{"title": "Noisy Neural Networks and Generalizations", "book": "Advances in Neural Information Processing Systems", "page_first": 335, "page_last": 341, "abstract": null, "full_text": "Noisy Neural Networks and \n\nGeneralizations \n\nHava T. Siegelmann \n\nAlexander Roitershtein \n\nIndustrial Eng. and Management, Mathematics \n\nTechnion - lIT \n\nHaifa 32000, Israel \n\niehava@ie.technion.ac.il \n\nMathematics \nTechnion - lIT \n\nHaifa 32000, Israel \n\nroiterst@math.technion.ac.il \n\nAsa Ben-Hur \n\nIndustrial Eng. and Management \n\nTechnion - lIT \n\nHaifa 32000, Israel \nasa@tx.technion.ac. il \n\nAbstract \n\nIn this paper we define a probabilistic computational model which \ngeneralizes many noisy neural network models, including the recent \nwork of Maass and Sontag [5]. We identify weak ergodicjty as the \nmechanism responsible for restriction of the computational power \nof probabilistic models to definite languages, independent of the \ncharacteristics of the noise: whether it is discrete or analog, or if \nit depends on the input or not, and independent of whether the \nvariables are discrete or continuous. We give examples of weakly \nergodic models including noisy computational systems with noise \ndepending on the current state and inputs, aggregate models, and \ncomputational systems which update in continuous time. \n\n1 \n\nIntroduction \n\nNoisy neural networks were recently examined, e.g. in. [1,4, 5]. It was shown in [5] \nthat Gaussian-like noise reduces the power of analog recurrent neural networks to \nthe class of definite languages, which area strict subset of regular languages. Let \nE be an arbitrary alphabet. LeE\u00b7 is called a definite language if for some integer \nr any two words coinciding on the last r symbols are either both in L or neither in \nL. The ability of a computational system to recognize only definite languages can \nbe interpreted as saying that the system forgets all its input signals, except for the \nmost recent ones. This property is reminiscent of human short term memory. \n\"Definite probabilistic computational models\" have their roots in Rabin's pioneer(cid:173)\ning work on probabilistic automata [9]. He identified a condition on probabilistic \nautomata with a finite state space which restricts them to definite languages. Paz \n[8] generalized Rabin's condition, applying it to automata with a countable state \nspace, and calling it weak ergodicity [7, 8]. \nIn their ground-breaking paper [5], \n\n\f336 \n\nH. T. Siegelmann, A. Roitershtein and A. Ben-Hur \n\nMaass and Sontag extended the principle leading to definite languages to a finite \ninterconnection of continuous-valued neurons. They proved that in the presence \nof \"analog noise\" (e.g. Gaussian), recurrent neural networks are limited in their \ncomputational power to definite languages. Under a different noise model, Maass \nand Orponen [4] and Casey [1] showed that such neural networks are reduced in \ntheir power to regular languages. \nIn this paper we generalize the condition of weak ergodicity, making it applica(cid:173)\nble to numerous probabilistic computational machines. In our general probabilistic \nmodel, the state space can be arbitrary: it is not constrained to be a finite or \ninfinite set, to be a discrete or non-discrete subset of some Euclidean space, or \neven to be a metric or topological space. The input alphabet is arbitrary as well \n(e.g., bits, rationals, reals, etc.). The stochasticity is not necessarily defined via a \ntransition probability function (TPF) as in all the aforementioned probabilistic and \nnoisy models, but through the more general Markov operators acting on measures. \nOur Markov Computational Systems (MCS's) include as special cases Rabin's ac(cid:173)\ntual probabilistic automata with cut-point [9], the quasi-definite automata by Paz \n[8], and the noisy analog neural network by Maass and Sontag [5]. Interestingly, \nour model also includes: analog dynamical systems and neural models, which have \nno underlying deterministic rule but rather update probabilistic ally by using finite \nmemory; neural networks with an unbounded number of components; networks of \nvariable dimension (e.g., \"recruiting networks\"); hybrid systems that combine dis(cid:173)\ncrete and continuous variables; stochastic cellular automata; and stochastic coupled \nmap lattices. \n\nWe prove that all weakly ergodic Markov systems are stable, i.e. are robust with \nrespect to architectural imprecisions and environmental noise. This property is de(cid:173)\nsirable for both biological and artificial neural networks. This robustness was known \nup to now only for the classical discrete probabilistic automata [8, 9] . To enable \npracticality and ease in deciding weak ergodicity for given systems, we provide two \nconditions on the transition probability functions under which the associated com(cid:173)\nputational system becomes weakly ergodic. One condition is based on a version \nof Doeblin's condition [5] while the second is motivated by the theory of scram(cid:173)\nbling matrices [7, 8]. In addition we construct various examples of weakly ergodic \nsystems which include synchronous or asynchronous computational systems, and \nhybrid continuous and discrete time systems. \n\n2 Markov Computational System (MCS) \n\nInstead of describing various types of noisy neural network models or stochastic \ndynamical systems we define a general abstract probabilistic model. When dealing \nwith systems containing inherent elements of uncertainty (e.g., noise) we abandon \nthe study of individual trajectories in favor of an examination of the flow of state \ndistributions. The noise models we consider are homogeneous in time, in that they \nmay depend on the input, but do not depend on time. The dynamics we consider \nis defined by operators acting in the space of measures, and are called Markov \noperators [6]. In the following we define the concepts which are required for such \nan approach. \n\nLet E be an arbitrary alphabet and \u00b0 be an abstract state space. We assume that \na O'-algebra B (not necessarily Borel sets) of subsets of \u00b0 is given, thus (0, B) is a \n\nmeasurable space. Let us denote by P the set of probability measures on (0, B). \nThis set is called a distribution space. \nLet E be a space of finite measures on (0, B) with the total variation norm defined \n\n\fNoisy Neural Networks and Generalizations \n\nby \n\nIlpllt = 11-'1(0) = sup I-'(A) -\n\nAEB \n\ninf I-'(A). \nAEB \n\n337 \n\n(1) \n\nDenote by C the set of all bounded linear operators acting from \u00a3 to itself. The \n1I'lh- norm on \u00a3 induces a norm IIPlh = sUPJjE'P IIPI-'III in C. An operator P E C \nis said to be a Markov operator if for any probability measure I-' E P, the image PI-' \nis again a probability measure. For a Markov operator, IIPIII = 1. \nDefinition 2.1 A Markov system is a set of Markov operators T = {Pu : u E E}. \nWith any Markov system T, one can associate a probabilistic computational sys(cid:173)\ntem. If the probability distribution on the initial states is given by the probability \nmeasure Po, then the distribution of states after n computational steps on inputs \nW = Wo, WI, ... , W n , is defined as in [5, 8] \n\nPwl-'o(A) = PWn \u2022\u2022 \u2022 \u2022\u2022 Pw1Pwol-'0. \n\nLet A and R be two subset of P with the property of having a p-gap \n\ndist(A, R) = \n\ninf \n\nJjEA,IIE'R \n\nIII-' - viii = P > 0 \n\n(2) \n\n(3) \n\nThe first set is called a set of accepting distributions and the second is called a set \nof rejecting distributions. A language L E E* is said to be recognized by Markov \ncomputational system M = (\u00a3, A, R, E, 1-'0, T) if \n\nW E L {:::} Pwl-'o E A \nrt. L, {:::} PwPo E R. \nW \n\nThis model of language recognition with a gap between accepting and rejecting \nspaces agrees with Rabin's model of probabilistic automata with isolated cut-point \n[9] and the model of analog probabilistic computation [4, 5]. \nAn example of a Markov system is a system of operators defined by TPF on (0, B). \nLet Pu (x, A) be the probability of moving from a state x to the set of states A upon \nreceiving the input signal u E E. The function Pu(x,') is a probability measure for \nall x E 0 and PuC A) is a measurable function of x for any A E B. In this case, \nPup(A) are defined by \n\n(4) \n\n3 Weakly Ergodic MCS \nLet P E \u00a3, be a Markov operator. The real number J'(P) = 1 - ! sUPJj,I/E'P IIPp -\nPvlll is called the ergodicity coefficient of the Markov operator. We denote \nJ(P) = 1 - J'(P) . It can be proven that for any two Markov operators P1 ,P2, \nJ(PI P2) :S J(Pt}J(P2)' The ergodicity coefficient was introduced by Dobrushin [2] \nfor the particular case of Markov operators induced by TPF P (x, A). In this special \ncase J'(P) = 1- SUPx,ySUPA IP(x , A) - P(y,A)I \u00b7 \nWeakly ergodic systems were introduced and studied by paz in the particular case \nof a denumerable state space 0, where Markov operators are represented by infi(cid:173)\nnite dimensional matrices. The following definition makes no assumption on the \nassociated measurable space. \n\nDefinition 3.1 A Markov system {Pu , U E E} is called weakly ergodic if for any \na > 0, there is an integer r = r( a) such that for any W E E~r and any 1-', v E P, \n\n1 \n\nJ(Pw) = \"2llPwl-' - Pwvlh :S a. \n\n(5) \n\n\f338 \n\nH. T. Siege/mann, A. ROitershtein and A. Ben-Hur \n\nAn MeS M is called weakly ergodic if its associated Markov system {Pu , u E E} \nis weakly ergodic. \n\u2022 \n\nAn MeS M is weakly ergodic if and only if there is an integer r and real number \na < 1, such that IlPwJ.l - Pwvlh ::; a for any word w of length r. Our most general \ncharacterization of weak ergodicity is as follows: [11]: \n\nTheorem 1 An abstract MCS M \na multiplicative operator's norm II . 11** on C equivalent to the norm II . liB \nsUPP,:Ml=O} 1I1~11!1 , and such that SUPUE~ I lPu lIu ::; \u20ac \n\nfor some number \u20ac < 1. \n\nis weakly ergodic if and only if there exists \n\n\u2022 \n\nThe next theorem connects the computational power of weakly ergodic MeS's with \nthe class of definite languages, generalizing the results by Rabin [9], Paz [8, p. 175], \nand Maass and Sontag [5]. \n\nTheorem 2 Let M be a weakly ergodic MCS. If a language L can be recognized by \nM, then it is definite. \n\u2022 \n\n4 The Stability Theorem of Weakly Ergodic MCS \n\nAn important issue for any computational system is whether the machine is robust \nwith respect to small perturbations of the system's parameters or under some ex(cid:173)\nternal noise. The stability of language recognition by weakly ergodic MeS's under \nperturbations of their Markov operators was previously considered by Rabin [9] and \nPaz [7,8]. We next state a general version ofthe stability theorem that is applicable \nto our wide notion of weakly ergodic systems. \n\nWe first define two MeS's M and M to be similar if they share the same measur(cid:173)\nable space (0,8), alphabet E, and sets A and 'fl, and if they differ only by their \nassociated Markov operators. \n\nTheorem 3 Let M and M be two similar MCS's such that the first is weakly \nergodic. Then there is a > 0, such that if IlPu - 1\\lh ::; a for all u E E, then \nthe second is also weakly ergodic. Moreover, these two MCS's recognize exactly the \nsame class of languages. \n\u2022 \n\nCorollary 3.1 Let M and M be two similar MCS's. Suppose that the first is \nweakly ergodic. Then there exists f3 > 0, such that ifsuPAEB IPu(x, A) -.Pu(x, A)I ::; \nf3 for all u E E, x E 0, the second is also weakly ergodic. Moreover, these two MCS's \nrecognize exactly the same class of languages. \n\u2022 \n\nA mathematically deeper result which implies Theorem 3 was proven in [11]: \n\nTheorem 4 Let M and M be two similar MCS's, such that the first is weakly \nergodic and the second is arbitrary. Then, for any a > 0 there exists \u20ac > 0 such \nthat IlPu -1\\lh ::; \u20ac \nfor all u E E implies IIPw - .Pw11 1 ::; a for all words wE E* .\u2022 \n\nTheorem 3 follows from Theorem 4. To see this, one can chose any a < p in Theorem \n4 and obser~ that IlPw - .Pwlh ::; a < p implies that the word w is accepted or \nrejected by M in accordance to whether it is accepted or rejected by M. \n\n\fNoisy Neural Networks and Generalizations \n\n339 \n\n5 Conditions on the Transition Probabilities \n\nThis section discusses practical conditions for weakly ergodic MCS's in which the \nMarkov operators Pu are induced by transition probability functions as in (4). \nClearly, a simple sufficient condition for an MCS to be weakly ergodic is given \nby sUPUEE d(Pu ) ~ 1 - c, for some c> o. \nMaass and Sontag used Doeblin's condition to prove the computational power of \nnoisy neural networks [5]. Although the networks in [5] constitute a very particular \ncase of weakly ergodic MCS's, Doeblin's condition is applicable also to our general \nmodel. The following version of Doeblin's condition was given by Doob [3]: \n\nDefinition 5.1 [3] Let P(x, A) be a TPF on (0,8). We say that it satisfies Doeblin \ncondition, D~, if there exists a constant c and a probability measure p on (0,8) \nsuch that pn(x,A) ~ cp(A) for any set A E 8. \n\u2022 \n\nIf an MCS M is weakly ergodic, then all its associated TPF Pw (x, A), wEE must \nsatisfy Do for some n = n(w). Doop has proved [3, p. 197] that if P(x,A) satisfies \nDoeblin's condition D~ with constant c, then for any p, II E P, IIPp - Plliit ~ \n(1 - c)llp - 11111, i.e., d(P) ~ 1- c. This leads us to the following definition. \n\nDefinition 5.2 Let M be an MCS. We say that the space 0 is small with respect \nto M if there exists an m > 0 such that all associated TPF P w (x, A), w E Em \nsatisfy Doeblin's condition D~ uniformly with the same constant c, i.e., Pw (x, A) ~ \n\u2022 \ncpw (A), wE Em. \n\nThe following theorem strengthens the result by Maass and Sontag [5]. \n\nTheorem 5 Let M be an MCS. If the space 0 is small with respect to M, then \nM \n\u2022 \n\nis weakly ergodic, and it can recognize only definite languages. \n\nThis theorem provides a convenient method for checking weak ergodicity in a given \nTPF. The theorem implies that it is sufficient to execute the following simple check: \nchoose any integer n, and then verify that for every state x and all input strings \nwEEn, the \"absolutely continuous\" part of all TPF Pw, wEEn is uniformly \nbounded from below: \n\n(6) \nwhere Pw(x, y) is the density of the absolutely continuous component of Pw(x,\u00b7) \nwith respect to 'l/Jw, and C1, C2 are positive numbers. \nMost practical systems can be defined by null preserving TPF (including for example \nthe systems in [5]). For these systems we provide (Theorem 6) a sufficient and neces(cid:173)\nsary condition in terms of density kernels. A TPF Pu(x, A), u E E is called null pre(cid:173)\nserving with respect to a probability measure pEP if it has a density with respect \nto p i.e., P(x,A) = IAPu(x,z)p(dz). It is not hard to see, that the property of null \npreserving per letter u E E implies that all TPF Pw(x, A) of words w E E* are null \npreserving as well. In this case d(Pu) = 1 - infx,y In min{pu(x, z),pu(y, z)}Pu(dz) \nand we have: \n\nTheorem 6 Let M be an MCS defined by null preserving transition probability \nfunctions Pu , u E E. Then, M \nis weakly ergodic if and only if there exists n such \nthat infwEE\" infx,y In min{pu(x, z),pu(y, z)}Pu(dz) > o. \n\u2022 \nA similar result was previously established by paz [7, 8] for the case of a denumerable \nstate space O. This theorem allows to treat examples which are not covered by \n\n\f340 \n\nH T. Siegelmann, A. ROitershtein and A. Ben-Hur \n\nTheorem 5. For example, suppose that the space 0 is not small with respect to an \nMCS M, but for some n and any wEEn there exists a measure 1/Jw on (0, B) with \nthe property that for any couple of states x, yEO \n\n1/Jw ({z : min{pw(x, z),Pw(y, z)} ~ cd) ~ C2 , \n\n(7) \nwhere Pw(x , y) is the density of Pw(x,\u00b7) with respect to 1/Jw, and Cl,C2 are positive \nnumbers. This condition may occur even ifthere is no y such that Pu(x, y) S; Cl for \nall x E O. \n\n6 Examples of Weakly Ergodic Systems \n\n1. The Synchronous Parallel Model \nLet (Oi , Bi ), i = 1,2, .. . , N be a collection of measurable sets. Define ni = TIj # nj \nand Hi = TIj # Bj. Then (ni , Bi) are measurable spaces. Define also Ei = E x n i , \nand 11 = {Pxl,u (Xi , Ai) : (xi, u) E Ed be given stochastic kernels. Each set 11 \nTIi Oi, \ndefines an MCS Mi. We can define an aggregate MCS by setting n \nB = TIi Bi , S = TIi Si , R = TIi Ri, and \n\n(8) \n\nThis describes a model of N noisy computational systems that update in syn(cid:173)\nchronous parallelism. The state of the whole aggregate is a vector of states of the \nindividual components, and each receives the states of all other components as part \nof its input. \n\nTheorem 7 [12] Let M be an MCS defined by equation (8). It is weakly ergodic if \nat least one set of operators T is such that <5(P~,xl) S; 1- C for any u E E, xi E ni \nand some positive number c. \n\u2022 \n\n2. The Asynchronous Parallel Model \n\nIn this model, at every step only one component is activated. Suppose that a collec-\ntion of N similar MCS 's M i, i = 1, ... , N is given. Consider a probability measure \ne = {fl,\" ., eN} on the set K = {I, ... , N} . Assume that in each computational \nstep only one MCS is activated. The current state of the whole aggregate is rep(cid:173)\nresented by the state of its active component. Assume also that the probability of \na computational system Mi to be activated, is time-independent and is given by \nProb(Md = ei. The aggregate system is then described by stochastic kernels \n\nN \n\nPu(x, A) = LeiP~(x , A) . \n\ni=l \n\n(9) \n\nTheorem 8 [12] Let M be an MCS defined by formula (9). It is weakly ergodic if \nat least one set of operators {PJ} , ... , {Pt'} is weakly ergodic. \n\u2022 \n\n3. Hybrid Weakly Ergodic Systems \n\nWe now present a hybrid weakly ergodic computational system consisting of both \ncontinuous and discrete elements. The evolution of the system is governed by a \ndifferential equation, while its input arrives at discrete times. Let n = ffin , and \nconsider a collection of differential equations \n\nXu(s) = 1/Ju(xu(s)), u E E, s E [0,00). \n\n(10) \n\n\fNoisy Neural Networks and Generalizations \n\n341 \n\nSuppose that 1/Ju (x) is sufficiently smooth to ensure the existence and uniqueness of \nsolutions of Equation (10) for s E [0,1] and for any initial condition. \n\nConsider a computational system which receives an input u(t) at discrete times \nto, t l , t 2 .... In the interval t E [ti, ti+d the behavior of the system is described by \nEquation (10), where s = t-tj. A random initial condition for the time tn is defined \nby \n\n(11) \nwhere X u (t,,_d(l) is the state of the system after previously completed computations, \nand Pu (x, A) , u E E is a family of stochastic kernels on 0 x 8. This describes a system \nwhich receives inputs in discrete instants of time; the input letters u E E cause \nrandom perturbations of the state Xu (t-l)(I) governed by the transition probability \nfunctions pu(t)(xu(t-l), A). In all other times the system is a noise-free continuous \ncomputational system which evolves according to equation (10). \nLet 0 = IRn \n, Xo E 0 be a distinguished initial state, and let Sand R be two subsets \nof 0 with the property of having a p-gap: dist(S, R) = infxEs,YER Ilx - yll = p > O. \nThe first set is called a set of accepting final states and the second is called a \nset of reJ'ecting final states. We say that the hybrid computational system M = \n(0, E, xo, 1/Ju, S, R) recognizes L ~ E* if for all w = WO ... Wn E E* and the end \nletter $ tj. E the following holds: W E L \u00a2} Prob(xw\"s(l) E S) > ~ + c, and \nW tj. L \u00a2} Prob(xw\"s(l) E R) > ~ + c. \n\nTheorem 9 [12} Let M be a hybrid computational system. It is weakly ergodic if \nits set of evolution operators T = {Pu : u E E} is weakly ergodic. \n\u2022 \n\nReferences \n\n[1] Casey, M., The Dynamics of Discrete-Time Computation, With Application to Re(cid:173)\n\ncurrent Neural Networks and Finite State Machine Extraction, Neural Computation \n8, 1135-1178, 1996. \n\n[2] Dobrushin, R. L., Central limit theorem for nonstationary Markov chains \n\nTheor. Probability Appl. vol. 1, 1956, pp 65-80, 298-383. \n\nI, \n\nII. \n\n[3] Doob J. L., Stochastic Processes. John Wiley and Sons, Inc., 1953. \n[4] W. Maass and Orponen, P., On the effect of analog noise in discrete time computation, \n\nNeural Computation, 10(5), 1998, pp. 1071-1095. \n\n[5] W. Maass and Sontag, E., Analog neural nets with Gaussian or other common noise \ndistribution cannot recognize arbitrary regular languages, Neural Computation, 11, \n1999, pp. 771-782. \n\n[6] Neveu J., Mathematical Foundations of the Calculus of Probability. Holden Day, San \n\nFrancisco, 1964. \n\n[7] Paz A., Ergodic theorems for infinite probabilistic tables. Ann. Math. Statist. vol. \n\n41, 1970, pp. 539-550. \n\n[8] Paz A., Introduction to Probabilistic Automata. Academic Press, Inc., London, 1971. \n[9] Rabin, M., Probabilistic automata, Information and Control, vol 6, 1963, pp. 230-245. \n[10] Siegelmann H. T., Neural Networks and Analog Computation: Beyond the Turing \n\nLimit. Birkhauser, Boston, 1999. \n\n[11] Siegelmann H. T . and Roitershtein A., On weakly ergodic computational systems, \n\n1999, submitted. \n\n[12] Siegelmann H. T., Roitershtein A., and Ben-Hur, A., On noisy computational sys(cid:173)\n\ntems, 1999, Discrete Applied Mathematics, accepted. \n\n\f", "award": [], "sourceid": 1764, "authors": [{"given_name": "Hava", "family_name": "Siegelmann", "institution": null}, {"given_name": "Alexander", "family_name": "Roitershtein", "institution": null}, {"given_name": "Asa", "family_name": "Ben-Hur", "institution": null}]}