{"title": "A Precise Characterization of the Class of Languages Recognized by Neural Nets under Gaussian and Other Common Noise Distributions", "book": "Advances in Neural Information Processing Systems", "page_first": 281, "page_last": 287, "abstract": null, "full_text": "A Precise Characterization of the Class of \n\nLanguages Recognized by Neural Nets under \n\nGaussian and other Common Noise Distributions \n\nWolfgang Maass* \n\nInst. for Theoretical Computer Science, \n\nTechnische Universitat Graz \n\nKlosterwiesgasse 3212, \nA-80lO Graz, Austria \n\nemail: maass@igi.tu-graz.ac.at \n\nEduardo D. Sontag \nOep. of Mathematics \nRutgers University \n\nNew Brunswick, NJ 08903, USA \nemail: sontag@hilbert.rutgers.edu \n\nAbstract \n\nWe consider recurrent analog neural nets where each gate is subject to \nGaussian noise, or any other common noise distribution whose probabil(cid:173)\nity density function is nonzero on a large set. We show that many regular \nlanguages cannot be recognized by networks of this type, for example \nthe language {w E {O, I} * I w begins with O}, and we give a precise \ncharacterization of those languages which can be recognized. This result \nimplies severe constraints on possibilities for constructing recurrent ana(cid:173)\nlog neural nets that are robust against realistic types of analog noise. On \nthe other hand we present a method for constructing feed forward analog \nneural nets that are robust with regard to analog noise of this type. \n\n1 Introduction \n\nA fairly large literature (see [Omlin, Giles, 1996] and the references therein) is devoted \nto the construction of analog neural nets that recognize regular languages. Any physical \nrealization of the analog computational units of an analog neural net in technological or \nbiological systems is bound to encounter some form of \"imprecision\" or analog noise at \nits analog computational units. We show in this article that this effect has serious conse(cid:173)\nquences for the computational power of recurrent analog neural nets. We show that any \nanalog neural net whose computational units are subject to Gaussian or other common \nnoise distributions cannot recognize arbitrary regular languages. For example, such analog \nneural net cannot recognize the regular language {w E {O, I} * I w begins with O}. \n\n\u2022 Partially supported by the Fonds zur F6rderung der wissenschaftlichen Forschnung (FWF), Aus(cid:173)\n\ntria, project P12153. \n\n\f282 \n\nW Maass and E. D. Sontag \n\nA precise characterization of those regular languages which can be recognized by such \nanalog neural nets is given in Theorem 1.1. In section 3 we introduce a simple technique \nfor making feedforward neural nets robust with regard to the same types of analog noise. \nThis method is employed to prove the positive part of Theorem 1.1. The main difficulty in \nproving Theorem 1.1 is its negative part, for which adequate theoretical tools are introduced \nin section 2. \n\nBefore we can give the exact statement of Theorem 1.1 and discuss related preceding work \nwe have to give a precise definition of computations in noisy neural networks. From the \nconceptual point of view this definition is basically the same as for computations in noisy \nboolean circuits (see [Pippenger, 1985] and [Pippenger, 1990]). However it is technically \nmore involved since we have to deal here with an infinite state space. \n\nWe will first illustrate this definition for a concrete case, a recurrent sigmoidal neural net \nwith Gaussian noise, and then indicate the full generality of our result, which makes it \napplicable to a very large class of other types of analog computational systems with analog \nnoise. Consider a recurrent sigmoidal neural net N consisting of n units, that receives \nat each time step t an input Ut from some finite alphabet U (for example U = {O, I}). \nThe internal state of N at the end of step t is described by a vector Xt E [-1, l]n, which \nconsists of the outputs of the n sigmoidal units at the end of step t. A computation step of \nthe network N is described by \n\nXt+1 = a(Wxt + h + UtC + Vi) \n\nwhere W E IRnxn and c, h E IRn represent weight matrix and vectors, a is a sigmoidal \n\nactivation function (e.g., a(y) = 1/(1 + e- Y\u00bb applied to each vector component, and \n\nVI, V2 , \u2022 \u2022\u2022 is a sequence of n-vectors drawn independently from some Gaussian distribu(cid:173)\ntion. In analogy to the case of noisy boolean circuits [Pippenger, 1990] one says that this \nnetwork N recognizes a language L ~ U* with reliability c (where c E (O,~] is some \ngiven constant) if immediately after reading an arbitrary word w E U* the network N is \nwith probability 2: ~ + c in an accepting state in case that w E L, and with probability \n:::; ! - c in an accepting state in case that w rf. LI. \n\nWe will show in this article that even if the parameters of the Gaussian noise distribution for \neach sigmoidal unit can be determined by the designer of the neural net, it is impossible to \nfind a size n, weight matrix W, vectors h, C and a reliability c E (0, !] so that the resulting \nrecurrent sigmoidal neural net with Gaussian noise accepts the simple regular language \n{w E {0,1}*1 w begins with O} with reliability c. This result exhibits a fundamental \nlimitation for making a recurrent analog neural net noise robust, even in a case where the \nnoise distribution is known and of a rather benign type. This quite startling negative result \nshould be contrasted with the large number of known techniques for making a feedforward \nboolean circuit robust against noise, see [Pippenger, 1990]. \n\nOur negative result turns out to be of a very general nature, that holds for virtually all related \ndefinitions of noisy analog neural nets and also for completely different models for analog \ncomputation in the presence of Gaussian or similar noise. Instead of the state set [-1, l]n \none can take any compact set n ~ IRn , and instead of the map (x, u) t-+ W x + h + uc one \ncan consider an arbitrary map I : n x U ~ 0 for a compact set 0 ~ IRn where f (', u) is \nBorel measurable for each fixed U E U. Instead of a sigmoidal activation function a and a \nGaussian distributed noise vector V it suffices to assume that a : IRn ~ n is some arbitrary \nBorel measurable function and V is some IRn -valued random variable with a density \u00a2(.) \nthat has a wide support2\u2022 In order to define a computation in such system we consider for \n\nwith probability strictly between t - c and t + c does not recognize any language L ~ U\u00b7. \n\n1 According to this definition a network N that is after reading some w E U\u00b7 in an accepting state \n2More precisely: We assume that there exists a subset no of n and some constant Co > 0 such \n\n\fAnalog Neural Nets with Gaussian Noise \n\n283 \n\neach U E U the stochastic kernel Ku defined by Ku(x, A) := Prob [a(f(x, u) + V) E A] \nfor x E n and A S;;; n. For each (signed, Borel) measure /-l on n, and each U E U, we \nlet lKu/-l be the (signed, Borel) measure defined on n by (lKu/-l)(A) := J Ku(x , A)d/-l(x) . \nNote that lKu /-l is a probability measure whenever /-l is. For any sequence of inputs W = \nU1 , .\u2022. ,Ur , we consider the composition of the evolution operators lKu. : \n\n(1) \n\nIf the probability distribution of states at any given instant is given by the measure /-l, then \nthe distribution of states after a single computation step on input U E U is given by lKu /-l, \nand after r computation steps on inputs W = U1,\"\" U r , the new distribution is IKw /-l, \nwhere we are using the notation (1). In particular, if the system starts at a particular initial \nstate ~, then the distribution of states after r computation steps on W is IKwbe, where be is \nthe probability measure concentrated on {O. That is to say, for each measurable subset \nFen \n\nProb [Xr+1 E F I Xl =~, input = w] = (lKwbe)(F). \n\nWe fix an initial state ~ E n, a set of \"accepting\" or \"final\" states F, and a \"reliability\" \nlevel E > 0, and say that the resulting noisy analog computational system M recognizes \nthe language L S;;; U* if for all w E U* : \n\nwEL ~ \n\n(lKwbe)(F) ;::: 2 + E \n\n1 \n\nw(j.L \n\n(lKwbe)(F) :::; 2 - E . \n\n1 \n\nIn general a neural network that simulates a DFA will carry out not just one, but a fixed \nnumber k of computation steps (=state transitions) of the form x' = a(W x + h + \nuc + V) for each input symbol U E U that it reads (see the constructions described in \n[Omlin, Giles, 1996], and in section 3 of this article). This can easily be reflected in our \nmodel by formally replacing any input sequence w = UI, U2 , . . . , U r from U* by a padded \nsequence W = UI , bk - I , U2 , bk - I , ... ,Ur , bk - I from (U U {b})*, where b is a blank sym(cid:173)\nbol not in U, and bk - I denotes a sequence of k - 1 copies of b (for some arbitrarily fixed \nk ;::: 1). This completes our definition of language recognition by a noisy analog compu(cid:173)\ntational system M with discrete time. This definition essentially agrees with that given in \n[Maass, Orponen, 1997]. \n\nWe employ the following common notations from formal language theory: We write WI w2 \nfor the concatenation of two strings WI and W2, ur for the set of all concatenations of r \nstrings from U, U* for the set of all concatenations of any finite number of strings from U, \nand UV for the set of all strings WI W2 with WI E U and W2 E V . The main result of this \narticle is the following: \n\nTheorem 1.1 Assume that U is some arbitrary finite alphabet. A language L S;;; U* can \nbe recognized by a noisy analog computational system of the previously specified type if \nand only if L = E1 U U* E2 for two finite subsets E1 and E2 of U* . \n\nA corresponding version of Theorem 1.1 for discrete computational systems was previously \nshown in [Rabin, 1963]. More precisely, Rabin had shown that probabilistic automata with \nstrictly positive matrices can recognize exactly the same class of languages L that occur \nin our Theorem 1.1. Rabin referred to these languages as definite languages. Language \nrecognition by analog computational systems with analog noise has previously been in(cid:173)\nvestigated in [Casey, 1996] for the special case of bounded noise and perfect reliability \nthat the following two properties hold: (v)dv = 1 for some small TJ > 0 and c = 1/2 in our terminology), and in \n[Maass, Orponen, 1997] for the general case. It was shown in [Maass, Orponen, 1997] that \nany such system can only recognize regular languages. Furthermore it was shown there \nthat if ~lvll:S;71 \u00a2>(v)dv = 1 for some small TJ > 0 then all regular languages can be recog(cid:173)\nnized by such systems. In the present paper we focus on the complementary case where the \ncondition \"~lvll:S;71 \u00a2>(v)dv = 1 for some small '\" > 0\" is not satisfied, i.e. analog noise may \nmove states over larger distances in the state space. We show that even if the probability of \nsuch event is arbitrarily small, the neural net will no longer be able to recognize arbitrary \nregular languages. \n\n2 A Constraint on Language Recognition \n\nWe prove in this section the following result for arbitrary noisy computational systems M \nas specified at the end of section 1: \n\nTheorem 2.1 Assume that U is some arbitrary alphabet. If a language L ~ U* is rec(cid:173)\nognized by M, then there are subsets E1 and E2 of u:S;r, for some integer r, such that \nL = E1 U U* E2. In other words: whether a string w E U* belongs to the language L \ncan be decided by just inspecting the first r and the last r symbols of w. \n\n2.1 A General Fact about Stochastic Kernels \n\nK(x, A) ~ cp(A) \n\nLet (5, S) be a measure space, and let K be a stochastic kernel3. As in the special case of \nthe Ku's above, for each (signed) measure f-t on (5, S), we let II<\u00a5t be the (signed) measure \ndefined on S by (II<\u00a5t)(A) := J K(x, A)df-t(x) . Observe that II<\u00a5t is a probability measure \nwhenever f-t is. Let c > 0 be arbitrary. We say that K satisfies Doeblin's condition (with \nconstant c) if there is some probability measure p on (5, S) so that \nfor all x E 5, A E S. \n\n(2) \n(Necessarily c ::; 1, as is seen by considering the special case A = 5.) This condition is \ndue to [Doeblin, 1937]. \nWe denote by Ilf-til the total variation of the (signed) measure f-t. Recall that Ilf-til is de(cid:173)\nfined as follows. One may decompose 5 into a disjoint union of two sets A and B, in \nsuch a manner that f-t is nonnegative on A and nonpositive on B. Letting the restrictions \nof f-t to A and B be \"f-t+\" and \"-f-t-\" respectively (and zero on B and A respectively), \nwe may decompose f-t as a difference of nonnegative measures with disjoint supports, \n. Then, Ilf-til = f-t+ (A) + f-t- (B). The following Lemma is a \"folk\" \nf-t = \nfact ([Papinicolaou, 1978]). \n\nf-t+ -\n\nf-t-\n\nLemma 2.2 Assume that K satisfies Doeblin's condition with constant c. Let f-t be any \n(signed) measure such that f-t(5) = o. Then 111I<\u00a5t11 ::; (1 - c) 1If-t1l. \n\u2022 \n\n2.2 Proof of Theorem 2.1 \n\nLemma 2.3 There is a constant c > 0 such that Ku satisfies Doeblin's condition with \nconstantc,foreveryu E U. \nProof Let no, co, and 0 < rno < 1 be as in the second footnote, and introduce the \nfollowing (Borel) probability measure on no: \n\nAo(A) := ~A (0'-1 (A)) . \n\nrno \n\n3That is to say, K(x,\u00b7) is a probability distribution for each x, and K(-, A) is a measurable \n\nfunction for each Borel measurable set A. \n\n\fAnalog Neural Nets with Gaussian Noise \n\n285 \n\nPick any measurable A ~ no and any yEn. Then, \n\n= \n\nZ(y, A) \n\nProb [O\"(y + V) E A] = Prob [y + V E 0\"-1 (A)] \n(\u00a2(v) dv ~ coA(Ay) = CoA (0\"-1 (A\u00bb) = comoAo(A) , \nJAy \n{y} ~ Q. We conclude that Z(y, A) ~ cAo(A) for all y, A, \nwhere Ay := 0\"-1 (A) -\nwhere c = como. Finally, we extend the measure AO to all of n by assigning zero measure \nto the complement of no, that is, p(A) := Ao(A n no) for all measurable subsets A of n . \nPick u E U; we will show that Ku satisfies Doeblin's condition with the above constant \nc (and using p as the \"comparison\" measure in the definition). Consider any x E nand \nmeasurable A ~ n. Then, \n\nKu(x, A) = Z(f(x, u), A) ~ Z(f(x, u), A n no) ~ cAo(A n no) = cp(A) , \n\nas required. \n\n\u2022 \n\nFor every two probability measures 1-'1,1-'2 on n, applying Lemma 2.2 to I-' := 1-'1 -1-'2, we \n1-'211 for each u E U. Recursively, then, we \nknow that 111Ku1-'1 -1Ku1-'211 ::; (1 - c) 111-'1 -\nconclude: \n\nIllKwl-'l -lKw1-'211 ::; (1 - ct 111-'1 - 1-'211::; 2(1 - ct \n\n(3) \n\nfor all words w of length ~ r. \nNow pick any integer r such that (1 - ct < 2c. From Equation (3), we have that \n\nfor all w of length ~ r and any two probability measures 1-'1,1-'2. In particular, this means \nthat, for each measurable set A, \n\nfor all such w. (Because, for any two probability measures VI and V2, and any measurable \nset A, 2Ivl(A) - v2(A)1 ::; Ilvl - v211 \u00b7) \n\n(4) \n\nLemma 2.4 Pick any v E U* and wE ur. Then \n\nw E L {:::::::} vw E L . \n\nProof Assume that w E L, that is, (lKw t5e)(F) ~ ~+E. Applying inequality (4) to the mea(cid:173)\nsures 1-'1 := t5e and 1-'2 := lKvt5e and A = F, we have that l(lKwt5e)(F) -\n(lKvw t5e)(F) I < \n2E,andthisimpliesthat(lKvwt5e)(F) > ~-E,i.e. , vw E L. (Since ~-E < (lKvwt5e)(F) < \n~ + E is ruled out.) If w ~ L, the argument is similar. \n\u2022 \n\nWe have proved that \n\nSo, \n\nwhere El := L n u~r and E2 := L n ur are both included in u~r. This completes the \n\u2022 \nproof of Theorem 2.1. \n\n\f286 \n\nW. Maass and E. D. Sontag \n\n3 Construction of Noise Robust Analog Neural Nets \n\nIn this section we exhibit a method for making feedforward analog neural nets robust with \nregard to arbitrary analog noise of the type considered in the preceding sections. This \nmethod will be used to prove in Corollary 3.2 the missing positive part of the claim of the \nmain result (Theorem 1.1) of this article. \n\nTheorem 3.1 Let C be any (noiseless) feedfOlward threshold circuit, and let u : ~ -+ \n[-1, 1] be some arbitrary function with u( u) -+ 1 for u -+ 00 and u( u) -+ -1 for \nu -+ -00. Furthermore assume that 8, p E (0, 1) are some arbitrary given parameters. \nThen one can transform for any given analog noise of the type considered in section 1 the \nnoiseless threshold circuit C into an analog neural net Nc with the same number of gates, \nwhose gates employ the given function u as activation function, so that for any circuit input \n~ E {-I, l}m the output of the noisy analog neural net Nc differs with probability ~ 1- 8 \nby at most p from the output ofC. \n\nIdea of the proof Let k be the maximal fan-in of a gate in C, and let w be the maximal \n\nabsolute value of a weight in C. We choose R > \u00b0 so large that the density function \u00a2>(.) of \n\nthe noise vector V satisfies for each gate with n inputs in C \n\n{ \nJIVil~R \n\n\u00a2>(v)dv:c:; 28 \nn \n\nfor i= 1, ... ,n. \n\nFurthermore we choose Uo > \u00b0 so large that u(u) ~ 1 - p/(wk) for u ~ Uo and u(u) :c:; \n-1 + p/(wk) for u :c:; -Uo . Finally we choose a factor \"/ > \u00b0 so large that ,,/(1- p) - R ~ \n\nUo. LetNc be the analog neural net that results from C through multiplication of all weights \nand thresholds with \"/ and through replacement of the Heaviside activation functions of the \ngates in C by the given activation function u. \n\u2022 \n\nThe following Corollary provides the proof of the positive part of our main result Theorem \n1.1. It holds for any u considered in Theorem 3.1. \n\nCorollary 3.2 Assume that U is some arbitrary finite alphabet, and language L ~ U* is \nof the form L = El U U* E2 for two arbitrary finite subsets El and E2 of U*. Then the \nlanguage L can be recognized by a noisy analog neural net N with any desired reliability \nE E (0, ~), in spite of arbitrary analog noise of the type considered in section 1. \n\nProof. We first construct a feed forward threshold circuit C for recognizing L, that receives \neach input symbol from U in the form of a bitstring u E {a, 1}' (for some fixed I ~ \nlog2 \\ U\\), that is encoded as the binary states of l input units of the boolean circuit C. Via \na tapped delay line of fixed length d (which can easily be implemented in a feedforward \nthreshold circuit by d layers, each consisting of l gates that compute the identity function on \na single binary input from the preceding layer) one can achieve that the feed forward circuit \nC computes any given boolean function of the last d sequences from {O, 1}1 that were \npresented to the circuit. On the other hand for any language of the form L = El U U* E2 \nwith E 1 , E2 finite there exists some dEN such that for each w E U* one can decide \nwhether w E L by just inspecting the last d characters of w. Therefore a feedforward \nthreshold circuit C with a tapped delay line of the type described above can decide whether \nwE L. \nWe apply Theorem 3.1 to this circuit C for 8 = p = min(~ - E, t). We define the set F \nof accepting states for the resulting analog neural net Nc as the set of those states where \nthe computation is completed and the output gate of Nc assumes a value ~ 3/4. Then \naccording to Theorem 3.1 the analog neural net Nc recognizes L with reliability E. To be \nformally precise, one has to apply Theorem 3.1 to a threshold circuit C that receives its \n\n\fAnalog Neural Nets with Gaussian NOise \n\n287 \n\ninput not in a single batch, but through a sequence of d batches. The proof of Theorem 3.1 \nreadily extends to this case. \n\u2022 \n\n4 Conclusions \n\nWe have exhibited a fundamental limitation of analog neural nets with Gaussian or other \ncommon noise distributions whose probability density function is nonzero on a large set: \nThey cannot accept the very simple regular language {w E {O, 1 }*I w begins with O}. \nThis holds even if the designer of the neural net is allowed to choose the parameters of \nthe Gaussian noise distribution and the architecture and parameters of the neural net. The \nproof of this result introduces new mathematical arguments into the investigation of neural \ncomputation, which can also be applied to other stochastic analog computational systems. \n\nWe also have presented a method for makingfeedfOlward analog neural nets robust against \nthe same type of noise. This implies that certain regular languages, such as for example \n{w E {O, 1 }*I wends with O} can be recognized by a recurrent analog neural net with \nGaussian noise. In combination with our negative result this yields a precise characteri(cid:173)\nzation of all regular languages that can be recognized by recurrent analog neural nets with \nGaussian noise, or with any other noise distribution that has a large support. \n\nReferences \n\n[Casey, 1996] Casey, M., \"The dynamics of discrete-time computation, with application to \nrecurrent neural networks and finite state machine extraction\", Neural Computation \n8,1135-1178,1996. \n\n[Doeblin, 1937] Doeblin, W., \"Sur Ie proprietes asymtotiques de mouvement regis par \ncertain types de chaInes simples\", Bull. Math. Soc. Roumaine Sci. 39(1): 57-115; (2) \n3-61,1937. \n\n[Maass, Orponen, 1997] Maass, W., and Orponen, P. \"On the effect of analog noise on \ndiscrete-time analog computations\", Advances in Neural Information Processing Sys(cid:173)\ntems 9, 1997, 218-224; journal version: Neural Computation 10(5), 1071-1095, \n1998. \n\n[amlin, Giles, 1996] amlin, C. W., Giles, C. L. \"Constructing deterministic finite-state \nautomata in recurrent neural networks\", J. Assoc. Comput. Mach. 43 (1996), 937-\n972. \n\n[Papinicolaou, 1978] Papinicolaou, G., \"Asymptotic Analysis of Stochastic Equations\", in \nStudies in Probability Theory, MAA Studies in Mathematics, vol. 18, 111-179, edited \nby M. Rosenblatt, Math. Assoc. of America, 1978. \n\n[Pippenger, 1985] Pippenger, N., \"On networks of noisy gates\", IEEE Sympos. on Foun(cid:173)\n\ndations of Computer Science, vol. 26, IEEE Press, New York, 30-38, 1985. \n\n[Pippenger, 1989] Pippenger, N., ':Invariance of complexity measures for networks with \n\nunreliable gates\", J. of the ACM, vol. 36, 531-539,1989. \n\n[Pippenger, 1990] Pippenger, N., \"Developments in 'The Synthesis of Reliable Organisms \nfrom Unreliable Components' \", Proc. of Symposia in Pure Mathematics, vol. 50, \n311-324,1990. \n\n[Rabin, 1963] Rabin, M., \"Probabilistic automata\", Information and Control, vol. 6, 230-\n\n245, 1963. \n\n\f", "award": [], "sourceid": 1530, "authors": [{"given_name": "Wolfgang", "family_name": "Maass", "institution": null}, {"given_name": "Eduardo", "family_name": "Sontag", "institution": null}]}