{"title": "Connectivity Versus Entropy", "book": "Neural Information Processing Systems", "page_first": 1, "page_last": 8, "abstract": null, "full_text": "1 \n\nCONNECTIVITY VERSUS ENTROPY \n\nYaser S. Abu-Mostafa \n\nCalifornia Institute of Technology \n\nPasadena, CA 91125 \n\nABSTRACT \n\nHow does the connectivity of a neural network (number of synapses per \nneuron) relate to the complexity of the problems it can handle (measured by \nthe entropy)? Switching theory would suggest no relation at all, since all Boolean \nfunctions can be implemented using a circuit with very low connectivity (e.g., \nusing two-input NAND gates). However, for a network that learns a problem \nfrom examples using a local learning rule, we prove that the entropy of the \nproblem becomes a lower bound for the connectivity of the network. \n\nINTRODUCTION \n\nThe most distinguishing feature of neural networks is their ability to spon(cid:173)\n\ntaneously learn the desired function from 'training' samples, i.e., their ability \nto program themselves. Clearly, a given neural network cannot just learn any \nfunction, there must be some restrictions on which networks can learn which \nfunctions. One obvious restriction, which is independent of the learning aspect, \nis that the network must be big enough to accommodate the circuit complex(cid:173)\nity of the function it will eventually simulate. Are there restrictions that arise \nmerely from the fact that the network is expected to learn the function, rather \nthan being purposely designed for the function? This paper reports a restriction \nof this kind. \n\nThe result imposes a lower bound on the connectivity of the network (num(cid:173)\n\nber of synapses per neuron). This lower bound can only be a consequence of \nthe learning aspect, since switching theory provides purposely designed circuits \nof low connectivity (e.g., using only two-input NAND gates) capable of imple(cid:173)\nmenting any Boolean function [1,2] . It also follows that the learning mechanism \nmust be restricted for this lower bound to hold; a powerful mechanism can be \n\n\u00a9 American Institute of Physics 1988 \n\n\f2 \n\ndesigned that will find one of the low-connectivity circuits (perhaps byexhaus(cid:173)\ntive search), and hence the lower bound on connectivity cannot hold in general. \nIndeed, we restrict the learning mechanism to be local; when a training sample \nis loaded into the network, each neuron has access only to those bits carried by \nitself and the neurons it is directly connected to. This is a strong assumption \nthat excludes sophisticated learning mechanisms used in neural-network models, \nbut may be more plausible from a biological point of view. \n\nThe lower bound on the connectivity of the network is given in terms of \nthe entropy of the environment that provides the training samples. Entropy is a \nquantitative measure of the disorder or randomness in an environment or, equiv(cid:173)\nalently, the amount of information needed to specify the environment. There \nare many different ways to define entropy, and many technical variations of this \nconcept [3]. In the next section, we shall introduce the formal definitions and \nresults, but we start here with an informal exposition of the ideas involved. \n\nThe environment in our model produces patterns represented by N bits \nx = Xl \u2022\u2022\u2022 X N (pixels in the picture of a visual scene if you will). Only h different \npatterns can be generated by a given environment, where h < 2N (the entropy \nis essentially log2 h). No knowledge is assumed about which patterns the en(cid:173)\nvironment is likely to generate, only that there are h of them. In the learning \nprocess, a huge number of sample patterns are generated at random from the \nenvironment and input to the network, one bit per neuron. The network uses \nthis information to set its internal parameters and gradually tune itself to this \nparticular environment. Because of the network architecture, each neuron knows \nonly its own bit and (at best) the bits of the neurons it is directly connected to \nby a synapse. Hence, the learning rules are local: a neuron does not have the \nbenefit of the entire global pattern that is being learned. \n\nAfter the learning process has taken place, each neuron is ready to perform \na function defined by what it has learned. The collective interaction of the \nfunctions of the neurons is what defines the overall function of the network. The \nmain result of this paper is that (roughly speaking) if the connectivity of the \nnetwork is less than the entropy of the environment, the network cannot learn \nabout the environment. The idea of the proof is to show that if the connectivity \nis small, the final function of each neuron is independent of the environment, \nand hence to conclude that the overall network has accumulated no information \nabout the environment it is supposed to learn about. \n\nFORMAL RESULT \n\nA neural network is an undirected graph (the vertices are the neurons and the \nedges are the synapses). Label the neurons 1\"\", N and define Kn C {I\"\", N} \nto be the set of neurons connected by a synapse to neuron n, together with \nneuron n itself. An environment is a subset e C {O,I}N (each x E e is a sample \n\n\f3 \n\nfrom the environment). During learning, Xl,\"', xN (the bits of x) are loaded \ninto the neurons 1\"\", N, respectively. Consider an arbitrary neuron nand \nrelabel everything to make Kn become {I\"\", K}. Thus the neuron sees the \nfirst K coordinates of each x. \n\nSince our result is asymptotic in N, we will specify K as a function of N; \nK = a.N where a. = a.(N) satifies limN-+oo a.(N) = 0.0 (0 < 0.0 < 1). Since the \nresult is also statistical, we will consider the ensemble of environments e \n\ne=e(N)={eC{O,I}N I lel=h} \n\nwhere h = 2~N and /3 = /3(N) satifies limN-+oo /3(N) = /30 (0 < /30 < 1). The \nprobability distribution on e is uniform; any environment e E e is as likely to \noccur as any other. \n\nThe neuron sees only the first K coordinates of each x generated by the \nenvironment e. For each e, we define the function n : {O,I}K -+ {O, 1,2,\u00b7\u00b7.} \nwhere \n\nn(al\" .aK) = I{x Eel Xle = ale for k = 1,'\" ,K}I \n\nand the normalized version \n\nThe function v describes the relative frequency of occurrence for each of the 2K \nbinary vectors Xl'\" XK as x = Xl \u2022\u2022\u2022 XN runs through all h vectors in e. In other \nwords, v specifies the projection of e as seen by the neuron. Clearly, veal > 0 \nfor all a E {O,l}K and LaE{O,l}K veal = 1. \n\nCorresponding to two environments el and e2, we will have two functions VI \nand V2. IT VI is not distinguishable from V2, the neuron cannot tell the difference \nbetween el and e2' The distinguishability between VI and V2 can be measured \nby \n\nIV1(a) - V2(a) I \n\n1 \n\nd(Vl,V2) = - 2: \n2 aE{O,l}K \n\nThe range of d(Vb V2) is 0 < d(Vl' V2) < 1, where '0' corresponds to complete \nindistinguishability while '1' corresponds to maximum distinguishability. We \nare now in a position to state the main result. \nLet el and e2 be independently selected environments from e according to the \nuniform probability distribution. d(Vl' V2) is now a random variable, and we are \ninterested in the expected value E(d(Vl' V2))' The case where E(d(Vb V2)) = 0 \ncorresponds to the neuron getting no information about the environment, while \nthe case where E(d(Vb V2)) = 1 corresponds to the neuron getting maximum \ninformation. The theorem predicts, in the limit, one of these extremes depending \non how the connectivity (0.0) compares to the entropy (/30)' \n\n\f4 \n\nTheorem. \n1. H Q o > Po , then limN ..... co E (d(VI, V2)) = 1. \n2. H Q o < Po , then limN ..... co E (d(v}, V2)) = O. \n\nThe proof is given in the appendix, but the idea is easy to illustrate infor(cid:173)\nmally. Suppose h = 2K +10 (corresponding to part 2 of the theorem). For most \nenvironments e E e, the first K bits of x E e go through all 2K possible val(cid:173)\nues approximately 210 times each as x goes through all h possible values once. \nTherefore, the patterns seen by the neuron are drawn from the fixed ensemble of \nall binary vectors of length K with essentially uniform probability distribution, \ni.e., v is the same for most environments. This means that, statistically, the \nneuron will end up doing the same function regardless of the environment at \nhand. \n\nWhat about the opposite case, where h = 2K - 10 (corresponding to part lof \nthe theorem)? Now, with only 2K - 10 patterns available from the environment, \nthe first K bits of x can assume at most 2K- 10 values out of the possible 2K \nvalues a binary vector of length K can assume in principle. Furthermore, which \nvalues can be assumed depends on the particular environment at hand, i.e., \nv does depend on the environment. Therefore, although the neuron still does \nnot have the global picture, the information it has says something about the \nenvironment. \n\nACKNOWLEDGEMENT \n\nThis work was supported by the Air Force Office of Scientific Research under \n\nGrant AFOSR-86-0296. \n\nAPPENDIX \n\nIn this appendix we prove the main theorem. We start by discussing some \nbasic properties about the ensemble of environments e. Since the probability \ndistribution on e is uniform and since Ie I = e:), we have \n\nPr(e) = \n\n( 2N)-1 \n\nh \n\nwhich is equivalent to generating e by choosing h elements x E {O,l}N with \nuniform probability (without replacement). It follows that \n\nh \nPr(x E e) = 2N \n\n\f5 \n\nh h-l \nPr(Xl E e , X2 E e) = 2N X 2N _ 1 \n\nand so on. \n\nThe functions n and v are defined on K-bit vectors. The statistics of n(a) \n\n(a random variable for fixed a) is independent of a \n\nPr(n(at} = m) = Pr(n(a2) = m) \n\nwhich follows from the symmetry with respect to each bit of a. The same holds \nfor the statistics of v(a). The expected value E(n(a)) = h2-K (h objects going \ninto 2K cells), hence E(v(a)) = 2- K . We now restate and prove the theorem. \n\nTheorem. \n1. If ao > Po , then limN_oo E (d(vt, V2)) = 1. \n2. If ao < Po , then limN_oo E (d(vt, V2)) = 0. \n\nProof. \n\nWe expand E (d(vt, V2)) as follows \n\nwhere nl and n2 denote nl(O. \u00b7\u00b70) and n2(0\u00b7\u00b7 \u00b70), respectively, and the last step \nfollows from the fact that the statistics of nl(a) and n2(a) is independent of a. \nTherefore, to prove the theorem, we evaluate E(lnl - n21) for large N. \n\n1. Assume ao > Po. Let n denote n(O\u00b7\u00b7\u00b7 0), and consider Pr(n = 0). For n to \nbe zero, all 2N - K strings x of N bits starting with K O's must not be in the \nenvironment e. Hence \n\nPr(n = 0) = (1 - - ) (1 -\n\nh \n2N \n\nh \n\n2N - 1 \n\n) ... (1 -\n\nh \n\n) \n2N - 2N- K + 1 \n\nwhere the first term is the probability that 0\u00b7 . \u00b700 f/. e, the second term is the \n\n\f6 \n\nprobability that O\u00b7 .. 01 ~ f given that o\u00b7 .. 00 ~ f, and so on. \n\n> (1- 2N _h2N _ K )'N-K \n= (1- h2- N(1- 2-K)-1) 2N - K \n> (1 - 2h2-N)2N - K \n> 1- 2h2-N 2N - K \n= 1- 2h2-K \n\nHence, Pr(nl = 0) = Pr(n2 = 0) = Pr(n = 0) > 1 - 2h2-K \u2022 However, E(nl) = \nE( n2) = h2-K. Therefore, \n\nE(lnl - n2\\) = LLPr(nl = i,n2 = j)li - jl \n\n\" \" \n\ni=O;=O \n\n\" \" \n\n= L L Pr(nl = i)Pr(n2 = j) Ii - jl \n\ni=O;=O \n\n\" \n> L Pr(nl = 0)Pr(n2 = j)j \n;=0 \n\" + L Pr(nl = i)Pr(n2 = O)i \n\ni=O \n\nwhich follows by throwing away all the terms where neither i nor j is zero (the \nterm where both i an j are zero appears twice for convenience, but this term is \nzero anyway). \n\n= Pr(nl = 0)E(n2) + Pr(n2 = O)E(nl) \n> 2(1 - 2h2-K )h2-K \n\nSubstituting this estimate in the expression for E(d(Vb V2)), we get \n\nE(d(vl, V2)) = 2h E(lnl - n21) \n\n2K \n\nx 2(1 - 2h2- K )h2-K \n\n2K \n> -\n- 2h \n= 1- 2h2- K \n= 1 - 2 X 2(,8-a)N \n\nSince a o > 130 by assumption, this lower bound goes to 1 as N goes to infinity. \nSince 1 is also an upper bound for d( VI, V2) (and hence an upper bound for the \nexpected value E(d(vl, V2))) , limN_oo E(d(vl, V2)) must be 1. \n\n\f7 \n\n2. Assume a o < Po. Consider \n\nE(lnl - n21) = E (I(nl - h2-K) -\n\n(n2 - h2- K )I) \n\n< E(\\nl - h2- K \\ + In2 - h2-KI) \n= E(\\nl - h2- K I) + E(ln2 - h2-K I) \n= 2E(ln - h2-KI) \n\nTo evaluate E(ln - h2- K I), we estimate the variance of n and use the fact \nthat E(ln - h2- K I) < ..jvar(n) (recall that h2-K = E(n\u00bb). Since var(n) = \nE(n2) -\n(E(n))2, we need an estimate for E(n2). We write n = E.E{O,l}N-K 6., \nwhere \n\n6 -\n\u2022 -\n\nif 0 .. \u00b7Oa E e\u00b7 \n, \n\n{ 1 \n, \n0, otherwise. \n\nIn this notation, E(n2 ) can be written as \n\nE(n2) = E (I: \n\nI: \n\n6.6t,) \n\n.E{O,l}N-K bE{O,l}N-K \nI: \n\nL \n\nE(6.6t,) \n\n.E{O,l}N-K bE{O,l}N-K \n\nFor the 'diagonal' terms (a = b), \n\nE(6.6.) = Pr(6. = 1) \n\n= h2-N \n\nThere are 2N - K such diagonal terms, hence a total contribution of 2N - K x \nh2- N = h2- K to the sum. For the 'off-diagonal' terms (a '# b), \n\nE(6.6b ) = Pr( 6. = 1,6b = 1) \n\n= Pr(6. = 1)Pr(6b = 116. = 1) \n\nh h-l \n=-x--::-:::--\n2N \n2N_1 \n\nThere are 2N - K (2N - K -1) such off-diagonal terms, hence a total contribution of \n2N - K(2 N - K -1) x 2;~:N~1) < (h2-K)2 2~~1 to the sum. Putting the contributions \n\n\f8 \n\nfrom the diagonal and off-diagonal terms together, we get \n\nE(n2) < h2-K + (h2-K)2 2N _ 1 \nvar(n) = E(n2) -\n\n(E(n))2 \n\n2N \n\n< (h2- K + (h2- K )' 2:: 1) - (h2- K )' \n= h2-K + (h2 - K)2----:-:-_ \n2N -1 \n\n1 \n\n( \n\nh2- K ) \n= h2-K 1 + ---:-:--\n2N -1 \n< 2h2-K \n\nThe last step follows since h2-K is much smaller than 2N -1. Therefore, E(ln-\nh2- K I) < vvar(n) < (2h2- K)?i. Substituting this estimate in the expression for \nE( d( Vb V2)), we get \n\n1 \n\nE(d(vb V2)) = 2h E(lnl - n21) \n\n2K \n\n2K \n\n< 2h x 2E(ln - h2- KI) \n\n2K \n\n1 \n\n< 2h x 2 x (2h2-K)?i \n_ ( 2K) ~ \n- 2-\nh \n= v'2 X 2~(Q-~)N \n\nSince a o < Po by assumption, this upper bound goes to 0 as N goes to infinity. \nSince 0 is also a lower bound for d(vb V2) (and hence a lower bound for the \nexpected value E(d(vb V2))), limN_oo E(d(vb V2)) must be O .\u2022 \n\nREFERENCES \n\n[1] Y. Abu-Mostafa, \"Neural networks for computing?,\" AlP Conference Pro(cid:173)\nceedings # 151, Neural Networks for Computing, J. Denker (ed.), pp. 1-6, 1986. \n\n[2] Z. Kohavi, Switching and Finite Automata Theory, McGraw-Hill, 1978. \n\n[3] Y. Abu-Mostafa, \"The complexity of information extraction,\" IEEE Trans. \non Information Theory, vol. IT-32, pp. 513-525, July 1986. \n\n[4] Y. Abu-Mostafa, \"Complexity in neural systems,\" in Analog VLSI and Neural \nSystems by C. Mead, Addison-Wesley, 1988. \n\n\f", "award": [], "sourceid": 63, "authors": [{"given_name": "Yaser", "family_name": "Abu-Mostafa", "institution": null}]}