{"title": "The Connectivity Analysis of Simple Association", "book": "Neural Information Processing Systems", "page_first": 338, "page_last": 347, "abstract": "", "full_text": "338 \n\nThe Connectivity Analysis of Simple Association \n\n- or-\n\nHow Many Connections Do You Need! \n\nDan Hammerstrom * \n\nOregon Graduate Center, Beaverton, OR 97006 \n\nABSTRACT \n\nThe efficient realization, using current silicon technology, of Very Large Connection \nNetworks (VLCN) with more than a billion connections requires that these networks exhibit \na high degree of communication locality. Real neural networks exhibit significant locality, \nyet most connectionist/neural network models have little. In this paper, the connectivity \nrequirements of a simple associative network are analyzed using communication theory. \nSeveral techniques based on communication theory are presented that improve the robust(cid:173)\nness of the network in the face of sparse, local interconnect structures. Also discussed are \nsome potential problems when information is distributed too widely. \n\nINTRODUCTION \n\nConnectionist/neural network researchers are learning to program networks that exhi(cid:173)\nbit a broad range of cognitive behavior. Unfortunately, existing computer systems are lim(cid:173)\nited in their ability to emulate such networks efficiently. The cost of emulating a network, \nwhether with special purpose, highly parallel, silicon-based architectures, or with traditional \nparallel architectures, is directly proportional to the number of connections in the network. \nThis number tends to increase geometrically as the number of nodes increases. Even with \nlarge, massively parallel architectures, connections take time and silicon area. Many exist(cid:173)\ning neural network models scale poorly in learning time and connections, precluding large \nimplementations. \n\nThe connectivity 'costs of a network are directly related to its locality. A network \n\nexhibits locality 01 communication 1 if most of its processing elements connect to other physi(cid:173)\ncally adjacent processing elements in any reasonable mapping of the elements onto a planar \nsurface. There is much evidence that real neural networks exhibit locality2. In this paper, \na technique is presented for analyzing the effects of locality on the process of association. \nThese networks use a complex node similar to the higher-order learning units of Maxwell et \nal. 3 \n\nNETWORK MODEL \n\nThe network model used in this paper is now defined (see Figure 1). \nDefinition 1: A recursive neural network, called a c-graph is a graph structure, \n\nr( V,E, e), where: \n\u2022 \n\n\u2022 \n\nThere is a set of CNs (network nodes), V, whose outputs can take a range of positive \nreal values, Vi, between 0 and 1. There are N. nodes in the set. \nThere is a set of codons, E, that can take a range of positive real values, eij (for \ncodon j of node i), between 0 and 1. There are Ne codons dedicated to each CN (the \noutput of each codon is only used by its local CN), so there are a total of Ne N. codons \nin the network. The fan-in or order of a codon is Ie. It is assumed that leis the \nsame for each codon, and Ne is the same for each CN. \n\n*This work was supported in part by the Semiconductor Research Corporation contract no. 86-10-097, and \njointly by the Office of Naval Research and Air Force Office of Scientific Research, ONR contract no. NOOO14 87 K \n0259. \n\n\u00a9 American Institute of Physics 1988 \n\n\f339 \n\nI \ne \n\ncodon j \n\nFigure 1 - A ON \n\n\u2022 \n\nCijk E C is a set of connections of ONs to codons, 1 *0). 0 \n\nCOMMUNICATION ANALOGY \n\nConsider a single connection network node, or CN. (The remainder of this paper will \nbe restricted to a single CN.) Assume that the CN output value space is restricted to two \nvalues, 0 and 1. Therefore, the CN must decide whether the input it sees belongs to the \nclass of \"0\" codes, those codes for which it remains off, or the class of \"I\" codes, those codes \nfor which it becomes active. The inputs it sees in its receptive field constitute a subset of \nthe input vectors (the D( ... ) function) to the network. It is also assumed that the CN is an \nideal I-NN (Nearest Neighbor) classifier or feature detector. That is, given a particular set \nof learned vectors, the CN will classify an arbitrary input according to the class of the \nnearest (using d. as a measure of distance) learned vector. This situation is equivalent to \nthe case where a single CN has a single codon whose receptive field size is equivalent to that \nof the CN. \n\nImagine a sender who wishes to send one bit of information over a noisy channel. The \nsender has a probabilistic encoder that choses a code word (learned vector) according to \nsome probability distribution. The receiver knows this code set, though it has no knowledge \nof which bit is being sent. Noise is added to the code word during its transmission over the \n\n\f341 \n\nchannel, which is analogous to applying an input vector to a network's inputs, where the \nvector lies within some learned vector's region. The \"noise\" is represented by the distance \n( d,,) between the input vector and the associated learned vector. \n\nThe code word sent over the channel consists of those bits that are seen in the recep(cid:173)\n\ntive field of the ON being modeled. In the associative mapping of input vectors to output \nvectors, each ON must respond with the appropriate output (0 or 1) for the associated \nlearned output vector. Therefore, a ON is a decoder that estimates in which class the \nreceived code word belongs. This is a classic block encoding problem, where increasing the \nfield size is equivalent to increasing code length. As the receptive field size increases, the \nperformance of the decoder improves in the presence of noise. Using communication theory \nthen, the trade-off between interconnection costs as they relate to field size and the func(cid:173)\ntionality of a node as it relates to the correctness of its decision making process (output \nerrors) can be characterized. \n\nAs the receptive field size of a node increases, so does the redundancy of the input, \nthough this is dependent on the particular codes being used for the learned vectors, since \nthere are situations where increasing the field size provides no additional information. \nThere is a point of diminishing returns, where each additional bit provides ever less reduc(cid:173)\ntion in output error. Another factor is that interconnection costs increase exponentially \nwith field size. The result of these two trends is a cost performance measure that has a sin(cid:173)\ngle global maximum value. In other words, given a set of learned vectors and their proba(cid:173)\nbilities, and a set of interconnection costs, a \"best\" receptive field size can be determined, \nbeyond which, increasing connectivity brings diminishing returns. \n\nSINGLE CODON, WITH NO CODE COMPRESSION \n\nA single neural element with a single codon and with no code compression can be \nmodelled exactly as a communication channel (see Figure 2). Each network node is assumed \nto have a single codon whose receptive field size is equal to that of the receptive field size of \nthe node. \n\nsender I I ~ \n\nencoder \n\ntransmitter \n\nnOIsy \n\nI \n\nCh.nne11~1 : ~ \nI \n\nI recelver \n\nreceiver \n\ndecoder \n\nON \n\nFigure 2 - A Transmission Channel \n\n\f342 \n\nThe operation of the channel is as follows. A bit is input into the channel encoder, \nwhich selects a random code of length N and transmits that code over the channel. The \nreceiver then, using nearest neighbor classification, decides if the original message was either \na 0 or a 1. \n\nLet M be the number of code words used by the encoder. The rate* then indicates the \n\ndensity of the code space. \n\nDefinition 7: The rate, R, of a communication channel is \n\nR = 10gM \n\n-\n\nN \n\n(6) \n\no \n\nThe block length, N, corresponds directly to the receptive field size of the codon, i.e., \n\nN=/e. The derivations in later sections use a related measure: \n\nDefinition 8: The code utilization, b, is the number of learned vectors assigned to a par(cid:173)\n\nticular code or \n\n(7) \n\nb can be written in terms of R \n\n(8) \nAs b approaches 1, code compression increases. b is essentially unbounded, since M may be \nsignificantly larger than 2N. 0 \n\nb = 2N(R-l) \n\nThe decode error (information loss) due to code compression is a random variable that \ndepends on the compression rate and the a priori probabilities, therefore, it will be different \nwith different learned vector sets and codons within a set. As the average code utilization \nfor all codons approaches 1, code compression occurs more often and codon decode error is \nunavoidable. \n\nLet Zi be the vector output of the encoder, and the input to the channel, where each \nelement of Zi is either a 1 or o. Let Vi be the vector output of the channel, and the input to \nthe decoder, where each element is either a 1 or a o. The Noisy Channel Coding Theorem is \nnow presented for a general case, where the individual M input codes are to be dis(cid:173)\ntinguished. The result is then extended to a CN, where, even though M input codes are \nused, the ON need only distinguish those codes where it must output a 1 from those where it \nmust output a o. The theorem is from Gallager (5.6.1)5. Random codes are assumed \nthroughout. \n\nTheorem 1: Let a discrete memoryless channel have transition probabilities PNU/k) \nand, for any positive integer N and positive number R, consider the ensemble of (N,R) \nblock codes in which each letter of each code word is independently selected according to \n\nthe probability assignment Q(k). Then, for each message m, l eXP{-f,IE,(P)-PR1} \n\nThe minimum value or this expression is obtained when p=1 (for q=0.5): \n\nEo; -log 2 [ (o.sV,+O.SVl-,)' 1 \n\n(12) \n\n(13) \n\n(14) \n\nSINGLE-CODON WITH CODE COMPRESSION \n\nUnfortunately, the implementation complexity of a codon grows exponentially with the \nsize or the codon, which limits its practical size. An alternative is to approximate single \ncodon function of a single CN with many smaller, overlapped codons. The goal is to main(cid:173)\ntain performance and reduce implementation costs, thus improving the cost/performance of \nthe decoding process. As codons get smaller, the receptive field size becomes smaller relative \nto the number of CNs in the network. When this happens there is codon compression, or \nvector alia6ing, that introduces its own errors into the decoding process due to information \nloss. Networks can overcome this error by using multiple redundant codons (with overlap(cid:173)\nping receptive fields) that tend to correct the compression error. \n\nCompression occurs when two code words requiring different decoder output share the \nsame representation (within the receptive field or the codon). The following theorem gives \nthe probability of incorrect codon output with and without compression error. \n\nTheorem 9: For a BSC model where q=0.5, the codon receptive field is Ic, the code util(cid:173)\nization is b, and the channel bits are selected randomly and independently, the probability \nof a codon decoding error when b > 1 is approximately \n\nPc.,. < (l-f)\"Pc- [1-(I-f)\" ]0.5 \n\n(15) \n\nwhere the expected compression error per codon is approximated by \n\n\f344 \n\nand from equations 13-14, when 6<1 \n\nPc = 0.5 \n\nP,,,, < exp { - j, [-log [ [(O .\u2022 V.+O .\u2022 Vl-' J' I-RI} \n\n(16) \n\n(17) \n\nProof is given in Hammerstrom6 . 0 \n\nAs 6 grows, Pc approaches 0.5 asymptotically. Thus, the performance of a single codon \n\ndegrades rapidly in the presence of even small amounts of compression. \n\nMULTIPLE CODONS WITH CODE COMPRESSION \n\nThe use or mUltiple small codons is more efficient than a few large codons, but there \nare some fundamental performance constraints. When a codon is split into two or more \nsmaller codons (and the original receptive field is subdivided accordingly), there are several \neffects to be considered. First, the error rate of each new codon increases due to a decrease \nin receptive field size (the codon's block code length). The second effect is that the code \nutilization, II, will increase for each codon, since the same number of learned vectors is \nmapped into a smaller receptive field. This change also increases the error rate per codon \ndue to code compression. In fact, as the individual codon receptive fields get smaller, \nsignificant code compression occurs. For higher-order input codes, there is an added error \nthat occurs when the order of the individual codons is decreased (since random codes are \nbeing assumed, this effect is not considered here). The third effect is the mass action of \nlarge numbers of codons. Even though individual codons may be in error, if the majority \nare correct, then the ON will have correct output. This effect decreases the total error rate. \nAssume that each ON has more than one codon, c>1. The union of the receptive fields \nfor these codons is the receptive field for the ON with no no restrictions on the degree of \noverlap of the various codon receptive fields within or between ONs. For a ON with a large \nnumber of codons, the codon overlap will generally be random and uniformly distributed. \nAlso assume that the transmission errors seen by different receptive fields are independent. \n\nNow consider what happens to a codon's compression error rate (ignoring transmission \nerror for the time being) when a codon is replaced by two or more smaller co dons covering \nthe same receptive field. This replacement process can continue until there are only 1 .. \ncodons, which, incidentally, is analogous to most current neural models. For a multiple \ncodon ON, assume that each codon votes a 1 or o. The summation unit then totals this \ninformation and outputs a 1 if the majority of codons vote for a 1, etc. \n\nTheorem 4: The probability of a ON error due to compression error is \n\n1 \n\nPc = \"'\\7?'; \n\n21r c!2-cp.-l!2 \nV cP.(i-p.) \n\n00 \n\nJ \n\nJ.2 \ne 2 dy \n\n(18) \n\nwhere Pc is given in equation 16 and q=0.5. \n\nPc incorporates the two effects of moving to mUltiple smaller codons and adding more \n\ncodons. Using equation 17 gives the total error probability (per bit), PeN: \n\n(19) \n\nProof is in Hammerstrom6 . 0 \n\n\f345 \n\nFor networks that perform association as defined in this paper, the connection weights \nrapidly approach a single uniform value as the size of the network grows. In information \ntheoretic terms, the information content of those weights approaches zero as the compres(cid:173)\nsion increases. Why then do simple non-conjunctive networks (1-codon equivalent) work at \nalI? In the next section I define connectivity cost constraints and show that the answer to \nthe first question is that the general associative structures defined here do not scale cost(cid:173)\neffectively and more importantly that there are limits to the degree of distribution of infor(cid:173)\nmation. \n\nCONNECTIVITY COSTS \n\nIt is much easier to assess costs if some implementation medium is assumed. I have \nchosen standard silicon, which is a two dimensional surface where ON's and codons take up \nsurface area according to their receptive field sizes. In addition, there is area devoted to \nthe metal lines that interconnect the ONs. A specific VLSI technology need not be assumed, \nsince the comparisons are relative, thus keeping ONs, codons, and metal in the proper pro(cid:173)\nportions, according to a standard metal width, m. (which also includes the inter-metal \n\npitch). For the analyses performed here, it is assumed that m, levels of metal are possible. \n\nIn the previous section I established the relationship of network performance, in terms \nof the transmission error rate, E, and the network capacity, M. In this section I present an \nimplementation cost, which is total silicon area, A. This figure can then be used to derive a \ncost/performance figure that can be used to compare such factors as codon size and recep(cid:173)\ntive field size. There are two components to the total area: A ON , the area of a ON, and \nAMI, the area of the metal interconnect between ONs. AON consists of the silicon area \nrequirements of the codons for all ONs. The metal area for local, intra-ON interconnect is \nconsidered to be much smaller than that of the codons themselves and of that of the more \nglobal, inter-ON interconnect, and is not considered here. The area per ON is roughly \n\nm. 2 \nAON = cfeme(-) m, \n\n(20) \n\nwhere me is the maximum number of vectors that each codon must distinguish, for 6>1, \nme = 2\". \n\nTheorem 5: Assume a rectangular, un6ounded* grid of ONs (all ONs are equi-distant \nfrom their four nearest neighbors), where each ON has a bounded receptive field of its nON \nnearest ONs, where \"ON is the receptive field size for the ON, nON = C~e , where c is the \nnumber of codons, and R is the intra-ON redundancy, that is, the ratio of inputs to \nsynapses (e.g., when R=l each ON input is used once at the ON, when R=2 each input is \nused on the average at two sites). The metal area required to support each ON's receptive \nfield is (proof is giving by Hammerstrom6 ): \n\nAMI = \n\n\"ON3 \n\n----w-+ \n\n[ \n\n3\"ON ~ \n\n2 +9\"ON \n\n21 [ m.j2 \n\nm, \n\n(21) \n\nThe total area per ON, A, then is \n\n\u00b7Another implementation IItrategy ill to place &II eNII along a diagonal, which givell n 2 area. However, thill \ntechnique only works ror a bounded number or eNII and when dendritic computation can be lipread over a large \narea, which limits the range or p08llible eN implementationll. The theorem IItated here covers an infinite plane or \neNII each with a bounded receptive Held. \n\n\f346 \n\no \n\n(22) \n\nEven with the assumption of maximum locality, the total metal interconnect area \n\nincreases as the cube of the per CN receptive field size! \n\nSINGLE CN SIMULATION \n\nWhat do the bounds tell us about CN connectivity requirements? From simulations, \nincreasing the CN's receptive field size improves the performance (increases capacity), but \nthere is also an increasing cost, which increases faster than the performance! Another \nobservation is that redundancy is quite effective as a means for increasing the effectiveness \nof a CN with constrained connectivity. (There are some limits to R, since it can reach a \npoint where the intra-CN connectivity approaches that of inter-CN for some situations.) \nWith a fixed nON, increasing cost-effectiveness (A 1m) is possible by increasing both order \nand redundancy. \n\nIn order to verify the derived bounds, I also wrote a discrete event simulation of a CN, \nwhere a random set of learned vectors were chosen and the CN's codons were programmed \naccording to the model presented earlier. Learned vectors were chosen randomly and sub(cid:173)\njected to random noise, L The CN then attempted to categorize these inputs into two \nmajor groups (CN output = 1 and CN output = 0). For the most part the analytic bounds \nagreed with the simulation, though they tended to be optimistic in slightly underestimating \nthe error. These differences can be easily explained by the simplifying assumptions that \nwere made to make the analytic bounds mathematically tractable. \n\nDISTRmUTED VS. LOCALIZED \n\nThroughout this paper, it has been tacitly assumed that representations are distributed \n\nacross a number of CNs, and that any single CN participates in a number of representa(cid:173)\ntions. In a local representation each CN represents a single concept or feature . It is the dis(cid:173)\ntribution of representation that makes the CN's decode job difficult, since it is the cause of \nthe code compression problem. \n\nThere has been much debate in the connectionist/neuromodelling community as to the \n\nadvantages and disadvantages of each approach; the interested reader is referred to Hin(cid:173)\nton7 , Baum et al. 8, and BallardQ \u2022 Some of the results derived here are relevant to this \ndebate. A1s the distribution of representation increases, the compression per CN increases \naccordingly. \nIt was shown above that the mean error in a codon's response quickly \napproaches 0.5, independent of the input noise . This result also holds at the CN level. For \neach individual CN, this error can be offset by adding more codons, but this is expensive \nand tends to obviate one of the arguments in favor of distributed representations, that is, \nthe multi-use advantage, where fewer CNs are needed because of more complex, redundant \nencodings. A1s the degree of distribution increases, the required connectivity and the code \ncompression increases, so the added information that each codon adds to its CN's decoding \nprocess goes to zero (equivalent to all weights approaching a uniform value). \n\nSUMMARY AND CONCLUSIONS \n\nIn this paper a single CN (node) performance model was developed that was based on \n\nCommunication Theory. Likewise, an implementation cost model was derived . \n\nThe communication model introduced the codon as a higher-order decoding element \nand showed that for small codons (much less than total CN fan-in, or convergence) code \ncompression, or vector aliasing, within the codon's receptive field is a severe problem for \n\n\f347 \n\nlarge networks. As code compression increases, the information added by any individual \ncodon to the CN's decoding task rapidly approaches zero. \n\nThe cost model showed that for 2-dimensional silicon, the area required for inter-node \n\nmetal connectivity grows as the cube of a CN's fan-in. \n\nThe combination of these two trends indicates that past a certain point, which is \nhighly dependent on the probability structure of the learned vector space, increasing the \nfan-in of a CN (as is done, for example, when the distribution of representation is increased) \nyields diminishing returns in terms of total cost-performance. Though the rate of diminish(cid:173)\ning returns can be decreased by the use of redundant, higher-order connections. \n\nThe next step is to apply these techniques to ensembles of nodes (CNs) operating in a \n\ncompetitive learning or feature extraction environment. \n\n[I] \n\n[2] \n\n[3] \n\n[4] \n\n[5] \n\n[6] \n\n[7] \n\n[8] \n\n[9] \n\nREFERENCES \n\nInterconnect Structure \n\nJ. Bailey, \"A VLSI \nfor Neural Networks,\" Ph.D. \nDissertation, Department of Computer SciencejEngineering, OGC. In Preparation. \nV. B. Mountcastle, \"An Organizing Principle for Cerebral Function: The Unit \nModule and the Distributed System,\" in The Mindful Brain, MIT Press, Cambridge, \nMA,1977. \nT. Maxwell, C. L. Giles, Y . C. Lee and H. H. Chen, \"Transformation Invariance \nUsing High Order Correlations \nin Neural Net Architectures,\" Proceeding8 \nInternational Con! on SY8tem8, Man, and Cybernetic8, 1986. \nD. Marr, \"A Theory for Cerebral Neocortex,\" Proc. Roy. Soc . London, vol. \n176(1970), pp. 161-234. \nR. G. Gallager, Information Theory and Reliable Communication, John Wiley and \nSons, New York, 1968. \nD. Hammerstrom, \"A Connectivity Analysis of Recursive, Auto-Associative \nConnection Networks,\" Tech. Report CS/E-86-009, Dept. of Computer \nSciencejEngineering, Oregon Graduate Center, Beaverton, Oregon, August 1986. \nG. E . Hinton, \"Distributed Representations,\" Technical Report CMU-CS-84-157, \nComputer Science Dept., Carnegie-Mellon University, Pittsburgh, PA 15213, 1984. \nE. B. Baum, J. Moody and F . Wilczek, \"Internal Representations for Associative \nMemory,\" Technical Report NSF-ITP-86-138, Institute for Theoretical Physics, \nSanta Barbara, CA, 1986. \nD. H . Ballard, \"Cortical Connections and Parallel Processing: Structure and \nFunction,\" Technical Report 133, Computer Science Department, Rochester, NY, \nJanuary 1985. \n\n\f", "award": [], "sourceid": 53, "authors": [{"given_name": "Dan", "family_name": "Hammerstrom", "institution": null}]}*