{"title": "Scaling Properties of Coarse-Coded Symbol Memories", "book": "Neural Information Processing Systems", "page_first": 652, "page_last": 661, "abstract": null, "full_text": "652 \n\nScaling Properties of Coarse-Coded Symbol Memories \n\nRonald Rosenfeld \nDavid S. Touretzky \n\nComputer Science Department \n\nCarnegie Mellon University \n\nPittsburgh, Pennsylvania 15213 \n\nAbstract: Coarse-coded symbol memories have appeared in several neural network \nsymbol processing models. In order to determine how these models would scale, one \nmust first have some understanding of the mathematics of coarse-coded representa(cid:173)\ntions. We define the general structure of coarse-coded symbol memories and derive \nmathematical relationships among their essential parameters: memory 8ize, 8ymbol-8et \nsize and capacity. The computed capacity of one of the schemes agrees well with actual \nmeasurements oC tbe coarse-coded working memory of DCPS, Touretzky and Hinton's \ndistributed connectionist production system. \n\n1 \n\nIntroduction \n\nA di8tributed repre8entation is a memory scheme in which each entity (concept, symbol) \nis represented by a pattern of activity over many units [3]. If each unit participates \nin the representation of many entities, it is said to be coar8ely tuned, and the memory \nitself is called a coar8e-coded memory. \n\nCoarse-coded memories have been used for storing symbols in several neural network \nsymbol processing models, such as Touretzky and Hinton's distributed connectionist \nproduction system DCPS [8,9], Touretzky's distributed implementation of linked list \nstructures on a Boltzmann machine, BoltzCONS [10], and St. John and McClelland's \nPDP model of case role defaults [6]. In all of these models, memory capacity was mea(cid:173)\nsured empirically and parameters were adjusted by trial and error to obtain the desired \nbehavior. We are now able to give a mathematical foundation to these experiments by \nanalyzing the relationships among the fundamental memory parameters. \n\nThere are several paradigms for coarse-coded memories. In a feature-based repre-\n8entation, each unit stands for some semantic feature. Binary units can code features \nwith binary values, whereas more complicated units or groups of units are required to \ncode more complicated features, such as multi-valued properties or numerical values \nfrom a continuous scale. The units that form the representation of a concept define \nan intersection of features that constitutes that concept. Similarity between concepts \ncomposed of binary Ceatures can be measured by the Hamming distance between their \nrepresentations. In a neural network implementation, relationships between concepts \nare implemented via connections among the units forming their representations. Certain \ntypes of generalization phenomena thereby emerge automatically. \n\nA different paradigm is used when representing points in a multidimensional contin(cid:173)\n\nuous space [2,3]. Each unit encodes values in some subset of the space. Typically the \n\n@ American Institute of Physics 1988 \n\n\f653 \n\nsubsets are hypercubes or hyperspheres, but they may be more coarsely tuned along \nsome dimensions than others [1]. The point to be represented is in the subspace formed \nby the intersection of all active units. AB more units are turned on, the accuracy of the \nrepresentation improves. The density and degree of overlap of the units' receptive fields \ndetermines the system's resolution [7]. \n\nYet another paradigm for coarse-coded memories, and the one we will deal with \nexclusively, does not involve features. Each concept, or symbol, is represented by an \narbitrary subset of the units, called its pattern. Unlike in feature-based representations, \nthe units in the pattern bear no relationship to the meaning of the symbol represented. A \nsymbol is stored in memory by turning on all the units in its pattern. A symbol is deemed \npresent if all the units in its pattern are active.l The receptive field of each unit is defined \nas the set of all symbols in whose pattern it participates. We call such memories coarse(cid:173)\ncoded symbol memories (CCSMs). We use the term \"symbol\" instead of \"concept\" to \nemphasize that the internal structure of the entity to be represented is not involved in \nits representation. In CCSMs, a short Hamming distance between two symbols does \nnot imply semantic similarity, and is in general an undesirable phenomenon. \n\nThe efficiency with which CCSMs handle sparse memories is the major reason they \nhave been used in many connectionist systems, and hence the major reason for studying \nthem here. The unit-sharing strategy that gives rise to efficient encoding in CCSMs \nis also the source of their major weakness. Symbols share units with other symbols. \nAB more symbols are stored, more and more of the units are turned on. At some \npoint, some symbol may be deemed present in memory because all of its units are \nturned on, even though it was not explicitly stored: a \"ghost\" is born. Ghosts are \nan unwanted phenomenon arising out of the overlap among the representations of the \nvarious symbols. The emergence of ghosts marks the limits of the system's capacity: \nthe number of symbols it can store simultaneously and reliably. \n\n2 Definitions and Fundamental Parameters \n\nA coarse coded symbol memory in its most general form consists of: \n\n\u2022 A set of N binary state units. \n\n\u2022 An alphabet of Q symbols to be represented. Symbols in this context are atomic \n\nentities: they have no constituent structure. \n\n\u2022 A memory scheme, which is a function that maps each symbol to a subset of \nthe units - its pattern. The receptive field of a unit is defined as the set of \nall symbols to whose pattern it belongs (see Figure 1). The exact nature of the \n\nlThis criterion can be generalized by introducing a visibility threshold: a fraction of \n\nthe pattern that should be on in order for a symbol to be considered present. Our analy(cid:173)\nsis deals only with a visibility criterion of 100%, but can be generalized to accommodate \nnOise. \n\n\fII 81 I 82 I 88 I 8 4 I 85 I 86 I 87 I 88 II \n\n654 \n\nII \n\nUl \nU2 \nU8 \nU4 \nU5 \nU6 \n\n\u2022 \n\n\u2022 \n\u2022 \n\n\u2022 \u2022 \n\u2022 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\u2022 \n\n\u2022 \n\u2022 \u2022 \n\u2022 \n\u2022 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\u2022 \n\n\u2022 \n\u2022 \n\nFigure 1: A memory scheme (N = 6, Q = 8) defined in terms of units Us and symbols \n8;. The columns are the symbols' patterns. The rows are the units' receptive fieldB. \n\nmemory scheme mapping determines the properties of the memory, and is the \ncentral target of our investigation. \n\nAs symbols are stored, the memory fills up and ghosts eventually appear. It is not \npossible to detect a ghost simply by inspecting the contents of memory, since there is \nno general way of distinguishing a symbol that was stored from one that emerged out of \noverlaps with other symbols. (It is sometimes possible, however, to conclude that there \nare no ghosts.) Furthermore, a symbol that emerged as a ghost at one time may not be \na ghost at a later time if it was subsequently stored into memory. Thus the definition \nof a ghost depends not only on the state of the memory but also on its history. \n\nSome memory schemes guarantee that no ghost will emerge as long as the number of \nsymbols stored does not exceed some specified limit. In other schemes, the emergence \nof ghosts is an ever-present possibility, but its probability can be kept arbitrarily low \nby adjusting other parameters. We analyze systems of both types. First, two more bits \nof notation need to be introduced: \n\nPghost: Probability of a ghost. The probability that at least one ghost will appear \n\nafter some number of symbols have been stored. \n\nk: Capacity. The maximum number of symbols that can be stored simultaneously \nbefore the probability of a ghost exceeds a specified threshold. If the threshold is \n0, we say that the capacity is guaranteed. \n\nA localist representation, where every symbol is represented by a single unit and \nevery unit is dedicated to the representation of a single symbol, can now be viewed as \na special case of coarse-coded memory, where k = N = Q and Pghost = o. Localist \nrepresentations are well suited for memories that are not sparse. In these cases, coarse(cid:173)\ncoded memories are at a disadvantage. In designing coarse-coded symbol memories we \nare interested in cases where k \u00ab N \u00ab Q. The permissible probability for a ghost in \nthese systems should be low enough so that its impact can be ignored. \n\n\f655 \n\n3 Analysis of Four Memory Schemes \n\n3.1 Bounded Overlap (guaranteed capacity) \n\nIf we want to construct the memory scheme with the largest possible a (given Nand \nk) while guaranteeing Pghost = 0, the problem can be stated formally as: \n\nGiven a set of size N, find the largest collection of subsets of it such that no \nunion of k such subsets subsumes any other subset in the collection. \n\nThis is a well known problem in Coding Theory, in slight disguise. Unfortunately, \nno complete analytical solution is known. We therefore simplify our task and consider \nonly systems in which all symbols are represented by the same number of units (i.e. all \npatterns are of the same size). In mathematical terms, we restrict ourselves to constant \nweight codes. The problem then becomes: \n\nGiven a set of size N, find the largest collection of subsets of size exactly \nL such that no union of k such subsets subsumes any other subset in the \ncollection. \n\nThere are no known complete analytical solutions for the size of the largest collection \nof patterns even when the patterns are of a fixed size. Nor is any efficient procedure \nfor constructing such a collection known. We therefore simplify the problem further. \nWe now restrict our consideration to patterns whose pairwise overlap is bounded by a \ngiven number. For a given pattern size L and desired capacity k, we require that no \ntwo patterns overlap in more than m units, where: \n\n_lL -1J \nm- - -k \n\n(1) \n\nMemory schemes that obey this constraint are guaranteed a capacity of at least k \nsymbols, since any k symbols taken together can overlap at most L - 1 units in the \npattern of any other symbol - one unit short of making it a ghost. Based on this \nconstraint, our mathematical problem now becomes: \n\nGiven a set of size N, find the largest collection of subsets of size exactly L \nsuch that the intersection of any two such subsets is of size ~ m (where m \nis given by equation 1.) \n\nCoding theory has yet to produce a complete solution to this problem, but several \nmethods of deriving upper bounds have been proposed (see for example [4]). The simple \nformula we use here is a variant of the Johnson Bound. Let abo denote the maximum \nnumber of symbols attainable in memory schemes that use bounded overlap. Then \n\n(m~l) \n(m~l) \n\n(2) \n\n\f656 \n\nThe Johnson bound is known to be an exact solution asymptotically (that is, when \nN, L, m -+ 00 and their ratios remain finite). \n\nSince we are free to choose the pattern size, we optimize our memory scheme by \n\nmaximizing the above expression over all possible values of L. For the parameter sub(cid:173)\nspace we are interested in here (N < 1000, k < 50) we use numerical approximation to \nobtain: \n\n< \n\n( N \n\nmax \nLeII,N] L - m \n\n)m+l \n\n< \n\n(3) \n\n(Recall that m is a function of Land k.) Thus the upper bound we derived depicts a \nsimple exponential relationship between Q and N/k. Next, we try to construct memory \nschemes of this type. A Common Lisp program using a modified depth-first search \nconstructed memory schemes for various parameter values, whose Q'S came within 80% \nto 90% of the upper bound. These results are far from conclusive, however, since only \na small portion of the parameter space was tested. \n\nIn evaluating the viability of this approach, its apparent optimality should be con(cid:173)\n\ntrasted with two major weaknesses. First, this type of memory scheme is hard to \nconstruct computationally. It took our program several minutes of CPU time on a \nSymbolics 3600 to produce reasonable solutions for cases like N = 200, k = 5, m = 1, \nwith an exponential increase in computing time for larger values of m. Second, if CC(cid:173)\nSMs are used as models of memory in naturally evolving systems (such as the brain), \nthis approach places too great a burden on developmental mechanisms. \n\nThe importance of the bounded overlap approach lies mainly in its role as an upper \nbound for all possible memory schemes, subject to the simplifications made earlier. All \nschemes with guaranteed capacities can be measured relative to equation 3. \n\n3.2 Random Fixed Size Patterns (a stochastic approach) \n\nRandomly produced memory schemes are easy to implement and are attractive because \nof their naturalness. However, if the patterns of two symbols coincide, the guaranteed \ncapacity will be zero (storing one of these symbols will render the other a ghost). We \ntherefore abandon the goal of guaranteeing a certain capacity, and instead establish a \ntolerance level for ghosts, Pghost. For large enough memories, where stochastic behavior \nis more robust, we may expect reasonable capacity even with very small Pghost. \n\nIn the first stochastic approach we analyze, patterns are randomly selected subsets \nof a fixed size L. Unlike in the previous approach, choosing k does not bound Q. We \nmay define as many symbols as we wish, although at the cost of increased probability \nof a ghost (or, alternatively, decreased capacity). The probability of a ghost appearing \nafter k symbols have been stored is given by Equation 4: \n\n(4) \n\n\f657 \n\nTN,L(k, e) is the probability that exactly e units will be active after k symbols have \n\nbeen stored. It is defined recursively by Equation 5\": \n\nTN,L(O,O) = 1 \nTN,L(k, e) = 0 \nTN,L(k, e) = E~=o T(k - 1, c - a) . (N-~-a)) . (~:~)/(~) \n\nfor either k = 0 and e 1= 0, or k > 0 and e < L \n\n(5) \n\nWe have constructed various coarse-coded memories with random fixed-size receptive \nfields and measured their capacities. The experimental results show good agreement \nwith the above equation. \n\nThe optimal pattern size for fixed values of N, k, and a can be determined by \nbinary search on Equation 4, since Pghost(L) has exactly one maximum in the interval \n[1, N]. However, this may be expensive for large N. A computational shortcut can be \nachieved by estimating the optimal L and searching in a small interval around it. A \ngood initial estimate is derived by replacing the summation in Equation 4 with a single \nterm involving E[e]: the expected value of the number of active units after k symbols \nhave been stored. The latter can be expressed as: \n\nThe estimated L is the one that maximizes the following expression: \n\nAn alternative formula, developed by Joseph Tebelskis, produces very good approx(cid:173)\n\nimations to Eq. 4 and is much more efficient to compute. After storing k symbols in \nmemory, the probability Pz that a single arbitrary symbol x has become a ghost is given \nby: \n\nPz(N,L,k,a) = f.(-1)' i \n\nL \n\nL \n\n.(L) (N _ i)k (N)k \n\n/ L \n\n(6) \n\nIf we now assume that each symbol's Pz is independent of that of any other symbol, \n\nwe obtain: \n\n(7) \n\nThis assumption of independence is not strictly true, but the relative error was less \nthan 0.1% for the parameter ranges we considered, when Pghost was no greater than \n0.01. \n\nWe have constructed the two-dimensional table TN,L(k, c) for a wide range of (N, L) \nvalues (70 ~ N ~ 1000, 7 ~ L ~ 43), and produced graphs of the relationships between \nN, k, a, and Pghost for optimum pattern sizes, as determined by Equation 4. The \n\n\f658 \n\nresults show an approximately exponential relationship between a and N /k [5]. Thus, \nfor a fixed number of symbols, the capacity is proportional to the number of units. Let \narlp denote the maximum number of symbols attainable in memory schemes that use \nrandom fixed-size patterns. Some typical relationships, derived from the data, are: \n\narlP(Pghost = 0.01) ~ 0.0086. eO.46Sf \narlp(Pghost = 0.001) ~ O.OOOS. eO.47Sf \n\n(8) \n\n3.3 Random Receptors (a stochastic approach) \n\nA second stochastic approach is to have each unit assigned to each symbol with an \nindependent fixed probability s. This method lends itself to easy mathematical analysis, \nresulting in a closed-form analytical solution. \n\nAfter storing k symbols, the probability that a given unit is active is 1 -\n\n(1 - s)k \n(independent of any other unit). For a given symbol to be a ghost, every unit must \neither be active or else not belong to that symbol's pattern. That will happen with a \nprobability [1 - s . (1 - s)k] N, and thus the probability of a ghost is: \n\nPghost(a, N, k,s) \n\n(9) \n\nAssuming Pghost \u00ab 1 and k \u00ab a (both hold in our case), the expression can be \n\nsimplified to: \n\nPghost(a,N,k,s) \n\na\u00b7 [1- s. (1- s)k]N \n\nfrom which a can be extracted: \n\narr(N, k, 8, Pghost) \n\n(10) \n\nWe can now optimize by finding the value of s that maximizes a, given any desired \nupper bound on the expected value of Pghost. This is done straightforwardly by solving \nBa/Bs = o. Note that 8\u00b7 N corresponds to L in the previous approach. The solution is \ns = l/(k + 1), which yields, after some algebraic manipulation: \n\n(11) \n\nA comparison of the results using the two stochastic approaches reveals an interesting \nsimilarity. For large k, with Pghost = 0.01 the term 0.468/k of Equation 8 can be seen \nas a numerical approximation to the log term in Equation 11, and the multiplicative \nfactor of 0.0086 in Equation 8 approximates Pghost in Equation 11. This is hardly \nsurprising, since the Law of Large Numbers implies that in the limit (N, k -+ 00, with \n8 fixed) the two methods are equivalent. \n\n\f659 \n\nFinally, it should be. noted that the stochastic approaches we analyzed generate \na family of memory schemes, with non-identical ghost-probabilities. Pghost in our \nformulas is therefore better understood as an expected value, averaged over the entire \nfamily. \n\n3.4 Partitioned Binary Coding (a reference point) \n\nThe last memory scheme we analyze is not strictly distributed. Rather, it is somewhere \nin between a distributed and a localist representation, and is presented for comparison \nwith the previous results. For a given number of units N and desired capacity k, the \nunits are partitioned into k equal-size \"slots,\" each consisting of N / k units (for simplicity \nwe assume that k divides N). Each slot is capable of storing exactly one symbol. \n\nThe most efficient representation for all possible symbols that may be stored into \na slot is to assign them binary codes, using the N / k units of each slot as bits. This \nwould allow 2NJic symbols to be represented. Using binary coding, however, will not \ngive us the required capacity of 1 symbol, since binary patterns subsume one another. \nFor example, storing the code '10110' into one of the slots will cause the codes '10010', \n'10100' and '00010' (as well as several other codes) to become ghosts. \n\nA possible solution is to use only half of the bits in each slot for a binary code, and \nset the other half to the binary complement of that code (we assume that N/k is even). \nThis way, the codes are guaranteed not to subsume one another. Let Qpbc denote the \nnumber of symbols representable using a partitioned binary coding scheme. Then, \n\n'\" \n..... pbc -\n\n_ 2N J2Ic - eO.847 !:!-\n.. \n\n-\n\n(12) \n\nOnce again, Q is exponential in N /k. The form of the result closely resembles the \nestimated upper bound on the Bounded Overlap method given in Equation 3. There is \nalso a strong resemblance to Equations 8 and 11, except that the fractional multiplier in \nfront of the exponential, corresponding to Pghost, is missing. Pghost is 0 for the Parti(cid:173)\ntioned Binary Coding method, but this is enforced by dividing the memory into disjoint \nsets of units rather than adjusting the patterns to reduce overlap among symbols. \n\nAs mentioned previously, this memory scheme is not really distributed in the sense \nused in this paper, since there is no one pattern associated with a symbol. Instead, a \nsymbol is represented by anyone of a set of k patterns, each N /k bits long, corresponding \nto its appearance in one of the k slots. To check whether a symbol is present, all k slots \nmust be examined. To store a new symbol in memory, one must scan the k slots until an \nempty one is found. Equation 12 should therefore be used only as a point of reference. \n\n4 Measurement of DCPS \n\nThe three distributed schemes we have studied all use unstructured patterns, the only \nconstraint being that patterns are at least roughly the same size. Imposing more com(cid:173)\nplex structure on any of these schemes may is likely to reduce the capacity somewhat. In \n\n\f660 \n\nMemory Scheme \n\nQbo(N, k) < \n\nResult \neO.367t \n\neO.347r \n\nBounded Overlap \nRandom Fixed-size Patterns Q,,!p(Pghost = 0.01) ~ 0.0086. e\u00b0.468r \nQ,,!p(Pghost = 0.001) ~ 0.0008 . e\u00b0.473f \nQ _ P \nghost \n,.,. -\n-\n-\nQpbc \n\nRandom Receptors \nPartitioned Binary Coding \n\n. eN .1og (k+1)\"'Tl/((k+l)\"'Tl_k\"') \n\nTable 1 Summary of results for various memory schemes. \n\norder to quantify this effect, we measured the memory capacity of DCPS (BoltzCONS \nuses the same memory scheme) and compared the results with the theoretical models \nanalyzed above. \n\nDCPS' memory scheme is a modified version of the Random Receptors method [5]. \nThe symbol space is the set of all triples over a 25 letter alphabet. Units have fixed-size \nreceptive fields organized as 6 x 6 x 6 subspaces. Patterns are manipulated to minimize \nthe variance in pattern size across symbols. The parameters for DCPS are: N = 2000, \nQ = 253 = 15625, and the mean pattern size is (6/25)3 x 2000 = 27.65 with a standard \ndeviation of 1.5. When Pghost = 0.01 the measured capacity was k = 48 symbols. By \nsubstituting for N in Equation 11 we find that the highest k value for which Q,.,. ~ 15625 \nis 51. There does not appear to be a significant cost for maintaining structure in the \nreceptive fields. \n\n5 Summary and Discussion \n\nTable 1 summarizes the results obtained for the four methods analyzed. Some dif(cid:173)\n\nferences must be emphasiz'ed: \n\n\u2022 Qbo and Qpbc deal with guaranteed capacity, whereas Q,.!p and Q,.,. are meaningful \n\nonly for Pghost > O. \n\n\u2022 Qbo is only an upper bound. \n\n\u2022 Q,.!p is based on numerical estimates. \n\n\u2022 Qpbc is based on a scheme which is not strictly coarse-coded. \n\nThe similar functional form of all the results, although not surprising, is aesthetically \npleasing. Some of the functional dependencies among the various parameters_ can be \nderived informally using qualitative arguments. Only a rigorous analysis, however, can \nprovide the definite answers that are needed for a better understanding of these systems \nand their scaling properties. \n\n\f661 \n\nAcknowledgments \n\nWe thank Geoffrey Hinton, Noga Alon and Victor Wei for helpful comments, and Joseph \nTebelskis for sharing with us his formula for approximating Pghost in the case of fixed \npattern sizes. \n\nThis work was supported by National Science Foundation grants IST-8516330 and \nEET-8716324, and by the Office of Naval Research under contract number NOOO14-86-\nK-0678. The first author was supported by a National Science Foundation graduate \nfellowship. \n\nReferences \n\n[1] Ballard, D H. (1986) Cortical connections and parallel processing: structure and \n\nfunction. Behavioral and Brain Sciences 9(1). \n\n[2] Feldman, J. A., and Ballard, D. H. (1982) Connectionist models and their proper(cid:173)\n\nties. Cognitive Science 6, pp. 205-254. \n\n[3] Hinton, G. E., McClelland, J. L., and Rumelhart, D. E. (1986) Distributed repre(cid:173)\n\nsentations. In D. E. Rumelhart and J. L. McClelland (eds.), Parallel Distributed \nProcessing: Explorations in the Microstructure of Cognition, volume 1. Cambridge, \nMA: MIT Press. \n\n[4] Macwilliams, F.J., and Sloane, N.J.A. (1978). The Theory of Error-Correcting \n\nCodes, North-Holland. \n\n[5] Rosenfeld, R. and Touretzky, D. S. (1987) Four capacity models for coarse-coded \nsymbol memories. Technical report CMU-CS-87-182, Carnegie Mellon University \nComputer Science Department, Pittsburgh, PA. \n\n[6] St. John, M. F. and McClelland, J. L. (1986) Reconstructive memory for sentences: \n\na PDP approach. Proceedings of the Ohio University Inference Conference. \n\n[7] Sullins, J. (1985) Value cell encoding strategies. Technical report TR-165, Com(cid:173)\n\nputer Science Department, University of Rochester, Rochester, NY. \n\n[8] Touretzky, D. S., and Hinton, G. E. (1985) Symbols among the neurons: details of \na connectionist inference architecture. Proceedings of IJCAI-85, Los Angeles, CA, \npp. 238-243. \n\n[9] Touretzky, D. S., and Hinton, G. E. (1986) A distributed connectionist produc(cid:173)\n\ntion system. Technical report CMU-CS-86-172, Computer Science Department, \nCarnegie Mellon University, Pittsburgh, PA. \n\n[10] Touretzky, D. S. (1986) BoltzCONS: reconciling connectionism with the recursive \nnature of stacks and trees. Proceedings of the Eighth A nnual Conference of the \nCognitive Science Society, Amherst, MA, pp. 522-530. \n\n\f", "award": [], "sourceid": 91, "authors": [{"given_name": "Ronald", "family_name": "Rosenfeld", "institution": null}, {"given_name": "David", "family_name": "Touretzky", "institution": null}]}