{"title": "Noise-Enhanced Associative Memories", "book": "Advances in Neural Information Processing Systems", "page_first": 1682, "page_last": 1690, "abstract": "Recent advances in associative memory design through structured pattern sets and graph-based inference algorithms have allowed reliable learning and recall of an exponential number of patterns. Although these designs correct external errors in recall, they assume neurons that compute noiselessly, in contrast to the highly variable neurons in hippocampus and olfactory cortex. Here we consider associative memories with noisy internal computations and analytically characterize performance. As long as the internal noise level is below a specified threshold, the error probability in the recall phase can be made exceedingly small. More surprisingly, we show that internal noise actually improves the performance of the recall phase. Computational experiments lend additional support to our theoretical analysis. This work suggests a functional benefit to noisy neurons in biological neuronal networks.", "full_text": "Noise-Enhanced Associative Memories\n\nAmin Karbasi\n\nAmir Hesam Salavati\n\nSwiss Federal Institute of Technology Zurich\n\nEcole Polytechnique Federale de Lausanne\n\namin.karbasi@inf.ethz.ch\n\nhesam.salavati@epfl.ch\n\nAmin Shokrollahi\n\nEcole Polytechnique Federale de Lausanne\n\namin.shokrollahi@epfl.ch\n\nLav R. Varshney\n\nIBM Thomas J. Watson Research Center\n\nvarshney@alum.mit.edu\n\nAbstract\n\nRecent advances in associative memory design through structured pattern sets and\ngraph-based inference algorithms allow reliable learning and recall of exponential\nnumbers of patterns. Though these designs correct external errors in recall, they\nassume neurons compute noiselessly, in contrast to highly variable neurons in\nhippocampus and olfactory cortex. Here we consider associative memories with\nnoisy internal computations and analytically characterize performance. As long\nas internal noise is less than a speci\ufb01ed threshold, error probability in the recall\nphase can be made exceedingly small. More surprisingly, we show internal noise\nactually improves performance of the recall phase. Computational experiments\nlend additional support to our theoretical analysis. This work suggests a functional\nbene\ufb01t to noisy neurons in biological neuronal networks.\n\n1\n\nIntroduction\n\nHippocampus, olfactory cortex, and other brain regions are thought to operate as associative memo-\nries [1,2], having the ability to learn patterns from presented inputs, store a large number of patterns,\nand retrieve them reliably in the face of noisy or corrupted queries [3\u20135]. Associative memory mod-\nels are designed to have these properties.\nAlthough such information storage and recall seemingly falls into the information-theoretic frame-\nwork, where an exponential number of messages can be communicated reliably with a linear number\nof symbols, classical associative memory models could only store a linear number of patterns [4]. A\nprimary reason is classical models require memorizing a randomly chosen set of patterns. By enforc-\ning structure and redundancy in the possible set of memorizable patterns\u2014like natural stimuli [6],\ninternal neural representations [7], and error-control codewords\u2014advances in associative memory\ndesign allow storage of an exponential number of patterns [8,9], just like in communication systems.\nInformation-theoretic and associative memory models of storage have been used to predict experi-\nmentally measurable properties of synapses in the mammalian brain [10,11]. But contrary to the fact\nthat noise is present in computational operations of the brain [12, 13], associative memory models\nwith exponential capacities have assumed no internal noise in the computational nodes. The purpose\nhere is to model internal noise and study whether such associative memories still operate reliably.\nSurprisingly, we \ufb01nd internal noise actually enhances recall performance, suggesting a functional\nrole for variability in the brain.\nIn particular we consider a multi-level, graph code-based, associative memory model [9] and \ufb01nd\nthat even if all components are noisy, the \ufb01nal error probability in recall can be made exceedingly\nsmall. We characterize a threshold phenomenon and show how to optimize algorithm parameters\nwhen knowing statistical properties of internal noise. Rather counterintuitively the performance\n\n1\n\n\fof the memory model improves in the presence of internal neural noise, as observed previously as\nstochastic resonance [13, 14]. There are mathematical connections to perturbed simplex algorithms\nfor linear programing [15], where internal noise pushes the algorithm out of local minima.\nThe bene\ufb01t of internal noise has been noted previously in associative memory models with stochastic\nupdate rules, cf. [16]. However, our framework differs from previous approaches in three key as-\npects. First, our memory model is different, which makes extension of previous analysis nontrivial.\nSecond, and perhaps most importantly, pattern retrieval capacity in previous approaches decreases\nwith internal noise, cf. [16, Fig. 6.1], in that increasing internal noise helps correct more external\nerrors, but also reduces the number of memorizable patterns. In our framework, internal noise does\nnot affect pattern retrieval capacity (up to a threshold) but improves recall performance. Finally, our\nnoise model has bounded rather than Gaussian noise, and so a suitable network may achieve perfect\nrecall despite internal noise.\nReliably storing information in memory systems constructed completely from unreliable compo-\nnents is a classical problem in fault-tolerant computing [17\u201319], where models have used random\naccess architectures with sequential correcting networks. Although direct comparison is dif\ufb01cult\nsince notions of circuit complexity are different, our work also demonstrates that associative mem-\nory architectures constructed from unreliable components can store information reliably.\nBuilding on the idea of structured pattern sets [20], our associative memory model [9] relies on the\nfact that all patterns to be learned lie in a low-dimensional subspace. Learning features of a low-\ndimensional space is very similar to autoencoders [21] and has structural similarities to Deep Belief\nNetworks (DBNs), particularly Convolutional Neural Networks [22].\n\n2 Associative Memory Model\n\nh =(cid:80)n\n\nNotation and basic structure: In our model, a neuron can assume an integer-valued state from\nthe set S = {0, . . . , S \u2212 1}, interpreted as the short term \ufb01ring rate of neurons. A neuron updates\nits state based on the states of its neighbor {si}n\ni=1 as follows. It \ufb01rst computes a weighted sum\ni=1 wisi + \u03b6, where wi is the weight of the link from si and \u03b6 is the internal noise, and then\napplies nonlinear function f : R \u2192 S to h.\nAn associative memory is represented by a weighted bipartite graph, G, with pattern neurons and\nconstraint neurons. Each pattern x = (x1, . . . , xn) is a vector of length n, where xi \u2208 S, i =\n1, . . . , n. Following [9], the focus is on recalling patterns with strong local correlation among\nentries. Hence, we divide entries of each pattern x into L overlapping sub-patterns of lengths\nn1, . . . , nL. Due to overlaps, a pattern neuron can be a member of multiple subpatterns, as in\nFig. 1a. The ith subpattern is denoted x(i) = (x(i)\nni ), and local correlations are assumed to\nbe in the form of subspaces, i.e. the subpatterns x(i) form a subspace of dimension ki < ni.\nWe capture the local correlations by learning a set of linear constraints over each subspace corre-\nsponding to the dual vectors orthogonal to that subspace. More speci\ufb01cally, let {w(i)\nmi} be\na set of dual vectors orthogonal to all subpatterns x(i) of cluster i. Then:\n\n1 , . . . , w(i)\n\n1 , . . . , x(i)\n\ny(i)\nj = (w(i)\n\nj )T \u00b7 x(i) = 0,\n\nfor all j \u2208 {1, . . . , mi} and for all i \u2208 {1, . . . , L}.\n\n(1)\n\nEq. (1) can be rewritten as W (i) \u00b7 x(i) = 0 where W (i) = [w(i)\nmi]T is the matrix of dual\nvectors. Now we use a bipartite graph with connectivity matrix determined by W (i) to represent the\nsubspace constraints learned from subpattern x(i); this graph is called cluster i. We developed an\nef\ufb01cient way of learning W (i) in [9], also used here. Brie\ufb02y, in each iteration of learning:\n\n2 | . . .|w(i)\n\n1 |w(i)\n\n1. Pick a pattern x at random from the dataset;\n2. Adjust weight vectors w(i)\nj\n\nfor j = {1, . . . , mi} and i = {1, . . . , L} such that the projection\n\nof x onto w(i)\nj\n\nis reduced. Apply a sparsity penalty to favor sparse solutions.\n\nThis process repeats until all weights are orthogonal to the patterns in the dataset or the maximum\niteration limit is reached. The learning rule allows us to assume the weight matrices W (i) are known\nand satisfy W (i) \u00b7 x(i) = 0 for all patterns x in the dataset X , in this paper.\n\n2\n\n\fG(1)\n\nG(2)\n\nG(3)\n\ny1\n\ny2\n\ny3\n\ny4\n\ny5\n\ny6\n\ny7\n\ny8\n\nx(2)\n1\n\nx(2)\n2\n\nx(2)\n3\n\nx(2)\n4\n\n(a) Bipartite graph G.\n\n(b) Contraction graph (cid:101)G.\n\nFigure 1: The proposed neural associative memory with overlapping clusters.\n\nFor the forthcoming asymptotic analysis, we need to de\ufb01ne a contracted graph (cid:101)G whose connectivity\nmatrix is denoted(cid:102)W and has size L\u00d7 n. This is a bipartite graph in which constraints in each cluster\nare represented by a single neuron. Thus, if pattern neuron xj is connected to cluster i,(cid:102)Wij = 1;\notherwise(cid:102)Wij = 0. We also de\ufb01ne the degree distribution from an edge perspective over (cid:101)G, using\n(cid:101)\u03bb(z) = (cid:80)\nj(cid:101)\u03c1jzj\u22121 where(cid:101)\u03bbj (resp., (cid:101)\u03c1j) equals the fraction of edges that\n\nj(cid:101)\u03bbjzj and (cid:101)\u03c1(z) = (cid:80)\n\nconnect to pattern (resp., cluster) nodes of degree j.\nNoise model: There are two types of noise in our model: external errors and internal noise. As\nmentioned earlier, a neural network should be able to retrieve memorized pattern \u02c6x from its corrupted\nversion x due to external errors. We assume the external error is an additive vector of size n, denoted\nby z satisfying x = \u02c6x + z, whose entries assume values independently from {\u22121, 0, +1}1 with\ncorresponding probabilities p\u22121 = p+1 = \u0001/2 and p0 = 1 \u2212 \u0001. The realization of the external error\non subpattern x(i) is denoted z(i). Note that the subspace assumption implies W \u00b7 y = W \u00b7 z and\nW (i) \u00b7 y(i) = W (i) \u00b7 z(i) for all i. Neurons also suffer from internal noise. We consider a bounded\nnoise model, i.e. a random number uniformly distributed in the intervals [\u2212\u03c5, \u03c5] and [\u2212\u03bd, \u03bd] for the\npattern and constraint neurons, respectively (\u03c5, \u03bd < 1).\nThe goal of recall is to \ufb01lter the external error z to obtain the desired pattern x as the correct states\nof the pattern neurons. When neurons compute noiselessly, this task may be achieved by exploiting\nthe fact the set of patterns x \u2208 X to satisfy the set of constraints W (i) \u00b7 x(i) = 0. However, it is not\nclear how to accomplish this objective when the neural computations are noisy. Rather surprisingly,\nwe show that eliminating external errors is not only possible in the presence of internal noise, but\nthat neural networks with moderate internal noise demonstrate better external noise resilience.\nRecall algorithms: To ef\ufb01ciently deal with external errors, we use a combination of Alg. 1 and\nAlg. 2. The role of Alg. 1 is to correct at least a single external error in each cluster. Without\noverlaps between clusters, the error resilience of the network is limited. Alg. 2 exploits the overlaps:\nit helps clusters with external errors recover their correct states by using the reliable information\nfrom clusters that do not have external errors. The error resilience of the resulting combination\nthereby drastically improves. Now we describe the details of Alg. 1 and Alg. 2 more precisely.\nAlg. 1 performs a series of forward and backward iterations in each cluster G(l) to remove (at\nleast) one external error from its input domain. At each iteration, the pattern neurons locally decide\nwhether to update their current state: if the amount of feedback received by a pattern neuron exceeds\na threshold, the neuron updates its state, and otherwise remains as is. With abuse of notation, let\nus denote messages transmitted by pattern node i and constraint node j at round t by xi(t) and\nyj(t), respectively. In round 0, pattern nodes are initialized by a pattern \u02c6x, sampled from dataset X ,\nperturbed by external errors z, i.e., x(0) = \u02c6x + z. Thus, for cluster (cid:96) we have x((cid:96))(0) = \u02c6x((cid:96)) + z((cid:96)),\nwhere z((cid:96)) is the realization of errors on subpattern x((cid:96)).\nIn round t, the pattern and constraint neurons update their states using feedback from neighbors.\nHowever since neural computations are faulty, decisions made by neurons may not be reliable. To\nminimize effects of internal noise, we use the following update rule for pattern node i in cluster (cid:96):\n\n(cid:40)\n\nx((cid:96))\ni (t + 1) =\n\ni (t) \u2212 sign(g((cid:96))\nx((cid:96))\nx((cid:96))\ni (t),\n\ni (t)),\n\ni (t)| \u2265 \u03d5\n\nif |g((cid:96))\notherwise,\n\n(2)\n\n1Note that the proposed algorithms also work with larger noise values, i.e. from a set {\u2212a, . . . , a} for some\n\na \u2208 N, see [23]; the \u00b11 noise model is presented here for simplicity.\n\n3\n\n\f(cid:80)n(cid:96)\n\nAlgorithm 1 Intra-Module Error Correction\nInput: Training set X , thresholds \u03d5, \u03c8, iteration tmax\nOutput: x((cid:96))\n1: for t = 1 \u2192 tmax do\n2:\n\n2 , . . . , x((cid:96))\nn(cid:96)\n\n1 , x((cid:96))\n\nj=1 W ((cid:96))\n\nForward iteration: Calculate the input h((cid:96))\nj + vi, for each neuron y((cid:96))\nij x((cid:96))\ni = f (h((cid:96))\n(cid:80)m(cid:96)\n(cid:80)m(cid:96)\ni=1 sign(W ((cid:96))\nij )y((cid:96))\nij |)\ni=1 sign(|W ((cid:96))\n\nset y((cid:96))\nBackward iteration: Each neuron x((cid:96))\nj\n\ng((cid:96))\nj =\n\n+ ui.\n\n, \u03c8).\n\ni\n\ni\n\ni\n\ni =\nand\n\ncomputes\n\n4:\n\nUpdate state of each pattern neuron j according\nto x((cid:96))\n5: end for\n\nj ) only if |g((cid:96))\n\nj \u2212 sign(g((cid:96))\n\nj = x((cid:96))\n\n| > \u03d5.\n\nj\n\n3:\n\nAlgorithm 2 Sequential Peeling Algorithm\n\nInput: (cid:101)G, G(1), G(2), . . . , G(L).\n\nOutput: x1, x2, . . . , xn\n1: while there is an unsatis\ufb01ed v((cid:96)) do\nfor (cid:96) = 1 \u2192 L do\n2:\n3:\n\nIf v((cid:96)) is unsatis\ufb01ed, apply Alg. 1\nto cluster G(l).\nIf v((cid:96)) remained unsatis\ufb01ed, revert\nstate of pattern neurons connected\nto v((cid:96)) to their initial state. Other-\nwise, keep their current states.\n\nend for\n5:\n6: end while\n7: Declare x1, x2, . . . , xn if all v((cid:96))\u2019s are\n\nsatis\ufb01ed. Otherwise, declare failure.\n\n4:\n\ni (t) =(cid:0)(sign(W ((cid:96)))(cid:62) \u00b7 y((cid:96))(t)(cid:1)\n\ni /d((cid:96))\n\nwhere \u03d5 is the update threshold and g((cid:96))\nis\nthe degree of pattern node i in cluster (cid:96), y((cid:96))(t) = [y((cid:96))\nm(cid:96) (t)] is the vector of messages\ntransmitted by the constraint neurons in cluster (cid:96), and ui is the random noise affecting pattern node\ni. Basically, the term g((cid:96))\ni (t) re\ufb02ects the (average) belief of constraint nodes connected to pattern\nneuron i about its correct value. If g((cid:96))\ni (t) is larger than a speci\ufb01ed threshold \u03d5 it means most of\nthe connected constraints suggest the current state x((cid:96))\ni (t) is not correct, hence, a change should be\nmade. Note this average belief is diluted by the internal noise of neuron i. As mentioned earlier, ui\nis uniformly distributed in the interval [\u2212\u03c5, \u03c5], for some \u03c5 < 1. On the constraint side, the update\nrule is:\n\ni + ui.2 Here, d((cid:96))\n\n1 (t), . . . , y((cid:96))\n\ni\n\ny((cid:96))\ni (t) = f (h((cid:96))\n\ni (t), \u03c8) =\n\ni (t) \u2264 \u03c8\n\n(3)\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3+1,\ni (t) = (cid:0)W ((cid:96)) \u00b7 x((cid:96))(t)(cid:1)\n\ni (t) \u2265 \u03c8\nif h((cid:96))\nif \u2212 \u03c8 \u2264 h((cid:96))\n\n0,\n\u22121, otherwise,\n\nwhere \u03c8 is the update threshold and h((cid:96))\ni + vi. Here, x((cid:96))(t) =\n[x((cid:96))\n1 (t), . . . , x((cid:96))\nn(cid:96) (t)] is the vector of messages transmitted by the pattern neurons and vi is the ran-\ndom noise affecting node i. As before, we consider a bounded noise model for vi, i.e., it is uniformly\ndistributed in the interval [\u2212\u03bd, \u03bd] for some \u03bd < 1.3\nThe error correction ability of Alg. 1 is fairly limited, as determined analytically and through sim-\nulations [23]. In essence, Alg. 1 can correct one external error with high probability, but degrades\nterribly against two or more external errors. Working independently, clusters cannot correct more\nthan a few external errors, but their combined performance is much better. As clusters overlap, they\nhelp each other in resolving external errors: a cluster whose pattern neurons are in their correct states\ncan always provide truthful information to neighboring clusters. This property is exploited in Alg. 2\nby applying Alg. 1 in a round-robin fashion to each cluster. Clusters either eliminate their internal\nnoise in which case they keep their new states and can now help other clusters, or revert back to their\noriginal states. Note that by such a scheduling scheme, neurons can only change their states towards\ncorrect values. This scheduling technique is similar in spirit to the peeling algorithm [24].\n\n3 Recall Performance Analysis\n\nNow let us analyze recall error performance. The following lemma shows that if \u03d5 and \u03c8 are chosen\nproperly, then in the absence of external errors the constraints remain satis\ufb01ed and internal noise\ncannot result in violations. This is a crucial property for Alg. 2, as it allows one to determine whether\ni (t + 1) is further mapped to the interval [0, S \u2212 1] by saturating the values below 0 and above\ni (t) can be shifted to 0, 1, 2, instead of \u22121, 0, 1 to match our assumption\n\nS \u2212 1 to 0 and S \u2212 1 respectively. The corresponding equations are omitted for brevity.\n\n3Note that although the values of y((cid:96))\n\n2Note that x((cid:96))\n\nthat neural states are non-negative, we leave them as such to simplify later analysis.\n\n4\n\n\fa cluster has successfully eliminated external errors (Step 4 of algorithm) by merely checking the\nsatisfaction of all constraint nodes.\nLemma 1. In the absence of external errors, the probability that a constraint neuron (resp. pat-\ntern neuron) in cluster (cid:96) makes a wrong decision due to its internal noise is given by \u03c0((cid:96))\n0 =\nmax\n\n0 = max(cid:0)0, \u03c5\u2212\u03d5\n\n(resp. P ((cid:96))\n\n0, \u03bd\u2212\u03c8\n\n(cid:1)).\n\n(cid:16)\n\n(cid:17)\n\n\u03bd\n\n\u03c5\n\nProof is given in [23]. In the sequel, we assume \u03d5 > \u03c5 and \u03c8 > \u03bd so that \u03c0((cid:96))\n0 = 0.\nHowever, an external error combined with internal noise may still push neurons to an incorrect state.\nGiven the above lemma and our neural architecture, we can prove the following surprising result: in\nthe asymptotic regime of increasing number of iterations of Alg. 2, a neural network with internal\nnoise outperforms one without. Let us de\ufb01ne the fraction of errors corrected by the noiseless and\nnoisy neural network (parametrized by \u03c5 and \u03bd) after T iterations of Alg. 2 by \u039b(T ) and \u039b\u03c5,\u03bd(T ),\nrespectively. Note that both \u039b(T ) \u2264 1 and \u039b\u03c5,\u03bd(T ) \u2264 1 are non-decreasing sequences of T . Hence,\ntheir limiting values are well de\ufb01ned: limT\u2192\u221e \u039b(T ) = \u039b\u2217 and limT\u2192\u221e \u039b\u03c5,\u03bd(T ) = \u039b\u2217\n\n0 = 0 and P ((cid:96))\n\n\u03c5,\u03bd.\n\nTheorem 2. Let us choose \u03d5 and \u03c8 so that \u03c0((cid:96))\nsame realization of external errors, we have \u039b\u2217\n\n0 = 0 and P ((cid:96))\n\u03c5,\u03bd \u2265 \u039b\u2217.\n\n0 = 0 for all (cid:96) \u2208 {1, . . . , L}. For the\n\nProof is given in [23]. The high level idea why a noisy network outperforms a noiseless one comes\nfrom understanding stopping sets. These are realizations of external errors where the iterative Alg. 2\ncannot correct all of them. We show that the stopping set shrinks as we add internal noise. In other\nwords, we show that in the limit of T \u2192 \u221e the noisy network can correct any error pattern that can\nbe corrected by the noiseless version and it can also get out of stopping sets that cause the noiseless\nnetwork to fail. Thus, the supposedly harmful internal noise will help Alg. 2 to avoid stopping sets.\nThm. 2 suggests the only possible downside with using a noisy network is its possible running time\nin eliminating external errors: the noisy neural network may need more iterations to achieve the\nsame error correction performance. Interestingly, our empirical experiments show that in certain\nscenarios, even the running time improves when using a noisy network.\nThm. 2 indicates that noisy neural networks (under our model) outperform noiseless ones, but does\nnot specify the level of errors that such networks can correct. Now we derive a theoretical upper\nbound on error correction performance. To this end, let Pci be the average probability that a cluster\ncan correct i external errors in its domain. The following theorem gives a simple condition under\nwhich Alg. 2 can correct a linear fraction of external errors (in terms of n) with high probability.\nThe condition involves \u02dc\u03bb and \u02dc\u03c1, the degree distributions of the contracted graph \u02dcG.\n\nTheorem 3. Under the assumptions that graph (cid:101)G grows large and it is chosen randomly with degree\ndistributions given by(cid:101)\u03bb and(cid:101)\u03c1, Alg. 2 is successful if\n\u00b7 di\u22121(cid:101)\u03c1(1 \u2212 z)\n\n\uf8f6\uf8f8 < z, f or z \u2208 [0, \u0001].\n\n\uf8eb\uf8ed1 \u2212(cid:88)\n\n\u0001(cid:101)\u03bb\n\nPci\n\n(4)\n\ndzi\u22121\n\nzi\u22121\ni!\n\ni\u22651\n\nProof is given in [23] and is based on the density evolution technique [25]. Thm. 3 states that for any\nfraction of errors \u039b\u03c5,\u03bd \u2264 \u039b\u2217\n\u03c5,\u03bd that satis\ufb01es the above recursive formula, Alg. 2 will be successful\nwith probability close to one. Note that the \ufb01rst \ufb01xed point of the above recursive equation dictates\nthe maximum fraction of errors \u039b\u2217\n\u03c5,\u03bd that our model can correct. For the special case of Pc1 = 1 and\n\nPci = 0,\u2200i > 1, we obtain \u0001(cid:101)\u03bb1 \u2212(cid:101)\u03c1(1 \u2212 z)) < z, the same condition given in [9]. Thm. 3 takes into\n\naccount the contribution of all Pci terms and as we will see, their values change as we incorporate\nthe effect of internal noise \u03c5 and \u03bd. Our results show that the maximum value of Pci does not\noccur when the internal noise is equal to zero, i.e. \u03c5 = \u03bd = 0, but instead when the neurons are\ncontaminated with internal noise! As an example, Fig. 2 illustrates how Pci behaves as a function\nof \u03c5 in the network considered (note that maximum values are not at \u03c5 = 0). This \ufb01nding suggests\nthat even individual clusters are able to correct more errors in the presence of internal noise.\n\n5\n\n\fs\nr\no\nr\nr\ne\n\n.\nt\nx\ne\ng\nn\ni\nt\nc\ne\nr\nr\no\nc\n\n.\n\nb\no\nr\nP\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u03c5 = 0, \u03bd = 0-Sim\n\u03c5 = 0, \u03bd = 0-Thr\n\u03c5 = 0.2, \u03bd = 0-Sim\n\u03c5 = 0.2, \u03bd = 0-Thr\n\u03c5 = 0.4, \u03bd = 0-Sim\n\u03c5 = 0.4, \u03bd = 0-Thr\n\u03c5 = 0.6, \u03bd = 0-Sim\n\u03c5 = 0.6, \u03bd = 0-Thr\n\nPc1\nPc2\nPc3\nPc4\n\nR\nE\nS\nl\na\nn\ni\nF\n\n0.15\n\n0.10\n\n0.05\n\n0.00\n\n0 0.1 0.2 0.3 0.4 0.5 0.6 0.7\n\n\u03c5\n\n0.00\n\n0.05\n\n0.10\n\n\u0001\n\nFigure 2: The value of Pci as a function of pat-\ntern neurons noise \u03c5 for i = 1, . . . , 4. Noise at\nconstraint neurons is assumed as zero (\u03bd = 0).\n\nFigure 3: The \ufb01nal SER for a network with n =\n400, L = 50 cf. [9]. The blue curves correspond\nto the noiseless neural network.\n\n3.1 Simulations\n\nNow we consider simulation results for a \ufb01nite system. To learn the subspace constraints (1) for each\ncluster G((cid:96)) we use the learning algorithm in [9]. Henceforth, we assume that the weight matrix W\nis known and given. In our setup, we consider a network of size n = 400 with L = 50 clusters. We\nhave 40 pattern nodes and 20 constraint nodes in each cluster, on average. External error is modeled\nby randomly generated vectors z with entries \u00b11 with probability \u0001 and 0 otherwise. Vector z is\nadded to the correct patterns, which satisfy (1). For recall, Alg. 2 is used and results are reported\nin terms of Symbol Error Rate (SER) as the level of external error (\u0001) or internal noise (\u03c5, \u03bd) is\nchanged; this involves counting positions where the output of Alg. 2 differs from the correct pattern.\n\n3.1.1 Symbol Error Rate as a function of Internal Noise\n\nFig. 3 illustrates the \ufb01nal SER of our algorithm for different values of \u03c5 and \u03bd. Recall that \u03c5 and\n\u03bd quantify the level of noise in pattern and constraint neurons, respectively. Dashed lines in Fig. 3\nare simulation results whereas solid lines are theoretical upper bounds provided in this paper. As\nevident, there is a threshold phenomenon such that SER is negligible for \u0001 \u2264 \u0001\u2217 and grows beyond\nthis threshold. As expected, simulation results are better than the theoretical bounds. In particular,\nthe gap is relatively large as \u03c5 moves towards one.\nA more interesting trend in Fig. 3 is the fact that internal noise helps in achieving better performance,\nas predicted by theoretical analysis (Thm. 2). Notice how \u0001\u2217 moves towards one as \u03bd increases.\nThis phenomenon is examined more closely in Figs. 4a and 4b where \u0001 is \ufb01xed to 0.125 while \u03c5\nand \u03bd vary. As we see, a moderate amount of internal noise at both pattern and constraint neurons\nimproves performance. There is an optimum point (\u03c5\u2217, \u03bd\u2217) for which the SER reaches its minimum.\nFig. 4b indicates for instance that \u03bd\u2217 \u2248 0.25, beyond which SER deteriorates.\n\n3.2 Recall Time as a function of Internal Noise\n\nFig. 5 illustrates the number of iterations performed by Alg. 2 for correcting the external errors when\n\u0001 is \ufb01xed to 0.075. We stop whenever the algorithm corrects all external errors or declare a recall\nerror if all errors were not corrected in 40 iterations. Thus, the corresponding areas in the \ufb01gure\nwhere the number of iterations reaches 40 indicates decoding failure. Figs. 6a and 6b are projected\nversions of Fig. 5 and show the average number of iterations as a function of \u03c5 and \u03bd, respectively.\nThe amount of internal noise drastically affects the speed of Alg. 2. First, from Fig. 5 and 6b observe\nthat running time is more sensitive to noise at constraint neurons than pattern neurons and that the\nalgorithms become slower as noise at constraint neurons is increased. In contrast, note that internal\nnoise at the pattern neurons may improve the running time, as seen in Fig. 6a.\n\n6\n\n\fR\nE\nS\nl\na\nn\ni\nF\n\n0.10\n\n0.05\n\n0.00\n\n\u03bd = 0\n\u03bd = 0.1\n\u03bd = 0.3\n\u03bd = 0.5\n\nR\nE\nS\nl\na\nn\ni\nF\n\n0.10\n\n0.05\n\n0.00\n\n\u03c5 = 0\n\u03c5 = 0.1\n\u03c5 = 0.2\n\u03c5 = 0.5\n\n0\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n\u03c5\n\n0\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n\u03bd\n\n(a) Final SER as function of \u03c5 for \u0001 = 0.125.\n\n(b) The effect of \u03bd on the \ufb01nal SER for \u0001 = 0.125\n\nFigure 4: The \ufb01nal SER vs. internal noise parameters at pattern and constraint neurons for \u0001 = 0.125\n\ns\nn\no\ni\nt\na\nr\ne\nt\nI\n\n.\no\nN\n\n.\ng\nv\nA\n\n40.00\n\n20.00\n\n0.00\n\n0.4\n\n0.2\n\n\u03bd\n\n0\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n\u03c5\n\n0.1\n\n0\n\nFigure 5: The effect of internal noise on the number of iterations of Alg. 2 when \u0001 = 0.075.\n\nNote that the results presented here are for the case where the noiseless decoder succeeds as well and\nits average number of iterations is pretty close to the optimal value (see Fig. 5). In [23], we provide\nadditional results corresponding to \u0001 = 0.125, where the noiseless decoder encounters stopping sets\nwhile the noisy decoder is still capable of correcting external errors; there we see that the optimal\nrunning time occurs when the neurons have a fair amount of internal noise.\nIn [23] we also provide results of a study for a slightly modi\ufb01ed scenario where there is only internal\nnoise and no external errors. Furthermore, \u03d5 < \u03c5. Thus, the internal noise can now cause neurons to\nmake wrong decisions, even in the absence of external errors. There, we witness the more familiar\nphenomenon where increasing the amount of internal noise results in a worse performance. This\n\ufb01nding emphasizes the importance of choosing update threshold \u03d5 and \u03c8 according to Lem. 1.\n\n4 Pattern Retrieval Capacity\n\nFor completeness, we review pattern retrieval capacity results from [9] to show that the proposed\nmodel is capable of memorizing an exponentially large number of patterns. First, note that since the\npatterns form a subspace, the number of patterns C does not have any effect on the learning or recall\nalgorithms (except for its obvious in\ufb02uence on the learning time). Thus, in order to show that the\npattern retrieval capacity is exponential in n, all we need to demonstrate is that there exists a training\nset X with C patterns of length n for which C \u221d arn, for some a > 1 and 0 < r.\nTheorem 4 ( [9]). Let X be a C \u00d7 n matrix, formed by C vectors of length n with entries from the\nset S. Furthermore, let k = rn for some 0 < r < 1. Then, there exists a set of vectors for which\nC = arn, with a > 1, and rank(X ) = k < n.\n\n7\n\n\f40\n\n10\n\ns\nn\no\ni\nt\na\nr\ne\nt\nI\n\n.\n\no\nN\n\n.\n\ng\nv\nA\n\n\u03bd = 0\n\u03bd = 0.2\n\u03bd = 0.3\n\u03bd = 0.5\n\n40\n\n10\n\ns\nn\no\ni\nt\na\nr\ne\nt\nI\n\n.\n\no\nN\n\n.\n\ng\nv\nA\n\n\u03c5 = 0\n\u03c5 = 0.2\n\u03c5 = 0.3\n\u03c5 = 0.5\n\n0\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n\u03c5\n\n0\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n\u03bd\n\n(a) Effect of internal noise at pattern neurons side.\n\n(b) Effect of internal noise at constraint neurons side.\n\nFigure 6: The effect of internal noise on the number of iterations performed by Alg. 2, for different\nvalues of \u03c5 and \u03bd with \u0001 = 0.075. The average iteration number of 40 indicate the failure of Alg. 2.\nThe proof is constructive: we create a dataset X such that it can be memorized by the proposed\nneural network and satis\ufb01es the required properties, i.e. the subpatterns form a subspace and pattern\nentries are integer values from the set S = {0, . . . , S \u2212 1}. The complete proof can be found in [9].\n\n5 Discussion\n\nWe have demonstrated that associative memories with exponential capacity still work reliably even\nwhen built from unreliable hardware, addressing a major problem in fault-tolerant computing and\nfurther arguing for the viability of associative memory models for the (noisy) mammalian brain.\nAfter all, brain regions modeled as associative memories, such as the hippocampus and the olfactory\ncortex, certainly do display internal noise [12, 13, 26]. The linear-nonlinear computations of Alg. 1\nare certainly biologically plausible, but implementing the state reversion computation of Alg. 2 in a\nbiologically plausible way remains an open question.\nWe found a threshold phenomenon for reliable operation, which manifests the tradeoff between\nthe amount of internal noise and the amount of external noise that the system can handle. In fact,\nwe showed that internal noise actually improves the performance of the network in dealing with\nexternal errors, up to some optimal value. This is a manifestation of the stochastic facilitation [13] or\nnoise enhancement [14] phenomenon that has been observed in other neuronal and signal processing\nsystems, providing a functional bene\ufb01t to variability in the operation of neural systems.\nThe associative memory design developed herein uses thresholding operations in the message-\npassing algorithm for recall; as part of our investigation, we optimized these neural \ufb01ring thresholds\nbased on the statistics of the internal noise. As noted by Sarpeshkar in describing the properties of\nanalog and digital computing circuits, \u201cIn a cascade of analog stages, noise starts to accumulate.\nThus, complex systems with many stages are dif\ufb01cult to build. [In digital systems] Round-off error\ndoes not accumulate signi\ufb01cantly for many computations. Thus, complex systems with many stages\nare easy to build\u201d [27]. One key to our result is capturing this bene\ufb01t of digital processing (thresh-\nolding to prevent the build up of errors due to internal noise) as well as a modular architecture which\nallows us to correct a linear number of external errors (in terms of the pattern length).\nThis paper focused on recall, however learning is the other critical stage of associative memory op-\neration. Indeed, information storage in nervous systems is said to be subject to storage (or learning)\nnoise, in situ noise, and retrieval (or recall) noise [11, Fig. 1]. It should be noted, however, there\nis no essential loss by combining learning noise and in situ noise into what we have called external\nerror herein, cf. [19, Fn. 1 and Prop. 1]. Thus our basic qualitative result extends to the setting where\nthe learning and stored phases are also performed with noisy hardware.\nGoing forward, it is of interest to investigate other neural information processing models that ex-\nplicitly incorporate internal noise and see whether they provide insight into observed empirical phe-\nnomena. As an example, we might be able to understand the threshold phenomenon observed in\nthe SER of human telegraph operators under heat stress [28, Fig. 2], by invoking a thermal internal\nnoise explanation.\n\n8\n\n\fReferences\n[1] A. Treves and E. T. Rolls, \u201cComputational analysis of the role of the hippocampus in memory,\u201d Hip-\n\npocampus, vol. 4, pp. 374\u2013391, Jun. 1994.\n\n[2] D. A. Wilson and R. M. Sullivan, \u201cCortical processing of odor objects,\u201d Neuron, vol. 72, pp. 506\u2013519,\n\nNov. 2011.\n\n[3] J. J. Hop\ufb01eld, \u201cNeural networks and physical systems with emergent collective computational abilities,\u201d\n\nProc. Natl. Acad. Sci. U.S.A., vol. 79, pp. 2554\u20132558, Apr. 1982.\n\n[4] R. J. McEliece, E. C. Posner, E. R. Rodemich, and S. S. Venkatesh, \u201cThe capacity of the Hop\ufb01eld asso-\n\nciative memory,\u201d IEEE Trans. Inf. Theory, vol. IT-33, pp. 461\u2013482, 1987.\n\n[5] D. J. Amit and S. Fusi, \u201cLearning in neural networks with material synapses,\u201d Neural Comput., vol. 6, pp.\n\n957\u2013982, Sep. 1994.\n\n[6] B. A. Olshausen and D. J. Field, \u201cSparse coding of sensory inputs,\u201d Curr. Opin. Neurobiol., vol. 14, pp.\n\n481\u2013487, Aug. 2004.\n\n[7] A. A. Koulakov and D. Rinberg, \u201cSparse incomplete representations: A potential role of olfactory granule\n\ncells,\u201d Neuron, vol. 72, pp. 124\u2013136, Oct. 2011.\n\n[8] A. H. Salavati and A. Karbasi, \u201cMulti-level error-resilient neural networks,\u201d in Proc. 2012 IEEE Int. Symp.\n\nInf. Theory, Jul. 2012, pp. 1064\u20131068.\n\n[9] A. Karbasi, A. H. Salavati, and A. Shokrollahi, \u201cIterative learning and denoising in convolutional neural\n\nassociative memories,\u201d in Proc. 30th Int. Conf. Mach. Learn. (ICML 2013), Jun. 2013, pp. 445\u2013453.\n\n[10] N. Brunel, V. Hakim, P. Isope, J.-P. Nadal, and B. Barbour, \u201cOptimal information storage and the distri-\n\nbution of synaptic weights: Perceptron versus Purkinje cell,\u201d Neuron, vol. 43, pp. 745\u2013757, 2004.\n\n[11] L. R. Varshney, P. J. Sj\u00a8ostr\u00a8om, and D. B. Chklovskii, \u201cOptimal information storage in noisy synapses\n\nunder resource constraints,\u201d Neuron, vol. 52, pp. 409\u2013423, Nov. 2006.\n\n[12] C. Koch, Biophysics of Computation. New York: Oxford University Press, 1999.\n[13] M. D. McDonnell and L. M. Ward, \u201cThe bene\ufb01ts of noise in neural systems: bridging theory and experi-\n\nment,\u201d Nat. Rev. Neurosci., vol. 12, pp. 415\u2013426, Jul. 2011.\n\n[14] H. Chen, P. K. Varshney, S. M. Kay, and J. H. Michels, \u201cTheory of the stochastic resonance effect in\nsignal detection: Part I\u2013\ufb01xed detectors,\u201d IEEE Trans. Signal Process., vol. 55, pp. 3172\u20133184, Jul. 2007.\n[15] D. A. Spielman and S.-H. Teng, \u201cSmoothed analysis of algorithms: Why the simplex algorithm usually\n\ntakes polynomial time,\u201d J. ACM, vol. 51, pp. 385\u2013463, May 2004.\n\n[16] D. J. Amit, Modeling Brain Function. Cambridge: Cambridge University Press, 1992.\n[17] M. G. Taylor, \u201cReliable information storage in memories designed from unreliable components,\u201d Bell\n\nSyst. Tech. J., vol. 47, pp. 2299\u20132337, Dec. 1968.\n\n[18] A. V. Kuznetsov, \u201cInformation storage in a memory assembled from unreliable components,\u201d Probl. Inf.\n\nTransm., vol. 9, pp. 100\u2013114, July-Sept. 1973.\n\n[19] L. R. Varshney, \u201cPerformance of LDPC codes under faulty iterative decoding,\u201d IEEE Trans. Inf. Theory,\n\nvol. 57, pp. 4427\u20134444, Jul. 2011.\n\n[20] V. Gripon and C. Berrou, \u201cSparse neural networks with large learning diversity,\u201d IEEE Trans. Neural\n\nNetw., vol. 22, pp. 1087\u20131096, Jul. 2011.\n\n[21] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, \u201cExtracting and composing robust features with\ndenoising autoencoders,\u201d in Proc. 25th Int. Conf. Mach. Learn. (ICML 2008), Jul. 2008, pp. 1096\u20131103.\n[22] Q. V. Le, J. Ngiam, Z. Chen, D. Chia, P. W. Koh, and A. Y. Ng, \u201cTiled convolutional neural networks,\u201d in\nAdvances in Neural Information Processing Systems 23, J. Lafferty, C. K. I. Williams, J. Shawe-Taylor,\nR. S. Zemel, and A. Culotta, Eds. Cambridge, MA: MIT Press, 2010, pp. 1279\u20131287.\n\n[23] A. Karbasi, A. H. Salavati, A. Shokrollahi, and L. R. Varshney, \u201cNoise-enhanced associative memories,\u201d\n\narXiv, 2013.\n\n[24] M. G. Luby, M. Mitzenmacher, M. A. Shokrollahi, and D. A. Spielman, \u201cEf\ufb01cient erasure correcting\n\ncodes,\u201d IEEE Trans. Inf. Theory, vol. 47, pp. 569\u2013584, Feb. 2001.\n\n[25] T. Richardson and R. Urbanke, Modern Coding Theory. Cambridge: Cambridge University Press, 2008.\n[26] M. Yoshida, H. Hayashi, K. Tateno, and S. Ishizuka, \u201cStochastic resonance in the hippocampal CA3\u2013CA1\n\nmodel: a possible memory recall mechanism,\u201d Neural Netw., vol. 15, pp. 1171\u20131183, Dec. 2002.\n\n[27] R. Sarpeshkar, \u201cAnalog versus digital: Extrapolating from electronics to neurobiology,\u201d Neural Comput.,\n\nvol. 10, pp. 1601\u20131638, Oct. 1998.\n\n[28] N. H. Mackworth, \u201cEffects of heat on wireless telegraphy operators hearing and recording Morse mes-\n\nsages,\u201d Br. J. Ind. Med., vol. 3, pp. 143\u2013158, Jul. 1946.\n\n9\n\n\f", "award": [], "sourceid": 847, "authors": [{"given_name": "Amin", "family_name": "Karbasi", "institution": "ETH Zurich"}, {"given_name": "Amir Hesam", "family_name": "Salavati", "institution": "EPFL"}, {"given_name": "Amin", "family_name": "Shokrollahi", "institution": "EPFL"}, {"given_name": "Lav", "family_name": "Varshney", "institution": "IBM Watson Research Center"}]}