{"title": "Dense Associative Memory for Pattern Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 1172, "page_last": 1180, "abstract": "A model of associative memory is studied, which stores and reliably retrieves many more patterns than the number of neurons in the network. We propose a simple duality between this dense associative memory and neural networks commonly used in deep learning. On the associative memory side of this duality, a family of models that smoothly interpolates between two limiting cases can be constructed. One limit is referred to as the feature-matching mode of pattern recognition, and the other one as the prototype regime. On the deep learning side of the duality, this family corresponds to feedforward neural networks with one hidden layer and various activation functions, which transmit the activities of the visible neurons to the hidden layer. This family of activation functions includes logistics, rectified linear units, and rectified polynomials of higher degrees. The proposed duality makes it possible to apply energy-based intuition from associative memory to analyze computational properties of neural networks with unusual activation functions - the higher rectified polynomials which until now have not been used in deep learning. The utility of the dense memories is illustrated for two test cases: the logical gate XOR and the recognition of handwritten digits from the MNIST data set.", "full_text": "Dense Associative Memory for Pattern Recognition\n\nDmitry Krotov\n\nSimons Center for Systems Biology\n\nInstitute for Advanced Study\n\nPrinceton, USA\nkrotov@ias.edu\n\nJohn J. Hop\ufb01eld\n\nPrinceton Neuroscience Institute\n\nPrinceton University\n\nPrinceton, USA\n\nhopfield@princeton.edu\n\nAbstract\n\nA model of associative memory is studied, which stores and reliably retrieves many\nmore patterns than the number of neurons in the network. We propose a simple\nduality between this dense associative memory and neural networks commonly used\nin deep learning. On the associative memory side of this duality, a family of models\nthat smoothly interpolates between two limiting cases can be constructed. One limit\nis referred to as the feature-matching mode of pattern recognition, and the other\none as the prototype regime. On the deep learning side of the duality, this family\ncorresponds to feedforward neural networks with one hidden layer and various\nactivation functions, which transmit the activities of the visible neurons to the\nhidden layer. This family of activation functions includes logistics, recti\ufb01ed linear\nunits, and recti\ufb01ed polynomials of higher degrees. The proposed duality makes\nit possible to apply energy-based intuition from associative memory to analyze\ncomputational properties of neural networks with unusual activation functions \u2013 the\nhigher recti\ufb01ed polynomials which until now have not been used in deep learning.\nThe utility of the dense memories is illustrated for two test cases: the logical gate\nXOR and the recognition of handwritten digits from the MNIST data set.\n\n1\n\nIntroduction\n\nPattern recognition and models of associative memory [1] are closely related. Consider image\nclassi\ufb01cation as an example of pattern recognition. In this problem, the network is presented with an\nimage and the task is to label the image. In the case of associative memory the network stores a set of\nmemory vectors. In a typical query the network is presented with an incomplete pattern resembling,\nbut not identical to, one of the stored memories and the task is to recover the full memory. Pixel\nintensities of the image can be combined together with the label of that image into one vector [2],\nwhich will serve as a memory for the associative memory. Then the image itself can be thought of\nas a partial memory cue. The task of identifying an appropriate label is a subpart of the associative\nmemory reconstruction. There is a limitation in using this idea to do pattern recognition. The standard\nmodel of associative memory works well in the limit when the number of stored patterns is much\nsmaller than the number of neurons [1], or equivalently the number of pixels in an image. In order\nto do pattern recognition with small error rate one would need to store many more memories than\nthe typical number of pixels in the presented images. This is a serious problem. It can be solved by\nmodifying the standard energy function of associative memory, quadratic in interactions between the\nneurons, by including in it higher order interactions. By properly designing the energy function (or\nHamiltonian) for these models with higher order interactions one can store and reliably retrieve many\nmore memories than the number of neurons in the network.\nDeep neural networks have proven to be useful for a broad range of problems in machine learning\nincluding image classi\ufb01cation, speech recognition, object detection, etc. These models are composed\nof several layers of neurons, so that the output of one layer serves as the input to the next layer. Each\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fneuron calculates a weighted sum of the inputs and passes the result through a non-linear activation\nfunction. Traditionally, deep neural networks used activation functions such as hyperbolic tangents or\nlogistics. Learning the weights in such networks, using a backpropagation algorithm, faced serious\nproblems in the 1980s and 1990s. These issues were largely resolved by introducing unsupervised\npre-training, which made it possible to initialize the weights in such a way that the subsequent\nbackpropagation could only gently move boundaries between the classes without destroying the\nfeature detectors [3, 4]. More recently, it was realized that the use of recti\ufb01ed linear units (ReLU)\ninstead of the logistic functions speeds up learning and improves generalization [5, 6, 7]. Recti\ufb01ed\nlinear functions are usually interpreted as \ufb01ring rates of biological neurons. These rates are equal\nto zero if the input is below a certain threshold and linearly grow with the input if it is above the\nthreshold. To mimic biology the output should be small or zero if the input is below the threshold, but\nit is much less clear what the behavior of the activation function should be for inputs exceeding the\nthreshold. Should it grow linearly, sub-linearly, or faster than linearly? How does this choice affect\nthe computational properties of the neural network? Are there other functions that would work even\nbetter than the recti\ufb01ed linear units? These questions to the best of our knowledge remain open.\nThis paper examines these questions through the lens of associative memory. We start by discussing\na family of models of associative memory with large capacity. These models use higher order (higher\nthan quadratic) interactions between the neurons in the energy function. The associative memory\ndescription is then mapped onto a neural network with one hidden layer and an unusual activation\nfunction, related to the Hamiltonian. We show that by varying the power of interaction vertex in\nthe energy function (or equivalently by changing the activation function of the neural network) one\ncan force the model to learn representations of the data either in terms of features or in terms of\nprototypes.\n\n2 Associative memory with large capacity\n\nThe standard model of associative memory [1] uses a system of N binary neurons, with values \u00b11. A\ncon\ufb01guration of all the neurons is denoted by a vector i. The model stores K memories, denoted by\n\u21e0\u00b5\ni , which for the moment are also assumed to be binary. The model is de\ufb01ned by an energy function,\nwhich is given by\n\niTijj, Tij =\n\ni \u21e0\u00b5\n\u21e0\u00b5\nj ,\n\n(1)\n\nE = \n\n1\n2\n\nNXi,j=1\n\nKX\u00b5=1\n\nand a dynamical update rule that decreases the energy at every update. The basic problem is the\nfollowing: when presented with a new pattern the network should respond with a stored memory\nwhich most closely resembles the input.\nThere has been a large amount of work in the community of statistical physicists investigating\nthe capacity of this model, which is the maximal number of memories that the network can store\nand reliably retrieve. It has been demonstrated [1, 8, 9] that in case of random memories this\nmaximal value is of the order of Kmax \u21e1 0.14N. If one tries to store more patterns, several\nneighboring memories in the con\ufb01guration space will merge together producing a ground state of\nthe Hamiltonian (1), which has nothing to do with any of the stored memories. By modifying the\nHamiltonian (1) in a way that removes second order correlations between the stored memories, it is\npossible [10] to improve the capacity to Kmax = N.\nThe mathematical reason why the model (1) gets confused when many memories are stored is that\nseveral memories produce contributions to the energy which are of the same order. In other words the\nenergy decreases too slowly as the pattern approaches a memory in the con\ufb01guration space. In order\nto take care of this problem, consider a modi\ufb01cation of the standard energy\n\nE = \n\nKX\u00b5=1\n\nF\u21e0\u00b5\ni i\n\n(2)\n\nIn this formula F (x) is some smooth function (summation over index i is assumed). The compu-\ntational capabilities of the model will be illustrated for two cases. First, when F (x) = xn (n is\nan integer number), which is referred to as a polynomial energy function. Second, when F (x) is a\n\n2\n\n\frecti\ufb01ed polynomial energy function\n\nF (x) =\u21e2xn, x 0\n\n0, x < 0\n\n(3)\n\n(4)\n\nIn the case of the polynomial function with n = 2 the network reduces to the standard model of\nassociative memory [1]. If n > 2 each term in (2) becomes sharper compared to the n = 2 case, thus\nmore memories can be packed into the same con\ufb01guration space before cross-talk intervenes.\nHaving de\ufb01ned the energy function one can derive an iterative update rule that leads to decrease of\nthe energy. We use asynchronous updates \ufb02ipping one unit at a time. The update rule is:\n\n(t+1)\ni\n\n= Sign\uf8ff KX\u00b5=1\u2713F\u21e3\u21e0\u00b5\n\ni +Xj6=i\n\nj (t)\n\u21e0\u00b5\n\nj \u2318 F\u21e3 \u21e0\u00b5\n\ni +Xj6=i\n\nj (t)\n\u21e0\u00b5\n\nj \u2318\u25c6,\n\nThe argument of the sign function is the difference of two energies. One, for the con\ufb01guration with\nall but the i-th units clumped to their current states and the i-th unit in the \u201coff\u201d state. The other one\nfor a similar con\ufb01guration, but with the i-th unit in the \u201con\u201d state. This rule means that the system\nupdates a unit, given the states of the rest of the network, in such a way that the energy of the entire\ncon\ufb01guration decreases. For the case of polynomial energy function a very similar family of models\nwas considered in [11, 12, 13, 14, 15, 16]. The update rule in those models was based on the induced\nmagnetic \ufb01elds, however, and not on the difference of energies. The two are slightly different due to\nthe presence of self-coupling terms. Throughout this paper we use energy-based update rules.\nHow many memories can model (4) store and reliably retrieve? Consider the case of random patterns,\nso that each element of the memories is equal to \u00b11 with equal probability. Imagine that the system\nis initialized in a state equal to one of the memories (pattern number \u00b5). One can derive a stability\ncriterion, i.e. the upper bound on the number of memories such that the network stays in that initial\nstate. De\ufb01ne the energy difference between the initial state and the state with spin i \ufb02ipped\n\nE =\n\nKX\u232b=1\u21e3\u21e0\u232b\n\ni \u21e0\u00b5\n\ni +Xj6=i\n\nj \u21e0\u00b5\n\u21e0\u232b\n\nj\u2318n\n\n\n\nKX\u232b=1\u21e3 \u21e0\u232b\n\ni \u21e0\u00b5\n\ni +Xj6=i\n\nj \u21e0\u00b5\n\u21e0\u232b\n\nj\u2318n\n\n,\n\nwhere the polynomial energy function is used. This quantity has a mean hEi = N n (N 2)n \u21e1\n2nN n1, which comes from the term with \u232b = \u00b5, and a variance (in the limit of large N)\n\n\u23032 = \u2326n(K 1)N n1,\n\nwhere \u2326n = 4n2(2n 3)!!\n\nThe i-th bit becomes unstable when the magnitude of the \ufb02uctuation exceeds the energy gap hEi\nand the sign of the \ufb02uctuation is opposite to the sign of the energy gap. Thus the probability that the\nstate of a single neuron is unstable (in the limit when both N and K are large, so that the noise is\neffectively gaussian) is equal to\n\nPerror =\n\n1ZhEi\n\ndx\np2\u21e1\u23032\n\ne x2\n\n2\u23032 \u21e1r (2n 3)!!\n\n2\u21e1\n\nK\n\nN n1 e N n1\n\n2K(2n3)!!\n\nRequiring that this probability is less than a small value, say 0.5%, one can \ufb01nd the upper limit on\nthe number of patterns that the network can store\n\n(5)\nwhere \u21b5n is a numerical constant, which depends on the (arbitrary) threshold 0.5%. The case\nn = 2 corresponds to the standard model of associative memory and gives the well known result\nK = 0.14N. For the perfect recovery of a memory (Perror < 1/N) one obtains\n\nKmax = \u21b5nN n1,\n\nKmax\nno errors \u21e1\n\n1\n\n2(2n 3)!!\n\nN n1\nln(N )\n\n(6)\n\nFor higher powers n the capacity rapidly grows with N in a non-linear way, allowing the network\nto store and reliably retrieve many more patterns than the number of neurons that it has, in accord1\nwith [13, 14, 15, 16]. This non-linear scaling relationship between the capacity and the size of the\nnetwork is the phenomenon that we exploit.\n\n1The n-dependent coef\ufb01cient in (6) depends on the exact form of the Hamiltonian and the update rule.\nReferences [13, 14, 15] do not allow repeated indices in the products over neurons in the energy function,\ntherefore obtain a different coef\ufb01cient. In [16] the Hamiltonian coincides with ours, but the update rule is\ndifferent, which, however, results in exactly the same coef\ufb01cient as in (6).\n\n3\n\n\fWe study a family of models of this kind as a function of n. At small n many terms contribute to the\nsum over \u00b5 in (2) approximately equally. In the limit n ! 1 the dominant contribution to the sum\ncomes from a single memory, which has the largest overlap with the input. It turns out that optimal\ncomputation occurs in the intermediate range.\n\n3 The case of XOR\n\nThe case of XOR is elementary, yet instructive. It is presented here for three reasons. First, it illustrates\nthe construction (2) in this simplest case. Second, it shows that as n increases, the computational\ncapabilities of the network also increase. Third, it provides the simplest example of a situation in\nwhich the number of memories is larger than the number of neurons, yet the network works reliably.\nThe problem is the following: given two inputs x and y produce an output z such that the truth table\n\nx\n-1\n-1\n1\n1\n\ny\n-1\n1\n-1\n1\n\nz\n-1\n1\n1\n-1\n\nis satis\ufb01ed. We will treat this task as an associative memory problem and will simply embed the\nfour examples of the input-output triplets x, y, z in the memory. Therefore the network has N = 3\nidentical units: two of which will be used for the inputs and one for the output, and K = 4 memories\n\u21e0\u00b5\ni , which are the four lines of the truth table. Thus, the energy (2) is equal to\n\nEn(x, y, z) = x y zn x + y + zn x y + zn x + y zn,\n(7)\nwhere the energy function is chosen to be a polynomial of degree n. For odd n, energy (7) is an odd\nfunction of each of its arguments, En(x, y,z) = En(x, y, z). For even n, it is an even function.\nFor n = 1 it is equal to zero. Thus, if evaluated on the corners of the cube x, y, z = \u00b11, it reduces to\n\nEn(x, y, z) =8<:\n\nn = 1\n0,\nCn,\nn = 2, 4, 6, ...\nCnxyz, n = 3, 5, 7, ...,\n\n(8)\n\nwhere coef\ufb01cients Cn denote numerical constants.\nIn order to solve the XOR problem one can present to the network an \u201cincomplete pattern\u201d of inputs\n(x, y) and let the output z adjust to minimize the energy of the three-spin con\ufb01guration, while holding\nthe inputs \ufb01xed. The network clearly cannot solve this problem for n = 1 and n = 2, since the energy\ndoes not depend on the spin con\ufb01guration. The case n = 2 is the standard model of associative\nmemory. It can also be thought of as a linear perceptron, and the inability to solve this problem\nrepresents the well known statement [17] that linear perceptrons cannot compute XOR without hidden\nneurons. The case of odd n 3 provides an interesting solution. Given two inputs, x and y, one can\nchoose the output z that minimizes the energy. This leads to the update rule\n\nz = Sign\u21e5En(x, y,1) En(x, y, +1)\u21e4 = Sign\u21e5 xy\u21e4\n\nThus, in this simple case the network is capable of solving the problem for higher odd values of n,\nwhile it cannot do so for n = 1 and n = 2. In case of recti\ufb01ed polynomials, a similar construction\nsolves the problem for any n 2. The network works well in spite of the fact that K > N.\n4 An example of a pattern recognition problem, the case of MNIST\n\nThe MNIST data set is a collection of handwritten digits, which has 60000 training examples and\n10000 test images. The goal is to classify the digits into 10 classes. The visible neurons, one for each\npixel, are combined together with 10 classi\ufb01cation neurons in one vector that de\ufb01nes the state of\nthe network. The visible part of this vector is treated as an \u201cincomplete\u201d pattern and the associative\nmemory is allowed to calculate a completion of that pattern, which is the label of the image.\nDense associative memory (2) is a recurrent network in which every neuron can be updated multiple\ntimes. For the purposes of digit classi\ufb01cation, however, this model will be used in a very limited\n\n4\n\n\fcapacity, allowing it to perform only one update of the classi\ufb01cation neurons. The network is\ninitialized in the state when the visible units vi are clamped to the intensities of a given image and the\nclassi\ufb01cation neurons are in the off state x\u21b5 = 1 (see Fig.1A). The network is allowed to make\none update of the classi\ufb01cation neurons, while keeping the visible units clamped, to produce the\noutput c\u21b5. The update rule is similar to (4) except that the sign is replaced by the continuous function\ng(x) = tanh(x)\n\nc\u21b5 = g\uf8ff\n\nKX\u00b5=1\u2713F\u21e3 \u21e0\u00b5\n\n\u21b5x\u21b5 +X6=\u21b5\n\n\u21e0\u00b5\n x +\n\nNXi=1\n\n\u21e0\u00b5\n\ni vi\u2318 F\u21e3\u21e0\u00b5\n\n\u21b5x\u21b5 +X6=\u21b5\n\n\u21e0\u00b5\n x +\n\n\u21e0\u00b5\n\ni vi\u2318\u25c6, (9)\n\nNXi=1\n\nwhere parameter regulates the slope of g(x). The proposed digit class is given by the number\nof a classi\ufb01cation neuron producing the maximal output. Throughout this section the recti\ufb01ed\npolynomials (3) are used as functions F . To learn effective memories for use in pattern classi\ufb01cation,\nan objective function is de\ufb01ned (see Appendix A in Supplemental), which penalizes the discrepancy\nA\n\nB\n\n2\n\n2\n\n158-262 epochs\n\n179-312 epochs\nn = 2\n\nvi\n\nvi\n\nc\u21b5\n\nx\u21b5\n\n \n\n%\n\n \n,\nt\nr\ne\no\ns\n \nt\nr\ns\nr\ne\ne\nt\n \n,\nr\n \no\nt\nr\ns\nr\ne\ne\nt\n\n1.9\n\n1.8\n\n1.7\n\n1.6\n\n1.5\n\n1.4\n0\n\n \n\n%\n\n \n,\nr\nt\ne\no\ns\n \nt\nr\ns\nr\ne\ne\nt\n \n,\nr\n \no\nt\nr\ns\nr\ne\ne\nt\n\n3000\n\n1.9\n\n1.8\n\n1.7\n\n1.6\n\n1.5\n\n1.4\n0\n\nn = 3\n\n500\n\n1000\n\n2000\n\n2500\n\n3000\n\nnumber of epochs\n\n1500\nEpochs\n\n500\n\n1000\n\n2000\n\n2500\n\nnumber of epochs\n\n1500\nEpochs\n\nFigure 1: (A) The network has N = 28 \u21e5 28 = 784 visible neurons and Nc = 10 classi\ufb01cation neurons.\nThe visible units are clamped to intensities of pixels (which is mapped on the segment [1, 1]), while the\nclassi\ufb01cation neurons are initialized in the state x\u21b5 and then updated once to the state c\u21b5. (B) Behavior of the\nerror on the test set as training progresses. Each curve corresponds to a different combination of hyperparameters\nfrom the optimal window, which was determined on the validation set. The arrows show the \ufb01rst time when the\nerror falls below a 2% threshold. All models have K = 2000 memories (hidden units).\nbetween the output c\u21b5 and the target output. This objective function is then minimized using a\nbackpropagation algorithm. The learning starts with random memories drawn from a Gaussian\ndistribution. The backpropagation algorithm then \ufb01nds a collection of K memories \u21e0\u00b5\ni,\u21b5, which\nminimize the classi\ufb01cation error on the training set. The memories are normalized to stay within the\n1 \uf8ff \u21e0\u00b5\ni,\u21b5 \uf8ff 1 range, absorbing their overall scale into the de\ufb01nition of the parameter .\nThe performance of the proposed classi\ufb01cation framework is studied as a function of the power n.\nThe next section shows that a recti\ufb01ed polynomial of power n in the energy function is equivalent\nto the recti\ufb01ed polynomial of power n 1 used as an activation function in a feedforward neural\nnetwork with one hidden layer of neurons. Currently, the most common choice of activation functions\nfor training deep neural networks is the ReLU, which in our language corresponds to n = 2 for\nthe energy function. Although not currently used to train deep networks, the case n = 3 would\ncorrespond to a recti\ufb01ed parabola as an activation function. We start by comparing the performances\nof the dense memories in these two cases.\nThe performance of the network depends on n and on the remaining hyperparameters, thus the hyper-\nparameters should be optimized for each value of n. In order to test the variability of performances\nfor various choices of hyperparameters at a given n, a window of hyperparameters for which the\nnetwork works well on the validation set (see the Appendix A in Supplemental) was determined.\nThen many networks were trained for various choices of the hyperparameters from this window to\nevaluate the performance on the test set. The test errors as training progresses are shown in Fig.1B.\nWhile there is substantial variability among these samples, on average the cluster of trajectories for\nn = 3 achieves better results on the test set than that for n = 2. These error rates should be compared\nwith error rates for backpropagation alone without the use of generative pretraining, various kinds\nof regularizations (for example dropout) or adversarial training, all of which could be added to our\nconstruction if necessary. In this class of models the best published results are all2 in the 1.6% range\n[18], see also controls in [19, 20]. This agrees with our results for n = 2. The n = 3 case does\nslightly better than that as is clear from Fig.1B, with all the samples performing better than 1.6%.\n\n2Although there are better results on pixel permutation invariant task, see for example [19, 20, 21, 22].\n\n5\n\n\fHigher recti\ufb01ed polynomials are also faster in training compared to ReLU. For the n = 2 case, the\nerror crosses the 2% threshold for the \ufb01rst time during training in the range of 179-312 epochs. For\nthe n = 3 case, this happens earlier on average, between 158-262 epochs. For higher powers n\nthis speed-up is larger. This is not a huge effect for a small dataset such as MNIST. However, this\nspeed-up might be very helpful for training large networks on large datasets, such as ImageNet. A\nsimilar effect was reported earlier for the transition between saturating units, such as logistics or\nhyperbolic tangents, to ReLU [7]. In our family of models that result corresponds to moving from\nn = 1 to n = 2.\n\nFeature to prototype transition\nHow does the computation performed by the neural network change as n varies? There are two\nextreme classes of theories of pattern recognition: feature-matching and formation of a prototype.\nAccording to the former, an input is decomposed into a set of features, which are compared with\nthose stored in the memory. The subset of the stored features activated by the presented input is then\ninterpreted as an object. One object has many features; features can also appear in more than one\nobject. The prototype theory provides an alternative approach, in which objects are recognized as a\nwhole. The prototypes do not necessarily match the object exactly, but rather are blurred abstract\n\nn = 2\n\nn = 3\n\n \n\n256\n\nn = 20\n\n \n\n256\n\nn = 30\n\n \n\n256\n\n \n\n \n\n256\n\n256\n\n1\n\n192\n\n128\n\n64\n\n \n\n \n\n \n\n%\n\n \n,\ns\ne\ni\nr\no\nm\ne\nm\n\ns\ne\ni\nr\no\nm\ne\nm\ne\nv\ni\nt\nc\na\n\n \n\n \nf\n\no\n\n \nt\n\nn\ne\nc\nr\ne\np\n\n \nf\no\n \nt\nn\ne\nc\nr\ne\np\n\n50\n\n40\n\n30\n\n20\n\n10\n\n0\n\n \n\n0\n\n4\n\n3\n\n2\n\n7\n9\n1\nnumber of strongly positively driven RU\nnumber of RU with \u21e0\u00b5\n\u21b5 > 0.99\n\n5\n\n6\n\n8\n\n192\n\n128\n\n64\n\n \n\n \n\nn=2\nn=3\nn=20\nn=30\n\n10\n\n192\n\n0.5\n\n128\n\n0\n\n192\n\n128\n\n64\n\n \n\n192\n\n128\n\n64\n\n64\n\n0.5\n1\n\n \n\nn=2\nn=3\nn=20\nn=30\n\nerrortest = 1.51%\nerrortest = 1.44%\nerrortest = 1.61%\nerrortest = 1.80%\n\n10000\n\n8000\n\n6000\n\n4000\n\n2000\n\ns\ne\ng\na\ns\ne\nm\ng\na\ni\n \nm\nt\ns\ni\n \nt\ne\ns\ne\nt\nt\n \n \nf\nf\no\no\n \nr\n \ne\nr\nb\ne\nm\nb\nu\nm\nn\nu\nn\n\n \n\n1\n\n0\nnumber of memories strongly contributing to the correct RU\nnumber of memories making the decision\n\n9 10 11 12\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\nFigure 2: We show 25 randomly selected memories (feature detectors) for four networks, which use recti-\n\ufb01ed polynomials of degrees n = 2, 3, 20, 30 as the energy function. The magnitude of a memory element\ncorresponding to each pixel is plotted in the location of that pixel, the color bar explains the color code. The\nhistograms at the bottom are explained in the text. The error rates refer to the particular four samples used in this\n\ufb01gure. RU stands for recognition unit.\nrepresentations which include all the features that an object has. We argue that the computational\nmodels proposed here describe feature-matching mode of pattern recognition for small n and the\nprototype regime for large n. This can be anticipated from the sharpness of contributions that each\nmemory makes to the total energy (2). For large n the function F (x) peaks much more sharply\naround each memory compared to the case of small n. Thus, at large n all the information about a\ndigit must be written in only one memory, while at small n this information can be distributed among\nseveral memories. In the case of intermediate n some learned memories behave like features while\nothers behave like prototypes. These two classes of memories work together to model the data in an\nef\ufb01cient way.\nThe feature to prototype transition is clearly seen in memories shown in Fig.2. For n = 2 or 3\neach memory does not look like a digit, but resembles a pattern of activity that might be useful for\nrecognizing several different digits. For n = 20 many of the memories can be recognized as digits,\nwhich are surrounded by white margins representing elements of memories having approximately\nzero values. These margins describe the variability of thicknesses of lines of different training\nexamples and mathematically mean that the energy (2) does not depend on whether the corresponding\npixel is on or off. For n = 30 most of the memories represent prototypes of whole digits or large\nportions of digits, with a small admixture of feature memories that do not resemble any digit.\n\n6\n\n\fThe feature to prototype transition can be visualized by showing the feature detectors in situations\nwhen there is a natural ordering of pixels. Such ordering exists in images, for example. In general\nsituations, however, there is no preferred permutation of visible neurons that would reveal this\nstructure (e.g. in the case of genomic data). It is therefore useful to develop a measure that permits a\ndistinction to be made between features and prototypes in the absence of such visual space. Towards\n\u21b5 are approximately equal to \u00b11. One\nthe end of training most of the recognition connections \u21e0\u00b5\ncan choose an arbitrary cutoff, and count the number of recognition connections that are in the \u201con\u201d\nstate (\u21e0\u00b5\n\u21b5 = +1) for each memory. The distribution function of this number is shown on the left\nhistogram in Fig.2. Intuitively, this quantity corresponds to the number of different digit classes\nthat a particular memory votes for. At small n, most of the memories vote for three to \ufb01ve different\ndigit classes, a behavior characteristic of features. As n increases, each memory specializes and\nvotes for only a single class. In the case n = 30, for example, more than 40% of memories vote for\nonly one class, a behavior characteristic of prototypes. A second way to see the feature to prototype\ntransition is to look at the number of memories which make large contributions to the classi\ufb01cation\ndecision (right histogram in Fig.2). For each test image one can \ufb01nd the memory that makes the\nlargest contribution to the energy gap, which is the sum over \u00b5 in (9). Then one can count the number\nof memories that contribute to the gap by more than 0.9 of this largest contribution. For small n,\nthere are many memories that satisfy this criterion and the distribution function has a long tail. In\nthis regime several memories are cooperating with each other to make a classi\ufb01cation decision. For\nn = 30, however, more than 8000 of 10000 test images do not have a single other memory that would\nmake a contribution comparable with the largest one. This result is not sensitive to the arbitrary choice\n(0.9) of the cutoff. Interestingly, the performance remains competitive even for very large n \u21e1 20\n(see Fig.2) in spite of the fact that these networks are doing a very different kind of computation\ncompared with that at small n.\n\n5 Relationship to a neural network with one hidden layer\n\nIn this section we derive a simple duality between the dense associative memory and a feedforward\nneural network with one layer of hidden neurons. In other words, we show that the same computational\nmodel has two very different descriptions: one in terms of associative memory, the other one in terms\n\nc\u21b5\n\ng\n\nf\n\nh\u00b5\n\nvi\n\nvi\n\nvi\n\nc\u21b5\n\nx\u21b5 = \"\n\nFigure 3: On the left a feedforward neural network with one layer of hidden neurons. The states of the visible\nunits are transformed to the hidden neurons using a non-linear function f, the states of the hidden units are\ntransformed to the output layer using a non-linear function g. On the right the model of dense associative\nmemory with one step update (9). The two models are equivalent.\nof a network with one layer of hidden units. Using this correspondence one can transform the family\nof dense memories, constructed for different values of power n, to the language of models used in\ndeep learning. The resulting neural networks are guaranteed to inherit computational properties of\nthe dense memories such as the feature to prototype transition.\nThe construction is very similar to (9), except that the classi\ufb01cation neurons are initialized in the state\nwhen all of them are equal to \", see Fig.3. In the limit \" ! 0 one can expand the function F in (9)\nso that the dominant contribution comes from the term linear in \". Then\nc\u21b5 \u21e1 gh\n\n\u21b5x\u21b5)i = gh KX\u00b5=1\n\ni vii = gh KX\u00b5=1\n\n\u21b5 f\u21e0\u00b5\n\ni vii,\n\n\u21e0\u00b5\n\ni vi\u2318 (2\u21e0\u00b5\n\nwhere the parameter is set to = 1/(2\") (summation over the visible index i is assumed). Thus,\nthe model of associative memory with one step update is equivalent to a conventional feedforward\nneural network with one hidden layer provided that the activation function from the visible layer to\nthe hidden layer is equal to the derivative of the energy function\n\nKX\u00b5=1\n\nF 0\u21e3 NXi=1\n\n\u21e0\u00b5\n\n\u21b5 F 0\u21e0\u00b5\n\n\u21e0\u00b5\n\n(10)\n\nf (x) = F 0(x)\n\n7\n\n(11)\n\n\fNPi=1\n\n\u21e0\u00b5\ni vi \n\nThe visible part of each memory serves as an incoming weight to the hidden layer, and the recognition\npart of the memory serves as an outgoing weight from the hidden layer. The expansion used in (10)\nis justi\ufb01ed by a condition\n\u21b5x\u21b5, which is satis\ufb01ed for most common problems, and\n\u21e0\u00b5\nis simply a statement that labels contain far less information than the data itself3.\nFrom the point of view of associative memory, the dominant contribution shaping the basins of\nattraction comes from the low energy states. Therefore mathematically it is determined by the\nasymptotics of the activation function f (x), or the energy function F (x), at x ! 1. Thus different\nactivation functions having similar asymptotics at x ! 1 should fall into the same universality class\nand should have similar computational properties. In the table below we list some common activation\n\nNcP\u21b5=1\n\nactivation function\n\nf (x) = tanh(x)\n\nf (x) = logistic function\n\nf (x) =ReLU\nf (x) = RePn1\n\nenergy function\n\nF (x) = ln cosh(x) \u21e1 x, at x ! 1 1\nF (x) = ln1 + ex \u21e1 x, at x ! 1 1\n\nF (x) \u21e0 x2, at x ! 1\n\nn\n\nF (x) = RePn\n\n2\nn\n\nfunctions used in models of deep learning, their associative memory counterparts and the power n\nwhich determines the asymptotic behavior of the energy function at x ! 1.The results of section 4\nsuggest that for not too large n the speed of learning should improve as n increases. This is consistent\nwith the previous observation that ReLU are faster in training than hyperbolic tangents and logistics\n[5, 6, 7]. The last row of the table corresponds to recti\ufb01ed polynomials of higher degrees. To the\nbest of our knowledge these activation functions have not been used in neural networks. Our results\nsuggest that for some problems these higher power activation functions should have even better\ncomputational properties than the recti\ufb01ed liner units.\n6 Discussion and conclusions\nWhat is the relationship between the capacity of the dense associative memory, calculated in section 2,\nand the neural network with one step update that is used for digit classi\ufb01cation? Consider the limit of\nvery large in (9), so that the hyperbolic tangent is approximately equal to the sign function, as in (4).\nIn the limit of suf\ufb01ciently large n the network is operating in the prototype regime. The presented\nimage places the initial state of the network close to a local minimum of energy, which corresponds\nto one of the prototypes. In most cases the one step update of the classi\ufb01cation neurons is suf\ufb01cient\nto bring this initial state to the nearest local minimum, thus completing the memory recovery. This is\ntrue, however, only if the stored patterns are stable and have basins of attraction around them of at\nleast the size of one neuron \ufb02ip, which is exactly (in the case of random patterns) the condition given\nby (6). For correlated patterns the maximal number of stored memories might be different from (6),\nhowever it still rapidly increases with increase of n. The associative memory with one step update (or\nthe feedforward neural network) is exactly equivalent to the full associative memory with multiple\nupdates in this limit. The calculation with random patterns thus theoretically justi\ufb01es the expectation\nof a good performance in the prototype regime.\nTo summarize, this paper contains three main results. First, it is shown how to use the general\nframework of associative memory for pattern recognition. Second, a family of models is constructed\nthat can learn representations of the data in terms of features or in terms of prototypes, and that\nsmoothly interpolates between these two extreme regimes by varying the power of interaction vertex.\nThird, there exists a simple duality between a one step update version of the associative memory\nmodel and a feedforward neural network with one layer of hidden units and an unusual activation\nfunction. This duality makes it possible to propose a class of activation functions that encourages\nthe network to learn representations of the data with various proportions of features and prototypes.\nThese activation functions can be used in models of deep learning and should be more effective than\nthe standard choices. They allow the networks to train faster. We have also observed an improvement\nof generalization ability in networks trained with the recti\ufb01ed parabola activation function compared\nto the ReLU for the case of MNIST. While these ideas were illustrated using the simplest architecture\nof the neural network with one layer of hidden units, the proposed activation functions can also be\nused in multilayer architectures. We did not study various regularizations (weight decay, dropout,\netc), which can be added to our construction. The performance of the model supplemented with these\nregularizations, as well as performance on other common benchmarks, will be reported elsewhere.\n\n3A relationshp similar to (11) was discussed in [23, 24] in the context of autoencoders.\n\n8\n\n\fReferences\n[1] Hop\ufb01eld, J.J., 1982. Neural networks and physical systems with emergent collective computational\n\nabilities. Proceedings of the national academy of sciences, 79(8), pp.2554-2558.\n\n[2] LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M. and Huang, F., 2006. A tutorial on energy-based learning.\n\nPredicting structured data, 1, p.0.\n\n[3] Hinton, G.E., Osindero, S. and Teh, Y.W., 2006. A fast learning algorithm for deep belief nets. Neural\n\ncomputation, 18(7), pp.1527-1554.\n\n[4] Hinton, G.E. and Salakhutdinov, R.R., 2006. Reducing the dimensionality of data with neural networks.\n\nScience, 313(5786), pp.504-507.\n\n[5] Nair, V. and Hinton, G.E., 2010. Recti\ufb01ed linear units improve restricted boltzmann machines. In Proceed-\n\nings of the 27th International Conference on Machine Learning (ICML-10) (pp. 807-814).\n\n[6] Glorot, X., Bordes, A. and Bengio, Y., 2011. Deep sparse recti\ufb01er neural networks. In International\n\nConference on Arti\ufb01cial Intelligence and Statistics (pp. 315-323).\n\n[7] Krizhevsky, A., Sutskever, I. and Hinton, G.E., 2012. ImageNet classi\ufb01cation with deep convolutional\n\nneural networks. In Advances in neural information processing systems (pp. 1097-1105).\n\n[8] Amit, D.J., Gutfreund, H. and Sompolinsky, H., 1985. Storing in\ufb01nite numbers of patterns in a spin-glass\n\nmodel of neural networks. Physical Review Letters, 55(14), p.1530.\n\n[9] McEliece, R.J., Posner, E.C., Rodemich, E.R. and Venkatesh, S.S., 1987. The capacity of the Hop\ufb01eld\n\nassociative memory. Information Theory, IEEE Transactions on, 33(4), pp.461-482.\n\n[10] Kanter, I. and Sompolinsky, H., 1987. Associative recall of memory without errors. Physical Review A,\n\n35(1), p.380.\n\n[11] Chen, H.H., Lee, Y.C., Sun, G.Z., Lee, H.Y., Maxwell, T. and Giles, C.L., 1986. High order correlation\nmodel for associative memory. In Neural Networks for Computing (Vol. 151, No. 1, pp. 86-99). AIP\nPublishing.\n\n[12] Psaltis, D. and Park, C.H., 1986. Nonlinear discriminant functions and associative memories. In Neural\n\nnetworks for computing (Vol. 151, No. 1, pp. 370-375). AIP Publishing.\n\n[13] Baldi, P. and Venkatesh, S.S., 1987. Number of stable points for spin-glasses and neural networks of higher\n\norders. Physical Review Letters, 58(9), p.913.\n\n[14] Gardner, E., 1987. Multiconnected neural network models. Journal of Physics A: Mathematical and\n\nGeneral, 20(11), p.3453.\n\n[15] Abbott, L.F. and Arian, Y., 1987. Storage capacity of generalized networks. Physical Review A, 36(10),\n\np.5091.\n\n[16] Horn, D. and Usher, M., 1988. Capacities of multiconnected memory models. Journal de Physique, 49(3),\n\npp.389-395.\n\n[17] Minsky, M. and Papert, S., 1969. Perceptron: an introduction to computational geometry. The MIT Press,\n\nCambridge, expanded edition, 19(88), p.2.\n\n[18] Simard, P.Y., Steinkraus, D. and Platt, J.C., 2003, August. Best practices for convolutional neural networks\n\napplied to visual document analysis. In null (p. 958). IEEE.\n\n[19] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. and Salakhutdinov, R., 2014. Dropout: A simple\nway to prevent neural networks from over\ufb01tting. The Journal of Machine Learning Research, 15(1),\npp.1929-1958.\n\n[20] Wan, L., Zeiler, M., Zhang, S., LeCun, Y. and Fergus, R., 2013. Regularization of neural networks using\ndropconnect. In Proceedings of the 30th International Conference on Machine Learning (ICML-13) (pp.\n1058-1066).\n\n[21] Goodfellow, I.J., Shlens, J. and Szegedy, C., 2014. Explaining and harnessing adversarial examples. arXiv\n\npreprint arXiv:1412.6572.\n\n[22] Rasmus, A., Berglund, M., Honkala, M., Valpola, H. and Raiko, T., 2015. Semi-supervised learning with\n\nladder networks. In Advances in Neural Information Processing Systems (pp. 3546-3554).\n\n[23] Kamyshanska, H. and Memisevic, R., 2013, April. On autoencoder scoring. In ICML (3) (pp. 720-728).\n[24] Kamyshanska, H. and Memisevic, R., 2015. The potential energy of an autoencoder. IEEE transactions on\n\npattern analysis and machine intelligence, 37(6), pp.1261-1273.\n\n9\n\n\f", "award": [], "sourceid": 653, "authors": [{"given_name": "Dmitry", "family_name": "Krotov", "institution": "Institute for Advanced Study"}, {"given_name": "John J.", "family_name": "Hopfield", "institution": "Princeton Neuroscience Institute"}]}