{"title": "Structured and Deep Similarity Matching via  Structured and Deep Hebbian Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 15403, "page_last": 15412, "abstract": "Synaptic plasticity is widely accepted to be the mechanism behind learning in the brain\u2019s neural networks. A central question is how synapses, with access to only local information about the network, can still organize collectively and perform circuit-wide learning in an efficient manner. In single-layered and all-to-all connected neural networks, local plasticity has been shown to implement gradient-based learning on a class of cost functions that contain a term that aligns the similarity of outputs to the similarity of inputs. Whether such cost functions exist for networks with other architectures is not known. In this paper, we introduce structured and deep similarity matching cost functions, and show how they can be optimized in a gradient-based manner by neural networks with local learning rules. These networks extend F\\\"oldiak\u2019s Hebbian/Anti-Hebbian network to deep architectures and structured feedforward, lateral and feedback connections. Credit assignment problem is solved elegantly by a factorization of the dual learning objective to synapse specific local objectives. Simulations show that our networks learn meaningful features.", "full_text": "Structured and Deep Similarity Matching via\n\nStructured and Deep Hebbian Networks\n\nDina Obeid\n\nHugo Ramambason\n\nCengiz Pehlevan\n\nJohn A. Paulson School of Engineering and Applied Sciences\n\nHarvard University\n\nCambridge, MA, USA\n\n{dinaobeid@seas,hugo_ramambason@g,cpehlevan@seas}.harvard.edu\n\nAbstract\n\nSynaptic plasticity is widely accepted to be the mechanism behind learning in\nthe brain\u2019s neural networks. A central question is how synapses, with access\nto only local information about the network, can still organize collectively and\nperform circuit-wide learning in an ef\ufb01cient manner. In single-layered and all-\nto-all connected neural networks, local plasticity has been shown to implement\ngradient-based learning on a class of cost functions that contain a term that aligns\nthe similarity of outputs to the similarity of inputs. Whether such cost functions\nexist for networks with other architectures is not known. In this paper, we introduce\nstructured and deep similarity matching cost functions, and show how they can\nbe optimized in a gradient-based manner by neural networks with local learning\nrules. These networks extend F\u00f6ldiak\u2019s Hebbian/Anti-Hebbian network to deep\narchitectures and structured feedforward, lateral and feedback connections. Credit\nassignment problem is solved elegantly by a factorization of the dual learning\nobjective to synapse speci\ufb01c local objectives. Simulations show that our networks\nlearn meaningful features.\n\n1\n\nIntroduction\n\nEnd-to-end training of neural networks by gradient-based minimization of a cost function has proved\nto be a very powerful idea, prompting the question whether the brain is implementing the same\nstrategy. The problem with this proposal is that synapses, the sites of learning in the brain, have access\nto only local information, i.e. states of the pre- and post-synaptic neurons, and neuromodulators,\nwhich may represent, for example, global error signals [1]. How can synapses calculate from such\nincomplete information the necessary partial derivatives, which depend on non-local information\nabout other neurons and synapses in the network? Researchers have been tackling this problem by\nsearching for a biologically-plausible implementation of the backpropagation algorithm [2]. While\nsigni\ufb01cant progress has been made in this domain, see e.g. [3, 4, 5, 6, 7, 8, 9, 10, 11, 12], a fully\nplausible implementation is not yet available.\nHere we take another approach and focus on networks already operating with biologically-plausible\nlocal learning rules. We ask whether one can formulate network-wide learning cost functions for\nsuch networks and whether these networks achieve ef\ufb01cient \u201ccredit assignment\u201d by performing\ngradient-based learning. Previous work in this area showed that single-layered, all-to-all connected\nHebbian/anti-Hebbian networks minimize various versions of similarity matching cost functions\n[13, 14, 15]. In this paper, we generalize these results to networks with structured connectivity and\ndeep architectures.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fTo achieve our goal, we \ufb01rst introduce a novel class of unsupervised learning objectives that generalize\nsimilarity matching [16, 13]: structured and deep similarity matching. This generalization allows for\nmaking use of spatial structure in data and hierarchical feature extraction. A parallel can be drawn to\nthe extension of sparse coding [17] to structured [18] and deep sparse coding [19, 20, 21].\nWe show that structured and deep similarity matching can be implemented by a new class of multi-\nlayered neural networks with structured connectivity and biologically-plausible local learning rules.\nThese networks have Hebbian learning in feedforward and feedback connections between different\nlayers, and anti-Hebbian learning in lateral connections within a layer. They generalize F\u00f6ldiak\u2019s\nsingle-layered, all-to-all connected Hebbian/anti-Hebbian network [22].\nWe show how ef\ufb01cient credit assignment is achieved in structured and deep Hebbian/anti-Hebbian\nnetworks in an elegant way. The network optimizes a dual min-max problem to the structured\nand deep similarity matching problem. The network-wide dual objective can be factorized into a\nsummation of distributed objectives over each synapse that depend only on local variables to that\nsynapse. Therefore, gradient learning on them leads to local Hebbian and anti-Hebbian learning\nrules. Previous work showed this result in a single-layered, all-to-all connected Hebbian/anti-Hebbian\nnetwork [14]. Here, we extend the result to multi-layered architectures with structured connectivity.\nThe rest of this paper is organized as follows. In Section 2, we review and extend the results on simi-\nlarity matching cost functions and their relation to single-layered, all-to-all connected Hebbian/anti-\nHebbian networks. In Section 3, we introduce structured similarity matching and in Section 4 we\nextend it to deep architectures. In Section 5, we derive structured and deep Hebbian/anti-Hebbian\nnetworks from these cost functions, and show how credit assignment is achieved. We show simulation\nresults in Section 6 and conclude in Section 7.\n\n2 Similarity matching and gradient-based learning in a\n\nHebbian/anti-Hebbian network\n\nIn this section we review the results of [14] on Hebbian/anti-Hebbian neural networks and extend them\nto general monotonic activation functions. F\u00f6ldiak introduced the Hebbian/anti-Hebbian network\nas a biologically-plausible, single-layered, competitive unsupervised learning network that forms\nsparse representations [22]. Given an input x \u2208 RK, \ufb01rst, the network produces an output, r \u2208 RN ,\nby running the following recurrent dynamics until convergence to a \ufb01xed point,\n\ndu(s)\n\n\u03c4\n\n= \u2212u(s) + Wx \u2212 (L \u2212 I) r(s),\n\nds\nr(s) = f (u(s)),\n\n(cid:1) ,\n\n(cid:0)r\u2217\n\n(1)\nwhere f is the activation function and \u03c4 is the time constant of the neural dynamics. Given the \ufb01xed\npoint output, r\u2217, the synaptic weights are updated by the following local learning rules,\n\n\u03b7\n2\n\nj \u2212 Lij\ni r\u2217\n\n\u2206Wij = \u03b7 (r\u2217\n\ni xj \u2212 Wij) , \u2206Lij =\n\n(2)\nwhere \u03b7 is the learning rate. Wij updates are Hebbian synaptic plasticity rules with a linear decay\nterm. Lij updates are anti-Hebbian, because of the minus sign in the corresponding term in (1), and\nimplement lateral competition. After the synaptic update, the network takes in the next input, and the\nwhole process is repeated.\nWhen activation functions are linear, recti\ufb01ed-linear or shrinkage functions, previous studies [23, 13,\n14] showed that the learning rules of this network can be interpreted as a stochastic gradient-based\noptimization of a network-wide learning objective called similarity matching. Our \ufb01rst contribution\nis generalizing this result to any monotonic activation function by introducing suitable regularizers\nand constraints to the optimization problem.\nSimilarity matching is formally de\ufb01ned as follows. Given T inputs, x1, . . . , xT \u2208 RK, and outputs,\nr1, . . . , rT \u2208 RN , similarity matching learns a representation where pairwise input dot products, or\nsimilarities, are preserved subject to regularization and lower and upper bounds on outputs:\n\nT(cid:88)\n\nT(cid:88)\n\n1\n\n2T 2\n\nmin\nr1,...,rT\ns.t. a \u2264 rt \u2264 b,\n\nt=1\n\nt(cid:48)=1\n\n(xt \u00b7 xt(cid:48) \u2212 rt \u00b7 rt(cid:48))2 +\n\n2\nT\n\nt = 1, . . . , T.\n\n2\n\nT(cid:88)\n\n(cid:107)F(rt)(cid:107)1 ,\n\nt=1\n\n(3)\n\n\fHere, the bounds and the regularization function act elementwise.\nTo see how the Hebbian/Anti-Hebbian network relates to (3), following the method of [14], we\nexpand the square in (3) and introduce new auxiliary variables W \u2208 RN\u00d7K and L \u2208 RN\u00d7N (which\nwill be related to the corresponding variables in (1) and (2) shortly) using the identities\nt W(cid:62)rt + Tr W(cid:62)W,\nx(cid:62)\n\nt rt(cid:48) = min\n\n(4)\n\n\u2212 1\nT 2\n\n(cid:88)\n(cid:88)\nx(cid:62)\nt xt(cid:48)r(cid:62)\n(cid:88)\n(cid:88)\n\n(cid:0)r(cid:62)\nt rt(cid:48)(cid:1)2\n\nt(cid:48)\n\nt\n\n1\n\n2T 2\n\nt\n\nt(cid:48)\n\n\u2212 2\nT\n\n(cid:88)\n(cid:88)\nt Lrt \u2212 1\nr(cid:62)\n2\n\nt\n\nt\n\nW\u2208RN\u00d7K\n\n= max\n\nL\u2208RN\u00d7N\n\n1\nT\n\nTr L(cid:62)L.\n\n(5)\n\nThe \ufb01rst line arises from the cross-term in (3), and aligns the similarities in the input to the output.\nThe second line creates diversity in the representation. Plugging these into (3) and exchanging the\norders of optimization, we arrive at a dual min-max formulation of similarity matching [14]:\n\nmin\n\nW\u2208RN\u00d7K\n\nmax\n\nL\u2208RN\u00d7N\n\nwhere\n\nlt := Tr W(cid:62)W \u2212 1\n2\n\nTr L(cid:62)L + min\n\nrt\n\nlt(W, L, xt),\n\nt Wxt + r(cid:62)\n\nt Lrt + 2(cid:107)F(rt)(cid:107)1\n\n(cid:1) .\n\n(6)\n\n(7)\n\n1\nT\n\nT(cid:88)\n(cid:0)\u22122r(cid:62)\n\nt=1\n\nThe Hebbian/anti-Hebbian network, de\ufb01ned by equations (1) and (2), can be interpreted as a stochastic\nalternating optimization [17] of the new objective (6). The algorithms performs two steps for each\ninput, xt.\nIn the \ufb01rst step, the algorithm minimizes lt with respect to rt by running the neural dynamics (1)\nuntil convergence. Minimization is achieved because the argument of min in (7), E = \u22122r(cid:62)\nt Wxt +\nt Lrt + 2(cid:107)F(rt)(cid:107)1, decreases with the neural dynamics (1), if, within the bounds on the output, the\nr(cid:62)\nregularizer is related to the neural activation function as:\n\nF (cid:48)(r) = u \u2212 r, where r = f (u).\n\n(8)\nThe lower and upper bounds a and b are the in\ufb01mum and supremum of the range of f respectively.\nWe prove a more general version of this result in Proposition 1 in Appendix A. See [24] for other\npossible neural dynamical systems for lt minimization.\nThe following are some examples of the relation (8). The capped recti\ufb01ed linear activation function\nf (u) = min(max(u \u2212 \u03bb, 0), b) corresponds to a regularization F (r) = \u03bbr + constant with opti-\nmization lower and upper bounds a = 0 and b. When \u03bb = 0, a = 0, b = \u221e, we recover nonnegative\nsimilarity matching [25]. When F (r) = constant, a = \u2212\u221e and b = \u221e, f (u) = u and we recover\nthe principal subspace network of [14]. For other examples of regularizers see [26].\nIn the second step of the algorithm, synaptic weights are updated by gradient descent-ascent. Given\nthe optimal network output, r\u2217\nt , W and L dependent terms in lt can be written as a distributed\nsummation of local objectives over synapses [14]:\n\nN(cid:88)\n\nK(cid:88)\n\n(cid:0)\u22122Wijr\u2217\n\nt,ixt,j + W 2\nij\n\nN(cid:88)\n\nN(cid:88)\n\n(cid:18)\n\n(cid:1) +\n\ni=1\n\nj=1\n\ni=1\n\nj=1\n\nLijr\u2217\n\nt,ir\u2217\n\nt,j \u2212 1\n2\n\nL2\nij\n\n.\n\n(9)\n\n(cid:19)\n\nThis form explicitly shows how the credit assignment problem is solved in an elegant way. In (9)\neach term in the summation depends on only local variables to that synapse, gradient descent on W\nand gradient ascent in L results in the local updates given in (2).\n\n3 Structured similarity matching\n\nThe derivation given in the previous section suggests a generalization of similarity matching in a way\nthat the corresponding Hebbian/anti-Hebbian network has structured connectivity. A close look at\nequations (9) and (4) reveals that if we modify the left hand side of (4), the input-output similarity\nalignment term, to\n\nxt,ixt(cid:48),irt,jrt(cid:48),jcW\nij ,\n\n(10)\n\n\u2212 1\nT 2\n\nK(cid:88)\n\nN(cid:88)\n\nT(cid:88)\n\nT(cid:88)\n\ni=1\n\nj=1\n\nt=1\n\nt(cid:48)=1\n\n3\n\n\fij \u2265 0 are constants that set the structure, one can still go through the argument in the\nwhere cW\nprevious section and arrive at a modi\ufb01ed version (9) where the global objective is still factorized into\nlocal objectives. We will do that explicitly in Section 5. A similar modi\ufb01cation can be done for the\nij \u2265 0, to arrive at the full structured similarity matching (SSM)\nleft hand side of (5) by introducing cL\ncost function\n\nT(cid:88)\n\nT(cid:88)\n\nt=1\n\nt(cid:48)=1\n\n\uf8eb\uf8ed\u2212(cid:88)\n\ni,j\n\n1\nT 2\n\nmin\n\na\u2264rt\u2264b,\nt=1,...,T\n\nxt,ixt(cid:48),irt,jrt(cid:48),jcW\n\nij +\n\n1\n2\n\nrt,irt(cid:48),irt,jrt(cid:48),jcL\nij\n\n(cid:88)\n\ni,j\n\n\uf8f6\uf8f8 +\n\nT(cid:88)\n\nt=1\n\n2\nT\n\n(cid:107)F(rt)(cid:107)1 ,\n\n(11)\n\nij and cL\n\nwhere we dropped terms that depend only on the input.\nij, one can design many topologies for the input-output and output-\nThrough the choice of cW\nij \u2208 {0, 1}.\noutput interactions. A simple way to choose structure constants is cW\nij = 0 will remove any direct interaction between the ith input and jth output channels,\nSetting cW\nand cL\nij = 0 will do the same thing for the corresponding outputs. One can anticipate that such\nstructured similarity matching can be learned by a Hebbian/anti-Hebbian network with corresponding\nconnections removed. We will show this explicitly later in Section 5. Other choices of structure\nconstants assign different weights to particular input-output and output-output interactions. A\nuseful architecture for image processing is the locally connected structure, shown in Figure 1A. It is\ninteresting to note that structured lateral inhibition was also used in [18] for structured sparse coding.\n\nij \u2208 {0, 1} and cL\n\n4 Structured and deep similarity matching\n\nNext, we generalize structured similarity matching to multi-layer processing. To illustrate the main\nidea, we \ufb01rst focus on generalizing the original similarity matching objective to multiple layers, and\nbring in the structure constants later.\nWe think of a series of similarity matching operations, each applied to the output of the previous layer.\nFor notational convenience, we set r(0)\n:= xt and N (0) := K, and de\ufb01ne deep similarity matching\nwith P layers as:\n\nt\n\nP(cid:88)\n\np=1\n\n\u03b3p\u2212P\n2T 2\n\nT(cid:88)\n\nT(cid:88)\n\n(cid:16)\n\nt=1\n\nt(cid:48)=1\n\nmin\na\u2264r(p)\nt \u2264b\nt=1,...T,\np=1,...,P\n\nr(p\u22121)\n\nt\n\n\u00b7 r(p\u22121)\n\nt(cid:48)\n\n\u2212 r(p)\n\nt\n\n\u00b7 r(p)\nt(cid:48)\n\n+\n\n(cid:17)2\n\nP(cid:88)\n\np=1\n\n2\u03b3p\u2212P\n\nT\n\n(cid:16)\n\n(cid:13)(cid:13)(cid:13)F\n\nT(cid:88)\n\nt=1\n\n(cid:17)(cid:13)(cid:13)(cid:13)1\n\nr(p)\nt\n\n, (12)\n\nFigure 1: A) Locally connected similarity matching and structured Hebbian/anti-Hebbian network.\nChoosing the constants cW and cL suitably, one can introduce structured interactions between inputs\nand outputs. In this example, we assume inputs x and outputs r both live on a 2-dimensional grid.\nEach output neuron takes input from a small portion of the input grid (red shades) and receives lateral\ninteractions from a subset of the output (blue shades). The corresponding structured Hebbian/anti-\nHebbian neural network has the same architecture with connectivity de\ufb01ned by the interactions.\nFeedfordward connections are learned with Hebbian learning rules and lateral connections with\nanti-Hebbian rules. B) Deep similarity matching and deep Hebbian/anti-Hebbian network. Arrows\nillustrate synaptic connections. One can introduce structure for all components of the connectivity.\n\n4\n\nABx1xK. . .x2HebbianHebbianAnti-Hebbianp =1p =2p =3Anti-HebbianAnti-HebbianInputOutputHebbian\fwhere \u03b3 \u2265 0 is a parameter and r(p)\nt \u2208 RN (p). The \u03b3 = 0 limit corresponds to each layer acting\nindependently on the output of the previous layer. The more interesting case is that of \ufb01nite \u03b3, which\nallows in\ufb02uence of later layers on previous layers. Small \u03b3 emphasizes costs of earlier layers. In\nspirit, this construction is similar to deep sparse coding with feedback [20, 21].\nOne can intuit the kind of network that will optimize the deep similarity matching cost, which we\nwill present a full derivation of in the next section. This will be a multi-layer network with Hebbian\nlearning across layers and anti-Hebbian learning within layers, Figure 1B. An interesting consequence\nof the \ufb01nite \u03b3 coupling between layers will be the existence of feedback connections. When \u03b3 = 0, we\nobtain a network without any feedback. Previously, Bahroun et al. [27] implemented a two-layered\nsimilarity matching network without any feedback. This network used biologically-implausible\nweight sharing by tiling the image plane with identical all-to-all connected networks, and thus is\ndifferent from our approach.\nDifferent layers in (12) have different representations due to regularization terms and possible changes\nin dimensionality. To make the framework stronger and allow better hierarchical feature extraction, we\nintroduce nonnegative structure constants cW,(p)\nto each layer and arrive at the structured\nij\nand deep similarity matching cost function:\n\nand cL,(p)\n\nij\n\nP(cid:88)\n\np=1\n\n\u03b3p\u2212P\nT 2\n\nT(cid:88)\n\nt=1\n\nmin\nt \u2264b\na\u2264r(p)\nt=1,...T,\np=1,...,P\n\nN (p)(cid:88)\n\nj=1\n\n\uf8eb\uf8ed\u2212 N (p\u22121)(cid:88)\nT(cid:88)\nN (p)(cid:88)\nN (p)(cid:88)\n\nt(cid:48)=1\n\ni=1\n\n(1 + \u03b3(1 \u2212 \u03b4pP ))\n\n+\n\n2\n\ni=1\n\nj=1\n\nr(p)\nt,i r(p)\n\nt(cid:48),ir(p)\n\nt,j r(p)\n\nt(cid:48),jcL,(p)\n\nij\n\nr(p\u22121)\n\nt,i\n\nr(p\u22121)\nt(cid:48),i\n\nr(p)\nt,j r(p)\n\nt(cid:48),jcW,(p)\n\nij\n\n\uf8f6\uf8f8 +\n\nP(cid:88)\n\np=1\n\n2\u03b3p\u2212P\n\nT\n\n(cid:16)\n\n(cid:13)(cid:13)(cid:13)F\n\nT(cid:88)\n\nt=1\n\n(cid:17)(cid:13)(cid:13)(cid:13)1\n\nr(p)\nt\n\n, (13)\n\nwhere \u03b4pP is the Kronecker delta. For images, neurobiology suggests choosing the structure constants\nso that the sizes of receptive \ufb01elds increase across layers [28].\n\n5 Structured and deep similarity matching via structured and deep\n\nHebbian/Anti-Hebbian neural networks\n\nNext, we derive the network that minimizes the structured and deep similarity matching cost (13).\nWe show how credit assignment in this network is solved by explicitly factorizing the dual of the\nnetwork-wide cost (13) to local synaptic objectives.\nOur derivation uses the methods reviewed in Section 2. For each layer, we introduce dual variables\nW (p)\n\nfor interactions with positive structure constants, de\ufb01ne variables\n\nij and L(p)\n\nij\n\n\uf8f1\uf8f2\uf8f3 W (p)\n\nij ,\n\n\u00afW (p)\n\nij =\n\n= 0\nfor notational convenience, and rewrite (13) as\n\n0,\n\n(cid:54)= 0\n\n,\n\n\u00afL(p)\n\nij =\n\n\uf8f1\uf8f2\uf8f3 L(p)\n\n0,\n\nij ,\n\n(cid:54)= 0\n\ncL,(p)\nij\ncL,(p)\nij = 0\n\nlt\n\n(cid:16) \u00afW(1), . . . , \u00afW(p), \u00afL(1), . . . , \u00afL(p), r(0)\n(cid:88)\n\n\u03b3p\u2212P\n\n2\n\nt\n\n2 (1 + \u03b3(1 \u2212 \u03b4pP )) cL,(p)\n\nij\n\nL(p)\nij\n\ncW,(p)\nij\ncW,(p)\nij\n\n1\nT\n\nT(cid:88)\n2 \u2212 P(cid:88)\n\nt=1\n\np=1\n\ni,j\ncL,(p)\nij\n\n(cid:54)=0\n\n,\n\n(cid:17)\n\n,\n\n(14)\n\n(15)\n\nP(cid:88)\n\n(cid:88)\n\nwhere\n\nlt :=\n\nmin\n\nmax\n\n\u00afW(1),..., \u00afW(p)\n\n\u00afL(1),..., \u00afL(p)\n\n\u03b3p\u2212P\ncW,(p)\nij\n\nW (p)\nij\n\n(cid:18)\n\np=1\n\ni,j\ncW,(p)\nij\n\n(cid:54)=0\n\n+ min\nt \u2264b\na\u2264r(p)\np=1,...,P\n\nP(cid:88)\n\np=1\n\n\u03b3p\u2212P\n\n(cid:62)\n\n\u22122r(p)\n\nt\n\n\u00afW(p)r(p\u22121)\n\nt\n\n+ r(p)\n\nt\n\n(cid:62)\n\n\u00afL(p)r(p)\n\nt + 2\n\n(cid:13)(cid:13)(cid:13)F\n\n(cid:16)\n\nr(p)\nt\n\n(cid:19)\n\n(cid:17)(cid:13)(cid:13)(cid:13)1\n\n,\n\n(16)\n\nThis new optimization problem can be solved in a stochastic manner, by taking gradients of lt\nwith respect to W (p)\n. This procedure is akin to the alternating\noptimization of sparse coding [17, 29]. We will describe each of these alternating steps separately.\n\nij , for optimal values of r(p)\n\nij and L(p)\n\nt\n\n5\n\n\f5.1 Neural dynamics\n\nProposition 1 given in Appendix A shows that the minimization of the second line of (16) can be\nperformed by running the following neural network dynamics until convergence to a \ufb01xed point,\n\n= \u2212u(p) + \u00afW(p)r(p\u22121) \u2212(cid:16)\u00afL(p) \u2212 I\n\n(cid:17)\n\ndu(p)\n\n\u03c4\n\nds\nr(p) = f (u(p)),\n\nr(p) + (1 \u2212 \u03b4pP )\u03b3 \u00afW(p+1)(cid:62)r(p+1),\n\np = 1, . . . , P.\n\n(17)\n\nwhere we dropped the t subscript for notational clarity and set r(0) = xt. As promised, the \u03b3\nparameter sets the strength of feedback connections in the network. When \u03b3 = 0, information in the\nnetwork \ufb02ows bottom up only. In practice waiting until convergence may not be necessary [30].\n\n5.2 Gradient-based learning and local learning rules\n\nWith the optimal values of the neural dynamics, the network-wide objective factorizes into local\nsynaptic objectives, providing an explicit solution to the credit assignment problem:\n\n(cid:33)\n\nlt =\n\n(cid:32)\n\nP(cid:88)\n(cid:88)\n(cid:88)\n\u2212 P(cid:88)\n\ni,j\ncW,(p)\nij\n\n(cid:54)=0\n\np=1\n\np=1\n\ni,j\ncL,(p)\nij\n\n(cid:54)=0\n\n\u2206W (p)\n\nij = \u03b7\u03b3p\u2212P\n\n\u2206L(p)\n\nij =\n\n\u03b3p\u2212P\n\n\u03b7\n2\n\n(cid:32)\n\n(cid:32)\n(cid:32)\n\n\u22122W (p)\n\nij r(p)\n\nj\n\n\u2217\n\n\u2217\n\nr(p\u22121)\n\ni\n\n+\n\n\u03b3p\u2212P\ncW,(p)\nij\n\n2\n\nW (p)\nij\n\n\u2212L(p)\n\nij r(p)\n\nj\n\n\u2217\n\n\u2217\n\n+\n\nr(p)\ni\n\n\u03b3p\u2212P\n\n2 (1 + \u03b3(1 \u2212 \u03b4pP )) cL,(p)\n\nij\n\n(cid:33)\n\n,\n\nr(p)\nj\n\n\u2217\n\nr(p\u22121)\n\ni\n\n\u2217 \u2212 W (p)\ncW,(p)\nij\n\nij\n\nr(p)\nj\n\n\u2217\n\n\u2217 \u2212\n\nr(p)\ni\n\nL(p)\nij\n\n(1 + \u03b3(1 \u2212 \u03b4pP )) cL,(p)\n\nij\n\n(cid:33)\n\n.\n\n(cid:33)\n\n.\n\n2\n\nL(p)\nij\n\n(18)\n\n(19)\n\nLocal learning rules are derived from the above equation by taking derivatives:\n\nThese rules are Hebbian between layers and anti-Hebbian within layers, Figure 1. One can absorb the\n\u03b3 factors into the learning rates and choose different rates for different layers for better performance.\nEquations (17) and (19) de\ufb01ne the structured and deep Hebbian/anti-Hebbian neural network, Figure\n1B. It operates by running the multi-layered dynamics (17) for each input, and performing the updates\n(19) before seeing the next input.\n\n6 Simulations\n\nNext, we illustrate the performance of the structured and deep similarity matching networks in various\ndatasets.\n\n6.1\n\nIllustrative example\n\nWe start by introducing a toy example that illustrates the operational principles of the structured\nand deep Hebbian/anti-Hebbian neural network. We trained a two-layer network with the following\narchitecture: the \ufb01rst layer is composed of two separate networks of 10 neurons each, while the\nsecond layer is composed of a single network of 10 neurons. The second layer is connected to both\n\ufb01rst layer networks with feedforward and feedback (\u03b3 = 0.8) connections, Figure 2. Inputs to the\nnetwork are clustered into two groups, and are drawn randomly from 100-dimensional Gaussian\ndistributions. Representational similarity of these patterns are shown in Figure 2. The Gaussian\ndistributions were chosen separately for each cluster and \ufb01rst layer network. Neural activation\nfunctions were f (a) = max(min(a, 1), 0). We used a regularized version of the similarity matching\ncost [31] to enforce pattern decorrelation in the \ufb01rst and second layers. This regularization does not\n\n6\n\n\fFigure 2: A two-layer Hebbian/anti-Hebbian network with feedback. For each subnetwork, repre-\nsentational similarity matrices are shown for 10 example patterns, 5 from each of the two clusters.\nSimilarities are calculated by taking pairwise dot products of patterns and normalizing the largest dot\nproduct to 1. A) Network simulated with patterns from a set generated from the same distribution\nas the training set. B) Network simulated with patterns to the bottom \ufb01rst layer generated from a\ndifferent distribution. C) Structured and deep similarity matching cost decreases over training.\n\n.\n\nchange the biological plausibility of our networks, just adds homeostatic plasticity rules [31, 32].\nDuring training, the structured and deep similarity matching cost consistently decreased 2C. The\nnetworks learned decorrelated representations in the \ufb01rst and second layers, 2A.\nNext, we performed a perturbation that elucidates the role of feedback. We kept the input distribution\nto one of the \ufb01rst layer networks (top in 2B) as is, but changed the inputs to the other \ufb01rst layer\nnetwork (bottom 2B). The new patterns were nearly equally similar to the original clusters of patterns\n(bottom 2B). Even though the cluster identity of these new patterns were ambiguous, the bottom\nnetwork clustered them to the \ufb01rst or second cluster, depending on the identity of inputs to the top\nnetwork. This decision was mediated by the feedback connections from the second layer. Therefore,\nwhile the anti-Hebbian connections within a layer were performing competitive learning and pattern\nseparation, the Hebbian connections between layers were creating a predictive, pattern completing\npathway between the hierarchical representations across layers.\n\n6.2 Faces dataset\n\nWe trained a 3-layer, locally connected Hebbian/anti-Hebbian neural network with examples from\nthe \u201clabeled faces in the wild\" dataset [33], Figure 3. Images in this dataset have dimensions 642.\nWe organized the neurons into square grids in each layer, with strides 2, 4 and 8 respectively in \ufb01rst,\nsecond and third layer. Thus, there were 1024, 256 and 64 neurons in respective layers. A neuron was\nconnected to a neuron in the previous layer if the Euclidean distance between its grid location and the\nprevious layer neuron\u2019s grid location was less than or equal to 8 for the \ufb01rst layer, 12 for the second\nlayer and 24 for the third layer. Lateral connections were again based on Euclidean distances with\nthe same parameters. We trained with different \u03b3 values, shown are features for \u03b3 = 0.01. Figure 3\nshows the learned features. Neural activation functions were f (a) = max(min(a, 1), 0). We see that\nthe network learns diverse localized features in the \ufb01rst layer, and combines them in the second and\nthird layers to larger scale features.\n\n6.3 Classifying hand-written digits\n\nWe next tested if the features learned by our networks are useful for classi\ufb01cation tasks. We trained\na single-layer structured similarity matching network on the MNIST data set, with each image\npreprocessed by mean subtraction. We used the locally-connected structure shown in Figure 1.\nNetwork had a stride 2 and each neuron received input from a patch of radius ro = 4. Neurons\nbelonging to the same site had inhibitory recurrent connections. We used hyperbolic tangent activation\nfunction (tanh(x)). Classi\ufb01cation was done using scikitlearn library\u2019s LinearSVC with default\nparameters. Table 1 shows classi\ufb01cation error as a function of number of neurons per site (NPS).\nWhen compared to other networks with biologically-plausible training (1.46% in [34]), our network\nachieves on-par performance on this dataset.\n\n7\n\n10ABCSimilarityPatternsPatternsOld PatternsNew Patterns02004006008001000-200204060Number of SamplesCost\fFigure 3: Features learned by a 3-layer, locally connected Hebbian/anti-Hebbian neural network on\nthe labeled faces in the wild dataset [33]. Features are calculated by reverse correlation on the dataset,\nand masking these features to keep only the portions of the dataset which elicits a response in the\nneuron. We zoom to a selected subset of features for the \ufb01rst layer on the left.\n\nNPS\n\nClassi\ufb01cation error (%)\n\n4\n\n3.87\n\n8\n\n2.41\n\n16\n1.73\n\n32\n1.60\n\n64\n1.47\n\n100\n1.40\n\nTable 1: Classi\ufb01cation on MNIST data set: we show how the test error decreases as the number of\nneurons per site (NPS) increases.\n\n7 Discussion and conclusion\n\nWe introduced a new class of unsupervised learning cost functions, structured and deep similarity\nmatching, and showed how they can ef\ufb01ciently be minimized via a new class of neural networks:\nstructured and deep Hebbian/anti-Hebbian networks. These networks generalize F\u00f6ldiak\u2019s single\nlayer, all-to-all connected network [22].\nEven though we introduced depth separately from structure within a layer, they are actually related.\nThe structured and deep cost function in (13) can be obtained from the structured cost function (11)\nby allowing structure constants to be negative, and choosing them and regularizers suitably. Our\nframework can be used to introduce other architectures, e.g. ones including skip connections.\nThe credit assignment problem in our networks is solved in an ef\ufb01cient manner. Through a duality\ntransform [14], we showed how the dual min-max objective is factorized into distributed objectives\nover synapses, that depend only on variables local to that synapse. Therefore, each synapse can be\nupdated by biologically-plausible local learning rules and yet the global objective can be optimized\nin a gradient-based manner.\nThere are two possible \u201cweight transport problems\u201d [35] in our networks: 1) feedback connections\nare transposes of feedforward connections, and 2) lateral connections are symmetric. A straight-\nforward and biologically-plausible solution to these problems exist: symmetric weights can be learned\nasymptotically by the local learning rules in (19), even when the weights are initialized differently.\nA similar solution was proposed in [36] to the weight transport problem in the backpropagation\nalgorithm. Other solutions, including random feedback weights [5], may be possible.\n\nAcknowledgments\n\nWe thank Alper Erdogan and Blake Bordelon for discussions. This work was supported by a gift\nfrom the Intel Corporation.\n\nReferences\n[1] Wolfram Schultz, Peter Dayan, and P Read Montague. A neural substrate of prediction and reward. Science,\n\n275(5306):1593\u20131599, 1997.\n\n8\n\nSecond LayerThird LayerFirst Layer\f[2] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-\n\npropagating errors. Nature, 323(6088):533\u2013536, 1986.\n\n[3] Xiaohui Xie and H Sebastian Seung. Equivalence of backpropagation and contrastive hebbian learning in a\n\nlayered network. Neural Computation, 15(2):441\u2013454, 2003.\n\n[4] Dong-Hyun Lee, Saizheng Zhang, Asja Fischer, and Yoshua Bengio. Difference target propagation. In\nJoint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 498\u2013515.\nSpringer, 2015.\n\n[5] Timothy P Lillicrap, Daniel Cownden, Douglas B Tweed, and Colin J Akerman. Random synaptic feedback\n\nweights support error backpropagation for deep learning. Nature Communications, 7:13276, 2016.\n\n[6] Arild N\u00f8kland. Direct feedback alignment provides learning in deep neural networks. In Advances in\n\nNeural Information Processing Systems, pages 1037\u20131045, 2016.\n\n[7] Benjamin Scellier and Yoshua Bengio. Equilibrium propagation: Bridging the gap between energy-based\n\nmodels and backpropagation. Frontiers in Computational Neuroscience, 11:24, 2017.\n\n[8] Jordan Guerguiev, Timothy P Lillicrap, and Blake A Richards. Towards deep learning with segregated\n\ndendrites. ELife, 6:e22901, 2017.\n\n[9] James CR Whittington and Rafal Bogacz. An approximation of the error backpropagation algorithm in a\npredictive coding network with local hebbian synaptic plasticity. Neural Computation, 29(5):1229\u20131262,\n2017.\n\n[10] Jo\u00e3o Sacramento, Rui Ponte Costa, Yoshua Bengio, and Walter Senn. Dendritic cortical microcircuits\napproximate the backpropagation algorithm. In Advances in Neural Information Processing Systems, pages\n8721\u20138732, 2018.\n\n[11] Blake A Richards and Timothy P Lillicrap. Dendritic solutions to the credit assignment problem. Current\n\nOpinion in Neurobiology, 54:28\u201336, 2019.\n\n[12] James CR Whittington and Rafal Bogacz. Theories of error back-propagation in the brain. Trends in\n\nCognitive Sciences, 2019.\n\n[13] Cengiz Pehlevan, Tao Hu, and Dmitri B Chklovskii. A hebbian/anti-hebbian neural network for linear\nsubspace learning: A derivation from multidimensional scaling of streaming data. Neural computation,\n27(7):1461\u20131495, 2015.\n\n[14] Cengiz Pehlevan, Anirvan M Sengupta, and Dmitri B Chklovskii. Why do similarity matching objectives\n\nlead to hebbian/anti-hebbian networks? Neural computation, 30(1):84\u2013124, 2018.\n\n[15] Cengiz Pehlevan and Dmitri B Chklovskii. Neuroscience-inspired online unsupervised learning algorithms.\n\narXiv preprint arXiv:1908.01867, 2019.\n\n[16] Da Kuang, Chris Ding, and Haesun Park. Symmetric nonnegative matrix factorization for graph clustering.\nIn Proceedings of the 2012 SIAM International Conference on Data Mining, pages 106\u2013117. SIAM, 2012.\n\n[17] Bruno A Olshausen and David J Field. Emergence of simple-cell receptive \ufb01eld properties by learning a\n\nsparse code for natural images. Nature, 381(6583):607, 1996.\n\n[18] Karol Gregor, Arthur Szlam, and Yann LeCun. Structured sparse coding via lateral inhibition. Advances in\n\nNeural Information Processing Systems, 24, 2011.\n\n[19] Yunlong He, Koray Kavukcuoglu, Yun Wang, Arthur Szlam, and Yanjun Qi. Unsupervised feature learning\nby deep sparse coding. In Proceedings of the 2014 SIAM International Conference on Data Mining, pages\n902\u2013910. SIAM, 2014.\n\n[20] Edward Kim, Darryl Hannan, and Garrett Kenyon. Deep sparse coding for invariant multimodal halle\nberry neurons. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n1111\u20131120, 2018.\n\n[21] Victor Boutin, Angelo Franciosini, Franck Ruf\ufb01er, and Laurent Perrinet. Meaningful representations\n\nemerge from sparse deep predictive coding. arXiv preprint arXiv:1902.07651, 2019.\n\n[22] Peter F\u00f6ldiak. Forming sparse representations by local anti-hebbian learning. Biological Cybernetics,\n\n64(2):165\u2013170, 1990.\n\n9\n\n\f[23] Tao Hu, Cengiz Pehlevan, and Dmitri B Chklovskii. A hebbian/anti-hebbian network for online sparse\ndictionary learning derived from symmetric matrix factorization. In 2014 48th Asilomar Conference on\nSignals, Systems and Computers, pages 613\u2013619. IEEE, 2014.\n\n[24] J Hertz, A Krogh, and RG Palmer. Introduction to the theory of neural computation, 1991.\n\n[25] Cengiz Pehlevan and Dmitri B Chklovskii. A hebbian/anti-hebbian network derived from online non-\nnegative matrix factorization can cluster and discover sparse features. In 2014 48th Asilomar Conference\non Signals, Systems and Computers, pages 769\u2013775. IEEE, 2014.\n\n[26] Christopher J Rozell, Don H Johnson, Richard G Baraniuk, and Bruno A Olshausen. Sparse coding via\n\nthresholding and local competition in neural circuits. Neural Computation, 20(10):2526\u20132563, 2008.\n\n[27] Yanis Bahroun and Andrea Soltoggio. Online representation learning with single and multi-layer hebbian\nnetworks for image classi\ufb01cation. In International Conference on Arti\ufb01cial Neural Networks, pages\n354\u2013363. Springer, 2017.\n\n[28] Jeremy Freeman and Eero P Simoncelli. Metamers of the ventral stream. Nature Neuroscience, 14(9):1195,\n\n2011.\n\n[29] Sanjeev Arora, Rong Ge, Tengyu Ma, and Ankur Moitra. Simple, ef\ufb01cient, and neural algorithms for\n\nsparse coding. In Proceedings of The 28th Conference on Learning Theory, pages 113\u2013149, 2015.\n\n[30] Victor Minden, Cengiz Pehlevan, and Dmitri B Chklovskii. Biologically plausible online principal\ncomponent analysis without recurrent neural dynamics. In 2018 52nd Asilomar Conference on Signals,\nSystems, and Computers, pages 104\u2013111. IEEE, 2018.\n\n[31] Anirvan Sengupta, Cengiz Pehlevan, Mariano Tepper, Alexander Genkin, and Dmitri Chklovskii. Manifold-\ntiling localized receptive \ufb01elds are optimal in similarity-preserving neural networks. In Advances in Neural\nInformation Processing Systems, pages 7080\u20137090, 2018.\n\n[32] Cengiz Pehlevan. A spiking neural network with local learning rules derived from nonnegative similarity\n\nmatching. arXiv preprint arXiv:1902.01429, 2019.\n\n[33] Erik Learned-Miller, Gary B Huang, Aruni RoyChowdhury, Haoxiang Li, and Gang Hua. Labeled faces in\nthe wild: A survey. In Advances in Face Detection and Facial Image Analysis, pages 189\u2013248. Springer,\n2016.\n\n[34] Dmitry Krotov and John J Hop\ufb01eld. Unsupervised learning by competing hidden units. Proceedings of the\n\nNational Academy of Sciences, 116(16):7723\u20137731, 2019.\n\n[35] Stephen Grossberg. Competitive learning: From interactive activation to adaptive resonance. Cognitive\n\nScience, 11(1):23\u201363, 1987.\n\n[36] John F Kolen and Jordan B Pollack. Backpropagation without weight transport. In Proceedings of 1994\nIEEE International Conference on Neural Networks (ICNN\u201994), volume 3, pages 1375\u20131380. IEEE, 1994.\n\n[37] John J Hop\ufb01eld. Neurons with graded response have collective computational properties like those of\n\ntwo-state neurons. Proceedings of the National Academy of Sciences, 81(10):3088\u20133092, 1984.\n\n10\n\n\f", "award": [], "sourceid": 8871, "authors": [{"given_name": "Dina", "family_name": "Obeid", "institution": "Harvard University"}, {"given_name": "Hugo", "family_name": "Ramambason", "institution": "Havard University"}, {"given_name": "Cengiz", "family_name": "Pehlevan", "institution": "Harvard University"}]}