{"title": "Unsupervised learning of an efficient short-term memory network", "book": "Advances in Neural Information Processing Systems", "page_first": 3653, "page_last": 3661, "abstract": "Learning in recurrent neural networks has been a topic fraught with difficulties and problems. We here report substantial progress in the unsupervised learning of recurrent networks that can keep track of an input signal. Specifically, we show how these networks can learn to efficiently represent their present and past inputs, based on local learning rules only. Our results are based on several key insights. First, we develop a local learning rule for the recurrent weights whose main aim is to drive the network into a regime where, on average, feedforward signal inputs are canceled by recurrent inputs. We show that this learning rule minimizes a cost function. Second, we develop a local learning rule for the feedforward weights that, based on networks in which recurrent inputs already predict feedforward inputs, further minimizes the cost. Third, we show how the learning rules can be modified such that the network can directly encode non-whitened inputs. Fourth, we show that these learning rules can also be applied to a network that feeds a time-delayed version of the network output back into itself. As a consequence, the network starts to efficiently represent both its signal inputs and their history. We develop our main theory for linear networks, but then sketch how the learning rules could be transferred to balanced, spiking networks.", "full_text": "Unsupervised learning of an ef\ufb01cient short-term\n\nmemory network\n\nPietro Vertechi\u2217\n\nWieland Brendel\u2217\u2020\n\nChristian K. Machens\n\nChampalimaud Neuroscience Programme\nChampalimaud Centre for the Unknown\n\nLisbon, Portugal\n\nfirst.last@neuro.fchampalimaud.org\n\nAbstract\n\nLearning in recurrent neural networks has been a topic fraught with dif\ufb01culties\nand problems. We here report substantial progress in the unsupervised learning of\nrecurrent networks that can keep track of an input signal. Speci\ufb01cally, we show\nhow these networks can learn to ef\ufb01ciently represent their present and past inputs,\nbased on local learning rules only. Our results are based on several key insights.\nFirst, we develop a local learning rule for the recurrent weights whose main aim\nis to drive the network into a regime where, on average, feedforward signal inputs\nare canceled by recurrent inputs. We show that this learning rule minimizes a cost\nfunction. Second, we develop a local learning rule for the feedforward weights\nthat, based on networks in which recurrent inputs already predict feedforward\ninputs, further minimizes the cost. Third, we show how the learning rules can be\nmodi\ufb01ed such that the network can directly encode non-whitened inputs. Fourth,\nwe show that these learning rules can also be applied to a network that feeds a\ntime-delayed version of the network output back into itself. As a consequence,\nthe network starts to ef\ufb01ciently represent both its signal inputs and their history.\nWe develop our main theory for linear networks, but then sketch how the learning\nrules could be transferred to balanced, spiking networks.\n\n1\n\nIntroduction\n\nMany brain circuits are known to maintain information over short periods of time in the \ufb01ring of\ntheir neurons [15]. Such \u201cpersistent activity\u201d is likely to arise through reverberation of activity due\nto recurrent synapses. While many recurrent network models have been designed that remain active\nafter transient stimulation, such as hand-designed attractor networks [21, 14] or randomly generated\nreservoir networks [10, 13], how neural networks can learn to remain active is less well understood.\nThe problem of learning to remember the input history has mostly been addressed in supervised\nlearning of recurrent networks. The classical approaches are based on backpropagation through\ntime [22, 6]. However, apart from convergence issues, backpropagation through time is not a fea-\nsible method for biological systems. More recent work has drawn attention to random recurrent\nneural networks, which already provide a reservoir of time constants that allow to store and read out\nmemories [10, 13]. Several studies have focused on the question of how to optimize such networks\nto the task at hand (see [12] for a review), however, the generality of the underlying learning rules\nis often not fully understood, since many rules are not based on analytical results or convergence\nproofs.\n\n\u2217These authors contributed equally.\n\u2020Current address: Centre for Integrative Neuroscience, University of T\u00a8ubingen, Germany\n\n1\n\n\fThe unsupervised learning of short-term memory systems, on the other hand, is largely unchartered\nterritory. While there have been several \u201cbottom-up\u201d studies that use biologically realistic learning\nrules and simulations (see e.g.\n[11]), we are not aware of any analytical results based on local\nlearning rules.\nHere we report substantial progress in following through a normative, \u201ctop-down\u201d approach that\nresults in a recurrent neural network with local synaptic plasticity. This network learns how to\nef\ufb01ciently remember an input and its history. The learning rules are largely Hebbian or covariance-\nbased, but separate recurrent and feedforward inputs. Based on recent progress in deriving integrate-\nand-\ufb01re neurons from optimality principles [3, 4], we furthermore sketch how an equivalent spiking\nnetwork with local learning rules could be derived. Our approach generalizes analogous work in the\nsetting of ef\ufb01cient coding of an instantaneous signal, as developed in [16, 19, 23, 4, 1].\n\n2 The autoencoder revisited\n\nWe start by recapitulating the autoencoder network shown in Fig. 1a. The autoencoder transforms a\nK-dimensional input signal, x, into a set of N \ufb01ring rates, r, while obeying two constraints. First,\nthe input signal should be reconstructable from the output \ufb01ring rates. A common assumption is that\nthe input can be recovered through a linear decoder, D, so that\n\nx \u2248 \u02c6x = Dr.\n\n(1)\n\nSecond, the output \ufb01ring rates, r, should provide an optimal or ef\ufb01cient representation of the input\nsignals. This optimality can be measured by de\ufb01ning a cost C(r) for the representation r. For\nsimplicity, we will in the following assume that the costs are quadratic (L2), although linear (L1)\ncosts in the \ufb01ring rates could easily be accounted for as well. We note that autoencoder networks\nare sometimes assumed to reduce the dimensionality of the input (undercomplete case, N < K) and\nsometimes assumed to increase the dimensionality (overcomplete case, N > K). Our results apply\nto both cases.\nThe optimal set of \ufb01ring rates for a given input signal can then be found by minimizing the loss\nfunction,\n\nL =\n\n1\n2 (cid:107)x \u2212 Dr(cid:107)2 +\n\n\u00b5\n2 (cid:107)r(cid:107)2 ,\n\n(2)\n\nwith respect to the \ufb01ring rates r. Here, the \ufb01rst term is the error between the reconstructed input\nsignal, \u02c6x = Dr, and the actual stimulus, x, while the second term corresponds to the \u201ccost\u201d of\nthe signal representation. The minimization can be carried out via gradient descent, resulting in the\ndifferential equation\n\n\u02d9r = \u2212\n\n\u2202L\n\u2202r\n\n= \u2212\u00b5r + D(cid:62)x \u2212 D(cid:62)Dr.\n\n(3)\n\nThis differential equation can be interpreted as a neural network with a \u2018leak\u2019, \u2212\u00b5r, feedforward\nconnections, F = DT , and recurrent connections, \u2126 = D(cid:62)D. The derivation of neural networks\nfrom quadratic loss functions was \ufb01rst introduced by Hop\ufb01eld [7, 8], and the link to the autoencoder\nwas pointed out in [19]. Here, we have chosen a quadratic cost term which results in a linear\ndifferential equation. Depending on the precise nature of the cost term, one can also obtain non-\nlinear differential equations, such as the Cowan-Wilson equations [19, 8]. Here, we will \ufb01rst focus\non linear networks, in which case \u2018\ufb01ring rates\u2019 can be both positive and negative. Further below,\nwe will also show how our results can be generalized to networks with positive \ufb01ring rates and to\nnetworks in which neurons spike.\nIn the case of arbitrarily small costs, the network can be understood as implementing predictive\ncoding [17]. The reconstructed (\u201cpredicted\u201d) input signal, \u02c6x = Dr, is subtracted from the actual\ninput signal, x, see Fig. 1b. Predictive coding here enforces a cancellation or \u2018balance\u2019 between the\nfeedforward and recurrent synaptic inputs. If we assume that the actual input acts excitatory, for\ninstance, then the predicted input is mediated through recurrent lateral inhibition. Recent work has\nshown that this cancellation can be mediated by the detailed balance of currents in spiking networks\n[3, 1], a result we will return to later on.\n\n2\n\n\fFigure 1: Autoencoders. (a) Feedforward network. The input signal x is multiplied with the feedforward\nweights F. The network generates output \ufb01ring rates r. (b) Recurrent network. The left panel shows how the\nreconstructed input signal \u02c6x = Dr is fed back and subtracted from the original input signal x. The right panel\nshows that this subtraction can also be performed through recurrent connections FD. For the optimal network,\nwe set F = D(cid:62). (c) Recurrent network with delayed feedback. Here, the output \ufb01ring rates are fed back\nwith a delay. This delayed feedback acts as just another input signal, and is thereby re-used, thus generating\nshort-term memory.\n\n3 Unsupervised learning of the autoencoder with local learning rules\n\nThe transformation of the input signal, x, into the output \ufb01ring rate, r, is largely governed by the\ndecoder, D, as can be seen in Eq. (3). When the inputs are drawn from a particular distribution, p(x),\nsuch as the distribution of natural images or natural sounds, some decoders will lead to a smaller\naverage loss and better performance. The average loss is given by\n\n(cid:10)\n\n(cid:107)x \u2212 Dr(cid:107)2 + \u00b5(cid:107)r(cid:107)2(cid:11)\n\n(cid:104)L(cid:105) =\n\n1\n2\n\n(4)\n\nwhere the angular brackets denote an average over many signal presentations. In practice, x will\ngenerally be centered and whitened. While it is straightforward to minimize this average loss with\nrespect to the decoder, D, biological networks face a different problem.1 A general recurrent neural\nnetwork is governed by the \ufb01ring rate dynamics\n\n(5)\nand has therefore no access to the decoder, D, but only to to its feedforward weights, F, and its\nrecurrent weights, \u2126. Furthermore, any change in F and \u2126 must solely rely on information that is\nlocally available to each synapse. We will assume that matrix \u2126 is initially chosen such that the\ndynamical system is stable, in which case its equilibrium state is given by\n\n\u02d9r = \u2212\u00b5r + Fx \u2212 \u2126r,\n\nFx = \u2126r + \u00b5r.\n\n(6)\nIf the dynamics of the input signal x are slow compared to the rate dynamics of the autoencoder,\nthe network will generally operate close to equilibrium. We will assume that this is the case, an\nassumption that provides a bridge from \ufb01ring rate networks to spiking networks, as explained below.\nA priori, it is not clear how to change the feedforward weights, F, or the recurrent weights, \u2126, since\nneither appears in the average loss function, Eq. (4). We might be inclined to solve Eq. (6) for r\nand plug the result into Eq. (4). However, we then have to operate on matrix inverses, the resulting\ngradients imply heavily non-local synaptic operations, and we would still need to somehow eliminate\nthe decoder, D, from the picture.\nHere, we follow a different approach. We note that the optimal target network in the previous section\nimplements a form of predictive coding. We therefore suggest a two-step approach to the learning\nproblem. First, we \ufb01x the feedforward weights and we set up learning rules for the recurrent weights\nsuch that the network moves into a regime where the inputs, Fx, are predicted or \u2018balanced\u2019 by the\nrecurrent weights, \u2126r, see Fig. 1b.\nIn this case, \u2126 = FD, and this will be our \ufb01rst target for\nlearning. Second, once \u2126 is learnt, we change the feedforward weights F to decrease the average\nloss even further. We then return to step 1 and iterate.\nSince F is assumed constant in step 1, we can reach the target \u2126 = FD by investigating how the\ndecoder D needs to change. The respective learning equation for D can then be translated into a\n\n1Note that minimization of the average loss with respect to D requires either a hard or a soft normalization\n\nconstraint on D.\n\n3\n\n--abc\u02c6x=DrxrxrFxF(x\u02c6x)xFxFDrrxtrtDrt-rt1Mrt\flearning equation for \u2126, which will directly link the learning of \u2126 to the minimization of the loss\nfunction, Eq. (4).\nOn an intuitive level, a small change of D in the direction \u2206D = \u0001xr(cid:62) shifts the signal estimate\n\u02c6x into the direction of the signal, \u02c6x \u2192 \u02c6x + \u0001(cid:107)r(cid:107)2 x, and thereby decreases the reconstruction error\n(as generally (cid:107)\u02c6x(cid:107)2 < (cid:107)x(cid:107)2 due to the regularization). Such a change translates into the following\nlearning rule for D,\n(7)\nwhere \u0001 is suf\ufb01ciently small to make the learning slow compared to the dynamics of the input signals\nx = x(t). The \u2018weight decay\u2019 term, \u2212\u03b1D, acts as a soft normalization on D. In turn, to have the\nrecurrent weights \u2126 move towards FD, we multiply with F from the left to obtain the learning rule2\n\n\u02d9D = \u0001(xr(cid:62) \u2212 \u03b1D),\n\n\u02d9\u2126 = \u0001(Fxr(cid:62) \u2212 \u03b1\u2126).\n\n(8)\nImportantly, this learning rule is completely local: it only rests on information that is available to\neach synapse, namely the presynaptic \ufb01ring rates, r, and the postsynaptic input signal, Fx. The\nrecurrent weights adapt in a Hebbian type manner by matching strong inputs with strong recurrent\ndrives, and thereby learn to \u2018predict\u2019 or \u2018balance\u2019 the feedforward input.\nIn step 2, we assume that the recurrent weights have reached their target, \u2126 = FD, and we learn\nthe feedforward weights. For that we notice that in the absolute minimum, as shown in the previous\nsection, the feedforward weights become F = D(cid:62). Hence, the target for the feedforward weights\nshould be the transpose of the decoder. Over long time intervals, the expected decoder is simply\nD = (cid:104)xr(cid:62)(cid:105)/\u03b1, since that is the \ufb01xed point of the decoder learning rule, Eq. (7). Hence, we learn\nthe feedforward weights on a yet slower time scale \u03b2 (cid:28) \u0001, according to\n(9)\nwhere \u03bbF is once more a soft normalization factor. The \ufb01xed point of the learning rule is then\nF = D(cid:62). We emphasize that this learning rule is also local, based solely on the presynaptic input\nsignal and postsynaptic \ufb01ring rates.\nThe learning rules for F and \u2126 ensure that the \ufb01xed points of the network dynamics correspond\nto the optimal topology, Eq. (3). The network dynamics is then fully determined by the decoder\nD, and so is the reconstruction error, Eq. (4). To understand the structure of the decoder, we note\n\nfrom Eq. (7) that the decoder will converge to(cid:10)xr(cid:62)(cid:11). In the equilibrium state, Eq. (6), the neural\nwhere we used that(cid:10)xx(cid:62)(cid:11) = I. Multiplying with \u2126 + \u00b5I from the right and replacing F and \u2126\n\nresponses can be replaced by a linear transform of the input signal, and so\n\n(cid:10)xr(cid:62)(cid:11) = F(cid:62) (\u2126 + \u00b5I)\u22121 ,\n\n\u03b1D \u2192\n\n\u02d9F = \u03b2(rx(cid:62) \u2212 \u03bbF),\n\n(10)\n\nwith their \ufb01xed points yields a condition for the \ufb01xed points of the decoder,\n\n\u03b1DD(cid:62)D = (1 \u2212 \u03b1\u00b5)D.\n\n(11)\nIf D is full rank, this condition is ful\ufb01lled if and only if D is a scaled orthogonal matrix. In other\nwords, the (white) inputs are projected onto orthogonal axes of the population, which is optimal in\nterms of minimizing signal interference and reconstruction error. If D is not full rank, then the \ufb01xed\npoints are unstable, as we show in the supplementary material.\nIn summary, we note that the autoencoder operates on four separate time scales. On a very fast,\nalmost instantaneous time scale, the \ufb01ring rates run into equilibrium for a given input signal, Eq. (6).\nOn a slower time scale, the input signal, x, changes. On a yet slower time scale, the recurrent\nweights, \u2126, are learnt, and their learning therefore uses many input signal values. On the \ufb01nal and\nslowest time scale, the feedforward weights, F, are optimized.\n\n4 Unsupervised learning for non-whitened inputs\n\nAlgorithms for ef\ufb01cient coding are generally applied to whitened and centered data (see e.g. [2, 16]).\nIndeed, if the data are not centered, the read-out of the neurons will concentrate in the direction of\n2Note that the \ufb01xed point of the decoder learning rule is D = (cid:104)xr(cid:62)(cid:105)/\u03b1. Hence, the \ufb01xed point of the\n\nrecurrent learning is \u2126 = FD.\n\n4\n\n\f(cid:0)xc \u2212 Drc\n\nthe mean input signal in order to represent it, even though the mean may not carry any relevant infor-\nmation about the actual, time-varying signal. If the data are not whitened, the choice of the decoder\nwill be dominated by second-order statistical dependencies, at the cost of representing higher-order\ndependencies. The latter are often more interesting to represent, as shown by applications of ef\ufb01cient\nor sparse coding algorithms to the visual system [20].\nWhile whitening and centering are therefore common pre-processing steps, we note that, with a\nsimple correction, our autoencoder network can take care of the pre-processing steps autonomously.\nThis extra step will be crucial later on, when we feed the time-delayed (and non-whitened) network\nactivity back into the network. The main idea is simple: we suggest to use a cost function that is\ninvariant under af\ufb01ne transformations and equals the cost function we have been using until now\nin case of centered and whitened data. To do so, we introduce the short-hands xc = x \u2212 (cid:104)x(cid:105) and\nrc = r \u2212 (cid:104)r(cid:105) for the centered input and the centered \ufb01ring rates, and we write Cx = cov(x, x) for\nthe covariance matrix of the input signal. The corrected loss function is then,\n\u00b5\n2 (cid:107)r(cid:107)2 .\n\n(12)\nThe loss function reduces to Eq. (2) if the data are centered and if Cx = I. Furthermore, the value\nof the loss function remains constant if we apply any af\ufb01ne transformation x \u2192 Ax + b.3 In turn,\nwe can interpret the loss function as the likelihood function of a Gaussian.\nFrom hereon, we can follow through exactly the same derivations as in the previous sections. We\n\ufb01rst notice that the optimal \ufb01ring rate dynamics becomes\nx x \u2212 D(cid:62)C\u22121\n\n(13)\n(14)\nwhere V is a placeholder for the overall input. The dynamics differ in two ways from those in\nEq. (3). First, the dynamics now require the subtraction of the averaged input, (cid:104)V(cid:105). Biophysically,\nthis subtraction could correspond to a slower intracellular process, such as adaptation through hy-\nx , and the optimal\nperpolarization. Second, the optimal feedforward weights are now F = D(cid:62)C\u22121\nrecurrent weights become \u2126 = D(cid:62)C\u22121\nThe derivation of the learning rules follows the outline of the previous section. Initially, the network\nstarts with some random connectivity, and obeys the dynamical equations,\n\n(cid:0)xc \u2212 Drc\n\nV = D(cid:62)C\u22121\n\u02d9r = V \u2212 (cid:104)V(cid:105),\n\n(cid:1)(cid:62)C\u22121\n\nx Dr \u2212 \u00b5r,\n\n(cid:1) +\n\nx D.\n\nx\n\nL =\n\n1\n2\n\nV = Fx \u2212 (\u2126 + \u00b5I)r,\n\u02d9r = V \u2212 (cid:104)V(cid:105).\n\n\u02d9D = \u0001(cid:0)xr(cid:62) \u2212 (cid:104)x(cid:105)(cid:104)r(cid:105)(cid:62) \u2212 \u03b1D(cid:1),\n\u02d9\u2126 = \u0001(cid:0)Fxr(cid:62) \u2212 (cid:104)Fx(cid:105)(cid:104)r(cid:105)(cid:62) \u2212 \u03b1\u2126(cid:1).\n\n(15)\n(16)\n\nWe then apply the following modi\ufb01ed learning rules for D and \u2126,\n\n(17)\n(18)\nWe note that in both cases, the learning remains local. However, similar to the rate dynamics, the\ndynamics of learning now requires a slower synaptic process that computes the averaged signal\ninputs and presynaptic \ufb01ring rates. Synapses are well-known to operate on a large range of time\nscales (e.g., [5]), so that such slower processes are in broad agreement with physiology.\nx . The matrix inverse can be\nThe target for learning the feedforward weights becomes F \u2192 D(cid:62)C\u22121\neliminated by noticing that the differential equation \u02d9F = \u0001(\u2212FCx + D(cid:62)) has the required target\nas its \ufb01xed point. The covariance matrix Cx can be estimated by averaging over xcx(cid:62)c , and the\n(cid:17)\ndecoder D(cid:62) can be estimated by averaging over xcr(cid:62)c , just as in the previous section, or as follows\nfrom Eq. (17). Hence, the learning of the feedforward weights becomes\n(r \u2212 \u03bbFx)x(cid:62) \u2212 (cid:104)r \u2212 \u03bbFx(cid:105)(cid:104)x(cid:62)(cid:105)\n\n(19)\nAs for the recurrent weights, the learning rests on local information, but requires a slower time scale\nthat computes the mean input signal and presynaptic \ufb01ring rates.\nUnder these synaptic dynamics all stable \ufb01xed points of the decoder are such that the input is \ufb01rst\nwhitened before being projected onto orthogonal axes of the population. Again, these \ufb01xed points\nare optimal in minimizing the reconstruction error.\n\n3Under an af\ufb01ne transformation, y = Ax+b and \u02c6y = A\u02c6x+b, we obtain:(cid:0)y\u2212\u02c6y(cid:1)(cid:62)\n(cid:0)Ax \u2212 A\u02c6x(cid:1)(cid:62)\n\ncov(Ax, Ax)\u22121(cid:0)Ax \u2212 A\u02c6x(cid:1) =(cid:0)x \u2212 \u02c6x(cid:1)(cid:62)\n\ncov(y, y)\u22121(cid:0)y\u2212\u02c6y(cid:1) =\n\ncov(x, x)\u22121(cid:0)x \u2212 \u02c6x(cid:1).\n\n\u02d9F = \u03b2\n\n(cid:16)\n\n.\n\n5\n\n\f5 The autoencoder with memory\n\nWe are \ufb01nally in a position to tackle the problem we started out with, how to build a recurrent\nnetwork that ef\ufb01ciently represents not just its present input, but also its past inputs. The objective\nfunction used so far, however, completely neglects the input history: even if the dimensionality of\nthe input is much smaller than the number of neurons available to code it, the network will not try\nto use the extra \u2018space\u2019 available to remember the input history.\n\n5.1 An objective function for short-term memory\n\nIdeally, we would want to be able to read out both the present input and the past inputs, such that\nxt\u2212n \u2248 Dnrt, where n is an elementary time step, and Dn are appropriately chosen readouts. We\nwill in the following assume that there is a matrix M such that DnM = Dn+1 for all n. In other\nwords, the input history should be accessible via \u02c6xt\u2212n = Dnr = D0Mnrt. Then the cost function\nwe would like to minimize is a straightforward generalization of Eq. (2),\n\n(cid:88)\n\nn=0\n\nL =\n\n1\n2\n\n\u03b3n(cid:107)xt\u2212n \u2212 DMnrt(cid:107)2 +\n\n\u00b5\n2(cid:107)rt(cid:107)2.\n\n(20)\n\nwhere we have set D = D0. We tacitly assume that x and r are centered and that the L2 norm\nis de\ufb01ned with respect to the input signal covariance matrix Cx, so that we can work in the full\ngenerality of Eq. (12) without keeping the additional notational baggage.\nUnfortunately, the direct minimization of this objective is impossible, since the network has no\naccess to the past inputs xt\u2212n for n \u2265 1. Rather, information about past inputs will have to be\nretrieved from the network activity itself. We can enforce that by replacing the past input signal at\ntime t, with its estimate in the previous time step, which we will denote by a prime. In other words,\ninstead of asking that xt\u2212n \u2248 \u02c6xt\u2212n, we ask that \u02c6x(cid:48)(t\u22121)\u2212(n\u22121) \u2248 \u02c6xt\u2212n, so that the estimates of the\ninput (and its history) are properly propagated through the network. Given the iterative character\nof the respective errors, (cid:107)\u02c6x(cid:48)(t\u22121)\u2212(n\u22121) \u2212 \u02c6xt\u2212n(cid:107) = (cid:107)DMn\u22121(rt\u22121 \u2212 Mrt)(cid:107), we can de\ufb01ne a loss\nfunction for one time step only,\n\nL =\n\n1\n2 (cid:107)xt \u2212 Drt(cid:107)2 +\n\n\u03b3\n2 (cid:107)rt\u22121 \u2212 Mrt(cid:107)2 +\n\n\u00b5\n2 (cid:107)rt(cid:107)2 .\n\n(21)\n\nHere, the \ufb01rst term enforces that the instantaneous input signal is properly encoded, while the second\nterm ensures that the network is remembering past information. The last term is a cost term that\nmakes the system more stable and ef\ufb01cient.\nNote that a network which minimizes this loss function is maximizing its information content, even\nif the number of neurons, N, far exceeds the input dimension K, so that N (cid:29) K. As becomes\nclear from inspecting the loss function, the network is trying to code an N + K dimensional signal\nwith only N neurons. Consequently, just as in the undercomplete autoencoder, all of its information\ncapacity will be used.\n\n5.2 Dynamics and learning\n\nConceptually, the loss function in Eq. (21) is identical to Eq. (2), or rather, to Eq. (12), if we keep full\ngenerality. We only need to vertically stack the feedforward input and the delayed recurrent input\ninto a single high-dimensional vector x(cid:48) = (xt ; \u221a\u03b3rt\u22121). Similarly, we can horizontally combine\nthe decoder D and the \u2018time travel\u2019 matrix M into a single decoder matrix D(cid:48) = (D \u221a\u03b3M). The\nabove loss function then reduces to\n\nL = (cid:107)x(cid:48)t \u2212 D(cid:48)rt(cid:107)2 + \u00b5(cid:107)rt(cid:107)2 ,\n\n(22)\n\nand all of our derivations, including the learning rules, can be directly applied to this system. Note\nthat the \u2018input\u2019 to the network now combines the actual input signal, xt, and the delayed recurrent\ninput, rt\u22121. Consequently, this extended input is neither white nor centered, and we will need to\nwork with the generalized dynamics and generalized learning rules derived in the previous section.\n\n6\n\n\fFigure 2: Emergence of working memory in a network of 10 neurons with random initial connectivity. (A)\nMean rates of all neurons over learning. Each time-step corresponds to one input. (B) Rate correlation matrix\nafter learning (black = 1, white = 0). (C) Number of reconstructed steps from the input history. We trained a\nlinear decoder on the network responses to 50000 input stimuli and tested the performance on a test set of 5000\ninputs. (D) Overlaps in signal projections before learning. Let u\u03c4 \u221d (\u2126d)\u03c4 F be the (normalized) direction in\nwhich the delayed weights project the input in a sequence of \u03c4 time-steps. We here show the overlap of these\nprojections, i.e. U(cid:62)U where U\u03c4,. = u\u03c4 . (E-F) Same as (D) but at different moments during learning.\n\nThe network dynamics will initially follow the differential equation 4\nV = Fxt + \u2126drt\u22121 \u2212 \u2126f rt \u2212 \u00b5rt,\n\u02d9r = V \u2212 (cid:104)V(cid:105).\n\nCompared to our previous network, we now have effectively three inputs into the network:\nthe\nfeedforward inputs with weight F, a delayed recurrent input with weight \u2126d and a fast recurrent\ninput with weight \u2126f , see Fig. 1c. The optimal connectivities can be derived from the loss function\nand are (see also Fig. 1c, which considers the simpli\ufb01ed case Cx = I and Cr = I)\n\n(23)\n(24)\n\n(25)\n(26)\n(27)\n\n(28)\n(29)\n(30)\n\nF\u2217Cx = D(cid:62),\n\u2217Cr = M(cid:62),\n\u2126d\n\u2126f\n\u2217 = F\u2217D + \u2126d\n\n\u2217M.\n\nConsequently, there are also three learning rules: one for the fast recurrent weights, which follows\nEq. (18), one for the feedforward weights, which follows Eq. (19), and one for the delayed recurrent\nweights, which also follows Eq. (19). In summary,\n\n\u02d9\u2126f = \u0001(cid:0)(Fxt + \u2126drt\u22121)r(cid:62)t \u2212 (cid:104)Fxt + \u2126drt\u22121(cid:105)(cid:104)r(cid:105)(cid:62)t \u2212 \u03b1\u2126f(cid:1) ,\n\u02d9F = \u03b2(cid:0)(rt \u2212 \u03b1Fxt)x(cid:62)t \u2212 (cid:104)rt \u2212 \u03b1Fxt(cid:105)(cid:104)x(cid:62)t (cid:105)\n\u02d9\u2126d = \u03b2(cid:0)(rt \u2212 \u03b1\u2126drt\u22121)r(cid:62)t\u22121 \u2212 (cid:104)rt \u2212 \u03b1\u2126drt\u22121(cid:105)(cid:104)r(cid:62)t\u22121(cid:105)\n(cid:1).\n\n(cid:1),\n\nIn the supplementary information we prove that all \ufb01xed points with rank de\ufb01cient Cr are unstable,\nand that Cr \u221d I \u221d \u2126f otherwise. In other words, the neural responses are completely decorrelated.\nAs a result, the network keeps track of the maximum possible number of inputs.\n\n4We are now dealing with a delay-differential equation, which may be obscured by our notation. In practice,\nthe term rt\u22121 would be replaced by a term of the form r(r \u2212 \u03c4 ), where \u03c4 is the actual value of the \u2018time step\u2019.\nAlso, for notational convenience we put \u03b3 = 1, but the generalization is straight-forward.\n\n7\n\n0.00.20.40.60.81.0time [1e7]0.100.150.200.250.300.350.40rel. firing rateFiring rates over learning02468Neuron #02468Neuron #Rate correlation matrix (t=10e7)0246810time [1e7]0246810# reconstructed time-stepsLength of reconstructed history02468time-steps in the past02468time-steps in the pastProjection overlaps (t=0)02468time-steps in the past02468Projection overlaps (t=.25e7)02468time-steps in the past02468Projection overlaps (t=2e7)02468time-steps in the past02468Projection overlaps (t=4.5e7)ABCDEFG\f6 Simulations\n\nWe simulated a \ufb01ring rate network of ten neurons that learn to remember a one-dimensional, tem-\nporally uncorrelated white noise stimulus (Fig. 2). We initialized all feedforward weights to one,\nwhereas the matrices \u2126f and \u2126d were drawn from a centered Gaussian distributions with small vari-\nance. At the onset, the network has some memory, similar to random networks based on reservoir\ncomputing. However, the recurrent inputs are generally not cancelling out the feedforward inputs.\nThe effect of such imprecise balance are initially high \ufb01ring rates and poor coding properties (Fig.\n2A,C). At the beginning of learning the fast recurrent connections tighten the balance and thus re-\nduce the overall \ufb01ring rates in the network (Fig. 2A). At this point the delayed recurrent connections\n\u2126d project the input signal F along a sequence u\u03c4 = (\u2126d)\nF that is almost random (Fig. 2D). Over\nlearning (Fig. 2E-F) this overlap in the projection sequence vanishes. In other words, inputs entering\nthe network are projected along a sequence of orthogonal subspaces, much like in tapped delay line-\ntype of networks. This adaptation leads to increasing memory performance of the network (Fig. 2C)\nand thus to increasing neural \ufb01ring rates (Fig. 2A). At the end of learning, the coding properties are\nclose to the information-theoretic limit (10 time steps). A simulation script that reproduces Figure 2\n(implemented as an Ipython Notebook) is registered at ModelDB (accession number 169983).\n\n\u03c4\n\n7 Towards learning in spiking recurrent networks\n\nWhile we have shown how a recurrent network can learn to ef\ufb01ciently represent an input and its\nhistory using only local learning rules, our network is still far from being biologically realistic. A\nquite obvious discrepancy with biological networks is that the neurons are not spiking, but rather\nemit \u2018\ufb01ring rates\u2019 that can be both positive and negative. How can we make the connection to\nspiking networks? Standard solutions have bridged from rate to spiking networks using mean-\ufb01eld\napproaches [18]. However, more recent work has shown that there is a direct link from the types of\nloss functions considered in this paper to balanced spiking networks.\nRecently, Hu et al. pointed out that the minimization of Eq. (2) can be done by a network of\nneurons that \ufb01res both positive and negative spikes [9], and then argued that these networks can be\ntranslated into real spiking networks. A similar, but more direct approach was introduced in [3, 1]\nwho suggested to minimize the loss function, Eq. (2), under the constraint that r \u2265 0. The resulting\nnetworks consist of recurrently connected integrate-and-\ufb01re neurons that balance their feedforward\nand recurrent inputs [3, 1, 4]. Importantly, Eq. (2) remains a convex function of r, and Eq. (3) still\napplies (except that r cannot become negative).\nThe precise match between the spiking network implementation and the \ufb01ring rate minimization\n[1] opens up the possibility to apply our learning rules to the spiking networks. We note, though,\nthat this holds only strictly in the regime where the spiking networks are balanced. (For unbalanced\nnetworks, there is no direct link to the \ufb01ring rate formalism.) If the initial network is not balanced,\nwe need to \ufb01rst learn how to bring it into the balanced state. For white-noise Gaussian inputs, [4]\nshowed how this can be done. For more general inputs, this problem will have to be solved in the\nfuture.\n\n8 Discussion\n\nIn summary, we have shown how a recurrent neural network can learn to ef\ufb01ciently represent both\nits present and past inputs. A key insight has been the link between balancing of feedforward and\nrecurrent inputs and the minimization of the cost function. If neurons can compensate both external\nfeedforward and delayed recurrent excitation with lateral inhibition, then, to some extent, they must\nbe coding the temporal trajectory of the stimulus. Indeed, in order to be able to compensate an input,\nthe network must be coding it at some level. Furthermore, if synapses are linear, then so must be the\ndecoder.\nWe have shown that this \u2018balance\u2019 can be learnt through local synaptic plasticity of the lateral con-\nnections, based only on the presynaptic input signals and postsynaptic \ufb01ring rates of the neurons.\nSimilar to tapped delay lines, the converged network propagates inputs along a sequence of indepen-\ndent directions of the population response. Different from tapped delay lines, this capability arises\ndynamically and can adapt to the input statistics.\n\n8\n\n\fReferences\n\n[1] D. G. Barrett, S. Den`eve, and C. K. Machens. \u201cFiring rate predictions in optimal balanced\nnetworks\u201d. In: Advances in Neural Information Processing Systems 26. 2013, pp. 1538\u20131546.\n[2] A. J. Bell and T. J. Sejnowski. \u201cAn information-maximization approach to blind separation\n\nand blind deconvolution\u201d. In: Neural comp. 7 (1995), pp. 1129\u20131159.\n\n[3] M. Boerlin, C. K. Machens, and S. Den`eve. \u201cPredictive coding of dynamical variables in\n\nbalanced spiking networks\u201d. In: PLoS Computational Biology 9.11 (2013), e1003258.\n\n[4] R. Bourdoukan et al. \u201cLearning optimal spike-based representations\u201d. In: Advances in Neural\n\nInformation Processing Systems 25. MIT Press, 2012, epub.\n\n[5] S. Fusi, P. J. Drew, and L. F. Abbott. \u201cCascade models of synaptically stored memories\u201d. In:\n\nNeuron 45.4 (2005), pp. 599\u2013611.\n\n[6] S. Hochreiter and J. Schmidhuber. \u201cLong short-term memory\u201d. In: Neural computation 9.8\n\n[7]\n\n[8]\n\n(1997), pp. 1735\u20131780.\nJ. J. Hop\ufb01eld. \u201cNeural networks and physical systems with emergent collective computational\nabilities\u201d. In: Proceedings of the national academy of sciences 79.8 (1982), pp. 2554\u20132558.\nJ. J. Hop\ufb01eld. \u201cNeurons with graded response have collective computational properties like\nthose of two-state neurons\u201d. In: Proc. Natl. Acad. Sci. USA 81 (1984), pp. 3088\u20133092.\n\n[9] T. Hu, A. Genkin, and D. B. Chklovskii. \u201cA network of spiking neurons for computing sparse\nrepresentations in an energy-ef\ufb01cient way\u201d. In: Neural computation 24.11 (2012), pp. 2852\u2013\n2872.\n\n[10] H. Jaeger. \u201cThe \u201decho state\u201d approach to analysing and training recurrent neural networks.\u201d\n\nIn: German National Research Center for Information Technology. Vol. 48. 2001.\n\n[11] A. Lazaar, G. Pipa, and J. Triesch. \u201cSORN: a self-organizing recurrent neural network\u201d. In:\n\nFrontiers in computational neuroscience 3 (2009), p. 23.\n\n[12] M. Luko\u02c7sevi\u02c7cius and H. Jaeger. \u201cReservoir computing approaches to recurrent neural network\n\ntraining\u201d. In: Computer Science Review 3.3 (2009), pp. 127\u2013149.\n\n[13] W. Maass, T. Natschl\u00a8ager, and H. Markram. \u201cReal-time computing without stable states:\nA new framework for neural computation based on perturbations\u201d. In: Neural computation\n14.11 (2002), pp. 2531\u20132560.\n\n[14] C. K. Machens, R. Romo, and C. D. Brody. \u201cFlexible control of mutual inhibition: A neural\n\nmodel of two-interval discrimination\u201d. In: Science 307 (2005), pp. 1121\u20131124.\n\n[15] G. Major and D. Tank. \u201cPersistent neural activity: prevalence and mechanisms\u201d. In: Curr.\n\nOpin. Neurobiol. 14 (2004), pp. 675\u2013684.\n\n[16] B. A. Olshausen and D. J. Field. \u201cSparse coding with an overcomplete basis set: A strategy\n\nemployed by V1?\u201d In: Vision Research 37.23 (1997), pp. 3311\u20133325.\n\n[17] R. P. N. Rao and D. H. Ballard. \u201cPredictive coding in the visual cortex: a functional inter-\npretation of some extra-classical receptive-\ufb01eld effects\u201d. In: Nature neuroscience 2.1 (1999),\npp. 79\u201387.\n\n[18] A. Renart, N. Brunel, and X.-J. Wang. \u201cMean-\ufb01eld theory of irregularly spiking neuronal\npopulations and working memory in recurrent cortical networks\u201d. In: Computational neuro-\nscience: A comprehensive approach (2004), pp. 431\u2013490.\n\n[19] C. J. Rozell et al. \u201cSparse coding via thresholding and local competition in neural circuits\u201d.\n\nIn: Neural computation 20.10 (2008), pp. 2526\u20132563.\n\n[20] E. P. Simoncelli and B. A. Olshausen. \u201cNatural image statistics and neural representation\u201d.\n\nIn: Ann. Rev. Neurosci. 24 (2001), pp. 1193\u20131216.\n\n[21] X.-J. Wang. \u201cProbabilistic decision making by slow reverberation in cortical circuits\u201d. In:\n\nNeuron 36.5 (2002), pp. 955\u2013968.\n\n[22] P. J. Werbos. \u201cBackpropagation through time: what it does and how to do it\u201d. In: Proceedings\n\nof the IEEE 78.10 (1990), pp. 1550\u20131560.\nJ. Zylberberg, J. T. Murphy, and M. R. DeWeese. \u201cA sparse coding model with synaptically\nlocal plasticity and spiking neurons can account for the diverse shapes of V1 simple cell\nreceptive \ufb01elds\u201d. In: PLoS Computational Biology 7.10 (2011), e1002250.\n\n[23]\n\n9\n\n\f", "award": [], "sourceid": 1922, "authors": [{"given_name": "Pietro", "family_name": "Vertechi", "institution": "Champalimaud Center for the Unknown"}, {"given_name": "Wieland", "family_name": "Brendel", "institution": "Champalimaud Neuroscience Programme"}, {"given_name": "Christian", "family_name": "Machens", "institution": "Champalimaud Centre for the Unknown"}]}