{"title": "Updates of Equilibrium Prop Match Gradients of Backprop Through Time in an RNN with Static Input", "book": "Advances in Neural Information Processing Systems", "page_first": 7081, "page_last": 7091, "abstract": "Equilibrium Propagation (EP) is a biologically inspired learning algorithm for\r\nconvergent recurrent neural networks, i.e. RNNs that are fed by a static input x and\r\nsettle to a steady state. Training convergent RNNs consists in adjusting the weights\r\nuntil the steady state of output neurons coincides with a target y. Convergent RNNs\r\ncan also be trained with the more conventional Backpropagation Through Time\r\n(BPTT) algorithm. In its original formulation EP was described in the case of\r\nreal-time neuronal dynamics, which is computationally costly. In this work, we\r\nintroduce a discrete-time version of EP with simplified equations and with reduced\r\nsimulation time, bringing EP closer to practical machine learning tasks. We first\r\nprove theoretically, as well as numerically that the neural and weight updates of EP,\r\ncomputed by forward-time dynamics, are step-by-step equal to the ones obtained by\r\nBPTT, with gradients computed backward in time. The equality is strict when the\r\ntransition function of the dynamics derives from a primitive function and the steady\r\nstate is maintained long enough. We then show for more standard discrete-time\r\nneural network dynamics that the same property is approximately respected and\r\nwe subsequently demonstrate training with EP with equivalent performance to\r\nBPTT. In particular, we define the first convolutional architecture trained with EP\r\nachieving \u223c 1% test error on MNIST, which is the lowest error reported with EP.\r\nThese results can guide the development of deep neural networks trained with EP.", "full_text": "Updates of Equilibrium Prop Match Gradients of\n\nBackprop Through Time in an RNN with Static Input\n\nMaxence Ernoult1,2, Julie Grollier2, Damien Querlioz1, Yoshua Bengio3,4, Benjamin Scellier3\n\n1Centre de Nanosciences et de Nanotechnologies, Universit\u00e9 Paris Sud, Universit\u00e9 Paris-Saclay\n\n2Unit\u00e9 Mixte de Physique, CNRS, Thales, Universit\u00e9 Paris-Sud, Universit\u00e9 Paris-Saclay\n\n3Mila, Universit\u00e9 de Montr\u00e9al\n\n4Canadian Institute for Advanced Research\n\nAbstract\n\nEquilibrium Propagation (EP) is a biologically inspired learning algorithm for\nconvergent recurrent neural networks, i.e. RNNs that are fed by a static input x and\nsettle to a steady state. Training convergent RNNs consists in adjusting the weights\nuntil the steady state of output neurons coincides with a target y. Convergent RNNs\ncan also be trained with the more conventional Backpropagation Through Time\n(BPTT) algorithm. In its original formulation EP was described in the case of\nreal-time neuronal dynamics, which is computationally costly. In this work, we\nintroduce a discrete-time version of EP with simpli\ufb01ed equations and with reduced\nsimulation time, bringing EP closer to practical machine learning tasks. We \ufb01rst\nprove theoretically, as well as numerically that the neural and weight updates of EP,\ncomputed by forward-time dynamics, are step-by-step equal to the ones obtained by\nBPTT, with gradients computed backward in time. The equality is strict when the\ntransition function of the dynamics derives from a primitive function and the steady\nstate is maintained long enough. We then show for more standard discrete-time\nneural network dynamics that the same property is approximately respected and\nwe subsequently demonstrate training with EP with equivalent performance to\nBPTT. In particular, we de\ufb01ne the \ufb01rst convolutional architecture trained with EP\nachieving \u223c 1% test error on MNIST, which is the lowest error reported with EP.\nThese results can guide the development of deep neural networks trained with EP.\n\n1\n\nIntroduction\n\nThe remarkable development of deep learning over the past years [LeCun et al., 2015] has been\nfostered by the use of backpropagation [Rumelhart et al., 1985] which stands as the most powerful\nalgorithm to train neural networks. In spite of its success, the backpropagation algorithm is not\nbiologically plausible [Crick, 1989], and its implementation on GPUs is energy-consuming [Editorial,\n2018]. Hybrid hardware-software experiments have recently demonstrated how physics and dynamics\ncan be leveraged to achieve learning with energy ef\ufb01ciency [Romera et al., 2018, Ambrogio et al.,\n2018]. Hence the motivation to invent novel learning algorithms where both inference and learning\ncould fully be achieved out of core physics.\nMany biologically inspired learning algorithms have been proposed as alternatives to backpropagation\nto train neural networks. Contrastive Hebbian learning (CHL) has been successfully used to train\nrecurrent neural networks (RNNs) with static input that converge to a steady state (or \u2018equilibrium\u2019),\nsuch as Boltzmann machines [Ackley et al., 1985] and real-time Hop\ufb01eld networks [Movellan,\n1991]. CHL proceeds in two phases, each phase converging to a steady state, where the learning rule\naccommodates the difference between the two equilibria. Equilibrium Propagation (EP) [Scellier\nand Bengio, 2017] also belongs to the family of CHL algorithms to train RNNs with static input.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fIn the second phase of EP, the prediction error is encoded as an elastic force nudging the system\ntowards a second equilibrium closer to the target. Interestingly, EP also shares similar features with\nthe backpropagation algorithm, and more speci\ufb01cally recurrent backpropagation (RBP) [Almeida,\n1987, Pineda, 1987]. It was proved in Scellier and Bengio [2019] that neural computation in the\nsecond phase of EP is equivalent to gradient computation in RBP.\nOriginally, EP was introduced in the context of leaky-integrate neurons [Scellier and Bengio, 2017,\n2019]. Computing their dynamics involves long simulation times, hence limiting EP training\nexperiments to small neural networks. In this paper, we propose a discrete-time formulation of EP.\nThis formulation allows demonstrating an equivalence between EP and BPTT in speci\ufb01c conditions,\nsimpli\ufb01es equations and speeds up training, and extends EP to standard neural networks including\nconvolutional ones. Speci\ufb01cally, the contributions of the present work are the following:\n\n\u2022 We introduce a discrete-time formulation of EP (Section 3.1) of which the original real-time\n\nformulation can be seen as a particular case (Section 4.2).\n\n\u2022 We show a step-by-step equality between the updates of EP and the gradients of BPTT when\nthe dynamics converges to a steady state and the transition function of the RNN derives\nfrom a primitive function (Theorem 1, Figure 1). We say that such an RNN has the property\nof \u2018gradient-descending updates\u2019 (or GDU property).\n\n\u2022 We numerically demonstrate the GDU property on a small network, on fully connected\nlayered and convolutional architectures. We show that the GDU property continues to hold\napproximately for more standard \u2013 prototypical \u2013 neural networks even if these networks do\nnot exactly meet the requirements of Theorem 1.\n\n\u2022 We validate our approach with training experiments on different network architectures using\ndiscrete-time EP, achieving similar performance than BPTT. We show that the number of\niterations in the two phases of discrete-time EP can be reduced by a factor three to \ufb01ve\ncompared to the original real-time EP, without loss of accuracy. This allows us training the\n\ufb01rst convolutional architecture with EP, reaching \u223c 1% test error on MNIST, which is the\nlowest test error reported with EP. Our code is available on-line in Pytorch 1.\n\n2 Background\n\nThis section introduces the notations and basic concepts used throughout the paper.\n\n2.1 Convergent RNNs With Static Input\n\nWe consider the supervised setting where we want to predict a target y given an input x. The model\nis a dynamical system - such as a recurrent neural network (RNN) - parametrized by \u03b8 and evolving\naccording to the dynamics:\n\n(1)\nWe call F the transition function. The input of the RNN at each timestep is static, equal to x.\nAssuming convergence of the dynamics before time step T , we have sT = s\u2217 where s\u2217 is such that\n(2)\n\nst+1 = F (x, st, \u03b8) .\n\ns\u2217 = F (x, s\u2217, \u03b8) .\n\nWe call s\u2217 the steady state (or \ufb01xed point, or equilibrium state) of the dynamical system. The number\nof timesteps T is a hyperparameter chosen large enough to ensure sT = s\u2217. The goal of learning is to\noptimize the parameter \u03b8 to minimize the loss:\n\nL\u2217 = (cid:96) (s\u2217, y) ,\n\n(3)\n\nwhere the scalar function (cid:96) is called cost function. Several algorithms have been proposed to\noptimize the loss L\u2217, including Recurrent Backpropagation (RBP) [Almeida, 1987, Pineda, 1987]\nand Equilibrium Propagation (EP) [Scellier and Bengio, 2017]. Here, we present Backpropagation\nThrough Time (BPTT) and Equilibrium Propagation (EP) and some of the inner mechanisms of these\ntwo algorithms, so as to enunciate the main theoretical result of this paper (Theorem 1).\n\n1https://github.com/ernoult/updatesEPgradientsBPTT\n\n2\n\n\fFigure 1: Illustration of the property of Gradient-Descending Updates (GDU property). Top left.\nForward-time pass (or \u2018\ufb01rst phase\u2019) of an RNN with static input x and target y. The \ufb01nal state sT is\nthe steady state s\u2217. Bottom left. Backprop through time (BPTT). Bottom right. Second phase of\nequilibrium prop (EP). The starting state in the second phase is the \ufb01nal state of the \ufb01rst phase, i.e.\nthe steady state s\u2217. GDU Property (Theorem 1). Step by step correspondence between the neural\nupdates \u2206EP\n(t) of BPTT. Corresponding\ncomputations in EP and BPTT at timestep t = 0 (resp. t = 1, 2, 3) are colored in green (resp. blue,\nred, yellow). Forward-time computation in EP corresponds to backward-time computation in BPTT.\n\ns (t) in the second phase of EP and the gradients \u2207BPTT\n\ns\n\n2.2 Backpropagation Through Time (BPTT)\n\nWith frameworks implementing automatic differentiation, optimization by gradient descent using\nBackpropagation Through Time (BPTT) has become the standard method to train RNNs. In particular\nBPTT can be used for a convergent RNN such as the one that we study here. To this end, we consider\nthe loss after T iterations (i.e. the cost of the \ufb01nal state sT ), denoted L = (cid:96) (sT , y), and we substitute\nL as a proxy 2 for the loss at the steady state L\u2217. The gradients of L can be computed with BPTT.\nIn order to state our Theorem 1 (Section 3.2), we recall some of the inner working mechanisms\nof BPTT. Eq. (1) can be rewritten in the form st+1 = F (x, st, \u03b8t+1 = \u03b8), where \u03b8t denotes the\nparameter of the model at time step t, the value \u03b8 being shared across all time steps. This way of\nas the sensitivity of the loss L with\nrewriting Eq. (1) enables us to de\ufb01ne the partial derivative \u2202L\nrespect to \u03b8t when \u03b81, . . . \u03b8t\u22121, \u03b8t+1, . . . \u03b8T remain \ufb01xed (set to the value \u03b8). With these notations,\nthe gradient \u2202L\n\n\u2202\u03b8 reads as the sum:\n\n\u2202\u03b8t\n\n\u2202L\n\u2202\u03b8\n\n\u2202L\n\u2202\u03b81\n\n\u2202L\n\u2202\u03b82\n\n+\n\n=\n\n+ \u00b7\u00b7\u00b7 +\n\n\u2202L\n\u2202\u03b8T\n\n.\n\n(4)\n\nBPTT computes the \u2018full\u2019 gradient \u2202L\niteratively\nand ef\ufb01ciently, backward in time, using the chain rule of differentiation. Subsequently, we denote the\ngradients that BPTT computes:\n\n\u2202\u03b8 by computing the partial derivatives \u2202L\n\nand \u2202L\n\n\u2202\u03b8t\n\n\u2202st\n\n(t) =\n\n(t) =\n\n\u2202L\n\u2202sT\u2212t\n\u2202L\n\u2202\u03b8T\u2212t\n\n,\n\n(5)\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3 \u2207BPTT\n\n\u2207BPTT\n\n\u03b8\n\ns\n\n\u2200t \u2208 [0, T \u2212 1] :\n\n2The difference between the loss L and the loss L\u2217 is explained in Appendix B.1.\n\n3\n\n\fso that\n\nT\u22121(cid:88)\n\nt=0\n\n\u2202L\n\u2202\u03b8\n\n=\n\n\u2207BPTT\n\n\u03b8\n\n(t).\n\nMore details about BPTT are provided in Appendix A.2.\n\n3 Equilibrium Propagation (EP) - Discrete Time Formulation\n\n3.1 Algorithm\n\n(6)\n\nIn its original formulation, Equilibrium Propagation (EP) was introduced in the case of real-time\ndynamics [Scellier and Bengio, 2017, 2019]. The \ufb01rst theoretical contribution of this paper is to adapt\nthe theory of EP to discrete-time dynamics.3 EP is an alternative algorithm to compute the gradient\nof L\u2217 in the particular case where the transition function F derives from a scalar function \u03a6, i.e. with\nF of the form F (x, s, \u03b8) = \u2202\u03a6\n\n\u2202s (x, s, \u03b8). In this setting, the dynamics of Eq. (1) rewrites:\n\u2200t \u2208 [0, T \u2212 1],\n\n(x, st, \u03b8).\n\nst+1 =\n\n(7)\n\nThis constitutes the \ufb01rst phase of EP. At the end of the \ufb01rst phase, we have reached steady state, i.e.\nsT = s\u2217. In the second phase of EP, starting from the steady state s\u2217, an extra term \u03b2 \u2202(cid:96)\n\u2202s (where \u03b2 is\na positive scaling factor) is introduced in the dynamics of the neurons and acts as an external force\nnudging the system dynamics towards decreasing the cost function (cid:96). Denoting s\u03b2\n2 , . . . the\nsequence of states in the second phase (which depends on the value of \u03b2), the dynamics is de\ufb01ned as\n\n1 , s\u03b2\n\n0 , s\u03b2\n\n(cid:17) \u2212 \u03b2\n\n(cid:16)\n\n\u2202(cid:96)\n\u2202s\n\n(cid:17)\n\ns\u03b2\nt , y\n\n.\n\n(8)\n\ns\u03b2\n0 = s\u2217\n\nand\n\n\u2200t \u2265 0,\n\ns\u03b2\nt+1 =\n\n\u2202\u03a6\n\u2202s\n\nx, s\u03b2\n\nt , \u03b8\n\nThe network eventually settles to a new steady state s\u03b2\u2217 . It was shown in Scellier and Bengio [2017]\nthat the gradient of the loss L\u2217 can be computed based on the two steady states s\u2217 and s\u03b2\u2217 . More\nspeci\ufb01cally, 4 in the limit \u03b2 \u2192 0,\n\n(cid:0)x, s\u03b2\u2217 , \u03b8(cid:1) \u2212 \u2202\u03a6\n\n\u2202\u03b8\n\n(x, s\u2217, \u03b8)\n\n\u2192 \u2212 \u2202L\u2217\n\n\u2202\u03b8\n\n.\n\n(9)\n\nIn fact, we can prove a stronger result. For \ufb01xed \u03b2 > 0 we de\ufb01ne the neural and weight updates\n\n\u2200t \u2265 0 :\n\ns (\u03b2, t) =\n\n\u03b8 (\u03b2, t) =\n\n(cid:17)\n\n,\n\n(cid:16)\n(cid:18) \u2202\u03a6\n\n(cid:16)\nt+1 \u2212 s\u03b2\ns\u03b2\nx, s\u03b2\n\nt\n\n\u2202\u03b8\n\n1\n\u03b2\n1\n\u03b2\n\nt+1, \u03b8\n\n(cid:16)\n\n(cid:17) \u2212 \u2202\u03a6\n\n\u2202\u03b8\n\nx, s\u03b2\n\nt , \u03b8\n\n(cid:17)(cid:19)\n\n\u2202\u03a6\n\u2202s\n\n(cid:16)\n\n(cid:19)\n\n\u2202\u03b8\n\n1\n\u03b2\n\n(cid:18) \u2202\u03a6\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3 \u2206EP\n\u221e(cid:88)\n\n\u2206EP\n\nt=0\n\n,\n\n(10)\n\n(11)\n\nand note that Eq. (9) rewrites as the following telescoping sum:\n\n\u03b8 (\u03b2, t) \u2192 \u2212 \u2202L\u2217\n\n\u2206EP\n\n\u2202\u03b8\n\nas\n\n\u03b2 \u2192 0.\n\nWe can now state our main theoretical result (Theorem 1 below).\n\n3.2 Forward-Time Dynamics of EP Compute Backward-Time Gradients of BPTT\n\nBPTT and EP compute the gradient of the loss in very different ways: while the former algorithm\niteratively adds up gradients going backward in time, as in Eq. (6), the latter algorithm adds up weight\nupdates going forward in time, as in Eq. (11). In fact, under a condition stated below, the sums are\nequal term by term: there is a step-by-step correspondence between the two algorithms.\n\n3We explain in Appendix B.2 the relationship between the discrete-time setting (resp. the primitive function\n\n\u03a6) of this paper and the real-time setting (resp. the energy function E) of Scellier and Bengio [2017, 2019].\n\n4The EP learning rule is a form of contrastive Hebbian learning similar to that of Boltzmann machines\n\n[Ackley et al., 1985] and similar to the one presented in Movellan [1991].\n\n4\n\n\fTheorem 1 (Gradient-Descending Updates, GDU). Consider the setting with a transition function\nof the form F (x, s, \u03b8) = \u2202\u03a6\n\u2202s (x, s, \u03b8). Let s0, s1, . . . , sT be the convergent sequence of states and\ndenote s\u2217 = sT the steady state. If we further assume that there exists some step K where 0 < K \u2264 T\nsuch that s\u2217 = sT = sT\u22121 = . . . sT\u2212K, then, in the limit \u03b2 \u2192 0, the \ufb01rst K updates in the second\nphase of EP are equal to the negatives of the \ufb01rst K gradients of BPTT, i.e.\n\n(cid:26) \u2206EP\n\n\u2200t = 0, 1, . . . , K :\n\ns (\u03b2, t) \u2192 \u2212\u2207BPTT\n\u03b8 (\u03b2, t) \u2192 \u2212\u2207BPTT\n\u2206EP\n\n\u03b8\n\ns\n\n(t),\n(t).\n\n(12)\n\nWe give here a short outline of the proof of Theorem 1 (A complete proof together with a detailed\nsketch of the proof are provided in Appendix A). The convergence requirement enables to derive the\nequations satis\ufb01ed by the neural and weight updates (Lemma 3). Then, the existence of a primitive\nfunction ensures that these equations are equal to those satis\ufb01ed by the gradients of BPTT (Lemma\n2), with same initial conditions, yielding the desired equality (Theorem 1).\nNote that other algorithms such as RTRL [Williams and Zipser, 1989] and UORO [Tallec and Ollivier,\n2017] also compute the gradients by forward-time dynamics.\n\n4 Experiments\n\nThis section uses Theorem 1 as a tool to design neural networks that are trainable with EP: if\na model satis\ufb01es the GDU property of Eq. 12, then we expect EP to perform as well as BPTT\non this model. After introducing our protocol (Section 4.1), we de\ufb01ne the energy-based setting\nand prototypical setting where the conditions of Theorem 1 are exactly and approximately met\nrespectively (Section 4.2). We show the GDU property on a toy model (Fig. 2) and on fully connected\nlayered architectures in the two settings (Section 4.3). We de\ufb01ne a convolutional architecture in the\nprototypical setting (Section 4.4) which also satis\ufb01es the GDU property. Finally, we validate our\napproach by training 5 these models with EP and BPTT (Table 1).\n\nFigure 2: Demonstrating the property of gradient-descending updates in the energy-based setting\non a toy model with dummy data x and a target y elastically nudging the output neurons s0 (right).\nDashed and solid lines represent \u2206EP and \u2212\u2207BPTT processes respectively and perfectly coincide\nfor 5 randomly selected neurons (left) and synapses (middle). Each randomly selected neuron or\nsynapse corresponds to one color. Details can be found in Appendix C.2.\n\n4.1 Demonstrating the Property of Gradient-Descending Updates (GDU Property)\n\nProperty of Gradient-Descending Updates. We say that a convergent RNN model fed with a\n\ufb01xed input has the GDU property if during the second phase, the updates it computes by EP (\u2206EP)\non the one hand and the gradients it computes by BPTT (\u2212\u2207BPTT) on the other hand are \u2018equal\u2019 - or\n\u2018approximately equal\u2019, as measured per the RelMSE (Relative Mean Squared Error) metric.\n\nRelative Mean Squared Error (RelMSE).\nIn order to quantitatively measure how well the GDU\nproperty is satis\ufb01ed, we introduce a metric which we call Relative Mean Squared Error (RelMSE)\n\n5For training, the effective weight update is performed at the end of the second phase.\n\n5\n\n\fsuch that RelMSE(\u2206EP, -\u2207BPTT) measures the distance between \u2206EP and \u2212\u2207BPTT processes,\naveraged over time, over neurons or synapses (layer-wise) and over a mini-batch of samples - see\nAppendix C.1 for the details.\n\nProtocol.\nIn order to measure numerically if a given model satis\ufb01es the GDU property, we proceed\nas follows. Considering an input x and associated target y, we perform the \ufb01rst phase for T steps.\nThen we perform on the one hand BPTT for K steps (to compute the gradients \u2207BPTT), on the other\nhand EP for K steps (to compute the neural updates \u2206EP) and compare the gradients and neural\nupdates provided by the two algorithms, either qualitatively by looking at the plots of the curves (as\nin Figs. 2 and 4), or quantitatively by computing their RelMSE (as in Fig. 3).\n\n4.2 Energy-Based Setting and Prototypical Setting\n\nEnergy-based setting. The system is de\ufb01ned in terms of a primitive function of the form:\n\n\u03a6\u0001(s, W ) = (1 \u2212 \u0001)\n\n1\n2\n\n(cid:107)s(cid:107)2 + \u0001 \u03c3(s)(cid:62) \u00b7 W \u00b7 \u03c3(s),\n\n(13)\n\nwhere \u0001 is a discretization parameter, \u03c3 is an activation function and W is a symmetric weight matrix.\nIn this setting, we consider \u2206EP(\u03b2\u0001, t) instead of \u2206EP(\u03b2, t) and write \u2206EP(t) for simplicity, so that:\n\n\u2206EP\n\ns (t) =\n\nt+1 \u2212 s\u03b2\u0001\ns\u03b2\u0001\n\nt\n\n\u03b2\u0001\n\n, \u2206EP\n\nW (t) =\n\n1\n\u03b2\n\n(cid:18)\n\n(cid:16)\n\n(cid:17)(cid:62) \u00b7 \u03c3\n\n(cid:16)\n\n(cid:17) \u2212 \u03c3\n\n(cid:16)\n\n(cid:17)(cid:62) \u00b7 \u03c3\n\n(cid:16)\n\n(cid:17)(cid:19)\n\n\u03c3\n\ns\u03b2\u0001\nt+1\n\ns\u03b2\u0001\nt+1\n\ns\u03b2\u0001\nt\n\ns\u03b2\u0001\nt\n\n.\n\n(14)\n\nWith \u03a6\u0001 as a primitive function and with the hyperparameter \u03b2 rescaled by a factor \u0001, we recover the\ndiscretized version of the real-time setting of Scellier and Bengio [2017], i.e. the Euler scheme of\ndt = \u2212 \u2202E\n2(cid:107)s(cid:107)2 \u2212 \u03c3(s)(cid:62) \u00b7 W \u00b7 \u03c3(s) \u2013 see Appendix B.2, where the link between\nds\ndiscrete-time dynamics and real-time dynamics is explained. Fig. 2 qualitatively demonstrates\nTheorem 1 in this setting on a toy model.\n\n\u2202s with E = 1\n\n\u2202s \u2212 \u03b2 \u2202(cid:96)\n\nPrototypical setting.\nfunction \u03a6. Instead, the dynamics is directly de\ufb01ned as:\n\nIn this case, the dynamics of the system does not derive from a primitive\n\nst+1 = \u03c3 (W \u00b7 st) .\n\n(15)\nAgain, W is assumed to be a symmetric matrix. The dynamics of Eq. (15) is a standard and simple\nneural network dynamics. Although the model is not de\ufb01ned in terms of a primitive function, note that\n2 s(cid:62) \u00b7 W \u00b7 s if we ignore the activation function \u03c3. Following\nst+1 \u2248 \u2202\u03a6\n(cid:17)\nEq. (10), we de\ufb01ne:\n\n\u2202s (st, W ) with \u03a6(s, W ) = 1\n\n(cid:16)\n\n(cid:16)\n\n(cid:17)\n\n\u2206EP\n\ns (t) =\n\nt+1 \u2212 s\u03b2\ns\u03b2\n\nt\n\n,\n\n1\n\u03b2\n\n\u2206EP\n\nW (t) =\n\n1\n\u03b2\n\ns\u03b2(cid:62)\nt+1 \u00b7 s\u03b2\n\nt+1 \u2212 s\u03b2(cid:62)\n\nt\n\n\u00b7 s\u03b2\n\nt\n\n.\n\n(16)\n\n4.3 Effect of Depth and Approximation\n\nWe consider a fully connected layered architecture where layers sn are labelled in a backward\nfashion: s0 denotes the output layer, s1 the last hidden layer, and so forth. Two consecutive layers\nare reciprocally connected with tied weights with the convention that Wn,n+1 connects sn+1 to sn.\nWe study this architecture in the energy-based and prototypical setting as described per Equations\n(13) and (15) respectively, with corresponding weight updates (14) and (16) - see details in Appendix\nC.3 and C.4. We study the GDU property layer-wise, e.g. RelMSE(\u2206EP\n) measures the\ndistance between the \u2206EP\nWe display in Fig. 3 the RelMSE, layer-wise for one, two and three hidden layered architecture (from\nleft to right), in the energy-based (upper panels) and prototypical (lower panels) settings, so that each\narchitecture in a given setting is displayed in one panel - see Appendix C.3 and C.4 for a detailed\ndescription of the hyperparameters and curve samples. In terms of RelMSE, we can see that the\nGDU property is best satis\ufb01ed in the energy-based setting with one hidden layer where RelMSE is\naround \u223c 10\u22122 (top left). When adding more hidden layers in the energy-based setting (top middle\nand top right), the RelMSE increases to \u223c 10\u22121, with a greater RelMSE when going away from the\noutput layer. The same is observed in the prototypical setting when we add more hidden layers (lower\n\nprocesses, averaged over all elements of layer sn.\n\nsn and \u2212\u2207BPTT\n\nsn , -\u2207BPTT\n\nsn\n\nsn\n\n6\n\n\fFigure 3: RelMSE analysis in the energy-based (top) and prototypical (bottom) setting. For one given\narchitecture, each bar is labelled by a layer or synapses connecting two layers, e.g. the orange bar\nabove s1 represents RelMSE(\u2206EP\n). For each architecture, the recurrent hyperparameters\nT , K and \u03b2 have been tuned to make the \u2206EP and \u2212\u2207BPTT processes match best.\n\ns1 ,\u2212\u2207BPTT\n\ns1\n\npanels). Compared to the energy-based setting, although the RelMSEs associated with neurons are\nsigni\ufb01cantly higher in the prototypical setting, the RelMSEs associated with synapses are similar\nor lower. On average, the weight updates provided by EP match well the gradients of BPTT, in the\nenergy-based setting as well as in the prototypical setting.\n\n4.4 Convolutional Architecture\n\nFigure 4: Demonstrating the GDU property with the convolutional architecture on MNIST. Dashed\nand continuous lines represent \u2206EP and \u2212\u2207BPTT processes respectively, for 5 randomly selected\nneurons (top) and synapses (bottom) in each layer. Each randomly selected neuron or synapse\ncorresponds to one color. Dashed and continuous lines mostly coincide. Some \u2206EP processes\ncollapse to zero as an effect of the non-linearity, see Appendix D for details. Interestingly, the \u2206EP\nand \u2212\u2207BPTT\n\nprocesses are saw-teeth-shaped ; Appendix C.6 accounts for this phenomenon.\n\ns\n\ns\n\nn,n+1 and W conv\n\nIn our convolutional architecture, hn and sn denote convolutional and fully connected layers respec-\ntively. W fc\nn,n+1 denote the fully connected weights connecting sn+1 to sn and the \ufb01lters\nconnecting hn+1 to hn, respectively. We de\ufb01ne the dynamics as:\nn\u22121n \u00b7 sn\u22121\n\nnn+1 \u00b7 sn+1\nW fc\n\nt + W fc(cid:62)\n\n(cid:17)\n\nt\n\n\uf8f1\uf8f2\uf8f3 sn\n\nt+1 = \u03c3\nhn\nt+1 = \u03c3\n\n(cid:16)\n(cid:16)P(cid:0)W conv\n\nwhere \u2217 and P denote convolution and pooling, respectively. Transpose convolution is de\ufb01ned through\nthe convolution by the \ufb02ipped kernel \u02dcW conv and P\u22121 denotes inverse pooling - see Appendix D for a\n\n(cid:1) + \u02dcW conv\n\nn\u22121,n \u2217 P\u22121(cid:0)hn\u22121\n\nt\n\n(cid:1)(cid:17)\n\n,\n\nn,n+1 \u2217 hn+1\n\nt\n\n(17)\n\n7\n\n\fTable 1: Training results on MNIST with EP benchmarked against BPTT, in the energy-based and\nprototypical settings. \"EB\" and \"P\" respectively denote \"energy-based\" and \"prototypical\", \"-#h\"\nstands for the number of hidden layers and WCT for \"Wall-clock time\" in hours : minutes. We\nindicate over \ufb01ve trials the mean and standard deviation for the test error, the mean error in parenthesis\nfor the train error. T (resp. K) is the number of iterations in the \ufb01rst (resp. second) phase.\n\nEP (error %)\nTest\n\nBPTT (error %)\nTest\n\nEB-1h\nEB-2h\nP-1h\nP-2h\nP-3h\nP-conv\n\n2.06 \u00b1 0.17\n2.01 \u00b1 0.21\n2.00 \u00b1 0.13\n1.95 \u00b1 0.10\n2.01 \u00b1 0.18\n1.02 \u00b1 0.04\n\nTrain\n(0.13)\n(0.11)\n(0.20)\n(0.14)\n(0.10)\n(0.54)\n\n2.11 \u00b1 0.09\n2.02 \u00b1 0.12\n2.00 \u00b1 0.12\n2.09 \u00b1 0.12\n2.30 \u00b1 0.17\n0.88 \u00b1 0.06\n\nTrain\n(0.46)\n(0.29)\n(0.55)\n(0.37)\n(0.32)\n(0.12)\n\nT\n\nK Epochs WCT\n\n100\n500\n30\n100\n180\n200\n\n12\n40\n10\n20\n20\n10\n\n30\n50\n30\n50\n100\n40\n\n1 : 33\n16 : 04\n0 : 17\n1 : 56\n8 : 27\n8 : 58\n\nTable 2: Best training results on MNIST with EP reported in the literature.\n\nEP (error %)\nTrain\nTest\n[Scellier and Bengio, 2017] \u223c 2.2\n(\u223c 0)\n[O\u2019Connor et al., 2018]\n2.37\n(0.15)\n[O\u2019Connor et al., 2019]\n2.19\n\nprecise de\ufb01nition of these operations and their inverse. Noting Nfc and Nconv the number of fully\nconnected and convolutional layers respectively, we can de\ufb01ne the function:\n\n\u03a6(x,{sn},{hn}) =\n\nNconv\u22121(cid:88)\n\nn=0\n\nhn \u2022 P(cid:0)W conv\n\nn,n+1 \u2217 hn+1(cid:1) +\n\nn=0\n\nsn(cid:62) \u00b7 W fc\n\nNfc\u22121(cid:88)\nn,n+1 \u00b7 sn+1,\nt+1 \u2248 \u2202\u03a6\nt+1 \u2248 \u2202\u03a6\n(cid:17)\nW fc as in Eq. 16. As for \u2206EP\n\n\u2202s (t) and hn\n\n\u2202h (t) if we\nW conv, we\n\n(18)\n\n(19)\n\nwith \u2022 denoting generalized scalar product. We note that sn\nignore the activation function \u03c3. We de\ufb01ne \u2206EP\nfollow the de\ufb01nition of Eq. (10):\n1\n\u03b2\n\n(cid:16)P\u22121(hn,\u03b2\n\nt+1) \u2217 hn+1,\u03b2\n\nsn , \u2206EP\n\nW conv\nnn+1\n\n\u2206EP\n\n(t) =\n\nhn and \u2206EP\n\nt+1 \u2212 P\u22121(hn,\u03b2\n\nt\n\n) \u2217 hn+1,\u03b2\n\nt\n\nAs can be seen in Fig. 4, the GDU property is qualitatively very well satis\ufb01ed: Eq. (19) can thus be\nsafely used as a learning rule. More precisely however, some \u2206EP\nhn processes collapse to\nzero as an effect of the non-linearity used (see Appendix C for greater details): the EP error signals\ncannot be transmitted through saturated neurons, resulting in a RelMSE of \u223c 10\u22121 for the network\nparameters - see Fig. 15 in Appendix D.\n\nsn and \u2206EP\n\n5 Discussion\n\nTable 1 shows the accuracy results on MNIST of several variations of our approach and of the\nliterature. First, EP overall performs as well or practically as well as BPTT in terms of test accuracy\nin all situations. Second, no degradation of accuracy is seen between using the prototypical (P) rather\nthan the energy-based (EB) setting, although the prototypical setting requires three to \ufb01ve times less\ntime steps in the \ufb01rst phase (T) and cuts the simulation time by a factor \ufb01ve to eight. Finally, the\nbest EP result, \u223c 1% test error, is obtained with our convolutional architecture. This is also the best\nperformance reported in the literature on MNIST training with EP. BPTT achieves 0.90% test error\nusing the same architecture. This slight degradation is due to saturated neurons which do no route\nerror signals (as reported in the previous section). The prototypical situation allows using highly\nreduced number of time steps in the \ufb01rst phase than Scellier and Bengio [2017] and O\u2019Connor et al.\n[2018]. On the other hand, O\u2019Connor et al. [2019] manages to cut this number even more. This\ncomes at the cost of using an extra network to learn proper initial states for the EP network, which is\nnot needed in our approach.\n\n8\n\n\fOverall, our work broadens the scope of EP from its original formulation for biologically motivated\nreal-time dynamics and sheds new light on its practical understanding. We \ufb01rst extended EP to a\ndiscrete-time setting, which reduces its computational cost and allows addressing situations closer\nto conventional machine learning. 6 Theorem 1 demonstrated that the gradients provided by EP are\nstrictly equal to the gradients computed with BPTT in speci\ufb01c conditions. Our numerical experiments\ncon\ufb01rmed the theorem and showed that its range of applicability extends well beyond the original\nformulation of EP to prototypical neural networks widely used today. These results highlight that, in\nprinciple, EP can reach the same performance as BPTT on benchmark tasks, for RNN models with\n\ufb01xed input. One limitation of our theory however is that it has yet to be adapted to sequential data:\nsuch an extension would require to capture and learn correlations between successive equilibrium\nstates corresponding to different inputs.\nLayer-wise analysis of the gradients computed by EP and BPTT show that the deeper the layer, the\nmore dif\ufb01cult it becomes to ensure the GDU property. On top of non-linearity effects, this is mainly\ndue to the fact that the deeper the network, the longer it takes to reach equilibrium.\nWhile this may be a conundrum for current processors, it should not be an issue for alternative\ncomputing schemes. Physics research is now looking at neuromorphic computing approaches that\nleverage the transient dynamics of physical devices for computation [Torrejon et al., 2017, Romera\net al., 2018, Feldmann et al., 2019]. In such systems, based on magnetism or optics, dynamical\nequations are solved directly by the physical circuits and components, in parallel and at speed much\nhigher than processors. On the other hand, in such systems, the nonlocality of backprop is a major\nconcern [Ambrogio et al., 2018]. In this context, EP appears as a powerful approach as computing\ngradients only requires measuring the system at the end of each phase, and going backward in time\nis not needed. In a longer term, interfacing the algorithmics of EP with device physics could help\ncutting drastically the cost of inference and learning of conventional computers, and thereby address\none of the biggest technological limitations of deep learning.\n\n6We also expect that our discrete-time formulation would accelerate simulations in the setting of Scellier\n\net al. [2018] where the weights are not tied.\n\n9\n\n\fAcknowledgments\n\nThe authors would like to thank Joao Sacramento for feedback and discussions, as well as NSERC,\nCIFAR, Samsung and Canada Research Chairs for funding. Julie Grollier and Damien Querlioz\nacknowledge funding from the European Research Council, respectively under grants bioSPINspired\n(682955) and NANOINFER (715872).\n\nReferences\nD. H. Ackley, G. E. Hinton, and T. J. Sejnowski. A learning algorithm for boltzmann machines.\n\nCognitive science, 9(1):147\u2013169, 1985.\n\nL. B. Almeida. A learning rule for asynchronous perceptrons with feedback in a combinatorial\n\nenvironment. volume 2, pages 609\u2013618, San Diego 1987, 1987. IEEE, New York.\n\nS. Ambrogio, P. Narayanan, H. Tsai, R. M. Shelby, I. Boybat, C. Nolfo, S. Sidler, M. Giordano,\nM. Bodini, N. C. Farinha, et al. Equivalent-accuracy accelerated neural-network training using\nanalogue memory. Nature, 558(7708):60, 2018.\n\nF. Crick. The recent excitement about neural networks. Nature, 337(6203):129\u2013132, 1989.\n\nEditorial. Big data needs a hardware revolution. Nature, 554(7691):145, Feb. 2018. doi: 10.1038/\n\nd41586-018-01683-1.\n\nJ. Feldmann, N. Youngblood, C. Wright, H. Bhaskaran, and W. Pernice. All-optical spiking neurosy-\n\nnaptic networks with self-learning capabilities. Nature, 569(7755):208, 2019.\n\nY. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436, 2015.\n\nJ. R. Movellan. Contrastive hebbian learning in the continuous hop\ufb01eld model. In Connectionist\n\nmodels, pages 10\u201317. Elsevier, 1991.\n\nP. O\u2019Connor, E. Gavves, and M. Welling. Initialized equilibrium propagation for backprop-free\ntraining. In International Conference on Machine Learning, Workshop on Credit Assignment in\nDeep Learning and Deep Reinforcement Learning, 2018. URL https://ivi.fnwi.uva.nl/\nisis/publications/2018/OConnorICML2018.\n\nP. O\u2019Connor, E. Gavves, and M. Welling. Training a spiking neural network with equilibrium\nIn K. Chaudhuri and M. Sugiyama, editors, Proceedings of Machine Learning\npropagation.\nResearch, volume 89 of Proceedings of Machine Learning Research, pages 1516\u20131523. PMLR,\n16\u201318 Apr 2019. URL http://proceedings.mlr.press/v89/o-connor19a.html.\n\nF. J. Pineda. Generalization of back-propagation to recurrent neural networks. 59:2229\u20132232, 1987.\n\nM. Romera, P. Talatchian, S. Tsunegi, F. A. Araujo, V. Cros, P. Bortolotti, J. Trastoy, K. Yakushiji,\nA. Fukushima, H. Kubota, et al. Vowel recognition with four coupled spin-torque nano-oscillators.\nNature, 563(7730):230, 2018.\n\nD. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error\npropagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science,\n1985.\n\nF. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network\n\nmodel. IEEE Transactions on Neural Networks, 20(1):61\u201380, 2009.\n\nB. Scellier and Y. Bengio.\n\nTowards a biologically plausible backprop.\n\narXiv:1602.05179v2, 914, 2016.\n\narXiv preprint\n\nB. Scellier and Y. Bengio. Equilibrium propagation: Bridging the gap between energy-based models\n\nand backpropagation. Frontiers in computational neuroscience, 11, 2017.\n\nB. Scellier and Y. Bengio. Equivalence of equilibrium propagation and recurrent backpropagation.\n\nNeural computation, 31(2):312\u2013329, 2019.\n\n10\n\n\fB. Scellier, A. Goyal, J. Binas, T. Mesnard, and Y. Bengio. Generalization of equilibrium propagation\n\nto vector \ufb01eld dynamics. arXiv preprint arXiv:1808.04873, 2018.\n\nC. Tallec and Y. Ollivier. Unbiased online recurrent optimization. arXiv preprint arXiv:1702.05043,\n\n2017.\n\nJ. Torrejon, M. Riou, F. A. Araujo, S. Tsunegi, G. Khalsa, D. Querlioz, P. Bortolotti, V. Cros,\nK. Yakushiji, A. Fukushima, et al. Neuromorphic computing with nanoscale spintronic oscillators.\nNature, 547(7664):428, 2017.\n\nR. J. Williams and D. Zipser. A learning algorithm for continually running fully recurrent neural\n\nnetworks. Neural computation, 1(2):270\u2013280, 1989.\n\n11\n\n\f", "award": [], "sourceid": 3824, "authors": [{"given_name": "Maxence", "family_name": "Ernoult", "institution": "Universit\u00e9 Paris Sud"}, {"given_name": "Julie", "family_name": "Grollier", "institution": "Unit\u00e9 Mixte de Physique CNRS/Thales"}, {"given_name": "Damien", "family_name": "Querlioz", "institution": "Univ Paris-Sud"}, {"given_name": "Yoshua", "family_name": "Bengio", "institution": "Mila"}, {"given_name": "Benjamin", "family_name": "Scellier", "institution": "Mila, University of Montreal"}]}