{"title": "Assessing the Scalability of Biologically-Motivated Deep Learning Algorithms and Architectures", "book": "Advances in Neural Information Processing Systems", "page_first": 9368, "page_last": 9378, "abstract": "The backpropagation of error algorithm (BP) is impossible to implement in a real brain. The recent success of deep networks in machine learning and AI, however, has inspired proposals for understanding how the brain might learn across multiple layers, and hence how it might approximate BP. As of yet, none of these proposals have been rigorously evaluated on tasks where BP-guided deep learning has proved critical, or in architectures more structured than simple fully-connected networks. Here we present results on scaling up biologically motivated models of deep learning on datasets which need deep networks with appropriate architectures to achieve good performance. We present results on the MNIST, CIFAR-10, and ImageNet datasets and explore variants of target-propagation (TP) and feedback alignment (FA) algorithms, and explore performance in both fully- and locally-connected architectures. We also introduce weight-transport-free variants of difference target propagation (DTP) modified to remove backpropagation from the penultimate layer. Many of these algorithms perform well for MNIST, but for CIFAR and ImageNet we find that TP and FA variants perform significantly worse than BP, especially for networks composed of locally connected units, opening questions about whether new architectures and algorithms are required to scale these approaches. Our results and implementation details help establish baselines for biologically motivated deep learning schemes going forward.", "full_text": "Assessing the Scalability of Biologically-Motivated\n\nDeep Learning Algorithms and Architectures\n\nSergey Bartunov\n\nDeepMind\n\nAdam Santoro\n\nDeepMind\n\nBlake A. Richards\nUniversity of Toronto\n\nLuke Marris\nDeepMind\n\nGeoffrey E. Hinton\n\nGoogle Brain\n\nTimothy P. Lillicrap\n\nDeepMind, University College London\n\nAbstract\n\nThe backpropagation of error algorithm (BP) is impossible to implement in a\nreal brain. The recent success of deep networks in machine learning and AI,\nhowever, has inspired proposals for understanding how the brain might learn\nacross multiple layers, and hence how it might approximate BP. As of yet, none\nof these proposals have been rigorously evaluated on tasks where BP-guided deep\nlearning has proved critical, or in architectures more structured than simple fully-\nconnected networks. Here we present results on scaling up biologically motivated\nmodels of deep learning on datasets which need deep networks with appropriate\narchitectures to achieve good performance. We present results on the MNIST,\nCIFAR-10, and ImageNet datasets, explore variants of target-propagation (TP) and\nfeedback alignment (FA) algorithms, and examine performance in both fully- and\nlocally-connected architectures. We also introduce weight-transport-free variants\nof difference target propagation (DTP) modi\ufb01ed to remove backpropagation from\nthe penultimate layer. Many of these algorithms perform well for MNIST, but for\nCIFAR and ImageNet we \ufb01nd that TP and FA variants perform signi\ufb01cantly worse\nthan BP, especially for networks composed of locally connected units, opening\nquestions about whether new architectures and algorithms are required to scale\nthese approaches. Our results and implementation details help establish baselines\nfor biologically motivated deep learning schemes going forward.\n\n1\n\nIntroduction\n\nThe suitability of the backpropagation of error (BP) algorithm [32] for explaining learning in the\nbrain was questioned soon after it was popularized [11, 8]. Weaker objections included undesirable\ncharacteristics of arti\ufb01cial networks in general, such as their violation of Dale\u2019s Law, their lack of\ncell-type variability, and the need for the gradient signals to be both positive and negative. More\nserious objections were: (1) The need for the feedback connections carrying the gradient to have the\nsame weights as the corresponding feedforward connections and (2) The need for a distinct form of\ninformation propagation (error feedback) that does not in\ufb02uence neural activity, and hence does not\nconform to known biological feedback mechanisms underlying neural communication. Researchers\nhave long sought biologically plausible and empirically powerful learning algorithms that avoid these\n\ufb02aws [2, 30, 31, 1, 26, 39, 14, 16, 12, 5, 23]. Recent work has demonstrated that the \ufb01rst objection\nmay not be as problematic as often supposed [22]: the feedback alignment (FA) algorithm uses\nrandom weights in backward pathways to successfully deliver error information to earlier layers. At\nthe same time, FA still suffers from the second objection: it requires the delivery of signed error\nvectors via a distinct pathway.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\fAnother family of promising approaches to biologically motivated deep learning \u2013 such as Contrastive\nHebbian Learning [24], and Generalized Recirculation [26] \u2013 use top-down feedback connections\nto in\ufb02uence neural activity, and differences in feedfoward-driven and feedback-driven activities (or\nproducts of activities) to locally approximate gradients [1, 31, 26, 39, 4, 36, 38]. Since these activity\npropagation methods don\u2019t require explicit propagation of gradients through the network, they go\na long way towards answering the second serious objection noted above. However, many of these\nmethods require long \u201cpositive\u201d and \u201cnegative\u201d settling phases for computing the activities whose\ndifferences provide the learning signal. Proposals for shortening the phases [13, 6] are not entirely\nsatisfactory as they still fundamentally depend on a settling process, and, in general, any settling\nprocess will likely be too slow for a brain that needs to quickly compute hidden activities in order to\nact in real time.\nPerhaps the most practical among this family of \u201cactivity propagation\u201d algorithms is target propagation\n(TP) and its variants [19, 20, 13, 3, 21]. TP avoids the weight transport problem by training a distinct\nset of feedback connections that de\ufb01ne the backward activity propagation. These connnections are\ntrained to approximately invert the computation of the feedforward connections in order to be able to\ncompute target activities for each layer by successively inverting the desired output target. Another\nappealing property of TP is that the errors guiding weight updates are computed locally along with\nbackward activities.\nWhile TP and its variants are promising as biologically-motivated algorithms, there are lingering\nquestions about their applicability to the brain. First, the only variant explored empirically (i.e. DTP)\nstill depends on explicit gradient computation via backpropagation for learning the penultimate layer\u2019s\noutgoing synaptic weights (see Algorithm Box 1 in Lee et al. [21]). Second, they have not been\nrigorously tested on datasets more dif\ufb01cult than MNIST. And third, they have not been incorporated\ninto architectures more complicated than simple multi-layer perceptrons (MLPs).\nOn this second point, it might be argued that an algorithm\u2019s inability to scale to dif\ufb01cult machine\nlearning datasets is a red herring when assessing whether it could help us understand learning in the\nbrain. Performance on isolated machine learning tasks using a model that lacks other adaptive neural\nphenomena \u2013 e.g., varieties of plasticity, evolutionary priors, etc. \u2013 makes a statement about the lack\nof these phenomena as much as it does about the suitability of an algorithm. Nonetheless, we argue\nthat there is a need for behavioural realism, in addition to physiological realism, when gathering\nevidence to assess the overall biological realism of a learning algorithm. Given that human beings\nare able to learn complex tasks that bear little relationship to their evolution, it would appear that\nthe brain possesses a powerful, general-purpose learning algorithm for shaping behavior. As such,\nresearchers can, and should, seek learning algorithms that are both more plausible physiologically,\nand scale up to the sorts of complex tasks that humans are capable of learning. Augmenting a model\nwith adaptive capabilities is unlikely to unveil any truths about the brain if the model\u2019s performance\nis crippled by an insuf\ufb01ciently powerful learning algorithm. On the other hand, demonstrating good\nperformance with even a vanilla arti\ufb01cial neural network provides evidence that, at the very least,\nthe learning algorithm is not limiting. Ultimately, we need a con\ufb02uence of evidence for: (1) the\nsuf\ufb01ciency of a learning algorithm, (2) the impact of biological constraints in a network, and (3) the\nnecessity of other adaptive neural capabilities. This paper focuses on addressing the \ufb01rst two.\nIn this work our contribution is threefold:\n(1) We examine the learning and performance of\nbiologically-motivated algorithms on MNIST, CIFAR, and ImageNet. (2) We introduce variants of\nDTP which eliminate signi\ufb01cant lingering biologically implausible features from the algorithm. (3)\nWe investigate the role of weight-sharing convolutions, which are key to performance on dif\ufb01cult\ndatasets in arti\ufb01cial neural networks, by testing the effectiveness of locally connected architectures\ntrained with BP and variants of FA and TP.\nOverall, our results are largely negative. That is, we \ufb01nd that none of the tested algorithms are capable\nof effectively scaling up to training large networks on ImageNet. There are three possible interpreta-\ntions from these results: (1) Existing algorithms need to be modi\ufb01ed, added to, and/or optimized to\naccount for learning in the real brain, (2) research should continue into new physiologically realistic\nlearning algorithms that can scale-up, or (3) we need to appeal to other adaptive capacities to account\nfor the fact that humans are able to perform well on this task. Ultimately, our negative results are\nimportant because they demonstrate the need for continued work to understand the power of learning\nin the human brain. More broadly, we suggest that behavioural realism, as judged by performance\n\n2\n\n\fFigure 1: In BP and DTP, the \ufb01nal layer target is used to compute a loss, and the gradients from this\nloss are shuttled backwards (through all layers, in BP, or just one layer, in DTP) in error propagation\nsteps that do not in\ufb02uence actual neural activity. SDTP never transports gradients using error\npropagation steps, unlike DTP and BP.\n\non dif\ufb01cult tasks, should increasingly become one of the metrics used in evaluating the biological\nrealism of computational models and algorithms.\n\n2 Learning in Multilayer Networks\nConsider the case of a feed-forward neural network with L layers {hl}L\nl=1, whose activations hl are\ncomputed by elementwise-applying a non-linear function \u03c3l to an af\ufb01ne transformation of previous\nlayer activations hl\u22121:\n\nhl = f (hl\u22121; \u03b8l) = \u03c3l(Wlhl\u22121 + bl),\n\n\u03b8l = {Wl, bl},\n\n(1)\n\nwith input to the network denoted as h0 = x and the last layer hL used as output.\nIn classi\ufb01cation problems the output layer hL parametrizes a predicted distribution over possible\nlabels p(y|hL), usually using the softmax function. The learning signal is then provided as a loss\nL(hL) incurred by making a prediction for an input x, which in the classi\ufb01cation case can be\ncross-entropy between the ground-truth label distribution q(y|x) and the predicted one: L(hL) =\nl=1 in order\n\ny q(y|x) log p(y|hL). The goal of training is then to adjust the parameters \u0398 = {\u03b8l}L\n\n\u2212(cid:80)\n\nto minimize a given loss over the training set of inputs.\n\n2.1 Backpropagation\n\n(cid:19)T\n\n=\n\ndhl\n\ndL\nd\u03b8l\n\n=\n\n(cid:18) dhl\n\nd\u03b8l\n\ndL\ndhl+1\n\n,\n\n(cid:19)T dL\n\n(cid:18) dhl+1\n\nBackpropagation [32] was popularized as a method for training neural networks by computing\ngradients with respect to layer parameters using the chain rule:\ndL\ndhl\nThus, gradients are obtained by \ufb01rst propagating activations forward to the output layer via eq. 1,\nand then recursively applying these backward equations. These equations imply that gradients are\npropagated backwards through the network using weights symmetric to their feedforward counter-\nparts. This is biologically problematic because it implies a mode of information propagation (error\npropagation) that does not in\ufb02uence neural activity, and that depends on an implausible network\narchitecture (symmetric weight connectivity for feedforward and feedback directions, which is called\nthe weight transport problem).\n\n= Wl+1diag(\u03c3(cid:48)\n\nl+1(Wl+1hl + bl+1)).\n\n,\n\ndhl\n\ndhl+1\ndhl\n\n3\n\nInput\"8\"Output...Simplified Difference Target PropagationTarget\"3\"Input\"8\"Output...Difference Target PropagationInput\"8\"Output...Backpropagationgradient gradient (a)(b)(c)Target\"3\"Target\"3\"\f2.1.1 Feedback alignment\n\nWhile we focus on TP variants in this manuscript, with the purpose of a more complete experimental\nstudy of biologically motivated algorithms, we explore FA as another baseline. FA replaces the\ntranspose weight matrices in the backward pass for BP with \ufb01xed random connections. Thus,\nFA shares features with both target propagation and conventional backpropagation. On the one\nhand, it alleviates the weight transport problem by maintaining a separate set of connections that,\nunder certain conditions, lead to synchronized learning of the network. On the other hand, similar\nto backpropagation, FA transports signed error information in the backward pass, which may be\nproblematic to implement as a plausible neural computation. We consider both the classical variant\nof FA [23] with random feedback weights at each hidden layer, and the recently proposed Direct\nFeedback Alignment [25] (DFA) or Broadcast Feedback Alignment [35], which connect feedback\nfrom the output layer directly to all previous layers directly.\n\n2.1.2 Target propagation and its variants\n\nUnlike backpropagation, where backwards communication passes on gradients without inducing or\naltering neural activity, the backward pass in target propagation [19, 20, 3, 21] takes place in the\nsame space as the forward-pass neural activity. The backward induced activities are those that layers\nshould strive to match so as to produce the target output. After feedforward propagation given some\ninput, the \ufb01nal output layer hL is trained directly to minimize the loss L, while all other layers are\ntrained so as to match their associated targets.\nIn general, good targets are those that minimize the loss computed in the output layer if they had\nbeen realized in feedforward propagation. In networks with invertible layers one could generate such\ntargets by \ufb01rst \ufb01nding a loss-optimal output activation \u02c6hL (e.g. the correct label distribution) and\nthen propagating it back using inverse transformations \u02c6hl = f\u22121(\u02c6hl+1; \u03b8l+1). Since it is hard to\nmaintain invertibility in a network, approximate inverse transformations (or decoders) can be learned\ng(hl+1; \u03bbl+1) \u2248 f\u22121(hl+1; \u03b8l+1). Note that this learning obviates the need for symmetric weight\nconnectivity.\nThe generic form of target propagation algorithms we consider in this paper can be summarized as a\nscheduled minimization of two kinds of losses for each layer.\n\n1. Reconstruction or inverse loss Linv\n\nl\n\n(\u03bbl) = (cid:107)hl\u22121 \u2212 g(f (hl\u22121; \u03b8l\u22121); \u03bbl)(cid:107)2\n\n2 is used to\ntrain the approximate inverse that is parametrized similarly to the forward computation:\ng(hl; \u03bbl) = \u03c3l(Vlhl + cl), \u03bbl = {Vl, cl}, where activations hl\u22121 are assumed to be prop-\nagated from the input. One can imagine other learning rules for the inverse, for example,\nthe original DTP algorithm trained inverses on noise-corrupted versions of activations with\nthe purpose of improved generalization. The loss is applied for every layer except the \ufb01rst,\nsince the \ufb01rst layer does not need to propagate target inverses backwards.\n\n2. Forward loss Ll(\u03b8l) = (cid:107)f (hl; \u03b8l) \u2212 \u02c6hl+1(cid:107)2\n\n2 penalizes the layer parameters for producing\nactivations different from their targets. Parameters of the last layer are trained to minimize\nthe task\u2019s loss L directly.\n\nUnder this framework both losses are local and involve only a single layer\u2019s parameters, and implicit\ndependencies on other layer\u2019s parameters are ignored. Variants differ in the way targets \u02c6hl are\ncomputed.\n\nTarget propagation \u201cVanilla\u201d target propagation (TP) computes targets by propagating the higher\nlayers\u2019 targets backwards through layer-wise inverses; i.e. \u02c6hl = g(\u02c6hl+1; \u03bbl+1). For traditional\ncategorization tasks the same 1-hot vector in the output will always map back to precisely the same\nhidden unit activities in a given layer. Thus, this kind of naive TP may have dif\ufb01culties when\ndifferent instances of the same class have different appearances, since it will attempt to make their\nrepresentations identical even in the early layers. As well, there are no guarantees about how TP will\nbehave when the inverses are imperfect.\n\nDifference target propagation Both TP and DTP update the output weights and biases using\nthe standard delta rule, but this is biologically unproblematic because it does not require weight\n\n4\n\n\ftransport [26, 23]. For most other layers in the network, DTP [21] computes targets as\n\n\u02c6hl = g(\u02c6hl+1; \u03bbl+1) + [hl \u2212 g(hl+1; \u03bbl+1)].\n\n(2)\n\nThe second term is the error in the reconstruction, which provides a stabilizing linear correction for\nimprecise inverse functions. However, in the original work by Lee et al. [21] the penultimate layer\ntarget, \u02c6hL\u22121, was computed using gradients from the network\u2019s loss, rather than by target propagation.\nThat is, \u02c6hL\u22121 = hL\u22121 \u2212 \u03b1 \u2202L(hL)\n, rather than \u02c6hL\u22121 = hL\u22121 \u2212 g(hL; \u03bbL) + g(\u02c6hL; \u03bbL). Though not\nstated explicitly, this approach was presumably taken to ensure that the penultimate layer received\nreasonable and diverse targets despite the low-dimensional 1-hot targets at the output layer. When\nthere are a small number of 1-hot targets (e.g. 10 classes), learning a good inverse mapping from\nthese vectors back to the hidden activity of the penultimate hidden layer (e.g. 1000 units) might be\nproblematic, since the inverse mapping cannot provide information that is both useful and unique to a\nparticular input sample x. Using BP in the penultimate layer sidesteps this concern, but deviates from\nthe intent of using these algorithms to avoid gradient computation and delivery.\n\n\u2202hL\u22121\n\nSimpli\ufb01ed difference target propagation We introduce SDTP as a simple modi\ufb01cation to DTP.\nIn SDTP we compute the target for the penultimate layer as \u02c6hL\u22121 = hL\u22121 \u2212 g(hL; \u03bbL) + g(\u02c6hL; \u03bbL),\nL(hL), i.e. the correct label distribution. This completely removes biologically\nwhere \u02c6hL = argminhL\ninfeasible gradient communication (and hence weight-transport) from the algorithm. However, it\nis not clear whether targets for the penultimate layer will be diverse enough (given low entropy\nclassi\ufb01cation targets) or precise enough (given the inevitable poor performance of the learned inverse\nfor this layer). The latter is particularly important if the dimensionality of the penultimate layer is\nmuch larger than the output layer, which is the case for classi\ufb01cation problems with a small number\nof classes. Hence, this modi\ufb01cation is a non-trivial change that requires empirical investigation. In\nSection 3 we evaluate SDTP in the presence of low-entropy targets (classi\ufb01cation problems) and also\nconsider the problem of learning an autoencoder (for which targets are naturally high-dimensional\nand diverse) in the supplementary material.\n\nAlgorithm 1 Simpli\ufb01ed Difference Target Propagation\n\nPropagate activity forward:\nfor l = 1 to L do\n\nhl \u2190 fl(hl\u22121; \u03b8l)\nend for\nCompute \ufb01rst target: \u02c6hL \u2190 argminhL\nCompute targets for lower layers:\nfor l = L \u2212 1 to 1 do\n\n\u02c6hl \u2190 hl \u2212 g(hl+1; \u03bbl+1) + g(\u02c6hl+1; \u03bbl+1)\n\nend for\nTrain inverse function parameters:\nfor l = L to 2 do\n\nL(hL)\n\nGenerate corrupted activity \u02dchl\u22121 = hl\u22121 + \u0001, \u0001 \u223c N (0, \u03c32)\nUpdate parameters \u03bbl using SGD on loss Linv\nLinv\n(\u03bbl) = (cid:107)hl\u22121 \u2212 g(f (\u02dchl\u22121; \u03b8l\u22121); \u03bbl)(cid:107)2\nend for\nTrain feedforward function parameters:\nfor l = 1 to L do\n\n(\u03bbl)\n\n2\n\nl\n\nl\n\nUpdate parameters \u03b8l using SGD on loss Ll(\u03b8l)\nLl(\u03b8l) = (cid:107)f (hl; \u03b8l) \u2212 \u02c6hl+1(cid:107)2\n\n2 if l < L, else LL(\u03b8L) = L (task loss)\n\nend for\n\nAuxiliary output SDTP As outlined above, in the context of 1-hot classi\ufb01cation, SDTP produces\nonly weak targets for the penultimate layer, i.e. one for each possible class label. To circumvent\nthis problem, we extend SDTP by introducing a composite structure for the output layer hL = [o, z],\nwhere o is the predicted class distribution on which the loss is computed and z is an auxiliary output\nvector that is meant to provide additional information about activations of the penultimate layer hL\u22121.\nThus, the inverse computation g(hL; \u03bbL) can be performed conditional on richer information from\nthe input, not just on the relatively weak information available in the predicted and actual label.\n\n5\n\n\fThe auxiliary output z is used to generate targets for penultimate layer as follows:\n\n\u02c6hL\u22121 = hL\u22121 \u2212 gL(o, z; \u03bbL) + gL(\u02c6o, z; \u03bbL),\n\n(3)\n\nwhere o is the predicted class distribution, \u02c6o is the correct class distribution and z produced from\nhL\u22121 is used in both inverse computations. Here gL(\u02c6o, z; \u03bbL) can be interpreted as a modi\ufb01cation of\nhL that preserves certain features of the original hL that can also be classi\ufb01ed as \u02c6o. Here parameters\n\u03bbL can be still learned using the usual inverse loss. But parameters of the forward computation \u03b8L\u22121\nused to produce z are dif\ufb01cult to learn in a way that maximizes their effectiveness for reconstruction\nwithout backpropagation. Thus, we studied a variant that does not require backpropagation: we\nsimply do not optimize the forward weights for z, so z is just a set of random features of hL\u22121.\n\nParallel and alternating training of inverses\nIn the original implementation of DTP1, the authors\ntrained forward and inverse model parameters by alternating between their optimizations; in practice\nthey trained one loss for one full epoch of the training set before switching to training the other\nloss. We considered a variant that simply optimizes both losses in parallel, which seems nominally\nmore plausible in the brain since both forward and feedback connections are thought to undergo\nplasticity changes simultaneously \u2014 though it is possible that a kind of alternating learning schedule\nfor forward and backward connections could be tied to wake/sleep cycles.\n\n2.2 Biologically-plausible network architectures\n\nConvolution-based architectures have been critical for achieving state of the art in image recog-\nnition [18]. These architectures are biologically implausible, however, because of their extensive\nweight sharing. To implement convolutions in biology, many neurons would need to share the values\nof their weights precisely \u2014 a requirement with no empirical support. In the absence of weight\nsharing, the \u201clocally connected\u201d receptive \ufb01eld structure of convolutional neural networks is in fact\nvery biologically realistic and may still offer a useful prior. Under this prior, neurons in the brain\ncould sample from small areas of visual space, then pool together to create spatial maps of feature\ndetectors.\nOn a computer, sharing the weights of locally connected units greatly reduces the number of free\nparameters and this has several bene\ufb01cial effects on simulations of large neural nets. It improves\ngeneralization and it drastically reduces both the amount of memory needed to store the parameters\nand the amount of communication required between replicas of the same model running on different\nsubsets of the data on different processors. From a biological perspective we are interested in how\nTP and FA compare with BP without using weight sharing, so both our BP results and our TP and\nFA results are considerably worse than convolutional neural nets and take far longer to produce. We\nassess the degree to which BP-guided learning is enhanced by convolutions, and not BP per se, by\nevaluating learning methods (including BP) on networks with locally connected layers.\n\n3 Experiments\n\nIn this section we experimentally evaluate variants of target propagation, backpropagation, and\nfeedback alignment [23, 25]. We focused our attention on TP variants. We found all of the variants\nwe explored to be quite sensitive to the choice of hyperparameters and network architecture, espe-\ncially in the case of locally-connected networks. With the aim of understanding the limits of the\nconsidered algorithms, we manually searched for architectures well suited to DTP. Then we \ufb01xed\nthese architectures for BP and FA variants and ran independent hyperparameter searches for each\nlearning method. Finally, we report best errors achieved in 500 epochs. For additional details see\nTables 3 and 4 in the Appendix.\nFor optimization we use Adam [15], with different hyper-parameters for forward and inverse models\nin the case of target propagation. All layers are initialized using the method suggested by Glorot &\nBengio [10]. In all networks we used the hyperbolic tangent as a nonlinearity between layers as it\nwas previously found to work better with DTP than ReLUs [21].\n\n1https://github.com/donghyunlee/dtp/blob/master/conti_dtp.py\n\n6\n\n\fTable 1: Train and test errors (%) achieved by different learning methods for fully-connected (FC)\nand locally-connected (LC) networks on MNIST and CIFAR. We highlight best and second best\nresults.\n\n(a) MNIST\n\n(b) CIFAR\n\nMETHOD\nDTP, PARALLEL\nDTP, ALTERNATING\nSDTP, PARALLEL\nSDTP, ALTERNATING\nAO-SDTP, PARALLEL\nAO-SDTP, ALTERNATING\nFA\nDFA\nBP\nBP CONVNET\n\nFC\n\nTRAIN TEST\n2.86\n0.44\n1.83\n0.00\n1.14\n3.52\n2.28\n0.00\n2.93\n0.96\n1.86\n0.00\n1.85\n0.00\n0.85\n2.75\n1.48\n0.00\n\u2013\n\u2013\n\nLC\n\nTRAIN TEST\n1.52\n0.00\n1.46\n0.00\n0.00\n1.98\n1.90\n0.00\n1.92\n0.00\n1.91\n0.00\n1.26\n0.00\n0.23\n2.05\n1.17\n0.00\n0.00\n1.01\n\nFC\n\nTRAIN TEST\n59.45 59.14\n30.41 42.32\n51.48 55.32\n48.65 54.27\n4.28 47.11\n0.00 45.40\n25.62 41.97\n33.35 47.80\n28.97 41.32\n\n\u2013\n\n\u2013\n\nLC\n\nTRAIN TEST\n28.69 39.47\n28.54 39.47\n43.00 46.63\n40.40 45.66\n32.67 40.05\n34.11 40.21\n17.46 37.44\n32.74 44.41\n0.83 32.41\n1.39 31.87\n\nFigure 2: Train (dashed) and test (solid) classi\ufb01cation errors on CIFAR.\n\n3.1 MNIST\nTo compare to previously reported results we began with the MNIST dataset, consisting of 28 \u00d7 28\ngray-scale images of hand-drawn digits. The \ufb01nal performance for all algorithms is reported in\nTable 1 and the learning dynamics are plotted in Figure 8 (see Appendix). Our implementation of\nDTP matches the performance of the original work [21]. However, all variants of TP performed\nslightly worse than BP, with a larger gap for SDTP, which does not rely on any gradient propagation.\nInterestingly, alternating optimization of forward and inverse losses consistently demonstrates more\nstable learning and better \ufb01nal performance.\n\n3.2 CIFAR-10\nCIFAR-10 is a more challenging dataset introduced by Krizhevsky [17]. It consists of 32 \u00d7 32\nRGB images of 10 categories of objects in natural scenes. In contrast to MNIST, classes in CIFAR-\n10 do not have a \u201ccanonical appearance\u201d such as a \u201cprototypical bird\u201d or \u201cprototypical truck\u201d as\nopposed to \u201cprototypical 7\u201d or \u201cprototypical 9\u201d. This makes them harder to classify with simple\ntemplate matching, making depth imperative for achieving good performance. The only prior study\nof biologically motivated learning methods applied to this data was carried out by Lee et al. [21];\nthis investigation was limited to DTP with alternating updates and fully connected architectures.\nHere we present a more comprehensive evaluation that includes locally-connected architectures and\nexperiments with an augmented training set consisting of vertical \ufb02ips and random crops applied to\nthe original images.\nFinal results can be found in Table 1. Overall, the results on CIFAR-10 are similar to those obtained\non MNIST, though the gap between TP and backpropagation as well as between different variants\nof TP is more prominent. Moreover, while fully-connected DTP-alternating roughly matched the\n\n7\n\n0100200300400500Epoch1020304050607080Error (%)Fully-connected network0100200300400500Epoch0102030405060Error (%)Locally-connected network\fperformance of BP, locally-connected networks presented an additional challenge for TP, yielding\nonly a minor improvement.\nThe issue of compatibility with locally-connected layers is yet to be understood. One possible\nexplanation is that the inverse computation might bene\ufb01t from a form that is not symmetric to the\nforward computation. We experimented with more expressive inverses, such as having larger receptive\n\ufb01elds or a fully-connected structure, but these did not lead to any signi\ufb01cant improvements. We leave\nfurther investigation of this question to future work.\nAs with MNIST, a BP trained convolutional network with shared weights performed better than\nits locally-connected variant. The gap, however, is not large, suggesting that weight sharing is not\nnecessary for good performance as long as the learning algorithm is effective.\nWe hypothesize that the signi\ufb01cant gap in performance between DTP and the gradient-free SDTP on\nCIFAR-10 is due to the problems with inverting a low-entropy target in the output layer. To validate\nthis hypothesis, we ran AO-SDTP with 512 auxiliary output units and compare its performance with\nother variants of TP. Even though the observed results do not match the performance of DTP, they\nstill present a large improvement over SDTP. This con\ufb01rms the importance of target diversity for\nlearning in TP (see Appendix 5.5 for related experiments) and provides reasonable hope that future\nwork in this area could further improve the performance of SDTP.\nFeedback alignment algorithm performed quite well on both MNIST and CIFAR, struggling only\nwith the LC architecture on CIFAR. In contrast, DFA appeared to be quite sensitive to the choice of\narchitecture and our architecture search was guided by the performance of TP methods. Thus, the\nnumbers achieved by DFA in our experiments should be regarded only as a rough approximation\nof the attainable performance for the algorithm. In particular, DFA appears to struggle with the\nrelatively narrow (256 unit) layers used in the fully-connected MNIST case \u2014 see Lillicrap et al. [23]\nSupplementary Information for a possible explanation. Under these conditions, DFA fails to match\nBP in performance, and also tends to fall behind DTP and AO-SDTP, especially on CIFAR.\n\n3.3\n\nImageNet\n\nWe assessed performance of the methods on the ImageNet dataset [33], a large-scale benchmark that\nhas propelled recent progress in deep learning. To the best of our knowledge, this is the \ufb01rst empirical\nstudy of biologically-motivated methods and architectures conducted on a dataset of such scale and\ndif\ufb01culty. ImageNet has 1000 object classes appearing in a variety of natural scenes and captured in\nhigh-resolution images (resized to 224 \u00d7 224).\nFinal results are reported in Table 2. Unlike MNIST and CIFAR, on ImageNet all biologically\nmotivated algorithms performed very poorly relative to BP. A number of factors could contribute\nto this result. One factor may be that deeper networks might require more careful hyperparameter\ntuning; for example, different learning rates or amounts of noise injected for each layer.\n\nTable 2: Test errors on ImageNet.\n\nMETHOD\nDTP, PARALLEL\nDTP, ALTERNATING\nSDTP, PARALLEL\nFA\nBACKPROPAGATION\nBACKPROPAGATION, CONVNET\n\nTOP-1\n98.34\n99.36\n99.28\n93.08\n71.43\n63.93\n\nTOP-5\n94.56\n97.28\n97.15\n82.54\n49.07\n40.17\n\nFigure 3: Top-1 (solid) and Top-5 (dotted)\ntest errors on ImageNet. Color legend is the\nsame as for \ufb01gure 2.\nA second factor might be a general incompatibility between the mainstream design choices for\nconvolutional networks with TP and FA algorithms. Years of research have led to a better under-\nstanding of ef\ufb01cient architectures, weight initialization, and optimizers for convolutional networks\ntrained with backpropagation, and perhaps more effort is required to reach comparable results for\nbiologically motivated algorithms and architectures. Addressing both of these factors could help\n\n8\n\n0100200300400500Epoch405060708090100Error (%)\fimprove performance, so it would be premature to conclude that TP cannot perform adequately on\nImageNet. We can conclude though, that out-of-the-box application of this class of algorithms does\nnot provide a straightforward solution to real data on even moderately large networks.\nWe note that FA demonstrated an improvement over TP, yet still performed much worse than BP. It\nwas not practically feasible to run its sibling, DFA, on large networks such as one we used in our\nImageNet experiments. This was due to practical necessity of maintaining a large fully-connected\nfeedback layer of weights from the output layer to each intermediate layer. Modern convolutional\narchitectures tend to have very large activation dimensions, and the requirement for linear projections\nback to all of the neurons in the network is practically intractable: on a GPU with 16GB of onboard\nmemory, we encountered out-of-memory errors when trying to initialize and train these networks\nusing a Tensor\ufb02ow implementation. Thus, the DFA algorithm appears to require either modi\ufb01cation\nor GPUs with more memory to run with large networks.\n\n4 Discussion\n\nHistorically, there has been signi\ufb01cant disagreement about whether BP can tell us anything interesting\nabout learning in the brain [8, 11]. Indeed, from the mid 1990s to 2010, work on applying insights\nfrom BP to help understand learning in the brain declined precipitously. Recent progress in machine\nlearning has prompted a revival of this debate; where other approaches have failed, deep networks\ntrained via BP have been key to achieving impressive performance on dif\ufb01cult datasets such as\nImageNet. It is once again natural to wonder whether some approximation of BP might underlie\nlearning in the brain [22, 5]. However, none of the algorithms proposed as approximations of BP\nhave been tested on the datasets that were instrumental in convincing the machine learning and\nneuroscience communities to revisit these questions.\nHere we studied TP and FA, and introduced a straightforward variant of the DTP algorithm that\ncompletely removed gradient propagation and weight transport. We demonstrated that networks\ntrained with SDTP without any weight sharing (i.e. weight transport in the backward pass or weight\ntying in convolutions) perform much worse than DTP, likely because of impoverished output targets.\nWe also studied an approach to rescue performance with SDTP. Overall, while some variants of TP\nand FA came close to matching the performance of BP on MNIST and CIFAR, all of the biologically\nmotivated algorithms performed much worse than BP in the context of ImageNet. Our experiments are\nfar from exhaustive and we hope that researchers in the \ufb01eld may coordinate to study the performance\nof other recently introduced biologically motivated algorithms, including e.g. [28, 27].\nWe note that although TP and FA algorithms go a long way towards biological plausibility, there\nare still many biological constraints that we did not address here. For example, we\u2019ve set aside the\nquestion of spiking neurons entirely to focus on asking whether variants of TP can scale up to solve\ndif\ufb01cult problems at all. The question of spiking networks is an important one [35, 12, 7, 34], but it\nshould nevertheless be possible to gain algorithmic insight to the brain without tackling all of the\nelements of biological complexity simultaneously. Similarly, we also ignore Dale\u2019s law in all of our\nexperiments [29]. In general, we\u2019ve aimed at the simplest models that allow us to address questions\naround (1) weight sharing, and (2) the form and function of feedback communication. However, it is\nworth noting that our work here ignores one other signi\ufb01cant issue with respect to the plausibility\nof feedback communication: BP, FA, all of the TP variants, and indeed most known activation\npropagation algorithms (for an exception see Sacramento et al. [34]), still require distinct forward\nand backward (or \u201cpositive\u201d and \u201cnegative\u201d) phases. The way in which forward and backward\npathways in the brain interact is not well characterized, but we\u2019re not aware of existing evidence that\nstraightforwardly supports distinct phases.\nNevertheless, algorithms that aim to illuminate learning in cortex should be able to perform well\non dif\ufb01cult domains without relying on any form of weight sharing. Thus, our results offer a new\nbenchmark for future work looking to evaluate the effectiveness of biologically plausible algorithms\nin more powerful architectures and on more dif\ufb01cult datasets.\n\nAcknowledgments\n\nWe would like to thank Shakir Mohamed, Wojtek Czarnecki, Yoshua Bengio, Rafal Bogacz, Walter\nSenn, Joao Sacramento, James Whittington, and Benjamin Scellier for useful discussions.\n\n9\n\n\fReferences\n[1] Ackley, David H, Hinton, Geoffrey E, and Sejnowski, Terrence J. A learning algorithm for\n\nboltzmann machines. Cognitive science, 9(1):147\u2013169, 1985.\n\n[2] Almeida, Luis B. A learning rule for asynchronous perceptrons with feedback in a combinatorial\n\nenvironment. In Arti\ufb01cial neural networks, pp. 102\u2013111. IEEE Press, 1990.\n\n[3] Bengio, Yoshua. How auto-encoders could provide credit assignment in deep networks via\n\ntarget propagation. arXiv preprint arXiv:1407.7906, 2014.\n\n[4] Bengio, Yoshua and Fischer, Asja. Early inference in energy-based models approximates\n\nback-propagation. arXiv preprint arXiv:1510.02777, 2015.\n\n[5] Bengio, Yoshua, Lee, Dong-Hyun, Bornschein, Jorg, Mesnard, Thomas, and Lin, Zhouhan.\n\nTowards biologically plausible deep learning. arXiv preprint arXiv:1502.04156, 2015.\n\n[6] Bengio, Yoshua, Scellier, Benjamin, Bilaniuk, Olexa, Sacramento, Joao, and Senn, Walter.\nFeedforward initialization for fast inference of deep generative networks is biologically plausible.\narXiv preprint arXiv:1606.01651, 2016.\n\n[7] Bengio, Yoshua, Mesnard, Thomas, Fischer, Asja, Zhang, Saizheng, and Wu, Yuhuai. Stdp-\ncompatible approximation of backpropagation in an energy-based model. Neural computation,\n2017.\n\n[8] Crick, Francis. The recent excitement about neural networks. Nature, 337(6203):129\u2013132,\n\n1989.\n\n[9] Dumoulin, Vincent and Visin, Francesco. A guide to convolution arithmetic for deep learning.\n\narXiv preprint arXiv:1603.07285, 2016.\n\n[10] Glorot, Xavier and Bengio, Yoshua. Understanding the dif\ufb01culty of training deep feedforward\nIn Proceedings of the Thirteenth International Conference on Arti\ufb01cial\n\nneural networks.\nIntelligence and Statistics, pp. 249\u2013256, 2010.\n\n[11] Grossberg, Stephen. Competitive learning: From interactive activation to adaptive resonance.\n\nCognitive science, 11(1):23\u201363, 1987.\n\n[12] Guerguiev, Jordan, Lillicrap, Timothy P, and Richards, Blake A. Towards deep learning with\n\nsegregated dendrites. ELife, 6:e22901, 2017.\n\n[13] Hinton, G.E. How to do backpropagation in a brain. NIPS 2007 Deep Learning Workshop,\n\n2007.\n\n[14] Hinton, Geoffrey E and McClelland, James L. Learning representations by recirculation. In\nNeural information processing systems, pp. 358\u2013366. New York: American Institute of Physics,\n1988.\n\n[15] Kingma, Diederik and Ba, Jimmy. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[16] K\u00a8ording, Konrad P and K\u00a8onig, Peter. Supervised and unsupervised learning with two sites of\n\nsynaptic integration. Journal of computational neuroscience, 11(3):207\u2013215, 2001.\n\n[17] Krizhevsky, Alex. Learning multiple layers of features from tiny images. 2009.\n\n[18] Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in neural information processing systems, pp.\n1097\u20131105, 2012.\n\n[19] LeCun, Yann. Learning process in an asymmetric threshold network. In Disordered systems\n\nand biological organization, pp. 233\u2013240. Springer, 1986.\n\n[20] LeCun, Yann. Mod`eles connexionnistes de lapprentissage. PhD thesis, PhD thesis, These de\n\nDoctorat, Universit\u00b4e Paris 6, 1987.\n\n10\n\n\f[21] Lee, Dong-Hyun, Zhang, Saizheng, Fischer, Asja, and Bengio, Yoshua. Difference target\npropagation. In Joint European Conference on Machine Learning and Knowledge Discovery in\nDatabases, pp. 498\u2013515. Springer, 2015.\n\n[22] Lillicrap, Timothy P, Cownden, Daniel, Tweed, Douglas B, and Akerman, Colin J. Random\nfeedback weights support learning in deep neural networks. arXiv preprint arXiv:1411.0247,\n2014.\n\n[23] Lillicrap, Timothy P, Cownden, Daniel, Tweed, Douglas B, and Akerman, Colin J. Random\nsynaptic feedback weights support error backpropagation for deep learning. Nature Communi-\ncations, 7, 2016.\n\n[24] Movellan, Javier R. Contrastive hebbian learning in the continuous hop\ufb01eld model. In Connec-\n\ntionist models: Proceedings of the 1990 summer school, pp. 10\u201317, 1991.\n\n[25] N\u00f8kland, Arild. Direct feedback alignment provides learning in deep neural networks. In\n\nAdvances In Neural Information Processing Systems, pp. 1037\u20131045, 2016.\n\n[26] O\u2019Reilly, Randall C. Biologically plausible error-driven learning using local activation differ-\n\nences: The generalized recirculation algorithm. Neural computation, 8(5):895\u2013938, 1996.\n\n[27] Ororbia, Alexander G and Mali, Ankur. Biologically motivated algorithms for propagating local\n\ntarget representations. arXiv preprint arXiv:1805.11703, 2018.\n\n[28] Ororbia, Alexander G, Mali, Ankur, Kifer, Daniel, and Giles, C Lee. Conducting credit\n\nassignment by aligning local representations. arXiv preprint arXiv:1803.01834, 2018.\n\n[29] Parisien, Christopher, Anderson, Charles H, and Eliasmith, Chris. Solving the problem of\n\nnegative synaptic weights in cortical models. Neural computation, 20(6):1473\u20131494, 2008.\n\n[30] Pineda, Fernando J. Generalization of back-propagation to recurrent neural networks. Physical\n\nreview letters, 59(19):2229, 1987.\n\n[31] Pineda, Fernando J. Dynamics and architecture for neural computation. Journal of Complexity,\n\n4(3):216\u2013245, 1988.\n\n[32] Rumelhart, DE, Hinton, GE, and Williams, RJ. Learning representations by back-propagation\n\nerrors. Nature, 323:533\u2013536, 1986.\n\n[33] Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang,\nZhiheng, Karpathy, Andrej, Khosla, Aditya, Bernstein, Michael, Berg, Alexander C., and\nFei-Fei, Li. ImageNet Large Scale Visual Recognition Challenge. International Journal of\nComputer Vision (IJCV), 115(3):211\u2013252, 2015. doi: 10.1007/s11263-015-0816-y.\n\n[34] Sacramento, Joao, Costa, Rui Ponte, Bengio, Yoshua, and Senn, Walter. Dendritic error\n\nbackpropagation in deep cortical microcircuits. arXiv preprint arXiv:1801.00062, 2017.\n\n[35] Samadi, Arash, Lillicrap, Timothy P, and Tweed, Douglas B. Deep learning with dynamic\n\nspiking neurons and \ufb01xed feedback weights. Neural computation, 2017.\n\n[36] Scellier, Benjamin and Bengio, Yoshua. Equilibrium propagation: Bridging the gap between\nenergy-based models and backpropagation. Frontiers in computational neuroscience, 11, 2017.\n\n[37] Springenberg, Jost Tobias, Dosovitskiy, Alexey, Brox, Thomas, and Riedmiller, Martin. Striving\n\nfor simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.\n\n[38] Whittington, James CR and Bogacz, Rafal. An approximation of the error backpropagation algo-\nrithm in a predictive coding network with local hebbian synaptic plasticity. Neural computation,\n2017.\n\n[39] Xie, Xiaohui and Seung, H Sebastian. Equivalence of backpropagation and contrastive hebbian\n\nlearning in a layered network. Neural computation, 15(2):441\u2013454, 2003.\n\n11\n\n\f", "award": [], "sourceid": 5709, "authors": [{"given_name": "Sergey", "family_name": "Bartunov", "institution": "DeepMind"}, {"given_name": "Adam", "family_name": "Santoro", "institution": "DeepMind"}, {"given_name": "Blake", "family_name": "Richards", "institution": "University of Toronto"}, {"given_name": "Luke", "family_name": "Marris", "institution": "DeepMind"}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": "Google & University of Toronto"}, {"given_name": "Timothy", "family_name": "Lillicrap", "institution": "Google DeepMind"}]}