{"title": "Backpropagation for Energy-Efficient Neuromorphic Computing", "book": "Advances in Neural Information Processing Systems", "page_first": 1117, "page_last": 1125, "abstract": "Solving real world problems with embedded neural networks requires both training algorithms that achieve high performance and compatible hardware that runs in real time while remaining energy efficient. For the former, deep learning using backpropagation has recently achieved a string of successes across many domains and datasets. For the latter, neuromorphic chips that run spiking neural networks have recently achieved unprecedented energy efficiency. To bring these two advances together, we must first resolve the incompatibility between backpropagation, which uses continuous-output neurons and synaptic weights, and neuromorphic designs, which employ spiking neurons and discrete synapses. Our approach is to treat spikes and discrete synapses as continuous probabilities, which allows training the network using standard backpropagation. The trained network naturally maps to neuromorphic hardware by sampling the probabilities to create one or more networks, which are merged using ensemble averaging. To demonstrate, we trained a sparsely connected network that runs on the TrueNorth chip using the MNIST dataset. With a high performance network (ensemble of $64$), we achieve $99.42\\%$ accuracy at $121 \\mu$J per image, and with a high efficiency network (ensemble of $1$) we achieve $92.7\\%$ accuracy at $0.408 \\mu$J per image.", "full_text": "Backpropagation for\n\nEnergy-Ef\ufb01cient Neuromorphic Computing\n\nSteve K. Esser\n\nIBM Research\u2013Almaden\n\nRathinakumar Appuswamy\n\nIBM Research\u2013Almaden\n\n650 Harry Road, San Jose, CA 95120\n\n650 Harry Road, San Jose, CA 95120\n\nsesser@us.ibm.com\n\nrappusw@us.ibm.com\n\nPaul A. Merolla\n\nIBM Research\u2013Almaden\n\nJohn V. Arthur\n\nIBM Research\u2013Almaden\n\n650 Harry Road, San Jose, CA 95120\n\n650 Harry Road, San Jose, CA 95120\n\npameroll@us.ibm.com\n\narthurjo@us.ibm.com\n\nDharmendra S. Modha\nIBM Research\u2013Almaden\n\n650 Harry Road, San Jose, CA 95120\n\ndmodha@us.ibm.com\n\nAbstract\n\nSolving real world problems with embedded neural networks requires both train-\ning algorithms that achieve high performance and compatible hardware that runs\nin real time while remaining energy ef\ufb01cient. For the former, deep learning using\nbackpropagation has recently achieved a string of successes across many domains\nand datasets. For the latter, neuromorphic chips that run spiking neural networks\nhave recently achieved unprecedented energy ef\ufb01ciency. To bring these two ad-\nvances together, we must \ufb01rst resolve the incompatibility between backpropaga-\ntion, which uses continuous-output neurons and synaptic weights, and neuromor-\nphic designs, which employ spiking neurons and discrete synapses. Our approach\nis to treat spikes and discrete synapses as continuous probabilities, which allows\ntraining the network using standard backpropagation. The trained network natu-\nrally maps to neuromorphic hardware by sampling the probabilities to create one\nor more networks, which are merged using ensemble averaging. To demonstrate,\nwe trained a sparsely connected network that runs on the TrueNorth chip using the\nMNIST dataset. With a high performance network (ensemble of 64), we achieve\n99.42% accuracy at 108 \u00b5J per image, and with a high ef\ufb01ciency network (ensem-\nble of 1) we achieve 92.7% accuracy at 0.268 \u00b5J per image.\n\n1\n\nIntroduction\n\nNeural networks today are achieving state-of-the-art performance in competitions across a range of\n\ufb01elds [1][2][3]. Such success raises hope that we can now begin to move these networks out of\nthe lab and into embedded systems that can tackle real world problems. This necessitates a shift\n\nDARPA: Approved for Public Release, Distribution Unlimited\n\n1\n\n\fin thinking to system design, where both neural network and hardware substrate must collectively\nmeet performance, power, space, and speed requirements.\nOn a neuron-for-neuron basis, the most ef\ufb01cient substrates for neural network operation today are\ndedicated neuromorphic designs [4][5][6][7]. To achieve high ef\ufb01ciency, neuromorphic architec-\ntures can use spikes to provide event based computation and communication that consumes energy\nonly when necessary, can use low precision synapses to colocate memory with computation keeping\ndata movement local and allowing for parallel distributed operation, and can use constrained con-\nnectivity to implement neuron fan-out ef\ufb01ciently thus dramatically reducing network traf\ufb01c on-chip.\nHowever, such design choices introduce an apparent incompatibility with the backpropagation al-\ngorithm [8] used for training today\u2019s most successful deep networks, which uses continuous-output\nneurons and high-precision synapses, and typically operates with no limits on the number of inputs\nper neuron. How then can we build systems that take advantage of algorithmic insights from deep\nlearning, and the operational ef\ufb01ciency of neuromorphic hardware?\nAs our main contribution here, we demonstrate a learning rule and a network topology that reconciles\nthe apparent incompatibility between backpropagation and neuromorphic hardware. The essence\nof the learning rule is to train a network of\ufb02ine with hardware supported connectivity, as well as\ncontinuous valued input, neuron output, and synaptic weights, but values constrained to the range\n[0, 1]. We further impose that such constrained values represent probabilities, either of a spike\noccurring or of a particular synapse being on. Such a network can be trained using backpropagation,\nbut also has a direct representation in the spiking, low synaptic precision deployment system, thereby\nbridging these two worlds. The network topology uses a progressive mixing approach, where each\nneuron has access to a limited set of inputs from the previous layer, but sources are chosen such that\nneurons in successive layers have access to progressively more network input.\nPrevious efforts have shown success with subsets of the elements we bring together here. Back-\npropagation has been used to train networks with spiking neurons but with high-precision weights\n[9][10][11][12], and the converse, networks with trinary synapses but with continuous output neu-\nrons [13]. Other probabilistic backpropagation approaches have been demonstrated for networks\nwith binary neurons and binary or trinary synapses but full inter-layer connectivity [14][15][16].\nThe work presented here is novel in that i) we demonstrate for the \ufb01rst time an of\ufb02ine training\nmethodology using backpropagation to create a network that employs spiking neurons, synapses re-\nquiring less bits of precision than even trinary weights, and constrained connectivity, ii) we achieve\nthe best accuracy to date on MNIST (99.42%) when compared to networks that use spiking neurons,\neven with high precision synapses (99.12%) [12], as well as networks that use binary synapses and\nneurons (97.88%) [15], and iii) we demonstrate the network running in real-time on the TrueNorth\nchip [7], achieving by far the best published power ef\ufb01ciency for digit recognition (4 \u00b5J per classi\ufb01-\ncation at 95% accuracy running 1000 images per second) compared to other low power approaches\n(6 mJ per classi\ufb01cation at 95% accuracy running 50 images per second) [17].\n\n2 Deployment Hardware\n\nWe use the TrueNorth neurosynaptic chip [7] as our example deployment system, though the ap-\nproach here could be generalized to other neuromorphic hardware [4][5][6]. The TrueNorth chip\nconsists of 4096 cores, with each core containing 256 axons (inputs), a 256 \u00d7 256 synapse cross-\nbar, and 256 spiking neurons. Information \ufb02ows via spikes from a neuron to one axon between\nany two cores, and from the axon to potentially all neurons on the core, gated by binary synapses\nin the crossbar. Neurons can be considered to take on a variety of dynamics [18], including those\ndescribed below. Each axon is assigned 1 of 4 axon types, which is used as an index into a lookup\ntable of s-values, unique to each neuron, that provides a signed 9-bit integer synaptic strength to the\ncorresponding synapse. This approach requires only 1 bit per synapse for the on/off state and an\nadditional 0.15 bits per synapse for the lookup table scheme.\n\n3 Network Training\n\nIn our approach, we employ two types of multilayer networks. The deployment network runs on a\nplatform supporting spiking neurons, discrete synapses with low precision, and limited connectivity.\n\n2\n\n\fThe training network is used to learn binary synaptic connectivity states and biases. This network\nshares the same topology as the deployment network, but represents input data, neuron outputs,\nand synaptic connections using continuous values constrained to the range [0, 1] (an overview is\nprovided in Figure 1 and Table 1). These values correspond to probabilities of a spike occurring\nor of a synapse being \u201con\u201d, providing a means of mapping the training network to the deployment\nnetwork, while providing a continuous and differentiable space for backpropagation. Below, we\ndescribe the deployment network, our training methodology, and our procedure for mapping the\ntraining network to the deployment network.\n\n3.1 Deployment network\n\nOur deployment network follows a feed-\nforward methodology where neurons are se-\nquentially updated from the \ufb01rst to the last\nlayer. Input to the network is represented us-\ning stochastically generated spikes, where the\nvalue of each input unit is 0 or 1 with some\nprobability. We write this as P (xi = 1) \u2261 \u02dcxi,\nwhere xi is the spike state of input unit i and\n\u02dcxi is a continuous value in the range [0, 1] de-\nrived by re-scaling the input data (pixels). This\nscheme allows representation of data using bi-\nnary spikes, while preserving data precision in\nthe expectation.\nSummed neuron input is computed as\n\nIj =\n\nxicijsij + bj,\n\n(1)\n\n(cid:88)\n\ni\n\nFigure 1: Diagram showing input, synapses, and\noutput for one neuron in the deployment and train-\ning network. For simplicity, only three synapses\nare depicted.\n\nwhere j is the target neuron index, cij is a bi-\nnary indicator variable representing whether a\nsynapse is on, sij is the synaptic strength, and bj is the bias term. This is identical to common prac-\ntice in neural networks, except that we have factored the synaptic weight into cij and sij, such that\nwe can focus our learning efforts on the former for reasons described below. The neuron activation\nfunction follows a history-free thresholding equation\n\n(cid:26)1\n\n0\n\nnj =\n\nif Ij > 0,\notherwise.\n\nThese dynamics are implemented in TrueNorth by setting each neuron\u2019s leak equal to the learned\nbias term (dropping any fractional portion), its threshold to 0, its membrane potential \ufb02oor to 0, and\nsetting its synapse parameters using the scheme described below.\nWe represent each class label using multiple output neurons in the last layer of the network, which\nwe found improves prediction performance. The network prediction for a class is simply the average\nof the output of all neurons assigned to that class.\n\nTable 1: Network components\n\nNetwork input\nSynaptic connection\nSynaptic strength\nNeuron output\n\nx\nc\ns\nn\n\nDeployment Network\nVariable\n\nValues\n{0, 1}\n{0, 1}\n{\u22121, 1}\n{0, 1}\n\n3\n\nCorrespondance Variable\nP (x = 1) \u2261 \u02dcx\nP (c = 1) \u2261 \u02dcc\n\ns \u2261 s\n\nP (n = 1) \u2261 \u02dcn\n\nTraining Network\nValues\n[0, 1]\n[0, 1]\n{\u22121, 1}\n[0, 1]\n\n\u02dcx\n\u02dcc\ns\n\u02dcn\n\n1-1Synapse strength.9.7.2.2.5.7.7DeploymentTrainingSpikesConnectedsynapsesSpikeprobabilitiesSynapticconnectionprobabilitiesInputInputNeuronNeuron\f3.2 Training network\n\nTraining follows the backpropagation methodology by iteratively i) running a forward pass from\nthe \ufb01rst layer to the last layer, ii) comparing the network output to desired output using a loss\nfunction, iii) propagating the loss backwards through the network to determine the loss gradient\nat each synapse and bias term, and iv) using this gradient to update the network parameters. The\ntraining network forward pass is a probabilistic representation of the deployment network forward\npass.\nSynaptic connections are represented as probabilities using \u02dccij, where P (cij = 1) \u2261 \u02dccij, while\nsynaptic strength is represented using sij as in the deployment network.\nIt is assumed that sij\ncan be drawn from a limited set of values and we consider the additional constraint that it is set\nin \u201cblocks\u201d such that multiple synapses share the same value, as done in TrueNorth for ef\ufb01ciency.\nWhile it is conceivable to learn optimal values for sij under such conditions, this requires stepwise\nchanges between allowed values and optimization that is not local to each synapse. We take a simpler\napproach here, which is to learn biases and synapse connection probabilities, and to intelligently \ufb01x\nthe synapse strengths using an approach described in the Network Initialization section.\nInput to the training network is represented using \u02dcxi, which is the probability of an input spike\noccurring in the deployment network. For neurons, we note that Equation 1 is a summation of\nweighted Bernoulli variables plus a bias term. If we assume independence of these inputs and have\nsuf\ufb01cient numbers, then we can approximate the probability distribution of this summation as a\nGaussian with mean\n\n\u00b5j = bj +\n\n\u02dcxi\u02dccijsij\n\nand variance\n\n\u03c32\nj =\n\n\u02dcxi\u02dccij(1 \u2212 \u02dcxi\u02dccij)s2\nij.\n\nWe can then derive the probability of such a neuron \ufb01ring using the complementary cumulative\ndistribution function of a Gaussian:\n\n(cid:88)\n\ni\n\ni\n\n(cid:88)\n\uf8ee\uf8f01 + erf\n\n\uf8eb\uf8ed \u03b8 \u2212 \u00b5j(cid:113)\n\n2\u03c32\nj\n\n\uf8f6\uf8f8\uf8f9\uf8fb ,\n\n\u02dcnj = 1 \u2212 1\n2\n\n(2)\n\n(3)\n\nwhere erf is the error function, \u03b8 = 0 and P (nj = 1) \u2261 \u02dcnj. For layers after the \ufb01rst, \u02dcxi is replaced\nby the input from the previous layer, \u02dcni, which represents the probability that a neuron produces a\nspike.\nA variety of loss functions are suitable for our approach, but we found that training converged the\nfastest when using log loss,\n\nE = \u2212(cid:88)\n\n[yk log(pk) + (1 \u2212 yk) log(1 \u2212 pk)] ,\n\nk\n\nwhere for each class k, yk is a binary class label that is 1 if the class is present and 0 otherwise, and\npk is the probability that the the average spike count for the class is greater than 0.5. Conveniently,\nwe can use the Gaussian approximation in Equation 3 for this, with \u03b8 = 0.5 and the mean and\nvariance terms set by the averaging process.\nThe training network backward pass is an adaptation of backpropagation using the neuron and\nsynapse equations above. To get the gradient at each synapse, we use the chain rule to compute\n\n\u2202E\n\u2202\u02dccij\n\n=\n\n\u2202E\n\u2202 \u02dcnj\n\n\u2202 \u02dcnj\n\u2202\u02dccij\n\n.\n\nFor the bias, a similar computation is made by replacing \u02dccij in the above equation with bj.\nWe can then differentiate Equation 3 to produce\n\n(cid:18) (\u03b8\u2212\u00b5j )2\n\n(cid:19)\n\n\u2212\n\n2\u03c32\nj\n\n\u2202 \u02dcnj\n\u2202\u02dccij\n\n=\n\n\u221a\n\u02dcxisij\n\u03c3j\n2\u03c0\n\ne\n\n\u2212 (\u03b8 \u2212 \u00b5j)\n\n\u02dcxis2\n\nij \u2212 \u02dcx2\ni \u02dccijs2\n\u221a\nij\n\u03c33\n2\u03c0\nj\n\ne\n\n4\n\n(cid:18) (\u03b8\u2212\u00b5j )2\n\n(cid:19)\n\n\u2212\n\n2\u03c32\nj\n\n.\n\n(4)\n\n\fAs described below, we will assume that the synapse strengths to each neuron are balanced between\npositive and negative values and that each neuron receives 256 inputs, so we can expect \u00b5 to be\nclose to zero, and \u00b5, \u02dcni and \u02dccij to be much less than \u03c3. Therefore, the right term of Equation 4\nj , can be expected to be much smaller than the left term containing the\ncontaining the denominator \u03c33\ndenominator \u03c3j. Under these conditions, for computational ef\ufb01ciency we can approximate Equation\n4 by dropping the right term and factoring out the remainder as\n\nwhere\n\nand\n\n(cid:18) (\u03b8\u2212\u00b5j )2\n\n2\u03c32\nj\n\n\u2212\n\n(cid:19)\n\n,\n\n\u2202 \u02dcnj\n\u2202\u02dccij\n\n\u2248 \u2202 \u02dcnj\n\u2202\u00b5j\n\n\u2202\u00b5j\n\u2202\u02dccij\n\n,\n\n\u2202 \u02dcnj\n\u2202\u00b5j\n\n=\n\n\u221a\n1\n\ne\n\n2\u03c0\n\n\u03c3j\n\n\u2202\u00b5j\n\u2202\u02dccij\n\n= \u02dcxisij.\n\nA similar treatment can be used to show that corresponding gradient with respect to the bias term\nequals one.\nThe network is updated using the loss gradient at each synapse and bias term. For each iteration,\nsynaptic connection probability changes according to\n\u2206\u02dccij = \u2212\u03b1\n\n,\n\n\u2202E\n\u2202\u02dccij\n\nwhere \u03b1 is the learning rate. Any synaptic connection probabilities that fall outside of the range\n[0, 1] as a result of the update rule are \u201csnapped\u201d to the nearest valid value. Changes to the bias\nterm are handled in a similar fashion, with values clipped to fall in the range [\u2212255, 255], the largest\nvalues supported using TrueNorth neuron parameters.\nThe training procedure described here is amenable to methods and heuristics applied in standard\nbackpropagation. For the results shown below, we used mini batch size 100, momentum 0.9, dropout\n0.5 [19], learning rate decay on a \ufb01xed schedule across training iterations starting at 0.1 and mul-\ntiplying by 0.1 every 250 epochs, and transformations of the training data for each iteration with\nrotation up to \u00b115\u25e6, shift up to \u00b15 pixels and rescale up to \u00b115%.\n\n3.3 Mapping training network to deployment network\n\nTraining is performed of\ufb02ine, and the resulting network is mapped to the deployment network for\nhardware operation. For deployment, depending on system requirements, we can utilize an ensemble\nof one or more samplings of the training network to increase overall output performance. Unlike\nother ensemble methods, we train only once then sample the training network for each member. The\nsystem output for each class is determined by averaging across all neurons in all member networks\nassigned to the class. Synaptic connection states are set on or off according to P (cij = 1) \u2261\n\u02dccij, using independent random number draws for each synapse in each ensemble member. Data is\nconverted into a spiking representation for input using P (xi = 1) \u2261 \u02dcxi, using independent random\nnumber draws for each input to each member of the ensemble.\n\n3.4 Network initialization\n\nThe approach for network initialization described here allows us to optimize for ef\ufb01cient neuromor-\nphic hardware that employs less than 2 bits per synapse. In our approach, each synaptic connection\nprobability is initialized from a uniform random distribution over the range [0, 1]. To initialize\nsynapse strength values, we begin from the principle that each core should maximize information\ntransfer by maximizing information per neuron and minimizing redundancy between neurons. Such\nmethods have been explored in detail in approaches such as infomax [20]. While the \ufb01rst of these\ngoals is data dependent, we can pursue the second at initialization time by tuning the space of pos-\nsible weights for a core, represented by the matrix of synapse strength values, S.\n\n5\n\n\fIn our approach, we wish to minimize redundancy\nbetween neurons on a core by attempting to induce a\nproduct distribution on the outputs for every pair of\nneurons. To simplify the problem, we note that the\nsummed weighted inputs to a pair of neurons is well-\napproximated by a bi-variate Gaussian distribution.\nThus, forcing the covariance between the summed\nweighted inputs to zero guarantees that the inputs are\nindependent. Furthermore, since functions of pair-\nwise independent random variables remain pair-wise\nindependent, the neuron outputs are guaranteed to be\nindependent.\nThe summed weighted input to j-th neuron is given\nby Equation 1.\nIt is desirable for the purposes of\nmaintaining balance in neuron dynamics to con\ufb01g-\nure its weights using a mix of positive and negative\nvalues that sum to zero. Thus for all j,\n\nsij = 0,\n\n(5)\n\nwhich implies that E[Ij] \u2248 0 assuming inputs and\nsynaptic connection states are both decorrelated and\nthe bias term is near 0. This simpli\ufb01es the covariance\nbetween the inputs to any two neurons on a core to\n\n(cid:88)\n\ni\n\n\uf8ee\uf8f0(cid:88)\n\n\uf8f9\uf8fb .\n\nFigure 2:\nSynapse strength values de-\npicted as axons (rows) \u00d7 neurons (columns)\narray. The learning procedure \ufb01xes these\nvalues when the network is initialized and\nlearns the probability that each synapse is\nin a transmitting state. The blocky appear-\nance of the strength matrix is the result of\nthe shared synaptic strength approach used\nby TrueNorth to reduce memory footprint.\n\nE[IjIr] = E\n\nxicijsijxqcqrsqr\n\ni,q\n\nRearranging terms, we get\n\nE[IjIr] =\n\n(cid:88)\n\ni\n\ncijsijcqrsqrE[x2\n\ni ] +\n\n(cid:88)\n\ni\n\ncijsij\n\n(cid:88)\n\nq(cid:54)=i\n\ncqrsqrE[xixq].\n\n(6)\n\nNext, we note from the equation for covariance that E[xixq] = \u03c3(xi, xq) + E[xi]E[xq]. Under\nthe assumption that inputs have equal mean and variance, then for any i, E[x2\ni ] = \u03c1, where \u03c1 =\n\u03c3(xi, xq) + E[xi]E[xq] is a constant. Further assuming that covariance between xi and xq where\ni (cid:54)= q is the same for all inputs, then E[xixq] = \u03b3, where \u03b3 = \u03c3(xi, xq) + E[xi]E[xq] is a constant.\nUsing this and equation (5), Equation 6 becomes\n\nE[IjIr] = \u03c1(cid:104)cjsj, crsr(cid:105) + \u03b3\n\ncijsij(\u2212cirsir)\n\n(cid:88)\n\ni\n\n= \u03c1(cid:104)cjsj, crsr(cid:105) \u2212 \u03b3 (cid:104)cjsj, crsr(cid:105)\n= (\u03c1 \u2212 \u03b3)(cid:104)cjsj, crsr(cid:105) .\n\nSo minimizing the absolute value of the inner product between columns of W forces Ij and Ir to be\nmaximally uncorrelated under the constraints.\nInspired by this observation, we apriori (i.e., without any knowledge of the input data) choose the\nstrength values such that the absolute value of the inner product between columns of the effective\nweight matrix is minimized, and the sum of effective weights to each neuron is zero. Practically,\nthis is achieved by assigning half of each neuron\u2019s s-values to \u22121 and the other half to 1, balancing\nthe possible permutations of such assignments so they occur as equally as possible across neurons\non a core, and evenly distributing the four possible axon types amongst the axons on a core. The\nresulting matrix of synaptic strength values can be seen in Figure 2. This con\ufb01guration thus provides\nan optimal weight subspace, given the constraints, in which backpropagation can operate in a data-\ndriven fashion to \ufb01nd desirable synaptic on/off states.\n\n6\n\nAxons 1-64Type 1Axons 65-128Type 2Axons 129-192Type 3Axons 193-256Type 4Neurons 1-11Neurons 12-22Neurons 246-2561-1Synapse strength...\fFigure 3: A) Two network con\ufb01gurations used for the results described here, a 5 core network\ndesigned to minimize core count and a 30 core network designed to maximize accuracy. B) Board\nwith a socketed TrueNorth chip used to run the deployment networks. The chip is 4.3 cm2, runs\nin real time (1 ms neuron updates), and consumes 63 mW running a benchmark network that uses\nall of its 1 million neuron [7]. C) Measured accuracy and measured energy for the two network\ncon\ufb01gurations running on the chip. Ensemble size is shown to the right of each data point.\n\n4 Network topology\n\nThe network topology is designed to support neurons with responses to local, regional or global\nfeatures while respecting the \u201ccore-to-core\u201d connectivity of the TrueNorth architecture \u2013 namely\nthat all neurons on a core share access to the same set of inputs, and that the number of such inputs\nis limited. The network uses a multilayer feedforward scheme, where the \ufb01rst layer consists of input\nelements in a rows \u00d7 columns \u00d7 channels array, such as an image, and the remaining layers consist\nof TrueNorth cores. Connections between layers are made using a sliding window approach.\nInput to each core in a layer is drawn from an R \u00d7 R \u00d7 F input window (Figure 3A), where R\nrepresents the row and column dimensions, and F represents the feature dimension. For input from\nthe \ufb01rst layer, rows and columns are in units of input elements and features are input channels,\nwhile for input from the remaining layers rows and columns are in units of cores and features are\nneurons. The \ufb01rst core in a given target layer locates its input window in the upper left corner of its\nsource layer, and the next core in the target layer shifts its input window to the right by a stride of\nS. Successive cores slide the window over by S until the edge of the source layer is reached, then\nthe window is returned to the left, and shifted down by S and the process is repeated. Features are\nsub-selected randomly, with the constraint that each neuron can only be selected by one target core.\nWe allow input elements to be selected multiple times. This scheme is similar in some respects to\nthat used by a convolution network, but we employ independent synapses for each location. The\nspeci\ufb01c networks employed here, and associated parameters, are shown in Figure 3A.\n\n5 Results\n\nWe applied the training method described above to the MNIST dataset [21], examining accuracy vs.\nenergy tradeoffs using two networks running on the TrueNorth chip (Figure 3B). The \ufb01rst network is\nthe smallest multilayer TrueNorth network possible for the number of pixels present in the dataset,\nconsisting of 5 cores distributed in 2 layers, corresponding to 512 neurons. The second network\nwas built with a primary goal of maximizing accuracy, and is composed of 30 cores distributed in\n4 layers (Figure 3A), corresponding to 3840 neurons. Networks are con\ufb01gured with a \ufb01rst layer\nusing R = 16 and F = 1 in both networks, and S = 12 in the 5 core network and S = 4 in the 30\ncore network, while all subsequent layers in both networks use R = 2, F = 64, and S = 1. These\nparameters result in a \u201dpyramid\u201d shape, where all cores from layer 2 to the \ufb01nal layer draw input\n\n7\n\n1 cm163input1612input5 Core Network30 Core Networkone core(256 neurons)input windowlayerstride256 neuronsABC\ffrom 4 source cores and 64 neurons in each of those sources. Each core employs 64 neurons per\ncore it targets, up to a maximum of 256 neurons.\nWe tested each network in an ensemble of 1, 4, 16, or 64 members running on a TrueNorth chip in\nreal-time. Each image was encoded using a single time step (1 ms), with a different spike sampling\nused for each input line targeted by a pixel. The instrumentation available measures active power\nfor the network in operation and leakage power for the entire chip, which consists of 4096 cores.\nWe report energy numbers as active power plus the fraction of leakage power for the cores in use.\nThe highest overall performance we observed of 99.42% was achieved with a 30 core trained net-\nwork using a 64 member ensemble, for a total of 1920 cores, that was measured using 108 \u00b5J per\nclassi\ufb01cation. The lowest energy was achieved by the 5 core network operating in an ensemble of\n1, that was measured using 0.268 \u00b5J per classi\ufb01cation while achieving 92.70% accuracy. Results\nare plotted showing accuracy vs. energy in Figure 3C. Both networks classi\ufb01ed 1000 images per\nsecond.\n\n6 Discussion\n\nOur results show that backpropagation operating in a probabilistic domain can be used to train\nnetworks that naturally map to neuromorphic hardware with spiking neurons and extremely low-\nprecision synapses. Our approach can be succinctly summarized as constrain-then-train, where\nwe \ufb01rst constrain our network to provide a direct representation of our deployment system and\nthen train within those constraints. This can be contrasted with a train-then-constrain approach,\nwhere a network agnostic to the \ufb01nal deployment system is \ufb01rst trained, and following training is\nconstrained through normalization and discretization methods to provide a spiking representation or\nlow precision weights. While requiring a customized training rule, the constrain-then-train approach\noffers the advantage that a decrease in training error has a direct correspondence to a decrease in\nerror for the deployment network. Conversely, the train-then-constrain approach allows use of off\nthe shelf training methods, but unconstrained training is not guaranteed to produce a reduction in\nerror after hardware constraints are applied.\nLooking forward, we see several avenues for expanding this approach to more complex datasets.\nFirst, deep convolution networks [21] have seen a great deal of success by using backpropagation\nto learn the weights of convolutional \ufb01lters. The learning method introduced here is independent of\nthe speci\ufb01c network structure beyond the given sparsity constraint, and could certainly be adapted\nfor use in convolution networks. Second, biology provides a number of examples, such as the retina\nor cochlea, for mapping high-precision sensory data into a binary spiking representation. Drawing\ninspiration from such approaches may improve performance beyond the linear mapping scheme used\nin this work. Third, this approach may also be adaptable to other gradient based learning methods,\nor to methods with existing probabilistic components such as contrastive divergence [22]. Further,\nwhile we describe the use of this approach with TrueNorth to provide a concrete use case, we see\nno reason why this training approach cannot be used with other spiking neuromorphic hardware\n[4][5][6].\nWe believe this work is particularly timely, as in recent years backpropagation has achieved a high\nlevel of performance on a number tasks re\ufb02ecting real world tasks, including object detection in\ncomplex scenes [1], pedestrian detection [2], and speech recognition [3]. A wide range of sensors\nare found in mobile devices ranging from phones to automobiles, and platforms like TrueNorth\nprovide a low power substrate for processing that sensory data. By bridging backpropagation and\nenergy ef\ufb01cient neuromorphic computing, we hope that the work here provides an important step\ntowards building low-power, scalable brain-inspired systems with real world applicability.\n\nAcknowledgments\n\nThis research was sponsored by the Defense Advanced Research Projects Agency under contracts\nNo. HR0011- 09-C-0002 and No. FA9453-15-C-0055. The views, opinions, and/or \ufb01ndings con-\ntained in this paper are those of the authors and should not be interpreted as representing the of\ufb01cial\nviews or policies of the Department of Defense or the U.S. Government.\n\n8\n\n\fReferences\n[1] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,\nM. Bernstein, A. C. Berg, and L. Fei-Fei, \u201cImageNet Large Scale Visual Recognition Challenge,\u201d Inter-\nnational Journal of Computer Vision, 2015.\n\n[2] W. Ouyang and X. Wang, \u201cJoint deep learning for pedestrian detection,\u201d in International Conference on\n\nComputer Vision, pp. 2056\u20132063, 2013.\n\n[3] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sen-\ngupta, A. Coates, et al., \u201cDeepspeech: Scaling up end-to-end speech recognition,\u201d arXiv preprint\narXiv:1412.5567, 2014.\n\n[4] B. Benjamin, P. Gao, E. McQuinn, S. Choudhary, A. Chandrasekaran, J.-M. Bussat, R. Alvarez-Icaza,\nJ. Arthur, P. Merolla, and K. Boahen, \u201cNeurogrid: A mixed-analog-digital multichip system for large-\nscale neural simulations,\u201d Proceedings of the IEEE, vol. 102, no. 5, pp. 699\u2013716, 2014.\n\n[5] E. Painkras, L. Plana, J. Garside, S. Temple, F. Galluppi, C. Patterson, D. Lester, A. Brown, and S. Furber,\n\u201cSpiNNaker: A 1-W 18-core system-on-chip for massively-parallel neural network simulation,\u201d IEEE\nJournal of Solid-State Circuits, vol. 48, no. 8, pp. 1943\u20131953, 2013.\n\n[6] T. Pfeil, A. Gr\u00a8ubl, S. Jeltsch, E. M\u00a8uller, P. M\u00a8uller, M. A. Petrovici, M. Schmuker, D. Br\u00a8uderle, J. Schem-\nmel, and K. Meier, \u201cSix networks on a universal neuromorphic computing substrate,\u201d Frontiers in neuro-\nscience, vol. 7, 2013.\n\n[7] P. A. Merolla, J. V. Arthur, R. Alvarez-Icaza, A. S. Cassidy, J. Sawada, F. Akopyan, B. L. Jackson,\nN. Imam, C. Guo, Y. Nakamura, et al., \u201cA million spiking-neuron integrated circuit with a scalable com-\nmunication network and interface,\u201d Science, vol. 345, no. 6197, pp. 668\u2013673, 2014.\n\n[8] D. Rumelhart, G. Hinton, and R. Williams, \u201cLearning representations by back-propagating errors,\u201d Na-\n\nture, vol. 323, no. 6088, pp. 533\u2013536, 1986.\n\n[9] P. Moerland and E. Fiesler, \u201cNeural network adaptations to hardware implementations,\u201d in Handbook of\nneural computation (E. Fiesler and R. Beale, eds.), New York: Institute of Physics Publishing and Oxford\nUniversity Publishing, 1997.\n\n[10] E. Fiesler, A. Choudry, and H. J. Caul\ufb01eld, \u201cWeight discretization paradigm for optical neural networks,\u201d\n\nin The Hague\u201990, 12-16 April, pp. 164\u2013173, International Society for Optics and Photonics, 1990.\n\n[11] Y. Cao, Y. Chen, and D. Khosla, \u201cSpiking deep convolutional neural networks for energy-ef\ufb01cient object\n\nrecognition,\u201d International Journal of Computer Vision, vol. 113, no. 1, pp. 54\u201366, 2015.\n\n[12] P. U. Diehl, D. Neil, J. Binas, M. Cook, S.-C. Liu, and M. Pfeiffer, \u201cFast-classifying, high-accuracy\nspiking deep networks through weight and threshold balancing,\u201d in International Joint Conference on\nNeural Networks, 2015, in press.\n\n[13] L. K. Muller and G. Indiveri, \u201cRounding methods for neural networks with low resolution synaptic\n\nweights,\u201d arXiv preprint arXiv:1504.05767, 2015.\n\n[14] J. Zhao, J. Shawe-Taylor, and M. van Daalen, \u201cLearning in stochastic bit stream neural networks,\u201d Neural\n\nNetworks, vol. 9, no. 6, pp. 991 \u2013 998, 1996.\n\n[15] Z. Cheng, D. Soudry, Z. Mao, and Z. Lan, \u201cTraining binary multilayer neural networks for image classi-\n\n\ufb01cation using expectation backpropgation,\u201d arXiv preprint arXiv:1503.03562, 2015.\n\n[16] D. Soudry, I. Hubara, and R. Meir, \u201cExpectation backpropagation: Parameter-free training of multilayer\nneural networks with continuous or discrete weights,\u201d in Advances in Neural Information Processing\nSystems 27, pp. 963\u2013971, 2014.\n\n[17] E. Stromatias, D. Neil, F. Galluppi, M. Pfeiffer, S. Liu, and S. Furber, \u201cScalable energy-ef\ufb01cient, low-\nlatency implementations of spiking deep belief networks on spinnaker,\u201d in International Joint Conference\non Neural Networks, IEEE, 2015, in press.\n\n[18] A. S. Cassidy, P. Merolla, J. V. Arthur, S. Esser, B. Jackson, R. Alvarez-Icaza, P. Datta, J. Sawada, T. M.\nWong, V. Feldman, A. Amir, D. Rubin, F. Akopyan, E. McQuinn, W. Risk, and D. S. Modha, \u201cCognitive\ncomputing building block: A versatile and ef\ufb01cient digital neuron model for neurosynaptic cores,\u201d in\nInternational Joint Conference on Neural Networks, 2013.\n\n[19] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, \u201cImproving neural\n\nnetworks by preventing co-adaptation of feature detectors,\u201d arXiv preprint arXiv:1207.0580, 2012.\n\n[20] A. J. Bell and T. J. Sejnowski, \u201cAn information-maximization approach to blind separation and blind\n\ndeconvolution,\u201d Neural computation, vol. 7, no. 6, pp. 1129\u20131159, 1995.\n\n[21] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, \u201cGradient-based learning applied to document recogni-\n\ntion,\u201d Proceedings of the IEEE, vol. 86, no. 11, pp. 2278\u20132324, 1998.\n\n[22] G. E. Hinton and R. R. Salakhutdinov, \u201cReducing the dimensionality of data with neural networks,\u201d\n\nScience, vol. 313, no. 5786, pp. 504\u2013507, 2006.\n\n9\n\n\f", "award": [], "sourceid": 691, "authors": [{"given_name": "Steve", "family_name": "Esser", "institution": "IBM Research-Almaden"}, {"given_name": "Rathinakumar", "family_name": "Appuswamy", "institution": "IBM Research-Almaden"}, {"given_name": "Paul", "family_name": "Merolla", "institution": "IBM Research-Almaden"}, {"given_name": "John", "family_name": "Arthur", "institution": "IBM Research-Almaden"}, {"given_name": "Dharmendra", "family_name": "Modha", "institution": "IBM Research-Almaden"}]}