{"title": "Expectation Backpropagation: Parameter-Free Training of Multilayer Neural Networks with Continuous or Discrete Weights", "book": "Advances in Neural Information Processing Systems", "page_first": 963, "page_last": 971, "abstract": "Multilayer Neural Networks (MNNs) are commonly trained using gradient descent-based methods, such as BackPropagation (BP). Inference in probabilistic graphical models is often done using variational Bayes methods, such as Expectation Propagation (EP). We show how an EP based approach can also be used to train deterministic MNNs. Specifically, we approximate the posterior of the weights given the data using a \u201cmean-field\u201d factorized distribution, in an online setting. Using online EP and the central limit theorem we find an analytical approximation to the Bayes update of this posterior, as well as the resulting Bayes estimates of the weights and outputs. Despite a different origin, the resulting algorithm, Expectation BackPropagation (EBP), is very similar to BP in form and efficiency. However, it has several additional advantages: (1) Training is parameter-free, given initial conditions (prior) and the MNN architecture. This is useful for large-scale problems, where parameter tuning is a major challenge. (2) The weights can be restricted to have discrete values. This is especially useful for implementing trained MNNs in precision limited hardware chips, thus improving their speed and energy efficiency by several orders of magnitude. We test the EBP algorithm numerically in eight binary text classification tasks. In all tasks, EBP outperforms: (1) standard BP with the optimal constant learning rate (2) previously reported state of the art. Interestingly, EBP-trained MNNs with binary weights usually perform better than MNNs with continuous (real) weights - if we average the MNN output using the inferred posterior.", "full_text": "Expectation Backpropagation: Parameter-Free\nTraining of Multilayer Neural Networks with\n\nContinuous or Discrete Weights\n\nDaniel Soudry1, Itay Hubara2, Ron Meir2\n\n(1) Department of Statistics, Columbia University\n\n(2) Department of Electrical Engineering, Technion, Israel Institute of Technology\n\ndaniel.soudry@gmail.com,itayhubara@gmail.com,rmeir@ee.technion.ac.il\n\nAbstract\n\nMultilayer Neural Networks (MNNs) are commonly trained using gradient\ndescent-based methods, such as BackPropagation (BP). Inference in probabilistic\ngraphical models is often done using variational Bayes methods, such as Expec-\ntation Propagation (EP). We show how an EP based approach can also be used\nto train deterministic MNNs. Speci\ufb01cally, we approximate the posterior of the\nweights given the data using a \u201cmean-\ufb01eld\u201d factorized distribution, in an online\nsetting. Using online EP and the central limit theorem we \ufb01nd an analytical ap-\nproximation to the Bayes update of this posterior, as well as the resulting Bayes\nestimates of the weights and outputs.\nDespite a different origin, the resulting algorithm, Expectation BackPropagation\n(EBP), is very similar to BP in form and ef\ufb01ciency. However, it has several addi-\ntional advantages: (1) Training is parameter-free, given initial conditions (prior)\nand the MNN architecture. This is useful for large-scale problems, where param-\neter tuning is a major challenge. (2) The weights can be restricted to have discrete\nvalues. This is especially useful for implementing trained MNNs in precision lim-\nited hardware chips, thus improving their speed and energy ef\ufb01ciency by several\norders of magnitude.\nWe test the EBP algorithm numerically in eight binary text classi\ufb01cation tasks.\nIn all tasks, EBP outperforms: (1) standard BP with the optimal constant learning\nrate (2) previously reported state of the art. Interestingly, EBP-trained MNNs with\nbinary weights usually perform better than MNNs with continuous (real) weights\n- if we average the MNN output using the inferred posterior.\n\n1\n\nIntroduction\n\nRecently, Multilayer1 Neural Networks (MNNs) with deep architecture have achieved state-of-the-\nart performance in various supervised learning tasks [10, 13, 7]. Such networks are often massive\nand require large computational and energetic resources. A dense, fast and energetically ef\ufb01cient\nhardware implementation of trained MNNs could be built if the weights were restricted to discrete\nvalues. For example, with binary weights, the chip in [12] can perform 1012 operations per second\nwith 1mW power ef\ufb01ciency. Such performances will enable the integration of massive MNNs into\nsmall and low-power electronic devices.\nTraditionally, MNNs are trained by minimizing some error function using BackPropagation (BP) or\nrelated gradient descent methods [14]. However, such an approach cannot be directly applied if the\nweights are restricted to binary values. Moreover, crude discretization of the weights is usually quite\n\n1i.e., having more than a single layer of adjustable weights.\n\n1\n\n\fdestructive [19]. Other methods have been suggested in the 90\u2019s (e.g., [22, 2, 17]), but it is not clear\nwhether these approaches are scalable.\nThe most ef\ufb01cient methods developed for training Single-layer2 Neural Networks (SNN) with binary\nweights use approximate Bayesian inference, either implicitly [5, 1] or explicitly [23, 21]. In theory,\ngiven a prior, the Bayes estimate of the weights can be found from their posterior given the data.\nHowever, storing or updating the full posterior is usually intractable. To circumvent this problem,\nthese previous works used a factorized \u201cmean-\ufb01eld\u201d form the posterior of the weights given the data.\nAs explained in [21], this was done using a special case of the widely applicable Expectation Propa-\ngation (EP) algorithm [18] - with an additional approximation that the fan-in of all neurons is large,\nso their inputs are approximately Gaussian. Thus, given an error function, one can analytically\nobtain the Bayes estimate of the weights or the outputs, using the factorized approximation of the\nposterior. However, to the best of our knowledge, it is still unknown whether such an approach could\nbe generalized to MNNs, which are more relevant for practical applications.\nIn this work we derive such generalization, using similar approximations (section 3). The end result\nis the Expectation BackPropagation (EBP, section 4) algorithm for online training of MNNs where\nthe weight values can be either continuous (i.e., real numbers) or discrete (e.g., \u00b11 binary). Notably,\nthe training is parameter-free (with no learning rate), and insensitive to the magnitude of the input.\nThis algorithm is very similar to BP. Like BP, it is very ef\ufb01cient in each update, having a linear\ncomputational complexity in the number of weights.\nWe test the EBP algorithm (section 5) on various supervised learning tasks: eight high dimensional\ntasks of classifying text into one of two semantic classes, and one low dimensional medical discrim-\nination task. Using MNNs with two or three weight layers, EBP outperforms both standard BP, as\nwell as the previously reported state of the art for these tasks [6]. Interestingly, the best performance\nof EBP is usually achieved using the Bayes estimate of the output of MNNs with binary weights.\nThis estimate can be calculated analytically, or by averaging the output of several such MNNs, with\nweights sampled from the inferred posterior.\n\n2 Preliminaries\n\nGeneral Notation A non-capital boldfaced letter x denotes a column vector with components xi,\na boldfaced capital letter X denotes a matrix with components Xij. Also, if indexed, the compo-\nnents of xl are denoted xi,l and those of Xl are denoted Xij,l. We denote by P (x) the proba-\nbility distribution (in the discrete case) or density (in the continuous case) of a random variable X,\nP (x|y) = P (x, y) /P (y),(cid:104)x(cid:105) =\nxP (x|y) dx, Cov (x, y) = (cid:104)xy(cid:105)\u2212(cid:104)x(cid:105)(cid:104)y(cid:105)\nand Var (x) = Cov (x, x). Integration is exchanged with summation in the discrete case. For any\ncondition A, we make use of I {A}, the indicator function (i.e., I {A} = 1 if A holds, and zero\notherwise), and \u03b4ij = I {i = j}, Kronecker\u2019s delta function. If x \u223c N (\u00b5, \u03a3) then it is Gaussian\nwith mean \u00b5 and covariance matrix \u03a3, and we denote its density by N (x|\u00b5, \u03a3). Furthermore, we\nuse the cumulative distribution function \u03a6 (x) =\n\nxP (x) dx, (cid:104)x|y(cid:105) =\n\n\u00b4\n\nx\n\n\u2212\u221e N (u|0, 1) du.\n\n\u00b4\n\n\u00b4\n\nModel We consider a general feedforward Multilayer Neural Network (MNN) with connections\nbetween adjacent layers (Fig. 2.1). For analytical simplicity, we focus here on deterministic binary\n(\u00b11) neurons. However, the framework can be straightforwardly extended to other types of neurons\n(deterministic or stochastic). The MNN has L layers, where Vl is the width of the l-th layer, and\nl=1 is the collection of Vl \u00d7 Vl\u22121 synaptic weight matrices which connect neuronal\nW = {Wl}L\nlayers sequentially. The outputs of the layers are {vl}L\nl=1 are\nthe hidden layers and vL is the output layer. In each layer,\n\nl=0, where v0 is the input layer, {vl}L\u22121\n\nvl = sign (Wlvl\u22121)\n\n(2.1)\n\nwhere each sign \u201cactivation function\u201d (a neuronal layer) operates component-wise (i.e., \u2200i\n(sign (x))i = sign (xi)). The output of the network is therefore\n\n:\n\nvL = g (v0,W) = sign (WLsign (WL\u22121sign (\u00b7\u00b7\u00b7 W1v0))) .\n\n(2.2)\n\n2i.e., having only a single layer of adjustable weights.\n\n2\n\n\fWe assume that the weights are constrained to some set\nS, with the speci\ufb01c restrictions on each weight denoted\nby Sij,l, so Wij,l \u2208 Sij,l and W \u2208 S. If Sij,l = {0},\nthen we say that Wij,l is \u201cdisconnected\u201d. For simplic-\nity, we assume that in each layer the \u201cfan-in\u201d Kl =\n|{j|Sij,l (cid:54)= {0}}| is constant for all i. Biases can be op-\ntionally included in the standard way, by adding a con-\nstant output v0,l = 1 to each layer.\n\nTask We examine a supervised classi\ufb01cation learning\ntask, in Bayesian framework. We are given a \ufb01xed set of\n\nsequentially labeled data pairs DN = (cid:8)x(n), y(n)(cid:9)N\n\nP (W|Dn) \u221d P\n\n(cid:16)\ny(n)|x(n),W(cid:17)\n\ny(n)|x(n),W(cid:17)\nx(n),W(cid:17)\n(cid:16)\n= I(cid:110)\n\ng\n\n(cid:16)\n\nP\n\nP (W|Dn\u22121) ,\n\n= y(n)(cid:111)\n\n(3.3)\n\n.\n\n(3.4)\n\n3MNN with stochastic activation functions will have a \u201csmoothed out\u201d version of this.\n\n3\n\nFigure 2.1: Our MNN model (Eq. 2.2).\n\nn=1\n\n(so D0 = \u2205), where each x(n) \u2208 RV0 is a data point, and\neach y(n) is a label taken from a binary set Y \u2282 {\u22121, +1}VL. For brevity, we will sometimes\nsuppress the sample index n, where it is clear from the context. As common for supervised learning\nwith MNNs, we assume that for all n the relation x(n) \u2192 y(n) can be represented by a MNN with\nknown architecture (the \u2018hypothesis class\u2019), and unknown weights W \u2208 S. This is a reasonable\nassumption since a MNN can approximate any deterministic function, given that it has suf\ufb01cient\nnumber of neurons [11] (if L \u2265 2). Speci\ufb01cally, there exists some W\u2217 \u2208 S, so that y(n) =\n\nf(cid:0)x(n),W\u2217(cid:1) (see Eq. 2.2). Our goals are: (1) estimate the most probable W\u2217 for this MNN, (2)\n\nestimate the most probable y given some (possibly unseen) x.\n\n3 Theory\n\nIn this section we explain how a speci\ufb01c learning algorithm for MNNs (described in section 4) arises\nfrom approximate (mean-\ufb01eld) Bayesian inference, used in this context (described in section 2).\n\n3.1 Online Bayesian learning in MNNs\n\nWe approach this task within a Bayesian framework, where we assume some prior distribution on the\nweights - P (W|D0). Our aim is to \ufb01nd P (W|DN ), the posterior probability for the con\ufb01guration\nof the weights W, given the data. With this posterior, one can select the most probable weight\ncon\ufb01guration - the Maximum A Posteriori (MAP) weight estimate\nW\u2217 = argmaxW\u2208S P (W|DN ) ,\n\n(3.1)\nminimizing the expected zero-one loss over the weights (I {W\u2217 (cid:54)= W}). This weight estimate can\nbe implemented in a single MNN, which can provide an estimate of the label y for (possibly unseen)\ndata points x through y =g (x,W\u2217). Alternatively, one can aim to minimize the expected loss over\nthe output - as more commonly done in the MNN literature. For example, if the aim is to reduce\nclassi\ufb01cation error then one should use the MAP output estimate\n\nW\n\ny\u2217 = argmaxy\u2208Y\n\nI {g (x,W) = y} P (W|DN ) ,\n\n(3.2)\nwhich minimizes the zero-one loss (I {y\u2217 (cid:54)= g (x,W)}) over the outputs. The resulting estimator\ndoes not generally have the form of a MNN (i.e., y =g (x,W) with W \u2208 S), but can be approxi-\nmated by averaging the output over many such MNNs with W values sampled from the posterior.\nNote that averaging the output of several MNNs is a common method to improve performance.\nWe aim to \ufb01nd the posterior P (W|DN ) in an online setting, where samples arrive sequentially.\nAfter the n-th sample is received, the posterior is updated according to Bayes rule:\n\n(cid:88)\n\nfor n = 1, . . . , N. Note that the MNN is deterministic, so the likelihood (per data point) has the\nfollowing simple form3\n\n\fTherefore, the Bayes update in Eq. 3.3 simply makes sure that P (W|Dn) = 0 in any \u201cillegal\u201d con-\n\n\ufb01guration (i.e., any W 0 such that g(cid:0)x(k),W 0(cid:1) (cid:54)= y(k)) for some 1 \u2264 k \u2264 n. In other words, the\n\nposterior is equal to the prior, restricted to the \u201clegal\u201d weight domain, and re-normalized appropri-\nately. Unfortunately, this update is generally intractable for large networks, mainly because we need\nto store and update an exponential number of values for P (W|Dn). Therefore, some approximation\nis required.\n\n3.2 Approximation 1: mean-\ufb01eld\nIn order to reduce computational complexity, instead of storing P (W|Dn), we will store its factor-\nized (\u2018mean-\ufb01eld\u2019) approximation \u02c6P (W|Dn), for which\n\n\u02c6P (W|Dn) =\n\n\u02c6P (Wij,l|Dn) ,\n\n(3.5)\n\n(cid:89)\n\nwhere each factor must be normalized. Notably, it is easy to \ufb01nd the MAP estimate of the weights\n(Eq. 3.1) under this factorized approximation \u2200i, j, l\n\ni,j,l\n\nW \u2217\n\n\u02c6P (Wij,l|DN ) .\n\nij,l = argmaxWij,l\u2208Sij,l\n\n(3.6)\nThe factors \u02c6P (Wij,l|Dn) can be found using a standard variational approach [4, 23]. For each n,\nwe \ufb01rst perform the Bayes update in Eq. 3.3 with \u02c6P (W|Dn\u22121) instead of P (W|Dn\u22121). Then, we\nproject the resulting posterior onto the family of distributions factorized as in Eq. 3.5, by minimiz-\ning the reverse Kullback-Leibler divergence (similarly to EP [18, 21]). A straightforward calculation\nshows that the optimal factor is just a marginal of the posterior (appendix A, available in the supple-\nmentary material). Performing this marginalization on the Bayes update and re-arranging terms, we\nobtain a Bayes-like update to the marginals \u2200i, j, l\n\n\u02c6P (Wij,l|Dn) \u221d \u02c6P\n\ny(n)|x(n), Wij,l, Dn\u22121\n\n(cid:16)\n\n(cid:17)\n\n(cid:88)\n\n=\nW(cid:48):W (cid:48)\n\nP\nij,l=Wij,l\n\n(cid:17) \u02c6P (Wij,l|Dn\u22121) ,\n\u02c6P(cid:0)W (cid:48)\n\ny(n)|x(n),W(cid:48)(cid:17) (cid:89)\n\n(cid:16)\n\n{k,r,m}(cid:54)={i,j,l}\n\n(cid:16)\n\nwhere\n\n\u02c6P\n\ny(n)|x(n), Wij,l, Dn\u22121\n\n(3.7)\n\n(cid:1) (3.8)\n\nkr,m|Dn\u22121\n\nis the marginal likelihood. Thus we can directly update the factor \u02c6P (Wij,l|Dn) in a single step.\nHowever, the last equation is still problematic, since it contains a generally intractable summation\nover an exponential number of values, and therefore requires simpli\ufb01cation. For simplicity, from\nnow on we replace any \u02c6P with P , in a slight abuse of notation (keeping in mind that the distributions\nare approximated).\n\n3.3 Simplifying the marginal likelihood\n\nIn order to be able to use the update rule in Eq. 3.7, we must \ufb01rst calculate the marginal likelihood\n\n(cid:1) using Eq. 3.8. For brevity, we suppress the index n and the dependence\n\nP(cid:0)y(n)|x(n), Wij,l, Dn\u22121\n\non Dn\u22121 and x, obtaining\n\n(cid:88)\n\n(cid:89)\n\nP (y|Wij,l) =\n\nP (y|W(cid:48))\n\nW(cid:48):W (cid:48)\n\nij,l=Wij,l\n\n{k,r,m}(cid:54)={i,j,l}\n\nwhere we recall that P (y|W(cid:48)) is simply an indicator function (Eq. 3.4). Since, by assumption,\nP (y|W(cid:48)) arises from a feed-forward MNN with input v0 = x and output vL = y, we can perform\nthe summations in Eq. 3.9 in a more convenient way - layer by layer. To do this, we de\ufb01ne\n\nkr,m\n\nP(cid:0)W (cid:48)\n(cid:1) ,\n\uf8fc\uf8fd\uf8feVm\u22121(cid:89)\nP(cid:0)W (cid:48)\n\nr=1\n\n(3.9)\n\n(cid:1)\uf8f9\uf8fb (3.10)\n\nkr,m\n\n\uf8ee\uf8f0I\n\n\uf8f1\uf8f2\uf8f3vk,m\n\nVm\u22121(cid:88)\n\nr=1\n\n(cid:88)\n\nVm(cid:89)\n\nW(cid:48)\n\nm\n\nk=1\n\nP (vm|vm\u22121) =\n\nvr,m\u22121W (cid:48)\n\nkr,m > 0\n\nand P (vl|vl\u22121, Wij,l), which is de\ufb01ned identically to P (vl|vl\u22121), except that the summation is\nperformed over all con\ufb01gurations in which Wij,l is \ufb01xed (i.e., W(cid:48)\nij,l = Wij,l) and we set\n\nl : W (cid:48)\n\n4\n\n\fP (Wij,l) = 1. Now we can write recursively P (v1) = P (v1|v0 = x)\n\n\u2200m \u2208 {2, .., l \u2212 1} : P (vm) =\n\nP (vl|Wij,l) =\n\n\u2200m \u2208 {l + 1, l + 2, .., L} : P (vm|Wij,l) =\n\n(cid:88)\n(cid:88)\n(cid:88)\n\nvl\u22121\n\nvm\u22121\n\nP (vm|vm\u22121) P (vm\u22121)\n\nP (vl|vl\u22121, Wij,l) P (vl\u22121)\n\n(3.11)\n\n(3.12)\n\nP (vm|vm\u22121) P (vm\u22121|Wij,l)\n\n(3.13)\n\nThus we obtain the result of Eq. 3.9, through P (y|Wij,l) = P (vL = y|Wij,l). However, this\ncomputation is still generally intractable, since all of the above summations (Eqs. 3.10-3.13) are still\nover an exponential number of values. Therefore, we need to make one additional approximation.\n\nvm\u22121\n\n3.4 Approximation 2: large fan-in\n\n(cid:112)\n\nKm \u223c N (\u00b5m, \u03a3m) .\n\nNext we simplify the above summations (Eqs. 3.10-3.13) assuming that the neuronal fan-in is\n\u201clarge\u201d. We keep in mind that i, j and l are the speci\ufb01c indices of the \ufb01xed weight Wij,l. All\nthe other weights beside Wij,l can be treated as independent random variables, due to the mean \ufb01eld\napproximation (Eq. 3.5). Therefore, in the limit of a in\ufb01nite neuronal fan-in (\u2200m : Km \u2192 \u221e) we\ncan use the Central Limit Theorem (CLT) and say that the normalized input to each neuronal layer,\nis distributed according to a Gaussian distribution\n\u2200m : um = Wmvm\u22121/\n\n(3.14)\nSince Km is actually \ufb01nite, this would be only an approximation - though a quite common and\neffective one (e.g., [21]). Using the approximation in Eq. 3.14 together with vm = sign (um) (Eq.\n2.1) we can calculate (appendix B) the distribution of um and vm sequentially for all the layers\nm \u2208 {1, . . . , L}, for any given value of v0 and Wij,l. These effectively simplify the summations in\n3.10-3.13 using Gaussian integrals (appendix B).\nAt the end of this \u201cforward pass\u201d we will be able to \ufb01nd P (y|Wij,l) = P (vL = y|Wij,l) , \u2200i, j, l.\nThis takes a polynomial number of steps (appendix B.3), instead of a direct calculation through\nEqs. 3.11-3.13, which is exponentially hard. Using P (y|Wij,l) and Eq. 3.7 we can now update the\ndistribution of P (Wij,l). This immediately gives the Bayes estimate of the weights (Eq. 3.6) and\noutputs (Eq. 3.2).\nAs we note in appendix B.3, the computational complexity of the forward pass is signi\ufb01cantly lower\nif \u03a3m is diagonal. This is true exactly only in special cases. For example, this is true if all hidden\nneurons have a fan-out of one - such as in a 2-layer network with a single output. However, in order\nto reduce the computational complexity in cases that \u03a3m is not diagonal, we will approximate the\ndistribution of um with its factorized (\u2018mean-\ufb01eld\u2019) version. Recall that the optimal factor is the\nmarginal of the distribution (appendix A). Therefore, we can now \ufb01nd P (y|Wij,l) easily (appendix\nB.1), as all the off-diagonal components in \u03a3m are zero, so \u03a3kk(cid:48),m = \u03c32\nA direct calculation of P (vL = y|Wij,l) for every i, j, l would be computationally wasteful, since\nIn order to improve the algorithm\u2019s ef\ufb01ciency,\nwe will repeat similar calculations many times.\nwe again exploit the fact that Kl is large. We approximate ln P (vL = y|Wij,l) using a Taylor\nexpansion of Wij,l around its mean, (cid:104)Wij,l(cid:105), to \ufb01rst order in K\n. The \ufb01rst order terms in this\nexpansion can be calculated using backward propagation of derivative terms\n\nk,m\u03b4kk(cid:48) .\n\n\u22121/2\nl\n\n(3.15)\nsimilarly to the BP algorithm (appendix C). Thus, after a forward pass for m = 1, . . . , L, and a\nbackward pass for l = L, . . . , 1, we obtain P (vL = y|Wij,l) for all Wij,l and update P (Wij,l).\n\n\u2206k,m = \u2202 ln P (vL = y) /\u2202\u00b5k,m ,\n\n4 The Expectation Backpropagation Algorithm\nUsing our results we can ef\ufb01ciently update the posterior distribution P (Wij,l|Dn) for all the weights\nwith O (|W|) operations, according to Eqs. 3.7. Next, we summarize the resulting general algorithm\n- the Expectation BackPropgation (EBP) algorithm. In appendix D, we exemplify how to apply the\n\n5\n\n\falgorithm in the special cases of MNNs with binary, ternary or real (continuous) weights. Similarly\nto the original BP algorithm (see review in [15]), given input x and desired output y, \ufb01rst we perform\na forward pass to calculate the mean output (cid:104)vl(cid:105) for each layer. Then we perform a backward pass\nto update P (Wij,l|Dn) for all the weights.\n\nForward pass\nIn this pass we perform the forward calculation of probabilities, as in Eq. 3.11.\nRecall that (cid:104)Wkr,m(cid:105) is the mean of the posterior distribution P (Wkr,m|Dn). We \ufb01rst initialize the\nMNN input (cid:104)vk,0(cid:105) = xk for all k and calculate recursively the following quantities for m = 1, . . . , L\nand all k\n\nVm\u22121(cid:88)\nVm\u22121(cid:88)\n(cid:10)W 2\n\nr=1\n\nr=1\n\n\u00b5k,m =\n\n1\u221a\nKm\n\n\u03c32\nk,m =\n\n1\nKm\n\n(cid:104)Wkr,m(cid:105)(cid:104)vr,m\u22121(cid:105)\n\n; (cid:104)vk,m(cid:105) = 2\u03a6 (\u00b5k,m/\u03c3k,m) \u2212 1 .\n\n(cid:16)(cid:104)vr,m\u22121(cid:105)2 \u2212 1\n\n(cid:17)\n\n(cid:17) \u2212 (cid:104)Wkr,m(cid:105)2 (cid:104)vr,m\u22121(cid:105)2 ,\n\n+ 1\n\n(cid:11)(cid:16)\n\nkr,m\n\n\u03b4m,1\n\n(4.1)\n\n(4.2)\n\nwhere \u00b5m and \u03c32\nand (cid:104)vm(cid:105) is the resulting mean of the output of layer m.\n\nm are, respectively, the mean and variance of um, the input of layer m (Eq. 3.14),\n\nBackward pass\nexpansion. Recall Eq. 3.15. We \ufb01rst initialize4\n\nIn this pass we perform the Bayes update of the posterior (Eq. 3.7) using a Taylor\n\n\u2206i,L = yi\n\n(cid:1)\n\ni,L\n\u03a6 (yi\u00b5i,L/\u03c3i,L)\n\nN(cid:0)0|\u00b5i,L, \u03c32\n(cid:1) Vm(cid:88)\nN(cid:0)0|\u00b5i,l\u22121, \u03c32\n\ni,l\u22121\n\n.\n\n(4.3)\n\n(4.4)\n\n(4.5)\n\nfor all i. Then, for l = L, . . . , 1 and \u2200i, j we calculate\n\n\u2206i,l\u22121 =\n\n2\u221a\nKl\n\nln P (Wij,l|Dn) = ln P (Wij,l|Dn\u22121) +\n\n(cid:104)Wji,l(cid:105) \u2206j,l .\n\nj=1\n\nWij,l\u2206i,l (cid:104)vj,l\u22121(cid:105) + C ,\n\n1\u221a\nKl\n\nwhere C is some unimportant constant (which does not depend on Wij,l).\n\nOutput Using the posterior distribution, the optimal con\ufb01guration can be immediately found\nthrough the MAP weights estimate (Eq. 3.6) \u2200i, j, l\n\nW \u2217\n\nij,l = argmaxWij,l\u2208Sij,l\n\n(4.6)\nThe output of a MNN implementing these weights would be g (x,W\u2217) (see Eq. 2.2). We de\ufb01ne this\nto be the \u2018deterministic\u2018 EBP output (EBP-D).\nAdditionally, the MAP output (Eq. 3.2) can be calculated directly\n\nln P (Wij,l|Dn) .\n\n(cid:34)(cid:88)\n\nln\n\nk\n\n(cid:18) 1 + (cid:104)vk,L(cid:105)\n\n1 \u2212 (cid:104)vk,L(cid:105)\n\n(cid:19)yk(cid:35)\n\n(4.7)\n\ny\u2217 = argmaxy\u2208Y ln P (vL = y) = argmaxy\u2208Y\n\nusing (cid:104)vk,L(cid:105) from Eq. 4.1, or as an ensemble average over the outputs of all possible MNN with the\nweights Wij,l being sampled from the estimated posterior P (Wij,l|Dn). We de\ufb01ne the output in Eq.\n4.7 to be the Probabilistic EBP output (EBP-P). Note that in the case of a single output Y = {\u22121, 1},\nso this output simpli\ufb01es to y = sign ((cid:104)vk,L(cid:105)).\n\n4Due to numerical inaccuracy, calculating \u2206i,L using Eq. 4.3 can generate nonsensical values (\u00b1\u221e, NaN)\n\nif |\u00b5i,L/\u03c3i,L| becomes to large. If this happens, we use instead the asymptotic form in that limit\n\n\u221a\n\u2206i,L = \u2212 \u00b5i,L\nKL\n\n\u03c32\n\ni,L\n\nI {yi\u00b5i,L < 0}\n\n6\n\n\f5 Numerical Experiments\n\nWe use several high dimensional text datasets to assess the performance of the EBP algorithm in\na supervised binary classi\ufb01cation task. The datasets (taken from [6]) contain eight binary tasks\nfrom four datasets: \u2018Amazon (sentiment)\u2019, \u201820 Newsgroups\u2019, \u2018Reuters\u2019 and \u2018Spam or Ham\u2019. Data\nspeci\ufb01cation (N =#examples and M =#features) and results (for each algorithm) are described in\nTable 1. More details on the data including data extraction and labeling can be found in [6].\nWe test the performance of EBP on MNNs with a 2-layer architecture of M \u2192 120 \u2192 1, and\nbias weights. We examine two special cases: (1) MNNs with real weights (2) MNNs with binary\nweights (and real bias). Recall the motivation for the latter (section 1) is that they can be ef\ufb01ciently\nimplemented in hardware (real bias has negligible costs). Recall also that for each type of MNN, the\nalgorithm gives two outputs - EBP-D (deterministic) and EBP-P (probabilistic), as explained near\nEqs. 4.6-4.7.\nTo evaluate our results we compare EBP to: (1) the AROW algorithm, which reports state-of-the-art\nresults on the tested datasets [6] (2) the traditional Backpropagation (BP) algorithm, used to train an\nM \u2192 120 \u2192 1 MNN with real weights. In the latter case, we used both Cross Entropy (CE) and\nMean Square Error (MSE) as loss functions. On each dataset we report the results of BP with the\nloss function which achieved the minimal error. We use a simple parameter scan for both AROW\n(regularization parameter) and the traditional BP (learning rate parameter). Only the results with\nthe optimal parameters (i.e., achieving best results) are reported in Table 1. The optimal parameters\nfound were never at the edges of the scanned \ufb01eld. Lastly, to demonstrate the destructive effect of\nnaive quantization, we also report the performance of the BP-trained MNNs, after all the weights\n(except the bias) were clipped using a sign function.\nDuring training the datasets were repeatedly presented in three epochs (in all algorithms, additional\nepochs did not reduce test error). On each epoch the examples were shuf\ufb02ed at random order for BP\nand EBP (AROW determines its own order). The test results are calculated after each epoch using\n8-fold cross-validation, similarly to [6]. Empirically, EBP running time is similar to BP with real\nweights, and twice slower with binary weights. For additional implementation details, see appendix\nE.1. The code is available online 5.\nThe minimal values achieved over all three epochs are summarized in Table 1. As can be seen, in all\ndatasets EBP-P performs better then AROW, which performs better then BP. Also, EBP-P usually\nperfroms better with binary weights. In appendix E.2 we show that this ranking remains true even if\nthe fan-in is small (in contrast to our assumptions), or if a deeper 3-layer architecture is used.\n\nDataset\n\n#Examples\n\nReuters news I6\nReuters news I8\nSpam or ham d0\nSpam or ham d1\n\n20News group comp vs HW\n20News group elec vs med\n\nAmazon Book reviews\nAmazon DVD reviews\n\n2000\n2000\n2500\n2500\n1943\n1971\n3880\n3880\n\n#Features\n11463\n12167\n26580\n27523\n29409\n38699\n221972\n238739\n\nReal EBP-D\n\n14.5%\n15.65%\n1.28%\n1.0%\n5.06%\n3.36%\n2.14%\n2.06%\n\nReal EBP-P\n11.35%\n15.25%\n1.11%\n0.96%\n4.96%\n3.15%\n2.09%\n2.14%\n\nBinary EBP-D\n\nBinary EBP-P\n\n21.7%\n23.15%\n7.93%\n3.85%\n7.54%\n6.0%\n2.45%\n5.72%\n\n9.95%\n16.4%\n0.76%\n0.96%\n4.44%\n2.08%\n2.01%\n2.27%\n\nBP\n\nAROW\n11.72% 13.3%\n15.27% 18.2%\n1.12% 1.32%\n1.4%\n1.36%\n5.79% 7.02%\n2.74% 3.96%\n2.24% 2.96%\n2.63% 2.94%\n\nClipped BP\n26.15%\n26.4%\n7.97%\n7.33%\n13.07%\n14.23%\n3.81%\n5.15%\n\nTable 1: Data speci\ufb01cation, and test errors (with 8-fold cross-validation). Best results are boldfaced.\n\n6 Discussion\n\nMotivated by the recent success of MNNs, we developed the Expectation BackPropagation algo-\nrithm (EBP - see section 4) for approximate Bayesian inference of the synaptic weights of a MNN.\nGiven a supervised classi\ufb01cation task with labeled training data and a prior over the weights, this\ndeterministic online algorithm can be used to train deterministic MNNs (Eq. 2.2) without the need\nto tune learning parameters (e.g., learning rate). Furthermore, each synaptic weight can be restricted\nto some set - which can be either \ufb01nite (e.g., binary numbers) or in\ufb01nite (e.g., real numbers). This\nopens the possibility of implementing trained MNNs in power-ef\ufb01cient hardware devices requiring\nlimited parameter precision.\n\n5https://github.com/ExpectationBackpropagation/EBP_Matlab_Code/\n\n7\n\n\fThis algorithm is essentially an analytic approximation to the intractable Bayes calculation of the\nposterior distribution of the weights after the arrival of a new data point. To simplify the intractable\nBayes update rule we use several approximations. First, we approximate the posterior using a prod-\nuct of its marginals - a \u2018mean \ufb01eld\u2019 approximation. Second, we assume the neuronal layers have a\nlarge fan-in, so we can approximate them as Gaussian. After these two approximations each Bayes\nupdate can be tractably calculated in polynomial time in the size of the MNN. However, in order to\nfurther improve computational complexity (to O (|W|) in each step, like BP), we make two addi-\ntional approximations. First, we use the large fan-in to perform a \ufb01rst order expansion. Second, we\noptionally6 perform a second \u2018mean \ufb01eld\u2019 approximation - to the distribution of the neuronal inputs.\nFinally, after we obtain the approximated posterior using the algorithm, the Bayes estimates of the\nmost probable weights and the outputs are found analytically.\nPrevious approaches to obtain these Bayes estimates were too limited for our purposes. The Monte\nCarlo approach [20] achieves state-of-the-art performance for small MNNs [25], but does not scale\nwell [24]. The Laplace approximation [16] and variational Bayes [9? , 8] based methods re-\nquire continuous-valued weights, tuning of the learning rate parameter, and stochastic neurons (to\n\u201csmooth\u201d the likelihood). Previous EP [23, 21] and message passing [5, 1] (a special case of EP[4])\nbased methods were derived only for SNNs.\nIn contrast, the EBP allows parameter free and scalable training of various types of MNNs (deter-\nministic or stochastic) with discrete (e.g., binary) or continuous weights. In appendix F, we see that\nfor continuous weights EBP is almost identical to standard BP with a speci\ufb01c choice of activation\nfunction s (x) = 2\u03a6 (x) \u2212 1, CE loss and learning rate \u03b7 = 1. The only difference is that the input\nis normalized by its standard deviation (Eq. 4.1, right), which depends on the weights and inputs\n(Eq. 4.2). This re-scaling makes the learning algorithm invariant to the amplitude changes in the\nneuronal input. This results from the same invariance of the sign activation functions. Note that in\nstandard BP algorithm the performance is directly affected by the amplitude of the input, so it is a\nrecommended practice to re-scale it in pre-processing [15].\nWe numerically evaluated the algorithm on binary classi\ufb01cation tasks using MNNs with two or three\nsynaptic layers. In all data sets and MNNs EBP performs better than standard BP with the optimal\nconstant learning rate, and even achieves state-of-the-art results in comparison to [6]. Surprisingly,\nEBP usually performs best when it is used to train binary MNNs. As suggested by a reviewer, this\ncould be related to the type of problems examined here. In text classi\ufb01cation tasks have large sparse\ninput spaces (bag of words), and presence/absence of features (words) is more important than their\nreal values (frequencies). Therefore, (distributions over) binary weights and a threshold activation\nfunction may work well.\nIn order to get such a good performance in binary MNNs, one must average over the output the\ninferred (approximate) posterior of the weights. The EBP-P output of the algorithm calculates this\naverage analytically. In hardware this output could be realizable by averaging the output of several\nbinary MNNs, by sampling weights from P (Wij,l|Dn). This can be done ef\ufb01ciently (appendix G).\nOur numerical testing mainly focused on high-dimensional text classi\ufb01cation tasks, where shallow\narchitectures seem to work quite well. In other domains, such as vision [13] and speech [7], deep\narchitectures achieve state-of-the-art performance. Such deep MNNs usually require considerable\n\ufb01ne-tuning and additional \u2018tricks\u2019 such as unsupervised pre-training [7] or weight sharing [13].\nIntegrating such methods into EBP and using it to train deep MNNs is a promising direction for\nfuture work. Another important generalization of the algorithm, which is rather straightforward, is to\nuse activation functions other than sign (\u00b7). This is particularly important for the last layer - where a\nlinear activation function would be useful for regression tasks, and joint activation functions7 would\nbe useful for multi-class tasks[3].\n\nAcknowledgments The authors are grateful to C. Baldassi, A. Braunstein and R. Zecchina for\nhelpful discussions and to A. Hallak, T. Knafo and U. S\u00fcmb\u00fcl for reviewing parts of this manuscript.\nThe research was partially funded by the Technion V.P.R. fund, by the Intel Collaborative Research\nInstitute for Computational Intelligence (ICRI-CI), and by the Gruss Lipper Charitable Foundation.\n\n6This approximation is not required if all neurons in the MNN have a fan-out of one.\n7i.e., activation functions for which (f (x))i (cid:54)= f (xi), such as softmax or argmax.\n\n8\n\n\fReferences\n[1] C Baldassi, A Braunstein, N Brunel, and R Zecchina. Ef\ufb01cient supervised learning in networks with\n\nbinary synapses. PNAS, 104(26):11079\u201384, 2007.\n\n[2] R Battiti and G Tecchiolli. Training neural nets with the reactive tabu search. IEEE transactions on neural\n\nnetworks, 6(5):1185\u2013200, 1995.\n\n[3] C M Bishop. Neural networks for pattern recognition. Oxford, 1995.\n[4] C M Bishop. Pattern recognition and machine learning. Springer, Singapore, 2006.\n[5] A Braunstein and R Zecchina. Learning by message passing in networks of discrete synapses. Physical\n\nreview letters, 96(3), 2006.\n\n[6] K Crammer, A Kulesza, and M Dredze. Adaptive regularization of weight vectors. Machine Learning,\n\n91(2):155\u2013187, March 2013.\n\n[7] G E Dahl, D Yu, L Deng, and A Acero. Context-Dependent Pre-Trained Deep Neural Networks for\n\nLarge-Vocabulary Speech Recognition. Audio, Speech, and Language Processing, 20(1):30\u201342, 2012.\n\n[8] A Graves. Practical variational inference for neural networks. Advances in Neural Information Processing\n\nSystems, pages 1\u20139, 2011.\n\n[9] G E Hinton and D Van Camp. Keeping the neural networks simple by minimizing the description length\n\nof the weights. In COLT \u201993, 1993.\n\n[10] G E Hinton, L Deng, D Yu, G E Dahl, A R Mohamed, N Jaitly, A Senior, V Vanhoucke, P Nguyen, T N\nSainath, and B Kingsbury. Deep neural networks for acoustic modeling in speech recognition: The shared\nviews of four research groups. Signal Processing Magazine, IEEE, 29(6):82\u201397, 2012.\n\n[11] K Hornik.\n\nApproximation capabilities of multilayer feedforward networks.\n\n4(1989):251\u2013257, 1991.\n\nNeural networks,\n\n[12] R Karakiewicz, R Genov, and G Cauwenberghs. 1.1 TMACS/mW Fine-Grained Stochastic Resonant\n\nCharge-Recycling Array Processor. IEEE Sensors Journal, 12(4):785\u2013792, 2012.\n\n[13] A Krizhevsky, I Sutskever, and G E Hinton.\n\nnetworks. In NIPS, 2012.\n\nImagenet classi\ufb01cation with deep convolutional neural\n\n[14] Y LeCun and L Bottou. Gradient-based learning applied to document recognition. Proceedings of the\n\nIEEE, 86(11):2278\u20132324, 1998.\n\n[15] Y LeCun, L Bottou, G B Orr, and K R M\u00fcller. Ef\ufb01cient Backprop. In G Montavon, G B Orr, and K-R\n\nM\u00fcller, editors, Neural networks: Tricks of the Trade. Springer, Heidelberg, 2nd edition, 2012.\n\n[16] D J C MacKay. A practical Bayesian framework for backpropagation networks. Neural computation,\n\n472(1):448\u2013472, 1992.\n\n[17] E Mayoraz and F Aviolat. Constructive training methods for feedforward neural networks with binary\n\nweights. International journal of neural systems, 7(2):149\u201366, 1996.\n\n[18] T P Minka. Expectation Propagation for Approximate Bayesian Inference. NIPS, pages 362\u2013369, 2001.\n[19] P Moerland and E Fiesler. Neural Network Adaptations to Hardware Implementations. In Handbook of\n\nneural computation. Oxford University Press, New York, 1997.\n\n[20] R M Neal. Bayesian learning for neural networks. PhD thesis, University of Toronto, 1995.\n[21] F Ribeiro and M Opper. Expectation propagation with factorizing distributions: a Gaussian approximation\n\nand performance results for simple models. Neural computation, 23(4):1047\u201369, April 2011.\n\n[22] D Saad and E Marom. Training Feed Forward Nets with Binary Weights Via a Modi\ufb01ed CHIR Algorithm.\n\nComplex Systems, 4:573\u2013586, 1990.\n\n[23] S A Solla and O Winther. Optimal perceptron learning: an online Bayesian approach. In On-Line Learning\n\nin Neural Networks. Cambridge University Press, Cambridge, 1998.\n\n[24] N Srivastava and G E Hinton. Dropout: A Simple Way to Prevent Neural Networks from Over\ufb01tting.\n\nJournal of Machine Learning, 15:1929\u20131958, 2014.\n\n[25] H Y Xiong, Y Barash, and B J Frey. Bayesian prediction of tissue-regulated splicing using RNA sequence\n\nand cellular context. Bioinformatics (Oxford, England), 27(18):2554\u201362, October 2011.\n\n9\n\n\f", "award": [], "sourceid": 594, "authors": [{"given_name": "Daniel", "family_name": "Soudry", "institution": "Columbia University"}, {"given_name": "Itay", "family_name": "Hubara", "institution": null}, {"given_name": "Ron", "family_name": "Meir", "institution": "Technion"}]}