{"title": "A Parallel Gradient Descent Method for Learning in Analog VLSI Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 836, "page_last": 844, "abstract": null, "full_text": "A Parallel Gradient Descent Method for Learning \n\nin Analog VLSI Neural Networks \n\nJ. Alspector R. Meir'\" B. Yuhas A. Jayakumar D. Lippet \n\nBellcore \n\nMorristown, NJ 07962-1910 \n\nAbstract \n\nTypical methods for gradient descent in neural network learning involve \ncalculation of derivatives based on a detailed knowledge of the network \nmodel. This requires extensive, time consuming calculations for each pat(cid:173)\ntern presentation and high precision that makes it difficult to implement \nin VLSI. We present here a perturbation technique that measures, not \ncalculates, the gradient. Since the technique uses the actual network as \na measuring device, errors in modeling neuron activation and synaptic \nweights do not cause errors in gradient descent. The method is parallel \nin nature and easy to implement in VLSI. We describe the theory of such \nan algorithm, an analysis of its domain of applicability, some simulations \nusing it and an outline of a hardware implementation. \n\n1 \n\nIntroduction \n\nThe most popular method for neural network learning is back-propagation (Rumel(cid:173)\nhart, 1986) and related algorithms that calculate gradients based on detailed knowl(cid:173)\nedge of the neural network model. These methods involve calculating exact values \nof the derivative of the activation function. For analog VLSI implementations, such \ntechniques require impossibly high precision in the synaptic weights and precise \nmodeling of the activation functions. It is much more appealing to measure rather \nthan calculate the gradient for analog VLSI implementation by perturbing either a \n\n\u00b7Present address: Dept. of EE; Technion; Haifa, Israel \ntpresent address: Dept. of EE; MIT; Cambridge, MA \n\n836 \n\n\fA Parallel Gradient Descent Method for Learning in Analog VLSI Neural Networks \n\n837 \n\nsingle weight (Jabri, 1991) or a single neuron (Widrow, 1990) and measuring the \nresulting change in the output error. However, perturbing only a single weight or \nneuron at a time loses one of the main advantages of implementing neural networks \nin analog VLSI, namely, that of computing weight changes in parallel. The one(cid:173)\nweight-at-a-time perturbation method has the same order of time complexity as a \nserial computer simulation of learning. A mathematical analysis of the possibility \nof model free learning using parallel weight perturbations followed by local corre(cid:173)\nlations suggests that random perturbations by additive, zero-mean, independent \nnoi~e sources may provide a means of parallel learning (Dembo, 1990). We have \npre :Tiously used such a noise source (Alspector, 1991) in a different implement able \nlearning model. \n\n2 Gradient Estimation by Parallel Weight Perturbation \n\n2.1 A Brownian Motion Algorithm \n\nOne can estimate the gradient of the error E(w) with respect to any weight WI \nby perturbing WI by OWl and measuring the change in the output error oE as the \nentire weight vector w except for component Wl is held constant. \n\nE(w + OWl) - E(w) \n\nOWl \n\nThis leads to an approximation to the true gradient g:l: \n\noE \n-\nOWl \n\n= - + O([owd) \n\noE \nOWl \n\n(1) \n\n(2) \n\nFor small perturbations, the second (and higher order) term can be ignored. This \nmethod of perturbing weights one-at-a-time has the advantage of using the correct \nphysical neurons and synapses in a VLSI implementation but has time complexity \nof O(W) where W is the number of weights. \n\nFollowing (Dembo, 1990), let us now consider perturbing all weights simultaneously. \nHowever, we wish to have the perturbation vector ow chosen uniformly on a hyper(cid:173)\ncube. Note that this requires only a random sign multiplying a fixed perturbation \nand is natural for VLSI. Dividing the resulting change in error by any single weight \nchange, say OWl, gives \n\noE \n\nE(w + ow) - E(w) \n\nOWl \nwhich by a Taylor expansion is \n\nOWl \n\nleading to the approximation (ignoring higher order terms) \n\n(3) \n\n(4) \n\n\f838 \n\nAlspector, Meir, Yuhas, Jayakumar, and Lippe \n\n(5) \n\nAn important point of this paper, emphasized by (Dembo, 1990) and embodied in \nEq. (5), is that the last term has expectation value zero for random and indepen(cid:173)\ndently distributed OWi since the last expression in parentheses is equally likely to \nbe +1 as -1. Thus, one can approximately follow the gradient by perturbing all \nweights at the same time. If each synapse has access to information about the re(cid:173)\nsulting change in error, it can adjust its weight by assuming it was the only weight \nperturbed. The weight change rule \n\n(6) \n\nwhere TJ is a learning rate, will follow the gradient on the average but with the \nconsiderable noise implied by the second term in Eq. (5). This type of stochas(cid:173)\ntic gradient descent is similar to the random-direction Kiefer-Wolfowitz method \n(Kushner, 1978), which can be shown to converge under suitable conditions on TJ \nand OWi. This is also reminiscent of Brownian motion where, although particles may \nbe subject to considerable random motion, there is a general drift of the ensemble \nof particles in the direction of even a weak external force. In this respect, there is \nsome similarity to the directed drift algorithm of (Venkatesh, 1991), although that \nwork applies to binary weights and single layer perceptrons whereas this algorithm \nshould work for any level of weight quantization or precision - an important ad(cid:173)\nvantage for VLSI implementations - as well as any number of layers and even for \nrecurrent networks. \n\n2.2 \n\nImproving the Estimate by Multiple Perturbations \n\nAs was pointed out by (Dembo, 1990), for each pattern, one can reduce the variance \nof the noise term in Eq. (5) by repeating the random parallel perturbation many \ntimes to improve the statistical estimate. If we average over P perturbations, we \nhave \n\noE \nOWl = P L oif.l = 8Wl + P L ?= 8Wi \nwhere p indexes the perturbation number. The variance of the second term, which \n. \n. \nIS a nOise, v, IS \n\n1 P W (EJE) (owf) \n\np=l&>l \n\nOwPl \n\n. \n\n1 p oE \n\n8E \n\np=l \n\n(7) \n\nwhere the expectation value, <>, leads to the Kronecker delta function, off, . This \nreduces Eq. (8) to \n\nI \n\n\fA Parallel Gradient Descent Method for Learning in Analog VLSI Neural Networks \n\n839 \n\n1 P W (OE)2 \n\n< II > = p2 LL OW. \n\n2 \n\np=li>l \n\nz \n\n(9) \n\nThe double sum over perturbations and weights (assuming the gradient is bounded \nand all gradient directions have the same order of magnitude) has magnitude \nO(PW) so that the variance is O(~) and the standard deviation is \n\n(10) \n\nTherefore, for a fixed variance in the noise term, it may be necessary to have a \nnumber of perturbations of the same order as the number of weights. So, if a \nhigh precision estimate of the gradient is needed throughout learning, it seems as \nthough the time complexity will still be O(W) giving no advantage over single \nperturbations. However, one or a few of the gradient derivatives may dominate \nthe noise and reduce the effective number of parameters. One can also make a \nqualitative argument that early in learning, one does not need a precise estimate of \nthe gradient since a general direction in weight space will suffice. Later, it will be \nnecessary to make a more precise estimate for learning to converge. \n\n2.3 The Gibbs Distribution and the Learning Problem \n\nNote that the noise of Eq. (7) is gaussian since it is composed of a sum of random \nsign terms which leads to a binomial distribution and is gaussian distributed for \nlarge P. Thus, in the continuous time limit, the learning problem has Langevin \ndynamics such that the time rate of change of a weight Wk is, \n\n(11) \n\nand the learning problem converges in probability (Zinn-Justin, 1989), so that \nas~mpto~ically Pr(w)