{"title": "Summed Weight Neuron Perturbation: An O(N) Improvement Over Weight Perturbation", "book": "Advances in Neural Information Processing Systems", "page_first": 212, "page_last": 219, "abstract": null, "full_text": "Summed Weight Neuron Perturbation: An O(N) \n\nImprovement over Weight Perturbation. \n\nBarry Flower and Marwan Jabri \n\nSEDAL \n\nDepartment of Electrical Engineering \n\nUniversity of Sydney \nNSW 2006 Australia \n\nAbstract \n\nThe algorithm presented performs gradient descent on the weight space \nof an Artificial Neural Network (ANN), using a finite difference to \napproximate the gradient The method is novel in that it achieves a com(cid:173)\nputational complexity similar to that of Node Perturbation, O(N3), but \ndoes not require access to the activity of hidden or internal neurons. \nThis is possible due to a stochastic relation between perturbations at the \nweights and the neurons of an ANN. The algorithm is also similar to \nWeight Perturbation in that it is optimal in terms of hardware require(cid:173)\nments when used for the training ofVLSI implementations of ANN's. \n\n1 INTRODUCTION \nOptimization of the weights of an ANN may be performed by, the application of a gradi(cid:173)\nent descent teclmique. The gradient may be calculated directly as in Backpropagation, or it \nmay be approximated by a Finite Difference Method which is what we concern ourselves \nwith in this paper. These methods lend themselves to the task of training hardware imple(cid:173)\nmentations of ANNs where real estate is at a premium and synaptic density is of great \nimportance. Neuron Perturbation (NP), as described by the Madaline Rule ill (MRllI) \n(Widrow and Lehr, 1990), is a teclmique that approximates the gradient of the Mean \nSquare Error (MSE) with respect to the change at a given neuron by applying a small per(cid:173)\nturbation to the input of the neuron and measuring the change in the MSE. The weight \n\nI1wij = -tl'-:l-'Xr \n\ndE \nonet. I \n\n(1) \n\nupdate is then calculated from the product of this gradient measure and the activation of \n\n212 \n\n\fSummed Weight Neuron Perturbation: An (O)N Improvement over Weight Perturbation \n\n213 \n\nthe neuron from which the weight is fed, as described by (1). \nWeight Perturbation (WP), as described by Jabri and Flower (Jabri and Flower, 1992) is a \nneural network training techniques based on gradient descent using a Finite Difference \nmethod to approximate the gradient. The gradient of the MSE with respect to a weight is \napproximated by applying a small pertubation to the weight and measuring the change in \nthe MSE. This gradient is then used to calculated the weight update such that: \n\niJE \naWr = -11\u00b7 :l-.. \nuw .. \nI) \n\n'J \n\n(2) \n\nThe advantages of WP over NP are that it performs better when limited precision weights \nare used, as shown by Xie and Jabri (Xie and Jabri, 1992), and is optimal with respect to \nhardware requirements when used to train VLSI implementations of ANNs. However, WP \nhas O(~) computational complexity whilst NP has O(N3) computational complexity. \nSummed Weight Neuron Perturbation (SWNP) is similar to NP in that it has a computa(cid:173)\ntional complexity of O(N3) but it has the added advantage that the activation of internal \nneurons does not need to be known. The cost of this reduced computational complexity is \nthat SWNP needs to save the perturbation vector used. \nIn the following sections a description of the SWNP algorithm is provided and, finally, \nsome experimental results are presented. \n\n2 THE SUMMED WEIGHT NEURON PERTURBATION \n\nALGORITHM \n\nA subsection of a feedforward ANN containing N neurons is shown in Figure 1. on which \nnomenclature the following derivation is based. \n\nFIGURE 1: Description Of Indices Used To Describe The Neurons Weights And \n\nPerturbations In An ANN. \n\nIn a feedforward network of size N neurons the activation of a given neuron is determined \nby: \n\nXi (P) = Ii (neti (p\u00bb , \n\nand \n\nneti (p) = ~WilXl (p) , \n\n(3) \n\nand Ii (y) is the ith neuron transfer function, Xi (p) is the activation of the ith neuron for \nthe pth pattern, and W iI is the weight connecting the Ith neuron's output to the ith neuron's \ninput. The error function, (MSE), is defmed as in (4), where T is the set of output neurons \nand dt (p) is the expected value of the output on the kth neuron. The change in E (p) \n\n\f214 \n\nFlower and Jabri \n\nwith respect to a given weight may then be expressed as (5). \nE(p) = 2: E (dt(p) -xt (p\u00bb2. \n\n1 \n\ntE T \n\noE (p) \ndwij = aneti (p) .Xj (p) . \n\noE (p) \n\n(4) \n\n(5) \n\nThe first term of on the right-hand side of (5) can be determined using a Finite Difference, \nwhich in this case is a Forward Difference, so that: \n\noE(p) \n=-------c- = \noneti (p) \n\nAEr. (p) \n\n~. + 0 (ri), \n\nf \n\n(6) \n\nwhere, \n\n, \n\n, \n\nAEr(p) = Er (p) -E(p), \n\n(7) \nand r i is the perturbation applied to the ith neuron, E r. (p) is the error for the pth pattern \nwith a perturbation applied to the ith neuron and E (p) is the error for the pth pattern \nwithout a perturbation applied to any neurons. The error introduced by the approximation \nis represented by the last term on the right-hand side in (6). \nThe perturbation of one or more of the weights that are inputs to the qth neuron can be \nthought of as being equal to some perturbation applied directly to that neuron. Hence: \n\n, \n\nrq = ~Yqrl(P)' \n\n(8) \n\nwhere Yql is the perturbation applied to weight w ql . As will be shown, perturbing the qth \nneuron by perturbing all the weights feeding into it, enables the sign of the gradient \no~~) to be determined without performing the product on the right-hand side of (5). \n\nfJ \n\nFurther more, the activation of hidden neurons, (i.e. Xj (p) in (5\u00bb need not be known. The \ncontribution of the perturbation of weight w ij to the perturbation of the ith neuron is \n\nLet us take the degenerate case where there is only one weight for the ith neuron. Then the \ngradient of the MSE with respect to weight w ij is: \n\nYi/j(p) . \n\n(9) \n\nAEr (p) Xj (p) \n\n' ( ) + 0 (r.) \ny.x). p \nfJ \n\nf \n\n= \n\n= \n\nll.Er . (p) \n\n' \nYij \n\n+0 (r.) , \n\nf \n\n(10) \n\n\fSummed Weight Neuron Perturbation: An (O)N Improvement over Weight Perturbation \n\n215 \n\nnoting that Xj (p) has been eliminated. In the general case where the ith neuron has more \nthan one weight the gradient with respect to weight w ij is shown in (11). \n\noE(p) \nOw .. fJ \n\n= \n\nMr_ (p) Xj (p) \n\nI r. \n\nf \n\n+O(ri) \n\nMr. (p) \n\n= \n\nI \n\n':P ij \n\n+ O(r.) \n\nf \n\nwhere, \n\n':Pij = x. (p) . \n\nr. f \n\nJ \n\n(11) \n\n(12) \n\nThe form of (10) and (11) are the same and it will be shown that y .. can be substituted for \n':P .. in (1) due to a stochastic relationship between them. \n\nfJ \n\nfJ \n\nLet us represent the sign of y .. and ':P .. as either + 1 or -1 such that: \n\nfJ \n\nlJ \n\nand \n\nI ':P iJ~ \nv .. = \\TI' \nfJ \n\nT .. \nfJ \n\n(13) \nThe set of all possible states for the system represented by the vector (1.1\"J v .. ) , assuming \ny .. and ':P .. are never zero, is: \nlJ \n\nfJ \n\nfJ \n\nf} \n\nfJ \n\nfJ \n\n{(-1, -1), (-1, 1), (1, -1), (1, 1)} . \n\n(14) \nand it can be seen that when 1.1 .. = v .. then the sign of the gradient of the MSE with \nrespect to weight w ij given by (0) is the same as that given by (11). If the sign of Yij is \nchosen randomly then the probability of 1.1 .. = v .. being true is 0.5, from (4), and so (0) \nwill generate a gradient that is in the correct direction 50% of the time. This in itself is not \nsufficient to allow the network to be trained as it will take as many steps in the incorrect \ndirection as the correct direction if the steps themselves are of the same size, (i.e. the mag(cid:173)\nnitude of r. is the same for a step in the correct direction as a step in the incorrect direc-\ntion). \nFortunately it can be shown that the size of the steps in the correct direction are greater \nthan those in the incorrect direction. Let us take the case where a particular y .. is chosen \nsuch that \n\nfJ \n\nfJ \n\nf} \n\nf \n\nNow by substituting (8), (2) and (13) into (5) we get: \n\n1.1 .. = v ... \nIJ \n\nfJ \n\n(15) \n\n\f216 \n\nFlower and Jabri \n\nrearranging to give, \n\n~ Yitx t (p) \n\nx\u00b7 J \n\n- - - - - - ' - - -\n~YitXt (p) \n\nYij \n\nx\u00b7 J \n\nYi/j \n\n~YitXt (p) \n\n(16) \n\n(17) \n\nwhich implies that the contribution to r. made by the pertUIbation y .. is of the same sign \nas r .. Let us designate this neuron pertUIbation as r. (A) . Now we take the other possible \ncase where, \n\nv \n\nI \n\nI \n\nI \n\nJIij * V ij' \n\n(18) \n\nassuming every other parameter is the same, and only the sign of y .. is changed. The \nequality in (17) is now untrue and the contribution to r. made by the perturbation y .. is of \nthe opposite sign as r .. Let us designate this neuron perturbation as r . (B) . From (8) we \ncan determine that, \n\nIJ \n\nIJ \n\nI \n\nI \n\nI \n\nI \n\nEquation (19) shows the relationship between the two possible states of the system where \nr. (A) represents the summed neuron perturbation for a selected weight perturbation y .. \nv \nthat generates a step in the corrected direction and r i (B) is similar but for a step in the \nincorrect direction. Clearly the correct step is always calculated from an approximated \ngradient that is larger than that for an incorrect step as the neuron perturbation is larger. \nThe weight update rule then becomes: \n\n(19) \n\nMr. (p) \n\n~Wij = -Tl. \n\nI \n\nYij \n\n(20) \n\nThe algorithm for SWNP is shown as pseudo code in Figure 2. \n\n2.1 HARDWARE COMPATmILITY OF SWNP \nThis optimisation technique is ideally suited to the training of hardware implementations \nof ANN's whether they consist of discrete components or are VLSI technology. The speed \nup over WP of 0 (N) achieved is at the cost of an 0 (N) storage requirement but this \nsto~ge can be achieved with a single bit per neuron. SWNP is the same order of complex(cid:173)\nity as NP but does not require access to the activation of internal neurons and therefore can \ntreat a network as a \"black box\" into which an input vector and weight matrix is fed and an \n\n\fSummed Weight Neuron Perturbation: An (O)N Improvement over Weight Perturbation \n\n217 \n\noutput vector is received. \n\nWhile (total error> error threshold) { \n\nFor (all patterns in training set) { \n\nSelect next pattern and training vector, \nForward Prop.;Measure, (calculate) and save error; \nAccumulate total error; \nFor (all non-input neurons) { \n\nFor (all weights of current neuron) { \n} Apply & Save perturbation of random polarity; \nForward Prop.;Measure, (calculate) and save &!rror; \nFor (all weights of current neuron) { \n\nRestore value of weight; \nCalculate weight delta using saved perturbation value; \n\n} If (Online Mode) Update current weight; \nIf (Online Mode) \n\nForward Prop.; Measure, (calculate) and save new error; \n\n} \nIf (Batch Mode) { \n\nFor (all weights) \n\nUpdate current weight; \n\n} } \n\n} \n\nFIGURE 2: Algorithm in Pseudo Code for Summed Weight Neuron Perturbation. \n\n3 TEST RESULTS USING SWNP \nThe results for a series of tests are shown in the next three tables and are summarised in \nFigure 4. The headings are, N the number of neurons in the network, P the number of pat(cid:173)\nterns in the training set, FF -SWNP the number of feedforward passes for the SWNP \nAlgorithm, FF -WP the number of feedforward passes for the WP Algorithm, and RA no \nthe ratio between the number of feedforward passes for WP against SWNP. The feedfor(cid:173)\nward passes are recorded to 1 significant figure. \nThe results for a series of simulations comparing the performance of SWNP against WP \nare shown in Table 1. The simulations utilised floating point synaptic and neuron preci(cid:173)\ns1ons. \n\nThe results for a series of simulations comparing the performance of SWNP against WP \nare shown in Table 2. The simulations utilised limited synaptic precision, (i.e. 6 bits) and \nfloating point neuron precisions \n\nThe results for a series of experiments comparing the performance of SWNP against WP \nare shown in Table 2. Note: the training algorithm are the variations of WP and SWNP \nthat are combined with the Random Search Algorithm (RSA). The results reported are \naveraged over 10 trials. \nAn example of the training error trajectories of WP and SWNP for the Monk 2 problem \nare shown in Figure 3. \n\n\f218 \n\nFlower and J abri \n\nTable 1: Performance Of SWMP Versus WP, Comparing Feedforward Operations To \n\nConvergence. (Simulations With Floating Point Precision) \n\nPROBLEM \n\nXOR \n\n4 Encoder \n\n8 Encoder \n\nICEG \n\nN \n\n3 \n\n5 \n\n11 \n\n15 \n\nP \n\n4 \n\n4 \n\n8 \n\n119 \n\nFF-SWNP \n\nFF-WP \n\nERROR \n\nRATIO \n\n1.6xlQ3 \n\n0.9xlcP \n\n1.5xlOS \n\n3.7xlOS \n\n1.9xlcP \n\n1.8xlcY \n\n4.5xlOS \n\n7.9xlrf' \n\n0.0125 \n\n0.0125 \n\n0.0125 \n\n0.0125 \n\n1. 22 \n\n1.84 \n\n2.88 \n\n21.34 \n\nTable 2: Performance Of SWNP Versus WP, Comparing Feedforward Operations To \n\nConvergence. (Simulations With Limited Precision) \n\nPROBLEM \n\nMonk 1 \n\nMonIa \n\nMonk3 \n\nIECG 55 \n\nN \n\n4 \n\n17 \n\n17 \n\n5 \n\nP \n\n129 \n\n169 \n\n122 \n\n8 \n\nSWNP \n\n1.OxlOS \n\n3.6xlOS \n1.2xl06 \n\nWP \n1.9x106 \n6.8x106 \n7.1xl06 \n\n1.6xl04 \n\n7.2xl04 \n\nERROR \n\nRATIO \n\n0.001 \n\n0.0005 \n\n0.022 \n\n0.0001 \n\n19.38 \n\n18.71 \n\n5.87 \n\n4.2 \n\nTable 3: Performance Of SWNP Versus WP, Comparing Feedforward Operations To \n\nConvergence. (Hardware Implementation) \n\nPROBLEM \n\nECG 55 \nECG045 \n\nN \n5 \n5 \n\nP \n8 \n8 \n\nSWNP \n3.1xlQ3 \n1.1xl04 \n\nWP \n\n3.6xlQ3 \n2.Oxl04 \n\nERROR \n0.00001 \n0.001 \n\nRATIO \n\n1.13 \n1.78 \n\n--lIWRP \n\nYBKa lIT' \n\nMONX 2 PROBLEM \n\n..,CO \n\",\",co \n_co \n\n>:\"'co \n:IOOCO \n\n1I0CO \n\nItDCO \n\n,OOCO \n\n1:\"'CO \n\n'OOCO \n\nlOCO \n\n\u00ablCO \n\nooCO \n\n:lOCO \n\n000 \n\nOCO \n\n2).(1) \n\n\u00ab),(1) \n\n\u00ab).IX) \n\ntom \n\n10000 \n\n12).00 \n\nleO) \n\nJ!IIIOC2B \n\nFIGURE 3: Comparison ofWP and SWNP For Monk 2 Problem \n\n\fSummed Weight Neuron Perturbation: An (O)N Improvement over Weight Perturbation \n\n219 \n\nFIGURE 4: Comparison of the number of Feedforward passes performed to achieve \n\nconvergence on a range of problems using SWNP and WP. \n\nmwsm SWNP Feedforward Passes \n_ \n\nWP Feedforward Passes \n\n20 \n18 \n16 \n14 \n12 \n10 \n8 \n6 \n4 \n2 \n\nXORx~02 \n4ENCODERxI \n\n8ENCODERxI \n\nICEGxl \n\nMONK. 1 xl \n\nMONK.2xl \n\nMONlOxl \n\nlCEG 55xl \n\nECG 55xl \n\nECG045xl \n\n4 CONCLUSION \nThe algorithm presented, SWNP, performs gradient descent on the weight space of an \nANN, using a fInite difference to approximate the gradient. The method is novel in that it \nachieves 0 (N3 ) computational complexity similar to that of Node Perturbation but does \nnot require access to the activity of hidden or internal neurons. The algorithm is also simi(cid:173)\nlar to Weight Perturbation in that it is optimal in terms of hardware requirements when \nused for the training of VLSI implementations of ANN's. Results are presented that show \nthe algorithm in operation on floating point simulations, limited precision simulations and \nan actual hardware implementation of an ANN. \n\nReferences \n\nlabri, M. and Flower, B. (1992). Weight perturbation: An optimal architecture and learning \ntechnique for analog vlsi feedforward and recurrent multilayer networks. IEEE \nTransactions on Neural Networks, 3(1):154-157. \n\nWidrow, B. and Lehr, M. A. (1990). 30 years of adaptive neural networks: Perceptron, \n\nmadaline, and backpropagation. Proceedings of the IEEE, 78(9):1415-1442. \n\nXie, Y. and lahri, M. (1992). Analysis of the effects of quantization in multilayer neural \nnetworks using a statistical model. IEEE Transactions on Neural Networks, \n3(2):334-338. \n\n\f", "award": [], "sourceid": 608, "authors": [{"given_name": "Barry", "family_name": "Flower", "institution": null}, {"given_name": "Marwan", "family_name": "Jabri", "institution": null}]}