{"title": "Synaptic Weight Noise During MLP Learning Enhances Fault-Tolerance, Generalization and Learning Trajectory", "book": "Advances in Neural Information Processing Systems", "page_first": 491, "page_last": 498, "abstract": null, "full_text": "Synaptic Weight Noise During MLP \nLearning Enhances Fault-Tolerance, \n\nGeneralisation and Learning Trajectory \n\nAlan F. Murray \n\nDept. of Electrical Engineering \n\nEdinburgh University \n\nScotland \n\nPeter J. Edwards \n\nDept. of Electrical Engjneering \n\nEdinburgh University \n\nScotland \n\nAbstract \n\nWe analyse the effects of analog noise on the synaptic arithmetic \nduring MultiLayer Perceptron training, by expanding the cost func(cid:173)\ntion to include noise-mediated penalty terms. Predictions are made \nin the light of these calculations which suggest that fault tolerance, \ngeneralisation ability and learning trajectory should be improved \nby such noise-injection. Extensive simulation experiments on two \ndistinct classification problems substantiate the claims. The re(cid:173)\nsults appear to be perfectly general for all training schemes where \nweights are adjusted incrementally, and have wide-ranging implica(cid:173)\ntions for all applications, particularly those involving \"inaccurate\" \nanalog neural VLSI. \n\n1 \n\nIntroduction \n\nThis paper demonstrates both by consjderatioll of the cost function and the learn(cid:173)\ning equations, and by simulation experiments, that injection of random noise on \nto MLP weights during learning enhances fault-tolerance without additional super(cid:173)\nvision. We also show that the nature of the hidden node states and the learning \ntrajectory is altered fundamentally, in a manner that improves training times and \nlearning quality. The enhancement uses the mediating influence of noise to dis(cid:173)\ntribute information optimally across the existing weights. \n\n491 \n\n\f492 \n\nMurray and Edwards \n\nTaylor [Taylor , 72] has studied noisy synapses, largely in a biological context, and \ninfers that the noise might assist learning. vVe have already demonstrated that noise \ninjection both reduces the learning time and improves the network's generalisation \nIt is established[Matsuoka, 92],[Bishop, 90] that \nability [Murray, 91],[Murray, 92]. \nadding noise to the training data in neural (MLP) learning improves the \"quality\" \nof learning, as measured by the trained network's ability to generalise. Here we \ninfer (synaptic) noise-mediated terms that sculpt the error function to favour faster \nlearning, and that generate more robust internal representations, giving rise to \nbetter generalisation and immunity to smaIl variations in the characteristics of the \ntest data. Much closer to the spirit of this paper is the work of Hanson[Hanson, 90]. \nHis stochastic version of the delta rule effectively adapts weight means and standard \ndeviations. Also Sequin and Clay [Sequin , 91] use stuck-at faults during training \nwhich imbues the trained network with an ability to withstand such faults. They \nalso note, but do not pursue, an increased generalisation ability. \n\nThis paper presents an outline of the mathematical predictions and verification \nsimulations. A full description of the work is given in [Murray, 93] . \n\n2 Mathematics \n\nLet us analyse an MLP with I input, J hidden and ]{ output nodes, with a set of \nP training input vectors Qp = {Oip}, looking at the effect of noise injection into the \nerror function itself. We are thus able to infer, from the additional terms introduced \nby noise, the characteristics of solutions that tend to reduce the error, and those \nwhich tend to increase it. The former will clearly be favoured, or at least stabilised, \nby the additional terms. while the latter will be de-stabilised. \n\nLet each weight Tab be augmented by a random noise source, such that Tab -+(cid:173)\nTab + ~abTab, for all weights {Tab}. Neuron thresholds are treated in precisely \nthe same way. Note in passing, but importantly, that this synaptic noise is not \nthe same as noise on the input data. Input noise is correlated across the synapses \nleaving an input node, while the synaptic noise that forms the basis of this study \nis not. The effect is thus quite distinct. \n\nConsidering, therefore, an error function of the form ;-\n\n1 K-l \n\nftot,p =\"2 L \nk=O \n\n1 K-l \n\nk=O \n\nfk/ =\"2 L(okp({Tab}) -Okp)2 \n\n(1) \n\nWhere Okp is the target output. We can now perform a Taylor expansion of the out(cid:173)\nput Okp to second order, around the noise-free weight set, {TN}, and thus augment \nthe error function ;-\n\nOkp -+ Okp + L..J Tab~ab aT. \nab \n\n\"\"' \nb \na \n\n(aOkP ) \n\n( a20kp \n+\"2 L..J Tab~abTcd~cd aT. aT. \ncd \n\n1 \"\"' \nb d \na ,c \n\nab \n\n) \n\n+0(> 3) (2) \n\nIf we ignore terms of order ~ 3 and above, and taking the time average over the \nlearning phase, we can infer that two terms are added to the error function ;-\n\n< ftot >=< (tot( {TN}) > + 2~ t\"I:l ~2 LTab 2 [(~;kP) 2 + (kp (:~k~)l (3) \n\np=l k=O \n\nab \n\nab \n\nab \n\n\fSynaptic Weight Noise During MLP Learning \n\n493 \n\nConsider also the perceptron rule update on the hidden-output layer along with the \nexpanded error function :-\n\n2 {)2 \u00b0 k P \n< 6Tkj >= -T L..J < fkpOjpOkp > -T -2 L..J < OjpOkp > X L..J Tab - -2 \naTab \n\n\" ' \" \nab \n\n~ 2 \n\n\"'\" \n\nP \n\n\" ' \" \n\np \n\nI \n\nI \n\n(4) \n\naveraged over several training epochs (which is acceptable for small values of T the \nadaption rate parameter). \n\n3 Simulations \n\nThe simulations detailed below are based on the virtual targets algorithm \n[Murray, 92], a variant on backpropagation, with broadly similar performance. The \n\"targets\" algorithm was chosen for its faster convergence properties. Two contrast(cid:173)\ning classification tasks were selected to verify the predictions made in the following \nsection by simulation. The first, a feature location task, uses real world normalised \ngreyscale image data. The task was to locate eyes in facial images - to classify \nsections of these as either \"eye\" or \"not-eye\". The network was trained on 16 x 16 \npreclassified sections of the images, classified as eyes and not-eyes. The not-eyes \nwere random sections of facial images, avoiding the eyes (see Fig. 1). The second, \n\n16x 16 section \n\n.:J \"eye\" \n\n------=>~ c=-\n\n\"not-eye\" \n\nFigure 1: The eye/not-eye classifier. \n\na more artificial task, was the ubiquitous character encoder (Fig. 2) where a 25-\n\n1B-~111111111111111111 \n\n:> 26 \nI I I I \n\nFigure 2: The character encoder task. \n\ndimensional binary input vector describing the 26 alphabetic characters (each 5 x 5 \npixels) was used to train the network with a one-out-of-26 output code. \n\nDuring the simulations noise was added to the weights at a level proportional to the \nweight size and at a probability distribution of uniform density (i.e. -~max < ~ < \n~max). Levels of up to 40% were probed in detail - although it is clear that the \nexpansion above is not quantitatively valid at this level. Above these percentages \nfurther improvements were seen in the network performance, although the dynamics \nof the training algorithm became chaotic. The injected noise level was reduced \n\n\f494 \n\nMurray and Edwards \n\nsmoothly to a minimum value of 1% as the network approached convergence (as \nevidenced by the highest output bit error). As ill all neural network simulations, the \nresults depended upon the training parameters, network sizes and the random start \nposition of the network. To overcome these factors and to achieve a meaningful \nresult 35 weight sets were produced for each noise level. All other characteristics \nof the training process were held constant. The results are therefore not simply \npathological freaks. \n\n4 Prediction/Verification \n\n4.1 Fault Tolerance \n\nConsider the first derivative penalty term in the expanded cost function (3), aver(cid:173)\naged over all patterns, output nodes and weights :-\n\n[{ X A' [Ta\" ( ~~: ) '] \n\n(5) \n\nThe implications of this term are straightforward. For large values of the (weight(cid:173)\ned) average magnitude of the derivative, the overall error is increased. This term \ntherefore causes solutions to be favoured where the dependence of outputs on in(cid:173)\ndividual weights is evenly distributed across the entire weight set. Furthermore, \nweight saliency should not only have a lower average value, but a smaller scatter \nacross the weight set as the training process attempts to reconcile the competing \npressures to reduce both (1) and (5) . This more distributed representation should \nbe manifest in an improved tolerance to faulty weights. \n\n~ \nU \nOJ ... ... \n0 \nU \n-0 \nOJ \n:-S \nII! \n~ \n0 \nII! \n0:: \n\n~ \n'\" Q. \n\n80 \n\n60 \n\n40 \n\n20 \n\n0 \n\n0 \n\nnolSe=O% -\nnoise::;: 1 fJro \nnoise = 20% \nnoise =30% \nnoise=40% ---\n\n-----\n\n'. \n\n'. ~ . \n\n5 \n\n10 \n\n15 \n\n20 \n\n25 \n\nSynapses Removed ('\u00a5o) \n\nFigure 3: Fault tolerance in the character encoder problem. \n\nSimulations were carried out on 35 weight sets produced for each ofthe two problems \nat each of 5 levels of noise injected during training. Weights were then random(cid:173)\nly removed and the networks tested on the training data. The resulting graphs \n(Fig. 3, 4) show graceful degradation with an increased tolerance to faults with \ninjected noise during training. The networks were highly constrained for these sim(cid:173)\nulations to remove some of the natural redundancy of the MLP structure. Although \nthe eye/not-eye problem contains a high proportion of redundant information, the \n\n\fSynaptic Weight Noise During MLP Learning \n\n495 \n\n1 \n\nI \n\n-\n\nnDise~O% \nnoise = 10% \nnoise = ~ -._. __ \nnoise~ 30% \nnoise = 40% - - -\n\n-- --\n\n~L-____ ~ ____ -L ____ ~ ______ ~ ____ ~ \n\no \n\n5 \n\n15 \n\n20 \n\n25 \n\nSynapses Removed (%) \n\nFigure 4: Fault tolerance enhancement in the eye/not-eye classifier. \n\nimprovement in the networks ability to withstand damage, with injected noise, is \nclear. \n\n4.2 Generalisation Ability \n\nConsidering the derivative in equation 5, and looking at the input-hidden weights. \nThe term that is added to the error function, again averaged over all patterns, \noutput nodes and weights is :-\n\n(6) \n\nIf an output neuron has a non-zero connection from a particular hidden node (Tkj \"I \n0), and provided the input Oip is non-zero and is connected to the hidden node (Tji \"I \n0), there is also a term oJp that will tend to favour solutions with the hidden \nnodes also turned firmly ON or OFF (i.e. Ojp = 0 or 1). Remembering, of \ncourse, that all these terms are noise-mediated, and that during the early stages \nof training, the \"actual\" error fkp, in (1), will dominate, this term will de-stabilise \nfinal solutions that balance the hidden nodes on the slope of the sigmoid. Naturally, \nhidden nodes OJ that are firmly ON or OFF are less likely to change state as a \nresult of small variations in the input data {Oi}. This should become evident in an \nincreased tolerance to input perturbations and therefore an increased generalisation \nability. \nSimulations were again carried out on the two problems using 35 weight sets for \neach level of injected synaptic noise during training. For the character encoder \nproblem generalisation is not really an issue, but it is possible to verify the above \nprediction by introducing random gaussian noise into the input data and noting the \ndegradation in performance. The results of these simulations are shown in Fig. 5, \nand clearly show an increased ability to withstand input perturbation, with injected \nnoise into the synapses during training. \n\nGeneralisation ability for the eye/not-eye problem is a real issue. This problem \ntherefore gives a valid test of whether the synaptic noise technique actually im(cid:173)\nproves generalisation performance. The networks were therefore tested on previ(cid:173)\nously unseen facial images and the results are shown in Table 1. These results show \n\n\f496 \n\nMurray and Edwards \n\n100 \n\nt : ~\" \n\n70 \n\n~ \nI:: \n0 u \n11 \n-= \n'iii \n\" '\" \n0 \n'\" E \n.,. \n.!:! \nll-\n\n60 \n\n50 \n\n40 \n\n30 \n\n-.~ . , \n\n. /. \n\nI \n0 \n\n-_.-------------\n\n.. -\n\n.. -.-\n\nperturbation = 0,(5 -\nperturbation = 0.10 \n. \nperturbation = 0.15 ..... -\nperturbation = 0.20 .. . \n\n10 \n\n20 \n\n30 \n\n40 \n\nNoise Level (%) \n\nFigure 5: Generalisation enhancement shown through increased tolerance to input \nperturbation , in the character encoder problem. \n\nI 30% J 40% \nNoise Levels \nTest Patterns 67.875 I 70.406 I 70.416 I 72.454 I 75.446 \n\nI 10% \n\nI 20% \n\n0% \n\nCorrectly Classified (%) \n\nTable 1: Generalisation enhancement shown through increased ability to classifier \npreviously unseen data, in the eye/not-eye task. \n\ndramatically improved generalisation ability with increased levels of injected synap(cid:173)\ntic noise during training. An improvement of approximately 8% is seen - consistent \nwith earlier results on a different \"real\" problem [Murray, 91]. \n\n4.3 Learning Trajectory \n\nConsidering now the second derivative penalty term in the expanded cost function \n(2). This term is complex as it involves second order derivatives, and also depends \nupon the sign a.nd magnitude of the errors themselves {flep}. The simplest way of \nlooking at its effect is to look at a single exemplar term :-\n\nK t:,. 2 f T. 2 ({)2 Olep ) \n{)Tab 2 \n\nlep ab \n\n(7) \n\nThis term implies that when the combination of flep ~~::~ is negative then the overall \ncost function error is reduced and vice versa. The term (7) is therefore constructive \nas it can actually lower the error locally via noise injection, whereas (6) always \nincreases it . (7) can therefore be viewed as a sculpting of the error surface during \nthe early phases of training (i.e. when flep is sUbstantial). In particular, a weight set \nwith a higher \"raw\" error value, calculated from (1), may be favoured over one with \na lower value if noise-injected terms indicate that the \"poorer\" solution is located \nin a promising area of weight space. This \"look-ahead\" property should lead to an \nenhanced learning trajectory, perhaps finding a solution more rapidly. \n\nIn the augmented weight update equation (4), the noise is acting as a medium \nprojecting statistical information about the character of the entire weight set on to \n\n\fSynaptic Weight Noise During MLP Learning \n\n497 \n\nthe update equation for each particular weight. So, the effect of the noise term is \nto account not only for the weight currently being updated, but to add in a term \nthat estimates what the other weight changes are likely to do to the output, and \nadjust the size of the weight increment/decrement as appropriate. \n\nTo verify this by simulation is not as straightforward as the other predictions. It \nis however possible to show the mean training time for each level of injected noise. \nFor each noise level, 1000 random start points were used to allow the underlying \nproperties of the training process to emerge. The results are shown in Fig. 6 and \n\n600 \n\nS50 \n\n0 \nQ., \n\n~ u \n~ .. \n\nSOO \n\n6 \nl= \n00 450 \nc: .. \n.5 \n.. ...J \n'\" \nc: \ni! 350 \n::E \n\n400 \n\n300 \n\n0 \n\n10 \n\n20 \n\n30 \n\n40 \n\n50 \n\nSynaptic Noise Level Used During Training \n\nFigure 6: Training time as a function of injected synaptic noise during training. \n\nclearly show that at low noise levels (::; 30% for the case of the character encoder) \na definite reduction in training times are seen. At higher levels the chaotic nature \nof the \"noisy learning\" takes over. \nIt is also possible to plot the combination of fkp ~~::~. This is shown in Fig. 7, \nagain for the character encoder problem. The term (7) is reduced more quickly \n\n-0.5 \n\n. ~ \ni;j \n:> \n\n1.0 \n~ O.S \n~ .. \n0.0 \n\u00b7c .. Q \n'tl \n.G \n\" .. \n~ -2.0 \n.. \ng \n-2.5 \n-3.0 \n\n-I.S \n\n-1.0 \n\nI.* 7% the effect is exaggerated, and the noise mediated improvements take place \n\n\f498 \n\nMurray and Edwards \n\nduring the first 100-200 epochs of training. The level of 7% is displayed simply \nbecause it is visually clear what is happening, and is also typical. \n\n5 Conclusion \n\nWe have shown both by mathematical expansion and by simulation that injecting \nrandom noise on to the synaptic weights of a MultiLayer Perceptron during the \ntraining phase enhances fault-tolerance, generalisation ability and learning trajec(cid:173)\ntory. It has long been held that any inaccuracy during training is detrimental to \nMLP learning. This paper proves that analog inaccuracy is not. The mathematical \npredictions are perfectly general and the simulations relate to a non-trivial classi(cid:173)\nfication task and a \"real\" world problem. The results are therefore important for \nthe designers of analog hardware and also as a non-invasive technique for producing \nlearning enhancements in the software domain. \n\nAcknowledgements \n\nWe are grateful to the Science and Engineering Research Council for financial sup(cid:173)\nport, and to Lionel Tarassenko and Chris Bishop for encouragement and advice. \n\nReferences \n\n[Taylor, 72] \n\n[Murray, 91] \n\n[Murray, 92] \n\nJ. G. Taylor, \"Spontaneous Behaviour in Neural Networks\" , J. The(cid:173)\nor. Bioi., vol. 36, pp. 513-528, 1972. \nA. F. Murray, \"Analog Noise-Enhanced Learning in Neural Net(cid:173)\nwork Circuits,\" Electronics Letters, vol. 2, no. 17, pp. 1546-1548, \n1991. \nA. F. Murray, \"Multi-Layer Perceptron Learning Optimised for On(cid:173)\nChip Implementation - a Noise Robust System,\" Neural Computa(cid:173)\ntion, vol. 4, no. 3, pp. 366-381, 1992. \n\n[Bishop, 90] \n\n[Hanson, 90] \n\n[Matsuoka, 92] K. Matsuoka, \"Noise Injection into Inputs in Back-Propagation \nLearning\", IEEE Trans. Systems, Man and Cybernetics, vol. 22, \nno. 3, pp. 436-440, 1992. \nC. Bishop, \"Curvature-Driven Smoothing in Backpropagation Neu(cid:173)\nral Networks,\" IJCNN, vol. 2, pp. 749-752, 1990. \nS. J. Hanson, \"A Stochastic Version of the Delta Rule\", Physica D, \nvol. 42, pp. 265-272, 1990. \nC. H. Sequin, R. D. Clay, \"Fault Tolerance in Feed-Forward Artifi(cid:173)\ncial Neural Networks\" , Neural Networks: Concepts, Applications \nand Implementations, vol. 4, pp. 111-141, 1991. \nA. F. Murray, P. J. Edwards, \"Enhanced MLP Performance and \nFault Tolerance Resulting from Synaptic Weight Noise During \nTraining\", IEEE Trans. Neural Networks, 1993, In Press. \n\n[Sequin, 91] \n\n[Murray, 93] \n\n\f", "award": [], "sourceid": 682, "authors": [{"given_name": "Alan", "family_name": "Murray", "institution": null}, {"given_name": "Peter", "family_name": "Edwards", "institution": null}]}*