{"title": "Tempering Backpropagation Networks: Not All Weights are Created Equal", "book": "Advances in Neural Information Processing Systems", "page_first": 563, "page_last": 569, "abstract": null, "full_text": "Tempering Backpropagation Networks: \n\nNot All Weights are Created Equal \n\nNicol N. Schraudolph \n\nEVOTEC BioSystems GmbH \n\nGrandweg 64 \n\n22529 Hamburg, Germany \n\nnici@evotec.de \n\nAbstract \n\nTerrence J. Sejnowski \n\nComputational Neurobiology Lab \nThe Salk Institute for BioI. Studies \nSan Diego, CA 92186-5800, USA \n\nterry@salk.edu \n\nBackpropagation learning algorithms typically collapse the network's \nstructure into a single vector of weight parameters to be optimized. We \nsuggest that their performance may be improved by utilizing the struc(cid:173)\ntural information instead of discarding it, and introduce a framework for \n''tempering'' each weight accordingly. \nIn the tempering model, activation and error signals are treated as approx(cid:173)\nimately independent random variables. The characteristic scale of weight \nchanges is then matched to that ofthe residuals, allowing structural prop(cid:173)\nerties such as a node's fan-in and fan-out to affect the local learning rate \nand backpropagated error. The model also permits calculation of an upper \nbound on the global learning rate for batch updates, which in turn leads \nto different update rules for bias vs. non-bias weights. \nThis approach yields hitherto unparalleled performance on the family re(cid:173)\nlations benchmark, a deep multi-layer network: for both batch learning \nwith momentum and the delta-bar-delta algorithm, convergence at the \noptimal learning rate is sped up by more than an order of magnitude. \n\n1 \n\nIntroduction \n\nAlthough neural networks are structured graphs, learning algorithms typically view them \nas a single vector of parameters to be optimized. All information about a network's archi(cid:173)\ntecture is thus discarded in favor of the presumption of an isotropic weight space -\nthe \nnotion that a priori all weights in the network are created equal. This serves to decouple \nthe learning process from network design and makes a large body of function optimization \ntechniques directly applicable to backpropagation learning. \nBut what if the discarded structural information holds valuable clues for efficient weight \noptimization? Adaptive step size and second-order gradient techniques (Battiti, 1992) may \n\n\f564 \n\nN. N. SCHRAUDOLPH. T. J. SEJNOWSKI \n\nand demonstrate its effectiveness. \n\nrecover some of it, at considerable computational expense. Ad hoc attempts to incorporate \nstructural information such as the fan-in (Plaut et aI., 1986) into local learning rates have be(cid:173)\ncome a familiar part of backpropagation lore; here we deri ve a more comprehensi ve frame(cid:173)\nwork - which we call tempering -\nTempering is based on modeling the acti vities and error signals in a backpropagation net(cid:173)\nwork as independent random variables. This allows us to calculate activity- and weight(cid:173)\ninvariant upper bounds on the effect of synchronous weight updates on a node's activity. \nWe then derive appropriate local step size parameters by relating this maximal change in a \nnode's acti vi ty to the characteristic scale of its residual through a global learning rate. \nOur subsequent derivation of an upper bound on the global learning rate for batch learning \nsuggests that the d.c. component of the error signal be given special treatment. Our exper(cid:173)\niments show that the resulting method of error shunting allows the global learning rate to \napproach its predicted maximum, for highly efficient learning performance. \n\n2 Local Learning Rates \n\nConsider a neural network with feedforward activation given by \n\nx j = /j (Yj) , Yj = L Xi Wij , \n\niEAj \n\n(1) \n\nwhere Aj denotes the set of anterior nodes feeding directly into node j, and /j is a nonlinear \n(typically sigmoid) activation function. We imply that nodes are activated in the appropriate \nsequence, and that some have their values clamped so as to represent external inputs. \nWith a local learning rate of'1j for node j, gradient descent in an objective function E pro(cid:173)\nduces the weight update \n\nLinearizing Ij around Yj approximates the resultant change in activation Xj as \n\n(2) \n\n(3) \n\niEAj \n\niEAj \n\nOur goal is to put the scale of ~Xj in relation to that of the error signal tSj . Specifically, when \naveraged over many training samples, we want the change in output activity of each node \nin response to each pattern limited to a certain proportion -\ngiven by the global learning \nrate '1 - of its residual. We achieve this by relating the variation of ~X j over the training \nset to that of the error signal: \n\n(4) \n\nwhere (.) denotes averaging over training samples. Formally, this approach may be inter(cid:173)\npreted as a diagonal approximation of the inverse Fischer information matrix (Amari, 1995). \nWe implement (4) by deriving an upper bound for the left-hand side which is then equated \nwith the right-hand side. Replacing the acti vity-dependent slope of Ij by its maximum value \n\ns(/j) == maxl/j(u)1 \n\nu \n\nand assuming that there are no correlations! between inputs Xi and error tSj\n\n' we obtain \n\n(~x}):::; '1} s(/j)2 (tS})f.j \n\n1 Note that such correlations are minimized by the local weight update. \n\n(5) \n\n(6) \n\n\fTempering Backpropagation Networks: Not All Weights Are Created Equal \n\n565 \n\nfrom (3), provided that \n\nej ~ e; == ([,Lxlf) , \n\nlEA] \n\nWe can now satisfy (4) by setting the local learning rate to \n\nTJ' = \nJ \n\n-\n\nTJ \n\n8 (fj ).j[j . \n\n(7) \n\n(8) \n\nThere are several approaches to computing an upper bound ej on the total squared input \npower e;. One option would be to calculate the latter empirically during training, though \nthis raises sampling and stability issues. For external inputs we may precompute e; orderive \nan upper bound based on prior knowledge of the training data. For inputs from other nodes \nin the network we assume independence and derive ej from the range of their activation \nfunctions: \n(9) \n\nej = L p(fd 2 , where p(fd == ffiuax/i(u)2. \n\niEAj \n\nNote that when all nodes use the same activation function I, we obtain the well-known \nVfan-in heuristic (Plaut et al., 1986) as a special case of (8). \n\n3 Error Backpropagation \n\nIn deriving local learning rates above we have tacitly used the error signal as a stand-in for \nthe residual proper, i.e. the distance to the target. For output nodes we can scale the error to \nnever exceed the residual: \n\n(10) \n\nNote that for the conventional quadratic error this simplifies to = 1) \nbut prefer the hyperbolic tangent for hidden units, with p(tanh) = s(tanh) = 1. \nTo illustrate the impact of tempering on this architecture we translate the combined effect \nof local learning rate and error attenuation into an effective learning rate2 for each layer, \nshown on the right in Figure 1. We observe that effective learning rates are largest near the \noutput and decrease towards the input due to error attenuation. Contrary to textbook opinion \n(LeCun, 1993; Haykin, 1994, page 162) we find that such unequal step sizes are in fact the \nkey to efficient learning here. We suspect that the logistic squashing function may owe its \npopUlarity largely to the error attenuation side-effect inherent in its maximum slope of 114-\nWe expect tempering to be applicable to a variety of backpropagation learning algorithms; \nhere we present first results for batch learning with momentum and the delta-bar-delta \nrule (Jacobs, 1988). Both algorithms were tested under three conditions: conventional, \ntempered (as described in Sections 2 and 3), and tempered with error shunting. All experi(cid:173)\nments were performed with a customized simulator based on Xerion 3.1.3 \nFor each condition the global learning rate TJ was empirically optimized (to single-digit pre(cid:173)\ncision) for fastest reliable learning performance, as measured by the sum of empirical mean \nand standard deviation of epochs required to reach a given low value of the cost function. \nAll other parameters were held in variant across experiments; their values (shown in Table 1) \nwere chosen in advance so as not to bias the results. \n\n2This is possible only for strictly layered networks, i.e. those with no shortcut (or \"skip-through\") \n\nconnections between topologically non-adjacent layers. \n\n3 At the time of writing, the Xerion neural network simulator and its successor UTS are available \n\nby anonymous file transfer from ai.toronto.edu, directory pub/xerion. \n\n\f568 \n\nN. N. SCHRAUDOLPH. T. 1. SEJNOWSKI \n\nVal ue II Parameter \nParameter \n100 \ntraining set size (= epoch) \n0.9 \nmomentum parameter \nuniform initial weight range \u00b10.3 \n10- 4 \nweight decay rate per epoch \n\nzero-error radius around target \nacceptable error & weight cost \ndelta-bar-delta gain increment \ndelta-bar-delta gain decrement \n\nI Value I \n\n0.2 \n1.0 \n0.1 \n0.9 \n\nTable 1: Invariant parameter settings for our experim~nts. \n\n7 Experimental Results \n\nTable 2 lists the empirical mean and standard deviation (over ten restarts) of the number \nof epochs required to learn the family relations task under each condition, and the optimal \nlearning rate that produced this performance. Training times for conventional backpropaga(cid:173)\ntion are quite long; this is typical for deep multi-layer networks. For comparison, Hinton \nreports around 1,500 epochs on this problem when both learning rate and momentum have \nbeen optimized (personal communication). Much faster convergence -\nthough to a far \nlooser criterion - has recently been observed for online algorithms (O'Reilly, 1996). \nTempering, on the other hand, is seen here to speed up two batch learning methods by al(cid:173)\nmost an order of magnitude. It reduces not only the average training time but also its coef(cid:173)\nficient of variation, indicating a more reliable optimization process. Note that tempering \nmakes simple batch learning with momentum run about twice as fast as the delta-bar-delta \nalgorithm. This is remarkable since delta-bar-delta uses online measurements to continu(cid:173)\nally adapt the learning rate for each individual weight, whereas tempering merely prescales \nit based on the network's architecture. We take this as evidence that tempering establishes \nappropriate local step sizes upfront that delta-bar-delta must discover empirically. \nThis suggests that by using tempering to set the initial (equilibrium) learning rates for delta(cid:173)\nbar-delta, it may be possible to reap the benefits of both prescaling and adaptive step size \ncontrol. Indeed Table 2 confirms that the respective speedups due to tempering and delta(cid:173)\nbar-delta multiply when the two approaches are combined in this fashion. Finally, the ad(cid:173)\ndition of error shunting increases learning speed yet further by allowing the global learning \nrate to be brought close to the maximum of 7]* = 0.1 that we would predict from (18). \n\n8 Discussion \n\nIn our experiments we have found tempering to dramatically improve speed and reliability \nof learning. More network architectures, data sets and learning algorithms will have to be \n\"tempered\" to explore the general applicability and limitations of this approach; we also \nhope to extend it to recurrent networks and online learning. Error shunting has proven useful \nin facilitating of near-maximal global learning rates for rapid optimization. \n\nCondition \n\ndelta-bar-delta \n\n7]= mean \n\nAlgorithm \n\nbatch & momentum \n7]= mean \nst.d. \nst.d. \n3.10- 3 2438 \u00b1 1153 3.10- 4 696\u00b1 218 \nconventional \n1.10- 2 \n3.10- 2 89.6 \u00b1 11 .8 \nwith tempering \ntempering & shunting 4.10- 2 \n9.10- 2 61.7\u00b18.1 \n\n339 \u00b1 95.0 \n142\u00b127.1 \n\nTable 2: Epochs required to learn the family relations task. \n\n\fTempering Backpropagation Networks: Not All Weights Are Created Equal \n\n569 \n\nAlthough other schemes may speed up backpropagation by comparable amounts, our ap(cid:173)\nproach has some unique advantages. It is computationally cheap to implement: local learn(cid:173)\ning and error attenuation rates are invariant with respect to network weights and activities \nand thus need to be recalculated only when the network architecture is changed. \nMore importantly, even advanced gradient descent methods typically retain the isotropic \nweight space assumption that we improve upon; one would therefore expect them to be(cid:173)\nnefit from tempering as much as delta-bar-delta did in the experiments reported here. For \ninstance, tempering could be used to set non-isotropic model-trust regions for conjugate and \nsecond-order gradient descent algorithms. \nFinally, by restricting ourselves to fixed learning rates and attenuation factors for now we \nhave arrived at a simplified method that is likely to leave room for further improvement. \nPossible refinements include taking weight vector size into account when attenuating error \nsignals, or measuring quantities such as (6 2 ) online instead of relying on invariant upper \nbounds. How such adaptive tempering schemes will compare to and interact with existing \ntechniques for efficient backpropagation learning remains to be explored. \n\nAcknowledgements \n\nWe would like to thank Peter Dayan, Rich Zemel and Jenny Orr for being instrumental in \ndiscussions that helped shape this work. Geoff Hinton not only offered invaluable com(cid:173)\nments, but is the source of both our simulator and benchmark problem. N. Schraudolph \nreceived financial support from the McDonnell-Pew Center for Cognitive Neuroscience in \nSan Diego, and the Robert Bosch Stiftung GmbH. \n\nReferences \n\nAmari, S.-1. (1995). Learning and statistical inference. In Arbib, M. A., editor, The Hand(cid:173)\nbook of Brain Theory and Neural Networks, pages 522-526. MIT Press, Cambridge. \nBattiti, T. (1992). First- and second-order methods for learning: Between steepest descent \n\nand Newton's method. Neural Computation,4(2):141-166. \n\nHaykin, S. (1994). Neural Networks: A Comprehensive Foundation. Macmillan, New York. \nHinton, G. (1986). Learning distributed representations of concepts. In Proceedings of \nthe Eighth Annual Conference of the Cognitive Science Society, pages 1-12, Amherst \n1986. Lawrence Erlbaum, Hillsdale. \n\nJacobs, R. (1988). Increased rates of convergence through learning rate adaptation. Neural \n\nNetworks,1:295-307. \n\nKrogh, A., Thorbergsson, G., and Hertz, J. A. (1990). A cost function for internal repres(cid:173)\nentations. In Touretzky, D. S., editor,Advances in Neural Information Processing Sys(cid:173)\ntems, volume 2, pages 733-740, Denver, CO, 1989. Morgan Kaufmann, San Mateo. \nLeCun, Y. (1993). Efficient learning & second-order methods. Tutorial given at the NIPS \n\nConference, Denver, CO. \n\nLeCun, Y., Kanter, I., and Solla, S. A. (1991). Second order properties of error surfaces: \nLearning time and generalization. In Lippmann, R. P., Moody, J. E., and Touretzky, \nD. S., editors, Advances in Neural Information Processing Systems, volume 3, pages \n918-924, Denver, CO, 1990. Morgan Kaufmann, San Mateo. \n\nO'Reilly, R. C. (1996). Biologically plausible error-driven learning using local activation \n\ndifferences: The generalized recirculation algorithm. Neural Computation, 8. \n\nPlaut, D., Nowlan, S., and Hinton, G. (1986). Experiments on learning by back propaga(cid:173)\n\ntion. Technical Report CMU-CS-86-126, Department of Computer Science, Carnegie \nMellon University, Pittsburgh, PA. \n\n\f", "award": [], "sourceid": 1100, "authors": [{"given_name": "Nicol", "family_name": "Schraudolph", "institution": null}, {"given_name": "Terrence", "family_name": "Sejnowski", "institution": null}]}