{"title": "Hoo Optimality Criteria for LMS and Backpropagation", "book": "Advances in Neural Information Processing Systems", "page_first": 351, "page_last": 358, "abstract": "", "full_text": "Hoo Optimality Criteria for LMS and \n\nBackpropagation \n\nBabak Hassibi \n\nInformation Systems Laboratory \n\nStanford University \nStanford, CA 94305 \n\nAli H. Sayed \n\nDept. of Elec. and Compo Engr. \n\nUniversity of California Santa Barbara \n\nSanta Barbara, CA 93106 \n\nThomas Kailath \n\nInformation Systems Laboratory \n\nStanford University \nStanford, CA 94305 \n\nAbstract \n\nWe have recently shown that the widely known LMS algorithm is \nan H OO optimal estimator. The H OO criterion has been introduced, \ninitially in the control theory literature, as a means to ensure ro(cid:173)\nbust performance in the face of model uncertainties and lack of \nstatistical information on the exogenous signals. We extend here \nour analysis to the nonlinear setting often encountered in neural \nnetworks, and show that the backpropagation algorithm is locally \nH OO optimal. This fact provides a theoretical justification of the \nwidely observed excellent robustness properties of the LMS and \nbackpropagation algorithms. We further discuss some implications \nof these results. \n\n1 \n\nIntroduction \n\nThe LMS algorithm was originally conceived as an approximate recursive procedure \nthat solves the following problem (Widrow and Hoff, 1960): given a sequence of n x 1 \ninput column vectors {hd, and a corresponding sequence of desired scalar responses \n{ di }, find an estimate of an n x 1 column vector of weights w such that the sum \nof squared errors, L:~o Idi - hi w1 2 , is minimized. The LMS solution recursively \n\n351 \n\n\f352 \n\nHassibi. Sayed. and Kailath \n\nupdates estimates of the weight vector along the direction of the instantaneous gra(cid:173)\ndient of the squared error. It has long been known that LMS is an approximate \nminimizing solution to the above least-squares (or H2) minimization problem. Like(cid:173)\nwise, the celebrated backpropagation algorithm (Rumelhart and McClelland, 1986) \nis an extension of the gradient-type approach to nonlinear cost functions of the form \n2:~o I di - hi ( W ) 12 , where hi ( .) are known nonlinear functions (e. g., sigmoids). It \nalso updates the weight vector estimates along the direction of the instantaneous \ngradients. \n\nWe have recently shown (Hassibi, Sayed and Kailath, 1993a) that the LMS algo(cid:173)\nrithm is an H = 2:~=o f; gk \n,where * denotes complex \nconjugation. Let T be a transfer operator that maps an input sequence {ud to an \noutput sequence {yd. Then the H-=--------, \n\n(d) \n\n0.5 \n\n\\ , \no \n\n\" \n\n-0.5 \n\n1\"'-\" \n\n-l~---------~ \no \n50 \n\n-1L-------------------~ \n50 \n\no \n\nFigure 1: Hoo norm of transfer operator as a function of the number of observations \nfor (a) RLS, and (b) LMS. The true output and the worst case disturbance signal \n(dotted curve) for RLS are given in (c). The predicted errors for the RLS (dashed) \nand LMS (dotted) algorithms corresponding to this disturbance are given in (d). \nThe LMS predicted error goes to zero while the RLS predicted error does not. \n\n3 Nonlinear HOO Adaptive Filtering \n\nIn this section we suppose that the observed sequence {dd obeys the following \nnonlinear model \n\n(5) \nwhere hi (.) is a known nonlinear function (with bounded first and second order \nderivatives), W is an unknown weight vector, and {vd is an unknown disturbance \nsequence that includes noise and/or modelling errors. In a neural network context \nthe index i in hi (.) will correspond to the nonlinear function that maps the weight \nvector to the output when the ith input pattern is presented, i.e., hi(W) = h(x(i), w) \nwhere x(i) is the ith input pattern. As before we shall denote by Wi = :F(do, ... , di) \nthe estimate of the weight vector using measurements up to and including time i, \nand the prediction error by \n\nI \n\nei = hi(w) - hi(Wi-1) \n\nLet T denote \n{ W - W -1 , { vd} to the prediction errors {e;}. \n\nthe \n\ntransfer operator that maps the unknowns/disurbances \n\nProblem 2 (Optimal Nonlinear HOO Adaptive Problem) Find \nan Hoo-optimal estimation strategy Wi = :F(do, d1, . .. , di ) that minimizes IITllooI \n\n\f356 \n\nHassibi, Sayed, and Kailath \n\nand obtain the resulting \n\ni'~ = inf IITII~ = inf \n\n:F \n\nsup \n:F w,vEh2 \n\n(6) \n\nCurrently there is no general solution to the above problem, and the class of non(cid:173)\nlinear functions hi(.) for which the above problem has a solution is not known (Ball \nand Helton, 1992). \n\nTo make some headway, though, note that by using the mean value theorem (5) \nmay be rewritten as \n\ndi = hi(wi-d + ~~ T (wi_d.(w - Wi-I) + Vi \n\n(7) \n\nwhere wi-l is a point on the line connecting wand Wi-I. Theorem 1 applied to (7) \nshows that the recursion \n\n(8) \nwill yield i' = 1. The problem with the above algorithm is that the wi's are \nnot known. But it suggests that the i'opt in Problem 2 (if it exists) cannot be \nless than one. Moreover, it can be seen that the backpropagation algorithm is an \napproximation to (8) where wi is replaced by Wi. To pursue this point further we \nuse again the mean value theorem to write (5) in the alternative form \n\nohi T \n\ndi = hi(wi-d+ ow (wi-d\u00b7(w-Wi-l +2(W-Wi-d . ow 2 wi-d\u00b7(w-Wi-d+Vi \n(9) \nwhere once more Wi-l lies on the line connecting Wi-l and w. Using (9) and \nTheorem 1 we have the following result. \n\n) \n\n1 \n\nT 02hi(_ \n\nTheorem 2 (Backpropagation Algorithm) Consider \nbackpropagation algorithm \n\nthe model (5) and the \n\nWi = Wi-l + J.L Ow (wi-d(di - hi(wi-d) \n\nohi \n\nthen if the ~~i (Wi- d are exciting, and \n\no < J.L < In - - : :T= - - - - - - -\nill!.. ( \n) \now Wi-I\u00b7 ow wi-l \n\n1 \n) ill!..( \n\n. f \ni \n\n(10) \n\n(11) \n\nthen for all nonzero w, v E h 2: \n\n-----------~~=-~--~~--~~--------------- < 1 \nJ.L-11w - w_112+ II Vi + !(w - wi_dT ~:::J (wi-d\u00b7(w - Wi-I) II~ -\n\nII ~~i T (wi-d(w - wi-d II~ \n\nwhere \n\n\fHoo Optimality Criteria for LMS and Backpropagation \n\n357 \n\nThe above result means that if one considers a new disturbance v; = Vi + ~ (w -\nWi_I)T ~::J (Wi-I).(W - Wi-I), whose second term indicates how far hi(w) is from a \nfirst order approximation at point Wi-I, then backpropagation guarantees that the \nenergy of the linearized prediction error ~~ T (wi-d(w - Wi-I) does not exceed the \nenergy of the new disturbances W - W-l and v:. \nIt seems plausible that if W-I is close enough to w then the second term in v~ should \nbe small and the true and linearized prediction errors should be close, so that we \nshould be able to bound the ratio in (6). Thus the following result is expected, \nwhere we have defined the vectors {hd persistently exciting if, and only if, for all \na E nn \n\nTheorem 3 (Local Hoc Optimality) Consider the model (5) and the backprop(cid:173)\nagation algorithm (10). Suppose that the ~':: (Wi-I) are persistently exciting, and \nthat (11) is satisfied. Then for each ( > 0, there exist cSt, ch > 0 such that for all \nIw - w-ti < cSt and all v E h2 with IVil < 82, we have \n\n, 12 \nII ej I 2 \n\n< 1 + ( \n\nIl-Ilw - w_112+ II v II~ -\n\nThe above Theorem indicates that the backpropagation algorithm is locally HOC \noptimal. In other words for W-l sufficiently close to w, and for sufficiently small \ndisturbance, the ratio in (6) can be made arbitrarily close to one. Note that the \nconditions on wand Vi are reasonable, since if for example W is too far from W-l, \nor if some Vi is too large, then it is well known that backpropagation may get stuck \nin a local minimum, in which case the ratio in (6) may get arbitrarily large. \nAs before (11) gives an upper bound on the learning rate Il, and indicates why \nbackpropagation behaves poorly if the learning rate is too large. \nIf there is no disturbance in (5) we have the following \n\nCorollary 2 If in addition to the assumptions in Theorem 3 there is no disturbance \nin (5), then for every ( > 0 there exists a 8 > 0 such that for all Iw - w-il < 8, \nthe backpropagation algorithm will yield II e' II~:::; 1l-18(1 + (), meaning that the \nprediction error converges to zero. Moreover Wi will converge to w. \n\nAgain provided (11) is satisfied, the larger Il is the faster the convergence will be. \n\n4 Discussion and Conclusion \n\nThe results presented in this paper give some new insights into the behaviour of \ninstantaneous gradient-based adaptive algorithms. We showed that ifthe underlying \nobservation model is linear then LMS is an HOC optimal estimator, whereas if the \nunderlying observation model is nonlinear then the backpropagation algorithm is \nlocally HOC optimal. The HOC optimality of these algorithms explains their inherent \nrobustness to unknown disturbances and modelling errors, as opposed to other \nestimation algorithms for which such bounds are not guaranteed. \n\n\f358 \n\nHassibi, Sayed, and Kailath \n\nNote that if one considers the transfer operator from the disturbances to the pre(cid:173)\ndiction errors, then LMS (backpropagation) is H OO optimal (locally), over all causal \nestimators. This indicates that our result is most applicable in situations where \none is confronted with real-time data and there is no possiblity of storing the train(cid:173)\ning patterns. Such cases arise when one uses adaptive filters or adaptive neural \nnetworks for adaptive noise cancellation, channel equalization, real-time control, \nand undoubtedly many other situations. This is as opposed to pattern recognition, \nwhere one has a set of training patterns and repeatedly retrains the network until \na desired performance is reached. \nMoreover, we also showed that the H oo optimality result leads to convergence proofs \nfor the LMS and backpropagation algorithms in the absence of disturbances. We \ncan pursue this line of thought further and argue why choosing large learning rates \nincreases the resistance of backpropagation to local minima, but we shall not do so \ndue to lack of space. \n\nIn conclusion these results give a new interpretation of the LMS and backpropaga(cid:173)\ntion algorithms, which we believe should be worthy of further scrutiny. \n\nAcknowledgements \n\nThis work was supported in part by the Air Force Office of Scientific Research, Air \nForce Systems Command under Contract AFOSR91-0060 and in part by a grant \nfrom Rockwell International Inc. \n\nReferences \n\nJ. A. Ball and J. W. Helton. (1992) Nonlinear H oo control theory for stable plants. \nMath. Control Signals Systems, 5:233-261. \nK. Glover and D. Mustafa. (1989) Derivation of the maximum entropy H oo con(cid:173)\ntroller and a state space formula for its entropy. Int. 1. Control, 50:899-916. \n\nB. Hassibi, A. H. Sayed, and T. Kailath. (1993a) LMS is HOO Optimal. IEEE Conf. \non Decision and Control, 74-80, San Antonio, Texas. \n\nB. Hassibi, A. H. Sayed, and T. Kailath. (1993b) Recursive linear estimation in \nKrein spaces - part II: Applications. IEEE Conf. on Decision and Control, 3495-\n3501, San Antonio, Texas. \n\nS. Haykin. (1991) Adaptive Filter Theory. Prentice Hall, Englewood Cliffs, NJ. \n\nD. E. Rumelhart, J. L. McClelland and the PDP Research Group. (1986) Parallel \ndistributed processing: explorations in the microstructure of cognition. Cambridge, \nMass. : MIT Press. \n\nP. Whittle. (1990) Risk Sensitive Optimal Control. John Wiley and Sons, New \nYork. \n\nB. Widrow and M. E. Hoff, Jr. (1960) Adaptive switching circuits. IRE WESCON \nConv. Rec., Pt.4:96-104. \n\nG. Zames. (1981) Feedback optimal sensitivity: model preference transformation, \nmultiplicative seminorms and approximate inverses. IEEE Trans. on Automatic \nControl, AC-26:301-320. \n\n\f", "award": [], "sourceid": 815, "authors": [{"given_name": "Babak", "family_name": "Hassibi", "institution": null}, {"given_name": "Ali", "family_name": "Sayed", "institution": null}, {"given_name": "Thomas", "family_name": "Kailath", "institution": null}]}