{"title": "On Iterative Krylov-Dogleg Trust-Region Steps for Solving Neural Networks Nonlinear Least Squares Problems", "book": "Advances in Neural Information Processing Systems", "page_first": 605, "page_last": 611, "abstract": null, "full_text": "On iterative Krylov-dogleg trust-region \n\nsteps for solving neural networks \nnonlinear least squares problems \n\nEiji Mizutani \n\nDepartment of Computer Science \nNational Tsing Hua University \nHsinchu, 30043 TAIWAN R.O.C. \n\neiji@wayne.cs.nthu.edu.tw \n\nJames w. Demmel \n\nMathematics and Computer Science \nUniversity of California at Berkeley, \n\nBerkeley, CA 94720 USA \ndemmel@cs.berkeley.edu \n\nAbstract \n\nThis paper describes a method of dogleg trust-region steps, or re(cid:173)\nstricted Levenberg-Marquardt steps, based on a projection pro(cid:173)\ncess onto the Krylov subspaces for neural networks nonlinear least \nsquares problems. In particular, the linear conjugate gradient (CG) \nmethod works as the inner iterative algorithm for solving the lin(cid:173)\nearized Gauss-Newton normal equation, whereas the outer nonlin(cid:173)\near algorithm repeatedly takes so-called \"Krylov-dogleg\" steps, re(cid:173)\nlying only on matrix-vector multiplication without explicitly form(cid:173)\ning the Jacobian matrix or the Gauss-Newton model Hessian. That \nis, our iterative dogleg algorithm can reduce both operational \ncounts and memory space by a factor of O(n) (the number of pa(cid:173)\nrameters) in comparison with a direct linear-equation solver. This \nmemory-less property is useful for large-scale problems. \n\n1 \n\nIntroduction \n\nWe consider the so-called n eural networks nonlinear least squares prob(cid:173)\nlem 1 wherein the objective is to optimize the n weight parameters of neural \nnetworks (NN) [e.g., multilayer perceptrons (MLP)]' denoted by an n-dimensional \nvector 8 , by minimizing the following: \n\n(1) \n\nwhere ap (8) is the MLP output for the pth training data pattern and tp is the \ndesired output. (Of course, these become vectors for a multiple-output MLP.) Here \nr(8) denotes the m-dimensional residual vector composed of ri(8) , i = 1, ... , m , \nfor all m training data. \n\n1The posed problem can be viewed as an implicitly constrained optimization problem as \n\nlong as hidden-node outputs are produced by sigmoidal \"squashing\" functions [1] . Our al(cid:173)\ngorithm exploits the special structure of the sum of squared error measure in Equation (1); \nhence , the other objective functions are outside the scope of this paper. \n\n\fThe gradient vector and Hessian matrix are given by g = g(9) == JT rand \nH = H( 9) == JT J +S, where J is the m x n Jacobian matrix of r, and S denotes the \nmatrix of second-derivative terms. If S is simply omitted based on the \"small resid(cid:173)\nual\" assumption, then the Hessian matrix reduces to the Gauss-Newton model \nHessian: i.e., JT J. Furthermore, a family of quasi-Newton methods can be ap(cid:173)\nplied to approximate term S alone, leading to the augmented Gauss-Newton model \nHessian (see, for example, Mizutani [2] and references therein). \n\nWith any form of the aforementioned Hessian matrices, we can collectively write \nthe following Newton formula to determine the next step lj in the course of the \nNewton iteration for 9next = 9now + lj: \n\nHlj = -g. \n\n(2) \n\nThis linear system can be solved by a direct solver in conjunction with a suitable \nmatrix factorization. However, typical criticisms towards the direct algorithm are: \n\n\u2022 It is expensive to form and solve the linear equation (2), which requires \n\nO(mn 2 ) operations when m > n; \n\n\u2022 It is expensive to store the (symmetric) Hessian matrix H, which requires \n\nn(n2+1) memory storage. \n\nThese issues may become much more serious for a large-scale problem. \n\nIn light of the vast literature on the nonlinear optimization, this paper describes how \nto alleviate these concerns, attempting to solve the Newton formula (2) approxi(cid:173)\nmately by iterative methods, which form a family of inexact (or truncated) \nNewton methods (see Dembo & Steihaug [3], for instance). An important sub(cid:173)\nclass ofthe inexact Newton methods are Newton-Krylov methods. In particular, this \npaper focuses on a Newton-CG-type algorithm, wherein the linear Gauss-Newton \nnormal equation, \n\n(3) \nis solved iteratively by the linear conjugate gradient method (known as CGNR) \nfor a dogleg trust-region implementation of the well-known Levenberg-Marquardt \nalgorithm; hence, the name \"dogleg trust-region Gauss-Newton-CGNR\" algorithm, \nor \"iterative Krylov-dogleg\" method (similar to Steihaug [4]; Toint [5]). \n\n2 Direct Dogleg Trust-Region Algorithms \n\nIn the NN literature, several variants of the Levenberg-Marquardt algorithm \nequipped with a direct linear-equation solver, particularly Marquardt's original \nmethod, have been recognized as instrumental and promising techniques; see, for \nexample, Demuth & Beale [6]; Masters [7]; Shepherd [8]. They are based on a simple \ndirect control ofthe Levenberg-Marquardt parameter J.L in (H+J.LI)lj = -g, although \nsuch a simple J.L-control can cause a number of problems, because of a complicated \nrelation between parameter J.L and its associated step length (see Mizutani [9]). \n\nAlternatively, a more efficient dogleg algorithm [10] can be employed that takes, \ndepending on the size of trust region R, the Newton step ljNewton [i.e., the solution \nof Eq. (2)], the (restricted) Cauchy step ljCauchy' or an intermediate dogleg step: \n\ndcl \nljdogleg = \n\nljCauchy + h ljNewton - ljCauchy , \n) \n\n( \n\n(4) \n\nwhich achieves a piecewise linear approximation to a trust-region step, or a restricted \nLevenberg-Marquardt step. Note that ljCauchy is the step that minimizes the local \n\n\fquadratic model in the steepest descent direction (i.e. , Eq. (8) with k = 1) . For \ndetails on Equation (4) , refer to Powell [10]; Mizutani [9 , 2]. \n\nWhen we consider the Gauss-Newton step for 8Newton in Equation (4), we must \nIlr + J8112, for \nsolve the overdetermined linear least squares problem: minimize8 \nwhich three principal direct linear-equation solvers are: \n\n(1) Normal equation approach (typically with Cholesky decomposition); \n(2) QR decomposition approach to J 8 = -r; \n(3) Singular value decomposition (SVD) approach to J8 = -r (only recom-\n\nmended when J is nearly rank-deficient). \n\nAmong those three direct solvers, approach (1) to Equation (3) is fastest. (For more \ndetails, refer to Demmel [11], Chapters 2 and 3.) In a highly overdetermined case \n(with a large data set ; i.e. , m \u00bb n) , the dominant cost in approach (1) is the mn 2 \noperations to form the Gauss-Newton model Hessian by: \n\nm \n\nJTJ = LU;U;' \n\n;=1 \n\n(5) \n\nwhere uT is the ith row vector of J. This cost might be prohibitive even with \nenough storage for JT J. Therefore, to overcome this limitation of direct solvers for \nEquation (3), we consider an iterative scheme in the next section. \n\n3 \n\nIterative Krylov-Dogleg Algorithm \n\nThe iterative Krylov-dogleg step approximates a trust-region step by iteratively \napproximating the Levenberg-Marquardt trajectory in the Krylov subspace via lin(cid:173)\near conjugate gradient iterates until the approximate trajectory hits the trust(cid:173)\nregion boundary; i.e., a CG iterate falls outside the trust-region boundary. In this \ncontext, the linear CGNR method is not intended to approximate the full Gauss(cid:173)\nNewton step [i.e. , the solution of Eq. (3)]. Therefore, the required number of CGNR(cid:173)\niterations might be kept small [see Section 4]. \nThe iterative process for the linear-equation solution sequence {8 k } is called the \ninner 2 iteration, whereas the solution sequence {(h} from the Krylov-dogleg algo(cid:173)\nrithm is generated by the outer iteration (or epoch), as shown in Figure 1. We now \ndescribe the inner iteration algorithm, which is identical to the standard linear CG \nalgorithm (see Demmel [11], pages 311-312) except steps 2, 4, and 5: \n\nAlgorithm 3.1 : The inner iteration of the Krylov-dogleg algorithm (see Figure 1). \n\n1. Initialization: \n\n8 0 = 0; do = ro = -gnow, and k = 1. \n\n2. Matrix-vector product (compare Eq. (5) and see Algorithm 3.2): \nz = Hnowdk = J~ow(Jnowdk) = L(uT dk)u;. \n\nm \n\n; =1 \n\n(6) \n\n(7) \n\n2Nonlinear conjugate gradient methods, such as Polak-Ribiere's CG (see Mizutani \nand Jang [13]) and Moller's scaled CG [14], are also widely-employed for training MLPs, \nbut those nonlinear versions attempt to approximate the entire Hessian matrix by gen(cid:173)\nerating the solution sequence {Ih} directly as the outer nonlinear algorithm. Thus, they \nignore the special structure of the nonlinear least squares problem; so does Pearlmutter's \nmethod [15] to the Newton formula, although its modification may be possible. \n\n\fInitialize Rn~. 90~ \nCompute E( 9000 ) \n\nDoes \n\nstopping criteria \n\nhold? \n\nNO \n\nYES ~ \n\n>-----~ \n\nr'-';'~~'~';\"';t~;~~;~'~\"\"'1 \ni \ni \nL ..................................... J \n\nAlgorithm 3. t \n\nc o \n.~ \n..... \n2 \n..... \nQ) \n\n\"S o \n\nIF E( 90\", ) ~ E( 90~) YES \n\nalgorithm \n\n:. .......................... : \n\nIF Vnow ~ Vsmall \n\nAlgorithm for local-model check \n\nFigure 1: The algorithmic flow of an iterative Krylov-dogleg algorithm. For detailed \nprocedures in the three dotted rectangular boxes, refer to Mizu tani and Demmel [12} \nand Algorithm 3. 1 in text. \n\n3. Analytical step size: 'fJk = \n\n4. Approximate solution: \n\nrL1rk- l \n\ndTz \n\nk \n\nli k = li k - 1 + 'fJk d k. \n\nIf Illikll < R now , then go onto the next step 5; otherwise compute \n\nli k \n\nli k = Rnowillikll ' \n\nand terminate. \n\n5. Linear-system residual: r k = r k-l -\n\n'fJkZ. \n\nIf IIrkl1 2 is small enough , then set Rnow f- Illikll. and terminate. \n\nOtherwise, continue with step 6. \n\n(8) \n\n(9) \n\n6. Improvement: {3k+l = rT \n7. Search direction : d k+1 = rk + {3k +l d k. Then, set k = k + 1 and back to step 2. \n\nk-l k-l \n\n. \n\nr I r k \n\nr \n\n\fThe first step given by Equation (8) is always the Cauchy step I5Cauchy ' moving \n9now to the Cauchy point 9Cauchy when Rnow > III5Cauchyll . Then, departing \nfrom 9 Cauchy , the linear CG constructs a Krylov-dogleg trajectory (by adding a CG \npoint one by one) towards the Gauss-Newton point 9Newton until the constructed \ntrajectory hits the trust-region boundary (i.e., Ill5k ll :::: Rnow is satisfied in step 4), \nor till the linear-system residual becomes small in step 5 (unlikely to occur for \nsmall forcing terms; e.g., 0.01) . In this way, the algorithm computes a vector \nbetween the steepest descent direction and the Gauss-Newton direction, resulting \nin an approximate Levenberg-Marquardt step in the Krylov subspace. \n\nIn step 2, the matrix-vector multiplication of Hdk in Equation (7) can be performed \nwith neither the Jacobian nor Hessian matrices explicitly required, keeping only \nseveral n-dimensional vectors in memory at the same time, as shown next: \n\nAlgorithm 3.2: Matrix-vector multiplication step. \nfor i = 1 to m; i.e., one sweep of all training data: \n\n(a) do forward propagation to compute the MLP output a; (9) for datum i; \n(b) do backpropagation 3 to obtain the ith row vector u T of matrix J; \n(c) compute (uT dk)u; and add it to z; \n\nend for. \n\nFor one sweep of all m data, each of steps (a) and (b) costs at least 2mn (plus \nadditional costs that depend on the MLP architectures) and step (c) [i.e., Eq. (7)] \ncosts 4mn. Hence, the overall cost of the inner iteration (Algorithm 3.1) can be \nkept as O(mn), especially when the number of inner iterations is small owing to \nour strategy of upper-bounded trust-region radii (e.g., Rupper = 1 for the parity \nproblem). Note for \"Algorithm for local-model check\" in Figure 1 that evaluating \nVnow (a ratio between the actual error reduction and the reduction predicted by \nthe current local quadratic model) needs a procedure similar to Algorithm 3.2. For \nmore details on the algorithm in Figure 1, refer to Mizutani and Demmel [12] . \n\n4 Experiments and Discussions \n\nIn the NN literature, there are numerous algorithmic comparisons available (see, for \nexample , Moller [14] ; Demuth & Beale [6] ; Shepherd [8] ; Mizutani [2 ,9, 16]). Due to \nthe space limitation, this section compares typical behaviors of our Krylov-dogleg \nGauss-Newton CGNR (or iterative dogleg) algorithm and Powell's dogleg-based \nalgorithm with a direct linear-equation solver (or direct dogleg) for solving highly \noverdetermined parity problems. In our numerical tests, we used a criterion, in \nwhich the MLP output for the pth pattern, ap , can be regarded as either \"on\" \n:::: 0.8, or \"off\" (-1.0) if ap :S -0.8; otherwise, it is \"undecided .\" The \n(1.0) if ap \ninitial parameter set was randomly generated in the range [-0.3 ,0.3]' and the two \nalgorithms started exactly at the same point in the parameter space. \n\nFigure 2 presents MLP-Iearning curves in RMSE (root mean squared error) for the \n20-bit and 14-bit parity problems. In (b) and (c), the total execution time [roughly \n(b) 32 days (500 epochs); (c) two hours (450 epochs), both on 299-MHz UltraSparc] \nof the direct dogleg algorithm was normalized for comparison purpose. Notably, the \n\n3The batch-mode MLP backpropagation can be viewed as an efficient matrix-vector \nmultiplication (2mn operations) for computing the graclient .JTr wilhoutfor'ming explicitly \nthe m X n Jacobian matrix or the m-climensional residual vector (with some extra costs) . \n\n\f-\n\n1 \n\n. \n\niterative dogleg \ndirect dogleg \n\n1 \n\n.. .1- iterative dogleg I \n.j .... direct dogleg \n\n..1- iterative dogleg I \n\n\u00b7.1 .... direct dogleg \n\n0.8 \n\n~ 0.6 \n::2 \na: 0.4 \n\n0 .2 \n\n0.8 \n\n~ 0.6 \n::2 \na: 0.4 \n\n0.2 ~ \n\n0.8 \n\n~ 0.6 \n::2 \na: 0.4 \n\n0 .2 \n\noL---~~------~ \no \n1000 \n\n500 \n\n(a) Epoch \n\no \n\n0.5 \n\nO~----------~~ \n\n(b) Normalized exec. time \n\noL-~======--~~ \no \n\n0.5 \n\n(c) Normalized exec. time \n\nFigure 2: MLP-learning curves of RMSE (root mean squared error) obtained by \nthe \"iterative dogleg\" (solid line) and the \"direct dogleg\" (broken line): (a) \"epoch\" \nand (b) \"normalized execution time\" for the 20-bit parity problem with a standard \n20 x 19 x 1 MLP with hyperbolic tangent node functions (m = 220 , n = 419), and \n(c) \"normalized execution time\" for the 14-bit parity problem with a 14 x 13 x 1 \nMLP (m = 214, n = 209). In ( a), (b), the iterative dogleg reduced the number of \nincorrect patterns down to 21 (nearly RMSE = 0.009) at epoch 838, whereas the \ndirect dogleg reached the same error level at epoch 388. In (c), the iterative dogleg \nsolved it perfectly at epoch 1,034 and the direct dogleg did so at epoch 401. \n\niterative dogleg converged faster to a small RMSE 4 than the direct dogleg at an \nearly stage of learning even with respect to epoch. Moreover, the average number \nof inner CG iterations per epoch in the iterative dogleg algorithm was quite small, \n5.53 for (b) and 4.61 for (c). Thus, the iterative dogleg worked nearly (b) nine times \nand (c) four times faster than the direct dogleg in terms of the average execution \ntime per epoch. Those speed-up ratios b ecame smaller than n mainly due to the \naforementioned cost of Algorithm 3.2. Yet, as n increases, the speed-up ratio can \nbe larger especially when the number of inner iterations is reasonably small. \n\n5 Conclusion and Future Directions \n\nWe have compared two batch-mode MLP-Iearning algorithms: iterative and direct \ndogleg trust-region algorithms. Although such a high-dimensional parity problem is \nvery special in the sense that it involves a large data set but the size of MLP can be \nkept relatively small, the algorithmic features of the two dogleg methods can be well \nunderstood from the obtained experimental results. That is, the iterative dogleg \nhas the great advantage of reducing the cost of an epoch from O(mn 2 ) to O(mn), \nand the memory requirements from O(n 2 ) to O(n), a factor of O(n) in both cases. \nWhen n is large, this is a very large improvement. It also has the advantage offaster \nconvergence in the early epochs, achieving a lower RMSE after fewer epochs than \nthe direct dogleg. Its disadvantage is that it may need more epochs to converge to a \nvery small RMSE than the direct dogleg (although it might work faster in execution \ntime). Thus, the iterative dogleg is most attractive when attempting to achieve a \nreasonably small RMSE on very large problems in a short period of time. \n\nThe iterative dogleg is a matrix-free algorithm that extracts information about the \nHessian matrix via matrix-vector multiplication ; this algorithm might be character(cid:173)\nized as iterative batch-mode learning, an intermediat e between direct batch-\n\n4 A standard steepest descent-type online pattern-by-pattern learning (or incremental \ngradient) algorithm (with or without a momentum term) failed to converge to a small \nRMSE in those parity problems due to hidden-node satl.lmtion [1]. \n\n\fmode learning and online pattern-by-pattern learning. Furthermore, the algorithm \nmight be implemented in a block-by-block updating mode if a large data set can \nbe split into multiple proper-size data blocks; so, it would be of our great inter(cid:173)\nest to compare the performance with online-mode learning algorithms for solving \nlarge-scale real-world problems with a large-scale NN model. \n\nAcknowledgments \n\nWe would like to thank Stuart Dreyfus (lEaR, UC Berkeley) and Rich Vuduc (CS, \nUC Berkeley) for their valuable advice. The work was supported in part by SONY \nUS Research Labs., and in part by \"Program for Promoting Academic Excellence \nof Universities,\" grant 89-E-FA04-1-4, Ministry of Education, Taiwan. \n\nReferences \n\n[1] E. Mizutani, S. E. Dreyfus, and J.-S. R. Jang. On dynamic programming-like recursive \ngradient formula for alleviating hidden-node satuaration in the parity problem. In \nProceedings of the International Workshop on Intelligent Systems Resolutions - the \n8th Bellman Continuum, pages 100- 104, Hsinchu, TAIWAN, 2000. \n\n[2] Eiji Mizutani. Powell's dogleg trust-region steps with the quasi-Newton augmented \nHessian for neural nonlinear least-squares learning. In Pr'oceedings of the IEEE Int'l \nConf. on Neural Networks (vol.2), pages 1239-1244, Washington, D.C., JuJy 1999. \n[3] R. S. Dembo and T. Steihaug. Truncated-Newton algorithms for large-scale uncon(cid:173)\n\nstrained optimization. Math. Prog., 26:190-212, 1983. \n\n[4] Trond Steihaug. The conjugate gradient method and trust regions in large scale \n\noptimization. SIAM J. Numer. Anal., 20(3):626- 637, 1983. \n\n[5] P. L. Toint. On large scale nonlinear least squares calculations. SIAM J. Sci . Statist. \n\nComput., 8(3):416- 435, 1987. \n\n[6] H. Demuth and M . Beale. Neural Network Toolbox ror Use with MATLAB . The \n\nMathWorks, Inc., Natick, Massachusetts, 1998. User's Guide (version 3.0). \n\n[7] Timothy Masters. Advanced algorithms for neural networ'ks: a C++ sourcebook. John \n\nWiley & Sons, New York, 1995. \n\n[8] Adrian J. Shepherd. Second-Order Methods for Neural Networks: Fast and Reliable \n\nTraining Methods for Multi-Layer Perceptrons. Springer-Verlag, 1997. \n\n[9] Eiji Mizutani. Computing Powell's dogleg steps for solving adaptive networks nonlin(cid:173)\n\near least-squares problems. In Proc. of the 8th Tnt'l Fuzzy Systems Association World \nCongress (IFSA '99), vol.2, pages 959- 963, Hsinchu, Taiwan, August 1999. \n\n[10] M . J. D. Powell. A new algorithm for unconstrained optimization. \n\nIn Nonlinear \n\nPr'ogramming, pages 31-65. Edited by J.B. Rosen et al., Academic Press, 1970. \n\n[11] James W. Demmel. Applied Numerical Linear Algebra. SIAM, 1997. \n[12] Eiji Mizutani and James W. Demmel. On generalized dogleg trust-region steps using \nthe Krylov subspace for solving neural networks nonlinear least squares problems. \nTechnical report, Computer Science Dept., UC Berkeley, 2001. (In preparation). \n\n[13] E. Mizutani and J.-S. R. Jang. Chapter 6: Derivative-based Optimization. In Neuro (cid:173)\nFuzzy and Soft Computing, pages 129- 172. J.-S. R. Jang, C.-T. Sun and E. Mizutani. \nPrentice Hall, 1997. \n\n[14] Martin Fodslette Moller. A scaled conjugate gradient algorithm for fast supervised \n\nlearning. N eural Networ'ks, 6:525-533, 1993. \n\n[15] B. A. Pearlmutter. Fast exact multiplication by the Hessian. Neural Computation, \n\n6(1):147-160, 1994. \n\n[16] E. Mizutani, K. Nishio, N. Katoh, and M. Blasgen. Color device characterization of \nelectronic cameras by solving adaptive networks nonlinear least squares problems. In \nProc. of the 8th IEEE Int'l Conf. on Fuzzy Systems, vol. 2, pages 858- 862, 1999. \n\n\f", "award": [], "sourceid": 1805, "authors": [{"given_name": "Eiji", "family_name": "Mizutani", "institution": null}, {"given_name": "James", "family_name": "Demmel", "institution": null}]}