{"title": "Iterative Scaled Trust-Region Learning in Krylov Subspaces via Pearlmutter's Implicit Sparse Hessian", "book": "Advances in Neural Information Processing Systems", "page_first": 209, "page_last": 216, "abstract": "", "full_text": "Iterative scaled trust-region learning in\n\nKrylov subspaces via Pearlmutter\u2019s\n\nimplicit sparse Hessian-vector multiply\n\nEiji Mizutani\n\nDepartment of Computer Science\n\nTsing Hua University\n\nHsinchu, 300 TAIWAN R.O.C.\n\neiji@wayne.cs.nthu.edu.tw\n\nJames W. Demmel\n\nMathematics and Computer Science\nUniversity of California at Berkeley,\n\nBerkeley, CA 94720 USA\ndemmel@cs.berkeley.edu\n\nAbstract\n\nThe online incremental gradient (or backpropagation) algorithm is\nwidely considered to be the fastest method for solving large-scale\nneural-network (NN) learning problems. In contrast, we show that\nan appropriately implemented iterative batch-mode (or block-mode)\nlearning method can be much faster. For example, it is three times\nfaster in the UCI letter classi\ufb01cation problem (26 outputs, 16,000\ndata items, 6,066 parameters with a two-hidden-layer multilayer\nperceptron) and 353 times faster in a nonlinear regression problem\narising in color recipe prediction (10 outputs, 1,000 data items,\n2,210 parameters with a neuro-fuzzy modular network). The three\nprincipal innovative ingredients in our algorithm are the following:\nFirst, we use scaled trust-region regularization with inner-outer it-\neration to solve the associated \u201coverdetermined\u201d nonlinear least\nsquares problem, where the inner iteration performs a truncated\n(or inexact) Newton method. Second, we employ Pearlmutter\u2019s\nimplicit sparse Hessian matrix-vector multiply algorithm to con-\nstruct the Krylov subspaces used to solve for the truncated New-\nton update. Third, we exploit sparsity (for preconditioning) in the\nmatrices resulting from the NNs having many outputs.\n\n1 Introduction\n\n2\n\n(cid:1)m\n\n(cid:1)r(\u201e)(cid:1)2\n\nOur objective function to be minimized for optimizing the n-dimensional parame-\nter vector \u201e of an F -output NN model is the sum over all the d data of squared\n2. Here, m\u2261F d; r(\u201e) is the m-\nresiduals: E(\u201e) = 1\ndimensional residual vector composed of all m residual elements: ri (i = 1, . . . , m);\nand rk the d-dimensional residual vector evaluated at terminal node k. The gradi-\nent vector and the Hessian matrix of E(\u201e) are given by g \u2261 JT r and H \u2261 JT J + S,\nrespectively, where J, the m\u00d7n (residual) Jacobian matrix of r, is readily obtain-\nable from backpropagation (BP) process, and S is the matrix of second-derivative\n\n(cid:1)rk(cid:1)2\n\n2 = 1\n2\n\nr2\ni = 1\n2\n\ni=1\n\n(cid:1)F\n\nk=1\n\n\fA1\n\nA2\n\n\uf8ee\n\uf8ef\uf8ef\uf8ef\uf8f0\n\nB1\nB2\n\n\uf8f9\n\uf8fa\uf8fa\uf8fa\uf8fb ,\n\nterms of r; i.e., S \u2261 (cid:1)m\n\ni=1\n\nri\u22072ri. Most nonlinear least squares algorithms take ad-\nvantage of information of J or its cross product called the Gauss-Newton (GN)\nHessian JT J (or the Fisher information matrix for E(.) in Amari\u2019s natural-gradient\nlearning [1]), which is the important portion of H because in\ufb02uence of S becomes\nweaker and weaker as residuals become smaller while learning progresses. With\nmultiple F -output nonlinear models (except fully-connected NNs), J is known to\nhave the m \u00d7 n block angular matrix form (see [7, 6] and references therein).\nFor instance, consider a single-hidden layer S-H-F MLP (with S-input H-hidden\nF -output nodes); there are nA=F (H + 1) terminal parameters \u201eA (including\nthreshold parameters) on direct connections to F terminal nodes, each of which has\nCA(=H + 1) direct connections, and the rest of nB=H(S + 1) parameters are not\ndirectly connected to any terminal node; hence, nB hidden parameters \u201eB. In\nother words, model\u2019s parameters \u201e (n=F CA + nB in total) can separate as: \u201eT =\n[\u201eAT |\u201eBT\nk is a vector of the kth subset of\nCA terminal parameters directly linked to terminal node k (k = 1,\u00b7\u00b7\u00b7 ,F ). The\nassociated residual Jacobian matrix J can be given in the block-angular form below\nleft, and thus the (full) Hessian matrix H has the n \u00d7 n sparse block arrow form\n\uf8f9\nbelow right (\u00d7 denotes some non-zero block) as well as the GN-Hessian JT J:\n\nk ,\u00b7\u00b7\u00b7 , \u201eAT\n\n], where \u201eA\n\n,\u00b7\u00b7\u00b7 , \u201eAT\n\nF |\u201eBT\n\n] =[\u201eAT\n\n\uf8ee\n\n1\n\n\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\n\u00d7\n\n\u00d7\n\n\u00d7\n\u00d7\n\u00d7\n\u00d7 \u00d7\n\u00d7 \u00d7 \u00d7 \u00d7 \u00d7\n\n\u00d7\n\n\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb .\n\nF\n\n.\n.\n.\n\n=\n\n=\n\n. . .\n\n(1)\n\nAF BF\n\nH(cid:2)(cid:3)(cid:4)(cid:5)\nn\u00d7n\n\nJ(cid:2)(cid:3)(cid:4)(cid:5)\nm\u00d7n\nHere in J, Ak and Bk are d \u00d7 CA and d \u00d7 nB Jacobian matrices, respectively, of\nthe d-dimensional residual vector rk evaluated at terminal node k. Notice that\nthere are F diagonal Ak blocks [because (F \u2212 1)CA terminal parameters excluding\nk have no e\ufb00ect on rk], and F vertical Bk blocks corresponding to the nB hidden\n\u201eA\nparameters \u201eB that contribute to minimizing all the residuals rk(k=1,\u00b7\u00b7\u00b7 , F ) evalu-\nated at all F terminal nodes. Therefore, the posed problem is overdetermined when\n\u201cm > n\u201d (namely, \u201cd > CA + 1\nnB\u201d) holds. In addition, when the terminal nodes\nhave linear identity functions, terminal parameters \u201eA are linear, and thus all Ak\nblocks become identical A1 = A2 = \u00b7\u00b7\u00b7 = AF , with H + 1 hidden-node outputs (in-\ncluding one constant bias-node output) in each row. For small- and medium-scale\nproblems, direct batch-mode learning is recommendable with a suitable \u201cdirect\u201d ma-\ntrix factorization, but attention must be paid to exploiting obvious sparsity in either\nblock-angular J or block-arrow H so as to render the algorithms e\ufb03cient in both\n\u22121 is dense even if H has a nice\nmemory and operation counts [7, 6]. Notice that H\nblock-arrow sparsity structure. For large-scale problems, Krylov subspace meth-\nods, which circumvent the need to perform time-consuming and memory-intensive\ndirect matrix factorizations, can be employed to realize what we call iterative\nIf any rows (or columns) of those matrices Ak and Bk\nbatch-mode learning.\nare not needed explicitly, then Pearlmutter\u2019s method [11] can automatically exploit\nsuch sparsity to perform sparse Hessian-vector product in constructing a Krylov\nsubspace for parameter optimization, which we describe in what follows with our\nnumerical evidence.\n\n2 Inner-Outer Iterative Scaled Trust-Region Methods\n\nPractical Newton methods enjoy both the global convergence property of the Cauchy\n(or steepest descent) method and the fast local convergence of the Newton method.\n\n\f2.1 Outer iteration process in trust-region methods\n\nOne might consider a convex combination of the Cauchy step \u2206\u201eCauchy and the\nNewton step \u2206\u201eN ewton such as (using a scalar parameter h):\n\ndef= (1 \u2212 h)\u2206\u201eCauchy + h\u2206\u201eN ewton,\n\n\u2206\u201eDogleg\n\n(2)\nwhich is known as the dogleg step [4, 9]. This step yields a good approximate solution\nto the so-called \u201cscaled 2-norm\u201d or \u201cM-norm\u201d trust-region subproblem (e.g., see\nChap. 7 in [2]) with Lagrange multiplier \u00b5 below:\nmin\u2206\u201e q(\u2206\u201e) subject to (cid:1)\u2206\u201e(cid:1)M \u2264 R, or min\u2206\u201e\n\n(cid:13)\n2 (\u2206\u201eT M\u2206\u201e \u2212 R2)\n,\n(3)\nwhere the distances are measured in the M-norm: (cid:1)x(cid:1)M =\nxT Mx with a symmet-\nric positive de\ufb01nite matrix M, and R (called the trust-region radius) signi\ufb01es the\ntrust-region size of the local quadratic model q(\u2206\u201e)\n2 \u2206\u201eT H\u2206\u201e.\nRadius R is controlled according to how well q(.) predicts the behavior of E(.) by\nchecking the error reduction ratio below:\n\u03c1 = Actual error reduction\nPredicted error reduction\n\nE(\u201enow) \u2212 E(\u201enext)\nE(\u201enow) \u2212 q(\u2206\u201e)\n\n= E(\u201e) +g T \u2206\u201e + 1\n\nq(\u2206\u201e) + \u00b5\n\u221a\n\n(4)\n\n(cid:12)\n\n=\n\ndef\n\n.\n\ndef\n\ndef\n\ndef\n\n,\n\ndef\n\n(cid:15)\n\nM\n\ndef\n\n(cid:14)\n\nFor more details, refer to [9, 2]. The posed constrained quadratic minimization can\nbe solved with Lagrange multiplier \u00b5: If \u2206\u201e is a solution to the posed problem, then\n\u2206\u201e satis\ufb01es the formula: (H + \u00b5M)\u2206\u201e = \u2212g, with \u00b5((cid:1)\u2206\u201e(cid:1)M \u2212 R) = 0, \u00b5 \u2265 0, and\nH + \u00b5M positive semide\ufb01nite. In nonlinear least squares context, the nonnegative\nscalar parameter \u00b5 is known as the Levenberg-Marquardt parameter. When\n\u00b5 = 0 (namely, R \u2265 (cid:1)\u2206\u201eN ewton(cid:1)M), the trust-region step \u2206\u201e becomes the Newton\n\u22121g, and, as \u00b5 increases (i.e., as R decreases), \u2206\u201e gets closer to\nstep \u2206\u201eN ewton\n\u22121g.\nthe (full) Cauchy step \u2206\u201eCauchy: \u2206\u201eCauchy\nWhen R < (cid:1)\u2206\u201eCauchy(cid:1)M, the trust-region step \u2206\u201e reduces to the restricted Cauchy\n= \u2212(R/(cid:1)\u2206\u201eCauchy(cid:1)M )\u2206\u201eCauchy. If (cid:1)\u2206\u201eCauchy(cid:1)M < R < (cid:1)\u2206\u201eN ewton(cid:1)M,\nstep \u2206\u201eRC\n\u2206\u201e is the \u201cdogleg step,\u201d intermediate between \u2206\u201eCauchy and \u2206\u201eN ewton, as shown in\nEq. (2), where scalar h (0 < h <1) is the positive root of (cid:1)s + hp(cid:1)M = R:\n\n\u22121g/gT M\n\n= \u2212H\n\n\u22121HM\n\n= \u2212\n\ngT M\n\n\u22121g\n\nh =\n= \u2206\u201eCauchy and p\n\n\u2212sT Mp+\n(5)\n= \u2206\u201eN ewton \u2212 \u2206\u201eCauchy (when pT g < 0). In this way, the\n\npT Mp\n\nwith s\ntrial step \u2206\u201e is subject to trust-region regularization.\nIn large-scale problems, the linear-equation solution sequence {\u2206\u201ek} is generated\niteratively while seeking a trial step \u2206\u201e in the inner iteration process, and the pa-\nrameter sequence {\u201ei}, whose two consecutive elements are denoted by \u201enow and\n\u201enext, is produced by the outer iteration (i.e., epoch in batch mode). The outer\niterative process updates parameters by \u201enext = \u201enow + \u2206\u201e without taking any\nuphill movement: That is, if the step is not satisfactory, then R is decreased so as\nto realize an important Levenberg-Marquardt concept: the failed step is shortened\nand de\ufb02ected towards the Cauchy-step direction simultaneously. For this purpose,\nthe trust-region methods compute the gradient vector in batch mode or with (suf-\n\ufb01ciently large) data block (i.e., block mode; see our demonstration in Section 3).\n\n\u221a\n(sT Mp)2+pT Mp(R2\u2212sT Ms)\n\n2.2\n\nInner iteration process with truncated preconditioned linear CG\n\nWe employ a preconditioned conjugate gradient (PCG) (among many Krylov sub-\nspace methods; see Section 6.6 in [3] and Chapter 5 in [2]) with our symmetric\n\n\fpositive de\ufb01nite preconditioner M for solving the M-norm trust-region subprob-\nlem (3). This is the truncated PCG (also known as Steihaug-Toint CG) applicable\neven to nonconvex problems for solving inexactly the Newton formula by the inner\niterative process below (see pp. 628-629 in [10]; pp. 202\u2013218 in [2]) based on the\nstandard PCG algorithm (e.g., see page 317 in [3]):\nAlgorithm 1: The inner iteration process via preconditioned CG.\n1. Initialization (k=0):\n\nSet \u2206\u201e0 = 0 and \u20390 = \u2212g (=\u2212g \u2212 H\u2206\u201e0);\nSolve Mz = \u20390 for pseudoresiduals: z = M\nCompute \u03c40 = \u03b4T\nSet k = 1 and d1 = z, and then proceed to Step 2.\n2. Matrix-vector product: z = Hdk = JT (Jdk) + Sdk\n3. Curvature check: \u03b3k = dT\n\nk z = dT\n\n\u22121\u20390;\n\nk Hdk.\n\n0 z;\n\n(see also Algorithm 2).\n\nIf \u03b3k > 0, then continue with Step 4. Otherwise, compute h (> 0) such that\n(cid:1)\u2206\u201ek\u22121 + hdk(cid:1)M = R, and terminate with \u2206\u201e = \u2206\u201ek\u22121 + hdk.\n\n4. Step size: \u03b7k = \u03c4k\u22121\n\u03b3k .\n5. Approximate solution: \u2206\u201ek = \u2206\u201ek\u22121 + \u03b7kdk.\nIf (cid:1)\u2206\u201ek(cid:1)M < R, go onto Step 6; else terminate with \u2206\u201e =\n(6)\n6. Linear-system residuals: \u03b4k = \u03b4k\u22121 \u2212 \u03b7kz [= \u2212g \u2212 H\u2206\u201ek = \u2212q (cid:2)\nIf (cid:4)\u03b4k(cid:4)2 is small enough; i.e., (cid:4)\u03b4k(cid:4)2 \u2264 \u03be(cid:4)g(cid:4)2, then terminate with \u2206\u201e = \u2206\u201ek.\n\n(cid:1)\u2206\u201ek(cid:1)M\n(\u2206\u201ek)].\n\n\u2206\u201ek.\n\nR\n\n\u22121\u2039k, and then compute \u03c4k = \u03b4T\n\nk z.\n\n7. Pseudoresiduals: z = M\n8. Conjugation factor: \u03b2k+1 = \u03c4k\n9. Search direction: dk+1 = \u03b4k + \u03b2k+1dk.\n10. If k < klimit, setk = k + 1 and return to Step 2.\n\n\u03c4k\u22121 .\n\nOtherwise, terminate with \u2206\u201e = \u2206\u201ek. 2\n\n(D) k=klimit.\n\nk Hdk \u2264 0,\n\n(B) (cid:4)\u2206\u201ek(cid:4)M \u2265 R,\n\n(C) (cid:4)H\u2206\u201ek + g(cid:4)2 \u2264 \u03be(cid:4)g(cid:4)2,\n\nAt Step 3, h is obtainable with Eq. (5) with s = \u2206\u201ek\u22121 and p = dk plugged in.\nLikewise, in place of Eq. (6) at Step 5, we may use Eq. (5) for \u2206\u201e = \u2206\u201ek\u22121 +\nhdk such that (cid:1)\u2206\u201ek\u22121 + hdk(cid:1)M = R, but both computations become identical if\nR \u2264 (cid:1)\u2206\u201eCauchy(cid:1)M; otherwise, Eq. (6) is less expensive and tends to give more bias\ntowards the Newton direction. The inner-iterative process terminates (i.e., stops at\ninner iteration k) when one of the next four conditions holds:\n(7)\n(A) dT\nCondition (D) at Step 10 is least likely to be met since there would be no prior\nknowledge about preset limits klimit to inner iterations (usually, klimit=n). As long\nk Hdk > 0 holds, PCG works properly until the CG-trajectory hits the trust-\nas dT\nregion boundary [Condition (B) at Step 5], or till the 2-norm linear-system residuals\nbecome small [Condition (C) at Step 6], where \u03be can be \ufb01xed (e.g., \u03be=0.01). Condi-\nk Hdk \u2264 0 (at Step 3) may hold when the local model is not strictly convex\ntion (A) dT\n(or H is not positive de\ufb01nite). That is, dk is a direction of zero or negative curvature;\na typical exploitation of non-positive curvature is to set \u2206\u201e equal to the \u201cstep to the\ntrust-region boundary along that curvature segment (in Step 3)\u201d as a model mini-\nmizer in the trust region. In this way, the terminated kth CG step yields an approxi-\nmate solution to the trust-region subproblem (3), and it belongs to the Krylov sub-\nspace span{\u2212M\n2 g}, resulting\n\u2212 1\n\u2212 1\n2 )k\u22121M\n\u2212 1\n2 ) to thesymmetric\nfrom our application of CG (without multiplying by M\n\u22121H (in the sys-\n\u2212 1\nNewton formula (M\n2 HM\n\u22121H\u2206\u201e = \u2212M\n\u22121g) is unlikely symmetric (see page 317 in [3]) even if M\ntem M\nis a diagonal matrix (unless M = I).\n\n2 )M\n2 \u2206\u201e) = \u2212M\n\n2 g, ..., \u2212(M\n\u2212 1\n\u2212 1\n\n2 g, because M\n\n2 g,\u2212(M\n\u2212 1\n\n2 HM\n\n2 HM\n\n2 )(M\n\n\u2212 1\n\n\u2212 1\n\n\u2212 1\n\n\u2212 1\n\n1\n\n\fThe overall memory requirement of Algorithm 1 is O(n) because at most \ufb01ve n-\nvectors are enough to implement. Since the matrix-vector product Hdk at Step 2\nis dominant in operation cost of the entire inner-outer process, we can employ\nPearlmutter\u2019s method with no H explicitly required. To better understand the\nmethod, we \ufb01rst describe a straightforward implicit sparse matrix-vector multiply\nit evaluates JT Jdi (without forming JT J) in two-step implicit\nwhen H = JT J;\nmatrix-vector product as z=JT (Jdi), exploiting block-angular J in Eq. (1);\ni.e.,\nworking on each block, Ak and Bk, in a row-wise manner below:\nAlgorithm 2: Implicit (i.e., matrix-free) sparse matrix-vector multiplication step\nwith an F -output NN model at inner iteration i starting with z = 0:\nfor p = 1 to d (i.e., one sweep of d training data):\n\n(a) do forward pass to compute F \ufb01nal outputs yp(\u201e) on datum p;\nfor k = 1 to F (at each terminal node k):\n\n\u2022 (b) do backward pass to obtain the pth row of Ak as the CA-vector\n\u2022 (c) compute \u03b1kap and \u03b1kbp,k, where scalar \u03b1k = aT\n\np,k, and the pth row of Bk as the nB-vector bT\naT\n\np,kda\nand then add them to their corresponding elements of z;\n\ni,k + bT\n\np,kdb\ni ,\n\np,k;\n\nend for k.\n\nend for p. 2\nHere, Step (a) costs at least 2dn (see details in [8]); Step (b) costs at least 2mlu,\nwhere m=F d and lu=CA +nB < n=F CA +nB; and Step (c) costs 4mlu; overall, Algo-\nrithm 2 costs O(mlu), linear in F . Note that if sparsity is ignored, the cost becomes\nO(mn), quadratic in F since mn = F d(F CA +nB). Algorithm 2 can extract explicitly\nF pairs of row vectors (aT and bT ) of J (with F lu storage) on each datum, making\nit easier to apply other numerical linear algebra approaches such as preconditioning\nto reduce the number of inner iterations. Yet, if the row vectors are not needed ex-\nplicitly, then Pearlmutter\u2019s method is more e\ufb03cient, calculating \u03b1k [see Step (c)] in\nits forward pass (i.e., R{yk}=\u03b1k; see Eq. (4.3) on page 151 in [11]). When H = JT J,\nit is easy to simplify its backward pass (see Eq. (4.4) on page 152 in [11]), just\nby eliminating the terms involving residuals r and second-derivatives of node func-\ntions f(cid:1)(cid:1)(.), so as to multiply vectors ak and bk through by scalar \u03b1k implicitly. This\nsimpli\ufb01ed method of Pearlmutter runs in time O(dn), whereas Algorithm 2 does in\nO(mlu). Since mlu \u2212 dn = dF (CA + nB) \u2212 d(F CA + nB) =d( F \u2212 1)nB, Pearlmutter\u2019s\nmethod can be up to F times faster than Algorithm 2. Furthermore, Pearlmutter\u2019s\n(cid:1)m\noriginal method e\ufb03ciently multiplies an n-vector by the \u201cfull\u201d Hessian matrix still\nj=1[\u22072rj]rjdi, where uT\nin O(dn) for z = Hdi = JT (Jdi) + Sdi =\nis the ith row vector of J; notably, the method automatically exploits block-arrow\nsparsity of H [see Eq. (1), right] in the essentially same way as the standard BP\ndeals with block-angular sparsity of J [see Eq. (1), left] to perform the matrix-vector\nproduct g = JT r in O(dn).\n\n(cid:1)m\nj=1(uT\n\nj di)uj +\n\ni\n\n3 Experiments and Discussion\n\nIn simulation, we compared the following \ufb01ve algorithms:\nAlgorithm A: Online-BP (i.e., H = I) with a \ufb01xed momentum (0.8);\nAlgorithm B: Algorithm 2 alone for Algorithm 1 with H = JT J (see [6]);\nAlgorithm C: Pearlmutter\u2019s method alone for Algorithm 1 with H = JT J;\nAlgorithm D: Algorithm 2 to obtain preconditioner M = diag(JT J) only, and\n\nPearlmutter\u2019s method for Algorithm 1 with H = JT J;\n\nAlgorithm E: Same as Algorithm D except with \u201cfull\u201d Hessian H = JT J + S. 2\n\n\fA\n\nB\n\nSingle 16-82-10 MLP\n\nModel\n\nAlgorithm\n\nAlgorithm A is tested for our speed comparison purpose, because if it works, it\u2019s\nprobably fastest.\nIn Algorithms D and E, Algorithm 2 was only employed for\nobtaining a diagonal preconditioner M = diag(JT J) (or Jacobi preconditioner) for\nAlgorithm 1, whereas in Algorithms B and C, no preconditioning (M = I) was ap-\nplied. The performance comparisons were made with a nonlinear regression task and\na classi\ufb01cation benchmark, the letter recognition problem, from the UCI machine\nlearning repository. All the experiments were conducted on a 1.6-GHz Pentium-IV\nPC with FreeBSD 4.5 and gcc-2.95.3 compiler (with -O2 optimization \ufb02ag).\nThe \ufb01rst regression task was a real-world application color recipe prediction, a\nproblem of determining mixing proportions of available colorants to reproduce a\ngiven target color, requiring mappings from 16 inputs (16 spectral re\ufb02ectance sig-\nnals of the target color) to ten outputs (F =10) (ten colorant proportions) using\n1,000 training data (d=1,000; m=10,000) with 302 test data. The table below\nshows the results averaged over 20 trials with a single 16-82-10 MLP [n=2,224\n(CA=83;nB=1,394;lu=1,477); hence, mlu\ndn =6.6], which was optimized until \u201ctrain-\ning RMSE \u2264 0.002 (application requirement)\u201d satis\ufb01ed, when we say that \u201ccon-\nvergence\u201d (relatively early stop) occurs. Clearly, the posed regression task is non-\ntrivial because Algorithm A, online-BP, took roughly six days (averaged over only\nten trials), nearly 280 (=8748.4/31.2) times slower than (fastest) Algorithm D. In\ngeneralization performance, all the posed algorithms were more or less equivalent.\nFive-MLP mixed\nB\nD\n20.9\n179.1\n\nTotal time (min)\nStopped epoch\nTime/epoch (sec)\nInner itr./epoch\nFlops ratio/itr.\nTest RMSE\n0.017\nWe also observed that use of full Hessian matrix (Algorithm E) helped reduce in-\nner iterations per epoch, although the total convergence time turned out to be\ngreater than that obtained with the GN-Hessian (Algorithm D), presumably be-\ncause our Jacobi-preconditioner must be more suitable for the GN-Hessian than for\nthe full Hessian, and perhaps because the inner iterative process of Algorithm E\ncan terminate due to detection of non-positive curvature in Eq. (7)(A); this extra\nchance of termination may increase the total epochs, but help reduce the time per\nepoch. Remarkably, the time per inner iteration of Algorithm E did not di\ufb00er much\nfrom Algorithms C and D owing to Pearlmutter\u2019s method; in fact, given precon-\nditioner M, Algorithm E merely needed about 1.3 times more \ufb02ops \u2217 per inner\niteration than Algorithms C and D did, although Algorithm B needed nearly 3.9\ntimes more. The measured mega\ufb02op rates for all these codes lie roughly in the\nrange from 200-270 M\ufb02op/sec; typically, below 10 % of peak machine speed.\nFor improving single-MLP performance, one might employ two layers of hidden\nnodes (rather than one large hidden layer; see the letter problem below), which\nincreases nB while reducing nA, rendering Algorithm 2 less e\ufb03cient (i.e., slower).\nAlternatively, one might introduce direct connections between the input and ter-\nminal output layers, which increases CA, the column size of Ak, retaining nice\nparameter separability. Yet another approach (if applicable) is to use a \u201ccomple-\n\nD\n31.2\n132.7\n14.1\n142.7\n\nC\n57.6\n160.0\n21.6\n174.1\n\n0.2\nN/A\nN/A\n0.020\n\n336.4\n272.5\n73.8\n218.3\n\n64.5\n300.3\n12.9\n110.9\n\n0.015\n\n0.015\n\n0.015\n\n0.015\n\n0.016\n\n0.016\n\n162.3\n147.3\n65.2\n193.8\n\n4.1\n\nC\n\n107.2\n261.5\n24.6\n216.0\n\n8748.4\n\n2,916,495.2\n\n3.9\n\n1.0\n\n1.3\n\n7.0\n66.0\n\n1.2\n\nE\n\n\u2217\n\nThe \ufb02oating-point operation counts were measured by using PAPI (Performance Ap-\n\nplication Programming Interface); see http://icl.cs.utk.edu/projects/papi/.\n\n\f(cid:17)\n\nkA2\n\nwioi\n\ni=1\n\n(cid:16)(cid:1)Z\n\nk \u00b7\u00b7\u00b7 BZ\n\nkB2\n\nk \u00b7\u00b7\u00b7 AZ\n\nk ], and Bk \u2261 [B1\n\nr(\u201e) =y (\u201e) \u2212 t =\n\nmentary mixtures of Z MLP-experts\u201d model (or a neuro-fuzzy modular network)\nthat combines Z smaller-size MLPs complementarily; the associated residual vector\n\u2212 t, where scalar wi,\nto be minimized becomes:\nthe ith output of the integrating unit, is the ith (normalized) mixing proportion\nassigned to the outputs (F -vector oi) of expert-MLP i. Note that each expert learns\n\u201cresiduals\u201d rather than \u201cdesired outputs\u201d (unlike in the committee method below)\nin the sense that only the \ufb01nal combined outputs y must come close to the desired\nones t. That is, there are strong coupling e\ufb00ects (see page 80 in [5]) among all ex-\nperts; hence, it is crucial to consider the global Hessian across all experts to optimize\nthem simultaneously [7]. The corresponding J has the same block-angular form as\nthat in Eq. (1)(left) with Ak \u2261 [A1\nk ] (k = 1,\u00b7\u00b7\u00b7 , F ).\nHere, the residual Jacobian portion for the parameters of the integrating unit was\nomitted because they were merely \ufb01ne-tuned with a steepest-descent type method\nowing to our knowledge-based design for input-partition to avoid (too many) local\nexperts. Speci\ufb01cally, the spectral re\ufb02ectance signals (16 inputs) were converted to\nthe hue angle as input to the integrating unit that consists of \ufb01ve bell-shaped basis\nfunctions, partitioning that hue-subspace alone in a fuzzy fashion into only \ufb01ve color\nregions (red, yellow, green, blue, and violet) for \ufb01ve 16-16-10 MLP-experts, each of\nwhich receives all the 16 spectral signals as input [hence, Z=5; n=2,210 (CA=85;\nnB=1,360); mlu\ndn =6.5]. Due to localized parameter-tunings, our \ufb01ve-MLP mixtures\nmodel was better in learning; see faster learning in table above. In particular, our\nmodel with Algorithm D worked 353 (\u2248 123.1 \u00d7 60.0/20.9) times faster than with\nAlgorithm A that took 123.1 hours (see [6]) and 419 (\u2248 8748.4/20.9) times faster\nthan the single MLP with Algorithm A. For our complementary mixtures model,\nR{.}-operator of Pearlmutter\u2019s method is readily applicable; for instance, at termi-\nnal node k (k=1,\u00b7\u00b7\u00b7 ,F ): R{rk} = R{yk} =\nR{wi}oi,k, where\neach R{oi,k} yields \u03b1k [see Algorithm 2(c)] for each expert-MLP i (i = 1,\u00b7\u00b7\u00b7 ,Z).\nThe second letter classi\ufb01cation benchmark problem involves 16 inputs (features) and\n26 outputs (alphabets) with 16,000 training data (F =26; d=16,000; m=416,000)\nplus 4,000 test data. We used the 16-70-50-26 MLP (see [12]) (n=6,066) with 10 sets\nof di\ufb00erent initial parameters randomly generated uniformly in the range [\u22120.2, 0.2].\nWe implemented block-mode learning (as well as batch mode) just by splitting\nthe training data set into two or four equally-sized data blocks, and each data block\nalone is employed for Algorithms 1 and 2 except for computing \u03c1 in Eq. (4), where\nevaluation of E(.) involves all the d training data. Notice that two-block mode\nlearning scheme updates model\u2019s parameters \u201e twice per epoch, whereas online-\nBP updates them on each datum (i.e., d times per epoch). We observed that\npossible redundancy in the data set appeared to help reduce the number of inner\niterations, speeding up our iterative batch-mode learning; therefore, we did not use\npreconditioning. The next table shows the average performance (over ten trials)\nwhen the best test-set performance was obtained by epoch 1,000 with online-BP\n(i.e., Algorithm A) and by epoch 50 with Algorithm C in three learning modes:\n\nR{oi,k}wi +\n\n(cid:1)Z\n\ni\n\n(cid:1)Z\n\ni\n\nOnline-BP\n\nFour-block mode Two-block mode Batch mode\n\nOn average, Algorithm C in four-block mode worked about three (\u2248 63.2/22.4)\n\nAverage results\nTotal time (min)\nStopped epoch\nTime/epoch (sec)\nAvg. inner itr.\nError (train/test)\nCommittee error\n\n63.2\n597.8\n\n6.3\nN/A\n\n22.4\n36.6\n36.8\n\n4.5/block\n\n2.3% / 6.4%\n0.2% / 3.0%\n\n2.7% / 5.1%\n1.2% / 2.8%\n\n41.0\n22.1\n111.7\n\n61.1\n27.1\n135.2\n\n26.3/block\n1.2% / 4.6%\n0.3% / 2.2%\n\n31.0/batch\n1.2% / 4.9%\n0.1% / 2.3%\n\n\ftimes faster than online-BP, and thus can work faster than batch-mode nonlinear-\nCG algorithms, since, reported in [12], online-BP worked faster than nonlinear-CG.\nHere, we also tested the committee methods (see Chap. 8 in [13]) that merely\ncombined all (equally-weighted) outputs of the ten MLPs, which were optimized\nindependently in this experiment. The committee error was better than the average\nerror, as expected. Intriguingly, our block-mode learning schemes introduced small\n(harmless) bias, improving the test-data performance; speci\ufb01cally, the two-block\nmode yielded the best test error rate 2.2% even with this simple committee method.\n\n4 Conclusion and Future Directions\n\nPearlmutter\u2019s method can construct Krylov subspaces e\ufb03ciently for implementing\niterative batch- or block-mode learning. In our simulation examples, the simpler ver-\nsion of Pearlmutter\u2019s method (see Algorithms C and D) worked excellently. But it\nwould be of interest to investigate other real-life large-scale problems to \ufb01nd out the\nstrengths of the full-Hessian based methods (see Algorithm E) perhaps with a more\nelaborate preconditioner, which would be much more time-consuming per epoch\nbut may reduce the total time dramatically; hence, need to deal with a delicate\nbalancing act. Beside the simple committee method, it would be worth examining\nour algorithms for implementing other statistical learning methods (e.g., boosting)\nin conjunction with appropriate numerical linear algebra techniques. These are part\nof our overlay ambitious goal for attacking practical large-scale problems.\n\nReferences\n\n[1] Shun-ichi Amari. Natural gradient works e\ufb03ciently in learning. In Neural Computa-\n\ntion, 10, pp. 251\u2013276, 1998.\n\n[2] A. R. Conn, N. I. M. Gould, and P. L. Toint. Trust-Region Methods. SIAM, 2000.\n[3] James W. Demmel. Applied Numerical Linear Algebra. SIAM, 1997.\n[4] J. E. Dennis, D. M. Gay, and R. E. Welsch. \u201cAn Adaptive Nonlinear Least-Squares\n\nAlgorithm.\u201d In ACM Trans. on Mathematical Software, 7(3), pp. 348\u2013368, 1981.\n\n[5] R. A. Jacobs, M. I. Jordan, S. J. Nowlan and G. E. Hinton. \u201cAdaptive Mixtures of\n\nLocal Experts.\u201d In Neural Computation, pp. 79\u201387, Vol. 3, No. 1, 1991.\n\n[6] Eiji Mizutani and James W. Demmel. \u201cOn structure-exploiting trust-region regular-\nized nonlinear least squares algorithms for neural-network learning.\u201d In International\nJournal of Neural Networks. Elsevier Science, Vol. 16, pp. 745-753, 2003.\n\n[7] Eiji Mizutani and James W. Demmel. \u201cOn separable nonlinear least squares algo-\nrithms for neuro-fuzzy modular network learning.\u201d In Proceedings of the IEEE Int\u2019l\nJoint Conf. on Neural Networks, Vol.3, pp. 2399\u20132404, Honolulu USA, May, 2002.\n(Available at http://www.cs.berkeley.edu/\u02dceiji/ijcnn02.pdf.)\n\n[8] Eiji Mizutani and Stuart E. Dreyfus. \u201cOn complexity analysis of supervised MLP-\nlearning for algorithmic comparisons.\u201d In Proceedings of the INNS-IEEE Int\u2019l Joint\nConf. on Neural Networks, Vol. 1, pp. 347\u2013352, Washington D.C., July, 2001.\n\n[9] Jorge J. Mor\u00b4e and Danny C. Sorensen. \u201cComputing A Trust Region Step.\u201d In SIAM\n\nJ. Sci. Stat. Comp. 4(3), pp. 553\u2013572, 1983.\n\n[10] Trond Steihaug \u201cThe Conjugate Gradient Method and Trust Regions in Large Scale\n\nOptimization.\u201d In SIAM J. Numer. Anal. pp. 626\u2013637, vol. 20, no. 3. 1983.\n\n[11] Barak A. Pearlmutter. \u201cFast exact multiplication by the Hessian.\u201d In Neural Com-\n\nputation, pp. 147\u2013160, Vol. 6, No. 1, 1994.\n\n[12] Holger Schwenk and Yoshua Bengio. \u201cBoosting neural networks.\u201d In Neural Compu-\n\ntation, pp. 1869\u20131887, Vol. 12, No. 8, 2000.\n\n[13] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical\n\nLearning. Springer-Verlag, 2001 (Corrected printing 2002).\n\n\f", "award": [], "sourceid": 2376, "authors": [{"given_name": "Eiji", "family_name": "Mizutani", "institution": null}, {"given_name": "James", "family_name": "Demmel", "institution": null}]}