{"title": "Periodic Step Size Adaptation for Single Pass On-line Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 763, "page_last": 771, "abstract": "It has been established that the second-order stochastic gradient descent (2SGD) method can potentially achieve generalization performance as well as empirical optimum in a single pass (i.e., epoch) through the training examples. However, 2SGD requires computing the inverse of the Hessian matrix of the loss function, which is prohibitively expensive. This paper presents Periodic Step-size Adaptation (PSA), which approximates the Jacobian matrix of the mapping function and explores a linear relation between the Jacobian and Hessian to approximate the Hessian periodically and achieve near-optimal results in experiments on a wide variety of models and tasks.", "full_text": "Periodic Step-Size Adaptation for\n\nSingle-Pass On-line Learning\n\nChun-Nan Hsu1,2,\u2217, Yu-Ming Chang1, Han-Shen Huang1 and Yuh-Jye Lee3\n\n1Institute of Information Science, Academia Sinica, Taipei 115, Taiwan\n2USC/Information Sciences Institute, Marina del Rey, CA 90292, USA\n\n3Department of Computer Science and Information Engineering,\n\nNational Taiwan University of Science and Technology, Taipei 106, Taiwan\n\n\u2217chunnan@isi.edu\n\nAbstract\n\nIt has been established that the second-order stochastic gradient descent (2SGD)\nmethod can potentially achieve generalization performance as well as empirical\noptimum in a single pass (i.e., epoch) through the training examples. However,\n2SGD requires computing the inverse of the Hessian matrix of the loss function,\nwhich is prohibitively expensive. This paper presents Periodic Step-size Adapta-\ntion (PSA), which approximates the Jacobian matrix of the mapping function and\nexplores a linear relation between the Jacobian and Hessian to approximate the\nHessian periodically and achieve near-optimal results in experiments on a wide\nvariety of models and tasks.\n\n1\n\nIntroduction\n\nOn-line learning has been studied for decades. Early works concentrate on minimizing the required\nnumber of model corrections made by the algorithm through a single pass of training examples.\nMore recently, on-line learning is considered as a solution of large scale learning mainly because\nof its fast convergence property. New on-line learning algorithms for large scale learning, such as\nSMD [1] and EG [2], are designed to learn incrementally to achieve fast convergence. They usually\nstill require several passes (or epochs) through the training examples to converge at a satisfying\nmodel. However, the real bottleneck of large scale learning is I/O time. Reading a large data set\nfrom disk to memory usually takes much longer than CPU time spent in learning. Therefore, the\nstudy of on-line learning should focus more on single-pass performance. That is, after processing\nall available training examples once, the learned model should generalize as well as possible so\nthat used training example can really be removed from memory to minimize disk I/O time.\nIn\nnatural learning, single-pass learning is also interesting because it allows for continual learning from\nunlimited training examples under the constraint of limited storage, resembling a nature learner.\n\nPreviously, many authors, including [3] and [4], have established that given a suf\ufb01ciently large set\nof training examples, 2SGD can potentially achieve generalization performance as well as empirical\noptimum in a single pass through the training examples. However, 2SGD requires computing the\ninverse of the Hessian matrix of the loss function, which is prohibitively expensive. Many attempts\nto approximate the Hessian have been made. For example, one may consider to modify L-BFGS [5]\nfor online settings. L-BFGS relies on line search. But in online settings, we only have the surface of\nthe loss function given one training example, as opposed to all in batch settings. The search direction\nobtained by line search on such a surface rarely leads to empirical optimum. A review of similar\nattempts can be found in Bottou\u2019s tutorial [6], where he suggested that none is actually suf\ufb01cient to\nachieve theoretical single-pass performance in practice. This paper presents a new 2SGD method,\ncalled Periodic Step-size Adaptation (PSA). PSA approximates the Jacobian matrix of the mapping\nfunction and explores a linear relation between the Jacobian and Hessian to approximate the Hessian\n\n1\n\n\fperiodically. The per-iteration time-complexity of PSA is linear to the number of nonzero dimen-\nsions of the data. We analyze the accuracy of the approximation and derive the asymptotic rate of\nconvergence for PSA. Experimental results show that for a wide variety of models and tasks, PSA is\nalways very close to empirical optimum in a single-pass. Experimental results also show that PSA\ncan run much faster compared to state-of-the-art algorithms.\n\n2 Aitken\u2019s Acceleration\n\nLet w \u2208 \u211d\u0001 be a \u0001-dimensional weight vector of a model. A machine learning problem can be\nformulated as a \ufb01xed-point iteration that solves the equation w = \u2133(w), where \u2133 is a mapping\n\u2133 : \u211d\u0001 \u2192 \u211d\u0001, until w\u2217 = \u2133(w\u2217). Assume that the mapping \u2133 is differentiable. Then we\ncan apply Aitken\u2019s acceleration, which attempts to extrapolate to the local optimum in one step, to\naccelerate the convergence of the mapping:\n\nw\u2217 = w(\u0001) + (I \u2212 J)\u22121(\u2133(w(\u0001)) \u2212 w(\u0001)),\n\n(1)\nwhere J := \u2133\u2032(w\u2217) is the Jacobian of the mapping \u2133 at w\u2217. When \u0001\u0001 := eig(J) \u2208 (\u22121, 1), the\nmapping \u2133 is guaranteed to converge. That is, when \u0001 \u2192 \u221e, w(\u0001) \u2192 w\u2217.\nIt is usually dif\ufb01cult to compute J for even a simple machine learning model. To alleviate this issue,\nwe can approximate J with the estimates of its \u0001-th eigenvalue \u0001\u0001 by\n\n\u0001(\u0001)\n\n\u0001\n\n:= \u2133(w(\u0001))\u0001 \u2212 w(\u0001)\n\u0001 \u2212 w(\u0001\u22121)\nw(\u0001)\n\n\u0001\n\n\u0001\n\n, \u2200\u0001,\n\nand extrapolate at each dimension \u0001 by:\n= w(\u0001)\n\nw(\u0001+1)\n\n\u0001\n\n\u0001 + (1 \u2212 \u0001(\u0001)\n\n\u0001 )\u22121(\u2133(w(\u0001))\u0001 \u2212 w(\u0001)\n\u0001 ) .\n\n(2)\n\n(3)\n\nIn practice, Aitken\u2019s acceleration alternates a step for preparing \u0001(\u0001) and a step for the extrapolation.\nThat is, when \u0001 is an even number, \u2133 is used to obtain w(\u0001+1). Otherwise, the extrapolation (3) is\nused. A bene\ufb01t of the above approximation is that the cost for performing an extrapolation is \u0001(\u0001),\nlinear in terms of the dimension.\n\n3 Periodic Step-Size Adaptation\n\nWhen \u2133 is a gradient descent update rule, that is, \u2133(w) \u2190 w \u2212 \u0001g(w; D), where \u0001 is a scalar\nstep size, D is the entire set of training examples, and g(w; D) is the gradient of a loss function to\nbe minimized, Aitken\u2019s acceleration is equivalent to Newton\u2019s method, because\n\n(I \u2212 J)\u22121 =\n\n1\n\u0001\n\nJ = \u2133\u2032(w) = I \u2212 \u0001H(w; D),\n\n(4)\n\nH(w; D)\u22121, and \u2133(w) \u2212 w = w \u2212 \u0001g(w; D) \u2212 w = \u2212\u0001g(w; D),\n\nwhere H(w; D) = \u0001\u2032(w; D), the Hessian matrix of the loss function, and the extrapolation given in\n(1) becomes\n\nw = w + (I \u2212 J)\u22121(\u2133(w) \u2212 w) = w \u2212\n\n1\n\u0001\n\nH\u22121\u0001g = w \u2212 H\u22121g.\n\nIn this case, Aitken\u2019s acceleration enjoys the same local quadratic convergence as Newton\u2019s method.\nThis can also be extended to a SGD update rule: w(\u0001+1) \u2190 w(\u0001) \u2212 \u0001 \u2219 g(w(\u0001); B(\u0001)), where the\nmini-batch B \u2286 D, \u2223B\u2223 \u226a \u2223D\u2223, is a randomly selected small subset of D. A genuine on-line learner\n+ and \u201c\u2219\u201d denotes\nusually has \u2223B\u2223 = 1. We consider a positive vector-valued step-size \u0001 \u2208 \u211d\u0001\ncomponent-wise (Hadamard) product of two vectors. Again, by exploiting (4), since\n\neig(I \u2212 diag(\u0001)H) = eig(\u2133\u2032) = eig(J) \u2248 \u0001,\n\nwhere \u0001 is an estimated eigenvalue of J as given in (2), when H is a symmetric matrix, its eigenvalue\nis given by\n\neig(J) = 1 \u2212 \u0001\u0001eig(H) \u21d2 eig(H) =\n\n2\n\n1 \u2212 eig(J)\n\n.\n\n\u0001\u0001\n\n\fTherefore, we can update the step size component-wise by\n\neig(H\u22121) =\n\n\u0001\u0001\n\n1 \u2212 eig(J) \u2248\n\n\u0001\u0001\n\n1 \u2212 \u0001\u0001 \u21d2 \u0001(\u0001+1)\n\n\u0001\n\n\u221d\n\n\u0001\n\n\u0001(\u0001)\n1 \u2212 \u0001(\u0001)\n\n\u0001\n\n.\n\n(5)\n\nSince the mapping \u2133 in SGD involves the gradient g(w(\u0001); B(\u0001)) of a randomly selected training\nexample B(\u0001), \u2133 is itself a random variable. It is unlikely that we can obtain a reliable eigenvalue\nestimation at each single iteration. To increase stationary of the mapping, we take advantage of the\nlaw of large numbers and aggregate consecutive SGD mappings into a new mapping\n\n\u2133\u0001 = \u2133(\u2133(. . . \u2133(w) . . .))\n}\n\n{z\n\n|\n\n\u0001\n\n,\n\nwhich reduces the variance of gradient estimation by 1\n\u0001 , compared to the plain SGD mapping \u2133.\nThe approximation is valid because w(\u0001+\u0001), \u0001 = 0, . . . , \u0001 \u2212 1 are approximately \ufb01xed when \u0001 is\nsuf\ufb01ciently small [7].\nWe can proceed to estimate the eigenvalues of \u2133\u0001 from w(\u0001), w(\u0001+\u0001) and w(\u0001+2\u0001) by applying (2)\nfor each component \u0001:\n\n\u00af\u0001\u0001\n\u0001 =\n\nw(\u0001+2\u0001)\n\u0001\nw(\u0001+\u0001)\n\n\u0001\n\n\u0001\n\n\u2212 w(\u0001+\u0001)\n\u2212 w(\u0001)\n\n\u0001\n\n.\n\n(6)\n\nWe note that our aggregate mapping \u2133\u0001 is different from a mapping that takes \u0001 mini-batches as the\ninput in a single iteration. Their difference is similar to that between batch and stochastic gradient\ndescent. Aggregate mappings have \u0001 chances to adjust its search direction, while mappings that use\n\u0001 mini-batches together only have one.\nWith the estimated eigenvalues, we can present the complete update rule to adjust the step size\nvector \u0001. To ensure that the estimated values of eig(J) \u2208 (\u22121, 1) and to ensure numerical stability,\n\u0001 \u2223. Let u denote the constrained \u00af\u0001\u0001.\nwe introduce a positive constant \u0001 < 1 as the upper bound of \u2223\u00af\u0001\u0001\nIts components are given by\n(7)\n\n\u0001\u0001 := sgn(\u00af\u0001\u0001\n\n\u0001 \u2223, \u0001), \u2200\u0001.\nThen we can update the step size every 2\u0001 iterations based on u by:\n\n\u0001 ) min(\u2223\u00af\u0001\u0001\n\nwhere v is a discount factor with components de\ufb01ned by\n\n\u0001(\u0001+2\u0001+1) = v \u2219 \u0001(\u0001+2\u0001),\n\n\u0001\u0001 :=\n\n\u0001 + \u0001\u0001\n\n\u0001 + \u0001 + \u0001\n\n, \u2200\u0001.\n\n(8)\n\n(9)\n\n1\u2212\u0001 > \u0001\u0001 \u2248 1 + \u0001 to ensure\nThe discount factor is derived from (5) and the fact that when \u0001 < 1,\nnumerical stability, with \u0001 and \u0001 controlling the range. Let \u0001 be the maximum value and \u0001 be the\nminimum value of \u0001\u0001. We can obtain \u0001 and \u0001 by solving \u0001 \u2264 \u0001\u0001 \u2264 \u0001 for all \u0001. Since \u2212\u0001 \u2264 \u0001\u0001 \u2264 \u0001,\nwe have \u0001\u0001 = \u0001 when \u0001\u0001 = \u0001 and \u0001\u0001 = \u0001 when \u0001\u0001 = \u2212\u0001. Solving these equations yields:\n\n1\n\n\u0001 =\n\n\u0001 + \u0001\n\u0001 \u2212 \u0001\n\n\u0001 and \u0001 =\n\n2(1 \u2212 \u0001)\n\u0001 \u2212 \u0001\n\n\u0001.\n\n(10)\n\nFor example, if we want to set \u0001 = 0.9999 and \u0001 = 0.99, then \u0001 and \u0001 will be 201\u0001 and 0.0202\u0001,\nrespectively. Setting 0 < \u0001 < \u0001 \u2264 1 ensures that the step size is decreasing and approaches zero so\nthat SGD can be guaranteed to converge [7].\n\nAlgorithm 1 shows the PSA algorithm.\nIn a nutshell, PSA applies SGD with a \ufb01xed step size\nand periodically updates the step size by approximating Jacobian of the aggregated mapping. The\n\u0001 ) because the cost of eigenvalue estimation given in (6) is 2\u0001 and it\ncomplexity per iteration is \u0001( \u0001\nis required for every 2\u0001 iterations. That is, PSA updates \u0001 after learning from 2\u0001 \u22c5 B examples.\n\n3\n\n\f\u0001\u2212\u0001 \u0001 and \u0001 \u2190 2(1\u2212\u0001)\n\u0001\u2212\u0001 \u0001\n\n\u22b3 Equation (10)\n\nChoose a small batch B(\u0001) uniformly at random from the set of training examples D\nupdate \u0001(\u0001+1) \u2190 \u0001(\u0001) \u2212 \u0001 \u2219 g(\u0001(\u0001); B(\u0001))\nif (\u0001 + 1) mod 2\u0001 = 0 then\n\nAlgorithm 1 The PSA Algorithm\n1: Given: \u0001, \u0001, \u0001 < 1 and \u0001\n2: Initialize \u0001(0) and \u0001(0); \u0001 \u2190 0; \u0001 \u2190 \u0001+\u0001\n3: repeat\n4:\n5:\n6:\n\n\u0001\n\n\u0001\n\n7:\n\n\u0001\n\u0001(\u0001+\u0001)\n\u0001\n\n\u2212\u0001(\u0001+\u0001)\n\u2212\u0001(\u0001)\n\n\u0001 \u2190 \u0001(\u0001+2\u0001)\n\nupdate \u00af\u0001\u0001\nFor all \u0001, update \u0001\u0001 \u2190 sgn(\u00af\u0001\u0001\nFor all \u0001, update \u0001\u0001 \u2190 \u0001+\u0001\u0001\nupdate \u0001(\u0001+1) \u2190 v \u2219 \u0001(\u0001)\n\u0001(\u0001+1) \u2190 \u0001(\u0001)\n\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15: until Convergence\n\nend if\n\u0001 \u2190 \u0001 + 1\n\nelse\n\n\u0001+\u0001+\u0001\n\n\u0001 ) min(\u2223\u00af\u0001\u0001\n\n\u0001 \u2223, \u0001)\n\n4 Analysis of PSA\n\n\u22b3 SGD update\n\u22b3 Update \u0001\n\u22b3 Equation (6)\n\n\u22b3 Equation (7)\n\u22b3 Equation (9)\n\u22b3 Equation (8)\n\nWe analyze the accuracy of \u0001(\u0001)\nJ = Q\u039bQ\u22121 and u\u0001 be column vectors of Q and v\u0001\n\n\u0001\n\nas an eigenvalue estimate as follows. Let eigen decomposition\n\n\u0001 be row vectors of Q\u22121. Then we have\n\nJ\u0001 =\n\n\u0001\n\n\u08a3\u0001=1\n\n\u0001 u\u0001 v\u0001\n\u0001\u0001\n\u0001 ,\n\nwhere \u0001\u0001 is the \u0001-th eigenvalue of J. By applying Taylor\u2019s expansion to \u2133, we have\n\nw(\u0001) \u2212 w\u2217 \u2248 J\u0001(w(0) \u2212 w\u2217)\n\nw(\u0001\u22121) \u2212 w\u2217 \u2248 J\u0001\u22121(w(0) \u2212 w\u2217)\n\n\u21d2 \u0394(\u0001) = w(\u0001) \u2212 w(\u0001\u22121) \u2248 J\u0001J\u22121(J \u2212 I)(w(0) \u2212 w\u2217)\n\n\u21d2 \u0394(\u0001+1) = w(\u0001+1) \u2212 w(\u0001) \u2248\n\nNow let\n\n\u0001\n\n\u08a3\u0001=1\n\n\u0001\u0001\u0001\u0001\n\n\u0001 u\u0001 v\u0001\n\n\u0001 J\u22121(J \u2212 I)(w(0) \u2212 w\u2217)\n\n\u0001\u0001\u0001 := e\u0001\n\n\u0001 u\u0001 v\u0001\n\n\u0001 J\u22121(J \u2212 I)(w(0) \u2212 w\u2217),\n\nwhere e\u0001 is the \u0001-th column of I. Let \u0394\u0001 be the \u0001-th element of \u0394 and \u0001\u0001\u0001 be the largest eigenvalue\nof J such that \u0001\u0001\u0001 \u2215= 0. Then\n\u0394(\u0001+1)\n\u0394(\u0001)\n\n(\u0001\u0001/\u0001\u0001\u0001 )\u0001\u0001\u0001\u0001\u0001\u0001/\u0001\u0001\u0001\u0001\n(\u0001\u0001/\u0001\u0001\u0001 )\u0001\u0001\u0001\u0001/\u0001\u0001\u0001\u0001\n\n\u0001=1 \u0001\u0001+1\n\u0001=1 \u0001\u0001\n\n\u0001 \u0001\u0001\u0001\n\u0001\u0001\u0001\u0001\n\n\u0001\u0001 \u2261\n\n=\n\n.\n\n\u0001\n\n\u0001\n\n= \u08a3\u0001\n\u08a3\u0001\n\n\u0001\u0001\u0001 +\u08a3\u0001\u2215=\u0001\u0001\n1 +\u08a3\u0001\u2215=\u0001\u0001\n\nTherefore, we can conclude that\n\ncomponentwise rate of convergence.\n\n\u2219 \u0001\u0001 \u2192 \u0001\u0001\u0001 as \u0001 \u2192 \u221e because \u2200\u0001, if \u0001\u0001\u0001 \u2215= 0 then \u0001\u0001/\u0001\u0001\u0001 \u2264 1. \u0001\u0001\u0001 \u2261 \u0001\u0001 is the \u0001-th\n\u2219 \u0001\u0001 = \u0001\u0001 if J is a diagonal matrix. In this case, our approximation is exact. This happens\nwhen there are high percentages of missing data for a Bayesian network model trained\nby EM [8] and when features are uncorrelated for training a conditional random \ufb01eld\nmodel [9].\n\n\u2219 \u0001\u0001 is the average of eigenvalues weighted by \u0001\u0001\n\n\u0001 = \u0001, we have \u0001\u0001 \u2248 \u0001\u0001.\n\n\u0001\u0001\u0001\u0001. Since \u0001\u0001\u0001 is usually the largest when\n\n4\n\n\fWhen we have the least possible step size \u0001(\u0001+1) = \u0001\u0001(\u0001) for all \u0001 mod 2\u0001 = 0 in PSA, the\nexpectation of w(\u0001) obtained by PSA can be shown to be:\n\n\u0001(w(\u0001)) = w\u2217 +\n\n\u0001\n\n\u00dc\u0001=1\u001c\u0001 \u2212 \u0001(0)\u0001\u230a \u0001\n= w\u2217 + S(\u0001)(w(0) \u2212 w\u2217).\n\n\u0001 \u230bH(w\u2217; D)\u001d (w(0) \u2212 w\u2217)\n\nThe rate of convergence is governed by the largest eigenvalue of S(\u0001). We now derive a bound of\nthis eigenvalue.\n\nTheorem 1 Let \u0001\u210e be the least eigenvalue of H(w\u2217; D). The asymptotic rate of convergence of PSA\nis bounded by\n\n1 \u2212 \u0001 \u0007 .\neig(S(\u0001)) \u2264 exp\u0007\u2212\u0001(0)\u0001\u210e\u0001\n\nProof We can show that\n\neig(S(\u0001)) =\n\n\u0001\n\n\u00dc\u0001=1\u001c1 \u2212 \u0001(0)\u0001\u230a \u0001\n\n\u0001 \u230b\u0001\u210e\u001d\n\n\u2264 exp\u0007\u2212\n\n\u08a3\u0001=1\nbecause for any 0 \u2264 \u0001\u0001 < 1, 1 \u2212 \u0001\u0001 \u2264 \u0001\u2212\u0001\u0001 ,\n\n\u0001\n\n\u0001(0)\u0001\u210e\u0001\u230a \u0001\n\n\u0001 \u230b\u0007 = exp\u0007\u2212\u0001(0)\u0001\u210e\n\n\u0001\u230a \u0001\n\n\u0001 \u230b\u0007\n\n\u0001\n\n\u08a3\u0001=1\n\n\u0001\n\n0 \u2264\n\n\u00dc\u0001=1\n\u0001 \u230b \u2248\u239b\n\u08a3\u0001=0\n\u239d\n\n\u0001 \u230b\n\n\u230a \u0001\n\nNow, since\n\nwe have\n\n\u0001\n\n\u08a3\u0001=1\n\n\u0001\u230a \u0001\n\n(1 \u2212 \u0001\u0001) \u2264\n\n\u0001\u2212\u0001\u0001 = \u0001\u2212 \u08a3\u0001\n\n\u0001=1 \u0001\u0001 .\n\n\u0001\n\n\u00dc\u0001=1\n\n\u0001\u0001\u0001\u239e\n\u23a0 = \u0001\n\n\u230a \u0001\n\n\u0001 \u230b\n\n\u08a3\u0001=0\n\n\u0001\u0001 \u2212\u2192\n\n\u0001\n\n1 \u2212 \u0001\n\nwhen \u0001 \u2192 \u221e,\n\neig(S(\u0001)) \u2264 exp\u0007\u2212\u0001(0)\u0001\u210e\n\n\u0001\n\n\u08a3\u0001=1\n\n\u0001\u230a \u0001\n\n\u0001 \u230b\u0007 \u2192 exp\u0007\u2212\u0001(0)\u0001\u210e\u0001\n\n1 \u2212 \u0001 \u0007 when \u0001 \u2192 \u221e.\n\n\u25a1\n\nThough this analysis suggests that for rapid convergence to \u0001\u2217, we should assign \u0001 \u2248 1 with a\nlarge \u0001 and \u0001(0), it is based on a worst-case scenario and thus insuf\ufb01cient as a practical guideline\nfor parameter assignment. In practice, we \ufb01x (\u0001, \u0001, \u0001) = (0.9999, 0.99, 0.9) and tune \u0001 as follows.\nWhen the training set size \u2223D\u2223 \u226b 2000, set \u0001 in the order of 0.5\u2223D\u2223/1000 is usually suf\ufb01cient.\nIn fact, when \u0001\nThis setting implies that the step size will be adjusted per \u2223D\u2223/1000 examples.\nis in the same order, PSA performs similarly. Consider the following three settings: (\u0001, \u0001, \u0001) =\n(10, 0.9999, 0.99), (100, 0.999, 0.9) or (1, 0.99999, 0.999). They all yield nearly identical single-\npass F-scores for the BaseNP task (see Section 5). The \ufb01rst setting was used in this paper. To see\nwhy this is the case, consider the decreasing factor \u0001\u0001 (see (8) and (9)), which will be con\ufb01ned within\nthe interval (\u0001, \u0001). Assume that \u0001\u0001 is selected at ransom uniformly, then the mean of \u0001\u0001 = 0.995\nwhen (\u0001, \u0001) = (0.9999, 0.99) and \u0001\u0001 will be decreased by a factor of 0.995 on average in each PSA\nupdate. When \u0001 = 10, PSA will update \u0001\u0001 per 20 examples. After learning from 200 examples, PSA\nwill decrease \u0001\u0001 10 times by a combined factor of 0.9511. Similarly, we can obtain that the factors\nfor the other two settings are 0.95 and 0.9512, respectively, nearly identical.\n\n5\n\n\f5 Experimental Results\n\nTable 1 shows the tasks chosen for our comparison. The tasks for CRF have been used in compe-\ntitions and the performance was measured by F-score. Weight for CRF reported here is Number\nof features provided by CRF++. Target provides the empirical optimal performance achieved\nby batch learners. If PSA accurately approximates 2SGD, then its single-pass performance should\nbe very close to Target. The target F-score for BioNLP/NLPBA is not >85% as reported in [1]\nbecause it was due to a bug that included true labels as a feature 1.\n\nTable 1: Tasks for the experiments.\n\nTask\n\nModel Training\n\nTest\n\nTag/Class Weight\n\nTarget\n\nBase NP\nChunking\nBioNLP/NLPBA\nBioCreative 2\nLS FD\nLS OCR\nMNIST [14]\n\n8936\nCRF\n8936\nCRF\n18546\nCRF\nCRF\n15000\nLSVM 2734900\nLSVM 1750000\nCNN\n60000\n\n2012\n2012\n3856\n5000\n2734900\n1750000\n10000\n\n3\n23\n11\n3\n2\n2\n10\n\n1015662\n7448606\n5977675\n10242972\n900\n1156\n134066\n\n94.0% [10]\n93.6% [11]\n70.0% [12]\n86.5% [13]\n3.26%\n23.94%\n0.99%\n\n5.1 Conditional Random Field\n\nWe compared PSA with plain SGD and SMD [1] to evaluate PSA\u2019s performance for training\nconditional random \ufb01elds (CRF). We implemented PSA by replacing the L-BFGS optimizer in\nCRF++ [11]. For SMD, we used the implementation available in the public domain 2. Our SGD\nimplementation for CRF is from Bottou 3. All the above implementations are revisions of CRF++.\nFinally, we ran the original CRF++ with default settings to obtain the performance results of L-\nBFGS. We simply used the original parameter settings for SGD and SMD as given in the literature.\nFor PSA, we used \u0001 = 0.9, (\u0001, \u0001) = (0.9999, 0.99), \u0001 = 10, and \u0001(0)\n\u0001 = 0.1, \u2200\u0001. The batch size\nis one for all tasks. These parameters were determined by using a small subset from CoNLL 2000\nbaseNP and we simply used them for all tasks. All of the experiments reported here for CRF were\nran on an Intel Q6600 Fedora 8 i686 PC with 4G RAM.\n\nTable 2 compares SGD variants in terms of the execution time and F-scores achieved after processing\nthe training examples for a single pass. Since the loss function in CRF training is convex, the\nconvergence results of L-BFGS can be considered as the empirical minimum. The results show that\nsingle-pass F-scores achieved by PSA are about as good as the empirical minima, suggesting that\nPSA has effectively approximated Hessian in CRF training.\n\nFig. 1 shows the learning curves in terms of the CPU time. Though as expected, plain SGD is the\nfastest, it is remarkable that PSA is faster than SMD for all tasks. SMD is supposed to have an edge\nhere because the mini-batch size for SMD was set to 6 or 8, as speci\ufb01ed in [1], while PSA used one\nfor all tasks. But PSA is still faster than SMD partly because PSA can take advantage of the sparsity\ntrick as plain SGD [15].\n\n5.2 Linear SVM\n\nWe also evaluated PSA\u2019s single-pass performance for training linear SVM. It is straightforward to\napply PSA as a primal optimizer for linear SVM. We used two very large data sets: FD (face detec-\ntion) and OCR (see Table 1), from the Pascal large-scale learning challenge in 2008 and compared\nthe performance of PSA with the state-of-the-art linear SVM solvers: Liblinear 1.33 [16], the winner\nof the challenge, and SvmSgd, from Bottou\u2019s SGD web site. They have been shown to outperform\nmany well-known linear SVM solvers, such as SVM-perf [17] and Pegasos [15].\n\n1Thanks to Shing-Kit Chan of the Chinese University of Hong Kong for pointing that out.\n2Available under LGPL from the following URL: http://sml.nicta.com.au/code/crfsmd/.\n3http://leon.bottou.org/projects/sgd.\n\n6\n\n\fMethod (pass)\n\nBase NP\n\ntime\n\nF-score\n\nChunking\ntime\n\nF-score\n\nBioNLP/NLPBA\nF-score\n\ntime\n\nBioCreative 2\n\ntime\n\nF-score\n\nSGD (1)\nSMD (1)\nPSA (1)\nL-BFGS (batch)\n\n1.15\n41.50\n16.30\n221.17\n\n92.42\n91.81\n93.31\n93.91\n\n13.04\n350.00\n160.00\n8694.40\n\n92.26\n91.89\n93.16\n93.78\n\n12.23\n522.00\n206.00\n20130.00\n\n66.37\n66.53\n69.41\n70.30\n\n3.18\n497.71\n191.61\n1601.50\n\n34.33\n69.04\n80.79\n86.82\n\nTable 2: CPU time in seconds and F-scores achieved after a single pass of CRF training.\n\n94.5\n\n94\n\n93.5\n\n93\n\n92.5\n\n92\n\n91.5\n\n91\n\n90.5\n\ne\nr\no\nc\ns\n\u2212\nF\n\n90\n \n0\n\n50\n\n80\n\n70\n\n60\n\n50\n\n40\n\n30\n\ne\nr\no\nc\ns\n\u2212\nF\n\nBaseNP\n\n \n\nPSA\n\nSMD\n\nSGD\n\nL\u2212BFGS\n\n100\n\nTime(sec)\nNLPBA04\n\n150\n\n200\n\n \n\nPSA\n\nSMD\n\nSGD\n\nL\u2212BFGS\n\ne\nr\no\nc\ns\n\u2212\nF\n\n94.5\n\n94\n\n93.5\n\n93\n\n92.5\n\n92\n\n91.5\n\n91\n\n90.5\n\n90\n \n0\n\n90\n\n80\n\n70\n\n60\n\n50\n\n40\n\ne\nr\no\nc\ns\n\u2212\nF\n\nChunking\n\n \n\nPSA\n\nSMD\n\nSGD\n\nL\u2212BFGS\n\n200\n\n400\n\n600\n\nTime(sec)\n\n800\n\n1000\n\n1200\n\nBioCreative 2 GM Task\n\n \n\nPSA\n\nSMD\n\nSGD\n\nL\u2212BFGS\n\n20\n \n0\n\n100\n\n200\n\n400\n\n300\n500\nTime(sec)\n\n600\n\n700\n\n800\n\n30\n \n0\n\n100\n\n200\n300\nTime(sec)\n\n400\n\n500\n\nFigure 1: Comparison of CPU time; Horizontal lines indicate target F-scores.\n\nWe selected L2-regularized logistic regression as the loss function for PSA and Liblinear because\nit is twice differentiable. The weight \u0001 of the margin error term was set to one. We kept SvmSgd\nintact. The experiment was run on an Open-SUSE Linux machine with Intel Xeon E7320 CPU\n(2.13GHz) and 64GB RAM. Table 3 shows the results. Again, PSA achieves the best single-pass\naccuracy for both tasks. Its test accuracies are very close to that of converged Liblinear. PSA takes\nmuch less time than the other two solvers. PSA (1) is faster than SvmSgd (1) for SVM because\nSvmSgd uses the sparsity trick [15], which speeds up training for sparse data, but otherwise may\nslow down. Both data sets we used turn out to be dense, i.e., with no zero features. We implemented\nPSA with the sparsity trick for CRF only but not for SVM and CNN.\n\nMethod (pass)\n\nLiblinear converge\nLiblinear (1)\nSvmSgd (20)\nSvmSgd (10)\nSvmSgd (1)\nPSA (1)\n\nLS FD\n\naccuracy\n\nLS OCR\n\naccuracy\n\ntime\n\ntime\n\n96.74\n91.43\n93.78\n93.77\n93.60\n95.10\n\n4648.49\n290.58\n1135.67\n567.68\n56.78\n30.65\n\n76.06\n74.33\n-\n73.71\n73.76\n75.68\n\n4454.42\n398.00\n-\n473.35\n46.96\n25.33\n\nTable 3: Test accuracy rates and elapsed CPU time in seconds by various linear SVM solvers.\n\n7\n\n\fThe parameter settings for PSA are basically the same as those for CRF but with a large period\n\u0001 = 1250 for FD and 500 for OCR. For FD, the worst accuracy by PSA is 94.66% with \u0001 between\n250 to 2000. For OCR, the worst is 75.20% with \u0001 between 100 to 1000, suggesting that PSA is not\nvery sensitive to parameter settings.\n\n5.3 Convolutional Neural Network\n\nApproximating Hessian is particularly challenging when the loss function is non-convex. We tested\nPSA in such a setting by applying PSA to train a large convolutional neural network for the original\n10-class MNIST task (see Table 1). We tried to duplicate the implementation of LeNet described in\n[18] in C++. Our implementation, referred to as \u201cLeNet-S\u201d, is a simpli\ufb01ed variant of LeNet-5. The\ndifferences include that the sub-sampling layers in LeNet-S picks only the upper-left value from a\n2 \u00d7 2 area and abandons the other three. LeNet-S used more maps (50 vs. 16) in the third layer and\nless nodes (120 vs. 100) in the \ufb01fth layer, due to the difference in the previous sub-sampling layer.\nFinally, we did not implement the Gaussian connections in the last layer. We trained LeNet-S by\nplain SGD and PSA. The initial \u0001 for SGD was 0.7 and decreased by 3 percent per pass. For PSA,\nwe used \u0001 = 0.9, (\u0001, \u0001) = (0.99999, 0.999), \u0001 = 10, \u0001(0)\n\u0001 = 0.5, \u2200\u0001, and the mini-batch size is\none for all tasks. We also adapted a trick given in [19] which advises that step sizes in the lower\nlayers should be larger than in the higher layer. Following their trick, the initial step sizes for the\n\ufb01rst and the third layers were 5 and \u221a2.5 times as large as those for the other layers, respectively.\nThe experiments were ran on an Intel Q6600 Fedora 8 i686 PC with 4G RAM.\n\nTable 4 shows the results. To obtain the empirical optimal error rate of our LeNet-S model, we ran\nplain SGD with suf\ufb01cient passes and obtained 0.99% error rate at convergence, slightly higher than\nLeNet-5\u2019s 0.95% [18]. Single-pass performance of PSA with the layer trick is within one percentage\npoint to the target. Starting from an initial weight closer to the optimum helped improving PSA\u2019s\nperformance further. We ran SGD 100 passes with randomly selected 10K training examples then\nre-started training with PSA using the rest 50K training examples for a single pass. Though PSA did\nachieve a better error rate, this is infeasible because it took 4492 seconds to run SGD 100 passes.\nFinally, though not directly comparable, we also report the performance of TONGA given in [20] as\na reference. TONGA is a 2SGD method based on natural gradient.\n\nMethod (pass)\n\ntime\n\nerror Method (pass)\n\ntime\n\nerror\n\nSGD (1)\nSGD (140)\nTONGA (n/a)\n\n266.77\n37336.20\n500.00\n\n2.36\n0.99\n2.00\n\nPSA w/o layer trick (1)\nPSA w/ layer trick (1)\nPSA re-start (1)\n\n311.95\n311.00\n253.72\n\n2.31\n1.97\n1.90\n\nTable 4: CPU time in seconds and percentage test error rates for various neural network trainers.\n\n6 Conclusions\n\nIt has been shown that given a suf\ufb01ciently large training set, a single pass of 2SGD generalizes as\nwell as the empirical optimum. Our results show that PSA provides a practical solution to accom-\nplish near optimal performance of 2SGD as predicted theoretically for a variety of large scale models\nand tasks with a reasonably low cost per iteration compared to competing 2SGD methods. The ben-\ne\ufb01t of 2SGD with PSA over plain SGD becomes clearer when the scale of the tasks are increasingly\nlarge. For non-convex neural network tasks, since the curvature of the error surface is so complex,\nit is still very challenging for an eigenvalue approximation method like PSA. A complete version of\nthis paper will appear as [21]. Source codes of PSA are available at http://aiia.iis.sinica.edu.tw.\n\nReferences\n\n[1] S.V.N. Vishwanathan, Nicol N. Schraudolph, Mark W. Schmidt, and Kevin P. Murphy. Accel-\nerated training of conditional random \ufb01elds with stochastic gradient methods. In Proceedings\nof the 23rd International Conference on Machine Learning (ICML\u201906), Pittsburgh, PA, USA,\nJune 2006.\n\n8\n\n\f[2] Michael Collins, Amir Globerson, Terry Koo, Xavier Carreras, and Peter L. Bartlett. Expo-\nnentiated gradient algorithms for conditional random \ufb01elds and max-margin markov networks.\nJournal of Machine Learning Research, 9:1775\u20131822, August 2008.\n\n[3] Noboru Murata and Shun-Ichi Amari. Statistical analysis of learning dynamics. Signal Pro-\n\ncessing, 74(1):3\u201328, April 1999.\n\n[4] L\u00b4eon Bottou and Yann LeCun. On-line learning for very large data sets. Applied Stochastic\n\nModels in Business and Industry, 21(2):137\u2013151, 2005.\n\n[5] Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer, 1999.\n[6] L\u00b4eon Bottou. The tradeoffs of large-scale learning. Tutorial, the 21st Annual Conference\non Neural Information Processing Systems (NIPS 2007), Vancouver, BC, Canada, December\n2007. http://leon.bottou.org/talks/largescale.\n\n[7] Albert Benveniste, Michel Metivier, and Pierre Priouret. Adaptive Algorithms and Stochastic\n\nApproximations. Springer-Verlag, 1990.\n\n[8] Chun-Nan Hsu, Han-Shen Huang, and Bo-Hou Yang. Global and componentwise extrapola-\ntion for accelerating data mining from large incomplete data sets with the EM algorithm. In\nProceedings of the Sixth IEEE International Conference on Data Mining (ICDM\u201906), pages\n265\u2013274, Hong Kong, China, December 2006.\n\n[9] Han-Shen Huang, Bo-Hou Yang, Yu-Ming Chang, and Chun-Nan Hsu. Global and componen-\ntwise extrapolations for accelerating training of Bayesian networks and conditional random\n\ufb01elds. Data Mining and Knowledge Discovery, 19(1):58\u201391, 2009.\n\n[10] Fei Sha and Fernando Pereira. Shallow parsing with conditional random \ufb01elds. In Proceedings\nof Human Language Technology, the North American Chapter of the Association for Compu-\ntational Linguistics (NAACL\u201903), pages 213\u2013220, 2003.\n\n[11] Taku Kudo. CRF++: Yet another CRF toolkit, 2006. Available under LGPL from the following\n\nURL: http://crfpp.sourceforge.net/.\n\n[12] Burr Settles. Biomedical named entity recognition using conditional random \ufb01elds and novel\nIn Proceedings of the Joint Workshop on Natural Language Processing in\n\nfeature sets.\nBiomedicine and its Applications (JNLPBA-2004), pages 104\u2013107, 2004.\n\n[13] Cheng-Ju Kuo, Yu-Ming Chang, Han-Shen Huang, Kuan-Ting Lin, Bo-Hou Yang, Yu-Shi\nLin, Chun-Nan Hsu, and I-Fang Chung. Rich feature set, uni\ufb01cation of bidirectional parsing\nand dictionary \ufb01ltering for high f-score gene mention tagging. In Proceedings of the Second\nBioCreative Challenge Evaluation Workshop, pages 105\u2013107, 2007.\n\n[14] Yann LeCun and Corinna Cortes.\n\nThe MNIST database of handwritten digits, 1998.\n\nhttp://yann.lecun.com/exdb/mnist/.\n\n[15] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal Estimated sub-\nGrAdient SOlver for SVM. In ICML\u201907: Proceedings of the 24th international conference on\nMachine learning, pages 807\u2013814, New York, NY, USA, 2007. ACM Press.\n\n[16] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector machines, 2001.\n\nSoftware available at http://www.csie.ntu.edu.tw/\u223ccjlin/libsvm.\n[17] Thorsten Joachims. Training linear SVMs in linear time. In Proceedings of the 12th ACM\nSIGKDD International Conference on Knowledge Discovery and Data Mining (KDD\u201906),\npages 217\u2013226, New York, NY, USA, 2006. ACM.\n\n[18] Yann LeCun, L\u00b4eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning\n\napplied to document recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[19] Yann LeCun, Leon Bottou, Genevieve B. Orr, and Klaus-Robert Muller. Ef\ufb01cient backprop.\n\nIn G. Orr and Muller K., editors, Neural Networks: Tricks of the trade. Springer, 1998.\n\n[20] Nicolas LeRoux, Pierre-Antoine Manzagol, and Yoshua Bengio. Topmoumoute online natural\ngradient algorithm. In Advances in Neural Information Processing Systems, 20 (NIPS 2007),\nCambridge, MA, USA, 2008. MIT Press.\n\n[21] Chun-Nan Hsu, Yu-Ming Chang, Han-Shen Huang, and Yuh-Jye Lee. Periodic step-size adap-\ntation in second-order gradient descent for single-pass on-line structured learning. To appear in\nMchine Learning, Special Issue on Structured Prediction. DOI: 10.1007/s10994-009-5142-6,\n2009.\n\n9\n\n\f", "award": [], "sourceid": 1160, "authors": [{"given_name": "Chun-nan", "family_name": "Hsu", "institution": null}, {"given_name": "Yu-ming", "family_name": "Chang", "institution": null}, {"given_name": "Hanshen", "family_name": "Huang", "institution": null}, {"given_name": "Yuh-jye", "family_name": "Lee", "institution": null}]}