{"title": "Adaptive Newton Method for Empirical Risk Minimization to Statistical Accuracy", "book": "Advances in Neural Information Processing Systems", "page_first": 4062, "page_last": 4070, "abstract": "We consider empirical risk minimization for large-scale datasets. We introduce Ada Newton as an adaptive algorithm that uses Newton's method with adaptive sample sizes. The main idea of Ada Newton is to increase the size of the training set by a factor larger than one in a way that the minimization variable for the current training set is in the local neighborhood of the optimal argument of the next training set. This allows to exploit the quadratic convergence property of Newton's method and reach the statistical accuracy of each training set with only one iteration of Newton's method. We show theoretically that we can iteratively increase the sample size while applying single Newton iterations without line search and staying within the statistical accuracy of the regularized empirical risk. In particular, we can double the size of the training set in each iteration when the number of samples is sufficiently large. Numerical experiments on various datasets confirm the possibility of increasing the sample size by factor 2 at each iteration which implies that Ada Newton achieves the statistical accuracy of the full training set with about two passes over the dataset.", "full_text": "Adaptive Newton Method for Empirical Risk\n\nMinimization to Statistical Accuracy\n\nAryan Mokhtari?\n\nUniversity of Pennsylvania\naryanm@seas.upenn.edu\n\nHadi Daneshmand?\n\nETH Zurich, Switzerland\n\nhadi.daneshmand@inf.ethz.ch\n\nAurelien Lucchi\n\nETH Zurich, Switzerland\naurelien.lucchi@inf.ethz.ch\n\nThomas Hofmann\n\nETH Zurich, Switzerland\n\nthomas.hofmann@inf.ethz.ch\n\nAlejandro Ribeiro\n\nUniversity of Pennsylvania\naribeiro@seas.upenn.edu\n\nAbstract\n\nWe consider empirical risk minimization for large-scale datasets. We introduce\nAda Newton as an adaptive algorithm that uses Newton\u2019s method with adaptive\nsample sizes. The main idea of Ada Newton is to increase the size of the train-\ning set by a factor larger than one in a way that the minimization variable for\nthe current training set is in the local neighborhood of the optimal argument of\nthe next training set. This allows to exploit the quadratic convergence property\nof Newton\u2019s method and reach the statistical accuracy of each training set with\nonly one iteration of Newton\u2019s method. We show theoretically that we can iter-\natively increase the sample size while applying single Newton iterations without\nline search and staying within the statistical accuracy of the regularized empirical\nrisk. In particular, we can double the size of the training set in each iteration when\nthe number of samples is suf\ufb01ciently large. Numerical experiments on various\ndatasets con\ufb01rm the possibility of increasing the sample size by factor 2 at each\niteration which implies that Ada Newton achieves the statistical accuracy of the\nfull training set with about two passes over the dataset.1\n\n1\n\nIntroduction\n\nA hallmark of empirical risk minimization (ERM) on large datasets is that evaluating descent direc-\ntions requires a complete pass over the dataset. Since this is undesirable due to the large number of\ntraining samples, stochastic optimization algorithms with descent directions estimated from a subset\nof samples are the method of choice. First order stochastic optimization has a long history [19, 17]\nbut the last decade has seen fundamental progress in developing alternatives with faster convergence.\nA partial list of this consequential literature includes Nesterov acceleration [16, 2], stochastic aver-\naging gradient [20, 6], variance reduction [10, 26], and dual coordinate methods [23, 24].\nWhen it comes to stochastic second order methods the \ufb01rst challenge is that while evaluation of\nHessians is as costly as evaluation of gradients, the stochastic estimation of Hessians has proven\nmore challenging. This dif\ufb01culty is addressed by incremental computations in [9] and subsampling\nin [7] or circumvented altogether in stochastic quasi-Newton methods [21, 12, 13, 11, 14]. Despite\nthis incipient progress it is nonetheless fair to say that the striking success in developing stochastic\n\ufb01rst order methods is not matched by equal success in the development of stochastic second order\nmethods. This is because even if the problem of estimating a Hessian is solved there are still four\nchallenges left in the implementation of Newton-like methods in ERM:\n\n1?The \ufb01rst two authors have contributed equally in this work.\n\n\f(i) Global convergence of Newton\u2019s method requires implementation of a line search subrou-\n\ntine and line searches in ERM require a complete pass over the dataset.\n\n(ii) The quadratic convergence advantage of Newton\u2019s method manifests close to the optimal\n\nsolution but there is no point in solving ERM problems beyond their statistical accuracy.\n\n(iii) Newton\u2019s method works for strongly convex functions but loss functions are not strongly\n\nconvex for many ERM problems of practical importance.\n\n(iv) Newton\u2019s method requires inversion of Hessians which is costly in large dimensional ERM.\n\nBecause stochastic Newton-like methods can\u2019t use line searches [cf. (i)], must work on problems\nthat may be not strongly convex [cf. (iii)], and never operate very close to the optimal solution [cf\n(ii)], they never experience quadratic convergence. They do improve convergence constants and, if\nefforts are taken to mitigate the cost of inverting Hessians [cf. (iv)] as in [21, 12, 7, 18] they result\nin faster convergence. But since they still converge at linear rates they do not enjoy the foremost\nbene\ufb01ts of Newton\u2019s method.\nIn this paper we attempt to circumvent (i)-(iv) with the Ada Newton algorithm that combines the use\nof Newton iterations with adaptive sample sizes [5]. Say the total number of available samples is\nN, consider subsets of n \uf8ff N samples, and suppose the statistical accuracy of the ERM associated\nwith n samples is Vn (Section 2). In Ada Newton we add a quadratic regularization term of order Vn\nto the empirical risk \u2013 so that the regularized risk also has statistical accuracy Vn \u2013 and assume that\nfor a certain initial sample size m0, the problem has been solved to its statistical accuracy Vm0. The\nsample size is then increased by a factor \u21b5> 1 to n = \u21b5m0. We proceed to perform a single Newton\niteration with unit stepsize and prove that the result of this update solves this extended ERM problem\nto its statistical accuracy (Section 3). This permits a second increase of the sample size by a factor\n\u21b5 and a second Newton iteration that is likewise guaranteed to solve the problem to its statistical\naccuracy. Overall, this permits minimizing the empirical risk in \u21b5/(\u21b5 1) passes over the dataset\nand inverting log\u21b5 N Hessians. Our theoretical results provide a characterization of the values of\n\u21b5 that are admissible with respect to different problem parameters (Theorem 1). In particular, we\nshow that asymptotically on the number of samples n and with proper parameter selection we can set\n\u21b5 = 2 (Proposition 2). In such case we can optimize to within statistical accuracy in about 2 passes\nover the dataset and after inversion of about 3.32 log10 N Hessians. Our numerical experiments\nverify that \u21b5 = 2 is a valid factor for increasing the size of the training set at each iteration while\nperforming a single Newton iteration for each value of the sample size.\n\n2 Empirical risk minimization\n\nWe aim to solve ERM problems to their statistical accuracy. To state this problem formally consider\nan argument w 2 Rp, a random variable Z with realizations z and a convex loss function f (w; z).\nWe want to \ufb01nd an argument w\u21e4 that minimizes the statistical average loss L(w) := EZ[f (w, Z)],\n(1)\n\nL(w) = argmin\n\nw\u21e4 := argmin\n\nEZ[f (w, Z)].\n\nw\n\nw\n\nThe loss in (1) can\u2019t be evaluated because the distribution of Z is unknown. We have, however,\naccess to a training set T = {z1, . . . , zN} containing N independent samples z1, . . . , zN that we\ncan use to estimate L(w). We therefore consider a subset Sn \u2713T and settle for minimization of the\nempirical risk Ln(w) := (1/n)Pn\n\nk=1 f (w, zk),\n\nw\u2020n := argmin\n\nLn(w) = argmin\n\nf (w, zk),\n\nw\n\nw\n\nwhere, without loss of generality, we have assumed Sn = {z1, . . . , zn} contains the \ufb01rst n elements\nof T . The difference between the empirical risk in (2) and the statistical loss in (1) is a fundamental\nquantities in statistical learning. We assume here that there exists a constant Vn, which depends on\nthe number of samples n, that upper bounds their difference for all w with high probability (w.h.p),\n(3)\n\nw.h.p.\n\nw |L(w) Ln(w)|\uf8ff Vn,\nsup\n\nThat the statement in (3) holds with w.h.p means that there exists a constant such that the inequality\nholds with probability at least 1 . The constant Vn depends on but we keep that dependency\n\n2\n\n1\nn\n\nnXk=1\n\n(2)\n\n\fimplicit to simplify notation. For subsequent discussions, observe that bounds Vn of order Vn =\nO(1/pn) date back to the seminal work of Vapnik \u2013 see e.g., [25, Section 3.4]. Bounds of order\nVn = O(1/n) have been derived more recently under stronger regularity conditions that are not\nuncommon in practice, [1, 8, 3].\nAn important consequence of (1) is that there is no point in solving (2) to an accuracy higher than Vn.\nIndeed, if we \ufb01nd a variable w for which Ln(wn)Ln(w\u2020) \uf8ff Vn \ufb01nding a better approximation of\nw\u2020 is moot because (3) implies that this is not necessarily a better approximation of the minimizer\nw\u21e4 of the statistical loss. We say the variable wn solves the ERM problem in (2) to within its\nstatistical accuracy. In particular, this implies that adding a regularization of order Vn to (2) yields\na problem that is essentially equivalent. We can then consider a quadratic regularizer of the form\ncVn/2kwk2 to de\ufb01ne the regularized empirical risk Rn(w) := Ln(w) + (cVn/2)kwk2 and the\ncorresponding optimal argument\n\nw\u21e4n := argmin\n\nw\n\nRn(w) = argmin\n\nw\n\nLn(w) +\n\ncVn\n2 kwk2.\n\n(4)\n\nSince the regularization in (4) is of order Vn and (3) holds, the difference between Rn(w\u21e4n) and\nL(w\u21e4) is also of order Vn \u2013 this may be not as immediate as it seems; see [22]. Thus, we can\nsay that a variable wn satisfying Rn(wn) Rn(w\u21e4n) \uf8ff Vn solves the ERM problem to within its\nstatistical accuracy. We accomplish this goal in this paper with the Ada Newton algorithm which we\nintroduce in the following section.\n\n3 Ada Newton\nTo solve (4) suppose the problem has been solved to within its statistical accuracy for a set Sm \u21e2\nSn with m = n/\u21b5 samples where \u21b5> 1. Therefore, we have found a variable wm for which\nRm(wm) Rm(w\u21e4m) \uf8ff Vm. Our goal is to update wm using the Newton step in a way that the\nupdated variable wn estimates w\u21e4n with accuracy Vn. To do so compute the gradient of the risk Rn\nevaluated at wm\n\nrRn(wm) =\n\n1\nn\n\nnXk=1\n\nrf (wm, zk) + cVnwm,\n\nas well as the Hessian Hn of Rn evaluated at wm\n\nr2f (wm, zk) + cVnI,\nand update wm with the Newton step of the regularized risk Rn to compute\n\nHn := r2Rn(wm) =\n\n1\nn\n\nnXk=1\nwn = wm H1\n\n(5)\n\n(6)\n\n(8)\n\nn rRn(wm).\n\n(7)\nNote that the stepsize of the Newton update in (7) is 1, which avoids line search algorithms requiring\nextra computation. The main contribution of this paper is to derive a condition that guarantees that\nwn solves Rn to within its statistical accuracy Vn. To do so, we \ufb01rst assume the following conditions\nare satis\ufb01ed.\nAssumption 1. The loss functions f (w, z) are convex with respect to w for all values of z. More-\nover, their gradients rf (w, z) are Lipschitz continuous with constant M\n\nkrf (w, z) rf (w0, z)k \uf8ff Mkw w0k,\n\nfor all z.\n\nAssumption 2. The loss functions f (w, z) are self-concordant with respect to w for all z.\nAssumption 3. The difference between the gradients of the empirical loss Ln and the statistical\naverage loss L is bounded by V 1/2\n\nfor all w with high probability,\n\nn\n\nw krL(w) rLn(w)k \uf8ff V 1/2\nn ,\nsup\n\nw.h.p.\n\n(9)\n\nThe conditions in Assumption 1 imply that the average loss L(w) and the empirical loss Ln(w)\nare convex and their gradients are Lipschitz continuous with constant M. Thus, the empirical risk\n\n3\n\n\fAlgorithm 1 Ada Newton\n1: Parameters: Sample size increase constants \u21b50 > 1 and 0 << 1.\n2: Input: Initial sample size n = m0 and argument wn = wm0 with krRn(wn)k < (p2c)Vn\n3: while n \uf8ff N do {main loop}\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13: end while\n\nUpdate argument and index: wm = wn and m = n. Reset factor \u21b5 = \u21b50 .\nrepeat {sample size backtracking loop}\nIncrease sample size: n = min{\u21b5m, N}.\nCompute gradient [cf. (5)]: rRn(wm) = (1/n)Pn\nCompute Hessian [cf. (6)]: Hn = (1/n)Pn\nNewton Update [cf. (7)]: wn = wm H1\nCompute gradient [cf. (5)]: rRn(wn) = (1/n)Pn\nBacktrack sample size increase \u21b5 = \u21b5.\nuntil krRn(wn)k < (p2c)Vn\n\nk=1 r2f (wm, zk) + cVnI\nn rRn(wm)\n\nk=1 rf (wm, zk) + cVnwm\n\nk=1 rf (wn, zk) + cVnwn\n\nn = O(1/pn). This is a conservative rate for the law of large numbers.\n\nRn(w) is strongly convex with constant cVn and its gradients rRn(w) are Lipschitz continuous\nwith parameter M + cVn. Likewise, the condition in Assumption 2 implies that the average loss\nL(w), the empirical loss Ln(w), and the empirical risk Rn(w) are also self-concordant. The condi-\ntion in Assumption 3 says that the gradients of the empirical risk converge to their statistical average\nat a rate of order V 1/2\nn . If the constant Vn in condition (3) is of order not faster than O(1/n) the\ncondition in Assumption 3 holds if the gradients converge to their statistical average at a rate of\norder V 1/2\nIn the following theorem, given Assumptions 1-3, we state a condition that guarantees the variable\nwn evaluated as in (7) solves Rn to within its statistical accuracy Vn.\nTheorem 1. Consider the variable wm as a Vm-optimal solution of the risk Rm, i.e., a solution\nsuch that Rm(wm) Rm(w\u21e4m) \uf8ff Vm. Let n = \u21b5m > m, consider the risk Rn associated with\nsample set Sn S m, and suppose assumptions 1 - 3 hold. If the sample size n is chosen such that\n(10)\n\n2(n m)\n\n+\n\n1\n4\n\n\uf8ff\n\nand\n\ncVn\n\n\u2713 2(M + cVm)Vm\n144\u2713Vm +\n\nn\n\n2(n m)\n\n\u25c61/2\n\nnc1/2 +(2 + p2)c1/2 + ckw\u21e4k (Vm Vn)\nkw\u21e4k2\u25c62\n\nc(Vm Vn)\n\n(cVn)1/2\n\n2\n\n(Vnm + Vm) + 2 (Vm Vn) +\n\n\uf8ff Vn\n\n(11)\n\nare satis\ufb01ed, then the variable wn, which is the outcome of applying one Newton step on the variable\nwm as in (7), has sub-optimality error Vn with high probability, i.e.,\nw.h.p.\n\n(12)\n\nRn(wn) Rn(w\u21e4n) \uf8ff Vn,\n\nProof. See Section 4.\n\nTheorem 1 states conditions under which we can iteratively increase the sample size while applying\nsingle Newton iterations without line search and staying within the statistical accuracy of the regu-\nlarized empirical risk. The constants in (10) and (11) are not easy to parse but we can understand\nthem qualitatively if we focus on large m. This results in a simpler condition that we state next.\nProposition 2. Consider a learning problem in which the statistical accuracy satis\ufb01es Vm \uf8ff \u21b5Vn\nfor n = \u21b5m and limn!1 Vn = 0. If the regularization constant c is chosen so that\n\n\u2713 2\u21b5M\nc \u25c61/2\n\n+\n\n2(\u21b5 1)\n\u21b5c1/2 <\n\n1\n4\n\n,\n\n(13)\n\nthen, there exists a sample size \u02dcm such that (10) and (11) are satis\ufb01ed for all m > \u02dcm and n = \u21b5m.\nIn particular, if \u21b5 = 2 we can satisfy (10) and (11) with c > 16(2pM + 1)2.\n\n4\n\n\fProof. That the condition in (11) is satis\ufb01ed for all m > \u02dcm follows simply because the left hand side\nm and the right hand side is of order Vn. To show that the condition in (10) is satis\ufb01ed\nis of order V 2\nfor suf\ufb01ciently large m observe that the third summand in (10) is of order O((Vm Vn)/V 1/2\nn )\nand vanishes for large m. In the second summand of (10) we make n = \u21b5m to obtain the second\nsummand in (13) and in the \ufb01rst summand replace the ratio Vm/Vn by its bound \u21b5 to obtain the \ufb01rst\nsummand of (13). To conclude the proof just observe that the inequality in (13) is strict.\nThe condition Vm \uf8ff \u21b5Vn is satis\ufb01ed if Vn = 1/n and is also satis\ufb01ed if Vn = 1/pn because\np\u21b5<\u21b5 . This means that for most ERM problems we can progress geometrically over the sample\nsize and arrive at a solution wN that solves the ERM problem RN to its statistical accuracy VN as\nlong as (13) is satis\ufb01ed .\nThe result in Theorem 1 motivates de\ufb01nition of the Ada Newton algorithm that we summarize in\nAlgorithm 1. The core of the algorithm is in steps 6-9. Step 6 implements an increase in the\nsample size by a factor \u21b5 and steps 7-9 implement the Newton iteration in (5)-(7). The required\ninput to the algorithm is an initial sample size m0 and a variable wm0 that is known to solve the\nERM problem with accuracy Vm0. Observe that this initial iterate doesn\u2019t have to be computed\nwith Newton iterations. The initial problem to be solved contains a moderate number of samples\nm0, a mild condition number because it is regularized with constant cVm0, and is to be solved to a\nmoderate accuracy Vm0 \u2013 recall that Vm0 is of order Vm0 = O(1/m0) or order Vm0 = O(1/pm0)\ndepending on regularity assumptions. Stochastic \ufb01rst order methods excel at solving problems with\nmoderate number of samples m0 and moderate condition to moderate accuracy.\nWe remark that the conditions in Theorem 1 and Proposition 2 are conceptual but that the constants\ninvolved are unknown in practice. In particular, this means that the allowed values of the factor \u21b5 that\ncontrols the growth of the sample size are unknown a priori. We solve this problem in Algorithm 1\nby backtracking the increase in the sample size until we guarantee that wn minimizes the empirical\nrisk Rn(wn) to within its statistical accuracy. This backtracking of the sample size is implemented\nin Step 11 and the optimality condition of wn is checked in Step 12. The condition in Step 12 is\non the gradient norm that, because Rn is strongly convex, can be used to bound the suboptimality\nRn(wn) Rn(w\u21e4n) as\n\n(14)\n\nRn(wn) Rn(w\u21e4n) \uf8ff\n\n1\n2cVnkrRn(wn)k2.\n\nObserve that checking this condition requires an extra gradient computation undertaken in Step\n10. That computation can be reused in the computation of the gradient in Step 5 once we exit\nthe backtracking loop. We emphasize that when the condition in (13) is satis\ufb01ed, there exists a\nsuf\ufb01ciently large m for which the conditions in Theorem 1 are satis\ufb01ed for n = \u21b5m. This means\nthat the backtracking condition in Step 12 is satis\ufb01ed after one iteration and that, eventually, Ada\nNewton progresses by increasing the sample size by a factor \u21b5. This means that Algorithm 1 can\nbe thought of as having a damped phase where the sample size increases by a factor smaller than \u21e2\nand a geometric phase where the sample size grows by a factor \u21e2 in all subsequent iterations. The\ncomputational cost of this geometric phase is of not more than \u21b5/(\u21b5 1) passes over the dataset\nand requires inverting not more than log\u21b5 N Hessians. If c > 16(2pM + 1)2, we make \u21b5 = 2\nfor optimizing to within statistical accuracy in about 2 passes over the dataset and after inversion of\nabout 3.32 log10 N Hessians.\n\n4 Convergence Analysis\n\nIn this section we study the proof of Theorem 1. The main idea of the Ada Newton algorithm is\nintroducing a policy for increasing the size of training set from m to n in a way that the current\nvariable wm is in the Newton quadratic convergence phase for the next regularized empirical risk\nRn. In the following proposition, we characterize the required condition to guarantee staying in the\nlocal neighborhood of Newton\u2019s method.\nProposition 3. Consider the sets Sm and Sn as subsets of the training set T such that Sm \u21e2S n \u21e2\nT . We assume that the number of samples in the sets Sm and Sn are m and n, respectively. Further,\nde\ufb01ne wm as an Vm optimal solution of the risk Rm, i.e., Rm(wm) Rm(w\u21e4m) \uf8ff Vm. In addition,\nde\ufb01ne n(w) := rRn(w)Tr2Rn(w)1rRn(w)1/2 as the Newton decrement of variable w\n\n5\n\n\fassociated with the risk Rn. If Assumption 1-3 hold, then Newton\u2019s method at point wm is in the\nquadratic convergence phase for the objective function Rn, i.e., n(wm) < 1/4, if we have\n1\n4\n\nn + (p2c + 2pc + ckw\u21e4k)(Vm Vn)\n\n(2(n m)/n)V 1/2\n\n\u2713 2(M + cVm)Vm\n\n\u25c61/2\n\n(cVn)1/2\n\ncVn\n\n+\n\n\uf8ff\n\nw.h.p.\n(15)\n\nProof. See Section 7.1 in the supplementary material.\n\nFrom the analysis of Newton\u2019s method we know that if the Newton decrement n(w) is smaller\nthan 1/4, the variable w is in the local neighborhood of Newton\u2019s method; see e.g., Chapter 9 of [4].\nFrom the result in Proposition 3, we obtain a suf\ufb01cient condition to guarantee that n(wm) < 1/4\nwhich implies that wm, which is a Vm optimal solution for the regularized empirical loss Rm, i.e.,\nRm(wm) Rm(w\u21e4m) \uf8ff Vm, is in the local neighborhood of the optimal argument of Rn that\nNewton\u2019s method converges quadratically.\nUnfortunately, the quadratic convergence of Newton\u2019s method for self-concordant functions is in\nterms of the Newton decrement n(w) and it does not necessary guarantee quadratic conver-\ngence in terms of objective function error. To be more precise, we can show that n(wn) \uf8ff\nn(wm)2; however, we can not conclude that the quadratic convergence of Newton\u2019s method\nimplies Rn(wn) Rn(w\u21e4n) \uf8ff 0(Rn(wm) Rn(w\u21e4n))2.\nIn the following proposition we try\nto characterize an upper bound for the error Rn(wn) Rn(w\u21e4n) in terms of the squared error\n(Rn(wm) Rn(w\u21e4n))2 using the quadratic convergence property of Newton decrement.\nProposition 4. Consider wm as a variable that is in the local neighborhood of the optimal argument\nof the risk Rn where Newton\u2019s method has a quadratic convergence rate, i.e., n(wm) \uf8ff 1/4. Re-\ncall the de\ufb01nition of the variable wn in (7) as the updated variable using Newton step. If Assumption\n1 and 2 hold, then the difference Rn(wn) Rn(w\u21e4n) is upper bounded by\nRn(wn) Rn(w\u21e4n) \uf8ff 144(Rn(wm) Rn(w\u21e4n))2.\n\n(16)\n\nProof. See Section 7.2 in the supplementary material.\n\nThe result in Proposition 4 provides an upper bound for the sub-optimality Rn(wn) Rn(w\u21e4n) in\nterms of the sub-optimality of variable wm for the risk Rn, i.e., Rn(wm) Rn(w\u21e4n). Recall that we\nknow that wm is in the statistical accuracy of Rm, i.e., Rm(wm) Rm(w\u21e4m) \uf8ff Vm, and we aim to\nshow that the updated variable wn stays in the statistical accuracy of Rn, i.e., Rn(wn) Rn(w\u21e4n) \uf8ff\nVn. This can be done by showing that the upper bound for Rn(wn) Rn(w\u21e4n) in (16) is smaller\nthan Vn. We proceed to derive an upper bound for the sub-optimality Rn(wm) Rn(w\u21e4n) in the\nfollowing proposition.\nProposition 5. Consider the sets Sm and Sn as subsets of the training set T such that Sm \u21e2S n \u21e2\nT . We assume that the number of samples in the sets Sm and Sn are m and n, respectively. Further,\nde\ufb01ne wm as an Vm optimal solution of the risk Rm, i.e., Rm(wm) R\u21e4m \uf8ff Vm. If Assumption\n1-3 hold, then the empirical risk error Rn(wm) Rn(w\u21e4n) of the variable wm corresponding to the\nset Sn is bounded above by\nRn(wm)Rn(w\u21e4n) \uf8ff Vm+\n\n(Vnm + Vm)+2 (Vm Vn)+\n\nc(Vm Vn)\n\n2(n m)\n\nn\n\n2\n\nkw\u21e4k2 w.h.p.\n(17)\n\nProof. See Section 7.3 in the supplementary material.\n\nThe result in Proposition 5 characterizes the sub-optimality of the variable wm, which is an Vm\nsub-optimal solution for the risk Rm, with respect to the empirical risk Rn associated with the set\nSn.\nThe results in Proposition 3, 4, and 5 lead to the result in Theorem 1. To be more precise, from the\nresult in Proposition 3 we obtain that the condition in (10) implies that wm is in the local neigh-\nborhood of the optimal argument of Rn and n(wm) \uf8ff 1/4. Hence, the hypothesis of Proposition\n4 is satis\ufb01ed and we have Rn(wn) Rn(w\u21e4n) \uf8ff 144(Rn(wm) Rn(w\u21e4n))2. This result paired\nwith the result in Proposition 5 shows that if the condition in (11) is satis\ufb01ed we can conclude that\nRn(wn) Rn(w\u21e4n) \uf8ff Vn which completes the proof of Theorem 1.\n\n6\n\n\f\u2217 N\nR\n\u2212\n\n)\n\nw\n\n(\nN\nR\n\n0\n\n10\n\n-2\n\n10\n\n-4\n\n10\n\n-6\n\n10\n\n-8\n\n10\n\n-10\n\n10\n\nSGD\nSAGA\nNewton\nAda Newton\n\n\u2217 N\nR\n\u2212\n\n)\n\nw\n\n(\nN\nR\n\n0\n\n10\n\n-2\n\n10\n\n-4\n\n10\n\n-6\n\n10\n\n-8\n\n10\n\n-10\n\n10\n\nSGD\nSAGA\nNewton\nAda Newton\n\n0\n\n5\n\n10\n15\nNumber of passes\n\n20\n\n25\n\n0\n\n10\n\n20\n\n30\n\n40\n50\nRuntime (s)\n\n60\n\n70\n\n80\n\n90\n\nFigure 1: Comparison of SGD, SAGA, Newton, and Ada Newton in terms of number of effective\npasses over dataset (left) and runtime (right) for the protein homology dataset.\n\n5 Experiments\n\nIn this section, we study the performance of Ada Newton and compare it with state-of-the-art in\nsolving a large-scale classi\ufb01cation problem. In the main paper we only use the protein homology\ndataset provided on KDD cup 2004 website. Further numerical experiments on various datasets\ncan be found in Section 7.4 in the supplementary material. The protein homology dataset contains\nN = 145751 samples and the dimension of each sample is p = 74. We consider three algorithms\nto compare with the proposed Ada Newton method. One of them is the classic Newton\u2019s method\nwith backtracking line search. The second algorithm is Stochastic Gradient Descent (SGD) and the\nlast one is the SAGA method introduced in [6]. In our experiments, we use logistic loss and set the\nregularization parameters as c = 200 and Vn = 1/n.\nThe stepsize of SGD in our experiments is 2\u21e5 102. Note that picking larger stepsize leads to faster\nbut less accurate convergence and choosing smaller stepsize improves the accuracy convergence\nwith the price of slower convergence rate. The stepsize for SAGA is hand-optimized and the best\nperformance has been observed for \u21b5 = 0.2 which is the one that we use in the experiments.\nFor Newton\u2019s method, the backtracking line search parameters are \u21b5 = 0.4 and = 0.5. In the\nimplementation of Ada Newton we increase the size of the training set by factor 2 at each iteration,\n\ni.e., \u21b5 = 2 and we observe that the condition krRn(wn)k < (p2c)Vn is always satis\ufb01ed and there\n\nis no need for reducing the factor \u21b5. Moreover, the size of initial training set is m0 = 124. For\nthe warmup step that we need to get into to the quadratic neighborhood of Newton\u2019s method we\nuse the gradient descent method. In particular, we run gradient descent with stepsize 103 for 100\niterations. Note that since the number of samples is very small at the beginning, m0 = 124, and the\nregularizer is very large, the condition number of problem is very small. Thus, gradient descent is\nable to converge to a good neighborhood of the optimal solution in a reasonable time. Notice that\nthe computation of this warm up process is very low and is equal to 12400 gradient evaluations.\nThis number of samples is less than 10% of the full training set. In other words, the cost is less than\n10% of one pass over the dataset. Although, this cost is negligible, we consider it in comparison\nwith SGD, SAGA, and Newton\u2019s method. We would like to mention that other algorithms such as\nNewton\u2019s method and stochastic algorithms can also be used for the warm up process; however, the\ngradient descent method sounds the best option since the gradient evaluation is not costly and the\nproblem is well-conditioned for a small training set .\nThe left plot in Figure 1 illustrates the convergence path of SGD, SAGA, Newton, and Ada Newton\nfor the protein homology dataset. Note that the x axis is the total number of samples used divided\nby the size of the training set N = 145751 which we call number of passes over the dataset. As\nwe observe, The best performance among the four algorithms belongs to Ada Newton. In particular,\nAda Newton is able to achieve the accuracy of RN (w) R\u21e4N < 1/N by 2.4 passes over the dataset\nwhich is very close to theoretical result in Theorem 1 that guarantees accuracy of order O(1/N )\nafter \u21b5/(\u21b5 1) = 2 passes over the dataset. To achieve the same accuracy of 1/N Newton\u2019s\nmethod requires 7.5 passes over the dataset, while SAGA needs 10 passes. The SGD algorithm can\nnot achieve the statistical accuracy of order O(1/N ) even after 25 passes over the dataset.\nAlthough, Ada Newton and Newton outperform SAGA and SGD, their computational complexity\nare different. We address this concern by comparing the algorithms in terms of runtime. The right\n\n7\n\n\fplot in Figure 1 demonstrates the convergence paths of the considered methods in terms of runtime.\nAs we observe, Newton\u2019s method requires more time to achieve the statistical accuracy of 1/N\nrelative to SAGA. This observation justi\ufb01es the belief that Newton\u2019s method is not practical for\nlarge-scale optimization problems, since by enlarging p or making the initial solution worse the\nperformance of Newton\u2019s method will be even worse than the ones in Figure 1. Ada Newton resolves\nthis issue by starting from small sample size which is computationally less costly. Ada Newton\nalso requires Hessian inverse evaluations, but the number of inversions is proportional to log\u21b5 N.\nMoreover, the performance of Ada Newton doesn\u2019t depend on the initial point and the warm up\nprocess is not costly as we described before. We observe that Ada Newton outperforms SAGA\nsigni\ufb01cantly. In particular it achieves the statistical accuracy of 1/N in less than 25 seconds, while\nSAGA achieves the same accuracy in 62 seconds. Note that since the variable wN is in the quadratic\nneighborhood of Newton\u2019s method for RN the convergence path of Ada Newton becomes quadratic\neventually when the size of the training set becomes equal to the size of the full dataset. It follows\nthat the advantage of Ada Newton with respect to SAGA is more signi\ufb01cant if we look for a sub-\noptimality less than Vn. We have observed similar performances for other datasets such as A9A,\nW8A, COVTYPE, and SUSY \u2013 see Section 7.4 in the supplementary material.\n\n6 Discussions\n\nAs explained in Section 4, Theorem 1 holds because condition (10) makes wm part of the quadratic\nconvergence region of Rn. From this fact, it follows that the Newton iteration makes the subopti-\nmality gap Rn(wn) Rn(w\u21e4n) the square of the suboptimality gap Rn(wm) Rn(w\u21e4n). This yields\ncondition (11) and is the fact that makes Newton steps valuable in increasing the sample size. If we\nreplace Newton iterations by any method with linear convergence rate, the orders of both sides on\ncondition (11) are the same. This would make aggressive increase of the sample size unlikely.\nIn Section 1 we pointed out four reasons that challenge the development of stochastic Newton meth-\nods. It would not be entirely accurate to call Ada Newton a stochastic method because it doesn\u2019t\nrely on stochastic descent directions. It is, nonetheless, a method for ERM that makes pithy use of\nthe dataset. The challenges listed in Section 1 are overcome by Ada Newton because:\n\n(i) Ada Newton does not use line searches. Optimality improvement is guaranteed by increas-\n\ning the sample size.\n\n(ii) The advantages of Newton\u2019s method are exploited by increasing the sample size at a rate\nthat keeps the solution for sample size m in the quadratic convergence region of the risk\nassociated with sample size n = \u21b5m. This allows aggressive growth of the sample size.\n\n(iii) The ERM problem is not necessarily strongly convex. A regularization of order Vn is added\n\nto construct the empirical risk Rn\n\n(iv) Ada Newton inverts approximately log\u21b5 N Hessians. To be more precise, the total number\nof inversion could be larger than log\u21b5 N because of the backtracking step. However, the\nbacktracking step is bypassed when the number of samples is suf\ufb01ciently large.\n\nIt is fair to point out that items (ii) and (iv) are true only to the extent that the damped phase in\nAlgorithm 1 is not signi\ufb01cant. Our numerical experiments indicate that this is true but the conclusion\nis not warranted by out theoretical bounds except when the dataset is very large. This suggests the\nbounds are loose and that further research is warranted to develop tighter bounds.\n\nReferences\n[1] Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Convexity, classi\ufb01cation, and risk bounds.\n\nJournal of the American Statistical Association, 101(473):138\u2013156, 2006.\n\n[2] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse prob-\n\nlems. SIAM Journal on Imaging Sciences, 2(1):183\u2013202, 2009.\n\n[3] L\u00b4eon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In Advances in Neural Informa-\n\ntion Processing Systems 20, Vancouver, British Columbia, Canada, pages 161\u2013168, 2007.\n\n[4] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, New York,\n\nNY, USA, 2004.\n\n8\n\n\f[5] Hadi Daneshmand, Aur\u00b4elien Lucchi, and Thomas Hofmann. Starting small - learning with adaptive\nsample sizes. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016,\nNew York City, NY, USA, pages 1463\u20131471, 2016.\n\n[6] Aaron Defazio, Francis R. Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method\nwith support for non-strongly convex composite objectives. In Advances in Neural Information Process-\ning Systems 27, Montreal, Quebec, Canada, pages 1646\u20131654, 2014.\n\n[7] Murat A. Erdogdu and Andrea Montanari. Convergence rates of sub-sampled Newton methods. In Ad-\nvances in Neural Information Processing Systems 28: Annual Conference on Neural Information Pro-\ncessing Systems 2015, Montreal, Quebec, Canada, pages 3052\u20133060, 2015.\n\n[8] Roy Frostig, Rong Ge, Sham M. Kakade, and Aaron Sidford. Competing with the empirical risk mini-\nmizer in a single pass. In Proceedings of The 28th Conference on Learning Theory, COLT 2015, Paris,\nFrance, July 3-6, 2015, pages 728\u2013763, 2015.\n\n[9] Mert G\u00a8urb\u00a8uzbalaban, Asuman Ozdaglar, and Pablo Parrilo. A globally convergent incremental Newton\n\nmethod. Mathematical Programming, 151(1):283\u2013313, 2015.\n\n[10] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduc-\ntion. In Advances in Neural Information Processing Systems 26. Lake Tahoe, Nevada, United States.,\npages 315\u2013323, 2013.\n\n[11] Aurelien Lucchi, Brian McWilliams, and Thomas Hofmann. A variance reduced stochastic newton\n\nmethod. arXiv, 2015.\n\n[12] Aryan Mokhtari and Alejandro Ribeiro. Res: Regularized stochastic BFGS algorithm. IEEE Transactions\n\non Signal Processing, 62(23):6089\u20136104, 2014.\n\n[13] Aryan Mokhtari and Alejandro Ribeiro. Global convergence of online limited memory BFGS. Journal of\n\nMachine Learning Research, 16:3151\u20133181, 2015.\n\n[14] Philipp Moritz, Robert Nishihara, and Michael I. Jordan. A linearly-convergent stochastic L-BFGS algo-\nrithm. Proceedings of the 19th International Conference on Arti\ufb01cial Intelligence and Statistics, AISTATS\n2016, Cadiz, Spain, pages 249\u2013258, 2016.\n\n[15] Yu Nesterov. Introductory Lectures on Convex Programming Volume I: Basic course. Citeseer, 1998.\n[16] Yurii Nesterov et al. Gradient methods for minimizing composite objective function. 2007.\n[17] Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM\n\nJournal on Control and Optimization, 30(4):838\u2013855, 1992.\n\n[18] Zheng Qu, Peter Richt\u00b4arik, Martin Tak\u00b4ac, and Olivier Fercoq. SDNA: stochastic dual Newton ascent for\nempirical risk minimization. In Proceedings of the 33nd International Conference on Machine Learning,\nICML 2016, New York City, NY, USA, June 19-24, 2016, pages 1823\u20131832, 2016.\n\n[19] Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical\n\nStatistics, pages 400\u2013407, 1951.\n\n[20] Nicolas Le Roux, Mark W. Schmidt, and Francis R. Bach. A stochastic gradient method with an expo-\nnential convergence rate for \ufb01nite training sets. In Advances in Neural Information Processing Systems\n25. Lake Tahoe, Nevada, United States., pages 2672\u20132680, 2012.\n\n[21] Nicol N. Schraudolph, Jin Yu, and Simon G\u00a8unter. A stochastic quasi-Newton method for online convex\nIn Proceedings of the Eleventh International Conference on Arti\ufb01cial Intelligence and\n\noptimization.\nStatistics, AISTATS 2007, San Juan, Puerto Rico, pages 436\u2013443, 2007.\n\n[22] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability and\n\nuniform convergence. The Journal of Machine Learning Research, 11:2635\u20132670, 2010.\n\n[23] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss.\n\nThe Journal of Machine Learning Research, 14:567\u2013599, 2013.\n\n[24] Shai Shalev-Shwartz and Tong Zhang. Accelerated proximal stochastic dual coordinate ascent for regu-\n\nlarized loss minimization. Mathematical Programming, 155(1-2):105\u2013145, 2016.\n\n[25] Vladimir Vapnik. The nature of statistical learning theory. Springer Science & Business Media, 2013.\n[26] Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction.\n\nSIAM Journal on Optimization, 24(4):2057\u20132075, 2014.\n\n9\n\n\f", "award": [], "sourceid": 2027, "authors": [{"given_name": "Aryan", "family_name": "Mokhtari", "institution": "University of Pennsylvania"}, {"given_name": "Hadi", "family_name": "Daneshmand", "institution": "ETH Zurich"}, {"given_name": "Aurelien", "family_name": "Lucchi", "institution": "ETH Zurich"}, {"given_name": "Thomas", "family_name": "Hofmann", "institution": "ETH Zurich"}, {"given_name": "Alejandro", "family_name": "Ribeiro", "institution": "University of Pennsylvania"}]}