{"title": "RSN: Randomized Subspace Newton", "book": "Advances in Neural Information Processing Systems", "page_first": 616, "page_last": 625, "abstract": "We develop a randomized Newton method capable of solving learning problems with huge dimensional feature spaces, which is a common setting in applications such as medical imaging, genomics and seismology. Our method leverages randomized sketching in a new way, by finding the Newton direction constrained to the space spanned by a random sketch. We develop a simple global linear convergence theory that holds for practically all sketching techniques, which gives the practitioners the freedom to design custom sketching approaches suitable for particular applications. We perform numerical experiments which demonstrate the efficiency of our method as compared to accelerated gradient descent and the full Newton method. Our method can be seen as a refinement and a randomized extension of the results of Karimireddy, Stich, and Jaggi (2019).", "full_text": "RSN: Randomized Subspace Newton\n\nRobert M. Gower\n\nLTCI, T\u00b4el\u00b4ecom Paristech, IPP, France\n\nDmitry Kovalev\n\nKAUST, Saudi Arabia\n\ngowerrobert@gmail.com\n\ndmitry.kovalev@kaust.edu.sa\n\nHeinrich-Heine-Universit\u00a8at D\u00a8usseldorf, Germany\n\nFelix Lieder\n\nPeter Richt\u00b4arik\n\nKAUST, Saudi Arabia and MIPT, Russia\n\nlieder@opt.uni-duesseldorf.de\n\npeter.richtarik@kaust.edu.sa\n\nAbstract\n\nWe develop a randomized Newton method capable of solving learning problems\nwith huge dimensional feature spaces, which is a common setting in applications\nsuch as medical imaging, genomics and seismology. Our method leverages ran-\ndomized sketching in a new way, by \ufb01nding the Newton direction constrained to\nthe space spanned by a random sketch. We develop a simple global linear con-\nvergence theory that holds for practically all sketching techniques, which gives\nthe practitioners the freedom to design custom sketching approaches suitable for\nparticular applications. We perform numerical experiments which demonstrate\nthe ef\ufb01ciency of our method as compared to accelerated gradient descent and the\nfull Newton method. Our method can be seen as a re\ufb01nement and randomized\nextension of the results of Karimireddy, Stich, and Jaggi [18].\n\n1\n\nIntroduction\n\nIn this paper we are interested in unconstrained optimization problems of the form\n\nmin\nx\u2208Rd\n\nf (x),\n\n(1)\n\nwhere f : Rd \u2192 R is a suf\ufb01ciently well behaved function, in the large dimensional setting, i.e.,\nwhen d is very large. Large dimensional optimization problems are becoming ever more common\nin applications. Indeed, d often stands for the dimensionality of captured data, and due to fast-paced\nadvances in technology, this only keeps growing. One of key driving forces behind this is the rapid\nincrease in the resolution of sensors used in medicine [19], genomics [26, 8], seismology [2] and\nweather forecasting [1]. To make predictions using such high dimensional data, typically one needs\nto solve an optimization problem such as (1). The traditional off-the-shelf solvers for such problems\nare based on Newton\u2019s method, but in this large dimensional setting they cannot be applied due to\nthe high memory footprint and computational costs of solving the Newton system. We offer a new\nsolution to this, by iteratively performing Newton steps in random subspaces of suf\ufb01ciently low di-\nmensions. The resulting randomized Newton\u2019s method need only solve small randomly compressed\nNewton systems and can be applied to solving (1) no matter how big the dimension d.\n\n1.1 Background and contributions\n\nNewton\u2019s method dates back to even before Newton, making an earlier appearance in the work of the\nPersian astronomer and mathematician al-Kashi 1427 in his \u201cKey to Arithmetic\u201d [33]. In the 80\u2019s\nNewton\u2019s method became the workhorse of nonlinear optimization methods such as trust region [9],\naugmented Lagrangian [4] and interior point methods. The research into interior point methods\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fculminated with Nesterov and Nemirovskii\u2019s [22] ground breaking work proving that minimizing\na convex (self-concordant) function could be done in a polynomial number of steps, where in each\nstep a Newton system was solved.\nAmongst the properties that make Newton type methods so attractive is that they are invariant to\nrescaling and coordinate transformations. This property makes them particularly appealing for off-\nthe-shelf solvers since they work well independently of how the user chooses to scale or represent\nthe variables. This in turn means that Newton based methods need little or no tuning of hyperparam-\neters. This is in contrast with \ufb01rst-order methods1, where even rescaling the function can result in\na signi\ufb01cantly different sequence of iterates, and their ef\ufb01cient execution relies on parameter tuning\n(typically the stepsize).\nDespite these advantages, Newton based solvers are now facing a challenge that renders most of\nthem inapplicable: large dimensional feature spaces. Indeed, solving a generic Newton system costs\nO(d3). While inexact Newton methods [11, 5] made signi\ufb01cant headway to diminishing this high\ncost by relying on Krylov based solvers whose iterations cost O(d2), this too can be prohibitive, and\nthis is why \ufb01rst order methods such as accelerated gradient descent [24] are often used in the large\ndimensional setting.\nIn this work we develop a family of randomized Newton methods which work by leveraging random-\nized sketching and projecting [16]. The resulting randomized Newton method has a global linear\nconvergence for virtually any type and size of sketching matrix. In particular, one can choose a\nsketch of size one, which yields a low iteration complexity of as little as O(1) if one assumes that\nscalar derivatives can be computed in constant time. Our main assumptions are the recently intro-\nduced [18] relative smoothness and convexity2 of f, which are in a certain sense weaker than the\nmore common strong convexity and smoothness assumptions. Our method is also scale invariant,\nwhich facilitates setting the stepsize. We further propose an ef\ufb01cient line search strategy that does\nnot increase the iteration complexity.\nThere are only a handful of Newton type methods in the literature that use iterative sketching, in-\ncluding the sketched Newton algorithm [28], SDNA (Stochastic Dual Newton Ascent) [29], RBCN\n(Randomized Block Cubic Newton) [12] and SON [21]. In the unconstrained case the sketched\nNewton algorithm [28] requires a sketching matrix that is proportional to the global rank of the Hes-\nsian, an unknown constant related to high probability statements and \u0001\u22122, where \u0001 > 0 is the desired\ntolerance. Consequently, the required sketch size could be as large as d, which defeats the purpose.\nThe SDNA algorithm in [29] relies on the existence of a positive de\ufb01nite matrix M \u2208 Rd\u00d7d that\nglobally upper bounds the Hessian, which is a stronger assumption than our relative smoothness\nassumption. The method then proceeds by selecting random principal submatrices of M that it then\nuses to form and solve an approximate Newton system. The theory in [29] allows for any sketch\nsize, including size of one. Our method could be seen as an extension of SDNA to allow for any\nsketch, one that is directly applied to the Hessian (as opposed to M) and one that relies on a set\nof more relaxed assumptions. The RBCN method combines the ideas of randomized coordinate\ndescent [23] and cubic regularization [25]. The method requires the optimization problem to be\nblock separable and is hence not applicable to the problem we consider here. Finally, SON [21]\nuses random and deterministic streaming sketches to scale up a second-order method, akin to a\nGauss\u2013Newton method, for solving online learning problems.\n\n1.2 Key Assumptions\nWe assume throughout that f : Rd \u2192 R is a convex and twice differentiable function. Further, we\nassume that f is bounded below and the set of minimizers X\u2217 nonempty. We denote the optimal\nvalue of (1) by f\u2217 \u2208 R.\nLet H(x) := \u22072f (x) (resp. g(x) = \u2207f (x)) be the Hessian (resp. gradient) of f at x. We \ufb01x an\ninitial iterate x0 \u2208 Rd throughout and de\ufb01ne Q to be a level set of function f (x) associated with x0:\n(2)\n\nQ :=(cid:8)x \u2208 Rd : f (x) \u2264 f (x0)(cid:9) .\n\nLet (cid:104)x, y(cid:105)H(xk) := (cid:104)H(xk)x, y(cid:105) for all x, y \u2208 Rd. Our main assumption on f is given next.\n\n1An exception to this is, for instance, the optimal \ufb01rst order af\ufb01ne-invariant method in [10].\n2These notions are different from the relative smoothness and convexity concepts considered in [20].\n\n2\n\n\fAssumption 1. There exist constants \u02c6L \u2265 \u02c6\u00b5 > 0 such that for all x, y \u2208 Q:\n\nf (x) \u2264 f (y) + (cid:104)g(y), x \u2212 y(cid:105) + \u02c6L\n\n2 (cid:107)x \u2212 y(cid:107)2\n\nH(y)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n:=T (x,y)\n\n(cid:125)\n\n,\n\n(3)\n\n(4)\n\nf (x) \u2265 f (y) + (cid:104)g(y), x \u2212 y(cid:105) + \u02c6\u00b5\n\n2(cid:107)x \u2212 y(cid:107)2\n\nH(y).\n\nWe refer to \u02c6L and \u02c6\u00b5 as the relative smoothness and relative convexity constant, respectively.\n\nRelative smoothness and convexity is a direct consequence of smoothness and strong convexity. It\nis also a consequence of the recently introduced [18] c\u2013stability condition, which served to us as an\ninspiration. Speci\ufb01cally, as shown in Lemma 2 in [18] and also formally (for convenience) stated in\nProposition 2 in the supplementary material, we have that\nL\u2013smooth + \u00b5\u2013strongly convex \u21d2 c\u2013stability \u21d2 relative smoothness & relative convexity.\nWe will also further assume:\nAssumption 2. g(x) \u2208 Range (H(x)) for all x \u2208 Rd.\nAssumption 2 holds if the Hessian is positive de\ufb01nite for all x, and for generalized linear models.\n\n1.3 The full Newton method\n\nOur baseline method for solving (1), is the following variant of the Newton Method (NM):\n\nxk+1 = xk + \u03b3n(xk) := xk \u2212 \u03b3H\u2020(xk)g(xk),\n\n(5)\nwhere H\u2020(xk) is the Moore-Penrose pseudoinverse of H(xk) and n(xk) := \u2212H\u2020(xk)g(xk) is the\nNewton direction. A property (which we recall from [18]) that will be important for our analysis is\nthat for a suitable stepsize, Newton\u2019s method is a descent method.\nLemma 1. Consider the iterates {xk}k\u22650 de\ufb01ned recursively by (5). If \u03b3 \u2264 1/\u02c6L and (3) holds, then\nf (xk+1) \u2264 f (xk) for all k \u2265 0, and in particular, xk \u2208 Q for all k \u2265 0.\nThe proof follows by using (3), twice differentiability and convexity of f. See [18, Lemma 3].\nThe relative smoothness assumption (3) is particularily important for motivating Newton\u2019s method.\nIndeed, a Newton step is the exact minimizer of the upper bound in (3).\nLemma 2. If Assumption 2 is satis\ufb01ed, then the quadratic x (cid:55)\u2192 T (x, xk) de\ufb01ned in (3) has a global\nminimizer xk+1 given by xk+1 = xk \u2212 1\nProof. Lemma 1 implies that xk+1 \u2208 Q, and Lemma 9 in the appendix shows that (5) is a global\nminimizer for \u03b3 = 1/\u02c6L.\n\nH\u2020(xk)g(xk) \u2208 Q.\n\n\u02c6L\n\n2 Randomized Subspace Newton\n\nSolving a Newton system exactly is costly and may be a waste of resources. Indeed, this is the reason\nfor the existence of inexact variants of Newton methods [11]. For these inexact Newton methods, an\naccurate solution is only needed when close to the optimal point.\nIn this work we introduce a different inexactness idea: we propose to solve an exact Newton system,\nbut in an inexact randomly selected subspace. In other words, we propose a randomized subspace\nNewton method, where the randomness is introduced via sketching matrices, de\ufb01ned next.\nDe\ufb01nition 1. Let D be a (discrete or continuous) distribution over matrices in Rd\u00d7s. We say that\nS \u223c D is a random sketching matrix and s \u2208 N is the sketch size.\nWe will often assume that the random sketching is nullspace preserving.\nAssumption 3. We say that S \u223c D is nullspace preserving if with probability one we have that\n\nNull(cid:0)S(cid:62)H(x)S(cid:1) = Null(S),\n\n\u2200x \u2208 Q.\n\n(6)\n\n3\n\n\fAlgorithm 1 RSN: Randomized Subspace Newton\n1: input: x0 \u2208 Rd\n2: parameters: D = distribution over random matrices\n3: for k = 0, 1, 2, . . . do\n4:\n5:\nSk\n6: output: last iterate xk\n\nsample a fresh sketching matrix: Sk \u223c D\nxk+1 = xk \u2212 1\nS(cid:62)\nk g(xk)\n\n(cid:0)S(cid:62)\n\nk H(xk)Sk\n\n(cid:1)\u2020\n\n\u02c6L\n\nBy sampling a sketching matrix Sk \u223c D in the kth iteration, we can form a sketched Newton\nk H(xk)Sk \u2208 Rs\u00d7s; see line 5 in Algorithm 1. Note\ndirection using only the sketched Hessian S(cid:62)\nthat the sketched Hessian is the result of twice differentiating the function \u03bb (cid:55)\u2192 f (xk + Sk\u03bb), which\ncan be done ef\ufb01ciently using a single backpropation pass [14] or s backpropagation passes [7] which\ncosts at most s times the cost of evaluating the function f.\nFirst we show that much like the full Newton method (5), Algorithm 1 is a descent method.\nLemma 3 (Descent). Consider the iterates xk given Algorithm 1. If Assumptions 1, 2 and 3 hold,\nthen f (xk+1) \u2264 f (xk) and consequently xk \u2208 Q for all k \u2265 0.\nWhile common in the literature of randomized coordinate (subspace) descent method, this is a rare\nresult for randomized stochastic gradient descent methods, which do not enjoy a descent property.\nLemma 3 is useful in monitoring the progress of the method in cases when function evaluations are\nnot too prohibitive. However, we use it solely for establishing a tighter convergence theory.\nInterestingly, the iterations of Algorithm 1 can be equivalently formulated as a random projection of\nthe full Newton step, as we detail next.\nLemma 4. Let Assumptions 1 and 2 hold. Consider the projection matrix Pk with respect to the\nseminorm (cid:107)\u00b7(cid:107)2\n\nH(xk) := (cid:104)\u00b7,\u00b7(cid:105)H(xk) given by\n\n(cid:0)S(cid:62)\n\n(cid:1)\u2020\n\nPk := Sk\n\nk H(xk)Sk\n\nk H(xk) \u2208 Rd\u00d7d.\nS(cid:62)\n\nThe iterates of Algorithm 1 can be viewed as a projection of the Newton step given by\n\nxk+1 = xk + 1\n\u02c6L\n\nPkn(xk) .\n\nProof. To verify that Pk is an oblique projection matrix, it suf\ufb01ces to check that\n\n(cid:104)Pkx, Pky(cid:105)H(xk) = (cid:104)Pkx, y(cid:105)H(xk) ,\n\n\u2200x, y \u2208 Rd,\n\n(7)\n\n(8)\n\n(9)\n\nwhich in turn relies on the identity M\u2020MM\u2020 = M\u2020, which holds for all matrices M \u2208 Rd\u00d7d. Since\ng(xk) \u2208 Range (H(xk)) , we have again by the same identity of the pseudoinverse that\n\nConsequently, Pkn(xk) = Sk\n\nk H(xk)Sk\n\nS(cid:62)\nk g(xk).\n\ng(xk) = H(xk)H\u2020(xk)g(xk) = \u2212H(xk)n(xk).\n\n(cid:0)S(cid:62)\n\n(cid:1)\u2020\n\nWe will refer to Pkn(xk) as the sketched Newton direction. If we add one more simple assumption\nto the selection of the sketching matrices, we have the following equivalent formulations of the\nsketched Newton direction.\nLemma 5. Let Assumptions 1, 2 and 3 hold. It follows that the xk+1 iterate of Algorithm 1 can be\nequivalently seen as\n1. The minimizer of T (x, xk) over the random subspace x \u2208 xk + Range (Sk) :\n\nxk+1 = xk + Sk\u03bbk, where \u03bbk \u2208 arg min\n\u03bb\u2208Rs\n\nT (xk + Sk\u03bb, xk) .\n\nFurthermore,\n\nT (xk+1, xk) = f (xk) \u2212 1\n\n2 \u02c6L\n\n(cid:107)g(xk)(cid:107)2\n\nSk(S(cid:62)\n\nk H(xk)Sk)\u2020Sk\n\n.\n\n(10)\n\n(11)\n\n4\n\n\f2. A projection of the Newton direction onto a random subspace:\n\n(cid:13)(cid:13)x \u2212(cid:0)xk \u2212 1\n\nn(xk)(cid:1)(cid:13)(cid:13)2\n\n\u02c6L\n\nH(xk)\n\nxk+1 = arg min\n\nx\u2208Rd, \u03bb\u2208Rs\n\nsubject to x = xk + Sk\u03bb.\n\n(12)\n\n3. A projection of the previous iterate onto the sketched Newton system given by:\nk H(xk)(x \u2212 xk) = \u2212 1\n\nxk+1 \u2208 arg min(cid:107)x \u2212 xk(cid:107)2\n\nsubject to S(cid:62)\n\nH(xk)\n\nFurthermore, if Range (Sk) \u2282 Range (Hk(xk)), then xk+1 is the unique solution to the above.\n\nS(cid:62)\nk g(xk).\n\n\u02c6L\n\n(13)\n\n3 Convergence Theory\n\nWe now present two main convergence theorems.\nTheorem 2. Let G(x) := ES\u223cD\n\n\u03c1(x) :=\n\nmin\n\nv\u2208Range(H(x))\nIf Assumptions 1 and 2 hold, then\n\n(cid:105)\n\nS\n\nand de\ufb01ne\n\n(cid:104)H1/2(x)G(x)H1/2(x)v,v(cid:105)\n\n(cid:104)\nS(cid:0)S(cid:62)H(x)S(cid:1)\u2020\nE [f (xk)] \u2212 f\u2217 \u2264(cid:16)\n(cid:16) f (x0)\u2212f\u2217\n(cid:17)\n\n(cid:107)v(cid:107)2\n\n2\n\n(cid:17)k\n\n1 \u2212 \u03c1 \u02c6\u00b5\n\u02c6L\n\nConsequently, given \u0001 > 0, if \u03c1 > 0 and if\n\nk \u2265 1\n\n\u03c1\n\n\u02c6L\n\u02c6\u00b5 log\n\n\u0001\n\n,\n\nthen\n\nand\n\n\u03c1 := min\n\nx\u2208Q \u03c1(x).\n\n(f (x0) \u2212 f\u2217).\n\nE [f (xk) \u2212 f\u2217] < \u0001.\n\n(14)\n\n(15)\n\n(16)\n\nTheorem 2 includes the convergence of the full Newton method as a special case. Indeed, when\nwe choose3 Sk = I \u2208 Rd\u00d7d, it is not hard to show that \u03c1(xk) \u2261 1, and thus (16) recovers the\n\u02c6L/\u02c6\u00b5 log (1/\u0001) complexity given in [18]. We provide yet an additional sublinear O(1/k) convergence\nresult that holds even when \u02c6\u00b5 = 0.\nTheorem 3. Let Assumption 2 hold and Assumption 1 be satis\ufb01ed with \u02c6L > \u02c6\u00b5 = 0. If\n\n(cid:107)x \u2212 x\u2217(cid:107)H(x) < +\u221e ,\n\nR := inf\nx\u2217\u2208X\u2217\nand \u03c1 > 0 then E [f (xk)] \u2212 f\u2217 \u2264 2 \u02c6LR2\n\u03c1k .\nAs a new result of Theorem 3, we can also show that the full Newton method has a O( \u02c6LR\u0001\u22121)\niteration complexity.\nBoth of the above theorems rely on \u03c1 > 0. So in the next Section 3.1 we give suf\ufb01cient conditions\nfor \u03c1 > 0 that holds for virtually all sketching matrices.\n\nsup\nx\u2208Q\n\n(17)\n\n3.1 The sketched condition number \u03c1(xk)\n\nThe parameters \u03c1(xk) and \u03c1 in Theorem 2 characterize the trade-off between the cost of the iterations\nand the convergence rate of RSN. Here we show that \u03c1 is always bounded between one and zero, and\nfurther, we give conditions under which \u03c1(xk) is the smallest non-zero eigenvalue of an expected\nprojection matrix, and is thus bounded away from zero.\nLemma 6. The parameter \u03c1(xk) appearing in Theorem 2 satis\ufb01es 0 \u2264 \u03c1(xk) \u2264 1. Letting\n\n(cid:0)S(cid:62)\n\n(cid:1)\u2020\n(cid:16)ES\u223cD\n\nS(cid:62)\nk H1/2(xk) ,\n\n(cid:104) \u02c6P(xk)\n(cid:105)(cid:17)\n\n(18)\n\n(19)\n\n\u02c6P(xk) := H1/2(xk)Sk\nand if we assume that the exactness4 condition\n\nk H(xk)Sk\n\nholds then \u03c1(xk) = \u03bb+\n\nmin\n\nRange (H(xk)) = Range\n\n(cid:16)ES\u223cD\n\n(cid:104) \u02c6P(xk)\n(cid:105)(cid:17)\n\n> 0.\n\n3Or when Sk is an invertible matrix.\n4An \u201cexactness\u201d condition similar to (19) was introduced in [30] in a program of \u201cexactly\u201d reformulating\na linear system into a stochastic optimization problem. Our condition has a similar meaning, but we do not\nelaborate on this as this is not central to the developments in this paper.\n\n5\n\n\fSince (19) is in general hard to verify, we give simpler suf\ufb01cient conditions for \u03c1 > 0 in the next\nlemma.\nLemma 7 (Suf\ufb01cient condition for exactness). If Assumption 3 and\n\nRange (H(xk)) \u2282 Range(cid:0)E[SkS(cid:62)\nk ](cid:1) ,\n(cid:3) is invertible, and this is the case for\n\nholds then (19) holds and consequently 0 < \u03c1 \u2264 1.\n\nClearly, condition (20) is immediately satis\ufb01ed if E(cid:2)SkS(cid:62)\n\nGaussian sketches, weighted coordinate sketched, sub-sampled Hadamard or Fourier transforms,\nand the entire class of randomized orthonormal system sketches [27].\n\n(20)\n\nk\n\n3.2 The relative smoothness and strong convexity constants\n\nIn the next lemma we give an insightful formula for calculating the relative smoothness and convex-\nity constants de\ufb01ned in Assumption 1, and in particular, show how \u02c6L and \u02c6\u00b5 depend on the relative\nchange of the Hessian.\nLemma 8. Let f be twice differentiable, satisfying Assumption 1. If moreover H(x) is invertible for\nevery x \u2208 Rd, then\n\n\u02c6L =\n\n\u02c6\u00b5 =\n\nmax\nx, y \u2208 Q\n\nmin\n\nx, y \u2208 Q\n\n2(1 \u2212 t)\n\n2(1 \u2212 t)\n\n(cid:107)zt\u2212y(cid:107)2\n(cid:107)zt\u2212y(cid:107)2\n\nH(zt)\n\nH(y)\n\n(cid:107)zt\u2212y(cid:107)2\n(cid:107)zt\u2212y(cid:107)2\n\nH(zt)\n\nH(y)\n\n(cid:107)x\u2212y(cid:107)2\n(cid:107)x\u2212y(cid:107)2\n\nH(x)\n\nH(y)\n\ndt \u2264 max\nx,y\u2208Q\n\ndt \u2265 1\nc ,\n\n:= c\n\n(21)\n\n(22)\n\nwhere zt := y + t(x \u2212 y).\nThe constant c on the right hand side of (21) is known as the c-stability constant [18]. As a by-\nproduct, the above lemma establishes that the rates for the deterministic Newton method obtained as\na special case of our general theorems are at least as good as those obtained in [18] using c-stability.\n\n(cid:90) 1\n(cid:90) 1\n\nt=0\n\nt=0\n\n4 Examples\n\nWith the freedom of choosing the sketch size, we can consider the extreme case s = 1, i.e., the case\nwith the sketching matrices having only a single column.\nCorollary 1 (Single column sketches). Let 0 \u227a U \u2208 Rn\u00d7n be a symmetric positive de\ufb01nite matrix\nsuch that H(x) (cid:22) U, \u2200x \u2208 Rd. Let D = [d1, . . . , dn] \u2208 Rn\u00d7n be a given invertible matrix such\nthat d(cid:62)\n\ni H(x)di (cid:54)= 0 for all x \u2208 Q and i = 1, . . . , n. If we sample according to\n\n:=\nthen the update on line 5 of Algorithm 1 is given by\n\nP[Sk = di] = pi\n\nd(cid:62)\ni Udi\n\nTrace(D(cid:62)UD) ,\n\ndi, with probability pi,\nand under the assumptions of Theorem 2, Algorithm 1 converges according to\n\n\u02c6L\n\nxk+1 = xk \u2212 1\n\nd(cid:62)\ni g(xk)\nd(cid:62)\ni H(xk)di\n\nE [f (xk)] \u2212 f\u2217 \u2264\n\n1 \u2212 min\nx\u2208Q\n\nmin(H1/2(x)DD(cid:62)H1/2(x))\n\u03bb+\n\nTrace(D(cid:62)UD)\n\n\u02c6\u00b5\n\u02c6L\n\n(f (x0) \u2212 f\u2217).\n\n(cid:18)\n\n(cid:19)k\n\n(23)\n\n(24)\n\nEach iteration of single colum sketching Newton method (23) requires only three scalar derivatives\nof the function t (cid:55)\u2192 f (xk + tdk) and thus if f (x) can be evaluated in constant time, this amounts to\nO(1) cost per iteration. Indeed (23) is much like coordinate descent, except we descent along the di\ndirections, and with a stepsize that adapts depending on the curvature information d(cid:62)\nThe rate of convergence in (24) suggests that we should choose D \u2248 U\u22121/2 so that \u03c1 is large. If\nthere is no ef\ufb01cient way to approximate U\u22121/2, then the simple choice of D = I gives \u03c1(xk) =\n\u03bb+\nmin(H(xk))/Trace (U) .\nAn expressive family of functions that satisfy Assumption 1 are generalized linear models.\n\ni H(xk)di. 5\n\n5There in fact exists a block coordinate method that also incorporates second order information [13].\n\n6\n\n\fDe\ufb01nition 4. Let 0 \u2264 u \u2264 (cid:96). Let \u03c6i : R (cid:55)\u2192 R+ be a twice differentiable function such that\n\n(25)\nLet ai \u2208 Rd for i = 1, . . . , n and A = [a1, . . . , an] \u2208 Rd\u00d7n. We say that f : Rd \u2192 R is a\ngeneralized linear model when\n\nfor i = 1, . . . , n.\n\nf (x) = 1\nn\n\n\u03c6(a(cid:62)\n\ni x) + \u03bb\n\n2 (cid:107)x(cid:107)2\n2 .\n\n(26)\n\nu \u2264 \u03c6(cid:48)(cid:48)\n\ni (t) \u2264 (cid:96),\nn(cid:80)\n\ni=1\n\nn(cid:80)\n\ni=1\n\nThe structure of the Hessian of a generalized linear model is such that highly ef\ufb01cient fast Johnson-\nLindenstrauss sketches [3] can be used. Indeed, the Hessian is given by\n\nH(x) = 1\nn\n\naia(cid:62)\n\ni \u03c6(cid:48)(cid:48)\n\ni (a(cid:62)\n\ni x) + \u03bbI = 1\n\nn A\u03a6(cid:48)(cid:48)(A(cid:62)x)A(cid:62) + \u03bbI ,\n\nk A and compute S(cid:62)\n\nk H(xk)Sk we only need to sketch the \ufb01xed\nk Sk ef\ufb01ciently, and thus no backpropgation is required. This is exactly\n\nand consequently, for computing the sketch Hessian S(cid:62)\nmatrix S(cid:62)\nthe setting where fast Johnson\u2013Lindenstrauss transforms can be effective [17, 3].\nWe now give a simple expression for computing the relative smoothness and convexity constant for\ngeneralized linear models.\nProposition 1. Let f : Rd \u2192 R be a generalized linear model with 0 \u2264 u \u2264 (cid:96). Then Assumption 1\nis satis\ufb01ed with\n\nFurthermore, if we apply Algorithm 1 with a sketch such that E(cid:2)SS(cid:62)(cid:3) is invertible, then the iteration\n\nmax(A)+n\u03bb\nmax(A)+n\u03bb .\n\nmax(A)+n\u03bb\nmax(A)+n\u03bb\n\n\u02c6L = (cid:96)\u03c32\nu\u03c32\n\n\u02c6\u00b5 = u\u03c32\n(cid:96)\u03c32\n\n(27)\n\nand\n\ncomplexity (16) of applying Algorithm 1 is given by\n\nu\u03c32\n\nmax(A)+n\u03bb\nmax(A)+n\u03bb\n\n(28)\nThis complexity estimate (28) should be contrasted with that of gradient descent. When x0 \u2208\nRange (A) , the iteration complexity of GD (gradient descent) applied to a smooth generalized\nlinear model is given by (cid:96)\u03c32\nu\u03c32\nvalue of A. To simplify the discussion, and as a santiy check, consider the full Newton method\nwith Sk = I for all k, and consequently \u03c1 = 1. In view of (28) Newton method does not depend\non the smallest singular values nor the condition number of the data matrix. This suggests that for\nill-conditioned problems Newton method can be superior to gradient descent, as is well known.\n\n(cid:1) , where \u03c3min+(A) is the smallest non-zero singular\n\n\u0001\n\n\u03c1\n\nk \u2265 1\n\n(cid:16) (cid:96)\u03c32\nmin+ (A)+n\u03bb log(cid:0) 1\n\nmax(A)+n\u03bb\n\n(cid:17)2\n\nlog(cid:0) 1\n\n\u0001\n\n(cid:1) .\n\n5 Experiments and Heuristics\n\nthe\n\ndataset\n\nTable 1: Details of the data sets taken from LIBSM [6] and\nOpenML [31].\n\nIn this section we evaluate and\ncomputational\ncompare\nperformance\nof RSN (Algo-\nrithm 1) on generalized linear\nmodels (26).\nSpeci\ufb01cally, we\nfocus on logistic regression,\ni.e.,\n\u03c6i(t) = ln (1 + e\u2212yit) , where\nyi \u2208 {\u22121, 1} are the target values\nfor i = 1, . . . , n. Gradient descent\n(GD), accelerated gradient descent\n(AGD)\n[24] and full Newton\nmethods6 are compared with RSN.\nFor simplicity, block coordinate sketches are used; these are random sketch matrices of the form\nSk \u2208 {0, 1}d\u00d7s with exactly one non-zero entry per row and per column. We will refer to s \u2208 N as\nthe sketch size. To ensure fairness and for comparability purposes, all methods were supplied with\n\nnon-zero features (d)\n61,359 + 1\n5,000 + 1\n1,355,191 + 1\n47,237 + 1\n20,958 + 1\n680,715 + 1\n\n158 +1\n6000\n19996\n20,241\n72,309\n350,000\n\n0.9910\n0.0003\n0.0016\n0.0025\n0.0055\n\nreal-sim\nwebspam\n\ngisette\nnews20\n\nchemotherapy\n\nsamples (n)\n\ndensity\n\nrcv1\n\n1\n\n6To implement the Newton\u2019s method ef\ufb01ciently, of course we exploit the Sherman\u2013Morrison\u2013Woodbury\n\nmatrix identity [32] when appropriate\n\n7\n\n\fthe exact Lipschitz constants and equipped with the same line-search strategy (see Algorithm 3 in\nthe supplementary material). We consider 6 datasets with a diverse number of features and samples\n(see Table 1 for details) which were modi\ufb01ed by removing all zero features and adding an intercept,\ni.e., a constant feature.\nFor regularization we used \u03bb = 10\u221210 and stopped methods once the gradients norm was below\ntol = 10\u22126 or some maximal number of iterations had been exhausted. In Figures 1 to 3 we plotted\niterations and wall-clock time vs gradient norm, respectively.\n\nFigure 1: Highly dense problems, favoring RSN methods.\n\nFigure 2: Due to extreme sparsity, accelerated gradient is competitive with the Newton type methods.\n\nFigure 3: Moderately sparse problems favor the RSN method. The full Newton method is infeasible due to\nhigh dimensionality.\nNewton\u2019s method, when not limited by the immense costs of forming and solving linear systems,\nis competitive as we can see in the gisette problem in Figure 1. In most real-world applications\nhowever, the bottleneck is exactly within the linear systems which may, even if they can be formed\nat all, require signi\ufb01cant solving time. On the other end of the spectrum, GD and AGD need usually\nmore iterations and therefore may suffer from expensive full gradient evaluations, for example due\nto higher density of the data matrix, see Figure 3. RSN seems like a good compromise here: As the\nsketch size and type can be controlled by the user, the involved linear systems can be kept reasonably\nsized. As a result, the RSN is the fastest method in all the above experiments, with the exception\nof the extremely sparse problem news20 in Figure 2, where AGD outruns RSN with s = 750 by\napproximately 20 seconds.\n\n6 Conclusions and Future Work\n\nWe have laid out the foundational theory of a class of randomized Newton methods, and also per-\nformed numerical experiments validating the methods. There are now several venues of work to\nexplore including 1) combining the randomized Newton method with subsampling so that it can be\napplied to data that is both high dimensional and abundant 2) leveraging the potential fast Johnson-\nLindenstrauss sketches to design even faster variants of RSN 3) develop heuristic sketches based on\npast descent directions inspired on the quasi-Newton methods [15].\n\n8\n\n0500100015002000250030003500400010-610-410-210000.050.10.150.20.2510-610-410-210005001000150010-610-510-410-310-210-11000510152025303540RSN, s=250RSN, s=500RSN, s=750RSN, s=1000GDAGDNewton01234510410-610-510-410-310-205010015020010-610-510-410-310-2020004000600080001000012000140001600010-610-510-410-310-2050100150200RSN, s=250RSN, s=500RSN, s=750RSN, s=1000GDAGDNewton00.511.522.5310410-610-510-410-310-210-100.511.5210410-610-510-410-310-210-1010002000300040005000600010-610-510-410-310-210-1020406080100RSN, s=250RSN, s=500RSN, s=750RSN, s=1000GDAGD\fReferences\n[1] John T. Abatzoglou, Solomon Z. Dobrowski, Sean A. Parks, and Katherine C. Hegewisch.\nData descriptor: Terraclimate, a high-resolution global dataset of monthly climate and climatic\nwater balance from 1958-2015. Scienti\ufb01c Data, 5, 2018.\n\n[2] T. G. Addair, D. A. Dodge, W. R. Walter, and S. D. Ruppert. Large-scale seismic signal\n\nanalysis with hadoop. Computers and Geosciences, 66(C), 2014.\n\n[3] Nir Ailon and Bernard Chazelle. The fast johnson-lindenstrauss transform and approximate\n\nnearest neighbors. SIAM J. Comput., 39(1):302\u2013322, May 2009.\n\n[4] Dimitri P. Bertsekas. Constrained Optimization and Lagrange Multiplier Methods (Optimiza-\n\ntion and Neural Computation Series). Athena Scienti\ufb01c, 1996.\n\n[5] Richard H Byrd, Gillian M Chin, Will Neveitt, and Jorge Nocedal. On the use of stochastic\nHessian information in optimization methods for machine learning. SIAM Journal on Opti-\nmization, 21(3):977\u2013995, 2011.\n\n[6] Chih Chung Chang and Chih Jen Lin. LIBSVM : A library for support vector machines. ACM\n\nTransactions on Intelligent Systems and Technology, 2(3):1\u201327, April 2011.\n\n[7] Bruce Christianson. Automatic Hessians by reverse accumulation. IMA Journal of Numerical\n\nAnalysis, 12(2):135\u2013150, 1992.\n\n[8] James R. Cole, Qiong Wang, Jordan A. Fish, Benli Chai, Donna M. McGarrell, Yanni Sun,\nC. Titus Brown, Andrea Porras-Alfaro, Cheryl R. Kuske, and James M. Tiedje. Ribosomal\nDatabase Project: data and tools for high throughput rRNA analysis. Nucleic Acids Research,\n42(D1):D633\u2013D642, 11 2013.\n\n[9] Andrew R. Conn, Nicholas I. M. Gould, and Philippe L. Toint. Trust-region Methods. Society\n\nfor Industrial and Applied Mathematics, Philadelphia, PA, USA, 2000.\n\n[10] Alexandre d\u2019Aspremont, Guzm\u00b4an Crist\u00b4obal, and Martin Jaggi. Optimal af\ufb01ne-invariant smooth\n\nminimization algorithms. SIAM Journal on Optimization, 28(3):2384\u20132405, 2018.\n\n[11] Ron S. Dembo, Stanley C. Eisenstat, and Trond Steihaug. Inexact Newton methods. SIAM\n\nJournal on Numerical Analysis, 19(2):400\u2013408, 1982.\n\n[12] Nikita Doikov and Peter Richt\u00b4arik. Randomized block cubic Newton method. In Proceedings\n\nof the 35th International Conference on Machine Learning, 2018.\n\n[13] Kimon Fountoulakis and Rachael Tappenden. A \ufb02exible coordinate descent method. Compu-\n\ntational Optimization and Applications, 70(2):351\u2013394, Jun 2018.\n\n[14] R M Gower and M P Mello. A new framework for the computation of hessians. Optimization\n\nMethods and Software, 27(2):251\u2013273, 2012.\n\n[15] Robert M. Gower, Donald Goldfarb, and Peter Richt\u00b4arik. Stochastic block BFGS: Squeezing\nmore curvature out of data. Proceedings of the 33rd International Conference on Machine\nLearning, 2016.\n\n[16] Robert Mansel Gower and Peter Richt\u00b4arik. Randomized iterative methods for linear systems.\n\nSIAM Journal on Matrix Analysis and Applications, 36(4):1660\u20131690, 2015.\n\n[17] William Johnson and Joram Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert\nIn Conference in modern analysis and probability (New Haven, Conn., 1982), vol-\nspace.\nume 26 of Contemporary Mathematics, pages 189\u2013206. American Mathematical Society,\n1984.\n\n[18] Sai Praneeth Karimireddy, Sebastian U. Stich, and Martin Jaggi. Global linear convergence of\n\nNewton\u2019s method without strong-convexity or Lipschitz gradients. arXiv:1806:0041, 2018.\n\n[19] C. H. Lee and H. J. Yoon. Medical big data: promise and challenges. kidney research and\n\nclinical practice. Kidney Res Clin Pract, 36(4):3\u20131, 2017.\n\n[20] Haihao Lu, Robert M. Freund, and Yurii Nesterov. Relatively smooth convex optimization by\n\n\ufb01rst-order methods, and applications. SIAM Journal on Optimization, 28(1):333\u2013354, 2018.\n\n[21] Haipeng Luo, Alekh Agarwal, Nicol`o Cesa-Bianchi, and John Langford. Ef\ufb01cient second\norder online learning by sketching. In Advances in Neural Information Processing Systems 29,\npages 902\u2013910. 2016.\n\n9\n\n\f[22] Y. Nesterov and A. Nemirovskii. Interior Point Polynomial Algorithms in Convex Program-\nming. Studies in Applied Mathematics. Society for Industrial and Applied Mathematics, 1987.\n[23] Yurii Nesterov. Ef\ufb01ciency of coordinate descent methods on huge-scale optimization problems.\n\nSIAM Journal on Optimization, 22(2):341\u2013362, 2012.\n\n[24] Yurii Nesterov.\n\nIntroductory Lectures on Convex Optimization: A Basic Course. Springer\n\nPublishing Company, Incorporated, 1 edition, 2014.\n\n[25] Yurii Nesterov and Boris T. Polyak. Cubic regularization of Newton method and its global\n\nperformance. Mathematical Programming, 108(1):177\u2013205, 2006.\n\n[26] Ross A. Overbeek, Niels Larsen, Gordon D. Pusch, Mark D\u2019Souza, Evgeni Selkov Jr., Nikos\nKyrpides, Michael Fonstein, Natalia Maltsev, and Evgeni Selkov. WIT: integrated system\nfor high-throughput genome sequence analysis and metabolic reconstruction. Nucleic Acids\nResearch, 28(1):123\u2013125, 2000.\n\n[27] Mert Pilanci and Martin J. Wainwright. Iterative Hessian sketch : Fast and accurate solution\napproximation for constrained least-squares. Journal of Machine Learning Research, 17:1\u201333,\n2016.\n\n[28] Mert Pilanci and Martin J. Wainwright. Newton sketch: A near linear-time optimization al-\ngorithm with linear-quadratic convergence. SIAM Journal on Optimization, 27(1):205\u2013245,\n2017.\n\n[29] Zheng Qu, Peter Richt\u00b4arik, Martin Tak\u00b4a\u02c7c, and Olivier Fercoq. SDNA: Stochastic dual Newton\nascent for empirical risk minimization. In Proceedings of the 33rd International Conference\non Machine Learning, 2016.\n\n[30] Peter Richt\u00b4arik and Martin Tak\u00b4a\u02c7c. Stochastic reformulations of linear systems: algorithms and\n\nconvergence theory. arXiv:1706.01108, 2017.\n\n[31] Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. Openml: Networked\n\nscience in machine learning. SIGKDD Explorations, 15(2):49\u201360, 2013.\n\n[32] Max A Woodbury.\n\nInverting modi\ufb01ed matrices. Technical report, Rep. no. 42, Statistical\n\nResearch Group, Princeton University, 1950.\n\n[33] Tjalling J. Ypma. Historical development of the newton-raphson method.\n\n37(4):531\u2013551, December 1995.\n\nSIAM Rev.,\n\n10\n\n\f", "award": [], "sourceid": 323, "authors": [{"given_name": "Robert", "family_name": "Gower", "institution": "Institut Polytechnique de Paris, Telecom Paris"}, {"given_name": "Dmitry", "family_name": "Koralev", "institution": "KAUST"}, {"given_name": "Felix", "family_name": "Lieder", "institution": "Heinrich-Heine-Universit\u00e4t D\u00fcsseldorf"}, {"given_name": "Peter", "family_name": "Richtarik", "institution": "KAUST"}]}