{"title": "Accelerated Stochastic Matrix Inversion: General Theory and Speeding up BFGS Rules for Faster Second-Order Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 1619, "page_last": 1629, "abstract": "We present the first accelerated randomized algorithm for solving linear systems in Euclidean spaces. One essential problem of this type is the matrix inversion problem. In particular, our algorithm can be specialized to invert positive definite matrices in such a way that all iterates (approximate solutions) generated by the algorithm are positive definite matrices themselves. This opens the way for many applications in the field of optimization and machine learning. As an application of our general theory, we develop the first accelerated (deterministic and stochastic) quasi-Newton updates. Our updates lead to provably more aggressive approximations of the inverse Hessian, and lead to speed-ups over classical non-accelerated rules in numerical experiments. Experiments with empirical risk minimization show that our rules can accelerate training of machine learning models.", "full_text": "Accelerated Stochastic Matrix Inversion: General\nTheory and Speeding up BFGS Rules for Faster\n\nSecond-Order Optimization\n\nRobert M. Gower\nT\u00e9l\u00e9com ParisTech\n\nParis, France\n\nrobert.gower@telecom-paristech.fr\n\nFilip Hanzely\n\nKAUST\n\nThuwal, Saudi Arabia\n\nfilip.hanzely@kaust.edu.sa\n\nPeter Richt\u00e1rik\u21e4\n\nKAUST\n\nThuwal, Saudi Arabia\n\npeter.richtarik@kaust.edu.sa\n\nSebastian U. Stich\n\nEPFL\n\nLausanne, Switzerland\n\nsebastian.stich@epfl.ch\n\nAbstract\n\nWe present the \ufb01rst accelerated randomized algorithm for solving linear systems\nin Euclidean spaces. One essential problem of this type is the matrix inversion\nproblem. In particular, our algorithm can be specialized to invert positive de\ufb01nite\nmatrices in such a way that all iterates (approximate solutions) generated by the\nalgorithm are positive de\ufb01nite matrices themselves. This opens the way for many\napplications in the \ufb01eld of optimization and machine learning. As an application of\nour general theory, we develop the \ufb01rst accelerated (deterministic and stochastic)\nquasi-Newton updates. Our updates lead to provably more aggressive approxima-\ntions of the inverse Hessian, and lead to speed-ups over classical non-accelerated\nrules in numerical experiments. Experiments with empirical risk minimization\nshow that our rules can accelerate training of machine learning models.\n\n1\n\nIntroduction\n\nConsider the optimization problem\n\nf (w),\n\n(1)\n\nmin\nw2Rn\n\nand assume f is suf\ufb01ciently smooth. A new wave of second order stochastic methods are being\ndeveloped with the aim of solving large scale optimization problems. In particular, many of these\nnew methods are based on stochastic BFGS updates [29, 35, 20, 21, 6, 8, 3]. Here we develop a new\nstochastic accelerated BFGS update that can form the basis of new stochastic quasi-Newton methods.\nAnother approach to scaling up second order methods is to use randomized sketching to reduce the\ndimension, and hence the complexity of the Hessian and the updates involving the Hessian [26, 38], or\nsubsampled Hessian matrices when the objective function is a sum of many loss functions [5, 2, 1, 37].\nThe starting point for developing second order methods is arguably Newton\u2019s method, which performs\nthe iterative process\n\nwk+1 = wk (r2f (wk))1rf (wk),\n\u21e4University of Edinburgh, Moscow Institute of Physics and Technology\n\n(2)\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fwhere r2f (wk) and rf (wk) are the Hessian and gradient of f, respectively. However, it is inef\ufb01cient\nfor solving large scale problems as it requires the computation of the Hessian and then solving a\nlinear system at each iteration. Several methods have been developed to address this issue, based on\nthe idea of approximating the exact update.\nQuasi-Newton methods, in particular BFGS [4, 10, 11, 30], have been the leading optimization\nalgorithm in various \ufb01elds since the late 60\u2019s until the rise of big data, which brought a need for\nsimpler \ufb01rst order algorithms. It is well known that Nesterov\u2019s acceleration [22] is a reliable way\nto speed up \ufb01rst order methods. However until now, acceleration techniques have been applied\nexclusively to speeding up gradient updates. In this paper we present an accelerated BFGS algorithm,\nopening up new applications for acceleration. The acceleration in fact comes from an accelerated\nalgorithm for inverting the Hessian matrix.\nTo be more speci\ufb01c, recall that quasi-Newton rules aim to maintain an estimate of the inverse Hessian\nXk, adjusting it every iteration so that the inverse Hessian acts appropriately in a particular direction,\nwhile enforcing symmetry:\n\nXk(rf (wk) rf (wk1)) = wk wk1,\n\nXk = X>k .\n\n(3)\n\nA notable research direction is the development of stochastic quasi-Newton methods [15], where the\nestimated inverse is equal to the true inverse over a subspace:\n\nXk = X>k ,\n\nXkr2f (wk)Sk = Sk,\nwhere Sk 2 Rn\u21e5\u2327 is a randomly generated matrix.\nIn fact, (4) can be seen as the so called sketch-and-project iteration for inverting r2f (wk). In this\npaper we \ufb01rst develop the accelerated algorithm for inverting positive de\ufb01nite matrices. As a direct\napplication, our algorithm can be used as a primitive in quasi-Newton methods which results in a\nnovel accelerated (stochastic) quasi-Newton method of the type (4). In addition, our acceleration\ntechnique can also be incorporated in the classical (non stochastic) BFGS method. This results in\nthe accelerated BFGS method. Whereas the matrix inversion contribution is accompanied by strong\ntheoretical justi\ufb01cations, this does not apply to the latter. Rather, we verify the effectiveness of this\nnew accelerated BFGS method through numerical experiments.\n\n(4)\n\nB subject to S>k Ax = S>k b,\n\nxk+1 = argminx kxk xk2\n\n1.1 Sketch-and-project for linear systems\nOur accelerated algorithm can be applied to more general tasks than only inverting matrices. In\nits most general form, it can be seen as an accelerated version of a sketch-and-project method in\nEuclidean spaces which we present now. Consider a linear system Ax = b such that b 2 Range (A).\nOne step of the sketch-and-project algorithm reads as:\n(5)\nB = hBx, xi for some B 0 and Sk is a random sketching matrix sampled i.i.d at each\nwhere kxk2\niteration from a \ufb01xed distribution.\nRandomized Kaczmarz [16, 33] was the \ufb01rst algorithm of this type. In [13], this sketch-and-project\nalgorithm was analyzed in its full generality. Note that the dual problem of (5) takes the form of a\nquadratic minimization problem [14], and randomized methods such as coordinate descent [23, 36],\nrandom pursuit [31, 32] or stochastic dual ascent [14] can thus also be captured as special instances\nof this method. Richt\u00e1rik and Tak\u00e1\u02c7c [28] adopt a new point of view through a theory of stochastic\nreformulations of linear systems. In addition, they consider the addition of a relaxation parameter,\nas well as mini-batch and accelerated variants. Acceleration was only achieved for the expected\niterates, and not in the L2 sense as we do here. We refer to Richt\u00e1rik and Tak\u00e1\u02c7c [28] for interpretation\nof sketch-and-project as stochastic gradient descent, stochastic Newton, stochastic proximal point\nmethod, and stochastic \ufb01xed point method.\nGower [15] observed that the procedure (5) can also be applied to \ufb01nd the inverse of a matrix. Assume\nthe optimization variable itself is a matrix, x = X, b = I, the identity matrix, then sketch-and-\nproject converges (under mild assumptions) to a solution of AX = I. Even the symmetry constraint\nX = X> can be incorporated into the sketch-and-project framework since it is a linear constraint.\nThere has been recent development in speeding up the sketch-and-project method using the idea of\nNesterov\u2019s acceleration [22]. In [18] an accelerated Kaczmarz algorithm was presented for special\n\n2\n\n\fsketches of rank one. Arbitrary sketches of rank one where considered in [31], block sketches in [24]\nand recently, Tu and coathors [34] developed acceleration for special sketching matrices, assuming\nthe matrix A is square. This assumption, along with any assumptions on A, was later dropped\nin [27]. Another notable way to accelerate the sketch-and-project algorithm is by using momentum\nor stochastic momentum [19].\nWe build on recent work of Richt\u00e1rik and Tak\u00e1\u02c7c [27] and further extend their analysis by studying\naccelerated sketch-and-project in general Euclidean spaces. This allows us to deduce the result for\nmatrix inversion as a special case. However, there is one additional caveat that has to be considered\nfor the intended application in quasi-Newton methods: ideally, all iterates of the algorithm should be\nsymmetric positive de\ufb01nite matrices. This is not the case in general, but we address this problem by\nconstructing special sketch operators that preserve symmetry and positive de\ufb01niteness.\n\n2 Contributions\n\nWe now present our main contributions.\nAccelerated Sketch and Project in Euclidean Spaces. We generalize the analysis of an accelerated\nversion of the sketch-and-project algorithm [27] to linear operator systems in Euclidean spaces. We\nprovide a self-contained convergence analysis, recovering the original results in a more general\nsetting.\nFaster Algorithms for Matrix Inversion. We develop an accelerated algorithm for inverting positive\nde\ufb01nite matrices. This algorithm can be seen as a special case of the accelerated sketch-and-project\nin Euclidean space, thus its convergence follows from the main theorem. However, we also provide a\ndifferent formulation of the proof that is specialized to this setting. Similarly to [34], the performance\nof the algorithm depends on two parameters \u00b5 and \u232b that capture spectral properties of the input\nmatrix and the sketches that are used. Whilst for the non-accelerated sketch-and-project algorithm\nfor matrix inversion [15] the knowledge of these parameters is not necessary, they need to be given\nas input to the accelerated scheme. When employed with the correct choice of parameters, the\naccelerated algorithm is always faster than the non-accelerated one. We also provide a theoretical\nrate for sub-optimal parameters \u00b5, \u232b, and we perform numerical experiments to argue the choice of\n\u00b5, \u232b in practice.\nRandomized Accelerated Quasi-Newton. The proposed iterative algorithm for matrix inversion is\ndesigned in such a way that each iterate is a symmetric matrix. This means, we can use the generated\napproximate solutions as estimators for the inverse Hessian in quasi-Newton methods, which is a\ndirect extension of stochastic quasi-Newton methods. To the best of our knowledge, this yields the\n\ufb01rst accelerated (stochastic) quasi-Newton method.\nAccelerated Quasi-Newton. In the standard BFGS method the updates to the Hessian estimate\nare not chosen randomly, but deterministically. Based on the intuition gained from the accelerated\nrandom method, we propose an accelerated scheme for BFGS. The main idea is that we replace the\nrandom sketching of the Hessian with a deterministic update. The theoretical convergence rates do\nnot transfer to this scheme, but we demonstrate by numerical experiments that it is possible to choose\na parameter combination which yields a slightly faster convergence. We believe that the novel idea\nof accelerating BFGS update is extremely valuable, as until now, acceleration techniques were only\nconsidered to improve gradient updates.\n\n2.1 Outline\n\nOur accelerated sketch-and-project algorithm for solving linear systems in Euclidean spaces is\ndeveloped and analyzed in Section 3, and is used later in Section 4 to analyze an accelerated sketch-\nand-project algorithm for matrix inversion. The accelerated sketch-and-project algorithm for matrix\ninversion is then used to accelerate the BFGS update, which in turn leads to the development of an\naccelerated BFGS optimization method. Lastly in Section 5, we perform numerical experiments to\ngain different insights into the newly developed methods. Proofs of all results and additional insights\ncan be found in the appendix.\n\n3\n\n\f3 Accelerated Stochastic Algorithm for Matrix Inversion\n\nIn this section we propose an accelerated randomized algorithm to solve linear systems in Euclidean\nspaces. This is a very general problem class which comprises the matrix inversion problem as well.\nThus, we will use the result of this section later to analyze our newly proposed matrix inversion\nalgorithm, which we then use to estimate the inverse of the Hessian within a quasi-Newton method.2\nLet X and Y be \ufb01nite dimensional Euclidean spaces and let A : X 7! Y be a linear operator. Let\nL(X ,Y) denote the space of linear operators that map from X to Y. Consider the linear system\n(6)\nwhere x 2X and b 2 Range (A) . Consequently there exists a solution to the equation (6). In\nparticular, we aim to \ufb01nd the solution closest to a given initial point x0 2X :\nsubject to Ax = b.\n\n2kx x0k2\n\nAx = b,\n\n(7)\n\nx\u21e4 def\n\n1\n\n= arg min\nx2X\n\nUsing the pseudoinverse and Lemma 22 item vi, the solution to (7) is given by\n\nwhere A\u2020 and A\u21e4 denote the pseudoinverse and the adjoint of A, respectively.\n\nx\u21e4 = x0 A \u2020(Ax0 b) 2 x0 + Range (A\u21e4) ,\n\n(8)\n\n3.1 The algorithm\nLet Z be a Euclidean space and consider a random linear operator Sk 2 L(Y,Z) chosen from some\ndistribution D over L(Y,Z) at iteration k. Our method is given in Algorithm 1, where Zk 2 L(X ) is\na random linear operator given by the following compositions\n\n(9)\nThe updates of variables gk and xk+1 on lines 8 and 9, respectively, correspond to what is known as\nthe sketch-and-project update:\n\n= A\u21e4S\u21e4k (SkAA\u21e4S\u21e4k )\u2020SkA.\n\nZk = Z(Sk)\n\ndef\n\n2kx ykk2\nwhich can also be written as the following operation\n\nxk+1 = arg min\nx2X\n\n1\n\nsubject to SkAx = Skb,\n\n(10)\n\n(11)\nThis follows from the fact that b 2 Range (A), together with item i of Lemma 22. Furthermore,\nnote that the adjoint A\u21e4 and the pseudoinverse in Algorithm 1 are taken with respect to the norm\nin (7).\n\nxk+1 x\u21e4 = (I Zk)(yk x\u21e4).\n\nAlgorithm 1 Accelerated Sketch-and-Project for solving (10) [27]\n1: Parameters: \u00b5, \u232b > 0, D = distribution over random linear operators.\n2: Choose x0 2X and set v0 = x0, = 1 p \u00b5\n\u00b5\u232b ,\u21b5 = 1\n1+\u232b .\n3: for k = 0, 1, . . . do\n4:\n5:\n6:\n7:\n8:\n9: end for\n\nyk = \u21b5vk + (1 \u21b5)xk\nSample an independent copy Sk \u21e0D\ngk = A\u21e4S\u21e4k (SkAA\u21e4S\u21e4k )\u2020Sk(Ayk b) = Zk(yk x\u21e4)\nxk+1 = yk gk\nvk+1 = vk + (1 )yk gk\n\n\u232b , =q 1\n\nAlgorithm 1 was \ufb01rst proposed and analyzed by Richt\u00e1rik and Tak\u00e1\u02c7c [27] for the special case when\nX = Rn and Y = Rm. Our contribution here is in extending the algorithm and analysis to the more\nabstract setting of Euclidean spaces. In addition, we provide some further extensions of this method\nin Sections D and E, allowing for a non-unit stepsize and variable \u21b5, respectively.\n\n2Quasi-Newton methods do not compute an exact matrix inverse, rather, they only compute an incremental\nupdate. Thus, it suf\ufb01ces to apply one step of our proposed scheme per iteration. This will be detailed in Section 4.\n\n4\n\n\f3.2 Key assumptions and quantities\nDenote Z = Z(S) for S\u21e0D . Assume that the exactness property holds\n\nNull (A) = Null (E [Z]) ;\n\n(12)\nthis is also equivalent to Range (A\u21e4) = Range (E [Z]). The exactness assumption is of key\nimportance in the sketch-and-project framework, and indeed it is not very strong. For example, it\nholds for the matrix inversion problem with every sketching strategy we consider. We further assume\nthat A6 = 0 and E [Z] is \ufb01nite. First we collect a few observation on the Z operator\nLemma 1. The Z operator (9) is a self-adjoint positive projection. Consequently E [Z] is a self-\nadjoint positive operator.\n\nThe two parameters that govern the acceleration are\n\ndef\n=\n\n\u00b5\n\ninf\n\nx2Range(A\u21e4)\n\nhE[Z]x,xi\n\nhx,xi\n\n,\u232b\n\ndef\n=\n\nsup\n\nx2Range(A\u21e4)\n\nhE[ZE[Z]\u2020Z]x,xi\n\nhE[Z]x,xi\n\n.\n\n(13)\n\nThe supremum in the de\ufb01nition of \u232b is well de\ufb01ned due to the exactness assumption together with\nA6 = 0.\nLemma 2. We have\n\n1 \uf8ff \u232b \uf8ff 1\n\n\u00b5 = kE [Z]\u2020k.\n\n(14)\n\nMoreover, if Range (A\u21e4) = X , we have\n\nRank(A\u21e4)\nE[Rank(Z)] \uf8ff \u232b.\n\n(15)\n\n3.3 Convergence and change of the norm\n\ndef\n\nE[Z]\u2020 + 1\n\n\u00b5kxk x\u21e4k2i \uf8ff\u21e31 q \u00b5\n\u232b\u2318k\n\nThis theorem shows the accelerated Sketch-and-Project algorithm converges linearly with a rate of\n\ndef\n\n= phx, xiG\nEhkv0 x\u21e4k2\n\nFor a positive self-adjoint G 2 L(X ) and x 2X let kxkG\ninformally state the convergence rate of Algorithm 1. Theorem 3 generalizes the main theorem from\n[27] to linear systems in Euclidean spaces.\nTheorem 3. Let xk, vk be the random iterates of Algorithm 1. Then\n\n= phGx, xi. We now\n\u00b5kx0 x\u21e4k2i .\nEhkvk x\u21e4k2\n\u232b, which translates to a total of O(p\u232b/\u00b5 log (1/\u270f)) iterations to bring the given error in\n1 p \u00b5\nwe have the bounds 1/p\u00b5 \uf8ffp\u232b/\u00b5 \uf8ff 1/\u00b5. On one extreme, this inequality shows that the iteration\n\nTheorem 3 below \u270f> 0. This is in contrast with the non-accelerated Sketch-and-Project algorithm\nwhich requires O((1/\u00b5) log (1/\u270f)) iterations, as shown in [13] for solving linear systems. From (14),\n\ncomplexity of the accelerated algorithm is at least as good as its non-accelerated counterpart. On the\nother extreme, the accelerated algorithm might require as little as the square root of the number of\niterations of its non-accelerated counterpart. Since the cost of a single iteration of the accelerated\nalgorithm is of the same order as the non-accelerated algorithm, this theorem shows that acceleration\ncan offer a signi\ufb01cant speed-up, which is veri\ufb01ed numerically in Section 5. It is also possible to get\nthe convergence rate of accelerated sketch-and-project where projections are taken with respect to a\ndifferent weighted norm. For technical details, see Section B.4 of the Appendix.\n\nE[Z]\u2020 + 1\n\n3.4 Coordinate sketches with convenient probabilities\nLet us consider a simple example in the setting for Algorithm 1 where we can understand parameters\n\u00b5, \u232b. In particular, consider a linear system Ax = b in Rn where A is symmetric positive de\ufb01nite.\nCorollary 4. Choose B = A and S = ei with probability proportional to Ai,i. Then\n\n\u00b5 = min(A)\n\nTr(A) =: \u00b5P\n\nand\n\n\u232b = Tr(A)\nmini Ai,i\n\n=: \u232bP\n\nand therefore the convergence rate given in Theorem 3 for the accelerated algorithm is\n\n\u27131 q \u00b5\n\u232b\u25c6k\n\n= \u27131 \n\npmin(A) mini Ai,i\n\nTr(A)\n\n\u25c6k\n\n.\n\n(16)\n\n(17)\n\n5\n\n\fRate (17) of our accelerated method is to be contrasted with the rate of the non-accelerated method:\n(1 \u00b5)k = (1 min(A)/Tr (A)))k. Clearly, we gain from acceleration if the smallest diagonal\nelement of A is signi\ufb01cantly larger than the smallest eigenvalue.\nIn fact, parameters \u00b5P ,\u232b P above are the correct choice for the matrix inversion algorithm, when\nsymmetry is not enforced, as we shall see later. Unfortunately, we are not able to estimate the\nparameters while enforcing symmetry for different sketching strategies. We dedicate a section in\nnumerical experiments to test, if the parameter selection (16) performs well under enforced symmetry\nand different sketching strategies, and also how one might safely choose \u00b5, \u232b in practice.\n\n4 Accelerated Stochastic BFGS Update\n\nThe update of the inverse Hessian used in quasi-Newton methods (e.g., in BFGS) can be seen as\na sketch-and-project update applied to the linear system AX = I, while X = X> is enforced,\nand where A denotes and approximation of the Hessian. In this section, we present an accelerated\nversion of these updates. We provide two different proofs: one based on Theorem 3 and one based on\nvectorization. By mimicking the updates of the accelerated stochastic BFGS method for inverting\nmatrices, we determine a heuristic for accelerating the classic deterministic BFGS update. We then\nincorporate this acceleration into the classic BFGS optimization method and show that the resulting\nalgorithm can offer a speed-up of the standard BFGS algorithm.\n\nX kXk2\n\nF (A)\n\n4.1 Accelerated matrix inversion\nConsider the symmetric positive de\ufb01nite matrix A 2 Rn\u21e5n and the following projection problem\n(18)\n\nsubject to AX = I, X = X>,\n\nA1 = arg min\n\ndef\n\n= TrAX>AX = kA1/2XA1/2k2\n=\u21e3 AX\nXX>\u2318 = ( I\n\nF . This projection problem can be cast as an\nwhere kXkF (A)\ninstantiation of the general projection problem (7). Indeed, we need only note that the constraint\nin (18) is linear and equivalent to A(X)\n0 ) . The matrix inversion problem can be\nef\ufb01ciently solved using sketch-and-project with a symmetric sketch [15]. The symmetric sketch is\ngiven by SkA(X) =\u21e3 S>k AX\nand \u2327 2 N. The resulting sketch-and-project method is as follows\n\nXX>\u2318 , where Sk 2 Rn\u21e5\u2327 is a random matrix drawn from a distribution D\nX kX Xkk2\n\nsubject to S>k AX = S>k , X = X>,\n\nXk+1 = arg min\n\n(19)\n\nF (A)\n\ndef\n\nthe closed form solution of which is\n\nXk+1 = Sk(S>k ASk)1S>k +I Sk(S>k ASk)1S>k A XkI ASk(S>k ASk)1S>k .\n\n(20)\nBy observing that (20) is the sketch-and-project algorithm applied to a linear operator equation, we\nhave constructed an accelerated version in Algorithm 2. We can also apply Theorem 3 to prove that\nAlgorithm 2 is indeed accelerated.\nTheorem 5. Let Lk def\n\nF (A). The iterates of Algorithm 2 satisfy\n\nM + 1\n\n= kVk A1k2\n\n\u00b5kXk A1k2\n\nE [Lk+1] \uf8ff\u21e31 q \u00b5\n\n\u232b\u2318 E [Lk] ,\n\nM = Tr\u21e3A1/2X>A1/2E [Z]\u2020 A1/2XA1/2\u2318 . Furthermore,\n\nhE[Z]X,Xi\n\nhX,Xi\n\n= min(E [Z]),\u232b\n\nwhere kXk2\n\u00b5\n\nwhere\n\ndef\n= inf\n\nX2Rn\u21e5n\ndef\n\n(21)\n\n,\n\n(22)\n\ndef\n= sup\n\nX2Rn\u21e5n\n\nhE[ZE[Z]\u2020Z]X,Xi\n\nhE[Z]X,Xi\n\nP\n\nZ\n\ndef\n= A1/2S(S>AS)1S>A1/2,\n\n= I \u2326 I (I P ) \u2326 (I P ),\n\n(23)\nand Z : X 2 Rn\u21e5n ! Rn\u21e5n is given by Z(X) = X (I P ) X (I P ) = XP + P X(I P ).\nMoreover, 2min(E [P ]) min(E [Z]) min(E [P ]).\nNotice that preserving symmetry yields \u00b5 = min(E [Z]) , which can be up to twice as large as\nmin(E [P ]), which is the value of the \u00b5 parameter of the method without preserving symmetry. This\nimproved rate is new, and was not present in the algorithm\u2019s debut publication [15]. In terms of\nparameter estimation, once symmetry is not preserved, we fall back onto the setting from Section 3.4.\nUnfortunately, we were not able to quantify the effect of enforcing symmetry on the parameter \u232b.\n\n6\n\n\fAlgorithm 2 Accelerated BFGS matrix inversion (solving (18))\n1: Parameters: \u00b5, \u232b > 0, D = distribution over random linear operators.\n2: Choose X0 2X and set V0 = X0, = 1 p \u00b5\n\u00b5\u232b ,\u21b5 = 1\n1+\u232b\n3: for k = 0, 1, . . . do\n4:\n5:\n6:\n7:\n8:\n9: end for\n\nYk = \u21b5Vk + (1 \u21b5)Xk\nSample an independent copy S \u21e0D\nXk+1 = Yk + (YkA I)S(S>AS)1S> S(S>AS)1S>AYk\nVk+1 = Vk + (1 )Yk (Yk Xk+1)\n\n+S(S>AS)1S>AYkAS(S>AS)1S>\n\n\u232b , =q 1\n\n4.2 Vectorizing\u2014a different insight\n\nIn the previous section we argued that Theorem 5 follows from the more general convergence result\nestablished in Theorem 3 for Euclidean spaces. We now show an alternative way to prove Theorem 5.\nDe\ufb01ne Vec : Rn\u21e5n ! Rn2 to be a vectorization operator of column-wise stacking and denote\ndef\n= Vec (X). It can be shown that the sketch-and-project operation for matrix inversion (4.2) is\nx\nequivalent to\n\nxk+1 = arg min\n\nx kx xkk2\n\nA\u2326A subject to (I \u2326 S>k )(I \u2326 A)x = (I \u2326 S>k )Vec (I) , Cx = 0,\nwhere C is de\ufb01ned so that Cx = 0 if and only if X = X>. The above is a sketch-and-project\nupdate for a linear system in Rn2, which allows to obtain an alternative proof of Theorem 5, without\nusing our results from Euclidean spaces. The details are provided in Section H.2 of the Appendix.\n\n4.3 Accelerated BFGS as an optimization algorithm\n\nAs a tweak in the stochastic BFGS allows for a faster estimation of Hessian inverse and therefore\nmore accurate steps of the method, one might wonder if a equivalent tweak might speed up the\nstandard, deterministic BFGS algorithm for solving (1). The mentioned tweaked version of standard\nBFGS is proposed as Algorithm 3. We do not state a convergence theorem for this algorithm\u2014due\nto the deterministic updates the analysis is currently elusive\u2014nor propose to use it as a default\nsolver, but we rather introduce it as a novel idea for accelerating optimization algorithms. We leave\ntheoretical analysis for the future work. For now, we perform several numerical experiments, in order\nto understand the potential and limitations of this new method.\n\nAlgorithm 3 BFGS method with accelerated BFGS update for solving (1)\n1: Parameters: \u00b5, \u232b > 0, stepsize \u2318.\n\n\u00b5\u232b ,\u21b5 = 1\n\n1+\u232b .\n\n\u232b , =q 1\n\n2: Choose X0 2X , w0 and set V0 = X0, = 1 p \u00b5\n\n3: for k = 0, 1, . . . do\n4:\nwk+1 = wk \u2318Xkrf (wk)\nsk = wk+1 wk,\n5:\n6:\nYk = \u21b5Vk + (1 \u21b5)Xk\nXk+1 = k>k\n7:\n>k \u21e3k\n8:\nVk+1 = Vk + (1 )Yk (Yk Xk+1)\n9: end for\n\n>k \u21e3k\u2318 Yk\u21e3I \u21e3k>k\n>k \u21e3k\u2318\n\n+\u21e3I k\u21e3>k\n\n\u21e3k = rf (wk+1) rf (wk)\n\nTo better understand Algorithm 3, recall that the BFGS updates an estimate of the inverse Hessian via\n\nXk+1 = argminX kX Xkk2\n\n(24)\nwhere k = wk+1 wk and \u21e3k = rf (wk+1) rf (wk). The above has the following closed form\n>k \u21e3k\u2318 . This update appears on line 7 of Algorithm 3\nsolution Xk+1 = k>k\n>k \u21e3k\nwith the difference being that it is applied to a matrix Yk.\n\n>k \u21e3k\u2318 Xk\u21e3I \u21e3k>k\n\n+\u21e3I k\u21e3>k\n\nsubject to X\u21e3 k = k, X = X>,\n\nF (A)\n\n7\n\n\f5 Numerical Experiments\n\nWe perform extensive numerical experiments to bring additional insight to both the performance of\nand to parameter selection for Algorithms 2 and 3. More numerical experiments can be found in\nSection A of the appendix. We \ufb01rst test our accelerated matrix inversion algorithm, and subsequently\nperform experiments related to Section 4.3.\n\n5.1 Accelerated Matrix Inversion\n\nWe consider the problem of inverting a symmetric positive matrix A. We focus on a few particular\nchoices of matrices A (speci\ufb01ed when describing each experiment), that differ in their eigenvalue\nspectra. Three different sketching strategies are studied: Coordinate sketches with convenient\nprobabilities (S = ei with probability proportional to Ai,i), coordinate sketches with uniform\nprobabilities (S = ei with probability 1\nn) and Gaussian sketches (S \u21e0N (0, I)). As matrices to be\ninverted, we use both arti\ufb01cially generated matrices with the access to the spectrum and also Hessians\nof ridge regression problems from LIBSVM.\nWe have shown earlier that \u00b5, \u232b can be estimated as per (16) for coordinate sketches with convenient\nprobabilities without enforcing symmetry. We use the mentioned parameters for the other sketching\nstrategies while enforcing the symmetry. Since in practice one might not have an access to the exact\nparameters \u00b5, \u232b for given sketching strategy, we test sensitivity of the algorithm to parameter choice .\nWe also test test for \u232b chosen by (16), \u00b5 = 1\n\n100\u232b and \u00b5 = 1\n\n10000\u232b .\n\nl\na\nu\nd\ni\ns\ne\nr\n\n100\n\n10-5\n\n10-10\n\n10-15\n\n10-20\n\nBFGS\nnsymBFGS\nBFGS-a\nnsymBFGS-a\n\nl\na\nu\nd\ni\ns\ne\nr\n\n0\n\n50\n\ntime (s)\n\n100\n\n100\n\n10-5\n\n10-10\n\n10-15\n\n10-20\n\nBFGS\nnsymBFGS\nBFGS-a\nnsymBFGS-a\n\nl\na\nu\nd\ni\ns\ne\nr\n\n0\n\n5\n\n10\n\ntime (s)\n\n15\n\n20\n\n100\n\n10-2\n\n10-4\n\n10-6\n\n10-8\n\nBFGS\nhBFGS100\nhBFGS10000\n\n100\n\nl\na\nu\nd\ni\ns\ne\nr\n\n10-2\n\n10-4\n\nBFGS\nhBFGS100\nhBFGS10000\n\n0\n\n500\n\n1000\ntime (s)\n\n1500\n\n2000\n\n10-6\n\n0\n\n500\n\n1000\ntime (s)\n\n1500\n\n2000\n\nFigure 1: From left to right: (i) Eigenvalues of A 2 R100\u21e5100 are 1, 103, 103, . . . , 103 and coordinate sketches\nwith convenient probabilities are used. (ii) Eigenvalues of A 2 R100\u21e5100 are 1, 2, . . . , n and Gaussian sketches\nare used. Label \u201cnsym\u201d indicates non-enforcing symmetry and \u201c-a\u201d indicates acceleration. (iii) Epsilon dataset\n(n = 2000), coordinate sketches with uniform probabilities. (iv) SVHN dataset (n = 3072), coordinate sketches\nwith convenient probabilities. Label \u201ch\u201d indicates that min was not precomputed, but \u00b5 was chosen as described\nin the text.\n\nFor more plots, see Section A in the appendix as here we provide only a tiny fraction of all plots.\nThe experiments suggest that once the parameters \u00b5, \u232b are estimated exactly, we get a speedup\ncomparing to the nonaccelerated method; and the amount of speedup depends on the structure of A\nand the sketching strategy. We observe from Figure 1 that we gain a great speedup for ill conditioned\nproblems once the eigenvalues are concentrated around the largest eigenvalue. We also observe\nfrom Figure 1 that enforcing symmetry combines well with \u00b5, \u232b computed by (16), which does\nnot consider the symmetry. On top of that, choice of \u00b5, \u232b per (16) seems to be robust to different\nsketching strategies, and in worst case performs as fast as the nonaccelerated algorithm.\n\n5.2 BFGS Optimization Method\n\nWe test Algorithm 3 on several logistic regression problems using data from LIBSVM [7]. In all\nour tests we centered and normalized the data, included a bias term (a linear intercept), and choose\nthe regularization parameter as = 1/m, where m is the number of data points. To keep things as\nsimple as possible, we also used a \ufb01xed stepsize which was determined using grid search. Since\nour theory regarding the choice for the parameters \u00b5 and \u232b does not apply in this setting, we simply\nprobed the space of parameters manually and reported the best found result, see Figure 2. In the\nlegend we use BFGS-a-\u00b5-\u232b to denote the accelerated BFGS method (Alg 3) with parameters \u00b5 and \u232b.\nOn all four datasets, our method outperforms the classic BFGS method, indicating that replacing\nclassic BFGS update rules for learning the inverse Hessian by our new accelerated rules can be\nbene\ufb01cial in practice. In A.4 in the appendix we also show the time plots for solving the problems in\nFigure 2, and show that the accelerated BFGS method also converges faster in time.\n\n8\n\n\fFigure 2: Algorithm 3 (BFGS with accelerated matrix inversion quasi-Newton update) vs standard\nBFGS. From left to right: phishing, mushrooms, australian and splice dataset.\n\n6 Conclusions and Extensions\n\nWe developed an accelerated sketch-and-project method for solving linear systems in Euclidean\nspaces. The method was applied to invert positive de\ufb01nite matrices, while keeping their symmetric\nstructure for all iterates. Our accelerated matrix inversion algorithm was then incorporated into an\noptimization framework to develop both accelerated stochastic and accelerated deterministic BFGS,\nwhich to the best of our knowledge, are the \ufb01rst accelerated quasi-Newton updates.\nWe show that under a careful choice of the parameters of the method\u2014depending on the problem\nstructure and conditioning\u2014acceleration might result into signi\ufb01cant speedups both for the matrix\ninversion problem and for the stochastic BFGS algorithm. We con\ufb01rm experimentally that our\naccelerated methods can lead to speed-ups when compared to the classical BFGS algorithm.\nAs a future line of research it might be interesting to study the accelerated BFGS algorithm (either\ndeterministic or stochastic) further, and provide a convergence analysis on a suitable class of functions.\nAnother interesting area of research might be to combine accelerated BFGS with limited memory [17]\nor engineer the method so that it can ef\ufb01ciently compete with \ufb01rst order algorithms for some empirical\nrisk minimization problems, such as, for example [12].\nAs we show in this work, Nesterov\u2019s acceleration can be applied to quasi-Newton updates. We\nbelieve this is a surprising fact, as quasi-Newton updates have not been understood as optimization\nalgorithms, which prevented the idea of applying acceleration in this context.\nSince since second-order methods are becoming more and more ubiquitous in machine learning\nand data science, we hope that our work will motivate further advances at the frontiers of big data\noptimization.\n\nReferences\n[1] Naman Agarwal, Brian Bullins, and Elad Hazan. Second-order stochastic optimization for\nmachine learning in linear time. The Journal of Machine Learning Research, 18(1):4148\u20134187,\n2017.\n\n[2] Albert S Berahas, Raghu Bollapragada, and Jorge Nocedal. An investigation of Newton-sketch\n\nand subsampled Newton methods. CoRR, abs/1705.06211, 2017.\n\n[3] Albert S Berahas, Jorge Nocedal, and pages=1055\u20131063 year=2016 Tak\u00e1\u02c7c, Martin, bookti-\ntle=Advances in Neural Information Processing Systems. A multi-batch l-bfgs method for\nmachine learning.\n\n[4] Charles G Broyden. Quasi-Newton methods and their application to function minimisation.\n\nMathematics of Computation, 21(99):368\u2013381, 1967.\n\n[5] Richard H Byrd, Gillian M Chin, Will Neveitt, and Jorge Nocedal. On the use of stochastic hes-\nsian information in optimization methods for machine learning. SIAM Journal on Optimization,\n21(3):977\u2013995, 2011.\n\n[6] Richard H Byrd, Samantha L Hansen, Jorge Nocedal, and Yoram Singer. A stochastic quasi-\nNewton method for large-scale optimization. SIAM Journal on Optimization, 26(2):1008\u20131031,\n2016.\n\n[7] Chih-Chung Chang and Chih-Jen Lin. Libsvm: A library for support vector machines. ACM\n\nTrans. Intell. Syst. Technol., 2(3):27:1\u201327:27, May 2011.\n\n9\n\n\f[8] Frank Curtis. A self-correcting variable-metric algorithm for stochastic optimization.\n\nInternational Conference on Machine Learning, pages 632\u2013641, 2016.\n\nIn\n\n[9] Charles A Desoer and Barry H Whalen. A note on pseudoinverses. Journal of the Society of\n\nIndustrial and Applied Mathematics, 11(2):442\u2013447, 1963.\n\n[10] Roger Fletcher. A new approach to variable metric algorithms. The computer journal, 13(3):317\u2013\n\n322, 1970.\n\n[11] Donald Goldfarb. A family of variable-metric methods derived by variational means. Mathe-\n\nmatics of computation, 24(109):23\u201326, 1970.\n\n[12] Robert M Gower, Donald Goldfarb, and Peter Richt\u00e1rik. Stochastic block BFGS: Squeezing\nmore curvature out of data. In International Conference on Machine Learning, pages 1869\u20131878,\n2016.\n\n[13] Robert M Gower and Peter Richt\u00e1rik. Randomized iterative methods for linear systems. SIAM\n\nJournal on Matrix Analysis and Applications, 36(4):1660\u20131690, 2015.\n\n[14] Robert M Gower and Peter Richt\u00e1rik. Stochastic dual ascent for solving linear systems.\n\narXiv:1512.06890, 2015.\n\n[15] Robert M Gower and Peter Richt\u00e1rik. Randomized quasi-Newton updates are linearly convergent\nmatrix inversion algorithms. SIAM Journal on Matrix Analysis and Applications, 38(4):1380\u2013\n1409, 2017.\n\n[16] Stefan Kaczmarz. Angen\u00e4herte Au\ufb02\u00f6sung von Systemen linearer Gleichungen. Bulletin\n\nInternational de l\u2019Acad\u00e9mie Polonaise des Sciences et des Lettres, 35:355\u2013357, 1937.\n\n[17] Dong C Liu and Jorge Nocedal. On the limited memory BFGS method for large scale optimiza-\n\ntion. Mathematical programming, 45(1-3):503\u2013528, 1989.\n\n[18] Ji Liu and Stephen J Wright. An accelerated randomized Kaczmarz algorithm. Math. Comput.,\n\n85(297):153\u2013178, 2016.\n\n[19] Nicolas Loizou and Peter Richt\u00e1rik. Momentum and stochastic momentum for stochastic gradi-\nent, Newton, proximal point and subspace descent methods. arXiv preprint arXiv:1712.09677,\n2017.\n\n[20] Aryan Mokhtari and Alejandro Ribeiro. Global convergence of online limited memory BFGS.\n\nThe Journal of Machine Learning Research, 16:3151\u20133181, 2015.\n\n[21] Philipp Moritz, Robert Nishihara, and Michael Jordan. A linearly-convergent stochastic L-BFGS\n\nalgorithm. In Arti\ufb01cial Intelligence and Statistics, pages 249\u2013258, 2016.\n\n[22] Yurii Nesterov. A method of solving a convex programming problem with convergence rate\n\nO(1/k2). Soviet Mathematics Doklady, 27(2):372\u2013376, 1983.\n\n[23] Yurii Nesterov. Ef\ufb01ciency of coordinate descent methods on huge-scale optimization problems.\n\nSIAM Journal on Optimization, 22(2):341\u2013362, 2012.\n\n[24] Yurii Nesterov and Sebastian U Stich. Ef\ufb01ciency of the accelerated coordinate descent method\n\non structured optimization problems. SIAM Journal on Optimization, 27(1):110\u2013123, 2017.\n[25] Gert K Pedersen. Analysis Now. Graduate Texts in Mathematics. Springer New York, 1996.\n[26] Mert Pilanci and Martin J Wainwright. Newton sketch: A near linear-time optimization\nalgorithm with linear-quadratic convergence. SIAM Journal on Optimization, 27(1):205\u2013245,\n2017.\n\n[27] Peter Richt\u00e1rik and Martin Tak\u00e1\u02c7c. Stochastic reformulations of linear systems: accelerated\n\nmethod. Manuscript, October 2017, 2017.\n\n[28] Peter Richt\u00e1rik and Martin Tak\u00e1\u02c7c. Stochastic reformulations of linear systems: algorithms and\n\nconvergence theory. arXiv:1706.01108, 2017.\n\n10\n\n\f[29] Nicol N Schraudolph, Jin Yu, and Simon G\u00fcnter. A stochastic quasi-Newton method for online\n\nconvex optimization. In Arti\ufb01cial Intelligence and Statistics, pages 436\u2013443, 2007.\n\n[30] David F Shanno. Conditioning of quasi-Newton methods for function minimization. Mathemat-\n\nics of computation, 24(111):647\u2013656, 1970.\n\n[31] Sebastian U Stich. Convex Optimization with Random Pursuit. PhD thesis, ETH Zurich, 2014.\n\nDiss., Eidgen\u00f6ssische Technische Hochschule ETH Z\u00fcrich, Nr. 22111.\n\n[32] Sebastian U Stich, Christian L M\u00fcller, and Bernd G\u00e4rtner. Variable metric random pursuit.\n\nMathematical Programming, 156(1):549\u2013579, Mar 2016.\n\n[33] Thomas Strohmer and Roman Vershynin. A randomized Kaczmarz algorithm with exponential\n\nconvergence. Journal of Fourier Analysis and Applications, 15(2):262, 2009.\n\n[34] Stephen Tu, Shivaram Venkataraman, Ashia C Wilson, Alex Gittens, Michael I Jordan, and\nBenjamin Recht. Breaking locality accelerates block Gauss-Seidel. In Proceedings of the 34th\nInternational Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11\nAugust 2017, pages 3482\u20133491, 2017.\n\n[35] Xiao Wang, Shiqian Ma, Donald Goldfarb, and Wei Liu. Stochastic quasi-Newton methods for\n\nnonconvex stochastic optimization. SIAM Journal on Optimization, 27(2):927\u2013956, 2017.\n\n[36] Stephen J Wright. Coordinate descent algorithms. Math. Program., 151(1):3\u201334, June 2015.\n[37] Peng Xu, Farbod Roosta-Khorasani, and Michael W Mahoney. Newton-type methods for\nnon-convex optimization under inexact hessian information. arXiv preprint arXiv:1708.07164,\n2017.\n\n[38] Peng Xu, Jiyan Yang, Farbod Roosta-Khorasani, Christopher R\u00e9, and Michael W Mahoney.\nSub-sampled newton methods with non-uniform sampling. In Advances in Neural Information\nProcessing Systems, pages 3000\u20133008, 2016.\n\n11\n\n\f", "award": [], "sourceid": 825, "authors": [{"given_name": "Robert", "family_name": "Gower", "institution": "ParisTech"}, {"given_name": "Filip", "family_name": "Hanzely", "institution": "KAUST"}, {"given_name": "Peter", "family_name": "Richtarik", "institution": "KAUST"}, {"given_name": "Sebastian", "family_name": "Stich", "institution": "EPFL"}]}