{"title": "Exploiting Numerical Sparsity for Efficient Learning : Faster Eigenvector Computation and Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 5269, "page_last": 5278, "abstract": "In this paper, we obtain improved running times for regression and top eigenvector computation for numerically sparse matrices. Given a data matrix $\\mat{A} \\in \\R^{n \\times d}$ where every row $a \\in \\R^d$ has $\\|a\\|_2^2 \\leq L$ and numerical sparsity $\\leq s$, i.e. $\\|a\\|_1^2 / \\|a\\|_2^2 \\leq s$, we provide faster algorithms for these problems for many parameter settings.\n\nFor top eigenvector computation, when $\\gap > 0$ is the relative gap between the top two eigenvectors of $\\mat{A}^\\top \\mat{A}$ and $r$ is the stable rank of $\\mat{A}$ we obtain a running time of $\\otilde(nd + r(s + \\sqrt{r s}) / \\gap^2)$ improving upon the previous best unaccelerated running time of $O(nd + r d / \\gap^2)$. As $r \\leq d$ and $s \\leq d$ our algorithm everywhere improves or matches the previous bounds for all parameter settings.\n\nFor regression, when $\\mu > 0$ is the smallest eigenvalue of $\\mat{A}^\\top \\mat{A}$ we obtain a running time of $\\otilde(nd + (nL / \\mu) \\sqrt{s nL / \\mu})$ improving upon the previous best unaccelerated running time of $\\otilde(nd + n L d / \\mu)$. This result expands when regression can be solved in nearly linear time from when $L/\\mu = \\otilde(1)$ to when $L / \\mu = \\otilde(d^{2/3} / (sn)^{1/3})$.\n\nFurthermore, we obtain similar improvements even when row norms and numerical sparsities are non-uniform and we show how to achieve even faster running times by accelerating using approximate proximal point \\cite{frostig2015regularizing} / catalyst \\cite{lin2015universal}. Our running times depend only on the size of the input and natural numerical measures of the matrix, i.e. eigenvalues and $\\ell_p$ norms, making progress on a key open problem regarding optimal running times for efficient large-scale learning.", "full_text": "Exploiting Numerical Sparsity for Ef\ufb01cient Learning :\n\nFaster Eigenvector Computation and Regression\n\nDepartment of Computer Science\n\nDepartment of Management Science and Engineering\n\nAaron Sidford\n\nStanford University\nStanford, CA USA\n\nsidford@stanford.edu\n\nNeha Gupta\n\nStanford University\nStanford, CA USA\n\nnehagupta@cs.stanford.edu\n\nAbstract\n\n1/(cid:107)a(cid:107)2\n\nIn this paper, we obtain improved running times for regression and top eigenvector\ncomputation for numerically sparse matrices. Given a data matrix A \u2208 Rn\u00d7d\nwhere every row a \u2208 Rd has (cid:107)a(cid:107)2\n2 \u2264 L and numerical sparsity at most s, i.e.\n(cid:107)a(cid:107)2\n2 \u2264 s, we provide faster algorithms for these problems in many parameter\nsettings.\n\u221a\nFor top eigenvector computation, we obtain a running time of \u02dcO(nd + r(s +\nrs)/gap2) where gap > 0 is the relative gap between the top two eigenvectors of\nA(cid:62)A and r is the stable rank of A. This running time improves upon the previous\nbest unaccelerated running time of O(nd + rd/gap2) as r \u2264 d and s \u2264 d.\n\nFor regression, we obtain a running time of \u02dcO(nd + (nL/\u00b5)(cid:112)snL/\u00b5) where\n\n\u00b5 > 0 is the smallest eigenvalue of A(cid:62)A. This running time improves upon\nthe previous best unaccelerated running time of \u02dcO(nd + nLd/\u00b5). This result\nexpands the regimes where regression can be solved in nearly linear time from\nwhen L/\u00b5 = \u02dcO(1) to when L/\u00b5 = \u02dcO(d2/3/(sn)1/3).\nFurthermore, we obtain similar improvements even when row norms and numerical\nsparsities are non-uniform and we show how to achieve even faster running times\nby accelerating using approximate proximal point [9] / catalyst [15]. Our running\ntimes depend only on the size of the input and natural numerical measures of the\nmatrix, i.e. eigenvalues and (cid:96)p norms, making progress on a key open problem\nregarding optimal running times for ef\ufb01cient large-scale learning.\n\n1\n\nIntroduction\n\nRegression and top eigenvector computation are two of the most fundamental problems in learning,\noptimization, and numerical linear algebra. They are central tools for data analysis and of the simplest\nproblems in a hierarchy of complex machine learning computational problems. Consequently,\ndeveloping provably faster algorithms for these problems is often a \ufb01rst step towards deriving new\ntheoretically motivated algorithms for large scale data analysis.\nBoth regression and top eigenvector computation are known to be ef\ufb01ciently reducible [10] to the\nmore general and prevalent \ufb01nite sum optimization problem of minimizing a convex function f decom-\nposed into the sum of m functions f1, ..., fm, i.e. minx\u2208Rn f (x) where f (x) = 1\ni\u2208[m] fi(x).\nm\nThis optimization problem encapsulates a variety of learning tasks where we have data points\n{(a1, b1), (a2, b2),\u00b7\u00b7\u00b7 , (an, bn)} corresponding to feature vectors ai, labels bi, and we wish to \ufb01nd\nthe predictor x that minimizes the average loss of predicting bi from ai using x, denoted by fi(x).\n\n(cid:80)\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fGiven the centrality of this problem to machine learning and optimization, over the past few years\nthere have been extensive research efforts to design new provably ef\ufb01cient methods for solving\nthis problem [12, 9, 13, 6, 20]. Using a variety of sampling techniques, impressive running time\nimprovements have been achieved. The emphasis in this line of work has been on improving the\ndependence on the number of gradient evaluations of the fi that need to be performed, i.e. improving\ndependence on m, as well as improving the dependence on other problem parameters.\nMuch less studied is the question of what structural assumptions on fi allow even faster running\ntimes to be achieved. A natural and fundamental question in this space, is when can we achieve\nfaster running times by computing the gradients of fi approximately, thereby decreasing iteration\ncosts. While there has been work on combining coordinate descent methods with these stochastic\nmethods [13], in the simple cases of regression and top eigenvector computation these methods do not\nyield any improvement in iteration cost. More broadly we are unaware of previous work on linearly\nconvergent algorithms with faster running times for \ufb01nite sum problems through this approach.\nIn this paper, we advance our understanding of the computational power of subsampling gradients of\nthe fi for the problems of top eigenvector computation and regression. In particular, we show that\nunder assumptions of numerical sparsity of the input matrix we can achieve provably faster algorithms\nand new nearly linear time algorithms for a broad range of parameters. We achieve our result by\napplying coordinate sampling techniques to Stochastic Variance Reduced Gradient Descent (SVRG)\n[13, 12], a popular tool for \ufb01nite sum optimization, along with linear algebraic data structures (in the\ncase of eigenvector computation) that we believe may be of independent interest.\nThe results in this paper constitute an important step towards resolving a key gap in our understanding\nof optimal iterative methods for top eigenvector computation and regression. Ideally running times of\nthese problems would depend only on the size of the input, e.g. the number of non-zero entries in the\ninput data matrix, row norms, eigenvalues, etc. However, this is not the case for the current fastest\nregression algorithms as these methods work by picking rows of the matrix non-uniformly yielding\nexpected iteration costs that depend on brittle weighted sparsity measures (which for simplicity are\ntypically instead stated in terms of the maximum sparsity among all rows, see Section 1.4.1). This\ncauses particularly unusual running times for related problems like nuclear norm estimation [17].\nThis paper takes an important step towards resolving this problem by providing running times for\ntop eigenvector computation and regression that depend only on the size of the input and natural\nnumerical quantities like eigenvalues, (cid:96)1-norms, (cid:96)2-norms, etc. While our running times do not\nstrictly dominate those based on the sparsity structure of the input (and it is unclear if such running\ntimes are possible), they improve upon the previous work in many settings. Ultimately, we hope this\npaper provides useful tools for even faster algorithms for solving large scale learning problems.\n\n1.1 The Problems\nThroughout this paper we let A \u2208 Rn\u00d7d denote a data matrix with rows a1, ..., an \u2208 Rd. We\nlet sr(A) def= (cid:107)A(cid:107)2\n2 denote the stable rank of A and we let nnz(A) denote the number of\nnon-zero entries in A. For symmetric M \u2208 Rd\u00d7d we let \u03bb1(M ) \u2265 \u03bb2(M ) \u2265 ... \u2265 \u03bbd(M )\nM = x(cid:62)M x and we let gap(M ) def= (\u03bb1(M ) \u2212 \u03bb2(M ))/\u03bb1(M )\ndenote its eigenvalues, (cid:107)x(cid:107)2\ndef= \u03bb1(A(cid:62)A),\ndenote its (relative) eigenvalue gap. For convenience we let gap def= gap(A(cid:62)A), \u03bb1\ndef= \u03bb/\u00b5. With this\n\u00b5 def= \u03bbmin\nnotation, we consider the following two optimization problems.\nDe\ufb01nition 1 (Top Eigenvector Problem) Find v\u2217 \u2208 Rd such that\n\ndef= \u03bbd(A(cid:62)A), sr def= sr(A), nnz(A) def= nnz, \u03ba def= (cid:107)A(cid:107)2\n\nF /(cid:107)A(cid:107)2\n\nF /\u00b5 and \u03bamax\n\nv\u2217 = argmax\nx\u2208Rd,(cid:107)x(cid:107)2=1\n\nx(cid:62)A(cid:62)Ax\n\nWe call v an \u0001-approximate solution to the problem if (cid:107)v(cid:107)2 = 1 and v(cid:62)A(cid:62)Av \u2265 (1 \u2212 \u0001)\u03bb1(A(cid:62)A) .\nDe\ufb01nition 2 (Regression Problem) Given b \u2208 Rn \ufb01nd x\u2217 \u2208 Rd such that\n\nx\u2217 = argmin\nx\u2208Rd\n\n(cid:107)Ax \u2212 b(cid:107)2\n\n2\n\nGiven initial x0 \u2208 Rd, we call x an \u0001-approximate solution if (cid:107)x \u2212 x\u2217(cid:107)A(cid:62)A \u2264 \u0001(cid:107)x0 \u2212 x\u2217(cid:107)A(cid:62)A.\n\n2\n\n\fEach of these are known to be reducible to the \ufb01nite sum optimization problem. The regression\ni x \u2212 bi)2 and the top\nproblem is equivalent to the \ufb01nite sum problem with fi(x) def= (m/2)(a(cid:62)\neigenvector problem is reducible with only polylogarithmic overhead to the \ufb01nite sum problem with\nfi(x) def= \u03bb(cid:107)x \u2212 x0(cid:107)2\n\ni x for carefully chosen \u03bb and x0 [10].\n\ni (x \u2212 x0))2 + b(cid:62)\n\n2 \u2212 (m/2)(a(cid:62)\n\n1.2 Our Results\n\nIn this paper, we provide improved iterative methods for top eigenvector computation and regression\nthat depend only on regularity parameters and not the speci\ufb01c sparsity structure of the input. Rather\nthan assuming uniform row sparsity as in previous work our running times depend on the numerical\nsparsity of rows of A, i.e. si\n2, which is at most the row sparsity, but may be smaller.\nNote that our results, as stated, are worse as compared to the previous running times which depend on\nthe (cid:96)0 sparsity in some parameter regimes. For simplicity, we are stating our results in terms of only\nthe numerical sparsity. However, when the number of zero entries in a row is small, we can always\nchoose that row completely and not do sampling on it. This would lead to our results always as good\nas the previous results and stricly better in some parameter regimes.\n\ndef= (cid:107)ai(cid:107)2\n\n1/(cid:107)ai(cid:107)2\n\n1.2.1 Top Eigenvector Computation\n\n1/(gap2)(cid:80)\n\nsi +(cid:112)sr(A))\n\n\u221a\n\n\u221a\n\n2/\u03bb1)(\n\n\u221a\n2/\u03bb1)(\n\n\u221a\ni((cid:107)ai(cid:107)2\n\ni((cid:107)ai(cid:107)2\n\u221a\n4 /\n\ngap)((cid:80)\n\nsi +(cid:112)sr(A))\n\nFor top eigenvector computation, we give an unaccelerated running time of \u02dcO(nnz(A) +\nsi) and an accelerated running time of \u02dcO(nnz(A) +\n(nnz(A) 3\n4 ) as compared to the previous unacceler-\nated running time of \u02dcO(nnz(A) + maxi nnz(ai)sr(A)/gap2) and accelerated iterative methods of\n\u02dcO(nnz(A)3/4(maxi nnz(ai)(sr(A)/gap2))1/4) respectively.\nIn the simpler case of uniform row norms (cid:107)ai(cid:107)2\n\ntime (unaccelerated) becomes \u02dcO(nnz(A) + (sr(A)/gap2)(s +(cid:112)sr(A) \u00b7 s)). To understand the\nalthough the number of nnz(ai) = d, then our running time \u02dcO(nnz(A) + (sr(A)/gap2)(cid:112)sr(A))\n\nrelative strength of our results, we give an example of one parameter regime where our running times\nare strictly better than the previous running times. When the rows are numerically sparse i.e. s = O(1)\n\n2 and uniform row sparsity si = s, our running\n\n2 = (cid:107)a(cid:107)2\n\nsi) 1\n\ngives signi\ufb01cant improvement over the previous best running time of \u02dcO(nnz(A) + d(sr(A)/gap2))\nsince sr(A) \u2264 d.\n\n\u03ba(cid:80)\n\n1.2.2 Regression\n\ni\n\n\u221a\n\n\u221a\n\n\u221a\n\ni\u2208[n]\n\nsi(cid:107)ai(cid:107)2\n\n(A(cid:62)A)\u22121.\n\nsi(cid:107)ai(cid:107)2\n2/\u00b5)\n\u221a\ni((cid:107)ai(cid:107)2/\n\nFor regression we give an unaccelerated running time of \u02dcO(nnz(A) +\n\naccelerated running time of \u02dcO(nnz(A)2/3\u03ba1/6((cid:80)\n2/\u00b5) and an\n1/3). Our methods improve upon\niterative methods of \u02dcO(nnz(A) + d maxi nnz(ai) +(cid:80)\nthe previous best unaccelerated iterative methods of \u02dcO(nnz(A) + \u03ba maxi nnz(ai)) and accelerated\n\u221a\n\u03c3i(A) maxi nnz(ai)) where\n\u03c3i = (cid:107)ai(cid:107)2\n2 = (cid:107)a(cid:107)2\nIn the simpler case of uniform row norms (cid:107)ai(cid:107)2\n(unaccelerated) running time becomes \u02dcO(nnz(A) + \u03ba3/2\u221a\nTo understand the relative strength of our results, we give an example of one parameter regime\nwhere our running times are strictly better than the previous running times. Consider the case where\n\u03ba = o(d2) and the rows are numerically sparse i.e. s = O(1) but maxi nnz(ai) = d, consider the\nparticular case of \u03ba = d1.5, then our running time is \u02dcO(nnz(A) + d2.25) whereas the SVRG running\ntime for regression will be \u02dcO(nnz(A) + d2.5) and our running time is better in this case.\n\n2 and uniform row sparsity si = s, our\nsi).\n\n\u00b5)\n\n1.3 Overview of Our Approach\n\nWe achieve these results by carefully modifying known techniques for \ufb01nite sum optimization\nproblem to our setting. The starting point for our algorithms is Stochastic Variance Reduced Gradient\nDescent (SVRG) [12] a popular method for \ufb01nite sum optimization. This method takes steps in the\n\n3\n\n\fdirection of negative gradient in expectation and its convergence depends on a measure of variance of\nthe steps.\nWe apply SVRG to our problems where we carefully subsample the entries of the rows of the data\nmatrix so that we can compute steps that are the negative gradient in expectation in time possibly\nsublinear in the size of the row. There is an inherent issue in such a procedure, in that this can change\nthe shape of variance. Previous sampling methods for regression ensure that the variance can be\ndirectly related to the function error, whereas here such sampling methods give (cid:96)2 error, the bounding\nof which in terms of function error can be expensive.\nIt is unclear how to completely avoid this issue and we leave this as future work. Instead, to mitigate\nthis issue we provide several techniques for subsampling that ensure we can obtain signi\ufb01cant\ndecrease in this (cid:96)2 error for small increases in the number of samples we take per row (See Section 3).\nHere we crucially use that we have bounds on the numerical sparsity of rows of the data matrix and\nprove that we can use this to quantify this decrease.\nFormally, the sampling problem we have for each row is as follows. For each row ai at any point\nwe may receive some vector x and need to compute a random vector g with E[g] = aia(cid:62)\ni x and with\nE(cid:107)g(cid:107)2\ni x) for some value of \u03b1, as\nprevious methods do. However, instead we settle for a bound of the form E(cid:107)g(cid:107)2\n(cid:62)x)+\u03b2(cid:107)x(cid:107)2\n2.\nOur sampling schemes for this problem works as follows: For the outer ai, we sample from the\ncoordinates with probability proportional to the coordinate\u2019s absolute value, we take a few (more than\n(cid:62)x, we always take the dot\n1) samples to control the variance (Lemma 4). For the approximation of ai\nproduct of x with large coordinates of ai and we sample from the rest with probability proportional\nto the squared value of the coordinate of ai and take more than one sample to control the variance\n(Lemma 5).\nCarefully controlling the number of samples we take per row and picking the right distribution over\nrows gives our bounds for regression. For eigenvector computation the same broad techniques work\nbut to keep the iteration costs down but a little more care needs to be taken due to the structure of\nfi(x) def= \u03bb(cid:107)x \u2212 x0(cid:107)2\ni x. Interestingly, for eigenvector computation the\npenalty from (cid:96)2 error is in some sense smaller due to the structure of the objective.\n\n2 suf\ufb01ciently bounded. Ideally, we would have that E(cid:107)g(cid:107)2\n\n2 \u2264 \u03b1(a(cid:62)\n\n2 \u2264 \u03b1(ai\n\n2 \u2212 (m/2)(a(cid:62)\n\ni (x \u2212 x0))2 + b(cid:62)\n\n1.4 Previous Results\n\nHere we brie\ufb02y cover previous work on regression and eigenvector computation (Section 1.4.1),\nsparse \ufb01nite sum optimization (Section 1.4.2), and matrix entrywise sparsi\ufb01cation (Section 1.4.3).\n\n1.4.1 Regression and Eigenvector Algorithms\n\nThere is an extensive amount of work on regression, eigenvector computation, and \ufb01nite sum\noptimization with far too many results to state but we have tried to include the algorithms with the\nbest known running times. The results for top eigenvector computation are stated in Table 1 and the\nresults for regression are stated in Table 2. The algorithms work according to the weighted (cid:96)0 sparsity\nmeasure of all rows and do not take into account the numerical sparsity which is a natural parameter\nto state the running times in and is strictly better than the (cid:96)0 sparsity.\n\n1.4.2 Sparsity Structure\n\nThere has been some prior work on attempting to improve for sparsity structure. Particularly relevant\nis the work of [13] on combining coordinate descent and sampling schemes. This paper picks\nunbiased estimates of the gradient at each step by \ufb01rst picking a function and then picking a random\ncoordinate whose variance decreases as time increases. Unfortunately, for regression and eigenvector\ncomputation computing a partial derivative is as expensive as computing the gradient and hence, this\nmethod does not give improved running times for regression and top eigenvector computation.\n\n1.4.3 Entrywise Sparsi\ufb01cation\n\nAnother natural approach to yielding the results of this paper would be to simply subsample the\nentries of A beforehand and use this as a preconditioner to solve the problem. There have been\nmultiple works on such entrywise sparsi\ufb01cation and in Table 3 we provide them. If we optimistically\n\n4\n\n\fAlgorithm\n\nTable 1: Previous results for computing \u0001-approximate top eigenvector (De\ufb01nition 1).\n(cid:17)\n(cid:17)\n\n(cid:17)\n(cid:17)\n\nRuntime\n\nsparsity\n\ngap\n\ngap\n\n\u02dcO\n\n\u02dcO\n\nRuntime with uniform row norms and\n\nPower Method\nLanczos Method\n\n(cid:16) nnz\n(cid:16) nnz\u221a\n\ngap\n\n\u02dcO\n\n(cid:16) nd\n(cid:16) nd\u221a\n\ngap\n\n\u02dcO\n\n(cid:19)\n\n(cid:18)\n\n\u02dcO\n\nFast subspace embeddings +\n\nLanczos method [7]\n\nSVRG (assuming bounded row\n\nnorms and warm start) [21]\nShift & Invert Power method\n\nwith SVRG [10]\n\nShift & Invert Power method\nwith Accelerated SVRG [10]\n\nThis paper\nThis paper\n\n(cid:18)\n\n\u02dcO\n\nnnz +\n\nd\u00b7sr\n\n\u02dcO\n\n\u02dcO\n\n(cid:19)\n(cid:19)\n\n(cid:18)\nmax {gap2.5,\u0001,\u00012.5}\nnnz + d\u00b7sr2\n(cid:18)\ngap2\nnnz + d\u00b7sr\ngap2\n(cid:0)\u221a\n(cid:80)\n\n(cid:19)\n\ngap\n\nnnz + nnz3/4(d\u00b7sr)1/4\n\n\u221a\n\n\u221a\ni (cid:107)ai(cid:107)2\n\u221a\n(cid:107)ai(cid:107)2\n(\n\u03bb1\n\n2\n\n2\n\nsi +\n\u221a\n\nsi +\n\nsr)\n\n(cid:18)\n\n\u02dcO\n\n\u02dcO(nnz +\n\n\u02dcO(nnz+ nnz\n\n1\n\ngap2\u03bb1\n3\n4\u221a\n\ngap ((cid:80)\n\ni\n\nsr(cid:1) \u221a\n\nsi)\n\n(cid:19)\n\nd\u00b7sr\n\nnd +\n\n\u02dcO\n\n(cid:18)\nmax {gap2.5,\u0001,\u00012.5}\nnd + d\u00b7sr2\n(cid:18)\ngap2\nnd + d\u00b7sr\ngap2\n\n(cid:19)\n(cid:19)\n\n\u02dcO\n\n(cid:18)\nnd + (nd)3/4(d\u00b7sr)1/4\n\u221a\n\n\u221a\n\u221a\n\n\u221a\n\ngap\n\n(cid:19)\n\n\u02dcO\n\n\u02dcO(nd + sr\n\ngap2 (\n\ns +\n\ns)\n\nsr)\n\u221a\n\n\u221a\n\n1\n4 )\n\nsi))\n\n\u02dcO(nd + (nd)\n\n3\n4\u221a\n\ngap sr\n\n1\n4 (s +\n\nsr \u00b7 s)\n\n1\n4\n\n)\n\nTable 2: Previous results for solving approximate regression (De\ufb01nition 2).\n\nAlgorithm\n\nRuntime\n\nGradient Descent\n\nConjugate Gradient Descent\n\nSVRG [12]\n\nAccelerated SVRG [4, 9, 15]\n\nAccelerated SVRG with\n\nleverage score sampling [3]\n\nThis paper\n\nThis paper\n\n\u02dcO(nnz \u00b7 \u03bamax)\n\u221a\n\u02dcO(nnz\n\u03bamax)\n\u02dcO(nnz + \u03bad)\n\u221a\nn\u03bad)\n\n\u02dcO(nnz +\n\n\u02dcO(nnz + d maxi nnz(ai) +(cid:80)\n\u03ba(cid:80)\n\u02dcO(nnz2/3\u03ba1/6((cid:80)\n\n\u221a\ni\n\u03c3i(A) maxi nnz(ai))\n\u221a\nsi)\n\u221a\n\n\u02dcO(nnz +\n\n\u00b5\n(cid:107)ai(cid:107)2\n\n(cid:107)ai(cid:107)2\n\n\u221a\n\n2\n\n2\n\ni\n\nsi)\n\ni\u2208[n]\n\n\u00b5\n\n\u00b5 \u00b7\n(cid:107)ai(cid:107)2\u221a\n\nRuntime with\n\nuniform row norms\n\nand sparsity\n\u02dcO(nd\u03bamax)\n\u221a\n\u02dcO(nd\n\u03bamax)\n\u02dcO(nd + \u03bad)\n\u221a\nn\u03bad)\n\u221a\n\u03ba\u00b7 d3/2)\n\u02dcO(nd + d2 +\n\u221a\n\u221a\n\u03ba3\n\n\u02dcO(nd +\n\n\u02dcO(nd +\n\ns)\n\n1/3\n\n)\n\n\u02dcO((nd)2/3\u03ba1/2s1/6)\n\n(cid:80)\n\n\u221a\n\nF /\u03bb3\nF\n\n(cid:80)\nmin) [14] and \u02dcO(nnz(A) +\ni si(cid:107)ai(cid:107)2\n\ncompare them to our approach, by supposing that their sparsity bounds are uniform (i.e. every row\nhas the same sparsity) and bound its quality as a preconditioner the best of these would give bounds\nof \u02dcO(nnz(A) + \u03bbmax(cid:107)A(cid:107)4\nmin)\n[5] and \u02dcO(nnz(A) + (cid:107)A(cid:107)2\nmin) [1] for regression. Bound obtained by [14]\ndepends on the the condition number square and does not depend on the numerical sparsity struc-\nture of the matrix. Bound obtained by [5] is worse as compared to our bound when compared\nwith matrices having equal row norms and uniform sparsity. Our running time for regression is\n\u02dcO(nnz(A) +\n\u221a\nsi). Our results are not always comparable to that by [1]. Assuming\nuniform sparsity and row norms, we get that Our runtime/Runtime by [1] = (\u03bbminn)/(\n\u03ba).\nDepending on the values of the particular parameters, the ratio can be both greater or less than 1 and\nhence, the results are incomparable. Our results are always better than that obtained by [14].\n\n\u221a\nsi(cid:107)ai(cid:107)2/\n\n\u03bbmax(cid:107)A(cid:107)2\n\n\u03ba(cid:80)\n\n2\u03bbmax/n\u03bb3\n\ni((cid:107)ai(cid:107)2\n\ns\u03bbmax\n\n2/\u00b5)\n\nn\u03bb2\n\n\u221a\n\n\u221a\n\n\u221a\n\n\u221a\n\nF\n\ni\n\n2 Notation\n\nVector Properties: For a \u2208 Rd, let s(a) = (cid:107)a(cid:107)2\n2 denote the numerical sparsity. For c \u2208\n{1, 2, . . . , d}, let (\u03a0c(a))i = ai if i \u2208 S where S is a set of the c largest coordinates of a in absolute\nvalue and 0 otherwise and \u00af\u03a0c(a) = a \u2212 \u03a0c(a). Let Ic(a) denote the set of indices with the c largest\ncoordinates of a in absolute value and \u00afIc(a) = [d]\\ Ic(a) i.e. everything except the top c co-ordinates.\nLet \u02c6ej denote the ith basis vector i.e. (\u02c6ej)i = 1 if i = j and 0 otherwise.\nOther: Let [d] denote the set {1, 2, . . . , d}. We use \u02dcO notation to hide polylogarithmic factors in the\ninput parameters and error rates. Refer to Section 1.1 for other de\ufb01nitions.\n\n1/(cid:107)a(cid:107)2\n\n5\n\n\f3 Sampling techniques\nIn this section we provide our key tools for sampling from a matrix for both regression and eigenvector\ncomputation. First, we provide a technical lemma on numerical sparsity that we will use throughout\nour analysis. Then, we provide and analyze the sampling distribution we use to sample from our\nmatrix for SVRG. We use the same distribution for both the applications, regression and eigenvector\ncomputation and provide some of the analysis of properties of this distribution. All proofs in this\nsection are differed to Appendix B.1.\nWe begin with a lemma at the core of the proofs of our sampling techniques. The lemma essentially\nstates that for a numerically sparse vector, most of the (cid:96)2-mass of the vector is concentrated in its\ntop few coordinates. Consequently, if a vector is numerically sparse then we can remove a few big\ncoordinates from it and reduce its (cid:96)2 norm considerably. Later, in our sampling schemes, we will use\nthis lemma to bound the variance of sampling a vector.\nLemma 3 (Numerical Sparsity) For a \u2208 Rd and c \u2208 [d], we have (cid:107) \u00af\u03a0c(a)(cid:107)2\n\n2 \u2264 s(a)(cid:107)a(cid:107)2\n\n2/c.\n\nThe following lemmas state the sampling distribution that we use for sampling the gradient function\n2 xA(cid:62)Ax \u2212 b(cid:62)x i.e.\nin SVRG. Basically, since we want to approximate the gradient of f (x) = 1\n\nA(cid:62)Ax \u2212 b, we would like to sample A(cid:62)Ax =(cid:80)\n\ni\u2208[n] aia(cid:62)\ni x.\n\nWe show how to perform this sampling and analyze it in several steps.\nIn Lemma 4 we\nshow how to sample from a and then in Lemma 5 we show how to sample from a(cid:62)x.\nIn\nLemma 6 we put these together to sample from aa(cid:62)x and in Lemma 7 we put it all to-\ngether to sample from A(cid:62)A. The distributions and our guarantees on them are stated below.\n\nAlgorithm 1: Samplevec(a, c)\n1: for t = 1 . . . c (i.i.d. trials) do\n2:\n3:\n4: end for\n5: Output 1\nc\n\nrandomly sample indices jt with\nPr(jt = j) = pj =\n\n(cid:80)c\n\n|aj|\n(cid:107)a(cid:107)1\n\n\u02c6ejt\n\nt=1\n\najt\npjt\n\n\u2200j \u2208 [d]\n\nAlgorithm 3: Samplerankonemat(a, x, c)\n\n1: ((cid:98)a)c = Samplevec(a, c)\n2: ((cid:100)a(cid:62)x)c = Sampledotproduct(a, x, c)\n3: Output ((cid:98)a)c((cid:100)a(cid:62)x)c\n\nAlgorithm 2: Sampledotproduct(a, x, c)\n1: for t = 1 . . . c (i.i.d. trials) do\nrandomly sample indices jt with\n2:\n3:\nPr(jt = j) = pj =\n4: end for\n5: Output \u03a0c(a)(cid:62)x + 1\n\n(cid:80)c\n\n(cid:107) \u00af\u03a0c(a)(cid:107)2\n\najt xjt\n\na2\nj\n\n2\n\nc\n\nt=1\n\npjt\n\n\u2200j \u2208 \u00afIc(a)\n\n\u221a\n\nAlgorithm 4: Samplemat(A, x, k)\n1: ci =\n\n2: M =(cid:80)\n\nsi \u00b7 k \u2200i \u2208 [n]\ni (cid:107)ai(cid:107)2\n2(1 + si\nci\n\n)\n\n3: Select a row index i with probability\n\npi =\n(cid:92)\naia(cid:62)\n\n4: (\n5: Output 1\npi\n\n(cid:107)ai(cid:107)2\nM (1 + si\nci\n\n2\n\n)\n\n(cid:92)\naia(cid:62)\n\ni x)ci\n\n(\n\ni x)ci = Samplerankonemat(ai, x, ci)\n\nLemma 4 (Stochastic Approximation of a) Let a \u2208 Rd and c \u2208 N and let our estimator (\u02c6a)c =\nSamplevec(a, x) (Algorithm 1) Then,\n\nE[(\u02c6a)c] = a and E(cid:2)(cid:107)(\u02c6a)c(cid:107)2\n\n2\n\n(cid:3) \u2264 (cid:107)a(cid:107)2\n\n2\n\n(cid:18)\n\n(cid:19)\n\n1 +\n\ns(a)\n\nc\n\nLemma 5 (Stochastic Approximation of a(cid:62)x) Let a, x \u2208 Rd and c \u2208 [d], and let our estimator be\n\nde\ufb01ned as ((cid:100)a(cid:62)x)c = Sampledotproduct(a, x, c) (Algorithm 2) Then,\n(cid:105) \u2264 (a(cid:62)x)2 +\n\nE[((cid:100)a(cid:62)x)c] = a(cid:62)x and E(cid:104)\n\n((cid:100)a(cid:62)x)2\n\nc\n\n(cid:107) \u00af\u03a0c(a)(cid:107)2\n\n2(cid:107)x(cid:107)2\n\n2\n\n1\nc\n\n6\n\n\fLemma 6 (Stochastic Approximation of aa(cid:62)x) Let a, x \u2208 Rd and c \u2208 [d], and the estimator be\nde\ufb01ned as (\n(cid:91)\n\n(cid:91)\naa(cid:62)x)c = Samplerankonemat(a, x, c) (Algorithm 3) Then,\n\naa(cid:62)x)c] = aa(cid:62)x and E(cid:104)(cid:107)(\n\n(cid:91)\naa(cid:62)x)c(cid:107)2\n\n(a(cid:62)x)2 +\n\n(cid:19)(cid:18)\n\n(cid:18)\n\n(cid:19)\n\nE[(\n\ns(a)\n\ns(a)\n\n1 +\n\n2(cid:107)x(cid:107)2\n\n2\n\n2\n\nc2 (cid:107)a(cid:107)2\n\nc\n\n2\n\n(cid:105) \u2264 (cid:107)a(cid:107)2\n(cid:13)(cid:13)(cid:13)(cid:13)2\n\n2\n\nLemma 7 (Stochastic Approximation of A(cid:62)Ax ) Let A \u2208 Rn\u00d7d with rows a1, a2, . . . , an and\n(cid:92)\nx \u2208 Rd and let (\nA(cid:62)Ax)k = Samplemat(A, x, k) (Algorithm 4) where k is some parameter. Then,\n(cid:92)\nA(cid:62)Ax)k\n(\n\n= A(cid:62)Ax and E\n\n(cid:92)\nA(cid:62)Ax)k\n\n(cid:107)Ax(cid:107)2\n\n\u2264 M\n\n(cid:18)\n\n(cid:19)\n\n(cid:35)\n\n(cid:20)\n\n(cid:21)\n\nF(cid:107)x(cid:107)2\n\n2 +\n\nE\n\n2\n\n.\n\n1\n\nk2(cid:107)A(cid:107)2\n\n(cid:34)(cid:13)(cid:13)(cid:13)(cid:13)(\n\n4 Applications\n\nUsing the framework of SVRG de\ufb01ned in Theorem 14 and the sampling techniques presented in\nSection 3, we now state how do we solve our problems of regression and top eigenvector computation.\n\n4.1 Eigenvector computation\n\n1\n\n= 1\n\n\u03bb\u2212\u03bb1\n\n\u03bb\u2212\u03bb1\n\n)/ 1\n\n\u2212 1\n\u03bb\u2212\u03bb2\n\nThe classic method to estimate the top eigenvector of a matrix is to apply power method which\ninvolves starting with an initial vector x0 and repeatedly multiplying the vector by A(cid:62)A which\neventually leads to convergence of the vector to the top eigenvector of the matrix A(cid:62)A if top\neigenvalue of the matrix is well separated from the other eigenvalues i.e. gap is large enough. The\nnumber of iterations required for convergence is O(log( d\n\u0001 )/gap). However, this method can be very\nslow when the gap is small. If the gap is small, one thing that can be done to improve convergence rate\nis running the power method on a matrix B\u22121 where B = \u03bbI \u2212 A(cid:62)A. B\u22121 has the same largest\n2 if \u03bb \u2248 (1 + gap)\u03bb1 and\neigenvector as A(cid:62)A and the eigenvalue gap is (\nthus we get a constant eigenvalue gap. Hence, if we have a rough estimate of the largest eigenvalue\nof the matrix, we can get the gap parameter as roughly constant. Section 6 of [10] shows how we\ncan get such an estimate based on the gap free eigenvalue estimation algorithm by [16] in running\ntime dependent on the linear system solver of B ignoring some additional polylogarithmic factors.\nHowever, doing power iteration on B\u22121 requires solving linear systems on B whose condition\nnumber now depends on 1/gap and thus, solving linear system on B would become expensive. [10]\nshowed how we can solve the linear systems in B faster by using SVRG [12] and achieved a better\noverall running time for top eigenvector computation. The formal theorem statement is differed to\nTheorem 17 in the appendix.\nWe simply use this framework for solving the eigenvector problem using SVRG and on the top of that,\ngive different sampling scheme for SVRG for B\u22121 which reduces the runtime for numerically sparse\nmatrices. Basically, we use the sampling scheme presented in Lemma 7. The following lemma states\nthe variance bound that we get for the gradient updates for SVRG for the top eigenvector computation\nproblem. This will be used to obtain a bound on the solving of linear systems in B = \u03bbI \u2212 A(cid:62)A\nwhich will be ultimately used in solving the approximate topeigen vector problem.\nLemma 8 (Variance bound for eigenvector computation) Let \u2207g(x) = \u03bbx \u2212 (\n(cid:92)\n(\n\nA(cid:62)Ax)k is the estimator of A(cid:62)Ax de\ufb01ned in Lemma 7, and k =(cid:112)sr(A), then we get\n(cid:3) \u2264 (f (x) \u2212 f (x\u2217))8M/gap\nE[\u2207g(x)] = (\u03bbI \u2212 A(cid:62)A)x and E(cid:2)(cid:107)\u2207g(x) \u2212 \u2207g(x\u2217)(cid:107)2\n(cid:16)\n(cid:17)\nsi +(cid:112)sisr(A)\nwith average time taken in calculating \u2207g(x), T =(cid:80)\n(cid:18)\n(cid:80)\ni (cid:107)ai(cid:107)2\n\n(cid:92)\nA(cid:62)Ax)k where\n\n2 x(cid:62)Bx \u2212 b(cid:62)x\n\n(cid:113) si\n\n/M where M =\n\nand f (x) = 1\n\ni (cid:107)ai(cid:107)2\n\n(cid:19)\n\n1 +\n\nsr(A)\n\n2\n\n2\n\n2\n\nNow, using the variance of the gradient estimators and per iteration running time T obtained in\nLemma 8 along with the framework of SVRG [12] (de\ufb01ned in Theorem 14), we can get constant\nmultiplicative decrease in the error in solving linear systems in B = \u03bbI \u2212 A(cid:62)A in total running time\n\n7\n\n\f(cid:80)\n\n\u221a\ni (cid:107)ai(cid:107)2\n2(\n\nsi +(cid:112)sr(A))\n\n\u221a\n\n2\n\ngap2\u03bb1(A(cid:62)A)\n\nsi) assuming we have a crude approximation\nO(nnz(A)+\nto the top eigenvector and eigenvalue which we have already discussed we can get. The formal\ntheorem statement (Theorem 18) and proof are differed to the appendix. Now, using the linear system\nsolver descibed above along with the shift and invert algorithmic framework, we get the following\nrunning time for top eigenvector computation problem. The proof appears in Appendix B.2.\nTheorem 9 (Numerically Sparse Top Eigenvector Computation Runtime) Linear system solver\nfrom Theorem 18 combined with the shift and invert framework from [10] stated in Theorem 17 gives\nan algorithm which computes \u0001-approximate top eigenvector (De\ufb01nition 1) in total running time\nO\n\n(cid:17)\u221a\nsi +(cid:112)sr(A)\n\nlog2(cid:16) d\n\n+ log(cid:0) 1\n\n(cid:80)\ni (cid:107)ai(cid:107)2\n\n(cid:17) \u00b7(cid:16)\n\nnnz(A) + 1\n\n(cid:1)(cid:17)(cid:17)\n\n(cid:16)\u221a\n\n(cid:16)(cid:16)\n\n(cid:17)\n\nsi\n\n2\n\ngap\n\n\u0001\n\ngap2\u03bb1\n\nSimilarly, using the acceleration framework of [9] mentioned in Theorem 15 in the appendix along\nwith the linear system solver runtime, we get the following accelerated running time for top eigenvec-\ntor computation and the proof appears in Appendix B.2.\nTheorem 10 (Numerically Sparse Accelerated Top Eigenvector Computation Runtime)\nLinear system solver from Theorem 18 combined with acceleration framework from [9] men-\ntioned in Theorem 15 and shift and invert framework from [10] stated in Theorem 17 gives an\nalgorithm which computes \u0001-approximate top eigenvector (De\ufb01nition 1) in total running time\nwhere \u02dcO hides a factor of\n\u02dcO\n\n(cid:17)\u221a\nsi +(cid:112)sr(A)\n\nnnz(A) + nnz(A)3/4\n\n(cid:17)1/4(cid:19)\n\n(cid:16)\u221a\n\nsi)\n\n(cid:18)\nlog2(cid:16) d\n\n(cid:16)\n\ngap\n\n(cid:16)(cid:80)\n(cid:16) d\n\ni\n\nlog\n\n(cid:107)ai(cid:107)2\n\u03bb1\n\n2\n\n(cid:17)\n\ngap\n\n.\n\n(cid:17)\n\n\u221a\n\n+ log(cid:0) 1\n\ngap\n\n\u0001\n\n(cid:1)(cid:17)\n\n4.2 Linear Regression\n\n(cid:80)\ni x(cid:62)aiai\n\n2(cid:107)Ax \u2212 b(cid:107)2\n\n2\n\n2 which is equivalent to minimizing\nIn linear regression, we want to minimize 1\n2 x(cid:62)A(cid:62)Ax \u2212 x(cid:62)A(cid:62)b = 1\n(cid:62)x \u2212 x(cid:62)A(cid:62)b and hence, we can apply the framework\n1\nof SVRG [12] (stated in Theorem 14) for solving it. However, instead of selecting the complete\nrow for calculating the gradient, we only select a few entries from the row to achieve lower cost\nper iteration. In particular, we use the distribution de\ufb01ned in Lemma 7. Note that the sampling\nprobabilities depend on \u03bbd and we need to know a constant factor approximation of \u03bbd for the scheme\nto work. For most of the ridge regression problems, we know a lower bound on the value of \u03bbd and we\ncan get an approximation by doing a binary search over all the values and paying an extra logarithmic\nfactor. The following lemma states the sampling distribution which we use for approximating the true\ngradient and the corresponding variance that we obtain. The proof of this appears in Appendix B.2.\n(cid:92)\nLemma 11 (Variance Bound for Regression) Let \u2207g(x) = (\nA(cid:62)Ax)k is the\nestimator for A(cid:62)Ax de\ufb01ned in Lemma 7 and k =\n\n\u03ba, assuming \u03ba \u2264 d2 we get\n\n(cid:92)\nA(cid:62)Ax)k where (\n\n\u221a\n\nE[\u2207g(x)] = A(cid:62)Ax and E(cid:104)(cid:107)\u2207g(x) \u2212 \u2207g(x\u2217)(cid:107)2\n(cid:0)1 +(cid:112) si\n\n(cid:1) where f (x) = 1\n\nwith average time taken in calculating \u2207g(x), T =\n\n(cid:80)\ni (cid:107)ai(cid:107)2\n\n2(cid:107)Ax \u2212 b(cid:107)2\n\n2\n\u221a\n\n\u03ba\n\n2\n\n2\n\n\u03ba\nM\n\n(cid:105) \u2264 M (f (x) \u2212 f (x\u2217))\n(cid:80)\n\ni\u2208[n] (cid:107)ai(cid:107)2\n\n\u221a\n\n2\n\nsi where M =\n\nUsing the variance bound obtained in Lemma 11 and the framework of SVRG stated in Theorem\n14 for solving approximate linear systems, we show how we can obtain an algorithm for solving\napproximate regression in time which is faster in certain regimes when the corresponding matrix is\nnumerically sparse. The proof of this appears in Appendix B.2.\nTheorem 12 (Numerically Sparse Regression Runtime) For solving \u0001-approximate regression\n(De\ufb01nition 2), if \u03ba \u2264 d2, SVRG framework from Theorem 14 and the variance bound from Lemma 11\ngives an algorithm with running time O\n\nlog(cid:0) 1\n\n\u03ba(cid:80)\n\n(cid:16)(cid:16)\n\n(cid:1)(cid:17)\n\nnnz(A) +\n\n(cid:107)ai(cid:107)2\n\n(cid:17)\n\n\u221a\n\n\u221a\n\n.\n\n2\n\ni\u2208[n]\n\n\u00b5\n\nsi\n\n\u0001\n\nCombined with the additional acceleration framework mentione in Theorem 15, we can get an\naccelerated algorithm for solving regression. The proof of this appears in Appendix B.2.2.\nTheorem 13 (Numerically Sparse Accelerated Regression Runtime) For solving \u0001-approximate\nregression (De\ufb01nition 2) if \u03ba \u2264 d2, SVRG framework from Theorem 14, acceleration framework\n\n8\n\n\ffrom Theorem 15 and the variance bound from Lemma 11 gives an algorithm with running time\n\n(cid:18)\n\nnnz(A)2/3\u03ba1/6(cid:16)(cid:80)\n\nO\n\n(cid:107)ai(cid:107)2\n\n2\n\n\u00b5\n\ni\u2208[n]\n\n\u221a\n\nsi\n\n(cid:17)1/3\n\nlog(\u03ba) log(cid:0) 1\n\n\u0001\n\n(cid:1)(cid:19)\n\nAcknowledgments\n\nWe would like to thank the anonymous reviewers who helped improve the readability and presentation\nof this draft by providing many helpful comments.\n\nReferences\n[1] Dimitris Achlioptas, Zohar S Karnin, and Edo Liberty. Near-optimal entrywise sampling for\ndata matrices. In Advances in Neural Information Processing Systems, pages 1565\u20131573, 2013.\n\n[2] Dimitris Achlioptas and Frank McSherry. Fast computation of low-rank matrix approximations.\n\nJournal of the ACM (JACM), 54(2):9, 2007.\n\n[3] Naman Agarwal, Sham Kakade, Rahul Kidambi, Yin Tat Lee, Praneeth Netrapalli, and Aaron\nSidford. Leverage score sampling for faster accelerated regression and erm. arXiv preprint\narXiv:1711.08426, 2017.\n\n[4] Zeyuan Allen-Zhu. Katyusha: The \ufb01rst direct acceleration of stochastic gradient methods. In\nProceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages\n1200\u20131205. ACM, 2017.\n\n[5] Sanjeev Arora, Elad Hazan, and Satyen Kale. A fast random sampling algorithm for sparsifying\nmatrices. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and\nTechniques, pages 272\u2013279. Springer, 2006.\n\n[6] L\u00e9on Bottou and Yann L Cun. Large scale online learning. In Advances in neural information\n\nprocessing systems, pages 217\u2013224, 2004.\n\n[7] Kenneth L Clarkson and David P Woodruff. Low rank approximation and regression in input\nsparsity time. In Proceedings of the forty-\ufb01fth annual ACM symposium on Theory of computing,\npages 81\u201390. ACM, 2013.\n\n[8] Petros Drineas and Anastasios Zouzias. A note on element-wise matrix sparsi\ufb01cation via a\n\nmatrix-valued bernstein inequality. Information Processing Letters, 111(8):385\u2013389, 2011.\n\n[9] Roy Frostig, Rong Ge, Sham Kakade, and Aaron Sidford. Un-regularizing: approximate\nproximal point and faster stochastic algorithms for empirical risk minimization. In International\nConference on Machine Learning, pages 2540\u20132548, 2015.\n\n[10] Dan Garber, Elad Hazan, Chi Jin, Cameron Musco, Praneeth Netrapalli, Aaron Sidford, et al.\nFaster eigenvector computation via shift-and-invert preconditioning. In International Conference\non Machine Learning, pages 2626\u20132634, 2016.\n\n[11] Alex Gittens and Joel A Tropp. Error bounds for random matrix approximation schemes. arXiv\n\npreprint arXiv:0911.4108, 2009.\n\n[12] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. In Advances in neural information processing systems, pages 315\u2013323, 2013.\n\n[13] Jakub Kone\u02c7cn`y, Zheng Qu, and Peter Richt\u00e1rik. Semi-stochastic coordinate descent. Optimiza-\n\ntion Methods and Software, 32(5):993\u20131005, 2017.\n\n[14] Abhisek Kundu and Petros Drineas. A note on randomized element-wise matrix sparsi\ufb01cation.\n\narXiv preprint arXiv:1404.0320, 2014.\n\n[15] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for \ufb01rst-order optimiza-\n\ntion. In Advances in Neural Information Processing Systems, pages 3384\u20133392, 2015.\n\n9\n\n\f[16] Cameron Musco and Christopher Musco. Randomized block krylov methods for stronger and\nfaster approximate singular value decomposition. In Advances in Neural Information Processing\nSystems, pages 1396\u20131404, 2015.\n\n[17] Cameron Musco, Praneeth Netrapalli, Aaron Sidford, Shashanka Ubaru, and David P Woodruff.\nSpectrum approximation beyond fast matrix multiplication: Algorithms and hardness. arXiv\npreprint arXiv:1704.04163, 2017.\n\n[18] Nam H Nguyen, Petros Drineas, and Trac D Tran. Tensor sparsi\ufb01cation via a bound on the\n\nspectral norm of random tensors. arXiv preprint arXiv:1005.4732, 2010.\n\n[19] NH Nguyen, Petros Drineas, and TD Tran. Matrix sparsi\ufb01cation via the khintchine inequality.\n\n2009.\n\n[20] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized\n\nloss minimization. Journal of Machine Learning Research, 14(Feb):567\u2013599, 2013.\n\n[21] Ohad Shamir. A stochastic pca and svd algorithm with an exponential convergence rate. In\n\nInternational Conference on Machine Learning, pages 144\u2013152, 2015.\n\n10\n\n\f", "award": [], "sourceid": 2521, "authors": [{"given_name": "Neha", "family_name": "Gupta", "institution": "Stanford University"}, {"given_name": "Aaron", "family_name": "Sidford", "institution": "Stanford"}]}