{"title": "Linear Convergence of a Frank-Wolfe Type Algorithm over Trace-Norm Balls", "book": "Advances in Neural Information Processing Systems", "page_first": 6191, "page_last": 6200, "abstract": "We propose a rank-k variant of the classical Frank-Wolfe algorithm to solve convex optimization over a trace-norm ball. Our algorithm replaces the top singular-vector computation (1-SVD) in Frank-Wolfe with a top-k singular-vector computation (k-SVD), which can be done by repeatedly applying 1-SVD k times. Alternatively, our algorithm can be viewed as a rank-k restricted version of projected gradient descent. We show that our algorithm has a linear convergence rate when the objective function is smooth and strongly convex, and the optimal solution has rank at most k. This improves the convergence rate and the total time complexity of the Frank-Wolfe method and its variants.", "full_text": "Linear Convergence of a Frank-Wolfe Type\n\nAlgorithm over Trace-Norm Balls\u2217\n\nZeyuan Allen-Zhu\n\nMicrosoft Research, Redmond\n\nzeyuan@csail.mit.edu\n\nWei Hu\n\nPrinceton University\n\nhuwei@cs.princeton.edu\n\nElad Hazan\n\nPrinceton University\n\nehazan@cs.princeton.edu\n\nYuanzhi Li\n\nPrinceton University\n\nyuanzhil@cs.princeton.edu\n\nAbstract\n\nWe propose a rank-k variant of the classical Frank-Wolfe algorithm to solve convex\noptimization over a trace-norm ball. Our algorithm replaces the top singular-vector\ncomputation (1-SVD) in Frank-Wolfe with a top-k singular-vector computation\n(k-SVD), which can be done by repeatedly applying 1-SVD k times. Alternatively,\nour algorithm can be viewed as a rank-k restricted version of projected gradient\ndescent. We show that our algorithm has a linear convergence rate when the\nobjective function is smooth and strongly convex, and the optimal solution has rank\nat most k. This improves the convergence rate and the total time complexity of the\nFrank-Wolfe method and its variants.\n\n(cid:8)f (X) : (cid:107)X(cid:107)\u2217 \u2264 \u03b8(cid:9) ,\n\nIntroduction\n\n1\nMinimizing a convex matrix function over a trace-norm ball, which is: (recall that the trace norm\n(cid:107)X(cid:107)\u2217 of a matrix X equals the sum of its singular values)\n\nminX\u2208Rm\u00d7n\n\n(1.1)\nis an important optimization problem that serves as a convex surrogate to many low-rank machine\nlearning tasks, including matrix completion [2, 10, 16], multiclass classi\ufb01cation [4], phase retrieval [3],\npolynomial neural nets [12], and more. In this paper we assume without loss of generality that \u03b8 = 1.\nOne natural algorithm for Problem (1.1) is projected gradient descent (PGD). In each iteration,\nPGD \ufb01rst moves X in the direction of the gradient, and then projects it onto the trace-norm ball.\nUnfortunately, computing this projection requires the full singular value decomposition (SVD) of the\nmatrix, which takes O(mn min{m, n}) time in general. This prevents PGD from being ef\ufb01ciently\napplied to problems with large m and n.\nAlternatively, one can use projection-free algorithms. As \ufb01rst proposed by Frank and Wolfe [5],\none can select a search direction (which is usually the gradient direction) and perform a linear\noptimization over the constraint set in this direction. In the case of Problem (1.1), performing linear\noptimization over a trace-norm ball amounts to computing the top (left and right) singular vectors\nof a matrix, which can be done much faster than full SVD. Therefore, projection-free algorithms\nbecome attractive for convex minimization over trace-norm balls.\nUnfortunately, despite its low per-iteration complexity, the Frank-Wolfe (FW) algorithm suffers from\nslower convergence rate compared with PGD. When the objective f (X) is smooth, FW requires\nO(1/\u03b5) iterations to convergence to an \u03b5-approximate minimizer, and this 1/\u03b5 rate is tight even if the\nobjective is also strongly convex [6]. In contrast, PGD achieves 1/\n\u03b5 rate if f (X) is smooth (under\nNesterov\u2019s acceleration [14]), and log(1/\u03b5) rate if f (X) is both smooth and strongly convex.\n\n\u221a\n\n\u2217The full version of this paper can be found on https://arxiv.org/abs/1708.02105.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f\u221a\n\nRecently, there were several results to revise the FW method to improve its convergence rate for\nstrongly-convex functions. The log(1/\u03b5) rate was obtained when the constraint set is a polyhe-\ndron [7, 11], and the 1/\n\u03b5 rate was obtained when the constraint set is strongly convex [8] or is a\nspectrahedron [6].\nAmong these results, the spectrahedron constraint (i.e., for all positive semide\ufb01nite matrices X with\nTr(X) = 1) studied by Garber [6] is almost identical to Problem (1.1), but slightly weaker.2 When\nstating the result of Garber [6], we assume for simplicity that it also applies to Problem (1.1).\nOur Question.\n\nIn this paper, we propose to study the following general question:\n\nCan we design a \u201crank-k variant\u201d of Frank-Wolfe to improve the convergence rate?\n\n(That is, in each iteration it computes the top k singular vectors \u2013 i.e., k-SVD \u2013 of some matrix.)\n\nk (cid:28) min{n, m} such that a rank-k variant of FW can achieve the convergence rate of PGD?\n\nOur motivation to study the above question can be summarized as follows:\n\u2022 Since FW computes a 1-SVD and PGD computes a full SVD in each iteration, is there a value\n\u2022 Since computing k-SVD costs roughly the same (sequential) time as \u201ccomputing 1-SVD for k\ntimes\u201d (see recent work [1, 13]),3 if using a rank-k variant of FW, can the number of iterations be\nreduced by a factor more than k? If so, then we can improve the sequential running time of FW.\n\u2022 k-SVD can be computed in a more distributed manner than 1-SVD. For instance, using block\nKrylov [13], one can distribute the computation of k-SVD to k machines, each in charge of\nindependent matrix-vector multiplications. Therefore, it is bene\ufb01cial to study a rank-k variant of\nFW in such settings.\n\n1.1 Our Results\nWe propose blockFW, a rank-k variant of Frank-Wolfe. Given a convex function f (X) that is \u03b2-\nsmooth, in each iteration t, blockFW performs an update Xt+1 \u2190 Xt + \u03b7(Vt \u2212 Xt), where \u03b7 > 0 is\na constant step size and Vt is a rank-k matrix computed from the k-SVD of (\u2212\u2207f (Xt) + \u03b2\u03b7Xt). If\nk = min{n, m}, blockFW can be shown to coincide with PGD, so it can also be viewed as a rank-k\nrestricted version of PGD.\nConvergence. Suppose f (X) is also \u03b1-strongly convex and suppose the optimal solution X\u2217\nof Problem (1.1) has rank k, then we show that blockFW achieves linear convergence: it \ufb01nds an\n\u03b5-approximate minimizer within O( \u03b2\n\n(cid:18) k\u03b2\n\n(cid:19)\n\n\u03b5 ) iterations, or equivalently, in\n\u03b1 log 1\n1\n\u03b5\n\ncomputations of 1-SVD.\n\nlog\n\n\u03b1\n\nT = O\n\nWe denote by T the number of 1-SVD computations throughout this paper. In contrast,\n\n(cid:17)\n\n(cid:16) \u03b2\n(cid:16)\n\n\u03b5\n\n(cid:110) \u03b2\n\nTFW = O\n\n(cid:0) \u03b2\n\n(cid:1)1/4(cid:0) \u03b2\n\n(cid:1)3/4\u221a\n\n(cid:0) \u03b2\n\n(cid:1)1/2(cid:0) \u03b2\n\n(cid:1)1/2\n\n(cid:111)(cid:17)\n\nfor Frank-Wolfe\n\n\u03b1\n\nmin\n\n\u03b5 ,\n\nTGar = O\n\nfor Garber [6].\nAbove, \u03c3min(X\u2217) is the minimum non-zero singular value of X\u2217. Note that \u03c3min(X\u2217) \u2264 (cid:107)X\u2217(cid:107)\u2217\nrank(X\u2217) \u2264\nk .\nWe note that TGar is always outperformed by min{T, TFW}: ignoring the log(1/\u03b5) factor, we have\n\n\u03c3min(X\u2217)\n\nk ,\n\n\u03b1\n\n1\n\n1\n\n\u03b5\n\n\u03b5\n\n\u2022 min(cid:8) \u03b2\n\u2022 min(cid:8) \u03b2\n\n\u03b1\n\n\u03b5 , k\u03b2\n\u03b5 , k\u03b2\n\n\u03b1\n\n(cid:9) \u2264(cid:0) \u03b2\n(cid:9) \u2264(cid:0) \u03b2\n\n\u03b1\n\n\u03b1\n\n(cid:1)1/4(cid:0) \u03b2\n(cid:1)1/2(cid:0) \u03b2\n\n\u03b5\n\n(cid:1)3/4\n(cid:1)1/2\n\n\u03b5\n\nk1/4 <(cid:0) \u03b2\nk1/2 <(cid:0) \u03b2\n\n\u03b1\n\n\u03b1\n\n(cid:1)1/4(cid:0) \u03b2\n(cid:1)1/2(cid:0) \u03b2\n\n\u03b5\n\n(cid:1)3/4\u221a\n(cid:1)1/2\n\n\u03b5\n\nk, and\n\u03c3min(X\u2217).\n\n1\n\n2The the best of our knowledge, given an algorithm that works for spectrahedron, to solve Problem (1.1), one\nhas to de\ufb01ne a function g(Y ) over (n + m) \u00d7 (n + m) matrices, by setting g(Y ) = f (2Y1:m,m+1:m+n) [10].\nAfter this transformation, the function g(Y ) is no longer strongly convex, even if f (X) is strongly convex. In\ncontrast, most algorithms for trace-norm balls, including FW and our later proposed algorithm, work as well for\nspectrahedron after minor changes to the analysis.\n\n3Using block Krylov [13], Lanszos [1], or SVRG [1], at least when k is small, the time complexity of\n(approximately) computing the top k singular vectors of a matrix is no more than k times the complexity of\n(approximately) computing the top singular vector of the same matrix. We refer interested readers to [1] for\ndetails.\n\n2\n\n\f# rank\n\nalgorithm\nPGD [14] min{m, n}\naccelerated\nPGD [14] min{m, n}\nFrank-\nWolfe [9]\n\n1\n\nGarber [6]\n\nblockFW\n\n1\n\nk\n\n# iterations\n\n\u03ba log(1/\u03b5)\n\u221a\n\u03ba log(1/\u03b5)\n\n\u03b2\n\u03b5\n\n4(cid:0) \u03b2\n2(cid:0) \u03b2\n\n\u03b5\n\n1\n\n(cid:1) 3\n(cid:1) 1\n\n4\n\n2\n\n\u03b5\n\n\u03ba\n1\n\n\u03ba\n\n\u221a\nk , or\n\n1\n\n\u03c3min(X\u2217)\n\n\u03ba log(1/\u03b5)\n\ntime complexity per iteration\n\nO(cid:0)mn min{m, n}(cid:1)\nO(cid:0)mn min{m, n}(cid:1)\n\u02dcO(cid:0)nnz(\u2207)(cid:1)\n\u02dcO(cid:0)nnz(\u2207) + (m + n)(cid:1)\nk \u00b7 \u02dcO(cid:0)nnz(\u2207) + k(m + n)\u03ba(cid:1)\n\n\u00d7 min\n\n\u00d7 min\n\n\u00d7 min\n\n,\n\n2\n\u03b51/2\n\n(cid:26) (cid:107)\u2207(cid:107)1/2\n(cid:26) (cid:107)\u2207(cid:107)1/2\n(cid:110) ((cid:107)\u2207(cid:107)2+\u03b1)1/2\n\n2\n\u03b51/2\n\n,\n\n\u03b51/2\n\n(cid:107)\u2207(cid:107)1/2\n\n2\n\n(\u03c31(\u2207)\u2212\u03c32(\u2207))1/2\n\n(cid:107)\u2207(cid:107)1/2\n\n2\n\n(\u03c31(\u2207)\u2212\u03c32(\u2207))1/2\n\n, \u03ba((cid:107)\u2207(cid:107)2+\u03b1)1/2\n\u03b11/2\u03c3min(X\u2217)\n\n(cid:27)\n(cid:27)\n(cid:111)\n\nTable 1: Comparison of \ufb01rst-order methods to minimize a \u03b2-smooth, \u03b1-strongly convex function over the\nunit-trace norm ball in Rm\u00d7n. In the table, k is the rank of X\u2217, \u03ba = \u03b2\n\u03b1 is the condition number,\n\u2207 = \u2207f (Xt) is the gradient matrix, nnz(\u2207) is the complexity to multiply \u2207 to a vector, \u03c3i(X) is the\ni-th largest singular value of X, and \u03c3min(X) is the minimum non-zero singular value of X.\n\nREMARK. The low-rank assumption on X\u2217 should be reasonable: as we mentioned, in most\napplications of Problem (1.1), the ultimate reason for imposing a trace-norm constraint is to ensure\nthat the optimal solution is low-rank; otherwise the minimization problem may not be interesting to\nsolve in the \ufb01rst place. Also, the immediate prior work [6] also assumes X\u2217 to have low rank.\nk-SVD Complexity. For theoreticians who are concerned about the time complexity of k-SVD, we\nalso compare it with the 1-SVD complexity of FW and Garber. If one uses LazySVD [1]4 to compute\nk-SVD in each iteration of blockFW, then the per-iteration k-SVD complexity can be bounded by\n\nk \u00b7 \u02dcO(cid:0)nnz(\u2207) + k(m + n)\u03ba(cid:1) \u00d7 min\n\n(cid:26) ((cid:107)\u2207(cid:107)2 + \u03b1)1/2\n\n(1.2)\n\u03b1 is the condition number of f, \u2207 = \u2207f (Xt) is the gradient matrix of the current\nAbove, \u03ba = \u03b2\niteration t, nnz(\u2207) is the complexity to multiply \u2207 to a vector, \u03c3min(X\u2217) is the minimum non-zero\nsingular value of X\u2217, and \u02dcO hides poly-logarithmic factors.\nIn contrast, if using Lanczos, the 1-SVD complexity for FW and Garber can be bounded as (see [6])\n\n\u03b51/2\n\n,\n\n.\n\n\u03ba((cid:107)\u2207(cid:107)2 + \u03b1)1/2\n\u03b11/2\u03c3min(X\u2217)\n\n(cid:27)\n\n\u02dcO(cid:0)nnz(\u2207)(cid:1) \u00d7 min\n\n(cid:110)(cid:107)\u2207(cid:107)1/2\n\n(cid:111)\n\n(cid:107)\u2207(cid:107)1/2\n\n.\n\n,\n\n2\n\n2\n\u03b51/2\n\n(\u03c31(\u2207) \u2212 \u03c32(\u2207))1/2\n\n(1.3)\nAbove, \u03c31(\u2207) and \u03c32(\u2207) are the top two singular values of \u2207, and the gap \u03c31(\u2207) \u2212 \u03c32(\u2207) can be\nas small as zero.\nWe emphasize that our k-SVD complexity (1.2) can be upper bounded by a quantity that only\ndepends poly-logarithmically on 1/\u03b5. In contrast, the worst-case 1-SVD complexity (1.3) of FW\nand Garber depends on \u03b5\u22121/2 because the gap \u03c31 \u2212 \u03c32 can be as small as zero. Therefore, if one\ntakes this additional \u03b5 dependency into consideration for the convergence rate, then blockFW has\nrate polylog(1/\u03b5), but FW and Garber have rates \u03b5\u22123/2 and \u03b5\u22121 respectively. The convergence rates\nand per-iteration running times of different algorithms for solving Problem (1.1) are summarized\nin Table 1.\n\nPractical Implementation. Besides our theoretical results above, we also provide practical sugges-\ntions for implementing blockFW. Roughly speaking, one can automatically select a different \u201cgood\u201d\nrank k for each iteration. This can be done by iteratively \ufb01nding the 1st, 2nd, 3rd, etc., top singular\nvectors of the underlying matrix, and then stop this process whenever the objective decrease is not\nworth further increasing the value k. We discuss the details in Section 6.\n\n4In fact, LazySVD is a general framework that says, with a meaningful theoretical support, one can apply\na reasonable 1-SVD algorithm k times in order to compute k-SVD. For simplicity, in this paper, whenever\nreferring to LazySVD, we mean to apply the Lanczos method k times.\n\n3\n\n\f2 Preliminaries and Notation\nFor a positive integer n, we de\ufb01ne [n] := {1, 2, . . . , n}. For a matrix A, we denote by (cid:107)A(cid:107)F , (cid:107)A(cid:107)2\nand (cid:107)A(cid:107)\u2217 respectively the Frobenius norm, the spectral norm, and the trace norm of A. We use\n(cid:104)\u00b7,\u00b7(cid:105) to denote the (Euclidean) inner products between vectors, or the (trace) inner products between\nmatrices (i.e., (cid:104)A, B(cid:105) = Tr(AB(cid:62))). We denote by \u03c3i(A) the i-th largest singular value of a matrix\nA, and by \u03c3min(A) the minimum non-zero singular value of A. We use nnz(A) to denote the time\ncomplexity of multiplying matrix A to a vector (which is at most the number of non-zero entries of\nA). We de\ufb01ne the (unit) trace-norm ball Bm,n in Rm\u00d7n as Bm,n := {X \u2208 Rm\u00d7n : (cid:107)X(cid:107)\u2217 \u2264 1}.\nDe\ufb01nition 2.1. For a differentiable convex function f : K \u2192 R over a convex set K \u2286 Rm\u00d7n, we\nsay\n\u2022 f is \u03b2-smooth if f (Y ) \u2264 f (X) + (cid:104)\u2207f (X), Y \u2212 X(cid:105) + \u03b2\nF for all X, Y \u2208 K.\n\u2022 f is \u03b1-strongly convex if f (Y ) \u2265 f (X) + (cid:104)\u2207f (X), Y \u2212 X(cid:105) + \u03b1\nFor Problem (1.1), we assume f is differentiable, \u03b2-smooth, and \u03b1-strongly convex over Bm,n. We\n\u03b1 the condition number of f, and by X\u2217 the minimizer of f (X) over the trace-norm\ndenote by \u03ba = \u03b2\nball Bm,n. The strong convexity of f (X) implies:\nFact 2.2. f (X) \u2212 f (X\u2217) \u2265 \u03b1\nProof. The minimality of X\u2217 implies (cid:104)\u2207f (X\u2217), X \u2212 X\u2217(cid:105) \u2265 0 for all X \u2208 K. The fact follows then\n(cid:3)\nfrom the \u03b1-strong convexity of f.\n\nF for all X, Y \u2208 K;\n\nF for all X \u2208 K.\n\n2 (cid:107)X \u2212 X\u2217(cid:107)2\n\n2(cid:107)X \u2212 Y (cid:107)2\n\n2 (cid:107)X \u2212 Y (cid:107)2\n\nThe Frank-Wolfe Algorithm. We now quickly review the Frank-Wolfe algorithm (see Algorithm 1)\nand its relation to PGD.\n\nAlgorithm 1 Frank-Wolfe\nInput: Step sizes {\u03b7t}t\u22651 (\u03b7t \u2208 [0, 1]), starting point X1 \u2208 Bm,n\n1: for t = 1, 2, . . . do\n2:\n3: Xt+1 \u2190 Xt + \u03b7t(Vt \u2212 Xt)\n4: end for\n\nVt \u2190 argminV \u2208Bm,n(cid:104)\u2207f (Xt), V (cid:105)\n\n(cid:5) by \ufb01nding the top left/right singular vectors ut, vt of \u2212\u2207f (Xt), and taking Vt = utv(cid:62)\nt .\n\nLet ht = f (Xt)\u2212 f (X\u2217) be the approximation error of Xt. The convergence analysis of Algorithm 1\nis based on the following relation:\nht+1 = f (Xt + \u03b7t(Vt \u2212 Xt)) \u2212 f (X\u2217)\ny\u2264 ht + \u03b7t(cid:104)\u2207f (Xt), X\u2217 \u2212 Xt(cid:105) +\n\nx\u2264 ht + \u03b7t(cid:104)\u2207f (Xt), Vt \u2212 Xt(cid:105) +\nt (cid:107)Vt \u2212 Xt(cid:107)2\nz\u2264 (1 \u2212 \u03b7t)ht +\n\u03b72\n\nF .\n(2.1)\nAbove, inequality x uses the \u03b2-smoothness of f, inequality y is due to the choice of Vt in Line 2,\nand inequality z follows from the convexity of f. Based on (2.1), a suitable choice of the step size\n\u03b7t = \u0398(1/t) gives the convergence rate O(\u03b2/\u03b5) for the Frank-Wolfe algorithm.\nIf f is also \u03b1-strongly convex, a linear convergence rate can be achieved if we replace the linear\noptimization step (Line 2) in Algorithm 1 with a constrained quadratic minimization:\n\nt (cid:107)Vt \u2212 Xt(cid:107)2\n\u03b72\nt (cid:107)Vt \u2212 Xt(cid:107)2\n\u03b72\n\n\u03b2\n2\n\u03b2\n2\n\n\u03b2\n2\n\nF\n\nF\n\nVt \u2190 argmin\nV \u2208Bm,n\n\n(cid:104)\u2207f (Xt), V \u2212 Xt(cid:105) +\n\n\u03b7t(cid:107)V \u2212 Xt(cid:107)2\n\nF .\n\n\u03b2\n2\n\n(2.2)\n\nIn fact, if Vt is de\ufb01ned as above, we have the following relation similar to (2.1):\n\nht+1 \u2264 ht + \u03b7t(cid:104)\u2207f (Xt), Vt \u2212 Xt(cid:105) +\n\u03b2\n2\n\u2264 ht + \u03b7t(cid:104)\u2207f (Xt), X\u2217 \u2212 Xt(cid:105) +\n\u03b2\n2\n\nt (cid:107)Vt \u2212 Xt(cid:107)2\n\u03b72\nt (cid:107)X\u2217 \u2212 Xt(cid:107)2\n\u03b72\n\nF\n\nF \u2264 (1 \u2212 \u03b7t + \u03ba\u03b72\nwhere the last inequality follows from Fact 2.2. Given (2.3), we can choose \u03b7t = 1\n2\u03ba to obtain a linear\nconvergence rate because ht+1 \u2264 (1 \u2212 1/4\u03ba)ht. This is the main idea behind the projected gradient\n\nt )ht ,\n\n(2.3)\n\n4\n\n\fdescent (PGD) method. Unfortunately, optimizing Vt from (2.2) requires a projection operation onto\nBm,n, and this further requires a full singular value decomposition of the matrix \u2207f (Xt) \u2212 \u03b2\u03b7tXt.\n3 A Rank-k Variant of Frank-Wolfe\nOur main idea comes from the following simple observation. Suppose we choose \u03b7t = \u03b7 = 1\nall iterations, and suppose rank(X\u2217) \u2264 k. Then we can add a low-rank constraint to Vt in (2.2):\n\n2\u03ba for\n\nVt \u2190\n\nargmin\n\nV \u2208Bm,n, rank(V )\u2264k\n\n(cid:104)\u2207f (Xt), V \u2212 Xt(cid:105) +\n\n\u03b7(cid:107)V \u2212 Xt(cid:107)2\n\nF .\n\n\u03b2\n2\n\n(3.1)\n\nUnder this new choice of Vt, it is obvious that the same inequalities in (2.3) remain to hold, and thus\nthe linear convergence rate of PGD can be preserved. Let us now discuss how to solve (3.1).\n3.1 Solving the Low-Rank Quadratic Minimization (3.1)\nAlthough (3.1) is non-convex, we prove that it can be solved ef\ufb01ciently. To achieve this, we \ufb01rst show\nthat Vt is in the span of the top k singular vectors of \u03b2\u03b7Xt \u2212 \u2207f (Xt).\n\nLemma 3.1. The minimizer Vt of (3.1) can be written as Vt = (cid:80)k\ncan perform k-SVD on At to compute {(ui, vi)}i\u2208[k], plug the expression Vt =(cid:80)k\nto minimizing \u2212(cid:80)k\na1, . . . , ak \u2265 0, (cid:107)a(cid:107)1 \u2264 1(cid:9), which is the same as projecting the vector 1\n\ni , where a1, . . . , ak\nare nonnegative scalars, and (ui, vi) is the pair of the left and right singular vectors of At :=\n\u03b2\u03b7Xt \u2212 \u2207f (Xt) corresponding to its i-th largest singular value.\nThe proof of Lemma 3.1 is given in the full version of this paper. Now, owing to Lemma 3.1, we\ninto\nthe objective of (3.1), and then search for the optimal values {ai}i\u2208[k]. The last step is equivalent\n\ni Atvi) over the simplex \u2206 :=(cid:8)a \u2208 Rk :\n\ni (where \u03c3i = u(cid:62)\n\n2 \u03b7(cid:80)k\n\ni=1 aiuiv(cid:62)\n\ni=1 aiuiv(cid:62)\n\ni=1 \u03c3iai + \u03b2\n\n\u03b2\u03b7 (\u03c31, . . . , \u03c3k) onto the\n\ni=1 a2\n\ni\n\nsimplex \u2206. It can be easily solved in O(k log k) time (see for instance the applications in [15]).\n3.2 Our Algorithm and Its Convergence\nWe summarize our algorithm in Algorithm 2 and call it blockFW.\n\nAlgorithm 2 blockFW\nInput: Rank parameter k, starting point X1 = 0\n1: \u03b7 \u2190 1\n2\u03ba.\n2: for t = 1, 2, . . . do\n3: At \u2190 \u03b2\u03b7Xt \u2212 \u2207f (Xt)\n4:\n\n(u1, v1, . . . , uk, vk) \u2190 k-SVD(At)\na \u2190 argmina\u2208Rk,a\u22650,(cid:107)a(cid:107)1\u22641 (cid:107)a \u2212 1\n\n\u03b2\u03b7 \u03c3(cid:107)2\n\nVt \u2190(cid:80)k\n\n5:\n6:\n7: Xt+1 \u2190 Xt + \u03b7(Vt \u2212 Xt)\n8: end for\n\ni=1 aiuiv(cid:62)\n\ni\n\n(cid:5) (ui, vi) is the i-th largest pair of left/right singular vectors of At\ni Atvi)k\n\n(cid:5) where \u03c3 := (u(cid:62)\n\ni=1\n\nF be the objective function in (3.1),\nt = gt(X\u2217). Given parameters \u03b3 \u2265 0 and \u03b5 \u2265 0, a feasible solution V to (3.1) is called\n\nSince the state-of-the-art algorithms for k-SVD are iterative methods, which in theory can only give\napproximate solutions, we now study the convergence of blockFW given approximate k-SVD solvers.\nWe introduce the following notion of an approximate solution to the low-rank quadratic minimization\nproblem (3.1).\nDe\ufb01nition 3.2. Let gt(V ) = (cid:104)\u2207f (Xt), V \u2212 Xt(cid:105) + \u03b2\nand let g\u2217\n(\u03b3, \u03b5)-approximate if it satis\ufb01es g(V ) \u2264 (1 \u2212 \u03b3)g\u2217\nNote that the above multiplicative-additive de\ufb01nition makes sense because g\u2217\nFact 3.3.\n\u2212(1 \u2212 \u03ba\u03b7)ht = \u2212 ht\nThe next theorem gives the linear convergence of blockFW under the above approximate solutions to\n(3.1). Its proof is simple and uses a variant of (2.3) (see the full version of this paper).\n\nIf rank(X\u2217) \u2264 k, for our choice of step size \u03b7 = 1\n\n2 \u03b7(cid:107)V \u2212 Xt(cid:107)2\nt + \u03b5.\n\nt \u2264 0:\n2\u03ba , we have g\u2217\n\n2 \u2264 0 according to (2.3).\n\nt = gt(X\u2217) \u2264\n\n5\n\n\fTheorem 3.4. Suppose rank(X\u2217) \u2264 k and \u03b5 > 0. If each Vt computed in blockFW is a ( 1\napproximate solution to (3.1), then for every t, the error ht = f (Xt) \u2212 f (X\u2217) satis\ufb01es\n\n2 , \u03b5\n\n8 )-\n\nht \u2264(cid:0)1 \u2212 1\n\n8\u03ba\n\n(cid:1)t\u22121\n\nh1 + \u03b5\n\n2 .\n\n2 , \u03b5\n\n\u03b5 ) iterations to achieve the target error ht \u2264 \u03b5.\n\n8 )-approximate solution Vt to (3.1), which we study in Section 4.\n\nAs a consequence, it takes O(\u03ba log h1\nBased on Theorem 3.4, the per-iteration running time of blockFW is dominated by the time necessary\nto produce a ( 1\n4 Per-Iteration Running Time Analysis\nIn this section, we study the running time necessary to produce a ( 1\n2 , \u03b5)-approximate solution Vt\nto (3.1). In particular, we wish to show a running time that depends only poly-logarithmically on 1/\u03b5.\nThe reason is that, since we are concerning about the linear convergence rate (i.e., log(1/\u03b5)) in this\npaper, it is not meaningful to have a per-iteration complexity that scales polynomially with 1/\u03b5.\nRemark 4.1. To the best of our knowledge, the Frank-Wolfe method and Garber\u2019s method [6] have\ntheir worst-case per-iteration complexities scaling polynomially with 1/\u03b5. In theory, this also slows\ndown their overall performance in terms of the dependency on 1/\u03b5.\n4.1 Step 1: The Necessary k-SVD Accuracy\nWe \ufb01rst show that if the k-SVD in Line 4 of blockFW is solved suf\ufb01ciently accurate, then Vt obtained\nin Line 6 will be a suf\ufb01ciently good approximate solution to (3.1). For notational simplicity, in this\nsection we denote Gt := (cid:107)\u2207f (Xt)(cid:107)2 + \u03b1, and we let k\u2217 = rank(X\u2217) \u2264 k.\nLemma 4.2. Suppose \u03b3 \u2208 [0, 1] and \u03b5 \u2265 0.\nu1, v1, . . . , uk, vk returned by k-SVD in Line 4 satisfy u(cid:62)\n\nIn each iteration t of blockFW, if the vectors\ni Atvi \u2265 (1 \u2212 \u03b3)\u03c3i(At) \u2212 \u03b5 for all i \u2208 [k\u2217],\n\n+ 2(cid:1)\u03b3, \u03b5(cid:1)-approximate to (3.1).\n\ni obtained in Line 6 is(cid:0)(cid:0) 6Gt\n\nthen Vt =(cid:80)k\n\ni=1 aiuiv(cid:62)\n\nht\n\nThe proof of Lemma 4.2 is given in the full version of this paper, and is based on our earlier\ncharacterization Lemma 3.1.\n4.2 Step 2: The Time Complexity of k-SVD\nWe recall the following complexity statement for k-SVD:\nTheorem 4.3 ([1]). The running time to compute the k-SVD of A \u2208 Rm\u00d7n using LazySVD is5\n\n(cid:16) k\u00b7nnz(A)+k2(m+n)\n\n(cid:17)\n\n\u221a\n\ngap\n\n.\n\nor\n\n\u02dcO\n\nIn the former case, we can have u(cid:62)\n\n0, \u03c3k\u2217 (A)\u2212\u03c3k\u2217+1(A)\n\n\u03c3k\u2217 (A)\n\ni Avi \u2265 (1 \u2212 \u03b3)\u03c3i(A) for all i \u2208 [k]; in the latter case, if\ni Avi \u2265 \u03c3i(A) \u2212 \u03b5 for all\n\nfor some k\u2217 \u2208 [k], then we can guarantee u(cid:62)\n\n1\n\ni \u2208 [k\u2217].\nThe First Attempt. Recall that we need a ( 1\nit suf\ufb01ces to obtain a (1 \u2212 \u03b3)-multiplicative approximation to the k-SVD of At (i.e., u(cid:62)\n(1 \u2212 \u03b3)\u03c3i(At) for all i \u2208 [k]), as long as \u03b3 \u2264\n\n2 , \u03b5)-approximate solution to (3.1). Using Lemma 4.2,\ni Atvi \u2265\n12Gt/ht+4. Therefore, we can directly apply the \ufb01rst\n\n(cid:1). However, when ht is very small, this running\n\nrunning time in Theorem 4.3: \u02dcO(cid:0) k\u00b7nnz(At)+k2(m+n)\nsince (cid:107)At(cid:107)2 =(cid:13)(cid:13) \u03b1\n\n2 Xt \u2212 \u2207f (Xt)(cid:13)(cid:13)2 \u2264 \u03b1\n\ni Atvi \u2265 \u03c3i(At) \u2212 \u03b5\n\n2 + (cid:107)\u2207f (Xt)(cid:107)2 \u2264 Gt, from u(cid:62)\n\ntime can be unbounded. In that case, we observe that \u03b3 = \u03b5\nGt\n\n(independent of ht) also suf\ufb01ces:\ni Atvi \u2265 (1 \u2212 \u03b5/Gt)\u03c3i(At)\n(cid:107)At(cid:107)2 \u2265 \u03c3i(At) \u2212 \u03b5; then according to\n2 , \u03b5)-approximation.\n) in Claim 4.5; the running time depends polynomially\n\nwe have u(cid:62)\nLemma 4.2 we can obtain (0, \u03b5)-approximation to (3.1), which is stronger than ( 1\nWe summarize this running time (using \u03b3 = \u03b5\nGt\n\u03b5 .\non 1\nThe Second Attempt. To make our linear convergence rate (i.e., the log(1/\u03b5) rate) meaningful, we\nwant the k-SVD running time to depend poly-logarithmically on 1/\u03b5. Therefore, when ht is small,\nwe wish to instead apply the second running time in Theorem 4.3.\n\n\u03c3i(At) \u2265 \u03c3i(At) \u2212 \u03b5\n\n\u221a\n\nGt\n\nGt\n\n\u03b3\n\n(cid:17)\n\n(cid:16) k\u00b7nnz(A)+k2(m+n)\n(cid:105)\n\n\u221a\n\n\u03b3\n\n\u02dcO\n\ngap \u2208(cid:16)\n\n5The \ufb01rst is known as the gap-free result because it does not depend on the gap between any two singular\nvalues. The second is known as the gap-dependent result, and it requires a k\u00d7k full SVD after the k approximate\nsingular vectors are computed one by one. The \u02dcO notation hides poly-log factors in 1/\u03b5, 1/\u03b3, m, n, and 1/gap.\n\n6\n\n\fRecall that X\u2217 has rank k\u2217 so \u03c3k\u2217 (X\u2217) \u2212 \u03c3k\u2217+1(X\u2217) = \u03c3min(X\u2217). We can show that this implies\n2 X\u2217 \u2212 \u2207f (X\u2217) also has a large gap \u03c3k\u2217 (A\u2217) \u2212 \u03c3k\u2217+1(A\u2217). Now, according to Fact 2.2,\nA\u2217 := \u03b1\n2 Xt \u2212 \u2207f (Xt) is also close\nwhen ht is small, Xt and X\u2217 are suf\ufb01ciently close. This means At = \u03b1\nto A\u2217, and thus has a large gap \u03c3k\u2217 (At) \u2212 \u03c3k\u2217+1(At). Then we can apply the second running time\nin Theorem 4.3.\n4.2.1 Formal Running Time Statements\nFact 4.4. We can store Xt as a decomposition into at most rank(Xt) \u2264 kt rank-1 components.6\n2 Xt \u2212 \u2207f (Xt), we have nnz(At) \u2264 nnz(\u2207f (Xt)) + (m + n)rank(Xt) \u2264\nTherefore, for At = \u03b1\nnnz(\u2207f (Xt)) + (m + n)kt.\nIf we always use the \ufb01rst running time in Theorem 4.3, then Fact 4.4 implies:\n\nClaim 4.5. The k-SVD computation in the t-th iteration of blockFW can be implemented in \u02dcO(cid:0)(cid:0)k \u00b7\nnnz(\u2207f (Xt)) + k2(m + n)t(cid:1)(cid:112)Gt/\u03b5(cid:1) time.\nbecomes \u02dcO(cid:0)k \u00b7 nnz(\u2207f (Xt))(cid:112)Gt/\u03b5(cid:1), which roughly equals k-times the 1-SVD running time\n\u02dcO(cid:0)nnz(\u2207)(cid:112)(cid:107)\u2207(cid:107)2/\u03b5)(cid:1) of FW and Garber [6]. Since in practice, it suf\ufb01ces to run blockFW and FW\n\nRemark 4.6. As long as (m + n)kt \u2264 nnz(\u2207f (Xt)), the k-SVD running time in Claim 4.5\n\nfor a few hundred 1-SVD computations, the relation (m + n)kt \u2264 nnz(\u2207f (Xt)) is often satis\ufb01ed.\nIf, as discussed above, we apply the \ufb01rst running time in Theorem 4.3 only for large ht, and apply\nthe second running time in Theorem 4.3 for small ht, then we obtain the following theorem whose\nproof is given in the full version of this paper.\nTheorem 4.7. The k-SVD comuputation in the t-th iteration of blockFW can be implemented in\n\u02dcO\n\nk \u00b7 nnz(\u2207f (Xt)) + k2(m + n)t\n\n(cid:17) \u03ba\n\n(cid:16)(cid:16)\n\ntime.\n\n(cid:17)\n\n\u221a\nGt/\u03b1\n\u03c3min(X\u2217)\n\nRemark 4.8. Since according to Theorem 3.4 we only need to run blockFW for O(\u03ba log(1/\u03b5))\niterations, we can plug t = O(\u03ba log(1/\u03b5)) into Claim 4.5 and Theorem 4.7, and obtain the running\ntime presented in (1.2). The per-iteration running time of blockFW depends poly-logarithmically on\n1/\u03b5. In contrast, the per-iteration running times of Garber [6] and FW depend polynomially on 1/\u03b5,\nmaking their total running times even worse in terms of dependency on 1/\u03b5.\n5 Maintaining Low-Rank Iterates\nOne of the main reasons to impose trace-norm constraints is to produce low-rank solutions. However,\nthe rank of iterate Xt in our algorithm blockFW can be as large as kt, which is much larger than k,\nthe rank of the optimal solution X\u2217. In this section, we show that by adding a simple modi\ufb01cation\nto blockFW, we can make sure the rank of Xt is O(k\u03ba log \u03ba) in all iterations t, without hurting the\nconvergence rate much.\nWe modify blockFW as follows. Whenever t \u2212 1 is a multiple of S = (cid:100)8\u03ba(log \u03ba + 1)(cid:101), we compute\n(note that this is the same as setting \u03b7 = 1 in (3.1))\n\nWt \u2190\n\nargmin\n\nW\u2208Bm,n, rank(W )\u2264k\n\n(cid:104)\u2207f (Xt), W \u2212 Xt(cid:105) +\n\n(cid:107)W \u2212 Xt(cid:107)2\n\nF ,\n\n\u03b2\n2\n\nand let the next iterate Xt+1 be Wt. In all other iterations the algorithm is unchanged. After this\nchange, the function value f (Xt+1) may be greater than f (Xt), but can be bounded as follows:\nLemma 5.1. Suppose rank(X\u2217) \u2264 k. Then we have f (Wt) \u2212 f (X\u2217) \u2264 \u03baht.\n\nProof. We have the following relation similar to (2.3):\n\nf (Wt) \u2212 f (X\u2217) \u2264 ht + (cid:104)\u2207f (Xt), Wt \u2212 Xt(cid:105) +\n\u2264 ht + (cid:104)\u2207f (Xt), X\u2217 \u2212 Xt(cid:105) +\n\u2264 ht \u2212 ht +\n\nht = \u03baht .\n\n\u03b2\n2\n\u03b2\n2\n\n\u03b2\n2\n\n\u00b7 2\n\u03b1\n\n(cid:107)Wt \u2212 Xt(cid:107)2\n(cid:107)X\u2217 \u2212 Xt(cid:107)2\n\nF\n\nF\n\n(cid:3)\n\n6 In Section 5, we show how to ensure that rank(Xt) is always O(k\u03ba log \u03ba), a quantity independent of t.\n\n7\n\n\fFrom Theorem 3.4 we know that hS+1 \u2264 (1 \u2212 1\n2 \u2264\ne\u03ba h1 + \u03b5/2. Therefore, after setting XS+2 = WS+1, we still have hS+2 \u2264\ne\u2212(log \u03ba+1)h1 + \u03b5\n2 (according to Lemma 5.1). Continuing this analysis (letting the \u03ba\u03b5 here be the \u201cnew\n1\ne h1 + \u03ba\u03b5\n\u03b5\u201d), we know that this modi\ufb01ed version of blockFW converges to an \u03b5-approximate minimizer in\n\n8\u03ba )8\u03ba(log \u03ba+1)h1 + \u03b5\n\n2 \u2264 (1 \u2212 1\n\n8\u03ba )Sh1 + \u03b5\n\n2 = 1\n\nO(cid:0)\u03ba log \u03ba \u00b7 log h1\n\n(cid:1) iterations.\n\n\u03b5\n\n.\n\n2\n\nPk =\n\nRemark 5.2. Since in each iteration the rank of Xt is increased by at most k, if we do the modi\ufb01ed\nstep every S = O(\u03ba log \u03ba) iterations, we have that throughout the algorithm, rank(Xt) is never more\nthan O(k\u03ba log \u03ba). Furthermore we can always store Xt using O(k\u03ba log \u03ba) vectors, instead of storing\nall the singular vectors obtained in previous iterations.\n6 Preliminary Empirical Evaluation\nWe conclude this paper with some preliminary experiments to test the performance of blockFW. We\n\ufb01rst recall two machine learning tasks that fall into Problem (1.1).\nMatrix Completion. Suppose there is an unknown matrix M \u2208 Rm\u00d7n close to low-rank, and we\nobserve a subset \u2126 of its entries \u2013 that is, we observe Mi,j for every (i, j) \u2208 \u2126. (Think of Mi,j as\nuser i\u2019s rating of movie j.) One can recover M by solving the following convex program:\n\n(i,j)\u2208\u2126(Xi,j \u2212 Mi,j)2 | (cid:107)X(cid:107)\u2217 \u2264 \u03b8(cid:9) .\n(cid:80)\n\nminX\u2208Rm\u00d7n\n\n(cid:8) 1\n\n(cid:8) 1\n\nj=1 aj(w(cid:62)\n\n(cid:110)\nx (cid:55)\u2192(cid:80)k\nIf we write A =(cid:80)k\n\n(6.1)\nAlthough Problem (6.1) is not strongly convex, our experiments show the effectiveness of blockFW\non this problem.\nPolynomial Neural Networks. Polynomial networks are neural networks with quadratic activation\nfunction \u03c3(a) = a2. Livni et al. [12] showed that such networks can express any function computed\nby a Turing machine, similar to networks with ReLU or sigmoid activations. Following [12], we\nconsider the class of 2-layer polynomial networks with inputs from Rd and k hidden neurons:\n\nj x)2(cid:12)(cid:12)(cid:12)\u2200j \u2208 [k], wj \u2208 Rd,(cid:107)wj(cid:107)2 = 1(cid:86) a \u2208 Rk(cid:111)\n\ni=1 ajwjw(cid:62)\n\nj , we have the following equivalent formulation:\n\nPk =(cid:8)x (cid:55)\u2192 x(cid:62)Ax(cid:12)(cid:12) A \u2208 Rd\u00d7d, rank(A) \u2264 k(cid:9) .\ni Axi \u2212 yi)2(cid:12)(cid:12)(cid:107)A(cid:107)\u2217 \u2264 \u03b8(cid:9) .\n\n(cid:80)N\ni=1(x(cid:62)\n\n(cid:80)N\ni=1(x(cid:62)\n\nTherefore, if replace the hard rank constraint with trace norm (cid:107)A(cid:107)\u2217 \u2264 \u03b8, the task of empirical risk\nminimization (ERM) given training data {(x1, y1), . . . , (xN , yN )} \u2282 Rd \u00d7 R can be formulated as7\n(6.2)\nminA\u2208Rd\u00d7d\ni Axi \u2212 yi)2 is convex in A, the above problem falls into Problem (1.1).\n\nSince f (A) = 1\n2\nAgain, this objective f (A) might not be strongly convex, but we still perform experiments on it.\n6.1 Preliminary Evaluation 1: Matrix Completion on Synthetic Data\nWe consider the following synthetic experiment for matrix completion. We generate a random\nrank-10 matrix in dimension 1000 \u00d7 1000, plus some small noise. We include each entry into \u2126 with\nprobability 1/2. We scale M to (cid:107)M(cid:107)\u2217 = 10000, so we set \u03b8 = 10000 in (6.1).\nWe compare blockFW with FW and Garber [6]. When implementing the three algorithms, we use\nexact line search. For Garber\u2019s algorithm, we tune its parameter \u03b7t = c\nt with different constant values\nc, and then exactly search for the optimum \u02dc\u03b7t. When implementing blockFW, we use k = 10 and\n\u03b7 = 0.2. We use the MATLAB built-in solver for 1-SVD and k-SVD.\nIn Figure 1(a), we compare the numbers of 1-SVD computations for the three algorithms. The plot\ncon\ufb01rms that it suf\ufb01ces to apply a rank-k variant FW in order to achieve linear convergence.\n6.2 Auto Selection of k\nIn practice, it is often unrealistic to know k in advance. Although one can simultaneously try\nk = 1, 2, 4, 8, . . . and output the best possible solution, this can be unpleasant to work with. We\npropose the following modi\ufb01cation to blockFW which automatically chooses k.\nIn each iteration t, we \ufb01rst run 1-SVD and compute the objective decrease, denoted by d1 \u2265 0. Now,\ngiven any approximate k-SVD decomposition of the matrix At = \u03b2\u03b7Xt \u2212 \u2207f (Xt), we can compute\nits (k + 1)-SVD using one additional 1-SVD computation according to the LazySVD framework [1].\n\n2\n\n7We consider square loss for simplicity. It can be any loss function (cid:96)(x(cid:62)\n\ni Axi, yi) convex in its \ufb01rst argument.\n\n8\n\n\f(a) matrix completion on synthetic\ndata\n\n(b) matrix completion\non MOVIELENS1M, \u03b8 = 10000\n\n(c) polynomial neural network\non MNIST, \u03b8 = 0.03\n\nFigure 1: Partial experimental results. The full 6 plots for MOVIELENS and 3 plots for MNIST are included in\n\nthe full version of this paper.\n\nk+1 < dk\n\nWe compute the new objective decrease dk+1. We stop this process and move to the next iteration t+1\nwhenever dk+1\nk . In other words, we stop whenever it \u201cappears\u201d not worth further increasing k.\nWe count this iteration t as using k + 1 computations of 1-SVD.\nAll the experiments on real-life datasets are performed using this above auto-k process.\n6.3 Preliminary Evaluation 2: Matrix Completion on MOVIELENS\nWe study the same experiment in Garber [6], the matrix completion Problem (6.1) on datasets\nMOVIELENS100K (m = 943, n = 1862 and |\u2126| = 105) and MOVIELENS1M (m = 6040, n =\n3952 and |\u2126| \u2248 106). In the second dataset, following [6], we further subsample \u2126 so it contains\nabout half of the original entries. For each dataset, we run FW, Garber, and blockFW with three\ndifferent choices of \u03b8.8 We present the six plots side-by-side in the full version of this paper.\nWe observe that when \u03b8 is large, there is no signi\ufb01cant advantage for using blockFW. This is because\nthe rank of the optimal solution X\u2217 is also high for large \u03b8. In contrast, when \u03b8 is small (so X\u2217 is of\nlow rank), as demonstrated for instance by Figure 1(b), it is indeed bene\ufb01cial to apply blockFW.\n6.4 Preliminary Evaluation 3: Polynomial Neural Network on MNIST\nWe use the 2-layer neural network Problem (6.2) to train a binary classi\ufb01er on the MNIST dataset of\nhandwritten digits, where the goal is to distinguish images of digit \u201c0\u201d from images of other digits.\nThe training set contains N = 60000 examples each of dimension d = 28\u00d7 28 = 784. We set yi = 1\nif that example belongs to digit \u201c0\u201d and yi = 0 otherwise. We divide the original grey levels by 256\nso xi \u2208 [0, 1]d. We again try three different values of \u03b8, and compare FW, Garber, and blockFW.9\nWe present the three plots side-by-side in the full version of this paper.\nThe performance of our algorithm is comparable to FW and Garber for large \u03b8, but as demonstrated\nfor instance by Figure 1(c), when \u03b8 is small so rank(X\u2217) is small, it is bene\ufb01cial to use blockFW.\n7 Conclusion\nIn this paper, we develop a rank-k variant of Frank-Wolfe for Problem (1.1) and show that: (1)\nit converges in log(1/\u03b5) rate for smooth and strongly convex functions, and (2) its per-iteration\ncomplexity scales with polylog(1/\u03b5). Preliminary experiments suggest that the value k can also be\nautomatically selected, and our algorithm outperforms FW and Garber [6] when X\u2217 is of relatively\nsmaller rank.\nWe hope more rank-k variants of Frank-Wolfe can be developed in the future.\nAcknowledgments\nElad Hazan was supported by NSF grant 1523815 and a Google research award. The authors would\nlike to thank Dan Garber for sharing his code for [6].\n\n8We perform exact line search for all algorithms. For Garber [6], we tune the best \u03b7t = c\n\nt and exactly\nsearch for the optimal \u02dc\u03b7t. For blockFW, we let k be chosen automatically and choose \u03b7 = 0.01 for all the six\nexperiments.\n\n9We perform exact line search for all algorithms. For Garber [6], we tune the best \u03b7t = c\n\nt and exactly search\nfor the optimal \u02dc\u03b7t. For blockFW, we let k be chosen automatically and choose \u03b7 = 0.0005 for all the three\nexperiments.\n\n9\n\n# 1-SVD computations020406080100Log(error)-2-1012345678FWGarberThis paper# 1-SVD computations050100150200Log(error)1234567FWGarberThis paper# 1-SVD computations0100200300400500Log(error)-1-0.500.511.522.533.54FWGarberThis paper\fReferences\n[1] Zeyuan Allen-Zhu and Yuanzhi Li. LazySVD: Even faster SVD decomposition yet without\n\nagonizing pain. In NIPS, pages 974\u2013982, 2016.\n\n[2] Emmanuel Candes and Benjamin Recht. Exact matrix completion via convex optimization.\n\nCommunications of the ACM, 55(6):111\u2013119, 2012.\n\n[3] Emmanuel J Candes, Yonina C Eldar, Thomas Strohmer, and Vladislav Voroninski. Phase\n\nretrieval via matrix completion. SIAM review, 57(2):225\u2013251, 2015.\n\n[4] Miroslav Dudik, Zaid Harchaoui, and J\u00e9r\u00f4me Malick. Lifted coordinate descent for learning\n\nwith trace-norm regularization. In AISTATS, pages 327\u2013336, 2012.\n\n[5] Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval research\n\nlogistics quarterly, 3(1-2):95\u2013110, 1956.\n\n[6] Dan Garber. Faster projection-free convex optimization over the spectrahedron. In NIPS, pages\n\n874\u2013882, 2016.\n\n[7] Dan Garber and Elad Hazan. A linearly convergent conditional gradient algorithm with\n\napplications to online and stochastic optimization. arXiv preprint arXiv:1301.4666, 2013.\n\n[8] Dan Garber and Elad Hazan. Faster rates for the frank-wolfe method over strongly-convex sets.\n\nIn ICML, pages 541\u2013549, 2015.\n\n[9] Elad Hazan. Sparse approximate solutions to semide\ufb01nite programs.\nSymposium on Theoretical Informatics, pages 306\u2013316. Springer, 2008.\n\nIn Latin American\n\n[10] Martin Jaggi and Marek Sulovsk\u00fd. A simple algorithm for nuclear norm regularized problems.\n\nIn ICML, pages 471\u2013478, 2010.\n\n[11] Simon Lacoste-Julien and Martin Jaggi. An af\ufb01ne invariant linear convergence analysis for\n\nfrank-wolfe algorithms. arXiv preprint arXiv:1312.7864, 2013.\n\n[12] Roi Livni, Shai Shalev-Shwartz, and Ohad Shamir. On the computational ef\ufb01ciency of training\n\nneural networks. In NIPS, pages 855\u2013863, 2014.\n\n[13] Cameron Musco and Christopher Musco. Randomized block krylov methods for stronger and\n\nfaster approximate singular value decomposition. In NIPS, pages 1396\u20131404, 2015.\n\n[14] Yurii Nesterov.\n\nIntroductory Lectures on Convex Programming Volume: A Basic course,\n\nvolume I. Kluwer Academic Publishers, 2004.\n\n[15] Yurii Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming,\n\n103(1):127\u2013152, December 2005.\n\n[16] Shai Shalev-Shwartz, Alon Gonen, and Ohad Shamir. Large-scale convex minimization with a\n\nlow-rank constraint. arXiv preprint arXiv:1106.1622, 2011.\n\n10\n\n\f", "award": [], "sourceid": 3135, "authors": [{"given_name": "Zeyuan", "family_name": "Allen-Zhu", "institution": "Microsoft Research"}, {"given_name": "Elad", "family_name": "Hazan", "institution": "Princeton University"}, {"given_name": "Wei", "family_name": "Hu", "institution": "Princeton University"}, {"given_name": "Yuanzhi", "family_name": "Li", "institution": "Princeton University"}]}