{"title": "Guaranteed Rank Minimization via Singular Value Projection", "book": "Advances in Neural Information Processing Systems", "page_first": 937, "page_last": 945, "abstract": "Minimizing the rank of a matrix subject to affine constraints is a fundamental problem with many important applications in machine learning and statistics. In this paper we propose a simple and fast algorithm SVP (Singular Value Projection) for rank minimization under affine constraints ARMP and show that SVP recovers the minimum rank solution for affine constraints that satisfy a Restricted Isometry Property} (RIP). Our method guarantees geometric convergence rate even in the presence of noise and requires strictly weaker assumptions on the RIP constants than the existing methods. We also introduce a Newton-step for our SVP framework to speed-up the convergence with substantial empirical gains. Next, we address a practically important application of ARMP - the problem of low-rank matrix completion, for which the defining affine constraints do not directly obey RIP, hence the guarantees of SVP do not hold. However, we provide partial progress towards a proof of exact recovery for our algorithm by showing a more restricted isometry property and observe empirically that our algorithm recovers low-rank Incoherent matrices from an almost optimal number of uniformly sampled entries. We also demonstrate empirically that our algorithms outperform existing methods, such as those of \\cite{CaiCS2008,LeeB2009b, KeshavanOM2009}, for ARMP and the matrix completion problem by an order of magnitude and are also more robust to noise and sampling schemes. In particular, results show that our SVP-Newton method is significantly robust to noise and performs impressively on a more realistic power-law sampling scheme for the matrix completion problem.", "full_text": "Guaranteed Rank Minimization via Singular Value\n\nProjection\n\nPrateek Jain\n\nMicrosoft Research Bangalore\n\nBangalore, India\n\nprajain@microsoft.com\n\nRaghu Meka\n\nUT Austin Dept. of Computer Sciences\n\nAustin, TX, USA\n\nraghu@cs.utexas.edu\n\nUT Austin Dept. of Computer Sciences\n\nInderjit Dhillon\n\nAustin, TX, USA\n\ninderjit@cs.utexas.edu\n\nAbstract\n\nMinimizing the rank of a matrix subject to af\ufb01ne constraints is a fundamental\nproblem with many important applications in machine learning and statistics. In\nthis paper we propose a simple and fast algorithm SVP (Singular Value Projec-\ntion) for rank minimization under af\ufb01ne constraints (ARMP) and show that SVP\nrecovers the minimum rank solution for af\ufb01ne constraints that satisfy a restricted\nisometryproperty(RIP). Our method guarantees geometric convergence rate even\nin the presence of noise and requires strictly weaker assumptions on the RIP con-\nstants than the existing methods. We also introduce a Newton-step for our SVP\nframework to speed-up the convergence with substantial empirical gains. Next,\nwe address a practically important application of ARMP - the problem of low-\nrank matrix completion, for which the de\ufb01ning af\ufb01ne constraints do not directly\nobey RIP, hence the guarantees of SVP do not hold. However, we provide partial\nprogress towards a proof of exact recovery for our algorithm by showing a more\nrestricted isometry property and observe empirically that our algorithm recovers\nlow-rank incoherent matrices from an almost optimal number of uniformly sam-\npled entries. We also demonstrate empirically that our algorithms outperform ex-\nisting methods, such as those of [5, 18, 14], for ARMP and the matrix completion\nproblem by an order of magnitude and are also more robust to noise and sampling\nschemes. In particular, results show that our SVP-Newton method is signi\ufb01cantly\nrobust to noise and performs impressively on a more realistic power-law sampling\nscheme for the matrix completion problem.\n\n1 Introduction\n\nIn this paper we study the general af\ufb01ne rank minimization problem (ARMP),\ns.t A(X) = b, X \u2208 Rm\u00d7n, b \u2208 Rd,\n\nmin rank(X)\n\nwhere A is an af\ufb01ne transformation from Rm\u00d7n to Rd.\nThe af\ufb01ne rank minimization problem above is of considerable practical interest and many important\nmachine learning problems such as matrix completion, low-dimensional metric embedding, low-\nrank kernel learning can be viewed as instances of the above problem. Unfortunately, ARMP is\nNP-hard in general and is also NP-hard to approximate ([22]).\nUntil recently, most known methods for ARMP were heuristic in nature with few known rigorous\nguarantees.\nIn a recent breakthrough, Recht et al. [24] gave the \ufb01rst nontrivial results for the\n\n(ARMP)\n\n1\n\n\f\u2308\n\nF\n\n2\n\n\u2309\n\n\u2309\n\n\u2217\n\n\u2217\n\nF\n\n\u2225b\u22252\n\n2\n\n\u2225b\u22252\n2\u03f5\n\niterations.\n\nproblem obtaining guaranteed rank minimization for af\ufb01ne transformations A that satisfy a restricted\nisometryproperty(RIP). De\ufb01ne the isometry constant of A, \u03b4k to be the smallest number such that\nfor all X \u2208 Rm\u00d7n of rank at most k,\n\n(1 \u2212 \u03b4k)\u2225X\u22252\n\nF\n\n\u2264 \u2225A(X)\u22252\n\n2\n\n\u2264 (1 + \u03b4k)\u2225X\u22252\nF .\n\n2(C\u2225e\u22252+\u03f5)\n\n1\n\nlog(1/D) log\n\nlog((1\u2212\u03b42k)/2\u03b42k) log\n\n1\n\n30, than our analysis.)\n\n, \u03f5 \u2265 0, in at most\n\n) for a\n\u2217. Then, SVP (Algorithm 1) with step-size \u03b7t = 1/(1 + \u03b42k) converges to X\n\u2217.\n\u2264 \u03f5 and \u2225X \u2212\n\n\u2264 \u03f5/(1 \u2212 \u03b42k) in at most\n\u2308\n)+e\n\u2217 and an error vector e \u2208 Rd. Then, SVP with step-size \u03b7t = 1/(1 + \u03b42k)\n\u2264\n\n(1)\nThe above RIP condition is a direct generalization of the RIP condition used in the compressive\nsensing context. Moreover, RIP holds for many important practical applications of ARMP such\nas image compression, linear time-invariant systems. In particular, Recht et al. show that for most\nnatural families of random measurements, RIP is satis\ufb01ed even for only O(nk log n) measurements.\nAlso, Recht et al. show that for ARMP with isometry constant \u03b45k < 1/10, the minimum rank\nsolution can be recovered by the minimum trace-norm solution.\nIn this paper we propose a simple and ef\ufb01cient algorithm SVP (Singular Value Projection) based\non the projected gradient algorithm. We present a simple analysis showing that SVP recovers the\nminimum rank solution for noisy af\ufb01ne constraints that satisfy RIP and prove the following guar-\nantees. (Independent of our work, Goldfarb and Ma [12] proposed an algorithm similar to SVP.\n\u221a\nHowever, their analysis and formulation is different from ours. They also require stronger isometry\nassumptions, \u03b43k < 1/\nTheorem 1.1 Suppose the isometry constant of A satis\ufb01es \u03b42k < 1/3 and let b = A(X\nrank-k matrix X\nFurthermore, SVP outputs a matrix X of rank at most k such that \u2225A(X) \u2212 b\u22252\n\u2217\u22252\nX\nTheorem 1.2 (Main) Suppose the isometry constant of A satis\ufb01es \u03b42k < 1/3 and let b = A(X\nfor a rank k matrix X\n\u2264 C\u2225e\u22252 + \u03f5 and \u2225X \u2212 X\n\u2217\u22252\noutputs a matrix X of rank at most k such that \u2225A(X) \u2212 b\u22252\nC\u2225e\u22252+\u03f5\niterations for universal constants C, D.\n1\u2212\u03b42k\nAs our SVP algorithm is based on projected gradient descent, it behaves as a \ufb01rst order methods\nand may require a relatively large number of iterations to achieve high accuracy, even after iden-\ntifying the correct row and column subspaces. To this end, we introduce a Newton-type step in\nour framework (SVP-Newton) rather than using a simple gradient-descent step. Guarantees sim-\nilar to Theorems 1.1, 1.2 follow easily for SVP-Newton using the proofs for SVP.\nIn practice,\nSVP-Newton performs better than SVP in terms of accuracy and number of iterations.\nWe next consider an important application of ARMP:\nthe low-rank matrix completion problem\n(MCP)\u2014 given a small number of entries from an unknown low-rank matrix, the task is to complete\nthe missing entries. Note that RIP does not hold directly for this problem. Recently, Candes and\nRecht [6], Candes and Tao [7] and Keshavan et al. [14] gave the \ufb01rst theoretical guarantees for the\nproblem obtaining exact recovery from an almost optimal number of uniformly sampled entries.\nWhile RIP does not hold for MCP, we show that a similar property holds for incoherent matrices\n[6]. Given our re\ufb01ned RIP and a hypothesis bounding the incoherence of the iterates arising in SVP,\nan analysis similar to that of Theorem 1.1 immediately implies that SVP optimally solves MCP.\nWe provide strong empirical evidence for our hypothesis and show that that both of our algorithms\nrecover a low-rank matrix from an almost optimal number of uniformly sampled entries.\nIn summary, our main contributions are:\n\u2022 Motivated by [11], we propose a projected gradient based algorithm, SVP, for ARMP and show\nthat our method recovers the optimal rank solution when the af\ufb01ne constraints satisfy RIP. To the\n\u221a\nbest of our knowledge, our isometry constant requirements are least stringent: we only require\n3 by Lee and Bresler [18] and\n\u03b42k < 1/3 as opposed to \u03b45k < 1/10 by Recht et al., \u03b43k < 1/4\n\u03b44k < 0.04 by Lee and Bresler [17].\n\u2022 We introduce a Newton-type step in the SVP method which is useful if high precision is criti-\ncally. SVP-Newton has similar guarantees to that of SVP, is more stable and has better empirical\nperformance in terms of accuracy. For instance, on the Movie-lens dataset [1] and rank k = 3,\nSVP-Newton achieves an RMSE of 0.89, while SVT method [5] achieves an RMSE of 0.98.\n\u2022 As observed in [23], most trace-norm based methods perform poorly for matrix completion when\nentries are sampled from more realistic power-law distributions. Our method SVP-Newton is\nrelatively robust to sampling techniques and performs signi\ufb01cantly better than the methods of\n[5, 14, 23] even for power-law distributed samples.\n\n2\n\n\f\u2022 We show that the af\ufb01ne constraints in the low-rank matrix completion problem satisfy a weaker\nrestricted isometry property and as supported by empirical evidence, conjecture that SVP (as\nwell as SVP-Newton) recovers the underlying matrix from an almost optimal number of uni-\nformly random samples.\n\u2022 We evaluate our method on a variety of synthetic and real-world datasets and show that our\nmethods consistently outperform, both in accuracy and time, various existing methods [5, 14].\n\n2 Method\nIn this section, we \ufb01rst introduce our Singular Value Projection (SVP) algorithm for ARMP and\npresent a proof of its optimality for af\ufb01ne constraints satisfying RIP (1). We then specialize our\nalgorithm for the problem of matrix completion and prove a more restricted isometry property for\nthe same. Finally, we introduce a Newton-type step in our SVP algorithm and prove its convergence.\n\n2.1 Singular Value Decomposition (SVP)\nConsider the following more robust formulation of ARMP (RARMP),\n\n1\n2\n\nmin\nX\n\n\u03c8(X) =\n\n\u2225A(X) \u2212 b\u22252\n\n2 s.t X \u2208 C(k) = {X : rank(X) \u2264 k}.\n\n(RARMP)\nThe hardness of the above problem mainly comes from the non-convexity of the set of low-rank\nmatrices C(k). However, the Euclidean projection onto C(k) can be computed ef\ufb01ciently using\nsingular value decomposition (SVD). Our algorithm uses this observation along with the projected\ngradient method for ef\ufb01ciently minimizing the objective function speci\ufb01ed in (RARMP).\nLet Pk : Rm\u00d7n \u2192 Rm\u00d7n denote the orthogonal projection on to the set C(k). That is, Pk(X) =\n{\u2225Y \u2212 X\u2225F : Y \u2208 C(k)}. It is well known that Pk(X) can be computed ef\ufb01ciently by\nargminY\ncomputing the top k singular values and vectors of X.\nIn SVP, a candidate solution to ARMP is computed iteratively by starting from the all-zero ma-\ntrix and adapting the classical projected gradient descent update as follows (note that \u2207\u03c8(X) =\nAT (A(X) \u2212 b)):\n\n(1)\nFigure 1 presents SVP in more detail. Note that the iterates X t are always low-rank, facilitating\nfaster computation of the SVD. See Section 3 for a more detailed discussion of computational issues.\n\n.\n\nX t+1 \u2190 Pk\n\nX t \u2212 \u03b7t\u2207\u03c8(X t)\n\n= Pk\n\nX t \u2212 \u03b7tAT (A(X t) \u2212 b)\n\n)\n\n(\n\n)\n\n(\n\nAlgorithm 1 Singular Value Projection (SVP) Algorithm\nRequire: A, b, tolerance \u03b5, \u03b7t for t = 0, 1, 2, . . .\n1: Initialize: X 0 = 0 and t = 0\n2: repeat\n3:\n4:\n5: X t+1 \u2190 Uk\u03a3kV T\n6:\n7: until \u2225A(X t+1) \u2212 b\u22252\n\nY t+1 \u2190 X t \u2212 \u03b7tAT (A(X t) \u2212 b)\nCompute top k singular vectors of Y t+1: Uk, \u03a3k, Vk\nt \u2190 t + 1\n\n\u2264 \u03b5\n\nk\n\n2\n\nAnalysis for Constraints Satisfying RIP\n\u2225b\u22252\nTheorem 1.1 shows that SVP converges to an \u03f5-approximate solution of RARMP in O(log\n\u03f5 )\nsteps. Theorem 1.2 shows a similar result for the noisy case. The theorems follow from the following\nlemma that bounds the objective function after each iteration.\n\u2217 be an optimal solution of (RARMP) and let X t be the iterate obtained by SVP\nLemma 2.1 Let X\nat t-th iteration. Then, \u03c8(X t+1) \u2264 \u03c8(X\n\u2225A(X\n2, where \u03b42k is the rank 2k\nisometry constant of A.\nThe lemma follows from elementary linear algebra, optimality of SVD (Eckart-Young theorem) and\ntwo simple applications of RIP. We refer to the supplementary material (Appendix A) for a detailed\nproof. We now prove Theorem 1.1. Theorem 1.2 can also be proved similarly; see supplementary\nmaterial (Appendix A) for a detailed proof.\nProof of Theorem 1.1 Using Lemma 2.1 and the fact that \u03c8(X\n\n\u2217 \u2212 X t)\u22252\n\n) + \u03b42k\n\n(1\u2212\u03b42k)\n\n\u2217\n\n\u2217\n\n\u03c8(X t+1) \u2264\n\n\u03b42k\n\n(1 \u2212 \u03b42k)\n\n\u2225A(X\n\n\u2217 \u2212 X t)\u22252\n\n2 =\n\n3\n\n) = 0, it follows that\n2\u03b42k\n(1 \u2212 \u03b42k)\n\n\u03c8(X t).\n\n\f\u03f5\n\n1\n\n2\u03b42k\n\n\u2309\n\n\u2308\n\n(1\u2212\u03b42k) < 1.\n\n\u2309\n. Further, using RIP for the rank at most 2k matrix X \u03c4 \u2212 X\n\nHence, \u03c8(X \u03c4 ) \u2264 \u03f5 where \u03c4 =\n\u2217 we\n\u2217\u2225 \u2264 \u03c8(X \u03c4 )/(1 \u2212 \u03b42k) \u2264 \u03f5/(1 \u2212 \u03b42k). Now, the SVP algorithm is initialized using\n\n\u2308\nAlso, note that for \u03b42k < 1/3,\nlog((1\u2212\u03b42k)/2\u03b42k) log \u03c8(X 0)\nget: \u2225X \u03c4 \u2212 X\n\u2225b\u22252\n2 . Hence, \u03c4 =\nX 0 = 0, i.e., \u03c8(X 0) =\n2.2 Matrix Completion\nWe \ufb01rst describe the low-rank matrix completion problem formally. For \u2126 \u2286 [m] \u00d7 [n], let P\u2126 :\nRm\u00d7n \u2192 Rm\u00d7n denote the projection onto the index set \u2126. That is, (P\u2126(X))ij = Xij for (i, j) \u2208\n\u2126 and (P\u2126(X))ij = 0 otherwise. Then, the low-rank matrix completion problem (MCP) can be\nformulated as follows,\n(MCP)\nObserve that MCP is a special case of ARMP, so we can apply SVP for matrix completion. We\nuse step-size \u03b7t = 1/(1 + \u03b4)p, where p is the density of sampled entries and \u03b4 is a parameter which\nwe will explain later in this section. Using the given step-size and update (1), we get the following\nupdate for matrix-completion:\n\n), X \u2208 Rm\u00d7n.\n)\n\ns.t P\u2126(X) = P\u2126(X\n\nlog((1\u2212\u03b42k)/2\u03b42k) log\n\nrank(X)\n\n\u2225b\u22252\n2\u03f5\n\n(\n\nmin\nX\n\n\u2217\n\n.\n\n1\n\n1\n\n(P\u2126(X t) \u2212 P\u2126(X\n\n\u2217\n\nX t+1 \u2190 Pk\n\nX t \u2212\n\n(1 + \u03b4)p\n\n(2)\nAlthough matrix completion is a special case of ARMP, the af\ufb01ne constraints that de\ufb01ne MCP, P\u2126,\ndo not satisfy RIP in general. Thus Theorems 1.1, 1.2 above and the results of Recht et al. [24] do\nnot directly apply to MCP. However, we show that the matrix completion af\ufb01ne constraints satisfy\nRIP for low-rank incoherent matrices.\nDe\ufb01nition 2.1 (Incoherence) A matrix X \u2208 Rm\u00d7n with singular value decomposition X =\nU \u03a3V T is \u00b5-incoherent if maxi,j |Uij| \u2264 \u221a\n\nm , maxi,j |Vij| \u2264 \u221a\n\n\u00b5\u221a\nn .\n\n\u00b5\u221a\n\n))\n\n.\n\nF\n\nF\n\n(3)\n\n(1 \u2212 \u03b4)p\u2225X\u22252\n\n\u2264 \u2225P\u2126(X)\u22252\n\nF with a union bound over\n\nThe above notion of incoherence is similar to that introduced by Candes and Recht [6] and also used\nby [7, 14]. Intuitively, high incoherence (i.e., \u00b5 is small) implies that the non-zero entries of X\nare not concentrated in a small number of entries. Hence, a random sampling of the matrix should\nprovide enough global information to satisfy RIP.\nUsing the above de\ufb01nition, we prove the following re\ufb01ned restricted isometry property.\nTheorem 2.2 There exists a constant C \u2265 0 such that the following holds for all 0 < \u03b4 < 1,\n\u00b5 \u2265 1, n \u2265 m \u2265 3: For \u2126 \u2286 [m] \u00d7 [n] chosen according to the Bernoulli model with density\np \u2265 C\u00b52k2 log n/\u03b42m, with probability at least 1\u2212exp(\u2212n log n), the following restricted isometry\nproperty holds for all \u00b5-incoherent matrices X of rank at most k:\n\n\u2264 (1 + \u03b4)p\u2225X\u22252\nF .\nRoughly, our proof combines a Chernoff bound estimate for \u2225P\u2126(X)\u22252\nlow-rank incoherent matrices. A proof sketch is presented in Section 2.2.1.\nGiven the above re\ufb01ned RIP, if the iterates arising in SVP are shown to be incoherent, the arguments\nof Theorem 1.1 can be used to show that SVP achieves exact recovery for low-rank incoherent\nmatrices from uniformly sampled entries. As supported by empirical evidence, we hypothesize that\n\u2217 is incoherent.\nthe iterates X t arising in SVP remain incoherent when the underlying matrix X\nn maxt,i,j |U t\n|, where U t are the\nFigure 1 (d) plots the maximum incoherence maxt \u00b5(X t) =\nleft singular vectors of the intermediate iterates X t computed by SVP. The \ufb01gure clearly shows\nthat the incoherence \u00b5(X t) of the iterates is bounded by a constant independent of the matrix size\nn and density p throughout the execution of SVP. Figure 2 (c) plots the threshold sampling density\np beyond which matrix completion for randomly generated matrices is solved exactly by SVP for\n\ufb01xed k and varying matrix sizes n. Note that the density threshold matches the optimal information-\ntheoretic bound [14] of \u0398(k log n/n).\nMotivated by Theorem 2.2 and supported by empirical evidence (Figures 2 (c), (d)) we hypothesize\nthat SVP achieves exact recovery from an almost optimal number of samples for incoherent matrices.\nConjecture 2.3 Fix \u00b5, k and \u03b4 \u2264 1/3. Then, there exists a constant C such that for a \u00b5-\n\u2217 of rank at most k and \u2126 sampled from the Bernoulli model with density\nincoherent matrix X\n\u2217 with high probability.\np = \u2126\u00b5,k((log n)/m), SVP with step-size \u03b7t = 1/(1 + \u03b4)p converges to X\n\u2264 \u03f5 after\nMoreover, SVP outputs a matrix X of rank at most k such that \u2225P\u2126(X) \u2212 P\u2126(X\nO\u00b5,k\n\n)\u2309)\n\niterations.\n\n(\u2308\n\n)\u22252\n\n(\n\n\u221a\n\nlog\n\n\u2217\n\nij\n\nF\n\n1\n\u03f5\n\n4\n\n\f2.2.1 RIP for Matrix Completion on Incoherent Matrices\nWe now prove the restricted isometry property of Theorem 2.2 for the af\ufb01ne constraints that result\nfrom the projection operator P\u2126. To prove Theorem 2.2 we \ufb01rst show the theorem for a discrete\ncollection of matrices using Chernoff type large-deviation bounds and use standard quantization\narguments to generalize to the continuous case. We \ufb01rst introduce some notation and provide useful\nlemmas for our main proof1. First, we introduce the notion of \u03b1-regularity.\nDe\ufb01nition 2.2 A matrix X \u2208 Rm\u00d7n is \u03b1-regular if maxi,j |Xij| \u2264 \u03b1\u221a\nLemma 2.4 below relates the notion of regularity to incoherence and Lemma 2.5 proves (3) for a\n\ufb01xed regular matrix when the samples \u2126 are selected independently.\n\u221a\nLemma 2.4 Let X \u2208 Rm\u00d7n be a \u00b5-incoherent matrix of rank at most k. Then X is \u00b5\nk-regular.\n)\nLemma 2.5 Fix a \u03b1-regular X \u2208 Rm\u00d7n and 0 < \u03b4 < 1. Then, for \u2126 \u2286 [m]\u00d7 [n] chosen according\nto the Bernoulli model, with each pair (i, j) \u2208 \u2126 chosen independently with probability p,\n\n\u00b7 \u2225X\u2225F .\n\n(\n\nmn\n\n[(cid:12)(cid:12)\u2225P\u2126(X)\u22252\n\nF\n\nPr\n\n(cid:12)(cid:12) \u2265 \u03b4p\u2225X\u22252\n\nF\n\n] \u2264 2 exp\n\n\u2212 p\u2225X\u22252\n\nF\n\n\u2212 \u03b42pmn\n3 \u03b12\n\n.\n\n\u221a\nk)-regular\nWhile the above lemma shows Equation (3) for a \ufb01xed rank k, \u00b5-incoherent X (i.e., (\u00b5\nX using Lemma 2.4), we need to show Equation (3) for all such rank k incoherent matrices. To\nhandle this problem, we discretize the space of low-rank incoherent matrices so as to be able to\nuse the above lemma and a union bound. We now show the existence of a small set of matrices\nS(\u00b5, \u03f5) \u2286 Rm\u00d7n such that every low-rank \u00b5-incoherent matrix is close to an appropriately regular\nmatrix from the set S(\u00b5, \u03f5).\nLemma 2.6 For all 0 < \u03f5 < 1/2, \u00b5 \u2265 1, m, n \u2265 3 and k \u2265 1, there exists a set S(\u00b5, \u03f5) \u2286 Rm\u00d7n\nwith |S(\u00b5, \u03f5)| \u2264 (mnk/\u03f5)3 (m+n)k such that the following holds. For any \u00b5-incoherent X \u2208 Rm\u00d7n\n\u221a\nof rank k with \u2225X\u22252 = 1, there exists Y \u2208 S(\u00b5, \u03f5) s.t. \u2225Y \u2212 X\u2225F < \u03f5 and Y is (4\u00b5\nk)-regular.\nWe now prove Theorem 2.2 by combining Lemmas 2.5, 2.6 and applying a union bound. We present\na sketch of the proof but defer the details to the supplementary material (Appendix B).\nk-regular}, where\nProof Sketch of Theorem 2.2 Let S\nS(\u00b5, \u03f5) is as in Lemma 2.6 for \u03f5 = \u03b4/9mnk. Let m \u2264 n. Then, by Lemma 2.5 and union bound,\nfor any Y \u2208 S\n\n(\u00b5, \u03f5) = {Y : Y \u2208 S(\u00b5, \u03f5), Y is 4\u00b5\n)\n\n\u221a\n\n\u2032\n\n(cid:12)(cid:12) \u2265 \u03b4p\u2225Y \u22252\n\nF\n\n] \u2264 2(mnk/\u03f5)3(m+n)k exp\n\n[(cid:12)(cid:12)\u2225P\u2126(Y )\u22252\n\nF\n\n\u2032\n(\u00b5, \u03f5),\n\u2212 p\u2225Y \u22252\n\nF\n\n(\u2212\u03b42pmn\n\n16\u00b52k\n\nwhere C1 \u2265 0 is a constant independent of m, n, k. Thus, if p > C\u00b52k2 log n/\u03b42m, where C =\n16(C1 + 1), with probability at least 1 \u2212 exp(\u2212n log n), the following holds\n| \u2264 \u03b4p\u2225Y \u22252\nF .\n\n(4)\nAs the statement of the theorem is invariant under scaling, it is enough to show the statement for all\n\u00b5-incoherent matrices X of rank at most k and \u2225X\u22252 = 1. Fix such a X and suppose that (4) holds.\nNow, by Lemma 2.6 there exists Y \u2208 S\n\n|\u2225P\u2126(Y )\u22252\n\n\u2212 p\u2225Y \u22252\n\n\u2200Y \u2208 S\n\n(\u00b5, \u03f5),\n\nF\n\nF\n\n\u2032\n\n\u2032\n\nProceeding similarly, we can show that\nCombining inequalities (4), (5) above, with probability at least 1 \u2212 exp(\u2212n log n) we have,\n\nF\n\nF\n\nF\n\nF\n\nF\n\n\u2225Y \u22252\n\u2264 (\u2225X\u2225F + \u03f5)2 \u2264 \u2225X\u22252\n|\u2225X\u22252\n\u2212 \u2225Y \u22252\n| \u2264 |\u2225P\u2126(X)\u22252\n\n| \u2264 3\u03f5k,\n\u2212 \u2225P\u2126(Y )\u22252\nThe theorem follows using the above inequality.\n\n\u2212 p\u2225X\u22252\n\nF\n\nF\n\nF\n\nF\n\n|\u2225P\u2126(X)\u22252\n\n(\u00b5, \u03f5) such that \u2225Y \u2212 X\u2225F \u2264 \u03f5. Moreover,\nF + 3\u03f5k.\n| \u2264 3\u03f5k.\n| + |\u2225P\u2126(Y )\u22252\n\nF + 2\u03f5\u2225X\u2225F + \u03f52 \u2264 \u2225X\u22252\n|\u2225P\u2126(Y )\u22252\n\u2212 \u2225P\u2126(X)\u22252\n\u2212 \u2225Y \u22252\n\n| + p|\u2225X\u22252\n\nF\n\nF\n\nF\n\nPr\n\n\u2264 exp(C1nk log n)\u00b7exp\n\n2.3 SVP-Newton\nIn this section we introduce a Newton-type step in our SVP method to speed up its convergence.\nRecall that each iteration of SVP (Equation (1)) takes a step along the gradient of the objective\nfunction and then projects the iterate to the set of low rank matrices using SVD. Now, the top k\nsingular vectors (Uk, Vk) of Y t+1 = X t\u2212\u03b7tAT (A(X t)\u2212b) determine the range-space and column-\nk (X t\u2212\u03b7tAT (A(X t)\u2212b))Vk).\nspace of the next iterate in SVP. Then, \u03a3k is given by \u03a3k = Diag(U T\n1Detailed proofs of all the lemmas in this section are provided in Appendix B of the supplementary material.\n\n5\n\n(\u2212\u03b42pmn\n\n)\n\n16\u00b52k\n\n,\n\n(5)\n| \u2264 2\u03b4p\u2225X\u22252\n\u2212 p\u2225Y \u22252\nF .\n\nF\n\n\fHence, \u03a3k can be seen as a product of gradient-descent step for a quadratic objective function, i.e.,\nk ). This leads us to the following variant of SVP we call SVP-Newton:2\n\u03a3k = argminS \u03c8(UkSV T\n\nCompute top k-singular vectors Uk, Vk of Y t+1 = X t \u2212 \u03b7tAT (A(X t) \u2212 b)\nX t+1 = Uk\u03a3kVk, \u03a3k = argmin\n\n\u2225A(Uk\u03a3kV T\n\n\u03a8(UkSV T\n\nk ) \u2212 b\u22252.\n\nS\n\nNote that as A is an af\ufb01ne transformation, \u03a3k can be computed by solving a least squares problem\non k\u00d7k variables. Also, for a single iteration, given the same starting point, SVP-Newton decreases\nthe objective function more than SVP. This observation along with straightforward modi\ufb01cations of\nthe proofs of Theorems 1.1, 1.2 show that similar guarantees hold for SVP-Newton as well3.\nNote that the least squares problem for computing \u03a3k has k2 variables. This makes SVP-Newton\ncomputationally expensive for problems with large rank, particularly for situations with a large\nnumber of constraints as is the case for matrix completion. To overcome this issue, we also consider\nthe alternative where we restrict \u03a3k to be a diagonal matrix, leading to the update\n\nk ) = argmin\n\nS\n\n\u03a3k =\n\nargmin\n\nS,s.t.,Sij =0 for i\u0338=j\n\n\u2225A(UkSV T\n\nk ) \u2212 b\u22252\n\n(6)\n\nWe call the above method SVP-NewtonD (for SVP-Newton Diagonal). As for SVP-Newton, guar-\nantees similar to SVP follow for SVP-NewtonD by observing that for each iteration, SVP-NewtonD\ndecreases the objective function more than SVP.\n3 Related Work and Computational Issues\nThe general rank minimization problem with af\ufb01ne constraints is NP-hard and is also NP-hard to\napproximate [22]. Most methods for ARMP either relax the rank constraint to a convex function\nsuch as the trace-norm [8], [9], or assume a factorization and optimize the resulting non-convex\nproblem by alternating minimization [4, 3, 15].\n\u221a\nThe results of Recht et al. [24] were later extended to noisy measurements and isometry constants\nup to \u03b43k < 1/4\n3 by Fazel et al. [10] and Lee and Bresler [18]. However, even the best existing\noptimization algorithms for the trace-norm relaxation are relatively inef\ufb01cient in practice. Recently,\nLee and Bresler [17] proposed an algorithm (ADMiRA) motivated by the orthogonalmatchingpur-\nsuitline of work in compressed sensing and show that for af\ufb01ne constraints with isometry constant\n\u03b44k \u2264 0.04, their algorithm recovers the optimal solution. However, their method is not very ef\ufb01-\ncient for large datasets and when the rank of the optimal solution is relatively large.\nFor the matrix-completion problem until the recent works of [6], [7] and [14], there were few meth-\nods with rigorous guarantees. The alternating least squares minimization heuristic and its variants\n[3, 15] perform the best in practice, but are notoriously hard to analyze. Candes and Recht [6],\n\u2217 is \u00b5-incoherent and the known entries are sampled uniformly\nCandes and Tao [7] show that if X\nat random with |\u2126| \u2265 C(\u00b5) k2n log2 n, \ufb01nding the minimum trace-norm solution recovers the min-\nimum rank solution. Keshavan et.al obtained similar results independently for exact recovery from\nuniformly sampled \u2126 with |\u2126| \u2265 C(\u00b5, k) n log n.\nMinimizing the trace-norm of a matrix subject to af\ufb01ne constraints can be cast as a semi-de\ufb01nite\nprogram (SDP). However, algorithms for semi-de\ufb01nite programming, as used by most methods for\nminimizing trace-norm, are prohibitively expensive even for moderately large datasets. Recently,\na variety of methods based mostly on iterative soft-thresholding have been proposed to solve the\ntrace-norm minimization problem more ef\ufb01ciently. For instance, Cai et al. [5] proposed a Singular\nValue Thresholding (SVT) algorithm which is based on Uzawa\u2019s algorithm [2]. A related approach\nbased on linearized Bregman iterations was proposed by Ma et al. [20], Toh and Yun [25], while Ji\nand Ye [13] use Nesterov\u2019s gradient descent methods for optimizing the trace-norm.\nWhile the soft-thresholding based methods for trace-norm minimization are signi\ufb01cantly faster than\nSDP based approaches, they suffer from slow convergence (see Figure 2 (d)). Also, noisy measure-\nments pose considerable computational challenges for trace-norm optimization as the rank of the\nintermediate iterates can become very large (see Figure 3(b)).\n\n2We call our method SVP-Newton as the Newton method when applied to a quadratic objective function\n\nleads to the exact solution by solving the resulting least squares problem.\n\n3As a side note, we can show a stronger result for SVP-Newton when applied to the special case of\ncompressed-sensing, i.e., when the matrix X is restricted to be diagonal. Speci\ufb01cally, we can show that under\ncertain assumptions SVP-Newton converges to the optimal solution in O(log k), improving upon the result of\nMaleki [21]. We give the precise statement of the theorem and proof in the supplementary material.\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 1: (a) Time taken by SVP and SVT for random instances of the Af\ufb01ne Rank Minimization\nProblem (ARMP) with optimal rank k = 5. (b) Reconstruction error for the MIT logo. (c) Empirical\nestimates of the sampling density threshold required for exact matrix completion by SVP (here\nC = 1.28). Note that the empirical bounds match the information theoretically optimal bound\n\u0398(k log n/n). (d) Maximum incoherence maxt \u00b5(X t) over the iterates of SVP for varying densities\np and sizes n. Note that the incoherence is bounded by a constant, supporting Conjecture 2.3.\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 2: (a), (b) Running time (on log scale) and RMSE of various methods for matrix completion\nproblem with sampling density p = .1 and optimal rank k = 2. (c) Running time (on log scale) of\nvarious methods for matrix completion with sampling density p = .1 and n = 1000. (d) Number of\niterations needed to get RMSE 0.001.\n\n\u2217\n\n\u2217\n\nFor the case of matrix completion, SVP has an important property facilitating fast computation\nof the main update in equation (2); each iteration of SVP involves computing the singular value\ndecomposition (SVD) of the matrix Y = X t + P\u2126(X t \u2212 X\n), where X t is a matrix of rank at\nmost k whose SVD is known and P\u2126(X t \u2212 X\n) is a sparse matrix. Thus, matrix-vector products\nof the form Y v can be computed in time O((m + n)k + |\u2126|). This facilitates the use of fast SVD\ncomputing packages such as PROPACK [16] and ARPACK [19] that only require subroutines for\ncomputing matrix-vector products.\n4 Experimental Results\nIn this section, we empirically evaluate our methods for the af\ufb01ne rank minimization problem and\nlow-rank matrix completion. For both problems we present empirical results on synthetic as well\nas real-world datasets. For ARMP we compare our method against the trace-norm based singular\nvalue thresholding (SVT) method [5]. Note that although Cai et al. present the SVT algorithm in the\ncontext of MCP, it can be easily adapted for ARMP. For MCP we compare against SVT, ADMiRA\n[17], the OptSpace (OPT) method of Keshavan et al. [14], and regularized alternating least squares\nminimization (ALS). We use our own implementation of SVT for ARMP and ALS, while for matrix\ncompletion we use the code provided by the respective authors for SVT, ADMiRA and OPT. We\nreport results averaged over 20 runs. All the methods are implemented in Matlab and use mex \ufb01les.\n\n4.1 Af\ufb01ne Rank Minimization\nWe \ufb01rst compare our method against SVT on random instances of ARMP. We generate random\nmatrices X \u2208 Rn\u00d7n of different sizes n and \ufb01xed rank k = 5. We then generate d = 6kn random\naf\ufb01ne constraint matrices Ai and compute b = A(X). Figure 1(a) compares the computational time\nrequired by SVP and SVT (in log-scale) for achieving a relative error (\u2225A(X)\u2212 b\u22252/\u2225b\u22252) of 10\n\u22123,\nand shows that our method requires many fewer iterations and is signi\ufb01cantly faster than SVT.\nNext we evaluate our method for the problem of matrix reconstruction from random measurements.\nAs in Recht et al. [24], we use the MIT logo as the test image for reconstruction. The MIT logo\nwe use is a 38 \u00d7 73 image and has rank four. For reconstruction, we generate random measurement\nmatrices Ai and measure bi = T r(AiX). We let both SVP and SVT converge and then compute the\nreconstruction error for the original image. Figure 1 (b) shows that our method incurs signi\ufb01cantly\nsmaller reconstruction error than SVT for the same number of measurements.\nMatrix Completion: Synthetic Datasets (Uniform Sampling)\nWe now evaluate our method against various matrix completion methods for random low-rank ma-\n\n7\n\n406080100120140160100102104n (Size of Matrix) ARMP: Random InstancesSVPSVT6008001000120014001600024681012ARMP: MIT LogoNumber of ConstraintsError (Frobenius Norm) SVPSVT100020003000400050000.020.040.060.080.1n (Size of the matrix)SVP Density Threshold k = 10, threshold pk=10, Cklog(n)/n100020003000400050003.544.555.5Incoherence (SVP)n (Size of the Matrix)m p=.05p=.15p=.25p=.3510002000300040005000\u221210123n (Size of Matrix) SVP\u2212NewtonDSVPSVTALSADMiRAOPT1000200030004000500001234x 10\u22123n (Size of Matrix)RMSE SVP\u2212NewtonDSVPSVTALSADMiRAOPT246810100101102103k (Rank of Matrix)Time Taken (secs) SVP\u2212NewtonDSVPSVTALSADMiRAOPT10002000300040005000050100150200n (Size of Matrix)Number of Iterations SVP\u2212NewtonDSVPSVT\fk\n2\n3\n5\n7\n10\n12\n\nSVP-NewtonD\n\n0.90\n0.89\n0.89\n0.89\n0.90\n0.91\n\nSVP\n1.15\n1.14\n1.09\n1.08\n1.07\n1.08\n\nALS\n0.88\n0.87\n0.86\n0.86\n0.87\n0.88\n\nSVT\n1.06\n0.98\n0.95\n0.93\n0.91\n0.90\n\n(b)\n\n(c)\n\n(a)\n\n(d)\nFigure 3: (a): RMSE incurred by various methods for matrix completion with different rank (k)\nsolutions on Movie-Lens Dataset. (b): Time(on log scale) required by various methods for matrix\ncompletion with p = .1, k = 2 and 10% Gaussian noise. Note that all the four methods achieve\nsimilar RMSE. (c): RMSE incurred by various methods for matrix completion with p = 0.1, k = 10\nwhen the sampling distribution follows Power-law distribution (Chung-Lu-Vu Model). (d): RMSE\nincurred for the same problem setting as plot (c) but with added Gaussian noise.\ntrices and uniform samples. We generate a random rank k matrix X \u2208 Rn\u00d7n and generate random\nBernoulli samples with probability p. Figure 2 (a) compares the time required by various methods\n\u22123 on the sampled entries for \ufb01xed\n(in log-scale) to obtain a root mean square error (RMSE) of 10\nk = 2. Clearly, SVP is substantially faster than the other methods. Next, we evaluate our method\nfor increasing k. Figure 2 (b) compares the overall RMSE obtained by various methods. Note that\nSVP-Newton is signi\ufb01cantly more accurate than both SVP and SVT. Figure 2 (c) compares the time\n\u22123 on the sampled\nrequired by various methods to obtain a root mean square error (RMSE) of 10\nentries for \ufb01xed n = 1000 and increasing k. Note that our algorithms scale well with increasing k\nand are faster than other methods. Next, we analyze reasons for better performance of our methods.\nTo this end, we plot the number of iterations required by our methods as compared to SVT (Fig-\nure 2 (d)). Note that even though each iteration of SVT is almost as expensive as our methods\u2019, our\nmethods converge in signi\ufb01cantly fewer iterations.\nFinally, we study the behavior of our method in presence of noise. For this experiment, we generate\nrandom matrices of different size and add approximately 10% Gaussian noise. Figure 2 (c) plots\ntime required by various methods as n increases from 1000 to 5000. Note that SVT is particularly\nsensitive to noise. One of the reason for this is that due to noise, the rank of the intermediate iterates\narising in SVT can be fairly large.\nMatrix Completion: Synthetic Dataset (Power-law Sampling) We now evaluate our methods\nagainst existing matrix-completion methods under more realistic power-law distributed samples.\nAs before, we generate a random rank-k = 10 matrix X \u2208 Rn\u00d7n and sample the entries of X\nusing a graph generated using Chung-Lu-Vu model with power-law distributed degrees (see [23])\nfor details. Figure 3 (c) plots the RMSE obtained by various methods for varying n and \ufb01xed\nsampling density p = 0.1. Note that SVP-NewtonD performs signi\ufb01cantly better than SVT as well\nas SVP. Figure 3 (d) plots the RMSE obtained by various methods when each sampled entry is\ncorrupted with around 1% Gaussian noise. Note that here again SVP-NewtonD performs similar to\nALS and is signi\ufb01cantly better than the other methods including the ICMC method [23] which is\nspecially designed for power-law sampling but is quite sensitive to noise.\nMatrix Completion: Movie-Lens Dataset\nFinally, we evaluate our method on the Movie-Lens dataset [1], which contains 1 million ratings for\n3900 movies by 6040 users. Figure 3 (a) shows the RMSE obtained by each method with varying k.\nFor SVP and SVP-Newton, we \ufb01x step size to be \u03b7 = 1/p\n(t), where t is the number of iterations.\nFor SVT, we \ufb01x \u03b4 = .2p using cross-validation. Since, rank cannot be \ufb01xed in SVT, we try various\nvalues for the parameter \u03c4 to obtain the desired rank solution. Note that SVP-Newton incurs a\nRMSE of 0.89 for k = 3. In contrast, SVT achieves a RMSE of 0.98 for the same rank. We remark\nthat SVT was able to achieve RMSE up to 0.89 but required rank 17 solution and was signi\ufb01cantly\nslower in convergence because many intermediate iterates had large rank (up to around 150). We\nattribute the relatively poor performance of SVP and SVT as compared with ALS and SVP-Newton\nto the fact that the ratings matrix is not sampled uniformly, thus violating the crucial assumption of\nuniformly distributed samples.\nAcknowledgements: This research was supported in part by NSF grant CCF-0728879.\n\n\u221a\n\n8\n\n10002000300040005000100101102103n (Size of Matrix)Time Taken (secs) SVP\u2212NewtonDSVPSVTALS50010001500200000.511.522.5n (Size of Matrix)RMSE ICMCALSSVTSVPSVP NewtonD50010001500200001234n (Size of Matrix)RMSE ICMCALSSVTSVPSVP NewtonD\fReferences\n[1] Movie lens dataset. Public dataset. URL http://www.grouplens.org/taxonomy/term/14.\n[2] K. Arrow, L. Hurwicz, and H. Uzawa. Studies in Linear and Nonlinear Programming. Stanford University\n\nPress, Stanford, 1958.\n\n[3] Robert Bell and Yehuda Koren. Scalable collaborative \ufb01ltering with jointly derived neighborhood inter-\n\npolation weights. In ICDM, pages 43\u201352, 2007. doi: 10.1109/ICDM.2007.90.\n\n[4] Matthew Brand. Fast online SVD revisions for lightweight recommender systems. In SIAM International\n\nConference on Data Mining, 2003.\n\n[5] Jian-Feng Cai, Emmanuel J. Cand`es, and Zuowei Shen. A singular value thresholding algorithm for\n\nmatrix completion. SIAM Journal on Optimization, 20(4):1956\u20131982, 2010.\n\n[6] Emmanuel J. Cand`es and Benjamin Recht. Exact matrix completion via convex optimization. Foundations\n\nof Computational Mathematics, 9(6):717\u2013772, December 2009.\n\n[7] Emmanuel J. Cand`es and Terence Tao. The power of convex relaxation: Near-optimal matrix completion.\n\nIEEE Trans. Inform. Theory, 56(5):2053\u20132080, 2009.\n\n[8] M. Fazel, H. Hindi, and S. Boyd. A rank minimization heuristic with application to minimum order\n\nsystem approximation. In American Control Conference, Arlington, Virginia, 2001.\n\n[9] M. Fazel, H. Hindi, and S. Boyd. Log-det heuristic for matrix rank minimization with applications to\n\nHankel and Euclidean distance matrices. In American Control Conference, 2003.\n\n[10] M. Fazel, E. Candes, B. Recht, and P. Parrilo. Compressed sensing and robust recovery of low rank\nmatrices. In Signals, Systems and Computers, 2008 42nd Asilomar Conference on, pages 1043\u20131047,\nOct. 2008. doi: 10.1109/ACSSC.2008.5074571.\n\n[11] Rahul Garg and Rohit Khandekar. Gradient descent with sparsi\ufb01cation: an iterative algorithm for sparse\n\nrecovery with restricted isometry property. In ICML, 2009.\n\n[12] Donald Goldfarb and Shiqian Ma. Convergence of \ufb01xed point continuation algorithms for matrix rank\n\n[13] Shuiwang Ji and Jieping Ye. An accelerated gradient method for trace norm minimization. In ICML,\n\nminimization, 2009. Submitted.\n\n2009.\n\n[14] Raghunandan H. Keshavan, Sewoong Oh, and Andrea Montanari. Matrix completion from a few entries.\nIn ISIT\u201909: Proceedings of the 2009 IEEE international conference on Symposium on Information Theory,\npages 324\u2013328, Piscataway, NJ, USA, 2009. IEEE Press. ISBN 978-1-4244-4312-3.\n\n[15] Yehuda Koren. Factorization meets the neighborhood: a multifaceted collaborative \ufb01ltering model. In\n\nKDD, pages 426\u2013434, 2008. doi: 10.1145/1401890.1401944.\n\n[16] R.M. Larsen. Propack: a software for large and sparse SVD calculations. Available online. URL http:\n\n//sun.stanford.edu/rmunk/PROPACK/.\n\n[17] Kiryung Lee and Yoram Bresler. Admira: Atomic decomposition for minimum rank approximation, 2009.\n[18] Kiryung Lee and Yoram Bresler. Guaranteed minimum rank approximation from linear observations by\n\nnuclear norm minimization with an ellipsoidal constraint, 2009.\n\n[19] Richard B. Lehoucq, Danny C. Sorensen, and Chao Yang. ARPACK Users\u2019 Guide: Solution of Large-\n\nScale Eigenvalue Problems with Implicitly Restarted Arnoldi Methods. SIAM, Philadelphia, 1998.\n\n[20] S. Ma, D. Goldfarb, and L. Chen. Fixed point and bregman iterative methods for matrix rank minimiza-\n\ntion. To appear, Mathematical Programming Series A, 2010.\n\n[21] Arian Maleki. Coherence analysis of iterative thresholding algorithms. CoRR, abs/0904.1193, 2009.\n[22] Raghu Meka, Prateek Jain, Constantine Caramanis, and Inderjit S. Dhillon. Rank minimization via online\n\nlearning. In ICML, pages 656\u2013663, 2008. doi: 10.1145/1390156.1390239.\n\n[23] Raghu Meka, Prateek Jain, and Inderjit S. Dhillon. Matrix completion from power-law distributed sam-\n\nples. In NIPS, 2009.\n\n[24] Benjamin Recht, Maryam Fazel, and Pablo A. Parrilo. Guaranteed minimum-rank solutions of linear\n\nmatrix equations via nuclear norm minimization, 2007. To appear in SIAM Review.\n\n[25] K.C. Toh and S. Yun. An accelerated proximal gradient algorithm for nuclear norm regularized least\n\nsquares problems. Preprint, 2009. URL http://www.math.nus.edu.sg/\u02dcmatys/apg.pdf.\n\n9\n\n\f", "award": [], "sourceid": 682, "authors": [{"given_name": "Prateek", "family_name": "Jain", "institution": null}, {"given_name": "Raghu", "family_name": "Meka", "institution": null}, {"given_name": "Inderjit", "family_name": "Dhillon", "institution": null}]}