{"title": "Scaled Gradients on Grassmann Manifolds for Matrix Completion", "book": "Advances in Neural Information Processing Systems", "page_first": 1412, "page_last": 1420, "abstract": "This paper describes gradient methods based on a scaled metric on the Grassmann manifold for low-rank matrix completion. The proposed methods significantly improve canonical gradient methods especially on ill-conditioned matrices, while maintaining established global convegence and exact recovery guarantees. A connection between a form of subspace iteration for matrix completion and the scaled gradient descent procedure is also established. The proposed conjugate gradient method based on the scaled gradient outperforms several existing algorithms for matrix completion and is competitive with recently proposed methods.", "full_text": "Scaled Gradients on Grassmann Manifolds\n\nfor Matrix Completion\n\nThanh T. Ngo and Yousef Saad\n\nDepartment of Computer Science and Engineering\n\nUniversity of Minnesota, Twin Cities\n\nMinneapolis, MN 55455\n\nthango@cs.umn.edu, saad@cs.umn.edu\n\nAbstract\n\nThis paper describes gradient methods based on a scaled metric on the Grassmann\nmanifold for low-rank matrix completion. The proposed methods signi\ufb01cantly\nimprove canonical gradient methods, especially on ill-conditioned matrices, while\nmaintaining established global convegence and exact recovery guarantees. A con-\nnection between a form of subspace iteration for matrix completion and the scaled\ngradient descent procedure is also established. The proposed conjugate gradient\nmethod based on the scaled gradient outperforms several existing algorithms for\nmatrix completion and is competitive with recently proposed methods.\n\n1\n\nIntroduction\n\nLet A \u2208 Rm\u00d7n be a rank-r matrix, where r \u226a m, n. The matrix completion problem is to re-\nconstruct A given a subset of entries of A. This problem has attracted much attention recently\n[8, 14, 13, 18, 21] because of its broad applications, e.g., in recommender systems, structure from\nmotion, and multitask learning (see e.g. [19, 9, 2]).\n\n1.1 Related work\n\nLet \u2126 = {(i, j)|Aij is observed}. We de\ufb01ne P\u2126(A) \u2208 Rm\u00d7n to be the projection of A onto the\nobserved entries \u2126: P\u2126(A)ij = Aij if (i, j) \u2208 \u2126 and P\u2126(A)ij = 0 otherwise.\nIf the rank is\nunknown and there is no noise, the problem can be formulated as:\n\nMinimize rank (X) subject to P\u2126(X) = P\u2126(A).\n\n(1)\nRank minimization is NP-hard and so work has been done to solve a convex relaxation of it by\napproximating the rank by the nuclear norm. Under some conditions, the solution of the relaxed\nproblem can be shown to be the exact solution of the rank minimization problem with overwhelming\nprobability [8, 18]. Usually, algorithms to minimize the nuclear norm iteratively use the Singular\nValue Decomposition (SVD), speci\ufb01cally the singular value thresholding operator [7, 15, 17], which\nmakes them expensive.\n\nIf the rank is known, we can formulate the matrix completion problem as follows:\n\nFind matrix X to minimize ||P\u2126(X) \u2212 P\u2126(A)||F subject to rank (X) = r.\n\n(2)\nKeshavan et al. [14] have proved that exact recovery can be obtained with high probability by solv-\ning a non-convex optimization problem. A number of algorithms based on non-convex formulation\nuse the framework of optimization on matrix manifolds [14, 22, 6]. Keshavan et al. [14] propose\na steepest descent procedure on the product of Grassmann manifolds of r-dimensional subspaces.\nVandereycken [22] discusses a conjugate gradient algorithm on the Riemann manifold of rank-r ma-\ntrices. Boumal and Absil [6] consider a trust region method on the Grassmann manifold. Although\n\n1\n\n\fthey do not solve an optimization problem on the matrix manifold, Wei et al. [23] perform a low rank\nmatrix factorization based on a successive over-relaxation iteration. Also, Srebro and Jaakkola [21]\ndiscuss SVD-EM, one of the early \ufb01xed-rank methods using truncated singular value decomposition\niteratively. Dai et al. [10] recently propose an interesting approach that does not use the Frobenius\nnorm of the residual as the objective function but instead uses the consistency between the current\nestimate of the column space (or row space) and the observed entries. Guaranteed performance for\nthis method has been established for rank-1 matrices.\n\nIn this paper, we will focus on the case when the rank r is known and solve problem (2). In fact,\neven when the rank is unknown, the sparse matrix which consists of observed entries can give us a\nvery good approximation of the rank based on its singular spectrum [14]. Also, a few values of the\nrank can be used and the best one is selected. Moreover, the singular spectrum is revealed during\nthe iterations, so many \ufb01xed rank methods can also be adapted to \ufb01nd the rank of the matrix.\n\n1.2 Our contribution\n\nOptSpace [14] is an ef\ufb01cient algorithm for low-rank matrix completion with global convergence and\nexact recovery guarantees. We propose using a non-canonical metric on the Grassmann manifold to\nimprove OptSpace while maintaining its appealing properties. The non-canonical metric introduces\na scaling factor to the gradient of the objective function which can be interpreted as an adaptive\npreconditioner for the matrix completion problem. The gradient descent procedure using the scaled\ngradient is related to a form of subspace iteration for matrix completion. Each iteration of the\nsubspace iteration is inexpensive and the procedure converges very rapidly. The connection between\nthe two methods leads to some improvements and to ef\ufb01cient implementations for both of them.\nThroughout the paper, A\u2126 will be a shorthand for P\u2126(A) and qf(U ) is the Q factor in the QR\nfactorization of U which gives an orthonormal basis for span (U ). Also, P \u00af\u2126(.) denotes the projection\nonto the negation of \u2126.\n\n2 Subspace iteration for incomplete matrices\n\nWe begin with a form of subspace iteration for matrix completion depicted in Algorithm 1. If the\n\nAlgorithm 1 SUBSPACE ITERATION FOR INCOMPLETE MATRICES.\nInput: Matrix A\u2126, \u2126, and the rank r.\nOutput: Left and right dominant subspaces U and V and associated singular values.\n1: [U0, \u03a30, V0] = svd(A\u2126, r), S0 = \u03a30;\n2: for i = 0,1,2,... do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11: end for\n\nXi+1 = P \u00af\u2126(UiSiV T\nUi+1 = Xi+1Vi; Vi+1 = X T\ni+1Ui+1\nUi+1 = qf(Ui+1); Vi+1 = qf(Vi+1)\nSi+1 = U T\nif condition then\n\n// Initialize U , V and \u03a3\n\ni ) + A\u2126\n\ni+1Xi+1Vi+1\n\n// Obtain new estimate of A\n// Update subspaces\n// Re-orthogonalize bases\n// Compute new S for next estimate of A\n\n// Diagonalize S to obtain current estimate of singular vectors and values\n[RU , \u03a3i+1, RV ] = svd(Si+1); Ui+1 = Ui+1RU ; Vi+1 = Vi+1RV ; Si+1 = \u03a3i+1.\n\nend if\n\nmatrix A is fully observed, U and V can be randomly initialized, line 3 is not needed and in lines\n4 and 6 we use A instead of Xi+1 to update the subspaces. In this case, we have the classical two-\nsided subspace iteration for singular value decomposition. Lines 6-9 correspond to a Rayleigh-Ritz\nprojection to obtain current approximations of singular vectors and singular values. It is known that\nif the initial columns of U and V are not orthogonal to any of the \ufb01rst r left and right singular vectors\nof A respectively, the algorithm converges to the dominant subspaces of A [20, Theorem 5.1].\nBack to the case when the matrix A is not fully observed, the basic idea of Algorithm 1 is to use\nan approximation of A in each iteration to update the subspaces U and V and then from the new U\nand V , we can obtain a better approximation of A for the next iteration. Line 3 is to compute a new\nestimate of A by replacing all entries of UiSiV T\nat the known positions by the true values in A.\ni\nThe update in line 6 is to get the new Si+1 based on recently computed subspaces. Diagonalizing\n\n2\n\n\fSi+1 (lines 7-10) is optional for matrix completion. This step provides current approximations\nof the singular values which could be useful for several purposes such as in regularization or for\nconvergence test. This comes with very little additional overhead, since Si+1 is a small r \u00d7 r matrix.\nEach iteration of Algorithm 1 can be seen as an approximation of an iteration of SVD-EM where a\nfew matrix multiplications are used to update U and V instead of using a truncated SVD to compute\nthe dominant subspaces of Xi+1. Recall that computing an SVD, e.g. by a Lanczos type procedure,\nrequires several, possibly a large number of, matrix multiplications of this type.\n\nWe now discuss ef\ufb01cient implementations of Algorithm 1 and modi\ufb01cations to speed-up its conver-\ngence. First, the explicit computation of Xi+1 in line 3 is not needed. Let \u02c6Xi = UiSiV T\n. Then\ni ) + A\u2126 = \u02c6Xi + Ei, where Ei = P\u2126(A \u2212 \u02c6Xi) is a sparse matrix of errors at\nXi+1 = P \u00af\u2126(UiSiV T\nknown entries which can be computed ef\ufb01ciently by exploiting the structure of \u02c6Xi. Assume that each\nSi is not singular (the non-singularity of Si will be discussed in Section 4). Then if we post-multiply\nthe update of U in line 4 by S\u22121\n\n, the subspace remains the same, and the update becomes:\n\ni\n\ni\n\nUi+1 = Xi+1ViS\u22121\n\ni = ( \u02c6Xi + Ei)ViS\u22121\n\ni = Ui + EiViS\u22121\n\ni\n\n,\n\n(3)\n\nThe update of V can also be ef\ufb01ciently implemented. Here, we make a slight change, namely\ni+1Ui (Ui instead of Ui+1). We observe that the convergence speed remains roughly the\nVi+1 = X T\nsame (when A is fully observed, the algorithm is a slower version of subspace iteration where the\nconvergence rate is halved). With this change, we can derive an update to V that is similar to (3),\n\nVi+1 = Vi + ET\n\ni UiS\u2212T\n\ni\n\n,\n\n(4)\n\nWe will point out in Section 3 that the updating terms EiViS\u22121\nare related to the\ngradients of a matrix completion objective function on the Grassmann manifold. As a result, to\nimprove the convergence speed, we can add an adaptive step size ti to the process, as follows:\n\ni UiS\u2212T\n\nand ET\n\ni\n\ni\n\nUi+1 = Ui + tiEiViS\u22121\n\ni\n\nand Vi+1 = Vi + tiET\n\ni UiS\u2212T\n\ni\n\n.\n\nThis is equivalent to using \u02c6Xi + tiEi as the estimate of A in each iteration. The step size can be\ncomputed using a heuristic adapted from [23]. Initially, t is set to some initial value t0 (t0 = 1 in\nour experiments). If the error kEikF decreases compared to the previous step, t is increased by a\nfactor \u03b1. Conversely, if the error increases, indicating that the step is too big, t is reset to t = t0.\nThe matrix Si+1 can be computed ef\ufb01ciently by exploiting low-rank structures and the sparsity.\n\nSi+1 = (U T\n\ni+1Ui)Si(V T\n\ni Vi+1) + tiU T\n\ni+1EiVi+1\n\n(5)\n\nThere are also other ways to obtain Si+1 once Ui+1 and Vi+1 are determined to improve the current\napproximation of A . For example we can solve the following quadratic program [14]:\n\nSi+1 = argminSkP\u2126(A \u2212 Ui+1SV T\n\ni+1)k2\n\nF\n\n(6)\n\nWe summarize the discussion in Algorithm 2. A suf\ufb01ciently small error kEikF can be used as a\nAlgorithm 2 GENERIC SUBSPACE ITERATION FOR INCOMPLETE MATRICES.\nInput: Matrix A\u2126, \u2126, and number r.\nOutput: Left and right dominant subspaces U and V and associated singular values.\n1: Initialize orthonormal matrices U0 \u2208 Rm\u00d7r and V0 \u2208 Rn\u00d7r.\n2: for i = 0,1,2,... do\n3:\n4:\n5:\n6:\n7: end for\n\nCompute Ei and appropriate step size ti\nUi+1 = Ui + tiEiViS\u22121\nOrthonormalize Ui+1 and Vi+1\nFind Si+1 such that P\u2126(Ui+1Si+1V T\n\ni+1) is close to A\u2126 (e.g. via (5), (6)).\n\nand Vi+1 = Vi + tiET\n\ni UiS\u2212T\n\ni\n\ni\n\nstoppping criterion. Algorithm 1 can be shown to be equivalent to LMaFit algorithm proposed in\n[23]. The authors in [23] also obtain results on local convergence of LMaFit. We will pursue a\ndifferent approach here. The updates (3) and (4) are reminiscent of the gradient descent steps for\nminimizing matrix completion error on the Grassmann manifold that is introduced in [14] and the\nnext section discusses the connection to optimization on the Grassmann manifold.\n\n3\n\n\f3 Optimization on the Grassmann manifold\n\nIn this section, we show that using a non-canonical Riemann metric on the Grassmann manifold,\nthe gradient of the same objective function in [14] is of a form similar to (3) and (4). Based on this,\nimprovements to the gradient descent algorithms can be made and exact recovery results similar\nto those of [14] can be maintained. The readers are referred to [1, 11] for details on optimization\nframeworks on matrix manifolds.\n\n3.1 Gradients on the Grassmann manifold for matrix completion problem\n\nLet G(m, r) be the Grassmann manifold in which each point corresponds to a subspace of dimension\nr in Rm. One of the results of [14], is that under a few assumptions (to be addressed in Section 4),\none can obtain with high probability the exact matrix A by minimizing a regularized version of the\nfunction F : G(m, r) \u00d7 G(n, r) \u2192 R de\ufb01ned below.\n\nF (U, V ) = min\n\nS\u2208Rr\u00d7r\n\nF(U, S, V ),\n\n(7)\n\nwhere F(U, S, V ) = (1/2)kP\u2126(A \u2212 U SV T )k2\nF , U \u2208 Rm\u00d7k and V \u2208 Rn\u00d7k are orthonormal\nmatrices. Here, we abuse the notation by denoting by U and V both orthonormal matrices as well\nas the points on the Grassmann manifold which they span. Note that F only depends on the sub-\nspaces spanned by matrices U and V . The function F (U, V ) can be easily evaluated by solving\nthe quadratic minimization problem in the form of (6). If G(m, r) is endowed with the canonical\ninner product hW, W \u2032i = Tr (W T W \u2032), where W and W \u2032 are tangent vectors of G(m, r) at U (i.e.\nW, W \u2032 \u2208 Rm\u00d7r such that W T U = 0 and W \u2032T U = 0) and similarly for G(n, r), the gradients of\nF (U, V ) on the product manifold are:\n\ngradFU (U, V ) = (I \u2212 U U T )P\u2126(U SV T \u2212 A)V ST\ngradFV (U, V ) = (I \u2212 V V T )P\u2126(U SV T \u2212 A)T U S.\n\n(8)\n(9)\n\nIn the above formulas, (I \u2212U U T ) and (I \u2212V V T ) are the projections of the derivatives P\u2126(U SV T \u2212\nA)V ST and P\u2126(U SV T \u2212 A)T U S onto the tangent space of the manifold at (U, V ). Notice that the\nderivative terms are very similar to the updates in (3) and (4). The difference is in the scaling factors\nwhere gradFU and gradFV use ST and S while those in Algorithm 2 use S\u22121 and S\u2212T .\nAssume that S is a diagonal matrix which can always be obtained by rotating U and V appropriately.\nF (U, V ) would change more rapidly when the columns of U and V corresponding to larger entries\nof S are changed. The rate of change of F would be approximately proportional to S2\nii when the\ni-th columns of U and V are changed, or in other words, S2 gives us an approximate second order\ninformation of F at the current point (U, V ). This suggests that the level set of F should be similar to\nan \u201cellipse\u201d with the shorter axes corresponding to the larger values of S. It is therefore compelling\nto use a scaled metric on the Grassmann manifold.\nConsider the inner product hW, W \u2032iD = Tr (DW T W \u2032), where D \u2208 Rr\u00d7r is a symmetric positive\nde\ufb01nite matrix. We will derive the partial gradients of F on the Grassmann manifold endowed with\nthis scaled inner product. According to [11], gradFU is the tangent vector of G(m, r) at U such that\n\nTr (F T\n\nU W ) = h(gradFU )T , W iD,\n\n(10)\n\nfor all tangent vectors W at U, where FU is the partial derivative of F with respect to U. Recall\nthat the tangent vectors at U are those W \u2019s such that W T U = 0. The solution of (10) with the\nconstraints that W T U = 0 and (gradFU )T U = 0 gives us the gradient based on the scaled metric,\nwhich we will denote by gradsFU and gradsFV .\n\ngradsFU (U, V ) = (I \u2212 U U T )FU D\u22121 = (I \u2212 U U T )P\u2126(U SV T \u2212 A)V SD\u22121.\ngradsFV (U, V ) = (I \u2212 V V T )FV D\u22121 = (I \u2212 V V T )P\u2126(U SV T \u2212 A)T U SD\u22121.\n\n(11)\n(12)\n\nNotice the additional scaling D appearing in these scaled gradients. Now if we use D = S2 (still\nwith the assumption that S is diagonal) as suggested by the arguments above on the approximate\nshape of the level set of F , we will have gradsFU (U, V ) = (I \u2212 U U T )P\u2126(U SV T \u2212 A)V S\u22121 and\ngradsFV (U, V ) = (I \u2212 V V T )P\u2126(U SV T \u2212 A)T U S\u22121 (note that S depends on U and V ).\n\n4\n\n\fIf S is not diagonalized, we use SST and ST S to derive gradsFU and gradsFV respectively, and the\nscalings appear exactly as in (3) and (4).\n\ngradsFU (U, V ) = (I \u2212 U U T )P\u2126(U SV T \u2212 A)V S\u22121\ngradsFV (U, V ) = (I \u2212 V V T )P\u2126(U SV T \u2212 A)T U S\u2212T\n\n(13)\n(14)\n\nThis scaling can be interpreted as an adaptive preconditioning step similar to those that are popular\nin the scienti\ufb01c computing literature [4]. As will be shown in our experiments, this scaled gradient\ndirection outperforms canonical gradient directions especially for ill-conditioned matrices.\n\nThe optimization framework on matrix manifolds allows to de\ufb01ne several elements of the manifold\nin a \ufb02exible way. Here, we use the scaled-metric to obtain a good descent direction, while other\noperations on the manifold can be based on the canonical metric which has simple and ef\ufb01cient\ncomputational forms. The next two sections describe algorithms using scaled-gradients.\n\n3.2 Gradient descent algorithms on the Grassmann manifold\n\nGradient descent algorithms on matrix manifolds are based on the update\n\nUi+1 = R(Ui + tiWi)\n\n(15)\n\nwhere Wi is the gradient-related search direction, ti is the step size and R(U ) is a retraction on the\nmanifold which de\ufb01nes a projection of U onto the manifold [1]. We use R(U ) = span (U ) as the\nretraction on the Grassmann manifold where span (U ) is represented by qf(U ), which is the Q factor\nin the QR factorization of U. Optimization on the product of two Grassmann manifolds can be done\nby treating each component as a coordinate component.\n\nThe step size t can be computed in several ways, e.g., by a simple back-tracking method to \ufb01nd the\npoint satisfying the Armijo condition [3]. Algorithm 3 is an outline of our gradient descent method\nfor matrix completion. We let gradsF (i)\nV \u2261 gradsFV (Ui, Vi). In\nline 5, the exact Si+1 which realizes F (Ui+1, Vi+1) can be computed according to (6). A direct\nmethod to solve (6) costs O(|\u2126|r4). Alternatively, Si+1 can be computed approximately and we\nfound that (5) is fast (O((|\u2126| + m + n)r2)) and gives the same convergence speed. If (5) fails\nto yield good enough progress, we can always switch back to (6) and compute Si+1 exactly. The\nsubspace iteration and LMaFit can be seen as relaxed versions of this gradient descent procedure.\nThe next section goes further and described the conjugate gradient iteration.\n\nU \u2261 gradsFU (Ui, Vi) and gradsF (i)\n\nAlgorithm 3 GRADIENT DESCENT WITH SCALED-GRADIENT ON THE GRASSMANN MANIFOLD.\nInput: Matrix A\u2126, \u2126, and number r.\nOutput: U and V which minimize F (U, V ), and S which realizes F (U, V ).\n1: Initialize orthonormal matrices U0 and V0.\n2: for i = 0,1,2,... do\n3:\n4:\n\nCompute gradsF (i)\nFind an appropriate step size ti and compute\n\nV according to (13) and (14).\n\nU and gradsF (i)\n\n(Ui+1, Vi+1) = (qf(Ui \u2212 tigradsF (i)\n\nU ), qf(Vi \u2212 tigradsF (i)\n\nV ))\n\nCompute Si+1 according to (6) (exact) or (5) (approximate).\n\n5:\n6: end for\n\n3.3 Conjugate gradient method on the Grassmann manifold\n\nIn this section, we describe the conjugate gradient (CG) method on the Grassmann manifold based\non the scaled gradients to solve the matrix completion problem. The main additional ingredient we\nneed is vector transport which is used to transport the old search direction to the current point on the\nmanifold. The transported search direction is then combined with the scaled gradient at the current\npoint, e.g. by Polak-Ribiere formula (see [11]), to derive the new search direction. After this, a line\nsearch procedure is performed to \ufb01nd the appropriate step size along this search direction.\n\nVector transport can be de\ufb01ned using the Riemann connection, which in turn is de\ufb01ned based on the\nRiemann metric [1]. As mentioned at the end of Section 3.1, we will use the canonical metric to\n\n5\n\n\fderive vector transport when considering the natural quotient manifold structure of the Grassmann\nmanifold. The tangent W \u2032 at U will be transported to U + W as TU +W (W \u2032) where TU (W \u2032) =\n(I \u2212 U U T )W \u2032. Algorithm 4 is a sketch of the resulting conjugate gradient procedure.\nAlgorithm 4 CONJUGATE GRADIENT WITH SCALED-GRADIENT ON THE GRASSMANN MANIFOLD.\nInput: Matrix A\u2126, \u2126, and number r.\nOutput: U and V which minimize F (U, V ), and S which realizes F (U, V ).\n1: Initialize orthonormal matrices U0 and V0.\n2: Compute (\u03b70, \u03be0) = (gradsF (0)\nU , gradsF (0)\nV ).\n3: for i = 0,1,2,... do\n4:\n5:\n\nCompute a step size ti and compute (Ui+1, Vi+1) = (qf(Ui + ti\u03b7i), qf(Vi + ti\u03bei))\nCompute \u03b2i+1 (Polak-Ribiere) and set\n\n(\u03b7i+1, \u03bei+1) = (\u2212gradsF (i)\n\nU + \u03b2i+1TUi+1(\u03b7i), \u2212gradsF (i)\n\nV + \u03b2i+1TVi+1(\u03bei))\n\nCompute Si+1 according to (6) or (5).\n\n6:\n7: end for\n\n4 Convergence and exact recovery of scaled-gradient descent methods\n\nLet A = U\u2217\u03a3\u2217V T\n\u2217 be the singular value decomposition of A, where U\u2217 \u2208 Rm\u00d7r, V\u2217 \u2208 Rn\u00d7r and\n\u03a3\u2217 \u2208 Rr\u00d7r. Let us also denote z = (U, V ) a point on G(m, r) \u00d7 G(n, r). Clearly, z\u2217 = (U\u2217, V\u2217)\nis a minimum of F . Assume that A is incoherent [14]; A has bounded entries and the minimum\nsingular value of A is bounded away from 0. Let \u03ba(A) be the condition number of A. It is shown\nthat, if the number of observed entries is of order O(max{\u03ba(A)2n log n, \u03ba(A)6n}) then, with high\nprobability, F is well approximated by a parabola and z\u2217 is the unique stationary point of F in a\nsuf\ufb01ciently small neighborhood of z\u2217 ([14, Lemma 6.4&6.5]). From these observations, given an\ninitial point that is suf\ufb01ciently close to z\u2217, a gradient descent procedure on F (with an additional\nregularization term to keep the intermediate points incoherent) converges to z\u2217 and exact recovery\nis obtained. The singular value decomposition of a trimmed version of the observerd matrix A\u2126 can\ngive us the initial point that ensures convergence. The readers are referred to [14] for details.\n\nCinc\n\ni=1 G1( kU (i)k2\n\ni=1 G1( kV (i)k2\n\n) + Pn\n\nFrom [14], let G(U, V ) = Pm\n), where G1(x) = 0 if x \u2264 1\nand G1(x) = e(x\u22121)2\n\u2212 1 otherwise; Cinc is a constant depending on the incoherence assumptions.\nWe consider the regularized version of F : \u02dcF (U, V ) = F (U, V ) + \u03c1G(U, V ), where \u03c1 is chosen\nappropriately so that U and V remain incoherent during the execution of the algorithm. We can see\nthat z\u2217 is also the minimum of \u02dcF . We will now show that the scaled-gradients of \u02dcF are well-de\ufb01ned\nduring the iterations and they are indeed descent directions of \u02dcF and only vanish at z\u2217. As a result,\nthe scaled-gradient-based methods can inherit all the convergence results in [14].\n\nCinc\n\nmin and \u03c3\u2217\n\nmax of \u03a3\u2217: \u03c3max \u2264 2\u03c3\u2217\n\nFirst, S must be non-singular during the iterations for the scaled-gradients to be well-de\ufb01ned. As a\ncorollary of Lemma 6.4 in [14], the extreme singular values of any intermediate S are bounded by\nmin. The second\nextreme singular values \u03c3\u2217\ninequality implies that S is well-conditioned during the iterations.\nThe scaled-gradient is the descent direction of \u02dcF as a direct result from the fact that it is in-\ndeed the gradient of \u02dcF based on a non-canonical metric. Moreover, by Lemma 6.5 in [14],\nkgrad \u02dcF (z)k2 \u2265 Cn\u01eb2(\u03c3\u2217\nmin)4d(z, z\u2217)2 for some constant C, where k.k and d(., .) are the canonical\nnorm and distance on the Grassmann manifold respectively. Based on this, a similar lower bound of\nkgrads\n\n\u02dcF k can be derived. Let D1 = SST and D2 = ST S be the scaling matrices. Then,\n\nmax and \u03c3min \u2265 1\n\n2 \u03c3\u2217\n\nkgrads\n\n\u02dcF (z)k2 = kgrad \u02dcFU (z)D\u22121\n\n1 k2\nmax(kgrad \u02dcFU (z)k2\n\nF + kgrad \u02dcFV (z)D\u22121\n2 k2\nF + kgrad \u02dcFV (z)k2\nF )\n\nF\n\n\u2265 \u03c3\u22122\n\u2265 (2\u03c3\u2217\n\u2265 (2\u03c3\u2217\n\nmax)\u22122kgrad \u02dcF (z)k2\nmax)\u22122Cn\u01eb2(\u03c3\u2217\n\nmin)4d(z, z\u2217)2 = C(\u03c3\u2217\n\nmin)4(2\u03c3\u2217\n\nmax)\u22122n\u01eb2d(z, z\u2217)2.\n\nTherefore, the scaled gradients only vanish at z\u2217 which means the scaled-gradient descent procedure\nmust converge to z\u2217, which is the exact solution [3].\n\n6\n\n\f5 Experiments and results\n\nThe proposed algorithms were implemented in Matlab with some mex-routines to perform matrix\nmultiplications with sparse masks. For synthesis data, we consider two cases: (1) fully random\nlow-rank matrices, A = randn(m, r) \u2217 randn(r, n) (in Matlab notations) whose singular values\ntend to be roughly the same; (2) random low-rank matrices with chosen singular values by letting\nU = qf(randn(m, r)), V = qf(randn(n, r)) and A = U SV T where S is a diagonal matrix with\nchosen singular values. The initializations of all methods are based on the SVD of A\u2126.\nFirst, we illustrate the improvement of scaled gradients over canonical gradients for steepest descent\nand conjugate gradient methods on 5000 \u00d7 5000 matrices with rank 5 (Figure 1). Note that Canon-\nGrass-Steep is OptSpace with our implementation. In this experiment, Si is obtained exactly using\n(6). The time needed for each iteration is roughly the same for all methods so we only present the\nresults in terms of iteration counts. We can see that there are some small improvements for the fully\nrandom case (Figure 1a) since the singular values are roughly the same. The improvement is more\nsubstantial for matrices with larger condition numbers (Figure 1b).\n\nl\n\n)\ne\na\nc\ns\n\u2212\ng\no\nl\n(\n \n\nE\nS\nM\nR\n\n0\n\n\u22125\n\n\u221210\n\n\u221215\n\n \n\n5000x5000 \u2212 Rank 5 \u2212 1.0% observed entries\nSingular values [4774, 4914, 4979, 5055, 5146]\n\n \n\nCanon\u2212Grass\u2212Steep\nCanon\u2212Grass\u2212CG\nScaled\u2212Grass\u2212Steep\nScaled\u2212Grass\u2212CG\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\nIteration count\n\n(a)\n\nl\n\n)\ne\na\nc\ns\n\u2212\ng\no\nl\n(\n \n\nE\nS\nM\nR\n\n0\n\n\u22122\n\n\u22124\n\n\u22126\n\n\u22128\n\n\u221210\n\n\u221212\n\n\u221214\n\n \n\n5000x5000 \u2212 Rank 5 \u2212 1.0% observed entries\nSingular values [1000, 2000, 3000, 4000, 5000]\n\n \n\nCanon\u2212Grass\u2212Steep\nCanon\u2212Grass\u2212CG\nScaled\u2212Grass\u2212Steep\nScaled\u2212Grass\u2212CG\n\n20\n\n40\n\n60\n\n80\n\n100\n\n120\n\n140\n\n160\n\n180\n\n200\n\nIteration count\n\n(b)\n\nFigure 1: Log-RMSE for fully random matrix (a) and random matrix with chosen spectrum (b).\n\nNow, we compare the relaxed version of the scaled conjugate gradient which uses (5) to compute Si\n(ScGrass-CG) to LMaFit [23], Riemann-CG [22], RTRMC2 [6] (trust region method with second\norder information), SVP [12] and GROUSE [5] (Figure 2). These methods are also implemented in\nMatlab with mex-routines similar to ours except for GROUSE which is entirely in Matlab (Indeed\nGROUSE does not use sparse matrix multiplication as other methods do). The subspace iteration\nmethod and the relaxed version of scaled steepest descent converge similarly to LMaFit, so we omit\nthem in the graph. Note that each iteration of GROUSE in the graph corresponds to one pass over\nthe matrix. It does not have exactly the same meaning as one iteration of other methods and is\nmuch slower with its current implementation. We use the best step sizes that we found for SVP\nand GROUSE. In terms of iteration counts, we can see that for the fully random case (upper row),\nRTRMC2 is the best while ScGrass-CG and Riemann-CG converge reasonably fast. However, each\niteraton of RTRMC2 is slower so in terms of time, ScGrass-CG and Riemann-CG are the fastest in\nour experiments. When the condition number of the matrix is higher, ScGrass-CG converges fastest\nboth in terms of iteration counts and execution time.\n\nFinally, we test the algorithms on Jester-1 and MovieLens-100K datasets which are assumed to\nbe low-rank matrices with noise (SVP and GROUSE are not tested because their step sizes need\nto be appropriately chosen). Similarly to previous work, for the Jester dataset we randomly se-\nlect 4000 users and randomly withhold 2 ratings for each user for testing. For the MovieLens\ndataset, we use the common dataset prepared by [16], and keep 50% for training and 50% for\ntesting. We run 100 different randomizations of Jester and 10 randomizations of MovieLens and\naverage the results. We stop all methods early, when the change of RMSE is less than 10\u22124, to\navoid over\ufb01tting. All methods stop well before one minute. The Normalized Mean Absolute Errors\n(NMAEs) [13] are reported in Table 1. ScGrass-CG is the relaxed scaled CG method and ScGrass-\nCG-Reg is the exact scaled CG method using a spectral-regularization version of F proposed in\n\n7\n\n\f10000x10000 \u2212 Rank 10 \u2212 0.5% observed entries\n\nSingular values [9612,9717,9806,9920,9987,10113,10128,10226,10248,10348]\n \n\nSVP\n\nLMaFit\n\nl\n\n)\ne\na\nc\ns\n\u2212\ng\no\nl\n(\n \n\nE\nS\nM\nR\n\n2\n\n0\n\n\u22122\n\n\u22124\n\n\u22126\n\n\u22128\n\n\u221210\n\nl\n\n)\ne\na\nc\ns\n\u2212\ng\no\nl\n(\n \n\nE\nS\nM\nR\n\nScGrass\u2212CG\n\n\u221212\n\nRTRMC2\n\n\u221214\n\n \n\nRiemannCG\n20\n\n40\n\n60\n\nGROUSE\n80\n100\n\n120\n\n140\n\n160\n\n180\n\n200\n\nIteration count\n\n10000x10000 \u2212 Rank 10 \u2212 0.5% observed entries \n\nSingular values [1000,2000,3000,4000,5000,6000,7000,8000,9000,10000]\n\n \n\nGROUSE\n\nSVP\n\nLMaFit\n\nRTRMC2\n\nRiemann\u2212CG\n\nScGrass\u2212CG\n\nl\n\n)\ne\na\nc\ns\n\u2212\ng\no\nl\n(\n \n\nE\nS\nM\nR\n\n10000x10000 \u2212 Rank 10 \u2212 0.5% observed entries\n\nSingular values [9612,9717,9806,9920,9987,10113,10128,10226,10248,10348]\n \n\nGROUSE\n\nSVP\n\n2\n\n0\n\n\u22122\n\n\u22124\n\n\u22126\n\n\u22128\n\n\u221210\n\n\u221212\n\n\u221214\n\n\u221216\n\n \n0\n\n2\n\n0\n\n\u22122\n\n\u22124\n\n\u22126\n\n\u22128\n\n\u221210\n\n\u221212\n\nScGrass\u2212CG\n\nLMaFit\n\nRiemann\u2212CG\n\nRTRMC2\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\n35\n\n40\n\n45\n\n50\n\nTime [s]\n\n10000x10000 \u2212 Rank 10 \u2212 0.5% observed entries\n\nSingular values [1000,2000,3000,4000,5000,6000,7000,8000,9000,10000]\n\n \n\nSVP\n\nGROUSE\n\nRTRMC2\n\nLMaFit\n\nRiemann\u2212CG\n\nl\n\n)\ne\na\nc\ns\n\u2212\ng\no\nl\n(\n \n\nE\nS\nM\nR\n\n2\n\n0\n\n\u22122\n\n\u22124\n\n\u22126\n\n\u22128\n\n\u221210\n\n\u221212\n\n\u221214\n\n \n\n50\n\n100\n\n150\n\n200\n\n250\n\n300\n\nIteration count\n\nScGrassCG\n\n\u221214\n\n \n0\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\nTime [s]\n\nFigure 2: Log-RMSE. Upper row is fully random, lower row is random with chosen singular values.\n\nRank\n5\n7\n5\n7\n\nScGrass-CG ScGrass-CG-Reg LMaFit Riemann-CG RTRMC2\n0.1588\n0.1584\n0.1808\n0.1832\n\n0.1588\n0.1583\n0.1884\n0.2298\n\n0.1591\n0.1584\n0.1781\n0.1817\n\n0.1588\n0.1581\n0.1828\n0.1836\n\n0.1588\n0.1584\n0.1758\n0.1787\n\nTable 1: NMAE on Jester dataset (\ufb01rst 2 rows) and MovieLens 100K. NMAEs for a random guesser\nare 0.33 on Jester and 0.37 on MovieLens 100K.\n\n[13]: \u02dcF (U, V ) = minS(1/2)(kP\u2126(U SV T \u2212 A)k + \u03bbkSk2\nF ). All methods perform similarly and\ndemonstrate over\ufb01tting when k = 7 for MovieLens. We observe that ScGrass-CG-Reg suffers the\nleast from over\ufb01tting thanks to its regularization. This shows the importance of regularization for\nnoisy matrices and motivates future work in this direction.\n\n6 Conlusion and future work\n\nThe gradients obtained from a scaled metric on the Grassmann manifold can result in improved\nconvergence of gradient methods on matrix manifolds for matrix completion while maintaining\ngood global convergence and exact recovery guarantees. We have established a connection between\nscaled gradient methods and subspace iteration method for matrix completion. The relaxed versions\nof the proposed gradient methods, adapted from the subspace iteration, are faster than previously\ndiscussed algorithms, sometimes much faster depending on the conditionining of the data matrix.\nIn the future, we will investigate if these relaxed versions achieve similar performance guarantees.\nWe are also interested in exploring ways to regularize the relaxed versions to deal with noisy data.\nThe convergence condition of OptSpace depends on \u03ba(A)6 and weakening this dependency for the\nproposed algorithms is also an interesting future direction.\n\n8\n\n\fAcknowledgments\n\nThis work was supported by NSF grants DMS-0810938 and DMR-0940218.\n\nReferences\n\n[1] P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization Algorithms on Matrix Manifolds. Princeton\n\nUniversity Press, Princeton, NJ, 2008.\n\n[2] Y. Amit, M. Fink, N. Srebro, and S. Ullman. Uncovering shared structures in multiclass classi\ufb01cation. In\n\nProceedings of the 24th international conference on Machine learning, ICML \u201907, pages 17\u201324, 2007.\n\n[3] L. Armijo. Minimization of functions having Lipschitz continuous \ufb01rst partial derivatives. Paci\ufb01c Journal\n\nof Mathematics, 16(1):1\u20133, 1966.\n\n[4] J. Baglama, D. Calvetti, G. H. Golub, and L. Reichel. Adaptively preconditioned GMRES algorithms.\n\nSIAM J. Sci. Comput., 20(1):243\u2013269, December 1998.\n\n[5] L. Balzano, R. Nowak, and B. Recht. Online identi\ufb01cation and tracking of subspaces from highly incom-\n\nplete information. In Proceedings of Allerton, September 2010.\n\n[6] N. Boumal and P.-A. Absil. Rtrmc: A riemannian trust-region method for low-rank matrix completion.\n\nIn NIPS, 2011.\n\n[7] J-F. Cai, E. J. Cand`es, and Z. Shen. A singular value thresholding algorithm for matrix completion. SIAM\n\nJournal on Optimization, 20(4):1956\u20131982, 2010.\n\n[8] E. Candes and T. Tao. The power of convex relaxation: Near-optimal matrix completion, 2009.\n[9] P. Chen and D. Suter. Recovering the Missing Components in a Large Noisy Low-Rank Matrix: Ap-\nplication to SFM. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(8):1051\u20131063,\n2004.\n\n[10] W. Dai, E. Kerman, and O. Milenkovic. A geometric approach to low-rank matrix completion. IEEE\n\nTransactions on Information Theory, 58(1):237\u2013247, 2012.\n\n[11] A. Edelman, T. Arias, and S. T. Smith. The geometry of algorithms with orthogonality constraints. SIAM\n\nJ. Matrix Anal. Appl, 20:303\u2013353, 1998.\n\n[12] P. Jain, R. Meka, and I. S. Dhillon. Guaranteed rank minimization via singular value projection. In NIPS,\n\npages 937\u2013945, 2010.\n\n[13] R. Keshavan, A. Montanari, and S. Oh. Matrix completion from noisy entries. In Y. Bengio, D. Schuur-\nmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing\nSystems 22, pages 952\u2013960. 2009.\n\n[14] R. H. Keshavan, S. Oh, and A. Montanari. Matrix completion from a few entries. CoRR, abs/0901.3150,\n\n2009.\n\n[15] S. Ma, D. Goldfarb, and L. Chen. Fixed point and bregman iterative methods for matrix rank minimiza-\n\ntion. Math. Program., 128(1-2):321\u2013353, 2011.\n\n[16] B. Marlin. Collaborative \ufb01ltering: A machine learning perspective, 2004.\n[17] R. Mazumder, T. Hastie, and R. Tibshirani. Spectral regularization algorithms for learning large incom-\n\nplete matrices. J. Mach. Learn. Res., 11:2287\u20132322, August 2010.\n\n[18] B. Recht. A simpler approach to matrix completion. CoRR, abs/0910.0651, 2009.\n[19] J. D. M. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative prediction.\nIn In Proceedings of the 22nd International Conference on Machine Learning (ICML, pages 713\u2013719.\nACM, 2005.\n\n[20] Y. Saad. Numerical Methods for Large Eigenvalue Problems- classics edition. SIAM, Philadelpha, PA,\n\n2011.\n\n[21] N. Srebro and T. Jaakkola. Weighted low-rank approximations. In In 20th International Conference on\n\nMachine Learning, pages 720\u2013727. AAAI Press, 2003.\n\n[22] B. Vandereycken. Low-rank matrix completion by riemannian optimization. Technical report, Mathemat-\n\nics Section, Ecole Polytechnique Federale de de Lausanne, 2011.\n\n[23] Z. Wen, W. Yin, and Y. Zhang. Solving a low-rank factorization model for matrix completion using a\n\nnon-linear successive over-relaxation algorithm. In CAAM Technical Report. Rice University, 2010.\n\n9\n\n\f", "award": [], "sourceid": 677, "authors": [{"given_name": "Thanh", "family_name": "Ngo", "institution": null}, {"given_name": "Yousef", "family_name": "Saad", "institution": null}]}