{"title": "Provable Efficient Online Matrix Completion via Non-convex Stochastic Gradient Descent", "book": "Advances in Neural Information Processing Systems", "page_first": 4520, "page_last": 4528, "abstract": "Matrix completion, where we wish to recover a low rank matrix by observing a few entries from it, is a widely studied problem in both theory and practice with wide applications. Most of the provable algorithms so far on this problem have been restricted to the offline setting where they provide an estimate of the unknown matrix using all observations simultaneously. However, in many applications, the online version, where we observe one entry at a time and dynamically update our estimate, is more appealing. While existing algorithms are efficient for the offline setting, they could be highly inefficient for the online setting. In this paper, we propose the first provable, efficient online algorithm for matrix completion. Our algorithm starts from an initial estimate of the matrix and then performs non-convex stochastic gradient descent (SGD). After every observation, it performs a fast update involving only one row of two tall matrices, giving near linear total runtime. Our algorithm can be naturally used in the offline setting as well, where it gives competitive sample complexity and runtime to state of the art algorithms. Our proofs introduce a general framework to show that SGD updates tend to stay away from saddle surfaces and could be of broader interests to other non-convex problems.", "full_text": "Provable Ef\ufb01cient Online Matrix Completion via\n\nNon-convex Stochastic Gradient Descent\n\nChi Jin\n\nUC Berkeley\n\nchijin@cs.berkeley.edu\n\nSham M. Kakade\n\nUniversity of Washington\nsham@cs.washington.edu\n\nPraneeth Netrapalli\n\nMicrosoft Research India\npraneeth@microsoft.com\n\nAbstract\n\nMatrix completion, where we wish to recover a low rank matrix by observing a\nfew entries from it, is a widely studied problem in both theory and practice with\nwide applications. Most of the provable algorithms so far on this problem have\nbeen restricted to the of\ufb02ine setting where they provide an estimate of the unknown\nmatrix using all observations simultaneously. However, in many applications, the\nonline version, where we observe one entry at a time and dynamically update our\nestimate, is more appealing. While existing algorithms are ef\ufb01cient for the of\ufb02ine\nsetting, they could be highly inef\ufb01cient for the online setting.\nIn this paper, we propose the \ufb01rst provable, ef\ufb01cient online algorithm for matrix\ncompletion. Our algorithm starts from an initial estimate of the matrix and then\nperforms non-convex stochastic gradient descent (SGD). After every observation,\nit performs a fast update involving only one row of two tall matrices, giving near\nlinear total runtime. Our algorithm can be naturally used in the of\ufb02ine setting as\nwell, where it gives competitive sample complexity and runtime to state of the art\nalgorithms. Our proofs introduce a general framework to show that SGD updates\ntend to stay away from saddle surfaces and could be of broader interests to other\nnon-convex problems.\n\n1\n\nIntroduction\n\nLow rank matrix completion refers to the problem of recovering a low rank matrix by observing the\nvalues of only a tiny fraction of its entries. This problem arises in several applications such as video\ndenoising [13], phase retrieval [3] and most famously in movie recommendation engines [15]. In the\ncontext of recommendation engines for instance, the matrix we wish to recover would be user-item\nrating matrix where each row corresponds to a user and each column corresponds to an item. Each\nentry of the matrix is the rating given by a user to an item. Low rank assumption on the matrix is\ninspired by the intuition that rating of an item by a user depends on only a few hidden factors, which\nare much fewer than the number of users or items. The goal is to estimate the ratings of all items by\nusers given only partial ratings of items by users, which would then be helpful in recommending new\nitems to users.\nThe seminal works of Cand\u00e8s and Recht [4] \ufb01rst identi\ufb01ed regularity conditions under which low\nrank matrix completion can be solved in polynomial time using convex relaxation \u2013 low rank matrix\ncompletion could be ill-posed and NP-hard in general without such regularity assumptions [9].\nSince then, a number of works have studied various algorithms under different settings for matrix\ncompletion: weighted and noisy matrix completion, fast convex solvers, fast iterative non-convex\nsolvers, parallel and distributed algorithms and so on.\nMost of this work however deals only with the of\ufb02ine setting where all the observed entries are\nrevealed at once and the recovery procedure does computation using all these observations simultane-\nously. However in several applications [5, 18], we encounter the online setting where observations are\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fonly revealed sequentially and at each step the recovery algorithm is required to maintain an estimate\nof the low rank matrix based on the observations so far. Consider for instance recommendation\nengines, where the low rank matrix we are interested in is the user-item rating matrix. While we make\nan observation only when a user rates an item, at any point of time, we should have an estimate of the\nuser-item rating matrix based on all prior observations so as to be able to continuously recommend\nitems to users. Moreover, this estimate should get better as we observe more ratings.\nAlgorithms for of\ufb02ine matrix completion can be used to solve the online version by rerunning the\nalgorithm after every additional observation. However, performing so much computation for every\nobservation seems wasteful and is also impractical. For instance, using alternating minimization,\nwhich is among the fastest known algorithms for the of\ufb02ine problem, would mean that we take several\npasses of the entire data for every additional observation. This is simply not feasible in most settings.\nAnother natural approach is to group observations into batches and do an update only once for each\nbatch. This however induces a lag between observations and estimates which is undesirable. To the\nbest of our knowledge, there is no known provable, ef\ufb01cient, online algorithm for matrix completion.\nOn the other hand, in order to deal with the online matrix completion scenario in practical applications,\nseveral heuristics (with no convergence guarantees) have been proposed in literature [2, 19]. Most\nof these approaches are based on starting with an estimate of the matrix and doing fast updates of\nthis estimate whenever a new observation is presented. One of the update procedures used in this\ncontext is that of stochastic gradient descent (SGD) applied to the following non-convex optimization\nproblem\n\n(cid:107)M \u2212 UV(cid:62)(cid:107)2\n\ns.t. U \u2208 Rd1\u00d7k, V \u2208 Rd2\u00d7k,\n\nmin\nU,V\n\n(1)\nwhere M is the unknown matrix of size d1 \u00d7 d2, k is the rank of M and UV(cid:62) is a low rank\nfactorization of M we wish to obtain. The algorithm starts with some U0 and V0, and given a\nnew observation (M)ij, SGD updates the ith-row and the jth-row of the current iterates Ut and Vt\nrespectively by\n\nF\n\nU(i)\n\nt+1 = U(i)\nt+1 = V(j)\n\nt \u2212 2\u03b7d1d2\nt \u2212 2\u03b7d1d2\n\nV(j)\n\n(cid:0)UtV(cid:62)\n(cid:0)UtV(cid:62)\n\nt \u2212 M(cid:1)\nt \u2212 M(cid:1)\n\nt\n\nij V(j)\nij U(i)\n\nt\n\n, and,\n\n,\n\n(2)\n\nwhere \u03b7 is an appropriately chosen stepsize, and U(i) denote the ith row of matrix U. Note that each\nupdate modi\ufb01es only one row of the factor matrices U and V, and the computation only involves one\nrow of U, V and the new observed entry (M)ij and hence are extremely fast. These fast updates\nmake SGD extremely appealing in practice. Moreover, SGD, in the context of matrix completion, is\nalso useful for parallelization and distributed implementation [23].\n\n1.1 Our Contributions\n\nIn this work we present the \ufb01rst provable ef\ufb01cient algorithm for online matrix completion by showing\nthat SGD (2) with a good initialization converges to a true factorization of M at a geometric rate. Our\nmain contributions are as follows.\n\n\u0001\n\n(cid:107)M(cid:107)F\n\n\u2022 We provide the \ufb01rst provable, ef\ufb01cient, online algorithm for matrix completion. Starting with\na good initialization, after each observation, the algorithm makes quick updates each taking\ntime O(k3) and requires O(\u00b5dk\u03ba4(k + log\n) log d) observations to reach \u0001 accuracy,\nwhere \u00b5 is the incoherence parameter, d = max(d1, d2), k is the rank and \u03ba is the condition\nnumber of M.\n\u2022 Moreover, our result features both sample complexity and total runtime linear in d, and is\ncompetitive to even the best existing of\ufb02ine results for matrix completion. (either improve\nover or is incomparable, i.e., better in some parameters and worse in others, to these results).\nSee Table 1 for the comparison.\n\u2022 To obtain our results, we introduce a general framework to show SGD updates tend to stay\naway from saddle surfaces. In order to do so, we consider distances from saddle surfaces,\nshow that they behave like sub-martingales under SGD updates and use martingale conver-\ngence techniques to conclude that the iterates stay away from saddle surfaces. While [24]\nshows that SGD updates stay away from saddle surfaces, the stepsizes they can handle are\n\n2\n\n\fTable 1: Comparison of sample complexity and runtime of our algorithm with existing algorithms in\n\norder to obtain Frobenius norm error \u0001. (cid:101)O(\u00b7) hides log d factors. See Section 1.2 for more discussion.\n\nAlgorithm\n\nNuclear Norm [22]\n\nAlternating\n\nminimization [14]\n\nAlternating\n\nminimization [8]\nProjected gradient\n\ndescent[12]\nSGD [24]\nOur result\n\n\u0001 )\n\nSample complexity\n\n(cid:101)O(\u00b5dk)\n(cid:101)O(\u00b5dk\u03ba8 log 1\n(cid:101)O(cid:0)\u00b5dk2\u03ba2(cid:0)k + log 1\n(cid:101)O(\u00b5dk5)\n(cid:101)O(\u00b52dk7\u03ba6)\n(cid:1)(cid:1)\n(cid:101)O(cid:0)\u00b5dk\u03ba4(cid:0)k + log 1\n\n\u0001\n\n\u0001\n\n(cid:1)(cid:1)\n\n\u0001)\n\n\u0001 )\n\n\u221a\n\nTotal runtime\n\n(cid:101)O(d3/\n(cid:101)O(\u00b5dk2\u03ba8 log 1\n(cid:101)O(cid:0)\u00b5dk3\u03ba2(cid:0)k + log 1\n(cid:101)O(\u00b5dk7 log 1\n(cid:1)\n(cid:101)O(cid:0)\u00b5dk4\u03ba4 log 1\n\npoly(\u00b5, d, k, \u03ba) log 1\n\u0001\n\n\u0001 )\n\n\u0001\n\n(cid:1)(cid:1)\n\nOnline?\n\nNo\n\nNo\n\nNo\n\nNo\n\nYes\nYes\n\n\u0001\n\nquite small (scaling as 1/poly(d1, d2)), leading to suboptimal computational complexity.\nOur framework makes it possible to establish the same statement for much larger step sizes,\ngiving us near-optimal runtime. We believe these techniques may be applicable in other\nnon-convex settings as well.\n\n1.2 Related Work\n\nIn this section we will mention some more related work.\nOf\ufb02ine matrix completion: There has been a lot of work on designing of\ufb02ine algorithms for matrix\ncompletion, we provide the detailed comparison with our algorithm in Table 1. The nuclear norm\nrelaxation algorithm [22] has near-optimal sample complexity for this problem but is computationally\nexpensive. Motivated by the empirical success of non-convex heuristics, a long line of works,\n[14, 8, 12, 24] and so on, has obtained convergence guarantees for alternating minimization, gradient\ndescent, projected gradient descent etc. Even the best of these are suboptimal in sample complexity\nby poly(k, \u03ba) factors. Our sample complexity is better than that of [14] and is incomparable to those\nof [8, 12]. To the best of our knowledge, the only provable online algorithm for this problem is that\nof Sun and Luo [24]. However the stepsizes they suggest are quite small, leading to suboptimal\ncomputational complexity by factors of poly(d1, d2). The runtime of our algorithm is linear in d,\nwhich makes poly(d) improvements over it.\nOther models for online matrix completion: Another variant of online matrix completion studied\nin the literature is where observations are made on a column by column basis e.g., [16, 26]. These\nmodels can give improved of\ufb02ine performance in terms of space and could potentially work under\nrelaxed regularity conditions. However, they do not tackle the version where only entries (as opposed\nto columns) are observed.\nNon-convex optimization: Over the last few years, there has also been a signi\ufb01cant amount of\nwork in designing other ef\ufb01cient algorithms for solving non-convex problems. Examples include\neigenvector computation [6, 11], sparse coding [20, 1] etc. For general non-convex optimization, an\ninteresting line of recent work is that of [7], which proves gradient descent with noise can also escape\nsaddle point, but they only provide polynomial rate without explicit dependence. Later [17, 21] show\nthat without noise, the space of points from where gradient descent converges to a saddle point is a\nmeasure zero set. However, they do not provide a rate of convergence. Another related piece of work\nto ours is [10], proves global convergence along with rates of convergence, for the special case of\ncomputing matrix squareroot.\n\n1.3 Outline\n\nThe rest of the paper is organized as follows. In Section 2 we formally describe the problem and all\nrelevant parameters. In Section 3, we present our algorithms, results and some of the key intuition\n\n3\n\n\fbehind our results. In Section 4 we give proof outline for our main results. We conclude in Section 5.\nAll formal proofs are deferred to the Appendix.\n\n2 Preliminaries\n\nIn this section, we introduce our notation, formally de\ufb01ne the matrix completion problem and\nregularity assumptions that make the problem tractable.\n\n2.1 Notation\nWe use [d] to denote {1, 2,\u00b7\u00b7\u00b7 , d}. We use bold capital letters A, B to denote matrices and bold\nlowercase letters u, v to denote vectors. Aij means the (i, j)th entry of matrix A. (cid:107)w(cid:107) denotes the\n(cid:96)2-norm of vector w and (cid:107)A(cid:107)/(cid:107)A(cid:107)F/(cid:107)A(cid:107)\u221e denotes the spectral/Frobenius/in\ufb01nity norm of matrix\nA. \u03c3i(A) denotes the ith largest singular value of A and \u03c3min(A) denotes the smallest singular\nvalue of A. We also let \u03ba(A) = (cid:107)A(cid:107) /\u03c3min(A) denote the condition number of A (i.e., the ratio\nof largest to smallest singular value). Finally, for orthonormal bases of a subspace W, we also use\nPW = WW(cid:62) to denote the projection to the subspace spanned by W.\n\n2.2 Problem statement and assumptions\nConsider a general rank k matrix M \u2208 Rd1\u00d7d2. Let \u2126 \u2282 [d1]\u00d7 [d2] be a subset of coordinates, which\nare sampled uniformly and independently from [d1] \u00d7 [d2]. We denote P\u2126(M) to be the projection\nof M on set \u2126 so that:\n\n(cid:26) Mij,\n\n0,\n\nif (i, j) \u2208 \u2126\nif (i, j) (cid:54)\u2208 \u2126\n\n[P\u2126(M)]ij =\n\nLow rank matrix completion is the task of recovering M by only observing P\u2126(M). This task is\nill-posed and NP-hard in general [9]. In order to make this tractable, we make by now standard\nassumptions about the structure of M.\nDe\ufb01nition 2.1. Let W \u2208 Rd\u00d7k be an orthonormal basis of a subspace of Rd of dimension k. The\ncoherence of W is de\ufb01ned to be\ndef\n=\n\n(cid:13)(cid:13)e(cid:62)\ni W(cid:13)(cid:13)2\n\n(cid:107)PWei(cid:107)2 =\n\n\u00b5(W)\n\nAssumption 2.2 (\u00b5-incoherence[4, 22]). We assume M is \u00b5-incoherent, i.e., max{\u00b5(X), \u00b5(Y)} \u2264\n\u00b5, where X \u2208 Rd1\u00d7k, Y \u2208 Rd2\u00d7k are the left and right singular vectors of M.\n\nd\nk\n\nmax\n1\u2264i\u2264d\n\nd\nk\n\nmax\n1\u2264i\u2264d\n\n3 Main Results and Intuition\n\nIn this section, we present our main result. We will \ufb01rst state result for a special case where M is a\nsymmetric positive semi-de\ufb01nite (PSD) matrix, where the algorithm and analysis are much simpler.\nWe will then discuss the general case.\n\n3.1 Symmetric PSD Case\n\ndef\nConsider the special case where M is symmetric PSD and let d\n= d1 = d2. Then, we can parametrize\na rank k symmetric PSD matrix by UU(cid:62) where U \u2208 Rd\u00d7k. Our algorithm for this case is given\nin Algorithm 1. The following theorem provides guarantees on the performance of Algorithm 1.\nThe algorithm starts by using an initial set of samples \u2126init to construct a crude approximation to\nthe low rank of factorization of M. It then observes samples from M one at a time and updates its\nfactorization after every observation.\nTheorem 3.1. Let M \u2208 Rd\u00d7d be a rank k, symmetric PSD matrix with \u00b5-incoherence. There\nexist some absolute constants c0 and c such that if |\u2126init| \u2265 c0\u00b5dk2\u03ba2(M) log d, learning rate\n\u03b7 \u2264\n\nd8 , we will have for all t \u2264 d2 that1:\n\nc\n\n\u00b5dk\u03ba3(M)(cid:107)M(cid:107) log d , then with probability at least 1 \u2212 1\n\u03b7 \u00b7 \u03c3min(M)\n\n(cid:107)UtU(cid:62)\n\nt \u2212 M(cid:107)2\n\nF \u2264\n\n(cid:19)t(cid:18) 1\n\n\u03c3min(M)\n\n.\n\n(cid:19)2\n\n(cid:18)\n\n1 \u2212 1\n2\n\n10\n\n1W.L.O.G, we can always assume t < d2, otherwise we already observed the entire matrix.\n\n4\n\n\fInitial set of uniformly random samples \u2126init of a symmetric PSD matrix M \u2208 Rd\u00d7d,\n\nAlgorithm 1 Online Algorithm for PSD Matrix Completion.\nInput:\nOutput: U such that UU(cid:62) \u2248 M\n\nlearning rate \u03b7, iterations T\n0 \u2190 top k SVD of\nU0U(cid:62)\nfor t = 0,\u00b7\u00b7\u00b7 , T \u2212 1 do\n\nd2\n\n|\u2126init|P\u2126init(M)\n\nObserve Mij where (i, j) \u223c Unif ([d] \u00d7 [d])\nUt+1 \u2190 Ut \u2212 2\u03b7d2(UtU(cid:62)\nt \u2212 M)ij(eie(cid:62)\n\nj + eje(cid:62)\n\ni )Ut\n\nend for\nReturn UT\n\nRemarks:\n\n\u2022 The algorithm uses an initial set of observations \u2126init to produce a warm start iterate U0,\n\u2022 The sample complexity of the warm start phase is O(\u00b5dk2\u03ba2(M) log d). The initialization\n\nthen enters the online stage, where it performs SGD.\n\nconsists of a top-k SVD on a sparse matrix, whose runtime is O(\u00b5dk3\u03ba2(M) log d).\n\n\u2022 For the online phase (SGD),\n\nif we choose \u03b7 =\nof observations T required for the error (cid:107)UT U(cid:62)\nO(\u00b5dk\u03ba(M)4 log d log \u03c3min(M)\n\nthe number\nT \u2212 M(cid:107)F to be smaller than \u0001 is\n\u2022 Since each SGD step modi\ufb01es two rows of Ut, its runtime is O(k) with a total runtime for\n\n\u00b5dk\u03ba3(M)(cid:107)M(cid:107) log d,\n\n).\n\nc\n\n\u0001\n\nonline phase of O(kT ).\n\nOur proof approach is to essentially show that the objective function is well-behaved (i.e., is smooth\nand strongly convex) in a local neighborhood of the warm start region, and then use standard\ntechniques to show that SGD obtains geometric convergence in this setting. The most challenging and\nnovel part of our analysis comprises of showing that the iterate does not leave this local neighborhood\nwhile performing SGD updates. Refer Section 4 for more details on the proof outline.\n\n3.2 General Case\nLet us now consider the general case where M \u2208 Rd1\u00d7d2 can be factorized as UV(cid:62) with U \u2208 Rd1\u00d7k\nand V \u2208 Rd2\u00d7k. In this scenario, we denote d = max{d1, d2}. We recall our remarks from the\nprevious section that our analysis of the performance of SGD depends on the smoothness and strong\nconvexity properties of the objective function in a local neighborhood of the iterates. Having U (cid:54)= V\nintroduces additional challenges in this approach since for any nonsingular k-by-k matrix C, and\nU(cid:48) def\n= VC\u22121, we have U(cid:48)V(cid:48)(cid:62) = UV(cid:62). Suppose for instance C is a very small scalar\ntimes the identity i.e., C = \u03b4I for some small \u03b4 > 0. In this case, U(cid:48) will be large while V(cid:48) will be\nsmall. This drastically deteriorates the smoothness and strong convexity properties of the objective\nfunction in a neighborhood of (U(cid:48), V(cid:48)).\n\n= UC(cid:62), V(cid:48) def\n\nAlgorithm 2 Online Algorithm for Matrix Completion (Theoretical)\nInput:\nOutput: U, V such that UV(cid:62) \u2248 M\n\nInitial set of uniformly random samples \u2126init of M \u2208 Rd1\u00d7d2, learning rate \u03b7, iterations T\n0 \u2190 top k SVD of d1d2\nV \u2190 SVD(UtV(cid:62)\nt )\n\nU0V(cid:62)\nfor t = 0,\u00b7\u00b7\u00b7 , T \u2212 1 do\n\n|\u2126init|P\u2126init(M)\n\n\u02dcVt \u2190 WV D 1\n\nWU DW(cid:62)\n\u02dcUt \u2190 WU D 1\n2 ,\nObserve Mij where (i, j) \u223c Unif ([d] \u00d7 [d])\nUt+1 \u2190 \u02dcUt \u2212 2\u03b7d1d2( \u02dcUt \u02dcV(cid:62)\nt \u2212 M)ijeie(cid:62)\nVt+1 \u2190 \u02dcVt \u2212 2\u03b7d1d2( \u02dcUt \u02dcV(cid:62)\nt \u2212 M)ijeje(cid:62)\n\nj\n\n2\n\ni\n\n\u02dcVt\n\u02dcUt\n\nend for\nReturn UT , VT .\n\n5\n\n\f2 , \u02dcVt \u2190 WV D 1\n\n2 , where WU DW(cid:62)\n\nV is the SVD of matrix UtV(cid:62)\n\nTo preclude such a scenario, we would ideally like to renormalize after each step by doing \u02dcUt \u2190\nWU D 1\nt . This algorithm is described\nin Algorithm 2. However, a naive implementation of Algorithm 2, especially the SVD step, would\nincur O(min{d1, d2}) computation per iteration, resulting in a runtime overhead of O(d) over both\nthe online PSD case (i.e., Algorithm 1) as well as the near linear time of\ufb02ine algorithms (see Table 1).\nIt turns out that we can take advantage of the fact that in each iteration we only update a single\nrow of Ut and a single row of Vt, and do ef\ufb01cient (but more complicated) update steps instead of\ndoing an SVD on d1 \u00d7 d2 matrix. The resulting algorithm is given in Algorithm 3. The key idea\nis that in order to implement the updates, it suf\ufb01ces to do an SVD of U(cid:62)\nt Vt which are\nk \u00d7 k matrices. So the runtime of each iteration is at most O(k3). The following lemma shows the\nequivalence between Algorithms 2 and 3.\n\nt Ut and V(cid:62)\n\nU0V(cid:62)\nfor t = 0,\u00b7\u00b7\u00b7 , T \u2212 1 do\n\nAlgorithm 3 Online Algorithm for Matrix Completion (Practical)\nInput:\nOutput: U, V such that UV(cid:62) \u2248 M\n\nInitial set of uniformly random samples \u2126init of M \u2208 Rd1\u00d7d2, learning rate \u03b7, iterations T\n0 \u2190 top k SVD of d1d2\nU \u2190 SVD(U(cid:62)\nt Ut)\nV \u2190 SVD(V(cid:62)\nt Vt)\nV \u2190 SVD(D\nU R(cid:62)\n\nP\u2126init (M)\n\nV )(cid:62))\n\n\u2126init\n\n1\n2\n\nRU DU R(cid:62)\nRV DV R(cid:62)\nQU DQ(cid:62)\nObserve Mij where (i, j) \u223c Unif ([d] \u00d7 [d])\nt \u2212 M)ijeie(cid:62)\nUt+1 \u2190 Ut \u2212 2\u03b7d1d2(UtV(cid:62)\nt \u2212 M)ijeje(cid:62)\nVt+1 \u2190 Vt \u2212 2\u03b7d1d2(UtV(cid:62)\n\nU RV (D\n\n1\n2\n\nj VtRV D\ni UtRU D\n\n2\n\n\u2212 1\nV QV Q(cid:62)\nU D\n\u2212 1\nU QU Q(cid:62)\nV D\n\n2\n\n1\n2\n\nU\n\nU R(cid:62)\nV R(cid:62)\n\n1\n2\n\nV\n\nend for\nReturn UT , VT .\n\nLemma 3.2. Algorithm 2 and Algorithm 3 are equivalent in the sense that: given same observations\nfrom M and other inputs, the outputs of Algorithm 2, U, V and those of Algorithm 3, U(cid:48), V(cid:48) satisfy\nUV(cid:62) = U(cid:48)V(cid:48)(cid:62).\n\nSince the output of both algorithms is the same, we can analyze Algorithm 2 (which is easier than\nthat of Algorithm 3), while implementing Algorithm 3 in practice. The following theorem is the main\nresult of our paper which presents guarantees on the performance of Algorithm 2.\nTheorem 3.3. Let M \u2208 Rd1\u00d7d2 be a rank k matrix with \u00b5-incoherence and let d\ndef\n= max(d1, d2).\nThere exist some absolute constants c0 and c such that if |\u2126init| \u2265 c0\u00b5dk2\u03ba2(M) log d, learning rate\n\u03b7 \u2264\n\nd8 , we will have for all t \u2264 d2 that:\n\nc\n\n\u00b5dk\u03ba3(M)(cid:107)M(cid:107) log d , then with probability at least 1 \u2212 1\n\u03b7 \u00b7 \u03c3min(M)\n\n(cid:107)UtV(cid:62)\n\nt \u2212 M(cid:107)2\n\nF \u2264\n\n(cid:18)\n\n(cid:19)t(cid:18) 1\n\n\u03c3min(M)\n\n.\n\n(cid:19)2\n\n1 \u2212 1\n2\n\n10\n\nRemarks:\n\n\u2022 Just as in the case of PSD matrix completion (Theorem 3.1), Algorithm 2 needs an initial\nset of observations \u2126init to provide a warm start U0 and V0 after which it performs SGD.\n\u2022 The sample complexity and runtime of the warm start phase are the same as in symmetric\nPSD case. The stepsize \u03b7 and the number of observations T to achieve \u0001 error in online\nphase (SGD) are also the same as in symmetric PSD case.\n\u2022 However, runtime of each update step in online phase is O(k3) with total runtime for online\n\nphase O(k3T ).\n\nThe proof of this theorem again follows a similar line of reasoning as that of Theorem 3.1 by \ufb01rst\nshowing that the local neighborhood of warm start iterate has good smoothness and strong convexity\nproperties and then use them to show geometric convergence of SGD. Proof of the fact that iterates\ndo not move away from this local neighborhood however is signi\ufb01cantly more challenging due to\nrenormalization steps in the algorithm. Please see Appendix C for the full proof.\n\n6\n\n\f4 Proof Sketch\n\nIn this section we will provide the intuition and proof sketch for our main results. For simplicity and\nhighlighting the most essential ideas, we will mostly focus on the symmetric PSD case (Theorem\n3.1). For the asymmetric case, though the high-level ideas are still valid, a lot of additional effort is\nrequired to address the renormalization step in Algorithm 2. This makes the proof more involved.\nFirst, note that our algorithm for the PSD case consists of an initialization and then stochastic descent\nsteps. The following lemma provides guarantees on the error achieved by the initial iterate U0.\nLemma 4.1. Let M \u2208 Rd\u00d7d be a rank-k PSD matrix with \u00b5-incoherence. There exists a constant c0\nsuch that if |\u2126init| \u2265 c0\u00b5dk2\u03ba2(M) log d, then with probability at least 1 \u2212 1\nd10 , the top-k SVD of\n|\u2126init|P\u2126init(M) (denote as U0U(cid:62)\n0 (cid:107)F \u2264 1\n20\n\n(cid:13)(cid:13)2 \u2264 10\u00b5k\u03ba(M)\n\n(cid:107)M \u2212 U0U(cid:62)\n\n(cid:13)(cid:13)e(cid:62)\n\n0 ) satis\ufb01es:\n\nand max\n\n\u03c3min(M)\n\n(cid:107)M(cid:107)\n\nj U0\n\n(3)\n\nd2\n\nd\n\nj\n\nBy Lemma 4.1, we know the initialization algorithm already gives U0 in the local region given by\nEq.(3). Intuitively, stochastic descent steps should keep doing local search within this local region.\nTo establish linear convergence on (cid:107)UtU(cid:62)\nF and obtain \ufb01nal result, we \ufb01rst establish several\nimportant lemmas describing the properties of this local regions. Throughout this section, we always\ndenote SVD(M) = XSX(cid:62), where X \u2208 Rd\u00d7k, and diagnal matrix S \u2208 Rk\u00d7k. We postpone all the\nformal proofs in Appendix.\nF and any U1, U2 \u2208 {U|(cid:107)U(cid:107) \u2264 \u0393}, we have:\nLemma 4.2. For function f (U) = (cid:107)M \u2212 UU(cid:62)(cid:107)2\n\nt \u2212 M(cid:107)2\n\n(cid:107)\u2207f (U1) \u2212 \u2207f (U2)(cid:107)F \u2264 16 max{\u03932,(cid:107)M(cid:107)} \u00b7 (cid:107)U1 \u2212 U2(cid:107)F\n\nLemma 4.3. For function f (U) = (cid:107)M \u2212 UU(cid:62)(cid:107)2\n\nF and any U \u2208 {U|\u03c3min(X(cid:62)U) \u2265 \u03b3}, we have:\n\n(cid:107)\u2207f (U)(cid:107)2\n\nF \u2265 4\u03b32f (U)\n\nLemma 4.2 tells function f is smooth if spectral norm of U is not very large. On the other hand,\n\u03c3min(X(cid:62)U) not too small requires both \u03c3min(U(cid:62)U) and \u03c3min(X(cid:62)W) are not too small, where W\nis top-k eigenspace of UU(cid:62). That is, Lemma 4.3 tells function f has a property similar to strongly\nconvex in standard optimization literature, if U is rank k in a robust sense (\u03c3k(U) is not too small),\nand the angle between the top k eigenspace of UU(cid:62) and the top k eigenspace M is not large.\n\nLemma 4.4. Within the region D = {U|(cid:13)(cid:13)M \u2212 UU(cid:62)(cid:13)(cid:13)F \u2264 1\nLemma 4.4 tells inside region {U|(cid:13)(cid:13)M \u2212 UU(cid:62)(cid:13)(cid:13)F \u2264 1\n\n(cid:107)U(cid:107) \u2264(cid:112)2(cid:107)M(cid:107),\n\n\u03c3min(X(cid:62)U) \u2265(cid:112)\u03c3k(M)/2\n\n10 \u03c3k(M)}, we have:\n\n10 \u03c3k(M)}, matrix U always has a good\nspectral property which gives preconditions for both Lemma 4.2 and 4.3, where f (U) is both smooth\nand has a property very similar to strongly convex.\nWith above three lemmas, we already been able to see the intuition behind linear convergence in\nTheorem 3.1. Denote stochastic gradient\n\nj + eje(cid:62)\n\nSG(U) = 2d2(UU(cid:62) \u2212 M)ij(eie(cid:62)\n\n(4)\nwhere SG(U) is a random matrix depends on the randomness of sample (i, j) of matrix M. Then,\nthe stochastic update step in Algorithm 1 can be rewritten as:\nUt+1 \u2190 Ut \u2212 \u03b7SG(Ut)\n\nLet f (U) = (cid:107)M \u2212 UU(cid:62)(cid:107)2\nF, By easy caculation, we know ESG(U) = \u2207f (U), that is SG(U) is\nunbiased. Combine Lemma 4.4 with Lemma 4.2 and Lemma 4.3, we know within region D speci\ufb01ed\nby Lemma 4.4, we have function f (U) is 32(cid:107)M(cid:107)-smooth, and (cid:107)\u2207f (U)(cid:107)2\nLet\u2019s suppose ideally, we always have U0, . . . , Ut inside region D, this directly gives:\n\nF \u2265 2\u03c3min(M)f (U).\n\ni )U\n\nEf (Ut+1) \u2264 Ef (Ut) \u2212 \u03b7E(cid:104)\u2207f (Ut), SG(Ut)(cid:105) + 16\u03b72 (cid:107)M(cid:107) \u00b7 E(cid:107)SG(Ut)(cid:107)2\n\nF\n\n= Ef (Ut) \u2212 \u03b7E(cid:107)\u2207f (Ut)(cid:107)2\nF + 16\u03b72 (cid:107)M(cid:107) \u00b7 E(cid:107)SG(Ut)(cid:107)2\n\u2264 (1 \u2212 2\u03b7\u03c3min(M))Ef (Ut) + 16\u03b72 (cid:107)M(cid:107) \u00b7 E(cid:107)SG(Ut)(cid:107)2\n\nF\n\nF\n\n7\n\n\fOne interesting aspect of our main result is that we actually show linear convergence under the\npresence of noise in gradient. This is true because for the second-order (\u03b72) term above, we can\nF \u2264 h(U) \u00b7 f (U), where h(U) is a factor depends on U and\nroughly see from Eq.(4) that (cid:107)SG(U)(cid:107)2\nalways bounded. That is, SG(U) enjoys self-bounded property \u2014 (cid:107)SG(U)(cid:107)2\nF will goes to zero, as\nobjective function f (U) goes to zero. Therefore, by choosing learning rate \u03b7 appropriately small,\nwe can have the \ufb01rst-order term always dominate the second-order term, which establish the linear\nconvergence.\nNow, the only remaining issue is to prove that \u201cU0, . . . , Ut always stay inside local region D\u201d. In\nreality, we can only prove this statement with high probability due to the stochastic nature of the\nupdate. This is also the most challenging part in our proof, which makes our analysis different from\nstandard convex analysis, and uniquely required due to non-convex setting.\nOur key theorem is presented as follows:\n\nF and gi(U) =(cid:13)(cid:13)e(cid:62)\n\ni U(cid:13)(cid:13)2. Suppose initial U0 satisfying:\n\nTheorem 4.5. Let f (U) =(cid:13)(cid:13)UU(cid:62) \u2212 M(cid:13)(cid:13)2\n(cid:19)2\n(cid:18) \u03c3min(M)\n\nf (U0) \u2264\n\ngi(U0) \u2264 10\u00b5k\u03ba(M)2\n\nd\n\nmax\n\ni\n\n(cid:107)M(cid:107)\n\n,\n\n20\n\nThen, there exist some absolute constant c such that for any learning rate \u03b7 <\nwith at least 1 \u2212 T\n\nd10 probability, we will have for all t \u2264 T that:\n\n\u00b5dk\u03ba3(M)(cid:107)M(cid:107) log d ,\n\nc\n\n(cid:18) \u03c3min(M)\n\n(cid:19)2\n\n10\n\nf (Ut) \u2264 (1 \u2212 1\n2\n\n\u03b7\u03c3min(M))t\n\n,\n\nmax\n\ni\n\ngi(Ut) \u2264 20\u00b5k\u03ba(M)2\n\nd\n\n(cid:107)M(cid:107)\n\n(5)\n\nNote function maxi gi(U) indicates the incoherence of matrix U. Theorem 4.5 guarantees if inital\nU0 is in the local region which is incoherent and U0U(cid:62)\n0 is close to M, then with high probability\nfor all steps t \u2264 T , Ut will always stay in a slightly relaxed local region, and f (Ut) has linear\nconvergence.\nIt is not hard to show that all saddle points of f (U) satisfy \u03c3k(U) = 0, and all local minima are global\nminima. Since U0, . . . , Ut automatically stay in region f (U) \u2264 ( \u03c3min(M)\n)2 with high probability,\nwe know Ut also stay away from all saddle points. The claim that U0, . . . , Ut stays incoherent is\nessential to better control the variance and probability 1 bound of SG(Ut), so that we can have large\nstep size and tight convergence rate.\nThe major challenging in proving Theorem 4.5 is to both prove Ut stays in the local region, and\nachieve good sample complexity and running time (linear in d) in the same time. This also requires\nthe learning rate \u03b7 in Algorithm 1 to be relatively large. Let the event Et denote the good event where\nU0, . . . , Ut satis\ufb01es Eq.(5). Theorem 4.5 is claiming that P (ET ) is large. The essential steps in\nthe proof is contructing two supermartingles related to f (Ut)1Et and gi(Ut)1Et (where 1(\u00b7) denote\nindicator function), and use Bernstein inequalty to show the concentration of supermartingales. The\n1Etterm allow us the claim all previous U0, . . . , Ut have all desired properties inside local region.\nFinally, we see Theorem 3.1 as a immediate corollary of Theorem 4.5.\n\n10\n\n5 Conclusion\n\nIn this paper, we presented the \ufb01rst provable, ef\ufb01cient online algorithm for matrix completion, based\non nonconvex SGD. In addition to the online setting, our results are also competitive with state of the\nart results in the of\ufb02ine setting. We obtain our results by introducing a general framework that helps\nus show how SGD updates self-regulate to stay away from saddle points. We hope our paper and\nresults help generate interest in online matrix completion, and our techniques and framework prompt\ntighter analysis for other nonconvex problems.\n\nReferences\n[1] Sanjeev Arora, Rong Ge, Tengyu Ma, and Ankur Moitra. Simple, ef\ufb01cient, and neural algorithms for\n\nsparse coding. arXiv preprint arXiv:1503.00778, 2015.\n\n[2] Matthew Brand. Fast online SVD revisions for lightweight recommender systems. In SDM, pages 37\u201346.\n\nSIAM, 2003.\n\n8\n\n\f[3] Emmanuel J Candes, Yonina C Eldar, Thomas Strohmer, and Vladislav Voroninski. Phase retrieval via\n\nmatrix completion. SIAM Review, 57(2):225\u2013251, 2015.\n\n[4] Emmanuel J. Cand\u00e8s and Benjamin Recht. Exact matrix completion via convex optimization. Foundations\n\nof Computational Mathematics, 9(6):717\u2013772, December 2009.\n\n[5] James Davidson, Benjamin Liebald, Junning Liu, Palash Nandy, Taylor Van Vleet, Ullas Gargi, Sujoy\nGupta, Yu He, Mike Lambert, Blake Livingston, et al. The youtube video recommendation system. In\nProceedings of the fourth ACM conference on Recommender systems, pages 293\u2013296. ACM, 2010.\n\n[6] Christopher De Sa, Kunle Olukotun, and Christopher R\u00e9. Global convergence of stochastic gradient\n\ndescent for some non-convex matrix problems. arXiv preprint arXiv:1411.1134, 2014.\n\n[7] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points\u2014online stochastic gradient\n\nfor tensor decomposition. arXiv preprint arXiv:1503.02101, 2015.\n\n[8] Moritz Hardt. Understanding alternating minimization for matrix completion. In Foundations of Computer\n\nScience (FOCS), 2014 IEEE 55th Annual Symposium on, pages 651\u2013660. IEEE, 2014.\n\n[9] Moritz Hardt, Raghu Meka, Prasad Raghavendra, and Benjamin Weitz. Computational limits for matrix\n\ncompletion. In COLT, pages 703\u2013725, 2014.\n\n[10] Prateek Jain, Chi Jin, Sham M Kakade, and Praneeth Netrapalli. Computing matrix squareroot via non\n\nconvex local search. arXiv preprint arXiv:1507.05854, 2015.\n\n[11] Prateek Jain, Chi Jin, Sham M Kakade, Praneeth Netrapalli, and Aaron Sidford. Matching matrix\nbernstein with little memory: Near-optimal \ufb01nite sample guarantees for oja\u2019s algorithm. arXiv preprint\narXiv:1602.06929, 2016.\n\n[12] Prateek Jain and Praneeth Netrapalli. Fast exact matrix completion with \ufb01nite samples. arXiv preprint\n\narXiv:1411.1087, 2014.\n\n[13] Hui Ji, Chaoqiang Liu, Zuowei Shen, and Yuhong Xu. Robust video denoising using low rank matrix\n\ncompletion. In CVPR, pages 1791\u20131798. Citeseer, 2010.\n\n[14] Raghunandan Hulikal Keshavan. Ef\ufb01cient algorithms for collaborative \ufb01ltering. PhD thesis, STANFORD\n\nUNIVERSITY, 2012.\n\n[15] Yehuda Koren. The BellKor solution to the Net\ufb02ix grand prize. Net\ufb02ix prize documentation, 81:1\u201310,\n\n2009.\n\n[16] Akshay Krishnamurthy and Aarti Singh. Low-rank matrix and tensor completion via adaptive sampling.\n\nIn Advances in Neural Information Processing Systems, pages 836\u2013844, 2013.\n\n[17] Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descent converges to\n\nminimizers. University of California, Berkeley, 1050:16, 2016.\n\n[18] G. Linden, B. Smith, and J. York. Amazon.com recommendations: item-to-item collaborative \ufb01ltering.\n\nIEEE Internet Computing, 7(1):76\u201380, Jan 2003.\n\n[19] Xin Luo, Yunni Xia, and Qingsheng Zhu. Incremental collaborative \ufb01ltering recommender based on\n\nregularized matrix factorization. Knowledge-Based Systems, 27:271\u2013280, 2012.\n\n[20] Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro. Online learning for matrix factorization\n\nand sparse coding. The Journal of Machine Learning Research, 11:19\u201360, 2010.\n\n[21] Ioannis Panageas and Georgios Piliouras. Gradient descent converges to minimizers: The case of non-\n\nisolated critical points. arXiv preprint arXiv:1605.00405, 2016.\n\n[22] Benjamin Recht. A simpler approach to matrix completion. Journal of Machine Learning Research,\n\n12(Dec):3413\u20133430, 2011.\n\n[23] Benjamin Recht and Christopher R\u00e9. Parallel stochastic gradient algorithms for large-scale matrix comple-\n\ntion. Mathematical Programming Computation, 5(2):201\u2013226, 2013.\n\n[24] Ruoyu Sun and Zhi-Quan Luo. Guaranteed matrix completion via nonconvex factorization. In Foundations\n\nof Computer Science (FOCS), 2015 IEEE 56th Annual Symposium on, pages 270\u2013289. IEEE, 2015.\n\n[25] Joel A Tropp. User-friendly tail bounds for sums of random matrices. Foundations of computational\n\nmathematics, 12(4):389\u2013434, 2012.\n\n[26] Se-Young Yun, Marc Lelarge, and Alexandre Proutiere. Streaming, memory limited matrix completion\n\nwith noise. arXiv preprint arXiv:1504.03156, 2015.\n\n9\n\n\f", "award": [], "sourceid": 2254, "authors": [{"given_name": "Chi", "family_name": "Jin", "institution": "UC Berkeley"}, {"given_name": "Sham", "family_name": "Kakade", "institution": "University of Washington"}, {"given_name": "Praneeth", "family_name": "Netrapalli", "institution": "Microsoft Research"}]}