{"title": "Implicit Regularization in Matrix Factorization", "book": "Advances in Neural Information Processing Systems", "page_first": 6151, "page_last": 6159, "abstract": "We study implicit regularization when optimizing an underdetermined quadratic objective over a matrix $X$ with gradient descent on a factorization of X. We conjecture and provide empirical and theoretical evidence that with small enough step sizes and initialization close enough to the origin, gradient descent on a full dimensional factorization converges to the minimum nuclear norm solution.", "full_text": "Implicit Regularization in Matrix Factorization\n\nSuriya Gunasekar\n\nTTI at Chicago\n\nsuriya@ttic.edu\n\nBlake Woodworth\n\nTTI at Chicago\n\nblake@ttic.edu\n\nSrinadh Bhojanapalli\n\nTTI at Chicago\n\nsrinadh@ttic.edu\n\nBehnam Neyshabur\n\nTTI at Chicago\n\nbehnam@ttic.edu\n\nNathan Srebro\nTTI at Chicago\n\nnati@ttic.edu\n\nAbstract\n\nWe study implicit regularization when optimizing an underdetermined quadratic\nobjective over a matrix X with gradient descent on a factorization of X. We\nconjecture and provide empirical and theoretical evidence that with small enough\nstep sizes and initialization close enough to the origin, gradient descent on a full\ndimensional factorization converges to the minimum nuclear norm solution.\n\n1\n\nIntroduction\n\nWhen optimizing underdetermined problems with multiple global minima, the choice of optimization\nalgorithm can play a crucial role in biasing us toward a speci\ufb01c global minima, even though this bias is\nnot explicitly speci\ufb01ed in the objective or problem formulation. For example, using gradient descent\nto optimize an unregularized, underdetermined least squares problem would yield the minimum\nEuclidean norm solution, while using coordinate descent or preconditioned gradient descent might\nyield a different solution. Such implicit bias, which can also be viewed as a form of regularization,\ncan play an important role in learning.\nIn particular, implicit regularization has been shown to play a crucial role in training deep models\n[14, 13, 18, 11]: deep models often generalize well even when trained purely by minimizing the\ntraining error without any explicit regularization, and when there are more parameters than samples\nand the optimization problem is underdetermined. Consequently, there are many zero training error\nsolutions, all global minima of the training objective, many of which generalize badly. Nevertheless,\nour choice of optimization algorithm, typically a variant of gradient descent, seems to prefer solutions\nthat do generalize well. This generalization ability cannot be explained by the capacity of the\nexplicitly speci\ufb01ed model class (namely, the functions representable in the chosen architecture).\nInstead, it seems that the optimization algorithm biases us toward a \u201csimple\" model, minimizing\nsome implicit \u201cregularization measure\u201d, and that generalization is linked to this measure. But what\nare the regularization measures that are implicitly minimized by different optimization procedures?\nAs a \ufb01rst step toward understanding implicit regularization in complex models, in this paper we\ncarefully analyze implicit regularization in matrix factorization models, which can be viewed as\ntwo-layer networks with linear transfer. We consider gradient descent on the entries of the factor\nmatrices, which is analogous to gradient descent on the weights of a multilayer network. We show\nhow such an optimization approach can indeed yield good generalization properties even when the\nproblem is underdetermined. We identify the implicit regularizer as the nuclear norm, and show that\neven when we use a full dimensional factorization, imposing no constraints on the factored matrix,\noptimization by gradient descent on the factorization biases us toward the minimum nuclear norm\nsolution. Our empirical study leads us to conjecture that with small step sizes and initialization close\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fto zero, gradient descent converges to the minimum nuclear norm solution, and we provide empirical\nand theoretical evidence for this conjecture, proving it in certain restricted settings.\n\n2 Factorized Gradient Descent for Matrix Regression\nWe consider least squares objectives over matrices X \u2208 Rn\u00d7n of the form:\n\nF (X) = (cid:107)A(X) \u2212 y(cid:107)2\n2.\n\nmin\nX(cid:23)0\n\n(cid:2) W X\n\n(cid:3) with A operating symmetrically on the off-diagonal blocks). In particular, this setting\n\n(1)\nwhere A : Rn\u00d7n \u2192 Rm is a linear operator speci\ufb01ed by A(X)i = (cid:104)Ai, X(cid:105), Ai \u2208 Rn\u00d7n, and\ny \u2208 Rm. Without loss of generality, we consider only symmetric positive semide\ufb01nite (p.s.d.)\nX and symmetric linearly independent Ai (otherwise, consider optimization over a larger matrix\nX(cid:62) Z\ncovers problems including matrix completion (where Ai are indicators, [5]), matrix reconstruction\nfrom linear measurements [15] and multi-task training (where each column of X is a predictor for a\ndifferent task and Ai have a single non-zero column, [2, 1]).\nWe are particularly interested in the regime where m (cid:28) n2, in which case (1) is underdetermined\nwith many global minima satisfying A(X) = y. For such underdetermined problems, merely mini-\nmizing (1) cannot ensure recovery (in matrix completion or recovery problems) or generalization (in\nprediction problems). For example, in a matrix completion problem (without diagonal observations),\nwe can minimize (1) by setting all non-diagonal unobserved entries to zero, or to any arbitrary value.\nInstead of working on X directly, we will study a factorization X = U U(cid:62). We can write (1)\nequivalently as optimization over U as,\n\nf (U ) =(cid:13)(cid:13)A(U U(cid:62)) \u2212 y(cid:13)(cid:13)2\n\n2.\n\nmin\n\nU\u2208Rn\u00d7d\n\n(2)\n\nWhen d < n, this imposes a constraint on the rank of X, but we will be mostly interested in the\ncase d = n, under which no additional constraint is imposed on X (beyond being p.s.d.) and (2) is\nequivalent to (1). Thus, if m (cid:28) n2, then (2) with d = n is similarly underdetermined and can be\noptimized in many ways \u2013 estimating a global optima cannot ensure generalization (e.g. imputing\nzeros in a matrix completion objective). Let us investigate what happens when we optimize (2) by\ngradient descent on U.\nTo simulate such a matrix reconstruction problem, we generated m (cid:28) n2 random measurement\nmatrices and set y = A(X\u2217) according to some planted X\u2217 (cid:23) 0. We minimized (2) by perform-\ning gradient descent on U to convergence, and then measured the relative reconstruction error\n(cid:107)X \u2212 X\u2217(cid:107)F /(cid:107)X\u2217(cid:107)F for X = U U(cid:62). Figure 1 shows the normalized training objective and recon-\nstruction error as a function of the dimensionality d of the factorization, for different initialization\nand step-size policies, and three different planted X\u2217.\nFirst, we see that (for suf\ufb01ciently large d) gradient descent indeed \ufb01nds a global optimum, as\nevidenced by the training error (the optimization objective) being zero. This is not surprising since\nwith large enough d this non-convex problem has no spurious local minima [4, 9] and gradient\ndescent converges almost surely to a global optima [12]; there has also been recent work establishing\nconditions for global convergence for low d [3, 7].\nThe more surprising observation is that in panels (a) and (b), even when d > m/n, indeed even for\nd = n, we still get good reconstructions from the solution of gradient descent with initialization U0\nclose to zero and small step size. In this regime, (2) is underdetermined and minimizing it does not\nensure generalization. To emphasize this, we plot the reference behavior of a rank unconstrained\nglobal minimizer Xgd obtained via projected gradient descent for (1) on the X space. For d < n we\nalso plot an example of an alternate \u201cbad\" rank d global optima obtained with an initialization based\non SVD of Xgd (\u2018SVD Initialization\u2019).\nWhen d < m/n, we understand how the low-rank structure can guarantee generalization [16] and\nreconstruction [10, 3, 7]. What ensures generalization when d (cid:29) m/n? Is there a strong implicit\nregularization at play for the case of gradient descent on factor space and initialization close to zero?\nObserving the nuclear norm of the resulting solutions plotted in Figure 2 suggests that gradient descent\nimplicitly induces a low nuclear norm solution. This is the case even for d = n when the factorization\n\n2\n\n\fFigure 1: Reconstruction error of the global optima for 50\u00d7 50 matrix reconstruction. (Left) X\u2217 is of rank r = 2\n\u221a\nr(cid:107)X\u2217(cid:107)F\nand m = 3nr; (Center) X\u2217 has a spectrum decaying as O(1/k1.5) normalized to have (cid:107)X\u2217(cid:107)\u2217 =\nfor r = 2 and m = 3nr, and (Right) is a non-reconstructable setting where the number of measurements\nm = nr/4 is much smaller than the requirement to reconstruct a rank r = 2 matrix. The plots compare the\nreconstruction error of gradient descent on U for different choices initialization U0 and step size \u03b7, including\n\ufb01xed step-size and exact line search clipped for stability (\u03b7ELS). Additonally, the orange dashed reference\nline represents the performance of Xgd \u2013 a rank unconstrained global optima obtained by projected gradient\ndescent for (1) on X space, and \u2018SVD-Initialization\u2019 is an example of an alternate rank d global optima, where\ninitialization U0 is picked based on SVD of Xgd and gradient descent is run on factor space with small stepsize.\nTraining error behaves similarly in all these settings (zero for d \u2265 2) and is plotted for reference. Results are\naveraged across 3 random initialization and (near zero) errorbars indicate the standard deviation.\n\nFigure 2: Nuclear norm of the solutions from Figure 1. In addition to the reference of Xgd from Figure 1, the\nmagenta dashed line (almost overlapped by the plot of (cid:107)U(cid:107)F = 10\u22124, \u03b7 = 10\u22123) is added as a reference for\nthe (rank unconstrained) minimum nuclear norm global optima. The error bars indicate the standard deviation\nacross 3 random initializations. We have dropped the plot for (cid:107)U(cid:107)F = 1, \u03b7 = 10\u22123 to reduce clutter.\n\nimposes no explicit constraints. Furthermore, we do not include any explicit regularization and\noptimization is run to convergence without any early stopping. In fact, we can see a clear bias toward\nlow nuclear norm even in problems where reconstruction is not possible: in panel (c) of Figure 2 the\nnumber of samples m = nr/4 is much smaller than those required to reconstruct a rank r ground\ntruth matrix X\u2217. The optimization in (2) is highly underdetermined and there are many possible\nzero-error global minima, but gradient descent still prefers a lower nuclear norm solution. The\nemerging story is that gradient descent biases us to a low nuclear norm solution, and we already know\nhow having low nuclear norm can ensure generalization [17, 6] and minimizing the nuclear norm\nensures reconstruction [15, 5].\nCan we more explicitly characterize this bias? We see that we do not always converge precisely\nto the minimum nuclear norm solution. In particular, the choice of step size and initialization\naffects which solution gradient descent converges to. Nevertheless, as we formalize in Section 3, we\nargue that when U is full dimensional, the step size becomes small enough, and the initialization\napproaches zero, gradient descent will converge precisely to a minimum nuclear norm solution, i.e. to\nargminX(cid:23)0 (cid:107)X(cid:107)\u2217 s.t. A(X) = y.\n\n3 Gradient Flow and Main Conjecture\n\nThe behavior of gradient descent with in\ufb01nitesimally small step size is captured by the differential\ndt = \u2212\u2207f (Ut) with an initial condition for U0. For the optimization in (2) this is\nequation \u02d9Ut := dUt\n(3)\n\n\u02d9Ut = \u2212A\u2217(A(UtU(cid:62)\n\nt ) \u2212 y)Ut,\n\n3\n\n01020304050dimension d0.00.20.40.60.8Relative error(a) Low rank X\u221701020304050dimension d0.00.20.40.60.8(b) Low nuclear norm X\u221701020304050dimension d0.00.51.01.52.0(c) Low rank X\u2217, m=nr/4Training errorkU0kF=10\u22124,\u03b7\u00afELSkU0kF=10\u22124,\u03b7=10\u22123SVD InitializationkU0kF=1,\u03b7=10\u22123Xgd01020304050dimension d1.52.02.5Nuclear norm(a) Low rank X\u221701020304050dimension d1.01.52.02.5(b) Low nuclear norm X\u221701020304050dimension d0.60.81.01.2(c) Low rank X\u2217, m=nr/4kU0kF=10\u22124,\u03b7=10\u22123SVD InitializationkU0kF=10\u22124,\u03b7\u00afELSXgdminA(X)=ykXk\u2217\fwhere A\u2217 : Rm \u2192 Rn\u00d7n is the adjoint of A and is given by A\u2217(r) =(cid:80)\n\ncan be seen as a discretization of (3), and approaches (3) as the step size goes to zero.\nThe dynamics (3) de\ufb01ne the behavior of the solution Xt = UtU(cid:62)\nt and using the chain rule we can\nt = \u2212A\u2217(rt)Xt \u2212 XtA\u2217(rt), where rt = A(Xt) \u2212 y is a vector\nverify that\nof the residual. That is, even though the dynamics are de\ufb01ned in terms of speci\ufb01c factorization\nXt = UtU(cid:62)\nt , they are actually independent of the factorization and can be equivalently characterized\nas\n\n\u02d9Xt = \u02d9UtU(cid:62)\n\nt + Ut \u02d9U(cid:62)\n\ni riAi. Gradient descent\n\n\u02d9Xt = \u2212A\u2217(rt)Xt \u2212 XtA\u2217(rt).\n\n(4)\nWe can now de\ufb01ne the limit point X\u221e(Xinit) := limt\u2192\u221e Xt for the factorized gradient \ufb02ow (4)\ninitialized at X0 = Xinit. We emphasize that these dynamics are very different from the standard\ngradient \ufb02ow dynamics of (1) on X, corresponding to gradient descent on X, which take the form\n\u02d9Xt = \u2212\u2207F (Xt) = \u2212A\u2217(rt).\nBased on the preliminary experiments in Section 2 and a more comprehensive numerical study\ndiscussed in Section 5, we state our main conjecture as follows:\n\nConjecture. For any full rank Xinit, if (cid:98)X = lim\u03b1\u21920 X\u221e(\u03b1Xinit) exists and is a global optima for\n(1) with A((cid:98)X) = y, then (cid:98)X \u2208 argminX(cid:23)0 (cid:107)X(cid:107)\u2217 s.t. A(X) = y.\n\nRequiring a full-rank initial point demands a full dimensional d = n factorization in (2). The\nassumption of global optimality in the conjecture is generally satis\ufb01ed: for almost all initializations,\ngradient \ufb02ow will converge to a local minimizer [12], and when d = n any such local minimizer is\nalso global minimum [9]. Since we are primarily concerned with underdetermined problems, we\nexpect the global optimum to achieve zero error, i.e. satisfy A(X) = y. We already know from\nthese existing literature that gradient descent (or gradient \ufb02ow) will generally converge to a solution\nsatisfying A(X) = y; the question we address here is which of those solutions will it converge to.\nThe conjecture implies the same behavior for asymmetric factorization as X = U V (cid:62) with gradient\n\n\ufb02ow on (U, V ), since this is equivalent to gradient \ufb02ow on the p.s.d. factorization of(cid:2) W X\n\n(cid:3).\n\nX(cid:62) Z\n\n4 Theoretical Analysis\n\nWe will prove our conjecture for the special case where the matrices Ai commute, and discuss\nthe more challenging non-commutative case. But \ufb01rst, let us begin by reviewing the behavior of\nstraight-forward gradient descent on X for the convex problem in (1).\n\nWarm up: Consider gradient descent updates on the original problem (1) in X space, ignoring\nthe p.s.d. constraint. The gradient direction \u2207F (X) = A\u2217(A(X) \u2212 y) is always spanned by the m\nmatrices Ai. Initializing at Xinit = 0, we will therefore always remain in the m-dimensional subspace\nL = {X = A\u2217(s)|s \u2208 Rm}. Now consider the optimization problem minX (cid:107)X(cid:107)2\nF s.t. A(X) = y.\nThe KKT optimality conditions for this problem are A(X) = y and \u2203\u03bd s.t. X = A\u2217(\u03bd). As long as\nwe are in L, the second condition is satis\ufb01ed, and if we converge to a zero-error global minimum,\nthen the \ufb01rst condition is also satis\ufb01ed. Since gradient descent stays on this manifold, this establishes\nthat if gradient descent converges to a zero-error solution, it is the minimum Frobenius norm solution.\n\nGetting started: m = 1 Consider the simplest case of the factorized problem when m = 1 with\nA1 = A and y1 = y. The dynamics of (4) are given by \u02d9Xt = \u2212rt(AXt + XtA), where rt is simply\n0 rtdt.\n\na scalar, and the solution for Xt is given by, Xt = exp (stA) X0 exp (stA) where sT = \u2212(cid:82) T\nAssuming (cid:98)X = lim\u03b1\u21920 X\u221e(\u03b1X0) exists and A((cid:98)X) = y, we want to show (cid:98)X is an optimum for the\n\nfollowing problem\n\n(cid:107)X(cid:107)\u2217 s.t. A(X) = y.\n\nmin\nX(cid:23)0\n\nThe KKT optimality conditions for (5) are:\n\n\u2203\u03bd \u2208 Rm s.t.\n\nA(X) = y\n\nX (cid:23) 0\n\nA\u2217(\u03bd) (cid:22) I\n\n(I \u2212 A\u2217(\u03bd))X = 0\n\nWe already know that the \ufb01rst condition holds, and the p.s.d. condition is guaranteed by the factoriza-\ntion of X. The remaining complementary slackness and dual feasibility conditions effectively require\n\n4\n\n(5)\n\n(6)\n\n\fthat (cid:98)X is spanned by the top eigenvector(s) of A. Informally, looking to the gradient \ufb02ow path above,\n\nfor any non-zero y, as \u03b1 \u2192 0 it is necessary that |s\u221e| \u2192 \u221e in order to converge to a global optima,\nthus eigenvectors corresponding to the top eigenvalues of A will dominate the span of X\u221e(\u03b1Xinit).\nWhat we can prove: Commutative {Ai}i\u2208[m] The characterization of the the gradient \ufb02ow path\nfrom the previous section can be extended to arbitrary m in the case that the matrices Ai commute,\n0 rtdt \u2013 a vector integral, we can verify by\n\ni.e. AiAj = AjAi for all i, j. De\ufb01ning sT = \u2212(cid:82) T\n\ndifferentiating that solution of (4) is\n\n(7)\n\nXt = exp (A\u2217(st)) X0 exp (A\u2217(st))\n\nTheorem 1. In the case where matrices {Ai}m\n\nKKT conditions in (6). Since the matrices Ai commute and are symmetric, they are simultaneously\ndiagonalizable by a basis v1, .., vn, and so is A\u2217(s) for any s \u2208 Rm. This implies that for any \u03b1,\n\ni=1 commute, if (cid:98)X = lim\u03b1\u21920 X\u221e(\u03b1I) exists and is a\nglobal optimum for (1) with A((cid:98)X) = y, then (cid:98)X \u2208 argminX(cid:23)0 (cid:107)X(cid:107)\u2217 s.t. A(X) = y.\nProof. It suf\ufb01ces to show that such a (cid:98)X satis\ufb01es the complementary slackness and dual feasibility\nX\u221e(\u03b1I) given by (7) and its limit (cid:98)X also have the same eigenbasis. Furthermore, since X\u221e(\u03b1I)\nconverges to (cid:98)X, the scalars v(cid:62)\nk (cid:98)Xvk for each k \u2208 [n]. Therefore, \u03bbk(X\u221e(\u03b1I)) \u2192\n\u03bbk((cid:98)X), where \u03bbk(\u00b7) is de\ufb01ned as the eigenvalue corresponding to eigenvector vk and not necessarily\nall k such that \u03bbk((cid:98)X) > 0, by the continuity of log, we have\n2\u03bbk(A\u2217(s\u221e(\u03b2))) \u2212 2\u03b2 \u2212 log \u03bbk((cid:98)X) \u2192 0 =\u21d2 \u03bbk\nDe\ufb01ning \u03bd(\u03b2) = s\u221e(\u03b2)/\u03b2, we conclude that for all k such that \u03bbk((cid:98)X) (cid:54)= 0, lim\u03b2\u2192\u221e \u03bbk(A\u2217(\u03bd(\u03b2))) =\n1. Similarly, for each k such that \u03bbk((cid:98)X) = 0,\n\nthe kth largest eigenvalue.\nLet \u03b2 = \u2212 log \u03b1, then using X0 = e\u2212\u03b2I in (7), \u03bbk(X\u221e(\u03b1I)) = exp(2\u03bbk(A\u2217(s\u221e(\u03b2))) \u2212 2\u03b2). For\n\n(cid:1)(cid:17) \u2212 1 \u2212 log \u03bbk((cid:98)X)\n\n(cid:16)A\u2217(cid:0) s\u221e(\u03b2)\n\nk X\u221e(\u03b1I)vk \u2192 v(cid:62)\n\n\u2192 0.\n\n(8)\n\n2\u03b2\n\n\u03b2\n\nexp(2\u03bbk(A\u2217(s\u221e(\u03b2))) \u2212 2\u03b2) \u2192 0 =\u21d2 exp(\u03bbk(A\u2217(\u03bd(\u03b2))) \u2212 1)2\u03b2 \u2192 0.\n\n(9)\n\nThus, for every \u0001 \u2208 (0, 1], for suf\ufb01ciently large \u03b2\nexp(\u03bbk(A\u2217(\u03bd(\u03b2))) \u2212 1) < \u0001\n\nTherefore, we have shown that lim\u03b2\u2192\u221e A\u2217(\u03bd(\u03b2)) (cid:22) I and lim\u03b2\u2192\u221e A\u2217(\u03bd(\u03b2))(cid:98)X = (cid:98)X establishing\nthe optimality of (cid:98)X for (5).\n\n2\u03b2 < 1 =\u21d2 \u03bbk(A\u2217(\u03bd(\u03b2))) < 1.\n\n(10)\n\n1\n\nInterestingly, and similarly to gradient descent on X, this proof does not exploit the particular form\nof the \u201ccontrol\" rt and only relies on the fact that the gradient \ufb02ow path stays within the manifold\n\nM = {X = exp (A\u2217(s)) Xinit exp (A\u2217(s)) | s \u2208 Rm} .\n\n(11)\nSince the Ai\u2019s commute, we can verify that the tangent space of M at a point X is given by\nTXM = Span{AiX + XAi}i\u2208[m], thus gradient \ufb02ow will always remain in M. For any control\nrt such that following \u02d9Xt = \u2212A\u2217(rt)Xt \u2212 XtA\u2217(rt) leads to a zero error global optimum, that\noptimum will be a minimum nuclear norm solution. This implies in particular that the conjecture\nextends to gradient \ufb02ow on (2) even when the Euclidean norm is replaced by certain other norms, or\nwhen only a subset of measurements are used for each step (such as in stochastic gradient descent).\nHowever, unlike gradient descent on X, the manifold M is not \ufb02at, and the tangent space at each\npoint is different. Taking \ufb01nite length steps, as in gradient descent, would cause us to \u201cfall off\" of the\nmanifold. To avoid this, we must take in\ufb01nitesimal steps, as in the gradient \ufb02ow dynamics.\nIn the case that Xinit and the measurements Ai are diagonal matrices, gradient descent on (2) is\nequivalent to a vector least squares problem, parametrized in terms of the square root of entries:\nCorollary 2. Let x\u221e(xinit) be the limit point of gradient \ufb02ow on minu\u2208Rn (cid:107)Ax(u) \u2212 y(cid:107)2\ninitialization xinit, where x(u)i = u2\n\ni , A \u2208 Rm\u00d7n and y \u2208 Rm. If(cid:98)x = lim\u03b1\u21920 x\u221e(\u03b1(cid:126)1) exists and\n(cid:107)x(cid:107)1 s.t. Ax = y.\n\nA(cid:98)x = y, then(cid:98)x \u2208 argminx\u2208Rm\n\n2 with\n\n+\n\n5\n\n\fThe plot thickens: Non-commutative {Ai}i\u2208[m] Unfortunately, in the case that the matrices Ai\ndo not commute, analysis is much more dif\ufb01cult. For a matrix-valued function F , d\ndt exp(Ft) is equal\nto \u02d9Ft exp(Ft) only when \u02d9Ft and Ft commute. Therefore, (7) is no longer a valid solution for (4).\nDiscretizing the solution path, we can express the solution as the \u201ctime ordered exponential\":\n\n\uf8eb\uf8ed 1(cid:89)\n\nXt = lim\n\u0001\u21920\n\n\uf8f6\uf8f8 X0\n\n\uf8eb\uf8ed t/\u0001(cid:89)\n\n\uf8f6\uf8f8 ,\n\nexp (\u2212\u0001A\u2217(r\u03c4 \u0001))\n\nexp (\u2212\u0001A\u2217(r\u03c4 \u0001))\n\n(12)\n\n\u03c4 =t/\u0001\n\n\u03c4 =1\n\nwhere the order in the products is important. If Ai commute, the product of exponentials is equal to\nan exponential of sums, which in the limit evaluates to the solution in (7). However, since in general\nexp(A1) exp(A2) (cid:54)= exp(A1 + A2), the path (12) is not contained in the manifold M de\ufb01ned in\n(11).\nIt is tempting to try to construct a new manifold M(cid:48) such that Span{AiX + XAi}i\u2208[m] \u2286 TXM(cid:48)\nand X0 \u2208 M(cid:48), ensuring the gradient \ufb02ow remains in M(cid:48). However, since Ai\u2019s do not commute,\nby combining in\ufb01nitesimal steps along different directions, it is possible to move (very slowly) in\ndirections that are not of the form A\u2217(s)X + XA\u2217(s) for any s \u2208 Rm. The possible directions\nof movements indeed corresponds to the Lie algebra de\ufb01ned by the closure of {Ai}m\ni=1 under the\ncommutator operator [Ai, Aj] := AiAj \u2212 AjAi. Even when m = 2, this closure will generally\nencompass all of Rn\u00d7n, allowing us to approach any p.s.d. matrix X with some (wild) control\nrt. Thus, we cannot hope to ensure the KKT conditions for an arbitrary control as we did in the\ncommutative case \u2014 it is necessary to exploit the structure of the residuals A(Xt) \u2212 y in some way.\nNevertheless, in order to make \ufb01nite progress moving along a commutator direction like [Ai, Aj]Xt +\nXt[Ai, Aj](cid:62), it is necessary to use an extremely non-smooth control, e.g., looping 1/\u00012 times between\n\u0001 steps in the directions Ai, Aj,\u2212Ai,\u2212Aj, each such loop making an \u00012 step in the desired direction.\nWe expect the actual residuals rt to behave much more smoothly and that for smooth control the\nnon-commutative terms in the expansion of the time ordered exponential (12) are asymptotically\nlower order then the direct term A\u2217(s) (as Xinit \u2192 0). This is indeed con\ufb01rmed numerically, both for\nthe actual residual controls of the gradient \ufb02ow path, and for other random controls.\n\n5 Empirical Evidence\n\nBeyond the matrix reconstruction experiments of Section 2, we also conducted experiments with\nsimilarly simulated matrix completion problems, including problems where entries are sampled from\npower-law distributions (thus not satisfying incoherence), as well as matrix completion problem on\nnon-simulated Movielens data. In addition to gradient descent, we also looked more directly at the\ngradient \ufb02ow ODE (3) and used a numerical ODE solver provided as part of SciPy [8] to solve\n(3). But we still uses a \ufb01nite (non-zero) initialization. We also emulated staying on a valid \u201csteering\npath\" by numerically approximating the time ordered exponential of 12 \u2014 for a \ufb01nite discretization\n\u03b7, instead of moving linearly in the direction of the gradient \u2207f (U ) (like in gradient descent), we\nmultiply Xt on right and left by e\u2212\u03b7A\u2217(rt). The results of these experiments are summarized in\nFigure 3.\nIn these experiments, we again observe trends similar to those in Section 2. In some panels in\nFigure 3, we do see a discernible gap between the minimum nuclear norm global optima and the\nnuclear norm of the gradient \ufb02ow solution with (cid:107)U0(cid:107)F = 10\u22124. This discrepancy could either be\ndue to starting at a non-limit point of U0, or numerical issue arising from approximations to the ODE,\nor it could potentially suggest a weakening of the conjecture. Even if the later case were true, the\nexperiments so far provide strong evidence for atleast approximate versions of our conjecture being\ntrue under a wide range of problems.\n\n6\n\n\f(i) Gaussian random measurements. We report the nuclear norm of the gradient \ufb02ow solutions from three\ndifferent approximations to (3) \u2013 numerical ODE solver (ODE approx.), time ordered exponential speci\ufb01ed in\n(12) (Time ordered exp.) and standard gradient descent with small step size (Gradient descent). The nuclear\nnorm of the solution from gradient descent on X space \u2013 Xgd and the minimum nuclear norm global minima\nare provided as references. In (a) X\u2217 is rank r and m = 3nr, in (b) X\u2217 has a decaying spectrum with\n(cid:107)X\u2217(cid:107)\u2217 =\n\n\u221a\nr(cid:107)X\u2217(cid:107)F and m = 3nr, and in (c) X\u2217 is rank r with m = nr/4, where n = 50, r = 2.\n\n(ii) Uniform matrix completion: \u2200i, Ai measures a uniform random entry of X\u2217. Details on X\u2217, number of\nmeasurements, and the legends follow Figure3-(i).\n\n(iii) Power law matrix completion: \u2200i, Ai measures a random entry of X\u2217 chosen according to a power law\ndistribution. Details on X\u2217, number of measurements, and the legends follow Figure3-(i).\n\nargminA(X)=y (cid:107)X(cid:107)\u2217\n0.2880\n\nGradient descent\n(cid:107)U0(cid:107)F = 10\u22123, \u03b7 = 10\u22122\n0.2631\n8876\n\nXgd\n\n1.000\n20912\n\nTest Error\nNuclear norm 8391\n\n(iv) Benchmark movie recommendation dataset \u2014 Movielens 100k. The dataset contains \u223c 100k ratings from\nn1 = 943 users on n2 = 1682 movies. In this problem, gradient updates are performed on the asymmetric\nmatrix factorization space X = U V (cid:62) with dimension d = min (n1, n2). The training data is completely \ufb01t to\nhave < 10\u22122 error. Test error is computed on a held out data of 10 ratings per user. Here we are not interested in\nthe recommendation performance (test error) itself but on observing the bias of gradient \ufb02ow with initialization\nclose to zero to return a low nuclear norm solution \u2014 the test error is provided merely to demonstrate the\neffectiveness of such a bias in this application. Also, due to the scale of the problem, we only report a coarse\napproximation of the gradient \ufb02ow 3 from gradient descent with (cid:107)U0(cid:107)F = 10\u22123, \u03b7 = 10\u22122.\n\nFigure 3: Additional matrix reconstruction experiments\n\nExhaustive search Finally, we also did experiments on an exhaustive grid search over small\nproblems, capturing essentially all possible problems of this size. We performed an exhaustive grid\nsearch for matrix completion problem instances in symmetric p.s.d. 3 \u00d7 3 matrices. With m = 4,\nthere are 15 unique masks or {Ai}i\u2208[4]\u2019s that are valid symmetric matrix completion observations.\n\n7\n\n(a) Low rank X\u2217(b) Low nuclear norm X\u2217(c) Low rank X\u2217,m=nr40.00.51.01.52.02.53.0Nuclear normminA(X)=ykXk\u2217ODE approx.kU0kF=10\u22124Time ordered exp.kU0kF=10\u22124,\u03b7=0.1Gradient descentkU0kF=10\u22124,\u03b7=10\u22123Xgd(a) Low rank X\u2217(b) Low nuclear norm X\u2217(c) Low rank X\u2217,m=nr40.00.51.01.52.02.5Nuclear norm(a) Low rank X\u2217(b) Low nuclear norm X\u2217(c) Low rank X\u2217,m=nr40.00.20.40.60.81.01.2Nuclear norm\fFor each mask, we \ufb01ll the m = 4 observations with all possible combinations of 10 uniformly spaced\nvalues in the interval [\u22121, 1]. This gives us a total of 15 \u00d7 104 problem instances. Of these problems\ninstances, we discard the ones that do not have a valid PSD completion and run the ODE solver on\nevery remaining instance with a random U0 such that (cid:107)U0(cid:107)F = \u00af\u03b1, for different values of \u00af\u03b1. Results\non the deviation from the minimum nuclear norm are reported in Figure 4. For small \u00af\u03b1 = 10\u22125, 10\u22123,\nmost of instances of our grid search algorithm returned solutions with near minimal nuclear norms,\nand the deviations are within the possibility of numerical error. This behavior also decays for \u00af\u03b1 = 1.\n\nFigure 4: Histogram of relative sub-optimality of nuclear norm of X\u221e in grid search experiments. We plot the\n(cid:107)X(cid:107)\u2217. The panels correspond to different\nhistogram of \u2206(X\u221e) =\nvalues of norm of initialization \u00af\u03b1 = (cid:107)U0(cid:107)F . (Left) \u00af\u03b1 = 10\u22125, (Center) \u00af\u03b1 = 10\u22123, and (Right) \u00af\u03b1 = 1.\n\n, where (cid:107)Xmin(cid:107)\u2217 = minA(X)=y\n\n(cid:107)X\u221e(cid:107)\u2217\u2212(cid:107)Xmin(cid:107)\u2217\n\n(cid:107)Xmin(cid:107)\u2217\n\n6 Discussion\n\nIt is becoming increasingly apparent that biases introduced by optimization procedures, especially\nfor under-determined problems, are playing a key role in learning. Yet, so far we have very little\nunderstanding of the implicit biases associated with different non-convex optimization methods. In\nthis paper we carefully study such an implicit bias in a two-layer non-convex problem, identify it, and\nshow how even though there is no difference in the model class (problems (1) and (2) are equivalent\nwhen d = n, both with very high capacity), the non-convex modeling induces a potentially much\nmore useful implicit bias.\nWe also discuss how the bias in the non-convex case is much more delicate then in convex gradient\ndescent: since we are not restricted to a \ufb02at manifold, the bias introduced by optimization depends\non the step sizes taken. Furthermore, for linear least square problems (i.e. methods based on the\ngradients w.r.t. X in our formulation), any global optimization method that uses linear combination\nof gradients, including conjugate gradient descent, Nesterov acceleration and momentum methods,\nremains on the manifold spanned by the gradients, and so leads to the same minimum norm solution.\nThis is not true if the manifold is curved, as using momentum or passed gradients will lead us to\n\u201cshoot off\u201d the manifold.\nMuch of the recent work on non-convex optimization, and matrix factorization in particular, has\nfocused on global convergence: whether, and how quickly, we converge to a global minima. In\ncontrast, we address the complimentary question of which global minima we converge to. There has\nalso been much work on methods ensuring good matrix reconstruction or generalization based on\nstructural and statistical properties. We do not assume any such properties, nor that reconstruction is\npossible or even that there is anything to reconstruct\u2014for any problem of the form (1) we conjecture\nthat (4) leads to the minimum nuclear norm solution. Whether such a minimum nuclear norm solution\nis good for reconstruction or learning is a separate issue already well addressed by the above literature.\nWe based our conjecture on extensive numerical simulations, with random, skewed, reconstructible,\nnon-reconstructible, incoherent, non-incoherent, and and exhaustively enumerated problems, some\nof which is reported in Section 5. We believe our conjecture holds, perhaps with some additional\ntechnical conditions or corrections. We explain how the conjecture is related to control on manifolds\nand the time ordered exponential and discuss a possible approach for proving it.\n\n8\n\n0.20.00.2\u2206(X\u221e)05000100001500020000Number of experiments(a) \u03b1=10\u221250.20.00.2\u2206(X\u221e)(b) \u03b1=10\u221230.20.00.2\u2206(X\u221e)(c) \u03b1=1\fReferences\n[1] Yonatan Amit, Michael Fink, Nathan Srebro, and Shimon Ullman. Uncovering shared structures\nin multiclass classi\ufb01cation. In Proceedings of the 24th international conference on Machine\nlearning, pages 17\u201324. ACM, 2007.\n\n[2] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Multi-task feature learning.\n\nAdvances in neural information processing systems, 19:41, 2007.\n\n[3] Srinadh Bhojanapalli, Behnam Neyshabur, and Nathan Srebro. Global optimality of local search\n\nfor low rank matrix recovery. Advances in Neural Information Processing Systems, 2016.\n\n[4] Samuel Burer and Renato DC Monteiro. A nonlinear programming algorithm for solving\nsemide\ufb01nite programs via low-rank factorization. Mathematical Programming, 95(2):329\u2013357,\n2003.\n\n[5] Emmanuel J Cand\u00e8s and Benjamin Recht. Exact matrix completion via convex optimization.\n\nFoundations of Computational mathematics, 9(6):717\u2013772, 2009.\n\n[6] Rina Foygel and Nathan Srebro. Concentration-based guarantees for low-rank matrix recon-\n\nstruction. In COLT, pages 315\u2013340, 2011.\n\n[7] Rong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local minimum. In\n\nAdvances in Neural Information Processing Systems, pages 2973\u20132981, 2016.\n\n[8] Eric Jones, Travis Oliphant, Pearu Peterson, et al. SciPy: Open source scienti\ufb01c tools for\n\nPython, 2001.\n\n[9] Michel Journ\u00e9e, Francis Bach, P-A Absil, and Rodolphe Sepulchre. Low-rank optimization on\nthe cone of positive semide\ufb01nite matrices. SIAM Journal on Optimization, 20(5):2327\u20132351,\n2010.\n\n[10] Raghunandan Hulikal Keshavan. Ef\ufb01cient algorithms for collaborative \ufb01ltering. PhD thesis,\n\nSTANFORD, 2012.\n\n[11] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping\nTak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima.\nIn International Conference on Learning Representations, 2016.\n\n[12] Jason D. Lee, Max Simchowitz, Michael I. Jordan, and Benjamin Recht. Gradient descent only\n\nconverges to minimizers. In 29th Annual Conference on Learning Theory, 2016.\n\n[13] Behnam Neyshabur, Ryota Tomioka, Ruslan Salakhutdinov, and Nathan Srebro. Geometry of\noptimization and implicit regularization in deep learning. arXiv preprint arXiv:1705.03071,\n2017.\n\n[14] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias:\nOn the role of implicit regularization in deep learning. In International Conference on Learning\nRepresentations, 2015.\n\n[15] Benjamin Recht, Maryam Fazel, and Pablo A Parrilo. Guaranteed minimum-rank solutions of\n\nlinear matrix equations via nuclear norm minimization. SIAM review, 52(3):471\u2013501, 2010.\n\n[16] Nathan Srebro, Noga Alon, and Tommi S Jaakkola. Generalization error bounds for collaborative\nprediction with low-rank matrices. In Advances In Neural Information Processing Systems,\npages 1321\u20131328, 2005.\n\n[17] Nathan Srebro and Adi Shraibman. Rank, trace-norm and max-norm. In International Confer-\n\nence on Computational Learning Theory, pages 545\u2013560. Springer, 2005.\n\n[18] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding\ndeep learning requires rethinking generalization. In International Conference on Learning\nRepresentations, 2017.\n\n9\n\n\f", "award": [], "sourceid": 3116, "authors": [{"given_name": "Suriya", "family_name": "Gunasekar", "institution": "TTI Chicago"}, {"given_name": "Blake", "family_name": "Woodworth", "institution": "Toyota Technological Institute at Chicago"}, {"given_name": "Srinadh", "family_name": "Bhojanapalli", "institution": "Toyota Technological Institute at Chicago"}, {"given_name": "Behnam", "family_name": "Neyshabur", "institution": "Institute for Advanced Study"}, {"given_name": "Nati", "family_name": "Srebro", "institution": "TTI-Chicago"}]}