{"title": "Dynamic matrix recovery from incomplete observations under an exact low-rank constraint", "book": "Advances in Neural Information Processing Systems", "page_first": 3585, "page_last": 3593, "abstract": "Low-rank matrix factorizations arise in a wide variety of applications -- including recommendation systems, topic models, and source separation, to name just a few. In these and many other applications, it has been widely noted that by incorporating temporal information and allowing for the possibility of time-varying models, significant improvements are possible in practice. However, despite the reported superior empirical performance of these dynamic models over their static counterparts, there is limited theoretical justification for introducing these more complex models. In this paper we aim to address this gap by studying the problem of recovering a dynamically evolving low-rank matrix from incomplete observations. First, we propose the locally weighted matrix smoothing (LOWEMS) framework as one possible approach to dynamic matrix recovery. We then establish error bounds for LOWEMS in both the {\\em matrix sensing} and {\\em matrix completion} observation models. Our results quantify the potential benefits of exploiting dynamic constraints both in terms of recovery accuracy and sample complexity. To illustrate these benefits we provide both synthetic and real-world experimental results.", "full_text": "Dynamic matrix recovery from incomplete\n\nobservations under an exact low-rank constraint\n\nLiangbei Xu Mark A. Davenport\n\nDepartment of Electrical and Computer Engineering\n\nGeorgia Institute of Technology\n\nAtlanta, GA 30318\n\nlxu66@gatech.edu mdav@gatech.edu\n\nAbstract\n\nLow-rank matrix factorizations arise in a wide variety of applications \u2013 including\nrecommendation systems, topic models, and source separation, to name just a few.\nIn these and many other applications, it has been widely noted that by incorporat-\ning temporal information and allowing for the possibility of time-varying models,\nsigni\ufb01cant improvements are possible in practice. However, despite the reported\nsuperior empirical performance of these dynamic models over their static counter-\nparts, there is limited theoretical justi\ufb01cation for introducing these more complex\nmodels. In this paper we aim to address this gap by studying the problem of recov-\nering a dynamically evolving low-rank matrix from incomplete observations. First,\nwe propose the locally weighted matrix smoothing (LOWEMS) framework as one\npossible approach to dynamic matrix recovery. We then establish error bounds for\nLOWEMS in both the matrix sensing and matrix completion observation models.\nOur results quantify the potential bene\ufb01ts of exploiting dynamic constraints both\nin terms of recovery accuracy and sample complexity. To illustrate these bene\ufb01ts\nwe provide both synthetic and real-world experimental results.\n\nIntroduction\n\n1\nSuppose that X \u2208 Rn1\u00d7n2 is a rank-r matrix with r much smaller than n1 and n2. We observe X\nthrough a linear operator A : Rn1\u00d7n2 \u2192 Rm,\n\ny = A(X),\n\ny \u2208 Rm.\n\nIn recent years there has been a signi\ufb01cant amount of progress in our understanding of how to recover\nX from observations of this form even when the number of observations m is much less than the\nnumber of entries in X. (See [8] for an overview of this literature.) When A is a set of weighted linear\ncombinations of the entries of X, this problem is often referred to as the matrix sensing problem.\nIn the special case where A samples a subset of entries of X, it is known as the matrix completion\nproblem. There are a number of ways to establish recovery guarantee in these settings. Perhaps the\nmost popular approach for theoretical analysis in recent years has focused on the use of nuclear norm\nminimization as a convex surrogate for the (nonconvex) rank constraint [1, 3, 4, 5, 6, 7, 15, 19, 21, 22].\nAn alternative, however is to aim to directly solve the problem under an exact low-rank constraint.\nThis leads a non-convex optimization problem, but has several computational advantages over most\napproaches to minimizing the nuclear norm and is widely used in large-scale applications (such\nas recommendation systems) [16]. In general, popular algorithms for solving the rank-constrained\nmodels \u2013 e.g., alternating minimization and alternating gradient descent \u2013 do not have as strong of\nconvergence or recovery error guarantees due to the non-convexity of the rank constraint. However,\nthere has been signi\ufb01cant progress on this front in recent years [11, 10, 12, 13, 14, 23, 25], with many\nof these algorithms now having guarantees comparable to those for nuclear norm minimization.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fNearly all of this existing work assumes that the underlying low-rank matrix X remains \ufb01xed\nthroughout the measurement process. In many practical applications, this is a tremendous limitation.\nFor example, users\u2019 preferences for various items may change (sometimes quite dramatically) over\ntime. Modelling such drift of user\u2019s preference has been proposed in the context of both music and\nmovies as a way to achieve higher accuracy in recommendation systems [9, 17]. Another example\nin signal processing is dynamic non-negative matrix factorization for the blind signal separation\nproblem [18]. In these and many other applications, explicitly modelling the dynamic structure in the\ndata has led to superior empirical performance. However, our theoretical understanding of dynamic\nlow-rank matrix recovery is still very limited.\nIn this paper we provide the \ufb01rst theoretical results on the dynamic low-rank matrix recovery problem.\nWe determine the sense in which dynamic constraints can help to recover the underlying time-varying\nlow-rank matrix in a particular dynamic model and quantify this impact through recovery error\nbounds. To describe our approach, we consider a simple example where we have two rank-r matrices\nX 1 and X 2. Suppose that we have a set of observations for each of X 1 and X 2, given by\n\nyi = Ai(cid:0)X i(cid:1) ,\n\ni = 1, 2.\n\nThe na\u00efve approach is to use y1 to recover X 1 and y2 to recover X 2 separately. In this case the\nnumber of observations required to guarantee successful recovery is roughly mi \u2265 C ir max(n1, n2)\nfor i = 1, 2 respectively, where C 1, C 2 are \ufb01xed positive constants (see [4]). However, if we know\nthat X 2 is close to X 1 in some sense (for example, if X 2 is a small perturbation of X 1), then the\nabove approach is suboptimal both in terms of recovery accuracy and sample complexity, since in\nthis setting y1 actually contains information about X 2 (and similarly, y2 contains information about\nX 1). There are a variety of possible approaches to incorporating this additional information. The\napproach we will take is inspired by the LOWESS (locally weighted scatterplot smoothing) approach\nfrom non-parametric regression. In the case of this simple example, if we look just at the problem of\nestimating X 2, our approach reduces to solving a problem of the form\n\nrank(cid:0)X 2(cid:1) \u2264 r,\n\n(cid:107)A2(X 2) \u2212 y2(cid:107)2\n\n2 + \u03bb(cid:107)A1(X 2) \u2212 y1(cid:107)2\n\n2\n\ns.t.\n\nmin\nX 2\n\nwhere \u03bb is a parameter that determines how strictly we are enforcing the dynamic constraint (if\nX 1 is very close to X 2 we can set \u03bb to be larger, but if X 1 is far from X 2 we will set it to be\ncomparatively small). This approach generalizes naturally to the locally weighted matrix smoothing\n(LOWEMS) program described in Section 2. Note that it has a (simple) convex objective function, but\na non-convex rank constraint. Our analysis in Section 3 shows that the proposed program outperforms\nthe above na\u00efve recovery strategy both in terms of recovery accuracy and sample complexity.\nWe should emphasize that the proposed LOWEMS program is non-convex due to the exact low-\nrank constraint. Inspired by previous work on matrix factorization, we propose using an ef\ufb01cient\nalternating minimization algorithm (described in more detail in Section 4). We explicitly enforce the\nlow-rank constraint by optimizing over a rank-r factorization and alternately minimize with respect\nto one of the factors while holding the other one \ufb01xed. This approach is popular in practice since\nit is typically less computationally complex than nuclear norm minimization based algorithms. In\naddition, thanks to recent work on global convergence guarantees for alternating minimization for\nlow-rank matrix recovery [10, 13, 25], one can reasonably expect similar convergence guarantees to\nhold for alternating minimization in the context of LOWEMS, although we leave the pursuit of such\nguarantees for future work.\nTo empirically verify our analysis, we perform both synthetic and real world experiments, described\nin Section 5. The synthetic experimental results demonstrate that LOWEMS outperforms the na\u00efve\napproach in practice both in terms of recovery accuracy and sample complexity. We also demonstrate\nthe effectiveness of LOWEMS in the context of recommendation systems.\nBefore proceeding, we brie\ufb02y state some of the notation that we will use throughout. For a vector\nx \u2208 Rn, we let (cid:107)x(cid:107)p denote the standard (cid:96)p norm. Given a matrix X \u2208 Rn1\u00d7n2, we use Xi: to denote\nthe ith row of X and X:j to denote the jth column of X. We let (cid:107)X(cid:107)F denote the the Frobenius\nnorm, (cid:107)X(cid:107)2 the operator norm, (cid:107)X(cid:107)\u2217 the nuclear norm, and (cid:107)X(cid:107)\u221e = maxi,j |Xij| the element-\ni,j XijYij =\n\nwise in\ufb01nity norm. Given a pair of matrices X, Y \u2208 Rn1\u00d7n2, we let (cid:104)X, Y (cid:105) = (cid:80)\nTr(cid:0)Y T X(cid:1) denote the standard inner product. Finally, we let nmax and nmin denote max{n1, n2}\n\nand min{n1, n2} respectively.\n\n2\n\n\f2 Problem formulation\n\nThe underlying assumption throughout this paper is that our low-rank matrix is changing over time\nduring the measurement process. For simplicity we will model this through the following discrete\ndynamic process: at time t, we have a low-rank matrix X t \u2208 Rn1\u00d7n2 with rank r, which we assume\nis related to the matrix at previous time-steps via\n\nX t = f (X 1, . . . , X t\u22121) + \u0001t,\n\n,\n\nt=1.\n\nyt, zt \u2208 Rmt\n\nyt = At(X t) + zt,\nt=1 jointly from {yt}d\n\nwhere \u0001t represents noise. Then we observe each X t through a linear operator At : Rn1\u00d7n2 \u2192 Rmt,\n(1)\nwhere zt is measurement noise. In our problem we will suppose that we observe up to d time steps,\nand our goal is to recover {X t}d\nThe above model is suf\ufb01ciently \ufb02exible to incorporate a wide variety of dynamics, but we will\nmake several simpli\ufb01cations. First, we note that we can impose the low-rank constraint explicitly\nby factorizing X t as X t = U t (V t)T , U t \u2208 Rn1\u00d7r, V t \u2208 Rn2\u00d7r. In general both U t and V t may\nbe changing over time. However, in some applications, it is reasonable to assume that only one set\nof factors is changing. For example, in a recommendation system where our matrix represent user\npreferences, if the rows correspond to items and the columns correspond to users, then U t contains\nthe latent properties of the items and V t models the latent preferences of the users. In this context\nit is reasonable to assume that only V t changes over time [9, 17], and that there is a \ufb01xed matrix\nU (which we may assume to be orthonormal) such that we can write X t = U V t for all t. Similar\narguments can be made in a variety of other applications, including personalized learning systems,\nblind signal separation, and more.\nSecond, we assume a Markov property on f, so that X t (or equivalently, V t) only depends on the\nprevious X t\u22121 (or V t\u22121). Furthermore, although other dynamic models could be accommodated, for\nthe sake of simplicity in our analysis we consider the simple model on V t where\n\nV t = V t\u22121 + \u0001t,\n\n(2)\nWe will also assume that both \u0001t and the measurement noise zt are i.i.d. zero-mean Gaussian noise.\nTo simplify our discussion, we will assume that our goal is to recover the matrix at the most recent\ntime-step, i.e., we wish to estimate X d from {yt}d\nt=1. Our general approach can be stated as follows.\nThe LOWEMS estimator is given by the following optimization program:\n\nt = 2, . . . , d.\n\n\u02c6X d = arg min\nX\u2208C(r)\n\nL (X) = arg min\nX\u2208C(r)\n\n1\n2\n\nassume(cid:80)d\n\nwhere C(r) = {X \u2208 Rn1\u00d7n2 : rank(X) \u2264 r}, and {wt}d\nperformance of the LOWEMS estimator for two common choices of operators At.\n\nt=1 are non-negative weights. We further\nt=1 wt = 1 to avoid ambiguity. In the following section we provide bounds on the\n\n(3)\n\n(cid:13)(cid:13)At (X) \u2212 yt(cid:13)(cid:13)2\n\n2 ,\n\nwt\n\nd(cid:88)\n\nt=1\n\n3 Recovery error bounds\nGiven the estimator \u02c6X d from (3), we de\ufb01ne the recovery error to be \u2206d := \u02c6X d \u2212 X d. Our goal in\nthis section will be to provide bounds on (cid:107) \u02c6X d \u2212 X d(cid:107)F under two common observation models. Our\nanalysis builds on the following (deterministic) inequality.\nProposition 3.1. Both the estimator \u02c6X d by (3) and (9) satis\ufb01es\n\nwtAt\u2217(cid:0)ht \u2212 zt(cid:1)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\nwhere ht = At(cid:0)X d \u2212 X t(cid:1) and At\u2217 is the adjoint operator of At.\n\n(cid:13)(cid:13)At(cid:0)\u2206d(cid:1)(cid:13)(cid:13)2\n\n(cid:13)(cid:13)\u2206d(cid:13)(cid:13)F ,\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) d(cid:88)\n\nd(cid:88)\n\n2 \u2264 2\n\n\u221a\n\n(4)\n\nwt\n\n2r\n\nt=1\n\nt=1\n\nThis is a deterministic result that holds for any set of {At}. The remaining work is to lower bound the\nLHS of (4), and upper bound the RHS of (4) for concrete choices of {At}. In the following sections\nwe derive such bounds in the settings of both Gaussian matrix sensing and matrix completion. For\nsimplicity and without loss of generality, we will assume m1 = . . . = md =: m0, so that the total\nnumber of observations is simply m = dm0.\n\n3\n\n\f3.1 Matrix sensing setting\nFor the matrix sensing problem, we will consider the case where all operators At correspond to\nGaussian measurement ensembles, de\ufb01ned as follows.\nDe\ufb01nition 3.2. [4] A linear operator A : Rn1\u00d7n2 \u2192 Rm is a Gaussian measurement ensemble if\nwe can express each entry of A (X) as [A (X)]i = (cid:104)Ai, X(cid:105) for a matrix Ai whose entries are i.i.d.\naccording to N (0, 1/m), and where the matrices A1, . . . , Am are independent from each other.\nAlso, we de\ufb01ne the matrix restricted isometry property (RIP) for a linear map A.\nDe\ufb01nition 3.3. [4] For each integer r = 1, . . . , nmin, the isometry constant \u03b4r of A is the smallest\nquantity such that\n\n(1 \u2212 \u03b4r)(cid:107)X(cid:107)2\nholds for all matrices X of rank at most r.\n\nF \u2264 (cid:107)A (X)(cid:107)2\n\n2 \u2264 (1 + \u03b4r)(cid:107)X(cid:107)2\n\nF\n\nAn important result (that we use in the proof of Theorem 3.4) is that Gaussian measurement ensembles\nsatisfy the matrix RIP with high probability provided m \u2265 Crnmax. See, for example, [4] for details.\nTo obtain an error bound in the matrix sensing case we lower bound the LHS of (4) using the matrix\nRIP and upper bound the stochastic error (the RHS of (4)) using a covering argument. The following\nis our main result in the context of matrix setting.\nTheorem 3.4. Suppose that we are given measurements as in (1) where all At\u2019s are Gaussian\nmeasurement ensembles. Assume that X t evolves according to (2) and has rank r. Further assume\n\n(cid:1) for 1 \u2264 t \u2264 d and that the perturbation noise \u0001t is\n\nthat the measurement noise zt is i.i.d. N(cid:0)0, \u03c32\ni.i.d. N(cid:0)0, \u03c32\n(cid:40)\n\n(cid:1) for 2 \u2264 t \u2264 d. If\n\n2\n\n1\n\nm0 \u2265 D1 max\n\nnmaxr\n\nw2\n\nt , nmax\n\nwhere D1 is a \ufb01xed positive constant, then the estimator \u02c6X d from (3) satis\ufb01es\n\nd(cid:88)\nd\u22121(cid:88)\n\nt=1\n\n(cid:41)\n(cid:33)\n\n,\n\n(cid:13)(cid:13)\u2206d(cid:13)(cid:13)2\n\nF \u2264 C0\n\n(cid:32) d(cid:88)\n\nw2\n\nt \u03c32\n\n1 +\n\n(d \u2212 t)w2\n\nt \u03c32\n2\n\nnmaxr\n\nt=1\n\nt=1\n\n(5)\n\n(6)\n\n(7)\n\n(8)\n\nwith probability at least P1 = 1 \u2212 dC1 exp (\u2212c1n2), where C0, C1, c1 are positive constants.\nIf we choose the weights as wd = 1 and wt = 0 for 1 \u2264 t \u2264 d \u2212 1, the bound in Theorem 3.4\nreduces to a bound matching classical (static) matrix recovery results (see, for example, [4] Theorem\n2.4). Also note that in this case Theorem 3.4 implies exact recovery when the sample complexity\nis O(rn/d). In order to help interpret this result for other choices of the weights, we note that for a\ngiven set of parameters, we can determine the optimal weights that will minimize this bound. Towards\n1 and set pt = (d \u2212 t), 1 \u2264 t \u2264 d. Then one can calculate the optimal\nthis end, we de\ufb01ne \u03ba := \u03c32\nweights by solving the following quadratic program:\n\n2/\u03c32\n\nUsing the method of Lagrange multipliers one can show that (7) has the analytical solution:\n\n{w\u2217\n\nt }d\n\nt=1 = arg min\n\n(cid:80)\nt wt=1; wt\u22650\n1(cid:80)d\n\n1\n\n1\n\ni=1\n\n1+pi\u03ba\n\nw\u2217\nj =\n\n1 + pj\u03ba\n\nd(cid:88)\n\nd\u22121(cid:88)\n\nw2\n\nt +\n\npt\u03baw2\nt .\n\nt=1\n\nt=1\n\n1 \u2264 j \u2264 d.\n\n,\n\nA simple special case occurs when \u03c32 = 0. In this case all V t\u2019s are the same, and the optimal weights\ngo to wt = 1\nd for all t. In contrast, when \u03c32 grows large the weights eventually converge to wd = 1\nand wt = 0 for all t (cid:54)= d. This results in essentially using only yd to recover X d and ignoring the rest\nof the measurements. Combining these, we note that when the \u03c32 is small, we can gain by a factor of\napproximately d over the na\u00efve strategy that ignores dynamics and tries to recover X d using only yd.\nt when r/d is relatively\nlarge. Thus, when \u03c32 is small, the required number of measurements can be reduced by a factor of d\ncompared to what would be required to recover X d using only yd.\n\nNotice also that the minimum sample complexity is proportional to(cid:80)d\n\nt=1 w2\n\n4\n\n\f3.2 Matrix completion setting\n\nFor the matrix completion problem, we consider the following simple uniform sampling ensemble:\nDe\ufb01nition 3.5. A linear operator A : Rn1\u00d7n2 \u2192 Rm is a uniform sampling ensemble (with\nreplacement) if all sensing matrices Ai are i.i.d. uniformly distributed on the set\n\nX =(cid:8)ej (n1) eT\n\nk (n2) , 1 \u2264 j \u2264 n1, 1 \u2264 k \u2264 n2\n\n(cid:1) ,\n\nwhere ej (n) are the canonical basis vectors in Rn. We let p = m0/(n1n2) denote the fraction of\nsampled entries.\n\nFor this observation architecture, our analysis is complicated by the fact that it does not satisfy the\nmatrix RIP. (A quick problematic example is a rank-1 matrix with only one non-zero entry.) To handle\nthis we follow the typical approach and restrict our focus to matrices that satisfy certain incoherence\nproperties.\nDe\ufb01nition 3.6. (Subspace incoherence [10]) Let U \u2208 Rn\u00d7r be the orthonormal basis for an r-\n\u221a\ndimensional subspace U, then the incoherence of U is de\ufb01ned as \u00b5(U) := maxi\u2208[n]\nn\u221a\nr\nwhere ei denotes the ith standard basis vector. We also simply denote \u00b5(span(U )) as \u00b5(U ).\nDe\ufb01nition 3.7. (Matrix incoherence [13]) A rank-r matrix X \u2208 Rn1\u00d7n2 with SVD X = U \u03a3V T is\nincoherent with parameter \u00b5 if\n\n(cid:13)(cid:13)eT\ni U(cid:13)(cid:13)2,\n\n\u221a\n(cid:107)U:i(cid:107)2 \u2264 \u00b5\nr\u221a\nn1\n\nfor any i \u2208 [n1]\n\nand\n\nfor any j \u2208 [n2],\n\n\u221a\n(cid:107)V:j(cid:107)2 \u2264 \u00b5\nr\u221a\nn2\n\ni.e., the subspaces spanned by the columns of U and V are both \u00b5-incoherent.\n\nThe incoherence assumption guarantees that X is far from sparse, which make it possible to recover\nX from incomplete measurements since a measurement contains roughly the same amount of\ninformation for all dimensions.\nTo proceed we also assume that the matrix X d has \u201cbounded spikiness\u201d in that the maximum entry\noptimization constraints C (r) in (3) with C (r, a) :== {X \u2208 Rn1\u00d7n2 : rank (X) \u2264 r,(cid:107)X(cid:107)\u221e \u2264 a}:\n\nof X d is bounded by a, i.e.,(cid:13)(cid:13)X d(cid:13)(cid:13)\u221e \u2264 a. To exploit the spikiness constraint below we replace the\n\n\u02c6X d = arg min\nX\u2208C(r,a)\n\nL (X) = arg min\nX\u2208C(r,a)\n\n1\n2\n\n(cid:13)(cid:13)At (X) \u2212 yt(cid:13)(cid:13)2\n\n2 .\n\nwt\n\nd(cid:88)\n\nt=1\n\n(9)\n\nNote that Proposition 3.1 still holds for (9).\nTo obtain an error bound in the matrix completion case, we lower bound the LHS of 4 using a\nrestricted convexity argument (see, for example, [20]) and upper bound the RHS using matrix\nBernstein inequality. The result of this approach is the following theorem.\nTheorem 3.8. Suppose that we are given measurements as in (1) where all At\u2019s are uniform sampling\nensembles. Assume that X t evolves according to (2), has rank r, and is incoherent with parameter\n\n\u00b50 and(cid:13)(cid:13)X d(cid:13)(cid:13)\u221e \u2264 a. Further assume that the perturbation noise and the measurement noise satisfy\n\nthe same assumptions in Theorem 3.4. If\n\nm0 \u2265 D2nmin log2(n1 + n2)\u03c6(cid:48)(w),\n0r\u03c32\nt ((d\u2212t)\u03c32\n\nt ((d\u2212t)\u00b52\nt=1 w2\n\n, then the estimator \u02c6X d from (9) satis\ufb01es\n\n(cid:115)(cid:80)d\n\n1)\n2 /n1+\u03c32\n1)\n2 +\u03c32\n\n\uf8f1\uf8f2\uf8f3B1 := C2a2n1n2\n(cid:32)(cid:32) d(cid:88)\n\nd\u22121(cid:88)\n\nt=1 w2\n\nt log(n1 + n2)\n\nm0\n\n, B2\n\n\uf8fc\uf8fd\uf8fe ,\nd(cid:88)\n\n(cid:33)\n\n(10)\n\n(11)\n\n(cid:33)\n\n,\n\n(12)\n\nwhere \u03c6(cid:48)(w) =\n\nmaxt w2\n\n(cid:80)d\n(cid:13)(cid:13)\u2206d(cid:13)(cid:13)2\n\nF \u2264 max\n\nwith probability at least P1 = 1 \u2212 5/(n1 + n2) \u2212 5dnmax exp(\u2212nmin), where\n\nB2 =\n\nC3rn2\n\n1n2\n\n2 log(n1 + n2)\nnminm0\n\nt=1\nand C2, C3, D2 are absolute positive constants.\n\nw2\n\nt \u03c32\n\n1 +\n\n(d \u2212 t)w2\n\nt \u03c32\n2\n\n+\n\nw2\n\nt a2\n\nt=1\n\nt=1\n\n5\n\n\fIf we choose the weights as wd = 1 and wt = 0 for 1 \u2264 t \u2264 d \u2212 1, the bound in Theorem 3.8\nreduces to a result comparable to classical (static) matrix completion results (see, for example, [15]\nTheorem 7). Moreover, from the B2 term in (11), we obtain the same dependence on m as that of (6),\ni.e., 1/m. However, there are also a few key differences between Theorem 3.4 and our results for\nmatrix completion. In general the bound is loose in several aspects compared to the matrix sensing\nbound. For example, when m0 is small, B1 actually dominates, in which case the dependence on\nm is actually 1/\nm instead of 1/m. When m0 is suf\ufb01ciently large, then B2 dominates, in which\ncase we can consider two cases. The \ufb01rst case corresponds to when a is relatively large compared to\n\u03c31, \u03c32 \u2013 i.e., the low-rank matrix is spiky. In this case the term containing a2 in B2 dominates, and\nthe optimal weights are equal weights of 1/d. This occurs because the term involving a dominates\nand there is little improvement to be obtained by exploiting temporal dynamics. In the second case,\nwhen a is relatively small compared to \u03c31, \u03c32 (which is usually the case in practice), the bound can\nbe simpli\ufb01ed to\n\n\u221a\n\n(cid:32)(cid:32) d(cid:88)\n\nd\u22121(cid:88)\n\n(cid:33)(cid:33)\n\n(cid:107)\u2206(cid:107)2\n\nF \u2264 c3rn2\n\n1n2\n\n2 log(n1 + n2)\nnminm0\n\nw2\n\nt \u03c32\n\n1 +\n\nt=1\n\nt=1\n\n(d \u2212 t)w2\n\nt \u03c32\n2\n\n.\n\nThe above bound is much more similar to the bound in (6) from Theorem 3.4. In fact, we can also\nobtain the optimal weights by solving the same quadratic program as (7).\nWhen n1 \u2248 n2, the sample complexity is \u0398(nmin log2(n1 + n2)\u03c6(cid:48)(w)). In this case Theorem 3.8\nalso implies a similar sample complexity reduction as we observed in the matrix sensing setting.\nHowever, the precise relations between sample complexity and weights wt\u2019s are different in these\ntwo cases (deriving from the fact that the proof uses matrix Bernstein inequalities in the matrix\ncompletion setting rather than concentration inequalities of Chi-squared variables as in the matrix\nsensing setting).\n\n4 An algorithm based on alternating minimization\nAs noted in Section 2, any rank-r matrix can be factorized as X = U V T where U is n1 \u00d7 r and V is\nn2 \u00d7 r, therefore the LOWEMS estimator in (3) can be reformulated as\n\nd(cid:88)\n\n1\n2\n\nwt\n\n(cid:13)(cid:13)At(cid:0)U V T(cid:1) \u2212 yt(cid:13)(cid:13)2\n\n2 .\n\n(13)\n\n\u02c6X d = arg min\nX\u2208C(r)\n\nL (X) = arg min\n\nX=U V T\n\nt=1\n\nThe above program can be solved by alternating minimization (see [17]), which alternatively mini-\nmizes the objective function over U (or V ) while holding V (or U) \ufb01xed until a stopping criterion is\nreached. Since the objective function is quadratic, each step in this procedure reduces to conventional\nweighted least squares, which can be solved via ef\ufb01cient numerical procedures. Theoretical guar-\nantees for global convergence of alternating minimization for the static matrix sensing/completion\nproblem have recently been established in [10, 13, 25] by treating the alternating minimization as\na noisy version of the power method. Extending these results to establish convergence guarantees\nfor (13) would involve analyzing a weighted power method. We leave this analysis for future work,\nbut expect that similar convergence guarantees should be possible in this setting.\n\n5 Simulations and experiments\n\n5.1 Synthetic simulations\n\nOur synthetic simulations consider both matrix sensing and matrix completion, but with an emphasis\non matrix completion. We set n1 = 100, n2 = 50, d = 4 and r = 5. We consider two baselines:\nbaseline one is only using yd to recover X d and simply ignoring y1, . . . yd\u22121; baseline two is using\n{yt}d\nt=1 with equal weights. Note that both of these can be viewed as special cases of LOWEMS with\nweights (0, . . . , 0, 1) and ( 1\nd ) respectively. Recalling the formula for the optimal choice of\n1) \u2192 \u221e\nweights in (8), it is easy to show that baseline one is equivalent to the case where \u03ba = (\u03c32\nand the baseline two equivalent to the case where \u03ba \u2192 0. This also makes intuitive sense since\n\u03ba \u2192 \u221e means the perturbation is arbitrarily large between time steps, while \u03ba \u2192 0 reduces to the\nstatic setting.\n\nd , . . . , 1\n\n2)/(\u03c32\n\nd , 1\n\n6\n\n\f(a)\n\n(b)\n\nFigure 1: Recovery error under different levels of perturbation noise. (a) matrix sensing. (b) matrix\ncompletion.\n\nFigure 2: Sample complexity under different levels of perturbation noise (matrix completion).\n\nF /(cid:13)(cid:13)X d(cid:13)(cid:13)2\n\n2/\u03c32\n\nand show the average relative recovery error(cid:13)(cid:13)\u2206d(cid:13)(cid:13)2\n\n1). Recovery error. In this simulation, we set m0 = 4000 and set the measurement noise level \u03c31\nto 0.05. We vary the perturbation noise level \u03c32. For every pair of (\u03c31, \u03c32) we perform 10 trials,\nF . Figure 1 illustrates how LOWEMS\nreduces the recovery error compared to our baselines. As one can see, when \u03c32 is small, the optimal\n\u03ba, i.e., \u03c32\n1, generates nearly equal weights (baseline two), reducing recovery error approximately\nby a factor of 4 over baseline one, which is roughly equal to d as expected. As \u03c32 grows, the recovery\nerror of baseline two will increase dramatically due to the perturbation noise. However in this case\nthe optimal \u03ba of LOWEMS grows with it, leading to a more uneven weighting and to somewhat\ndiminished performance gains. We also note that, as expected, LOWEMS converges to baseline one\nwhen \u03c32 is large.\n2). Sample complexity. In the interest of conciseness we only provide results here for the matrix\ncompletion setting (matrix sensing yields broadly similar results).\nIn this simulation we vary\nthe fraction of observed entries p to empirically \ufb01nd the minimum sample complexity required to\nguarantee successful recovery (de\ufb01ned as a relative error \u2264 0.08). We compare the sample complexity\nof the proposed LOWEMS to baseline one and baseline two under different perturbation noise level\n\u03c32 (\u03c31 is set as 0.02). For a certain \u03c32, the relative recovery error is the averaged over 10 trials.\nFigure 2 illustrates how LOWEMS reduces the sample complexity required to guarantee successful\nrecovery. When the perturbation noise is weaker than the measurement noise, the sample complexity\ncan be reduced approximately by a factor of d compared to baseline one. When the perturbation noise\nis much stronger than measurement noise, the recovery error of baseline two will increase due to the\nperturbation noise and hence the sample complexity increase rapidly. However in this case proposed\nLOWEMS still achieves relatively small sample complexity and its sample complexity converges to\nbaseline one when \u03c32 is relatively large.\n\n7\n\n10\u2212210\u2212110000.0050.010.0150.020.0250.03\u03c32Recovery Error Baseline oneBaseline twoLOWEMS10-210-1100101\u03c3200.010.020.030.040.050.06Recovery ErrorBaseline oneBaseline twoLOWEMS10\u2212310\u2212210\u221211001010.20.30.40.50.60.70.80.91\u03c32Sample Complexity p Baseline oneLOWEMSBaseline two\f(a)\n\n(b)\n\nFigure 3: Experimental results on truncated Net\ufb02ix dataset. (a) Testing RMSE vs. number of time\nsteps. (b) Validation RMSE vs. \u03ba.\n\n5.2 Real world experiments\n\nWe next test the LOWEMS approach in the context of a recommendation system using the (truncated)\nNet\ufb02ix dataset. We eliminate those movies with few ratings, and those users rating few movies, and\ngenerate a truncated dataset with 3199 users, 1042 movies, 2462840 ratings, and hence the fraction of\nvisible entries in the rating matrix is \u2248 0.74. All the ratings are distributed over a period of 2191 days.\nFor the sake of robustness, we additionally impose a Frobenius norm penalty on the factor matrices\nU and V in (13). We keep the latest (in time) 10% of the ratings as a testing set. The remaining\nratings are split into a validation set and a training set for the purpose of cross validation. We divide\nthe remaining ratings into d \u2208 {1, 3, 6, 8} bins respectively with same time period according to\ntheir timestamps. We use 5-fold cross validation, and we keep 1/5 of the ratings from the dth bin\nas a validation set. The number of latent factors r is set to 10. The Frobenius norm regularization\nparameter \u03b3 is set to 1. We also note that in practice one likely has no prior information on \u03c31,\n\u03c32 and hence \u03ba. However, we use model selection techniques like cross validation to select the\nbest \u03ba incorporating the unknown prior information on measurement/perturbation noise. We use\nroot mean squared error (RMSE) to measure prediction accuracy. Since alternating minimization\nuses a random initialization, we generate 10 test RMSE\u2019s (using a boxplot) for the same testing set.\nFigure 3(a) shows that the proposed LOWEMS estimator improves the testing RMSE signi\ufb01cantly\nwith appropriate \u03ba. Additionally, the performance improvement increases as d gets larger.\nTo further investigate how the parameter \u03ba affects accuracy, we also show the validation RMSE\ncompared to \u03ba in Figure 3(b). When \u03ba \u2248 1, LOWEMS achieves the best RMSE on the validation\ndata. This further demonstrates that imposing an appropriate dynamic constraint should improve\nrecovery accuracy in practice.\n\n6 Conclusion\n\nIn this paper we consider the low-rank matrix recovery problem in a novel setting, where one\nof the factor matrices changes over time. We propose the locally weighted matrix smoothing\n(LOWEMS) framework, and have established error bounds for LOWEMS in both the matrix sensing\nand matrix completion cases. Our analysis quanti\ufb01es how the proposed estimator improves recovery\naccuracy and reduces sample complexity compared to static recovery methods. Finally, we provide\nboth synthetic and real world experimental results to verify our analysis and demonstrate superior\nempirical performance when exploiting dynamic constraints in a recommendation system.\nAcknowledge\nThis work was supported by grants NRL N00173-14-2-C001, AFOSR FA9550-14-1-0342, NSF\nCCF-1409406, CCF-1350616, and CMMI-1537261.\n\n8\n\n1368# of bins0.8710.8720.8730.8740.8750.8760.877RMSE10-210-1100101\u03ba0.7450.750.7550.760.7650.770.7750.78RMSEd = 1d = 3d = 6d = 8\fReferences\n[1] A. Agarwal, S. Negahban, and M. Wainwright. Noisy matrix decomposition via convex relaxation: Optimal\n\nrates in high dimensions. Ann. Stat., 40(2):1171\u20131197, 2012.\n\n[2] P. B\u00fchlmann and S. Van De Geer. Statistics for high-dimensional data: Methods, theory and applications.\n\nSpringer-Verlag Berlin Heidelberg, 2011.\n\n[3] E. Cand\u00e8s and Y. Plan. Matrix completion with noise. Proc. IEEE, 98(6):925\u2013936, 2010.\n[4] E. Cand\u00e8s and Y. Plan. Tight oracle inequalities for low-rank matrix recovery from a minimal number of\n\nnoisy random measurements. IEEE Trans. Inform. Theory, 57(4):2342\u20132359, 2011.\n\n[5] E. Cand\u00e8s and B. Recht. Exact matrix completion via convex optimization. Found. Comput. Math.,\n\n9(6):717\u2013772, 2009.\n\n[6] E. Cand\u00e8s and T. Tao. The power of convex relaxation: Near-optimal matrix completion. IEEE Trans.\n\nInform. Theory, 56(5):2053\u20132080, 2010.\n\n[7] M. Davenport, Y. Plan, E. van den Berg, and M. Wootters. 1-bit matrix completion.\n\n3(3):189\u2013223, 2014.\n\nInf. Inference,\n\n[8] M. Davenport and J. Romberg. An overview of low-rank matrix recovery from incomplete observations.\n\nIEEE J. Select. Top. Signal Processing, 10(4):608\u2013622, 2016.\n\n[9] G. Dror, N. Koenigstein, Y. Koren, and M. Weimer. The Yahoo! music dataset and KDD-Cup\u201911. In Proc.\nACM SIGKDD Int. Conf. on Knowledge, Discovery, and Data Mining (KDD), San Diego, CA, Aug. 2011.\n[10] M. Hardt. Understanding alternating minimization for matrix completion. In Proc. IEEE Symp. Found.\n\nComp. Science (FOCS), Philadelphia, PA, Oct. 2014.\n\n[11] M. Hardt and M. Wootters. Fast matrix completion without the condition number. In Proc. Conf. Learning\n\nTheory, Barcelona, Spain, June 2014.\n\n[12] P. Jain and P. Netrapalli. Fast exact matrix completion with \ufb01nite samples. In Proc. Conf. Learning Theory,\n\nParis, France, July 2015.\n\n[13] P. Jain, P. Netrapalli, and S. Sanghavi. Low-rank matrix completion using alternating minimization. In\n\nProc. ACM Symp. Theory of Comput., Stanford, CA, June 2013.\n\n[14] R. Keshavan, A. Montanari, and S. Oh. Matrix completion from noisy entries. In Proc. Adv. in Neural\n\nProcessing Systems (NIPS), Vancouver, BC, Dec. 2009.\n\n[15] O. Klopp. Noisy low-rank matrix completion with general sampling distribution. Bernoulli, 20(1):282\u2013303,\n\n2014.\n\n[16] Y. Koren. The Bellkor solution to the Net\ufb02ix grand prize, 2009.\n[17] Y. Koren. Collaborative \ufb01ltering with temporal dynamics. Comm. ACM, 53(4):89\u201397, 2010.\n[18] N. Mohammadiha, P. Smaragdis, G. Panahandeh, and S. Doclo. A state-space approach to dynamic\n\nnonnegative matrix factorization. IEEE Trans. Signal Processing, 63(4):949\u2013959, 2015.\n\n[19] S. Negahban and M. Wainwright. Estimation of (near) low-rank matrices with noise and high-dimensional\n\nscaling. Ann. Stat., 39(2):1069\u20131097, 2011.\n\n[20] S. Negahban and M. Wainwright. Restricted strong convexity and weighted matrix completion: Optimal\n\nbounds with noise. J. Machine Learning Research, 13(1):1665\u20131697, 2012.\n\n[21] B. Recht, M. Fazel, and P. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via\n\nnuclear norm minimization. SIAM Rev., 52(3):471\u2013501, 2010.\n\n[22] B. Recht, W. Xu, and B. Hassibi. Necessary and suf\ufb01cient conditions for success of the nuclear norm\nheuristic for rank minimization. In Proc. IEEE Conf. on Decision and Control (CDC), Cancun, Mexico,\nDec. 2008.\n\n[23] R. Sun and Z.-Q. Luo. Guaranteed matrix completion via nonconvex factorization. In Proc. IEEE Symp.\n\nFound. Comp. Science (FOCS), Berkeley, CA, Oct. 2015.\n\n[24] J. A. Tropp. An introduction to matrix concentration inequalities. Found. Trends Mach. Learning,\n\n8(1\u20132):1\u2013230, 2015.\n\n[25] T. Zhao, Z. Wang, and H. Liu. A nonconvex optimization framework for low rank matrix estimation. In\n\nProc. Adv. in Neural Processing Systems (NIPS), Montr\u00e9al, QC, Dec. 2015.\n\n9\n\n\f", "award": [], "sourceid": 1786, "authors": [{"given_name": "Liangbei", "family_name": "Xu", "institution": "Gatech"}, {"given_name": "Mark", "family_name": "Davenport", "institution": "Georgia Institute of Technology"}]}