{"title": "Matrix Completion Under Monotonic Single Index Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1873, "page_last": 1881, "abstract": "Most recent results in matrix completion assume that the matrix under consideration is low-rank or that the columns are in a union of low-rank subspaces. In real-world settings, however, the linear structure underlying these models is distorted by a (typically unknown) nonlinear transformation. This paper addresses the challenge of matrix completion in the face of such nonlinearities. Given a few observations of a matrix that are obtained by applying a Lipschitz, monotonic function to a low rank matrix, our task is to estimate the remaining unobserved entries. We propose a novel matrix completion method that alternates between low-rank matrix estimation and monotonic function estimation to estimate the missing matrix elements. Mean squared error bounds provide insight into how well the matrix can be estimated based on the size, rank of the matrix and properties of the nonlinear transformation. Empirical results on synthetic and real-world datasets demonstrate the competitiveness of the proposed approach.", "full_text": "Matrix Completion Under Monotonic Single Index\n\nModels\n\nWisconsin Institutes for Discovery\n\nElectrical Engineering and Computer Sciences\n\nLaura Balzano\n\nUniversity of Michigan Ann Arbor\n\ngirasole@umich.edu\n\nRavi Ganti\n\nUW-Madison\n\ngantimahapat@wisc.edu\n\nDepartment of Electrical and Computer Engineering\n\nRebecca Willett\n\nUW-Madison\n\nrmwillett@wisc.edu\n\nAbstract\n\nMost recent results in matrix completion assume that the matrix under consider-\nation is low-rank or that the columns are in a union of low-rank subspaces. In\nreal-world settings, however, the linear structure underlying these models is dis-\ntorted by a (typically unknown) nonlinear transformation. This paper addresses\nthe challenge of matrix completion in the face of such nonlinearities. Given a\nfew observations of a matrix that are obtained by applying a Lipschitz, monotonic\nfunction to a low rank matrix, our task is to estimate the remaining unobserved en-\ntries. We propose a novel matrix completion method that alternates between low-\nrank matrix estimation and monotonic function estimation to estimate the missing\nmatrix elements. Mean squared error bounds provide insight into how well the\nmatrix can be estimated based on the size, rank of the matrix and properties of the\nnonlinear transformation. Empirical results on synthetic and real-world datasets\ndemonstrate the competitiveness of the proposed approach.\n\n1\n\nIntroduction\n\nIn matrix completion, one has access to a matrix with only a few observed entries, and the task is\nto estimate the entire matrix using the observed entries. This problem has a plethora of applications\nsuch as collaborative \ufb01ltering, recommender systems [1] and sensor networks [2]. Matrix comple-\ntion has been well studied in machine learning, and we now know how to recover certain matrices\ngiven a few observed entries of the matrix [3, 4] when it is assumed to be low rank. Typical work\nin matrix completion assumes that the matrix to be recovered is incoherent, low rank, and entries\nare sampled uniformly at random [5, 6, 4, 3, 7, 8]. While recent work has focused on relaxing the\nincoherence and sampling conditions under which matrix completion succeeds, there has been little\nwork for matrix completion when the underlying matrix is of high rank. More speci\ufb01cally, we shall\nassume that the matrix that we need to complete is obtained by applying some unknown, non-linear\nfunction to each element of an unknown low-rank matrix. Because of the application of a non-linear\ntransformation, the resulting ratings matrix tends to have a large rank. To understand the effect of\nthe application of non-linear transformation on a low-rank matrix, we shall consider the following\ni be its SVD. The rank of the\nmatrix X is the number of non-zero singular values. Given an \u0001 \u2208 (0, 1), de\ufb01ne the effective rank\nof X as follows:\n\nsimple experiment: Given an n \u00d7 m matrix X, let X =(cid:80)m\n(cid:115)(cid:80)m\n(cid:80)m\n\ni=1 \u03c3iuiv(cid:62)\n(cid:41)\n\n.\n\n(1)\n\nr\u0001(X) = min\n\nk \u2208 N :\n\n(cid:40)\n\nj=k+1 \u03c32\nj\nj=1 \u03c32\nj\n\n\u2264 \u0001\n\n1\n\n\fFigure 1: The plot shows the r0.01(X) de\ufb01ned in equation (1) obtained by applying a non-linear\nfunction g(cid:63) to each element of Z, where g(cid:63)(z) =\n\n1+exp(\u2212cz). Z is a 30 \u00d7 20 matrix of rank 5.\n\n1\n\nThe effective rank of X tells us the rank k of the lowest rank approximator \u02c6X that satis\ufb01es\n\n|| \u02c6X \u2212 X||F\n\n||X||F\n\n\u2264 \u0001.\n\n(2)\n\nIn \ufb01gure (1), we show the effect of applying a non-linear monotonic function g(cid:63)(z) =\n1+exp(\u2212cz)\nto the elements of a low-rank matrix Z. As c increases both the rank of X and its effective rank\nr\u0001(X) grow rapidly with c, rendering traditional matrix completion methods ineffective even in the\npresence of mild nonlinearities.\n\n1\n\n1.1 Our Model and contributions\n\nIn this paper we consider the high-rank matrix completion problem where the data generating pro-\ncess is as follows: There is some unknown matrix Z (cid:63) \u2208 Rn\u00d7m with m \u2264 n and of rank r (cid:28) m. A\nnon-linear, monotonic, L- Lipschitz function g(cid:63) is applied to each element of the matrix Z (cid:63) to get\nanother matrix M (cid:63). A noisy version of M (cid:63), which we call X, is observed on a subset of indices\ndenoted by \u2126 \u2282 [n] \u00d7 [m].\n\ni,j), \u2200i \u2208 [n], j \u2208 [m]\n\nM (cid:63)\ni,j = g(cid:63)(Z (cid:63)\nX\u2126 = (M (cid:63) + N )\u2126\n\n(3)\n(4)\nThe function g(cid:63) is called the transfer function. We shall assume that E[N ] = 0, and the entries\nof N are i.i.d. We shall also assume that the index set \u2126 is generated uniformly at random with\nreplacement from the set [n] \u00d7 [m] 1. Our task is to reliably estimate the entire matrix M (cid:63) given\nobservations of X on \u2126. We shall call the above model as Monotonic Matrix Completion (MMC).\nTo illustrate our framework we shall consider the following two simple examples. In recommender\nsystems users are required to provide discrete ratings of various objects. For example, in the Net\ufb02ix\nproblem users are required to rate movies on a scale of 1 \u2212 5 2. These discrete scores can be\nthought of as obtained by applying a rounding function to some ideal real valued score matrix given\nby the users. This real-valued score matrix may be well modeled by a low-rank matrix, but the\napplication of the rounding function 3 increases the rank of the original low-rank matrix. Another\nimportant example is that of completion of Gaussian kernel matrices. Gaussian kernel matrices are\nused in kernel based learning methods. The Gaussian kernel matrix of a set of n points is an n \u00d7 n\nmatrix obtained by applying the Gaussian function on an underlying Euclidean distance matrix. The\nEuclidean distance matrix is a low-rank matrix [9]. However, in many cases one cannot measure\nall pair-wise distances between objects, resulting in an incomplete Euclidean distance matrix and\nhence an incomplete kernel matrix. Completing the kernel matrix can then be viewed as completing\na matrix of large rank.\nIn this paper we study this matrix completion problem and provide algorithms with provable error\nguarantees. Our contributions are as follows:\n\n1. In Section (3) we propose an optimization formulation to estimate matrices in the above\ndescribed context. In order to do this we introduce two formulations, one using a squared\n\n1By [n] we denote the set {1, 2 . . . , n}\n2This is typical of many other recommender engines such as Pandora.com, Last.fm and Amazon.com.\n3Technically the rounding function is not a Lipschitz function but can be well approximated by a Lipschitz\n\nfunction.\n\n2\n\n051015202530354045502468101214161820cEffective rank\floss, which we call MMC - LS, and another using a calibrated loss function, which we call\nas MMC \u2212 c. For both these formulations we minimize w.r.t. M (cid:63) and g(cid:63). This calibrated\nloss function has the property that the minimizer of the calibrated loss satis\ufb01es equation (3).\n\n2. We propose alternating minimization algorithms to solve our optimization problem. Our\nproposed algorithms, called MMC\u2212 c and MMC-LS, alternate between solving a quadratic\nprogram to estimate g(cid:63) and performing projected gradient descent updates to estimate the\nmatrix Z (cid:63). MMC outputs the matrix \u02c6M where \u02c6Mi,j = \u02c6g( \u02c6Zi,j).\n\n3. In Section (4) we analyze the mean squared error (MSE) of the matrix \u02c6M returned by one\nstep of the MMC \u2212 c algorithm. The upper bound on the MSE of the matrix \u02c6M output by\nMMC depends only on the rank r of the matrix Z (cid:63) and not on the rank of matrix M (cid:63). This\nproperty makes our analysis useful because the matrix M (cid:63) could be potentially high rank\nand our results imply reliable estimation of a high rank matrix with error guarantees that\ndepend on the rank of the matrix Z (cid:63).\n\n4. We compare our proposed algorithms to state-of-art implementations of low rank matrix\n\ncompletion on both synthetic and real datasets (Section 5).\n\n2 Related work\n\nClassical matrix completion with and without noise has been investigated by several authors [5, 6,\n4, 3, 7, 8]. The recovery techniques proposed in these papers solve a convex optimization problem\nthat minimizes the nuclear norm of the matrix subject to convex constraints. Progress has also been\nmade on designing ef\ufb01cient algorithms to solve the ensuing convex optimization problem [10, 11,\n12, 13]. Recovery techniques based on nuclear norm minimization guarantee matrix recovery under\nthe condition that a) the matrix is low rank, b) the matrix is incoherent or not very spiky, and c) the\nentries are observed uniformly at random. Literature on high rank matrix completion is relatively\nsparse. When columns or rows of the matrix belong to a union of subspaces, then the matrix tends\nto be of high rank. For such high rank matrix completion problems algorithms have been proposed\nthat exploit the fact that multiple low-rank subspaces can be learned by clustering the columns or\nrows and learning subspaces from each of the clusters. While Eriksson et al. [14] suggested looking\nat the neighbourhood of each incomplete point for completion, [15] used a combination of spectral\nclustering techniques as done in [16, 17] along with learning sparse representations via convex\noptimization to estimate the incomplete matrix. Singh et al. [18] consider a certain speci\ufb01c class\nof high-rank matrices that are obtained from ultra-metrics. In [19] the authors consider a model\nsimilar to ours, but instead of learning a single monotonic function, they learn multiple monotonic\nfunctions, one for each row of the matrix. However, unlike in this paper, their focus is on a ranking\nproblem and their proposed algorithms lack theoretical guarantees.\nDavenport et al [20] studied the one-bit matrix completion problem. Their model is a special case of\nthe matrix completion model considered in this paper. In the one-bit matrix completion problem we\nassume that g(cid:63) is known and is the CDF of an appropriate probability distribution, and the matrix X\nis a boolean matrix where each entry takes the value 1 with probability Mi,j, and 0 with probability\n1\u2212 Mi,j. Since g(cid:63) is known, the focus in one-bit matrix completion problems is accurate estimation\nof Z (cid:63).\nTo the best of our knowledge the MMC model considered in this paper has not been investigated\nbefore. The MMC model is inspired by the single-index model (SIM) that has been studied both\nin statistics [21, 22] and econometrics for regression problems [23, 24]. Our MMC model can be\nthought of as an extension of SIM to matrix completion problems.\n\n3 Algorithms for matrix completion\n\nOur goal is to estimate g(cid:63) and Z (cid:63) from the model in equations (3- 4). We approach this problem\nvia mathematical optimization. Before we discuss our algorithms, we mention in brief an algorithm\nfor the problem of learning Lipschitz, monotonic functions in 1- dimension. This algorithm will be\nused for learning the link function in MMC.\n\n3\n\n\f\uf8f1\uf8f2\uf8f3\u02c6z1,\n\n(cid:88)\n\nmin\ng,Z\n\nLPAV 4 algorithm introduced in [21] outputs the best function \u02c6g in G that minimizes(cid:80)n\n\nSuppose we are given data (p1, y1), . . . (pn, yn), where p1 \u2264 p2 . . . \u2264 pn,\nThe LPAV algorithm:\nand y1, . . . , yn are real numbers. Let G def= {g : R \u2192 R, g is L-Lipschitz and monotonic}. The\ni=1(g(pi) \u2212\n\nyi)2. In order to do this, the LPAV \ufb01rst solves the following optimization problem:\ns.t. 0 \u2264 zj \u2212 zi \u2264 L(pj \u2212 pi) if pi \u2264 pj\n\n(cid:107)z \u2212 y(cid:107)2\n\n2\n\n\u02c6z = arg min\nz\u2208Rn\n\nwhere, \u02c6g(pi) def= \u02c6zi. This gives us the value of \u02c6g on a discrete set of points p1, . . . , pn. To get \u02c6g\neverywhere else on the real line, we simply perform linear interpolation as follows:\n\n\u02c6g(\u03b6) =\n\n\u02c6zn,\n\u00b5\u02c6zi + (1 \u2212 \u00b5)\u02c6zi+1\n\nif \u03b6 \u2264 p1\nif \u03b6 \u2265 pn\nif \u03b6 = \u00b5pi + (1 \u2212 \u00b5)pi+1\n\n(5)\n\n(6)\n\n3.1 Squared loss minimization\n\nA natural approach to the monotonic matrix completion problem is to learn g(cid:63), Z (cid:63) via squared loss\nminimization. In order to do this we need to solve the following optimization problem:\n\n(g(Zi,j) \u2212 Xi,j)2\n\n\u2126\n\ng : R \u2192 R is L-Lipschitz and monotonic\nrank(Z) \u2264 r.\n\n(7)\n\ni,j \u2212 \u03b7(gt\u22121(Z t\u22121\n\ni,j ) \u2212 Xi,j)(gt\u22121)(cid:48)(Z t\u22121\n\ni,j ) ,\u2200(i, j) \u2208 \u2126\n\nThe problem is a non-convex optimization problem individually in parameters g, Z. A reasonable\napproach to solve this optimization problem would be to perform optimization w.r.t. each variable\nwhile keeping the other variable \ufb01xed. For instance, in iteration t, while estimating Z one would\nkeep g \ufb01xed, to say gt\u22121, and then perform projected gradient descent w.r.t. Z. This leads to the\nfollowing updates for Z:\n\ni,j \u2190 Z t\u22121\nZ t\nZ t \u2190 Pr(Z t)\n\n(8)\n(9)\nwhere \u03b7 > 0 is a step-size used in our projected gradient descent procedure, and Pr is projection\non the rank r cone. The above update involves both the function gt\u22121 and its derivative (gt\u22121)(cid:48).\nSince our link function is monotonic, one can use the LPAV algorithm to estimate this link function\ngt\u22121. Furthermore since LPAV estimates gt\u22121 as a piece-wise linear function, the function has a\nsub-differential everywhere and the sub-differential (gt\u22121)(cid:48) can be obtained very cheaply. Hence,\nthe projected gradient update shown in equation (8) along with the LPAV algorithm can be iteratively\nused to learn estimates for Z (cid:63) and g(cid:63). We shall call this algorithm as MMC\u2212LS. Incorrect estimation\nof gt\u22121 will also lead to incorrect estimation of the derivative (gt\u22121)(cid:48). Hence, we would expect\nMMC\u2212 LS to be less accurate than a learning algorithm that does not have to estimate (gt\u22121)(cid:48). We\nnext outline an approach that provides a principled way to derive updates for Z t and gt that does not\nrequire us to estimate derivatives of the transfer function, as in MMC\u2212 LS.\n\n3.2 Minimization of a calibrated loss function and the MMC algorithm.\nLet \u03a6 : R \u2192 R be a differentiable function that satis\ufb01es \u03a6(cid:48) = g(cid:63). Furthermore, since g(cid:63) is a\nmonotonic function, \u03a6 will be a convex loss function. Now, suppose g(cid:63) (and hence \u03a6) is known.\nConsider the following function of Z\n\n\uf8eb\uf8ed (cid:88)\n\n(i,j)\u2208\u2126\n\n\uf8f6\uf8f8 .\n\nL(Z; \u03a6, \u2126) = EX\n\n\u03a6(Zi,j) \u2212 Xi,jZi,j\n\n(cid:88)\n\n(i,j)\u2208\u2126\n\ng(cid:63)(Zi,j) \u2212 EXi,j = 0.\n\n(10)\n\n(11)\n\nThe above loss function is convex in Z, since \u03a6 is convex. Differentiating the expression on the\nR.H.S. of Equation 10 w.r.t. Z, and setting it to 0, we get\n\n4LPAV stands for Lipschitz Pool Adjacent Violator\n\n4\n\n\fThe MMC model shown in Equation (3) satis\ufb01es Equation (11) and is therefore a minimizer of the\nloss function L(Z; \u03a6, \u2126). Hence, the loss function (10) is \u201ccalibrated\u201d for the MMC model that we\nare interested in. The idea of using calibrated loss functions was \ufb01rst introduced for learning single\nindex models [25]. When the transfer function is identity, \u03a6 is a quadratic function and we get the\nsquared loss approach that we discussed in section (3.1).\nThe above discussion assumes that g(cid:63) is known. However in the MMC model this is not the case.\nTo get around this problem, we consider the following optimization problem\n\u03a6(Zi,j) \u2212 Xi,jZi,j\n\nL(\u03a6, Z; \u2126) = min\n\n(cid:88)\n\nEX\n\n(12)\n\nmin\n\u03a6,Z\n\n\u03a6,Z\n\n(i,j)\u2208\u2126\n\nwhere \u03a6 : R \u2192 R is a convex function, with \u03a6(cid:48) = g and Z \u2208 Rm\u00d7n is a low-rank matrix. Since, we\nknow that g(cid:63) is a Lipschitz, monotonic function, we shall solve a constrained optimization problem\nthat enforces Lipschitz constraints on g and low rank constraints on Z. We consider the sample\nversion of the optimization problem shown in equation (12).\n\nmin\n\n\u03a6\n\nrank(Z)\u2264r\n\nL(\u03a6, Z; \u2126) = min\n\n\u03a6,Z\n\n(i,j)\u2208\u2126\n\n\u03a6(Zi,j) \u2212 Xi,jZi,j\n\n(13)\n\n(cid:88)\n\nThe pseudo-code of our algorithm MMC that solves the above optimization problem (13) is shown\nin algorithm (1). MMC optimizes for \u03a6 and Z alternatively, where we \ufb01x one variable and update\nanother.\nAt the start of iteration t, we have at our disposal iterates \u02c6gt\u22121, and Z t\u22121. To update our estimate\nof Z, we perform gradient descent with \ufb01xed \u03a6 such that \u03a6(cid:48) = \u02c6gt\u22121. Notice that the objective in\nequation (13) is convex w.r.t. Z. This is in contrast to the least squares objective where the objective\nin equation (7) is non-convex w.r.t. Z. The gradient of L(Z; \u03a6) w.r.t. Z is\ni,j ) \u2212 Xi,j.\n\n\u2207Zi,jL(Z; \u03a6) =\n\n\u02c6gt\u22121( \u02c6Z t\u22121\n\n(cid:88)\n\n(14)\n\n(i,j)\u2208\u2126\n\nGradient descent updates on \u02c6Z t\u22121 using the above gradient calculation leads to an update of the\nform\n\ni,j \u2212 \u03b7(\u02c6gt\u22121( \u02c6Z t\u22121\n\ni,j ) \u2212 Xi,j)1(i,j)\u2208\u2126\n\ni,j \u2190 \u02c6Z t\u22121\n\u02c6Z t\n\u02c6Z t \u2190 Pr( \u02c6Z t)\n\n(15)\n\n(16)\n\n(17)\n\nEquation (15) projects matrix \u02c6Z t onto a cone of matrices of rank r. This entails performing SVD on\n\u02c6Z t and retaining the top r singular vectors and singular values while discarding the rest. This is done\nin steps 4, 5 of Algorithm (1). As can be seen from the above equation we do not need to estimate\nderivative of \u02c6gt\u22121. This, along with the convexity of the optimization problem in Equation (13) w.r.t.\nZ for a given \u03a6 are two of the key advantages of using a calibrated loss function over the previously\nproposed squared loss minimization formulation.\nOptimization over \u03a6. In round t of algorithm (1), we have \u02c6Z t after performing steps 4, 5. Differ-\nentiating the objective function in equation (13) w.r.t. Z, we get that the optimal \u03a6 function should\nsatisfy\n\n(cid:88)\n\n(i,j)\u2208\u2126\n\n\u02c6gt( \u02c6Z t\n\ni,j) \u2212 Xi,j = 0,\n\nwhere \u03a6(cid:48) = \u02c6gt. This provides us with a strategy to calculate \u02c6gt. Let, \u02c6Xi,j\ni,j). Then\nsolving the optimization problem in equation (16) is equivalent to solving the following optimization\nproblem.\n\ndef= \u02c6gt( \u02c6Z t\n\n(cid:88)\n\nmin\n\u02c6X\n\n(i,j)\u2208\u2126\n\n( \u02c6Xi,j \u2212 Xi,j)2\n\nsubject to: 0 \u2264 \u2212 \u02c6Xi,j + \u02c6Xk,l \u2264 L( \u02c6Z t\n\nk,l \u2212 \u02c6Z t\n\ni,j) if \u02c6Z t\n\ni,j \u2264 \u02c6Z t\n\nk,l, (i, j) \u2208 \u2126, (k, l) \u2208 \u2126\n\nwhere L is the Lipschitz constant of g(cid:63). We shall assume that L is known and does not need to be\nestimated. The gradient, w.r.t. \u02c6X, of the objective function, in equation (17), when set to zero is\n\n5\n\n\fthe same as Equation (16). The constraints enforce monotonicity of \u02c6gt and the Lipschitz property of\n\u02c6gt. The above optimization routine is exactly the LPAV algorithm. The solution \u02c6X obtained from\nsolving the LPAV problem can be used to de\ufb01ne \u02c6gt on X\u2126. These two steps are repeated for T\niterations. After T iterations we have \u02c6gT de\ufb01ned on \u02c6Z T\n\u2126 . In order to de\ufb01ne \u02c6gT everywhere else on\nthe real line we perform linear interpolation as shown in equation (6).\n\nAlgorithm 1 Monotonic Matrix Completion (MMC)\nInput: Parameters \u03b7 > 0, T > 0, r, Data:X\u2126, \u2126\nOutput: \u02c6M = \u02c6gT ( \u02c6Z T )\n1: Initialize \u02c6Z 0 = mn|\u2126| X\u2126, where X\u2126 is the matrix X with zeros \ufb01lled in at the unobserved loca-\n\ntions.\n\n|\u2126|\nmn z\n\ni,j \u2212 \u03b7(\u02c6gt\u22121( \u02c6Z t\u22121\n\ni,j ) \u2212 Xi,j)1(i,j)\u2208\u2126\n\n2: Initialize \u02c6g0(z) =\n3: for t = 1, . . . , T do\ni,j \u2190 \u02c6Z t\u22121\n\u02c6Z t\n4:\n\u02c6Z t \u2190 Pr( \u02c6Z t)\n5:\nSolve the optimization problem in (17) to get \u02c6X\n6:\nSet \u02c6gt( \u02c6Z t\n7:\n8: end for\n9: Obtain \u02c6gT on the entire real line using linear interpolation shown in equation (6).\n\ni,j) = \u02c6Xi,j for all (i, j) \u2208 \u2126.\n\ndef= (cid:80)|\u2126|\n\nj=1 X \u25e6 \u2206j, where each \u2206j\nLet us now explain our initialization procedure. De\ufb01ne X\u2126\nis a boolean mask with zeros everywhere and a 1 at an index corresponding to the index of\nan observed entry. A \u25e6 B is the Hadamard product, i.e. entry-wise product of matrices A, B.\n(cid:80)|\u2126|\nWe have |\u2126| such boolean masks each corresponding to an observed entry. We initialize \u02c6Z 0\n\u2126 to\nj=1 X \u25e6 \u2206j. Because each observed index is assumed to be sampled uniformly at\nmn|\u2126| X\u2126 = mn|\u2126|\nrandom with replacement, our initialization is guaranteed to be an unbiased estimate of X.\n\n4 MSE Analysis of MMC\n\nWe shall analyze our algorithm, MMC, for the case of T = 1, under the modeling assumption shown\nin Equations (4) and (3). Additionally, we will assume that the matrices Z (cid:63) and M (cid:63) are bounded\nentry-wise in absolute value by 1. When T = 1, the MMC algorithm estimates \u02c6Z, \u02c6g and \u02c6M as\nfollows\n\n.\n\n(18)\n\n\u02c6g is obtained by solving the LPAV problem from Equation (17) with \u02c6Z shown in Equation (18). This\nallows us to de\ufb01ne \u02c6Mi,j = \u02c6g( \u02c6Zi,j),\u2200i = [n], j = [m].\nDe\ufb01ne the mean squared error (MSE) of our estimate \u02c6M as\n\n(cid:18) mnX\u2126\n\n(cid:19)\n\n|\u2126|\n\n\u02c6Z = Pr\n\n\uf8ee\uf8f0 1\n\nmn\n\nn(cid:88)\n\nm(cid:88)\n\ni=1\n\nj=1\n\nM SE( \u02c6M ) = E\n\n( \u02c6Mi,j \u2212 Mi,j)2\n\n(19)\n\n\uf8f9\uf8fb .\n\nDenote by ||M|| the spectral norm of a matrix M. We need the following additional technical\nassumptions:\n\n\u221a\nA1. (cid:107)Z (cid:63)(cid:107) = O(\nn).\n\u221a\nA2. \u03c3r+1(X) = \u02dcO(\n\n1/\u03b4.\n\nn) with probability at least 1 \u2212 \u03b4, where \u02dcO hides terms logarithmic in\n\nZ (cid:63) has entries bounded in absolute value by 1. This means that in the worst case, ||Z (cid:63)|| =\nmn.\nAssumption A1 requires that the spectral norm of Z (cid:63) is not very large. Assumption A2 is a weak\nassumption on the decay of the spectrum of M (cid:63). By assumption X = M (cid:63) + N. Applying Weyl\u2019s\n\n\u221a\n\n6\n\n\finequality we get \u03c3r+1(X) \u2264 \u03c3r+1(M (cid:63)) + \u03c31(N ). Since N is a zero-mean noise matrix with\n\u221a\nindependent bounded entries, N is a matrix with sub-Gaussian entries. This means that \u03c31(N ) =\n\u221a\n\u02dcO(\nn) with high probability. Hence, assumption A2 can be interpreted as imposing the condition\nn). This means that while M (cid:63) could be full rank, the (r + 1)th singular value of\n\u03c3r+1(M (cid:63)) = O(\nM (cid:63) cannot be too large.\nTheorem 1. Let \u00b51\nand A2, the MSE of the estimator output by MMC with T = 1 is given by\n\ndef= E||N||2. Let \u03b1 = ||M (cid:63) \u2212 Z (cid:63)||. Then, under assumptions A1\n\ndef= E||N||, \u00b52\n\n(cid:32)(cid:114) r\n(cid:115)\n\nm\n\n\u221a\nr\u03b1\nm\n\nn\n\n(cid:112)mn log(n)\n(cid:19)\n(cid:18)\n\n|\u2126|\n\n+\n\n1 +\n\n+\n\n\u03b1\u221a\nn\n\nmn\n|\u2126|3/2\n\n+\n\n+\n\n(cid:115)\n\nrmn log2(n)\n\n|\u2126|2\n\n\u221a\nr\n\nm\n\n(cid:33)\n\n.\n\n(cid:115)\n\n(cid:18)\n\n(cid:19)\n\n\u00b51 +\n\nn\n\n\u00b52\u221a\nn\n\n+\n\n(20)\n\nM SE( \u02c6M ) = O\n\nwhere O(\u00b7) notation hides universal constants, and the Lipschitz constant L of g(cid:63). We would like\nto mention that the result derived for MMC-1 can be made to hold true for T > 1, by an additional\nlarge deviation argument.\nInterpretation of our results: Our upper bounds on the MSE of MMC depends on the quantity\n\u03b1 = ||M (cid:63)\u2212Z (cid:63)||, and \u00b51, \u00b52. Since matrix N has independent zero-mean entries which are bounded\n\u221a\nin absolute value by 1, N is a sub-Gaussian matrix with independent entries. For such matrices\nn), \u00b52 = O(n) (see Theorem 5.39 in [26]). With these settings we can simplify the\n\u00b51 = O(\nexpression in Equation (20) to\n\n(cid:32)(cid:114) r\n\n(cid:112)mn log(n)\n\n+\n\nm\n\n|\u2126|\n\n(cid:115)\n\n(cid:18)\n\n(cid:115)\n\n(cid:19)\n\n+\n\nmn\n|\u2126|3/2\n\n+\n\n\u221a\nr\u03b1\n\nm\n\nn\n\n1 +\n\n+\n\n\u03b1\u221a\nn\n\nrmn log2(n)\n\n|\u2126|2\n\n(cid:33)\n\n.\n\nM SE( \u02c6M ) = \u02dcO\n\nA remarkable fact about our sample complexity results is that the sample complexity is independent\nof the rank of matrix M (cid:63), which could be large. Instead it depends on the rank of matrix Z (cid:63) which\nwe assume to be small. The dependence on M (cid:63) is via the term \u03b1 = ||M (cid:63) \u2212 Z (cid:63)||. From equation (4)\n\u221a\nit is evident that the best error guarantees are obtained when \u03b1 = O(\nn). For such values of \u03b1\nequation (4) reduces to,\n\n(cid:112)mn log(n)\nm + \u0001. It is important to note that the \ufb02oor of the MSE is(cid:112) r\n\n\u221a\nmn\n|\u2126| +\n\nrmn log2(n)\n\nmn\n|\u2126|3/2\n\n|\u2126|2\n\n|\u2126|\n\n\u02dcO((cid:0) mn\n\nThis result can be converted into a sample complexity bound as follows. If we are given |\u2126| =\nm,\n\n), then M SE( \u02c6M ) \u2264(cid:112) r\n\n(cid:1)2/3\n\nwhich depends on the rank of Z (cid:63) and not on rank(M (cid:63)), which can be much larger than r.\n\n(cid:32)(cid:114) r\n\nM SE( \u02c6M ) = \u02dcO\n\n(cid:115)\n\n(cid:33)\n\n+\n\n+\n\nm\n\n+\n\n.\n\n\u0001\n\n5 Experimental results\nWe compare the performance of MMC\u2212 1, MMC\u2212 c, MMC- LS, and nuclear norm based low-rank\nmatrix completion (LRMC) [4] on various synthetic and real world datasets. The objective metric\nthat we use to compare different algorithms is the root mean squared error (RMSE) of the algorithms\non unobserved, test indices of the incomplete matrix.\n\n5.1 Synthetic experiments\nFor our synthetic experiments we generated a random 30 \u00d7 20 matrix Z (cid:63) of rank 5 by taking the\nproduct of two random Gaussian matrices of size n \u00d7 r, and r \u00d7 m, with n = 30, m = 20, r = 5.\nThe matrix M (cid:63) was generated using the function, g(cid:63)(M (cid:63)\ni,j)), where c > 0.\nBy increasing c, we increase the Lipschitz constant of the function g(cid:63), making the matrix completion\ntask harder. For large enough c, Mi,j \u2248 sgn(Zi,j). We consider the noiseless version of the problem\nwhere X = M (cid:63). Each entry in the matrix X was sampled with probability p, and the sampled entries\nare observed. This makes E|\u2126| = mnp. For our implementations we assume that r is unknown,\n\ni,j) = 1/(1 + exp(\u2212cZ (cid:63)\n\n7\n\n\fFigure 2: RMSE of different methods at different values of c.\n\nand estimate it either (i) via the use of a dedicated validation set in the case of MMC \u2212 1 or (ii)\nadaptively, where we progressively increase the estimate of our rank until a suf\ufb01cient decrease in\nerror over the training set is achieved [13]. For an implementation of the LRMC algorithm we used a\nstandard off-the-shelf implementation from TFOCS [27]. In order to speed up the run time of MMC,\nwe also keep track of the training set error, and terminate iterations if the relative residual on the\ntraining set goes below a certain threshold 5. In the supplement we provide a plot that demonstrates\nthat, for MMC\u2212 c, the RMSE on the training dataset has a decreasing trend and reaches the required\nthreshold in at most 50 iterations. Hence, we set T = 50. Figure (2) show the RMSE of each method\nfor different values of p, c. As one can see from \ufb01gure (2), the RMSE of all the methods improves\nfor any given c as p increases. This is expected since as p increases E|\u2126| = pmn also increases. As\nc increases, g(cid:63) becomes steeper increasing the effective rank of X. This makes matrix completion\ntask hard. For small p, such as p = 0.2, MMC \u2212 1 is competitive with MMC \u2212 c and MMC\u2212LS\nand is often the best. In fact for small p, irrespective of the value of c, LRMC is far inferior to other\nmethods. For larger p, MMC \u2212 c works the best achieving smaller RMSE over other methods.\n5.2 Experiments on real datasets\n\nWe performed experimental comparisons on four real world datasets: paper recommendation, Jester-\n3, ML-100k, Cameraman. All of the above datasets, except the Cameraman dataset, are ratings\ndatasets, where users have rated a few of the several different items. For the Jester-3 dataset we used\n5 randomly chosen ratings for each user for training, 5 randomly chosen rating for validation and\nthe remaining for testing. ML-100k comes with its own training and testing dataset. We used 20%\nof the training data for validation. For the Cameraman and the paper recommendation datasets 20%\nof the data was used for training, 20% for validation and the rest for testing. The baseline algorithm\nchosen for low rank matrix completion is LMaFit-A [13] 6.\nFor each of the datasets we report the RMSE of MMC \u2212 1, MMC \u2212 c, and LMaFit-A on the test\nsets. We excluded MMC-LS from these experiments because in all of our datasets the number of\nobserved entries is a very small fraction of the total number of entries, and from our results on\nsynthetic datasets we know that MMC\u2212 LS is not the best performing algorithm in such cases.\nTable 1 shows the RMSE over the test set of the different matrix completion methods. As we see\nthe RMSE of MMC \u2212 c is the smallest of all the methods, surpassing LMaFit-A by a large margin.\n\nTable 1: RMSE of different methods on real datasets.\n\nDataset\nPaperReco\nJester-3\nML-100k\nCameraman\n\nDimensions\n3426 \u00d7 50\n24938 \u00d7 100\n1682 \u00d7 943\n1536 \u00d7 512\n\n|\u2126|\n34294\n124690\n64000\n157016\n\nr0.01(X)\n47\n66\n391\n393\n\n6 Conclusions and future work\n\nLMaFit-A MMC \u2212 1 MMC \u2212 c\n0.4026\n6.8728\n3.3101\n0.0754\n\n0.2965\n5.2348\n1.1533\n0.06885\n\n0.4247\n5.327\n1.388\n0.1656\n\nWe have investigated a new framework and for high rank matrix completion problems called mono-\ntonic matrix completion and proposed new algorithms. In the future we would like to investigate if\none could relax improve the theoretical results.\n\n5For our experiments this threshold is set to 0.001.\n6http://lmafit.blogs.rice.edu/. The parameter k in the LMaFit algorithm was set to effective\n\nrank, and we used est rank=1 for LMaFit-A.\n\n8\n\n0\t\r \u00a00.1\t\r \u00a00.2\t\r \u00a00.3\t\r \u00a00.4\t\r \u00a00.5\t\r \u00a0LRMC\t\r \u00a0MMC-\u00ad\u2010LS\t\r \u00a0MMC-\u00ad\u20101\t\r \u00a0MMC-\u00ad\u2010c\t\r \u00a0RMSE\t\r \u00a0on\t\r \u00a0test\t\r \u00a0data\t\r \u00a0c=1.0\t\r \u00a0p=0.2\t\r \u00a0p=0.35\t\r \u00a0p=0.5\t\r \u00a0p=0.7\t\r \u00a00\t\r \u00a00.2\t\r \u00a00.4\t\r \u00a00.6\t\r \u00a00.8\t\r \u00a0LRMC\t\r \u00a0MMC-\u00ad\u2010LS\t\r \u00a0MMC-\u00ad\u20101\t\r \u00a0MMC-\u00ad\u2010c\t\r \u00a0RMSE\t\r \u00a0on\t\r \u00a0test\t\r \u00a0data\t\r \u00a0c=10\t\r \u00a0p=0.2\t\r \u00a0p=0.35\t\r \u00a0p=0.5\t\r \u00a0p=0.7\t\r \u00a00\t\r \u00a00.1\t\r \u00a00.2\t\r \u00a00.3\t\r \u00a00.4\t\r \u00a00.5\t\r \u00a00.6\t\r \u00a00.7\t\r \u00a00.8\t\r \u00a0LRMC\t\r \u00a0MMC-\u00ad\u2010LS\t\r \u00a0MMC-\u00ad\u20101\t\r \u00a0MMC-\u00ad\u2010c\t\r \u00a0RMSE\t\r \u00a0on\t\r \u00a0test\t\r \u00a0data\t\r \u00a0c=40\t\r \u00a0p=0.2\t\r \u00a0p=0.35\t\r \u00a0p=0.5\t\r \u00a0p=0.7\t\r \u00a0\fReferences\n[1] Prem Melville and Vikas Sindhwani. Recommender systems.\n\nSpringer, 2010.\n\nIn Encyclopedia of machine learning.\n\n[2] Mihai Cucuringu. Graph realization and low-rank matrix completion. PhD thesis, Princeton University,\n\n2012.\n\n[3] Benjamin Recht. A simpler approach to matrix completion. JMLR, 12:3413\u20133430, 2011.\n[4] Emmanuel J Cand`es and Benjamin Recht. Exact matrix completion via convex optimization. FOCM,\n\n9(6):717\u2013772, 2009.\n\n[5] Emmanuel J Candes and Yaniv Plan. Matrix completion with noise. Proceedings of the IEEE, 98(6):925\u2013\n\n936, 2010.\n\n[6] Sahand Negahban and Martin J Wainwright. Restricted strong convexity and weighted matrix completion:\n\nOptimal bounds with noise. The Journal of Machine Learning Research, 13(1):1665\u20131697, 2012.\n\n[7] Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from a few entries.\n\nInformation Theory, IEEE Transactions on, 56(6):2980\u20132998, 2010.\n\n[8] David Gross. Recovering low-rank matrices from few coef\ufb01cients in any basis. Information Theory, IEEE\n\nTransactions on, 57(3):1548\u20131566, 2011.\n\n[9] Jon Dattorro. Convex optimization & Euclidean distance geometry. Lulu. com, 2010.\n[10] Bart Vandereycken. Low-rank matrix completion by riemannian optimization. SIAM Journal on Opti-\n\nmization, 23(2):1214\u20131236, 2013.\n\n[11] Mingkui Tan, Ivor W Tsang, Li Wang, Bart Vandereycken, and Sinno J Pan. Riemannian pursuit for big\n\nmatrix recovery. In ICML, pages 1539\u20131547, 2014.\n\n[12] Zheng Wang, Ming-Jun Lai, Zhaosong Lu, Wei Fan, Hasan Davulcu, and Jieping Ye. Rank-one matrix\n\npursuit for matrix completion. In ICML, pages 91\u201399, 2014.\n\n[13] Zaiwen Wen, Wotao Yin, and Yin Zhang. Solving a low-rank factorization model for matrix completion\n\nby a nonlinear successive over-relaxation algorithm. Mathematical Programming Computation, 2012.\n[14] Brian Eriksson, Laura Balzano, and Robert Nowak. High-rank matrix completion. In AISTATS, 2012.\n[15] Congyuan Yang, Daniel Robinson, and Rene Vidal. Sparse subspace clustering with missing entries. In\n\nICML, 2015.\n\n[16] Mahdi Soltanolkotabi, Emmanuel J Candes, et al. A geometric analysis of subspace clustering with\n\noutliers. The Annals of Statistics, 40(4):2195\u20132238, 2012.\n\n[17] Ehsan Elhamifar and Rene Vidal. Sparse subspace clustering: Algorithm, theory, and applications.\n\nTPAMI, 2013.\n\n[18] Aarti Singh, Akshay Krishnamurthy, Sivaraman Balakrishnan, and Min Xu. Completion of high-rank\n\nultrametric matrices using selective entries. In SPCOM, pages 1\u20135. IEEE, 2012.\n\n[19] Oluwasanmi Koyejo, Sreangsu Acharyya, and Joydeep Ghosh. Retargeted matrix factorization for col-\nlaborative \ufb01ltering. In Proceedings of the 7th ACM conference on Recommender systems, pages 49\u201356.\nACM, 2013.\n\n[20] Mark A Davenport, Yaniv Plan, Ewout van den Berg, and Mary Wootters. 1-bit matrix completion.\n\nInformation and Inference, 3(3):189\u2013223, 2014.\n\n[21] Sham M Kakade, Varun Kanade, Ohad Shamir, and Adam Kalai. Ef\ufb01cient learning of generalized linear\n\nand single index models with isotonic regression. In NIPS, 2011.\n\n[22] Adam Tauman Kalai and Ravi Sastry. The isotron algorithm: High-dimensional isotonic regression. In\n\nCOLT, 2009.\n\n[23] Hidehiko Ichimura. Semiparametric least squares (sls) and weighted sls estimation of single-index mod-\n\nels. Journal of Econometrics, 58(1):71\u2013120, 1993.\n\n[24] Joel L Horowitz and Wolfgang H\u00a8ardle. Direct semiparametric estimation of single-index models with\n\ndiscrete covariates. Journal of the American Statistical Association, 91(436):1632\u20131640, 1996.\n\n[25] Alekh Agarwal, Sham Kakade, Nikos Karampatziakis, Le Song, and Gregory Valiant. Least squares\n\nrevisited: Scalable approaches for multi-class prediction. In ICML, pages 541\u2013549, 2014.\n\n[26] Roman Vershynin.\n\nIntroduction to the non-asymptotic analysis of random matrices. arXiv preprint\n\narXiv:1011.3027, 2010.\n\n[27] Stephen Becker, E Candes, and M Grant. Tfocs: Flexible \ufb01rst-order methods for rank minimization. In\n\nLow-rank Matrix Optimization Symposium, SIAM Conference on Optimization, 2011.\n\n9\n\n\f", "award": [], "sourceid": 1151, "authors": [{"given_name": "Ravi Sastry", "family_name": "Ganti", "institution": "UW Madison"}, {"given_name": "Laura", "family_name": "Balzano", "institution": "University of Michigan-Ann Arbor"}, {"given_name": "Rebecca", "family_name": "Willett", "institution": "University of Wisconsin"}]}