{"title": "Matrix Completion with Noisy Side Information", "book": "Advances in Neural Information Processing Systems", "page_first": 3447, "page_last": 3455, "abstract": "We study matrix completion problem with side information. Side information has been considered in several matrix completion applications, and is generally shown to be useful empirically. Recently, Xu et al. studied the effect of side information for matrix completion under a theoretical viewpoint, showing that sample complexity can be significantly reduced given completely clean features. However, since in reality most given features are noisy or even weakly informative, how to develop a general model to handle general feature set, and how much the noisy features can help matrix recovery in theory, is still an important issue to investigate. In this paper, we propose a novel model that balances between features and observations simultaneously, enabling us to leverage feature information yet to be robust to feature noise. Moreover, we study the effectof general features in theory, and show that by using our model, the sample complexity can still be lower than matrix completion as long as features are sufficiently informative. This result provides a theoretical insight of usefulness for general side information. Finally, we consider synthetic data and two real applications - relationship prediction and semi-supervised clustering, showing that our model outperforms other methods for matrix completion with features both in theory and practice.", "full_text": "Matrix Completion with Noisy Side Information\n\n\u2217 University of Texas at Austin\n\n\u2020 University of California at Davis\n\nKai-Yang Chiang\u2217 Cho-Jui Hsieh \u2020\n\nInderjit S. Dhillon \u2217\n\n\u2217 {kychiang,inderjit}@cs.utexas.edu\n\n\u2020 chohsieh@ucdavis.edu\n\nAbstract\n\nWe study the matrix completion problem with side information. Side information\nhas been considered in several matrix completion applications, and has been em-\npirically shown to be useful in many cases. Recently, researchers studied the effect\nof side information for matrix completion from a theoretical viewpoint, showing\nthat sample complexity can be signi\ufb01cantly reduced given completely clean fea-\ntures. However, since in reality most given features are noisy or only weakly in-\nformative, the development of a model to handle a general feature set, and investi-\ngation of how much noisy features can help matrix recovery, remains an important\nissue. In this paper, we propose a novel model that balances between features and\nobservations simultaneously in order to leverage feature information yet be robust\nto feature noise. Moreover, we study the effect of general features in theory and\nshow that by using our model, the sample complexity can be lower than matrix\ncompletion as long as features are suf\ufb01ciently informative. This result provides\na theoretical insight into the usefulness of general side information. Finally, we\nconsider synthetic data and two applications \u2014 relationship prediction and semi-\nsupervised clustering \u2014 and show that our model outperforms other methods for\nmatrix completion that use features both in theory and practice.\n\n1\n\nIntroduction\n\nLow rank matrix completion is an important topic in machine learning and has been successfully\napplied to many practical applications [22, 12, 11]. One promising direction in this area is to exploit\nthe side information, or features, to help matrix completion tasks. For example, in the famous Net\ufb02ix\nproblem, besides rating history, pro\ufb01le of users and/or genre of movies might also be given, and one\ncould possibly leverage such side information for better prediction. Observing the fact that such\nadditional features are usually available in real applications, how to better incorporate features into\nmatrix completion becomes an important problem with both theoretical and practical aspects.\n\nSeveral approaches have been proposed for matrix completion with side information, and most of\nthem empirically show that features are useful for certain applications [1, 28, 9, 29, 33]. However,\nthere is surprisingly little analysis on the effect of features for general matrix completion. More re-\ncently, Jain and Dhillon [18] and Xu et al. [35] provided non-trivial guarantees on matrix completion\nwith side information. They showed that if \u201cperfect\u201d features are given, under certain conditions,\none can substantially reduce the sample complexity by solving a feature-embedded objective. This\nresult suggests that completely informative features are extremely powerful for matrix completion,\nand the algorithm has been successfully applied in many applications [29, 37]. However, this model\nis still quite restrictive since if features are not perfect, it fails to guarantee recoverability and could\neven suffer poor performance in practice. A more general model with recovery analysis to handle\nnoisy features is thus desired.\n\nIn this paper, we study the matrix completion problem with general side information. We propose a\ndirty statistical model which balances between feature and observation information simultaneously\nto complete a matrix. As a result, our model can leverage feature information, yet is robust to noisy\nfeatures. Furthermore, we provide a theoretical foundation to show the effectiveness of our model.\nWe formally quantify the quality of features and show that the sample complexity of our model\n\n1\n\n\fdepends on feature quality. Two noticeable results could thus be inferred: \ufb01rst, unlike [18, 35],\ngiven any feature set, our model is guaranteed to achieve recovery with at most O(n3/2) samples in\ndistribution-free manner, where n is the dimensionality of the matrix. Second, if features are rea-\nsonably good, we can improve the sample complexity to o(n3/2). We emphasize that since \u2126(n3/2)\nis the lower bound of sample complexity for distribution-free, trace-norm regularized matrix com-\npletion [32], our result suggests that even noisy features could asymptotically reduce the number\nof observations needed in matrix completion. In addition, we empirically show that our model out-\nperforms other completion methods on synthetic data as well as in two applications: relationship\nprediction and semi-supervised clustering. Our contribution can be summarized as follows:\n\nwhere the matrix is learned by balancing features and pure observations simultaneously.\n\n\u2022 We propose a dirty statistical model for matrix completion with general side information\n\u2022 We quantify the effectiveness of features in matrix completion problem.\n\u2022 We show that our model is guaranteed to recover the matrix with any feature set, and\nmoreover, the sample complexity can be lower than standard matrix completion given in-\nformative features.\n\nThe paper is organized as follows. Section 2 states some related research. In Section 3, we introduce\nour proposed model for matrix completion with general side information. We theoretically analyze\nthe effectiveness of features in our model in Section 4, and show experimental results in Section 5.\n\n2 Related Work\n\nMatrix completion has been widely applied to many machine learning tasks, such as recommender\nsystems [22], social network analysis [12] and clustering [11]. Several theoretical foundations have\nalso been established. One remarkable milestone is the strong guarantee provided by Cand`es et\nal. [7, 5], who proves that O(npolylogn) observations are suf\ufb01cient for exact recovery provided\nentries are uniformly sampled at random. Several work also studies recovery under non-uniform\ndistributional assumptions [30, 10], distribution-free setting [32], and noisy observations [21, 4].\n\nSeveral works also consider side information in matrix completion [1, 28, 9, 29, 33]. Although most\nof them found that features are helpful for certain applications [28, 33] and cold-start setting [29]\nfrom their experimental supports, their proposed methods focus on the non-convex matrix factoriza-\ntion formulation without any theoretical guarantees. Compared to them, our model mainly focuses\non a convex trace-norm regularized objective and on theoretical insight on the effect of features. On\nthe other hand, Jain and Dhillon [18] (also see [38]) studied an inductive matrix completion objective\nto incorporate side information, and followup work [35] also considers a similar formulation with\ntrace norm regularized objective. Both of them show that recovery guarantees could be attained with\nlower sample complexity when features are perfect. However, if features are imperfect, such models\ncannot recover the underlying matrix and could suffer poor performance in practice. We will have a\ndetailed discussion on inductive matrix completion model in Section 3.\n\nOur proposed model is also related to the family of dirty statistical models [36], where the model\nparameter is expressed as the sum of a number of parameter components, each of which has its\nown structure. Dirty statistical models have been proposed mostly for robust matrix completion,\ngraphical model estimation, and multi-task learning to decompose the sparse component (noise) and\nlow-rank component (model parameters) [6, 8, 19]. Our proposed algorithm is completely different.\nWe aim to decompose the model into two parts: the part that can be described by side information\nand the part that has to be recovered purely by observations.\n\n3 A Dirty Statistical Model for Matrix Completion with Features\nLet R \u2208 Rn1\u00d7n2 be the underlying rank-k matrix that aims to be recovered, where k \u226a min(n1, n2)\nso that R is low-rank. Let \u2126 be the set of observed entries sampled from R with cardinality |\u2126| = m.\nFurthermore, let X \u2208 Rn1\u00d7d1 and Y \u2208 Rn2\u00d7d2 be the feature set, where each row xi (or yi) denotes\nthe feature of the i-th row (or column) entity of R. Both d1, d2 \u2264 min(n1, n2) but can be either\nsmaller or larger than k. Thus, given a set of observations \u2126 and the feature set X and Y as side\ninformation, the goal is to recover the underlying low rank matrix R.\n\nTo begin with, consider an ideal case where the given features are \u201cperfect\u201d in the following sense:\n\nSuch a feature set can be thought as perfect since it fully describes the true latent feature space of\nR. Then, instead of recovering the low rank matrix R directly, one can recover a smaller matrix\n\ncol(R) \u2286 col(X) and row(R) \u2286 col(Y ).\n\n(1)\n\n2\n\n\fM \u2208 Rd1\u00d7d2 such that R = XM Y T . The resulting formulation, called inductive matrix comple-\ntion (or IMC in brief) [18], is shown to be both theoretically preferred [18, 35] and useful in real\napplications [37, 29]. Details of this model can be found in [18, 35].\n\nHowever, in practice, most given features X and Y will not be perfect. In fact, they could be quite\nnoisy or only weakly correlated to the latent feature space of R. Though in some cases applying\nIMC with imperfect X, Y might still yield decent performance, in many other cases, the performance\ndrastically drops when features become noisy. This weakness of IMC can also be empirically seen\nin Section 5. Therefore, a more robust model is desired to better handle noisy features.\n\nWe now introduce a dirty statistical model for matrix completion with (possibly noisy) features.\nThe core concept of our model is to learn the underlying matrix by balancing feature information\nand observations. Speci\ufb01cally, we propose to learn R jointly from two parts, one is the low rank\nestimate from feature space XM Y T , and the other part N is the part outside the feature space.\nThus, N can be used to capture the information that noisy features fail to describe, which is then\nestimated by pure observations. Naturally, both XM Y T and N are preferred to be low rank since\nthey are aggregated to estimate a low rank matrix R. This further leads a preference on M to be low\nrank as well, since one could expect only a small subspace of X and a subspace of Y are jointly\neffective to form the low rank space XM Y T . Putting all of above together, we consider to solve the\nfollowing problem:\n\nmin\n\nM,N !(i,j)\u2208\u2126\n\n\u2113((XM Y T + N )ij, Rij) + \u03bbM\u2225M\u2225\u2217 + \u03bbN\u2225N\u2225\u2217,\n\n(2)\n\nwhere M and N are regularized with trace norm because of the low rank prior. The underlying\nmatrix R can thus be estimated by XM\u2217Y T +N\u2217. We refer our model as DirtyIMC for convenience.\nTo solve the convex problem (2), we propose an alternative minimization scheme to solve N and M\niteratively. Our algorithm is stated in details in Appendix A. One remark of this algorithm is that it\nis guaranteed to converge to a global optimal, since the problem is jointly convex with M and N .\n\nThe parameters \u03bbM and \u03bbN are crucial for controlling the importance between features and residual.\nWhen \u03bbM = \u221e, M will be enforced to 0, so features are disregarded and (2) becomes a standard\nmatrix completion objective. Another special case is \u03bbN = \u221e, in which N will be enforced to 0\nand the objective becomes IMC. Intuitively, with an appropriate ratio \u03bbM /\u03bbN , the proposed model\ncan incorporate useful part of features, yet be robust to noisy part by compensating from pure ob-\nservations. Some natural questions arise from here: How to quantify the quality of features? What\nis the right \u03bbM and \u03bbN given a feature set? And beyond intuition, how much can we bene\ufb01t from\nfeatures using our model in theory? We will formally answer these questions in Section 4.\n\n4 Theoretical Analysis\n\nNow we analyze the usefulness of features in our model under a theoretical perspective. We \ufb01rst\nquantify the quality of features and show that with reasonably good features, our model achieves\nrecovery with lower sample complexity. Finally, we compare our results to matrix completion and\nIMC. Due to space limitations, detailed proofs of Theorems and Lemmas are left in Appendix B.\n\n4.1 Preliminaries\nRecall that our goal is to recover a rank-k matrix R given observed entry set \u2126, feature set X and Y\ndescribed in Section 3. To recover the matrix with our model (Equation (2)), it is equivalent to solve\nthe hard-constraint problem:\n\nmin\n\nM,N !(i,j)\u2208\u2126\n\n\u2113((XM Y T + N )ij, Rij),\n\nsubject to \u2225M\u2225\u2217 \u2264M ,\u2225N\u2225\u2217 \u2264N .\n\n(3)\n\nFor simplicity, we will consider d = max(d1, d2) = O(1) so that feature dimensions do not grow\nas a function of n. We assume each entry (i, j) \u2208 \u2126 is sampled i.i.d. under an unknown distri-\nbution with index set {(i\u03b1, j\u03b1)}m\n\u03b1=1. Also, each entry of R is assumed to be upper bounded, i.e.\nmaxij |Rij|\u2264R (so that trace norm of R is in O(\u221an1n2)). Such circumstance is consistent with\nreal scenarios like the Net\ufb02ix problem where users can rate movies with scale from 1 to 5. For con-\nvenience, let \u03b8 = (M, N ) be any feasible solution, and \u0398= {(M, N ) |\u2225 M\u2225\u2217 \u2264M ,\u2225N\u2225\u2217 \u2264N}\nbe the feasible solution set. Also, let f\u03b8(i, j) = xT\ni M yj + Nij be the estimation function for Rij\nparameterized by \u03b8, and F\u0398 = {f\u03b8 | \u03b8 \u2208 \u0398} be the set of feasible functions. We are interested in\nthe following two \u201c\u2113-risk\u201d quantities:\n\n\u2022 Expected \u2113-risk: R\u2113(f ) = E(i,j)\"\u2113(f (i, j), Rij)#.\n\n3\n\n\f\u2022 Empirical \u2113-risk: \u02c6R\u2113(f ) = 1\n\nm$(i,j)\u2208\u2126 \u2113(f (i, j), Rij).\n\nThus, our model is to solve for \u03b8\u2217 that parameterizes f\u2217 = arg minf\u2208F\u0398\nto show that recovery can be attained if R\u2113(f\u2217) approaches to zero with large enough n and m.\n4.2 Measuring the Quality of Features\n\n\u02c6R\u2113(f ), and it is suf\ufb01cient\n\nWe now link the quality of features to Rademacher complexity, a learning theoretic tool to measure\nthe complexity of a function class. We will show that quality features result in a lower model\ncomplexity and thus a smaller error bound. Under such a viewpoint, the upper bound of Rademacher\ncomplexity could be used for measuring the quality of features.\n\nTo begin with, we apply the following Lemma to bound the expected \u2113-risk.\nLemma 1 (Bound on Expected \u2113-risk [2]). Let \u2113 be a loss function with Lipschitz constant L\u2113\nbounded by B with respect to its \ufb01rst argument, and \u03b4 be a constant where 0 <\u03b4< 1. Let R(F\u0398)\nbe the Rademacher complexity of the function class F\u0398 (w.r.t. \u2126 and associated with \u2113) de\ufb01ned as:\n\nwhere each \u03c3\u03b1 takes values {\u00b11} with equal probability. Then with probability at least 1 \u2212 \u03b4, for\nall f \u2208 F\u0398 we have:\n\nm\n\n1\nm\n\n!\u03b1=1\n\nf\u2208F\u0398\n\n\u03c3\u03b1\u2113(f (i\u03b1, j\u03b1), Ri\u03b1j\u03b1)#,\nR(F\u0398) = E\u03c3\" sup\nR\u2113(f ) \u2264 \u02c6R\u2113(f ) + 2E\u2126\"R(F\u0398)# + B% log 1\n\n\u03b4\n2m\n\n.\n\n(4)\n\nApparently, to guarantee a small enough R\u2113, both \u02c6R\u2113 and model complexity E\u2126\"R(F\u0398)# have to be\nbounded. The next key lemma shows that, the model complexity term E\u2126\"R(F\u0398)# is related to the\n\nfeature quality in matrix completion context.\n\nBefore diving into the details, we \ufb01rst provide an intuition on the meaning of \u201cgood\u201d features.\nConsider any imperfect feature set which violates (1). One can imagine such feature set is perturbed\nby some misleading noise which is not correlated to the true latent features. However, features\nshould still be effective if such noise does not weaken the true latent feature information too much.\nThus, if a large portion of true latent features lies on the informative part of the feature spaces X\nand Y , they should still be somewhat informative and helpful for recovering the matrix R.\nMore formally, the model complexity can be bounded in terms of M and N by the following lemma:\nLemma 2. Let X = maxi \u2225xi\u22252, Y = maxi \u2225yi\u22252 and n = max(n1, n2). Then the model com-\nplexity of function class F\u0398 is upper bounded by:\nE\u2126\"R(F\u0398)# \u2264 2L\u2113MXY& log 2d\n(.\nThen, by Lemma 1 and 2, one could carefully construct a feasible solution set (by setting M and\nN ) such that both \u02c6R\u2113(f\u2217) and E\u2126\"R(F\u0398)# are controlled to be reasonably small. We now suggest\na witness pair of M and N constructed as follows. Let \u03b3 be de\ufb01ned as:\n*.\n\n,&9CL\u2113BN (\u221an1 + \u221an2)\n\n+ min\u20192L\u2113N& log 2n\n\n\u03b3 = min) mini \u2225xi\u2225\n\nmini \u2225yi\u2225\n\nLet T\u00b5(\u00b7) : R+ \u2192 R+ be the thresholding operator where T\u00b5(x) = x if x \u2265 \u00b5 and T\u00b5(x) =\ni be the reduced SVD of X, and de\ufb01ne X\u00b5 =\n0 otherwise.\n$d1\ni=1 \u03c31T\u00b5(\u03c3i/\u03c31)uivT\nto be the \u201c\u00b5-informative\u201d part of X. The \u03bd-informative part of Y , denoted\nas Y\u03bd, can also be de\ufb01ned similarly. Now consider setting M = \u2225 \u02c6M\u2225\u2217 and N = \u2225R\u2212 X\u00b5 \u02c6M Y T\n\u03bd \u2225\u2217,\n\nIn addition, let X = $d1\n\ni=1 \u03c3iuivT\n\nX\n\nY\n\nm\n\nm\n\nm\n\n,\n\ni\n\nwhere\n\n\u02c6M = arg min\n\nM \u2225X\u00b5M Y T\n\n\u03bd \u2212 R\u22252\n\nF = (X T\n\n\u00b5 X\u00b5)\u22121X T\n\n\u00b5 RY\u03bd(Y T\n\n\u03bd Y\u03bd)\u22121\n\nis the optimal solution for approximating R under the informative feature space X\u00b5 and Y\u03bd. Then\nthe following lemma shows that the trace norm of \u02c6M will not grow as n increases.\nLemma 3. Fix \u00b5, \u03bd \u2208 (0, 1], and let \u02c6d = min(rank(X\u00b5), rank(Y\u03bd)). Then with some universal\nconstant C\u2032:\n\n\u02c6d\n\n\u2225 \u02c6M\u2225\u2217 \u2264\n\n.\n\nC\u2032\u00b52\u03bd2\u03b32XY\n\n4\n\n\fMoreover, by combining Lemma 1 - 3, we can upper bound R\u2113(f\u2217) of DirtyIMC as follows:\nTheorem 1. Consider problem (3) with M = \u2225 \u02c6M\u2225\u2217 and N = \u2225R \u2212 X\u00b5 \u02c6M Y T\n\u03bd \u2225\u2217. Then with\nprobability at least 1 \u2212 \u03b4, the expected \u2113-risk of an optimal solution (N\u2217, M\u2217) will be bounded by:\n+ B% log 1\nR\u2113(f\u2217) \u2264 min\u20194L\u2113N& log 2n\n\n,&36CL\u2113BN (\u221an1 + \u221an2)\n\nC\u2032\u00b52\u03bd2\u03b32& log 2d\n\n( +\n\n4L\u2113 \u02c6d\n\nm\n\n\u03b4\n2m\n\n.\n\nm\n\nm\n\n4.3 Sample Complexity Analysis\n\nFrom Theorem 1, we can derive the following sample complexity guarantee of our model. For\nsimplicity, we assume k = O(1) so it will not grow as n increases in the following discussion.\n\nCorollary 1. Suppose we aim to \u201c\u03f5-recover\u201d R where E(i,j)\"\u2113(Nij + XM Y T\nij , Rij)# <\u03f5 given\nan arbitrarily small \u03f5. Then for DirtyIMC model, O(min(N\u221an,N 2 log n)/\u03f52) observations are\n\nsuf\ufb01cient for \u03f5-recovery provided a suf\ufb01ciently large n.\n\nCorollary 1 suggests that the sample complexity of our model only depends on the trace norm of\n\nresidual N . This matches the intuition of good features stated in Section 4.2 because X \u02c6M Y T will\ncover most part of R if features are good, and as a result, N will be small and one can enjoy small\nsample complexity by exploiting quality features.\n\nWe also compare our sample complexity result with other models. First, suppose features are perfect\n(so that N = O(1)), our result suggests that only O(log n) samples are required for recovery.\nThis matches the result of [35], in which the authors show that given perfect features, O(log n)\nobservations are enough for exact recovery by solving the IMC objective. However, IMC does\nnot guarantee recovery when features are not perfect, while our result shows that recovery is still\n\nattainable by DirtyIMC with O(min(N\u221an,N 2 log n)/\u03f52) samples. We will also empirically justify\n\nthis result in Section 5.\n\nOn the other hand, for standard matrix completion (i.e. no features are considered), the most well-\nknown guarantee is that under certain conditions, one can achieve O(n poly log n) sample com-\nplexity for both \u03f5-recovery [34] and exact recovery [5]. However, these bounds only hold with\ndistributional assumptions on observed entries. For sample complexity without any distributional\nassumptions, Shamir et al. [32] recently showed that O(n3/2) entries are suf\ufb01cient for \u03f5-recovery,\nand this bound is tight if no further distribution of observed entries is assumed. Compared to those\nresults, our analysis also requires no assumptions on distribution of observed entries, and our sample\ncomplexity yields O(n3/2) as well in the worst case, by the fact that N\u2264 \u2225 R\u2225\u2217 = O(n). Notice\nthat it is reasonable to meet the lower bound \u2126(n3/2) even given features, since in an extreme case,\nX, Y could be random matrices and have no correlation to R, and thus the given information is as\nsame as that in standard matrix completion.\n\nHowever, in many applications, features will be far from random, and our result provides a theoreti-\ncal insight to show that features can be useful even if they are imperfect. Indeed, as long as features\nare informative enough such that N = o(n), our sample complexity will be asymptotically lower\nthan O(n3/2). Here we provide two concrete instances for such a scenario. In the \ufb01rst scenario, we\nconsider the rank-k matrix R to be generated from random orthogonal model [5] as follows:\nTheorem 2. Let R \u2208 Rn\u00d7n be generated from random orthogonal model, where U = {ui}k\ni=1, V =\n{vi}k\ni=1 are random orthogonal bases, and \u03c31 . . .\u03c3 k are singular values with arbitrary magnitude.\nLet \u03c3t be the largest singular value such that limn\u2192\u221e \u03c3t/\u221an = 0. Then, given the noisy features\nX, Y where X:i = ui (and Y:i = vi) if i < t and X:i (and V:i) be any basis orthogonal to U (and\nV ) if i \u2265 t, o(n) samples are suf\ufb01cient for DirtyIMC to achieve \u03f5-recovery.\nTheorem 2 suggests that, under random orthogonal model, if features are not too noisy in the sense\nthat noise only corrupts the true subspace associated with smaller singular values, we can approxi-\nmately recover R with only o(n) observations. An empirical justi\ufb01cation for this result is presented\nin Appendix C. Another scenario is to consider R to be the product of two rank-k Gaussian matrices:\nTheorem 3. Let R = U V T be a rank-k matrix, where U, V \u2208 Rn\u00d7k are true latent row/column fea-\ntures with each Uij, Vij \u223cN (0,\u03c3 2) i.i.d. Suppose now we are given a feature set X, Y where g(n)\nrow items and h(n) column items have corrupted features. Moreover, each corrupted row/column\nitem has perturbed feature xi = ui +\u2206 ui and yi = vi +\u2206 vi, where \u2225\u2206u\u2225\u221e \u2264 \u03be1 and\n\n5\n\n\fSparsity (\u03c1s) = 0.095825\n\nSparsity (\u03c1s) = 0.25965\n\nSparsity (\u03c1s) = 0.39413\n\n \n\n1\n\n \n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nr\no\nr\nr\ne\n\n \n\ne\nv\n\ni\nt\na\n\nl\n\ne\nR\n\n \n\n0\n0\n\n0.2\n\nSVDfeature\nMC\nIMC\nDirtyIMC\n0.8\n\n0.4\n\n0.6\n\nFeature noise level (\u03c1f)\n(a) \u03c1s = 0.1\n\nFeature noise level (\u03c1f) = 0.1\n\nSVDfeature\nMC\nIMC\nDirtyIMC\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nr\no\nr\nr\ne\n\n \n\ne\nv\n\ni\nt\na\n\nl\n\ne\nR\n\n \n\n0\n0\n\n0.1\n\n0.4\n\n0.5\n\n0.3\n\n0.2\nSparsity (\u03c1s)\n(d) \u03c1f = 0.1\n\nSVDfeature\nMC\nIMC\nDirtyIMC\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nr\no\nr\nr\ne\n\n \n\ne\nv\n\ni\nt\na\n\nl\n\ne\nR\n\n \n\n0\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\nFeature noise level (\u03c1f)\n(b) \u03c1s = 0.25\n\nr\no\nr\nr\ne\n\n \n\ne\nv\n\ni\nt\na\n\nl\n\ne\nR\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n \n\n0\n0\n\nFeature noise level (\u03c1f) = 0.5\n\nSVDfeature\nMC\nIMC\nDirtyIMC\n0.1\n\n0.3\n\n0.2\nSparsity (\u03c1s)\n(e) \u03c1f = 0.5\n\n0.4\n\n0.5\n\n \n\n1\n\n \n\n \n\n1\n\n \n\nSVDfeature\nMC\nIMC\nDirtyIMC\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nr\no\nr\nr\ne\n\n \n\ne\nv\n\ni\nt\na\n\nl\n\ne\nR\n\n \n\n0\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\nFeature noise level (\u03c1f)\n(c) \u03c1s = 0.4\n\nr\no\nr\nr\ne\n\n \n\ne\nv\n\ni\nt\na\n\nl\n\ne\nR\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n \n\n0\n0\n\nFeature noise level (\u03c1f) = 0.9\n\nSVDfeature\nMC\nIMC\nDirtyIMC\n0.1\n\n0.3\n\n0.2\nSparsity (\u03c1s)\n(f) \u03c1f = 0.9\n\n0.4\n\n0.5\n\nFigure 1: Performance of various methods for matrix completion under different sparsity and feature\nquality. Compared to other feature-based completion methods, the top \ufb01gures show that DirtyIMC\nis less sensitive to noisy features with each \u03c1s, and the bottom \ufb01gures show that error of DirtyIMC\nalways decreases to 0 with more observations given any feature quality.\n\n\u2225\u2206v\u2225\u221e \u2264 \u03be2 with some constants \u03be1 and \u03be2. Then for DirtyIMC model (3), with high probability,\nO+ max(,g(n),,h(n))n log n- observations are suf\ufb01cient for \u03f5-recovery.\nTheorem 3 suggests that, if features have good quality in the sense that items with corrupted features\nare not too many, for example g(n), h(n) = O(log n), then sample complexity of DirtyIMC can be\nO(n log n\u221alog n) = o(n3/2) as well. Thus, both Theorem 2 and 3 provide concrete examples\nshowing that given imperfect yet informative features, the sample complexity of our model can be\nasymptotically lower than the lower bound of pure matrix completion (which is \u2126(n3/2)).\n\n5 Experimental Results\n\nIn this section, we show the effectiveness of the DirtyIMC model (2) for matrix completion with\nfeatures on both synthetic datasets and real-world applications. For synthetic datasets, we show\nthat DirtyIMC model better recovers low rank matrices under various quality of features. For real\napplications, we consider relationship prediction and semi-supervised clustering, where the current\nstate-of-the-art methods are based on matrix completion and IMC respectively. We show that by\napplying DirtyIMC model to these two problems, we can further improve performance by making\nbetter use of features.\n\n5.1 Synthetic Experiments\n\nWe consider matrix recovery with features on synthetic data generated as follows. We create a\nlow rank matrix R = U V T , as the true latent row/column space U, V \u2208 R200\u00d720, Uij, Vij \u223c\nN (0, 1/20). We then randomly sample \u03c1s percent of entries \u2126 from R as observations, and construct\na perfect feature set X\u2217, Y \u2217 \u2208 R200\u00d740 which satis\ufb01es (1). To examine performance under different\nquality of features, we generate features X, Y with a noise parameter \u03c1f , where X and Y will be\nderived by replacing \u03c1f percent of bases of X\u2217 (and Y \u2217) with bases orthogonal to X\u2217 (and Y \u2217). We\nthen consider recovering the underlying matrix R given X, Y and a subset \u2126 of R.\n\nWe compare our DirtyIMC model (2) with standard trace-norm regularized matrix completion (MC)\nand two other feature-based completion methods: IMC [18] and SVDfeature [9]. The standard\n\nrelative error \u2225 \u02c6R \u2212 R\u2225F /\u2225R\u2225F is used to evaluate a recovered matrix \u02c6R. For each method, we\nselect parameters from the set {10\u03b1}2\n\u03b1=\u22123 and report the one with the best recovery. All results are\naveraged over 5 random trials.\n\nFigure 1 shows the recovery of each method under each sparsity level \u03c1s = 0.1, 0.25, 0.4, and\neach feature noise level \u03c1f = 0.1, 0.5 and 0.9. We \ufb01rst observe that in the top \ufb01gures, IMC and\n\n6\n\n\fMethod\nAccuracy\n\nAUC\n\nDirtyIMC\n\n0.9474\u00b10.0009\n\n0.9506\n\nMF-ALS [16]\n0.9412\u00b10.0011\n\n0.9020\n\nIMC [18]\n\nHOC-3\n\n0.9139\u00b10.0016\n\n0.9109\n\n0.9242\u00b10.0010\n\n0.9432\n\nHOC-5 [12]\n0.9297\u00b10.0011\n\n0.9480\n\nTable 1: Relationship prediction on Epinions. Compared with other approaches, DirtyIMC model\ngives the best performance in terms of both accuracy and AUC.\n\nSVDfeature perform similarly under different \u03c1s. This suggests that with suf\ufb01cient observations,\nperformance of IMC and SVDfeature mainly depend on feature quality and will not be affected\nmuch by the number of observations. As a result, given good features (1d), they achieve smaller\nerror compared to MC with few observations, but as features become noisy (1e-1f), they suffer\npoor performance by trying to learn the underlying matrix under biased feature spaces. Another\ninteresting \ufb01nding is that when good features are given (1d), IMC (and SVDfeature) still fails to\nachieve 0 relative error as the number of observations increases, which recon\ufb01rms that IMC cannot\nguarantee recoverability when features are not perfect. On the other hand, we see that performance\nof DirtyIMC can be improved by both better features or more observations. In particular, it makes\nuse of informative features to achieve lower error compared to MC and is also less sensitive to noisy\nfeatures compared to IMC and SVDfeature. Some \ufb01ner recovery results on \u03c1s and \u03c1f can be found\nin Appendix C.\n\n5.2 Real-world Applications\n\nRelationship Prediction in Signed Networks. As the \ufb01rst application, we consider relationship\nprediction problem in an online review website Epinions [26], where people can write reviews and\ntrust or distrust others based on their reviews. Such social network can be modeled as a signed\nnetwork where trust/distrust are modeled as positive/negative edges between entities [24], and the\nproblem is to predict unknown relationship between any two users given the network. A state-of-\nthe-art approach is the low rank model [16, 12] where one can \ufb01rst conduct matrix completion on\nadjacency matrix and then use the sign of completed matrix for relationship prediction. Therefore,\nif features of users are available, we can also consider low rank model by using our model for matrix\ncompletion step. This approach can be regarded as an improvement over [16] by incorporating\nfeature information.\n\nIn this dataset, there are about n = 105K users and m = 807K observed relationship pairs where\n15% relationships are distrust. In addition to who-trust-to-whom information, we also have user\nfeature matrix Z \u2208 Rn\u00d741 where for each user a 41-dimensional feature is collected based on\nthe user\u2019s review history, such as number of positive/negative reviews the user gave/received. We\nthen consider the low-rank model in [16] where matrix completion is conducted by DirtyIMC with\nnon-convex relaxation (5) (DirtyIMC), IMC [18] (IMC), and matrix factorization proposed in [16]\n(MF-ALS), along with another two prediction methods, HOC-3 and HOC-5 [12]. Note that both\nrow and column entities are users so X = Y = Z is set for both DirtyIMC and IMC model.\n\nWe conduct the experiment using 10-fold cross validation on observed edges, where the parameters\n\u03b1=\u22123{10\u03b1, 5 \u00d7 10\u03b1}. The averaged accuracy and AUC of each method\nare chosen from the set ,2\nare reported in Table 1. We \ufb01rst observe that IMC performs worse than MF-ALS even though IMC\ntakes features into account. This is because features are only weakly related to relationship matrix,\nand as a result, IMC is misled by such noisy features. On the other hand, DirtyIMC performs\nthe best among all prediction methods. In particular, it performs slightly better than MF-ALS in\nterms of accuracy, and much better in terms of AUC. This shows DirtyIMC can still exploit weakly\ninformative features without being trapped by noisy features.\n\nSemi-supervised Clustering. We now consider semi-supervised clustering problem as another ap-\nplication. Given n items, the item feature matrix Z \u2208 Rn\u00d7d, and m pairwise constraints specifying\nwhether item i and j are similar or dissimilar, the goal is to \ufb01nd a clustering of items such that most\nsimilar items are within the same cluster.\nWe notice that the problem can indeed be solved by matrix completion. Consider S \u2208 Rn\u00d7n to be\nthe signed similarity matrix de\ufb01ned as Sij = 1 (or \u22121) if item i and j are similar (or dissimilar), and\n0 if similarity is unknown. Then solving semi-supervised clustering becomes equivalent to \ufb01nding\na clustering of the symmetric signed graph S, where the goal is to cluster nodes so that most edges\nwithin the same group are positive and most edges between groups are negative [12]. As a result, a\nmatrix completion approach [12] can be applied to solve the signed graph clustering problem on S.\n\nApparently, the above solution is not optimal for semi-supervised clustering as it disregards fea-\ntures. Many semi-supervised clustering algorithms are thus proposed by taking both item features\n\n7\n\n\fr\no\nr\nr\ne\n\n \n\ne\ns\nw\n\ni\n\nr\ni\n\na\nP\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n \n\n0\n0\n\nMushroom\n\n \n\nK\u2212means\nSignMC\nMCCC\nDirtyIMC\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\nr\no\nr\nr\ne\n\n \n\ne\ns\nw\n\ni\n\nr\ni\n\na\nP\n\nSegment\n\n \n\nK\u2212means\nSignMC\nMCCC\nDirtyIMC\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\nr\no\nr\nr\ne\n\n \n\ne\ns\nw\n\ni\n\nr\ni\n\na\nP\n\nCovtype\n\n \n\nK\u2212means\nSignMC\nMCCC\nDirtyIMC\n\n0.5\nNumber of observed pairs\n\n1.5\n\n1\n\n2\n\nx 105\n\n \n\n0\n0\n\n1\n\n2\n\n3\n\nNumber of observed pairs\n\n4\n\n5\n\n6\n\nx 104\n\n \n\n0\n0\n\n0.5\n\n1\n\n1.5\n\n2\n\n2.5\n\nNumber of observed pairs\n\n3\nx 105\n\nFigure 2: Semi-supervised clustering on real-world datasets. For Mushroom dataset where features\nare almost ideal, both MCCC and DirtyIMC achieve 0 error rate. For Segment and Covtype where\nfeatures are more noisy, our model outperforms MCCC as its error decreases given more constraints.\n\nMushrooms\nSegment\nCovtype\n\nnumber of items n\n8124\n2319\n11455\n\nfeature dimension d\n112\n19\n54\n\nnumber of clusters k\n2\n7\n7\n\nTable 2: Statistics of semi-supervised clustering datasets.\n\nand constraints into consideration [13, 25, 37]. The current state-of-the-art method is the MCCC\nalgorithm [37], which essentially solves semi-supervised clustering with IMC objective. In [37], the\nauthors show that by running k-means on the top-k eigenvectors of the completed matrix ZM ZT ,\nMCCC outperforms other state-of-the-art algorithms [37].\n\nWe now consider solving semi-supervised clustering with our DirtyIMC model. Our algorithm,\nsummarized in Algorithm 2 in Appendix D, \ufb01rst completes the pairwise matrix with DirtyIMC\nobjective (2) instead of IMC (with both X, Y are set as Z), and then runs k-means on the top-k\neigenvectors of the completed matrix to obtain a clustering. This algorithm can be viewed as an\nimproved version of MCCC to handle noisy features Z.\n\nWe now compare our algorithm with k-means, signed graph clustering with matrix completion [12]\n(SignMC) and MCCC [37]. Note that since MCCC has been shown to outperform most other\nstate-of-the-art semi-supervised clustering algorithms in [37], comparing with MCCC is suf\ufb01cient\nto demonstrate the effectiveness of our algorithm. We perform each method on three real-world\ndatasets: Mushrooms, Segment and Covtype 1. All of them are classi\ufb01cation benchmarks where\nfeatures and ground-truth class of items are both available, and their statistics are summarized in Ta-\nble 2. For each dataset, we randomly sample m = [1, 5, 10, 15, 20, 25, 30] \u00d7 n pairwise constraints,\nand perform each algorithm to derive a clustering \u03c0, where \u03c0i is the cluster index of item i. We then\nevaluate \u03c0 by the following pairwise error to ground-truth:\n\nn(n \u2212 1)\n\n2\n\n) !(i,j):\u03c0\u2217\n\ni =\u03c0\u2217\nj\n\n1(\u03c0i \u0338= \u03c0j) + !(i,j):\u03c0\u2217\n\ni \u0338=\u03c0\u2217\n\nj\n\n1(\u03c0i = \u03c0j)*\n\nwhere \u03c0\u2217i is the ground-truth class of item i.\nFigure 2 shows the result of each method on all three datasets. We \ufb01rst see that for Mushrooms\ndataset where features are perfect (100% training accuracy can be attained by linear-SVM for clas-\nsi\ufb01cation), both MCCC and DirtyIMC can obtain a perfect clustering, which shows that MCCC is\nindeed effective with perfect features. For Segment and Covtype datasets, we observe that the per-\nformance of k-means and MCCC are dominated by feature quality. Although MCCC still bene\ufb01ts\nfrom constraint information as it outperforms k-means, it clearly does not make the best use of con-\nstraints, as its performance does not improves even if number of constraints increases. On the other\nhand, the error rate of SignMC can always decrease down to 0 by increasing m. However, since it\ndisregards features, it suffers from a much higher error rate than methods with features when con-\nstraints are few. We again see DirtyIMC combines advantage from MCCC and SignMC, as it makes\nuse of features when few constraints are observed yet leverages constraint information simultane-\nously to avoid being trapped by feature noise. This experiment shows that our model outperforms\nstate-of-the-art approaches for semi-supervised clustering.\n\nAcknowledgement. We thank David Inouye and Hsiang-Fu Yu for helpful comments and discus-\nsions. This research was supported by NSF grants CCF-1320746 and CCF-1117055.\n\n1All datasets are available at http://www.csie.ntu.edu.tw/\u02dccjlin/libsvmtools/datasets/.\n\nFor Covtype, we subsample from the entire dataset to make each cluster has balanced size.\n\n8\n\n\fReferences\n\n[1] J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. A new approach to collaborative \ufb01ltering: Operator\n\nestimation with spectral regularization. JMLR, 10:803\u2013826, 2009.\n\n[2] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural\n\nresults. JMLR, 3:463\u2013482, 2003.\n\n[3] D. P. Bertsekas. Nonlinear Programming. Athena Scienti\ufb01c, Belmont, MA 02178-9998, 1999.\n[4] E. Cand`es and Y. Plan. Matrix completion with noise. Proceedings of the IEEE, 98(6):925\u2013936, 2010.\n[5] E. Cand`es and B. Recht. Exact matrix completion via convex optimization. Commun. ACM, 55(6):111\u2013\n\n119, 2012.\n\n[6] E. J. Cand`es, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? J. ACM, 58(3):11:1\u2013\n\n11:37, 2011.\n\n[7] E. J. Cand`es and T. Tao. The power of convex relaxation: Near-optimal matrix completion. IEEE Trans.\n\nInf. Theor., 56(5):2053\u20132080, 2010.\n\n[8] V. Chandrasekaran, P. A. Parrilo, and A. S. Willsky. Latent variable graphical model selection via convex\n\noptimization. The Annals of Statistics, 2012.\n\n[9] T. Chen, W. Zhang, Q. Lu, K. Chen, Z. Zheng, and Y. Yu. SVDFeature: A toolkit for feature-based\n\ncollaborative \ufb01ltering. JMLR, 13:3619\u20133622, 2012.\n\n[10] Y. Chen, S. Bhojanapalli, S. Sanghavi, and R. Ward. Coherent matrix completion. In ICML, 2014.\n[11] Y. Chen, A. Jalali, S. Sanghavi, and H. Xu. Clustering partially observed graphs via convex optimization.\n\nJMLR, 15(1):2213\u20132238, 2014.\n\n[12] K.-Y. Chiang, C.-J. Hsieh, N. Natarajan, I. S. Dhillon, and A. Tewari. Prediction and clustering in signed\n\nnetworks: A local to global perspective. JMLR, 15:1177\u20131213, 2014.\n\n[13] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon. Information-theoretic metric learning. In ICML,\n\npages 209\u2013216, 2007.\n\n[14] U. Feige and G. Schechtman. On the optimality of the random hyperplane rounding technique for max\n\ncut. Random Struct. Algorithms, 20(3):403\u2013440, 2002.\n\n[15] L. Grippo and M. Sciandrone. Globally convergent block-coordinate techniques for unconstrained opti-\n\nmization. Optimization Methods and Software, 10:587\u2013637, 1999.\n\n[16] C.-J. Hsieh, K.-Y. Chiang, and I. S. Dhillon. Low rank modeling of signed networks. In KDD, 2012.\n[17] C.-J. Hsieh and P. A. Olsan. Nuclear norm minimization via active subspace selection. In ICML, 2014.\n[18] P. Jain and I. S. Dhillon. Provable inductive matrix completion. CoRR, abs/1306.0626, 2013.\n[19] A. Jalali, P. Ravikumar, S. Sanghavi, and C. Ruan. A dirty model for multi-task learning. In NIPS, 2010.\n[20] S. M. Kakade, K. Sridharan, and A. Tewari. On the complexity of linear prediction: Risk bounds, margin\n\nbounds, and regularization. In NIPS, pages 793 \u2013 800, 2008.\n\n[21] R. Keshavan, A. Montanari, and S. Oh. Matrix completion from noisy entries. JMLR, 2010.\n[22] Y. Koren, R. M. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. IEEE\n\nComputer, 42:30\u201337, 2009.\n\n[23] B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection. The Annals\n\nof Statistics, 28(5):1302\u20131338, 2000.\n\n[24] J. Leskovec, D. Huttenlocher, and J. Kleinberg. Predicting positive and negative links in online social\n\nnetworks. In WWW, 2010.\n\n[25] Z. Li and J. Liu. Constrained clustering by spectral kernel learning. In ICCV, 2009.\n[26] P. Massa and P. Avesani. Trust-aware bootstrapping of recommender systems. In Proceedings of ECAI\n\n2006 Workshop on Recommender Systems, pages 29\u201333, 2006.\n\n[27] R. Meir and T. Zhang. Generalization error bounds for bayesian mixture algorithms. JMLR, 2003.\n[28] A. K. Menon, K.-P. Chitrapura, S. Garg, D. Agarwal, and N. Kota. Response prediction using collabora-\n\ntive \ufb01ltering with hierarchies and side-information. In KDD, pages 141\u2013149, 2011.\n\n[29] N. Natarajan and I. S. Dhillon.\n\nInductive matrix completion for predicting gene-disease associations.\n\nBioinformatics, 30(12):60\u201368, 2014.\n\n[30] S. Negahban and M. J. Wainwright. Restricted strong convexity and weighted matrix completion: Optimal\n\nbounds with noise. JMLR, 13(1):1665\u20131697, 2012.\n\n[31] M. Rudelson and R. Vershynin. Smallest singular value of a random rectangular matrix. Comm. Pure\n\nAppl. Math, pages 1707\u20131739, 2009.\n\n[32] O. Shamir and S. Shalev-Shwartz. Matrix completion with the trace norm: Learning, bounding, and\n\ntransducing. JMLR, 15(1):3401\u20133423, 2014.\n\n[33] D. Shin, S. Cetintas, K.-C. Lee, and I. S. Dhillon. Tumblr blog recommendation with boosted inductive\n\nmatrix completion. In CIKM, pages 203\u2013212, 2015.\n\n[34] N. Srebro and A. Shraibman. Rank, trace-norm and max-norm. In COLT, pages 545\u2013560, 2005.\n[35] M. Xu, R. Jin, and Z.-H. Zhou. Speedup matrix completion with side information: Application to multi-\n\nlabel learning. In NIPS, 2013.\n\n[36] E. Yang and P. Ravikumar. Dirty statistical models. In NIPS, 2013.\n[37] J. Yi, L. Zhang, R. Jin, Q. Qian, and A. Jain. Semi-supervised clustering by input pattern assisted pairwise\n\nsimilarity matrix completion. In ICML, 2013.\n\n[38] K. Zhong, P. Jain, and I. S. Dhillon. Ef\ufb01cient matrix sensing using rank-1 gaussian measurements. In\n\nInternational Conference on Algorithmic Learning Theory(ALT), 2015.\n\n9\n\n\f", "award": [], "sourceid": 1909, "authors": [{"given_name": "Kai-Yang", "family_name": "Chiang", "institution": "UT Austin"}, {"given_name": "Cho-Jui", "family_name": "Hsieh", "institution": "UC Davis"}, {"given_name": "Inderjit", "family_name": "Dhillon", "institution": "University of Texas at Austin"}]}