{"title": "Speedup Matrix Completion with Side Information: Application to Multi-Label Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2301, "page_last": 2309, "abstract": "In standard matrix completion theory, it is required to have at least $O(n\\ln^2 n)$ observed entries to perfectly recover a low-rank matrix $M$ of size $n\\times n$, leading to a large number of observations when $n$ is large. In many real tasks, side information in addition to the observed entries is often available. In this work, we develop a novel theory of matrix completion that explicitly explore the side information to reduce the requirement on the number of observed entries. We show that, under appropriate conditions, with the assistance of side information matrices, the number of observed entries needed for a perfect recovery of matrix $M$ can be dramatically reduced to $O(\\ln n)$. We demonstrate the effectiveness of the proposed approach for matrix completion in transductive incomplete multi-label learning.", "full_text": "Speedup Matrix Completion with Side Information:\n\nApplication to Multi-Label Learning\n\nMiao Xu1\n1National Key Laboratory for Novel Software Technology,\n\nZhi-Hua Zhou1\n\nRong Jin2\n\nrongjin@cse.msu.edu\n\nNanjing University, Nanjing 210023, China\n\n2Department of Computer Science and Engineering,\nMichigan State University, East Lansing, MI 48824\n\n{xum, zhouzh}@lamda.nju.edu.cn\n\nAbstract\n\nIn standard matrix completion theory, it is required to have at least O(n ln2 n) ob-\nserved entries to perfectly recover a low-rank matrix M of size n \u00d7 n, leading to\na large number of observations when n is large. In many real tasks, side informa-\ntion in addition to the observed entries is often available. In this work, we develop\na novel theory of matrix completion that explicitly explore the side information\nto reduce the requirement on the number of observed entries. We show that, un-\nder appropriate conditions, with the assistance of side information matrices, the\nnumber of observed entries needed for a perfect recovery of matrix M can be dra-\nmatically reduced to O(ln n). We demonstrate the effectiveness of the proposed\napproach for matrix completion in transductive incomplete multi-label learning.\n\n1 Introduction\n\nMatrix completion concerns the problem of recovering a low-rank matrix from a limited number\nof observed entries. It has broad applications including collaborative \ufb01ltering [35], dimensionality\nreduction [41], multi-class learning [4, 31], clustering [15, 42], etc. Recent studies show that, with a\nhigh probability, we can ef\ufb01ciently recover a matrix M \u2208 Rn\u00d7m of rank r from O(r(n+m) ln2(n+\nm)) observed entries when the observed entries are uniformly sampled from M [11, 12, 34].\nAlthough the sample complexity for matrix completion, i.e., the number of observed entries required\nfor perfectly recovering a low rank matrix, is already near optimal (up to a logarithmic factor), its\nlinear dependence on n and m requires a large number of observations for recovering large matri-\nces, signi\ufb01cantly limiting its application to real-world problems. Moreover, current techniques for\nmatrix completion require solving an optimization problem that can be computationally prohibitive\nwhen the size of the matrix is very large. In particular, although a number of algorithms have been\ndeveloped for matrix completion [10, 22, 23, 25, 27, 28, 39], most of them require updating the full\nmatrix M at each iteration of optimization, leading to a high computational cost and a large storage\nrequirement when both n and m are large. Several recent efforts [5, 19] try to address this issue, at\na price of losing performance guarantee in recovering the target matrix.\nOn the other hand, in several applications of matrix completion, besides the observed entries, side\ninformation is often available that can potentially bene\ufb01t the process of matrix completion. Below\nwe list a few examples:\n\u2022 Collaborative \ufb01ltering aims to predict ratings of individual users based on the ratings from\nother users [35]. Besides the ratings provided by users, side information, such as the textual\ndescription of items and the demographical information of users, is often available and can\nbe used to facilitate the prediction of missing ratings.\n\n1\n\n\f\u2022 Link prediction aiming to predict missing links between users in a social network based on\nthe existing ones can be viewed as a matrix completion problem [20], where side informa-\ntion, such as attributes of users (e.g., browse patterns and interaction among users), can be\nused to assist completing the user-user-link matrix.\n\nAlthough several studies exploit side information for matrix recovery [1, 2, 3, 16, 29, 32, 33], most\nof them focus on matrix factorization techniques, which usually result in non-convex optimization\nproblems without guarantee of perfectly recovering the target matrix. In contrast, matrix comple-\ntion deals with convex optimization problems and perfect recovery is guaranteed under appropriate\nconditions.\nIn this work, we focus on exploiting side information to improve the sample complexity and scala-\nbility of matrix completion. We assume that besides the observed entries in the matrix M, there exist\ntwo side information matrices A \u2208 Rn\u00d7ra and B \u2208 Rm\u00d7rb, where r \u2264 ra \u2264 n and r \u2264 rb \u2264 m.\nWe further assume the target matrix and the side information matrices share the same latent informa-\ntion; that is, the column and row vectors in M lie in the subspaces spanned by the column vectors in\nA and B, respectively. Unlike the standard theory of matrix completion that needs to \ufb01nd the opti-\nmal matrix M of size n\u00d7 m, our optimization problem is reduced to searching for an optimal matrix\nof size ra \u00d7 rb, making the recovery signi\ufb01cantly more ef\ufb01cient both computationally and storage\nwise provided ra \u226a n and/or rb \u226a m. We show that, with the assistance of side information matri-\nces, with a high probability, we can perfectly recover M with O(r(ra + rb) ln(ra + rb) ln(n + m))\nobserved entries, a sample complexity that is sublinear in n and m.\nWe demonstrate the effectiveness of matrix completion with side information in transductive in-\ncomplete multi-label learning [17], which aims to assign multiple labels to individual instances in\na transductive learning setting. We formulate transductive incomplete multi-label learning as a ma-\ntrix completion problem, i.e., completing the instance-label matrix based on the observed entries\nthat correspond to the given label assignments. Both the feature vectors of instances and the class\ncorrelation matrix can be used as side information. Our empirical study shows that the proposed\napproach is particularly effective when the number of given label assignments is small, verifying\nour theoretical result, i.e., side information can be used to reduce the sample complexity.\nThe rest of the paper is organized as follows: Section 2 brie\ufb02y reviews some related work. Section 3\npresents our main contribution. Section 4 presents our empirical study. Finally Section 5 concludes\nwith future issues.\n\n2 Related work\n\nMatrix Completion The objective of matrix completion is to \ufb01ll out the missing entries of a matrix\nbased on the observed ones. Early work on matrix completion, also referred to as maximum margin\nmatrix factorization [37], was developed for collaborative \ufb01ltering. Theoretical studies show that, it\nis suf\ufb01cient to perfectly recover a matrix M \u2208 Rn\u00d7m of rank r when the number of observed entries\nis O(r(n + m) ln2(n + m)) [11, 12, 34]. A more general matrix recovery problem, referred to as\nmatrix regression, was examined in [30, 36]. Unlike these studies, our proposed approach reduces\nthe sample complexity with the help of side information matrices.\nSeveral computational algorithms [10, 22, 23, 25, 27, 28, 39] have been developed to ef\ufb01ciently\nsolve the optimization problem of matrix completion. The main problem with these algorithms lies\nin the fact that they have to explicitly update the full matrix of size n \u00d7 m, which is expensive both\ncomputationally and storage wise for large matrices. This issue has been addressed in several recent\nstudies [5, 19], where the key idea is to store and update the low rank factorization of the target\nmatrix. A preliminary convergence analysis is given in [19], however, none of these approaches\nguarantees perfect recovery of the target matrix, even with signi\ufb01cantly large number of observed\nentries. In contrast, our proposed approach reduces the computational cost by explicitly exploring\nthe side information matrices and still delivers the promise of perfect recovery.\nSeveral recent studies involve matrix recovery with side information. [2, 3, 29, 33] are based on\ngraphical models by assuming special distribution of latent factors; these algorithms, as well as [16]\nand [32], consider side information in matrix factorization. The main limitation lies in the fact that\nthey have to solve non-convex optimization problems, and do not have theoretical guarantees on\nmatrix recovery. Matrix completion with in\ufb01nite dimensional side information was exploited in [1],\n\n2\n\n\fyet lacking guarantee of perfect recovery. In contrast, our work is based on matrix completion theory\nthat deals with a general convex optimization problem and is guaranteed to make a perfect recovery\nof the target matrix.\n\nMulti-label Learning Multi-label learning allows each instance to be assigned to multiple classes\nsimultaneously, making it more challenging than multi-class learning. The simplest approach for\nmulti-label learning is to train one binary model for each label, which is also referred to as BR\n(Binary Relevance) [7]. Many advanced algorithms have been developed to explicitly explore the\ndependence among labels ( [44] and references therein).\nIn this work, we will evaluate our proposed approach by transductive incomplete multi-label learn-\n\u22a4 \u2208 Rn\u00d7d be the feature matrix with xi \u2208 Rd, where n is\ning [17]. Let X = (x1, . . . , xn)\nthe number of instances and d is the dimension. Let C1, . . . ,Cm denote the m labels, and let\nT \u2208 {\u22121, +1}n\u00d7m be the instance-label matrix, where Ti;j = +1 when xi is associated with\nthe label Cj, and Ti;j = \u22121 when xi is not associated with the label Cj. Let \u2126 denote the subset of\nthe observed entries in T that corresponds to the given label assignments of instances. The objective\nof transductive incomplete multi-label learning is to predict the missing entries in T based on the\nfeature matrix X and the given label assignments in \u2126. The main challenge lies in the fact that only\na partial label assignment is given for each training instance. This is in contrast to many studies on\ncommon semi-supervised or transductive multi-label learning [18, 24, 26, 43] where each labeled\ninstance receives a complete set of label assignments. This is also different from multi-label learn-\ning with weak labels [8, 38] which assumes that only the positive labels can be observed. Here we\nassume the observed labels can be either positive or negative.\nIn [17], a matrix completion based approach was proposed for transductive incomplete multi-label\nlearning. To effectively exploit the information in the feature matrix X, the authors proposed to\ncomplete the matrix T\n= [X, T ] that combines the input features with label assignments into a\nsingle matrix. Two algorithms MC-b and MC-1 were presented there, differing only in the treatment\nof bias term, whereas the convergence of MC-1 was examined in [9]. The main limitation of both\nalgorithms lies in their high computational cost when both the number of instances and features are\nlarge. Unlike MC-1 and MC-b, our proposed approach does not need to deal with the big matrix\n\u2032, and is computationally more ef\ufb01cient. Besides the computational advantage, we show that our\nT\nproposed approach signi\ufb01cantly improves the sample complexity of matrix completion by exploiting\nside information matrices.\n\n\u2032\n\n3 Speedup Matrix Completion with Side Information\n\nWe \ufb01rst describe the framework of matrix completion with side information, and then present its\ntheoretical guarantee and application to multi-label learning\n\n3.1 Matrix Completion using Side Information\nLet M \u2208 Rn\u00d7m be the target matrix of rank r to be recovered. Without loss of generality, we\nassume n \u2265 m. Let \u03bbk, k \u2208 {1, . . . , r} be the kth largest singular value of M, and let uk \u2208 Rn\nand vk \u2208 Rm be the corresponding left and right singular vectors, i.e., M = U \u03a3V\n\u22a4, where\n\u03a3 = diag(\u03bb1, . . . , \u03bbr), U = (u1, . . . , ur) and V = (v1, . . . , vr).\nLet \u2126 \u2286 {1, . . . , n} \u00d7 {1, . . . , m} be the subset of indices of observed entries sampled uniformly\nfrom all entries in M. Given \u2126, we de\ufb01ne a linear operator R\u2126(M ) : Rn\u00d7m 7\u2192 Rn\u00d7m as\n\n{\n\n[R\u2126(M )]i;j =\n\nMi;j\n\n0\n\n(i, j) \u2208 \u2126\n(i, j) /\u2208 \u2126\n\nUsing R\u2126(\u00b7), the standard matrix completion problem is:\n\n\u2225 \u02dcM\u2225tr\n\ns. t. R\u2126( \u02dcM ) = R\u2126(M ),\n\nmin\n\n\u02dcM\u2208Rn(cid:2)m\n\n(1)\n\nwhere \u2225 \u00b7 \u2225tr is the trace norm.\nLet A = (a1, . . . , ara) \u2208 Rn\u00d7ra and B = (b1, . . . , brb) \u2208 Rm\u00d7rb be the side information matrices,\nwhere r \u2264 ra \u2264 n and r \u2264 rb \u2264 m. Without loss of generality, we assume that ra \u2265 rb and that\n\n3\n\n\f\u22a4\ni aj = \u03b4i;j and b\n\n\u22a4\nboth A and B are orthonormal matrices, i.e., a\ni bj = \u03b4i;j for any i and j, where\n\u03b4i;j is the Kronecker delta function that outputs 1 if i = j and 0, otherwise. In case when the side\ninformation is not available, A and B will be set to identity matrix.\nThe objective is to complete a matrix M of rank r with the side information matrices A and B. We\nmake the following assumption in order to fully exploit the side information:\nAssumption A: the column vectors in M lie in the subspace spanned by the column vectors in A,\nand the row vectors in M lie in the subspace spanned by the column vectors in B.\nTo understand the implication of this assumption, let us consider the problem of transductive incom-\nplete multi-label learning [17], where the objective is to complete the instance-label matrix based on\nthe observed entries corresponding to the given label assignments, and the side information matrices\nA and B are given by the feature vectors of instances and the label correlation matrix, respectively.\nAssumption A essentially implies that all the label assignments can be accurately predicted by a\nlinear combination of feature vectors of instances.\n\u22a4 and therefore, our goal is to learn Z0 \u2208\nUsing Assumption A, we can write M as M = AZ0B\nRra\u00d7rb. Following the standard theory for matrix completion [11, 12, 34], we can cast the matrix\ncompletion task into the following optimization problem:\ns. t. R\u2126(AZB\n\n) = R\u2126(M ).\n\n\u2225Z\u2225tr\n\n(2)\n\n\u22a4\n\nmin\n\nZ\u2208Rra(cid:2)rb\n\nUnlike the standard algorithm for matrix completion that requires solving an optimization problem\ninvolved matrix of n \u00d7 m, the optimization problem given in (2) only deals with a matrix Z of\nra \u00d7 rb, and therefore can be solved signi\ufb01cantly more ef\ufb01ciently if ra \u226a n and rb \u226a m.\n\n(\n\nn\nr\nmn\nr\n\n(\n\n)\n\n,\n\n)\n\n,\n\n3.2 Theoretical Result\n\nWe de\ufb01ne \u00b50 and \u00b51, the coherence measurements for matrix M as\n\n\u00b50 = max\n\n\u00b51 = max\ni;j\n\n\u2225PU ei\u22252,\nmax\n1\u2264i\u2264n\n\u22a4\n\n]i;j)2,\n\n([U V\n\nm\nr\n\nmax\n1\u2264j\u2264m\n\n\u2225PV ej\u22252\n\nwhere ei is the vector with the ith entry equal to 1 and all others equal to 0, and PU and PV project\na vector onto the subspace spanned by the column vectors of U and V , respectively. We also de\ufb01ne\nthe coherence measure for matrices A and B as\n\nn\u2225Ai;\u2217\u22252\n\nm\u2225Bj;\u2217\u22252\n\n\u00b5AB = max\n\nmax\n1\u2264i\u2264n\n\n, max\n1\u2264j\u2264m\n\nra\n\nrb\nwhere Ai;\u2217 and Bi;\u2217 stand for the ith row of A and B, respectively.\n2 (1 + log2 ra \u2212 log2 r), \u21260 =\nTheorem 1. Let \u00b5 = max(\u00b50, \u00b5AB).\nDe\ufb01ne q0 = 1\n3 \u00b52(rarb + r2) ln n. Assume \u21261 \u2265 q0\u21260. With a\n128(cid:12)\n3 \u00b5 max(\u00b51, \u00b5)r(ra + rb) ln n and \u21261 = 8(cid:12)\nprobability at least 1 \u2212 4(q0 + 1)n\n\u2212(cid:12)+1 \u2212 2q0n\n\u2212(cid:12)+2, Z0 is the unique optimizer to the problem in\n(2) provided\n\n|\u2126| \u2265 64\u03b2\n3\n\n\u00b5 max(\u00b51, \u00b5) (1 + log2 ra \u2212 log2 r) r(ra + rb) ln n.\n\nCompared to the standard matrix completion theory [34], the side information matrices reduce sam-\nple complexity from O(r(n + m) ln2(n + m)) to O(r(ra + rb) ln(ra + rb) ln n). When ra \u226a n and\nrb \u226a m, the side information allows us signi\ufb01cantly reduce the number of observed entries required\nfor perfectly recovering matrix M. We defer the technical proof of Theorem 1 to the supplementary\nmaterial due to page limit. Note that although we follow the framework of [34] for analysis, namely\n\ufb01rst proving the result under deterministic conditions, and then showing that the deterministic con-\nditions hold with a high probability, our technical proof is quite different due to the involvement of\nside information matrices A and B.\n\n4\n\n\f3.3 Application to Multi-Label Learning\n\nSimilar to the Singular Vector Thresholding (SVT) method [10], we approximate the problem in ( 2)\nby an unconstrained optimization problem, i.e.,\nL(Z) = \u03bb\u2225Z\u2225tr +\n\n(cid:13)(cid:13)R\u2126(AZB\n\n\u22a4 \u2212 M )\n\n(cid:13)(cid:13)2\n\nmin\n\n(3)\n\nF ,\n\nZ\u2208Rra(cid:2)rb\n\n1\n2\n\nwhere \u03bb > 0 is introduced to weight the trace norm regularization term against the regression error.\nWe develop an algorithm that exploits the smoothness of the loss function and therefore achieves\nO(1/T 2) convergence, where T is the number of iterations. Details of the algorithm can be found\nin the supplementary material. We refer to the proposed algorithm as Maxide.\nFor transductive incomplete multi-label learning, we abuse our notation by de\ufb01ning n as the number\nof instances, m as the number of labels, and d as the dimensionality of input patterns. Our goal is\nto complete the instance-label matrix M \u2208 Rn\u00d7m by using (i) the feature matrix X \u2208 Rn\u00d7d and\n(ii) the observed entries \u2126 in M (i.e., the given label assignments). We thus set the side information\nmatrix A to include the top left singular vectors of X, and B = I to indicate that no side information\nis available for the dependence among labels. We note that the low rank assumption of instance-label\nmatrix M implies a linear dependence among the label prediction functions. This assumption has\nbeen explored extensively in the previous studies of multi-label learning [17, 21, 38].\n\n4 Experiments\n\nWe evaluate the proposed algorithm for matrix completion with side information on both synthet-\nic and real data sets. Our implementation is in Matlab except that the operation R\u2126(L \u00d7 R) is\nimplemented in C. All the results were obtained on a Linux server with CPU 2.53GHz and 48GB\nmemory.\n\n4.1 Experiments on Synthetic Data\nTo create the side information matrices A and B, we \ufb01rst generate a random matrix F \u2208 Rn\u00d7m,\nwith each entry Fi;j drawn independently from N (0, 1). Side information matrix A includes the\n\ufb01rst ra left singular vectors of F , and B includes the \ufb01rst rb right singular vectors. To create Z0,\nwe generate two Gaussian random matrices ZA \u2208 Rra\u00d7r and ZB \u2208 Rrb\u00d7r, where each entry\nis sampled independently from N (0, 1). The singular value decomposition of AZA and BZB is\n2 , respectively. We create a diagonal matrix \u03a3 \u2208\ngiven by AZA = U \u03a31V T\nRr\u00d7r, whose diagonal entries are drawn independently from N (0, 104). Z0 is then given by Z0 =\n)T where \u2020 denotes the pseudo inverse of a matrix. Finally, the target\n(ZA\u03a3\nmatrix M is given by M = AZ0B\n\n1 and BZB = V \u03a32V T\n\n\u2020\n\u2020\n1(V T\n1 )\n\n\u2020\n2(V T\n2 )\n\n)\u03a3(ZB\u03a3\n\n\u22a4.\n\n\u2020\n\nSettings and Baselines Our goal is to show that the proposed algorithm is able to accu-\nrately recover the target matrix with signi\ufb01cantly smaller number of entries and less compu-\ntational time.\nIn this study, we only consider square matrices (i.e., m = n), with n =\n1, 000; 5, 000; 10, 000; 20, 000; 30, 000 and rank r = 10; 50; 100. Both ra and rb of side informa-\ntion matrices are set to be 2r, and |\u2126|, the number of observed entries, is set to be r(2n \u2212 r), which\nis signi\ufb01cantly smaller than the number of observed entries used in previous studies [10, 25, 27].\nWe repeat each experiment 10 times, and report the result averaged over 10 runs. We compare the\nproposed Maxide algorithm with three state-of-the-art matrix completion algorithms: Singular Vec-\ntor Thresholding (SVT) [10], Fixed Point Bregman Iterative Method (FPCA) [27] and Augmented\nLagrangian Method (ALM) [25]. In addition to these matrix completion methods, we also com-\npare with a trace norm minimizing method (TraceMin) [6]. For all the baseline, we use the codes\nprovided by their original authors with their default parameter settings.\nResults We measure the performance of matrix completion by the relative error \u2225AZB\n\u22a4 \u2212\nM\u2225F /\u2225M\u2225F and report the results of both relative error and running time in Table 1. For TraceMin,\nwe observe that for n = 1, 000 and r = 10, it gives the result of 1.75 \u00d7 10\n\u22127 within 2.94 \u00d7 104\nseconds, which is really slow compared to our proposal. For n = 1, 000 and r = 50, it gives no\nresult within one week. In Table 1, we \ufb01rst observed that for all the cases, the relative error achieved\n\n5\n\n\fTable 1: Results on synthesized data sets. n is the size of a squared matrix and r is its rank. Rate is the\nnumber of observed entries divided by the size of the matrix, that is, |\u2126|=(nm). Time measures the running\ntime in seconds and Relative error measures \u2225AF B\n\u22a4 \u2212 M\u2225F =\u2225M\u2225F . The best performance for each setting\nare bolded. We do not report the results for FPCA and SVT when n \u2265 5; 000 because they were unable to\n\ufb01nish the computation after 50 hours.\n\nn\n\nr\n\nTime\n\nRate\n1; 000 10 1:99 \u00d7 10\n50 9:75 \u00d7 10\n100 1:900 \u00d7 10\n5; 000 10 3:96 \u00d7 10\n50 1:99 \u00d7 10\n100 3:96 \u00d7 10\n10; 000 10 2:00 \u00d7 10\n50 9:98 \u00d7 10\n100 1:99 \u00d7 10\n20; 000 10 1:00 \u00d7 10\n50 4:99 \u00d7 10\n30; 000 10 6:67 \u00d7 10\n\nAlg.\nSVT 3:23 \u00d7 103\nSVT 3:51 \u00d7 103\nSVT 3:82 \u00d7 103\n\n(cid:0)2 Maxide 1:89 \u00d7 101 6:42 \u00d7 10\n(cid:0)2 Maxide 6:44 \u00d7 101 5:28 \u00d7 10\n(cid:0)1 Maxide 1:94 \u00d7 102 1:91 \u00d7 10\n(cid:0)3 Maxide 3:50 \u00d7 101 6:38 \u00d7 10\n(cid:0)2 Maxide 4:56 \u00d7 102 1:43 \u00d7 10\n(cid:0)2 Maxide 1:29 \u00d7 103 2:44 \u00d7 10\n(cid:0)3 Maxide 6:18 \u00d7 101 1:63 \u00d7 10\n(cid:0)3 Maxide 8:39 \u00d7 102 9:97 \u00d7 10\n(cid:0)2 Maxide 4:47 \u00d7 103 1:67 \u00d7 10\n(cid:0)3 Maxide 1:22 \u00d7 102 3:54 \u00d7 10\n(cid:0)3 Maxide 2:16 \u00d7 103 4:51 \u00d7 10\n(cid:0)4 Maxide 4:37 \u00d7 102 3:25 \u00d7 10\n\nTime\n\nRelative error Algo.\n(cid:0)7 FPCA 5:55 \u00d7 103 8:79 \u00d7 10\n8:76 \u00d7 104 ALM 2:92 \u00d7 101 8:46 \u00d7 10\n(cid:0)8 FPCA 7:60 \u00d7 103 5:53 \u00d7 10\n2:77 \u00d7 105 ALM 7:72 \u00d7 101 5:58 \u00d7 10\n(cid:0)8 FPCA 1:71 \u00d7 104 4:63 \u00d7 10\n7:45 \u00d7 104 ALM 8:57 \u00d7 101 3:59 \u00d7 10\n(cid:0)4 ALM 1:24 \u00d7 103 9:07 \u00d7 10\n(cid:0)7 ALM 1:79 \u00d7 103 7:26 \u00d7 10\n(cid:0)8 ALM 2:14 \u00d7 103 5:51 \u00d7 10\n(cid:0)3 ALM 7:16 \u00d7 103 9:10 \u00d7 10\n(cid:0)2 ALM 7:87 \u00d7 103 7:19 \u00d7 10\n(cid:0)7 ALM 9:50 \u00d7 103 6:41 \u00d7 10\n(cid:0)3 ALM 3:62 \u00d7 104 9:49 \u00d7 10\n(cid:0)4 ALM 4:09 \u00d7 104 8:51 \u00d7 10\n(cid:0)3 ALM 8:69 \u00d7 104 9:53 \u00d7 10\n\nRelative error\n(cid:0)1\n(cid:0)1\n(cid:0)1\n(cid:0)1\n(cid:0)1\n(cid:0)1\n(cid:0)1\n(cid:0)1\n(cid:0)1\n(cid:0)1\n(cid:0)1\n(cid:0)1\n(cid:0)1\n(cid:0)1\n(cid:0)1\n\nby the baseline methods is \u2126(1), implying that none of them is able to make accurate recovery of\nthe target matrix given the small number of observed entries. In contrast, our proposed algorithm\nis able to recover the target matrix with small relative error. In addition, our proposed algorithm\nis computationally more ef\ufb01cient than the baseline methods. The improvement in computational\nef\ufb01ciency becomes more signi\ufb01cant for large matrices.\n\n4.2 Application to Transductive Incomplete Multi-Label Learning\n\nWe evaluate the proposed algorithm for transductive incomplete multi-label learning on thirteen\nbenchmark data sets, including eleven data sets for web page classi\ufb01cation from \u201cyahoo.com\u201d [40],\nand two image classi\ufb01cation data sets NUS-WIDE [14] and Flickr [45]. For the eleven \u201cyahoo.com\u201d\ndata sets, the number of instances is n = 5, 000 and the number of dimensions varies from 438 to\n1,047, with the number of labels varies from 21 to 40. Detailed information of these eleven data sets\ncan be found in [40]. For NUS-WIDE data set, we have n = 209, 347 images each represented by\na bag-of-words model with d = 500 visual words, and 81 labels. For the Flickr data set, we only\nkeep the \ufb01rst 1, 000 most popular keywords for labels, leaving us with n = 565, 444 images, each\nrepresented by a d = 297-dimension vector.\n\nSettings and Baselines For each data set, we randomly sample 10% instances for testing (unla-\nbeled data) and use the remaining 90% data for training. No label assignment is provided for any test\ninstance. To create partial label assignments for training data, for each label Cj, we expose the label\nassignment of Cj for \u03c9% randomly sampled positive and negative training instances and keep the\nlabel assignment of Cj unknown for the rest of the training instances. To examine the performance of\nthe proposed algorithm, we vary the \u03c9% in the range {10%, 20%, 40%}. We repeat each experiment\n10 times, and report the result averaged over 10 trials. The regularization parameter \u03bb is selected\n{\u221210;\u22129;:::;9;10} by cross validation on training data for smaller data sets and set as 1 for larger\nfrom 2\n\u22125, respectively, for the proposed algorithm, and the\nones. Parameters \u03b3 and \u03f5 are set to be 2 and 10\nmaximum number of iterations is set to be 100. The Average Precision [44], which measures the\naverage number of relevant labels ranked before a particular relevant label, is computed over the test\ndata (the metric on all the data is provided in the supplementary material) and used as our evaluation\nmetric.\nWe compare the proposed Maxide method with MC-1 and MC-b, the state-of-the-art methods for\ntransductive incomplete multi-label learning developed in [17]. In addition, we also compare with\ntwo reference methods for multi-label learning that train one binary classi\ufb01er for each label; that\nis, the Binary Relevance method [7] based on Linear kernel (BR-L) and the method based on RBF\nkernel (BR-R), where the kernel width is set to 1. For the eleven data sets from \u201cyahoo.com\u201d,\n\n6\n\n\fLIBSVM [13] is used by BR-L and BR-R to learn both a linear and nonlinear SVM classi\ufb01er. For\nthe two image data sets, due to their large size, only BR-L method is included in comparison and\nLIBLINEAR is used for the implementation of BR-L due to its high ef\ufb01ciency for large data sets. A\nsimilar strategy is used to determine the optimal \u03bb as our proposal.\n\nResults Table 2 summarizes the results on transductive incomplete multi-label learning. We ob-\nserve that the proposed Maxide algorithm outperforms the baseline methods, for most setting on\nseveral data sets (e.g., Business, Education, and Recreation), and the improvements are signi\ufb01cant.\nMore impressively, for most data sets, the proposed algorithm is three order faster than MC-1 and\nMC-b. For the NUS-WIDE data set, none of MC-1 and MC-b, the two existing matrix completion\nbased algorithms for transductive incomplete multi-label learning, is able to \ufb01nish within one week.\nFor the Flickr data set, MC-1 and MC-b are not runnable due to the out of memory problem. For the\nNUS-WIDE and Flickr data sets, our proposed Maxide method gets an average of more than 50%\nimprovement against BR-L, the only runnable baseline, on the Average Precision.\n\n5 Conclusion\n\nIn this paper, we develop the theory of matrix completion with side information. We show theoreti-\ncally that, with side information matrices A \u2208 Rn\u00d7ra and B \u2208 Rm\u00d7rb, we can perfectly recover an\nn \u00d7 m rank-r matrix with only O(r(ra + rb) ln(ra + rb) ln(n + m)) observed entries, a signi\ufb01cant\nimprovement compared to the sample complexity O(r(n + m) ln2(n + m)) for the standard theory\nfor matrix completion. We present the Maxide algorithm that can ef\ufb01ciently solve the optimization\nproblem for matrix completion with side information. Empirical studies with synthesized data sets\nand transductive incomplete multi-label learning show the promising performance of the proposed\nalgorithm.\n\nAcknowledgement This research was partially supported by 973 Program (2010CB327903), NS-\nFC (61073097, 61273301), and ONR Award (N000141210431).\n\nReferences\n[1] J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. A new approach to collaborative \ufb01ltering: Operator\n\nestimation with spectral regularization. JMLR, 10:803\u2013826, 2009.\n\n[2] R. Adams, G. Dahl, and I. Murray. Incorporating side information in probabilistic matrix factorization\n\nwith gaussian processes. In UAI, 2010.\n\n[3] D. Agarwal and B.-C. Chen. Regression-based latent factor models. In KDD, 2009.\n[4] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. MLJ, 73(3):243\u2013272, 2008.\n[5] H. Avron, S. Kale, S. Kasiviswanathan, and V. Sindhwani. Ef\ufb01cient and practical stochastic subgradient\n\ndescent for nuclear norm regularization. In ICML, 2012.\n\n[6] F. Bach. Consistency of trace norm minimization. JMLR, 9:1019\u20131048, 2008.\n[7] M. R. Boutell, J. Luo, X. Shen, and C. M. Brown. Learning multi-label scene classi\ufb01cation. Pattern\n\nRecognition, 37(9):1757\u20131771, 2004.\n\n[8] S. Bucak, R. Jin, and A. Jain. Multi-label learning with incomplete class assignments. In CVPR, 2011.\n[9] R. Cabral, F. Torre, J. Costeira, and A. Bernardino. Matrix completion for multi-label image classi\ufb01cation.\n\nIn NIPS, 2011.\n\n[10] J.-F. Cai, E. Cand`es, and Z. Shen. A singular value thresholding algorithm for matrix completion. SIAM\n\nJournal on Optimization, 20(4):1956\u20131982, 2010.\n\n[11] E. Cand`es and B. Recht. Exact matrix completion via convex optimization. CACM, 55(6):111\u2013119, 2012.\n[12] E. Cand`es and T. Tao. The power of convex relaxation: near-optimal matrix completion.\nIEEE TIT,\n\n56(5):2053\u20132080, 2010.\n\n[13] C.-C. Chang and C.-J. Lin. Libsvm: A library for support vector machines. ACM TIST, 2(3):27, 2011.\n[14] T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y.-T. Zheng. Nus-wide: A real-world web image database\n\nfrom national university of singapore. In CIVR, 2009.\n\n[15] B. Eriksson, L. Balzano, and R. Nowak. High-rank matrix completion and subspace clustering with\n\nmissing data. CoRR, 2011.\n\n7\n\n\fTable 2: Results on transductive incomplete multi-label learning. Algo. speci\ufb01es the name of the algorithms.\nTime is the CPU time measured in seconds. AP is Average Precision measured based on test data; the higher the\nAP, the better the performance. !% represents the percentage of training instances with observed label assign-\nment for each label. The best result and its comparable ones (pairwise single-tailed t-tests at 95% con\ufb01dence\nlevel) are bolded.\n\n!% = 40%\n\n!% = 10%\n\n!% = 20%\n\nData\n\nArts\n\nBusiness\n\nComputers\n\nEducation\n\nEntertainment\n\nHealth\n\nRecreation\n\nReference\n\nScience\n\nSocial\n\nSociety\n\nNUS-WIDE\n\nFlickr\n\nAlgo.\n\nMaxide\nMC-b\nMC-1\nBR-R\nBR-1\nMaxide\nMC-b\nMC-1\nBR-R\nBR-1\nMaxide\nMC-b\nMC-1\nBR-R\nBR-1\nMaxide\nMC-b\nMC-1\nBR-R\nBR-1\nMaxide\nMC-b\nMC-1\nBR-R\nBR-1\nMaxide\nMC-b\nMC-1\nBR-R\nBR-1\nMaxide\nMC-b\nMC-1\nBR-R\nBR-1\nMaxide\nMC-b\nMC-1\nBR-R\nBR-1\nMaxide\nMC-b\nMC-1\nBR-R\nBR-1\nMaxide\nMC-b\nMC-1\nBR-R\nBR-1\nMaxide\nMC-b\nMC-1\nBR-R\nBR-1\nMaxide\nBR-1\nMaxide\nBR-1\n\ntime\n\n3:09 \u00d7 100\n2:47 \u00d7 104\n2:39 \u00d7 104\n1:63 \u00d7 101\n1:77 \u00d7 101\n3:24 \u00d7 100\n2:94 \u00d7 104\n3:25 \u00d7 104\n1:02 \u00d7 101\n1:19 \u00d7 101\n4:67 \u00d7 100\n5:58 \u00d7 104\n6:56 \u00d7 104\n2:34 \u00d7 101\n2:70 \u00d7 101\n4:40 \u00d7 100\n3:82 \u00d7 104\n4:68 \u00d7 104\n1:77 \u00d7 101\n1:94 \u00d7 101\n2:77 \u00d7 100\n4:86 \u00d7 104\n4:40 \u00d7 104\n1:89 \u00d7 101\n2:04 \u00d7 101\n4:31 \u00d7 100\n4:98 \u00d7 104\n5:82 \u00d7 104\n2:03 \u00d7 101\n2:16 \u00d7 101\n2:75 \u00d7 100\n3:56 \u00d7 104\n3:48 \u00d7 104\n1:97 \u00d7 101\n2:24 \u00d7 101\n5:11 \u00d7 100\n9:38 \u00d7 104\n1:11 \u00d7 105\n2:28 \u00d7 101\n2:71 \u00d7 101\n6:21 \u00d7 100\n6:80 \u00d7 104\n8:50 \u00d7 104\n2:93 \u00d7 101\n3:60 \u00d7 101\n7:18 \u00d7 100\n1:71 \u00d7 105\n2:22 \u00d7 105\n3:09 \u00d7 101\n3:71 \u00d7 101\n3:69 \u00d7 100\n4:75 \u00d7 104\n4:14 \u00d7 104\n2:50 \u00d7 101\n2:84 \u00d7 101\n1:47 \u00d7 103\n1:24 \u00d7 102\n1:33 \u00d7 104\n2:48 \u00d7 104\n\nAP\n\n0:548\n0:428\n0:430\n0:540\n0:540\n0:868\n0:865\n0:865\n0:846\n0:846\n0:635\n0:597\n0:600\n0:622\n0:621\n0:566\n0:472\n0:484\n0:535\n0:535\n0:631\n0:474\n0:489\n0:628\n0:627\n0:725\n0:609\n0:626\n0:725\n0:725\n0:559\n0:381\n0:381\n0:548\n0:547\n0:635\n0:565\n0:576\n0:644\n0:644\n0:513\n0:395\n0:411\n0:506\n0:506\n0:721\n0:582\n0:602\n0:717\n0:717\n0:580\n0:550\n0:550\n0:571\n0:572\n\n0:513\n0:329\n0:124\n0:064\n\ntime\n\n3:60 \u00d7 100\n1:59 \u00d7 104\n2:05 \u00d7 104\n2:98 \u00d7 101\n3:07 \u00d7 101\n3:89 \u00d7 100\n1:83 \u00d7 104\n2:18 \u00d7 104\n1:78 \u00d7 101\n1:96 \u00d7 101\n5:81 \u00d7 100\n3:38 \u00d7 104\n4:40 \u00d7 104\n4:13 \u00d7 101\n4:50 \u00d7 101\n5:41 \u00d7 100\n2:40 \u00d7 104\n3:02 \u00d7 104\n3:16 \u00d7 101\n3:28 \u00d7 101\n3:41 \u00d7 100\n3:13 \u00d7 104\n4:15 \u00d7 104\n3:38 \u00d7 101\n3:44 \u00d7 101\n5:36 \u00d7 100\n2:99 \u00d7 104\n3:82 \u00d7 104\n3:61 \u00d7 101\n3:59 \u00d7 101\n3:38 \u00d7 100\n2:41 \u00d7 104\n3:25 \u00d7 104\n3:48 \u00d7 101\n3:74 \u00d7 101\n6:47 \u00d7 100\n5:38 \u00d7 104\n6:53 \u00d7 104\n3:89 \u00d7 101\n4:34 \u00d7 101\n7:67 \u00d7 100\n3:94 \u00d7 104\n4:97 \u00d7 104\n5:06 \u00d7 101\n5:91 \u00d7 101\n9:09 \u00d7 100\n9:65 \u00d7 104\n1:17 \u00d7 105\n5:35 \u00d7 101\n6:00 \u00d7 101\n4:54 \u00d7 100\n2:93 \u00d7 104\n3:65 \u00d7 104\n4:54 \u00d7 101\n4:92 \u00d7 101\n2:10 \u00d7 103\n2:38 \u00d7 102\n1:89 \u00d7 104\n4:74 \u00d7 104\n\nAP\n\n0:572\n0:444\n0:494\n0:563\n0:563\n0:860\n0:851\n0:855\n0:841\n0:841\n0:660\n0:599\n0:608\n0:649\n0:648\n0:604\n0:478\n0:536\n0:568\n0:568\n0:650\n0:467\n0:492\n0:638\n0:640\n0:746\n0:607\n0:632\n0:742\n0:741\n0:592\n0:381\n0:430\n0:574\n0:573\n0:666\n0:561\n0:576\n0:670\n0:669\n0:543\n0:403\n0:470\n0:535\n0:535\n0:748\n0:595\n0:625\n0:746\n0:746\n0:594\n0:545\n0:561\n0:590\n0:590\n\n0:519\n0:398\n0:124\n0:074\n\ntime\n\n4:42 \u00d7 100\n9:54 \u00d7 103\n1:27 \u00d7 104\n5:71 \u00d7 101\n7:10 \u00d7 101\n5:04 \u00d7 100\n1:08 \u00d7 104\n1:21 \u00d7 104\n3:32 \u00d7 101\n4:30 \u00d7 101\n7:79 \u00d7 100\n1:87 \u00d7 104\n2:30 \u00d7 104\n7:68 \u00d7 101\n8:25 \u00d7 101\n6:73 \u00d7 100\n1:32 \u00d7 104\n1:55 \u00d7 104\n6:01 \u00d7 101\n6:94 \u00d7 101\n4:56 \u00d7 100\n1:73 \u00d7 104\n2:27 \u00d7 104\n6:47 \u00d7 101\n6:41 \u00d7 101\n7:11 \u00d7 100\n1:71 \u00d7 104\n2:03 \u00d7 104\n6:83 \u00d7 101\n7:05 \u00d7 101\n4:44 \u00d7 100\n1:30 \u00d7 104\n1:90 \u00d7 104\n6:53 \u00d7 101\n6:86 \u00d7 101\n8:49 \u00d7 100\n2:75 \u00d7 104\n3:22 \u00d7 104\n7:08 \u00d7 101\n7:48 \u00d7 101\n1:02 \u00d7 101\n2:06 \u00d7 104\n2:52 \u00d7 104\n9:30 \u00d7 101\n1:04 \u00d7 102\n1:21 \u00d7 101\n4:56 \u00d7 104\n5:41 \u00d7 104\n9:74 \u00d7 101\n1:02 \u00d7 102\n5:80 \u00d7 100\n1:62 \u00d7 104\n2:04 \u00d7 104\n8:59 \u00d7 101\n9:58 \u00d7 101\n3:53 \u00d7 103\n4:81 \u00d7 102\n2:67 \u00d7 104\n1:11 \u00d7 105\n\nAP\n\n0:596\n0:434\n0:473\n0:574\n0:575\n0:872\n0:858\n0:862\n0:854\n0:854\n0:675\n0:604\n0:618\n0:662\n0:661\n0:618\n0:474\n0:564\n0:583\n0:583\n0:679\n0:468\n0:578\n0:668\n0:667\n0:769\n0:610\n0:645\n0:757\n0:757\n0:614\n0:378\n0:421\n0:596\n0:596\n0:696\n0:575\n0:575\n0:693\n0:692\n0:568\n0:394\n0:414\n0:557\n0:557\n0:754\n0:594\n0:604\n0:751\n0:751\n0:616\n0:552\n0:590\n0:600\n0:601\n\n0:522\n0:466\n0:124\n0:077\n\n[16] Y. Fang and L. Si. Matrix co-factorization for recommendation with rich side information and implicit\nfeedback. In Proceedings of the 2nd International Workshop on Information Heterogeneity and Fusion in\n\n8\n\n\fRecommender Systems, 2011.\n\n[17] A. Goldberg, X. Zhu, B. Recht, J.-M. Xu, and R. Nowak. Transduction with matrix completion: Three\n\nbirds with one stone. In NIPS, 2010.\n\n[18] Y. Guo and D. Schuurmans. Semi-supervised multi-label classi\ufb01cation - a simultaneous large-margin,\n\nsubspace learning approach. In ECML, 2012.\n\n[19] P. Jain, P. Netrapalli, and S. Sanghavi. Provable matrix sensing using alternating minimization. In NIPS\n\nWorkshop on Optimization for Machine Learning, 2012.\n\n[20] A. Jalali, Y. Chen, S. Sanghavi, and H. Xu. Clustering partially observed graphs via convex optimization.\n\nIn ICML, 2011.\n\n[21] S. Ji, L. Tang, S. Yu, and J. Ye. Extracting shared subspace for multi-label classi\ufb01cation. In KDD, 2008.\n[22] S. Ji and J. Ye. An accelerated gradient method for trace norm minimization. In ICML, 2009.\n[23] R. Keshavan, A. Montanari, and S. Oh. Matrix completion from a few entries. IEEE TIT, 56(6):2980\u2013\n\n2998, 2010.\n\n[24] X. Kong, M. Ng, and Z.-H. Zhou. Transductive multi-label learning via label set propagation.\n\nTKDE, 25(3):704\u2013719, 2013.\n\nIEEE\n\n[25] Z. Lin, M. Chen, L. Wu, and Y. Ma. The augmented lagrange multiplier method for exact recovery of\n\ncorrupted low-rank matrices. Technical report, UIUC, 2009.\n\n[26] Y. Liu, R. Jin, and L. Yang. Semi-supervised multi-label learning by constrained non-negative matrix\n\nfactorization. In AAAI, 2006.\n\n[27] S. Ma, D. Goldfarb, and L. Chen. Fixed point and bregman iterative methods for matrix rank minimiza-\n\ntion. Mathematical Programming, 128(1-2):321\u2013353, 2011.\n\n[28] R. Mazumder, T. Hastie, and R. Tibshirani. Spectral regularization algorithms for learning large incom-\n\nplete matrices. JMLR, 11:2287\u20132322, 2010.\n\n[29] A. Menon, K. Chitrapura, S. Garg, D. Agarwal, and N. Kota. Response prediction using collaborative\n\n\ufb01ltering with hierarchies and side-information. In KDD, 2011.\n\n[30] S. Negahban and M. Wainwright. Estimation of (near) low-rank matrices with noise and high dimensional\n\nscaling. Annual of Statistics, 39(2):1069\u20131097, 2011.\n\n[31] G. Obozinski, B. Taskar, and M. Jordan. Joint covariate selection and joint subspace selection for multiple\n\nclassi\ufb01cation problems. Statistics and Computing, 20(2):231\u2013252, 2010.\n\n[32] W. Pan, E. Xiang, N. Liu, and Q. Yang. Transfer learning in collaborative \ufb01ltering for sparsity reduction.\n\nIn AAAI, 2010.\n\n[33] I. Porteous, A. Asuncion, and M. Welling. Bayesian matrix factorization with side information and\n\ndirichlet process mixtures. In AAAI, 2010.\n\n[34] B. Recht. A simpler approach to matrix completion. JMLR, 12:3413\u20133430, 2011.\n[35] J. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative prediction.\n\nICML, 2005.\n\nIn\n\n[36] A. Rhode and A. Tsybakov. Estimation of high dimensional low rank matrices. Annual of Statistics,\n\n39(2):887\u2013930, 2011.\n\n[37] N. Srebro, Jason D. Rennie, and T. Jaakkola. Maximum-margin matrix factorization. In NIPS. 2005.\n[38] Y.-Y. Sun, Y. Zhang, and Z.-H. Zhou. Multi-label learning with weak label. In AAAI, 2010.\n[39] K.-C. Toh and Y. Sangwoon. An accelerated proximal gradient algorithm for nuclear norm regularized\n\nlinear least squares problems. Paci\ufb01c Journal of Optimization, 2010.\n\n[40] N. Ueda and K. Saito. Parametric mixture models for multi-labeled text. In NIPS, 2002.\n[41] K. Weinberger and L. Saul. Unsupervised learning of image manifolds by semide\ufb01nite programming.\n\nIJCV, 70(1):77\u201390, 2006.\n\n[42] J. Yi, T. Yang, R. Jin, A. Jain, and M. Mahdavi. Robust ensemble clustering by matrix completion. In\n\nICDM, 2012.\n\n[43] G. Yu, C. Domeniconi, H. Rangwala, G. Zhang, and Z. Yu. Transductive multi-label ensemble classi\ufb01ca-\n\ntion for protein function prediction. In KDD, 2012.\n\n[44] M.-L. Zhang and Z.-H. Zhou. A review on multi-label learning algorithms. IEEE TKDE, in press.\n[45] J. Zhuang and S. Hoi. A two-view learning approach for image tag ranking. In WSDM, 2011.\n\n9\n\n\f", "award": [], "sourceid": 1107, "authors": [{"given_name": "Miao", "family_name": "Xu", "institution": "Nanjing University"}, {"given_name": "Rong", "family_name": "Jin", "institution": "Michigan State University (MSU)"}, {"given_name": "Zhi-Hua", "family_name": "Zhou", "institution": "Nanjing University"}]}