{"title": "Transduction with Matrix Completion: Three Birds with One Stone", "book": "Advances in Neural Information Processing Systems", "page_first": 757, "page_last": 765, "abstract": "We pose transductive classification as a matrix completion problem. By assuming the underlying matrix has a low rank, our formulation is able to handle three problems simultaneously: i) multi-label learning, where each item has more than one label, ii) transduction, where most of these labels are unspecified, and iii) missing data, where a large number of features are missing. We obtained satisfactory results on several real-world tasks, suggesting that the low rank assumption may not be as restrictive as it seems. Our method allows for different loss functions to apply on the feature and label entries of the matrix. The resulting nuclear norm minimization problem is solved with a modified fixed-point continuation method that is guaranteed to find the global optimum.", "full_text": "Transduction with Matrix Completion:\n\nThree Birds with One Stone\n\nAndrew B. Goldberg1, Xiaojin Zhu1, Benjamin Recht1, Jun-Ming Xu1, Robert Nowak2\n\nDepartment of {1Computer Sciences, 2Electrical and Computer Engineering}\n\n{goldberg, jerryzhu, brecht, xujm}@cs.wisc.edu, nowak@ece.wisc.edu\n\nUniversity of Wisconsin-Madison, Madison, WI 53706\n\nAbstract\n\nWe pose transductive classi\ufb01cation as a matrix completion problem. By assuming\nthe underlying matrix has a low rank, our formulation is able to handle three prob-\nlems simultaneously: i) multi-label learning, where each item has more than one\nlabel, ii) transduction, where most of these labels are unspeci\ufb01ed, and iii) miss-\ning data, where a large number of features are missing. We obtained satisfactory\nresults on several real-world tasks, suggesting that the low rank assumption may\nnot be as restrictive as it seems. Our method allows for different loss functions to\napply on the feature and label entries of the matrix. The resulting nuclear norm\nminimization problem is solved with a modi\ufb01ed \ufb01xed-point continuation method\nthat is guaranteed to \ufb01nd the global optimum.\n\n1\n\nIntroduction\n\nSemi-supervised learning methods make assumptions about how unlabeled data can help in the\nlearning process, such as the manifold assumption (data lies on a low-dimensional manifold) and\nthe cluster assumption (classes are separated by low density regions) [4, 16].\nIn this work, we\npresent two transductive learning methods under the novel assumption that the feature-by-item and\nlabel-by-item matrices are jointly low rank. This assumption effectively couples different label pre-\ndiction tasks, allowing us to implicitly use observed labels in one task to recover unobserved labels\nin others. The same is true for imputing missing features. In fact, our methods learn in the dif\ufb01-\ncult regime of multi-label transductive learning with missing data that one sometimes encounters in\npractice. That is, each item is associated with many class labels, many of the items\u2019 labels may be\nunobserved (some items may be completely unlabeled across all labels), and many features may also\nbe unobserved. Our methods build upon recent advances in matrix completion, with ef\ufb01cient algo-\nrithms to handle matrices with mixed real-valued features and discrete labels. We obtain promising\nexperimental results on a range of synthetic and real-world data.\n\n2 Problem Formulation\nLet x1 . . . xn \u2208 Rd be feature vectors associated with n items. Let X = [x1 . . . xn] be the d \u00d7 n\nfeature matrix whose columns are the items. Let there be t binary classi\ufb01cation tasks, y1 . . . yn \u2208\n{\u22121, 1}t be the label vectors, and Y = [y1 . . . yn] be the t \u00d7 n label matrix. Entries in X or Y can\nbe missing at random. Let \u2126X be the index set of observed features in X, such that (i, j) \u2208 \u2126X if\nand only if xij is observed. Similarly, let \u2126Y be the index set of observed labels in Y. Our main goal\nis to predict the missing labels yij for (i, j) /\u2208 \u2126Y. Of course, this reduces to standard transductive\nlearning when t = 1, |\u2126X| = nd (no missing features), and 1 < |\u2126Y| < n (some missing labels).\nIn our more general setting, as a side product we are also interested in imputing the missing features,\nand de-noising the observed features, in X.\n\n1\n\n\f2.1 Model Assumptions\n\nThe above problem is in general ill-posed. We now describe our assumptions to make it a well-\nde\ufb01ned problem. In a nutshell, we assume that X and Y are jointly produced by an underlying\nlow rank matrix. We then take advantage of the sparsity to \ufb01ll in the missing labels and features\nusing a modi\ufb01ed method of matrix completion. Speci\ufb01cally, we assume the following generative\nstory. It starts from a d \u00d7 n low rank \u201cpre\u201d-feature matrix X0, with rank(X0) (cid:28) min(d, n). The\nactual feature matrix X is obtained by adding iid Gaussian noise to the entries of X0: X = X0 + \u0001,\nwhere \u0001ij \u223c N(0, \u03c32\nj \u2208 Rt of item j are\nj + b, where W is a t \u00d7 d weight matrix, and b \u2208 Rt is a bias vector. Let\nproduced by y0\n1 . . . y0\nn\n\n\u0001 ). Meanwhile, the t \u201csoft\u201d labels(cid:0)y0\n(cid:3) be the soft label matrix. Note the combined (t + d) \u00d7 n matrix(cid:2)Y0; X0(cid:3) is low\nY0 =(cid:2)y0\nrank too: rank((cid:2)Y0; X0(cid:3)) \u2264 rank(X0) + 1. The actual label yij \u2208 {\u22121, 1} is generated randomly\nij) = 1/(cid:0)1 + exp(\u2212yijy0\nij)(cid:1). Finally, two random masks \u2126X, \u2126Y\n\nvia a sigmoid function: P (yij|y0\nare applied to expose only some of the entries in X and Y, and we use \u03c9 to denote the percentage of\nobserved entries. This generative story may seem restrictive, but our approaches based on it perform\nwell on synthetic and real datasets, outperforming several baselines with linear classi\ufb01ers.\n\n(cid:1)> \u2261 y0\n\nj = Wx0\n\n1j . . . y0\ntj\n\n2.2 Matrix Completion for Heterogeneous Matrix Entries\n\nWith the above data generation model, our task can be de\ufb01ned as follows. Given the partially\nobserved features and labels as speci\ufb01ed by X, Y, \u2126X, \u2126Y, we would like to recover the interme-\n\ndiate low rank matrix(cid:2)Y0; X0(cid:3). Then, X0 will contain the denoised and completed features, and\nThe key assumption is that the (t + d) \u00d7 n stacked matrix(cid:2)Y0; X0(cid:3) is of low rank. We will start\n\nsign(Y0) will contain the completed and correct labels.\n\nfrom a \u201chard\u201d formulation that is illustrative but impractical, then relax it.\n\nargmin\nZ\u2208R(t+d)\u00d7n\ns.t.\n\nrank(Z)\nsign(zij) = yij, \u2200(i, j) \u2208 \u2126Y;\n\nHere, Z is meant to recover(cid:2)Y0; X0(cid:3) by directly minimizing the rank while obeying the observed\n\nfeatures and labels. Note the indices (i, j) \u2208 \u2126X are with respect to X, such that i \u2208 {1, . . . , d}. To\nindex the corresponding element in the larger stacked matrix Z, we need to shift the row index by t\nto skip the part for Y0, and hence the constraints z(i+t)j = xij. The above formulation assumes that\nthere is no noise in the generation processes X0 \u2192 X and Y0 \u2192 Y. Of course, there are several\nissues with formulation (1), and we handle them as follows:\n\nz(i+t)j = xij, \u2200(i, j) \u2208 \u2126X\n\n(1)\n\nk=1\n\nPmin(t+d,n)\n\n\u2022 rank() is a non-convex function and dif\ufb01cult to optimize. Following recent work in\nmatrix completion [3, 2], we relax rank() with the convex nuclear norm kZk\u2217 =\n\u03c3k(Z), where \u03c3k\u2019s are the singular values of Z. The relationship between\nrank(Z) and kZk\u2217 is analogous to that of \u20180-norm and \u20181-norm for vectors.\n\u2022 There is feature noise from X0 to X. Instead of the equality constraints in (1), we minimize\n2(u \u2212 v)2 in this\n\u2022 Similarly, there is label noise from Y0 to Y. The observed labels are of a different type\nthan the observed features. We therefore introduce another loss function cy(zij, yij) to\nIn this work, we use the logistic loss cy(u, v) =\naccount for the heterogeneous data.\nlog(1 + exp(\u2212uv)).\n\na loss function cx(z(i+t)j, xij). We choose the squared loss cx(u, v) = 1\nwork, but other convex loss functions are possible too.\n\nIn addition to these changes, we will model the bias b either explicitly or implicitly, leading to two\nalternative matrix completion formulations below.\nFormulation 1 (MC-b). In this formulation, we explicitly optimize the bias b \u2208 Rt in addition to\n\nZ \u2208 R(t+d)\u00d7n, hence the name. Here, Z corresponds to the stacked matrix(cid:2)WX0; X0(cid:3) instead of\n(cid:2)Y0; X0(cid:3), making it potentially lower rank. The optimization problem is\nX\n\nX\n\ncy(zij + bi, yij) +\n\ncx(z(i+t)j, xij),\n\n(2)\n\nargmin\n\nZ,b\n\n\u00b5kZk\u2217 + \u03bb\n|\u2126Y|\n\n(i,j)\u2208\u2126Y\n\n1\n|\u2126X|\n\n(i,j)\u2208\u2126X\n\n2\n\n\fwhere \u00b5, \u03bb are positive trade-off weights. Notice the bias b is not regularized. This is a convex\nproblem, whose optimization procedure will be discussed in section 3. Once the optimal Z, b are\nfound, we recover the task-i label of item j by sign(zij + bi), and feature k of item j by z(k+t)j.\nFormulation 2 (MC-1). In this formulation, the bias is modeled implicitly within Z. Similar to how\nbias is commonly handled in linear classi\ufb01ers, we append an additional feature with constant value\n\none to each item. The corresponding pre-feature matrix is augmented into(cid:2)X0; 1>(cid:3), where 1 is the\nY0 are linear combinations of rows in(cid:2)X0; 1>(cid:3), i.e., rank((cid:2)Y0; X0; 1>(cid:3)) = rank((cid:2)X0; 1>(cid:3)). We\nthen let Z correspond to the (t + d + 1) \u00d7 n stacked matrix(cid:2)Y0; X0; 1>(cid:3), by forcing its last row to\n\nall-1 vector. Under the same label assumption y0\n\nj + b, the rows of the soft label matrix\n\nj = Wx0\n\nbe 1> (hence the name):\n\nargmin\n\nZ\u2208R(t+d+1)\u00d7n\n\ns.t.\n\n\u00b5kZk\u2217 + \u03bb\n|\u2126Y|\nz(t+d+1)\u00b7 = 1>.\n\nX\n\n(i,j)\u2208\u2126Y\n\ncy(zij, yij) +\n\n1\n|\u2126X|\n\ncx(z(i+t)j, xij)\n\n(3)\n\nX\n\n(i,j)\u2208\u2126X\n\nThis is a constrained convex optimization problem. Once the optimal Z is found, we recover the\ntask-i label of item j by sign(zij), and feature k of item j by z(k+t)j.\nMC-b and MC-1 differ mainly in what is in Z, which leads to different behaviors of the nuclear norm.\nDespite the generative story, we do not explicitly recover the weight matrix W in these formulations.\n\nOther formulations are certainly possible. One way is to let Z correspond to(cid:2)Y0; X0(cid:3) directly,\n\nwithout introducing bias b or the all-1 row, and hope nuclear norm minimization will prevail. This\nis inferior in our preliminary experiments, and we do not explore it further in this paper.\n\n3 Optimization Techniques\n\nWe solve MC-b and MC-1 using modi\ufb01cations of the Fixed Point Continuation (FPC) method of Ma,\nGoldfarb, and Chen [10].1 While nuclear norm minimization can be converted into a semide\ufb01nite\nprogramming (SDP) problem [2], current SDP solvers are severely limited in the size of problems\nthey can solve. Instead, the basic \ufb01xed point approach is a computationally ef\ufb01cient alternative,\nwhich provably converges to the globally optimal solution and has been shown to outperform SDP\nsolvers in terms of matrix recoverability.\n\n3.1 Fixed Point Continuation for MC-b\n\nWe \ufb01rst describe our modi\ufb01ed FPC method for MC-b. It differs from [10] in the extra bias variables\nand multiple loss functions. Our \ufb01xed point iterative algorithm to solve the unconstrained problem\nof (2) consists of two alternating steps for each iteration k:\n\n1. (gradient step) bk+1 = bk \u2212 \u03c4bg(bk), Ak = Zk \u2212 \u03c4Zg(Zk)\n2. (shrinkage step) Zk+1 = S\u03c4Z\u00b5(Ak).\n\nIn the gradient step, \u03c4b and \u03c4Z are step sizes whose choice will be discussed next. Overloading\nnotation a bit, g(bk) is the vector gradient, and g(Zk) is the matrix gradient, respectively, of the two\nloss terms in (2) (i.e., excluding the nuclear norm term):\n\u2212yij\n\nX\n\ng(bi) =\n\ng(zij) =\n\n\u03bb\n|\u2126Y|\n\n\uf8f1\uf8f2\uf8f3 \u03bb|\u2126Y|\n\nj:(i,j)\u2208\u2126Y\n\n1 + exp(yij(zij + bi))\n\u2212yij\n\n1+exp(yij (zij +bi)) ,\n\n1|\u2126X|(zij \u2212 x(i\u2212t)j),\n0,\n\ni \u2264 t and (i, j) \u2208 \u2126Y\ni > t and (i \u2212 t, j) \u2208 \u2126X\notherwise\n\n(4)\n\n(5)\n\nNote for g(zij), i > t, we need to shift down (un-stack) the row index by t in order to map the\nelement in Z back to the item x(i\u2212t)j.\n\n1While the primary method of [10] is Fixed Point Continuation with Approximate Singular Value Decom-\nposition (FPCA), where the approximate SVD is used to speed up the algorithm, we opt to use an exact SVD\nfor simplicity and will refer to the method simply as FPC.\n\n3\n\n\fInput: Initial matrix Z0, bias b0,\nparameters \u00b5, \u03bb, Step sizes \u03c4b, \u03c4Z\nDetermine \u00b51 > \u00b52 > \u00b7\u00b7\u00b7 > \u00b5L = \u00b5 > 0.\nSet Z = Z0, b = b0.\nforeach \u00b5 = \u00b51, \u00b52, . . . , \u00b5L do\n\nwhile Not converged do\n\nCompute b = b \u2212 \u03c4bg(b), A = Z \u2212 \u03c4Zg(Z)\nCompute SVD of A = U\u039bV>\nCompute Z = U max(\u039b \u2212 \u03c4Z\u00b5, 0)V>\n\nend\n\nend\nOutput: Recovered matrix Z, bias b\n\nAlgorithm 1: FPC algorithm for MC-b.\n\nparameters \u00b5, \u03bb, Step sizes \u03c4Z\n\nInput: Initial matrix Z0,\nDetermine \u00b51 > \u00b52 > \u00b7\u00b7\u00b7 > \u00b5L = \u00b5 > 0.\nSet Z = Z0.\nforeach \u00b5 = \u00b51, \u00b52, . . . , \u00b5L do\n\nwhile Not converged do\n\nCompute A = Z \u2212 \u03c4Zg(Z)\nCompute SVD of A = U\u039bV>\nCompute Z = U max(\u039b \u2212 \u03c4Z\u00b5, 0)V>\nProject Z to feasible region z(t+d+1)\u00b7 = 1>\n\nend\n\nend\nOutput: Recovered matrix Z\n\nAlgorithm 2: FPC algorithm for MC-1.\n\nIn the shrinkage step, S\u03c4Z\u00b5(\u00b7) is a matrix shrinkage operator. Let Ak = U\u039bV> be the SVD of\nAk. Then S\u03c4Z\u00b5(Ak) = U max(\u039b \u2212 \u03c4Z\u00b5, 0)V>, where max is elementwise. That is, the shrinkage\noperator shifts the singular values down, and truncates any negative values to zero. This step reduces\nthe nuclear norm.\n\nEven though the problem is convex, convergence can be slow. We follow [10] and use a con-\ntinuation or homotopy method to improve the speed. This involves beginning with a large value\n\u00b51 > \u00b5 and solving a sequence of subproblems, each with a decreasing value and using the pre-\nvious solution as its initial point. The sequence of values is determined by a decay parameter \u03b7\u00b5:\nk = 1, . . . , L \u2212 1, where \u00b5 is the \ufb01nal value to use, and L is the number\n\u00b5k+1 = max{\u00b5k\u03b7\u00b5, \u00b5},\nof rounds of continuation. The complete FPC algorithm for MC-b is listed in Algorithm 1.\n\nA minor modi\ufb01cation of the argument in [10] reveals that as long as we choose non-negative step\nsizes satisfying \u03c4b < 4|\u2126Y|/(\u03bbn) and \u03c4Z < min{4|\u2126Y|/\u03bb,|\u2126X|}, the algorithms MC-b will be\nguaranteed to converge to a global optimum. Indeed, to guarantee convergence, we only need that\nthe gradient step is non-expansive in the sense that\nkb1\u2212\u03c4bg(b1)\u2212b2 +\u03c4bg(b2)k2 +kZ1\u2212\u03c4Zg(Z1)\u2212Z2 +\u03c4Zg(Z2)k2\nF \u2264 kb1\u2212b2k2 +kZ1\u2212Z2k2\nF\nfor all b1, b2, Z1, and Z2. Our choice of \u03c4b and \u03c4Z guarantee such non-expansiveness. Once this\nnon-expansiveness is satis\ufb01ed, the remainder of the convergence analysis is the same as in [10].\n\n3.2 Fixed Point Continuation for MC-1\n\nOur modi\ufb01ed FPC method for MC-1 is similar except for two differences. First, there is no bias\nvariable b. Second, the shrinkage step will in general not satisfy the all-1-row constraints in (3).\nThus, we add a third projection step at the end of each iteration to project Zk+1 back to the feasible\nregion, by simply setting its last row to all 1\u2019s. The complete algorithm for MC-1 is given in Algo-\nrithm 2. We were unable to prove convergence for this gradient + shrinkage + projection algorithm.\nNonetheless, in our empirical experiments, Algorithm 2 always converges and tends to outperform\nMC-b. The two algorithms have about the same convergence speed.\n\n4 Experiments\n\nWe now empirically study the ability of matrix completion to perform multi-class transductive clas-\nsi\ufb01cation when there is missing data. We \ufb01rst present a family of 24 experiments on a synthetic\ntask by systematically varying different aspects of the task, including the rank of the problem, noise\nlevel, number of items, and observed label and feature percentage. We then present experiments on\ntwo real-world datasets: music emotions and yeast microarray. In each experiments, we compare\nMC-b and MC-1 against four other baseline algorithms. Our results show that MC-1 consistently\noutperforms other methods, and MC-b follows closely.\nParameter Tuning and Other Settings for MC-b and MC-1: To tune the parameters \u00b5 and \u03bb,\nwe use 5-fold cross validation (CV) separately for each experiment. Speci\ufb01cally, we randomly\n\n4\n\n\f\u03bb\n\n,|\u2126X|), \u03c4b = 3.8|\u2126Y|\n\n5 of the observed entries, measure its performance on the remaining 1\n\ndivide \u2126X and \u2126Y into \ufb01ve disjoint subsets each. We then run our matrix completion algorithms\nusing 4\n5 , and average over\nthe \ufb01ve folds. Since our main goal is to predict unobserved labels, we use label error as the CV\nperformance criterion to select parameters. Note that tuning \u00b5 is quite ef\ufb01cient since all values\nunder consideration can be evaluated in one run of the continuation method. We set \u03b7\u00b5 = 0.25 and,\nas in [10], consider \u00b5 values starting at \u03c31\u03b7\u00b5, where \u03c31 is the largest singular value of the matrix\nof observed entries in [Y; X] (with the unobserved entries set to 0), and decrease \u00b5 until 10\u22125.\nThe range of \u03bb values considered was {10\u22123, 10\u22122, 10\u22121, 1}. We initialized b0 to be all zero and\nZ0 to be the rank-1 approximation of the matrix of observed entries in [Y; X] (with unobserved\nentries set to 0) obtained by performing an SVD and reconstructing the matrix using only the largest\nsingular value and corresponding left and right singular vectors. The step sizes were set as follows:\n\u03c4Z = min( 3.8|\u2126Y|\n\u03bbn . Convergence was de\ufb01ned as relative change in objective\nfunctions (2)(3) smaller than 10\u22125.\nBaselines: We compare to the following baselines, each consisting of some missing feature impu-\ntation step on X \ufb01rst, then using a standard SVM to predict the labels: [FPC+SVM] Matrix com-\npletion on X alone using FPC [10]. [EM(k)+SVM] Expectation Maximization algorithm to impute\nmissing X entries using a mixture of k Gaussian components. As in [9], missing features, mixing\ncomponent parameters, and the assignments of items to components are treated as hidden variables,\nwhich are estimated in an iterative manner to maximize the likelihood of the data. [Mean+SVM]\nImpute each missing feature by the mean of the observed entries for that feature. [Zero+SVM]\nImpute missing features by \ufb01lling in zeros.\nAfter imputation, an SVM is trained using the available (noisy) labels in \u2126Y for that task, and\npredictions are made for the rest of the labels. All SVMs are linear, trained using SVMlin2, and the\nregularization parameter is tuned using 5-fold cross validation separately for each task. The range\nof parameter values considered was {10\u22128, 10\u22127, . . . , 107, 108}.\nEvaluation Method: To evaluate performance, we consider two measures: transductive label error,\ni.e., the percentage of unobserved labels predicted incorrectly; and relative feature imputation error\nij, where \u02c6x is the predicted feature value. In the tables below,\nx2\nfor each parameter setting, we report the mean performance (and standard deviation in parenthesis)\nof different algorithms over 10 random trials. The best algorithm within each parameter setting,\nas well as any statistically indistinguishable algorithms via a two-tailed paired t-test at signi\ufb01cance\nlevel \u03b1 = 0.05, are marked in bold.\n\n(xij \u2212 \u02c6xij)2(cid:17)\n\n/P\n\n(cid:16)P\n\nij /\u2208\u2126X\n\nij /\u2208\u2126X\n\n4.1 Synthetic Data Experiments\n\nSynthetic Data Generation: We generate a family of synthetic datasets to systematically explore\nthe performance of the algorithms. We \ufb01rst create a rank-r matrix X0 = LR>, where L \u2208 Rd\u00d7r\nand R \u2208 Rn\u00d7r with entries drawn iid from N (0, 1). We then normalize X0 such that its entries\nhave variance 1. Next, we create a weight matrix W \u2208 Rt\u00d7d and bias vector b \u2208 Rt, with all entries\ndrawn iid from N (0, 10). We then produce X, Y0, Y according to section 2.1. Finally, we produce\nthe random \u2126X, \u2126Y masks with \u03c9 percent observed entries.\nUsing the above procedure, we vary \u03c9 = 10%, 20%, 40%, n = 100, 400, r = 2, 4, and \u03c32\n\u0001 =\n0.01, 0.1, while \ufb01xing t = 10, d = 20, to produce 24 different parameter settings. For each setting,\nwe generate 10 trials, where the randomness is in the data and mask.\nSynthetic experiment results: Table 1 shows the transductive label errors, and Table 2 shows the\nrelative feature imputation errors, on the synthetic datasets. We make several observations.\n\nObservation 1: MC-b and MC-1 are the best for feature imputation, as Table 2 shows. However,\nthe imputations are not perfect, because in these particular parameter settings the ratio between the\nnumber of observed entries over the degrees of freedom needed to describe the feature matrix (i.e.,\nr(d + n \u2212 r)) is below the necessary condition for perfect matrix completion [2], and because there\nis some feature noise. Furthermore, our CV tuning procedure selects parameters \u00b5, \u03bb to optimize\nlabel error, which often leads to suboptimal imputation performance. In a separate experiment (not\nreported here) when we made the ratio suf\ufb01ciently large and without noise, and speci\ufb01cally tuned for\n\n2http://vikas.sindhwani.org/svmlin.html\n\n5\n\n\fTable 1: Transductive label error of six algorithms on the 24 synthetic datasets. The varying pa-\n\u0001 , rank(X0) = r, number of items n, and observed label and feature\nrameters are feature noise \u03c32\npercentage \u03c9. Each row is for a unique parameter combination. Each cell shows the mean(standard\ndeviation) of transductive label error (in percentage) over 10 random trials. The \u201cmeta-average\u201d row\nis the simple average over all parameter settings and all trials.\n\n\u03c32\n\u0001\n0.01\n\nr\n2\n\nn\n100\n\n400\n\n4\n\n100\n\n400\n\n0.1\n\n2\n\n100\n\n400\n\n4\n\n100\n\n400\n\n\u03c9\n\nMC-b\n10% 37.8(4.0)\n20% 23.5(2.9)\n40% 15.1(3.1)\n10% 26.5(2.0)\n20% 15.9(2.5)\n40% 11.7(2.0)\n10% 42.5(4.0)\n20% 33.2(2.3)\n40% 19.6(3.1)\n10% 35.3(3.1)\n20% 24.4(2.3)\n40% 14.6(1.8)\n10% 39.6(5.5)\n20% 25.2(2.6)\n40% 15.7(3.1)\n10% 27.6(2.1)\n20% 18.0(2.2)\n40% 12.0(2.1)\n10% 42.5(4.3)\n20% 33.3(1.9)\n40% 21.4(2.7)\n10% 36.3(2.7)\n20% 25.5(2.0)\n40% 16.0(1.8)\n\nMC-1\n31.8(4.3)\n17.0(2.2)\n10.8(1.8)\n19.9(1.7)\n11.7(1.9)\n8.0(1.6)\n40.8(4.4)\n26.2(2.8)\n14.3(2.7)\n32.1(1.6)\n19.1(1.3)\n9.5(0.5)\n34.6(3.5)\n20.1(1.7)\n12.6(1.4)\n22.6(1.9)\n15.2(1.7)\n10.1(1.3)\n41.5(2.5)\n29.0(2.2)\n18.4(3.1)\n34.0(1.7)\n21.8(1.0)\n12.8(0.8)\n21.4\n\nFPC+SVM EM1+SVM Mean+SVM\n34.8(7.0)\n40.5(5.7)\n17.6(2.1)\n28.7(4.1)\n9.6(1.5)\n16.5(2.5)\n32.4(2.9)\n23.7(1.7)\n20.0(1.9)\n12.6(2.2)\n7.2(1.8)\n12.2(1.8)\n41.5(2.6)\n43.5(2.9)\n26.7(1.7)\n35.5(1.4)\n13.6(2.6)\n22.5(2.0)\n33.4(1.6)\n37.7(1.2)\n26.9(1.5)\n20.5(1.4)\n16.4(1.2)\n9.2(0.9)\n37.3(6.4)\n41.5(6.0)\n21.6(2.6)\n31.8(4.7)\n13.2(2.0)\n18.5(2.7)\n34.5(3.3)\n27.6(2.4)\n22.6(2.4)\n16.8(2.3)\n10.4(2.1)\n14.1(2.0)\n42.3(2.0)\n44.6(2.9)\n30.9(3.1)\n36.2(2.3)\n18.7(2.4)\n23.9(2.0)\n35.1(1.2)\n38.7(1.3)\n23.8(1.5)\n28.4(1.7)\n18.3(1.2)\n13.9(1.2)\n22.6\n28.6\n\n34.6(3.9)\n19.7(2.4)\n10.4(1.0)\n24.2(1.9)\n12.0(1.9)\n7.3(1.4)\n43.2(2.2)\n30.8(2.7)\n14.1(2.4)\n34.2(1.8)\n19.8(1.1)\n8.6(1.1)\n40.2(5.3)\n26.8(3.7)\n15.1(2.4)\n28.8(2.6)\n18.4(2.5)\n11.1(1.9)\n45.6(1.9)\n34.9(3.0)\n21.6(2.4)\n36.3(1.4)\n25.1(1.4)\n14.7(1.3)\n24.1\n\nZero+SVM\n40.5(5.1)\n27.4(4.4)\n15.4(2.3)\n31.5(2.7)\n19.7(1.7)\n12.1(2.0)\n42.9(2.9)\n33.9(1.5)\n21.7(2.3)\n38.2(1.4)\n26.9(1.3)\n16.5(1.3)\n41.0(5.7)\n29.9(4.0)\n17.2(2.4)\n33.6(2.8)\n21.8(2.5)\n14.0(2.4)\n43.6(2.3)\n35.4(1.6)\n23.3(2.5)\n39.1(1.2)\n28.4(1.8)\n18.2(1.2)\n28.0\n\nmeta-average\n\n25.6\n\nimputation error, both MC-b and MC-1 did achieve perfect feature imputation. Also, FPC+SVM is\nslightly worse in feature imputation. This may seem curious as FPC focuses exclusively on imputing\nX. We believe the fact that MC-b and MC-1 can use information in Y to enhance feature imputation\nin X made them better than FPC+SVM.\nObservation 2: MC-1 is the best for multi-label transductive classi\ufb01cation, as suggested by Table 1.\nSurprisingly, the feature imputation advantage of MC-b did not translate into classi\ufb01cation, and\nFPC+SVM took second place.\n\nObservation 3: The same factors that affect standard matrix completion also affect classi\ufb01cation\nperformance of MC-b and MC-1. As the tables show, everything else being equal, less feature noise\n(smaller \u03c32\n\u0001 ), lower rank r, more items, or more observed features and labels, reduce label error.\nBene\ufb01cial combination of these factors (the 6th row) produces the lowest label errors.\nMatrix completion bene\ufb01ts from more tasks. We performed one additional synthetic data exper-\niment examining the effect of t (the number of tasks) on MC-b and MC-1, with the remaining data\nparameters \ufb01xed at \u03c9 = 10%, n = 400, r = 2, d = 20, and \u03c32\n\u0001 = 0.01. Table 3 reveals that both\nMC methods achieve statistically signi\ufb01cantly better label prediction and imputation performance\nwith t = 10 than with only t = 2 (as determined by two-sample t-tests at signi\ufb01cance level 0.05).\n\n4.2 Music Emotions Data Experiments\n\nIn this task introduced by Trohidis et al. [14], the goal is to predict which of several types of emotion\nare present in a piece of music. The data3 consists of n = 593 songs of a variety of musical genres,\neach labeled with one or more of t = 6 emotions (i.e., amazed-surprised, happy-pleased, relaxing-\ncalm, quiet-still, sad-lonely, and angry-fearful). Each song is represented by d = 72 features (8\nrhythmic, 64 timbre-based) automatically extracted from a 30-second sound clip.\n\n3Available at http://mulan.sourceforge.net/datasets.html\n\n6\n\n\fTable 2: Relative feature imputation error on the synthetic datasets. The algorithm Zero+SVM is\nnot shown because it by de\ufb01nition has relative feature imputation error 1.\n\n\u03c32\n\u0001\n0.01\n\nr\n2\n\nn\n100\n\n400\n\n4\n\n100\n\n400\n\n0.1\n\n2\n\n100\n\n400\n\n4\n\n100\n\n400\n\n\u03c9\n\nMC-b\n10% 0.84(0.04)\n20% 0.54(0.08)\n40% 0.29(0.06)\n10% 0.73(0.03)\n20% 0.43(0.04)\n40% 0.30(0.10)\n10% 0.99(0.04)\n20% 0.77(0.05)\n40% 0.42(0.07)\n10% 0.87(0.04)\n20% 0.69(0.07)\n40% 0.34(0.05)\n10% 0.92(0.05)\n20% 0.69(0.07)\n40% 0.51(0.05)\n10% 0.79(0.03)\n20% 0.64(0.06)\n40% 0.48(0.04)\n10% 1.01(0.04)\n20% 0.84(0.03)\n40% 0.59(0.03)\n10% 0.90(0.02)\n20% 0.75(0.04)\n40% 0.56(0.03)\n\nMC-1\n0.87(0.06)\n0.57(0.06)\n0.27(0.06)\n0.72(0.04)\n0.46(0.05)\n0.22(0.04)\n0.96(0.03)\n0.78(0.05)\n0.40(0.03)\n0.88(0.03)\n0.67(0.04)\n0.34(0.03)\n0.93(0.04)\n0.72(0.06)\n0.52(0.05)\n0.80(0.03)\n0.64(0.06)\n0.45(0.05)\n0.97(0.03)\n0.85(0.03)\n0.61(0.04)\n0.92(0.02)\n0.77(0.02)\n0.55(0.04)\n0.66\n\nFPC+SVM\n0.88(0.06)\n0.57(0.07)\n0.27(0.06)\n0.76(0.03)\n0.50(0.04)\n0.24(0.05)\n0.96(0.03)\n0.77(0.04)\n0.42(0.04)\n0.89(0.01)\n0.69(0.03)\n0.38(0.03)\n0.93(0.05)\n0.74(0.06)\n0.53(0.05)\n0.84(0.03)\n0.67(0.04)\n0.49(0.05)\n0.97(0.03)\n0.85(0.03)\n0.63(0.04)\n0.92(0.01)\n0.79(0.03)\n0.59(0.04)\n0.68\n\nEM1+SVM\n1.01(0.12)\n0.67(0.13)\n0.34(0.03)\n0.79(0.07)\n0.45(0.04)\n0.21(0.04)\n1.22(0.11)\n0.92(0.07)\n0.49(0.04)\n1.00(0.08)\n0.66(0.03)\n0.29(0.02)\n1.18(0.10)\n0.94(0.07)\n0.67(0.08)\n0.96(0.07)\n0.73(0.07)\n0.57(0.07)\n1.25(0.05)\n1.07(0.06)\n0.80(0.09)\n1.08(0.07)\n0.86(0.05)\n0.66(0.06)\n0.78\n\nMean+SVM\n1.06(0.02)\n1.03(0.02)\n1.01(0.01)\n1.02(0.01)\n1.01(0.00)\n1.00(0.00)\n1.05(0.01)\n1.02(0.01)\n1.01(0.01)\n1.01(0.00)\n1.01(0.00)\n1.00(0.00)\n1.06(0.02)\n1.03(0.02)\n1.02(0.01)\n1.02(0.01)\n1.01(0.00)\n1.00(0.00)\n1.05(0.02)\n1.02(0.01)\n1.01(0.01)\n1.01(0.01)\n1.01(0.00)\n1.00(0.00)\n1.02\n\nmeta-average\n\n0.66\n\nTable 3: More tasks help matrix completion (\u03c9 = 10%, n = 400, r = 2, d = 20, \u03c32\nFPC+SVM\n0.76(0.03)\n0.76(0.03)\n\nMC-1\n0.78(0.04)\n0.72(0.04)\n\nMC-b\n0.78(0.07)\n0.73(0.03)\n\nMC-1\n22.9(2.2)\n19.9(1.7)\n\nFPC+SVM\n20.5(2.5)\n23.7(1.7)\n\nt\n2\n10\n\nMC-b\n30.1(2.8)\n26.5(2.0)\n\n\u0001 = 0.01).\n\ntransductive label error\n\nrelative feature imputation error\n\nTable 4: Performance on the music emotions data.\n\n\u03c9 =40%\n28.0(1.2)\n27.4(0.8)\n26.9(0.7)\n26.0(1.1)\n26.2(0.9)\n26.3(0.8)\n30.3(0.6)\n\n60%\n25.2(1.0)\n23.7(1.6)\n25.2(1.6)\n23.6(1.1)\n23.1(1.2)\n24.2(1.0)\n28.9(1.1)\n\n80%\n22.2(1.6)\n19.8(2.4)\n24.4(2.0)\n21.2(2.3)\n21.6(1.6)\n22.6(1.3)\n25.7(1.4)\n\nAlgorithm\n\nMC-b\nMC-1\n\nFPC+SVM\nEM1+SVM\nEM4+SVM\nMean+SVM\nZero+SVM\n\n\u03c9 =40%\n0.69(0.05)\n0.60(0.05)\n0.64(0.01)\n0.46(0.09)\n0.49(0.10)\n0.18(0.00)\n1.00(0.00)\n\n60%\n0.54(0.10)\n0.46(0.12)\n0.46(0.02)\n0.23(0.04)\n0.27(0.04)\n0.19(0.00)\n1.00(0.00)\n\n80%\n0.41(0.02)\n0.25(0.03)\n0.31(0.03)\n0.13(0.01)\n0.15(0.02)\n0.20(0.01)\n1.00(0.00)\n\ntransductive label error\n\nrelative feature imputation error\n\nWe vary the percentage of observed entries \u03c9 = 40%, 60%, 80%. For each \u03c9, we run 10 random\ntrials with different masks \u2126X, \u2126Y. For this dataset, we tuned only \u00b5 with CV, and set \u03bb = 1.\nThe results are in Table 4. Most importantly, these results show that MC-1 is useful for this real-\nworld multi-label classi\ufb01cation problem, leading to the best (or statistically indistinguishable from\nthe best) transductive error performance with 60% and 80% of the data available, and close to the\nbest with only 40%.\n\nWe also compared these algorithms against an \u201coracle baseline\u201d (not shown in the table). In this\nbaseline, we give 100% features (i.e., no indices are missing from \u2126X) and the training labels\nin \u2126Y to a standard SVM, and let it predict the unspeci\ufb01ed labels. On the same random tri-\nals, for observed percentage \u03c9 = 40%, 60%, 80%, the oracle baseline achieved label error rate\n22.1(0.8), 21.3(0.8), 20.5(1.8) respectively. Interestingly, MC-1 with \u03c9 = 80% (19.8) is statisti-\ncally indistinguishable from the oracle baseline.\n\n7\n\n\f4.3 Yeast Microarray Data Experiments\n\nThis dataset comes from a biological domain and involves the problem of Yeast gene functional\nclassi\ufb01cation. We use the data studied by Elisseeff and Weston [5], which contains n = 2417\nexamples (Yeast genes) with d = 103 input features (results from microarray experiments).4 We\nfollow the approach of [5] and predict each gene\u2019s membership in t = 14 functional classes. For\nthis larger dataset, we omitted the computationally expensive EM4+SVM methods, and tuned only\n\u00b5 for matrix completion while \ufb01xing \u03bb = 1.\nTable 5 reveals that MC-b leads to statistically signi\ufb01cantly lower transductive label error for this bi-\nological dataset. Although not highlighted in the table, MC-1 is also statistically better than the SVM\nmethods in label error. In terms of feature imputation performance, the MC methods are weaker than\nFPC+SVM. However, it seems simultaneously predicting the missing labels and features appears to\nprovide a large advantage to the MC methods. It should be pointed out that all algorithms except\nZero+SVM in fact have small but non-zero standard deviation on imputation error, despite what the\n\ufb01xed-point formatting in the table suggests. For instance, with \u03c9 = 40%, the standard deviation is\n0.0009 for MC-1, 0.0011 for FPC+SVM, and 0.0001 for Mean+SVM.\nAgain, we compared these algorithms to an oracle SVM baseline with 100% observed entries in \u2126X.\nThe oracle SVM approach achieves label error of 20.9(0.1), 20.4(0.2), and 20.1(0.3) for \u03c9 =40%,\n60%, and 80% observed labels, respectively. Both MC-b and MC-1 signi\ufb01cantly outperform this\noracle under paired t-tests at signi\ufb01cance level 0.05. We attribute this advantage to a combination\nof multi-label learning and transduction that is intrinsic to our matrix completion methods.\n\n\u03c9 =40%\n16.1(0.3)\n16.7(0.3)\n21.5(0.3)\n22.0(0.2)\n21.7(0.2)\n21.6(0.2)\n\n60%\n12.2(0.3)\n13.0(0.2)\n20.8(0.3)\n21.2(0.2)\n21.1(0.2)\n21.1(0.2)\n\n80%\n8.7(0.4)\n8.5(0.4)\n20.3(0.3)\n20.4(0.2)\n20.5(0.4)\n20.5(0.4)\n\nAlgorithm\n\nMC-b\nMC-1\n\nFPC+SVM\nEM1+SVM\nMean+SVM\nZero+SVM\n\nTable 5: Performance on the yeast data.\n\n\u03c9 =40%\n0.83(0.02)\n0.86(0.00)\n0.81(0.00)\n1.15(0.02)\n1.00(0.00)\n1.00(0.00)\n\n60%\n0.76(0.00)\n0.92(0.00)\n0.76(0.00)\n1.04(0.02)\n1.00(0.00)\n1.00(0.00)\n\n80%\n0.73(0.02)\n0.74(0.00)\n0.72(0.00)\n0.77(0.01)\n1.00(0.00)\n1.00(0.00)\n\ntransductive label error\n\nrelative feature imputation error\n\n5 Discussions and Future Work\n\nWe have introduced two matrix completion methods for multi-label transductive learning with miss-\ning features, which outperformed several baselines. In terms of problem formulation, our methods\ndiffer considerably from sparse multi-task learning [11, 1, 13] in that we regularize the feature and\nlabel matrix directly, without ever learning explicit weight vectors. Our methods also differ from\nmulti-label prediction via reduction to binary classi\ufb01cation or ranking [15], and via compressed\nsensing [7], which assumes sparsity in that each item has a small number of positive labels, rather\nthan the low-rank nature of feature matrices. These methods do not naturally allow for missing fea-\ntures. Yet other multi-label methods identify a subspace of highly predictive features across tasks in\na \ufb01rst stage, and learn in this subspace in a second stage [8, 12]. Our methods do not require separate\nstages. Learning in the presence of missing data typically involves imputation followed by learning\nwith completed data [9]. Our methods perform imputation plus learning in one step, similar to EM\non missing labels and features [6], but the underlying model assumption is quite different.\n\nA drawback of our methods is their restriction to linear classi\ufb01ers only. One future extension is to\nexplicitly map the partial feature matrix to a partially observed polynomial (or other) kernel Gram\nmatrix, and apply our methods there. Though such mapping proliferates the missing entries, we\nhope that the low-rank structure in the kernel matrix will allow us to recover labels that are nonlinear\nfunctions of the original features.\n\nAcknowledgements: This work is supported in part by NSF IIS-0916038, NSF IIS-0953219, AFOSR FA9550-\n09-1-0313, and AFOSR A9550-09-1-0423. We also wish to thank Brian Eriksson for useful discussions and\nsource code implementing EM-based imputation.\n\n4Available at http://mulan.sourceforge.net/datasets.html\n\n8\n\n\fReferences\n\n[1] Andreas Argyriou, Charles A. Micchelli, and Massimiliano Pontil. On spectral learning. Jour-\n\nnal of Machine Learning Research, 11:935\u2013953, 2010.\n\n[2] Emmanuel J. Cand`es and Benjamin Recht. Exact matrix completion via convex optimization.\n\nFoundations of Computational Mathematics, 9:717\u2013772, 2009.\n\n[3] Emmanuel J. Cand`es and Terence Tao. The power of convex relaxation: Near-optimal matrix\n\ncompletion. IEEE Transactions on Information Theory, 56:2053\u20132080, 2010.\n\n[4] Olivier Chapelle, Alexander Zien, and Bernhard Sch\u00a8olkopf, editors. Semi-supervised learning.\n\nMIT Press, 2006.\n\n[5] Andr\u00b4e Elisseeff and Jason Weston. A kernel method for multi-labelled classi\ufb01cation.\n\nIn\nThomas G. Dietterich, Suzanna Becker, and Zoubin Ghahramani, editors, NIPS, pages 681\u2013\n687. MIT Press, 2001.\n\n[6] Zoubin Ghahramani and Michael I. Jordan. Supervised learning from incomplete data via\nan EM approach. In Advances in Neural Information Processing Systems 6, pages 120\u2013127.\nMorgan Kaufmann, 1994.\n\n[7] Daniel Hsu, Sham Kakade, John Langford, and Tong Zhang. Multi-label prediction via com-\npressed sensing. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta,\neditors, Advances in Neural Information Processing Systems 22, pages 772\u2013780. 2009.\n\n[8] Shuiwang Ji, Lei Tang, Shipeng Yu, and Jieping Ye. Extracting shared subspace for multi-label\nclassi\ufb01cation. In KDD \u201908: Proceeding of the 14th ACM SIGKDD international conference on\nKnowledge discovery and data mining, pages 381\u2013389, New York, NY, USA, 2008. ACM.\n\n[9] Roderick J. A. Little and Donald B. Rubin. Statistical Analysis with Missing Data. Wiley-\n\nInterscience, 2nd edition, September 2002.\n\n[10] Shiqian Ma, Donald Goldfarb, and Lifeng Chen. Fixed point and Bregman iterative methods\nfor matrix rank minimization. Mathematical Programming Series A, to appear (published\nonline September 23, 2009).\n\n[11] Guillaume Obozinski, Ben Taskar, and Michael I. Jordan. Joint covariate selection and joint\nsubspace selection for multiple classi\ufb01cation problems. Statistics and Computing, 20(2):231\u2013\n252, 2010.\n\n[12] Piyush Rai and Hal Daume. Multi-label prediction via sparse in\ufb01nite CCA.\n\nIn Y. Bengio,\nD. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural\nInformation Processing Systems 22, pages 1518\u20131526. 2009.\n\n[13] Nathan Srebro and Adi Shraibman. Rank, trace-norm and max-norm. In Proceedings of the\n\n18th Annual Conference on Learning Theory, pages 545\u2013560. Springer-Verlag, 2005.\n\n[14] K. Trohidis, G. Tsoumakas, G. Kalliris, and I. Vlahavas. Multilabel classi\ufb01cation of music\ninto emotions. In Proc. 9th International Conference on Music Information Retrieval (ISMIR\n2008), Philadelphia, PA, USA, 2008, 2008.\n\n[15] G. Tsoumakas, I. Katakis, and I. Vlahavas. Mining multi-label data.\n\nKnowledge Discovery Handbook. Springer, 2nd edition, 2010.\n\nIn Data Mining and\n\n[16] Xiaojin Zhu and Andrew B. Goldberg. Introduction to Semi-Supervised Learning. Morgan &\n\nClaypool, 2009.\n\n9\n\n\f", "award": [], "sourceid": 158, "authors": [{"given_name": "Andrew", "family_name": "Goldberg", "institution": null}, {"given_name": "Ben", "family_name": "Recht", "institution": null}, {"given_name": "Junming", "family_name": "Xu", "institution": null}, {"given_name": "Robert", "family_name": "Nowak", "institution": null}, {"given_name": "Jerry", "family_name": "Zhu", "institution": null}]}