{"title": "Multi-Label Prediction via Sparse Infinite CCA", "book": "Advances in Neural Information Processing Systems", "page_first": 1518, "page_last": 1526, "abstract": "Canonical Correlation Analysis (CCA) is a useful technique for modeling dependencies between two (or more) sets of variables. Building upon the recently suggested probabilistic interpretation of CCA, we propose a nonparametric, fully Bayesian framework that can automatically select the number of correlation components, and effectively capture the sparsity underlying the projections. In addition, given (partially) labeled data, our algorithm can also be used as a (semi)supervised dimensionality reduction technique, and can be applied to learn useful predictive features in the context of learning a set of related tasks. Experimental results demonstrate the efficacy of the proposed approach for both CCA as a stand-alone problem, and when applied to multi-label prediction.", "full_text": "Multi-label Prediction via Sparse In\ufb01nite CCA\n\nPiyush Rai and Hal Daum\u00b4e III\n\nSchool of Computing, University of Utah\n\n{piyush,hal}@cs.utah.edu\n\nAbstract\n\nCanonical Correlation Analysis (CCA) is a useful technique for modeling de-\npendencies between two (or more) sets of variables. Building upon the re-\ncently suggested probabilistic interpretation of CCA, we propose a nonparametric,\nfully Bayesian framework that can automatically select the number of correla-\ntion components, and effectively capture the sparsity underlying the projections.\nIn addition, given (partially) labeled data, our algorithm can also be used as a\n(semi)supervised dimensionality reduction technique, and can be applied to learn\nuseful predictive features in the context of learning a set of related tasks. Experi-\nmental results demonstrate the ef\ufb01cacy of the proposed approach for both CCA as\na stand-alone problem, and when applied to multi-label prediction.\n\n1\n\nIntroduction\n\nLearning with examples having multiple labels is an important problem in machine learning and\ndata mining. Such problems are encountered in a variety of application domains. For example, in\ntext classi\ufb01cation, a document (e.g., a newswire story) can be associated with multiple categories.\nLikewise, in bio-informatics, a gene or protein usually performs several functions. All these settings\nsuggest a common underlying problem: predicting multivariate responses. When the responses\ncome from a discrete set, the problem is termed as multi-label classi\ufb01cation. The aforementioned\nsetting is a special case of multitask learning [6] when predicting each label is a task and all the tasks\nshare a common source of input. An important characteristics of these problems is that the labels\nare not independent of each other but actually often have signi\ufb01cant correlations with each other. A\nna\u00a8\u0131ve approach to learn in such settings is to train a separate classi\ufb01er for each label. However, such\nan approach ignores the label correlations and leads to sub-optimal performance [20].\n\nIn this paper, we show how Canonical Correlation Analysis (CCA) [11] can be used to exploit label\nrelatedness, learning multiple prediction problems simultaneously. CCA is a useful technique for\nmodeling dependencies between two (or more) sets of variables. One important application of CCA\nis in supervised dimensionality reduction, albeit in the more general setting where each example has\nseveral labels. In this setting, CCA on input-output pair (X, Y) can be used to project inputs X to\na low-dimensional space directed by label information Y. This makes CCA an ideal candidate for\nextracting useful predictive features from data in the context of multi-label prediction problems.\n\nThe classical CCA formulation, however, has certain inherent limitations. It is non-probabilistic\nwhich means that it cannot deal with missing data, and precludes a Bayesian treatment which can\nbe important if the dataset size is small. An even more crucial issue is choosing the number of cor-\nrelation components, which is traditionally dealt with by using cross-validation, or model-selection\n[21]. Another issue is the potential sparsity [18] of the underlying projections that is ignored by the\nstandard CCA formulation.\n\nBuilding upon the recently suggested probabilistic interpretation of CCA [3], we propose a nonpara-\nmetric, fully Bayesian framework that can deal with each of these issues. In particular, the proposed\nmodel can automatically select the number of correlation components, and effectively capture the\n\n1\n\n\fsparsity underlying the projections. Our framework is based on the Indian Buffet Process [9], a\nnonparametric Bayesian model to discover latent feature representation of a set of observations. In\naddition, our probabilistic model allows dealing with missing data and, in the supervised dimension-\nality reduction case, can incorporate additional unlabeled data one may have access to, making our\nCCA algorithm work in a semi-supervised setting. Thus, apart from being a general, nonparamet-\nric, fully Bayesian solution to the CCA problem, our framework can be readily applied for learning\nuseful predictive features from labeled (or partially labeled) data in the context of learning a set of\nrelated tasks.\n\nThis paper is organized as follows. Section 2 introduces the CCA problem and its recently proposed\nprobabilistic interpretation. In section 3, we describe our general framework for in\ufb01nite CCA. Sec-\ntion 4 gives a concrete example of an application (multi-label learning) where the proposed approach\ncan be applied. In particular, we describe a fully supervised setting (when the test data is not avail-\nable at the time of training), and a semi-supervised setting with partial labels (when we have access\nto test data at the time of training). We describe our experiments in section 5, and discuss related\nwork in section 6 drawing connections of the proposed method with previously proposed ones for\nthis problem. .\n\n2 Canonical Correlation Analysis\n\nCanonical correlation analysis (CCA) is a useful technique for modeling the relationships among a\nset of variables. CCA computes a low-dimensional shared embedding of a set of variables such that\nthe correlations among the variables is maximized in the embedded space.\n\nMore formally, given a pair of variables x \u2208 RD1 and y \u2208 RD2, CCA seeks to \ufb01nd linear projections\nux and uy such that the variables are maximally correlated in the projected space. The correlation\ncoef\ufb01cient between the two variables in the embedded space is given by\n\n\u03c1 =\n\nuT\n\nx xyT uy\nx xxT ux)(uT\n\ny yyT uy)\n\nq(uT\n\nSince the correlation is not affected by rescaling of the projections ux and uy, CCA is posed as a\nconstrained optimization problem.\n\nmax\nux,uy\n\nuT\nx xyT uy, subject to : uT\n\nx xxT ux = 1, uT\n\ny yyT uy = 1\n\nIt can be shown that the above formulation is equivalent to solving the following generalized eigen-\nvalue problem:\n\n(cid:18) 0\n\n\u03a3yx\n\n\u03a3xy\n\n0 (cid:19)(cid:18) ux\n\nuy (cid:19) = \u03c1(cid:18) \u03a3xx\n\n0\n\n0\n\n\u03a3yy (cid:19)(cid:18) ux\n\nuy (cid:19)\n\nwhere \u03a3 denotes the covariance matrix of size D \u00d7 D (where D = D1 + D2) obtained from the\ndata samples X = [x1, . . . , xn] and Y = [y1, . . . , yn].\n\n2.1 Probabilistic CCA\n\nBach and Jordan [3] gave a probabilistic interpretation of CCA by posing it as a latent variable\nmodel. To see this, let x and y be two random vectors of size D1 and D2. Let us now consider the\nfollowing latent variable model\n\nz \u223c Nor(0, I K), min{D1, D2} \u2265 K\nx \u223c Nor(\u00b5x + Wxz, \u03a8x), Wx \u2208 RD1\u00d7K, \u03a8x (cid:23) 0\ny \u223c Nor(\u00b5y + Wyz, \u03a8y), Wy \u2208 RD2\u00d7K, \u03a8y (cid:23) 0\n\nEquivalently, we can also write the above as\n\n[x; y] \u223c Nor(\u00b5 + Wz, \u03a8)\n\n2\n\n\fwhere \u00b5 = [\u00b5x; \u00b5y], W = [Wx; Wy], and \u03a8 is a block-diagonal matrix consisting of \u03a8x and \u03a8x\non its diagonals. [.; .] denotes row-wise concatenation. The latent variable z is shared between x and\ny.\n\nBach and Jordan [3] showed that, given the maximum likelihood solution for the model parameters,\nthe expectations E(z|x) and E(z|y) of the latent variable z lie in the same subspace that classical\nCCA \ufb01nds, thereby establishing the equivalence between the above probabilistic model and CCA.\n\nThe probabilistic interpretation opens doors to several extension of the basic setup proposed in [3]\nwhich suggested a maximum likelihood approach for parameter estimation. However, it still as-\nsumes an apriori \ufb01xed number of canonical correlation components. In addition, another important\nissue is the sparsity of the underlying projection matrix which is usually ignored.\n\n3 The In\ufb01nite Canonical Correlation Analysis Model\n\nRecall that the CCA problem can be de\ufb01ned as [x; y] \u223c Nor(Wz, \u03a8) (assuming centered data). A\ncrucial issue in the CCA model is choosing the number of canonical correlation components which\nis set to a \ufb01xed value in classical CCA (and even in the probabilistic extensions of CCA). In the\nBayesian formulation of CCA, one can use the Automatic Relevance Determination (ARD) prior\n[5] on the projection matrix W that gives a way to select this number. However, it would be more\nappropriate to have a principled way to automatically \ufb01gure out this number based on the data.\n\nWe propose a nonparametric Bayesian model that selects the number of canonical correlation com-\nponents automatically. More speci\ufb01cally, we use the Indian Buffet Process [9] (IBP) as a nonpara-\nmetric prior on the projection matrix W. The IBP prior allows W to have an unbounded number\nof columns which gives a way to automatically determine the dimensionality K of the latent space\nassociated with Z.\n\n3.1 The Indian Buffet Process\n\nThe Indian Buffet Process [9] de\ufb01nes a distribution over in\ufb01nite binary matrices, originally mo-\ntivated by the need to model the latent feature structure of a given set of observations. The IBP\nhas been a model of choice in variety of non-parametric Bayesian approaches, such as for factorial\nstructure learning, learning causal structures, modeling dyadic data, modeling overlapping clusters,\nand several others [9].\n\nIn the latent feature model, each observation can be thought of as being explained by a set of latent\nfeatures. Given an N \u00d7 D matrix X of N observations having D features each, we can consider a\ndecomposition of the form X = ZA + E where Z is an N \u00d7 K binary feature-assignment matrix\ndescribing which features are present in each observation. Zn,k is 1 if feature k is present in obser-\nvation n, and is otherwise 0. A is a K \u00d7 D matrix of feature scores, and the matrix E consists of\nobservation speci\ufb01c noise. A crucial issue in such models is the choosing the number K of latent\nfeatures. The standard formulation of IBP lets us de\ufb01ne a prior over the binary matrix Z such that\nit can have an unbounded number of columns and thus can be a suitable prior in problems dealing\nwith such structures.\n\nThe IBP derivation starts by de\ufb01ning a \ufb01nite model for K many columns of a N \u00d7 K binary matrix.\n\nP (Z) =\n\nK\n\nYk=1\n\n\u03b1\n\nK \u0393(mk + \u03b1\n\nK )\u0393(P \u2212 mk \u2212 1)\n\n\u0393(P + 1 + \u03b1\nK )\n\n(1)\n\nHere mk = Pi Zik. In the limiting case, as K \u2192 \u221e, it as was shown in [9] that the binary matrix Z\n\ngenerated by IBP is equivalent to one produced by a sequential generative process. This equivalence\ncan be best understood by a culinary analogy of customers coming to an Indian restaurant and se-\nlecting dishes from an in\ufb01nite array of dishes. In this analogy, customers represent observations and\ndishes represent latent features. Customer 1 selects P oisson(\u03b1) dishes to begin with. Thereafter,\neach incoming customer n selects an existing dish k with a probability mk/N, where mk denotes\nhow many previous customers chose that particular dish. The customer n then goes on further to\nadditionally select P oisson(\u03b1/N ) new dishes. This process generates a binary matrix Z with rows\nrepresenting customer and columns representing dishes. Many real world datasets have a sparseness\n\n3\n\n\fFigure 1: The graphical model depicts the fully supervised case when all variables X and Y are\nobserved. The semisupervised case can have X and/or Y consisting of missing values as well. The\ngraphical model structure remains the same\n\nproperty which means that each observation depends only on a subset of all the K latent features.\nThis means that the binary matrix Z is expected to be reasonably sparse for many datasets. This\nmakes IBP a suitable choice for also capturing the underlying sparsity in addition to automatically\ndiscovering the number of latent features.\n\n3.2 The In\ufb01nite CCA Model\n\nIn our proposed framework, the matrix W consisting of canonical correlation vectors is modeled\nusing an IBP prior. However since W can be real-valued and the IBP prior is de\ufb01ned only for\nbinary matrices, we represent the (D1 + D2) \u00d7 K matrix W as (B \u2299 V), where B = [Bx; By]\nis a (D1 + D2) \u00d7 K binary matrix, V = [Vx; Vy] is a (D1 + D2) \u00d7 K real-valued matrix, and\n\u2299 denotes their element-wise (Hadamard) product. We place an IBP prior on B that automatically\ndetermines K, and a Gaussian prior on V. Note that B and V have the same number of columns.\nUnder this model, two random vectors x and y can be modeled as x = (Bx \u2299 Vx)z + Ex and\ny = (By \u2299 Vy)z + Ey. Here z is shared between x and y, and Ex and Ey are observation speci\ufb01c\nnoise.\nIn the full model, X = [x1, . . . , xN ] is D1 \u00d7 N matrix consisting of N samples of D1 dimensions\neach, and Y = [y1, . . . , yN ] is another matrix consisting of N samples of D2 dimensions each. Here\nis the generative story for our basic model:\n\nB \u223c IBP(\u03b1)\nV \u223c Nor(0, \u03c32\nZ \u223c Nor(0, I)\n\nvI),\n\n\u03c3v \u223c IG(a, b)\n\n[X; Y] \u223c Nor(B \u2299 V)Z, \u03a8),\n\nwhere \u03a8 is a diagonal matrix of size D \u00d7 D where D = (D1 + D2), with each diagonal entry\nhaving an inverse-Gamma prior..\nSince our model is probabilistic, it can also deal the problem when X or Y have missing entries.\nThis is particularly important in the case of supervised dimensionality reduction (i.e., X consisting\nof inputs and Y associated responses) when the labels for some of the inputs are unknown, making\nit a model for semi-supervised dimensionality reduction with partially labeled data. In addition,\nplacing the IBP prior on the projection matrix W (via the binary matrix B) also helps in capturing\nthe sparsity in W (see results section for evidence).\n\n3.3\n\nInference\n\nWe take a fully Bayesian approach by treating everything at latent variables and computing the\nposterior distributions over them. We use Gibbs sampling with a few Metropolis-Hastings steps to\ndo inference in this model.\n\n4\n\n\fIn what follows, D denotes the data [X; Y], B = [Bx; By], and V = [Vx; Vy]\nSampling B: Sampling the binary IBP matrix B consists of sampling existing dishes, proposing new\ndishes and accepting or rejecting them based on the acceptance ratio in the associated M-H step. For\nsampling existing dishes, an entry in B is set as 1 according to p(Bik = 1|D, B\u2212ik, V, Z, \u03a8) \u221d\nm\u2212i,k\nD p(D|B, V, F, \u03a8) whereas it is set as 0 according to p(Bik = 0|D, B\u2212ik, V, Z, \u03a8) \u221d\n\nD\u2212m\u2212i,k\n\nD\n\nZ 2\nk,n\n\u03a8i\n\n+ 1\n\u03c32\nv\n\nn=1\n\np(D|B, V, Z, \u03a8). m\u2212i,k = Pj6=i Bjk is how many other customers chose dish k.\n\ni,k = Di,n \u2212 PK\n\n)\u22121 and \u00b5i,k = \u03a3i,k(PN\n\nNor(Vi,k|\u00b5i,k, \u03a3i,k), where \u03a3i,k = (PN\n\nFor sampling new dishes, we use an M-H step where we simultaneously propose \u03b7 =\n(K new, V new, Z new) where K new \u223c P oisson(\u03b1/D). We accept the proposal with an accep-\ntance probability given by a = min{1, p(rest|\u03b7\u2217)\np(rest|\u03b7) }. Here, p(rest|\u03b7) is the probability of the data\ngiven parameters \u03b7. We propose V new from its prior (Gaussian) but, for faster mixing, we propose\nZ new from its posterior.\nSampling V: We sample the real-valued matrix V from its posterior p(Vi,k|D, B, Z, \u03a8) \u221d\ni,k)\u03a8\u22121\n.\nl=1,l6=k(Bi,lVi,l)Zl,n. The hyperparameter \u03c3v on V has an inverse-\nWe de\ufb01ne D\u2217\ngamma prior and posterior also has the same form. Note that the number of columns in V is the\nsame as number of columns in the IBP matrix B.\nSampling Z: We sample for Z from its posterior p(Z|D, B, V, \u03a8) \u221d Nor(Z|\u00b5, \u03a3) where \u00b5 =\nWT(WWT + \u03a8)\u22121D and \u03a3 = I \u2212 WT(WWT + \u03a8)\u22121W, where W = B \u2299 V.\nNote that, in our sampling scheme, we considered the matrices Bx and By as simply parts of the big\nIBP matrix B, and sampled them together using a single IBP draw. However, one could also sample\nthem separately as two separate IBP matrices for Bx and By. This would require different IBP draws\nfor sampling Bx and By with some modi\ufb01cation of the existing Gibbs sampler. Different IBP draws\ncould result in different number of nonzero columns in Bx and By. To deal with this issue, one\ncould sample Bx (say having Kx nonzero columns) and By (say having Ky nonzero columns) \ufb01rst,\nintroduce extra dummy columns (|Kx \u2212Ky| in number) in the matrix having smaller number of non-\nzero columns, and then set all such columns to zero. The effective K for each iteration of the Gibbs\nsampler would be max{Kx, Ky}. A similar scheme could also be followed for the corresponding\nreal-valued matrices Vx and Vy, sampling them in conjunction with Bx and By respectively.\n\nn=1 Ak,nD\u2217\n\ni\n\n4 Multitask Learning using In\ufb01nite CCA\n\nHaving set up the framework for in\ufb01nite CCA, we now describe its applicability for the problem\nof multitask learning. In particular, we consider the setting when each example is associated with\nmultiple labels. Here predicting each individual label becomes a task to be learned. Although one\ncan individually learn a separate model for each task, doing this would ignore the label correla-\ntions. This makes borrowing the information across tasks crucial, making it imperative to share the\nstatistical strength across all the task. With this motivation, we apply our in\ufb01nite CCA model to\ncapture the label correlations and to learn better predictive features from the data by projective it to\na subspace directed by label information. It has been empirically and theoretically [25] shown that\nincorporating label information in dimensionality reduction indeed leads to better projections if the\n\ufb01nal goal is prediction.\nMore concretely, let X = [x1, . . . , xN ] be an D \u00d7 N matrix of predictor variables, and Y =\n[y1, . . . , yN ] be an M \u00d7 N matrix of the responses variables (i.e., the labels) with each yi being\nan M \u00d7 1 vector of responses for input xi. The labels can take real (for regression) or categorical\n(for classi\ufb01cation) values. The in\ufb01nite CCA model is applied on the pair X and Y which is akin to\ndoing supervised dimensionality reduction for the inputs X. Note that the generalized eigenvalue\nproblem posed in such a supervised setting of CCA consists of cross-covariance matrix \u03a3XY and\nlabel covariance matrix \u03a3Y Y . Therefore the projection takes into account both the input-output\ncorrelations and the label correlations. Such a subspace therefore is expected to consist of much\nbetter predictive features than one obtained by a na\u00a8\u0131ve feature extraction approach such as simple\nPCA that completely ignores the label information, or approaches like Linear Discriminant Analysis\n(LDA) that do take into account label information but ignore label correlations.\n\n5\n\n\fMultitask learning using the in\ufb01nite CCA model can be done in two settings: supervised and semi-\nsupervised depending on whether or not the inputs of test data are involved in learning the shared\nsubspace Z.\n\n4.1 Fully supervised setting\n\nIn the supervised setting, CCA is done on labeled data (X, Y) to give a single shared subspace\nZ \u2208 RK\u00d7N that is good across all tasks. A model is then learned in the Z subspace to learn\nM task parameters {\u03b8m} \u2208 RK\u00d71 where m \u2208 {1, . . . , M }. Each of the parameters \u03b8m is then\nused to predict the labels for the test data of task m. However that since the test data is still D\ndimensional, we need to either separately project it down onto the K dimensional subspace and do\npredictions in this subspace, or \u201cin\ufb02ate\u201d each task parameter back to D dimensions by applying\nthe projection matrix Wx and do predictions in the original D dimensional space. The \ufb01rst option\nrequires using the fact that P (Z|Xte) \u221d P (Xte|Z)P (Z) which is a Gaussian Nor(\u00b5Z|X , \u03a3Z|X ) with\n\u00b5Z|X = (WT\nx \u03a8xWx + I)\u22121. With the second option, we\ncan in\ufb02ate each learned task parameter back to D dimensions by applying the projection matrix Wx.\nWe choose the second option for the experiments. We call this fully supervised setting as model-1.\n\nx Xte and \u03a3Z|X = (WT\n\nx \u03a8xWx + I)\u22121WT\n\n4.2 A Semi-supervised setting\n\nIn the semi-supervised setting, we combine training data and test data (with unknown labels) as\nX = [Xtr, Xte] and Y = [Ytr, Yte] where the labels Yte are unknown. The in\ufb01nite CCA model is\nthen applied on the pair (X, Y) and the parts of Y consisting of Yte are treated as a latent variables\nto be imputed. With this model, we get the embeddings also for the test data and thus training and\ntesting both take place in the K dimensional subspace, unlike model-1 in which training is done in\nK dimensional subspace and prediction are made in the original D dimensional subspace. We call\nthis semi-supervised setting as model-2.\n\n5 Experiments\n\nHere we report our experimental results on several synthetic and real world datasets. We \ufb01rst show\nour results with the in\ufb01nite CCA as a stand alone algorithm for CCA by using it on a synthetic\ndataset demonstrating its effectiveness in capturing the canonical correlations. We then also report\nour experiments on applying the in\ufb01nite CCA model to the problem of multitask learning on two\nreal world datasets.\n\n5.1\n\nIn\ufb01nite CCA results on synthetic data\n\nIn the \ufb01rst experiment, we demonstrate the effectiveness of our proposed in\ufb01nite CCA model in\ndiscovering the correct number of canonical correlation components, and in capturing the sparsity\npattern underlying the projection matrix. For this, we generated two datasets of dimensions 25 and\n10 respectively, with each having 100 samples. For this synthetic dataset, we knew the ground truth\n(i.e., the number of components, and the underlying sparsity of projection matrix). In particular, the\ndataset had 4 correlation components with a 63% sparsity in the true projection matrix. We then\nran both classical CCA and in\ufb01nite CCA algorithm on this dataset. Looking at all the correlations\ndiscovered by classical CCA, we found that it discovered 8 components having signi\ufb01cant correla-\ntions, whereas our model correctly discovered exactly 4 components in the \ufb01rst place (we extract\nthe MAP samples for W and Z output by our Gibbs sampler). Thus on this small dataset, standard\nCCA indeed seems to be \ufb01nding spurious correlations, indicating a case of over\ufb01tting (the over\ufb01t-\nting problem of classical CCA was also observed in [15] when comparing Bayesian versus classical\nCCA). Furthermore, as expected, the projection matrix inferred by the classical CCA had no exact\nzero entries and even after thresholding signi\ufb01cantly small absolute values to zero, the uncovered\nsparsity was only about 25%. On the other hand, the projection matrix inferred by the in\ufb01nite CCA\nmodel had 57% exact zero entries and 62% zero entries after thresholding very small values, thereby\ndemonstrating its effectiveness in also capturing the sparsity patterns.\n\n6\n\n\fModel\n\nFull\nPCA\nCCA\n\nModel-1\nModel-2\n\nYeast\n\nScene\n\nAcc\n\n0.5583\n0.5612\n0.5441\n0.5842\n0.6156\n\nF1-macro\n\n0.3132\n0.3144\n0.2888\n0.3327\n0.3463\n\nF1-micro\n0.3929\n0.4648\n0.3923\n0.4402\n0.4954\n\nAUC\n0.5054\n0.5026\n0.5135\n0.5232\n0.5386\n\nAcc\n\n0.7565\n0.7233\n0.7496\n0.7533\n0.7664\n\nF1-macro\n\n0.3445\n0.2857\n0.3342\n0.3630\n0.3742\n\nF1-micro\n0.3527\n0.2734\n0.3406\n0.3732\n0.3825\n\nAUC\n0.6339\n0.6103\n0.6346\n0.6517\n0.6686\n\nTable 1: Results on the multi-label classi\ufb01cation task. Bold face indicates the best performance.\nModel-1 and Model-2 scores are averaged over 10 runs with different initializations.\n\n5.2\n\nIn\ufb01nite CCA applied to multi-label prediction\n\nIn the second experiment, we use in\ufb01nite CCA model to learn a set of related task in the context of\nmulti-label prediction. For our experiments, we use two real-world multi-label datasets (Yeast and\nScene) from the UCI repository. The Yeast dataset consists of 1500 training and 917 test examples,\neach having 103 features. The number of labels (or tasks) per example is 14. The Scene dataset\nconsists of 1211 training and 1196 test examples, each having 294 features. The number of labels\nper example for this dataset is 6. We compare the following models for our experiments.\n\n\u2022 Full: Train separate classi\ufb01ers (SVM) on the full feature set for each task.\n\u2022 PCA: Apply PCA on training and test data and then train separate classi\ufb01ers for each task\nin the low dimensional subspace. This baseline ignores the label information while learning\nthe low dimensional subspace.\n\n\u2022 CCA: Apply classical CCA on training data to extract the shared subspace, learn separate\nmodel (i.e., task parameters) for each task in this subspace, project the task parameters\nback to the original D dimensional feature space by applying the projection Wx, and do\npredictions on the test data in this feature pace.\n\n\u2022 Model-1: Use our supervised in\ufb01nite CCA model to learn the shared subspace using only\n\nthe training data (see section 4.1).\n\n\u2022 Model-2: Use our semi-supervised in\ufb01nite CCA model to simultaneously learn the shared\n\nsubspace for both training and test data (see section 4.2).\n\nThe performance metrics used are overall accuracy, F1-Macro, F1-Micro, and AUC (Area Under\nROC Curve). For PCA and CCA, we chose K that gives the best performance, whereas this param-\neter was learned automatically for both of our proposed models. The results are shown in Table-1.\nAs we can see, both the proposed models do better than the other baselines. Of the two proposed\nmodel, we see that model-2 does better in most cases suggesting that it is useful to incorporate\nthe test data while learning the projections. This is possible in our probabilistic model since we\ncould treat the unknown Y\u2019s of the test data as latent variables to be imputed while doing the Gibbs\nsampling.\n\nWe note here that our results are with cases where we only had access to small number of related task\n(yeast has 14, scene has 6). We expect the performance improvements to be even more signi\ufb01cant\nwhen the number of (related) tasks is high.\n\n6 Related Work\n\nA number of approaches have been proposed in the recent past for the problem of supervised dimen-\nsionality reduction of multi-label data. The few approaches that exist include Partial Least Squares\n[2], multi-label informed latent semantic indexing [24], and multi-label dimensionality reduction us-\ning dependence maximization (MDDM) [26]. None of these, however, deal with the case when the\ndata is only partially labeled. Somewhat similar in spirit to our approach is the work on supervised\nprobabilistic PCA [25] that extends probabilistic PCA to the setting when we also have access to\nlabels. However, it assumes a \ufb01xed number of components and does not take into account sparsity\nof the projections.\n\n7\n\n\fThe CCA based approach to supervised dimensionality reduction is more closely related to the\nnotion of dimension reduction for regression (DRR) which is formally de\ufb01ned as \ufb01nding a low\ndimensional representation z \u2208 RK of inputs x \u2208 RD (K \u226a D) for predicting multivariate outputs\ny \u2208 RM . An important notion in DRR is that of suf\ufb01cient dimensionality reduction (SDR) [10, 8]\nwhich states that given z, x and y are conditionally independent, i.e., x \u22a5\u22a5 y|z. As we can see in the\ngraphical model shown in \ufb01gure-1, the probabilistic interpretation of CCA yields the same condition\nwith X and Y being conditionally independent given Z.\n\nAmong the DRR based approaches to dimensionality reduction for real-valued multilabel data, Co-\nvariance Operator Inverse Regression (COIR) exploits the covariance structures of both the inputs\nand outputs [14]. Please see [14] for more details on the connection between COIR and CCA. Be-\nsides the DRR based approaches, the problem of extracting useful features from data, particularly\nwith the goal of making predictions, has also been considered in other settings. The information\nbottleneck (IB) method [19] is one such example. Given input-output pairs (X, Y), the information\nbottleneck method aims to obtain a compressed representation T of X that can account for Y. IB\nachieves this using a single tradeoff parameter to represent the tradeoff between the complexity of\nthe representation of X, measured by I(X; T), and the accuracy of this representation, measured\nby I(T; Y), where I(.; .) denotes the mutual information between two variables. In another recent\nwork [13], a joint learning framework is proposed which performs dimensionality reduction and\nmulti-label classi\ufb01cation simultaneously.\n\nIn the context of CCA as a stand-alone problem, sparsity is another important issue. In particular,\nsparsity improves model interpretation and has been gaining lots of attention recently. Existing\nworks on sparsity in CCA include the double barrelled lasso which is based on a convex least squares\napproach [17], and CCA as a sparse solution to the generalized eigenvalue problem [18] which is\nbased on constraining the cardinality of the solution to the generalized eigenvalue problem to obtain\na sparse solution. Another recent solution is based on a direct greedy approach which bounds the\ncorrelation at each stage [22].\n\nThe probabilistic approaches to CCA include the works of [15] and [1], both of which use an au-\ntomatic relevance determination (ARD) prior [5] to determine the number of relevant components,\nwhich is a rather ad-hoc way of doing this. In contrast, a nonparametric Bayesian alternative pro-\nposed here is a more principled to determine the number of components.\n\nWe note that the sparse factor analysis model proposed in [16] actually falls out as a special case\nof our proposed in\ufb01nite CCA model if one of the datasets (X or Y) is absent. Besides, the sparse\nfactor analysis model is limited to factor analysis whereas the proposed model can be seen as an in-\n\ufb01nite generalization of both an unsupervised problem (sparse CCA), and (semi)supervised problem\n(dimensionality reduction using CCA with full or partial label information), with the latter being\nespecially relevant for multitask learning in the presence of multiple labels.\n\nFinally, multitask learning has been tackled using a variety of different approaches, primarily de-\npending on what notion of task relatedness is assumed. Some of the examples include tasks gener-\nated from an IID space [4], and learning multiple tasks using a hierarchical prior over the task space\n[23, 7], among others. In this work, we consider multi-label prediction in particular, based on the\npremise that that a set of such related tasks share an underlying low-dimensional feature space [12]\nthat captures the task relatedness.\n\n7 Conclusion\n\nWe have presented a nonparametric Bayesian model for the Canonical Correlation Analysis problem\nto discover the dependencies between a set of variables. In particular, our model does not assume\na \ufb01xed number of correlation components and this number is determined automatically based only\non the data.\nIn addition, our model enjoys sparsity making the model more interpretable. The\nprobabilistic nature of our model also allows dealing with missing data. Finally, we also demonstrate\nthe model\u2019s applicability to the problem of multi-label learning where our model, directed by label\ninformation, can be used to automatically extract useful predictive features from the data.\nAcknowledgements\n\nWe thank the anonymous reviewers for helpful comments. This work was partially supported by\nNSF grant IIS-0712764.\n\n8\n\n\fReferences\n\n[1] C. Archambeau and F. Bach. Sparse probabilistic projections. In Neural Information Processing Systems\n\n21, 2008.\n\n[2] J. Arenas-Garc\u00b4\u0131a, K. B. Petersen, and L. K. Hansen. Sparse kernel orthonormalized pls for feature extrac-\n\ntion in large data sets. In Neural Information Processing Systems 19, 2006.\n\n[3] F. R. Bach and M. I. Jordan. A Probabilistic Interpretation of Canonical Correlation Analysis. In Technical\n\nReport 688, Dept. of Statistics. University of California, 2005.\n\n[4] J. Baxter. A Model of Inductive Bias Learning. Journal of Arti\ufb01cial Intelligence Research, 12:149\u2013198,\n\n2000.\n\n[5] C. M. Bishop. Bayesian PCA. In Neural Information Processing Systems 11, Cambridge, MA, USA,\n\n1999. MIT Press.\n\n[6] R. Caruana. Multitask Learning. Machine Learning, 28(1):41\u201375, 1997.\n[7] H. Daum\u00b4e III. Bayesian Multitask Learning with Latent Hierarchies. In Conference on Uncertainty in\n\nArti\ufb01cial Intelligence, Montreal, Canada, 2009.\n\n[8] K. Fukumizu, F. R. Bach, and M. I. Jordan. Dimensionality reduction for supervised learning with repro-\n\nducing kernel hilbert spaces. J. Mach. Learn. Res., 5:73\u201399, 2004.\n\n[9] Z. Ghahramani, T. L. Grif\ufb01ths, and P. Sollich. Bayesian Nonparametric Latent Feature Models.\n\nBayesian Statistics 8. Oxford University Press, 2007.\n\nIn\n\n[10] A. Globerson and N. Tishby. Suf\ufb01cient dimensionality reduction. J. Mach. Learn. Res., 3:1307\u20131331,\n\n2003.\n\n[11] H. Hotelling. Relations Between Two Sets of Variables. Biometrika, pages 321\u2013377, 1936.\n[12] S. Ji, L. Tang, S. Yu, and J. Ye. Extracting Shared Subspace for Multi-label Classi\ufb01cation. 2008.\n[13] S. Ji and J. Ye. Linear dimensionality reduction for multi-label classi\ufb01cation. In Twenty-\ufb01rst International\n\nJoint Conference on Arti\ufb01cial Intelligence, 2009.\n\n[14] M. Kim and V. Pavlovic. Covariance operator based dimensionality reduction with extension to semi-\nsupervised settings. In Twelfth International Conference on Arti\ufb01cial Intelligence and Statistics, Florida\nUSA, 2009.\n\n[15] A. Klami and S. Kaski. Local dependent components. In ICML \u201907: Proceedings of the 24th international\n\nconference on Machine learning, 2007.\n\n[16] P. Rai and H. Daum\u00b4e III. The in\ufb01nite hierarchical factor regression model. In Neural Information Pro-\n\ncessing Systems 21, 2008.\n\n[17] D. Hardoon J. Shawe-Taylor. The Double-Barrelled LASSO (Sparse Canonical Correlation Analysis). In\n\nWorkshop on Learning from Multiple Sources (NIPS), 2008.\n\n[18] B. Sriperumbudur, D. Torres, and G. Lanckriet. The Sparse Eigenvalue Problem. In arXiv:0901.1504v1,\n\n2009.\n\n[19] N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck method. In Proc. of the 37-th Annual\n\nAllerton Conference on Communication, Control and Computing, pages 368\u2013377.\n\n[20] N. Ueda and K. Saito. Parametric Mixture Models for Multi-labeled Text. Advances in Neural Information\n\nProcessing Systems, pages 737\u2013744, 2003.\n\n[21] C. Wang. Variational Bayesian approach to Canonical Correlation Analysis. In IEEE Transactions on\n\nNeural Networks, 2007.\n\n[22] A. Wiesel, M. Kliger, and A. Hero. A Greedy Approach to Sparse Canonical Correlation Analysis. In\n\narXiv:0801.2748, 2008.\n\n[23] Y. Xue, X. Liao, L. Carin, and B. Krishnapuram. Multi-task Learning for Classi\ufb01cation with Dirichlet\n\nProcess Priors. The Journal of Machine Learning Research, 8:35\u201363, 2007.\n\n[24] K. Yu, S. Yu, and V. Tresp. Multi-label Informed Latent Semantic Indexing. In Proceedings of the 28th\nannual international ACM SIGIR conference on Research and development in information retrieval, pages\n258\u2013265. ACM New York, NY, USA, 2005.\n\n[25] S. Yu, K. Yu, V. Tresp, H. Kriegel, and M. Wu. Supervised Probabilistic Principal Component Analysis.\nIn KDD \u201906: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery\nand data mining, 2006.\n\n[26] Y. Zhang Z. H. Zhou. Multi-Label Dimensionality Reduction via Dependence Maximization. In Pro-\nceedings of the Twenty-Third AAAI Conference on Arti\ufb01cial Intelligence, AAAI 2008, pages 1503\u20131505,\n2008.\n\n9\n\n\f", "award": [], "sourceid": 1149, "authors": [{"given_name": "Piyush", "family_name": "Rai", "institution": null}, {"given_name": "Hal", "family_name": "Daume", "institution": null}]}