{"title": "Multilabel Classification using Bayesian Compressed Sensing", "book": "Advances in Neural Information Processing Systems", "page_first": 2645, "page_last": 2653, "abstract": "In this paper, we present a Bayesian framework for multilabel classification using compressed sensing. The key idea in compressed sensing for multilabel classification is to first project the label vector to a lower dimensional space using a random transformation and then learn regression functions over these projections. Our approach considers both of these components in a single probabilistic model, thereby jointly optimizing over compression as well as learning tasks. We then derive an efficient variational inference scheme that provides joint posterior distribution over all the unobserved labels. The two key benefits of the model are that a) it can naturally handle datasets that have  missing labels and b) it can also measure uncertainty in prediction. The uncertainty estimate provided by the model naturally allows for active learning paradigms where an oracle provides information about labels that promise to be maximally informative for the prediction task. Our experiments show significant boost over prior methods in terms of prediction performance over benchmark datasets, both in the fully labeled and the missing labels case.  Finally, we also highlight various useful active learning scenarios that are enabled by the probabilistic model.", "full_text": "Multilabel Classi\ufb01cation using Bayesian Compressed\n\nSensing\n\nAshish Kapoor\u2020, Prateek Jain\u2021 and Raajay Viswanathan\u2021\n\n\u2020Microsoft Research, Redmond, USA\n\u2021Microsoft Research, Bangalore, INDIA\n\n{akapoor, prajain, t-rviswa}@microsoft.com\n\nAbstract\n\nIn this paper, we present a Bayesian framework for multilabel classi\ufb01cation using\ncompressed sensing. The key idea in compressed sensing for multilabel classi-\n\ufb01cation is to \ufb01rst project the label vector to a lower dimensional space using a\nrandom transformation and then learn regression functions over these projections.\nOur approach considers both of these components in a single probabilistic model,\nthereby jointly optimizing over compression as well as learning tasks. We then\nderive an ef\ufb01cient variational inference scheme that provides joint posterior distri-\nbution over all the unobserved labels. The two key bene\ufb01ts of the model are that a)\nit can naturally handle datasets that have missing labels and b) it can also measure\nuncertainty in prediction. The uncertainty estimate provided by the model allows\nfor active learning paradigms where an oracle provides information about labels\nthat promise to be maximally informative for the prediction task. Our experiments\nshow signi\ufb01cant boost over prior methods in terms of prediction performance over\nbenchmark datasets, both in the fully labeled and the missing labels case. Finally,\nwe also highlight various useful active learning scenarios that are enabled by the\nprobabilistic model.\n\n1\n\nIntroduction\n\nLarge scale multilabel classi\ufb01cation problems arise in several practical applications and has recently\ngenerated a lot of interest with several ef\ufb01cient algorithms being proposed for different settings\n[1, 2]. A primary reason for thrust in this area is due to explosion of web-based applications, such\nas Picasa, Facebook and other online sharing sites, that can obtain multiple tags per data point.\nFor example, users on the web can annotate videos and images with several possible labels. Such\napplications have provided a new dimension to the problem as these applications typically have\nmillions of tags. Most of the existing multilabel methods learn a decision function or weight vector\nper label and then combine the decision functions in a certain manner to predict labels for an unseen\npoint [3, 4, 2, 5, 6]. However, such approaches quickly become infeasible in real-world as the\nnumber of labels in such applications is typically very large. For instance, traditional multi-label\nclassi\ufb01cation techniques based on 1-vs-all SVM [7] is prohibitive because of both large train and\ntest times.\nTo alleviate this problem, [1] proposed a compressed sensing (CS) based method that exploits the\nfact that usually the label vectors are very sparse, i.e., the number of positive labels/tags present\nin a point is signi\ufb01cantly less than the total number of labels. Their algorithm uses the follow-\ning result from the CS literature: an s-sparse vector in RL can be recovered ef\ufb01ciently using\nK = O(s log L/s) measurements. Their method projects label vectors into a s log L/s dimensional\nspace and learns a regression function in the projected space (independently for each dimension).\nFor test points, the learnt regression function is applied in the reduced space and then standard re-\ncovery algorithms from CS literature are used to obtain sparse predicted labels [8, 9]. However, in\n\n1\n\n\fthis method, learning of the decision functions is independent of the sparse recovery and hence in\npractice, it requires several measurements to match accuracy of the standard baseline methods such\nas 1-vs-all SVM. Another limitation of this method is that the scheme does not directly apply when\nlabels are missing, a common aspect in real-world web applications. Finally, the method does not\nlend itself naturally to uncertainty analysis that can be used for active learning of labels.\nIn this paper, we address some of the issues mentioned above using a novel Bayesian framework\nfor multilabel classi\ufb01cation. In particular, we propose a joint probabilistic model that combines\ncompressed sensing [10, 11] with a Bayesian learning model on the projected space. Our model\ncan be seen as a Bayesian co-training model, where the lower dimensional projected space can be\nthought of as latent variables. And these latent variables are generated by two different views: a)\nusing a random projection of the label vector, b) using a (linear) predictor over the input data space.\nHence, unlike the method of [1], our model can jointly infer predictions in the projected space and\nprojections of the label vector. This joint inference leads to more ef\ufb01cient utilization of the latent\nvariable space and leads to signi\ufb01cantly better accuracies than the method of [1] while using same\nnumber of latent variables K.\nBesides better prediction performance, there are several other advantages offered by our probabilistic\nmodel. First, the model naturally handles missing labels as the missing labels are modeled as ran-\ndom variables that can be marginalized out. Second, the model enables derivation of a variational\ninference method that can ef\ufb01ciently compute joint posterior distribution over all the unobserved\nrandom variables. Thus, we can infer labels not only for the test point but also for all the missing\nlabels in the training set. Finally, the inferred posterior over labels provide an estimate of uncertainty\nmaking the proposed method amenable to active learning.\nActive learning is an important learning paradigm that has received a lot of attention due to the\navailability of large unlabeled data but paucity of labels over these data sets. In the traditional active\nlearning setting (for binary/multiclass classi\ufb01cation), at each round the learner actively seeks labels\nfor a selected unlabeled point and updates its models using the provided label. Several criteria, such\nas uncertainty [12], expected informativeness [13, 14], reduction in version space [15], disagree-\nment among a committee of classi\ufb01ers [16], etc. have been proposed. While heuristics have been\nproposed [17] in the case of 1-vs-all SVMs, it is still unclear how these methods can be extended to\nmultilabel classi\ufb01cation setting in a principled manner. Our proposed model naturally handles the\nactive learning task as the variational inference procedure provides the required posteriors which can\nguide information acquisition. Further, besides the traditional active learning scenario, where all the\nlabels are revealed for a selected data, the model leads to extension of information foraging to more\npractical and novel scenarios. For example, we introduce active diagnosis, where the algorithm only\nasks about labels for the test case that potentially can help with prediction over the rest of the un-\nobserved tags. Similarly, we can extend to a generalized active learning setting, where the method\nseeks answer to questions of the type: \u201cdoes label \u2019A\u2019 exists in data point x\u201d. Such extensions are\nmade feasible due to the Bayesian interpretation of the multilabel classi\ufb01cation task.\nWe demonstrate the above mentioned advantages of our model using empirical validation on bench-\nmark datasets. In particular, experiments show that the method signi\ufb01cantly outperforms ML-CS\nbased method by [1] and also obtains accuracies matching 1-vs-all SVM while projecting onto K-\ndimensional space that is typically less than half the total number of labels. We expect these gains\nto become even more signi\ufb01cant for datasets with larger number of labels. We also show that the\nproposed framework is robust to missing labels and actually outperforms 1-vs-all SVM with about\n85-95% missing labels while using K = .5L only. Finally, we demonstrate that our active learning\nstrategies select signi\ufb01cantly more informative labels/points than the random selection strategy.\n2 Approach\nAssume that we are given a set of training data points X = {xi} with labels Y = {yi}, where each\ni ] \u2208 [0, 1]L is a multilabel binary vector of size L. Further, let us assume that there\nyi = [y1\nare data points in the training set for which we have partially observed labeled vectors that leads to\nthe following partitioning: X = XL \u222a XP. Here the subscripts L and P indicate fully and partially\nlabeled data respectively. Our goal then is to correctly predict all the labels for data in the test set\nXU . Further, we also seek an active learning procedure that would request as few labels as possible\nfrom an oracle to maximize classi\ufb01cation rate over the test set.\nIf we treat each label independently then standard machine learning procedures could be used to train\nindividual classi\ufb01ers and this can even be extended to do active learning. However, such procedures\n\ni , .., yL\n\n2\n\n\fcan be fairly expensive when the number of labels is huge. Further, these methods would simply\nignore the missing data, thus may not utilize statistical relationship amongst the labels. Recent\ntechniques in multilabel classi\ufb01cation alleviate the problem of large output space [1, 18], but cannot\nhandle the missing data cases. Finally, there are no clear methods of extending these approaches for\nactive learning.\nWe present a probabilistic graphical model that builds upon ideas of compressed sensing and utilizes\nstatistical relations across the output space for prediction and active information acquisition. The\nkey idea in compressed sensing is to consider a linear transformation of the L dimensional label\nvector y to a K dimensional space z, where K (cid:28) L, via a random matrix \u03a6. The ef\ufb01ciency in the\nclassi\ufb01cation system is improved by considering regression functions to the compressed vectors z\ninstead of the true label space. The proposed framework considers Gaussian process priors over the\ncompressed label space and has the capability to propagate uncertainties to the output label space by\nconsidering the constraints imposed by the random projection matrix. There are several bene\ufb01ts of\nthe proposed method: 1) \ufb01rst it naturally handles missing data by marginalizing over the unobserved\nlabels, 2) the Bayesian perspective leads to valid probabilities that re\ufb02ect the true uncertainties\nin the system, which in turn helps guide active learning procedures, 3) \ufb01nally, the experiments\nshow that the model signi\ufb01cantly outperforms state-of-the-art compressed sensing based multilabel\nclassi\ufb01cation methods.\n2.1 A Model for Multilabel Classi\ufb01cation with Bayesian Compressed Sensing\nWe propose a model that simultaneously handles two key aspects: \ufb01rst is the task of compressing\nand recovering the label vector yi to and from the lower dimensional representation zi. Second,\ngiven an input data xi the problem is estimating low dimensional representation in the compressed\nspace. Instead of separately solving each of the tasks, the proposed approach aims at achieving better\nperformance by considering both of these tasks jointly, thereby modeling statistical relationships\namongst different variables of interest.\nFigure 1 illustrates the factor graph corresponding to the proposed model. For every data point xi,\nthe output labels yi in\ufb02uence the compressed latent vector zi via the random projection matrix \u03a6.\nThese compressed signals in turn also get in\ufb02uenced by the d-dimensional feature vector xi via the\nK different linear regression functions represented as a d \u00d7 K matrix W. Consequently, the role of\nzi is not only to compress the output space but also to consider the compatibility with the input data\npoint. The latent variable W corresponding to the linear model has a spherical Gaussian prior and\nis motivated by Gaussian Process regression [19]. Note that when zi is observed, the model reduces\nto simple Gaussian Process regression.\nOne of the critical assumptions in compressed sensing is that the output labels yi is sparse. The\nproposed model induces this constraint via a zero-mean Gaussian prior on each of the labels (i.e.\ni \u223c N (0, 1/\u03b1j\nyj\ni of the normal distribution follows a Gamma prior\ni \u223c \u0393(a0, b0) with hyper-parameters a0 and b0. The Gamma prior has been earlier proposed\n\u03b1j\nin the context of Relevance Vector Machine (RVM) [20] as it not only induces sparsity but also is a\nconjugate prior to the precision \u03b1j\ni of the zero mean Gaussian distributions. Intuitively, marginaliz-\ning the precision in the product of Gamma priors and the Gaussian likelihoods leads to a potential\nfunction on the labels that is a student-t distribution and has a signi\ufb01cant probability mass around\nzero. Thus, the labels yj\ni naturally tend to zero unless they need to explain observed data. Finally, the\nconjugate-exponential form between the precisions \u03b1i and the output labels yi leads to an ef\ufb01cient\ninference procedure that we describe later in the paper.\nNote that, for labeled training data xi \u2208 XL all the labels yi are observed, while only some or\nnone of the labels are observed for the partially labeled and test cases respectively. The proposed\nmodel ties the input feature vectors X to the output space Y via the compressed representations Z\naccording to the following distribution:\ni=1|X, \u03a6) =\n\ni )), where the precision \u03b1j\n\np(Y, Z, W, [\u03b1i]N\n\nN(cid:89)\n\nwhere Z is the partition function (normalization term), p(W) =(cid:81)K\nGaussian prior on the linear regression functions and p(\u03b1i) =(cid:81)L\n\ni=1 N (wi, 0, I) is the spherical\ni ; a0, b0) is the product of\nGamma priors on each individual label. Finally, the potentials fxi(\u00b7,\u00b7), g\u03a6(\u00b7,\u00b7) and h\u03b1i(\u00b7) take the\n\nfxi(w, zi)g\u03a6(yi, zi)h\u03b1i(yi)p(\u03b1i)\n\nj=1 \u0393(\u03b1j\n\np(W)\n\n1\nZ\n\ni=1\n\n3\n\n\fFigure 1: A Bayesian model for multilabel classi\ufb01cation via compressed sensing. The input data is\nxi with multiple labels yi, which are fully observed for the case of fully labeled training data set L,\npartially observed for training data with missing labels P, or completely unobserved as in test data\nU. The latent variables zi indicate the compressed label space, and \u03b1i with independent Gamma\npriors enforce the sparsity. The set of regression functions described by W is also a latent random\nvariable and is connected across all the data points.\n\nfollowing form:\n\nfxi (W, zi) = e\n\n\u2212 ||WT xi\u2212zi||2\n\n2\u03c32\n\n, g\u03a6(yi, zi) = e\n\n\u2212 ||\u03a6yi\u2212zi||2\n\n2\u03c72\n\n, h\u03b1i(yi) =\n\nL(cid:89)\n\nj=1\n\nN (yj\n\ni ; 0,\n\n1\n\u03b1j\ni\n\n).\n\nIntuitively, the potential term fxi(W, zi) favors con\ufb01gurations that are aligned with output of the\nlinear regression function when applied to the input feature vector. Similarly, the term g\u03a6(yi, zi)\nfavors con\ufb01gurations that are compatible with the output compressive projections determined by\n\u03a6. Finally, as described earlier, h\u03b1i(yi) enforces sparsity in the output space. The parameters \u03c32\nand \u03c72 denote noise parameters and determine how tight the relation is between the labels in the\noutput space, the compressed space and the regression coef\ufb01cients. By changing the value of these\nparameters we can emphasize or de-emphasize the relationship between the latent variables.\nIn summary, our model provides a powerful framework for modeling multilabel classi\ufb01cation using\ncompressive sensing. The model promises statistical ef\ufb01ciency by jointly considering compressive\nsensing and regression within a single model. Moreover, as we will see in the next section this model\nallows ef\ufb01cient numerical procedures for inferring unobserved labels by resolving the constraints\nimposed by the potential functions and the observed data. The model naturally handles the case of\nmissing data (incomplete labels) by automatically marginalizing the unobserved data as a part of\nthe inference mechanism. Finally, the probabilistic nature of the approach provides us with valid\nprobabilistic quantities that can be used to perform active selection of the unlabeled points.\n2.2\nFirst, consider the simpler scenario where the training data set only consists of fully labeled instances\nXL with labels YL. Thus our aim is to infer p(YU|X, YL, \u03a6), the posterior distribution over\nunlabeled data. Performing exact inference is prohibitive in this model primarily due to the following\nreason. First, notice that the joint distribution is a product of a Gaussian (Spherical prior on W and\ncompatibility terms with zi) and non-Gaussian terms (the Gamma priors). Along with these sparsity\nterms, the projection of the label space into the compressed space precludes usage of exact inference\nvia a junction tree algorithm. Thus, we resort to approximate inference techniques. In particular we\nperform an approximate inference by maximizing the variational lower bound by assuming that the\nposterior over the unobserved random variable W, YU , Z and [\u03b1i]N\n\nInference\n\ni=1 can be factorized:\ni=1|X, \u03a6)\np(Y, Z, W, [\u03b1i]N\nq(YU )q(Z)q(W)q([\u03b1i]N\ni=1)\n\n(cid:90)\n\n(cid:90)\n\nF =\n\n\u2264 log\n\nYU ,Z,W,[\u03b1]N\n\ni=1\n\nYU ,Z,W,[\u03b1]N\n\ni=1\n\nq(YU )q(Z)q(W)q([\u03b1i]N\n\ni=1) log\n\np(Y, Z, W, [\u03b1i]N\n\ni=1|X, \u03a6)\n\n4\n\n\ud835\udc65\ud835\udc56 \ud835\udc67\ud835\udc56 \ud835\udc67\ud835\udc56 \ud835\udc67\ud835\udc56 \ud835\udefc\ud835\udc56 \ud835\udefc\ud835\udc56 \ud835\udefc\ud835\udc56 \ud835\udc66\ud835\udc56 \ud835\udc66\ud835\udc56 \ud835\udc66\ud835\udc56   1   1   1   2   2   2   \ud835\udc3f   \ud835\udc3f   \ud835\udc3e \ud835\udc64 \ud835\udc56=1 \ud835\udc61\ud835\udc5c \ud835\udc41 \ud835\udc54\u03a6(\ud835\udc66\ud835\udc56,\ud835\udc67\ud835\udc56) \ud835\udc66\ud835\udc56\u223c\ud835\udc41(0,\ud835\udefc\ud835\udc5600\ud835\udefc\ud835\udc56\u22121)   1   \ud835\udc59 \ud835\udc4a\u223c      \ud835\udc41(\ud835\udc64\ud835\udc57;0,\ud835\udc3c)\ud835\udc3e\ud835\udc57=1 \ud835\udefc\ud835\udc56\u223c \u0393(\ud835\udefc\ud835\udc56;\ud835\udc4e0,\ud835\udc4f0)\ud835\udc3f\ud835\udc57=1   \ud835\udc57 \ud835\udc53\ud835\udc65\ud835\udc56\ud835\udc64,\ud835\udc67\ud835\udc56 \fi\u2208U\u222aL q(zi) and q([\u03b1]N\n\ni=1) =(cid:81)N\n\nsities are assumed to have the following per data point factorization: q(YU ) = (cid:81)\nq(Z) =(cid:81)\nsion functions has a per dimension factorization: q(W) =(cid:81)K\n\nHere, the posteriors on the precisions \u03b1i are assumed to be Gamma distributed while the rest\nof the distributions are constrained to be Gaussian. Further, each of these joint posterior den-\ni\u2208U q(yi),\ni ). Similarly the posterior over the regres-\ni=1 q(wi). The approximate inference\nalgorithm aims to compute good approximations to the real posteriors by iteratively optimizing the\nabove described variational bound. Speci\ufb01cally, given the approximations qt(yi) \u223c N (\u00b5t\n, \u03a3t\n)\nyi\n(similar forms for zi and wi) and qt(\u03b1j\nij) from the tth iteration the update rules are as\nfollows:\n\ni ) \u223c \u0393(at\n\nj=1 q(\u03b1j\n\n(cid:81)l\n\nij, bt\n\ni=1\n\nyi\n\ni)) + \u03a6T \u03c7\u22122\u03a6]\u22121,\n\nbt+1\nij = b0\n= [\u03c3\u22122I + \u03c7\u22122I]\u22121, \u00b5t+1\n= [\u03c3\u22122XXT + I]\u22121,\n\nzi\n\n= \u03a3t+1\n\n= [diag(E(\u03b1t\nij + 0.5,\n\nUpdate for qt+1(yi): \u03a3t+1\nyi\nUpdate for qt+1(\u03b1j\ni ): at+1\nij = a0\nUpdate for qt+1(zi): \u03a3t+1\nUpdate for qt+1(wi): \u03a3t+1\nwi\n\nzi\n\n\u00b5t+1\nyi\nij + 0.5[\u03a3t+1\nyi\n[\u03c3\u22122[\u00b5t+1\n\u00b5t+1\nwi\n\n= \u03a3t+1\nyi\n(j, j) + [\u00b5t+1\nyi\n\n\u03a6T \u03c7\u22122\u00b5t\n,\nzi\n(j)]2],\nW ]T xi + \u03c7\u22122\u03a6\u00b5t+1\n],\nyi\n= \u03c3\u22122\u03a3t+1\n(i)]T .\n\nX[\u00b5t+1\n\nwi\n\nzi\n\nz\n\nAlternating between the above described updates can be considered as message passing between\nthe low-dimensional regression outputs and higher dimensional output labels, which in turn are\nconstrained to be sparse. By doing the update on q(yi), the algorithms attempts to explain the\ncompressed signal zi using sparsity imposed by the precisions \u03b1i. Similarly, by updating q(zi)\nand q(W) the inference procedures reasons about a compressed representation that is most ef\ufb01cient\nin terms of reconstruction. By iterating between these updates the model consolidates information\nfrom the two key components, compressed sensing and regression, that constitute the system and is\nmore effective than doing these tasks in isolation.\nAlso note that the most expensive step is in the \ufb01rst update for computing \u03a3t+1\n, which if naively\nimplemented would require an inversion of an L \u00d7 L matrix. However, this inversion can be com-\nputed easily using Sherman-Morrison-Woodbury formula, which in turn reduces the complexity of\nthe update to O(K 3 + K 2L). The only other signi\ufb01cant update is the posterior computation q(w)\nthat is O(d3), where d is the dimensionality of the feature space. Consequently, this scheme is fairly\nef\ufb01cient and has time complexity similar to that of other non-probabilistic approaches. Finally, note\nthat straightforward extension to non-linear regression functions can be done via the kernel trick.\nHandling Missing Labels in Training Data: The proposed model and the inference procedure\nnaturally handles the case of missing labels in the training set via the variational inference. Lets\nconsider a data point xp with set of partially observed labels yo\np as the set of\nunobserved labels, then all the above mentioned update steps stay the same except for the one that\nupdates q(zp), which takes the following form:\n[\u03c3\u22122xT\n\np. If we denote yu\n\nW + \u03c7\u22122\u03a6uo[\u00b5t+1\n\n= \u03a3t+1\nzp\n\np \u00b5t+1\n\n\u00b5t+1\nzp\n\n; yo\n\np]].\n\nyi\n\nyu\np\n\nHere \u03a6uo denotes re-ordering of the columns on \u03a6 according to the indices of the observed and\nunobserved labels. Intuitively, the compressed signal zp now considers compatibility with the un-\nobserved labels, while taking into account the observed labels, and in doing so effectively facilitates\nmessage passing between all the latent random variables.\nHandling a Test Point: While it might seem that the above mentioned framework works in the\ntransductive setting, we here show such is not the case and that the framework can seamlessly han-\ndle test data in an inductive setting. Note that given a training set, we can recover the posterior\ndistribution q(W) that summarizes the regression parameter. This posterior distribution is suf\ufb01cient\nfor doing inference on a test point x\u2217. Intuitively, the key idea is that the information about the\ntraining set is fully captured in the regression parameters, thus, the labels for the test point can be\nsimply recovered by only iteratively updating q(y\u2217), q(z\u2217) and q(\u03b1\u2217).\n2.3 Active Learning\nThe main aim in active learning is to seek bits of information that would promise to enhance the\ndiscriminatory power of the framework the most. When employed in a traditional classi\ufb01cation\nsetting, the active learning procedure boils down to the task of seeking the label for one of the\nunlabeled examples that promises to be most informative and then update the classi\ufb01cation model\nby incorporating it into the existing training set. However, multilabel classi\ufb01cation enables richer\nforms of active information acquisitions, which we describe below:\n\n5\n\n\f\u2022 Traditional Active Learning: This is similar to the active learning scenario in traditional\nclassi\ufb01cation tasks. In particular, the goal is to select an unlabeled sample for which all the\nlabels will be revealed.\n\n\u2022 Active Diagnosis: Given a test data point, at every iteration the active acquisition procedure\nseeks a label for each test point that is maximally informative for the same and promises to\nimprove the prediction accuracy over the rest of the unknown labels.\n\nNote that Active Diagnosis is highly relevant for real-world tasks. For example, consider the wiki-\npedia page classi\ufb01cation problem. Just knowing a few labels about the page can be immensely\nuseful in inferring the rest of the labels. Active diagnosis should be able to leverage the statistical\ndependency amongst the output label space, in order to ask for labels that are maximally informative.\nA direct generalization of the above two paradigms is a setting in which the active learning procedure\nselects a label for one point in the training set. Speci\ufb01cally, the key difference between this scenario\nand the traditional active learning is that only one label is chosen to be revealed for the selected data\npoint instead of the entire set of labels.\nNon-probabilistic classi\ufb01cation schemes, such as SVMs, can handle traditional active learning by\n\ufb01rst establishing the con\ufb01dence in the estimate of each label by using the distance from the classi-\n\ufb01cation boundary (margin) and then selecting the point that is closest to the margin. However, it is\nfairly non-trivial to extend those approaches to tackle the active diagnosis and generalized informa-\ntion acquisition. On the other hand the proposed Bayesian model provides a posterior distribution\nover the unknown class labels as well as other latent variables and can be used for active learning.\nIn particular, measures such as uncertainty or information gain can be used to guide the selective\nsampling procedure for active learning. Formally, we can write these two selection criteria as:\n\nUncertainty: arg max\nyi\u2208YU\nInfoGain: arg max\nyi\u2208YU\n\nH(yi)\nH(YU /yi) \u2212 Eyi[H(YU /yi|yi)].\n\nHere, H(\u00b7) denotes Shannon entropy and is a measure of uncertainty. The uncertainty criterion seeks\nto select the labels that have the highest entropy, whereas the information gain criterion seeks to\nselect a label that has the highest expected reduction in uncertainty over all the other unlabeled points\nor unknown labels. Either of these criteria can be computed given the inferred posteriors; however\nwe note that the information gain criterion is far more expensive to compute as it requires repeated\ninference by considering all possible labels for every unlabeled data point. The uncertainty criterion\non the other hand is very simple and often guides active learning with reasonable amount of gains.\nIn this work we will consider uncertainty as the primary active learning criterion. Finally, we\u2019d like\nto point that the different described forms of active learning can naturally be addressed with these\nheuristics by appropriately choosing the set of possible candidates and the posterior distributions\nover which the entropy is measured.\n3 Experiments\nIn this section, we present experimental results using our methods on standard benchmark datasets.\nThe goals of our experiments are three-fold: a) demonstrate that the proposed jointly probabilistic\nmethod is signi\ufb01cantly better than the standard compressed sensing based method by [1] and gets\ncomparable accuracy to 1-vs-all SVM while projecting labels onto much smaller dimensionality\nK compared to the total number of labels L, b) show robustness of our method to missing labels,\nc) demonstrate various active learning scenarios and compare them against the standard baselines.\nWe use Matlab for all our implementations. We refer to our Compressed Sensing based Bayesian\nMultilabel classi\ufb01cation method as BML-CS . In BML-CS method, the hyper-parameters a0 and b0\nare set to 10\u22126, which in turn leads to a fairly uninformative prior. The noise parameters \u03c7 and \u03c3 are\nfound by maximizing the marginalized likelihood of the Gaussian Process Regression model [19].\nWe use liblinear for SVM implementation; error penalty C is selected using cross-validation. We\nalso implemented the multilabel classi\ufb01cation method based on compressed sensing (ML-CS ) [1]\nwith CoSamp [8] being the underlying sparse vector recovery algorithm.\nFor our experiments, we use standard multilabel datasets. In particular, we choose datasets where\nthe number of labels is high. Such datasets generally tend to have only a few labels per data point\nand the compressed sensing methods can exploit this sparsity to their advantage.\n\n6\n\n\f(a) CAL500 dataset\n\n(b) Bookmarks dataset\n\n(c) RCV1 dataset\n\n(d) Corel5k dataset\n\nFigure 2: Comparison of precision values (in top-1 label) for different methods with different values\nof K, dimensionality of the compressed label space. The SVM baseline uses all the L labels. The\nx-axis shows K as a percentage of the total number of labels L. Clearly, for each of the dataset\nthe proposed method obtains accuracy similar to 1-vs-all SVM method while projecting to only\nK = L/2 dimensions. Also, our method consistently obtains signi\ufb01cantly higher accuracies than\nthe CS method of [1] while using the same number of latent variables K.\n\nTop-3\n\nTop-5\n\nTop-3\n\nTop-5\n\nK\n10%\n25%\n50%\n75%\n100%\n\nK\n10%\n25%\n50%\n75%\n100%\n\n0.74\n\nSVM BML-CS ML-CS\n0.36\n0.48\n0.44\n0.53\n0.61\n(a)\n\n0.04\n0.38\n0.61\n0.75\n0.70\n\nSVM BML-CS ML-CS\n0.32\n0.41\n0.40\n0.55\n0.57\n\n0.09\n0.28\n0.51\n0.60\n0.65\n\n0.67\n\nTop-3\n\nTop-5\n\n0.75\n\nSVM BML-CS ML-CS\n0.19\n0.59\n0.69\n0.71\n0.72\n(c)\n\n0.33\n0.65\n0.75\n0.75\n0.75\n\nSVM BML-CS ML-CS\n0.14\n0.39\n0.49\n0.50\n0.51\n\n0.23\n0.44\n0.52\n0.53\n0.53\n\n0.54\n\nK\n10%\n25%\n50%\n75%\n100%\n\nK\n10%\n25%\n50%\n75%\n100%\n\n0.20\n\nSVM BML-CS ML-CS\n0.06\n0.08\n0.09\n0.10\n0.10\n(b)\n\n0.10\n0.15\n0.17\n0.17\n0.19\n\nSVM BML-CS ML-CS\n0.04\n0.05\n0.06\n0.07\n0.07\n\n0.07\n0.10\n0.12\n0.13\n0.13\n\n0.14\n\nTop-3\n\nTop-5\n\n0.27\n\nSVM BML-CS ML-CS\n0.08\n0.17\n0.21\n0.22\n0.22\n(d)\n\n0.20\n0.27\n0.27\n0.27\n0.27\n\nSVM BML-CS ML-CS\n0.06\n0.14\n0.17\n0.18\n0.17\n\n0.15\n0.22\n0.23\n0.23\n0.23\n\n0.22\n\nFigure 3: Precision values obtained by various methods in retrieving 3 and 5 labels respectively.\nFirst column in each table shows K as the fraction of number of labels L. 1-vs-all SVM requires\ntraining L weight vectors, while both BML-CS and ML-CS trains K weight vectors. BML-CS is\nconsistently more accurate than ML-CS although its accuracy is not as close to SVM as it is for the\ncase of top-1 labels (see Figure 2).\n\nFor each of the algorithms we recover the top 1, 3, 5 most likely positive labels and set remaining\nlabels to be negative. For each value of t \u2208 {1, 3, 5}, we report precision in prediction, i.e., fraction\nof true positives to the total number of positives predicted.\n3.1 Multilabel Classi\ufb01cation Accuracies\nWe train both ML-CS and our method BML-CS on all datasets using different values of K, i.e.,\nthe dimensionality of the space of latent variables z for which weight vectors are learned. Figure 2\ncompares precision (in predicting 1 positive label) of our proposed method on four different datasets\nfor different values of K with the corresponding values obtained by ML-CS and SVM . Note that 1-\nvs-all SVM learns all L > K weight vectors, hence it is just one point in the plot; we provide a line\nfor ease of comparison. It is clear from the \ufb01gure that both BML-CS and ML-CS are signi\ufb01cantly\nworse than 1-vs-all SVM when K is very small compared to total number of labels L. However,\nfor around K = 0.5L, our method achieves close to the baseline (1-vs-all SVM) accuracy while\nML-CS still achieves signi\ufb01cantly worse accuracies. In fact, even with K = L, ML-CS still obtains\nsigni\ufb01cantly lower accuracy than SVM baseline.\nIn Figure 3 we tabulate precision for top-3 and top-5 retrieved positive labels. Here again, the\nproposed method is consistently more accurate than ML-CS . However, it requires larger K to\nobtain similar precision values as SVM. This is fairly intuitive as for higher recall rate the multilabel\nproblems become harder and hence our method requires more weight vectors to be learned per label.\n\n3.2 Missing Labels\nNext, we conduct experiments for multilabel classi\ufb01cation with missing labels. Speci\ufb01cally, we\nremove a \ufb01xed fraction of training labels randomly from each dataset considered. We then apply\n\n7\n\n020406080100020406080100Precision (in %)  1 vs all SVMBML\u2212CSML\u2212CS02040608010010152025303540Precision (in %)  1 vs all SVMBML\u2212CSML\u2212CS0204060801002030405060708090100Precision (in %)  1 vs all SVMBML\u2212CSML\u2212CS02040608010010152025303540Precision (in %)  1 vs all SVMBML\u2212CSML\u2212CS\f(a)\n\n(b)\n\n(c)\n\nFigure 4: (a) Precision (in retrieving the most likely positive label) obtained by BML-CS and SVM\nmethods on RCV1 dataset with varying fraction of missing labels. We observe that BML-CS obtains\nhigher precision values than baseline SVM.(k = 0.5L) (b) Precision obtained after each round\nof active learning by BML-CS-Active method and by the baseline random selection strategy over\nRCV1 dataset.(c) Precision after active learning, where one label per point is added to the training\nset, in comparison with random baseline on RCV1 dataset. Parameters for (b) & (c): k = 0.1L.\nBoth (b) and (c), start with 100 points initially.\nBML-CS as well as 1-vs-all SVM method to such training data. Since, SVM cannot directly handle\nmissing labels, we always set a missing label to be a negative label. In contrast, our method can\nexplicitly handle missing labels and can perform inference by marginalizing the unobserved tags.\nAs the number of positive labels is signi\ufb01cantly smaller than the negative labels, when only a small\nfraction of labels are removed, both SVM and BML-CS obtain similar accuracies to the case where\nall the labels are present. However, as the number of missing labels increase there is a smooth dip\nin the precision of the two methods. Figure 4 (a) compares precision obtained by BML-CS with the\nprecision obtained by 1-vs-all SVM. Clearly, our method performs better than SVM, while using\nonly K = .5L weight vectors.\n3.3 Active Learning\nIn this section, we provide empirical results for some of the active learning tasks we discussed in\nSection 2.3. For each of the tasks, we use our Bayesian multilabel method to compute the posterior\nover the label vector. We then select the desired label/point appropriately according to each individ-\nual task. For each of the tasks, we compare our method against an appropriate baseline method.\nTraditional Active Learning: The goal here is to select most informative points which if labeled\ncompletely will increase the accuracy by the highest amount. We use uncertainty sampling where\nwe consider the entropy of the posterior over label vector as the selection criterion for BML-CS-\nActive method. We compare the proposed method against the standard random selection baseline.\nFor these experiments, we initialize both the methods with an initial labeled dataset of 100 points\nand then after each active learning round we seek all the labels for the selected training data point.\nFigure 4 (b) compares precisions obtained by BML-CS-Active method with the precisions obtained\nby the baseline method after every active learning round. After just 15 active learning rounds, our\nmethod is able to gain about 6% of accuracy while random selection method do not provide any gain\nin the accuracy.\nActive Diagnosis: In this type of active learning, we query one label for each of the training points\nin each round. For each training point, we choose a label with the most uncertainty and ask for\nits label. Figure 4 (c) plots the improvement in precision values with number of rounds of active\nlearning, for estimating the top-1 label. From the plot, we can see that after just 20 rounds, choosing\npoints by uncertainty has an improvement of 20% over the random baseline.\n4 Conclusion and Future Work\nWe presented a Bayesian framework for multilabel classi\ufb01cation that uses compressive sensing.\nThe proposed framework jointly models the compressive sensing/reconstruction task with learning\nregression over the compressed space. We present an ef\ufb01cient variational inference scheme that\njointly resolves compressed sensing and regression tasks. The resulting posterior distribution can\nfurther be used to perform different \ufb02avors of active learning. Experimental evaluations highlight the\nef\ufb01cacy of the framework. Future directions include considering other structured prediction tasks\nthat are sparse and applying the framework to novel scenarios. Further, instead of myopic next best\ninformation seeking we also seek to investigate non-myopic selective sampling where an optimal\nsubset of unlabeled data are selected.\n\n8\n\n808590957678808284868890Percentage of labels missingPrecision (in %)Variation of precision with incomplete labels  BML\u2212CSSVM051015207374757677787980Active learning rounds (1 point per round)Precision (in %)  BML\u2212CS ActiveBML\u2212CS Rand02040608010020304050607080Active learning rounds (1 label per point per round)Precision (in %)  BML\u2212CS ActiveBML\u2212CS Rand\fReferences\n[1] D. Hsu, S. Kakade, J. Langford, and T. Zhang. Multi-label prediction via compressed sensing. In NIPS,\n\npages 772\u2013780, 2009.\n\n[2] B. Hariharan, L. Zelnik-Manor, S. V. N. Vishwanathan, and M. Varma. Large scale max-margin multi-\n\nlabel classi\ufb01cation with priors. In ICML, pages 423\u2013430, 2010.\n\n[3] G. Tsoumakas and I. Katakis. Multi-label classi\ufb01cation: An overview. IJDWM, 3(3):1\u201313, 2007.\n[4] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and\n\ninterdependent output variables. Journal of Machine Learning Research, 6:1453\u20131484, 2005.\n\n[5] M. R. Boutell, J. Luo, X. Shen, and C. M. Brown. Learning multi-label scene classi\ufb01cation. Pattern\n\nRecognition, 37(9):1757\u20131771, 2004.\n\n[6] B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. In NIPS, 2003.\n[7] R. M. Rifkin and A. Klautau.\n\nIn defense of one-vs-all classi\ufb01cation. Journal of Machine Learning\n\nResearch, 5:101\u2013141, 2004.\n\n[8] D. Needell and J. A. Tropp. Cosamp: Iterative signal recovery from incomplete and inaccurate samples.\n\nApplied and Computational Harmonic Analysis, 26(3):301 \u2013 321, 2009.\n\n[9] S. Foucart. Hard thresholding pursuit: an algorithm for compressive sensing, 2010. preprint.\n[10] D. Baron, S. S. Sarvotham, and R. G. Baraniuk. Bayesian compressive sensing via belief propagation.\n\nIEEE Transactions on Signal Processing, 58(1), 2010.\n\n[11] S. Ji, Y. Xue, and L. Carin. Bayesian compressive sensing. IEEE Transactions on Signal Processing,\n\n56(6), 2008.\n\n[12] N. Cesa-Bianchi, A Conconi, and C. Gentile. Learning probabilistic linear-threshold classi\ufb01ers via selec-\n\ntive sampling. In COLT, 2003.\n\n[13] N. Lawrence, M. Seeger, and R. Herbrich. Fast sparse Gaussian Process method: Informative vector\n\nmachines. NIPS, 2002.\n\n[14] D. MacKay. Information-based objective functions for active data selection. Neural Computation, 4(4),\n\n1992.\n\n[15] S. Tong and D. Koller. Support vector machine active learning with applications to text classi\ufb01cation. In\n\nICML, 2000.\n\n[16] Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee\n\nalgorithm. Machine Learning, 28(2-3), 1997.\n\n[17] B. Yang, J.-Tao Sun, T. Wang, and Z. Chen. Effective multi-label active learning for text classi\ufb01cation. In\n\nKDD, pages 917\u2013926, 2009.\n\n[18] J. Weston, S. Bengio, and N. Usunier. Large scale image annotation: learning to rank with joint word-\n\nimage embeddings. Machine Learning, 81(1):21\u201335, 2010.\n\n[19] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning (Adaptive Computation\n\nand Machine Learning). The MIT Press, 2005.\n\n[20] M. E. Tipping. Sparse bayesian learning and the relevance vector machine. Journal of Machine Learning\n\nResearch, 1:211\u2013244, 2001.\n\n9\n\n\f", "award": [], "sourceid": 1243, "authors": [{"given_name": "Ashish", "family_name": "Kapoor", "institution": null}, {"given_name": "Raajay", "family_name": "Viswanathan", "institution": null}, {"given_name": "Prateek", "family_name": "Jain", "institution": null}]}