{"title": "Large-Scale Bayesian Multi-Label Learning via Topic-Based Label Embeddings", "book": "Advances in Neural Information Processing Systems", "page_first": 3222, "page_last": 3230, "abstract": "We present a scalable Bayesian multi-label learning model based on learning low-dimensional label embeddings. Our model assumes that each label vector is generated as a weighted combination of a set of topics (each topic being a distribution over labels), where the combination weights (i.e., the embeddings) for each label vector are conditioned on the observed feature vector. This construction, coupled with a Bernoulli-Poisson link function for each label of the binary label vector, leads to a model with a computational cost that scales in the number of positive labels in the label matrix. This makes the model particularly appealing for real-world multi-label learning problems where the label matrix is usually very massive but highly sparse. Using a data-augmentation strategy leads to full local conjugacy in our model, facilitating simple and very efficient Gibbs sampling, as well as an Expectation Maximization algorithm for inference. Also, predicting the label vector at test time does not require doing an inference for the label embeddings and can be done in closed form. We report results on several benchmark data sets, comparing our model with various state-of-the art methods.", "full_text": "Large-Scale Bayesian Multi-Label Learning via\n\nTopic-Based Label Embeddings\n\nPiyush Rai\u2020\u2217, Changwei Hu\u2217, Ricardo Henao\u2217, Lawrence Carin\u2217\n\npiyush@cse.iitk.ac.in, {ch237,r.henao,lcarin}@duke.edu\n\n\u2020CSE Dept, IIT Kanpur\n\n\u2217ECE Dept, Duke University\n\nAbstract\n\nWe present a scalable Bayesian multi-label learning model based on learning low-\ndimensional label embeddings. Our model assumes that each label vector is gen-\nerated as a weighted combination of a set of topics (each topic being a distribution\nover labels), where the combination weights (i.e., the embeddings) for each label\nvector are conditioned on the observed feature vector. This construction, coupled\nwith a Bernoulli-Poisson link function for each label of the binary label vector,\nleads to a model with a computational cost that scales in the number of posi-\ntive labels in the label matrix. This makes the model particularly appealing for\nreal-world multi-label learning problems where the label matrix is usually very\nmassive but highly sparse. Using a data-augmentation strategy leads to full local\nconjugacy in our model, facilitating simple and very ef\ufb01cient Gibbs sampling, as\nwell as an Expectation Maximization algorithm for inference. Also, predicting\nthe label vector at test time does not require doing an inference for the label em-\nbeddings and can be done in closed form. We report results on several benchmark\ndata sets, comparing our model with various state-of-the art methods.\n\nIntroduction\n\n1\nMulti-label learning refers to the problem setting in which the goal is to assign to an object (e.g., a\nvideo, image, or webpage) a subset of labels (e.g., tags) from a (possibly very large) set of labels.\nThe label assignments of each example can be represented using a binary label vector, indicating the\npresence/absence of each label. Despite a signi\ufb01cant amount of prior work, multi-label learning [7,\n6] continues to be an active area of research, with a recent surge of interest [1, 25, 18, 13, 10] in\ndesigning scalable multi-label learning methods to address the challenges posed by problems such as\nimage/webpage annotation [18], computational advertising [1, 18], medical coding [24], etc., where\nnot only the number of examples and data dimensionality are large but the number of labels can also\nbe massive (several thousands to even millions).\nOften, in multi-label learning problems, many of the labels tend to be correlated with each other.\nTo leverage the label correlations and also handle the possibly massive number of labels, a common\napproach is to reduce the dimensionality of the label space, e.g., by projecting the label vectors to\na subspace [10, 25, 21], learning a prediction model in that space, and then projecting back to the\noriginal space. However, as the label space dimensionality increases and/or the sparsity in the label\nmatrix becomes more pronounced (i.e., very few ones), and/or if the label matrix is only partially\nobserved, such methods tend to suffer [25] and can also become computationally prohibitive.\nTo address these issues, we present a scalable, fully Bayesian framework for multi-label learning.\nOur framework is similar in spirit to the label embedding methods based on reducing the label space\ndimensionality [10, 21, 25]. However, our framework offers the following key advantages: (1)\ncomputational cost of training our model scales in the number of ones in the label matrix, which\nmakes our framework easily scale in cases where the label matrix is massive but sparse; (2) our\nlikelihood model for the binary labels, based on a Bernoulli-Poisson link, more realistically models\nthe extreme sparsity of the label matrix as compared to the commonly employed logistic/probit link;\nand (3) our model is more interpretable - embeddings naturally correspond to topics where each\ntopic is a distribution over labels. Moreover, at test time, unlike other Bayesian methods [10], we do\nnot need to infer the label embeddings of the test example, thereby leading to faster predictions.\n\n1\n\n\fIn addition to the modeling \ufb02exibility that leads to a robust, interpretrable, and scalable model, our\nframework enjoys full local conjugacy, which allows us to develop simple Gibbs sampling, as well\nas an Expectation Maximization (EM) algorithm for the proposed model, both of which are simple\nto implement in practice (and amenable for parallelization).\n\n2 The Model\n\nWe assume that the training data are given in the form of N examples represented by a feature matrix\nX \u2208 RD\u00d7N , along with their labels in a (possibly incomplete) label matrix Y \u2208 {0, 1}L\u00d7N . The\ngoal is to learn a model that can predict the label vector y\u2217 \u2208 {0, 1}L for a test example x\u2217 \u2208 RD.\nWe model the binary label vector yn of the nth example by thresholding a count-valued vector mn\n(1)\nwhich, for each individual binary label yln \u2208 yn, l = 1, . . . , L, can also be written as yln =\n1(mln \u2265 1). In Eq. (1), mn = [m1n, . . . , mLn] \u2208 ZL denotes a latent count vector of size L and\nis assumed drawn from a Poisson\n\nyn = 1(mn \u2265 1)\n\n(2)\nEq (2) denotes drawing each component of mn independently, from a Poisson distribution, with rate\nequal to the corresponding component of \u03bbn \u2208 RL\n\n+, which is de\ufb01ned as\n\nmn \u223c Poisson(\u03bbn)\n\n(3)\n+ (typically K (cid:28) L). Note that the K columns of V can be thought\nHere V \u2208 RL\u00d7K\nof as atoms of a label dictionary (or \u201ctopics\u201d over labels) and un can be thought of as the atom\nweights or embedding of the label vector yn (or \u201ctopic proportions\u201d, i.e., how active each of the K\ntopics is for example n). Also note that Eq. (1)-(3) can be combined as\n\nand un \u2208 RK\n\n\u03bbn = Vun\n\n+\n\n+\n\nyn = f (\u03bbn) = f (Vun)\n\n(4)\nwhere f jointly denotes drawing the latent counts mn from a Poisson (Eq. 2) with rate \u03bbn = Vun,\nfollowed by thresholding mn at 1 (Eq. 1).\nIn particular, note that marginalizing out mn from\nEq. 1 leads to yn \u223c Bernoulli(1 \u2212 exp(\u2212\u03bbn)). This link function, termed as the Bernoulli-Poisson\nlink [28, 9], has also been used recently in modeling relational data with binary observations.\nIn Eq. (4), expressing the label vector yn \u2208 {0, 1}L in terms of Vun is equivalent to a low-rank\nassumption on the L \u00d7 N label matrix Y = [y1 . . . yN ]: Y = f (VU), where V = [v1 . . . vK] \u2208\nRL\u00d7K\n\nand U = [u1 . . . uN ] \u2208 RK\u00d7N\n\n, which are modeled as follows\n\nvk \u223c Dirichlet(\u03b71L)\nukn \u223c Gamma(rk, pkn(1 \u2212 pkn)\u22121)\npkn = \u03c3(w(cid:62)\nk xn)\nwk \u223c Nor(0, \u0393)\n1 , . . . , \u03c4\u22121\n\n(5)\n(6)\n(7)\n(8)\n\u03c3(z) = 1/(1 + exp(\u2212z)), \u0393 = diag(\u03c4\u22121\nD ), and hyperparameters rk, \u03c41, . . . , \u03c4D are given\nimproper gamma priors. Since columns of V are Dirichlet drawn, they correspond to distributions\n(i.e., topics) over the labels. It is important to note here that the dependence of the label embedding\nun = {ukn}K\nk=1 on the feature vector xn is achieved by making the scale parameter of the gamma\nk=1 depend on {pkn}K\nprior on {ukn}K\nk=1 which in turn depends on the features xn via regression\nweight W = {wk}K\nk=1 (Eq. 6 and 8).\n\n+\n\nFigure 1: Graphical model for the generative process of the label vector. Hyperpriors omitted for brevity.\n\n2\n\n\f2.1 Computational scalability in the number of positive labels\n\nFor the Bernoulli-Poisson likelihood model for binary labels, we can write the conditional poste-\nrior [28, 9] of the latent count vector mn as\n\n(mn|yn, V, un) \u223c yn (cid:12) Poisson+(Vun)\n\n(9)\nwhere Poisson+ denotes the zero-truncated Poisson distribution with support only on the positive\nintegers, and (cid:12) denotes the element-wise product. Eq. 9 suggests that the zeros in yn will result\nin the corresponding elements of the latent count vector mn being zero, almost surely (i.e., with\nprobability one). As shown in Section 3, the suf\ufb01cient statistics of the model parameters do not\ndepend on latent counts that are equal to zero; such latent counts can be simply ignored during the\ninference. This aspect leads to substantial computational savings in our model, making it scale only\nin the number of positive labels in the label matrix. In the rest of the exposition, we will refer to our\nmodel as BMLPL to denote Bayesian Multi-label Learning via Positive Labels.\n\n2.2 Asymmetric Link Function\n\nIn addition to the computational advantage (i.e., scaling in the number of non-zeros in the label ma-\ntrix), another appealing aspect of our multi-label learning framework is that the Bernoulli-Poisson\nlikelihood is also a more realistic model for highly sparse binary data as compared to the commonly\nused logistic/probit likelihood. To see this, note that the Bernoulli-Poisson model de\ufb01nes the prob-\nability of an observation y being one as p(y = 1|\u03bb) = 1 \u2212 exp(\u2212\u03bb) where \u03bb is the positive rate\nparameter. For a positive \u03bb on the X axis, the rate of growth of the plot of p(y = 1|\u03bb) on the Y axis\nfrom 0.5 to 1 is much slower than the rate it drops from 0.5 to 0. This benavior of the Bernoulli-\nPoisson link will encourage a much fewer number of nonzeros in the observed data as compared to\nthe number of zeros. On the other hand, a logistic and probit approach both 0 and 1 at the same rate,\nand therefore cannot model the sparsity/skewness of the label matrix like the Bernoulli-Poisson link.\nTherefore, in contrast to multilabel learning models based on logistic/probit likelihood function or\nstandard loss functions such as the hinge-loss [25, 14] for the binary labels, our proposed model\nprovides better robustness against label imbalance.\n\n3\n\nInference\n\nA key aspect of our framework is that the conditional posteriors of all the model parameters are\navailable in closed form using data augmentation strategies that we will describe below. In particular,\nsince we model binary label matrix as thresholded counts, we are also able to leverage some of the\ninference methods proposed for Bayesian matrix factorization of count-valued data [27] to derive an\nef\ufb01cient Gibbs sampler for our model.\nInference in our model requires estimating V \u2208 RL\u00d7K\n, and the\n+\nhyperparameters of the model. As we will see below, the latent count vectors {mn}N\nn=1 (which are\nfunctions of V and U) provide suf\ufb01cient statistics for the model parameters. Each element of mn\n(if the corresponding element in yn is one) is drawn from a truncated Poisson distribution\n\n, W \u2208 RD\u00d7K, U \u2208 RK\u00d7N\n\n+\n\nmln \u223c Poisson+(Vl,:un) = Poisson+(\u03bbln)\n\nk=1 \u03bbkln = (cid:80)K\n\n(10)\nk=1 vlkukn. Thus we can also write\n\nk=1 mlkn where mlkn \u223c Poisson+(\u03bbkln) = Poisson+(vlkukn).\n\nVl,: denotes the lth row of V and \u03bbln = (cid:80)K\nmln =(cid:80)K\ntion mln =(cid:80)K\nk=1 mlkn as a draw from a multinomial\n(cid:80)K\n\nOn the other hand, if yln = 0 then mln = 0 with probability one (Eq. (9)), and therefore need not\nbe sampled because it does not affect the suf\ufb01cient statistics of the model parameters.\nUsing the equivalence of Poisson and multinomial distribution [27], we can express the decomposi-\n\n[ml1n, . . . , mlKn] \u223c Mult(mln; \u03b6l1n, . . . , \u03b6lKn)\n\n(11)\nwhere \u03b6lkn =\n. This allows us to exploit the Dirichlet-multinomial conjugacy and\nhelps designing ef\ufb01cient Gibbs sampling and EM algorithms for doing inference in our model. As\ndiscussed before, the computational cost of both algorithms scales in the number of ones in the\nlabel matrix Y, which males our model especially appealing for dealing with multilabel learning\nproblems where the label matrix is massive but highly sparse.\n\nvlkukn\nk=1 vlkukn\n\n3\n\n\f3.1 Gibbs Sampling\n\nGibbs sampling for our model proceeds as follows\nSampling V: Using Eq. 11 and the Dirichlet-multinomial conjugacy, each column of V \u2208 RL\u00d7K\ncan be sampled as\n\n+\n\nvk \u223c Dirichlet(\u03b7 + m1k, . . . , \u03b7 + mLk)\n\nn mlnk, \u2200l = 1, . . . , L.\n\nwhere mlk =(cid:80)\nwhere mkn =(cid:80)\nSampling W: Since mkn =(cid:80)\n\nl mlnk and pkn = \u03c3(w(cid:62)\n\nk xn).\n\nSampling U: Using the gamma-Poisson conjugacy, each entry of U \u2208 RK\u00d7N\n\n+\n\nukn \u223c Gamma(rk + mkn, pkn)\n\n(12)\n\ncan be sampled as\n\n(13)\n\nFurther, since p(ukn|r, pkn) is gamma, we can integrate out ukn from p(mkn|ukn) which gives\n\nl mlnk and mlnk \u223c Poisson+(vlkukn), p(mkn|ukn) is also Poisson.\n\nmkn = NegBin(rk, pkn)\nwhere NegBin(., .) denotes the negative Binomial distribution.\nAlthough the negative Binomial is not conjugate to the Gaussian prior on wk, we leverage the P\u00b4olya-\nGamma strategy [17] data augmentation to \u201cGaussianify\u201d the negative Binomial likelihood. Doing\nthis, we are able to derive closed form Gibbs sampling updates wk, k = 1, . . . , K. The P\u00b4olya-\nGamma (PG) strategy is based on sampling a set of auxiliary variables, one for each observation\n(which, in the context of sampling wk, are the latent counts mkn). For sampling wk, we draw N\nP\u00b4olya-Gamma random variables [17] \u03c9k1, . . . , \u03c9kN (one for each training example) as\n\n\u03c9kn \u223c PG(mkn + rk, w(cid:62)\nwhere PG(., .) denotes the P\u00b4olya-Gamma distribution [17].\nGiven these PG variables, the posterior distribution of wk is Gaussian Nor(\u00b5wk , \u03a3wk ) where\n\nk xn)\n\n\u03a3wk = (X\u2126kX(cid:62) + \u0393\u22121)\u22121\n\u00b5wk = \u03a3wk X\u03bak\n\n(14)\n\n(15)\n(16)\n\nwhere \u2126k = diag(\u03c9k1, . . . , \u03c9kN ) and \u03bak = [(mk1 \u2212 rk)/2, . . . , (mkN \u2212 rk)/2](cid:62).\nSampling the hyperparameters: The hyperparameter rk is given a gamma prior and can be sam-\npled easily. The other hyperparameters \u03c41, . . . , \u03c4D are estimated using Type-II maximum likelihood\nestimation [22].\n\n3.2 Expectation Maximization\n\nThe Gibbs sampler described in Section 3.1 is ef\ufb01cient and has a computational complexity that\nscales in the number of ones in the label matrix. To further scale up the inference, we also develop\nan ef\ufb01cient Expectation-Maximization (EM) inference algorithm for our model. In the E-step, we\nneed to compute the expectations of the local variables U, the latent counts, and the P\u00b4olya-Gamma\nvariables \u03c9k1, . . . , \u03c9kN , for k = 1, . . . , K. These expectations are available in closed form and can\nthus easily be computed. In particular, the expectation of each P\u00b4olya-Gamma variable \u03c9kn is very\nef\ufb01cient to compute and is available in closed form [20]\n\nE[\u03c9kn] =\n\n(mkn + rk)\n2w(cid:62)\n\nk xn\n\ntanh(w(cid:62)\n\nk xn/2)\n\n(17)\n\nThe M-step involves a maximization w.r.t. V and W, which essentially involves solving for their\nmaximum-a-posteriori (MAP) estimates, which are available in closed form. In particular, as shown\nin [20], estimating wk requires solving a linear system which, in our case, is of the form\n\n(18)\nwhere Sk = X\u2126kX(cid:62) + \u0393\u22121, dk = X\u03bak, \u2126k and \u03bak are de\ufb01ned as in Section 3.1, except that the\nP\u00b4olya-Gamma random variables are replaced by their expectations given by Eq. 17. Note that Eq. 18\n\nSkwk = dk\n\n4\n\n\fcan be straighforwardly solved as wk = S\u22121\nk dk. However, convergence of the EM algorithm [20]\ndoes not require solving for wk exactly in each EM iteration and running a couple of iterations of\nany of the various iterative methods that solves a linear system of equations can be used for this step.\nWe use the Conjugate Gradient [2] method to solve this, which also allows us to exploit the sparsity\nin X and \u2126k to very ef\ufb01ciently solve this system of equations, even when D and N are very large.\nAlthough in this paper, we only use the batch EM, it is possible to speed it up even further using\nan online version of this EM algorithm, as shown in [20]. The online EM processes data in small\nminibatches and in each EM iteration updates the suf\ufb01cient statistics of the global parameters. In\nour case, these suf\ufb01cient statistics include Sk and dk, for k = 1, . . . , K, and can be updated as\n\nS(t+1)\nk\nd(t+1)\nk\n\n= (1 \u2212 \u03b3t)S(t)\n= (1 \u2212 \u03b3t)d(t)\n\nk + \u03b3tX(t)\u2126(t)\nk + \u03b3tX(t)\u03ba(t)\n\nk\n\nk X(t)(cid:62)\n\nwhere X(t) denotes the set of examples in the current minibatch, and \u2126(t)\nthat are computed using the data from the current minibatch.\n\nk and \u03ba(t)\n\nk denote quantities\n\n3.3 Predicting Labels for Test Examples\nPredicting the label vector y\u2217 \u2208 {0, 1}L for a new test example x\u2217 \u2208 RD can be done as\n\np(y\u2217 = 1|x\u2217) =\n\n(1 \u2212 exp(\u2212Vu\u2217))p(u\u2217)du\u2217\n\n(cid:90)\n\nu\u2217\n\nIf using Gibbs sampling, the integral above can be approximated using samples {u(m)\u2217 }M\nm=1 from\nthe posterior of u\u2217. It is also possible to integrate out u\u2217 (details skipped for brevity) and get closed\nform estimates of probability of each label yl\u2217 in terms of the model parameters V and W, and it is\ngiven by\n\np(yl\u2217 = 1|x\u2217) = 1 \u2212 K(cid:89)\n\n1\n[Vlk exp(w(cid:62)\nk x\u2217) + 1]rk\n\n(19)\n\nk=1\n\n4 Computational Cost\nComputing the latent count mln for each nonzero entry yln in Y requires computing\n[ml1n, . . . , mlKn], which takes O(K) time;\ntherefore computing all the latent counts takes\nO(nnz(Y)K) time, which is very ef\ufb01cient if Y has very few nonzeros (which is true of most real-\nworld multi-label learning problems). Estimating V, U, and the hyperparameters is relatively cheap\nand can be done very ef\ufb01ciently. The P\u00b4olya-Gamma variables, when doing Gibbs sampling, can be\nef\ufb01ciently sampled using methods described in [17]; and when doing EM, these can be even more\ncheaply computed because the P\u00b4olya-Gamma expectations, which are available in closed form (as\na hyperbolic tan function), can be very ef\ufb01ciently computed [20]. The most dominant step is esti-\nmating W; when doing Gibbs sampling, if done na\u00a8\u0131vely, it would O(DK 3) time if sampling W\nrow-wise, and O(KD3) time if sampling column-wise. However, if using the EM algorithm, esti-\nmating W can be done much more ef\ufb01ciently, e.g., using Conjugate Gradient updates because, it is\nnot even required to solved for W exactly in each iteration of the EM algorithm [20]. Also note that\nsince most of the parameters updates for different k = 1, . . . , K, n = 1, . . . , N are all independent\nof each other, our Gibbs sampler and the EM algorithms can be easily parallelized/block-updated.\n\n5 Connection: Topic Models with Meta-Data\n\n+\n\nAs discussed earlier, our multi-label learning framework is similar in spirit to a topic model as the\nlabel embeddings naturally correspond to topics - each Dirichlet-drawn column vk of the matrix\nV \u2208 RL\u00d7K\ncan be seen as representing a \u201ctopic\u201d. In fact, our model, interestingly, can directly be\nseen as a topic model [3, 27] where we have side-information associated with each document (e.g.,\ndocument features). For example, if each document yn \u2208 {0, 1}L (in a bag-of-words representation\nwith vocabulary of size L) may also have some meta-data xn \u2208 RD associated with it. Our model\ncan therefore also be used to perform topic modeling of text documents with such meta-data [15, 12,\n29, 19] in a robust and scalable manner.\n\n5\n\n\f6 Related Work\n\nDespite a signi\ufb01cant number of methods proposed in the recent years, learning from multi-label\ndata continues to remain an active area of research, especially due to the recent surge of interest in\nlearning when the output space (i.e., the number of labels) is massive. To handle the huge dimen-\nsionality of the label space, a common approach is to embed the labels in a lower-dimensional space,\ne.g., using methods such as Canonical Correlation Analysis or other methods for jointly embedding\nfeature and label vectors [26, 5, 23], Compressed Sensing[8, 10], or by assuming that the matrix\nconsisting of the weight vectors of all the labels is a low-rank matrix [25]. Another interesting line\nof work on label embedding methods makes use of random projections to reduce the label space\ndimensionality [11, 16], or use methods such as multitask learning (each label is a task).\nOur proposed framework is most similar in spirit to the aforementioned class of label embedding\nbased methods (we compare with some of these in our experiments). In contrast to these methods,\nour framework reduces the label-space dimensionality via a nonlinear mapping (Section 2), our\nframework has accompanying inference algorithms that scale in the number of positive labels 2.1,\nhas an underlying generative model that more realistically models the imbalanced nature of the labels\nin the label matrix (Section 2.2), can deal with missing labels, and is easily parallelizable. Also, the\nconnection to topic models provide a nice interpretability to the results, which is usually not possible\nwith the other methods (e.g., in our model, the columns of the matrix V can be seen as a set of topics\nover the labels; in Section 7.2, we show an experiment on this). Moreover, although in this paper, we\nhave focused on the multi-label learning problem, our framework can also be applied for multiclass\nproblems via the one-vs-all reduction, in which case the label matrix is usually very sparse (each\ncolumn of the label matrix represents the labels of a single one-vs-all binary classi\ufb01cation problem).\nFinally, although not a focus of this paper, some other important aspects of the multi-label learning\nproblem have also been looked at in recent work. For example, fast prediction at test time is an\nimportant concern when the label space is massive. To deal with this, some recent work focuses\non methods that only incur a logarithmic cost (in the number of labels) at test time [1, 18], e.g., by\ninferring and leveraging a tree structure over the labels.\n\n7 Experiments\n\nWe evaluate the proposed multi-label learning framework on four benchmark multi-label data sets -\nbibtex, delicious, compphys, eurlex [25], with their statistics summarized in Table 1. The data sets\nwe use in our experiments have both feature and label dimensions that range from a few hundreds\nto a several thousands. In addition, the feature and/or label matrices are also quite sparse.\n\nData set\nbibtex\ndelicious\ncompphys\neurlex\n\nD\n1836\n500\n33,284\n5000\n\nL\n159\n983\n208\n3993\n\nNtrain\n4880\n12920\n161\n17413\n\nTraining set\n\n\u00afL\n2.40\n19.03\n9.80\n5.30\n\n\u00afD\n68.74\n18.17\n792.78\n236.69\n\nNtest\n2515\n3185\n40\n1935\n\nTest set\n\u00afL\n2.40\n19.00\n11.83\n5.32\n\n\u00afD\n68.50\n18.80\n899.20\n240.96\n\nTable 1: Statistics of the data sets used in our experiments. \u00afL denotes average number of positive\nlabels per example; \u00afD denotes the average number of nonzero features per example.\n\nWe compare the proposed model BMLPL with four state-of-the-art methods. All these methods,\njust like our method, are based on the assumption that the label vectors live in a low dimensional\nspace.\n\nbedding the label vectors conditioned on the features.\n\nmethod that uses the idea of doing compressed sensing on the labels [8].\n\n\u2022 CPLST: Conditional Principal Label Space Transformation [5]: CPLST is based on em-\n\u2022 BCS: Bayesian Compressed Sensing for multi-label learning [10]: BCS is a Bayesian\n\u2022 WSABIE: It assumes that the feature as well as the label vectors live in a low dimensional\n\u2022 LEML: Low rank Empirical risk minimization for multi-label learning [25]. For LEML, we\nreport the best results across the three loss functions (squared, logistic, hinge) they propose.\n\nspace. The model is based on optimizing a weighted approximate ranking loss [23].\n\n6\n\n\fTable 2 shows the results where we report the Area Under the ROC Curve (AUC) for each method on\nall the data sets. For each method, as done in [25], we vary the label space dimensionality from 20%\n- 100% of L, and report the best results. For BMLPL, both Gibbs sampling and EM based inference\nperform comparably (though EM runs much faster than Gibbs); here we report results obtained with\nEM inference only (Section 7.4 provides another comparison between these two inference methods).\nThe EM algorithms were run for 1000 iterations and they converged in all the cases.\nAs shown in the results in Table 2, in almost all of the cases, the proposed BMLPL model performs\nbetter than the other methods (except for compphys data sets where the AUC is slightly worse than\nLEML). The better performance of our model justi\ufb01es the \ufb02exible Bayesian formulation and also\nshows the evidence of the robustness provided by the asymmetric link function against sparsity and\nlabel imbalance in the label matrix (note that the data sets we use have very sparse label matrices).\n\nbibtex\ndelicious\ncompphys\neurlex\n\nCPLST\n0.8882\n0.8834\n0.7806\n-\n\nBCS\n0.8614\n0.8000\n0.7884\n-\n\nWSABIE\n0.9182\n0.8561\n0.8212\n0.8651\n\nLEML\n0.9040\n0.8894\n0.9274\n0.9456\n\nBMLPL\n0.9210\n0.8950\n0.9211\n0.9520\n\nTable 2: Comparison of the various methods in terms of AUC scores on all the data sets. Note: CPLST and\nBCS were not feasible to run on the eurlex data, so we are unable to report those numbers here.\n7.1 Results with Missing Labels\n\nOur generative model for the label matrix can also handle missing labels (the missing labels may\ninclude both zeros or ones). We perform an experiment on two of the data sets - bibtex and compphys\n- where only 20% of the labels from the label matrix are revealed (note that, of all these revealed\nlabels, our model uses only the positive labels), and compare our model with LEML and BCS (both\nare capable of handling missing labels). The results are shown in Table 3. For each method, we\nset K = 0.4L. As the results show, our model yields better results as compared to the competing\nmethods even in the presence of missing labels.\n\nbibtex\ncompphys\n\nBCS\n0.7871\n0.6442\n\nLEML\n0.8332\n0.7964\n\nBMLPL\n0.8420\n0.8012\n\nTable 3: AUC scores with only 20% labels observed.\n\n7.2 Qualitative Analysis: Topic Modeling on Eurlex Data\n\nSince in our model, each column of the L \u00d7 K matrix V represents a distribution (i.e., a \u201ctopic\u201d)\nover the labels, to assess its ability of discovering meaningful topics, we run an experiment on the\nEurlex data with K = 20 and look at each column of V. The Eurlex data consists of 3993 labels\n(each of which is a tags; a document can have a subset of the tags), so each column in V is of that\nsize. In Table 4, we show \ufb01ve of the topics (and top \ufb01ve labels in each topic, based on the magnitude\nof the entries in the corresponding column of V). As shown in Table 4, our model is able to discover\nclear and meaningful topics from the Eurlex data, which shows its usefulness as a topic model when\neach document yn \u2208 {0, 1}L has features in form of meta data xn \u2208 RD associated with it.\n\nTopic 1 (Nuclear)\nnuclear safety\nnuclear power station\nradioactive ef\ufb02uent\nradioactive waste\nradioactive pollution\n\nTopic 2 (Agreements)\nEC agreement\ntrade agreement\nEC interim agreement\ntrade cooperation\nEC coop. agree.\n\nTopic 3 (Environment)\nenvironmental protection\nwaste management\nenv. monitoring\ndangerous substance\npollution control measures\n\nTopic 4 (Stats & Data)\ncommunity statistics\nstatistical method\nagri. statistics\nstatistics\ndata transmission\n\nTopic 5 (Fishing Trade)\n\ufb01shing regulations\n\ufb01shing agreement\n\ufb01shery management\n\ufb01shing area\nconservation of \ufb01sh stocks\n\nTable 4: Most probable words in different topics.\n\n7\n\n\f7.3 Scalability w.r.t. Number of Positive Labels\n\nTo demonstrate the linear scalability in the number of positive labels, we run an experiment on the\nDelicious data set by varying the number of positive labels used for training the model from 20% to\n100% (to simulate this, we simply treat all the other labels as zeros, so as to have a constant label\nmatrix size). We run each experiment for 100 iterations (using EM for the inference) and report\nthe running time for each case. Fig. 2 (left) shows the results which demonstrates the roughly linear\nscalability w.r.t. the number of positive labels. This experiment is only meant for a small illustration.\nNote than the actual scalability will also depend on the relative values of D and L and the sparsity\nof Y. In any case, the amount of computations the involve the labels (both positive and negatives)\nonly depend on the positive labels, and this part, for our model, is clearly linear in the number of\npositive labels in the label matrix.\n\nFigure 2: (Left) Scalability w.r.t. number of positive labels. (Right) Time vs accuracy comparison for Gibbs\nand EM (with exact and with CG based M steps)\n\n7.4 Gibbs Sampling vs EM\n\nWe \ufb01nally show another experiment comparing both Gibbs sampling and EM for our model in terms\nof accuracy vs running time. We run each inference method only for 100 iterations. For EM, we\ntry two settings: EM with an exact M step for W, and EM with an approximate M step where\nwe run 2 steps of conjugate gradient (CG). Fig. 2 (right), shows a plot comparing each inference\nmethod in terms of the accuracy vs running time. As Fig. 2 (right) shows, the EM algorithms (both\nexact as well as the one that uses CG) attain reasonably high AUC scores in a short amount of time,\nwhich the Gibbs sampling takes much longer per iteration and seems to converge rather slowly.\nMoreover, remarkably, EM with 2 iterations CG in each M steps seems to perform comparably\nto the EM with an exact M step, while running considerably faster. As for the Gibbs sampler,\nalthough it runs slower than the EM based inference, it should be noted that the Gibbs sampler\nwould still be considerably faster than other fully Bayesian methods for multi-label prediction (such\nas BCS [10]) because it only requires evaluating the likelihoods over the positive labels in the label\nmatrix). Moreover, the step involving sampling of the W matrix can be made more ef\ufb01cient by using\ncholesky decompositions which can avoid matrix inversions needed for computing the covariance\nof the Gaussian posterior on wk.\n8 Discussion and Conclusion\n\nWe have presented a scalable Bayesian framework for multi-label learning. In addition to providing\na \ufb02exible model for sparse label matrices, our framework is also computationally attractive and\ncan scale to massive data sets. The model is easy to implement and easy to parallelize. Both full\nBayesian inference via simple Gibbs sampling and EM based inference can be carried out in this\nmodel in a computationally ef\ufb01cient way. Possible future work includes developing online Gibbs\nand online EM algorithms to further enhance the scalability of the proposed framework to handle\neven bigger data sets. Another possible extension could be to additionally impose label correlations\nmore explicitly (in addition to the low-rank structure already imposed by the current model), e.g.,\nby replacing the Dirichlet distribution on the columns of V with logistic normal distributions [4].\nBecause our framework allows ef\ufb01ciently computing the predictive distribution of the labels (as\nshown in Section 3.3), it can be easily extend for doing active learning on the labels [10]. Finally,\nalthough here we only focused on multi-label learning, our framework can be readily used as a robust\nand scalable alternative to methods that perform binary matrix factorization with side-information.\nAcknowledgements This research was supported in part by ARO, DARPA, DOE, NGA and ONR\n\n8\n\n20%40%60%60%100%200300400500600700800Fraction of Positive LabelsTime Taken10\u221221001021040.650.70.750.80.850.9TimeAUC EM\u2212CGEM\u2212ExactGibbs\fReferences\n[1] Rahul Agrawal, Archit Gupta, Yashoteja Prabhu, and Manik Varma. Multi-label learning with millions of\n\nlabels: Recommending advertiser bid phrases for web pages. In WWW, 2013.\n\n[2] Dimitri P Bertsekas. Nonlinear programming. Athena scienti\ufb01c Belmont, 1999.\n[3] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. JMLR, 2003.\n[4] Jianfei Chen, Jun Zhu, Zi Wang, Xun Zheng, and Bo Zhang. Scalable inference for logistic-normal topic\n\nmodels. In NIPS, 2013.\n\n[5] Yao-Nan Chen and Hsuan-Tien Lin. Feature-aware label space dimension reduction for multi-label clas-\n\nsi\ufb01cation. In NIPS, 2012.\n\n[6] Eva Gibaja and Sebasti\u00b4an Ventura. Multilabel learning: A review of the state of the art and ongoing\n\nresearch. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2014.\n\n[7] Eva Gibaja and Sebasti\u00b4an Ventura. A tutorial on multilabel learning. ACM Comput. Surv., 2015.\n[8] Daniel Hsu, Sham Kakade, John Langford, and Tong Zhang. Multi-label prediction via compressed\n\nsensing. In NIPS, 2009.\n\n[9] Changwei Hu, Piyush Rai, and Lawrence Carin. Zero-truncated poisson tensor factorization for massive\n\nbinary tensors. In UAI, 2015.\n\n[10] Ashish Kapoor, Raajay Viswanathan, and Prateek Jain. Multilabel classi\ufb01cation using bayesian com-\n\npressed sensing. In NIPS, 2012.\n\n[11] Nikos Karampatziakis and Paul Mineiro. Scalable multilabel prediction via randomized methods. arXiv\n\npreprint arXiv:1502.02710, 2015.\n\n[12] Dae I Kim and Erik B Sudderth. The doubly correlated nonparametric topic model. In NIPS, 2011.\n[13] Xiangnan Kong, Zhaoming Wu, Li-Jia Li, Ruofei Zhang, Philip S Yu, Hang Wu, and Wei Fan. Large-scale\n\nmulti-label learning with incomplete label assignments. In SDM, 2014.\n\n[14] Xin Li, Feipeng Zhao, and Yuhong Guo. Conditional restricted boltzmann machines for multi-label\n\nlearning with incomplete labels. In AISTATS, 2015.\n\n[15] David Mimno and Andrew McCallum. Topic models conditioned on arbitrary features with dirichlet-\n\nmultinomial regression. In UAI, 2008.\n\n[16] Paul Mineiro and Nikos Karampatziakis. Fast label embeddings for extremely large output spaces. In\n\nICLR Workshop, 2015.\n\n[17] Nicholas G Polson, James G Scott, and Jesse Windle. Bayesian inference for logistic models using p\u00b4olya\u2013\n\ngamma latent variables. Journal of the American Statistical Association, 108(504):1339\u20131349, 2013.\n\n[18] Yashoteja Prabhu and Manik Varma. FastXML: a fast, accurate and stable tree-classi\ufb01er for extreme\n\nmulti-label learning. In KDD, 2014.\n\n[19] Maxim Rabinovich and David Blei. The inverse regression topic model. In ICML, 2014.\n[20] James G Scott and Liang Sun. Expectation-maximization for logistic regression.\n\narXiv:1306.0040, 2013.\n\narXiv preprint\n\n[21] Farbound Tai and Hsuan-Tien Lin. Multilabel classi\ufb01cation with principal label space transformation.\n\nNeural Computation, 2012.\n\n[22] Michael E Tipping. Bayesian inference: An introduction to principles and practice in machine learning.\n\nIn Advanced lectures on machine Learning, pages 41\u201362. Springer, 2004.\n\n[23] Jason Weston, Samy Bengio, and Nicolas Usunier. WSABIE: Scaling up to large vocabulary image\n\nannotation. In IJCAI, 2011.\n\n[24] Yan Yan, Glenn Fung, Jennifer G Dy, and Romer Rosales. Medical coding classi\ufb01cation by leveraging\n\ninter-code relationships. In KDD, 2010.\n\n[25] Hsiang-Fu Yu, Prateek Jain, Purushottam Kar, and Inderjit S Dhillon. Large-scale multi-label learning\n\nwith missing labels. In ICML, 2014.\n\n[26] Yi Zhang and Jeff G Schneider. Multi-label output codes using canonical correlation analysis. In AISTATS,\n\n2011.\n\n[27] M. Zhou, L. A. Hannah, D. Dunson, and L. Carin. Beta-negative binomial process and poisson factor\n\nanalysis. In AISTATS, 2012.\n\n[28] Mingyuan Zhou. In\ufb01nite edge partition models for overlapping community detection and link prediction.\n\nIn AISTATS, 2015.\n\n[29] Jun Zhu, Ni Lao, Ning Chen, and Eric P Xing. Conditional topical coding: an ef\ufb01cient topic model\n\nconditioned on rich features. In KDD, 2011.\n\n9\n\n\f", "award": [], "sourceid": 1805, "authors": [{"given_name": "Piyush", "family_name": "Rai", "institution": "Duke University"}, {"given_name": "Changwei", "family_name": "Hu", "institution": null}, {"given_name": "Ricardo", "family_name": "Henao", "institution": "Duke University"}, {"given_name": "Lawrence", "family_name": "Carin", "institution": "Duke University"}]}