{"title": "Probabilistic n-Choose-k Models for Classification and Ranking", "book": "Advances in Neural Information Processing Systems", "page_first": 3050, "page_last": 3058, "abstract": null, "full_text": "Probabilistic n-Choose-k Models for Classi\ufb01cation\n\nand Ranking\n\nKevin Swersky\n\nDaniel Tarlow\n\nDept. of Computer Science\n\nUniversity of Toronto\n\nRyan P. Adams\n\n[kswersky,dtarlow]@cs.toronto.edu\n\nrpa@seas.harvard.edu\n\nSchool of Eng. and Appl. Sciences\n\nHarvard University\n\nRichard S. Zemel\n\nDept. of Computer Science\n\nUniversity of Toronto\n\nzemel@cs.toronto.edu\n\nBrendan J. Frey\n\nProb. and Stat. Inf. Group\n\nUniversity of Toronto\n\nfrey@psi.toronto.edu\n\nAbstract\n\nIn categorical data there is often structure in the number of variables that take on\neach label. For example, the total number of objects in an image and the number of\nhighly relevant documents per query in web search both tend to follow a structured\ndistribution. In this paper, we study a probabilistic model that explicitly includes\na prior distribution over such counts, along with a count-conditional likelihood\nthat de\ufb01nes probabilities over all subsets of a given size. When labels are binary\nand the prior over counts is a Poisson-Binomial distribution, a standard logistic\nregression model is recovered, but for other count distributions, such priors induce\nglobal dependencies and combinatorics that appear to complicate learning and\ninference. However, we demonstrate that simple, ef\ufb01cient learning procedures\ncan be derived for more general forms of this model. We illustrate the utility of\nthe formulation by exploring applications to multi-object classi\ufb01cation, learning\nto rank, and top-K classi\ufb01cation.\n\nIntroduction\n\n1\nWhen models contain multiple output variables, an important potential source of structure is the\nnumber of variables that take on a particular value. For example, if we have binary variables indi-\ncating the presence or absence of a particular object class in an image, then the number of \u201cpresent\u201d\nobjects may be highly structured, such as the number of digits in a zip code. In ordinal regression\nproblems there may be some prior knowledge about the proportion of outputs within each level. For\ninstance, when modeling scores assigned to papers submitted to a conference, this structure can be\ndue to instructions that reviewers assign scores such that the distribution is roughly uniform.\nOne popular model for multiple output classi\ufb01cation problems is logistic regression (LR), in which\nthe class probabilities are modeled as being conditionally independent, given the features; another\npopular approach utilizes a softmax over the class outputs. Both models can be seen as possessing\na prior on the label counts: in the case of the softmax model this prior is explicit that exactly one\nis active. For LR, there is an implicit factorization in which there is a speci\ufb01c prior on counts; this\nprior is the source of computational tractability, but also imparts an inductive bias to the model. The\nstarting observation for our work is that we do not lose much ef\ufb01ciency by replacing the LR counts\nprior with a general prior, which permits the speci\ufb01cation of a variety of inductive biases.\nIn this paper we present a probabilistic model of multiple output classi\ufb01cation, the n-choose-\nk model, which incorporates a distribution over the label counts, and show that computations needed\n\n1\n\n\ffor learning and inference in this model are ef\ufb01cient. We develop applications of this model to di-\nverse problems. A maximum-likelihood version of the model can be used for problems such as\nmulti-class recognition, in which the label counts are known at training time but only a prior dis-\ntribution is known at test time. The model easily extends to ordinal regression problems, such as\nranking or collaborative \ufb01ltering, in which each item is assigned to one of a small number of rel-\nevance levels. We establish a connection between n-choose-k models and ranking objectives, and\nprove that optimal decision theoretic predictions under the model for \u201cmonotonic\u201d gain functions\n(to be de\ufb01ned later), which include standard objectives used in ranking, can be achieved by a simple\nsorting operation. Other problems can be modeled via direct maximization of expected gain. An\nimportant aim in classi\ufb01cation and information retrieval is to optimize expected precision@K. We\nshow that we can ef\ufb01ciently optimize this objective under the model and that it yields promising\nresults.\nOverall, the result is a class of models along with a well-developed probabilistic framework for\nlearning and inference that makes use of algorithms and modeling components that are not often\nused in machine learning. We demonstrate that it is a simple, yet expressive probabilistic approach\nthat has many desirable computational properties.\n2 Binary n-Choose-k Model\nWe begin by de\ufb01ning the basic model under the assumption of binary output variables.\nIn the\nfollowing section, we will generalize to the case of ordinal variables. The model inputs are x,\nand \u03b8 is de\ufb01ned as \u03b8 = Wx, where W are the parameters. The model output is a vector of D\nbinary variables y \u2208 Y = {0, 1}D. We will use subsets c \u2286 {1, . . . , D} of variable indices and will\nrepresent the value assigned to a subset of variables as yc. We will also make use of the notation \u00afc\nto mean the complement {1, . . . , D}\\c. The generative procedure is then de\ufb01ned as follows:\n\n\u2022 Draw k from a prior distribution p(k) over counts k.\n\u2022 Draw k variables to take on label 1, where the probability of choosing subset c is given by\n\nbeing off or on, and Zk(\u03b8) =(cid:80)\ny|(cid:80)\n\nwhere \u03b8 = (\u03b81, . . . , \u03b8D) are parameters that determine individual variable biases towards\nd \u03b8dyd}. Under this de\ufb01nition Z0 = 1,\nand p(0 | 0) = 1. This has been referred to as a conditional Bernoulli distribution [1].\n\n0\n\nd\u2208c \u03b8d}\n\nZk(\u03b8)\n\nif |c| = k\notherwise\n\n,\n\n(1)\n\np(yc = 1, y\u00afc = 0 | k) =\n\n(cid:40) exp{(cid:80)\nd yd=k exp{(cid:80)\n\nLogistic regression can be viewed as an instantiation of this model, with a \u201cprior\u201d distribution over\ncount values that depends on parameters \u03b8. This is a forced interpretation, but it is useful in under-\nstanding the implicit prior over counts that is imposed when using LR. Speci\ufb01cally, if p(k) is de\ufb01ned\nas be a particular function of \u03b8 (known as a Poisson-Binomial distribution [2]): p(k; \u03b8) = Zk(\u03b8)\nZ(\u03b8) ,\nk Zk(\u03b8), then the joint probability p(y, k; \u03b8) becomes equivalent to a LR model in\nd yd = k, and p(k; \u03b8)\nexp{\u03b8dyd}\n1 + exp{\u03b8d} .\n\nwhere Z(\u03b8) =(cid:80)\nthe following sense. Suppose we have a joint assignment of variables y and(cid:80)\n(cid:89)\n\np(y, k; \u03b8) = p(k; \u03b8)p(y | k; \u03b8) =\n\nexp{(cid:80)\n\nis Poisson-Binomial, then\n\nZk(\u03b8)\nZ(\u03b8)\n\n(2)\n\nd\u2208c \u03b8d}\n\nZk(\u03b8)\n\n=\n\nd\n\nNote that the last equality factorizes Z(\u03b8) to create independence across variables, but it requires\nthat the \u201cprior\u201d be de\ufb01ned in terms of parameters \u03b8. Our interest in this paper is in the more \ufb02exible\nfamily of models that arise after breaking the dependence of the \u201cprior\u201d on \u03b8. First, we explore\ntreating p(k) as a prior in the Bayesian sense, using it to express prior knowledge about label counts;\nlater we will explore learning p(k) using separate parameters from \u03b8. A consequence of these\ndecisions is that the distribution does not factorize. At this point, we have not made it clear that\nthese models can be learned ef\ufb01ciently, but we will show in the next section that this is indeed the\ncase.\n\n2.1 Maximum Likelihood Learning\nOur goal in learning is to select parameters so as to maximize the probability assigned to observed\ndata by the model. For notational simplicity in this section, we compute partial derivatives with\n\n2\n\n\frespect to \u03b8, then it should be clear that these can be back-propagated to a model of \u03b8(x; W). We\nnote that if this relationship is linear, and the objective is convex in terms of \u03b8, then it will also be\nconvex in terms of W. The log-likelihood is as follows:\n\nlog p(y; \u03b8) = log\n\nyd; \u03b8) + \u03ba\n\nD(cid:88)\np(k)p(y | k; \u03b8) = log p(y |(cid:88)\n\u03b8dyd \u2212 log Z(cid:80)\n\nd yd (\u03b8) + \u03ba,\n\nk=0\n\nd\n\n(cid:88)\n\n=\n\n(3)\n\n(4)\n\nd\n\nn=1, we maximize the sum of log probabilities(cid:80)\n\nwhere \u03ba is a constant that is independent of \u03b8. As is standard, if we are given multiple sets of\nbinary variables, {yn}N\nderivatives take a standard log-sum-exp form, requiring expectations Ep(yd|k=(cid:80)\nn log p(yn; \u03b8). The partial\nA naive computation of this expectation would require summing over(cid:0) D\nk=(cid:80)\n\n(cid:1) con\ufb01gurations.\n\nd(cid:48) yd(cid:48) ) [yd].\n\nHowever, there are more ef\ufb01cient alternatives: the dynamic programming algorithms developed\nin the context of Poisson-Binomial distributions are applicable, e.g., the algorithm from [3] runs\nin O(Dk) time. The basic idea is to compute partial sums along a chain that lays out vari-\nables yd in sequence. An alternative formulation of the dynamic program [4] can be made to\nyield an O(D log2 D) algorithm by using a divide-and-conquer algorithm that employs Fast Fourier\nTransforms (FFTs). These algorithms are quite general and can also be used to compute Zk values,\nincorporate prior distributions over count values, and draw a sample of y values conditional upon\nsome k for the same computational cost [5]. We use the FFT tree algorithm from [5] throughout,\nbecause it is most \ufb02exible and has best worst-case complexity.\n\nd yd\n\n2.2 Test-time Inference\nHaving learned a model, we would like to make test-time predictions. In Section 4.2, we will show\nthat optimal decision-theoretic predictions (i.e., that maximize expected gain) can be made in several\nsettings by a simple sorting procedure, and this will be our primary way of using the learned model.\nHowever, here, we consider the task of producing a distribution over labels y, given \u03b8(x). To draw\na joint sample of y values, we can begin by drawing k from p(k), then conditional on that k, use the\ndynamic programming algorithm to draw a sample conditional on k.\nTo compute marginals, a simple strategy is to loop over each value of k and run dynamic program-\nming conditioned on k, and then average the results weighted by the respective prior. For priors that\nonly give support to a small number of k values, this is quite ef\ufb01cient. An alternative approach is\nto draw several samples of k from p(k), then for each sampled value, run dynamic programming to\ncompute marginals. Averaging these marginals can then be seen as a Rao-Blackwellized estimate.\nFinally, it is possible to compute exact marginals for arbitrary p(k) in a single run of an O(D log2 D)\ndynamic programming algorithm, but the simpler strategies were suf\ufb01cient for our needs here, so\nwe do not pursue that direction further.\n\n3 Ordinal n-Choose-k Model\nAn extension of the binary n-choose-k model can be developed in the case of ordinal data, where we\nassume that labels y can take on one of R categorical labels, and where there is an inherent ordering\nto labels R > R \u2212 1 > . . . > 1; each label represents a relevance label in a learning-to-rank setting.\nLet kr represent the number of variables y that take on label r and de\ufb01ne k = (kR, . . . , k1). The idea\nin the ordinal case is to de\ufb01ne a joint model over count variables k, then to reduce the conditional\ndistribution of p(y | k) to be a series of binary models. The generative model is de\ufb01ned as follows:\n\n\u2022 Initialize all variables y to be unlabeled.\n\u2022 Sample kR, . . . , k1 jointly from p(k).\n\u2022 Repeat for r = R to 1:\n\n\u2013 Choose a set cr of kr unlabeled variables y\u2264r and assign them relevance label r.\n\nChoose subsets with probability equal to the following:\nd\u2208cr\n\np(y\u2264r,cr = 1, y\u2264r,\u00afcr = 0 | kr) =\n\n\u03b8d}\nZr,k(\u03b8,y\u2264r)\n\n(cid:40) exp{(cid:80)\n\n0\n\nif |cr| = kr\notherwise\n\n,\n\n(5)\n\n3\n\n\fwhere we use the notation y\u2264r to represent all variables that are given a relevance\nlabel less than or equal to r. Zr,k is similar to the normalization constant Zk that\nappears in the binary model, but it is restricted to sum over y\u2264r instead of the full y:\n\nZr,kr (\u03b8, y\u2264r) =(cid:80)\n\ny\u2264r|((cid:80)\n\nd 1{yd=r})=kr\n\nexp{\u03b8d \u00b7 1{yd = r}}.\n\nNote that if R = D and p(k) speci\ufb01es that kr = 1 for all r, then this process de\ufb01nes a Plackett-Luce\n(PL) [6, 7, 8] ranking model. One interpretation of this model is as a \u201cgroup\u201d PL model, where\ninstead of drawing individual elements in the generative process, groups of elements are drawn si-\nmultaneously. In this work, we focus on ranking with weak labels (R < D) which is more restrictive\nthan modeling distributions over permutations [9], where learning would require marginalizing over\nall possible permutations consistent with the given labels. In this setting, inference in the ordinal\nn-choose-k model is both exact and ef\ufb01cient.\n\n3.1 Maximum Likelihood Learning\n\nLet kr =(cid:80)\n\nd 1{yd = r}. The log likelihood of parameters \u03b8 can be written as follows:\n\n(cid:88)\n\nk\u2208K\n\nR(cid:88)\n\n\uf8ee\uf8f0 (cid:88)\n\nr=1\n\nd:yd=r\n\nlog\n\np(k)p(y | k; \u03b8) =\n\n\u03b8d \u2212 log Zr,kr (\u03b8, y\u2264r)\n\n(6)\n\n\uf8f9\uf8fb + \u03ba.\n\ndistribution over k takes the form p(k) = 1{(cid:80)\n\nHere, we see that learning decomposes into the sum of R objectives that are of the same form as\narise in the binary n-choose-k model. As before, the only non-trivial part of the gradient compu-\ntation comes from the log-sum-exp term, but the required expectations that arise can be ef\ufb01ciently\ncomputed using dynamic programming. In this case, R \u2212 1 calls are required.\n3.2 Test-time Inference\nThe test-time inference procedure in the ordinal model is similar to the binary case. Brute force enu-\nmeration over k becomes exponentially more expensive as R grows, but for some priors where p(k)\nhas sparse support, this may be feasible. To draw samples of y, the main requirement is the abil-\nity to draw a joint sample of k from p(k). In the case that p(k) is a simple distribution such as\na multinomial, this can be done easily. It is also possible to ef\ufb01ciently draw a joint sample if the\nr p(kr). That is, there is an arbitrary\nbut independent prior over each kr value, along with a single constraint that the chosen kr values\nsum to exactly D. Given a sample of k, it is straightforward to sample y using R calls to dynamic\nprogramming. To do so, begin by using the binary algorithm to sample kR variables to take on\nvalue R. Then remove the chosen variables from the set of possible variables, and sample kR\u22121\nvariables to take on value R \u2212 1. Repeat until all variables have been assigned a value.\nAn alternative to producing marginal probabilities at test time is trying to optimize performance un-\nder a task-speci\ufb01c evaluation measure. The main motivation for the ordinal model is the learning to\nrank problem [10], so our main interest is in methods that do well under such task-speci\ufb01c evalua-\ntion measures that arise in the ranking task. In Section 4.2, we show that we can make exact optimal\ndecision theoretic test-time predictions under the learning-to-rank gain functions without the need\nfor sampling.\n\nr kr = D} \u00b7(cid:81)\n\nIncorporating Gain\n\n4\n4.1 Training to Maximize Expected Top-K Classi\ufb01cation Gain\nOne of the motivating applications for this model is the top-K classi\ufb01cation (TKC) task. We formu-\nlate this task using a gain function, parameterized by a value K and a \u201cscoring vector\u201d t, which is\nassumed to be of the same dimension as y. The gain function stipulates that K elements of y are\nchosen, (assigning a score of zero if some other number is chosen), and assigns reward for choosing\neach element of y based on t. Speci\ufb01cally the gain function is de\ufb01ned as follows:\n\n(cid:26) (cid:80)\n\nif(cid:80)\n\nGK(y, t) =\n\nd ydtd\n0\n\nd yd = K\n\notherwise .\n\n(7)\n\nThe same gain can be used for Precision@K, in which case the number of nonzero values in t is\nunrestricted. Here, we focus on the case where t is binary with a single nonzero entry at index d\u2217.\n\n4\n\n\fAn interesting issue is what gain function should be used to train a model when the test-time evalu-\nation metric is TKC, or Precision@K. Maximum likelihood training of TKC in this case of a single\ntarget class could correspond to a version of our n-choose-k model in which p(k) is a spike at k = 1;\nnote that in this case the n-choose-k model is equivalent to a softmax over the output classes. An\nalternative is to train using the same gain function used at test-time.\nHere, we consider incorporating the TKC gain at training time for binary t with one nonzero entry,\ntraining the model to maximize expected gain. Speci\ufb01cally, the objective is the following:\n\n(cid:88)\n\n(cid:88)\n\n(cid:40)(cid:88)\n\n(cid:41)(cid:88)\n\n(cid:88)\n\nEp[GK(y, t)] =\n\np(k)p(y | k)1\n\nyd = K\n\nydtd =\n\np(K)p(y | K)yd\u2217\n\n(8)\n\nk\n\ny\n\nd\n\nd\n\ny\n\nIt becomes clear that this objective is equivalent to the marginal probability of yd\u2217 under a prior\ndistribution that places all its mass on k = K. In Section 5.3, we empirically investigate training\nunder expected gain versus training under maximum likelihood\n\n4.2 Optimal Decision-theoretic Predictions for Monotonic Gain Functions\nWe now turn attention to gain functions de\ufb01ned on rankings of items. Letting \u03c0 be a permutation,\nwe de\ufb01ne a \u201cmonotonic\u201d gain function as follows:\nDe\ufb01nition 1. A gain function G(\u03c0, r) is a monotonic ranking gain if:\n\n\u2022 It can be expressed as(cid:80)D\n\n\u03c0d is the index of the item ranked in position d,\n\n\u2022 \u03b1d \u2265 \u03b1d+1 \u2265 0 for all d, and\n\u2022 f (r) \u2265 f (r \u2212 1) \u2265 0 for all r \u2265 r(cid:48).\n\nd=1 \u03b1df (r\u03c0d ), where \u03b1d is a weighting (or discount) term, and\n\nis straightforward to see that popular\n\nlog2(1+d), so set \u03b1d = \u03ba \u00b7\n2r\u03c0d \u22121\n\nlearning-to-rank scoring functions like normal-\nIt\nized discounted cumulative gain (NDCG) and Precision@K are monotonic ranking gains.\nlog2(1+d) and f (r) = 2r \u2212 1. We de\ufb01ne Preci-\nsion@K gain to be the fraction of documents in the top K produced ranks that have label R:\n\nN DCG(\u03c0, r) \u221d(cid:80)\nP @K(\u03c0, r) =(cid:80)\nd 1{d \u2264 K} 1{r\u03c0d = R}, so set \u03b1d = 1{d \u2264 K} and f (r) = 1{r = R}.\n(cid:88)\n\nThe expected gain under a monotonic ranking gain and ordinal n-choose-k model is\n\nR(cid:88)\n\nD(cid:88)\n\nD(cid:88)\n\nd\n\n1\n\nf (y(cid:48)\n\n\u03c0d\n\n)p(y\u03c0d = y(cid:48)\n\n) =\n\n\u03c0d\n\n\u03b1dg\u03c0d ,\n\n(9)\n\n\u03b1df (y(cid:48)\n\n) =\n\n\u03c0d\n\n\u03b1d\n\nd=1\n\ny(cid:48)\n\u03c0d\n\n=1\n\nd=1\n\nEp [G(\u03c0)] =\n\nD(cid:88)\nwhere we have de\ufb01ned gd =(cid:80)R\n\np(y(cid:48))\n\ny(cid:48)\u2208Y\n\nd=1\n\nr=1 f (r)p(yd = r).\n\nWe now state four propositions and a lemma. The proofs of the propositions mostly result from\nalgebraic manipulation, so we leave their proof to the supplementary materials. The main theorem\nwill be proved afterwards.\nProposition 1. If \u03b8i \u2265 \u03b8j, then p(yi = R) \u2265 p(yj = R).\nProposition 2. If \u03b8i \u2265 \u03b8j and p(yi \u2265 r) \u2265 p(yj \u2265 r), then p(yi \u2265 r \u2212 1) \u2265 p(yj \u2265 r \u2212 1).\nLemma 1. If \u03b8i \u2265 \u03b8j, then for all r, p(yi \u2265 r) \u2265 p(yj \u2265 r).\n\nProof. By induction. Proposition 1 is the base case, and Proposition 2 is the inductive step.\nProposition 3. If \u03b8i \u2265 \u03b8j and f is de\ufb01ned as in De\ufb01nition 1, then gi \u2265 gj.\nProposition 4. Consider two pairs of non-negative real numbers ai, aj and bi, bj where ai \u2265 aj\nand bi \u2265 bj. It follows that aibi + ajbj \u2265 aibj + ajbi.\nTheorem 1. Under an ordinal n-choose-k model, the optimal decision theoretic predictions for a\nmonotonic ranking gain are made by sorting \u03b8 values.\n\n5\n\n\fFigure 1: Four example images from the embedded MNIST dataset test set, along with the Poisson-\nBinomial distribution produced by logistic regression for each image. The area marked in red has\nzero probability under the data distribution, but the logistic regression model is not \ufb02exible enough\nto model it.\n\nProof. Without loss of generality, assume that we are given a vector \u03b1 corresponding to placing the\n\u03b1\u2019s in descending order and a vector g\u03c0 where \u03c0 is some arbitrary ordering of the g\u2019s. The goal now\nis to \ufb01nd the ordering \u03c0\u2217 that maximizes the objective given in (9) which is equivalently expressed\nas the inner product \u03b1T g\u03c0.\nAssume that we are given an ordering \u02c6\u03c0 where for at least one pair i, j where i > j, we have\nthat \u03b8\u02c6\u03c0i < \u03b8\u02c6\u03c0j . Furthermore, assume that this ordering is optimal. That is, \u02c6\u03c0 = \u03c0\u2217. By Proposi-\ntion 3 we have that g\u02c6\u03c0i < g\u02c6\u03c0j . The contributions of these elements to the overall objective is given\nby \u03b1ig\u02c6\u03c0i + \u03b1jg\u02c6\u03c0j . By Proposition 4 we improve the objective by swapping \u03b8\u02c6\u03c0i and \u03b8\u02c6\u03c0j contradict-\ning the assumption that \u02c6\u03c0 is a local optimum.\nIf we have multiple elements that are not in sorted order, then we can repeat this argument by\nconsidering pairs of elements until the whole vector is sorted.\n\n5 Experiments\n5.1 Modeling Varying Numbers of Objects\nOur \ufb01rst experiment explores an issue that arises frequently in computer vision, where there are\nan unknown number of objects in an image, but the number is highly structured. We developed a\nmultiple image dataset that simulates this scenario.1 To generate an image, we uniformly sampled\na count between 1 and 4, and then take that number of digit instances (with at most one instance\nper digit class) from the MNIST dataset and embed them in a 60 \u00d7 60 image. The x, y locations\nare chosen from a 4 \u00d7 4 uniformly spaced grid and and then a small amount of jitter is added. We\ngenerated 10,000 images each for the training, test, and validation sets. The goal is to predict the set\nof digits that appear in a given image. Examples can be seen in Figure 1.\nWe train a binary n-choose-k model on this dataset. The inputs to the model are features learned\nfrom the images by a standard Restricted Boltzmann Machine with 1000 hidden units. As a base-\nline, we trained a logistic regression classi\ufb01er on the features and achieved a test-set negative log-\nlikelihood (NLL) of 2.84. Ideally, this model should learn that there are never more than four digits\nin any image. In Figure 1, we show four test images, and the Poisson-Binomial distribution over\ncounts that arises from the logistic regression model. Marked in red are regions where there is zero\nprobability of the count value in the data distribution. Here it is clear that the implicit count prior\nin LR is not powerful enough to model this data. As a comparison, we trained a binary n-choose-\nk model where we explicitly parameterize and learn an input-dependent prior. The model learns the\ncorrect distribution over counts and achieves a test-set NLL of 1.95. We show a visualization of the\nlearned likelihood and prior parameters in the supplementary material.\n\n5.2 Ranking\nA second set of experiments considers learning-to-rank applications of the n-choose-k model. We\nreport on comparisons to other ranking approaches, using seven datasets associated with the LETOR\n3.0 benchmark [10]. Following the standard LETOR procedures, we trained over \ufb01ve folds, each\nwith distinct training, validation, and testing splits.\nFor each dataset, we train an ordinal n-choose-k model to maximize the likelihood of the data, where\neach training example consists of a number of items, each assigned a particular relevance level; the\nnumber of levels ranges from 2-4 across the datasets. At test time, we produce a ranking, which as\n\n1http://www.cs.toronto.edu/\u02dckswersky/data/\n\n6\n\n\f(a) TD 2003\n\n(b) NP 2004\n\nFigure 2: Ranking results on two datasets from LETOR 3.0. Results for the other 5 datasets, along\nwith Precision@K results, appear in the supplementary material.\n\nshown in Section 4.2 is the optimal decision theoretic prediction under a ranking gain function, by\nsimply sorting the items for each test query based on their \u03b8 score values. Note that this is a very\nsimple ranking model, in that the score assigned to each test item by the model is a linear function\nof the input features, and the only hyperparameter to tune is an (cid:96)2 regularization strength.\nResults for two of the data sets are shown in Figure 2 (\ufb01rst is our best relative performance, second is\ntypical); the full set of results are in the supplementary material. Several publicly available baselines\nare shown for comparison. As can be seen in the graphs, our approach is competitive with the state-\nof-the-art on all data sets, and substantially outperforms all baselines on the TD 2003 dataset. Note\nthat the performance of the baseline methods is quite variable and it appears that over\ufb01tting is an is-\nsue on these datasets, even for linear models. We hypothesize that proper probabilistic incorporation\nof weak labels helps to mitigate this effect to some degree.\n\n5.3 Top-K Classi\ufb01cation\nOur third and \ufb01nal set of experiments concern top-K classi\ufb01cation, an important task that has gained\nconsiderable attention recently in the ImageNet Challenge.2 Here we consider a task analogous to\nthat in the ImageNet Challenge, in which each image contains a single object label, but a model\nis allowed to return up to K class predictions per image. A classi\ufb01cation is deemed correct if the\nappropriate class is one of the K returned classes.\nWe train binary n-choose-k models, experimenting with different training protocols that directly\nmaximize expected gain under the model, as described in Section 4.1. That is, we train on the ex-\npected top-K gain for different values of K. Note that top-1 is equivalent to softmax regression. For\neach model/evaluation criterion combination, we \ufb01nd the (cid:96)2 penalty that gives the highest validation\naccuracy; the corresponding test-set results are shown in Table 1. For comparison, we also include\nlogistic regression, where each output is conditionally independent. We experimented on the em-\nbedded MNIST dataset where all but one label from each example was randomly removed, and on\nthe Caltech-101 Silhouettes dataset [11], which consists of images of binarized silhouettes from 101\ndifferent categories. In both datasets we trained the models using the pixels as inputs. We noticed\nthat the optimal (cid:96)2 strength chosen by each method was quite high, suggesting that over\ufb01tting is an\nissue in these datasets. When the (cid:96)2 strength is low, the difference between the objectives becomes\nmore apparent. On Caltech it is clear that training for the expected gain improves the corresponding\ntest accuracy in this regime. On the embedded MNIST dataset, when the (cid:96)2 strength is low there is\na surprising result that the top-3 and top-5 criteria outperform top-1, even when top-1 is used as the\nevaluation measure. Since there are several digits actually present in the ground truth, there is no\nreal signal in the data that differentiates the digit labeled as the target from the other equally valid\n\u201cdistractor\u201d digits. In order to satisfy the top-1 objective for the given target, the learning algorithm\nis forced to \ufb01nd some arbitrary criterion by which to cause the given target to be preferred over the\ndistractors, which is harmful for generalization purposes. This scenario does occur in datasets like\nImageNet, where multiple objects can be present in a single image. It would be interesting to repeat\nthese experiments on more challenging, large scale datasets, but we leave this for future work.\n\n2http://www.image-net.org/challenges/LSVRC/2011/\n\n7\n\nOrdinal nCkAdaRank!NDCGFRankListNetRankBoostRankSVMRegression!RegSmoothRank123456789100.20.250.30.350.4NDCG Truncation Level (K)NDCG@K123456789100.40.50.60.70.80.9NDCG Truncation Level (K)NDCG@K\fTop 1 / Top 3 / Top 5\n0.606 / 0.785 / 0.812\n0.621 / 0.796 / 0.831\n0.614 / 0.792 / 0.834\n0.602 / 0.787 / 0.834\n\nLR\nTop 1\nTop 3\nTop 5\n\n(a) Caltech Sil. strong (cid:96)2\n\nTop 1 / Top 3 / Top 5\n0.545 / 0.716 / 0.766\n0.574 / 0.755 / 0.804\n0.558 / 0.771 / 0.813\n0.523 / 0.767 / 0.823\n(b) Caltech Sil. weak (cid:96)2\n\nTop 1 / Top 3 / Top 5\n0.346 / 0.647 / 0.815\n0.353 / 0.659 / 0.820\n0.353 / 0.671 / 0.834\n0.330 / 0.659 / 0.824\n(c) EMNIST strong (cid:96)2\n\nTop 1 / Top 3 / Top 5\n0.263 / 0.557 / 0.742\n0.268 / 0.569 / 0.757\n0.318 / 0.637 / 0.815\n0.313 / 0.642 / 0.822\n(d) EMNIST weak (cid:96)2\n\nTable 1: Top-K classi\ufb01cation results when various models are trained using an expected top-K gain\nand then tested using some possibly different top-K criterion. The rows correspond to training\ncriteria, and the columns correspond to test criteria.\n(a) and (c) show the test accuracy when a\nstrong (cid:96)2 regularizer is used, while (b) and (d) use a relatively weaker regularizer. Logistic regression\nis included for comparison.\n\n6 Related Work\n\nOur work here is related to many different areas; we cannot hope to survey all related work in multi-\nlabel classi\ufb01cation and ranking. Instead, we focus on work related to the main novelty in this paper,\nthe explicit modeling of structure on label counts. That is, given that we have prior knowledge of\nlabel count structure, or are modeling a domain that exhibits such structure, the question is how can\nthe structure be leveraged to improve a model.\nThe \ufb01rst and most direct approach is the one that we take here: explicitly model the count structure\nwithin the model. There are other alternative approaches that are similar in this respect. The work\nof [12] considers MAP inference in the context of cardinality-based models and develops applica-\ntions to named entity recognition tasks. Similarly, [13] develops an example application where a\ncardinality-based term constrains the number of pixels that take on the label \u201cforeground\u201d in a fore-\nground/background image segmentation task. [14] develops models that include a penalty in the\nenergy function for using more labels, which can be seen as a restricted form of structure over label\ncardinalities.\nAn alternative way of incorporating structure over counts into a model is via the gain function. The\nwork of Joachims [15] can be seen in this light \u2013 the training objective is formulated so as to optimize\nperformance on evaluation measures that include Precision@K. A different approach to including\ncount information in the gain function comes from [16], which trains an image segmentation model\nso as match count statistics present in the ground truth data. Finally, there are other approaches that\ndo not neatly fall into either category, such as the posterior regularization framework of [17] and\nrelated works such as [18]. There, structure, including structure that encodes prior knowledge about\ncounts, such as there being at least one verb in most sentences, is added as a regularization term that\nis used both during learning and during inference.\nOverall, the main difference between our work and these others is that we work in a proper proba-\nbilistic framework, either maximizing likelihood, maximizing expected gain, and/or making proper\ndecision-theoretic predictions at test time. Importantly, there is no signi\ufb01cant penalty for assuming\nthe proper probabilistic approach: learning is exact, and test-time prediction is ef\ufb01cient.\n\n7 Discussion\n\nWe have presented a \ufb02exible probabilistic model for multiple output variables that explicitly models\nstructure in the number of variables taking on speci\ufb01c values. The model is simple, ef\ufb01cient, easy to\nlearn due to its convex objective, and widely applicable. Our theoretical contribution provides a link\nbetween this type of ordinal model and ranking problems, bridging the gap between the two tasks,\nand allowing the same model to be effective for several quite different problems. Finally, there are\nmany extensions. More powerful models of \u03b8 can be put into the formulation, and gradients can\neasily be back-propagated. Also, while we chose to take a maximum likelihood approach in this\npaper, the model is well suited to fully Bayesian inference using e.g., slice sampling. The unimodal\nposterior distribution should lead to good behavior of the sampler. Beyond these extensions, we\nbelieve the framework here to be a valuable modeling building block that has broad application to\nproblems in machine learning.\n\n8\n\n\fReferences\n[1] S. X. Chen and J. S. Liu. Statistical applications of the Poisson-Binomial and conditional\n\nBernoulli distributions. Statistica Sinica, 7(4), 1997.\n\n[2] X. H. Chen, A. P. Dempster, and J. S. Liu. Weighted \ufb01nite population sampling to maximize\n\nentropy. Biometrika, 81(3):457\u2013469, 1994.\n\n[3] M. H. Gail, J. H. Lubin, and L. V. Rubinstein. Likelihood calculations for matched case-control\n\nstudies and survival studies with tied death times. Biometrika, 68:703\u2013707, 1981.\n\n[4] L. Belfore. An O(n) log2(n) algorithm for computing the reliability of k-out-of-n:G and k-to-\n\nl-out-of-n:G systems. IEEE Transactions on Reliability, 44(1), 1995.\n\n[5] D. Tarlow, K. Swersky, R. Zemel, R.P. Adams, and B. Frey. Fast exact inference for recursive\n\ncardinality models. In Uncertainty in Arti\ufb01cial Intelligence, 2012.\n\n[6] R. Plackett. The analysis of permutations. Applied Statistics, pages 193\u2013202, 1975.\n[7] R.D. Luce. Individual Choice Behavior a Theoretical Analysis. Wiley, 1959.\n[8] J. Guiver and E. Snelson. Bayesian inference for plackett-luce ranking models. In International\n\nConference on Machine Learning, 2009.\n\n[9] J. Huang, C. Guestrin, and L. Guibas. Ef\ufb01cient inference for distributions on permutations. In\n\nAdvances in Neural Information Processing Systems, 2007.\n\n[10] T. Qin, T.Y. Liu, J. Xu, and H. Li. LETOR: A benchmark collection for research on learning\n\nto rank for information retrieval. Information Retrieval Journal, 2010.\n\n[11] B. Marlin, K. Swersky, B. Chen, and N. de Freitas. Inductive principles for restricted Boltz-\n\nmann machine learning. In Arti\ufb01cial Intelligence and Statistics, 2010.\n\n[12] R. Gupta, A. Diwan, and S. Sarawagi. Ef\ufb01cient inference with cardinality-based clique poten-\n\ntials. In International Conference on Machine Learning, 2007.\n\n[13] D. Tarlow, I. Givoni, and R. Zemel. HOP-MAP: Ef\ufb01cient message passing for high order\n\npotentials. In Arti\ufb01cial Intelligence and Statistics, 2010.\n\n[14] A. Delong, A. Osokin, H.N. Isack, and Y. Boykov. Fast approximate energy minimization with\n\nlabel costs. International Journal of Computer Vision, 96(1):127, 2012.\n\n[15] T. Joachims. A support vector method for multivariate performance measures. In International\n\nConference on Machine Learning, 2005.\n\n[16] P. Pletscher and P. Kohli. Learning low-order models for enforcing high-order statistics. In\n\nArti\ufb01cial Intelligence and Statistics, 2012.\n\n[17] K. Ganchev, J. Grac\u00b8a, J. Gillenwater, and B. Taskar. Posterior regularization for structured\n\nlatent variable models. Journal of Machine Learning Research, 11:2001\u20132049, 2010.\n\n[18] G. Mann and A McCallum. Generalized expectation criteria with application to semi-\nsupervised classi\ufb01cation and sequence modeling. Journal of Machine Learning Research,\n11:955\u2013984, 2010.\n\n9\n\n\f", "award": [], "sourceid": 4702, "authors": [{"given_name": "Kevin", "family_name": "Swersky", "institution": null}, {"given_name": "Brendan", "family_name": "Frey", "institution": null}, {"given_name": "Daniel", "family_name": "Tarlow", "institution": null}, {"given_name": "Richard", "family_name": "Zemel", "institution": null}, {"given_name": "Ryan", "family_name": "Adams", "institution": null}]}