{"title": "Generative and Discriminative Learning with Unknown Labeling Bias", "book": "Advances in Neural Information Processing Systems", "page_first": 401, "page_last": 408, "abstract": "We apply robust Bayesian decision theory to improve both generative and discriminative learners under bias in class proportions in labeled training data, when the true class proportions are unknown. For the generative case, we derive an entropy-based weighting that maximizes expected log likelihood under the worst-case true class proportions. For the discriminative case, we derive a multinomial logistic model that minimizes worst-case conditional log loss. We apply our theory to the modeling of species geographic distributions from presence data, an extreme case of label bias since there is no absence data. On a benchmark dataset, we find that entropy-based weighting offers an improvement over constant estimates of class proportions, consistently reducing log loss on unbiased test data.", "full_text": "Generative and Discriminative Learning with\n\nUnknown Labeling Bias\n\nMiroslav Dud\u00b4\u0131k\n\nCarnegie Mellon University\n\n5000 Forbes Ave, Pittsburgh, PA 15213\n\nSteven J. Phillips\n\nAT&T Labs \u2212 Research\n\n180 Park Ave, Florham Park, NJ 07932\n\nmdudik@cmu.edu\n\nphillips@research.att.com\n\nAbstract\n\nWe apply robust Bayesian decision theory to improve both generative and discrim-\ninative learners under bias in class proportions in labeled training data, when the\ntrue class proportions are unknown. For the generative case, we derive an entropy-\nbased weighting that maximizes expected log likelihood under the worst-case true\nclass proportions. For the discriminative case, we derive a multinomial logistic\nmodel that minimizes worst-case conditional log loss. We apply our theory to the\nmodeling of species geographic distributions from presence data, an extreme case\nof labeling bias since there is no absence data. On a benchmark dataset, we \ufb01nd\nthat entropy-based weighting offers an improvement over constant estimates of\nclass proportions, consistently reducing log loss on unbiased test data.\n\n1 Introduction\n\nIn many real-world classi\ufb01cation problems, it is not equally easy or affordable to verify membership\nin different classes. Thus, class proportions in labeled data may signi\ufb01cantly differ from true class\nproportions. In an extreme case, labeled data for an entire class might be missing (for example,\nnegative experimental results are typically not published). A naively trained learner may perform\npoorly on test data that is not similarly af\ufb02icted by labeling bias. Several techniques address labeling\nbias in the context of cost-sensitive learning and learning from imbalanced data [5, 11, 2]. If the\nlabeling bias is known or can be estimated, and all classes appear in the training set, a model trained\non biased data can be corrected by reweighting [5]. When the labeling bias is unknown, a model is\noften selected using threshold-independent analysis such as ROC curves [11]. A good ROC curve,\nhowever, does not guarantee a low loss on test data. Here, we are concerned with situations when\nthe labeling bias is unknown and some classes may be missing, but we have access to unlabeled\ndata. We want to construct models that in addition to good ROC-based performance, also yield\nlow test loss. We will be concerned with minimizing joint and conditional log loss, or equivalently,\nmaximizing joint and conditional log likelihood.\n\nOur work is motivated by the application of modeling species\u2019 geographic distributions from occur-\nrence data. The data consists of a set of locations within some region (for example, the Australian\nwet tropics) where a species (such as the golden bowerbird) was observed, and a set of features such\nas precipitation and temperature, describing environmental conditions at each location. Species dis-\ntribution modeling suffers from extreme imbalance in training data: we often only have information\nabout species presence (positive examples), but no information about species absence (negative ex-\namples). We do, however, have unlabeled data, obtained either by randomly sampling locations\nfrom the region [4], or pooling presence data for several species collected with similar methods to\nyield a representative sample of locations which biologists have surveyed [13].\n\nPrevious statistical methods for species distribution modeling can be divided into three main ap-\nproaches. The \ufb01rst interprets all unlabeled data as examples of species absence and learns a rule\n\n\fto discriminate them from presences [19, 4]. The second embeds a discriminative learner in the\nEM algorithm in order to infer presences and absences in unlabeled data; this explicitly requires\nknowledge of true class probabilities [17]. The third models the presences alone, which is known in\nmachine learning as one-class estimation [14, 7]. When using the \ufb01rst approach, the training data is\ncommonly reweighted so that positive and negative examples have the same weight [4]; this models\na quantity monotonically related to conditional probability of presence [13], with the relationship\ndepending on true class probabilities. If we use y to denote the binary variable indicating presence\nand x to denote a location on the map, then the \ufb01rst two approaches yield models of conditional\nprobability p(y = 1|x), given estimates of true class probabilities. On the other hand, the main in-\nstantiation of the third approach, maximum entropy density estimation (maxent) [14] yields a model\nof the distribution p(x|y = 1). To convert this to an estimate of p(y = 1|x) (as is usually required,\nand necessary for measuring conditional log loss on which we focus here) again requires knowledge\nof the class probabilities p(y = 1) and p(y = 0). Thus, existing discriminative approaches (the \ufb01rst\nand second) as well as generative approaches (the third) require estimates of true class probabilities.\n\nWe apply robust Bayesian decision theory, which is closely related to the maximum entropy prin-\nciple [6], to derive conditional probability estimates p(y | x) that perform well under a wide range\nof test distributions. Our approach can be used to derive robust estimates of class probabilities p(y)\nwhich are then used to reweight discriminative models or to convert generative models into discrimi-\nnative ones. We present a treatment for the general multiclass problem, but our experiments focus on\none-class estimation and species distribution modeling in particular. Using an extensive evaluation\non real-world data, we show improvement in both generative and discriminative techniques.\n\nThroughout this paper we assume that the dif\ufb01culty of uncovering the true class label depends on the\nclass label y alone, but is independent of the example x. Even though this assumption is simplistic,\nwe will see that our approach yields signi\ufb01cant improvements. A related set of techniques estimates\nand corrects for the bias in sample selection, also known as covariate shift [9, 16, 18, 1, 13]. When\nthe bias can be decomposed into an estimable and inestimable part, the right approach might be to\nuse a combination of techniques presented in this paper and those for sample-selection bias.\n\n2 Robust Bayesian Estimation with Unknown Class Probabilities\n\nOur goal is to estimate an unknown conditional distribution \u03c0(y | x), where x \u2208 X is an example\nand y \u2208 Y is a label. The input consists of labeled examples (x1, y1), . . . , (xm, ym) and unlabeled\nexamples xm+1, . . . , xM . Each example x is described by a set of features fj : X \u2192 R, indexed\nby j \u2208 J. For simplicity, we assume that sets X, Y, and J are \ufb01nite, but we would like to allow the\nspace X and the set of features J to be very large.\n\nIn species distribution modeling from occurrence data, the space X corresponds to locations on the\nmap, features are various functions derived from the environmental variables, and the set Y contains\ntwo classes: presence (y = 1) and absence (y = 0) for a particular species. Labeled examples are\npresences of the species, e.g., recorded presence locations of the golden bowerbird, while unlabeled\nexamples are locations that have been surveyed by biologists, but neither presence nor absence was\nrecorded. The unlabeled examples can be obtained as presence locations of species observed by a\nsimilar protocol, for example other birds [13].\n\nWe posit a joint density \u03c0(x, y) and assume that examples are generated by the following process.\nFirst, a pair (x, y) is chosen according to \u03c0. We always get to see the example x, but the label y is\nrevealed with an unknown probability that depends on y and is independent of x. This means that\nwe have access to independent samples from \u03c0(x) and from \u03c0(x| y), but no information about \u03c0(y).\nIn our example, species presence is revealed with an unknown \ufb01xed probability whereas absence is\nrevealed with probability zero (i.e., never revealed).\n\n2.1 Robust Bayesian Estimation, Maximum Entropy, and Logistic Regression\n\nRobust Bayesian decision theory formulates an estimation problem as a zero-sum game between a\ndecision maker and nature [6]. In our case, the decision maker chooses an estimate p(x, y) while\nnature selects a joint density \u03c0(x, y). Using the available data, the decision maker forms a set P in\nwhich he believes nature\u2019s choice lies, and tries to minimize worst-case loss under nature\u2019s choice.\nIn this paper we are interested in minimizing the worst-case log loss relative to a \ufb01xed default\n\n\festimate \u03bd (equivalently, maximizing the worst-case log likelihood ratio)\n\nmin\np\u2208\u2206\n\nmax\n\u03c0\u2208P\n\nE\u03c0(cid:20)ln(cid:18) p(X, Y )\n\n\u03bd(X, Y )(cid:19)(cid:21) .\n\nHere, \u2206 is the simplex of joint densities and E\u03c0 is a shorthand for EX,Y \u223c\u03c0. The default density \u03bd\nrepresents any prior information we have about \u03c0; if we have no prior information, \u03bd is typically the\nuniform density.\n\nGr\u00a8unwald and Dawid [6] show that the robust Bayesian problem (Eq. 1) is often equivalent to the\nminimum relative entropy problem\n\nmin\np\u2208P\n\nRE(pk \u03bd) ,\n\nwhere RE(pk q) = Ep[ln(p(X, Y )/q(X, Y )] is relative entropy or Kullback-Leibler divergence\nand measures discrepancy between distributions p and q. The formulation intuitively says that we\nshould choose the density p which is closest to \u03bd while respecting constraints P. When \u03bd is uniform,\nminimizing relative entropy is equivalent to maximizing entropy H(p) = Ep[\u2212 ln p(X, Y )]. Hence,\nthe approach is mainly referred to as maximum entropy [10] or maxent for short. The next theorem\noutlines the equivalence of robust Bayes and maxent for the case considered in this paper. It is a\nspecial case of Theorem 6.4 of [6].\nTheorem 1 (Equivalence of maxent and robust Bayes). Let X \u00d7 Y be a \ufb01nite sample space, \u03bd\na density on X \u00d7 Y and P \u2286 \u2206 a closed convex set containing at least one density absolutely\ncontinuous w.r.t. \u03bd . Then Eqs. (1) and (2) have the same optimizers.\n\nFor the case without labeling bias, the set P is usually described in terms of equality constraints\non moments of the joint distribution (feature expectations). Speci\ufb01cally, feature expectations with\nrespect to p are required to equal their empirical averages. When features are functions of x, but the\ngoal is to discriminate among classes y, it is natural to consider a derived set of features which are\nversions of fj(x) active solely in individual classes y (see for instance [8]). If we were to estimate the\ndistribution of the golden bowerbird from presence-absence data then moment equality constraints\nrequire that the joint model p(x, y) match the average altitude of presence locations as well as the\naverage altitude of absence locations (both weighted by their respective training proportions).\n\nWhen the number of samples is too small or the number of features too large then equality con-\nstraints lead to over\ufb01tting because the true distribution does not match empirical averages exactly.\nOver\ufb01tting is alleviated by relaxing the constraints so that feature expectations are only required to\nlie within a certain distance of sample averages [3].\n\nThe solution of Eq. (2) with equality or relaxed constraints can be shown to lie in an exponential\nfamily parameterized by \u03bb = h\u03bbyiy\u2208Y, \u03bby \u2208 RJ, and containing densities\n\nThe optimizer of Eq. (2) is the unique density which minimizes the empirical log loss\n\nq\u03bb(x, y) \u221d \u03bd(x, y)e\u03bby\u00b7f (x) .\n\nln q\u03bb(xi, yi)\n\n(3)\n\n1\n\nm Xi\u2264m\n\npossibly with an additional \u21131-regularization term accounting for slacks in equality constraints. (See\n[3] for a proof.)\n\nIn addition to constraints on moments of the joint distribution, it is possible to introduce constraints\non marginals of p. The most common implementations of maxent impose marginal constraints\np(x) = \u02dc\u03c0lab(x) where \u02dc\u03c0lab is the empirical distribution over labeled examples. The solution then\ntakes form q\u03bb(x, y) = \u02dc\u03c0lab(x)q\u03bb(y | x) where q\u03bb(y | x) is the multinomial logistic model\n\n(1)\n\n(2)\n\nAs before, the maxent solution is the unique density of this form which minimizes the empirical log\nloss (Eq. 3). The minimization of Eq. (3) is equivalent to the minimization of conditional log loss\n\nq\u03bb(y | x) \u221d \u03bd(y | x)e\u03bby\u00b7f (x) .\n\n1\n\nm Xi\u2264m\n\n\u2212 ln q\u03bb(yi | xi) .\n\n\fHence, this approach corresponds to logistic regression. Since it only models the labeling process\n\u03c0(y | x), but not the sample generation \u03c0(x), it is known as discriminative training.\nThe case with equality constraints p(y) = \u02dc\u03c0lab(y) has been analyzed for example by [8]. The\nsolution has the form q\u03bb(x, y) = \u02dc\u03c0lab(y)q\u03bb(x| y) with\n\nq\u03bb(x| y) \u221d \u03bd(x| y)e\u03bby\u00b7f (x) .\n\nLog loss can be minimized for each class separately, i.e., each \u03bby is the maximum likelihood esti-\nmate (possibly with regularization) of \u03c0(x| y). The joint estimate q\u03bb(x, y) can be used to derive the\nconditional distribution q\u03bb(y | x). Since this approach estimates the sample generating distributions\nof individual classes, it is known as generative training. Naive Bayes is a special case of generative\ntraining when \u03bd(x| y) =Qj \u03bdj(fj(x)| y).\n\nThe two approaches presented in this paper can be viewed as generalizations of generative and\ndiscriminative training with two additional components: availability of unlabeled examples and lack\nof information about class probabilities. The former will in\ufb02uence the choice of the default \u03bd, the\nlatter the form of constraints P.\n\n2.2 Generative Training: Entropy-weighted Maxent\n\nWhen the number of labeled and unlabeled examples is suf\ufb01ciently large, it is reasonable to assume\nthat the empirical distribution \u02dc\u03c0(x) over all examples (labeled and unlabeled) is a faithful repre-\nsentation of \u03c0(x). Thus, we consider defaults with \u03bd(x) = \u02dc\u03c0(x), shown to work well in species\ndistribution modeling [13]. For simplicity, we assume that \u03bd(y | x) does not depend on x and focus\non \u03bd(x, y) = \u02dc\u03c0(x)\u03bd(y). Other options are possible. For example, when the number of examples is\nsmall, \u02dc\u03c0(x) might be replaced by an estimate of \u03c0(x). The distribution \u03bd(y) can be chosen uniform\nacross y, but if some classes are known to be rarer than others then a non-uniform estimate will\nperform better. In Section 3, we analyze the impact of this choice.\n\nEp[fj(X)| y] \u2212 \u02dc\u00b5y\n\nj\n\n\u2200j : (cid:12)(cid:12)\n\nX} where py\n\nj = \u03b2 \u02dc\u03c3y\n\nj /\u221amy where \u03b2 is a single tuning constant, \u02dc\u03c3y\n\nj is the empirical average of fj among labeled examples in class y and \u03b2y\n\nConstraints on moments of the joint distribution, such as those in the previous section, will misspec-\nify true moments in the presence of labeling bias. However, as discussed earlier, labeled examples\nfrom each class y approximate conditional distributions \u03c0(x| y). Thus, instead of constraining joint\nexpectations, we constrain conditional expectations Ep[fj(X)| y]. In general, we consider robust\nBayes and maxent problems with the set P of the form P = {p \u2208 \u2206 : py\nX \u2208 Py\nX denotes\nthe |X|-dimensional vector of conditional probabilities p(x| y) and Py\nX expresses the constraints on\npy\nX. For example, relaxed constraints for class y are expressed as\nj(cid:12)(cid:12) \u2264 \u03b2y\n\n(4)\nwhere \u02dc\u00b5y\nj are estimates of\ndeviations of averages from true expectations. Similar to [14], we use standard-error-like deviation\nestimates \u03b2y\nj is the empirical standard devia-\ntion of fj among labeled examples in class y, and my is the number of labeled examples in class y.\nWhen my equals 0, we choose \u03b2y\nThe next theorem and the following corollary show that robust Bayes (and also maxent) with the\nconstraint set P of the form above yield estimators similar to generative training. In addition to the\nnotation py\nX for conditional densities, we use the notation pY and pX to denote vectors of marginal\nprobabilities p(y) and p(x), respectively. For example, the empirical distribution over examples is\ndenoted \u02dc\u03c0X.\nTheorem 2. Let Py\nX \u2208 Py\nX}.\nIf P contains at least one density absolutely continuous w.r.t. \u03bd then robust Bayes and maxent over\nP are equivalent. The solution \u02c6p has the form \u02c6p(y)\u02c6p(x| y) where class-conditional densities \u02c6py\nminimize RE(py\n(5)\n\nX, y \u2208 Y be closed convex sets of densities over X and P = {p \u2208 \u2206 : py\n\nj = \u221e and thus leave feature expectations unconstrained.\n\nX k \u02dc\u03c0X) among py\n\nX \u2208 Py\n\nX and\n\nX\n\n\u02c6p(y) \u221d \u03bd(y)e\u2212RE( \u02c6py\n\nX k \u02dc\u03c0X) .\n\nProof. It is not too dif\ufb01cult to verify that the set P is a closed convex set of joint densities, so\nthe equivalence of robust Bayes and maxent follows from Theorem 1. To prove the remainder, we\nrewrite the maxent objective as\n\nRE(pk \u03bd) = RE(pY k \u03bdY) +Xy\n\np(y)RE(py\n\nX k \u02dc\u03c0X) .\n\n\fMaxent problem is then equivalent to\n\nmin\n\npY hRE(pY k \u03bdY) +Xy\n\np(y) min\nX\u2208Py\npy\n\nX\n\nRE(py\n\nX k \u02dc\u03c0X)i\n\n= min\n\n= min\n\npY \" Xy\npY \"Xy\n\np(y) ln p(y)\np(y) ln \n\n\u03bd(y)!! + Xy\nX k \u02dc\u03c0X)!#\n\np(y)\n\u03bd(y)e\u2212RE( \u02c6py\n\n= const. + min\npY\n\nRE(pY k \u02c6pY) .\n\np(y)RE(\u02c6py\n\nX k \u02dc\u03c0X)!#\n\nSince RE(pk q) is minimized for p = q, we indeed obtain that for the minimizing p, pY = \u02c6pY.\nTheorem 2 generalizes to the case when in addition to constraining py\nX, we also constrain\npY to lie in a closed convex set PY. The solution then takes form p(y)\u02c6p(x| y) with \u02c6p(x| y) as\nin the theorem, but with p(y) minimizing RE(pY k \u02c6pY) subject to pY \u2208 PY. Unlike generative\ntraining without labeling bias, the class-conditional densities in the theorem above in\ufb02uence class\nprobabilities. When sets Py\nX are speci\ufb01ed using constraints of Eq. (4) then \u02c6p has a form derived from\nregularized maximum likelihood estimates in an exponential family (see, e.g., [3]):\nCorollary 3. If sets Py\nmaxent are equivalent. The class-conditional densities \u02c6p(x| y) of the solution take form\n\nX are speci\ufb01ed by inequality constraints of Eq. (4) then robust Bayes and\n\nX to lie in Py\n\nand solve single-class regularized maximum likelihood problems\n\n\u02c6\u03bb\nq\u03bb(x| y) \u221d \u02dc\u03c0(x)e\n\ny\n\n\u00b7f (x)\n\n(6)\n\n(7)\n\nmin\n\n\u03bby n Xi:yi=y(cid:2)\u2212 ln q\u03bb(xi | y)(cid:3) + myXj\u2208J\n\n\u03b2j|\u03bby\n\nj|o .\n\nOne-class Estimation.\nIn one-class estimation problems, there are two classes (0 and 1), but we\nonly have access to labeled examples from one class (e.g., class 1). In species distribution modeling,\nwe only have access to presence records of the species. Based on labeled examples, we derive a set\nof constraints on p(x| y = 1), but leave p(x| y = 0) unconstrained. By Theorem 2, \u02c6p(x| y = 1)\nthen solves the single-class maximum entropy problem, we write \u02c6p(x| y = 1) = \u02c6pME(x), and\n\u02c6p(x| y = 0) = \u02dc\u03c0(x). Assume without loss of generality that examples x1, . . . , xM are distinct (but\nallow them to have identical feature vectors). Thus, \u02dc\u03c0(x) = 1/M on examples and zero elsewhere,\nand RE(\u02c6pME k \u02dc\u03c0X) = \u2212H(\u02c6pME) + ln M. Plugging these into Theorem 2, we can derive the condi-\ntional estimate \u02c6p(y = 1| x) across all unlabeled examples x:\n\n\u02c6p(y = 1| x) =\n\n\u03bd(y = 0) + \u03bd(y = 1)\u02c6pME(x)eH( \u02c6pME)\n\n\u03bd(y = 1)\u02c6pME(x)eH( \u02c6pME)\n\n.\n\n(8)\n\nIf constraints on p(x| y = 1) are chosen as in Corollary 3 then \u02c6pME is exponential and Eq. (8) thus\ndescribes a logistic model. This model has the same coef\ufb01cients as \u02c6pME, with the intercept chosen\nso that \u201ctypical\u201d examples x under \u02c6pME (examples with log probability close to the expected log\nprobability) yield predictions close to the default.\n\n2.3 Discriminative Training: Class-robust Logistic Regression\n\nSimilar to the previous section, we consider \u03bd(x, y) = \u02dc\u03c0(x)\u03bd(y). The set of constraints P will\nnow also include equality constraints on p(x). Since \u02dc\u03c0lab(x) misspeci\ufb01es the marginal, we use\np(x) = \u02dc\u03c0(x). Next theorem is an analog of Corollary 3 for discriminative training. It follows from\na combination of Theorem 1 and duality of maxent with maximum likelihood [3]. A complete proof\nwill appear in the extended version of this paper.\nTheorem 4. Assume that sets Py\n\u2206 : py\nequivalent. For the solution \u02c6p, \u02c6p(x) = \u02dc\u03c0(x) and \u02c6p(y | x) takes form\n\u03bby\u00b7f (x)\u2212\u03bby\u00b7 \u02dc\u00b5y+Pj\n\nX are speci\ufb01ed by inequality constraints of Eq. (4). Let P = {p \u2208\nX and pX = \u02dc\u03c0X}. If the set P is non-empty then robust Bayes and maxent over P are\n\nq\u03bb(y | x) \u221d \u03bd(y)e\n\nX \u2208 Py\n\nj |\u03bby\n\u03b2y\nj |\n\n(9)\n\n\fand solves the regularized \u201clogistic regression\u201d problem\n\nji) .\n\nmin\n\n\u03bb ( 1\n\nj \u2212 \u02dc\u00b5y\n\nj )\u03bby\n\n(10)\n\nj its class-conditional feature expectations.\n\n\u00af\u03c0(y)Xj\u2208Jh\u03b2y\nj(cid:12)(cid:12)\u03bby\nj(cid:12)(cid:12) + (\u00af\u00b5y\n\nM Xi\u2264MXy\u2208Yh\u2212\u00af\u03c0(y | xi) ln q\u03bb(y | xi)i +Xy\u2208Y\nwhere \u00af\u03c0 is an arbitrary feasible point, \u00af\u03c0 \u2208 P, and \u00af\u00b5y\nWe put logistic regression in quotes, because the model described by Eq. (9) is not the usual logistic\nmodel; however, once the parameters \u03bby are \ufb01xed, Eq. (9) simply determines a logistic model with a\nspecial form of the intercept. Note that the second term of Eq. (10) is indeed a regularization, albeit\npossibly an asymmetric one, since any feasible \u00af\u03c0 will have |\u00af\u00b5y\nj . Since \u00af\u03c0(x) = \u02dc\u03c0(x),\n\u00af\u03c0 is speci\ufb01ed solely by \u00af\u03c0(y | x) and thus can be viewed as a tentative imputation of labels across\nall examples. We remark that the value of the objective of Eq. (10) does not depend on the choice\nof \u00af\u03c0, because a different choice of \u00af\u03c0 (in\ufb02uencing the \ufb01rst term) yields a different set of means \u00af\u00b5y\nj\n(in\ufb02uencing the second term) and these differences cancel out. To provide a more concrete example\nand some intuition about Eq. (10), we now consider one-class estimation.\n\nj \u2212 \u02dc\u00b5y\n\nj| \u2264 \u03b2y\n\nOne-class estimation. A natural choice of \u00af\u03c0 is the \u201cpseudo-empirical\u201d distribution which views\nall unlabeled examples as negatives. Pseudo-empirical means of class 1 match empirical averages of\nclass 1 exactly, whereas pseudo-empirical means of class 0 can be arbitrary because they are uncon-\nstrained. The lack of constraints on class 0 forces the corresponding \u03bby to equal zero. The objective\ncan thus be formulated solely using \u03bby for the class 1; therefore, we will omit the superscript y.\nEq. (10) after multiplying by M then becomes\n\nmin\n\n\u03bb (Xi\u2264m(cid:2)\u2212 ln q\u03bb(y = 1| xi)(cid:3) + Xm