{"title": "Estimating the class prior and posterior from noisy positives and unlabeled data", "book": "Advances in Neural Information Processing Systems", "page_first": 2693, "page_last": 2701, "abstract": "We develop a classification algorithm for estimating posterior distributions from positive-unlabeled data, that is robust to noise in the positive labels and effective for high-dimensional data. In recent years, several algorithms have been proposed to learn from positive-unlabeled data; however, many of these contributions remain theoretical, performing poorly on real high-dimensional data that is typically contaminated with noise. We build on this previous work to develop two practical classification algorithms that explicitly model the noise in the positive labels and utilize univariate transforms built on discriminative classifiers. We prove that these univariate transforms preserve the class prior, enabling estimation in the univariate space and avoiding kernel density estimation for high-dimensional data. The theoretical development and parametric and nonparametric algorithms proposed here constitute an important step towards wide-spread use of robust classification algorithms for positive-unlabeled data.", "full_text": "Estimating the class prior and posterior from noisy\n\npositives and unlabeled data\n\nShantanu Jain, Martha White, Predrag Radivojac\n\nDepartment of Computer Science\n\nIndiana University, Bloomington, Indiana, USA\n\n{shajain, martha, predrag}@indiana.edu\n\nAbstract\n\nWe develop a classi\ufb01cation algorithm for estimating posterior distributions from\npositive-unlabeled data, that is robust to noise in the positive labels and effective\nfor high-dimensional data. In recent years, several algorithms have been proposed\nto learn from positive-unlabeled data; however, many of these contributions re-\nmain theoretical, performing poorly on real high-dimensional data that is typically\ncontaminated with noise. We build on this previous work to develop two practical\nclassi\ufb01cation algorithms that explicitly model the noise in the positive labels and\nutilize univariate transforms built on discriminative classi\ufb01ers. We prove that these\nunivariate transforms preserve the class prior, enabling estimation in the univari-\nate space and avoiding kernel density estimation for high-dimensional data. The\ntheoretical development and parametric and nonparametric algorithms proposed\nhere constitute an important step towards wide-spread use of robust classi\ufb01cation\nalgorithms for positive-unlabeled data.\n\n1\n\nIntroduction\n\nAccess to positive, negative and unlabeled examples is a standard assumption for most semi-\nsupervised binary classi\ufb01cation techniques. In many domains, however, a sample from one of the\nclasses (say, negatives) may not be available, leading to the setting of learning from positive and\nunlabeled data (Denis et al., 2005). Positive-unlabeled learning often emerges in sciences and com-\nmerce where an observation of a positive example (say, that a protein catalyzes reactions or that a\nsocial network user likes a particular product) is usually reliable. Here, however, the absence of a\npositive observation cannot be interpreted as a negative example. In molecular biology, for example,\nan attempt to label a data point as positive (say, that a protein is an enzyme) may be unsuccessful for\na variety of experimental and biological reasons, whereas in social networks an explicit dislike of a\nproduct may not be possible. Both scenarios lead to a situation where negative examples cannot be\nactively collected.\nFortunately, the absence of negatively labeled examples can be tackled by incorporating unlabeled\nexamples as negatives, leading to the development of non-traditional classi\ufb01ers. Here we follow the\nterminology by Elkan and Noto (2008) that a traditional classi\ufb01er predicts whether an example is\npositive or negative, whereas a non-traditional classi\ufb01er predicts whether the example is positive or\nunlabeled. Positive vs. unlabeled (non-traditional) training is reasonable because the class posterior\n\u2014 and also the optimum scoring function for composite losses (Reid and Williamson, 2010) \u2014 in the\ntraditional setting is monotonically related to the posterior in the non-traditional setting. However,\nthe true posterior can be fully recovered from the non-traditional posterior only if we know the class\nprior; i.e., the proportion of positives in unlabeled data. The knowledge of the class prior is also\nnecessary for estimation of the performance criteria such as the error rate, balanced error rate or\nF-measure, and also for \ufb01nding the right threshold for the non-traditional scoring function that leads\nto an optimal classi\ufb01er with respect to some criteria (Menon et al., 2015).\n\n\fClass prior estimation in a nonparametric setting has been actively researched in the past decade\noffering an extensive theory of identi\ufb01ability (Ward et al., 2009; Blanchard et al., 2010; Scott et al.,\n2013; Jain et al., 2016) and a few practical solutions (Elkan and Noto, 2008; Ward et al., 2009;\ndu Plessis and Sugiyama, 2014; Sanderson and Scott, 2014; Jain et al., 2016; Ramaswamy et al.,\n2016). Application of these algorithms to real data, however, is limited in that none of the pro-\nposed algorithms simultaneously deals with noise in the labels and practical estimation for high-\ndimensional data.\nMuch of the theory on learning class priors relies on the assumption that either the distribution of\npositives is known or that the positive sample is clean. In practice, however, labeled data sets con-\ntain class-label noise, where an unspeci\ufb01ed amount of negative examples contaminates the positive\nsample. This is a realistic scenario in experimental sciences where technological advances enabled\ngeneration of high-throughput data at a cost of occasional errors. One example for this comes from\nthe studies of proteins using analytical chemistry technology; i.e., mass spectrometry. For example,\nin the process of peptide identi\ufb01cation (Steen and Mann, 2004), bioinformatics methods are usually\nset to report results with speci\ufb01ed false discovery rate thresholds (e.g., 1%). Unfortunately, statis-\ntical assumptions in these experiments are sometimes violated thereby leading to substantial noise\nin reported results, as in the case of identifying protein post-translational modi\ufb01cations. Similar\namounts of noise might appear in social networks such as Facebook, where some users select \u2018like\u2019,\neven when they do not actually like a particular post. Further, the only approach that does consider\nsimilar such noise (Scott et al., 2013) requires density estimation, which is known to be problematic\nfor high-dimensional data.\nIn this work, we propose the \ufb01rst classi\ufb01cation algorithm, with class prior estimation, designed\nparticularly for high-dimensional data with noise in the labeling of positive data. We \ufb01rst formalize\nthe problem of class prior estimation from noisy positive and unlabeled data. We extend the existing\nidenti\ufb01ability theory for class prior estimation from positive-unlabeled data to this noise setting.\nWe then show that we can practically estimate class priors and the posterior distributions by \ufb01rst\ntransforming the input space to a univariate space, where density estimation is reliable. We prove\nthat these transformations preserve class priors and show that they correspond to training a non-\ntraditional classi\ufb01er. We derive a parametric algorithm and a nonparametric algorithm to learn the\nclass priors. Finally, we carry out experiments on synthetic and real-life data and provide evidence\nthat the new approaches are sound and effective.\n\n2 Problem formulation\n\nConsider a binary classi\ufb01cation problem of mapping an input space X to an output space Y = {0, 1}.\nLet f be the true distribution of inputs. It can be represented as the following mixture\n\nf (x) = \u21b5f1(x) + (1\u2212 \u21b5)f0(x),\n\n(1)\nwhere x 2 X , y 2 Y, fy are distributions over X for the positive (y = 1) and negative (y = 0)\nclass, respectively; and \u21b5 2 [0, 1) is the class prior or the proportion of the positive examples in f.\nWe will refer to a sample from f as unlabeled data.\nLet g be the distribution of inputs for the labeled data. Because the labeled sample contains some\nmislabeled examples, the corresponding distribution is also a mixture of f1 and a small proportion,\nsay 1 \u2212 \u03b2, of f0. That is,\n\ng(x) = \u03b2f1(x) + (1 \u2212 \u03b2)f0(x),\n\n(2)\nwhere \u03b2 2 (0, 1]. Observe that both mixtures have the same components but different mixing\nproportions. The simplest scenario is that the mixing components f0 and f1 correspond to the class-\nconditional distributions p(x|Y = 0) and p(x|Y = 1), respectively. However, our approach also\npermits transformations of the input space X , thus resulting in a more general setup.\nThe objective of this work is to study the estimation of the class prior \u21b5 = p(Y = 1) and propose\npractical algorithms for estimating \u21b5. The ef\ufb01cacy of this estimation is clearly tied to \u03b2, where as \u03b2\ngets smaller, the noise in the positive labels becomes larger. We will discuss identi\ufb01ability of \u21b5 and\n\u03b2 and give a practical algorithm for estimating \u21b5 (and \u03b2). We will then use these results to estimate\nthe posterior distribution of the class variable, p(y|x), despite the fact that the labeled set does not\ncontain any negative examples.\n\n2\n\n\f3\n\nIdenti\ufb01ability\n\nThe class prior is identi\ufb01able if there is a unique class prior for a given pair (f, g). Much of the\nidenti\ufb01ability characterization in this section has already been considered as the case of asymmetric\nnoise (Scott et al., 2013); see Section 7 on related work. We recreate these results here, with the aim\nto introduce required notation, to highlight several important results for later algorithm development\nand to include a few missing results needed for our approach. Though the proof techniques are\nthemselves quite different and could be of interest, we include them in the appendix due to space.\nThere are typically two aspects to address with identi\ufb01ability. First, one needs to determine if a\nproblem is identi\ufb01able, and, second, if it is not, propose a canonical form that is identi\ufb01able. In\nthis section we will see that class prior is not identi\ufb01able in general because f0 can be a mixture\ncontaining f1 and vice versa. To ensure identi\ufb01ability, it is necessary to choose a canonical form\nthat prefers a class prior that makes the two components as different as possible; this canonical form\nwas introduced as the mutual irreducibility condition (Scott et al., 2013) and is related to the proper\nnovelty distribution (Blanchard et al., 2010) and the max-canonical form (Jain et al., 2016).\nWe discuss identi\ufb01ability in terms of measures. Let \u00b5, \u232b, \u00b50 and \u00b51 be probability measures de\ufb01ned\non some \u03c3-algebra A on X , corresponding to f, g, f0 and f1, respectively. It follows that\n\n\u00b5 = \u21b5\u00b51 + (1 \u2212 \u21b5)\u00b50\n\u232b = \u03b2\u00b51 + (1 \u2212 \u03b2)\u00b50.\n\n(3)\n(4)\n\nConsider a family of pairs of mixtures having the same components\nF(\u21e7) = {(\u00b5, \u232b) : \u00b5 = \u21b5\u00b51 + (1 \u2212 \u21b5)\u00b50, \u232b = \u03b2\u00b51 + (1 \u2212 \u03b2)\u00b50, (\u00b50, \u00b51) 2 \u21e7, 0 \uf8ff \u21b5 < \u03b2 \uf8ff 1},\nwhere \u21e7 is some set of pairs of probability measures de\ufb01ned on A. The family is parametrized\nby the quadruple (\u21b5, \u03b2, \u00b50, \u00b51). The condition \u03b2 > \u21b5 means that \u232b has a greater proportion of\n\u00b51 compared to \u00b5. This is consistent with our assumption that the labeled sample mainly contains\npositives. The most general choice for \u21e7 is\n\n\u21e7all = Pall \u21e5 Pall \\\ufffd(\u00b5, \u00b5) : \u00b5 2 Pall ,\n\nwhere Pall is the set of all probability measures de\ufb01ned on A and\ufffd(\u00b5, \u00b5) : \u00b5 2 Pall is the set of\n\npairs with equal distributions. Removing equal pairs prevents \u00b5 and \u232b from being identical.\nWe now de\ufb01ne the maximum proportion of a component \u03bb1 in a mixture \u03bb, which is used in the\nresults below and to specify the criterion that enables identi\ufb01ability; more speci\ufb01cally,\n\na\u03bb1\n\n\u03bb = max\ufffd\u21b5 2 [0, 1] : \u03bb = \u21b5\u03bb1 + (1 \u2212 \u21b5)\u03bb0, \u03bb0 2 Pall .\n\nOf particular interest is the case when a\u03bb1\n\u03bb = 0, which should be read as \u201c\u03bb is not a mixture contain-\ning \u03bb1\u201d. We \ufb01nally de\ufb01ne the set all possible (\u21b5, \u03b2) that generate \u00b5 and \u232b when (\u00b50, \u00b51) varies in\n\u21e7:\nA+(\u00b5, \u232b, \u21e7) = {(\u21b5, \u03b2) : \u00b5 = \u21b5\u00b51 + (1 \u2212 \u21b5)\u00b50, \u232b = \u03b2\u00b51 + (1 \u2212 \u03b2)\u00b50, (\u00b50, \u00b51) 2 \u21e7, 0 \uf8ff \u21b5 < \u03b2 \uf8ff 1}.\nIf A+(\u00b5, \u232b, \u21e7) is a singleton set for all (\u00b5, \u232b) 2 F(\u21e7), then F(\u21e7) is identi\ufb01able in (\u21b5, \u03b2).\nFirst, we show that the most general choice for \u21e7, \u21e7all, leads to unidenti\ufb01ability (Lemma 1). Fortu-\nnately, however, by choosing a restricted set\n\n(5)\n\n\u21e7res =\ufffd(\u00b50, \u00b51) 2 \u21e7all : a\u00b51\n\n\u00b50 = 0, a\u00b50\n\n\u00b51 = 0 \n\nas \u21e7, we do obtain identi\ufb01ability (Theorem 1). In words, \u21e7res contains pairs of distributions, where\neach distribution in a pair cannot be expressed as a mixture containing the other. The proofs of the\nresults below are in the Appendix.\nLemma 1 (Unidenti\ufb01ability) Given a pair of mixtures (\u00b5, \u232b) 2 F(\u21e7all),\n(\u21b5, \u03b2, \u00b50, \u00b51) generate (\u00b5, \u232b) and \u21b5+ = a\u232b\n\nlet parameters\n\n\u00b5, \u03b2+ = a\u00b5\n\n\u232b . It follows that\n\n1. There is a one-to-one relation between (\u00b50, \u00b51) and (\u21b5, \u03b2) and\n\n\u00b50 =\n\n\u03b2\u00b5 \u2212 \u21b5\u232b\n\u03b2 \u2212 \u21b5\n\n,\n\n\u00b51 =\n\n(1 \u2212 \u21b5)\u232b \u2212 (1 \u2212 \u03b2)\u00b5\n\n\u03b2 \u2212 \u21b5\n\n.\n\n(6)\n\n3\n\n\f2. Both expressions on the right-hand side of Equation 6 are well de\ufb01ned probability measures\n\nif and only if \u21b5/\u03b2 \uf8ff \u21b5+ and (1\u2212\u03b2)/(1\u2212\u21b5) \uf8ff \u03b2+.\n\n3. A+(\u00b5, \u232b, \u21e7all) = {(\u21b5, \u03b2) : \u21b5/\u03b2 \uf8ff \u21b5+, (1\u2212\u03b2)/(1\u2212\u21b5) \uf8ff \u03b2+}.\n4. F(\u21e7all) is unidenti\ufb01able in (\u21b5, \u03b2); i.e., (\u21b5, \u03b2) is not uniquely determined from (\u00b5, \u232b).\n5. F(\u21e7all) is unidenti\ufb01able in \u21b5 and \u03b2, individually; i.e., neither \u21b5 nor \u03b2 is uniquely deter-\n\nmined from (\u00b5, \u232b).\n\nObserve that the de\ufb01nition of a\u03bb1\nA+(\u00b5, \u232b, \u21e7all) satis\ufb01es \u21b5 < \u03b2, as expected.\nTheorem 1 (Identi\ufb01ablity) Given (\u00b5, \u232b) 2 F(\u21e7all), let \u21b5+ = a\u232b\n(\u00b5\u2212\u21b5+\u232b)/(1\u2212\u21b5+), \u00b5\u21e41 = (\u232b\u2212\u03b2+\u00b5)/(1\u2212\u03b2+) and\n\n\u03bb and \u00b5 6= \u232b imply \u21b5+ < 1 and, consequently, any (\u21b5, \u03b2) 2\n\u232b . Let \u00b5\u21e40 =\n\n\u00b5 and \u03b2+ = a\u00b5\n\nIt follows that\n\n\u21b5\u21e4 = \u21b5+(1\u2212\u03b2+)/(1\u2212\u21b5+\u03b2+),\n\n\u03b2\u21e4 = (1\u2212\u03b2+)/(1\u2212\u21b5+\u03b2+).\n\n(7)\n\n1. (\u21b5\u21e4, \u03b2\u21e4, \u00b5\u21e40, \u00b5\u21e41) generate (\u00b5, \u232b)\n2. (\u00b5\u21e40, \u00b5\u21e41) 2 \u21e7res and, consequently, \u21b5\u21e4 = a\u00b5\u21e41\n3. F(\u21e7res) contains all pairs of mixtures in F(\u21e7all).\n4. A+(\u00b5, \u232b, \u21e7res) = {(\u21b5\u21e4, \u03b2\u21e4)}.\n5. F(\u21e7res) is identi\ufb01able in (\u21b5, \u03b2); i.e., (\u21b5, \u03b2) is uniquely determined from (\u00b5, \u232b).\n\n\u00b5 , \u03b2\u21e4 = a\u00b5\u21e41\n\u232b .\n\nWe refer to the expressions of \u00b5 and \u232b as mixtures of components \u00b50 and \u00b51 as a max-canonical form\nwhen (\u00b50, \u00b51) is picked from \u21e7res. This form enforces that \u00b51 is not a mixture containing \u00b50 and vice\nversa, which leads to \u00b50 and \u00b51 having maximum separation, while still generating \u00b5 and \u232b. Each\npair of distributions in F(\u21e7res) is represented in this form. Identi\ufb01ability of F(\u21e7res) in (\u21b5, \u03b2) occurs\nprecisely when A+(\u00b5, \u232b, \u21e7res) = {(\u21b5\u21e4, \u03b2\u21e4)}, i.e., (\u21b5\u21e4, \u03b2\u21e4) is the only pair of mixing proportions\nthat can appear in a max-canonical form of \u00b5 and \u232b. Moreover, Statement 1 in Theorem 1 and\nStatement 1 in Lemma 1 imply that the max-canonical form is unique and completely speci\ufb01ed\nby (\u21b5\u21e4, \u03b2\u21e4, \u00b5\u21e40, \u00b5\u21e41), with \u21b5\u21e4 < \u03b2\u21e4 following from Equation 7. Thus, using F(\u21e7res) to model the\nunlabeled and labeled data distributions makes estimation of not only \u21b5, the class prior, but also\n\u03b2, \u00b50, \u00b51 a well-posed problem. Moreover, due to Statement 3 in Theorem 1, there is no loss in the\nmodeling capability by using F(\u21e7res) instead of F(\u21e7all). Overall, identi\ufb01ability, absence of loss of\nmodeling capability and maximum separation between \u00b50 and \u00b51 combine to justify estimating \u21b5\u21e4\nas the class prior.\n\n4 Univariate Transformation\n\nThe theory and algorithms for class prior estimation are agnostic to the dimensionality of the data;\nin practice, however, this dimensionality can have important consequences. Parametric Gaussian\nmixture models trained via expectation-maximization (EM) are known to strongly suffer from co-\nlinearity in high-dimensional data. Nonparametric (kernel) density estimation is also known to have\ncurse-of-dimensionality issues, both in theory (Liu et al., 2007) and in practice (Scott, 2008).\nWe address the curse of dimensionality by transforming the data to a single dimension. The trans-\nformation \u2327 : X ! R, surprisingly, is simply an output of a non-traditional classi\ufb01er trained to\nseparate labeled sample, L, from unlabeled sample, U. The transform is similar to that in (Jain\net al., 2016), except that it is not required to be calibrated like a posterior distribution; as shown\nbelow, a good ranking function is suf\ufb01cient. First, however, we introduce notation and formalize the\ndata generation steps (Figure 1).\nLet X be a random variable taking values in X , capturing the true distribution of inputs, \u00b5, and Y\nbe an unobserved random variable taking values in Y, giving the true class of the inputs. It follows\nthat X|Y = 0 and X|Y = 1 are distributed according to \u00b50 and \u00b51, respectively. Let S be a\nselection random variable, whose value in S = {0, 1, 2} determines the sample to which an input x\nis added (Figure 1). When S = 1, x is added to the noisy labeled sample; when S = 0, x is added\nto the unlabeled sample; and when S = 2, x is not added to either of the samples. It follows that\n\n4\n\n\fInput\n\nSelect for\nlabeling\n\nyes\n\nSuccess of\nlabeling\n\nyes\n\nY = 0 w.p. 0\nY = 1 w.p. 1\n\nNoisy positive\n\nS = 1\n\nno\n\nUnlabeled\n\nS = 0\n\nno\n\nY = 0 w.p. 1 0\nY = 1 w.p. 1 1\n\nDropped\nS = 2\n\nFigure 1: The labeling procedure, with S taking values from S = {0, 1, 2}. In the \ufb01rst step, the sample is\nrandomly selected to attempt labeling, with some probability independent of X or Y . If it is not selected, it is\nadded to the \u201cUnlabeled\u201d set. If it is selected, then labeling is attempted. If the true label is Y = 1, then with\nprobability \u03b31 2 (0, 1), the labeling will succeed and it is added to \u201cNoisy positives\u201d. Otherwise, it is added\nto the \u201cDropped\u201d set. If the true label is Y = 0, then the attempted labeling is much more likely to fail, but\nbecause of noise, could succeed. The attempted label of Y = 0 succeeds with probability \u03b30, and is added to\n\u201cNoisy positives\u201d, even though it is actually a negative instance. \u03b30 = 0 leads to the no noise case and the noise\nincreases as \u03b30 increases. \u03b2 = \u03b31\u21b5/(\u03b31\u21b5+\u03b30(1\u21b5)), gives the proportion of positives in the \u201cNoisy positives\u201d.\nX u = X|S = 0 and X l = X|S = 1 are distributed according to \u00b5 and \u232b, respectively. We make\nthe following assumptions which are consistent with the statements above:\n(8)\n(9)\n(10)\nAssumptions 8 and 9 states that the proportion of positives in the unlabeled sample and the labeled\nsample matches the true proportion in \u00b5 and \u232b, respectively. Assumption 10 states that the distribu-\ntion of the positive inputs (and the negative inputs) in both the unlabeled and the labeled samples is\nequal and unbiased. Lemma 2 gives the implications of these assumptions. Statement 3 in Lemma 2\nis particularly interesting and perhaps counter-intuitive as it states that with non-zero probability\nsome inputs need to be dropped.\nLemma 2 Let X, Y and S be random variables taking values in X , Y and S, respectively, and\nX u = X|S = 0 and X l = X|S = 1. For measures \u00b5, \u232b, \u00b50, \u00b51, satisfying Equations 3 and 4 and\n\u00b51 6= \u00b50, let \u00b5, \u00b50, \u00b51 give the distribution of X, X|Y = 0 and X|Y = 1, respectively. If X, Y and\nS satisfy assumptions 8, 9 and 10, then\n\np(y|S = 0) = p(y),\np(y = 1|S = 1) = \u03b2,\np(x|s, y) = p(x|y).\n\n1. X is independent of S = 0; i.e., p(x|S = 0) = p(x)\n2. X u and X l are distributed according to \u00b5 and \u232b, respectively.\n3. p(S = 2) 6= 0.\n\nThe proof is in the Appendix. Next, we highlight the conditions under which the score function \u2327\npreserves \u21b5\u21e4. Observing that S serves as the pseudo class label for labeled vs. unlabeled classi\ufb01ca-\ntion as well, we \ufb01rst give an expression for the posterior:\n\n\u2327p(x) = p(S = 1|x, S 2 {0, 1}), 8x 2 X .\n\n(11)\n\nTheorem 2 (\u21b5\u21e4-preserving transform) Let random variables X, Y, S, X u, X l and measures\n\u00b5, \u232b, \u00b50, \u00b51 be as de\ufb01ned in Lemma 2. Let \u2327p be the posterior as de\ufb01ned in Equation 11 and\n\u2327 = H \u25e6 \u2327p, where H is a 1-to-1 function on [0, 1] and \u25e6 is the composition operator. Assume\n\n1. (\u00b50, \u00b51) 2 \u21e7res,\n2. X u and X l are continuous with densities f and g, respectively,\n3. \u00b5\u2327 , \u232b\u2327 , \u00b5\u2327 1 are the measures corresponding to \u2327 (X u), \u2327 (X l), \u2327 (X1), respectively,\n4. (\u21b5+, \u03b2+, \u21b5\u21e4, \u03b2\u21e4) = (a\u232b\n\n\u2327 , \u21b5\u21e4\u2327 , \u03b2\u21e4\u2327 ) = (a\u232b\u2327\n\n\u232b ) and (\u21b5+\n\n\u00b5 , a\u00b51\n\n\u2327 , \u03b2+\n\n\u232b , a\u00b51\n\n\u00b5, a\u00b5\n\n\u00b5\u2327 , a\u00b5\u2327\n\n\u232b\u2327 , a\u00b5\u2327 1\n\n\u00b5\u2327 , a\u00b5\u2327 1\n\n\u232b\u2327 ).\n\nThen\n\n(\u21b5+\n\n\u2327 , \u03b2+\n\n\u2327 , \u21b5\u21e4\u2327 , \u03b2\u21e4\u2327 ) = (\u21b5+, \u03b2+, \u21b5\u21e4, \u03b2\u21e4)\n\nand so \u2327 is an \u21b5\u21e4-preserving transformation.\nMoreover, \u2327p can also be used to compute the true posterior probability:\n1 \u2212 \u2327p(x) \u2212\n\n\u03b2\u21e4 \u2212 \u21b5\u21e4 \u2713 p(S = 0)\n\np(Y = 1|x) =\n\n\u21b5\u21e4(1 \u2212 \u21b5\u21e4)\n\np(S = 1)\n\n\u2327p(x)\n\n1 \u2212 \u03b2\u21e4\n\n1 \u2212 \u21b5\u21e4\u25c6.\n\n(12)\n\n5\n\n\fThe proof is in the Appendix. Theorem 2 shows that the \u21b5\u21e4 is the same for the original data and\nthe transformed data, if the transformation function \u2327 can be expressed as a composition of \u2327p and\na one-to-one function, H, de\ufb01ned on [0, 1]. Trivially, \u2327p itself is one such function. We emphasize,\nhowever, that \u21b5\u21e4-preservation is not limited by the ef\ufb01cacy of the calibration algorithm; uncalibrated\nscoring that ranks inputs as \u2327p(x) also preserves \u21b5\u21e4. Theorem 2 further demonstrates how the true\nposterior, p(Y = 1|x), can be recovered from \u2327p by plugging in estimates of \u2327p, p(S=0)/p(S=1),\n\u21b5\u21e4 and \u03b2\u21e4 in Equation 12. The posterior probability \u2327p can be estimated directly by using a prob-\nabilistic classi\ufb01er or by calibrating a classi\ufb01er\u2019s score (Platt, 1999; Niculescu-Mizil and Caruana,\n2005); |U|/|L| serves as an estimate of p(S=0)/p(S=1); section 5 gives parametric and nonparametric\napproaches for estimation of \u21b5\u21e4 and \u03b2\u21e4.\n\n5 Algorithms\n\nIn this section, we derive a parametric and a nonparametric algorithm to estimate \u21b5\u21e4 and \u03b2\u21e4 from the\ni}. In theory, both approaches\nunlabeled sample, U = {X u\ncan handle multivariate samples; in practice, however, to circumvent the curse of dimensionality, we\nexploit the theory of \u21b5\u21e4-preserving univariate transforms to transform the samples.\nParametric approach. The parametric approach is derived by modeling each sample as a two\ncomponent Gaussian mixture, sharing the same components but having different mixing proportions:\n\ni }, and the noisy positive sample, L = {X l\n\nX u\ni \u21e0 \u21b5N (u1, \u23031) + (1 \u2212 \u21b5)N (u0, \u23030)\nX l\ni \u21e0 \u03b2N (u1, \u23031) + (1 \u2212 \u03b2)N (u0, \u23030)\n\nwhere u1, u0 2 Rd and \u23031, \u23030 2 Sd\n++, the set of all d\u21e5d positive de\ufb01nite matrices. The algorithm is\nan extension to the EM approach for Gaussian mixture models (GMMs) where, instead of estimating\nthe parameters of a single mixture, the parameters of both mixtures (\u21b5, \u03b2, u0, u1, \u23030, \u23031) are esti-\nmated simultaneously by maximizing the combined likelihood over both U and L. This approach,\nwhich we refer to as a multi-sample GMM (MSGMM), exploits the constraint that the two mixtures\nshare the same components. The update rules and their derivation are given in the Appendix.\nNonparametric approach. Our nonparametric strategy directly exploits the results of Lemma 1 and\nTheorem 1, which give a direct connection between (\u21b5+ = a\u232b\n\u232b ) and (\u21b5\u21e4, \u03b2\u21e4). Therefore,\nfor a two-component mixture sample, M, and a sample from one of the components, C, it only\nrequires an algorithm to estimate the maximum proportion of C in M. For this purpose, we use\nthe AlphaMax algorithm (Jain et al., 2016), brie\ufb02y summarized in the Appendix. Speci\ufb01cally, our\ntwo-step approach for estimating \u21b5\u21e4 and \u03b2\u21e4 is as follows: (i) Estimate \u21b5+ and \u03b2+ as outputs of\nAlphaMax(U, L) and AlphaMax(L, U ), respectively; (ii) Estimate (\u21b5\u21e4, \u03b2\u21e4) from the estimates of\n(\u21b5+, \u03b2+) by applying Equation 7. We refer to our nonparametric algorithm as AlphaMax-N.\n\n\u00b5, \u03b2+ = a\u00b5\n\n6 Empirical investigation\n\nIn this section we systematically evaluate the new algorithms in a controlled, synthetic setting as\nwell as on a variety of data sets from the UCI Machine Learning Repository (Lichman, 2013).\nExperiments on synthetic data: We start by evaluating all algorithms in a univariate setting where\nboth mixing proportions, \u21b5 and \u03b2, are known. We generate unit-variance Gaussian and unit-scale\nLaplace-distributed i.i.d. samples and explore the impact of mixing proportions, the size of the\ncomponent sample, and the separation and overlap between the mixing components on the accuracy\nof estimation. The class prior \u21b5 was varied from {0.05, 0.25, 0.50} and the noise component \u03b2 from\n{1.00, 0.95, 0.75}. The size of the labeled sample L was varied from {100, 1000}, whereas the size\nof the unlabeled sample U was \ufb01xed at 10000.\nExperiments on real-life data: We considered twelve real-life data sets from the UCI Machine\nLearning Repository. To adjust these data to our problems, categorical features were transformed\ninto numerical using sparse binary representation, the regression data sets were transformed into\nclassi\ufb01cation based on mean of the target variable, and the multi-class classi\ufb01cation problems were\nconverted into binary problems by combining classes. In each data set, a subset of positive and\nnegative examples was randomly selected to provide a labeled sample while the remaining data\n(without class labels) were used as unlabeled data. The size of the labeled sample was kept at 1000\n(or 100 for small data sets) and the maximum size of unlabeled data was set 10000.\n\n6\n\n\fAlgorithms: We compare the AlphaMax-N and MSGMM algorithms to the Elkan-Noto algorithm\n(Elkan and Noto, 2008) as well as the noiseless version of AlphaMax (Jain et al., 2016). There\nare several versions of the Elkan-Noto estimator and each can use any underlying classi\ufb01er. We\nused the e1 alternative estimator combined with the ensembles of 100 two-layer feed-forward neural\nnetworks, each with \ufb01ve hidden units. The out-of-bag scores of the same classi\ufb01er were used as\na class-prior preserving transformation that created an input to the AlphaMax algorithms.\nIt is\nimportant to mention that neither Elkan-Noto nor AlphaMax algorithm was developed to handle\nnoisy labeled data. In addition, the theory behind the Elkan-Noto estimator restricts its use to class-\nconditional distributions with non-overlapping supports. The algorithm by du Plessis and Sugiyama\n(2014) minimizes the same objective as the e1 Elkan-Noto estimator and, thus, was not implemented.\nEvaluation: All experiments were repeated 50 times to be able to draw conclusions with statistical\nsigni\ufb01cance. In real-life data, the labeled sample was created randomly by choosing an appropriate\nnumber of positive and negative examples to satisfy the condition for \u03b2 and the size of the labeled\nsample, while the remaining data was used as the unlabeled sample. Therefore, the class prior in\nthe unlabeled data varies with the selection of the noise parameter \u03b2. The mean absolute difference\nbetween the true and estimated class priors was used as a performance measure. The best performing\nalgorithm on each data set was determined by multiple hypothesis testing using the P-value of 0.05\nand Bonferroni correction.\nResults: The comprehensive results for synthetic data drawn from univariate Gaussian and Laplace\ndistributions are shown in Appendix (Table 2). In these experiments no transformation was applied\nprior to running any of the algorithms. As expected, the results show excellent performance of the\nMSGMM model on the Gaussian data. These results signi\ufb01cantly degrade on Laplace-distributed\ndata, suggesting sensitivity to the underlying assumptions. On the other hand, AlphaMax-N was\naccurate over all data sets and also robust to noise. These results suggest that new parametric and\nnonparametric algorithms perform well in these controlled settings.\nTable 1 shows the results on twelve real data sets. Here, AlphaMax and AlphaMax-N algorithms\ndemonstrate signi\ufb01cant robustness to noise, although the parametric version MSGMM was compet-\nitive in some cases. On the other hand, the Elkan-Noto algorithm expectedly degrades with noise.\nFinally, we investigated the practical usefulness of the \u21b5\u21e4-preserving transform. Table 3 (Appendix)\nshows the results of AlphaMax-N and MSGMM on the real data sets, with and without using the\ntransform. Because of computational and numerical issues, we reduced the dimensionality by us-\ning principal component analysis (the original data caused matrix singularity issues for MSGMM\nand density estimation issues for AlphaMax-N). MSGMM deteriorates signi\ufb01cantly without the\ntransform, whereas AlphaMax-N preserves some signal for the class prior. AlphaMax-N with the\ntransform, however, shows superior performance on most data sets.\n\n7 Related work\n\nClass prior estimation in a semi-supervised setting, including positive-unlabeled learning, has been\nextensively discussed previously; see Saerens et al. (2002); Cortes et al. (2008); Elkan and Noto\n(2008); Blanchard et al. (2010); Scott et al. (2013); Jain et al. (2016) and references therein. Re-\ncently, a general setting for label noise has also been introduced, called the mutual contamination\nmodel. The aim under this model is to estimate multiple unknown base distributions, using multi-\nple random samples that are composed of different convex combinations of those base distributions\n(Katz-Samuels and Scott, 2016). The setting of asymmetric label noise is a subset of this more\ngeneral setting, treated under general conditions by Scott et al. (2013), and previously investigated\nunder a more restrictive setting as co-training (Blum and Mitchell, 1998). A natural approach is\nto use robust estimation to learn in the presence of class noise; this strategy, however, has been\nshown to be ineffective, both theoretically (Long and Servedio, 2010; Manwani and Sastry, 2013)\nand empirically (Hawkins and McLachlan, 1997; Bashir and Carter, 2005), indicating the need to\nexplicitly model the noise. Generative mixture model approaches have also been developed, which\nexplicitly model the noise (Lawrence and Scholkopf, 2001; Bouveyron and Girard, 2009); these al-\ngorithms, however, assume labeled data for each class. As the most related work, though Scott et al.\n(2013) did not explicitly treat the positive-unlabeled learning with noisy positives, their formulation\ncan incorporate this setting by using \u21e10 = \u21b5 and \u03b2 = 1 \u2212 \u21e11. The theoretical and algorithmic\ntreatment, however, is very different. Their focus is on identi\ufb01ability and analyzing convergence\nrates and statistical properties, assuming access to some \uf8ff\u21e4 function which can obtain proportions\n\n7\n\n\fTable 1: Mean absolute difference between estimated and true mixing proportion over twelve data sets from\nthe UCI Machine Learning Repository. Statistical signi\ufb01cance was evaluated by comparing the Elkan-Noto\nalgorithm, AlphaMax, AlphaMax-N, and the multi-sample GMM after applying a multivariate-to-univariate\ntransform (MSGMM-T). The bold font type indicates the winner and the asterisk indicates statistical signi\ufb01-\ncance. For each data set, shown are the true mixing proportion (\u21b5), true proportion of the positives in the labeled\nsample (\u03b2), sample dimensionality (d), the number of positive examples (n1), the total number of examples\n(n), and the area under the ROC curve (AUC) for a model trained between labeled and unlabeled data.\nData\n\nElkan-Noto AlphaMax AlphaMax-N MSGMM-T\n\nn\n\nAUC d\n0.842 13\n0.819 13\n0.744 13\n1030\n0.685 8\n1030\n0.662 8\n0.567 8\n1030\n0.825 127 2565 5574\n0.795 127 2565 5574\n0.672 127 2565 5574\n506\n0.810 13\n209\n506\n0.777 13\n209\n0.651 13\n506\n209\n1508 6435\n0.933 36\n1508 6435\n0.904 36\n0.788 36\n1508 6435\n0.792 126 3916 8124\n0.766 126 3916 8124\n0.648 126 3916 8124\n5473\n0.885 10\n5473\n0.858 10\n0.768 10\n5473\n0.875 16\n0.847 16\n0.738 16\n0.735 8\n0.710 8\n0.623 8\n0.929 9\n0.903 9\n0.802 9\n0.842 57\n0.812 57\n0.695 57\n0.626 11\n0.610 11\n0.531 11\n\nn1\n5188 45000 0.241\n5188 45000 0.284\n5188 45000 0.443\n0.329\n490\n0.363\n490\n0.531\n490\n0.017\n0.078\n0.396\n0.159\n0.226\n0.501\n0.074\n0.110\n0.302\n0.029\n0.087\n0.370\n0.116\n560\n0.137\n560\n560\n0.256\n3430 10992 0.030\n3430 10992 0.071\n3430 10992 0.281\n0.351\n268\n0.408\n268\n268\n0.586\n8903 58000 0.024*\n8903 58000 0.052\n8903 58000 0.199\n0.184\n1813 4601\n1813 4601\n0.246\n0.515\n1813 4601\n0.290\n4113 6497\n0.322\n4113 6497\n4113 6497\n0.420\n\n768\n768\n768\n\n0.070\n0.079\n0.124\n0.141\n0.174\n0.212\n0.011\n0.016\n0.137\n0.087\n0.094\n0.125\n0.009\n0.015\n0.063\n0.015*\n0.015\n0.140\n0.026*\n0.031*\n0.041*\n0.006*\n0.011\n0.093\n0.120\n0.118\n0.144\n0.027\n0.004*\n0.047\n0.046\n0.059\n0.155\n0.083\n0.113\n0.322\n\n0.037*\n0.036*\n0.040*\n0.181\n0.231\n0.272\n0.017\n0.006\n0.009\n0.094\n0.110\n0.134\n0.007*\n0.008*\n0.012*\n0.022\n0.008*\n0.020\n0.044\n0.052\n0.064\n0.009\n0.005*\n0.007*\n0.111\n0.110\n0.156\n0.029\n0.007\n0.004*\n0.041\n0.042*\n0.044*\n0.060\n0.063\n0.353\n\n0.163\n0.155\n0.127\n0.077*\n0.095*\n0.233\n0.008*\n0.006\n0.006*\n0.209\n0.204\n0.172\n0.157\n0.152\n0.143\n0.037\n0.037\n0.024\n0.129\n0.125\n0.111\n0.081\n0.074\n0.062\n0.171\n0.168\n0.175\n0.157\n0.157\n0.148\n0.059\n0.063\n0.059\n0.070\n0.076\n0.293\n\nBank\n\nConcrete\n\nGas\n\nHousing\n\nLandsat\n\nMushroom\n\nPageblock\n\nPendigit\n\nPima\n\nShuttle\n\nSpambase\n\nWine\n\n\u03b2\n\n\u21b5\n0.095 1.00\n0.096 0.95\n0.101 0.75\n0.419 1.00\n0.425 0.95\n0.446 0.75\n0.342 1.00\n0.353 0.95\n0.397 0.75\n0.268 1.00\n0.281 0.95\n0.330 0.75\n0.093 1.00\n0.103 0.95\n0.139 0.75\n0.409 1.00\n0.416 0.95\n0.444 0.75\n0.086 1.00\n0.087 0.95\n0.090 0.75\n0.243 1.00\n0.248 0.95\n0.268 0.75\n0.251 1.00\n0.259 0.95\n0.289 0.75\n0.139 1.00\n0.140 0.95\n0.143 0.75\n0.226 1.00\n0.240 0.95\n0.295 0.75\n0.566 1.00\n0.575 0.95\n0.612 0.75\n\nbetween samples. They do not explicitly address issues with high-dimensional data nor focus on\nalgorithms to obtain \uf8ff\u21e4. In contrast, we focus primarily on the univariate transformation to handle\nhigh-dimensional data and practical algorithms for estimating \u21b5\u21e4. Supervised learning used for class\nprior-preserving transformation provides a rich set of techniques to address high-dimensional data.\n\n8 Conclusion\n\nIn this paper, we developed a practical algorithm for classi\ufb01cation of positive-unlabeled data with\nnoise in the labeled data set.\nIn particular, we focused on a strategy for high-dimensional data,\nproviding a univariate transform that reduces the dimension of the data, preserves the class prior so\nthat estimation in this reduced space remains valid and is then further useful for classi\ufb01cation. This\napproach provides a simple algorithm that simultaneously improves estimation of the class prior and\nprovides a resulting classi\ufb01er. We derived a parametric and a nonparametric version of the algorithm\nand then evaluated its performance on a wide variety of learning scenarios and data sets. To the best\nof our knowledge, this algorithm represents one of the \ufb01rst practical and easy-to-use approaches to\nlearning with high-dimensional positive-unlabeled data with noise in the labels.\n\n8\n\n\fAcknowledgements\nWe thank Prof. Michael W. Trosset for helpful comments. Grant support: NSF DBI-1458477, NIH\nR01MH105524, NIH R01GM103725, and the Indiana University Precision Health Initiative.\n\nReferences\nS. Bashir and E. M. Carter. High breakdown mixture discriminant analysis. J Multivar Anal, 93(1):102\u2013111,\n\n2005.\n\nG. Blanchard, G. Lee, and C. Scott. Semi-supervised novelty detection. J Mach Learn Res, 11:2973\u20133009,\n\n2010.\n\nA. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. COLT 1998, pages 92\u2013100,\n\n1998.\n\nC. Bouveyron and S. Girard. Robust supervised classi\ufb01cation with mixture models: learning from data with\n\nuncertain labels. Pattern Recognit, 42(11):2649\u20132658, 2009.\n\nC. Cortes, M. Mohri, M. Riley, and A. Rostamizadeh. Sample selection bias correction theory. ALT 2008,\n\npages 38\u201353, 2008.\n\nF. Denis, R. Gilleron, and F. Letouzey. Learning from positive and unlabeled examples. Theor Comput Sci,\n\n348(16):70\u201383, 2005.\n\nM. C. du Plessis and M. Sugiyama. Class prior estimation from positive and unlabeled data. IEICE Trans Inf\n\n& Syst, E97-D(5):1358\u20131362, 2014.\n\nC. Elkan and K. Noto. Learning classi\ufb01ers from only positive and unlabeled data. KDD 2008, pages 213\u2013220,\n\n2008.\n\nD. M. Hawkins and G. J. McLachlan. High-breakdown linear discriminant analysis. J Am Stat Assoc, 92(437):\n\n136\u2013143, 1997.\n\nS. Jain, M. White, M. W. Trosset, and P. Radivojac. Nonparametric semi-supervised learning of class propor-\n\ntions. arXiv preprint arXiv:1601.01944, 2016. URL http://arxiv.org/abs/1601.01944.\n\nJ. Katz-Samuels and C. Scott. A mutual contamination analysis of mixed membership and partial label models.\n\narXiv preprint arXiv:1602.06235, 2016. URL http://arxiv.org/abs/1602.06235.\n\nN. D. Lawrence and B. Scholkopf. Estimating a kernel Fisher discriminant in the presence of label noise. ICML\n\n2001, pages 306\u2013313, 2001.\n\nM. Lichman. UCI Machine Learning Repository, 2013. URL http://archive.ics.uci.edu/ml.\nH. Liu, J. D. Lafferty, and L. A. Wasserman. Sparse nonparametric density estimation in high dimensions using\n\nthe rodeo. AISTATS 2007, pages 283\u2013290, 2007.\n\nP. M. Long and R. A. Servedio. Random classi\ufb01cation noise defeats all convex potential boosters. Mach Learn,\n\n78(3):287\u2013304, 2010.\n\nN. Manwani and P. S. Sastry. Noise tolerance under risk minimization. IEEE T Cybern, 43(3):1146\u20131151,\n\n2013.\n\nA. K. Menon, B. van Rooyen, C. S. Ong, and R. C. Williamson. Learning from corrupted binary labels via\n\nclass-probability estimation. ICML 2015, pages 125\u2013134, 2015.\n\nA. Niculescu-Mizil and R. Caruana. Obtaining calibrated probabilities from boosting. UAI 2005, pages 413\u2013\n\n420, 2005.\n\nJ. C. Platt. Probabilistic outputs for support vector machines and comparison to regularized likelihood methods,\n\npages 61\u201374. MIT Press, 1999.\n\nH. G. Ramaswamy, C. Scott, and A. Tewari. Mixture proportion estimation via kernel embedding of distribu-\n\ntions. arXiv preprint arXiv:1603.02501, 2016. URL https://arxiv.org/abs/1603.02501.\n\nM. D. Reid and R. C. Williamson. Composite binary losses. J Mach Learn Res, 11:2387\u20132422, 2010.\nM. Saerens, P. Latinne, and C. Decaestecker. Adjusting the outputs of a classi\ufb01er to new a priori probabilities:\n\na simple procedure. Neural Comput, 14:21\u201341, 2002.\n\nT. Sanderson and C. Scott. Class proportion estimation with application to multiclass anomaly rejection. AIS-\n\nTATS 2014, pages 850\u2013858, 2014.\n\nC. Scott, G. Blanchard, and G. Handy. Classi\ufb01cation with asymmetric label noise: consistency and maximal\n\ndenoising. J Mach Learn Res W&CP, 30:489\u2013511, 2013.\n\nD. W. Scott. The curse of dimensionality and dimension reduction. Multivariate Density Estimation: Theory,\n\nPractice, and Visualization, pages 195\u2013217, 2008.\n\nH. Steen and M. Mann. The ABC\u2019s (and XYZ\u2019s) of peptide sequencing. Nat Rev Mol Cell Biol, 5(9):699\u2013711,\n\n2004.\n\nG. Ward, T. Hastie, S. Barry, J. Elith, and J.R. Leathwick. Presence-only data and the EM algorithm. Biometrics,\n\n65(2):554\u2013563, 2009.\n\n9\n\n\f", "award": [], "sourceid": 1376, "authors": [{"given_name": "Shantanu", "family_name": "Jain", "institution": "Indiana University"}, {"given_name": "Martha", "family_name": "White", "institution": "Indiana University"}, {"given_name": "Predrag", "family_name": "Radivojac", "institution": "Indiana University"}]}