{"title": "Correcting Sample Selection Bias by Unlabeled Data", "book": "Advances in Neural Information Processing Systems", "page_first": 601, "page_last": 608, "abstract": "", "full_text": "Correcting Sample Selection Bias by Unlabeled Data\n\nJiayuan Huang\n\nSchool of Computer Science\nUniv. of Waterloo, Canada\nj9huang@cs.uwaterloo.ca\n\nAlexander J. Smola\n\nNICTA, ANU\n\nCanberra, Australia\n\nAlex.Smola@anu.edu.au\n\nArthur Gretton\n\nMPI for Biological Cybernetics\n\nT\u00a8ubingen, Germany\n\narthur@tuebingen.mpg.de\n\nKarsten M. Borgwardt\n\nLudwig-Maximilians-University\n\nMunich, Germany\nkb@dbs.i\ufb01.lmu.de\n\nBernhard Sch\u00a8olkopf\n\nMPI for Biological Cybernetics\n\nT\u00a8ubingen, Germany\nbs@tuebingen.mpg.de\n\nAbstract\n\nWe consider the scenario where training and test data are drawn from different\ndistributions, commonly referred to as sample selection bias. Most algorithms\nfor this setting try to \ufb01rst recover sampling distributions and then make appro-\npriate corrections based on the distribution estimate. We present a nonparametric\nmethod which directly produces resampling weights without distribution estima-\ntion. Our method works by matching distributions between training and testing\nsets in feature space. Experimental results demonstrate that our method works\nwell in practice.\n\n1 Introduction\n\nThe default assumption in many learning scenarios is that training and test data are independently\nand identically (iid) drawn from the same distribution. When the distributions on training and test\nset do not match, we are facing sample selection bias or covariate shift. Speci\ufb01cally, given a domain\nof patterns X and labels Y, we obtain training samples Z = {(x1, y1), . . . , (xm, ym)} \u2286 X\u00d7 Y from\na Borel probability distribution Pr(x, y), and test samples Z\u2032 = {(x\u20321, y\u20321), . . . , (x\u2032m\u2032 , y\u2032m\u2032)} \u2286 X\u00d7 Y\ndrawn from another such distribution Pr\u2032(x, y).\nAlthough there exists previous work addressing this problem [2, 5, 8, 9, 12, 16, 20], sample selection\nbias is typically ignored in standard estimation algorithms. Nonetheless, in reality the problem\noccurs rather frequently : While the available data have been collected in a biased manner, the test is\nusually performed over a more general target population. Below, we give two examples; but similar\nsituations occur in many other domains.\n1. Suppose we wish to generate a model to diagnose breast cancer. Suppose, moreover, that most\nwomen who participate in the breast screening test are middle-aged and likely to have attended the\nscreening in the preceding 3 years. Consequently our sample includes mostly older women and\nthose who have low risk of breast cancer because they have been tested before. The examples do not\nre\ufb02ect the general population with respect to age (which amounts to a bias in Pr(x)) and they only\ncontain very few diseased cases (i.e. a bias in Pr(y|x)).\n2. Gene expression pro\ufb01le studies using DNA microarrays are used in tumor diagnosis. A common\nproblem is that the samples are obtained using certain protocols, microarray platforms and analysis\ntechniques. In addition, they typically have small sample sizes. The test cases are recorded under\ndifferent conditions, resulting in a different distribution of gene expression values.\n\nIn this paper, we utilize the availability of unlabeled data to direct a sample selection de-biasing\nprocedure for various learning methods. Unlike previous work we infer the resampling weight di-\nrectly by distribution matching between training and testing sets in feature space in a non-parametric\n\n\fmanner. We do not require the estimation of biased densities or selection probabilities [20, 2, 12], or\nthe assumption that probabilities of the different classes are known [8]. Rather, we account for the\ndifference between Pr(x, y) and Pr\u2032(x, y) by reweighting the training points such that the means\nof the training and test points in a reproducing kernel Hilbert space (RKHS) are close. We call this\nreweighting process kernel mean matching (KMM). When the RKHS is universal [14], the popula-\ntion solution to this miminisation is exactly the ratio Pr\u2032(x, y)/ Pr(x, y); however, we also derive a\ncautionary result, which states that even granted this ideal population reweighting, the convergence\nof the empirical means in the RKHS depends on an upper bound on the ratio of distributions (but\nnot on the dimension of the space), and will be extremely slow if this ratio is large.\n\nThe required optimisation is a simple QP problem, and the reweighted sample can be incorpo-\nrated straightforwardly into several different regression and classi\ufb01cation algorithms. We apply our\nmethod to a variety of regression and classi\ufb01cation benchmarks from UCI and elsewhere, as well as\nto classi\ufb01cation of microarrays from prostate and breast cancer patients. These experiments demon-\nstrate that KMM greatly improves learning performance compared with training on unweighted data,\nand that our reweighting scheme can in some cases outperform reweighting using the true sample\nbias distribution.\nKey Assumption 1: In general, the estimation problem with two different distributions Pr(x, y)\nand Pr\u2032(x, y) is unsolvable, as the two terms could be arbitrarily far apart. In particular, for arbi-\ntrary Pr(y|x) and Pr\u2032(y|x), there is no way we could infer a good estimator based on the training\nsample. Hence we make the simplifying assumption that Pr(x, y) and Pr\u2032(x, y) only differ via\nPr(x, y) = Pr(y|x) Pr(x) and Pr(y|x) Pr\u2032(x). In other words, the conditional probabilities of y|x\nremain unchanged (this particular case of sample selection bias has been termed covariate shift\n[12]). However, we will see experimentally that even in situations where our key assumption is not\nvalid, our method can nonetheless perform well (see Section 4).\n\n2 Sample Reweighting\n\nWe begin by stating the problem of regularized risk minimization. In general a learning method\nminimizes the expected risk\n\nR[Pr, \u03b8, l(x, y, \u03b8)] = E(x,y)\u223cPr [l(x, y, \u03b8)]\n\n(1)\n\nof a loss function l(x, y, \u03b8) that depends on a parameter \u03b8. For instance, the loss function could\nbe the negative log-likelihood \u2212 log Pr(y|x, \u03b8), a misclassi\ufb01cation loss, or some form of regression\nloss. However, since typically we only observe examples (x, y) drawn from Pr(x, y) rather than\nPr\u2032(x, y), we resort to computing the empirical average\n1\nm\n\nRemp[Z, \u03b8, l(x, y, \u03b8)] =\n\nl(xi, yi, \u03b8).\n\nX\n\n(2)\n\nm\n\ni=1\n\nTo avoid over\ufb01tting, instead of minimizing Remp directly we often minimize a regularized variant\nRreg[Z, \u03b8, l(x, y, \u03b8)] := Remp[Z, \u03b8, l(x, y, \u03b8)] + \u03bb\u2126[\u03b8], where \u2126[\u03b8] is a regularizer.\n\n2.1 Sample Correction\n\nThe problem is more involved if Pr(x, y) and Pr\u2032(x, y) are different. The training set is drawn from\nPr, however what we would really like is to minimize R[Pr\u2032, \u03b8, l] as we wish to generalize to test\nexamples drawn from Pr\u2032. An observation from the \ufb01eld of importance sampling is that\nl(x, y, \u03b8)i\n\nR[Pr \u2032, \u03b8, l(x, y, \u03b8)] = E(x,y)\u223cPr\u2032 [l(x, y, \u03b8)] = E(x,y)\u223cPrh Pr\u2032(x,y)\n| {z }\n\n= R[Pr, \u03b8, \u03b2(x, y)l(x, y, \u03b8)],\n\n:=\u03b2(x,y)\n\nPr(x,y)\n\n(3)\n\n(4)\n\nprovided that the support of Pr\u2032 is contained in the support of Pr. Given \u03b2(x, y), we can thus\ncompute the risk with respect to Pr\u2032 using Pr. Similarly, we can estimate the risk with respect to\nPr\u2032 by computing Remp[Z, \u03b8, \u03b2(x, y)l(x, y, \u03b8)].\nThe key problem is that the coef\ufb01cients \u03b2(x, y) are usually unknown, and we need to estimate them\nfrom the data. When Pr and Pr\u2032 differ only in Pr(x) and Pr\u2032(x), we have \u03b2(x, y) = Pr\u2032(x)/Pr(x),\nwhere \u03b2 is a reweighting factor for the training examples. We thus reweight every observation\n\n\f(x, y) such that observations that are under-represented in Pr obtain a higher weight, whereas over-\nrepresented cases are downweighted.\nNow we could estimate Pr and Pr\u2032 and subsequently compute \u03b2 based on those estimates. This is\nclosely related to the methods in [20, 8], as they have to either estimate the selection probabilities\nor have prior knowledge of the class distributions. Although intuitive, this approach has two major\nproblems: \ufb01rst, it only works whenever the density estimates for Pr and Pr\u2032(or potentially, the se-\nlection probabilities or class distributions) are good. In particular, small errors in estimating Pr can\nlead to large coef\ufb01cients \u03b2 and consequently to a serious overweighting of the corresponding obser-\nvations. Second, estimating both densities just for the purpose of computing reweighting coef\ufb01cients\nmay be overkill: we may be able to directly estimate the coef\ufb01cients \u03b2i := \u03b2(xi, yi) without having\nto estimate the two distributions. Furthermore, we can regularize \u03b2i directly with more \ufb02exibility,\ntaking prior knowledge into account similar to learning methods for other problems.\n\n2.2 Using the sample reweighting in learning algorithms\n\nBefore we describe how we will estimate the reweighting coef\ufb01cients \u03b2i, let us brie\ufb02y discuss how\nto minimize the reweighted regularized risk\n\nRreg[Z, \u03b2, l(x, y, \u03b8)] :=\n\n1\nm\n\nm\n\nX\n\ni=1\n\n\u03b2il(xi, yi, \u03b8) + \u03bb\u2126[\u03b8],\n\n(5)\n\nin the classi\ufb01cation and regression settings (an additional classi\ufb01cation method is discussed in the\naccompanying technical report [7]).\nSupport Vector Classi\ufb01cation: Utilizing the setting of [17]we can have the following minimization\nproblem (the original SVMs can be formulated in the same way):\n\nminimize\n\n\u03b8,\u03be\n\n1\n2 k\u03b8k2 + C\n\nm\n\nX\n\ni=1\n\n\u03b2i\u03bei\n\n(6a)\n\nsubject to h\u03c6(xi, yi) \u2212 \u03c6(xi, y), \u03b8i \u2265 1 \u2212 \u03bei/\u2206(yi, y) for all y \u2208 Y, and \u03bei \u2265 0.\n\n(6b)\nHere, \u03c6(x, y) is a feature map from X \u00d7 Y into a feature space F, where \u03b8 \u2208 F and \u2206(y, y\u2032) denotes\na discrepancy function between y and y\u2032. The dual of (6) is given by\n\n(7a)\n\n(7b)\n\nminimize\n\n\u03b1\n\n1\n2\n\nm\n\nX\n\n\u03b1iy\u03b1jy \u2032 k(xi, y, xj, y\u2032) \u2212\n\ni,j=1;y,y \u2032\u2208Y\n\nsubject to \u03b1iy \u2265 0 for all i, y and X\ny\u2208Y\n\nm\n\n\u03b1iy\n\nX\ni=1;y\u2208Y\n\n\u03b1iy/\u2206(yi, y) \u2264 \u03b2iC.\n\nHere k(x, y, x\u2032, y\u2032) := h\u03c6(x, y), \u03c6(x\u2032, y\u2032)i denotes the inner product between the feature maps. This\ngeneralizes the observation-dependent binary SV classi\ufb01cation described in [10]. Modi\ufb01cations of\nexisting solvers, such as SVMStruct [17], are straightforward.\nPenalized LMS Regression: Assume l(x, y, \u03b8) = (y \u2212 h\u03c6(x), \u03b8i)2 and \u2126[\u03b8] = k\u03b8k2. Here we\nminimize\n(8)\n\nm\n\n\u03b2i(yi \u2212 h\u03c6(xi), \u03b8i)2 + \u03bbk\u03b8k2 .\n\nX\n\ni=1\n\nDenote by \u00af\u03b2 the diagonal matrix with diagonal (\u03b21, . . . , \u03b2m) and let K \u2208 Rm\u00d7m be the kernel\nmatrix Kij = k(xi, xj). In this case minimizing (8) is equivalent to minimizing (y \u2212 K\u03b1)\u22a4 \u00af\u03b2(y \u2212\nK\u03b1) + \u03bb\u03b1\u22a4K\u03b1 with respect to \u03b1. Assuming that K and \u00af\u03b2 have full rank, the minimization yields\n\u03b1 = (\u03bb \u00af\u03b2\u22121 + K)\u22121y. The advantage of this formulation is that it can be solved as easily as solving\nthe standard penalized regression problem. Essentially, we rescale the regularizer depending on the\npattern weights: the higher the weight of an observation, the less we regularize.\n\n3 Distribution Matching\n\n3.1 Kernel Mean Matching and its relation to importance sampling\n\nLet \u03a6 : X \u2192 F be a map into a feature space F and denote by \u00b5 : P \u2192 F the expectation operator\n\n\f\u00b5(Pr) := Ex\u223cPr(x) [\u03a6(x)] .\n\n(9)\n\nClearly \u00b5 is a linear operator mapping the space of all probability distributions P into feature space.\nDenote by M(\u03a6) := {\u00b5(Pr) where Pr \u2208 P} the image of P under \u00b5. This set is also often referred\nto as the marginal polytope. We have the following theorem (proved in [7]):\nTheorem 1 The operator \u00b5 is bijective if F is an RKHS with a universal kernel k(x, x\u2032) =\nh\u03a6(x), \u03a6(x\u2032)i in the sense of Steinwart [15].\nThe use of feature space means to compare distributions is further explored in [3]. The practical\nconsequence of this (rather abstract) result is that if we know \u00b5(Pr\u2032), we can infer a suitable \u03b2 by\nsolving the following minimization problem:\n\nminimize\n\n\u03b2\n\n(cid:13)(cid:13)\u00b5(Pr \u2032) \u2212 Ex\u223cPr(x) [\u03b2(x)\u03a6(x)](cid:13)(cid:13) subject to \u03b2(x) \u2265 0 and Ex\u223cPr(x) [\u03b2(x)] = 1. (10)\n\nThis is the kernel mean matching (KMM) procedure. For a proof of the following (and further\nresults in the paper) see [7].\nLemma 2 The problem (10) is convex. Moreover, assume that Pr\u2032 is absolutely continuous with\nrespect to Pr (so Pr(A) = 0 implies Pr\u2032(A) = 0). Finally assume that k is universal. Then the\nsolution \u03b2(x) of (10) is P r\u2032(x) = \u03b2(x)P r(x).\n\n3.2 Convergence of reweighted means in feature space\n\nLemma 2 shows that in principle, if we knew Pr and \u00b5[Pr\u2032], we could fully recover Pr\u2032 by solving\na simple quadratic program. In practice, however, neither \u00b5(Pr\u2032) nor Pr is known. Instead, we only\nhave samples X and X\u2032 of size m and m\u2032, drawn iid from Pr and Pr\u2032 respectively.\nNaively we could just replace the expectations in (10) by empirical averages and hope that the\nresulting optimization problem provides us with a good estimate of \u03b2. However, it is to be expected\nthat empirical averages will differ from each other due to \ufb01nite sample size effects. In this section,\nwe explore two such effects. First, we demonstrate that in the \ufb01nite sample case, for a \ufb01xed \u03b2, the\nempirical estimate of the expectation of \u03b2 is normally distributed: this provides a natural limit on\nthe precision with which we should enforce the constraint R \u03b2(x)d Pr(x) = 1 when using empirical\nexpectations (we will return to this point in the next section).\nLemma 3 If \u03b2(x) \u2208 [0, B] is some \ufb01xed function of x \u2208 X, then given xi \u223c Pr iid such that \u03b2(xi)\nm Pi \u03b2(xi) converges in distribution to a\nhas \ufb01nite mean and non-zero variance, the sample mean 1\nGaussian with mean R \u03b2(x)d Pr(x) and standard deviation bounded by B\n2\u221am .\nThis lemma is a direct consequence of the central limit theorem [1, Theorem 5.5.15]. Alternatively,\nit is straightforward to get a large deviation bound that likewise converges as 1/\u221am [6].\nOur second result demonstrates the deviation between the empirical means of Pr\u2032 and \u03b2(x) Pr in\nfeature space, given \u03b2(x) is chosen perfectly in the population sense. In particular, this result shows\nthat convergence of these two means will be slow if there is a large difference in the probability mass\nof Pr\u2032 and Pr (and thus the bound B on the ratio of probability masses is large).\nLemma 4 In addition to the Lemma 3 conditions, assume that we draw X\u2032 := {x\u20321, . . . , x\u2032m\u2032} iid\nfrom X using Pr\u2032 = \u03b2(x) Pr, and k\u03a6(x)k \u2264 R for all x \u2208 X. Then with probability at least 1 \u2212 \u03b4\n\nm\n\nX\n\ni=1\n\n1\nm\n\n(cid:13)(cid:13)(cid:13)\n\n\u03b2(xi)\u03a6(xi) \u2212\n\n1\nm\u2032\n\nm\u2032\n\nX\n\ni=1\n\n\u03a6(x\u2032i)(cid:13)(cid:13)(cid:13) \u2264 (cid:16)1 + p\u22122 log \u03b4/2(cid:17) RpB2/m + 1/m\u2032\n\n(11)\n\nNote that this lemma shows that for a given \u03b2(x), which is correct in the population sense, we can\nbound the deviation between the feature space mean of Pr\u2032 and the reweighted feature space mean\nof Pr. It is not a guarantee that we will \ufb01nd coef\ufb01cients \u03b2i that are close to \u03b2(xi), but it gives us a\nuseful upper bound on the outcome of the optimization.\nLemma 4 implies that we have O(Bp1/m + 1/m\u2032B2) convergence in m, m\u2032 and B. This means\nthat, for very different distributions we need a large equivalent sample size to get reasonable conver-\ngence. Our result also implies that it is unrealistic to assume that the empirical means (reweighted\nor not) should match exactly.\n\n\f3.3 Empirical KMM optimization\n\nTo \ufb01nd suitable values of \u03b2 \u2208 Rm we want to minimize the discrepancy between means subject\nto constraints \u03b2i \u2208 [0, B] and | 1\ni=1 \u03b2i \u2212 1| \u2264 \u01eb. The former limits the scope of discrepancy\nbetween Pr and Pr\u2032 whereas the latter ensures that the measure \u03b2(x) Pr(x) is close to a probability\ndistribution. The objective function is given by the discrepancy term between the two empirical\nmeans. Using Kij := k(xi, xj) and \u03bai := m\n\nj=1 k(xi, x\u2032j) one may check that\n\nm Pm\n\nm\n\nX\n\ni=1\n\n1\nm\n\n(cid:13)(cid:13)(cid:13)\n\n\u03b2i\u03a6(xi) \u2212\n\n1\nm\u2032\n\nm\u2032\n\nX\n\ni=1\n\nm\u2032 Pm\u2032\n\u03a6(x\u2032i)(cid:13)(cid:13)(cid:13)\n\n2\n\n=\n\n1\nm2 \u03b2\u22a4K\u03b2 \u2212\n\n2\nm2 \u03ba\u22a4\u03b2 + const.\n\nWe now have all necessary ingredients to formulate a quadratic problem to \ufb01nd suitable \u03b2 via\n\nminimize\n\n\u03b2\n\n1\n2\n\n\u03b2\u22a4K\u03b2 \u2212 \u03ba\u22a4\u03b2 subject to \u03b2i \u2208 [0, B] and (cid:12)(cid:12)(cid:12)\n\nm\n\nX\n\ni=1\n\n\u03b2i \u2212 m(cid:12)(cid:12)(cid:12) \u2264 m\u01eb.\n\n(12)\n\nIn accordance with Lemma 3, we conclude that a good choice of \u01eb should be O(B/\u221am). Note\nthat (12) is a quadratic program which can be solved ef\ufb01ciently using interior point methods or any\nother successive optimization procedure. We also point out that (12) resembles Single Class SVM\n[11] using the \u03bd-trick. Besides the approximate equality constraint, the main difference is the linear\ncorrection term by means of \u03ba. Large values of \u03bai correspond to particularly important observations\nxi and are likely to lead to large \u03b2i.\n\n4 Experiments\n4.1 Toy regression example\nOur \ufb01rst experiment is on toy data, and is intended mainly to provide a comparison with the approach\nof [12]. This method uses an information criterion to optimise the weights, under certain restrictions\non Pr and Pr\u2032 (namely, Pr\u2032 must be known, while Pr can be either known exactly, Gaussian with\nunknown parameters, or approximated via kernel density estimation).\n\nOur data is generated according to the polynomial regression example from [12, Section 2], for\nwhich Pr \u223c N(0.5, 0.52) and Pr\u2032 \u223c N(0, 0.32) are two normal distributions. The observations are\ngenerated according to y = \u2212x + x3, and are observed in Gaussian noise with standard deviation\n0.3 (see Figure 1(a); the blue curve is the noise-free signal).\nWe sampled 100 training (blue circles) and testing (red circles) points from Pr and Pr\u2032 respectively.\nWe attempted to model the observations with a degree 1 polynomial. The black dashed line is a\nbest-case scenario, which is shown for reference purposes: it represents the model \ufb01t using ordinary\nleast squared (OLS) on the labeled test points. The red line is a second reference result, derived\nonly from the training data via OLS, and predicts the test data very poorly. The other three dashed\nlines are \ufb01t with weighted ordinary least square (WOLS), using one of three weighting schemes: the\nratio of the underlying training and test densities, KMM, and the information criterion of [12]. A\nsummary of the performance over 100 trials is shown in Figure 1(b). Our method outperforms the\ntwo other reweighting methods.\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n\n\u22121.2\n\n\u22121.4\n\n \n\n\u22120.4\n\nx from q0\ntrue fitting model\nOLS fitting x\n\nq0\n\nx from q1\nOLS fitting x\n\nq1\nWOLS by ratio\nWOLS by KMM\nWOLS by min IC\n0.2\n\n0\n\n\u22120.2\n\n \n\ns\ns\no\n\nl\n \n\ne\nr\na\nu\nq\ns\n \nf\no\nm\nu\nS\n\n \n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0.4\n\n0.6\n\n0.8\n\n1\n\n1.2\n\n ratio\n\nKMM\n\nIC\n\nOLS\n\n(a)\n\n(b)\n\nFigure 1: (a) Polynomial models of degree 1 \ufb01t with OLS and WOLS;(b) Average performances of three\nWOLS methods and OLS on the test data in (a). Labels are Ratio for ratio of test to training density; KMM for\nour approach; min IC for the approach of [12]; and OLS for the model trained on the labeled test points.\n\n\f4.2 Real world datasets\nWe next test our approach on real world data sets, from which we select training examples using a\ndeliberately biased procedure (as in [20, 9]). To describe our biased selection scheme, we need to\nde\ufb01ne an additional random variable si for each point in the pool of possible training samples, where\nsi = 1 means the ith sample is included, and si = 0 indicates an excluded sample. Two situations\nare considered: the selection bias corresponds to our assumption regarding the relation between\nthe training and test distributions, and P (si = 1|xi, yi) = P (si|xi); or si is dependent only on\nyi, i.e. P (si|xi, yi) = P (si|yi), which potentially creates a greater challenge since it violates our\nkey assumption 1. In the following, we compare our method (labeled KMM) against two others: a\nbaseline unweighted method (unweighted), in which no modi\ufb01cation is made, and a weighting by\nthe inverse of the true sampling distribution (importance sampling), as in [20, 9]. We emphasise,\nhowever, that our method does not require any prior knowledge of the true sampling probabilities.\nIn our experiments, we used a Gaussian kernel exp(\u2212\u03c3kxi \u2212 xjk2) in our kernel classi\ufb01cation and\nregression algorithms, and parameters \u01eb = (\u221am \u2212 1)/\u221am and B = 1000 in the optimization (12).\n\n \n\nunweighted\nimportance sampling\nKMM\n\n0.2\n\n0.18\n\n0.16\n\n0.14\n\n0.12\n\n0.1\n\n0.08\n\n0.06\n\n0.04\n\n0.02\n\nr\no\nr\nr\ne\n\n \nt\ns\ne\nt\n\n0\n\n \n\n1\n\n2\n\n3\n\n4\n6\nbiased feature\n\n5\n\n7\n\n8\n\n9\n\nr\no\nr\nr\ne\n \nt\ns\ne\nt\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n0\n\n \n\n \n\nunweighted\nimportance sampling\nKMM\n\n0.1\n\n0.2\n\n0.3\n\ntraining set proportion\n\n0.4\n\n0.5\n\n(a) Simple bias on features\n\n(b) Joint bias on features\n\n0.07\n\n0.06\n\n0.05\n\nr\no\nr\nr\ne\n \nt\ns\ne\nt\n\n0.04\n\n0.03\n\n0.02\n\n0.01\n\n0\n\n \n\n \n\nunweighted\nimportance sampling\nKMM\n\n1\n\n2\n4\ntraining set proportion\n\n3\n\n5\n\n12\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0\n \n0\n\n \n\noptimal weights\ninverse of true sampling\nprobabilites\n\n10\n\n20\n\n30\n\n40\n\n50\n\n(c) Bias on labels\n\n(d) \u03b2 vs inverse sampling prob.\n\nFigure 2: Classi\ufb01cation performance analysis on breast cancer dataset from UCI.\n\n4.2.1 Breast Cancer Dataset\nThis dataset is from the UCI Archive, and is a binary classi\ufb01cation task. It includes 699 examples\nfrom 2 classes: benign (positive label) and malignant (negative label). The data are randomly split\ninto training and test sets, where the proportion of examples used for training varies from 10% to\n50%. Test results are averaged over 30 trials, and were obtained using a support vector classi\ufb01er with\nkernel size \u03c3 = 0.1. First, we consider a biased sampling scheme based on the input features, of\nwhich there are nine, with integer values from 0 to 9. Since smaller feature values predominate in the\nunbiased data, we sample according to P (s = 1|x \u2264 5) = 0.2 and P (s = 1|x > 5) = 0.8, repeating\nthe experiment for each of the features in turn. Results are an average over 30 random training/test\nsplits, with 1/4 of the data used for training and 3/4 for testing. Performance is shown in Figure 2(a):\nwe consistently outperform the unweighted method, and match or exceed the performance obtained\nusing the known distribution ratio. Next, we consider a sampling bias that operates jointly across\nmultiple features. We select samples less often when they are further from the sample mean x over\nthe training data, i.e. P (si|xi) \u221d exp(\u2212\u03c3kxi \u2212 xk2) where \u03c3 = 1/20. Performance of our method\nin 2(b) is again better than the unweighted case, and as good as or better than reweighting using the\nsampling model. Finally, we consider a simple biased sampling scheme which depends only on the\nlabel y: P (s = 1|y = 1) = 0.1 and P (s = 1|y = \u22121) = 0.9 (the data has on average twice as\nmany positive as negative examples when uniformly sampled). Average performance for different\ntraining/testing split proportions is in Figure 2(c); remarkably, despite our assumption regarding the\ndifference between the training and test distributions being violated, our method still improves the\ntest performance, and outperforms the reweighting by density ratio for large training set sizes. Fig-\n\n\fure 2(d) shows the weights \u03b2 are proportional to the inverse of true sampling probabilities: positive\nexamples have higher weights and negative ones have lower weights.\n4.2.2 Further Benchmark Datasets\nWe next compare the performance on further benchmark datasets1 by selecting training data via\nvarious biased sampling schemes. Speci\ufb01cally, for the sampling distribution bias on labels, we\nuse P (s = 1|y) = exp(a + by)/(1 + exp(a + by)) (datasets 1 to 5), or the simple step distri-\nbution P (s = 1|y = 1) = a, P (s = 1|y = \u22121) = b (datasets 6 and 7). For the remaining\ndatasets, we generate biased sampling schemes over their features. We \ufb01rst do PCA, selecting the\n\ufb01rst principal component of the training data and the corresponding projection values. Denoting\nthe minimum value of the projection as m and the mean as m, we apply a normal distribution with\nmean m + (m \u2212 m)/a and variance (m \u2212 m)/b as the biased sampling scheme. Please refer to\n[7] for detailed parameter settings. We use penalized LMS for regression problems and SVM for\nclassi\ufb01cation problems. To evaluate generalization performance, we utilize the normalized mean\nsquare error (NMSE) given by 1\nfor regression problems, and the average test error\nfor classi\ufb01cation problems. In 13 out of 23 experiments, our reweighting approach is the most accu-\nrate (see Table 1), despite having no prior information about the bias of the test sample (and, in some\ncases, despite the additional fact that the data reweighting does not conform to our key assumption\n1). In addition, the KMM always improves test performance compared with the unweighted case.\nTwo additional points should be borne in mind: \ufb01rst, we use the same \u03c3 for the kernel mean match-\ning and the SVM, as listed in Table 1. Performance might be improved by decoupling these kernel\nsizes: indeed, we employ kernels that are somewhat large, suggesting that the KMM procedure is\nhelpful in the case of relatively smooth classi\ufb01cation/regresssion functions. Second, we did not \ufb01nd\na performance improvement in the case of data sets with smaller sample sizes. This is not surprising,\nsince a reweighting would further reduce the effective number of points used for training, resulting\nin insuf\ufb01cient data for learning.\nTable 1: Test results for three methods on 18 datasets with different sampling schemes. The results are\naverages over 10 trials for regression problems (marked *) and 30 trials for classi\ufb01cation problems. We used a\nGaussian kernel of size \u03c3 for both the kernel mean matching and the SVM/LMS regression, and set B = 1000.\n\nn Pn\n\n(yi\u2212\u00b5i)\nvar y\n\ni=1\n\nDataSet\n1. Abalone*\n2. CA Housing*\n3. Delta Ailerons(1)*\n4. Ailerons*\n5. haberman(1)\n6. USPS(6vs8)(1)\n7. USPS(3vs9)(1)\n8. Bank8FM*\n9. Bank32nh*\n10. cpu-act*\n11. cpu-small*\n12. Delta Ailerons(2)*\n13. Boston house*\n14. kin8nm*\n15. puma8nh*\n16. haberman(2)\n17. USPS(6vs8) (2)\n18. USPS(6vs8) (3)\n19. USPS(3vs9)(2)\n20. Breast Cancer\n21. India diabetes\n22. ionosphere\n23. German credit\n\n\u03c3\n\n1e \u2212 1\n1e \u2212 1\n1e3\n1e \u2212 5\n1e \u2212 2\n1/128\n1/128\n1e \u2212 1\n1e \u2212 2\n1e \u2212 12\n1e \u2212 12\n1e3\n1e \u2212 4\n1e \u2212 1\n1e \u2212 1\n1e \u2212 2\n1/128\n1/128\n1/128\n1e \u2212 1\n1e \u2212 4\n1e \u2212 1\n1e \u2212 4\n\nntr\n2000\n16512\n4000\n7154\n150\n500\n500\n4500\n4500\n4000\n4000\n4000\n300\n5000\n4499\n150\n500\n500\n500\n280\n200\n150\n400\n\nselected\n853\n3470\n1678\n925\n52\n260\n252\n654\n740\n1462\n1488\n634\n108\n428\n823\n90\n156\n104\n252\n96\n97\n64\n214\n\nntst\n2177\n4128\n3129\n6596\n156\n1042\n1145\n3692\n3692\n4192\n4192\n3129\n206\n3192\n3693\n156\n1042\n1042\n1145\n419\n568\n201\n600\n\nunweighted\n1.00 \u00b1 0.08\n2.29 \u00b1 0.01\n0.51 \u00b1 0.01\n1.50 \u00b1 0.06\n0.50 \u00b1 0.09\n0.13 \u00b1 0.18\n0.016 \u00b1 0.006\n0.5 \u00b1 0.1\n23 \u00b1 4.0\n10 \u00b1 1\n9 \u00b1 2\n2 \u00b1 2\n0.8 \u00b1 0.2\n0.85 \u00b1 0.2\n1.1 \u00b1 0.1\n0.27 \u00b1 0.01\n0.23 \u00b1 0.2\n0.54 \u00b1 0.0002\n0.46 \u00b1 0.09\n0.05 \u00b1 0.01\n0.32 \u00b1 0.02\n0.32 \u00b1 0.06\n0.283 \u00b1 0.004\n\nNMSE / Test err.\nimportance samp.\n1.1 \u00b1 0.2\n1.72 \u00b1 0.04\n0.51 \u00b1 0.01\n0.7 \u00b1 0.1\n0.37 \u00b1 0.03\n0.1 \u00b1 0.2\n0.012 \u00b1 0.005\n0.45 \u00b1 0.06\n19 \u00b1 2\n4.0 \u00b1 0.2\n4.0 \u00b1 0.2\n1.5 \u00b1 1.5\n0.74 \u00b1 0.09\n0.81 \u00b1 0.1\n0.77 \u00b1 0.05\n0.39 \u00b1 0.04\n0.23 \u00b1 0.2\n0.5 \u00b1 0.2\n0.5 \u00b1 0.2\n0.036 \u00b1 0.005\n0.30 \u00b1 0.02\n0.31 \u00b1 0.07\n0.282 \u00b1 0.004\n\nKMM\n0.6 \u00b1 0.1\n1.24 \u00b1 0.09\n0.401 \u00b1 0.007\n1.2 \u00b1 0.2\n0.30 \u00b1 0.05\n0.1 \u00b1 0.1\n0.013 \u00b1 0.005\n0.47 \u00b1 0.05\n19 \u00b1 2\n1.9 \u00b1 0.2\n2.0 \u00b1 0.5\n1.7 \u00b1 0.9\n0.76 \u00b1 0.07\n0.81 \u00b1 0.2\n0.83 \u00b1 0.03\n0.25 \u00b1 0.2\n0.16 \u00b1 0.08\n0.16 \u00b1 0.04\n0.2 \u00b1 0.1\n0.033 \u00b1 0.004\n0.30 \u00b1 0.02\n0.28 \u00b1 0.06\n0.280 \u00b1 0.004\n\n4.2.3 Tumor Diagnosis using Microarrays\nOur next benchmark is a dataset of 102 microarrays from prostate cancer patients [13]. Each of these\nmicroarrays measures the expression levels of 12,600 genes. The dataset comprises 50 samples\nfrom normal tissues (positive label) and 52 from tumor tissues (negative label). We simulate the\nrealisitc scenario that two sets of microarrays A and B are given with dissimilar proportions of tumor\nsamples, and we want to perform cancer diagnosis via classi\ufb01cation, training on A and predicting\n\n1Regression data from http://www.liacc.up.pt/\u223cltorgo/Regression/DataSets.html;\n\nclassi\ufb01cation data from UCI. Sets with numbers in brackets are examined by different sampling schemes.\n\n\fon B. We select training examples via the biased selection scheme P (s = 1|y = 1) = 0.85 and\nP (s = 1|y = \u22121) = 0.15. The remaining data points form the test set. We then perform SVM\nclassi\ufb01cation for the unweighted, KMM, and importance sampling approaches. The experiment\nwas repeated over 500 independent draws from the dataset according to our biased scheme; the 500\nresulting test errors are plotted in [7]. The KMM achieves much higher accuracy levels than the\nunweighted approach, and is very close to the importance sampling approach.\n\nWe study a very similar scenario on two breast cancer microarray datasets from [4] and [19], mea-\nsuring the expression levels of 2,166 common genes for normal and cancer patients [18]. We train\nan SVM on one of them and test on the other. Our reweighting method achieves signi\ufb01cant improve-\nment in classi\ufb01cation accuracy over the unweighted SVM (see [7]). Hence our method promises to\nbe a valuable tool for cross-platform microarray classi\ufb01cation.\nAcknowledgements: The authors thank Patrick Warnat (DKFZ, Heidelberg) for providing the mi-\ncroarray datasets, and Olivier Chapelle and Matthias Hein for helpful discussions. The work is\npartially supported by by the BMBF under grant 031U112F within the BFAM project, which is part\nof the German Genome Analysis Network. NICTA is funded through the Australian Government\u2019s\nBacking Australia\u2019s Ability initiative, in part through the ARC. This work was supported in part by\nthe IST Programme of the EC, under the PASCAL Network of Excellence, IST-2002-506778.\nReferences\n[1] G. Casella and R. Berger. Statistical Inference. Duxbury, Paci\ufb01c Grove, CA, 2nd edition, 2002.\n[2] M. Dudik, R.E. Schapire, and S.J. Phillips. Correcting sample selection bias in maximum entropy density\n\nestimation. In Advances in Neural Information Processing Systems 17, 2005.\n\n[3] A. Gretton, K. Borgwardt, M. Rasch, B. Sch\u00a8olkopf, and A. Smola. A kernel method for the two-sample-\n\nproblem. In NIPS. MIT Press, 2006.\n\n[11] B. Sch\u00a8olkopf, J. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Estimating the support of a\n\nhigh-dimensional distribution. Neural Computation, 13(7):1443\u20131471, 2001.\n\n[12] H. Shimodaira. Improving predictive inference under convariance shift by weighting the log-likelihood\n\nfunction. Journal of Statistical Planning and Inference, 90, 2000.\n\n[13] D. Singh, P. Febbo, K. Ross, D. Jackson, J. Manola, C. Ladd, P. Tamayo, A. Renshaw, A. DAmico, and\n\nJ. Richie. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell, 1(2), 2002.\n\n[14] I. Steinwart. On the in\ufb02uence of the kernel on the consistency of support vector machines. Journal of\n\nMachine Learning Research, 2:67\u201393, 2002.\n\n[15] I. Steinwart. Support vector machines are universally consistent. J. Compl., 18:768\u2013791, 2002.\n[16] M. Sugiyama and K.-R. M\u00a8uller. Input-dependent estimation of generalization error under covariate shift.\n\nStatistics and Decisions, 23:249\u2013279, 2005.\n\n[17] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and\n\ninterdependent output variables. Journal of Machine Learning Research, 2005.\n\n[18] P. Warnat, R. Eils, and B. Brors. Cross-platform analysis of cancer microarray data improves gene ex-\n\npression based classi\ufb01cation of phenotypes. BMC Bioinformatics, 6:265, Nov 2005.\n\n[19] M. West, C. Blanchette, H. Dressman, E. Huang, S. Ishida, R. Spang, H Zuzan, J.A. Olson Jr, J.R.Marks,\nand J.R.Nevins. Predicting the clinical status of human breast cancer by using gene expression pro\ufb01les.\nPNAS, 98(20), 2001.\n\n[20] B. Zadrozny. Learning and evaluating classi\ufb01ers under sample selection bias. In International Conference\n\non Machine Learning ICML\u201904, 2004.\n\n[4] S. Gruvberger, M. Ringner, Y.Chen, S.Panavally, L.H. Saal, C. Peterson A.Borg, M. Ferno, and\nP.S.Meltzer. Estrogen receptor status in breast cancer is associated with remarkably distinct gene ex-\npression patterns. Cancer Research, 61, 2001.\n\n[5] J. Heckman. Sample selection bias as a speci\ufb01cation error. Econometrica, 47(1):153\u2013161, 1979.\n[6] W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American\n\nStatistical Association, 58:13\u201330, 1963.\n\n[7] J. Huang, A. Smola, A. Gretton, K. Borgwardt, and B. Sch\u00a8olkopf. Correcting sample selection bias by\n\nunlabeled data. Technical report, CS-2006-44, University of Waterloo, 2006.\n\n[8] Y. Lin, Y. Lee, and G. Wahba. Support vector machines for classi\ufb01cation in nonstandard situations.\n\nMachine Learning, 46:191\u2013202, 2002.\n\n[9] S. Rosset, J. Zhu, H. Zou, and T. Hastie. A method for inferring label sampling mechanisms in semi-\n\nsupervised learning. In Advances in Neural Information Processing Systems 17, 2004.\n\n[10] M. Schmidt and H. Gish. Speaker identi\ufb01cation via support vector classi\ufb01ers. In Proc. ICASSP \u201996, pages\n\n105\u2013108, Atlanta, GA, May 1996.\n\n\f", "award": [], "sourceid": 3075, "authors": [{"given_name": "Jiayuan", "family_name": "Huang", "institution": null}, {"given_name": "Arthur", "family_name": "Gretton", "institution": null}, {"given_name": "Karsten", "family_name": "Borgwardt", "institution": ""}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": null}, {"given_name": "Alex", "family_name": "Smola", "institution": null}]}