{"title": "A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1161, "page_last": 1168, "abstract": null, "full_text": "          A Method for Inferring Label Sampling\n        Mechanisms in Semi-Supervised Learning\n\n\n\n                        Saharon Rosset                                   Ji Zhu\n             Data Analytics Research Group                     Department of Statistics\n            IBM T.J. Watson Research Center                    University of Michigan\n              Yorktown Heights, NY 10598                        Ann Arbor, MI 48109\n                srosset@us.ibm.com                              jizhu@umich.edu\n\n\n                          Hui Zou                                  Trevor Hastie\n                Department of Statistics                      Department of Statistics\n                     Stanford University                        Stanford University\n                     Stanford, CA 94305                         Stanford, CA 94305\n           hzou@stat.stanford.com                          hastie@stanford.edu\n\n\n\n\n                                              Abstract\n\n           We consider the situation in semi-supervised learning, where the \"label\n           sampling\" mechanism stochastically depends on the true response (as\n           well as potentially on the features). We suggest a method of moments\n           for estimating this stochastic dependence using the unlabeled data. This\n           is potentially useful for two distinct purposes: a. As an input to a super-\n           vised learning procedure which can be used to \"de-bias\" its results using\n           labeled data only and b. As a potentially interesting learning task in it-\n           self. We present several examples to illustrate the practical usefulness of\n           our method.\n\n\n\n1     Introduction\n\nIn semi-supervised learning, we assume we have a sample (xi, yi, si)n\n                                                                                i=1, of i.i.d. draws\nfrom a joint distribution on (X, Y, S), where:1\n\n         xi  Rp are p-vectors of features.\n         yi is a label, or response (yi  R for regression, yi  {0, 1} for 2-class classifica-\n           tion).\n\n         si  {0, 1} is a \"labeling indicator\", that is yi is observed if and only if si = 1,\n           while xi is observed for all i.\n\nIn this paper we consider the interesting case of semi-supervised learning, where the prob-\nability of observing the response depends on the data through the true response, as well as\n\n     1Our notation here differs somewhat from many semi-supervised learning papers, where the un-\nlabeled part of the sample is separated from the labeled part and sometimes called \"test set\".\n\n\f\npotentially through the features. Our goal is to model this unknown dependence:\n\n\n                                    l(x, y) = P r(S = 1|x, y)                                      (1)\n\n\nNote that the dependence on y (which is unobserved when S = 0) prevents us from using\nstandard supervised modeling approaches to learn l. We show here that we can use the\nwhole data-set (labeled+unlabeled data) to obtain estimates of this probability distribution\nwithin a parametric family of distributions, without needing to \"impute\" the unobserved\nresponses.2\n\nWe believe this setup is of significant practical interest. Here are a couple of examples of\nrealistic situations:\n1. The problem of learning from positive examples and unlabeled data is of significant\ninterest in document topic learning [4, 6, 8]. Consider a generalization of that problem,\nwhere we observe a sample of positive and negative examples and unlabeled data, but we\nbelieve that the positive and negative labels are supplied with different probabilities (in\nthe document learning example, positive examples are typically more likely to be labeled\nthan negative ones, which are much more abundant). These probabilities may also not be\nuniform within each class, and depend on the features as well. Our methods allow us to\ninfer these labeling probabilities by utilizing the unlabeled data.\n2. Consider a satisfaction survey, where clients of a company are requested to report their\nlevel of satisfaction, but they can choose whether or not they do so. It is reasonable to\nassume that their willingness to report their satisfaction depends on their actual satisfaction\nlevel. Using our methods, we can infer the dependence of the reporting probability on\nthe actual satisfaction by utilizing the unlabeled data, i.e., the customers who declined to\nrespond.\n\nBeing able to infer the labeling mechanism is important for two distinct reasons. First,\nit may be useful for \"de-biasing\" the results of supervised learning, which uses only the\nlabeled examples. The generic approach for achieving this is to use \"inverse sampling\"\nweights (i.e. weigh labeled examples by 1/l(x, y)). The us of this for maximum likeli-\nhood estimation is well established in the literature as a method for correcting sampling\nbias (of which semi-supervised learning is an example) [10]. We can also use the learned\nmechanism to post-adjust the probabilities from a probability estimation methods such as\nlogistic regression to attain \"unbiasedness\" and consistency [11]. Second, understanding\nthe labeling mechanism may be an interesting and useful learning task in itself. Consider,\nfor example, the \"satisfaction survey\" scenario described above. Understanding the way in\nwhich satisfaction affects the customers' willingness to respond to the survey can be used\nto get a better picture of overall satisfaction and to design better future surveys, regardless\nof any supervised learning task which models the actual satisfaction.\n\nOur approach is described in section 2, and is based on a method of moments. Observe\nthat for every function of the features g(x), we can get an unbiased estimate of its mean\n        n\nas 1           g(x\n   n    i=1      i). We show that if we know the underlying label sampling mechanism\nl(x, y) we can get a different unbiased estimate of Eg(x), which uses only the labeled\nexamples, weighted by 1/l(x, y). We suggest inferring the unknown function l(x, y) by\nrequiring that we get identical estimates of Eg(x) using both approaches. We illustrate our\nmethod's implementation on the California Housing data-set in section 3. In section 4 we\nreview related work in the machine learning and statistics literature, and we conclude with\na discussion in section 5.\n\n\n\n   2The importance of this is that we are required to hypothesize and fit a conditional probability\nmodel for l(x, y) only, as opposed to the full probability model for (S, X, Y ) required for, say, EM.\n\n\f\n2    The method\n\nLet g(x) be any function of our features. We construct two different unbiased estimates of\nEg(x), one based on all n data points and one based on labeled examples only, assuming\nP (S = 1|x, y) is known. Then, our method uses the equality in expectation of the two\nestimates to infer P (S = 1|x, y). Specifically, consider g(x) and also:\n\n                                                           g(x)\n                               f (x, y, s) =            p(S=1|x,y)       if s = 1 (y observed)                                 (2)\n                                                      0                  otherwise\n\nThen:\n\nTheorem 1 Assume P (S = 1|x, y) > 0 , x, y. Then:\n\n                                            E(g(X)) = E(f (X, Y, S))\n\n.\n\nProof:\n\n            E(f (X, Y, S)) =                                   f (x, y, s)dP (x, y, s) =\n                                                 X,Y,S\n                                                                    P (S = 1|x, y)\n                                       =              g(x)                                dP (y|x)dP (x) =\n                                                 X               Y P (S = 1|x, y)\n\n                                       =              g(x)dP (x) = Eg(X)\n                                                 X\n\n\nThe empirical interpretation of this expectation result is:\n\n           1 n                              1                        g(x                              1 n\n                       f (x                                              i)               Eg(x)                 g(x\n           n              i, yi, si) = n                       P (S = 1|x                             n                  i)    (3)\n                i=1                              i:s                           i, yi)\n                                                        i=1                                                i=1\n\nwhich can be interpreted as relating an estimate of Eg(x) based on the complete data on\nthe right, to the one based on labeled data only, which requires weighting that is inversely\nproportional to the probability of labeling, to compensate for ignoring the unlabeled data.\n\n(3) is the fundamental result we use for our purpose, leading to a \"method of moments\"\napproach to estimating l(x, y) = P (S = 1|x, y), as follows:\n\n          Assume that l(x, y) = p(x, y) ,   Rk belongs to a parametric family with k\n           parameters.\n          Select k different functions g1(x), ..., gk(x), and define f1, ..., fk correspondingly\n           according to (2).\n          Demand equality of the leftmost and rightmost sums in (3) for each of g1, ..., gk,\n           and solve the resulting k equations to get an estimate of .\n\nMany practical and theoretical considerations arise when we consider what \"good\" choices\nof the representative functions g1(x), ..., gk(x) may be. Qualitatively we would like to\naccomplish the standard desirable properties of inverse problems: uniqueness, stability and\nrobustness. We want the resulting equations to have a unique \"correct\" solution. We want\nour functions to have low variance so the inaccuracy in (3) is minimal, and we want them\nto be \"different\" from each other to get a stable solution in the k-dimensional space. It is of\ncourse much more difficult to give concrete quantitative criteria for selecting the functions\nin practical situations. What we can do in practice is evaluate how stable the results we get\nare. We return to this topics in more detail in section 5.\n\n\f\nA second set of considerations in selecting these functions is the computational one: can we\neven solve the resulting inverse problems with a reasonable computational effort? In gen-\neral, solving systems of more than one nonlinear equation is a very hard problem. We also\nneed to consider the possibility of non-unique solutions. These questions are sometimes\ninter-related with the choice of gk(x).\n\nSuppose we wish to solve a set of non-linear equations for :\n                                                 g\n                 h                               k(xi)\n                      k() =                                     -           g\n                                            p                                     k(xi) = 0,              k = 1, . . . , K    (4)\n                                s           (xi, yi)\n                                     i=1                                i\n\nThe solution of (4) is similar to\n\n                                      arg min h() = arg min                                  hk()2                          (5)\n                                                                                         m\n\nNotice that every solution to (4) minimizes (5), but there may be local minima of (5) that\nare not solutions to (4). Hence simply applying a Newton-Raphson method to (5) is not\na good idea: if we have a sufficiently good initial guess about the solution, the Newton-\nRaphson method converges quadratically fast; however, it can also fail to converge, if the\nroot does not exist nearby. In practice, we can combine the Newton-Raphson method with\na line search strategy that makes sure h() is reduced at each iteration (the Newton step is\nalways a descent direction of h()). While this method can still occasionally fail by landing\non a local minimum of h(), this is quite rare in practice [1]. The remedy is usually to try\na new starting point. Other global algorithms based on the so called model-trust region\napproach are also used in practice. These methods have a reputation for robustness even\nwhen starting far from the desired zero or minimum [2].\n\nIn some cases we can employ simpler methods, since the equations we get can be manip-\nulated algebraically to give more \"friendly\" formulations. We show two examples in the\nnext sub-section.\n\n\n2.1    Examples of simplified calculations\n\nWe consider two situations where we can use algebra to simplify the solution of the equa-\ntions our method gives. The first is the obvious application to two-class classification,\nwhere the label sampling mechanism depends on the class label only. Our method then\nreduces to the one suggested by [11]. The second is a more involved regression scenario,\nwith a logistic dependence between the sampling probability and the actual label.\n\nFirst, consider a two-class classification scenario, where the sampling mechanism is inde-\npendent of x:\n                                                                                  p\n                                      P (S = 1|x, y) =                             1     if y = 1\n                                                                                  p0 if y = 0\nThen we need two functions of x to \"de-bias\" our classes. One natural choice is g(x) = 1,\nwhich implies we are simply trying to invert the sampling probabilities. Assume we use\none of the features g(x) = xj as our second function. Plugging these into (3) we get that\nto find p0, p1 we should solve:\n                                            #{y                                           #{y\n                         n =                          i = 1 observed} +                                 i = 0 observed}\n                                                                ^\n                                                                p1                                           ^\n                                                                                                            p0\n                                                                      xij                                 xij\n                       xij =                     si=1,yi=1                   +          si=1,yi=0\n                                                           ^\n                                                           p                                  ^\n                                                                                              p\n                 i                                         1                                       0\n\nwhich we can solve analytically to get:\n                                                                       r\n                                                      ^\n                                                      p                     1n0 - r0n1\n                                                      1         =       rn0 - r0n\n\n\f\n                                                                          r\n                                                     ^\n                                                    p                          1n0 - r0n1\n                                                     0         =           r1n - rn1\nwhere nk = #{yi = k observed} , rk =                                                            x\n                                                                               s                     ij , k = 0, 1\n                                                                                    i=1,yi=k\n\nAs a second, more involved, example, consider a regression situation (like the satisfaction\nsurvey mentioned in the introduction), where we assume the probability of observing the\nresponse has a linear-logistic dependence on the actual response (again we assume for sim-\nplicity independence on x, although dependence on x poses no theoretical complications):\n\n                                                           exp(a + by)\n                             P (S = 1|x, y) =                                                   = logit-1(a + by)     (6)\n                                                          1 + exp(a + by)\n\nwith a, b unknown parameters. Using the same two g functions as above gives us the\nslightly less friendly set of equations:\n\n                                                                                                1\n                                                    n =\n                                                                                    logit-1(^a + ^by\n                                                                    s                                     i)\n                                                                         i=1\n                                                                                           x\n                                               x                                                ij\n                                               ij         =\n                                                                                    logit-1(^a + ^by\n                                          i                         s                                     i)\n                                                                         i=1\n\nwhich with some algebra we can re-write as:\n\n                                               0 =                             exp(-^byi)(x0j - xij)                 (7)\n                                                               si=1\n\n                                     exp(^a)m0 =                               exp(-^byi)                             (8)\n                                                               si=1\n\nwhere \n            x0j is the empirical mean of the j'th feature over unlabeled examples and m0 is the\nnumber of unlabeled examples. We do not have an analytic solution for these equations.\nHowever, the decomposition they offer allows us to solve them by searching first over b to\nsolve (7), then plugging the result into (8) to get an estimate of a. In the next section we\nuse this solution strategy on a real-data example.\n\n\n3         Illustration on the California Housing data-set\n\nTo illustrate our method, we take a fully labeled regression data-set and hide some of the\nlabels based on a logistic transformation of the response, then examine the performance\nof our method in recovering the sampling mechanism and improving resulting prediction\nthrough de-biasing. We use the California Housing data-set [9], collected based on US\nCensus data. It contains 20640 observations about log( median house price) throughout\nCalifornia regions. The eight features are: median income, housing median age, total\nrooms, total bedrooms, population, households, latitude and longitude.\n\nWe use 3/4 of the data for modeling and leave 1/4 aside for evaluation. Of the training\ndata, we hide some of the labels stochastically, based on the \"label sampling\" model:\n\n                                    P (S = 1|y) = logit-1(1.5(y - \n                                                                                                y) - 0.5)             (9)\n\nthis scheme results in having 6027 labeled training examples, 9372 training examples with\nthe labels removed and 5241 test examples.\n\nWe use equations (7,8) to estimate ^\n                                                           a, ^b based on each one of the 8 features. Figure\n1 and Table 1 show the results of our analysis. In Figure 1 we display the value of\n                 exp(-by\n     s                      i)(\n                              x0j - xj) for a range of possible values for b. We observe that all\n          i=1\nfeatures give us 0 crossing around the correct value of 1.5. In Table 1 we give details of the\n8 models estimated by a search strategy as follows:\n\n\f\n                                                                                                   6\n                                                                                              x 10\n                                               6000                                      3\n           5000                                4000                                      2\n                                               2000\n                                                                                         1\n              0                                      0\n                                                                                         0\n                                              -2000\n\n          -5000                               -4000                                     -1\n\n                   0         1     2     3                0         1     2        3          0         1    2    3\n                        5                                      5                                   5\n                   x 10                                   x 10                                x 10\n\n                                                 10\n              4                                                                          4\n\n\n              2                                      5                                   2\n\n\n              0                                                                          0\n                                                     0\n\n                   0         1     2     3                0         1     2        3          0         1    2    3\n\n\n                                                500\n           1000\n\n              0                                      0\n          -1000\n\n          -2000                               -500\n\n          -3000\n                                              -1000\n                   0         1     2     3                0         1     2        3\n\n\nFigure 1: Value of RHS of (7) (vertical axis) vs value of b (horizontal axis) for the 8 different\nfeatures. The correct value is b = 1.5, and so we expect to observe \"zero crossings\" around that\nvalue, which we indeed observe on all 8 graphs.\n\n\n\n        Find ^b by minimizing |                               exp(-by\n                                                s                         i)(\n                                                                               x0j - xij)| over the range b  [0, 3].\n                                                 i=1\n\n        Find ^a by plugging ^b from above into (8).\n\nThe table also shows the results of using these estimates to \"de-bias\" the prediction model,\ni.e. once we have ^\n                             a, ^b we use them to calculate ^\n                                                                          P (S = 1|y) and use the inverse of these\nestimated probabilities as weights in a least squares analysis of the labeled data. The table\ncompares the predictive performance of the resulting models on the 1/4 evaluation set\n(5241 observations) to that of the model built using labeled data only with no weighting\nand that of the model built using the labeled data and the \"correct\" weighting based on\nour knowledge of the true a, b. Most of the 8 features give reasonable estimates, and the\nprediction models built using the resulting weighting schemes perform comparably to the\none built using the \"correct\" weights. They generally attain MSE about 20% smaller than\nthat of the non-weighted model built without regard to the label sampling mechanism.\n\nThe stability of the resulting estimates is related to the \"reasonableness\" of the selected\ng(x) functions. To illustrate that, we also tried the function g(x) = x3  x4  x5/(x1  x2)\n(still in combination with the identity function, so we can use (7,8)). The resulting estimates\nwere ^\n     b = 3.03, ^a = 0.074. Clearly these numbers are way outside the reasonable range of\nthe results in Table 1. This is to be expected as this choice of g(x) gives a function with\nvery long tails. Thus, a few \"outliers\" dominate the two estimates of E(g(x)) in (3) which\nwe use to estimate a, b.\n\n\n4    Related work\n\nThe surge of interest in semi-supervised learning in recent years has been mainly in the\ncontext of text classification ([4, 6, 8] are several examples of many). There is also a\n\n\f\n     Table 1: Summary of estimates of sampling mechanism using each of 8 features\n\n                  Feature                     b             a       Prediction MSE\n                  median income              1.52         -0.519           0.1148\n                  housing median age         1.18         -0.559           0.1164\n                  total rooms                1.58         -0.508           0.1147\n                  total bedrooms             1.64         -0.497           0.1146\n                  population                  1.7         -0.484           0.1146\n                  households                 1.63         -0.499           0.1146\n                  latitude                   1.55         -0.514           0.1147\n                  longitude                  1.33         -0.545           0.1155\n                  (no weighting)             N/A           N/A             0.1354\n                  (true sampling model)       1.5          -0.5            0.1148\n\n\n\n\nwealth of statistical literature on missing data and biased sampling (e.g. [3, 7, 10]) where\nmethods have been developed that can be directly or indirectly applied to semi-supervised\nlearning. Here we briefly survey some of the interesting and popular approaches and relate\nthem to our method.\n\nThe EM approach to text classification is advocated by [8]. Some ad-hoc two-step variants\nare surveyed by [6]. They consists of iterating between completing class labels and esti-\nmating the classification model. The main caveat of all these methods is that they ignore\nthe sampling mechanism, and thus implicitly assume it cancels out in the likelihood func-\ntion -- i.e., that the sampling is random and that l(x, y) is fixed. It is possible, in principle,\nto remove this assumption, but that would significantly increase the complexity of the al-\ngorithms, as it would require specifying a likelihood model for the sampling mechanism\nand adding its parameters to the estimation procedure. The methods described by [7] and\ndiscussed below take this approach.\n\nThe use of re-weighted loss to account for unknown sampling mechanisms is suggested\nby [4, 11]. Although they differ significantly in the details, both of these can account for\nlabel-dependent sampling in two-class classification. They do not offer solutions for other\nmodeling tasks or for feature-dependent sampling, which our approach covers.\n\nIn the missing data literature, [7] (chapter 15) and references therein offer several meth-\nods for handling \"nonignorable nonresponse\". These are all based on assuming complete\nprobability models for (X, Y, S) and designing EM algorithms for the resulting problem.\nAn interesting example is the bivariate normal stochastic ensemble model, originally sug-\ngested by [3]. In our notation, they assume that there is an additional fully unobserved\n\"response\" zi, and that yi is observed if and only if zi > 0. They also assume that yi, zi\nare bivariate normal, depending on the features xi, that is:\n\n\n                         yi                xi1             2      2\n                         z          N               ,\n                              i            xi2             2     1\n\n\nthis leads to a complex, but manageable, EM algorithm for inferring the sampling mech-\nanism and fitting the actual model at once. The main shortcoming of this approach, as\nwe see it, is in the need to specify a complete and realistic joint probability model engulf-\ning both the sampling mechanism and the response function. This precludes completely\nthe use of non-probabilistic methods for the response model part (like trees or kernel meth-\nods), and seems to involve significant computational complications if we stray from normal\ndistributions.\n\n\f\n5      Discussion\n\nThe method we suggest in this paper allows for the separate and unbiased estimation of\nlabel-sampling mechanisms, even when they stochastically depend on the partially unob-\nserved labels. We view this \"de-coupling\" of the sampling mechanism estimation from the\nactual modeling task at hand as an important and potentially very useful tool, both because\nit creates a new, interesting learning task and because the results of the sampling model can\nbe used to \"de-bias\" any black-box modeling tool for the supervised learning task through\ninverse weighting (or sampling, if the chosen tool does not take weights).\n\nOur method of moments suffers from the same problems all such methods (and inverse\nproblems in general) share, namely the uncertainty about the stability and validity of the\nresults. It is difficult to develop general theory for stable solutions to inverse problems.\nWhat we can do in practice is attempt to validate the estimates we get. We have already\nseen one approach for doing this in section 3, where we used multiple choices for g(x)\nand compared the resulting estimates of the parameters determining l(x, y). Even if we\nhad not known the true values of a and b, the fact that we got similar estimates using\ndifferent features would reassure us that these estimates were reliable, and give us an idea\nof their uncertainty. A second approach for evaluating the resulting estimates could be to\nuse bootstrap sampling, which can be used to calculate bootstrap confidence intervals of\nthe parameter estimates.\n\nThe computational issues also need to be tackled if our method is to be applicable for large\nscale problems with complex sampling mechanisms. We have suggested a general method-\nology in section 2, and some ad-hoc solutions for special cases in section 2.1. However\nwe feel that a lot more can be done to develop efficient and widely applicable methods for\nsolving the moment equations.\n\n\nAcknowledgments\n\nWe thank John Langford and Tong Zhang for useful discussions.\n\n\nReferences\n\n[1]    Acton, F. (1990) Numerical Methods That Work. Washington: Math. Assoc. of America.\n\n[2]    Dennis, J. & Schnabel, R. (1983) Numerical Methods for Unconstrained Optimization and\n       Nonlinear Equations. New Jersey: Prentice-Hall.\n\n[3]    Heckman, J.I. (1976). The common structure of statistical models for truncation, sample se-\n       lection and limited dependent variables, and a simple estimator for such models. Annals of\n       Economic and Social Measurement 5:475-492.\n\n[4]    Lee, W.S. & Liu, B. (2003). Learning with Positive and Unlabeled Examples Using Weighted\n       Logistic Regression. ICML-03\n\n[5]    Lin, Y., Lee, Y. & Wahba, G. (2000). Support vector machines for classification in nonstandard\n       situations. Machine Learning, 46:191-202.\n\n[6]    Liu, B., Dai, Y., Li, X., Lee, W.S. & Yu, P. (2003). Building Text Classifiers Using Positive and\n       Unlabeled Examples. Proceedings ICDM-03\n\n[7]    Little, R. & Rubin, D. (2002). Statistical Analysis with Missing Data, 2nd Ed. . Wiley & Sons.\n\n[8]    Nigam, K., McCallum , A., Thrun, S. & Mitchell, T. (2000) Text Classification from Labeled\n       and Unlabeled Documents using EM. Machine Learning 39(2/3):103-134.\n\n[9]    Pace, R.K. & Barry, R. (1997). Sparse Spatial Autoregressions. Stat. & Prob. Let., 33 291-297.\n\n[10] Vardi, Y. (1985). Empirical Distributions in Selection Bias Models. Annals of Statistics, 13.\n\n[11] Zou, H., Zhu, J. & Hastie, T. (2004). Automatic Bayes Carpentary in Semi-Supervised Classi-\n       fication. Unpublished.\n\n\f\n", "award": [], "sourceid": 2583, "authors": [{"given_name": "Saharon", "family_name": "Rosset", "institution": null}, {"given_name": "Ji", "family_name": "Zhu", "institution": null}, {"given_name": "Hui", "family_name": "Zou", "institution": null}, {"given_name": "Trevor", "family_name": "Hastie", "institution": null}]}