{"title": "Minimizing Statistical Bias with Queries", "book": "Advances in Neural Information Processing Systems", "page_first": 417, "page_last": 423, "abstract": null, "full_text": "Minimizing Statistical Bias with Queries \n\nDavid A. Cohn \n\nAdaptive Systems Group \n\nHarlequin, Inc. \n\nOne Cambridge Center \nCambridge, MA 02142 \ncOhnCharlequin.com \n\nAbstract \n\nI describe a querying criterion that attempts to minimize the error \nof a learner by minimizing its estimated squared bias. I describe \nexperiments with locally-weighted regression on two simple prob(cid:173)\nlems, and observe that this \"bias-only\" approach outperforms the \nmore common \"variance-only\" exploration approach, even in the \npresence of noise. \n\n1 \n\nINTRODUCTION \n\nIn recent years, there has been an explosion of interest in \"active\" machine learning \nsystems. These are learning systems that make queries, or perform experiments \nto gather data that are expected to maximize performance. When compared with \n\"passive\" learning systems, which accept given, or randomly drawn data, active \nlearners have demonstrated significant decreases in the amount of data required to \nachieve equivalent performance. In industrial applications, where each experiment \nmay take days to perform and cost thousands of dollars, a method for optimally \nselecting these points would offer enormous savings in time and money. \nAn active learning system will typically attempt to select data that will minimize \nits predictive error. This error can be decomposed into bias and variance terms. \nMost research in selecting optimal actions or queries has assumed that the learner \nis approximately unbiased, and that to minimize learner error, variance is the only \nthing to minimize (e.g. Fedorov [1972]' MacKay [1992]' Cohn [1996], Cohn et al., \n[1996], Paass [1995]). In practice, however, there are very few problems for which \nwe have unbiased learners. Frequently, bias constitutes a large portion of a learner's \nerror; if the learner is deterministic and the data are noise-free, then bias is the only \nsource of error. Note that the bias term here is a statistical bias, distinct from the \ninductive bias discussed in some machine learning research [Dietterich and Kong, \n1995]. \n\n\f418 \n\nD.A. Cohn \n\nIn this paper I describe an algorithm which selects actions/ queries designed to mini(cid:173)\nmize the bias of a locally weighted regression-based learner. Empirically, \"variance(cid:173)\nminimizing\" strategies which ignore bias seem to perform well, even in cases where, \nstrictly speaking, there is no variance to minimize. In the tasks considered in this \npaper, the bias-minimizing strategy consistently outperforms variance minimiza(cid:173)\ntion, even in the presence of noise. \n\n1.1 BIAS AND VARIANCE \n\nLet us begin by defining P(x, y) to be the unknown joint distribution over x and y, \nand P( x) to be the known marginal distribution of x (commonly called the input \ndistribution). We denote the learner's output on input x, given training set D as \ny(x; D). We can then write the expected error of the learner as \n\n1 E [(y(x;D) - y(x))2Ix] P(x)dx, \n\n(1) \n\nwhere E[\u00b7] denotes the expectation over P and over training sets D. The expectation \ninside the integral may be decomposed as follows (Geman et al. , 1992): \n\nE [(y(x;D) - y(x))2Ix] \n\nE [(y(x) - E[ylx]?] \n\n(2) \n\n+ (Ev [y(x; D)] - E[ylx])2 \n\n+Ev [(y(x;D) - Ev[y(x;D)])2] \n\nwhere Ev [.] denotes the expectation over training sets. The first term in Equation 2 \nis the variance of y given x - it is the noise in the distribution, and does not depend \non our learner or how the training data are chosen. The second term is the learner's \nsquared bias, and the third is its variance; these last two terms comprise the expected \nsquared error of the learner with respect to the regression function E[Ylx]. \nMost research in active learning assumes that the second term of Equation 2 is \napproximately zero, that is, that the learner is unbiased. If this is the case, then \none may concentrate on selecting data so as to minimize the variance of the learner. \nAlthough this \"all-variance\" approach is optimal when the learner is unbiased, truly \nunbiased learners are rare. Even when the learner's representation class is able \nto match the target function exactly, bias is generally introduced by the learning \nalgorithm and learning parameters. From the Bayesian perspective, a learner is \nonly unbiased if its priors are exactly correct. \nThe optimal choice of query would, of course, minimize both bias and variance, but \nI leave that for future work. For the purposes of this paper, I will only be concerned \nwith selecting queries that are expected to minimize learner bias. This approach \nis justified in cases where noise is believed to be only a small component of the \nlearner's error. If the learner is deterministic and there is no noise, then strictly \nspeaking, there is no error due to variance -\nall the error must be due to learner \nbias. In cases with non-determinism or noise, all-bias minimization, like all-variance \nminimization, becomes an approximation of the optimal approach. \n\nThe learning model discussed in this paper is a form of locally weighted regression \n(LWR) [Cleveland et al., 1988], which has been used in difficult machine learning \ntasks, notably the \"robot juggler\" of Schaal and Atkeson [1994]. Previous work \n[Cohn et al., 1996] discussed all-variance query selection for LWR; in the remainder \nof this paper, I describe a method for performing all-bias query selection. Section 2 \ndescribes the criterion that must be optimized for all-bias query selection. Section 3 \ndescribes the locally weighted regression learner used in this paper and describes \n\n\fMinimizing Statistical Bias with Queries \n\n419 \n\nhow the all-bias criterion may be computed for it . Section 4 describes the results \nof experiments using this criterion on several simple domains. Directions for future \nwork are discussed in Section 5. \n\n2 ALL-BIAS QUERY SELECTION \n\nLet us assume for the moment that we have a source of noise-free examples (Xi, Yi) \nand a deterministic learner which, given input X, outputs estimate Y(X).l Let us \nalso assume that we have an accurate estimate of the bias of y which can be used \nto estimate the true function y(x) = y(x) - bias(x). We will break these rather \nstrong assumptions of noise-free examples and accurate bias estimates in Section 4, \nbut they are useful for deriving the theoretical approach described below. \n\nGiven an accurate bias estimate, we must force the biased estimator into the best \napproximation of y(x) with the fewest number of examples. This, in effect, trans(cid:173)\nforms the query selection problem into an example filter problem similar to that \nstudied by Plutowski and White [1993] for neural networks. Below, I derive this \ncriterion for estimating the change in error at X given a new queried example at x. \nSince we have (temporarily) assumed a deterministic learner and noise-free data, \nthe expected error in Equation 2 simplifies to: \n\nE [(Y( X; 'D) - y( x))2Ix, 'D] \n\n(Y(x; 'D) - y(x))2 \n\n(3) \n\nWe want to select a new x such that when we add (x, f)), the resulting squared bias \nis minimized: \n\n(Y' - y? == (y(x; 'D U (x, f))) - y(x))2 . \n\n(4) \nI will, for the remainder of the paper, use the \"'\" to indicate estimates based on the \ninitial training set plus the additional example (x, y). To minimize Expression 4, \nwe need to compute how a query at x will change the learner's bias at x. If we \nassume that we know the input distribution,2 then we can integrate this change \nover the entire domain (using Monte Carlo procedures) to estimate the resulting \naverage change, and select a x such that the expected squared bias is minimized. \nDefining bias == y - y and f:,.y == y' - y, we can write the new squared bias as: \n\nbias,2 \n\n(y' - y)2 = (Y + f:,.y _ y)2 \nf:,.y2 + 2f:,.y . bias + bias2 \n\n(5) \nNote that since bias as defined here is independent of x, minimizing the bias is \nequivalent to minimizing f:,.y2 + 2f:,.y . bias. \nThe estimate of bias' tells us how much our bias will change for a given x. We may \noptimize this value over x in one of a number of ways. In low dimensional spaces, it \nis often sufficient to consider a set of \"candidate\" x and select the one promising the \nsmallest resulting error. In higher dimensional spaces, it is often more efficient to \nsearch for an optimal x with a response surface technique [Box and Draper, 1987], \nor hill climb on abias,2 / ax. \nEstimates of bias and f:,.y depend on the specific learning model being used. In \nSection 3, I describe a locally weighted regression model, and show how differentiable \nestimates of bias and f:,.y may be computed for it. \n\n1 For clarity, I will drop the argument :z; except where required for disambiguation. I \n\nwill also denote only the univariate case; the results apply in higher dimensions as well. \n2This assumption is contrary to the assumption norma.lly made in some forms of learn(cid:173)\n\ning, e.g. PAC-learning, but it is appropriate in many domains. \n\n\f420 \n\nD. A. Cohn \n\n2.1 AN ASIDE: WHY NOT JUST USE Y - Mas? \n\nIf we have an accurate bias estimate, it is reasonable to ask why we do not simply \nuse the corrected y - C;;;S as our predictor. The answer has two parts, the first \nof which is that for most learners, there are no perfect bias estimators -\nthey \nintroduce their own bias and variance, which must be addressed in data selection. \nSecond, we can define a composite learner Ye == Y - C:;;;S. Given a random training \nsample then, we would expect Ye to outperform y. However, there is no obvious \nway to select data for this composite learner other than selecting to maximize the \nperformance of its two components. In our case, the second component (the bias \nestimate) is non-analytic, which leaves us selecting data so as to maximize the \nperformance of the first component (the uncorrected estimator). We are now back \nto our original problem: we can select data so as to minimize either the bias or \nvariance of the uncorrected LWR-based learner. Since the purpose of the correction \nis to give an unbiased estimator, intuition suggests that variance minimization would \nbe the more sensible route in this case. Empirically, this approach does not appear \nto yield any benefit over uncorrected variance minimization (see Figure 1). \n\n3 LOCALLY WEIGHTED REGRESSION \n\nThe type of learner I consider here is a form of locally weighted regression (LWR) \nthat is a slight variation on the LOESS model of Cleveland et al. [1988] (see Cohn \net al., [1996] for details). The LOESS model performs a linear regression on points \nin the data set, weighted by a kernel centered at x. The kernel shape is a design \nparameter: the original LOESS model uses a \"tricubic\" kernel; in my experiments \nI use the more common Gaussian \n\nwhere Ie is a smoothing parameter. For brevity, I will drop the argument x for hi(x), \nand define n = 2:i hi. We can then write the weighted means and covariances as: \n\n\"\" Xi \n, \nn \n. \n\nJ.l:r; = L..J hi - , \nJ.ly = L h,-, \n\nYi \nn \n\n, \n. \n\n\"\" \n\nU:r;y = L..J hi \n\n, \n. \n\n(Xi - X)(Yi - J.ly) \n\nn \n\n. \n\nWe use these means and covariances to produce an estimate Y at the x around which \nthe kernel is centered, with a confidence term in the form of a variance estimate: \n\nIn all the experiments discussed in this paper, the smoothing parameter Ie was set \nso as to minimize u2. \nThe low cost of incorporating new training examples makes this form of locally \nweighted regression appealing for learning systems which must operate in real time, \nor with time-varying target functions (e.g. [Schaal and Atkeson 1994]). \n\n\fMinimizing Statistical Bias with Queries \n\n421 \n\nI \n\nI \nY \n\n) \n\nA \n\nA \n\nA I \n\nA \n\n3.1 COMPUTING D..y FOR LWR \nIf we know what new point (x, y) we're going to add, computing D..y for LWR is \nstraightforward. Defining h as the weight given to x, and n as n + h we can write \n~y = y - y = J.L + -\n\nx - J.Lx \n\nh (Y ___ J.Ly) _ uxy (x _ J.Lx) + (x _ n~x _ ~x) . nuXY_ + h . ~x -:xKii - J.Ly) \n\nU xy ( \nU/2 \nx \n\nU xy ( \n-\nu2 \nx \n\nn \n\nnu;+h\u00b7(x-J.Lx)2 \nNote that computing D..y requires us to know both the x and y of the new point . In \npractice, we only know x. If we assume, however, that we can estimate the learner's \nbias at any x, then we can also estimate the unknown value y ~ y(x) - bias(x) . \nBelow, I consider how to compute the bias estimate. \n\nn \n\nn \n\nI ) \nx - J.L \nx \n\n- J.L \ny \n\n-\n\nu; \n\n3.2 ESTIMATING BIAS FOR LWR \n\nThe most common technique for estimating bias is cross-validation . Standard cross(cid:173)\nvalidation however, only gives estimates of the bias at our specific training points , \nwhich are usually combined to form an average bias estimate. This is sufficient if \none assumes that the training distribution is representative of the test distribution \n(which it isn't in query learning) and if one is content to just estimate the bias \nwhere one already has training data (which we can't be). \nIn the query selection problem, we must be able to estimate the bias at all possible \nx. Box and Draper [1987] suggest fitting a higher order model and measuring the \ndifference. For the experiments described in this paper, this method yielded poor \nresults; two other bias-estimation techniques, however, performed very well. \n\nOne method of estimating bias is by bootstrapping the residuals of the training \npoints. One produces a \"bootstrap sample\" of the learner's residuals on the training \ndata, and adds them to the original predictions to create a synthetic training set . By \naveraging predictions over a number of bootstrapped training sets and comparing \nthe average prediction with that of the original predictor, one arrives at a first-order \nbootstrap estimate of the predictor's bias [Connor 1993; Efron and Tibshirani, 1993] . \nIt is known that this estimate is itself biased towards zero; a standard heuristic is \nto divide the estimate by 0.632 [Efron, 1983]. \nAnother method of estimating bias of a learner is by fitting its own cross-validated \nresiduals. We first compute the cross-validated residuals on the training examples. \nThese produce estimates of the learner's bias at each of the training points. We \ncan then use these residuals as training examples for another learner (again LWR) \nto produce estimates of what the cross-validated error would be in places where we \ndon't have training data. \n\n4 EMPIRICAL RESULTS \n\nIn the previous two sections, I have explained how having an estimate of D..y and \nbias for a learner allows one to compute the learner's change in bias given a new \nquery, and have shown how these estimates may be computed for a learner that \nuses locally weighted regression. Here, I apply these results to two simple problems \nand demonstrate that they may actually be used to select queries that minimize \nthe statistical bias (and the error) of the learner. The problems involve learning the \nkinematics of a planar two-jointed robot arm: given the shoulder and elbow joint \nangles, the learner must predict the tip position. \n\n\f422 \n\n4.1 BIAS ESTIMATES \n\nD.A. Cohn \n\nI tested the accuracy of the two bias estimators by observing their correlations on 64 \nreference inputs, given 100 random training examples from the planar arm problem. \nThe bias estimates had a correlation with actual biases of 0.852 for the bootstrap \nmethod, and 0.871 for the cross-validation method. \n\n4.2 BIAS MINIMIZATION \n\nI ran two sets of experiments using the bias-minimizing criterion in conjunction with \nthe bias estimation technique of the previous section on the planar arm problem. \nThe bias minimization criterion was used as follows: At each time step, the learner \nwas given a set of 64 randomly chosen candidate queries and 64 uniformly chosen \nreference points. It evaluated E' (x) for each reference point given each candidate \npoint and selected for its next query the candidate point with the smallest average \nE' (x) over the reference points. I compared the bias-minimizing strategy (using the \ncross-validation and bootstrap estimation techniques) against random sampling and \nthe variance-minimizing strategy discussed in Cohn et al. [1996]. On a Sparc 10, \nwith m training examples, the average evaluation times per candidate per reference \npoint were 58 + 0.16m J.lseconds for the variance criterion, 65 + 0.53m J.lseconds for \nthe cross-validation-based bias criterion, and 83 + 3. 7m J.lseconds for the bootstrap(cid:173)\nbased bias criterion (with 20x resampling) . \nTo test whether the bias-only assumption was robust against the presence of noise, \n1 % Gaussian noise was added to the input values of the training data in all ex(cid:173)\nperiments. This simulates noisy position effectors on the arm , and results in non(cid:173)\nGaussian noise in the output coordinate system. \n\nIn the first series of experiments, the candidate shoulder and elbow joint angles \nwere drawn uniformly over (U[O, 271\"], U[O, 71\")) . In unconstrained domains like this, \nrandom sampling is a fairly good default strategy. The bias minimization strategies \nstill significantly outperform both random sampling and the variance minimizing \nstrategy in these experiments (see Figure 1). \n\n-1 \n\n10 \n\ng \n'\" 'il ,0-2 \n:a \n~ \nc: \n~10 . random \n\n-3 \n\n' '\\ \n\n\".'~ \n\nvariance-min \n-\no cross-val-min \n~ x bootstrap-min \n200 \n\n100 \n\n10 0 \n\n300 \n\ntrainlno set size \n\nI, \n\\\\ \n--\n\n1 \n10 \n\ng 0 \n\n\"'10 \n\n}1O-1 \n~ -2 \n\n,- -\n\n.: ~~&e-.!)jni(llIZI~ , \no \n\n10 \n10-3 ~ ~~~~rmiWngar- iOlmizing \n400 \n\n300 \n\n200 \n\n1 00 \n\ntrainino set size \n\nIheta 1 \n\n(left) MSE as a function of number of noisy training examples for the \nFigure 1: \nunconstrained arm problem. Errors are averaged over 10 runs for the bootstrap \nmethod and 15 runs for all others. One run with the cross-validation-based method \nwas excluded when k failed to converge to a reasonable value. (center) MSE as \na function of number of noisy training examples for the constrained arm problem . \nThe bias correction strategy discussed in Section 2.1 does no better than the uncor(cid:173)\nrected variance-minimizing strategy, and much worse than the bias-minimization \nstrategy. (right) Sample exploration trajectory in joint-space for the constrained \narm problem , explored according to the bias minimizing criterion. \n\nIn the second series of experiments, candidates were drawn uniformly from a region \n\n\fMinimizing Statistical Bias with Queries \n\n423 \n\nlocal to the previously selected query: (01 \u00b1 0.217\", O2 \u00b1 0.117\"). This corresponds to \nrestricting the arm to local motions. In a constrained problem such as this, ran(cid:173)\ndom sampling is a poor strategy; both the bias and variance-reducing strategies \noutperform it at least an order of magnitude. Further, the bias-minimization strat(cid:173)\negy outperforms variance minimization by a large margin (Figure 1). Figure 1 also \nshows an exploration trajectory produced by pursuing the bias-minimizing crite(cid:173)\nrion. It is noteworthy that, although the implementation in this case was a greedy \n(one-step) minimization, the trajectory results in globally good exploration. \n\n5 DISCUSSION \n\nI have argued in this paper that, in many situations, selecting queries to minimize \nlearner bias is an appropriate and effective strategy for active learning. I have given \nempirical evidence that, with a LWR-based learner and the examples considered \nhere, the strategy is effective even in the presence of noise. \n\nBeyond minimizing either bias or variance, an important next step is to explicitly \nminimize them together . The bootstrap-based estimate should facilitate this, as \nit produces a complementary variance estimate with little additional computation. \nBy optimizing over both criteria simultaneously, we expect to derive a criterion that \nthat, in terms of statistics, is truly optimal for selecting queries. \n\nREFERENCES \nBox, G., & Draper, N. (1987). Empirical model-building and response surfaces, \nWiley, New York. \nCleveland, W., Devlin, S., & Grosse, E. (1988) . Regression by local fitting. \nJournal of Econometrics, 37, 87-114. \nCohn, D. (1996) Neural network exploration using optimal experiment design. \nNeural Networks, 9(6):1071-1083. \nCohn, D., Ghahramani, Z., & Jordan, M. (1996) . Active learning with sta(cid:173)\ntistical models. Journal of Artificial Inteligence Research 4:129-145 . \nConnor, J. (1993). Bootstrap Methods in Neural Network Time Series Prediction. \nIn J . Alspector et al., eds., Proc. of the Int. Workshop on Applications of Neural \nNetworks to Telecommunications, Lawrence Erlbaum, Hillsdale, N.J. \nDietterich, T., & Kong, E. (1995) . Error-correcting output coding corrects \nbias and variance. In S. Prieditis and S. Russell, eds., Proceedings of the 12th \nInternational Conference on Machine Learning. \nEfron, B. (1983) Estimating the error rate of a prediction rule: some improvements \non cross-validation. J. Amer. Statist. Assoc. 78:316-331. \nEfron, B. & Tibshirani, R. (1993). An introduction to the bootstrap. Chapman \n& Hall, New York . \nFedorov, V. (1972). Theory of Optimal Experiments. Academic Press, New York. \nGeman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the \nbias/variance dilemma. Neural Computation, 4, 1-58. \nMacKay, D. (1992). Information-based objective functions for active data selec(cid:173)\ntion, Neural Computation, 4, 590-604. \nPaass, G., and Kindermann, J. (1994). Bayesian Query Construction for Neu(cid:173)\nral Network Models. In G. Tesauro et al., eds., Advances in Neural Information \nProcessing Systems 7, MIT Press. \nPlutowski, M., & White, H. (1993). Selecting concise training sets from clean \ndata. IEEE Transactions on Neural Networks, 4, 305-318. \nSchaal, S. & Atkeson, C. (1994). Robot Juggling: An Implementation of \nMemory-based Learning. Control Systems 14, 57-71. \n\n\f", "award": [], "sourceid": 1288, "authors": [{"given_name": "David", "family_name": "Cohn", "institution": null}]}