{"title": "Active Learning for Function Approximation", "book": "Advances in Neural Information Processing Systems", "page_first": 593, "page_last": 600, "abstract": null, "full_text": "Active Learning for Function \n\nApproximation \n\nKah Kay Sung \n(sung@ai.mit.edu) \n\nPartha Niyogi \n(pn@ai.mit.edu) \n\nMassachusetts Institute of Technology \n\nMassachusetts Institute of Technology \n\nArtificial Intelligence Laboratory \n\nArtificial Intelligence Laboratory \n\n545 Technology Square \nCambridge, MA 02139 \n\n545 Technology Square \nCambridge, MA 02139 \n\nAbstract \n\nWe develop a principled strategy to sample a function optimally for \nfunction approximation tasks within a Bayesian framework. Using \nideas from optimal experiment design, we introduce an objective \nfunction (incorporating both bias and variance) to measure the de(cid:173)\ngree of approximation, and the potential utility of the data points \ntowards optimizing this objective. We show how the general strat(cid:173)\negy can be used to derive precise algorithms to select data for two \ncases: learning unit step functions and polynomial functions. In \nparticular, we investigate whether such active algorithms can learn \nthe target with fewer examples. We obtain theoretical and empir(cid:173)\nical results to suggest that this is the case. \n\nINTRODUCTION AND MOTIVATION \n\n1 \nLearning from examples is a common supervised learning paradigm that hypothe(cid:173)\nsizes a target concept given a stream of training examples that describes the concept. \nIn function approximation, example-based learning can be formulated as synthesiz(cid:173)\ning an approximation function for data sampled from an unknown target function \n(Poggio and Girosi, 1990). \n\nActive learning describes a class of example-based learning paradigms that seeks out \nnew training examples from specific regions of the input space, instead of passively \naccepting examples from some data generating source. By judiciously selecting ex-\n\n\f594 \n\nKah Kay Sung, Parlha Niyogi \n\namples instead of allowing for possible random sampling, active learning techniques \ncan conceivably have faster learning rates and better approximation results than \npassive learning methods. \n\nThis paper presents a Bayesian formulation for active learning within the function \napproximation framework. Specifically, here is the problem we want to address: Let \nDn = {(Xi, Yi)li = 1, ... , n} be a set of n data points sampled from an unknown \ntarget function g, possibly in the presence of noise. Given an approximation func(cid:173)\ntion concept class, :F, where each f E :F has prior probability P;:-[J], one can use \nregularization techniques to approximate 9 from Dn (in the Bayes optimal sense) \nby means of a function 9 E:F. We want a strategy to determine at what input \nlocation one should sample the next data point, (XN+l, YN+d, in order to obtain \nthe \"best\" possible Bayes optimal approximation of the unknown target function 9 \nwith our concept class :F. \n\nThe data sampling problem consists of two parts: \n\n1) Defining what we mean by the \"best\" possible Bayes optimal ap(cid:173)\nproximation of an unknown target function. In this paper, we propose an \noptimality criterion for evaluating the \"goodness\" of a solution with respect to an \nunknown target function. \n\n2) Formalizing precisely the task of determining where in input space \nto sample the next data point. We express the above mentioned optimality \ncriterion as a cost function to be minimized, and the task of choosing the next \nsample as one of minimizing the cost function with respect to the input space \nlocation of the next sample point. \n\nEarlier work (Cohn, 1991; MacKay, 1992) have tried to use similar optimal experi(cid:173)\nment design (Fedorov, 1972) techniques to collect data that would provide maximum \ninformation about the target function. Our work differs from theirs in several re(cid:173)\nspects. First, we use a different, and perhaps more general, optimality criterion for \nevaluating solutions to an unknown target function, based on a measure of function \nuncertainty that incorporates both bias and variance components of the total output \ngeneralization error. In contrast, MacKay and Cohn use only variance components \nin model parameter space. Second, we address the important sample complexity \nquestion, i.e., does the active strategy require fewer examples to learn the target \nto the same degree of uncertainty? Our results are stated in PAC-style (Valiant, \n1984). After completion of this work, we learnt that Sollich (1994) had also recently \ndeveloped a similar formulation to ours. His analysis is conducted in a statistical \nphysics framework. \n\nThe rest of the paper is organized as follows: Section 2, develops our active sampling \nparadigm. In Sections 3 and 4, we consider two classes offunctions for which active \nstrategies are obtained, and investigate their performance both theoretically and \nempirically. \n\n2 THE MATHEMATICAL FRAMEWORK \nIn order to optimally select examples for a learning task, one should first have a \nclear notion of what an \"ideal\" learning goal is for the task. We can then measure \nan example's utility in terms of how well the example helps the learner achieve the \n\n\fActive Learning for Function Approximation \n\n595 \n\ngoal, and devise an active sampling strategy that selects examples with maximum \npotential utility. In this section, we propose one such learning goal -\nto find an \napproximation function g E :F that \"best\" estimates the unknown target function \ng. We then derive an example utility cost function for the goal and finally present \na general procedure for selecting examples. \n\n2.1 EVALUATING A SOLUTION TO AN UNKNOWN TARGET -\nTHE EXPECTED INTEGRATED SQUARED DIFFERENCE \n\nLet 9 be the target function that we want to estimate by means of an approximation \nfunction 9 E :F. If the target function 9 were known, then one natural measure of \nhow well (or badly) g approximates 9 would be the Integrated Squared Difference \n(ISD) of the two functions: \n\n1x\", \n\n8(g,g) = \n\n(g(x) - g(x))2dx. \n\n(1) \n\nXlo \n\nIn most function approximation tasks, the target 9 is unknown, so we clearly can(cid:173)\nnot express the quality of a learning result, g, in terms of g. We can, however, \nobtain an expected integrated squared difference (EISD) between the unknown tar(cid:173)\nget, g, and its estimate, g, by treating the unknown target 9 as a random vari(cid:173)\nable from the approximation function concept class:F. Taking into account the \nn data points, D n , seen so far, we have the following a-posteriori likelihood for \ng: P(gIDn) ex PF[g]P(Dnlg). The expected integrated squared difference (EISD) \nbetween an unknown target, g, and its estimate, g, given D n , is thus: \n\nEF[8(g,g)IDn] = f P(gIDn)8(g,g)dg = f PF[g]P(Dnlg)8(y,g)dg. \n\n(2) \n\n19EF \n\n19EF \n\n2.2 SELECTING THE NEXT SAMPLE LOCATION \nWe can now express our learning goal as minimizing the expected integrated squared \ndifference (EISD) between the unknown target 9 and its estimate g. A reasonable \nsampling strategy would be to choose the next example from the input location that \nminimizes the EISD between g and the new estimate gn+l. How does one predict \nthe new EISD that results from sampling the next data point at location Xn+l ? \nSuppose we also know the target output value (possibly noisy), Yn+l, at Xn+l. \nThe EISD between 9 and its new estimate gn+l would then be EF[8(gn+l, g)IDn U \n(x n+l,Yn+d]' where gn+l can be recovered from Dn U (Xn+l,Yn+l) via regular(cid:173)\nization. In reality, we do not know Yn+l, but we can derive for it the following \nconditional probability distribution: \n\nP(Yn+llxn+l,Dn) ex f P(Dn U(Xn+l,Yn+dlf)PF[J]df. \n\n(3) \n\nliEF \n\nThis leads to the following expected value for the new EISD, if we sample our next \ndata point at X n+l: \n\nU(Yn+lIDn , xn+1) = 1: P(Yn+lIXn+1' Dn)EF[8(gn+1' g)IDn U (Xn+b Yn+l)]dYn+1' \n\nClearly, the optimal input location to sample next is the location that minimizes ~l \ncost function in Equation 4 (henceforth referred to as the total output uncertainty), \nl.e., \n\n(5) \n\n\f596 \n\nKah Kay Sung, Partha Niyogi \n\n2.3 SUMMARY OF ACTIVE LEARNING PROCEDURE \nWe summarize the key steps involved in finding the optimal next sample location: \n\n1) Compute P(gIDn). This is the a-posteriori likelihood of the different functions \n9 given Dn , the n data points seen so far. \n2) Fix a new point Xn+l to sample. \n3) Assume a value Yn+l for this Xn+l . One can compute P(glDn U (Xn+l' Yn+d) \nand hence the expected integrated squared difference between the target and its new \nestimate. This is given by EF[8(Un+l, g)IDn U (Xn+l' Yn+t>]. See also Equation 2. \n4) At the given Xn+l, Yn+l has a probability distribution given by Equation 3. \nAveraging over all Yn+l 's, we obtain the total output uncertainty for Xn+l, given by \nU(Un+lIDn, xn+t> in Equation 4. \n5) Sample at the input location that minimizes the total output uncertainty cost \nfunction. \n\n3 EXAMPLE 1: UNIT STEP FUNCTIONS \nTo demonstrate the usefulness of the above procedure, let us first consider the \nfollowing simple class of indicator functions parameterized by a single parameter a \nwhich takes values in [0,1]. Thus \n\n:F = {I[a,l]IO ~ a ~ I} \n\nWe obtain a prior peg = l[a,l]\u00bb by assuming that a has an a-priori uniform distri(cid:173)\nbution on [0,1]. Assume that data, Dn = {(Xi;Yi); i = 1, .. n} consistent with some \nunknown target function l[at,l] (which the learner is to approximate) has been ob(cid:173)\ntained. We are interested in choosing a point x E [0,1] to sample which will provide \nus with maximal information. Following the general procedure outlined above we \ngo through the following steps. \n\nFor ease of notation, let x R be the right most point belonging to Dn whose Y value \nis 0, i.e., XR = maXi=l, .. n{xdYi = OJ. Similarly let XL = millj=l, .. n{Xs!Yi = I} and \nlet XL - XR = W. \n1) We first need to get P(gIDn ). It is easy to show that \n\nP(g = l[a,l]IDn) = ~ if a E [XR' XL]; 0 otherwise \n\n2) Suppose we sample next at a particular X E [0,1]' we would obtain Y with the \ndistribution \nP(y = OIDn, x) = (XL - x) = (XL - x) if x E [XR' XL]; 1 if x ~ XR; 0 otherwise \n\nXL - XR \n\nW \n\nFor a particular y, the new data set would be Dn+l = Dn U (x, y) and the corre(cid:173)\nsponding EISD can be easily obtained using the distribution P(gIDn+d. Averaging \nthis over P(YIDn, x) as in step 4 of the general procedure, we obtain \n\nif x ~ x R or x ~ XL \notherwise \n\n\fActive Learning for Function Approximation \n\n597 \n\nClearly the point which minimizes the expected total output uncertainty is the mid-\npoint of XL and XR. \n\nXn+1 = arg mm U(gIDn, x) = (XL + xR)/2 \n\n. \n\nXE[O,l] \n\nThus applying the general procedure to this special case reduces to a binary search \nlearning algorithm which queries the midpoint of XR and XL. An interesting question \nat this stage is whether such a strategy provably reduces the sample complexity; \nand if so, by how much. It is possible to. prove the following theorem which shows \nthat for a certain pre-decided total output uncertainty value, the active learning \nalgorithm takes fewer examples to learn the target to the same degree of total output \nuncertainty than a random drawing of examples according to a uniform distribution. \n\nTheorem 1 Suppose we want to collect examples so that we are guaranteed with \nhigh probability (i. e. probability> 1 - 6) that the total output uncertainty is less \nthan f. Then a passive learner would require at least ~ In(I/6) examples while \n\ny(48f) \n\nthe active strategy described earlier would require at most (1/2) In(1 / 12f) examples. \n\n4 . EXAMPLE 2: THE CASE OF POLYNOMIALS \nIn this section we turn our attention to a class of univariate polynomials (from \n[-5,5] to ~) of maximum degree K, i.e., \n\nK \n\n:F = {g(ao, ... , aK) = L: aixi} \n\ni:=O \n\nAs before, the prior on :F is obtained here by assuming a prior on the parameters; \nin particular we assume that a = (ao, aI, ... , aK) has a multivariate normal distri(cid:173)\nbution N(O, S). For simplicity, it is assumed that the parameters are independent, \ni.e., S is a diagonal matrix with Si,i = (7;' In this example we also incorporate noise \n(distributed normally according to N(O, (72)). As before, there is a target gt E G \nwhich the learner is to approximate on the basis of data. Suppose the learner is in \npossession of a data set Dn = {(Xi, Yi = gt(Xi) + 1]); i = 1 ... n} and is to receive \nanother data point. The two options are 1) to sample the function at a point X \naccording to a uniform distribution on the domain [-5,5] (passive learning) and 2) \nfollow our principled active learning strategy to select the next point to be sampled. \n\n4.1 ACTIVE STRATEGY \nHere we derive an exact expression for Xn+1 (the next query point) by applying the \ngeneral procedure described earlier. Going through the steps as before, \n1)It is possible to show that p(g(a)IDn) = P(aIDn) is again a multivariate normal \ndistribution N(J1., 1:n ) where J1. = L~l Yixi, Xi = (1, Xi, xl, . .. , xf)T and \n\n1:-1 = S-l + _ \" (x.x. T ) \n\n1 1 \n\nn \n\n1 \nn \n2(72 L..J \ni:=l \n\n2)Computation of the total output uncertainty U(gn+1IDn, x) requires several steps. \nTaking advantage of the Gaussian distribution on both the parameters a and the \nnoise, we obtain (see Niyogi and Sung, 1995 for details): \n\nU(gIDn, x) = l1:n+1AI \n\n\f598 \n\nKah Kay Sung. Partha Niyogi \n\nJ i\\ Noise=0.1 \n\nto/ I~J \\ A \n\"\\ \n\\.j\\ \n\n'\\ \n\n\"'-\n\n20 \n\n15 \n\n10 \n\n5 \n0 \n\n20 \n\\ \n15 0\\ '\\ l \nI \\J\\ \n\nNoise=1.0 \n\nr \\ \\ \n10\". \n\n'v .J\\ \n\n10 \n\n5 \n0 \n\n50 \n\n50 \n\nFigure 1: Comparing active and passive learning average error rates at different sample \nnoise levels. The two graphs above plot log error rates against number of samples. See text \nfor detailed explanation. The solid and dashed curves are the active and passive learning \nerror rates respectively. \n\nwhere A is a matrix of numbers whose i,j element is J~5 t(i+i- 2)dt . En+l has the \nsame form as En and depends on the previous data, the priors, noise and Xn+l. \nWhen minimized over Xn+l, we get xn+! as the maximum utility location where \nthe active learner should next sample the unknown target function. \n\n4.2 SIMULATIONS \nWe have performed several simulations to compare the performance of the active \nstrategy developed in the previous section to that of a passive learner (who receives \nexamples according to a uniform random distribution on the domain [-5,5]). The \nfollowing issues have been investigated. \n\n1) Average error rate as a function of the number of examples: Is it \nindeed the case that the active strategy has superior error performance for the same \nnumber of examples? To investigate this we generated 1000 test target polynomial \nfunctions (of maximum degree 9) according ~o the following Gaussian prior on \nthe parameters: for each ai,P(ai) = N(0,0.9'). For each target polynomial, we \ncollected data according to the active strategy as well as the passive (random) \nstrategy for varying number of data points. Figure 1 shows the average error rate \n(i.e., the integrated squared difference between the actual target function and its \nestimate, averaged over the 1000 different target polynomials) as a function of the \nnumber of data points. Notice that the active strategy has a lower error rate than \nthe passive for the same number of examples and is particularly true for small \nnumber of data. The active strategy uses the same priors that generate the test \ntarget functions. We show results of the same simulation performed at two noise \nlevels (noise standard deviation 0.1 and 1.0). In both cases the active strategy \noutperforms the passive learner indicating robustness in the face of noise. \n\n2) Incorrect priors: How sensitive is the active learner to possible differences \nbetween its prior assumptions on the class :F and the true priors? We repeated the \nfunction learning task of the earlier case with the test targets generated in the same \nway as before. The active learner assumes a slightly different Gaussian prior and \npolynomial degree from the target (Std(ai) = 0.7i and K = 7 for the active learner \nversus Std(ad = O.Si and K = S for the target). Despite its inaccurate priors, the \n\n\fActive Learning for Function Approximation \n\n599 \n\n8.5 r--------r----r---~--...---___, \n\nY' '- Different Gauss Prior & Deg. \n\n8 \n\nOJ \n1tS \n~7.5 \no \n~ w \nC) \no \n-6.5 \n\n7 \n\n~ \n\n\\ \n\n\\ \n\n\"-'\\ \n--\n10 \n\n\" \"' ....... \n\n........ --... ....... \n\n-- --\n\n6~--~~--~----~----~----~ \no \n50 \n\n20 \n\n40 \n\n30 \n\nFigure 2: Active lea.rning results with different Gaussian priors for coefficients, and a \nlower a priori polynomial degree K. See text for detailed explanation. The solid and \ndashed curves a.re the active a.nd passive learning error rates respectively. \n\nactive learner outperforms the passive case. \n\n3) The distribution of points: How does the active learner choose to sample \nthe domain for maximally reducing uncertainty? There are a few sampling trends \nwhich are noteworthy here. First, the learner does not simply sample the domain \non a uniform grid. Instead it chooses to cluster its samples typically around K + 1 \nlocations for concept classes with maximum degree K as borne out by simulations \nwhere K varies from 5 to 9. One possible explanation for this is it takes only \nK + 1 points to determine the target in the absence of noise. Second, as the noise \nincreases, although the number of clusters remains fixed, they tend to be distributed \naway from the origin. It seems that for higher noise levels, there is less pressure to \nfit the data closely; consequently the prior assumption of lower order polynomials \ndominates. For such lower order polynomials, it is profitable to sample away from \nthe origin as it reduces the variance of the resulting fit. (Note the case of linear \nregression) . \nRemarks \n1) Notice that because the class of polynomials is linear in its model parameters, a, \nthe new sample location (xn+d does not depend on the y values actually observed \nbut only on the x values sampled . Thus if the learner is to collect n data points, \nit can pre-compute the n points at which to sample from the start. In this sense \nthe active algorithm is not really adaptive. This behavior has also been observed \nby MacKay (1992) and Sollich (1994) . \n2) Needless to say, the general framework from optimal design can be used for \nany function class within a Bayesian framework. We are currently investigating \nthe possibility of developing active strategies for Radial Basis Function networks. \nWhile it is possible to compute exact expressions for Xn+l for such RBF networks \nwith fixed centers, for the case of moving centers, one has to resort to numerical \nminimization. For lack of space we do not include those results in this paper. \n\n\f600 \n\n2 \n\n1 \n\no o \n\nKah Kay Sung, Partha Niyogi \n\n10 \n\n8 \n\n6 \n\n(DO 0 00 (D CJX:\u00bb 0 \n\n00 (I) \n\n0 \n00 \n0 \n\n0 \n\n0 \n\n0 \n\n0 \n\n0 \n00 \n\n0 \n\n5 \n\n-5 \n\n0 \n\no \n\no~~~~----~~~~ 4 \n-5 \n\n5 \nFigure 3: Distribution of active learning sample points as a function of (i) noise strength \nand (ii) a-priori polynomial degree. The horizontal axis of both graphs represents the input \nspace [-5,5]. Each circle indicates a sample location. The Left graph shows the distribution \nof sample locations (on x axis) for different noise level (indicated on y-axis). The Right \ngraph shows the distribution of sample locations (on x-axis) for different assumptions on \nthe maximum polynomial degree K (indicated on y-axis) . \n\n5 CONCLUSIONS \nWe have developed a Bayesian framework for active learning using ideas from opti(cid:173)\nmal experiment design. Our focus has been to investigate the possibility of improved \nsample complexity using such active learning schemes. For a simple case of unit \nstep functions, we are able to derive a binary search algorithm from a completely \ndifferent standpoint. Such an algorithm then provably requires fewer examples for \nthe same error rate. We then show how to derive specific algorithms for the case \nof polynomials and carry out extensive simulations to compare their performance \nagainst the benchmark of a passive learner with encouraging results. This is an \napplication of the optimal design paradigm to function learning and seems to bear \npromise for the design of more efficient learning algorithms. \n\nReferences \n\nD. Cohn. (1991) A Local Approach to Optimal Queries. In D. Touretzky (ed.), Proc. of \n1990 Connectionist Summer School, San Mateo, CA , 1991. Morgan Kaufmann Publishers. \n\nV. Fedorov. (1972) Theory of Optimal Experiments. Academic Press, New York, 1972. \n\nD. MacKay. (1992) Bayesian Methods for Adaptive Models. PhD thesis, CalTech, 1992. \n\nP. Niyogi and K. Sung. (1995) Active Learning for Function Approximation: Paradigms \nfrom Optimal Experiment Design. Tech Report AIM-1483, AI Lab. , MIT, In Preparation. \n\nM. Plutowski and H. White. (1991) Active Selection of Training Examples for Network \nLearning in Noiseless Environments. Tech Report CS91-180, Dept. of Computer Science \nand Engineering, University of California, San Diego, 1991. \n\nT. Poggio and F. Girosi. (1990) Regularization Algorithms for Learning that are Equiva(cid:173)\nlent to Multilayer Networks. Science, 247:978-982, 1990. \n\nP. Sollich. (1994) Query Construction, Entropy, Generalization in Neural Network Models. \nPhysical Review E, 49:4637-4651, 1994. \n\n1. Valiant. (1984) A Theory of Learnable. Proc. of the 1984 STOC, p436-445, 1984. \n\n\f", "award": [], "sourceid": 900, "authors": [{"given_name": "Kah", "family_name": "Sung", "institution": null}, {"given_name": "Partha", "family_name": "Niyogi", "institution": null}]}