{"title": "Active Learning for Misspecified Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1305, "page_last": 1312, "abstract": null, "full_text": "Active Learning for Misspeci\ufb01ed Models\n\nMasashi Sugiyama\n\nDepartment of Computer Science, Tokyo Institute of Technology\n\n2-12-1, O-okayama, Meguro-ku, Tokyo, 152-8552, Japan\n\nsugi@cs.titech.ac.jp\n\nAbstract\n\nexpressed as\n\nActive learning is the problem in supervised learning to design the loca-\ntions of training input points so that the generalization error is minimized.\nExisting active learning methods often assume that the model used for\nlearning is correctly speci\ufb01ed, i.e., the learning target function can be ex-\npressed by the model at hand. In many practical situations, however, this\nassumption may not be ful\ufb01lled. In this paper, we \ufb01rst show that the ex-\nisting active learning method can be theoretically justi\ufb01ed under slightly\nweaker condition: the model does not have to be correctly speci\ufb01ed, but\nslightly misspeci\ufb01ed models are also allowed. However, it turns out that\nthe weakened condition is still restrictive in practice. To cope with this\nproblem, we propose an alternative active learning method which can be\ntheoretically justi\ufb01ed for a wider class of misspeci\ufb01ed models. Thus,\nthe proposed method has a broader range of applications than the exist-\ning method. Numerical studies show that the proposed active learning\nmethod is robust against the misspeci\ufb01cation of models and is thus reli-\nable.\n\nLet us discuss the regression problem of learning a real-valued functionf\u0004x\u0005 de\ufb01ned on\nRd\nfrom training examplesf\u0004xi;yi\u0005jyi=f\u0004xi\u0005\u0007(cid:15)ig\u0002i=1;\nwheref(cid:15)ig\u0002i=1 are i.i.d. noise with mean zero and unknown variance(cid:27)2\nbf\u0004x\u0005=\u0004Xi=1(cid:11)i\u0003i\u0004x\u0005;\nwheref\u0003i\u0004x\u0005g\u0004i=1 are \ufb01xed linearly independent functions and(cid:11)=\u0004(cid:11)1;(cid:11)2;:::;(cid:11)\u0004\u0005>\nWe evaluate the goodness of the learned functionbf\u0004x\u0005 by the expected squared test error\nare drawn independently from a distribution with density\u0004\b\u0004x\u0005, the generalization error is\nG=E(cid:15)Z(cid:16)bf\u0004x\u0005f\u0004x\u0005(cid:17)2\u0004\b\u0004x\u0005dx;\n\nover test input points and noise (i.e., the generalization error). When the test input points\n\n1\n\nIntroduction and Problem Formulation\n\nare parameters to be learned.\n\nlowing linear regression model for learning.\n\n. We use the fol-\n\n\fIn a standard setting of regression, the training input points are provided from the environ-\n\nActive learning\u2014also referred to as experimental design\u2014is the problem of optimizing the\nlocation of training input points so that the generalization error is minimized. In active\nlearning research, it is often assumed that the regression model is correctly speci\ufb01ed [2,\n\nhowever, this assumption is often violated.\n\nhand, in some cases, the training input points can be designed by users. In such cases,\nit is expected that the accuracy of the learning result can be improved if the training input\npoints are chosen appropriately, e.g., by densely locating training input points in the regions\nof high uncertainty.\n\nwhereE(cid:15) denotes the expectation over the noisef(cid:15)ig\u0002i=1. In the following, we suppose that\n\u0004\b\u0004x\u0005 is known1.\nment, i.e.,fxig\u0002i=1 independently follow the distribution with density\u0004\b\u0004x\u0005. On the other\n1, 3], i.e., the learning target functionf\u0004x\u0005 can be expressed by the model. In practice,\nIn the following, we suppose that the training input pointsfxig\u0002i=1 are independently drawn\nfrom a user-de\ufb01ned distribution with density\u0004x\u0004x\u0005, and discuss the problem of \ufb01nding the\nThe generalization errorG de\ufb01ned by Eq.(1) can be decomposed as\nG=B\u0007V;\nwhereB is the (squared) bias term andV is the variance term given by\nB=Z(cid:16)E(cid:15)bf\u0004x\u0005f\u0004x\u0005(cid:17)2\u0004\b\u0004x\u0005dx and\nV=E(cid:15)Z(cid:16)bf\u0004x\u0005E(cid:15)bf\u0004x\u0005(cid:17)2\u0004\b\u0004x\u0005dx:\nsquares learning, i.e., parameter vector(cid:11) is determined as follows.\n\"\u0002Xi=1(cid:16)bf\u0004xi\u0005yi(cid:17)2#:\nb(cid:11)\u0007\u0004S=a\u0006g\u0001i\u0002(cid:11)\nIt is known thatb(cid:11)\u0007\u0004S is given byb(cid:11)\u0007\u0004S=\u0004\u0007\u0004Sy;\nwhere\u0004\u0007\u0004S=\u0004X>X\u00051X>;Xi;j=\u0003j\u0004xi\u0005;\ny=\u0004y1;y2;:::;y\u0002\u0005>:\nLetG\u0007\u0004S,B\u0007\u0004S andV\u0007\u0004S beG,B andV for the learned function obtained by the\nwith density\u0004\b\u0004x\u0005\u2014are easily gathered. In such cases, a reasonably good estimate of\u0004\b\u0004x\u0005 may\nbe obtained by some standard density estimation method. Therefore, the assumption that\u0004\b\u0004x\u0005 is\n\nIn this paper, we \ufb01rst show that the existing active learning method can still be theoreti-\ncally justi\ufb01ed when the model is approximately correct in a strong sense. Then we propose\nan alternative active learning method which can also be theoretically justi\ufb01ed for approx-\nimately correct models, but the condition on the approximate correctness of the models is\nweaker than that for the existing method. Thus, the proposed method has a wider range of\napplications.\n\nordinary least-squares learning, respectively. Then the following proposition holds.\n\n1In some application domains such as web page analysis or bioinformatics, a large number of\nunlabeled samples\u2014input points without output values independently drawn from the distribution\n\noptimal density function.\n\n2 Existing Active Learning Method\n\nA standard way to learn the parameters in the regression model (1) is the ordinary least-\n\nand\n\nknown may not be so restrictive.\n\n\fwhere\n\nas\n\nmized [2, 1, 3].\n\nProposition 1 ([2, 1, 3]) Suppose that the model is correctly speci\ufb01ed, i.e., the learning\n\ntarget functionf\u0004x\u0005 is expressed asf\u0004x\u0005=\u0004Xi=1(cid:11)\u0003i\u0003i\u0004x\u0005:\nThenB\u0007\u0004S andV\u0007\u0004S are expressed as\nV\u0007\u0004S=(cid:27)2\u0002\u0007\u0004S;\nB\u0007\u0004S=0 and\nandUi;j=Z\u0003i\u0004x\u0005\u0003j\u0004x\u0005\u0004\b\u0004x\u0005dx:\n\u0002\u0007\u0004S=\b\u0006\u0004U\u0004\u0007\u0004S\u0004>\u0007\u0004S\u0005\nTherefore, for the correctly speci\ufb01ed model (1), the generalization errorG\u0007\u0004S is expressed\nG\u0007\u0004S=(cid:27)2\u0002\u0007\u0004S:\ntraining input pointsfxig\u0002i=1 (or the training input density\u0004x\u0004x\u0005) so that\u0002\u0007\u0004S is mini-\nSuppose the model does not exactly include the learning target functionf\u0004x\u0005, but it ap-\nproximately includes it, i.e., for a scalar\u00c6 such thatj\u00c6j is small,f\u0004x\u0005 is expressed as\nf\u0004x\u0005=g\u0004x\u0005\u0007\u00c6\u0006\u0004x\u0005;\nwhereg\u0004x\u0005 is the orthogonal projection off\u0004x\u0005 onto the span off\u0003i\u0004x\u0005g\u0004i=1 and the\nresidual\u0006\u0004x\u0005 is orthogonal tof\u0003i\u0004x\u0005g\u0004i=1:\nZ\u0006\u0004x\u0005\u0003i\u0004x\u0005\u0004\b\u0004x\u0005dx=0 fori=1;2;:::;\u0004:\ng\u0004x\u0005=\u0004Xi=1(cid:11)\u0003i\u0003i\u0004x\u0005\nIn this case, the bias termB is expressed as\nB=Z(cid:16)E(cid:15)bf\u0004x\u0005g\u0004x\u0005(cid:17)2\u0004\b\u0004x\u0005dx\u0007C; whereC=Z\u0004g\u0004x\u0005f\u0004x\u0005\u00052\u0004\b\u0004x\u0005dx:\nSinceC is constant which does not depend on the training input density\u0004x\u0004x\u0005, we subtract\nC in the following discussion.\nB\u0007\u0004SC=\u00c62hU\u0004\u0007\u0004Sz\u0006;\u0004\u0007\u0004Sz\u0006i=\u0007\u0004\u00c62\u0005;\nV\u0007\u0004S=(cid:27)2\u0002\u0007\u0004S=\u0007\u0004\u0004\u00021\u0005;\nz\u0006=\u0004\u0006\u0004x1\u0005;\u0006\u0004x2\u0005;:::;\u0006\u0004x\u0002\u0005\u0005>:\n\nIn this section, we investigate the validity of the existing active learning method for mis-\nspeci\ufb01ed models.\n\n3 Analysis of Existing Method under Misspeci\ufb01cation of Models\n\nThen we have the following lemma2.\n\nLemma 2 For the approximately correct model (3), we have\n\nBased on this expression, the existing active learning method determines the location of\n\nand\n\nwhere\n\n2Proofs of lemmas are provided in an extended version [6].\n\n\f4.1 Weighted Least-Squares Learning\n\n. However, for\n\nneglected.\n\n4 New Active Learning Method\n\nIn this section, we propose a new active learning method based on the weighted least-\nsquares learning.\n\nNote that the asymptotic order in Eq.(1) is in probability sinceV\u0007\u0004S is a random variable\nthat includesfxig\u0002i=1. The above lemma implies that\nif\u00c6=\u0003\u0004\u0004\u000212\u0005:\nG\u0007\u0004SC=(cid:27)2\u0002\u0007\u0004S\u0007\u0003\u0004\u0004\u00021\u0005\nTherefore, the existing active learning method of minimizing\u0002\u0007\u0004S is still justi\ufb01ed if\u00c6=\n\u0003\u0004\u0004\u000212\u0005. However, when\u00c66=\u0003\u0004\u0004\u000212\u0005, the existing method may not work well because\nthe bias termB\u0007\u0004SC is not smaller than the variance termV\u0007\u0004S, so it can not be\nWhen the model is correctly speci\ufb01ed,b(cid:11)\u0007\u0004S is an unbiased estimator of(cid:11)\u0003\nmisspeci\ufb01ed models,b(cid:11)\u0007\u0004S is generally biased even asymptotically if\u00c6=\u0007\u0004\u00041\u0005.\nThe bias ofb(cid:11)\u0007\u0004S is actually caused by the covariate shift [5]\u2014the training input density\n\u0004x\u0004x\u0005 is different from the test input density\u0004\b\u0004x\u0005. For correctly speci\ufb01ed models, in-\nasymptotically unbiased even if\u00c6=\u0007\u0004\u00041\u0005 [5].\n\u0004x\u0004xi\u0005(cid:16)bf\u0004xi\u0005yi(cid:17)2#:\n\"\u0002Xi=1\u0004\b\u0004xi\u0005\nb(cid:11)W\u0004S=a\u0006g\u0001i\u0002(cid:11)\nAsymptotic unbiasedness ofb(cid:11)W\u0004S would be intuitively understood by the following iden-\nZ(cid:16)bf\u0004x\u0005f\u0004x\u0005(cid:17)2\u0004\b\u0004x\u0005dx=Z(cid:16)bf\u0004x\u0005f\u0004x\u0005(cid:17)2\u0004\b\u0004x\u0005\n\u0004x\u0004x\u0005\u0004x\u0004x\u0005dx:\nIn the following, we assume that\u0004x\u0004x\u0005 is strictly positive for allx. LetD be the diagonal\nmatrix with thei-th diagonal elementDi;i=\u0004\b\u0004xi\u0005\n\u0004x\u0004xi\u0005:\nThen it can be con\ufb01rmed thatb(cid:11)W\u0004S is given by\nb(cid:11)W\u0004S=\u0004W\u0004Sy; where\u0004W\u0004S=\u0004X>DX\u00051X>D:\nLetGW\u0004S,BW\u0004S andVW\u0004S beG,B andV for the learned function obtained by the\nBW\u0004SC=\u00c62hU\u0004W\u0004Sz\u0006;\u0004W\u0004Sz\u0006i=\u0007\u0004\u0004\u00c62\u00021\u0005;\nVW\u0004S=(cid:27)2\u0002W\u0004S=\u0007\u0004\u0004\u00021\u0005;\n\u0002W\u0004S=\b\u0006\u0004U\u0004W\u0004S\u0004>W\u0004S\u0005:\n\n\ufb02uence of the covariate shift can be ignored, as the existing active learning method does.\nHowever, for misspeci\ufb01ed models, we should explicitly cope with the covariate shift.\n\nUnder the covariate shift, it is known that the following weighted least-squares learning is\n\n4.2 Active Learning Based on Weighted Least-Squares Learning\n\nabove weighted least-squares learning, respectively. Then we have the following lemma.\n\nLemma 3 For the approximately correct model (3), we have\n\ntity, which is similar in spirit to importance sampling:\n\nwhere\n\n\f5 Numerical Examples\n\nWe evaluate the usefulness of the proposed active learning method through experiments.\n\nhas a wider range of applications. The effect of this extension is experimentally investigated\nin the next section.\n\nToy Data Set: We \ufb01rst illustrate how the proposed method works under a controlled\nsetting.\n\nThis lemma implies thatGW\u0004SC=(cid:27)2\u0002W\u0004S\u0007\u0003\u0004\u0004\u00021\u0005\nif\u00c6=\u0003\u0004\u00041\u0005:\nBased on this expression, we propose determining the training input density\u0004x\u0004x\u0005 so that\n\u0002W\u0004S is minimized.\nThe use of the proposed criterion\u0002W\u0004S can be theoretically justi\ufb01ed when\u00c6=\u0003\u0004\u00041\u0005,\nwhile the existing criterion\u0002\u0007\u0004S requires\u00c6=\u0003\u0004\u0004\u000212\u0005. Therefore, the proposed method\nLetd=1 and the learning target functionf\u0004x\u0005 bef\u0004x\u0005=1x\u0007x2\u0007\u00c6x3\n. Let\u0002=100\nandf(cid:15)ig100i=1 be i.i.d. Gaussian noise with mean zero and standard deviation0:3. Let\u0004\b\u0004x\u0005\nbe the Gaussian density with mean0:2 and standard deviation0:4, which is assumed to be\nknown here. Let\u0004=3 and the basis functions be\u0003i\u0004x\u0005=xi1\nfori=1;2;3. Let us\nconsider the following three cases.\u00c6=0;0:04;0:5, where each case corresponds to \u201ccor-\nthe training input density\u0004x\u0004x\u0005 from the Gaussian density with mean0:2 and standard\ndeviation0:4\r, where\n\r=0:8;0:9;1:0;:::;2:5:\ndetermined so that\u0002W\u0004S is minimized. Following the determined input density,\ntraining input pointsfxig100i=1 are created and corresponding output valuesfyig100i=1\nsity is determined so that\u0002\u0007\u0004S is minimized. OLS learning is used for estimating\n(C) Passive learning + OLS learning: The test input density\u0004\b\u0004x\u0005 is used as the training\nFirst, we evaluate the accuracy of\u0002W\u0004S and\u0002\u0007\u0004S as approximations ofGW\u0004S andG\u0007\u0004S.\nThe means and standard deviations ofGW\u0004S,\u0002W\u0004S ,G\u0007\u0004S, and\u0002\u0007\u0004S over100 runs are\ndepicted as functions of\r in Figure 2. These graphs show that when\u00c6=0 (\u201ccorrectly\nspeci\ufb01ed\u201d), both\u0002W\u0004S and\u0002\u0007\u0004S give accurate estimates ofGW\u0004S andG\u0007\u0004S. When\n\u00c6=0:04 (\u201capproximately correct\u201d),\u0002W\u0004S again works well, while\u0002\u0007\u0004S tends to be\nnegatively biased for large\r. This result is surprising since as illustrated in Figure 1, the\nlearning target functions with\u00c6=0 and\u00c6=0:04 are visually quite similar. Therefore,\nit intuitively seems that the result of\u00c6=0:04 is not much different from that of\u00c6=0.\nHowever, the simulation result shows that this slight difference makes\u0002\u0007\u0004S unreliable.\nWhen\u00c6=0:5 (\u201cmisspeci\ufb01ed\u201d),\u0002W\u0004S is still reasonably accurate, while\u0002\u0007\u0004S is heavily\nThese results show that as an approximation of the generalization error,\u0002W\u0004S is more\nrobust against the misspeci\ufb01cation of models than\u0002\u0007\u0004S, which is in good agreement with\n\nare observed. Then WLS learning is used for estimating the parameters.\n\n(B) Existing active learning criterion + OLS learning [2, 1, 3]: The training input den-\n\nWe compare the accuracy of the following three methods:\n\n(A) Proposed active learning criterion + WLS learning : The training input density is\n\nthe parameters.\n\ninput density. OLS learning is used for estimating the parameters.\n\nrectly speci\ufb01ed\u201d, \u201capproximately correct\u201d, and \u201cmisspeci\ufb01ed\u201d (see Figure 1). We choose\n\nbiased.\n\nthe theoretical analyses given in Section 3 and Section 4.\n\n\f0\n\u22121.5\n\n1.5\n\n1\n\n0.5\n\n0\n\u22121.5\n\n1.5\n\n2\n\n0.5\n\n1\n\n1.5\n\n0.5\n\n1\n\n8\n\n6\n\n4\n\n2\n\n\u22121\n\n\u22120.5\n\n0\n\n\u22121\n\n\u22120.5\n\n0\n\npt(x)\npx(x)\n\nInput density functions\n\n\u03b4=0\n\u03b4=0.04\n\u03b4=0.5\n\nLearning target function f(x)\n\n(A)\n(B)\n(C)\n\nlarge but it is not a typo.\n\nFigure 1: Learning target function\nand input density functions.\n\nTable 1: The means and standard deviations of\nthe generalization error for Toy data set. The best\nmethod and comparable ones by the t-test at the\n\nsigni\ufb01cance level5\u0001 are described with boldface.\nThe value of method (B) for\u00c6=0:5 is extremely\n\u00c6=0\n\u00c6=0:04\n\u00c6=0:5\n1:99\u00060:07\n2:02\u00060:075:94\u00060:80\n1:34\u00060:04\n3:27\u00061:23\n303\u0006197\n2:60\u00060:44\n2:62\u00060:43\n6:87\u00061:15\nAll values in the table are multiplied by103\n\u00c6=0:04\n\u00c6=0:5\n\u00c6=0\nFigure 2: The means and error bars ofGW\u0004S,\u0002W\u0004S,G\u0007\u0004S, and\u0002\u0007\u0004S over100 runs as\nfunctions of\r.\nmethod is described. When\u00c6=0, the existing method (B) works better than the proposed\nGW\u0004S andG\u0007\u0004S were found by\u0002W\u0004S and\u0002\u0007\u0004S. Therefore, the difference of the errors\nOLS. Since bias is zero for both WLS and OLS if\u00c6=0, OLS would be more accurate\nit still works better than the passive learning scheme (C). When\u00c6=0:04 and\u00c6=0:5 the\n\nmethod (A). Actually, in this case, training input densities that approximately minimize\n\nthan WLS. Although the proposed method (A) is outperformed by the existing method (B),\n\nIn Table 1, the mean and standard deviation of the generalization error obtained by each\n\nis caused by the difference of WLS and OLS: WLS generally has larger variance than\n\n\u201capproximately correct\u201d\n\n\u201ccorrectly speci\ufb01ed\u201d\n\n\u201cmisspeci\ufb01ed\u201d\n\n0.07\n0.06\n0.05\n0.04\n0.03\n\n0.07\n0.06\n0.05\n0.04\n0.03\n\n0.07\n0.06\n0.05\n0.04\n0.03\n\n0.06\n0.05\n0.04\n0.03\n0.02\n\n0.06\n0.05\n0.04\n0.03\n0.02\n\n0.06\n0.05\n0.04\n0.03\n0.02\n\n0.5\n0.4\n0.3\n0.2\n0.1\n\n12\n\n10\n\n8\n\n6\n\n1.2\n\n1.6\n\n1.2\n\n1.6\n\n1.2\n\n1.6\n\n1.2\n\n1.6\n\n1.2\n\n1.6\n\n1.2\n\n1.6\n\n1.2\n\n1.6\n\n1.2\n\n1.6\n\n0.8\n\n1.2\n\n0.8\n\n1.2\n\n0.8\n\n1.2\n\n0.8\n\n1.2\n\n1.6\n\n2\n\n2\n\n2\n\n2\n\n2\n\n2\n\n2\n\n2\n\n2\n\n2\n\n2\n\n2\n\n2.4\n\nJ\u2212OLS\n\n2.4\n\nJ\u2212OLS\n\n2.4\n\nJ\u2212OLS\n\n2.4\n\nJ\u2212WLS\n\n2.4\n\nJ\u2212WLS\n\n2.4\n\nJ\u2212WLS\n\n2.4\n\nG\u2212OLS\n\n2.4\n\nG\u2212OLS\n\n6\n\n5\n\n4\n\n3\n\n5\n\n4\n\n3\n\n2\n\n6\n\n5\n\n4\n\n3\n\n5\n\n4\n\n3\n\n2\n\n0.8\n\n0.8\n\nx 10\u22123\n\n0.8\n\nx 10\u22123\n\n0.8\n\n0.8\nx 10\u22123\n\n0.8\n\n0.8\nx 10\u22123\n\n0.8\n\n1.6\nc\n\n2.4\n\n1.6\nc\n\n2.4\n\n1.6\nc\n\n2.4\n\nG\u2212WLS\n\nG\u2212WLS\n\nG\u2212WLS\n\n.\n\n2.4\n\nx 10\u22123\n\nG\u2212OLS\n\n2\n\nproposed method (A) gives signi\ufb01cantly smaller errors than other methods.\n\nOverall, we found that for all three cases, the proposed method (A) works reasonably well\nand outperforms the passive learning scheme (C). On the other hand, the existing method\n(B) works excellently in the correctly speci\ufb01ed case, although it tends to perform poorly\nonce the correctness of the model is violated. Therefore, the proposed method (A) is found\nto be robust against the misspeci\ufb01cation of models and thus it is reliable.\n\n\f.\n\nKin-8fm\n\nKin-8fh\n\nKin-8nm\n\nKin-8nh\n\n(A)\n(B)\n(C)\n\nBank-8fm\n\nBank-8fh\n\nBank-8nm\n\nBank-8nh\n\n(A)\n(B)\n(C)\n\nTable 2: The means and standard deviations of the test error for DELVE data sets. All\n\nvalues in the table are multiplied by103\n0:31\u00060:042:10\u00060:0524:66\u00061:2037:98\u00061:11\n0:44\u00060:07\n2:21\u00060:09\n27:67\u00061:50\n39:71\u00061:38\n0:35\u00060:04\n2:20\u00060:06\n26:34\u00061:35\n39:84\u00061:35\n1:59\u00060:07\n5:90\u00060:16\n0:72\u00060:04\n3:68\u00060:09\n1:49\u00060:065:63\u00060:13\n0:85\u00060:06\n3:60\u00060:09\n1:70\u00060:08\n6:27\u00060:24\n0:81\u00060:06\n3:89\u00060:14\nare averaged over100 runs. Note that the error bars were reasonably small so they were\nset includes8192 samples, consisting of8-dimensional input and1-dimensional output\nvalues. For convenience, every attribute is normalized into[0;1\u2104.\nSuppose we are given all8192 input points (i.e., unlabeled samples). Note that output\nvalues are unknown. From the pool of unlabeled samples, we choose\u0002=1000 input\npointsfxig1000\ni=1 for training and observe the corresponding output valuesfyig1000\ni=1 . The\nIn this experiment, the test input density\u0004\b\u0004x\u0005 is unknown. So we estimate it using the\n\u0004\b\u0004x\u0005=\u00042(cid:25)b(cid:13)2\u0005\u0004E\u0005d2ex\u0004kxb(cid:22)\u0005\u0004Ek2=\u00042b(cid:13)2\u0005\u0004E\u0005\u0001;\nwhereb(cid:22)\u0005\u0004E andb(cid:13)\u0005\u0004E are the maximum likelihood estimates of the mean and standard\ndeviation obtained from all unlabeled samples. Let\u0004=50 and the basis functions be\n\u0003i\u0004x\u0005=ex\u0004kx\bik2=2\u0001\nfori=1;2;:::;50;\nwheref\big50i=1 are template points randomly chosen from the pool of unlabeled samples.\nWe select the training input density\u0004x\u0004x\u0005 from the independent Gaussian density with\nmeanb(cid:22)\u0005\u0004E and standard deviation\rb(cid:13)\u0005\u0004E, where\n\r=0:7;0:75;0:8;:::;2:4:\ncause we only have8192 samples. Therefore, we \ufb01rst create temporary input points fol-\nwe repeat this simulation100 times, by changing the template pointsf\big50i=1 in each run.\n\nRealistic Data Set: Here we use eight practical data sets provided by DELVE [4]: Bank-\n8fm, Bank-8fh, Bank-8nm, Bank-8nh, Kin-8fm, Kin-8fh, Kin-8nm, and Kin-8nh. Each data\n\nlowing the determined training input density, and then choose the input points from the\npool of unlabeled samples that are closest to the temporary input points. For each data set,\n\nFigure 3: Mean relative performance of (A) and (B) compared with (C). For each run,\nthe test errors of (A) and (B) are normalized by the test error of (C), and then the values\n\nIn this simulation, we can not create the training input points in an arbitrary location be-\n\ntask is to predict the output values of all unlabeled samples.\n\nBank\u22128fm Bank\u22128fh Bank\u22128nm Bank\u22128nh Kin\u22128fm \n\nKin\u22128fh \n\nKin\u22128nm \n\nKin\u22128nh \n\nindependent Gaussian density.\n\n1.2\n\n1.1\n\n1\n\n0.9\n\n(A)/(C)\n(B)/(C)\n(C)/(C)\n\nomitted.\n\n\fThe means and standard deviations of the test error over100 runs are described in Table 2.\n\nThe proposed method (A) outperforms the existing method (B) for \ufb01ve data sets, while it\nis outperformed by (B) for the other three data sets. We conjecture that the model used\nfor learning is almost correct in these three data sets. This result implies that the proposed\nmethod (A) is slightly better than the existing method (B).\n\nFigure 3 depicts the relative performance of the proposed method (A) and the existing\nmethod (B) compared with the passive learning scheme (C). This shows that (A) outper-\nforms (C) for all eight data sets, while (B) is comparable or is outperformed by (C) for \ufb01ve\ndata sets. Therefore, the proposed method (A) is overall shown to work better than other\nschemes.\n\n6 Conclusions\nWe argued that active learning is essentially the situation under the covariate shift\u2014the\ntraining input density is different from the test input density. When the model used for\nlearning is correctly speci\ufb01ed, the covariate shift does not matter. However, for misspeci-\n\ufb01ed models, we have to explicitly cope with the covariate shift. In this paper, we proposed\na new active learning method based on the weighted least-squares learning.\n\nThe numerical study showed that the existing method works better than the proposed\nmethod if model is correctly speci\ufb01ed. However, the existing method tends to perform\npoorly once the correctness of the model is violated. On the other hand, the proposed\nmethod overall worked reasonably well and it consistently outperformed the passive learn-\ning scheme. Therefore, the proposed method would be robust against the misspeci\ufb01cation\nof models and thus it is reliable.\n\nThe proposed method can be theoretically justi\ufb01ed if the model is approximately correct\nin a weak sense. However, it is no longer valid for totally misspeci\ufb01ed models. A natural\nfuture direction would be therefore to devise an active learning method which has theoret-\nical guarantee with totally misspeci\ufb01ed models. It is also important to notice that when the\nmodel is totally misspeci\ufb01ed, even learning with optimal training input points would not\nbe successful anyway. In such cases, it is of course important to carry out model selection.\nIn active learning research\u2014including the present paper, however, the location of train-\ning input points are designed for a single model at hand. That is, the model should have\nbeen chosen before performing active learning. Devising a method for simultaneously op-\ntimizing models and the location of training input points would be a more important and\npromising future direction.\n\nAcknowledgments: The author would like to thank MEXT (Grant-in-Aid for Young Sci-\nentists 17700142) for partial \ufb01nancial support.\n\nReferences\n[1] D. A. Cohn, Z. Ghahramani, and M. I. Jordan. Active learning with statistical models. Journal\n\nof Arti\ufb01cial Intelligence Research, 4:129\u2013145, 1996.\n\n[2] V. V. Fedorov. Theory of Optimal Experiments. Academic Press, New York, 1972.\n[3] K. Fukumizu. Statistical active learning in multilayer perceptrons. IEEE Transactions on Neural\n\nNetworks, 11(1):17\u201326, 2000.\n\n[4] C. E. Rasmussen, R. M. Neal, G. E. Hinton, D. van Camp, M. Revow, Z. Ghahramani, R. Kustra,\n\nand R. Tibshirani. The DELVE manual, 1996.\n\n[5] H. Shimodaira.\n\nImproving predictive inference under covariate shift by weighting the log-\n\nlikelihood function. Journal of Statistical Planning and Inference, 90(2):227\u2013244, 2000.\n\n[6] M. Sugiyama. Active learning for misspeci\ufb01ed models. Technical report, Department of Com-\n\nputer Science, Tokyo Institute of Technology, 2005.\n\n\f", "award": [], "sourceid": 2944, "authors": [{"given_name": "Masashi", "family_name": "Sugiyama", "institution": null}]}