{"title": "Statistical Analysis of Semi-Supervised Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 801, "page_last": 808, "abstract": null, "full_text": "Statistical Analysis of Semi-Supervised Regression\n\nJohn Lafferty\n\nComputer Science Department\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nLarry Wasserman\n\nDepartment of Statistics\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nlafferty@cs.cmu.edu\n\nlarry@stat.cmu.edu\n\nAbstract\n\nSemi-supervised methods use unlabeled data in addition to labeled data to con-\nstruct predictors. While existing semi-supervised methods have shown some\npromising empirical performance, their development has been based largely based\non heuristics. In this paper we study semi-supervised learning from the viewpoint\nof minimax theory. Our \ufb01rst result shows that some common methods based on\nregularization using graph Laplacians do not lead to faster minimax rates of con-\nvergence. Thus, the estimators that use the unlabeled data do not have smaller\nrisk than the estimators that use only labeled data. We then develop several new\napproaches that provably lead to improved performance. The statistical tools of\nminimax analysis are thus used to offer some new perspective on the problem of\nsemi-supervised learning.\n\n1 Introduction\n\nSuppose that we have labeled data L = {(X1, Y1), . . . (Xn, Yn)} and unlabeled data U =\n{Xn+1, . . . X N} where N (cid:29) n and Xi \u2208 RD. Ordinary regression and classi\ufb01cation techniques\nuse L to predict Y from X. Semi-supervised methods also use the unlabeled data U in an attempt\nto improve the predictions. To justify these procedures, it is common to invoke one or both of the\nfollowing assumptions:\n\nManifold Assumption (M): The distribution of X lives on a low dimensional manifold.\nSemi-Supervised Smoothness Assumption (SSS): The regression function m(x) =\nEY | X = x is very smooth where the density p(x) of X is large. In particular, if there\nis a path connecting Xi and X j on which p(x) is large, then Yi and Y j should be similar\nwith high probability.\n\nWhile these assumptions are somewhat intuitive, and synthetic examples can easily be constructed to\ndemonstrate good performance of various techniques, there has been very little theoretical analysis\nof semi-supervised learning that rigorously shows how the assumptions lead to improved perfor-\nmance of the estimators.\n\nIn this paper we provide a statistical analysis of semi-supervised methods for regression, and pro-\npose some new techniques that provably lead to better inferences, under appropriate assumptions. In\nparticular, we explore precise formulations of SSS, which is motivated by the intuition that high den-\nsity level sets correspond to clusters of similar objects, but as stated above is quite vague. To the best\nof our knowledge, no papers have made the assumption precise and then explored its consequences\nin terms of rates of convergence, with the exception of one of the \ufb01rst papers on semi-supervised\nlearning, by Castelli and Cover (1996), which evaluated a simple mixture model, and the recent\npaper of Rigollet (2006) in the context of classi\ufb01cation. This situation is striking, given the level\nof activity in this area within the machine learning community; for example, the recent survey of\nsemi-supervised learning by Zhu (2006) contains 163 references.\n\n1\n\n\fAmong our \ufb01ndings are:\n\n1. Under the manifold assumption M, the semi-supervised smoothness assumption SSS is\nsuper\ufb02uous. This point was made heuristically by Bickel and Li (2006), but we show\nthat in fact ordinary regression methods are automatically adaptive if the distribution of X\nconcentrates on a manifold.\n\n2. Without the manifold assumption M, the semi-supervised smoothness assumption SSS as\nusually de\ufb01ned is too weak, and current methods don\u2019t lead to improved inferences. In\nparticular, methods that use regularization based on graph Laplacians do not achieve faster\nrates of convergence.\n\n3. Assuming speci\ufb01c conditions that relate m and p, we develop new semi-supervised meth-\nods that lead to improved estimation. In particular, we propose estimators that reduce bias\nby estimating the Hessian of the regression function, improve the choice of bandwidths\nusing unlabeled data, and estimate the regression function on level sets.\n\nThe focus of the paper is on a theoretical analysis of semi-supervised regression techniques, rather\nthan the development of practical new algorithms and techniques. While we emphasize regression,\nmost of our results have analogues for classi\ufb01cation. Our intent is to bring the statistical perspective\nof minimax analysis to bear on the problem, in order to study the interplay between the labeled\nsample size and the unlabeled sample size, and between the regression function and the data density.\nBy studying simpli\ufb01ed versions of the problem, our analysis suggests how precise formulations of\nassumptions M and SSS can be made and exploited to lead to improved estimators.\n\n2 Preliminaries\n\nThe data are (X1, Y1, R1), . . . , (X N , YN , RN ) where Ri \u2208 {0, 1} and we observe Yi only if Ri = 1.\nThe labeled data are L = {(Xi , Yi ) Ri = 1} and the unlabeled data are U = {(Xi , Yi ) Ri = 0}.\nFor convenience, assume that data are labeled so that Ri = 1 for i = 1, . . . , n and Ri = 0 for\ni = n + 1, . . . , N . Thus, the labeled sample size is n, and the unlabeled sample size is u = N \u2212 n.\nLet p(x) be the density of X and let m(x) = E(Y | X = x) denote the regression function. Assume\nthat R \u22a5\u22a5 Y | X (missing at random) and that Ri | Xi \u223c Bernoulli(\u03c0(Xi)). Finally, let \u00b5 = P(Ri =\nassumption R \u22a5\u22a5 Y | X is crucial, although this point is rarely emphasized in the machine learning\nliterature.\n\n1) = R \u03c0(x) p(x)dx. For simplicity we assume that \u03c0(x) = \u00b5 for all x. The missing at random\n\nIt is clear that without some further conditions, the unlabeled data are useless. The key assumption\nwe need is that there is some correspondence between the shape of the regression function m and\nthe shape of the data density p.\n\nWe will use minimax theory to judge the quality of an estimator. Let R denote a class of regression\nfunctions and let F denote a class of density functions. In the classical setting, we observe labeled\ndata (X1, Y2), . . . , (Xn, Yn). The pointwise minimax risk, or mean squared error (MSE), is de\ufb01ned\nby\n\nwhere the in\ufb01mum is over all estimators. The global minimax risk is de\ufb01ned by\n\nRn(x) = inf\nbmn\n\nsup\n\nm\u2208R, p\u2208F\n\nE(bmn(x) \u2212 m(x))2\nEZ (bmn(x) \u2212 m(x))2dx.\n\nsup\n\nm\u2208R, p\u2208F\n\nRn = inf\nbmn\n\n(1)\n\n(2)\n\nA typical assumption is that R is the Sobolev space of order two, meaning essentially that m has\nsmooth second derivatives. In this case we have1 Rn (cid:16) n\u22124/(4+D). The minimax rate is achieved\nby kernel estimators and local polynomial estimators. In particular, for kernel estimators if we use\na product kernel with common bandwidth hn for each variable, choosing hn \u223c n\u22121/(4+D) yields an\n1We write an (cid:16) bn to mean that an /bn is bounded away from 0 and in\ufb01nity for large n. We have suppressed\n\nsome technicalities such as moment assumptions on \u0001 = Y \u2212 m(X ).\n\n2\n\n\fC\n2\n\nestimator with the minimax rate. The dif\ufb01culty is that the rate Rn (cid:16) n\u22124/(4+D) is extremely slow\nwhen D is large.\nIn more detail, let C > 0 and let B be a positive de\ufb01nite matrix, and de\ufb01ne\n\nR = (cid:26)m (cid:12)(cid:12)(cid:12)m(x) \u2212 m(x0) \u2212 (x \u2212 x0)T \u2207m(x0)(cid:12)(cid:12)(cid:12) \u2264\n(x \u2212 x0)T B(x \u2212 x0)(cid:27)\nF = (cid:26) p p(x) \u2265 b > 0, | p(x1) \u2212 p(x2)| \u2264 ckx1 \u2212 x2k\u03b1\n2(cid:27).\nFan (1993) shows that the local linear estimator is asymptotically minimax for this class. This esti-\nmator is given bybmn(x) = a0 where (a0, a1) minimizesPn\n1 (Xi\u2212x))2 K (H\u22121/2(Xi\u2212\ni=1(Yi\u2212a0\u2212aT\nThe asymptotic MSE of the local linear estimatorbm(x) using the labeled data is\nwhere Hm (x) is the Hessian of m at x, \u00b52(K ) = R K 2(u) du and \u03bd0 is a constant. The optimal\n\nx)), where K is a symmetric kernel and H is a matrix of bandwidths.\n\n2(K )tr(Hm (x)H )(cid:19)2\n\nbandwidth matrix H\u2217 is given by\n\n\u03bd0\u03c3 2\np(x) + o( tr(H ))\n\nR(H ) =(cid:18) 1\n\n(3)\n\n(4)\n\n(5)\n\n\u00b52\n\n+\n\n2\n\nn|H|1/2\n2(K )n Dp(x)!2/(D+4)\n\nH\u2217 = \u03bd0\u03c3 2|Hm|1/2\n\n\u00b52\n\n(Hm )\u22121\n\n(6)\n\n1\n\nand R(H\u2217) = O(n\u22124/(4+D)). This result is important to what follows, because it suggests that if the\nHessian Hm of the regression function is related to the Hessian H p of the data density, one may be\nable to estimate the optimal bandwidth matrix from unlabeled data in order to reduce the risk.\n\n3 The Manifold Assumption\n\nIt is common in the literature to invoke both M and SSS. But if M holds, SSS is not needed. This\nis argued by Bickel and Li (2006) who say, \u201cWe can unwittingly take advantage of low dimensional\nstructure without knowing it.\u201d\n\nSuppose X \u2208 RD has support on a manifold M with dimension d < D. Letbmh be the local linear\nestimator with diagonal bandwidth matrix H = h2 I . Then Bickel and Li show that the bias and\nvariance are\n(7)\nfor some functions J1 and J2. Choosing h (cid:16) n\u22121/(4+d) yields a risk of order n\u22124/(4+d), which is the\noptimal rate for data that to lie on a manifold of dimension d.\nTo use the above result we would need to know d. Bickel and Li argue heuristically that the following\nprocedure will lead to a reasonable bandwidth. First, estimate d using the procedure in Levina\n\nb(x) = h2 J1(x)(1 + oP (1)) and v(x) =\n\nJ2(x)\nnhd (1 + oP (1))\n\nand Bickel (2005). Now let B = {\u03bb1/n1/(bd+4), . . . , \u03bbB /n1/(bd+4)} be a set of bandwidths, scaling\nthe asymptotic order n\u22121/(bd+4) by different constants. Finally, choose the bandwidth h \u2208 B that\nminimizes a local cross-validation score.\nWe now show that, in fact, one can skip the step of estimating d. Let E1, . . . , En be independent\nBernoulli (\u03b8 = 1\n2 ) random variables. Split the data into two groups, so that I0 = {i Ei = 0} and\nI1 = {i\n\n1 \u2264 d \u2264 D}. Constructbmh for h \u2208 H using the data\nin I0, and estimate the risk from I1 by setting bR(h) = |I1|\u22121Pi\u2208I1 (Yi \u2212bmh(Xi ))2. Finally, letbh\nminimizebR(h) and setbm =bmbh. For simplicity, let us assume that both Yi and Xi are bounded by a\n\n\ufb01nite constant B.\nTheorem 1. Supposethatand|Yi| \u2264 B and|Xi j| \u2264 B forall i and j. Assumetheconditionsin\nBickelandLi(2006). Supposethatthedatadensity p(x) issupportedonamanifoldofdimension\nd \u2265 4. Thenwehavethat\n\nEi = 1}. Let H = {n\u22121/(4+d)\n\n(8)\n\nE(bm(x) \u2212 m(x))2 = eO(cid:18)\n\n1\n\nn4/(4+d)(cid:19) .\n\n3\n\n\fThe notation eO allows for logarithmic factors in n.\nProof. The risk is, up to a constant, R(h) = E(Y \u2212bm(X ))2, where (X, Y ) is a new pair\nand Y = m(X ) + \u0001. Note that (Y \u2212bmh(X ))2 = Y 2 \u2212 2Ybmh(X ) +bm2\nh(X ), so R(h) = E(Y 2) \u2212\n2E(Ybmh(X )) +bm2\nn1Xi\u2208I1bm2\n\nh(X ). Let n1 = |I1|. Then,\n2\n\nBy conditioning on the data in I0 and applying Bernstein\u2019s inequality, we have\n\n(9)\n\n1\n\n1\n\nY 2\ni \u2212\n\nn1Xi\u2208I1\n\nn1Xi\u2208I1\nbR(h) =\nP(cid:18)max\nh\u2208H |bR(h) \u2212 R(h)| > \u0001(cid:19) \u2264 Xh\u2208H\n\nh(Xi ).\n\nYibmh(Xi ) +\nP(cid:0)|bR(h) \u2212 R(h)| > \u0001(cid:1) \u2264 De\u2212nc\u00012\n\nfor some c > 0. Setting \u0001n = \u221aC log n/n for suitably large C, we conclude that\n\nLet h\u2217 minimize R(h) over H. Then, except on a set of probability tending to 0,\n\nn ! \u2212\u2192 0.\n\nP max\nh\u2208H |bR(h) \u2212 R(h)| >r C log n\n\u2264 bR(h\u2217) +r C log n\nR(bh) \u2264 bR(bh) +r C log n\n\u2264 R(h\u2217) + 2r C log n\n= O(cid:18)\neO(n\u22124/(4+d)); if d > 4 then O(\u221alog n/n) = o(cid:0)n4/(4+d)(cid:1). (cid:3)\n\nwhere we used the assumption d \u2265 4 in the last equality.\n\n1\n\nn\n\nn\n\nn\n\n(10)\n\n(11)\n\n(12)\n\n(13)\n\nn4/(4+d)(cid:19) + 2r C log n\n\nn\n\n1\n\nn4/(4+d)(cid:19)\n\n= eO(cid:18)\n\nIf d = 4 then O(\u221alog n/n) =\n\nWe conclude that ordinary regression methods are automatically adaptive, and achieve the low-\ndimensional minimax rate if the distribution of X concentrates on a manifold; there is no need for\nsemi-supervised methods in this case. Similar results apply to classi\ufb01cation.\n\n4 Kernel Regression with Laplacian Regularization\n\nIn practice, it is unlikely that the distribution of X would be supported exactly on a low-dimensional\nmanifold. Nevertheless, the shape of the data density p(x) might provide information about the\nregression function m(x), in which case the unlabeled data are informative.\n\nSeveral recent methods for semi-supervised learning attempt to exploit the smoothness assumption\nSSS using regularization operators de\ufb01ned with respect to graph Laplacians (Zhu et al., 2003; Zhou\net al., 2004; Belkin et al., 2005). The technique of Zhu et al. (2003) is based on Gaussian random\n\ufb01elds and harmonic functions de\ufb01ned with respect to discrete Laplace operators. To express this\nmethod in statistical terms, recall that standard kernel regression corresponds to the locally constant\nestimator\n\nbmn(x) = arg min\n\nm(x)\n\nnXi=1\n\nKh(Xi , x)(Yi \u2212 m(x))2 = Pn\ni=1 Kh(Xi , x) Yi\nPn\ni=1 Kh(Xi , x)\n\nwhere Kh is a symmetric kernel depending on bandwidth parameters h. In the semi-supervised\n\nlabeled data, but also using the estimates at the unlabeled points. Suppose that the \ufb01rst n data points\n(X1, Y1), . . . , (Xn, Yn) are labeled, and the next u = N \u2212 n points are unlabeled, Xn+1, . . . , Xn+u.\n\napproach of Zhu et al. (2003), the locally constant estimate bm(x) is formed using not only the\nThe semi-supervised regression estimate is then (bm(X1),bm(X2), . . . ,bm(X N )) given by\n\n(15)\n\nKh(Xi , X j ) (m(Xi ) \u2212 m(X j ))2\n\n(14)\n\nbm = arg min\n\nm\n\nNXi=1\n\nNXj=1\n\n4\n\n\fwhere the minimization is carried out subject to the constraint m(Xi ) = Yi , i = 1, . . . , n. Thus,\nthe estimates are coupled, unlike the standard kernel regression estimate (14) where the estimate at\neach point x can be formed independently, given the labeled data.\n\nThe estimator can be written in closed form as a linear smoother bm = C\u22121 B Y = G Y where\nbm = (bm(Xn+1), . . . , m(Xn+u))T is the vector of estimates over the unlabeled test points, and Y =\n\n(Y1, . . . , Yn)T is vector of labeled values. The (N \u2212n)\u00d7(N \u2212n) matrix C and the (N \u2212n)\u00d7n matrix\nB denote blocks of the combinatorial Laplacian on the data graph corresponding to the labeled and\nunlabeled data:\n\nwhere the Laplacian \u0001 = \u0001i j has entries\n\nB\n\n\u0001 =(cid:18)A BT\nC (cid:19)\n\u0001i j =(cid:26)Pk Kh(Xi , Xk )\n\n\u2212Kh(Xi , X j )\n\nif i = j\notherwise.\n\n(16)\n\n(17)\n\nThis expresses the effective kernel G in terms of geometric objects such as heat kernels for the\ndiscrete diffusion equations (Smola and Kondor, 2003).\n\nThis estimator assumes the noise is zero, since bm(Xi ) = Yi for i = 1, . . . , n. To work in the\n\nstandard model Y = m(X ) + \u0001, the natural extension of the harmonic function approach is mani-\nfold regularization (Belkin et al., 2005; Sindhwani et al., 2005; Tsang and Kwok, 2006). Here the\nestimator is chosen to minimize the regularized empirical risk functional\n\nNXi=1\n\nnXj=1\n\nR\u03b3 (m)=\nwhere H is a matrix of bandwidths and K H (Xi , X j ) = K (H\u22121/2(Xi \u2212 X j )). When \u03b3 = 0 the\nstandard kernel smoother is obtained. The regularization term is\n\nK H (Xi , X j )(cid:0)m(X j ) \u2212 m(Xi )(cid:1)2\n\nK H (Xi , X j )(cid:0)Y j \u2212 m(Xi )(cid:1)2\n\n+\u03b3\n\n(18)\n\nNXi=1\n\nNXj=1\n\nJ (m) \u2261\n\nNXi=1\n\nNXj=1\n\nK H (Xi , X j )(cid:0)m(X j ) \u2212 m(Xi )(cid:1)2\n\n= 2mT \u0001m\n\n(19)\n\nwhere \u0001 is the combinatorial Laplacian associated with K H . This regularization term is motivated\nby the semi-supervised smoothness assumption\u2014it favors functions m for which m(Xi ) is close to\nm(X j ) when Xi and X j are similar, according to the kernel function. The name manifold regulariza-\ntion is justi\ufb01ed by the fact that 1\nWhile this regularizer has primarily been used for SVM classi\ufb01ers (Belkin et al., 2005), it can be\nused much more generally. For an appropriate choice of \u03b3 , minimizing the functional (18) can be\nexpected to give essentially the same results as the harmonic function approach that minimizes (15).\n\n2 J (m) \u2192RM k\u2207m(x)k2 dMx, the energy of m over the manifold.\n\n1\n2\n\n\u00b5\n\n.\n\n(20)\n\np(x)\n\nc1\u00b5\u03c3 2\n\noperatorde\ufb01nedby\n\n\u0001 p,H f (x) =\n\nn(\u00b5 + \u03b3 ) p(x)|H|1/2 + c2(\u00b5 + \u03b3 )\n\ntrace(H f (x)H ) + \u2207 p(x)T H\u2207 f (x)\n\u0001 p,H m(x)!2\n\nTheorem 2. Suppose that D \u2265 2. Letem H,\u03b3 minimize (18), and let \u0001 p,H be the differential\nThentheasymptoticMSEofem H,\u03b3 (x) is\neM =\nwhere \u00b5 = P(Ri = 1).\nNote that the bias of the standard kernel estimator, in the notation of this theorem, is b(x) =\nc2\u0001 p,H m(x), and the variance is V (x) = c1/np(x)|H|1/2. Thus, this result agrees with the standard\nsupervised MSE in the special case \u03b3 = 0. It follows from this theorem that eM = M + o( tr(H ))\nwhere M is the usual MSE for a kernel estimator. Therefore, the minimum of eM has the same\n\nThe proof is given in the full version of the paper. The implication of this theorem is that the\nestimator that uses Laplacian regularization has the same rate of convergence as the usual kernel\nestimator, and thus the unlabeled data have not improved the estimator asymptotically.\n\nleading order in H as the minimum of M.\n\n\u0001 p,H(cid:19)\u22121\n\n+ o( tr(H ))\n\n(cid:18)I \u2212\n\n(21)\n\n\u03b3\n\n\u00b5\n\n5\n\n\f5 Semi-Supervised Methods With Improved Rates\n\nThe previous result is negative, in the sense that it shows unlabeled data do not help to improve the\nrate of convergence. This is because the bias and variance of a manifold regularized kernel esti-\nmator are of the same order in H as the bias and variance of standard kernel regression. We now\ndemonstrate how improved rates of convergence can be obtained by formulating and exploiting ap-\npropriate SSS assumptions. We describe three different approaches: semi-supervised bias reduction,\nimproved bandwidth selection, and averaging over level sets.\n\n5.1 Semi-Supervised Bias Reduction\n\nWe \ufb01rst show a positive result by formulating an SSS assumption that links the shape of p to the\nshape of m by positing a relationship between the Hessian Hm of m and the Hessian H p of p. Under\nthis SSS assumption, we can improve the rate of convergence by reducing the bias.\nTo illustrate the idea, take p(x) known (i.e., N = \u221e) and suppose that Hm (x) = H p(x). De\ufb01ne\n\n\u00b52\n\n1\n2\n\n2(K )tr(Hm (x)H )\n\nemn(x) =bmn(x) \u2212\nwherebmn(x) is the local linear estimator.\nTheorem 3. Theriskofemn(x) is O(cid:0)n\u22128/(8+D)(cid:1).\nProof. First note that the variance of the estimator emn, conditional on X1, . . . , Xn,\nVar(emn(x)|X1, . . . , Xn) = Var(bmn(x)|X1, . . . , Xn). Now, the term 1\n\nis\n2(K )tr(Hm (x)H ) is pre-\ncisely the bias of the local linear estimator, under the SSS assumption that H p(x) = Hm (x). Thus,\nthe \ufb01rst order bias term has been removed. The result now follows from the fact that the next term\nin the bias of the local linear estimator is of order O(tr(H )4). (cid:3)\n\n2 \u00b52\n\n(22)\n\nBy assuming 2` derivatives are matched, we get the rate n\u2212(4+4`)/(4+4`+D). When p is estimated\nfrom the data, the risk will be in\ufb02ated by N\u22124/(4+D) assuming standard smoothness assumptions\non p. This term will not dominate the improved rate n\u2212(4+4`)/(4+4`+D) as long as N > n`. The\nassumption that Hm = H p can be replaced by the more realistic assumption that Hm = g( p; \u03b2)\nfor some parameterized family of functions g(\u00b7; \u03b2). Semiparametric methods can then be used to\nestimate \u03b2. This approach is taken in the following section.\n\n5.2\n\nImproved Bandwidth Selection\n\nLet bH be the estimated bandwidth using the labeled data. We will now show how a bandwidth\nbH\u2217 can be estimated using the labeled and unlabeled data together, such that, under appropriate\n\nassumptions,\n\nR(H ).\n\n(23)\n\nlim sup\nn\u2192\u221e\n\n|R(bH\u2217) \u2212 R(H\u2217)|\n|R(bH ) \u2212 R(H\u2217)| = 0, where H\u2217 = arg min\n\nH\n\nTherefore, the unlabeled data allow us to construct an estimator that gets closer to the oracle risk.\nThe improvement is weaker than the bias adjustment method. But it has the virtue that the optimal\nlocal linear rate is maintained even if the proposed model linking Hm to p is incorrect.\n\nWe begin in one dimension to make the ideas clear. Letbm H denote the local linear estimator with\nbandwidth H \u2208 R, H > 0. To use the unlabeled data, note that the optimal (global) bandwidth\nis H\u2217 = (c2 B/(4nc1 A))1/5 where A =R m00(x)2dx and B =R dx/ p(x). Letbp(x) be the kernel\ndensity estimator of p using X1, . . . , X N and bandwidth h = O(N\u22121/5). We assume\n(SSS)\nNow let \\m00(x) = Gb\u03b8 (bp), and de\ufb01ne bH\u2217 = (cid:16) c2bB\n4nc1bA(cid:17)1/5\nwhere bA = R (\\m00(x))2 dx and bB =\nR dx/bp(x).\n\nm00(x) = G\u03b8 ( p) for some function G depending on \ufb01nitely many parameters \u03b8.\n\n6\n\n\fTheorem 4. Supposethat \\m00(x) \u2212 m00(x) = OP (N\u2212\u03b2 ) where \u03b2 > 2\nn \u2192 \u221e. If N /n1/4 \u2192 \u221e,then\n\n5. Let N = N (n) \u2192 \u221e as\n\nlim sup\nn\u2192\u221e\n\nProof. The risk is\n\n|R(bH\u2217) \u2212 R(H\u2217)|\n|R(bH ) \u2212 R(H\u2217)| = 0.\nn H Z\n\ndx\n\nc2\n\nR(H ) = c1 H 4Z (m00(x))2dx +\n\np(x) + o(cid:18) 1\nn H(cid:19) .\n\n(24)\n\n(25)\n\n2\n\n2\n\n=\n\n(26)\n\nR(bH ) =\n\nThe oracle bandwidth is H\u2217 = c3/n1/5 and then R(H\u2217) = O(n\u22124/5). Now let bH be the bandwidth\nestimated by cross-validation. Then, since R0(H\u2217) = 0 and H\u2217 = O(n\u22121/5), we have\n\nR00(H\u2217) + O(|bH \u2212 H\u2217|3)\nO(n\u22122/5) + O(|bH \u2212 H\u2217|3).\nn1/5 (cid:19) + OP(cid:18) N\u2212\u03b2\nn1/5(cid:19) .\n\nFrom Girard (1998), bH \u2212 H\u2217 = OP (n\u22123/10). Hence, R(bH ) \u2212 R(H\u2217) = OP (n\u22121). Also,bp(x) \u2212\np(x) = O(N\u22122/5). Since \\m00(x) \u2212 m00(x) = OP (N\u2212\u03b2 ),\n\n(bH \u2212 H\u2217)2\n(bH \u2212 H\u2217)2\nbH\u2217 \u2212 H\u2217 = OP(cid:18) N\u22122/5\nR(bH\u2217) \u2212 R(H\u2217) = oP (1/n) and the result follows. (cid:3)\nthat we use the multivariate version of Girard\u2019s result, namely, H\u2217 \u2212 bH = OP (n\u2212(D+2)/(2(D+4))).\nTheorem 5. Let N = N (n). If N /n D/4 \u2192 \u221e,b\u03b8 \u2212 \u03b8 = OP (N\u2212\u03b2 ) forsome \u03b2 > 2\n\nThe \ufb01rst term is oP (n\u22123/10) since N > n1/4. The second term is oP (n\u22123/10) since \u03b2 > 2/5. Thus\n\nThe proof in the multidimensional case is essentially the same as in the one dimensional case, except\n\nThis leads to the following result.\n\n4+D then\n\n(28)\n\n(27)\n\n(29)\n\n5.3 Averaging over Level Sets\n\nlim sup\nn\u2192\u221e\n\n|R(bH\u2217) \u2212 R(H\u2217)|\n|R(bH ) \u2212 R(H\u2217)| = 0.\n\nde\ufb01ne\n\nRecall that SSS is motivated by the intuition that high density level sets should correspond to clusters\nof similar objects. Another approach to quantifying SSS is to make this cluster assumption explicit.\nRigollet (2006) shows one way to do this in classi\ufb01cation. Here we focus on regression.\nSuppose that L = {x\np(x) > \u03bb} can be decomposed into a \ufb01nite number of connected, compact,\nconvex sets C1, . . . , Cg where \u03bb is chosen so that Lc has negligible probability. For N large we can\n\nbm(x) = Pn\n\ni=1 I (Xi \u2208 C j ) and for x \u2208 C j\n(30)\n\nreplace L with L = {x bp(x) > \u03bb} with small loss in accuracy, wherebp is an estimate of p using\nthe unlabeled data; see Rigollet (2006) for details. Let k j = Pn\ni=1 Yi I (Xi \u2208 C j )\nThus, bm(x) simply averages the labels of the data that fall in the set to which x belongs. If the\nTheorem 6. Theriskofbm(x) for x \u2208 L \u2229 C j isboundedby\nO(cid:18) 1\nn\u03c0 j(cid:19) + O(cid:16)\u03b42\nj(cid:17)\n\nregression function is slowly varying in over this set, the risk should be small. A similar estimator\nis considered by Cortes and Mohri (2006), but they do not provide estimates of the risk.\n\nj \u03be 2\n\n(31)\n\nk j\n\n.\n\nwhere \u03b4 j = supx\u2208C j k\u2207m(x)k, \u03be j = diameter(C j ) and \u03c0 j = P(X \u2208 C j ).\n\n7\n\n\f1\n\nk j Xi Xi\u2208C j\n\nm(Xi ) = m(x) +\n\n(m(Xi ) \u2212 m(x)).\n\n1\n\nk j Xi Xi\u2208C j\n\nis O(1/(n\u03c0 j )). The mean, given X1, . . . , Xn, is\n\nProof. Since the k j are Binomial, k j = n\u03c0 j + o(1) almost surely. Thus, the variance ofbm(x)\n\n(32)\n\nNow m(Xi )\u2212 m(x) = (X j \u2212 x)T \u2207m(ui ) for some ui between x and Xi . Hence, |m(Xi )\u2212 m(x)| \u2264\nkX j \u2212 xk supx\u2208C j k\u2207m(x)k and so the bias is bounded by \u03b4 j \u03be j . (cid:3)\nThis result reveals an interesting bias-variance tradeoff. Making \u03bb smaller decreases the variance\nand increases the bias. Suppose the two terms are balanced at \u03bb = \u03bb\u2217. Then we will beat the usual\nrate of convergence if \u03c0 j (\u03bb\u2217) > n\u2212D/(4+D).\n6 Conclusion\n\nSemi-supervised methods have been very successful in many problems. Our results suggest that the\nstandard explanations for this success are not correct. We have indicated some new approaches to\nunderstanding and exploiting the relationship between the labeled and unlabeled data. Of course, we\nmake no claim that these are the only ways of incorporating unlabeled data. But our results indicate\nthat decoupling the manifold assumption and the semi-supervised smoothness assumption is crucial\nto clarifying the problem.\n\n7 Acknowlegments\n\nWe thank Partha Niyogi for several interesting discussions. This work was supported in part by NSF\ngrant CCF-0625879.\n\nReferences\nBELKIN, M., NIYOGI, P. and SINDHWANI, V. (2005). On manifold regularization. In Proceedings of the Tenth\n\nInternational Workshop on Arti\ufb01cial Intelligence and Statistics (AISTAT 2005).\n\nBICKEL, P. and LI, B. (2006). Local polynomial regression on unknown manifolds. Tech. rep., Department of\n\nStatistics, UC Berkeley.\n\nCASTELLI, V. and COVER, T. (1996). The relative value of labeled and unlabeled samples in pattern recogni-\n\ntion with an unknown mixing parameter. IEEE Trans. on Info. Theory 42 2101\u20132117.\n\nCORTES, C. and MOHRI, M. (2006). On transductive regression. In Advances in Neural Information Process-\n\ning Systems (NIPS), vol. 19.\n\nFAN, J. (1993). Local linear regression smoothers and their minimax ef\ufb01ciencies. The Annals of Statistics 21\n\n196\u2013216.\n\nGIRARD, D. (1998). Asymptotic comparison of (partial) cross-validation, gcv and randomized gcv in nonpara-\n\nmetric regression. Ann. Statist. 12 315\u2013334.\n\nLEVINA, E. and BICKEL, P. (2005). Maximum likelihood estimation of intrinsic dimension. In Advances in\n\nNeural Information Processing Systems (NIPS), vol. 17.\n\nNIYOGI, P. (2007). Manifold regularization and semi-supervised learning: Some theoretical analyses. Tech.\n\nrep., Departments of Computer Science and Statistics, University of Chicago.\n\nRIGOLLET, P. (2006). Generalization error bounds in semi-supervised classi\ufb01cation under the cluster assump-\n\ntion. arxiv.org/math/0604233 .\n\nSINDHWANI, V., NIYOGI, P., BELKIN, M. and KEERTHI, S. (2005). Linear manifold regularization for large\nscale semi-supervised learning. In Proc. of the 22nd ICML Workshop on Learning with Partially Classi\ufb01ed\nTraining Data.\n\nSMOLA, A. and KONDOR, R. (2003). Kernels and regularization on graphs.\n\nTheory, COLT/KW.\n\nIn Conference on Learning\n\nTSANG, I. and KWOK, J. (2006). Large-scale sparsi\ufb01ed manifold regularization.\n\nInformation Processing Systems (NIPS), vol. 19.\n\nIn Advances in Neural\n\nZHOU, D., BOUSQUET, O., LAL, T., WESTON, J. and SCH\u00d6LKOPF, B. (2004). Learning with local and global\n\nconsistency. In Advances in Neural Information Processing Systems (NIPS), vol. 16.\n\nZHU, X. (2006). Semi-supervised learning literature review. Tech. rep., University of Wisconsin.\nZHU, X., GHAHRAMANI, Z. and LAFFERTY, J. (2003). Semi-supervised learning using Gaussian \ufb01elds and\n\nharmonic functions. In ICML-03, 20th International Conference on Machine Learning.\n\n8\n\n\f", "award": [], "sourceid": 293, "authors": [{"given_name": "Larry", "family_name": "Wasserman", "institution": null}, {"given_name": "John", "family_name": "Lafferty", "institution": null}]}