{"title": "Statistical Analysis of Semi-Supervised Learning: The Limit of Infinite Unlabelled Data", "book": "Advances in Neural Information Processing Systems", "page_first": 1330, "page_last": 1338, "abstract": "We study the behavior of the popular Laplacian Regularization method for Semi-Supervised Learning at the regime of a fixed number of labeled points but a large number of unlabeled points. We show that in $\\R^d$, $d \\geq 2$, the method is actually not well-posed, and as the number of unlabeled points increases the solution degenerates to a noninformative function. We also contrast the method with the Laplacian Eigenvector method, and discuss the ``smoothness assumptions associated with this alternate method.", "full_text": "Semi-Supervised Learning with the Graph Laplacian:\n\nThe Limit of In\ufb01nite Unlabelled Data\n\nDept. of Computer Science and Applied Mathematics\n\nToyota Technological Institute\n\nBoaz Nadler\n\nNathan Srebro\n\nWeizmann Institute of Science\n\nRehovot, Israel 76100\n\nChicago, IL 60637\n\nnati@uchicago.edu\n\nboaz.nadler@weizmann.ac.il\n\nXueyuan Zhou\n\nDept. of Computer Science\n\nUniversity of Chicago\n\nChicago, IL 60637\n\nzhouxy@cs.uchicago.edu\n\nAbstract\n\nWe study the behavior of the popular Laplacian Regularization method for Semi-\nSupervised Learning at the regime of a \ufb01xed number of labeled points but a large\nnumber of unlabeled points. We show that in Rd, d > 2, the method is actually not\nwell-posed, and as the number of unlabeled points increases the solution degener-\nates to a noninformative function. We also contrast the method with the Laplacian\nEigenvector method, and discuss the \u201csmoothness\u201d assumptions associated with\nthis alternate method.\n\n1 Introduction and Setup\n\nIn this paper we consider the limit behavior of two popular semi-supervised learning (SSL) methods\nbased on the graph Laplacian: the regularization approach [15] and the spectral approach [3]. We\nconsider the limit when the number of labeled points is \ufb01xed and the number of unlabeled points\ngoes to in\ufb01nity. This is a natural limit for SSL as the basic SSL scenario is one in which unlabeled\ndata is virtually in\ufb01nite. We can also think of this limit as \u201cperfect\u201d SSL, having full knowledge\nof the marginal density p(x). The premise of SSL is that the marginal density p(x) is informative\nabout the unknown mapping y(x) we are trying to learn, e.g. since y(x) is expected to be \u201csmooth\u201d\nin some sense relative to p(x). Studying the in\ufb01nite-unlabeled-data limit, where p(x) is fully known,\nallows us to formulate and understand the underlying smoothness assumptions of a particular SSL\nmethod, and judge whether it is well-posed and sensible. Understanding the in\ufb01nite-unlabeled-data\nlimit is also a necessary \ufb01rst step to studying the convergence of the \ufb01nite-labeled-data estimator.\n\nWe consider the following setup: Let p(x) be an unknown smooth density on a compact domain \u2126 \u2282\nRd with a smooth boundary. Let y : \u2126 \u2192 Y be the unknown function we wish to estimate. In case of\nregression Y = R whereas in binary classi\ufb01cation Y = {\u22121, 1}. The standard (transductive) semi-\nsupervised learning problem is formulated as follows: Given l labeled points, (x1, y1), . . . , (xl, yl),\nwith yi = y(xi), and u unlabeled points xl+1, . . . , xl+u, with all points xi sampled i.i.d. from p(x),\nthe goal is to construct an estimate of y(xl+i) for any unlabeled point xl+i, utilizing both the labeled\nand the unlabeled points. We denote the total number of points by n = l + u. We are interested in\nthe regime where l is \ufb01xed and u \u2192 \u221e.\n\n1\n\n\f2 SSL with Graph Laplacian Regularization\n\nWe \ufb01rst consider the following graph-based approach formulated by Zhu et. al. [15]:\ny(xi) = yi, i = 1, . . . , l\n\n\u02c6y(x) = arg min\n\nsubject to\n\nIn(y)\n\ny\n\nwhere\n\nIn(y) =\n\nWi,j(y(xi) \u2212 y(xj))2\n\n1\n\nn2 Xi,j\n\nis a Laplacian regularization term enforcing \u201csmoothness\u201d with respect to the n\u00d7n similarity matrix\nW . This formulation has several natural interpretations in terms of, e.g. random walks and electrical\ncircuits [15]. These interpretations, however, refer to a \ufb01xed graph, over a \ufb01nite set of points with\ngiven similarities.\nIn contrast, our focus here is on the more typical scenario where the points xi \u2208 Rd are a random\nsample from a density p(x), and W is constructed based on this sample. We would like to understand\nthe behavior of the method in terms of the density p(x), particularly in the limit where the number\nof unlabeled points grows. Under what assumptions on the target labeling y(x) and on the density\np(x) is the method (1) sensible?\nThe answer, of course, depends on how the matrix W is constructed. We consider the common\nsituation where the similarities are obtained by applying some decay \ufb01lter to the distances:\n\n(1)\n\n(2)\n\n(3)\n\nWi,j = G(cid:16) kxi\u2212xj k\n\n\u03c3\n\n(cid:17)\n\nwhere G : R+ \u2192 R+ is some function with an adequately fast decay. Popular choices are the\nGaussian \ufb01lter G(z) = e\u2212z2/2 or the \u01eb-neighborhood graph obtained by the step \ufb01lter G(z) = 1z<1.\nFor simplicity, we focus here on the formulation (1) where the solution is required to satisfy the\nconstraints at the labeled points exactly. In practice, the hard labeling constraints are often replaced\nwith a softer loss-based data term, which is balanced against the smoothness term In(y), e.g. [14, 6].\nOur analysis and conclusions apply to such variants as well.\n\nLimit of the Laplacian Regularization Term\n\nAs the number of unlabeled examples grows the regularization term (2) converges to its expectation,\nwhere the summation is replaced by integration w.r.t. the density p(x):\n\nlim\nn\u2192\u221e\n\nIn(y) = I (\u03c3)(y) =Z\u2126Z\u2126\n\nG(cid:16) kx\u2212x\u2032k\n\n\u03c3\n\n(cid:17) (y(x) \u2212 y(x\u2032))2p(x)p(x\u2032)dxdx\u2032.\n\n(4)\n\nIn the above limit, the bandwidth \u03c3 is held \ufb01xed. Typically, one would also drive the bandwidth \u03c3\nto zero as n \u2192 \u221e. There are two reasons for this choice. First, from a practical perspective, this\nmakes the similarity matrix W sparse so it can be stored and processed. Second, from a theoretical\nperspective, this leads to a clear and well de\ufb01ned limit of the smoothness regularization term In(y),\n\nat least when \u03c3 \u2192 0 slowly enough1, namely when \u03c3 = \u03c9( dplog n/n). If \u03c3 \u2192 0 as n \u2192 \u221e,\n\nand as long as n\u03c3d/ log n \u2192 \u221e, then after appropriate normalization, the regularizer converges to\na density weighted gradient penalty term [7, 8]:\n\nlim\nn\u2192\u221e\n\nd\n\nC\u03c3d+2 In(y) = lim\n\u03c3\u21920\n\nd\n\nC\u03c3d+2 I (\u03c3)(y) = J(y) =Z\u2126\n\nk\u2207y(x)k2p(x)2dx\n\n(5)\n\nwhere C =RRd kzk2G(kzk)dz, and assuming 0 < C < \u221e (which is the case for both the Gaussian\n\nand the step \ufb01lters). This energy functional J(f ) therefore encodes the notion of \u201csmoothness\u201d with\nrespect to p(x) that is the basis of the SSL formulation (1) with the graph constructions speci\ufb01ed by\n(3). To understand the behavior and appropriateness of (1) we must understand this functional and\nthe associated limit problem:\n\n\u02c6y(x) = arg min\n\ny\n\nJ(y)\n\nsubject to\n\ny(xi) = yi, i = 1, . . . , l\n\n(6)\n\n1When \u03c3 = o( dp1/n) then all non-diagonal weights Wi,j vanish (points no longer have any \u201cclose by\u201d\nneighbors). We are not aware of an analysis covering the regime where \u03c3 decays roughly as dp1/n, but would\nbe surprised if a qualitatively different meaningful limit is reached.\n\n2\n\n\f3 Graph Laplacian Regularization in R1\n\nWe begin by considering the solution of (6) for one dimensional data, i.e. d = 1 and x \u2208 R. We \ufb01rst\nconsider the situation where the support of p(x) is a continuous interval \u2126 = [a, b] \u2282 R (a and/or\nb may be in\ufb01nite). Without loss of generality, we assume the labeled data is sorted in increasing\norder a 6 x1 < x2 < \u00b7 \u00b7 \u00b7 < xl 6 b. Applying the theory of variational calculus, the solution \u02c6y(x)\nsatis\ufb01es inside each interval (xi, xi+1) the Euler-Lagrange equation\n\nPerforming two integrations and enforcing the constraints at the labeled points yields\n\nd\n\ndx(cid:20)p2(x)\n\ndy\n\ndx(cid:21) = 0.\n\nxi\n\ny(x) = yi + R x\nR xi+1\n\nxi\n\n1/p2(t)dt\n1/p2(t)dt\n\n(yi+1 \u2212 yi)\n\nfor xi 6 x 6 xi+1\n\n(7)\n\nwith y(x) = x1 for a 6 x 6 x1 and y(x) = xl for xl 6 x 6 b. If the support of p(x) is a union of\ndisjoint intervals, the above analysis and the form of the solution applies in each interval separately.\n\nThe solution (7) seems reasonable and desirable from the point of view of the \u201csmoothness\u201d assump-\ntions: when p(x) is uniform, the solution interpolates linearly between labeled data points, whereas\nacross low-density regions, where p(x) is close to zero, y(x) can change abruptly. Furthermore,\nthe regularizer J(y) can be interpreted as a Reproducing Kernel Hilbert Space (RKHS) squared\nsemi-norm, giving us additional insight into this choice of regularizer:\nTheorem 1. Let p(x) be a smooth density on \u2126 = [a, b] \u2282 R such that Ap = 1\nThen, J(f ) can be written as a squared semi-norm J(f ) = kf k2\n\na 1/p2(t)dt < \u221e.\n\ninduced by the kernel\n\nKp\n\n4R b\n\nKp(x, x\u2032) = Ap \u2212 1\n\n.\n\n(8)\n\nwith a null-space of all constant functions. That is, kf kKp is the norm of the projection of f onto\nthe RKHS induced by Kp.\nIf p(x) is supported on several disjoint intervals, \u2126 = \u222ai[ai, bi], then J(f ) can be written as a\nsquared semi-norm induced by the kernel\n\nKp(x, x\u2032) =( 1\n4R bi\n\nai\n\n0\n\ndt\n\np2(t) \u2212 1\n\nx\n\ndt\n\n2(cid:12)(cid:12)(cid:12)R x\u2032\np2(t)(cid:12)(cid:12)(cid:12)\n\nif x, x\u2032 \u2208 [ai, bi]\nif x \u2208 [ai, bi], x\u2032 \u2208 [aj, bj], i 6= j\n\n(9)\n\nwith a null-space spanned by indicator functions 1[ai,bi](x) on the connected components of \u2126.\n\n2(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n1\n\nx\n\np2(t) dt(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\nZ x\u2032\n\nProof. For any f (x) =Pi \u03b1iKp(x, xi) in the RKHS induced by Kp:\n\ndx(cid:19)2\nJ(f ) =Z(cid:18) df\nwhere Jij =Z d\n\np2(x)dx =Xi,j\n\ndx\n\nKp(x, xi)\n\nd\ndx\n\nKp(x, xj)p2(x)dx\n\n\u03b1i\u03b1jJij\n\n(10)\n\nWhen xi and xj are in different connected components of \u2126, the gradients of Kp(\u00b7, xi) and Kp(\u00b7, xj)\nare never non-zero together and Jij = 0 = Kp(xi, xj). When they are in the same connected\ncomponent [a, b], and assuming w.l.o.g. a 6 xi 6 xj 6 b:\n\nJij =\n\n=\n\na\n\n1\n\n4\"Z xi\n4Z b\n\n1\n\na\n\n1\n\np2(t)\n\n1\n\np2(t)\n\nxi\n\ndt +Z xj\n2Z xj\n\nxi\n\n1\n\ndt \u2212\n\n\u22121\np2(t)\n\ndt +Z b\n\nxj\n\n1\n\np2(t)\n\ndt#\n\n1\n\np2(t)\n\ndt = Kp(xi, xj).\n\n(11)\n\nSubstituting Jij = Kp(xi, xj) into (10) yields J(f ) =P \u03b1i\u03b1jKp(xi, xj) = kf kKp.\n\n3\n\n\fCombining Theorem 1 with the Representer Theorem [13] establishes that the solution of (6) (or of\nany variant where the hard constraints are replaced by a data term) is of the form:\n\nl\n\ny(x) =\n\nXj=1\n\n\u03b1jKp(x, xj) +Xi\n\n\u03b2i1[ai,bi](x),\n\nwhere i ranges over the connected components [ai, bi] of \u2126, and we have:\n\nJ(y) =\n\n\u03b1i\u03b1jKp(xi, xj).\n\n(12)\n\nl\n\nXi,j=1\n\nKp\n\nViewing the regularizer as kyk2\nsuggests understanding (6), and so also its empirical approxima-\ntion (1), by interpreting Kp(x, x\u2032) as a density-based \u201csimilarity measure\u201d between x and x\u2032. This\nsimilarity measure indeed seems sensible: for a uniform density it is simply linearly decreasing as a\nfunction of the distance. When the density is non-uniform, two points are relatively similar only if\nthey are connected by a region in which 1/p2(x) is low, i.e. the density is high, but are much less\n\u201csimilar\u201d, i.e. related to each other, when connected by a low-density region. Furthermore, there is\nno dependence between points in disjoint components separated by zero density regions.\n\n4 Graph Laplacian Regularization in Higher Dimensions\n\nThe analysis of the previous section seems promising, at it shows that in one dimension, the SSL\nmethod (1) is well posed and converges to a sensible limit. Regretfully, in higher dimensions this is\nnot the case anymore. In the following theorem we show that the in\ufb01mum of the limit problem (6) is\nzero and can be obtained by a sequence of functions which are certainly not a sensible extrapolation\nof the labeled points.\nTheorem 2. Let p(x) be a smooth density over Rd, d > 2, bounded from above by some constant\npmax, and let (x1, y1), . . . , (xl, yl) be any (non-repeating) set of labeled examples. There exist con-\ntinuous functions y\u01eb(x), for any \u01eb > 0, all satisfying the constraints y\u01eb(xj) = yj, j = 1, . . . , l, such\nthat J(y\u01eb) \u01eb\u21920\u2212\u2192 0 but y\u01eb(x) \u01eb\u21920\u2212\u2192 0 for all x 6= xj, j = 1, . . . , l.\n\nProof. We present a detailed proof for the case of l = 2 labeled points. The generalization of the\nproof to more labeled points is straightforward. Furthermore, without loss of generality, we assume\nthe \ufb01rst labeled point is at x0 = 0 with y(x0) = 0 and the second labeled point is at x1 with\nkx1k = 1 and y(x1) = 1. In addition, we assume that the ball B1(0) of radius one centered around\nthe origin is contained in \u2126 = {x \u2208 Rd | p(x) > 0}.\nWe \ufb01rst consider the case d > 2. Here, for any \u01eb > 0, consider the function\n\nwhich indeed satis\ufb01es the two constraints y\u01eb(xi) = yi, i = 0, 1. Then,\n\ny\u01eb(x) = min(cid:16) kxk\n\n\u01eb , 1(cid:17)\n\u01eb2 ZB\u01eb(0)\n\npmax\n\nJ(y\u01eb) =ZB\u01eb(0)\n\np2(x)\n\u01eb2 dx 6\n\ndx = p2\n\nmaxVd \u01ebd\u22122\n\n(13)\n\nwhere Vd is the volume of a unit ball in Rd. Hence, the sequence of functions y\u01eb(x) satisfy the\nconstraints, but for d > 2, inf \u01eb J(y\u01eb) = 0.\nFor d = 2, a more extreme example is necessary: consider the functions\n\nand y\u01eb(x) = 1 for kxk > 1. These functions satisfy the two constraints y\u01eb(xi) = yi, i = 0, 1 and:\n\n4\n\n\u01eb\n\ny\u01eb(x) = log(cid:16) kxk2+\u01eb\n\u01eb \u201di2 ZB1(0)\n\u01eb \u201di2 log(cid:0) 1+\u01eb\n\n\u01eb (cid:1) =\n\nkxk2\n\nmax\n\nJ(y\u01eb) =\n\n6\n\nhlog\u201c 1+\u01eb\n\n4\u03c0p2\n\nhlog\u201c 1+\u01eb\n\n(cid:17)(cid:14) log(cid:0) 1+\u01eb\n\u01eb (cid:1)\n\nfor kxk 6 1\n\n(kxk2+\u01eb)2 p2(x)dx 6\n\n4p2\n\nmax\n\nhlog\u201c 1+\u01eb\n\n\u01eb \u201di2 Z 1\n\n0\n\nr2\n\n(r2+\u01eb)2 2\u03c0rdr\n\n\u01eb\u21920\u2212\u2192 0.\n\n4\u03c0p2\n\nmax\n\nlog(cid:0) 1+\u01eb\n\u01eb (cid:1)\n\n4\n\n\fThe implication of Theorem 2 is that regardless of the values at the labeled points, as u \u2192 \u221e, the\nsolution of (1) is not well posed. Asymptotically, the solution has the form of an almost every-\nwhere constant function, with highly localized spikes near the labeled points, and so no learning\nis performed. In particular, an interpretation in terms of a density-based kernel Kp, as in the one-\ndimensional case, is not possible.\n\nOur analysis also carries over to a formulation where a loss-based data term replaces the hard label\nconstraints, as in\n\n\u02c6y = arg min\ny(x)\n\n1\nl\n\nl\n\nXj=1\n\n(y(xj) \u2212 yj)2 + \u03b3In(y)\n\nIn the limit of in\ufb01nite unlabeled data, functions of the form y\u01eb(x) above have a zero data penalty\nterm (since they exactly match the labels) and also drive the regularization term J(y) to zero. Hence,\nit is possible to drive the entire objective functional (the data term plus the regularization term) to\nzero with functions that do not generalize at all to unlabeled points.\n\n4.1 Numerical Example\n\nWe illustrate the phenomenon detailed by Theorem 2 with a simple example. Consider a density\np(x) in R2, which is a mixture of two unit variance spherical Gaussians, one per class, centered at\nthe origin and at (4, 0). We sample a total of n = 3000 points, and label two points from each of\nthe two components (four total). We then construct a similarity matrix using a Gaussian \ufb01lter with\n\u03c3 = 0.4.\nFigure 1 depicts the predictor \u02c6y(x) obtained from (1). In fact, two different predictors are shown,\nobtained by different numerical methods for solving (1). Both methods are based on the observation\nthat the solution \u02c6y(x) of (1) satis\ufb01es:\n\nn\n\nn\n\n\u02c6y(xi) =\n\nWij \u02c6y(xj) /\n\nWij\n\non all unlabeled points i = l + 1, . . . , l + u.\n\n(14)\n\nXj=1\n\nXj=1\n\nCombined with the constraints of (1), we obtain a system of linear equations that can be solved\nby Gaussian elimination (here invoked through MATLAB\u2019s backslash operator). This is the method\nused in the top panels of Figure 1. Alternatively, (14) can be viewed as an update equation for \u02c6y(xi),\nwhich can be solved via the power method, or label propagation [2, 6]: start with zero labels on the\nunlabeled points and iterate (14), while keeping the known labels on x1, . . . , xl. This is the method\nused in the bottom panels of Figure 1.\n\nAs predicted, \u02c6y(x) is almost constant for almost all unlabeled points. Although all values are very\nclose to zero, thresholding at the \u201cright\u201d threshold does actually produce sensible results in terms of\nthe true -1/+1 labels. However, beyond being inappropriate for regression, a very \ufb02at predictor is still\nproblematic even from a classi\ufb01cation perspective. First, it is not possible to obtain a meaningful\ncon\ufb01dence measure for particular labels. Second, especially if the size of each class is not known a-\npriori, setting the threshold between the positive and negative classes is problematic. In our example,\nsetting the threshold to zero yields a generalization error of 45%.\nThe differences between the two numerical methods for solving (1) also point out to another problem\nwith the ill-posedness of the limit problem: the solution is numerically very un-stable.\n\nA more quantitative evaluation, that also validates that the effect in Figure 1 is not a result of choos-\ning a \u201cwrong\u201d bandwidth \u03c3, is given in Figure 2. We again simulated data from a mixture of two\nGaussians, one Gaussian per class, this time in 20 dimensions, with one labeled point per class, and\nan increasing number of unlabeled points. In Figure 2 we plot the squared error, and the classi\ufb01-\ncation error of the resulting predictor \u02c6y(x). We plot the classi\ufb01cation error both when a threshold\nof zero is used (i.e. the class is determined by sign(\u02c6y(x))) and with the ideal threshold minimizing\nthe test error. For each unlabeled sample size, we choose the bandwidth \u03c3 yielding the best test\nperformance (this is a \u201ccheating\u201d approach which provides a lower bound on the error of the best\nmethod for selecting the bandwidth). As the number of unlabeled examples increases the squared\nerror approaches 1, indicating a \ufb02at predictor. Using a threshold of zero leads to an increase in the\nclassi\ufb01cation error, possibly due to numerical instability. Interestingly, although the predictors be-\ncome very \ufb02at, the classi\ufb01cation error using the ideal threshold actually improves slightly. Note that\n\n5\n\n\f1\n\n0\n\n\u22121\n10\n\n1\n\n0\n\n\u22121\n10\n\n0\n\n0\n\n10\n\n5\n\n0\n\n\u221210\n\n\u22125\n\nPOWER METHOD\n\n10\n\n5\n\n0\n\n\u221210\n\n\u22125\n\n10\n\n5\n\n0\n\n\u22125\n\n \n\u221210\n\u22125\n\n10\n\n5\n\n0\n\n\u22125\n\n\u221210\n\u22125\n\n6\n\n4\n\n2\n\n0\n\n0\n\n1.5\n\n1\n\n0.5\n\n0\n\n0\n\n8\n\n6\n\n4\n\n2\n\n0\n\n200\n\n400\n\n600\n\n800\n\nOPTIMAL BANDWIDTH\n\n200\n\n400\n\n600\n\n800\n\nOPTIMAL BANDWIDTH\n\n200\n\n400\n\n600\n\n800\n\nDIRECT INVERSION\n\nSIGN ERROR: 45%\n\nSQUARED ERROR\n\nOPTIMAL BANDWIDTH\n\n \n\ny(x) > 0\ny(x) < 0\n\n1\n\n0.95\n\n0.9\n\n0.85\n\n400\n\n200\n\n0\n800\n0\u22121 ERROR (THRESHOLD=0)\n\n600\n\n0\n\n5\n\n10\n\nSIGN ERR: 17.1\n\n0.32\n\n0.3\n\n0.28\n\n0.26\n\n0\n\n200\n\n400\n\n600\n\n800\n\n0\u22121 ERROR (IDEAL THRESHOLD)\n\n0.19\n\n0.18\n\n0.17\n\n0.16\n\n0\n\n200\n\n400\n\n600\n\n800\n\n0\n\n5\n\n10\n\nFigure 1: Left plots: Minimizer of Eq. (1). Right plots:\nthe resulting classi\ufb01cation according to sign(y). The four\nlabeled points are shown by green squares. Top: mini-\nmization via Gaussian elimination (MATLAB backslash).\nBottom: minimization via label propagation with 1000 it-\nerations - the solution has not yet converged, despite small\nresiduals of the order of 2 \u00b7 10\u22124.\n\nFigure 2: Squared error (top), classi\ufb01cation error\nwith a threshold of zero (center) and minimal clas-\nsi\ufb01cation error using ideal threhold (bottom), of the\nminimizer of (1) as a function of number of unla-\nbeled points. For each error measure and sample\nsize, the bandwidth minimizing the test error was\nused, and is plotted.\n\nideal classi\ufb01cation performance is achieved with a signi\ufb01cantly larger bandwidth than the bandwidth\nminimizing the squared loss, i.e. when the predictor is even \ufb02atter.\n\n4.2 Probabilistic Interpretation, Exit and Hitting Times\n\nAs mentioned above, the Laplacian regularization method (1) has a probabilistic interpretation in\nterms of a random walk on the weighted graph. Let x(t) denote a random walk on the graph with\n\nbinary classi\ufb01cation case with yi = \u00b11 we have [15]:\n\ntransition matrix M = D\u22121W where D is a diagonal matrix with Dii = Pj Wij. Then, for the\n\u02c6y(xi) = 2 Prhx(t) hits a point labeled +1 before hitting a point labeled -1(cid:12)(cid:12)(cid:12)\nx(0) = xii \u2212 1\n\nWe present an interpretation of our analysis in terms of the limiting properties of this random walk.\nConsider, for simplicity, the case where the two classes are separated by a low density region. Then,\nthe random walk has two intrinsic quantities of interest. The \ufb01rst is the mean exit time from one\ncluster to the other, and the other is the mean hitting time to the labeled points in that cluster. As the\nnumber of unlabeled points increases and \u03c3 \u2192 0, the random walk converges to a diffusion process\n[12]. While the mean exit time then converges to a \ufb01nite value corresponding to its diffusion ana-\nlogue, the hitting time to a labeled point increases to in\ufb01nity (as these become absorbing boundaries\nof measure zero). With more and more unlabeled data the random walk will fully mix, forgetting\nwhere it started, before it hits any label. Thus, the probability of hitting +1 before \u22121 will become\nuniform across the entire graph, independent of the starting location xi, yielding a \ufb02at predictor.\n\n5 Keeping \u03c3 Finite\n\nAt this point, a reader may ask whether the problems found in higher dimensions are due to taking\nthe limit \u03c3 \u2192 0. One possible objection is that there is an intrinsic characteristic scale for the data\n\u03c30 where (with high probability) all points at a distance kxi \u2212 xjk < \u03c30 have the same label. If this\nis the case, then it may not necessarily make sense to take values of \u03c3 < \u03c30 in constructing W .\nHowever, keeping \u03c3 \ufb01nite while taking the number of unlabeled points to in\ufb01nity does not resolve\nthe problem. On the contrary, even the one-dimensional case becomes ill-posed in this case. To\nsee this, consider a function y(x) which is zero everywhere except at the labeled points, where\ny(xj) = yj. With a \ufb01nite number of labeled points of measure zero, I (\u03c3)(y) = 0 in any dimension\n\n6\n\n\f50 points\n\n500 points\n\n3500 points\n\ny\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22122\n\n0\n\n4\n\n6\n\n2\nx\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22122\n\n0\n\n2\n\n4\n\n6\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22122\n\n0\n\n2\n\n4\n\n6\n\nFigure 3: Minimizer of (1) for a 1-d problem with a \ufb01xed \u03c3 = 0.4, two labeled points and an increasing number\nof unlabeled points.\n\nand for any \ufb01xed \u03c3 > 0. While this limiting function is discontinuous, it is also possible to construct\na sequence of continuous functions y\u01eb that all satisfy the constraints and for which I (\u03c3)(y\u01eb) \u01eb\u21920\u2212\u2192 0.\nThis behavior is illustrated in Figure 3. We generated data from a mixture of two 1-D Gaussians\ncentered at the origin and at x = 4, with one Gaussian labeled \u22121 and the other +1. We used\ntwo labeled points at the centers of the Gaussians and an increasing number of randomly drawn\nunlabeled points. As predicted, with a \ufb01xed \u03c3, although the solution is reasonable when the number\nof unlabeled points is small, it becomes \ufb02atter, with sharp spikes on the labeled points, as u \u2192 \u221e.\n\n6 Fourier-Eigenvector Based Methods\n\nBefore we conclude, we discuss a different approach for SSL, also based on the Graph Laplacian,\nsuggested by Belkin and Niyogi [3]. Instead of using the Laplacian as a regularizer, constraining\ncandidate predictors y(x) non-parametrically to those with small In(y) values, here the predictors\nare constrained to the low-dimensional space spanned by the \ufb01rst few eigenvectors of the Laplacian:\nThe similarity matrix W is computed as before, and the Graph Laplacian matrix L = D \u2212 W is\n\nconsidered (recall D is a diagonal matrix with Dii =Pj Wij). Only predictors\n\nj=1aj ej\n\n(15)\nspanned by the \ufb01rst p eigenvectors e1, . . . , ep of L (with smallest eigenvalues) are considered. The\ncoef\ufb01cients aj are chosen by minimizing a loss function on the labeled data, e.g. the squared loss:\n(16)\nUnlike the Laplacian Regularization method (1), the Laplacian Eigenvector method (15)\u2013(16) is\nwell posed in the limit u \u2192 \u221e. This follows directly from the convergence of the eigenvectors of\nthe graph Laplacian to the eigenfunctions of the corresponding Laplace-Beltrami operator [10, 4].\n\n\u02c6y(x) =Pp\n(\u02c6a1, . . . , \u02c6ap) = arg minPl\n\nj=1(yj \u2212 \u02c6y(xj))2.\n\nEigenvector based methods were shown empirically to provide competitive generalization perfor-\nmance on a variety of simulated and real world problems. Belkin and Niyogi [3] motivate the\napproach by arguing that \u2018the eigenfunctions of the Laplace-Beltrami operator provide a natural ba-\nsis for functions on the manifold and the desired classi\ufb01cation function can be expressed in such a\nbasis\u2019. In our view, the success of the method is actually not due to data lying on a low-dimensional\nmanifold, but rather due to the low density separation assumption, which states that different class la-\nbels form high-density clusters separated by low density regions. Indeed, under this assumption and\nwith suf\ufb01cient separation between the clusters, the eigenfunctions of the graph Laplace-Beltrami op-\nerator are approximately piecewise constant in each of the clusters, as in spectral clustering [12, 11],\nproviding a basis for a labeling that is constant within clusters but variable across clusters. In other\nsettings, such as data uniformly distributed on a manifold but without any signi\ufb01cant cluster struc-\nture, the success of eigenvector based methods critically depends on how well can the unknown\nclassi\ufb01cation function be approximated by a truncated expansion with relatively few eigenvectors.\n\nWe illustrate this issue with the following three-dimensional example: Let p(x) denote the uniform\ndensity in the box [0, 1] \u00d7 [0, 0.8] \u00d7 [0, 0.6], where the box lengths are different to prevent eigenvalue\nmultiplicity. Consider learning three different functions, y1(x) = 1x1>0.5, y2(x) = 1x1>x2/0.8 and\ny3(x) = 1x2/0.8>x3/0.6. Even though all three functions are relatively simple, all having a linear\nseparating boundary between the classes on the manifold, as shown in the experiment described in\nFigure 4, the Eigenvector based method (15)\u2013(16) gives markedly different generalization perfor-\nmances on the three targets. This happens both when the number of eigenvectors p is set to p = l/5\nas suggested by Belkin and Niyogi, as well as for the optimal (oracle) value of p selected on the test\nset (i.e. a \u201ccheating\u201d choice representing an upper bound on the generalization error of this method).\n\n7\n\n\f)\n\n%\n\n(\n \nr\no\nr\nr\n\n \n\nE\nn\no\n\ni\n\ni\nt\nc\nd\ne\nr\nP\n\np = #labeled points/5\n40\n\n20\n\n0\n\n20\n\n40\n\n60\n\n# labeled points\n\n40\n\n20\n\n0\n\noptimal p\n\n20 labeled points\n\nApprox. Error\n\n50\n\n0\n\n0\n\n20\n\n10\n\n0\n\n0\n\n5\n\n10\n\n15\n\n# eigenvectors\n\n5\n\n10\n\n15\n\n# eigenvectors\n\n20\n\n40\n\n60\n\n# labeled points\n\nFigure 4: Left three panels: Generalization Performance of the Eigenvector Method (15)\u2013(16) for the three\ndifferent functions described in the text. All panels use n = 3000 points. Prediction counts the number of sign\nagreements with the true labels. Rightmost panel: best \ufb01t when many (all 3000) points are used, representing\nthe best we can hope for with a few leading eigenvectors.\n\nThe reason for this behavior is that y2(x) and even more so y3(x) cannot be as easily approximated\nby the very few leading eigenfunctions\u2014even though they seem \u201csimple\u201d and \u201csmooth\u201d, they are\nsigni\ufb01cantly more complicated than y1(x) in terms of measure of simplicity implied by the Eigen-\nvector Method. Since the density is uniform, the graph Laplacian converges to the standard Lapla-\ncian and its eigenfunctions have the form \u03c8i,j,k(x) = cos(i\u03c0x1) cos(j\u03c0x2/0.8) cos(k\u03c0x3/0.6),\nmaking it hard to represent simple decision boundaries which are not axis-aligned.\n\n7 Discussion\n\nOur results show that a popular SSL method, the Laplacian Regularization method (1), is not well-\nbehaved in the limit of in\ufb01nite unlabeled data, despite its empirical success in various SSL tasks.\nThe empirical success might be due to two reasons.\n\nFirst, it is possible that with a large enough number of labeled points relative to the number of\nunlabeled points, the method is well behaved. This regime, where the number of both labeled and\nunlabeled points grow while l/u is \ufb01xed, has recently been analyzed by Wasserman and Lafferty\n[9]. However, we do not \ufb01nd this regime particularly satisfying as we would expect that having\nmore unlabeled data available should improve performance, rather than require more labeled points\nor make the problem ill-posed. It also places the user in a delicate situation of choosing the \u201cjust\nright\u201d number of unlabeled points without any theoretical guidance.\n\nSecond, in our experiments we noticed that although the predictor \u02c6y(x) becomes extremely \ufb02at, in\nbinary tasks, it is still typically possible to \ufb01nd a threshold leading to a good classi\ufb01cation perfor-\nmance. We do not know of any theoretical explanation for such behavior, nor how to characterize\nit. Obtaining such an explanation would be very interesting, and in a sense crucial to the theoretical\nfoundation of the Laplacian Regularization method. On a very practical level, such a theoretical un-\nderstanding might allow us to correct the method so as to avoid the numerical instability associated\nwith \ufb02at predictors, and perhaps also make it appropriate for regression.\n\nThe reason that the Laplacian regularizer (1) is ill-posed in the limit is that the \ufb01rst order gradient\nis not a suf\ufb01cient penalty in high dimensions. This fact is well known in spline theory, where the\nSobolev Embedding Theorem [1] indicates one must control at least d+1\n2 derivatives in Rd. In the\ncontext of Laplacian regularization, this can be done using the iterated Laplacian: replacing the\ngraph Laplacian matrix L = D \u2212 W , where D is the diagonal degree matrix, with L\n(matrix to\nthe d+1\n2 power). In the in\ufb01nite unlabeled data limit, this corresponds to regularizing all order- d+1\n(mixed) partial derivatives. In the typical case of a low-dimensional manifold in a high dimensional\nambient space, the order of iteration should correspond to the intrinsic, rather then ambient, dimen-\nsionality, which poses a practical problem of estimating this usually unknown dimensionality. We\nare not aware of much practical work using the iterated Laplacian, nor a good understanding of its\nappropriateness for SSL.\n\nd+1\n\n2\n\n2\n\nA different approach leading to a well-posed solution is to include also an ambient regularization\nterm [5]. However, the properties of the solution and in particular its relation to various assumptions\nabout the \u201csmoothness\u201d of y(x) relative to p(x) remain unclear.\n\nAcknowledgments The authors would like to thank the anonymous referees for valuable sugges-\ntions. The research of BN was supported by the Israel Science Foundation (grant 432/06).\n\n8\n\n\fReferences\n\n[1] R.A. Adams, Sobolev Spaces, Academic Press (New York), 1975.\n[2] A. Azran, The rendevous algorithm: multiclass semi-supervised learning with Markov Random Walks,\n\nICML, 2007.\n\n[3] M. Belkin, P. Niyogi, Using manifold structure for partially labelled classi\ufb01cation, NIPS, vol. 15, 2003.\n[4] M. Belkin and P. Niyogi, Convergence of Laplacian Eigenmaps, NIPS 19, 2007.\n[5] M. Belkin, P. Niyogi and S. Sindhwani, Manifold Regularization: A Geometric Framework for Learning\n\nfrom Labeled and Unlabeled Examples, JMLR, 7:2399-2434, 2006.\n\n[6] Y. Bengio, O. Delalleau, N. Le Roux, label propagation and quadratic criterion, in Semi-Supervised\n\nLearning, Chapelle, Scholkopf and Zien, editors, MIT Press, 2006.\n\n[7] O. Bosquet, O. Chapelle, M. Hein, Measure Based Regularization, NIPS, vol. 16, 2004.\n[8] M. Hein, Uniform convergence of adaptive graph-based regularization, COLT, 2006.\n[9] J. Lafferty, L. Wasserman, Statistical Analysis of Semi-Supervised Regression, NIPS, vol. 20, 2008.\n[10] U. von Luxburg, M. Belkin and O. Bousquet, Consistency of spectral clustering, Annals of Statistics, vol.\n\n36(2), 2008.\n\n[11] M. Meila, J. Shi. A random walks view of spectral segmentation, AI and Statistics, 2001.\n[12] B. Nadler, S. Lafon, I.G. Kevrekidis, R.R. Coifman, Diffusion maps, spectral clustering and eigenfunc-\n\ntions of Fokker-Planck operators, NIPS, vol. 18, 2006.\n\n[13] B. Sch\u00a8olkopf, A. Smola, Learning with Kernels, MIT Press, 2002.\n[14] D. Zhou, O. Bousquet, T. Navin Lal, J. Weston, B. Sch\u00a8olkopf, Learning with local and global consistency,\n\nNIPS, vol. 16, 2004.\n\n[15] X. Zhu, Z. Ghahramani, J. Lafferty, Semi-Supervised Learning using Gaussian \ufb01elds and harmonic func-\n\ntions, ICML, 2003.\n\n9\n\n\f", "award": [], "sourceid": 465, "authors": [{"given_name": "Boaz", "family_name": "Nadler", "institution": null}, {"given_name": "Nathan", "family_name": "Srebro", "institution": null}, {"given_name": "Xueyuan", "family_name": "Zhou", "institution": null}]}