{"title": "Non-Local Manifold Tangent Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 129, "page_last": 136, "abstract": null, "full_text": "         Non-Local Manifold Tangent Learning\n\n\n\n                        Yoshua Bengio and Martin Monperrus\n                            Dept. IRO, Universite de Montreal\n           P.O. Box 6128, Downtown Branch, Montreal, H3C 3J7, Qc, Canada\n                   {bengioy,monperrm}@iro.umontreal.ca\n\n                                        Abstract\n\n         We claim and present arguments to the effect that a large class of man-\n         ifold learning algorithms that are essentially local and can be framed as\n         kernel learning algorithms will suffer from the curse of dimensionality,\n         at the dimension of the true underlying manifold. This observation sug-\n         gests to explore non-local manifold learning algorithms which attempt to\n         discover shared structure in the tangent planes at different positions. A\n         criterion for such an algorithm is proposed and experiments estimating a\n         tangent plane prediction function are presented, showing its advantages\n         with respect to local manifold learning algorithms: it is able to general-\n         ize very far from training data (on learning handwritten character image\n         rotations), where a local non-parametric method fails.\n\n\n1    Introduction\n\nA central issue of generalization is how information from the training examples can be\nused to make predictions about new examples, and without strong prior assumptions, i.e.\nin non-parametric models, this may be fundamentally difficult as illustrated by the curse\nof dimensionality. There has been in recent years a lot of work on unsupervised learn-\ning based on characterizing a possibly non-linear manifold near which the data would lie,\nsuch as Locally Linear Embedding (LLE) (Roweis and Saul, 2000), Isomap (Tenenbaum,\nde Silva and Langford, 2000), kernel Principal Components Analysis (PCA) (Sch olkopf,\nSmola and Muller, 1998), Laplacian Eigenmaps (Belkin and Niyogi, 2003), and Mani-\nfold Charting (Brand, 2003). These are all essentially non-parametric methods that can be\nshown to be kernel methods with an adaptive kernel (Bengio et al., 2004), and which rep-\nresent the manifold on the basis of local neighborhood relations, very often constructed us-\ning the nearest neighbors graph (the graph with one vertex per observed example, and arcs\nbetween near neighbors). The above methods characterize the manifold through an em-\nbedding which associates each training example (an input object) with a low-dimensional\ncoordinate vector (the coordinates on the manifold). Other closely related methods char-\nacterize the manifold as well as \"noise\" around it. Most of these methods consider the\ndensity as a mixture of flattened Gaussians, e.g. mixtures of factor analyzers (Ghahramani\nand Hinton, 1996), Manifold Parzen windows (Vincent and Bengio, 2003), and other local\nPCA models such as mixtures of probabilistic PCA (Tipping and Bishop, 1999). This is\nnot an exhaustive list, and recent work also combines modeling through a mixture density\nand dimensionality reduction (Teh and Roweis, 2003; Brand, 2003).\n\nIn this paper we claim that there is a fundamental weakness with such kernel methods, due\nto the locality of learning: we show that the local tangent plane of the manifold at a point\nx is defined based mostly on the near neighbors of x according to some possibly data-\n\n\f\ndependent kernel KD. As a consequence, it is difficult with such methods to generalize to\nnew combinations of values x that are \"far\" from the training examples xi, where being\n\"far\" is a notion that should be understood in the context of several factors: the amount of\nnoise around the manifold (the examples do not lie exactly on the manifold), the curvature\nof the manifold, and the dimensionality of the manifold. For example, if the manifold\ncurves quickly around x, neighbors need to be closer for a locally linear approximation to\nbe meaningful, which means that more data are needed. Dimensionality of the manifold\ncompounds that problem because the amount of data thus needed will grow exponentially\nwith it. Saying that y is \"far\" from x means that y is not well represented by its projection\non the tangent plane at x.\n\nIn this paper we explore one way to address that problem, based on estimating the tangent\nplanes of the manifolds as a function of x, with parameters that can be estimated not only\nfrom the data around x but from the whole dataset. Note that there can be more than one\nmanifold (e.g. in vision, one may imagine a different manifold for each \"class\" of object),\nbut the structure of these manifolds may be related, something that many previous manifold\nlearning methods did not take advantage of. We present experiments on a variety of tasks\nillustrating the weaknesses of the local manifold learning algorithms enumerated above.\nThe most striking result is that the model is able to generalize a notion of rotation learned\non one kind of image (digits) to a very different kind (alphabet characters), i.e. very far\nfrom the training data.\n\n2      Local Manifold Learning\n\nBy \"local manifold learning\", we mean a method that derives information about the local\nstructure of the manifold (i.e. implicitly its tangent directions) at x based mostly on the\ntraining examples \"around\" x. There is a large group of manifold learning methods (as\nwell as the spectral clustering methods) that share several characteristics, and can be seen\nas data-dependent kernel PCA (Bengio et al., 2004). These include LLE (Roweis and Saul,\n2000), Isomap (Tenenbaum, de Silva and Langford, 2000), kernel PCA (Sch olkopf, Smola\nand Muller, 1998) and Laplacian Eigenmaps (Belkin and Niyogi, 2003). They first build a\ndata-dependent Gram matrix M with n  n entries KD(xi, xj) where D = {x1, . . . , xn}\nis the training set and KD is a data-dependent kernel, and compute the eigenvector-\neigenvalue pairs {(vk, k)} of M . The embedding of the training set is obtained directly\nfrom the principal eigenvectors vk of M (the i-th element of vk gives the k-th coordi-\nnate of xi's embedding, possibly scaled by                     k , i.e. e\n                                                                n             k(xi) = vik) and the embedding\nfor a new example can be estimated using the Nystr om formula (Bengio et al., 2004):\ne                        n\n     k(x) =    1                v\n                                    kiKD (x, xi) for the k-th coordinate of x, where k is the k-th eigen-\n                    k    i=1\nvalue of M (the optional scaling by                      k would also apply). The above equation says\n                                                          n\nthat the embedding for a new example x is a local interpolation of the manifold coor-\ndinates of its neighbors xi, with interpolating weights given by KD(x,xi) . To see more\n                                                                                         k\nclearly how the tangent plane may depend only on the neighbors of x, consider the re-\nlation between the tangent plane and the embedding function: the tangent plane at x is\nsimply the subspace spanned by the vectors ek(x) . In the case of very \"local\" kernels\n                                                                     x\nlike that of LLE, spectral clustering with Gaussian kernel, Laplacian Eigenmaps or kernel\nPCA with Gaussian kernel, that derivative only depends significantly on the near neigh-\nbors of x. Consider for example kernel PCA with a Gaussian kernel: then ek(x) can be\n                                                                                                 x\nclosely approximated by a linear combination of the difference vectors (x - xj) for xj\nnear x. The weights of that combination may depend on the whole data set, but if the\nambiant space has many more dimensions then the number of such \"near\" neighbors of\nx, this is a very strong locally determined constraint on the shape of the manifold. The\ncase of Isomap is less obvious but we show below that it is also local. Let D(a, b) denote\nthe graph geodesic distance going only through a, b and points from the training set. As\n\n\f\nshown in (Bengio et al., 2004), the corresponding data-dependent kernel can be defined as\nKD(x, xi) = - 1 (D(x, x                                                     D(x, x                                                              D(x\n                              2                    i)2 - 1\n                                                                  n    j              j )2 - \n                                                                                                 Di + \n                                                                                                        D) where \n                                                                                                                    Di = 1n                j           i, xj )2\nand \n       D = 1                  \n                              D\n               n         j         j . Let N (x, xi) denote the index j of the training set example xj that is\na neighbor of x that minimizes ||x - xj|| + D(xj, xi). Then\n\n e                                                                                                                                                       \n       k(x)         1                                   1                           (x - xN(x,x                                 (x - xN\n               =                        v                         D(x, x                             j ) )    - D(x, x                          (x,xi))\n       x                                   ki                              j )                                          i)\n                         k                         n                               ||x - x                                     ||x - x                    \n                                   i                         j                                 N (x,xj )||                                 N (x,xi)||\n                                                                                                                                                           (1)\nwhich is a linear combination of vectors (x - xk), where xk is a neighbor of x. This clearly\nshows that the tangent plane at x associated with Isomap is also included in the subspace\nspanned by the vectors (x - xk) where xk is a neighbor of x.\n\nThere are also a variety of local manifold learning algorithms which can be classified as\n\"mixtures of pancakes\" (Ghahramani and Hinton, 1996; Tipping and Bishop, 1999; Vin-\ncent and Bengio, 2003; Teh and Roweis, 2003; Brand, 2003). These are generally mixtures\nof Gaussians with a particular covariance structure. When the covariance matrix is approx-\nimated using its principal eigenvectors, this leads to \"local PCA\" types of methods. For\nthese methods the local tangent directions directly correspond to the principal eigenvectors\nof the local covariance matrices. Learning is also local since it is mostly the examples\naround the Gaussian center that determine its covariance structure. The problem is not so\nmuch due to the form of the density as a mixture of Gaussians. The problem is that the local\nparameters (e.g. local principal directions) are estimated mostly based on local data. There\nis usually a non-local interaction between the different Gaussians, but its role is mainly of\nglobal coordination, e.g. where to set the Gaussian centers to allocate them properly where\nthere is data, and optionally how to orient the principal directions so as to obtain a globally\ncoherent coordinate system for embedding the data.\n\n2.1     Where Local Manifold Learning Would Fail\n\nIt is easy to imagine at least four failure causes for local manifold learning methods, and\ncombining them will create even greater problems:\n Noise around the manifold: data are not exactly lying on the manifold. In the case\nof non-linear manifolds, the presence of noise means that more data around each pancake\nregion will be needed to properly estimate the tangent directions of the manifold in that\nregion.\n High curvature of the manifold. Local manifold learning methods basically approxi-\nmate the manifold by the union of many locally linear patches. For this to work, there must\nbe at least d close enough examples in each patch (more with noise). With a high curvature\nmanifold, more  smaller  patches will be needed, and the number of required patches\nwill grow exponentially with the dimensionality of the manifold. Consider for example the\nmanifold of translations of a high-contrast image.The tangent direction corresponds to the\nchange in image due a small translation, i.e. it is non-zero only at edges in the image. After\na one-pixel translation, the edges have moved by one pixel, and may not overlap much with\nthe edges of the original image if it had high contrast. This is indeed a very high curvature\nmanifold.\n High intrinsic dimension of the manifold. We have already seen that high manifold\ndimensionality d is hurtful because O(d) examples are required in each patch and O(rd)\npatches (for some r depending on curvature and noise) are necessary to span the manifold.\nIn the translation example, if the image resolution is increased, then many more training\nimages will be needed to capture the curvature around the translation manifold with locally\nlinear patches. Yet the physical phenomenon responsible for translation is expressed by a\nsimple equation, which does not get more complicated with increasing resolution.\n Presence of many manifolds with little data per manifold. In many real-world con-\ntexts there is not just one global manifold but a large number of manifolds which however\n\n\f\nshare something about their structure. A simple example is the manifold of transforma-\ntions (view-point, position, lighting,...) of 3D objects in 2D images. There is one manifold\nper object instance (corresponding to the successive application of small amounts of all\nof these transformations). If there are only a few examples for each such class then it\nis almost impossible to learn the manifold structures using only local manifold learning.\nHowever, if the manifold structures are generated by a common underlying phenomenon\nthen a non-local manifold learning method could potentially learn all of these manifolds\nand even generalize to manifolds for which a single instance is observed, as demonstrated\nin the experiments below.\n\n\n3    Non-Local Manifold Tangent Learning\n\nHere we choose to characterize the manifolds in the data distribution through a matrix-\nvalued function F (x) that predicts at x  Rn a basis for the tangent plane of the manifold\nnear x, hence F (x)  Rdn for a d-dimensional manifold. Basically, F (x) specifies\n\"where\" (in which directions) one expects to find near neighbors of x.\n\nWe are going to consider a simple supervised learning setting to train this function. As\nwith Isomap, we consider that the vectors (x - xi) with xi a near neighbor of x span a\nnoisy estimate of the manifold tangent space. We propose to use them to define a \"target\"\nfor training F (x). In our experiments we simply collected the k nearest neighbors of each\nexample x, but better selection criteria might be devised. Points on the predicted tangent\nsubspace can be written F (x)w with w  Rd being local coordinates in the basis specified\nby F (x). Several criteria are possible to match the neighbors differences with the subspace\ndefined by F (x). One that yields to simple analytic calculations is simply to minimize\nthe distance between the x - xj vectors and their projection on the subspace defined by\nF (x). The low-dimensional local coordinate vector wtj  Rd that matches neighbor xj\nof example xt is thus an extra free parameter that has to be optimized, but is obtained\nanalytically. The overall training criterion involves a double optimization over function F\nand local coordinates wtj of what we call the relative projection error:\n\n                                                         ||F (x\n                             min                                   t)wtj - (xt - xj )||2               (2)\n                       F,{w                                         ||x\n                               tj }    t                                   t - xj ||2\n                                             jN (xt)\n\nwhere N (x) denotes the selected set of near neighbors of x. The normalization by ||xt -\nxj||2 is to avoid giving more weight to the neighbors that are further away. The above\nratio amounts to minimizing the square of the sinus of the projection angle. To perform\nthe above minimization, we can do coordinate descent (which guarantees convergence to\na minimum), i.e. alternate changes in F and changes in w's which at each step go down\nthe total criterion. Since the minimization over the w's can be done separately for each\nexample xt and neighbor xj, it is equivalent to minimize\n\n                                            ||F (xt)wtj - (xt - xj)||2                                 (3)\n                                                     ||xt - xj||2\n\nover vector wtj for each such pair (done analytically) and compute the gradient of the\nabove over F (or its parameters) to move F slightly (we used stochastic gradient on the\nparameters of F ). The solution for wtj is obtained by solving the linear system\n\n                                                                            (x\n                               F (x                                               t - xj )\n                                       t)F (xt)wtj = F (xt)                                   .        (4)\n                                                                           ||xt - xj||2\n\nIn our implementation this is done robustly through a singular value decomposition\nF (xt) = U SV and wtj = B(xt - xj) where B can be precomputed for all the neighbors\nof xt: B = (    d      1                     /S2)F (x\n                k=1         Sk> V.kV.k          k        t). The gradient of the criterion with respect to\n\n\f\nthe i-th row of F (xt), holding wtj fixed, is simply\n\n                                                 w\n                                2                     tji     (F (x\n                                          ||x                          t)w - (xt - xj ))                    (5)\n                                     j           t - xj ||\n\nwhere wtji is the i-th element of wtj. In practice, it is not necessary to store more than one\nwtj vector at a time. In the experiments, F () is parameterized as a an ordinary one hidden\nlayer neural network with n inputs and d  n outputs. It is trained by stochastic gradient\ndescent, one example xt at a time.\n\nAlthough the above algorithm provides a characterization of the manifold, it does not di-\nrectly provide an embedding nor a density function. However, once the tangent plane\nfunction is trained, there are ways to use it to obtain all of the above. The simplest method\nis to apply existing algorithms that provide both an embedding and a density function based\non a Gaussian mixture with pancake-like covariances. For example one could use (Teh and\nRoweis, 2003) or (Brand, 2003), the local covariance matrix around x being constructed\nfrom F (x)diag(2(x))F (x), where 2i(x) should estimate V ar(wi) around x.\n\n3.1    Previous Work on Non-Local Manifold Learning\n\nThe non-local manifold learning algorithm presented here (find F () which minimizes the\ncriterion in eq. 2) is similar to the one proposed in (Rao and Ruderman, 1999) to esti-\nmate the generator matrix of a Lie group. That group defines a one-dimensional manifold\ngenerated by following the orbit x(t) = eGtx(0), where G is an n  n matrix and t is a\nscalar manifold coordinate. A multi-dimensional manifold can be obtained by replacing\nGt above by a linear combination of multiple generating matrices. In (Rao and Ruderman,\n1999) the matrix exponential is approximated to first order by (I + Gt), and the authors\nestimate G for a simple signal undergoing translations, using as a criterion the minimiza-\ntion of           min\n           x,~\n            x            t ||(I + Gt)x - ~\n                                                 x||2, where ~\n                                                                 x is a neighbor of x in the data. Note that in\nthis model the tangent plane is a linear function of x, i.e. F1(x) = Gx. By minimizing\nthe above across many pairs of examples, a good estimate of G for the artificial data was\nrecovered by (Rao and Ruderman, 1999). Our proposal extends this approach to multiple\ndimensions and non-linear relations between x and the tangent planes. Note also the earlier\nwork on Tangent Distance (Simard, LeCun and Denker, 1993), in which the tangent planes\nare not learned but used to build a nearest neighbor classifier that is based on the distance\nbetween the tangent subspaces around two examples to be compared. The main advantage\nof the approach proposed here over local manifold learning is that the parameters of the\ntangent plane predictor can be estimated using data from very different regions of space,\nthus in principle allowing to be less sensitive to all four of the problems described in 2.1,\nthanks to sharing of information across these different regions.\n\n\n4      Experimental Results\n\nThe objective of the experiments is to validate the proposed algorithm: does it estimate\nwell the true tangent planes? does it learn better than a local manifold learning algorithm?\n\nError Measurement In addition to visualizing the results for the low-dimensional data,\nwe measure performance by considering how well the algorithm learns the local tangent\ndistance, as measured by the normalized projection error of nearest neighbors (eq. 3). We\ncompare the errors of four algorithms, always on test data not used to estimate the tan-\ngent plane: (a) true analytic (using the true manifold's tangent plane at x computed an-\nalytically), (b) tangent learning (using the neural-network tangent plane predictor F (x),\ntrained using the k  d nearest neighbors in the training set of each training set exam-\nple), (c) Isomap (using the tangent plane defined on Eq. 1), (d) Local PCA (using the d\nprincipal components of the k nearest neighbors of x in the training set).\n\n\f\n            Generalization of Tangent Learning          0.16\n                                                                                                       Analytic\n 4                                                                                                     Local PCA\n                                                        0.15                                           Isomap\n                                                                                                       Tangent Learning\n\n\n                                                        0.14\n 3\n\n                                                        0.13\n\n\n\n 2                                                      0.12\n\n\n\n                                                        0.11\n\n\n 1\n                                                             0.1\n\n\n\n                                                        0.09\n\n 0\n\n                                                        0.08\n\n\n\n-1                                                      0.07\n\n\n                                                        0.061       2    3    4    5    6    7    8                9       10\n\n-20    1    2    3     4    5     6    7    8     9    10\n\n\n                                                        Figure 2: Task 2 relative projection er-\nFigure 1: Task 1 2-D data with 1-D sinu-                ror for k-th nearest neighbor, w.r.t. k, for\nsoidal manifolds: the method indeed cap-                compared methods (from lowest to high-\ntures the tangent planes. The small seg-                est at k=1: analytic, tangent learning, lo-\nments are the estimated tangent planes.                 cal PCA, Isomap). Note U-shape due to\nRed points are training examples.                       opposing effects of curvature and noise.\n\n\n\nTask 1 We first consider a low-dimensional but multi-manifold problem. The data {xi}\nare in 2 dimensions and coming from a set of 40 1-dimensional manifolds. Each mani-\nfold is composed of 4 near points obtained from a randomly based sinus, i.e i  1..4,\nxi = (a + ti, sin(a + ti) + b, where a, b, and ti are randomly chosen. Four neighbors\nwere used for training both the Tangent Learning algorithm and the benchmark local non-\nparametric estimator (local PCA of the 4 neighbors). Figure 1 shows the training set and\nthe tangent planes recovered, both on the training examples and generalizing away from\nthe data. The neural network has 10 hidden units (chosen arbitrarily). This problem is\nparticularly difficult for local manifold learning, which does very poorly here: the out-of-\nsample relative prediction error are respectively 0.09 for the true analytic plane, 0.25 for\nnon-local tangent learning, and 0.81 for local PCA.\n\nTask 2 This is a higher dimensional manifold learning problem, with 41 dimensions.\nThe data are generated by sampling Gaussian curves. Each curve is of the form x(i) =\net1-(-2+i/10)2/t2 with i  {0, 1, . . . , 40}. Note that the tangent vectors are not linear in x.\nThe manifold coordinates are t1 and t2, sampled uniformly, respectively from (-1, 1) and\n(.1, 3.1). Normal noise (standard deviation = 0.001) is added to each point. 100 example\ncurves were generated for training and 200 for testing. The neural network has 100 hidden\nunits. Figure 2 shows the relative projection error for the four methods on this task, for the\nk-th nearest neighbor, for increasing values of k. First, the error decreases because of the\neffect of noise (near noisy neighbors may form a high angle with the tangent plane). Then,\nit increases because of the curvature of manifold (further away neighbors form a larger\nangle).\n\nTask 3 This is a high-dimensional multi-manifold task, involving digit images to which\nwe have applied slight rotations, in such a way as to have the knowledge of the analytic\nformulation of the manifolds. There is one rotation manifold for each instance of digit from\nthe database, but only two examples for each manifold: one real image from the MNIST\ndataset and one slightly rotated image. 10002 examples are used for training and 10002\nfor testing. In this context we use k = 1 nearest neighbor only and manifold dimension\nis 1. The average relative projection error for the nearest neighbor is 0.27 for the analytic\ntangent plane, 0.43 with tangent learning (100 hidden units), and 1.5 with Local PCA.\nHere the neural network would probably overfit if trained too much (here only 100 epochs).\n\n\f\n 2                                           2                                           2\n\n\n\n\n 4                                           4                                           4\n\n\n\n\n 6                                           6                                           6\n\n\n\n\n 8                                           8                                           8\n\n\n\n\n10                                          10                                          10\n\n\n\n\n12                                          12                                          12\n\n\n\n\n14                                          14                                          14\n\n      2    4    6    8    10    12    14          2    4    6    8    10    12    14          2    4    6    8    10    12    14\n\n\n\n\n 2                                           2                                           2\n\n\n\n\n 4                                           4                                           4\n\n\n\n\n 6                                           6                                           6\n\n\n\n\n 8                                           8                                           8\n\n\n\n\n10                                          10                                          10\n\n\n\n\n12                                          12                                          12\n\n\n\n\n14                                          14                                          14\n\n      2    4    6    8    10    12    14          2    4    6    8    10    12    14          2    4    6    8    10    12    14\n\n\n\n\n\nFigure 3: Left column: original image. Middle: applying a small amount of the predicted\nrotation. Right: applying a larger amount of the predicted rotation. Top: using the esti-\nmated tangent plane predictor. Bottom: using local PCA, which is clearly much worse.\n\n\nAn even more interesting experiment consists in applying the above trained predictor on\nnovel images that come from a very different distribution but one that shares the same\nmanifold structure: it was applied to images of other characters that are not digits. We\nhave used the predicted tangent planes to follow the manifold by small steps (this is very\neasy to do in the case of a one-dimensional manifold). Figure 3 shows for example on a\nletter 'M' image the effect of a few such steps and a larger number of steps, both for the\nneural network predictor and for the local PCA predictor.\n\nThis example illustrates the crucial point that non-local tangent plane learning is able to\ngeneralize to truly novel cases, where local manifold learning fails.\n\nIn all the experiments we found that all the randomly initialized neural networks converged\nto similarly good solutions. The number of hidden units was not optimized, although pre-\nliminary experimentation showed phenomena of over-fitting and under-fitting due to too\nsmall or too large number hidden units was possible.\n\n5     Conclusion\n\nThe central claim of this paper is that there are fundamental problems with non-parametric\nlocal approaches to manifold learning, essentially due to the curse of dimensionality (at the\ndimensionality of the manifold), but worsened by manifold curvature, noise, and the pres-\nence of several disjoint manifolds. To address these problems, we propose that learning\nalgorithms should be designed in such a way that they can share information, coming from\ndifferent regions of space, about the structure of the manifold. In this spirit we have pro-\nposed a simple learning algorithm based on predicting the tangent plane at x with a function\nF (x) whose parameters are estimated based on the whole data set. Note that the same fun-\ndamental problems are present with non-parametric approaches to semi-supervised learn-\ning (e.g. as in (Szummer and Jaakkola, 2002; Chapelle, Weston and Scholkopf, 2003;\nBelkin and Niyogi, 2003; Zhu, Ghahramani and Lafferty, 2003)), which rely on proper\nestimation of the manifold in order to propagate label information.\n\nFuture work should investigate how to better handle the curvature problem, e.g. by follow-\n\n\f\ning the manifold (using the local tangent estimates), to estimate a manifold-following path\nbetween pairs of neighboring examples. The algorithm can also be extended in a straight-\nforward way to obtain a Gaussian mixture or a mixture of factor analyzers (with the factors\nor the principal eigenvectors of the Gaussian centered at x obtained from F (x)). This view\ncan also provide an alternative criterion to optimize F (x) (the local log-likelihood of such\na Gaussian). This criterion also tells us how to estimate the missing information (the vari-\nances along the eigenvector directions). Since we can estimate F (x) everywhere, a more\nambitious view would consider the density as a \"continuous\" mixture of Gaussians (with\nan infinitesimal component located everywhere in space).\n\nAcknowledgments\n\nThe authors would like to thank the following funding organizations for support: NSERC,\nMITACS, IRIS, and the Canada Research Chairs.\n\nReferences\n\nBelkin, M. and Niyogi, P. (2003). Using manifold structure for partially labeled classification. In\n        Becker, S., Thrun, S., and Obermayer, K., editors, Advances in Neural Information Processing\n        Systems 15, Cambridge, MA. MIT Press.\n\nBengio, Y., Delalleau, O., Le Roux, N., Paiement, J.-F., Vincent, P., and Ouimet, M. (2004). Learning\n        eigenfunctions links spectral embedding and kernel PCA. Neural Computation, 16(10):2197\n        2219.\n\nBrand, M. (2003). Charting a manifold. In Becker, S., Thrun, S., and Obermayer, K., editors,\n        Advances in Neural Information Processing Systems 15. MIT Press.\n\nChapelle, O., Weston, J., and Scholkopf, B. (2003). Cluster kernels for semi-supervised learning. In\n        Becker, S., Thrun, S., and Obermayer, K., editors, Advances in Neural Information Processing\n        Systems 15, Cambridge, MA. MIT Press.\n\nGhahramani, Z. and Hinton, G. (1996). The EM algorithm for mixtures of factor analyzers. Technical\n        Report CRG-TR-96-1, Dpt. of Comp. Sci., Univ. of Toronto.\n\nRao, R. and Ruderman, D. (1999). Learning lie groups for invariant visual perception. In Kearns,\n        M., Solla, S., and Cohn, D., editors, Advances in Neural Information Processing Systems 11,\n        pages 810816. MIT Press, Cambridge, MA.\n\nRoweis, S. and Saul, L. (2000). Nonlinear dimensionality reduction by locally linear embedding.\n        Science, 290(5500):23232326.\n\nSch\n   olkopf, B., Smola, A., and M \n                                    uller, K.-R. (1998). Nonlinear component analysis as a kernel\n        eigenvalue problem. Neural Computation, 10:12991319.\n\nSimard, P., LeCun, Y., and Denker, J. (1993). Efficient pattern recognition using a new transformation\n        distance. In Giles, C., Hanson, S., and Cowan, J., editors, Advances in Neural Information\n        Processing Systems 5, pages 5058, Denver, CO. Morgan Kaufmann, San Mateo.\n\nSzummer, M. and Jaakkola, T. (2002). Partially labeled classification with markov random walks.\n        In Dietterich, T., Becker, S., and Ghahramani, Z., editors, Advances in Neural Information\n        Processing Systems 14, Cambridge, MA. MIT Press.\n\nTeh, Y. W. and Roweis, S. (2003). Automatic alignment of local representations. In Becker, S.,\n        Thrun, S., and Obermayer, K., editors, Advances in Neural Information Processing Systems 15.\n        MIT Press.\n\nTenenbaum, J., de Silva, V., and Langford, J. (2000). A global geometric framework for nonlinear\n        dimensionality reduction. Science, 290(5500):23192323.\n\nTipping, M. and Bishop, C. (1999). Mixtures of probabilistic principal component analysers. Neural\n        Computation, 11(2):443482.\n\nVincent, P. and Bengio, Y. (2003). Manifold parzen windows. In Becker, S., Thrun, S., and Ober-\n        mayer, K., editors, Advances in Neural Information Processing Systems 15, Cambridge, MA.\n        MIT Press.\n\nZhu, X., Ghahramani, Z., and Lafferty, J. (2003). Semi-supervised learning using gaussian fields and\n        harmonic functions. In ICML'2003.\n\n\f\n", "award": [], "sourceid": 2647, "authors": [{"given_name": "Yoshua", "family_name": "Bengio", "institution": null}, {"given_name": "Martin", "family_name": "Monperrus", "institution": null}]}