{"title": "Manifold Denoising", "book": "Advances in Neural Information Processing Systems", "page_first": 561, "page_last": 568, "abstract": null, "full_text": "Manifold Denoising\n\nMatthias Hein\n\nMarkus Maier\n\nMax Planck Institute for Biological Cybernetics\n\nT\u00a8ubingen, Germany\n\n{first.last}@tuebingen.mpg.de\n\nAbstract\n\nWe consider the problem of denoising a noisily sampled submanifold M in Rd,\nwhere the submanifold M is a priori unknown and we are only given a noisy point\nsample. The presented denoising algorithm is based on a graph-based diffusion\nprocess of the point sample. We analyze this diffusion process using recent re-\nsults about the convergence of graph Laplacians. In the experiments we show that\nour method is capable of dealing with non-trivial high-dimensional noise. More-\nover using the denoising algorithm as pre-processing method we can improve the\nresults of a semi-supervised learning algorithm.\n\n1 Introduction\n\nIn the last years several new methods have been developed in the machine learning community\nwhich are based on the assumption that the data lies on a submanifold M in Rd. They have been\nused in semi-supervised learning [15], dimensionality reduction [14, 1] and clustering. However\nthere exists a certain gap between theory and practice. Namely in practice the data lies almost never\nexactly on the submanifold but due to noise is scattered around it. Several of the existing algorithms\nin particular graph based methods are quite sensitive to noise. Often they fail in the presence of high-\ndimensional noise since then the distance structure is non-discriminative. In this paper we tackle this\nproblem by proposing a denoising method for manifold data. Given noisily sampled manifold data\nin Rd the objective is to \u2019project\u2019 the sample onto the submanifold.\nThere exist already some methods which have related objectives like principal curves [6] and the\ngenerative topographic mapping [2]. For both methods one has to know the intrinsic dimension of\nthe submanifold M as a parameter of the algorithm. However in the presence of high-dimensional\nnoise it is almost impossible to estimate the intrinsic dimension correctly. Moreover usually prob-\nlems arise if there is more than one connected component.\nThe algorithm we propose adresses these problems. It works well for low-dimensional submanifolds\ncorrupted by high-dimensional noise and can deal with multiple connected components. The basic\nprinciple behind our denoising method has been proposed by [13] as a surface processing method in\nR3. The goal of this paper is twofold. First we extend this method to general submanifolds in Rd\naimed at dealing in particular with high-dimensional noise. Second we provide an interpretation of\nthe denoising algorithm which takes into account the probabilistic setting encountered in machine\nlearning and which differs from the one usually given in the computer graphics community.\n\n2 The noise model and problem statement\n\nWe assume that the data lies on an abstract m-dimensional manifold M, where the dimension m\ncan be seen as the number of independent parameters in the data. This data is mapped via a smooth,\nregular embedding i : M \u2192 Rd into the feature space Rd. In the following we will not distin-\nguish between M and i(M ) \u2282 Rd, since it should be clear from the context which case we are\nconsidering. The Euclidean distance in Rd then induces a metric on M. This metric depends on the\n\n\fembedding/representation (e.g. scaling) of the data in Rd but is at least continuous with respect to\nthe intrinsic parameters. Furthermore we assume that the manifold M is equipped with a probability\nmeasure PM which is absolutely continuous with respect to the natural volume element1 dV of M.\nWith these de\ufb01nitions the model for the noisy data-generating process in Rd has the following form:\n\nX = i(\u0398) + \u0001,\n\nwhere \u0398 \u223c PM and \u0001 \u223c N (0, \u03c3). Note that the probability measure of the noise \u0001 has full support\nin Rd. We consider here for convenience a Gaussian noise model but also any other reasonably\nconcentrated isotropic noise should work. The law PX of the noisy data X can be computed from\nthe true data-generating probability measure PM :\n\nPX (x) = (2 \u03c0 \u03c32)\u2212 d\n\n2 ZM\n\ne\u2212 kx\u2212i(\u03b8)k2\n\n2\u03c32\n\np(\u03b8) dV (\u03b8).\n\n(1)\n\nNow the Gaussian measure is equivalent to the heat kernel pt(x, y) = (4\u03c0t)\u2212 d\nthe diffusion process on Rd, see e.g. [5], if we make the identi\ufb01cation \u03c32 = 2t. An alternative point\nof view on PX is therefore to see PX as the result of a diffusion of the density function2 p(\u03b8) of\nPM stopped at time t = 1\n2 \u03c32. The basic principle behind the denoising algorithm in this paper is to\nreverse this diffusion process.\n\n2 exp\u00a1 \u2212 kx\u2212yk2\n\n\u00a2 of\n\n4t\n\n3 The denoising algorithm\n\nIn practice we have only an i.i.d. sample Xi, i = 1, . . . , n of PX. The ideal goal would be to \ufb01nd the\ncorresponding set of points i(\u03b8i), i = 1, . . . , n on the submanifold M which generated the points\nXi. However due to the random nature of the noise this is in principle impossible. Instead the goal\nis to \ufb01nd corresponding points Zi on the submanifold M which are close to the points Xi. However\nwe are facing several problems. Since we are only given a \ufb01nite sample, we do not know PX or\neven PM . Second as stated in the last section we would like to reverse this diffusion process which\namounts to solving a PDE. However the usual technique to solve this PDE on a grid is unfeasible\ndue to the high dimension of the ambient space Rd.\nInstead we solve the diffusion process directly on a graph generated by the sample Xi. This can be\nmotivated by recent results in [7] where it was shown that the generator of the diffusion process, the\nLaplacian \u2206Rd, can be approximated by the graph Laplacian of a random neighborhood graph. A\nsimilar setting for the denoising of two-dimensional meshes in R3 has been proposed in the seminal\nwork of Taubin [13]. Since then several modi\ufb01cations of his original idea have been proposed in\nthe computer graphics community, including the recent development in [11] to apply the algorithm\ndirectly to point cloud data in R3. In this paper we propose a modi\ufb01cation of this diffusion process\nwhich allows us to deal with general noisy samples of arbitrary (low-dimensional) submanifolds in\nRd. In particular the proposed algorithm can cope with high-dimensional noise. Moreover we give\nan interpretation of the algorithm, which differs from the one usually given in the computer graphics\ncommunity and takes into account the probabilistic nature of the problem.\n\n3.1 Structure on the sample-based graph\n\nWe would like to de\ufb01ne a diffusion process directly on the sample Xi. To this end we need the gen-\nerator of the diffusion process, the graph Laplacian. We will construct this operator for a weighted,\nundirected graph. The graph vertices are the sample points Xi. With {h(Xi)}n\ni=1 being the k-nearest\nneighbor (k-NN) distances the weights of the k-NN graph are de\ufb01ned as\n\nkXi \u2212 Xjk2\n\n(max{h(Xi), h(Xj)})2\u00b4,\n\nw(Xi, Xj) = exp\u00b3 \u2212\nloops. Further we denote by d the degree function d(Xi) = Pn\n\nand w(Xi, Xj) = 0 otherwise. Additionally we set w(Xi, Xi) = 0, so that the graph has no\nj=1 w(Xi, Xj) of the graph and\n1In local coordinates \u03b81, . . . , \u03b8m the natural volume element dV is given as dV = \u221adet g d\u03b81 . . . d\u03b8m,\n\nkXi \u2212 Xjk \u2264 max{h(Xi), h(Xj)},\n\nif\n\nwhere det g is the determinant of the metric tensor g.\n\n2Note that PM is not absolutely continuous with respect to the Lebesgue measure in Rd and therefore p(\u03b8)\n\nis not a density in Rd.\n\n\fwe introduce two Hilbert spaces HV , HE of functions on the vertices V and edges E. Their inner\nproducts are de\ufb01ned as\n\n=Xn\n\ni=1\n\nhf, giHV\n\nf (Xi) g(Xi) d(Xi),\n\nh\u03c6, \u03c8iHE\n\nw(Xi, Xj) \u03c6(Xi, Xj) \u03c8(Xi, Xj).\n\nIntroducing the discrete differential \u2207 : HV \u2192 HE, (\u2207f )(Xi, Xj) = f (Xj) \u2212 f (Xi) the graph\nLaplacian is de\ufb01ned as\n\n=Xn\n\ni,j=1\n\n\u2206 : HV \u2192 HV , \u2206 = \u2207\u2217\u2207,\n\n(\u2206f )(Xi) = f (Xi) \u2212\n\nw(Xi, Xj)f (Xj),\n\nwhere \u2207\u2217 is the adjoint of \u2207. De\ufb01ning the matrix D with the degree function on the diagonal the\ngraph Laplacian in matrix form is given as \u2206 = \u2212 D\u22121W , see [7] for more details. Note that\ndespite \u2206 is not a symmetric matrix it is a self-adjoint operator with respect to the inner product in\nHV .\n\n1\n\nd(Xi)Xn\n\nj=1\n\n3.2 The denoising algorithm\n\nHaving de\ufb01ned the necessary structure on the graph it is straightforward to write down the backward\ndiffusion process.\nIn the next section we will analyze the geometric properties of this diffusion\nprocess and show why it is directed towards the submanifold M. Since the graph Laplacian is the\ngenerator of the diffusion process on the graph we can formulate the algorithm by the following\ndifferential equation on the graph:\n\n\u2202tX = \u2212\u03b3 \u2206X,\n\n(2)\nwhere \u03b3 > 0 is the diffusion constant. Since the points change with time, the whole graph is dynamic\nin our setting. This is different to the diffusion processes on a \ufb01xed graph studied in semi-supervised\nlearning. In order to solve the differential equation (2) we choose an implicit Euler-scheme, that is\n(3)\nwhere \u03b4t is the time-step. Since the implicit Euler is unconditionally stable we can choose the factor\n\u03b4t \u03b3 arbitrarily. We \ufb01x in the following \u03b3 = 1 so that the only free parameter remains to be \u03b4t,\nwhich is set to \u03b4 = 0.5 in the rest of the paper. The solution of the implicit Euler scheme for one\ntimestep in Equation 3 can then be computed as: Xt+1 = ( + \u03b4t \u2206)\u22121Xt. After each timestep\nthe point con\ufb01guration has changed so that one has to recompute the weight matrix W of the graph.\nThen the procedure is continued until a prede\ufb01ned stopping criterion is satis\ufb01ed, see Section 3.4.\nThe pseudo-code is given in Algorithm 1. In [12] it was pointed out that there exists a connection\n\nX(t + 1) \u2212 X(t) = \u2212\u03b4t \u03b3 \u2206X(t + 1),\n\nAlgorithm 1 Manifold denoising\n1: Choose \u03b4t, k\n2: while Stopping criterion not satis\ufb01ed do\n3:\n4:\n\nCompute the k-NN distances h(Xi), i = 1, . . . , n,\nCompute the weights w(Xi, Xj) of the graph with w(Xi, Xi) = 0,\n\nw(Xi, Xj) = exp\u00b3 \u2212\n\nkXi\u2212Xj k2\n\n(max{h(Xi),h(Xj )})2\u00b4,\n\nCompute the graph Laplacian \u2206, \u2206 = \u2212 D\u22121W ,\nSolve X(t + 1) \u2212 X(t) = \u2212\u03b4t \u2206X(t + 1) \u21d2 X(t + 1) = ( + \u03b4t \u2206)\u22121X(t).\n\nif\n\nkXi \u2212 Xjk \u2264 max{h(Xi), h(Xj)},\n\n5:\n6:\n7: end while\n\nbetween diffusion processes and Tikhonov regularization. Namely the result of one time step of\nthe diffusion process with the implicit Euler scheme is equivalent to the solution of the following\nregularization problem on the graph:\n\narg min\nZ \u03b1\u2208HV\n\nS(Z \u03b1) := arg min\nZ \u03b1\u2208HV\n\nkZ \u03b1 \u2212 X \u03b1(t)k2\n\nHV\n\n+ \u03b4t\n\nk\u2207Z \u03b1k2\n\nHE\n\n,\n\nd\n\nX\u03b1=1\n\nd\n\nX\u03b1=1\n\nwhere Z \u03b1 denotes the \u03b1-component of the vector Z \u2208 Rd. With k\u2207Z \u03b1k2\nminimizer of the above functional with respect to Z \u03b1 can be easily computed as\n\nHE\n\n= hZ \u03b1, \u2206 Z \u03b1iHV\n\nthe\n\n\u2202S(Z \u03b1)\n\u2202Z \u03b1 = 2(Z \u03b1 \u2212 X \u03b1(t)) + 2 \u03b4t \u2206Z \u03b1 = 0, \u03b1 = 1, . . . , d,\n\n\fso that Z = ( + \u03b4t \u2206)\u22121Xt. Each time-step of our diffusion process can therefore be seen as\na regression problem, where we trade off between \ufb01tting the new points Z to the points X(t) and\nhaving a \u2019smooth\u2019 point con\ufb01guration Z measured with respect to the current graph built from X(t).\n\n3.3 k-nearest neighbor graph versus h-neighborhood graph\n\nIn the denoising algorithm we have chosen to use a weighted k-NN graph. It turns out that a k-NN\ngraph has three advantages over an h-neighborhood graph3. The \ufb01rst advantage is that the graph has\na better connectivity. Namely points in areas of different density have quite different neighborhood\nscales which leads for a \ufb01xed h to either disconnected or over-connected graphs.\nSecond we usually have high-dimensional noise. In this case it is well-known that one has a drastic\nchange in the distance statistic of a sample, which is illustrated by the following trivial lemma.\n\nLemma 1 Let x, y \u2208 Rd and \u00011, \u00012 \u223c N (0, \u03c32) and de\ufb01ne X = x + \u00011 and Y = y + \u00012, then\n\nE kX \u2212 Y k2 = kx \u2212 yk2 + 2 d \u03c32,\n\nand Var kX \u2212 Y k2 = 8\u03c32 kx \u2212 yk2 + 8 d \u03c34.\n\nOne can deduce that the expected squared distance of the noisy submanifold sample is dominated by\nthe noise term if 2d\u03c32 > max\u03b8,\u03b80 ki(\u03b8) \u2212 i(\u03b80)k2, which is usually the case for large d. In this case\nit is quite dif\ufb01cult to adjust the average number of neighbors in a graph by a \ufb01xed neighborhood size\nh since the distances start to concentrate around their mean value. The third is that by choosing k\nwe can control directly the sparsity of the weight matrix W and the Laplacian \u2206 = \u2212 D\u22121W so\nthat the linear equation in each time step can be solved ef\ufb01ciently.\n\n3.4 Stopping criterion\n\nThe problem of choosing the correct number of iterations is very dif\ufb01cult if one has initially high-\ndimensional noise and requires prior knowledge. We propose two stopping criterions. The \ufb01rst\none is based on the effect that if the diffusion is done too long the data becomes disconnected and\nconcentrates in local clusters. One therefore can stop if the number of connected components of the\ngraph4 increases. The second one is based on prior knowledge about the intrinsic dimension of the\ndata. In this case one can stop the denoising if the estimated dimension of the sample (e.g. via the\ncorrelation dimension, see [4]) is equal to the intrinsic one. Another less founded but very simple\nway is to stop the iterations if the changes in the sample are below some pre-de\ufb01ned threshold.\n\n4 Large sample limit and theoretical analysis\n\nOur qualitative theoretical analysis of the denoising algorithm is based on recent results on the limit\nof graph Laplacians [7, 8] as the neighborhood size decreases and the sample size increases. We use\nthis result to study the continuous limit of the diffusion process. The following theorem about the\nlimit of the graph Laplacian applies to h-neighborhood graphs, whereas the denoising algorithm is\nbased on a k-NN graph. Our conjecture5 is that the result carries over to k-NN graphs.\n\nTheorem 1 [7, 8] Let {Xi}n\ni=1 be an i.i.d. sample of a probability measure PM on a m-dimensional\ncompact submanifold6 M of Rd, where PM has a density pM \u2208 C 3(M ). Let f \u2208 C 3(M ) and\nx \u2208 M \\\u2202M, then if h \u2192 0 and nhm+2/ log n \u2192 \u221e,\n\nlim\nn\u2192\u221e\n\n1\nh2 (\u2206f )(x) \u223c \u2212(\u2206M f )(x) \u2212\n\n2\np\n\nh\u2207f, \u2207piTxM ,\n\nalmost surely,\n\nwhere \u2206M is the Laplace-Beltrami operator of M and \u223c means up to a constant which depends on\nthe kernel function k(kx \u2212 yk) used to de\ufb01ne the weights W (x, y) = k(kx \u2212 yk) of the graph.\n\n3In an h-neighborhood graph two sample points Xi, Xj have a common edge if kXi \u2212 Xjk \u2264 h.\n4The number of connected comp. is equal to the multiplicity of the \ufb01rst eigenvalue of the graph Laplacian.\n5Partially we veri\ufb01ed the conjecture however the proof would go beyond the scope of this paper.\n6Note that the case where P has full support in Rd is a special case of this theorem.\n\n\f4.1 The noise-free case\n\nWe \ufb01rst derive in a non-rigorous way the continuum limit of our graph based diffusion process in\nthe noise free case. To that end we do the usual argument made in physics to go from a difference\nequation on a grid to the differential equation. We rewrite our diffusion equation (2) on the graph as\n\ni(t + 1) \u2212 i(t)\n\n\u03b4t\n\n= \u2212\n\nh2\n\u03b4t\n\n1\nh2 \u2206i\n\nDoing now the limit h \u2192 0 and \u03b4t \u2192 0 such that the diffusion constant D = h2\nusing the limit of 1\n\nh2 \u2206 given in Theorem 1 we get the following differential equation,\n\n\u03b4t stays \ufb01nite and\n\n\u2202ti = D [\u2206M i +\n\n2\np\n\nh\u2207p, \u2207ii].\n\n(4)\n\nNote that for the k-NN graph the neighborhood size h is a function of the local density which implies\nthat the diffusion constant D also becomes a function of the local density D = D(p(x)).\n\nLemma 2 ([9], Lemma 2.14) Let i\ndimensional manifold M, then \u2206M i = m H, where H is the mean curvature7 of M.\n\n: M \u2192 Rd be a regular, smooth embedding of an m-\n\nUsing the equation \u2206M i = mH we can establish equivalence of the continuous diffusion equation\n(4) to a generalized mean curvature \ufb02ow.\n\n\u2202ti = D [m H +\n\n2\np\n\nh\u2207p, \u2207ii],\n\n(5)\n\nThe equivalence to the mean curvature \ufb02ow \u2202ti = m H is usually given in computer graphics as the\nreason for the denoising effect, see [13, 11]. However as we have shown the diffusion has already\nan additional part if one has a non-uniform probability measure on M.\n\n4.2 The noisy case\n\nThe analysis of the noisy case is more complicated and we can only provide a rough analysis. The\nlarge sample limit n \u2192 \u221e of the graph Laplacian \u2206 at a sample point Xi is given as\n\n\u2206Xi = Xi \u2212 RRd kh(kXi \u2212 yk) y pX (y)dy\nRRd kh(kXi \u2212 yk)pX (y)dy\n\n,\n\n(6)\n\nwhere kh(kx \u2212 yk) is the weight function used in the construction of the graph, that is in our case\nkh(kx \u2212 yk) = e\u2212 kx\u2212yk2\n2h2 kx\u2212yk\u2264h. In the following analysis we will assume three things, 1) the\nnoise level \u03c3 is small compared to the neighborhood size h, 2) the curvature of M is small compared\nto h and 3) the density pM varies slowly along M. Under these conditions it is easy to see that the\nmain contribution of \u2212\u2206Xi in Equation 6 will be in the direction of the gradient of pX at Xi. In the\nfollowing we try to separate this effect from the mean curvature part derived in the noise-free case.\nUnder the above conditions we can do the following second order approximation of a convolution\nwith a Gaussian, see [7], using the explicit form of pX of Equation 1 :\n\nZRd\n\nkh(kX \u2212 yk) y pX (y)dy =ZM\n=ZM\n\n1\n\n(2\u03c0\u03c32)d/2 ZRd\n\nkh(kX \u2212 yk) y e\u2212 ky\u2212i(\u03b8)k2\n\n2\u03c32\n\np(\u03b8) dy dV (\u03b8)\n\nkh(kX \u2212 i(\u03b8)k) i(\u03b8) p(\u03b8) dV (\u03b8) + O(\u03c32)\n\nNow de\ufb01ne the closest point of the submanifold M to X: i(\u03b8min) = arg mini(\u03b8)\u2208M kX \u2212 i(\u03b8)k.\nUsing the condition on the curvature we can approximate the diffusion step \u2212\u2206X as follows:\n\n\u2212\u00c3 i(\u03b8min) \u2212 RM kh(ki(\u03b8min) \u2212 i(\u03b8)k) i(\u03b8) p(\u03b8) dV (\u03b8)\nRM kh(ki(\u03b8min) \u2212 i(\u03b8)k) p(\u03b8) dV (\u03b8)\n}\n\n{z\n\n|\n\nII\n\n!,\n\n7The mean curvature H is the trace of the second fundamental form. If M is a hypersurface in Rd the mean\n\ncurvature at p is H = 1\n\ni=1 \u03baiN, where N is the normal vector and \u03bai the principal curvatures at p.\n\n\u2212\u2206X \u2248 i(\u03b8min) \u2212 X\n\n|\n\nI\n\n}\n{z\nd\u22121Pd\u22121\n\n\fwhere we have omitted second-order terms. It follows from the proof of Theorem 1 that the term II\nis an approximation of \u2212\u2206M i(\u03b8min) \u2212 2\np h\u2207p, \u2207ii whereas the \ufb01rst term I\nleads to a movement of X towards M. We conclude from this rough analysis that in the denoising\nprocedure we always have a tradeoff between reducing the noise via the term I and smoothing of the\nmanifold via the mean curvature term II. Note that the term II is the same for all points X which\nhave i(\u03b8min) as their closest point on M. Therefore this term leads to a global \ufb02ow which smoothes\nthe submanifold. In the experiments we observe this as the shrinking phenomenon.\n\np h\u2207p, \u2207ii = \u2212mH \u2212 2\n\n5 Experiments\n\nIn the experimental section we test the performance of the denoising algorithm on three noisy\ndatasets. Furthermore we explore the possibility to use the denoising method as a preprocessing\nstep for semi-supervised learning. Due to lack of space we can not deal with further applications as\npreprocessing method for clustering or dimensionality reduction.\n\n5.1 Denoising\n\nThe \ufb01rst experiment is done on a toy-dataset. The manifold M is given as t \u2192 [sin(2\u03c0t), 2\u03c0t],\nt is sampled uniformly on [0, 1]. We embed M into R200 and put full isotropic Gaussian noise\nwith \u03c3 = 0.4 on each datapoint resulting in the left part of Figure 5.1. We verify the effect of the\ndenoising algorithm by estimating continuously the dimension over different scales (note that the\ndimension of a \ufb01nite sample always depends on the scale at which one examines). We use for that\npurpose the correlation dimension estimator of [4].\nThe result of the denoising algorithm with k = 25 for the k-NN graph and 10 timesteps is given\nin the right part of Figure 5.1. One can observe visually and by inspecting the dimension estimate\nas well as by the histogram of distances that the algorithm has reduced the noise. One can also see\ntwo undesired effects. First as discussed in the last section the diffusion process has a component\nwhich moves the manifold in the direction of the mean curvature, which leads to a smoothing of the\nsinusoid. Second at the boundary the sinusoid shrinks due to the missing counterparts in the local\naveraging done by the graph Laplacian, see (6), which result in an inward tangential component.\nIn the next experiment we apply the denoising to the handwritten digit datasets USPS and MNIST.\n\nData points\n\nDimension vs. scale\n\nhistogram of dist.\n\n6\n5\n4\n3\n2\n1\n0\n\n300\n\n200\n\n100\n\n0\n\n10000\n\n8000\n\n6000\n\n4000\n\n2000\n\n6\n5\n4\n3\n2\n1\n0\n\n\u22121\n\n0\n\n1\n\n\u2212100\n\n1.5\n\n2\n\n2.5\n\n0\n\n6\n\n8\n\n10\n\n12\n\n\u22121\n\n0\n\n1\n\nData points\n\nDimension vs. scale\n\nhistogram of dist.\n\n40\n\n30\n\n20\n\n10\n\n0\n\u22122\n\n12000\n\n10000\n\n8000\n\n6000\n\n4000\n\n2000\n\n0\n\n2\n\n0\n\n0\n\n2\n\n4\n\n6\n\nFigure 1: Left: 500 samples of the noisy sinusoid in R200 as described in the text, Right: Result after\n10 steps of the denoising method with k = 25, note that the estimated dimension is much smaller\nand the scale has changed as can be seen from the histogram of distances shown to the right\n\nFor handwritten digits the underlying manifold corresponds to varying writing styles. In order to\ncheck if the denoising method can also handle several manifolds at the same time which would make\nthe method useful for clustering and dimensionality reduction we fed all the 10 digits simultaneously\ninto the algorithm. For USPS we used the 9298 digits in the training and test set and from MNIST a\nsubsample of 1000 examples from each digit. We used the two-sided tangent distance in [10] which\nprovides a certain invariance against translation, scaling, rotation and line thickness. In Figure 2 and\n3 we show a sample of the result across all digits. In both cases digits are transformed wrongly. This\nhappens since they are outliers with respect to their digit manifold and lie closer to another digit\ncomponent. An improved handling of invariances should resolve at least partially this problem.\n\n5.2 Denoising as pre-processing for semi-supervised learning\n\nMost semi-supervised learning (SSL) are based on the cluster assumption, that is the decision bound-\nary should lie in a low-density region. The denoising algorithm is consistent with that assumption\n\n\f5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n10\n15\n20\n25\n\n5\n10\n15\n20\n25\n\n5\n10\n15\n20\n25\n\n5\n10\n15\n20\n25\n\n5\n10\n15\n20\n25\n\n5\n\n10 15\n\n5\n\n10 15\n\n5\n\n10 15\n\n5\n\n10 15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10 15\n\n5\n\n10 15\n\n5\n\n10 15\n\n5\n\n10 15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10 15\n\n5\n\n10 15\n\n5\n\n10 15\n\n5\n\n10 15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10 15\n\n5\n\n10 15\n\n5\n\n10 15\n\n5\n\n10 15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10 15\n\n5\n\n10 15\n\n5\n\n10 15\n\n5\n\n10 15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10 15\n\n5\n\n10 15\n\n5\n\n10 15\n\n5\n\n10 15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10 15\n\n5\n\n10 15\n\n5\n\n10 15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n5\n\n10 15\n\n5\n\n10 15\n\n5\n\n10 15\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\nFigure 2: Left: Original images from USPS, right: after 15 iterations with k = [9298/50].\n\n5 10 15 20 25\n\n5 10 15 20 25\n\n5 10 15 20 25\n\n5 10 15 20 25\n\n5\n10\n15\n20\n25\n\n5\n10\n15\n20\n25\n\n5\n10\n15\n20\n25\n\n5\n10\n15\n20\n25\n\n5\n10\n15\n20\n25\n\n5 10 15 20 25\n\n5 10 15 20 25\n\n5 10 15 20 25\n\n5 10 15 20 25\n\n5\n10\n15\n20\n25\n\n5\n10\n15\n20\n25\n\n5\n10\n15\n20\n25\n\n5\n10\n15\n20\n25\n\n5\n10\n15\n20\n25\n\n5 10 15 20 25\n\n5 10 15 20 25\n\n5 10 15 20 25\n\n5 10 15 20 25\n\n5\n10\n15\n20\n25\n\n5\n10\n15\n20\n25\n\n5\n10\n15\n20\n25\n\n5\n10\n15\n20\n25\n\n5\n10\n15\n20\n25\n\n5 10 15 20 25\n\n5 10 15 20 25\n\n5 10 15 20 25\n\n5 10 15 20 25\n\n5\n10\n15\n20\n25\n\n5\n10\n15\n20\n25\n\n5\n10\n15\n20\n25\n\n5\n10\n15\n20\n25\n\n5\n10\n15\n20\n25\n\n5 10 15 20 25\n\n5 10 15 20 25\n\n5 10 15 20 25\n\n5 10 15 20 25\n\n5\n10\n15\n20\n25\n\n5\n10\n15\n20\n25\n\n5\n10\n15\n20\n25\n\n5\n10\n15\n20\n25\n\n5\n10\n15\n20\n25\n\n5 10 15 20 25\n\n5 10 15 20 25\n\n5 10 15 20 25\n\n5 10 15 20 25\n\n5\n10\n15\n20\n25\n\n5\n10\n15\n20\n25\n\n5\n10\n15\n20\n25\n\n5\n10\n15\n20\n25\n\n5\n10\n15\n20\n25\n\n5 10 15 20 25\n\n5 10 15 20 25\n\n5 10 15 20 25\n\n5 10 15 20 25\n\n5\n10\n15\n20\n25\n\n5\n10\n15\n20\n25\n\n5\n10\n15\n20\n25\n\n5\n10\n15\n20\n25\n\n5\n10\n15\n20\n25\n\n5 10 15 20 25\n\n5 10 15 20 25\n\n5 10 15 20 25\n\n5 10 15 20 25\n\n5\n10\n15\n20\n25\n\n5\n10\n15\n20\n25\n\n5\n10\n15\n20\n25\n\n5\n10\n15\n20\n25\n\n5\n10\n15\n20\n25\n\n5 10 15 20 25\n\n5 10 15 20 25\n\n5 10 15 20 25\n\n5 10 15 20 25\n\n5\n10\n15\n20\n25\n\n5\n10\n15\n20\n25\n\n5\n10\n15\n20\n25\n\n5\n10\n15\n20\n25\n\n5\n10\n15\n20\n25\n\n5 10 15 20 25\n\n5 10 15 20 25\n\n5 10 15 20 25\n\n5 10 15 20 25\n\n5 10 15 20 25\n\n5 10 15 20 25\n\n5 10 15 20 25\n\n5 10 15 20 25\n\n5 10 15 20 25\n\n5 10 15 20 25\n\n5 10 15 20 25\n\n5 10 15 20 25\n\n5 10 15 20 25\n\n5 10 15 20 25\n\nFigure 3: Left: Original images from MNIST, right: after 15 iterations with k = 100.\n\nsince it moves data points towards high-density regions. This is in particular helpful if the original\nclusters are distorted by high-dimensional noise. In this case the distance structure of the data be-\ncomes less discriminative, see Lemma 1, and the identi\ufb01cation of the low density regions is quite\ndif\ufb01cult. We expect that in such cases manifold denoising as a pre-processing step should improve\nthe discriminative capacity of graph-based methods. However the denoising algorithm does not take\ninto account label information. Therefore in the case where the cluster assumption is not ful\ufb01lled\nthe denoising algorithm might decrease the performance. Therefore we add the number of iterations\nof the denoising process as an additional parameter in the SSL algorithm.\nFor the evaluation of our denoising algorithm as a preprocessing step for SSL, we used the bench-\nmark data sets from [3]. A description of the data sets and the results of several state-of-the-art SSL\nalgorithms can be found there. As SSL-algorithm we use a slight variation of the one by Zhou et al.\n[15]. It can be formulated as the following regularized least squares problem.\n\nf \u2217 = argminf \u2208HV kf \u2212 yk2\n\nHV\n\n+ \u00b5 hf, \u2206f iHV\n\n,\n\nwhere y is the given label vector and hf, \u2206f iHV\nis the smoothness functional induced by the graph\nLaplacian. The solution is given as f \u2217 = ( + \u00b5\u2206)\u22121y. In order to be consistent with our de-\nnoising scheme we choose instead of the normalized graph Laplacian \u02dc\u2206 = \u2212 D\u2212 1\n2 as\nsuggested in [15] the graph Laplacian \u2206 = \u2212 D\u22121W and the graph structure as described in Sec-\ntion 3.1. As neighborhood graph for the SSL-algorithm we used a symmetric k-NN graph with the\nfollowing weights: w(Xi, Xj) = exp(\u2212\u03b3 kXi \u2212 Xjk2)\nif kXi \u2212 Xjk \u2264 min{h(Xi), h(Xj)}.\nAs suggested in [3] the distances are rescaled in each iteration such that the 1/c2-quantile of the\ndistances equals 1 where c is the number of classes. The number of k-NN was chosen for denois-\ning in {5, 10, 15, 25, 50, 100, 150, 200}, and for classi\ufb01cation in {5, 10, 20, 50, 100}. The scaling\n\n2 W D\u2212 1\n\n\fparameter \u03b3 and the regularization parameter \u00b5 were selected from { 1\n2 , 1, 2} resp. {2, 20, 200}. The\nmaximum of iterations was set to 20. Parameter values where not all data points have been classi\ufb01ed,\nthat is the graph is disconnected, were excluded. The best parameters were found by ten-fold cross\nvalidation. The \ufb01nal classi\ufb01cation is done using a majority vote of the classi\ufb01ers corresponding to\nthe minimal cross validation test error. In Table 1 the results are shown for the standard case, that is\nno manifold denoising (No MD), and with manifold denoising (MD). For the datasets g241c, g241d\nand Text we get signi\ufb01cantly better performance using denoising as a preprocessing step, whereas\nthe results are indifferent for the other datasets. However compared to the results of the state of the\nart of SSL on all the datasets reported in [3], the denoising preprocessing has lead to a performance\nof the algorithm which is competitive uniformly over all datasets. This improvement is probably not\nlimited to the employed SSL-algorithm but should also apply to other graph-based methods.\n\nTable 1: Manifold Denoising (MD) as preprocessing for SSL. The mean and standard deviation of\nthe test error are shown for the datasets from [3] for 10 (top) and 100 (bottom) labeled points.\n\ng241c\n\nNo MD 47.9\u00b12.67\n29.0\u00b114.3\nMD\n\u00f8 Iter.\n12.3\u00b13.8\nNo MD 38.9\u00b16.3\n16.1\u00b12.2\nMD\n\u00f8 Iter.\n15.0\u00b10.8\n\ng241d\n\n47.2\u00b14.0\n26.6\u00b117.8\n11.7\u00b14.4\n34.2\u00b14.1\n7.5\u00b10.9\n14.5\u00b11.5\n\nDigit1\n\n14.1\u00b15.4\n13.8\u00b15.5\n9.6\u00b12.4\n3.0\u00b11.6\n3.2\u00b11.2\n8.0\u00b13.2\n\nUSPS\n\n19.2\u00b12.1\n20.5\u00b15.0\n7.3\u00b12.9\n6.2\u00b11.2\n5.3\u00b11.4\n8.3\u00b13.8\n\nCOIL\n\n66.2\u00b17.8\n66.4\u00b16.0\n4.9\u00b12.7\n15.5\u00b12.6\n16.2\u00b12.5\n1.6\u00b11.8\n\nBCI\n\n50.0\u00b11.1\n49.8\u00b11.5\n8.2\u00b13.5\n46.5\u00b11.9\n48.4\u00b12.0\n8.4\u00b14.3\n\nText\n\n41.9\u00b17.0\n33.6\u00b17.0\n5.6\u00b14.4\n27.0\u00b11.9\n24.1\u00b12.8\n6.0\u00b13.5\n\nReferences\n[1] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation.\n\nNeural Comp., 15(6):1373\u20131396, 2003.\n\n[2] C. M. Bishop, M. Svensen, and C. K. I. Williams. GTM: The generative topographic mapping. Neural\n\nComputation, 10:215\u2013234, 1998.\n\n[3] O. Chapelle, B. Sch\u00a8olkopf, and A. Zien, editors. Semi-Supervised Learning. MIT Press, Cambridge,\n\n2006. in press, http://www.kyb.tuebingen.mpg.de/ssl-book.\n\n[4] P. Grassberger and I. Procaccia. Measuring the strangeness of strange attractors. Physica D, 9:189\u2013208,\n\n1983.\n\n[5] A. Grigoryan. Heat kernels on weighted manifolds and applications. Cont. Math., 398:93\u2013191, 2006.\n[6] T. Hastie and W. Stuetzle. Principal curves. J. Amer. Stat. Assoc., 84:502\u2013516, 1989.\n[7] M. Hein, J.-Y. Audibert, and U. von Luxburg. From graphs to manifolds - weak and strong pointwise\nconsistency of graph Laplacians. In P. Auer and R. Meir, editors, Proc. of the 18th Conf. on Learning\nTheory (COLT), pages 486\u2013500, Berlin, 2005. Springer.\n\n[8] M. Hein, J.-Y. Audibert, and U. von Luxburg. Graph Laplacians and their convergence on random neigh-\n\nborhood graphs, 2006. accepted at JMLR, available at arXiv:math.ST/0608522.\n\n[9] M. Hein. Geometrical aspects of statistical learning theory. PhD thesis, MPI f\u00a8ur biologische Kyber-\n\nnetik/Technische Universit\u00a8at Darmstadt, 2005.\n\n[10] D. Keysers, W. Macherey, H. Ney, and J. Dahmen. Adaptation in statistical pattern recognition using\n\ntangent vectors. IEEE Trans. on Pattern Anal. and Machine Intel., 26:269\u2013274, 2004.\n\n[11] C. Lange and K. Polthier. Anisotropic smoothing of point sets. Computer Aided Geometric Design,\n\n22:680\u2013692, 2005.\n\n[12] O. Scherzer and J. Weickert. Relations between regularization and diffusion imaging. J. of Mathematical\n\nImaging and Vision, 12:43\u201363, 2000.\n\n[13] G. Taubin. A signal processing approach to fair surface design. In Proc. of the 22nd annual conf. on\n\nComputer graphics and interactive techniques (Siggraph), pages 351\u2013358, 1995.\n\n[14] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlinear dimen-\n\nsionality reduction. Science, 290(5500):2319\u20132323, 2000.\n\n[15] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Sch\u00a8olkopf. Learning with local and global consistency.\nIn S. Thrun, L. Saul, and B. Sch\u00a8olkopf, editors, Adv. in Neur. Inf. Proc. Syst. (NIPS), volume 16. MIT\nPress, 2004.\n\n\f", "award": [], "sourceid": 2997, "authors": [{"given_name": "Matthias", "family_name": "Hein", "institution": null}, {"given_name": "Markus", "family_name": "Maier", "institution": null}]}