{"title": "Improved Graph Laplacian via Geometric Self-Consistency", "book": "Advances in Neural Information Processing Systems", "page_first": 4457, "page_last": 4466, "abstract": "We address the problem of setting the kernel bandwidth, epps, used by Manifold Learning algorithms to construct the graph Laplacian. Exploiting the connection between manifold geometry, represented by the Riemannian metric, and the Laplace-Beltrami operator, we set epps by optimizing the Laplacian's ability to preserve the geometry of the data. Experiments show that this principled approach is effective and robust", "full_text": "Improved Graph Laplacian via Geometric\n\nConsistency\n\nDominique C. Perrault-Joncas\n\nGoogle, Inc.\n\ndominiquep@google.com\n\nMarina Meil\u02d8a\n\nDepartment of Statistics\nUniversity of Washington\n\nmmp2@uw.edu\n\nJames McQueen\n\nAmazon\n\njmcq@amazon.com\n\nAbstract\n\nIn all manifold learning algorithms and tasks setting the kernel bandwidth \u0001 used\nconstruct the graph Laplacian is critical. We address this problem by choosing\na quality criterion for the Laplacian, that measures its ability to preserve the\ngeometry of the data. For this, we exploit the connection between manifold\ngeometry, represented by the Riemannian metric, and the Laplace-Beltrami operator.\nExperiments show that this principled approach is effective and robust.\n\n1\n\nIntroduction\n\nManifold learning and manifold regularization are popular tools for dimensionality reduction and\nclustering [1, 2], as well as for semi-supervised learning [3, 4, 5, 6] and modeling with Gaussian\nProcesses [7]. Whatever the task, a manifold learning method requires the user to provide an external\nparameter, called \u201cbandwidth\u201d or \u201cscale\u201d \u0001, that de\ufb01nes the size of the local neighborhood.\nMore formally put, a common challenge in semi-supervised and unsupervised manifold learning\nlies in obtaining a \u201cgood\u201d graph Laplacian estimator L. We focus on the practical problem of\noptimizing the parameters used to construct L and, in particular, \u0001. As we see empirically, since the\nLaplace-Beltrami operator on a manifold is intimately related to the geometry of the manifold, our\nestimator for \u0001 has advantages even in methods that do not explicitly depend on L.\nIn manifold learning, there has been sustained interest for determining the asymptotic properties of L\n[8, 9, 10, 11]. The most relevant is [12], which derives the optimal rate for \u0001 w.r.t. the sample size N\n\n\u00012 = C(M)N\n\n\u2212 1\n\n3+d/2 ,\n\n(1)\nwith d denoting the intrinsic dimension of the data manifold M. The problem is that C(M) is a\nconstant that depends on the yet unknown data manifold, so it is rarely known in practice.\nConsiderably fewer studies have focused on the parameters used to construct L in a \ufb01nite sample\nproblem. A common approach is to \u201ctune\u201d parameters by cross-validation in the semi-supervised\ncontext. However, in an unsurpervised problem like non-linear dimensionality reduction, there is\nno context in which to apply cross-validation. While several approaches [13, 14, 15, 16] may yield\na usable parameter, they generally do not aim to improve L per se and offer no geometry-based\njusti\ufb01cation for its selection.\nIn this paper, we present a new, geometrically inspired approach to selecting the bandwidth parameter\n\u0001 of L for a given data set. Under the data manifold hypothesis, the Laplace-Beltrami operator \u2206M\nof the data manifold M contains all the intrinsic geometry of M. We set out to exploit this fact by\ncomparing the geometry induced by the graph Laplacian L with the local data geometry and choose\nthe value of \u0001 for which these two are closest.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f2 Background: Heat Kernel, Laplacian and Geometry\n\nOur paper builds on two previous sets of results: 1) the construction of L that is consistent for \u2206M\nwhen the sample size N \u2192 \u221e under the data manifold hypothesis (see [17]); and 2) the relationship\nbetween \u2206M and the Riemannian metric g on a manifold, as well as the estimation of g (see [18]).\nConstruction of the graph Laplacian. Several methods methods to construct L have been suggested\n(see [10, 11]). The one we present, due to [17], guarantees that, if the data are sampled from a manifold\nM, L converges to \u2206M:\nGiven a set of points D = {x1, . . . , xN} in high-dimensional Euclidean space Rr, construct a\nweighted graph G = (D, W ) over them, with W = [wij]ij=1:N . The weight wij between xi and xj\nis the heat kernel [1]\n\nWij \u2261 w\u0001(xi, xj) = exp\n\nwith \u0001 a bandwidth parameter \ufb01xed by the user. Next, construct L = [Lij]ij of G by\n\n2 /\u00012(cid:17)\n(cid:16)||xi \u2212 xj||2\n(cid:88)\n\nij , and Lij =\n\nW (cid:48)\n\n,\n\n(cid:88)\n\nij\n\nW (cid:48)\nt(cid:48)\n\nj\n\n.\n\n(cid:88)\n\nj\n\nti=\n\nWij , W (cid:48)\n\nij =\n\nWij\ntitj\n\n,\n\nt(cid:48)\ni =\n\nj\n\nj\n\n(2)\n\n(3)\n\n(cid:113)(cid:0) dx\n\ndt\n\ndt .\n\n(cid:1)TG(x) dx\n\ni, W (cid:48), L all depend on the bandwidth \u0001 via the heat kernel.\n\nEquation (3) represents the discrete versions of the renormalized Laplacian construction from [17].\nNote that ti, t(cid:48)\nEstimation of the Riemannian metric. We follow [18] in this step. A Riemannian manifold (M, g)\nis a smooth manifold M endowed with a Riemannian metric g; the metric g at point p \u2208 M is a scalar\nproduct over the vectors in TpM, the tangent subspace of M at p. In any coordinate representation\nof M, gp \u2261 G(p) - the Riemannian metric at p - represents a positive de\ufb01nite matrix1 of dimension d\nequal to the intrinsic dimension of M. We say that the metric g encodes the geometry of M because\n\ng determines the volume element for any integration over M by(cid:112)det G(x)dx, and the line element\n\nfor computing distances along a curve x(t) \u2282 M, by\nIf we assume that the data we observe (in Rr) lies on a manifold, then under rotation of the original\ncoordinates, the metric G(p) is the unit matrix of dimension d padded with zeros up to dimension r.\nWhen the data is mapped to another coordinate system - for instance by a manifold learning algorithm\nthat performs non-linear dimension reduction - the matrix G(p) changes with the coordinates to\nre\ufb02ect the distortion induced by the mapping (see [18] for more details).\nProposition 2.1 Let x denote local coordinate functions of a smooth Riemannian manifold (M, g)\nof dimension d and \u2206M the Laplace-Beltrami operator de\ufb01ned on M. Then, H(p) = (G(p))\u22121 the\n(matrix) inverse of the Riemannian metric at point p, is given by\n\n2 \u2206M(cid:0)xk \u2212 xk(p)(cid:1)(cid:0)xj \u2212 xj(p)(cid:1)|x=x(p) with i, j = 1, . . . , d.\n\n\u2206M operator to the function \u03c6kj =(cid:0)xk \u2212 xk(p)(cid:1)(cid:0)xj \u2212 xj(p)(cid:1), where xk, xj denote coordinates\n\n(4)\nNote that the inverse matrices H(p), p \u2208 M, being symmetric and positive de\ufb01nite, also de\ufb01nes a\nmetric h called the cometric on M. Proposition 2.1 says that the cometric is given by applying the\nk, j seen as functions on M. A converse theorem [19] states that g (or h) uniquely determines \u2206M.\nProposition 2.1 provides a way to estimate h and g from data. Algorithm 1, adapted from [18],\nimplements (4).\n\n(H(p))kj = 1\n\n3 A Quality Measure for L\n\nOur approach can be simply stated: the \u201cbest\u201d value for \u0001 is the value for which the corresponding L\nof (3) best captures the original data geometry. For this we must: (1) estimate the geometry g or h\n1This paper contains mathematical objects like M, g and \u2206, and computable objects like a data point x, and\nthe graph Laplacian L. The Riemannian metric at a point belongs to both categories, so it will sometimes be\ndenoted gp, gxi and sometimes G(p), G(xi), depending on whether we refer to its mathematical or algorithmic\naspects (or, more formally, whether the expression is coordinate free or in a given set of coordinates). This also\nholds for the cometric h, de\ufb01ned in Proposition 2.1.\n\n2\n\n\fAlgorithm 1 Riemannian Metric(X, i, L, pow \u2208 {\u22121, 1})\n\nInput: N \u00d7 d design matrix X, i index in data set, Laplacian L, binary variable pow\nfor k = 1 \u2192 d, l = 1 \u2192 d do\n\nHk,l \u2190(cid:80)N\n\nj=1 Lij (Xjk \u2212 Xik)(Xjl \u2212 Xil)\n\nend for\nreturn H pow (i.e. H if pow = 1 and H\u22121 if pow = \u22121)\n\nfrom L (this is achieved by RiemannianMetric()); (2) \ufb01nd an independent way to estimate the data\ngeometry, locally (this is done in Sections 3.2 and 3.1); (3) de\ufb01ne a measure of agreement between\nthe two (Section 3.3).\n\n3.1 The Geometric Consistency Idea and gtarget\n\nThere is a natural way to estimate the geometry of the data without the use of L. We consider\nthe canonical embedding of the data in the ambient space Rr for which the geometry is trivially\nknown. This provides a target gtarget; we tune the scale of the Laplacian so that the g calculated\nfrom Proposition 2.1 matches this target. Hence, we choose \u0001 to maximize consistency with the\ngeometry of the data. We denote the inherited metric by gRr|TM, which stands for the restriction of\nthe natural metric of the ambient space Rr to the tangent bundle TM of the manifold M. We tune\nthe parameters of the graph Laplacian L so as to enforce (a coordinate expression of) the identity\n\ngp(\u0001) = gtarget, with gtarget = gRr|TpM \u2200p \u2208 M .\n\n(5)\n\nIn the above, the l.h.s. will be the metric implied from the Laplacian via Proposition 2.1, and the r.h.s\nis the metric induced by Rr. Mathematically speaking, (5) is necessary and suf\ufb01cient for \ufb01nding the\n\u201ccorrect\u201d Laplacian. The next section describes how to obtain the r.h.s. from a \ufb01nite sample D. Then,\nto optimize the graph Laplacian we estimate g from L as prescribed by Proposition 2.1 and compare\nwith gRr|TpMnumerically. We call this approach geometric consistency (GC). The GC method is not\nlimited to the choice of \u0001, but can be applied to any other parameter required for the Laplacian.\n\n3.2 Robust Estimation of gtarget for a \ufb01nite sample\n\nFirst idea: estimate tangent subspace We use the simple fact, implied by Section 3.1, that\nprojecting the data onto TpM preserves the metric locally around p. Hence, Gtarget = Id in the\nprojected data. Moreover, projecting on any direction in TpM does not change the metric in that\ndirection. This remark allows us to work with small matrices (of at most d \u00d7 d instead of r \u00d7 r) and\nto avoid the problem of estimating d, the intrinsic dimension of the data manifold.\nSpeci\ufb01cally, we evaluate the tangent subspace around each sampled point xi using weighted (local)\nPrincipal Component Analysis (wPCA) and then express gRr|TpM directly in the resulting low-\ndimensional subspace as the unit matrix Id. The tangent subspace also serves to de\ufb01ne a local\ncoordinate chart, which is passed as input to Algorithm 1 which computes H(xi), G(xi) in these\ncoordinates. For computing TxiM, by wPCA, we choose weights de\ufb01ned by the heat kernel (2),\ncentered around xi, with same bandwidth \u0001 as for computing L. This approach is similar to sample-\nwise weighted PCA of [20], with one important requirements: the weights must decay rapidly away\nfrom xi so that only points close xi are used to estimate TxiM. This is satis\ufb01ed by the weighted\nrecentered design matrix Z, where Zi:, row i of Z, is given by:\n\n\uf8eb\uf8ed N(cid:88)\n\nj(cid:48)=1\n\n\uf8f6\uf8f8 , with \u00afx =\n\n\uf8eb\uf8ed N(cid:88)\n\nj=1\n\n\uf8f6\uf8f8 /\n\n\uf8eb\uf8ed N(cid:88)\n\nj(cid:48)=1\n\n\uf8f6\uf8f8 .\n\nZi: = Wij(xi \u2212 \u00afx)/\n\nWij(cid:48)\n\nWijxj\n\nWij(cid:48)\n\n(6)\n\n[21] proves that the wPCA using the heat kernel, and equating the PCA and heat kernel bandwidths\nas we do, yields a consistent estimator of TxiM. This is implemented in Algorithm 2.\nIn summary, to instantiate equation (5) at point xi \u2208 D, one must (i) construct row i of the graph\nLaplacian by (3); (ii) perform Algorithm 2 to obtain Y ; (iii) apply Algorithm 1 to Y to obtain\nG(xi) \u2208 Rd\u00d7d; (iv) this matrix is then compared with Id, which represents the r.h.s. of (5).\n\n3\n\n\fAlgorithm 2 Tangent Subspace Projection(X, w, d(cid:48))\n\nInput: N \u00d7 r design matrix X, weight vector w, working dimension d(cid:48)\nCompute Z using (6)\n[V, \u039b] \u2190 eig(Z T Z, d(cid:48))\nCenter X around \u00afx from (6)\nY \u2190 XV:,1:d(cid:48) (Project X on d(cid:48) principal subspace)\nreturn Y\n\n(i.e.d(cid:48)-SVD of Z)\n\nSecond idea: project onto tangent directions We now take this approach a few steps further in\nterms of improving its robustness with minimal sacri\ufb01ce to its theoretical grounding. In particular,\nwe perform both Algorithm 2 and Algorithm 1 in d(cid:48) dimensions, with d(cid:48) < d (and typically d(cid:48) = 1).\nThis makes the algorithm faster, and make the computed metrics G(xi), H(xi) both more stable\nnumerically and more robust to possible noise in the data2. Proposition 3.1 shows that the resulting\nmethod remains theoretically sound.\nProposition 3.1 Let X, Y, Z, V, W:i, H, and d \u2265 1 represent the quantities in Algorithms 1 and 2;\nassume that the columns of V are sorted in decreasing order of the singular values, and that the rows\nand columns of H are sorted according to the same order. Now denote by Y (cid:48), V (cid:48), H(cid:48) the quantitities\ncomputed by Algorithms 1 and 2 for the same X, W:i but with d \u2190 d(cid:48) = 1. Then,\n\nV (cid:48) = V:1 \u2208 Rr\u00d71 Y (cid:48) = Y:1 \u2208 RN\u00d71 H(cid:48) = H11 \u2208 R.\n\n(7)\n\nThe proof of this result is straightforward and omitted for brevity. It is easy to see that Proposition 3.1\ngeneralizes immediately to any 1 \u2264 d(cid:48) < d. In other words, by using d(cid:48) < d, we will be projecting\nthe data on a proper subspace of TxiM - namely, the subspace of least curvature [22]. The cometric\nH(cid:48) of this projection is the principal submatrix of order d(cid:48) of H, i.e. H11 if d(cid:48) = 1.\n\nThird idea: use h instead of g Relation (5) is trivially satis\ufb01ed by the cometrics of g and gtarget\n(the latter being H target = Id). Hence, inverting H in Algorithm 1 is not necessary, and we will use\nthe cometric h in place of g by default. This saves time and increases numerical stability.\n\n3.3 Measuring the Distortion\n\nFor a \ufb01nite sample, we cannot expect (5) to hold exactly, and so we need to de\ufb01ne a distortion\nbetween the two metrics to evaluate how well they agree. We propose the distortion\n\nN(cid:88)\n\ni=1\n\nD = 1\nN\n\n||H(xi) \u2212 Id||\n\nhence the expression of D is the discrete version of Dg0 (g1, g2) = (cid:82)\n\n(8)\nwhere ||A|| = \u03bbmax(A) is the matrix spectral norm. Thus D measures the average distance of H\nfrom the unit matrix over the data set. For a \u201cgood\u201d Laplacian, the distortion D should be minimal:\n(9)\nThe choice of norm in (8) is not arbitrary. Riemannian metrics are order 2 tensors or TM\ndVg0, with\n, representing the tensor norm of gp on TpM with respect to\n||g||g0\nthe Riemannian metric g0p. Now, (8) follows when g0, g1, g2 are replaced by I, I and H, respectively.\nWith (9), we have established a principled criterion for selecting the parameter(s) of the graph\nLaplacian, by minimizing the distortion between the true geometry and the geometry derived from\nProposition 2.1. Practically, we have in (9) a 1D optimization problem with no derivatives, and we\ncan use standard algorithms to \ufb01nd its minimum. \u02c6\u0001.\n\n(cid:12)(cid:12)p = supu,v\u2208TpM\\{0} <u,v>gp\n\nM ||g1 \u2212 g2||g0\n\n\u02c6\u0001 = argmin\u0001D .\n\n<u,v>g0p\n\n4 Related Work\n\nWe have already mentioned the asymptotic result (1) of [12]. Other work in this area [8, 10, 11, 23]\nprovides the rates of change for \u0001 with respect to N to guarantee convergence. These studies are\n2We know from matrix perturbation theory that noise affects the d-th principal vector increasingly with d.\n\n4\n\n\fAlgorithm 3 Compute Distortion(X, \u0001, d(cid:48))\n\nInput: N \u00d7 r design matrix X, \u0001, working dimension d(cid:48), index set I \u2286 {1, . . . , N}\nCompute the heat kernel W by (2) for each pair of points in X\nCompute the graph Laplacian L from W by (3)\nD \u2190 0\nfor i \u2208 I do\n\nY \u2190 TangentSubspaceProjection(X, Wi,:, d(cid:48))\nH \u2190 RiemannianMetric(Y, L, pow = 1)\nD \u2190 D + ||H \u2212 Id(cid:48)||2/|I|\n\nend for\nreturn D\n\nrelevant; but they depend on manifold parameters that are usually not known. Recently, an extremely\ninteresting Laplacian \"continuous nearest neighbor\u201d consistent construction method was proposed\nby [24], from a topological perspective. However, this method depends on a smoothness parameter\ntoo, and this is estimated by constructing the persistence diagram of the data. [25] propose a new,\nstatistical approach for estimating \u0001, which is very promising, but currently can be applied only to\nun-normalized Laplacian operators. This approach also depends on unknown pparameters a, b, which\nare set heuristically. (By contrast, our method depends only weakly on d(cid:48), which can be set to 1.)\nAmong practical methods, the most interesting is that of [14], which estimates k, the number of\nnearest neighbors to use in the construction of the graph Laplacian. This method optimizes k\ndepending on the embedding algorithm used. By contrast, the selection algorithm we propose\nestimates an intrinsic quantity, a scale \u0001 that depends exclusively on the data. Moreover, it is not\nknown when minimizing reconstruction error for a particular method can be optimal, since [26] even\nin the limit of in\ufb01nite data, the most embeddings will distort the original geometry. In semi-supervised\nlearning (SSL), one uses Cross-Validation (CV) [5].\nFinally, we mention the algorithm proposed in [27] (CLMR). Its goal is to obtain an estimate of the\nintrinsic dimension of the data; however, a by-product of the algorithm is a range of scales where\nthe tangent space at a data point is well aligned with the principal subspace obtained by a local\nsingular value decomposition. As these are scales at which the manifold looks locally linear, one can\nreasonably expect that they are also the correct scales at which to approximate differential operators,\nsuch as \u2206M. Given this, we implement the method and compare it to our own results.\nFrom the computational point of view, all methods described above explore exhaustively a range\nof \u0001 values. GC and CLMR only require local PCA at a subset of the data points (with d(cid:48) < d\ncomponents for GC, d(cid:48) >> d for CLMR); whereas CV, and [14] require respectively running a SSL\nalgorithm, or an embedding algorithm, for each \u0001. In relation to these, GC is by far the most ef\ufb01cient\ncomputationally. 3\n\n5 Experimental Results\n\nto the unit matrix; this is tested by maxj((cid:80)\n\nSynthethic Data. We experimented with estimating the bandwidth \u02c6\u0001 on data sampled from two\nknown manifolds, the two-dimensional hourglass and dome manifolds of Figure 1. We sampled\npoints uniformly from these, adding 10 \u201cnoise\u201d dimensions and Gaussian noise N (0, \u03c32) resulting in\nr = 13 dimensions.\nThe range of \u0001 values was delimited by \u0001min and \u0001max. We set \u0001max to the average of ||xi \u2212 xj||2\nover all point pairs and \u0001min to the limit in which the heat kernel W becomes approximately equal\ni Wij) \u2212 1 < \u03b34 for \u03b3 \u2248 10\u22124. This range spans about\ntwo orders of magnitude in the data we considered, and was searched by a logarithmic grid with\napproximately 20 points. We saved computatation time by evaluating all pointwise quantities ( \u02c6D,\nlocal SVD) on a random sample of size N(cid:48) = 200 of each data set. We replicated each experiment\non 10 independent samples.\n\n3In addition, these operations being local, they can be further parallelized or accelerated in the usual ways.\n4Guaranteeing that all eigenvalues of W are less than \u03b3 away from 1.\n\n5\n\n\f\u03c3 = 0.001\n\n\u03c3 = 0.01\n\n\u03c3 = 0.1\n\nFigure 1: Estimates \u02c6\u0001 (mean and standard deviation over 10 runs) on the dome and hourglass data, vs sample\nsizes N for various noise levels \u03c3; d(cid:48) = 2 is in black and d(cid:48) = 1 in blue. In the background, we also show as\ngray rectangles, for each N, \u03c3 the intervals in the \u0001 range where the eigengaps of local SVD indicate the true\ndimension, and, as un\ufb01llled rectangles, the estimates proposed by CLMR [27] for these intervals. The variance\nof \u02c6\u0001 observed is due to randomness in the subsample N(cid:48) used to evaluate the distortion. Our \u02c6\u0001 always falls in the\ntrue interval (when this exists), and have are less variable and more accurate than the CLMR intervals.\n\nReconstruction of manifold w.r.t. gold standard These results (relegated to the Supplement) are\nuniformly very positive, and show that GC achieves its most explicit goal, even in the presence of\nnoise. In the remainder, we illustrate the versatility of our method on on other tasks. Effects of d(cid:48),\nnoise and N. The estimated \u0001 are presented in Figure 1. Let \u02c6\u0001d(cid:48) denote the estimate obtained for a\ngiven d(cid:48) \u2264 d. We note that when d1 < d2, typically \u02c6\u0001d1 > \u02c6\u0001d2, but the values are of the same order\n(a ratio of about 2 in the synthetic experiments). The explanation is that, chosing d(cid:48) < d directions\nin the tangent subspace will select a subspace aligned with the \u201cleast curvature\u201d directions of the\nmanifold, if any exist, or with the \u201cleast noise\u201d in the random sample. In these directions, the data\nwill tolerate more smoothing, which results in larger \u02c6\u0001. The optimal \u0001 decreases with N and grows\nwith the noise levels, re\ufb02ecting the balance it must \ufb01nd between variance and bias. Note that for the\nhourglass data, the highest noise level of \u03c3 = 0.1 is an extreme case, where the original manifold\nis almost drowned in the 13-dimensional noise. Hence, \u0001 is not only commensurately larger, but also\nstable between the two dimensions and runs. This re\ufb02ects the fact that \u0001 captures the noise dimension,\nand its values are indeed just below the noise amplitude of 0.1\n13. The dome data set exhibits the\nsame properties discussed previously, showing that our method is effective even for manifolds with\nborder.\nSemi-supervised Learning (SSL) with Real Data. In this set of experiments, the task is classi\ufb01ca-\ntion on the benchmark SSL data sets proposed by [28]. This was done by least-square classi\ufb01cation,\nsimilarly to [5], after choosing the optimal bandwidth by one of the methods below.\n\n\u221a\n\nTE Minimize Test Error, i.e. \u201ccheat\u201d in an attempt to get an estimate of the \u201cground truth\u201d.\n\nCV Cross-validation We split the training set (consisting of 100 points in all data sets) into two\nequal groups;5 we minimize the highly non-smooth CV classi\ufb01cation error by simulated\nannealing.\n\nRec Minimize the reconstruction error We cannot use the method of [14] directly, as it requires\nan embedding, so we minimize reconstruction error based on the heat kernel weights w.r.t. \u0001\n\n(this is reminiscent of LLE [29]): R(\u0001) =(cid:80)n\n\nxj\n\ni=1\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)2\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)xi \u2212(cid:80)\n\nWij(cid:80)\n\nl(cid:54)=i Wij\n\nj(cid:54)=i\n\nOur method is denoted GC for Geometric Consistency; we evaluate straighforward GC, that uses the\ncometric H and a variant that includes the matrix inversion in Algorithm 1 denoted GC\u22121.\n\n5In other words, we do 2-fold CV. We also tried 20-fold and 5-fold CV, with no signi\ufb01cant difference.\n\n6\n\n\fDigit1\n\nUSPS\n\nCOIL\n\nBCI\n\ng241c\n\ng241d\n\nTE\n\n0.67\u00b10.08\n[0.57, 0.78]\n1.24\u00b10.15\n[1.04, 1.59]\n49.79\u00b16.61\n[42.82, 60.36]\n3.4\u00b13.1\n[1.2, 8.9]\n8.3\u00b1 2.5\n[6.3, 14.6]\n5.7\u00b1 0.24\n[5.6, 6.3]\n\nCV\n\n0.80\u00b10.45\n[0.47, 1.99]\n1.25\u00b10.86\n[0.50, 3.20]\n69.65\u00b131.16\n[50.55, 148.96]\n\n3.2\u00b12.5\n[1.2, 8.2]\n8.8\u00b13.3\n[4.4, 14.9]\n6.4\u00b11.15\n[4.3, 8.2]\n\nRec\n\n0.64\n\n1.68\n\nGC\u22121\n0.74\n\n2.42\n\nGC\n\n0.74\n\n1.10\n\n78.37\n\n216.95\n\n116.38\n\n3.31\n\n3.79\n\n3.77\n\n3.19\n\n7.37\n\n7.35\n\n5.61\n\n7.38\n\n7.36\n\nTable 1: Estimates of \u0001 by methods presented for the six SSL data sets used, as well as TE. For TE\nand CV, which depend on the training/test splits, we report the average, its standard error, and range\n(in brackets below) over the 12 splits.\n\nCV\n3.32\n5.18\n7.02\n49.22\n13.31\n8.67\n\nRec\n2.16\n4.83\n8.03\n49.17\n23.93\n18.39\n\nGC\u22121\n2.11\n12.00\n16.31\n50.25\n12.77\n8.76\n\nGC\n2.11\n3.89\n8.81\n48.67\n12.77\n8.76\n\nDigit1\nUSPS\nCOIL\nBCI\ng241c\ng241d\n\nDigit1\n\nUSPS\n\nCOIL\n\nBCI\n\ng241c\n\ng241d\n\nGC\u22121\nGC\nGC\u22121\nGC\nGC\u22121\nGC\nGC\u22121\nGC\nGC\u22121\nGC\nGC\u22121\nGC\n\nd(cid:48)=1\n0.743\n0.744\n2.42\n1.10\n116\n187\n3.32\n5.34\n7.38\n7.38\n7.35\n7.35\n\nd(cid:48)=2\n0.293\n0.767\n2.31\n1.16\n87.4\n179\n3.48\n5.34\n7.38\n9.83\n7.35\n9.33\n\nd(cid:48)=3\n0.305\n0.781\n3.88\n1.18\n128\n187\n3.65\n5.34\n7.38\n9.37\n7.35\n9.78\n\nTable 2: Left panel: Percent classi\ufb01cation error for the six SSL data sets using the four \u0001 estimation\nmethods described. Right panel: \u0001 obtained for the six datasets using various d(cid:48) values with GC and\nGC\u22121 . \u02c6\u0001 was computed for d=5 for Digit1, as it is known to have an intrinsic dimension of 5, and\nfound to be 1.162 with GC and 0.797 with GC\u22121 .\n\nAcross all methods and data sets, the estimate of \u0001 closer to the values determined by TE lead to\nbetter classi\ufb01cation error, see Table 2. For \ufb01ve of the six data sets6, GC-based methods outperformed\nCV, and were 2 to 6 times faster to compute. This is in spite of the fact that GC does not use label\ninformation, and is not aimed at reducing the classi\ufb01cation error, while CV does. Further, the CV\nestimates of \u0001 are highly variable, suggesting that CV tends to over\ufb01t to the training data.\nEffect of Dimension d(cid:48). Table 2 shows how changing the dimension d(cid:48) alters our estimate of \u0001. We\nsee that the \u02c6\u0001 for different d(cid:48) values are close, even though we search over a range of two orders of\nmagnitude. Even for g241c and g241d, which were constructed so as to not satisfy the manifold\nhypothesis, our method does reasonably well at estimating \u0001. That is, our method \ufb01nds the \u02c6\u0001 for\nwhich the Laplacian encodes the geometry of the data set irrespective of whether or not that geometry\nis lower-dimensional. Overall, we have found that using d(cid:48) = 1 is most stable, and that adding more\ndimensions introduces more numerical problems: it becomes more dif\ufb01cult to optimize the distortion\nas in (9), as the minimum becomes shallower.\nIn our experience, this is due to the increase in\nvariance associated with adding more dimensions.\nUsing one dimension probably works well because the wPCA selects the dimension that explains the\nmost variance and hence is the closest to linear over the scale considered. Subsequently, the wPCA\nmoves to incrementally \u201cshorter\u201d or less linear dimensions, leading to more variance in the estimate\nof the tangent subspace (more evidence for this in the Supplement).\n\n6In the COIL data set, despite their variability, CV estimates still outperformed the GC-based methods. This is\nthe only data set constructed from a collection of manifolds - in this case, 24 one-dimensional image rotations.\nAs such, one would expect that there would be more than one natural length scale.\n\n7\n\n\fFigure 2: Bandwidth Estimation For Galaxy Spectra Data. Left: GC results for d(cid:48) = 1 (d(cid:48) = 2, 3 are\nalso shown); we chose radius = 66 the minimum of D for d = 1(cid:48). Right: A log-log plot of radius\nversus average number of neighbors within this radius. The region in blue includes radius = 66 and\nindicates dimension d = 3. In the code \u0001 = radius/3, hence we use \u0001 = 22.\n\nEmbedding spectra of galaxies (Details of this experiment are in the Supplement.) For these data\nin r = 3750 dimensions, with N = 650, 000, the goal was to obtain a smooth, low dimensional\nembedding. The intrinsic dimension d is unknown, CV cannot be applied, and it is impractical\nto construct multiple embeddings for large N. Hence, we used the GC method with d(cid:48) = 1, 2, 3\nand N(cid:48) = |I| = 200. We compare the \u02c6\u0001\u2019s obtained with a heuristic based on the scaling of the\nneighborhood sizes [30] with the radius, which relates \u0001, d and N (Figure 2). Remarkably, both\nmethods yield the same \u0001, see the Supplement for evidence that the resulting embedding is smooth.\n\n6 Discussion\n\nIn manifold learning, supervised and unsupervised, estimating the graph versions of Laplacian-type\noperators is a fundamental task. We have provided a principled method for selecting the parameters\nof such operators, and have applied it to the selection of the bandwidth/scale parameter \u0001. Moreover,\nour method can be used to optimize any other parameters used in the graph Laplacian; for example,\nk in the k-nearest neighbors graph, or - more interestingly - the renormalization parameter \u03bb [17]\nof the kernel. The latter is theoretically equal to 1, but it is possible that it may differ from 1 in the\n\ufb01nite N regime. In general, for \ufb01nite N, a small departure from the asymptotic prescriptions may be\nbene\ufb01cial - and a data-driven method such as ours can deliver this bene\ufb01t.\nBy imposing geometric self-consistency, our method estimates an intrinsic quantity of the data. GC is\nalso fully unsupervised, aiming to optimize a (lossy) representation of the data, rather than a particular\ntask. This is an ef\ufb01ciency if the data is used in an unsupervised mode, or if it is used in many different\nsubsequent tasks. Of course, one cannot expect an unsupervised method to always be superior to a\ntask-dependent one. Yet, GC has shown to be competitive and sometimes superior in experiments\nwith the widely accepted CV. Besides the experimental validation, there are other reasons to consider\nan unsupervised method like GC in a supervised task: (1) the labeled data is scarce, so \u02c6\u0001 will have\nhigh variance, (2) the CV cost function is highly non-smooth while D is much smoother, and (3)\nwhen there is more than one parameter to optimize, dif\ufb01culties (1) and (2) become much more severe.\nOur algorithm requires minimal prior knowledge. In particular, it does not require exact knowledge\nof the intrinsic dimension d, since it can work satisfactorily with d(cid:48) = 1 in many cases.\nAn interesting problem that is outside the scope of our paper is the question of whether \u0001 needs to\nvary over M. This is a question/challenge facing not just GC, but any method for setting the scale,\nunsupervised or supervised. Asymptotically, a uniform \u0001 is suf\ufb01cient. Practically, however, we believe\nthat allowing \u0001 to vary may be bene\ufb01cial. In this respect, the GC method, which simply evaluates\nthe overall result, can be seamlessly adapted to work with any user-selected spatially-variable \u0001, by\nappropriately changing (2) or sub-sampling D when calculating D.\n\n8\n\n\fReferences\n[1] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data represen-\n\ntation. Neural Computation, 15:1373\u20131396, 2002.\n\n[2] U. von Luxburg, M. Belkin, and O. Bousquet. Consistency of spectral clustering. Annals of\n\nStatistics, 36(2):555\u2013585, 2008.\n\n[3] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework\nfor learning from labeled and unlabeled examples. Journal of Machine Learning Research,\n7:2399\u20132434, December 2006.\n\n[4] Xiaojin Zhu, John Lafferty, and Zoubin Ghahramani. Semi-supervised learning: From gaussian\n\n\ufb01elds to gaussian processes. Technical Report, 2003.\n\n[5] X. Zhou and M. Belkin. Semi-supervised learning by higher order regularization. AISTAT,\n\n2011.\n\n[6] A. J. Smola and I.R. Kondor. Kernels and regularization on graphs. In Proceedings of the\n\nAnnual Conference on Computational Learning Theory, 2003.\n\n[7] V. Sindhwani, W. Chu, and S. S. Keerthi. Semi-supervised gaussian process classi\ufb01ers. In\n\nProceedings of the International Joint Conferences on Arti\ufb01cial Intelligence, 2007.\n\n[8] E. Gin\u00e9 and V. Koltchinskii. Empirical Graph Laplacian Approximation of Laplace-Beltrami\n\nOperators: Large Sample results. High Dimensional Probability, pages 238\u2013259, 2006.\n\n[9] M. Belkin and P. Niyogi. Convergence of laplacians eigenmaps. NIPS, 19:129\u2013136, 2007.\n\n[10] M. Hein, J.-Y. Audibert, and U. von Luxburg. Graph Laplacians and their Convergence on\nRandom Neighborhood Graphs. Journal of Machine Learning Research, 8:1325\u20131368, 2007.\n\n[11] D. Ting, L Huang, and M. I. Jordan. An analysis of the convergence of graph laplacians. In\n\nICML, pages 1079\u20131086, 2010.\n\n[12] A. Singer. From graph to manifold laplacian: the convergence rate. Applied and Computational\n\nHarmonic Analysis, 21(1):128\u2013134, 2006.\n\n[13] John A. Lee and Michel Verleysen. Nonlinear Dimensionality Reduction. Springer Publishing\n\nCompany, Incorporated, 1st edition, 2007.\n\n[14] Lisha Chen and Andreas Buja. Local Multidimensional Scaling for nonlinear dimension reduc-\ntion, graph drawing and proximity analysis. Journal of the American Statistical Association,\n104(485):209\u2013219, March 2009.\n\n[15] \"E. Levina and P. Bickel\". Maximum likelihood estimation of intrinsic dimension. \"Advances\n\nin NIPS\", 17, 2005. \"Vancouver Canada\".\n\n[16] \"K. Carter, A. Hero, and R Raich\". \"de-biasing for intrinsic dimension estimation\". \"IEEE/SP\n\n14th Workshop on Statistical Signal Processing\", pages 601\u2013605, 8 2007.\n\n[17] R. R. Coifman and S. Lafon. Diffusion maps. Applied and Computational Harmonic Analysis,\n\n21(1):6\u201330, 2006.\n\n[18] Anonymous. Metric learning and manifolds: Preserving the intrinsic geometry. Submitted, 7,\n\nDecember 2012.\n\n[19] S. Rosenberg. The Laplacian on a Riemannian Manifold. Cambridge University Press, 1997.\n\n[20] H. Yue, M. Tomoyasu, and N. Yamanashi. Weighted principal component analysis and its\napplications to improve fdc performance. In 43rd IEEE Conference on Decision and Control,\npages 4262\u20134267, 2004.\n\n[21] Anil Aswani, Peter Bickel, and Claire Tomlin. Regression on manifolds: Estimation of the\n\nexterior derivative. Annals of Statistics, 39(1):48\u201381, 2011.\n\n9\n\n\f[22] J. M. Lee. Riemannian Manifolds: An Introduction to Curvature, volume M. Springer, New\n\nYork, 1997.\n\n[23] Xu Wang. Spectral convergence rate of graph laplacian. ArXiv, 2015. convergence rate of\n\nLaplacian when both n and h vary simultaneously.\n\n[24] Tyrus Berry and Timothy Sauer. Consistent manifold representation for topological data analysis.\n\nArXiv, June 2016.\n\n[25] Frederic Chazal, Ilaria Giulini, and Bertrand Michel. Data driven estimation of laplace-beltrami\noperator. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances\nin Neural Information Processing Systems 29, pages 3963\u20133971. Curran Associates, Inc., 2016.\n\n[26] Y. Goldberg, A. Zakai, D. Kushnir, and Y. Ritov. Manifold Learning: The Price of Normalization.\n\nJournal of Machine Learning Research, 9:1909\u20131939, AUG 2008.\n\n[27] Guangliang Chen, Anna Little, Mauro Maggioni, and Lorenzo Rosasco. Some recent advances\nin multiscale geometric analysis of point clouds. In J. Cohen and A. I. Zayed, editors, Wavelets\nand multiscale analysis: Theory and Applications, Applied and Numerical Harmonic Analysis,\nchapter 10, pages 199\u2013225. Springer, 2011.\n\n[28] O. Chapelle, B. Sch\u00f6lkopf, A. Zien, and editors. Semi-Supervised Learning. the MIT Press,\n\n2006.\n\n[29] L. Saul and S. Roweis. Think globally, \ufb01t locally: unsupervised learning of low dimensional\n\nmanifold. Journal of Machine Learning Research, 4:119\u2013155, 2003.\n\n[30] Sanjoy Dasgupta and Yoav Freund. Random projection trees and low dimensional manifolds.\nIn Proceedings of the Fortieth Annual ACM Symposium on Theory of Computing, STOC \u201908,\npages 537\u2013546, New York, NY, USA, 2008. ACM.\n\n10\n\n\f", "award": [], "sourceid": 2330, "authors": [{"given_name": "Dominique", "family_name": "Joncas", "institution": "Google"}, {"given_name": "Marina", "family_name": "Meila", "institution": "University of Washington"}, {"given_name": "James", "family_name": "McQueen", "institution": "University of Washington"}]}