{"title": "Sample Complexity of Testing the Manifold Hypothesis", "book": "Advances in Neural Information Processing Systems", "page_first": 1786, "page_last": 1794, "abstract": "The hypothesis that high dimensional data tends to lie in the vicinity of a low dimensional manifold is the basis of a collection of methodologies termed Manifold Learning. In this paper, we study statistical aspects of the question of fitting a manifold with a nearly optimal least squared error. Given upper bounds on the dimension, volume, and curvature, we show that Empirical Risk Minimization can produce a nearly optimal manifold using a number of random samples that is {\\it independent} of the ambient dimension of the space in which data lie. We obtain an upper bound on the required number of samples that depends polynomially on the curvature, exponentially on the intrinsic dimension, and linearly on the intrinsic volume. For constant error, we prove a matching minimax lower bound on the sample complexity that shows that this dependence on intrinsic dimension, volume and curvature is unavoidable. Whether the known lower bound of $O(\\frac{k}{\\eps^2} + \\frac{\\log \\frac{1}{\\de}}{\\eps^2})$ for the sample complexity of Empirical Risk minimization on $k-$means applied to data in a unit ball of arbitrary dimension is tight, has been an open question since 1997 \\cite{bart2}. Here $\\eps$ is the desired bound on the error and $\\de$ is a bound on the probability of failure. We improve the best currently known upper bound \\cite{pontil} of $O(\\frac{k^2}{\\eps^2} + \\frac{\\log \\frac{1}{\\de}}{\\eps^2})$ to $O\\left(\\frac{k}{\\eps^2}\\left(\\min\\left(k, \\frac{\\log^4 \\frac{k}{\\eps}}{\\eps^2}\\right)\\right) + \\frac{\\log \\frac{1}{\\de}}{\\eps^2}\\right)$. Based on these results, we devise a simple algorithm for $k-$means and another that uses a family of convex programs to fit a piecewise linear curve of a specified length to high dimensional data, where the sample complexity is independent of the ambient dimension.", "full_text": "Sample complexity of testing the manifold hypothesis\n\nHariharan Narayanan\u2217\n\nLaboratory for Information and Decision Systems\n\nLaboratory for Information and Decision Systems\n\nEECS, MIT\n\nCambridge, MA 02139\n\nhar@mit.edu\nSanjoy Mitter\n\nEECS, MIT\n\nCambridge, MA 02139\nmitter@mit.edu\n\nAbstract\n\nThe hypothesis that high dimensional data tends to lie in the vicinity of a low di-\nmensional manifold is the basis of a collection of methodologies termed Manifold\nLearning. In this paper, we study statistical aspects of the question of \ufb01tting a\nmanifold with a nearly optimal least squared error. Given upper bounds on the\ndimension, volume, and curvature, we show that Empirical Risk Minimization\ncan produce a nearly optimal manifold using a number of random samples that is\nindependent of the ambient dimension of the space in which data lie. We obtain\nan upper bound on the required number of samples that depends polynomially on\nthe curvature, exponentially on the intrinsic dimension, and linearly on the intrin-\nsic volume. For constant error, we prove a matching minimax lower bound on the\nsample complexity that shows that this dependence on intrinsic dimension, volume\n\u00012 + log 1\nand curvature is unavoidable. Whether the known lower bound of O( k\n\u00012 )\nfor the sample complexity of Empirical Risk minimization on k\u2212means applied\nto data in a unit ball of arbitrary dimension is tight, has been an open question\nsince 1997 [3]. Here \u0001 is the desired bound on the error and \u03b4 is a bound on the\nprobability of failure. We improve the best currently known upper bound [14] of\nO( k2\n. Based on these results, we\ndevise a simple algorithm for k\u2212means and another that uses a family of convex\nprograms to \ufb01t a piecewise linear curve of a speci\ufb01ed length to high dimensional\ndata, where the sample complexity is independent of the ambient dimension.\n\n\u00012 + log 1\n\nk, log4 k\n\u00012\n\n\u03b4\n\n\u00012 ) to O\n\n+ log 1\n\u00012\n\n\u03b4\n\n(cid:17)(cid:17)\n\n(cid:16)\n\n(cid:16) k\n\n\u00012\n\nmin\n\n\u0001\n\n(cid:17)\n\n(cid:16)\n\n\u03b4\n\n1\n\nIntroduction\n\nWe are increasingly confronted with very high dimensional data in speech signals, images, gene-\nexpression data, and other sources. Manifold Learning can be loosely de\ufb01ned to be a collection of\nmethodologies that are motivated by the belief that this hypothesis (henceforth called the manifold\nhypothesis) is true. It includes a number of extensively used algorithms such as Locally Linear\nEmbedding [17], ISOMAP [19], Laplacian Eigenmaps [4] and Hessian Eigenmaps [8]. The sample\ncomplexity of classi\ufb01cation is known to be independent of the ambient dimension [15] under the\nmanifold hypothesis, (assuming the decision boundary is a manifold as well,) thus obviating the\ncurse of dimensionality. A recent empirical study [6] of a large number of 3\u00d7 3 images, represented\nas points in R9 revealed that they approximately lie on a two dimensional manifold known as the\n\n\u2217Research supported by grant CCF-0836720\n\n1\n\n\fFigure 1: Fitting a torus to data.\n\nKlein bottle. On the other hand, knowledge that the manifold hypothesis is false with regard to\ncertain data would give us reason to exercise caution in applying algorithms from manifold learning\nand would provide an incentive for further study.\nIt is thus of considerable interest to know whether given data lie in the vicinity of a low dimensional\nmanifold. Our primary technical results are the following.\n\n1. We obtain uniform bounds relating the empirical squared loss and the true squared loss\nover a class F consisting of manifolds whose dimensions, volumes and curvatures are\nbounded in Theorems 1 and 2. These bounds imply upper bounds on the sample complexity\nof Empirical Risk Minimization (ERM) that are independent of the ambient dimension,\nexponential in the intrinsic dimension, polynomial in the curvature and almost linear in the\nvolume.\n\n2. We obtain a minimax lower bound on the sample complexity of any rule for learning a\nmanifold from F in Theorem 6 showing that for a \ufb01xed error, the the dependence of the\nsample complexity on intrinsic dimension, curvature and volume must be at least exponen-\ntial, polynomial, and linear, respectively.\n\n3. We improve the best currently known upper bound [14] on the sample complexity of Em-\npirical Risk minimization on k\u2212means applied to data in a unit ball of arbitrary dimen-\nsion from O( k2\n\u00012 ) to O\n. Whether the known lower\n\u00012 ) is tight, has been an open question since 1997 [3]. Here \u0001 is the\n\nbound of O( k\ndesired bound on the error and \u03b4 is a bound on the probability of failure.\n\n\u00012 + log 1\n\u00012 + log 1\n\n+ log 1\n\u00012\n\n\u03b4\n\nk, log4 k\n\u00012\n\n\u0001\n\nmin\n\n\u03b4\n\n\u03b4\n\n(cid:16)\n\n(cid:16) k\n\n\u00012\n\n(cid:16)\n\n(cid:17)(cid:17)\n\n(cid:17)\n\nOne technical contribution of this paper is the use of dimensionality reduction via random projec-\ntions in the proof of Theorem 5 to bound the Fat-Shattering dimension of a function class, elements\nof which roughly correspond to the squared distance to a low dimensional manifold. The application\nof the probabilistic method involves a projection onto a low dimensional random subspace. This is\nthen followed by arguments of a combinatorial nature involving the VC dimension of halfspaces,\nand the Sauer-Shelah Lemma applied with respect to the low dimensional subspace. While random\nprojections have frequently been used in machine learning algorithms, for example in [2, 7], to our\nknowledge, they have not been used as a tool to bound the complexity of a function class. We il-\nlustrate the algorithmic utility of our uniform bound by devising an algorithm for k\u2212means and a\nconvex programming algorithm for \ufb01tting a piecewise linear curve of bounded length. For a \ufb01xed\nerror threshold and length, the dependence on the ambient dimension is linear, which is optimal\nsince this is the complexity of reading the input.\n\n2 Connections and Related work\n\nIn the context of curves, [10] proposed \u201cPrincipal Curves\u201d, where it was suggested that a natural\ncurve that may be \ufb01t to a probability distribution is one where every point on the curve is the center\nof mass of all those points to which it is the nearest point. A different de\ufb01nition of a principal curve\nwas proposed by [12], where they attempted to \ufb01nd piecewise linear curves of bounded length which\nminimize the expected squared distance to a random point from a distribution. This paper studies\nthe decay of the error rate as the number of samples tends to in\ufb01nity, but does not analyze the\ndependence of the error rate on the ambient dimension and the bound on the length. We address this\nin a more general setup in Theorem 4, and obtain sample complexity bounds that are independent of\n\n2\n\n\f\u03b4\n\n\u00012 + log 1\n\nthe ambient dimension, and depend linearly on the bound on the length. There is a signi\ufb01cant amount\nof recent research aimed at understanding topological aspects of data, such its homology [20, 16].\nIt has been an open question since 1997 [3], whether the known lower bound of O( k\n\u00012 ) for\nthe sample complexity of Empirical Risk minimization on k\u2212means applied to data in a unit ball\nof arbitrary dimension is tight. Here \u0001 is the desired bound on the error and \u03b4 is a bound on the\nprobability of failure. The best currently known upper bound is O( k2\n\u00012 ) and is based on\nRademacher complexities. We improve this bound to O\n, using an\nargument that bounds the Fat-Shattering dimension of the appropriate function class using random\nprojections and the Sauer-Shelah Lemma. Generalizations of principal curves to parameterized\nprincipal manifolds in certain regularized settings have been studied in [18]. There, the sample\ncomplexity was related to the decay of eigenvalues of a Mercer kernel associated with the regularizer.\nWhen the manifold to be \ufb01t is a set of k points (k\u2212means), we obtain a bound on the sample\ncomplexity s that is independent of m and depends at most linearly on k, which also leads to an\napproximation algorithm with additive error, based on sub-sampling. If one allows a multiplicative\nerror of 4 in addition to an additive error of \u0001, a statement of this nature has been proven by Ben-\nDavid (Theorem 7, [5]).\n\n\u00012 + log 1\nk, log4 k\n\u00012\n\n(cid:16) k\n\n+ log 1\n\u00012\n\n(cid:17)(cid:17)\n\n(cid:17)\n\n(cid:16)\n\n(cid:16)\n\nmin\n\n\u00012\n\n\u03b4\n\n\u0001\n\n\u03b4\n\n3 Upper bounds on the sample complexity of Empirical Risk Minimization\n\nunit ball B in Rm, let L(M,P) :=(cid:82) d(M, x)2dP(x). Given a set of i.i.d points x = {x1, . . . , xs}\nmanifold in Merm(x) \u2208 F such that(cid:80)s\n\nIn the remainder of the paper, C will always denote a universal constant which may differ across\nthe paper. For any submanifold M contained in, and probability distribution P supported on the\nfrom P, a tolerance \u0001 and a class of manifolds F, Empirical Risk Minimization (ERM) outputs a\ni=1 d(xi,Merm)2 \u2264 \u0001/2+infN\u2208F d(xi,N )2. Given error\nparameters \u0001, \u03b4, and a rule A that outputs a manifold in F when provided with a set of samples, we\nde\ufb01ne the sample complexity s = s(\u0001, \u03b4,A) to be the least number such that for any probability\ndistribution P in the unit ball, if the result of A applied to a set of at least s i.i.d random samples\nfrom P is N , then P [L(N ,P) < infM\u2208F L(M,P) + \u0001] > 1 \u2212 \u03b4.\n\n3.1 Bounded intrinsic curvature\nLet M be a Riemannian manifold and let p \u2208 M. Let \u03b6 be a geodesic starting at p.\nDe\ufb01nition 1. The \ufb01rst point on \u03b6 where \u03b6 ceases to minimize distance is called the cut point of p\nalong M. The cut locus of p is the set of cut points of M. The injectivity radius is the minimum\ntaken over all points of the distance between the point and its cut locus. M is complete if it is\ncomplete as a metric space.\nLet Gi = Gi(d, V, \u03bb, \u03b9) be the family of all isometrically embedded, complete Riemannian sub-\nmanifolds of B having dimension less or equal to d, induced d\u2212dimensional volume less or\nequal to V , sectional curvature less or equal to \u03bb and injectivity radius greater or equal to \u03b9. Let\nUint( 1\nTheorem 1. Let \u0001 and \u03b4 be error parameters. If\n\n, which for brevity, we denote Uint.\n\n\u0001 , d, V, \u03bb, \u03b9) := V\n\nmin(\u0001,\u03b9,\u03bb\u22121/2)\n\n(cid:16)\n\nC\n\nd\n\n(cid:17)(cid:17)d\n(cid:18) Uint\n\n(cid:19)\n\n(cid:16)\n(cid:18)\n\n(cid:18)(cid:18) 1\n\n(cid:19)\n\ns \u2265 C\n\nmin\n\n(cid:19)\n\n,\n\nlog4\n\n, Uint\n\n\u00012 +\n\n1\n\u00012 log\n\n1\n\u03b4\n\n\u00012\n\n\u0001\n\nand x = {x1, . . . , xs} is a set of i.i.d points from P then,\n\n(cid:20)\nL(Merm(x),P) \u2212 infM\u2208Gi\nThe proof of this theorem is deferred to Section 4.\n\nP\n\nL(M,P) < \u0001\n\n> 1 \u2212 \u03b4.\n\n(cid:19) Uint\n(cid:21)\n\n3.2 Bounded extrinsic curvature\n\nWe will consider submanifolds of B that have the property that around each of them, for any radius\nr < \u03c4, the boundary of the set of all points within a distance r is smooth. This class of submanifolds\n\n3\n\n\fhas appeared in the context of manifold learning [16, 15]. The condition number is de\ufb01ned as\nfollows.\nDe\ufb01nition 2 (Condition Number). Let M be a smooth d\u2212dimensional submanifold of Rm. We\nde\ufb01ne the condition number c(M) to be 1\n\u03c4 , where \u03c4 is the largest number to have the property that\nfor any r < \u03c4 no two normals of length r that are incident on M have a common point unless it is\non M.\nLet Ge = Ge(d, V, \u03c4 ) be the family of Riemannian submanifolds of B having dimension \u2264 d,\nvolume \u2264 V and condition number \u2264 1\n\u0001 , d, \u03c4 ) :=\nV\nmin(\u0001,\u03c4 )\nTheorem 2. If\n\n\u03c4 . Let \u0001 and \u03b4 be error parameters. Let Uext( 1\n\n, which for brevity, we denote by Uext.\n\n(cid:17)(cid:17)d\n\n(cid:16)\n\n(cid:16)\n\nC\n\nd\n\ns \u2265 C\n\nmin\n\n(cid:19)\n\n(cid:19)\n\n(cid:18)(cid:18) 1\n\n(cid:18) Uext\n\n(cid:18)\n(cid:19) Uext\nP(cid:104)L(Merm(x),P) \u2212 infM L(M,P) < \u0001\n(cid:105)\n\n, Uext\n\nlog4\n\n\u00012\n\n\u0001\n\nand x = {x1, . . . , xs} is a set of i.i.d points from P then,\n\n4 Relating bounded curvature to covering number\n\n(cid:19)\n\n\u00012 +\n\n1\n\u00012 log\n\n1\n\u03b4\n\n> 1 \u2212 \u03b4.\n\n,\n\n(1)\n\nIn this subsection, we note that that bounds on the dimension, volume, sectional curvature and\ninjectivity radius suf\ufb01ce to ensure that they can be covered by relatively few \u0001\u2212balls. Let V M\nbe\nthe volume of a ball of radius M centered around a point p. See ([9], page 51) for a proof of the\nfollowing theorem.\nTheorem 3 (Bishop-G\u00a8unther Inequality). Let M be a complete Riemannian manifold and assume\nthat r is not greater than the injectivity radius \u03b9. Let KM denote the sectional curvature of M and\nlet \u03bb > 0 be a constant. Then, KM \u2264 \u03bb implies V M\n\n(cid:16) sin(t\n\n(cid:17)n\u22121\n\n(cid:82) r\n\n\u03bb)\u221a\n\ndt.\n\n\u221a\n\np\n\np (r) \u2265 2\u03c0\nn\n2\n\u0393( n\n2 )\n\n0\n\n\u03bb\n\nCd\n\n), then, V M\n\n\u2212 1\nThus, if \u0001 < min(\u03b9, \u03c0\u03bb\n2\n2\n\nProof of Theorem 1. As a consequence of Theorem 3, we obtain an upper bound of V (cid:0) Cd\n(cid:19)d\nballs,(cid:83)\n\nthe number of disjoint sets of the form M \u2229 B\u0001/32(p) that can be packed in M.\nIf {M \u2229\nB\u0001/32(p1), . . . ,M\u2229B\u0001/32(pk)} is a maximal family of disjoint sets of the form M\u2229B\u0001/32(p), then\nthere is no point p \u2208 M such that min\n(cid:107)p \u2212 pi(cid:107) > \u0001/16. Therefore, M is contained in the union of\n\nB\u0001/16(pi). Therefore, we may apply Theorem 4 with U(cid:0) 1\n\n(cid:1) \u2264 V\n\n(cid:1)d on\n\n(cid:18)\n\nCd\n\n.\n\n\u0001\n\ni\n\n\u0001\n\n\u2212 1\n\nmin(\u0001,\u03bb\n\n2 ,\u03b9)\n\ni\n\np (\u0001) >(cid:0) \u0001\n\n(cid:1)d.\n\nThe proof of Theorem 2 is along the lines of that of Theorem 1, so it has been deferred to the journal\nversion.\n\n5 Class of manifolds with a bounded covering number\n\nIn this section, we show that uniform bounds relating the empirical squares loss and the expected\nsquared loss can be obtained for a class of manifolds whose covering number at a different scale \u0001\nhas a speci\ufb01ed upper bound. Let U : R+ \u2192 Z+ be any integer valued function. Let G be any family\nof subsets of B such that for all r > 0 every element M \u2208 G can be covered using open Euclidean\nballs of radius r centered around U ( 1\nr ) points; let this set be \u039bM(r). Note that if the subsets consist\nof k\u2212tuples of points, U (1/r) can be taken to be the constant function equal to k and we recover\nthe k\u2212means question. A priori, it is unclear if\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:80)s\n\nsup\nM\u2208G\n\n(cid:12)(cid:12)(cid:12)(cid:12),\n\ni=1 d(xi,M)2\n\n\u2212 EP d(x,M)2\n\ns\n\n4\n\n(2)\n\n\f(cid:18)\n\n\u00012 min\n\n(cid:18) U (16/\u0001)\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:80)s\n(cid:20)\nd2(x, g) \u2264 (cid:16) \u0001\n\nsup\nM\u2208G\n\ns\n\ni=1 d(xi,M)2\n\nis a random variable, since the supremum of a set of random variables is not always a random\nvariable (although if the set is countable this is true). However (2) is equal to\n\u2212 EP d(x, \u039bM(1/n))2\n\ni=1 d(xi, \u039bM(1/n))2\n\n(3)\n\nn\u2192\u221e sup\nlim\nM\u2208G\n\ns\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:80)s\n\n(cid:12)(cid:12)(cid:12)(cid:12),\n\nand for each n, the supremum in the limits is over a set parameterized by U (n) points, which without\nloss of generality we may take to be countable (due to the density and countability of rational points).\nThus, for a \ufb01xed n, the quantity in the limits is a random variable. Since the limit as n \u2192 \u221e of a\nsequence of bounded random variables is a random variable as well, (2) is a random variable too.\nTheorem 4. Let \u0001 and \u03b4 be error parameters. If\n\n(cid:18) 1\n\n(cid:19)\n\n\u00012\n\nU (16/\u0001),\n\nlog4\n\n(cid:19)\n\n,\n\n+\n\n1\n\u00012 log\n\n1\n\u03b4\n\n(cid:19)(cid:19)\n\n(cid:18) U (16/\u0001)\n(cid:12)(cid:12)(cid:12)(cid:12) <\n(cid:21)\n\n\u0001\n2\n\n\u0001\n\n\u2212 EP d(x,M)2\n\n> 1 \u2212 \u03b4.\n\n(4)\n\nThen,\n\ns \u2265 C\n\nP\n\n(cid:17)2\n\nProof. For every g \u2208 G, let c(g, \u0001) = {c1, . . . , ck} be a set of k := U (16/\u0001) points in g \u2286 B, such\nthat g is covered by the union of balls of radius \u0001/16 centered at these points. Thus, for any point\nx \u2208 B,\n\n+ d(x, c(g, \u0001))\n\n\u0001 mini (cid:107)x \u2212 ci(cid:107)\n\n+ d(x, c(g, \u0001))2.\n\n(5)\n\n(6)\n\n16\n\u2264 \u00012\n256\n\n+\n\n8\n\nSince mini (cid:107)x \u2212 ci(cid:107) is less or equal to 2, the last expression is less than \u0001\n2 + d(x, c(g, \u0001))2.\nOur proof uses the \u201ckernel trick\u201d in conjunction with Theorem 5. Let \u03a6 : (x1, . . . , xm)T (cid:55)\u2192\n2\u22121/2(x1, . . . , xm, 1)T map a point x \u2208 Rm to one in Rm+1. For each i, let ci := (ci1, . . . , cim)T ,\nand \u02dcci := 2\u22121/2(\u2212ci1, . . . ,\u2212cim,\n)T . The factor of 2\u22121/2 is necessitated by the fact that we\nwish the image of a point in the unit ball to also belong to the unit ball. Given a collection of points\nc := {c1, . . . , ck} and a point x \u2208 B, let fc(x) := d(x, c(g, \u0001))2. Then,\n\n(cid:107)ci(cid:107)2\n\n2\n\nfc(x) = (cid:107)x(cid:107)2 + 4 min(\u03a6(x) \u00b7 \u02dcc1, . . . , \u03a6(x) \u00b7 \u02dcck).\n\nFor any set of s samples x1, . . . , xs,\n\u2212 EP fc(x)\n\ni=1 fc(xi)\n\nsup\nfc\u2208G\n\ns\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:80)s\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u2264\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:80)s\n\n(cid:12)(cid:12)(cid:12)(cid:12)\n\n\u2212 EP(cid:107)x(cid:107)2\n\n\u03a6(xi) \u00b7 \u02dcci\n\ni=1 min\ns\n\ni\n\ns\n\ni=1 (cid:107)xi(cid:107)2\n(cid:80)s\n(cid:12)(cid:12)(cid:12)(cid:12)\n(cid:12)(cid:12)(cid:12)(cid:12) >\n\n\u0001\n4\n\n(cid:21)\n\n+ 4 sup\nfc\u2208G\n\ni=1 (cid:107)xi(cid:107)2\n\ns\n\n\u2212 EP(cid:107)x(cid:107)2\n\n< 2e\u2212( 1\n\n8 )s\u00012\n\n,\n\n\u2212 EP min\n\ni\n\n\u03a6(x) \u00b7 \u02dcci\n\n(7)\n\n(8)\n\n(cid:12)(cid:12)(cid:12)(cid:12).\n\n(9)\n\nP\n\n(cid:20)(cid:12)(cid:12)(cid:12)(cid:12)(cid:80)s\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:80)s\n\nBy Hoeffding\u2019s inequality,\n\n(cid:34)\n\n2.\nwhich is less than \u03b4\n\nBy Theorem 5, P\n\n(cid:34)\n\nsup\nfc\u2208G\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:80)s\n\nTherefore, P\n\nsup\nfc\u2208G\n\n(cid:35)\n\n(cid:12)(cid:12)(cid:12)(cid:12) > \u0001\n\n16\n\n< \u03b4\n2 .\n\n\u03a6(xi)\u00b7\u02dcci\n\ni=1 min\ns\n\ni\n\n\u2212 EP min\n\n\u03a6(x) \u00b7 \u02dcci\n\ni=1 fc(xi)\n\ns\n\n\u2212 EP fc(x)\n\n\u2265 1 \u2212 \u03b4.\n\ni\n\n(cid:35)\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u2264 \u0001\n\n2\n\n5\n\n\fFigure 2: Random projections are likely to preserve linear separations.\n\n6 Bounding the Fat-Shattering dimension using random projections\n\nThe core of the uniform bounds in Theorems 1 and 2 is the following uniform bound on the minimum\nof k linear functions on a ball in Rm.\nTheorem 5. Let F be the set of all functions f from B := {x \u2208 Rm : (cid:107)x(cid:107) \u2264 1} to R, such that for\nsome k vectors v1, . . . , vk \u2208 B,\n\nf (x) := min\n\n(vi \u00b7 x).\n\n(cid:18) 1\n\ni\n\n(cid:18) k\n\n(cid:19)\n\n\u00012 min\n\n\u00012 log4\n\n\u0001\n\n(cid:19)\n\n,\n\n1\n\u00012 log\n\n1\n\u03b4\n\nIndependent of m, if\n\nthen\n\n(cid:18) k\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n(cid:80)s\n\ns \u2265 C\n\n(cid:34)\n\nP\n\n+\n\n, k\n\n(cid:19)\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < \u0001\n(cid:35)\n(cid:19)\n(cid:19)\n\n(cid:19)\n\nIt has been open since 1997 [3], whether the known lower bound of C(cid:0) k\n\ns\n\nsup\nF\u2208F\n\ni=1 F (xi)\n\n\u2212 EP F (x)\n\n> 1 \u2212 \u03b4.\n\n(10)\n\n(cid:1) on the sample\n\n\u00012 + 1\n\n\u00012 log 1\n\n\u03b4\n\ncomplexity s is tight. Theorem 5 in [14], uses Rademacher complexities to obtain an upper bound\nof\n\n(11)\n(The scenarios in [3, 14] are that of k\u2212means, but the argument in Theorem 4 reduces k\u2212means to\nour setting.) Theorem 5 improves this to\n\n\u00012 +\n\n.\n\n1\n\u00012 log\n\n1\n\u03b4\n\n(cid:19)\n\nC\n\n\u00012 min\n\n\u00012 log4\n\n, k\n\n+\n\n1\n\u00012 log\n\n1\n\u03b4\n\nby putting together (11) with a bound of\n\n(cid:18) k\n\nC\n\n(cid:18) k2\n(cid:18) 1\n(cid:18) k\n\n\u00014 log4\n\nC\n\n(cid:19)\n\n(cid:18) k\n(cid:19)\n(cid:18) k\n\n\u0001\n\n\u0001\n\n+\n\n1\n\u00012 log\n\n1\n\u03b4\n\n(12)\n\n(13)\n\nobtained using the Fat-Shattering dimension. Due to constraints on space, the details of the proof of\nTheorem 5 will appear in the journal version, but the essential ideas are summarized here.\n24 ) and x1, . . . , xu be a set of vectors that is \u03b3\u2212shattered by F . We would like to use\nLet u := fatF ( \u0001\nVC theory to bound u, but doing so directly leads to a linear dependence on the ambient dimension\n, we consider a g\u2212dimensional random\nm. In order to circumvent this dif\ufb01culty, for g := C log(u+k)\nlinear subspace and the image under an appropriately scaled orthogonal projection R of the points\n2\u2212shatter coef\ufb01cient of {Rx1, . . . , Rxu}\nx1, . . . , xu onto it. We show that the expected value of the \u03b3\nis at least 2u\u22121 using the Johnson-Lindenstrauss Lemma [11] and the fact that {x1, . . . , xu} is\n\u03b3\u2212shattered. Using Vapnik-Chervonenkis theory and the Sauer-Shelah Lemma, we then show that\n(cid:1)\n2\u2212shatter coef\ufb01cient cannot be more than uk(g+2). This implies that 2u\u22121 \u2264 uk(g+2), allowing us\nto conclude that fatF ( \u0001\non fatF ( \u0001\n\n(cid:1) . By a well-known theorem of [1], a bound of Ck\n\n24 ) implies the bound in (13) on the sample complexity, which implies Theorem 5.\n\n\u00012 log2(cid:0) k\n\n\u00012 log2(cid:0) k\n\n24 ) \u2264 Ck\n\n\u00012\n\n\u03b3\n\n\u0001\n\n\u0001\n\n6\n\n\u03b3x1x2x3x4Rx1Rx2Rx3Rx4\u03b32RandommapR\f7 Minimax lower bounds on the sample complexity\n\nis the translation of W by x. Note that each Si has radius \u03c4. Let (cid:96) =(cid:0)K2dk\n\nLet K be a universal constant whose value will be \ufb01xed throughout this section. In this section, we\nwill state lower bounds on the number of samples needed for the minimax decision rule for learning\nfrom high dimensional data, with high probability, a manifold with a squared loss that is within \u0001\nof the optimal. We will construct a carefully chosen prior on the space of probability distributions\nand use an argument that can either be viewed as an application of the probabilistic method or of the\nfact that the Minimax risk is at least the risk of a Bayes optimal manifold computed with respect to\nthis prior. Let U be a K 2dk\u2212dimensional vector space containing the origin, spanned by the basis\n{e1, . . . , eK2dk} and S be the surface of the ball of radius 1 in Rm. We assume that m be greater or\nequal to K 2dk +d. Let W be the d\u2212dimensional vector space spanned by {eK2dk+1, . . . , eK2dk+d}.\n\u221a\nLet S1, . . . , SK2dk denote spheres, such that for each i, Si := S \u2229 (\n1 \u2212 \u03c4 2ei + W ), where x + W\nconsist of all K dk\u2212element subsets of {S1, . . . , SK2dk}. Let \u03c9d be the volume of the unit ball in\nRd. The following theorem shows that no algorithm can produce a nearly optimal manifold with\nhigh probability unless it uses a number of samples that depends linearly on volume, exponentially\non intrinsic dimension and polynomially on the curvature.\nTheorem 6. Let F be equal to either Ge(d, V, \u03c4 ) or Gi(d, V, 1\n(cid:99). Let A\nbe an arbitrary algorithm that takes as input a set of data points x = {x1, . . . , xk} and outputs a\nmanifold MA(x) in F. If \u0001 + 2\u03b4 < 1\n\n(cid:1) and {M1, . . . ,M(cid:96)}\n\n\u03c4 2 , \u03c0\u03c4 ). Let k = (cid:98)\n\n(cid:16) 1\n\nV\nd\u03c9d(K\n\n(cid:17)2\n\n5\n4 \u03c4 )d\n\nKdk\n\n2\n\n3\n\n\u221a\n\n\u2212 \u03c4\n\n(cid:20)\n(cid:21)\nL(MA(x),P) \u2212 infM\u2208F L(M,P) < \u0001\n\nthen,\n\n2\n\nP\n\ninfP\n\n< 1 \u2212 \u03b4,\n\nwhere P ranges over all distributions supported on B and x1, . . . , xk are i.i.d draws from P.\nProof. Observe from Lemma ?? and Theorem 3 that F is a class of a manifolds such that\neach manifold in F is contained in the union of K 3d\n2 k m\u2212dimensional balls of radius \u03c4, and\n{M1, . . . ,M(cid:96)} \u2286 F.\n(The reason why we have K 3d\n4 as in the statement of\nthe theorem is that the parameters of Gi(d, V, \u03c4 ) are intrinsic, and to transfer to the extrinsic setting\nof the last sentence, one needs some leeway.) Let P1, . . . ,P(cid:96) be probability distributions that are\nuniform on {M1, . . . ,M(cid:96)} with respect to the induced Riemannian measure. Suppose A is an al-\ngorithm that takes as input a set of data points x = {x1, . . . , xt} and outputs a manifold MA(x).\nLet r be chosen uniformly at random from {1, . . . , (cid:96)}. Then,\n\nrather than K 5d\n\n2\n\n(cid:20)(cid:12)(cid:12)L(MA(x),P) \u2212 infM\u2208F L(M,P)(cid:12)(cid:12) < \u0001\n(cid:21)\n\nP\n\ninfP\n\n\u2264 EPr\n\n= ExPPr\n= ExPPr\nWe \ufb01rst prove a lower bound on inf x Er [L(MA(x),Pr)|x].\nWe see that\n\n(cid:2)L(MA(x),Pr)(cid:12)(cid:12)x(cid:3) = Er,xk+1\n\nEr\n\n(cid:21)\n\nPx\n\n(cid:20)(cid:12)(cid:12)L(MA(x),Pr) \u2212 infM\u2208F L(M,Pr)(cid:12)(cid:12) < \u0001\n(cid:21)\n(cid:20)(cid:12)(cid:12)L(MA(x),Pr) \u2212 infM\u2208F L(M,Pr)(cid:12)(cid:12) < \u0001(cid:12)(cid:12)x\n(cid:2)L(MA(x),Pr) < \u0001(cid:12)(cid:12)x(cid:3) .\n(cid:2)d(MA(x), xk+1)2(cid:12)(cid:12)x(cid:3) .\n\n(14)\n\nConditioned on x, the probability of the event (say Edif ) that xk+1 does not belong to the same\n2.\nsphere as one of the x1, . . . , xk is at least 1\nConditioned on Edif and x1, . . . , xk, the probability that xk+1 lies on a given sphere Sj is equal to\nK2k\u2212k(cid:48) otherwise, where k(cid:48) \u2264 k is the number of spheres in\n0 if one of x1, . . . , xk lies on Sj and\n{Si} which contain at least one point among x1, . . . , xk.\nBy construction, A(x1, . . . , xk) can be covered by K 3d\ny1, . . . , y\n\n2 k balls of radius \u03c4; let their centers be\n\n.\n\n1\n\n3d\n2 k\n\nK\n\n7\n\n\fHowever, it is easy to check that for any dimension m, the cardinality of the set Sy of all Si that\nhave a nonempty intersection with the balls of radius\n, is at most\nK 3d\n\n2 k. Therefore, P(cid:104)\n\ncentered around y1, . . . , y\n\n\u221a\n1\n2\n\n(cid:105)\n\n3d\n2 k\n\nK\n\n(cid:20)\n\nd(MA(x), xk+1) \u2265 1\n\u221a\n2\n\u221a\n}, xk+1) \u2265 1\n2\n\n3d\n2 k\n\nK\n\n2\n\n2\n\nP\n\nd({y1, . . . , y\n\nP [Edif ] P [xk+1 (cid:54)\u2208 Sy|Edif ]\nK 2dk \u2212 k(cid:48) \u2212 K 3d\n2 k\n\nK 2dk \u2212 k(cid:48)\n\n\u2212 \u03c4(cid:12)(cid:12)x\n(cid:21)\n(cid:12)(cid:12)x\n\n2\nis at least\n\u2265\n\n1\n2\n\n\u2265\n\u2265 1\n3 .\n\u2212 \u03c4\n\n(cid:17)2\n\n(cid:2)d(MA(x), xk+1)2(cid:12)(cid:12)x(cid:3) \u2265 1\n(cid:2)L(MA(x),Pr) < \u0001(cid:12)(cid:12)x(cid:3) to be more than 1 \u2212 \u03b4 if inf x PPr\n\nTherefore, Er,xk+1\nsible for ExPPr\n\u0001 + 2\u03b4, because L(MA(x),Pr) is bounded above by 2.\n\n\u221a\n\n2\n\n2\n\n3\n\n(cid:16) 1\n\n. Finally, we observe that it is not pos-\n\n(cid:2)L(MA(x),Pr)(cid:12)(cid:12)x(cid:3) >\n\n8 Algorithmic implications\n8.1 k\u2212means\nApplying Theorem 4 to the case when P is a distribution supported equally on n speci\ufb01c points (that\nare part of an input) in a unit ball of Rm, we see that in order to obtain an additive \u0001 approximation\nfor the k\u2212means problem with probability 1 \u2212 \u03b4, it suf\ufb01ces to sample\n1\n\u03b4\n\nlog4(cid:0) k\n\n1\n\u00012 log\n\ns \u2265 C\n\n(cid:33)\n\n(cid:33)\n\n(cid:32)\n\n(cid:32)\n\nk\n\u00012\n\n(cid:1)\n\n, k\n\n\u00012\n\n+\n\n\u0001\n\npoints uniformly at random (which would have a cost of O(s log n) if the cost of one random bit\nis O(1)) and exhaustively solve k\u2212means on the resulting subset. Supposing that a dot product\nbetween two vectors xi, xj can be computed using \u02dcm operations, the total cost of sampling and\nthen exhaustively solving k\u2212means on the sample is O( \u02dcmsks log n). In contrast, if one asks for a\nmultiplicative (1 + \u0001) approximation, the best running time known depends linearly on n [13]. If\nP is an unknown probability distribution, the above algorithm improves upon the best results in a\nnatural statistical framework for clustering [5].\n\n8.2 Fitting piecewise linear curves\n\nIn this subsection, we illustrate the algorithmic utility of the uniform bound in Theorem 4 by ob-\ntaining an algorithm for \ufb01tting a curve of length no more than L, to data drawn from an unknown\nprobability distribution P supported in B, whose sample complexity is independent of the ambient\ndimension. This curve, with probability 1 \u2212 \u03b4, achieves a mean squared error of less than \u0001 more\nthan the optimum. The proof of its correctness and analysis of its run-time have been deferred to the\njournal version. The algorithm is as follows:\n\n(cid:18)\n\n(cid:18) log4( k\n\n(cid:19)\n\n(cid:19)\n\n1. Let k := (cid:100) L\n\n\u0001 (cid:101) and s \u2265 C\n\n\u0001 )\n, k\nfrom P for s =, and set J := span({xi}s\ni=1).\n\n2. For every permutation \u03c3 of [s], minimize the convex objective function(cid:80)n\nover the convex set of all s\u2212tuples of points (y1, . . . , ys) in J, such that(cid:80)s\u22121\n\n\u00012 log 1\n\n+ 1\n\nk\n\u00012\n\n\u00012\n\n\u03b4\n\n. Sample points x1, . . . , xs i.i.d\n\ni=1 d(x\u03c3(i), yi)2\ni=1 (cid:107)yi+1 \u2212\n\nyi(cid:107) \u2264 L.\n\n3. If the minimum over all (y1, . . . , ys) (and \u03c3) is achieved for (z1, . . . , zs), output the curve\n\nobtained by joining zi to zi+1 for each i by a straight line segment.\n\n9 Acknowledgements\n\nWe are grateful to Stephen Boyd for several helpful conversations.\n\n8\n\n\fReferences\n[1] Noga Alon, Shai Ben-David, Nicol`o Cesa-Bianchi, and David Haussler. Scale-sensitive di-\n\nmensions, uniform convergence, and learnability. J. ACM, 44(4):615\u2013631, 1997.\n\n[2] Rosa I. Arriaga and Santosh Vempala. An algorithmic theory of learning: Robust concepts and\n\nrandom projection. In FOCS, pages 616\u2013623, 1999.\n\n[3] Peter Bartlett. The minimax distortion redundancy in empirical quantizer design. IEEE Trans-\n\nactions on Information Theory, 44:1802\u20131813, 1997.\n\n[4] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data\n\nrepresentation. Neural Comput., 15(6):1373\u20131396, 2003.\n\n[5] Shai Ben-David. A framework for statistical clustering with constant time approximation al-\n\ngorithms for k-median and k-means clustering. Mach. Learn., 66(2-3):243\u2013257, 2007.\n\n[6] Gunnar Carlsson. Topology and data. Bulletin of the American Mathematical Society, 46:255\u2013\n\n308, January 2009.\n\n[7] Sanjoy Dasgupta. Learning mixtures of gaussians. In FOCS, pages 634\u2013644, 1999.\n[8] David L. Donoho and Carrie Grimes. Hessian eigenmaps: Locally linear embedding\ntechniques for high-dimensional data. Proceedings of the National Academy of Sciences,\n100(10):5591\u20135596, May 2003.\n\n[9] A. Gray. Tubes. Addison-Wesley, 1990.\n[10] Trevor J. Hastie and Werner Stuetzle. Principal curves. Journal of the American Statistical\n\nAssociation, 84:502\u2013516, 1989.\n\n[11] William Johnson and Joram Lindenstrauss. Extensions of lipschitz mappings into a hilbert\n\nspace. Contemporary Mathematics, 26:419\u2013441, 1984.\n\n[12] Bal\u00b4azs K\u00b4egl, Adam Krzyzak, Tam\u00b4as Linder, and Kenneth Zeger. Learning and design of prin-\ncipal curves. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22:281\u2013297,\n2000.\n[13] Amit Kumar, Yogish Sabharwal, and Sandeep Sen. A simple linear time (1+\u0001)\u2212approximation\n\nalgorithm for k-means clustering in any dimensions. In FOCS, pages 454\u2013462, 2004.\n\n[14] Andreas Maurer and Massimiliano Pontil. Generalization bounds for k-dimensional coding\n\nschemes in hilbert spaces. In ALT, pages 79\u201391, 2008.\n\n[15] H. Narayanan and P. Niyogi. On the sample complexity of learning smooth cuts on a manifold.\n\nIn Proc. of the 22nd Annual Conference on Learning Theory (COLT), June 2009.\n\n[16] Partha Niyogi, Stephen Smale, and Shmuel Weinberger. Finding the homology of submani-\nfolds with high con\ufb01dence from random samples. Discrete & Computational Geometry, 39(1-\n3):419\u2013441, 2008.\n\n[17] Sam T. Roweis and Lawrence K. Saul. Nonlinear dimensionality reduction by locally linear\n\nembedding. SCIENCE, 290:2323\u20132326, 2000.\n\n[18] Alexander J. Smola, Sebastian Mika, Bernhard Sch\u00a8olkopf, and Robert C. Williamson. Regu-\n\nlarized principal manifolds. J. Mach. Learn. Res., 1:179\u2013209, 2001.\n\n[19] J. B. Tenenbaum, V. Silva, and J. C. Langford. A Global Geometric Framework for Nonlinear\n\nDimensionality Reduction. Science, 290(5500):2319\u20132323, 2000.\n\n[20] Afra Zomorodian and Gunnar Carlsson. Computing persistent homology. Discrete & Compu-\n\ntational Geometry, 33(2):249\u2013274, 2005.\n\n9\n\n\f", "award": [], "sourceid": 1076, "authors": [{"given_name": "Hariharan", "family_name": "Narayanan", "institution": null}, {"given_name": "Sanjoy", "family_name": "Mitter", "institution": null}]}