{"title": "Fast, smooth and adaptive regression in metric spaces", "book": "Advances in Neural Information Processing Systems", "page_first": 1024, "page_last": 1032, "abstract": "It was recently shown that certain nonparametric regressors can escape the curse of dimensionality in the sense that their convergence rates adapt to the intrinsic dimension of data (\\cite{BL:65, SK:77}). We prove some stronger results in more general settings. In particular, we consider a regressor which, by combining aspects of both tree-based regression and kernel regression, operates on a general metric space, yields a smooth function, and evaluates in time $O(\\log n)$. We derive a tight convergence rate of the form $n^{-2/(2+d)}$ where $d$ is the Assouad dimension of the input space.", "full_text": "Fast, smooth and adaptive regression in metric spaces\n\nSamory Kpotufe\n\nUCSD CSE\n\nAbstract\n\nIt was recently shown that certain nonparametric regressors can escape the curse\nof dimensionality when the intrinsic dimension of data is low ([1, 2]). We prove\nsome stronger results in more general settings. In particular, we consider a regres-\nsor which, by combining aspects of both tree-based regression and kernel regres-\nsion, adapts to intrinsic dimension, operates on general metrics, yields a smooth\nfunction, and evaluates in time O(log n). We derive a tight convergence rate of\nthe form n\n\n(cid:0)2=(2+d) where d is the Assouad dimension of the input space.\n\n1 Introduction\n\nRelative to parametric methods, nonparametric regressors require few structural assumptions on the\nfunction being learned. However, their performance tends to deteriorate as the number of features\nincreases. This so-called curse of dimensionality is quanti\ufb01ed by various lower bounds on the con-\n(cid:0)2=(2+D) for data in RD (see e.g. [3, 4]). In other words, one might\nvergence rates of the form n\nrequire a data size exponential in D in order to attain a low risk.\nFortunately, it is often the case that data in RD has low intrinsic complexity, e.g. the data is near a\nmanifold or is sparse, and we hope to exploit such situations. One simple approach, termed manifold\nlearning (e.g. [5, 6, 7]), is to embed the data into a lower dimensional space where the regressor\nmight work well. A recent approach with theoretical guarantees for nonparametric regression, is\nthe study of adaptive procedures, i.e. ones that operate in RD but attain convergence rates that\ndepend just on the intrinsic dimension of data. An initial result [1] shows that for data on a d-\ndimensional manifold, the asymptotic risk at a point x 2 RD depends just on d and on the behavior\nof the distribution in a neighborhood of x. Later, [2] showed that a regressor based on the RPtree\nof [8] (a hierarchical partitioning procedure) is not only fast to evaluate, but is adaptive to Assouad\ndimension, a measure which captures notions such as manifold dimension and data sparsity. The\nrelated notion of box dimension (see e.g. [9]) was shown in an earlier work [10] to control the risk\nof nearest neighbor regression, although adaptivity was not a subject of that result.\nThis work extends the applicability of such adaptivity results to more general uses of nonparametric\nregression.\nIn particular, we present an adaptive regressor which, unlike RPtree, operates on a\ngeneral metric space where only distances are provided, and yields a smooth function, an important\nproperty in many domains (see e.g. [11] which considers the smooth control of a robotic tool based\non noisy outside input). In addition, our regressor can be evaluated in time just O(log n), unlike\n(\nkernel or nearest neighbor regression. The evaluation time for these two forms of regression is\nlower bounded by the number of sample points contributing to the regression estimate. For nearest\nneighbor regression, this number is given by a parameter kn whose optimal setting (see [12]) is\n(cid:0)1=(2+d) (see [12]), we\nO\nwould expect about nhd (cid:25) n2=(2+d) points in the ball B(x; h) around a query point x.\nWe note that there exist many heuristics for speeding up kernel regression, which generally combine\nfast proximity search procedures with other elaborate methods for approximating the kernel weights\n(see e.g. [13, 14, 15]). There are no rigorous bounds on either the achievable speedup or the risk of\nthe resulting regressor.\n\n)\n. For kernel regression, given an optimal bandwidth h (cid:25) n\n\nn2=(2+d)\n\n1\n\n\fFigure 1: Left and Middle- Two r-nets at different scales r, each net inducing a partition of the sample X. In\neach case, the gray points are the r-net centers. For regression each center contributes the average Y value of\nthe data points assigned to them (points in the cells). Right- Given an r-net and a bandwidth h, a kernel around\na query point x weights the Y -contribution of each center to the regression estimate for x.\n\n)\n\n(\nOur regressor integrates aspects of both tree-based regression and kernel regression. It constructs\npartitions of the input dataset X = fXign\n1 , and uses a kernel to select a few sets within a given\npartition, each set contributing its average output Y value to the estimate. We show that such a\n(cid:0)2=(2+d)\nregressor achieves an excess risk of O\n, where d is the Assouad dimension of the input\ndata space. This is a tighter convergence rate than the O\nof RPtree regression\n(see [2]). Finally, the evaluation time of O(log n) is arrived at by modifying the cover tree proximity\nsearch procedure of [16]. Unlike in [16], this guarantee requires no growth assumption on the data\ndistribution.\nWe\u2019ll now proceed with a more detailed presentation of the results in the next section, followed by\ntechnical details in sections 3 and 4.\n\n(cid:0)2=(2+O(d log d)\n\n)\n\n(\n\nn\n\nn\n\n2 Detailed overview of results\nWe\u2019re given i.i.d training data (X; Y) = f(Xi; Yi)gn\n1 , where the input variable X belongs to a metric\nspace X where the distance between points is given by the metric (cid:26), and the output Y belongs to a\nsubset Y of some Euclidean space. We\u2019ll let (cid:1)X and (cid:1)Y denote the diameters of X and Y.\nAssouad dimension: The Assouad or doubling dimension of X is de\ufb01ned as the smallest d such\nthat any ball can be covered by 2d balls of half its radius.\nExamples: A d-dimensional af\ufb01ne subspace of a Euclidean space RD has Assouad dimension O(d)\n[9]. A d-dimensional submanifold of a Euclidean space RD has Assouad dimension O(d) subject\nto a bound on its curvature [8]. A d-sparse data space in RD, i.e. one where each data point has at\nmost d non zero coordinates, has Assouad dimension O(d log D) [8, 2].\nThe algorithm has no knowledge of the dimension d, nor of (cid:1)Y, although we assume (cid:1)X is known\n(or can be upper-bounded).\nRegression function: We assume the regression function f(x) := E [Y jX = x] is Lipschitz, i.e.\nthere exists (cid:21) , unknown, such that 8x; x\nExcess risk: Our performance criteria for a regressor fn(x) is the integrated excess l2 risk:\nkf(X) (cid:0) Y k2 :\n\nkfn(X) (cid:0) f(X)k2 = E\n\nkfn(X) (cid:0) Y k2 (cid:0) E\n\n0 2 X ; kf(x) (cid:0) f(x\n\n0)k (cid:20) (cid:21) (cid:1) (cid:26) (x; x\n0).\n\nkfn (cid:0) fk2 := E\n\n(1)\n\nX\n\nX;Y\n\nX;Y\n\n2.1 Algorithm overview\n\nWe\u2019ll consider a set of partitions of the data induced by a hierarchy of r-nets of X. Here an r-net Qr\nis understood to be both an r-cover of X (all points in X are within r of some point in Qr), and an\nr-packing (the points in Qr are at least r apart). The details on how to build the r-nets are covered\n{\nin section 4. For now, we\u2019ll consider a class of regressors de\ufb01ned over these nets (as illustrated in\n}\nFigure 1), and we\u2019ll describe how to select a good regressor out of this class.\nQr; r 2 f(cid:1)X =2igI+2\n:= dlog ne, and\nPartitions of X: The r-nets are denoted by\ninduces a partition fX(q); q 2 Qg of X, where\nQr; r 2 f(cid:1)X =2igI+2\n\nQr (cid:26) X. Each Q 2 {\n\n, where I\n\n}\n\n0\n\n0\n\n2\n\n\f\u2211\n\nnq\n\ni:Xi2X(q) Yi.\n\nX(q) designate all those points in X whose closest point in Q is q. We set nq\n(cid:22)Yq = 1\nAdmissible kernels: We assume that K(u) is a non increasing function of u 2 [0;1); K is\npositive on u 2 [0; 1), maximal at u = 0, and vanishes for u (cid:21) 1. To simplify notation, we\u2019ll often\nlet K(x; q; h) denote K((cid:26) (x; q) =h).\n\n:= jX(q)j, and\n\nQr; r 2 f(cid:1)X =2igI+2\n\n0\n\n, and given a bandwidth h, we de\ufb01ne the fol-\n\nRegressors: For each Q 2{\n\u2211\n\nlowing regressor:\n\n}\n\u2211\nnq(K(x; q; h) + (cid:15))\nq02Q nq0(K(x; q0; h) + (cid:15)) :\n\nfn;Q(x) =\n\nwq(x) (cid:22)Yq; where wq =\n\nq2Q\n\n(2)\n\nThe positive constant (cid:15) ensures that the estimate remains well de\ufb01ned when K(x; q; h) = 0. We\nassume (cid:15) (cid:20) K(1=2)=n2. We can view (K((cid:1)) + (cid:15)) as the effective kernel which never vanishes. It is\nclear that the learned function fn;Q inherits any degree of smoothness from the kernel function K,\ni.e. if K is of class C k, then so is fn;Q.\ng, equation (2) above\nSelecting the \ufb01nal regressor: For \ufb01xed n, K((cid:1)), and fQr; r 2 f(cid:1)X =2igI+2\nde\ufb01nes a class of regressors parameterized by r 2 f(cid:1)X =2igI+2\n, and the bandwidth h. We want\nto pick a good regressor out of this class. We can reduce the search space right away by noticing\nthat we need r = (cid:18)(h): if r >> h then B(x; h) \\ Qr is empty for most x since the points in Qr\nare over r apart, and if r << h then B(x; h) \\ Qr might contain a lot of points, thus increasing\nevaluation time. So for each choice of h, we will set r = h=4, which will yield good guarantees on\ncomputational and prediction performance. The \ufb01nal regressor is selected as follows.\n:= dlog ne, and de\ufb01ne H\n; Y0) of size n. As before let I\nDraw a new sample (X0\n0. For\nevery h 2 H, pick the r-net Qh=4 and test fn;Qh=4 on (X0\n; Y0); let the empirical risk be minimized\ni) (cid:0) Y\n0\n0\nat ho, i.e. ho\ni\nFast evaluation: Each regressor fn;Qh=4(x) can be estimated quickly on points x by traversing\n(nested) r-nets as described in detail in section 4.\n\n(cid:13)(cid:13)2. Return fn;Qho =4 as the \ufb01nal regressor.\n\n(cid:13)(cid:13)fn;Qh=4(X\n\n:= f(cid:1)X =2igI\n\n:= argminh2H\n\n\u2211\n\nn\ni=1\n\n1\nn\n\n0\n\n0\n\n2.2 Computational and prediction performance\n\nThe cover property ensures that for some h, Qh=4 is a good summary of local information (for\nprediction performance), while the packing property ensures that few points in Qh=4 fall in B(x; h)\n(for fast evaluation). We have the following main result.\nTheorem 1. Let d be the Assouad dimension of X and let n (cid:21) max\n\n)2\n\n)2\n\n(\n\n)\n\n(\n\n(\n\n9;\n\n.\n\n;\n\n(cid:21)(cid:1)X\n(cid:1)Y\n\n(a) The \ufb01nal regressor selected satis\ufb01es\n\n(cid:13)(cid:13)fn;Qho =4\n\nE\n\n(cid:13)(cid:13)2 (cid:20) C ((cid:21)(cid:1)X )2d=(2+d)\n\n(cid:0) f\n\n)2=(2+d)\n\n(\n\n(cid:1)2Y\nn\n\n+ 3(cid:1)2Y\n\nln(n log n)\n\nn\n\n;\n\n(cid:1)Y\n(cid:21)(cid:1)X\n\n\u221a\n\nwhere C depends on the Assouad dimension d and on K(0)=K(1=2).\n\n(b) fn;Qho =4(x) can be computed in time C\nPart (a) of Theorem 1 is given by Corollary 1 of section 3, and does not depend on how the r-nets\nare built; part (b) follows from Lemma 4 of section 4 which speci\ufb01es the nets.\n\n0 log n, where C\n\n0 depends just on d.\n\n3 Risk analysis\n\nThroughout this section we assume 0 < h < (cid:1)X and we let Q = Qh=4. We\u2019ll bound the risk for\nfn;Q for any \ufb01xed choice of h, and then show that the \ufb01nal h0 selected yields a good risk. The results\nin this section only require the fact that Q is a cover of data and thus preserves local information,\nwhile the packing property is needed in the next section for fast evaluation.\n\n3\n\n\fDe\ufb01ne efn;Q(x) := EYjX fn;Q(x), i.e. the conditional expectation of the estimate, for X \ufb01xed. We\n\nhave the following standard decomposition of the excess risk into variance and bias terms:\n\n(cid:13)(cid:13)(cid:13)fn;Q(x) (cid:0) efn;Q(x)\n(cid:13)(cid:13)(cid:13)2\n\n(cid:13)(cid:13)(cid:13)efn;Q(x) (cid:0) f(x)\n\n(cid:13)(cid:13)(cid:13)2\n\n+\n\n8x 2 X ; E\nYjX\n\nkfn;Q(x) (cid:0) f(x)k2 = E\nYjX\n\n:\n\n(3)\n\nWe\u2019ll proceed by bounding each term separately in the following two lemmas, and then combining\nthese bounds in Lemma 3. We\u2019ll let (cid:22) denote the marginal measure over X and (cid:22)n denote the\ncorresponding empirical measure.\n4 -net of X, 0 < h < (cid:1)X . Consider x 2 X such\nLemma 1 (Variance at x). Fix X, and let Q be an h\nthat X \\ (B(x; h=4)) 6= ;. We have\n\n2K(0)(cid:1)2Y\n\nE\nYjX\n\nK(1=2) (cid:1) n(cid:22)n (B(x; h=4)) :\n\ni vik2 =\nE kvik2. We apply this fact twice in the inequalities below, given that, conditioned on X and\n\nProof. Remember that for independent random vectors vi with expectation 0, E k\u2211\n\u2211\n(cid:13)(cid:13)(cid:13)fn;Q(x) (cid:0) efn;Q(x)\n(cid:13)(cid:13)(cid:13)2\nQ (cid:26) X, the Yi values are mutually independent and so are the (cid:22)Yq values. We have\n\u2211\n\n\u2211\n\nwq(x)\n\n(cid:22)Yq\n\n(cid:20)\n\ni\n\n(cid:13)(cid:13)(cid:13)fn;Q(x) (cid:0) efn;Q(x)\n\n(cid:13)(cid:13)(cid:13)2 (cid:20)\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n\n= E\nYjX\n\nq2Q\n\n\u2211\n(\n\nq2Q\n\n=\n\n(cid:20)\n\n)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n(\n\n(\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n\n(cid:22)Yq (cid:0) E\nYjX\n\u2211\n})\u2211\n\n(cid:13)(cid:13)(cid:13)(cid:13)2\n\n(cid:22)Yq\n\n(cid:13)(cid:13)(cid:13)(cid:13) (cid:22)Yq (cid:0) E\n\u2211\n}\n\nq(x)\n\nq2Q\n\nw2\n\nYjX\n\n(cid:1)2Y\nnq\n\nq(x) E\nw2\nYjX\n\nq2Q\n\n)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\nYi\n\n(cid:20)\n\n1\nnq\n\nq(x) E\nw2\nYjX\n\ni:Xi2X(q)\n(cid:1)2Y\nnq\n\n{\n\u2211\n(K(x; q; h) + (cid:15))(cid:1)2Y\nq02Q nq0(K(x; q0; h) + (cid:15))\n\nYi (cid:0) E\n{\nYjX\n\u2211\n2K(0)(cid:1)2Y\nq2Q nqK(x; q; h) :\n\nwq = max\nq2Q\n(cid:20)\n\n(cid:1)2Y\nnq\n\nwq(x)\n\nwq(x)\n\nq2Q\n\nmax\nq2Q\n\n(4)\n\n= max\nq2Q\n\n\u2211\n\n\u2211\n\nTo bound the fraction in (4), we lower-bound the denominator as:\n\nnqK(x; q; h) (cid:21)\n\nnqK(x; q; h) (cid:21)\n\nnqK(1=2) (cid:21) K(1=2) (cid:1) n(cid:22)n(B(x; h=4)):\n\nq:(cid:26)(x;q)(cid:20)h=2\n\nq2Q\nThe last inequality follows by remarking that, since Q is an h\nonly contain points from [q:(cid:26)(x;q)(cid:20)h=2X(q). Plug this last inequality into (4) and conclude.\nLemma 2 (Bias at x). As before, \ufb01x X, and let Q be an h\nsuch that X \\ (B(x; h=4)) 6= ;. We have\n\n4 -cover of X, the ball B(x; h=4) can\n4 -net of X, 0 < h < (cid:1)X . Consider x 2 X\n\nq:(cid:26)(x;q)(cid:20)h=2\n\nE\nYjX\n\n\u2211\n\nwhere we just applied Jensen\u2019s inequality on the norm square. We bound the r.h.s by breaking the\nsummation over two subsets of Q as follows.\n\nProof. We have\n\n(cid:13)(cid:13)(cid:13)efn;Q(x) (cid:0) f(x)\n\u2211\n\u2211\n\nq:(cid:26)(x;q) 0, the set B(x; h=2) \\ Q\ncannot be empty (remember that Q is an h\n\n4 -cover of X). This concludes the argument.\n\n1 + n\n\nn(cid:15)\n\n;\n\nLemma 3 (Integrated excess risk). Let Q be an h\nkfn;Q (cid:0) fk2 (cid:20) C0\n\nE\n\n(X;Y)\n\n4 -net of X, 0 < h < (cid:1)X . We have\nn (cid:1) (h=(cid:1)X )d + 2(cid:21)2h2;\n\n(cid:1)2Y\n\nwhere C0 depends on the Assouad dimension d and on K(0)=K(1=2).\nProof. Applying Fubini\u2019s theorem, the expected excess risk, E(X;Y) kfn;Q (cid:0) fk2, can be written as\n\n(\n\nE\nX\n\nE\n\n(X;Y)\n\nkfn;Q(X) (cid:0) f(X)k2\n\n1f(cid:22)n(B(X;h=4))>0g + 1f(cid:22)n(B(X;h=4))=0g\n\n:\n\nBy lemmas 1 and 2 we have for X = x \ufb01xed,\n\nE\n\n(X;Y)\n\n(\nkfn;Q(x) (cid:0) f(x)k2 1f(cid:22)n(B(x;h=4))>0g (cid:20) C1 E\n(cid:20) C1\n\nX\n\n[\n\n(cid:1)2Y 1f(cid:22)n(B(x;h=4))>0g\n\nn(cid:22)n(B(x; h=4))\n2(cid:1)2Y\n\nn(cid:22)(B(x; h=4))\n\n)\n\n[\n\n+ 2(cid:21)2h2 +\n\n+ 2(cid:21)2h2 +\n\n]\n\n(cid:1)2Y\n;\nn\n(cid:20) 2\n\n(cid:1)2Y\nn\n\n(5)\n\n1fb(n;p)>0g\n\nb(n;p)\n\nnp (see\n\nwhere for the last inequality we used the fact that for a binomial b(n; p), E\nlemma 4.1 of [12]).\nFor the case where B(x; h=4) is empty, we have\n\nE\n\n(X;Y)\n\nkfn;Q(x) (cid:0) f(x)k2 1f(cid:22)n(B(x;h=4))=0g (cid:20) (cid:1)2Y E\n(cid:20) (cid:1)2Y e\n[\n\nX\n\nCombining (6) and (5), we can then bound the expected excess risk as\n\n1f(cid:22)n(B(x;h=4))=0g = (cid:1)2Y (1 (cid:0) (cid:22)(B(x; h=4))n\n(cid:0)n(cid:22)(B(x;h=4)) (cid:20)\n\n(cid:1)2Y\n\n(6)\n\nn(cid:22)(B(x; h=4)) :\n\n]\n\n1\n\n(cid:22)(B(X; h=4))\n\n+ 2(cid:21)2h2 +\n\n(cid:1)2Y\nn\n\n:\n\n(7)\n\nE\n\n(X;Y)\n\nE\nX\n\nn\n\nkfn;Q (cid:0) fk2 (cid:20) 3C1(cid:1)2Y\n[\n\n]\n\n(cid:20) N\u2211\n\n1\n\n(cid:22)(B(X; h=4))\n\n[\n\nE\nX\n\nThe expectation on the r.h.s is bounded using a standard covering argument (see e.g. [12]). Let\n8 -cover of X . Notice that for any zi, x 2 B(zi; h=8) implies B(x; h=4) (cid:27) B(zi; h=8).\nfzigN\nWe therefore have\n\n1 be an h\n\n[\n\n]\n\n(cid:20) N\u2211\n\ni=1\n\nE\nX\n\n1fX2B(zi;h=8)g\n(cid:22)(B(X; h=8))\n\n]\n\n; where C2 depends just on d:\n\nE\nX\n\n1fX2B(zi;h=8)g\n(cid:22)(B(X; h=4))\n\n(\n\n)d\n\ni=1\n\n= N (cid:20) C2\n\n(cid:1)X\nh\n\nWe conclude by combining the above with (7) to obtain\nkfn;Q (cid:0) fk2 (cid:20) 3C1C2(cid:1)2Y\n\nE\n\n(X;Y)\n\nn(h=(cid:1)X )d + 2(cid:21)2h2 +\n\n(cid:1)2Y\nn\n\n:\n\n5\n\n\fCorollary 1. Let n (cid:21) max\n\n(cid:13)(cid:13)fn;Qho =4\n\nE\n\nwhere C depends on the Assouad dimension d and on K(0)=K(1=2).\n\nProof outline. Let ~h = C3\nthat such an ~h is in H. We have by Lemma 3 that for ~h,\n\n(cid:1)d=(2+d)\n\n(cid:1)2Y\n(cid:21)2n\n\nX\n\n2 H. We note that n is lower bounded so\n\n. The \ufb01nal regressor selected satis\ufb01es\n\n\u221a\n\n+ 3(cid:1)2Y\n\nln(n log n)\n\nn\n\n;\n\n;\n\n9;\n\n(\n\n(cid:21)(cid:1)X\n(cid:1)Y\n\n)\n\n(cid:1)Y\n(cid:21)(cid:1)X\n\n(cid:0) f\n\n(cid:1)2Y\nn\n\n)2\n\n)2=(2+d)\n\n(\n)\n)2\n(\n(\n(cid:13)(cid:13)2 (cid:20) C ((cid:21)(cid:1)X )2d=(2+d)\n(\n(\n)1=(2+d)\n(cid:13)(cid:13)(cid:13)2 (cid:20) C0 ((cid:21)(cid:1)X )2d=(2+d)\n(cid:13)(cid:13)(cid:13)fn;Q~h=4\nn\u2211\n(cid:13)(cid:13)fn;Qh=4(X\n(cid:13)(cid:13)2 (cid:0) 1\n(cid:13)(cid:13)fn;Qh=4(X) (cid:0) Y\ni) (cid:0) Y\n0\n(cid:13)(cid:13)(cid:13)fn;Q~h=4(X) (cid:0) Y\n(cid:13)(cid:13)2 (cid:20) E\n(cid:13)(cid:13)fn;Qho =4(X) (cid:0) Y\n\u221a\n(cid:13)(cid:13)(cid:13)2\n(cid:13)(cid:13)(cid:13)fn;Q~h=4\n(cid:13)(cid:13)2 (cid:20)\n(cid:13)(cid:13)fn;Qho =4\n\n+2(cid:1)2Y\n\nE\nX;Y\n\n(cid:0) f\n\n(cid:0) f\n\n(cid:0) f\n\n(\n\nX;Y\n\nX;Y\n\n0\ni\n\ni=0\n\nn\n\n:\n\n(cid:1)2Y\nn\n\n)2=(2+d)\n(cid:13)(cid:13)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:20) (cid:1)2Y\n\u221a\n(cid:13)(cid:13)(cid:13)2\n\n+ 2(cid:1)2Y\n\nln(jHjp\n\nn)\n\nn)\n\nln(jHjp\n\u221a\nln(jHjp\n\nn\n\n:\n\nn\n\nn)\n\n, which\n\np\nApplying McDiarmid\u2019s to the empirical risk followed by a union bound over H, we have that, with\nprobability at least 1 (cid:0) 1=\n\nn over the choice of (X0\n\n; Y0), for all h 2 H\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E\n\nX;Y\n\nIt follows that E\n\nby (1) implies\nthe randomness in the two samples) over this last inequality and conclude.\n\nn\n\n. Take the expectation (given\n\n4 Fast evaluation\n\n:= f(cid:1)X =2igI\n\n1; I = dlog ne.\n\nIn this section we show how to modify the cover-tree procedure of [16] to enable fast evaluation of\nfn;Qh=4 for any h 2 H\nThe cover-tree performs proximity search by navigating a hierarchy of nested r-nets of X. The\nnavigating-nets of [17] implement the same basic idea. They require additional book-keeping to\nenable range queries of the form X \\ B(x; h), for a query point x. Here we need to perform range\nsearches of the form Qh=4 \\ B(x; h) and our book-keeping will therefore be different from [17].\nNote that, for each h and Qh=4, one could use a generic range search procedure such as [17] with\nthe data in Qh=4 as input, but this requires building a separate data structure for each h, which is\nexpensive. We use a single data structure.\n\n4.1 The hierarchy of nets\n\nConsider an ordering\npoints in X; inductively for 2 < i < n, X(i) in X is the farthest point from\nwhere the distance to a set is de\ufb01ned as the minimum distance to a point in the set.\n\n1 of the data points obtained as follows: X(1) and X(2) are the farthest\n,\n\nX(1); : : : ; X(i(cid:0)1)\n\nX(i)\n\n{\n}I+2\n\n}n\n{\n}) (cid:21) r. Notice that, by construction, Qr is an r-net of X.\n\nX(1); : : : ; X(i)\n\n}\n\n{\n\nwhere i (cid:21) 1 is the highest index such that\n\n(cid:1)X =2i\nX(1); : : : ; X(i(cid:0)1)\n\n0\n\n, de\ufb01ne Qr =\n\n}\n\nFor r 2{\n(\n{\n\n(cid:26)\n\nX(i);\n\n4.2 Data structure\n\nThe data structure consists of an acyclic directed graph, and range sets de\ufb01ned below.\nNeighborhood graph: The nodes of the graph are the\n1 , and the edges are given by the\nfollowing parent-child relationship: starting at r = (cid:1)X =2, the parent of each node in Qr n Q2r is\nthe point it is closest to in Q2r. The graph is implemented by maintaining an ordered list of children\nfor each node, where the order is given by the children\u2019s appearance in the sequence\n1 . These\nrelationships are depicted in Figure 2.\n\n}n\n\n{\n\nX(i)\n\nX(i)\n\n{\n\n}n\n\n6\n\n\fX(1) X(2) X(3) X(4) X(5) X(6)\n\nX(1) X(2) X(3) X(4) X(5) X(6)\n\nFigure 2: The r-nets (rows of left sub\ufb01gure) are implicit to an ordering of the data. They de\ufb01ne a parent-child\nrelationship implemented by the neighborhood graph (right), the structure traversed for fast evaluation.\n\ntively as follows. Given Q (cid:26){\n\n}n\nThese ordered lists of children are used to implement the operation nextChildren de\ufb01ned itera-\n}n\n1 , let visited children denote any child of q 2 Q that a previous\ncall to nextChildren has already returned. The call nextChildren (Q) returns children of q 2 Q\nthat have not yet been visited, starting with the unvisited child with lowest index in\n1 , say\nX(i), and returning all unvisited children in Qr, the \ufb01rst net containing X(i), i.e. X(i) 2 Qr n Q2r ;\nr is also returned. The children returned are then marked off as visited. The time complexity of this\nroutine is just the number of children returned.\n\nRange sets: For each node X(i) and each r 2{\n) (cid:20) 8r\n\nin Qr de\ufb01ned as R(i);r\n\nq 2 Qr : (cid:26)\n\n0 , we maintain a set of neighbors of X(i)\n\n}1\n\n(cid:1)X =2i\n\n}\n\n.\n\nX(i); q\n\n{\n\n:=\n\nX(i)\n\n{\n\nX(i)\n\n(\n\n4.3 Evaluation\n\nProcedure evaluate(x, h)\n\nQ Q(cid:1)X ;\nrepeat\n; r nextChildren (Q);\n0\nQ\n00 Q [ Q\n0\n;\nQ\n0\nif r < h=4 or Q\n\n= ; then // We reached past Qh=4.\n\n00\nif (cid:26) (x; Q\nQ ;;\nBreak loop ;\nQ fq 2 Q\n00\n\nX(i) argminq2Q (cid:26) (x; q); // Closest point to x in Qh=4.\nQ R(i);h=4 \\ B(x; h); // Search in a range of 2h around X(i).\nBreak loop ;\n\n) (cid:21) h + 2r then // The set Qh=4 \\ B(x; h) is empty.\n\n00\n; (cid:26) (x; q) < (cid:26) (x; Q\n\n) + 2rg;\nuntil : : : ;\n//At this point Q = Qh=4 \\ B(x; h).\nreturn\n\nfn;Qh=4 (x) Pq2Q nq(K(x; q; h) + (cid:15)) (cid:22)Yq + (cid:15)(cid:16)Pq2Qh=4\n\nPq2Q nq(K(x; q; h) + (cid:15)) + (cid:15)(cid:16)n (cid:0)Pq2Q nq(cid:17)\n\nnq (cid:22)Yq (cid:0)Pq2Q nq (cid:22)Yq(cid:17)\n\n;\n\n)\n\n(\n\nThe evaluation procedure consists of quickly identifying the closest point X(i) to x in Qh=4 and\nthen searching in the range of X(i) for the points in Qh=4 \\ B(x; h). The identi\ufb01cation of X(i) is\ndone by going down the levels of nested nets, and discarding those points (and their descendants)\nthat are certain to be farther to x than X(i) (we will argue that \u201c(cid:26) (x; Q00) + 2r\u201d is an upper-bound\non (cid:26)\n). Also, if x is far enough from all points at the current level (second if-clause), we\ncan safely stop early because B(x; h) is unlikely to contain points from Qh=4 (we\u2019ll see that points\nin Qh=4 are all within 2r of their ancestor at the current level).\nLemma 4. The call to procedure evaluate (x,h) correctly evaluates fn;Qh=4(x) and has time\ncomplexity C log ((cid:1)X =h) + log n where C is at most 28d.\n\nx; X(i)\n\n7\n\n\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\f)\n\n(\n\n(\n\nQh=4 \\ B(x; h)\n\n)\nR(i);h=4 \\ B(x; h)\n\nProof. We \ufb01rst show that the algorithm correctly returns fn;Q(cid:11)h(x), and we then argue its run time.\nCorrectness of evaluate. The procedure works by \ufb01rst \ufb01nding the closest point to x in Qh=4,\nsay X(i), and then identifying all nodes in\n(see the \ufb01rst\nif-clause). We just have to show that this closest point X(i) is correctly identi\ufb01ed.\nWe\u2019ll argue the following loop invariant I: at the beginning of the loop, X(i) is either in Q00 =\nQ [ Q0 or is a descendant of a node in Q0. Let\u2019s consider some iteration where I holds (it certainly\ndoes in the \ufb01rst iteration).\nIf the \ufb01rst if-clause is entered, then Q is contained in Qh=4 but Q0 is not, so X(i) must be in Q and\nwe correctly return.\nSuppose the \ufb01rst if-clause is not entered. Now let X(j) be the ancestor in Q0 of X(i) or let it be X(i)\nitself if it\u2019s in Q00. Let r be as de\ufb01ned in evaluate, we have (cid:26)\nk=0 r=2k = 2r by\ngoing down the parent-child relations. It follows that\n\nX(i); X(j)\n\n(\n\n)\n\n=\n\n<\n\n)\n\n(\n\n)\n\n(\n\n\u22111\n)\n\n(\n(\n)\n\n00) (cid:20) (cid:26)\n(\n\n(\n\n) (cid:20) (cid:26)\n)\n\n(cid:26) (x; Q\n\nx; X(j)\n\nx; X(i)\n\n+ (cid:26)\n\nX(i); X(j)\n\n< (cid:26)\n\nx; X(i)\n\n+ 2r:\n\n<\n\n(\n\n> (cid:26)\n\n(\n\n)\n\n)\n\n)\n\nx; X(i)\n\nx; X(i)\n\n(\n)\n\nx; X(j)\n\nx; X(k)\n\nX(j); X(k)\n\n> (cid:26) (x; Q00) (cid:0) 2r. Thus, if the second if-clause is entered, we\n\n) (cid:0) 2r (cid:21) (cid:26)\n\n) (cid:21) (cid:26) (x; Q00) + 2r (cid:21) (cid:26)\n\n(\n\u22111\nx; X(j)\nso we know X(j) 6= X(i).\n\nx; X(i)\n> h, i.e. B(x; h) \\ Qh=4 = ; and we correctly return.\n(\n\nIn other words, we have (cid:26)\nnecessarily have (cid:26)\nx; X(i)\nNow assume none of the if-clauses is entered. Let X(j) 2 Q00 be any of the points removed from\nQ00 to obtain the next Q. Let X(k) be a child of X(j) that has not yet been visited, or a descendant\nof such a child. If neither such X(j) or X(k) is X(i) then, by de\ufb01nition, I must hold at the next\n(\niteration. We sure have X(j) 6= X(i) since (cid:26)\n+ 2r. Now\nk=0 r=2k = 2r. We thus have\nnotice that, by the same argument as above, (cid:26)\n(cid:26)\nRuntime of evaluate. Starting from Q(cid:1)X , a different net Qr is reached at every iteration, and the\nloop stops when we reach past Qh=4. Therefore the loop is entered at most log ((cid:1)X =h=4) times. In\neach iteration, most of the work is done parsing through Q00, besides time spent for the range search\nin the last iteration. So the total runtime is O (log ((cid:1)X =h=4) (cid:1) maxjQ00j) + range search time. We\njust need to bound maxjQ00j (cid:20) maxjQj + maxjQ0j and the range search time.\nThe following fact (see e.g. Lemma 4.1 of [9]) will come in handy: consider r1 and r2 such that\nr1=r2 is a power of 2, and let B (cid:26) X be a ball of radius r1; since X has Assouad dimension d,\nthe smallest r2-cover of B is of size at most (r1=r2)d, and the largest r2-packing of B is of size at\nmost (r1=r2)2d. This is true for any metric space, and therefore holds for X which is of Assouad\ndimension at most d by inclusion in X .\nLet Q0 (cid:26) Qr so that Q (cid:26) Q2r at the beginning of some iteration. Let q 2 Q, the children of q in Q0\n)\nare not in Q2r and therefore are all within 2r of Q; since these children an r-packing of B(q; 2r),\nthere are at most 22d of them. Thus, maxjQ0j (cid:20) 22d maxjQj.\nInitially Q = Q(cid:1)X so we have jQj (cid:20) 22d since Q(cid:1)X is a (cid:1)X -packing of X (cid:26) B\n. At\nthe end of each iteration we have Q (cid:26) B(x; (cid:26) (x; Q00) + 2r). Now (cid:26) (x; Q00) (cid:20) h + 2r (cid:20) 4r + 2r\nsince the if-clauses were not entered if we got to the end of the iteration. Thus, Q is an r-packing of\nB(x; 8r), and therefore maxjQj (cid:20) 28d.\nTo \ufb01nish, the range search around X(i) takes time\nof B\n\n(cid:12)(cid:12) (cid:20) 28d since R(i);h=4 is an h\n\n(cid:12)(cid:12)R(i);h=4\n\n4 -packing\n\nX(1); 2(cid:1)X\n\n(\n\n(\n\n)\n\nX(i); 2h\n\n.\n\nAcknowledgements\n\nThis work was supported by the National Science Foundation (under grants IIS-0347646, IIS-\n0713540, and IIS-0812598) and by a fellowship from the Engineering Institute at the Los Alamos\nNational Laboratory. Many thanks to the anonymous NIPS reviewers for the useful comments, and\nthanks to Sanjoy Dasgupta for advice on the presentation.\n\n8\n\n\fReferences\n[1] P. Bickel and B. Li. Local polynomial regression on unknown manifolds. Tech. Re. Dep. of Stats. UC\n\nBerkley, 2006.\n\n[2] S. Kpotufe. Escaping the curse of dimensionality with a tree-based regressor. COLT, 2009.\n[3] C. J. Stone. Optimal rates of convergence for non-parametric estimators. Ann. Statist., 8:1348\u20131360,\n\n1980.\n\n[4] C. J. Stone. Optimal global rates of convergence for non-parametric estimators. Ann. Statist., 10:1340\u2013\n\n1353, 1982.\n\n[5] S. Roweis and L. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science,\n\n290:2323\u20132326, 2000.\n\n[6] M. Belkin and N. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation.\n\nNeural Computation, 15:1373\u20131396, 2003.\n\n[7] J.B. TenenBaum, V. De Silva, and J. Langford. A global geometric framework for non-linear dimension-\n\nality reduction. Science, 290:2319\u20132323, 2000.\n\n[8] S. Dasgupta and Y. Freund. Random projection trees and low dimensional manifolds. STOC, 2008.\n[9] K. Clarkson. Nearest-neighbor searching and metric space dimensions. Nearest-Neighbor Methods for\n\nLearning and Vision: Theory and Practice, 2005.\n\n[10] S. Kulkarni and S. Posner. Rates of convergence of nearest neighbor estimation under arbitrary sampling.\n\nIEEE Transactions on Information Theory, 41, 1995.\n\n[11] S. Schaal and C. Atkeson. Robot Juggling: An Implementation of Memory-based Learning. Control\n\nSystems Magazine, IEEE, 1994.\n\n[12] L. Gyor\ufb01, M. Kohler, A. Krzyzak, and H. Walk. A Distribution Free Theory of Nonparametric Regression.\n\nSpringer, New York, NY, 2002.\n\n[13] D. Lee and A. Grey. Faster gaussian summation: Theory and experiment. UAI, 2006.\n[14] D. Lee and A. Grey. Fast high-dimensional kernel summations using the monte carlo multipole method.\n\nNIPS, 2008.\n\n[15] C. Atkeson, A. Moore, and S. Schaal. Locally weighted learning. AI Review, 1997.\n[16] A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbors. ICML, 2006.\n[17] R. Krauthgamer and J. Lee. Navigating nets: Simple algorithms for proximity search. SODA, 2004.\n\n9\n\n\f", "award": [], "sourceid": 1009, "authors": [{"given_name": "Samory", "family_name": "Kpotufe", "institution": null}]}