{"title": "Optimal rates for k-NN density and mode estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 2555, "page_last": 2563, "abstract": "We present two related contributions of independent interest: (1) high-probability finite sample rates for $k$-NN density estimation, and (2) practical mode estimators -- based on $k$-NN -- which attain minimax-optimal rates under surprisingly general distributional conditions.", "full_text": "Optimal rates for k-NN density and mode estimation\n\nSanjoy Dasgupta\n\nUniversity of California, San Diego, CSE\n\ndasgupta@eng.ucsd.edu\n\nSamory Kpotufe \u2217\n\nPrinceton University, ORFE\nsamory@princeton.edu\n\nAbstract\n\nWe present two related contributions of independent interest: (1) high-probability\n\ufb01nite sample rates for k-NN density estimation, and (2) practical mode estimators\n\u2013 based on k-NN \u2013 which attain minimax-optimal rates under surprisingly general\ndistributional conditions.\n\n1\n\nIntroduction\n\nWe prove \ufb01nite sample bounds for k-nearest neighbor (k-NN) density estimation, and subsequently\napply these bounds to the related problem of mode estimation. These two main results, while related,\nare interesting on their own.\nFirst, k-NN density estimation [1] is one of the better known and simplest density estimation pro-\ncedures. The estimate fk(x) of an unknown density f (see De\ufb01nition 1 of Section 3) is a simple\nfunctional of the distance rk(x) from x to its k-th nearest neighbor in a sample X[n] (cid:44) {Xi}n\ni=1.\nAs such it is intimately related to other functionals of rk(x), e.g. the degree of vertices x in k-NN\ngraphs and their variants used in modeling communities and in clustering applications (see e.g. [2]).\nWhile this procedure has been known for a long time, its convergence properties are still not fully\nunderstood. The bulk of research in the area has concentrated on establishing its asymptotic con-\nvergence, while its \ufb01nite sample properties have received little attention in comparison. Our \ufb01nite\nsample bounds are concisely derived once the proper tools are identi\ufb01ed. The bounds hold with high\nprobability, under general conditions on the unknown density f. This generality proves quite useful\nas shown in our subsequent application to the problem of mode estimation.\nThe basic problem of estimating the modes (local maxima) of an unknown density f has also been\nstudied for a while (see e.g. [3] for an early take on the problem). It arises in various unsupervised\nproblems where modes are used as a measure of typicality of a sample X. In particular, in modern\napplications, mode estimation is often used in clustering, with the modes representing cluster centers\n(see e.g. [4, 5] and general applications of the popular mean-shift procedure).\nWhile there exists a rich literature on mode estimation, the bulk of theoretical work concerns es-\ntimators of a single mode (highest maximum of f), and often concentrates on procedures that are\nhard to implement in practice. Given the generality of our \ufb01rst result on k-NN density estimation,\nwe can prove that some simple implementable procedures yield optimal estimates of the modes of\nan unknown density f, under surprisingly general conditions on f.\nOur results are overviewed in the following section, along with an overview of the rich literature on\nk-NN density estimation and mode estimation. This is followed by our theoretical setup in Section 3;\nour rates for k-NN density estimation are detailed in Section 4, while the results on mode estimation\nare given in Section 5.\n\n\u2217Much of this work was conducted when this author was at TTI-Chicago.\n\n1\n\n\f2 Overview of results and related Work\n\n2.1 Rates for k-NN density estimates\n\n\u221a\n\nThe k-NN density estimator dates back perhaps to the early work of [1] where it is shown to be\nconsistent when the unknown density f is continuous on Rd. While one of the best known and\nsimplest procedure for density estimation, it has proved more cumbersome to analyze than its smooth\ncounterpart, the kernel density estimator.\nMore general consistency results such as [6, 7] have been established since its introduction.\nIn\nparticular [6] shows that, for f Lipschitz in a neighborhood of a point x, where f (x) > 0, and\nk = k(n) satisfying k \u2192 \u221e and k/n2/(2+d) \u2192 0, the estimator is asymptotically normal, i.e.\n\u221a\nD\u2212\u2192 N (0, 1). The recent work of [8], concerning generalized weighted\nk(fk(x) \u2212 f (x))/f (x)\nvariants of k-NN, shows that asymptotic normality holds under the weaker restriction k/n4/(4+d) \u2192\n0 if f is twice differentiable at x.\nAsymptotic normality as stated above yields some insight into the rate of convergence of fk: we\ncan expect that |fk(x) \u2212 f (x)| (cid:46) f (x)/\nk under the stated conditions on k. In fact, [8] shows\nthat such a result can be obtained in expectation for n = n(x) suf\ufb01ciently large.\nIn particular,\ntheir conditions on k allows for a setting of k \u2248 n4/(4+d) (not allowed under the above conditions)\nyielding a minimax-optimal l2 risk E |fk(x) \u2212 f (x)|2 (cid:46) f (x)2/k = O(n\u22124/(4+d)).\nWhile consistency results and bounds on expected error are now well understood, we still don\u2019t have\na clear understanding of the conditions under which high probability bounds on |fk(x) \u2212 f (x)| are\npossible. This is particularly important given the inherent instability of nearest neighbors estimates\nwhich are based on order-statistics rather than the more stable average statistics at the core of kernel-\ndensity estimates. The recent result of [9] provides an initial answer: they obtain a high-probability\nbound uniformly over x taking value in the sample X[n], however under conditions not allowing for\noptimal settings of k (where f is assumed Lipschitz).\nThe bounds in the present paper hold with high-probability, simultaneously for all x in the support\nof f. Rather than requiring smoothness conditions on f, we simply give the bounds in terms of the\nmodulus of continuity of f at any x, i.e. how much f can change in a neighborhood of x. This\nallows for a useful degree of \ufb02exibility in applying these bounds. In particular, optimal bounds\nunder various degrees of smoothness of f at x easily follow. More importantly, for our application\nto mode estimation, the bounds allow us to handle |fk(x) \u2212 f (x)| at different x \u2208 Rd with varying\nsmoothness in f. As a result we can derive minimax-optimal mode estimation rates for practical\nprocedures under surprisingly weak assumptions.\n\n2.2 Mode estimation\n\nThere is an extensive literature on mode estimation and we unfortunately can only overview some\nof the relevant work. Most of the literature covers the case of a unimodal distribution, or one where\nthere is a single maximizer x0 of f.\nEarly work on estimating the (single) mode of a distribution focused primarily on understand-\ning the consistency and rates achievable by various approaches, with much less emphasis on the\nease of implementation of these approaches. The common approaches consist of estimating x0 as\n\u02c6x (cid:44) arg supx\u2208Rd fn(x) where fn is an estimate of f, usually a kernel density estimate. Various\nwork such as [3, 10, 11] establish consistency properties of the approach and achievable rates under\nvarious Euclidean settings and regularity assumptions on the distribution F. More recent work such\nas [12, 13] address the problem of optimal choice of bandwidth and kernel to adaptively achieve\nthe minimax risk for mode estimation. Essentially, under smoothness \u03ba (e.g. f is \u03ba times differen-\nEf (cid:107)\u02c6x \u2212 x0(cid:107)) is of the form n\u2212(\u03ba\u22121)/(2\u03ba+d), as independently\ntiable), the minimax risk (inf \u02c6x supf\nestablished in [14] and [15].\nAs noticed early in [16], the estimator arg supx\u2208Rd fn(x), while yielding much insight into the\nproblem, is hard to implement in practice. Hence, other work, apparently starting with [16, 14]\nhave looked into so-called recursive estimators of the (single) mode which are practical and easy\nto update as the sample size increases. These approaches can be viewed as some form of gradient-\n\n2\n\n\fascent of fn with carefully chosen step sizes. The later versions of [14] are shown to be minimax-\noptimal. Another line of work is that of so-called direct mode estimators which estimate the mode\nfrom practical statistics of the data [17, 18]. In particular, [18] shows that the simple and practical\nestimator arg maxx\u2208X[n] fn(x), where fn is a kernel-density estimator, is a consistent estimator of\nthe mode. We show in the present paper that arg maxx\u2208X[n] fk(x), where fk is a k-NN density\nestimator, is not only consistent, but converges at a minimax-optimal rate under surprisingly mild\ndistributional conditions.\nThe more general problem of estimating all modes of distribution has received comparatively little\nattention. The best known practical approach for this problem is the mean-shift procedure and its\nvariants [19, 4, 20, 21], quite related to recursive-mode-estimators, as they essentially consist of\ngradient ascent of fn starting from every sample point, where fn is required to be appropriately\nsmooth to ascend (e.g. a smooth kernel estimate). While mean-shift is popular in practice, it has\nproved quite dif\ufb01cult to analyze. A recent result of [22] comes close to establishing the consistency\nof mean-shift, as it establishes the convergence of the procedure to the right gradient lines (essen-\ntially the ascent path to the mode) if it is seeded from \ufb01xed starting points rather than the random\nIt remains unclear however whether mean-shift produces only true modes,\nsamples themselves.\ngiven the inherent variability in estimating f from sample. This question was recently addressed by\n[23] which proposes a hypothesis test to detect false modes based on con\ufb01dence intervals around\nHessians estimated at the modes returned by any procedure.\nInterestingly, while a k-NN density estimate fk is far from smooth, in fact not even continuous, we\nshow a simple practical procedure that identi\ufb01es any mode of the unknown density f under mild\nconditions: we mainly require that f is well approximated by a quadratic in a neighborhood of\neach mode. Our \ufb01nite sample rates (on (cid:107)\u02c6x \u2212 x0(cid:107), for an estimate \u02c6x of any mode x0) are of the\nform O(k\u22121/4), hold with high-probability and are minimax-optimal for an appropriate choice of\nk = \u0398(n4/(4+d)).\nIf in addition f is Lipschitz or more generally H\u00a8older-continuous (in principle uniform continuity\nof f is enough), all the modes returned above a level set \u03bb of fk can be optimally assigned to\nseparate modes of the unknown f. Since \u03bb n\u2192\u221e\u2212\u2212\u2212\u2212\u2192 0, the procedure consistently prunes false modes.\nThis feature is made intrinsic to the procedure by borrowing from insights of [9, 24] on identifying\nfalse clusters by inspecting levels sets of fn. These last works concern the related area of level set\nestimation, and do not study mode estimation rates.\nAs alluded to so far, our results are given in terms of local assumptions on modes rather than\nglobal distributional conditions. We show that any mode that is suf\ufb01ciently salient (this is locally\nparametrized) w.r.t. the \ufb01nite sample size n, is optimally estimated, while false modes are pruned\naway. In particular our results allow for f having a countably in\ufb01nite number of modes.\n\n3 Preliminaries\n\nThroughout the analysis, we assume access to a sample X[n] = {Xi}n\nfrom an\nabsolutely continuous distribution F over Rd, with Lebesgue-density function f. We let X denote\nthe support of the density function f.\nThe k-NN density estimate at a point x is de\ufb01ned as follows.\nDe\ufb01nition 1 (k-NN density estimate). For every x \u2208 Rd, let rk(x) denote the distance from x to its\nk-th nearest neighbor in X[n]. The density estimate is given as:\n\ni=1 drawn i.i.d.\n\nfk(x) (cid:44)\n\nk\n\nn \u00b7 vd \u00b7 rk(x)d ,\n\nwhere vd denotes the volume of the unit sphere in Rd.\n\nAll balls considered in the analysis are closed Euclidean balls of Rd.\n\n3\n\n\f4 k-NN density estimation rates\nIn this section we bound the error in estimating f (x) as fk(x) at every x \u2208 X . The main results of\nthe section are Lemmas 3 and 4. These lemmas are easily obtained given the right tools: uniform\nconcentration bounds on the empirical mass of balls in Rd, using relative Vapnik-Chervonenkis\nbounds, i.e. Bernstein\u2019s type bounds rather than Chernoff type bounds (see e.g. Theorem 5.1 of\n[25]). We next state a form of these bounds for completion.\nLemma 1. Let G be a class of functions from X to {0, 1} with VC dimension d < \u221e, and P a\nprobability distribution on X . Let E denote expectation with respect to P. Suppose n points are\ndrawn independently at random from P; let En denote expectation with respect to this sample. Then\nfor any \u03b4 > 0, with probability at least 1 \u2212 \u03b4, the following holds for all g \u2208 G:\n\n(cid:112)Eg) \u2264 Eg \u2212 Eng \u2264 min(\u03b22\n\nn + \u03b2n\n\n(cid:112)Eng, \u03b2n\n\n(cid:112)Eg),\n\n(cid:112)Eng, \u03b22\n\n\u2212 min(\u03b2n\n\nwhere \u03b2n =(cid:112)(4/n)(d ln 2n + ln(8/\u03b4)).\n\nn + \u03b2n\n\nThese sort of relative VC bounds allows for a tighter relation (than Chernoff type bounds) between\nempirical and true mass of sets (Eng and Eg) in those situations where these quantities are small,\nn = \u02dcO(1/n) above. This is particularly useful since the balls we have to deal\ni.e. of the order of \u03b22\nwith are those containing approximately k points, and hence of (small) mass approximately k/n.\nA direct result of the above lemma is the following lemma of [26]. This next lemma essentially\nreworks Lemma 1 above into a form we can use more directly. We re-use C\u03b4,n below throughout\nthe analysis.\n\u221a\nLemma 2 ([26]). Pick 0 < \u03b4 < 1. Let C\u03b4,n (cid:44) 16 log(2/\u03b4)\nprobability at least 1 \u2212 \u03b4, for every ball B \u2282 Rd we have,\n\nd log n. Assume k \u2265 d log n. With\n\n\u221a\n\u221a\n\nk\nn\n\nF(B) \u2265 C\u03b4,n\n\nF(B) \u2265 k\nn\n\n+ C\u03b4,n\n\nd log n\n\n=\u21d2 Fn(B) > 0,\n\nn\n=\u21d2 Fn(B) \u2265 k\n\u221a\nn\n\n, and\n\nF(B) \u2264 k\nn\n\n\u2212 C\u03b4,n\n\nk\nn\n\n=\u21d2 Fn(B) <\n\nk\nn\n\n.\n\nThe main idea in bounding fk(x) is to bound the random term rk(x) in terms of f (x) using Lemma\n2 above. We can deduce from the lemma that if a ball B(x, r) centered has mass roughly k/n, then\nits empirical mass is likely to be of the order k/n; hence rk(x) is likely to be close to the radius r\nof B(x, r). Now if f does not vary too much in B(x, r), then we can express the mass of B(x, r) in\nterms of f (x), and thus get our desired bound on rk(x) and fk(x) in terms of f (x).\nOur results are given in terms of how f varies in a neighborhood of x, captured as follows.\nDe\ufb01nition 2. For x \u2208 Rd and \u0001 > 0, de\ufb01ne \u02c6r(\u0001, x) (cid:44) sup\nand \u02c7r(\u0001, x) (cid:44) sup\n\n(cid:110)\nr : sup(cid:107)x\u2212x(cid:48)(cid:107)\u2264r f (x(cid:48)) \u2212 f (x) \u2264 \u0001\n\n(cid:110)\nr : sup(cid:107)x\u2212x(cid:48)(cid:107)\u2264r f (x) \u2212 f (x(cid:48)) \u2264 \u0001\n\n(cid:111)\n\n(cid:111)\n\n,\n\n.\n\nThe continuity parameters \u02c6r(\u0001, x) and \u02c7r(\u0001, x) (related to the modulus of continuity of f at x) are eas-\nily bounded under smoothness assumptions on f at x. Our high-probability bounds on the estimates\nfk(x) in terms of f (x) and the continuity parameters are given as follows.\n\u03b4,n. Then, with probability at least 1 \u2212 \u03b4, for all\nLemma 3 (Upper-bound on fk). Suppose k \u2265 4C 2\n(cid:19)\nx \u2208 Rd and all \u0001 > 0,\n\n(cid:18)\n\nfk(x) <\n\n1 + 2\n\n(f (x) + \u0001) ,\n\nC\u03b4,n\u221a\nk\n\nprovided k satis\ufb01es vd \u00b7 \u02c6r(\u0001, x)d \u00b7 (f (x) + \u0001) \u2265 k\n\nn \u2212 C\u03b4,n\n\n4\n\n\u221a\nk\nn .\n\n\fLemma 4 (Lower-bound on fk). Then, with probability at least 1\u2212 \u03b4, for all x \u2208 Rd and all \u0001 > 0,\n\n(cid:18)\n\n(cid:19)\n\nfk(x) \u2265\n\n1 \u2212 C\u03b4,n\u221a\nk\n\n(f (x) \u2212 \u0001) ,\n\n\u221a\nk\nn .\n\nprovided k satis\ufb01es vd \u00b7 \u02c7r(\u0001, x)d \u00b7 (f (x) \u2212 \u0001) \u2265 k\nThe proof of these results are concise applications of Lemma 2 above. They are given in the appendix\n(long version). The trick is in showing that, under the conditions on k, there exists an r \u2248 (k/(n \u00b7\nf (x)))1/d which is at most \u02c6r(\u0001, x) or \u02c7r(\u0001, x) as appropriate; hence, f does not vary much on B(x, r)\nso we must have\n\nn + C\u03b4,n\n\nF (B(x, r)) \u2248 volume (B(x, r)) \u00b7 f (x) = vd \u00b7 rd \u00b7 f (x) \u2248 k\nn\n\n.\n\n\u221a\nUsing Lemma 2 we get rk(x) \u2248 r; plug this value into fk(x) to obtain fk(x) \u2248 (1 + 1/\nk)f (x).\nLemmas 3 and 4 allow a great deal of \ufb02exibility as we will soon see with their application to mode\nestimation. In particular we can consider various smoothness conditions simultaneously at different\nx for different biases \u0001.\nSuppose for instance that f is locally H\u00a8older at x,\n\u221a\nB(x, r),\n\u221a\n(\u0001/L)1/\u03b2; pick \u0001 = O(f (x)/\n|fk(x) \u2212 f (x)| \u2264 O(f (x)/\n\nfor all x(cid:48) \u2208\n|f (x) \u2212 f (x(cid:48))| \u2264 L(cid:107)x \u2212 x(cid:48)(cid:107)\u03b2. Then for small \u0001, both \u02c6r(\u0001, x) and \u02c7r(\u0001, x) are at least\nk) for n suf\ufb01ciently large, then by both lemmas we have, w.h.p.,\nk)d/\u03b2f (x) \u2265 Ck/n\n\n\u2203r, L, \u03b2 > 0 s.t.\n\nfor some constant C. This allows for a setting of k = \u0398(cid:0)n2\u03b2/(2\u03b2+d)(cid:1) for a minimax-optimal rate\nof |fk(x) \u2212 f (x)| = O(cid:0)n\u2212\u03b2/(2\u03b2+d(cid:1).\n\n\u221a\nk) provided k = \u2126(log2 n) and satis\ufb01es vd(1/L\n\ni.e.\n\nThe ability to consider various biases \u0001 would prove particularly helpful in the next section on\nmode estimation where we have to consider different approximations in different parts of space with\nvarying smoothness in f. In particular, at a mode x, we will essentially have \u03b2 = 2 (f is twice\ndifferentiable) while elsewhere on X we might not have much smoothness in f.\n\n5 Mode estimation\n\nWe start with the following de\ufb01nition of modes.\nDe\ufb01nition 3. We denote the set of modes of f by M \u2261 {x : \u2203r > 0,\u2200x(cid:48) \u2208 B(x, r), f (x(cid:48)) < f (x)} .\nWe need the following assumption at modes.\nAssumption 1. f is twice differentiable in a neighborhood of every x \u2208 M. We denote the gradient\nand Hessian of f by \u2207f and \u22072f. Furthermore, \u22072f (x) is negative de\ufb01nite at all x \u2208 M.\nAssumption 1 excludes modes at the boundary of the support of f (where f cannot be continuously\ndifferentiable). We note that most work on the subject consider only interior modes as we are\ndoing here. Modes on the boundary can however be handled under additional boundary smoothness\nassumptions to ensure that f puts suf\ufb01cient mass on any ball around such modes. This however only\ncomplicates the analysis, while the main insights remain the same as for interior modes.\nAn implication of Assumption 1 is that for all x \u2208 M, \u2207f is continuous in a neighborhood of x,\nwith \u2207f (x) = 0. Together with \u22072f (x) \u227a 0 (i.e. negative de\ufb01nite), f is well-approximated by a\nquadratic in a neighborhood of a mode x \u2208 M. This is stated in the following lemma.\nLemma 5. Let f satisfy Assumption 1. Consider any x \u2208 M. Then there exists a neighborhood\nB(x, r), r > 0, and constants \u02c6Cx, \u02c7Cx > 0 such that, for all x(cid:48) \u2208 B(x, r), we have\n\n\u02c7Cx (cid:107)x(cid:48) \u2212 x(cid:107)2 \u2264 f (x) \u2212 f (x(cid:48)) \u2264 \u02c6Cx (cid:107)x(cid:48) \u2212 x(cid:107)2 .\n\n(1)\n\nWe can therefore parametrize a mode x \u2208 M locally as follows:\nDe\ufb01nition 4 (Critical radius rx around mode x). For every mode x \u2208 M, there exists rx > 0, such\nthat B(x, rx) is contained in a set Ax, satisfying the following conditions:\n(i) Ax is a connected component of a level set X \u03bb (cid:44) {x(cid:48) \u2208 X : f (x(cid:48)) > \u03bb} for some \u03bb > 0.\n(ii) \u2203 \u02c6Cx, \u02c7Cx > 0, \u2200x(cid:48) \u2208 Ax, \u02c7Cx (cid:107)x(cid:48) \u2212 x(cid:107)2 \u2264 f (x) \u2212 f (x(cid:48)) \u2264 \u02c6Cx (cid:107)x(cid:48) \u2212 x(cid:107)2. (So Ax \u2229 M = {x}.)\n\n5\n\n\fReturn arg maxx\u2208X[n] fk(x).\n\nFigure 1: Estimate the mode of a unimodal density f from X[n].\n\nFigure 2: The analysis argues over different regions (depicted) around a mode x.\n\nFinally, we assume that every hill in f corresponds to a mode in M:\nAssumption 2. Each connected component of any level set X \u03bb, \u03bb > 0, contains a mode in M.\n\n5.1 Single mode\nWe start with the simple but common assumption that |M| = 1. This case has been extensively\nstudied to get a handle on the inherent dif\ufb01culty of mode estimation. The usual procedures in the\nstatistical literature are known to be minimax-optimal but are not practical: they invariably return the\nmaximizer of some density estimator (usually a kernel estimate) over the entire space Rd. Instead\nwe analyze the practical procedure of Figure 1 where we pick the maximizer of fk out of the \ufb01nite\nsample X[n]. The rates of Theorem 1 are optimal (O(n\u22121/(4+d))) for a setting of k = O(n4/(4+d)).\nTheorem 1. Let \u03b4 > 0. Assume f has a single mode x0 and satis\ufb01es Assumptions 1, 2. There exists\nNx0,\u03b4 such that the following holds for n \u2265 Nx0,\u03b4. Let \u02c6Cx0, \u02c7Cx0 be as in De\ufb01nition 4. Suppose k\nsatis\ufb01es(cid:32)\n\n(cid:33)4d/(4+d)\n\n(cid:33)2\n\n(cid:32)\n\n(cid:115)\n\nf (x0)(2d+4)/(4+d)(cid:16) vd\n\n(cid:17)4/(4+d)\n\n.\n\n(2)\n\n24C\u03b4,nf (x0)\n\n\u02c7Cx0r2\nx0\n\n\u2264 k \u2264\n\n1\n2\n\nC\u03b4,n\n\u02c6Cx0\n\nn\n\n4\n\n(cid:115)\n\nLet x be the mode returned in the procedure of Figure 1. With probability at least 1 \u2212 2\u03b4 we have\n\n(cid:107)x \u2212 x0(cid:107) \u2264 5\n\nC\u03b4,n\n\u02c7Cx0\n\nf (x0) \u00b7 1\nk1/4\n\n.\n\nProof. Let rx0 be the critical radius of De\ufb01nition 4. Let rn(x0) \u2261 inf(cid:8)r : B(x0, r) \u2229 X[n] (cid:54)= \u2205(cid:9).\n\nLet 0 < \u03c4 < 1 to be later speci\ufb01ed, and assume the event that rn(x0) \u2264 \u03c4\n2 rx0. We will bound the\nprobability of this event once the proper setting of \u03c4 becomes clear.\nConsider \u02dcr satisfying rx0 \u2265 \u02dcr \u2265 2rn(x0)/\u03c4 (see Figure 2). We will \ufb01rst upper bound fk for any x\noutside B(x0, \u02dcr), then lower-bound fk for x \u2208 B(x0, rn(x0)).\nRecall Ax0 from De\ufb01nition 4. By equation (1) we have\n\nsup\n\nx\u2208Ax0\\B(x0,\u02dcr/2)\n\nf (x) \u2264 f (x0) \u2212 \u02c7Cx0(\u02dcr/2)2 (cid:44) \u02c6F .\n\n(3)\n\nThe above allows us to apply Lemma 3 as follows. First note that for any x \u2208 X\\B(x0, \u02dcr/2), f (x) \u2264\n\u02c6F since Ax0 is a level set of the unimodal f, i.e. supx /\u2208Ax0\nf (x). Therefore, for\n= \u02c6F \u2212 f (x). By equation (3) the modulus of continuity \u02c6r(\u0001, x) is at least\nany x \u2208 X \\ B(x0, \u02dcr) let \u0001\n.\n\nf (x) \u2264 inf x\u2208Ax0\n\n6\n\n\fInitialize: Mn \u2190 \u2205.\nFor \u03bb = maxx\u2208X[n] fn(x) down to 0:\n\n\u221a\nk.\n\n(cid:110) \u02dcAi\n\n\u2022 Let \u0001\u03bb (cid:44) \u03bb \u00b7 C\u03b4,n/\n\u2022 Let\n\n(cid:111)m\n\u2022 Mn \u2190 Mn \u222a(cid:110)\n\ni=1\n\nReturn the estimated modes Mn.\n\nbe the CCs of G (\u03bb \u2212 \u0001\u03bb \u2212 \u02dc\u0001) disjoint from Mn.\n\nxi (cid:44) arg maxx\u2208 \u02dcAi\u2229X\u03bb\n\n[n]\n\nfn(x)\n\n(cid:111)m\n\n.\n\ni=1\n\nFigure 3: Estimate the modes of a multimodal f from X[n]. The parameter \u02dc\u0001 serves to prune.\n\n\u02dcr/2. Therefore, if k satis\ufb01es\n\nvd \u00b7 (\u02dcr/2)d \u00b7(cid:0)f (x0) \u2212 \u02c7Cx0 (\u02dcr/2)2(cid:1) \u2265 k\n\nwe have with probability at least 1 \u2212 \u03b4\n\n(cid:18)\n\n\u2212 C\u03b4,n\n\n\u221a\n\nk\nn\n\n,\n\nn\n\n(cid:19)(cid:0)f (x0) \u2212 \u02c7Cx0 (\u02dcr/2)2(cid:1) .\n\nsup\n\nx\u2208X\\B(x0,\u02dcr)\n\nfk(x) <\n\n1 + 2\n\nC\u03b4,n\u221a\nk\n\n(4)\n\n(5)\n\nNow we turn to x \u2208 B(x0, rn(x0)). We have again by equation (1) that inf x\u2208B(x,\u03c4 \u02dcr) f (x) \u2265\nf (x0) \u2212 \u02c6Cx0(\u03c4 \u02dcr)2 (cid:44) \u02c7F . Therefore, for x \u2208 B(x0, rn(x0)) let \u0001 = f (x) \u2212 \u02c7F , we have \u02c7r(\u0001, x) \u2265\n\u03c4 \u02dcr \u2212 rn(x0) \u2265 \u03c4 \u02dcr/2. It follows that, if k satis\ufb01es\n\n(6)\nwe have by Lemma 4 that, with probability at least 1 \u2212 \u03b4 (under the same event used in Lemma 3)\n\n+ C\u03b4,n\n\nn\n\n,\n\nvd \u00b7 ((\u03c4 /2)\u02dcr)d \u00b7(cid:16)\n\nf (x0) \u2212 \u02c6Cx0 (\u03c4 \u02dcr)2(cid:17) \u2265 k\n(cid:18)\n\n(cid:19)(cid:16)\n\nf (x0) \u2212 \u02c6Cx0 (\u03c4 \u02dcr)2(cid:17)\n\n\u221a\n\nk\nn\n\ninf\n\nx\u2208B(x,rn(x0))\n\nfk(x) \u2265\n\n1 \u2212 C\u03b4,n\u221a\nk\n\n.\n\n(7)\n\n\u221a\n\n(cid:17)\n\n24f (x0)C\u03b4,n/\n\n(cid:16) \u02c7Cx0\n\nNext, with a bit of algebra, we can pick \u03c4 and \u02dcr so that the l.h.s. of (5) is less than the l.h.s.\nof equation (7). It suf\ufb01ces to pick \u03c4 2 = \u02c7Cx0/8 \u02c6Cx0 and \u02dcr2 \u2265 24f (x0)C\u03b4,n/ \u02c7Cx0\nk. Given these\nsettings, equations (4) and (6) are satis\ufb01ed whenever k satis\ufb01es equation (2) of the lemma statement.\nIt follows that, with probability at least 1 \u2212 \u03b4, inf x\u2208B(x,rn(x0)) fk(x) > supx\u2208X\\B(x0,\u02dcr) fk(x).\nTherefore, the empirical mode chosen by the procedure is in B(x0, \u02c6r). We are free to choose \u02dcr as\nsmall as max\nWe\u2019ve assumed so far the event that rn(x0) \u2264 \u03c4\n\n(cid:26)(cid:114)\nfollows. Let r (cid:44)(cid:113)\non k imply that r \u2264 rx0, and that vd \u00b7 ((\u03c4 /2)r)d \u00b7(cid:16)\n\n2 rx0. We bound the probability of this event as\nk. Under the above setting of \u03c4, the Theorem\u2019s assumptions\n\u221a\nn . Again,\n\u221a\nby equation (1), this implies that F(B(x0, (\u03c4 /2)r)) \u2265 k\nn . By Lemma, 2, with probability\nat least 1 \u2212 \u03b4, Fn(B(x0, (\u03c4 /2)r)) \u2265 k/n and therefore rn(x0) \u2264 (\u03c4 /2)r \u2264 (\u03c4 /2)rx0. It now\nbecomes clear that we can just pick \u02dcr = r.\n\nf (x0) \u2212 \u02c6Cx0 ((\u03c4 /2)r)2(cid:17) \u2265 k\n\n24f (x0)C\u03b4,n/ \u02c7Cx0\n\n, 2rn(x0)/\u03c4\n\nn + C\u03b4,n\n\nn + C\u03b4,n\n\n(cid:27)\n\n\u221a\n\n\u221a\n\nk\n\n.\n\nk\n\nk\n\n5.2 Multiple modes\n\nIn this section we turn to the problem of estimating the modes of a more general density f with an\nunknown number of modes.\nThe algorithm of Figure 3 operates on the following set of nested graphs G(\u03bb). These are subgraphs\nof a mutual k-NN graph on the sample X[n], where vertices are connected if they are in each other\u2019s\nnearest neighbor sets. The connected components (CCs) of these graphs G(\u03bb) are known to be good\nestimates of the CCs of corresponding level sets of the unknown density f [9, 26, 27].\n\n7\n\n\f(cid:44)(cid:8)x \u2208 X[n] : fn(x) \u2265 \u03bb(cid:9) , and where vertices x, x(cid:48) are connected by an edge when and only\nDe\ufb01nition 5 (k-NN level set G(\u03bb)). Given \u03bb \u2208 R, let G(\u03bb) denote the graph with vertices in\nX \u03bb\nwhen (cid:107)x \u2212 x(cid:48)(cid:107) \u2264 \u03b1 \u00b7 min{rk(x), rk(x(cid:48))}, for some \u03b1 \u2265 \u221a\n[n]\nWe will show that for a given n, any suf\ufb01ciently salient mode is optimally recovered; furthermore,\nif f is uniformly continuous on Rd, then the procedure returns no false mode above a level \u03bbn \u2192 0.\n\n2.\n\n5.2.1 Optimal Recovery for Any Mode\n\nThe guarantees of this section would be given in terms of salient modes as de\ufb01ned below. Essentially\na mode x0 is salient if it is separated from other modes by a suf\ufb01ciently wide and deep valley.\nWe de\ufb01ne saliency in a way similar to [9], but simpler: we only require a wide valley since the\nsmoothness of f at the mode (as expressed in equation 1) takes care of the depth.\nWe start with a notion of separation between sets inspired from [26].\nDe\ufb01nition 6 (r-separation). A, A(cid:48) \u2282 X are r-separated if there exists a (separating) set S \u2282 Rd\nsuch that: every path from A to A(cid:48) crosses S, and supx\u2208S+B(0,r) f (x) < inf x\u2208A\u222aA(cid:48) f (x).\nOur notion of mode saliency follows: for a mode x, we require the critical set Ax of De\ufb01nition 4 to\nbe well separated from all components at the level where it appears.\nDe\ufb01nition 7 (r-salient Modes). A mode x of f is said to be r-salient for r > 0 if the following\nholds. There exist Ax as in De\ufb01nition 4 (with the corresponding rx, \u02c6Cx and \u02c7Cx), which is a CC of\nsay X \u03bbx (cid:44) {x \u2208 X : f (x) \u2265 \u03bbx}. Ax is r-separated from X \u03bbx \\ Ax.\nThe next theorem again yields the optimal rates O(n\u22121/(4+d)) for k = O(n4/(4+d)).\nTheorem 2 (Recovery of salient modes). Assume f satis\ufb01es Assumptions 1, 2. Suppose \u02dc\u0001 =\n\u02dc\u0001(n) n\u2192\u221e\u2212\u2212\u2212\u2212\u2192 0. Let x0 be an r-salient mode for some r > 0. Assume k = \u2126\n. Then\nthere exist N = N (x0,{\u02dc\u0001(n)}) depending on x0 and \u02dc\u0001(n) such that the following holds for n \u2265 N.\nLet Ax0 , \u02c6Cx0, \u02c7Cx0 be as in De\ufb01nition 4, and let \u03bbx0\nf (x). Let \u03b4 > 0. Suppose k further\nsatis\ufb01es(cid:32)\n\n(cid:44) inf x\u2208Ax0\n(cid:33)4d/(4+d)\n\n(cid:32)\n\n(cid:115)\n\nC 2\n\n\u03b4,n\n\n(cid:17)\n\n(cid:16)\n\n(cid:16) vd\n\n(cid:17)4/(4+d)\n\n/4, (r/\u03b1)2(cid:9)(cid:33)2\n\n\u02c7Cx0 min(cid:8)r2\n\nx0\n\n24C\u03b4,nf (x0)\n\n\u03bb(2d+4)/(4+d)\nx0\n\nn\n\n4\n\n.\n\nLet Mn be the modes returned by the procedure of Figure 3. With probability at least 1 \u2212 2\u03b4, there\nexists x \u2208 Mn such that\n\n\u2264 k \u2264\n\nC\u03b4,n\n\u02c6Cx0\n\n1\n2\n\n(cid:115)\n\n(cid:107)x \u2212 x0(cid:107) \u2264 5\n\nC\u03b4,n\n\u02c7Cx0\n\nf (x0) \u00b7 1\nk1/4\n\n.\n\n5.2.2 Pruning guarantees\n\nThe proof of the main theorem of this section is based on Lemma 7.4 of [24].\nTheorem 3. Let \u039b (cid:44) supx f (x) and r(\u0001) (cid:44) supx\u2208Rd max{\u02c6r(\u0001, x), \u02c7r(\u0001, x)}. Assume f satis\ufb01es\nAssumption 2. Suppose r(\u02dc\u0001) = \u2126 (k/n)1/d, which is feasible whenever f is uniformly continuous\non Rd. In particular, if f is H\u00a8older continuous, i.e.\n\n\u2200x, x(cid:48) \u2208 Rd,\n\n|f (x) \u2212 f (x(cid:48))| \u2264 L(cid:107)x \u2212 x(cid:48)(cid:107)\u03b2 , for some L > 0, 0 < \u03b2 \u2264 1,\n\nthen we can just let \u02dc\u0001 = \u2126 (k/n)\u03b2/d since r(\u02dc\u0001) \u2265 (\u02dc\u0001/L)1/\u03b2. De\ufb01ne\n\n(cid:40)\n\n(cid:32)\n\n(cid:33)\n\n\u221a\n\n(cid:41)\n\n.\n\n\u03bb0 = max\n\n2\u02dc\u0001, 8\n\n\u039b\nk\n\nC 2\n\n\u03b4,n,\n\nk\nn\n\n+ C\u03b4,n\n\nk\nn\n\n2\n\nvdr(\u02dc\u0001)d\n\nAssume k \u2265 9C 2\nlet \u03bbf = inf x\u2208X \u03bb\nM \u2229 X \u03bbf .\n\n[n]\n\n\u03b4,n. The following holds with probability at least 1 \u2212 \u03b4. Pick any \u03bb \u2265 2\u03bb0, and\n[n] can be assigned to distinct modes in\n\nf (x). All estimated modes in Mn \u2229 X \u03bb\n\n8\n\n\fReferences\n[1] Don O Loftsgaarden, Charles P Quesenberry, et al. A nonparametric estimate of a multivariate density\n\nfunction. The Annals of Mathematical Statistics, 36(3):1049\u20131051, 1965.\n\n[2] M. Maier, M. Hein, and U. von Luxburg. Optimal construction of k-nearest neighbor graphs for identify-\n\ning noisy clusters. Theoretical Computer Science, 410:1749\u20131764, 2009.\n\n[3] Emanuel Parzen et al. On estimation of a probability density function and mode. Annals of mathematical\n\nstatistics, 33(3):1065\u20131076, 1962.\n\n[4] Yizong Cheng. Mean shift, mode seeking, and clustering. Pattern Analysis and Machine Intelligence,\n\nIEEE Transactions on, 17(8):790\u2013799, 1995.\n\n[5] Fr\u00b4ed\u00b4eric Chazal, Leonidas J Guibas, Steve Y Oudot, and Primoz Skraba. Persistence-based clustering in\n\nriemannian manifolds. Journal of the ACM (JACM), 60(6):41, 2013.\n\n[6] David S Moore and James W Yackel. Large sample properties of nearest neighbor density function\n\nestimators. Technical report, DTIC Document, 1976.\n\n[7] L.P. Devroye and T.J. Wagner. The strong uniform consistency of nearest neighbor density estimates. The\n\nAnnals of Statistics, 5:536\u2013540, 1977.\n\n[8] G\u00b4erard Biau, Fr\u00b4ed\u00b4eric Chazal, David Cohen-Steiner, Luc Devroye, Carlos Rodriguez, et al. A weighted\nk-nearest neighbor density estimate for geometric inference. Electronic Journal of Statistics, 5:204\u2013237,\n2011.\n\n[9] S. Kpotufe and U. von Luxburg. Pruning nearest neighbor cluster trees. In International Conference on\n\nMachine Learning, 2011.\n\n[10] Herman Chernoff. Estimation of the mode. Annals of the Institute of Statistical Mathematics, 16(1):31\u2013\n\n41, 1964.\n\n[11] William F Eddy et al. Optimum kernel estimators of the mode. The Annals of Statistics, 8(4):870\u2013882,\n\n1980.\n\n[12] Birgit Grund, Peter Hall, et al. On the minimisation of lp error in mode estimation. The Annals of\n\nStatistics, 23(6):2264\u20132284, 1995.\n\n[13] Jussi Klemel\u00a8a. Adaptive estimation of the mode of a multivariate density. Journal of Nonparametric\n\nStatistics, 17(1):83\u2013105, 2005.\n\n[14] Aleksandr Borisovich Tsybakov. Recursive estimation of the mode of a multivariate distribution. Prob-\n\nlemy Peredachi Informatsii, 26(1):38\u201345, 1990.\n\n[15] David L Donoho and Richard C Liu. Geometrizing rates of convergence, iii. The Annals of Statistics,\n\npages 668\u2013701, 1991.\n\n[16] Luc Devroye. Recursive estimation of the mode of a multivariate density. Canadian Journal of Statistics,\n\n7(2):159\u2013167, 1979.\n\n[17] Ulf Grenander et al. Some direct estimates of the mode. The Annals of Mathematical Statistics, 36(1):131\u2013\n\n138, 1965.\n\n[18] Christophe Abraham, G\u00b4erard Biau, and Beno\u02c6\u0131t Cadre. On the asymptotic properties of a simple estimate\n\nof the mode. ESAIM: Probability and Statistics, 8:1\u201311, 2004.\n\n[19] Keinosuke Fukunaga and Larry Hostetler. The estimation of the gradient of a density function, with\n\napplications in pattern recognition. Information Theory, IEEE Transactions on, 21(1):32\u201340, 1975.\n\n[20] Dorin Comaniciu and Peter Meer. Mean shift: A robust approach toward feature space analysis. Pattern\n\nAnalysis and Machine Intelligence, IEEE Transactions on, 24(5):603\u2013619, 2002.\n\n[21] Jia Li, Surajit Ray, and Bruce G Lindsay. A nonparametric statistical approach to clustering via mode\n\nidenti\ufb01cation. Journal of Machine Learning Research, 8(8), 2007.\n\n[22] Ery Arias-Castro, David Mason, and Bruno Pelletier. On the estimation of the gradient lines of a density\n\nand the consistency of the mean-shift algorithm. Unpublished Manuscript, 2013.\n\n[23] Christopher Genovese, Marco Perone-Paci\ufb01co, Isabella Verdinelli, and Larry Wasserman. Nonparametric\n\ninference for density modes. arXiv preprint arXiv:1312.7567, 2013.\n\n[24] K. Chaudhuri, S. Dasgupta, S. Kpotufe, and U. von Luxburg. Consistent procedures for cluster tree\n\nestimation and pruning. Arxiv, 2014.\n\n[25] O. Bousquet, S. Boucheron, and G. Lugosi. Introduction to statistical learning theory. Lecture Notes in\n\nArti\ufb01cial Intelligence, 3176:169\u2013207, 2004.\n\n[26] K. Chaudhuri and S. Dasgupta. Rates for convergence for the cluster tree. In Advances in Neural Infor-\n\nmation Processing Systems, 2010.\n\n[27] S. Balakrishnan, S. Narayanan, A. Rinaldo, A. Singh, and L. Wasserman. Cluster trees on manifolds. In\n\nAdvances in Neural Information Processing Systems, pages 2679\u20132687, 2013.\n\n9\n\n\f", "award": [], "sourceid": 1324, "authors": [{"given_name": "Sanjoy", "family_name": "Dasgupta", "institution": "UC San Diego"}, {"given_name": "Samory", "family_name": "Kpotufe", "institution": "Princeton University"}]}