{"title": "On the Consistency of Quick Shift", "book": "Advances in Neural Information Processing Systems", "page_first": 46, "page_last": 55, "abstract": "Quick Shift is a popular mode-seeking and clustering algorithm. We present finite sample statistical consistency guarantees for Quick Shift on mode and cluster recovery under mild distributional assumptions. We then apply our results to construct a consistent modal regression algorithm.", "full_text": "On the Consistency of Quick Shift\n\nHeinrich Jiang\n\nGoogle Inc.\n\n1600 Amphitheatre Parkway, Mountain View, CA 94043\n\nheinrich.jiang@gmail.com\n\nAbstract\n\nQuick Shift is a popular mode-seeking and clustering algorithm. We present \ufb01nite\nsample statistical consistency guarantees for Quick Shift on mode and cluster\nrecovery under mild distributional assumptions. We then apply our results to\nconstruct a consistent modal regression algorithm.\n\n1\n\nIntroduction\n\nQuick Shift [16] is a clustering and mode-seeking procedure that has received much attention in\ncomputer vision and related areas. It is simple and proceeds as follows: it moves each sample to its\nclosest sample with a higher empirical density if one exists in a \u03c4 radius ball, where the empirical\ndensity is taken to be the Kernel Density Estimator (KDE). The output of the procedure can thus be\nseen as a graph whose vertices are the sample points and a directed edge from each sample to its next\npoint if one exists. Furthermore, it can be seen that Quick Shift partitions the samples into trees which\ncan be taken as the \ufb01nal clusters, and the root of each such tree is an estimate of a local maxima.\nQuick Shift was designed as an alternative to the better known mean-shift procedure [4, 5]. Mean-shift\nperforms a gradient ascent of the KDE starting at each sample until \u0001-convergence. The samples that\ncorrespond to the same points of convergence are in the same cluster and the points of convergence\nare taken to be the estimates of the modes. Both procedures aim at clustering the data points by\nincrementally hill-climbing to a mode in the underlying density. Some key differences are that Quick\nShift restricts the steps to sample points and has the extra \u03c4 parameter. In this paper, we show that\nQuick Shift can surprisingly attain strong statistical guarantees without the second-order density\nassumptions required to analyze mean-shift.\nWe prove that Quick Shift recovers the modes of an arbitrary multimodal density at a minimax optimal\nrate under mild nonparametric assumptions. This provides an alternative to known procedures with\nsimilar statistical guarantees; however such procedures only recover the modes but fail to inform us\nhow to assign the sample points to a mode which is critical for clustering. Quick Shift on the other\nhand recovers both the modes and the clustering assignments with statistical consistency guarantees.\nMoreover, Quick Shift\u2019s ability to do all of this has been extensively validated in practice.\nA unique feature of Quick Shift is that it has a segmentation parameter \u03c4 which allows practioners\nto merge clusters corresponding to certain less salient modes of the distribution. In other words, if\na local mode is not the maximizer of its \u03c4-radius neighborhood, then its corresponding cluster will\nbecome merged to that of another mode. Current consistent mode-seeking procedures [6, 12] fail to\nallow one to control such segmentation. We give guarantees on how Quick Shift does this given an\narbitrary setting of \u03c4.\nWe show that Quick Shift can also be used to recover the cluster tree. In cluster tree estimation, the\nknown procedures with the strongest statistical consistency guarantees include Robust Single Linkage\n(RSL) [2] and its variants e.g. [13, 7]. We show that Quick Shift attains similar guarantees.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fThus, Quick Shift, a simple and already popular procedure, can simultaneously recover the modes\nwith segmentation tuning, provide clustering assignments to the appropriate mode, and can estimate\nthe cluster tree of an unknown density f with the strong consistency guarantees. No other procedure\nhas been shown to have these properties.\nThen we use Quick Shift to solve the modal regression problem [3], which involves estimating the\nmodes of the conditional density f (y|X) rather than the mean as in classical regression. Traditional\napproaches use a modi\ufb01ed version of mean-shift. We provide an alternative using Quick Shift which\nhas precise statistical consistency guarantees under much more mild assumptions.\n\nFigure 1: Quick Shift example. Left: \u03c4 = \u221e. The procedure returns one tree, whose head is the\nsample with highest empirical density. Right: \u03c4 set to a lower value. The edges with length greater\nthan \u03c4 are no longer present when compared to the left. We are left with three clusters.\n\n2 Assumptions and Supporting Results\n\nInput: Samples X[n] := {x1, ..., xn}, KDE bandwidth h, segmentation parameter \u03c4 > 0.\nInitialize directed graph G with vertices {x1, ..., xn} and no edges.\nfor i = 1 to n do\n\nif there exists x \u2208 X[n] such that (cid:98)fh(x) > (cid:98)fh(xi) and ||x \u2212 xi|| \u2264 \u03c4 then\nAdd to G a directed edge from xi to argminxj\u2208X[n]:(cid:98)fh(xj )>(cid:98)fh(xi)||xi \u2212 xj||.\n\nAlgorithm 1 Quick Shift\n\nend if\nend for\nreturn G.\n\n2.1 Setup\nLet X[n] = {x1, ..., xn} be n i.i.d. samples drawn from distribution F with density f over the\nuniform measure on Rd.\nAssumption 1 (H\u00f6lder Density). f is H\u00f6lder continuous on compact support X \u2286 Rd. i.e. |f (x) \u2212\nf (x(cid:48))| \u2264 C\u03b1||x \u2212 x(cid:48)||\u03b1 for all x, x(cid:48) \u2208 X and some 0 < \u03b1 \u2264 1 and C\u03b1 > 0.\nDe\ufb01nition 1 (Level Set). The \u03bb level set of f is de\ufb01ned as Lf (\u03bb) := {x \u2208 X : f (x) \u2265 \u03bb}.\nDe\ufb01nition 2 (Hausdorff Distance). dH (A, A(cid:48)) = max{supx\u2208A d(x, A(cid:48)), supx\u2208A(cid:48) d(x, A)}, where\nd(x, A) := inf x(cid:48)\u2208A ||x \u2212 x(cid:48)||.\nThe next assumption says that the level sets are continuous w.r.t. the level in the following sense\nwhere we denote the \u0001-interior of A as A(cid:9)\u0001 := {x \u2208 A, inf y\u2208\u2202A d(x, y) \u2265 \u0001} (\u2202A is the boundary\nof A):\nAssumption 2 (Uniform Continuity of Level Sets). For each \u0001 > 0, there exists \u03b4 > 0 such that for\n0 < \u03bb \u2264 \u03bb(cid:48) \u2264 ||f||\u221e with |\u03bb \u2212 \u03bb(cid:48)| < \u03b4, then Lf (\u03bb)(cid:9)\u0001 \u2286 Lf (\u03bb(cid:48)).\nRemark 1. Procedures that try to incrementally move points to nearby areas of higher density will\nhave dif\ufb01culties in regions where there is little or no change in density. The above assumption is a\nsimple and mild formulation which ensures there are no such \ufb02at regions.\nRemark 2. Note that our assumptions are quite mild when compared to analyses of similar proce-\ndures like mean-shift, which require at least second-order smoothness assumptions. Interestingly, we\nonly require H\u00f6lder continuity.\n\n2\n\n\f2.2 KDE Bounds\n\nnumbers such that(cid:82)\n\nRd K(u)du = 1.\n\nWe next give uniform bounds on KDE required to analyze Quick Shift.\nDe\ufb01nition 3. De\ufb01ne kernel function K : Rd \u2192 R\u22650 where R\u22650 denotes the non-negative real\n\nWe make the following mild regularity assumptions on K.\nAssumption 3. (Spherically symmetric, non-increasing, and exponential decays) There exists non-\nincreasing function k : R\u22650 \u2192 R\u22650 such that K(u) = k(|u|) for u \u2208 Rd and there exists\n\u03c1, C\u03c1, t0 > 0 such that for t > t0, k(t) \u2264 C\u03c1 \u00b7 exp(\u2212t\u03c1).\nRemark 3. These assumptions allow the popular kernels such as Gaussian, Exponential, Silverman,\nuniform, triangular, tricube, Cosine, and Epanechnikov.\nDe\ufb01nition 4 (Kernel Density Estimator). Given a kernel K and bandwidth h > 0 the KDE is de\ufb01ned\nby\n\n(cid:98)fh(x) =\n\n1\n\nn \u00b7 hd\n\nn(cid:88)\n\ni=1\n\nK\n\n(cid:18) x \u2212 Xi\n\n(cid:19)\n\n.\n\nh\n\nHere we provide the uniform KDE bound which will be used for our analysis, established in [11].\nLemma 1. [(cid:96)\u221e bound for \u03b1-H\u00f6lder continuous functions. Theorem 2 of [11]] There exists positive\nconstant C(cid:48) depending on f and K such that the following holds with probability at least 1 \u2212 1/n\nuniformly in h > (log n/n)1/d.\n\n|(cid:98)fh(x) \u2212 f (x)| < C(cid:48) \u00b7\n\n(cid:32)\n\n(cid:114)\n\nh\u03b1 +\n\nlog n\nn \u00b7 hd\n\n(cid:33)\n\n.\n\nsup\nx\u2208Rd\n\n3 Mode Estimation\n\nIn this section, we give guarantees about the local modes returned by Quick Shift. We make the\nadditional assumption that the modes are local maxima points with negative-de\ufb01nite Hessian.\nAssumption 4. [Modes] A local maxima of f is a connected region M such that the density is\nconstant on M and decays around its boundaries. Assume that each local maxima of f is a point,\nwhich we call a mode. Let M be the modes of f where M is a \ufb01nite set. Then let f be twice\ndifferentiable around a neighborhood of each x \u2208 M and let f have a negative-de\ufb01nite Hessian at\neach x \u2208 M and those neighborhoods are disjoint.\nThis assumption leads to the following.\nLemma 2 (Lemma 5 of [6]). Let f satisfy Assumption 4. There exists rM , \u02c7C, \u02c6C > 0 such that the\nfollowing holds for all x0 \u2208 M simultaneously.\n\n\u02c7C \u00b7 |x0 \u2212 x|2 \u2264 f (x0) \u2212 f (x) \u2264 \u02c6C \u00b7 |x0 \u2212 x|2,\n\nfor all x \u2208 Ax0 where Ax0 is a connected component of {x : f (x) \u2265 inf x(cid:48)\u2208B(x0,rM ) f (x)} which\ncontains x0 and does not intersect with other modes.\n\nThe next assumption ensures that the level sets don\u2019t become arbitrarily thin as long as we are\nsuf\ufb01ciently away from the modes.\nAssumption 5. [Level Set Regularity] For each \u03c3, r > 0, there exists \u03b7 > 0 such that the following\nholds for all connected components A of Lf (\u03bb) with \u03bb > 0 and A (cid:54)\u2286 \u222ax0\u2208MB(x0, r). If x lies on\nthe boundary of A, then Vol(B(x, \u03c3) \u2229 A) > \u03b7 where Vol is volume w.r.t. the uniform measure on Rd.\nWe next give the result about mode recovery for Quick Shift. It says that as long as \u03c4 is small enough,\nthen as the number of samples grows, the roots of the trees returned by Quick Shift will bijectively\ncorrespond to the true modes of f and the estimation errors will match lower bounds established by\nTsybakov [15] up to logarithmic factors. We defer the proof to Theorem 2 which is a generalization\nof the following result.\nTheorem 1 (Mode Estimation guarantees for Quick Shift). Let \u03c4 < rM /2 and Assumptions 1, 2, 3, 4,\n\nand 5 hold. Choose h such that (log n)2/\u03c1 \u00b7 h \u2192 0 and log n/(nhd) \u2192 0 as n \u2192 \u221e. Let (cid:99)M be the\n\n3\n\n\fheads of the trees in G (returned by Algorithm 1). There exists constant C depending on f and K\nsuch that for n suf\ufb01ciently large, with probability at least 1 \u2212 1/n,\n\n(cid:32)\n\ndH (M, (cid:99)M)2 < C\n\n(cid:114)\n\n(cid:33)\n\n.\n\nlog n\nn \u00b7 hd\n\n(log n)4/\u03c1h2 +\n\nand |M| = |(cid:99)M|. In particular, taking h \u2248 n\u22121/(4+d) optimizes the above rate to d(M, (cid:99)M) =\n\n\u02dcO(n\u22121/(4+d)). This matches the minimax optimal rate for mode estimation up to logarithmic factors.\nWe now give a stronger notion of mode that \ufb01ts better for analyzing the role of \u03c4. In the last result, it\nwas assumed that the practitioner wished to recover exactly the modes of the density f by taking \u03c4\nsuf\ufb01ciently small. Now, we analyze the case where \u03c4 is intentionally set to a particular value so that\nQuick Shift produces segmentations that group modes together that are in close proximity to higher\ndensity regions.\nDe\ufb01nition 5. A mode x0 \u2208 M is an (r, \u03b4)+-mode if f (x0) > f (x) + \u03b4 for all x \u2208\nB(x0, r)\\B(x0, rM ). A mode x0 \u2208 M is an (r, \u03b4)\u2212-mode if f (x0) < f (x) \u2212 \u03b4 for some\nx \u2208 B(x0, r). Let M+\nr,\u03b4 \u2286 M denote the set of (r, \u03b4)+-modes and (r, \u03b4)\u2212-modes\nof f, respectively.\n\nr,\u03b4 \u2286 M and M\u2212\n\nIn other words, an (r, \u03b4)+-mode is a mode that is also a maximizer in a larger ball of radius r by at\nleast \u03b4 when outside of the region where there is quadratic decay and smoothness (B(x0, rM )). An\n(r, \u03b4)\u2212-mode is a mode that is not the maximizer in its radius r ball by a margin of at least \u03b4.\nThe next result shows that Algorithm recovers the (\u03c4 +\u0001, \u03b4)+-modes of f and excludes the (\u03c4 \u2212\u0001, \u03b4)\u2212-\nmodes of f. The proof is in the appendix.\nTheorem 2. (Generalization of Theorem 1) Let \u03b4, \u0001 > 0 and suppose Assumptions 1, 2, 3, 4, and 5\nhold. Let h \u2261 h(n) be chosen such that h \u2192 0 and log n/(nhd) \u2192 0 as n \u2192 \u221e. Then there exists\nC > 0 depending on f and K such that the following holds for n suf\ufb01ciently large with probability\nat least 1 \u2212 1/n. For each x0 \u2208 M+\n\n\u03c4\u2212\u0001,\u03b4, there exists unique \u02c6x \u2208 (cid:99)M such that\n\n(cid:32)\n\u03c4 +\u0001,\u03b4\\M\u2212\n\n||x0 \u2212 \u02c6x||2 < C\n\n(log n)4/\u03c1h2 +\n\n(cid:114)\n\n(cid:33)\n\n.\n\nlog n\nn \u00b7 hd\n\nMoreover, |(cid:99)M| \u2264 |M| \u2212 |M\u2212\n\n\u03c4\u2212\u0001,\u03b4|.\n\nIn particular, taking \u0001 \u2192 0 and \u03b4 \u2192 0 gives us an exact characterization of the asymptotic behavior\nof Quick Shift in terms of mode recovery.\n\n4 Assignment of Points to Modes\n\nIn this section, we give guarantees on how the points are assigned to their respective modes. We\n\ufb01rst give the following de\ufb01nition which formalizes how two points are separated by a wide and deep\nvalley.\nDe\ufb01nition 6. x1, x2 \u2208 X are (rs, \u03b4)-separated if there exists a set S such that every path from x1\nand x2 intersects with S and\n\nsup\n\nx\u2208S+B(0,rs)\n\nf (x) <\n\ninf\n\nx\u2208B(x1,rs)\u222aB(x2,rs)\n\nf (x) \u2212 \u03b4.\n\nLemma 3. Suppose Assumptions 1, 2, 3, 4, and 5 hold. Let \u03c4 < rs/2 and choose h such that\n(log n)2/\u03c1 \u00b7 h \u2192 0 and log n/(nhd) \u2192 0 as n \u2192 \u221e. Let G be the output of Algorithm 1. The\nfollowing holds with probability at least 1 \u2212 1/n for n suf\ufb01ciently large depending on f, K, \u03b4, and \u03c4\nuniformly in all x1, x2 \u2208 X . If x1 and x2 are (rs, \u03b4)-separated, then there cannot exist a directed\npath from x1 to x2 in G.\n\nProof. Suppose that x1 and x2 are (rs, \u03b4)-separated (with respect to set S) and there exists a\ndirected path from x1 to x2 in G. Given our choice of \u03c4, there exists some point x \u2208 G such that\nx \u2208 S+B(0, rs) and x is on the path from x1 to x2. We have f (x) < f (x1)\u2212\u03b4. Choose n suf\ufb01ciently\n\nlarge such that by Lemma 1, supx\u2208X |(cid:98)fh(x) \u2212 f (x)| < \u03b4/2. Thus, we have (cid:98)fh(x) < (cid:98)fh(x1), which\n\nmeans a directed path in G starting from x1 cannot contain x, a contradiction. The result follows.\n\n4\n\n\fFigure 2: Illustration of (rs, \u03b4)-separation in 1 dimension. Here A and B are (rs, \u03b4)-separated by S.\nThis is because the minimum density level of rs-radius balls around A and B (the red dotted line)\nexceeds the maximum density level of the rs-radius ball around S by at least \u03b4 (golden dotted line).\nIn other words, there exists a suf\ufb01ciently wide (controlled by rs and S) and deep (controlled by \u03b4)\nvalley separating A and B. The results in this section will show that in such cases, these pairs of\npoints will not be assigned to the same cluster.\n\nThis leads to the following consequence about how samples are assigned to their respective modes.\nTheorem 3. Assume the same conditions as Lemma 3. The following holds with probability at least\n1 \u2212 1/n for n suf\ufb01ciently large depending on f, K, \u03b4, and \u03c4 uniformly in x \u2208 X and x0 \u2208 M. For\neach x \u2208 X and x0 \u2208 M, if x and x0 are (rs, \u03b4)-separated, then x will not be assigned to the tree\ncorresponding to x0 from Theorem 1.\nRemark 4. In particular, taking \u03b4 \u2192 0 and rs \u2192 0 gives us guarantees for all points which have a\nunique mode in which it can be assigned to.\n\nWe now give a more general version of (rs, \u03b4)-separation, in which the condition holds if every path\nbetween the two points dips down at some point. The same results as the above extend for this\nde\ufb01nition in a straightforward manner.\nDe\ufb01nition 7. x1, x2 \u2208 X are (rs, \u03b4)-weakly-separated if there exists a set S, with x1, x2 (cid:54)\u2208 S +\nB(0, rs), such that every path P from x1 and x2 satisifes the following. (1) P \u2229 S (cid:54)= \u2205 and (2)\n\nsup\n\nx\u2208P\u2229S+B(0,rs)\n\nf (x) <\n\nx\u2208B(x(cid:48)\n\n1,rs)\u222aB(x(cid:48)\n\ninf\n\n2,rs)\n\nf (x) \u2212 \u03b4,\n\n1, x(cid:48)\n\n2 are de\ufb01ned as follows. Let P1 be the path obtained by starting at x1 and following\nwhere x(cid:48)\nP until it intersects S, and P2 be the path obtained by following P starting from the last time it\nintersects S until the end. Then x(cid:48)\n2 are points which respectively attain the highest values of f\non P1 and P2.\n\n1 and x(cid:48)\n\n5 Cluster Tree Recovery\n\nThe connected components of the level sets as the density level varies forms a hierarchical structure\nknown as the cluster tree.\nDe\ufb01nition 8 (Cluster Tree). The cluster tree of f is given by\n\nDe\ufb01nition 9. Let G(\u03bb) be the subgraph of G with vertices x \u2208 X[n] such that (cid:98)fh(x) > \u03bb and edges\n\nCf (\u03bb) := connected components of {x \u2208 X : f (x) \u2265 \u03bb}.\n\nbetween pairs of vertices which have corresponding edges in G. Let \u02dcG(\u03bb) be the sets of vertices\ncorresponding to the connected components of G(\u03bb).\nDe\ufb01nition 10. Suppose that A is a collection of sets of points in Rd. Then de\ufb01ne Link(A, \u03b4)\nto be the result of repeatedly removing pairs A1, A2 \u2208 A from A (A1 (cid:54)= A2) that satisfy\ninf a1\u2208A1 inf a2\u2208A2 ||a1 \u2212 a2|| < \u03b4 and adding A1 \u222a A2 to A until no such pairs exist.\nParameter settings for Algorithm 2: Suppose that \u03c4 \u2261 \u03c4 (n) is chosen as a function of n such\nsuch that \u03c4 \u2192 0 as n \u2192 \u221e, \u03c4 (n) \u2265 (log2 n/n)1/d and h \u2261 h(n) is chosen such that h \u2192 0 and\nlog n/(nhd) \u2192 0 as n \u2192 \u221e.\nThe following is the main result of this section, the proof is in the appendix.\n\n5\n\n\fAlgorithm 2 Quick Shift Cluster Tree Estimator\n\nInput: Samples X[n] := {X1, ..., Xn}, KDE bandwidth h, segmentation parameter \u03c4 > 0.\nLet G be the output of Quick Shift (Algorithm 1) with above parameters.\n\nFor \u03bb > 0, let (cid:98)Cf (\u03bb) := Link( \u02dcG(\u03bb), \u03c4 ).\nreturn (cid:98)Cf\n\nTheorem 4 (Consistency). Algorithm 2 converges in probability to the true cluster tree of f under\nmerge distortion (de\ufb01ned in [7]).\nRemark 5. By combining the result of this section with the mode estimation result, we can obtain\nthe following interpretation. For any level \u03bb, a component in G(\u03bb) estimates a connected component\nof the \u03bb-level set of f, and that further, the trees within that component in G(\u03bb) have a one-to-one\ncorrespondence with the modes in the connected component.\n\nFigure 3: Illustration on 1-dimensional density with three modes A, B, and C. When restricting Quick\nShift\u2019s output to samples have empirical density above a certain threshold and connecting nearby\nclusters, then this approximates the connected components of the true density level set. Moreover, we\ngive guarantees that such points will be assigned to clusters which correspond to modes within its\nconnected component.\n\n6 Modal Regression\nSuppose that we have joint density f (X, y) on Rd \u00d7 R w.r.t. to the Lebesgue measure. In modal\nregression, we are interested in estimating the modes of the conditional f (y|X = x) given samples\nfrom the joint distribution.\n\nAlgorithm 3 Quick Shift Modal Regression\n\nInput: Samples D := {(x1, y1), ..., (xn, yn)}, bandwidth h, \u03c4 > 0, and x \u2208 X .\n\nLet Y = {y1, ..., yn} and (cid:98)fh be the KDE computed w.r.t. D.\nif there exists yj \u2208 [yi \u2212 \u03c4, yi + \u03c4 ] \u2229 Y such that (cid:98)fh(x, yj) > (cid:98)fh(x, yi) then\nAdd to G an directed edge from yi to argminyi\u2208Y :(cid:98)fh(x,yj )>(cid:98)fh(x,yi)||yi \u2212 yj||.\n\nInitialize directed graph G with vertices Y and no edges.\nfor i = 1 to n do\n\nend if\nend for\nreturn The roots of the trees of G as the estimates of the modes of f (y|X = x).\n\nTheorem 5 (Consistency of Quick Shift Modal Regression). Suppose that \u03c4 \u2261 \u03c4 (n) is chosen as a\nfunction of n such such that \u03c4 \u2192 0 as n \u2192 \u221e, \u03c4 (n) \u2265 (log2 n/n)1/d and h \u2261 h(n) is chosen such\nthat h \u2192 0 and log n/(nhd+1) \u2192 0 as n \u2192 \u221e. Let Mx be the modes of the conditional density\nin x such that f (y|X = x) and K satis\ufb01es Assumptions 1, 2, 3, 4, and 5,\n\nf (y|X = x) and (cid:99)Mx be the output of Algorithm 3. Then with probability at least 1 \u2212 1/n uniformly\n\ndH (Mx, (cid:99)Mx) \u2192 0 as n \u2192 \u221e.\n\n6\n\n\f7 Related Works\n\nMode Estimation. Perhaps the most popular procedure to estimate the modes is mean-shift; however,\nit has proven quite dif\ufb01cult to analyze. Arias-Castro et al. [1] made much progress by utilizing\ndynamical systems theory to show that mean-shift\u2019s updates converge to the correct gradient ascent\nsteps. The recent work of Dasgupta and Kpotufe [6] was the \ufb01rst to give a procedure which recovers\nthe modes of a density with minimax optimal statistical guarantees in a multimodal density. They do\nthis by using a top-down traversal of the density levels of a proximity graph, borrowing from work\nin cluster tree estimation. The procedure was shown to recover exactly the modes of the density at\nminimax optimal rates.\nIn this work, we showed that Quick Shift attains the same guarantees while being a simpler approach\nthan known procedures that attain these guarantees [6, 12]. Moreover unlike these procedures, Quick\nShift also assigns the remaining samples to their appropriate modes. Furthermore, Quick Shift also\nhas a segmentation tuning parameter \u03c4 which allows us to merge the clusters of modes that are not\nmaximal in its \u03c4-radius neighborhood into the clusters of other modes. This is useful as in practice,\none may not wish to pick up every single local maxima, especially when there are local maxima that\ncan be grouped together by proximity. We formalized the segmentation of such modes and identify\nwhich modes get returned and which ones become merged into other modes\u2019 clusters by Quick Shift.\nCluster Tree Estimation. Work on cluster tree estimation has a long history. Some early work on\ndensity based clustering by Hartigan [9] modeled the clusters of a density as the regions {x : f (x) \u2265\n\u03bb} for some \u03bb. This is called the density level-set of f at level \u03bb. The cluster tree of f is the hierarchy\nformed by the in\ufb01nite collection of these clusters over all \u03bb. Chaudhuri and Dasgupta [2] introduced\nRobust Single Linkage (RSL) which was the \ufb01rst cluster tree estimation procedure with precise\nstatistical guarantees. Shortly after, Kpotufe and Luxburg [13] provided an estimator that ensured\nfalse clusters were removed using used an extra pruning step. Interestingly, Quick Shift does not\nrequire such a pruning step, since the points near cluster boundaries naturally get assigned to regions\nwith higher density and thus no spurious clusters are formed near these boundaries. Sriperumbudur\nand Steinwart [14], Jiang [10], Wang et al. [17] showed that the popular DBSCAN algorithm [8] also\nestimates these level sets. Eldridge et al. [7] introduced the merge distortion metric for cluster tree\nestimates, which provided a stronger notion of consistency. We use their framework to analyze Quick\nShift and show that this simple estimator is consistent in merge distortion.\n\nFigure 4: Density-based clusters discovered by level-set model {x : f (x) \u2265 \u03bb} (e.g. DBSCAN) vs\nQuick Shift on a one dimensional density. Left two images: level sets for two density level settings.\nUnassigned regions are noise and have no cluster assignment. Right two images: Quick Shift with\ntwo different \u03c4 settings. The latter is a hill-climbing based clustering assignment.\n\nModal Regression. Nonparametric modal regression [3] is an alternative to classical regression,\nwhere we are interested in estimating the modes of the conditional density f (y|X = x) rather\nthan the mean. Current approaches primarily use a modi\ufb01cation of mean-shift; however analysis\nfor mean-shift require higher order smoothness assumptions. Using Quick Shift instead for modal\nregression requires less regularity assumptions while having consistency guarantees.\n\n8 Conclusion\n\nWe provided consistency guarantees for Quick Shift under mild assumptions. We showed that\nQuick Shift recovers the modes of a density from a \ufb01nite sample with minimax optimal guarantees.\nThe approach of this method is considerably different from known procedures that attain similar\nguarantees. Moreover, Quick Shift allows tuning of the segmentation and we provided an analysis\nof this behavior. We also showed that Quick Shift can be used as an alternative for estimating the\n\n7\n\n\fcluster tree which contrasts with current approaches which utilize proximity graph sweeps. We\nthen constructed a procedure for modal regression using Quick Shift which attains strong statistical\nguarantees.\n\nAppendix\n\nMode Estimation Proofs\nLemma 4. Suppose Assumptions 1, 2, 3, 4, and 5 hold. Let \u00afr > 0 and h \u2261 h(n) be chosen such\nthat h \u2192 0 and log n/(nhd) \u2192 0 as n \u2192 \u221e. Then the following holds for n suf\ufb01ciently large with\nprobability at least 1 \u2212 1/n. De\ufb01ne\n\n(cid:40)\nargmaxx\u2208B(x0,\u00afr)\u2229X[n] (cid:98)fh(x), we have\n\n\u02dcr2 := max\n\n32 \u02c6C\n\u02c7C\n\n(cid:114)\n\n(cid:41)\n\n.\n\n(log n)4/\u03c1h2, 17 \u00b7 C(cid:48)\n\nlog n\nn \u00b7 hd\n\nSuppose x0 \u2208 M and x0 is the unique maximizer of f on B(x0, \u00afr). Then letting \u02c6x :=\n\n||x0 \u2212 \u02c6x|| < \u02dcr.\n\nProof sketch. This follows from modifying the proof of Theorem 3 of [11] by replacing Rd\\B(x0, \u02dcr)\nwith B(x0, \u00afr)\\B(x0, \u02dcr). This leads us to\n\n(cid:98)fh(x) >\n\ninf\n\nx\u2208B(x0,rn)\n\nsup\n\nx\u2208B(x0,\u00afr)\\B(x0,\u02dcr)\n\n(cid:98)fh(x),\n\nwhere rn := minx\u2208X[n] |x0 \u2212 x| and n is chosen suf\ufb01ciently large such that \u02dcr < \u03c4. Thus, |x0 \u2212 \u02c6x| \u2264\n\u03c4\u2212\u0001,\u03b4. Let \u02c6x := argmaxx\u2208B(x0,\u03c4 )\u2229X[n] (cid:98)fh(x).\n\u02dcr.\nWe \ufb01rst show that \u02c6x \u2208 (cid:99)M.\nProof of Theorem 2. Suppose that x0 \u2208 M+\n(cid:27)\nremains to show that \u02c6x = argmaxx\u2208B(\u02c6x,\u03c4 )\u2229X[n] (cid:98)fh(x). We have B(\u02c6x, \u03c4 ) \u2286 B(x0, \u03c4 + \u02dcr). Choose\nn suf\ufb01ciently large such that (i) \u02dcr < \u0001, (ii) by Lemma 1, supx\u2208X |(cid:98)fh(x) \u2212 f (x)| < \u03b4/4 and (iii)\n\n(log n)4/\u03c1h2, 17 \u00b7 C(cid:48)(cid:113) log n\n\nBy Lemma 4, we have |x0 \u2212 \u02c6x| \u2264 \u02dcr where \u02dcr2 := max\n\n\u03c4 +\u0001,\u03b4\\M\u2212\n\n(cid:26)\n\n32 \u02c6C\n\u02c7C\n\nn\u00b7hd\n\n. It\n\n\u02dcr2 < \u03b4/(4 \u02c6C). Now, we have\n\nsup\n\nx\u2208B(x0,\u03c4 +\u02dcr)\\B(x0,\u03c4 )\n\n(cid:98)fh(x) \u2264\n\nf (x) + \u03b4/4 \u2264 f (x0) \u2212 3\u03b4/4\n\nsup\n\nx\u2208B(x0,\u03c4 +\u02dcr)\\B(x0,\u03c4 )\n\n\u2264 f (\u02c6x) + \u02c6C \u02dcr2 \u2212 3\u03b4/4 < f (\u02c6x) \u2212 \u03b4/2 < (cid:98)fh(\u02c6x).\n\nchoosing n suf\ufb01ciently large such that \u02dcr < \u03c4 /2, we obtain \u02c6x \u2208 B(\u02c6x(cid:48), \u03c4 ). This implies that \u02c6x = \u02c6x(cid:48), as\ndesired.\n\nThus, \u02c6x = argmaxx\u2208B(\u02c6x,\u03c4 )\u2229X[n] (cid:98)fh(x). Hence, \u02c6x \u2208 (cid:99)M.\nNext, we show that it is unique. To do this, suppose that \u02c6x(cid:48) \u2208 (cid:99)M such that ||\u02c6x(cid:48) \u2212 x0|| \u2264 \u03c4 /2. Then\nwe have both \u02c6x = argmaxx\u2208B(\u02c6x,\u03c4 )\u2229X[n] (cid:98)fh(x) and \u02c6x(cid:48) = argmaxx\u2208B(\u02c6x(cid:48),\u03c4 )\u2229X[n] (cid:98)fh(x). However,\nWe now show |(cid:99)M| \u2264 |M| \u2212 |M\u2212\n\u03c4\u2212\u0001,\u03b4|. Suppose that \u02c6x \u2208 (cid:99)M. Let \u03c40 := min{\u0001/3, \u03c4 /3, rM /2}. We\nshow that B(\u02c6x, \u03c40) \u2229 M (cid:54)= \u2205. Suppose otherwise. Let \u03bb = f (\u02c6x). By Assumptions 2 and 5, we\nLf (\u03bb + \u03c3)) \u2265 \u03b7. Choose n suf\ufb01ciently large such that (i) by Lemma 1, supx\u2208X |(cid:98)fh(x) \u2212 f (x)| <\nhave that there exists \u03c3 > 0 and \u03b7 > 0 such that the following holds uniformly: Vol(B(\u02c6x, \u03c40) \u2229\nChaudhuri and Dasgupta [2]. Then (cid:98)fh(x) > \u03bb+\u03c3/2 > (cid:98)fh(\u02c6x) but x \u2208 B(\u02c6x, \u03c40), a contradiction since\nmin \u03c3/2, \u03b4/4 and (ii) there exists a sample x \u2208 B(\u02c6x, \u0001/3) \u2229 Lf (\u03bb + \u03c3) \u2229 X[n] by Lemma 7 of\n\u02c6x is the maximizer of the KDE of the samples in its \u03c4-radius neighborhood. Thus, B(\u02c6x, \u03c40)\u2229M (cid:54)= \u2205.\nNow, suppose that there exists x0 \u2208 B(\u02c6x, \u03c40) \u2229 M\u2212\n\u03c4\u2212\u0001,\u03b4. Then, there exists x(cid:48) \u2208 B(x0, \u03c4 \u2212 2\u03c40)\n\n8\n\n\fsuch that f (x(cid:48)) \u2265 f (x0) + \u03b4. Then, if \u00afx is the closest sample point to x(cid:48), we have for n suf\ufb01ciently\nBut \u00afx \u2208 B(\u02c6x, \u03c4 ) \u2229 X[n], contradicting the fact that \u02c6x is the maximizer of the KDE over samples in\nits \u03c4-radius neighborhood. Thus, B(\u02c6x, \u03c40) \u2229 (M\\M\u2212\n\nlarge, |x(cid:48) \u2212 \u00afx| \u2264 \u03c40 and f (\u00afx) \u2265 f (x0) + \u03b4/2 and thus (cid:98)fh(\u00afx) > f (\u00afx) \u2212 \u03b4/4 \u2265 f (\u02c6x) + \u03b4/4 > (cid:98)fh(\u02c6x).\nFinally, suppose that there exists \u02c6x, \u02c6x(cid:48) \u2208 (cid:99)M such that x0 \u2208 M\\M\u2212\n\n\u03c4\u2212\u0001,\u03b4 and x0 \u2208 B(\u02c6x, \u03c40) \u2229\n\nB(\u02c6x(cid:48), \u03c40). Then, \u02c6x, \u02c6x(cid:48) \u2208 B(x0, \u03c40), thus |\u02c6x \u2212 \u02c6x(cid:48)| \u2264 \u03c4 and thus \u02c6x = \u02c6x(cid:48), as desired.\n\n\u03c4\u2212\u0001,\u03b4) (cid:54)= \u2205.\n\nwhere the last inequality holds for n suf\ufb01ciently large so that \u03c4 is suf\ufb01ciently small. Thus, we have\n\n\u03bb \u2212 \u0001. Given our choice of \u03c4, it follows by Lemma 7 of [2] that B(x, \u03c4 /2) \u2229 X[n] is non-empty for n\nsuf\ufb01ciently large. Let x(cid:48) \u2208 B(x, \u03c4 /2) \u2229 X[n]. Choose n suf\ufb01ciently large such that by Lemma 1, we\n\nCluster Tree Estimation Proofs\nLemma 5 (Minimality). The following holds with probability at least 1 \u2212 1/n. If A is a connected\nfor any \u0001 > 0 as n \u2192 \u221e.\n\ncomponent of {x \u2208 X : f (x) \u2265 \u03bb}, then A \u2229 X[n] is contained in the same component in (cid:98)Cf (\u03bb \u2212 \u0001)\nProof. It suf\ufb01ces to show that for each x \u2208 A, there exists x(cid:48) \u2208 B(x, \u03c4 /2)\u2229 X[n] such that (cid:98)fh(x(cid:48)) >\nhave supx\u2208X |(cid:98)fh(x)\u2212 f (x)| < \u0001/2. We have f (x(cid:48)) \u2265 inf B(x,\u03c4 /2) f (x) \u2265 \u03bb\u2212 C\u03b1(\u03c4 /2)\u03b1 > \u03bb\u2212 \u0001/2,\n(cid:98)fh(x(cid:48)) > \u03bb \u2212 \u0001, as desired.\n(cid:98)Cf (\u00b5 + \u0001) for any \u0001 > 0 as n \u2192 \u221e.\nProof. It suf\ufb01ces to assume that \u03bb = \u00b5 + \u0001. Let A(cid:48) and B(cid:48) be the connected components of\n{x \u2208 X : f (x) \u2265 \u00b5 + \u0001/2} which contain A and B respectively. By the uniform continuity of f,\nthere exists \u02dcr > 0 such that A + B(0, 3\u02dcr) \u2286 A(cid:48). We have supx\u2208A(cid:48)\\(A+B(0,\u02dcr)) f (x) = \u00b5 + \u0001 \u2212 \u0001(cid:48) for\nChoose n suf\ufb01ciently large such that by Lemma 1, we have supx\u2208X |(cid:98)fh(x) \u2212 f (x)| < \u0001(cid:48)/2.\nsome \u0001(cid:48) > 0.\nThus, supx\u2208A(cid:48)\\(A+B(0,\u02dcr)) (cid:98)fh(x) < \u00b5 + \u0001 \u2212 \u0001(cid:48)/2. Hence, points in (cid:98)Cf (\u00b5 + \u0001) cannot belong to\n\nLemma 6 (Separation). Suppose that A and B are distinct connected components of {x \u2208 X :\nf (x) \u2265 \u03bb} which merge at {x \u2208 X : f (x) \u2265 \u00b5}. Then A \u2229 X[n] and B \u2229 X[n] are separated in\n\nA(cid:48)\\ (A + B(0, \u02dcr)). Since A(cid:48) also contains A + B(0, 3\u02dcr), it means that there cannot be a path from\nA to B with points of empirical density at least \u00b5 + \u0001 with all edges of length less than \u02dcr. The result\nfollows by taking n suf\ufb01ciently large so that \u03c4 < \u02dcr, as desired.\n\nProof of Theorem 4. By the regularity assumptions on f and Theorem 2 of [7], we have that Al-\ngorithm 2 has both uniform minimality and uniform separation (de\ufb01ned in [7]), which implies\nconvergence in merge distortion.\n\nModal Regression Proofs\n\nProof of Theorem 5. There are two directions to show. (1) if \u02c6y \u2208 (cid:99)Mx then \u02c6y is a consistent estimator\nof some mode y0 \u2208 Mx. (2) For each mode, y0 \u2208 M, there exists a unique \u02c6y \u2208 (cid:99)M which estimates\n\nby Lemma 1, sup(x(cid:48),y(cid:48)) |(cid:98)fh(x(cid:48), y(cid:48)) \u2212 f (x(cid:48), y(cid:48))| < \u03b4/2. Then (cid:98)fh(x, y) > \u03bb + \u03b4/2 > (cid:98)fh(x, \u02c6y) but\n\nit.\nWe \ufb01rst show (1). We show that [\u02c6y \u2212 \u03c4, \u02c6y + \u03c4 ] \u2229 Mx (cid:54)= \u2205. Suppose otherwise. Let \u03bb = f (x, \u02c6y).\nChoose \u03c3 < \u03c4 /4. Then by Assumptions 2 and 5, there exists \u03b7 > 0 such that taking \u0001 = \u03c4 /2, we\nhave that there exists \u03b4 > 0 such that {(x, y(cid:48)) : y(cid:48) \u2208 [\u02c6y \u2212 \u03c4, \u02c6y + \u03c4 ]} \u2229 Lf (\u03bb + \u03b4) contains connected\nset A where Vol(A) > \u03b7. Choose n suf\ufb01ciently large such that (i) there exists y \u2208 A \u2229 Y , and (ii)\ny \u2208 [\u02c6y \u2212 \u03c4, \u02c6y + \u03c4 ], a contradiction since \u02c6y is the maximizer of the KDE in \u03c4 radius neighborhood\nwhen restricted to X = x. Thus, there exists y0 \u2208 Mx such that y0 \u2208 [\u02c6y \u2212 \u03c4, \u02c6y + \u03c4 ]. Moreover\nthis y0 \u2208 Mx must be unique by Lemma 2. As n \u2192 0, we have \u03c4 \u2192 0 and thus consistency is\nestablished for \u02c6y estimating y0.\nNow we show (2). Suppose that y0 \u2208 Mx. From the above, for n suf\ufb01ciently large, the maximizer of\nthe KDE in [y0 \u2212 2\u03c4, y0 + 2\u03c4 ] \u2229 Y is contained in [y0 \u2212 \u03c4, y0 + \u03c4 ]. Thus, there exists a root of the\ntree contained in [y0 \u2212 \u03c4, y0 + \u03c4 ] and taking \u03c4 \u2192 0 gives us the desired result.\n\n9\n\n\fAcknowledgements\n\nI thank the anonymous reviewers for their valuable feedback.\n\nReferences\n[1] Ery Arias-Castro, David Mason, and Bruno Pelletier. On the estimation of the gradient lines\nof a density and the consistency of the mean-shift algorithm. Journal of Machine Learning\nResearch, 2015.\n\n[2] Kamalika Chaudhuri and Sanjoy Dasgupta. Rates of convergence for the cluster tree.\n\nAdvances in Neural Information Processing Systems, pages 343\u2013351, 2010.\n\nIn\n\n[3] Yen-Chi Chen, Christopher R Genovese, Ryan J Tibshirani, Larry Wasserman, et al. Nonpara-\n\nmetric modal regression. The Annals of Statistics, 44(2):489\u2013514, 2016.\n\n[4] Yizong Cheng. Mean shift, mode seeking, and clustering. IEEE transactions on pattern analysis\n\nand machine intelligence, 17(8):790\u2013799, 1995.\n\n[5] Dorin Comaniciu and Peter Meer. Mean shift: A robust approach toward feature space analysis.\n\nIEEE Transactions on pattern analysis and machine intelligence, 24(5):603\u2013619, 2002.\n\n[6] Sanjoy Dasgupta and Samory Kpotufe. Optimal rates for k-nn density and mode estimation. In\n\nAdvances in Neural Information Processing Systems, pages 2555\u20132563, 2014.\n\n[7] Justin Eldridge, Mikhail Belkin, and Yusu Wang. Beyond hartigan consistency: Merge distortion\n\nmetric for hierarchical clustering. In COLT, pages 588\u2013606, 2015.\n\n[8] Martin Ester, Hans-Peter Kriegel, J\u00f6rg Sander, and Xiaowei Xu. A density-based algorithm for\ndiscovering clusters in large spatial databases with noise. In Kdd, volume 96, pages 226\u2013231,\n1996.\n\n[9] John A Hartigan. Consistency of single linkage for high-density clusters. Journal of the\n\nAmerican Statistical Association, 76(374):388\u2013394, 1981.\n\n[10] Heinrich Jiang. Density level set estimation on manifolds with dbscan.\n\nConference on Machine Learning, pages 1684\u20131693, 2017.\n\nIn International\n\n[11] Heinrich Jiang. Uniform convergence rates for kernel density estimation. In International\n\nConference on Machine Learning, pages 1694\u20131703, 2017.\n\n[12] Heinrich Jiang and Samory Kpotufe. Modal-set estimation with an application to clustering. In\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, pages 1197\u20131206, 2017.\n\n[13] Samory Kpotufe and Ulrike V Luxburg. Pruning nearest neighbor cluster trees. In Proceedings\nof the 28th International Conference on Machine Learning (ICML-11), pages 225\u2013232, 2011.\n[14] Bharath Sriperumbudur and Ingo Steinwart. Consistency and rates for clustering with dbscan.\n\nIn Arti\ufb01cial Intelligence and Statistics, pages 1090\u20131098, 2012.\n\n[15] Aleksandr Borisovich Tsybakov. Recursive estimation of the mode of a multivariate distribution.\n\nProblemy Peredachi Informatsii, 26(1):38\u201345, 1990.\n\n[16] Andrea Vedaldi and Stefano Soatto. Quick shift and kernel methods for mode seeking. In\n\nEuropean Conference on Computer Vision, pages 705\u2013718. Springer, 2008.\n\n[17] Daren Wang, Xinyang Lu, and Alessandro Rinaldo. Optimal rates for cluster tree estimation\n\nusing kernel density estimators. arXiv preprint arXiv:1706.03113, 2017.\n\n10\n\n\f", "award": [], "sourceid": 49, "authors": [{"given_name": "Heinrich", "family_name": "Jiang", "institution": "Google"}]}