{"title": "Optimal Ridge Detection using Coverage Risk", "book": "Advances in Neural Information Processing Systems", "page_first": 316, "page_last": 324, "abstract": "We introduce the concept of coverage risk as an error measure for density ridge estimation.The coverage risk generalizes the mean integrated square error to set estimation.We propose two risk estimators for the coverage risk and we show that we can select tuning parameters by minimizing the estimated risk.We study the rate of convergence for coverage risk and prove consistency of the risk estimators.We apply our method to three simulated datasets and to cosmology data.In all the examples, the proposed method successfully recover the underlying density structure.", "full_text": "Optimal Ridge Detection using Coverage Risk\n\nYen-Chi Chen\n\nDepartment of Statistics\n\nCarnegie Mellon University\n\nyenchic@andrew.cmu.edu\n\nShirley Ho\n\nDepartment of Physics\n\nCarnegie Mellon University\n\nshirleyh@andrew.cmu.edu\n\nChristopher R. Genovese\nDepartment of Statistics\n\nCarnegie Mellon University\n\ngenovese@stat.cmu.edu\n\nLarry Wasserman\n\nDepartment of Statistics\n\nCarnegie Mellon University\nlarry@stat.cmu.edu\n\nAbstract\n\nWe introduce the concept of coverage risk as an error measure for density ridge\nestimation. The coverage risk generalizes the mean integrated square error to set\nestimation. We propose two risk estimators for the coverage risk and we show that\nwe can select tuning parameters by minimizing the estimated risk. We study the\nrate of convergence for coverage risk and prove consistency of the risk estimators.\nWe apply our method to three simulated datasets and to cosmology data. In all\nthe examples, the proposed method successfully recover the underlying density\nstructure.\n\n1\n\nIntroduction\n\nDensity ridges [10, 22, 15, 6] are one-dimensional curve-like structures that characterize high den-\nsity regions. Density ridges have been applied to computer vision [2], remote sensing [21], biomedi-\ncal imaging [1], and cosmology [5, 7]. Density ridges are similar to the principal curves [17, 18, 27].\nFigure 1 provides an example for applying density ridges to learn the structure of our Universe.\n\nTo detect density ridges from data, [22] proposed the \u2018Subspace Constrained Mean Shift (SCMS)\u2019\nalgorithm. SCMS is a modi\ufb01cation of usual mean shift algorithm [14, 8] to adapt to the local\ngeometry. Unlike mean shift that pushes every mesh point to a local mode, SCMS moves the meshes\nalong a projected gradient until arriving at nearby ridges. Essentially, the SCMS algorithm detects\nthe ridges of the kernel density estimator (KDE). Therefore, the SCMS algorithm requires a pre-\nselected parameter h, which acts as the role of smoothing bandwidth in the kernel density estimator.\n\nDespite the wide application of the SCMS algorithm, the choice of h remains an unsolved problem.\nSimilar to the density estimation problem, a poor choice of h results in over-smoothing or under-\nsmoothing for the density ridges. See the second row of Figure 1.\n\nIn this paper, we introduce the concept of coverage risk which is a generalization of the mean\nintegrated expected error from function estimation. We then show that one can consistently estimate\nthe coverage risk by using data splitting or the smoothed bootstrap. This leads us to a data-driven\nselection rule for choosing the parameter h for the SCMS algorithm. We apply the proposed method\nto several famous datasets including the spiral dataset, the three spirals dataset, and the NIPS dataset.\nIn all simulations, our selection rule allows the SCMS algorithm to detect the underlying structure\nof the data.\n\n1\n\n\fFigure 1: The cosmic web. This is a slice of the observed Universe from the Sloan Digital Sky\nSurvey. We apply the density ridge method to detect \ufb01laments [7]. The top row is one example\nfor the detected \ufb01laments. The bottom row shows the effect of smoothing. Bottom-Left: optimal\nsmoothing. Bottom-Middle: under-smoothing. Bottom-Right: over-smoothing. Under optimal\nsmoothing, we detect an intricate \ufb01lament network. If we under-smooth or over-smooth the dataset,\nwe cannot \ufb01nd the structure.\n\n1.1 Density Ridges\n\nDensity ridges are de\ufb01ned as follows. Assume X1,\u00b7\u00b7\u00b7 , Xn are independently and identically dis-\ntributed from a smooth probability density function p with compact support K. The density ridges\n[10, 15, 6] are de\ufb01ned as\n\nR = {x \u2208 K : V (x)V (x)T\u2207p(x) = 0, \u03bb2(x) < 0},\n\nwhere V (x) = [v2(x),\u00b7\u00b7\u00b7 vd(x)] with vj(x) being the eigenvector associated with the ordered\neigenvalue \u03bbj(x) (\u03bb1(x) \u2265 \u00b7\u00b7\u00b7 \u2265 \u03bbd(x)) for Hessian matrix H(x) = \u2207\u2207p(x). That is, R is\nthe collection of points whose projected gradient V (x)V (x)T\u2207p(x) = 0. It can be shown that\n(under appropriate conditions), R is a collection of 1-dimensional smooth curves (1-dimensional\nmanifolds) in Rd.\n\nThe SCMS algorithm is a plug-in estimate for R by using\n\n(cid:110)\n(cid:111)\nx \u2208 K : (cid:98)Vn(x)(cid:98)Vn(x)T\u2207(cid:98)pn(x) = 0,(cid:98)\u03bb2(x) < 0\n(cid:98)Rn =\ni=1 K(cid:0) x\u2212Xi\n(cid:80)n\n\n(cid:1) is the KDE and(cid:98)Vn and(cid:98)\u03bb2 are the associated quantities de\ufb01ned\nwhere(cid:98)pn(x) = 1\nby(cid:98)pn. Hence, one can clearly see that the parameter h in the SCMS algorithm plays the same role\n\n,\n\nnhd\n\nh\n\nof smoothing bandwidth for the KDE.\n\n2\n\n\f2 Coverage Risk\n\nr.\n\n(1)\n\n, R).\n\nWn = d(UR,(cid:98)Rn), (cid:102)Wn = d(U(cid:98)Rn\n\nBefore we introduce the coverage risk, we \ufb01rst de\ufb01ne some geometric concepts. Let \u00b5(cid:96) be the (cid:96)-\ndimensional Hausdorff measure [13]. Namely, \u00b51(A) is the length of set A and \u00b52(A) is the area\nas\n\nLet Haus(A, B) = inf{r : A \u2282 B \u2295 r, B \u2282 A \u2295 r} be the Hausdorff distance between A and B\nwhere A \u2295 r = {x : d(x, A) \u2264 r}. The following lemma gives some useful properties about Wn\n\nof A. Let d(x, A) be the projection distance from point x to a set A. We de\ufb01ne UR and U(cid:98)Rn\nrandom variables uniformly distributed over the true density ridges R and the ridge estimator (cid:98)Rn\nrespectively. Assuming R and (cid:98)Rn are given, we de\ufb01ne the following two random variables\nare random variables while R,(cid:98)Rn are sets. Wn is the distance from a randomly\nNote that UR, U(cid:98)Rn\nselected point on R to the estimator (cid:98)Rn and(cid:102)Wn is the distance from a random point on (cid:98)Rn to R.\nand(cid:102)Wn.\nLemma 1 Both random variables Wn and(cid:102)Wn are bounded by Haus((cid:99)Mn, M ). Namely,\nThe cumulative distribution function (CDF) for Wn and(cid:102)Wn are\n(cid:16)(cid:98)Rn \u2229 (R \u2295 r)\n(cid:17)\n(cid:16)(cid:98)Rn\n(cid:17)\n\n0 \u2264(cid:102)Wn \u2264 Haus((cid:98)Rn, R).\n, P((cid:102)Wn \u2264 r|(cid:98)Rn) =\n\n0 \u2264 Wn \u2264 Haus((cid:98)Rn, R),\n(cid:17)\nR \u2229 ((cid:98)Rn \u2295 r)\n\nP(Wn \u2264 r|(cid:98)Rn) =\n\nThis lemma follows trivially by de\ufb01nition so that we omit its proof. Lemma 1 links the random\n\nThus, P(Wn \u2264 r|(cid:98)Rn) is the ratio of R being covered by padding the regions around (cid:98)Rn at distance\nvariables Wn and(cid:102)Wn to the Hausdorff distance and the coverage for R and (cid:98)Rn. Thus, we call them\ncoverage random variables. Now we de\ufb01ne the L1 and L2 coverage risk for estimating R by (cid:98)Rn as\nThat is, Risk1,n (and Risk2,n) is the expected (square) projected distance between R and (cid:98)Rn. Note\nthat the expectation in (4) applies to both (cid:98)Rn and UR. One can view Risk2,n as a generalized mean\n\nE(Wn +(cid:102)Wn)\n\nn +(cid:102)W 2\n\n, Risk2,n =\n\nRisk1,n =\n\nE(W 2\n\n\u00b51 (R)\n\nn)\n\n.\n\n(cid:16)\n\n.\n\n(3)\n\n\u00b51\n\n\u00b51\n\n(4)\n\n(2)\n\n\u00b51\n\n2\n\n2\n\nintegrated square errors (MISE) for sets.\n\nA nice property of Risk1,n and Risk2,n is that they are not sensitive to outliers of R in the sense that\na small perturbation of R will not change the risk too much. On the contrary, the Hausdorff distance\nis very sensitive to outliers.\n\n2.1 Selection for Tuning Parameters Based on Risk Minimization\n\nIn this section, we will show how to choose h by minimizing an estimate of the risk.\n\nWe propose two risk estimators. The \ufb01rst estimator is based on the smoothed bootstrap [25]. We\nsample X\u2217\nn. The we estimate the risk by\n\nn from the KDE(cid:98)pn and recompute the estimator (cid:98)R\u2217\nn +(cid:102)W \u2217\nn) and(cid:102)W \u2217\n,(cid:98)R\u2217\n\n, (cid:100)Risk2,n =\n,(cid:98)Rn).\n\nn|X1,\u00b7\u00b7\u00b7 , Xn)\n2\nn = d(U(cid:98)R\u2217\n\nE(W \u22172\n\nE(W \u2217\n\n1 ,\u00b7\u00b7\u00b7 X\u2217\n(cid:100)Risk1,n =\nn = d(U(cid:98)Rn\n\nn\n\nn +(cid:102)W \u22172\n\nn |X1,\u00b7\u00b7\u00b7 , Xn)\n2\n\n,\n\n(5)\n\nwhere W \u2217\n\n3\n\n\f\u2020\n1m and\n\u2020\n2m, assuming n is even and 2m = n. We compute the estimated manifolds by using\n\nThe second approach is to use data splitting. We randomly split the data into X\nhalf of the data, which we denote as (cid:98)R\n\u2020\n21,\u00b7\u00b7\u00b7 , X\nX\n(cid:100)Risk\n\n1,n and (cid:98)R\n\u2020\n2,n|X1,\u00b7\u00b7\u00b7 , Xn)\n2\n\n\u20202\n2,n|X1,\u00b7\u00b7\u00b7 , Xn)\n2\n\n\u2020\n2,n. Then we compute\n\n\u2020\n11,\u00b7\u00b7\u00b7 , X\n\n\u20202\n1,n + W\n\n\u2020\n1,n + W\n\nE(W\n\n\u2020\n\n\u2020\n1,n =\n\u2020\n\nE(W\n1,n = d(U(cid:98)R\n\n,(cid:98)R\n\n\u2020\n1,n\n\nwhere W\n\n\u2020\n2,n) and W\n\nHaving estimated the risk, we select h by\n\n\u2020\n2,n =\n\u2020\n1,n).\n\n\u2020\n\n, (cid:100)Risk\n,(cid:98)R\n2,n = d(U(cid:98)R\n(cid:100)Risk\n\nh\u2217 = argmin\nh\u2264\u00afhn\n\n\u2020\n2,n\n\n\u2020\n1,n,\n\n,\n\n(6)\n\n(7)\n\nwhere \u00afhn is an upper bound by the normal reference rule [26] (which is known to oversmooth, so\nthat we only select h below this rule). Moreover, one can choose h by minimizing L2 risk as well.\n\nIn [11], they consider selecting the smoothing bandwidth for local principal curves by self-coverage.\nThis criterion is a different from ours. The self-coverage counts data points. The self-coverage is\na monotonic increasing function and they propose to select the bandwidth such that the derivative\nis highest. Our coverage risk yields a simple trade-off curve and one can easily pick the optimal\nbandwidth by minimizing the estimated risk.\n\n3 Manifold Comparison by Coverage\n\nThe concepts of coverage in previous section can be generalized to investigate the difference between\ntwo manifolds. Let M1 and M2 be an (cid:96)1-dimensional and an (cid:96)2-dimensional manifolds ((cid:96)1 and (cid:96)2\nare not necessarily the same). We de\ufb01ne the coverage random variables\n\n.\n\n(9)\n\n\u00b5(cid:96)2 (M1)\n\n, P(W21 \u2264 r) =\n\nW12 = d(UM1, M2), W21 = d(UM2, M1).\n\n(8)\nThen by Lemma 1, the CDF for W12 and W21 contains information about how M1 and M2 are\ndifferent from each other:\nP(W12 \u2264 r) =\n\n\u00b5(cid:96)2 (M2 \u2229 (M1 \u2295 r))\n\n\u00b5(cid:96)1 (M1 \u2229 (M2 \u2295 r))\n\n\u00b5r2 (M1)\nP(W12 \u2264 r) is the coverage on M1 by padding regions with distance r around M2.\nWe call the plots of the CDF of W12 and W21 coverage diagrams since they are linked to the\ncoverage over M1 and M2. The coverage diagram allows us to study how two manifolds are different\nfrom each other. When (cid:96)1 = (cid:96)2, the coverage diagram can be used as a similarity measure for two\nmanifolds. When (cid:96)1 (cid:54)= (cid:96)2, the coverage diagram serves as a measure for quality of representing high\ndimensional objects by low dimensional ones. A nice property for coverage diagram is that we can\napproximate the CDF for W12 and W21 by a mesh of points (or points uniformly distributed) over\nM1 and M2. In Figure 2 we consider a Helix dataset whose support has dimension d = 3 and we\ncompare two curves, a spiral curve (green) and a straight line (orange), to represent the Helix dataset.\nAs can be seen from the coverage diagram (right panel), the green curve has better coverage at each\ndistance (compared to the orange curve) so that the spiral curve provides a better representation for\nthe Helix dataset.\nIn addition to the coverage diagram, we can also use the following L1 and L2 losses as summary for\nthe difference:\n\nLoss1(M1, M2) =\n\n,\n\nLoss2(M1, M2) =\n\nE(W12 + W21)\n\n2\n\nE(W 2\n\n12 + W 2\n\n21)\n\n2\n\n.\n\n(10)\n\nThe expectation is take over UM1 and UM2 and both M1 and M2 here are \ufb01xed. The risks in (4) are\nthe expected losses:\n\nLoss1((cid:99)Mn, M )\n\n(cid:17)\n\n, Risk2,n = E(cid:16)\n\n(cid:17)\nLoss2((cid:99)Mn, M )\n\nRisk1,n = E(cid:16)\n\n.\n\n(11)\n\n4\n\n\fFigure 2: The Helix dataset. The original support for the Helix dataset (black dots) are a 3-\ndimensional regions. We can use green spiral curves (d = 1) to represent the regions. Note that\nwe also provide a bad representation using a straight line (orange). The coverage plot reveals the\nquality for representation. Left: the original data. Dashed line is coverage from data points (black\ndots) over green/orange curves in the left panel and solid line is coverage from green/orange curves\non data points. Right: the coverage plot for the spiral curve (green) versus a straight line (orange).\n\n4 Theoretical Analysis\n\nIn this section, we analyze the asymptotic behavior for the coverage risk and prove the consistency\nfor estimating the coverage risk by the proposed method. In particular, we derive the asymptotic\nproperties for the density ridges. We only focus on L2 risk since by Jensen\u2019s inequality, the L2 risk\ncan be bounded by the L1 risk.\n\nBefore we state our assumption, we \ufb01rst de\ufb01ne the orientation of density ridges. Recall that the\ndensity ridge R is a collection of one dimensional curves. Thus, for each point x \u2208 R, we can\nassociate a unit vector e(x) that represent the orientation of R at x. The explicit formula for e(x)\ncan be found in Lemma 1 of [6].\n\nAssumptions.\n\n(R) There exist \u03b20, \u03b21, \u03b22, \u03b4R > 0 such that for all x \u2208 R \u2295 \u03b4R,\n\n\u03bb2(x) \u2264 \u2212\u03b21,\n\n\u03bb1(x) \u2265 \u03b20 \u2212 \u03b21,\n\n(12)\nwhere (cid:107)p(3)(x)(cid:107)max is the element wise norm to the third derivative. And for each x \u2208 R,\n|e(x)T\u2207p(x)| \u2265\n\n(cid:107)\u2207p(x)(cid:107)(cid:107)p(3)(x)(cid:107)max \u2264 \u03b20(\u03b21 \u2212 \u03b22),\n\n\u03bb1(x)\n\n\u03bb1(x)\u2212\u03bb2(x).\n\n(K1) The kernel function K is three times bounded differetiable and is symmetric, non-negative\n\n(cid:90) (cid:16)\n\n(cid:17)2\n\nx2K (\u03b1)(x)dx < \u221e,\n\nK (\u03b1)(x)\n\ndx < \u221e\n\nand\n\nlet\n\nfor all \u03b1 = 0, 1, 2, 3.\n\n(cid:90)\n\n(cid:26)\n\n(K2) The kernel function K and its partial derivative satis\ufb01es condition K1 in [16]. Speci\ufb01cally,\n\nK =\n\ny (cid:55)\u2192 K (\u03b1)\n\n: x \u2208 Rd, h > 0,|\u03b1| = 0, 1, 2\n\n(cid:19)\n\n(cid:18) x \u2212 y\nN(cid:0)K, L2(P ), \u0001(cid:107)F(cid:107)L2(P )\n\nh\n\n(cid:1) \u2264\n\n(cid:18) A\n\n(cid:19)v\n\n\u0001\n\n(cid:27)\n\nWe require that K satis\ufb01es\n\nsup\nP\n\n(13)\n\n(14)\n\n5\n\nllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.00.20.40.60.81.00.00.20.40.60.81.0rCoverage\ffor some positive number A, v, where N (T, d, \u0001) denotes the \u0001-covering number of the\nmetric space (T, d) and F is the envelope function of K and the supreme is taken over\nthe whole Rd. The A and v are usually called the VC characteristics of K. The norm\n(cid:107)F(cid:107)L2(P ) = supP\n\n(cid:82) |F (x)|2dP (x).\n\nAssumption (R) appears in [6] and is very mild. The \ufb01rst two inequality in (12) are just the bound\non eigenvalues. The last inequality requires the density around ridges to be smooth. The latter part\nof (R) requires the direction of ridges to be similar to the gradient direction. Assumption (K1) is\nthe common condition for kernel density estimator see e.g. [28] and [24]. Assumption (K2) is to\nregularize the classes of kernel functions that is widely assumed [12, 15, 4]; any bounded kernel\nfunction with compact support satis\ufb01es this condition. Both (K1) and (K2) hold for the Gaussian\nkernel.\nUnder the above condition, we derive the rate of convergence for the L2 risk.\nTheorem 2 Let Risk2,n be the L2 coverage risk for estimating the density ridges and level sets.\nAssume (K1\u20132) and (R) and p is at least four times bounded differentiable. Then as n \u2192 \u221e, h \u2192 0\nand log n\n\nnhd+6 \u2192 0\n\nRisk2,n = B2\n\nRh4 +\n\n\u03c32\nnhd+2 + o(h4) + o\nR\n\n(cid:18) 1\n\nnhd+2\n\n(cid:19)\n\n,\n\nfor some BR and \u03c32\n\nR that depends only on the density p and the kernel function K.\n\nThe rate in Theorem 2 shows a bias-variance decomposition. The \ufb01rst term involving h4 is the\nbias term while the latter term is the variance part. Thanks to the Jensen\u2019s inequality, the rate of\nconvergence for L1 risk is the square root of the rate Theorem 2. Note that we require the smoothing\nnhd+6 \u2192 0. This constraint comes from the uniform bound\nparameter h to decay slowly to 0 by log n\nfor estimating third derivatives for p. We need this constraint since we need the smoothness for\nestimated ridges to converge to the smoothness for the true ridges. Similar result for density level\nset appears in [3, 20].\nBy Lemma 1, we can upper bound the L2 risk by expected square of the Hausdorff distance which\ngives the rate\n\nRisk2,n \u2264 E(cid:16)\n\n(cid:17)\nHaus2((cid:98)Rn, R)\n\n= O(h4) + O\n\n(15)\n\n(cid:18) log n\n\n(cid:19)\n\nnhd+2\n\nThe rate under Hausdorff distance for density ridges can be found in [6] and the rate for density\nridges appears in [9]. The rate induced by Theorem 2 agrees with the bound from the Hausdorff\ndistance and has a slightly better rate for variance (without a log-n factor). This phenomena is\nsimilar to the MISE and L\u221e error for nonparametric estimation for functions. The MISE converges\nslightly faster (by a log-n factor) than square to the L\u221e error.\n\nNow we prove the consistency of the risk estimators. In particular, we prove the consistency for the\nsmoothed bootstrap. The case of data splitting can be proved in the similar way.\nTheorem 3 Let Risk2,n be the L2 coverage risk for estimating the density ridges and level sets. Let\np is at least four times bounded differentiable. Then as n \u2192 \u221e, h \u2192 0 and log n\n\n(cid:100)Risk2,n be the corresponding risk estimator by the smoothed bootstrap. Assume (K1\u20132) and (R) and\n\nnhd+6 \u2192 0,\n\n(cid:100)Risk2,n \u2212 Risk2,n\n\nRisk2,n\n\nP\u2192 0.\n\nTheorem 3 proves the consistency for risk estimation using the smoothed bootstrap. This also leads\nto the consistency for data splitting.\n\n6\n\n\fFigure 3: Three different simulation datasets. Top row: the spiral dataset. Middle row: the three\nspirals dataset. Bottom row: NIPS character dataset. For each row, the leftmost panel shows the\nestimated L1 coverage risk using data splitting; the red straight line indicates the bandwidth selected\nby least square cross validation [19], which is either undersmooth or oversmooth. Then the rest three\npanels, are the result using different smoothing parameters. From left to right, we show the result\nfor under-smoothing, optimal smoothing (using the coverage risk), and over-smoothing. Note that\nthe second minimum in the coverage risk at the three spirals dataset (middle row) corresponds to a\nphase transition when the estimator becomes a big circle; this is also a locally stable structure.\n\n5 Applications\n\n5.1 Simulation Data\n\nWe now apply the data splitting technique (7) to choose the smoothing bandwidth for density ridge\nestimation. Note that we use data splitting over smooth bootstrap since in practice, data splitting\nworks better. The density ridge estimation can be done by the subspace constrain mean shift al-\ngorithm [22]. We consider three famous datasets: the spiral dataset, the three spirals dataset and a\n\u2018NIPS\u2019 dataset.\n\nFigure 3 shows the result for the three simulation datasets. The top row is the spiral dataset; the\nmiddle row is the three spirals dataset; the bottom row is the NIPS character dataset. For each row,\nfrom left to right the \ufb01rst panel is the estimated L1 risk by using data splitting. Note that there is\nno practical difference between L1 and L2 risk. The second to fourth panels are under-smoothing,\noptimal smoothing, and over-smoothing. Note that we also remove the ridges whose density is\n\nbelow 0.05 \u00d7 maxx(cid:98)pn(x) since they behave like random noise. As can be seen easily, the optimal\n\nbandwidth allows the density ridges to capture the underlying structures in every dataset. On the\ncontrary, the under-smoothing and the over-smoothing does not capture the structure and have a\nhigher risk.\n\n7\n\n0.00.51.01.52.00.51.01.52.02.53.0Smoothing Parameter(Estimated) L1 Coverage Riskllllllllllllllllllll0.00.51.01.52.00.20.40.60.81.01.2Smoothing Parameter(Estimated) L1 Coverage Riskllllllllllllllllllll0.00.51.01.52.00.030.040.050.06Smoothing Parameter(Estimated) L1 Coverage Riskllllllllllllllllllll\fFigure 4: Another slice for the cosmic web data from the Sloan Digital Sky Survey. The leftmost\npanel shows the (estimated) L1 coverage risk (right panel) for estimating density ridges under dif-\nferent smoothing parameters. We estimated the L1 coverage risk by using data splitting. For the\nrest panels, from left to right, we display the case for under-smoothing, optimal smoothing, and\nover-smoothing. As can be seen easily, the optimal smoothing method allows the SCMS algorithm\nto detect the intricate cosmic network structure.\n\n5.2 Cosmic Web\n\nNow we apply our technique to the Sloan Digital Sky Survey, a huge dataset that contains millions\nof galaxies. In our data, each point is an observed galaxy with three features:\n\n\u2022 z: the redshift, which is the distance from the galaxy to Earth.\n\u2022 RA: the right ascension, which is the longitude of the Universe.\n\u2022 dec: the declination, which is the latitude of the Universe.\n\nThese three features (z, RA, dec) uniquely determine the location of a given galaxy.\n\nTo demonstrate the effectiveness of our method, we select a 2-D slice of our Universe at redshift\nz = 0.050 \u2212 0.055 with (RA, dec) \u2208 [200, 240] \u00d7 [0, 40]. Since the redshift difference is very tiny,\nwe ignore the redshift value of the galaxies within this region and treat them as a 2-D data points.\nThus, we only use RA and dec. Then we apply the SCMS algorithm (version of [7]) with data\nsplitting method introduced in section 2.1 to select the smoothing parameter h. The result is given in\nFigure 4. The left panel provides the estimated coverage risk at different smoothing bandwidth. The\nrest panels give the result for under-smoothing (second panel), optimal smoothing (third panel) and\nover-smoothing (right most panel). In the third panel of Figure 4, we see that the SCMS algorithm\ndetects the \ufb01lament structure in the data.\n\n6 Discussion\n\nIn this paper, we propose a method using coverage risk, a generalization of mean integrated square\nerror, to select the smoothing parameter for the density ridge estimation problem. We show that\nthe coverage risk can be estimated using data splitting or smoothed bootstrap and we derive the\nstatistical consistency for risk estimators. Both simulation and real data analysis show that the\nproposed bandwidth selector works very well in practice.\n\nThe concept of coverage risk is not limited to density ridges; instead, it can be easily generalized to\nother manifold learning technique. Thus, we can use data splitting to estimate the risk and use the\nrisk estimator to select the tuning parameters. This is related to the so-called stability selection [23],\nwhich allows us to select tuning parameters even in an unsupervised learning settings.\n\n8\n\n0.00.20.40.60.81.00.91.01.11.21.31.41.5Smoothing Parameter(Estimated) L1 Coverage Riskllllllllllllllllllll\fReferences\n\n[1] E. Bas, N. Ghadarghadar, and D. Erdogmus. Automated extraction of blood vessel networks from 3d\nmicroscopy image stacks via multi-scale principal curve tracing. In Biomedical Imaging: From Nano to\nMacro, 2011 IEEE International Symposium on, pages 1358\u20131361. IEEE, 2011.\n\n[2] E. Bas, D. Erdogmus, R. Draft, and J. W. Lichtman. Local tracing of curvilinear structures in volumet-\nric color images: application to the brainbow analysis. Journal of Visual Communication and Image\nRepresentation, 23(8):1260\u20131271, 2012.\n\n[3] B. Cadre. Kernel estimation of density level sets. Journal of multivariate analysis, 2006.\n[4] Y.-C. Chen, C. R. Genovese, R. J. Tibshirani, and L. Wasserman. Nonparametric modal regression. arXiv\n\npreprint arXiv:1412.1716, 2014.\n\n[5] Y.-C. Chen, C. R. Genovese, and L. Wasserman. Generalized mode and ridge estimation.\n\n1406.1803, June 2014.\n\narXiv:\n\n[6] Y.-C. Chen, C. R. Genovese, and L. Wasserman. Asymptotic theory for density ridges. arXiv preprint\n\narXiv:1406.5663, 2014.\n\n[7] Y.-C. Chen, S. Ho, P. E. Freeman, C. R. Genovese, and L. Wasserman. Cosmic web reconstruction\n\nthrough density ridges: Method and algorithm. arXiv preprint arXiv:1501.05303, 2015.\n\n[8] Y. Cheng. Mean shift, mode seeking, and clustering. Pattern Analysis and Machine Intelligence, IEEE\n\nTransactions on, 17(8):790\u2013799, 1995.\n\n[9] A. Cuevas, W. Gonzalez-Manteiga, and A. Rodriguez-Casal. Plug-in estimation of general level sets.\n\nAust. N. Z. J. Stat., 2006.\n\n[10] D. Eberly. Ridges in Image and Data Analysis. Springer, 1996.\n[11] J. Einbeck. Bandwidth selection for mean-shift based unsupervised learning techniques: a uni\ufb01ed ap-\n\nproach via self-coverage. Journal of pattern recognition research., 6(2):175\u2013192, 2011.\n\n[12] U. Einmahl and D. M. Mason. Uniform in bandwidth consistency for kernel-type function estimators.\n\nThe Annals of Statistics, 2005.\n\n[13] L. C. Evans and R. F. Gariepy. Measure theory and \ufb01ne properties of functions, volume 5. CRC press,\n\n1991.\n\n[14] K. Fukunaga and L. Hostetler. The estimation of the gradient of a density function, with applications in\n\npattern recognition. Information Theory, IEEE Transactions on, 21(1):32\u201340, 1975.\n\n[15] C. R. Genovese, M. Perone-Paci\ufb01co, I. Verdinelli, and L. Wasserman. Nonparametric ridge estimation.\n\nThe Annals of Statistics, 42(4):1511\u20131545, 2014.\n\n[16] E. Gine and A. Guillou. Rates of strong uniform consistency for multivariate kernel density estimators.\n\nIn Annales de l\u2019Institut Henri Poincare (B) Probability and Statistics, 2002.\n\n[17] T. Hastie. Principal curves and surfaces. Technical report, DTIC Document, 1984.\n[18] T. Hastie and W. Stuetzle. Principal curves. Journal of the American Statistical Association, 84(406):\n\n502\u2013516, 1989.\n\n[19] M. C. Jones, J. S. Marron, and S. J. Sheather. A brief survey of bandwidth selection for density estimation.\n\nJournal of the American Statistical Association, 91(433):401\u2013407, 1996.\n\n[20] D. M. Mason, W. Polonik, et al. Asymptotic normality of plug-in level set estimates. The Annals of\n\nApplied Probability, 19(3):1108\u20131142, 2009.\n\n[21] Z. Miao, B. Wang, W. Shi, and H. Wu. A method for accurate road centerline extraction from a classi\ufb01ed\n\nimage. 2014.\n\n[22] U. Ozertem and D. Erdogmus. Locally de\ufb01ned principal curves and surfaces. Journal of Machine Learn-\n\ning Research, 2011.\n\n[23] A. Rinaldo and L. Wasserman. Generalized density clustering. The Annals of Statistics, 2010.\n[24] D. W. Scott. Multivariate density estimation: theory, practice, and visualization, volume 383. John Wiley\n\n& Sons, 2009.\n\n[25] B. Silverman and G. Young. The bootstrap: To smooth or not to smooth? Biometrika, 74(3):469\u2013479,\n\n1987.\n\n[26] B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and Hall, 1986.\n[27] R. Tibshirani. Principal curves revisited. Statistics and Computing, 2(4):183\u2013190, 1992.\n[28] L. Wasserman. All of Nonparametric Statistics. Springer-Verlag New York, Inc., 2006.\n\n\u2014\n\n9\n\n\f", "award": [], "sourceid": 191, "authors": [{"given_name": "Yen-Chi", "family_name": "Chen", "institution": "Carnegie Mellon University"}, {"given_name": "Christopher", "family_name": "Genovese", "institution": "Carnegie Mellon University"}, {"given_name": "Shirley", "family_name": "Ho", "institution": "Carnegie Mellon University"}, {"given_name": "Larry", "family_name": "Wasserman", "institution": "Carnegie Mellon University"}]}