{"title": "Efficient Bregman Range Search", "book": "Advances in Neural Information Processing Systems", "page_first": 243, "page_last": 251, "abstract": "We develop an algorithm for efficient range search when the notion of dissimilarity is given by a Bregman divergence. The range search task is to return all points in a potentially large database that are within some specified distance of a query. It arises in many learning algorithms such as locally-weighted regression, kernel density estimation, neighborhood graph-based algorithms, and in tasks like outlier detection and information retrieval. In metric spaces, efficient range search-like algorithms based on spatial data structures have been deployed on a variety of statistical tasks. Here we describe the first algorithm for range search for an arbitrary Bregman divergence. This broad class of dissimilarity measures includes the relative entropy, Mahalanobis distance, Itakura-Saito divergence, and a variety of matrix divergences. Metric methods cannot be directly applied since Bregman divergences do not in general satisfy the triangle inequality. We derive geometric properties of Bregman divergences that yield an efficient algorithm for range search based on a recently proposed space decomposition for Bregman divergences.", "full_text": "Ef\ufb01cient Bregman Range Search\n\nLawrence Cayton\n\nMax Planck Institute for Biological Cybernetics\n\nlcayton@tuebingen.mpg.de\n\nAbstract\n\nWe develop an algorithm for ef\ufb01cient range search when the notion of dissim-\nilarity is given by a Bregman divergence. The range search task is to return\nall points in a potentially large database that are within some speci\ufb01ed distance\nof a query.\nIt arises in many learning algorithms such as locally-weighted re-\ngression, kernel density estimation, neighborhood graph-based algorithms, and in\ntasks like outlier detection and information retrieval. In metric spaces, ef\ufb01cient\nrange search-like algorithms based on spatial data structures have been deployed\non a variety of statistical tasks. Here we describe an algorithm for range search\nfor an arbitrary Bregman divergence. This broad class of dissimilarity measures\nincludes the relative entropy, Mahalanobis distance, Itakura-Saito divergence, and\na variety of matrix divergences. Metric methods cannot be directly applied since\nBregman divergences do not in general satisfy the triangle inequality. We derive\ngeometric properties of Bregman divergences that yield an ef\ufb01cient algorithm for\nrange search based on a recently proposed space decomposition for Bregman di-\nvergences.\n\n1 Introduction\n\nRange search is a fundamental proximity task at the core of many learning problems. The task of\nrange search is to return all points in a database within a speci\ufb01ed distance of a given query. The\nproblem is to do so ef\ufb01ciently, without examining the entire database. Many machine learning algo-\nrithms require range search. Locally weighted regression and kernel density estimation/regression\nboth require retrieving points in a region around a test point. Neighborhood graphs\u2014used in mani-\nfold learning, spectral algorithms, semisupervised algorithms, and elsewhere\u2014can be built by con-\nnecting each point to all other points within a certain radius; doing so requires range search at\neach point. Computing point-correlation statistics, distance-based outliers/anomalies, and intrinsic\ndimensionality estimates also requires range search.\nA growing body of work uses spatial data structures to accelerate the computation of these and other\nproximity problems for statistical tasks. This line of techniques, coined \u201cn-body methods\u201d in [11],\nhas showed impressive speedups on a variety of tasks including density estimation [12], gaussian\nprocess regression [25], non-parametric classi\ufb01cation [17], matrix approximation [14], and kernel\nsummation [15]. These methods achieve speedups by pruning out large portions of the search space\nwith bounds derived from KD or metric trees that are augmented with statistics of the database.\nSome of these algorithms are direct applications of range search; others rely on very similar pruning\ntechniques. One fairly substantial limitation of these methods is that they all derive bounds from the\ntriangle inequality and thus only work for notions of distance that are metrics.\nThe present work is on performing range search ef\ufb01ciently when the notion of dissimilarity is not\na metric, but a Bregman divergence. The family of Bregman divergences includes the standard (cid:96)2\n2\ndistance, Mahalanobis distance, KL-divergence, Itakura-Saito divergence, and a variety of matrix\ndissimilarity measures. We are particularly interested in the KL-divergence, as it is not a metric and\nis used extensively in machine learning. It appears naturally in document analysis, since documents\n\n1\n\n\fare often modeled using histograms [22, 5]. It also is used in many vision applications [23], such as\ncontent-based image retrieval [24]. Because Bregman divergences can be asymmetric and need not\nsatisfy the triangle inequality, the traditional metric methods cannot be applied.\nIn this work we present an algorithm for ef\ufb01cient range search when the notion of dissimilarity\nis an arbitrary Bregman divergence. These results demonstrate that the basic techniques behind\nthe previously described ef\ufb01cient statistical algorithms can be applied to non-metric dissimilarities\nincluding, notably, the KL-divergence. Because of the widespread use of histogram representations,\nthis generalization is important.\nThe task of ef\ufb01cient Bregman range search presents a technical challenge. Our algorithm cannot\nrely on the triangle inequality, so bounds must be derived from geometric properties of Bregman\ndivergences. The algorithm makes use of a simple space decomposition scheme based on Bregman\nballs [8], but deploying this decomposition for the range search problem is not straightforward. In\nparticular, one of the bounds required results in a non-convex program to be solved, and the other\nrequires comparing two convex bodies. We derive properties of Bregman divergences that imply\nef\ufb01cient algorithms for these problems.\n\n2 Background\n\nIn this section, we brie\ufb02y review prior work on Bregman divergences and proximity search. Breg-\nman divergences originate in [7] and have become common in the machine learning literature, e.g.\n[3, 4].\nDe\ufb01nition 1. Let f : RD \u2192 R be strictly convex and differentiable. The Bregman divergence based\non f is\n\ndf (x, y) \u2261 f(x) \u2212 f(y) \u2212 (cid:104)\u2207f(y), x \u2212 y(cid:105).\n\n2, and f(x) =(cid:80)\n\n2(cid:107)x \u2212 y(cid:107)2\n\ndf (x, y) =(cid:80)\n\nAs can be seen from the de\ufb01nition, a Bregman divergence measures the distance between a func-\n2(cid:107)x(cid:107)2\ntion and its \ufb01rst-order taylor series approximation. Standard examples include f(x) = 1\n2,\ni xi log xi, giving the KL-divergence\nyielding the (cid:96)2\nThe Itakura-Saito divergence and Mahalanobis distance are other examples\n\n2 distance df (x, y) = 1\ni xi log xi\nyi\nof Bregman divergences.\nStrict convexity of f implies that df (x, y) \u2265 0, with equality if, and only if, x = y. Though Bregman\ndivergences satisfy this non-negativity property, like metrics, the similarities to metrics end there. In\nparticular, a Bregman divergence need not satisfy the triangle inequality or be symmetric.\nBregman divergences do possess several geometric properties related to the convexity of the base\nfunction. Most notably, df (x, y) is always convex in x (though not necessarily in y), implying that\nthe Bregman ball\n\nBf (\u00b5, R) \u2261 {x | df (x, \u00b5) \u2264 R}\n\nis a convex body.\nRecently, work on a variety of geometric tasks with Bregman divergences has appeared. In [19],\ngeometric properties of Bregman voronoi diagrams are derived. [1] studies core-sets under Bregman\ndivergences and gives a provably correct approximation algorithm for k-median clustering. [13]\nexamines sketching Bregman (and Csisz\u00b4ar) divergences. [8] describes the Bregman ball tree in the\ncontext of nearest neighbor search; we will describe this work further momentarily. As these papers\ndemonstrate, there has been substantial recent interest in developing basic geometric algorithms for\nBregman divergences. The present paper contributes an effective algorithm for range search, one of\nthe core problems of computational geometry [2], to this repertoire.\nThe Bregman ball tree (BB-tree) was introduced in the context of nearest neighbor (NN) search [8].\nThough NN search has a similar \ufb02avor to range search, the bounds that suf\ufb01ce for NN search are\nnot suf\ufb01cient for range search. Thus the utility of the BB-tree for statistical tasks is at present rather\nseriously limited. Moreover, though the extension of metric trees to range search (and hence to the\npreviously described statistical tasks) is fairly straightforward because of the triangle inequality, the\nextension of BB-trees is substantially more complex.\n\n2\n\n\fSeveral other papers on Bregman proximity search have appeared very recently. Nielsen et al. study\nsome improvements to the BB-tree [21] and develop a related data structure which can be used with\nsymmetrized divergences [20]. Zhang et al. develop extensions of the VA-\ufb01le and the R-tree for\nBregman divergences [26]. These data structures can be adapted to work for Bregman divergences,\nas the authors of [26] demonstrate, because bounds on the divergence from a query to a rectan-\ngular cell can be computed cheaply; however this idea appears limited to decomposable Bregman\ndivergences\u2014divergences that decompose into a sum over one-dimensional divergences.1 Never-\ntheless, these data structures seem practical and effective and it would be interesting to apply them\nto statistical tasks.2 The applicability of rectangular cell bounds was independently demonstrated\nin [9, Chapter 7], where it is mentioned that KD-trees (and relatives) can be used for decomposable\nBregman divergences. That chapter also contains theoretical results on the general Bregman range\nsearch problem attained by adapting known data structures via the lifting technique (also used in\n[26] and previously in [19]).\n\n3 Range search with BB-trees\n\nIn this section, we review the Bregman ball tree data structure and outline the range search algorithm.\nThe search algorithm relies on geometric properties of Bregman divergences, which we derive in\nsection 4.\nThe BB-tree is a hierarchical space decomposition based on Bregman balls.\nIt is a binary tree\nde\ufb01ned over the database such that each level provides a partition of the database points. As the\ntree is descended, the partition becomes \ufb01ner and \ufb01ner. Each node i in the tree owns a subset of the\npoints Xi and also de\ufb01nes a Bregman ball Bf (\u00b5, R) such that Xi \u2282 Bf (\u00b5, R). If i is an interior\nnode, it has two children j and k that encapsulate database points Xj and Xk. Moreover, each point\nin Xi is in exactly one of Xj and Xk. Each leaf node contains some small number of points and the\nroot node contains the entire database.\nHere we use this simple form of BB-tree, though our results apply to any hierarchical space decom-\nposition based on Bregman balls, such as the more complex tree described in [21].\nTo encourage a rapid rate of radius decrease, an effective build algorithm will split a node into two\nwell-separated and compact children. Thus a reasonable method for building BB-trees is to per-\nform a top-down hierarchical clustering. Since k-means has been generalized to arbitrary Bregman\ndivergences [4], it is a natural choice for a clustering algorithm.\n\n3.1 Search algorithm\n\nWe now turn to the search algorithm, which uses a branch-and-bound approach. We develop the\nnecessary novel bounding techniques in the next section.\nSuppose we are interested in returning all points within distance \u03b3 of a query q\u2014i.e. we hope to\nretrieve all database points lying inside of Bq \u2261 Bf (q, \u03b3). The search algorithm starts at the root\nnode and recursively explores the tree. At a node i, the algorithm compares the node\u2019s Bregman ball\nBx to Bq. There are three possible situations. First, if Bx is contained in Bq, then all x \u2208 Bx are in\nthe range of interest. We can thus stop the recursion and return all the points associated with the node\nwithout explicitly computing the divergence to any of them. This type of pruning is called inclusion\npruning. Second, if Bx \u2229 Bq = \u2205, the algorithm can prune out Bx and stop the recursion; none\nof these points are in range. This is exclusion pruning. See \ufb01gure 1. All performance gains from\nusing the algorithm come from these two types of pruning. The third situation is Bx \u2229 Bq (cid:54)= \u2205 and\nBx (cid:54)\u2282 Bq. In this situation, the algorithm cannot perform any pruning, so recurses on the children\nof node i. If i is a leaf node, then the algorithm computes the divergence to each database point\nassociated with i and returns those elements within range.\nThe two types of pruning\u2014inclusion and exclusion\u2014have been applied to a variety of problems\nwith metric and KD-trees, see e.g. [11, 12, 25] and the papers cited previously. Thus though we\n\n1This assumption is implicit in the proof of [26, Lemma 3.1] and is used in the revised lower bound com-\n\nputation as well.\n\ndone a detailed comparison.\n\n2[26] had yet not been published at the time of submission of the present work and hence we have not yet\n\n3\n\n\fExclusion\n\nInclusion\n\nFigure 1: The two pruning scenarios. The dotted, shaded object is the query range and the other is\nthe Bregman ball associated with a node of the BB-tree.\n\nfocus on range search, these types of prunings are useful in a broad range of statistical problems. A\nthird type of pruning, approximation pruning, is useful in tasks like kernel density estimation [12].\nThis type of pruning is another form of inclusion pruning and can be accomplished with the same\ntechnique.\nIt has been widely observed that the performance of spatial decomposition data structures, degrades\nwith increasing dimensionality. In order to manage high-dimensional datasets, practitioners often\nuse approximate proximity search techniques [8, 10, 17]. In the experiments, we explore one way\nto use the BB-tree in an approximate fashion.\nDetermining whether two Bregman balls intersect, or whether one Bregman ball contains another,\nis non-trivial. For the range search algorithm to be effective, it must be able to determine these\nrelationships very quickly. In the case of metric balls, these determinations are trivially accom-\nplished using the triangle inequality. Since we cannot rely on the triangle inequality for an arbitrary\nBregman divergence, we must develop novel techniques.\n\n4 Computation of ball intersection\n\nIn this section we lay out the main technical contribution of the paper. We develop algorithms for\ndetermining (1) whether one Bregman ball is contained in another and (2) whether two Bregman\nballs have non-empty intersection.\n\n4.1 Containment\nLet Bq (cid:31) Bf ((cid:181) q(cid:44) Rq) and Bx (cid:31) Bf ((cid:181) x(cid:44) Rx). We wish to evaluate if Bx (cid:30) Bq. This problem is\nequivalent to testing whether\n\nfor all x (cid:28)Bx. Simplifying notation, the core problem is determining\n\ndf (x(cid:44) (cid:181) q) (cid:29)Rq\n\nmax\n\nx\n\ndf (x(cid:44) q)\n\nsubject to: df (x(cid:44) (cid:181) ) (cid:29)R(cid:46)\n\n(maxP)\nUnfortunately, this problem is not convex. As is well-known, non-convex problems are in general\nmuch more computationally dif\ufb01cult to solve than convex ones. This dif\ufb01culty is particularly prob-\nlematic in the case of range search, as the search algorithm will need to solve this problem repeatedly\nin the course of evaluating a singe range query. Moreover, \ufb01nding a sub-optimal solution (i.e. a point\nx (cid:28)Bf ((cid:181)\nRemarkably, beneath (maxP) lies a geometric structure that allows an ef\ufb01cient solution. We now\nshow the main claim of this section, which implies a simple, ef\ufb01cient algorithm for solving (maxP).\nWe denote the convex conjugate of f by\n\n(cid:44) R) that is not the max) will render the solution to the range search incorrect.\n\nf(cid:31)(x) (cid:31) sup\n\n(cid:123) (cid:27)x(cid:44) y\u2282(cid:25)f (y)(cid:125)\n\nand de\ufb01ne x(cid:30)(cid:31) (cid:24)f(x), q(cid:30)(cid:31) (cid:24)f(q), etc.\n\ny\n\n4\n\n\f\u03bd(x, \u03bb) \u2261 df (x, q) \u2212 \u03bb(df (x, \u00b5) \u2212 R),\n\nwhere \u03bb \u2265 0.\nDifferentiating (1) with respect to x and setting it equal to 0, we get\n\n\u2207f(xp) \u2212 \u2207f(q) \u2212 \u03bb\u2207f(xp) + \u03bb\u2207f(\u00b5) = 0,\n\nwhich implies that\n\n\u2207f(xp) =\n\nWe need to check what type of extrema \u2207f(xp) = 0 is:\n\n1\n\n1 \u2212 \u03bb\n\n(\u2207f(q) \u2212 \u03bb\u2207f(\u00b5)) .\n\nThus for \u03bb > 1, the xp de\ufb01ned implicitly in (2) is a maximum. Setting \u03b8 \u2261 \u2212 \u03bb\n\n1\u2212\u03bb gives\n\nx\u03bd(x, \u03bb) = (1 \u2212 \u03bb)\u22072f(x).\n\u22072\n\n\u2207f(xp) = \u03b8\u00b5(cid:48) + (1 \u2212 \u03b8)q(cid:48),\n\n(1)\n\n(2)\n\nClaim 1. Suppose that the domain of f is C and that Bf (\u00b5, R) \u2282 relint(C). Furthermore, assume\nthat (cid:107)\u22072f\u2217(x(cid:48))(cid:107) is lower-bounded for all x(cid:48) such that x \u2208 Bf (\u00b5, R). Let xp be the optimal solution\nto (maxP). Then x(cid:48)\n\np lies in the set {\u03b8\u00b5(cid:48) + (1 \u2212 \u03b8)q(cid:48) | \u03b8 \u2265 0}.\n\nProof. Though the program is not concave, the Lagrange dual still provides an upper bound on the\noptimal solution value (by weak duality). The Lagrangian is\n\nwhere \u03b8 \u2208 (\u2212\u221e, 0) \u222a (1,\u221e); we restrict attention to \u03b8 \u2208 (1,\u221e) since that is where \u03bb > 1 and\nhence xp is a maximum. Let x(cid:48)\n\n\u03b8). The Lagrange dual is\n\n\u03b8 \u2261 \u03b8\u00b5(cid:48) + (1 \u2212 \u03b8)q(cid:48) and x\u03b8 \u2261 \u2207f\u2217(x(cid:48)\n(df (x\u03b8, \u00b5) \u2212 R).\n\nL(\u03b8) \u2261 df (x\u03b8, q) + \u03b8\n1 \u2212 \u03b8\n\nThen for any \u03b8 \u2208 (1,\u221e), we have\n\ndf (xp, q) \u2264 L(\u03b8)\n\n(3)\nby weak duality. We now show that there is a \u03b8\u2217 > 1 satisfying df (x\u03b8\u2217 , \u00b5) = R. One can check\nthat the derivative of df (x\u03b8, \u00b5) with respect to \u03b8 is\n\n(\u03b8 \u2212 1)(\u00b5(cid:48) \u2212 q(cid:48))(cid:62)\u22072f\u2217(x(cid:48)\n\n\u03b8)(\u00b5(cid:48) \u2212 q(cid:48)).\n\n(4)\nSince (cid:107)\u22072f\u2217(cid:107) > c, for some positive c, (4) is at least (\u03b8\u2212 1)(cid:107)\u00b5(cid:48) \u2212 q(cid:48)(cid:107)c. We conclude that df (x\u03b8, \u00b5)\nis increasing at an increasing rate with \u03b8. Thus there must be some \u03b8\u2217 > 1 such that df (x\u03b8\u2217 , \u00b5) = R.\nPlugging this \u03b8\u2217 into the dual, we get\n\nL(\u03b8\u2217) = df (x\u03b8\u2217 , q) + \u03b8\u2217\n\n1 \u2212 \u03b8\u2217 (df (x\u03b8\u2217 , \u00b5) \u2212 R)\n\nCombining with (3), we have\n\n= df (x\u03b8\u2217 , q).\n\ndf (xp, q) \u2264 df (x\u03b8\u2217 , \u00b5).\n\nFinally, since (maxP) is a maximization problem and since x\u03b8\u2217 is feasible, the previous inequality is\nactually an equality, giving the theorem.\nThus determining if Bx \u2282 Bq reduces to searching for \u03b8\u2217 > 1 satisfying\n\ndf (x\u03b8\u2217 , \u00b5x) = Rx\n\nand comparing df (x\u03b8\u2217 , \u00b5q) to Rq. Note that there is no obvious upper bound on \u03b8\u2217 in general,\nthough one may be able to derive such a bound for a particular Bregman divergence. Without such\nan upper bound, one needs to use a line search method that does not require one, such as Newton\u2019s\nmethod or the secant method. Both of these line search methods will converge quickly (quadratic in\nthe case of Newton\u2019s method, slightly slower in the case of the secant method): since df (x\u03b8, \u00b5x) is\nmonotonic in \u03b8, there is a unique root.\nInterestingly, the convex program evaluated in [8] has a similar solution space, which we will again\nencounter in the next section.\n\n5\n\n\f4.2 Non-empty intersection\nIn this section we provide an algorithm for evaluating whether Bq \u2229 Bx = \u2205. We will need to make\nuse of the Pythagorean theorem, a standard property of Bregman divergences.\nTheorem 1 (Pythagorean). Let C \u2282 RD be a convex set and let x \u2208 C. Then for all z, we have\n\ndf (x, z) \u2265 df (x, y) + df (y, z),\n\nwhere y \u2261 argminy\u2208Cdf (y, z) is the projection of z onto C.\nAt \ufb01rst glance, the Pythagorean theorem may appear to be a triangle inequality for Bregman diver-\ngences. However, the inequality is actually the reverse of the standard triangle inequality and only\napplies to the very special case when y is the projection of z onto a convex set containing x. We\nnow prove the main claim of this section.\nClaim 2. Suppose that Bx \u2229 Bq (cid:54)= \u2205. Then there exists a w in\n\n{\u2207f\u2217(\u03b8\u00b5(cid:48)\n\nx + (1 \u2212 \u03b8)\u00b5(cid:48)\n\nq) | \u03b8 \u2208 [0, 1]}\n\nsuch that w \u2208 Bq \u2229 Bx.\nProof. Let z \u2208 Bx \u2229 Bq. We will refer to the set {\u2207f\u2217(\u03b8\u00b5(cid:48)\ncurve.\nLet x be the projection of \u00b5q onto Bx and let q be the projection of \u00b5x onto Bq. Both x and q are on\nthe dual curve (this fact follows from [8, Claim 2]), so we are done if we can show that at least one\nof them lies in the intersection of Bx and Bq. Suppose towards contradiction that neither are in the\nintersection.\nThe projection of x onto Bq lies on the dual curve between x and \u00b5y; thus projecting x onto Bq\nyields q and similarly projecting q onto Bx yields x. By the Pythagorean theorem,\n\nq) | \u03b8 \u2208 [0, 1]} as the dual\n\nx + (1 \u2212 \u03b8)\u00b5(cid:48)\n\n(5)\n\n(6)\n\ndf (z, x) \u2265 df (z, q) + df (q, x),\nsince q is the projection of x onto Bq and since z \u2208 Bq. Similarly,\ndf (z, q) \u2265 df (z, x) + df (x, q).\n\nInserting (5) into (6), we get\n\ndf (z, q) \u2265 df (z, q) + df (q, x) + df (x, q).\n\nRearranging, we get that df (q, x) + df (x, q) \u2264 0. Thus both df (q, x) = 0 and df (x, q) = 0,\nimplying that x = q. But since x \u2208 Bx and q \u2208 Bq, we have that x = q \u2208 Bq \u2229 Bq. This is the\ndesired contradiction.\n\nThe proceeding claim yields a simple algorithm for determining whether two balls Bx and Bq are\ndisjoint: project \u00b5x onto Bq using the line search algorithm discussed previously. The projected\npoint will obviously be in Bq; if it is also in Bx, the two balls intersect.3 Otherwise, they are disjoint\nand exclusion pruning can be performed.\n\n5 Experiments\n\nWe compare the performance of the search algorithm to standard brute force search on several\ndatasets. We are particularly interested in text applications as histogram representations are com-\nmon, datasets are often very large, and ef\ufb01cient search is broadly useful. We experimented with the\nfollowing datasets, many of which are fairly high-dimensional.\n\n\u2022 pubmed-D. We used one million documents from the pubmed abstract corpus (available\nfrom the UCI collection). We generated a correlated topic model (CTM) [5] with D =\n4, 8, . . . , 256 topics. For each D, we built a CTM using a training set and then performed\ninference on the 1M documents to generate the topic histograms.\n\n3Claim 2 actually only shows that at least one of two projections\u2014\u00b5x onto Bq and \u00b5q onto Bx\u2014will be in\nthe intersection. However, one can show that both projections will be in the intersection using the monotonicity\nof df (x\u03b8,\u00b7) in \u03b8.\n\n6\n\n\fFigure 2: Approximate search. The y-axis is on a logarithmic scale and is the speedup over brute\nforce search. The x axis is a linear scale and is the average percentage of the points in range returned\n(i.e. the average recall).\n\n\u2022 Corel histograms. This data set consists of 60k color histograms of dimensionality 64\n\ngenerated from the Corel image datasets.\n\n\u2022 rcv-D. Latent dirichlet allocation was applied to 500K documents from the rcv1 [16]\n\ncorpus to generate topic histograms for each [6]. D is set to 8, 16, 32, . . . 256.\n\n\u2022 Semantic space. This dataset is a 371-dimensional representation of 5000 images from the\nCorel stock photo collection. Each image is represented as a distribution over keywords\n[24].\n\nAll of our experiments are for the KL-divergence. Although the KL-divergence is widely used, little\nis known about ef\ufb01cient proximity techniques for it. In contrast, the (cid:96)2\n2 and Mahalanobis distances\ncan be handled by metric methods, for which there is a huge literature. Application of the range\nsearch algorithm for the KL-divergence raises one technical point: Claim 1 requires that the KL-\nball being investigated lies within the domain of the KL-divergence. It is possible that the ball will\ncross the domain boundary (xi = 0), though we found that this was not a signi\ufb01cant issue. When\nit did occur (which can be checked by evaluating df (\u00b5, x\u03b8) for large \u03b8), we simply did not perform\ninclusion pruning for that node.\nThere are two regimes where range search is particularly useful: when the radius \u03b3 is very small and\nwhen it is large. When \u03b3 is small, range search is useful in instance-based learning algorithms like\nlocally weighted regression, which need to retrieve points close to each test point. It is also useful\nfor generating neighborhood graphs. When \u03b3 is large enough that Bf (q, \u03b3) will contain most of the\ndatabase, range search is potentially useful for applications like distance-based outlier detection and\nanomaly detection. We provide experiments for both of these regimes.\nTable 1 shows the results for exact range search. For the small radius experiments, \u03b3 was chosen so\nthat about 20 points would be inside the query ball (on average). On the pubmed datasets, we are\ngetting one to two orders of magnitude speed-up across all dimensionalities. On the rcv datasets,\nthe BB-tree range search algorithm is an order of magnitude faster than brute search except of the\nthe two datasets of highest dimensionality. The algorithm provides a useful speedup on corel, but\nno speedup on semantic space. We note that the semantic space dataset is both high-dimensional\n(371 dimensions) and quite small (5k), which makes it very hard for proximity search. The algo-\nrithm re\ufb02ects the widely observed phenomenon that the performance of spatial decomposition data\nstructures degrades with dimensionality, but still provides a useful speedup on several moderate-\ndimensional datasets.\n\n7\n\n00.20.40.60.811.2100101102103104corel00.20.40.60.811.2100101102103104pmed4 \u2212 pmed32pmed4pmed8pmed16pmed3200.20.40.60.811.2100101102103104pmed64 \u2212 pmed256pmed64pmed128pmed25600.20.40.60.811.2100101102103104rcv8\u2212rcv32rcv8rcv16rcv3200.20.40.60.811.2100101102103104rcv64 \u2212 rcv256rcv64rcv128rcv25600.20.40.60.811.2100101102103104semantic space\fTable 1: Exact range search.\n\nspeedup\n\nsmall radius\n\nlarge radius\n\ndataset\ncorel\npubmed4\npubmed8\npubmed16\npubmed32\npubmed64\npubmed128\npubmed256\nrcv8\nrcv16\nrcv32\nrcv64\nrcv128\nrcv256\nsemantic space\n\ndimensionality\n\n64\n4\n8\n16\n32\n64\n128\n256\n8\n16\n32\n64\n128\n256\n371\n\n2.53\n371.6\n102.7\n37.3\n18.6\n13.26\n15.0\n18.9\n48.1\n23.0\n16.4\n11.4\n6.1\n1.1\n.7\n\n3.4\n5.1\n9.7\n12.8\n47.1\n21.6\n120.4\n39.0\n8.9\n21.9\n16.4\n9.6\n3.1\n1.9\n1.0\n\nFor the large radius experiments, \u03b3 was chosen so that all but about 100-300 points would be in\nrange. The results here are more varied than for small \u03b3, but we are still getting useful speedups\nacross most of the datasets. Interestingly, the amount of speedup seems less dependent of the di-\nmensionality in comparison to the small \u03b3 experiments.\nFinally, we investigate approximate search, which we consider the most likely use of this algorithm.\nThere are many ways to use the BB-tree in an approximate way. Here, we follow [18] and simply\ncut-off the search process early. We are thus guaranteed to get only points within the speci\ufb01ed\nrange (perfect precision), but we may not get all of them (less than perfect recall). In instance-based\nlearning algorithms, this loss of recall is often tolerable as long as a reasonable number of points are\nreturned. Thus a practical way to deploy the range search algorithm is to run it until enough points\nare recovered. In this experiment, \u03b3 was set so that about 50 points would be returned. Figure 2\nshows the results.\nThese are likely the most relevant results to practical applications. They demonstrate that the pro-\nposed algorithm provides a speedup of up to four orders of magnitude with a high recall.\n\n6 Conclusion\n\nWe presented the \ufb01rst algorithm for ef\ufb01cient ball range search when the notion of dissimilarity\nis an arbitrary Bregman divergence. This is an important step towards generalizing the ef\ufb01cient\nproximity algorithms from (cid:96)2 (and metrics) to the family of Bregman divergences, but there is plenty\nmore to do. First, it would be interesting to see if the dual-tree approach promoted in [11, 12] and\nelsewhere can be used with BB-trees. This generalization appears to require more complex bounding\ntechniques than those discussed here. A different research goal is to develop ef\ufb01cient algorithms for\nproximity search that have rigorous guarantees on run-time; theoretical questions about proximity\nsearch with Bregman divergences remain largely open. Finally, the work in this paper provides a\nfoundation for developing ef\ufb01cient statistical algorithms using Bregman divergences; \ufb02eshing out\nthe details for a particular application is an interesting direction for future research.\n\nReferences\n[1] Marcel Ackermann and Johannes Bl\u00a8omer. Coresets and approximate clustering for bregman\n\ndivergences. In Proceedings of the Symposium on Discrete Algorithms (SODA), 2009.\n\n[2] Pankaj K. Agarwal and Jeff Erickson. Geometric range searching and its relatives. In Advances\nin Discrete and Computational Geometry, pages 1\u201356. American Mathematical Society, 1999.\n[3] Katy Azoury and Manfred Warmuth. Relative loss bounds for on-line density estimation with\n\nthe exponential family of distributions. Machine Learning, 43(3):211\u2013246, 2001.\n\n8\n\n\f[4] Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, and Joydeep Ghosh. Clustering with\n\nbregman divergences. Journal of Machine Learning Research, Oct 2005.\n\n[5] David Blei and John Lafferty. A correlated topic model of Science. Annals of Applied Statistics,\n\n1(1):17\u201335, 2007.\n\n[6] David Blei, Andrew Ng, and Michael Jordan. Latent dirichlet allocation. Journal of Machine\n\nLearning Research, 2003.\n\n[7] L.M. Bregman. The relaxation method of \ufb01nding the common point of convex sets and its\napplication to the solution of problems in convex programming. USSR Computational Mathe-\nmatics and Mathematical Physics, 7(3):200\u2013217, 1967.\n\n[8] Lawrence Cayton. Fast nearest neighbor retrieval for bregman divergences. In Proceedings of\n\nthe International Conference on Machine Learning, 2008.\n\n[9] Lawrence Cayton. Bregman Proximity Search. PhD thesis, University of California, San Diego,\n\n2009.\n\n[10] Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. Locality-sensitive hash-\ning scheme based on p-stable distributions. In Symposium on Computational Geometry, 2004.\n[11] Alexander Gray and Andrew Moore. \u2018N-body\u2019 problems in statistical learning. In Advances\n\nin Neural Information Processing Systems, 2000.\n\n[12] Alexander Gray and Andrew Moore. Nonparametric density estimation: Toward computa-\n\ntional tractability. In SIAM International Conference on Data Mining, 2003.\n\n[13] Sudipto Guha, Piotr Indyk, and Andrew McGregor. Sketching information divergences. In\n\nConference on Learning Theory, 2007.\n\n[14] Michael P. Holmes, Alexander Gray, and Charles Lee Isbell. QUIC-SVD: Fast SVD using\n\ncosine trees. In Advances in Neural Information Processing Systems 21, 2008.\n\n[15] Dongryeol Lee and Alexander Gray. Fast high-dimensional kernel summations using the monte\n\ncarlo multipole method. In Advances in Neural Information Processing Systems 21, 2008.\n\n[16] D. D. Lewis, Y. Yang, T. Rose, and F. Li. RCV1: A new benchmark collection for text catego-\n\nrization research. Journal of Machine Learning Research, 2004.\n\n[17] Ting Liu, Andrew Moore, and Alexander Gray. New algorithms for ef\ufb01cient high-dimensional\n\nnonparametric classi\ufb01cation. Journal of Machine Learning Research, 2006.\n\n[18] Ting Liu, Andrew Moore, Alexander Gray, and Ke Yang. An investigation of practical approx-\n\nimate neighbor algorithms. In Advances in Neural Information Processing Systems, 2004.\n\n[19] Frank Nielsen, Jean-Daniel Boissonnat, and Richard Nock. On bregman voronoi diagrams. In\n\nSymposium on Discrete Algorithms, pages 746\u2013755, 2007.\n\n[20] Frank Nielsen, Paolo Piro, and Michel Barlaud. Bregman vantage point trees for ef\ufb01cient\n\nnearest neighbor queries. In IEEE International Conference on Multimedia & Expo, 2009.\n\n[21] Frank Nielsen, Paolo Piro, and Michel Barlaud. Tailored bregman ball trees for effective\n\nnearest neighbors. In European Workshop on Computational Geometry, 2009.\n\n[22] Fernando Pereira, Naftali Tishby, and Lillian Lee. Distributional clustering of English words.\n\nIn 31st Annual Meeting of the ACL, pages 183\u2013190, 1993.\n\n[23] Jan Puzicha, Joachim Buhmann, Yossi Rubner, and Carlo Tomasi. Empirical evaluation of\ndissimilarity measures for color and texture. In Proceedings of the Internation Conference on\nComputer Vision (ICCV), 1999.\n\n[24] N. Rasiwasia, P. Moreno, and N. Vasconcelos. Bridging the gap: query by semantic example.\n\nIEEE Transactions on Multimedia, 2007.\n\n[25] Yirong Shen, Andrew Ng, and Matthias Seeger. Fast gaussian process regression using kd-\n\ntrees. In Advances in Neural Information Processing Systems, 2006.\n\n[26] Zhenjie Zhang, Beng Chin Ooi, Srinivasan Parthasarathy, and Anthony Tung. Similarity search\non bregman divergence: towards non-metric indexing. In International Conference on Very\nLarge Databases (VLDB), 2009.\n\n9\n\n\f", "award": [], "sourceid": 255, "authors": [{"given_name": "Lawrence", "family_name": "Cayton", "institution": null}]}