{"title": "`N-Body' Problems in Statistical Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 521, "page_last": 527, "abstract": null, "full_text": ".N-Body. Problems in Statistical Learning \n\nAlexander G. Gray \n\nAndrew W. Moore \n\nDepartment of Computer Science \n\nRobotics Inst. and Dept. Compo Sci. \n\nCarnegie Mellon University \n\nagray@cs.cmu.edu \n\nCarnegie Mellon University \n\nawm@cs.cmu.edu \n\nAbstract \n\nWe present efficient algorithms for all-point-pairs problems , or 'N(cid:173)\nbody '-like problems, which are ubiquitous in statistical learning. We \nfocus on six examples, including nearest-neighbor classification, kernel \ndensity estimation, outlier detection , and the two-point correlation. \nThese include any problem which abstractly requires a comparison of \neach of the N points in a dataset with each other point and would \nnaively be solved using N 2 distance computations. In practice N \nis often large enough to make this infeasible. We present a suite of \nnew geometric t echniques which are applicable in principle to any \n'N-body' computation including large-scale mixtures of Gaussians, \nRBF neural networks, and HMM 's. Our algorithms exhibit favorable \nasymptotic scaling and are empirically several orders of magnitude \nfaster than the naive computation, even for small datasets. We are \naware of no exact algorithms for these problems which are more effi(cid:173)\ncient either empirically or theoretically. In addition, our framework \nyields simple and elegant algorithms. It also permits two important \ngeneralizations beyond the standard all-point-pairs problems, which \nare more difficult. These are represented by our final examples, the \nmultiple two-point correlation and the notorious n-point correlation. \n\n1 \n\nIntroduction \n\nThis paper is about accelerating a wide class of statistical methods that are naively \nquadratic in the number of datapoints. \n1 We introduce a family of dual kd-tree \ntraversal algorithms for these problems. They are the statistical siblings of powerful \nstate-of-the-art N -body simulation algorithms [1 , 4] of computational physics, but the \ncomputations within statistical learning present new opportunities for acceleration and \nrequire t echniques more general than those which have been exploited for the special \ncase of potential-based problems involving forces or charges. \n\nWe describe in detail a dual-tree algorithm for calculating the two-point correlation, \nthe simplest case of the problems we consider; for the five other statistical problems we \nconsider, we show only performance results for lack of space. The last of our examples, \n\n1 In the general case, when we are computing distances between two different datasets \nhaving sizes Nl and N2, as in nearest-neighbor classification with separate training and test \nsets, say, the cost is O(NlN2). \n\n\fFigure 1: A kd-tree. (a) Nodes at level 3. (b) Nodes at level 5. The dots are the individual \ndata points. The sizes and positions of the disks show the node counts and centroids. The \nellipses and rectangles show the covariances and bounding boxes. (c) The rectangles show the \nnodes pruned dlITing a RangeSearch for one (depicted) query and radius. (d) More pruning \nis possible using RangeCount instead of RangeSearch. \n\nthe n-point correlation, illustrates a generalization from all-point-pairs problems to all(cid:173)\nn-tuples problems, which are much harder (naively O(N \")). For all the examples, we \nbelieve there exist no exact algorithms which are faster either empirically or theoret(cid:173)\nically, nor any approximate algorithms that are faster while providing guarantees of \nacceptably high accuracy (as ours do). For n-tuple N -body problems in particular, \nthis type of algorithm design appears to have surpassed the existing computational \nbarriers. In addition, all the algorithms in this paper can be compactly defined and \nare easy to implement. \n\nStatistics and geometry. We proceed by viewing these statistical problems as \ngeometric problems, exploiting the data's hyperstructure. Each algorithm utilizes \nMultiresolution kd-trees, providing a geometric partitioning of the data space which \nis used to reason about entire chunks of the data simultaneously. \n\nA review of kd-trees and mrkd-trees. A kd-tree [3] records a d-dimensional data \nset containing N records. Each node represents a set of data points by their bounding \nbox. Non-leaf nodes have two children, obtained by splitting the widest dimension of \nthe parent 's bounding box. For the purposes of this paper, nodes are split until they \ncontain only one point, where they become leaves. An mrkd-tree [2, 6] is a conventional \nkd-tree decorated, at each node, with extra statistics about the node's data, such as \ntheir count, centroid, and covariance. They are an instance of the idea of cached \nsufficient statistics [8] and are quite efficient in practice. 2 See Figure 1. \n\n2 The 2-point correlation function \n\nThe two-point correlation is a spatial statistic which is of fundamental importance in \nmany natural sciences, in particular astrophysics and biology. It can be thought of \nroughly as a measure of the dumpiness of a set of points. It is easily defined as the \nnumber of pairs of points in a dataset which lie within a given radius l' of each other. \n\n2.1 Previous approaches \n\nQuadratic algorithm. The most naive approach is to simply compare each datum \nto each other one, incrementing a count if the distance between them is less than 1'. \nThis has O(N 2 ) cost , unacceptably high for problems of practical interest. \n\n2 mrkd-trees can be built quickly, in time O( dN log N +d2 N). Although we have not needed \nto do so, they can modified to become disk-resident for data sets with billions of records, and \nthey can be efficiently updated incrementally. They scale poorly to higher dimensions but \nrecent work [7] significantly remedies the dimensionality problem. \n\n\fBinning and gridding algorithms. The schemes in widespread use [12, 13] are \nmainly of this sort. The idea of binning is simply to divide the data space into a fine \ngrid defining a set of bins, perform the quadratic algorithm on the bins as if they were \nindividual data, then multiply by the bin sizes as appropriate to get an estimate of \nthe total count. The idea of grid ding is to divide the data space into a coarse grid, \nperform the quadratic algorithm within each bin, and sum the results over all bins to \nget an estimate of the total count. These are both of course very approximate methods \nyielding large errors. They are not usable when r is small or r is large , respectively. \n\nRange-searching with a kd-tree. An approach to the two-point correlation com(cid:173)\nputation that has been taken is to treat it as a range-searching problem [5 , 10], since \nkd-trees have been historically almost synonymous with range-searching. The idea \nis that we will make each datapoint in turn a query point and then execute a range \nsearch of the kd-tree to find all other points within distance r of the query. A search is \na depth-first traversal of the kd-tree, always checking the minimum possible distance \ndmin between the query and the hyper-rectangle surrounding the current node. If \ndmin > r there is no point in visiting the node's children, and computation is saved. \nWe call this exclusion-based pruning. \n\nThe range searching avoids computing most of the distances between pairs of points \nfurther than r apart, which is a considerable saving if r is small. But is it the best we \ncan do? And what if r is large? We now propose several layers of new approaches. \n\n2.2 Better geometric approaches: new algorithms \n\nSingle-tree search (Range-Counting Algorithm). A straightforward extension can \nexploit the fact that unlike conventional use of range searching, these statistics fre(cid:173)\nquently don 't need to retrieve all the points in the radius but merely to count them. \nThe mrkd-tree has, in each node, the count of the number of data it contains-the \nsimplest kind of cached sufficient statistic. At a given node, if the distance between \nthe query and the farthest point of the bounding box of the data in the node is smaller \nthan the radius r, clearly every datum in the node is within range of the query. We \ncan then simply add the node's stored count to the total count. We call this subsump(cid:173)\ntion. 3 (Note that both exclusion and subsumption are simple computations because \nthe geometric regions are always axis-parallel rectangles.) This paper introduces new \nsingle-tree algorithms for most of our examples, though it is not our main focus. \n\nDual-tree search. This is the primary topic of this paper. The idea is to consider \nthe query points in chunks as well, as defined by nodes in a kd-tree. In the general case \nwhere the query points are different from the data being queried, a separate kd-tree is \nbuilt for the query points; otherwise a query node and a data node are simply pointers \ninto the same kd-tree. Dual-tree search can be thought of as a simultaneous traversal \nof two trees, instead of iterating over the query points in an outer loop and only \nexploiting single-tree-search in the inner loop. Dual-tree search is based on node-node \ncomparisons while Single-tree search was based on point-node comparisons. \n\nPseudocode for a recursive procedure called TwoPointO is shown in Figure 2. It \ncounts the number of pairs of points (x q E QNODE, Xd E DNoDE) such that IXq -\nxdl < r. Before doing any real work, the procedure checks whether it can perform \nan exclusion pruning (in which case the call terminates, returning 0) or subsumption \npruning (in which case the call terminates, returning the product of the number of \npoints in the two nodes). If neither of these prunes occur, then depending on whether \nQNODE and/or DNODE are leaves, the corresponding recursive calls are made. \n\n3Subsumption can also be exploited when other aggregate statistics, such as centroids or \n\ncovariances of sets of points in a range are required [2 , 14, 9]. \n\n\fTwoPoint( QNODE,DNODE ,r) \n\nif excludes(QNODE,DNODE,r), return ; \n\nif subsumes(QNoDE,DNoDE,r) \ntotal = total + ( count(QNoDE) X count(DNoDE) ); return; \n\nif le af(QNoDE) and leaf(DNoDE) \nif distance(QNoDE,DNODE) < r, total = total + 1; \n\nif le af(QNoDE) and notleaf(DNoDE) \n\nTwoPoint( Q NODE,leftchild (D NODE), r ); Two Point (Q NODE,rightchild (D NODE) ,r ); \n\nif notle af(QNoDE) and le af(DNoDE) \n\nTwoPoint(leftchild(QNoDE ) ,DNoDE,r); TwoPoint ( rightchild ( QNODE) ,DNoDE,r); \n\nif notleaf(QNoDE) and notleaf(DNoDE) \n\nTwoPoint(leftchild(QNoDE) ,left child(DNoDE) ,r ); TwoPoint(leftchild ( QNODE) ,rightchild(DNoDE) ,r); \nTwoPoint(rightchild( QNODE) ,leftchild (DNoDE) ,r); TwoPoint(rightchild(QNoDE) ,rightchild(DNoDE) ,r); \n\nFigure 2: A recursive Dual-tree code. All the reported algorithms have a similar brevity. \n\nImportantly, both kinds of prunings can now apply to many query points at once, \ninstead of each nearby query point rediscovering the same prune during the Single(cid:173)\ntree search. The intuition behind Dual-tree's advantage can be seen by considering \ntwo cases . First, if l' is so large that all pairs of points are counted then the Single-Tree \nsearch will perform O(N) operations, where each query point immediately prunes at \nthe root, while Dual-Tree search will perform 0 (1) operations. Second, if l' is so small \nthat no pairs of points are counted, Single-Tree search will run to one leaf for each \nquery, m eaning total work O(N log N ) whereas Dual-tree search will visit each leaf \nonce, meaning O(N) work. Note, however, that in the middle case of a medium-size \n1', Dual-tree is theoretically only a constant-factor superior to Single-tree. 4 \n\nNon-redundant dual-tree search. So far , we have discussed two operations which \ncut short the need to traverse the tree further - exclusion and subsumption. Another \nform of pruning is to eliminate node-node comparisons which have been p erformed \nalready in the reverse order. This can be done [11] simply by (virtually) ranking \nthe datapoints according to their position in a depth-first traversal of the tree, then \nrecording for each node the minimum and maximum ranks of the points it owns, and \npruning whenever QNODE'S maximum rank is less than DNODE's minimum rank. This \nis useful for all-pairs problems, but becomes essential for all-n-tuples problems. This \nkind of pruning is not practical for Single-tree search. Figure 3 shows the performance \nof a two-point correlation algorithm using all the aforementioned pruning m ethods. \n\nMultiple radii simultaneously. Most often in practice, the two-point is computed \nfor many successive radii so that a curve can be plotted, indicating the clumpiness on \ndifferent scales. Though the m ethod presented so far is fast , it may have to be run \nonce for each of, say, 1,000 radii. It is possible to perform a single, faster computation \nfor all the radii simultaneously, by taking advantage of the nesting structure of the \nordered radii, with an algorithm which recursively narrows the radii which still need to \n\n4We'1l summarize the asymptotic analysis briefly. \n\nIf the data is uniformly distributed \nin d-dimensional space, the cost of computing the n-point correlation function on a dataset \nwith N points using the Dual-tree (n-tree) algorithm is O( NOnd) where and is the dimen(cid:173)\nsionality of the manifold of n-tuples that are just on the border between being matched and \n\nnot-matched, and is and = n' (1 - n~;;-l) where n' = min( n, d) For example, the 2-point \n\ncorrelation function in two dimensions is O(N3/2), considerably better than the O(N2) naive \nalgorithm. Disappointingly, for 2-point, this performance is asymptotically the same cost as \nSingle-tree. For n > 2 our algorithm is better. Furthermore, if we can accept an approximate \nanswer, t e cost IS - f -\n\n. (nond)(O nd /(n-O nd) ) h' h ' \n\nw IC \n\nf N \n\nh \n\n. d \n\nIS In epen ent 0 \n\nd \n\n. \n\n\fI Algorithm \ntwopoint \ntwopomt \ntwopoint \ntwopoint \nnearest \nnearest \nnearest \noutliers \noutliers \noutliers \noutliers \n\n# Data \n10,000 \n50,000 \n150,000 \n300,000 \n10,000 \n20,000 \n50,000 \n10,000 \n50,000 \n150,000 \n300,000 \n\nQuadratIc \n\nI Smgle-tree Dual-tree \n\nST Speedup DT Speedup \n\n132 \n\n3300 est. \n30899 est. \n123599 est. \n\n139 \n\n556 est. \n3475 est. \n\n141 \n\n3525 est. \n33006 est. \n132026 est. \n\n2.2 \n11.8 \n37 \n76 \n2.0 \n11.6 \n30.6 \n2.3 \n12 \n36 \n72 \n\n1.2 \n7.0 \n20 \n40 \n1.4 \n9.8 \n26.4 \n1.2 \n6.5 \n21 \n44 \n\n60 \n280 \n835 \n1626 \n\n70 \n48 \n114 \n61 \n294 \n917 \n1834 \n\n110 \n471 \n1545 \n3090 \n\n99 \n57 \n132 \n118 \n542 \n1572 \n3001 \n\nFigure 3: Our experiments timed our algorithms on large astronomical datasets of current \nscientific interest, consisting of x-y positions of sky objects from the Sloane Digital Sky \nSurvey. All times are given in seconds, and runs were performed on a Pentium III-500 MHz \nLinux workstation. The larger runtimes for the quadratic algorithm were estimated based \non those for smaller datasets. The dual kd-tree method is about a factor of 2 faster than the \nsingle kd-tree method, and both are 3 orders of magnitude faster than the quadratic method \nfor a medium-sized dataset of 300,000 points. \n\nI # Data I \n\n10 ,000 \n20 ,000 \n50 ,000 \n150 ,000 \n\n1.2 \n2.8 \n7.0 \n20 \n\nI 100 I 1000 I Speedup I I # Data I Quadratic I 10 \n1.2 \n10.4 \n32 \n73 \n\n5650 est. \n50850 est. \n203400 est. \n\n10 ,000 \n50 ,000 \n150,000 \n300,000 \n\n226 \n\n1.8 \n6.4 \n31 \n133 \n\n2.4 \n6.6 \n31 \n146 \n\n500 \n424 \n226 \n137 \n\nI 10 \n\nSpeedup \n\n3.0 \n16.8 \n65 \n151 \n\n188 \n543 \n1589 \n2786 \n\nFigure 4: (a) Runtimes for multiple 2-point correlation with increasing number of radii, \nand the speedup factored compared to 1,000 separate Dual-tree 2-point correlations. \n(b) \nRuntimes for kernel density estimation with decreasing levels of approximation, controlled \nby parameter ~, and speedup over quadratic. \n\nbe considered based on the current closest and farthest distances between the nodes. \nThe details are omitted for space, regrettably. The results in Figure 4 confirm that \nthe algorithm quickly focuses on the radii of relevance: for 150 ,000 data, computing \n1,000 2-point correlations took only 7 times as long as computing one. \n\n3 Kernel density estimation \n\nApproximation accelerations. A fourth major type of pruning opportunity is \napproximation. This is often needed in all-point-pairs computations which involve \ncomputing some real-valued function f(x, y) between every pair of points x and y. \nAn example is kernel density estimation with an infinite-tailed kernel such as a Gaus(cid:173)\nsian, in which every training point has some non-zero (though perhaps infinitesimal) \ncontribution to the density at each test point. \nFor each query point Xq we need to accumulate K Ei w(lxq - Xii) where K is a \nnormalizing constant and w is a weighting function (which we will need to assume is \nmonotonic). A recursive call of the Dual-tree implementation has the following job: \nfor Xq E QNODE compute the contribution to xq's summed weights that are due to \nall points in DNODE. Once again, before doing any real work we use simple rectangle \ngeometry to compute the shortest and furthest possible distances between any (x q , Xd) \npair. This bounds the minimum and maximum possible values of Kw(lxq - xdl). If \nthese bounds are tight enough (according to an approximation parameter f) we prune \nby simply distributing the midpoint weight to all the points in QNODE. \n\n\fI # Data I Time \n\n1000 \n2000 \n10000 \n20000 \n\n1 \n13 \n1470 \n14441 \n\n-\n< 1 \n< 1 \n< 1 \n< 1 \n\n1 \n2 \n3 \n4 \n\n-\n< 1 \n3 \n6 \n7 \n\n-\n< 1 \n23 \n57 \n73 \n\nFigure 5: (a) Runtimes for approximate n-point correlation with t = 0.02 and 20,000 data. \n(b) Runtimes for approximate 4-point with t = 0.02 and increasing data size. (c) Runtimes \nfor exact n-point, run on 2000 datapoints of galaxies in d-dimensional color space. \n\n4 The n-point correlation, for n > 2 \n\nThe n-point correlation is the generalization of the 2-point correlation, which counts \nthe number of n-tuples of points lying within radius 7' of each other, or more generally, \nbetween some 7'min and 7'max. 5 The implementation is entirely analogous to the 2-\npoint case, using n trees in general instead of two , except that there is more benefit in \nbeing careful about which of 2n possible recursive calls to choose in the cases where \nyou cannot prune, the approximation versions are harder, there is no immediately \nanalogous Single-tree version of the algorithm, and anti-redundancy pruning is much \nmore important. Figure 5 shows the unprecedented efficiency gains, which become \nmore dramatic as n increases. \n\nApproximating 'exact' computations. Even for algorithms such as 2-point, that \nreturn exact counts, bounded approximation is possible. Suppose the true value of the \n2-point function is V* but that we can tolerate a fractional error of f: we'll accept any \nvalue V such that IV - V*I < fV*. It is possible to adapt the dual-tree algorithm using \na best-first iterative deepening search strategy to guarantee this result while exploiting \npermission to approximate effectively by building the count as much as possible from \n\"easy-win\" node pairs while doing approximation at hard deep node-pairs. \n\n5 Outlier detection, nearest neighbors, and other problems \n\nOne of the main intents of this paper is to point out the broad applicability of this \ntype of algorithm within statistical learning. Figure 3 shows performance results for \nour outlier detection and nearest neighbors algorithms. Figure 6 lists many N-body \nproblems which are clear candidates for acceleration in future work. 6 \n\n5The n-point correlation is useful for detailed characterizations of mass distributions (in(cid:173)\n\ncluding galaxies and biomasses). Higher-order n-point correlations detect increasingly subtle \ndifferences in mass distribution, and are also useful for assessing variance in the lower-order \nn-point statistics. For example, the three-point correlation, which measures the number of \ntriplets of points meeting the specified geometric constraints, can distinguish between two \ndistributions that have the same 2-point correlations but differ in their degree of \"stripiness\" \nversus \"spottiness\" . \n\n6In our nearest neighbors algorithm we consider the problem of finding, for each query \n\npoint, its single nearest neighbor among the data points. (This is exactly the all-nearest(cid:173)\nneighbors problem of computational geometry.) The methods are easily generalized to the \ncase of finding the k nearest neighbors, as in k-NN classification and locally weighted regres(cid:173)\nsion. Outlier detection is one of the most common statistical operations encountered in data \nanalysis. The question of which procedure is most correct is an open and active one. We \npresent here a natural operation which might be used directly for outlier detection, or within \nanother procedure: for each of the points, find the number of other points that are within \ndistance r of it - those having zero neighbors within r are defined as outliers. (This is exactly \nthe all-range-count problem.) \n\n\fStatistical Op eratIOn \n\n2- point function \nn-point function \nMultiple 2-point func tion \nBatch k-ne arest neighbor \nN on- paramet erlc outlier d et ectIOn I d enOlsmg \nBatch K ernel d ensity / classify / regression \nBatch loc ally weighted regression \nBatch kernel PCA \nGaussian pro cess le arning and prediction \nK-means \nMixture of G aussians clustering \nHidden Markov model \nRBF neural network \nFinding pairs of correlated attributes \nFinding n-tuples of correlated attributes \nD epe ndency-tree learning \n\nR esults \nhere? \nYes \nYes \nYes \nYes \nYes \nYes \nNo \nNo \nNo \nNo \nN o \nNo \nNo \nNo \nNo \nNo \n\nApproximation ? What is N ? \n\nOptional \nOptional \nOptional \nOptional \nOptional \nYes \nYes \nYes \nYes \nOptional \nYes \nYes \nYes \nOptional \nOptional \nOptional \n\n# Data \n# Data \n# Data \n# Data \n# Data \n# Data \n# Data \n# Data \n# Data \n# Data, Cluste rs \n# Data, C luste rs \n# Data, States \n# Data, N eurons \n# Attributes \n# Attributes \n# Attributes \n\nFigure 6: A very brief sample of applicability of Dual-tree search methods. \n\nReferences \n\n[1] J. BaInes and P. Hut . A Hierarchical O(NlogN) Force-Calculation Algorithm. Nature, \n\n324, 1986. \n\n[2] K . Deng and A. W. Moore. Multiresolution instance-based learning. In Proceedings of \nthe Twelfth International Joint Conference on Artificial Intelligence, pages 1233- 1239, \nSan Francisco, 1995. Morgan Kaufmann. \n\n[3] J. H. Friedman, J. L. Bentley, and R. A. Finkel. An algorithm for finding best matches in \nlogarithmic expected time. A CM Transactions on Mathematical Software, 3(3):209- 226, \nSeptember 1977. \n\n[4] L. Greengard and V. Rokhlin. A Fast Algorithm for Particle Simulations. Journal of \n\nComputational Physics, 73, 1987. \n\n[5] D. E. Knuth. Sor\u00b7ting and Searching. Addison Wesley , 1973. \n[6] A. W. Moore. Very fast mixt ure-model-based clustering using multiresolution kd-trees. \nIn M. Kearns and D. Cohn, editors, Advances in Neural Information Processing Systems \n10, pages 543- 549, San Francisco, April 1999. Morgan Kaufmann. \n\n[7] A. W. Moore. The Anchors Hierarchy: Using the triangle inequality to survive high \ndimensional data. In Twelfth Conference on Un certainty in A rtificial Intelligence (to \nappear) . AAAI Press, 2000. \n\n[8] A. W. Moore and M. S. Lee. Cached Sufficient Statistics for Efficient Machine Learning \n\nwith Large Datasets. Journal of Artificial Intelligen ce Research, 8, March 1998. \n\n[9] D. Pelleg and A. W. Moore. Accelerating Exact k-means Algorithms with Geometric \nReasoning. In Proceedings of the Fifth International Conference on Knowledge Discove ry \nand Data Mining. AAAI Press, 1999. \n\n[10] F. P . Preparata and M. Shamos. Computational Geometry. Springer-Verlag, 1985. \n[11] A. Szalay. Personal Communication. 2000. \n[12] I. Szapudi. A New Method for Calculating Counts in Cells. The Astrophysical Journal, \n\n1997. \n\n[l3] I. Szapudi, S. Colombi, and F. BeInardeau. Cosmic Statistics of Statistics. Monthly \n\nNotices of the Royal Astronomical Society, 1999. \n\n[14] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An Efficient Data Clustering \n\nMethod for Very Large Databases. \nSIGMOD-SIGART Symposium on Principles of Database Systems : PODS 1996. Assn \nfor Computing Machinery, 1996. \n\nIn Proceedings of the Fifteenth ACM SIGACT(cid:173)\n\n\f", "award": [], "sourceid": 1848, "authors": [{"given_name": "Alexander", "family_name": "Gray", "institution": null}, {"given_name": "Andrew", "family_name": "Moore", "institution": null}]}