{"title": "Automatic online tuning for fast Gaussian summation", "book": "Advances in Neural Information Processing Systems", "page_first": 1113, "page_last": 1120, "abstract": "Many machine learning algorithms require the summation of Gaussian kernel functions, an expensive operation if implemented straightforwardly. Several methods have been proposed to reduce the computational complexity of evaluating such sums, including tree and analysis based methods. These achieve varying speedups depending on the bandwidth, dimension, and prescribed error, making the choice between methods difficult for machine learning tasks. We provide an algorithm that combines tree methods with the Improved Fast Gauss Transform (IFGT). As originally proposed the IFGT suffers from two problems: (1) the Taylor series expansion does not perform well for very low bandwidths, and (2) parameter selection is not trivial and can drastically affect performance and ease of use. We address the first problem by employing a tree data structure, resulting in four evaluation methods whose performance varies based on the distribution of sources and targets and input parameters such as desired accuracy and bandwidth. To solve the second problem, we present an online tuning approach that results in a black box method that automatically chooses the evaluation method and its parameters to yield the best performance for the input data, desired accuracy, and bandwidth. In addition, the new IFGT parameter selection approach allows for tighter error bounds. Our approach chooses the fastest method at negligible additional cost, and has superior performance in comparisons with previous approaches.", "full_text": "Automatic online tuning for fast Gaussian summation\n\nVlad I. Morariu1\u2217, Balaji V. Srinivasan1, Vikas C. Raykar2, Ramani Duraiswami1, and Larry S. Davis1\n\n2Siemens Medical Solutions Inc., USA, 912 Monroe Blvd, King of Prussia, PA 19406\n\nmorariu@umd.edu, balajiv@umiacs.umd.edu, vikas.raykar@siemens.com,\n\n1University of Maryland, College Park, MD 20742\n\nramani@umiacs.umd.edu, lsd@cs.umd.edu\n\nAbstract\n\nMany machine learning algorithms require the summation of Gaussian kernel\nfunctions, an expensive operation if implemented straightforwardly. Several meth-\nods have been proposed to reduce the computational complexity of evaluating such\nsums, including tree and analysis based methods. These achieve varying speedups\ndepending on the bandwidth, dimension, and prescribed error, making the choice\nbetween methods dif\ufb01cult for machine learning tasks. We provide an algorithm\nthat combines tree methods with the Improved Fast Gauss Transform (IFGT). As\noriginally proposed the IFGT suffers from two problems: (1) the Taylor series\nexpansion does not perform well for very low bandwidths, and (2) parameter se-\nlection is not trivial and can drastically affect performance and ease of use. We\naddress the \ufb01rst problem by employing a tree data structure, resulting in four eval-\nuation methods whose performance varies based on the distribution of sources and\ntargets and input parameters such as desired accuracy and bandwidth. To solve the\nsecond problem, we present an online tuning approach that results in a black box\nmethod that automatically chooses the evaluation method and its parameters to\nyield the best performance for the input data, desired accuracy, and bandwidth.\nIn addition, the new IFGT parameter selection approach allows for tighter error\nbounds. Our approach chooses the fastest method at negligible additional cost,\nand has superior performance in comparisons with previous approaches.\n\n1 Introduction\n\nGaussian summations occur in many machine learning algorithms, including kernel density esti-\nmation [1], Gaussian process regression [2], fast particle smoothing [3], and kernel based machine\nlearning techniques that need to solve a linear system with a similarity matrix [4]. In such algorithms,\nthe sum g(yj) = PN\nmust be computed for j = 1, . . . , M , where {x1, . . . , xN}\nand {y1, . . . , yM} are d-dimensional source and target (or reference and query) points, respectively;\nqi is the weight associated with xi; and h is the bandwidth. Straightforward computation of the\nabove sum is computationally intensive, taking O(M N ) time.\n\ni=1 qie\u2212||xi\u2212yj ||2/h2\n\nTo reduce the computational complexity, Greengard and Strain proposed the Fast Gauss Transform\n(FGT) [5], using two expansions, the far-\ufb01eld Hermite expansion and the local Taylor expansion, and\na translation process that converts between the two, yielding an overall complexity of O(M + N ).\nHowever, due to the expensive translation operation, O(pd) constant term, and the box based data\nstructure, this method becomes less effective for higher dimensions (e.g. d > 3) [6].\n\nDual-tree methods [7, 8, 9, 10] approach the problem by building two separate trees for the source\nand target points respectively, and recursively considering contributions from nodes of the source\ntree to nodes of the target tree. The most recent works [9, 10] present new expansions and error\ncontrol schemes that yield improved results for bandwidths in a large range above and below the op-\ntimal bandwidth, as determined by the standard least-squares cross-validation score [11]. Ef\ufb01ciency\nacross bandwidth scales is important in cases where the optimal bandwidth must be searched for.\n\n\u2217Our code is available for download as open source at http://sourceforge.net/projects/\ufb01gtree.\n\n\fAnother approach, the Improved Fast Gauss Transform (IFGT) [6, 12, 13], uses a Taylor expansion\nand a space subdivision different than the original FGT, allowing for ef\ufb01cient evaluation in higher\ndimensions. This approach also achieves O(M + N ) asymptotic computational complexity. How-\never, the approach as initially presented in [6, 12] was not accompanied by an automatic parameter\nselection algorithm. Because the parameters interact in a non-trivial way, some authors designed\nsimple parameter selection methods to meet the error bounds, but which did not maximize perfor-\nmance [14]; others attempted, unsuccessfully, to choose parameters, reporting times of \u201c\u221e\u201d for\nIFGT [9, 10]. Recently, Raykar et al [13] presented an approach which selects parameters that mini-\nmize the constant term that appears in the asymptotic complexity of the method, while guaranteeing\nthat error bounds are satis\ufb01ed. This approach is automatic, but only works for uniformly distributed\nsources, a situation often not met in practice. In fact, Gaussian summations are often used because\na simple distribution cannot be assumed. In addition, the IFGT performs poorly at low bandwidths\nbecause of the number of Taylor expansion terms that must be retained to meet error bounds.\n\nWe address both problems with IFGT: 1) small bandwidth performance, and 2) parameter selection.\nFirst we employ a tree data structure [15, 16] that allows for fast neighbor search and greatly speeds\nup computation for low bandwidths. This gives rise to four possible evaluation methods that are\nchosen based on input parameters and data distributions: direct evaluation, direct evaluation using\ntree data structure, IFGT evaluation, and IFGT evaluation using tree data structure (denoted by\ndirect, direct+tree, ifgt, and ifgt+tree, respectively). We improve parameter selection by removing\nthe assumption that data is uniformly distributed and by providing a method for selecting individual\nsource and target truncation numbers that allows for tighter error bounds. Finally, we provide an\nalgorithm that automatically selects the evaluation method that is likely to be fastest for the given\ndata, bandwidth, and error tolerance. This is done in a way that is automatic and transparent to the\nuser, as for other software packages such as FFTW [17] and ATLAS [18].The algorithm is tested on\nseveral datasets, including those in [10], and in each case found to perform as expected.\n\n2 Improved Fast Gauss Transform\n\nWe brie\ufb02y summarize the IFGT, which is described in detail [13, 12, 6]. The speedup is achieved by\nemploying a truncated Taylor series factorization, using a space sub-division to reduce the number\nof terms needed to satisfy the error bound, and ignoring sources whose contributions are negligible.\nThe approximation is guaranteed to satisfy the absolute error |\u02c6g(yj) \u2212 g(yj)| /Q \u2264 \u01eb, where Q =\nPi |qi|. The factorization that IFGT uses involves the truncated multivariate Taylor expansion\nh (cid:19)\u03b1\uf8f9\ne\u2212kyj \u2212xik2/h2\n\uf8fb\n\nh (cid:19)\u03b1(cid:18) xi \u2212 x\u2217\n\ne\u2212||yj \u2212x\u2217k2/h2 \uf8ee\n\n\u03b1! (cid:18) yj \u2212 x\u2217\n\n\uf8f0 X|\u03b1|\u2264p\u22121\n\n= e\u2212kxi\u2212x\u2217k2/h2\n\nwhere \u03b1 is multi-index notation1 and \u2206ij is the error induced by truncating the series to exclude\nterms of degree p and higher and can be bounded by\n\n+ \u2206ij\n\n2\u03b1\n\n\u2206ij \u2264\n\n2p\n\np! (cid:18)||xi \u2212 x\u2217||\n\nh\n\n(cid:19)p(cid:18)||yj \u2212 x\u2217||\n\nh\n\n(cid:19)p\n\ne\u2212(||xi\u2212x\u2217||\u2212||yj \u2212x\u2217||)2/h2\n\n.\n\n(1)\n\nBecause reducing the distance ||xi \u2212 x\u2217|| also reduces the error bound given above, the sources can\nbe divided into K clusters, so the Taylor series center of expansion for source xi is the center of\nthe cluster to which the source belongs. Because of the rapid decay of the Gaussian function, the\n\ncontribution of sources in cluster k can be ignored if ||yj \u2212 ck|| > rk\nck and rk\nx are cluster center and radius of the kth cluster, respectively.\n\ny = rk\n\nx + hplog(1/\u01eb), where\n\nIn [13], the authors ensure that the error bound is met by choosing the truncation number pi for each\nsource so that \u2206ij \u2264 \u01eb. This guarantees that |\u02c6g(yj) \u2212 g(yj)| = |PN\ni=1 |qi|\u01eb = Q\u01eb.\nBecause ||yj \u2212 ck|| cannot be computed for each \u2206ij term (to prevent quadratic complexity), the\nauthors use the worst case scenario; denoting dik = ||xi \u2212 ck|| and djk = ||yj \u2212 ck||, the bound on\nerror term \u2206ij is maximized at d\u2217\ny , whichever is smaller (since\ntargets further than rk\n\ni=1 qi\u2206ij| \u2264 PN\n\ny from ck will not consider cluster k).\n\ndik+\u221ad2\n\nik+2pih2\n2\n\njk = rk\n\n, or d\u2217\n\njk =\n\n1Multi-index \u03b1 = {\u03b11, . . . , \u03b1d} is a d-tuple of nonnegative integers, its length is |\u03b1| = \u03b11 + . . . + \u03b1d, its\n\nfactorial is de\ufb01ned as \u03b1! = \u03b11!\u03b12! . . . \u03b1d!, and for x = (x1, . . . , xd) \u2208 Rd, x\n\n\u03b1 = x\n\n\u03b11\n1 x\n\n\u03b12\n2 . . . x\n\n\u03b1d\nd .\n\n\fT a r g e t\n\nT a r g e t\n\nT a r g e t\n\nT a r g e t\n\nc\n\n2\n\nc\n\n2\n\nr\n\nr\n\nr\n\nS o u r c e s\n\ndirect\n\nS o u r c e s\n\ndirect+tree\n\nc\n\n1\n\nc\n\n3\n\nS o u r c e s\n\nifgt\n\nc\n\n1\n\nc\n\n3\n\nS o u r c e s\n\nifgt+tree\n\nFigure 1: The four evaluation methods. Target is displayed elevated to separate it from sources.\n\nThe algorithm proceeds as follows. First, the number of clusters K, maximum truncation number\npmax, and the cut-off radius r are selected by assuming that sources are uniformly distributed. Next,\nK-center clustering is performed to obtain c1, . . . , cK , and the set of sources S is partitioned into\nS1, . . . , Sk. Using the max cluster radius rx, the truncation number pmax is found that satis\ufb01es\nworst-case error bound. Choosing pi for each source xi so that \u2206ij \u2264 \u01eb, source contributions are\naccumulated to cluster centers:\n2\u03b1\n\n||xi\u2212ck||2\n\nqie\u2212\n\nh2\n\n1|\u03b1|\u2264pi\u22121\n\nC k\n\n\u03b1 =\n\n\u03b1! Xxi\u2208Sk\n\nh (cid:19)\u03b1\n\n(cid:18) xi \u2212 ck\ny = rk\n\nFor each yi, in\ufb02uential clusters for which ||yi \u2212 ck|| \u2264 rk\n\nfrom those clusters are evaluated:\n\nx + r are found, and contributions\n\n\u02c6g(yj) = X||yi\u2212ck||\u2264ry\n\nk X|\u03b1|\u2264pmax\u22121\n\n||yj \u2212ck||2\n\nh2\n\nC k\n\n\u03b1e\u2212\n\n(cid:18) yj \u2212 ck\n\nh (cid:19)\u03b1\n\nThe clustering step can be performed in O(N K) time using a simple algorithm [19] due to Gonzalez,\nor in optimal O(N log K) time using the algorithm by Feder and Greene [20]. Because the number\nof values of \u03b1 such that |\u03b1| \u2264 p is rpd = C(p + d, d), the total complexity of the algorithm is\nO(cid:0)(N + M nc)(log K + r(pmax\u22121)d)(cid:1) where nc is the number of cluster centers that are within the\ncut-off radius of a target point. Note that for \ufb01xed p, rpd is polynomial in the dimension d rather than\nexponential. Searching for clusters within the cut-off radius of each target can take time O(M K),\nbut ef\ufb01cient data-structures can be used to reduce the cost to O(M nc log K).\n\n3 Fast Fixed-Radius Search with Tree Data Structure\n\nOne problem that becomes apparent from the point-wise error bound on \u2206ij is that as bandwidth h\ndecreases, the error bound increases, and either dik = ||xi \u2212 ck|| must be decreased (by increasing\nthe number of clusters K) or the maximum truncation number pmax must be increased to continue\nsatisfying the desired error. An increase in either K or pmax increases the total cost of the algorithm.\nConsequently, the algorithm originally presented above does not perform well for small bandwidths.\n\nHowever, few sources have a contribution greater than qi\u01eb at low bandwidths, since the cut-off radius\nbecomes very small. Also, because the number of clusters increases as the bandwidth decreases, we\nneed an ef\ufb01cient way of searching for clusters that are within the cut-off radius. For this reason, a\ntree data structure can be used since it allows for ef\ufb01cient \ufb01xed-radius nearest neighbor search. If h\nis moderately low, a tree data structure can be built on the cluster centers, such that the nc in\ufb02uential\nclusters within the cut-off radius can be found in O(nc log K) time [15, 16]. If the bandwidth is\nvery low, then it is more ef\ufb01cient to simply \ufb01nd all source points xi that in\ufb02uence a target yj and\nperform exact evaluation for those source points. Thus, if ns source points are within the cut-off\nradius of yj, then the time to build the structure is O(N log N ) and the time to perform a query is\nO(ns log N ) for each target. Thus, we have four methods that may be used for evaluation of the\nGauss Transform: direct evaluation, direct evaluation with the tree data structure, IFGT evaluation,\nand IFGT evaluation with a tree data structure on the cluster centers. Figure 1 shows a graphical\nrepresentation of the four methods. Because the running times of the four methods for various\nparameters can differ greatly (i.e. using direct+tree evaluation when ifgt is optimal could result in a\nrunning time that is many orders of magnitude larger), we will need an ef\ufb01cient and online method\nselection approach, which is presented in section 5.\n\n\fr\n \n,\ns\nu\nd\na\nR\n\ni\n\nx\n\nl\n\n \nr\ne\nt\ns\nu\nC\n \nx\na\nM\n\n1\n10\n\n0\n10\n\n\u22121\n\n10\n\n\u22122\n\n10\n\n\u22123\n\n10\n\n\u22124\n\n10\n\n \n0\n\n \n\nActual radius\nPredicted radius\n\np\nu\nd\ne\ne\np\nS\n\n3\n10\n\n2\n10\n\n1\n10\n\n0\n10\n\n\u22121\n\n10\n\n \n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\n1.2\n\n1.4\n\n1.6\n\n1.8\n\nNumber of clusters, K\n\n2\n4\nx 10\n\n\u22122\n\n10\n\n\u22121\n\n10\n\nBandwidth h\n\n \n\nd = 1\nd = 2\nd = 3\nd = 4\nd = 5\nd = 6\n\nFigure 2: Selecting pmax and K using cluster radius, for M =N =20000, sources dist. as mixture of\n25 N (\u00b5 \u223c U [0, 1]d, \u03a3=4\u22124I), targets as U [0, 1]d, \u01eb=10\u22122. Left: Predicted cluster radius as K \u22121/d\nvs actual cluster radius for d = 3. Right: Speedup from using actual cluster radius.\n\n4 Choosing IFGT Parameters\n\nAs mentioned in Section 1, the process of choosing the parameters is non-trivial. In [13], the point-\nwise error bounds described in Eq. 1 were used in an automatic parameter selection scheme that is\noptimized when sources are uniformly distributed. We remove the uniformity assumption and also\nmake the error bounds tighter by selecting individual source and target truncation numbers to satisfy\ncluster-wise error bounds instead of the worst-case point-wise error bounds. The \ufb01rst improvement\nprovides signi\ufb01cant speedup in cases where sources are not uniformly distributed, and the second\nimprovement results in general speedup since we are no longer considering the error contribution of\njust the worst source point, but considering the total error of each cluster instead.\n\n4.1 Number of Clusters and Maximum Truncation Number\n\nThe task of selecting the number of clusters K and maximum truncation number pmax is dif\ufb01cult\nbecause they depend on each other indirectly through the source distribution. For example, increas-\ning K decreases the cluster radius, which allows for a lower truncation number while still satisfying\nthe error bound; conversely, increasing pmax allows clusters to have a larger radius, which allows\nfor a smaller K. Ideally, both parameters should be as low as possible since they both affect compu-\ntational complexity. Unfortunately, we cannot \ufb01nd the balance between the two without analyzing\nthe source distribution because it in\ufb02uences the rate at which the cluster radius decreases. The uni-\n\nformity assumption leads to an estimate of maximum cluster radius, rx \u223c K \u22121/d [13]. However,\nfew interesting datasets are uniformly distributed, and when the assumption is violated, as in Fig. 2,\nactual rx will decrease faster than K \u22121/d, leading to over-clustering and increased running time.\n\nOur solution is to perform clustering as part of the parameter selection process, obtaining the actual\ncluster radii for each value of K. Using this approach, parameters are selected in a way that the\nalgorithm is tuned to the actual distribution of the sources.\n\nWe can take advantage of the incremental nature of some clustering algorithms such as the greedy al-\ngorithm proposed by Gonzalez [19] or the \ufb01rst phase of the Feder and Greene algorithm [20], which\nprovide a 2-approximation and 6-approximation of the optimal k-center clustering, respectively. We\ncan then increment the value K, obtain the maximum cluster radius, and then \ufb01nd the lowest p that\nsatis\ufb01es the error bound, picking the \ufb01nal value K which yields the lowest computational cost.\n\nNote that if we simply set the maximum number of clusters to Klimit = N , we would spend\nO(N log N ) time to estimate parameters. However, in practice, the optimal value of K is low\nrelative to N , and it is possible to detect when we cannot lower cost further by increasing K or\nlowering pmax, thus allowing the search to terminate early. In addition, in Section 5, we show how\nthe data distribution allows us to intelligently choose Klimit.\n\n4.2\n\nIndividual Truncation Numbers by Cluster-wise Error Bounds\n\nOnce the maximum truncation number pmax is selected, we can guarantee that the worst source-\ntarget pairwise error is below the desired error bound. However, simply setting each source and\ntarget truncation number to pmax wastes computational resources since most source-target pairs do\nnot contribute much error. This problem is addressed in [13] by allowing each source to have its own\ntruncation number based on its distance from the cluster center and assuming the worst placement of\n\n\f \n\nd = 1\nd = 2\nd = 3\nd = 4\nd = 5\nd = 6\n\np\nu\nd\ne\ne\np\nS\n\n0\n10\n\n \n10\n\n\u22122\n\n\u22121\n\n10\n\nBandwidth h\n\nFigure 3: Speedup obtained by using cluster-wise instead of point-wise truncation numbers, for\n\nM =N =4000, sources dist. as mixture of 25 N (\u00b5 \u223c U [0, 1]d, \u03a3=4\u22124I), targets as U [0, 1]d, \u01eb=10\u22124.\nFor d=1, the gain of lowering truncation is not large enough to make up for overhead costs.\n\nany target. However, this means that each cluster will have to compute r(pi\u22121)d coef\ufb01cients where\npi is the truncation number of its farthest point.\n\nWe propose a method for further decreasing most individual source and target truncation numbers\nby considering the total error incurred by evaluation at any target\n\n|\u02c6g(yj) \u2212 g(yj)| \u2264 Xk : ||yj \u2212ck||\u2264rk\n\ny Xxi\u2208Sk\n\n|qi|\u2206ij + Xk : ||yj \u2212ck||>rk\n\ny Xxi\u2208Sk\n\n|qi|\u01eb\n\nwhere the left term on the r.h.s. is the error from truncating the Taylor series for the clusters that\nare within the cut-off radius, and the right term bounds the error from ignoring clusters outside the\ncut-off radius, ry. Instead of ensuring that \u2206ij \u2264 \u01eb for all (i, j) pairs, we ensure\n\nPxi\u2208Sk |qi|\u2206ij \u2264 Pxi\u2208Sk |qi|\u01eb = Qk\u01eb\n\nfor all clusters. In this case, if a cluster is outside the cut-off radius, then the error incurred is no\ngreater than Qk\u01eb; otherwise, the cluster-wise error bounds guarantee that the error is still no greater\nthan Qk\u01eb. Summing over all clusters we have\n\n|\u02c6g(yj) \u2212 g(yj)| \u2264 Pk Qk\u01eb = Q\u01eb,\n\nour desired error bound. The lowest truncation number that satis\ufb01es the cluster-wise error for each\ncluster is found in O(pmaxN ) time by evaluating the cluster-wise error for all clusters for each\nvalue of p = {1 . . . pmax}. In addition, we can \ufb01nd individual target point truncation numbers by\nnot only considering the worst case target distance rk\ny when computing cluster error contributions,\nbut considering target errors for sources at varying distance ranges from each cluster center. This\nyields concentric regions around each cluster, each of which has its own truncation number, which\ncan be used for targets in that region. Our approach satis\ufb01es the error bound tighter and reduces\ncomputational cost because:\n\nif most points are clustered close to the center the maximum truncation will be lower;\n\n\u2022 Each cluster\u2019s maximum truncation number no longer depends only on its farthest point, so\n\u2022 The weight of each source point is considered in the error contributions, so if a source point\nis far away but has a weight of qi = 0 its error contribution will be ignored; and \ufb01nally\n\u2022 Each target can use a truncation number that depends on its distance from the cluster.\n\n5 Automatic Tuning via Method Selection\n\nFor any input source and target point distribution, requested absolute error, and Gaussian bandwidth,\nwe have the option of evaluating the Gauss Transform using any one of four methods: direct, di-\nrect+tree, ifgt, and ifgt+tree. As Fig. 4 shows, choosing the wrong method can result in orders of\nmagnitude more time to evaluate the sum. Thus, we require an ef\ufb01cient scheme to automatically\nchoose the best method online based on the input. The scheme must use the distribution of both the\nsource and target points in making its decision, while at the same time avoiding long computations\nthat would defeat the purpose of automatic method selection.\n\nNote that if we know d, M , N , ns, nc, K, and pmax, we can calculate the cost of each method:\n\nCostdirect(d, N, M )\nCostdirect+tree(d, N, M, ns)\nCostifgt(d, N, M, K, nc, pmax)\nCostifgt+tree(d, N, M, K, nc, pmax) O((N + M nc)(d log K + r(pmax\u22121)d))\n\nO(dM N )\nO(d(N + M ns) log N )\nO(dN log K + (N + M nc)r(pmax\u22121)d + dM K)\n\n\f2\n10\n\n1\n10\n\n0\n10\n\n\u22121\n\n10\n\n\u22122\n\n10\n\n)\ns\nd\nn\no\nc\ne\ns\n(\n \n\ni\n\ne\nm\nT\nU\nP\nC\n\n \n\n10\n\n\u22123\n\n \n10\n\n\u22122\n\n \n\n0\n10\n\n\u22121\n\n10\n\nBandwidth h\n\ndirect\ndirect\u2212tree\nifgt\nifgt\u2212tree\nauto\n\n1\n10\n\n0\n10\n\n\u22121\n\n10\n\n\u22122\n\n10\n\n\u22123\n\n10\n\no\n\ni\nt\n\na\nr\n \n\ne\nm\nT\n\ni\n\n10\n\n\u22124\n\n \n10\n\n\u22122\n\n \n\n0\n10\n\nd = 1, auto to best\nd = 2, auto to best\nd = 3, auto to best\nd = 4, auto to best\nd = 5, auto to best\nd = 6, auto to best\nd = 1, auto to worst\nd = 2, auto to worst\nd = 3, auto to worst\nd = 4, auto to worst\nd = 5, auto to worst\nd = 6, auto to worst\n\n\u22121\n\n10\n\nBandwidth h\n\nFigure 4: Running times of the four methods and our automatic method selection for M =N =4000,\nsources dist. as mixture of 25 N (\u00b5 \u223c U [0, 1]d, \u03a3=4\u22124I), targets as U [0, 1]d, \u01eb=10\u22124. Left: example\nfor d=4. Right: Ratio of automatic to fastest method and automatic to slowest method, showing that\nmethod selection incurs very small overhead while preventing potentially large slowdowns.\n\nmin(Costifgt, Costifgt+tree) \u2264 min(Costdirect, Costdirect+tree)\n\nAlgorithm 1 Method Selection\n1: Calculate \u02c6ns, an estimate of ns\n2: Calculate Costdirect(d, N, M ) and Costdirect+tree(d, N, M, \u02c6ns)\n3: Calculate highest Klimit \u2265 0 such that for some nc and pmax\n4: if Klimit > 0 then\n5:\n6:\n7:\n8: end if\n9: return arg mini Costi\n\nCompute pmax and K \u2264 Klimit that minimize estimated cost of IFGT\nCalculate \u02c6nc, an estimate of nc\nCalculate Costifgt+tree(d, N, M, K, \u02c6nc, pmax) and Costifgt(d, N, M, K, \u02c6nc, pmax)\n\nMore precise equations and the correct constants that relate the four costs can be obtained directly\nfrom the speci\ufb01c implementation of each method (this could be done by inspection, or automatically\nof\ufb02ine or at compile-time to account for hardware). A simple approach to estimating the distribution\ndependent ns and nc is to build a tree on sample source points and compute the average number of\nneighbors to a sampled set of targets. The asymptotic complexity of this approximation is the same\nas that of direct+tree, unless sub-linear sampling is used at the expense of accuracy in predicting\ncost. However, ns and nc can be estimated in O(M + N ) time even without sampling by using\ntechniques from the \ufb01eld of database management systems for estimating spatial join selectivity[21].\nGiven ns, we predict the cost of direct+tree, and estimate Klimit as the highest value that might yield\nlower costs than direct or direct+tree. If Klimit > 0, then, we can estimate the parameters and costs\nof ifgt or ifgt+tree. Finally, we pick the method with lowest cost. As \ufb01gure 4 shows, our method\nselection approach chooses the correct method across bandwidths at very low computational cost.\n\n6 Experiments\n\nPerformance Across Bandwidths. We empirically evaluate our method on the same six real-world\ndatasets as in [10] and compare against the authors\u2019 reported results. As in [10], we scale the data to\n\ufb01t the unit hypercube and evaluate the Gauss transform using all 50K points as sources and targets,\nwith bandwidths varying from 10\u22123 to 103 times the optimal bandwidth. Because our method satis-\n\ufb01es an absolute error, we use for absolute \u01eb the highest value that guarantees a relative error of 10\u22122\n(to achieve this, \u01eb ranges from 10\u22121 to 10\u22124 by factors of 10). We do not include the time required to\nchoose \u01eb (since we are doing this only to evaluate the running times of the two methods for the same\nrelative errors) but we do include the time to automatically select the method and parameters. Since\nthe code of [10] is not currently available, our experiments do not use the same machine as [10], and\nthe CPU times are scaled based on the reported/computed the times needed by the naive approach on\nthe corresponding machines. Figure 5 shows the normalized running times of our method versus the\nDual-Tree methods DFD, DFDO, DFTO, and DITO. For most bandwidths our method is generally\nfaster by about one order of magnitude (sometimes as much as 1000 times faster). For near-optimal\nbandwidths, our approach is either faster or comparable to the other approaches.\n\nGaussian Process Regression. Gaussian process regression (GPR) [22] provides a Bayesian frame-\nwork for non-parametric regression. The computational complexity for straightforward GPR is\n\n\fmockgalaxy, d = 3, h\n = 0.000768\n*\n\n1\n10\n\n0\n10\n\n\u22121\n\n10\n\n\u22122\n\n10\n\n\u22123\n\n10\n\n\u22124\n\n10\n\n\u22123\n\n10\n\n\u22122\n\n10\n\n\u22121\n\n10\n\n0\n10\n\n1\n10\nBandwidth scale h/h*\n\npall7, d = 7, h\n = 0.001319\n*\n\ni\n\n \n\ne\nm\nT\nU\nP\nC\ne\nv\na\nN\n\n \n\ni\n\n \n/\n \n\ni\n\ne\nm\nT\nU\nP\nC\n\n \n\ni\n\n \n\ne\nm\nT\nU\nP\nC\ne\nv\na\nN\n\n \n\ni\n\n \n/\n \n\ni\n\ne\nm\nT\nU\nP\nC\n\n \n\ni\n\n \n\ne\nm\nT\nU\nP\nC\ne\nv\na\nN\n\n \n\ni\n\n \n/\n \n\ni\n\ne\nm\nT\nU\nP\nC\n\n \n\nsj2, d = 1, h\n = 0.001395\n*\n\nDFD\nDFDO\nDFTO\nDITO\nOur method\n\n0\n10\n\n\u22121\n\n10\n\n\u22122\n\n10\n\n\u22123\n\n10\n\n\u22124\n\n10\n\n\u22123\n\n10\n\n\u22122\n\n10\n\n1\n10\n\n0\n10\n\n\u22121\n\n10\n\n\u22122\n\n10\n\n\u22123\n\n10\n\n\u22124\n\n10\n\n\u22123\n\n10\n\n1\n10\n\n0\n10\n\n\u22121\n\n10\n\n\u22122\n\n10\n\n\u22123\n\n10\n\n\u22124\n\n10\n\n\u22123\n\n10\n\n\u22122\n\n10\n\n\u22122\n\n10\n\n\u22121\n\n10\n\n0\n10\n\n1\n10\nBandwidth scale h/h*\n\nbio5, d = 5, h\n = 0.000567\n*\n\n\u22121\n\n10\n\n0\n10\n\n1\n10\nBandwidth scale h/h*\n\ncovtype, d = 10, h\n = 0.015476\n*\n\n\u22121\n\n10\n\n0\n10\n\n1\n10\nBandwidth scale h/h*\n\ni\n\n \n\ne\nm\nT\nU\nP\nC\ne\nv\na\nN\n\n \n\ni\n\n \n/\n \n\ni\n\ne\nm\nT\nU\nP\nC\n\n \n\ni\n\n \n\ne\nm\nT\nU\nP\nC\ne\nv\na\nN\n\n \n\ni\n\n \n/\n \n\ni\n\ne\nm\nT\nU\nP\nC\n\n \n\ni\n\n \n\ne\nm\nT\nU\nP\nC\ne\nv\na\nN\n\n \n\ni\n\n \n/\n \n\ni\n\ne\nm\nT\nU\nP\nC\n\n \n\n2\n10\n\n3\n10\n\n2\n10\n\n3\n10\n\n2\n10\n\n3\n10\n\n1\n10\n\n0\n10\n\n\u22121\n\n10\n\n\u22122\n\n10\n\n\u22123\n\n10\n\n\u22124\n\n10\n\n\u22123\n\n10\n\n1\n10\n\n0\n10\n\n\u22121\n\n10\n\n\u22122\n\n10\n\n\u22123\n\n10\n\n\u22124\n\n10\n\n\u22123\n\n10\n\n2\n10\n\n3\n10\n\n2\n10\n\n3\n10\n\n2\n10\n\n3\n10\n\n\u22122\n\n10\n\n\u22122\n\n10\n\n\u22121\n\n10\n\n0\n10\n\n1\n10\nBandwidth scale h/h*\n\nCoocTexture, d = 16, h\n = 0.026396\n*\n\n\u22121\n\n10\n\n0\n10\n\n1\n10\nBandwidth scale h/h*\n\nFigure 5: Comparison with Dual-Tree methods for six real-world datasets (lower is faster).\n\nO(N 3) which is undesirable for large datasets. The core computation in GPR involves the solution\nof a linear system for the dense covariance matrix K + \u03c32I, where [K]ij = K(xi, xj). Our method\ncan be used to accelerate this solution for Gaussian processes with Gaussian covariance, given by\nK(x, x\u2032) = \u03c32\ni=1, and a\nnew point x\u2217, the training phase involves computing \u03b1 = (K + \u03c32I)\u22121y, and the prediction of y\u2217\nis given by y\u2217 = k(x\u2217)T \u03b1, where k(x\u2217) = [K(x\u2217, x1), . . . , K(x\u2217, xN )]. The system can be solved\nef\ufb01ciently by a conjugate gradient method using IFGT for matrix-vector multiplication. Further, the\naccuracy of the matrix-vector product can be reduced as the iterations proceed (i.e. \u01eb is modi\ufb01ed\nevery iteration) if we use inexact Krylov subspaces [23] for the conjugate gradient iterations.\n\nk) [22]. Given the training set, D = {xi, yi}N\n\nf exp(\u2212Pd\n\nk=1 (xk \u2212 x\u2032\n\nk)2/h2\n\nWe apply our method for Gaussian process regression on four standard datasets: robotarm, abalone,\nhousing, and elevator2. We present the results of the training phase (though we also speed up\nthe prediction phase). For each dataset we ran \ufb01ve experiments:\nthe \ufb01rst four \ufb01xed one of the\nfour methods (direct, direct+tree, ifgt, ifgt+tree) and used it for all conjugate gradient iterations;\nthe \ufb01fth automatically selected the best method at each iteration (denoted by auto in \ufb01gure 6). To\nvalidate our solutions, we measured the relative error between the vectors found by the direct method\n\nand our approximate methods; they were small, ranging from \u223c 10\u221210 to \u223c 10\u22125. As expected,\n\nauto chose the correct method for each dataset, incurring only a small overhead cost. Also, for\nthe abalone dataset, auto outperformed any of the \ufb01xed method experiments; as the right side of\n\ufb01gure 6 shows, half way through the iterations, the required accuracy decreased enough to make ifgt\nfaster than direct evaluation. By switching methods dynamically, the automatic selection approach\noutperformed any \ufb01xed method, further demonstrating the usefulness of our online tuning approach.\n\nFast Particle Smoothing. Finally, we embed our automatic method selection in the the two-\ufb01lter\nparticle smoothing demo provided by the authors of [3]3. For a data size of 1000, tolerance set at\n10\u22126, the run-times are 18.26s, 90.28s and 0.56s for the direct, dual-tree and automatic (ifgt was\nchosen) methods respectively. The RMS error for all methods from the ground truth values were\n\nobserved as 2.904 \u00b1 10\u221204.\n\n2The last three datasets can be downloaded from http://www.liaad.up.pt/\u02dcltorgo/Regression/DataSets.html;\n\nthe \ufb01rst, robotarm, is a synthetic dataset generated as in [2]\n\n3The code was downloaded from http://www.cs.ubc.ca/\u02dcawll/nbody/demos.html\n\n\fRobotarm\n\nAbalone\n\nHousing\n\nElevator\n\nDims\nSize\n\ndirect\n\nifgt\n\ndirect-tree\n\nifgt-tree\n\nauto\n\n2\n\n1000\n\n0.578s\n0.0781s\n5.45s\n\n0.0781s\n\n0.0938s\n\n7\n\n4177\n\n16.1s\n32.3s\n328s\n35.2s\n\n14.5s\n\n12\n506\n\n0.313s\n1317s\n2.27s\n549s\n\n0.547s\n\n18\n\n8752\n\n132s\n133s\n0.516s\n101s\n\n0.797s\n\n)\ny\nc\na\nr\nu\nc\nc\na\n \nd\ne\nr\ni\ns\ne\nd\n(\ng\no\n\u2212\n\nl\n\n8\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n \n0\n\n2\n\n4\n\n6\n\n \n\nIFGT\nDirect Method\n\n14\n\n16\n\n18\n\n20\n\n8\n\n10\n\n12\nIteration Number\n\nFigure 6: GPR Results. Left: CPU times. Right: Desired accuracy per iteration for abalone dataset.\n\n7 Conclusion\n\nWe presented an automatic online tuning approach to Gaussian summations that combines a tree data\nstructure with IFGT that is well suited for both high and low bandwidths and which users can treat\nas a black box. The approach also tunes IFGT parameters to the source distribution, and provides\ntighter error bounds. Experiments demonstrated that our approach outperforms competing methods\nfor most bandwidth settings, and dynamically adapts to various datasets and input parameters.\n\nAcknowledgments. We would like to thank the U.S. Government VACE program for supporting\nthis work. This work was also supported by a NOAA-ESDIS Grant to ASIEP at UMD.\n\nReferences\n\n[1] M.P. Wand and M.C. Jones. Kernel Smoothing. Chapman and Hall, 1995.\n\n[2] C. K. I. Williams and C. E. Rasmussen. Gaussian processes for regression. In NIPS, 1995.\n\n[3] M. Klaas, M. Briers, N. de Freitas, A. Doucet, S. Maskell, and D. Lang. Fast particle smoothing: if I had\n\na million particles. In ICML, 2006.\n\n[4] N. de Freitas, Y. Wang, M. Mahdaviani, and D. Lang. Fast Krylov methods for N-body learning. In NIPS,\n\n2006.\n\n[5] L. Greengard and J. Strain. The fast Gauss transform. SIAM J. Sci. Stat. Comput., 1991.\n\n[6] C. Yang, R. Duraiswami, N. A. Gumerov, and L. S. Davis. Improved fast Gauss transform and ef\ufb01cient\n\nkernel density estimation. In ICCV, 2003.\n\n[7] A. G. Gray and A. W. Moore. \u2018N-body\u2019 problems in statistical learning. In NIPS, 2000.\n\n[8] A. G. Gray and A. W. Moore. Nonparametric density estimation: Toward computational tractability. In\n\nSIAM Data Mining, 2003.\n\n[9] D. Lee, A. Gray, and A. Moore. Dual-tree fast Gauss transforms. In NIPS, 2006.\n\n[10] D. Lee and A. G. Gray. Faster Gaussian summation: Theory and experiment. In UAI, 2006.\n\n[11] B. W. Silverman. Density estimation for statistics and data analysis. Chapman and Hal, 1986.\n\n[12] C. Yang, R. Duraiswami, and L. S. Davis. Ef\ufb01cient kernel machines using the improved fast Gauss\n\ntransform. In NIPS, 2004.\n\n[13] V. Raykar, C. Yang, R. Duraiswami, and N. Gumerov. Fast computation of sums of Gaussians in high\n\ndimensions. UMD-CS-TR-4767, 2005.\n\n[14] D. Lang, M. Klaas, and N. de Freitas. Empirical testing of fast kernel density estimation algorithms.\n\nTechnical Report UBC TR-2005-03, University of British Columbia, Vancouver, 2005.\n\n[15] S. Arya and D. Mount. Approximate nearest neighbor queries in \ufb01xed dimensions. In SODA, 1993.\n\n[16] S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Y. Wu. An optimal algorithm for approxi-\n\nmate nearest neighbor searching \ufb01xed dimensions. Journal of the ACM, 1998.\n\n[17] M. Frigo and S. G. Johnson. The design and implementation of FFTW3. Proceedings of the IEEE, 2005.\n\n[18] R. C. Whaley, A. Petitet, and J. J. Dongarra. Automated empirical optimization of software and the\n\nATLAS project. Parallel Computing, 27(1\u20132):3\u201335, 2001.\n\n[19] T. F. Gonzalez. Clustering to minimize the maximum inter\u2013cluster distance. In Journal of Theoretical\n\nComputer Science, number 38, pages 293 \u2013 306, October 1985.\n\n[20] T. Feder and D. H. Greene. Optimal algorithms for approximate clustering. In STOC, 1988.\n\n[21] C. Faloutsos, B. Seeger, A. Traina, and C. Traina. Spatial join selectivity using power laws. In SIGMOD\n\nConference, 2000.\n\n[22] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2006.\n\n[23] V. Simoncini and D. Szyld. Theory of inexact Krylov subspace methods and applications to scienti\ufb01c\n\ncomputing. Technical Report 02-4-12, Temple University, 2002.\n\n\f", "award": [], "sourceid": 257, "authors": [{"given_name": "Vlad", "family_name": "Morariu", "institution": null}, {"given_name": "Balaji", "family_name": "Srinivasan", "institution": null}, {"given_name": "Vikas", "family_name": "Raykar", "institution": null}, {"given_name": "Ramani", "family_name": "Duraiswami", "institution": null}, {"given_name": "Larry", "family_name": "Davis", "institution": null}]}