{"title": "Convergent Bounds on the Euclidean Distance", "book": "Advances in Neural Information Processing Systems", "page_first": 388, "page_last": 396, "abstract": "Given a set V of n vectors in d-dimensional space, we provide an efficient method for computing quality upper and lower bounds of the Euclidean distances between a pair of the vectors in V . For this purpose, we define a distance measure, called the MS-distance, by using the mean and the standard deviation values of vectors in V . Once we compute the mean and the standard deviation values of vectors in V in O(dn) time, the MS-distance between them provides upper and lower bounds of Euclidean distance between a pair of vectors in V in constant time. Furthermore, these bounds can be refined further such that they converge monotonically to the exact Euclidean distance within d refinement steps. We also provide an analysis on a random sequence of refinement steps which can justify why MS-distance should be refined to provide very tight bounds in a few steps of a typical sequence. The MS-distance can be used to various problems where the Euclidean distance is used to measure the proximity or similarity between objects. We provide experimental results on the nearest and the farthest neighbor searches.", "full_text": "Convergent Bounds on the Euclidean Distance\n\nYoonho Hwang\n\nHee-Kap Ahn\n\nDepartment of Computer Science and Engineering\n\nPohang University of Science and Technology\nPOSTECH, Pohang, Gyungbuk, Korea(ROK)\n{cypher,heekap}@postech.ac.kr\n\nAbstract\n\nGiven a set V of n vectors in d-dimensional space, we provide an ef\ufb01cient method\nfor computing quality upper and lower bounds of the Euclidean distances between\na pair of vectors in V . For this purpose, we de\ufb01ne a distance measure, called the\nMS-distance, by using the mean and the standard deviation values of vectors in\nV . Once we compute the mean and the standard deviation values of vectors in V\nin O(dn) time, the MS-distance provides upper and lower bounds of Euclidean\ndistance between any pair of vectors in V in constant time. Furthermore, these\nbounds can be re\ufb01ned further in such a way to converge monotonically to the\nexact Euclidean distance within d re\ufb01nement steps. An analysis on a random se-\nquence of re\ufb01nement steps shows that the MS-distance provides very tight bounds\nin only a few re\ufb01nement steps. The MS-distance can be used to various applica-\ntions where the Euclidean distance is used to measure the proximity or similarity\nbetween objects. We provide experimental results on the nearest and the farthest\nneighbor searches.\n\n1\n\nIntroduction\n\nThe Euclidean distance between two vectors x and y in d-dimensional space is a typical distance\nmeasure that re\ufb02ects their proximity in the space. Measuring the Euclidean distance is a fundamental\noperation in computer science, including the areas of database, computational geometry, computer\nvision and computer graphics. In machine learning, the Euclidean distance, denoted by dist(x, y),\nor it\u2019s variations(for example, e||x\u2212y||) are widely used to measure data similarity for clustering [1],\nclassi\ufb01cation [2] and so on.\nA typical problem is as follows. Given two sets X and Y of vectors in d-dimensional space, our\ngoal is to \ufb01nd a pair (x, y), for x \u2208 X and y \u2208 Y , such that dist(x, y) is the optimum (minimum or\nmaximum) over all such pairs. For the nearest or farthest neighbor searches, X is the set consisting\nof a single query point while Y consists of all candidate data points. If the dimension is low, a\nbrute-force computation would be fast enough. However, data sets in areas such as optimization,\ncomputer vision, machine learning or statistics often live in spaces of dimensionality in the order\nof thousands or millions. In d dimensional space, a single distance computation already takes O(d)\ntime, thus the cost for \ufb01nding the nearest or farthest neighbor becomes O(dnm) time, where n and\nm are the cardinalities of X and Y , respectively.\nSeveral techniques have been proposed to reduce computation cost for computing distance. Probably\nPCA (principal component analysis) is the most frequently used technique for this purpose [3], in\nwhich we use an orthogonal transformation based on PCA to convert a set of given data so that\nthe dimensionality of the transformed data is reduced. Then it computes distances between pairs of\ntransformed data ef\ufb01ciently. However, this transformation does not preserve the pairwise distances\nof data in general, therefore there is no guarantee on the computation results.\n\n1\n\n\fIf we restrict ourselves to the nearest neighbor search, some methods using space partitioning trees\nsuch as KD-tree [4], R-tree [5], or their variations have been widely used. However, they become\nimpractical for high dimensions because of their poor performance in constructing data structures\nfor queries. Recently, cover tree [6] has been used for high dimensional nearest neighbor search, but\nits construction time increases drastically as the dimension increases [7].\nAnother approach that has attracted some attention is to compute a good bound of the exact Eu-\nclidean distance ef\ufb01ciently such that it can be used to \ufb01lter off some unnecessary computation, for\nexample, the distance computation between two vectors that are far apart from each other in near-\nest neighbor search. One of such methods is to compute a distance bound using the inner product\napproximation [8]. This method, however, requires the distribution of the input data to be known in\nadvance, and works only on data in some predetermined distribution. Another method is to com-\npute a distance bound using bitwise operations [9]. But this method works well only on uniformly\ndistributed vectors, and requires O(2d) bitwise operations in d dimension. A method using an index\nstructure [10] provides an effective \ufb01ltering method based on the triangle inequality. But this works\nwell only when data are well clustered.\nIn this paper, we de\ufb01ne a distance measure, called the MS-distance, by using the mean and the\nstandard deviation values of vectors in V . Once we compute the mean and the standard deviation\nvalues of vectors in V in O(dn) time, the MS-distance provides tight upper and lower bounds of\nEuclidean distance between any pair of vectors in V in constant time. Furthermore, these bounds can\nbe re\ufb01ned further in such a way to converge monotonically to the exact Euclidean distance within d\nre\ufb01nement steps. Each re\ufb01nement step takes constant time.\nWe provide an analysis on a random sequence of k re\ufb01nement steps for 0 \u2264 k \u2264 d, which shows\na good expectation on the lower and upper bounds. This can justify that the MS-distance provides\nvery tight bounds in a few re\ufb01nement steps of a typical sequence. We also show that the MS-distance\ncan be used in fast \ufb01ltering. Note that we do not use any assumption on data distribution.\nThe MS-distance can be used to various applications where the Euclidean distance is a measure\nfor proximity or similarity between objects. Among them, we provide experimental results on the\nnearest and the farthest neighbor searches.\n\n2 An Upper and A Lower Bounds of the Euclidean Distance\n\n(cid:80)d\nFor a d-dimensional vector x = [x1, x2, . . . , xd], we denote its mean by \u00b5x = 1\ni=1 xi and its\nd\ni=1(xi \u2212 \u00b5x)2. For a pair of vectors x and y, we can reformulate the squared\nvariance by \u03c32\nEuclidean distance between x and y as follows. Let a = [a1, a2, . . . , ad] and b = [b1, b2, . . . , bd]\nsuch that ai = xi \u2212 \u00b5x and bi = yi \u2212 \u00b5y.\n\nx = 1\nd\n\n(cid:80)d\n\ndist(x, y)2 =\n\ni=1\n\ni=1\n\ni=1\n\ni=1\n\n=\n\n=\n\n=\n\n(xi \u2212 yi)2\n\n((\u00b5x + ai) \u2212 (\u00b5y + bi))2\n\nd(cid:88)\nd(cid:88)\nd(cid:88)\nd(cid:88)\n= d(cid:0)(\u00b5x \u2212 \u00b5y)2 + (\u03c3x + \u03c3y)2(cid:1) \u2212 2d\u03c3x\u03c3y \u2212 2\n= d(cid:0)(\u00b5x \u2212 \u00b5y)2 + (\u03c3x \u2212 \u03c3y)2(cid:1) + 2d\u03c3x\u03c3y \u2212 2\n\nx \u2212 2\u00b5x\u00b5y + \u00b52\n\nx + 2ai\u00b5x + a2\n\ni + \u00b52\n\ny + 2bi\u00b5y + b2\n\ni \u2212 2aibi)\n\ny + a2\n\ni + b2\n\n(\u00b52\n\n(\u00b52\n\n2\n\ni \u2212 2(\u00b5x\u00b5y + ai\u00b5y + bi\u00b5x + aibi)) (1)\n\nd(cid:88)\nd(cid:88)\n\ni=1\n\ni=1\n\naibi\n\naibi.\n\n(2)\n\n(3)\n\n(4)\n\n\f(cid:80)d\n\nBy the de\ufb01nitions of ai and bi, we have(cid:80)d\n\ni=1 ai =(cid:80)d\n\ni=1 bi = 0, and 1\n\nx. By the \ufb01rst\nproperties, equation (1) is simpli\ufb01ed to (2), and by the second property, equations (2) becomes (3)\nand (4).\nNote that equations (3) and (4) are composed of the mean and variance values (their products and\nsquared values, multiplied by d) of x and y, except the last summations. Thus, once we preprocess\nV of n vectors such that both \u00b5x and \u03c3x for all x \u2208 V are computed in O(dn) time and stored in a\ntable of size O(n), this sum can be computed in constant time for any pair of vectors, regardless of\nthe dimension.\n\ni = \u03c32\n\ni=1 a2\n\nd\n\ni aibi, is the inner product (cid:104)a, b(cid:105), and therefore by applying the Cauchy-\n\nThe last summation,(cid:80)d\n\nSchwarz inequality we get\n\n|(cid:104)a, b(cid:105)| = | d(cid:88)\n\naibi| \u2264\n\n(cid:118)(cid:117)(cid:117)(cid:116)(\nd(cid:88)\n\nd(cid:88)\n\ni=1\n\ni=1\n\ni=1\n\na2\ni )(\n\nb2\ni ) = d\u03c3x\u03c3y.\n\n(5)\n\nThis gives us the following upper and lower bounds of the squared Euclidean distance from equa-\ntions (3) and (4).\n\nLemma 1 For two d-dimensional vectors x, y, the followings hold.\n\ndist(x, y)2 \u2265 d(cid:0)(\u00b5x \u2212 \u00b5y)2 + (\u03c3x \u2212 \u03c3y)2(cid:1)\ndist(x, y)2 \u2264 d(cid:0)(\u00b5x \u2212 \u00b5y)2 + (\u03c3x + \u03c3y)2(cid:1)\n\n(6)\n(7)\n\n3 The MS-distance\n\nThe lower and upper bounds in inequalities (6) and (7) can be computed in constant time once we\ncompute the mean and standard variance values of each vector in V in the preprocessing. However,\nin some applications these bounds may not be tight enough. In this section, we introduce the MS-\ndistance which not only provides lower and upper bounds of the Euclidean distance in constant time,\nbut also could be re\ufb01ned further in such a way to converge to the exact Euclidean distance within d\nsteps.\nTo do this, we reformulate the last term of equations (3) and (4), that is, the inner product (cid:104)a, b(cid:105). If\nthe norms ||a|| =\ni=1 aibi = 0, thus the upper\nand lower bounds become the same. This implies that we can compute the exact Euclidean distance\nin constant time. So from now on, we assume that both ||a|| and ||b|| are non-zero. We reformulate\nthe inner product (cid:104)a, b(cid:105).\n\ni are zero, then(cid:80)d\n\n(cid:113)(cid:80)d\n\n(cid:113)(cid:80)d\n\ni or ||b|| =\n\ni=1 a2\n\ni=1 b2\n\n= \u2212d\u03c3x\u03c3y +\n\n\u03c3x\u03c3y\n\n2\n\nEquation (8) is because of(cid:80)d\n\nswitching the roles of the term \u2212d\u03c3x\u03c3y and the term d\u03c3x\u03c3y in the above equations.\n\ni=1 b2\n\ni = d\u03c32\n\ny. We can also get equation (10) by\n\ni=1 a2\n\ni = d\u03c32\n\nd(cid:88)\n\naibi = d\u03c3x\u03c3y \u2212 d\u03c3x\u03c3y +\n\ni=1\n\n= d\u03c3x\u03c3y \u2212 \u03c3x\u03c3y\n2\n\n= d\u03c3x\u03c3y \u2212 \u03c3x\u03c3y\n2\n\n= d\u03c3x\u03c3y \u2212 \u03c3x\u03c3y\n2\n\n(cid:33)\nd(cid:88)\n\ni=1\n\ni=1\n\ni=1\n\naibi\n\nd(cid:88)\n(cid:32)\n2d \u2212 d(cid:88)\n(cid:32) d(cid:88)\n(cid:19)2\n(cid:18) ai\nd(cid:88)\nd(cid:88)\nx and(cid:80)d\n\n\u2212 ai\n\u03c3x\n\nbi\n\u03c3y\n\nbi\n\u03c3y\n\n\u03c3x\n\ni=1\n\ni=1\n\ni=1\n\n+\n\n(\n\n(\n\n2aibi\n\u03c3x\u03c3y\n\n+\n\n)2\n\nai\n\u03c3x\n\n)2\n\n3\n\n(cid:18) bi\n\n\u03c3y\n\n(cid:19)2 \u2212 d(cid:88)\n\ni=1\n\n2aibi\n\u03c3x\u03c3y\n\n(cid:33)\n\n(8)\n\n(9)\n\n(10)\n\n\fDe\ufb01nition. Now we de\ufb01ne the MS-distance between x and y in its lower bound form, denoted by\nMSL(x, y, k), by replacing the last term of equation (3) with equation (9), and in its upper bound\nform, denoted by MSU(x, y, k) by replacing the last term of equation (4) with equation (10). The\nMS-distance makes use of the nonincreasing intermediate values for its upper bound and the nonde-\ncreasing intermediate values for its lower bound. We let a0 = b0 = 0.\n\n(cid:19)2\n(cid:19)2\n\n\u2212 ai\n\u03c3x\n\n+\n\nai\n\u03c3x\n\n(11)\n\n(12)\n\nMSL(x, y, k) = d(cid:0)(\u00b5x \u2212 \u00b5y)2 + (\u03c3x \u2212 \u03c3y)2(cid:1) + \u03c3x\u03c3y\nMSU(x, y, k) = d(cid:0)(\u00b5x \u2212 \u00b5y)2 + (\u03c3x + \u03c3y)2(cid:1) \u2212 \u03c3x\u03c3y\n\nk(cid:88)\nk(cid:88)\n\ni=0\n\n(cid:18) bi\n(cid:18) bi\n\n\u03c3y\n\n\u03c3y\n\ni=0\n\nProperties. Note that equation (11) is nondecreasing and equation (12) is nonincreasing while i\n)2 are\nincreases from 0 to d, because d, \u03c3x, and \u03c3y are all nonnegative, and ( bi\n\u03c3y\nalso nonnegative for all i. This is very useful because, in equation (11), the \ufb01rst term, MSL(x, y, 0),\nis already a lower bound of dist(x, y)2 by inequality (6) , and the lower bound can be re\ufb01ned further\nnondecreasingly over the summation in the second term. If we stop the summation at i = k, for\nk < d, the intermediate result is also a re\ufb01ned lower bounds of dist(x, y)2. Similarly, in equation\n(12), the \ufb01rst term, MSU(x, y, 0), is already an upper bound of dist(x, y)2 by inequality (7) , and\nthe upper bound can be re\ufb01ned further nonincreasingly over the summation in the second term. This\nmeans we can stop the summation as soon as we \ufb01nd a bound good enough for the application\nunder consideration. If we need the exact Euclidean distance, we can get it by continuing to the full\nsummation. We summarize the above properties in the following.\n\n)2 and ( bi\n\u03c3y\n\n\u2212 ai\n\n+ ai\n\u03c3x\n\n\u03c3x\n\nLemma 2 (Monotone Convergence) Let MSL(x, y, k) and MSU(x, y, k) be the lower and upper\nbounds of MS-distance as de\ufb01ned above, respectively. Then the following properties hold.\n\n\u2022 MSL(x, y, 0) \u2264 MSL(x, y, 1) \u2264 \u00b7\u00b7\u00b7 \u2264 MSL(x, y, d \u2212 1) \u2264 MSL(x, y, d) = dist(x, y)2.\n\u2022 MSU(x, y, 0) \u2265 MSU(x, y, 1) \u2265 \u00b7\u00b7\u00b7 \u2265 MSU(x, y, d \u2212 1) \u2265 MSU(x, y, d) = dist(x, y)2.\n\u2022 MSL(x, y, k) = MSL(x, y, k + 1) if and only if bk+1/\u03c3y = ak+1/\u03c3x.\n\u2022 MSU(x, y, k) = MSU(x, y, k + 1) if and only if bk+1/\u03c3y = \u2212ak+1/\u03c3x.\n\nLemma 3 For 0 \u2264 k < d, we can update MSL(x, y, k) to MSL(x, y, k + 1), and MSU(x, y, k) to\nMSU(x, y, k + 1) in constant time.\n\nFast Filtering. We must emphasize that MSL(x, y, 0) and MSU(x, y, 0) can be used for fast \ufb01lter-\ning. Let \u03c6 denote a threshold for \ufb01ltering de\ufb01ned in some proximity search problem under consider-\nation. If \u03c6 < MSL(x, y, 0) in case of nearest search or \u03c6 > MSL(x, y, 0) in case of farthest search,\nwe do not need to consider this pair (x, y) as a candidate, thus we can save time from computing\ntheir exact Euclidean distance.\nPrecisely speaking, we map each d-dimensional vector x = [x1, x2, . . . , xd] into a pair of points,\n\u02c6x+ and \u02c6x\u2212, in the 2-dimensional plane such that \u02c6x+ = [\u00b5x, \u03c3x] and \u02c6x\u2212 = [\u00b5x,\u2212\u03c3x]. Then\n\ndist(\u02c6x+, \u02c6y+)2 = MSL(x, y, 0)/d\ndist(\u02c6x+, \u02c6y\u2212)2 = MSU(x, y, 0)/d.\n\n(13)\n(14)\n\nTo see why it is useful in fast \ufb01ltering, consider the case of \ufb01nding the nearest vector. For d-\ndimensional vectors in V of size n, we have n pairs of points in the plane as in Figure 1. Since \u03c3x\nis nonnegative, exactly n points lie on or below \u00b5-axis. Let q be a query vector, and let \u02c6q+ denote\nthe point mapped in the plane as de\ufb01ned above. Among these n points lying on or below \u00b5-axis, let\n\u02c6x\u2212i be the point that is nearest to \u02c6q+. Note that the closest point from the query can be computed\nef\ufb01ciently in 2-dimensional space, for example, after constructing some space partitioning structures\nsuch as kd-trees or R-trees, each query can be answered in poly-logarithmic search time.\n\n4\n\n\fThen we can ignore all d-dimensional vectors x whose mapped point \u02c6x+ lies outside the circle\ncentered at \u02c6q+ and of radius dist(\u02c6q+, \u02c6x\u2212i ) in the plane, because they are strictly farther than xi\nfrom q.\n\nFigure 1: Fast \ufb01ltering using MSL(x, y, 0) and MSU(x, y, 0). All d-dimensional vectors x whose\nmapped point \u02c6x+ lies outside the circle are strictly farther than xi from q.\n\n4 Estimating the Expected Difference Between Two Bounds\n\nWe now turn to estimating the expected difference between MSL(x, y, k) and MSU(x, y, k). Observe\nthat MSL(x, y, k) is almost the same as MSL(x, y, k \u2212 1) if bk/\u03c3y \u2248 ak/\u03c3x. Hence, in the worst\ncase, MSL(x, y, 0) = MSL(x, y, d \u2212 1) < MSL(x, y, d) = dist(x, y)2 when bk/\u03c3y = ak/\u03c3x for\nall k = 0, 1, . . . , d \u2212 1, except k = d. Therefore, if we need a lower bound strictly better than\nMSL(x, y, 0), then we need to go through all d re\ufb01nement steps, which takes O(d) time. It is not\ndif\ufb01cult to see that this also applies to the case of MSU(x, y, k).\nHowever, this is unlikely to happen. Consider a random order for the last term in equation\nMSL(x, y, k) and for the last term in equation MSU(x, y, k). We show below that their expected val-\nues increase and decrease linearly, respectively, as k increases from 0 to d. Formally, let (a\u03b3(i), b\u03b3(i))\ndenote the ith pair in the random order. We measure the expected quality of the bounds by the dif-\nference between the bounds, that is, MSU(x, y, k) \u2212 MSL(x, y, k) as follows.\n\n(cid:32)(cid:18) a\u03b3(i)\nk(cid:88)\n(cid:32)(cid:18) ai\nd(cid:88)\n\n\u03c3x\n\ni=0\n\n(cid:19)2\n(cid:19)2\n\n\u03c3x\n\ni=0\n\n+\n\n+\n\n(cid:19)2(cid:33)\n(cid:18) b\u03b3(i)\n(cid:19)2(cid:33)\n(cid:18) bi\n\n\u03c3y\n\n\u03c3y\n\n(15)\n\n(16)\n\n(17)\n(18)\n\nMSU(x, y, k) \u2212 MSL(x, y, k) = 4d\u03c3x\u03c3y \u2212 2\u03c3x\u03c3y\n\n= 4d\u03c3x\u03c3y \u2212 2\u03c3x\u03c3y\nk\nd\n= 4d\u03c3x\u03c3y \u2212 4k\u03c3x\u03c3y\n= 4\u03c3x\u03c3y(d \u2212 k)\n\nLet us explain how we get Equation (16) from (15). Let N denote the set of all pairs, and let N k\ndenote the set of \ufb01rst k pairs in the random order. Since each pair in N is treated equally, N k is a\ni=1(a\u03b3(i)/\u03c3x)2 is equivalent to take the total sum of\ni=1(b\u03b3(i)/\u03c3y)2 by a\n\nrandom subset of N of size k. Therefore,(cid:80)k\n(ai/\u03c3x)2 with i from 1 to d and divide it by d/k. We can also show this for(cid:80)k\nEquations (17) and (18) are because(cid:80)d\nx and(cid:80)d\ni=1(ai/\u03c3x)2 =(cid:80)d\nBy replacing each squared sum with d, that is , by applying(cid:80)d\n\ny by de\ufb01nitions of ai and bi.\ni=1(bi/\u03c3y)2 = d,\n\nsimilar augment.\n\ni = d\u03c32\n\ni=1 b2\n\ni=1 a2\n\ni = d\u03c32\n\nwe have Equation (18).\nLemma 4 The expected value of MSU(x, y, k) \u2212 MSL(x, y, k) is 4\u03c3x\u03c3y(d \u2212 k).\n\n5\n\n\u00b5\u03c3Y+Y\u2212X+2X\u22122X+1X\u22121\fBecause dist(x, y)2 always lies in between the two bounds, the following also holds.\nCorollary 1 Both expected values of MSU(x, y, k) \u2212 dist(x, y)2 and dist(x, y)2 \u2212 MSL(x, y, k)\nare at most 4\u03c3x\u03c3y(d \u2212 k).\nThis shows a good theoretical expectation on the lower and upper bounds. This can justify that the\nMS-distance provides very tight bounds in a few re\ufb01nement steps of a typical sequence.\n\n5 Applications : Proximity Searches\n\nThe MS-distance can be used to application problems where the Euclidean distance is a measure\nfor proximity or similarity of objects. As a case study, we implemented the nearest neighbor search\n(NNS) and the farthest neighbor search (FNS) using the MS-distance.\nGiven a set X of d-dimensional vectors xi, for i = 1, . . . , n, and a d-dimensional query vector\nq, we use the following simple randomized algorithm for NNS. Initially, we set \u03c6 to the threshold\ngiven from the application under consideration or computed from the fast \ufb01ltering in 2-dimension in\nSection 3.\n\n1. Consider the vectors in X one at a time according to this sequence. At the ith stage, we do\n\nthe followings.\n\nif MSL(q, xi, 0) < \u03c6 :\nfor j = 1, 2, ..., d :\n\nif MSL(q, xi, j) > \u03c6 :\n\nbreak;\n\nif j = d:\n\n\u03c6 = MSL(q, xi, d);\nNN = i;\n\n2. return NN as the nearest neighbor of q with the squared Euclidean distance \u03c6.\n\nNote that the \ufb01rst line of the pseudocodes \ufb01lters out the vectors whose distance to q is larger than \u03c6\nas in the fast \ufb01ltering in Section 3. In the for loop, we compute MSL(q, xi, j) from MSL(q, xi, j\u22121)\nin constant time. From the last two lines of the pseudocodes, we update \u03c6 to the exact Euclidean\ndistance between q and xi and store the index as the current nearest neighbor (NN). The algorithm\nfor the farthest neighbor search is similar to this one, except that it uses MSU(xi, y, j) and maintains\nthe maximum distance.\nFor empirical comparison, we implemented a linear search algorithm that simply computes distances\nfrom q to every xi and chooses the one with the minimum distance. We also used the implementation\nof the cover tree [6]. A cover tree is a data structure that supports fast nearest neighbor queries given\na \ufb01xed intrinsic dimensionality [7].\nWe tested these implementations on data sets from UCI machine learning archive [11]. We selected\ndata sets D from various dimensions (from 10 to 100, 000), and randomly selected 30 queries points\nQ \u2282 D, and queried them on D \\ Q. We labelled the data set on d-dimension as \u201cDd\u201d. The\ndata sets D500, D5000, D10000, D20000, D100000 were used in NIPS 2003 challenge on feature\nselection [12]. The test machine has one CPU, Intel Q6600 with 2.4GHz, 3GB memory, and 32bit\nUbuntu 10 operating system running on the machine.\nFigure 2 shows the percentage of data \ufb01ltered off. For the data sets on relaxed dimensions, the\nMS-distance \ufb01ltered off over 95% of data without lose of accuracy. For high dimensional data,\nMS-distance failed to \ufb01lter off many data. Probably this is because the distances from queries to\ntheir nearest vectors tend to converge to the distances to their farthest vectors as described in [13].\nThis makes it hard to decrease (or increase in FNS) the threshold \u03c6 for the MS-distance enough to\n\ufb01lter off many data. However, on such high dimensions, both the linear search and the cover tree\nalgorithm also show poor performance.\nFigure 3 shows the preprocessing time of the MS-distance and the cover tree for NNS. The time axis\nis log-scaled second. This shows that the preprocessing time of the MS-distance is up to 1000 times\n\n6\n\n\fFigure 2: Data \ufb01ltered off in percentage.\n\nFigure 3: Preprocessing time for nearest neigh-\nbor search in log-scaled second.\n\nfaster than the one in the cover tree. This is because for the MS-distance it requires only O(dn) time\nto compute the mean and the standard deviation values.\n\nFigure 4: Relative running time for the nearest\nneighbor search queries, normalized by linear\nsearch time.\n\nFigure 5: Relative running time for the farthest\nneighbor search queries, normalized by linear\nsearch time.\n\nFigure 4 shows the time spent for NNS queries. The graph shows the query time that is normalized\nby the linear search time. It is clear that the \ufb01ltering algorithm based on the MS-distance beats\nthe linear search algorithm, even on high dimensional data in the results. The cover tree, which is\ndesigned exclusively for NNS, shows slightly better query performance than ours. However, the\nMS-distance is more general and \ufb02exible: it supports addition of a new vector to the data set (our\ndata structure) in O(d) time for computing the mean and the standard deviation values of the vector.\nDeletion of a vector from the data set can be done in constant time. Furthermore, the data structure\nfor NNS can also be used for FNS.\nFigure 5 shows the time spent for FNS queries. This is outstanding compared to the linear search\nalgorithm. We hardly know any other previous work achieving better performance than this.\n\n6 Conclusion\n\nWe introduce a fast distance bounding technique, called the MS-distance, by using the mean and the\nstandard deviation values. The MS-distance between two vectors provides upper and lower bounds\nof Euclidean distance between them in constant time, and these bounds converge monotonically to\nthe exact Euclidean distance over iteration. The MS-distance can be used to application problems\nwhere the Euclidean distance is a measure for proximity or similarity of objects. The experimental\nresults show that our method is ef\ufb01cient enough even to replace the best known algorithms for\nproximity searches.\n\n7\n\n020406080100D10D11D16D19D22D27D37D50D55D57D61D64D86D90D167D255D500D617D5000D10000D20000D10^5Percent(%)NNSFNS0.0010.010.1110100100010000D10D11D16D19D22D27D37D50D55D57D61D64D86D90D167D255D500D617D5000D10000D20000D10^5Time(Second)MS-distCover00.20.40.60.811.2D10D11D16D19D22D27D37D50D55D57D61D64D86D90D167D255D500D617D5000D10000D20000D10^5Relative TimeMS-distCoverLinear00.20.40.60.811.2D10D11D16D19D22D27D37D50D55D57D61D64D86D90D167D255D500D617D5000D10000D20000D10^5Relative TimeMS-distLinear\fTable 1: Data sets\n\nData Label Name\n\n# of vectors Data Label Name\n\n# of vectors\n\nD10\nD11\nD16\nD19\nD22\nD27\nD37\nD50\nD55\nD57\nD61\n\nPage Blocks\nWine Quality\nLetter Recognition\nImage Segmentation\nParkinsons Tel\nSteel Plates Faults\nStatlog Satellite\nMiniBooNE\nCovertype\nSpambase\nIPUMS Census\n\n5473 D64\n6497 D86\n20000 D90\n2310 D167\n5875 D255\n1941 D500\n6435 D617\n130064 D5000\n581012 D10000\n4601 D20000\n233584 D100000\n\nOptical Recognition\nInsurance Company\nYearPredictionMSD\nMusk2\nSemeion\nMadelon\nISOLET\nGisette\nArcene\nDexter\nDorothea\n\n5620\n5822\n515345\n6597\n1593\n4400\n7795\n13500\n900\n2600\n1950\n\nAcknowledgments\n\nThis work was supported by the National Research Foundation of Korea Grant funded by the Korean\nGovernment (MEST) (NRF-2010-0009857).\n\nReferences\n[1] J. B. MacQueen. Some methods for classi\ufb01cation and analysis of multivariate observations. In L. M. Le\nCam and J. Neyman, editors, Proceedings of the \ufb01fth Berkeley Symposium on Mathematical Statistics and\nProbability, volume 1, pages 281\u2013297. University of California Press, 1967.\n\n[2] V. N. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, NY, USA, 1995.\n[3] K. Pearson. On lines and planes of closest \ufb01t to systems of points in space. Philosophical Magazine,\n\n2:559\u2013572, 1901.\n\n[4] J. L. Bentley. Multidimensional binary search trees used for associative searching. Communications of\n\nACM, 18:509\u2013517, September 1975.\n\n[5] A. Guttman. R-trees: A dynamic index structure for spatial searching.\n\nIn Beatrice Yormark, editor,\nProceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD \u201984,\npages 47\u201357. ACM, 1984.\n\n[6] A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbor. In Proceedings of the 23rd\ninternational conference on Machine learning, ICML \u201906, pages 97\u2013104, New York, NY, USA, 2006.\nACM.\n\n[7] D. R. Karger and M. Ruhl. Finding nearest neighbors in growth-restricted metrics. In Proceedings of the\n34th annual ACM symposium on Theory of computing, STOC \u201902, pages 741\u2013750, New York, NY, USA,\n2002. ACM.\n\u00a8O. E\u02d8gecio\u02d8glu and H. Ferhatosmano\u02d8glu. Dimensionality reduction and similarity computation by inner\nproduct approximations. In Proceedings of the ninth international conference on Information and knowl-\nedge management, CIKM \u201900, pages 219\u2013226, New York, NY, USA, 2000. ACM.\n\n[8]\n\n[9] R. Weber, H. J. Schek, and S. Blott. A quantitative analysis and performance study for similarity-search\nmethods in high-dimensional spaces. In Proceedings of the 24rd International Conference on Very Large\nData Bases, VLDB \u201998, pages 194\u2013205, San Francisco, CA, USA, 1998. Morgan Kaufmann Publishers\nInc.\n\n[10] H. V. Jagadish, B. C. Ooi, K.L. Tan, C. Yu, and R. Zhang. idistance: An adaptive b+-tree based indexing\n\nmethod for nearest neighbor search. ACM Transactions on Database Systems, 30:364\u2013397, June 2005.\n\n[11] UCI machine learning archive. http://archive.ics.uci.edu/ml/.\n\n8\n\n\f[12] NIPS 2003 challenge on feature selection. http://clopinet.com/isabelle/projects/nips2003/.\n[13] K. S. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is \u201dnearest neighbor\u201d meaningful? In\nProceedings of the 7th International Conference on Database Theory, ICDT \u201999, pages 217\u2013235, London,\nUK, 1999. Springer.\n\n9\n\n\f", "award": [], "sourceid": 286, "authors": [{"given_name": "Yoonho", "family_name": "Hwang", "institution": null}, {"given_name": "Hee-kap", "family_name": "Ahn", "institution": null}]}