{"title": "A Theory-Based Evaluation of Nearest Neighbor Models Put Into Practice", "book": "Advances in Neural Information Processing Systems", "page_first": 6742, "page_last": 6753, "abstract": "In the $k$-nearest neighborhood model ($k$-NN), we are given a set of points $P$, and we shall answer queries $q$ by returning the $k$ nearest neighbors of $q$ in $P$ according to some metric. This concept is crucial in many areas of data analysis and data processing, e.g., computer vision, document retrieval and machine learning. Many $k$-NN algorithms have been published and implemented, but often the relation between parameters and accuracy of the computed $k$-NN is not explicit. We study property testing of $k$-NN graphs in theory and evaluate it empirically: given a point set $P \\subset \\mathbb{R}^\\delta$ and a directed graph $G=(P,E)$, is $G$ a $k$-NN graph, i.e., every point $p \\in P$ has outgoing edges to its $k$ nearest neighbors, or is it $\\epsilon$-far from being a $k$-NN graph? Here, $\\epsilon$-far means that one has to change more than an $\\epsilon$-fraction of the edges in order to make $G$ a $k$-NN graph. We develop a randomized algorithm with one-sided error that decides this question, i.e., a property tester for the $k$-NN property, with complexity $O(\\sqrt{n} k^2 / \\epsilon^2)$ measured in terms of the number of vertices and edges it inspects, and we prove a lower bound of $\\Omega(\\sqrt{n / \\epsilon k})$. We evaluate our tester empirically on the $k$-NN models computed by various algorithms and show that it can be used to detect $k$-NN models with bad accuracy in significantly less time than the building time of the $k$-NN model.", "full_text": "A Theory-Based Evaluation of Nearest Neighbor\n\nModels Put Into Practice\n\nHendrik Fichtenberger\u2217\n\nTU Dortmund\n\nDortmund, Germany\n\nhendrik.fichtenberger@tu-dortmund.de\n\ndennis.rohde@cs.tu-dortmund.de\n\nDennis Rohde\u2020\nTU Dortmund\n\nDortmund, Germany\n\nAbstract\n\nIn the k-nearest neighborhood model (k-NN), we are given a set of points P , and\nwe shall answer queries q by returning the k nearest neighbors of q in P according\nto some metric. This concept is crucial in many areas of data analysis and data\nprocessing, e.g., computer vision, document retrieval and machine learning. Many\nk-NN algorithms have been published and implemented, but often the relation\nbetween parameters and accuracy of the computed k-NN is not explicit. We study\nproperty testing of k-NN graphs in theory and evaluate it empirically: given a point\nset P \u2282 R\u03b4 and a directed graph G = (P, E), is G a k-NN graph, i.e., every point\np \u2208 P has outgoing edges to its k nearest neighbors, or is it \u0001-far from being a\nk-NN graph? Here, \u0001-far means that one has to change more than an \u0001-fraction of\nthe edges in order to make G a k-NN graph. We develop a randomized algorithm\n\u221a\nwith one-sided error that decides this question, i.e., a property tester for the k-\nNN property, with complexity O(\nnk2/\u00012) measured in terms of the number\n\nof vertices and edges it inspects, and we prove a lower bound of \u2126((cid:112)n/\u0001k).\n\nWe evaluate our tester empirically on the k-NN models computed by various\nalgorithms and show that it can be used to detect k-NN models with bad accuracy\nin signi\ufb01cantly less time than the building time of the k-NN model.\n\n1\n\nIntroduction\n\nThe k-nearest neighborhood (k-NN) of a point q with respect to some set of points P is one of the\nmost fundamental concepts used in data analysis tasks such as classi\ufb01cation, regression and machine\nlearning. In the past decades, many algorithms have been proposed in theory as well as in practice\nto ef\ufb01ciently answer k-NN queries [e.g., 1, 7\u201310, 18, 20, 23, 26, 27, 30, 36]. For example, one can\nconstruct a k-NN graph of a point set P , i.e., a directed graph G = (P, E) of size n = |P| such that E\ncontains an edge (p, q) for every k-nearest neighbor q of p for every p \u2208 P , in time O(n log n + kn)\nfor constant dimension \u03b4 [8]. Due to restrictions on computational resources, approximations and\nheuristics are often used instead (see, e.g., [9, 10] and the discussion therein for details). Given the\noutput graph G(cid:48) of such a randomized approximation algorithm or heuristic, one might want to check\nwhether G(cid:48) resembles a k-NN graph before using it, e.g., in a data processing pipeline. However, the\ntime required for exact veri\ufb01cation might cancel out the advantages gained by using an approximation\nalgorithm or a heuristic. On the other hand, testing whether G(cid:48) is at least close to a k-NN graph will\nsuf\ufb01ce for many purposes. Property testing is a framework for the theoretical analysis of decision and\nveri\ufb01cation problems that are relaxed in favor of sublinear complexity. One motivation of property\ntesting is to fathom the theoretical foundations of ef\ufb01ciently assessing approximation and heuristic\nalgorithms\u2019 outputs.\n\n\u2217ORCID iD: 0000-0003-3246-5323\n\u2020ORCID iD: 0000-0001-8984-1962\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fProperty testing [32], and in particular property testing of graphs [21], has been studied quite\nextensively since its founding. A one-sided error \u0001-tester for a property P of graphs with average\ndegree bounded by d has to accept every graph G \u2208 P and it has to reject every graph H that is\n\u0001-far from P with probability at least 2/3 (i.e., if graphs that are \u0001-far are relevant, it has precision 1\nand recall 2/3). A graph H of size n is \u0001-far from some property P if more than \u0001dn edges have to\nbe added or removed to transform it into a graph that is in P. A two-sided error \u0001-tester may also\nerr with probability less than 1/3 if the graph has the property. The computational complexity of\na property tester is the number of adjacency list entries it reads, denoted its queries. Many works\nin graph property testing focus on testing plain graphs that contain only the pure combinatorial\ninformation. However, most graphs that model real data contain some additional information that\nmay, for example, indicate the type of an atom, the bandwidth of a data link or spatial information of\nan object that is represented by a vertex or an edge, respectively. In this work, we consider geometric\ngraphs with bounded average degree. In particular, the graphs are embedded into R\u03b4, i.e., every\nvertex has a coordinate x \u2208 R\u03b4. The coordinate of a vertex may be obtained by a query.\n\nMain Results Our \ufb01rst result is a property tester with one-sided error for the property that a given\ngeometric graph G with bounded average degree is a k-nearest neighborhood graph of its underlying\npoint set (i.e., it has precision 1 and recall 2/3 when taking \u0001-far graphs as relevant).\nTheorem 1. Given an input graph G = (V, E) of size n = |V | with bounded average degree d,\nhas query complexity c \u00b7 \u221a\nthere exists a one-sided error \u0001-tester that tests whether G is a k-nearest neighbourhood graph. It\nn k2\u03c8\u03b4/\u00012, where \u03c8\u03b4 is the \u03b4-dimensional kissing number and c > 0 is a\n\nuniversal constant.\n\nWe emphasize that it is not necessary to compute the ground truth (i.e., the k-NN of P ) in order to\nrun the property tester. Furthermore, the tester can be easily adapted for graphs G = (P \u222a Q, E)\nsuch that P \u2229 Q = \u2205 and we only require that for every q \u2208 Q, E contains an edge (q, p) for every\nk-nearest neighbor p of q in P . This is more natural when we think of P as a training set and Q as\na test set or query domain. To complement this result, we prove a lower bound that holds even for\ntwo-sided error testers.\nTheorem 2. Testing whether a given input graph G = (V, E) of size n = |V | is a k-nearest\n\nneighbourhood graph with one-sided or two-sided error requires max((cid:112)n/(8\u0001k), k\u03c8\u03b4/6) queries.\n\nFinally, we provide an experimental evaluation of our property tester on approximate nearest neighbor\n(ANN) indices computed by various ANN algorithms. Our results indicate that the tester requires\nsigni\ufb01cantly less time than the ANN algorithm to build the ANN index, most times just a 1/10-fraction.\nTherefore, it can often detect badly chosen parameters of the ANN algorithm at almost no additional\ncost and before the ANN index is fed into the remaining data processing pipeline.\n\nwith query complexity \u02dcO((cid:112)n/\u0001) is given. In a fashion similar to property testing, Czumaj et al. [14]\n\nRelated Work We give an overview of sublinear algorithms for geometric graphs, which is the topic\nof research that is most relevant to our work. As mentioned above, the research on k-NN algorithms\nis very broad and diverse. See, e.g., [15, 33] for surveys. Testing whether a geometric graph that is\nembedded into the plane is a Euclidean minimum spanning tree has been studied by Ben-Zwi et al. [3]\nand Czumaj and Sohler [11]. In [3], the authors show that any non-adaptive tester has to make \u2126(\nn)\nqueries, and that any adaptive tester has query complexity \u2126(n1/3). In [11], a one-sided eror tester\n\u221a\nn\u00b7 poly(\u0001)) time, and Czumaj and\nestimate the weight of Euclidean Minimum Spanning Trees in \u02dcO(\nSohler [12] approximate the weight of Metric Minimum Spanning Trees in \u02dcO(n \u00b7 poly(\u0001)) time for\nconstant dimension, respectively. Hellweg et al. [22] develop a tester for Euclidean (1 + \u03b4)-spanners.\nProperty testers for many other geometric problems can, for example, be found in [13, 29].\n\n\u221a\n\n2 Preliminaries\nLet d, \u03b4, \u0001, k \u2265 0, d \u2265 k be \ufb01xed parameters. In this paper, we consider property testing on directed\ngeometric graphs with bounded average degree d. A graph G = (V, E) with an associated function\ncoord : V \u2192 Rd is a geometric graph, where each vertex v is assigned a coordinate coord(v). Given\nv \u2208 V , we denote its degree by d(v) and the set of adjacent vertices N(v) := {u | (v, u) \u2208 E}. The\nEuclidean distance between two points x, y is denoted by dist(x, y). For the sake of simplicity, we\n\n2\n\n\fwrite dist(u, v) := dist(coord(u), coord(v)) for two vertices u, v \u2208 V . When there is no ambiguity,\nwe also refer to coord(v) by simply writing v. We denote the size of the graph G = (V, E) at hand\nby n = |V |.\nDe\ufb01nition 1 (k-nearest neighborhood graph). A geometric graph G = (V, E) is a k-nearest neigh-\nbourhood (k-NN) graph if for every v \u2208 V , the k points u1, . . . , uk \u2208 V that lie nearest to v according\nto dist(\u00b7,\u00b7) are neighbors of v in G, i.e., (v, ui) \u2208 E for all i \u2208 [k] (breaking ties arbitrarily).\nLet G = (V, E) be a geometric graph. We say that a graph G is \u0001-far from a geometric graph\nproperty P if at least \u0001dn edges of G have to be modi\ufb01ed in order to convert it into a graph that satis\ufb01es\nthe property P. We assume that the graph G is represented by a function fG : V \u00d7 [n] \u2192 V \u222a {(cid:63)},\nwhere fG(v, i) denotes the ith neighbor of v if v has at least i neighbors (otherwise, fG(v, i) = (cid:63)),\na degree function dG : V \u2192 N that outputs the degree of a vertex and a coordinate function\ncoordG : V \u2192 R\u03b4 that outputs the coordinates of a vertex.\nDe\ufb01nition 2 (\u0001-tester). A one-sided (error) \u0001-tester for a property P with query complexity q is a\nrandomized algorithm that makes q queries to fG, dG and coordG for a graph G. The algorithm\naccepts if G has the property P. If G is \u0001-far from P, then it rejects with probability at least 2/3.\nThe motivation to consider query complexity is that the cost of accessing the graph, e.g., through an\nANN index, is costly but cannot be in\ufb02uenced. Therefore, one should minimize access to the graph.\nDe\ufb01nition 3 (witness). Let #nearer(v, w) := |{u \u2208 V | u (cid:54)= v \u2227 dist(v, u) < dist(v, w)}|\ndenote the number of vertices u that lie nearer to v than w. Further let knn(v) := {u \u2208 V | u (cid:54)=\nv \u2227 #nearer(v, u) \u2264 k \u2212 1} denote the set of v\u2019s k-nearest neighbors. Let wit(v) := {u \u2208 V | u (cid:54)\u2208\nN (v)\u2227 u \u2208 knn(v)} de\ufb01ne the subset of knn(v) that is not adjacent to v. If wit(v) (cid:54)= \u2205 or d(v) < k,\nwe call v incomplete, and we call elements of wit(v) the witnesses of v.\n\nIf G is \u0001-far from being a k-nearest neighborhood graph, an \u0001-fraction of its vertices are incomplete.\nThe proof follows from common arguments in property testing (see the full version [16]).\nLemma 4. If G is \u0001-far from being a k-nearest neighborhood graph, at least \u0001dn/(2k) vertices are\nincomplete.\n\nThe main challenge for the property tester will be to \ufb01nd matching witnesses for a \ufb01xed set of\nincomplete vertices. The following result from coding theory for Euclidean codes bounds the\nmaximum number of points qi that can have the same \ufb01xed point p as nearest neighbor.\nLemma 5. [35] Given a point set P \u2282 R\u03b4 and p \u2208 P , the maximum number of points qi \u2208 P\nthat can have p as nearest neighbour is bounded by the \u03b4-dimensional kissing number \u03c8\u03b4, where\n20.2075\u03b4(1+o(1)) \u2264 \u03c8\u03b4 [34] and \u03c8\u03b4 \u2264 20.401\u03b4(1+o(1)) [24] (asymptotic notation with respect to \u03b4).\n\n3 Upper Bound\n\nThe idea of the tester is as follows (see Algorithm 1). Two samples are drawn uniformly at random:\nS(cid:48), which shall contain many incomplete vertices if G is \u0001-far from being a k-nearest neighborhood\ngraph and T , which shall contain at least one witness of an incomplete vertex in S(cid:48). For every\nv \u2208 S(cid:48), the algorithm should query its degree, its coordinate as well as every adjacent vertex and their\ncoordinates and calculate the distance to them. If d(k) < k or if one of the vertices in T is a witness\nof v, the algorithm found an incomplete vertex, and hence rejects. Otherwise, it accepts.\nHowever, we have to deal with the case that some vertices in S(cid:48) have non-constant degree, say,\n\u2126(1/\u0001), such that querying all their adjacent vertices would require too many queries. To this end,\nwe prove that one can prune these vertices to obtain a subset S \u2286 S(cid:48) of low degree vertices that still\ncontains many incomplete vertices with suf\ufb01cient probability.\n\nProof of Theorem 1 We prove that Algorithm 1 is an \u0001-tester as claimed by Theorem 1. Since\nAlgorithm 1 does never reject a k-nearest neighbourhood graph, assume without loss of generality\nthat G = (V, E) is \u0001-far from being a k-nearest neighborhood graph. Algorithm 1 only queries the\nneighbors of S, and therefore its query complexity is at most |S|\u00b7 100k/\u0001 = O(\nnk2/\u00012). It remains\nto prove the correctness.\nIn the following, let L := {v \u2208 V | d(v) \u2264 100k/\u0001} denote the set of all vertices in G that have low\ndegree, let I denote the set of incomplete vertices in L, and let IS \u2286 S denote the set of incomplete\n\n\u221a\n\n3\n\n\fAlgorithm 1: Tester for k-nearest neighborhood\nData: G = (V, E), d, k, \u0001\nResult: accept or reject\nS(cid:48) \u2190 sample 100k\nT \u2190 sample ln(10) \u00b7 k \u00b7 \u03c8\u03b4 \u00b7 \u221a\nS \u2190 {v \u2208 S(cid:48) | d(v) \u2264 100k/\u0001};\nfor v \u2208 S, u \u2208 T do\n\n\u221a\n\n\u0001\n\nif (u (cid:54)= v \u2227 u \u2208 knn(v) \u2227 u (cid:54)\u2208 N (v)) \u2228 d(v) < k then\n\nn\n\nvertices from V u.a.r. without replacement;\n\nn vertices from V u.a.r. with replacement;\n\nreject;\n\nend\n\nend\naccept;\n\nvertices in S. By an averaging argument, |V \\L| \u2264 \u0001dn/(100k). It follows from Lemma 4 that L\ncontains at least \u0001dn/(4k) incomplete vertices, and therefore we focus on \ufb01nding incomplete vertices\nthat have low degree. Given u \u2208 T , let WS(u) be a random variable that is 1 if u is a witness of an\nincomplete vertex v \u2208 IS and 0 otherwise.\nThe proof of Theorem 1 follows from the following three claims. First, note that S is a uniform\nsample without replacement from L whose size |S| is random. However, |S| is suf\ufb01ciently large with\nconstant probability. This claim follows from Markov\u2019s inequality (see full version [16]).\nClaim 6. With probability at least 9/10, |S| \u2265 20\n\nn/\u0001.\n\n\u221a\n\n\u221a\nIn the subsequent sections, we prove the following two claims. Given that S is suf\ufb01ciently large, it\nn incomplete vertices with constant probability.\nwill contain at least\n\u221a\nn/\u0001, it holds with probability at least 9/10 that |IS| >\nClaim 7. If |S| \u2265 20\n\n\u221a\n\nn.\n\n\u221a\n\nn incomplete vertices, then T will contain at least one\n\nFinally, we show that if S contains at least\nwitness of such an incomplete vertex with constant probability.\nClaim 8 (Lemma 11). If |IS| >\nThe correctness follows by a union bound over these three bad events.\n\nn, with probability at least 9/10, P r[(cid:80)\n\n\u221a\n\nw\u2208T WS(w) > 0].\n\n\u221a\n\nAnalysis of the Sample S: Proof of Claim 7\nProof. Since S was sampled without replacement, the random variable |IS| follows the hyper-\ngeometric distribution. Let X be a random variable that denotes the number of draws that are\nneeded to obtain\nn incomplete vertices in S, which therefore follows the negative hyperge-\nometric distribution. By Lemma 4, we have E[X] \u2264 \u221a\n(\u0001dn)/(2k)+1. By the de\ufb01nition of |IS|\nn] \u2264 Pr[X \u2265 |S|]. We apply Markov\u2019s inequality to obtain\nand X, we have Pr[|IS| <\n\u221a\nn) ensures |IS| \u2265 \u221a\nn\u00b7(n+1)\nPr[X \u2265 |S|] \u2264\n|S|(\u0001dn)/(2k)+1. It follows that |S| \u2208 \u2126(\nn with suf\ufb01cient\nprobability.\n\nn\u00b7(n+1)\n\n\u221a\n\n\u221a\n\nAnalysis of the Sample T: Proof of Claim 8 We prove the following lower bound on the number\nof witnesses in G, which will imply a bound on |T| by k-reducing it to the case k = 1.\nProposition 9. Given a point set P \u2282 R\u03b4, p \u2208 P and k \u2208 N, the maximum number of points qi \u2208 P\nthat can have p as k-nearest neighbor is bounded by k \u00b7 \u03c8\u03b4.\nWe note that this bound is tight, as shown in Lemma 12.\nDe\ufb01nition 10 (k-reducing). Let p \u2208 P be an arbitrary point. Fix Q := {q \u2208 P | #nearer(q, p) \u2264\nk \u2212 1}. Repeat the following steps until \u2200q \u2208 Q : (cid:64)q(cid:48) \u2208 Q \\ {q} : dist(q, q(cid:48)) < dist(q, p).\n\n(\u2217) Pick a point q \u2208 Q that lies furthest from w and let Qq := {q(cid:48) \u2208 Q | q (cid:54)= q(cid:48) \u2227 dist(q, q(cid:48)) <\n\ndist(q, p)}.\n\n4\n\n\f(#) Set Q := Q \\ Qq.\n\nProof of Proposition 9. We apply De\ufb01nition 10 to p and prove that the size of Q at the beginning of\nthe process is at most k \u00b7 \u03c8\u03b4, which proves the claim.\nAt \ufb01rst we show that every vertex that is picked by (\u2217) stays in Q: Let q1, q2 be arbitrary points that\nare picked by (\u2217) in the process of k-reducing, with q1 being picked in an earlier iteration than q2.\nThe latter implies dist(q2, p) < dist(q1, p). Assume that q1 \u2208 Qq2 at the time q2 is selected, and\ntherefore q1 is removed from Q. Since q1 is deleted by (#), it holds that dist(q1, p) < dist(q2, p),\nwhich is a contradiction as q1 has been selected before q2.\nWe continue to bound the maximum number of vertices that share their k-nearest neighbor: Because p\nis the nearest point for the remaining q \u2208 Q, we apply Lemma 5 and conclude that at most \u03c8\u03b4 vertices\nare remaining in Q after k-reducing. Since every iteration of step (#) removed at most k \u2212 1 points\nfrom Q, the cardinality of Q at the beginning of the process was at most \u03c8\u03b4 +(k\u22121)\u00b7\u03c8\u03b4 = k\u00b7\u03c8\u03b4.\n\nSince at most k \u00b7 \u03c8\u03b4 vertices can share a witness by Proposition 9, there are at least |IS|\ndistinct\nk\u00b7\u03c8\u03b4\nwitnesses of vertices in S. We employ this bound to calculate the size of the sample T such that it\ncontains at least one witness of an incomplete vertex in S with constant probability.\n\u221a\nLemma 11. If |S| \u2265 10\n\nw\u2208T WS(w) = 0(cid:3) \u2264 1\n\nand |T| \u2265 ln(10) \u00b7 k \u00b7 \u03c8\u03b4 \u00b7 \u221a\n\nn, then Pr(cid:2)(cid:80)\n\nn\n\n10 .\n\n\u0001\n\nProof. Since every vertex is sampled uniformly at random with replacement, the event that one vertex\n|IS|\nis a witness is a Bernoulli trial with probability Prw\u2208V [WS(w) = 1] \u2265 |IS|\nk\u00b7\u03c8\u03b4\u00b7n. Therefore\nk\u00b7\u03c8\u03b4\n\n\u00b7 1\nn =\n\nPrw\u2208V [WS(w) = 0] \u2264(cid:16)\n\n(cid:17)\n\n. We have\n\n1 \u2212 |IS|\nk\u00b7\u03c8\u03b4\u00b7n\n|T|\n\n\u21d2 |IS| \u00b7 |T|\n\u21d2\n\n(cid:18)\n1 \u2212 |IS|\n(cid:34)(cid:88)\nk \u00b7 \u03c8\u03b4 \u00b7 n\n\n\u21d4 Pr\n\nWS(u) = 0\n\nu\u2208T\n\n(cid:19)|T|\n\nn\n\n\u2265 ln(10) \u00b7 k \u00b7 \u03c8\u03b4 \u00b7 \u221a\n\u2265 ln(10) \u00b7 k \u00b7 \u03c8\u03b4 \u00b7 n\n(cid:35)\n\u2264 1\n10\n\u2264 1\n10\n\n(1)\n\n(2)\n\n(3)\n\nBy Claim 7, Eq. (1) holds for |S| as chosen in Algorithm 1. In Eq. (2) we use the fact that 1\u2212x \u2264 e\u2212x\nand in Eq. (3) we use that all events WS(u) = 0 for u \u2208 T are independent Bernoulli trials.\n\nFinally, we observe that the factor k that is introduced in Proposition 9 is tight.\nLemma 12. For every \u03b4 \u2265 3, k \u2265 2, there exists a point set P \u2282 R\u03b4 such that there is a set of k\u03c8\u03b4\npoints qi \u2208 P that have the same k-nearest neighbor.\nProof. Take a set P = \u02dcP \u222a (0, . . . , 0) of \u03b4-dimensional points, where \u02dcP consists of \u03c8\u03b4 points from\nR\u03b4 that have (0, . . . , 0) \u2208 R\u03b4 as their nearest neighbor. Create a new point set P (cid:48) from P by splitting\neach point p \u2208 P \u2229 \u02dcP into k points p1, . . . , pk. Breaking ties arbitrarily, the 1 to k \u2212 1 nearest\nneighbors of pi are \u222aj(cid:54)=i{pj} (with distance 0), but (0, . . . , 0) is the k-nearest neighbor for all pi,\ni \u2208 [k]. Thus, |P (cid:48)| + 1 = k \u00b7 | \u02dcP| + 1 = k\u03c8\u03b4 + 1 and all points in P (cid:48) except the origin have (0, . . . , 0)\nas their k-nearest neighbor.\n\n4 Lower Bound\n\nWe prove the \ufb01rst lower bound by constructing two (distributions of) graphs that are composed of\nmultiple copies of the same building block. All graphs in one distribution are k-nearest neighborhood\ngraphs, and all graphs in the other distribution are \u0001-far from the property. It suf\ufb01ces to show that\nno deterministic algorithm that makes o(\nn) queries can distinguish these two distributions with\nsuf\ufb01ciently high probability. Our building block is de\ufb01ned as follows.\n\n\u221a\n\n5\n\n\fDe\ufb01nition 13 (line gadget). Let x \u2208 R. A line gadget is a geometric, complete, directed graph\nLx = (V, E) of size k + 1. The vertices v1, . . . , vk+1 \u2208 V have coordinates x, x + 1, . . . , x + k \u2208 R.\nNote that a line gadget is a k-nearest neighborhood graph itself. In the following, let k(cid:48) := k + 1. The\ngraphs in the \ufb01rst distribution D1 are composed of n/k(cid:48) line gadgets with suf\ufb01ciently large pair-wise\ndistances that maintain the k-nearest neighborhood property. The construction of the distribution of\n\u0001-far graphs D2 is a bit more complicated. Basically, we want to move (cid:100)\u0001n/k(cid:48)(cid:101) line gadgets to the\nexact position of (cid:100)\u0001n/k(cid:48)(cid:101) other line gadgets such that in the resulting graph, (cid:100)\u0001n/k(cid:48)(cid:101) pairs of line\ngadgets share the same coordinates. However, we have to make sure that the algorithm is oblivious of\nthis relocation with suf\ufb01ciently high probability. We provide a sketch of the proof here, the whole\nproof is contained in the full version [16].\nLemma 14. Testing whether a graph is a k-nearest neighborhood graph with two-sided error requires\n\n(cid:112)n/(8\u0001k(cid:48)) queries.\n\n\u221a\nProof sketch. It is suf\ufb01cient to show that for any deterministic algorithm that makes o(\nn) queries,\nthe distributions of knowledge graphs that are obtained from distributions D1 and D2, respectively,\nhave small statistical distance.\nWithout loss of generality, one can assume that every query of the algorithm to a graph from D1\nreveals a line gadget that is not in the knowledge graph yet. Then, the probability that some line\ngadget is revealed by the i-th query is uniform over all undiscovered line gadgets. Now, consider a\ngraph from D2. Call all line gadgets that share their coordinates with another line gadget to be blue,\nand call all other line gadgets to be red. One can show that if a query does not reveal a blue line\ngadget such that a previous query revealed a blue line gadget with the same coordinates, then the\nrevealed line gadget is distributed uniformly among all other undiscovered (red or blue) line gadgets.\nThe probability that two (blue) line gadgets with the same coordinates are revealed by the \ufb01rst\nquery i hits one of the k(cid:48) vertices in one of the (cid:100)\u0001n/k(cid:48)(cid:101) moved line gadgets times the probability that\nn \u2264\nn \u2264 1/4. Therefore, the total variation distance between the knowledge graph distributions is at\n\nb =(cid:112)n/(8\u0001k(cid:48)) queries is upper bounded by the probability that for any pair of queries (i, j) \u2208 [b]2,\nquery j hits its (blue) counterpart. One can show that this probability is at most(cid:80)\n\nn \u00b7 k(cid:48)\n\nb2 \u0001k(cid:48)\nmost 1/4.\n\ni,j\u2208[b](cid:100) \u0001n\n\nk(cid:48) (cid:101)\u00b7 k(cid:48)\n\nIn property testing, it is common to \ufb01x problem speci\ufb01c parameters such as the dimension and\nanalyze the asymptotic behavior with respect to n and \u0001. However, it may be interesting that the\ncomputational complexity of a tester for k-nearest neighborhood graphs is at least linear in \u03c8\u03b4. A\nsketch of the proof is provided in the full version [16].\nLemma 15. Testing whether a graph of size n is a k-nearest neighborhood graph with two-sided\nerror requires at least k\u03c8\u03b4/6 queries.\n\n5 Experiments\n\nAs discussed above, property testing aims at distinguishing perfect objects and objects that have many\n\ufb02aws at very small cost. Given the output of an approximate nearest neighbor (ANN) algorithm, a\nnatural use case for a property tester is to decide whether the nearest neighbor index computed by the\nANN algorithm is accurate or resolves many queries incorrectly.\nAlthough Algorithm 1 already gives values for the sizes of |S| and |T|, one would probably want\nto minimize the running time of the tester beyond worst-case analysis in practice. When used as a\ntool to assess an ANN index before actually putting it to work, it is also important that the tester\nactually reduces the total computation time compared to observing poor results at the end of the data\nprocessing pipeline (e.g., bad classi\ufb01cation results) and starting over. Therefore, we seek to answer\nthe following questions:\nQ1 Parameterization. What quality of ANN indices can be tested by different choices of |S(cid:48)|,|T|?\nQ2 Performance. How does the testing time compare to the time required by the ANN algorithm?\n\n6\n\n\fSetup We implemented our property tester in C++ and integrated it into the Python framework ANN-\nBenchmarks [2, 4]. The key-idea of ANN-Benchmarks is to compare the quality of the indices built\nby ANN implementations, with respect to their running times and query-answer times. To evaluate\nour property tester, we chose three algorithms with the best performance observed in [2]: KGraph [1]\nand hnsw and SW-graph from the Non-Metric Space Library [7, 28]. All of the ANN algorithms\nare implemented in C / C++ and build upon nearest neighbor / proximity graphs. We computed the\nground truth, i.e., a k-NN graph of the input data, for the Euclidean datasets MNIST (size 60 000,\ndimension 960, [25]), Fashion-MNIST (size 60 000, dimension 960, [31]) and SIFT (size 1 000 000,\ndimension 128, [19]) to evaluate the answers of the tester.\nWe ran our benchmarks on identical machines with 60 GB of free RAM guaranteed and an Intel Xeon\nE5-2640 v4 CPU running at 2.40 GHz (capable of running 20 concurrent threads) and measured CPU\ntime. To minimize interference between different processes, a single instance of an ANN algorithm\nwas run exclusively on one machine at a time.\nThe C++ source code of the property tester that was used for the experiments is available here [17].\nThe modi\ufb01ed version of ANN-Benchmarks is available here [6].\n\n\u221a\n\n\u221a\n\nn and |T| = c2 \u00b7 k\n\nQ1: Parameterization of the Property Tester We analyze how different choices for |S(cid:48)| and\n|T| in Algorithm 1 affect which quality of ANN indices the tester is likely to reject. All ANN\nalgorithms were run ten times for each choice of parameters built into ann-benchmarks (as listed\nin [5]) and every dataset. Then the tester was run once for each output and for every choice from\n{0.001, 0.01, 0.1}\u00d7{0.05, 0.5, 5} for (c1, c2) in |S(cid:48)| = c1 \u00b7 8k\nn log(10), with\noracle access to the resulting ANN index. We chose to evaluate the tester for k = 10 because indices\nthat are very close to 10-NN graphs \u2013 which is the hard case for the property tester to detect \u2013 can be\ncomputed by the ANN algorithms in reasonable time, and we support this decision by an additional\nexperiment for k = 50. The ground truth, i.e., a k-NN graph of each dataset, and the \u0001-distance (see\nSection 2) of each ANN index to a k-NN graph was computed of\ufb02ine.\nWe evaluate the recall of the property tester by distance of a tested ANN index to ground truth, where\ngraphs that are no k-NN graphs are relevant (note that the tester always provides a witness when\nit rejects, so its precision is 1). Since the quality of an ANN index varies depending on the ANN\nalgorithm\u2019s parameters and internal randomness, we group the computed ANN indices into buckets\naccording to their distance to ground truth and depict the resulting recall on these classes in Fig. 1 for\nall datasets combined and for each dataset individually. As the oracle access that is provided to the\nproperty tester is oblivious of the underlying ANN algorithm, the \ufb01gures show the combined results\nfor all algorithms.\nWe observe that for distances and parameters that result in a reasonable overall recall, say, at least\ngreater than 0.75, the property tester behaves comparable on all datasets. Since the property tester is\nguaranteed to have precision 1, even parameterizations with low recall on a small distance can be\nampli\ufb01ed by running the tester multiple times, possibly for different values of c1, c2. In summary,\nafter choosing a target distance that the property tester should detect, the tested parameters seem\nsuitable for data with dimensions up to roughly 800. For higher dimensions, it is likely advisable\nto apply dimensionality reduction techniques \ufb01rst before computing and using nearest neighbors in\nEuclidean space.\nTo get an indication of how the tester behaves for larger k, we conducted an additional experiment\nwhere we ran the tester on KGraph indices with k = 50. As one might expect, less indices are close\nto being a 50-NN graph than a 10-NN graph for the same sets of KGraph parameters (although the\ndistance is normalized by k and therefore it allows more errors), but the results indicate that it is also\neasier for the property tester to spot errors. This suggests that, at least for KGraph, errors are spread\nquite uniformly in the index rather than they are concentrated on some vertices.\n\nQ2: Performance Consider the following scenario: an algorithm that processes data employs\nan ANN index. The quality of the algorithm\u2019s result (e.g., the classi\ufb01cation rate) depends on the\nquality of the ANN index. However, the best parameters for the ANN algorithm are not known, and\nconclusions about the quality of the ANN index can only be drawn by looking at the algorithm\u2019s\n\ufb01nal result, which may be a long costly way to go. Does it pay out to run the property tester on the\nANN index and recompute the index using different parameters if the tester rejects? We address\nthis question by measuring the tester\u2019s performance. However, whether to use the tester or not\n\n7\n\n\fFigure 1: Recall of the property tester by \u0001-distance of the ANN index to a 10-NN graph\nfor different choices of the testers\u2019 parameters c1, c2. Distances are grouped into classes\n(0, 10\u22124], (10\u22124, 10\u22123], . . ., all of size at least 150 (lateral axis shows upper bound of the respective\nbucket). For example, the property tester rejected more than 95% of the ANN indices that are between\n0.005-far and 0.01-far from being a 10-NN graph for c1 = 0.01, c2 = 5.\n\ndepends heavily on the cost incurred otherwise. Therefore, we compare the property tester against\nthe minimum cost that every algorithm that uses an ANN index must invest before it can employ it or\neven just draw conclusions about its quality: the build time of the index. Figure 3 shows the time\nrequired by the property tester normalized (divided) by the time required to build the ANN index for\neach ANN algorithm and each dataset. There are two plots: one for all graphs that are between 0.005\nand 0.01-close to a 10-NN graph, and one for all graphs that are between 0.01 and 0.02-close to a\n10-NN graph.\nIn general, the running time of the property tester is always smaller than the build time for hnsw and\nSW-graph and at most \ufb01ve times the build time for KGraph. Mostly, it is even smaller than 1/10 of the\nbuild time, and therefore running the property tester comes at almost no additional cost. For the runs\nof the tester on KGraph indices with k = 50, the testing time is also upper bounded by \ufb01ve times\nthe build time and the tester time vs. build time ratio is 0.1 for k = 10 (restricted to MNIST and\nFashion-MNIST) and k = 50.\n\n6 Conclusion\n\nWe have studied the task of ef\ufb01ciently identifying NN models with low accuracy by exploring\npossibilities within the theoretical framework of sublinear algorithms and evaluated our approach\n\u221a\nby moving to experiments. In particular, we have proved that there is a one-sided error property\ntester with complexity O(\nnk2/\u00012), i.e., a sublinear (randomized) algorithm that decides whether\nan input graph G is a k-NN graph or requires many edge modi\ufb01cations to become a k-NN graph (i.e.,\n\n8\n\n\fFigure 2: Recall of the property tester by \u0001-distance of the ANN index to a k-NN graph for different\nchoices of the tester\u2019s parameters and k = {10, 50}. As in Fig. 1, distances are grouped into classes.\n\nFigure 3: Performance of the property tester in terms of property tester CPU time over ANN index\nbuilding CPU time for all computed ANN indices that are (0.005, 0.01]-far and (0.01, 0.02]-far from\nbeing a 10-NN graph, respectively. SW-graph computed only one graph that is 0.01-close on SIFT.\n\nerror property tester requires complexity \u2126((cid:112)n/(\u0001k)). Our experiments of the property tester on\n\nprecision 1 and recall 2/3 when taking \u0001-far graphs as relevant). We also proved that even a two-sided\n\nANN indices computed by various algorithms indicate that testing comes at almost no additional cost,\ni.e., the testing time is signi\ufb01cantly smaller than the building time of the ANN index that is tested.\nFrom the perspective of applications, it would be desirable to analyze the tester for a more context\nsensitive notion of edit distance. For example, an edge to the (k + 1)-nearest neighbor of a point\ninstead of an edge to its k-nearest neighbor might be a defect that is much less severe than an edge\nto the k2-nearest neighbor. It would be interesting to investigate what results can be obtained under\nestablished oracle access models, which are oblivious of the graph\u2019s structure, and whether other\nuseful models can be devised.\n\n9\n\n\fAcknowledgments\n\nThe research leading to these results has received funding from the European Research Council under\nthe European Union\u2019s Seventh Framework Programme (FP7/2007-2013) / ERC grant agreement\nn\u25e6 307696. We thank the anonymous reviewers for their comments and questions, which we addressed\nby adding Lemma 12 and Lemma 15 most notably.\n\nReferences\n[1] Ann Arbor Algorithms. KGraph: A Library for k-Nearest Neighbor Search, 2018. URL\n\nhttps://github.com/aaalgo/kgraph.\n\n[2] Martin Aum\u00fcller, Erik Bernhardsson, and Alexander Faithfull. ANN-Benchmarks: A\nIn Similarity Search\nBenchmarking Tool for Approximate Nearest Neighbor Algorithms.\nand Applications, Lecture Notes in Computer Science, pages 34\u201349. Springer, 2017. doi:\n10.1007/978-3-319-68474-1_3.\n\n[3] Oren Ben-Zwi, Oded Lachish, and Ilan Newman. Lower Bounds for Testing Euclidean Minimum\n\nSpanning Trees. Information Processing Letters, 102(6):219\u2013225, 2007.\n\n[4] Erik Bernhardsson. ANN-Benchmarks: Benchmarks of Approximate Nearest Neighbor Li-\n\nbraries in Python, 2018. URL https://github.com/erikbern/ann-benchmarks.\n\nBernhardsson.\n\n[5] Erik\n2018.\n4805b13cc0d03eb25ef57a90ac64ef907cdd5817/algos.yaml.\n\n4805b1,\nhttps://github.com/erikbern/ann-benchmarks/blob/\n\nURL\n\nANN-Benchmarks,\n\nalgos.yaml,\n\ncommit\n\n[6] Erik Bernhardsson, Martin Aum\u00fcller, Alexander Faithfull, Leonid Boytsov, Ilya Razenshteyn,\nAnn Arbor Algorithms, www, Leland McInnes, Yury Malkov, Asier Erramuzpe, Ole, Ben\nFrederickson, Hendrik Fichtenberger, and Dennis Rohde. h\ufb01chtenberger/ann-benchmarks:\nNIPS paper version, October 2018. URL https://doi.org/10.5281/zenodo.1463824.\n\n[7] Leonid Boytsov and Bilegsaikhan Naidan. Engineering Ef\ufb01cient and Effective Non-metric\nSpace Library. In Similarity Search and Applications, pages 280\u2013293, 2013. doi: 10.1007/\n978-3-642-41062-8_28.\n\n[8] Paul B. Callahan and Sambasiva R. Kosaraju. A Decomposition of Multidimensional Point Sets\nwith Applications to K-Nearest-Neighbors and N-Body Potential Fields. Journal of the ACM,\n42(1):67\u201390, 1995. doi: 10.1145/200836.200853.\n\n[9] Jie Chen, Haw-ren Fang, and Yousef Saad. Fast Approximate kNN Graph Construction for High\nDimensional Data via Recursive Lanczos Bisection. Journal of Machine Learning Research, 10:\n1989\u20132012, 2009. ISSN 1533-7928. URL http://www.jmlr.org/papers/v10/chen09b.\nhtml.\n\n[10] Michael Connor and Piyush Kumar. Fast Construction of K-Nearest Neighbor Graphs for Point\nClouds. IEEE Transactions on Visualization and Computer Graphics, 16(4):599\u2013608, 2010.\ndoi: 10.1109/TVCG.2010.9.\n\n[11] Artur Czumaj and Christian Sohler. Testing Euclidean Minimum Spanning Trees in the Plane.\n\nACM Transactions on Algorithms (TALG), 4(3):31, 2008.\n\n[12] Artur Czumaj and Christian Sohler. Estimating the Weight of Metric Minimum Spanning\nTrees in Sublinear Time. SIAM Journal on Computing, 39(3):904\u2013922, 2009. doi: 10.1137/\n060672121.\n\n[13] Artur Czumaj, Christian Sohler, and Martin Ziegler. Property Testing in Computational Geome-\ntry, pages 155\u2013166. Springer, 2000. ISBN 978-3-540-45253-9. doi: 10.1007/3-540-45253-2_15.\nURL http://dx.doi.org/10.1007/3-540-45253-2_15.\n\n[14] Artur Czumaj, Funda Erg\u00fcn, Lance Fortnow, Avner Magen, Ilan Newman, Ronitt Rubin-\nfeld, and Christian Sohler. Approximating the Weight of the Euclidean Minimum Span-\nning Tree in Sublinear Time. SIAM Journal on Computing, 35(1):91\u2013109, 2005. doi:\n10.1137/S0097539703435297.\n\n10\n\n\f[15] Belur V. Dasarathy. Nearest Neighbor Norms: NN Pattern Classi\ufb01cation Techniques. IEEE\n\nComputer Society Press, 1991.\n\n[16] Hendrik Fichtenberger and Dennis Rohde. A Theory-Based Evaluation of Nearest Neighbor\n\nModels Put Into Practice. URL http://arxiv.org/abs/1810.05064.\n\n[17] Hendrik Fichtenberger and Dennis Rohde. h\ufb01chtenberger/knn_tester: NIPS paper version,\n\nOctober 2018. URL https://doi.org/10.5281/zenodo.1463804.\n\n[18] Jerome H. Friedman, Forest Baskett, and Leonard J. Shustek. An Algorithm for Finding\nIEEE Transactions on Computers, C-24(10):1000\u20131006, 1975. doi:\n\nNearest Neighbors.\n10.1109/T-C.1975.224110.\n\n[19] Xiping Fu, Brendan McCane, Steven Mills, Michael Albert, and Lech Szymanski. UCI Machine\nLearning Repository: SIFT10M Data Set, 2018. URL https://archive.ics.uci.edu/ml/\ndatasets/SIFT10M.\n\n[20] Keinosuke Fukunaga and Patrenahalli M. Narendra. A Branch and Bound Algorithm for\nComputing k-Nearest Neighbors. IEEE Transactions on Computers, C-24(7):750\u2013753, 1975.\ndoi: 10.1109/T-C.1975.224297.\n\n[21] Oded Goldreich, Shari Goldwasser, and Dana Ron. Property Testing and Its Connection to\nLearning and Approximation. Journal of the ACM, 45(4):653\u2013750, 1998. doi: 10.1145/285055.\n285060.\n\n[22] Frank Hellweg, Melanie Schmidt, and Christian Sohler. Testing Euclidean Spanners. Property\n\nTesting, 6390:306\u2013311, 2010.\n\n[23] Piotr Indyk and Rajeev Motwani. Approximate Nearest Neighbors: Towards Removing the\nCurse of Dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of\nComputing, STOC, pages 604\u2013613. ACM, 1998. doi: 10.1145/276698.276876.\n\n[24] Grigorii Anatol\u2019evich Kabatiansky and Vladimir Iosifovich Levenshtein. On Bounds for\n\nPackings on a Sphere and in Space. Problemy Peredachi Informatsii, 14(1):3\u201325, 1978.\n\n[25] Yann LeCun, Corinna Cortes, and Chris Burges. MNIST Handwritten Digit Database, 2018.\n\nURL http://yann.lecun.com/exdb/mnist/.\n\n[26] Jesus Maillo, Sergio Ram\u00edrez, Isaac Triguero, and Francisco Herrera. kNN-IS: An Iterative\nSpark-Based Design of the k-Nearest Neighbors Classi\ufb01er for Big Data. Knowledge-Based\nSystems, 117:3\u201315, 2017. doi: 10.1016/j.knosys.2016.06.012.\n\n[27] Marius Muja and David G. Lowe. Fast Approximate Nearest Neighbors with Automatic\nAlgorithm Con\ufb01guration. In In VISAPP International Conference on Computer Vision Theory\nand Applications, pages 331\u2013340, 2009.\n\n[28] Bilegsaikhan Naidan, Leonid Boytsov, Yury Malkov, David Novak, and Ben Frederickson.\nNon-Metric Space Library (NMSLIB): An Ef\ufb01cient Similarity Search Library and a Toolkit for\nEvaluation of k-NN Methods for Generic Non-Metric Spaces, 2018. URL https://github.\ncom/nmslib/nmslib.\n\n[29] Michal Parnas and Dana Ron. Testing Metric Properties. In 33rd Annual ACM Symposium on\n\nTheory of Computing, STOC, pages 276\u2013285. ACM, 2001. doi: 10.1145/380752.380811.\n\n[30] Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion,\nOlivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vander-\nplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Edouard\nDuchesnay. Scikit-Learn: Machine Learning in Python. Journal of Machine Learning Research,\n12:2825\u20132830, 2011. ISSN 1533-7928. URL http://jmlr.csail.mit.edu/papers/v12/\npedregosa11a.html.\n\n[31] Zalando Research. Fashion-MNIST: A MNIST-like Fashion Product Database, 2018. URL\n\nhttps://github.com/zalandoresearch/fashion-mnist.\n\n11\n\n\f[32] Ronitt Rubinfeld and Madhu Sudan. Robust Characterizations of Polynomials with Applications\nto Program Testing. SIAM Journal on Computing, 25(2):252\u2013271, 1996. doi: 10.1137/\nS0097539793255151.\n\n[33] Gregory Shakhnarovich, Trevor Darrell, and Piotr Indyk, editors. Nearest-Neighbor Methods in\nLearning and Vision: Theory and Practice. Neural Information Processing Series. MIT Press,\n2006. ISBN 0-262-19547-X.\n\n[34] Aaron D. Wyner. Capabilities of Bounded Discrepancy Decoding. Bell System Technical\n\nJournal, 44(6):1061\u20131122, 1965.\n\n[35] Kenneth Zeger and Allen Gersho. Number of Nearest Neighbors in a Euclidean Code. IEEE\n\nTransactions on Information Theory, 40(5):1647\u20131649, 1994-09. doi: 10.1109/18.333884.\n\n[36] Shichao Zhang, Xuelong Li, Ming Zong, Xiaofeng Zhu, and Ruili Wang. Ef\ufb01cient kNN\nClassi\ufb01cation with Different Numbers of Nearest Neighbors. IEEE Transactions on Neural\nNetworks and Learning Systems, 29(5):1774\u20131785, 2018. doi: 10.1109/TNNLS.2017.2673241.\n\n12\n\n\f", "award": [], "sourceid": 3392, "authors": [{"given_name": "Hendrik", "family_name": "Fichtenberger", "institution": "TU Dortmund"}, {"given_name": "Dennis", "family_name": "Rohde", "institution": "TU Dortmund"}]}