{"title": "Near-optimal sample compression for nearest neighbors", "book": "Advances in Neural Information Processing Systems", "page_first": 370, "page_last": 378, "abstract": "We present the first sample compression algorithm for nearest neighbors with non-trivial performance guarantees. We complement these guarantees by demonstrating almost matching hardness lower bounds, which show that our bound is nearly optimal. Our result yields new insight into margin-based nearest neighbor classification in metric spaces and allows us to significantly sharpen and simplify existing bounds. Some encouraging empirical results are also presented.", "full_text": "Near-optimal sample compression\n\nfor nearest neighbors\n\nDepartment of Computer Science and Mathematics, Ariel University\n\nAriel, Israel. leead@ariel.ac.il\n\nLee-Ad Gottlieb\n\nAryeh Kontorovich\n\nComputer Science Department, Ben Gurion University\n\nBeer Sheva, Israel. karyeh@cs.bgu.ac.il\n\nPinhas Nisnevitch\n\nDepartment of Computer Science and Mathematics, Ariel University\n\nAriel, Israel. pinhasn@gmail.com\n\nAbstract\n\nWe present the \ufb01rst sample compression algorithm for nearest neighbors with non-\ntrivial performance guarantees. We complement these guarantees by demonstrat-\ning almost matching hardness lower bounds, which show that our bound is nearly\noptimal. Our result yields new insight into margin-based nearest neighbor classi\ufb01-\ncation in metric spaces and allows us to signi\ufb01cantly sharpen and simplify existing\nbounds. Some encouraging empirical results are also presented.\n\n1 Introduction\n\nThe nearest neighbor classi\ufb01er for non-parametric classi\ufb01cation is perhaps the most intuitive learn-\ning algorithm. It is apparently the earliest, having been introduced by Fix and Hodges in 1951\n(technical report reprinted in [1]). In this model, the learner observes a sample S of labeled points\n(X, Y ) = (Xi, Yi)i\u2208[n], where Xi is a point in some metric space X and Yi \u2208 {1, \u22121} is its label.\nBeing a metric space, X is equipped with a distance function d : X \u00d7 X \u2192 R. Given a new unla-\nbeled point x \u2208 X to be classi\ufb01ed, x is assigned the same label as its nearest neighbor in S, which is\nargminYi\u2208Y d(x, Xi). Under mild regularity assumptions, the nearest neighbor classi\ufb01er\u2019s expected\nerror is asymptotically bounded by twice the Bayesian error, when the sample size tends to in\ufb01nity\n[2].1 These results have inspired a vast body of research on proximity-based classi\ufb01cation (see [4, 5]\nfor extensive background and [6] for a recent re\ufb01nement of classic results). More recently, strong\nmargin-dependent generalization bounds were obtained in [7], where the margin is the minimum\ndistance between opposite labeled points in S.\nIn addition to provable generalization bounds, nearest neighbor (NN) classi\ufb01cation enjoys several\nother advantages. These include simple evaluation on new data, immediate extension to multiclass\nlabels, and minimal structural assumptions \u2014 it does not assume a Hilbertian or even a Banach\nspace. However, the naive NN approach also has disadvantages. In particular, it requires storing the\nentire sample, which may be memory-intensive. Further, information-theoretic considerations show\nthat exact NN evaluation requires \u0398(|S|) time in high-dimensional metric spaces [8] (and possibly\nEuclidean space as well [9]) \u2014 a phenomenon known as the algorithmic curse of dimensionality.\nLastly, the NN classi\ufb01er has in\ufb01nite VC-dimension [5], implying that it tends to over\ufb01t the data.\n\n1A Bayes-consistent modi\ufb01cation of the 1-NN classi\ufb01er was recently proposed in [3].\n\n1\n\n\fThis last problem can be mitigated by taking the majority vote among k > 1 nearest neighbors\n[10, 11, 5], or by deleting some sample points so as to attain a larger margin [12].\n\nShortcomings in the NN classi\ufb01er led Hart [13] to pose the problem of sample compression. In-\ndeed, signi\ufb01cant compression of the sample has the potential to simultaneously address the issues\nof memory usage, NN search time, and over\ufb01tting. Hart considered the minimum Consistent Subset\nproblem \u2014 elsewhere called the Nearest Neighbor Condensing problem \u2014 which seeks to identify\na minimal subset S\u2217 \u2282 S that is consistent with S, in the sense that the nearest neighbor in S\u2217 of\nevery x \u2208 S possesses the same label as x. This problem is known to be NP-hard [14, 15], and Hart\nprovided a heuristic with runtime O(n3). The runtime was recently improved by [16] to O(n2), but\nneither paper gave performance guarantees.\n\nThe Nearest Neighbor Condensing problem has been the subject of extensive research since its in-\ntroduction [17, 18, 19]. Yet surprisingly, there are no known approximation algorithms for it \u2014\nall previous results on this problem are heuristics that lack any non-trivial approximation guaran-\ntees. Conversely, no strong hardness-of-approximation results for this problem are known, which\nindicates a gap in the current state of knowledge.\nMain results. Our contribution aims at closing the existing gap in solutions to the Nearest Neighbor\nCondensing problem. We present a simple near-optimal approximation algorithm for this problem,\nwhere our only structural assumption is that the points lie in some metric space. De\ufb01ne the scaled\nmargin \u03b3 < 1 of a sample S as the ratio of the minimum distance between opposite labeled points\nin S to the diameter of S. Our algorithm produces a consistent set S0 \u2282 S of size d1/\u03b3eddim(S)+1\n(Theorem 1), where ddim(S) is the doubling dimension of the space S. This result can signi\ufb01cantly\nspeed up evaluation on test points, and also yields sharper and simpler generalization bounds than\nwere previously known (Theorem 3).\n\nTo establish optimality, we complement the approximation result with an almost matching\nhardness-of-approximation lower-bound. Using a reduction from the Label Cover problem, we\nshow that the Nearest Neighbor Condensing problem is NP-hard to approximate within factor\n2(ddim(S) log(1/\u03b3))1\u2212o(1) (Theorem 2). Note that the above upper-bound is an absolute size guar-\nantee, and stronger than an approximation guarantee.\n\nAdditionally, we present a simple heuristic to be applied in conjunction with the algorithm of Theo-\nrem 1, that achieves further sample compression. The empirical performances of both our algorithm\nand heuristic seem encouraging (see Section 4).\nRelated work. A well-studied problem related to the Nearest Neighbor Condensing problem is that\nof extracting a small set of simple conjunctions consistent with much of the sample, introduced by\n[20] and shown by [21] to be equivalent to minimum Set Cover (see [22, 23] for further extensions).\nThis problem is monotone in the sense that adding a conjunction to the solution set can only increase\nthe sample accuracy of the solution. In contrast, in our problem the addition of a point of S to S\u2217\ncan cause S\u2217 to be inconsistent \u2014 and this distinction is critical to the hardness of our problem.\nRemoval of points from the sample can also yield lower dimensionality, which itself implies faster\nnearest neighbor evaluation and better generalization bounds. For metric spaces, [24] and [25] gave\nalgorithms for dimensionality reduction via point removal (irrespective of margin size).\n\nThe use of doubling dimension as a tool to characterize metric learning has appeared several times\nin the literature, initially by [26] in the context of nearest neighbor classi\ufb01cation, and then in [27]\nand [28]. A series of papers by Gottlieb, Kontorovich and Krauthgamer investigate doubling spaces\nfor classi\ufb01cation [12], regression [29], and dimension reduction [25].\nk-nearest neighbor. A natural question is whether the Nearest Neighbor Condensing problem of\n[13] has a direct analogue when the 1-nearest neighbor rule is replaced by a (k > 1)-nearest neighbor\n\u2013 that is, when the label of a point is determined by the majority vote among its k nearest neighbors.\nA simple argument shows that the analogy breaks down. Indeed, a minimal requirement for the\ncondensing problem to be meaningful is that the full (uncondensed) set S is feasible, i.e. consistent\nwith itself. Yet even for k = 3 there exist self-inconsistent sets. Take for example the set S consisting\nof two positive points at (0, 1) and (0, \u22121) and two negative points at (1, 0) and (\u22121, 0). Then the\n3-nearest neighbor rule misclassi\ufb01es every point in S, hence S itself is inconsistent.\n\n2\n\n\fPaper outline. This paper is organized as follows. In Section 2, we present our algorithm and prove\nits performance bound, as well as the reduction implying its near optimality (Theorem 2). We then\nhighlight the implications of this algorithm for learning in Section 3. In Section 4 we describe a\nheuristic which re\ufb01nes our algorithm, and present empirical results.\n\n1.1 Preliminaries\n\nMetric spaces. A metric d on a set X is a positive symmetric function satisfying the triangle\ninequality d(x, y) \u2264 d(x, z) + d(z, y); together the two comprise the metric space (X , d). The\ndiameter of a set A \u2286 X , is de\ufb01ned by diam(A) = supx,y\u2208A d(x, y). Throughout this paper we\nwill assume that diam(S) = 1; this can always be achieved by scaling.\nDoubling dimension. For a metric (X , d), let \u03bb be the smallest value such that every ball in X\nof radius r (for any r) can be covered by \u03bb balls of radius r\n2 . The doubling dimension of X is\nddim(X ) = log2 \u03bb. A metric is doubling when its doubling dimension is bounded. Note that while a\nlow Euclidean dimension implies a low doubling dimension (Euclidean metrics of dimension d have\ndoubling dimension O(d) [30]), low doubling dimension is strictly more general than low Euclidean\ndimension. The following packing property can be demonstrated via a repetitive application of the\ndoubling property: For set S with doubling dimension ddim(X ) and diam(S) \u2264 \u03b2, if the minimum\ninterpoint distance in S is at least \u03b1 < \u03b2 then\n\n|S| \u2264 d\u03b2/\u03b1eddim(X )+1\n\n(1)\n(see, for example [8]). The above bound is tight up to constant factors, meaning there exist sets of\nsize (\u03b2/\u03b1)\u2126(ddim(X )).\nNearest Neighbor Condensing. Formally, we de\ufb01ne the Nearest Neighbor Condensing (NNC)\nproblem as follows: We are given a set S = S\u2212 \u222a S+ of points, and distance metric d : S \u00d7 S \u2192 R.\nWe must compute a minimal cardinality subset S0 \u2282 S with the property that for any p \u2208 S, the\nnearest neighbor of p in S0 comes from the same subset {S+, S\u2212} as does p. If p has multiple exact\nnearest neighbors in S0, then they must all be of the same subset.\nLabel Cover. The Label Cover problem was \ufb01rst introduced by [31] in a seminal paper on the\nhardness of computation. Several formulations of this problem have appeared the literature, and we\ngive the description forwarded by [32]: The input is a bipartite graph G = (U, V, E), with two sets\nof labels: A for U and B for V . For each edge (u, v) \u2208 E (where u \u2208 U , v \u2208 V ), we are given\na relation \u03a0u,v \u2282 A \u00d7 B consisting of admissible label pairs for that edge. A labeling (f, g) is a\npair of functions f : U \u2192 2A and g : V \u2192 2B\\{\u2205} assigning a set of labels to each vertex. A\nlabeling covers an edge (u, v) if for every label b \u2208 g(v) there is some label a \u2208 f (u) such that\n(a, b) \u2208 \u03a0u,v. The goal is to \ufb01nd a labeling that covers all edges, and which minimizes the sum of\n\nthe number of labels assigned to each u \u2208 U , that isPu\u2208U |f (u)|. It was shown in [32] that it is\n\nNP-hard to approximate Label Cover to within a factor 2(log n)1\u2212o(1) , where n is the total size of the\ninput.\nLearning. We work in the agnostic learning model [33, 5]. The learner receives n labeled examples\n(Xi, Yi) \u2208 X \u00d7{\u22121, 1} drawn iid according to some unknown probability distribution P. Associated\n1{h(Xi)6=Yi} and\n\nto any hypothesis h : X \u2192 {\u22121, 1} is its empirical error cerr(h) = n\u22121Pi\u2208[n]\n\ngeneralization error err(h) = P(h(X) 6= Y ).\n\n2 Near-optimal approximation algorithm\n\nIn this section, we describe a simple approximation algorithm for the Nearest Neighbor Condensing\nproblem. In Section 2.1 we provide almost tight hardness-of-approximation bounds. We have the\nfollowing theorem:\nTheorem 1. Given a point set S and its scaled margin \u03b3 < 1, there exists an algorithm that in time\n\nmin{n2, 2O(ddim(S))n log(1/\u03b3)}\ncomputes a consistent set S0 \u2282 S of size at most d1/\u03b3eddim(S)+1.\n\nRecall that an \u03b5-net of point set S is a subset S\u03b5 \u2282 S with two properties:\n\n3\n\n\f(i) Packing. The minimum interpoint distance in S\u03b5 is at least \u03b5.\n(ii) Covering. Every point p \u2208 S has a nearest neighbor in S\u03b5 strictly within distance \u03b5.\n\nWe make the following observation: Since the margin of the point set is \u03b3, a \u03b3-net of S is consistent\nwith S. That is, every point p \u2208 S has a neighbor in S\u03b3 strictly within distance \u03b3, and since the\nmargin of S is \u03b3, this neighbor must be of the same label set as p. By the packing property of\ndoubling spaces (Equation 1), the size of S\u03b3 is at most d1/\u03b3eddim(S)+1. The solution returned by\nour algorithm is S\u03b3, and satis\ufb01es the guarantees claimed in Theorem 1.\nIt remains only to compute the net S\u03b3. A brute-force greedy algorithm can accomplish this in time\nO(n2): For every point p \u2208 S, we add p to S\u03b3 if the distance from p to all points currently in S\u03b3 is\n\u03b3 or greater, d(p, S\u03b3) \u2265 \u03b3. See Algorithm 1.\nAlgorithm 1 Brute-force net construction\nRequire: S\n1: S\u03b3 \u2190 arbitrary point of S\n2: for all p \u2208 S do\n3:\n4:\n5:\n6: end for\n\nif d(p, S\u03b3) \u2265 \u03b3 then\nS\u03b3 = S\u03b3 \u222a {p}\n\nend if\n\nThe construction time can be improved by building a net hierarchy, similar to the one employed by\n[8], in total time 2O(ddim(S))n log(1/\u03b3). (See also [34, 35, 36].) A hierarchy consists of all nets\nS2i for i = 0, \u22121, . . . , blog \u03b3c, where S2i \u2282 S2i\u22121 for all i > blog \u03b3c. Two points p, q \u2208 S2i are\nneighbors if d(p, q) < 4 \u00b7 2i. Further, each point q \u2208 S is a child of a single nearby parent point\np \u2208 S2i satisfying d(p, q) < 2i. By the de\ufb01nition of a net, a parent point must exist. If two points\np, q \u2208 S2i are neighbors (d(p, q) < 4 \u00b7 2i) then their respective parents p0, q0 \u2208 S2i+1 are necessarily\nneighbors as well: d(p0, q0) \u2264 d(p0, p) + d(p, q) + d(q, q0) < 2i+1 + 4 \u00b7 2i + 2i+1 = 4 \u00b7 2i+1.\nThe net S20 = S1 consists of a single arbitrary point. Having constructed S2i, it is an easy matter\nto construct S2i\u22121: Since we require S2i\u22121 \u2283 S2i, we will initialize S2i\u22121 = S2i. For each q \u2208 S,\nwe need only to determine whether d(q, S2i\u22121 ) \u2265 2i\u22121, and if so add q to S2i\u22121. Crucially, we need\nnot compare q to all points of S2i\u22121: If there exists a point p \u2208 S2i with d(q, p) < 2i, then the\nrespective parents p0, q0 \u2208 S2i of p, q must be neighbors. Let set T include only the children of q0\nand of q0\u2019s neighbors. To determine the inclusion of every q \u2208 S in S2i\u22121, it suf\ufb01ces to compute\nwhether d(q, T ) \u2265 2i\u22121, and so n such queries are suf\ufb01cient to construct S2i\u22121. The points of T\nhave minimum distance 2i\u22121 and are all contained in a ball of radius 4 \u00b7 2i + 2i\u22121 centered at T , so\nby the packing property (Equation 1) |T | = 2O(ddim(S)). It follows that the above query d(q, T ) can\nbe answered in time 2O(ddim(S)). For each point in S we execute O(log(1/\u03b3)) queries, for a total\nruntime of 2O(ddim(S))n log(1/\u03b3). The above procedure is illustrated in the Appendix.\n\n2.1 Hardness of approximation of NNC\n\n.\n\nIn this section, we prove almost matching hardness results for the NNC problem.\nTheorem 2. Given a set S of labeled points with scaled margin \u03b3,\nit is NP-hard to ap-\nproximate the solution to the Nearest Neighbor Condensing problem on S to within a factor\n2(ddim(S) log(1/\u03b3))1\u2212o(1)\nTo simplify the proof, we introduce an easier version of NNC called Weighted Nearest Neighbor\nCondensing (WNNC). In this problem, the input is augmented with a function assigning weight\nto each point of S, and the goal is to \ufb01nd a subset S0 \u2282 S of minimum total weight. We will\nreduce Label Cover to WNNC and then reduce WNNC to NNC (with some mild assumptions on\nthe admissible range of weights), all while preserving hardness of approximation. The theorem will\nfollow from the hardness of Label Cover [32].\n\nFirst reduction. Given a Label Cover instance of size m = |U |+|V |+|A|+|B|+|E|+Pe\u2208E |\u03a0E|,\n\n\ufb01x large value c to be speci\ufb01ed later, and an in\ufb01nitesimally small constant \u03b7. We create an instance\nof WNNC as follows (see Figure 1).\n\n1. We \ufb01rst create a point p+ \u2208 S+ of weight 1.\n\n4\n\n\fLabel Cover\n\nNearest Neighbor Condensing\n\nU\n\nu1\n\nu2\n\ne1\n\ne2\n\ne3\n\nV\n\nv1\n\nv2\n\nl1: (a1,b1) \b `e1\nl2: (a2,b2) \b `e1\n\nl3: (a1,b1) \b `e2\nl4: (a2,b1) \b `e2\n\nl5: (a1,b2) \b `e3\n\nSU,A \n\n\b S+\n\nSL \n\n\b S+\n\nSV,B \n\n\b S-\n\nSE \n\n\b S-\n\nu1a1\n\nu1a2\n\nu2a1\n\nu2a2\n\nl1\n\nl2\n\nl3,,l4\n\nl5\n\n2\n\n2+2\u0007\n\nY\u0005\n\n+\n\nv1b1\n\nv1b2\n\nv2b1\n\nv2b2\n\n3\n\n2+\u0007\n\ne1\n\ne2\n\ne3\n\n2\n\np-\n\n3+\u0007\n\np+\n\nFigure 1: Reduction from Label Cover to Nearest Neighbor Condensing.\n\nWe introduce set SE \u2282 S\u2212 representing edges in E: For each edge e \u2208 E, create point pe of\nweight \u221e. The distance from pe to p+ is 3 + \u03b7.\n\n2. We introduce set SV,B \u2282 S\u2212 representing pairs in V \u00d7 B: For each vertex v \u2208 V and label\nb \u2208 B, create point pv,b of weight 1. If edge e is incident to v and there exists a label (a, b) \u2208 \u03a0e\nfor any a \u2208 A, then the distance from pv,b to pe is 3.\nFurther add a point p\u2212 \u2208 S\u2212 of weight 1, at distance 2 from all points in SV,B.\n\n3. We introduce set SL \u2282 S+ representing labels in \u03a0e. For each edge e = (u, v) and label b \u2208 B\nfor which (a, b) \u2208 \u03a0e (for any a \u2208 A), we create point pe,b \u2282 SL of weight \u221e. pe,b represents\nthe set of labels (a, b) \u2208 \u03a0e over all a \u2208 A. pe,b is at distance 2 + \u03b7 from pv,b.\nFurther add a point p0\n\n+ \u2208 S+ of weight 1, at distance 2 + 2\u03b7 from all points in SL.\n\n4. We introduce set SU,A \u2282 S+ representing pairs in U \u00d7 A: For each vertex u \u2208 U and label\na \u2208 A, create point pu,a of weight c. For any edge e = (u, v) and label b \u2208 B, if (a, b) \u2208 \u03a0e\nthen the distance from pe,b \u2208 SL to pu,a is 2.\n\nThe points of each set SE, SV,B, SL and SU,A are packed into respective balls of diameter 1. Fixing\nany target doubling dimension D = \u2126(1) and recalling that the cardinality of each of these sets\nis less than m2, we conclude that the minimum interpoint distance in each ball is m\u2212O(1/D). All\ninterpoint distances not yet speci\ufb01ed are set to their maximum possible value. The diameter of the\nresulting set is constant, so its scaled margin is \u03b3 = m\u2212O(1/D). We claim that a solution of WNNC\non the constructed instance implies some solution of the Label Cover Instance:\n\n1. p+ must appear in any solution: The nearest neighbors of p+ are the negative points of SE, so\nif p+ is not included the nearest neighbor of set SE is necessarily the nearest neighbor of p+,\nwhich is not consistent.\n\n2. Points in SE have in\ufb01nite weight, so no points of SE appear in the solution. All points of SE\nare at distance exactly 3 + \u03b7 from p+, hence each point of SE must be covered by some point\nof SV,B to which it is connected \u2013 other points in SV,B are farther than 3 + \u03b7. (Note that SV,B\nitself can be covered by including the single point p\u2212.)\nChoosing covering points in SV,B corresponds to assigning labels in B to vertices of V in the\nLabel Cover instance.\n\n3. Points in SL have in\ufb01nite weight, so no points of SL appear in the solution. Hence, either p0\n+\nor some points of SU,A must be used to cover points of SL. Speci\ufb01cally, a point in SL \u2208 S+\nincident on an included point of SV,B \u2208 S\u2212 is at distance exactly 2 + \u03b7 from this point, and so\nit must be covered by some point of SU,A to which it is connected, at distance 2 \u2013 other points\nin SU,A are farther than 2 + \u03b7. Points of SL not incident on an included point of SV,B can be\ncovered by p0\n+, which at distance 2 + 2\u03b7 is still closer than any point in SV,B. (Note that SU,A\nitself can be covered by including a single arbitrary point of SU,A, which at distance 1 is closer\nthan all other point sets.)\nChoosing the covering point in SU,A corresponds to assigning labels in A to vertices of U in\nthe Label Cover instance, thereby inducing a valid labeling for some edge and solving the Label\nCover problem.\n\n5\n\n\f= 2(D log(1/\u03b3))1\u2212o(1) .\n\nNow, a trivial solution to this instance of WNNC is to take all points of SU,A, SV,B and the single\npoint p+: then SE and p\u2212 are covered by SV,B, and SL and p0\n+ by SU,A. The size of the resulting\nset is c|SU,A| + |SU,B| + 1, and this provides an upper bound on the optimal solution. By setting\nc = m4 (cid:29) m3 > m(|SU,B| + 1), we ensure that the solution cost of WNNC is asymptotically equal\nto the number of points of SU,A included in its solution. This in turn is exactly the sum of labels\nof A assigned to each vertex of U in a solution to the Label Cover problem. Label Cover is hard\nto approximate within a factor 2(log m)1\u2212o(1) , implying that WNNC is hard to approximate within a\nfactor of 2(log m)1\u2212o(1)\nBefore proceeding to the next reduction, we note that to rule out the inclusion of points of SE, SL\nin the solution set, in\ufb01nite weight is not necessary: It suf\ufb01ces to give each heavy point weight c2,\nwhich is itself greater than the weight of the optimal solution by a factor of at least m2. Hence, we\nmay assume all weights are restricted to the range [1, mO(1)], and the hardness result for WNNC\nstill holds.\nSecond reduction. We now reduce WNNC to NNC, assuming that the weights of the n points\nare in the range [1, mO(1)]. Let \u03b3 be the scaled margin of the WNNC instance. To mimic the\nweight assignment of WNNC using the unweighted points of NNC, we introduce the following\ngadget graph G(w, D): Given parameter w and doubling dimension D, create a point set T of size\nw whose interpoint distances are the same as those realized by a set of contiguous points on the\nD-dimensional `1-grid of side-length dw1/De. Now replace each point p \u2208 T by twin positive and\nnegative points at mutual distance \u03b3\n2 , so that the distance from each twin replacing p to each twin\nreplacing any q \u2208 T is the same as the distance from p to q. G(w, D) consists of T , as well as\na single positive point at distance dw1/De from all positive points of T , and dw1/De + \u03b3\n2 from all\nnegative points of T , and a single negative point at distance dw1/De from all negative points of T ,\nand dw1/De + \u03b3\nClearly, the optimal solution to NNC on the gadget instance is to choose the two points not in T .\nFurther, if any single point in T is included in the solution, then all of T must be included in the\nsolution: First the twin of the included point must also be included in the solution. Then, any point\nat distance 1 from both twins must be included as well, along with its own twin. But then all points\nwithin distance 1 of the new twins must be included, etc., until all points of T are found in the\nsolution.\n\n2 from all positive points of T .\n\nTo effectively assign weight to a positive point of NNC, we add a gadget to the point set, and place\nall negative points of the gadget at distance dw1/De from this point. If the point is not included in\nthe NNC solution, then the cost of the gadget is only 2.2 But if this point is included in the NNC\nsolution, then it is the nearest neighbor of the negative gadget points, and so all the gadget points\nmust be included in the solution, incurring a cost of w. A similar argument allows us to assign\nweight to negative points of NNC. The scaled margin of the NNC instance is of size \u2126(\u03b3/w1/D) =\n\u2126(\u03b3m\u2212O(1/D)), which completes the proof of Theorem 2.\n\n3 Learning\n\nIn this section, we apply Theorem 1 to obtain improved generalization bounds for binary classi\ufb01ca-\ntion in doubling spaces. Working in the standard agnostic PAC setting, we take the labeled sample\nS to be drawn iid from some unknown distribution over X \u00d7 {\u22121, 1}, with respect to which all of\nour probabilities will be de\ufb01ned. In a slight abuse of notation, we will blur the distinction between\nS \u2282 X as a collection of points in a metric space and S \u2208 (X \u00d7 {\u22121, 1})n as a sequence of point-\nlabel pairs. As mentioned in the preliminaries, there is no loss of generality in taking diam(S) = 1.\nPartitioning the sample S = S+ \u222a S\u2212 into its positively and negatively labeled subsets, the margin\ninduced by the sample is given by \u03b3(S) = d(S+, S\u2212), where d(A, B) := minx\u2208A,x0\u2208B d(x, x0) for\nA, B \u2282 X . Any labeled sample S induces the nearest-neighbor classi\ufb01er \u03bdS : X \u2192 {\u22121, 1} via\n\n\u03bdS(x) =(cid:26)+1 if d(x, S+) < d(x, S\u2212)\n\n\u22121 else.\n\n2By scaling up all weights by a factor of n2, we can ensure that the cost of all added gadgets (2n) is\n\nasymptotically negligible.\n\n6\n\n\fnPx\u2208S\n\nWe say that \u02dcS \u2282 S is \u03b5-consistent with S if 1\n1{\u03bdS (x)6=\u03bd \u02dcS (x)} \u2264 \u03b5. For \u03b5 = 0, an \u03b5-consistent\n\u02dcS is simply said to be consistent (which matches our previous notion of consistent subsets). A\nsample S is said to be (\u03b5, \u03b3)-separable (with witness \u02dcS) if there is an \u03b5-consistent \u02dcS \u2282 S with\n\u03b3( \u02dcS) \u2265 \u03b3.\nWe begin by invoking a standard Occam-type argument to show that the existence of small \u03b5-\nconsistent sets implies good generalization. The generalizing power of sample compression was\nindependently discovered by [37, 38], and later elaborated upon by [39].\nTheorem 3. For any distribution P, any n \u2208 N and any 0 < \u03b4 < 1, with probability at least 1 \u2212 \u03b4\nover the random sample S \u2208 (X \u00d7 {\u22121, 1})n, the following holds:\n\n(i) If \u02dcS \u2282 S is consistent with S, then\n\nerr(\u03bd \u02dcS) \u2264\n\n(ii) If \u02dcS \u2282 S is \u03b5-consistent with S, then\n\nerr(\u03bd \u02dcS) \u2264\n\n\u03b5n\n\nn \u2212 | \u02dcS|\n\n1\n\n1\n\nn \u2212 | \u02dcS|(cid:18)| \u02dcS| log n + log n + log\n\n\u03b4(cid:19) .\n+s | \u02dcS| log n + 2 log n + log 1\n\n2(n \u2212 | \u02dcS|)\n\n\u03b4\n\n.\n\nProof. Finding a consistent (resp., \u03b5-consistent) \u02dcS \u2282 S constitutes a sample compression scheme of\nsize | \u02dcS|, as stipulated in [39]. Hence, the bounds in (i) and (ii) follow immediately from Theorems\n1 and 2 ibid.\n\nCorollary 1. With probability at least 1 \u2212 \u03b4, the following holds: If S is (\u03b5, \u03b3)-separable with\nwitness \u02dcS, then\n\nerr(\u03bd \u02dcS) \u2264\n\n\u03b5n\n\nn \u2212 `\n\nwhere ` = d1/\u03b3eddim(S)+1.\n\n+s ` log n + 2 log n + log 1\n\n2(n \u2212 `)\n\n\u03b4\n\n,\n\nProof. Follows immediately from Theorems 1 and 3(ii).\n\nRemark. It is instructive to compare the bound above to [12, Corollary 5]. Stated in the language\nof this paper, the latter upper-bounds the NN generalization error in terms of the sample margin \u03b3\nand ddim(X ) by\n\n\u03b5 +r 2\n\nn\n\n(d\u03b3 ln(34en/d\u03b3) log2(578n) + ln(4/\u03b4)),\n\n(2)\n\nwhere d\u03b3 = d16/\u03b3eddim(X )+1 and \u03b5 is the fraction of the points in S that violate the margin condi-\ntion (i.e., opposite-labeled point pairs less than \u03b3 apart in d). Hence, Corollary 1 is a considerable im-\nprovement over (2) in at least three aspects. First, the data-dependent ddim(S) may be signi\ufb01cantly\nsmaller than the dimension of the ambient space, ddim(X ).3 Secondly, the factor of 16ddim(X )+1\nis shaved off. Finally, (2) relied on some fairly intricate fat-shattering arguments [40, 41], while\nCorollary 1 is an almost immediate consequence of much simpler Occam-type results.\n\nOne limitation of Theorem 1 is that it requires the sample to be (0, \u03b3)-separable. The form of the\nbound in Corollary 1 suggests a natural Structural Risk Minimization (SRM) procedure: minimize\nthe right-hand size over (\u03b5, \u03b3). A solution to this problem was (essentially) given in [12, Theorem\n7]:\nTheorem 4. Let R(\u03b5, \u03b3) denote the right-hand size of the inequality in Corollary 1 and put\n(\u03b5\u2217, \u03b3\u2217) = argmin\u03b5,\u03b3 R(\u03b5, \u03b3). Then (i) One may compute (\u03b5\u2217, \u03b3\u2217) in O(n4.376) randomized time.\n(ii) One may compute (\u02dc\u03b5, \u02dc\u03b3) satisfying R(\u02dc\u03b5, \u02dc\u03b3) \u2264 4R(\u03b5\u2217, \u03b3\u2217) in O(ddim(S)n2 log n) deterministic\ntime. Both solutions yield a witness \u02dcS \u2282 S of (\u03b5, \u03b3)-separability as a by-product.\n\nHaving thus computed the optimal (or near-optimal) \u02dc\u03b5, \u02dc\u03b3 with the corresponding witness \u02dcS, we may\nnow run the algorithm furnished by Theorem 1 on the sub-sample \u02dcS and invoke the generalization\nbound in Corollary 1. The latter holds uniformly over all \u02dc\u03b5, \u02dc\u03b3.\n\n3 In general, ddim(S) \u2264 c ddim(X ) for some universal constant c, as shown in [24].\n\n7\n\n\f4 Experiments\n\nIn this section we discuss experimental results. First, we will describe a simple heuristic built upon\nour algorithm. The theoretical guarantees in Theorem 1 feature a dependence on the scaled margin\n\u03b3, and our heuristic aims to give an improved solution in the problematic case where \u03b3 is small.\nConsider the following procedure for obtaining a smaller consistent set. We \ufb01rst extract a net S\u03b3\nsatisfying the guarantees of Theorem 1. We then remove points from S\u03b3 using the following rule:\nfor all i \u2208 {0, . . . dlog \u03b3e}, and for each p \u2208 S\u03b3, if the distance from p to all opposite labeled points\nin S\u03b3 is at least 2 \u00b7 2i, then remove from S\u03b3 all points strictly within distance 2i \u2212 \u03b3 of p (see\nAlgorithm 2). We can show that the resulting set is consistent:\nLemma 5. The above heuristic produces a consistent solution.\n\nIf\nProof. Consider a point p \u2208 S\u03b3, and assume without loss of generality that p is positive.\n\u03b3 ) \u2265 2 \u00b7 2i, then the positive net-points strictly within distance 2i of p are closer to p than to\nd(p, S\u2212\nany negative point in S\u03b3, and are \u201ccovered\u201d by p. The removed positive net-points at distance 2i \u2212 \u03b3\nthemselves cover other positive points of S within distance \u03b3, but p covers these points of S as well.\nFurther, p cannot be removed at a later stage in the algorithm, since p\u2019s distance from all remaining\npoints is at least 2i \u2212 \u03b3.\n\nif p \u2208 S\u00b1\n\n\u03b3 and d(p, S\u2213\n\nfor all p \u2208 S\u03b3 do\n\nAlgorithm 2 Consistent pruning heuristic\n1: S\u03b3 is produced by Algorithm 1 or its fast version (Appendix)\n2: for all i \u2208 {0, . . . , dlog \u03b3e} do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10: end for\n\nfor all q 6= p \u2208 S\u03b3 with d(p, q) < 2i \u2212 \u03b3 do\n\n\u03b3 ) \u2265 2 \u00b7 2i then\n\nS\u03b3 \u2190 S\u03b3\\{q}\n\nend for\n\nend if\n\nend for\n\nAs a proof of concept, we tested our sample compression algorithms on several data sets from the\nUCI Machine Learning Repository. These included the Skin Segmentation, Statlog Shuttle, and\nCovertype sets.4 The \ufb01nal dataset features 7 different label types, which we treated as 21 separate\nbinary classi\ufb01cation problems; we report results for labels 1 vs. 4, 4 vs. 6, and 4 vs. 7, and these\ntypify the remaining pairs. We stress that the focus of our experiments is to demonstrate that (i) a\nsigni\ufb01cant amount of consistent sample compression is often possible and (ii) the compression does\nnot adversely affect the generalization error.\n\nFor each data set and experiment, we sampled equal sized learning and test sets, with equal repre-\nsentation of each label type. The L1 metric was used for all data sets. We report (i) the initial sample\nset size, (ii) the percentage of points retained after the net extraction procedure of Algorithm 1, (iii)\nthe percentage retained after the pruning heuristic of Algorithm 2, and (iv) the change in predic-\ntion accuracy on test data, when comparing the heuristic to the uncompressed sample. The results,\naveraged over 500 trials, are summarized in Figure 2.\n\ndata set\nSkin Segmentation\nStatlog Shuttle\nCovertype 1 vs. 4\nCovertype 4 vs. 6\nCovertype 4 vs. 7\n\noriginal sample % after net % after heuristic \u00b1% accuracy\n10000\n2000\n2000\n2000\n2000\n\n-0.0010\n+0.0080\n+0.0200\n-0.0300\n0.0000\n\n4.78\n29.65\n17.70\n69.00\n3.40\n\n35.10\n65.75\n35.85\n96.50\n4.40\n\nFigure 2: Summary of the performance of NN sample compression algorithms.\n\n4\n\nhttp://tinyurl.com/skin-data;\n\nhttp://tinyurl.com/shuttle-data;\n\nhttp://tinyurl.com/cover-data\n\n8\n\n\fReferences\n\n[1] E. Fix and J. L. Hodges, Discriminatory analysis. nonparametric discrimination: Consistency properties.\n\nInternational Statistical Review / Revue Internationale de Statistique, 57(3):pp. 238\u2013247, 1989.\n\n[2] T. Cover, P. Hart. Nearest neighbor pattern classi\ufb01cation. IEEE Trans. Info. Theo., 13:21\u201327, 1967.\n[3] A. Kontorovich, R. Weiss. A Bayes consistent 1-NN classi\ufb01er (arXiv:1407.0208), 2014.\n[4] G. Toussaint. Open problems in geometric methods for instance-based learning. In Discrete and compu-\n\ntational geometry, volume 2866 of Lecture Notes in Comput. Sci., pp 273\u2013283. 2003.\n\n[5] S. Shalev-Shwartz, S. Ben-David. Understanding Machine Learning. 2014.\n[6] K. Chaudhuri, S. Dasgupta. Rates of Convergence for Nearest Neighbor Classi\ufb01cation. In NIPS, 2014.\n[7] U. von Luxburg, O. Bousquet. Distance-based classi\ufb01cation with Lipschitz functions. JMLR, 2004.\n[8] R. Krauthgamer and J. R. Lee. Navigating nets: Simple algorithms for proximity search. In SODA, 2004.\n[9] K. L. Clarkson. An algorithm for approximate closest-point queries. In SCG, 1994\n[10] L. Devroye, L. Gy\u00a8or\ufb01, A. Krzy\u02d9zak, G. Lugosi. On the strong universal consistency of nearest neighbor\n\nregression function estimates. Ann. Statist., 22(3):1371\u20131385, 1994.\n\n[11] R. R. Snapp and S. S. Venkatesh. Asymptotic expansions of the k nearest neighbor risk. Ann. Statist.,\n\n26(3):850\u2013878, 1998.\n\n[12] L. Gottlieb, A. Kontorovich, R. Krauthgamer. Ef\ufb01cient classi\ufb01cation for metric data. In COLT, 2010.\n[13] P. E. Hart. The condensed nearest neighbor rule. IEEE Trans. Info. Theo., 14(3):515\u2013516, 1968.\n[14] G. Wilfong. Nearest neighbor problems. In SCG, 1991.\n[15] A. V. Zukhba. NP-completeness of the problem of prototype selection in the nearest neighbor method.\n\nPattern Recognit. Image Anal., 20(4):484\u2013494, 2010.\n\n[16] F. Angiulli. Fast condensed nearest neighbor rule. In ICML, 2005.\n[17] W. Gates. The reduced nearest neighbor rule. IEEE Trans. Info. Theo., 18:431\u2013433, 1972.\n[18] G. L. Ritter, H. B. Woodruff, S. R. Lowry, T. L. Isenhour. An algorithm for a selective nearest neighbor\n\ndecision rule. IEEE Trans. Info. Theo., 21:665\u2013669, 1975.\n\n[19] D. R. Wilson and T. R. Martinez. Reduction techniques for instance-based learning algorithms. Mach.\n\nLearn., 38:257\u2013286, 2000.\n\n[20] L. G. Valiant. A theory of the learnable. Commun. ACM, 27(11):1134\u20131142, 1984.\n[21] D. Haussler. Quantifying inductive bias: AI learning algorithms and valiant\u2019s learning framework. Arti\ufb01-\n\ncial Intelligence, 36(2):177 \u2013 221, 1988.\n\n[22] F. Laviolette, M. Marchand, M. Shah, S. Shanian. Learning the set covering machine by bound minimiza-\n\ntion and margin-sparsity trade-off. Mach. Learn., 78(1-2):175\u2013201, 2010.\n\n[23] M. Marchand and J. Shawe-Taylor. The set covering machine. JMLR, 3:723\u2013746, 2002.\n[24] L. Gottlieb and R. Krauthgamer. Proximity algorithms for nearly doubling spaces. SIAM J. on Discr.\n\nMath., 27(4):1759\u20131769, 2013.\n\n[25] L. Gottlieb, A. Kontorovich, R. Krauthgamer. Adaptive metric dimensionality reduction. ALT, 2013.\n[26] A. Beygelzimer, S. Kakade, J. Langford. Cover trees for nearest neighbor. In ICML, 2006.\n[27] Y. Li and P. M. Long. Learnability and the doubling dimension. In NIPS, 2006.\n[28] N. H. Bshouty, Y. Li, P. M. Long. Using the doubling dimension to analyze the generalization of learning\n\nalgorithms. J. Comp. Sys. Sci., 75(6):323 \u2013 335, 2009.\n\n[29] L. Gottlieb, A. Kontorovich, R. Krauthgamer. Ef\ufb01cient regression in metric spaces via approximate\n\nLipschitz extension. In SIMBAD, 2013.\n\n[30] A. Gupta, R. Krauthgamer, J. R. Lee. Bounded geometries, fractals, and low-distortion embeddings. In\n\nFOCS, 2003.\n\n[31] S. Arora, L. Babai, J. Stern, Z. Sweedyk. The hardness of approximate optima in lattices, codes, and\n\nsystems of linear equations. In FOCS, 1993.\n\n[32] I. Dinur, S. Safra. On the hardness of approximating label-cover. Info. Proc. Lett., 2004.\n[33] M. Mohri, A. Rostamizadeh, A. Talwalkar. Foundations Of Machine Learning. 2012.\n[34] A. Beygelzimer, S. Kakade, J. Langford. Cover trees for nearest neighbor. In ICML 2006.\n[35] S. Har-Peled and M. Mendel. Fast construction of nets in low-dimensional metrics and their applications.\n\nSIAM J. on Comput., 35(5):1148\u20131184, 2006.\n\n[36] R. Cole, L. Gottlieb. Searching dynamic point sets in spaces with bounded doubling dimension. STOC,\n\n2006.\n\n[37] N. Littlestone and M. K. Warmuth. Relating data compression and learnability, unpublished. 1986.\n[38] L. Devroye, L. Gy\u00a8or\ufb01, G. Lugosi. A probabilistic theory of pattern recognition, 1996.\n[39] T. Graepel, R. Herbrich, J. Shawe-Taylor. Pac-bayesian compression bounds on the prediction error of\n\nlearning algorithms for classi\ufb01cation. Mach. Learn., 59(1-2):55\u201376, 2005.\n\n[40] N. Alon, S. Ben-David, N. Cesa-Bianchi, D. Haussler. Scale-sensitive dimensions, uniform convergence,\n\nand learnability. J. ACM, 44(4):615\u2013631, 1997.\n\n[41] P. Bartlett and J. Shawe-Taylor. Generalization performance of support vector machines and other pattern\n\nclassi\ufb01ers, pages 43\u201354. 1999.\n\n9\n\n\f", "award": [], "sourceid": 263, "authors": [{"given_name": "Lee-Ad", "family_name": "Gottlieb", "institution": null}, {"given_name": "Aryeh", "family_name": "Kontorovich", "institution": "Ben Gurion University"}, {"given_name": "Pinhas", "family_name": "Nisnevitch", "institution": "Ariel University"}]}