{"title": "k*-Nearest Neighbors: From Global to Local", "book": "Advances in Neural Information Processing Systems", "page_first": 4916, "page_last": 4924, "abstract": "The weighted k-nearest neighbors algorithm is one of the most fundamental non-parametric methods in pattern recognition and machine learning. The question of setting the optimal number of neighbors as well as the optimal weights has received much attention throughout the years, nevertheless this problem seems to have remained unsettled. In this paper we offer a simple approach to locally weighted regression/classification, where we make the bias-variance tradeoff explicit. Our formulation enables us to phrase a notion of optimal weights, and to efficiently find these weights as well as the optimal number of neighbors efficiently and adaptively, for each data point whose value we wish to estimate. The applicability of our approach is demonstrated on several datasets, showing superior performance over standard locally weighted methods.", "full_text": "k\u21e4-Nearest Neighbors: From Global to Local\n\nOren Anava\n\nThe Voleon Group\noren@voleon.com\n\nK\ufb01r Y. Levy\nETH Zurich\n\nyehuda.levy@inf.ethz.ch\n\nAbstract\n\nThe weighted k-nearest neighbors algorithm is one of the most fundamental non-\nparametric methods in pattern recognition and machine learning. The question of\nsetting the optimal number of neighbors as well as the optimal weights has received\nmuch attention throughout the years, nevertheless this problem seems to have\nremained unsettled. In this paper we offer a simple approach to locally weighted\nregression/classi\ufb01cation, where we make the bias-variance tradeoff explicit. Our\nformulation enables us to phrase a notion of optimal weights, and to ef\ufb01ciently \ufb01nd\nthese weights as well as the optimal number of neighbors ef\ufb01ciently and adaptively,\nfor each data point whose value we wish to estimate. The applicability of our\napproach is demonstrated on several datasets, showing superior performance over\nstandard locally weighted methods.\n\n1\n\nIntroduction\n\nThe k-nearest neighbors (k-NN) algorithm [1, 2], and Nadarays-Watson estimation [3, 4] are the\ncornerstones of non-parametric learning. Owing to their simplicity and \ufb02exibility, these procedures\nhad become the methods of choice in many scenarios [5], especially in settings where the underlying\nmodel is complex. Modern applications of the k-NN algorithm include recommendation systems [6],\ntext categorization [7], heart disease classi\ufb01cation [8], and \ufb01nancial market prediction [9], amongst\nothers.\nA successful application of the weighted k-NN algorithm requires a careful choice of three ingredients:\nthe number of nearest neighbors k, the weight vector \u21b5, and the distance metric. The latter requires\ndomain knowledge and is thus henceforth assumed to be set and known in advance to the learner.\nSurprisingly, even under this assumption, the problem of choosing the optimal k and \u21b5 is not fully\nunderstood and has been studied extensively since the 1950\u2019s under many different regimes. Most\nof the theoretic work focuses on the asymptotic regime in which the number of samples n goes to\nin\ufb01nity [10, 11, 12], and ignores the practical regime in which n is \ufb01nite. More importantly, the vast\nmajority of k-NN studies aim at \ufb01nding an optimal value of k per dataset, which seems to overlook\nthe speci\ufb01c structure of the dataset and the properties of the data points whose labels we wish to\nestimate. While kernel based methods such as Nadaraya-Watson enable an adaptive choice of the\nweight vector \u21b5, theres still remains the question of how to choose the kernel\u2019s bandwidth , which\ncould be thought of as the parallel of the number of neighbors k in k-NN. Moreover, there is no\nprincipled approach towards choosing the kernel function in practice.\nIn this paper we offer a coherent and principled approach to adaptively choosing the number of\nneighbors k and the corresponding weight vector \u21b5 2 Rk per decision point. Given a new decision\npoint, we aim to \ufb01nd the best locally weighted predictor, in the sense of minimizing the distance\nbetween our prediction and the ground truth. In addition to yielding predictions, our approach\nenbles us to provide a per decision point guarantee for the con\ufb01dence of our predictions. Fig. 1\nillustrates the importance of choosing k adaptively. In contrast to previous works on non-parametric\ni=1 arrives from some (unknown)\nregression/classi\ufb01cation, we do not assume that the data {(xi, yi)}n\nunderlying distribution, but rather make a weaker assumption that the labels {yi}n\ni=1 are independent\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\f(a) First scenario\n\n(b) Second scenario\n\n(c) Third scenario\n\nFigure 1: Three different scenarios. In all three scenarios, the same data points x1, . . . , xn 2 R2 are\ngiven (represented by black dots). The red dot in each of the scenarios represents the new data point\nwhose value we need to estimate. Intuitively, in the \ufb01rst scenario it would be bene\ufb01cial to consider\nonly the nearest neighbor for the estimation task, whereas in the other two scenarios we might pro\ufb01t\nby considering more neighbors.\n\ngiven the data points {xi}n\ni=1, allowing the latter to be chosen arbitrarily. Alongside providing a\ntheoretical basis for our approach, we conduct an empirical study that demonstrates its superiority\nwith respect to the state-of-the-art.\nThis paper is organized as follows. In Section 2 we introduce our setting and assumptions, and derive\nthe locally optimal prediction problem. In Section 3 we analyze the solution of the above prediction\nproblem, and introduce a greedy algorithm designed to ef\ufb01ciently \ufb01nd the exact solution. Section 4\npresents our experimental study, and Section 5 concludes.\n\n1.1 Related Work\n\nAsymptotic universal consistency is the most widely known theoretical guarantee for k-NN. This\npowerful guarantee implies that as the number of samples n goes to in\ufb01nity, and also k ! 1,\nk/n ! 0, then the risk of the k-NN rule converges to the risk of the Bayes classi\ufb01er for any\nunderlying data distribution. Similar guarantees hold for weighted k-NN rules, with the additional\nassumptions thatPk\ni=1 \u21b5i = 1 and maxi\uf8ffn \u21b5i ! 0, [12, 10]. In the regime of practical interest\nwhere the number of samples n is \ufb01nite, using k = bpnc neighbors is a widely mentioned rule of\nthumb [10]. Nevertheless, this rule often yields poor results, and in the regime of \ufb01nite samples it is\nusually advised to choose k using cross-validation. Similar consistency results apply to kernel based\nlocal methods [13, 14].\nA novel study of k-NN by Samworth, [11], derives a closed form expression for the optimal weight\nvector, and extracts the optimal number of neighbors. However, this result is only optimal under\nseveral restrictive assumptions, and only holds for the asymptotic regime where n ! 1. Furthermore,\nthe above optimal number of neighbors/weights do not adapt, but are rather \ufb01xed over all decision\npoints given the dataset. In the context of kernel based methods, it is possible to extract an expression\nfor the optimal kernel\u2019s bandwidth [14, 15]. Nevertheless, this bandwidth is \ufb01xed over all decision\npoints, and is only optimal under several restrictive assumptions.\nThere exist several heuristics to adaptively choosing the number of neighbors and weights separately\nfor each decision point. In [16, 17] it is suggested to use local cross-validation in order to adapt the\nvalue of k to different decision points. Conversely, Ghosh [18] takes a Bayesian approach towards\nchoosing k adaptively. Focusing on the multiclass classi\ufb01cation setup, it is suggested in [19] to\nconsider different values of k for each class, choosing k proportionally to the class populations.\nSimilarly, there exist several attitudes towards adaptively choosing the kernel\u2019s bandwidth , for\nkernel based methods [20, 21, 22, 23].\nLearning the distance metric for k-NN was extensively studied throughout the last decade. There\nare several approaches towards metric learning, which roughly divide into linear/non-linear learning\nmethods. It was found that metric learning may signi\ufb01cantly affect the performance of k-NN in\nnumerous applications, including computer vision, text analysis, program analysis and more. A\ncomprehensive survey by Kulis [24] provides a review of the metric learning literature. Throughout\n\n2\n\n\fthis work we assume that the distance metric is \ufb01xed, and thus the focus is on \ufb01nding the best (in a\nsense) values of k and \u21b5 for each new data point.\nTwo comprehensive monographs, [10] and [25], provide an extensive survey of the existing literature\nregarding k-NN rules, including theoretical guarantees, useful practices, limitations and more.\n\n2 Problem De\ufb01nition\n\nIn this section we present our setting and assumptions, and formulate the locally weighted optimal\nestimation problem. Recall we seek to \ufb01nd the best local prediction in a sense of minimizing the\ndistance between this prediction and the ground truth. The problem at hand is thus de\ufb01ned as follows:\nWe are given n data points x1, . . . , xn 2 Rd, and n corresponding labels1 y1, . . . , yn 2 R. Assume\nthat for any i 2{ 1, . . . , n} = [n] it holds that yi = f (xi) + \u270fi, where f (\u00b7) and \u270fi are such that:\n\n(1) f (\u00b7) is a Lipschitz continuous function: For any x, y 2 Rd it holds that |f (x) f (y)|\uf8ff\nL \u00b7 d(x, y), where the distance function d(\u00b7,\u00b7) is set and known in advance. This assumption\nis rather standard when considering nearest neighbors-based algorithms, and is required in\nour analysis to bound the so-called bias term (to be later de\ufb01ned). In the binary classi\ufb01cation\nsetup we assume that f : Rd 7! [0, 1], and that given x its label y 2{ 0, 1} is distributed\nBernoulli(f (x)).\n(2) \u270fi\u2019s are noise terms: For any i 2 [n] it holds that E [\u270fi|xi] = 0 and |\u270fi|\uf8ff b for some given\ni=1 then the noise terms\nb > 0. In addition, it is assumed that given the data points {xi}n\ni=1 are independent. This assumption is later used in our analysis to apply Hoeffding\u2019s\n{\u270fi}n\ninequality and bound the so-called variance term (to be later de\ufb01ned). Alternatively, we\ni|xi\u21e4 \uf8ff b (instead of |\u270fi|\uf8ff b), and apply Bernstein inequalities. The\ncould assume that E\u21e5\u270f2\nbe of the form \u02c6f (x0) =Pn\n\nGiven a new data point x0, our task is to estimate f (x0), where we restrict the estimator \u02c6f (x0) to\ni=1 \u21b5iyi. That is, the estimator is a weighted average of the given noisy\nlabels. Formally, we aim at minimizing the absolute distance between our prediction and the ground\ntruth f (x0), which translates into\n\nresults and analysis remain qualitatively similar.\n\ni=1 \u21b5i\u270fi|\uf8ff Ck\u21b5k2 for\n, w.p. at least 1 . We thus arrive at a new optimization problem (P2), such that\n\nsolving it would yield a guarantee for (P1) with high probability:\n\n1Note that our analysis holds for both setups of classi\ufb01cation/regression. For brevity we use a classi\ufb01cation\n\ntask terminology, relating to the yi\u2019s as labels. Our analysis extends directly to the regression setup.\n\n3\n\ni=1 \u21b5i = 1 and \u21b5i 0, 8i}. Decomposing\nthe objective of (P1) into a sum of bias and variance terms, we arrive at the following relaxed\nobjective:\n\n\n\nmin\n\n(P1),\n\nnXi=1\n\nnXi=1\n=\n=\n\uf8ff\n\uf8ff\n\n\u21b52n\n\u21b5iyi f (x0)\n\n\u21b5iyi f (x0)\nwhere we minimize over the simplex, n = {\u21b5 2 Rn|Pn\n\u21b5i (yi f (xi) + f (xi)) f (x0)\nnXi=1\n\u21b5i (f (xi) f (x0))\nnXi=1\nnXi=1\n\u21b5i\u270fi\n\u21b5i (f (xi) f (x0))\n+\nnXi=1\nnXi=1\n\u21b5i\u270fi\nnXi=1\nnXi=1\nBy Hoeffding\u2019s inequality (see supplementary material) it follows that |Pn\nC = bq2 log 2\nnXi=1\n\nCk\u21b5k2 + L\n\n\u21b5id(xi, x0).\n\nmin\n\u21b52n\n\n\u21b5id(xi, x0)\n\n(P2).\n\n\u21b5i\u270fi +\n\n+ L\n\n\fThe \ufb01rst term in (P2) corresponds to the noise in the labels and is therefore denoted as the variance\ni=1 and is\nterm, whereas the second term corresponds to the distance between f (x0) and {f (xi)}n\nthus denoted as the bias term.\n\n3 Algorithm and Analysis\n\nIn this section we discuss the properties of the optimal solution for (P2), and present a greedy\nalgorithm which is designed in order to ef\ufb01ciently \ufb01nd the exact solution of the latter objective (see\nSection 3.1). Given a decision point x0, Theorem 3.1 demonstrates that the optimal weight \u21b5i of the\ndata point xi is proportional to d(xi, x0) (closer points are given more weight). Interestingly, this\nweight decay is quite slow compared to popular weight kernels, which utilize sharper decay schemes,\ne.g., exponential/inversely-proportional. Theorem 3.1 also implies a cutoff effect, meaning that there\nexists k\u21e4 2 [n], such that only the k\u21e4 nearest neighbors of x0 donate to the prediction of its label.\nNote that both \u21b5 and k\u21e4 may adapt from one x0 to another. Also notice that the optimal weights\ndepend on a single parameter L/C, namely the Lipschitz to noise ratio. As L/C grows k\u21e4 tends to\nbe smaller, which is quite intuitive.\nWithout loss of generality, assume that the points are ordered in ascending order according to their\ndistance from x0, i.e., d(x1, x0) \uf8ff d(x2, x0) \uf8ff . . . \uf8ff d(xn, x0). Also, let 2 Rn be such that\ni = Ld(xi, x0)/C. Then, the following is our main theorem:\nTheorem 3.1. There exists > 0 such that the optimal solution of (P2) is of the form\n\nFurthermore, the value of (P2) at the optimum is C.\n\n( i) \u00b7 1{i < }\ni=1 ( i) \u00b7 1{i < }\n\n\u21b5\u21e4i =\n\nPn\n\n.\n\n(1)\n\nFollowing is a direct corollary of the above Theorem:\nCorollary 3.2. There exists 1 \uf8ff k\u21e4 \uf8ff n such that for the optimal solution of (P2) the following\napplies:\n\n\u21b5\u21e4i > 0; 8i \uf8ff k\u21e4\n\nand \u21b5\u21e4i = 0; 8i > k\u21e4.\n\nProof of Theorem 3.1. Notice that (P2) may be written as follows:\n\nWe henceforth ignore the parameter C. In order to \ufb01nd the solution of (P2), let us \ufb01rst consider its\nLagrangian:\n\nmin\n\u21b52n\n\nCk\u21b5k2 + \u21b5>\nL(\u21b5,, \u2713) = k\u21b5k2 + \u21b5> + 1 \n\n(P2).\n\n\u21b5i! \n\nnXi=1\n\n\u2713i\u21b5i,\n\nnXi=1\n\nwhere 2 R is the multiplier of the equality constraintPi \u21b5i = 1, and \u27131, . . . ,\u2713 n 0 are the\nmultipliers of the inequality constraints \u21b5i 0, 8i 2 [n]. Since (P2) is convex, any solution\nsatisfying the KKT conditions is a global minimum. Deriving the Lagrangian with respect to \u21b5, we\nget that for any i 2 [n]:\n\n\u21b5i\nk\u21b5k2\n\n= i + \u2713i.\n\nDenote by \u21b5\u21e4 the optimal solution of (P2). By the KKT conditions, for any \u21b5\u21e4i > 0 it follows that\n\u2713i = 0. Otherwise, for any i such that \u21b5\u21e4i = 0 it follows that \u2713i 0, which implies \uf8ff i. Thus, for\nany nonzero weight \u21b5\u21e4i > 0 the following holds:\n\nSquaring and summing Equation (2) over all the nonzero entries of \u21b5, we arrive at the following\nequation for :\n\n\u21b5\u21e4i\nk\u21b5\u21e4k2\n\n= i.\n\n(2)\n\n1 = X\u21b5\u21e4i >0\n\n(\u21b5\u21e4i )2\nk\u21b5\u21e4k2\n\n2\n\n= X\u21b5\u21e4i >0\n\n( i)2.\n\n4\n\n(3)\n\n\fAlgorithm 1 k\u21e4-NN\n\nInput: vector of ordered distances 2 Rn, noisy labels y1, . . . , yn 2 R\nSet: 0 = 1 + 1, k = 0\nwhile k > k+1 and k \uf8ff n 1 do\ni!\n\nUpdate: k k + 1\nCalculate: k = 1\n\ni=1 i +rk +\u21e3Pk\n\nk Pk\n\ni=1 i\u23182\n\n kPk\n\ni=1 2\n\nend while\n\n(ki)\u00b71{i 0 satis\ufb01es\n, where A = X\u21b5\u21e4i >0\nAsX\u21b5\u21e4i >0\nA X\u21b5\u21e4i >0\n\nA X\u21b5\u21e4i >0\nA X\u21b5\u21e4i >0\n\nCk\u21b5\u21e4k2 + \u21b5\u21e4> =\n\nPlugging the above into the objective of (P2) yields\n\n( i)(i + )\n\n( i)2 +\n\n( i)2 +\n\n( i).\n\n i\n\nA\n\n( i)\n\n\u21b5\u21e4i =\n\n(4)\n\nC\n\nC\n\n=\n\nC\n\nC\nA \n\n= C,\n\nC\n\nwhere in the last equality we used Equation (3), and substituted A =P\u21b5\u21e4i >0( i).\n\n3.1 Solving (P2) Ef\ufb01ciently\n\nNote that (P2) is a convex optimization problem, and it can be therefore (approximately) solved\nef\ufb01ciently, e.g., via any \ufb01rst order algorithm. Concretely, given an accuracy \u270f> 0, any off-the-shelf\n\u270f ) in order to \ufb01nd an\nconvex optimization method would require a running time which is poly(n, 1\n\u270f-optimal solution to (P2)2. Note that the calculation of (the unsorted) requires an additional\ncomputational cost of O(nd).\nHere we present an ef\ufb01cient method that computes the exact solution of (P2). In addition to the\nO(nd) cost for calculating , our algorithm requires an O(n log n) cost for sorting the entries of\n, as well as an additional running time of O(k\u21e4), where k\u21e4 is the number of non-zero elements at\nthe optimum. Thus, the running time of our method is independent of any accuracy \u270f, and may be\nsigni\ufb01cantly better compared to any off-the-shelf optimization method. Note that in some cases [26],\nusing advanced data structures may decrease the cost of \ufb01nding the nearest neighbors (i.e., the sorted\n), yielding a running time substantially smaller than O(nd + n log n).\nOur method is depicted in Algorithm 1. Quite intuitively, the core idea is to greedily add neighbors\naccording to their distance form x0 until a stopping condition is ful\ufb01lled (indicating that we have\nfound the optimal solution). Letting CsortNN, be the computational cost of calculating the sorted vector\n, the following theorem presents our guarantees.\n\nTheorem 3.3. Algorithm 1 \ufb01nds the exact solution of (P2) within k\u21e4 iterations, with an O(k\u21e4 +\nCsortNN) running time.\n\n2Note that (P2) is not strongly-convex, and therefore the polynomial dependence on 1/\u270f rather than log(1/\u270f)\nfor \ufb01rst order methods. Other methods such as the Ellipsoid depend logarithmically on 1/\u270f, but suffer a worse\ndependence on n compared to \ufb01rst order methods.\n\n5\n\n\fProof of Theorem 3.3. Denote by \u21b5\u21e4 the optimal solution of (P2), and by k\u21e4 the corresponding\nnumber of nonzero weights. By Corollary 3.2, these k\u21e4 nonzero weights correspond to the k\u21e4 smallest\nvalues of . Thus, we are left to show that (1) the optimal is of the form calculated by the algorithm;\nand (2) the algorithm halts after exactly k\u21e4 iterations and outputs the optimal solution.\nLet us \ufb01rst \ufb01nd the optimal . Since the non-zero elements of the optimal solution correspond to the\nk\u21e4 smallest values of , then Equation (3) is equivalent to the following quadratic equation in :\n\nSolving for and neglecting the solution that does not agree with \u21b5i 0, 8i 2 [n], we get\n\n(5)\n\nk\u21e42 2\n\ni + k\u21e4Xi=1\nk\u21e4Xi=1\ni +vuutk\u21e4 + k\u21e4Xi=1\n\n =\n\n1\n\nk\u21e40B@\n\nk\u21e4Xi=1\n\ni 1! = 0.\n\n2\n\ni!2\n\n k\u21e4\n\nk\u21e4Xi=1\n\n2\n\ni1CA .\n\nThe above implies that given k\u21e4, the optimal solution (satisfying KKT) can be directly derived by a\ncalculation of according to Equation (5) and computing the \u21b5i\u2019s according to Equation (1). Since\nAlgorithm 1 calculates and \u21b5 in the form appearing in Equations (5) and (1) respectively, it is\ntherefore suf\ufb01cient to show that it halts after exactly k\u21e4 iterations in order to prove its optimality. The\nlatter is a direct consequence of the following conditions:\n\n(1) Upon reaching iteration k\u21e4 Algorithm 1 necessarily halts.\n(2) For any k \uf8ff k\u21e4 it holds that k 2 R.\n(3) For any k < k\u21e4 Algorithm 1 does not halt.\n\nNote that the \ufb01rst condition together with the second condition imply that k is well de\ufb01ned until the\nalgorithm halts (in the sense that the \u201c > \u201doperation in the while condition is meaningful). The \ufb01rst\ncondition together with the third condition imply that the algorithm halts after exactly k\u21e4 iterations,\nwhich concludes the proof. We are now left to show that the above three conditions hold:\nCondition (1): Note that upon reaching k\u21e4, Algorithm 1 necessarily calculates the optimal = k\u21e4.\nMoreover, the entries of \u21b5\u21e4 whose indices are greater than k\u21e4 are necessarily zero, and in particular,\n\u21b5\u21e4k\u21e4+1 = 0. By Equation (1), this implies that k\u21e4 \uf8ff k\u21e4+1, and therefore the algorithm halts upon\nreaching k\u21e4.\nIn order to establish conditions (2) and (3) we require the following lemma:\nLemma 3.4. Let k be as calculated by Algorithm 1 at iteration k. Then, for any k \uf8ff k\u21e4 the\nfollowing holds:\n\nk = min\n\u21b52(k)\n\nn k\u21b5k2 + \u21b5> , where (k)\n\nn = {\u21b5 2 n : \u21b5i = 0, 8i > k}\n\nWe are now ready to prove the remaining conditions.\nCondition (2): Lemma 3.4 states that k is the solution of a convex program over a nonempty set,\ntherefore k 2 R.\nCondition (3): By de\ufb01nition (k)\nfor any k < n. Therefore, Lemma 3.4 implies\nthat k k+1 for any k < k\u21e4 (minimizing the same objective with stricter constraints yields\na higher optimal value). Now assume by contradiction that Algorithm 1 halts at some k0 < k\u21e4,\nthen the stopping condition of the algorithm implies that k0 \uf8ff k0+1. Combining the latter with\nk k+1, 8k \uf8ff k\u21e4, and using k \uf8ff k+1, 8k \uf8ff n, we conclude that:\n\nn \u21e2 (k+1)\n\nn\n\nThe above implies that \u21b5k\u21e4 = 0 (see Equation (1)), which contradicts Corollary 3.2 and the de\ufb01nition\nof k\u21e4.\n\nk\u21e4 \uf8ff k0+1 \uf8ff k0 \uf8ff k0+1 \uf8ff k\u21e4 .\n\n6\n\n\fRunning time: Note that the main running time burden of Algorithm 1 is the calculation of k for\nany k \uf8ff k\u21e4. A naive calculation of k requires an O(k) running time. However, note that k depends\nonly onPk\ni . Updating these sums incrementally implies that we require only\nO(1) running time per iteration, yielding a total running time of O(k\u21e4). The remaining O(CsortNN)\nrunning time is required in order to calculate the (sorted) .\n\ni=1 i andPk\n\ni=1 2\n\n3.2 Special Cases\nThe aim of this section is to discuss two special cases in which the bound of our algorithm coincides\nwith familiar bounds in the literature, thus justifying the relaxed objective of (P2). We present here\nonly a high-level description of both cases, and defer the formal details to the full version of the\npaper.\nThe solution of (P2) is a high probability upper-bound on the true prediction error\n|Pn\ni=1 \u21b5iyi f (x0)|. Two interesting cases to consider in this context are i = 0 for all i 2 [n], and\n1 = . . . = n = > 0. In the \ufb01rst case, our algorithm includes all labels in the computation of ,\nthus yielding a con\ufb01dence bound of 2C = 2bp(2/n) log (2/) for the prediction error (with proba-\nbility 1 ). Not surprisingly, this bound coincides with the standard Hoeffding bound for the task of\nestimating the mean value of a given distribution based on noisy observations drawn from this distri-\nbution. Since the latter is known to be tight (in general), so is the con\ufb01dence bound obtained by our\nalgorithm. In the second case as well, our algorithm will use all data points to arrive at the con\ufb01dence\nbound 2C = 2Ld + 2bp(2/n) log (2/), where we denote d(x1, x0) = . . . = d(xn, x0) = d. The\nsecond term is again tight by concentration arguments, whereas the \ufb01rst term cannot be improved due\nto Lipschitz property of f (\u00b7), thus yielding an overall tight con\ufb01dence bound for our prediction in\nthis case.\n\n4 Experimental Results\n\nThe following experiments demonstrate the effectiveness of the proposed algorithm on several\ndatasets. We start by presenting the baselines used for the comparison.\n\n4.1 Baselines\nThe standard k-NN: Given k, the standard k-NN \ufb01nds the k nearest data points to x0 (assume with-\nout loss of generality that these data points are x1, . . . , xk), and then estimates \u02c6f (x0) = 1\ni=1 yi.\n\nThe Nadaraya-Watson estimator: This estimator assigns the data points with weights that are\nproportional to some given similarity kernel K : Rd \u21e5 Rd 7! R+. That is,\n\nkPk\n\ni=1 K(xi, x0)yi\ni=1 K(xi, x0)\n\n.\n\n\u02c6f (x0) = Pn\nPn\n4\u21e31 kxixjk2\n\n2\n\nPopular choices of kernel functions include the Gaussian kernel K(xi, xj) = 1\nEpanechnikov Kernel K(xi, xj) = 3\n\n e kxixjk2\n\u2318 1{kxixjk\uf8ff}; and the triangular kernel\n\u2318 1{kxixjk\uf8ff}. Due to lack of space, we present here only the best per-\n\nforming kernel function among the three listed above (on the tested datasets), which is the Gaussian\nkernel.\n\nK(xi, xj) =\u21e31 kxixjk\n\n22\n\n;\n\n4.2 Datasets\nIn our experiments we use 8 real-world datasets, all are available in the UCI repository website\n(https://archive.ics.uci.edu/ml/). In each of the datasets, the features vector consists of\nreal values only, whereas the labels take different forms: in the \ufb01rst 6 datasets (QSAR, Diabetes,\nPopFailures, Sonar, Ionosphere, and Fertility), the labels are binary yi 2{ 0, 1}. In the last two\ndatasets (Slump and Yacht), the labels are real-valued. Note that our algorithm (as well as the other\ntwo baselines) applies to all datasets without requiring any adjustment. The number of samples n and\nthe dimension of each sample d are given in Table 1 for each dataset.\n\n7\n\n\fStandard k-NN\n\nNadarays-Watson\n\nValue of k\n\nValue of \n\nDataset (n, d)\nQSAR (1055,41)\nDiabetes (1151,19)\nPopFailures (360,18)\n\nSonar (208,60)\n\nIonosphere (351,34)\n\nFertility (100,9)\nSlump (103,9)\nYacht (308,6)\n\nError (STD)\n0.2467 (0.3445)\n0.3809 (0.2939)\n0.1333 (0.2924)\n0.1731 (0.3801)\n0.1257 (0.3055)\n0.1900 (0.3881)\n3.4944 (3.3042)\n6.4643 (10.2463)\n\n2\n4\n2\n1\n2\n1\n4\n2\n\nError (STD)\n0.2303 (0.3500)\n0.3675 (0.3983)\n0.1155 (0.2900)\n0.1711 (0.3747)\n0.1191 (0.2937)\n0.1884 (0.3787)\n2.9154 (2.8930)\n5.2577 (8.7051)\n\n0.1\n0.1\n0.01\n0.1\n0.5\n0.1\n0.05\n0.05\n\nRange of k\n\nOur algorithm (k\u21e4-NN)\nError (STD)\n0.2105* (0.3935)\n0.3666 (0.3897)\n0.1218 (0.2302)\n0.1636 (0.3661)\n0.1113* (0.3008)\n0.1760 (0.3094)\n2.8057 (2.7886)\n5.0418* (8.6502)\n\n1-4\n1-9\n2-24\n1-2\n1-4\n1-5\n1-4\n1-3\n\nTable 1: Experimental results. The values of k, and L/C are determined via 5-fold cross validation\non the validation set. These value are then used on the test set to generate the (absolute) error rates\npresented in the table. In each line, the best result is marked with bold font, where asterisk indicates\nsigni\ufb01cance level of 0.05 over the second best result.\n\n4.3 Experimental Setup\nWe randomly divide each dataset into two halves (one used for validation and the other for test). On\nthe \ufb01rst half (the validation set), we run the two baselines and our algorithm with different values\nof k, and L/C (respectively), using 5-fold cross validation. Speci\ufb01cally, we consider values of k\nin {1, 2, . . . , 10} and values of and L/C in {0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10}. The best\nvalues of k, and L/C are then used in the second half of the dataset (the test set) to obtain the\nresults presented in Table 1. For our algorithm, the range of k that corresponds to the selection of\nL/C is also given. Notice that we present here the average absolute error of our prediction, as a\nconsequence of our theoretical guarantees.\n\n4.4 Results and Discussion\nAs evidenced by Table 1, our algorithm outperforms the baselines on 7 (out of 8) datasets, where\non 3 datasets the outperformance is signi\ufb01cant. It can also be seen that whereas the standard k-NN\nis restricted to choose one value of k per dataset, our algorithm fully utilizes the ability to choose\nk adaptively per data point. This validates our theoretical \ufb01ndings, and highlights the advantage of\nadaptive selection of k.\n\n5 Conclusions and Future Directions\n\nWe have introduced a principled approach to locally weighted optimal estimation. By explicitly\nphrasing the bias-variance tradeoff, we de\ufb01ned the notion of optimal weights and optimal number of\nneighbors per decision point, and consequently devised an ef\ufb01cient method to extract them. Note\nthat our approach could be extended to handle multiclass classi\ufb01cation, as well as scenarios in which\npredictions of different data points correlate (and we have an estimate of their correlations). Due to\nlack of space we leave these extensions to the full version of the paper.\nA shortcoming of current non-parametric methods, including our k\u21e4-NN algorithm, is their limited\ngeometrical perspective. Concretely, all of these methods only consider the distances between the\ni=1, and ignore the geometrical relation between\ndecision point and dataset points, i.e., {d(x0, xi)}n\ni,j=1. We believe that our approach opens an avenue for taking\nthe dataset points, i.e., {d(xi, xj)}n\nadvantage of this additional geometrical information, which may have a great affect over the quality\nof our predictions.\n\nReferences\n[1] Thomas M Cover and Peter E Hart. Nearest neighbor pattern classi\ufb01cation.\n\nInformation Theory, 13(1):21\u201327, 1967.\n\nIEEE Transactions on\n\n[2] Evelyn Fix and Joseph L Hodges Jr. Discriminatory analysis-nonparametric discrimination: consistency\n\nproperties. Technical report, DTIC Document, 1951.\n\n[3] Elizbar A Nadaraya. On estimating regression. Theory of Probability & Its Applications, 9(1):141\u2013142,\n\n1964.\n\n8\n\n\f[4] Geoffrey S Watson. Smooth regression analysis. Sankhy\u00afa: The Indian Journal of Statistics, Series A, pages\n\n359\u2013372, 1964.\n\n[5] Xindong Wu, Vipin Kumar, J Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J\nMcLachlan, Angus Ng, Bing Liu, S Yu Philip, et al. Top 10 algorithms in data mining. Knowledge and\ninformation systems, 14(1):1\u201337, 2008.\n\n[6] DA Adeniyi, Z Wei, and Y Yongquan. Automated web usage data mining and recommendation system\nusing k-nearest neighbor (knn) classi\ufb01cation method. Applied Computing and Informatics, 12(1):90\u2013108,\n2016.\n\n[7] Bruno Trstenjak, Sasa Mikac, and Dzenana Donko. Knn with tf-idf based framework for text categorization.\n\nProcedia Engineering, 69:1356\u20131364, 2014.\n\n[8] BL Deekshatulu, Priti Chandra, et al. Classi\ufb01cation of heart disease using k-nearest neighbor and genetic\n\nalgorithm. Procedia Technology, 10:85\u201394, 2013.\n\n[9] Sadegh Bafandeh Imandoust and Mohammad Bolandraftar. Application of k-nearest neighbor (knn)\napproach for predicting economic events: Theoretical background. International Journal of Engineering\nResearch and Applications, 3(5):605\u2013610, 2013.\n\n[10] Luc Devroye, L\u00e1szl\u00f3 Gy\u00f6r\ufb01, and G\u00e1bor Lugosi. A probabilistic theory of pattern recognition, volume 31.\n\nSpringer Science & Business Media, 2013.\n\n[11] Richard J Samworth et al. Optimal weighted nearest neighbour classi\ufb01ers. The Annals of Statistics,\n\n40(5):2733\u20132763, 2012.\n\n[12] Charles J Stone. Consistent nonparametric regression. The Annals of Statistics, pages 595\u2013620, 1977.\n\n[13] Luc P Devroye, TJ Wagner, et al. Distribution-free consistency results in nonparametric discrimination and\n\nregression function estimation. The Annals of Statistics, 8(2):231\u2013239, 1980.\n\n[14] L\u00e1szl\u00f3 Gy\u00f6r\ufb01, Michael Kohler, Adam Krzyzak, and Harro Walk. A distribution-free theory of nonparamet-\n\nric regression. Springer Science & Business Media, 2006.\n\n[15] Jianqing Fan and Irene Gijbels. Local polynomial modelling and its applications: monographs on statistics\n\nand applied probability 66, volume 66. CRC Press, 1996.\n\n[16] Dietrich Wettschereck and Thomas G Dietterich. Locally adaptive nearest neighbor algorithms. Advances\n\nin Neural Information Processing Systems, pages 184\u2013184, 1994.\n\n[17] Shiliang Sun and Rongqing Huang. An adaptive k-nearest neighbor algorithm. In 2010 Seventh Interna-\n\ntional Conference on Fuzzy Systems and Knowledge Discovery.\n\n[18] Anil K Ghosh. On nearest neighbor classi\ufb01cation using adaptive choice of k. Journal of computational\n\nand graphical statistics, 16(2):482\u2013502, 2007.\n\n[19] Li Baoli, Lu Qin, and Yu Shiwen. An adaptive k-nearest neighbor text categorization strategy. ACM\n\nTransactions on Asian Language Information Processing (TALIP), 3(4):215\u2013226, 2004.\n\n[20] Ian S Abramson. On bandwidth variation in kernel estimates-a square root law. The annals of Statistics,\n\npages 1217\u20131223, 1982.\n\n[21] Bernard W Silverman. Density estimation for statistics and data analysis, volume 26. CRC press, 1986.\n\n[22] Serdar Demir and \u00d6niz Toktami\u00b8s. On the adaptive nadaraya-watson kernel regression estimators. Hacettepe\n\nJournal of Mathematics and Statistics, 39(3), 2010.\n\n[23] Khulood Hamed Aljuhani et al. Modi\ufb01cation of the adaptive nadaraya-watson kernel regression estimator.\n\nScienti\ufb01c Research and Essays, 9(22):966\u2013971, 2014.\n\n[24] Brian Kulis. Metric learning: A survey. Foundations and Trends in Machine Learning, 5(4):287\u2013364,\n\n2012.\n\n[25] G\u00e9rard Biau and Luc Devroye. Lectures on the Nearest Neighbor Method, volume 1. Springer, 2015.\n\n[26] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimen-\nsionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pages 604\u2013613.\nACM, 1998.\n\n9\n\n\f", "award": [], "sourceid": 2483, "authors": [{"given_name": "Oren", "family_name": "Anava", "institution": "Technion"}, {"given_name": "Kfir", "family_name": "Levy", "institution": "Technion"}]}