{"title": "Diffusion Decision Making for Adaptive k-Nearest Neighbor Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 1925, "page_last": 1933, "abstract": "This paper sheds light on some fundamental connections of the diffusion decision making model of neuroscience and cognitive psychology with k-nearest neighbor classification. We show that conventional k-nearest neighbor classification can be viewed as a special problem of the diffusion decision model in the asymptotic situation. Applying the optimal strategy associated with the diffusion decision model, an adaptive rule is developed for determining appropriate values of k in k-nearest neighbor classification. Making use of the sequential probability ratio test (SPRT) and Bayesian analysis, we propose five different criteria for adaptively acquiring nearest neighbors. Experiments with both synthetic and real datasets demonstrate the effectivness of our classification criteria.", "full_text": "Diffusion Decision Making for Adaptive\n\nk-Nearest Neighbor Classi\ufb01cation\n\nYung-Kyun Noh, Frank Chongwoo Park\nSchl. of Mechanical and Aerospace Engineering\n\nSeoul National University\n\nSeoul 151-744, Korea\n\n{nohyung,fcp}@snu.ac.kr\n\nDaniel D. Lee\n\nDept. of Electrical and Systems Engineering\n\nUniversity of Pennsylvania\nPhiladelphia, PA 19104, USA\nddlee@seas.upenn.edu\n\nAbstract\n\nThis paper sheds light on some fundamental connections of the diffusion decision\nmaking model of neuroscience and cognitive psychology with k-nearest neighbor\nclassi\ufb01cation. We show that conventional k-nearest neighbor classi\ufb01cation can\nbe viewed as a special problem of the diffusion decision model in the asymptotic\nsituation. By applying the optimal strategy associated with the diffusion decision\nmodel, an adaptive rule is developed for determining appropriate values of k in k-\nnearest neighbor classi\ufb01cation. Making use of the sequential probability ratio test\n(SPRT) and Bayesian analysis, we propose \ufb01ve different criteria for adaptively\nacquiring nearest neighbors. Experiments with both synthetic and real datasets\ndemonstrate the effectiveness of our classi\ufb01cation criteria.\n\n1\n\nIntroduction\n\nThe recent interest in understanding human perception and behavior from the perspective of neuro-\nscience and cognitive psychology has spurred a revival of interest in mathematical decision theory.\nOne of the standard interpretations of this theory is that when there is a continuous input of noisy\ninformation, a decision becomes certain only after accumulating suf\ufb01cient information. It is also\ntypically understood that early decisions save resources. Among the many theoretical explanations\nfor this phenomenon, the diffusion decision model offers a particularly appealing explanation of\nhow information is accumulated and how the time involved in making a decision affects overall ac-\ncuracy. The diffusion decision model considers the diffusion of accumulated evidence toward one\nof the competing choices, and reaches a decision when the evidence meets a pre-de\ufb01ned con\ufb01dence\nlevel.\nThe diffusion decision model successfully explains the distribution of decision times for humans\n[13, 14, 15]. More recently, this model offers a compelling explanation of the neuronal decision\nmaking process in the lateral intraparietal (LIP) area of the brain for perceptual decision making\nbased on visual evidence [2, 11, 16]. The fundamental premise behind this model is that there is a\ntradeoff between decision times and accuracy, and that both are controlled by the con\ufb01dence level.\nAs described in Bogacz et al [3], the sequential probability ratio test (SPRT) is one mathematical\nmodel that explains this tradeoff. More recent studies also demonstrate how SPRT can be used to\nexplain the evidence as emanated from Poisson processes [6, 21].\nNow shifting our attention to machine learning, the well-known k-nearest neighbor classi\ufb01cation\nuses a simple majority voting strategy that, at least in the asymptotic case, implicitly involves a sim-\nilar tradeoff between time and accuracy. According to Cover and Hart [4], the expected accuracy of\nk-nearest neighbor classi\ufb01cation always increases with respect to k when there is suf\ufb01cient data. At\nthe same time, there is a natural preference to use less resources, or equivalently, a fewer number of\nnearest neighbors. If one seeks to maximize the accuracy for a given number of total nearest neigh-\n\n1\n\n\fFigure 1: Diffusion decision model. The evidence of decision making is accumulated, and it diffuses\nover time (to the right). Once the accumulated evidence reaches one of the con\ufb01dence levels of either\nchoice, z or \u2212z, the model stops collecting any more evidence and makes a decision.\n\nbors, this naturally leads to the idea of using different ks for different data. At a certain level, this\nadaptive idea can be anticipated, but methods described in the existing literature are almost exclu-\nsively heuristic-based, without offering a thorough understanding of under what situations heuristics\nare effective [1, 12, 19].\nIn this work, we present a set of simple, theoretically sound criteria for adaptive k-nearest neighbor\nclassi\ufb01cation. We \ufb01rst show that the conventional majority voting rule is identical to the diffusion\ndecision model when applied to data from two different Poisson processes. Depending on how the\naccumulating evidence is de\ufb01ned, it is possible to construct \ufb01ve different criteria based on different\nstatistical tests. First, we derive three different criteria using the SPRT statistical test. Second, using\nstandard Bayesian analysis, we derive two probabilities for the case where one density function\nis greater than the other. Our \ufb01ve criteria are then used as diffusing evidence; once the evidence\nexceeds a certain con\ufb01dence level, collection of information can cease and a decision can be made\nimmediately. Despite the complexity of the derivations involved, the resulting \ufb01ve criteria have a\nparticularly simple and appealing form. This feature can be traced to the memoryless property of\nPoisson processes. In particular, all criteria can be cast as a function of the information of only one\nnearest neighbor in each class. Using our derivation, we consider this property to be the result of\nthe assumption that we have suf\ufb01cient data; the criteria are not guaranteed to work in the event that\nthere is insuf\ufb01cient data. We present experimental results involving real and synthetic data to verify\nthis conjecture.\nThe remainder of the paper is organized as follows. In Section 2, a particular form of the diffu-\nsion decision model is reviewed for Poisson processes, and two simple tests based on SPRT are\nderived. The relationship between k-nearest neighbor classi\ufb01cation and diffusion decision making\nis explained in Section 3. In Section 4, we describe the adaptive k-nearest neighbor classi\ufb01cation\nprocedure in terms of the diffusion decision model, and we introduce \ufb01ve different criteria within\nthis context. Experiments for synthetic and real datasets are presented in Section 5, and the main\nconclusions are summarized in Section 6.\n\n2 Diffusion Decision Model for Two Poisson Processes\n\nThe diffusion decision model is a stochastic model for decision making. The model considers the\ndiffusion of an evidence in favor of either of two possible choices by continuously accumulating the\ninformation. After initial wavering between the two choices, the evidence \ufb01nally reaches a level of\ncon\ufb01dence where a decision is made as in Fig. 1.\nIn mathematical modeling of this diffusion process, Gaussian noise has been predominantly used as a\nmodel for zigzagging upon a constant drift toward a choice [3, 13]. However, when we consider two\ncompeting Poisson signals, a simpler statistical test can be used instead of estimating the direction of\nthe drift. In the studies of decision making in the lateral intraparietal (LIP) area of the brain [2, 11],\ntwo Poisson processes are assumed to have rate parameters of either \u03bb+ and \u03bb\u2212 where we know\nthat \u03bb+ > \u03bb\u2212, but exact values are unknown. When it should be determined which Poisson process\nhas the larger rate \u03bb+, a sequential probability ratio test (SPRT) can be used to explain a diffusion\ndecision model [6, 21].\n\n2\n\n\fexp(\u2212\u03bbT ), and we consider two\nThe Poisson distribution we use has the form: p(N|\u03bb, T ) = (\u03bbT )N\nPoisson distributions for N1 and N2 at time T1 and T2, respectively: p(N1|\u03bb1, T1) and p(N2|\u03bb2, T2).\nHere, \u03bb1 and \u03bb2 are the rate parameters, and either of these parameters has \u03bb+ where the other has\n\u03bb\u2212. Now, we apply the statistical test of Wald [18] for a con\ufb01dence \u03b1( > 1):\n\nN !\n\np(N1|\u03bb1 = \u03bb+)p(N2|\u03bb2 = \u03bb\u2212)\np(N1|\u03bb1 = \u03bb\u2212)p(N2|\u03bb2 = \u03bb+)\n\n> \u03b1 or <\n\n1\n\u03b1\n\n(1)\n\nfor the situation where there is N1 number of signals at time T1 for the \ufb01rst Poisson process and N2\nnumber of signals at time T2 for the second process. We can determine that \u03bb1 has the \u03bb+ once the\nleft term is greater than \u03b1, and \u03bb2 has the \u03bb+ once it is greater than 1\n\u03b1, otherwise, we must collect\nmore information. According to Wald and Wolfowitz [18], this test is optimal in that the test requires\nthe fewest average observations with the same probability of error.\nBy taking the log on both sides, we can rewrite the test as\n\n(cid:18) \u03bb+\n\n(cid:19)\n\n\u03bb\u2212\n\nlog\n\n(N1 \u2212 N2) \u2212 (\u03bb+ \u2212 \u03bb\u2212)(T1 \u2212 T2) > log a\n< \u2212 log a.\n\nor\n\n(2)\n\nConsidering two special situations, this equation can be reduced into two different, simple tests.\nFirst, we can consider observation of the numbers N1 and N2 at a certain time T = T1 = T2. Then\ntest in Eq. (2) is reduced into one test previously proposed in [21]:\n\nwhere zN is a constant satisfying zN =\nobservation times T1 and T2 when we \ufb01nd the same number of signals N = N1 = N2:\n\n|N1 \u2212 N2| > zN\n\nlog \u03b1\n\n(3)\nlog(\u03bb+/\u03bb\u2212). Another simple test can be made by using the\n|T1 \u2212 T2| > zT\n\n(4)\n\n\u03bb+\u2212\u03bb\u2212 .\n\nwhere zT satis\ufb01es zT = log \u03b1\nHere, we can consider \u2206N = N1 \u2212 N2 and \u2206T = T1 \u2212 T2 as two different evidences in the\ndiffusion decision model. The evidence diffuses as we collect more information, and we come to\nmake a decision once the evidence reaches the con\ufb01dence levels, \u00b1zN for \u2206N, and \u00b1zT for \u2206T .\nIn this work, we refer to the \ufb01rst model, using the criterion \u2206N, as the \u2206N rule and the second\nmodel, using \u2206T , as the \u2206T rule.\nAlthough the \u2206N rule has been previously derived and used [21], we propse four more test criteria\nin this paper including Eq. (4). Later, we show that the diffusion decision making with these \ufb01ve\ncriteria is related to different methods for k-nearest neighbor classi\ufb01cation.\n\n3 Equivalence of Diffusion Decision Model and k-Nearest Neighbor\n\nClassi\ufb01cation\n\nA conventional k-nearest neighbor (k-NN) classi\ufb01cation takes a majority voting strategy using k\nnumber of nearest neighbors. According to Cover and Hart [4], in the limit of in\ufb01nite sampling,\nthis simple majority voting rule can produce a fairly low expected error and furthermore, this error\ndecreases even more as a bigger k is used. This theoretical result is obtained from the relationship\nbetween the k-NN classi\ufb01cation error and the optimal Bayes error: the expected error with one\nnearest neighbor is always less than twice the Bayes error, and the error decreases with the number\nof k asymptotically to the Bayes error [4].\nIn this situation, we can claim that the k-NN classi\ufb01cation actually performs the aforementioned\ndiffusion decision making for Poisson processes. The identity comes from two equivalence re-\nlationships: \ufb01rst, the logical equivalence between two decision rules; second, the equivalence of\ndistribution of nearest neighbors to the Poisson distribution in an asymptotic situation.\n\n3.1 Equivalent Strategy of Majority Voting\n\nHere, we \ufb01rst show an equivalence between the conventional k-NN classi\ufb01cation and a novel com-\nparison algorithm:\n\n3\n\n\fTheorem: For two-class data, we consider the N-th nearest datum of each class from the testing\npoint. With an odd number k, majority voting rule in k-NN classi\ufb01cation is equivalent to the rule\nof picking up the class to which a datum with smaller distance to the testing point belongs, for\nk = 2N \u2212 1.\nProof: Among k-NNs of a test point, if there are more than or equal to N data having label C,\nfor C \u2208 {1, 2}, the test point is classi\ufb01ed as class C according to the majority voting because\n2 . If we consider three distances dk to the k-th nearest neighbor among all data,\nN = (k + 1)/2 > k\ndN,C to the N-th nearest neighbor in class C, and dN,\u00acC to the N-th nearest neighbor in class non-\nC, then both dN,C \u2264 dk and dN,\u00acC > dk are satis\ufb01ed in this case. This completes one direction of\nproof that the selection of class C by majority voting implies dN,C < dN,\u00acC. The opposite direction\ncan be proved similarly.\nTherefore, instead of counting the number of nearest neighbors, we can classify a test point us-\ning two separate N-th nearest neighbors of two classes and comparing the distances. This logical\nequivalence applies regardless of the underlying density functions.\n\n3.2 Nearest neighbors as Poisson processes\n\nThe random generation of data from a particular underlying density function induces a density func-\ntion of distance to the nearest neighbors. When the density function is \u03bb(x) for x \u2208 RD and we\nconsider a D-dimensional hypersphere of volume V with N-th nearest neighbor on its surface, a\nrandom variable u = M V , which is the volume of the sphere V multiplied by the number of data\nM, asymptotically converges in distribution to the Erlang density function [10]:\n\np(u|\u03bb) =\n\n\u03bbN\n\u0393(N )\n\nexp(\u2212\u03bbu)uN\u22121\n\n(5)\n\nwith a large amount of data. Here, the volume element is a function of distance d which can be\nrepresented as V = \u03b3dD and \u03b3 = \u03c0D/2\n\u0393(D/2+1), a proportionality constant for a hypersphere volume.\nThis Erlang function is a special case of the Gamma density function when the parameter N is an\ninteger.\nWe can also note that this Erlang density function implies the Poisson distribution with respect to N\n[20], and we can write the distribution of N as follows:\n\np(N|\u03bb) =\n\n\u03bbN\n\n\u0393(N + 1)\n\nexp(\u2212\u03bb).\n\n(6)\n\nThis equation shows that the appearance of nearest neighbors can be approximated with Poisson\nprocesses. In other words, with a growing hypersphere at a constant rate in volume, the occurrence\nof new points within a hypersphere will follow a Poisson distribution.\nThis Erlang function in Eq. (5) comes from the asymptotic convergence in distribution of the real\ndistribution, the binomial distribution with \ufb01nite N number of samples [10]. Here, we note that,\nwith a \ufb01nite number of samples, the memoryless property of the Poisson disappears. This results\nin the breakdown of the independency assumption between posterior probabilities for classes which\nCover and Hart used implicitly when they derived the expected error of k-NN classi\ufb01cation [4].\nOn the other hand, once we have enough data, and hence the density functions Eq. (5) and Eq. (6)\nexplain data correctly, we can expect the equivalence between the diffusion decision making and\nk-NN classi\ufb01cation. In this case, the nearest neighbors are the samples of a Poisson process, having\nthe rate parameter \u03bb, which is the probability density at the test point.\nNow, we can turn back to the conventional k-NN classi\ufb01cation. By theorem 1 and the arguments in\nthis section, the k-NN classi\ufb01cation strategy is the same as the strategy of comparing two Poisson\nprocesses using N-th samples of each class. This connection naturally exploits the conventional\nk-NN classi\ufb01cation to the adaptive method of using different ks using the con\ufb01dence level in the\ndiffusion decision model.\n\n4\n\n\f4 Criteria for Adaptive k-NN Classi\ufb01cation\n\nUsing the equivalence settings of the diffusion decision model and the k-NN classi\ufb01cation, we can\nextend the conventional majority voting strategy to more sophisticated adaptive strategies. First, the\nSPRT criteria in the previous section, \u2206N rule and \u2206T rule can be used. For the \u2206N rule in Eq. (3),\nwe can use the numbers of nearest neighbors N1 and N2 within a \ufb01xed distance d, then compare\n|\u2206N| = |N1 \u2212 N2| with a pre-de\ufb01ned con\ufb01dence level zN .\nInstead of making an immediate\ndecision, we can collect more nearst neighbors by increasing d until Eq. (3) is satis\ufb01ed. This is the\n\u201c\u2206N rule\u201d for adaptive k-NN classi\ufb01cation.\nIn terms of the \u2206T rule in Eq. (4), using the correspondence of time in the original SPRT to the vol-\nume within the hypersphere in k-NN classi\ufb01cation, we can make two different criteria for adaptive\nk-NN classi\ufb01cation. First, we consider two volume elements, V1 and V2 of N-th nearest neighbors,\nand the criterion can be rewritten as |V1 \u2212 V2| > zV . We refer to this rule as the \u201c\u2206V rule\u201d.\nAdditional criterion for the \u2206T rule considers a more conservative rule using the volume of (N +1)-\nth nearest neighbor hypersphere. Since a slightly smaller hypersphere than this hypersphere still\ncontains N number of nearest neighbors, we can make the same test more dif\ufb01cult to stop diffusing\nby replacing the smaller volume in the \u2206V rule with the volume of (N + 1)-th nearest neighbor\nhypersphere of that class. We refer to this rule as the \u201cConservative \u2206V rule\u201d because it is more\ncautious in making a decision with this strategy.\nIn addition to the SPRT method, with which we derive three different criteria, we can also derive\nseveral stopping criteria using the Bayesian approach. If we consider \u03bb as a random variable and\napply an appropriate prior, we can obtain a posterior distribution of \u03bb as well as the probability\nof P (\u03bb1 > \u03bb2) or P (\u03bb1 < \u03bb2).\nIn the following section, we show how we can derive these\nprobabilities and how these probabilities can be used as evidence in the diffusion decision making\nmodel.\n\n4.1 Bayesian Criteria\n\nFor both Eq. (5) and Eq. (6), we consider \u03bb as a random variable, and we can apply a conjugate prior\nfor \u03bb:\n\n(7)\nwith constants a and b. The constant a is an integer satisfying a \u2265 1, and b is a real number. With\nthis prior Eq. (7), the posteriors for two likelihoods Eq. (5) and Eq. (6) are obtained easily:\n\np(\u03bb) =\n\nba\n\u0393(a)\n\n\u03bba\u22121 exp(\u2212\u03bbb)\n\np(\u03bb|u) =\n\n\u03bbN +a\u22121 exp(\u2212\u03bb(u + b))\n\n(8)\n\n(u + b)N +a\n\u0393(N + a)\n(b + 1)N +a\n\u0393(N + a)\n\np(\u03bb|N ) =\n\n(9)\nFirst, we derive P (\u03bb1 > \u03bb2|u1, u2) for u1 and u2 obtained using the N-th nearest neighbors in class\n1 and class 2. Because the posterior functions of different classes are independent from each other,\nthis probability of \u03bb1 > \u03bb2 is simply obtained by the double integration:\n\n\u03bbN +a\u22121 exp(\u2212\u03bb(b + 1))\n\n(cid:90) \u221e\n\n(cid:90) \u03bb2\n\nP (\u03bb1 > \u03bb2|u1, u2) =\n\np(\u03bb2|u2)\n\np(\u03bb1|u1) d\u03bb1 d\u03bb2.\n\n0\n\n0\n\nAfter some calculation, the integration result gives an extremely simple analytic solution:\n\nP (\u03bb1 > \u03bb2|u1, u2) =\n\n(cid:18) 2N + 2a \u2212 1\n\n(cid:19) (u1 + b)m(u2 + b)2N +2a\u22121\u2212m\n\nN +a\u22121(cid:88)\n\nm=0\n\nm\n\n(u1 + u2 + 2b)2N +2a\u22121\n\n(10)\n\n(11)\n\nHere, we merely consider the case that a = 1, and it is interesting to note that this probability is\nequivalent to the probability of \ufb02ipping a biased coin 2N + 1 times and observing less than or equal\nto N number of heads. This probability from the Bayesian approach can be ef\ufb01ciently computed\n\n5\n\n\f(a)\n\n(b)\n\nFigure 2: Decision making process for the nearest neighbor classi\ufb01cation with (a) 80% and (b) 90%\ncon\ufb01dence level. Sample data are generated from the probability densities \u03bb1 = 0.8 and \u03bb2 = 0.2.\nFor incrementing N-th nearest neighbors of different classes, the criterion probabilities P (\u03bb1 >\n\u03bb2|u1, u2) and P (\u03bb1 < \u03bb2|u1, u2) are calculated and compared with the con\ufb01dence level. Unless\nthe probability exceeds the con\ufb01dence level, the next (N + 1)-th nearest neighbors are collected\nand the criterion probabilities are calculated again.\nIn this \ufb01gure, the diffusion of the criterion\nprobability P (\u03bb1 > \u03bb2|u1, u2) is displayed for different realizations, where the evidence stops\ndiffusing once the criterion passes the threshold where enough evidence has accumulated. The\nbars represent the number of points that are correctly (Red, upward bars) and incorrectly (Blue,\ndownward bars) classi\ufb01ed at each stage of the computation. Using a larger con\ufb01dence results in less\nerror, but with a concomitant increase in the number of nearest neighbors used.\n\nin an incremental fashion, and the nearest neighbor computation can be adaptively stopped with\nenough con\ufb01dence of the evidence probability.\nThe second probability P (\u03bb1 > \u03bb2|N1, N2) for the number of nearest neighbors N1 and N2 within\na particular distance can be similarly derived. Using the double integration of Eq. (9), we can derive\nthe analytic result again as\n\nN1+a\u22121(cid:88)\n\n(cid:18) N1 + N2 + 2a \u2212 1\n\n(cid:19)\n\nP (\u03bb1 > \u03bb2|N1, N2) =\n\n1\n\n2N1+N2+2a\u22121\n\nm=0\n\nm\n\n.\n\n(12)\n\nBoth the probabilities Eq. (11) and Eq. (12) can be used as evidence that diffuse along with incoming\ninformation. Stopping criteria for diffusion can be derived using these probabilities.\n\n4.2 Adaptive k-NN Classi\ufb01cation\n\nOf interest in the diffusion decision model is the relationship between the accuracy and the amount\nof resources needed to obtain the accuracy. In a diffusion decision setting for k-NN classi\ufb01cation,\nwe can control the amount of resources using the con\ufb01dence level. For example, in Fig. 2, we\ngenerated data from two uniform density functions, \u03bb1 = 0.8 and \u03bb2 = 0.2, for different classes,\nand we applied different con\ufb01dence levels, 0.8 and 0.9 in Fig. 2(a) and (b), respectively. Using\nthe P (\u03bb1 > \u03bb2|u1, u2) criterion in Eq. (11), we applied the adaptive k-NN classi\ufb01cation with an\nincreasing N of two classes.\nFig. 2 shows the decision results of the classi\ufb01cation with incrementing N for 1000 realizations,\nand a few diffusion examples of the evidence probability in Eq. (11) are presented. According to\nthe con\ufb01dence level, the average number of nearest neighbors used differs. For Fig. 2(a) when\nthe con\ufb01dence level is lower than Fig. 2(b), the evidence reaches the con\ufb01dence level at an earlier\nstage than Fig. 2(b), while the decision in Fig. 2(b) tends to select the \ufb01rst class more often than\nin Fig. 2(a). Considering that the optimal Bayes classi\ufb01cation choosing class 1 for \u03bb1 > \u03bb2, the\ndecisions for class 2 can be considered as errors. In this sense, we can say with the higher con\ufb01dence\nlevel, decisions are made more correctly while using more resources. Therefore, the ef\ufb01ciencies\n\n6\n\n\f(a)\n\n(c)\n\n(b)\n\n(d)\n\nFigure 3: Classi\ufb01cation accuracy (vertical axis) versus the average number of nearest neighbors used\n(horizontal axis) for adaptive k-NN classi\ufb01cation. (a) Uniform probability densities for \u03bb1 = 0.8\nand \u03bb2 = 0.2 in 100-dimensional space, (b) CIFAR-10, (c) 2 \u00d7 105 data per class for 5-dimensional\nGaussians, and (d) 2 \u00d7 106 data per class for the same Gaussians in (c) are used.\n\nbetween strategies can be compared using the accuracies as well as the average number of nearest\nneighbors used.\n\n5 Experiments\n\nIn the experiments, we compare the accuracy of the algorithms to the number of nearest neighbors\nused, for various con\ufb01dence levels for criteria. We used the conventional k-NN classi\ufb01cation as\nwell as the proposed adaptive methods. Adaptive classi\ufb01cation includes the comparison rule of N-\nth nearest neighbors using three criteria\u2014the \u2206V rule (DV), the Conservative \u2206V rule (CDV), and\nBayesian probability in Eq. (11) (PV)\u2014, as well as the comparison rule of N1-th and N2-th at a\ngiven volume using two rules\u2014the \u2206N rule (DN) and Bayesian probability in Eq. (12) (PN). We\npresent the average accuracies resulting from the use of these k-NN classi\ufb01cation and \ufb01ve adaptive\nrules with respect to the average number of nearest neighbors used.\nWe \ufb01rst show the results on synthetic datasets. In Fig. 3(a), we used two uniform probability den-\nsities \u03bb1 = 0.8 and \u03bb2 = 0.2 in 100-dimensional space, and we classi\ufb01ed a test point based on the\nnearest neighbors. In this \ufb01gure, all algorithms are expected to approach the Bayes performance\nbased on Cover and Hart\u2019s approach when the average number of nearest neighbors increase. In\n\n7\n\n0510152025300.650.70.750.8Average number of nearest neighborsAccuracy PNDNPVCDVDVkNNCNNRaceminRaceMinMaxRatioJigang05101520250.710.720.730.740.750.760.77Number of nearest neighbors usedAccuracy PNDNPVCDVDVkNNCNNRaceminRaceMinMaxRatioJigang05101520250.20.250.30.350.40.450.50.55Average number of nearest neighborsAccuracy PNDNPVCDVDVkNNCNNRaceminRaceMinMaxRatioJigang05101520250.60.620.640.660.680.70.720.74Average number of nearest neighborsAccuracy PNDNPVCDVDVkNNCNNRaceminRaceMinMaxRatioJigang\fthis experiment, we can observe that all \ufb01ve proposed adaptive algorithms approach the Bayes error\nquicker than other methods showing similar rates with each other.\nHere, we also present the results of other adaptive algorithms CNN [12], Race, minRace, Min-\nMaxRatio, and Jigang [19]. They perform majority voting with increasing k; CNN stops collecting\nmore nearest neighbors once more than a certain amount of consecutive neighbors are found with\nthe same labels; Race stops when the total amount of neighbors of one class exceeds a certain\nlevel; minRace stops when all classes have at least a prede\ufb01ned amount of neighbors; MinMaxRa-\ntio considers the ratio between numbers of nearest neighbors in different classes; lastly, Jigang is a\nprobability criterion slightly different from Eq. (12). Except for Jigang\u2019s method, all algorithms per-\nform poorly, while our \ufb01ve algorithms perform equally well though they use different information,\nprobably because the performance produced by diffusion decision making algorithms is optimal.\nFig. 3(b) shows the experiments for a CIFAR-10 subset of the tiny images dataset [17]. The CIFAR-\n10 set has 10-class 32 \u00d7 32 color images. Each class has 6000 images, and they are separated into\none testing set and \ufb01ve training sets. With this 10-class data, we \ufb01rst performed Fisher Discriminant\nAnalysis to obtain a 9-dimensional subspace, then all different adaptive algorithms are applied on\nthis subspace. The result is the average accuracy for \ufb01ve different training sets and for all possible\npairs of 10 classes. Because the underlying density is non-uniform here, the result shows the perfor-\nmance decrease when algorithms use non-close nearest neighbors. Except for DV and PV criteria,\nall of our adaptive algorithms outperform all other methods. The k-NN classi\ufb01cation in the original\ndata space shows the maximal average performance of 0.721 at k = 3, which is far less than the\noverall accuracies in the \ufb01gure, because the distance information is poor in the high dimensional\nspace.\nFig. 3(c) and (d) clearly show that our algorithms are not guaranteed to work with insuf\ufb01cient data.\nWe generated data from two different Gaussian functions and tried to classify a datum located at\none of the modes to \ufb01gure out the label of this datum. The number of generated data is 2 \u00d7 105 per\nclass for (c), and 2 \u00d7 106 per class for (d) in 5-dimensional space. We presented the average result\nof 5000 realizations, and the comparison of two \ufb01gures show that our adaptive algorithms work as\nexpected when Cover and Hart\u2019s asymptotic data condition holds. The Poisson process assumption\nalso holds when this condition is satis\ufb01ed.\n\n6 Conclusions\n\nIn this work, we showed that k-NN classi\ufb01cation in the asymptotic limit is equivalent to the diffusion\ndecision model for decision making. Nearest neighbor classi\ufb01cation and the diffusion decision\nmodel are both very well known models in machine learning and cognitive science respectively,\nbut the intimate connection between them has not been studied before. Using analysis of Poisson\nprocesses, we showed how classi\ufb01cation using incrementally increasing nearest neighbors can be\nmapped to a simple threshold based decision model.\nIn the diffusion decision model, the con\ufb01dence level plays a key role in determining the tradeoff\nbetween speed and accuracy. The notion of con\ufb01dence can also be applied to nearest neighbor\nclassi\ufb01cation to adapt the number of nearest neighbors used in making the classi\ufb01cation decision.\nWe presented several different criteria for choosing the appropriate number of nearest neighbors\nbased on the sequential probability ratio test in addition to Bayesian inference. We demonstrated\nthe utility of these methods in modulating speed versus accuracy on both simulated and benchmark\ndatasets.\nIt is straightforward to extend these methods to other datasets and algorithms that utilize neighbor-\nhood information. Future work will investigate how our results would scale with dataset size and\nfeature representations. Potential bene\ufb01ts of this work include a well-grounded approach to speeding\nup classi\ufb01cation using parallel computation on very large datasets.\n\nAcknowledgments\n\nThis research is supported in part by the US Of\ufb01ce of Naval Research, Intel Science and Technology\nCenter, AIM Center, KIST-CIR, ROSAEC-ERC, SNU-IAMD, and the BK21.\n\n8\n\n\fReferences\n[1] A. F. Atiya. Estimating the posterior probabilities using the k-nearest neighbor rule. Neural\n\nComputation, 17(3):731\u2013740, 2005.\n\n[2] J. M. Beck, W. J. Ma, R. Kiani, T. Hanks, A. K. Churchland, J. Roitman, M. N. Shadlen, P. E.\nLatham, and A. Pouget. Probabilistic population codes for Bayesian decision making. Neuron,\n60(6):1142\u20131152, 2008.\n\n[3] R. Bogacz, E. Brown, J. Moehlis, P. Holmes, and J. D. Cohen. The physics of optimal decision\nmaking: A formal analysis of models of performance in two-alternative forced-choice tasks.\nPsychological Review, 113(4):700\u2013765, 2006.\n\n[4] T. Cover and P. Hart. Nearest neighbor pattern classi\ufb01cation. IEEE Transactions on Informa-\n\ntion Theory, 13(1):21\u201327, 1967.\n\n[5] L. Devroye, L. Gy\u00a8or\ufb01, and G. Lugosi. A probabilistic theory of pattern recognition. Applica-\n\ntions of mathematics. Springer, 1996.\n\n[6] M. A. Girshick. Contributions to the theory of sequential analysis I. The Annuals of Mathe-\n\nmatical Statistics, 17:123\u2013143, 1946.\n\n[7] M. Goldstein. kn-Nearest Neighbor Classi\ufb01cation. IEEE Transactions on Information Theory,\n\nIT-18(5):627\u2013630, 1972.\n\n[8] C. C. Holmes and N. M. Adams. A probabilistic nearest neighbour method for statistical\n\npattern recognition. Journal of the Royal Statistical Society Series B, 64(2):295\u2013306, 2002.\n\n[9] M. D. Lee, I. G. Fuss, and D. J. Navarro. A Bayesian approach to diffusion models of decision-\nmaking and response time. In Advances in Neural Information Processing Systems 19, pages\n809\u2013816. 2007.\n\n[10] N. Leonenko, L. Pronzato, and V. Savani. A class of R\u00b4enyi information estimators for multidi-\n\nmensional densities. Annals of Statistics, 36:2153\u20132182, 2008.\n\n[11] W. J. Ma, J. M. Beck, P. E. Latham, and A. Pouget. Bayesian inference with probabilistic\n\npopulation codes. Nature Neuroscience, 9(11):1432\u20131438, 2006.\n\n[12] S. Ougiaroglou, A. Nanopoulos, A. N. Papadopoulos, Y. Manolopoulos, and T. Welzer-\nDruzovec. Adaptive k-nearest-neighbor classi\ufb01cation using a dynamic number of nearest\nneighbors. In Proceedings of the 11th East European conference on Advances in databases\nand information systems, pages 66\u201382, 2007.\n\n[13] R. Ratcliff and G. Mckoon. The diffusion decision model: theory and data for two-choice\n\ndecision tasks. Neural Computation, 20(4):873\u2013922, 2008.\n\n[14] R. Ratcliff and J. N. Rouder. A diffusion model account of masking in two-choice letter identi-\n\ufb01cation. Journal of Experimental Psychology Human Perception and Performance, 26(1):127\u2013\n140, 2000.\n\n[15] M. N. Shadlen, A. K. Hanks, A. K. Churchland, R. Kiani, and T. Yang. The speed and accu-\nracy of a simple perceptual decision: a mathematical primer. Bayesian brain: Probabilistic\napproaches to neural coding, 2006.\n\n[16] M. N. Shadlen and W. T. Newsome. The variable discharge of cortical neurons: Implications\nfor connectivity, computation, and information coding. Journal of Neuroscience, 18:3870\u2013\n3896, 1998.\n\n[17] A. Torralba, R. Fergus, and W. T. Freeman. 80 million tiny images: A large data set for non-\nparametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine\nIntelligence, 30(11):1958\u20131970, 2008.\n\n[18] A. Wald and J. Wolfowitz. Optimum character of the sequential probability ratio test. Annals\n\nof Mathematical Statistics, 19:326\u2013339, 1948.\n\n[19] J. Wang, P. Neskovic, and L. N. Cooper. Neighborhood size selection in the k-nearest-neighbor\n\nrule using statistical con\ufb01dence. Pattern Recognition, 39(3):417\u2013423, 2006.\n\n[20] L. Wasserman. All of Statistics: A Concise Course in Statistical Inference (Springer Texts in\n\nStatistics). Springer, December 2003.\n\n[21] J. Zhang and R. Bogacz. Optimal decision making on the basis of evidence represented in\n\nspike trains. Neural Computation, 22(5):1113\u20131148, 2010.\n\n9\n\n\f", "award": [], "sourceid": 954, "authors": [{"given_name": "Yung-kyun", "family_name": "Noh", "institution": null}, {"given_name": "Frank", "family_name": "Park", "institution": null}, {"given_name": "Daniel", "family_name": "Lee", "institution": null}]}