{"title": "An Incremental Nearest Neighbor Algorithm with Queries", "book": "Advances in Neural Information Processing Systems", "page_first": 612, "page_last": 618, "abstract": "", "full_text": "An Incremental Nearest Neighbor \n\nAlgorithm with Queries \n\nJoel Ratsaby\u00b7 \n\nN.A.P. Inc. \n\nHollis, New York \n\nAbstract \n\nWe consider the general problem of learning multi-category classifi(cid:173)\ncation from labeled examples. We present experimental results for \na nearest neighbor algorithm which actively selects samples from \ndifferent pattern classes according to a querying rule instead of the \na priori class probabilities. The amount of improvement of this \nquery-based approach over the passive batch approach depends on \nthe complexity of the Bayes rule. The principle on which this al(cid:173)\ngorithm is based is general enough to be used in any learning algo(cid:173)\nrithm which permits a model-selection criterion and for which the \nerror rate of the classifier is calculable in terms of the complexity \nof the model. \n\n1 \n\nINTRODUCTION \n\nWe consider the general problem of learning multi-category classification from la(cid:173)\nbeled examples. In many practical learning settings the time or sample size available \nfor training are limited. This may have adverse effects on the accuracy of the result(cid:173)\ning classifier. For instance, in learning to recognize handwritten characters typical \ntime limitation confines the training sample size to be of the order of a few hundred \nexamples. It is important to make learning more efficient by obtaining only training \ndata which contains significant information about the separability of the pattern \nclasses thereby letting the learning algorithm participate actively in the sampling \nprocess. Querying for the class labels of specificly selected examples in the input \nspace may lead to significant improvements in the generalization error (cf. Cohn, \nAtlas & Ladner, 1994, Cohn, 1996). However in learning pattern recognition this \nis not always useful or possible. In the handwritten recognition problem, the com(cid:173)\nputer could ask the user for labels of selected patterns generated by the computer \n\n-The author's coordinates are: Address: Hamered St. #2, Ra'anana, ISRAEL. Eaail: \n\njer@ee. technion. ac.il \n\n\fAn Incremental Nearest Neighbor Algorithm with Queries \n\n613 \n\nhowever labeling such patterns are not necessarily representative of his handwriting \nstyle but rather of his reading recognition ability. On the other hand it is possi(cid:173)\nble to let the computer (learner) select particular pattern classes, not necessarily \naccording to their a priori probabilities, and then obtain randomly drawn patterns \naccording to the underlying unknown class-conditional probability distribution. We \nrefer to such selective sampling as sample querying. Recent theory (cf. Ratsaby, \n1997) indicates that such freedom to select different classes at any time during the \ntraining stage is beneficial to the accuracy of the classifier learnt. In the current \npaper we report on experimental results for an incremental algorithm which utilizes \nthis sample-querying procedure. \n\n2 THEORETICAL BACKGROUND \n\nWe use the following setting: Given M distinct pattern classes each with a class \nconditional probability density fi (x), 1 ::; i ::; M, x E IR d, and a priori probabilities \nPi, 1 ::; i ::; M. The functions fi (x), 1 ::; i ::; M, are assumed to be unknown while \nthe Pi are assumed to be known or easily estimable as is the case oflearning character \nrecognition. For a sample-size vector m = [m1' ... , mM] where L~1 mj = m denote \nby (m = {( x j , Yj ) } T= 1 a sample of labeled examples consisting of mi example from \npattern class i where Yj, 1 ::; j ::; m, are chosen not necessarily at random from \n{I, 2 .... , M}, and the corresponding Xj are drawn at random i.i.d. according to the \nclass conditional probability density fy) (x). The expected misclassification error of \na classifier c is referred to as the loss of c and is denoted by L( c ). It is defined as the \nprobability of miselassification of a randomly drawn x with respect to the underlying \nmixture probability density function f(x) = L~l pdi(X). The loss is commonly \nrepresented as L(c) = El{x :c(x);iy(x)}, where l{xEA} is the indicator function of a set \nA, expectation is taken with respect to the joint probability distribution fy (x )p(y) \nwhere p(y) is a discrete probability distribution taking values Pi over 1 ::; i ::; M, \nwhile y denotes the label of the class whose distribution fy(x) was used to draw x. \nThe loss L(e) may also be written as L(e) = Lf!l PiEi1{c(x);ii} where Ei denotes \nexpectation with respect to fi(X) . The pattern recognition problem is to learn based \non em the optimal classifier, also known as the Bayes classifier, which by definition \nhas minimum loss whkh we denote by L * . \nA multi-category classifier c is represented as a vector e(x) = [el(x), . .. , CM(X)] \nof boolean classifiers, where Ci (x) = 1 if e( x) = i, and Ci (x) = 0 otherwise, 1 ::; \ni ::; M. The loss L(e) of a multi-category classifier c may then be expressed as \nthe average of the losses of its component classifiers, i.e., L(e) = L~l PiL(ei) \nwhere for a boolean classifier ei the loss is defined as L(ed = Ed{c.(x);il}' As \nan estimate of L(e) we define the empirical loss Lm(c} = L~l p;Lm.(e) where \nLm,(c) = ~, Lj :Y1=i l{c(x);ii} which may also can be expressed as Lm,(ci) = \n~, Lj:YJ=i l{c.(x);il}' \nThe family of all classifiers is assumed to be decomposed into a multi-structure \n5 = 51 X 52 X .. , X 5M , where 5 j is a nested structure (cf. Vapnik, 1982) of \nboolean families Bk). ' ji = 1,2, ... , for 1 ::; i ::; M, i.e., 51 = Bkl , Bk 2 , \u2022\u2022 \u2022 ,Bk)1 ' ... , \n52 = BkllBk2,\u00b7\u00b7\u00b7,Bk12'oo\" up to 5M = Bk1 ,Bk2 ,\u00b7 .. ,Bk)M ' \u00b7 ' \" where ki, E71+ \ndenotes the VC-dimension of BkJ , and Bk1 , ~ Bk1,+I' 1 ::; i ::; M . For any fixed \npositive integer vector j E 7l ~ consider the class of vector classifiers 1ik(j) = \n== 1ik where we take the liberty in dropping the multi-\nBk \nindex j and write k instead of k(j) . Define by (h the subfamily of 1ik consisting \n\nx Bk x\u00b7 .. X Bk \n\n1 1 . l2 \n\n1M \n\n\f614 \n\nJ. Ratsaby \n\n'WM .\"\\,,,M \n\nmEaJ+ '6,=1 m,_m \n\nof classifiers C that are well-defined , i.e., ones whose components Ci, 1 S; i S; M \nsatisfy Ut!I{X: Ci(X) = I} = IRd and {x : Ci(X) = l}n{x: Cj(x) = I} = 0, for \n1 S; i =1= j S; M . \nFrom the Vapnik-Chervonenkis theory (cf. Vapnik, 1982, Devroye, Gyorfi & Lu(cid:173)\ngosi , 1996) it follows that the loss of any boolean classifier Ci E Bk ], is, with \nhigh confidence, related to its empirical loss as L( Ci ) S Lm. (Ci) + f( m i, kj ,) where \nf(mi' kj ,) = const Jkj,ln mi!mi , 1 S; \ni S; M , where henceforth we denote by \nconst any constant which does not depend on the relevant variables in the expres-\nsion. Let the vectors m = [ml, \"\" mM] and k == k(j) = [kil , . \u00b7\u00b7, kjM] in 'lh~. \nDefine f(m, k) = 2:f'!1 Pif(mi ,kj,). It follows that the deviation between the em(cid:173)\npirical loss and the loss is bounded uniformly over all multi-category classifiers in \na class (}k by f(m, k) . We henceforth denote by ck the optimal classifier in (}k , i.e., \nck = argmincE~h L( c) and Ck = argmincEQk Lm (c) is the empirical loss minimizer \nover the class (}k. \nThe above implies that the classifier Ck has a loss which is no more than L( c~) + \nf(m, k) . Denote by k* the minimal complexity of a class (}k which contains the \nBayes classifier. We refer to it as the Bayes complexity and henceforth assume \nk: < 00, 1 S; i S; M. If k* was known then based on a sample of size m with a \nsample size vector m = [ml , \"\" mM] a classifier Ck o whose loss is bounded from \nabove by L * + f( m, k*) may be determined where L * = L( c~o) is the Bayes loss. \nThis bound is minimal with respect to k by definition of k* and we refer to it as the \nminimal criterion. It can be further minimized by selecting a sample of size vector \nm* = argmin{ \n__ }f(m,k*). This basically says that more examples \nshould be queried from pattern classes which require more complex discriminating \nrules within the Bayes classifier. Thus sample-querying via minimization of the \nminimal criterion makes learning more efficient through tuning the subsample sizes \nto the complexity of the Bayes classifier. However the Bayes classifier depends \non the underlying probability distributions which in most interesting scenarios are \nunknown thus k* should be assumed unknown. In (Ratsaby, 1997) an incremental \nlearning algorithm, based on Vapnik's structural risk minimization, generates a \nrandom complexity sequence ken) , corresponding to a sequence of empirical loss \nminimizers ck(n) over (}k(n), which converges to k* with increasing time n for learning \nproblems with a zero Bayes loss. Based on this, a sample-query rule which achieves \nthe same minimization is defined without the need to know k*. We briefly describe \nthe main ideas next. \nAt any time n, the criterion function is c(-, ken)) and is defined over the m-domain \n'lhtt\u00b7 A gradient descent step of a fixed size is taken to minimize the current cri(cid:173)\nterion. After a step is taken , a new sample-size vector men + 1) is obtained and \nthe difference m( n + 1) - m( n) dictates the sample-query at time n, namely, the \nincrement in subsample size for each of the M pattern classes. With increasing n \nthe vector sequence men) gets closer to an optimal path defined as the set which \nis comprised of the solutions to the minimization of f( m, k*) under all different \nconstraints of 2:~1 mi = m, where m runs over the positive integers. Thus for \nall large n the sample-size vector m( n) is optimal in that it minimizes the minimal \ncri terion f(', k*) for the current total sample size m( n). This consti tutes the sample(cid:173)\nquerying procedure of the learning algorithm. The remaining part does empirical \nloss minimization over the current class (}k(n) and outputs ck(n)\" By assumption, \nsince the Bayes classifier is contained in (}k o , it follows that for all large n, the loss \nL(ck(n\u00bb S; L* + min{mE~ :2::1 m,=m'(n)} f(m, k*), which is basically the minimal \ncriterion mentioned above. Thus the algorithm produces a classifier ck(n) with a \n\n\fAn Incremental Nearest Neighbor Algorithm with Queries \n\n615 \n\nminimal loss even when the Bayes complexity k* is unknown. \n\nIn the next section we consider specific modf'l classes consisting of nearest-nf'ighbor \nclassifiers on which we implement this incremental learning approach. \n\n3 \n\nINCREMENTAL NEAREST-NEIGHBOR \nALGORITHM \n\nFix and Hodges, cf. Silverman & Jones (1989). introduced the simple but powerful \nnearest-neighbor classifier which based on a labeled training sample {(Xj,yd}i!:I' \nXi E m,d, Yi E {I, 2, ... , M}, when given a pattern x, it outputs the label Yj corre(cid:173)\nsponding to the example whose x j is closest to x. Every example in the training \nsample is used for this decision (we denote such an example as a prototype) thus \nthe empirical loss is zero. The condensed nearest-neighbor algorithm (Hart, 1968) \nand the reduced nearest neighbor algorithm (G ates, 1972) are procedures which \naim at reducing the number of prototypes while maintaining a zero empirical loss. \nThus given a training sample of size m, after running either of these procedures, a \nnearest neighbor classifier having a zero empirical loss is generated based on s ~ m \nprototypes. Learning in this manner may be viewed as a form of empirical loss \nminimization with a complexity regularization component which puts a penalty \nproportional to the number of prototypes. \nA cell boundary ej,j of the voronoi diagram (cf. Preparata & Shamos, 1985) \ncorresponding to a multi-category nearest-neighbor classifier c is defined as the \n(d - 1 )-dimensional perpendicular-bisector hyperplane 2 by resorting to linear programming for computing the \nadjacencies of facets of a polyhedron, d. Fukuda (1997). \n\n\f616 \n\n1. Ratsaby \n\nIncremental Nearest Neighbor (INN) Algorithm \n\nInitialization: (Time n = 0) \nLet increment-size t::. be a fixed small positive integer. Start with m(O) = \n[e, . .. , e], where e is a small posit.ive integer. Draw (m(o) = {(m](o)}\u00a7\";l where \n(m)(O ) consists of mJ(O) randomly drawn i.i.d. examples from pattern class j. \nWhile (number of available examples 2: t::.) Do: \n\n1. Call Procedure CR: chIn ) = CR\u00ab(m(n\u00bb . \n2. Call Procedure GQ: m(n + 1) = GQ(n). \n3. n:= n + 1. \n\nEnd While \n//Used up all examples . \nOutput: NN-cIassifier ck(n). \n\nProcedure Condense-Reduce (CR) \n\nInput: Sample (m(n) stored in an array A[] of size m(n). \nInitialize: Make only the first example A[I] be a prototype. \n//Condense \nDo: \nChangeOccl.lred := FALSE. \nFor i= 1, . .. , m(n): \n\n\u2022 Classify A[i] based on available prototypes using the NN-Rule. \n\u2022 If not correct then \n\n- Let A[i] be a prototype. \n- ChangeOcel.lred:= TRUE. \n\n\u2022 End If \n\nEnd For \nWhile ( ChangeOecl.lred). \n//Reduce \nDo: \nChangeOccl.lred := FALSE. \nFor i = 1, ... . m(n): \n\n\u2022 If A[ i] is a prototype then classify it using the remaining prototypes by the \n\nNN-Rule. \n\n\u2022 If correct then \n\n- Make A[i] be not a prototype. \n- ChangeOccl.lred := TRUE. \n\n\u2022 End If \n\nEnd For \nWhile ( ChangeOec'Ured). \nRun Delaunay-Triangulation Let k(n) = [k~, ... , kM ], k. denotes the number \nof Voronoi-cell boundaries associated with the s, prototypes. \nReturn (NN-classifier with complexity vector k(n\u00bb. \n\nProcedure Greedy-Query (GQ) \n\nInput: Time n. \n\nj*(n) := argmaxl~J~M I a!) f(m, k(n\u00bb1 \nDraw: t::. new i.i.d. examples from class j\u00b7(n). Denote them by ( . \nUpdate Sample: (m]\u00b7(n)(n+l) := Cn)\u00b7(n)Cn) U (, while (m,Cn+l) := (m,Cn), for \n1 :S i:;i: j\u00b7(n) :S M . \nReturn: (m(n)+ t::. eJ.Cn\u00bb, where eJ is an all zero vector except 1 at jth element. \n\nIm(n) \n\n\fAn Incremental Nearest Neighbor Algorithm with Queries \n\n617 \n\n3.1 EXPERIMENTAL RESULTS \n\nWe ran algorithm INN on several two-dimensional (d = 2) multi-category classifi(cid:173)\ncation problems and compared its generalization error versus total sample size m \nwith that of batch learning, the latter uses Procedure CR (but not Procedure GQ) \nwith uniform subsample proportions, i.e., mi = ~, 1 ~ i ~ M. \nWe ran three classification problems consisting of 4 equiprobable pattern classes \nwith a zero Bayes loss. The generalization curves represent the average of 15 inde(cid:173)\npendent learning runs of the empirical error on a fixed size test set. Each run (both \nfor INN and Batch learning) consists of 80 independent experiments where each \ndiffers by 10 in the sample size used for training where the maximum sample size is \n800. We call an experiment a success if INN results in a lower generalization error \nthan Batch. Let p be the probability of INN beating Batch. We wish to reject the \nhypothesis H that p = ~ which says that INN and Batch are approximately equal \nin performance. The results are displayed in Figure 1 as a series of pairs, the first \npicture showing the pattern classes of the specific problem while the second shows \nthe learning curves for the two learning algorithms. Algorithm INN outperformed \nthe simple Batch approach with a reject level of less than 1 %, the latter ignoring the \ninherent Bayes complexity and using an equal subsample size for each of the pattern \nclasses. In contrast, the INN algorithm learns, incrementally over time, which of \nthe classes are harder to separate and queries more from these pattern classes. \n\nReferences \n\nCohn D., Atlas L., Ladner R. (1994), Improving Generalization with Active Learn(cid:173)\ning. Machine Learning, Vol 15, p.201-221. \n\nDevroye L., Gyorfi L. Lugosi G. (1996). \"A Probabilistic Theory of Pattern Recog(cid:173)\nnition\", Springer Verlag. \n\nFukuda K. (1997). \nFrequently Asked Questions in Geometric Computation. \nTechnical report, Swiss Federal Institute of technology, Lausanne. Available at \nftp://ftp.ifor.ethz.ch/pub/fukuda/reports. \n\nGates, G. W. (1972) The Reduced Nearest Neighbor Rule. \nTheo., p.431-433. \n\nIEEE Trans. \n\nInfo. \n\nHart P. E. (1968) The Condensed Nearest Neighbor Rule. IEEE Trans. on Info. \nThea., Vol. IT-14, No.3. \n\nO'rourke J . (1994). \"Computational Geometry in C\". Cambridge University Press. \n\n(1997) Learning Classification with Sample Queries. \n\nRatsaby, J. \ntrical Engineering Dept., Technion, CC PUB #196. \nhttp://www.ee.technion.ac.il/jer/iandc.ps. \n\nElec(cid:173)\n\nAvailable at URL \n\nRivest R. L., Eisenberg B. (1990), On the sample complexity of pac-learning using \nrandom and chosen examples. Proceedings of the 1990 Workshop on Computational \nLearning Theory, p. 154-162, Morgan Kaufmann, San Maeto, CA. \nB. W. Silverman and M. C. Jones. E. Fix and J. 1. Hodges (1951): An impor(cid:173)\ntant contribution to nonparametric discriminant analysis and density estimation(cid:173)\ncommentary on Fix and Hodges (1951). \np.233-247, 1989. \n\nInternational statistical review, 57(3), \n\nVapnik V.N., (1982), \"Estimation of Dependences Based on Empirical Data\", \nSpringer-Verlag. Berlin. \n\n\f618 \n\nJ. Ratsaby \n\n~\" __ - - - I ' - - - - \" \" I - - - - r - - - - ' \n\nn.2 \n\nn.1 \n\n. ~ 0.1 \n.~ ~. I \n\n~ 8o.0K \n\n0.06 \n\nI \n100 \n\nI \n1:-0 \n\n:!OO \n\n0.04 \n\n0.0 \n\n~ , \nil ' \n\n\\ \\ , \n\n\\. ',. \n~ \":-. \n\n!t~ . ..... \"<.., \n\n. \n\nfb-\n\n'\"\"'\" \n\n\", .. : -\n\n. \"'-, \n\n'\n\n, '\n\n. \n\nPoltcmCI3.~S I \n'..00 PaltcmCla. ... 2 \nPoltcmCl3.'i.'i.l \nI(ICl( l'a\"cmCJa.~s4 \n\n0 \n\n100 \n\n200 \n\n~OO \n\n400 \n\n500 \n\n600 \n\n700 \n\nROO \n\nTolal number of \"\"\"\"\",I .... \n\nBalch \n....... INN \n\n0.25 \n\nO. \n\n~ 0.17 \n..,; \n.~ 0.15 \n.~ O. IJ \n;; \nIi 0.1 \nc \nC 0.011 \n0.05 \n\n0.03 \n\n-~\"\"''' .................... -.....:\"\",\" . \n\n~.>...-,...\". \n\n~ \n\n0 \n\n100 \n\n2110 \n\nJOn \n\n400 \n\nSIlO \n\n(1110 \n\n70n \n\n1100 \n\nTnUl numhcr of \"\"\"\"\",lea \n\nBalch \n-(r INN \n\nO.J \n\n0.2 \n\n0.26 \n\n~ 0.24 \n~ \n.~ 0.22 \n.\"\u00a7 0.2 \n\n\"ii J 0.1 \n\n0. 16 \n\n0. 14 \n\n0.12 \n\n0.1 \n\n0 \n\nBatch \n-- INN \n\n400 \n\nT oI.tl numher of cxamplCli \n\nl'aU<:m('I_1 \n(.<>,) PaucmCl_2 \nl'aUc:mCIass.~ \nCit( l'aUcmCl-.l \n\nCoX \n.\u00b7~1~< \n\n: \n\n., \n\n- (II, IC( \n\nI( t.f. \n'~'J \n~::\u00b7.t\"*fC\" _-.:-c \n\n1 \n\nw'Ic \n\n~-\n., . \n, ><'V-' \n~t. \ni.8cl\"\u00ab \n(# ( \\\u00b7qllc \n, .... \n~: .... \n\nI(~ \u00ab I \n~PC( icc \n\nI( \n\n\u2022 \n\n0 \n\n0 \n\n~ \n\n100 \n\nI~ \n\n200 \n\nPanemCI ..... 1 \n\",,,,. PanemClass2 \nPananCI .. ....u \n1(( PanemCl ..... 4 \n\nFigure 1. Three different Pattern Classification Problems and Learning \n\nCurves of the INN-Algorithm compared to Batch Learning. \n\n\f", "award": [], "sourceid": 1418, "authors": [{"given_name": "Joel", "family_name": "Ratsaby", "institution": null}]}*