{"title": "K-Local Hyperplane and Convex Distance Nearest Neighbor Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 985, "page_last": 992, "abstract": null, "full_text": "K-Local Hyperplane and Convex Distance\n\nNearest Neighbor Algorithms\n\nPascal Vincent and Yoshua Bengio\nDept. IRO, Universit\u00b4e de Montr\u00b4eal\n\nC.P. 6128, Montreal, Qc, H3C 3J7, Canada\n\n vincentp,bengioy\nhttp://www.iro.umontreal.ca/\n\n@iro.umontreal.ca\nvincentp\n\nAbstract\n\nGuided by an initial idea of building a complex (non linear) decision\nsurface with maximal local margin in input space, we give a possible\ngeometrical intuition as to why K-Nearest Neighbor (KNN) algorithms\noften perform more poorly than SVMs on classi\ufb01cation tasks. We then\npropose modi\ufb01ed K-Nearest Neighbor algorithms to overcome the per-\nceived problem. The approach is similar in spirit to Tangent Distance, but\nwith invariances inferred from the local neighborhood rather than prior\nknowledge. Experimental results on real world classi\ufb01cation tasks sug-\ngest that the modi\ufb01ed KNN algorithms often give a dramatic improve-\nment over standard KNN and perform as well or better than SVMs.\n\n1 Motivation\n\nThe notion of margin for classi\ufb01cation tasks has been largely popularized by the success\nof the Support Vector Machine (SVM) [2, 15] approach. The margin of SVMs has a nice\ngeometric interpretation1: it can be de\ufb01ned informally as (twice) the smallest Euclidean\ndistance between the decision surface and the closest training point. The decision surface\nproduced by the original SVM algorithm is the hyperplane that maximizes this distance\nwhile still correctly separating the two classes. While the notion of keeping the largest pos-\nsible safety margin between the decision surface and the data points seems very reasonable\nand intuitively appealing, questions arise when extending the approach to building more\ncomplex, non-linear decision surfaces.\nNon-linear SVMs usually use the \u201ckernel trick\u201d to achieve their non-linearity. This con-\nceptually corresponds to \ufb01rst mapping the input into a higher-dimensional feature space\nwith some non-linear transformation and building a maximum-margin hyperplane (a linear\ndecision surface) there. The \u201ctrick\u201d is that this mapping is never computed directly, but im-\nplicitly induced by a kernel. In this setting, the margin being maximized is still the smallest\nEuclidean distance between the decision surface and the training points, but this time mea-\nsured in some strange, sometimes in\ufb01nite dimensional, kernel-induced feature space rather\nthan the original input space. It is less clear whether maximizing the margin in this new\nspace, is meaningful in general (see [16]).\n\n1for the purpose of this discussion, we consider the original hard-margin SVM algorithm for two\n\nlinearly separable classes.\n\n\u0001\n\u0002\n\fA different approach is to try and build a non-linear decision surface with maximal dis-\ntance to the closest data point as measured directly in input space (as proposed in [14]). We\ncould for instance restrict ourselves to a certain class of decision functions and try to \ufb01nd\nthe function with maximal margin among this class. But let us take this even further. Ex-\ntending the idea of building a correctly separating non-linear decision surface as far away\nas possible from the data points, we de\ufb01ne the notion of local margin as the Euclidean dis-\ntance, in input space, between a given point on the decision surface and the closest training\npoint. Now would it be possible to \ufb01nd an algorithm that could produce a decision surface\nwhich correctly separates the classes and such that the local margin is everywhere maximal\nalong its surface? Surprisingly, the plain old Nearest Neighbor algorithm (1NN) [5] does\nprecisely this!\nSo why does 1NN in practice often perform worse than SVMs? One typical explanation, is\nthat it has too much capacity, compared to SVM, that the class of function it can produce is\ntoo rich. But, considering it has in\ufb01nite capacity (VC-dimension), 1NN is still performing\nquite well... This study is an attempt to better understand what is happening, based on\ngeometrical intuition, and to derive an improved Nearest Neighbor algorithm from this\nunderstanding.\n\n2 Fixing a broken Nearest Neighbor algorithm\n\n2.1 Setting and de\ufb01nitions\n\n\b\u0018\u0017\n\n\b\r\u001c\n\nwhere )!*\n\nis the\n\n\u0014#\"\n\n\u001c5\u0017X6\n\n.\n\n\u0019\u001b\u0006\n\n\u001cTS\n\nif 8\n\n\u00154\u0019\u001f\u0006\n\n\u000432\n\u0002<;\n\nIn the previous and following discussion, we often refer to the concept of decision surface,\n\nthat will generalize well on new\nshould ideally minimize the expected classi\ufb01cation error,\n\nN\tR denotes the indicator function, whose value is &\nG(HJQ\u0016K\u000eM\n\n\u0019\u001f.=\n\u000e0\n\u0015 and U otherwise.\n9 corresponding to a given algorithm de-\n\u0016\u0006W\u0014\ntwo regions of the input space: the region V\n* . The decision surface for class 6\n& dimensional\n* , and can be seen as a Z\n\u0002 ) possibly made of several disconnected components. For sim-\n\nThe setting is that of a classical classi\ufb01cation problem in \n\u0001\u0003\u0002 (the input space).\nWe are given a training set \u0004 of \u0005 points \u0007\u0006\t\b\u000b\n\r\f\u0007\f\r\f\r\n\u000e\u0006\u0010\u000f\n\n\u0011\u0006\u0013\u0012\u0011\u0014\n\u0002 and their corresponding\n\u000f\u001e\u0017\n\u000f \u001c\n'&(\n\u0007\f\r\f\u0007\f\u001d\n\u000e)!*\n\u0015\u001a\u0019\u001f\u0006\n\u0015\u001a\u0019\u001b\u0006\nclass label \u0016\u0015\n\u0014#\"$\n%\"\n\n\f\u0007\f\r\f\u001d\n\u000e\u0015\n\n!\u0015\n\u001c pairs are assumed to be samples drawn from an\nnumber of different classes. The \u0019\u001b\u0006+\n,\u0015\n\u001c . Barring duplicate inputs, the class labels associated to\n\u0019\u001b./\n\u000e0\nunknown distribution -\n\u001c5\u001776\neach \u00061\u0014\n\u0016\u00061\u0014\n: let \u0004\n\u0004 de\ufb01ne a partition of \u0004\n97:\nThe problem is to \ufb01nd a decision function 8\n\u001c .\n\u0019\u001b.=\n>0\npoints drawn from -\ni.e. minimize ?A@CB\nNFO$P where ?A@ denotes the expectation with respect to -\nG'HJILK\u000eM\nDFE\nand DFE\nalso known as decision boundary. The function 8\n\ufb01nes for any class 6\n\u0019\u001b\u0006\nbetween those two regions, i.e. the contour of V\nmanifold (a \u201csurface\u201d in \nWhen we mention a test point, we mean a point \u00061\u0014\n\u001c .\nand for which the algorithm is to decide on a class 8\nset \u0004\nBy distance, we mean the usual Euclidean distance in input-space \n\u001c or alternatively ^\u001d[\ntween two points [ and \\ will be written ]\nThe distance between a single point \u0006 and a set of points _\n\u001c .\n\u001c`\u0017ba)\n\ni\u0005\u0004\n\n\u0002\u0001\n\u0001\u000b\n\n\u0007 . This same hyperplane can then be expressed as\n\n\u0017\u0010\u000f\u0012\u0011\nAnother way to de\ufb01ne this hyperplane, that gets rid of the constraint\u0013\ntake a reference point within the hyperplane as an origin, for instance the centroid 6 \u0014\n\u0001\u000b\n\ng\u001f\u001e\n\nwhere Y\nOur modi\ufb01ed nearest neighbor algorithm then associates a test point \u0006\n\u0019\u001b\u0006F\n\nComputing, for each class 6\n\n\b\t\b\n\u001cA\u0017\u0010\u0019\u001b\u001a\u001d\u001c$a@?BADCFEHGJI\nnatively, we can remove one of the4\n\n?BA\tLFL not inM\n\nneighbors as the reference point, but this formulation with the\n\nour \u201clocal hyperplanes\u201d can have fewer dimensions.\n\nfrom the system so that it has a unique solution.\n\n- , and\u0018\n\ncentroid will prove useful later.\n\nso any solution will do. Alter-\n\n:7;\n\n\u0007\f\r\f\u0007\f\n\nl\nY\nk\n*\n\u0003\n\u0004\n\u0004\ni\n\u0017\nk\n\u0006\n\u0007\nN\n\u0007\n)\n\u0007\n\n\b\n\u0014\n\n\n\u0006\n\f\n\b\n\f\n)\n\b\n\u0007\n\u0001\n\u0017\nj\nk\n*\n\b\n\u0007\n\u0017\n)\n\u0017\n\b\nk\n\u0013\nk\n\u0007\nN\n\b\n)\nk\n*\n\u0003\n\u0004\n\u0004\ni\n\u0017\n\u0014\nk\n\u0006\n\u0007\nN\n\u0007\nY\n;\n\u0018\n\u0007\n\n\b\n\u0014\n\n\u0011\n;\n\u0018\n\u0007\n\u0017\n)\n\u0007\nY\n\u0014\n)\nk\n*\n\u001c\n9\nd\n*\n]\nk\n*\n]\nk\n*\n]\nk\n*\n\u0017\nd\n$\nH\n^\n\u0006\nY\ni\n^\n\u0017\n%\n+\n+\n+\n+\n\u0006\nY\n\u0014\n)\nY\nk\n\u0006\n\u0007\nN\n\u0007\nY\n;\n\u0018\n\u0007\n+\n+\n+\n+\n+\n\u0019\n\u001c\n/\n\b\n\u0017\nY\n\u0014\n)\n\u001c\n)\n\u0017\n\u0019\n\b\n\b\n\b\nk\n\u001c\nl\n;\n\u0018\n9\nK\n9\n\f2.4 Links with other paradigms\nThe proposed HKNN algorithm is very similar in spirit to the Tangent Distance Algo-\nrithm [13].\n\n\u001c can be seen as a tangent hyperplane representing a set of local di-\n\n\u0019\u001b\u0006\n\n\u0002\u0001\n\n\u0007 vectors) that do not affect\n\nrections of transformation (any linear combination of the Y\n\nthe class identity. These are invariances. The main difference is that in HKNN these in-\nvariances are inferred directly from the local neighborhood in the training set, whereas in\nTangent Distance, they are based on prior knowledge. It should be interesting (and rela-\ntively easy) to combine both approaches for improved performance when prior knowledge\nis available.\nPrevious work on nearest-neighbor variations based on other locally-de\ufb01ned metrics can\nbe found in [12, 9, 6, 7], and is very much related to the more general paradigm of Local\nLearning Algorithms [3, 1, 10].\nWe should also mention close similarities between our approach and the recently proposed\nLocal Linear Embedding [11] method for dimensionality reduction.\nThe idea of fantasizing points around the training points in order to de\ufb01ne the decision\nsurface is also very close to methods based on estimating the class-conditional input den-\nsity [14, 4].\nBesides, it is interesting to look at HKNN from a different, less geometrical angle: it can be\nunderstood as choosing the class that achieves the best reconstruction (the smallest recon-\nparticular prototypes\nneighbors). From this point of view, the algorithm is very similar to\nthe Nearest Feature Line (NFL) [8] method. They differ in the fact that NFL considers all\n)\n\nstruction error) of the test pattern through a linear combination of l\nof that class (the l\npairs of points for its search rather than the local l\nlines (i.e. 2 dimensional af\ufb01ne subspaces), rather than at a single l\n3 Fixing the basic HKNN algorithm\n3.1 Problem arising for large K\nOne problem with the basic HKNN algorithm, as previously described, arises as we in-\n, i.e. the number of points considered in the neighborhood of the test\npoint. In a typical high dimensional setting, exact colinearities between input patterns are\n\nneighbors, thus looking at many (\u0005\u0001\n& dimensional one.\n\ncrease the value of l\nrare, which means that as soon as l\u0003\u0002\ncan be produced by a linear combination of the l\nof the manifold may be much less than l\nhurt its performance even for smaller values of l\n\nrections associated to small eigenvalues of the covariance matrix\u0018\n\nneighbors. The \u201cactual\u201d dimensionality\n. This is due to \u201cnear-colinearities\u201d producing di-\nthat are but noise,\nthat can lead the algorithm to mistake those noise directions for \u201cinvariances\u201d, and may\n. Another related issue is that the linear\napproximation of the class manifold by a hyperplane is valid only locally, so we might\nwant to restrict the \u201cfantasizing\u201d of class members to a smaller region of the hyperplane.\nWe considered two ways of dealing with these problems. 8\n\n\u0002 (including nonsensical ones)\n\n, any pattern of \n\n/0\u0018\n\n3.2 The convex hull solution\nOne way to avoid the above mentioned problems is to restrict ourselves to considering\nthe convex hull of the neighbors, rather than the whole hyperplane they support (of which\nto\nequation (1). Unlike the problem of computing the distance to the hyperplane, the distance\nto the convex hull cannot be found by solving a simple linear system, but typically requires\nsolving a quadratic programming problem (very similar to the one of SVMs). While this\n\nthe convex hull is a subset). This corresponds to adding a constraint of\b\nmost relevant principal components of:\n\n8A third interesting avenue, which we did not have time to explore, would be to keep only the\n\n, ignoring those corresponding to small eigenvalues.\n\n\u0007\u0005\u0004\n\n\u0007\u0006\t\b\n\nk\n*\n;\n\u0018\nY\nZ\n\u0001\n-\nU\n\fis more complex to implement, it should be mentioned that the problems to be solved are\n, and that the time of the whole algorithm will\nnearest neighbors within each class.\nThis algorithm will be referred to as K-local Convex Distance Nearest Neighbor Algorithm\n(CKNN in short).\n\nof a relatively small dimension of order l\nvery likely still be dominated by the search of the l\n\n3.3 The \u201cweight decay\u201d penalty solution\n\n\u0019\u001b\u0006+\n\n\u0019\u001f\u0006\n\n\u001c\u000e\u001c\n\n\u0002\u0001\n\na