{"title": "Neighbourhood Components Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 513, "page_last": 520, "abstract": null, "full_text": "              Neighbourhood Components Analysis\n\n\n\n           Jacob Goldberger, Sam Roweis, Geoff Hinton, Ruslan Salakhutdinov\n                     Department of Computer Science, University of Toronto\n                         {jacob,roweis,hinton,rsalakhu}@cs.toronto.edu\n\n\n\n                                           Abstract\n\n             In this paper we propose a novel method for learning a Mahalanobis\n             distance measure to be used in the KNN classification algorithm. The\n             algorithm directly maximizes a stochastic variant of the leave-one-out\n             KNN score on the training set. It can also learn a low-dimensional lin-\n             ear embedding of labeled data that can be used for data visualization\n             and fast classification. Unlike other methods, our classification model\n             is non-parametric, making no assumptions about the shape of the class\n             distributions or the boundaries between them. The performance of the\n             method is demonstrated on several data sets, both for metric learning and\n             linear dimensionality reduction.\n\n\n\n1    Introduction\n\nNearest neighbor (KNN) is an extremely simple yet surprisingly effective method for clas-\nsification. Its appeal stems from the fact that its decision surfaces are nonlinear, there\nis only a single integer parameter (which is easily tuned with cross-validation), and the\nexpected quality of predictions improves automatically as the amount of training data in-\ncreases. These advantages, shared by many non-parametric methods, reflect the fact that\nalthough the final classification machine has quite high capacity (since it accesses the entire\nreservoir of training data at test time), the trivial learning procedure rarely causes overfitting\nitself.\n\nHowever, KNN suffers from two very serious drawbacks. The first is computational, since\nit must store and search through the entire training set in order to classify a single test point.\n(Storage can potentially be reduced by \"editing\" or \"thinning\" the training data; and in low\ndimensional input spaces, the search problem can be mitigated by employing data structures\nsuch as KD-trees or ball-trees[4].) The second is a modeling issue: how should the distance\nmetric used to define the \"nearest\" neighbours of a test point be defined? In this paper, we\nattack both of these difficulties by learning a quadratic distance metric which optimizes the\nexpected leave-one-out classification error on the training data when used with a stochastic\nneighbour selection rule. Furthermore, we can force the learned distance metric to be low\nrank, thus substantially reducing storage and search costs at test time.\n\n\n2    Stochastic Nearest Neighbours for Distance Metric Learning\n\nWe begin with a labeled data set consisting of n real-valued input vectors x1, . . . , xn in RD\nand corresponding class labels c1, ..., cn. We want to find a distance metric that maximizes\n\n\f\nthe performance of nearest neighbour classification. Ideally, we would like to optimize\nperformance on future test data, but since we do not know the true data distribution we\ninstead attempt to optimize leave-one-out (LOO) performance on the training data.\n\nIn what follows, we restrict ourselves to learning Mahalanobis (quadratic) distance metrics,\nwhich can always be represented by symmetric positive semi-definite matrices. We esti-\nmate such metrics through their inverse square roots, by learning a linear transformation\nof the input space such that in the transformed space, KNN performs well. If we denote\nthe transformation by a matrix A we are effectively learning a metric Q = A A such that\nd(x, y) = (x - y) Q(x - y) = (Ax - Ay) (Ax - Ay).\n\nThe actual leave-one-out classification error of KNN is quite a discontinuous function of the\ntransformation A, since an infinitesimal change in A may change the neighbour graph and\nthus affect LOO classification performance by a finite amount. Instead, we adopt a more\nwell behaved measure of nearest neighbour performance, by introducing a differentiable\ncost function based on stochastic (\"soft\") neighbour assignments in the transformed space.\nIn particular, each point i selects another point j as its neighbour with some probability pij,\nand inherits its class label from the point it selects. We define the pij using a softmax over\nEuclidean distances in the transformed space:\n\n                               exp(- Ax                               2)\n                  p                              i - Axj\n                       ij =                                                                   ,             p\n                                      exp(- Ax                               2)                               ii = 0       (1)\n                               k=i                      i - Axk\n\nUnder this stochastic selection rule, we can compute the probability pi that point i will be\ncorrectly classified (denote the set of points in the same class as i by Ci = {j|ci = cj}):\n\n                                                 pi =                 pij                                                  (2)\n                                                              jCi\n\nThe objective we maximize is the expected number of points correctly classified under this\nscheme:\n                                  f (A) =                         pij =                      pi                            (3)\n                                                   i    jCi                       i\n\nDifferentiating f with respect to the transformation matrix A yields a gradient rule which\nwe can use for learning (denote xij = xi - xj):\n\n                        f = -2A                        p                    -                p                  )         (4)\n                        A                                   ij (xij xij                           ik xik xik\n                                           i    jCi                                    k\n\nReordering the terms we obtain a more efficiently computed expression:\n\n                       f = 2A              p                               -                     p                  \n                       A                                    ik xik xik                                 ij xij xij\n                                      i    pi k                                   jCi                                   (5)\nOur algorithm  which we dub Neighbourhood Components Analysis (NCA) is extremely\nsimple: maximize the above objective (3) using a gradient based optimizer such as delta-\nbar-delta or conjugate gradients. Of course, since the cost function above is not convex,\nsome care must be taken to avoid local maxima during training. However, unlike many\nother objective functions (where good optima are not necessarily deep but rather broad) it\nhas been our experience that the larger we can drive f during training the better our test\nperformance will be. In other words, we have never observed an \"overtraining\" effect.\n\nNotice that by learning the overall scale of A as well as the relative directions of its rows\nwe are also effectively learning a real-valued estimate of the optimal number of neighbours\n(K). This estimate appears as the effective perplexity of the distributions pij. If the learning\n\n\f\nprocedure wants to reduce the effective perplexity (consult fewer neighbours) it can scale\nup A uniformly; similarly by scaling down all the entries in A it can increase the perplexity\nof and effectively average over more neighbours during the stochastic selection.\n\nMaximizing the objective function f (A) is equivalent to minimizing the L1 norm between\nthe true class distribution (having probability one on the true class) and the stochastic class\ndistribution induced by pij via A. A natural alternative distance is the KL-divergence which\ninduces the following objective function:\n\n                           g(A) =             log(               pij) =         log(pi)               (6)\n                                         i               jCi              i\n\nMaximizing this objective would correspond to maximizing the probability of obtaining a\nperfect (error free) classification of the entire training set. The gradient of g(A) is even\nsimpler than that of f (A):\n\n                    g                                                             pijxijx\n                          = 2                                                                  ij\n                               A                   p                  -    jCi                       (7)\n                   A                                   ik xik xik                      pij\n                                    i         k                                 jCi\n\nWe have experimented with optimizing this cost function as well, and found both the trans-\nformations learned and the performance results on training and testing data to be very\nsimilar to those obtained with the original cost function.\n\nTo speed up the gradient computation, the sums that appear in equations (5) and (7) over\nthe data points and over the neigbours of each point, can be truncated (one because we\ncan do stochastic gradient rather than exact gradient and the other because pij drops off\nquickly).\n\n\n3    Low Rank Distance Metrics and Nonsquare Projection\n\nOften it is useful to reduce the dimensionality of input data, either for computational sav-\nings or for regularization of a subsequent learning algorithm. Linear dimensionality re-\nduction techniques (which apply a linear operator to the original data in order to arrive\nat the reduced representation) are popular because they are both fast and themselves rela-\ntively immune to overfitting. Because they implement only affine maps, linear projections\nalso preserve some essential topology of the original data. Many approaches exist for lin-\near dimensionality reduction, ranging from purely unsupervised approaches (such as factor\nanalysis, principal components analysis and independent components analysis) to methods\nwhich make use of class labels in addition to input features such as linear discriminant\nanalysis (LDA)[3] possibly combined with relevant components analysis (RCA)[1].\n\nBy restricting A to be a nonsquare matrix of size dD, NCA can also do linear dimension-\nality reduction. In this case, the learned metric will be low rank, and the transformed inputs\nwill lie in Rd. (Since the transformation is linear, without loss of generality we only con-\nsider the case d  D. ) By making such a restriction, we can potentially reap many further\nbenefits beyond the already convenient method for learning a KNN distance metric. In par-\nticular, by choosing d     D we can vastly reduce the storage and search-time requirements\nof KNN. Selecting d = 2 or d = 3 we can also compute useful low dimensional visual-\nizations on labeled datasets, using only a linear projection. The algorithm is exactly the\nsame: optimize the cost function (3) using gradient descent on a nonsquare A. Our method\nrequires no matrix inversions and assumes no parametric model (Gaussian or otherwise)\nfor the class distributions or the boundaries between them. For now, the dimensionality of\nthe reduced representation (the number of rows in A) must be set by the user.\n\nBy using an highly rectangular A so that d                         D, we can significantly reduce the com-\nputational load of KNN at the expense of restricting the allowable metrics to be those of\n\n\f\nrank at most d. To achieve this, we apply the NCA learning algorithm to find the optimal\ntransformation A, and then we store only the projections of the training points yn = Axn\n(as well as their labels). At test time, we classify a new point xtest by first computing its\nprojection ytest = Axtest and then doing KNN classification on ytest using the yn and\na simple Euclidean metric. If d is relatively small (say less than 10), we can preprocess\nthe yn by building a KD-tree or a ball-tree to further increase the speed of search at test\ntime. The storage requirements of this method are O(dN ) + Dd compared with O(DN )\nfor KNN in the original input space.\n\n\n4    Experiments in Metric Learning and Dimensionality Reduction\n\n\nWe have evaluated the NCA algorithm against standard distance metrics for KNN and other\nmethods for linear dimensionality reduction. In our experiments, we have used 6 data sets\n(5 from the UC Irvine repository). We compared the NCA transformation obtained from\noptimizing f (for square A) on the training set with the default Euclidean distance A = I,\nthe \"whitening\" transformation , A = - 12 (where  is the sample data covariance matrix),\n                                          - 1\nand the RCA [1] transformation A =  2\n                                          w      (where w is the average of the within-class\ncovariance matrices). We also investigated the behaviour of NCA when A is restricted to\nbe diagonal, allowing only axis aligned Mahalanobis measures.\n\nFigure 1 shows that the training and (more importantly) testing performance of NCA is\nconsistently the same as or better than that of other Mahalanobis distance measures for\nKNN, despite the relative simplicity of the NCA objective function and the fact that the\ndistance metric being learned is nothing more than a positive definite matrix A A.\n\nWe have also investigated the use of linear dimensionality reduction using NCA (with non-\nsquare A) for visualization as well as reduced-complexity classification on several datasets.\nIn figure 2 we show 4 examples of 2-D visualization. First, we generated a synthetic three-\ndimensional dataset (shown in top row of figure 2) which consists of 5 classes (shown by\ndifferent colors). In two dimensions, the classes are distributed in concentric circles, while\nthe third dimension is just Gaussian noise, uncorrelated with the other dimensions or the\nclass label. If the noise variance is large enough, the projection found by PCA is forced\nto include the noise (as shown on the top left of figure 2). (A full rank Euclidean metric\nwould also be misled by this dimension.) The classes are not convex and cannot be lin-\nearly separated, hence the results obtained from LDA will be inappropriate (as shown in\nfigure 2). In contrast, NCA adaptively finds the best projection without assuming any para-\nmetric structure in the low dimensional representation. We have also applied NCA to the\nUCI \"wine\" dataset, which consists of 178 points labeled into 3 classes and to a database\nof gray-scale images of faces consisting of 18 classes (each a separate individual) and 560\ndimensions (image size is 20  28). The face dataset consists of 1800 images (100 for each\nperson). Finally, we applied our algorithm to a subset of the USPS dataset of handwritten\ndigit images, consisting of the first five digit classes (\"one\" through \"five\"). The grayscale\nimages were downsampled to 8  8 pixel resolution resulting in 64 dimensions.\n\nAs can be seen in figure 2 when a two-dimensional projection is used, the classes are con-\nsistently much better separated by the NCA transformation than by either PCA (which is\nunsupervised) or LDA (which has access to the class labels). Of course, the NCA transfor-\nmation is still only a linear projection, just optimized with a cost function which explicitly\nencourages local separation. To further quantify the projection results we can apply a\nnearest-neighbor classification in the projected space. Using the same projection learned\nat training time, we project the training set and all future test points and perform KNN in\nthe low-dimensional space using the Euclidean measure. The results under the PCA, LDA,\nLDA followed by RCA and NCA transformations (using K=1) appear in figure 1. The\nNCA projection consistently gives superior performance in this highly constrained low-\n\n\f\n                    distance metric learning - training                                 distance metric learning - testing\n            1                                                                    1\n\n\n\n     0.95                                                                 0.95\n\n\n\n      0.9                                                                  0.9\n\n\n\n     0.85                                                                 0.85\n\n\n\n      0.8                                                                  0.8\n\n\n\n     0.75                                                                 0.75\n\n\n\n      0.7                                                                  0.7\n\n\n\n     0.65                                                                 0.65\n                                        NCA                                                                     NCA\n      0.6                               diag-NCA                           0.6                                  diag-NCA\n                                        RCA                                                                     RCA\n                                        whitened                                                                whitened\n     0.55                                                                 0.55\n                                        Euclidean                                                               Euclidean\n\n      0.5                                                                  0.5\n                  bal    ion    iris           wine    hous     digit                 bal        ion    iris           wine    hous     digit\n\n\n                    rank 2  transformation - training                                        rank 2  transformation - testing\n      1                                                                    1\n                                                               NCA                                                                     NCA\n                                                               LDA+RCA                                                                 LDA+RCA\n     0.9                                                       LDA        0.9                                                          LDA\n                                                               PCA                                                                     PCA\n\n\n     0.8                                                                  0.8\n\n\n\n\n     0.7                                                                  0.7\n\n\n\n\n     0.6                                                                  0.6\n\n\n\n\n     0.5                                                                  0.5\n\n\n\n\n     0.4                                                                  0.4\n\n\n\n\n     0.3                                                                  0.3\n                  bal    ion    iris       wine        hous     digit                 bal       ion     iris       wine        hous     digit\n\n\nFigure 1: KNN classification accuracy (left train, right test) on UCI datasets balance, iono-\nsphere, iris, wine and housing and on the USPS handwritten digits. Results are averages\nover 40 realizations of splitting each dataset into training (70%) and testing (30%) subsets\n(for USPS 200 images for each of the 10 digit classes were used for training and 500 for\ntesting). Top panels show distance metric learning (square A) and bottom panels show\nlinear dimensionality reduction down to d = 2.\n\n\n\nrank KNN setting. In summary, we have found that when labeled data is available, NCA\nperforms better both in terms of classification performance in the projected representation\nand in terms of visualization of class separation as compared to the standard methods of\nPCA and LDA.\n\n\n5                Extensions to Continuous Labels and Semi-Supervised Learning\n\nAlthough we have focused here on discrete classes, linear transformations and fully su-\npervised learning, many extensions of this basic idea are possible. Clearly, a nonlinear\ntransformation function A() could be learned using any architecture (such as a multilayer\nperceptron) trainable by gradient methods. Furthermore, it is possible to extend the clas-\nsification framework presented above to the case of a real valued (continuous) supervision\nsignal by defining the set of \"correct matches\" Ci for point i to be those points j having\nsimilar (continuous) targets. This naturally leads to the idea of \"soft matches\", in which\nthe objective function becomes a sum over all pairs, each weighted by their agreement ac-\ncording to the targets. Learning under such an objective can still proceed even in settings\nwhere the targets are not explicitly provided as long as information identifying close pairs\n\n\f\n            PCA                           LDA                           NCA\n\nFigure 2: Dataset visualization results of PCA, LDA and NCA applied to (from top) the\n\"concentric rings\", \"wine\", \"faces\" and \"digits\" datasets. The data are reduced from their\noriginal dimensionalities (D=3,D=13,D=560,D=256 respectively) to the d=2 dimensions\nshow.\n\n\f\nFigure 3: The two dimensional outputs of the neural network on a set of test cases. On the left, each\npoint is shown using a line segment that has the same orientation as the input face. On the right, the\nsame points are shown again with the size of the circle representing the size of the face.\n\n\n\nis available. Such semi-supervised tasks often arise in domains with strong spatial or tem-\nporal continuity constraints on the supervision, e.g. in a video of a person's face we may\nassume that pose, and expression vary slowly in time even if no individual frames are ever\nlabeled explicitly with numerical pose or expression values.\n\nTo illustrate this, we generate pairs of faces in the following way: First we choose two faces\nat random from the FERET-B dataset (5000 isolated faces that have a standard orientation\nand scale). The first face is rotated by an angle uniformly distributed between 45o and\nscaled to have a height uniformly distributed between 25 and 35 pixels. The second face\n(which is of a different person) is given the same rotation and scaling but with Gaussian\nnoise of 1.22o and 1.5 pixels. The pair is given a weight, wab, which is the probability\ndensity of the added noise divided by its maximum possible value. We then trained a neural\nnetwork with one hidden layer of 100 logistic units to map from the 3535 pixel intensities\nof a face to a point, y, in a 2-D output space. Backpropagation was used to minimize the\ncost function in Eq. 8 which encourages the faces in a pair to be placed close together:\n\n                                        exp(-||y\n     Cost = -                 w                      a - yb||2)\n                              ab log                                   (8)\n                                               exp(-||yc - yd||2)\n                 pair(a,b)              c,d\n\nwhere c and d are indices over all of the faces, not just the ones\nthat form a pair. Four example faces are shown to the right; hori-\nzontally the pairs agree and vertically they do not. Figure 3 above\nshows that the feedforward neural network discovered polar coor-\ndinates without the user having to decide how to represent scale\nand orientation in the output space.\n\n\n6     Relationships to Other Methods and Conclusions\n\nSeveral papers recently addressed the problem of learning Mahalanobis distance functions\ngiven labeled data or at least side-information of the form of equivalence constraints. Two\nrelated methods are RCA [1] and a convex optimization based algorithm [7]. RCA is\nimplicitly assuming a Gaussian distribution for each class (so it can be described using\nonly the first two moments of the class-conditional distribution). Xing et. al attempt to\nfind a transformation which minimizes all pairwise squared distances between points in the\n\n\f\nsame class; this implicitly assumes that classes form a single compact connected set. For\nhighly multimodal class distributions this cost function will be severely penalized. Lowe[6]\nproposed a method similar to ours but used a more limited idea for learning a nearest\nneighbour distance metric. In his approach, the metric is constrained to be diagonal (as\nwell, it is somewhat redundantly parameterized), and the objective function corresponds to\nthe average squared error between the true class distribution and the predicted distribution,\nwhich is not entirely appropriate in a more probabilistic setting.\n\nIn parallel there has been work on learning low rank transformations for fast classification\nand visualization. The classic LDA algorithm[3] is optimal if all class distributions are\nGaussian with a single shared covariance; this assumption, however is rarely true. LDA\nalso suffers from a small sample size problem when dealing with high-dimensional data\nwhen the within-class scatter matrix is nearly singular[2]. Recent variants of LDA (e.g.\n[5], [2]) make the transformation more robust to outliers and to numerical instability when\nnot enough datapoints are available. (This problem does not exist in our method since there\nis no need for a matrix inversion.)\n\nIn general, there are two classes of regularization assumption that are common in linear\nmethods for classification. The first is a strong parametric assumption about the structure of\nthe class distributions (typically enforcing connected or even convex structure); the second\nis an assumption about the decision boundary (typically enforcing a hyperplane). Our\nmethod makes neither of these assumptions, relying instead on the strong regularization\nimposed by restricting ourselves to a linear transformation of the original inputs.\n\nFuture research on the NCA model will investigate using local estimates of K as derived\nfrom the entropy of the distributions pij; the possible use of a stochastic classification rule\nat test time; and more systematic comparisons between the objective functions f and g.\n\nTo conclude, we have introduced a novel non-parametric learning method -- NCA -- that\nhandles the tasks of distance learning and dimensionality reduction in a unified manner.\nAlthough much recent effort has focused on non-linear methods, we feel that linear em-\nbedding has still not fully fulfilled its potential for either visualization or learning.\n\n\nAcknowledgments\n\nThanks to David Heckerman and Paul Viola for suggesting that we investigate the alterna-\ntive cost g(A) and the case of diagonal A.\n\n\nReferences\n\n[1] A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall. Learning distance functions using equiva-\n    lence relation. In International Conference on Machine Learning, 2003.\n\n[2] L. Chen, H. Liao, M. Ko, J. Lin, and G. Yu. A new lda-based face recognition system which can\n    solve the small sample size problem. In Pattern Recognition, pages 17131726, 2000.\n\n[3] R. A. Fisher. The use of multiple measurements in taxonomic problems. In Annual of Eugenic,\n    pages 179188, 1936.\n\n[4] J. Friedman, J.bentley, and R. Finkel. An algorithm for finding best matches in logarithmic\n    expected time. In ACM, 1977.\n\n[5] Y. Koren and L. Carmel. Robust linear dimensionality reduction. In IEEE Trans. Vis. and Comp.\n    Graph., pages 459470, 2004.\n\n[6] D. Lowe. Similarity metric learning for a variable kernel classifier. In Neural Computation,\n    pages 7285, 1995.\n\n[7] E.P. Xing, A. Y. Ng, M.I. Jordan, and S. Russell. Distance learning metric. In Proc. of Neural\n    Information Processing Systems, 2003.\n\n\f\n", "award": [], "sourceid": 2566, "authors": [{"given_name": "Jacob", "family_name": "Goldberger", "institution": null}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}, {"given_name": "Sam", "family_name": "Roweis", "institution": null}, {"given_name": "Russ", "family_name": "Salakhutdinov", "institution": null}]}