{"title": "Generative Local Metric Learning for Nearest Neighbor Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 1822, "page_last": 1830, "abstract": "We consider the problem of learning a local metric to enhance the performance of nearest neighbor classification. Conventional metric learning methods attempt to separate data distributions in a purely discriminative manner; here we show how to take advantage of information from parametric generative models. We focus on the bias in the information-theoretic error arising from finite sampling effects, and find an appropriate local metric that maximally reduces the bias based upon knowledge from generative models. As a byproduct, the asymptotic theoretical analysis in this work relates metric learning with dimensionality reduction, which was not understood from previous discriminative approaches. Empirical experiments show that this learned local metric enhances the discriminative nearest neighbor performance on various datasets using simple class conditional generative models.", "full_text": "Generative Local Metric Learning for\n\nNearest Neighbor Classi\ufb01cation\n\nYung-Kyun Noh1,2\n\nByoung-Tak Zhang2\n\nDaniel D. Lee1\n\n1GRASP Lab, University of Pennsylvania, Philadelphia, PA 19104, USA\n2Biointelligence Lab, Seoul National University, Seoul 151-742, Korea\n\nnohyung@seas.upenn.edu, btzhang@snu.ac.kr, ddlee@seas.upenn.edu\n\nAbstract\n\nWe consider the problem of learning a local metric to enhance the performance of\nnearest neighbor classi\ufb01cation. Conventional metric learning methods attempt to\nseparate data distributions in a purely discriminative manner; here we show how\nto take advantage of information from parametric generative models. We focus\non the bias in the information-theoretic error arising from \ufb01nite sampling effects,\nand \ufb01nd an appropriate local metric that maximally reduces the bias based upon\nknowledge from generative models. As a byproduct, the asymptotic theoretical\nanalysis in this work relates metric learning with dimensionality reduction, which\nwas not understood from previous discriminative approaches. Empirical exper-\niments show that this learned local metric enhances the discriminative nearest\nneighbor performance on various datasets using simple class conditional gener-\native models.\n\n1\n\nIntroduction\n\nThe classic dichotomy between generative and discriminative methods for classi\ufb01cation in machine\nlearning can be clearly seen in two distinct performance regimes as the number of training examples\nis varied [12, 18]. Generative models\u2014which employ models \ufb01rst to \ufb01nd the underlying distribu-\ntion p(x|y) for discrete class label y and input data x \u2208 RD\u2014typically outperform discriminative\nmethods when the number of training examples is small, due to smaller variance in the generative\nmodels which compensates for any possible bias in the models. On the other hand, more \ufb02exible\ndiscriminative methods\u2014which are interested in a direct measure of p(y|x)\u2014can accurately cap-\nture the true posterior structure p(y|x) when the number of training examples is large. Thus, given\nenough training examples, the best performing classi\ufb01cation algorithms have typically employed\npurely discriminative methods.\nHowever, due to the curse of dimensionality when D is large, the number of data examples may\nnot be suf\ufb01cient for discriminative methods to approach their asymptotic performance limits. In this\ncase, it may be possible to improve discriminative methods by exploiting knowledge of generative\nmodels. There has been recent work on hybrid models showing some improvement [14, 15, 20], but\nmainly the generative models have been improved through the discriminative formulation. In this\nwork, we consider a very simple discriminative classi\ufb01er, the nearest neighbor classi\ufb01er, where the\nclass label of an unknown datum is chosen according to the class label of the nearest known datum.\nThe choice of a metric to de\ufb01ne nearest is then crucial, and we show how this metric can be locally\nde\ufb01ned based upon knowledge of generative models.\nPrevious work on metric learning for nearest neighbor classi\ufb01cation has focused on a purely discrim-\ninative approach. The metric is parameterized by a global quadratic form which is then optimized\non the training data to maximize pairwise separation between dissimilar points, and to minimize the\npairwise separation of similar points [3, 9, 10, 21, 26]. Here, we show how the problem of learning\n\n1\n\n\fa metric can be related to reducing the theoretical bias of the nearest neighbor classi\ufb01er. Though\nthe performance of the nearest neighbor classi\ufb01er has good theoretical guarantees in the limit of\nin\ufb01nite data, \ufb01nite sampling effects can introduce a bias which can be minimized by the choice of an\nappropriate metric. By directly trying to reduce this bias at each point, we will see the classi\ufb01cation\nerror is signi\ufb01cantly reduced compared to the global class-separating metric.\nWe show how to choose such a metric by analyzing the probability distribution on nearest neighbors,\nprovided we know the underlying generative models. Analyses of nearest neighbor distributions\nhave been discussed before [11, 19, 24, 25], but we take a simpler approach and derive the metric-\ndependent term in the bias directly. We then show that minimizing this bias results in a semi-de\ufb01nite\nprogramming optimization that can be solved analytically, resulting in a locally optimal metric. In\nrelated work, Fukunaga et al. considered optimizing a metric function in a generative setting [7, 8],\nbut the resulting derivation was inaccurate and does not improve nearest neighbor performance.\nJaakkola et al. \ufb01rst showed how a generative model can be used to derive a special kernel, called the\nFisher kernel [12], which can be related to a distance function. Unfortunately, the Fisher kernel is\nquite generic, and need not necessarily improve nearest neighbor performance.\nOur generative approach also provides a theoretical relationship between metric learning and the\ndimensionality reduction problem.\nIn order to \ufb01nd better projections for classi\ufb01cation, research\non dimensionality reduction using labeled training data has utilized information-theoretic measures\nsuch as Bhattacharrya divergence [6] and mutual information [2, 17]. We argue how these prob-\nlems can be connected with metric learning for nearest neighbor classi\ufb01cation within the general\nframework of F-divergences. We will also explain how dimensionality reduction is entirely different\nfrom metric learning in the generative approach, whereas in the discriminative setting, it is simply a\nspecial case of metric learning where particular directions are shrunk to zero.\nThe remainder of the paper is organized as follows. In section 2, we motivate by comparing the met-\nric dependency of the discriminative and generative approaches for nearest neighbor classi\ufb01cation.\nAfter we derive the bias due to \ufb01nite sampling in section 3, we show, in section 4, how minimizing\nthis bias results in a local metric learning algorithm. In section 5, we explain how metric learning\nshould be understood in a generative perspective, in particular, its relationship with dimensionality\nreduction. Experiments on various datasets are presented in section 6, comparing our experimental\nresults with other well-known algorithms. Finally, in section 7, we conclude with a discussion of\nfuture work and possible extensions.\n\n2 Metric and Nearest Neighbor Classi\ufb01cation\n\nIn recent work, determining a good metric for nearest neighbor classi\ufb01cation is believed to be cru-\ncial. However, traditional generative analysis of this problem has simply ignored the metric issue\nwith good reason, as we will see in section 2.2. In this section, we explain the apparent contra-\ndiction between two different approaches to this issue, and brie\ufb02y describe how the resolution of\nthis contradiction will lead to a metric learning method that is both theoretically and practically\nplausible.\n\n2.1 Metric Learning for Nearest Neighbor Classi\ufb01cation\n\nA nearest neighbor classi\ufb01er determines the label of an unknown datum according to the label of\nits nearest neighbor. In general, the meaning of the term nearest is de\ufb01ned along with the notion\nof distance in data space. One common choice for this distance is the Mahalanobis distance with\na positive de\ufb01nite square matrix A \u2208 RD\u00d7D where D is the dimensionality of data space. In this\ncase, the distance between two points x1 and x2 is de\ufb01ned as\n\n(cid:113)\n\nd(x1, x2) =\n\n(x1 \u2212 x2)T A(x1 \u2212 x2) ,\n\n(1)\n\ni=1\n\nand the nearest datum xN N is one having minimal distance to the test point among labeled training\ndata in {xi}N\nIn this classi\ufb01cation task, the results are highly dependent on the choice of matrix A, and prior work\nhas attempted to improve the performance by a better choice of A. This recent work has assumed\nthe following common heuristic: the training data in different classes should be separated in a new\n\n2\n\n\fmetric space. Given training data, a global A is optimized such that directions separating different\nclass data are extended, and directions binding same class data together are shrunk [3, 9, 10, 21, 26].\nHowever, in terms of the test results, these conventional methods do not improve the performance\ndramatically, which will be shown in our later experiments on large datasets, and we show why only\nsmall improvements arise in our theoretical analysis.\n\n2.2 Theoretical Performance of Nearest Neighbor Classi\ufb01er\n\nContrary to recent metric learning approaches, a simple theoretical analysis using a generative model\ndisplays no sensitivity to the choice of the metric. We consider i.i.d. samples generated from two\ndifferent distributions p1(x) and p2(x) over the vector space x \u2208 RD. With in\ufb01nite samples, the\nprobability of misclassi\ufb01cation using a nearest neighbor classi\ufb01er can be obtained:\n\n(cid:90)\n\nEAsymp =\n\np1(x)p2(x)\np1(x) + p2(x) dx,\n\n(2)\n\nwhich is better known by its relationship to an upper bound, twice the optimal Bayes error [4, 7, 8].\nBy looking at the asymptotic error in a linearly transformed z-space, we can show that Eq. (2) is\ninvariant to the change of metric. If we consider a linear transformation z = LT x using a full\nrank matrix L, and the distribution qc(z) for c \u2208 {1, 2} in z-space satisfying pc(x)dx = qc(z)dz\nand accompanying measure change dz = |L|dx, we see EAsymp in z-space is unchanged. Since\nany positive de\ufb01nite A can be decomposed as A = LLT , we can say the asymptotic error remains\nconstant even as the metric shrinks or expands any spatial directions in data space.\nThis difference in behavior in terms of metric dependence can be understood as a special property\nthat arises from in\ufb01nite data. When we do not have in\ufb01nite samples, the expectation of error is\nbiased in that it deviates from the asymptotic error, and the bias is dependent on the metric. From\na theoretical perspective, the asymptotic error is the theoretical limit of expected error, and the bias\nreduces as the number of samples increase. Since this difference is not considered in previous\nresearch, the aforementioned metric will not exhibit performance improvements when the sample\nnumber is large.\nIn the next section, we look at the performance bias associated with \ufb01nite sampling directly and \ufb01nd\na metric that minimizes the bias from the asymptotic theoretical error.\n\n3 Performance Bias due to Finite Sampling\n\nHere, we obtain the expectation of nearest neighbor classi\ufb01cation error from the distribution of\nnearest neighbors in different classes. As we consider \ufb01nite number of samples, the nearest neighbor\nfrom a point x0 appears at a \ufb01nite distance dN > 0. This non-zero distance gives rise to the\nperformance difference from its theoretical limit (2). A twice-differentiable distribution p(x) is\nconsidered and approximated to second order near a test point x0:\n\np(x) (cid:39) p(x0) + \u2207p(x)|T\n\nx=x0(x \u2212 x0) +\n\n(3)\nwith the gradient \u2207p(x) and Hessian matrix \u2207\u2207p(x) de\ufb01ned by taking derivatives with respect to\nx.\nNow, under the condition that the nearest neighbor appears at the distance dN from the test point,\nthe expectation of the probability p(xN N ) at a nearest neighbor point is derived by averaging the\nprobability over the D-dimensional hypersphere of radius dN , as in Fig. 1. After averaging, the\ngradient term disappears, and the resulting expectation is the sum of the probability at x0 and a\nresidual term containing the Laplacian of p. We replace this expected probability by \u02dcp(x0).\n\n(x \u2212 x0)\n\n(x \u2212 x0)T\u2207\u2207p(x)(cid:12)(cid:12)x=x0\n\n1\n2\n\n(cid:104)\n\n(cid:105)\n\n(cid:12)(cid:12)(cid:12)dN , x0\n(cid:104)\n\u00b7 \u22072p|x=x0 \u2261 \u02dcp(x0)\n\nExN N\n\np(xN N )\n\n1\n2 ExN N\n\n= p(x0) +\n= p(x0) + d2\nN\n2D\n\n(x \u2212 x0)T\u2207\u2207p(x)(x \u2212 x0)\n\n(cid:12)(cid:12)(cid:12)(cid:107)x \u2212 x0(cid:107)2 = d2\n\nN\n\n(cid:105)\n\n(4)\n\n3\n\n\fFigure 1: The nearest neighbor xN N appears at a \ufb01nite distance dN from x0 due to \ufb01nite sampling.\nGiven the data distribution p(x), the average probability density function over the surface of a D\ndimensional hypersphere is \u02dcp(x0) = p(x0) + d2\n\n4D\u22072p|x=x0 for small dN .\n\nN\n\nwhere the scalar Laplacian \u22072p(x) is given by the sum of the eigenvalues of the Hessian \u2207\u2207p(x).\nIf we look at the expected error, it is the expectation of the probability that the test point and its\nneighbor are labeled differently. In other words, the expectation error EN N is the expectation of\ne(x, xN N ) = p(C1|x)p(C2|xN N ) + p(C2|x)p(C1|xN N ) over both the distribution of x and the\ndistribution of nearest neighbor xN N for a given x:\n\n(cid:34)\n\n(cid:104)\n\n(cid:105)(cid:35)\n(cid:12)(cid:12)(cid:12)x\n\nEN N = Ex\n\nExN N\n\ne(x, xN N )\n\n(5)\n\n(cid:104)\n\nWe then replace the posteriors p(C|x) and p(C|xN N ) as pc(x)/(p1(x) + p2(x)) and\npc(xN N )/(p1(xN N ) + p2(xN N )) respectively, and approximate the expectation of the posterior\nat a \ufb01xed distance dN from test point x using \u02dcpc(x)/(\u02dcp1(x) + \u02dcp2(x)). If\nExN N\n=\nwe expand EN N with respect to dN , and take the expectation using the decomposition, ExN N\n\np(C|xN N )\n\n(cid:105)\n\n(cid:104)\n\nf\n\n, then the expected error is given to leading order by\n\n(cid:12)(cid:12)(cid:12)dN , x\n(cid:105)\n(cid:12)(cid:12)(cid:12)dN\n(cid:105)(cid:105)\n(cid:90)\n\n(cid:104)\n\nf\n\nExN N\nEN N (cid:39)\n\n(cid:104)\n\nEdN\n\ndx +\n\n(cid:90)\n\np1p2\np1 + p2\nEdN [d2\nN ]\n4D\n\n(cid:104)\n\n(cid:105)\n2\u22072p1 \u2212 p1p2(\u22072p1 + \u22072p2)\n\n1\n\n1\u22072p2 + p2\np2\n\n(6)\nN ] \u2192 0 with an in\ufb01nite number of samples, this error converges to the asymptotic limit\nWhen EdN [d2\nin Eq. (2) as expected. The residual term can be considered as the \ufb01nite sampling bias of the error\ndiscussed earlier. Under the coordinate transformation z = LT x and the distributions p(x) on x and\nq(z) on z, we see that this bias term is dependent upon the choice of a metric A = LLT .\n\n(p1 + p2)2\n\ndx\n\n(cid:104)\n\n1\n\n(q1 + q2)2\n(p1 + p2)2 tr\n\n1\n\n=\n\n(cid:104)\n1\u22072q2 + q2\nq2\n\nA\u22121(cid:16)\n\n2\u22072q1 \u2212 q1q2\n1\u2207\u2207p2 + p2\np2\n\n(cid:0)\u22072q1 + \u22072q2\n\n(cid:1)(cid:105)\n\ndz\n\n2\u2207\u2207p1 \u2212 p1p2 (\u2207\u2207p1 + \u2207\u2207p2)\n\n(7)\n\n(cid:17)(cid:105)\n\ndx\n\nwhich is derived using p(x)dx = q(z)dz and |L|\u22072q = tr[A\u22121\u2207\u2207p]. Expectation of squared\nN ] is related to the determinant |A|, which will be \ufb01xed to 1. Thus, \ufb01nding the\ndistance EdN [d2\nmetric that minimizes the quantity given in Eq. (7) at each point is equivalent to minimizing the\nmetric-dependent bias in Eq. (6).\n\n4\n\n\f4 Reducing Deviation from the Asymptotic Performance\n\nFinding the local metric that minimizes the bias can be formulated as a semi-de\ufb01nite programming\n(SDP) problem of minimizing squared residual with respect to a positive semi-de\ufb01nite metric A:\n\n(tr[A\u22121B])2\n\ns.t.\n\nmin\nA\n\n|A| = 1, A (cid:23) 0\n\n(8)\n\nwhere the matrix B at each point is\n\nB = p2\n\n1\u2207\u2207p2 + p2\n\n2\u2207\u2207p1 \u2212 p1p2(\u2207\u2207p1 + \u2207\u2207p2).\n\n(9)\nThis is a simple SDP having an analytical solution where the solution shares the eigenvectors with\nB. Let \u039b+ \u2208 Rd+\u00d7d+ and \u039b\u2212 \u2208 Rd\u2212\u00d7d\u2212 be the diagonal matrices containing the positive and\nnegative eigenvalues of B respectively. If U+ \u2208 RD\u00d7d+ contains the eigenvectors corresponding to\nthe eigenvalues in \u039b+ and U\u2212 \u2208 RD\u00d7d\u2212 contains the eigenvectors corresponding to the eigenvalues\nin \u039b\u2212, we use the solution given by\n\nAopt = [U+ U\u2212]\n\n[U+ U\u2212]T\n\n(10)\n\n(cid:18) d+\u039b+\n\n0\n\n(cid:19)\n\n0\n\n\u2212d\u2212\u039b\u2212\n\nThe solution Aopt is a local metric since we assumed that the nearest neighbor was close to the test\npoint satisfying Eq. (3). In principle, distances should then be de\ufb01ned as geodesic distances using\nthis local metric on a Riemannian manifold. However, this is computationally dif\ufb01cult, so we use\nthe surrogate distance A = \u03b3I + Aopt and treat \u03b3 as a regularization parameter that is learned in\naddition to the local metric Aopt.\nThe multiway extension of this problem is straightforward. The asymptotic error with C-class dis-\ndx using the posteriors of each\ntributions can be extended to 1\n/\nC\nclass, and it replaces B in Eq. (9) by the extended matrix:\n\nj(cid:54)=i pj\n\n(cid:17)\n\ni pi\n\nc=1\n\n(cid:80)C\nC(cid:88)\n\ni=1\n\npc\n\n(cid:82)(cid:16)\n(cid:80)\n\uf8eb\uf8ed(cid:88)\n\nj(cid:54)=i\n\nB =\n\n\u22072pi\n\nj \u2212 pi\np2\n\n(cid:17)\n\uf8f6\uf8f8 .\n\n(cid:16)(cid:80)\n(cid:88)\n\nj(cid:54)=i\n\npj\n\n(11)\n\n5 Metric Learning in Generative Models\n\nTraditional metric learning methods can be understood as being purely discriminative. In contrast to\nour method that directly considers the expected error, those methods are focused on maximizing the\nseparation of data belonging to different classes. In general, their motivations are compared to the\nsupervised dimensionality reduction methods, which try to \ufb01nd a low dimensional space where the\nseparation between classes is maximized. Their dimensionality reduction is not that different from\nmetric learning, but often as a special case where metric in particular directions is forced to be zero.\nIn the generative approach, however, the relationship between dimensionality reduction and metric\nlearning is different. As in the discriminative case, dimensionality reduction in generative models\ntries to obtain class separation in a transformed space. It assumes particular parametric distributions\n(typically Gaussians), and uses a criterion to maximize the separation [2, 6, 16, 17]. One general\nform of these criteria is the F-divergence (also known as Csiszer\u2019s general measure of divergence),\nthat can be de\ufb01ned with respect to a convex function \u03c6(t) for t \u2208 R [13]:\n\n\u221a\n\ndx.\n\np1(x) \u03c6\n\nF (p1(x), p2(x)) =\n\nt and the KL-divergence \u2212(cid:82) p1(x) log\n\nThe examples of using this divergence include the Bhattacharyya divergence (cid:82)(cid:112)p1(x)p2(x)dx\n\ndx when \u03c6(t) = \u2212 log(t). Using\nwhen \u03c6(t) =\nmutual information between data and labels can be understood as an extension of KL-divergence.\nThe well known Linear Discriminant Analysis is a special example of Bhattacharyya criterion when\nwe assume two-class Gaussians sharing the same covariance matrices.\nUnlike dimensionality reduction, we cannot use these criteria for metric learning because any F-\ndivergence is metric-invariant. The asymptotic error Eq. (2) is related to one particular F-divergence\n\n(12)\n\np1(x)\n\n(cid:90)\n\n(cid:19)\n(cid:18) p2(x)\n(cid:17)\n(cid:16) p2(x)\n\np1(x)\n\n5\n\n\fFigure 2: Optimal local metrics are shown on the left for three example Gaussian distributions in\na 5-dimensional space. The projected 2-dimensional distributions are represented as ellipses (one\nstandard deviation from the mean), while the remaining 3 dimensions have an isotropic distribution.\nThe local \u02dcp/p of the three classes are plotted on the right using a Euclidean metric I and for the\noptimal metric Aopt. The solution Aopt tries to keep the ratio \u02dcp/p over the different classes as\nsimilar as possible when the distance dN is varied.\n\nby EAsymp = 1 \u2212 F (p1, p2) with a convex function \u03c6(t) = 1/(1 + t). Therefore, in generative\nmodels, the metric learning problem is qualitatively different from the dimensionality reduction\nproblem in this aspect. One interpretation is that the F-measure can be understood as a measure\nof dimensionality reduction in an asymptotic situation. In this case, the role of metric learning can\nbe de\ufb01ned to move the expected F-measure toward the asymptotic F-measure by appropriate metric\nadaptation.\nFinally, we provide an alternative understanding on the problem of reducing Eq. (7). By reformulat-\ning Eq. (9) into (p2 \u2212 p1)(p2\u22072p1 \u2212 p1\u22072p2), we can see that the optimal metric tries to minimize\nthe difference between \u22072p1\n\n, this also implies\n\nand \u22072p2\n\n. If \u22072p1\n\np1\n\np2\n\np2\n\n\u2248 \u22072p2\n\u2248 \u02dcp2\np2\n\np1\n\u02dcp1\np1\n\n(13)\n\nN\n\n2D\u22072p, the average probability at a distance dN in (4). Thus, the algorithm tries to keep\nfor \u02dcp = p+ d2\nthe ratio of the average probabilities \u02dcp1/ \u02dcp2 at a distance dN to be as similar as possible to the ratio of\nprobabilities p1/p2 at the test point. This means that the expected nearest neighbor classi\ufb01cation at\na distance dN will be least biased due to \ufb01nite sampling. Fig. 2 shows how the learned local metric\nAopt varies at a point x for a 3-class Gaussian example, and how the ratio of \u02dcp/p is kept as similar\nas possible.\n\n6 Experiments\n\nWe apply our algorithm for learning a local metric to synthetic and various real datasets and see\nhow well it improves nearest neighbor classi\ufb01cation performance. Simple standard Gaussian dis-\ntributions are used to learn the generative model, with parameters including the mean vector \u00b5 and\ncovariance matrix \u03a3 for each class. The Hessian of a Gaussian distribution is then given by the\nexpression:\n\n(cid:104)\n\n\u03a3\u22121(x \u2212 \u00b5)(x \u2212 \u00b5)T \u03a3\u22121 \u2212 \u03a3\u22121(cid:105)\n\n\u2207\u2207p(x) = p(x)\n\n(14)\n\nThis expression is then used to learn the optimal local metric. We compare the performance of\nour method (GLML\u2014Generative Local Metric Learning) with recent metric learning discrimina-\ntive methods which report state-of-the-art performance on a number of datasets. These include\n\n6\n\n\fInformation-Theoretic Metric Learning (ITML)1 [3], Boost Metric2 (BM) [21], and Largest Margin\nNearest Neighbor (LMNN)3 [26]. We used the implementations downloaded from the correspond-\ning authors\u2019 websites. We also compare with a local metric given by the Fisher kernel [12] assuming\na single Gaussian for the generative model and using the location parameter to derive the Fisher in-\nformation matrix. The metric from the Fisher kernel was not originally intended for nearest neighbor\nclassi\ufb01cation, but it is the only other reported algorithm that learns a local metric from generative\nmodels.\nFor the synthetic dataset, we generated data from two-class random Gaussian distributions having\ntwo \ufb01xed means. The covariance matrices are generated from random orthogonal eigenvectors and\nrandom eigenvalues. Experiments were performed varying the input dimensionality, and the classi-\n\ufb01cation accuracies are shown in Fig. 3.(a) along with the results of the other algorithms. We used\n500 test points and an equal number of training examples. The experiments were performed with\n20 different realizations and the results were averaged. As the dimensionality grows, the original\nnearest neighbor performance degrades because of the high dimensionality. However, we see that\nthe proposed local metric highly outperforms the discriminative nearest neighbor performance in a\nhigh dimensional space appropriately. We note that this example is ideal for GLML, and it shows\nmuch improvement compared to the other methods.\nThe other experiments consist of the following benchmark datasets: UCI machine learning reposi-\ntory4 datasets (Ionosphere, Wine), and the IDA benchmark repository5 (German, Image, Waveform,\nTwonorm). We also used the USPS handwritten digits and the TI46 speech dataset. For the USPS\ndata, we resized the images to 8 \u00d7 8 pixels and trained on the 64-dimensional pixel vector data. For\nthe TI46 dataset, the examples consist of spoken sounds pronounced by 8 different men and 8 dif-\nferent women. We chose the pronunciation of ten digits (\u201czero\u201d to \u201cnine\u201d), and performed a 10 class\ndigit classi\ufb01cation task. 10 different \ufb01lters in the Fourier domain were used as features to preprocess\nthe acoustic data. The experiments were done on 20 data sampling realizations for Twonorm and\nTI46, 10 for USPS, 200 for Wine, and 100 for the others.\nExcept the synthetic data in Fig. 3.(a), the data consist of various number of training data per class.\nThe regularization parameter \u03b3 value is chosen by cross-regularization on a subset of the training\ndata, then \ufb01xed for testing. The covariance matrix of the learned Gaussian distributions is also\nregularized by setting \u03a3 = \u02c6\u03a3 + \u03b1I where \u02c6\u03a3 is the estimated covariance. The parameter \u03b1 is set\nprior to each experiment.\nFrom the results shown in Fig. 3, our local metric algorithm generally outperforms most of the other\nmetrics across most of the datasets. On quite a number of datasets, many of the other methods\ndo not outperform the original Euclidean nearest neighbor classi\ufb01er. This is because on some of\nthese datasets, performance cannot be improved using a global metric. On the other hand, the local\nmetric derived from simple Gaussian distributions always shows a performance gain over the naive\nnearest neighbor classi\ufb01er.\nIn contrast, using Bayes rule with these simple Gaussian generative\nmodels often results in very poor performance. The computational time using a local metric is also\nvery competitive, since the underlying SDP optimization has a simple spectral solution. This is in\ncontrast to other methods which numerically solve for a global metric using an SDP over the data\npoints.\n\n7 Conclusions\n\nIn our study, we showed how a local metric for nearest neighbor classi\ufb01cation can be learned using\ngenerative models. Our experiments show improvement over competitive methods on a number\nof experimental datasets. The learning algorithm is derived from an analysis of the asymptotic\nperformance of the nearest neighbor classi\ufb01er, such that the optimal metric minimizes the bias of the\nexpected performance of the classi\ufb01er. This connection to generative models is very powerful, and\ncan easily be extended to include missing data\u2014one of the large advantages of generative models\n\n1http://userweb.cs.utexas.edu/ pjain/itml/\n2http://code.google.com/p/boosting/\n3http://www.cse.wustl.edu/ kilian/Downloads/LMNN.html\n4http://archive.ics.uci.edu/ml/\n5http://www.fml.tuebingen.mpg.de/Members/raetsch/benchmark\n\n7\n\n\f(a) Synthetic\n\n(b) Ionosphere\n\n(c) German\n\n(d) Image\n\n(e) Waveform\n\n(f) Twonorm\n\n(g) Wine\n\n(h) USPS 8\u00d78\n\n(i) TI46\n\nFigure 3: (a) Gaussian synthetic data with different dimensionality. As number of dimensions gets\nlarge, most methods degrade except GLML and LMNN. GLML continues to improve vastly over\nother methods. (b)\u223c(h) are the experiments on benchmark datasets varying the number of training\ndata per class. (i) TI46 is the speech dataset pronounced by 8 men and 8 women. The Fisher kernel\nand BM are omitted for (f)\u223c(i) and (h)\u223c(i) respectively, since their performances are much worse\nthan the naive nearest neighbor classi\ufb01er.\n\nin machine learning. Here we used simple Gaussians for the generative models, but this could be\nalso easily extended to include other possibilities such as mixture models, hidden Markov models,\nor other dynamic generative models.\nThe kernelization of this work is straightforward, and the extension to the k-nearest neighbor setting\nusing the theoretical distribution of k-th nearest neighbors is an interesting future direction. Another\npossible future avenue of work is to combine dimensionality reduction and metric learning using\nthis framework.\n\nAcknowledgments\n\nThis research was supported by National Research Foundation of Korea (2010-0017734, 2010-0018950, 314-\n2008-1-D00377) and by the MARS (KI002138) and BK-IT Projects.\n\nReferences\n\n[1] B. Alipanahi, M. Biggs, and A. Ghodsi. Distance metric learning vs. Fisher discriminant analysis. In\n\nProceedings of the 23rd national conference on Arti\ufb01cial intelligence, pages 598\u2013603, 2008.\n\n8\n\n520501000.60.70.80.91# DimPerformanceNNGLMLITMLBMLMNNFisher1030501000.60.70.80.9# tr. dataPerformance1001502002500.620.630.640.650.66# tr. dataPerformance3005007009000.80.850.90.951# tr. dataPerformance200500100015000.780.80.820.840.86# tr. dataPerformance5001000200030000.90.920.940.96# tr. dataPerformance102030400.70.80.91# tr. dataPerformance10030050010000.90.951# tr. dataPerformance1001802703500.550.60.650.70.75# tr. dataPerformance\f[2] K. Das and Z. Nenadic. Approximate information discriminant analysis: A computationally simple het-\n\neroscedastic feature extraction technique. Pattern Recognition, 41(5):1548\u20131557, 2008.\n\n[3] J.V. Davis, B. Kulis, P. Jain, S. Sra, and I.S. Dhillon. Information-theoretic metric learning. In Proceedings\n\nof the 24th International Conference on Machine Learning, pages 209\u2013216, 2007.\n\n[4] R.O. Duda, P.E. Hart, and D.G. Stork. Pattern Classi\ufb01cation (2nd Edition). Wiley-Interscience, 2000.\n[5] A. Frome, Y. Singer, and J. Malik. Image retrieval and classi\ufb01cation using local distance functions. In\n\nAdvances in Neural Information Processing Systems 18, pages 417\u2013424, 2006.\n\n[6] K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, San Diego, CA, 1990.\n[7] K. Fukunaga and T.E. Flick. The optimal distance measure for nearest neighbour classi\ufb01cation. IEEE\n\nTransactions on Information Theory, 27(5):622\u2013627, 1981.\n\n[8] K. Fukunaga and T.E. Flick. An optimal global nearest neighbour measure. IEEE Transactions on Pattern\n\nAnalysis and Machine Intelligence, 6(3):314\u2013318, 1984.\n\n[9] A. Globerson and S. Roweis. Metric learning by collapsing classes. In Advances in Neural Information\n\nProcessing Systems 18, pages 451\u2013458. 2006.\n\n[10] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood components analysis. In\n\nAdvances in Neural Information Processing Systems 17, pages 513\u2013520. 2005.\n\n[11] M. N. Goria, N. N. Leonenko, V. V. Mergel, and P. Inverardi. A new class of random vector entropy\nestimators and its applications in testing statistical hypotheses. Journal of Nonparametric Statistics,\n17(3):277\u2013297, 2005.\n\n[12] T. Jaakkola and D. Haussler. Exploiting generative models in discriminative classi\ufb01ers. In Advances in\n\nNeural Information Processing Systems 11, pages 487\u2013493, 1998.\n\n[13] J.N. Kapur. Measures of Information and Their applications. John Wiley & Sons, New York, NY, 1994.\n[14] S. Lacoste-Julien, F. Sha, and M. Jordan. DiscLDA: Discriminative learning for dimensionality reduction\n\nand classi\ufb01cation. In Advances in Neural Information Processing Systems 21, pages 897\u2013904. 2009.\n\n[15] J.A. Lasserre, C.M. Bishop, and T.P. Minka. Principled hybrids of generative and discriminative mod-\nIn Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern\n\nels.\nRecognition, pages 87\u201394, 2006.\n\n[16] M. Loog and R.P.W. Duin. Linear dimensionality reduction via a heteroscedastic extension of LDA:\nThe chernoff criterion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(6):732\u2013739,\n2004.\n\n[17] Z. Nenadic. Information discriminant analysis: Feature extraction with an information-theoretic objective.\n\nIEEE Transactions on Pattern Analysis and Machine Intelligence, 29(8):1394\u20131407, 2007.\n\n[18] A.Y. Ng and M.I. Jordan. On discriminative vs. generative classi\ufb01ers: A comparison of logistic regression\n\nand naive Bayes. In Advances in Neural Information Processing Systems 14, pages 841\u2013848, 2001.\n\n[19] F. Perez-Cruz. Estimation of information theoretic measures for continuous random variables. In Ad-\n\nvances in Neural Information Processing Systems 21, pages 1257\u20131264. 2009.\n\n[20] R. Raina, Y. Shen, A.Y. Ng, and A. McCallum. Classi\ufb01cation with hybrid generative/discriminative\n\nmodels. In Advances in Neural Information Processing Systems 16, pages 545\u2013552. 2004.\n\n[21] C. Shen, J. Kim, L. Wang, and A. van den Hengel. Positive semide\ufb01nite metric learning with boosting.\n\nIn Advances in Neural Information Processing Systems 22, pages 1651\u20131659. 2009.\n\n[22] N. Singh-Miller and M. Collins. Learning label embeddings for nearest-neighbor multi-class classi\ufb01cation\nwith an application to speech recognition. In Advances in Neural Information Processing Systems 22,\npages 1678\u20131686. 2009.\n\n[23] D. Tran and A. Sorokin. Human activity recognition with metric learning. In Proceedings of the 10th\n\nEuropean Conference on Computer Vision, pages 548\u2013561, 2008.\n\n[24] Q. Wang, S. R. Kulkarni, and S. Verd\u00b4u. A nearest-neighbor approach to estimating divergence between\ncontinuous random vectors. In Proceedings of IEEE International Symposium on Information Theory,\npages 242\u2013246, 2006.\n\n[25] Q. Wang, S. R. Kulkarni, and S. Verd\u00b4u. Divergence estimation for multidimensional densities via k-\n\nnearest-neighbor distances. IEEE Transactions on Information Theory, 55(5):2392\u20132405, 2009.\n\n[26] K. Weinberger, J. Blitzer, and L. Saul. Distance metric learning for large margin nearest neighbor classi-\n\n\ufb01cation. In Advances in Neural Information Processing Systems 18, pages 1473\u20131480. 2006.\n\n9\n\n\f", "award": [], "sourceid": 348, "authors": [{"given_name": "Yung-kyun", "family_name": "Noh", "institution": null}, {"given_name": "Byoung-tak", "family_name": "Zhang", "institution": null}, {"given_name": "Daniel", "family_name": "Lee", "institution": null}]}