{"title": "Parametric Local Metric Learning for Nearest Neighbor Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 1601, "page_last": 1609, "abstract": "We study the problem of learning local metrics for nearest neighbor classification. Most previous works on local metric learning learn a number of local unrelated metrics. While this ''independence'' approach delivers an increased flexibility its downside is the considerable risk of overfitting. We present a new parametric local metric learning method in which we learn a smooth metric matrix function over the data manifold. Using an approximation error bound of the metric matrix function we learn local metrics as linear combinations of basis metrics defined on anchor points over different regions of the instance space. We constrain the metric matrix function by imposing on the linear combinations manifold regularization which makes the learned metric matrix function vary smoothly along the geodesics of the data manifold. Our metric learning method has excellent performance both in terms of predictive power and scalability. We experimented with several large-scale classification problems, tens of thousands of instances, and compared it with several state of the art metric learning methods, both global and local, as well as to SVM with automatic kernel selection, all of which it outperforms in a significant manner.", "full_text": "Parametric Local Metric Learning for Nearest\n\nNeighbor Classi\ufb01cation\n\nJun Wang\n\nAdam Woznica\n\nDepartment of Computer Science\n\nDepartment of Computer Science\n\nUniversity of Geneva\n\nSwitzerland\n\nJun.Wang@unige.ch\n\nUniversity of Geneva\n\nSwitzerland\n\nAdam.Woznica@unige.ch\n\nAlexandros Kalousis\n\nDepartment of Business Informatics\n\nUniversity of Applied Sciences\n\nWestern Switzerland\n\nAlexandros.Kalousis@hesge.ch\n\nAbstract\n\nWe study the problem of learning local metrics for nearest neighbor classi\ufb01cation.\nMost previous works on local metric learning learn a number of local unrelated\nmetrics. While this \u201dindependence\u201d approach delivers an increased \ufb02exibility its\ndownside is the considerable risk of over\ufb01tting. We present a new parametric local\nmetric learning method in which we learn a smooth metric matrix function over the\ndata manifold. Using an approximation error bound of the metric matrix function\nwe learn local metrics as linear combinations of basis metrics de\ufb01ned on anchor\npoints over different regions of the instance space. We constrain the metric matrix\nfunction by imposing on the linear combinations manifold regularization which\nmakes the learned metric matrix function vary smoothly along the geodesics of\nthe data manifold. Our metric learning method has excellent performance both\nin terms of predictive power and scalability. We experimented with several large-\nscale classi\ufb01cation problems, tens of thousands of instances, and compared it with\nseveral state of the art metric learning methods, both global and local, as well as to\nSVM with automatic kernel selection, all of which it outperforms in a signi\ufb01cant\nmanner.\n\n1\n\nIntroduction\n\nThe nearest neighbor (NN) classi\ufb01er is one of the simplest and most classical non-linear classi\ufb01ca-\ntion algorithms. It is guaranteed to yield an error no worse than twice the Bayes error as the number\nof instances approaches in\ufb01nity. With \ufb01nite learning instances, its performance strongly depends\non the use of an appropriate distance measure. Mahalanobis metric learning [4, 15, 9, 10, 17, 14]\nimproves the performance of the NN classi\ufb01er if used instead of the Euclidean metric. It learns\na global distance metric which determines the importance of the different input features and their\ncorrelations. However, since the discriminatory power of the input features might vary between dif-\nferent neighborhoods, learning a global metric cannot \ufb01t well the distance over the data manifold.\nThus a more appropriate way is to learn a metric on each neighborhood and local metric learn-\ning [8, 3, 15, 7] does exactly that. It increases the expressive power of standard Mahalanobis metric\nlearning by learning a number of local metrics (e.g. one per each instance).\n\n1\n\n\fLocal metric learning has been shown to be effective for different learning scenarios. One of the\n\ufb01rst local metric learning works, Discriminant Adaptive Nearest Neighbor classi\ufb01cation [8], DANN,\nlearns local metrics by shrinking neighborhoods in directions orthogonal to the local decision bound-\naries and enlarging the neighborhoods parallel to the boundaries. It learns the local metrics inde-\npendently with no regularization between them which makes it prone to over\ufb01tting. The authors of\nLMNN-Multiple Metric (LMNN-MM) [15] signi\ufb01cantly limited the number of learned metrics and\nconstrained all instances in a given region to share the same metric in an effort to combat over\ufb01tting.\nIn the supervised setting they \ufb01xed the number of metrics to the number of classes; a similar idea\nhas been also considered in [3]. However, they too learn the metrics independently for each region\nmaking them also prone to over\ufb01tting since the local metrics will be overly speci\ufb01c to their respec-\ntive regions. The authors of [16] learn local metrics using a least-squares approach by minimizing a\nweighted sum of the distances of each instance to apriori de\ufb01ned target positions and constraining the\ninstances in the projected space to preserve the original geometric structure of the data in an effort to\nalleviate over\ufb01tting. However, the method learns the local metrics using a learning-order-sensitive\npropagation strategy, and depends heavily on the appropriate de\ufb01nition of the target positions for\neach instance, a task far from obvious. In another effort to overcome the over\ufb01tting problem of the\ndiscriminative methods [8, 15], Generative Local Metric Learning, GLML, [11], propose to learn\nlocal metrics by minimizing the NN expected classi\ufb01cation error under strong model assumptions.\nThey use the Gaussian distribution to model the learning instances of each class. However, the\nstrong model assumptions might easily be very in\ufb02exible for many learning problems.\nIn this paper we propose the Parametric Local Metric Learning method (PLML) which learns a\nsmooth metric matrix function over the data manifold. More precisely, we parametrize the metric\nmatrix of each instance as a linear combination of basis metric matrices of a small set of anchor\npoints; this parametrization is naturally derived from an error bound on local metric approximation.\nAdditionally we incorporate a manifold regularization on the linear combinations, forcing the linear\ncombinations to vary smoothly over the data manifold. We develop an ef\ufb01cient two stage algorithm\nthat \ufb01rst learns the linear combinations of each instance and then the metric matrices of the anchor\npoints. To improve scalability and ef\ufb01ciency we employ a fast \ufb01rst-order optimization algorithm,\nFISTA [2], to learn the linear combinations as well as the basis metrics of the anchor points. We\nexperiment with the PLML method on a number of large scale classi\ufb01cation problems with tens of\nthousands of learning instances. The experimental results clearly demonstrate that PLML signi\ufb01-\ncantly improves the predictive performance over the current state-of-the-art metric learning methods,\nas well as over multi-class SVM with automatic kernel selection.\n\n2 Preliminaries\nWe denote by X the n\u00d7d matrix of learning instances, the i-th row of which is the xT\ni \u2208 Rd instance,\nand by y = (y1, . . . , yn)T , yi \u2208 {1, . . . , c} the vector of class labels. The squared Mahalanobis\ndistance between two instances in the input space is given by:\n\nM(xi, xj) = (xi \u2212 xj)T M(xi \u2212 xj)\nd2\n\nwhere M is a PSD metric matrix (M (cid:23) 0). A linear metric learning method learns a Mahalanobis\nmetric M by optimizing some cost function under the PSD constraints for M and a set of additional\nconstraints on the pairwise instance distances. Depending on the actual metric learning method,\ndifferent kinds of constraints on pairwise distances are used. The most successful ones are the\nlarge margin triplet constraints. A triplet constraint denoted by c(xi, xj, xk), indicates that in the\nprojected space induced by M the distance between xi and xj should be smaller than the distance\nbetween xi and xk.\nVery often a single metric M can not model adequately the complexity of a given learning problem\nin which discriminative features vary between different neighborhoods. To address this limitation\nin local metric learning we learn a set of local metrics. In most cases we learn a local metric for\neach learning instance [8, 11], however we can also learn a local metric for some part of the instance\nspace in which case the number of learned metrics can be considerably smaller than n, e.g. [15]. We\nfollow the former approach and learn one local metric per instance. In principle, distances should\nthen be de\ufb01ned as geodesic distances using the local metric on a Riemannian manifold. However,\nthis is computationally dif\ufb01cult, thus we de\ufb01ne the distance between instances xi and xj as:\n\n(xi, xj) = (xi \u2212 xj)T Mi(xi \u2212 xj)\n\nd2\nMi\n\n2\n\n\fwhere Mi is the local metric of instance xi. Note that most often the local metric Mi of instance\nxi is different from that of xj. As a result, the distance d2\n(xi, xj) does not satisfy the symmetric\nproperty, i.e. it is not a proper metric. Nevertheless, in accordance to the standard practice we will\ncontinue to use the term local metric learning following [15, 11].\n\nMi\n\n3 Parametric Local Metric Learning\n\nWe assume that there exists a Lipschitz smooth vector-valued function f (x), the output of which\nis the vectorized local metric matrix of instance x. Learning the local metric of each instance is\nessentially learning the value of this function at different points over the data manifold. In order to\nsigni\ufb01cantly reduce the computational complexity we will approximate the metric function instead\nof directly learning it.\n\na\n\nis\n\nf (x)\n\nfunction\n\nsmooth\nto a vector norm (cid:107)\u00b7(cid:107) if (cid:107)f (x) \u2212 f (x(cid:48))(cid:107) \u2264 \u03b1(cid:107)x \u2212 x(cid:48)(cid:107) and\n\n(\u03b1, \u03b2, p)-Lipschitz\n\non Rd\n\n(cid:13)(cid:13)f (x) \u2212 f (x(cid:48)) \u2212 \u2207f (x(cid:48))T (x \u2212 x(cid:48))(cid:13)(cid:13) \u2264 \u03b2 (cid:107)x \u2212 x(cid:48)(cid:107)1+p, where \u2207f (x(cid:48))T is the derivative of\n\nDe\ufb01nition 1 A vector-valued\nfunction with respect\nthe f function at x(cid:48). We assume \u03b1, \u03b2 > 0 and p \u2208 (0, 1].\n[18] have shown that any Lipschitz smooth real function f (x) de\ufb01ned on a lower dimensional man-\nifold can be approximated by a linear combination of function values f (u), u \u2208 U, of a set U of\nanchor points. Based on this result we have the following lemma that gives the respective error\nbound for learning a Lipschitz smooth vector-valued function.\nLemma 1 Let (\u03b3, U) be a nonnegative weighting on anchor points U in Rd. Let f be an (\u03b1, \u03b2, p)-\nLipschitz smooth vector function. We have for all x \u2208 Rd:\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)f (x) \u2212(cid:88)\n\nu\u2208U\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) \u2264 \u03b1\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)x \u2212(cid:88)\n\nu\u2208U\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + \u03b2\n\n(cid:88)\n\nu\u2208U\n\n\u03b3u(x)f (u)\n\n\u03b3u(x)u\n\n\u03b3u(x)(cid:107)x \u2212 u(cid:107)1+p\n\n(1)\n\nThe proof of the above Lemma 1 is similar to the proof of Lemma 2.1 in [18]; for lack of space\nwe omit its presentation. By the nonnegative weighting strategy (\u03b3, U), the PSD constraints on the\napproximated local metric is automatically satis\ufb01ed if the local metrics of anchor points are PSD\nmatrices.\nLemma 1 suggests a natural way to approximate the local metric function by parameterizing the\nmetric Mi of each instance xi as a weighted linear combination, Wi \u2208 Rm, of a small set of\nmetric basis, {Mb1, . . . , Mbm}, each one associated with an anchor point de\ufb01ned in some region\nof the instance space. This parametrization will also provide us with a global way to regularize the\n\ufb02exibility of the metric function. We will \ufb01rst learn the vector of weights Wi for each instance xi,\nand then the basis metric matrices; these two together, will give us the Mi metric for the instance\nxi.\nMore formally, we de\ufb01ne a m \u00d7 d matrix U of anchor points, the i-th row of which is the anchor\ni \u2208 Rd. We denote by Mbi the Mahalanobis metric matrix associated with ui.\npoint ui, where uT\nThe anchor points can be de\ufb01ned using some clustering algorithm, we have chosen to de\ufb01ne them\nas the means of clusters constructed by the k-means algorithm. The local metric Mi of an instance\nxi is parametrized by:\n\nWibkMbk , Wibk \u2265 0,\n\n(2)\nwhere W is a n \u00d7 m weight matrix, and its Wibk entry is the weight of the basis metric Mbk for\nWibk = 1 removes the scaling problem between different local\nmetrics. Using the parametrization of equation (2), the squared distance of xi to xj under the metric\nMi is:\n\nWibk = 1\n\nbk\n\nbk\n\n(cid:88)\nthe instance xi. The constraint(cid:80)\n\nMi =\n\nbk\n\n(cid:88)\n\nd2\nMi\n\n(xi, xj) =\n\nWibk d2\n\nMbk\n\n(xi, xj)\n\n(3)\n\nMbk\n\nwhere d2\n(xi, xj) is the squared Mahalanobis distance between xi and xj under the basis metric\nMbk. We will show in the next section how to learn the weights of the basis metrics for each instance\nand in section 3.2 how to learn the basis metrics.\n\nbk\n\n3\n\n(cid:88)\n\n\fAlgorithm 1 Smoothl Local Linear Weight Learning\n\nInput: W0, X, U, G, L, \u03bb1, and \u03bb2\nOutput: matrix W\n\nde\ufb01ne(cid:101)g\u03b2,Y(W) = g(Y) + tr(\u2207g(Y)T (W \u2212 Y)) + \u03b2\nwhile g(Wi) >(cid:101)g\u03b2,Yi(Wi) do\n\ninitialize: t1 = 1, \u03b2 = 1,Y1 = W0, and i = 0\nrepeat\n\u03b2\u2207g(Yi)))\n\u03b2\u2207g(Yi)))\n\ni = i + 1, Wi = P roj((Yi \u2212 1\n\n\u03b2 = 2\u03b2, Wi = P roj((Yi \u2212 1\n\n2 (cid:107)W \u2212 Y(cid:107)2\n\nF\n\n\u221a\n\nend while\n1+\nti+1 =\n\n1+4t2\ni\n2\nuntil converges;\n\n, Yi+1 = Wi + ti\u22121\n\nti+1\n\n(Wi \u2212 Wi\u22121)\n\n3.1 Smooth Local Linear Weighting\n\nLemma 1 bounds the approximation error by two terms. The \ufb01rst term states that x should be close\nto its linear approximation, and the second that the weighting should be local. In addition we want\nthe local metrics to vary smoothly over the data manifold. To achieve this smoothness we rely\non manifold regularization and constrain the weight vectors of neighboring instances to be similar.\nFollowing this reasoning we will learn Smooth Local Linear Weights for the basis metrics by mini-\nmizing the error bound of (1) together with a regularization term that controls the weight variation\n\nof similar instances. To simplify the objective function, we use the term(cid:13)(cid:13)x \u2212(cid:80)\ninstead of(cid:13)(cid:13)x \u2212(cid:80)\n\nu\u2208U \u03b3u(x)u(cid:13)(cid:13)2\nu\u2208U \u03b3u(x)u(cid:13)(cid:13). By including the constraints on the W weight matrix in (2), the\n\noptimization problem is given by:\n\ng(W) = (cid:107)X \u2212 WU(cid:107)2\n\nmin\nW\n\n(cid:88)\n\ns.t.\n\nWibk \u2265 0,\n\nWibk = 1,\u2200i, bk\n\nF + \u03bb1tr(WG) + \u03bb2tr(WTLW)\n\n(4)\n\nbk\n\nwhere tr(\u00b7) and (cid:107)\u00b7(cid:107)F denote respectively the trace norm of a square matrix and the Frobenius norm\nof a matrix. The m \u00d7 n matrix G is the squared distance matrix between each anchor point ui and\neach instance xj, obtained for p = 1 in (1), i.e. its (i, j) entry is the squared Euclidean distance\n(cid:80)\nbetween ui and xj. L is the n \u00d7 n Laplacian matrix constructed by D \u2212 S, where S is the n \u00d7 n\nsymmetric pairwise similarity matrix of learning instances and D is a diagonal matrix with Dii =\nk Sik. Thus the minimization of the tr(WTLW) term constrains similar instances to have similar\nweight coef\ufb01cients. The minimization of the tr(WG) term forces the weights of the instances\nto re\ufb02ect their local properties. Most often the similarity matrix S is constructed using k-nearest\nneighbors graph [19]. The \u03bb1 and \u03bb2 parameters control the importance of the different terms.\nSince the cost function g(W) is convex quadratic with W and the constraint is simply linear, (4) is\na convex optimization problem with a unique optimal solution. The constraints on W in (4) can be\nseen as n simplex constraints on each row of W; we will use the projected gradient method to solve\nthe optimization problem. At each iteration t, the learned weight matrix W is updated by:\n\nWt+1 = P roj(Wt \u2212 \u03b7\u2207g(Wt))\n\n(5)\nwhere \u03b7 > 0 is the step size and \u2207g(Wt) is the gradient of the cost function g(W) at Wt. The\nP roj(\u00b7) denotes the simplex projection operator on each row of W. Such a projection operator can\nbe ef\ufb01ciently implemented with a complexity of O(nm log(m)) [6]. To speed up the optimization\nprocedure we employ a fast \ufb01rst-order optimization method FISTA, [2]. The detailed algorithm is\ndescribed in Algorithm 1. The Lipschitz constant \u03b2 required by this algorithm is estimated by using\n\nthe condition of g(Wi) \u2264 (cid:101)g\u03b2,Yi(Wi) [1]. At each iteration, the main computations are in the\n\ngradient and the objective value with complexity O(nmd + n2m).\nTo set the weights of the basis metrics for a testing instance we can optimize (4) given the weight of\nthe basis metrics for the training instances. Alternatively we can simply set them as the weights of\nits nearest neighbor in the training instances. In the experiments we used the latter approach.\n\n4\n\n\f3.2 Large Margin Basis Metric Learning\n\nIn this section we de\ufb01ne a large margin based algorithm to learn the basis metrics Mb1, . . . , Mbm.\nGiven the W weight matrix of basis metrics obtained using Algorithm 1, the local metric Mi of\nan instance xi de\ufb01ned in (2) is linear with respect to the basis metrics Mb1 , . . . , Mbm. We de\ufb01ne\nthe relative comparison distance of instances xi, xj and xk as: d2\n(xi, xj). In\nMi\na large margin constraint c(xi, xj, xk), the squared distance d2\n(xi, xk) is required to be larger\n(xi, xj) + 1, otherwise an error \u03beijk \u2265 0 is generated. Note that, this relative comparison\nthan d2\nde\ufb01nition is different from that de\ufb01ned in LMNN-MM [15]. In LMNN-MM to avoid over-\ufb01tting,\ndifferent local metrics Mj and Mk are used to compute the squared distance d2\n(xi, xj) and\n(xi, xk) respectively, as no smoothness constraint is added between metrics of different local\nd2\nMk\nregions.\nGiven a set of triplet constraints, we learn the basis metrics Mb1, . . . , Mbm with the following\noptimization problem:\n\n(xi, xk) \u2212 d2\n\nMj\n\nMi\n\nMi\n\nMi\n\nmin\n\nMb1 ,...,Mbm ,\u03be\n\ns.t.\n\n\u03b11\n\nF +\n\n(cid:88)\n\n||Mbl||2\n\n(cid:88)\n(cid:88)\n\u03beijk \u2265 0; \u2200i, j, k Mbl (cid:23) 0; \u2200bl\n\n(xi, xk) \u2212 d2\n\nbl\nWibl (d2\n\nMbl\n\nMbl\n\nijk\n\nbl\n\n\u03beijk + \u03b12\n\n(cid:88)\n\n(cid:88)\n(xi, xj)) \u2265 1 \u2212 \u03beijk \u2200i, j, k\n\nWibl d2\n\n(xi, xj)\n\nMbl\n\nbl\n\nij\n\n(6)\n\nwhere \u03b11 and \u03b12 are parameters that balance the importance of the different terms. The large margin\ntriplet constraints for each instance are generated using its k1 same class nearest neighbors and k2\ndifferent class nearest neighbors by requiring its distances to the k2 different class instances to be\nlarger than those to its k1 same class instances. In the objective function of (6) the basis metrics are\nlearned by minimizing the sum of large margin errors and the sum of squared pairwise distances of\neach instance to its k1 nearest neighbors computed using the local metric. Unlike LMNN we add the\nsquared Frobenius norm on each basis metrics in the objective function. We do this for two reasons.\nFirst we exploit the connection between LMNN and SVM shown in [5] under which the squared\nFrobenius norm of the metric matrix is related to the SVM margin. Second because adding this term\nleads to an easy-to-optimize dual formulation of (6) [12].\nUnlike many special solvers which optimize the primal form of the metric learning problem [15, 13],\nwe follow [12] and optimize the Lagrangian dual problem of (6). The dual formulation leads to an\nef\ufb01cient basis metric learning algorithm. Introducing the Lagrangian dual multipliers \u03b3ijk, pijk and\nthe PSD matrices Zbl to respectively associate with every large margin triplet constraints, \u03beijk \u2265 0\nand the PSD constraints Mbl (cid:23) 0 in (6), we can easily derive the following Lagrangian dual form\n\n\u03b3ijk \u2212(cid:88)\n\n(cid:88)\n(cid:88)\n1 \u2265 \u03b3ijk \u2265 0; \u2200i,j,k Zbl (cid:23) 0; \u2200bl\n(Z\u2217\nbl\n\nWibl Aij(cid:107)2\n(cid:80)\nijkWibl Cijk\u2212\u03b12\nand\nikxik\u2212xT\nijxij respectively,\n\n\u03b3ijkWibl Cijk \u2212 \u03b12\n+(cid:80)\nand the corresponding optimality conditions: M\u2217\n1 \u2265 \u03b3ijk \u2265 0, where the matrices Aij and Cijk are given by xT\nwhere xij = xi \u2212 xj.\nCompared to the primal form, the main advantage of the dual formulation is that the second term\nin the objective function of (7) has a closed-form solution for Zbl given a \ufb01xed \u03b3. To drive the\noptimal solution of Zbl, let Kbl = \u03b12\nijk \u03b3ijkWibl Cijk. Then, given a \ufb01xed \u03b3,\nthe optimal solution of Zbl is Z\u2217\n= (Kbl )+, where (Kbl )+ projects the matrix Kbl onto the PSD\ncone, i.e. (Kbl )+ = U[max(diag(\u03a3)), 0)]UT with Kbl = U\u03a3UT.\nNow, (7) is rewritten as:\n\n(cid:80)\nij Wibl Aij \u2212(cid:80)\n\nijxij and xT\n\n\u00b7 (cid:107)Zbl +\n\n(cid:88)\n\nZb1 ,...,Zbm ,\u03b3\n\nij Wibl Aij )\n\n1\n4\u03b11\n\nijk \u03b3\u2217\n\nmax\n\n(7)\n\ns.t.\n\n2\u03b11\n\n=\n\nijk\n\nijk\n\nbl\n\nbl\n\nbl\n\nij\n\nF\n\n1\n4\u03b11\n\n(cid:107)(Kbl )+ \u2212 Kbl(cid:107)2\n\nF\n\n(8)\n\ng(\u03b3) = \u2212(cid:88)\n\n\u03b3ijk +\n1 \u2265 \u03b3ijk \u2265 0;\u2200i, j, k\n\nijk\n\nmin\n\n\u03b3\n\ns.t.\n\n(cid:88)\n\nbl\n\n5\n\n\f((K\u2217\n\nbl\n\n= 1\n2\u03b11\n\n)+ \u2212 K\u2217\n\nfunction in (8), \u2207g(\u03b3ijk), is given by: \u2207g(\u03b3ijk) = \u22121 +(cid:80)\n\nAnd the optimal condition for Mbl is M\u2217\n). The gradient of the objective\n(cid:104)(Kbl )+ \u2212 Kbl , Wibl Cijk(cid:105). At\neach iteration, \u03b3 is updated by: \u03b3i+1 = BoxP roj(\u03b3i \u2212 \u03b7\u2207g(\u03b3i)) where \u03b7 > 0 is the step size.\nThe BoxP roj(\u00b7) denotes the simple box projection operator on \u03b3 as speci\ufb01ed in the constraints\nof (8). At each iteration, the main computational complexity lies in the computation of the eigen-\ndecomposition with a complexity of O(md3) and the computation of the gradient with a complexity\nof O(m(nd2 + cd)), where m is the number of basis metrics and c is the number of large margin\ntriplet constraints. As in the weight learning problem the FISTA algorithm is employed to accelerate\nthe optimization process; for lack of space we omit the algorithm presentation.\n\nbl\n\nbl\n1\n\nbl\n\n2\u03b11\n\n4 Experiments\n\nIn this section we will evaluate the performance of PLML and compare it with a number of rel-\nevant baseline methods on six datasets with large number of instances, ranging from 5K to 70K\ninstances; these datasets are Letter, USPS, Pendigits, Optdigits, Isolet and MNIST. We want to de-\ntermine whether the addition of manifold regularization on the local metrics improves the predictive\nperformance of local metric learning, and whether the local metric learning improves over learning\nwith single global metric. We will compare PLML against six baseline methods. The \ufb01rst, SML, is\na variant of PLML where a single global metric is learned, i.e. we set the number of basis in (6) to\none. The second, Cluster-Based LML (CBLML), is also a variant of PLML without weight learn-\ning. Here we learn one local metric for each cluster and we assign a weight of one for a basis metric\nMbi if the corresponding cluster of Mbi contains the instance, and zero otherwise. Finally, we\nalso compare against four state of the art metric learning methods LMNN [15], BoostMetric [13]1,\nGLML [11] and LMNN-MM [15]2. The former two learn a single global metric and the latter two\na number of local metrics. In addition to the different metric learning methods, we also compare\nPLML against multi-class SVMs in which we use the one-against-all strategy to determine the class\nlabel for multi-class problems and select the best kernel with inner cross validation.\nSince metric learning is computationally expensive for datasets with large number of features we\nfollowed [15] and reduced the dimensionality of the USPS, Isolet and MINIST datasets by applying\nPCA. In these datasets the retained PCA components explain 95% of their total variances. We\npreprocessed all datasets by \ufb01rst standardizing the input features, and then normalizing the instances\nto so that their L2-norm is one.\nPLML has a number of hyper-parameters. To reduce the computational time we do not tune \u03bb1\nand \u03bb2 of the weight learning optimization problem (4), and we set them to their default values of\n\u03bb1 = 1 and \u03bb2 = 100. The Laplacian matrix L is constructed using the six nearest neighbors graph\nfollowing [19]. The anchor points U are the means of clusters constructed with k-means clustering.\nThe number m of anchor points, i.e.\nthe number of basis metrics, depends on the complexity of\nthe learning problem. More complex problems will often require a larger number of anchor points\nto better model the complexity of the data. As the number of classes in the examined datasets is\n10 or 26, we simply set m = 20 for all datasets. In the basis metric learning problem (6), the\nnumber of the dual parameters \u03b3 is the same as the number of triplet constraints. To speedup the\nlearning process, the triplet constraints are constructed only using the three same-class and the three\ndifferent-class nearest neighbors for each learning instance. The parameter \u03b12 is set to 1, while\nthe parameter \u03b11 is the only parameter that we select from the set {0.01, 0.1, 1, 10, 100} using\n2-fold inner cross-validation. The above setting of basis metric learning for PLML is also used\nwith the SML and CBLML methods. For LMNN and LMNN-MM we use their default settings,\n[15], in which the triplet constraints are constructed by the three nearest same-class neighbors and\nall different-class samples. As a result, the number of triplet constraints optimized in LMNN and\nLMNN-MM is much larger than those of PLML, SML, BoostMetric and CBLML. The local metrics\nare initialized by identity matrices. As in [11], GLML uses the Gaussian distribution to model the\nlearning instances from the same class. Finally, we use the 1-NN rule to evaluate the performance\nof the different metric learning methods. In addition as we already mentioned we also compare\nagainst multi-class SVM. Since the performance of the latter depends heavily on the kernel with\nwhich it is coupled we do automatic kernel selection with inner cross validation to select the best\n\n1http://code.google.com/p/boosting\n2http://www.cse.wustl.edu/\u223ckilian/code/code.html.\n\n6\n\n\f(a) LMNN-MM\n\n(b) CBLML\n\n(c) GLML\n\n(d) PLML\n\nFigure 1: The visualization of learned local metrics of LMNN-MM, CBLML, GLML and PLML.\n\nTable 1: Accuracy results. The superscripts +\u2212= next to the accuracies of PLML indicate the result\nof the McNemar\u2019s statistical test with LMNN, BoostMetric, SML, CBLML, LMNN-MM, GMLM\nand SVM. They denote respectively a signi\ufb01cant win, loss or no difference for PLML. The number\nin the parenthesis indicates the score of the respective algorithm for the given dataset based on the\npairwise comparisons of the McNemar\u2019s statistical test.\n\nSingle Metric Learning Baselines\nSML\n\nLMNN\n\nDatasets\nLetter\nPendigits\nOptdigits\nIsolet\nUSPS\nMNIST\nTotal Score\n\nPLML\n\n97.22+++|+++|+(7.0)\n98.34+++|+++|+(7.0)\n97.72===|+++|=(5.0)\n95.25=+=|+++|=(5.5)\n98.26+++|+++|=(6.5)\n97.30=++|+++|=(6.0)\n\n37\n\n96.08(2.5)\n97.43(2.0)\n97.55(5.0)\n95.51(5.5)\n97.92(4.5)\n97.30(6.0)\n\n25.5\n\nBoostMetric\n96.49(4.5)\n97.43(2.5)\n97.61(5.0)\n89.16(2.5)\n97.65(2.5)\n96.03(2.5)\n\n19.5\n\nLocal Metric Learning Baselines\n\nCBLML\n95.82(2.5)\n97.94(5.0)\n95.94(1.5)\n89.03(2.5)\n96.22(0.5)\n95.77(2.5)\n\nLMNN-MM\n95.02(1.0)\n97.43(2.0)\n95.94(1.5)\n84.61(0.5)\n97.90(4.0)\n93.24(1.0)\n\n14.5\n\n10\n\nGLML\n\n93.86(0.0)\n96.88(0.0)\n94.82(0.0)\n84.03(0.5)\n96.05(0.5)\n84.02(0.0)\n\n1\n\nSVM\n\n96.64(5.0)\n97.91(5.0)\n97.33(5.0)\n95.19(5.5)\n98.19(5.5)\n97.62(6.0)\n\n32.5\n\n96.71(5.5)\n97.80(4.5)\n97.22(5.0)\n94.68(5.5)\n97.94(4.0)\n96.57(4.0)\n\n28.5\n\nkernel and parameter setting. The kernels were chosen from the set of linear, polynomial (degree 2,3\nand 4), and Gaussian kernels; the width of the Gaussian kernel was set to the average of all pairwise\ndistances. Its C parameter of the hinge loss term was selected from {0.1, 1, 10, 100}.\nTo estimate the classi\ufb01cation accuracy for Pendigits, Optdigits, Isolet and MNIST we used the de-\nfault train and test split, for the other datasets we used 10-fold cross-validation. The statistical\nsigni\ufb01cance of the differences were tested with McNemar\u2019s test with a p-value of 0.05. In order to\nget a better understanding of the relative performance of the different algorithms for a given dataset\nwe used a simple ranking schema in which an algorithm A was assigned one point if it was found\nto have a statistically signi\ufb01cantly better accuracy than another algorithm B, 0.5 points if the two\nalgorithms did not have a signi\ufb01cant difference, and zero points if A was found to be signi\ufb01cantly\nworse than B.\n\n4.1 Results\n\nIn Table 1 we report the experimental results. PLML consistently outperforms the single global\nmetric learning methods LMNN, BoostMetric and SML, for all datasets except Isolet on which\nits accuracy is slightly lower than that of LMNN. Depending on the single global metric learning\nmethod with which we compare it, it is signi\ufb01cantly better in three, four, and \ufb01ve datasets ( for\nLMNN, SML, and BoostMetric respectively), out of the six and never singi\ufb01cantly worse. When\nwe compare PLML with CBLML and LMNN-MM, the two baseline methods which learn one local\nmetric for each cluster and each class respectively with no smoothness constraints, we see that it is\nstatistically signi\ufb01cantly better in all the datasets. GLML fails to learn appropriate metrics on all\ndatasets because its fundamental generative model assumption is often not valid. Finally, we see\nthat PLML is signi\ufb01cantly better than SVM in two out of the six datasets and it is never signi\ufb01cantly\nworse; remember here that with SVM we also do inner fold kernel selection to automatically select\nthe appropriate feature speace. Overall PLML is the best performing methods scoring 37 points over\nthe different datasets, followed by SVM with automatic kernel selection and SML which score 32.5\nand 28.5 points respectively. The other metric learning methods perform rather poorly.\nExamining more closely the performance of the baseline local metric learning methods CBLML and\nLMNN-MM we observe that they tend to over\ufb01t the learning problems. This can be seen by their\nconsiderably worse performance with respect to that of SML and LMNN which rely on a single\nglobal model. On the other hand PLML even though it also learns local metrics it does not suffer\nfrom the over\ufb01tting problem due to the manifold regularization. The poor performance of LMNN-\n\n7\n\n\f(a) Letter\n\n(b) Pendigits\n\n(c) Optdigits\n\n(d) USPS\n\n(e) Isolet\n\n(f) MNIST\n\nFigure 2: Accuracy results of PLML and CBLML with varying number of basis metrics.\n\nMM is not in agreement with the results reported in [15]. The main reason for the difference is the\nexperimental setting. In [15], 30% of the training instance of each dataset were used as a validation\nset to avoid over\ufb01tting.\nTo provide a better understanding of the behavior of the learned metrics, we applied PLML LMNN-\nMM, CBLML and GLML, on an image dataset containing instances of four different handwritten\ndigits, zero, one, two, and four, from the MNIST dataset. As in [15], we use the two main principal\ncomponents to learn. Figure 1 shows the learned local metrics by plotting the axis of their corre-\nsponding ellipses(black line). The direction of the longer axis is the more discriminative. Clearly\nPLML \ufb01ts the data much better than LMNN-MM and as expected its local metrics vary smoothly.\nIn terms of the predictive performance, PLML has the best with 82.76% accuracy. The CBLML,\nLMNN-MM and GLML have an almost identical performance with respective accuracies of 82.59%,\n82.56% and 82.51%.\nFinally we investigated the sensitivity of PLML and CBLML to the number of basis metrics, we\nexperimented with m \u2208 {5, 10, 15, 20, 25, 30, 35, 40}. The results are given in Figure 2. We see\nthat the predictive performance of PLML often improves as we increase the number of the basis\nmetrics. Its performance saturates when the number of basis metrics becomes suf\ufb01cient to model the\nunderlying training data. As expected different learning problems require different number of basis\nmetrics. PLML does not over\ufb01t on any of the datasets. In contrast, the performance of CBLML gets\nworse when the number of basis metrics is large which provides further evidence that CBLML does\nindeed over\ufb01t the learning problems, demonstrating clearly the utility of the manifold regularization.\n\n5 Conclusions\n\nLocal metric learning provides a more \ufb02exible way to learn the distance function. However they are\nprone to over\ufb01tting since the number of parameters they learn can be very large. In this paper we\npresented PLML, a local metric learning method which regularizes local metrics to vary smoothly\nover the data manifold. Using an approximation error bound of the metric matrix function, we\nparametrize the local metrics by a weighted linear combinations of local metrics of anchor points.\nOur method scales to learning problems with tens of thousands of instances and avoids the over\ufb01tting\nproblems that plague the other local metric learning methods. The experimental results show that\nPLML outperforms signi\ufb01cantly the state of the art metric learning methods and it has a performance\nwhich is signi\ufb01cantly better or equivalent to that of SVM with automatic kernel selection.\n\nAcknowledgments\n\nThis work was funded by the Swiss NSF (Grant 200021-137949). The support of EU projects\nDebugIT (FP7-217139) and e-LICO (FP7-231519), as well as that of COST Action BM072 (\u2019Urine\nand Kidney Proteomics\u2019) is also gratefully acknowledged.\n\n8\n\n\fReferences\n[1] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Convex optimization with sparsity-inducing\n\nnorms. Optimization for Machine Learning.\n\n[2] A. Beck and M. Teboulle. Gradient-based algorithms with applications to signal-recovery\nproblems. Convex Optimization in Signal Processing and Communications, pages 42\u201388,\n2010.\n\n[3] M. Bilenko, S. Basu, and R.J. Mooney. Integrating constraints and metric learning in semi-\n\nsupervised clustering. In ICML, page 11, 2004.\n\n[4] J.V. Davis, B. Kulis, P. Jain, S. Sra, and I.S. Dhillon. Information-theoretic metric learning. In\n\nICML, 2007.\n\n[5] H. Do, A. Kalousis, J. Wang, and A. Woznica. A metric learning perspective of svm: on the\n\nrelation of svm and lmnn. AISTATS, 2012.\n\n[6] J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Ef\ufb01cient projections onto the l 1-ball\n\nfor learning in high dimensions. In ICML, 2008.\n\n[7] A. Frome, Y. Singer, and J. Malik.\n\nImage retrieval and classi\ufb01cation using local distance\nfunctions. In Advances in Neural Information Processing Systems, volume 19, pages 417\u2013424.\nMIT Press, 2007.\n\n[8] T. Hastie and R. Tibshirani. Discriminant adaptive nearest neighbor classi\ufb01cation. IEEE Trans.\n\non PAMI, 1996.\n\n[9] P. Jain, B. Kulis, J.V. Davis, and I.S. Dhillon. Metric and kernel learning using a linear trans-\n\nformation. JMLR, 2012.\n\n[10] R. Jin, S. Wang, and Y. Zhou. Regularized distance metric learning: Theory and algorithm. In\n\nNIPS, 2009.\n\n[11] Y.K. Noh, B.T. Zhang, and D.D. Lee. Generative local metric learning for nearest neighbor\n\nclassi\ufb01cation. NIPS, 2009.\n\n[12] C. Shen, J. Kim, and L. Wang. A scalable dual approach to semide\ufb01nite metric learning. In\n\nCVPR, 2011.\n\n[13] C. Shen, J. Kim, L. Wang, and A. Hengel. Positive semide\ufb01nite metric learning using boosting-\n\nlike algorithms. JMLR, 2012.\n\n[14] J. Wang, H. Do, A. Woznica, and A. Kalousis. Metric learning with multiple kernels. In NIPS,\n\n2011.\n\n[15] K.Q. Weinberger and L.K. Saul. Distance metric learning for large margin nearest neighbor\n\nclassi\ufb01cation. JMLR, 2009.\n\n[16] D.Y. Yeung and H. Chang. Locally smooth metric learning with application to image retrieval.\n\nIn ICCV, 2007.\n\n[17] Y. Ying, K. Huang, and C. Campbell. Sparse metric learning via smooth optimization. NIPS,\n\n2009.\n\n[18] K. Yu, T. Zhang, and Y. Gong. Nonlinear learning using local coordinate coding. NIPS, 2009.\n[19] L. Zelnik-Manor and P. Perona. Self-tuning spectral clustering. NIPS, 2004.\n\n9\n\n\f", "award": [], "sourceid": 757, "authors": [{"given_name": "Jun", "family_name": "Wang", "institution": null}, {"given_name": "Alexandros", "family_name": "Kalousis", "institution": null}, {"given_name": "Adam", "family_name": "Woznica", "institution": null}]}