{"title": "Metric Learning with Multiple Kernels", "book": "Advances in Neural Information Processing Systems", "page_first": 1170, "page_last": 1178, "abstract": "Metric learning has become a very active research field. The most popular representative--Mahalanobis metric learning--can be seen as learning a linear transformation and then computing the Euclidean metric in the transformed space. Since a linear transformation might not always be appropriate for a given learning problem, kernelized versions of various metric learning algorithms exist. However, the problem then becomes finding the appropriate kernel function. Multiple kernel learning addresses this limitation by learning a linear combination of a number of predefined kernels; this approach can be also readily used in the context of multiple-source learning to fuse different data sources. Surprisingly, and despite the extensive work on multiple kernel learning for SVMs, there has been no work in the area of metric learning with multiple kernel learning. In this paper we fill this gap and present a general approach for metric learning with multiple kernel learning. Our approach can be instantiated with different metric learning algorithms provided that they satisfy some constraints. Experimental evidence suggests that our approach outperforms metric learning with an unweighted kernel combination and metric learning with cross-validation based kernel selection.", "full_text": "Metric Learning with Multiple Kernels\n\nJun Wang\n\nHuyen Do\n\nAdam Woznica\n\nAlexandros Kalousis\n\n{Jun.Wang, Huyen.Do, Adam.Woznica, Alexandros.Kalousis}@unige.ch\n\nAI Lab, Department of Informatics\nUniversity of Geneva, Switzerland\n\nAbstract\n\nMetric learning has become a very active research \ufb01eld. The most popular\nrepresentative\u2013Mahalanobis metric learning\u2013can be seen as learning a linear trans-\nformation and then computing the Euclidean metric in the transformed space.\nSince a linear transformation might not always be appropriate for a given learning\nproblem, kernelized versions of various metric learning algorithms exist. How-\never, the problem then becomes \ufb01nding the appropriate kernel function. Multiple\nkernel learning addresses this limitation by learning a linear combination of a\nnumber of prede\ufb01ned kernels; this approach can be also readily used in the con-\ntext of multiple-source learning to fuse different data sources. Surprisingly, and\ndespite the extensive work on multiple kernel learning for SVMs, there has been\nno work in the area of metric learning with multiple kernel learning. In this paper\nwe \ufb01ll this gap and present a general approach for metric learning with multiple\nkernel learning. Our approach can be instantiated with different metric learning\nalgorithms provided that they satisfy some constraints. Experimental evidence\nsuggests that our approach outperforms metric learning with an unweighted ker-\nnel combination and metric learning with cross-validation based kernel selection.\n\n1\n\nIntroduction\n\nMetric learning (ML), which aims at learning dissimilarities by determining the importance of dif-\nferent input features and their correlations, has become a very active research \ufb01eld over the last\nyears [23, 5, 3, 14, 22, 7, 12]. The most prominent form of ML is learning the Mahalanobis metric.\nIts computation can be seen as a two-step process; in the \ufb01rst step we perform a linear projection of\nthe instances and in the second step we compute their Euclidean metric in the projected space.\nVery often a linear projection cannot adequately represent the inherent complexities of a problem\nat hand. To address this limitation various works proposed kernelized versions of ML methods in\norder to implicitly compute a linear transformation and Euclidean metric in some non-linear feature\nspace; this computation results in a non-linear projection and distance computation in the original\ninput space [23, 5, 3, 14, 22]. However, we are now faced with a new problem, namely that of\n\ufb01nding the appropriate kernel function and the associated feature space matching the requirements\nof the learning problem.\nThe simplest approach to address this problem is to select the best kernel from a prede\ufb01ned kernel\nset using internal cross-validation. The main drawback of this approach is that only one kernel\nis selected which limits the expressiveness of the resulting method. Additionally, this approach\nis limited to a small number of kernels\u2013due to computational constraints\u2013and requires the use of\nextra data. Multiple Kernel Learning (MKL) [10, 17] lifts the above limitations by learning a linear\ncombination of a number of prede\ufb01ned kernels. The MKL approach can also naturally handle the\nmultiple-source learning scenarios where instead of combining kernels de\ufb01ned on a single input\ndata, which depending on the selected kernels could give rise to feature spaces with redundant\n\n1\n\n\ffeatures, we combine different and complementary data sources. In [11, 13] the authors propose a\nmethod that learns a distance metric for multiple-source problems within a multiple-kernel scenario.\nThe proposed method de\ufb01nes the distance of two instances as the sum of their distances in the\nfeature spaces induced by the different kernels. During learning, a set of Mahalanobis metrics, one\nfor each source, are learned together. However, this approach ignores the potential correlations\nbetween the different kernels. To the best of our knowledge most of the work on MKL has been\ncon\ufb01ned in the framework of SVMs and despite the recent popularity of ML there exists so far no\nwork that performs MKL in the ML framework by learning a distance metric in the weighted linear\ncombination of feature spaces.\nIn this paper we show how to perform the Mahalanobis ML with MKL. We \ufb01rst propose a general\nframework of ML with MKL which can be instantiated with virtually any Mahalanobis ML algo-\nrithm h provided that the latter satis\ufb01es some stated conditions. We examine two parametrizations\nof the learning problem that give rise to two alternative formulations, denoted by MLh-MKL\u00b5 and\nMLh-MKLP. Our approach can be seen as the counterpart of MKL with SVMs [10, 20, 17] for ML.\nSince the learned metric matrix has a regularized form (i.e. it has internal structure) we propose a\nstraightforward non-regularized version of ML with MKL, denoted by NR-MLh-MKL; however, due\nto the number of free parameters the non-regularized version can only scale with very small number\nof kernels and requires ML methods that are able to cope with large dimensionalities. We performed\na number of experiments for ML with MKL in which, for the needs of this paper, we have cho-\nsen the well known Large Margin Nearest Neighbor [22] (LMNN) algorithm as the ML method h.\nThe experimental results suggest that LMNN-MKLP outperforms LMNN with an unweighted kernel\ncombination and the single best kernel selected by internal cross-validation.\n\n2 Preliminaries\nIn the different \ufb02avors of metric learning we are given a matrix of learning instances X : n \u00d7 d, the\ni \u2208 Rd instance, and a vector of class labels y = (y1, . . . , yn)T , yi \u2208\ni-th row of which is the xT\n{1, . . . , c}. Consider a mapping \u03a6l(x) of instances x to some feature space Hl, i.e. x \u2192 \u03a6l(x) \u2208\nHl. The corresponding kernel function kl(xi, xj) computes the inner product of two instances\nin the Hl feature space, i.e. kl(xi, xj) = (cid:104)\u03a6l(xi), \u03a6l(xj)(cid:105). We denote dimensionality of Hl\n(possibly in\ufb01nite) as dl. The squared Mahalanobis distance of two instances in the Hl space\n(\u03a6l(xi), \u03a6l(xj)) = (\u03a6l(xi) \u2212 \u03a6l(xj))T Ml(\u03a6l(xi) \u2212 \u03a6l(xj)), where Ml is\nis given by d2\na Positive Semi-De\ufb01nite (PSD) metric matrix in the Hl space (Ml (cid:23) 0). For some given ML\nmethod h we optimize (most often minimize) some cost function Fh with respect to the Ml met-\nric matrix1 under the PSD constraint for Ml and an additional set of pairwise distance constraints\nCh({d2\n(\u03a6l(xi), \u03a6l(xj)) | i, j = 1, . . . , n}) that depend on the choice of h, e.g. similarity and\ndissimilarity pairwise constraint [3] and relative comparison constraint [22]. In the reminder of this\npaper, for simplicity, we denote this set of constraints as Ch(d2\n(\u03a6l(xi), \u03a6l(xj))). The kernelized\nML optimization problem can be now written as:\ns.t. Ch(d2\n\n(\u03a6l(xi), \u03a6l(xj))), Ml (cid:23) 0\n\nFh(Ml)\n\n(1)\n\nMl\n\nMl\n\nMl\n\nMl\n\nmin\nMl\n\nKernelized ML methods do not require to learn the explicit form of the Mahalanobis metric Ml. It\nwas shown in [9] that the optimal solution of the Mahalanobis metric Ml is in the form of Ml =\n\u03b7hI + \u03a6l(X)T Al\u03a6l(X), where I is the identity matrix of dimensionality dl \u00d7 dl, Al is a n \u00d7 n\nPSD matrix, \u03a6l(X) is the matrix of learning instances in the Hl space (with instances in rows), and\n\u03b7h is a constant that depends on the ML method h. Since in the vast majority of the existing ML\nmethods [19, 8, 18, 23, 5, 14, 22] the value of constant \u03b7h is zero, in this paper we only consider the\noptimal form of Ml with \u03b7h = 0. Under the optimal parametrization of Ml = \u03a6l(X)T Al\u03a6l(X)\nthe squared Mahalanobis distance becomes:\n\n(\u03a6l(xi), \u03a6l(xj)) = (Ki\n\n(2)\nd2\nMl\nl is the i-th column of kernel matrix Kl, the (i, j) element of which is Klij = kl(xi, xj).\n\n(\u03a6l(xi), \u03a6l(xj))\n\nl )T Al(Ki\n\nl ) = d2\nAl\n\nl \u2212 Kj\n\nl \u2212 Kj\n\nwhere Ki\nAs a result, (1) can be rewritten as:\n\nmin\nAl\n\nFh(\u03a6l(X)T Al\u03a6l(X))\n\ns.t. Ch(d2\n\nAl\n\n(\u03a6l(xi), \u03a6l(xj))), Al (cid:23) 0\n\n(3)\n\n1The optimization could also be done with respect to other variables of the cost function and not only Ml.\n\nHowever, to keep the notation uncluttered we parametrize the optimization problem only over Ml.\n\n2\n\n\fIn MKL we are given a set of kernel functions Z = {kl(xi, xj) | l = 1 . . . m} and the goal is to\nlearn an appropriate kernel function k\u00b5(xi, xj) parametrized by \u00b5 under a cost function Q. The cost\nfunction Q is determined by the cost function of the learning method that is coupled with multiple\nkernel learning, e.g. it can be the SVM cost function if one is using an SVM as the learning approach.\nAs in [10, 17] we parametrize k\u00b5(xi, xj) by a linear combination of the form:\n\n\u00b5lkl(xi, xj), \u00b5l \u2265 0,\n\nk\u00b5(xi, xj) =\n\n(4)\nWe denote the feature space that is induced by the k\u00b5 kernel by H\u00b5, feature space which is given\nby the mapping x \u2192 \u03a6\u00b5(x) = (\n\u00b5m\u03a6m(x)T )T \u2208 H\u00b5. We denote the dimen-\nsionality of H\u00b5 by d; it can be in\ufb01nite. Finally, we denote by H the feature space that we get by the\nunweighted concatenation of the m feature spaces, i.e. \u2200\u00b5i, \u00b5i = 1, whose representation is given\nby x \u2192 \u03a6(x) = (\u03a61(x)T , . . . , \u03a6m(x)T )T .\n\n\u00b51\u03a61(x)T , . . . ,\n\n\u00b5l = 1\n\n\u221a\n\n\u221a\n\ni=l\n\nl\n\nm(cid:88)\n\nm(cid:88)\n\n3 Metric Learning with Multiple Kernel Learning\nThe goal is to learn a metric matrix M in the feature space H\u00b5 induced by the mapping \u03a6\u00b5 as\nwell as the kernel weight \u00b5; we denote this metric by d2\nM,\u00b5. Based on the optimal form of the\nMahalanobis metric M for metric learning method learning with a single kernel function [9], we\nhave the following lemma:\nLemma 1. Assume that for a metric learning method h the optimal parameterization of its Maha-\nlanobis metric M\u2217 is \u03a6l(X)T A\u2217\u03a6l(X), for some A\u2217, when learning with a single kernel function\nkl(x, x(cid:48)). Then, for h with multiple kernel learning the optimal parametrization of its Mahalanobis\nmetric M\u2217\u2217 is given by \u03a6\u00b5(X)T A\u2217\u2217\u03a6\u00b5(X), for some A\u2217\u2217.\nThe proof of the above Lemma is similar to the proof of Theorem 1 in [9] (it is not presented here\ndue to the lack of space). Following Lemma 1, we have:\nM,\u00b5(\u03a6\u00b5(xi), \u03a6\u00b5(xj)) =\nd2\n\n(\u03a6\u00b5(xi) \u2212 \u03a6\u00b5(xj))T \u03a6\u00b5(X)T A\u03a6\u00b5(X)(\u03a6\u00b5(xi) \u2212 \u03a6\u00b5(xj)) (5)\nA,\u00b5(\u03a6\u00b5(xi), \u03a6\u00b5(xj))\n\n= (cid:88)\n\nl )T A(cid:88)\n\nl \u2212 Kj\n\nl \u2212 Kj\n\nl ) = d2\n\n\u00b5l(Ki\n\n\u00b5l(Ki\n\nl\n\nl\n\nBased on (5) and the constraints from (4), the ML optimization problem with MKL can be presented\nas:\n\nmin\nA,\u00b5\n\nFh(\u03a6\u00b5(X)T A\u03a6\u00b5(X))\n\ns.t. Ch(d2\n\nA,\u00b5(\u03a6\u00b5(xi), \u03a6\u00b5(xj))), A (cid:23) 0, \u00b5l \u2265 0,\n\n\u00b5l = 1(6)\n\nm(cid:88)\n\nl\n\nWe denote the resulting optimization problem and the learning method by MLh-MKL\u00b5; clearly this\nis not fully speci\ufb01ed until we choose a speci\ufb01c ML method h.\n\n\uf8ee\uf8f0 (Ki\n\n(Ki\n\n1 \u2212 Kj\n. . .\nm \u2212 Kj\n\n1)T\n\nm)T\n\n\uf8f9\uf8fb. We note that d2\n\nLet B =\n\nas:\n\nA,\u00b5(\u03a6\u00b5(xi), \u03a6\u00b5(xj)) from (5) can also be written\n\nA,\u00b5(\u03a6\u00b5(xi), \u03a6\u00b5(xj)) = \u00b5T BABT\u00b5 = tr(PBABT) = d2\nd2\n\n(7)\nwhere P = \u00b5\u00b5T and tr(\u00b7) is the trace of a matrix. We use \u03a6P(X) to emphasize the explicit the\ndependence of \u03a6\u00b5(X) to P = \u00b5\u00b5T . As a result, instead of optimizing over \u00b5 we can also use the\nparametrization over P; the new optimization problem can now be written as:\nmin\nA,P\n\nA,P(\u03a6P(xi), \u03a6P(xj))\n\n(8)\nPij = 1, Pij \u2265 0, Rank(P) = 1, P = PT\n\nFh(\u03a6P(X)T A\u03a6P(X))\nCh(d2\n\nA,P(\u03a6P(xi), \u03a6P(xj))), A (cid:23) 0,\n\nij Pij = 1, Pij \u2265 0, Rank(P) = 1, and P = PT are added so that\nP = \u00b5\u00b5T . We call the optimization problem and learning method (8) as MLh-MKLP; as before in\norder to fully instantiate it we need to choose a speci\ufb01c metric learning method h.\n\ns.t.\n\nwhere the constraints(cid:80)\n\n(cid:88)\n\nij\n\n3\n\n\fNow, we derive an alternative parametrization of (5). We need two additional matrices: C\u00b5i\u00b5j =\n\u00b5i\u00b5jI, where the dimensionality of I is n \u00d7 n, and \u03a6\n(X) which is an mn \u00d7 d dimensional matrix:\n. . .\n. . .\n. . . \u03a6m(X)\n\n(cid:34) \u03a61(X)\n\n(X) =\n\n0\n. . .\n\n. . .\n0\n\n(cid:35)\n\n\u03a6\n\n(cid:48)\n\n(cid:48)\n\nWe have:\n\nA,\u00b5(\u03a6\u00b5(xi), \u03a6\u00b5(xj)) = (\u03a6(xi) \u2212 \u03a6(xj))T M\nd2\n\n(cid:48)\n\n(\u03a6(xi) \u2212 \u03a6(xj))\n\nwhere:\nand A(cid:48) is a mn \u00d7 mn matrix:\n\nA(cid:48) =\n\n(cid:48)\nM\n\n(cid:48)\n= \u03a6\n\n(cid:48)\n(X)T A(cid:48)\u03a6\n\n(cid:34) C\u00b51\u00b51A . . . C\u00b51\u00b5mA\n\n(X)\n\n. . .\n\n. . .\n\n. . .\n\nC\u00b5m\u00b51A . . . C\u00b5m\u00b5mA\n\n(cid:35)\n\n.\n\n(9)\n\n(10)\n\n(11)\n\nFrom (9) we see that the Mahalanobis metric, parametrized by the M or A matrix, in the feature\nspace H\u00b5 induced by the kernel k\u00b5, is equivalent to the Mahalanobis metric in the feature space H\nwhich is parametrized by M(cid:48) or A(cid:48). As we can see from (11), MLh-MKL\u00b5 and MLh-MKLP learn\na regularized matrix A(cid:48) (i.e. matrix with internal structure) that corresponds to a parametrization of\nthe Mahalanobis metric M(cid:48)\n\nin the feature space H.\n\n3.1 Non-Regularized Metric Learning with Multiple Kernel Learning\n\nWe present here a more general formulation of the optimization problem (6) in which we lift the\nregularization of matrix A(cid:48) from (11), and learn instead a full PSD matrix A(cid:48)(cid:48):\n\n(cid:34) A11\n\nA(cid:48)(cid:48) =\n\n. . . A1m\n. . .\n. . .\n. . .\nA1m . . . Amm\n\n(cid:35)\n\n(12)\n\nwhere Akl is an n \u00d7 n matrix. The respective Mahalanobis matrix, which we denote by M(cid:48)(cid:48), still\n(cid:48)\n(cid:48)\nhave the same parametrization form as in (10), i.e. M(cid:48)(cid:48) = \u03a6\n(X). As a result, by using\nA(cid:48)(cid:48) instead of A(cid:48) the squared Mahalanobis distance can be written now as:\n\n(X)T A(cid:48)(cid:48)\u03a6\n\nA(cid:48)(cid:48)(\u03a6(xi), \u03a6(xj)) = (\u03a6(xi) \u2212 \u03a6(xj))T M\nd2\nm)T ]A(cid:48)(cid:48)[(Ki\n= [(Ki\n= [\u03a6Z(xi) \u2212 \u03a6Z(xj)]T A(cid:48)(cid:48)(\u03a6Z(xi) \u2212 \u03a6Z(xj)]\n\n1)T , . . . , (Ki\n\nm \u2212 Kj\n\n1 \u2212 Kj\n\n(cid:48)(cid:48)\n\n(\u03a6(xi) \u2212 \u03a6(xj))\n1 \u2212 Kj\n1)T , . . . , (Ki\n\nm \u2212 Kj\n\nm)T ]T\n\n(13)\n\nm)T )T \u2208 HZ. What we see here is that under the M(cid:48)(cid:48)\nwhere \u03a6Z(xi) = ((Ki\nparametrization computing the Mahalanobis metric in the H is equivalent to computing the Ma-\nhalanobis metric in the HZ space. Under the parametrization of the Mahalanobis distance given by\n(13), the optimization problem of metric learning with multiple kernel learning is the following:\n\n1)T , . . . , (Ki\n\n(cid:48)\n\nA(cid:48)(cid:48) Fh(\u03a6\nmin\n\n(cid:48)\n(X)T A(cid:48)(cid:48)\u03a6\n\n(X))\n\ns.t. Ch(d2\n\nA(cid:48)(cid:48)(\u03a6(xi), \u03a6(xj))), A(cid:48)(cid:48) (cid:23) 0\n\n(14)\n\nWe call this optimization problem NR-MLh-MKL. We should note that this formulation has scaling\nproblems since it has O(m2n2) parameters that need to be estimated, and it clearly requires a very\nef\ufb01cient ML method h in order to be practical.\n\n4 Optimization\n\n4.1 Analysis\n\nThe NR-MLh-MKL optimization problem obviously has the same convexity properties as the metric\n(cid:48)\nlearning algorithm h that will be used, since the parametrization M(cid:48)(cid:48) = \u03a6\n(X) used in\nNR-MLh-MKL is linear with A(cid:48)(cid:48), and the composition of a function with an af\ufb01ne mapping preserves\n\n(cid:48)\n(X)T A(cid:48)(cid:48)\u03a6\n\n4\n\n\fthe convexity property of the original function [1]. This is also valid for the subproblems of learning\nmatrix A in MLh-MKL\u00b5 and MLh-MKLP given the weight vector \u00b5.\nGiven the PSD matrix A, we have the following two lemmas for optimization problems MLh-\nMKL{\u00b5|P}:\nLemma 2. Given the PSD matrix A the MLh-MKL\u00b5 optimization problem is convex with \u00b5 if\nmetric learning algorithm h is convex with \u00b5.\n\nProof. The last two constraints on \u00b5 of the optimization problem from (6) are linear, thus this\nproblem is convex if metric learning algorithm h is convex with \u00b5.\n\nA,\u00b5(\u03a6\u00b5(xi), \u03a6\u00b5(xj)) is convex quadratic of \u00b5, which can be easily proved based on the\nSince d2\nPSD property of matrix BABT in (7), many of the well known metric learning algorithms, such as\nPairwise SVM [21], POLA [19] and Xing\u2019s method [23] satisfy the conditions in Lemma 2.\nThe MLh-MKLP optimization problem (8) is not convex given a PSD matrix A because the rank\nconstraint is not convex. However, when the number of kernels m is small, e.g. a few tens of kernels,\nthere is an equivalent convex formulation.\nLemma 3. Given the PSD matrix A, the MLh-MKLP optimization problem (8) can be formulated\nas an equivalent convex problem with respect to P if the ML algorithm h is linear with P and the\nnumber of kernel m is small.\n\nProof. Given the PSD matrix A, if h is linear with P, we can formulate the rank constraint problem\nwith the help of the two following convex problems [2]:\nFh(\u03a6P(X)T A\u03a6P(X)) + w \u00b7 tr(PT W)\nCh(d2\n\nA,P(\u03a6P(xi), \u03a6P(xj))), A (cid:23) 0, P (cid:23) 0,\n\nPij = 1, Pij \u2265 0, P = PT\n\n(cid:88)\n\nmin\nP\n\n(15)\n\ns.t.\n\nwhere w is a positive scalar just enough to make tr(PT W) vanish, i.e. global convergence de\ufb01ned\nin (17), and the direction matrix W is an optimal solution of the following problem:\n\nij\n\nmin\nW\n\ntr(P\u2217T W)\n\ns.t. 0 (cid:22) W (cid:22) I, tr(W) = m \u2212 1\n\n(16)\nwhere P\u2217 is an optimal solution of (15) given A and W, and m is the number of kernels. The\nproblem (16) has a closed form solution W = UUT , where U \u2208 Rm\u00d7m\u22121 is the eigenvector matrix\nof P\u2217 whose columns are the eigenvectors which correspond to the m \u2212 1 smallest eigenvalues of\nP\u2217. The two convex problems are iteratively solved until global convergence, de\ufb01ned as:\n\nm(cid:88)\n\n\u03bb(P\u2217)i = tr(P\u2217T W\u2217) = \u03bb(P\u2217)T \u03bb(W\u2217) \u2261 0\n\n(17)\n\ni=2\n\nwhere \u03bb(P\u2217)i is the i-th largest eigenvalue of P\u2217. This formulation is not a projection method. At\nglobal convergence the convex problem (15) is not a relaxation of the original problem, instead it is\nan equivalent convex problem [2].\nWe will now prove the convergence of problem (15). Suppose the objective value of (15) is fi at\niteration i. Since both (15) and (16) minimize the objective value of (15), we have fj < fi for\nany iteration j > i. Beacuse the in\ufb01mum f\u2217 of the objective value of (15) corresponds to the\noptimal objective value of (15) when the second term is removed. Thus the nonincreasing sequence\nof objective values is bounded below and as a result converges because any bounded monotonic\nsequence in R is convergent. Thus the local convergence of (15) is now established.\nOnly the local convergence can be established for problem (15) because the objective tr(PT W) is\ngenerally multimodal [2]. However, as indicated in section 7.2 [2], when the size of m is small, the\nglobal optimal of problem (15) can be often achieved. This can be simply veri\ufb01ed by comparing the\ndifference between the in\ufb01mum f\u2217 and the optimal objective value f of problem (15).\n\nFor a number of known metric learning algorithms, such as LMNN [22], POLA [19], MLSVM [14]\nand Xing\u2019s method [23] linearity with respect to P holds given A (cid:23) 0.\n\n5\n\n\fAlgorithm 1 MLh-MKL\u00b5, MLh-MKLP\n\nInput: X, Y, A0, \u00b50, and matrices K1, . . . , Km\nOutput: A and \u00b5\nrepeat\n\nK\u00b5(i) =(cid:80)\n\n\u00b5(i)=WeightLearning(A(i\u22121))\nA(i)=MetricLearningh(A(i\u22121),X,K\u00b5(i))\ni := i + 1\n\nkKk\n\nk \u00b5i\n\nuntil convergence\n\n4.2 Optimization Algorithms\n\nThe NR-MLh-MKL optimization problem can be directly solved by any metric learning algorithm\nh on the space HZ when the optimization problem of the latter only involves the squared pairwise\nMahalanobis distance, e.g. LMNN [22] and MCML [5]. When the metric learning algorithm h\nhas regularization term on M, e.g.\ntrace norm [8] and Frobenius norm [14, 19], most often the\nNR-MLh-MKL optimization problem can be solved by a slightly modi\ufb01cation of original algorithm.\nWe now describe how we can solve the optimization problems of MLh-MKL\u00b5 and MLh-MKLP.\nBased on Lemmas 2 and 3 we propose for both methods a two-step iterative algorithm, Algorithm 1,\nat the \ufb01rst step of which we learn the kernel weighting and at the second the metric under the kernel\nweighting learned in the \ufb01rst step. At the \ufb01rst step of the i-th iteration we learn the \u00b5(i) kernel weight\nvector under \ufb01xed PSD matrices A(i\u22121), learned at the preceding iteration (i \u2212 1). For MLh-MKL\u00b5\nwe solve the weight learning problem using Lemma 2 and for MLh-MKLP using Lemma 3. At\nthe second step we apply the metric learning algorithm h and we learn the PSD matrices A(i) with\nl Ki kernel matrix using as the initial metric matrices the A(i\u22121). We should\nmake clear that the optimization problem we are solving is only individually convex with respect to\n\u00b5 given the PSD matrix A and vice-versa. As a result, the convergence of the two-step algorithm\n(possible to a local optima) is guaranteed [6] and checked by the variation of \u00b5 and the objective\nvalue of metric learning method h. In our experiments (Section 6) we observed that it most often\nconverges in less than ten iterations.\n\nthe K\u00b5(i) = (cid:80)\n\nl \u00b5(i)\n\n5 LMNN-Based Instantiation\n\nWe have presented two basic approaches to metric learning with multiple kernel learning: MLh-\nMKL\u00b5 (MLh-MKLP) and NR-MLh-MKL. In order for the approaches to be fully instantiated we have\nto specify the ML algorithm h. In this paper we focus on the LMNN state-of-the-art method [22].\nDue to the relative comparison constraint, LMNN does not satisfy the condition of Lemma 2. How-\never, as we already mentioned LMNN satis\ufb01es the condition of Lemma 3 so we get the MLh-MKLP\nvariant of the optimization problem for LMNN which we denote by LMNN-MKLP. The resulting\noptimization problem is:\n\nA,P(\u03a6P(xi), \u03a6P(xj)) + \u03b3\n\n(1 \u2212 Yik)\u03beijk}\n\n(18)\n\nSij{(1 \u2212 \u03b3)d2\n\n(cid:88)\n(cid:88)\nij\nA,P(\u03a6P(xi), \u03a6P(xk)) \u2212 d2\nd2\n\nmin\nA,P,\u03be\n\ns.t.\n\n(cid:88)\n\nk\n\nA,P(\u03a6P(xi), \u03a6P(xj)) \u2265 1 \u2212 \u03beijk, \u03beijk > 0, A (cid:23) 0\n\nPkl = 1, Pkl \u2265 0, Rank(P) = 1, P = PT\n\nkl\n\nwhere the matrix Y, Yij \u2208 {0, 1}, indicates if the class labels yi and yj are the same (Yij = 1)\nor different (Yij = 0). The matrix S is a binary matrix whose Sij entry is non-zero if instance xj\nis one of the k same class nearest neigbors of instance xi. The objective is to minimize the sum of\nthe distances of all instances to their k same class nearest neighbors while allowing for some errors,\ntrade of which is controlled by the \u03b3 parameter. As the objective function of LMNN only involves\nthe squared pairwise Mahalanobis distances, the instantiation of NR-MLh-MKL is straightforward\nand it consists simply of the application of LMNN on the space HZ in order to learn the metric. We\ndenote this instantiation by NR-LMNN-MKL.\n\n6\n\n\fTable 1: Accuracy results. The superscripts +\u2212= next to the accuracies of NR-LMNN-MKLand\nLMNN-MKLPindicate the result of the McNemar\u2019s statistical test of their comparison to the accura-\ncies of LMNNHand LMNN-MKLCV and denote respectively a signi\ufb01cant win, loss or no difference.\nThe number in the parenthesis indicates the score of the respective algorithm for the given dataset\nbased on the pairwise comparisons of the McNemar\u2019s statistical test.\n\nDatasets\nSonar\nWine\nIris\nIonosphere\nWdbc\nCentralNervous\nColon\nLeukemia\nMaleFemale\nOvarian\nProstate\nStroke\nTotal Score\n\nNR-LMNN-MKL\n\n88.46+=(3.0)\n98.88==(2.0)\n93.33==(2.0)\n93.73==(2.5)\n94.90\u2212=(1.0)\n55.00==(2.0)\n80.65==(2.0)\n95.83+=(2.5)\n86.57==(2.5)\n95.26+=(3.0)\n79.50==(2.0)\n69.71==(2.0)\n\n26.5\n\nLMNN-MKLP\n85.58==(2.0)\n98.88==(2.0)\n95.33==(2.0)\n94.87=+(3.0)\n97.36=+(3.5)\n63.33==(2.0)\n85.48+=(2.5)\n94.44+=(2.5)\n88.81+=(3.0)\n94.47+=(3.0)\n80.43==(2.5)\n72.12==(2.0)\n\n30.0\n\nLMNNH\n82.21(1.0)\n98.31(2.0)\n94.67(2.0)\n92.59(2.5)\n97.36(3.0)\n65.00(2.0)\n66.13(1.5)\n70.83(0.0)\n80.60(1.5)\n90.51(0.5)\n79.19(2.0)\n71.15(2.0)\n\n20.0\n\nLMNN-MKLCV\n\n88.46(3.0)\n96.07(2.0)\n94.00(2.0)\n90.88(2.0)\n95.96(1.5)\n65.00(2.0)\n79.03(2.0)\n95.83(2.5)\n89.55(3.0)\n94.47(3.0)\n78.88(2.0)\n70.19(2.0)\n\n27.0\n\n1-NN\n\n82.21(1.0)\n97.19(2.0)\n95.33(2.0)\n86.89(0.0)\n95.43(1.0)\n58.33(2.0)\n74.19(2.0)\n88.89(2.5)\n58.96(0.0)\n87.35(0.5)\n76.71(1.5)\n65.38(2.0)\n\n16.5\n\n6 Experiments\n\nIn this section we perform a number of experiments on real world datasets in order to compare the\ntwo of the LMNN-based instantiations of our framework, i.e. LMNN-MKLP and NR-LMNN-MKL.\nWe compare these methods against two baselines: LMNN-MKLCV in which a kernel is selected\nfrom a set of kernels using 2-fold inner cross-validation (CV), and LMNN with the unweighted\nsum of kernels, which induces the H feature space, denoted by LMNNH. Additionally, we report\nperformance of 1-Nearest-Neighbor, denoted as 1-NN, with no metric learning. The PSD matrix\nA and weight vector \u00b5 in LMNN-MKLP were respectively initialized by I and equal weighting (1\ndivided by the number of kernels). The parameter w in the weight learning subproblem of LMNN-\nMKLP was selected from {10i | i = 0, 1, . . . , 8} and was the smallest value enough to achieve\nIts direction matrix W was initialized by 0. The number of k same class\nglobal convergence.\nnearest neighbors required by LMNN was set to 5 and its \u03b3 parameter to 0.5. After learning the\nmetric and the multiple kernel combination we used 1-NN for classi\ufb01cation.\n\n6.1 Benchmark Datasets\n\nWe \ufb01rst experimented with 12 different datasets: \ufb01ve from the UCI machine learning repository,\ni.e. Sonar, Ionosphere, Wine, Iris, and Wdbc; three microarray datasets, i.e. CentralNervous, Colon,\nand Leukemia; and four proteomics datasets, i.e. MaleFemale, Stroke, Prostate and Ovarian. The\nattributes of all the datasets are standardized in the preprocessing step. The Z set of kernels that we\nuse consists of the following 20 kernels: 10 polynomial with degree from one to ten, ten Gaussians\nwith bandwidth \u03c3 \u2208 {0.5, 1, 2, 5, 7, 10, 12, 15, 17, 20} (the same set of kernels was used in [4]).\nEach basic kernel Kk was normalized by the average of its diag(Kk). LMNN-MKLP, LMNNH and\nLMNN-MKLCV were tested using the complete Z set. For NR-LMNN-MKL due to its scaling limi-\ntations we could only use a small subset of Z consisting of the linear, the second order polynomial,\nand the Gaussian kernel with the kernel width of 0.5. We use 10-fold CV to estimate the predictive\nperformance of the different methods. To test the statistical signi\ufb01cance of the differences we used\nMcNemar\u2019s test and we set the p-value to 0.05. To get a better understanding of the relative per-\nformance of the different methods for a given dataset we used a ranking schema in which a method\nA was assigned one point if its accuracy was signi\ufb01cantly better than that of another method B, 0.5\npoints if the two methods did not have a signi\ufb01cantly different performance, and zero points if A\nwas found to be signi\ufb01cantly worse than B.\nThe results are reported in Table 1. First, we observe that by learning the kernel inside LMNN-\nMKLP we improve performance over LMNNH that uses the unweighted kernel combination. More\nprecisely, LMNN-MKLP is signi\ufb01cantly better than LMNNH in four out of the thirteen datasets. If\nwe now compare LMNN-MKLP with LMNN-MKLCV , the other baseline method where we select the\nbest kernel with CV, we can see that LMNN-MKLP also performs better being statistically signi\ufb01cant\n\n7\n\n\fTable 2: Accuracy results on the multiple source datasets.\n\nDatasets\nMultiple Feature\nOxford Flowers\n\nLMNN-MKLP\n98.79++(3.0)\n86.01++(3.0)\n\nLMNNH\n98.44(1.5)\n85.74(2.0)\n\nLMNN-MKLCV\n\n98.44(1.5)\n65.46(0.0)\n\n1-NN\n\n97.86(0.0)\n67.38(1.0)\n\nbetter in two dataset. If we now examine NR-LMNN-MKL and LMNNH we see that the former\nmethod, even though learning with only three kernels, is signi\ufb01cantly better in two datasets, while it\nis signi\ufb01cantly worse in one dataset. Comparing NR-LMNN-MKL and LMNN-MKLCV we observe\nthat the two methods achieve comparable predictive performances. We should stress here that NR-\nLMNN-MKL has a disadvantage since it only uses three kernels, as opposed to other methods that\nuse 20 kernels; the scalability of NR-LMNN-MKL is left as a future work. In terms of the total score\nthat the different methods obtain the best one is LMNN-MKLP followed by LMNN-MKLCV and\nNR-LMNN-MKL.\n\n6.2 Multiple Source Datasets\n\nTo evaluate the proposed method on problems with multiple sources of information we also perform\nexperiments on the Multiple Features and the Oxford \ufb02owers datasets [16]. Multiple Features from\nUCI has six different feature representations for 2,000 handwritten digits (0-9); each class has 200\ninstances. In the preprocessing step all the features are standardized in all the data sources. Oxford\n\ufb02owers dataset has 17 category \ufb02ower images; each class has 80 instances. In the experiment seven\ndistance matrices from the website2 are used; these matrices are precomputed respectively from\nseven features, the details of which are described in [16, 15]. For both datasets Gaussian kernels are\nconstructed respectively using the different feature representations of instances with kernel width\n\u03c30, where \u03c30 is the mean of all pairwise distances. We experiment with 10 random splits where\nhalf of the data is used for training and the other half for testing. We do not experiment here with\nNR-LMNN-MKL here due to its scaling limitations.\nThe accuracy results are reported in Table 2. We can see that by learning a linear combination of\ndifferent feature representations LMNN-MKLP achieves the best predictive performance on both\ndatasets being signi\ufb01cantly better than the two baselines, LMNNH and LMNN-MKLCV . The bad\nperformance of LMNN-MKLCV on the Oxford \ufb02owers dataset could be explained by the fact that\nthe different Gaussian kernels are complementary for the given problem, but in LMNN-MKLCV only\none kernel is selected.\n\n7 Conclusions\n\nIn this paper we combine two recent developments in the \ufb01eld of machine learning, namely metric\nlearning and multiple kernel learning, and propose a general framework for learning a metric in\na feature space induced by a weighted combination of a number of individual kernels. This is in\ncontrast with the existing kernelized metric learning techniques which consider only one kernel\nfunction (or possibly an unweighted combination of a number of kernels) and hence are sensitive\nto the selection of the associated feature space. The proposed framework is general as it can be\ncoupled with many existing metric learning techniques. In this work, to practically demonstrate the\neffectiveness of the proposed approach, we instantiate it with the well know LMNN metric learning\nmethod. The experimental results con\ufb01rm that the adaptively induced feature space does bring\nan advantage in the terms of predictive performance with respect to feature spaces induced by an\nunweighted combination of kernels and the single best kernel selected by internal CV.\n\nAcknowledgments\n\nThis work was funded by the Swiss NSF (Grant 200021-122283/1). The support of the European\nCommission through EU projects DebugIT (FP7-217139) and e-LICO (FP7-231519) is also grate-\nfully acknowledged.\n\n2http://www.robots.ox.ac.uk/\u223cvgg/data/\ufb02owers/index.html\n\n8\n\n\fReferences\n[1] S.P. Boyd and L. Vandenberghe. Convex optimization. Cambridge Univ Pr, 2004.\n[2] J. Dattorro. Convex optimization & Euclidean distance geometry. Meboo Publishing USA,\n\n2005.\n\n[3] J.V. Davis, B. Kulis, P. Jain, S. Sra, and I.S. Dhillon. Information-theoretic metric learning. In\n\nICML, 2007.\n\n[4] K. Gai, G. Chen, and C. Zhang. Learning kernels with radiuses of minimum enclosing balls.\n\nNIPS, 2010.\n\n[5] A. Globerson and S. Roweis. Metric learning by collapsing classes. In NIPS, 2006.\n[6] L. Grippo and M. Sciandrone. On the convergence of the block nonlinear gauss-seidel method\n\nunder convex constraints* 1. Operations Research Letters, 26(3):127\u2013136, 2000.\n\n[7] M. Guillaumin, J. Verbeek, and C. Schmid. Is that you? Metric learning approaches for face\n\nidenti\ufb01cation. In ICCV, pages 498\u2013505, 2009.\n\n[8] K. Huang, Y. Ying, and C. Campbell. Gsml: A uni\ufb01ed framework for sparse metric learning.\nIn Data Mining, 2009. ICDM\u201909. Ninth IEEE International Conference on, pages 189\u2013198.\nIEEE, 2009.\n\n[9] P. Jain, B. Kulis, and I. Dhillon.\n\n2010.\n\nInductive regularized learning of kernel functions. NIPS,\n\n[10] G.R.G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M.I. Jordan. Learning the\nKernel Matrix with Semide\ufb01nite Programming. Journal of Machine Learning Research, 5:27\u2013\n72, 2004.\n\n[11] B. McFee and G. Lanckriet. Partial order embedding with multiple kernels. In Proceedings of\nthe 26th Annual International Conference on Machine Learning, pages 721\u2013728. ACM, 2009.\n[12] B. McFee and G. Lanckriet. Metric learning to rank. In ICML. ACM New York, NY, USA,\n\n2010.\n\n[13] B. McFee and G. Lanckriet. Learning multi-modal similarity. The Journal of Machine Learn-\n\ning Research, 12:491\u2013523, 2011.\n\n[14] N. Nguyen and Y. Guo. Metric Learning: A Support Vector Approach. In ECML/PKDD, 2008.\n[15] M.E. Nilsback and A. Zisserman. A visual vocabulary for \ufb02ower classi\ufb01cation. In Computer\nVision and Pattern Recognition, 2006 IEEE Computer Society Conference on, volume 2, pages\n1447\u20131454. Ieee, 2006.\n\n[16] M.E. Nilsback and A. Zisserman. Automated \ufb02ower classi\ufb01cation over a large number of\nclasses. In Computer Vision, Graphics & Image Processing, 2008. ICVGIP\u201908. Sixth Indian\nConference on, pages 722\u2013729. IEEE, 2008.\n\n[17] A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grandvalet. SimpleMKL. Journal of Machine\n\nLearning Research, 9:2491\u20132521, 2008.\n\n[18] M. Schultz and T. Joachims. Learning a distance metric from relative comparisons. In NIPS,\n\n2003.\n\n[19] S. Shalev-Shwartz, Y. Singer, and A.Y. Ng. Online and batch learning of pseudo-metrics. In\n\nProceedings of the twenty-\ufb01rst international conference on Machine learning. ACM, 2004.\n\n[20] S. Sonnenburg, G. Ratsch, and C. Schafer. A general and ef\ufb01cient multiple kernel learning\n\nalgorithm. In NIPS, 2006.\n\n[21] J.P. Vert, J. Qiu, and W. Noble. A new pairwise kernel for biological network inference with\n\nsupport vector machines. BMC bioinformatics, 8(Suppl 10):S8, 2007.\n\n[22] K.Q. Weinberger and L.K. Saul. Distance metric learning for large margin nearest neighbor\n\nclassi\ufb01cation. The Journal of Machine Learning Research, 10:207\u2013244, 2009.\n\n[23] E.P. Xing, A.Y. Ng, M.I. Jordan, and S. Russell. Distance metric learning with application to\n\nclustering with side-information. In NIPS, 2003.\n\n9\n\n\f", "award": [], "sourceid": 683, "authors": [{"given_name": "Jun", "family_name": "Wang", "institution": null}, {"given_name": "Huyen", "family_name": "T.", "institution": null}, {"given_name": "Adam", "family_name": "Woznica", "institution": null}, {"given_name": "Alexandros", "family_name": "Kalousis", "institution": null}]}