{"title": "Inductive Regularized Learning of Kernel Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 946, "page_last": 954, "abstract": "In this paper we consider the fundamental problem of semi-supervised kernel function learning. We propose a general regularized framework for learning a kernel matrix, and then demonstrate an equivalence between our proposed kernel matrix learning framework and a general linear transformation learning problem. Our result shows that the learned kernel matrices parameterize a linear transformation kernel function and can be applied inductively to new data points. Furthermore, our result gives a constructive method for kernelizing most existing Mahalanobis metric learning formulations. To make our results practical for large-scale data, we modify our framework to limit the number of parameters in the optimization process. We also consider the problem of kernelized inductive dimensionality reduction in the semi-supervised setting. We introduce a novel method for this problem by considering a special case of our general kernel learning framework where we select the trace norm function as the regularizer. We empirically demonstrate that our framework learns useful kernel functions, improving the $k$-NN classification accuracy significantly in a variety of domains. Furthermore, our kernelized dimensionality reduction technique significantly reduces the dimensionality of the feature space while achieving competitive classification accuracies.", "full_text": "Inductive Regularized Learning of Kernel Functions\n\nPrateek Jain\n\nMicrosoft Research Bangalore\n\nBangalore, India\n\nprajain@microsoft.com\n\nBrian Kulis\n\nUC Berkeley EECS and ICSI\n\nBerkeley, CA, USA\n\nkulis@eecs.berkeley.edu\n\nInderjit Dhillon\n\nAustin, TX, USA\n\nUT Austin Dept. of Computer Sciences\n\ninderjit@cs.utexas.edu\n\nAbstract\n\nIn this paper we consider the problem of semi-supervised kernel function learn-\ning. We \ufb01rst propose a general regularized framework for learning a kernel matrix,\nand then demonstrate an equivalence between our proposed kernel matrix learn-\ning framework and a general linear transformation learning problem. Our result\nshows that the learned kernel matrices parameterize a linear transformation kernel\nfunction and can be applied inductively to new data points. Furthermore, our re-\nsult gives a constructive method for kernelizing most existing Mahalanobis metric\nlearning formulations. To make our results practical for large-scale data, we mod-\nify our framework to limit the number of parameters in the optimization process.\nWe also consider the problem of kernelized inductive dimensionality reduction in\nthe semi-supervised setting. To this end, we introduce a novel method for this\nproblem by considering a special case of our general kernel learning framework\nwhere we select the trace norm function as the regularizer. We empirically demon-\nstrate that our framework learns useful kernel functions, improving the k-NN clas-\nsi\ufb01cation accuracy signi\ufb01cantly in a variety of domains. Furthermore, our kernel-\nized dimensionality reduction technique signi\ufb01cantly reduces the dimensionality\nof the feature space while achieving competitive classi\ufb01cation accuracies.\n\n1 Introduction\n\nLearning kernel functions is an ongoing research topic in machine learning that focuses on learning\nan appropriate kernel function for a given task. While several methods have been proposed, many\nof the existing techniques can only be applied transductively [1\u20133]; i.e., they cannot be applied\ninductively to new data points. Of the methods that can be applied inductively, several are either too\ncomputationally expensive for large-scale data (e.g. hyperkernels [4]) or are limited to small classes\nof possible learned kernels (e.g. multiple kernel learning [5]).\nIn this paper, we propose and analyze a general kernel matrix learning problem using provided side-\ninformation over the training data. Our learning problem regularizes the desired kernel matrix via\na convex regularizer chosen from a broad class, subject to convex constraints on the kernel. While\nthe learned kernel matrix should be able to capture the provided side-information well, it is not\nclear how the information can be propagated to new data points. Our \ufb01rst main result demonstrates\nthat our kernel matrix learning problem is equivalent to learning a linear transformation (LT) kernel\nfunction (a kernel of the form \u03d5(x)T W \u03d5(y) for some matrix W \u227d 0) with a speci\ufb01c regularizer.\nWith the appropriate representation of W , this result implies that the learned LT kernel function can\nbe naturally applied to new data. Additionally, we demonstrate that a large class of Mahalanobis\nmetric learning methods can be seen as learning an LT kernel function and so our result provides a\n\n1\n\n\fconstructive method for kernelizing these methods. Our analysis recovers some recent kernelization\nresults for metric learning, but also implies several new results.\nAs our proposed kernel learning formulation learns a kernel matrix over the training points, the\nmemory requirements scale quadratically in the number of training points, a common issue arising\nin kernel methods. To alleviate such issues, we propose an additional constraint to the learning\nformulation to reduce the number of parameters. We prove that the equivalence to LT kernel function\nlearning still holds with the addition of this constraint, and that the resulting formulation can be\nscaled to very large data sets.\nWe then focus on a novel application of our framework to the problem of inductive semi-supervised\nkernel dimensionality reduction. Our method is a special case of our kernel function learning\nframework with trace-norm as the regularization function. As a result, we learn low-rank linear\ntransformations, which correspond to low-dimensional embeddings of high- or in\ufb01nite-dimensional\nkernel embeddings; unlike previous kernel dimensionality methods, which are either unsupervised\n(kernel-PCA) or cannot easily be applied inductively to new data (spectral kernels [6]), our method\nintrinsically possesses both desirable properties. Furthermore, our method can handle a variety of\nside-information, e.g., class labels, click-through rates, etc. Finally, we validate the effectiveness of\nour proposed framework. We quantitatively compare several regularizers, including the trace-norm\nregularizer for dimensionality reduction, over standard data sets. We also apply the methods to an\nobject recognition task in computer vision and qualitatively show results of dimensionality reduction\non a handwritten digits data set.\nRelated Work: Most of the existing kernel learning methods can be classi\ufb01ed into two broad cat-\negories. The \ufb01rst category includes parametric approaches, where the learned kernel function is\nrestricted to be of a speci\ufb01c form and then the relevant parameters are learned according to the pro-\nvided data. Prominent methods include multiple kernel learning [5], hyperkernels [4], in\ufb01nite kernel\nlearning [7], and hyper-parameter cross-validation [8]. Most of these methods either lack modeling\n\ufb02exibility, require non-convex optimization, or are restricted to a supervised learning scenario. The\nsecond category includes non-parametric methods, which explicitly model geometric structure in the\ndata. Examples include spectral kernel learning [6], manifold-based kernel learning [9], and kernel\ntarget alignment [3]. However, most of these approaches are limited to the transductive setting and\ncannot be used to naturally generalize to new points. In comparison, our method combines both of\nthe above approaches. We propose a general non-parametric kernel matrix learning framework, sim-\nilar to methods of the second category. However, we show that our learned kernel matrix corresponds\nto a linear transformation kernel function parameterized by a PSD matrix. Hence, our method can\nbe applied to inductive settings also without sacri\ufb01cing signi\ufb01cant modeling power. Furthermore,\nour methods can be applied to a variety of domains and with a variety of forms of side-information.\nExisting work on learning linear transformations has largely focused on learning Mahalanobis dis-\ntances; examples include [10\u201315], among others. POLA [13] and ITML [12] provide specialized\nkernelization techniques for their respective metric learning formulations. Kernelization of LMNN\nwas discussed in [16], though it relied on a convex perturbation based formulation that can lead\nto suboptimal solutions. Recently, [17] showed kernelization for a class of metric learning algo-\nrithms including LMNN and NCA [15]; as we will see, our result is more general and we can prove\nkernelization over a larger class of problems and can also reduce the number of parameters to be\nlearned. Independent of our work, [18] recently proved a representer type of theorem for spectral\nregularization functions. However, the framework they consider is different than ours in that they\nare interested in sensing the underlying high-dimensional matrix using given measurements.\nKernel dimensionality reduction methods can generally be divided into two categories: 1) semi-\nsupervised dimensionality reduction in the transductive setting, 2) supervised dimensionality reduc-\ntion in the inductive setting. Methods in the \ufb01rst category include the incomplete Cholesky de-\ncomposition [19], colored maximum variance unfolding [20], manifold preserving semi-supervised\ndimensionality reduction [21]. Methods in the second category include the kernel dimensionality re-\nduction method [22] and Gaussian Process latent variable models [23]. Kernel PCA [24] reduces the\ndimensionality in the inductive unsupervised setting, while various manifold learning methods can\nreduce the dimensionality but only in the unsupervised transductive setting. In contrast, our dimen-\nsionality reduction method, which is an instantiation of our general kernel learning framework, can\nperform kernel dimensionality reduction simultaneously in both the semi-supervised as well as the\ninductive setting. Additionally, it can capture the manifold structure using an appropriate baseline\nkernel function such as the one proposed by [25].\n\n2\n\n\f2 Learning Framework\nGiven an input kernel function \u03ba : Rd \u00d7 Rd \u2192 R, and some side-information over a set of points\nX = {x1, x2, . . . , xn} the goal is to learn a new kernel function \u03baW that is regularized against\n\u03ba but incorporates the provided side-information (the use of the subscript W will become clear\nlater). The initial kernel function \u03ba is of the form \u03ba(x, y) = \u03d5(x)T \u03d5(y) for some mapping \u03d5.\nThroughout the rest of this paper, we will denote \u03d5i as shorthand for \u03d5(xi), i.e., data point xi after\napplying the mapping \u03d5. We will also assume that the data vectors in X have been mapped via \u03d5,\nresulting in (cid:8) = {\u03d51, \u03d52, . . . , \u03d5n}. Learning a kernel function from the provided side-information\nis an ill-posed problem since in\ufb01nitely many such kernels can satisfy the provided supervision. A\ncommon approach is to formulate a transductive learning problem to learn a new kernel matrix over\nthe training data. Denoting the input kernel matrix K as K = (cid:8)T (cid:8), we aim to learn a new kernel\nmatrix KW that is regularized against K while satisfying the available side-information. In this\nwork, we study the following optimization problem:\n\n\u22121/2KW K\n\n\u22121/2)\n\ns.t. gi(KW ) \u2264 bi, 1 \u2264 i \u2264 m,\n\nf (K\n\nmin\nKW \u227d0\n\n(1)\nwhere f and gi are functions from Rn\u00d7n \u2192 R. We call f the regularizer and the gi the constraints.\nNote that if f and constraints gi\u2019s are all convex functions, then the above problem can be solved\noptimally using standard convex optimization algorithms. Note that our results will also hold for\nunconstrained variants of the above problem, as well as variants that incorporate slack variables.\nIn general, such learning formulations are limited in that the learned kernel cannot readily be applied\nto new data points. However, we will show that the above proposed problem is equivalent to learning\nlinear transformation (LT) kernel functions. Formally, an LT kernel function \u03baW is a kernel function\nof the form \u03baW (x, y) = \u03d5(x)T W \u03d5(y), where W is a positive semi-de\ufb01nite (PSD) matrix; we can\nthink of the LT kernel as describing the linear transformation \u03d5i \u2192 W 1/2\u03d5i. A natural way to\nlearn an LT kernel function would be to learn the parameterization matrix W using the provided\nside-information. To this end, we consider the following problem:\n\nmin\nW\u227d0\n\nf (W )\n\ns.t. gi((cid:8)T W (cid:8)) \u2264 bi, 1 \u2264 i \u2264 m,\n\n(2)\n\nwhere, as before, the function f is the regularizer and the functions gi are the constraints that encode\nthe side information. The constraints gi are assumed to be a function of the matrix (cid:8)T W (cid:8) of learned\nkernel values over the training data. We make two observations about this problem: \ufb01rst, for data\nmapped to high-dimensional spaces via kernel functions, this problem is seemingly impossible to\noptimize since the size of W grows quadratically with the dimensionality. We will show that (2)\nneed not explicitly be solved for learning an LT kernel function. Second, most Mahalanobis metric\nlearning methods may be viewed as a special case of the above framework, and we will discuss some\nof them throughout the paper.\n\n2.1 Examples of Regularizers and Constraints\n\n\u2225A \u2212 I\u22252\n\u22121KW \u2212 I\u22252\n\nTo make the kernel learning optimization problem concrete, we discuss a few examples of possible\nregularizers and constraints.\nFor the regularizer f (A) = 1\nF , the resulting kernel learning objective can be equivalently\n\u2225K\n2\nF . Thus, the goal is to keep the learned kernel close to the\nexpressed as minimizing 1\ninput kernel subject to the constraints in gi. Similarly, for f (A) = tr(A \u2212 I), the resulting objective\n2\n\u22121KW \u2212I). Another interesting regularizer is f (A) = tr(A)\u2212\ncan be expressed as minimizing tr(K\nlog det(A). In this case, the resulting objective is to minimize the LogDet divergence D\u2113d(KW , K)\nsubject to the constraints given by gi. For linear gi, this problem was studied in [12, 26].\nIn terms of constraints, pairwise squared Euclidean distance constraint between a pair of points\n(\u03d5i, \u03d5j) in feature space can be formulated as KW (i, i) + KW (j, j) \u2212 2KW (i, j) \u2265 b or\nKW (i, i) + KW (j, j) \u2212 2KW (i, j) \u2264 b; this constraint is clearly linear in the entries of KW .\nSimilarity constraints can be represented as KW (i, j) \u2264 b or KW (i, j) \u2265 b and are also linear in\nKW . Relative distance constraints over a triplet (\u03d5i, \u03d5j, \u03d5k) specify that \u03d5i should be closer to \u03d5j\nthan \u03d5k, and are often used in metric learning formulations and ranking problems; such constraints\ncan be easily formulated within our framework. Finally, non-parametric probability estimation con-\nstraints can be used to constrain the conditional probability of a class c given a data point \u03d5i,\n\n\u00b1p(c|x) = \u00b1\n\nj\u2208c KW (i, j)\n\nC\nt=1\n\nj\u2208t KW (i, j)\n\n\u2265 b,\n\n\u2211\n\n\u2211\n\n\u2211\n\n3\n\n\fwhere C is the number of classes. This constraint can be written as a linear constraint over KW\nafter appropriate manipulation.\n\n3 Analysis\n\nWe are now ready to analyze the connection between problems (1) and (2). We will show that\nthe solutions to the two problems are equivalent, in the sense that by optimally solving one of the\nproblems, the solution to the other can be computed in closed form. More importantly, this result\nwill yield insight into the type of kernel that is learned by the kernel learning problem.\nWe begin by de\ufb01ning the class of regularizers considered in our analysis. Note that each of the\nexample regularizers discussed earlier satisfy the following de\ufb01nition of spectral functions.\nDe\ufb01nition 3.1. We say that f : Rn\u00d7n \u2192 R is a spectral function if f (A) =\ni fs(\u03bbi), where\n\u03bb1, ..., \u03bbn are the eigenvalues of A and fs : R \u2192 R is a real-valued function over the reals. Note\nthat if fs is a convex function over the reals, then f is also convex.\n\n\u2211\n\n3.1 Learning Linear Transformation Kernels\nNow we present our main result, i.e., for a spectral function f, problems (1) and (2) are equivalent.\nTheorem 1. Let K \u227b 0 be an invertible matrix, f be a spectral function and denote the global\n\u2217\n\u2217 be an optimal solution to (2) and K\nminima of the corresponding scalar function fs as \u03b1. Let W\nW\nbe an optimal solution to (1). Then,\n\nwhere S\n\n\u2217\n\n\u22121(K\n\n= K\n\n\u2217\nW\n\n\u2212 \u03b1K)K\n\nW\n\n= \u03b1I + (cid:8)S\n\u2217\n\u22121. Furthermore, K\nW = (cid:8)T W\n\n(cid:8)T ,\n\n\u2217\n\n(cid:8).\n\n\u2217\n\n\u2217\n\n\u2217\nW to (1), one can con-\nThe \ufb01rst part of the theorem demonstrates that, given an optimal solution K\n\u2217 to (2), while the second part shows the reverse (this also\nstruct the corresponding solution W\ndemonstrates why W is used in the subscript of the learned kernel). The proof of this theorem\nappears in the supplementary material. The main idea behind the proof is to \ufb01rst show that the op-\ntimal solution to (2) is always of the form W = \u03b1I + (cid:8)S(cid:8)T , and then we obtain the closed form\nexpression for S using algebraic manipulations.\nAs a \ufb01rst consequence of this result, we can achieve induction over the learned kernels. Given\nthat KW = (cid:8)T W (cid:8), we can see that the learned kernel function is a linear transformation kernel;\nthat is, \u03baW (\u03d5i, \u03d5j) = \u03d5T\ni W \u03d5j. Given a pairs of new data points \u03d5n1 and \u03d5n2, we use the fact\nthat the learned kernel is a linear transformation kernel, along with the \ufb01rst result of the theorem\n(W = \u03b1I + (cid:8)S(cid:8)T ) to compute the learned kernel as:\n\nn\u2211\n\n\u03baW (xn1 , xn2) = \u03d5T\nn1\n\nW \u03d5n2 = \u03b1\u03ba(xn1 , xn2) +\n\nSij\u03ba(xn1, xi)\u03ba(xj, xn2 ).\n\ni,j=1\n\n(3)\n\nAs mentioned in Section 2, many Mahalanobis metric learning methods can be viewed as a special\ncase of (2). Therefore, a corollary of Theorem 1 is that we can constructively apply these metric\nlearning methods in kernel space by solving their corresponding kernel learning problem, and then\ncompute the learned metrics via (3). Thus, W need not explicitly be constructed to learn the LT ker-\nnel. Kernelization of Mahalanobis metric learning has previously been established for some special\ncases; our results generalize and extend previous methods, as well as provide simpler techniques in\nsome cases. Below, we elaborate with some special cases.\nExample 1 [Information Theoretic Metric Learning (ITML)]: [12] proposed the following Ma-\nhalanobis metric learning problem formulation:\ndW (\u03d5i, \u03d5j) \u2265 bij, (i, j) \u2208 D,\nmin\nW\u227d0\nwhere S and D specify pairs of similar and dissimilar points, respectively, and dW (\u03d5i, \u03d5j) =\n(\u03d5i \u2212 \u03d5j)T W (\u03d5i \u2212 \u03d5j) is the Mahalanobis distance between \u03d5i and \u03d5j. ITML is an instantiation\nof our framework with regularizer f (A) = tr(A) \u2212 log det(A) and pairwise distance constraints\nencoded as the gi functions. Furthermore, it is straightforward to show that f is a convex spectral\nfunction with global optima \u03b1 = 1, so the optimal W can be learned implicitly using (1). The\ncorresponding kernel learning optimization problem simpli\ufb01es to:\n\ndW (\u03d5i, \u03d5j) \u2264 bij, (i, j) \u2208 S,\n\nTr(W ) \u2212 log det(W ),\n\ns.t.\n\ngi(KW ) \u2264 bi, 1 \u2264 i \u2264 m,\n\n(4)\n\nmin\nKW\n\nD\u2113d(KW , K) s.t.\n\n4\n\n\f\u22121)\u2212log det(KW K\n\n\u22121)\u2212n is the LogDet divergence [12], and the\nwhere D\u2113d(KW , K) = tr(KW K\npositive de\ufb01niteness of KW is satis\ufb01ed automatically. This recovers the kernelized metric learning\nproblem analyzed in [12], where kernelization for this special case was established and an iterative\nprojection algorithm for optimization was developed. Note that, in the analysis of [12], the gi were\nlimited to similarity and dissimilarity constraints; our result is therefore more general than the exist-\ning kernelization result, even for this special case.\nExample 2 [Pseudo Online Metric Learning (POLA)]: [13] proposed the following metric learn-\ning formulation:\nwhere yij = 1 if \u03d5i and \u03d5j are similar, and yij = \u22121 if \u03d5i and \u03d5j are dissimilar. P is a set\nof pairs of points with known distance constraints. POLA is an instantiation of (2) with f (A) =\n\u2225A\u22252\nF and side-information available in the form of pair-wise distance constraints. Note that the\n1\n\u2225A\u22252 was also employed in [2, 27], and these methods also fall under our\n2\nregularizer f (A) = 1\n2\ngeneral formulation. In this case, f is once again a convex spectral function, and its global minima\nis \u03b1 = 0, so we can use (1) to solve for the learned kernel KW as\n\ns.t. yij(b \u2212 dW (\u03d5i, \u03d5j)) \u2265 1, \u2200(i, j) \u2208 P,\n\n\u2225W\u22252\nF ,\n\nmin\nW\u227d0\n\ngi(KW ) \u2264 bi, 1 \u2264 i \u2264 m, KW \u227d 0.\n\n(5)\n\n\u2225KW K\n\n\u22121\u22252\n\nF\n\ns.t.\n\nmin\nKW\n\nThe constraints gi for this problem can be easily constructed by re-writing each of POLA\u2019s con-\nstraints as a function of (cid:8)T W (cid:8). Note that the above approach for kernelization is much simpler\nthan the method suggested in [13], which involves a kernelized Gram-Schmidt procedure at each\nstep of the algorithm.\nOther Examples: The above two examples show that our analysis recovers two well-known ker-\nnelization results for Mahalanobis metric learning. However, there are several other metric learning\napproaches that fall into our framework as well, including the large margin nearest neighbor met-\nric learning method (LMNN) [11] and maximally collapsing metric learning (MCML) [14], both\nof which can be seen as instantiations of our learning framework with a constant f, as well as rel-\nevant component analysis (RCA) [28] and Xing et al.\u2019s Mahalanobis metric learning method for\nclustering [10]. Given lack of space, we cannot detail the kernelization of all these methods, but\nthey follow in the same manner as in the above two examples. In particular, each of these methods\nmay be run in kernel space, and our analysis yields new insights into these methods; for example,\nkernelization of LMNN [11] using Theorem 1 avoids the convex perturbation analysis in [16] that\nleads to suboptimal solutions in some cases.\n3.2 Parameter Reduction\nOne of the drawbacks to Theorem 1 is that the size of the matrices KW and S are n \u00d7 n, and thus\ngrow quadratically with the number of data points. We would like to have a way to restrict our\noptimization over a smaller number of parameters, so we now discuss a generalization of (2) by\nintroducing an additional constraint to make it possible to reduce the number of parameters to learn,\npermitting scalability to data sets with many training points and with very high dimensionality.\nIn order to\nTheorem 1 shows that the optimal K\naccommodate fewer parameters to learn, a natural option is to replace the unknown S matrix with a\nlow-rank matrix JLJ T , where J \u2208 Rn\u00d7r is a pre-speci\ufb01ed matrix, L \u2208 Rr\u00d7r is unknown (we use\nL instead of S to emphasize that S is of size n\u00d7 n whereas L is r\u00d7 r), and the rank r is a parameter\nof the algorithm. Then, we will explicitly enforce that the learned kernel is of this form.\nBy plugging in KW = \u03b1K + KSK into (1) and replacing S with JLJ T , the resulting optimization\nproblem is given by:\n\n\u2217\nW is of the form (cid:8)T W\n\n(cid:8) = \u03b1K + KS\n\ns.t. gi(\u03b1K + KJLJ T K) \u2264 bi, 1 \u2264 i \u2264 m.\n\nf (\u03b1I + K 1/2JLJ T K 1/2)\n\n(6)\nWhile the above problem involves just r \u00d7 r variables, the functions f and gi\u2019s are applied to n \u00d7 n\nmatrices and therefore the problem may still be computationally expensive to optimize. Below, we\nshow that for any spectral function f and linear constraints gi(KW ) = Tr(CiKW ), (6) reduces to a\nproblem that applies f and gi\u2019s to r \u00d7 r matrices only, which provides signi\ufb01cant scalability.\nTheorem 2. Let K = (cid:8)T (cid:8) \u227b 0 and J \u2208 Rn\u00d7r. Also, let the regularization function f be a spectral\nfunction (see De\ufb01nition 3.1) such that the corresponding scalar function fs has a global minima at\n\u03b1. Then problem (6) is equivalent to the following problem:\n\nmin\nL\u227d0\n\nK.\n\n\u2217\n\n\u2217\n\nmin\n\nL\u227d\u2212\u03b1(KJ )(cid:0)1\n\nf ((K J )\n\n\u22121/2(\u03b1K J + K J LK J )(K J )\n\n\u22121/2),\n\ns.t. Tr(LJ T KCiKJ) \u2264 bi \u2212 Tr(\u03b1KCi), 1 \u2264 i \u2264 m.\n\n(7)\n\n5\n\n\fNote that (7) is over r \u00d7 r matrices (after initial pre-processing) and is in fact similar to the kernel\nlearning problem (1), but with a kernel K J of smaller size r \u00d7 r, r \u226a n. A proof of the above\ntheorem is in the supplementary material, and follows by showing that for spectral functions the\nobjective functions of the two problems can be shown to differ by a universal constant.\nSimilar to (1), we can show that (6) is also equivalent to linear transformation kernel function learn-\ning. This enables us to naturally apply the above kernel learning problem in the inductive setting.\nWe provide a proof of the following theorem in the supplementary material.\nTheorem 3. Consider (6) with a spectral function f so that corresponding scalar function fs has a\nglobal minima at \u03b1 and let K \u227b 0 be invertible. Then, (6) and (7) are equivalent to the following\nlinear transformation kernel learning problem (analogous to the connection between (1) and (2)):\n\nmin\nW\u227d0,L\n\nf (W )\n\ns.t.\n\nTr((cid:8)T W (cid:8)) \u2264 bi, 1 \u2264 i \u2264 m, W = \u03b1I + XJLJX T .\n\n(8)\n\nNote that, in contrast to (2), where the last constraint over W is achieved automatically, (8) requires\nthat constraint should be satis\ufb01ed during the optimization process which leads to a reduced number\nof parameters for our kernel learning problem. The above theorem shows that our reduced parame-\nters kernel learning method (6) also implicitly learns a linear transformation kernel function, hence\nwe can generalize the learned kernel to unseen data points using an expression similar to (3).\nThe parameter reduction approach presented in this section depends critically on the choice of J.\nA few simple heuristics for choosing J beyond choosing a subset of the points from (cid:8) include\na randomly sampled coef\ufb01cient matrix or clustering (cid:8) into r clusters such that J is the cluster\nmembership indicator function. Also note that using this parameter reduction technique, we can\nscale the optimization to kernel learning problems with millions of points of more. For example,\nwe have applied a special case of this scalable framework to learn kernels over data sets containing\nnearly half a million images, as well as the MNIST data set of 60,000 data points [29].\n\n4 Trace-norm based Inductive Semi-supervised Kernel Dimensionality\n\nReduction (Trace-SSIKDR)\n\nWe now consider applying our framework to the scenario of semi-supervised kernel dimensionality\nreduction, which provides a novel and practical application of our framework. While there exists a\nvariety of methods for kernel dimensionality reduction, most of these methods are unsupervised (e.g.\nkernel-PCA) or are restricted to the transductive setting. In contrast, we can use our kernel learning\nframework to learn a low-rank transformation of the feature vectors implicitly that in turn provides\na low-dimensional embedding of the dataset. Furthermore, our framework permits a variety of side-\ninformation such as pair-wise or relative distance constraints, beyond the class label information\nallowed by existing transductive methods.\nWe describe our method starting from the linear transformation problem. Our goal is to learn a low-\nrank linear transformation W whose corresponding low-dimensional mapped embedding of \u03d5i is\nW 1/2\u03d5i. Even when the dimensionality of \u03d5i is very large, if the rank of W is low enough, then the\nmapped embedding will have small dimensionality. With that in mind, a possible regularizer could\nbe the rank, i.e., f (A) = rank(A); one can easily show that this satis\ufb01es the de\ufb01nition of a spectral\nfunction. Unfortunately, optimization is intractable in general with the non-convex rank function,\nso we use the trace-norm relaxation for the matrix rank function, i.e., we set f (A) = Tr(A). This\nfunction has been extensively studied as a relaxation for the rank function [30], and it satis\ufb01es the\nde\ufb01nition of a spectral function (with \u03b1 = 0). We also add a small Frobenius norm regularization\nfor ease of optimization (this does not affect the spectral property of the regularization function).\nThen using Theorem 1, the resulting relaxed kernel learning problem is:\n\n\u03c4Tr(K\n\n\u22121/2KW K\n\n\u22121/2) + \u2225K\n\n\u22121/2KW K\n\n\u22121/2\u22252\n\nF\n\ns.t. Tr(CiKW ) \u2264 bi, 1 \u2264 i \u2264 m,\n\n(9)\n\nmin\nKW \u227d0\n\nwhere \u03c4 > 0 is a parameter. The above problem can be solved using a method based on Uzawa\u2019s\ninexact algorithm, similar to [31].\nWe brie\ufb02y describe the steps taken by our method at each iteration. For simplicity, denote ~K =\n\u22121/2; we will optimize with respect to ~K instead of KW . Let ~K t be the t-th iterate.\n\u22121/2KW K\nK\ni = 0,\u2200i. Let \u03b4t\nAssociate variable zt\n\ni , 1 \u2264 i \u2264 m with each constraint at each iteration t, and let z0\n\n6\n\n\fTable 1: UCI Datasets: accuracy achieved by various methods. The numbers in parentheses show\nthe rank of the corresponding learned kernels. Trace-SSIKDR achieves accuracy comparable to Frob\n(Frobenius norm regularization) and ITML (LogDet regularization) with a signi\ufb01cantly smaller rank.\n\nDataset\\Method\n\nIris\nWine\n\nIonosphere\nSoybean\nDiabetes\n\nBalance-scale\nBreast-cancer\nSpectf-heart\n\nHeart-c\nHeart-h\n\nGaussian\n0.99(40)\n0.80(105)\n0.94(337)\n0.89(624)\n0.75(251)\n0.93(156)\n0.72(259)\n0.74(267)\n0.68(228)\n0.59(117)\n\nFrob\n\n0.99(27)\n0.94(36)\n0.98(64)\n0.96(96)\n0.74(154)\n0.96(106)\n0.73(61)\n0.87(39)\n0.78(62)\n0.69(71)\n\nITML\n0.99(40)\n0.99(105)\n0.98(337)\n0.96(624)\n0.76(251)\n0.97(156)\n0.78(259)\n0.84(267)\n0.79(228)\n0.70(117)\n\nFrob LR\n0.91(4)\n0.72(11)\n0.98(19)\n0.44(40)\n0.67(14)\n0.97(10)\n0.69(21)\n0.84(22)\n0.73(39)\n0.56(31)\n\nITML LR-pre\n\n0.93(4)\n0.85(11)\n0.98(19)\n0.87(40)\n0.62(14)\n0.80(10)\n0.68(21)\n0.89(22)\n0.61(39)\n0.30(31)\n\nITML LR-post\n\n0.99(4)\n0.46(11)\n0.93(19)\n0.35(40)\n0.73(14)\n0.82(10)\n0.68(21)\n0.89(22)\n0.55(39)\n0.56(31)\n\nTrace-SSIKDR\n\n0.99(4)\n0.94(11)\n0.99(19)\n0.96(40)\n0.74(14)\n0.97(10)\n0.75(21)\n0.84(22)\n0.78(39)\n0.68(31)\n\n(\u2211\n\n)\n\nbe the step size at iteration t. The algorithm performs the following updates:\n\nU (cid:6)U T \u2190 K 1/2\n\nzt\u22121\ni Ci\n\u2190 zt\u22121\n\ni\n\ni\nzt\ni\n\n~K t \u2190 U max((cid:6) \u2212 \u03c4 I, 0)U T ,\nK 1/2,\n\u2212 \u03b4 max(Tr(CiK 1/2 ~K tK 1/2) \u2212 bi, 0),\u2200i.\n\nThe above updates require computation of K 1/2 which is expensive for large high-rank matrices.\nHowever, using elementary linear algebra we can show that ~K and the learned kernel function\n\u22121/2 from\ncan be computed ef\ufb01ciently without computing K 1/2 by maintaining S = K\nstep to step. Algorithm 1 details an ef\ufb01cient method for optimizing (9) and returns matrices (cid:6)k,\nDk and Vk all of which are contain only O(nk) parameters, where k is the rank of ~K t, which\nchanges from iteration to iteration. Note that step 4 of the algorithm computes k singular vectors\nand requires O(nk2). Since k is typically signi\ufb01cantly smaller than n, the computational cost will\nbe signi\ufb01cantly smaller than computing the whole SVD. Note that the learned embedding \u03d5i \u2192\n\u22121/2ki, where ki is a vector of input kernel function values between \u03d5i and the training\n~K 1/2K\ndata, can be computed ef\ufb01ciently as \u03d5i \u2192 (cid:6)1/2\nk DkVkki, which does not require K 1/2 explicitly.\nWe defer the proof of correctness for Algorithm 1 to the supplementary material.\n\n\u22121/2 ~KK\n\ni = 0, t = 0\n\nAlgorithm 1 Trace-SSIKDR\nRequire: K, (Ci, bi), 1 \u2264 i \u2264 m, \u03c4, \u03b4\n1: Initialize: z0\n2: repeat\n3:\nt = t + 1\nCompute Vk and (cid:6)k, the top k eigenvectors and eigenvalues of\n4:\nargmaxj \u03c3j > \u03c4\n5: Dk(i, i) \u2190 1/vT\n6:\n7: until Convergence\n8: Return (cid:6)k, Dk, Vk\n\n\u2212 \u03b4 max(Tr(CiKVkDk(cid:6)kDkV T\n\nk K) \u2212 bi, 0),\u2200i.\n\ni Kvi, 1 \u2264 i \u2264 k\n\n\u2190 zt\u22121\n\ni\n\nzt\ni\n\n(\u2211\ni zt\u22121\ni Ci\n\n)\n\nK, where k =\n\n//St = VkDk(cid:6)kDkV T\nk\n\n5 Experimental Results\n\nWe now present empirical evaluation of our kernel learning framework and our semi-supervised\nkernel dimensionality approach when applied in conjunction with k-nearest neighbor classi\ufb01cation.\nIn particular, using different regularization functions, we show that our framework can be used to\nobtain signi\ufb01cantly better kernels than the baseline kernels for k-NN classi\ufb01cation. Additionally,\nwe show that our semi-supervised kernel dimensionality reduction approach achieves comparable\naccuracy while signi\ufb01cantly reducing the dimensionality of the linear mapping.\nUCI Datasets: First, we evaluate the performance of our kernel learning framework on standard\nUCI datasets. We measure accuracy of the learned kernels using 5-NN classi\ufb01cation with two-fold\ncross validation averaged over 10 runs. For training, we use pairwise (dis)similarity constraints as\ndescribed in Section 2.1. We select parameters l and u (right-hand side of the pairwise constraints)\nusing 5th and 95th percentiles of all the pairwise distances between points from the training dataset.\n\n7\n\n\f(c)\n\n(d)\n\n(a)\n\n(b)\n\nFigure 1: (a): Mean classi\ufb01cation accuracy on Caltech101 dataset obtained by 1-NN classi\ufb01cation\nwith learned kernels obtained by various methods. (b): Rank of the learned kernel functions obtained\nby various methods. The rank of the learned kernel function is same as the reduced dimensionality\nof the dataset. (c): Two-dimensional embedding of 2000 USPS digits obtained using our method\nTrace-SSIKDR for a training set of just 100 USPS digits. Note that we use the inductive setting\nhere and the embedding is color coded according to the underlying digit. (d): Embedding of the\nUSPS digits dataset obtained using kernel-PCA.\nTable 4 shows the 5-NN classi\ufb01cation accuracies achieved by our kernel learning framework with\ndifferent regularization functions. Gaussian represents the baseline Gaussian kernel, Frob represents\nan instantiation of our framework with Frobenius norm (f (A) = \u2225A\u22252\nF ) regularization, while ITML\ncorresponds to the LogDet regularization (f (A) = Tr(A) \u2212 log det(A) ). For the latter case, our\nformulation is same as formulation proposed by [12]. Note that for almost all the datasets (except\nIris and Diabetes), both Frob and ITML improve upon the baseline Gaussian kernel signi\ufb01cantly.\nWe also compare our semi-supervised dimensionality reduction method Trace-SSIKDR (see Sec-\ntion 4) with baseline kernel dimensionality reduction methods Frob LR, ITML LR-pre, and ITML\nLR-post. Frob LR reduces the rank of the learned matrix W (equivalently, it reduces the dimension-\nality) using Frobenius norm regularization by taking the top eigenvectors. Similarly, ITML LR-post\nreduces the rank of the learned kernel matrix obtained using ITML by taking its top eigenvectors.\nITML LR-pre reduces the rank of the kernel function by reducing the rank of the training kernel ma-\ntrix. The learned linear transformation W (or equivalently, the learned kernel function) should have\nthe same rank as that of training kernel matrix as the LogDet divergence preserves the range space\nof the input kernel. We \ufb01x the rank of the learned W for Frob LR, ITML LR-pre, ITML LR-post as\nthe rank of the transformation W obtained by our Trace-SSIKDR method. Note that Trace-SSIKDR\nachieves accuracies similar to Frob and ITML, while decreasing the rank signi\ufb01cantly. Furthermore,\nit is signi\ufb01cantly better than the corresponding baseline dimensionality reduction methods.\nCaltech-101: Next, we evaluate our kernel learning framework on the Caltech-101 dataset, a bench-\nmark object recognition dataset containing over 3000 images. Here, we compare various methods\nusing 1-NN classi\ufb01cation method and the accuracy is measured in terms of the mean recognition\naccuracy per class. We use a pool of 30 images per class for our experiments, out of which a vary-\ning number of random images are selected for training and the remaining are used for testing the\nlearned kernel function. The baseline kernel function is selected to be the sum of four different\nkernel functions: PMK [32], SPMK [33], Geoblur-1 and Geoblur-2 [34]. Figure 1 (a) shows the\naccuracy achieved by various methods (acronyms represent the same methods as described in the\nprevious section). Clearly, ITML and Frob (which are speci\ufb01c instances of our framework) are able\nto learn signi\ufb01cantly more accurate kernel functions than the baseline kernel function. Furthermore,\nour Trace-SSIKDR method is able to achieve reasonable accuracy while reducing the rank of the\nkernel function signi\ufb01cantly (Figure 1 (b)). Also note that Trace-SSIKDR achieves signi\ufb01cantly\nbetter accuracy than Frob LR, ITML LR-pre and ITML LR-post, although all of these methods have\nthe same rank as Trace-SSIKDR.\nUSPS Digits: Finally, we qualitatively evaluate our dimensionality reduction method on the USPS\ndigits dataset. Here, we train our method using 100 examples to learn a linear mapping to two\ndimensions, i.e., a rank-2 matrix W . For the baseline kernel, we use the data-dependent kernel func-\ntion proposed by [25] that also takes data\u2019s manifold structure into account. We then embed 2000\n(unseen) test examples into two dimensions using our learned low-rank transformation. Figure 1 (c)\nshows the embedding obtained by our Trace-SSIKDR method, while Figure 1 (d) shows the embed-\nding obtained by the kernel-PCA algorithm. Each point is color coded according to the underlying\ndigit. Note that our method is able to separate out most of the digits even in 2D, and is signi\ufb01cantly\nbetter than the embedding obtained using kernel-PCA.\nAcknowledgements: This research was supported in part by NSF grant CCF-0728879.\n\n8\n\n51015200.10.20.30.40.50.60.70.8Training examples per classMean recognition accuracyAccuracy vs Training Set Size Trace\u2212SSIKDRITMLFrobFrob LRITML LR\u2212postITML LR\u2212preBaseline51015200100200300400500Training examples per classDimensionality of the learned mappingDimensionality vs Training Set Size Trace\u2212SSIKDRFrob\u22121\u22120.8\u22120.6\u22120.4\u22120.200.20.40.60.81\u22120.8\u22120.6\u22120.4\u22120.200.20.40.60.8\u22120.25\u22120.2\u22120.15\u22120.1\u22120.0500.05\u22120.1\u22120.0500.050.1\fReferences\n[1] K. Tsuda, G. R\u00a8atsch, and M. K. Warmuth. Matrix exponentiated gradient updates for on-line learning and\n\nBregman projection. JMLR, 6:995\u20131018, 2005.\n\n[2] J. T. Kwok and I. W. Tsang. Learning with idealized kernels. In ICML, 2003.\n[3] N. Cristianini, J. Shawe-Taylor, A. Elisseeff, and J. Kandola. On kernel-target alignment. In NIPS, 2001.\n[4] C. S. Ong, A. J. Smola, and R. C. Williamson. Learning the kernel with hyperkernels. JMLR, 6:1043\u2013\n\n1071, 2005.\n\n[5] G. R. G. Lanckriet, N. Cristianini, P. L. Bartlett, L. El Ghaoui, and M. I. Jordan. Learning the kernel\n\nmatrix with semide\ufb01nite programming. JMLR, 5:27\u201372, 2004.\n\n[6] Xiaojin Zhu, Jaz Kandola, Zoubin Ghahramani, and John Lafferty. Nonparametric transforms of graph\nkernels for semi-supervised learning. In Lawrence K. Saul, Yair Weiss, and L\u00b4eon Bottou, editors, NIPS,\nvolume 17, pages 1641\u20131648, 2005.\n\n[7] Peter V. Gehler and Sebastian Nowozin. Let the kernel \ufb01gure it out; principled learning of pre-processing\n\nfor kernel classi\ufb01ers. In CVPR, pages 2836\u20132843, 2009.\n\n[8] Matthias Seeger. Cross-validation optimization for large scale hierarchical classi\ufb01cation kernel methods.\n\nIn NIPS, pages 1233\u20131240, 2006.\n\n[9] Yoshua Bengio, Olivier Delalleau, Nicolas Le Roux, Jean-Francois Paiement, Pascal Vincent, and Marie\nOuimet. Learning eigenfunctions links spectral embedding and kernel PCA. Neural Computation,\n16(10):2197\u20132219, 2004.\n\n[10] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. J. Russell. Distance metric learning with application to clustering\n\nwith side-information. In NIPS, pages 505\u2013512, 2002.\n\n[11] K. Q. Weinberger, J. Blitzer, and L. K. Saul. Distance metric learning for large margin nearest neighbor\n\nclassi\ufb01cation. In NIPS, 2005.\n\n[12] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon. Information-theoretic metric learning. In ICML,\n\npages 209\u2013216, 2007.\n\n[13] S. Shalev-Shwartz, Y. Singer, and A. Y. Ng. Online and batch learning of pseudo-metrics. In ICML, 2004.\n[14] A. Globerson and S. T. Roweis. Metric learning by collapsing classes. In NIPS, 2005.\n[15] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood component analysis. In NIPS,\n\n2004.\n\n[16] B. Kulis, S. Sra, and I. S. Dhillon. Convex perturbations for scalable semide\ufb01nite programming.\n\nAISTATS, 2009.\n\nIn\n\n[17] R. Chatpatanasiri, T. Korsrilabutr, P. Tangchanachaianan, and B. Kijsirikul. On kernelization of supervised\n\nMahalanobis distance learners, 2008.\n\n[18] Andreas Argyriou, Charles A. Micchelli, and Massimiliano Pontil. On spectral learning. JMLR, 11:935\u2013\n\n953, 2010.\n\n[19] F. R. Bach and M. I. Jordan. Predictive low-rank decomposition for kernel methods. In ICML, pages\n\n33\u201340, 2005.\n\n[20] L. Song, A. Smola, K. M. Borgwardt, and A. Gretton. Colored maximum variance unfolding. In NIPS,\n\npages 1385\u20131392, 2007.\n\n[21] Y. Song, F. Nie, C. Zhang, and S. Xiang. A uni\ufb01ed framework for semi-supervised dimensionality reduc-\n\ntion. Pattern Recognition, 41(9):2789\u20132799, 2008.\n\n[22] K. Fukumizu, F. R. Bach, and M. I. Jordan. Kernel dimensionality reduction for supervised learning. In\n\nNIPS, 2003.\n\n[23] R. Urtasun and T. Darrell. Discriminative gaussian process latent variable model for classi\ufb01cation. In\n\nICML, pages 927\u2013934, 2007.\n\n[24] S. Mika, B. Sch\u00a8olkopf, A. J. Smola, K. M\u00a8uller, M. Scholz, and G. R\u00a8atsch. Kernel pca and de-noising in\n\nfeature spaces. In NIPS, pages 536\u2013542, 1998.\n\n[25] V. Sindhwani, P. Niyogi, and M. Belkin. Beyond the point cloud: from transductive to semi-supervised\n\nlearning. In ICML, pages 824\u2013831, 2005.\n\n[26] Brian Kulis, M\u00b4aty\u00b4as Sustik, and Inderjit S. Dhillon. Learning low-rank kernel matrices. In ICML, pages\n\n505\u2013512, 2006.\n\n[27] Matthew Schultz and Thorsten Joachims. Learning a distance metric from relative comparisons. In NIPS,\n\n2003.\n\n[28] A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall. Learning a mahalanobis metric from equivalence\n\nconstraints. JMLR, 6:937\u2013965, 2005.\n\n[29] P. Jain, B. Kulis, and K. Grauman. Fast image search for learned metrics. In CVPR, 2008.\n[30] B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via\n\nnuclear norm minimization, 2007.\n\n[31] J. Cai, E. J. Candes, and Z. Shen. A singular value thresholding algorithm for matrix completion, 2008.\n[32] K. Grauman and T. Darrell. The Pyramid Match Kernel: Ef\ufb01cient learning with sets of features. Journal\n\nof Machine Learning Research (JMLR), 8:725\u2013760, April 2007.\n\n[33] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing\n\nnatural scene categories. In CVPR, pages 2169\u20132178, 2006.\n\n[34] A. C. Berg and J. Malik. Geometric blur for template matching. In CVPR, pages 607\u2013614, 2001.\n\n9\n\n\f", "award": [], "sourceid": 603, "authors": [{"given_name": "Prateek", "family_name": "Jain", "institution": null}, {"given_name": "Brian", "family_name": "Kulis", "institution": null}, {"given_name": "Inderjit", "family_name": "Dhillon", "institution": null}]}