{"title": "Gaussian and Wishart Hyperkernels", "book": "Advances in Neural Information Processing Systems", "page_first": 729, "page_last": 736, "abstract": null, "full_text": "Gaussian and Wishart Hyp erkernels\n\nRisi Kondor, Tony Jebara Computer Science Department, Columbia University 1214 Amsterdam Avenue, New York, NY 10027, U.S.A. {risi,jebara}@cs.columbia.edu\n\nAbstract\nWe propose a new method for constructing hyperkenels and define two promising special cases that can be computed in closed form. These we call the Gaussian and Wishart hyperkernels. The former is especially attractive in that it has an interpretable regularization scheme reminiscent of that of the Gaussian RBF kernel. We discuss how kernel learning can be used not just for improving the performance of classification and regression methods, but also as a stand-alone algorithm for dimensionality reduction and relational or metric learning.\n\n1\n\nIntro duction\n\nThe performance of kernel methods, such as Support Vector Machines, Gaussian Processes, etc. depends critically on the choice of kernel. Conceptually, the kernel captures our prior knowledge of the data domain. There is a small number of popular kernels expressible in closed form, such as the Gaussian RBF kernel k (x, x ) = exp(- x - x 2 /(2 2 )), which boasts attractive and unique properties from an abstract function approximation point of view. In real world problems, however, and especially when the data is heterogenous or discrete, engineering an appropriate kernel is a ma jor part of the modelling process. It is natural to ask whether instead it might be possible to learn the kernel itself from the data. Recent years have seen the development of several approaches to kernel learning [5][1]. Arguably the most principled method proposed to date is the hyperkernels idea introduced by Ong, Smola and Williamson [8][7][9]. The current paper is a continuation of this work, introducing a new family of hyperkernels with attractive properties. Most work on kernel learning has focused on finding a kernel which is subsequently to be used in a conventional kernel machine, turning learning into an essentially two-stage process: first learn the kernel, then use it in a conventional algorithm such as an SVM to solve a classification or regression task. Recently there has been increasing interest in using the kernel in its own right to answer relational questions about the dataset. Instead of predicting individual labels, a kernel characterizes which pairs of labels are likely to be the same, or related. Kernel learning can be used to infer the network structure underlying data. A different application is to use the learnt kernel to produce a low dimensional embedding via kernel PCA. In this sense, kernel learning can be also be regarded as a dimensionality reduction or metric learning algorithm.\n\n2\n\nHyp erkernels\n\nWe begin with a brief review of the kernel and hyperkernel formalism. Let X be the input space, Y the output space, and {(x1 , y1 ) , (x2 , y2 ) , . . . , (xm , ym )} the training data. By kernel we mean a symmetric function k : X X R that is positive definite on X . Whenever\n\n\f\nwe refer to a function being positive definite, we assume that it is also symmetric. Positive definiteness guarantees that k induces a Reproducing Kernel Hilbert Space (RKHS) F , which is a vector space of functions spanned by { kx () = k (x, ) | x X } and endowed with an inner product satisfying kx , kx = k (x, x ). Kernel-based learning algorithms find a ^ hypothesis f F by solving some variant of the Regularized Risk Minimzation problem 1m w i 1 ^ L(f (xi ), yi ) + f = arg min f2 F f F m 2 =1\n\nThe idea expounded in [8] is to set up an analogous optimization problem for finding k itself in the RKHS of a hyperkernel K : X X R, where X = X 2 . We will sometimes view K as a function of four arguments, K ((x1 , x1 ), (x2 , x2 )), and sometimes as a function of two pairs, K (x1 , x2 ), with x1 = (x1 , x1 ) and x2 = (x2 , x2 ). To induce an RKHS K must be positive definite in the latter sense. Additionaly, we have to ensure that the solution of our regularized risk minimization problem is itself a kernel. To this end, we require that the functions Kx1 ,x1 (x2 , x2 ) that we get by fixing the first two arguments of K ((x1 , x1 ), (x2 , x2 )) be symmetric and positive definite kernel in the remaining two arguments. Definition 1. Let X be a nonempty set, X = X X and K : X X R with Kx ( ) = K (x, ) = K ( , x). Then K is cal led a hyperkernel on X if and only if 1. K is positive definite on X and 2. for any x X , Kx is positive definite on X . Denoting the RKHS of K by K, potential kernels lie in the cone K pd = { k K | k is pos.def. }. Unfortunately, there is no simple way of restricting kernel learnink algorithms to pd . Instead,, we will restrict ourselves to the positive quadrant K + = g K k K | , Kx 0 xX which is a subcone of Kpd . The actual learning procedure involved in finding k is very similar to conventional kernel methods, except that now regularized risk minimization is to be performed over all pairs of data points: , Q ^ = arg min (X, Y , k ) + 1 k 2 (1) k K K 2\n\n^ here L is a loss function of our choice. By the Representer Theorem [2], f is expressible ^x) = m i k (xi , x) for some 1 , 2 , . . . , m R. in the form ( i=1\n\nwhere Q is a quality functional describing how well k fits the training data and K = K+ . Several candidates for Q are described in [8]. If K has the property K at for any S X the orthogonal pro jection of any k K to the th r subspace spanned by emains in K , then k is expressible as x|xX for some real coefficients (ij )i.j . In other words, we have a hyper-representer theorem. It is easy to see that for K = K+ this condition is satisfied provided that K ((x1 , x1 ), (x2 , x2 )) 0 for all x1 , x1 , x2 , x2 X . Thus, in this case to solve (1) it is sufficient to optimize the m variables (ij )i,j =1 , introducing the additional constraints ij 0 to enforce k K+ . Finding functions that satisfy Definition 1 and also make sense in terms of regularization theory or practical problem domains in not trivial. Some potential choices are presented in [8]. In this paper we propose some new families of hyperkernels. The key tool we use is the following simple lemma. Lemma 1. Let {gz : X R} be a family of functions indexed by z Z and let h : Z Z R be a kernel. Then g ) ( ) ) 3) k (x, x = z (x) h(z , z gz (x dz dz k (x, x ) = i\nm\n\nij K(xi ,xj ) (x, x ) =\n\n,j =1\n\ni\n\nm\n\nij K ((xi , xj ), (x, x ))\n\n(2)\n\n,j =1\n\n\f\nis a kernel on X . Furthermore, if h is pointwise positive (h(z , z ) 0) and { gz : X X R } is a family of pointwise positive kernels, then g 2 1 (4) K ((x1 , x1 ) , (x2 , x2 )) = z1 (x1 , x ) h(z1 , z2 ) gz2 (x2 , x ) dz1 dz2 is a hyperkernel on X , and it satisfies K ((x1 , x1 ), (x2 , x2 )) 0 for al l x1 , x1 , x2 , x2 X .\n\n3\n\nConvolution hyp erkernels\n\nOne interpreation of a kernel k (x, x ) is that it quantifies some notion of similarity between points x and x . For the Gaussian RBF kernel, and heat kernels in general, this similarity can be regarded as induced by a diffusion process in the ambient space [4]. Just as physical substances diffuse in space, the similarity between x and x is mediated by intermediate points, in the sense that by virtue of x being similar to some x0 and x0 being similar to x , x and x themselves become similar to each other. This captures the natural transitivity of similarity. Specifically, the normalized Gaussian kernel on Rn of variance 2t = 2 , kt (x, x ) = 1 (4 t)\nn/2\n\ne-\n\nx-x\n\n2\n\n/(4t)\n\n,\n\nSuch kernels are by definition homogenous and isotropic in the ambient space. What we hope for from the hyperkernels formalism is to be able to adapt to the inhomogeneous and anisotropic nature of training data, while retaining the transitivity idea in some form. Hyperkernels achieve this by weighting the integrand of (5) in relation to what is \"on the other side\" of the hyperkernel. Specifically, we define convolution hyperkernels by setting gz (x, x ) = r(x, z ) r(x , z ) in (4) for some r : X X R. By (3), the resulting hyperkernel always satisfies the conditions of Definition 1. Definition 2. Given functions r : X X R and h : X X R where h is positive definite, the convolution hyperkernel induced by r and h is r K ((x1 , x1 ) , (x2 , x2 )) = (x1 , z1 ) r(x1 , z1 ) h(z1 , z2 ) r(x2 , z2 ) r(x2 , z2 ) dz1 dz2 . (6) A good way to visualize the structure of convolution hyperkernels is to note that (6) is proportional to the likelihood of the graphical model in the figure to the right. The only requirements on the graphical model are to have the same potential function 1 at each of the extremities and to have a positive definite potential function 2 at the core. 3.1 The Gaussian hyp erkernel\n)\n\nsatisfies the well known convolution property k ) kt (x, x = t/2 (x, x0 ) kt/2 (x0 , x) dx0 .\n\n(5)\n\nTo make the foregoing more concrete we now investigate the case where r(x, x ) and h(z , z are Gaussians. To simplify the notation we use the shorthand x, x\n\n2\n\n=\n\n1\nn/2 (2 2 )\n\ne-\n\nx-x\n\n2\n\n/(2 2 )\n\n.\n\nThe Gaussian hyp erkernel on X = Rn is then defined as XX 1 2 K ((x1 , x ), (x2 , x )) = x1 , z 2 z, x1 2 z, z 2 x2 , z\nh\n\n\n2\n\nz , x2 2 dz dz\n\n.\n\n(7)\n\n\f\nFixing x and completing the square we have - 1 1 exp x1 , z 2 z, x1 2 = 2 )n 2 2 (2 z - x1 + x 1 1 1 - n exp 2 2 (2 2 ) K ((x1 , x1 ), (x2 , x2 )) = x1 , x 22 x2 , x 22\n1 2\n\nz\n\n2\n\n- x1 2 + z-x1 2 - x1 - x1 2 4 2 =\n\n= x1 , x1 22 z, x1 2/2 ,\n\nwhere xi = (xi + xi )/2. By the convolution property of Gaussians it follows that XX\n\nx1 , z 2 /2 z, z\n\n\n\n2 h\n\nz, x2 2/2 dz dz\n\n=\n\nx1 , x1 22 x2 , x2 22 x1 , x2 2 +2 .\nh\n\n(8)\n\nIt is an important property of the Gaussian hyperkernel that it can be evaluated in closed 2 form. A noteworthy special case is when h(x, x ) = (x, x ), corresponding to h 0. At 2 the opposite extreme, in the limit h , the hyperkernel decouples into the product of two RBF kernels. Since the hyperkernel expansion (2) is a sum over hyperkernel evaluations with one pair of arguments fixed, it is worth examining what these functions look like: ( - e - x1 - x2 2 x2 - x2 2 2 9) xp Kx1 ,x1 (x2 , x ) exp 2 2 ( 2 + h ) 2 2 with = 2 . This is really a conventional Gaussian kernel between x2 and x2 multiplied by a spatially varying Gaussian intensity factor depending on how close the mean of x 2 and x2 is to the mean of the training pair. This can be regarded as a lo calized Gaussian, and the full kernel (2) will be a sum of such terms with positive weights. As x 2 and x2 move around in X , whichever localized Gaussians are centered close to their mean will dominate the sum. By changing the (ij ) weights, the kernel learning algorithm can choose k from a highly flexible class of potential kernels. The close relationship of K to the ordinary Gaussian RBF ker l is further borne out ne ~ by changing coordinates to x = (x + x ) / 2 and x = (x - x ) / 2, which factorizes the ^ hyperkernel in the form ^ ~ . ^^ ^ ~~ ~ K ((x1 , x1 ), (x2 , x2 )) = K (x1 , x2 )K (x1 , x2 ) = x1 , x2 2(2 +2 ) x1 , 0 2 x2 , 0 2 ^~ ^~ ^ ~\nh\n\nwhere () is the Fourier transform of k (x), establishing the same exponential regularization ^ penalty scheme in the Fourier components of k that is familiar from the theory of Gaussian RBF kernels. In summary, K behaves in (x1 , x2 ) like a Gaussian kernel with variance ^^ 2 2( 2 + h ), but in x it just effects a one-dimensional feature mapping. ~\n\n^ ~ ^ Omitting details for brevity, the consequences of this include that K = K K, where K ~ is the RKHS of a Gaussian kernel over X , while K is the one-dimensional space gener^^ ~ ated by x, 0 2 : each k K can be written as k (x, x) = k (x) x, 0 2 . Furthermore, the ~ ^~ regularization operator (defined by k, k K = k, k L2 [10]) will be e 2 ( 2 +h ) 2 /2 i x () eix d ( ) e d x, 0 2 ~ x, 0 2 ~\n\n4\n\nAnisotropic hyp erkernels\n\nWith the hyperkernels so far far we can only learn kernels that are a sum of rotationally invariant terms. Consequently, the learnt kernel will have a locally isotropic character. Yet, rescaling of the axes and anisotropic dilations are one of the most common forms of variation in naturally occurring data that we would hope to accomodate by learning the kernel.\n\n\f\n4.1\n\nThe Wishart hyp erkernel\n\nWe define the Wishart hyp erkernel as X K ((x1 , x1 ), (x2 , x2 )) = x1 , z z, x1 x2 , z z, x2 I W (; C, r) dz d.\n0\n\n(10)\n\nwhere\n\nver positive definite matrices (denoted 0) [6]. Here r is an integer parameter, C is an n n n positive definite parameter matrix and Zr,n = 2 rn/2 n(n-1)/4 i=1 ((r + 1 - i)/2) is a normalizing factor. The Wishart hyperkernel can be seen as the anisotropic analog of (7) 2 in the limit h 0, z, z 2 (z , z ). Hence, by Lemma 1, it is a valid hyperkernel. In h analogy with (8), 1 2 K ((x1 , x ), (x2 , x )) = (11) x1 , x1 2 x2 , x2 2 x1 , x2 I W (; C, r) d .\n0\n\no\n\n(2 ) || and I W (; C, r) is the inverse Wishart distribution Zr,n | | |C |\nr /2 (n+r +1)/2\n\nx, x\n\n\n\n=\n\n1\nn/2 1/2\n\ne-(x-x\n\n)\n\n-1 (x-x )/2\n\n,\n\nexp\n\n- -1 / tr C2\n\nBy using the identity v x, x\n\n\nA\n\nv = tr(A(v v |C |\nr,n\n\n)\n\n),\n(n+r +2)/2\n\nr /2\n\nI W (; C, r) =\n\n(2 )n/2 Z\n\n||\n\nexp\n\nwhere S = (x - x )(x - x ) . Cascading this through each of the terms in the integrand of (11) and noting that the integral of a Wishart density is unity, we conclude that K ((x1 , x1 ), (x2 , x2 )) | C + Stot | |C |\nr /2 (r +3)/2\n\n|C | Zr+1,n I W ( ; C + S, r + 1 ) , (r +1)/2 n/2 Z (2 ) r,n | C + S |\n\n- -1 /= tr (C + S ) 2\nr /2\n\n,\n\n(12)\n\nwhere Stot = S1 + S2 + S ; Si = 1 (xi - xi )(xi - xi ) ; and S = (x1 - x2 )(x1 - x2 ) . We 2 can read off that for given x1 - x1 , x2 - x2 , and x - x , the hyperkernel will favor quadruples where x1 - x1 , x2 - x2 , and x - x are close to parallel to each other and to the largest eigenvector of C . It is not so easy to immediately see the dependence of K on the relative distances between x1 , x1 , x2 and x2 . To better expose the qualitative behavior of the Wichart hyperkernel, we fix (x 1 , x1 ), assume s c t that C = cI for some c R and use the identity I + v v = cn-1 + v 2 o write Q (r+3)/2 Q r+3 c (2S1 , 2S ) c (S1 + S , S2 ) 2 Kx1 ,x1 (x2 , x ) c 1/4 c 1/4 + 4 x 1 - x2 2 + x 2 - x2 2 where Qc (A, B ) is the affinity . 1/2 | cI + A + B | This latter expression is a natural positive definite similarity metric between positive definite matrices, as we can see from the fact that it is the overlap integral (Bhattacharyya kernel) x 1/2 x 1/2 dx Qc (A, B ) = , 0 (cI +2A)-1 , 0 (cI +2B )-1 Qc (A, B ) = | cI + 2A |\n1/4\n\n | cI + 2B |\n\n1/4\n\nbetween two zero-centered Gaussian distributions with inverse covariances cI + 2A and cI + 2B , respectively [3].\n\n\f\n0.25 0.2 0.2 0.15 0.1 0.1 0.05 0 0 -0.05 -0.05 -0.1 -0.1 -0.15 -0.2 -0.25 -0.1 -0.05 0 0.05 -0.3 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 -0.1 0 0.1 0.05 0.2 0.15\n\n-0.15 -0.2 -0.2\n\n-0.25 -0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2\n\nFigure 1: The first two panes show the separation of '3's and '8's in the training and testing sets respectively achieved by the Gaussian hyperkernel (the plots show the data plotted by its first two eigenvectors according to the learned kernel k ). The right hand pane shows a similar KernelPCA plot but based on a fixed RBF kernel.\n\n5\n\nExp eriments\n\nWe conducted preliminary experiments with the hyperkernels in relation learning between pairs of datapoints. k he idea here is that the learned kernel k naturally induces a distance T metric d(x, x ) = (x, x) - 2k (x, x ) + k (x , x ), and in this sense kernel learning is equivalent to learning d. Given a labeled dataset, we can learn a kernel which effectively remaps the data in such a way that data points with the same label are close to each other, while those with different labels are far apart. For classification problems (yi being the class label), a natural choice of quality functional m 1 similar to the hinge loss is Q(X, Y , k ) = m2 i,j =1 | 1 - yij k (xi , xj ) |+ , where | z |+ = z if z 0 and | z |+ = 0 for z < 0, while yij = 1 if yi = yj . The corresponding optimization m m problem learns k (x, x ) = i=1 j =1 ij K ((x, x ), (xi , xj )) + b minimizing 1i i 2 ,j ,\nij i\nj\n\nK ((xi , xj ), (xi , xj\n\n))\n\n+C\n\nj\n\n\n\ni\n\nij\n\n,j\n\nAs an illustrative example we learned a kernel (and hence, a metric) between a subset of the NIST handwritten digits1 . The training data consisted of 20 '3's and 20 '8's randomly rotated by 45 degrees to make the problem slightly harder. Figure 1 shows that a kernel learned by the above strategy with a Gaussian hyperkernel with parameters set by cross validation is extremely good at separating the two classes in training as well as testing. In comparison, in a similar plot for a fixed RBF kernel the '3's and '8's are totally intermixed. Interpreting this as an information retrieval problem, we can imagine inflating a ball around each data point in the test set and asking how many other data points in this ball are of the same class. The corresponding area under the curve (AUC) in the original space is just 0.5575, while in the hyperkernel space it is 0.7341.\nProvided at http://yann.lecun.com/exdb/mnist/ courtesy of Yann LeCun and Corinna Cortes.\n1\n\nfor all pairs of i, j {1, 2, . . . , m}. In testing we interpret k (x, x ) > 0 to mean that x and x are of the same class and k (x, x ) 0 to mean that they are of different classes.\n\nsub ject to the classification constraints i yij i j K ((xi , xj ), (xi , xj )) + b\n,j \n\n\n\n1 - ij\n\nij 0\n\nij 0\n\n\f\n =0\nh\n\n =1\nh\n\n =2\nh\n\n1 0.95 0.9 0.85 AUC AUC 0.8 0.75 0.7 0.65 0.6 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 SVM Linear HyperKernel Conic HyperKernel\n\n1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 AUC SVM Linear HyperKernel Conic HyperKernel\n\n1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 SVM Linear HyperKernel Conic HyperKernel\n\n\n\n\n\n\n\nh=4 1 0.95 0.9 0.85 AUC AUC 0.8 0.75 0.7 0.65 0.6 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 SVM Linear HyperKernel Conic HyperKernel 1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.1 0.2 0.3\n\nh=6 1 SVM Linear HyperKernel Conic HyperKernel 0.95 0.9 0.85 AUC 0.8 0.75 0.7 0.65 0.4 0.5 0.6 0.7 0.8 0.6 0.1 0.2 0.3\n\nh=10 SVM Linear HyperKernel Conic HyperKernel\n\n\n\n\n\n0.4\n\n\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\nFigure 2: Test area under the curve (AUC) for Olivetti face recognition under varying and h .\n\nWe ran a similar experiment but with multiple classes on the Olivetti faces dataset, which consists of 92 112 pixel normalized gray-scale images of 30 individuals in 10 different poses. Here we also experimented with dropping the ij 0 constraints, which breaks the positive definiteness of k , but might still give a reasonable similarity measure. The first case we call \"conic hyperkernels\", whereas the second are just \"linear hyperkernels\". Both involve solving a quadratic program over 2m2 + 1 variables. Finally, as a baseline, we trained an SVM over pairs of datapoints to predict yij , representing (xi , xj ) with a concatenated feature vector [xi , xj ] and using a Gaussian RBF between these concatenations. The results on the Olivetti dataset are summarized in Figure 2. We trained the system with m = 20 faces and considered all pairs of the training data-points (i.e. 400 constraints) to find a kernel that predicted the labeling matrix. When speed becomes an issue it often suffices to work with a subsample of the binary entries in the m m label matrix and thus avoid having m2 constraints. Also, we only need to consider half the entries due to symmetry. Using the learned kernel, we then test on 100 unseen faces and predict all their pairwise kernel evaluations, in other words, 104 predicted pair-wise labelings. Test error rates are averaged over 10 folds of the data. For both the baseline Gaussian RBF and the Gaussian hyperkernels we varied the parameter from 0.1 to 0.6. For the Gaussian hyperkernel we also varied h from 0 to 10 . We used a value of C = 10 for all experiments and for all algorithms. The value of C had very little effect on the testing accuracy. Using a conic hyperkernel combination did best in labeling new faces. The advantage over SVMs is dramatic. The support vector machine can only achieve an AUC of less than 0.75 while the Gaussian hyperkernel methods achieve an AUC of almost 0.9 with only T = 20 training examples. While the difference between the conic and linear hyperkernel methods is harder to see, across all settings of and h , the conic combination outperformed the linear combination over 92% of the time. The conic hyperkernel combination is also the only method of the three that guarantees a true Mercer kernel as an output which can then be converted into a valid metric. The average runtime for the three methods was comparable. The SVM took 2.08s 0.18s, the linear hyperkernel took 2.75s 0.10s and the conic hyperkernel took 7.63s 0.50s to train on m = 20 faces with m2 constraints. We implemented quadratic programming using the MOSEK optimization package on a single CPU workstation.\n\n\f\n6\n\nConclusions\n\nThe main barrier to hyperkernels becoming more popular is their high computational demands (out of the box algorithms run in O(m6 ) time as opposed to O(m3 ) in regular learning). In certain metric learning and on-line settings however this need not be forbidding, and is compensated for by the elegance and generality of the framework. The Gaussian and Wishart hyperkernels presented in this paper are in a sense canonical, with intuitively appealing interpretations. In the case of the Gaussian hyperkernel we even have a natural regularization scheme. Preliminary experiments show that these new hyperkernels can capture the inherent structure of some input spaces. We hope that their introduction will give a boost to the whole hyperkernels field. Acknowledgements The authors wish to thank Zoubin Ghahramani, Alex Smola and Cheng Soon Ong for discussions related to this work. This work was supported in part by National Science Foundation grants IIS-0347499, CCR-0312690 and IIS-0093302.\n\nReferences\n[1] N. Cristianini, J. Shawe-Taylor, A. Elisseeff, and J. Kandola. On kernel-target alignment. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 367 373, Cambridge, MA, 2002. MIT Press. [2] G. S. Kimeldorf and G. Wahba. Some results on Tchebycheffian spline functions. J. Math. Anal. Applic., 33:8295, 1971. [3] R. Kondor and T. Jebara. A kernel between sets of vectors. In Machine Learning: Tenth International Conference, ICML 2003, 2003. [4] R. Kondor and J. Lafferty. Diffusion kernels on graphs and other discrete input spaces. In Machine Learning: Proceedings of the Nineteenth International Conference (ICML '02), 2002. [5] G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. I. Jordan. Learning the kernel matrix with semi-definite programming. Journal of Machine Learning Research, 5:27 72, 2004. [6] T. P. Minka. Inferring a Gaussian distribution, 2001. http://www.stat.cmu.edu/ minka/papers/learning.html. Tutorial paper available at\n\n[7] C. S. Ong and A. J. Smola. Machine learning using hyperkernels. In Proceedings of the International Conference on Machine Learning, 2003. [8] Cheng Soon Ong, Alexander J. Smola, and Robert C. Williamson. Hyperkernels. In S. Thrun S. Becker and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 478485. MIT Press, Cambridge, MA, 2003. [9] Cheng Soon Ong, Alexander J. Smola, and Robert C. Williamson. Learning the kernel with hyperkernels. Sumbitted to the Journal of Machine Learning Research, 2003. [10] B. Scholkopf and A. J. Smola. Learning with Kernels. MIT Press, 2002. \n\n\f\n", "award": [], "sourceid": 3099, "authors": [{"given_name": "Risi", "family_name": "Kondor", "institution": null}, {"given_name": "Tony", "family_name": "Jebara", "institution": null}]}