{"title": "Learning Non-Linear Combinations of Kernels", "book": "Advances in Neural Information Processing Systems", "page_first": 396, "page_last": 404, "abstract": "This paper studies the general problem of learning kernels based on a polynomial combination of base kernels. It analyzes this problem in the case of regression and the kernel ridge regression algorithm. It examines the corresponding learning kernel optimization problem, shows how that minimax problem can be reduced to a simpler minimization problem, and proves that the global solution of this problem always lies on the boundary. We give a projection-based gradient descent algorithm for solving the optimization problem, shown empirically to converge in few iterations. Finally, we report the results of extensive experiments with this algorithm using several publicly available datasets demonstrating the effectiveness of our technique.", "full_text": "Learning Non-Linear Combinations of Kernels\n\nCorinna Cortes\nGoogle Research\n\n76 Ninth Ave\n\nNew York, NY 10011\n\nMehryar Mohri\n\nAfshin Rostamizadeh\n\nCourant Institute and Google\n\nCourant Institute and Google\n\n251 Mercer Street\n\nNew York, NY 10012\n\n251 Mercer Street\n\nNew York, NY 10012\n\ncorinna@google.com\n\nmohri@cims.nyu.edu\n\nrostami@cs.nyu.edu\n\nAbstract\n\nThis paper studies the general problem of learning kernels based on a polynomial\ncombination of base kernels. We analyze this problem in the case of regression\nand the kernel ridge regression algorithm. We examine the corresponding learning\nkernel optimization problem, show how that minimax problem can be reduced to a\nsimpler minimization problem, and prove that the global solution of this problem\nalways lies on the boundary. We give a projection-based gradient descent algo-\nrithm for solving the optimization problem, shown empirically to converge in few\niterations. Finally, we report the results of extensive experiments with this algo-\nrithm using several publicly available datasets demonstrating the effectiveness of\nour technique.\n\n1 Introduction\n\nLearning algorithms based on kernels have been used with much success in a variety of tasks [17,19].\nClassi\ufb01cation algorithms such as support vector machines (SVMs) [6, 10], regression algorithms,\ne.g., kernel ridge regression and support vector regression (SVR) [16, 22], and general dimensional-\nity reduction algorithms such as kernel PCA (KPCA) [18] all bene\ufb01t from kernel methods. Positive\nde\ufb01nite symmetric (PDS) kernel functions implicitly specify an inner product in a high-dimension\nHilbert space where large-margin solutions are sought. So long as the kernel function used is PDS,\nconvergence of the training algorithm is guaranteed.\n\nHowever, in the typical use of these kernel method algorithms, the choice of the PDS kernel, which\nis crucial to improved performance, is left to the user. A less demanding alternative is to require\nthe user to instead specify a family of kernels and to use the training data to select the most suitable\nkernel out of that family. This is commonly referred to as the problem of learning kernels.\n\nThere is a large recent body of literature addressing various aspects of this problem, including de-\nriving ef\ufb01cient solutions to the optimization problems it generates and providing a better theoretical\nanalysis of the problem both in classi\ufb01cation and regression [1, 8, 9, 11, 13, 15, 21]. With the excep-\ntion of a few publications considering in\ufb01nite-dimensional kernel families such as hyperkernels [14]\nor general convex classes of kernels [2], the great majority of analyses and algorithmic results focus\non learning \ufb01nite linear combinations of base kernels as originally considered by [12]. However,\ndespite the substantial progress made in the theoretical understanding and the design of ef\ufb01cient\nalgorithms for the problem of learning such linear combinations of kernels, no method seems to re-\nliably give improvements over baseline methods. For example, the learned linear combination does\nnot consistently outperform either the uniform combination of base kernels or simply the best single\nbase kernel (see, for example, UCI dataset experiments in [9, 12], see also NIPS 2008 workshop).\nThis suggests exploring other non-linear families of kernels to obtain consistent and signi\ufb01cant\nperformance improvements.\n\nNon-linear combinations of kernels have been recently considered by [23]. However, here too,\nexperimental results have not demonstrated a consistent performance improvement for the general\n\n1\n\n\flearning task. Another method, hierarchical multiple learning [3], considers learning a linear combi-\nnation of an exponential number of linear kernels, which can be ef\ufb01ciently represented as a product\nof sums. Thus, this method can also be classi\ufb01ed as learning a non-linear combination of kernels.\nHowever, in [3] the base kernels are restricted to concatenation kernels, where the base kernels\napply to disjoint subspaces. For this approach the authors provide an effective and ef\ufb01cient algo-\nrithm and some performance improvement is actually observed for regression problems in very high\ndimensions.\n\nThis paper studies the general problem of learning kernels based on a polynomial combination of\nbase kernels. We analyze that problem in the case of regression using the kernel ridge regression\n(KRR) algorithm. We show how to simplify its optimization problem from a minimax problem\nto a simpler minimization problem and prove that the global solution of the optimization problem\nalways lies on the boundary. We give a projection-based gradient descent algorithm for solving this\nminimization problem that is shown empirically to converge in few iterations. Furthermore, we give\na necessary and suf\ufb01cient condition for this algorithm to reach a global optimum. Finally, we report\nthe results of extensive experiments with this algorithm using several publicly available datasets\ndemonstrating the effectiveness of our technique.\n\nIn Section 2, we introduce the non-linear family of kernels\nThe paper is structured as follows.\nconsidered. Section 3 discusses the learning problem, formulates the optimization problem, and\npresents our solution. In Section 4, we study the performance of our algorithm for learning non-\nlinear combinations of kernels in regression (NKRR) on several publicly available datasets.\n\n2 Kernel Family\n\nThis section introduces and discusses the family of kernels we consider for our learning kernel\nproblem. Let K1, . . . , Kp be a \ufb01nite set of kernels that we combine to de\ufb01ne more complex kernels.\nWe refer to these kernels as base kernels. In much of the previous work on learning kernels, the\nfamily of kernels considered is that of linear or convex combinations of some base kernels. Here,\nwe consider polynomial combinations of higher degree d \u2265 1 of the base kernels with non-negative\ncoef\ufb01cients of the form:\n\nK\u00b5 =\n\n\u00b5k1\u00b7\u00b7\u00b7kpK k1\n\n1 \u00b7 \u00b7 \u00b7 K kp\np ,\n\n\u00b5k1\u00b7\u00b7\u00b7kp \u2265 0.\n\n(1)\n\nX\n\n0\u2264k1+\u00b7\u00b7\u00b7+kp\u2264d, ki\u22650, i\u2208[0,p]\n\nAny kernel function K\u00b5 of this form is PDS since products and sums of PDS kernels are PDS [4].\nNote that K\u00b5 is in fact a linear combination of the PDS kernels K k1\np . However, the number\nof coef\ufb01cients \u00b5k1\u00b7\u00b7\u00b7kp is in O(pd), which may be too large for a reliable estimation from a sample\nof size m. Instead, we can assume that for some subset I of all p-tuples (k1, . . . , kp), \u00b5k1\u00b7\u00b7\u00b7kp can\nbe written as a product of non-negative coef\ufb01cients \u00b51, . . . , \u00b5p: \u00b5k1\u00b7\u00b7\u00b7kp = \u00b5k1\np . Then, the\ngeneral form of the polynomial combinations we consider becomes\n\n1 \u00b7 \u00b7 \u00b7K kp\n\n1 \u00b7 \u00b7 \u00b7 \u00b5kp\n\nK = X(k1,...,kp)\u2208I\n\np + X(k1,...,kp)\u2208J\n\n\u00b5k1\n1 \u00b7 \u00b7 \u00b7 \u00b5kp\n\np K k1\n\n1 \u00b7 \u00b7 \u00b7 K kp\n\n\u00b5k1\u00b7\u00b7\u00b7kp K k1\n\n1 \u00b7 \u00b7 \u00b7 K kp\np ,\n\n(2)\n\nwhere J denotes the complement of the subset I. The total number of free parameters is then\nreduced to p+|J|. The choice of the set I and its size depends on the sample size m and possible\nprior knowledge about relevant kernel combinations. The second sum of equation (2) de\ufb01ning our\ngeneral family of kernels represents a linear combination of PDS kernels. In the following, we\nfocus on kernels that have the form of the \ufb01rst sum and that are thus non-linear in the parameters\n\u00b51, . . . , \u00b5p. More speci\ufb01cally, we consider kernels K\u00b5 de\ufb01ned by\n\n\u00b5k1\n1 \u00b7 \u00b7 \u00b7 \u00b5kp\n\np K k1\n\n1 \u00b7 \u00b7 \u00b7 K kp\np ,\n\n(3)\n\nK\u00b5 = Xk1+\u00b7\u00b7\u00b7+kp=d\n\nwhere \u00b5 = (\u00b51, . . . , \u00b5p)\u22a4 \u2208 Rp. For the ease of presentation, our analysis is given for the case d = 2,\nwhere the quadratic kernel can be given the following simpler expression:\n\nK\u00b5 =\n\n\u00b5k\u00b5l KkKl .\n\n(4)\n\nBut, the extension to higher-degree polynomials is straightforward and our experiments include\nresults for degrees d up to 4.\n\npXk,l=1\n\n2\n\n\f3 Algorithm for Learning Non-Linear Kernel Combinations\n\n3.1 Optimization Problem\n\nWe consider a standard regression problem where the learner receives a training sample of size\nm, S = ((x1, y1), . . . , (xm, ym)) \u2208 (X \u00d7 Y )m, where X is the input space and Y \u2208 R the label\nspace. The family of hypotheses H\u00b5 out of which the learner selects a hypothesis is the reproducing\nkernel Hilbert space (RKHS) associated to a PDS kernel function K\u00b5 : X \u00d7 X \u2192 R as de\ufb01ned in\nthe previous section. Unlike standard kernel-based regression algorithms however, here, both the\nparameter vector \u00b5 de\ufb01ning the kernel K\u00b5 and the hypothesis are learned using the training sample\nS.\nThe learning kernel algorithm we consider is derived from kernel ridge regression (KRR). Let y =\n[y1, . . . , ym]\u22a4 \u2208 Rm denote the vector of training labels and let K\u00b5 denote the Gram matrix of the\nkernel K\u00b5 for the sample S: [K\u00b5]i,j = K\u00b5(xi, xj), for all i, j \u2208 [1, m]. The standard KRR dual\noptimization algorithm for a \ufb01xed kernel matrix K\u00b5 is given in terms of the Lagrange multipliers\n\u03b1 \u2208 Rm by [16]:\n\nmax\n\u03b1\u2208Rm\n\n\u2212\u03b1\n\n\u22a4(K\u00b5 + \u03bbI)\u03b1 + 2\u03b1\n\n\u22a4y\n\n(5)\n\nThe related problem of learning the kernel K\u00b5 concomitantly can be formulated as the following\nmin-max optimization problem [9]:\n\nmin\n\u00b5\u2208M\n\nmax\n\u03b1\u2208Rm\n\n\u2212\u03b1\n\n\u22a4(K\u00b5 + \u03bbI)\u03b1 + 2\u03b1\n\n\u22a4y,\n\n(6)\n\nwhere M is a positive, bounded, and convex set. The positivity of \u00b5 ensures that K\u00b5 is positive\nsemi-de\ufb01nite (PSD) and its boundedness forms a regularization controlling the norm of \u00b5.1 Two\nnatural choices for the set M are the norm-1 and norm-2 bounded sets,\nM1 = {\u00b5 | \u00b5 (cid:23) 0 \u2227 k\u00b5 \u2212 \u00b50k1 \u2264 \u039b}\nM2 = {\u00b5 | \u00b5 (cid:23) 0 \u2227 k\u00b5 \u2212 \u00b50k2 \u2264 \u039b}.\n\n(7)\n(8)\nThese de\ufb01nitions include an offset parameter \u00b50 for the weights \u00b5. Some natural choices for \u00b50\nare: \u00b50 = 0, or \u00b50/k\u00b50k = 1. Note that here, since the objective function is not linear in \u00b5, the\nnorm-1-type regularization may not lead to a sparse solution.\n\n3.2 Algorithm Formulation\n\nFor learning linear combinations of kernels, a typical technique consists of applying the minimax\ntheorem to permute the min and max operators, which can lead to optimization problems com-\nputationally more ef\ufb01cient to solve [8, 12]. However, in the non-linear case we are studying, this\ntechnique is unfortunately not applicable.\n\nInstead, our method for learning non-linear kernels and solving the min-max problem in equation (6)\nconsists of \ufb01rst directly solving the inner maximization problem. In the case of KRR for any \ufb01xed\n\u00b5 the optimum is given by\n\n(9)\nPlugging the optimal expression of \u03b1 in the min-max optimization yields the following equivalent\nminimization in terms of \u00b5 only:\n\n\u03b1 = (K\u00b5 + \u03bbI)\u22121y.\n\nmin\n\u00b5\u2208M\n\nF (\u00b5) = y\u22a4(K\u00b5 + \u03bbI)\u22121y.\n\n(10)\n\nWe refer to this optimization as the NKRR problem. Although the original min-max problem has\nbeen reduced to a simpler minimization problem, the function F is not convex in general as illus-\ntrated by Figure 1. For small values of \u00b5, concave regions are observed. Thus, standard interior-\npoint or gradient methods are not guaranteed to be successful at \ufb01nding a global optimum.\n\nIn the following, we give an analysis which shows that under certain conditions it is however possible\nto guarantee the convergence of a gradient-descent type algorithm to a global minimum.\n\nAlgorithm 1 illustrates a general gradient descent algorithm for the norm-2 bounded setting which\nprojects \u00b5 back to the feasible set M2 after each gradient step (projecting to M1 is very similar).\n\n1To clarify the difference between similar acronyms, a PDS function corresponds to a PSD matrix [4].\n\n3\n\n\f)\n\n2\n\n1\n\n\u00b5\n\n,\n\n\u00b5\n(\nF\n\n210\n\n205\n\n200\n\n195\n1\n\n0.5\n\n\u00b5\n1\n\n0\n\n1\n\n0.5\n\n\u00b5\n2\n\n0\n\n)\n\n2\n\n1\n\n\u00b5\n\n,\n\n\u00b5\n(\nF\n\n21\n\n20.5\n\n20\n1\n\n0\n\n0.5\n\n0\n\n1\n\n\u00b5\n1\n\n0.5\n\n\u00b5\n2\n\n)\n\n2\n\n1\n\n\u00b5\n\n,\n\n\u00b5\n(\nF\n\n2.09\n\n2.08\n\n2.07\n\n2.06\n1\n\n0\n\n0.5\n\n0\n\n1\n\n\u00b5\n1\n\n0.5\n\u00b5\n2\n\nFigure 1: Example plots for F de\ufb01ned over two linear base kernels generated from the \ufb01rst two\nfeatures of the sonar dataset. From left to right \u03bb = 1, 10, 100. For larger values of \u03bb it is clear that\nthere are in fact concave regions of the function near 0.\n\nAlgorithm 1 Projection-based Gradient Descent Algorithm\n\nInput: \u00b5init \u2208 M2, \u03b7 \u2208 [0, 1], \u01eb > 0, Kk, k \u2208 [1, p]\n\n\u2032 \u2190 \u00b5init\n\n\u00b5\nrepeat\n\n\u2032\n\n\u00b5 \u2190 \u00b5\n\u00b5\n\u2200k, \u00b5\u2032\nnormalize \u00b5\n\n\u2032 \u2190 \u2212\u03b7\u2207F (\u00b5) + \u00b5\nk \u2190 max(0, \u00b5\u2032\nk)\n\u2032, s.t. k\u00b5\n\u2032 \u2212 \u00b50k = \u039b\n\nuntil k\u00b5\n\n\u2032 \u2212 \u00b5k < \u01eb\n\nIn Algorithm 1 we have \ufb01xed the step size \u03b7, however this can be adjusted at each iteration via\na line-search. Furthermore, as shown later, the thresholding step that forces \u00b5\n\u2032 to be positive is\nunnecessary since \u2207F is never positive.\nNote that Algorithm 1 is simpler than the wrapper method proposed by [20]. Because of the closed\nform expression (10), we do not alternate between solving for the dual variables and performing a\ngradient step in the kernel parameters. We only need to optimize with respect to the kernel parame-\nters.\n\n3.3 Algorithm Properties\n\nWe \ufb01rst explicitly calculate the gradient of the objective function for the optimization problem (10).\nIn what follows, \u25e6 denotes the Hadamard (pointwise) product between matrices.\nProposition 1. For any k \u2208 [1, p], the partial derivative of F : \u00b5 \u2192 y\u22a4(K\u00b5 + \u03bbI)\u22121y with respect\nto \u00b5i is given by\n\n\u2202F\n\u2202\u00b5k\n\n= \u22122\u03b1\n\n\u22a4Uk \u03b1,\n\n(11)\n\nProof. In view of the identity \u2207M Tr(y\u22a4M\u22121y) = \u2212M\u22121\u22a4\n\nyy\u22a4M\u22121\u22a4, we can write:\n\nwhere Uk =(cid:0)Pp\n\n\u2202F\n\u2202\u00b5k\n\n\u2202(K\u00b5 + \u03bbI)\n\n\u2202(K\u00b5 + \u03bbI)\n\nr=1(\u00b5rKr) \u25e6 Kk(cid:1).\n(cid:21)\n= Tr(cid:20) \u2202y\u22a4(K\u00b5 + \u03bbI)\u22121y\n= \u2212 Tr(cid:20)(K\u00b5 + \u03bbI)\u22121yy\u22a4(K\u00b5 + \u03bbI)\u22121 \u2202(K\u00b5 + \u03bbI)\n= \u2212 Tr\"(K\u00b5 + \u03bbI)\u22121yy\u22a4(K\u00b5 + \u03bbI)\u22121(cid:16)2\n= \u2212 2y\u22a4(K\u00b5 + \u03bbI)\u22121(cid:16) pXr=1\n\n(\u00b5rKr) \u25e6 Kk(cid:17)#\n(\u00b5rKr) \u25e6 Kk(cid:17)(K\u00b5 + \u03bbI)\u22121y = \u22122\u03b1\n\npXr=1\n\n(cid:21)\n\n\u2202\u00b5k\n\n\u2202\u00b5k\n\n\u22a4Uk\u03b1.\n\n4\n\n\f\u2264 0 for all i \u2208 [1, p] and \u2207F \u2264 0.\nMatrix Uk just de\ufb01ned in proposition 1 is always PSD, thus \u2202F\n\u2202\u00b5k\nAs already mentioned, this fact obliterates the thresholding step in Algorithm 1. We now provide\nguarantees for convergence to a global optimum. We shall assume that \u03bb is strictly positive: \u03bb > 0.\nProposition 2. Any stationary point \u00b5\n\u22c6 of the function F : \u00b5 \u2192 y\u22a4(K\u00b5 + \u03bbI)\u22121y necessarily\nmaximizes F :\n\nF (\u00b5\n\n\u22c6) = max\n\nF (\u00b5) =\n\n\u00b5\n\nkyk2\n\n.\n\n\u03bb\n\n(12)\n\n(13)\n\nProof. In view of the expression of the gradient given by Proposition 1, at any point \u00b5\n\n\u22c6,\n\n\u22c6\u22a4\u2207F (\u00b5\n\n\u00b5\n\n\u22c6) = \u03b1\n\n\u22a4\n\npXi=1\n\n\u00b5\u22c6\n\nkUk\u03b1 = \u03b1\n\n\u22a4K\u00b5\n\n\u22c6 \u03b1.\n\nBy de\ufb01nition, if \u00b5\n\u03b1\n\n\u22a4K\u00b5\n\n\u22c6 \u03b1 = 0, which implies K\u00b5\n\n\u22c6 \u03b1 = 0, that is\n\n\u22c6 is a stationary point, \u2207F (\u00b5\n\n\u22c6) = 0, which implies \u00b5\n\n\u22c6\u22a4\u2207F (\u00b5\n\n\u22c6) = 0. Thus,\n\nK\u00b5\n\n\u22c6(K\u00b5\n\n\u22c6 + \u03bbI)\u22121y = 0 \u21d4 (K\u00b5\n\n\u22c6 + \u03bbI \u2212 \u03bbI)(K\u00b5\n\n\u22c6 + \u03bbI)\u22121y = 0\n\n\u21d4 y \u2212 \u03bb(K\u00b5\n\n\u22c6 + \u03bbI)\u22121y = 0\n\n\u21d4 (K\u00b5\n\n\u22c6 + \u03bbI)\u22121y =\n\ny\n\u03bb\n\n.\n\n(14)\n(15)\n\n(16)\n\nThus, for any such stationary point \u00b5\nmaximum.\n\n\u22c6, F (\u00b5\n\n\u22c6) = y\u22a4(K\u00b5\n\n\u22c6 + \u03bbI)\u22121y = y\u22a4y\n\n\u03bb , which is clearly a\n\nWe next show that there cannot be an interior stationary point, and thus any local minimum strictly\nwithin the feasible set, unless the function is constant.\nProposition 3. If any point \u00b5\nfunction is necessarily constant.\n\n\u22c6 > 0 is a stationary point of F : \u00b5 \u2192 y\u22a4(K\u00b5 + \u03bbI)\u22121y, then the\n\nProof. Assume that \u00b5\n\u03bbI)\u22121y = y\u22a4y\nalently, y is an eigenvector of K\u00b5\nThus,\n\n\u03bb , which implies that y is an eigenvector of (K\u00b5\n\n\u22c6 > 0 is a stationary point, then, by Proposition 2, F (\u00b5\n\n\u22c6 + \u03bbI with eigenvalue \u03bb, which is equivalent to y \u2208 null(K\u00b5\n\n\u22c6 +\n\u22c6 +\u03bbI)\u22121 with eigenvalue \u03bb\u22121. Equiv-\n\u22c6 ).\n\n\u22c6) = y\u22a4(K\u00b5\n\ny\u22a4K\u00b5\n\n\u22c6 y =\n\n\u00b5k\u00b5l\n\nyrysKk(xr, xs)Kl(xr, xs)\n\n= 0.\n\n(17)\n\npXk,l=1\n\nmXr,s=1\n|\n\n(\u2217)\n\n{z\n\n}\n\nSince the product of PDS functions is also PDS, (*) must be non-negative. Furthermore, since by\nassumption \u00b5i > 0 for all i \u2208 [1, p], it must be the case that the term (*) is equal to zero. Thus,\nequation 17 is equal to zero for all \u00b5 and the function F is equal to the constant kyk2/\u03bb.\n\nThe previous propositions are suf\ufb01cient to show that the gradient descent algorithm will not become\nstuck at a local minimum while searching the interior of a convex set M and, furthermore, they\nindicate that the optimum is found at the boundary.\n\nThe following proposition gives a necessary and suf\ufb01cient condition for the convexity of F on a\nconvex region C. If the boundary region de\ufb01ned by k\u00b5 \u2212 \u00b50k = \u039b is contained in this convex\nregion, then Algorithm 1 is guaranteed to converge to a global optimum. Let u \u2208 Rp represent an\narbitrary direction of \u00b5 in C. We simplify the analysis of convexity in the following derivation by\nseparating the terms that depend on K\u00b5 and those depending on Ku, which arise when showing\nthe positive semi-de\ufb01niteness of the Hessian, i.e. u\u22a4\u22072F u (cid:23) 0. We denote by \u2297 the Kronecker\nproduct of two matrices.\nProposition 4. The function F : \u00b5 \u2192 y\u22a4(K\u00b5 + \u03bbI)\u22121y is convex over the convex set C iff the\nfollowing condition holds for all \u00b5 \u2208 C and all u:\n\nhM, N \u2212e1iF \u2265 0,\n\n5\n\n(18)\n\n\fData\nParkinsons\nIono\nSonar\nBreast\n\nm\n194\n351\n208\n683\n\np\n21\n34\n60\n9\n\nlin. base\n.70 \u00b1 .03\n.82 \u00b1 .03\n.90 \u00b1 .02\n.70 \u00b1 .02\n\nlin. \u21131\n\n.70 \u00b1 .04\n.81 \u00b1 .04\n.92 \u00b1 .03\n.71 \u00b1 .02\n\nlin. \u21132\n\n.70 \u00b1 .03\n.81 \u00b1 .03\n.90 \u00b1 .04\n.70 \u00b1 .02\n\nquad. base\n.65 \u00b1 .03\n.62 \u00b1 .05\n.84 \u00b1 .03\n.70 \u00b1 .02\n\nquad. \u21131\n.66 \u00b1 .03\n.62 \u00b1 .05\n.80 \u00b1 .04\n.70 \u00b1 .01\n\nquad. \u21132\n.64 \u00b1 .03\n.60 \u00b1 .05\n.80 \u00b1 .04\n.70 \u00b1 .01\n\nTable 1: The square-root of the mean squared error is reported for each method and several datasets.\n\nwhere M =(cid:0)1 \u2297 vec(\u03b1\u03b1\n\nmatrix with zero-one entries constructed to select the terms [M]ijkl where i = k and j = l, i.e. it is\nnon-zero only in the (i, j)th coordinate of the (i, j)th m \u00d7 m block.\n\n\u22a4)\u22a4(cid:1) \u25e6 (Ku \u2297 Ku), N = 4(cid:0)1 \u2297 vec(V)\u22a4(cid:1) \u25e6 (K\u00b5 \u2297 K\u00b5), ande1 is the\n\nProof. For any u \u2208 Rp the expression of the Hessian of F at the point \u00b5 \u2208 C can be derived from\nthat of its gradient and shown to be\nu\u22a4(\u22072F )u = 4\u03b1\nExpanding each term, we obtain:\n\n\u22a4(K\u00b5 \u25e6 Ku)V(K\u00b5 \u25e6 Ku)\u03b1 \u2212 \u03b1\n\n\u22a4(Ku \u25e6 Ku)\u03b1.\n\n(19)\n\n\u22a4(K\u00b5 \u25e6 Ku)V(K\u00b5 \u25e6 Ku)\u03b1 =\n\n\u03b1\n\n=\n\n[K\u00b5]ik[Ku]ik[V]kl[K\u00b5]ik[K\u00b5]lj\n\nmXk,l=1\n\n(\u03b1i\u03b1j[Ku]ik[Ku]lj )([V]kl[K\u00b5]ik[K\u00b5]lj )\n\n\u03b1i\u03b1j\n\nmXi,j=1\nmXi,j,k,l=1\n\n\u22a4(Ku \u25e6 Ku)\u03b1 =Pm\n\ni,j=1 \u03b1i\u03b1j[Ku]ij [Ku]ij. Let 1 \u2208 Rm2 de\ufb01ne the column vector of all\nand \u03b1\nones and let vec(A) denote the vectorization of a matrix A by stacking its columns. Let the matrices\nM and N be de\ufb01ned as in the statement of the proposition. Then, [M]ijkl = (\u03b1i\u03b1j[Ku]ik[Ku]lj )\n\ncan be represented with the Frobenius inner product,\n\nand [N]ijkl = [V]kl[K\u00b5]ik[K\u00b5]lj. Then, in view of the de\ufb01nition ofe1, the terms of equation (19)\nFor any \u00b5 \u2208 Rp, let K\u00b5 =Pi \u00b5iKi and let V = (K\u00b5 + \u03bbI)\u22121. We now show that the condition\n\nu\u22a4(\u22072F )u = hM, NiF \u2212 hM,e1iF = hM, N \u2212e1iF .\n\nof Proposition 4 is satis\ufb01ed for convex regions for which \u039b, and therefore \u00b5, is suf\ufb01ciently large, in\nthe case where Ku and K\u00b5 are diagonal. In that case, M, N and V are diagonal as well and the\ncondition of Proposition 4 can be rewritten as follows:\n\n(20)\n\n(21)\n\n(22)\n\n(23)\n\n[Ku]ii[Ku]jj \u03b1i\u03b1j(4[K\u00b5]ii[K\u00b5]jj Vij \u2212 1i=j) \u2265 0.\n\nUsing the fact that V is diagonal, this inequality we can be further simpli\ufb01ed\n\nXi,j\n\n[Ku]2\n\nii \u03b12\n\ni (4[K\u00b5]2\n\niiVii \u2212 1) \u2265 0.\n\nmXi=1\n\nA suf\ufb01cient condition for this inequality to hold is that each term (4[K\u00b5]2\nor equivalently that 4K2\n\niiVii \u2212 1) be non-negative,\n3 I. Therefore, it suf\ufb01ces to select \u00b5 such that\n\n\u00b5V \u2212 I (cid:23) 0, that is K\u00b5 (cid:23)q \u03bb\n\nminiPp\n\nk=1 \u00b5k[Kk]ii \u2265p\u03bb/3.\n\n4 Empirical Results\n\nTo test the advantage of learning non-linear kernel combinations, we carried out a number of ex-\nperiments on publicly available datasets. The datasets are chosen to demonstrate the effectiveness\nof the algorithm under a number of conditions. For general performance improvement, we chose a\nnumber of UCI datasets frequently used in kernel learning experiments, e.g., [7,12,15]. For learning\nwith thousands of kernels, we chose the sentiment analysis dataset of Blitzer et. al [5]. Finally, for\nlearning with higher-order polynomials, we selected datasets with large number of examples such as\nkin-8nm from the Delve repository. The experiments were run on a 2.33 GHz Intel Xeon Processor\nwith 2GB of RAM.\n\n6\n\n\fE\nS\nM\nR\n\n1.7\n\n1.65\n\n1.6\n\n1.55\n\n1.5\n\n1.45\n\n1.4\n \n0\n\nKitchen\n\nL2 reg.\n\nBaseline\n\nL1 reg.\n\n \n\n1.7\n\n1.65\n\n1.6\n\n1.55\n\n1.5\n\n1.45\n\nE\nS\nM\nR\n\n1000\n\n2000\n\n# bigrams\n\n3000\n\n4000\n\n1.4\n \n0\n\n1000\n\nElectronics\n\nL2 reg.\n\nBaseline\n\nL1 reg.\n\n \n\n2000\n\n3000\n\n# bigrams\n\n4000\n\n5000\n\nFigure 2: The performance of baseline and learned quadratic kernels (plus or minus one standard\ndeviation) versus the number of bigrams (and kernels) used.\n\n4.1 UCI Datasets\n\nWe \ufb01rst analyzed the performance of the kernels learned as quadratic combinations. For each\ndataset, features were scaled to lie in the interval [0, 1]. Then, both labels and features were centered.\nIn the case of classi\ufb01cation dataset, the labels were set to \u00b11 and the RMSE was reported. We as-\nsociated a base kernel to each feature, which computes the product of this feature between different\nexamples. We compared both linear and quadratic combinations, each with a baseline (uniform),\nnorm-1-regularized and norm-2-regularized weighting using \u00b50 = 1 corresponding to the weights of\nthe baseline kernel. The parameters \u03bb and \u039b were selected via 10-fold cross validation and the error\nreported was based on 30 random 50/50 splits of the entire dataset into training and test sets. For the\ngradient descent algorithm, we started with \u03b7 = 1 and reduced it by a factor of 0.8 if the step was\nfound to be too large, i.e., the difference k\u00b5\n\u2032 \u2212 \u00b5k increased. Convergence was typically obtained\nin less than 25 steps, each requiring a fraction of a second (\u223c 0.05 seconds).\nThe results, which are presented in Table 1, are in line with previous ones reported for learning\nkernels on these datasets [7,8,12,15]. They indicate that learning quadratic combination kernels can\nsometimes offer improvements and that it clearly does not degrade with respect to the performance\nof the baseline kernel. The learned quadratic combination performs well, particularly on tasks where\nthe number of features was large compared to the number of points. This suggests that the learned\nkernel is better regularized than the plain quadratic kernel and can be advantageous is scenarios\nwhere over-\ufb01tting is an issue.\n\n4.2 Text Based Dataset\n\nWe next analyzed a text-based task where features are frequent word n-grams. Each base kernel\ncomputes the product between the counts of a particular n-gram for the given pair of points. Such\nkernels have a direct connection to count-based rational kernels, as described in [8]. We used the\nsentiment analysis dataset of Blitzer et. al [5]. This dataset contains text-based user reviews found\nfor products on amazon.com. Each text review is associated with a 0-5 star rating. The product re-\nviews fall into two categories: electronics and kitchen-wares, each with 2,000 data-points. The data\nwas not centered in this case since we wished to preserve the sparsity, which offers the advantage of\nsigni\ufb01cantly more ef\ufb01cient computations. A constant feature was included to act as an offset.\n\nFor each domain, the parameters \u03bb and \u039b were chosen via 10-fold cross validation on 1,000 points.\nOnce these parameters were \ufb01xed, the performance of each algorithm was evaluated using 20 ran-\ndom 50/50 splits of the entire 2,000 points into training and test sets. We used the performance of\nthe uniformly weighted quadratic combination kernel as a baseline, and showed the improvement\nwhen learning the kernel with norm-1 or norm-2 regularization using \u00b50 = 1 corresponding to the\nweights of the baseline kernel. As shown by Figure 2, the learned kernels signi\ufb01cantly improved\nover the baseline quadratic kernel in both the kitchen and electronics categories. For this case too,\nthe number of features was large in comparison with the number of points. Using 900 training points\nand about 3,600 bigrams, and thus kernels, each iteration of the algorithm took approximately 25\n\n7\n\n\fKRR, with (dashed) and without (solid) learning\n\n0.25\n\n0.20\nE\nS\nM\n0.15\n\n0.10\n\n0\n\n1st degree\n2nd degree\n3rd degree\n4th degree\n\n40\n\n20\nTraining data subsampling factor\n\n60\n\n80\n\n100\n\nFigure 3: Performance on the kin-8nm dataset. For all polynomials, we compared un-weighted,\nstandard KRR (solid lines) with norm-2 regularized kernel learning (dashed lines). For 4th degree\npolynomials we observed a clear performance improvement, especially for medium amount of train-\ning data (subsampling factor of 10-50). Standard deviations were typically in the order 0.005, so the\nresults were statistically signi\ufb01cant.\n\nseconds to compute with our Matlab implementation. When using norm-2 regularization, the algo-\nrithm generally converges in under 30 iterations, while the norm-1 regularization requires an even\nfewer number of iterations, typically less than 5.\n\n4.3 Higher-order Polynomials\n\nWe \ufb01nally investigated the performance of higher-order non-linear combinations. For these exper-\niments, we used the kin-8nm dataset from the Delve repository. This dataset has 20,000 examples\nwith 8 input features. Here too, we used polynomial kernels over the features, but this time we\nexperimented with polynomials with degrees as high as 4. Again, we made the assumption that all\ncoef\ufb01cients of \u00b5 are in the form of products of \u00b5is (see Section 2), thus only 8 kernel parameters\nneeded to be estimated.\n\nWe split the data into 10,000 examples for training and 10,000 examples for testing, and, to inves-\ntigate the effect of the sample size on learning kernels, subsampled the training data so that only a\nfraction from 1 to 100 was used. The parameters \u03bb and \u039b were determined by 10-fold cross vali-\ndation on the training data, and results are reported on the test data, see Figure 3. We used norm-2\nregularization with \u00b50 = 1 and compare our results with those of uniformly weighted KRR.\nFor lower degree polynomials, the performance was essentially the same, but for 4th degree poly-\nnomials we observed a signi\ufb01cant performance improvement of learning kernels over the uniformly\nweighted KRR, especially for a medium amount of training data (subsampling factor of 10-50). For\nthe sake of readability, the standard deviations are not indicated in the plot. They were typically in\nthe order of 0.005, so the results were statistically signi\ufb01cant. This result corroborates the \ufb01nding\non the UCI dataset, that learning kernels is better regularized than plain unweighted KRR and can\nbe advantageous is scenarios where over\ufb01tting is an issue.\n\n5 Conclusion\n\nWe presented an analysis of the problem of learning polynomial combinations of kernels in regres-\nsion. This extends learning kernel ideas and helps explore kernel combinations leading to better\nperformance. We proved that the global solution of the optimization problem always lies on the\nboundary and gave a simple projection-based gradient descent algorithm shown empirically to con-\nverge in few iterations. We also gave a necessary and suf\ufb01cient condition for that algorithm to\nconverge to a global optimum. Finally, we reported the results of several experiments on publicly\navailable datasets demonstrating the bene\ufb01ts of learning polynomial combinations of kernels. We are\nwell aware that this constitutes only a preliminary study and that a better analysis of the optimization\nproblem and solution should be further investigated. We hope that the performance improvements\nreported will further motivate such analyses.\n\n8\n\n\fReferences\n\n[1] A. Argyriou, R. Hauser, C. Micchelli, and M. Pontil. A DC-programming algorithm for kernel\n\nselection. In International Conference on Machine Learning, 2006.\n\n[2] A. Argyriou, C. Micchelli, and M. Pontil. Learning convex combinations of continuously\n\nparameterized basic kernels. In Conference on Learning Theory, 2005.\n\n[3] F. Bach. Exploring large feature spaces with hierarchical multiple kernel learning. In Advances\n\nin Neural Information Processing Systems, 2008.\n\n[4] C. Berg, J. P. R. Christensen, and P. Ressel. Harmonic Analysis on Semigroups. Springer-\n\nVerlag: Berlin-New York, 1984.\n\n[5] J. Blitzer, M. Dredze, and F. Pereira. Biographies, Bollywood, Boom-boxes and Blenders: Do-\nmain Adaptation for Sentiment Classi\ufb01cation. In Association for Computational Linguistics,\n2007.\n\n[6] B. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classi\ufb01ers. In\n\nConference on Learning Theory, 1992.\n\n[7] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing multiple parameters for\n\nsupport vector machines. Machine Learning, 46(1-3), 2002.\n\n[8] C. Cortes, M. Mohri, and A. Rostamizadeh. Learning sequence kernels. In Machine Learning\n\nfor Signal Processing, 2008.\n\n[9] C. Cortes, M. Mohri, and A. Rostamizadeh. L2 regularization for learning kernels. In Uncer-\n\ntainty in Arti\ufb01cial Intelligence, 2009.\n\n[10] C. Cortes and V. Vapnik. Support-Vector Networks. Machine Learning, 20(3), 1995.\n[11] T. Jebara. Multi-task feature and kernel selection for SVMs. In International Conference on\n\nMachine Learning, 2004.\n\n[12] G. Lanckriet, N. Cristianini, P. Bartlett, L. E. Ghaoui, and M. Jordan. Learning the kernel\n\nmatrix with semide\ufb01nite programming. Journal of Machine Learning Research, 5, 2004.\n\n[13] C. Micchelli and M. Pontil. Learning the kernel function via regularization. Journal of Machine\n\nLearning Research, 6, 2005.\n\n[14] C. S. Ong, A. Smola, and R. Williamson. Learning the kernel with hyperkernels. Journal of\n\nMachine Learning Research, 6, 2005.\n\n[15] A. Rakotomamonjy, F. Bach, Y. Grandvalet, and S. Canu. Simplemkl. Journal of Machine\n\nLearning Research, 9, 2008.\n\n[16] C. Saunders, A. Gammerman, and V. Vovk. Ridge Regression Learning Algorithm in Dual\n\nVariables. In International Conference on Machine Learning, 1998.\n\n[17] B. Sch\u00a8olkopf and A. Smola. Learning with Kernels. MIT Press: Cambridge, MA, 2002.\n[18] B. Scholkopf, A. Smola, and K. Muller. Nonlinear component analysis as a kernel eigenvalue\n\nproblem. Neural computation, 10(5), 1998.\n\n[19] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge Univer-\n\nsity Press, 2004.\n\n[20] S. Sonnenburg, G. R\u00a8atsch, C. Sch\u00a8afer, and B. Sch\u00a8olkopf. Large scale multiple kernel learning.\n\nJournal of Machine Learning Research, 7, 2006.\n\n[21] N. Srebro and S. Ben-David. Learning bounds for support vector machines with learned ker-\n\nnels. In Conference on Learning Theory, 2006.\n\n[22] V. N. Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998.\n[23] M. Varma and B. R. Babu. More generality in ef\ufb01cient multiple kernel learning. In Interna-\n\ntional Conference on Machine Learning, 2009.\n\n9\n\n\f", "award": [], "sourceid": 716, "authors": [{"given_name": "Corinna", "family_name": "Cortes", "institution": null}, {"given_name": "Mehryar", "family_name": "Mohri", "institution": null}, {"given_name": "Afshin", "family_name": "Rostamizadeh", "institution": null}]}