{"title": "Learning Kernels with Radiuses of Minimum Enclosing Balls", "book": "Advances in Neural Information Processing Systems", "page_first": 649, "page_last": 657, "abstract": "In this paper, we point out that there exist scaling and initialization problems in most existing multiple kernel learning (MKL) approaches, which employ the large margin principle to jointly learn both a kernel and an SVM classifier. The reason is that the margin itself can not well describe how good a kernel is due to the negligence of the scaling. We use the ratio between the margin and the radius of the minimum enclosing ball to measure the goodness of a kernel, and present a new minimization formulation for kernel learning. This formulation is invariant to scalings of learned kernels, and when learning linear combination of basis kernels it is also invariant to scalings of basis kernels and to the types (e.g., L1 or L2) of norm constraints on combination coefficients. We establish the differentiability of our formulation, and propose a gradient projection algorithm for kernel learning. Experiments show that our method significantly outperforms both SVM with the uniform combination of basis kernels and other state-of-art MKL approaches.", "full_text": "Learning Kernels with Radiuses of Minimum\n\nEnclosing Balls\n\nKun Gai\n\nGuangyun Chen\n\nChangshui Zhang\n\nState Key Laboratory on Intelligent Technology and Systems\n\nTsinghua National Laboratory for Information Science and Technology (TNList)\n{gaik02, cgy08}@mails.thu.edu.cn, zcs@mail.thu.edu.cn\n\nDepartment of Automation, Tsinghua University, Beijing 100084, China\n\nAbstract\n\nIn this paper, we point out that there exist scaling and initialization problems in\nmost existing multiple kernel learning (MKL) approaches, which employ the large\nmargin principle to jointly learn both a kernel and an SVM classi\ufb01er. The reason\nis that the margin itself can not well describe how good a kernel is due to the\nnegligence of the scaling. We use the ratio between the margin and the radius\nof the minimum enclosing ball to measure the goodness of a kernel, and present a\nnew minimization formulation for kernel learning. This formulation is invariant to\nscalings of learned kernels, and when learning linear combination of basis kernels\nit is also invariant to scalings of basis kernels and to the types (e.g., L1 or L2) of\nnorm constraints on combination coef\ufb01cients. We establish the differentiability of\nour formulation, and propose a gradient projection algorithm for kernel learning.\nExperiments show that our method signi\ufb01cantly outperforms both SVM with the\nuniform combination of basis kernels and other state-of-art MKL approaches.\n\n1 Introduction\n\nIn the past years, kernel methods, like support vector machines (SVM), have achieved great success\nin many learning problems, such as classi\ufb01cation and regression. For such tasks, the performance\nstrongly depends on the choice of the kernels used. A good kernel function, which implicitly char-\nacterizes a suitable transformation of input data, can greatly bene\ufb01t the accuracy of the predictor.\nHowever, when there are many available kernels, it is dif\ufb01cult for the user to pick out a suitable one.\nKernel learning has been developed to jointly learn both a kernel function and an SVM classi\ufb01er.\nChapelle et al. [1] present several principles to tune parameters in kernel functions. In particular,\nwhen the learned kernel is restricted to be a linear combination of multiple basis kernels, the prob-\nlem of learning the combination coef\ufb01cients as well as an SVM classi\ufb01er is usually called multiple\nkernel learning (MKL). Lanckriet et al. [2] formulate the MKL problem as a quadratically con-\nstrained quadratic programming problem, which implicitly uses an L1 norm constraint to promote\nsparse combinations. To enhance the computational ef\ufb01ciency, different approaches for solving this\nMKL problem have been proposed using SMO-like strategies [3], semi-in\ufb01nite linear program [4],\ngradient-based methods [5], and second-order optimization [6]. Some other subsequent work ex-\nplores more generality of multiple kernel learning by promoting non-sparse [7, 8] or group-sparse\n[9] combinations of basis kernels, or using other forms of learned kernels, e.g., a combination of an\nexponential number of kernels [10] or nonlinear combinations [11, 12, 13].\nMost existing MKL approaches employ the objective function used in SVM. With an acceptable\nempirical loss, they aim to \ufb01nd the kernel which leads to the largest margin of the SVM classi-\n\ufb01er. However, despite the substantial progress in both the algorithmic design and the theoretical\nunderstanding for the MKL problem, none of the approaches seems to reliably outperform baseline\n\n1\n\n\fmethods, like SVM with the uniform combination of basis kernels [13]. As will be shown in this\npaper, the large margin principle used in these methods causes the scaling problem and the initializa-\ntion problem, which can strongly affect \ufb01nal solutions of learned kernels as well as performances.\nIt implicates that the large margin preference can not reliably result in a good kernel, and thus the\nmargin itself is not a suitable measure of the goodness of a kernel.\nMotivated by the generalization bounds for SVM and kernel learning, we use the ratio between\nthe margin of the SVM classi\ufb01er and the radius of the minimum enclosing ball (MEB) of data in\nthe feature space endowed with the learned kernel as a measure of the goodness of the kernel, and\npropose a new kernel learning formulation. Our formulation differs from the radius-based principle\nby Chapelle et al. [1]. Their principle is sensitive to kernel scalings when a nonzero empirical loss\nis allowed, also causing the same problems as the margin-based formulations. We prove that our\nformulation is invariant to scalings of learned kernels, and also invariant to initial scalings of basis\nkernels and to the types (e.g., L1 or L2) of norm constraints on kernel parameters for the MKL\nproblem. Therefore our formulation completely addresses the scaling and initialization problems.\nExperiments show that our approach gives signi\ufb01cant performance improvements both over SVM\nwith the uniform combination of basis kernels and over other state-of-art kernel learning methods.\nOur proposed kernel learning problem can be reformulated to a tri-level optimization problem. We\nestablish the differentiability of a general family of multilevel optimization problems. This enables\nus to generally tackle the radius of the minimal enclosing ball, or other complicated optimal value\nfunctions, in the kernel learning framework by simple gradient-based methods. We hope that our\nresults will also bene\ufb01t other learning problems.\nThe paper is structured as follows. Section 2 shows problems in previous MKL formulations. In\nSection 3 we present a new kernel learning formulation and give discussions. Then, we study the\ndifferentiability of multilevel optimization problems and give an ef\ufb01cient algorithm in Section 4 and\nSection 5, respectively. Experiments are shown in Section 6. Finally, we close with a conclusion.\n\n2 Measuring how good a kernel is\nLet D = {(x1, y1), ..., (xn, yn)} denote a training set of n pairs of input points xi \u2208 X and target\nlabels yi \u2208 {\u00b11}. Suppose we have a kernel family K = {k : X \u00d7 X \u2192 R}, in which any kernel\nfunction k implicitly de\ufb01nes a transformation \u03c6(\u00b7; k) from the input space X to a feature space by\nk(xc, xd) = h\u03c6(xc; k), \u03c6(xd; k)i. Let a classi\ufb01er be linear in the feature space endowed with k, as\n(1)\nthe sign of which is used to classify data. The task of kernel learning (for binary classi\ufb01cation) is to\nlearn both a kernel function k \u2208 K and a classi\ufb01er w and b.\nTo make the problem trackable, the learned kernel is usually restricted to a parametric form k(\u03b8)(\u00b7,\u00b7),\nwhere \u03b8 = [\u03b8i]i is the kernel parameter. Then the problem of learning a kernel transfers to the prob-\nlem of learning a kernel parameter \u03b8. The most common used kernel form is a linear combination\nof multiple basis kernels, as\n\nf(x; w, b, k) = h\u03c6(x; k), wi + b,\n\nj=1 \u03b8jkj(\u00b7,\u00b7), \u03b8j \u2265 0.\n\n(2)\n\nk(\u03b8)(\u00b7,\u00b7) =Pm\n2kwk2 + CP\n2kwk2 + CP\n\ni \u03bei,\n\n1\n\n2.1 Problems in multiple kernel learning\n\nMost existing MKL approaches, e.g., [2, 4, 5], employ the equivalent objective function as in SVM:\n(3)\n\ns.t. yif(xi; w, b, k) + \u03bei \u2265 1, \u03bei \u2265 0,\n\nmink,w,b,\u03bei\n\n1\n\nwhere \u03bei is the hinge loss. This problem can be reformulated to\n\n\u02dcG(k) = minw,b,\u03bei\n\n(4)\nwhere\n(5)\nFor any kernel k, the optimal classi\ufb01er w and b is actually the SVM classi\ufb01er with the kernel k. Let \u03b3\ndenote the margin of the SVM classi\ufb01er in the feature space endowed with k. We have \u03b3\u22122 = kwk2.\nThus the term kwk2 makes formulation (3) prefer the kernel that results in an SVM classi\ufb01er with a\nlarger margin (as well as an acceptable empirical loss). Here, a natural question is that for different\nkernels whether the margins of SVM classi\ufb01ers can well measure the goodness of the kernels.\n\n\u02dcG(k),\ns.t. yif(xi; w, b, k) + \u03bei \u2265 1, \u03bei \u2265 0.\n\nmink :\ni \u03bei,\n\n2\n\n\f1, b\u2217\n\nCP\n\n2kw2k2 + CP\n\n1k2/a, and f(x; w2, b2, knew) and f(x; w\u2217\n\n\u221a\nTo answer this question, we consider what happens when a kernel k is enlarged by a scalar a:\nknew = ak, where a > 1. The corresponding transformations satisfy \u03c6(\u00b7; knew) =\na\u03c6(\u00b7; k). For\n\u221a\nk, let {w\u2217, b\u2217} denote the optimal solution of (5). For knew, we set w2 = w\u2217\na and b2 = b\u2217\n1,\n1/\nthen we have kw2k2 = kw\u2217\n1, k) are the same classi\ufb01er,\nresulting in the same \u03bei. Then we obtain: \u02dcG(ak) = \u02dcG(knew) \u2264 1\n1k2 +\ni \u03bei = \u02dcG(k), which means the enlarged kernel gives a larger margin and a smaller objective\nvalue. As a consequence, on one hand, the large margin preference guides the scaling of the learned\nkernel to be as large as possible. On the other hand, any kernel, even the one resulting in a bad\nperformance, can give an arbitrarily large margin by enlarging its scaling. This problem is called the\nscaling problem. It shows that the margin is not a suitable measure of the goodness of a kernel.\nIn the linear combination case, the scaling problem causes that the kernel parameter \u03b8 does not\nconverge in the optimization. A remedy is to use a norm constraint on \u03b8. However, it has been\nshown in recent literature [7, 9] that different types of norm constraints \ufb01t different data sets. So\nusers face the dif\ufb01culty of choosing a suitable norm constraint. Even after a norm constraint is\nselected, the scaling problem also causes another problem about the initialization. Consider an L1\nnorm constraint and a learned kernel which is a combination of two basis kernels, as\n\ni \u03bei < 1\n\n2kw\u2217\n\nk(\u03b8)(\u00b7,\u00b7) = \u03b81k1(\u00b7,\u00b7) + \u03b82k2(\u00b7,\u00b7),\n\n\u03b81 + \u03b82 = 1.\n\n(6)\nTo leave the empirical loss out of consideration, assume: (a) both k1 and k2 can lead to zero em-\npirical loss, (b) k1 results in a larger margin than k2. For simplicity, we further restrict \u03b81 and \u03b82\nto be equal to 0 or 1, to enable kernel selection. The MKL formulation (3), of course, will choose\nk1 from {k1, k2} due to the large margin preference. Then we set knew\n(\u00b7,\u00b7) = ak1(\u00b7,\u00b7), where a\nis a small scalar to make that knew\nsubstitutes for k1, the\n, k2}. The example shows that the \ufb01nal solution can\nMKL formulation (3) will select k2 from {knew\nbe greatly affected by the initial scalings of basis kernels, although a norm constraint is used. This\nproblem is called the initialization problem. When the MKL framework is extended from the linear\ncombination cases to the nonlinear cases, the scaling problem becomes more serious, as even a \ufb01nite\nscaling of the learned kernel may not be generally guaranteed by a simple norm constraint on kernel\nparameters for some kernel forms. These problems implicate that the margin itself is not enough to\nmeasure the goodness of kernels.\n\nhas a smaller margin than k2. After knew\n\n\u03b81, \u03b82 \u2265 0,\n\n1\n\n1\n\n1\n\n1\n\n2.2 Measuring the goodness of kernels with the radiuses of MEB\n\nNow we need to \ufb01nd a more reasonable way to measure the goodness of kernels. Below we in-\ntroduce the generalization error bounds for SVM and kernel learning, which inspire us to con-\nsider the minimum enclosing ball to learn a kernel. For SVM with a \ufb01xed kernel, it is well\nknown that the estimation error, which denotes the gap between the expected error and the em-\n\npirical error, is bounded by pO(R2\u03b3\u22122)/n, where R is the radius of the minimum enclosing\n\n+ 256 R2\n\n\u03b32 log en\u03b3\n\nq 8\n\n8R log 128nR2\n\nn(2 + d\u03c6 log 128en3R2\n\nball (MEB) of data in the feature space endowed with the kernel used. For SVM with a ker-\nnel learned from a kernel family K, if we restrict that the radius of the minimum enclosing ball\nin the feature space endowed with the learned kernel to be no larger than R, then the theoret-\nical results of Srebro and Ben-David [14] say: for any \ufb01xed margin \u03b3 > 0 and any \ufb01xed ra-\ndius R > 0, with probability at least 1 \u2212 \u03b4 over a training set of size n, the estimation error is\n\u03b32 \u2212 log \u03b4). Scalar d\u03c6 denotes\nno larger than\nthe pseudodimension [14] of the kernel family K. For example, d\u03c6 of linear combination ker-\nnels is no larger than the number of basis kernels, and d\u03c6 of the Gaussian kernels with a form of\nk(\u03b8)(xa, xb) = e\u2212\u03b8kxa\u2212xbk2 is no larger than 1 (See [14] for more details). The above results clearly\nstate that the generalization error bounds for SVM with both \ufb01xed kernels and learned kernels de-\npend on the ratio between the margin \u03b3 and the radius R of the minimum enclosing ball of data.\nAlthough some new results of the generalization bounds for kernel learning, like [15], give different\ntypes of dependencies on d\u03c6, they also rely on the margin-and-radius ratio.\nIn SVM with a \ufb01xed kernel, the radius R is a constant and we can safely minimize kwk2 (as well as\nthe empirical loss). However, in kernel learning, the radius R changes drastically from one kernel\nto another (An example is given in the supplemental materials: when we uniformly combine p basis\np of the squared radius of each basis\nkernel.). Thus we should also take the radius into account. As a result, we use the ratio between the\nmargin \u03b3 and the radius R to measure how good a kernel is for kernel learning.\n\nkernels by kunif =Pp\n\np kj, the squared radius becomes only 1\n\n\u03b32d\u03c6\n\nj=1\n\n1\n\n3\n\n\fGiven any kernel k, the radius of the minimum enclosing ball, denoted by R(k), can be obtained by:\n(7)\n\ns.t. y \u2265 k\u03c6(xi; k) \u2212 ck2.\n\nR2(k) = miny,c y,\n\nThis problem is a convex minimization problem, being equivalent to its dual problem, as\ni \u03b2i = 1, \u03b2i \u2265 0,\n\n(8)\nwhich shows a property of R2(k): for any kernel k and any scalar a >0, we have R2(ak) = aR2(k).\n\ni,j\u03b2ik(xi, xj)\u03b2j,\n\nR2(k) = max\u03b2i\n\nP\ni\u03b2ik(xi, xi) \u2212P\n\ns.t. P\n\n3 Learning kernels with the radiuses\n\n2 R2(k)kwk2 + CP\n\n1\n\ni \u03bei,\n\nmink,w,b,\u03bei\n\nConsidering the ratio between the margin and the radius of MEB, we propose a new formulation, as\n(9)\nwhere R2(k)kwk2 is a radius-based regularizer that prefers a large ratio between the margin and the\ni \u03bei is the hinge loss which is an upper bound of empirical misclassi\ufb01ed error. This\n\ns.t. yi(h\u03c6(xi; k), wi + b) + \u03bei \u2265 1, \u03bei \u2265 0,\n\nradius, andP\n\noptimization problem is called radius based kernel learning problem, referred to as RKL.\nChapelle et al. [1] also utilize the radius of MEB to tune kernel parameters for hard margin SVM.\nOur formulation (9) is equivalent to theirs if \u03bei is restricted to be zero. To give a soft margin version,\nthey modify the kernel matrix K(\u03b8) = K(\u03b8) + 1\n\nC I, resulting in a formulation equivalent to:\n\ni , s.t. yi(h\u03c6(xi; k(\u03b8)), wi + b) + \u03bei \u2265 1, \u03bei \u2265 0. (10)\ni \u03be2\nThe function R2(k(\u03b8)) in the second term, which may become small, makes that minimizing the\n\u221a\nobjective function can not reliably give a small empirical loss, even when C is large. Besides, when\nwe reduce the scaling of a kernel by multiplying it with a small scalar a and substitute \u02dcw = w/\na\nfor w to keep the same \u03bei, the objective function always decreases (due to the decrease of R2 in\nthe empirical loss term), still leading to scaling problems. Do et al. [16] recently propose to learn a\nlinear kernel combination, as de\ufb01ned in (2), through\n\n2 R2(k(\u03b8))kwk2 + CR2(k(\u03b8))P\n\nmin\n\u03b8,w,b,\u03bei\n\n1\n\n\u03b8j\n\n1\n2\n\n+\n\nkwjk2\n\ni \u03be2\n\nmin\n\n\u03b8,wj ,b,\u03bei\n\nj \u03b8j R2(kj )\n\njhwj, \u03c6(xi; kj)i + b) + \u03bei \u2265 1, \u03bei \u2265 0. (11)\nTheir objective function also can be always decreased by multiplying \u03b8 with a large scalar. Thus\ntheir method does not address the scaling problem, also resulting in the initialization problem. If we\ninitially adjust the scalings of basis kernels to make each R(kj) be equal to each other, then their for-\nmulation is equivalent to the margin-based formulation (3). Different from the above formulations,\nour formulation (9) is invariant to scalings of kernels.\n\ni , s.t. yi(P\n\nCP\n\nP\n\nj\n\nP\n\n3.1\n\nInvariance to scalings of kernels\n\nNow we discuss the properties of formulation (9). The RKL problem can be reformulated to\n\n2 R2(k)kwk2 + CP\n\n1\n\nwhere G(k) = min\nw,b,\u03bei\n\nmink G(k),\ni \u03bei, s.t. yi(h\u03c6(xi; k), wi + b) + \u03bei \u2265 1, \u03bei \u2265 0.\n\n(12)\n(13)\n\nFunctional G(k) de\ufb01nes a measure of the goodness of kernel functions, which consider a trade-off\nbetween the margin-and-radius ratio and the empirical loss. This functional is invariant to the scaling\nof k, as stated by the following proposition.\nProposition 1. For any kernel k and any scalar a > 0, equation G(ak) = G(k) holds.\n\n2 R2(k)kwk2 + CP\n\nProof. For the scaled kernel ak, equation R2(ak) = aR2(k) holds. Thereby, we get\nG(ak) = minw,b,\u03bei\nLet \u02dcw\u221a\n\na = w replace w in (14), and then (14) becomes equivalent to (13). Thus G(ak)= G(k).\n\na\u03c6(xi; k), wi + b) + \u03bei \u2265 1, \u03bei \u2265 0. (14)\n\ni \u03bei, s.t. yi(h\u221a\n\na\n\nFor a parametric kernel form k(\u03b8), the RKL problem transfers to minimizing a function g(\u03b8) .=\nG(k(\u03b8)). Here we temporarily focus on the linear combination case de\ufb01ned by (2), and use glinear(\u03b8)\nto denote g(\u03b8) in such case. Due to the scaling invariance, for any \u03b8 and any a > 0, we have\nglinear(a\u03b8) = glinear(\u03b8). It makes the problem of minimizing glinear(\u03b8) be invariant to the types of\nnorm constraints on \u03b8, as stated in the following.\n\n4\n\n\fProposition 2. Given any norm de\ufb01nition N (\u00b7) and any set S \u2286 R, suppose there exists c > 0\nthat satis\ufb01es c \u2208 S. Let (a) denote the problem of minimizing glinear(\u03b8) s.t. \u03b8i \u2265 0, and (b) denote\nthe problem of minimizing glinear(\u03b8) s.t. \u03b8i \u2265 0 and N (\u03b8) \u2208 S. Then we have: (1) For any local\n(global) optimal solution of (a), denoted by \u03b8a,\ncN (\u03b8a) \u03b8a is also the local (global) optimal solution\nof (b). (2) For any local (global) optimal solution of (b), denoted by \u03b8b, \u03b8b is also the local (global)\noptimal solution of (a).\n\ncN (\u03b8a) \u03b8a) = c \u2208 S,\n\nProof. The complete proof is given in the the supplemental materials. Here we only prove the\nequivalence of global optimal solutions of (a) and (b). On one hand, if \u03b8a is the global optimal\nsolution of (a), then for any \u03b8 that satis\ufb01es \u03b8i \u2265 0 and N (\u03b8) \u2208 S, we have glinear(\ncN (\u03b8) \u03b8a) =\nglinear(\u03b8a) \u2264 g(\u03b8). Due to N (\ncN (\u03b8) \u03b8a also satis\ufb01es the constraint of (b),\ncN (\u03b8) \u03b8a is the global optimal solution of (b). On the other hand, for any \u03b8 (\u03b8i \u2265 0),\nand thus\nglinear(\ncN (\u03b8) \u03b8) = glinear(\u03b8) due to the scaling invariance. If \u03b8b is the global optimal solution of (b),\ncN (\u03b8)\u03b8 satis\ufb01es the constraint of (b), we have glinear(\u03b8b)\u2264 glinear( cN (\u03b8) \u03b8),\nthen for any \u03b8 (\u03b8i \u2265 0), as\ngiving glinear(\u03b8b)\u2264 glinear(\u03b8). Thus \u03b8b is the global optimal solution of (a).\nAs the problems of minimizing glinear(\u03b8) under different types of norm constraints on \u03b8 are all\nequivalent to the same problem without any norm constraint, they are equivalent to each other. Based\non the above proposition, we can also get the another conclusion: in the linear combination case the\nminimization problem (12) is also invariant to the initial scalings of basis kernels (see below).\nProposition 3. Let kj denote basis kernels, and aj >0 be initial scaling coef\ufb01cients of basis kernels.\nGive a norm constraint N (\u03b8)\u2208S, which is by the same de\ufb01nition as in Proposition 2. Let (a) denote\nj \u03b8jkj) w.r.t. \u03b8 s.t. \u03b8i\u22650 and N (\u03b8)\u2208S, and (b) denote the problem\nj \u03b8jajkj) w.r.t. \u03b8 s.t. \u03b8i \u2265 0 and N (\u03b8) \u2208 S. Then:\n(1) Problem (a) and problem (b) have the same local and global optimums. (2) For any local (global)\noptimal solution of (b), denoted by \u03b8b, [\nt ]t)]j is also the local (global) optimal solution of (a).\n\nthe problem of minimizing G(P\nwith different initial scalings: minimizing G(P\n\nmizing G(P\nproblem (c) is equivalent to the problem of minimizing G(P\n\nProof. By proposition 2, problems (b) is equivalent to the one without any norm constraint: mini-\nj \u03b8jajkj) w.r.t. \u03b8 s.t. \u03b8i \u2265 0, which is denoted by problem (c). Let \u02dc\u03b8j = aj\u03b8j, and then\n\u02dc\u03b8jkj) w.r.t. \u02dc\u03b8 s.t. \u02dc\u03b8i \u2265 0, which is\ndenoted by problem (d) (local and global optimal solutions of problems (c) and (d) have one-to-one\ncorrespondences due to the simple transform \u02dc\u03b8j = aj\u03b8j). Again, by Proposition 2, problem (d) is\nequivalent to the one with N (\u03b8)\u2208S, which is indeed problem (a). So we have conclusion (1). By\nproper transformations of optimal solutions of these equivalent problems, we get conclusion (2).\n\ncaj \u03b8b\nj\nN ([at\u03b8b\n\nj\n\nNote that in Proposition 3, optimal solutions of problems (a) and (b), which are with different initial\nscalings of basis kernels, actually result in the same kernel combinations up to the scalings.\nAs shown in the above three propositions, our proposed formulation not only completely addresses\nscaling and initialization problems, but also is not sensitive to the types of norm constraints used.\n\n3.2 Reformulation to a tri-level optimization problem\n\nThe remaining task is to optimize the RKL problem (12). Given a parametric kernel form k(\u03b8), for\nany parameter \u03b8, to obtain the value of the objective function g(\u03b8) = G(k(\u03b8)) in (12), we need to\nsolve the SVM-like problem in (13), which is a convex minimization problem and can be solved by\nits dual problem. Indeed, the whole RKL problem is transformed to a tri-level optimization problem:\n(15)\n,(16)\n\ni,j\u03b1i\u03b1jyiyjKi,j(\u03b8), s.t.P\nP\nP\nP\ni\u03b2iKi,j(\u03b8) \u2212P\ni,j\u03b2iKi,j(\u03b8)\u03b2j, s.t.P\ni\u03b1i \u2212 1\n\no\ni\u03b1iyi = 0, 0 \u2264 \u03b1i \u2264 C\ni\u03b2i = 1, \u03b2i \u2265 0\n\nwhere r2(\u03b8) =\n\nwhere g(\u03b8) =\n\nn\nn\n\nmax\u03b1i\n\nmax\u03b2i\n\n.\n\n(17)\n\nmin\u03b8 g(\u03b8),\n\n2r2(\u03b8)\n\no\n\nNotation K(\u03b8) denotes the kernel matrix [k(\u03b8)(xi, xj)]i,j. The above formulations show that given\nany \u03b8 the calculation of a value of g(\u03b8) requires solving a bi-level optimization problem. First, solve\nthe MEB dual problem (17), and obtain the optimal value r2(\u03b8) and the optimal solution, denoted by\n\n5\n\n\f\u03b2\u2217\ni . Then, take r2(\u03b8) into the objective function of the SVM dual problem (16), solve it, and obtain\nthe value of g(\u03b8), as well as the optimal solution of (16), denoted by \u03b1\u2217\ni . Unlike in other kernel\nlearning approaches, here the optimization of the SVM dual problem relies on another optimal value\nfunction r2(\u03b8), making the RKL problem more challenging.\nIf g(\u03b8), which is the objective function in the top-level optimization, is differentiable and we can get\nits derivatives, then we can use a variety of gradient-based methods to solve the RKL problem. So in\nnext section, we study the differentiability of a general family of multilevel optimization problems.\n\n4 Differentiability of the multilevel optimization problem\nThe Danskin\u2019s theorem [17] states the differentiability of the optimal value of a single-level op-\ntimization problem, and has been applied in many MKL algorithms, e.g., [5, 12]. Unfortunately,\nit is not directly applicable to the optimal value of a multilevel optimization problem. Below we\ngeneralize the Danskin\u2019s theorem and give new results about the multilevel optimization problem.\nLet Y be a metric space, and X, U and Z be normed spaces. Suppose: (1) The function g1(x, u, z),\nis continuous on X \u00d7 U \u00d7 Z. (2) For all x \u2208 X the function g1(x,\u00b7,\u00b7) is continuously differentiable.\n(3) The function g2(y, x, u) (g2 : Y \u00d7 X \u00d7 U \u2192 Z) is continuous on Y \u00d7 X \u00d7 U. (4) For all y \u2208 Y\nthe function g2(y,\u00b7,\u00b7) is continuously differentiable. (5) Sets \u03a6X \u2286 X and \u03a6Y \u2286 Y are compact.\nBy these notations, we propose the following theorem about bi-level optimal value functions.\nTheorem 1. Let us de\ufb01ne a bi-level optimal value function as\n\nv1(u) = inf x\u2208\u03a6X g1(x, u, v2(x, u)),\n\n(18)\n\nwhere v2(x, u) is another optimal value function as\n\nv2(x, u) = inf y\u2208\u03a6Y g2(y, x, u).\n\n(19)\nIf for any x and u, g2(\u00b7, x, u) has a unique minimizer y\u2217(x, u) over \u03a6Y , then y\u2217(x, u) are continuous\non X \u00d7 U, and v1(u) is directionally differentiable. Furthermore, if for any u, the g1(\u00b7, u, v2(\u00b7, u))\nhas also a unique minimizer x\u2217(u) over \u03a6X, then\n1. the minimizer x\u2217(u) are continuous on U,\n2. v1(u) is continuously differentiable, and its derivative is equal to\ndv1(u)\ndu =\n\n(cid:16)\u2202g1(x\u2217,u,v2)\n\n, where \u2202v2(x\u2217,u)\n\n= \u2202g2(y\u2217,x\u2217,u)\n\n(cid:17)(cid:12)(cid:12)(cid:12)v2=v2(x\u2217,u)\n\n+ \u2202v2(x\u2217,u)\n\n\u2202g1(x\u2217,u,v2)\n\n. (20)\n\n\u2202v2\n\n\u2202u\n\n\u2202u\n\n\u2202u\n\n\u2202u\n\nThe proof is given in supplemental materials. To apply Theorem 1 to the objective function g(\u03b8) in\nthe RKL problem (15), we shall make sure the following two conditions are satis\ufb01ed. First, both\nthe MEB dual problem (17) and the SVM dual problem (16) must have unique optimal solutions.\nThis can be guaranteed by that the kernel matrix K(\u03b8) is strictly positive de\ufb01nite. Second, the\nkernel matrix K(\u03b8) shall be continuously differentiable to \u03b8. Both conditions can be met in the\nlinear combination case when each basis kernel matrix is strictly positive de\ufb01nite, and can also be\neasily satis\ufb01ed in nonlinear cases, like in [11, 12]. If these two conditions are met, then g(\u03b8) is\ncontinuously differentiable and\n\ndg(\u03b8)\n\nd\u03b8 = \u2212 1\n\nP\ni \u03b1\u2217\ni,j\u03b1\u2217\ni is the optimal solution of the SVM dual problem (16), and\ndKi,j (\u03b8)\n\nP\ni,j\u03b1\u2217\nd\u03b8 =P\n\nd\u03b8 \u2212P\n\nd\u03b8 + 1\n\ni,j\u03b2\u2217\n\ni \u03b1\u2217\n\ni\u03b2\u2217\n\nj yiyj\n\ndKi,j (\u03b8)\n\ndKi,i(\u03b8)\n\ndr2(\u03b8)\n\n2r4(\u03b8)\n\n2r2(\u03b8)\n\ni\n\n\u03b2\u2217\nj ,\n\nj yiyjKi,j(\u03b8) dr2(\u03b8)\n\nd\u03b8\n\n,\n\n(21)\n\nwhere \u03b1\u2217\n\ni\n\nd\u03b8\n\nd\u03b8\n\ndKi,j (\u03b8)\n\nwhere \u03b2\u2217\n\n(22)\ni is the optimal solution of the MEB dual problem (17). In above equations, the value of\nis needed. It depends on the speci\ufb01c form of the parametric kernels, and the deriving of it\n=\n\nis easy. For example, for the linear combination kernel Ki,j(\u03b8) =P\n\ni,j. For the Gaussian kernel Ki,j(\u03b8) = e\u2212\u03b8kxi\u2212xjk2, we have dKi,j (\u03b8)\nK m\n5 Algorithm\nWith the derivative of g(\u03b8), we use the standard gradient projection approach with the Armijo\nrule [18] for selecting step sizes to address the RKL problem. To compare with the most popu-\nlar kernel learning algorithm, simpleMKL [5], in experiments we employ the linear combination\n\ni,j, we have \u2202Ki,j (\u03b8)\nm \u03b8mK m\nd\u03b8 = \u2212Ki,j(\u03b8)kxi \u2212 xjk2.\n\n\u2202\u03b8m\n\n6\n\n\fj \u03b82\n\nj \u03b8j = 1 andP\n\nno norm constraint. The L1 and L2 norm constraints are asP\n\nkernel form with nonnegative combination coef\ufb01cients, as de\ufb01ned in (2). In addition, we also con-\nsider three types of norm constraints on kernel parameters (combination coef\ufb01cients): L1, L2 and\nj = 1, respectively.\nThe projection for the L1 norm and nonnegative constraints can be ef\ufb01ciently done by the method\nof Duchi et al. [19]. The projection for only nonnegative constraints can be accomplished by set-\nting negative elements to be zero. The projection for the L2 norm and nonnegative constraints need\nanother step after eliminating negative values: normalize \u03b8 by multiplying it with k\u03b8k\u22121\n2 .\nIn our gradient projection algorithm, each calculation of the objective functions g(\u03b8) needs solving\nan MEB problem (17) and an SVM problem (16), whereas the gradient calculation and projec-\ntion steps have ignorable time complexity compared to MEB and SVM solvers. The MEB and\nSVM problems have similar forms of objective functions and constraints, and both of them can be\nef\ufb01ciently solved by SMO algorithms. Moreover, previous solutions \u03b1\u2217\ni can be used as \u201chot-\nstart\u201d to accelerate the solvers. It is because optimal solutions of two problems are continuous to\nkernel parameter \u03b8 according to Theorem 1. Thus when \u03b8 moves a small step, the optimal solu-\ntions also will only change a little. In real experiments our approach usually achieves approximate\nconvergence within one or two dozens of invocations of SVM and MEB solvers (For lack of space,\nexamples of the convergence speed of our algorithm are shown in the supplemental materials).\nIn linear combination cases, the RKL problem, as the radius-based formulation by Chapelle et al. [1],\nis not convex. Gradient-based methods only guarantee local optimums. The following states the\nnontrivial quality of local optimal solutions and their connections to related convex problems.\nProposition 4. In linear combination cases, for any local optimal solution of the RKL problem,\ndenoted by \u03b8\u2217, there exist C1 > 0 and C2 > 0 that \u03b8\u2217 is the global optimal solution of the following\nconvex problem:\n\ni and \u03b2\u2217\n\njhwj, \u03c6(xi; \u03b8jkj)i+b)+\u03bei\u22651, \u03bei\u22650. (23)\n\nP\nj kwjk2 + C1r2(\u03b8) + C2\n\nP\n\ni , s.t. yi(P\n\ni \u03be2\n\nmin\n\n\u03b8,wj ,b,\u03bei\n\n1\n2\n\nThe proof can be found in the supplemental materials. The proposition also gives another possible\nway to address the RKL problem: iteratively solve the convex problem (23) with a search for C1\nand C2. However, it is dif\ufb01cult to \ufb01nd exact values of C1 and C2 by a grid search, and even a rough\nsearch will result in too high computational load. Besides, such method is also lack of extension\nability to nonlinear parametric kernel forms. Then, in the experiments, we demonstrate that the\ngradient-based approach can give satisfactory performances, which are signi\ufb01cantly better than ones\nof SVM with the uniform combination of basis kernels and of other kernel learning approaches.\n\n6 Experiments\nIn this section, we illustrate the performances of our presented RKL approach, in comparison with\nSVM with the uniform combination of basis kernels (Unif), the margin-based MKL method using\nformulation (3) (MKL), and the kernel learning principle by Chapelle et al. [1] using formulation\n(10) (KL-C). The evaluation is made on eleven public available data sets from UCI repository [20]\nand LIBSVM Data [21] (see Table 1). All data sets have been normalized to be zero-means and\nunit-variances on every feature. The used basis kernels are the same as in SimpleMKL [5]: 10\nGaussian kernels with bandwidths \u03b3G \u2208 {0.5, 1, 2, 5, 7, 10, 12, 15, 17, 20} and 10 polynomial ker-\nnels of degree 1 to 10. All kernel matrices have been normalized to unit trace, as in [5, 7]. Note that\nalthough our RKL formulation is theoretically invariant to the initial scalings, the normalization is\nstill applied in RKL to avoid numerical problems caused by large value kernel matrices in SVM and\nMEB solvers. To show impacts of different norm constraints, we use three types of them: L1, L2\nand no norm constraint. With no norm constraint, only RKL can converge, and so only its results are\nreported. The SVM toolbox used is LIBSVM [21]. MKL with the L1 norm constraint is solved by\nthe code from SimpleMKL [5]. Other problems are solved by standard gradient-projection methods,\nwhere the calculation of gradients of the MKL formulation (3) and Chapelle\u2019s formulation (10) is\n20 e, where e is an all-ones vector.\nthe same as in [5] and [1], respectively. The initial \u03b8 is set to be 1\nThe trade-off coef\ufb01cients C in SVM, MKL, KL-C and RKL are automatically determined by\n.=\n3-fold cross-validations on training sets.\n{0.01, 0.1, 1, 10, 100}. For each data set, we split it to \ufb01ve parts, and each time we use four parts as\nthe training set and the remaining one as the test set. The average accuracies with standard deviations\nand average numbers of selected basis kernels are reported in Table 1.\n\nIn all methods, C is selected from the set Scoef\n\n7\n\n\fTable 1: The testing accuracies (Acc.) with standard deviations (in parentheses), and the average\nnumbers of selected basis kernels (Nk). We set the numbers of our method to be bold if our method\noutperforms both Unif and other two kernel learning approaches under the same norm constraint.\n\n2\n\nMKL\nL1\n\n3\n\nKL-C\nL1\n\n4\n\nOurs\nL1\n\n5\n\nMKL\nL2\n\n6\n\nKL-C\nL2\n\n7\n\nOurs\nL2\n\n8\n\nOurs\nNo\n\nIndex\n\n1\n\nUnif\n\nConstraint\nData set\nAcc. Nk Acc. Nk Acc. Nk Acc. Nk Acc. Nk Acc. Nk Acc. Nk Acc. Nk\nIonosphere 94.0(1.4) 20 92.9(1.6) 3.8 86.0(1.9) 4.0 95.7(0.9) 2.8 94.3(1.5) 20 84.4(1.6) 18 95.7(0.9) 3.0 95.7(0.9) 3.0\n51.7(0.1) 20 79.5(1.9) 1.0 80.5(1.9) 2.8 86.5(2.4) 3.2 82.0(2.2) 20 74.0(2.6) 14 86.5(2.4) 2.2 86.3(2.5) 3.2\nSplice\n58.0(0.0) 20 59.1(1.4) 4.2 62.9(3.5) 4.0 64.1(4.2) 3.6 67.0(3.8) 20 64.1(3.9) 11 64.1(4.2) 8.0 64.3(4.3) 6.6\nLiver\n81.2(1.9) 20 97.7(1.2) 7.0 94.0(1.2) 2.0 100 (0.0) 1.0 97.3(1.6) 20 94.0(1.3) 17 100 (0.0) 1.0 100 (0.0) 1.6\nFourclass\n83.7(6.1) 20 84.1(5.7) 7.4 83.3(5.9) 1.8 84.1(5.7) 5.2 83.7(5.8) 20 83.3(5.1) 19 84.4(5.9) 5.4 84.8(5.0) 5.8\nHeart\nGermannum70.0(0.0) 20 70.0(0.0) 7.2 71.9(1.8) 9.8 73.7(1.6) 4.8 71.5(0.8) 20 71.6(2.1) 13 73.9(1.2) 6.0 73.9(1.8) 5.8\n61.4(2.9) 20 85.5(2.9) 1.6 73.9(2.9) 2.0 93.3(2.3) 4.0 87.4(3.0) 20 61.9(3.1) 19 93.5(2.2) 3.8 93.3(2.3) 3.8\nMusk1\n94.4(1.8) 20 97.0(1.8) 1.2 97.4(2.3) 4.6 97.4(1.6) 6.2 96.8(1.6) 20 97.4(2.0) 11 97.6(1.9) 5.8 97.6(1.9) 5.8\nWdbc\n76.5(2.9) 20 76.5(2.9) 7.2 52.2(5.9) 9.6 76.5(2.9) 17 75.9(1.8) 20 51.0(6.6) 17 76.5(2.9) 15 76.5(2.9) 15\nWpbc\n76.5(1.8) 20 82.3(5.6) 2.6 80.8(5.8) 7.4 86.0(2.6) 2.6 85.2(2.9) 20 80.2(5.9) 11 86.0(2.6) 2.6 86.0(3.3) 3.0\nSonar\nColoncancer 67.2(11) 20 82.6(8.5) 13 74.5(4.4) 11 84.2(4.2) 7.2 76.5(9.0) 20 76.0(3.6) 15 84.2(4.2) 5.6 84.2(4.2) 7.6\n\nThe results in Table 1 can be summarized as follows. (a) RKL gives the best results on most sets.\nUnder L1 norm constraints, RKL (Index 4) outperforms all other methods (Index 1, 2, 3) on 8 out\nof 11 sets, and also gives results equal to the best ones of other methods on the remaining 3 sets.\nIn particular, RKL gains 5 or more percents of accuracies on Splice, Liver and Musk1 over MKL,\nand gains more than 9 percents on four sets over KL-C. Under L2 norm constraints, the results are\nsimilar: RKL (Index 7) outperforms other methods (Index 5, 6) on 10 out of 11 sets, with only 1\ninverse result. (b) Both MKL and KL-C are sensitive to the types of norm constraints (Compare\nIndex 2 and 5, as well as 3 and 6). As shown in recent literature [7, 9], for the MKL formulation,\ndifferent types of norm constraints \ufb01t different data sets. However, RKL outperforms MKL (as well\nas KL-C) under both L1 and L2 norm constraints on most sets. (c) RKL is invariant to the types\nof norm constraints. See Index 4, 7 and 8. Most accuracy numbers of them are the same. Several\nexceptions with slight differences are possibly due to precisions of numerical computation. (d) For\nMKL, the L1 norm constraint always results in sparse combinations, whereas the L2 norm constraint\nalways gives non-sparse results (see Index 2 and 5). (e) An interesting thing is that, our presented\nRKL gives sparse solutions on most sets, whatever types of norm constraints are used. As there usu-\nally exist redundancies in the basis kernels, the searching for good kernels and small empirical loss\noften directly leads to sparse solutions. We notice that KL-C under L2 norm constraints also slightly\npromotes sparsity (Index 6). Compared to KL-C under L2 norm constraints, RKL provides not only\nhigher performances but also more sparsity, which bene\ufb01ts both interpretability and computational\nef\ufb01ciency in prediction.\n\n7 Conclusion\n\nIn this paper, we show that the margin term used in previous MKL formulations is not a suitable\nmeasure of the goodness of kernels, resulting in scaling and initialization problems. We propose\na new formulation, called RKL, which uses the ratio between the margin and the radius of MEB\nto learn kernels. We prove that our formulation is invariant to kernel scalings, and also invariant\nto scalings of basis kernels and to the types of norm constraints for the MKL problem. Then,\nby establishing the differentiability of a general family of multilevel optimal value functions, we\npropose a gradient-based algorithm to address the RKL problem. We also provide the property of\nsolutions of our algorithm. The experiments validate that our approach outperforms both SVM with\nthe uniform combination of basis kernels and other state-of-art kernel learning methods.\n\nAcknowledgments\n\nThe work is supported by the National Natural Science Foundation of China (NSFC) (Grant\nNos. 60835002 and 61075004) and the National Basic Research Program (973 Program) (No.\n2009CB320602).\n\n8\n\n\fReferences\n[1] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing multiple parameters for support vector\n\nmachines. Machine Learning, 46(1):131\u2013159, 2002.\n\n[2] G.R.G. Lanckriet, N. Cristianini, P. Bartlett, L.E. Ghaoui, and M.I. Jordan. Learning the kernel matrix\n\nwith semide\ufb01nite programming. The Journal of Machine Learning Research, 5:27\u201372, 2004.\n\n[3] F.R. Bach, G.R.G. Lanckriet, and M.I. Jordan. Multiple kernel learning, conic duality, and the smo\nalgorithm. In Proceedings of the twenty-\ufb01rst international conference on Machine learning (ICML 2004),\n2004.\n\n[4] S. Sonnenburg, G. R\u00a8atsch, and C. Sch\u00a8afer. A general and ef\ufb01cient multiple kernel learning algorithm. In\n\nAdv. Neural. Inform. Process Syst. (NIPS 2005), 2006.\n\n[5] A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grandvalet. SimpleMKL. Journal of Machine Learning\n\nResearch, 9:2491\u20132521, 2008.\n\n[6] O. Chapelle and A. Rakotomamonjy. Second order optimization of kernel parameters. In Proc. of the\n\nNIPS Workshop on Kernel Learning: Automatic Selection of Optimal Kernels, 2008.\n\n[7] M. Kloft, U. Brefeld, S. Sonnenburg, P. Laskov, K. M\u00a8uller, and A. Zien. Ef\ufb01cient and Accurate lp-Norm\n\nMultiple Kernel Learning. In Adv. Neural. Inform. Process Syst. (NIPS 2009), 2009.\n\n[8] C. Cortes, M. Mohri, and A. Rostamizadeh. L2 regularization for learning kernels. In Uncertainty in\n\nArti\ufb01cial Intelligence, 2009.\n\n[9] J. Saketha Nath, G. Dinesh, S. Raman, Chiranjib Bhattacharyya, Aharon Ben-Tal, and K. R. Ramakrish-\nnan. On the algorithmics and applications of a mixed-norm based kernel learning formulation. In Adv.\nNeural. Inform. Process Syst. (NIPS 2009), 2009.\n\n[10] F. Bach. Exploring large feature spaces with hierarchical multiple kernel learning. In Adv. Neural. Inform.\n\nProcess Syst. (NIPS 2008), 2008.\n\n[11] M. G\u00a8onen and E. Alpaydin. Localized multiple kernel learning. In Proceedings of the 25th international\n\nconference on Machine learning (ICML 2008), 2008.\n\n[12] M. Varma and B.R. Babu. More generality in ef\ufb01cient multiple kernel learning. In Proceedings of the\n\n26th International Conference on Machine Learning (ICML 2009), 2009.\n\n[13] C. Cortes, M. Mohri, and A. Rostamizadeh. Learning Non-Linear Combinations of Kernels.\n\nNeural. Inform. Process Syst. (NIPS 2009), 2009.\n\nIn Adv.\n\n[14] N. Srebro and S. Ben-David. Learning bounds for support vector machines with learned kernels.\n\nIn\nProceedings of the International Conference on Learning Theory (COLT 2006), pages 169\u2013183. Springer,\n2006.\n\n[15] Yiming Ying and Colin Campbell. Generalization bounds for learning the kernel. In Proceedings of the\n\nInternational Conference on Learning Theory (COLT 2009), 2009.\n\n[16] H. Do, A. Kalousis, A. Woznica, and M. Hilario. Margin and Radius Based Multiple Kernel Learning. In\n\nProceedings of the European Conference on Machine Learning (ECML 2009), 2009.\n\n[17] J.M. Danskin. The theory of max-min, with applications. SIAM Journal on Applied Mathematics, pages\n\n641\u2013664, 1966.\n\n[18] Dimitri P. Bertsekas. Nonlinear Programming. Athena Scienti\ufb01c, Belmont, MA, September 1999.\n[19] John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Ef\ufb01cient projections onto the l1-\nball for learning in high dimensions. In Proceedings of the 25th international conference on Machine\nlearning (ICML 2008), 2008.\n\n[20] A. Asuncion and D.J. Newman. UCI machine learning repository, 2007.\n\nhttp://www.ics.uci.edu/\u223cmlearn/MLRepository.html.\n\nSoftware available at\n\n[21] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector machines, 2001. Software\n\navailable at http://www.csie.ntu.edu.tw/\u223ccjlin/libsvm.\n\n9\n\n\f", "award": [], "sourceid": 367, "authors": [{"given_name": "Kun", "family_name": "Gai", "institution": null}, {"given_name": "Guangyun", "family_name": "Chen", "institution": null}, {"given_name": "Chang-shui", "family_name": "Zhang", "institution": null}]}