{"title": "Efficient and Accurate Lp-Norm Multiple Kernel Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 997, "page_last": 1005, "abstract": "Learning linear combinations of multiple kernels is an appealing strategy when the right choice of features is unknown. Previous approaches to multiple kernel learning (MKL) promote sparse kernel combinations and hence support interpretability. Unfortunately, L1-norm MKL is hardly observed to outperform trivial baselines in practical applications. To allow for robust kernel mixtures, we generalize MKL to arbitrary Lp-norms. We devise new insights on the connection between several existing MKL formulations and develop two efficient interleaved optimization strategies for arbitrary p>1. Empirically, we demonstrate that the interleaved optimization strategies are much faster compared to the traditionally used wrapper approaches. Finally, we apply Lp-norm MKL to real-world problems from computational biology, showing that non-sparse MKL achieves accuracies that go beyond the state-of-the-art.", "full_text": "Ef\ufb01cient and Accurate (cid:96)p-Norm\n\nMultiple Kernel Learning\n\nMarius Kloft\n\nUniversity of California\n\nBerkeley, USA\n\nUlf Brefeld\n\nYahoo! Research\nBarcelona, Spain\n\nS\u00a8oren Sonnenburg\n\nTechnische Universit\u00a8at Berlin\n\nBerlin, Germany\n\nPavel Laskov\n\nUniversit\u00a8at T\u00a8ubingen\nT\u00a8ubingen, Germany\n\nKlaus-Robert M\u00a8uller\n\nTechnische Universit\u00a8at Berlin\n\nBerlin, Germany\n\nAlexander Zien\n\nLIFE Biosystems GmbH\n\nHeidelberg, Germany\n\nAbstract\n\nLearning linear combinations of multiple kernels is an appealing strategy when\nthe right choice of features is unknown. Previous approaches to multiple kernel\nlearning (MKL) promote sparse kernel combinations to support interpretability.\nUnfortunately, (cid:96)1-norm MKL is hardly observed to outperform trivial baselines in\npractical applications. To allow for robust kernel mixtures, we generalize MKL\nto arbitrary (cid:96)p-norms. We devise new insights on the connection between several\nexisting MKL formulations and develop two ef\ufb01cient interleaved optimization\nstrategies for arbitrary p > 1. Empirically, we demonstrate that the interleaved\noptimization strategies are much faster compared to the traditionally used wrap-\nper approaches. Finally, we apply (cid:96)p-norm MKL to real-world problems from\ncomputational biology, showing that non-sparse MKL achieves accuracies that go\nbeyond the state-of-the-art.\n\n1\n\nIntroduction\n\nkernel matrices K1, . . . , KM , and the task is to learn the optimal mixture K =(cid:80)\n\nSparseness is being regarded as one of the key features in machine learning [15] and biology [16].\nSparse models are appealing since they provide an intuitive interpretation of a task at hand by sin-\ngling out relevant pieces of information. Such automatic complexity reduction facilitates ef\ufb01cient\ntraining algorithms, and the resulting models are distinguished by small capacity. The interpretabil-\nity is one of the main reasons for the popularity of sparse methods in complex domains such as\ncomputational biology, and consequently building sparse models from data has received a signi\ufb01-\ncant amount of recent attention.\nUnfortunately, sparse models do not always perform well in practice [7, 15]. This holds particularly\nfor learning sparse linear combinations of data sources [15], an abstraction of which is known as\nmultiple kernel learning (MKL) [10]. The data sources give rise to a set of (possibly correlated)\nm \u03b8mKm for the\nproblem at hand. Previous MKL research aims at \ufb01nding sparse mixtures to effectively simplify\nthe underlying data representation. For instance, [10] study semi-de\ufb01nite matrices K (cid:23) 0 inducing\nsparseness by bounding the trace tr(K) \u2264 c; unfortunately, the resulting semi-de\ufb01nite optimization\nproblems are computationally too expensive for large-scale deployment.\nRecent approaches to MKL promote sparse solutions either by Tikhonov regularization over the\nmixing coef\ufb01cients [25] or by incorporating an additional constraint (cid:107)\u03b8(cid:107) \u2264 1 [18, 27] requiring\nsolutions on the standard simplex, known as Ivanov regularization. Based on the one or the other,\nef\ufb01cient optimization strategies have been proposed for solving (cid:96)1-norm MKL using semi-in\ufb01nite\nlinear programming [21], second order approaches [6], gradient-based optimization [19], and level-\nset methods [26]. Other variants of (cid:96)1-norm MKL have been proposed in subsequent work address-\ning practical algorithms for multi-class [18, 27] and multi-label [9] problems.\n\n1\n\n\funweighted-sum kernels K =(cid:80)\n\nPrevious approaches to MKL successfully identify sparse kernel mixtures, however, the solutions\nfound, frequently suffer from poor generalization performances. Often, trivial baselines using\nm Km are observed to outperform the sparse mixture [7]. One rea-\nson for the collapse of (cid:96)1-norm MKL is that kernels deployed in real-world tasks are usually highly\nsophisticated and effectively capture relevant aspects of the data. In contrast, sparse approaches to\nMKL rely on the assumption that some kernels are irrelevant for solving the problem. Enforcing\nsparse mixtures in these situations may lead to degenerate models. As a remedy, we propose to\nsacri\ufb01ce sparseness in these situations and deploy non-sparse mixtures instead. After submission of\nthis paper, we learned about a related approach, in which the sum of an (cid:96)1- and an (cid:96)2-regularizer are\nused [12]. Although non-sparse solutions are not as easy to interpret, they account for (even small)\ncontributions of all available kernels to live up to practical applications.\nIn this paper, we \ufb01rst show the equivalence of the most common approaches to (cid:96)1-norm MKL\n[18, 25, 27]. Our theorem allows for a generalized view of recent strands of multiple kernel learn-\ning research. Based on the detached view, we extend the MKL framework to arbitrary (cid:96)p-norm\nMKL with p \u2265 1. Our approach can either be motivated by additionally regularizing over the mix-\ning coef\ufb01cients (cid:107)\u03b8(cid:107)p\np \u2264 1. We propose two\nalternative optimization strategies based on Newton descent and cutting planes, respectively. Em-\npirically, we demonstrate the ef\ufb01ciency and accuracy of none-sparse MKL. Large-scale experiments\non gene start detection show a signi\ufb01cant improvement of predictive accuracy compared to (cid:96)1- and\n(cid:96)\u221e-norm MKL.\nThe rest of the paper is structured as follows. We present our main contributions in Section 2,\nthe theoretical analysis of existing approaches to MKL, our (cid:96)p-norm MKL generalization with two\nhighly ef\ufb01cient optimization strategies, and relations to (cid:96)1-norm MKL. We report on our empirical\nresults in Section 3 and Section 4 concludes.\n\np, or equivalently by incorporating the constraint (cid:107)\u03b8(cid:107)p\n\n2 Generalized Multiple Kernel Learning\n\n2.1 Preliminaries\nIn the standard supervised learning setup, a labeled sample D = {(xi, yi)}i=1...,n is given, where\nthe x lie in some input space X and y \u2208 Y \u2282 R. The goal is to \ufb01nd a hypothesis f \u2208 H,\nthat generalizes well on new and unseen data. Applying regularized risk minimization returns the\nminimizer f\u2217,\n\nf\u2217 = argminf Remp(f) + \u03bb\u2126(f),\n\nwhere Remp(f) = 1\ni=1 V (f(xi), yi) is the empirical risk of hypothesis f w.r.t. to the loss V :\nR \u00d7 Y \u2192 R, regularizer \u2126 : H \u2192 R, and trade-off parameter \u03bb > 0. In this paper, we focus on\nn\n\u2126(f) = 1\n\n2 and on linear models of the form\n\n2(cid:107) \u02dcw(cid:107)2\n\n(1)\ntogether with a (possibly non-linear) mapping \u03c8 : X \u2192 H to a Hilbert space H [20]. We will later\nmake use of kernel functions K(x, x(cid:48)) = (cid:104)\u03c8(x), \u03c8(x(cid:48))(cid:105)H to compute inner products in H.\n\nf \u02dcw,b(x) = \u02dcw(cid:62)\u03c8(x) + b,\n\n(cid:80)n\n\n2.2 Learning with Multiple Kernels\nWhen learning with multiple kernels, we are given M different feature mappings \u03c8m : X \u2192\nHm, m = 1, . . . M, each giving rise to a reproducing kernel Km of Hm. Approaches to multi-\n\nple kernel learning consider linear kernel mixtures K\u03b8 =(cid:80) \u03b8mKm, \u03b8m \u2265 0. Compared to Eq. (1),\n\nthe primal model for learning with multiple kernels is extended to\n\nf \u02dcw,b,\u03b8(x) = \u02dcw(cid:62)\u03c8\u03b8(xi) + b =\n\n\u03b8m \u02dcw(cid:62)\n\nm\u03c8m(x) + b,\n\n(2)\n\n1 , . . . , \u02dcw(cid:62)\n\nM )(cid:62) and \u03c8\u03b8 =\n\nwhere the weight vector \u02dcw and the composite feature map \u03c8\u03b8 have a block structure \u02dcw =\n( \u02dcw(cid:62)\nThe idea in learning with multiple kernels is to minimize the loss on the training data w.r.t. to\n\noptimal kernel mixture(cid:80) \u03b8mKm in addition to regularizing \u03b8 to avoid over\ufb01tting. Hence, in terms\n\n\u03b81\u03c81 \u00d7 . . . \u00d7 \u221a\n\n\u03b8M \u03c8M , respectively.\n\n\u221a\n\nm=1\n\n2\n\nM(cid:88)\n\n(cid:112)\n\n\fof regularized risk minimization, the optimization problem becomes\n\nn(cid:88)\n\ni=1\n\nM(cid:88)\n\nm=1\n\ninf\n\n\u02dcw,b,\u03b8\u22650\n\n1\nn\n\nV (fw,b,\u03b8(xi), yi) + \u03bb\n2\n\n(cid:107) \u02dcwm(cid:107)2\n\n2 + \u02dc\u00b5\u02dc\u2126[\u03b8].\n\n(3)\n\n(cid:32) M(cid:88)\n\nn(cid:88)\n\ni=1\n\n(cid:33)\n\nM(cid:88)\n\n(cid:107)wm(cid:107)2\n\n2\n\n\u03b8m\n\nPrevious approaches to multiple kernel learning employ regularizers of the form \u02dc\u2126(\u03b8) = ||\u03b8||1 to\npromote sparse kernel mixtures. By contrast, we propose to use smooth convex regularizers of the\np, 1 < p < \u221e, allowing for non-sparse solutions. The non-convexity of the\nform \u02dc\u2126(\u03b8) = ||\u03b8||p\nresulting optimization problem is not inherent and can be resolved by substituting wm \u2190 \u221a\n\u03b8m \u02dcwm.\nFurthermore, regularization parameter and sample size can be decoupled by introducing \u02dcC = 1\n(and adjusting \u00b5 \u2190 \u02dc\u00b5\n\u03bb ) which has favorable scaling properties in practice. We obtain the following\nconvex optimization problem [5] that has also been considered by [25] for hinge loss and p = 1,\n\nn\u03bb\n\n+\n\n1\n2\n\nV\n\n\u02dcC\n\ninf\n\nw,b,\u03b8\u22650\n\nm\u03c8m(xi) + b, yi\n\nw(cid:62)\n(4)\nm=1\n0 = 0 if t = 0 and \u221e otherwise. An alternative approach has\nwhere we use the convention that t\nbeen studied by [18, 27] (again using hinge loss and p = 1). They upper bound the value of the\nregularizer (cid:107)\u03b8(cid:107)1 \u2264 1 and incorporate the latter as an additional constraint into the optimization\nproblem. For C > 0, they arrive at\nw(cid:62)\n\n(cid:32) M(cid:88)\n\n+ \u00b5||\u03b8||p\np,\n\n||wm||2\n\nn(cid:88)\n\nM(cid:88)\n\n(cid:33)\n\nm\u03c8m(xi) + b, yi\n\n||\u03b8||p\n\np \u2264 1.\n\ns.t.\n\nC\n\nV\n\n(5)\n\nm=1\n\ninf\n\nw,b,\u03b8\u22650\n\n+\n\n1\n2\n\nm=1\n\n2\n\n\u03b8m\n\ni=1\n\nm=1\n\nOur \ufb01rst contribution shows that both, the Tikhonov regularization in Eq. (4) and the Ivanov regu-\nlarization in Eq. (5), are equivalent.\nTheorem 1 Let be p \u2265 1. For each pair ( \u02dcC, \u00b5) there exists C > 0 such that for each optimal\n\u2217) is also an optimal solution\nsolution (w\u2217, b\u2217, \u03b8\nof Eq. (5) using C, and vice versa, where \u03ba > 0 is some multiplicative constant.\n\n\u2217) of Eq. (4) using ( \u02dcC, \u00b5), we have that (w\u2217, b\u2217, \u03ba \u03b8\n\nProof. The proof is shown in the supplementary material for lack of space. Sketch of the proof:\nWe incorporate the regularizer of (4) into the constraints and show that the resulting upper bound is\n2\ntight. A variable substitution completes the proof.\nZien and Ong [27] showed that the MKL optimization problems by Bach et al. [3], Sonnenburg et\nal. [21], and their own formulation are equivalent. As a main implication of Theorem 1 and by using\nthe result of Zien and Ong it follows that the optimization problem of Varma and Ray [25] and the\nones from [3, 18, 21, 27] all are equivalent.\nIn addition, our result shows the coupling between trade-off parameter C and the regularization pa-\nrameter \u00b5 in Eq. (4): tweaking one also changes the other and vice versa. Moreover, Theorem 1\nimplies that optimizing C in Eq. (5) implicitly searches the regularization path for the parameter \u00b5\nof Eq. (4). In the remainder, we will therefore focus on the formulation in Eq. (5), as a single param-\neter is preferable in terms of model selection. Furthermore, we will focus on binary classi\ufb01cation\nproblems with Y = {\u22121, +1}, equipped with the hinge loss V (f(x), y) = max{0, 1 \u2212 yf(x)}.\nHowever note, that all our results can easily be transferred to regression and multi-class settings\nusing appropriate convex loss functions and joint kernel extensions.\n\n2.3 Non-Sparse Multiple Kernel Learning\n\nWe now extend the existing MKL framework to allow for non-sparse kernel mixtures \u03b8, see also\n[13]. Let us begin with rewriting Eq. (5) by expanding the hinge loss into the slack variables as\nfollows\n\nM(cid:88)\n\nm=1\n\n||wm||2\n\n2\n\n\u03b8m\n\n(cid:32) M(cid:88)\n\nmin\n\u03b8,w,b,\u03be\n\n1\n2\n\n+ C(cid:107)\u03be(cid:107)1\n\ns.t. \u2200i : yi\n\nw(cid:48)\n\nm\u03c8m(xi) + b\n\nm=1\n\n3\n\n(cid:33)\n\n\u2265 1 \u2212 \u03bei ;\n\n\u03be \u2265 0 ;\n\n\u03b8 \u2265 0 ;\n\n(cid:107)\u03b8(cid:107)p\n\np \u2264 1.\n\n(6)\n\n\f(cid:32) M(cid:88)\n\nm=1\n\n(cid:18)1\n2 \u03b1(cid:62)Qm\u03b1\n\n\u03b4 =\n\np\u22121(cid:33) p\u22121\n(cid:19) p\n\np\n\n.\n\n(8)\n\nApplying Lagrange\u2019s theorem incorporates the constraints into the objective by introducing non-\nnegative Lagrangian multipliers \u03b1, \u03b2 \u2208 Rn, \u03b3 \u2208 RM , \u03b4 \u2208 R (including a pre-factor of 1\np for the\n\u03b4-Term). Resubstitution of optimality conditions w.r.t. to w, b, \u03be, and \u03b8 removes the dependency\nof the Lagrangian on the primal variables. After some additional algebra (e.g., the terms associated\nwith \u03b3 cancel), the Lagrangian can be written as\n\n(cid:32) M(cid:88)\n\nm=1\n\n(cid:18)1\n2 \u03b1(cid:62)Qm\u03b1\n\np\u22121(cid:33)\n(cid:19) p\n\n,\n\n(7)\n\nL = 1(cid:62)\u03b1 \u2212 1\np\n\n\u03b4 \u2212 p \u2212 1\n\np\n\n\u2212 1\np\u22121\n\n\u03b4\n\nwhere Qm = diag(y)Kmdiag(y). Eq. (7) now has to be maximized w.r.t. to the dual variables \u03b1, \u03b4,\nsubject to \u03b1(cid:62)y = 0, 0 \u2264 \u03b1i \u2264 C for 1 \u2264 i \u2264 n, and \u03b4 \u2265 0. Let us ignore for a moment the\nnon-negativity \u03b4 \u2265 0 and solve \u2202L/\u2202\u03b4 = 0 for the unbounded \u03b4. Setting the partial derivative to\nzero yields\n\n(cid:32) M(cid:88)\n\n(cid:0)\u03b1(cid:62)Qm\u03b1(cid:1) p\n\n(cid:33) p\u22121\n\np\n\nInterestingly, at optimality, we always have \u03b4 \u2265 0 because the quadratic term in \u03b1 is non-negative.\nPlugging the optimal \u03b4 into Eq. (7), we arrive at the following optimization problem which solely\ndepends on \u03b1.\n\n\u03b1\n\nmax\n\n1(cid:62)\u03b1 \u2212 1\n2\n\nIn the limit p \u2192 \u221e, the above problem reduces to the SVM dual (with Q =(cid:80)\n\nm Qm), while p \u2192 1\ngives rise to a QCQP (cid:96)1-MKL variant. However, optimizing the dual ef\ufb01ciently is dif\ufb01cult and will\ncause numerical problems in the limits p \u2192 1 and p \u2192 \u221e.\n\ns.t. 0 \u2264 \u03b1 \u2264 C1; \u03b1(cid:62)y = 0.\n\n(9)\n\nm=1\n\np\u22121\n\n2.4 Two Ef\ufb01cient Second-Order Optimization Strategies\n\nMany recent MKL solvers (e.g., [19, 24, 26]) are based on wrapping linear programs around SVMs.\nFrom an optimization standpoint, our work is most closely related to the SILP approach [21] and\nthe simpleMKL method [19, 24]. Both of these methods also aim at ef\ufb01cient large-scale MKL\nalgorithms. The two alternative approaches proposed for (cid:96)p-norm MKL proposed in this paper are\nlargely inspired by these methods and extend them in two aspects: customization to arbitrary norms\nand a tight coupling with minor iterations of an SVM solver, respectively.\nOur \ufb01rst strategy interleaves maximizing the Lagrangian of (6) w.r.t. \u03b1 with minor precision and\nNewton descent on \u03b8. For the second strategy, we devise a semi-in\ufb01nite convex program, which we\nsolve by column generation with nested sequential quadratically constrained linear programming\n(SQCLP). In both cases, the maximization step w.r.t. \u03b1 is performed by chunking optimization with\nminor iterations. The Newton approach can be applied without a common purpose QCQP solver,\nhowever, convergence can only be guaranteed for the SQCLP [8].\n\n2.4.1 Newton Descent\n\nFor a Newton descent on the mixing coef\ufb01cients, we \ufb01rst compute the partial derivatives\n+ (p \u2212 1)\u03b4\u03b8p\u22122\n\n+ \u03b4\u03b8p\u22121\n\nand\n\n\u2202L\n\u2202\u03b8m\n\n(cid:124)\n= \u22121\n2\n\nw(cid:62)\nmwm\n\u03b82\nm\n=:\u2207\u03b8m\n\n(cid:123)(cid:122)\n\nm\n\n(cid:125)\n\n\u22022L\n\u22022\u03b8m\n\n= w(cid:62)\n(cid:124)\n\nmwm\n\u03b83\nm\n\n(cid:123)(cid:122)\n\n=:hm\n\nm\n\n(cid:125)\n\nof the original Lagrangian. Fortunately, the Hessian H is diagonal, i.e. given by H = diag(h). The\nm-th element sm of the corresponding Netwon step, de\ufb01ned as s := \u2212H\u22121\u2207\u03b8, is thus computed\nby\n\nsm =\n\n1\n\n2 \u03b8m||wm||2 \u2212 \u03b4\u03b8p+2\n||wm||2 + (p \u2212 1)\u03b4\u03b8p+1\n\nm\n\nm\n\n,\n\n4\n\n\fwhere \u03b4 is de\ufb01ned in Eq. (8). However, a Newton step \u03b8t+1 = \u03b8t + s might lead to non-positive \u03b8.\nTo avoid this awkward situation, we take the Newton steps in the space of log(\u03b8) by adjusting the\nderivatives according to the chain rule. We obtain\nm) \u2212\n\nm ) = log(\u03b8t\n\nlog(\u03b8t+1\n\n(10)\n\n/(\u03b8t\n\nm)2 ,\n\n\u03b8m\n\n\u2207t\n/\u03b8t\nm\nm)2 \u2212 \u2207t\n(cid:33)\n\n\u03b8m\n\nm/(\u03b8t\nht\n\n(cid:32) \u2207t\n\n\u03b8m\n\n\u03b8t\nm\n\u2212 ht\n\nm\n\n\u2207t\n\n\u03b8m\n\nwhich corresponds to multiplicative update of \u03b8:\n\n\u03b8t+1\nm\n\n= \u03b8t\n\nm \u00b7 exp\n\n.\n\n(11)\n\nFurthermore we additionally enhance the Newton step by a line search.\n\n2.4.2 Cutting Planes\n\nIn order to obtain an alternative optimization strategy, we \ufb01x \u03b8 and build the partial Lagrangian\nw.r.t. all other primal variables w, b, \u03be. The derivation is analogous to [18, 27] and we omit details\nfor lack of space. The resulting dual problem is a min-max problem of the form\n\n2 \u03b1(cid:62) M(cid:88)\n\nmin\n\u03b8\n\n1(cid:62)\u03b1 \u2212 1\n\n\u03b1\n\nmax\n\u03b8mQm\u03b1\ns.t. 0 \u2264 \u03b1 \u2264 C1; y(cid:62)\u03b1 = 0;\n\nm=1\n\n\u03b8 \u2265 0;\n\n(cid:107)\u03b8(cid:107)p\n\np \u2264 1.\n\nThe above optimization problem is a saddle point problem and can be solved by alternating \u03b1 and\n\u03b8 optimization step. While the former can simply be carried out by a support vector machine for a\n\ufb01xed mixture \u03b8, the latter has been optimized for p = 1 by reduced gradients [18].\nWe take a different approach and translate the min-max problem into an equivalent semi-in\ufb01nite\nprogram (SIP) as follows. Denote the value of the target function by t(\u03b1, \u03b8) and suppose \u03b1\u2217 is\noptimal. Then, according to the max-min inequality [5], we have t(\u03b1\u2217, \u03b8) \u2265 t(\u03b1, \u03b8) for all \u03b1 and\n\u03b8. Hence, we can equivalently minimize an upper bound \u03b7 on the optimal value and arrive at\n\nmin\n\u03b7,\u03b8\n\n\u03b7\n\ns.t.\n\n\u03b7 \u2265 1(cid:62)\u03b1 \u2212 1\n\n\u03b8mQm\u03b1\n\n(12)\n\np \u2264 1 and \u03b8 \u2265 0.\n\nfor all \u03b1 \u2208 Rn with 0 \u2264 \u03b1 \u2264 C1, and y(cid:62)\u03b1 = 0 as well as (cid:107)\u03b8(cid:107)p\n[21] optimize the above SIP for p \u2265 1 with interleaving cutting plane algorithms. The solution of\na quadratic program (here the regular SVM) generates the most strongly violated constraint for the\nactual mixture \u03b8. The optimal (\u03b8\n, \u03b7) is then identi\ufb01ed by solving a linear program with respect to\nthe set of active constraints. The optimal mixture is then used for computing a new constraint and\nso on.\nUnfortunately, for p > 1, a non-linearity is introduced by requiring (cid:107)\u03b8(cid:107)p\np \u2264 1 and such constraint is\nunlikely to be found in standard optimization toolboxes that often handle only linear and quadratic\nconstraints. As a remedy, we propose to approximate (cid:107)\u03b8(cid:107)p\np \u2264 1 by sequential second-order Taylor\nexpansion of the form\n\n\u2217\n\n2 \u03b1(cid:62) M(cid:88)\n\nm=1\n\n||\u03b8||p\n\np \u2248 1 + p(p \u2212 3)\n\n2\n\np(p \u2212 2)(\u02dc\u03b8m)p\u22121 \u03b8m + p(p \u2212 1)\n\n2\n\n\u2212 M(cid:88)\n\nm=1\n\nM(cid:88)\n\nm=1\n\n\u02dc\u03b8p\u22122\nm \u03b82\nm,\n\nM ). The sequence (\u03b80, \u03b81,\u00b7\u00b7\u00b7 ) is initial-\nwhere \u03b8p is de\ufb01ned element-wise, that is \u03b8p := (\u03b8p\nized with a uniform mixture satisfying (cid:107)\u03b80(cid:107)p\np = 1 as a starting point. Successively \u03b8t+1 is computed\nusing \u02dc\u03b8 = \u03b8t. Note that the quadratic term in the approximation is diagonal wherefore the subse-\nquent quadratically constrained problem can be solved ef\ufb01ciently. Finally note, that this approach\ncan be further sped-up by an additional projection onto the level-sets in the \u03b8-optimization phase\nsimilar to [26]. In our case, the level-set projection is a convex quadratic problem with (cid:96)p-norm\nconstraints and can again be approximated by successive second-order Taylor expansions.\n\n1, ..., \u03b8p\n\n5\n\n\fFigure 1: Execution times of SVM Training, (cid:96)p-norm MKL based on interleaved optimization via the Newton,\nthe cutting plane algorithm (CPA), and the SimpleMKL wrapper. (left) Training using \ufb01xed number of 50\nkernels varying training set size. (right) For 500 examples and varying numbers of kernels. Our proposed\nNewton and CPA obtain speedups of over an order of magnitude. Notice the tiny error bars.\n\n3 Computational Experiments\n\nIn this section we study non-sparse MKL in terms of ef\ufb01ciency and accuracy.1 We apply the method\nof [21] for (cid:96)1-norm results as it is contained as a special case of our cutting plane strategy. We write\n\n(cid:96)\u221e-norm MKL for a regular SVM with the unweighted-sum kernel K =(cid:80)\n\nm Km.\n\n3.1 Execution Time\n\nWe demonstrate the ef\ufb01ciency of our implementations of non-sparse MKL. We experiment on the\nMNIST data set where the task is to separate odd vs. even digits. We compare our (cid:96)p-norm MKL\nwith two methods for (cid:96)1-norm MKL, simpleMKL [19] and SILP-based chunking [21], and to SVMs\nusing the unweighted-sum kernel ((cid:96)\u221e-norm MKL) as additional baseline. We optimize all methods\nup to a precision of 10\u22123 for the outer SVM-\u03b5 and 10\u22125 for the \u201cinner\u201d SIP precision and computed\nrelative duality gaps. To provide a fair stopping criterion to simpleMKL, we set the stopping criterion\nof simpleMKL to the relative duality gap of its (cid:96)1-norm counterpart. This way, the deviations of\nrelative objective values of (cid:96)1-norm MKL variants are guaranteed to be smaller than 10\u22124. SVM\ntrade-off parameters are set to C = 1 for all methods.\nFigure 1 (left) displays the results for varying sample sizes and 50 precomputed Gaussian kernels\nwith different bandwidths. Error bars indicate standard error over 5 repetitions. Unsurprisingly,\nthe SVM with the unweighted-sum kernel is the fastest method. Non-sparse MKL scales similarly\nas (cid:96)1-norm chunking; the Newton strategy (Section 2.4.1) is slightly faster than the cutting plane\nvariant (Section 2.4.2) that needs additional Taylor expansions within each \u03b8-step. SimpleMKL\nsuffers from training an SVM to full precision for each gradient evaluation and performs worst.2\nFigure 1 (right) shows the results for varying the number of precomputed RBF kernels for a \ufb01xed\nsample size of 500. The SVM with the unweighted-sum kernel is hardly affected by this setup and\nperforms constantly. The (cid:96)1-norm MKL by [21] handles the increasing number of kernels best and is\nthe fastest MKL method. Non-sparse approaches to MKL show reasonable run-times, the Newton-\nbased (cid:96)p-norm MKL being again slightly faster than its peer. Simple MKL performs again worst.\nOverall, our proposed Newton and cutting plane based optimization strategies achieve a speedup of\noften more than one order of magnitude.\n\n3.2 Protein Subcellular Localization\n\nThe prediction of the subcellular localization of proteins is one of the rare empirical success stories\nof (cid:96)1-norm-regularized MKL [17, 27]: after de\ufb01ning 69 kernels that capture diverse aspects of\n\n1Available at http://www.shogun-toolbox.org/\n2SimpleMKL could not be evaluated for 2000 instances (ran out of memory on a 4GB machine).\n\n6\n\n10210310\u2212210\u22121100101102sample sizetime in seconds10110210310\u22121100101102number of kernelstime in seconds \fTable 1: Results for Protein Subcellular Localization\n\n(cid:96)p-norm\n1 - MCC [%]\n\n1\n\n9.13\n\n32/31\n9.12\n\n16/15\n9.64\n\n8/7\n9.84\n\n4/3\n9.56\n\n2\n\n10.18\n\n4\n\n10.08\n\n\u221e\n10.41\n\nprotein sequences, (cid:96)1-norm-MKL could raise the predictive accuracy signi\ufb01cantly above that of\nthe unweighted sum of kernels (thereby also improving on established prediction systems for this\nproblem). Here we investigate the performance of non-sparse MKL.\nWe download the kernel matrices of the dataset plant3 and follow the experimental setup of [17]\nwith the following changes:\ninstead of a genuine multiclass SVM, we use the 1-vs-rest decom-\nposition; instead of performing cross-validation for model selection, we report results for the best\nmodels, as we are only interested in the relative performance of the MKL regularizers. Speci\ufb01cally,\nfor each C \u2208 {1/32, 1/8, 1/2, 1, 2, 4, 8, 32, 128}, we compute the average Mathews correlation co-\nef\ufb01cient (MCC) on the test data. For each norm, the best average MCC is recorded. Table 1 shows\nthe averages over several splits of the data.\nThe results indicate that, indeed, with proper choice of a non-sparse regularizer, the accuracy of\n(cid:96)1-norm can be recovered. This is remarkable, as this dataset is particular in that it full\ufb01lls the rare\ncondition that (cid:96)1-norm MKL performs better than (cid:96)\u221e-norm MKL. In other words, selecting these\ndata may imply a bias towards (cid:96)1-norm. Nevertheless our novel non-sparse MKL can keep up with\nthis, essentially by approximating (cid:96)1-norm.\n\n3.3 Gene Start Recognition\n\nThis experiment aims at detecting transcription start sites (TSS) of RNA Polymerase II binding genes\nin genomic DNA sequences. Accurate detection of the transcription start site is crucial to identify\ngenes and their promoter regions and can be regarded as a \ufb01rst step in deciphering the key regulatory\nelements in the promoter region that determine transcription. For our experiments we use the dataset\nfrom [22] which contains a curated set of 8,508 TSS annotated genes built from dbTSS version 4\n[23] and refseq genes. These are translated into positive training instances by extracting windows of\nsize [\u22121000, +1000] around the TSS. Similar to [4], 85,042 negative instances are generated from\nthe interior of the gene using the same window size.\nFollowing [22], we employ \ufb01ve different kernels representing the TSS signal (weighted degree with\nshift), the promoter (spectrum), the 1st exon (spectrum), angles (linear), and energies (linear). Opti-\nmal kernel parameters are determined by model selection in [22]. Every kernel is normalized such\nthat all points have unit length in feature space. We reserve 13,000 and 20,000 randomly drawn\ninstances for holdout and test sets, respectively, and use the remaining 60,000 as the training pool.\nFigure 2 shows test errors for varying training set sizes drawn from the pool; training sets of the\nsame size are disjoint. Error bars indicate standard errors of repetitions for small training set sizes.\nRegardless of the sample size, (cid:96)1-MKL is signi\ufb01cantly outperformed by the sum-kernel. On the\ncontrary, non-sparse MKL signi\ufb01cantly achieves higher AUC values than the (cid:96)\u221e-MKL for sample\nsizes up to 20k. The scenario is well suited for (cid:96)2-norm MKL which performs best. Finally, for 60k\ntraining instances, all methods but (cid:96)1-norm MKL yield the same performance. Again, the superior\nperformance of non-sparse MKL is remarkable, and of signi\ufb01cance for the application domain: the\nmethod using the unweighted sum of kernels [22] has recently been con\ufb01rmed to be the leading in\na comparison of 19 state-of-the-art promoter prediction programs [1], and our experiments suggest\nthat its accuracy can be further elevated by non-sparse MKL.\n\n4 Conclusion and Discussion\n\nWe presented an ef\ufb01cient and accurate approach to non-sparse multiple kernel learning and showed\nthat our (cid:96)p-norm MKL can be motivated as Tikhonov and Ivanov regularization of the mixing coef-\n\ufb01cients, respectively. Applied to previous MKL research, our result allows for a uni\ufb01ed view as so\nfar seemingly different approaches turned out to be equivalent. Furthermore, we devised two ef\ufb01-\ncient approaches to non-sparse multiple kernel learning for arbitrary (cid:96)p-norms, p > 1. The resulting\n\n3from http://www.fml.tuebingen.mpg.de/raetsch/suppl/protsubloc/\n\n7\n\n\fFigure 2: Left: Area under ROC curve (AUC) on test data for TSS recognition as a function of the training\nset size. Notice the tiny bars indicating standard errors w.r.t. repetitions on disjoint training sets. Right: Corre-\nsponding kernel mixtures. For p = 1 consistent sparse solutios are obtained while the optimal p = 2 distributes\nwheights on the weighted degree and the 2 spectrum kernels in good agreement to [22].\n\n\u221a\n\noptimization strategies are based on semi-in\ufb01nite programming and Newton descent, both inter-\nleaved with chunking-based SVM training. Execution times moreover revealed that our interleaved\noptimization vastly outperforms commonly used wrapper approaches.\nWe would like to note that there is a certain preference/obsession for sparse models in the scienti\ufb01c\ncommunity due to various reasons. The present paper, however, shows clearly that sparsity by itself\nis not the ultimate virtue to be strived for. Rather on the contrary: non-sparse model may improve\nquite impressively over sparse ones. The reason for this is less obvious and its theoretical explo-\nration goes well beyond the scope of its submissions. We remark nevertheless that some interesting\nasymptotic results exist that show model selection consistency of sparse MKL (or the closely related\ngroup lasso) [2, 14], in other words in the limit n \u2192 \u221e MKL is guaranteed to \ufb01nd the correct subset\nof kernels. However, also the rate of convergence to the true estimator needs to be considered, thus\nwe conjecture that the rate slower than\nn which is common to sparse estimators [11] may be one\nof the reasons for \ufb01nding excellent (nonasymptotic) results in non-sparse MKL. In addition to the\nconvergence rate the variance properties of MKL estimators may play an important role to elucidate\nthe performance seen in our various simulation experiments.\nIntuitively speaking, we observe clearly that in some cases all features even though they may contain\nredundant information are to be kept, since putting their contributions to zero does not improve\nprediction. I.e. all of them are informative to our MKL models. Note however that this result is\nalso class speci\ufb01c, i.e. for some classes we may sparsify. Cross-validation based model building that\nincludes the choice of p will however inevitably tell us which classes should be treated sparse and\nwhich non-sparse.\nLarge-scale experiments on TSS recognition even raised the bar for (cid:96)1-norm MKL: non-sparse MKL\nproved consistently better than its sparse counterparts which were outperformed by an unweighted-\nsum kernel. This exempli\ufb01es how the unprecedented combination of accuracy and scalability of our\nMKL approach and methods paves the way for progress in other real world applications of machine\nlearning.\n\nAuthors\u2019 Contributions\n\nThe authors contributed in the following way: MK and UB had the initial idea. MK, UB, SS, and AZ each\ncontributed substantially to both mathematical modelling, design and implementation of algorithms, conception\nand execution of experiments, and writing of the manuscript. PL had some shares in the initial phase and KRM\ncontributed to the text. Most of the work was done at previous af\ufb01liations of several authors: Fraunhofer\nInstitute FIRST (Berlin), Technical University Berlin, and the Friedrich Miescher Laboratory (T\u00a8ubingen).\n\nAcknowledgments\n\nThis work was supported in part by the German BMBF grant REMIND (FKZ 01-IS07007A) and by the Euro-\npean Community under the PASCAL2 Network of Excellence (ICT-216886).\n\n8\n\n010K20K30K40K50K60K0.880.890.90.910.920.93sample sizeAUC 1\u2212norm MKL4/3\u2212norm MKL2\u2212norm MKL4\u2212norm MKLSVM1\u2212normn=5k4/3\u2212norm2\u2212norm4\u2212normunw.\u2212sumn=20kn=60k\fReferences\n[1] T. Abeel, Y. V. de Peer, and Y. Saeys. Towards a gold standard for promoter prediction evaluation.\n\nBioinformatics, 2009.\n\n[2] F. R. Bach. Consistency of the group lasso and multiple kernel learning. J. Mach. Learn. Res., 9:1179\u2013\n\n1225, 2008.\n\n[3] F. R. Bach, G. R. G. Lanckriet, and M. I. Jordan. Multiple kernel learning, conic duality, and the smo\n\nalgorithm. In Proc. 21st ICML. ACM, 2004.\n\n[4] V. B. Bajic, S. L. Tan, Y. Suzuki, and S. Sugano. Promoter prediction analysis on the whole human\n\ngenome. Nature Biotechnology, 22(11):1467\u20131473, 2004.\n\n[5] S. Boyd and L. Vandenberghe. Convex Optimization. Cambrigde University Press, Cambridge, UK, 2004.\n[6] O. Chapelle and A. Rakotomamonjy. Second order optimization of kernel parameters. In Proc. of the\n\nNIPS Workshop on Kernel Learning: Automatic Selection of Optimal Kernels, 2008.\n\n[7] C. Cortes, A. Gretton, G. Lanckriet, M. Mohri, and A. Rostamizadeh. Proceedings of the NIPS Workshop\n\non Kernel Learning: Automatic Selection of Optimal Kernels, 2008.\n\n[8] R. Hettich and K. O. Kortanek. Semi-in\ufb01nite programming: theory, methods, and applications. SIAM\n\nRev., 35(3):380\u2013429, 1993.\n\n[9] S. Ji, L. Sun, R. Jin, and J. Ye. Multi-label multiple kernel learning. In Advances in Neural Information\n\nProcessing Systems, 2009.\n\n[10] G. Lanckriet, N. Cristianini, L. E. Ghaoui, P. Bartlett, and M. I. Jordan. Learning the kernel matrix with\n\nsemi-de\ufb01nite programming. JMLR, 5:27\u201372, 2004.\n\n[11] H. Leeb and B. M. P\u00a8otscher. Sparse estimators and the oracle property, or the return of hodges\u2019 estimator.\n\nJournal of Econometrics, 142:201\u2013211, 2008.\n\n[12] C. Longworth and M. J. F. Gales. Combining derivative and parametric kernels for speaker veri\ufb01cation.\n\nIEEE Transactions in Audio, Speech and Language Processing, 17(4):748\u2013757, 2009.\n\n[13] C. A. Micchelli and M. Pontil. Learning the kernel function via regularization. Journal of Machine\n\nLearning Research, 6:1099\u20131125, 2005.\n\n[14] Y. Nardi and A. Rinaldo. On the asymptotic properties of the group lasso estimator for linear models.\n\nElectron. J. Statist., 2:605\u2013633, 2008.\n\n[15] S. Olhede, M. Pontil, and J. Shawe-Taylor. Proceedings of the PASCAL2 Workshop on Sparsity in\n\nMachine Learning and Statistics, 2009.\n\n[16] B. A. Olshausen and D. J. Field. Emergence of simple-cell receptive \ufb01eld properties by learning a sparse\n\ncode for natural images. Nature, 381:607\u2013609, 1996.\n\n[17] C. S. Ong and A. Zien. An Automated Combination of Kernels for Predicting Protein Subcellular Local-\n\nization. In Proc. of the 8th Workshop on Algorithms in Bioinformatics, 2008.\n\n[18] A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grandvalet. More ef\ufb01ciency in multiple kernel learning. In\n\nICML, pages 775\u2013782, 2007.\n\n[19] A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grandvalet. SimpleMKL. Journal of Machine Learning\n\nResearch, 9:2491\u20132521, 2008.\n\n[20] B. Sch\u00a8olkopf and A. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.\n[21] S. Sonnenburg, G. R\u00a8atsch, C. Sch\u00a8afer, and B. Sch\u00a8olkopf. Large Scale Multiple Kernel Learning. Journal\n\nof Machine Learning Research, 7:1531\u20131565, July 2006.\n\n[22] S. Sonnenburg, A. Zien, and G. R\u00a8atsch. ARTS: Accurate Recognition of Transcription Starts in Human.\n\nBioinformatics, 22(14):e472\u2013e480, 2006.\n\n[23] Y. Suzuki, R. Yamashita, K. Nakai, and S. Sugano. dbTSS: Database of human transcriptional start sites\n\nand full-length cDNAs. Nucleic Acids Research, 30(1):328\u2013331, 2002.\n\n[24] M. Szafranski, Y. Grandvalet, and A. Rakotomamonjy. Composite kernel learning. In Proceedings of the\n\nInternational Conference on Machine Learning, 2008.\n\n[25] M. Varma and D. Ray. Learning the discriminative power-invariance trade-off. In IEEE 11th International\n\nConference on Computer Vision (ICCV), pages 1\u20138, 2007.\n\n[26] Z. Xu, R. Jin, I. King, and M. Lyu. An extended level method for ef\ufb01cient multiple kernel learning. In\nD. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing\nSystems 21, pages 1825\u20131832. 2009.\n\n[27] A. Zien and C. S. Ong. Multiclass multiple kernel learning. In Proceedings of the 24th international\n\nconference on Machine learning (ICML), pages 1191\u20131198. ACM, 2007.\n\n9\n\n\f", "award": [], "sourceid": 879, "authors": [{"given_name": "Marius", "family_name": "Kloft", "institution": null}, {"given_name": "Ulf", "family_name": "Brefeld", "institution": null}, {"given_name": "Pavel", "family_name": "Laskov", "institution": null}, {"given_name": "Klaus-Robert", "family_name": "M\u00fcller", "institution": null}, {"given_name": "Alexander", "family_name": "Zien", "institution": null}, {"given_name": "S\u00f6ren", "family_name": "Sonnenburg", "institution": null}]}