{"title": "Q-MKL: Matrix-induced Regularization in Multi-Kernel Learning with Applications to Neuroimaging", "book": "Advances in Neural Information Processing Systems", "page_first": 1421, "page_last": 1429, "abstract": "Multiple Kernel Learning (MKL) generalizes SVMs to the setting where one simultaneously trains a linear classifier and chooses an optimal combination of given base kernels. Model complexity is typically controlled using various norm regularizations on the vector of base kernel mixing coefficients. Existing methods, however, neither regularize nor exploit potentially useful information pertaining to how kernels in the input set 'interact'; that is, higher order kernel-pair relationships that can be easily obtained via unsupervised (similarity, geodesics), supervised (correlation in errors), or domain knowledge driven mechanisms (which features were used to construct the kernel?). We show that by substituting the norm penalty with an arbitrary quadratic function Q \\succeq 0, one can impose a desired covariance structure on mixing coefficient selection, and use this as an inductive bias when learning the concept. This formulation significantly generalizes the widely used 1- and 2-norm MKL objectives. We explore the model\u2019s utility via experiments on a challenging Neuroimaging problem, where the goal is to predict a subject\u2019s conversion to Alzheimer\u2019s Disease (AD) by exploiting aggregate information from several distinct imaging modalities. Here, our new model outperforms the state of the art (p-values << 10\u22123 ). We briefly discuss ramifications in terms of learning bounds (Rademacher complexity).", "full_text": "Q-MKL: Matrix-induced Regularization in\nMulti-Kernel Learning with Applications to\n\nNeuroimaging\u2217\n\nChris Hinrichs\u2020\u2021 Vikas Singh\u2020\u2021\n\nJiming Peng\u00a7\n\nSterling C. Johnson\u2020\u2021\n\n\u2020University of Wisconsin\n\nMadison, WI\n\n\u00a7University of Illinois\nUrbana-Champaign, IL\n\n\u2021Geriatric Research Education & Clinical Center\n\nWm. S. Middleton Memorial VA Hospital, Madison, WI\n\n{ hinrichs@cs, vsingh@biostat, scj@medicine }.wisc.edu\n\npengj@illinois.edu\n\nAbstract\n\nMultiple Kernel Learning (MKL) generalizes SVMs to the setting where one\nsimultaneously trains a linear classi\ufb01er and chooses an optimal combination of\ngiven base kernels. Model complexity is typically controlled using various norm\nregularizations on the base kernel mixing coef\ufb01cients. Existing methods neither\nregularize nor exploit potentially useful information pertaining to how kernels in\nthe input set \u2018interact\u2019; that is, higher order kernel-pair relationships that can be\neasily obtained via unsupervised (similarity, geodesics), supervised (correlation\nin errors), or domain knowledge driven mechanisms (which features were used\nto construct the kernel?). We show that by substituting the norm penalty with an\narbitrary quadratic function Q (cid:23) 0, one can impose a desired covariance struc-\nture on mixing weights, and use this as an inductive bias when learning the con-\ncept. This formulation signi\ufb01cantly generalizes the widely used 1- and 2-norm\nMKL objectives. We explore the model\u2019s utility via experiments on a challeng-\ning Neuroimaging problem, where the goal is to predict a subject\u2019s conversion to\nAlzheimer\u2019s Disease (AD) by exploiting aggregate information from many dis-\ntinct imaging modalities. Here, our new model outperforms the state of the art\n(p-values (cid:28) 10\u22123). We brie\ufb02y discuss rami\ufb01cations in terms of learning bounds\n(Rademacher complexity).\n\nIntroduction\n\n1\nKernel learning methods (such as Support Vector Machines) are conceptually simple, strongly rooted\nin statistical learning theory, and can often be formulated as a convex optimization problem. As a\nresult, SVMs have come to dominate the landscape of supervised learning applications in bioinfor-\nmatics, computer vision, neuroimaging, and many other domains. A standard SVM-based \u2018learning\nsystem\u2019 may be conveniently thought of as a composition of two modules [1, 2, 3, 4]: (1) Feature\npre-processing, and (2) a core learning algorithm. The design of a kernel (feature pre-processing)\nmay involve using different sets of extracted features, dimensionality reductions, or parameteriza-\ntions of the kernel functions. Each of these alternatives produces a distinct kernel matrix. While\nmuch research has focused on ef\ufb01cient methods for the latter (i.e., support vector learning) step,\nspeci\ufb01c choices of feature pre-processing are frequently a dominant factor in the system\u2019s overall\nperformance as well, and may involve signi\ufb01cant user effort. Multi-kernel learning [5, 6, 7] trans-\nfers a part of this burden from the user to the algorithm. Rather than selecting a single kernel, MKL\noffers the \ufb02exibility of specifying a large set of kernels corresponding to the many options (i.e., ker-\nnels) available, and additively combining them to construct an optimized, data-driven Reproducing\n\u2217Supported by NIH (R01AG040396), (R01AG021155); NSF (RI 1116584), (DMS 09-15240 ARRA),\nand (CMMI-1131690); Wisconsin Partnership Proposal; UW ADRC; UW ICTR (1UL1RR025011); AFOSR\n(FA9550-09-1-0098); and NLM (5T15LM007359). The authors would like to thank Maxwell Collins and\nSangkyun Lee for many helpful discussions.\n\n1\n\n\fKernel Hilbert Space (RKHS) \u2013 while simultaneously \ufb01nding a max-margin classi\ufb01er. MKL has\nturned out to be very successful in many applications: on several important Vision problems (such\nas image categorization), some of the best known results on community benchmarks come from\nMKL-type methods [8, 9]. In the context of our primary motivating application, the current state of\nthe art in multi-modality neuroimaging-based Alzheimer\u2019s Disease (AD) prediction [10] is achieved\nby multi-kernel methods [3, 4], where each imaging modality spawns a kernel, or set of kernels.\nIn allowing the user to specify an arbitrary number of base kernels for combination MKL provides\nmore expressive power, but this comes with the responsibility to regularize the kernel mixing coef-\n\ufb01cients so that the classi\ufb01er generalizes well. While the importance of this regularization cannot be\noverstated, it is also a fact that commonly used (cid:96)p norm regularizers operate on kernels separately,\nwithout explicitly acknowledging dependencies and interactions among them. To see how such de-\npendencies can arise in practice, consider our neuroimaging learning problem of interest: the task\nof learning to predict the onset of AD. A set of base kernels K1, . . . , KM are derived from sev-\neral different medical imaging modalities (MRI; PET), image processing methods (morphometric;\nanatomical modelling), and kernel functions (linear; RBF). Some features may be shared between\nkernels, or kernel functions may use similar parameters. As a result we expect the kernels\u2019 behaviors\nto exhibit some correlational, or other cluster structure according to how they were constructed. (See\nFig. 2 (a) and related text, for a concrete discussion of these behaviors in our problem of interest.)\nWe will denote this relationship as Q \u2208 RM\u00d7M .\nIdeally, the regularizer should re\ufb02ect these dependencies encoded by Q, as they can signi\ufb01cantly\nimpact the learning characteristics of a linearly combined kernel. Some extensions work at the level\nof group membership (e.g., [11]), but do not explicitly quantify these interactions. Instead, rather\nthan penalizing covariances or inducing sparsity among groups of kernels, it may be bene\ufb01cial to\nreward such covariances, so as to better re\ufb02ect a latent cluster structure between kernels. In this\npaper, we show that a rich class of regularization schemes are possible under a new MKL formulation\nwhich regularizes on Q directly \u2013 the model allows one to exploit domain knowledge (as above) and\nstatistical measures of interaction between kernels, employ estimated error covariances in ways that\nare not possible with (cid:96)p-norm regularization, or, encourage sparsity, group sparsity or non-sparsity\nas needed \u2013 all within a convex optimization framework. We call this form of multi-kernel learning,\nQ-norm MKL or \u201cQ-MKL\u201d. This paper makes the following contributions: (a) presents our new\nQ-MKL model which generalizes 1- (and 2-) norm MKL models, (b) provides a learning theoretic\nresult showing that Q-MKL can improve MKL\u2019s generalization error rate, (c) develops ef\ufb01cient\noptimization strategies (to be distributed with the Shogun toolbox), and (d) provides empirical results\ndemonstrating statistically signi\ufb01cant gains in accuracy on the important AD prediction problem.\nBackground. The development of MKL methods began with [5], which showed that the problem\nof learning the right kernel for an input problem instance could be formulated as a Semi-De\ufb01nite\nProgram (SDP). Subsequent papers have focused on designing more ef\ufb01cient optimization methods,\nwhich have enabled its applications to large-scale problem domains. To this end, the model in [5]\nwas shown to be solvable as a Second Order Cone Program [12], a Semi-In\ufb01nite Linear Program\n[6], and via gradient descent methods in the dual and primal [7, 13]. More recently, efforts have\nfocused on generalizing MKL to arbitrary p-norm regularizers where p > 1 [13, 14] while main-\ntaining overall ef\ufb01ciency. In [14], the authors brie\ufb02y mentioned that more general norms may be\npossible, but this issue was not further examined. A non-linear \u201chyperkernel\u201d method was proposed\n[15] which implicitly maps the kernels themselves to an implicit RKHS, however this method is\ncomputationally very demanding, (it has 4th order interactions among training examples). The au-\nthors of [16] proposed to \ufb01rst select the sub-kernel weights by minimizing an objective function\nderived from Normalized Cuts, and subsequently train an SVM on the combined kernel. In [17, 2],\na method was proposed for selecting an optimal \ufb01nite combination from an in\ufb01nite parameter space\nof kernels. Contemporary to these results, [18] showed that if a large number of kernels had a desir-\nable shared structure (e.g., followed directed acyclic dependencies), extensions of MKL could still\nbe applied. Recently in [8], a set of base classi\ufb01ers were \ufb01rst trained using each kernel and were\nthen boosted to produce a strong multi-class classi\ufb01er. At this time, MKL methods [8, 9] provide\nsome of the best known accuracy on image categorization datasets such as Caltech101/256 (see\nwww.robots.ox.ac.uk/\u02dcvgg/software/MKL/). Next, we describe in detail the motiva-\ntion and theoretical properties of Q-MKL .\n\n2\n\n\f(cid:32) M(cid:88)\n\n(cid:32) M(cid:88)\n\n(cid:33)\n\n(cid:33)\n\n2 From MKL to Q-MKL\nMKL Models. Adding kernels corresponds to taking a direct sum of Reproducing Kernel Hilbert\nspaces (RKHS), and scaling a kernel by a constant c scales the axes of it\u2019s RKHS by\nc. In the\nMKL setting, the SVM margin regularizer 1\nover\ncontributions from RKHS\u2019s H1, . . . ,HM , where the vector of mixing coef\ufb01cients \u03b2 scales each\nrespective RKHS [14]. A norm penalty on \u03b2 ensures that the units in which the margin is measured\nare meaningful (provided the base kernels are normalized). The MKL primal problem is given as\n\n2(cid:107)w(cid:107)2 becomes a weighted sum 1\n\n(cid:80)M\n\n(cid:107)wm(cid:107)2Hm\n\n\u221a\n\nm=1\n\n\u03b2m\n\n2\n\nmin\n\nw,b,\u03b2\u22650,\u03be\u22650\n\n1\n2\n\n(cid:107)wm(cid:107)2Hm\n\n\u03b2m\n\n+ C\n\nM(cid:88)\n\nm\n\nn(cid:88)\n\n\u03bei + (cid:107)\u03b2(cid:107)2\n\np\n\ns.t. yi\n\n(cid:104)wm, \u03c6m(xi)(cid:105)Hm + b\n\n\u2265 1 \u2212 \u03bei,\n\n(1)\n\ni\n\nm\n\nwhere \u03c6m(x) is the (potentially unknown) transformation from the original data space to the mth\nRKHS Hm. As in SVMs, we turn to the dual problem to see the role of kernels:\n(cid:107)G(cid:107)q, G \u2208 RM ; Gm = (\u03b1 \u25e6 y)T Km(\u03b1 \u25e6 y),\n\n(2)\n\nmax\n0\u2264\u03b1\u2264C\n\n\u03b1T 1 \u2212 1\n2\n\np + 1\n\nwhere \u25e6 denotes element-wise multiplication, and the dual q-norm follows the identity 1\nq = 1.\nNote that the primal norm penalty (cid:107)\u03b2(cid:107)2\np becomes a dual-norm on the vector G. At optimality,\nwm = \u03b2m(\u03b1 \u25e6 y)T \u03c6m(X), and so the term Gm = (\u03b1 \u25e6 y)T Km(\u03b1 \u25e6 y) =\nis the vector\nof scaled classi\ufb01er norms. This shows that the dual norm is tied to how MKL measures the margin\nin each RKHS.\nThe Q-MKL model. The key characteristic of Q-MKL is that the standard (squared) (cid:96)p-norm\npenalty on \u03b2, along with the corresponding dual-norm penalty in (2), is substituted with a more\ngeneral class of quadratic penalty functions, expressed as \u03b2T Q\u03b2 = (cid:107)\u03b2(cid:107)2\nMahalanobis (matrix-induced) norm so long as Q (cid:23) 0. In this framework, the burden of choosing\na kernel is deferred to a choice of Q-function. This approach gives the algorithm greater \ufb02exibility\nwhile controlling model complexity, as we will discuss shortly. The model we optimize is,\n\nQ. (cid:107)\u03b2(cid:107)Q =(cid:112)\u03b2T Q\u03b2 is a\n\n(cid:107)wm(cid:107)2Hm\n\n\u03b22\nm\n\nn(cid:88)\n\n||wm||2Hm\n\nM(cid:88)\n(cid:112)GT Q\u22121G.\n\n\u03b2m\n\nm\n\nmin\n\nw,b,\u03b2\u22650,\u03be\u22650\n\n1\n2\n\n+ C\n\n\u03bei + \u03b2T Q\u03b2 s.t. yi\n\ni\n\nm\n\n(cid:104)wm, \u03c6m(xi)(cid:105)Hm + b\n\n\u2265 1 \u2212 \u03bei,\n\n(3)\n\nwhere the last objective term provides a bias relative to \u03b2T Q\u03b2. The dual problem becomes\nmax\u03b1 \u03b1T 1 \u2212 1\nIt is easy to see that if Q = 1M\u00d7M , we obtain the p = 1 form\nof (1), i.e., 1-norm MKL, as a special case because \u03b2T 1M\u00d7M \u03b2 = (cid:107)\u03b2(cid:107)2\n1. On the other hand, setting\nQ to IM\u00d7M (identity), reduces to 2-norm MKL.\n\n2\n\n3 The case for Q-MKL\nExtending the MKL regularizer to arbitrary quadratics Q (cid:23) 0 signi\ufb01cantly expands the richness\nof the MKL framework; yet we can show that for reasonable choices of Q, this actually decreases\nMKL\u2019s learning-theoretic complexity. Joachims et al. [19] derived a theoretical generalization error\nbound on kernel combinations which depends on the degree of redundancy between support vectors\nin SVMs trained on base kernels individually. Using this type of correlational structure, we can\nderive a Q function between kernels to automatically select a combination of kernels which will\nmaximize this bound. This type of Q function can be shown to have lower Rademacher complexity,\n(see below,) while simultaneously decreasing the error bound from [19], which does not directly\ndepend on Rademacher complexity.\n\n3.1 Virtual Kernels, Rademacher Complexity and Renyi Entropy\nIf we decompose Q into its component eigen-vectors, we can see that each eigen-vector de\ufb01nes\na linear combination of kernels. This observation allows us to analyze Q-MKL in terms of these\nobjects, which we will refer to as Virtual Kernels. We \ufb01rst show that as Q\u22121\u2019s eigen-values decay,\nso do the traces of the virtual kernels. Assuming Q\u22121 has a bounded, non-uniform spectrum, this\nproperty can then be used to analyze, (and bound), Q-MKL\u2019s Rademacher complexity, which has\nbeen shown to depend on the traces of the base kernels. We then offer a few observations on how\nQ\u22121\u2019s Renyi entropy [20] relates to these learning theoretic bounds.\n\n3\n\n\f(cid:88)\n\nVirtual Kernels. In the following, assume that Q (cid:31) 0, and has eigen-decomposition Q = V \u039bV ,\nwith V = {v1,\u00b7\u00b7\u00b7 , vM}. First, observe that because Q\u2019s eigen-vectors provide an orthonormal basis\nof RM , \u03b2 \u2208 RM can be expressed as a linear combination in this basis with \u03b3 as its coef\ufb01cients:\n\ni \u03b3ivi = V \u03b3. Substituting in \u03b2T Q\u03b2 we have\n\n\u03b2 =(cid:80)\n\n\u03b2T Q\u03b2 = (\u03b3T V T )V \u039bV T (V \u03b3) = \u03b3T (V T V )\u039b(V T V )\u03b3 = \u03b3T \u039b\u03b3 =\n\n\u03b32\ni \u03bbi\n\n(4)\n\ni\n\nRKHS. This can be ensured by choosing Q in a speci\ufb01c way, if desired. This leads to the following\nresult:\n\nThis simple observation offers an alternate view of what Q-MKL is actually optimizing. Each\neigen-vector vi of Q can be used to de\ufb01ne a linear combination of kernels, which we will refer to as\n\nvirtual kernel(cid:101)Ki =(cid:80)\nm vi(m)Km. Note that if(cid:101)Ki (cid:23) 0, \u2200 i, then they each de\ufb01ne an independent\nLemma 1. If (cid:101)Ki (cid:23) 0,\u2200i, then Q-MKL is equivalent to 2-norm MKL using virtual kernels instead\n= (cid:80)M\ni \u00b5i(cid:101)Ki, where (cid:101)Ki =\n2(cid:80)M\n\n(cid:80)M\ni \u03b3ivi(m)Km = (cid:80)M\nm \u03b2mKm\nm vi(m)Km is the ith virtual kernel. The learned kernel K\u2217 is a weighted combination\n\n\u03bb\u2212 1\nof virtual kernels, and the coef\ufb01cients are regularized under a squared 2-norm.\n\n2(cid:80)M\nm vi(m)Km = (cid:80)M\n\n4) and K\u2217 = (cid:80)\n\nThen \u03b2T Q\u03b2 = (cid:107)\u00b5(cid:107)2\n2,\n\nProof. Let \u00b5i = \u03b3i\n\nof base kernels.\n\ni \u00b5i\u03bb\u2212 1\n\n(eq.\n\n\u03bbi.\n\n\u221a\n\nm\n\nn\n\n(5)\n\np) \u2264\n\np + 1\n\nq = 1.\n\n(cid:112)\u03b70q(cid:107)u(cid:107)q\n\nRademacher Complexity in MKL. With this result in hand, we can now evaluate the Rademacher\ncomplexity of Q-MKL by using a recent result for p-norm MKL. We \ufb01rst state a theorem from [21],\nwhich relates the Rademacher complexity of MKL to the traces of its base kernels.\nTheorem 1. ([21]) The empirical Rademacher complexity on a sample set S of size n, with M base\n22 ),\nkernels is given as follows (with \u03b70 = 23\nRS(HM\nwhere u = [Tr(K1),\u00b7\u00b7\u00b7 , Tr(KM )]T and 1\nThe bound in (5) shows that the Rademacher complexity RS(\u00b7) depends on (cid:107)u(cid:107)q, a norm on the base\nkernels\u2019 traces. Assuming they are normalized to have unit trace, the bound for p = q = 2-norm\nMKL is governed by (cid:107)u(cid:107)2 =\nM. However, in Q-MKL the virtual kernels traces are not equal,\n. With this expression for the traces of the virtual kernels,\nwe can now prove that the bound given in (5) is strictly decreased as long as the eigen-values \u03c8i of\nQ\u22121 are in the range (0, 1]. (Adding 1 to the diagonal of Q is suf\ufb01cient to guarantee this.)\n\nand are in fact given by Tr((cid:101)Ki) = 1T vi\u221a\nTheorem 2. If Q\u22121 (cid:54)= IM\u00d7M and (cid:101)Ki (cid:23) 0 \u2200i then the bound on Rademacher complexity given in\nkernel traces, decreases. As shown above, the virtual kernel traces are given as Tr((cid:101)Ki) =\n\n(5) is strictly lower for Q-MKL than for 2-norm MKL.\nProof. By Lemma 1, we have that the bound in (5) will decrease if (cid:107)u(cid:107)2, the norm on the virtual\n\u03c8i1T vi,\nmeaning that (cid:107)u(cid:107)2\ni 1 = 1T Q\u221211. Clearly, this sum is maxi-\nmal for \u03c8i = 1, \u2200i, which is true if and only if Q\u22121 = IM\u00d7M . This means that when Q (cid:54)= IM\u00d7M ,\nthen the bound in (5) is strictly decreased.\n\ni \u03c8i(1T vi)2 =(cid:80)N\n\n2 =(cid:80)N\n\ni \u03c8i1T vivT\n\n\u221a\n\n\u221a\n\n\u03bbi\n\nNote that requiring the virtual kernels to be p.s.d., while achievable (see supplements,) is somewhat\nrestrictive. In practice, such a Q matrix may not differ substantially from IN\u00d7N . We therefore\nprovide the following result which frees us from this restriction, and has more practical signi\ufb01cance.\nTheorem 3. Q-MKL is equivalent to the following model:\n\nM(cid:88)\n(cid:32) M(cid:88)\n\nm\n\n(cid:107)wm(cid:107)2Vm\n\n\u00b5m\n\nmin\n\nw,b,\u00b5,\u03be\u22650\n\n1\n2\n\ns.t. yi\n\nn(cid:88)\n\ni\n\n+ C\n\n\u03bei + (cid:107)\u00b5(cid:107)2\n\n2\n\n(cid:33)\n\n(cid:104)wm, \u03c6m(xi)(cid:105)Vm + b\n\n\u2265 1 \u2212 \u03bei, Q\n\n2 \u00b5 \u2265 0,\n\u2212 1\n\n(6)\n\nwhere \u03c6m() is the feature transform mapping data space to the mth virtual kernel, denoted as Vm.\n\nm\n\n4\n\n\fWhile the virtual kernels themselves may be inde\ufb01nite, recall that \u00b5 = Q 1\n2 \u03b2, and so the constraint\n2 \u00b5 \u2265 0 is equivalent to \u03b2 \u2265 0, guaranteeing that the combined kernel will be p.s.d. This\nQ\u2212 1\nformulation is slightly different than the 2-norm MKL formulation, however it does not alter the\ntheoretical guarantee of [21], providing a stronger result.\nRenyi Entropy. Renyi entropy [20] signi\ufb01cantly generalizes the usual notion of Shannon entropy\n[22, 23, 24], has applications in Statistics and many other \ufb01elds, and has recently been proposed as\nan alternative to PCA [22]. Thm. 2 points to an intuitive explanation of where the bene\ufb01t from a\nQ regularizer comes from as well, if we analyze the Renyi entropy of the distribution on kernels\nde\ufb01ned by Q\u22121, if we treat Q\u22121 as a kernel density estimator. The quadratic Renyi entropy of a\nprobability measure is given as,\n\n(cid:90)\n\nH(p) = \u2212 log\n\np2(x)dx.\n\nvirtual kernel traces to the Renyi entropy estimator of Q\u22121 as(cid:82) \u02c6p2(x)dx = 1\n\nNow, if we use a kernel function (i.e., Q\u22121), and a \ufb01nite sample (i.e., base kernels), as a kernel\ndensity estimator, (cf. [15],) then with some normalization we can derive an estimate of the un-\nderlying probability \u02c6p, which is a distribution over base kernels. We can then interpret its Renyi\nentropy as a complexity measure on the space of combined kernels. Eq. (5.2) in [23] relates the\nN 2 1T Q\u221211,1 which\nleads to a nice connection to Thm. 2. This view informs us that setting Q\u22121 = IM\u00d7M , (i.e., 2-norm\nMKL), has maximal Renyi entropy because it is maximally uninformative; adding structure to Q\u22121\nconcentrates \u02c6p, reducing both its Renyi entropy, and Rademacher complexity together.\nThis series of results suggests an entirely new approach to analyzing the Rademacher complexity of\nMKL methods. The proof of Thm. 2 relies on decreasing a norm on the virtual kernel traces, which\nwe now see directly relates to the Renyi entropy of Q\u22121, as well as with decreasing the Rademacher\ncomplexity of the search space of combined kernels. It is even possible that by directly analyzing\nRenyi entropy in a multi-kernel setting, this conjecture may be useful in deriving analogous bounds\nin, e.g., Inde\ufb01nite Kernel Learning [25], because the virtual kernels are inde\ufb01nite in general.\n\n3.2 Special Cases: Q-SVM and relative margin\nBefore describing our optimization strategy, we discuss several variations on the Q-MKL model.\nQ-SVM. An interesting special case of Q-MKL is Q-SVM, which generalizes several recent, (but\nindependently developed,) models in the literature [26, 27]. If the base kernels are rank-1, (i.e.,\nsingleton features,) then each coef\ufb01cient \u03b2m effectively becomes a feature weight, and a 2-norm\npenalty on \u03b2 is a penalty on weights. Q-MKL therefore reduces to a form of SVM in which (cid:107)w(cid:107)2\nbecomes wT Qw. Thus, in such cases we can reduce the Q-MKL model to a simple QP, which we\ncall Q-SVM . Please refer to the supplements for details, and some experimental results.\nRelative Margin. Several interesting extensions to the SVM and MKL frameworks have been\nproposed which focus on the relative margin methods [28, 29] which maximize the margin relative\nto the spread of the data. In particular Q-MKL can be easily modi\ufb01ed to incorporate the Relative\nMargin Machine (RMM) model [28] by replacing Module 1 as in (7) with the RMM objective. Our\nalternating optimization approach, (described next,) is not affected by this addition; however, the\nadditional constraints would mean that SMO-based strategies would not be applicable.\n\n4 Optimization\nWe now present the core engine to solve (3). Most MKL implementations make use of an alternating\nminimization strategy which \ufb01rst minimizes the objective in terms of the SVM parameters, and\nthen with respect to the sub-kernel weights \u03b2. Since the MKL problem is convex, this method\nleads to global convergence [7, 14] and minor modi\ufb01cations to standard SVM implementations are\nsuf\ufb01cient. Q-MKL generalizes (cid:107)\u03b2(cid:107)2\np to arbitrary convex quadratic functions, while the feasible set\nis the same as for MKL. This directly gives that the Q-MKL model in (3) is convex. We will broadly\nfollow this strategy, but as will become clear shortly, interaction between sub-kernel weights makes\nthe optimization of \u03b2 more involved (than [6, 14]), and requires alternative solution mechanisms.\nWe may consider this process as a composition of two modules: one which solves for SVM dual\nparameters (\u03b1) with \ufb01xed \u03b2, and the other for solving for \u03b2 with \ufb01xed \u03b1:\n\n1Note that this involves a Gaussian assumption, but [24] provides extensions to non-Gauss kernels.\n\n5\n\n\f(Module 1)\n\n\u03b1T 1 \u2212 \u03b1T Y KY \u03b1 s.t.\u03b1T y = 0\n\n(7)\n\nmax\n0\u2264\u03b1\u2264C\n\nmin\n\u03b2\u22650\n\n(cid:88)\n\nm\n\n(Module 2)\n(cid:107)wm(cid:107)2\n\u03b2m\n\ns.t.\u03b2T Q\u03b2 \u2264 1\n\n(8)\n\nUsing a result from [14] we can replace the \u03b2T Q\u03b2 objective term with a quadratic constraint, which\ngives the problem in (8). Notice that (8) has a sum of ratios with optimization variables in the de-\nnominator, while the constraint is quadratic \u2013 this means that standard convex optimization toolkits\nmay not be able to solve this problem without signi\ufb01cant reformulation from its canonical form in\n(8). Our approach is to search for a stationary point by representing the gradient as a non-linear\nsystem. Writing the gradient in terms of the Lagrange multiplier \u03b4, and setting it equal to 0 gives:\n\n(cid:107)wm(cid:107)2Hm\n\n\u03b22\nm\n\n\u2212 \u03b4(Q\u03b2)m = 0, \u2200m \u2208 {1,\u00b7\u00b7\u00b7 , M}.\n\n(9)\n\nWe now seek to eliminate \u03b4 so that the non-linear system will be limited to quadratic terms in\n\u03b2, allowing us to use a non-linear system solver. Let W = Diag((cid:107)w1(cid:107)2H1\n), and\n\u03b2\u22122 = (\u03b2\u22122\nM ). We can then write W\u03b2\u22122 = \u03b4(Q\u03b2). Now, solving for \u03b2 (on the right hand\nside) gives\n\n, . . . ,(cid:107)wM(cid:107)2HM\n\n1 , . . . , \u03b2\u22122\n\n(10)\nBecause Q (cid:31) 0, and \u03b2 \u2265 0, at optimality the constraint \u03b2T Q\u03b2 \u2264 1 must be active. So, we can\nplug in the above identity to solve for \u03b4,\n\n\u03b2 =\n\nQ\n\n\u22121W\u03b2\n\n\u22122\n\n1\n\u03b4\n\n(cid:19)T\n\n(cid:18) 1\n\u03b4 =(cid:112)(W\u03b2\u22122)T Q\u22121(W\u03b2\u22122) = (cid:107)W\u03b2\n\n(cid:18) 1\n\n\u22121W\u03b2\n\n\u22121W\u03b2\n\n1 =\n\n\u22122\n\nQ\n\nQ\n\nQ\n\n\u03b4\n\n\u03b4\n\n\u22122\n\n(cid:19)\n\u22122(cid:107)Q\u22121 ,\n\n(11)\nwhich shows that \u03b4 effectively normalizes W\u03b2\u22122 according to Q\u22121. We can now solve (10) in\nterms of \u03b2 using a nonlinear root \ufb01nder, such as the GNU Scienti\ufb01c Library; in practice this method\nis quite ef\ufb01cient, typically requiring 10 to 20 outer iterations. Putting these parts together, we can\npropose following algorithm for optimizing Q-MKL:\n\nAlgorithm 1. Q-MKL\n\nInput: Kernels {K1,\u00b7\u00b7\u00b7 , KM}; Q (cid:23) 0 \u2208 RM\u00d7M ; labels y \u2208 {\u00b11}N .\nOutputs: \u03b1, b, \u03b2\n\u03b2(0) = 1\nwhile not optimal do\nm \u03b2(t)\n\nK (t) \u2190(cid:80)\n\nM ; t = 0 (iterations)\n\nm Km\n\n\u03b1(t), b(t) \u2190 SVM(K (t), C, y) (Module 1, (7))\nWmm = \u03b1(t)T K (t)\n\u03b2(t+1) \u2190 arg min (Problem(8)) (Module 2, (8))\nt = t + 1\n\nm \u03b1(t)(\u03b2(t)\n\nm )2\n\nend while\n\n4.1 Convergence\nWe can show that our model can be solved optimally by noting that Module 2 can be precisely\noptimized at each step. If Module 2 cannot be solved precisely, then Algorithm 1 may not converge.\nThe following result assures us that indeed Module 2 can be solved precisely by reducing it to a\nconvex Semi-De\ufb01nite Program (SDP).\nTheorem 4. The solution to Problem (8) is the same as the solution to the following SDP:\n\nmin\n\n\u03bd\u22650,\u03b2\u22650,Z\u2208RM\u00d7M\n\nwT \u03bd\n\n(cid:20) \u03bdm\n\n(cid:21)\n\n(cid:20) 1\n\n(cid:21)\n\nsubject to\n\n\u03b2T\nZ\nProof. The \ufb01rst PSD constraint (13) requires that \u03bdm = \u03b2\u22121\nm , meaning that objective (12) is the\nsame as that of Problem (8). From the second we have Z = \u03b2\u03b2T , and so Tr(QZ) = \u03b2T Q\u03b2;\ntherefore the feasible sets are equivalent.\n\nTr(QZ) \u2264 1.\n\n(cid:23) 0, \u2200m\n\n(cid:23) 0,\n\n1\n\u03b2m\n\n(13)\n\n\u03b2\n\n1\n\n(12)\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 1: Comparison of spatial smoothness of weights chosen by Q-SVM and SVM with gray matter (GM)\ndensity maps. Left (a-b): weights given by a standard SVM; Right (c-d): weights given by Q-SVM .\n\nThe last PSD constraint is only necessary to ensure that \u03b2T Q\u03b2 \u2264 1, and can be replaced with that\nquadratic constraint. Doing so yields a Second-Order Cone Program (SOCP) which is also amenable\nto standard solvers. Note that it is not necessary to solve for \u03b2 as an SDP, though it may nevertheless\nbe an effective solution mechanism, depending on the size and characteristics of the problem.\n\n5 Experiments\nWe performed extensive experiments to validate Q-MKL, examine the effect it has on \u03b2, and to\nassess its advantages in the context of our motivating neuroimaging application.\nIn these main\nexperiments, we demonstrate how domain knowledge can be adapted to improve the algorithm\u2019s\nperformance. Our focus on a practical application is intended as a demonstration of how domain\nknowledge can be seamlessly incorporated into a learning model, giving signi\ufb01cant gains in accu-\nracy. We also performed experiments on the UCI repositories, which are described in detail in the\nsupplements. Brie\ufb02y, in these experiments Q-MKL performed as well as, or better than, 1- and\n2-norm MKL on most datasets, showing that even in the absence of signi\ufb01cant domain knowledge,\nQ-MKL can still perform about as well as existing MKL methods.\nImage preprocessing. In out main experiments we used brain scans of AD patients and Cognitively\nNormal healthy controls (CN) from the Alzheimer\u2019s Disease Neuroimaging Initiative (ADNI) [30]\nin a set of cross-validation experiments. ADNI is a landmark study sponsored by the NIH, major\npharmaceuticals and others to determine the extent to which multi-modal brain imaging can help pre-\ndict on-set, and monitor progression of, AD. To this end, MKL-type methods have already de\ufb01ned\nthe state of the art for this application [3, 4]. For our experiments, 48 AD subjects and 66 controls\nwere chosen who had both T1-weighted MR scans and Fluoro-Deoxy-Glucose PET (FDG-PET)\nscans at two time-points two years apart. Standard diffeomorphic methods, known generally as\nVoxel-Based Morphometry (VBM), (see SPM, www.fil.ion.ucl.ac.uk/spm/) were used\nto register scans to a common template and calculate Gray Matter (GM) densities at each voxel in\nthe MR scans. We also used Tensor-Based Morphometry (TBM) to calculate maps of longitudinal\nvoxel-wise expansion or contraction over a two year period. Feature selection was performed sep-\narately in each set of images by sorting voxels by t-statistic (calculated using training data), and\nchoosing the highest 2000, 5000, 10000,. . . ,250000 voxels in 8 stages. We used linear, quadratic,\nand Gaussian kernels: a total of 24 kernels per set, (GM maps, TBM maps, baseline FDG-PET,\nFDG-PET at 2-year follow up) for a total of 96 kernels. For Q-matrix we used the Laplacian of\ncovariance between single-kernel \u03b1 parameters, (recall the motivation from [19] in Section 3,) plus\na block-diagonal representing clusters of kernels derived from the same imaging modalities.\n\n5.1 Spatial SVM\nBefore describing out main experiments, we \ufb01rst return to the Q-SVM model brie\ufb02y mentioned\nin 3.2. To demonstrate that Q-regularizers indeed in\ufb02uence the learned classi\ufb01er, we performed\nclassi\ufb01cation experiments with the Laplacian of the inverse distance between voxels as a Q matrix,\nand voxel-wise GM density (VBM) as features. Using 10-fold cross-validation with 10 realizations,\nQ-SVM \u2019s accuracy was 0.819, compared to the regular SVM\u2019s accuracy of 0.792. These accuracies\nare signi\ufb01cantly different at the \u03b1 = 0.0005 level under a paired t-test.\nIn Fig. 1 we show a\ncomparison of weights trained by a regular SVM (a\u2013b), and those trained by a spatially regularized\nSVM, (c\u2013d). Note the greater spatial smoothness, and lower incidence of isolated \u201cpockets\u201d.\n\n7\n\n\fAcc.\n0.864\n0.875\n0.875\n0.884\n0.884\n0.888\n\nSpec.\n0.931\n0.936\n0.938\n0.942\n0.955\n0.956\n\nCov\u03b1\n\nLap.(Cov\u03b1)\n\nLap.(Cov\u03b1) + diag\n\nSens.\n0.771\n0.790\n0.789\n0.780\n0.785\n0.786\n\nRegularizer\n(cid:107)\u03b2(cid:107)1-MKL\n(cid:107)\u03b2(cid:107)1.5-MKL\n(cid:107)\u03b2(cid:107)2-MKL\n\nTable 1: Comparison of Q-MKL & MKL. Bold\nnumerals indicate methods not differing from the\nbest at the 0.01 level using a paired t-test. Lap. =\n\u201cLaplacian\u201d; diag = \u201cBlock-diagonal\u201d.\n\n5.2 Multi-modality Alzheimer\u2019s disease (AD) prediction\nNext, we performed multi-modality AD prediction\nexperiments using all 96 kernels across two modal-\nities: MR provides structural information, while\nFDG-PET assesses hypo-metabolism. Further, we\nmay use several image processing pipelines. Due\nto the inherent similarities in how the various ker-\nnels are derived, there are clear cluster structures /\nbehaviors among the kernels, which we would like\nto exploit using Q-MKL. We used 10-fold cross-\nvalidation with 30 realizations, for a total of 300\nfolds. Accuracy, sensitivity and speci\ufb01city were av-\neraged over all folds. For comparison we also exam-\nined 1-, 1.5-, and 2-norm MKL. As MKL methods\nhave emerged as the state of the art in this domain [3, 4], and have performed favorably in exten-\nsive evaluations against various baselines such as single-kernel methods, and na\u00a8\u0131ve combinations,\nwe therefore focus our analysis on comparison with existing MKL methods. Results are shown in\nTable 1. Q-MKL had the highest performance overall, reducing the error rate from 12.5% to 11.2%.\n(Signi\ufb01cant at the \u03b1 = 0.001 level.) Note that the in vivo diagnostic error rate for AD is believed to\nbe near 8\u201310%, meaning that this improvement is quite signi\ufb01cant. The primary bene\ufb01t of current\nsparse MKL methods is that they effectively \ufb01lter out uninformative or noisy kernels, however, the\nkernels used in these experiments are all derived from clinically relevant neuroimaging data, and are\nthus highly reliable. Q-MKL\u2019s performance suggests that it boosts the overall accuracy.\nVirtual kernel analysis. We next turn to an analysis of the covariance structures found in the data\nempirically as a concrete demonstration of the type of patterns towards which the Q-MKL regular-\nizer biases \u03b2. Recall that Q\u2019s eigen-vectors can show which patterns are encouraged or deterred,\nin proportion to their eigen-values. In Fig. 2, we compare the Q matrix used in the ADNI exper-\niments, based on the correlations of single-kernel \u03b1 parameters (a), the 3 least eigenvectors of its\ngraph Laplacian (b\u2013d), and the \u03b2 vector optimized by Q-MKL . In (a), we can see that while the\nVBM (\ufb01rst block of 24 kernels) and TBM (second block of kernels) are highly correlated, they ap-\npear to be fairly uncorrelated to one another. The FDG-PET kernels (last 48 kernels) are much more\nstrongly interrelated. Interestingly, the \ufb01rst eigenvector is almost entirely devoted to two large blocks\nof kernels \u2013 those which come from MRI data, and those which come from FDG-PET data. The\npositive elements in the off-diagonal encourage sparsity within these two super-blocks of kernels.\nSomewhat to the contrary, the next two eigenvecors have negative weights in the region between\nTBM and VBM kernels, encouraging non-sparsity between these two blocks. In (e) we see that the\noptimized \u03b2 discards most TBM kernels, (but not all,) putting the strongest weight on a few VBM\nkernels, and keeps a wider distribution of the FDG-PET kernels.\nConclusion. MKL is an elegant method for aggregating multiple data views, and is being exten-\nsively adopted for a variety of problems in machine learning, computer vision, and neuroimaging.\nQ-MKL extends this framework to exploit higher order interactions between kernels using super-\nvised, unsupervised, or domain-knowledge driven measures. This \ufb02exibility can impart greater\ncontrol over how the model utilizes cluster structure among kernels, and effectively encourage can-\ncellation of errors wherever possible. We have presented a convex optimization model to ef\ufb01ciently\nsolve the resultant model, and shown experiments on a challenging problem of identifying AD\nbased on multi modal brain imaging data (obtaining statistically signi\ufb01cant improvements). Our im-\nplementation will be made available within the Shogun toolbox (www.shogun-toolbox.org).\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\nFigure 2: Cov. Q used in AD experiments (a); three least graph Laplacian eigen-vectors (b-d); outer product\nof optimized \u03b2 (e). Note the block structure in (a\u2013d) relating to the imaging modalities and kernel functions.\n\n8\n\n\fReferences\n[1] I. Guyon and A. Elisseeff. An introduction to variable and feature selection. JMLR, 3:1157\u20131182, 2003.\n[2] P. V. Gehler and S. Nowozin. Let the kernel \ufb01gure it out; principled learning of pre-processing for kernel\n\nclassi\ufb01ers. CVPR, 2009.\n\n[3] C. Hinrichs, V. Singh, G. Xu, and S.C. Johnson. Predictive markers for AD in a multi-modality frame-\n\nwork: An analysis of MCI progression in the ADNI population. Neuroimage, 55(2):574\u2013589, 2011.\n\n[4] D. Zhang, Y. Wang, L. Zhou, H. Yuan, and D. Shen. Multimodal Classi\ufb01cation of Alzheimer\u2019s Disease\n\nand Mild Cognitive Impairment. NeuroImage, 55(3):856\u2013867, 2011.\n\n[5] G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. Jordan. Learning the kernel matrix\n\nwith semide\ufb01nite programming. JMLR, 5:27\u201372, 2004.\n\n[6] S. Sonnenburg, G. R\u00a8atsch, C. Sch\u00a8afer, and B. Sch\u00a8olkopf. Large scale multiple kernel learning. JMLR,\n\n7:1531\u20131565, 2006.\n\n[7] A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grandvalet. SimpleMKL. JMLR, 9:2491\u20132521, 2008.\n[8] P. V. Gehler and S. Nowozin. On feature combination for multiclass object classi\ufb01cation. In ICCV, 2009.\n[9] J. Yang, Y. Li, Y. Tian, L. Duan, and W. Gao. Group-sensitive multiple kernel learning for object catego-\n\nrization. In ICCV, 2009.\n\n[10] P. Vemuri, J.L. Gunter, M. L. Senjem, J. L. Whitwell, K. Kantarci, D. S. Knopman, et al. Alzheimer\u2019s\ndisease diagnosis in individual subjects using structural MR images: validation studies. Neuroimage,\n39(3):1186\u20131197, 2008.\n\n[11] M. Szafranski, Y. Grandvalet, and A. Rakotomamonjy. Composite kernel learning. Machine learning,\n\n79(1):73\u2013103, 2010.\n\n[12] F. R. Bach, G. Lanckriet, and M. I. Jordan. Multiple kernel learning, conic duality, and the SMO algo-\n\nrithm. In ICML, 2004.\n\n[13] F. Orabona, L. Jie, and B. Caputo. Online-Batch Strongly Convex Multi Kernel Learning. In CVPR, 2010.\n[14] M. Kloft, U. Brefeld, S. Sonnenburg, and A. Zien. (cid:96)p-Norm Multiple Kernel Learning. JMLR, 12:953\u2013\n\n997, 2011.\n\n[15] C.S. Ong, A. Smola, and B. Williamson. Learning the kernel with hyperkernels. JMLR, 6:1045\u20131071,\n\n2005.\n\n[16] L. Mukherjee, V. Singh, J. Peng, and C. Hinrichs. Learning Kernels for variants of Normalized Cuts:\n\nConvex Relaxations and Applications. CVPR, 2010.\n\n[17] P. V. Gehler and S. Nowozin. In\ufb01nite kernel learning. Technical Report 178, Max-Planck Institute for\n\nBiological Cybernetics, 10 2008.\n\n[18] F. R. Bach. Exploring large feature spaces with hierarchical multiple kernel learning. In NIPS, 2008.\n[19] T. Joachims, N. Cristianini, and J. Shawe-Taylor. Composite kernels for hypertext categorisation.\n\nICML, 2001.\n\nIn\n\n[20] A. Renyi. On measures of entropy and information. In Fourth Berkeley Symposium on Mathematical\n\nStatistics and Probability, pages 547\u2013561, 1961.\n\n[21] C. Cortes, M. Mohri, and A. Rostamizadeh. Generalization bounds for learning kernels. In ICML, 2010.\n[22] R. Jenssen. Kernel entropy component analysis. IEEE Trans. PAMI, pages 847\u2013860, 2009.\n[23] M. Girolami. Orthogonal series density estimation and the kernel eigenvalue problem. Neural Computa-\n\ntion, 14(3):669\u2013688, 2002.\n\n[24] D. Erdogmus and J.C. Principe. Generalized information potential criterion for adaptive system training.\n\nIEEE Trans. Neural Networks, 13(5):1035\u20131044, 2002.\n\n[25] M. Kowalski, M. Szafranski, and L. Ralaivola. Multiple inde\ufb01nite kernel learning with mixed norm\n\nregularization. In ICML, 2009.\n\n[26] S. Bergsma, D. Lin, and D. Schuurmans.\n\nImproved Natural Language Learning via Variance-\n\nRegularization Support Vector Machines. In CoNLL, 2010.\n\n[27] R. Cuingnet, M. Chupin, H. Benali, and O. Colliot. Spatial and anatomical regularization of SVM for\n\nbrain image analysis. In NIPS, 2010.\n\n[28] P. Shivaswamy and T. Jebara. Maximum relative margin and data-dependent regularization. JMLR,\n\n11:747\u2013788, 2010.\n\n[29] K. Gai, G. Chen, and C. Zhang. Learning kernels with radiuses of minimum enclosing balls. In NIPS,\n\n2010.\n\n[30] S. G. Mueller, M. W. Weiner, et al. Ways toward an early diagnosis in Alzheimers disease: The\n\nAlzheimer\u2019s Disease Neuroimaging Initiative. J. of the Alzheimer\u2019s Association, 1(1):55\u201366, 2005.\n\n9\n\n\f", "award": [], "sourceid": 680, "authors": [{"given_name": "Chris", "family_name": "Hinrichs", "institution": null}, {"given_name": "Vikas", "family_name": "Singh", "institution": null}, {"given_name": "Jiming", "family_name": "Peng", "institution": null}, {"given_name": "Sterling", "family_name": "Johnson", "institution": null}]}