{"title": "Non-parametric Group Orthogonal Matching Pursuit for Sparse Learning with Multiple Kernels", "book": "Advances in Neural Information Processing Systems", "page_first": 2519, "page_last": 2527, "abstract": "We consider regularized risk minimization in a large dictionary of Reproducing  kernel Hilbert Spaces (RKHSs) over which the target function has a sparse representation.  This setting, commonly referred to as Sparse Multiple Kernel Learning  (MKL), may be viewed as the non-parametric extension of group sparsity in linear  models. While the two dominant algorithmic strands of sparse learning, namely  convex relaxations using l1 norm (e.g., Lasso) and greedy methods (e.g., OMP),  have both been rigorously extended for group sparsity, the sparse MKL literature  has so farmainly adopted the former withmild empirical success. In this paper, we  close this gap by proposing a Group-OMP based framework for sparse multiple  kernel learning. Unlike l1-MKL, our approach decouples the sparsity regularizer  (via a direct l0 constraint) from the smoothness regularizer (via RKHS norms)  which leads to better empirical performance as well as a simpler optimization  procedure that only requires a black-box single-kernel solver. The algorithmic  development and empirical studies are complemented by theoretical analyses in  terms of Rademacher generalization bounds and sparse recovery conditions analogous  to those for OMP [27] and Group-OMP [16].", "full_text": "Non-parametric Group Orthogonal Matching Pursuit\n\nfor Sparse Learning with Multiple Kernels\n\nVikas Sindhwani and Aur\u00b4elie C. Lozano\n\nIBM T.J. Watson Research Center\n\nYorktown Heights, NY 10598\n\n{vsindhw,aclozano}@us.ibm.com\n\nAbstract\n\nWe consider regularized risk minimization in a large dictionary of Reproducing\nkernel Hilbert Spaces (RKHSs) over which the target function has a sparse repre-\nsentation. This setting, commonly referred to as Sparse Multiple Kernel Learning\n(MKL), may be viewed as the non-parametric extension of group sparsity in linear\nmodels. While the two dominant algorithmic strands of sparse learning, namely\nconvex relaxations using l1 norm (e.g., Lasso) and greedy methods (e.g., OMP),\nhave both been rigorously extended for group sparsity, the sparse MKL literature\nhas so far mainly adopted the former with mild empirical success. In this paper, we\nclose this gap by proposing a Group-OMP based framework for sparse MKL. Un-\nlike l1-MKL, our approach decouples the sparsity regularizer (via a direct l0 con-\nstraint) from the smoothness regularizer (via RKHS norms), which leads to better\nempirical performance and a simpler optimization procedure that only requires a\nblack-box single-kernel solver. The algorithmic development and empirical stud-\nies are complemented by theoretical analyses in terms of Rademacher generaliza-\ntion bounds and sparse recovery conditions analogous to those for OMP [27] and\nGroup-OMP [16].\n\n1\n\nIntroduction\n\nKernel methods are widely used to address a variety of learning problems including classi\ufb01cation, re-\ngression, structured prediction, data fusion, clustering and dimensionality reduction [22, 23]. How-\never, choosing an appropriate kernel and tuning the corresponding hyper-parameters can be highly\nchallenging, especially when little is known about the task at hand. In addition, many modern prob-\nlems involve multiple heterogeneous data sources (e.g. gene functional classi\ufb01cation, prediction of\nprotein-protein interactions) each necessitating the use of a different kernel. This strongly suggests\navoiding the risks and limitations of single kernel selection by considering \ufb02exible combinations of\nmultiple kernels. Furthermore, it is appealing to impose sparsity to discard noisy data sources. As\nseveral papers have provided evidence in favor of using multiple kernels (e.g. [19, 14, 7]), the mul-\ntiple kernel learning problem (MKL) has generated a large body of recent work [13, 5, 24, 33], and\nbecome the focal point of the intersection between non-parametric function estimation and sparse\nlearning methods traditionally explored in linear settings.\n\nGiven a convex loss function, the MKL problem is usually formulated as the minimization of em-\npirical risk together with a mixed norm regularizer, e.g., the square of the sum of individual RKHS\nnorms, or variants thereof, that have a close relationship to the Group Lasso criterion [30, 2]. Equiv-\nalently, this formulation may be viewed as simultaneous optimization of both the non-negative con-\nvex combination of kernels, as well as prediction functions induced by this combined kernel. In\nconstraining the combination of kernels, the l1 penalty is of particular interest as it encourages spar-\nsity in the supporting kernels, which is highly desirable when the number of kernels considered is\nlarge. The MKL literature has rapidly evolved along two directions: one concerns scalability of op-\n\n1\n\n\ftimization algorithms beyond the early pioneering proposals based on Semi-de\ufb01nite programming\nor Second-order Cone programming [13, 5] to simpler and more ef\ufb01cient alternating optimization\nschemes [20, 29, 24]; while the other concerns the use of lp norms [10, 29] to construct complex\nnon-sparse kernel combinations with the goal of outperforming 1-norm MKL which, as reported in\nseveral papers, has demonstrated mild success in practical applications.\n\nThe class of Orthogonal Matching Pursuit techniques has recently received considerable attention, as\na competitive alternative to Lasso. The basic OMP algorithm originates from the signal-processing\ncommunity and is similar to forward greedy feature selection, except that it performs re-estimation\nof the model parameters in each iteration, which has been shown to contribute to improved accuracy.\nFor linear models, some strong theoretical performance guarantees and empirical support have been\nprovided for OMP [31] and its extension for variable group selection, Group-OMP [16]. In particular\nit was shown in [25, 9] that OMP and Lasso exhibit competitive theoretical performance guarantees.\nIt is therefore desirable to investigate the use of Matching Pursuit techniques in the MKL framework\nand whether one may be able to improve upon existing MKL methods.\n\nOur contributions in this paper are as follows. We propose a non-parametric kernel-based extension\nto Group-OMP [16]. In terms of the feature space (as opposed to function space) perspective of\nkernel methods, this allows Group-OMP to handle groups that can potentially contain in\ufb01nite fea-\ntures. By adding regularization in Group-OMP, we allow it to handle settings where the sample size\nmight be smaller than the number of features in any group. Rather than imposing a mixed l1/RKHS-\nnorm regularizer as in group-Lasso based MKL, a group-OMP based approach allows us to consider\nthe exact sparse kernel selection problem via l0 regularization instead. Note that in contrast to the\ngroup-lasso penalty, the l0 penalty by itself has no effect on the smoothness of each individual com-\nponent. This allows for a clear decoupling between the role of the smoothness regularizer (namely,\nan RKHS regularizer) and the sparsity regularizer (via the l0 penalty). Our greedy algorithms allow\nfor simple and \ufb02exible optimization schemes that only require a black-box solver for standard learn-\ning algorithms. In this paper, we focus on multiple kernel learning with Regularized least squares\n(RLS). We provide a bound on the Rademacher complexity of the hypothesis sets considered by\nour formulation. We derive conditions analogous to OMP [27] and Group-OMP [16] to guarantee\nthe \u201ccorrectness\u201d of kernel selection. We close this paper with empirical studies on simulated and\nreal-world datasets that con\ufb01rm the value of our methods.\n\n2 Learning Over an RKHS Dictionary\n\nIn this section, we setup some notation and give a brief background before introducing our main\nobjective function and describing our algorithm in the next section. Let H1 . . .HN be a collection\nof Reproducing Kernel Hilbert Spaces with associated Kernel functions k1 . . . kN de\ufb01ned on the\ninput space X \u2282 Rd. Let H denote the sum space of functions,\n\nN\n\nH = H1 \u2295 H2 . . . \u2295 HN = {f : X 7\u2192 R|f (x) =\n\nLet us equip this space with the following lp norms,\n\nkfklp(H) = inf\uf8f1\uf8f4\uf8f2\n\uf8f4\uf8f3\n\n\uf8eb\n\uf8ed\n\nN\n\nXj=1\n\nkfjkp\n\nHj\uf8f6\n\uf8f8\n\n1\np\n\n: f (x) =\n\nN\n\nXj=1\n\nfj(x), x \u2208 X , fj \u2208 Hj, j = 1 . . . N}\n\nXj=1\nfj(x), x \u2208 X , fj \u2208 Hj, j = 1 . . . N\uf8fc\uf8f4\uf8fd\n\uf8f4\uf8fe\n\n(1)\n\nIt is now natural to consider a regularized risk minimization problem over such a RKHS dictionary,\ngiven a collection of training examples {xi, yi}l\n\ni=1,\n\narg min\n\nf \u2208H\n\n1\nl\n\nl\n\nXi=1\n\nV (yi, f (xi)) + \u03bbkfk2\n\nlp(H)\n\n(2)\n\nwhere V (\u00b7,\u00b7) is a convex loss function such as squared loss in the Regularized Least Squares (RLS)\nalgorithm or the hinge loss in the SVM method. If this problem again has elements of an RKHS\nstructure, then, via the Representer Theorem, it can again be reduced to a \ufb01nite dimensional problem\nand ef\ufb01ciently solved.\n\n2\n\n\fLet q = p\n\n2\u2212p and let us de\ufb01ne the q-convex hull of the set of kernel functions to be the following,\n\ncoq(k1 . . . kN ) =\uf8f1\uf8f2\n\uf8f3\n\nk\u03b3 : X \u00d7 X 7\u2192 R | k\u03b3 (x, z) =\n\n\u03b3jkj(x, z),\n\nN\n\nXj=1\n\nN\n\nXj=1\n\n\u03b3q\n\nj = 1, \u03b3j \u2265 0\uf8fc\uf8fd\n\uf8fe\n\nwhere \u03b3 \u2208 RN . It is easy to see that the non-negative combination of kernels, k\u03b3, is itself a valid\nkernel with an associated RKHS Hk\u03b3 . With this de\ufb01nition, [17] show the following,\n\nkfklp(H) = inf\n\n\u03b3 nkfkHk\u03b3\n\n, k\u03b3 \u2208 coq(k1 . . . kN )o\n\n(3)\n\nThis relationship connects Tikhonov regularization with lp norms over H to regularization over\nRKHSs parameterized by the kernel functions k\u03b3. This leads to a large family of \u201cmultiple kernel\nlearning\u201d algorithms (whose variants are also sometimes referred to as lq-MKL) where the basic\nidea is to solve an equivalent problem,\n\narg min\n\nf \u2208Hk\u03b3 ,\u03b3\u2208\u25b3q\n\n1\nl\n\nl\n\nXi=1\n\nV (yi, f (xi)) + \u03bbkfk2\n\nHk\u03b3\n\n(4)\n\nj=1\u03b3j \u2265 0}. For a \ufb01xed \u03b3, the optimization over f \u2208 Hk\u03b3 is\nwhere \u25b3q = {\u03b3 \u2208 RN : k\u03b3kq = 1,\u2200n\nrecognizable as an RKHS problem for which a standard black box solver may be used. The weights\n\u03b3 may then optimized in an alternating minimization scheme, although several other optimization\nprocedures are also be used (see e.g.,\n[4]). The case where p = 1 is of particular interest in\nthe setting when the size of the RKHS dictionary is large but the unknown target function can be\napproximated in a much smaller number of RKHSs. This leads to a large family of sparse multiple\nkernel learning algorithms that have a strong connection to the Group Lasso [2, 20, 29].\n\n3 Multiple Kernel Learning with Group Orthogonal Matching Pursuit\n\nLet us recall the l0 pseudo-norm, which is the cardinality of the sparsest representation of f in the\n\ndictionary, kfkl0(H) = min{|J| : f = Pj\u2208J fj}. We now pose the following exact sparse kernel\n\nselection problem,\n\narg min\n\nf \u2208H\n\n1\nl\n\nl\n\nXi=1\n\nV (yi, f (xi)) + \u03bbkfk2\n\nl2(H)\n\nsubject to kfkl0(H) \u2264 s\n\n(5)\n\nIt is important to note the following: when using a dictionary of universal kernels, e.g., Gaussian\nl2(H) is critical (i.e.,\nkernels with different bandwidths, the presence of the regularization term kfk2\n\u03bb > 0) since otherwise the labeled data can be perfectly \ufb01t by any single kernel. In other words, the\nkernel selection problem is ill-posed. While conceptually simple, our formulation is quite different\nl2(H) penalty) is\nfrom those proposed earlier since the role of a smoothness regularizer (via the kfk2\ndecoupled from the role of a sparsity regularizer (via the constraint on kfkl0(H) \u2264 s). Moreover, the\nlatter is imposed directly as opposed through a p = 1 penalty making the spirit of our approach closer\nto Group Orthogonal Matching Pursuit (Group-OMP [16]) where groups are formed by very high-\ndimensional (in\ufb01nite for Gaussian kernels) feature spaces associated with the kernels. It has been\nobserved in recent work [10, 29] on l1-MKL that sparsity alone does not lead it to improvements in\nreal-world empirical tasks and hence several methods have been proposed to explore lq-norm MKL\nwith q > 1 in Eqn. 4, making MKL depart away from sparsity in kernel combinations. By contrast,\nwe note that as q \u2192 \u221e, p \u2192 2. Our approach gives a direct knob both on smoothness (via \u03bb)\nand sparsity (via s) with a solution path along these dimensions that differs from that offered by\nGroup-Lasso based lq-MKL as q is varied. By combining l0 pseudo-norm with RKHS norms, our\nmethod is conceptually reminiscent of the elastic net [32] (also see [26, 12, 21]). If kernels arise\nfrom different subsets of input variables, our approach is also related to sparse additive models [18].\n\nOur algorithm, MKL-GOMP, is outlined below for regularized least squares. Extensions for other\nloss functions, e.g., hinge loss for SVMs, can also be similarly derived. In the description of the algo-\nrithm, our notation is as follows: For any function f belonging to an RKHS Fk with kernel function\nl Pl\nk(\u00b7,\u00b7), we denote the regularized objective function as, R\u03bb(f, y) = 1\ni=1(yi\u2212f (xi))2 +\u03bbkfkFk\n\n3\n\n\fwhere k \u00b7 kF denotes the RKHS norm. Recall that the minimizer f \u22c6 = arg minf \u2208F R\u03bb(f, y) is\ngiven by solving the linear system, \u03b1 = (K + \u03bblI)\u22121y where K is the gram matrix of the ker-\nnel on the labeled data, and by setting f \u22c6(x) = Pl\ni=1 \u03b1ik(x, xi). Moreover, the objective value\nachieved by the minimizer is: R\u03bb(f \u22c6, y) = \u03bbyT (K + \u03bblI)\u22121y. Note that MKL-GOMP should\nnot be confused with Kernel Matching Pursuit [28] whose goal is different: it is designed to spar-\nsify \u03b1 in a single-kernel setting. The MKL-GOMP procedure iteratively expands the hypothesis\nspace, HG(1) \u2286 HG(2) . . . \u2286 HG(i), by greedily selecting kernels from a given dictionary, where\nG(i) \u2282 {1 . . . N} is a subset of indices and HG = Sj\u2208G Hj. Note that each HG is an RKHS with\nkernelPj\u2208G kj (see Section 6 in [1]). The selection criteria is the best improvement, I(f (i),Hj),\ngiven by a new hypothesis space Hj in reducing the norm of the current residual r(i) = y \u2212 f (i)\nwhere f (i) = [f (i)(x1) . . . f (i)(xl)]T , by \ufb01nding the best regularized (smooth) approximation. Note\nthat since ming\u2208Hj R\u03bb(g, r) \u2264 R\u03bb(0, r) = krk2, the value of the improvement function,\n\nI(f (i),Hj) = kr(i)k2\n\n2 \u2212 min\n\ng\u2208Hj R\u03bb(g, r(i))\n\nis always non-negative. Once a kernel is selected, the function is re-estimated by learning in HG(i).\nNote that since HG is an RKHS whose kernel function is the sum Pj\u2208G kj, we can use a simple\nRLS linear system solver for re\ufb01tting. Unlike group-Lasso based MKL, we do not need an iterative\nkernel reweighting step which essentially arises as a mechanism to transform the less convenient\ngroup sparsity norms into reweighted squared RKHS norms. MKL-GOMP converges when the best\nimprovement is no better than \u01eb.\n\nKernel Dictionary {kj(\u00b7,\u00b7)}N\n\nj=1, Precision \u01eb > 0\n\n\u25ee Input: Data matrix X = [x1 . . . xl]T , Label vector y \u2208 Rl,\n\n\u25ee Output: Selected Kernels G(i) and a function f (i) \u2208 HG(i)\n\u25ee Initialization: G(0) = \u2205, f (0) = 0, set residual r(0) = y\n\u25ee for i = 0, 1, 2, ...\n\n1. Kernel Selection: For all j /\u2208 G(i), set:\n\n2 \u2212 ming\u2208Hj R\u03bb(g, r(i))\nI(f (i),Hj) = kr(i)k2\n= r(i)T (cid:0)I \u2212 \u03bb(Kj + \u03bblI)\u22121(cid:1) r(i)\nPick j(i) = arg maxj /\u2208G(i) I(f (i),Hj)\n2. Convergence Check: if(cid:0)I(f (i),Hj(i) ) \u2264 \u01eb(cid:1) break\n3. Re\ufb01tting: Set G(i+1) = G(i)S{j(i)}. Set f (i+1)(x) =Pl\nwhere k =Pj\u2208G(i+1) kj and \u03b1 =(cid:16)Pj\u2208G(i+1) Kj + \u03bblI(cid:17)\u22121\n4. Update Residual: r(i+1) = y \u2212 f (i+1) where f (i+1) = [f (i+1)(x1) . . . f (i+1)(xl)]T .\nend\n\nj=1 \u03b1jk(x, xj)\n\ny\n\nRemarks: Note that our algorithm can be applied to multivariate problems with group structure\namong outputs similar to Multivariate Group-OMP [15]. In particular, in our experiments on mul-\nticlass datasets, we treat all outputs as a single group and evaluate each kernel for selection based\non how well the total residual is reduced across all outputs simultaneously. Kernel matrices are nor-\nmalized to unit trace or to have uniform variance of data points in their associated feature spaces, as\nin [10, 33]. In practice, we can also monitor error on a validation set to decide the optimal degree\nof sparsity. For ef\ufb01ciency, we can precompute the matrices Qj = (I \u2212 \u03bb(Kj + \u03bblI)\u22121)\n2 so that\n2 can be very quickly evaluated at selection time, and/or reduce the search\nI(f (i),Hj) = kQj rk2\nspace by considering a random subsample of the dictionary.\n\n1\n\n4 Theoretical Analysis\n\nOur analysis is composed of two parts.\nIn the \ufb01rst part, we establish generalization bounds for\nthe hypothesis spaces considered by our formulation, based on the notion of Rademacher complex-\n\n4\n\n\fity. The second component of our theoretical analysis consists of deriving conditions under which\nMKL-GOMP can recover good solutions. While the \ufb01rst part can be seen as characterizing the\n\u201cstatistical convergence\u201d of our method, the second part characterizes its \u201cnumerical convergence\u201d\nas an optimization method, and is required to complement the \ufb01rst part. This is because matching\npursuit methods can be deemed to solve an exact sparse problem approximately, while regularized\nmethods (e.g. l1 norm MKL) solve an approximate problem exactly. We therefore need to show that\nMKL-GOMP recovers a solution that is close to an optimum solution of the exact sparse problem.\n\n4.1 Rademacher Bounds\n\nTheorem 1. Consider the hypothesis space of suf\ufb01ciently sparse and smooth functions1,\n\nH\u03c4,s =nf \u2208 H : kfk2\n\nl2(H) \u2264 \u03c4,kfkl0(H) \u2264 so\n\nLet \u03b4 \u2208 (0, 1) and \u03ba = supx\u2208X ,j=1...N kj(x, x). Let \u03c1 be any probability distribution on (x, y) \u2208\nX \u00d7 R satisfying |y| \u2264 M almost surely, and let {xi, yi}l\ni=1 be randomly sampled according to\ni=1 (yi \u2212 f (xi))2 to be the empirical risk minimizer and f \u22c6 =\n\u03c1. De\ufb01ne, \u02c6f = arg minf \u2208H\u03c4,s\narg minf \u2208H\u03c4,s R(f ) to be the true risk minimizer in H\u03c4,s where R(f ) = E(x,y)\u223c\u03c1 (y \u2212 f (x))2\ndenotes the true risk. Then, with probability atleast 1 \u2212 \u03b4 over random draws of samples of size l,\n\nl Pl\n\n1\n\nR( \u02c6f ) \u2264 R(f \u22c6) + 8Lr s\u03ba\u03c4\n\nl\n\nwhere ky \u2212 fk\u221e \u2264 L = (M + \u221as\u03ba\u03c4 ).\n\n+ 4L2s log( 3\n\n2l\n\n\u03b4 )\n\n(6)\n\nl\n\nThe proof is given in supplementary material, but can also be reasoned as follows. In the standard\nsingle-RKHS case, the Rademacher complexity can be upper bounded by a quantity that is propor-\ntional to the square root of the trace of the Gram matrix, which is further upper bounded by \u221al\u03ba.\nIn our case, any collection of s-sparse functions from a dictionary of N RKHSs reduces to a single\nRKHS whose kernel is the sum of s base kernels, and hence the corresponding trace can be bounded\nby \u221als\u03ba for all possible subsets of size s. Once it is established that the empirical Rademacher\ncomplexity of H\u03bb,s is upper bounded byp s\u03ba\u03c4\n, the generalization bound follows from well-known\nresults [6] tailored to regularized least squares regression with bounded target variable.\nFor l1-norm MKL, in the context of margin-based loss functions, Cortes et. al., 2010 [8] bound\nthe Rademacher complexity asq ce\u2308log(N )\u2309\u03ba\u03c4\nwhere \u2308\u00b7\u2309 is the ceiling function that rounds to next\n22 . Using VC-based lower-bound arguments, they point\nout that the plog(N ) dependence on N is essentially optimal. By contrast, our greedy approach\n\nwith sequential regularized risk minimization imposes direct control over degree of sparsity as well\nas smoothness, and hence the Rademacher complexity in our case is independent of N. If s =\nO(logN ), the bounds are similar. A critical difference between l1-norm MKL and sparse greedy\napproximations, however, is that the former is convex and hence the empirical risk can be minimized\nexactly in the hypothesis space whose complexity is bounded by Rademacher analysis. This is not\ntrue in our case, and therefore, to complement Rademacher analysis, we need conditions under\nwhich good solutions can be recovered.\n\ninteger, e is the exponential and c = 23\n\nl\n\n4.2 Exact Recovery Conditions in Noiseless Settings\n\nWe now assume that the regression function f\u03c1(x) = R yd\u03c1(y|x) is sparse, i.e., f\u03c1 \u2208 HGgood for\nsome subset Ggood of s \u201cgood\u201d kernels and that it is suf\ufb01ciently smooth in the sense that for some\n\u03bb > 0, given suf\ufb01cient samples, the empirical minimizer \u02c6f = arg minf \u2208HGgood R\u03bb(f, y) gives near\noptimal generalization as per Theorem 1. In this section our main concern is to characterize Group-\nOMP like conditions under which MKL-GOMP will be able to learn \u02c6f by recovering the support\nGgood exactly.\n\n1Note that Tikhonov regularization using a penalty term \u03bbk \u00b7 k2, and Ivanov Regularization which uses a\n\nball constraint k \u00b7 k2 \u2264 \u03c4 return identical solutions for some one-to-one correspondence between \u03bb and \u03c4 .\n\n5\n\n\fLet us denote r(i) = \u02c6f \u2212 f (i) as the residual function at step i of the algorithm.\nInitially,\nr(0) = \u02c6f \u2208 HGgood. Our argument is inductive:\nif at any step i, r(i) \u2208 HGgood and we can\nalways guarantee that maxj\u2208Ggood I(f (i),Hj) > maxj /\u2208Ggood I(f (i),Hj), i.e., a good kernel of-\nfers better greedy improvement, then it is clear that the algorithm correctly expands the hypothesis\nspace and never makes a mistake. Without loss of generality, let us rearrange the dictionary so that\nGgood = {1 . . . s}. For any function f \u2208 HGgood, we now wish to derive the following upper bound,\n(7)\n\nk(I(f,Hs+1) . . . I(f,HN ))k\u221e\nk(I(f,H1) . . . I(f,Hs))k\u221e \u2264 \u00b5H(Ggood)2\n\nClearly, a suf\ufb01cient condition for exact recovery is \u00b5H(Ggood) < 1.\nWe need some notation to state our main result. Let s = |Ggood|, i.e., the number of good kernels.\nFor any matrix A \u2208 Rls\u00d7l(N \u2212s), let kAk(2,1) denote the matrix norm induced by the following\nvector norms: for any vector u = [u1 . . . us] \u2208 Rls de\ufb01ne kuk(2,1) = Ps\ni=1 kuik2; and similarly,\nfor any vector v = [v1 . . . vN \u2212s] \u2208 Rl(N \u2212s) de\ufb01ne kvk(2,1) = PN \u2212s\ni=1 kvik2. Then, kAk(2,1) =\nsupv\u2208Rl(N \u2212s)\nTheorem 2. Given the kernel dictionary {kj(\u00b7,\u00b7)}N\nthe labeled data, MKL-GOMP correctly recovers the good kernels, i.e., G(s) = Ggood, if\n\nj=1 with associated gram matrices {Kj}N\n\n. We can now state the following:\n\nkAvk(2,1)\nkvk(2,1)\n\ni=1 over\n\n\u00b5H(Ggood) = kC\u03bb,H(Ggood)k(2,1) < 1\n\nwhere C\u03bb,H(Ggood) \u2208 Rls\u00d7l(N \u2212s) is a coherence matrix whose (i, j)th block of size l \u00d7 l, i \u2208\nGgood, j /\u2208 Ggood, is given by,\n\nC\u03bb,H(Ggood)i,j = KGgood Qi\uf8eb\n\n\uf8ed Xk\u2208Ggood\n\n\u22121\n\nQk\uf8f6\n\uf8f8\n\nQkK2\n\nGgood\n\nQj KGgood\n\n(8)\n\n1\n\n2 , j = 1 . . . N.\n\nKj, Qj = (I \u2212 \u03bb(Kj + \u03bblI)\u22121)\n\nwhere KGgood =Pj\u2208Ggood\nThe proof is given in supplementary material. This result is analogous to sparse recovery conditions\nfor OMP and l1 methods and their (linear) group counterparts. In the noiseless setting, Tropp [27]\ngives an exact recovery condition of the form kX\u2020\ngoodXbadk1 < 1, where Xgood and Xbad refer\nto the restriction of the data matrix to good and bad features, and k \u00b7 k1 refers to the l1 induced\nmatrix norm. Intriguingly, the same paper shows that this condition is also suf\ufb01cient for the Basis\nPursuit l1 minimization problem. For Group-OMP [16], the condition generalizes to involve a group\nsensitive matrix norm on the same matrix objects. Likewise, Bach [2] generalizes the Lasso variable\nselection consistency conditions to apply to Group Lasso and then further to non-parametric l1-\nMKL. The above result is similar in spirit. A stronger suf\ufb01cient condition can be derived by requiring\nkQj KGgoodk2 to be suf\ufb01ciently small for all j /\u2208 Ggood. Intuitively, this means that smooth functions\nin HGgood cannot be well approximated by using smooth functions induced by the \u201cbad\u201d kernels, so\nthat MKL-GOMP is never led to making a mistake.\n\n5 Empirical Studies\n\nWe report empirical results on a collection of simulated datasets and 3 classi\ufb01cation problems from\ncomputational cell biology.\nIn all experiments, as in [10, 33], candidate kernels are normalized\nmultiplicatively to have uniform variance of data points in their associated feature spaces.\n\n5.1 Adaptability to Data Sparsity - Simulated Setting\n\nWe adapt the experimental setting proposed by [10] where the sparsity of the target function is ex-\nplicitly controlled, and the optimal subset of kernels is varied from requiring the entire dictionary to\nrequiring a single kernel. Our goal is to study the solution paths offered by MKL-GOMP in compar-\nison to lq-norm MKL. For consistency, we use squared loss in all experiments2. We implemented\n\n2l\n\nq-MKL with SVM hinge loss behaves similarly.\n\n6\n\n\fr\no\nr\nr\ne\n \nt\ns\ne\nt\n\n0.22\n\n0.2\n\n0.18\n\n0.16\n\n0.14\n\n0.12\n\n0.1\n\n0.08\n\n0.06\n\n0.04\n\nFigure 1: Simulated Setting: Adaptability to Data Sparsity\n\n \n\n1\u2212norm MKL\n4/3\u2212norm MKL\n2\u2212norm MKL\n4\u2212norm MKL\n\u221e\u2212norm MKL (=RLS)\nMKL\u2212GOMP\nBayes Error\n\n \ny\nt\ni\ns\nr\na\np\nS\n\n80\n\n60\n\n40\n\n20\n\n% of Kernels Selected\n\n0\n\nv(\u03b8) = fraction of noise kernels [in %]\n\n44\n\n66\n\n82\n\n92\n\n98\n\n140\n120\n100\n80\n60\n40\n20\n\n \ns\ns\ne\nn\nh\nt\no\no\nm\nS\n\nValue of \u03bb\n\n0\n\nv(\u03b8) = fraction of noise kernels [in %]\n\n44\n\n66\n\n82\n\n92\n\n98\n\n \n0\n\nv(\u03b8) = fraction of noise kernels [in %]\n\n44\n\n66\n\n82\n\n92\n\n98\n\nlq-norm MKL for regularized least squares (RLS) using an alternating minimization scheme adapted\nfrom [17, 29]. Different binary classi\ufb01cation datasets3 with 50 labeled examples are randomly gen-\nerated by sampling the two classes from 50-dimensional isotropic Gaussian distributions with equal\ncovariance matrices (identity) and equal but opposite, means \u00b51 = 1.75 \u03b8\nk\u03b8k and \u00b52 = \u2212\u00b51 where \u03b8\nis a binary vector encoding the true underlying sparsity. The fraction of zero components in \u03b8 is a\nmeasure for the feature sparsity of the learning problem. For each dataset, a linear kernel (normal-\nized as in [10]) is generated from each feature and the resulting dictionary is input to MKL-GOMP\nand lq-norm MKL. For each level of sparsity, a training of size 50, validation and test sets of size\n10000 are generated 10 times and average classi\ufb01cation errors are reported. For each run, the vali-\ndation error is monitored as kernel selection progresses in MKL-GOMP and the number of kernels\nwith smallest validation error are chosen. The regularization parameters for both MKL-GOMP and\nlq norm MKL are similarly chosen using the validation set. Figure 5.1 shows test error rates as a\nfunction of sparsity of the target function: from non-sparse (all kernels needed) to extremely sparse\n(only 1 kernel needed). We recover the observations also made in [10]: l1-norm MKL excels in\nextremely sparse settings where a single kernel carries the whole discriminative information of the\nlearning problem. However, in the other scenarios it mostly performs worse than the other q > 1\nvariants, despite the fact that the vector \u03b8 remains sparse in all but the uniform scenario. As q is\nincreased, the error rate in these settings improves but deteriorates in sparse settings. As reported\nin [11], the elastic net MKL approach of [26] performs similar to l1-MKL in the hinge loss case.\nAs can be seen in the Figure, the error curve of MKL-GOMP tends to be below the lower envelope\nof the error rates given by lq-MKL solutions. To adapt to the sparsity of the problem, lq methods\nclearly need to tune q requiring several fresh invocations of the appropriate lq-MKL solver. On the\nother hand, in MKL-GOMP the hypothesis space grows as function of the iteration number and the\nsolution trajectory naturally expands sequentially in the direction of decreasing sparsity. The right\nplot in Figure 5.1 shows the number of kernels selected by MKL-GOMP and the optimal value of\n\u03bb, suggesting that MKL-GOMP adapts to the sparsity and smoothness of the learning problem.\n\n5.2 Protein Subcellular Localization\n\nThe multiclass generalization of l1-MKL proposed in [33] (MCMKL) is state of the art methodology\nin predicting protein subcellular localization, an important cell biology problem that concerns the\nestimation of where a protein resides in a cell so that, for example, the identi\ufb01cation of drug targets\ncan be aided. We use three multiclass datasets: PSORT+, PSORT- and PLANT provided by the au-\nthors of [33] at http://www.fml.tuebingen.mpg.de/raetsch/suppl/protsubloc\ntogether with a dictionary of 69 kernels derived with biological insight: 2 kernels on phylogenetic\ntrees, 3 kernels based on similarity to known proteins (BLAST E-values), and 64 kernels based\non amino-acid sequence patterns. The statistics of the three datasets are as follows: PSORT+ has\n541 proteins labeled with 4 location classes, PSORT- has 1444 proteins in 5 classes and PLANT is\n\n3Provided by the authors of [10] at mldata.org/repository/data/viewslug/mkl-toy/\n\n7\n\n\f)\nr\ne\nt\nt\ne\nb\n \ns\ni\n \nr\ne\nh\ng\ni\nh\n(\n \ne\nc\nn\na\nm\nr\no\nf\nr\ne\nP\n\n100\n\n95\n\n90\n\n85\n\n80\n\n75\n\n \n\nmklgomp mcmkl\n\nsum\n\n \n\nsingle \n\nother\n\npsort+\n\npsort\u2212\n\nplant\n\nFigure 2: Protein Subcellular Localization Results\n\na 4-class problem with 940 proteins. For each dataset, results are averaged over 10 splits of the\ndataset into training and test sets. We used exactly the same experimental protocol, data splits and\nevaluation methodology as given in [33]: the hyper-parameters of MKL-GOMP (sparsity and the\nregularization parameter \u03bb) were tuned based on 3-fold cross-validation; results on PSORT+, PSORT-\nare F1-scores averaged over the classes while those on PLANT are Mathew\u2019s correlation coef\ufb01cient4.\nFigure 2 compare MKL-GOMP against MCMKL, baselines such as using the sum of all the kernels\nand using the best single kernel, and results from other prediction systems proposed in the literature.\nAs can be seen, MKL-GOMP slightly outperforms MCMKL on PSORT+ an PSORT- datasets and\nis slightly worse on PLANT where RLS with the sum of all the kernels also performs very well.\nOn the two PSORT datasets, [33] report selecting 25 kernels using MCMKL. On the other hand, on\naverage, MKL-GOMP selects 14 kernels on PSORT+, 15 on PSORT- and 24 kernels on PLANT. Note\nthat MKL-GOMP is applied in multivariate mode: the kernels are selected based on their utility to\nreduce the total residual error across all target classes.\n\n6 Conclusion\n\nBy proposing a Group-OMP based framework for sparse multiple kernel learning, analyzing theoret-\nically the performance of the resulting methods in relation to the dominant convex relaxation-based\napproach, and demonstrating the value of our framework through extensive experimental studies,\nwe believe greedy methods arise as a natural alternative for tackling MKL problems. Relevant\ndirections for future research include extending our theoretical analysis to the stochastic setting,\ninvestigating complex multivariate structures and groupings over outputs, e.g., by generalizing the\nmultivariate version of Group-OMP [15], and extending our algorithm to incorporate interesting\nstructured kernel dictionaries [3].\n\nAcknowledgments\n\nWe thank Rick Lawrence, David S. Rosenberg and Ha Quang Minh for helpful conversations and\nsupport for this work.\n\nReferences\n\n[1] N. Aronszajn. Theory of reproducing kernel hilbert spaces. Transactions of the American Mathematical\n\nSociety, 68(3):337\u2013404, 1950.\n\n[2] F. Bach. Consistency of group lasso and multiple kernel learning. JMLR, 9:1179\u20131225, 2008.\n[3] F. Bach. High-dimensional non-linear variable selection through hierarchical kernel learning. In Technical\n\nreport, HAL 00413473, 2009.\n\n[4] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing penalties.\n\nTechnical report, HAL 00413473, 2010.\n\nIn\n\n4see http://www.fml.tuebingen.mpg.de/raetsch/suppl/protsubloc/protsubloc-wabi08-supp.pdf\n\n8\n\n\f[5] F. R. Bach, G. R. G. Lanckriet, and M. I. Jordan. Multiple kernel learning, conic duality, and the smo\n\nalgorithm. In ICML, 2004.\n\n[6] P. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results.\n\nJMLR, 3:463\u2013482, 2002.\n\n[7] A. Ben-Hur and W. S. Noble. Kernel methods for predicting protein\u2013protein interactions. Bioinformatics,\n\n21, January 2005.\n\n[8] C. Cortes, M. Mohri, and Afshin Rostamizadeh. Generalization bounds for learning kernels. In ICML,\n\n2010.\n\n[9] A. K. Fletcher and S. Rangan. Orthogonal matching pursuit from noisy measurements: A new analysis.\n\nIn NIPS, 2009.\n\n[10] M. Kloft, U. Brefeld, S. Sonnenburg, and A. Zien. l\n\n2011.\n\np-norm multiple kernel learning. JMLR, 12:953\u2013997,\n\n[11] M. Kloft, U. Ruckert, and P. Bartlett. A unifying view of multiple kernel learning. In European Conference\n\non Machine Learning (ECML), 2010.\n\n[12] V. Koltchinskii and M. Yuan. Sparsity in multiple kernel learning. The Annals of Statistics, 38(6):3660\u2013\n\n3695, 2010.\n\n[13] G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. I. Jordan. Learning the kernel matrix\n\nwith semide\ufb01nite programming. J. Mach. Learn. Res., 5:27\u201372, December 2004.\n\n[14] G. R. G. Lanckriet, T. De Bie, N. Cristianini, M. I. Jordan, and W. S. Noble. A statistical framework for\n\ngenomic data fusion. Bioinformatics, 20, November 2004.\n\n[15] A. C. Lozano and V. Sindhwani. Block variable selection in multivariate regression and high-dimensional\n\ncausal inference. In NIPS, 2010.\n\n[16] A. C. Lozano, G. Swirszcz, and N. Abe. Group orthogonal matching pursuit for variable selection and\n\nprediction. In NIPS, 2009.\n\n[17] C. Michelli and M. Pontil. Learning the kernel function via regularization. JMLR, 6:1099\u20131125, 2005.\n[18] H. Liu P. Ravikumar, J. Lafferty and L. Wasserman. Sparse additive models. Journal of the Royal\n\nStatistical Society: Series B (Statistical Methodology) (JRSSB), 71 (5):1009\u20131030, 2009.\n\n[19] P. Pavlidis, J. Cai, J. Weston, and W.S. Noble. Learning gene functional classi\ufb01cations from multiple data\n\ntypes. Journal of Computational Biology, 9:401\u2013411, 2002.\n\n[20] A. Rakotomamonjy, F.Bach, S. Cano, and Y. Grandvalet. SimpleMKL. Journal of Machine Learning\n\nResearch, 9:2491\u20132521, 2008.\n\n[21] G. Raskutti, M. Wainwrigt, and B. Yu. Minimax-optimal rates for sparse additive models over kernel\n\nclasses via convex programming. In Technical Report 795, Statistics Department, UC Berkeley., 2010.\n\n[22] Bernhard Scholkopf and Alexander J. Smola. Learning with Kernels: Support Vector Machines, Regular-\n\nization, Optimization, and Beyond. MIT Press, 2001.\n\n[23] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press,\n\n2004.\n\n[24] S. Sonnenburg, G. R\u00a8atsch, C. Sch\u00a8afer, and B. Sch\u00a8olkopf. Large scale multiple kernel learning. J. Mach.\n\nLearn. Res., 7, December 2006.\n\n[25] Zhang T. Sparse recovery with orthogonal matching pursuit under rip. Computing Research Repository,\n\n2010.\n\n[26] R. Tomioka and T. Suzuki. Sparsity-accuracy tradeoff in mkl. In NIPS Workshop: Understanding Multiple\n\nKernel Learning Methods. Technical report, arXiv:1001.2615v1, 2010.\n\n[27] J. Tropp. Greed is good: Algorithmic results for sparse approximation. IEEE Trans. Inform. Theory,,\n\n50(10):2231\u20132242, 2004.\n\n[28] P. Vincent and Y. Bengio. Kernel matching pursuit. Machine Learning, 48:165\u2013188, 2002.\n[29] Z. Xu, R. Jin, H. Yang, I. King, and M.R. Lyu. Simple and ef\ufb01cient multiple kernel learning by group\n\nlasso. In ICML, 2010.\n\n[30] Ming Yuan, Ali Ekici, Zhaosong Lu, and Renato Monteiro. Dimension reduction and coef\ufb01cient estima-\ntion in multivariate linear regression. Journal Of The Royal Statistical Society Series B, 69(3):329\u2013346,\n2007.\n\n[31] Tong Zhang. On the consistency of feature selection using greedy least squares regression. J. Mach.\n\nLearn. Res., 10, June 2009.\n\n[32] H. Zhou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal\n\nStatistical Society, 67(2):301\u2013320, 2005.\n\n[33] A. Zien and Cheng S. Ong. Multiclass multiple kernel learning. ICML, 2007.\n\n9\n\n\f", "award": [], "sourceid": 1364, "authors": [{"given_name": "Vikas", "family_name": "Sindhwani", "institution": null}, {"given_name": "Aurelie", "family_name": "Lozano", "institution": null}]}