{"title": "Efficient algorithms for learning kernels from multiple similarity matrices with general convex loss functions", "book": "Advances in Neural Information Processing Systems", "page_first": 1198, "page_last": 1206, "abstract": "In this paper we consider the problem of learning an n x n Kernel matrix from m similarity matrices under general convex loss. Past research have extensively studied the m =1 case and have derived several algorithms which require sophisticated techniques like ACCP, SOCP, etc. The existing algorithms do not apply if one uses arbitrary losses and often can not handle m > 1 case. We present several provably convergent iterative algorithms, where each iteration requires either an SVM or a Multiple Kernel Learning (MKL) solver for m > 1 case. One of the major contributions of the paper is to extend the well known Mirror Descent(MD) framework to handle Cartesian product of psd matrices. This novel extension leads to an algorithm, called EMKL, which solves the problem in O(m^2 log n) iterations; in each iteration one solves an MKL involving m kernels and m eigen-decomposition of n x n matrices. By suitably defining a restriction on the objective function, a faster version of EMKL is proposed, called REKL, which avoids the eigen-decomposition. An alternative to both EMKL and REKL is also suggested which requires only an SVM solver. Experimental results on real world protein data set involving several similarity matrices illustrate the efficacy of the proposed algorithms.", "full_text": "Ef\ufb01cient algorithms for learning kernels from\n\nmultiple similarity matrices with general convex loss\n\nfunctions\n\nAchintya Kundu\n\nVikram Tankasali\n\nDept. of Computer Science & Automation,\n\nDept. of Computer Science & Automation,\n\nIndian Institute of Science, Bangalore.\nachintya@csa.iisc.ernet.in\n\nIndian Institute of Science, Bangalore.\nvikram@csa.iisc.ernet.in\n\nChiranjib Bhattacharyya\n\nDept. of Computer Science & Automation,\n\nIndian Institute of Science, Bangalore.\n\nAharon Ben-Tal\n\nFaculty of Industrial Engg. & Management,\n\nTechnion - Israel Institute of Technology, Haifa.\n\nchiru@csa.iisc.ernet.in\n\nabental@ie.technion.ac.il\nVisiting Professor, CWI, Amsterdam\n\nAbstract\n\nIn this paper we consider the problem of learning an n \u00d7 n kernel matrix from\nm(\u2265 1) similarity matrices under general convex loss. Past research have exten-\nsively studied the m = 1 case and have derived several algorithms which require\nsophisticated techniques like ACCP, SOCP, etc. The existing algorithms do not\napply if one uses arbitrary losses and often can not handle m > 1 case. We\npresent several provably convergent iterative algorithms, where each iteration re-\nquires either an SVM or a Multiple Kernel Learning (MKL) solver for m > 1\ncase. One of the major contributions of the paper is to extend the well known Mir-\nror Descent(MD) framework to handle Cartesian product of psd matrices. This\nnovel extension leads to an algorithm, called EMKL, which solves the problem in\nO( m2 log n\n) iterations; in each iteration one solves an MKL involving m kernels\nand m eigen-decomposition of n \u00d7 n matrices. By suitably de\ufb01ning a restriction\non the objective function, a faster version of EMKL is proposed, called REKL,\nwhich avoids the eigen-decomposition. An alternative to both EMKL and REKL\nis also suggested which requires only an SVM solver. Experimental results on real\nworld protein data set involving several similarity matrices illustrate the ef\ufb01cacy\nof the proposed algorithms.\n\n\u01eb2\n\n1 Introduction\n\nLearning procedures based on positive semide\ufb01nite (psd) kernel functions, like Support vector ma-\nchines (SVMs), have emerged as powerful tools for several learning tasks with wide applicability\n[13]. In many applications it is relatively straightforward to de\ufb01ne measures of similarity between\nany pair of examples but extremely dif\ufb01cult to design a kernel function for accurate classi\ufb01cation.\nFor instance, similarity score between two protein sequences given by various measures like BLAST\n[1], Smith-Waterman [14], etc are not psd, whence cannot be substituted as kernel. In this paper, we\nconsider the problem of learning an optimal kernel matrix, from multiple similarity matrices, that\nyields accurate classi\ufb01cation.\n\nLet the set of n \u00d7 n real symmetric matrices be denoted as Sn and the set of psd matrices as Sn\n+.\nConsider a binary classi\ufb01cation problem with n training examples. Let y \u2208 {+1 , \u22121}n be the\n\n1\n\n\fvector of class labels and K \u2208 Sn\nperformance measure \u03c9(K) by solving\n\n+ be a kernel matrix. The SVM formulation [13] computes a\n\n\u03c9(K) = max\n\n\u03b1\u2208Ah \u03b1\u22a41 \u2212 0.5 \u03b1\u22a4Y KY \u03b1i,\n\nwhere A = {\u03b1 \u2208 Rn | \u03b1\u22a4y = 0, 0 \u2264 \u03b1 \u2264 C1}, Y = diag(y), 1 = [ 1 . . . 1 ]\u22a4 \u2208 Rn and C user\nde\ufb01ned positive constant.\n\n(1)\n\n(2)\n\n1.1 Background and Related work\n\nTo the best of our knowledge the problem of handling multiple similarity matrices and arbitrary\nconvex losses has not been studied before. Existing literature has focussed on only one similarity\nmatrix and speci\ufb01c choices of loss function. In this section we brie\ufb02y review the related literature.\n\nThe problem was \ufb01rst studied in [8] for a single similarity matrix. They introduced the following\noptimization problem\n\nmin\nK\u2208Sn\n+\n\n\u03c9(K) + \u03c1 kK \u2212 Sk2\n\nF ,\n\nwhere S \u2208 Sn, is a similarity matrix, whose (i, j)-th element S(i, j) represents the similarity be-\ntween example pair i, j and \u03c9(K) is de\ufb01ned in (1). By interchanging the maximization over \u03b1 and\nminimization over K the authors note that the inner minimization admits a closed form solution:\n\nK \u2217 =(cid:0) S + (4\u03c1)\u22121(Y \u03b1)(Y \u03b1)\u22a4(cid:1)+\n\n,\n\n(3)\n\nwhere (X)+ denotes the psd matrix obtained by clipping the negative eigen values of X to zero, i.e.,\ni .\ni=1 max(\u03bbi, 0)viv\u22a4\nAfter plugging in the value of K \u2217 authors suggest using sophisticated techniques like Analytic center\ncutting plane (ACCP) method to solve the outer maximization in \u03b1.\n\nis the eigen decomposition of X \u2208 Sn then (X)+ = Pn\n\nif X = Pn\n\ni=1 \u03bbiviv\u22a4\n\ni\n\nThe formulation (2) was studied further in [5] where an iterative algorithm based on solving a\nquadratically constrained linear program (QCLP) was proposed. In both the above approaches the\norder of maximization and minimization has been interchanged which lead to optimization problems\nthat can be posed as semi-in\ufb01nite quadratically constrained linear Programs (SIQCLP) [5]. In [6] an\nalternate loss function kK \u2212 SkF was studied which led to a Second Order Cone Program(SOCP)\nformulation. The choice of kK \u2212 Sk2\nF , as a measure of loss, is arbitrary. In this paper we generalize\nthe setting in (2) by providing an algorithm which works for any convex loss function. We note that\nthe method used in solving (2) utilizing (3) is speci\ufb01c to the loss function kK \u2212 Sk2\nF and do not\napply generally. Apart from non-applicability of the existing methods to other loss functions it is\nnot clear how these procedures could be used to handle multiple similarity matrices.\n\nContributions: The key contribution of the paper is to design ef\ufb01cient procedures which can learn\n+, from m(\u2265 1) similarity matrices, possibly inde\ufb01nite, under general con-\na kernel matrix, K \u2208 Sn\nvex sub-differentiable loss function by using either SVM or MKL solvers. We study the problem in\ntwo different settings. In the \ufb01rst setting we consider learning a single kernel matrix from multiple\nsimilarity matrices under a general loss function. A novel algorithm, referred in the paper as ESKL,\nbased on the Mirror Descent (MD) [3] procedure is proposed. It is a provably convergent algorithm\nwhich requires O( log n\n\u01eb2 ) calls to an SVM solver. In the second setting we consider learning separate\nkernel matrix for each of the given similarity matrices and then aggregating them using a Multiple\nKernel Learning (MKL) setup. The resultant formulation is non-trivial to solve. We present EMKL\nwhich is based on generalizing the existing MD setup to deal with Cartesian product of psd matri-\nces. Like the previous case it requires O( m2 log n\n) calls to an MKL solver. At every iteration the\nalgorithm also requires eigen-decomposition which is expensive. We present a related algorithm,\nREKL, which does not require the expensive eigen-decomposition step but yields similar classi\ufb01ca-\ntion performance as EMKL. Apart from allowing general loss functions the procedures also opens\nup new avenues for learning multiple kernels, which could be viable alternatives to the framework\nproposed in [7].\n\n\u01eb2\n\nThe remainder of the paper is organized as follows: in section 2 we discuss problem formulation.\nOur main contribution is in section 3, where we develop mirror descent algorithms for learning\nkernel from multiple inde\ufb01nite similarity matrices. We also analyze the complexity and convergence\nproperties of the proposed algorithms. Finally we present our experimental results in section 4.\n\n2\n\n\f2 Problem formulation\n\nGiven multiple similarity matrices {Si : i = 1, . . . , m} we consider the following formulation\n\nmin\n+, tr(K)=\u03c4\n\nK\u2208Sn\n\nf (K) \u2261 h \u03c9(K) + \u03c1\n\nm\n\nXi=1\n\nLi (K \u2212 Si) i,\n\nwhere \u03c1 \u2265 0 is a trade-off parameter, Li : Sn \u2192 R is a convex sub-differentiable loss function\noperating on K and Si. We impose the trace constraint on K to ensure good generalization as in [7].\nA more naturally suited formulation for handling multiple similarity matrices is to consider learning\nindividual kernel matrix Ki from similarity matrix Si , \u2200i and invoke a Multiple Kernel Learning\n\n(MKL) setup to obtain a kernel matrix K \u2208 K , (cid:8)Pi \u03b2i Ki | Ki \u2208 Sn\n\nMKL problem is proposed as\n\n+, \u03b2i \u2265 0, \u2200i(cid:9). In [7] the\n\n\u2126(K1, . . . , Km) =\n\nmin\n\n\u03c9(K) ,\n\nK\u2208K, tr(K)=\u03c4\n\nwhere the kernels Ki\u2019s are \ufb01xed and \u03b2i\u2019s are variable. Based on MKL we consider the following\nkernel learning formulation\n\nmin\n\nK1, ..., Km\ns.t. Ki \u2208 Sn\n\nF (K1, . . . , Km) \u2261 h \u2126(K1, . . . , Km) + \u03c1\n\n+ , tr(Ki) = \u03c4 , i = 1, . . . , m.\n\nm\n\nXi=1\n\nLi(Ki \u2212 Si) i ,\n\n(6)\n\nNote that \u2126(K1, . . . , Km) can be obtained by solving any standard MKL formulation.\nThe restriction of \u2126(K1, . . . , Km) on the Cartesian product of\n\n=\nij | \u00b5ij \u2265 0, vij is j-th eigen vector of Si }, yields a very interesting alternative to\n\ni=1{ Ki\n\n(6). Based on this restriction we formulate the restricted kernel learning problem as\n\nPj \u00b5ij vij v\u22a4\n\n(4)\n\n(5)\n\n(7)\n\nsets Nm\n\u2113i(\u00b5i, \u03bbi) i ,\n\ng(\u00b51, . . . , \u00b5m) \u2261 h \u2126(K1, . . . Km) + \u03c1\n\nij , Ki \u2208 Sn\n\nj=1 \u00b5ij = \u03c4 , i = 1, . . . , m,\n\nm\n\nXi=1\n\nmin\n\n\u00b51, ..., \u00b5m\u2208Rn\n\ns.t. Ki = Pj \u00b5ij vij v\u22a4\n\n+ , Pm\n\nwhere \u03bbi = [\u03bbi1 . . . \u03bbin]\u22a4 denotes the eigen values of Si and \u2113i : Rn \u00d7 Rn \u2192 R is a convex loss\nfunction on \u00b5i = [\u00b5i1 . . . \u00b5in]\u22a4.\nWe mention that the formulation (4) generalizes the existing single similarity matrix based formula-\ntions. For m = 1 with L(X) = kXk2\nF , L(X) = kXkF we recover the formulations in [8] and [6]\nrespectively (albeit with a trace constraint). Also the SOCP based spectrum modi\ufb01cation learning\nformulation [6] proposed in the context of single similarity matrix (m = 1) is a special case of (7).\n\n3 Kernel Learning using Mirror Descent\n\nIn this section we derive general methods for solving (4) and (6) based on the following assumptions:\nloss function Li is convex; a sub-gradient L\u2032\ni can be computed ef\ufb01ciently and the computed sub-\ngradients are bounded. We also assume the availability of an ef\ufb01cient SVM / MKL solver.\n\n3.1 Entropic single kernel learning (ESKL) algorithm\n\nWe denote the feasible set of kernels as K = {K \u2208 Sn\n+ | tr(K) = \u03c4 } and its relative interior as\nint(K) = {K \u2208 Sn | tr(K) = \u03c4, K is positive de\ufb01nite }. Note that K is convex and compact.\nDe\ufb01ne inner-product on Sn as hK, K \u2032i = tr(KK \u2032). From Eqn. (1) we note that \u03c9(K) is a convex\nfunction of K. Therefore the objective function f in (4) is convex and Lipschitz continuous on K.\nLet \u03b1\u2217 denote a maximizer of the SVM dual (1). Then we can compute a sub-gradient of f as\n\nf \u2032(K) = \u22120.5 Y \u03b1\u2217 \u03b1\u2217\u22a4 Y + \u03c1 Pm\n\n(8)\nThus the convex programming problem (4) satis\ufb01es the conditions for applying Mirror Descent\n(MD) [2] scheme. To apply MD procedure we require a strongly convex and continuously differen-\ntiable function \u03c8 : K \u2192 R. Following [2] we choose negative of matrix entropy as the candidate for\n\u03c8:\n\ni (K \u2212 Si) .\n\ni=1 L\u2032\n\n\u03c8(K) = Pn\n\nj=1 \u03bbj log \u03bbj ,\n\n3\n\n(9)\n\n\fwhere (\u03bb1, . . . , \u03bbn) are the eigen values of K \u2208 K. With the above setup we derive an MD\nalgorithm, named entropic single kernel learning (ESKL) algorithm, similar to the entropic mirror\ndescent algorithm proposed in [2].\n\nAlgorithm 1 Entropic single kernel learning (ESKL) algorithm\n\nInitialization: K (1) \u2208 int(K). Set t = 0.\nrepeat\n\n\u2022 t := t + 1.\n\u2022 Obtain \u03b1\u2217 i.e. a maximizer of the SVM dual problem (1) for kernel K = K (t).\ni(K (t) \u2212 Si).\n\n\u2022 Compute sub-gradient f \u2032(K (t)) := \u22120.5 Y \u03b1\u2217\u03b1\u2217\u22a4Y + \u03c1 Pj L\u2032\n\n\u2022 Choose suitable step-size \u03b7t.\n\u2022 Compute eigen decomposition f \u2032(K (t)) = V (t) diag([d(t)\n1\n\nn ]) V (t) \u22a4.\n\n. . . d(t)\n\n\u2022 \u03bb(t+1)\n\nj\n\n:=\n\n\u03c4 \u03bb(t)\nj\nl=1 \u03bb(t)\n\nl\n\nexp(\u2212\u03b7t d(t)\nj )\nexp(\u2212\u03b7t d(t)\nl )\n. . . \u03bb(t+1)\n\nn\n\nPn\n\n\u2022 K (t+1) := V (t) diag(cid:16)h\u03bb(t+1)\n\nuntil Convergence\n\n1\n\n, \u2200j = 1, . . . , n.\n\ni(cid:17) V (t) \u22a4.\n\nProposition 3.1. Let f (t) denote the objective function value at t-th iteration and f \u2217 be the optimal\nobjective value. If the ESKL algorithm is initialized with K (1) = \u03c4\nn I and the step-sizes are chosen\n\nas \u03b7t =\n\n1\n\nLip(f )r 2 log n\n\nT\nLipschitz constant of f such that k f \u2032(K (t)) kF \u2264 Lip(F ) , t = 1, . . . , T .\n\nt\n\nthen min\n1\u2264t\u2264T\n\nf (t) \u2212 f \u2217 \u2264 \u03c4 Lip(f )r 2 log n\n\n, where Lip(f ) is a\n\nProof. The strong convexity constant of \u03c8 w.r.t. k \u00b7 kF norm is \u03c3 = 1\n\u03c4 . Let B\u03c8 denote the Bregman\ndistance function [2] generated by \u03c8. Then we have B\u03c8(K, K (1)) \u2264 \u03c4 log n , \u2200 K \u2208 K (assuming\nn \u2265 3). We complete the proof by applying Theorem 4.2 of [2] to the ESKL algorithm.\n\n3.2 Entropic multiple kernel learning (EMKL) algorithm\n\nConsider the kernel learning formulation (6) which minimizes the distances of kernels {Ki\n: i =\n1, . . . , m} from the corresponding similarity matrices {Si\n: i = 1, . . . , m} and simultaneously\nlearns an SVM classi\ufb01er by performing multiple kernel learning (MKL) on those kernels. To learn\na non-sparse combination of kernels the following MKL formulation has been proposed in [10]:\n\n\u2126(K1, . . . , Km) \u2261 max\n\u03b3\u2208\u25b3m\n\nmax\n\n\u03b1\u2208A h \u03b1\u22a41 \u2212\n\n1\n2\n\n\u03b1\u22a4Y(cid:0)\n\n1\n\u03b3i\n\nm\n\nXi=1\n\nKi(cid:1)Y \u03b1 i,\n\nwhere \u25b3m = (cid:8)\u03b3 = [\u03b31 . . . \u03b3m]\u22a4 : Pi \u03b3i \u2264 1, \u03b3i \u2265 0, \u2200i(cid:9) . With \u2126(K1, . . . , Km) as de\ufb01ned\n\nabove, the objective function F in (6) can be expressed as\n\nF (K1, . . . , Km) = max\n\u03b3 \u2208\u25b3m\nm 1\u22a4\u03b1 \u2212 1\n2\u03b3i\n\nFi(Ki ; \u03b1, \u03b3) = 1\n\n\u03b1\u2208A Xi\n\nmax\n\nFi(Ki ; \u03b1, \u03b3) ,\n\ntr(cid:0)KiY \u03b1\u03b1\u22a4Y(cid:1) + \u03c1 Li(Ki \u2212 Si) , i = 1, . . . , m.\n+ : tr(K) = \u03c4 } and Km := Nm\nLet V := Nm\nK = (K1, . . . , Km) \u2208 Km. De\ufb01ne inner product on V as hK, K\u2032iV := Pm\n\ni=1 K \u2282 V. Denote\nii, where\ni=1 kKikF . From (10) we note\nhKi, K \u2032\nthat \u2126(K1, . . . , Km) is a convex function of (K1, . . . , Km) over the compact space Km. Thus the\nobjective function F in (6) is convex and Lipschitz continuous on Km.\n\ni). Also de\ufb01ne a norm on V as kKk = Pm\n\nSn, K = {K \u2208 Sn\n\nii = tr(KiK \u2032\n\ni=1hKi, K \u2032\n\ni=1\n\n(10)\n\n(11)\n\nLemma 3.1. Let (\u03b1\u2217, \u03b3\u2217) be a solution of (10) and L\u2032\nof F is given by\n\ni be a sub-gradient of Li. Then a sub-gradient\n\nF \u2032(K1, . . . , Km) = (cid:16) \u2202K1 F1(K1; \u03b1\u2217, \u03b3\u2217) \u00b7 \u00b7 \u00b7 \u2202Km Fm(Km; \u03b1\u2217, \u03b3\u2217) (cid:17),\n\n1\n\n\u2202Ki Fi(Ki; \u03b1\u2217, \u03b3\u2217) = \u2212\n\ni(Ki \u2212 Si) , i = 1, . . . , m.\n\nY \u03b1\u2217 \u03b1\u2217\u22a4Y + L\u2032\n\n2 \u03b3\u2217\ni\n\n(12)\n\n4\n\n\fProof. First, we observe that Fi(Ki; \u03b1, \u03b3) is a convex function of Ki \u2208 K and the expression of\n\u2202Ki Fi(Ki; \u03b1, \u03b3) given in Eqn. (12) is precisely a sub-gradient of Fi. Therefore, we can write\n\nFi(K \u2032\n\ni; \u03b1, \u03b3) \u2265 Fi(Ki; \u03b1, \u03b3) + h K \u2032\n\ni \u2212 Ki , \u2202Ki Fi(Ki; \u03b1, \u03b3) i , \u2200K \u2032\n\ni \u2208 K.\n\n(13)\n\nBy optimality of (\u03b1\u2217, \u03b3\u2217) we have F (K1, . . . , Km) = Pm\nm) \u2265 Pm\n\nmax operation over \u03b1, \u03b3, we have F (K \u2032\n(K \u2032\n\nm) \u2208 Km. Applying (13) we arrive at\n\n1, . . . , K \u2032\n\n1, . . . , K \u2032\n\ni=1 Fi(Ki; \u03b1\u2217, \u03b3\u2217). Because of the\ni; \u03b1\u2217, \u03b3\u2217) for any K\u2032 =\ni=1 Fi(K \u2032\n\nF (K \u2032\n\n1, . . . , K \u2032\n\nm) \u2265 F (K1, . . . , Km) + D K\u2032 \u2212 K , F \u2032(K1, . . . , Km)EV\n\n,\n\nHence, F \u2032(K1, . . . , Km) given in (12) is a sub-gradient of F .\n\nWe develop a novel Mirror Descent procedure for problem (6) by de\ufb01ning a strongly convex and\ncontinuously differentiable function \u03a8 on the product space Km as\n\n\u03a8(K) = Pm\n\ni=1Pn\n\nj=1 \u03bbi,j log \u03bbi,j, K \u2208 Km,\n\n(14)\n\nwhere (\u03bbi,1 , . . . , \u03bbi,n) denote eigen values of Ki. The resulting algorithm, named entropic multiple\nkernel learning (EMKL), is given below.\n\nAlgorithm 2 Entropic multiple kernel learning (EMKL) algorithm\n\nInitialization: K (1)\nrepeat\n\ni \u2208 int(K), i = 1, . . . , m. Set t = 0.\n\nt := t + 1.\nObtain \u03b1\u2217, \u03b3\u2217 by solving the MKL problem (10) with Ki = K (t)\nfor i = 1 to m do\n\ni\n\n, i = 1, . . . , m.\n\n\u2022 Compute sub-gradient \u2202Ki Fi(K (t)\n\u2022 Find eigen decomposition \u2202Ki Fi(K (t)\n\ni\n\n; \u03b1\u2217, \u03b3\u2217)\n\n:= \u2212 1\n2 \u03b3 \u2217\ni\n; \u03b1\u2217, \u03b3\u2217) = V (t)\n\ni\n\ni\n\nY \u03b1\u2217 \u03b1\u2217\u22a4Y + L\u2032\ni,1 . . . d(t)\ndiag([d(t)\n\ni(K (t)\ni,n]) V (t) \u22a4\n\ni\n\n.\n\ni \u2212 Si).\n\n\u03c4 \u03bb(t)\nl=1 \u03bb(t)\n\ni,j exp(\u2212\u03b7t d(t)\ni,j )\ni,l exp(\u2212\u03b7t d(t)\ni,l )\ni,n ]) V (t) \u22a4\n\ndiag([\u03bb(t+1)\n\n. . . \u03bb(t+1)\n\n, j = 1, . . . , n.\n\ni,1\n\n.\n\ni\n\nPn\n\ni\n\n:= V (t)\n\n\u2022 \u03bb(t+1)\n\ni,j\n\n:=\n\n\u2022 K (t+1)\n\ni\n\nend for\n\nuntil Convergence\n\nTheorem 3.2. Let F (t) denote the objective function value at t-th iteration and F \u2217 be the optimal\nobjective value. If the EMKL algorithm is initialized with K (1)\nn I , \u2200 i and the step-sizes are\n\ni = \u03c4\n\nchosen as \u03b7t =\n\n1\n\nLip(F )r 2 log n\n\nm t\n\nthen min\n1\u2264t\u2264T\n\nLip(F ) is a Lipschitz constant of F such that k \u2202KiF (K (t)\n\ni\n\nF (t) \u2212 F \u2217 \u2264 \u03c4 m Lip(F )r 2 log n\n\nT\n; \u03b1, \u03b3) kF \u2264 Lip(F ) , \u2200i, t.\n\n, where\n\nProof. Let K\u2217 = (K \u2217\nWe apply the convergence result presented as Theorem 4.2 in [2]. This leads to the following:\n\nm) be an optimal solution of (6). Denote K(t) =(cid:16)K (t)\n\n1 , . . . , K \u2217\n\n1 , . . . , K (t)\n\nm (cid:17).\n\n\u03b7t =\n\n1\n\nLip(F )s 2 \u03c3 B\u03a8(K\u2217, K(t))\n\nt\n\n\u21d2 min\n1\u2264t\u2264T\n\nF (t) \u2212 F \u2217 \u2264 Lip(F )s 2 B\u03a8(K\u2217, K(t))\n\n\u03c3 T\n\n,\n\n(15)\n\nwhere \u03c3 > 0 is the strong convexity constant of \u03a8 and B\u03a8 is the Bregman distance function gener-\nated by \u03a8. For the \u03a8 function de\ufb01ned in Eqn. (14), we have \u03c3 = 1\nm \u03c4 . Assuming n \u2265 3, we also\nhave B\u03a8(K, K(1)) \u2264 m \u03c4 log n , \u2200 K \u2208 Km. Substituting values for B\u03a8 and \u03c3 in (15) we obtain\nthe desired result.\n\n5\n\n\f3.3 Restricted entropic kernel learning (REKL) algorithm\n\nEqn. (10). We denote the feasible set for \u00b5i as X := {\u00b5i \u2208 Rn | \u00b5ij \u2265 0, \u2200j, Pn\n\u00b5i\u2019s, is a convex function on the Cartesian product space X m := Nm\n\nThe proposed EMKL algorithm is computationally expensive as it computes eign decomposition\nof m matrices of dimension n \u00d7 n at every iteration. Here we propose an ef\ufb01cient algorithm\nby considering the restricted kernel learning formulation (7) where \u2126(K1, . . . , Km) is given in\nj=1 \u00b5ij = \u03c4 },\nwhich is a convex compact subset of Rn. We note that \u2126 in (10) when viewed as a function of\ni=1 X . The loss function\n\u2113i is assumed to be a convex function of \u00b5i with bounded sub-gradients. Hence, the objective\nfunction g in (7) is convex and Lipschitz continuous over the compact space X m. Denote a sub-\ngradient of \u2113i as [ \u2202\u00b5i1 \u2113i(\u00b5i, \u03bbi) , . . . , \u2202\u00b5in \u2113i(\u00b5i, \u03bbi) ]\u22a4. We can compute a sub-gradient of \u2126 as\n\u2126\u2032 = ( \u2202\u00b511 \u2126 , \u2202\u00b512 \u2126 , . . . , \u2202\u00b5nn \u2126 ), where \u2202\u00b5ij \u2126 = \u2212 1\nij Y \u03b1\u2217. We derive an MD\n2 \u03b3 \u2217\ni\nalgorithm, named restricted entropic kernel learning (REKL), by extending the entropic mirror de-\nscent scheme [2] to deal with Cartesian product of simplices. This is achieved by de\ufb01ning a strongly\nconvex and continuously differentiable function \u03c8e : X m \u2192 R as\n\n\u03b1\u2217\u22a4Y vij v\u22a4\n\nj=1 \u00b5ij log \u00b5ij , \u00b5i \u2208 X , \u2200i.\n\n\u03c8e(\u00b51, . . . , \u00b5m) = Pm\n\ni=1Pn\nFind eigen decomposition: Si = Pj \u03bbij vij v\u22a4\n\nAlgorithm 3 Restricted entropic kernel learning (REKL) algorithm\ni = 1, . . . , m.\n\ni \u2208 int(X ) , i = 1, . . . , m. Set t = 0.\n\nij ,\n\nInitialization: \u00b5(1)\nrepeat\n\n(16)\n\nt = t + 1\n\nObtain \u03b1\u2217, \u03b3\u2217 by solving the MKL problem (10) with Ki =Pj \u00b5(t)\n\nfor i = 1 to m do\n\n\u2022 Compute sub-gradient g\u2032(t)\nij\n\n\u03b1\u2217\u22a4Y vij v\u22a4\n\n:= \u2212 1\n2 \u03b3 \u2217\ni\n\nij Y \u03b1\u2217 + \u2202\u00b5ij \u2113i(\u00b5(t)\n\ni\n\n, \u03bbi).\n\nij vij v\u22a4\n\nij , i = 1, . . . , m.\n\n\u2022 \u00b5(t+1)\n\nij\n\n:=\n\nend for\n\nuntil Convergence\n\n\u03c4 \u00b5(t)\n\nij (cid:17)\nij exp(cid:16)\u2212\u03b7t g\u2032(t)\nil exp(cid:16)\u2212\u03b7t g\u2032(t)\nil (cid:17)\nl=1 \u00b5(t)\n\nPn\n\n, j = 1, . . . , n.\n\nProposition 3.2. Let g(t) denote the objective function value at t-th iteration and g\u2217 be the optimal\nobjective value. If the REKL algorithm is initialized with \u00b5(1)\nn 1 , \u2200 i and the step-sizes are\n\ni = \u03c4\n\nchosen as \u03b7t =\n\n1\n\nLip(g)r 2 log n\n\nm t\n\nthen min\n1\u2264t\u2264T\n\ng(t) \u2212 g\u2217 \u2264 \u03c4 m Lip(g)r 2 log n\n\nT\n\n, where Lip(g)\n\nis a Lipschitz constant of g such that |g\u2032(t)\n\nij | \u2264 Lip(g) , i, = 1, . . . , m, j = 1, . . . , n, t = 1, . . . , T .\n\nProof. The proof is similar to that of Theorem 3.2.\n\n3.4 Discussion\n\none solves an SVM and eigen-decomposition of n \u00d7 n matrix. Both EMKL and REKL formulations\n\nThe ESKL formulation requires O(cid:16) log n\nrequire O(cid:16) m2 log n\n\n\u01eb2 (cid:17) iterations (see Proposition 3.1), where in each iteration\n(cid:17) iterations (see Theorem 3.2 and Proposition 3.2), and in each iteration one\n\nsolves an MKL problem. However EMKL is more computationally expensive than REKL.\n\n\u01eb2\n\n4 Experiments and Results\n\nIn this section we experimentally compare the proposed kernel learning formulations against\nIndSVM [8] and the eigen transformation methods: Denoise, Flip, Shift [15]. Given an inde\ufb01nite\nj , eigen transformation methods gen-\nj , where choice \u00b5j\u2019s are: (a) Denoise: \u00b5j = max(\u03bbj , 0),\n\nsimilarity matrix S with eigen-decomposition S =Pj \u03bbj vj v\u22a4\nerate kernel matrix as K := Pj \u00b5j vj v\u22a4\n\n6\n\n\f(b) Flip: \u00b5j = |\u03bbj |, (c) Shift: \u00b5j = \u03bbj \u2212 \u03b4, where \u03b4 = min{\u03bb1, . . . , \u03bbn, 0}. We consider the fol-\n\nlowing choices for the loss functions in ESKL / EMKL: [L1] L(K \u2212S) = Pi,j |K(i, j)\u2212S(i, j)|,\n[L3] L(K \u2212 S) = Pi,j |K(i, j) \u2212 S(i, j)|2. For REKL we\n\n[L2] L(K \u2212 S) = kK \u2212 SkF ,\nchoose \u2113(\u00b5j, \u03bbj) = |\u00b5j \u2212 \u03bbjk2, i.e., the Euclidean distance. Algorithm parameters are tuned using\nstandard 5 fold cross validation procedure. LibSVM [4] is used as the SVM solver. For each data set\nwe have considered equal number of positive and negative training / test samples. We report classi-\n\ufb01cation performance in terms of accuracy and F-score (expressed as % ) averaged over 5 different\ntrain / test splits.\n\n4.1 Data sets\n\nWe experimented on 10 different data sets including the data sets covered in [16, 6]. We have\ngenerated the inde\ufb01nite similarity matrices as prescribed in [16] for each of the following data sets:\nSonar, Liver disorder, Ionosphere, Diabetes and Heart. We have used the same similarity matrices\nas in [6]:1 for the data sets Amazon, AuralSonar, Yeast-SW-5-7 and Yeast-SW-5-12 .\nTo test the proposed multiple similarity based formulations we experimented on a subset of the\nSCOP database [9] taken from Protein Classi\ufb01cation Benchmark Collection 2. Considering pro-\nteins having < 40% sequence identity, we randomly select 8 super-families which have at least\n45 proteins. We compute 3 different pairwise similarity measure for proteins: Psi-BLAST [1],\nSmith-Waterman [14] and Needleman-Wunsch [11]. The similarity matrices obtained from these 3\nsimilarity measures are inde\ufb01nite in general.\n\n4.2 Effect of various loss functions\n\nWe experimentally demonstrate the ability of the proposed ESKL algorithm in handling general\nconvex loss function. Classi\ufb01cation performance is presented in Table 1. We observe that on Liver\ndisorder data set L2 loss performs better than L1, L3. Again, on Diabetes and Heart data sets\nboth L1, L2 provides much better performance better than L3. From Table 2 we observe that on\nAuralSonar data set ESKL formulation works best with L3 loss function. But on Yeast-SW-5-7 data\nset L1 loss function works best. Therefore we can say that the choice of loss function has an effect\non classi\ufb01cation accuracy. This suggest the need for a general algorithm which provides \ufb02exibility\nto choose loss function based on the data set. Hence in this paper we have developed the algorithms\nkeeping the choice of loss function almost open.\n\nTable 1: Comparison of classi\ufb01cation accuracy (odd rows) and F-score (even rows) on UCI data sets\n\nEigen Transformation\n\nDenoise\n\nDataset\n\nSonar\n\nLiver disorder\n\nIonosphere\n\nDiabetes\n\nHeart\n\n71.5\n70.0\n57.6\n55.4\n87.3\n87.6\n63.9\n65.0\n73.3\n73.1\n\nFlip\n72.5\n70.6\n54.5\n53.8\n89.6\n89.9\n58.7\n58.8\n63.8\n65.1\n\nShift\n76.5\n75.0\n55.5\n52.9\n91.2\n91.4\n64.4\n65.1\n75.8\n76.9\n\nIndSVM\n\n[8]\n76.5\n74.8\n59.7\n55.8\n91.5\n91.8\n70.2\n71.4\n76.3\n76.5\n\nESKL\n[L2]\n73.0\n71.3\n62.8\n59.1\n91.2\n91.4\n73.5\n74.6\n78.8\n79.0\n\n[L1]\n75.5\n73.9\n61.0\n58.9\n88.5\n88.5\n73.3\n74.3\n78.8\n79.5\n\n[L3]\n75.5\n74.1\n60.7\n57.5\n91.5\n91.8\n69.8\n71.0\n76.3\n76.5\n\n4.3 Combining multiple sequence similarity matrices for Proteins\n\nConsider the task of classifying proteins into super-families when multiple sequence similarity mea-\nsures are available. We perform 1 vs rest classi\ufb01cation experiments on each of the 8 protein super-\nfamilies and report performance averaged over 5 train / test splits. One can extend IndSVM [8]\n\n1http://idl.ee.washington.edu/SimilarityLearning/\n2http://net.icgeb.org/benchmark/\n\n7\n\n\fTable 2: Comparison of classi\ufb01cation accuracy (odd rows) and F-score (even rows) on real data sets\n\nEigen Transformation\n\nDenoise\n\nDataset\n\nAmazon\n\nAuralSonar\n\nYeast-SW-5-7\n\nYeast-SW-5-12\n\n83.8\n84.8\n87.0\n86.5\n75.5\n77.1\n86.0\n87.1\n\nFlip\n83.8\n84.8\n87.0\n86.3\n70.0\n72.4\n85.5\n85.8\n\nShift\n85.0\n85.9\n87.0\n86.3\n74.0\n74.1\n86.0\n87.6\n\nIndSVM\n\n[8]\n87.5\n86.9\n88.0\n87.3\n77.0\n77.7\n90.0\n90.9\n\nESKL\n[L2]\n85.0\n84.3\n87.0\n86.3\n75.5\n76.1\n90.5\n91.35\n\n[L1]\n88.8\n88.0\n88.0\n87.3\n79.0\n79.9\n90.0\n90.9\n\n[L3]\n88.8\n88.0\n90.0\n89.1\n76.5\n77.2\n90.0\n90.9\n\nTable 3: Comparison of classi\ufb01cation accuracy (odd rows) and F-score (even rows) on Proteins\n\nIndSVM simple MKL\n\nSuper\nfamily\na.4.1\n\nb.1.18\n\nb.29.1\n\nb.40.4\n\nc.1.8\n\nc.3.1\n\nc.47.1\n\nc.67.1\n\nLinear\nSVM\n51.9\n67.5\n63.8\n73.9\n70.6\n55.6\n66.9\n74.8\n58.8\n29.5\n91.9\n90.9\n88.1\n85.7\n88.8\n87.2\n\nEigen\nDenoise\n\n53.1\n68.1\n62.5\n73.1\n80.6\n76.3\n68.1\n75.4\n75.0\n65.1\n97.5\n97.4\n86.2\n87.8\n90.6\n89.6\n\n[8]\n54.4\n68.7\n58.1\n70.7\n75.6\n67.0\n59.4\n71.1\n66.9\n50.1\n95.6\n95.4\n76.2\n81.5\n90.6\n89.6\n\n[12]\n56.9\n69.9\n65.6\n74.7\n77.5\n70.2\n70.0\n76.7\n73.7\n62.7\n95.0\n94.7\n90.0\n88.8\n91.2\n90.3\n\nESKL\n[L1]\n70.0\n77.1\n71.9\n78.6\n85.0\n83.5\n71.9\n77.3\n80.6\n74.6\n95.6\n95.2\n84.4\n85.7\n81.9\n76.8\n\nEMKL\n\n[L1]\n73.1\n78.9\n75.6\n80.6\n83.8\n82.1\n68.8\n76.3\n85.0\n80.8\n95.6\n95.4\n90.6\n90.3\n91.2\n90.3\n\nREKL\nk \u00b7 k2\n84.4\n86.5\n74.4\n78.6\n75.0\n71.0\n76.2\n78.5\n80.0\n77.9\n96.2\n96.0\n84.4\n86.4\n93.1\n92.6\n\noriginally proposed to handle single similarity matrix, to multiple similarity matrices by averaging\nover the similarity matrices. We implement the linear SVM by considering similarities as feature\nand computing a linear kernel. We also compare with a multiple kernel learning formulation, simple\nMKL [12]. Denoised version of the similarity matrices are given as input to simple MKL. In Ta-\nble 3 the proposed multiple similarity based kernel learning algorithms ESKL / EMKL / REKL are\ncompared with the other methods mentioned above. We observe signi\ufb01cant performance improve-\nment in most cases. We also note that REKL is computationally cheaper than EMKL but provides\nreasonably good performance.\n\n5 Conclusion\n\nWe have proposed three formulations, (4), (6), (7) for learning kernels from multiple similarity ma-\ntrices. The key advantages of the proposed algorithms over the state of the art are: (i) require only\nSVM / MKL solvers and does not require any other sophisticated tools; (ii) the algorithms are appli-\ncable for a wide choice of loss functions and multiple similarity functions. Proposed methods can\nalso be seen as an alternative to Multiple Kernel learning,which will be explored in future research.\n\nAcknowledgments\n\nProf. Chiranjib Bhattacharyya was partly supported by Yahoo! faculty award grant.\n\n8\n\n\fReferences\n\n[1] Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schffer, Ro A. Schffer, Jinghui Zhang,\nZheng Zhang, Webb Miller, and David J. Lipman. Gapped blast and psiblast: a new generation\nof protein database search programs. NUCLEIC ACIDS RES, 25:3389\u20133402, 1997.\n\n[2] A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for\n\nconvex optimization. Operations Research Letters, 31:167\u2013175, 2003.\n\n[3] A. Ben-Tal, T. Margalit, and A. Nemirovski. The ordered subsets mirror descent optimization\n\nmethod with applications to tomography. SIAM J. Optim., 12:79\u2013108, 2001.\n\n[4] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector machines, 2001.\n\nSoftware available at http://www.csie.ntu.edu.tw/\u02dccjlin/libsvm.\n\n[5] J. Chen and J. Ye. Training svm with inde\ufb01nite kernels.\n\nMachine Learning. 2008.\n\nIn International Conference on\n\n[6] Y. Chen, M. R. Gupta, and B. Recht. Learning kernels from inde\ufb01nite similarities. In Interna-\n\ntional Conference on Machine Learning. 2009.\n\n[7] G. R. Gert Lanckriet, N. Cristianini, P. Bartlett, L. E. Ghaoui, and Michael I. Jordan. Learning\nthe kernel matrix with semide\ufb01nite programming. Journal of Machine Learning Research,\n5:27\u201372, 2004.\n\n[8] R. Luss and A. d\u2019Aspremont. Support vector machine classi\ufb01cation with inde\ufb01nite kernels. In\n\nAdvances in Neural Information processing Systems. 2007.\n\n[9] A. G. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia. Scop: a structural classi\ufb01cation\nof proteins database for the investigation of sequences and structures. Journal of Molecular\nBiology, 247:536\u2013540, 1995.\n\n[10] J. Saketha Nath, G. Dinesh, S. Raman, C. Bhattacharyya, A. Ben-Tal, and K.R. Ramakrishnan.\nOn the algorithmics and applications of a mixed-norm based kernel learning formulation. In\nAdvances in Neural Information Processing Systems, pages 844\u2013852. 2009.\n\n[11] Saul B. Needleman and Christian D. Wunsch. A general method applicable to the search\nfor similarities in the amino acid sequence of two proteins. Journal of Molecular Biology,\n48(3):443\u2013453, 1970.\n\n[12] A. Rakotomamonjy, Francis R. Bach, S. Canu, and Y. Grandvalet. Simplemkl. Journal of\n\nMachine Learning Research, 9:2491\u20132521, 2008.\n\n[13] Bernhard Sch\u00a8olkopf and Alexander J. Smola. Learning with Kernels: Support Vector Ma-\nchines, Regularization, Optimization, and Beyond (Adaptive Computation and Machine Learn-\ning). The MIT Press, 2001.\n\n[14] T. F. Smith and M. S. Waterman. Identi\ufb01cation of common molecular subsequences. Journal\n\nof Molecular Biology, 147(1):195 \u2013 197, 1981.\n\n[15] G. Wu, Z. Zhang, and E. Y. Chang. An analysis of transformation on non-positive semide\ufb01-\nnite similarity matrix for kernel machines. Technical Report, University of California, Santa\nBarbara, 2005.\n\n[16] Y. Ying, C. Campbell, and M. Girolami. Analysis of SVM with inde\ufb01nite kernels. In Advances\n\nin Neural Information processing Systems, 2009.\n\n9\n\n\f", "award": [], "sourceid": 791, "authors": [{"given_name": "Achintya", "family_name": "Kundu", "institution": null}, {"given_name": "Vikram", "family_name": "Tankasali", "institution": null}, {"given_name": "Chiranjib", "family_name": "Bhattacharyya", "institution": null}, {"given_name": "Aharon", "family_name": "Ben-tal", "institution": null}]}