{"title": "Analysis of SVM with Indefinite Kernels", "book": "Advances in Neural Information Processing Systems", "page_first": 2205, "page_last": 2213, "abstract": "The recent introduction of indefinite SVM by Luss and dAspremont [15] has effectively demonstrated SVM classification with a non-positive semi-definite kernel (indefinite kernel). This paper studies the properties of the objective function introduced there. In particular, we show that the objective function is continuously differentiable and its gradient can be explicitly computed. Indeed, we further show that its gradient is Lipschitz continuous. The main idea behind our analysis is that the objective function is smoothed by the penalty term, in its saddle (min-max) representation, measuring the distance between the indefinite kernel matrix and the proxy positive semi-definite one. Our elementary result greatly facilitates the application of gradient-based algorithms. Based on our analysis, we further develop Nesterovs smooth optimization approach [16,17] for indefinite SVM which has an optimal convergence rate for smooth problems. Experiments on various benchmark datasets validate our analysis and demonstrate the efficiency of our proposed algorithms.", "full_text": "Analysis of SVM with Inde\ufb01nite Kernels\n\nYiming Ying\u2020 , Colin Campbell\u2020 and Mark Girolami\u2021\n\n\u2020Department of Engineering Mathematics, University of Bristol,\n\nS.A.W. Building, G12 8QQ, United Kingdom\n\nBristol BS8 1TR, United Kingdom\n\n\u2021Department of Computer Science, University of Glasgow,\n\nAbstract\n\nThe recent introduction of inde\ufb01nite SVM by Luss and d\u2019Aspremont [15] has ef-\nfectively demonstrated SVM classi\ufb01cation with a non-positive semi-de\ufb01nite ker-\nnel (inde\ufb01nite kernel). This paper studies the properties of the objective function\nintroduced there. In particular, we show that the objective function is continuously\ndifferentiable and its gradient can be explicitly computed. Indeed, we further show\nthat its gradient is Lipschitz continuous. The main idea behind our analysis is that\nthe objective function is smoothed by the penalty term, in its saddle (min-max)\nrepresentation, measuring the distance between the inde\ufb01nite kernel matrix and\nthe proxy positive semi-de\ufb01nite one. Our elementary result greatly facilitates the\napplication of gradient-based algorithms. Based on our analysis, we further de-\nvelop Nesterov\u2019s smooth optimization approach [17, 18] for inde\ufb01nite SVM which\nhas an optimal convergence rate for smooth problems. Experiments on various\nbenchmark datasets validate our analysis and demonstrate the ef\ufb01ciency of our\nproposed algorithms.\n\n1 Introduction\n\nKernel methods [5, 24] such as Support Vector Machines (SVM) have recently attracted much atten-\ntion due to their good generalization performance and appealing optimization approaches. The basic\nidea of kernel methods is to map the data into a high dimensional (even in\ufb01nite-dimensional) feature\nspace through a kernel function. The kernel function over samples forms a similarity kernel matrix\nwhich is usually required to be positive semi-de\ufb01nite (PSD). The PSD property of the similarity\nmatrix ensures that the SVM can be ef\ufb01ciently solved by a convex quadratic programming.\nHowever, many potential kernel matrices could be non-positive semi-de\ufb01nite. Such cases are quite\ncommon in applications such as the sigmoid kernel [14] for various values of the hyper-parameters,\nhyperbolic tangent kernels [25], and the protein sequence similarity measures derived from Smith-\nWaterman and BLAST score [23]. The problem of learning with a non-PSD similarity matrix (in-\nde\ufb01nite kernel) has recently attracted considerable attention [4, 8, 9, 14, 20, 21, 26]. One widely\nused method is to convert the inde\ufb01nite kernel matrix into a PSD one by using the spectrum trans-\nformation. The denoise method neglects the negative eigenvalues [8, 21], \ufb02ip [8] takes the absolute\nvalue of all eigenvalues, shift [22] shifts eigenvalues to be positive by adding a positive constant, and\nthe diffusion method [11] takes the exponentials of eigenvalues. One can also see [26] for a detailed\ncoverage. However, useful information in the data could be lost in the above spectral transformations\nsince they are separated from the process of training classi\ufb01ers. In [9], the classi\ufb01cation problem\nwith inde\ufb01nite kernels is regarded as the minimization of the distance between convex hulls in the\npseudo-Euclidean space. In [20], general Reproducing Kernel Kre\u02c7\u0131n spaces (RKKS) with inde\ufb01nite\nkernels are introduced which allows a general representer theorem and regularization formulations.\nLuss and d\u2019Aspremont [15] recently proposed a regularized formulation for SVM classi\ufb01cation\nwith inde\ufb01nite kernel. Training a SVM with an inde\ufb01nite kernel was viewed as a learning the kernel\n\n1\n\n\fmatrix problem [13] i.e.\nlearning a proxy PSD kernel matrix to approximate the inde\ufb01nite one.\nWithout realizing that the objective function is differentiable, the authors quadratically smoothed\nthe objective function, and then formulated two approximate algorithms including the projected\ngradient method and the analytic center cutting plane method.\nIn this paper we follow the formulation of SVM with inde\ufb01nite kernels proposed in [15]. We mainly\nestablish the differentiability of the objective function (see its precise de\ufb01nition in equation (3)) and\nprove that it is, indeed, differentiable with Lipschitz continuous gradient. This elementary result\nsuggests there is no need to smooth the objective function which greatly facilitates the application\nof gradient-based algorithms. The main idea behind our analysis is from its saddle (min-max) rep-\nresentation which involves a penalty term in the form of Frobenius norm of matrices, measuring\nthe distance between the inde\ufb01nite kernel matrix and the proxy PSD one. This penalty term can be\nregarded as a Moreau-Yosida regularization term [12] to smooth out the objective function.\nThe paper is organized as follows.\nIn Section 2, we review the formulation of inde\ufb01nite SVM\nclassi\ufb01cation presented in [15]. Our main contribution is outlined in Section 3. There, we \ufb01rst show\nthat the objective function of interest is continuously differentiable and its gradient function can\nbe explicitly computed. Indeed, we further show that its gradient is Lipschitz continuous. Based\non our analysis, in Section 4 we propose a simpli\ufb01ed formulation of the projected gradient method\npresented in [15] and show that it has a convergence rate of O(1/k) where k is the iteration number.\nWe further develop Nesterov\u2019s smooth optimization approach [17, 18] for inde\ufb01nite SVM which\nhas an optimal convergence rate of O(1/k2) for smooth problems. In Section 5, our analysis and\nproposed optimization approaches are validated by experiments on various benchmark data sets.\n\n2 Inde\ufb01nite SVM Classi\ufb01cation\n\nIn this section we review the regularized formulation of inde\ufb01nite SVM presented in [15]. To this\nend, we introduce some notation. Let Nn = {1, 2, . . . , n} for any n \u2208 N and S n be the space of\nall n \u00d7 n symmetric matrices. If A \u2208 S n is positive semi-de\ufb01nite, we write it as A (cid:186) 0. The\n+. For any A, B \u2208 Rn\u00d7n, (cid:104)A, B(cid:105)F := Tr(A(cid:62)B) where\ncone of PSD matrices is denoted by S n\nTr(\u00b7) denotes the trace of a matrix. Finally, the Frobenius norm over the vector space S n is denoted,\nfor any A \u2208 S n, by (cid:107)A(cid:107)F := (Tr(A(cid:62)A)) 1\n2 . The standard Euclidean norm and inner product are\nrespectively denoted by (cid:107) \u00b7 (cid:107) and (cid:104)\u00b7,\u00b7(cid:105).\nLet a set of training samples be given by inputs x = {xi \u2208 Rd : i \u2208 Nn} and outputs y = {yi \u2208\n{\u00b11} : i \u2208 Nn}. Suppose that K is a positive semi-de\ufb01nite kernel matrix (proxy kernel matrix)\non inputs x. Let matrix Y = diag(y), vector e be an n-dimensional vector of all ones and C be a\npositive trade-off parameter. Then, the dual formulation of 1-norm soft margin SVM [5, 24] is given\nby\n\nmax\u03b1 \u03b1(cid:62)e \u2212 1\ns.t.\n\n2 \u03b1(cid:62)Y KY \u03b1\n\u03b1(cid:62)y = 0, 0 \u2264 \u03b1 \u2264 C.\n\nSince we assume that K is positive semi-de\ufb01nite, the above problem is a standard convex quadratic\nprogram [2] and a global solution can be ef\ufb01ciently obtained by, e.g., the primal-dual interior\nmethod. Suppose now we are only given an inde\ufb01nite kernel matrix K0 \u2208 S n. Luss and\nd\u2019Aspremont [15] proposed the following max-min approach to simultaneously learn a proxy PSD\nkernel matrix K for the inde\ufb01nite matrix K0 and the SVM classi\ufb01cation:\n2 \u03b1(cid:62)Y KY \u03b1 + \u03c1(cid:107)K \u2212 K0(cid:107)2\n\nminK max\u03b1 \u03b1(cid:62)e \u2212 1\n\nF\n\n\u03b1(cid:62)y = 0, 0 \u2264 \u03b1 \u2264 C, K (cid:186) 0.\nLet Q1 = {\u03b1 \u2208 Rn : \u03b1(cid:62)y = 0, 0 \u2264 \u03b1 \u2264 C} and L(\u03b1, K) = \u03b1(cid:62)e \u2212 1\nBy the min-max theorem [2], problem (1) is equivalent to\n\ns.t.\n\n(1)\n2 \u03b1(cid:62)Y KY \u03b1 + \u03c1(cid:107)K \u2212 K0(cid:107)2\nF .\n\nmax\n\u03b1\u2208Q1\n\nmin\nK\u2208S n\n+\n\nL(\u03b1, K).\n\nFor simplicity, we refer to the following function de\ufb01ned by\nL(\u03b1, K)\n\nf(\u03b1) = min\nK\u2208S n\n+\n\n(2)\n\n(3)\n\nas the objective function. It is obviously concave since f is the minimum of a sequence of concave\nfunctions. We also call the associated function L(\u03b1, K) the saddle representation of the objective\nfunction f.\n\n2\n\n\f(cid:110)\n\nL(\u03b1, K)\n\n(cid:111)\n\n.\n\nFor \ufb01xed \u03b1 \u2208 Q1, the optimization K(\u03b1) = arg minK(cid:186)0 L(\u03b1, K) is equivalent to a projection to\nthe semi-de\ufb01nite cone S n\n\n+. Indeed, it was shown in [15] that the optimal solution is given by\n\nK(\u03b1) = (K0 + Y \u03b1\u03b1(cid:62)Y /(4\u03c1))+\n\n(4)\nwhere, for any matrix A \u2208 S n, the notation A+ denotes the positive part of A by simply setting\nits negative eigenvalues to zero. The optimal solution (\u03b1\u2217, K\u2217) \u2208 Q1 \u00d7 S n\n+ to the above min-max\nproblem is a saddle point of L(\u03b1, K) (see e.g. [2]), i.e. for any \u03b1 \u2208 Q1, K \u2208 S n\n+ there holds\nL(\u03b1, K\u2217) \u2264 L(\u03b1\u2217, K\u2217) \u2264 L(\u03b1\u2217, K). For a matrix A \u2208 S n, denote its maximum eigenvalue by\n\u03bbmax(A). The next lemma tells us that the optimal solution K\u2217 belongs to a bounded domain in\nS n\n+.\nLemma 1. Problem (2) is equivalent to the formulation max\u03b1\u2208Q1 minK\u2208Q2 L(\u03b1, K) and the ob-\njective function can be de\ufb01ned by\n\nf(\u03b1) = min\nK\u2208Q2\n\n(5)\n\nwhere Q2 :=\n\nK \u2208 S n\n\n+ : \u03bbmax(K) \u2264 \u03bbmax(K0) + nC2\n\n4\u03c1\n\nProof. By the saddle theorem [2], we have L(\u03b1\u2217, K\u2217) = minK\u2208Q2 L(\u03b1\u2217, K). Combining this\nwith equation (4) yields that K\u2217 = K(\u03b1\u2217) = (K0 + Y \u03b1\u2217(\u03b1\u2217)(cid:62)Y /(4\u03c1))+. We can easily see\n\u03bbmax(K\u2217) \u2264 \u03bbmax(K0 + Y \u03b1\u2217(\u03b1\u2217)(cid:62)Y /(4\u03c1) \u2264 \u03bbmax(K0) + \u03bbmax(Y \u03b1\u2217(\u03b1\u2217)(cid:62)Y /(4\u03c1)\n\u03bbmax(K0) + (cid:107)\u03b1\u2217(cid:107)2\n4\u03c1 , where the second to last inequality uses the property of maximum eigenval-\nues (e.g. [10, Page 201]) i.e. \u03bbmax(A + B) \u2264 \u03bbmax(A) + \u03bbmax(B) for any A, B \u2208 S n. Note\nthat 0 \u2264 \u03b1\u2217 \u2264 C, (cid:107)\u03b1\u2217(cid:107)2 \u2264 nC 2. Combining this with the above inequality yields the desired\nlemma.\n\n(cid:162) \u2264\n\nIt is worthy of mentioning that it was shown in [18, Theorem 1] that a function g has a Lipschitz\ncontinuous gradient if it enjoys a special structure: g(\u03b1) = min{(cid:104)A\u03b1, K(cid:105) + \u03b3d(K) : K \u2208 Q}\nwhere Q is a closed convex subset in a certain vector space and d(\u00b7) is a strongly convex function,\nand, most importantly, A is a linear operator. Since the variable \u03b1 appeared in a quadratic form, i.e.\n\u03b1(cid:62)Y KY \u03b1, in the objective function de\ufb01ned by (5), it can not be written as the above special form,\nand hence the theorem there can not be applied to our case.\n\n3 Differentiability of the Objective Function\n\nThe following lemma outlines a very useful characterization of differentiable properties of the opti-\nmal value function [3, Theorem 4.1], essentially due to Danskin [7].\nLemma 2. Let X be a metric space and U be a normed space. Suppose that for all x \u2208 X the\nfunction L(\u03b1,\u00b7) is differentiable, L(\u03b1, x) and \u2202\u03b1L(\u03b1, x), the derivative of L(\u00b7, x), are continuous\non X \u00d7 U and let Q be a compact subset of X . De\ufb01ne the optimal value function as f(\u03b1) =\ninf x\u2208Q L(\u03b1, x). The optimal value function is directionally differentiable. Furthermore, if for \u03b1 \u2208\nU, L(\u03b1,\u00b7) has a unique minimizer x(\u03b1) over Q then f is differentiable at \u03b1 and the gradient of f is\ngiven by \u2207f(\u03b1) = \u2202\u03b1L(\u03b1, x(\u03b1)).\nApplying the above lemma to the objective function f de\ufb01ned by equation (5), we have:\nTheorem 1. The objective function f de\ufb01ned by (3) (equivalently by (5)) is differentiable and its\ngradient is given by\n\n\u2207f(\u03b1) = e \u2212 Y (K0 + Y \u03b1\u03b1(cid:62)Y /(4\u03c1))+Y \u03b1.\n\n(6)\nProof. We apply Lemma 2 with X = S n and Q = Q2 \u2286 S n, U = Q1 and x = K. To this\nend, we \ufb01rst prove the uniqueness of K(\u03b1). Suppose there are two minimizers K1, K2 for problem\n+ L(\u03b1, K). By the \ufb01rst order optimality condition, for the minimizer K1, we have that\narg minK\u2208S n\n(cid:104)\u2202KL(\u03b1, K1), K2 \u2212 K1(cid:105)F \u2265 0. Considering the minimizer K2, we also have (cid:104)\u2202KL(\u03b1, K2), K1 \u2212\n2 Y \u03b1\u03b1(cid:62)Y + 2\u03c1(K \u2212 K0) and adding the above two \ufb01rst-\nK2(cid:105)F \u2265 0. Noting that \u2202KL(\u03b1, K) = \u2212 1\norder optimaility inequalities together, we have \u2212(cid:107)K2\u2212K1(cid:107)2\nF \u2265 0 which means that K1 = K2, and\nhence completes the proof of the uniqueness of K(\u03b1). Now the desired result follows directly from\nLemma 2 by noting that the derivative of L w.r.t. the \ufb01rst argument \u2202\u03b1L(\u03b1, K) = e \u2212 Y KY \u03b1.\n\n3\n\n\fthere holds (cid:107)(K0 + Y \u03b11\u03b1(cid:62)\n\n2 Y /(4\u03c1))+(cid:107)F \u2264 ((cid:107)\u03b11(cid:107) + (cid:107)\u03b12(cid:107))(cid:107)\u03b11 \u2212 \u03b12(cid:107)/(4\u03c1).\n\nIndeed, we can go further to establish the Lipschitz continuity of \u2207f based on the strongly convex\nproperty of L(\u03b1,\u00b7). To this end, we \ufb01rst establish a useful lemma.\nLemma 3. For any \u03b11, \u03b12 \u2208 Q1,\n1 Y /(4\u03c1))+ \u2212 (K0 +\nY \u03b12\u03b1(cid:62)\nProof. Let \u2202KL(\u03b1,\u00b7) denote the gradient w.r.t. K. Now, consider the minimization prob-\nlem arg minK\u2208Q2 L(\u03b1, K). By the \ufb01rst order optimality conditions, for any K \u2208 Q2 there\nholds (cid:104)\u2202KL(\u03b1, K(\u03b1)), K \u2212 K(\u03b1)(cid:105)F \u2265 0. Applying the above inequality twice implies that\n(cid:104)\u2202KL(\u03b11, K(\u03b11)), K(\u03b12) \u2212 K(\u03b11)(cid:105)F \u2265 0, and (cid:104)\u2202KL(\u03b12, K(\u03b12)), K(\u03b11) \u2212 K(\u03b12)(cid:105)F \u2265 0. Con-\nsequently, (cid:104)\u2202KL(\u03b11, K(\u03b11)) \u2212 \u2202KL(\u03b12, K(\u03b12)), K(\u03b12) \u2212 K(\u03b11)(cid:105)F \u2265 0. Substituting the fact\nthat \u2202KL(\u03b1, K) = \u2212 1\n2 Y \u03b1\u03b1(cid:62)Y + 2\u03c1(K \u2212 K0) into the above equation, we have 4\u03c1(cid:107)K(\u03b11) \u2212\nK(\u03b12)(cid:107)2\n1 )Y (cid:107)F(cid:107)K(\u03b12) \u2212\nF \u2264 (cid:104)Y (\u03b12\u03b1(cid:62)\n2 \u2212 \u03b11\u03b1(cid:62)\nK(\u03b11)(cid:107)F . Consequently,\n1 )(cid:107)F\n\n1 )Y, K(\u03b12) \u2212 K(\u03b11)(cid:105)F \u2264 (cid:107)Y (\u03b12\u03b1(cid:62)\n\u2264 (cid:107)(\u03b12\u03b1(cid:62)\n\n(7)\n4\u03c1\nwhere the last inequality follows from the fact that Y is an orthonormal matrix since yi \u2208 {\u00b11}\nand Y = diag(y1, . . . , yn). Note that (cid:107)\u03b12\u03b1(cid:62)\n1 (cid:107)F = (cid:107)(\u03b12 \u2212 \u03b11)\u03b1(cid:62)\n2 \u2212 \u03b11(\u03b11 \u2212 \u03b12)(cid:62)(cid:107)F \u2264\n((cid:107)\u03b11(cid:107)+(cid:107)\u03b12(cid:107))(cid:107)\u03b11\u2212\u03b12(cid:107). Putting this back into inequality (7) completes the proof of the lemma.\n\n(cid:107)K(\u03b11) \u2212 K(\u03b12)(cid:107)F \u2264 (cid:107)Y (\u03b12\u03b1(cid:62)\n\n2 \u2212 \u03b11\u03b1(cid:62)\n2 \u2212 \u03b11\u03b1(cid:62)\n\n2 \u2212 \u03b11\u03b1(cid:62)\n\n2 \u2212 \u03b11\u03b1(cid:62)\n\n1 )Y (cid:107)F\n\n4\u03c1\n\nIt is interesting to point out that the above lemma can be alternatively established by delicate tech-\nniques in matrix analysis. To see this, recall that a spectral function G : S n \u2192 S n is de\ufb01ned\nby applying a real-valued function g to the eigenvalues of its argument i.e. for any K \u2208 S n with\neigen-decomposition K = Udiag(\u03bb1, . . . , \u03bbn)U(cid:62), G(K) := Udiag(g(\u03bb1), . . . , g(\u03bbn))U(cid:62). The\nperturbation inequality in matrix analysis [1, Lemma VII.5.5] shows that if g is Lipschitz continu-\nous with Lipschitz constant \u03ba then (cid:107)G(K1) \u2212 G(K2)(cid:107)F \u2264 \u03ba(cid:107)K1 \u2212 K2(cid:107)F ,\n\u2200K1, K2 \u2208 S n.\nApplying the above inequality with g(t) = max(0, t) and K1 = K0 + Y \u03b11\u03b1(cid:62)\n1 Y /(4\u03c1) and\nK2 = K0 + Y \u03b12\u03b1(cid:62)\n2 Y /(4\u03c1) implies equation (7), and hence Lemma 3. However, we prefer the\noriginal proof presented for Lemma 3 since it explains more clearly how the strong convexity of the\nregularization term (cid:107)K \u2212 K0(cid:107)2\nFrom the above lemma, we can establish the Lipschitz continuity of the gradient of the objective\nfunction.\nTheorem 2. The gradient of the objective function given by (6) is Lipschitz continuous with Lipschitz\ni.e. for any \u03b11, \u03b12 \u2208 Q1 the following inequality holds (cid:107)\u2207f(\u03b11)\u2212\nconstant L = \u03bbmax(K0) + nC2\n\nF plays a critical role in the analysis.\n\n\u03c1\n\n(cid:164)(cid:107)\u03b11 \u2212 \u03b12(cid:107).\n\n\u03bbmax(K0)) + nC 2/\u03c1\n\nProof. For any \u03b11, \u03b12 \u2208 Q1, from representation of \u2207f in Theorem 1 the term (cid:107)\u2207f(\u03b11)\u2212\u2207f(\u03b12)(cid:107)\n\n(cid:107)Y\n(K0 + Y \u03b11\u03b1(cid:62)\n(cid:107)Y (K0 + Y \u03b12\u03b1(cid:62)\n+\n\n(cid:111)\n1 Y /(4\u03c1))+ \u2212 (K0 + Y \u03b12\u03b1(cid:62)\n2 Y /(4\u03c1))+\n2 Y /(4\u03c1))+Y (\u03b12 \u2212 \u03b11)(cid:107)\n.\n\n(cid:111)\n\n(cid:164)\n\nY \u03b11(cid:107)\n\n\u2207f(\u03b12)(cid:107) \u2264(cid:163)\ncan be bounded by(cid:110)\n(cid:110)\n(cid:161)\n\u2264 (cid:107)(cid:161)\n\n(cid:107)Y\n\u2264 (cid:107)Y\n\n(cid:163)\n\n(cid:161)\n\n(8)\n\n(9)\n\nNow it remains to estimate the two terms within parentheses on the right-hand side of inequality (8).\nLet\u2019s begin with the \ufb01rst one by applying Lemma 3.\n\n(K0 + Y \u03b11\u03b1(cid:62)\n(K0 + Y \u03b11\u03b1(cid:62)\nK0 + Y \u03b11\u03b1(cid:62)\n\n1 Y /(4\u03c1))+ \u2212 (K0 + Y \u03b12\u03b1(cid:62)\n1 Y /(4\u03c1))+ \u2212 (K0 + Y \u03b12\u03b1(cid:62)\nK0 + Y \u03b12\u03b1(cid:62)\n\u2264 (cid:107)\u03b11(cid:107) ((cid:107)\u03b11(cid:107) + (cid:107)\u03b12(cid:107)) (cid:107)\u03b11 \u2212 \u03b12(cid:107)/(4\u03c1) \u2264 nC2\n\n1 Y /(4\u03c1)\n\n+ \u2212(cid:161)\n\n(cid:162)\n\n2 Y /(4\u03c1))+\n\n(cid:162)\n(cid:162)\n(cid:162)\nY \u03b11(cid:107)\n2 Y /(4\u03c1))+\n+(cid:107)F(cid:107)\u03b11(cid:107)\n2 Y /(4\u03c1)\n2\u03c1 (cid:107)\u03b11 \u2212 \u03b12(cid:107).\n\nY (cid:107)F(cid:107)\u03b11(cid:107)\n\n(cid:104)\n\n(cid:179)\n\nwhere the second inequality follows from the fact that Y is an orthonormal matrix. For the\nsecond term on the right-hand side of inequality (8), we apply the fact proved in Theorem 1\n2 Y /(4\u03c1))+Y (\u03b12 \u2212 \u03b11)(cid:107) \u2264\nthat K(\u03b1) \u2208 Q2 for any \u03b1 \u2208 Q1.\n(cid:107)\u03b12 \u2212 \u03b11(cid:107) \u2264\n2 Y /(4\u03c1))+Y\n\u03bbmax\n(cid:107)\u03b11 \u2212 \u03b12(cid:107). Putting this equation and (9) back into equality (8) completes the\n\u03bbmax(K0) + nC2\nproof of Theorem 2.\n\nIndeed, (cid:107)Y (K0 + Y \u03b12\u03b1(cid:62)\n\n(cid:107)\u03b12 \u2212 \u03b11(cid:107) \u2264 \u03bbmax\n\nY (K0 + Y \u03b12\u03b1(cid:62)\n\n(K0 + Y \u03b12\u03b1(cid:62)\n\n2 Y /(4\u03c1))+\n\n(cid:180)\n\n(cid:180)\n\n(cid:179)\n\n(cid:105)\n\n4\u03c1\n\n4\n\n\f(cid:161)\nSimpli\ufb01ed Projected Gradient Method (SPGM)\n\u03c1 . Let \u03b5 > 0, \u03b10 \u2208 Q1 be given and set k = 0.\n1. Choose \u03b3 \u2265 \u03bbmax(K0) + nC2\n2. Compute \u2207f(\u03b1k) = e \u2212 Y\nK0 + Y \u03b1k\u03b1(cid:62)\n3. \u03b1k+1 = PQ1 (\u03b1k + \u2207f(\u03b1k)/\u03b3) .\n4. Set k \u2190 k + 1. Go to step 2 until the stopping criterion less than \u03b5.\n\nk Y /(4\u03c1)\n\n+ Y \u03b1k .\n\n(cid:162)\n\nTable 1: Pseudo-code of projected gradient method\n\n4 Smooth Optimization Algorithms\n\nThis section is based on the theoretical analysis above, mainly Theorem 2. We \ufb01rst outline a sim-\npli\ufb01ed version of the projected gradient method proposed in [15] and show it has a convergence rate\nof O(1/k) where k is the iteration number. We can further develop a smooth optimization approach\n[17, 18] for inde\ufb01nite SVM (5). This scheme has an optimal convergence rate O(1/k2) for smooth\nproblems which has been applied to various problems, e.g. [6].\n\n4.1 Simpli\ufb01ed Projected Gradient Method\n\nIn [15], the objective function was smoothed by adding a quadratic term (see details in Section 3\nthere) and then they proposed a projected gradient algorithm to solve this approximation problem.\nUsing the explicit gradient representation in Theorem 1 we formulate its simpli\ufb01ed version in Table\n1 where the projection PQ1 : Rn \u2192 Q1 is de\ufb01ned, for any \u03b2 \u2208 Rn, by\n\nPQ1(\u03b2) = arg min\n\u03b1\u2208Q1\n\n(cid:107)\u03b1 \u2212 \u03b2(cid:107)2.\n\n(10)\n\n(cid:104)\n\n(cid:105)\n\n\u03c1\n\n\u03bbmax(K0) + nC2\n\nIndeed, from Theorem 2 we can further obtain the following result by developing the techniques in\nSections 2.1.5, 2.2.3 and 2.2.4 of [18].\nLemma 4. Let \u03b3 \u2265\nand {\u03b1k : k \u2208 N} be given by the simpli\ufb01ed projected\ngradient method in Table 1. For any \u03b1 \u2208 Q1, the following inequality holds f(\u03b1k+1) \u2265 f(\u03b1) +\n\u03b3(cid:104)\u03b1k \u2212 \u03b1k+1, \u03b1 \u2212 \u03b1k(cid:105) + \u03b3\n(cid:82) 1\nProof. We know from Theorem 2 that \u2207f is Lipschitz continuous with Lipschitz constant L =\n0 (cid:104)\u2207f(\u03b8\u03b1 + (1\u2212 \u03b8)\u03b1k)\u2212\n\u03bbmax(K0) + nC2\n2(cid:107)\u03b1\u2212 \u03b1k(cid:107)2. Applying this inequality with\n\u2207f(\u03b1k), \u03b1\u2212 \u03b1k(cid:105)d\u03b8 \u2265 \u2212L\n\u03b1 = \u03b1k+1 implies that\n\n2(cid:107)\u03b1k \u2212 \u03b1k+1(cid:107)2.\n(cid:82) 1\n0 (1\u2212 \u03b8)(cid:107)\u03b1\u2212 \u03b1k(cid:107)2d\u03b8 \u2265 \u2212 \u03b3\n\n\u03c1 , then we have f(\u03b1)\u2212 f(\u03b1k)\u2212(cid:104)\u2207f(\u03b1k), \u03b1 \u2212 \u03b1k(cid:105) =\n\n\u2212f(\u03b1k) \u2212 (cid:104)\u2207f(\u03b1k), \u03b1k+1 \u2212 \u03b1k(cid:105) \u2265 \u2212f(\u03b1k+1) \u2212 \u03b3\n2\n\n(cid:107)\u03b1k+1 \u2212 \u03b1k(cid:107)2.\n\n(11)\n\nLet \u03c6(\u03b1) = \u2212f(\u03b1k) \u2212 \u2207f(\u03b1k)(\u03b1 \u2212 \u03b1k) + \u03b3\n2(cid:107)\u03b1 \u2212 \u03b1k(cid:107)2 which implies that \u03b1k+1 =\narg min\u03b1\u2208Q1 \u03c6(\u03b1). Then, by the \ufb01rst-order optimality condition over \u03b1k+1 there holds, for any\n\u03b1 \u2208 Q1, (cid:104)\u2207\u03c6(\u03b1k), \u03b1\u2212 \u03b1k+1(cid:105) \u2265 0, i.e. \u2212(cid:104)\u2207f(\u03b1k), \u03b1\u2212 \u03b1k+1(cid:105) \u2265 \u03b3(cid:104)\u03b1k+1 \u2212 \u03b1k, \u03b1k+1 \u2212 \u03b1(cid:105). Adding\nthis equation and (11) together yields that \u2212f(\u03b1k) \u2212 (cid:104)\u2207f(\u03b1k), \u03b1 \u2212 \u03b1k(cid:105) \u2265 \u2212f(\u03b1k+1) + \u03b3(cid:104)\u03b1k \u2212\n\u03b1k+1, \u03b1\u2212\u03b1k(cid:105)+ \u03b3\n2(cid:107)\u03b1k\u2212\u03b1k+1(cid:107)2. Also, since \u2212f is convex, \u2212f(\u03b1) \u2265 \u2212f(\u03b1k)\u2212(cid:104)\u2207f(\u03b1k), \u03b1\u2212\u03b1k(cid:105).\nCombining this with the above inequality \ufb01nishes the proof of the lemma.\nTheorem 3. Let \u03b3 \u2265\nsimpli\ufb01ed projected gradient method in Table 1. Then, we have that\n(cid:107)\u03b1k+1 \u2212 \u03b1k(cid:107)2,\n\nand the iteration sequence {\u03b1k : k \u2208 N} be given by the\n\n\u03bbmax(K0) + nC2\n\n(12)\n\n(cid:104)\n\n(cid:105)\n\n\u03c1\n\nf(\u03b1k+1) \u2265 f(\u03b1k) + \u03b3\n2\n\nMoreover,\n\nwhere \u03b1\u2217 is an optimal solution of problem max\u03b1\u2208Q1 f(\u03b1).\n\nmax\n\u03b1\u2208Q1\n\nf(\u03b1) \u2212 f(\u03b1k) \u2264 \u03b3\n2k\n\n(cid:107)\u03b10 \u2212 \u03b1\u2217(cid:107)2\n\n(13)\n\n5\n\n\f(cid:162)\n(cid:161)\nNesterov\u2019s Smooth Optimization Method (SMM)\n1. Let \u03b5 > 0, k = 0 and initialize \u03b10 \u2208 Q1 and let L = \u03bbmax(K0)) + nC 2/\u03c1.\n(cid:180)\n(cid:80)k\n2. Compute \u2207f(\u03b1k) = e \u2212 Y\nK0 + Y \u03b1k\u03b1(cid:62)\n3. Compute \u03b3k = PQ1 (\u03b1k + \u2207f(\u03b1k)/L) .\n4. Compute \u03b2k = PQ1\ni=0(i + 1)\u2207f(\u03b1k)/(2L)\n5. Set \u03b1k+1 = 2\n6. Set k \u2190 k + 1. Go to step 2 until the stopping criterion less than \u03b5.\n\n\u03b10 +\nk+3 \u03b2k + k+1\nk+3 \u03b3k .\n\nk Y /(4\u03c1)\n\n+ Y \u03b1k .\n\n(cid:179)\n\n.\n\nTable 2: Pseudo-code of \ufb01rst-order Nesterov\u2019s smooth optimization method\n\n2(cid:107)\u03b1\u2217 \u2212 \u03b1i(cid:107)2 \u2212 \u03b3\n\nProof. Applying Lemma 4 with \u03b1 = \u03b1k yields inequality (12). To prove inequality (13), we \ufb01rst\napply Lemma 4 with \u03b1 = \u03b1\u2217 to get that, for any i, max\u03b1\u2208Q1 f(\u03b1) \u2212 f(\u03b1i) \u2264 \u2212\u03b3(cid:104)\u03b1i \u2212 \u03b1i+1, \u03b1\u2217 \u2212\n\u03b1i(cid:105) \u2212 \u03b3\n2(cid:107)\u03b1\u2217 \u2212 \u03b1i+1(cid:107)2. Adding them over i from 0 and k \u2212 1\nand also, noting from (12) that {max\u03b1\u2208Q1 f(\u03b1) \u2212 f(\u03b1k) : k \u2208 N} is decreasing, we have that\n2(cid:107)\u03b1\u2217 \u2212 \u03b10(cid:107)2. This com-\n\nk (max\u03b1\u2208Q1 f(\u03b1) \u2212 f(\u03b1k)) \u2264 (cid:80)k\u22121\n\ni=0 (max\u03b1\u2208Q1 f(\u03b1) \u2212 f(\u03b1i+1)) \u2264 \u03b3\n\n2(cid:107)\u03b1i \u2212 \u03b1i+1(cid:107)2 = \u03b3\n\npletes the proof of the theorem.\nFrom the above theorem, the sequence {f(\u03b1k) : k \u2208 N} is monotonically increasing and the\niteration complexity of SPGM is O(L/\u03b5) for \ufb01nding an \u03b5-optimal solution.\n\n4.2 Nesterov\u2019s Smooth Optimization Method\n\nIn [18, 17], Nesterov proposed an ef\ufb01cient smooth optimization method for solving convex pro-\ngramming problems of the form\n\nmin\nx\u2208U\n\ng(x)\n\n(cid:112)\n\nwhere g is a convex function with Lipschitz continuous gradient, and U is a closed convex set in Rn.\nSpeci\ufb01cally, suppose there exists L > 0 such that (cid:107)\u2207g(x) \u2212 \u2207g(x(cid:48))(cid:107) \u2264 L(cid:107)x \u2212 x(cid:48)(cid:107),\n\u2200x, x(cid:48) \u2208 U.\nThe smooth optimization approach needs to introduce a proxy-function d(x) associated with the set\nU. It is assumed to be continuous and strongly convex on U with convexity parameter \u03c3 > 0.\nLet x0 = arg minx\u2208U d(x). Without loss of generality, assume that d(x0) = 0. Thus, strong\nconvexity of d means that , for any x \u2208 U, d(x) \u2265 1\n2 \u03c3(cid:107)x\u2212 x0(cid:107)2. Then, a speci\ufb01c \ufb01rst-order smooth\noptimization scheme detailed in [18] can be then applied to the function g with convergence rate\nL/\u03b5). The \ufb01rst-order method needs to de\ufb01ne a proxy-function associated with Q1. Here,\nin O(\n2(cid:107)\u03b1 \u2212 \u03b10(cid:107)2 with \u03b10 \u2208 Q1. The Lipschitz constant of\nwe de\ufb01ne the proxy-function by d(\u03b1) = 1\n\u2212f is established in Theorem 2 given by L = \u03bbmax(K0) + nC 2/\u03c1. Translating the \ufb01rst-order\nNesterov\u2019s scheme [18, Section 3] to our problem (5), we can get the smooth optimization algorithm\nfor inde\ufb01nite SVM, see its pseudo-code in Table 2. One can see [17] for its variants with general\nstep sizes.\nThe effectiveness of the \ufb01rst-order Nesterov\u2019s algorithm largely depends on the Steps 2, 3 and 4 out-\n(cid:80)k\nlined in Table 2. By Theorem 1, the computation of \u2207f(\u03b1k) in Step 2 needs an eigen-decomposition.\nSteps 3 and 4 are the projection problem (10) by replacing \u03b2 respectively by \u03b1k + \u2207f(\u03b1k)/L\ni=0(i + 1)\u2207f(\u03b1i)/(2L). The convergence of this optimal method was shown in [18]:\nand \u03b10 +\nmax\u03b1\u2208Q1 f(\u03b1) \u2212 f(\u03b3k) \u2264 4L(cid:107)\u03b10\u2212\u03b1\u2217(cid:107)2\n(k+1)(k+2) where \u03b1\u2217 is one of the optimal solutions. It is worthy of\npointing out that either {f(\u03b1k) : k \u2208 N} or {f(\u03b3k) : k \u2208 N} may not monotonically increase,\nhowever it can be made to monotonically increase by a simple modi\ufb01cation of the algorithm [18].\nIn addition, the above estimation of the Lipschitz constant L could be loose in reality and one could\nfurther accelerate the algorithm by using a line search scheme [16].\n\n4.3 Related Work and Complexity Discussion\n\nWe list the theoretical time complexity of algorithms to run Inde\ufb01nite SVM. It is worth noting that\nthe number of iterations to reach a target precision of \u03b5 means that \u2212f(\u03b1k) \u2212 min\u03b1\u2208Q1 \u2212f(\u03b1) =\nmax\u03b1\u2208Q1 f(\u03b1) \u2212 f(\u03b1k) \u2264 \u03b5. However, this does not mean the dual gap as used in [15] is less\nthan \u03b5. In [15], the objective function is smoothed by adding a quadratic term and then they further\n\n6\n\n\fproposed a projected gradient algorithm and analytic center cutting plane method (ACCPM)1. As\nproved in Theorem 3, the number of iterations of the projected gradient method is usually O(L/\u03b5).\nIn each iteration, the main complexity cost O(n3) is from the eigen-decomposition. Hence, the\noverall complexity of SPGM is O(n3L/\u03b5). As discussed in [15], ACCPM has an overall complexity\nis O(n4 log(1/\u03b5)2) for \ufb01nding an \u03b5-optimal solution. However, this method needs to use interior\nmethods at each iteration which would be slow for large scale datasets.\nChen and Ye [4] reformulated inde\ufb01nite SVM as an appealing semi-in\ufb01nite quadratically constrained\nlinear programming (SIQCLP) without applying extra smoothing techniques. There, the algorithm\niteratively solves a linear programming with a \ufb01nite number of quadratic constraints. The iteration\ncomplexity of semi-in\ufb01nite linear programming is usually O(1/\u03b53). In each iteration, one needs\nto \ufb01nd maximum violation constraints which involves eigen-decomposition of complexity O(n3).\nHence, the overall complexity is of O(n3/\u03b53). The main limitation of this approach is that one needs\nto save the subset of increasing quadratically constrained conditions indexed by n \u00d7 n matrices and\niteratively solve a quadratically constrained linear programming (QCLP). The QCLP sub-problem\ncan be solved by general software packages, e.g. Mosek (http://www.mosek.com/), which is gener-\nally slow in our experience. This tends to make the algorithm inef\ufb01cient during the iteration process,\nalthough pruning techniques were proposed to avoid too many quadratically constrained conditions.\nBased on our theoretical results (Theorem 2), Nesterov\u2019s smooth optimization method can be ap-\nplied. The complexity of this smooth optimization method (SMM) mainly relies on the eigenvalue\ndecomposition on Step 2 listed in Table 2 which costs O(n3). Step 3 and 4 are projections onto\nthe convex region Q1 which costs O(n log n) as pointed out in [15]. The \ufb01rst-order smooth op-\ntimization approach [17, 18] has iteration complexity O(\nL/\u03b5) for \ufb01nding an \u03b5-optimal solution.\nConsequently, the overall complexity is O(n3\nL/\u03b5). Hence, from theoretical comparison the com-\nplexity of smoothing optimization is better than the simpli\ufb01ed projected gradient method (SPGM)\nand SIQCLP. Compared with ACCPM, SMM has better dependence on the sample number n but\nwith a worse precision i.e. worse dependence on \u03b5.\n\n(cid:112)\n\n(cid:112)\n\n5 Experimental Validation\n\nmatrices by adding a small noisy matrix i.e. K0 := K \u2212 0.1(cid:98)E. Here, the noisy matrix (cid:98)E =\n\nWe run our proposed smooth optimization approach and simpli\ufb01ed projected gradient method on\nvarious datasets to validate our analysis. The experiments are done on several benchmark data sets\nfrom the UCI repository [19] including Sonar, Ionosphere, Heart, Pima Indians Diabetes, Breast\nCancer, and USPS with digits 3 and 5. For USPS dataset, we randomly select 600 samples for\neach digit. All the results reported are based on 10 random training/test partition with ratio 4/1.\nIn each data split, as in [4] we \ufb01rst generate a Gaussian kernel matrix K with the hyper-parameter\ndetermined by cross-validation on the training data using LIBSVM and then construct inde\ufb01nite\n(E + E(cid:48))/2 where E is randomly generated by zero mean and identity covariance matrix. For all\nmethods, the parameters C and \u03c1 for Inde\ufb01nite SVM are tuned by cross-validation and we terminate\nthe algorithm if the relative change of the objective value is less than 10\u22126.\nIn Table 3, we report the average test set accuracy (%) and CPU time (seconds) across different\nalgorithms: smooth optimization method (SMM), simpli\ufb01ed projected gradient method (SPGM),\nanalytic center cutting plane method (ACCPM), and semi-in\ufb01nite quadratically constrained linear\nprogramming (SIQCLP). For the QCLP sub-problem in the SIQCLP method, we use Mosek soft-\nware package (http://www.mosek.com/). We can see that test accuracies are statistically the same\nacross different algorithms, which validates our analysis on the objective function. In particular, we\nobserve that SMM is consistently more ef\ufb01cient than other methods, especially for a large number\nof training samples. SIQCLP needs much more time since, in each iteration, it needs to solves a\nquadratically constrained linear programming. In Figure 1, we plot the objective values versus iter-\nation on Sonar and Diabetes for SMM, SPGM, and ACCPM. The SIQCLP approach is not included\nhere since its objective value is not based on the iteration w.r.t. the variable \u03b1 which does not di-\nrectly yield an increasing iteration sequence of objective values in contrast to those of the other three\nalgorithms. From Figure 1, we can see that SMM converges faster than SPGM which is consistent\nwith the complexity analysis. The convergence of ACCPM is quite similar to SMM, especially for\n\n1MATLAB codes are available in http://www.princeton.edu/ rluss/Inde\ufb01niteSVM.htm\n\n7\n\n\fData\nSonar\n\nSize\n\u03bbmin\n208 \u22121.38\n\n\u03bbmax\n21.47\n\nIonosphere\n\nHeart\n\nDiabetes\n\n351\n\n270\n\n768\n\n-2.08\n\n101.34\n\n-1.98\n\n178.03\n\n-3.44\n\n539.12\n\nBreast-cancer\n\nUSPS-35\n\n-2.87\n683\n1200 \u22123.72\n\n290.41\n\n112.65\n\nSMM\n76.34% 76.34% 75.12%\n0.74s\n3.20s\n93.14% 93.43% 93.54%\n5.47s\n22.73s\n79.81% 79.44% 79.25%\n3.54s\n11.96s\n70.00% 69.86% 70.52%\n39.93s\n678.85s\n95.93% 96.02% 96.02%\n5.71s\n212.96s\n96.33% 96.33% 96.04%\n23.22s\n3713.05s\n\nSPGM ACCPM SIQCLP\n76.09%\n244.55s\n93.54%\n455.81s\n79.25%\n689.17s\n69.73%\n3134.31s\n95.40%\n4610.82s\n95.54%\n5199.17s\n\n345.48s\n\n236.00s\n\n5.12s\n\n28.93s\n\n12.05s\n\n50.13s\n\nTable 3: Average test set accuracy (%) and CPU time in seconds (s) of different algorithms where\n\u03bbmax(\u03bbmin) denotes the average maximum (minimum) eigenvalues of the inde\ufb01nite kernel matrix\nover training samples.\n\nFigure 1: Objective value versus iteration: Sonar (left) and Diabetes (right). Curves: SMM (blue),\nSPGM (red) and ACCPM (black)\n\nsmall-sized datasets which coincides with the complexity analysis in Section 4.3 since it generally\nhas a high precision. However, ACCPM needs more time in each iteration than SMM and this ob-\nservation becomes more apparent for the relatively large datasets shown in the time comparison of\nTable 3.\n\n6 Conclusion\n\nIn this paper we analyzed the regularization formulation for training SVM with inde\ufb01nite kernels\nproposed by Luss and d\u2019Aspremont [15]. We show that the objective function of interest is continu-\nously differentiable with Lipschitz continuous gradient. Our elementary analysis greatly facilitates\nthe application of gradient-based methods. We formulated a simpli\ufb01ed version of the projected gra-\ndient method presented in [15] and showed that it has a convergence rate of O(1/k). We further\ndeveloped Nesterov\u2019s smooth optimization method [17, 18] for Inde\ufb01nite SVM which has an opti-\nmal convergence rate of O(1/k2) for smooth problems. Experiments on various datasets validate\nour analysis and the ef\ufb01ciency of our proposed optimization approach. In future, we are planning to\nfurther accelerate the algorithm by using a line search scheme [16]. We are also applying this method\nto real biological datasets such as protein sequence analysis using sequence alignment measures.\n\nAcknowledgements\n\nThis work is supported by EPSRC grant EP/E027296/1.\n\n8\n\n20406080100120\u22124\u22123.5\u22123\u22122.5\u22122\u22121.5\u22121\u22120.500.5x 104IterationObjectve value SMMSPGMACCPM20406080100\u2212600\u2212500\u2212400\u2212300\u2212200\u22121000100Iteration SMMSPGMACCPM\fReferences\n\n[1] R. Bhatia. Matrix analysis. Graduate texts in Mathematics. Springer, 1997.\n[2] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, 2004.\n[3] J. F. Bonnans and A. Shapiro. Optimization problems with perturbation: A guided tour. SIAM\n\nReview, 40: 202\u2013227, 1998.\n\n[4] J. Chen and J. Ye. Training SVM with Inde\ufb01nite Kernels. ICML, 2008.\n[5] N. Cristianini and J. Shawe-Taylor. An introduction to support vector machines and other\n\nkernel-based learning methods. Cambridge University Press, 2000.\n\n[6] A. d\u2019Aspremont, O. Banerjee and L. El Ghaoui. First-order methods for sparse covariance\n\nselection. SIAM Journal on Matrix Analysis and its Applications, 30: 56\u201366, 2007.\n\n[7] J.M. Danskin. The theory of max-min and its applications to weapons allocation problems,\n\nSpringer-Verlag, New York, 1967.\n\n[8] T. Graepel, R. Herbrich, P. Bollmann-Sdorra, and K. Obermayer. Classi\ufb01cation on pairwise\n\nproximity data. NIPS, 1998.\n\n[9] B. Haasdonk. Feature space interpretation of SVMs with inde\ufb01nite kernels. IEEE Transac-\n\ntions on Pattern Analysis and Machine Intelligence, 27: 482\u2013492, 2005.\n\n[10] R. A. Horn and C. R. Johnson. Topics in Matrix Analysis. Cambridge University Press, 1991.\n[11] R. I. Kondor and J. Laffferty. Diffusion kernels on graphs and other discrete input spaces.\n\nICML, 2002.\n\n[12] C. Lemar\u00b4echal and C. Sagastiz\u00b4abal. Practical aspects of the Moreau-Yosida regularization:\n\ntheoretical preliminaries. SIAM Journal on Optimization, 7: 367\u2013385, 1997.\n\n[13] G. R. G. Lanckriet, N. Cristianini, N., P. L. Bartlett, L. E. Ghaoui, and M. I. Jordan. Learning\nthe kernel matrix with semide\ufb01nite programming. J. of Machine Learning Research, 5: 27\u2013\n72, 2004.\n\n[14] H.-T. Lin and C. J. Lin. A study on sigmoid kernels for SVM and the training of non-psd\n\nkernels by smo-type methods. Technical Report, National Taiwan University, 2003.\n\n[15] R. Luss and A. d\u2019Aspremont. Support vector machine classi\ufb01cation with inde\ufb01nite kernels.\n\nNIPS, 2007.\n\n[16] A. Nemirovski. Ef\ufb01cient methods in convex programming. Lecture Notes, 1994.\n[17] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Springer, 2003.\n[18] Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming,\n\n103:127\u2013152, 2005.\n\n[19] D. Newman, S. Hettich, C. Blake, and C. Merz. UCI repository of machine learning datasets.\n\n1998.\n\n[20] C. S. Ong, X. Mary, S. Canu, and A. J. Smola. Learning with non-positive kernels. ICML,\n\n2004.\n\n[21] E. Pekalska, P. Paclik, and R. P. W. Duin. A generalized kernel approach to dissimilarity-\n\nbased classi\ufb01cation. J. of Machine Learning Research, 2: 175\u2013211, 2002.\n\n[22] V. Roth, J. Laub, M. Kawanabe, and J. M. Buhmann. Optimal cluster preserving embedding of\nnonmetric proximity data. IEEE Transactions on Pattern Analysis and Machine Intelligence,\n25:1540\u20131551, 2003.\n\n[23] H. Saigo, J.P.Vert and N. Ueda, and T. Akutsu. Protein homology detection using string align-\n\nment kernels. Bioinformatics, 20: 1682\u20131689., 2004.\n\n[24] B. Sch\u00a8olkopf, and A.J. Smola. Learning with kernels: Support vector machines, regulariza-\n\ntion, optimization, and beyond. The MIT Press, 2001.\n\n[25] A. J. Smola, Z. L. \u00b4O\u00b4vari, and R. C. Williamson. Regularization with dot-product kernels.\n\nNIPS, 2000.\n\n[26] G. Wu, Z. Zhang, and E. Y. Chang. An analysis of transformation on non-positive semide\ufb01nite\n\nsimilarity matrix for kernel machines. Technical Report, UCSB, 2005.\n\n9\n\n\f", "award": [], "sourceid": 135, "authors": [{"given_name": "Yiming", "family_name": "Ying", "institution": null}, {"given_name": "Colin", "family_name": "Campbell", "institution": null}, {"given_name": "Mark", "family_name": "Girolami", "institution": null}]}