{"title": "Fast Graph Laplacian Regularized Kernel Learning via Semidefinite\u2013Quadratic\u2013Linear Programming", "book": "Advances in Neural Information Processing Systems", "page_first": 1964, "page_last": 1972, "abstract": "Kernel learning is a powerful framework for nonlinear data modeling. Using the kernel trick, a number of problems have been formulated as semidefinite programs (SDPs). These include Maximum Variance Unfolding (MVU) (Weinberger et al., 2004) in nonlinear dimensionality reduction, and Pairwise Constraint Propagation (PCP) (Li et al., 2008) in constrained clustering. Although in theory SDPs can be efficiently solved, the high computational complexity incurred in numerically processing the huge linear matrix inequality constraints has rendered the SDP approach unscalable. In this paper, we show that a large class of kernel learning problems can be reformulated as semidefinite-quadratic-linear programs (SQLPs), which only contain a simple positive semidefinite constraint, a second-order cone constraint and a number of linear constraints. These constraints are much easier to process numerically, and the gain in speedup over previous approaches is at least of the order $m^{2.5}$, where m is the matrix dimension. Experimental results are also presented to show the superb computational efficiency of our approach.", "full_text": "Fast Graph Laplacian Regularized Kernel Learning\nvia Semide\ufb01nite\u2013Quadratic\u2013Linear Programming\n\nXiao-Ming Wu\n\nDept. of IE\n\nThe Chinese University of Hong Kong\n\nAnthony Man-Cho So\n\nDept. of SE&EM\n\nThe Chinese University of Hong Kong\n\nwxm007@ie.cuhk.edu.hk\n\nmanchoso@se.cuhk.edu.hk\n\nZhenguo Li\nDept. of IE\n\nThe Chinese University of Hong Kong\n\nShuo-Yen Robert Li\n\nDept. of IE\n\nThe Chinese University of Hong Kong\n\nzgli@ie.cuhk.edu.hk\n\nbobli@ie.cuhk.edu.hk\n\nAbstract\n\nKernel learning is a powerful framework for nonlinear data modeling. Using the\nkernel trick, a number of problems have been formulated as semide\ufb01nite programs\n(SDPs). These include Maximum Variance Unfolding (MVU) (Weinberger et al.,\n2004) in nonlinear dimensionality reduction, and Pairwise Constraint Propagation\n(PCP) (Li et al., 2008) in constrained clustering. Although in theory SDPs can\nbe ef\ufb01ciently solved, the high computational complexity incurred in numerically\nprocessing the huge linear matrix inequality constraints has rendered the SDP\napproach unscalable. In this paper, we show that a large class of kernel learning\nproblems can be reformulated as semide\ufb01nite-quadratic-linear programs (SQLPs),\nwhich only contain a simple positive semide\ufb01nite constraint, a second-order cone\nconstraint and a number of linear constraints. These constraints are much easier to\nprocess numerically, and the gain in speedup over previous approaches is at least\nof the order m2.5, where m is the matrix dimension. Experimental results are also\npresented to show the superb computational ef\ufb01ciency of our approach.\n\n1 Introduction\n\nKernel methods provide a principled framework for nonlinear data modeling, where the inference\nin the input space can be transferred intactly to any feature space by simply treating the associ-\nated kernel as inner products, or more generally, as nonlinear mappings on the data (Sch\u00a8olkopf &\nSmola, 2002). Some well-known kernel methods include support vector machines (SVMs) (Vap-\nnik, 2000), kernel principal component analysis (kernel PCA) (Sch\u00a8olkopf et al., 1998), and kernel\nk-means (Shawe-Taylor & Cristianini, 2004). Naturally, an important issue in kernel methods is\nkernel design. Indeed, the performance of a kernel method depends crucially on the kernel used,\nwhere different choices of kernels often lead to quite different results. Therefore, substantial efforts\nhave been made to design appropriate kernels for the problems at hand. For instance, in (Chapelle\n& Vapnik, 2000), parametric kernel functions are proposed, where the focus is on model selection\n(Chapelle & Vapnik, 2000). The modeling capability of parametric kernels, however, is limited. A\nmore natural idea is to learn specialized nonparametric kernels for speci\ufb01c problems. For instance,\nin cases where only inner products of the input data are involved, kernel learning is equivalent to the\nlearning of a kernel matrix. This is the main focus of recent kernel methods.\nCurrently, many different kernel learning frameworks have been proposed. These include spectral\nkernel learning (Li & Liu, 2009), multiple kernel learning (Lanckriet et al., 2004), and the Breg-\n\n1\n\n\fman divergence-based kernel learning (Kulis et al., 2009). Typically, a kernel learning framework\nconsists of two main components: the problem formulation in terms of the kernel matrix, and an\noptimization procedure for \ufb01nding the kernel matrix that has certain desirable properties. As seen\nin, e.g., the Maximum Variance Unfolding (MVU) method (Weinberger et al., 2004) for nonlinear\ndimensionality reduction (see (So, 2007) for related discussion) and Pairwise Constraint Propaga-\ntion (PCP) (Li et al., 2008) for constrained clustering, a nice feature of such a framework is that\nthe problem formulation often becomes straightforward. Thus, the major challenge in optimization-\nbased kernel learning lies in the second component, where the key is to \ufb01nd an ef\ufb01cient procedure\nto obtain a positive semide\ufb01nite kernel matrix that satis\ufb01es certain properties.\nUsing the kernel trick, most kernel learning problems (Graepel, 2002; Weinberger et al., 2004;\nGloberson & Roweis, 2007; Song et al., 2008; Li et al., 2008) can naturally be formulated as\nsemide\ufb01nite programs (SDPs). Although in theory SDPs can be ef\ufb01ciently solved, the high computa-\ntional complexity has rendered the SDP approach unscalable. An effective and widely used heuristic\nfor speedup is to perform low-rank kernel approximation and matrix factorization (Weinberger et al.,\n2005; Weinberger et al., 2007; Li et al., 2009). In this paper, we investigate the possibility of further\nspeedup by studying a class of convex quadratic semide\ufb01nite programs (QSDPs). These QSDPs\narise in many contexts, such as graph Laplacian regularized low-rank kernel learning in nonlinear\ndimensionality reduction (Sha & Saul, 2005; Weinberger et al., 2007; Globerson & Roweis, 2007;\nSong et al., 2008; Singer, 2008) and constrained clustering (Li et al., 2009). In the aforementioned\npapers, a QSDP is reformulated as an SDP with O(m2) variables and a linear matrix inequality of\nsize O(m2) \u00d7 O(m2). Such a reformulation is highly inef\ufb01cient and unscalable, as it has an order\nof m9 time complexity (Ben-Tal & Nemirovski, 2001, Lecture 6). In this paper, we propose a novel\nreformulation that exploits the structure of the QSDP and leads to a semide\ufb01nite-quadratic-linear\nprogram (SQLP) that can be solved by the standard software SDPT3 (T\u00a8ut\u00a8unc\u00a8u et al., 2003). Such a\nreformulation has the advantage that it only has one positive semide\ufb01nite constraint on a matrix of\nsize m \u00d7 m, one second-order cone constraint of size O(m2) and a number of linear constraints on\nO(m2) variables. As a result, our reformulation is much easier to process numerically. Moreover,\na simple complexity analysis shows that the gain in speedup over previous approaches is at least\nof the order m2.5. Experimental results show that our formulation is indeed far more ef\ufb01cient than\nprevious ones.\nThe rest of the paper is organized as follows. We review related kernel learning problems in Section\n2 and present our formulation in Section 3. Experiment results are reported in Section 4. Section 5\nconcludes the paper.\n\n2 The Problems\n\nIn this section, we brie\ufb02y review some kernel learning problems that arise in dimensionality re-\nduction and constrained clustering. They include MVU (Weinberger et al., 2004), Colored MVU\n(Song et al., 2008), (Singer, 2008), Pairwise Semide\ufb01nite Embedding (PSDE) (Globerson & Roweis,\n2007), and PCP (Li et al., 2008). MVU maximizes the variance of the embedding while preserving\nlocal distances of the input data. Colored MVU generalizes MVU with side information such as\nclass labels. PSDE derives an embedding that strictly respects known similarities, in the sense that\nobjects known to be similar are always closer in the embedding than those known to be dissimilar.\nPCP is designed for constrained clustering, which embeds the data on the unit hypersphere such that\ntwo objects that are known to be from the same cluster are embedded to the same point, while two\nobjects that are known to be from different clusters are embedded orthogonally. In particular, PCP\nseeks the smoothest mapping for such an embedding, thereby propagating pairwise constraints.\nInitially, each of the above problems is formulated as an SDP, whose kernel matrix K is of size n\u00d7n,\nwhere n denotes the number of objects. Since such an SDP is computationally expensive, one can\ntry to reduce the problem size by using graph Laplacian regularization. In other words, one takes\nK = QY QT , where Q \u2208 Rn\u00d7m consists of the smoothest m eigenvectors of the graph Laplacian\n(m (cid:191) n), and Y is of size m \u00d7 m (Sha & Saul, 2005; Weinberger et al., 2007; Song et al., 2008;\nGloberson & Roweis, 2007; Singer, 2008; Li et al., 2009). The learning of K is then reduced to\nthe learning of Y , leading to a quadratic semide\ufb01nite program (QSDP) that is similar to a standard\nquadratic program (QP), except that the feasible set of a QSDP resides in the positive semide\ufb01nite\ncone as well. The intuition behind this low-rank kernel approximation is that a kernel matrix of the\n\n2\n\n\fform K = QY QT actually, to some degree, preserves the proximity of objects in the feature space.\nDetailed justi\ufb01cation can be found in the related work mentioned above.\nNext, we use MVU and PCP as representatives to demonstrate how the SDP formulations emerge\nfrom nonlinear dimensionality reduction and constrained clustering.\n\n2.1 MVU\n\nThe SDP of MVU (Weinberger et al., 2004) is as follows:\n\ntr(K) = I \u2022 K\n\nn(cid:88)\n\nmax\nK\n\ns.t.\n\nkij = 0,\n\n(1)\n\n(2)\n\ni,j=1\n\nij, \u2200(i, j) \u2208 N ,\n\nkii + kjj \u2212 2kij = d2\nK (cid:186) 0,\n\n(3)\n(4)\nwhere K = (kij) denotes the kernel matrix to be learned, I denotes the identity matrix, tr(\u00b7) denotes\nthe trace of a square matrix, \u2022 denotes the element-wise dot product between matrices, dij denotes\nthe Euclidean distance between the i-th and j-th objects, and N denotes the set of neighbor pairs,\nwhose distances are to be preserved in the embedding.\nThe constraint in (2) centers the embedding at the origin, thus removing the translation freedom.\nThe constraints in (3) preserve local distances. The constraint K (cid:186) 0 in (4) speci\ufb01es that K must\n(cid:80)\nbe symmetric and positive semide\ufb01nite, which is necessary since K is taken as the inner product\nmatrix of the embedding. Note that given the constraint in (2), the variance of the embedding\nis characterized by V(K) = 1\ni,j(kii + kjj \u2212 2kij) = tr(K) (Weinberger et al., 2004) (see\nrelated discussion in (So, 2007), Chapter 4). Thus, the SDP in (1-4) maximizes the variance of the\nembedding while keeping local distances unchanged. After K is obtained, kernel PCA is applied to\nK to compute the low-dimensional embedding.\n\n2n\n\n2.2 PCP\n\nThe SDP of PCP (Li et al., 2008) is:\n\n\u00afL \u2022 K\n\n(5)\n\nmin\nK\ns.t. kii = 1, i = 1, 2, . . . , n,\nkij = 1, \u2200(i, j) \u2208 M,\nkij = 0, \u2200(i, j) \u2208 C,\nK (cid:186) 0,\n\n(6)\n(7)\n(8)\n(9)\nwhere \u00afL denotes the normalized graph Laplacian, M denotes the set of object pairs that are known\nto be from the same cluster, and C denotes those that are known to be from different clusters.\nThe constraints in (6) map the objects to the unit hypersphere. The constraints in (7) map two objects\nthat are known to be from the same cluster to the same vector. The constraints in (8) map two objects\nthat are known to be from different clusters to vectors that are orthogonal. Let X = {xi}n\ni=1 be the\ndata set, F be the feature space, and \u03c6 : X \u2192 F be the associated feature map of K. Then, the\ndegree of smoothness of \u03c6 on the data graph can be captured by (Zhou et al., 2004):\n\n(cid:176)(cid:176)(cid:176)(cid:176)(cid:176) \u03c6(xi)\u221a\n\u2212 \u03c6(xj)(cid:112)\n(cid:80)n\nj=1 wij, and (cid:107) \u00b7 (cid:107)F is the distance metric in F.\nwhere wij is the similarity of xi and xj, dii =\nThe smaller the value S(\u03c6), the smoother is the feature map \u03c6. Thus, the SDP in (5-9) seeks the\nsmoothest feature map that embeds the data on the unit hypersphere and at the same time respects the\npairwise constraints. After K is solved, kernel k-means is then applied to K to obtain the clusters.\n\n= \u00afL \u2022 K,\n\n(cid:176)(cid:176)(cid:176)(cid:176)(cid:176)2\n\nS(\u03c6) =\n\ndjj\n\nF\n\n(10)\n\nn(cid:88)\n\ni,j=1\n\n1\n2\n\nwij\n\ndii\n\n3\n\n\f2.3 Low-Rank Approximation: from SDP to QSDP\n\nThe SDPs in MVU and PCP are dif\ufb01cult to solve ef\ufb01ciently because their computational complexity\nscales at least cubically in the size of the matrix variable and the number of constraints (Borchers,\n1999). A useful heuristic is to use low-rank kernel approximation, which is motivated by the obser-\nvation that the degree of freedom in the data is often much smaller than the number of parameters\nin a fully nonparametric kernel matrix K. For instance, it may be equal to or slightly larger than\nthe intrinsic dimension of the data manifold (for dimensionality reduction) or the number of clusters\n(for clustering). Another more speci\ufb01c observation is that it is often desirable to have nearby objects\nmapped to nearby points, as is done in MVU or PCP.\nBased on these observations, instead of learning a fully nonparametric K, one learns a K of the\nform K = QY QT , where Q and Y are of sizes n\u00d7 m and m\u00d7 m, respectively, where m (cid:191) n. The\nmatrix Q should be smooth in the sense that nearby objects in the input space are mapped to nearby\npoints (the i-th row of Q is taken as a new representation of xi). Q is computed prior to the learning\nof K. In this way, the learning of a kernel matrix K is reduced to the learning of a much smaller\nY , subject to the constraint that Y (cid:186) 0. This idea is used in (Weinberger et al., 2007) and (Li et al.,\n2009) to speed up MVU and PCP, respectively, and is also adopted in Colored MVU (Song et al.,\n2008) and PSDE (Globerson & Roweis, 2007) for ef\ufb01cient computation.\nThe choice of Q can be different for MVU and PCP. In (Weinberger et al., 2007), Q =\n(v2, . . . , vm+1), where {vi} are the eigenvectors of the graph Laplacian.\nIn (Li et al., 2009),\nQ = (u1, . . . , um), where {ui} are the eigenvectors of the normalized graph Laplacian. Since\nsuch Q\u2019s are obtained from graph Laplacians, we call the learning of K of the form K = QY QT\nthe Graph Laplacian Regularized Kernel Learning problem, and denote the methods in (Weinberger\net al., 2007) and (Li et al., 2009) by RegMVU and RegPCP, respectively.\nWith K = QY QT , RegMVU and RegPCP become:\n\n((QY QT )ii \u2212 2(QY QT )ij + (QY QT )jj \u2212 d2\n\nij)2,\n\n(cid:88)\n\n(i,j)\u2208N\n((QY QT )ij \u2212 tij)2,\n\nRegMVU : max\nY (cid:186)0\n\nRegPCP : min\nY (cid:186)0\n\ntr(Y ) \u2212 \u03bd\n\n(cid:88)\n\n(i,j,tij )\u2208S\n\n(11)\n\n(12)\n\n(13)\n\nwhere \u03bd > 0 is a regularization parameter and S = {(i, j, tij) | (i, j) \u2208 M \u222a C, or i = j, tij =\n1 if (i, j) \u2208 M or i = j, tij = 0 if (i, j) \u2208 C}. Both RegMVU and RegPCP can be succinctly\nrewritten in the uni\ufb01ed form:\n\nyT Ay + bT y\n\nmin\ny\ns.t. Y (cid:186) 0,\n\n(14)\nwhere y = vec(Y ) \u2208 Rm2 denotes the vector obtained by concatenating all the columns of Y , and\nA (cid:186) 0 (Weinberger et al., 2007; Li et al., 2009). Note that this problem is convex since both the\nobjective function and the feasible set are convex.\nProblem (13-14) is an instance of the so-called convex quadratic semide\ufb01nite program (QSDP),\nwhere the objective is a quadratic function in the matrix variable Y . Note that similar QSDPs arise\nin Colored MVU, PSDE, Conformal Eigenmaps (Sha & Saul, 2005), Locally Rigid Embedding\n(Singer, 2008), and Kernel Matrix Completion (Graepel, 2002). Before we present our approach for\ntackling the QSDP (13-14), let us brie\ufb02y review existing approaches in the literature.\n\n2.4 Previous Approach: from QSDP to SDP\n\nCurrently, a typical approach for tackling a QSDP is to use the Schur complement (Boyd & Vanden-\nberghe, 2004) to rewrite it as an SDP (Sha & Saul, 2005; Weinberger et al., 2007; Li et al., 2009;\nSong et al., 2008; Globerson & Roweis, 2007; Singer, 2008; Graepel, 2002), and then solve it using\nan SDP solver such as CSDP1 (Borchers, 1999) or SDPT32 (Toh et al., 2006). In this paper, we call\n\n1https://projects.coin-or.org/Csdp/\n2http://www.math.nus.edu.sg/\u02dcmattohkc/sdpt3.html\n\n4\n\n\fthis approach the Schur Complement Based SDP (SCSDP) formulation. For the QSDP in (13-14),\nthe equivalent SDP takes the form:\n\n(cid:183)\n\nmin\ny,\u03bd\n\n\u03bd + bT y\n\ns.t. Y (cid:186) 0 and\n\nIm2\n1\n2 y)T\n\n(A\n\n1\n2 y\nA\n\u03bd\n\n(cid:184)\n\n(cid:186) 0,\n\n(15)\n\n(16)\n\n1\n\n1\n\n1\n\n2 y)T (A\n\n2 y) \u2264 \u03bd by the Schur complement.\n\n2 is the matrix square root of A, Im2 is the identity matrix of size m2 \u00d7 m2, and \u03bd is a slack\nwhere A\nvariable serving as an upper bound of yT Ay. The second semide\ufb01nite cone constraint is equivalent\nto (A\nAlthough the SDP in (15-16) has only m(m + 1)/2 + 1 variables, it has two semide\ufb01nite cone\nconstraints, of sizes m\u00d7m and (m2+1)\u00d7(m2+1), respectively. Such an SDP not only scales poorly,\nbut is also dif\ufb01cult to process numerically. Indeed, by considering Problem (15-16) as an SDP in\nthe standard dual form, the number of iterations required by standard interior-point algorithms is of\nthe order m, and the total number of arithmetic operations required is of the order m9 (Ben-Tal &\nNemirovski, 2001, Lecture 6). In practice, it takes only a few seconds to solve the aforementioned\nSDP when m = 10, but can take more than 1 day when m = 40 (see Section 4 for details). Thus,\nit is not surprising that m is set to a very small value in the related methods\u2014for example, m = 10\nin (Weinberger et al., 2007) and m = 15 in (Li et al., 2009). However, as noted by the authors in\n(Weinberger et al., 2007), a larger m does lead to better performance. In (Li et al., 2009), the authors\nsuggest that m should be larger than the number of clusters.\nIs this formulation from QSDP to SDP the best we can have? The answer is no. In the next section,\nwe present a novel formulation that leads to a semide\ufb01nite-quadratic-linear program (SQLP), which\nis much more ef\ufb01cient and scalable than the one above. For instance, it takes about 15 seconds when\nm = 30, 2 minutes when m = 40, and 1 hour when m = 100, as reported in Section 4.\n\n3 Our Formulation: from QSDP to SQLP\n\nIn this section, we formulate the QSDP in (13-14) as an SQLP. Though our focus here is on the\nQSDP in (13-14), we should point out that our method applies to any convex QSDP.\nRecall that the size of A is m2 \u00d7 m2. Let r be the rank of A. With Cholesky factorization, we can\nobtain an r\u00d7 m2 matrix B such that A = BT B, as A is symmetric positive semide\ufb01nite and of rank\nr (Golub & Loan, 1996). Now, let z = By. Then, the QSDP in (13-14) is equivalent to:\n\nmin\ny,z,\u00b5\n\n\u00b5 + bT y\n\n(17)\n\n(18)\n(19)\n(20)\nNext, we show that the constraint in (19) is equivalent to a second-order cone constraint. Let Kn be\nthe second-order cone of dimension n, i.e.,\n\ns.t. z = By,\nzT z \u2264 \u00b5,\nY (cid:186) 0.\n\nKn = {(x0; x) \u2208 Rn : x0 \u2265 (cid:107)x(cid:107)},\nwhere (cid:107) \u00b7 (cid:107) denotes the standard Euclidean norm. Let u = ( 1+\u00b5\nholds.\nTheorem 3.1. zT z \u2264 \u00b5 if and only if u \u2208 Kr+2.\nProof. Note that u \u2208 Rr+2, since z \u2208 Rr. Also, note that \u00b5 = ( 1+\u00b5\nthen ( 1+\u00b5\nu \u2208 Kr+2. Conversely, if u \u2208 Kr+2, then ( 1+\u00b5\n\n2 )2 = \u00b5 \u2265 zT z, which means that 1+\u00b5\n2 )2 \u2265 ( 1\u2212\u00b5\n\n2 )2 \u2212 ( 1\u2212\u00b5\n\n2 , 1\u2212\u00b5\n\n2 )2 \u2212 ( 1\u2212\u00b5\n\n2 )2. If zT z \u2264 \u00b5,\n2 , zT )T(cid:107). In particular, we have\n\n2 \u2265 (cid:107)( 1\u2212\u00b5\n2 )2 + zT z, thus implying zT z \u2264 \u00b5.\n\n2 , zT )T . Then, the following\n\nLet ei (where i = 1, 2, . . . , r + 2) be the i-th basis vector, and let C = (0r\u00d72, Ir\u00d7r). Then, we have\n(e1 \u2212 e2)T u = \u00b5, (e1 + e2)T u = 1, and z = Cu. Hence, by Theorem 3.1, the problem in (17-20)\n\n5\n\n\f(a)\n\n(b)\n\nFigure 1: Swiss Roll. (a) The true manifold. (b) A set of 2000 points sampled from the manifold.\n\nis equivalent to:\n\n(e1 \u2212 e2)T u + bT y\n\nmin\ny,u\ns.t. (e1 + e2)T u = 1,\n\nBy \u2212 Cu = 0,\nu \u2208 Kr+2,\nY (cid:186) 0,\n\n(21)\n\n(22)\n(23)\n(24)\n(25)\n\nwhich is an instance of the SQLP problem (T\u00a8ut\u00a8unc\u00a8u et al., 2003). Note that in this formulation,\nwe have traded the semide\ufb01nite cone constraint of size (m2 + 1) \u00d7 (m2 + 1) in (16) with one\nsecond-order cone constraint of size r + 2 and r + 1 linear constraints. As it turns out, such a\nformulation is much easier to process numerically and can be solved much more ef\ufb01ciently. Indeed,\nm (Ben-\nusing standard interior-point algorithms, the number of iterations required is of the order\nTal & Nemirovski, 2001, Lecture 6), and the total number of arithmetic operations required is of the\norder m6.5 (T\u00a8ut\u00a8unc\u00a8u et al., 2003). This compares very favorably with the m9 arithmetic complexity\nof the SCSDP approach, and our experimental results indicate that the speedup in computation is\nquite substantial. Moreover, in contrast with the SCSDP formulation, which does not take advantage\nof the low rank structure of A, our formulation does take advantage of such a structure.\n\n\u221a\n\n4 Experimental Results\n\nIn this section, we perform several experiments to demonstrate the viability of our SQLP formulation\nand its superior computational performance. Since both the SQLP formulation and the previous\nSCSDP formulation can be solved by standard softwares to a satisfying gap tolerance, the focus in\nthis comparison is not on the accuracy aspect but on the computational ef\ufb01ciency aspect.\nWe set the relative gap tolerance for both algorithms to be 1e-08. We use SDPT3 (Toh et al., 2006;\nT\u00a8ut\u00a8unc\u00a8u et al., 2003) to solve the SQLP. We follow (Weinberger et al., 2007; Li et al., 2009) and\nuse CSDP 6.0.1 (Borchers, 1999) to solve the SCSDP. All experiments are conducted in Matlab\n7.6.0(R2008a) on a PC with 2.5GHz CPU and 4GB RAM.\nTwo benchmark databases, Swiss Roll3 and USPS4 are used in our experiments. Swiss Roll (Fig-\nure 1(a)) is a standard manifold model used for manifold learning and nonlinear dimensionality\nreduction. In the experiments, we use the data set shown in Figure 1(b), which consists of 2000\npoints sampled from the Swiss Roll manifold. USPS is a handwritten digits database widely used\nfor clustering and classi\ufb01cation. It contains images of handwritten digits from 0 to 9 of size 16\u00d7 16,\nand has 7291 training examples and 2007 test examples. In the experiments, we use a subset of\nUSPS with 2000 images, containing the \ufb01rst 200 examples of each digit from 0-9 in the training\ndata. The feature to represent each image is a vector formed by concatenating all the columns of the\nimage intensities. In the sequel, we shall refer to the two subsets used in the experiments simply as\nSwiss Roll and USPS.\n\n3http://www.cs.toronto.edu/\u02dcroweis/lle/code.html\n4http://www-stat.stanford.edu/\u02dctibs/ElemStatLearn/\n\n6\n\nswiss rollsample (n=2000)\fTable 1: Computational Results on Swiss Roll (Time: second; # Iter: number of iterations)\n\nTable 2: Computational Results on USPS (Time: second; # Iter: number of iterations)\n\nm\n10\n15\n20\n25\n30\n35\n40\n50\n60\n80\n100\n\nTime\n3.84\n60.36\n557.79\n2821.76\n13039.30\n38559.50\n> 1 day\n\n\u2014\n\u2014\n\u2014\n\u2014\n\n0.13\n2.01\n17.43\n82.99\n352.41\n1168.50\n\nSCSDP\n# Iter Time per Iter\n29\n30\n32\n34\n37\n33\n\u2014\n\u2014\n\u2014\n\u2014\n\u2014\n\n\u2014\n\u2014\n\u2014\n\u2014\n\u2014\n\nm\n10\n15\n20\n25\n30\n35\n40\n50\n60\n80\n100\n\nTime\n2.84\n42.96\n461.38\n2572.72\n10576.01\n35173.60\n> 1 day\n\n\u2014\n\u2014\n\u2014\n\u2014\n\n0.14\n1.95\n17.09\n82.99\n352.53\n1172.50\n\nSCSDP\n# Iter Time per Iter\n21\n22\n27\n31\n30\n30\n\u2014\n\u2014\n\u2014\n\u2014\n\u2014\n\n\u2014\n\u2014\n\u2014\n\u2014\n\u2014\n\nTime\n0.96\n1.75\n4.48\n7.84\n13.39\n29.74\n74.01\n213.26\n467.90\n1729.65\n3988.31\n\nTime\n0.47\n1.26\n3.35\n5.97\n15.72\n44.53\n133.58\n362.24\n936.58\n1784.12\n2900.44\n\nSQLP\n\n# Iter Time per Iter\n32\n31\n35\n37\n35\n35\n35\n35\n35\n39\n36\n\n0.03\n0.06\n0.13\n0.21\n0.38\n0.85\n2.12\n6.09\n13.37\n44.35\n110.79\n\nSQLP\n\n# Iter Time per Iter\n16\n17\n17\n14\n19\n17\n20\n16\n19\n17\n17\n\n0.03\n0.07\n0.20\n0.43\n0.83\n2.62\n6.68\n22.64\n49.29\n104.95\n170.61\n\nrank(A)\n\n64\n153\n264\n403\n537\n670\n852\n1152\n1451\n2062\n2623\n\nrank(A)\n\n100\n225\n400\n625\n900\n1225\n1600\n2379\n2938\n3112\n3111\n\nThe Swiss Roll (resp. USPS) is used to derive the QSDP in RegMVU (resp. RegPCP). For RegMVU,\nthe 4NN graph is used, i.e., wij = 1 if xi is within the 4NN of xj or vice versa, and wij = 0\notherwise. We veri\ufb01ed that the 4NN graph derived from our Swiss Roll data is connected. For\nRegPCP, we construct the graph following the approach suggested in (Li et al., 2009). Speci\ufb01cally,\nwe have wij = exp(\u2212d2\nij/(2\u03c32)) if xi is within 20NN of xj or vice versa, and wij = 0 otherwise.\nHere, \u03c3 is the averaged distance from each object to its 20-th nearest neighbor. For the pairwise\nconstraints used in RegPCP, we randomly generate 20 similarity constraints for each class, and 20\ndissimilarity constraints for every two classes, yielding a total of 1100 constraints. For each data set,\nm ranges over {10, 15, 20, 25, 30, 35, 40, 50, 60, 80, 100}. In summary, for each data set, 11 QSDPs\nare formed. Each QSDP gives rise to one SQLP and one SCSDP. Therefore, for each data set, 11\nSQLPs and 11 SCSDPs are derived.\n\n4.1 The Results\n\nThe computational results of the programs are shown in Tables 1 and 2. For each program, we\nreport the total computation time, the number of iterations needed to achieve the required tolerance,\nand the average time per iteration in solving the program. A dash (\u2014) in the box indicates that the\ncorresponding program takes too much time to solve. We choose to stop the program if it fails to\nconverge within 1 day. This happens for the SCSDP with m = 40 on both data sets.\n\u00bfFrom the tables, we see that solving an SQLP is consistently much more faster than solving an\nSCSDP. To see the scalability, we plot the solution time (Time) against the problem size (m) in\nFigure 2. It can be seen that the solution time of the SCSDP grows much faster than that of the\nSQLP. This demonstrates the superiority of our proposed approach.\n\n7\n\n\f(a)\n\n(b)\n\nFigure 2: Curves on computational cost: Solution Time vs. Problem Scale.\n\nWe also note that the per-iteration computational costs of SCSDP and SQLP are drastically different.\nIndeed, for the same problem size m, it takes much less time per iteration for the SQLP than that for\nthe SCSDP. This is not very surprising, as the SQLP formulation takes advantage of the low rank\nstructure of the data matrix A.\n\n5 Conclusions\n\nWe have studied a class of convex optimization programs called convex Quadratic Semide\ufb01nite\nPrograms (QSDPs), which arise naturally from graph Laplacian regularized kernel learning (Sha &\nSaul, 2005; Weinberger et al., 2007; Li et al., 2009; Song et al., 2008; Globerson & Roweis, 2007;\nSinger, 2008). A QSDP is similar to a QP, except that it is subject to a semide\ufb01nite cone constraint\nas well. To tackle the QSDP, one typically uses the Schur complement to rewrite it as an SDP\n(SCSDP), thus resulting in a large linear matrix inequality constraint. In this paper, we argue that\nthis formulation is not computationally optimal and have proposed a novel formulation that leads to\na semide\ufb01nite-quadratic-linear program (SQLP). Our formulation introduces one positive semidef-\ninite constraint, one second-order cone constraint and a set of linear constraints. This should be\ncontrasted with the two large semide\ufb01nite cone constraints in the SCSDP. Our complexity analysis\nand experimental results have shown that the proposed SQLP formulation scales far better than the\nSCSDP formulation.\n\nAcknowledgements\n\nThe authors would like to thank Professor Kim-Chuan Toh for his valuable comments. This re-\nsearch work was supported in part by GRF grants CUHK 2150603, CUHK 414307 and CRF grant\nCUHK2/06C from the Research Grants Council of the Hong Kong SAR, China, as well as the\nNSFC-RGC joint research grant N CUHK411/07.\n\nReferences\nBen-Tal, A., & Nemirovski, A. (2001). Lectures on Modern Convex Optimization: Analysis, Algorithms, and\nEngineering Applications. MPS\u2013SIAM Series on Optimization. Philadelphia, Pennsylvania: Society for\nIndustrial and Applied Mathematics.\n\nBorchers, B. (1999). CSDP, a C Library for Semide\ufb01nite Programming. Optimization Methods and Software,\n\n11/12, 613\u2013623.\n\nBoyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge: Cambridge University Press. Avail-\n\nable online at http://www.stanford.edu/\u02dcboyd/cvxbook/.\n\nChapelle, O., & Vapnik, V. (2000). Model Selection for Support Vector Machines. In S. A. Solla, T. K. Leen\nand K.-R. M\u00a8uller (Eds.), Advances in Neural Information Processing Systems 12: Proceedings of the 1999\nConference, 230\u2013236. Cambridge, Massachusetts: The MIT Press.\n\n8\n\n101520253035405060801000.511.522.533.5x 104mTime (second)Swiss Roll SCSDPSQLP101520253035405060801000.511.522.533.5x 104mTime (second)USPS SCSDPSQLP\fGloberson, A., & Roweis, S. (2007). Visualizing Pairwise Similarity via Semide\ufb01nite Programming. Proceed-\n\nings of the 11th International Conference on Arti\ufb01cial Intelligence and Statistics (pp. 139\u2013146).\n\nGolub, G. H., & Loan, C. F. V. (1996). Matrix Computations. Baltimore, Maryland: The Johns Hopkins\n\nUniversity Press. Third edition.\n\nGraepel, T. (2002). Kernel Matrix Completion by Semide\ufb01nite Programming. Proceedings of the 12th Inter-\n\nnational Conference on Arti\ufb01cial Neural Networks (pp. 694\u2013699). Springer\u2013Verlag.\n\nKulis, B., Sustik, M. A., & Dhillon, I. S. (2009). Low\u2013Rank Kernel Learning with Bregman Matrix Diver-\n\ngences. The Journal of Machine Learning Research, 10, 341\u2013376.\n\nLanckriet, G. R. G., Cristianini, N., Bartlett, P., El Ghaoui, L., & Jordan, M. I. (2004). Learning the Kernel\n\nMatrix with Semide\ufb01nite Programming. The Journal of Machine Learning Research, 5, 27\u201372.\n\nLi, Z., & Liu, J. (2009). Constrained Clustering by Spectral Kernel Learning. To appear in the Proceedings of\n\nthe 12th IEEE International Conference on Computer Vision.\n\nLi, Z., Liu, J., & Tang, X. (2008). Pairwise Constraint Propagation by Semide\ufb01nite Programming for Semi\u2013\nSupervised Classi\ufb01cation. Proceedings of the 25th International Conference on Machine Learning (pp.\n576\u2013583).\n\nLi, Z., Liu, J., & Tang, X. (2009). Constrained Clustering via Spectral Regularization. Proceedings of the IEEE\n\nConference on Computer Vision and Pattern Recognition 2009 (pp. 421\u2013428).\n\nSch\u00a8olkopf, B., & Smola, A. J. (2002). Learning with Kernels: Support Vector Machines, Regularization,\n\nOptimization, and Beyond. Cambridge, Massachusetts: The MIT Press.\n\nSch\u00a8olkopf, B., Smola, A. J., & M\u00a8uller, K.-R. (1998). Nonlinear Component Analysis as a Kernel Eigenvalue\n\nProblem. Neural Computation, 10, 1299\u20131319.\n\nSha, F., & Saul, L. K. (2005). Analysis and Extension of Spectral Methods for Nonlinear Dimensionality\n\nReduction. Proceedings of the 22nd International Conference on Machine Learning (pp. 784\u2013791).\n\nShawe-Taylor, J., & Cristianini, N. (2004). Kernel Methods for Pattern Analysis. Cambridge: Cambridge\n\nUniversity Press.\n\nSinger, A. (2008). A Remark on Global Positioning from Local Distances. Proceedings of the National\n\nAcademy of Sciences, 105, 9507\u20139511.\n\nSo, A. M.-C. (2007). A Semide\ufb01nite Programming Approach to the Graph Realization Problem: Theory,\n\nApplications and Extensions. Doctoral dissertation, Stanford University.\n\nSong, L., Smola, A., Borgwardt, K., & Gretton, A. (2008). Colored Maximum Variance Unfolding. In J. C.\nPlatt, D. Koller, Y. Singer and S. Roweis (Eds.), Advances in Neural Information Processing Systems 20:\nProceedings of the 2007 Conference, 1385\u20131392. Cambridge, Massachusetts: The MIT Press.\n\nToh, K. C., T\u00a8ut\u00a8unc\u00a8u, R. H., & Todd, M. J. (2006). On the Implementation and Usage of SDPT3 \u2014 A MATLAB\n\nSoftware Package for Semide\ufb01nite\u2013Quadratic\u2013Linear Programming, Version 4.0. User\u2019s Guide.\n\nT\u00a8ut\u00a8unc\u00a8u, R. H., Toh, K. C., & Todd, M. J. (2003). Solving Semide\ufb01nite\u2013Quadratic\u2013Linear Programs using\n\nSDPT3. Mathematical Programming, 95, 189\u2013217.\n\nVapnik, V. N. (2000). The Nature of Statistical Learning Theory. Statistics for Engineering and Information\n\nScience. New York: Springer\u2013Verlag. Second edition.\n\nWeinberger, K. Q., Packer, B. D., & Saul, L. K. (2005). Nonlinear Dimensionality Reduction by Semide\ufb01nite\nProgramming and Kernel Matrix Factorization. Proceedings of the 10th International Workshop on Arti\ufb01cial\nIntelligence and Statistics (pp. 381\u2013388).\n\nWeinberger, K. Q., Sha, F., & Saul, L. K. (2004). Learning a Kernel Matrix for Nonlinear Dimensionality\n\nReduction. Proceedings of the 21st International Conference on Machine Learning (pp. 85\u201392).\n\nWeinberger, K. Q., Sha, F., Zhu, Q., & Saul, L. K. (2007). Graph Laplacian Regularization for Large\u2013Scale\nSemide\ufb01nite Programming. Advances in Neural Information Processing Systems 19: Proceedings of the\n2006 Conference (pp. 1489\u20131496). Cambridge, Massachusetts: The MIT Press.\n\nZhou, D., Bousquet, O., Lal, T. N., Weston, J., & Sch\u00a8olkopf, B. (2004). Learning with Local and Global\nConsistency. Advances in Neural Information Processing Systems 16: Proceedings of the 2003 Conference\n(pp. 595\u2013602). Cambridge, Massachusetts: The MIT Press.\n\n9\n\n\f", "award": [], "sourceid": 792, "authors": [{"given_name": "Xiao-ming", "family_name": "Wu", "institution": null}, {"given_name": "Anthony", "family_name": "So", "institution": null}, {"given_name": "Zhenguo", "family_name": "Li", "institution": null}, {"given_name": "Shuo-yen", "family_name": "Li", "institution": null}]}