{"title": "Multi-label Multiple Kernel Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 777, "page_last": 784, "abstract": "We present a multi-label multiple kernel learning (MKL) formulation, in which the data are embedded into a low-dimensional space directed by the instance-label correlations encoded into a hypergraph. We formulate the problem in the kernel-induced feature space and propose to learn the kernel matrix as a linear combination of a given collection of kernel matrices in the MKL framework. The proposed learning formulation leads to a non-smooth min-max problem, and it can be cast into a semi-infinite linear program (SILP). We further propose an approximate formulation with a guaranteed error bound which involves an unconstrained and convex optimization problem. In addition, we show that the objective function of the approximate formulation is continuously differentiable with Lipschitz gradient, and hence existing methods can be employed to compute the optimal solution efficiently. We apply the proposed formulation to the automated annotation of Drosophila gene expression pattern images, and promising results have been reported in comparison with representative algorithms.", "full_text": "Multi-label Multiple Kernel Learning\n\nShuiwang Ji\n\nArizona State University\n\nTempe, AZ 85287\n\nLiang Sun\n\nArizona State University\n\nTempe, AZ 85287\n\nshuiwang.ji@asu.edu\n\nsun.liang@asu.edu\n\nRong Jin\n\nMichigan State University\nEast Lansing, MI 48824\n\nJieping Ye\n\nArizona State University\n\nTempe, AZ 85287\n\nrongjin@cse.msu.edu\n\njieping.ye@asu.edu\n\nAbstract\n\nWe present a multi-label multiple kernel learning (MKL) formulation in which\nthe data are embedded into a low-dimensional space directed by the instance-\nlabel correlations encoded into a hypergraph. We formulate the problem in the\nkernel-induced feature space and propose to learn the kernel matrix as a linear\ncombination of a given collection of kernel matrices in the MKL framework. The\nproposed learning formulation leads to a non-smooth min-max problem, which\ncan be cast into a semi-in\ufb01nite linear program (SILP). We further propose an ap-\nproximate formulation with a guaranteed error bound which involves an uncon-\nstrained convex optimization problem. In addition, we show that the objective\nfunction of the approximate formulation is differentiable with Lipschitz continu-\nous gradient, and hence existing methods can be employed to compute the optimal\nsolution ef\ufb01ciently. We apply the proposed formulation to the automated annota-\ntion of Drosophila gene expression pattern images, and promising results have\nbeen reported in comparison with representative algorithms.\n\n1 Introduction\n\nSpectral graph-theoretic methods have been used widely in unsupervised and semi-supervised learn-\ning recently. In this paradigm, a weighted graph is constructed for the data set, where the nodes\nrepresent the data points and the edge weights characterize the relationships between vertices. The\nstructural and spectral properties of graph can then be exploited to perform the learning task. One\nfundamental limitation of using traditional graphs for this task is that they can only represent pair-\nwise relationships between data points, and hence higher-order information cannot be captured [1].\nHypergraphs [1, 2] generalize traditional graphs by allowing edges, called hyperedges, to connect\nmore than two vertices, thereby being able to capture the relationships among multiple vertices.\n\nIn this paper, we propose to use a hypergraph to capture the correlation information for multi-label\nlearning [3].\nIn particular, we propose to construct a hypergraph for multi-label data in which\nall data points annotated with a common label are included in a hyperedge, thereby capturing the\nsimilarity among data points with a common label. By exploiting the spectral properties of the\nconstructed hypergraph, we propose to embed the multi-label data into a lower-dimensional space\nin which data points with a common label tend to be close to each other. We formulate the multi-label\nlearning problem in the kernel-induced feature space, and show that the well-known kernel canonical\ncorrelation analysis (KCCA) [4] is a special case of the proposed framework. As the kernel plays an\nessential role in the formulation, we propose to learn the kernel matrix as a linear combination of a\ngiven collection of kernel matrices in the multiple kernel learning (MKL) framework. The resulting\n\n\fformulation involves a non-smooth min-max problem, and we show that it can be cast into a semi-\nin\ufb01nite linear program (SILP). To further improve the ef\ufb01ciency and reduce the non-smoothness\neffect of the SILP formulation, we propose an approximate formulation by introducing a smoothing\nterm into the original problem. The resulting formulation is unconstrained and convex. In addition,\nthe objective function of the approximate formulation is shown to be differentiable with Lipschitz\ncontinuous gradient. We can thus employ the Nesterov\u2019s method [5, 6], which solves smooth convex\nproblems with the optimal convergence rate, to compute the solution ef\ufb01ciently.\n\nWe apply the proposed formulation to the automated annotation of Drosophila gene expression\npattern images, which document the spatial and temporal dynamics of gene expression during\nDrosophila embryogenesis [7]. Comparative analysis of such images can potentially reveal new\ngenetic interactions and yield insights into the complex regulatory networks governing embryonic\ndevelopment. To facilitate pattern comparison and searching, groups of images are annotated with a\nvariable number of labels by human curators in the Berkeley Drosophila Genome Project (BDGP)\nhigh-throughput study [7]. However, the number of available images produced by high-throughput\nin situ hybridization is now rapidly increasing. It is therefore tempting to design computational\nmethods to automate this task [8]. Since the labels are associated with groups of a variable number\nof images, we propose to extract invariant features from each image and construct kernels between\ngroups of images by employing the vocabulary-guided pyramid match algorithm [9]. By applying\nvarious local descriptors, we obtain multiple kernel matrices and the proposed multi-label MKL\nformulation is applied to obtain an optimal kernel matrix for the low-dimensional embedding. Ex-\nperimental results demonstrate the effectiveness of the kernel matrices obtained by the proposed\nformulation. Moreover, the approximate formulation is shown to yield similar results to the original\nformulation, while it is much more ef\ufb01cient.\n\n2 Multi-label Learning with Hypergraph\n\nAn essential issue in learning from multi-label data is how to exploit the correlation information\namong labels. We propose to capture such information through a hypergraph as described below.\n\n2.1 Hypergraph Spectral Learning\n\nHypergraphs generalize traditional graphs by allowing hyperedges to connect more than two ver-\ntices, thus capturing the joint relationships among multiple vertices. We propose to construct a\nhypergraph for multi-label data in which each data point is represented as a vertex. To document the\njoint similarity among data points annotated with a common label, we propose to construct a hyper-\nedge for each label and include all data points annotated with a common label into one hyperedge.\nFollowing the spectral graph embedding theory [10], we propose to compute the low-dimensional\nembedding through a linear transformation W by solving the following optimization problem:\n\nmin\nW\nsubject to\n\ntr(cid:0)W T \u03c6(X)L\u03c6(X)T W(cid:1)\nW T (cid:0)\u03c6(X)\u03c6(X)T + \u03bbI(cid:1) W = I,\n\n(1)\n\nwhere \u03c6(X) = [\u03c6(x1), \u00b7 \u00b7 \u00b7 , \u03c6(xn)] is the data matrix consisting of n data points in the feature\nspace, \u03c6 is the feature mapping, L is the normalized Laplacian matrix derived from the hypergraph,\nand \u03bb > 0 is the regularization parameter. In this formulation, the instance-label correlations are\nencoded into L through the hypergraph, and data points sharing a common label tend to be close to\neach other in the embedded space.\n\nIt follows from the representer theorem [11] that W = \u03c6(X)B for some matrix B \u2208 Rn\u00d7k where\nk is the number of labels. By noting that L = I \u2212 C for some matrix C, the problem in Eq. (1) can\nbe reformulated as\n\nmax\n\nB\n\nsubject to\n\ntr(cid:0)BT (KCK)B(cid:1)\n\nBT (K 2 + \u03bbK)B = I,\n\n(2)\n\nwhere K = \u03c6(X)T \u03c6(X) is the kernel matrix. Kernel canonical correlation analysis (KCCA) [4] is\na widely-used method for dimensionality reduction. It can be shown [4] that KCCA is obtained by\nsubstituting C = Y T (Y Y T )\u22121Y in Eq. (2) where Y \u2208 Rk\u00d7n is the label indicator matrix. Thus,\nKCCA is a special case of the proposed formulation.\n\n\f2.2 A Semi-in\ufb01nite Linear Program Formulation\n\nIt follows from the theory of kernel methods [11] that the kernel K in Eq. (2) uniquely determines the\nfeature mapping \u03c6. Thus, kernel selection (learning) is one of the central issues in kernel methods.\nFollowing the MKL framework [12], we propose to learn an optimal kernel matrix by integrating\nmultiple candidate kernel matrices, that is,\n\nK \u2208 K =\uf8f1\uf8f2\n\uf8f3\n\nK =\n\np\n\nXj=1\n\n\u03b8jKj(cid:12)(cid:12)\u03b8T e = 1, \u03b8 \u2265 0\uf8fc\uf8fd\n\uf8fe\n\n,\n\n(3)\n\nj=1 are the p candidate kernel matrices, {\u03b8j}p\n\nwhere {Kj}p\nj=1 are the weights for the linear combi-\nnation, and e is the vector of all ones of length p. We have assumed in Eq. (3) that all the candidate\nkernel matrices are normalized to have a unit trace value. It has been shown [8] that the optimal\nweights maximizing the objective function in Eq. (2) can be obtained by solving a semi-in\ufb01nite lin-\near program (SILP) [13] in which a linear objective is optimized subject to an in\ufb01nite number of\nlinear constraints, as summarized in the following theorem:\nTheorem 2.1. Given a set of p kernel matrices {Kj}p\nmizes the objective function in Eq. (2) can be obtained by solving the following SILP problem:\n\nj=1, the optimal kernel matrix in K that maxi-\n\nmax\n\u03b8,\u03b3\n\n\u03b3\n\nsubject to\n\n\u03b8 \u2265 0, \u03b8T e = 1,\n\nwhere Sj(Z), for j = 1, \u00b7 \u00b7 \u00b7 , p, is de\ufb01ned as\n\n\u03b8jSj(Z) \u2265 \u03b3, for all Z \u2208 Rn\u00d7k,\n\np\n\nXj=1\n\nSj(Z) =\n\nk\n\nXi=1(cid:18) 1\n\n4\n\nzT\ni zi +\n\n1\n4\u03bb\n\nzT\ni Kjzi \u2212 zT\n\ni hi(cid:19) ,\n\nZ = [z1, \u00b7 \u00b7 \u00b7 , zk], H is obtained from C such that HH T = C, and H = [h1, \u00b7 \u00b7 \u00b7 , hk].\n\n(4)\n\n(5)\n\n(6)\n\nNote that the matrix C is symmetric and positive semide\ufb01nite. Moreover, for the L considered in\nthis paper, we have rank(C) = k. Hence, H \u2208 Rn\u00d7k is always well-de\ufb01ned. The SILP formulation\nin Theorem 2.1 can be solved by the column generation technique as in [14].\n\n3 The Approximate Formulation\n\nThe multi-label kernel learning formulation proposed in Theorem 2.1 involves optimizing a linear\nobjective subject to an in\ufb01nite number of constraints. The column generation technique used to solve\nthis problem adds constraints to the problem successively until all the constraints are satis\ufb01ed. Since\nthe convergence rate of this algorithm is slow, the problem solved at each iteration may involve a\nlarge number of constraints, and hence is computationally expensive. In this section, we propose an\napproximate formulation by introducing a smoothing term into the original problem. This results in\nan unconstrained and smooth convex problem. We propose to employ existing methods to solve the\nsmooth convex optimization problem ef\ufb01ciently in the next section.\n\nBy rewriting the formulation in Theorem 2.1 as\n\np\n\nand exchanging the minimization and maximization, the SILP formulation can be expressed as\n\nmax\n\n\u03b8:\u03b8T e=1,\u03b8\u22650\n\nmin\n\nZ\n\n\u03b8jSj(Z)\n\nXj=1\n\nwhere f (Z) is de\ufb01ned as\n\nmin\n\nZ\n\nf (Z)\n\nf (Z) =\n\nmax\n\n\u03b8:\u03b8T e=1,\u03b8\u22650\n\n\u03b8jSj(Z).\n\np\n\nXj=1\n\n(7)\n\n(8)\n\n\fThe maximization problem in Eq. (8) with respect to \u03b8 leads to a non-smooth objective function for\nf (Z). To reduce this effect, we introduce a smoothing term and modify the objective to f\u00b5(Z) as\n\nf\u00b5(Z) =\n\np\n\np\n\n\u03b8jSj(Z) \u2212 \u00b5\n\n,\n\n(9)\n\nmax\n\n\u03b8:\u03b8T e=1,\u03b8\u22650\uf8f1\uf8f2\n\uf8f3\n\nXj=1\n\nXj=1\n\n\u03b8j log \u03b8j\uf8fc\uf8fd\n\uf8fe\n\nwhere \u00b5 is a positive constant controlling the approximation. The following lemma shows that the\nproblem in Eq. (9) can be solved analytically:\nLemma 3.1. The optimization problem in Eq. (9) can be solved analytically, and the optimal value\ncan be expressed as\n\nf\u00b5(Z) = \u00b5 log\uf8eb\n\uf8ed\n\np\n\nXj=1\n\nexp(cid:18) 1\n\n\u00b5\n\nSj(Z)(cid:19)\uf8f6\n\uf8f8 .\n\u03b1j\u03b8j +\uf8eb\n\uf8ed\n\np\n\nProof. De\ufb01ne the Lagrangian function for the optimization problem in Eq. (9) as\n\np\n\np\n\np\n\nL =\n\nXj=1\n\nXj=1\n\n\u03b8j log \u03b8j +\n\n\u03b8jSj(Z) \u2212 \u00b5\n\nwhere {\u03b1j}p\n\nj=1 and \u03b2 are Lagrangian dual variables. Taking the derivative of the Lagrangian func-\n\nXj=1\n\u00b5 (Sj(Z) + \u03b1j + \u03b2 \u2212 \u00b5)(cid:17) .\ntion with respect to \u03b8j and setting it to zero, we obtain that \u03b8j = exp(cid:16) 1\n\nIt follows from the complementarity condition that \u03b1j \u03b8j = 0 for j = 1, \u00b7 \u00b7 \u00b7 , p. Since \u03b8j 6= 0, we\nhave \u03b1j = 0 for j = 1, \u00b7 \u00b7 \u00b7 , p. By removing {\u03b1j}p\nj=1 and substituting \u03b8j into the objective function\nin Eq. (9), we obtain that f\u00b5(Z) = \u00b5 \u2212 \u03b2. Since \u00b5 \u2212 \u03b2 = Sj(Z) \u2212 \u00b5 log \u03b8j, we have\n\nXj=1\n\n(11)\n\n\u03b8j \u2212 1\uf8f6\n\uf8f8 \u03b2,\n\n\u03b8j = exp ((Sj(Z) \u2212 f\u00b5(Z))/\u00b5) .\n\n(12)\n\nj=1 exp ((Sj(Z) \u2212 f\u00b5(Z))/\u00b5) , we obtain Eq. (10).\n\nFollowing 1 =Pp\n\nj=1 \u03b8j =Pp\n\nThe above discussion shows that we can approximate the original non-smooth constrained min-max\nproblem in Eq. (7) by the following smooth unconstrained minimization problem:\n\nmin\n\nZ\n\nf\u00b5(Z),\n\n(13)\n\nwhere f\u00b5(Z) is de\ufb01ned in Eq. (10). We show in the following two lemmas that the approximate\nformulation in Eq. (13) is convex and has a guaranteed approximation bound controlled by \u00b5.\nLemma 3.2. The problem in Eq. (13) is a convex optimization problem.\n\nProof. The optimization problem in Eq. (13) can be expressed equivalently as\n\n(10)\n\n(14)\n\nZ,{uj }p\n\nmin\nj=1,{vj }p\n\nj=1\n\nexp uj + vj \u2212\n\n\u00b5 log\uf8eb\n\uf8ed\n\np\n\nXj=1\nXi=1\n\n1\n4\n\nk\n\nk\n\nXi=1\n\n1\n4\u03bb\n\nzT\n\ni hi!\uf8f6\n\uf8f8\nXi=1\n\nk\n\nsubject to\n\n\u00b5uj \u2265\n\nzT\ni zi, \u00b5vj \u2265\n\nzT\ni Kjzi, j = 1, \u00b7 \u00b7 \u00b7 , p.\n\nSince the log-exponential-sum function is a convex function and the two constraints are second-order\ncone constraints, the problem in Eq. (13) is a convex optimization problem.\n\nLemma 3.3. Let f (Z) and f\u00b5(Z) be de\ufb01ned as above. Then we have f\u00b5(Z) \u2265 f (Z) and |f\u00b5(Z) \u2212\nf (Z)| \u2264 \u00b5 log p.\n\nj=1 \u03b8j log \u03b8j de\ufb01nes the entropy of {\u03b8j}p\n\nj=1 when it is considered as a proba-\nbility distribution, since \u03b8 \u2265 0 and \u03b8T e = 1. Hence, this term is non-negative and f\u00b5(Z) \u2265 f (Z). It\nj=1,\nj=1 \u03b8j log \u03b8j \u2264 log p and |f\u00b5(Z) \u2212 f (Z)| =\n\nProof. The term \u2212Pp\nis known from the property of entropy that \u2212Pp\n\u2212\u00b5Pp\n\np for j = 1, \u00b7 \u00b7 \u00b7 , p. Thus, we have \u2212Pp\n\nj=1 \u03b8j log \u03b8j \u2264 \u00b5 log p. This completes the proof of the lemma.\n\nj=1 \u03b8j log \u03b8j is maximized with a uniform {\u03b8j}p\n\ni.e., \u03b8j = 1\n\n\f4 Solving the Approximate Formulation Using the Nesterov\u2019s Method\n\nThe Nesterov\u2019s method (known as \u201cthe optimal method\u201d in [5]) is an algorithm for solving smooth\nconvex problems with the optimal rate of convergence. In this method, the objective function needs\nto be differentiable with Lipschitz continuous gradient. In order to apply this method to solve the\nproposed approximate formulation, we \ufb01rst compute the Lipschitz constant for the gradient of func-\ntion f\u00b5(Z), as summarized in the following lemma:\nLemma 4.1. Let f\u00b5(Z) be de\ufb01ned as in Eq. (10). Then the Lipschitz constant L of the gradient of\nf\u00b5(Z) can be bounded from above as\n\n(15)\n\nwhere L\u00b5 is de\ufb01ned as\n\nL \u2264 L\u00b5,\n\nL\u00b5 =\n\n1\n2\n\n+\n\n1\n2\u03bb\n\nmax\n1\u2264j\u2264p\n\n\u03bbmax(Kj) +\n\n1\n\n8\u00b5\u03bb2\n\ntr(Z T Z) max\n1\u2264i,j\u2264p\n\n\u03bbmax((Ki \u2212 Kj)(Ki \u2212 Kj)T ),\n\n(16)\n\nand \u03bbmax(\u00b7) denotes the maximum eigenvalue. Moreover, the distance from the origin to the optimal\nset of Z can be bounded as tr(Z T Z) \u2264 R2\n\n\u00b5 is de\ufb01ned as\n\n\u00b5 where R2\n\nR2\n\n\u00b5 =\n\nk\n\nXi=1 ||[Cj]i||2 +s4\u00b5 log p + tr(cid:18)CT\n\nj (cid:20)I +\n\nH and [Cj]i denotes the ith column of Cj.\n\n1\n\u03bb\n\nKj(cid:21) Cj(cid:19)!\n\n2\n\n,\n\n(17)\n\nCj = 2(cid:0)I + 1\n\n\u03bb Kj(cid:1)\u22121\n\nProof. To compute the Lipschitz constant for the gradient of f\u00b5(Z), we \ufb01rst compute the \ufb01rst and\nsecond order derivatives as follows:\n\nvec(KjZ)\n\n(cid:19) \u2212 vec(H),\n\n(18)\n\n\u25bdf\u00b5(Z) =\n\np\n\nXj=1\n\n\u25bd2f\u00b5(Z) =\n\n1\n2\n\nI +\n\n+\n\n1\n8\u00b5\n\n2\u03bb\n\np\n\n2\n\n+\n\ngj\n2\u03bb\n\ngj(cid:18) vec(Z)\nXj=1\ngigj(cid:18) vec(KiZ)\nXi,j=1\n\nDk(Kj)\n\n\u03bb\n\np\n\n\u2212\n\nvec(KjZ)\n\n\u03bb\n\n(cid:19)(cid:18) vec(KiZ)\n\n\u03bb\n\n\u2212\n\nvec(KjZ)\n\n\u03bb\n\n(cid:19)T\n\n, (19)\n\nwhere vec(\u00b7) converts a matrix into a vector, Dk(Kj) \u2208 R(n\u00d7k)\u00d7(n\u00d7k) is a block diagonal matrix\ni=1 exp(Si(Z)/\u00b5). Then we have\n\nwith the kth diagonal block as Kj, and gj = exp(Sj (Z)/\u00b5)/Pp\n\ntr(Z T (Ki \u2212 Kj)(Ki \u2212 Kj)T Z) \u2264 L\u00b5.\n\n\u03bbmax(Kj) +\n\nL \u2264\n\n+\n\n1\n2\n\n1\n2\u03bb\n\nmax\n1\u2264j\u2264p\n\n1\n8\u00b5\u03bb2 max\n\n1\u2264i,j\u2264p\n\nwhere L\u00b5 is de\ufb01ned in Eq. (16).\n\nWe next derive the upper bound for tr(Z T Z). To this end, we \ufb01rst rewrite Sj(Z) as\n\nSj(Z) =\n\n1\n4\n\nSince min f\u00b5(Z) \u2264 f\u00b5(0) = \u00b5 log p, and f\u00b5(Z) \u2265 Sj(Z), we have Sj(Z) \u2264 \u00b5 log p for j =\n1, \u00b7 \u00b7 \u00b7 , p. It follows that 1\nthis inequality, it can be veri\ufb01ed that tr(Z T Z) \u2264 R2\n\n\u00b5 is de\ufb01ned in Eq. (17).\n\ntr(cid:18)(Z \u2212 Cj )T (cid:20)I +\n\nKj(cid:21) (Z \u2212 Cj)(cid:19) \u2212\n4 tr(cid:0)(Z \u2212 Cj)T (Z \u2212 Cj )(cid:1) \u2264 \u00b5 log p + 1\n\n\u00b5 where R2\n\ntr(cid:18)CT\n4 tr(cid:0)CT\n\nj (cid:20)I +\nj (cid:2)I + 1\n\nKj(cid:21) Cj(cid:19) .\n\u03bb Kj(cid:3) Cj(cid:1) . By using\n\n1\n\u03bb\n\n1\n\u03bb\n\n1\n4\n\nThe Nesterov\u2019s method for solving the proposed approximate formulation is presented in Table 1.\nAfter the optimal Z is obtained from the Nesterov\u2019s method, the optimal {\u03b8j}p\nj=1 can be computed\nIt follows from the convergence proof in [5] that after N iterations, as long as\nfrom Eq. (12).\nf\u00b5(X i) \u2264 f\u00b5(X 0) for i = 1, \u00b7 \u00b7 \u00b7 , N, we have\n\nf\u00b5(Z N +1) \u2212 f\u00b5(Z \u2217) \u2264\n\n4L\u00b5R2\n\u00b5\n(N + 1)2 ,\n\n(20)\n\n\fTable 1: The Nesterov\u2019s method for solving the proposed multi-label MKL formulation.\n\u2022 Initialize X 0 = Z 1 = Q0 = 0 \u2208 Rn\u00d7k, t0 = 1, L0 = 1\nN where N is the prede\ufb01ned number of iterations\n\n2\u03bb max1\u2264j\u2264p \u03bbmax(Kj), and\n\n2 + 1\n\n\u00b5 = 1\n\n\u2022 for i = 1, \u00b7 \u00b7 \u00b7 , N do\n\n(Z i + Qi\u22121)\n\n\u2022 Set X i = Z i \u2212 1\nti\u22121\n\u2022 Compute f\u00b5(X i) and \u25bdf\u00b5(X i)\n\u2022 Set L = Li\u22121\n\u2022 while f\u00b5(X i \u2212 \u25bdf\u00b5(X i)/L) > f\u00b5(X i) \u2212 1\n\n2L tr((\u25bdf\u00b5(X i))T \u25bdf\u00b5(X i)) do\n\n\u2022 L = L \u00d7 2\n\n\u2022 end while\n\u2022 Set Li = L\n\u2022 Set Z i+1 = X i \u2212 1\nLi\n\n\u25bdf\u00b5(X i), Qi = Qi\u22121 + ti\u22121\nLi\n\n\u25bdf\u00b5(X i)\n\n\u2022 Set ti = 1\n\n\u2022 end for\n\n2(cid:16)1 +q1 + 4t2\n\ni\u22121(cid:17)\n\nwhere Z \u2217 = arg minZ f\u00b5(Z). Furthermore, since f\u00b5(Z N +1) \u2265 f (Z N +1) and f\u00b5(Z \u2217) \u2264 f (Z \u2217) +\n\u00b5 log p, we have\n\nf (Z N +1) \u2212 f (Z \u2217) \u2264 \u00b5 log p +\n\n4L\u00b5R2\n\u00b5\n(N + 1)2 .\n\n(21)\n\nBy setting \u00b5 = O(1/N ), we have that L\u00b5 \u221d O(1/\u00b5) \u221d O(N ). Hence, the convergence rate of the\nNesterov\u2019s method is on the order of O(1/N ). This is signi\ufb01cantly better than the convergence rates\nof O(1/N 1/3) and O(1/N 1/2) for the SILP and the gradient descent method, respectively.\n\n5 Experiments\n\nIn this section, we evaluate the proposed formulation on the automated annotation of gene expression\npattern images. The performance of the approximate formulation is also validated.\nExperimental Setup The experiments use a collection of gene expression pattern images retrieved\nfrom the FlyExpress database (http://www.flyexpress.net). We apply nine local descrip-\ntors (SIFT, shape context, PCA-SIFT, spin image, steerable \ufb01lters, differential invariants, complex\n\ufb01lters, moment invariants, and cross correlation) on regular grids of 16 and 32 pixels in radius and\nspacing on each image. These local descriptors are commonly used in computer vision problems\n[15]. We also apply Gabor \ufb01lters with different wavelet scales and \ufb01lter orientations on each image\nto obtain global features of 384 and 2592 dimensions. Moreover, we sample the pixel values of each\nimage to obtain features of 10240, 2560, and 640 dimensions. After generating the features, we\napply the vocabulary-guided pyramid match algorithm [9] to construct kernels between the image\nsets. A total of 23 kernel matrices (2 grid size \u00d7 9 local descriptors + 2 Gabor + 3 pixel) are con-\nstructed. Then the proposed MKL formulation is employed to obtain the optimal integrated kernel\nmatrix based on which the low-dimensional embedding is computed. We use the expansion-based\napproach (star and clique) to construct the hypergraph Laplacian, since it has been shown [1] that\nthe Laplacians constructed in this way are similar to those obtained directly from a hypergraph. The\nperformance of kernel matrices (either single or integrated) is evaluated by applying the support\nvector machine (SVM) for each term using the one-against-rest scheme. The F1 score is used as\nthe performance measure, and both macro-averaged and micro-averaged F1 scores across labels are\nreported. In each case, the entire data set is randomly partitioned into training and test sets with a\nratio of 1:1. This process is repeated ten times, and the averaged performance is reported.\nPerformance Evaluation It can be observed from Tables 2 and 3 that in terms of both macro and\nmicro F1 scores, the kernels integrated by either star or clique expansions achieve the highest per-\nformance on almost all of the data sets. In particular, the integrated kernels outperform the best\nindividual kernel signi\ufb01cantly on all data sets. This shows that the proposed formulation is effective\n\n\fTable 2: Performance of integrated kernels and the best individual kernel (denoted as BIK) in terms\nof macro F1 score. The number of terms used are 20, 30, and 40, and the number of image sets\nused are 1000, 1500, and 2000. \u201cSILP\u201d, \u201cAPP\u201d, \u201cSVM1\u201d, and \u201cUniform\u201d denote the performance\nof kernels combined with the SILP formulation, the approximate formulation, the 1-norm SVM for-\nmulation proposed in [12] applied for each label separately, and the case where all kernels are given\nthe same weight, respectively. The subscripts \u201cstar\u201d and \u201cclique\u201d denote the way that Laplacian is\nconstructed, and \u201cKCCA\u201d denotes the case where C = Y T (Y Y T )\u22121Y .\n\n# of labels\n# of sets\nSILPstar\nSILPclique\nSILPKCCA\nAPPstar\nAPPclique\nAPPKCCA\nSVM1\nUniform\nBIK\n\n1000\n0.4396\n0.4536\n0.3987\n0.4404\n0.4510\n0.4029\n0.3780\n0.3727\n0.4241\n\n20\n1500\n0.4903\n0.5125\n0.4635\n0.4930\n0.5125\n0.4805\n0.4640\n0.4703\n0.4515\n\n2000\n0.4575\n0.4926\n0.4477\n0.4703\n0.4917\n0.4586\n0.4356\n0.4480\n0.4344\n\n1000\n0.3852\n0.4065\n0.3497\n0.3896\n0.4060\n0.3571\n0.3523\n0.3513\n0.3782\n\n2000\n0.4162\n0.4563\n0.4063\n0.4267\n0.4563\n0.4146\n0.4200\n0.4191\n0.3996\n\n1000\n0.3768\n0.4145\n0.3538\n0.3900\n0.4180\n0.3642\n0.3741\n0.3719\n0.3914\n\n40\n1500\n0.4019\n0.4346\n0.3872\n0.4100\n0.4338\n0.3914\n0.4048\n0.4111\n0.3954\n\n30\n1500\n0.4437\n0.4747\n0.4240\n0.4494\n0.4741\n0.4313\n0.4352\n0.4410\n0.4312\n\n30\n1500\n0.4837\n0.5127\n0.4737\n0.4875\n0.5124\n0.4828\n0.4844\n0.4939\n0.4484\n\n2000\n0.3927\n0.4283\n0.3759\n0.3983\n0.4281\n0.3841\n0.3955\n0.3986\n0.3827\n\n2000\n0.4305\n0.4660\n0.4271\n0.4346\n0.4658\n0.4350\n0.4188\n0.4226\n0.3781\n\nTable 3: Performance in terms of micro F1 score. See the caption of Table 2 for explanations.\n# of labels\n# of sets\nSILPstar\nSILPclique\nSILPKCCA\nAPPstar\nAPPclique\nAPPKCCA\nSVM1\nUniform\nBIK\n\n20\n1500\n0.5199\n0.5422\n0.4994\n0.5211\n0.5421\n0.5174\n0.5024\n0.5096\n0.4735\n\n40\n1500\n0.4470\n0.4796\n0.4420\n0.4541\n0.4793\n0.4488\n0.4234\n0.4358\n0.3905\n\n1000\n0.4861\n0.5039\n0.4581\n0.4852\n0.5013\n0.4612\n0.4361\n0.4390\n0.4614\n\n2000\n0.4473\n0.4894\n0.4532\n0.4582\n0.4894\n0.4605\n0.4632\n0.4683\n0.4178\n\n1000\n0.4277\n0.4610\n0.4095\n0.4355\n0.4633\n0.4194\n0.3947\n0.3999\n0.3869\n\n2000\n0.4847\n0.5247\n0.4887\n0.4973\n0.5239\n0.5018\n0.4844\n0.4975\n0.4562\n\n1000\n0.4472\n0.4682\n0.4209\n0.4484\n0.4673\n0.4299\n0.4239\n0.4242\n0.4189\n\nin combining multiple kernels and exploiting the complementary information contained in different\nkernels constructed from various features. Moreover, the proposed formulation based on a hyper-\ngraph outperforms the classical KCCA consistently.\nSILP versus the Approximate Formulation In terms of classi\ufb01cation performance, we can observe\nfrom Tables 2 and 3 that the SILP and the approximate formulations are similar. More precisely,\nthe approximate formulations perform slightly better than SILP in almost all cases. This may be\ndue to the smoothness nature of the formulations and the simplicity of the computational procedure\nemployed in the Nesterov\u2019s method so that it is less prone to numerical problems. Figure 1 compares\nthe computation time and the kernel weights of SILPstar and APPstar. It can be observed that in\ngeneral the approximate formulation is signi\ufb01cantly faster than SILP, especially when the number\nof labels and the number of image sets are large, while they both yields very similar kernel weights.\n\n6 Conclusions and Future Work\n\nWe present a multi-label learning formulation that incorporates instance-label correlations by a hy-\npergraph. We formulate the problem in the kernel-induced feature space and propose to learn the\nkernel matrix in the MKL framework. The resulting formulation leads to a non-smooth min-max\nproblem, and it can be cast as an SILP. We propose an approximate formulation by introducing a\nsmoothing term and show that the resulting formulation is an unconstrained convex problem that can\nbe solved by the Nesterov\u2019s method. We demonstrate the effectiveness and ef\ufb01ciency of the method\non the task of automated annotation of gene expression pattern images.\n\n\f)\ns\nd\nn\no\nc\ne\ns\n \nn\ni\n(\n \n\ne\nm\n\ni\nt\n \n\nn\no\n\ni\nt\n\na\n\nt\n\nu\np\nm\no\nC\n\n400\n\n350\n\n300\n\n250\n\n200\n\n150\n\n100\n\n50\n\n0\n\n \n\nSILP\n\nstar\n\nAPP\n\nstar\n\n20(1000) 20(1500) 20(2000) 30(1000) 30(1500) 30(2000) 40(1000) 40(1500) 40(2000)\n\nThe number of labels and the number of image sets\n\n \n\nl\n\ns\ne\nn\nr\ne\nk\n \nr\no\n\nf\n \nt\n\ni\n\nh\ng\ne\nW\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n0\n\n \n\nSILPstar\nAPPstar\n\n \n\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23\n\nKernel number\n\n(a) Comparison of computation time\n\n(b) Comparison of kernel weights\n\nFigure 1: Comparison of computation time and kernel weights for SILPstar and APPstar. The left\npanel plots the computation time of two formulations on one partition of the data set as the number\nof labels and image sets increase gradually, and the right panel plots the weights assigned to each of\nthe 23 kernels by SILPstar and APPstar on a data set of 40 labels and 1000 image sets.\n\nThe experiments in this paper focus on the annotation of gene expression pattern images. The\nproposed formulation can also be applied to the task of multiple object recognition in computer\nvision. We plan to pursue other applications in the future. Experimental results indicate that the\nbest individual kernel may not lead to a large weight by the proposed MKL formulation. We plan to\nperform a detailed analysis of the weights in the future.\n\nAcknowledgements\nThis work is supported in part by research grants from National Institutes of Health (HG002516 and\n1R01-GM079688-01) and National Science Foundation (IIS-0612069 and IIS-0643494).\n\nReferences\n[1] S. Agarwal, K. Branson, and S. Belongie. Higher order learning with graphs. In ICML, pages 17\u201324,\n\n2006.\n\n[2] D. Zhou, J. Huang, and B. Sch\u00a8olkopf. Learning with hypergraphs: Clustering, classi\ufb01cation, and embed-\n\nding. In NIPS, pages 1601\u20131608. 2007.\n\n[3] Z. H. Zhou and M. L. Zhang. Multi-instance multi-label learning with application to scene classi\ufb01cation.\n\nIn NIPS, pages 1609\u20131616. 2007.\n\n[4] D. R. Hardoon, S. R. Szedmak, and J. R. Shawe-taylor. Canonical correlation analysis: An overview with\n\napplication to learning methods. Neural Computation, 16(12):2639\u20132664, 2004.\n\n[5] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Springer, 2003.\n[6] Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 103(1):127\u2013\n\n152, 2005.\n\n[7] P. Tomancak and et al. Systematic determination of patterns of gene expression during Drosophila em-\n\nbryogenesis. Genome Biology, 3(12), 2002.\n\n[8] S. Ji, L. Sun, R. Jin, S. Kumar, and J. Ye. Automated annotation of Drosophila gene expression patterns\n\nusing a controlled vocabulary. Bioinformatics, 24(17):1881\u20131888, 2008.\n\n[9] K. Grauman and T. Darrell. Approximate correspondences in high dimensions. In NIPS, pages 505\u2013512.\n\n2006.\n\n[10] F. R. K. Chung. Spectral Graph Theory. American Mathematical Society, 1997.\n[11] S. Sch\u00a8olkopf and A. Smola. Learning with Kernels: Support Vector Machines,Regularization, Optimiza-\n\ntion and Beyond. MIT Press, 2002.\n\n[12] G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. E. Ghaoui, and M. I. Jordan. Learning the kernel matrix\n\nwith semide\ufb01nite programming. Journal of Machine Learning Research, 5:27\u201372, 2004.\n\n[13] R. Hettich and K. O. Kortanek. Semi-in\ufb01nite programming: Theory, methods, and applications. SIAM\n\nReview, 35(3):380\u2013429, 1993.\n\n[14] S. Sonnenburg, G. R\u00a8atsch, C. Sch\u00a8afer, and B. Sch\u00a8olkopf. Large scale multiple kernel learning. Journal of\n\nMachine Learning Research, 7:1531\u20131565, July 2006.\n\n[15] K. Mikolajczyk and C. Schmid. A performance evaluation of local descriptors. IEEE Transactions on\n\nPattern Analysis and Machine Intelligence, 27(10):1615\u20131630, 2005.\n\n\f", "award": [], "sourceid": 129, "authors": [{"given_name": "Shuiwang", "family_name": "Ji", "institution": null}, {"given_name": "Liang", "family_name": "Sun", "institution": null}, {"given_name": "Rong", "family_name": "Jin", "institution": null}, {"given_name": "Jieping", "family_name": "Ye", "institution": null}]}