{"title": "Multi-label Multiple Kernel Learning by Stochastic Approximation: Application to Visual Object Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 325, "page_last": 333, "abstract": "Recent studies have shown that multiple kernel learning is very effective for object recognition, leading to the popularity of kernel learning in computer vision problems. In this work, we develop an efficient algorithm for multi-label multiple kernel learning (ML-MKL). We assume that all the classes under consideration share the same combination of kernel functions, and the objective is to find the optimal kernel combination that benefits all the classes. Although several algorithms have been developed for ML-MKL, their computational cost is linear in the number of classes, making them unscalable when the number of classes is large, a challenge frequently encountered in visual object recognition. We address this computational challenge by developing a framework for ML-MKL that combines the worst-case analysis with stochastic approximation. Our analysis shows that the complexity of our algorithm is $O(m^{1/3}\\sqrt{ln m})$, where $m$ is the number of classes. Empirical studies with object recognition show that while achieving similar classification accuracy, the proposed method is significantly more efficient than the state-of-the-art algorithms for ML-MKL.", "full_text": "Multi-label Multiple Kernel Learning by Stochastic\n\nApproximation: Application to Visual Object Recognition\n\nSerhat S. Bucak\u2217\n\nRong Jin\u2217\n\nAnil K. Jain\u2217\u2020\n\nbucakser@cse.msu.edu\n\nrongjin@cse.msu.edu\n\njain@cse.msu.edu\n\nDept. of Comp. Sci. & Eng.\u2217\nMichigan State University\n\nEast Lansing, MI 48824,U.S.A.\n\nDept. of Brain & Cognitive Eng.\u2020\nKorea University, Anam-dong,\n\nSeoul, 136-713, Korea\n\nAbstract\n\nRecent studies have shown that multiple kernel learning is very effective for object recognition,\nleading to the popularity of kernel learning in computer vision problems. In this work, we develop\nan ef\ufb01cient algorithm for multi-label multiple kernel learning (ML-MKL). We assume that all the\nclasses under consideration share the same combination of kernel functions, and the objective is\nto \ufb01nd the optimal kernel combination that bene\ufb01ts all the classes. Although several algorithms\nhave been developed for ML-MKL, their computational cost is linear in the number of classes,\nmaking them unscalable when the number of classes is large, a challenge frequently encountered\nin visual object recognition. We address this computational challenge by developing a framework\nfor ML-MKL that combines the worst-case analysis with stochastic approximation. Our analysis\nshows that the complexity of our algorithm is O(m1/3\u221alnm), where m is the number of classes.\nEmpirical studies with object recognition show that while achieving similar classi\ufb01cation accuracy,\nthe proposed method is signi\ufb01cantly more ef\ufb01cient than the state-of-the-art algorithms for ML-MKL.\n\n1\n\nIntroduction\n\nRecent studies have shown promising performance of kernel methods for object classi\ufb01cation, recognition and local-\nization [1]. Since the choice of kernel functions can signi\ufb01cantly affect the performance of kernel methods, kernel\nlearning, or more speci\ufb01cally Multiple Kernel Learning (MKL) [2, 3, 4, 5, 6, 7], has attracted considerable amount\nof interest in computer vision community. In this work, we focuss on kernel learning for object recognition because\nthe visual content of an image can be represented in many ways, depending on the methods used for keypoint detec-\ntion, descriptor/feature extraction, and keypoint quantization. Since each representation leads to a different similarity\nmeasure between images (i.e., kernel function), the related fusion problem can be cast into a MKL problem.\n\nA number of algorithms have been developed for MKL. In [2], MKL is formulated as a quadratically constraint\nquadratic program (QCQP). [8] suggests an algorithm based on sequential minimization optimization (SMO) to im-\nprove the ef\ufb01ciency of [2]. [9] shows that MKL can be formulated as a semi-in\ufb01nite linear program (SILP) and can\nbe solved ef\ufb01ciently by using off-the-shelf SVM implementations. In order to improve the scalability of MKL, several\n\ufb01rst order optimization methods have been proposed, including the subgradient method [10], the level method [11], the\nmethod based on equivalence between group lasso and MKL [12, 13, 14]. Besides L1-norm [15] and L2-norm [16],\nLp-norm [17] has also been proposed to regularize the weights for kernel combination. Other then the framework\nbased on maximum margin classi\ufb01cation, MKL can also be formulated by using kernel alignment [18] and Fisher\ndiscriminative analysis frameworks [19].\n\n1\n\n\fAlthough most efforts in MKL focus on binary classi\ufb01cation problems, several recent studies have attempted to extend\nMKL to multi-class and multi-label learning [3, 20, 21, 22, 23]. Most of these studies assume that either the same or\nsimilar kernel functions are used by different but related classi\ufb01cation tasks. Even though studies show that MKL for\nmulti-class and multi-label learning can result in signi\ufb01cant improvement in classi\ufb01cation accuracy, the computational\ncost is often linear in the number of classes, making it computationally expensive when dealing with a large number of\nclasses. Since most object recognition problems involve many object classes, whose number might go up to hundreds\nor sometimes even to thousands, it is important to develop an ef\ufb01cient learning algorithm for multi-class and multi-\nlabel MKL that is sublinear in the number of classes.\n\nIn this work, we develop an ef\ufb01cient algorithm for Multi-Label MKL (ML-MKL) that assumes all the classi\ufb01ers\nshare the same combination of kernels. We note that although this assumption signi\ufb01cantly constrains the choice of\nkernel functions for different classes, our empirical studies with object recognition show that it does not affect the\nclassi\ufb01cation performance. A similar phenomenon was also observed in [21]. A naive implementation of ML-MKL\nwith shared kernel combination will lead to a computational cost linear in the number of classes. We alleviate this\ncomputational challenge by exploring the idea of combining worst case analysis with stochastic approximation. Our\nanalysis reveals that the convergence rate of the proposed algorithm is O(m1/3\u221aln m), which is signi\ufb01cantly better\nthan a linear dependence on m, where m is the number of classes. Our empirical studies show that the proposed MKL\nalgorithm yields similar performance as the state-of-the-art algorithms for ML-MKL, but with a signi\ufb01cantly shorter\nrunning time, making it suitable for multi-label learning with a large number of classes.\n\nThe rest of this paper is organized as follows. Section 2 presents the proposed algorithm for Multi-Label MKL,\nalong with its convergence analysis. Section 3 summarizes the experimental results for object recognition. Section 4\nconcludes this work.\n\n2 Multi-label Multiple Kernel Learning (ML-MKL)\n\n1 , . . . , yk\n\ni,j = \u03baa(xi, xj).\n\nn)> \u2208 {\u22121, +1}n, the assignment of the kth class to all the training instances: yk\n\nWe denote by D = {x1, . . . , xn} the collection of n training instances, and by m the number of classes. We introduce\ni = +1 if xi is\nyk = (yk\nassigned to the k-th class and yk\ni = \u22121 otherwise. We introduce \u03baa(x, x0) : Rd \u00d7 Rd 7\u2192 R, a = 1, . . . , s, the s kernel\nfunctions to be combined. We denote by {Ka \u2208 Rn\u00d7n, a = 1, . . . , s} the collection of s kernel matrices for the data\npoints in D, i.e., K a\nWe introduce p = (p1, . . . , ps), a probability distribution, for combining kernels. We denote by K(p) =Ps\na=1 paKa\nthe combined kernel matrices. We introduce the domain P for the probability distribution p, i.e., P = {p \u2208 Rs\n+ :\np>1 = 1}. Our goal is to learn from the training examples the optimal kernel combination p for all the m classes.\nThe simplest approach for multi-label multiple kernel learning with shared kernel combination is to \ufb01nd the optimal\nkernel combination p by minimizing the sum of regularized loss functions of all m classes, leading to the following\noptimization problem:\n\nmin\np\u2208P\n\nmin\n\n{fk\u2208H(p)}m\n\nk=1( mXk=1\n\nHk =\n\nmXk=1( 1\n\n2|fk|2\n\nH(p) +\n\nnXi=1\n\ni fk(xi)(cid:1))) ,\n`(cid:0)yk\n\nwhere `(z) = max(0, 1 \u2212 z) and H(p) is a Reproducing Kernel Hilbert Space endowed with kernel \u03ba(x, x0; p) =\nPs\na=1 pa\u03baa(x, x0). Hk is the regularized loss function for the kth class. It is straightforward to verify the following\ndual problem of (1):\n\n(1)\n\n(2)\n\n\u03b1\u2208Q1(L(p, \u03b1) =\n\nmax\n\nmin\np\u2208P\n\nmXk=1(cid:26)[\u03b1k]>1 \u2212\n\n(\u03b1k \u25e6 yk)>K(p)(\u03b1k \u25e6 yk)(cid:27)) ,\n\n1\n2\n\nwhere Q1 =(cid:8)\u03b1 = (\u03b11, . . . , \u03b1m) : \u03b1k \u2208 [0, C]n, k = 1, . . . , m(cid:9). To solve the optimization problem in Eq. (2), we\ncan view it as a minimization problem, i.e., minp\u2208P A(p), where A(p) = max\u03b1\u2208Q1 L(p, \u03b1). We then follow the\nsubgradient descent approach in [10] and compute the gradient of A(p) as\n\n\u2202pi A(p) = \u2212\n\n1\n2\n\nmXk=1\n\n(\u03b1k(p) \u25e6 yk)>Ki(\u03b1k(p) \u25e6 yk),\n\n2\n\n\fwhere \u03b1k(p) = arg max\u03b1\u2208[0,C]n[\u03b1k]>1 \u2212 (\u03b1k \u25e6 yk)>K(p)(\u03b1k \u25e6 yk). We refer to this approach as Multi-label\nMultiple Kernel Learning by Sum, or ML-MKL-Sum. Note that this approach is similar to the one proposed\nin [21]. The main computational problem with ML-MKL-Sum is that by treating every class equally, in each iteration\nof subgradient descent, it requires solving m kernel SVMs, making it unscalable to a very large number of classes.\nBelow we present a formulation for multi-label MKL whose computational cost is sublinear in the number of classes.\n\n2.1 A Minimax Framework for Multi-label MKL\n\nIn order to alleviate the computational dif\ufb01culty arising from a large number of classes, we search for the combined\nkernel matrix K(p) that minimizes the worst classi\ufb01cation error among m classes, i.e.,\n\nmin\np\u2208P\n\nmin\n\n{fk\u2208H(p)}m\n\nk=1\n\nmax\n1\u2264k\u2264m\n\nHk\n\n(3)\n\nEq. (3) differs from Eq. (1) in that it replacesPm\nof using maxk Hk instead ofPk Hk is that by using an appropriately designed method, we may be able to \ufb01gure\n\nout the most dif\ufb01cult class in a few iterations, and spend most of the computational cycles on learning the optimal\nkernel combination for the most dif\ufb01cult class. In this way, we are able to achieve a running time that is sublinear\nin the number of classes. Below, we present an optimization strategy for Eq. (3) based on the idea of stochastic\napproximation.\n\nk=1 Hk with max1\u2264k\u2264m Hk. The main computational advantage\n\nA direct approach is to solve the optimization problem in Eq. (3) by its dual form. It is straightforward to derive the\ndual problem of Eq. (3) as follows (more details can be found in the supplementary documents)\n\nwhere\n\nmin\np\u2208P\n\nmax\n\n\u03b2\u2208B\uf8f1\uf8f2\uf8f3L(p, \u03b2) =( mXk=1(cid:26)[\u03b2k]>1 \u2212\nB =((\u03b21, . . . , \u03b2m) : \u03b2k \u2208 Rn\n\n1\n2\n\n2)2\uf8fc\uf8fd\uf8fe\n(\u03b2k \u25e6 yk)>K(p)(\u03b2k \u25e6 yk)(cid:27) 1\n\u03bbk = 1) .\nmXk=1\n\n.\n\n+, k = 1, . . . , m, \u03b2k \u2208 [0, C\u03bbk]n s.t.\n\nThe challenge in solving Eq. (4) is that the solutions {\u03b21, . . . , \u03b2m} in domain B are correlated with each other, making\nit impossible to solve each \u03b2k independently by an off-the-shelf SVM solver. Although a gradient descent approach\ncan be developed for optimizing Eq. (4), it is unable to explore the sparse structure in \u03b2k making it less ef\ufb01cient than\nstate-of-the-art SVM solvers. In order to effectively explore the power of off-the-shelf SVM solvers, we rewrite (3) as\nfollows\n\n\u03b3\u2208\u0393 (L(p, \u03b3) = max\n\nmax\n\n\u03b1\u2208Q1\n\nmin\np\u2208P\n\n\u03b3k(cid:26)\u03b1k>\n\nmXk=1\n\n1\n2\n\n1 \u2212\n\n(\u03b1k \u25e6 yk)>K(p)(\u03b1k \u25e6 yk)(cid:27)) ,\n\n(4)\n\n(5)\n\n(7)\n\nwhere \u0393 = {(\u03b31, . . . , \u03b3m) \u2208 Rm\nusing Eq. (5) is that we can resort to a SVM solver to ef\ufb01ciently \ufb01nd \u03b1k for a given combination of kernels K(p).\nGiven Eq. (5), we develop a subgradient descent approach for solving the optimization problem. In particular, in each\niteration of subgradient descent, we compute the gradient L(p, \u03b3) with respect to p and \u03b3 as follows\n\n+ : \u03b3>1 = 1}. In Eq. (5), we replace max1\u2264k\u2264m with max\u03b3\u2208\u0393. The advantage of\n\n\u2207paL(p, \u03b3) = \u2212\n\n1\n2\n\n\u03b3k(\u03b1k \u25e6 yk)>Ka(\u03b1k \u25e6 yk), \u2207\u03b3kL(p, \u03b3) = [\u03b1k]>1 \u2212\n\n1\n2\n\n(\u03b1k \u25e6 yk)>K(p)(\u03b1k \u25e6 yk),\n\n(6)\n\nmXk=1\n\nwhere \u03b1k = arg max\u03b1\u2208[0,C]n \u03b1>1\u2212 (\u03b1\u25e6 yk)>K(p)(\u03b1\u25e6 yk)/2, i.e., a SVM solution to the combined kernel K(p).\nFollowing the mirror prox descent method [24], we de\ufb01ne potential functions \u03a6p = \u03b7p\na=1 pa ln pa for p and\n\u03a6\u03b3 =Pm\n\ni=1 \u03b3i ln \u03b3i for \u03b3, and have the following equations for updating pt and \u03b3t\n\n\u03b7\u03b3 Ps\n\npa\nt+1 =\n\npa\nt\nZ p\nt\n\nexp(\u2212\u03b7p\u2207paL(pt, \u03b3t)), \u03b3k\n\nt+1 =\n\nexp(\u2212\u03b7\u03b3\u2207\u03b3kL(pt, \u03b3t)),\n\n\u03b3k\nt\nZ \u03b3\nt\n\n3\n\n\fwhere Z p\noptimizing p and \u03b3, respectively.\n\nt and Z \u03b3\n\nt are normalization factors that ensure p>\n\nt 1 = \u03b3>\n\nt 1 = 1. \u03b7p > 0 and \u03b7\u03b3 > 0 are the step sizes for\n\nUnfortunately, the algorithm described above shares the same shortcoming as the other approaches for multiple la-\nbel multiple kernel learning, i.e., it requires solving m SVM problems in each iteration, and therefore its compu-\ntational complexity is linear in the number of classes. To alleviate this problem, we modify the above algorithm\nby introducing the stochastic approximation method. In particular, in each iteration t, instead of computing the full\ngradients that requirs solving m SVMs, we sample one classi\ufb01cation task according to the multinomial distribution\nt ). Let jt be the index of the sampled classi\ufb01cation task. Using the sampled task jt, we estimate the\nM ulti(\u03b31\n\nt , . . . , \u03b3m\n\nk (pt, \u03b3t), as follows\n\n1\n2\n\na(pt, \u03b3t) andbg\u03b3\n\na(pt, \u03b3t) = \u2212\n\ngradient of L(p, \u03b3) with respect to pa and \u03b3k, denoted bybgp\n\nbgp\nk (pt, \u03b3t) = (cid:26)\nbg\u03b3\na(pt, \u03b3t) andbg\u03b3\nThe computation ofbgp\na(pt, \u03b3t)] = \u2207paL(pt, \u03b3t), Et[bg\u03b3\nEt[bgp\n\n(\u03b1jt \u25e6 yjt )>Ka(\u03b1jt \u25e6 yjt),\n\u03b3k(cid:0)\u03b1>\n\nwhere Et[\u00b7] stands for the expectation over the randomly sampled task jt.\n\nk 1 \u2212 1\n\n0\n\n1\n\ni (pt, \u03b3t) only requires \u03b1jt and therefore only needs to solve one SVM problem,\ninstead of m SVMs. The key property of the estimated gradients in Eqs. (8) and (9) is that their expectations equal to\nthe true gradients, as summarized by Proposition 1. This property is the key to the correctness of this algorithm.\nProposition 1. We have\n\n2 (\u03b1k \u25e6 yk)>K(p)(\u03b1k \u25e6 yk)(cid:1) k = jt\n\nk 6= jt\n\n(8)\n\n(9)\n\n.\n\ni (pt, \u03b3t)] = \u2207\u03b3iL(pt, \u03b3t),\n\nGiven the estimated gradients, we will follow Eq. (7) for updating p and \u03b3 in each iteration. Since bg\u03b3\nproportional to 1/\u03b3t, to ensure the norm ofbg\u03b3\n\nt+1,\nsmoothing effect, without modifying \u03b3t+1, we will sample directly from \u03b30\n\u03b4\nm\n\ni (pt, \u03b3t) is\ni (pt, \u03b3t) to be bounded, we need to smooth \u03b3t+1. In order to have the\n\nt+1(1 \u2212 \u03b4) +\n\n, k = 1, . . . , m,\n\nwhere \u03b4 > 0 is a small probability mass used for smoothing and\n\n\u2200\u03b3 \u2208 \u0393,\u2203\u03b30 \u2208 \u03930, s.t. \u03b30k\nt+1 \u2190 \u03b3k\n\u03930 =(cid:26)\u03b30>1 = 1, \u03b30\n\nk \u2265\n\n\u03b4\nm\n\n, k = 1, . . . , m(cid:27) .\n\nWe refer to this algorithm as Multi-label Multiple Kernel Learning by Stochastic Approximation, or ML-MKL-\nSA for short. Algorithm 1 gives the detailed description.\n\n2.2 Convergence Analysis\n\nSince Eq. (5) is a convex-concave optimization problem, we introduce the following citation for measuring the quality\nof a solution (p, \u03b3)\n\n\u2206(p, \u03b3) = max\n\n\u03b3 0\u2208\u0393 L(p, \u03b30) \u2212 min\n\n0\u2208P L(p0, \u03b3).\n\np\n\n(11)\n\nWe denote by (p\u2217, \u03b3\u2217) the optimal solution to Eq. (5).\nProposition 2. We have the following properties for \u2206(p, \u03b3)\n\n1. \u2206(p, \u03b3) \u2265 0 for any solution p \u2208 P and \u03b3 \u2208 \u0393\n2. \u2206 (p\u2217, \u03b3\u2217) = 0\n3. \u2206(p, \u03b3) is jointly convex in both p and \u03b3\n\nWe have the following theorem for the convergence rate for Algorithm 1. The detailed proof can be found in the\nsupplementary document.\n\nobtained by Algorithm 1\n\nTheorem 1. After running Algorithm 1 over T iterations, we have the following inequality for the solutionbp andb\u03b3\n\n1\n\u03b7\u03b3T\n\n(ln m + ln s) + \u03b7\u03b3(cid:18)d\n\nm2\n2\u03b42 \u03bb2\n\n0n2C 4 + n2C 2(cid:19) ,\n\nwhere d is a constant term, E[\u00b7] stands for the expectation over the sampled task indices of all iterations, and \u03bb0 =\nmax\n1\u2264a\u2264s\n\n\u03bbmax(Ka), where \u03bbmax(Z) stands for the maximum eigenvalue of matrix Z.\n\nE [\u2206 (bp,b\u03b3)] \u2264\n\n4\n\n\fAlgorithm 1 Multi-label Multiple Kernel Learning: ML-MKL-SA\n1: Input\n\n\u2022 \u03b7p, \u03b7\u03b3: step sizes\n\u2022 K 1, . . . , K s: s kernel matrices\n\u2022 y1, . . . , ym: the assignments of m different classes to n training instances\n\u2022 T : number of iterations\n\u2022 \u03b4: smoothing parameter\n\u2022 \u03b31 = 1/m and p1 = 1/s\n\n2: Initialization\n\n3: for t = 1, . . . , T do\n4:\n5:\n6:\n7:\n\nSample a classi\ufb01cation task jt according to the distribution M ulti(\u03b31\nCompute \u03b1jt = arg max\u03b1\u2208[0,C]n \u03b1>1 \u2212 (\u03b1 \u25e6 yjt )>K(p)(\u03b1 \u25e6 yjt )/2 using an off shelf SVM solver.\nUpdate pt+1, \u03b3t+1 and \u03b30\n\ni (pt, \u03b3t) using Eq. (8) and (9).\n\nt , . . . , \u03b3m\n\nt+1 as follows\n\nt ).\n\nCompute the estimated gradientsbgp\n\na(pt, \u03b3t)), a = 1, . . . , s.\n\npa\nt+1 =\n\n[\u03b3t+1]k =\n\na(pt, \u03b3t) andbg\u03b3\nexp(\u2212\u03b7\u03b3bgp\nexp(\u03b7\u03b3bg\u03b3\n\npa\nt\nZ p\nt\n\u03b3k\nt\nZ \u03b3\nt\n\nk (pt, \u03b3t)), k = 1, . . . , m; \u03b30\n\nt+1 = (1 \u2212 \u03b4)\u03b3t+1 +\n\n\u03b4\nm\n\n1.\n\n8: end for\n\n\u03b3t,\n\npt.\n\n1\nT\n\nTXt=1\n\n1\nT\n\nTXt=1\n\n9: Compute the \ufb01nal solutionbp andb\u03b1 as\nb\u03b3 =\n3p(ln m)/T , after running Algorithm 1 (on the original paper) over T\n\niterations, we have E[\u2206(bp,b\u03b3)] \u2264 O(nm1/3p(ln m)/T ) in terms of m,n and T .\nalgorithm on the order of O(m1/3p(ln m)/T ), sublinear in the number of classes m.\n\nSince we only need to solve one kernel SVM at each iteration, we have the computational complexity for the proposed\n\nCorollary 1. With \u03b4 = m\n\n3 and \u03b7\u03b3 = 1\n\nbp =\n\nn m\u2212 1\n\n(10)\n\n2\n\n3 Experiments\n\nIn this section, we empirically evaluate the proposed multiple kernel learning algorithm2 by demonstrating its ef\ufb01-\nciency and effectiveness on the visual object recognition task.\n\n3.1 Data sets\n\nWe use three benchmark data sets for visual object recognition: Caltech-101, Pascal VOC 2006 and Pascal VOC 2007.\nCaltech-101 contains 101 different object classes in addition to a \u201cbackground\u201d class. We use the same settings as [25]\nin which 30 instances of each class are used for training and 15 instances for testing. Pascal VOC 2006 data set [26]\nconsists of 5, 303 images distributed over 10 classes, of which 2, 618 are used for training. Pascal VOC 2007 [27]\nconsists of 5, 011 training images and 4, 932 test images that are distributed over 20 classes. For both data sets, we\nused the default train-test partition provided by VOC Challenge. Unlike Caltech-101 data set, where each image is\nassigned to one class, images in VOC data sets can be assigned to multiple classes simultaneously, making it more\nsuitable for multi-label learning.\n\n2Codes can be downloaded from http://www.cse.msu.edu/\u02dcbucakser/ML-MKL-SA.rar\n\n5\n\n\fTable 1: Classi\ufb01cation accuracy (AUC) and running times (second) of all ML-MKL algorithms on three data sets.\nAbbreviations SA, GMKL, Sum, Simple, VSKL, AVG stand for ML-MKL-SA, Generalized MKL, ML-MKL-Sum,\nSimpleMKL, variable sparsity kernel learning and average kernel, respectively\n\nAccuracy (AUC)\n\ndataset\nCALTECH-101\nVOC2006\nVOC2007\n\nSA\n0.80\n0.75\n0.50\n\nGMKL\n0.79\n0.75\n0.49\n\nSum Simple VSKL AVG\n0.77\n0.80\n0.72\n0.74\n0.47\n0.45\n\n0.78\n0.74\n0.42\n\n0.77\n0.74\n0.46\n\nSA\n191.17\n245.10\n1329.40\n\nGMKL\n18292.00\n2586.90\n30333.14\n\nTraining Time (sec)\nSimple\n9869.40\n11549.00\n18536.37\n\nSum\n1814.50\n890.65\n1372.60\n\nVSKL\nAVG\n21266.05 N/A\n7368.27\nN/A\n11370.48 N/A\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\ni\n\ns\nt\nn\ne\nc\ni\nf\nf\ne\no\nc\n \nl\ne\nn\nr\ne\nk\n\n0\n\n0\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\ni\n\ns\nt\nn\ne\nc\ni\nf\nf\ne\no\nc\n \nl\ne\nn\nr\ne\nk\n\n0\n\n0\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\ni\n\ns\nt\nn\ne\nc\ni\nf\nf\ne\no\nc\n \nl\ne\nn\nr\ne\nk\n\n0\n\n0\n\n500\n\n1000\ntime(sec)\n\n1500\n\nML-MKL-Sum\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\ns\nt\nn\ne\nc\ni\nf\nf\n\ni\n\ne\no\nc\n \nl\n\ne\nn\nr\ne\nk\n\n0\n\n0\n\n0.5\n\n1\n\ntime(sec)\n\n1.5\n\n2\nx 104\n\nGMKL\n\n0.5\n\n1\n\ntime(sec)\n\n1.5\n\n2\nx 104\n\nVSKL\n\n200\n\n400\n\ntime(sec)\n\n600\n\n800\n\nML-MKL-SA\n\nFigure 1: The evolution of kernel weights over time for CALTECH-101 data set. For GMKL and VSKL, the curves\ndisplay the kernel weights that are averaged over all the classes since a different kernel combination is learnt for each\nclass.\n\n3.2 Kernels\n\nWe extracted 9 kernels for Caltech-101 data set by using the software provided in [28]. Three different feature\nextraction methods are used for kernel construction: (i) GB: geometric blur descriptors are applied to the detected\nkeypoints [29]; RBF kernel is used in which the distance between two images is computed by averaging the distance\nof the nearest descriptor pairs for the image pair. (ii) PHOW gray/color: keypoints based on dense sampling; SIFT\ndescriptors are quantized to 300 words and spatial histograms with 2x2 and 4x4 subdivisions are built to generate\nchi-squared kernels [30]. (iii) SSIM: self-similarity features taken from [31] are used and spatial histograms based on\n300 visual words are used to form the chi-squared kernel.\nFor VOC data sets, a different procedure, based on the reports of VOC challenges [1], is used to construct multiple\nvisual dictionaries, and each dictionary results in a different kernel. To obtain multiple visual dictionaries, we deploy\n(i) three keypoint detectors, i.e., dense sampling, HARHES [32] and HESLAP [33], (ii) two keypoint descriptors,\ni.e., SIFT [33] and SPIN [34]), (iii) two different numbers of visual words, i.e., 500 and 1, 000 visual words, (iv)\ntwo different kernel functions, i.e., linear kernel and chi-squared kernel. The bandwidth of the chi-squared kernels\nis calculated using the procedure in [25]. Using the above variants in visual dictionary construction, we constructed\n22 kernels for both VOC2007 and VOC2006 data sets. In addition to the K-means implementation in [28], we also\napplied a hierarchical clustering algorithm [35] to descriptor quantization for VOC 2007 data set, leading to four more\nkernels for VOC2007 data set.\n\n3.3 Baseline Methods\n\nWe \ufb01rst compare the proposed algorithm ML-MKL-SA to the following MKL algorithms that learn a different kernel\ncombination for each class: (i) Generalized multiple kernel learning method (GMKL) [25], which reports promising\nresults for object recognition, (ii) SimpleMKL [10], which learns the kernel combination by a subgradient approach\nand (iii) Variable Sparsity Kernel Learning (VSKL), a miror-prox descent based algorithm for MKL [36]. We also\ncompare ML-MKL-SA to ML-MKL-Sum, which learns a kernel combination shared by all classes as described in\nSection 2 using the optimization method in [21]. In all implementations of ML multiple kernel learning algorithms,we\nuse LIBSVM implementation of one-versus-all SVM where needed.\n\n3.4 Experimental Results\n\nTo evaluate the effectiveness of different algorithms for multi-label multiple kernel learning, we \ufb01rst compute the area\nunder precision-recall curve (AUC) for each class, and report the value of AUC averaged over all the classes. We\n\n6\n\n\fC\nU\nA\n\n0.8\n\n0.78\n\n0.76\n\n0.74\n\n0.72\n\n \n0\n\n \n\n \n\nC\nU\nA\n\n0.75\n\n0.74\n\n0.73\n\n0.72\n\n0.71\n\n0.7\n\n \n0\n\n200\n\nML\u2212MKL\u2212SA\nML\u2212MKL\u2212SUM\n\n1500\n\n2000\n\n500\n\n1000\n\ntime(sec)\n\nCALTECH-101\n\nML\u2212MKL\u2212SA\nML\u2212MKL\u2212SUM\n\n800\n\n1000\n\n400\n\n600\n\ntime(sec)\nVOC-2006\n\n0.5\n\n0.48\n\n0.46\n\n0.44\n\n0.42\n\nC\nU\nA\n\n \n\n \n\n200\n\n400\n\n600\n\n800\ntime(sec)\n\n1000\n\n1200\n\n1400\n\nML\u2212MKL\u2212SA\nML\u2212MKL\u2212SUM\n\nVOC-2007\n\nFigure 2: The evolution of classi\ufb01cation accuracy over time for ML-MKL-SA and ML-MKL-Sum on three data sets\n\nC\nU\nA\n\n0.81\n\n0.805\n\n0.8\n\n0.795\n\n0.79\n\n0.785\n\n0.78\n\n0.775\n\n \n\n50\n\n100\n\n \n\n\u03b4=0\n\u03b4=0.2\n\u03b4=0.6\n\u03b4=1\n\n300\n\n350\n\n400\n\n0.84\n\n0.82\n\nC\nU\nA\n\n0.8\n\n0.78\n\n0.76\n\n \n\n \n\n\u03b7=0.01\n\u03b7=0.001\n\u03b7=0.0001\n\n200\n\n400\n800\nnumber of iterations\n\n600\n\n1000\n\n1200\n\n150\n\n200\n\n250\n\nnumber of iterations\n\nFigure 3: Classi\ufb01cation accuracy (AUC) of the proposed\nalgorithm Ml-MKL-SA on CALTECH-101 using differ-\nent values of \u03b4 (for \u03b7p = \u03b7\u03b3 = 0.01).\n\nFigure 4: Classi\ufb01cation accuracy (AUC) of the proposed\nalgorithm Ml-MKL-SA on CALTECH-101 using differ-\nent values of \u03b7p = \u03b7\u03b3 = \u03b7 for (\u03b4 = 0).\n\nevaluate the ef\ufb01ciency of algorithms by their running times for training. All methods are coded in MATLAB and\nare implemented on machines with 2 dual-core AMD Opterons running at 2.2GHz, 8GB RAM and linux operating\nsystem.\n\nFor the proposed method, itarations stop when bpt\u2212bpt\u22121\nis smaller than 0.01. Unless stated, the smoothing parameter\n\u03b4 is set to be 0.2. For simplicity we take \u03b7 = \u03b7p = \u03b7\u03b3 in all the following experiments. Step size \u03b7 is chosen as 0.0001\nfor CALTECH-101 data set and 0.001 for VOC data sets in order to achieve the best computational ef\ufb01ciency.\n\nbpt\n\nTable 1 summarizes the classi\ufb01cation accuracies (AUC) and the running times of all the algorithms over the three\ndata sets. We \ufb01rst note that the proposed MKL method for multi-labeled data, i.e., ML-MKL-SA, yields the best\nperformance among the methods in comparison, which justi\ufb01es the assumption of using the same kernel combination\nfor all the classes. Note that a simple approach that uses the average of all kernels yields reasonable performance,\nalthough its classi\ufb01cation accuracy is signi\ufb01cantly worse than the proposed approach ML-MKL-SA. Second, we\nobserve that except for the average kernel method that does not require learning the kernel combination weights, ML-\nMKL-SA and ML-MKL-Sum are signi\ufb01cantly more ef\ufb01cient than the other baseline approaches. This is not surprising\nas ML-MKL-SA and ML-MKL-Sum compute a single kernel combination for all classes. Third, compared to ML-\nMKL-Sum, we observe that ML-MKL-SA is overall more ef\ufb01cient, and signi\ufb01cantly more ef\ufb01cient for CALTECH-\n101 data set. This is because the number of classes in CALTECH-101 is signi\ufb01cantly larger than that of the two VOC\nchallenge data sets. This result further con\ufb01rms that the proposed algorithm is scalable to the data sets with a large\nnumber of classes.\n\nFig. 1 shows the change in the kernel weights over time for the proposed method and the three baseline methods (i.e.,\nML-MKL-Sum, GMKL, and VSKL) on CALTECH-101 data set. We observe that, overall, ML-MKL-SA shares a\nsimilar pattern as GMKL and VSKL in the evolution curves of kernel weights, but is ten times faster than the two\nbaseline methods. Although ML-MKL-Sum is signi\ufb01cantly more ef\ufb01cient than GMKL and VSKL, the kernel weights\nlearned by ML-MKL-Sum vary signi\ufb01cantly, particularly at the beginning of the learning process, making it a less\nstable algorithm than the proposed algorithm ML-MKL-SA. To further compare ML-MKL-SA with ML-MKL-Sum,\nin Fig. 2, we show how the classi\ufb01cation accuracy is changed over time for both methods for all three data sets.\nWe again observe the unstable behavior of ML-MKL-Sum: the classi\ufb01cation accuracy of ML-MKL-Sum could vary\nsigni\ufb01cantly over a relatively short period of time, making it less desirable method for MKL.\n\n7\n\n\fTo evaluate the sensitivity of the proposed method to parameters \u03b4 and \u03b7, we conducted experiments with varied\nvalues for the two parameters. Fig. 3 shows how the classi\ufb01cation accuracy (AUC) of the proposed algorithm changes\nover iterations on CALTECH-101 using four different values of \u03b4. We observe that the \ufb01nal classi\ufb01cation accuracy\nis comparable for different values of \u03b4, demonstrating the robustness of the proposed method to the choice of \u03b4. We\nalso note that the two extreme cases, i.e, \u03b4 = 0 and \u03b4 = 1, give the worst performance, indicating the importance of\nchoosing an optimal value for \u03b4. Fig. 4 shows the classi\ufb01cation accuracy for three different values of \u03b7 on CALTECH-\n101 data set. We observe that the proposed algorithm achieves similar classi\ufb01cation accuracy when \u03b7 is set to be a\nrelatively small value (i.e., \u03b7 = 0.001 and \u03b7 = 0.0001). This result demonstrates that the proposed algorithm is in\ngeneral insensitive to the choice of step size (\u03b7).\n\n4 Conclusion and Future Work\n\nIn this paper, we present an ef\ufb01cient optimization framework for multi-label multiple kernel learning that combines\nworst-case analysis with stochastic approximation. Compared to the other algorithms for ML-MKL, the key advantage\nof the proposed algorithm is that its computational cost is sublinear in the number of classes, making it suitable for\nhandling a large number of classes. We verify the effectiveness of the proposed algorithm by experiments in object\nrecognition on several benchmark data sets. There are two directions that we plan to explore in the future. First, we\naim to further improve the ef\ufb01ciency of ML-MKL by reducing its dependence on the number of training examples and\nspeeding up the convergence rate. Second, we plan to improve the effectiveness and ef\ufb01ciency of multi-label learning\nby exploring the correlation and structure among the classes.\n\n5 Acknowledgements\n\nThis work was supported in part by National Science Foundation (IIS-0643494), US Army Research (ARO Award\nW911NF-08-010403) and Of\ufb01ce of Naval Research (ONR N00014-09-1-0663). Any opinions, \ufb01ndings and conclu-\nsions or recommendations expressed in this material are those of the authors and do not necessarily re\ufb02ect the views\nof NFS, ARO, and ONR. Part of Anil Jain\u2019s research was supported by WCU (World Class University) program\nthrough the National Research Foundation of Korea funded by the Ministry of Education, Science and Technology\n(R31-2008-000-10008-0).\n\nReferences\n\n[1] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, \u201cThe PASCAL Visual Object Classes Challenge\n\n2009 (VOC2009) Results.\u201d http://www.pascal-network.org/challenges/VOC/voc2009/workshop/index.html.\n\n[2] G. Lanckriet, T. De Bie, N. Cristianini, M. Jordan, and W. Noble, \u201cA statistical framework for genomic data fusion,\u201d Bioin-\n\nformatics, vol. 20, pp. 2626\u20132635, 2004.\n\n[3] S. Ji, L. Sun, R. Jin, and J. Ye, \u201cMulti-label multiple kernel learning,\u201d in Proceedings of Neural Information Processings\n\nSystems, 2008.\n\n[4] G. Lanckriet, N. Cristianini, P. Bartlett, L. Ghaoui, and M. Jordan, \u201cLearning the kernel matrix with semide\ufb01nite program-\n\nming,\u201d Journal of Machine Learning Research, vol. 5, pp. 27\u201372, 2004.\n\n[5] O. Chapelle and A. Rakotomamonjy, \u201cSecond order optimization of kernel parameters,\u201d in NIPS Workshop on Kernel Learn-\n\ning: Automatic Selection of Optimal Kernels, 2008.\n\n[6] P. Gehler and S. Nowozin, \u201cOn feature combination for multiclass object classi\ufb01cation,\u201d in Proceedings of the IEEE Interna-\n\ntional Conference on Computer Vision, 2009.\n\n[7] P. Gehler and S. Nowozin, \u201cLet the kernel \ufb01gure it out: Principled learning of pre-processing for kernel classi\ufb01ers,\u201d in\n\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009.\n\n[8] F. Bach, G. Lanckriet, and M. Jordan, \u201cMultiple kernel learning, conic duality, and the smo algorithm,\u201d in Proceedings of the\n\n21st International Conference on Machine Learning, 2004.\n\n[9] S. Sonnenburg, G. Ratsch, and C. Schafer, \u201cA general and ef\ufb01cient multiple kernel learning algorithm,\u201d in Proceedings of\n\nNeural Information Processings Systems, pp. 1273\u20131280, 2006.\n\n[10] A. Rakotomamonjy, F. Bach, Y. Grandvalet, and S. Canu, \u201cSimpleMKL,\u201d Journal of Machine Learning Research, vol. 9,\n\npp. 2491\u20132521, 2008.\n\n8\n\n\f[11] Z. Xu, R. Jin, I. King, and M. R. Lyu, \u201cAn extended level method for ef\ufb01cient multiple kernel learning,\u201d in Proceedings of\n\nNeural Information Processings Systems, pp. 1825\u20131832, 2008.\n\n[12] Z. Xu, R. Jin, H. Yang, I. King, and M. R. Lyu, \u201cSimple and ef\ufb01cient multiple kernel learning by group lasso,\u201d in Proceedings\n\nof the 27th International Conference on Machine Learning, 2010.\n\n[13] F. Bach, \u201cConsistency of the group lasso and multiple kernel learning,\u201d Journal of Machine Learning Research, vol. 9,\n\npp. 1179\u20131225, 2008.\n\n[14] Z. Xu, R. Jin, S. Zhu, M. R. Lyu, and I. King, \u201cSmooth optimization for effective multiple kernel learning,\u201d in Proceedings of\n\nthe AAAI Conference on Arti\ufb01cial Intelligence, 2010.\n\n[15] A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grandvalet, \u201cMore ef\ufb01ciency in multiple kernel learning,\u201d in Proceedings of the\n\n24th International Conference on Machine Learning, 2007.\n\n[16] M. Kloft, U. Brefeld, A. Sonnenburg, and A. Zien, \u201cComparing sparse and non-sparse multiple kernel learning,\u201d in NIPS\n\nWorkshop on Understanding Multiple Kernel Learning Methods, 2009.\n\n[17] M. Kloft, U. Brefeld, A. Sonnenburg, P. Laskov, K.-R. Muller, and A. Zien, \u201cEf\ufb01cient and accurate lp-norm multiple kernel\n\nlearning,\u201d in Proceedings of Neural Information Processings Systems, 2009.\n\n[18] S. Hoi, M. Lyu, and E. Chang, \u201cLearning the uni\ufb01ed kernel machines for classi\ufb01cation,\u201d in Proceedings of the Conference on\n\nKnowledge Discovery and Data Mining, p. 187196, 2006.\n\n[19] J. Ye, J. Chen, and J. S., \u201cDiscriminant kernel and regularization parameter learning via semide\ufb01nite programming,\u201d in\n\nProceedings of the International Conference on Machine Learning, p. 10951102, 2007.\n\n[20] A. Zien and S. Cheng, \u201cMulticlass multiple kernel learning,\u201d in Proceedings of the 24th International Conference on Machine\n\nLearning, 2007.\n\n[21] L. Tang, J. Chen, and J. Ye, \u201cOn multiple kernel learning with multiple labels,\u201d in Proceedings of the 21st International Jont\n\nConference on Arti\ufb01cal Intelligence, 2009.\n\n[22] J. Yang, Y. Li, Y. Tian, L. Duan, and W. Gao, \u201cGroup-sensitive multiple kernel learning for object categorization,\u201d in Pro-\n\nceedings of the IEEE International Conference on Computer Vision, 2009.\n\n[23] F. Orabona, L. Jie, and B. Caputo, \u201cOnline-batch strongly convex multi kernel learning,\u201d in Proceedings of the IEEE Confer-\n\nence on Computer Vision and Pattern Recognition, 2010.\n\n[24] A. Nemirovski, \u201cProx-method with rate of convergence o(1/t) for variational inequalities with lipschitz continuous monotone\n\noperators and smooth convex-concave saddle point problems,\u201d SIAM Journal on Optimization, vol. 15, pp. 229\u2013251, 2004.\n\n[25] M. Varma and D. Ray, \u201cLearning the discriminative power-invariance trade-off,\u201d in Proceedings of the IEEE International\n\nConference on Computer Vision, October 2007.\n\n[26] M. Everingham, A. Zisserman, C. K. I. Williams, and L. Van Gool, \u201cThe PASCAL Visual Object Classes Challenge 2006\n\n(VOC2006) Results.\u201d http://www.pascal-network.org/challenges/VOC/voc2006/results.pdf.\n\n[27] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, \u201cThe PASCAL Visual Object Classes Challenge\n\n2007 (VOC2007) Results.\u201d http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html.\n\n[28] A. Vedaldi and B. Fulkerson, \u201cVLFeat: An open and portable library of computer vision algorithms.\u201d http://www.\n\nvlfeat.org/, 2008.\n\n[29] A. Berg, T. Berg, and J. Malik, \u201cShape matching and object recognition using low distortion correspondences,\u201d in Proceedings\n\nof the IEEE Conference on Computer Vision and Pattern Recognition, 2005.\n\n[30] S. Lazebnik, C. Schmid, and P. Ponce, \u201cBeyond bag of features: Spatial pyramid matching for recognizing natural scene\n\ncategories,\u201d in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2006.\n\n[31] E. Shechtman and I. M., \u201cMatching local self-similarities across images and videos,\u201d in Proceedings of the IEEE Conference\n\non Computer Vision and Pattern Recognition, 2007.\n\n[32] K. Mikolajczyk and C. Schmid, \u201cDistinctive image features from scale-invariant keypoints,\u201d IEEE Transactions on Pattern\n\nAnalysis and Machine Intelligence, vol. 27, no. 10, pp. 1615\u20131630, 2005.\n\n[33] D. Lowe, \u201cDistinctive image features from scale-invariant keypoints,\u201d International Journal of Computer Vision, vol. 2, no. 60,\n\npp. 91\u2013110, 2004.\n\n[34] S. Lazebnik, C. Schmid, and P. Ponce, \u201cSparse texture representation using af\ufb01ne-invariant neighborhoods,\u201d in Proceedings\n\nof the IEEE Conference on Computer Vision and Pattern Recognition, 2003.\n\n[35] M. Muja and D. G. Lowe, \u201cFast approximate nearest neighbors with automatic algorithm con\ufb01guration,\u201d in Proceedings of\n\nthe International Conference on Computer Vision Theory and Application, pp. 331\u2013340, INSTICC Press, 2009.\n\n[36] J. Saketha Nath, G. Dinesh, S. Raman, C. Bhattacharyya, A. Ben-Tal, and K. Ramakrishan, \u201cOn the algorithmics and ap-\nplications of a mixed-norm based kernel learning formulation,\u201d in Proceedings of Neural Information Processings Systems,\n2009.\n\n9\n\n\f", "award": [], "sourceid": 1145, "authors": [{"given_name": "Serhat", "family_name": "Bucak", "institution": null}, {"given_name": "Rong", "family_name": "Jin", "institution": null}, {"given_name": "Anil", "family_name": "Jain", "institution": null}]}