{"title": "On the Algorithmics and Applications of a Mixed-norm based Kernel Learning Formulation", "book": "Advances in Neural Information Processing Systems", "page_first": 844, "page_last": 852, "abstract": "Motivated from real world problems, like object categorization, we study a particular mixed-norm regularization for Multiple Kernel Learning (MKL). It is assumed that the given set of kernels are grouped into distinct components where each component is crucial for the learning task at hand. The formulation hence employs $l_\\infty$ regularization for promoting combinations at the component level and $l_1$ regularization for promoting sparsity among kernels in each component. While previous attempts have formulated this as a non-convex problem, the formulation given here is an instance of non-smooth convex optimization problem which admits an efficient Mirror-Descent (MD) based procedure. The MD procedure optimizes over product of simplexes, which is not a well-studied case in literature. Results on real-world datasets show that the new MKL formulation is well-suited for object categorization tasks and that the MD based algorithm outperforms state-of-the-art MKL solvers like \\texttt{simpleMKL} in terms of computational effort.", "full_text": "On the Algorithmics and Applications of a\n\nMixed-norm based Kernel Learning Formulation\n\nJ. Saketha Nath\n\nDept. of Computer Science & Engg.,\n\nIndian Institute of Technology, Bombay.\n\nsaketh@cse.iitb.ac.in\n\nG. Dinesh\n\nDept. of Computer Science & Automation,\n\nIndian Institute of Science, Bangalore.\ndinesh@csa.iisc.ernet.in\n\nS. Raman\n\nDept. of Computer Science & Automation,\n\nIndian Institute of Science, Bangalore.\nsraman@csa.iisc.ernet.in\n\nChiranjib Bhattacharyya\n\nDept. of Computer Science & Automation,\n\nIndian Institute of Science, Bangalore.\n\nchiru@csa.iisc.ernet.in\n\nAharon Ben-Tal\n\nTechnion, Haifa.\n\nFaculty of Industrial Engg. & Management,\n\nabental@ie.technion.ac.il\n\nK. R. Ramakrishnan\n\nDept. of Electrical Engg.,\n\nIndian Institute of Science, Bangalore.\n\nkrr@ee.iisc.ernet.in\n\nAbstract\n\nMotivated from real world problems, like object categorization, we study a par-\nticular mixed-norm regularization for Multiple Kernel Learning (MKL). It is as-\nsumed that the given set of kernels are grouped into distinct components where\neach component is crucial for the learning task at hand. The formulation hence\nemploys l\u221e regularization for promoting combinations at the component level and\nl1 regularization for promoting sparsity among kernels in each component. While\nprevious attempts have formulated this as a non-convex problem, the formula-\ntion given here is an instance of non-smooth convex optimization problem which\nadmits an ef\ufb01cient Mirror-Descent (MD) based procedure. The MD procedure\noptimizes over product of simplexes, which is not a well-studied case in literature.\nResults on real-world datasets show that the new MKL formulation is well-suited\nfor object categorization tasks and that the MD based algorithm outperforms state-\nof-the-art MKL solvers like simpleMKL in terms of computational effort.\n\n1 Introduction\n\nIn this paper the problem of Multiple Kernel Learning (MKL) is studied where the given kernels are\nassumed to be grouped into distinct components and each component is crucial for the learning task\nin hand. The focus of this paper is to study the formalism, algorithmics of a speci\ufb01c mixed-norm\nregularization based MKL formulation suited for such tasks.\nMajority of existing MKL literature have considered employing a block l1 norm regularization lead-\ning to selection of few of the given kernels [8, 1, 16, 14, 20] . Such formulations tend to select\nthe \u201cbest\u201d among the given kernels and consequently the decision functions tend to depend only on\nthe selected kernel. Recently [17] extended the framework of MKL to the case where kernels are\npartitioned into groups and introduces a generic mixed-norm regularization based MKL formulation\nin order to handle groups of kernels. Again the idea is to promote sparsity leading to low number of\nkernels. This paper differs from [17] by assuming that every component (group of kernels) is highly\n\n1\n\n\fcrucial for success of the learning task. It is well known in optimization literature that l\u221e regulariza-\ntions often promote combinations with equal preferences and l1 regularizations lead to selections.\nThe proposed MKL formulation hence employs l\u221e regularization and promotes combinations of\nkernels at the component level. Moreover it employs l1 regularization for promoting sparsity among\nkernels in each component.\nThe formulation studied here is motivated by real-world learning applications like object categoriza-\ntion where multiple feature representations need to be employed simultaneously for achieving good\ngeneralization. Combining feature descriptors using the framework of Multiple Kernel Learning\n(MKL) [8] for object categorization has been a topic of interest for many recent studies [19, 13].\nFor e.g., in the case of \ufb02ower classi\ufb01cation feature descriptors for shape, color and texture need\nto be employed in order to achieve good visual discrimination as well as signi\ufb01cant within-class\nvariation [12]. A key \ufb01nding of [12] is the following: in object categorization tasks, employing few\nof the feature descriptors or employing a canonical combination of them often leads to sub-optimal\nsolutions. Hence, in the framework of MKL, employing a l1 regularization, which is equivalent to\nselecting one of the given kernels, as well as employing a l2 regularization, which is equivalent to\nworking with a canonical combination of the given kernels, may lead to sub-optimality. This im-\nportant \ufb01nding clearly motivates the use of l\u221e norm regularization for combining kernels generated\nfrom different feature descriptors and l1 norm regularization for selecting kernels generated from\nthe same feature descriptor. Hence, by grouping kernels generated from the same feature descriptor\ntogether and employing the new MKL formulation, classi\ufb01ers which are potentially well-suited for\nobject categorization tasks can be built.\nApart from the novel MKL formulation the main contribution of the paper is a highly ef\ufb01cient\nalgorithm for solving it. Since the formulation is an instance of a Second Order Cone Program\n(SOCP), it can be solved using generic interior point algorithms. However it is impractical to work\nwith such solvers even for moderately large number of data points and kernels. Also the generic\nwrapper approach proposed in [17] cannot be employed as it solves a non-convex variant of the\nproposed (convex) formulation. The proposed algorithm employs mirror-descent [3, 2, 9] leading to\nextremely scalable solutions.\nThe feasibility set for the minimization problem tackled by Mirror-Descent (MD) turns out to be\ndirect product of simplexes, which is not a standard set-up discussed in optimization literature. We\nemploy a weighted version of the entropy function as the prox-function in the auxiliary problem\nsolved by MD at each iteration and justify its suitability for the case of direct product of simplexes.\nThe mirror-descent based algorithm presented here is also of independent interest to the MKL com-\nmunity as it can solve the traditional MKL problem; namely the case when the number of groups is\nunity. Empirically we show that the mirror-descent based algorithm proposed here scales better than\nthe state-of-the-art steepest descent based algorithms [14].\nThe remainder of this paper is organized as follows: in section 2, details of the new MKL formulation\nand its dual are presented. The mirror-descent based algorithm which ef\ufb01ciently solves the dual is\npresented in section 3. This is followed by a summary of the numerical experiments carried for\nverifying the major claims of the paper. In particular, the empirical \ufb01ndings are a) the new MKL\nformulation is well-suited for object categorization tasks b) the MD based algorithm scales better\nthan state-of-the-art gradient descent methods (e.g. simpleMKL) in solving the special case where\nnumber of components (groups) of kernels is unity.\n\n2 Mixed-norm based MKL Formulation\n\nThis section presents the novel mixed-norm regularization based MKL formulation and its dual.\nIn the following text we concentrate on the case of binary classi\ufb01cation. However many of the\nideas presented here apply to other learning problems too. Let the training dataset be denoted by\nD = {(xi, yi), i = 1, . . . , m | xi \u2208 X , yi \u2208 {\u22121, 1}}. Here, xi represents the ith training data\npoint with label yi. Let Y denote the diagonal matrix with entries as yi. Suppose the given ker-\nnels are divided into n groups (components) and the jth component has nj number of kernels. Let\nthe feature-space mapping generated from the kth kernel of the jth component be \u03c6jk(\u00b7) and the\n1. We are in search of a hyperplane clas-\ncorresponding gram-matrix of training data points be Kjk\n\n1The gram-matrices are unit-trace normalized.\n\n2\n\n\fsi\ufb01er of the form(cid:80)n\n\n(cid:80)nj\nk=1 w(cid:62)\n\n(1)\n\n(2)\n\nj=1\n\njk\u03c6jk(xi) \u2212 b = 0. As discussed above, we wish to perform a\nblock l\u221e regularization over the model parameters wjk associated with distinct components and l1\nregularization for those associated with the same component. Intuitively, such a regularization pro-\nmotes combinations of kernels belonging to different components and selections among kernels of\nthe same component. Following the framework of MKL and the mixed norm regularization detailed\nhere, the following formulation is immediate:\n\nmin\nwjk,b,\u03bei\n\ns.t.\n\nyi\n\n1\n2\n\n(cid:16)(cid:80)n\n\nj=1\n\n(cid:104)\n(cid:80)nj\nmaxj\nk=1 w(cid:62)\n\n(cid:0)(cid:80)nj\nk=1 (cid:107)wjk(cid:107)2\njk\u03c6jk(xi) \u2212 b\n\n(cid:1)2(cid:105)\n+ C(cid:80)\n(cid:17) \u2265 1 \u2212 \u03bei, \u03bei \u2265 0 \u2200 i\n\ni \u03bei\n\nj=1\n\ni \u03bei\n\nk=1\n\n\u03bbjk\n\n1\n2\n\nmaxj\n\ns.t.\n\nyi\n\n(cid:107)wjk(cid:107)2\n\n2\n\n(cid:17)(cid:105)\n\nmin\nwjk,b,\u03bei\n\nHere, \u03bei variables measure the slack in correctly classifying the ith training data point and C is the\nregularization parameter controlling weightage given to the mixed-norm regularization term and the\ntotal slack. MKL formulation in (1) is convex and moreover an instance of SOCP. This formulation\ncan also be realized as a limiting case of the generic CAP formulation presented in [17] (with \u03b3 =\n1, \u03b30 \u2192 \u221e). However since the motivation of that work was to perform feature selection, this\nlimiting case was neither theoretically studied nor empirically evaluated. Moreover, the generic\nwrapper approach of [17] is inappropriate for solving this limiting case as that approach would solve\na non-convex variant of this (convex) formulation. In the following text, a dual of (1) is derived.\nLet a simplex of dimensionality d be represented by \u2206d. Following the strategy of [14], one can\n\nintroduce variables \u03bbj \u2261(cid:2)\u03bbj1 . . . \u03bbjnj\n(cid:3)(cid:62) \u2208 \u2206nj and re-write (1) as follows:\n(cid:104)\n(cid:16)\n+ C(cid:80)\n(cid:80)nj\n(cid:16)(cid:80)n\n(cid:17) \u2265 1 \u2212 \u03bei, \u03bei \u2265 0 \u2200 i\n(cid:80)nj\nmin\u03bbj\u2208\u2206nj\njk\u03c6jk(xi) \u2212 b\nk=1 w(cid:62)\nThis is because for any vector [a1 . . . an] \u2265 0, the following holds: minxi\u22650,P\n((cid:80)\n=\n(cid:80)nj\ni ai)2. Notice that the max over j and min over \u03bbj can be interchanged. To see that rewrite\n\u2264 t, where t is a new decision variable.\nmaxj as mint t with constraints min\u03bbj\u2208\u2206nj\nconstraints to obtain an equivalent problem: min\u03bbj\u2208\u2206nj \u2200j,t t subject to(cid:80)nj\nThis problem is feasible in both \u03bbjs and t and hence we can drop the minimization over individual\n\u2264 t. One can\nnow eliminate t by reintroducing the maxj and interchange the min\u03bbj\u2208\u2206nj \u2200j with other variables\n+ C(cid:80)\n(cid:80)nj\nto obtain:\n(cid:17) \u2265 1 \u2212 \u03bei, \u03bei \u2265 0 \u2200 i\n(cid:80)nj\n2 maxj\njk\u03c6jk(xi) \u2212 b\nk=1 w(cid:62)\n(cid:19)\uf8f9\uf8fb \u03b1\n(cid:18)(cid:80)nj\nwhere \u03b1, \u03b3 are Lagrange multipliers, Sm(C) \u2261 {x \u2208 Rm | 0 \u2264 x \u2264 C1, (cid:80)m\n\nNow one can derive the standard dual of (3) wrt. to the variables wjk, b, \u03bei alone, leading to:\n\n\uf8ee\uf8f0 n(cid:88)\n\n(cid:16)(cid:80)n\n\ni=1 xiyi = 0} and\n\nQjk \u2261 YKjkY. The following points regarding (4) must to be noted:\n\n\u03b1\u2208Sm(C), \u03b3\u2208\u2206n\n\n1(cid:62)\u03b1 \u2212 1\n\nk=1 \u03bbjkQjk\n\nmin\nwjk,b,\u03bei\n\n\u03bbj\u2208\u2206nj \u2200j\n\n\u03bbj\u2208\u2206nj \u2200j\n\n2 \u03b1(cid:62)\n\n(cid:80)\n\n(cid:107)wjk(cid:107)2\n\n2\n\n(cid:107)wjk(cid:107)2\n\n2\n\n(cid:107)wjk(cid:107)2\n\n2\n\nk=1\n\n\u03bbjk\n\ns.t.\n\nyi\n\nk=1\n\n\u03bbjk\n\nk=1\n\n\u03bbjk\n\n(3)\n\n(4)\n\ni xi=1\n\na2\ni\nxi\n\ni\n\nmin\n\nmax\n\nmin\n\nj=1\n\n\u03b3j\n\ni \u03bei\n\n1\n\nj=1\n\n(cid:18)Pnj\n\n(cid:80)n\n\n\u2022 (4) is equivalent\nk=1 \u03bb\u2217\n\u03b3\u2217\n\nj=1\n\nj\n\nand \u03bb\u2217\n\n(cid:19)\n\nto the well-known SVM [18] formulation with kernel Kef f \u2261\nis the weight given to the jth component\n\nIn other words, 1\n\u03b3\u2217\n\n2.\n\njkKjk\n\nj\n\njk is weight given to the kth kernel of the jth component.\n\n\u2022 It can be shown that none of \u03b3j, j = 1, . . . , n can be zero provided the given gram-matrices\n\nKjk are positive de\ufb01nite3.\n\n2Superscript \u2018*\u2019 represents the optimal value as per (4)\n3Add a small ridge if positive semi-de\ufb01nite.\n\n3\n\n\f\u2022 By construction, most of the weights \u03bbjk are zero and at-least for one kernel in every\n\ncomponent the weight is non-zero (see also [14]).\n\nThese facts readily justify the suitability of the particular mixed norm regularization for object cat-\negorization. Indeed, in-sync with \ufb01ndings of [12], kernels from different feature descriptors (com-\nponents) are combined using non-trivial weights (i.e. 1\n). Moreover, only the \u201cbest\u201d kernels from\n\u03b3\u2217\neach feature descriptor (component) are utilized by the model. This sparsity feature leads to bet-\nter interpretability as well as computational bene\ufb01ts during the prediction stage. In the following\nsection an ef\ufb01cient iterative algorithm for solving the dual (4) is presented.\n\nj\n\n3 Ef\ufb01cient Algorithm for Solving the Dual\n\nThis section presents an ef\ufb01cient algorithm for solving the dual (4). Note that typically in object cat-\negorization or other such multi-modal learning tasks, the number of feature descriptors (i.e. number\nof groups of kernels, n) is low (< 10). However the kernels constructed from each feature descriptor\ncan be very high in number i.e., nj \u2200 j can be quite high. Also, it is frequent to encounter datasets\nwith huge number of training data points, m. Hence it is desirable to derive algorithms which scale\nwell wrt. m and nj. We assume n is small and almost O(1). Consider the dual formulation (4).\nUsing the minimax theorem [15], one can interchange the min over \u03bbjs and max over \u03b3 to obtain:\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0 min\n\n\u03bbj\u2208\u2206nj \u2200j\n\n(cid:124)\n\n\uf8f1\uf8f2\uf8f3 max\n(cid:124)\n\n\u03b1\u2208Sm(C)\n\n(cid:18)(cid:80)nj\n\n\uf8ee\uf8f0 n(cid:88)\n(cid:123)(cid:122)\n\nj=1\n\ng\u03b3 (\u03bb1,...,\u03bbn)\n\n1(cid:62)\u03b1 \u2212 1\n\n2 \u03b1(cid:62)\n(cid:123)(cid:122)\n\nf (\u03b3)\n\n\u2212\n\nmin\n\u03b3\u2208\u2206n\n\nk=1 \u03bbjkQjk\n\n\u03b3j\n\n(cid:19)\uf8f9\uf8fb \u03b1\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb\n\uf8fc\uf8fd\uf8fe\n(cid:125)\n\n(cid:125)\n\nWe have restated the maximum over \u03b3 as a minimization problem by introducing a minus sign.\nThe proposed algorithm performs alternate minimization over the variables \u03b3 and (\u03bb1, . . . , \u03bbn, \u03b1).\nIn other words, in one step the variables (\u03bb1, . . . , \u03bbn, \u03b1) are assumed to be constant and (5) is\noptimized wrt. \u03b3. This leads to the following optimization problem:\n\nwhere Wj = \u03b1(cid:62)(cid:80)nj\n\nk=1 \u03bbjkQjk\u03b1. This problem has an analytical solution given by:\n\n(5)\n\n(6)\n\nj=1\n\nWj\n\u03b3j\n\nn(cid:88)\n(cid:112)Wj(cid:80)\n(cid:112)Wj\n\nj\n\nmin\n\u03b3\u2208\u2206n\n\n\u03b3j =\n\nIn the subsequent step \u03b3 is assumed to be \ufb01xed and (5) is optimized wrt. (\u03bb1, . . . , \u03bbn, \u03b1). For\nthis f(\u03b3) needs to be evaluated by solving the corresponding optimization problem (refer (5) for\nde\ufb01nition of f). Now, the per-step computational complexity of the iterative algorithm will depend\non how ef\ufb01ciently one evaluates f for a given \u03b3.\nIn the following we present a mirror-descent\n(MD) based algorithm which evaluates f to suf\ufb01cient accuracy in O(log [maxj nj])O(SVMm). Here\nO(SVMm) represents the computational complexity of solving an SVM with m training data points.\nNeglecting the log term, the overall per-step computational effort for the alternate minimization can\nbe assumed to be O(SVMm) and hence nearly-independent of the number of kernels. Alternatively,\none can employ the strategy of [14] and compute f using projected steepest-descent (SD) methods.\nThe following points highlight the merits and de-merits of these two methods:\n\n\u2022 In case of SD, the per-step auxiliary problem has no closed form solution and projections\nonto the feasibility set need to be done which are computationally intensive especially for\nproblems with high dimensions. In case of MD, the auxiliary problem has an analytical\nsolution (refer (8)).\n\u2022 The step size needs to be computed using 1-d line search in case of SD; whereas the step-\n\nsizes for MD can be easily computed using analytical expressions (refer (9)).\n\n4\n\n\f\u2022 The computational complexity of evaluating f using MD is nearly-independent of no. ker-\nnels. However no such statement can be made for SD (unless feasibility set is of Euclidean\ngeometry, which is not so in our case).\n\n(cid:20)\n\n(cid:21)\n\n1\nst\n\nThe MD based algorithm for evaluating f(\u03b3) i.e. solving min\u03bbj\u2208\u2206nj \u2200j g\u03b3(\u03bb1, . . . , \u03bbn) is detailed\nbelow. Let \u03bb represent the vector [\u03bb1 . . . \u03bbn](cid:62). Also let values at iteration \u2018t\u2019 be indicated using the\nsuper-script \u2018(t)\u2019. Similar to any gradient-based method, at each step \u2018t\u2019 MD works with a linear\n\u03b3 (\u03bb) = g\u03b3(\u03bb(t)) + (\u03bb \u2212 \u03bb(t))(cid:62)\u2207g\u03b3(\u03bb(t)) and follows the below update\napproximation of g\u03b3: \u02c6g(t)\nrule:\n\n\u03c9(\u03bb(t), \u03bb)\n\n2(cid:107)x(cid:107)2\n\n\u02c6g(t)\n\u03b3 (\u03bb) +\n\n\u03bb(t+1) = argmin\u03bb\u2208\u2206n1\u00d7...\u00d7\u2206nn\n\n(7)\nwhere, \u03c9(x, y) \u2261 \u03c9(y) \u2212 \u03c9(x) \u2212 (y \u2212 x)(cid:62)\u2207\u03c9(x) is the Bregman-divergence (prox-function) asso-\nciated with \u03c9(x), a continuously differentiable strongly convex distance-generating function. st is\na regularization parameter and also determines the step-size. (7) is usually known as the auxiliary\nproblem and needs to be solved at each step. Intuitively (7) minimizes a weighted sum of the local\nlinear approximation of the original objective and a regularization term that penalizes solutions far\nfrom the current iterate. It is easy to show that the update rule in (7) leads to the SD technique\nif \u03c9(x) = 1\n2 and step-size is chosen using 1-d line search. The key idea in MD is to choose\nthe distance-generating function based on the feasibility set, which in our case is direct product of\nsimplexes, such that (7) is very easy to solve. Note that for SD, with feasibility set as direct product\nof simplexes, (7) is not easy to solve especially in higher dimensions.\nWe choose the distance-generating function as the following modi\ufb01ed entropy function: \u03c9(x) \u2261\n\n\u22121(cid:1) where \u03b4 is a small positive number\n\n\u22121(cid:1) log(cid:0)xjkn\u22121 + \u03b4n\u22121nj\n\n(cid:0)xjkn\u22121 + \u03b4n\u22121nj\n\n(say, 10e \u2212 16). Now, let \u02dcg\u03b3\n(t) \u2261 st\u2207g\u03b3(\u03bb(t)) \u2212 \u2207\u03c9(\u03bb(t)). Note that g\u03b3 is nothing but the optimal\nobjective of SVM with kernel Kef f . Since it is assumed that each given kernel is positive de\ufb01nite,\nthe optimal of the SVM is unique and hence gradient of g\u03b3 wrt. \u03bb exists [5]. Gradient of g\u03b3 can\nbe computed using \u2202g\u03b3\nwhere \u03b1(t) is the optimal \u03b1 obtained by solving an\n\u2202\u03bb\n\n(\u03b1(t))(cid:62)\nQjk\u03b1(t)\n\u03b3j\n\n= \u2212 1\n\n(cid:80)nj\n\n(cid:80)n\n\nk=1\n\nj=1\n\nSVM with kernel as(cid:80)n\n\n(t)\njk\n\nj=1\n\n(cid:18)Pnj\n\n2\n\n(t)\njk Kjk\n\nk=1 \u03bb\n\u03b3j\n\n(cid:19)\n\nupdate (7) has the following analytical form4:\n\n. With this notation, it is easy to show that the optimal\n\n(cid:111)\n\n(cid:110)\u2212 \u02dcg\u03b3\n(cid:110)\u2212 \u02dcg\u03b3\n\n(t)\njk n\n\n(t)\njk n\n\nexp\n\n(cid:80)nj\n\nk=1 exp\n\n\u03bb(t+1)\njk =\n\n(cid:111)\n\n(8)\n\nThe following text discusses the convergence issues with MD. Let the modulus of strong convexity\nof \u03c9 wrt. (cid:107) \u00b7 (cid:107) \u2261 (cid:107) \u00b7 (cid:107)1 be \u03c3. Also, let the \u03c9-size of feasibility set be de\ufb01ned as follows: \u0398 \u2261\n\u03c9(u, v). It is easy to verify that \u03c3 = O(1)n\u22122 and \u0398 = O (log [maxj nj]) in\nmaxu,v\u2208\u2206n1\u00d7...\u00d7\u2206nn\nour case. The convergence and its ef\ufb01ciency follow from this result [3, 2, 9]:\n\n\u221a\n\n\u0398\u03c3\n(cid:107)\u2207g\u03b3(cid:107)\u2217\n\none has the following bound on error after iteration\n\u221a\n\n\u221a\nResult 1 With step-sizes:st =\nt\nT :\u0001T = mint\u2264T g\u03b3(\u03bb(t)) \u2212 g\u03b3(\u03bb\u2217) \u2264 O(1)\nwhere (cid:107)\u00b7(cid:107)\u2217 is the dual norm of the norm wrt. which the modulus of strong convexity was computed\n(in our case (cid:107) \u00b7 (cid:107)\u2217 = (cid:107) \u00b7 (cid:107)\u221e) and L(cid:107)\u00b7(cid:107)(h) is Lipschitz constant of function h wrt. norm (cid:107) \u00b7 (cid:107) (in our\ncase (cid:107) \u00b7 (cid:107) = (cid:107) \u00b7 (cid:107)1 and it can be shown that the Lipschitz constant exists for g\u03b3). Substituting the\nparticular values for our case, we obtain\n\n\u0398L(cid:107)\u00b7(cid:107)(g\u03b3 )\n\n\u221a\n\n\u03c3T\n\n(cid:112)log [maxj nj]\n\nn(cid:107)\u2207g\u03b3(cid:107)\u221e\n\n\u221a\nt\n\nst =\n\n\u221a\n\nand \u0001T \u221d\n. In other words, for reaching a reasonable approximation of the optimal,\nthe number iterations required are O(log [maxj nj]), which is nearly-independent of the number\n\nlog[maxj nj ]\n\n\u221a\nT\n\n(9)\n\n4Since the term involving \u03b4 is (cid:28) \u03bbjk, it is neglected in this computation.\n\n5\n\n\fof kernels. Since the computations in each iteration are dominated by the SVM optimization, the\noverall complexity of MD is (nearly) O(SV Mm). Note that the iterative algorithm can be improved\nby improving the algorithm for solving the SVM problem. The overall algorithm is summarized in\nalgorithm 15. The MKL formulation presented here exploits the special structure in the kernels and\n\nAlgorithm 1: Mirror-descent based alternate minimization algorithm\nData: Labels and gram-matrices of training eg., component-id of each kernel, regularization\n\nparameter (C)\n\nResult: Optimal values of \u03b1, \u03b3, \u03bb in (4)\nbegin\n\nSet \u03b3, \u03bb to some initial feasible values.\nwhile stopping criteria for \u03b3 is not met do\n\nwhile stopping criteria for \u03bb is not met do\n\nSolve SVM with current kernel weights and update \u03b1\nCompute \u02dcg\u03b3\n\n(t) and update \u03bb using (8)\n\n/* Alternate minimization loop */\n/* Mirror-descent loop */\n\nCompute Wj and update \u03b3 using (6)\n\nReturn values of \u03b1, \u03b3, \u03bb\n\nend\n\nleads to non-trivial combinations of the kernels belonging to different components and selections\namong the kernels of the same component. Moreover the proposed iterative algorithm solves the\nformulation with a per-step complexity of (almost) O(SV Mm), which is the same as that with tra-\nditional MKL formulations (which do not exploit this structure). As discussed earlier, this ef\ufb01ciency\nis an outcome of employing state-of-the-art mirror-descent techniques. The MD based algorithm\npresented here is of independent interest to the MKL community. This is because, in the special\ncase where number of components is unity (i.e. n = 1), the proposed algorithm solves the tradi-\ntional MKL formulation. And clearly, owing to the merits of MD over SD discussed earlier, the new\nalgorithm can potentially be employed to boost the performance of state-of-the-art MKL algorithms.\nOur empirical results con\ufb01rm that the proposed algorithm (with n = 1) outperforms simpleMKL\nin terms of computational ef\ufb01ciency.\n\n4 Numerical Experiments\n\nThis section presents results of experiments which empirically verify the major claims of the pa-\nper: a) The proposed formulation is well-suited for object categorization b) In the case n = 1, the\nproposed algorithm outperforms simpleMKL wrt. computational effort. In the following, the ex-\nperiments done on real-world object categorization datasets are summarized. The proposed MKL\nformulation is compared with state-of-the-art methodology for object categorization [19, 13] that\nemploys a block l1 regularization based MKL formulation with additional constraints for including\nprior information regarding weights of kernels. Since such constraints lead to independent improve-\nments with all formulations, the experiments here compare the following three MKL formulations\nwithout the additional constraints: MixNorm-MKL, the (l\u221e, l1) mixed-norm based MKL formula-\ntion studied in this paper; L1-MKL, the block l1 regularization based MKL formulation [14]; and\nL2-MKL, which is nothing but an SVM built using the canonical combination of all kernels i.e.\nk=1 Kjk. In case of MixNorm-MKL, the MD based algorithm (section 3) was\nused to solve the formulation. The SVM problem arising at each step of mirror-descent is solved\nusing the libsvm software6. L1-MKL is solved using simpleMKL7. L2-MKL is solved using\nlibsvm and serves as a baseline for comparison. In all cases, the hyper-parameters of the various\nformulations were tuned using suitable cross-validation procedures and the accuracies reported de-\nnote testset accuracies achieved by the respective classi\ufb01ers using the tuned set of hyper-parameters.\n\nKef f \u2261(cid:80)n\n\n(cid:80)nj\n\nj=1\n\n5Asymptotic convergence can be proved for the algorithm; details omitted due to lack of space.\n6Available at www.csie.ntu.edu.tw/\u02dccjlin/libsvm\n7Available at http://asi.insa-rouen.fr/enseignants/\u02dcarakotom/code/mklindex.\n\nhtml\n\n6\n\n\f(a) Caltech-5\n\n(b) Caltech-5\n\n(c) Oxford Flowers\n\n(d) Oxford \ufb02owers\n\n(e) Caltech-101\n\n(f) Caltech-101\n\nFigure 1: Plot of average gain (%) in accuracy with MixNorm-MKL on the various real-world\ndatasets.\n\nThe following real-world datasets were used in the experiments: Caltech-5 [6], Caltech-101 [7]\nand Oxford Flowers [10]. The Caltech datasets contain digital images of various objects like faces,\nwatches, ants etc.; whereas the Oxford dataset contains images of 17 varieties of \ufb02owers. The\nCaltech-101 dataset has 101 categories of objects whereas Caltech-5 dataset is a subset of the\nCaltech-101 dataset including images of Airplanes, Car sides, Faces, Leopards and Motorbikes\nalone. Most categories of objects in the Caltech dataset have 50 images. The number of images\nper category varies from 40 to 800. In the Oxford \ufb02owers dataset there are 80 images in each \ufb02ower\ncategory. In order to make the results presented here comparable to others in literature we have\nfollowed the usual practice of generating training and test sets using a \ufb01xed number of pictures from\neach object category and repeating the experiments with different random selections of pictures. For\nthe Caltech-5, Caltech-101 and Oxford \ufb02owers datasets we have used 50, 15, 60 images per object\ncategory as training images and 50, 15, 20 images per object category as testing images respectively.\nAlso, in case of Caltech-5 and Oxford \ufb02owers datasets, the accuracies reported are the testset ac-\ncuracies averaged over 10 such randomly sampled training and test datasets. Since the Caltech-101\ndataset has large number of classes and the experiments are computationally intensive (100 choose\n2 classi\ufb01ers need to be built in each case), the results are averaged over 3 sets of training and test\ndatasets only. In case of the Caltech datasets, \ufb01ve feature descriptors8 were employed: SIFT, Op-\nponentSIFT, rgSIFT, C-SIFT, Transformed Color SIFT. Whereas in case of Oxford \ufb02owers dataset,\nfollowing strategy of [11, 10], seven feature descriptors9 were employed. Using each feature de-\nscriptor, nine kernels were generated by varying the width-parameter of the Gaussian kernel. The\nkernels can be grouped based on the feature descriptor they were generated from and the proposed\nformulation can be employed to construct classi\ufb01ers well-suited for object categorization. For eg. in\ncase of the Caltech datasets, n = 5 and nj = 9 \u2200 j and in case of Oxford \ufb02owers dataset, n = 7 and\nnj = 9 \u2200 j. In all cases, the 1-vs-1 methodology was employed to handle the multi-class problems.\nThe results of the experiments are summarized in \ufb01gure 1. Each plot shows the % gain in accuracy\nachieved by MixNorm-MKL over L1-MKL and L2-MKL for each object category. Note that for\n\n8Code at http://staff.science.uva.nl/\u02dcksande/research/colordescriptors/\n9Distance matrices available at http://www.robots.ox.ac.uk/\u02dcvgg/data/flowers/17/\n\nindex.html\n\n7\n\n123450123456789Object CategoriesAverage gain wrt. L1\u2212MKL (%)1234500.511.522.533.544.5Object CategoriesAverage gain wrt. L2\u2212MKL (%)024681012141618\u22124\u22122024681012Object CategoriesAverage gain wrt. L1\u2212MKL (%)02468101214161800.511.522.533.544.5Object CategoriesAverage gain wrt. L2\u2212MKL (%)020406080100\u22121000100200300400500600700800Object CategoriesAverage gain wrt. L1\u2212MKL (%)020406080100\u22121000100200300400500600Object CategoriesAverage gain wrt. L2\u2212MKL (%)\fFigure 2: Scaling plots comparing scalability of mirror-descent based algorithm and simpleMKL.\n\nmost object categories, the gains are positive and moreover quite high. The best results are seen\nin case of the Caltech-101 dataset: the peak and avg. gains over L1-MKL are 800%, 37.57% re-\nspectively and over L2-MKL are 600%, 21.75% respectively. The gain in terms of numbers for the\nother two datasets are not as high merely because the baseline accuracies were themselves high.\nThe baseline accuracies i.e., the average accuracy achieved by L2-MKL (over all categories) were\n93.84%, 34.81% and 85.97% for the Caltech-5, Caltech-101 and Oxford \ufb02owers datasets respec-\ntively. The \ufb01gures clearly show that the proposed formulation outperforms state-of-the-art object\ncategorization techniques and is hence highly-suited for such tasks. Another observation was that the\naverage sparsity (% of kernels with zero weightages) with the methods MixNorm-MKL, L1-MKL\nand L2-MKL is 57%, 96% and 0% respectively. Also, it was observed that L1-MKL almost always\nselected kernels from one or two components (feature descriptors) only whereas MixNorm-MKL\n(and ofcourse L2-MKL) selected kernels from all the components. These observations clearly show\nthat the proposed formulation combines important kernels while eliminating redundant and noisy\nkernels using the information embedded in the group structure of the kernels.\nIn the following, the results of experiments which compare the scalability of simpleMKL and\nthe proposed mirror-descent based algorithm wrt. the number of kernels are presented. Note that\nin the special case, n = 1, the proposed formulation is exactly same as the l1 regularization based\nformulation. Hence the mirror-descent based iterative algorithm proposed here can also be employed\nfor solving l1 regularization based MKL. Figure 2 shows plots of the training times as a function\nof number of kernels with the algorithms on two binary classi\ufb01cation problems encountered in the\nobject categorization experiments. The plots clearly show that the proposed algorithm outperforms\nsimpleMKL in terms of computational effort. Interestingly, it was found in our experiments that,\nin most cases, the major computational effort at every iteration of SimpleMKL was in computing\nthe projection onto the feasible set! On the contrary Mirror descent allows an easily computable\nclosed form solution for the per-step auxiliary problem. We think this is the crucial advantage of\nthe proposed iterative algorithm over the gradient-decent based algorithms which were traditionally\nemployed for solving the MKL formulations.\n\n5 Conclusions\n\nThis paper makes two important contributions: a) a speci\ufb01c mixed-norm regularization based MKL\nformulation which is well-suited for object categorization and multi-modal tasks b) An ef\ufb01cient\nmirror-descent based algorithm for solving the new formulation. Empirical results on real-world\ndatasets show that the new formulation achieves far better generalization than state-of-the-art ob-\nject categorization techniques.\nIn some cases, the average gain in testset accuracy compared to\nstate-of-the-art was as high as 37%. The mirror-descent based algorithm presented in the paper not\nonly solves the proposed formulation ef\ufb01ciently but also outperforms simpleMKL in solving the\ntraditional l1 regularization based MKL. The speed-up was as high as 12 times in some cases. Appli-\ncation of proposed methodology to various other multi-modal tasks and study of improved variants\nof mirror-decent algorithm [4] for faster convergence are currently being explored by us.\nAcknowledgements CB was supported by grants from Yahoo! and IBM.\n\n8\n\n1.522.533.540100200300400500600log10(Number of Kernels)Training Time (seconds) MixNorm\u2212MKLL1\u2212MKL1.522.533.540100200300400500600700800900log10(Number of Kernels)Training Time (seconds) MixNorm\u2212MKLL1\u2212MKL\fReferences\n[1] F. Bach, G. R. G. Lanckriet, and M. I. Jordan. Multiple Kernel Learning, Conic Duality, and\n\nthe SMO Algorithm. In International Conference on Machine Learning, 2004.\n\n[2] Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods\n\nfor convex optimization. Operations Research Letters, 31:167\u2013175, 2003.\n\n[3] Aharon Ben-Tal, Tamar Margalit, and Arkadi Nemirovski. The Ordered Subsets Mirror De-\nscent Optimization Method with Applications to Tomography. SIAM Journal of Optimization,\n12(1):79\u2013108, 2001.\n\n[4] Aharon Ben-Tal and Arkadi Nemirovski. Non-euclidean Restricted Memory Level Method for\n\nLarge-scale Convex Optimization. Mathematical Programming, 102(3):407\u2013456, 2005.\n\n[5] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukerjhee. Choosing multiple parameters for\n\nSVM. Machine Learning, 46:131\u2013159, 2002.\n\n[6] R. Fergus, P. Perona, and A. Zisserman. Object class recognition by unsupervised scale-\nIn IEEE Computer Society Conference on Computer Vision and Pattern\n\ninvariant learning.\nRecognition, volume 2, 2003.\n\n[7] R. Fergus L. Fei-Fei and P. Perona. Learning generative visual models from few training\nexamples: an incremental bayesian approach tested on 101 object categories. In IEEE. CVPR\n2004, Workshop on Generative-Model Based Vision., 2004.\n\n[8] G.R.G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M.I. Jordan. Learning the\nKernel Matrix with Semide\ufb01nite Programming. Journal of Machine Learning Research, 5:27\u2013\n72, 2004.\n\n[9] Arkadi Nemirovski. Lectures on modern convex optimization (chp.5.4). Available at www2.\n\nisye.gatech.edu/\u02dcnemirovs/Lect_ModConvOpt.pdf.\n\n[10] M-E. Nilsback and A. Zisserman. A visual vocabulary for \ufb02ower classi\ufb01cation. In Proceedings\n\nof the IEEE Conference on Computer Vision and Pattern Recognition, 2006.\n\n[11] M-E. Nilsback and A Zisserman. Automated \ufb02ower classi\ufb01cation over a large number of\nclasses. In Proceedings of the Indian Conference on Computer Vision, Graphics and Image\nProcessing, 2008.\n\n[12] Maria-Elena Nilsback and Andrew Zisserman. A Visual Vocabulary for Flower Classi\ufb01ca-\ntion. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and\nPattern Recognition, volume 2, pages 1447\u20131454, 2006.\n\n[13] Maria-Elena Nilsback and Andrew Zisserman. Automated Flower Classi\ufb01cation over a Large\nNumber of Classes. In Proceedings of the Sixth Indian Conference on Computer Vision, Graph-\nics & Image Processing, 2008.\n\n[14] A. Rakotomamonjy, F. Bach, S. Canu, and Y Grandvalet. SimpleMKL. Journal of Machine\n\nLearning Research, 9:2491\u20132521, 2008.\n\n[15] R. T. Rockafellar. Convex Analysis. Princeton University Press, 1970.\n[16] Soren Sonnenburg, Gunnar Ratsch, Christin Schafer, and Bernhard Scholkopf. Large Scale\n\nMultiple Kernel Learning. Journal of Machine Learning Research, 7:1531\u20131565, 2006.\n\n[17] M. Szafranski, Y. Grandvalet, and A. Rakotomamonjy. Composite Kernel Learning. In Pro-\n\nceedings of the Twenty-\ufb01fth International Conference on Machine Learning (ICML), 2008.\n\n[18] Vladimir Vapnik. Statistical Learning Theory. Wiley-Interscience, 1998.\n[19] M. Varma and D. Ray. Learning the Discriminative Power Invariance Trade-off. In Proceedings\n\nof the International Conference on Computer Vision, 2007.\n\n[20] Zenglin Xu, Rong Jin, Irwin King, and Michael R. Lyu. An Extended Level Method for\n\nMultiple Kernel Learning. In Advances in Neural Information Processing Systems, 2008.\n\n9\n\n\f", "award": [], "sourceid": 603, "authors": [{"given_name": "Saketha", "family_name": "Jagarlapudi", "institution": null}, {"given_name": "Dinesh", "family_name": "G", "institution": null}, {"given_name": "Raman", "family_name": "S", "institution": null}, {"given_name": "Chiranjib", "family_name": "Bhattacharyya", "institution": null}, {"given_name": "Aharon", "family_name": "Ben-tal", "institution": null}, {"given_name": "Ramakrishnan", "family_name": "K.r.", "institution": null}]}