{"title": "Multilabel Structured Output Learning with Random Spanning Trees of Max-Margin Markov Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 873, "page_last": 881, "abstract": "We show that the usual score function for conditional Markov networks can be written as the expectation over the scores of their spanning trees. We also show that a small random sample of these output trees can attain a significant fraction of the margin obtained by the complete graph and we provide conditions under which we can perform tractable inference. The experimental results confirm that practical learning is scalable to realistic datasets using this approach.", "full_text": "Multilabel Structured Output Learning with Random\n\nSpanning Trees of Max-Margin Markov Networks\n\nD\u00b4epartement d\u2019informatique et g\u00b4enie logiciel\n\nMario Marchand\n\nUniversit\u00b4e Laval\n\nQu\u00b4ebec (QC), Canada\n\nmario.marchand@ift.ulaval.ca\n\nHongyu Su\n\nHelsinki Institute for Information Technology\nDept of Information and Computer Science\n\nAalto University, Finland\nhongyu.su@aalto.fi\n\nEmilie Morvant\u2217\n\nLaHC, UMR CNRS 5516\nUniv. of St-Etienne, France\n\nemilie.morvant@univ-st-etienne.fr\n\nJuho Rousu\n\nHelsinki Institute for Information Technology\nDept of Information and Computer Science\n\nAalto University, Finland\njuho.rousu@aalto.fi\n\nJohn Shawe-Taylor\n\nDepartment of Computer Science\n\nUniversity College London\n\nLondon, UK\n\nj.shawe-taylor@ucl.ac.uk\n\nAbstract\n\nWe show that the usual score function for conditional Markov networks can be\nwritten as the expectation over the scores of their spanning trees. We also show\nthat a small random sample of these output trees can attain a signi\ufb01cant fraction\nof the margin obtained by the complete graph and we provide conditions under\nwhich we can perform tractable inference. The experimental results con\ufb01rm that\npractical learning is scalable to realistic datasets using this approach.\n\nIntroduction\n\n1\nFinding an hyperplane that minimizes the number of misclassi\ufb01cations is NP-hard. But the support\nvector machine (SVM) substitutes the hinge for the discrete loss and, modulo a margin assumption,\ncan nonetheless ef\ufb01ciently \ufb01nd a hyperplane with a guarantee of good generalization. This paper\ninvestigates whether the problem of inference over a complete graph in structured output prediction\ncan be avoided in an analogous way based on a margin assumption.\nWe \ufb01rst show that the score function for the complete output graph can be expressed as the expec-\ntation over the scores of random spanning trees. A sampling result then shows that a small random\nsample of these output trees can attain a signi\ufb01cant fraction of the margin obtained by the complete\ngraph. Together with a generalization bound for the sample of trees, this shows that we can obtain\ngood generalization using the average scores of a sample of trees in place of the complete graph.\nWe have thus reduced the intractable inference problem to a convex optimization not dissimilar to\na SVM. The key inference problem to enable learning with this ensemble now becomes \ufb01nding the\nmaximum violator for the (\ufb01nite sample) average tree score. We then provide the conditions under\nwhich the inference problem is tractable. Experimental results con\ufb01rm this prediction and show that\n\n\u2217Most of this work was carried out while E. Morvant was af\ufb01liated with IST Austria, Klosterneurburg.\n\n1\n\n\fpractical learning is scalable to realistic datasets using this approach with the resulting classi\ufb01cation\naccuracy enhanced over more naive ways of training the individual tree score functions.\nThe paper aims at exploring the potential rami\ufb01cations of the random spanning tree observation\nboth theoretically and practically. As such, we think that we have laid the foundations for a fruitful\napproach to tackle the intractability of inference in a number of scenarios. Other attractive features\nare that we do not require knowledge of the output graph\u2019s structure, that the optimization is convex,\nand that the accuracy of the optimization can be traded against computation. Our approach is \ufb01rmly\nrooted in the maximum margin Markov network analysis [1]. Other ways to address the intractability\nof loopy graph inference have included using approximate MAP inference with tree-based and LP\nrelaxations [2], semi-de\ufb01nite programming convex relaxations [3], special cases of graph classes for\nwhich inference is ef\ufb01cient [4], use of random tree score functions in heuristic combinations [5].\nOur work is not based on any of these approaches, despite super\ufb01cial resemblances to, e.g., the\ntrees in tree-based relaxations and the use of random trees in [5]. We believe it represents a distinct\napproach to a fundamental problem of learning and, as such, is worthy of further investigation.\n\n2 De\ufb01nitions and Assumptions\nWe consider supervised learning problems where the input space X is arbitrary and the output space\nY consists of the set of all (cid:96)-dimensional multilabel vectors (y1, . . . , y(cid:96)) def= y where each yi \u2208\n{1, . . . , ri} for some \ufb01nite positive integer ri. Each example (x, y) \u2208 X \u00d7Y is mapped to a joint\nfeature vector \u03c6\u03c6\u03c6(x, y). Given a weight vector w in the space of joint feature vectors, the predicted\noutput yw(x) at input x \u2208 X , is given by the output y maximizing the score F (w, x, y), i.e.,\n\ny\u2208Y\n\nF (w, x, y)\n\nyw(x) def= argmax\n\n; where F (w, x, y) def= (cid:104)w, \u03c6\u03c6\u03c6(x, y)(cid:105) ,\n\n(1)\nand where (cid:104)\u00b7,\u00b7(cid:105) denotes the inner product in the joint feature space. Hence, yw(x) is obtained by\nsolving the so-called inference problem, which is known to be NP-hard for many output feature\nmaps [6, 7]. Consequently, we aim at using an output feature map for which the inference prob-\nlem can be solved by a polynomial time algorithm such as dynamic programming. The margin\n\u0393(w, x, y) achieved by predictor w at example (x, y) is de\ufb01ned as,\n\n\u0393(w, x, y)\n\ndef= min\ny(cid:48)(cid:54)=y\n\n[F (w, x, y) \u2212 F (w, x, y(cid:48))] .\n\nWe consider the case where the feature map \u03c6\u03c6\u03c6 is a potential function for a Markov network de\ufb01ned\nby a complete graph G with (cid:96) nodes and (cid:96)((cid:96) \u2212 1)/2 undirected edges. Each node i of G represents\nan output variable yi and there exists an edge (i, j) of G for each pair (yi, yj) of output variables.\nFor any example (x, y) \u2208 X \u00d7 Y, its joint feature vector is given by\n\nwhere \u2297 is the Kronecker product. Hence, any predictor w can be written as w = (wi,j)(i,j)\u2208G\nwhere wi,j is w\u2019s weight on \u03c6\u03c6\u03c6i,j(x, yi, yj). Therefore, for any w and any (x, y), we have\n\n(i,j)\u2208G\n\n(i,j)\u2208G\n\n,\n\nF (w, x, y) = (cid:104)w, \u03c6\u03c6\u03c6(x, y)(cid:105) =\n\n(cid:104)wi,j, \u03c6\u03c6\u03c6i,j(x, yi, yj)(cid:105) =\n\nFi,j(wi,j, x, yi, yj) ,\n\n(i,j)\u2208G\n\nwhere we denote by Fi,j(wi,j, x, yi, yj) = (cid:104)wi,j, \u03c6\u03c6\u03c6i,j(x, yi, yj) the score of labeling the edge (i, j)\nby (yi, yj) given input x.\nFor any vector a, let (cid:107)a(cid:107) denote its L2 norm. Throughout the paper, we make the assumption that\nwe have a normalized joint feature space such that (cid:107)\u03c6\u03c6\u03c6(x, y)(cid:107) = 1 for all (x, y) \u2208 X \u00d7 Y and\n\n(cid:107)\u03c6\u03c6\u03c6i,j(x, yi, yj)(cid:107) is the same for all (i, j) \u2208 G. Since the complete graph G has(cid:0)(cid:96)\nthat (cid:107)\u03c6\u03c6\u03c6i,j(x, yi, yj)(cid:107)2 =(cid:0)(cid:96)\n\n(cid:1)\u22121 for all (i, j) \u2208 G.\n\n(cid:1) edges, it follows\n\n2\n\n2\n\nWe also have a training set S def= {(x1, y1), . . . , (xm, ym)} where each example is generated in-\ndependently according to some unknown distribution D. Mathematically, we do not assume the\nexistence of a predictor w achieving some positive margin \u0393(w, x, y) on each (x, y) \u2208 S. Indeed,\n\n2\n\n\u03c6\u03c6\u03c6(x, y) =(cid:0)\u03c6\u03c6\u03c6i,j(x, yi, yj)(cid:1)\n(cid:88)\n\n(i,j)\u2208G\n\n=(cid:0)\u03d5\u03d5\u03d5(x) \u2297 \u03c8\u03c8\u03c8i,j(yi, yj)(cid:1)\n(cid:88)\n\n\ffor some S, there might not exist any w where \u0393(w, x, y) > 0 for all (x, y) \u2208 S. However, the\ngeneralization guarantee will be best when w achieves a large margin on most training points.\nGiven any \u03b3 > 0, and any (x, y) \u2208 X \u00d7Y, the hinge loss (at scale \u03b3) incurred on (x, y) by a unit L2\nnorm predictor w that achieves a (possibly negative) margin \u0393(w, x, y) is given by L\u03b3(\u0393(w, x, y)),\nwhere the so-called hinge loss function L\u03b3 is de\ufb01ned as L\u03b3(s) def= max (0, 1 \u2212 s/\u03b3) \u2200s \u2208 R . We\nwill also make use of the ramp loss function A\u03b3 de\ufb01ned by A\u03b3(s) def= min(1,L\u03b3(s)) \u2200s \u2208 R .\nThe proofs of all the rigorous results of this paper are provided in the supplementary material.\n\n3 Superposition of Random Spanning Trees\n\nGiven a complete graph G of (cid:96) nodes (representing the Markov network), let S(G) denote the set of\nall (cid:96)(cid:96)\u22122 spanning trees of G. Recall that each spanning tree of G has (cid:96) \u2212 1 edges. Hence, for any\n\nedge (i, j) \u2208 G, the number of trees in S(G) covering that edge (i, j) is given by (cid:96)(cid:96)\u22122((cid:96)\u22121)/(cid:0)(cid:96)\n\n(cid:1) =\n\n(2/(cid:96))(cid:96)(cid:96)\u22122. Therefore, for any function f of the edges of G we have\n\n2\n\n(cid:88)\n\n(cid:88)\n\nT\u2208S(G)\n\n(i,j)\u2208T\n\nf ((i, j)) = (cid:96)(cid:96)\u22122 2\n(cid:96)\n\nf ((i, j)) .\n\n(cid:88)\n\n(i,j)\u2208G\n\nGiven any spanning tree T of G and given any predictor w, let wT denote the projection of w on the\nedges of T . Namely, (wT )i,j = wi,j if (i, j) \u2208 T , and (wT )i,j = 0 otherwise. Let us also denote\nby \u03c6\u03c6\u03c6T (x, y), the projection of \u03c6\u03c6\u03c6(x, y) on the edges of T . Namely, (\u03c6\u03c6\u03c6T (x, y))i,j = \u03c6\u03c6\u03c6i,j(x, yi, yj)\n\nif (i, j) \u2208 T , and (\u03c6\u03c6\u03c6T (x, y))i,j = 0 otherwise. Recall that (cid:107)\u03c6\u03c6\u03c6i,j(x, yi, yj)(cid:107)2 =(cid:0)(cid:96)\n\n(cid:1)\u22121 \u2200(i, j) \u2208 G.\n\nThus, for all (x, y) \u2208 X \u00d7 Y and for all T \u2208 S(G), we have\n\n2\n\n(cid:107)\u03c6\u03c6\u03c6T (x, y)(cid:107)2 =\n\n(cid:107)\u03c6\u03c6\u03c6i,j(x, yi, yj)(cid:107)2 =\n\n(cid:88)\n\n(i,j)\u2208T\n\n(cid:96) \u2212 1(cid:0)(cid:96)\n(cid:1) =\n\n2\n\n2\n(cid:96)\n\n.\n\n(cid:114)\n\nWe now establish how F (w, x, y) can be written as an expectation over all the spanning trees of G.\n= \u03c6\u03c6\u03c6T /(cid:107)\u03c6\u03c6\u03c6T(cid:107). Let U(G) denote the uniform distribution on\ndef\nLemma 1. Let \u02c6wT\nS(G). Then, we have\n\n= wT /(cid:107)wT(cid:107), \u02c6\u03c6\u03c6\u03c6T\n\ndef\n\nF (w, x, y) =\n\nE\n\nT\u223cU (G)\n\nMoreover, for any w such that (cid:107)w(cid:107) = 1, we have: E\n\naT(cid:104) \u02c6wT , \u02c6\u03c6\u03c6\u03c6T (x, y)(cid:105), where aT\n\n(cid:107)wT(cid:107) .\n\ndef\n=\n\n(cid:96)\n2\nT = 1, and E\na2\n\nT\u223cU (G)\n\naT \u2264 1 .\n\nT\u223cU (G)\n\nLet T def= {T1, . . . , Tn} be a sample of n spanning trees of G where each Ti is sampled independently\naccording to U(G). Given any unit L2 norm predictor w on the complete graph G, our task is to\ninvestigate how the margins \u0393(w, x, y), for each (x, y) \u2208 X \u00d7Y, will be modi\ufb01ed if we approximate\nthe (true) expectation over all spanning trees by an average over the sample T .\nFor this task, we consider any (x, y) and any w of unit L2 norm. Let FT (w, x, y) denote the\nestimation of F (w, x, y) on the tree sample T ,\n1\nn\n\naTi(cid:104) \u02c6wTi, \u02c6\u03c6\u03c6\u03c6Ti(x, y)(cid:105) ,\n\nFT (w, x, y)\n\nn(cid:88)\n\ndef=\n\nand let \u0393T (w, x, y) denote the estimation of \u0393(w, x, y) on the tree sample T ,\n[FT (w, x, y) \u2212 FT (w, x, y(cid:48))] .\n\n\u0393T (w, x, y) def= min\ny(cid:48)(cid:54)=y\n\ni=1\n\nThe following lemma states how \u0393T relates to \u0393.\nLemma 2. Consider any unit L2 norm predictor w on the complete graph G that achieves a margin\nof \u0393(w, x, y) for each (x, y) \u2208 X \u00d7 Y, then we have\n\n\u0393T (w, x, y) \u2265 \u0393(w, x, y) \u2212 2\u0001 \u2200(x, y) \u2208 X \u00d7 Y ,\nwhenever we have |FT (w, x, y) \u2212 F (w, x, y)| \u2264 \u0001 for all (x, y) \u2208 X \u00d7 Y.\n\n3\n\n\f\u221a\n\nLemma 2 has important consequences whenever |FT (w, x, y) \u2212 F (w, x, y)| \u2264 \u0001 for all (x, y) \u2208\nX \u00d7 Y. Indeed, if w achieves a hard margin \u0393(w, x, y) \u2265 \u03b3 > 0 for all (x, y) \u2208 S, then we have\nthat w also achieves a hard margin of \u0393T (w, x, y) \u2265 \u03b3 \u2212 2\u0001 on each (x, y) \u2208 S when using the tree\nsample T instead of the full graph G. More generally, if w achieves a ramp loss of A\u03b3(\u0393(w, x, y))\nfor each (x, y) \u2208 X \u00d7Y, then w achieves a ramp loss of A\u03b3(\u0393T (w, x, y)) \u2264 A\u03b3 (\u0393(w, x, y) \u2212 2\u0001)\nfor all (x, y) \u2208 X \u00d7 Y when using the tree sample T instead of the full graph G. This last property\nfollows directly from the fact that A\u03b3(s) is a non-increasing function of s.\nn) dependence, a sample of n \u2208 \u0398((cid:96)2/\u00012)\nThe next lemma tells us that, apart from a slow ln2(\nspanning trees is suf\ufb01cient to assure that the condition of Lemma 2 holds with high probability for all\n(x, y) \u2208 X \u00d7 Y. Such a fast convergence rate was made possible by using PAC-Bayesian methods\nwhich, in our case, prevented us of using the union bound over all possible y \u2208 Y.\nLemma 3. Consider any \u0001 > 0 and any unit L2 norm predictor w for the complete graph G acting\non a normalized joint feature space. For any \u03b4 \u2208 (0, 1), let\n\u221a\n8\n\n(2)\nThen with probability of at least 1 \u2212 \u03b4/2 over all samples T generated according to U(G)n, we\nhave, simultaneously for all (x, y) \u2208 X \u00d7 Y, that |FT (w, x, y) \u2212 F (w, x, y)| \u2264 \u0001.\nGiven a sample T of n spanning trees of G, we now consider an arbitrary set W def= { \u02c6wT1 , . . . , \u02c6wTn}\nof unit L2 norm weight vectors where each \u02c6wTi operates on a unit L2 norm feature vector \u02c6\u03c6\u03c6\u03c6Ti(x, y).\nFor any T and any such set W, we consider an arbitrary unit L2 norm conical combination of each\ni = 1 and\neach qi \u2265 0. Given any (x, y) and any T , we de\ufb01ne the score FT (W, q, x, y) achieved on (x, y)\nby the conical combination (W, q) on T as\nFT (W, q, x, y) def=\n\nweight in W realized by a n-dimensional weight vector q def= (q1, . . . , qn), where(cid:80)n\n\nqi(cid:104) \u02c6wTi, \u02c6\u03c6\u03c6\u03c6Ti(x, y)(cid:105) ,\n\nn \u2265 (cid:96)2\n\u00012\n\n(cid:18) 1\n\nn(cid:88)\n\n(cid:19)2\n\ni=1 q2\n\n+\n\nln\n\nn\n\n\u03b4\n\n.\n\n(3)\n\n1\n2\n\n16\n\n1\u221a\nn\n\ni=1\n\nthat(cid:80)n\n\n\u221a\n\ni=1 qi can be as large as\n\nn denominator ensures that we always have FT (W, q, x, y) \u2264 1 in view of the fact\nwhere the\nn. Note also that FT (W, q, x, y) is the score of the feature vector\nobtained by the concatenation of all the weight vectors in W (and weighted by q) acting on a feature\nn. Hence, given T , we de\ufb01ne the\nvector obtained by concatenating each \u02c6\u03c6\u03c6\u03c6Ti multiplied by 1/\nmargin \u0393T (W, q, x, y) achieved on (x, y) by the conical combination (W, q) on T as\n\n\u221a\n\n\u221a\n\n\u0393T (W, q, x, y) def= min\ny(cid:48)(cid:54)=y\n\n[FT (W, q, x, y) \u2212 FT (W, q, x, y(cid:48))] .\n\n(4)\nFor any unit L2 norm predictor w that achieves a margin of \u0393(w, x, y) for all (x, y) \u2208 X \u00d7 Y, we\nnow show that there exists, with high probability, a unit L2 norm conical combination (W, q) on T\nachieving margins that are not much smaller than \u0393(w, x, y).\nTheorem 4. Consider any unit L2 norm predictor w for the complete graph G, acting on a normal-\nized joint feature space, achieving a margin of \u0393(w, x, y) for each (x, y) \u2208 X \u00d7 Y. Then for any\n\u0001 > 0, and any n satisfying Lemma 3, for any \u03b4 \u2208 (0, 1], with probability of at least 1 \u2212 \u03b4 over all\nsamples T generated according to U(G)n, there exists a unit L2 norm conical combination (W, q)\non T such that, simultaneously for all (x, y) \u2208 X \u00d7 Y, we have\n\n\u0393T (W, q, x, y) \u2265\n\n1\u221a\n1 + \u0001\n\n[\u0393(w, x, y) \u2212 2\u0001] .\n\nFrom Theorem 4, and since A\u03b3(s) is a non-increasing function of s, it follows that, with proba-\nbility at least 1 \u2212 \u03b4 over the random draws of T \u223c U(G)n, there exists (W, q) on T such that,\nsimultaneously for all \u2200(x, y) \u2208 X \u00d7 Y, for any n satisfying Lemma 3 we have\n\nA\u03b3(\u0393T (W, q, x, y)) \u2264 A\u03b3(cid:16)\n\n[\u0393(w, x, y) \u2212 2\u0001] (1 + \u0001)\u22121/2(cid:17)\n\n.\n\nHence, instead of searching for a predictor w for the complete graph G that achieves a small ex-\npected ramp loss E(x,y)\u223cDA\u03b3(\u0393(w, x, y), Theorem 4 tells us that we can settle the search for a\n\n4\n\n\funit L2 norm conical combination (W, q) on a sample T of randomly-generated spanning trees of\nG that achieves small E(x,y)\u223cDA\u03b3(\u0393T (W, q, x, y)). But recall that \u0393T (W, q, x, y)) is the margin\nof a weight vector obtained by the concatenation of all the weight vectors in W (weighted by q) on\n\u221a\nn)\u02c6\u03c6\u03c6\u03c6Ti. It thus follows\na feature vector obtained by the concatenation of the n feature vectors (1/\nthat any standard risk bound for the SVM applies directly to E(x,y)\u223cDA\u03b3(\u0393T (W, q, x, y)). Hence,\nby adapting the SVM risk bound of [8], we have the following result.\nTheorem 5. Consider any sample T of n spanning trees of the complete graph G. For any \u03b3 > 0\nand any 0 < \u03b4 \u2264 1, with probability of at least 1 \u2212 \u03b4 over the random draws of S \u223c Dm,\nsimultaneously for all unit L2 norm conical combinations (W, q) on T , we have\n\nE\n\n(x,y)\u223cD\n\nA\u03b3(\u0393T (W, q, x, y)) \u2264 1\nm\n\nA\u03b3(\u0393T (W, q, xi, yi)) +\n\n\u221a\n2\n\n\u03b3\n\nm\n\n+ 3\n\nln(2/\u03b4)\n\n2m\n\n.\n\n(cid:114)\n\nm(cid:88)\n\ni=1\n\nHence, according to this theorem, the conical combination (W, q) having the best generalization\nguarantee is the one which minimizes the sum of the \ufb01rst two terms on the right hand side of\nthe inequality. Note that the theorem is still valid if we replace, in the empirical risk term, the\nnon-convex ramp loss A\u03b3 by the convex hinge loss L\u03b3. This provides the theoretical basis of the\nproposed optimization problem for learning (W, q) on the sample T .\n\n4 A L2-Norm Random Spanning Tree Approximation Approach\n\ndef= \u03b3 \u00b7 L\u03b3(\u0393T (W, q, xk, yk), Theorem 5 suggests that\nIf we introduce the usual slack variables \u03bek\nwe should minimize 1\nk=1 \u03bek for some \ufb01xed margin value \u03b3 > 0. Rather than performing this\n\u03b3\ntask for several values of \u03b3, we show in the supplementary material that we can, equivalently, solve\nthe following optimization problem for several values of C > 0.\nDe\ufb01nition 6. Primal L2-norm Random Tree Approximation.\n\n(cid:80)m\n\nn(cid:88)\nn(cid:88)\n\ni=1\n\n1\n2\n\nmin\nwTi ,\u03bek\n\ns.t.\n\n||wTi||2\n\n2 + C\n\n\u03bek\n\nm(cid:88)\n\nk=1\n\n(cid:104)wTi, \u02c6\u03c6\u03c6\u03c6Ti (xk, yk)(cid:105) \u2212 max\ny(cid:54)=yk\n\ni=1\n\n\u03bek \u2265 0 ,\u2200 k \u2208 {1, . . . , m},\n\nn(cid:88)\n\ni=1\n\n(cid:104)wTi, \u02c6\u03c6\u03c6\u03c6Ti(xk, y)(cid:105) \u2265 1 \u2212 \u03bek,\n\nwhere {wTi|Ti \u2208 T } are the feature weights to be learned on each tree, \u03bek is the margin slack\nallocated for each xk, and C is the slack parameter that controls the amount of regularization.\n\nThis primal form has the interpretation of maximizing the joint margins from individual trees be-\ntween (correct) training examples and all the other (incorrect) examples.\nThe key for the ef\ufb01cient optimization is solving the \u2019argmax\u2019 problem ef\ufb01ciently. In particular, we\nnote that the space of all multilabels is exponential in size, thus forbidding exhaustive enumeration\nover it. In the following, we show how exact inference over a collection T of trees can be imple-\n(cid:80)n\nmented in \u0398(Kn(cid:96)) time per data point, where K is the smallest number such that the average score\ni=1(cid:104)wTi, \u02c6\u03c6\u03c6\u03c6Ti (x, y)(cid:105).\nof the K\u2019th best multilabel for each tree of T is at most FT (x, y) def= 1\nWhenever K is polynomial in the number of labels, this gives us exact polynomial-time inference\nover the ensemble of trees.\n\nn\n\n4.1 Fast inference over a collection of trees\n\nIt is well known that the exact solution to the inference problem\n\n\u02c6yTi(x) = argmax\n\ny\u2208Y\n\nFwTi\n\n(x, y) def= argmax\n\ny\u2208Y\n\n(cid:104)wTi , \u02c6\u03c6\u03c6\u03c6Ti(x, y)(cid:105),\n\n(5)\n\non an individual tree Ti can be obtained in \u0398((cid:96)) time by dynamic programming. However, there is\nno guarantee that the maximizer \u02c6yTi of Equation (5) is also a maximizer of FT . In practice, \u02c6yTi\n\n5\n\n\fcan differ for each spanning tree Ti \u2208 T . Hence, instead of using only the best scoring multil-\nabel \u02c6yTi from each individual Ti \u2208 T , we consider the set of the K highest scoring multilabels\nYTi,K = {\u02c6yTi,1,\u00b7\u00b7\u00b7 , \u02c6yTi,K} of FwTi\n(x, y). In the supplementary material we describe a dynamic\nprogramming to \ufb01nd the K highest multilabels in \u0398(K(cid:96)) time. Running this algorithm for all of the\ntrees gives us a candidate set of \u0398(Kn) multilabels YT ,K = YT1,K \u222a \u00b7\u00b7\u00b7 \u222a YTn,K. We now state a\nkey lemma that will enable us to verify if the candidate set contains the maximizer of FT .\nFT (x, y) be the highest scoring multilabel in YT ,K. Suppose that\nLemma 7. Let y(cid:63)\n\nK = argmax\ny\u2208YT ,K\n\nFT (x, y(cid:63)\n\nK) \u2265 1\nn\n\nFwTi\n\n(x, yTi,K)\n\ndef\n= \u03b8x(K).\n\nIt follows that FT (x, y(cid:63)\n\nK) = maxy\u2208Y FT (x, y).\n\nn(cid:88)\n\ni=1\n\nK is a\n\nWe can use any K satisfying the lemma as the length of K-best lists, and be assured that y(cid:63)\nmaximizer of FT .\nWe now examine the conditions under which the highest scoring multilabel is present in our can-\ndidate set YT ,K with high probability. For any x \u2208 X and any predictor w, let \u02c6y def= yw(x) def=\nF (w, x, y) be the highest scoring multilabel in Y for predictor w on the complete graph G.\nargmax\ny\u2208Y\nFor any y \u2208 Y, let KT (y) be the rank of y in tree T and let \u03c1T (y) def= KT (y)/|Y| be the normalized\nrank of y in tree T . We then have 0 < \u03c1T (y) \u2264 1 and \u03c1T (y(cid:48)) = miny\u2208Y \u03c1T (y) whenever y(cid:48) is a\nhighest scoring multilabel in tree T . Since w and x are arbitrary and \ufb01xed, let us drop them momen-\ntarily from the notation and let F (y) def= F (w, x, y), and FT (y) def= FwT (x, y). Let U(Y) denote the\nuniform distribution of multilabels on Y. Then, let \u00b5T\nLet T \u223c U(G)n be a sample of n spanning trees of G. Since the scoring function FT of each tree\nT of G is bounded in absolute value, it follows that FT is a \u03c3T -sub-Gaussian random variable for\nsome \u03c3T > 0. We now show that, with high probability, there exists a tree T \u2208 T such that \u03c1T (\u02c6y)\nis decreasing exponentially rapidly with (F (\u02c6y) \u2212 \u00b5)/\u03c3, where \u03c32 def= ET\u223cU (G)\u03c32\nT .\nLemma 8. Let the scoring function FT of each spanning tree of G be a \u03c3T -sub-Gaussian random\nvariable under the uniform distribution of labels; i.e., for each T on G, there exists \u03c3T > 0 such\nthat for any \u03bb > 0 we have\n\ndef= Ey\u223cU (Y)FT (y) and \u00b5 def= ET\u223cU (G)\u00b5T .\n\nLet \u03c32 def\n\n= E\n\nT\u223cU (G)\n\nT , and let \u03b1\n\u03c32\n\n(cid:18)\n\nPr\n\nT \u223cU (G)n\n\ne\u03bb(FT (y)\u2212\u00b5T ) \u2264 e\n\n\u03bb2\n2 \u03c32\n\nT .\n\n(cid:16)\n\nE\n\ny\u223cU (Y)\ndef\n= Pr\n\nT\u223cU (G)\n\n(cid:19)\n\n\u2203T \u2208 T : \u03c1T (\u02c6y) \u2264 e\n\n\u2212 1\n\n2\n\n(F (\u02c6y)\u2212\u00b5)2\n\n\u03c32\n\n\u2265 1 \u2212 (1 \u2212 \u03b1)n .\n\n\u00b5T \u2264 \u00b5 \u2227 FT (\u02c6y) \u2265 F (\u02c6y) \u2227 \u03c32\n\n. Then,\n\nT \u2264 \u03c32(cid:17)\n\nThus, even for very small \u03b1, when n is large enough, there exists, with high probability, a tree T \u2208 T\nsuch that \u02c6y has a small \u03c1T (\u02c6y) whenever [F (\u02c6y) \u2212 \u00b5]/\u03c3 is large for G. For example, when |Y| = 2(cid:96)\n(the multiple binary classi\ufb01cation case), we have with probability of at least 1\u2212 (1\u2212 \u03b1)n, that there\nexists T \u2208 T such that KT (\u02c6y) = 1 whenever F (\u02c6y) \u2212 \u00b5 \u2265 \u03c3\n\n2(cid:96) ln 2.\n\n\u221a\n\n4.2 Optimization\n\nTo optimize the L2-norm RTA problem (De\ufb01nition 6) we convert it to the marginalized dual form\n(see the supplementary material for the derivation), which gives us a polynomial-size problem (in\nthe number of microlabels) and allows us to use kernels to tackle complex input spaces ef\ufb01ciently.\nDe\ufb01nition 9. L2-norm RTA Marginalized Dual\n\nmax\n\u00b5\u00b5\u00b5\u2208Mm\n\n1\n|ET |\n\n\u00b5(k, e, ue) \u2212 1\n2\n\n\u00b5(k, e, ue)K eT (xk, ue; x(cid:48)\n\nk, u(cid:48)\n\ne)\u00b5(k(cid:48), e, u(cid:48)\n\ne) ,\n\n(cid:88)\n\ne,k,ue\n\nwhere ET is the union of the sets of edges appearing in T , and \u00b5\u00b5\u00b5 \u2208 Mm are the marginal dual\ndef\n= (\u00b5(k, e, ue))k,e,ue, with the triplet (k, e, ue) corresponding to labeling the edge\nvariables \u00b5\u00b5\u00b5\n\n(cid:88)\n\ne,k,ue,\nk(cid:48),u(cid:48)\n\ne\n\n6\n\n\fMICROLABEL LOSS (%)\n\nDATASET\nEMOTIONS\n\nYEAST\nSCENE\nENRON\nCAL500\n\nSVM MTL MMCRF MAM RTA\n18.8\n22.4\n19.8\n20.0\n8.8\n9.8\n5.3\n6.4\n13.7\n13.8\nFINGERPRINT 10.3\n10.7\n14.9\n15.3\n2.1\n2.6\n0.6\n4.7\n5.7\n3.8\n\nNCI60\nMEDICAL\nCIRCLE10\nCIRCLE50\n\n20.2\n20.7\n11.6\n6.5\n13.8\n17.3\n16.0\n2.6\n6.3\n6.2\n\n20.1\n21.7\n18.4\n6.2\n13.7\n10.5\n14.6\n2.1\n2.6\n1.5\n\n19.5\n20.1\n17.0\n5.0\n13.7\n10.5\n14.3\n2.1\n2.5\n2.1\n\n0/1 LOSS (%)\n\nMTL MMCRF MAM\n69.6\n74.5\n86.0\n88.7\n55.2\n94.6\n87.9\n99.6\n100.0\n100.0\n99.6\n100.0\n53.0\n60.0\n63.1\n91.8\n17.7\n33.2\n46.2\n72.3\n\n71.3\n93.0\n72.2\n92.7\n100.0\n99.6\n63.1\n63.8\n20.3\n38.8\n\nSVM\n77.8\n85.9\n47.2\n99.6\n100.0\n99.0\n56.9\n91.8\n28.9\n69.8\n\nRTA\n66.3\n77.7\n30.2\n87.7\n100.0\n96.7\n52.9\n58.8\n4.0\n52.8\n\nTable 1: Prediction performance of each algorithm in terms of microlabel loss and 0/1 loss. The best\nperforming algorithm is highlighted with boldface, the second best is in italic.\n\ne = (v, v(cid:48)) \u2208 ET of the output graph by ue = (uv, uv(cid:48))\u2208Yv\u00d7Yv(cid:48) for the training example xk. Also,\nMm is the marginal dual feasible set and\n(cid:48)\nK eT (xk, ue; xk(cid:48) , u\ne)\nis the joint kernel of input features and the differences of output features of true and competing\nmultilabels (yk, u), projected to the edge e. Finally, NT (e) denotes the number of times e appears\namong the trees of the ensemble.\n\n|ET |2 K(xk, xk(cid:48) )(cid:10)\u03c8\u03c8\u03c8e(ykv, ykv(cid:48) ) \u2212 \u03c8\u03c8\u03c8e(uv, uv(cid:48) ), \u03c8\u03c8\u03c8e(yk(cid:48)v, yk(cid:48)v(cid:48) ) \u2212 \u03c8\u03c8\u03c8e(u\n\nv(cid:48) )(cid:11)\n\nNT (e)\n\n(cid:48)\nv, u\n\ndef\n=\n\n(cid:48)\n\nThe master algorithm described in the supplementary material iterates over each training example\nuntil convergence. The processing of each training example xk proceeds by \ufb01nding the worst vio-\nlating multilabel of the ensemble de\ufb01ned as\n\n\u00afyk\n\ndef= argmax\ny(cid:54)=yk\n\nFT (xk, y) ,\n\n(6)\n\nusing the K-best inference approach of the previous section, with the modi\ufb01cation that the correct\nmultilabel is excluded from the K-best lists. The worst violator \u00afyk is mapped to a vertex\n\n\u00af\u00b5\u00b5\u00b5(xk) = C \u00b7 ([\u00afye = ue])e,ue\n\n\u2208 Mk\n\ncorresponding to the steepest feasible ascent direction (c.f, [9]) in the marginal dual feasible set Mk\nof example xk, thus giving us a subgradient of the objective of De\ufb01nition 9. An exact line search is\nused to \ufb01nd the saddle point between the current solution and \u00af\u00b5\u00b5\u00b5.\n\n5 Empirical Evaluation\n\nWe compare our method RTA to Support Vector Machine (SVM) [10, 11], Multitask Feature Learn-\ning (MTL) [12], Max-Margin Conditional Random Fields (MMCRF) [9] which uses the loopy be-\nlief propagation algorithm for approximate inference on the general graph, and Maximum Average\nMarginal Aggregation (MAM) [5] which is a multilabel ensemble model that trains a set of random\ntree based learners separately and performs the \ufb01nal approximate inference on a union graph of the\nedge potential functions of the trees. We use ten multilabel datasets from [5]. Following [5], MAM\nis constructed with 180 tree based learners, and for MMCRF a consensus graph is created by pool-\ning edges from 40 trees. We train RTA with up to 40 spanning trees and with K up to 32. The linear\nkernel is used for methods that require kernelized input. Margin slack parameters are selected from\n{100, 50, 10, 1, 0.5, 0.1, 0.01}. We use 5-fold cross-validation to compute the results.\nPrediction performance. Table 1 shows the performance in terms of microlabel loss and 0/1 loss.\nThe best methods are highlighted in \u2019boldface\u2019 and the second best in \u2019italics\u2019 (see supplementary\nmaterial for full results). RTA quite often improves over MAM in 0/1 accuracy, sometimes with\nnoticeable margin except for Enron and Circle50. The performances in microlabel accuracy are\nquite similar while RTA is slightly above the competition. This demonstrates the advantage of RTA\nthat gains by optimizing on a collection of trees simultaneously rather than optimizing on individual\ntrees as MAM. In addition, learning using approximate inference on a general graph seems less\n\n7\n\n\fFigure 1: Percentage of examples with provably optimal y\u2217 being in the K-best lists plotted as a\nfunction of K, scaled with respect to the number of microlabels in the dataset.\n\nfavorable as the tree-based methods, as MMCRF quite consistently trails to RTA and MAM in\nboth microlabel and 0/1 error, except for Circle50 where it outperforms other models. Finally, we\nnotice that SVM, as a single label classi\ufb01er, is very competitive against most multilabel methods for\nmicrolabel accuracy.\nExactness of inference on the collection of trees. We now study the empirical behavior of the\ninference (see Section 4) on the collection of trees, which, if taken as a single general graph, would\ncall for solving an NP-hard inference problem. We provide here empirical evidence that we can\nperform exact inference on most examples in most datasets in polynomial time.\nWe ran the K-best inference on eleven datasets where the RTA models were trained with different\namounts of spanning trees |T | ={5, 10, 40} and values for K ={2, 4, 8, 16, 32, 40, 60}. For each pa-\nrameter combination and for each example, we recorded whether the K-best inference was provably\nexact on the collection (i.e., if Lemma 7 was satis\ufb01ed). Figure 1 plots the percentage of examples\nwhere the inference was indeed provably exact. The values are shown as a function of K, expressed\nas the percentage of the number of microlabels in each dataset. Hence, 100% means K = (cid:96), which\ndenotes low polynomial (\u0398(n(cid:96)2)) time inference in the exponential size multilabel space.\nWe observe, from Figure 1, on some datasets (e.g., Medical, NCI60), that the inference task is very\neasy since exact inference can be computed for most of the examples even with K values that are\nbelow 50% of the number of microlabels. By setting K = (cid:96) (i.e., 100%) we can perform exact\ninference for about 90% of the examples on nine datasets with \ufb01ve trees, and eight datasets with\n40 trees. On two of the datasets (Cal500, Circle50), inference is not (in general) exact with low\nvalues of K. Allowing K to grow superlinearly on (cid:96) would possibly permit exact inference on these\ndatasets. However, this is left for future studies.\nFinally, we note that the dif\ufb01culty of performing provably exact inference slightly increases when\nmore spanning trees are used. We have observed that, in most cases, the optimal multilabel y\u2217 is\nstill on the K-best lists but the conditions of Lemma 7 are no longer satis\ufb01ed, hence forbidding us\nto prove exactness of the inference. Thus, working to establish alternative proofs of exactness is a\nworthy future research direction.\n\n6 Conclusion\n\nThe main theoretical result of the paper is the demonstration that if a large margin structured output\npredictor exists, then combining a small sample of random trees will, with high probability, generate\na predictor with good generalization. The key attraction of this approach is the tractability of the\ninference problem for the ensemble of trees, both indicated by our theoretical analysis and supported\nby our empirical results. However, as a by-product, we have a signi\ufb01cant added bene\ufb01t: we do not\nneed to know the output structure a priori as this is generated implicitly in the learned weights\nfor the trees. This is used to signi\ufb01cant advantage in our experiments that automatically leverage\ncorrelations between the multiple target outputs to give a substantive increase in accuracy. It also\nsuggests that the approach has enormous potential for applications where the structure of the output\nis not known but is expected to play an important role.\n\n8\n\nlllllll020406080100|T| = 5 K (% of |Y|) Y* being verified (% of data)llEmotionsYeastSceneEnronCal500FingerprintNCI60MedicalCircle10Circle50131032100316100013103210031610001310321003161000131032100316100013103210031610001310321003161000131032100316100013103210031610001310321003161000lllllll1310321003161000lllllll020406080100|T| = 10 K (% of |Y|) Y* being verified (% of data)131032100316100013103210031610001310321003161000131032100316100013103210031610001310321003161000131032100316100013103210031610001310321003161000lllllll1310321003161000lllllll020406080100|T| = 40 K (% of |Y|) Y* being verified (% of data)131032100316100013103210031610001310321003161000131032100316100013103210031610001310321003161000131032100316100013103210031610001310321003161000lllllll1310321003161000\fReferences\n[1] Ben Taskar, Carlos Guestrin, and Daphne Koller. Max-margin markov networks. In S. Thrun,\nL.K. Saul, and B. Sch\u00a8olkopf, editors, Advances in Neural Information Processing Systems 16,\npages 25\u201332. MIT Press, 2004.\n\n[2] Martin J. Wainwright, Tommy S. Jaakkola, and Alan S. Willsky. MAP estimation via agree-\nment on trees: message-passing and linear programming. IEEE Transactions on Information\nTheory, 51(11):3697\u20133717, 2005.\n\n[3] Michael I. Jordan and Martin J Wainwright. Semide\ufb01nite relaxations for approximate inference\non graphs with cycles. In S. Thrun, L.K. Saul, and B. Sch\u00a8olkopf, editors, Advances in Neural\nInformation Processing Systems 16, pages 369\u2013376. MIT Press, 2004.\n\n[4] Amir Globerson and Tommi S. Jaakkola. Approximate inference using planar graph decom-\nposition. In B. Sch\u00a8olkopf, J.C. Platt, and T. Hoffman, editors, Advances in Neural Information\nProcessing Systems 19, pages 473\u2013480. MIT Press, 2007.\n\n[5] Hongyu Su and Juho Rousu. Multilabel classi\ufb01cation through random graph ensembles. Ma-\n\nchine Learning, dx.doi.org/10.1007/s10994-014-5465-9, 2014.\n\n[6] Robert G. Cowell, A. Philip Dawid, Steffen L. Lauritzen, and David J. Spiegelhalter. Proba-\n\nbilistic Networks and Expert Systems. Springer, New York, 1999.\n\n[7] Thomas G\u00a8artner and Shankar Vembu. On structured output training: hard cases and an ef\ufb01cient\n\nalternative. Machine Learning, 79:227\u2013242, 2009.\n\n[8] John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge\n\nUniversity Press, 2004.\n\n[9] J. Rousu, C. Saunders, S. Szedmak, and J. Shawe-Taylor. Ef\ufb01cient algorithms for max-margin\n\nstructured classi\ufb01cation. Predicting Structured Data, pages 105\u2013129, 2007.\n\n[10] Kristin P. Bennett. Combining support vector and mathematical programming methods for\nclassi\ufb01cations. In B. Sch\u00a8olkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel\nMethods\u2014Support Vector Learning, pages 307\u2013326. MIT Press, Cambridge, MA, 1999.\n\n[11] Nello Cristianini and John Shawe-Taylor. An Introduction to Support Vector Machines and\nOther Kernel-Based Learning Methods. Cambridge University Press, Cambridge, U.K., 2000.\n[12] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Convex multi-task feature\n\nlearning. Machine Learning, 73(3):243\u2013272, 2008.\n\n[13] Yevgeny Seldin, Franc\u00b8ois Laviolette, Nicol`o Cesa-Bianchi, John Shawe-Taylor, and Peter\nAuer. PAC-Bayesian inequalities for martingales. IEEE Transactions on Information Theory,\n58:7086\u20137093, 2012.\n\n[14] Andreas Maurer. A note on the PAC Bayesian theorem. CoRR, cs.LG/0411099, 2004.\n[15] David McAllester. PAC-Bayesian stochastic model selection. Machine Learning, 51:5\u201321,\n\n2003.\n\n[16] Juho Rousu, Craig Saunders, Sandor Szedmak, and John Shawe-Taylor. Kernel-based learn-\ning of hierarchical multilabel classi\ufb01cation models. Journal of Machine Learning Research,\n7:1601\u20131626, December 2006.\n\n9\n\n\f", "award": [], "sourceid": 564, "authors": [{"given_name": "Mario", "family_name": "Marchand", "institution": "Universit\u00e9 Laval"}, {"given_name": "Hongyu", "family_name": "Su", "institution": "Aalto University"}, {"given_name": "Emilie", "family_name": "Morvant", "institution": "LaHC, University of Saint-Etienne"}, {"given_name": "Juho", "family_name": "Rousu", "institution": "Aalto University"}, {"given_name": "John", "family_name": "Shawe-Taylor", "institution": "UCL"}]}