{"title": "Maximum Margin Multi-Label Structured Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 289, "page_last": 297, "abstract": "We study multi-label prediction for structured output spaces, a problem that occurs, for example, in object detection in images, secondary structure prediction in computational biology, and graph matching with symmetries. Conventional multi-label classification techniques are typically not applicable in this situation, because they require explicit enumeration of the label space, which is infeasible in case of structured outputs. Relying on techniques originally designed for single- label structured prediction, in particular structured support vector machines, results in reduced prediction accuracy, or leads to infeasible optimization problems. In this work we derive a maximum-margin training formulation for multi-label structured prediction that remains computationally tractable while achieving high prediction accuracy. It also shares most beneficial properties with single-label maximum-margin approaches, in particular a formulation as a convex optimization problem, efficient working set training, and PAC-Bayesian generalization bounds.", "full_text": "Maximum Margin Multi-Label Structured Prediction\n\nChristoph H. Lampert\n\nIST Austria (Institute of Science and Technology Austria)\n\nAm Campus 1, 3400 Klosterneuburg, Austria\n\nhttp://www.ist.ac.at/\u223cchl\n\nchl@ist.ac.at\n\nAbstract\n\nWe study multi-label prediction for structured output sets, a problem that occurs,\nfor example, in object detection in images, secondary structure prediction in com-\nputational biology, and graph matching with symmetries. Conventional multi-\nlabel classi\ufb01cation techniques are typically not applicable in this situation, be-\ncause they require explicit enumeration of the label set, which is infeasible in case\nof structured outputs. Relying on techniques originally designed for single-label\nstructured prediction, in particular structured support vector machines, results in\nreduced prediction accuracy, or leads to infeasible optimization problems.\n\nIn this work we derive a maximum-margin training formulation for multi-label\nstructured prediction that remains computationally tractable while achieving high\nprediction accuracy.\nIt also shares most bene\ufb01cial properties with single-label\nmaximum-margin approaches, in particular formulation as a convex optimization\nproblem, ef\ufb01cient working set training, and PAC-Bayesian generalization bounds.\n\n1\n\nIntroduction\n\nThe recent development of conditional random \ufb01elds (CRFs) [1], max-margin Markov networks\n(M3Ns) [2], and structured support vector machines (SSVMs) [3] has triggered a wave of interest in\nthe prediction of complex outputs. Typically, these are formulated as graph labeling or graph match-\ning tasks in which each input has a unique correct output. However, not all problems encountered\nin real applications are re\ufb02ected well by this assumption: machine translation in natural language\nprocessing, secondary structure prediction in computational biology, and object detection in com-\nputer vision are examples of tasks in which more than one prediction can be \u201ccorrect\u201d for each data\nsample, and that are therefore more naturally formulated as multi-label prediction tasks.\nIn this paper, we study multi-label structured prediction, de\ufb01ning the task and introducing the nec-\nessary notation in Section 2. Our main contribution is a formulation of a maximum-margin training\nproblem, named MLSP, which we introduce in Section 3. Once trained it allows the prediction of\nmultiple structured outputs from a single input, as well as abstaining from a decision. We study\nthe generalization properties of MLSP in form of a generalization bound in Section 3.2, and we\nintroduce a working set optimization procedure in Section 3.3. The main insights from these is that\nMLSP behaves similarly to a single-label SSVM in terms of ef\ufb01cient use of training data and compu-\ntational effort during training, despite the increased complexity of the problem setting. In Section 4\nwe discuss MLSP\u2019s relation to existing methods for multi-label prediction with simple label sets,\nand to single-label structured prediction. We furthermore compare MLSP to a multi-label structured\nprediction methods within the SSVM framework in Section 4.1. In Section 5 we compare the dif-\nferent approaches experimentally, and we conclude in Section 6 by summarizing and discussing our\ncontribution.\n\n1\n\n\f2 Multi-label structured prediction\nWe \ufb01rst recall some background and establish the notation necessary to discuss multi-label classi-\n\ufb01cation and structured prediction in a maximum margin framework. Our overall task is predicting\noutputs y \u2208 Y for inputs x \u2208 X in a supervised learning setting.\nIn ordinary (single-label) multi-class prediction we use a prediction function, g : X \u2192 Y, for this,\nwhich we learn from i.i.d. example pairs {(xi, yi)}i=1,...,n \u2282 X \u00d7Y. Adopting a maximum-margin\nsetting, we set\n\ng(x) := argmaxy\u2208Y f (x, y)\n\nfor a compatibility function f (x, y) := (cid:104)w, \u03c8(x, y)(cid:105).\n\n(1)\nThe joint feature map \u03c8 : X \u00d7Y \u2192 H maps input-output pairs into a Hilbert space H with inner\nproduct (cid:104)\u00b7 ,\u00b7(cid:105). It is de\ufb01ned either explicitly, or implicitly through a joint kernel function k : (X \u00d7\nY) \u00d7 (X \u00d7Y) \u2192 R. We measure the quality of predictions by a task-dependent loss function\n\u2206 : Y \u00d7 Y \u2192 R+, where \u2206(y, \u00afy) speci\ufb01es what cost occurs if we predict an output \u00afy while the\ncorrect prediction is y.\nStructured output prediction can be seen as a generalization of the above setting, where one wants to\nmake not only one, but several dependent decisions at the same time, for example, deciding for each\npixel of an image to which out of several semantic classes it belongs. Equivalently, one can interpret\nthe same task as a special case of supervised single-label prediction, where inputs and outputs consist\nof multiple parts. In the above example, a whole image is one input sample, and a segmentation mask\nwith as many entries as the image has pixels is an output. Having a choice of M \u2265 2 classes per pixel\nof a (w\u00d7h)-sized image leads to an output set of M w\u00b7h elements. Enumerating all of these is out of\nquestion, and collecting training examples for each of them even more so. Consequently, structured\noutput prediction requires specialized techniques that avoid enumerating all possible outputs, and\nthat can generalize between labels in the output set. A popular technique for this task is the structured\n(output) support vector machine (SSVM) [3]. To train it, one has to solve a quadratic program\nsubject to n|Y| linear constraints. If an ef\ufb01cient separation oracle is available, i.e. a technique for\nidentifying the currently most violated linear constraints, working set training, in particular cutting\nplane [4] or bundle methods [5] allow SSVM training to arbitrary precision in polynomial time.\nMulti-label prediction is a generalization of single-label prediction that gives up the condition of a\nfunctional relation between inputs and outputs. Instead, each input object can be associated with\nany (\ufb01nite) number of outputs, including none. Formally, we are given pairs {(xi, Y i)}i=1,...,n \u2282\nX \u00d7P(Y), where P denotes the power set operation, and we want to determine a set-valued function\nG : X \u2192 P(Y). Often it is convenient to use indicator vectors instead of variable size subsets.\nWe say that v \u2208 {\u00b11}Y represents the subset Y \u2208 P(Y) if vy = +1 for y \u2208 Y and vy = \u22121\notherwise. Where no confusion arises, we use both representations interchangeably, e.g., we write\neither Y i or vi for a label set in the training data. To measure the quality of a predicted set we use a\nset loss function \u2206ML : P(Y) \u00d7 P(Y) \u2192 R. Note that multi-label prediction can also be interpreted\nas ordinary single-output prediction with P(Y) taking the place of the original output set Y. We will\ncome back to this view in Section 4.1 when discussing related work.\nMulti-label structured prediction combines the aspects of multi-label prediction and structured out-\nput sets: we are given a training set {(xi, Y i)}i=1,...,n \u2282 X \u00d7 P(Y), where Y is a structured output\nset of potentially very large size, and we would like to learn a prediction function: G : X \u2192 P(Y)\nwith the ability to generalize also in the output set. In the following, we will take the view of struc-\ntured prediction point of view, deriving expressions for predicting multiple structured outputs instead\nof single ones. Alternatively, the same conclusions could be reached by interpreting the task as per-\nforming multi-label predicting with binary output vectors that are too large to store or enumerate\nexplicitly, but that have an internal structure allowing generalization between the elements.\n\n3 Maximum margin multi-label structured prediction\n\nIn this section we propose a learning technique designed for multi-label structure prediction that we\ncall MLSP. It makes set-valued prediction by1,\n\nG(x) := {y \u2208 Y : f (x, y) > 0}\n\nfor\n\nf (x, y) := (cid:104)w, \u03c8(x, y)(cid:105).\n\n(2)\n\n1More complex prediction rules exist in the multi-label literature, see, e.g., [6]. We restrict ourselves to per-\nlabel thresholding, because more advanced rules complicate the learning and prediction problem even further.\n\n2\n\n\fNote that the compatibility function, f (x, y), acts on individual inputs and outputs, as in single-label\nprediction (1), but the prediction step consists of collecting all outputs of positive scores instead of\n\ufb01nding the outputs of maximal score. By including a constant entry into the joint feature map \u03c8(x, y)\nwe can model a bias term, thereby avoiding the need of a threshold during prediction (2). We can\nalso add further \ufb02exibility by a data-independent, but label-dependent term. Note that our setup\ndiffers from SSVMs training in this regard. There, a bias term, or a constant entry of the feature\nmap, would have no in\ufb02uence, because during training only pairwise differences of function values\nare considered, and during prediction a bias does not affect the argmax-decision in Equation (1).\nWe learn the weight vector w for the MLSP compatibility function in a maximum-margin framework\nthat is derived from regularized risk minimization. As the risk depends on the loss function chosen,\nwe \ufb01rst study the possibilities we have for the set loss \u2206ML : P(Y) \u00d7 P(Y) \u2192 R+. There are no\nestablished functions for this in the structured prediction setting, but it turns out that two canonical\nset losses are consistent with the following \ufb01rst principles. Positivity: \u2206ML(Y, \u00afY ) \u2265 0, with equality\nonly if Y = \u00afY , Modularity: \u2206ML should decompose over the elements of Y (in order to facilitate\nef\ufb01cient computation), Monotonicity: \u2206ML should re\ufb02ect that making a wrong decision about some\nelement y \u2208 Y can never reduce the loss. The last criterion we formalize as\nfor any \u00afy (cid:54)\u2208 Y , and\nfor any y (cid:54)\u2208 \u00afY .\n\n\u2206ML(Y, \u00afY \u222a {\u00afy}) \u2265 \u2206ML(Y, \u00afY )\n\u2206ML(Y \u222a {y}, \u00afY ) \u2265 \u2206ML(Y, \u00afY )\n\nTwo candidates that ful\ufb01ll these criteria are the sum loss, \u2206sum(Y, \u00afY ) :=(cid:80)\n\n(3)\n(4)\ny\u2208Y (cid:127) \u00afY \u03bb(Y, y), and the\nmax loss, \u2206max(Y, \u00afY ) := maxy\u2208Y (cid:127) \u00afY \u03bb(Y, y), where Y (cid:127) \u00afY := (Y\\\u00afY )\u222a( \u00afY\\Y ) is the symmetric set\ndifference, and \u03bb : P(Y) \u00d7 Y \u2192 R+ is a task-dependent per-label misclassi\ufb01cation cost. Assuming\nthat a set Y is the correct prediction, \u03bb(Y, \u00afy) speci\ufb01es either the cost of predicting \u00afy, although \u00afy (cid:54)\u2208 Y ,\nor of not predicting \u00afy, when really \u00afy \u2208 Y . In the special case of \u03bb \u2261 1 the sum loss is known as\nsymmetric difference loss, and it coincides with the Hamming loss of the binary indicator vector\nrepresentation. The max loss becomes the 0/1-loss between sets in this case. In a general case, \u03bb\ntypically expresses partial correctness, generalizing the single-label structured loss \u2206(y, \u00afy). Note\nthat in evaluating \u03bb(Y, \u00afy) one has access to the whole set Y , not just single elements. Therefore, a\n\ufb02exible penalization of multiple errors is possible, e.g., submodular behavior.\nWhile in the small-scale multi-label situation, the sum loss is more common, we argue in this work\nthat that the max loss has advantages in the structured prediction situation. For once, the sum loss has\na scaling problem. Because it adds potentially exponentially many terms, the ratio in loss between\nmaking few mistakes and making many mistakes is very large. If used in the unnormalized form\ngiven above this can result in impractically large values. Normalizing the expression by multiplying\nwith 1/|Y| stabilizes the upper value range, but it leads to a situation where \u2206sum(Y, \u00afY ) \u2248 0 in\nthe common situation that \u00afY differs from Y in only a few elements. The value range of the max\nloss, on the other hand, is the same as the value range of \u03bb and therefore easy to keep reasonable.\nA second advantage of the max loss is that it leads to an ef\ufb01cient constraint generation technique\nduring training, as we will see in Section 3.3.\n\n1\n\n3.1 Maximum margin multi-label structured prediction (MLSP)\n(cid:80)\nTo learn the parameters w of the compatibility function f (x, y) we follow a regularized risk mini-\nmization framework: given i.i.d. training examples {(xi, Y i)}i=1,...,n, we would like to minimize\n(cid:80)\n2(cid:107)w(cid:107)2 + C\ni \u2206max(Y i, G(xi)). Using the de\ufb01nition of \u2206max this is equivalent to minimizing\n2(cid:107)w(cid:107)2 + C\nyf (xi, y) \u2264 0. Upper bounding the\ni \u03bei, subject to \u03bei \u2265 \u03bb(Y i, y) for all y \u2208 Y with vi\n1\nn(cid:88)\ninequalities by a Hinge construction yields the following maximum-margin training problem:\n\nn\n\nn\n\n(w\u2217, \u03be\u2217) =\n\n(5)\n\nargmin\n\nw\u2208H,\u03be1,...,\u03ben\u2208R+\n\n(cid:107)w(cid:107)2 +\n\n1\n2\n\nC\nn\n\n\u03bei\n\ni=1\n\nsubject to, for i = 1, . . . , n,\n\n\u03bei \u2265 \u03bb(Y i, y)[1 \u2212 vi\n\nyf (xi, y)], for all y \u2208 Y.\n\n(6)\n\nNote that making per-label decisions through thresholding does not rule out the sharing of information between\nlabels. In the terminology of [7], Equation (2) corresponds to a conditional label independence assumption.\nThrough the joint feature function \u03c8(x, y) te proposed model can still learn unconditional dependence between\nlabels, which relates closer to an intuition of the form \u201cLabel A tends to co-occur with label B\u201d.\n\n3\n\n\fBesides this slack rescaled variant, one can also form margin rescaled training using the constraints\n\n\u03bei \u2265 \u03bb(Y i, y) \u2212 vi\n\nyf (xi, y),\n\n(7)\nBoth variants coincide in the case of 0/1 set loss, \u03bb(Y i, y) \u2261 1. The main difference between slack\nand margin rescaled training is how they treat the case of \u03bb(Y i, y) = 0 for some y \u2208 Y. In slack\nrescaling, the corresponding outputs have no effect on the training at all, whereas for margin rescal-\ning, no margin is enforced for such examples, but a penalization still occurs whenever f (xi, y) > 0\nfor y (cid:54)\u2208 Y i, or if f (xi, y) < 0 for y \u2208 Y i.\n\nfor all y \u2208 Y.\n\n3.2 Generalization Properties\nMaximum margin structured learning has become successful not only because it provides a pow-\nerful framework for solving practical prediction problems, but also because it comes with certain\ntheoretical guarantees, in particular generalization bounds. We expect that many of these results\nwill have multi-label analogues. As an initial step, we formulate and prove a generalization bound\nfor slack-rescaled MLSP similar to the single-label SSVM analysis in [8].\nLet Gw(x) := {y \u2208 Y : fw(x, y) > 0} for fw(x, y) = (cid:104)w, \u03c8(x, y)(cid:105). We assume |Y| < r and\n(cid:107)\u03c8(x, y)(cid:107) < s for all (x, y) \u2208 X \u00d7 Y, and \u03bb(Y, y) \u2264 \u039b for all (Y, y) \u2208 P(Y) \u00d7 Y.\nFor any distribution Qw over weight vectors, that may depend on w, we denote by L(Qw, P ) the\nexpected \u2206max-risk for P -distributed data,\n\n(cid:8)RP,\u2206max(G \u00afw)(cid:9) = E \u00afw\u223cQw,(x,Y )\u223cP\n\n(cid:8)\u2206max(Y, G \u00afw(x))(cid:9).\n\nL(Qw, P ) = E \u00afw\u223cQw\n\n(8)\n\nThe following theorem bounds the expected risk in terms of the total margin violations.\nTheorem 1. With probability at least 1 \u2212 \u03c3 over the sample S of size n, the following inequality\nholds simultaneously for all weight vectors w.\n\n(cid:16) s2||w||2 ln(rn/||w||2) + ln n\n\n(cid:17)1/2\n\n\u03c3\n\n2(n \u2212 1)\n\n||w||2\nn\n\n+\n\n(9)\n\nL(Qw,D) \u2264 1\nn\n\n(cid:96)(xi, Y i, f ) +\n\nn(cid:88)\n\nfor (cid:96)(xi, Y i, f ) := max\n\ni=1\n\ny\u2208Y \u03bb(Y i, y)(cid:74)vi\n\nyf (xi, y) < 1(cid:75), where vi is the binary indicator vector of Y i.\n\nProof. The argument follows [8, Section 11.6]. It can be found in the supplemental material.\n\nA main insight from Theorem 1 is that the number of samples needed for good generalization grows\nonly logarithmically with r, i.e. the size of Y. This is the same complexity as for single-label\nprediction using SSVMs, despite the fact that multi-label prediction formally maps into P(Y), i.e.\nan exponentially larger output set.\n\n3.3 Numeric Optimization\nThe numeric solution of MLSP training resembles SSVM training. For explicitly given joint fea-\nture maps, \u03c8(x, y), we can solve the optimization problem (5) in the primal, for example using\nsubgradient descent. To solve MLSP in a kernelized setup we introduce Lagrangian multipliers\n(\u03b1i\n\ny)i=1,...,n;y\u2208Y for the constraints (7)/(6). For the margin-rescaled variant we obtain the dual\n\n\u00afy k(cid:0)(xi, y), (x\u00af\u0131, \u00afy)(cid:1) +\n\n(cid:88)\n\n\u03bbi\ny\u03b1i\ny\n\nFor slack-rescaled MLSP, the dual is computed analogously as\n\nmax\ny\u2208R+\n\u03b1i\n\n(cid:88)\nsubject to (cid:88)\n(cid:88)\nsubject to (cid:88)\n\nmax\ny\u2208R+\n\u03b1i\n\n\u2212 1\nvi\nyv\u00af\u0131\n2\n(i,y),(\u00af\u0131,\u00afy)\n\n\u00afy\u03b1i\n\ny\u03b1\u00af\u0131\n\ny \u2264 C\n\u03b1i\nn\n\ny\n\n\u2212 1\nyv\u00af\u0131\nvi\n2\n(i,y),(\u00af\u0131,\u00afy)\n\n\u00afy\u03b1i\n\ny\u03b1\u00af\u0131\n\n,\n\nfor i = 1, . . . , n.\n\n\u00afy k(cid:0)(xi, y), (x\u00af\u0131, \u00afy)(cid:1) +\n\n(i,y)\n\n(cid:88)\n\n(i,y)\n\n\u03b1i\ny\n\n\u03b1i\ny\n\u03bbi\ny\n\ny\n\n\u2264 C\nn\n\n,\n\nfor i = 1, . . . , n,\n\n4\n\n(10)\n\n(11)\n\n(12)\n\n(13)\n\n\fwith the convention that only terms with \u03bbi\nbility function becomes\n\n(cid:88)\n\nf (x, y) =\n\n(14)\n\ny (cid:54)= 0 enter the summation. In both cases, the compati-\n\n\u00afy k(cid:0)(xi, \u00afy), (x, y)(cid:1).\n\n\u03b1i\n\u00afyvi\n\n(i,\u00afy)\n\nComparing the optimization problems (10)/(11) and (12)/(13) to the ordinary SVM dual, we see\nthat MLSP couples |Y| binary SVM problems by the joint kernel function and the summed-over box\nconstraints. In particular, whenever only a feasibly small subset of variables has to be considered, we\ncan solve the problem using a general purpose QP solver, or a slightly modi\ufb01ed SVM solver. Overall,\nhowever, there are infeasibly many constraints in the primal, or variables in the dual. Analogously\nto the SSVM situation we therefore apply iterative working set training, which we explain here\nusing the terminology of the primal. We start with an arbitrary, e.g. empty, working set S. Then, in\neach step we solve the optimization using only the constraints indicated by the working set. For the\nresulting solution (wS, \u03beS) we check whether any constraints of the full set (6)/(7) are violated up\nto a target precision \u0001. If not, we have found the optimal parameters. Otherwise, we add the most\nviolated constraint to S and start the next iteration. The same monotonicity argument as in [3] shows\nthat we reach an objective value \u0001-close to the optimal one within O( 1\n\u0001 ) steps. Consequently, MLSP\ntraining is roughly comparable in computational complexity to SSVM training.\nThe crucial step in working set training is the identi\ufb01cation of violated constraints. Note that con-\nstraints in MLSP are determined by pairs of samples and single labels, not pairs of samples and\nsets of labels. This allows us to reuse existing methods for loss augmented single label inference.\nIn practice, it is safe to assume that the sets Y i are feasibly small, since they are given to us ex-\nplicitly. Consequently, we can identify violated \u201cpositive\u201d constraints by explicitly checking the\ninequalities (7)/(6) for y \u2208 Y i. Identifying violated \u201cnegative\u201d constraint requires loss-augmented\nprediction over Y\\Y i. We are not aware of a general purpose solution for this task, but at least all\nproblems that allow ef\ufb01cient K-best MAP prediction can be handled by iteratively performing loss-\naugmented prediction within Y until a violating example from Y\\Y i is found, or it is con\ufb01rmed that\nno such example exists. Note that K-best versions of most standard MAP prediction methods have\nbeen developed, including max-\ufb02ow [9], loopy BP [10], LP-relaxations [11], and Sampling [12].\n\n3.4 Prediction problem\nAfter training, Equation (2) speci\ufb01es the rule to predict output sets for new input data. In contrast\nto single-label SSVM prediction this requires not only a maximization over all elements of Y, but\nthe collection of all elements y \u2208 Y of positive score. The structure of the output set is not as\nimmediately helpful for this as it is, e.g., in MAP prediction. Task-speci\ufb01c solutions exist, however,\nfor example branch-and-bound search for object detection [13]. Also, it is often possible to establish\nan upper bound on the number of desired outputs, and then, K-best prediction techniques can again\nbe applied. This makes MLSP of potential use for several classical tasks, such as parsing and\nchunking in natural language processing, secondary structured prediction in computational biology,\nor human pose estimation in computer vision. In general situations, evaluating (2) might require\napproximate structured prediction techniques, e.g. iterative greedy selection [14]. Note that the use\nof approximation algorithms is little problematic here, because, in contrast to training, the prediction\nstep is not performed in an iterative manner, so errors do not accumulate.\n4 Related Work\nMulti-label classi\ufb01cation is an established \ufb01eld of research in machine learning and several es-\ntablished techniques are available, most of which fall into one of three categories: 1) Multi-class\nreformulations [15] treat every possible label subset, Y \u2208 P(Y), as a new class in an independent\nmulti-class classi\ufb01cation scenario. 2) Per-label decomposition [16] trains one classi\ufb01er for each out-\nput label and makes independent decision for each of those. 3) Label ranking [17] learns a function\nthat ranks all potential labels for an input sample. Given the size of Y, 1) is not a promising direction\nfor multi-label structured prediction. A straight-forward application of 2) and 3) are also infeasible\nif Y is too large to enumerate. However, MLSP resembles both approaches by sharing their predic-\ntion rule (2). MLSP can be seen as a way to make a combination of approaches applicable to the\nsituation of structured prediction by incorporating the ability to generalize in the label set.\nBesides the general concepts above, many speci\ufb01c techniques for multi-label prediction have been\nproposed, several of them making use of structured prediction techniques: [18] introduces an SSVM\n\n5\n\n\fformulation that allows direct optimization of the average precision ranking loss when the label\nset can be enumerated. [19] relies on a counting framework for this purpose, and [20] proposes\nan SSVM formulation for enforcing diversity between the labels.\n[21] and [22] identify shared\nsubspaces between sets of labels, [23] encodes linear label relations by a change of the SSVM\nregularizer, and [24] handles the case of tree- and DAG-structured dependencies between possible\noutputs. All these methods work in the multi-class setup and require an explicit enumerations of the\nlabel set. They use a structured prediction framework to encode dependencies between the individual\noutput labels, of which there are relatively few. MLSP, on the other hand, aims at predicting multiple\nstructured object, i.e. the structured prediction framework is not just a tool to improve multi-class\nclassi\ufb01cation with multiple output labels, but it is required as a core component for predicting even\na single output.\nSome previous methods targeting multi-label prediction with large output sets, in particular using\nlabel compression [25] or a label hierarchy [26]. This allows handling thousands of potential output\nclasses, but a direct application to the structured prediction situation is not possible, because the\nmethods still require explicit handling of the output label vectors, or cannot predict labels that were\nnot part of the training set.\nThe actual task of predicting multiple structured outputs has so far not appeared explicitly in the\nliterature. The situation of multiple inputs during training has, however, received some attention:\n[27] introduces a one-class SVM based training technique for learning with ambiguous ground truth\ndata. [13] trains an SSVM for the same task by de\ufb01ning a task-adapted loss function \u2206min(Y, \u00afy) =\nminy\u2208Y \u2206(y, \u00afy). [28] uses a similar min-loss in a CRF setup to overcome problems with incomplete\nannotation. Note that \u2206min(Y, \u00afy) has the right signature to be used as a misclassi\ufb01cation cost \u03bb(Y, \u00afy)\nin MLSP. The compatibility functions learned by the maximum-margin techniques [13, 27] have\nthe same functional form as f (x, y) in MLSP, so they can, in principle, be used to predict multiple\noutputs using Equation (2). However, our experiments of Section 5 show that this leads to low multi-\nlabel prediction accuracy, because the training setup is not designed for this evaluation procedure.\n\n4.1 Structured Multilabel Prediction in the SSVM Framework\nAt \ufb01rst sight, it appears unnecessary to go beyond the standard structured prediction framework at\nall in trying to predict subsets of Y. As mentioned in Section 3, multi-label prediction into Y can\nbe interpreted as single-label prediction into P(Y), so a straight-forward approach to multi-label\nstructured prediction would be to use an ordinary SSVM with output set P(Y). We will call this\nsetup P-SSVM. It has previously been proposed for classical multi-label prediction, for example\nin [23]. Unfortunately, as we will show in this section, the P-SSVM setup is not well suited to the\nstructured prediction situation.\nA P-SSVM learns a prediction function, G(x) := argmaxY \u2208P(Y) F (x, Y ), with linearly parameter-\nized compatibility function, F (x, Y ) := (cid:104)w, \u03c8(x, Y )(cid:105), by solving the optimization problem\n\nargmin\n\n(cid:107)w(cid:107)2 +\n\n1\n2\n\nC\nn\n\n\u03bei,\n\nsubject to \u03bei \u2265 \u2206(yi, Y ) + F (xi, Y ) \u2212 F (xi, Y i), (15)\n\ni=1\n\nw\u2208H,\u03be1,...,\u03ben\u2208R+\nfor i = 1, . . . , n, and for all Y \u2208 P(Y). The main problem with this general form is that identifying\nviolated constraints of (15) requires loss-augmented maximization of F over P(Y), i.e. an exponen-\ntially larger set than Y. To better understand this problem, we analyze what happens when making\nthe same simplifying assumptions as for MLSP in Section 3.1. First, we assume additivity of F over\ny\u2208Y f (x, y) for f (x, y) := (cid:104)w, \u03c8(x, y)(cid:105). This turns the argmax-evaluation\n\nfor G(x) exactly into the prediction rule (2), and the constraint set in (15) simpli\ufb01es to\n\nY, i.e. F (x, Y ) := (cid:80)\n\u03bei \u2265 \u2206ML(Y i, Y ) \u2212(cid:88)\nsum loss does: with \u2206ML(Y i, Y ) = (cid:80)\n\ny\u2208Y (cid:127)Y i\n\nn(cid:88)\n\nvi\nyf (xi, y),\n\nfor i = 1, . . . , n, and for all Y \u2208 P(Y),\n\n(16)\n\nChoosing \u2206ML as max loss does not allow us to further simplify this expression, but choosing the\ny\u2208Y (cid:127)Y i \u03bb(Y i, y), we obtain an explicit expression for the\n\nlabel set maximizing the right hand side of the constraint (16), namely\n\nviol ={y \u2208 Y i : f (xi, y) < \u03bb(Y i, y)} \u222a {y \u2208 Y \\ Y i : f (xi, y) > \u2212\u03bb(Y i, y)}.\nY i\nThus, we avoid having to maximize a function over P(Y). Unfortunately, the set Y i\ntion (17) can contain exponentially many terms, rendering a numeric computation of F (xi, Y i\n\n(17)\nviol in Equa-\nviol) or\n\n6\n\n\fviol. In fact, starting the optimization with w = 0 would already lead to Y i\n\nits gradient still infeasible in general. Note that this is not just a rare, easily avoidable case. Because\nw, and thereby f, are learned iteratively, they typically go through phases of low prediction qual-\nviol = Y\nity, i.e. large Y i\nfor all i = 1, . . . , n. Consequently, we presume that P-SSVM training is intractable for structured\nprediction problems, except for the case of a small label set.\nNote that while computational complexity is the most prominent problem of P-SSVM training, it is\nnot the only one. For example, even if we did \ufb01nd a polynomial-time training algorithm to solve (15)\nthe generalization ability of the resulting predictor would be unclear:\nthe SSVM-generalization\nbounds [8] suggest that training sets of size O(log |P(Y)|) = O(|Y|) will be required, compared to\nthe O(log |Y|) bound we established for MLSP in Section 3.2.\n5 Experimental Evaluation\nTo show the practical use of MLSP we performed experiments on multi-label hierarchical classi-\n\ufb01cation and object detection in natural images. The complete protocol of training a miniature toy\nexample can be found in the supplemental material (available from the author\u2019s homepage).\n\n5.1 Multi-label hierarchical classi\ufb01cation\nWe use hierarchical classi\ufb01cation as an illustrative example that in particular allows us to compare\nMLSP to alternative, less scalable, methods. On the one hand, it is straight-forward to model as a\nstructured prediction task, see e.g. [3, 29, 30, 31]. On the other hand, its output set is small enough\nsuch that we can compare MLSP also against other approaches that cannot handle very large output\nsets, in particular P-SSVM and independent per-label training.\nThe task in hierarchical classi\ufb01cation is to classify samples into a number of discrete classes, where\neach class corresponds to a path in a tree. Classes are considered related if they share a path in the\ntree, and this is re\ufb02ected by sharing parts of the joint feature representations. In our experiments,\nwe use the PASCAL VOC2006 dataset that contains 5304 images, each belonging to between 1\nand 4 out of 10 classes. We represent each image x by 960-dimensional GIST features \u03c6(x) and\nuse the same 19-node hierarchy \u03ba and joint feature function, \u03c8(x, y) = vec(\u03c6(x) \u2297 \u03ba(y)), as in\n[30]. As baselines we use P-SSVM [23], JKSE [27], and an SSVM trained with the normal, single-\nlabel objective, but evaluated by Equation (2). We follow the pre-de\ufb01ned data splits, doing model\nselection using the train and val parts to determine C \u2208 {2\u22121, . . . , 214} (MLSP, P-SSVM, SSVM),\nor \u03bd \u2208 {0.05, 0.10, . . . , 0.95} (JKSE). We then retrain on the combination of train and val and we\ntest on the test part of the dataset. As the label set is small, we use exhaustive search over Y to\nidentify violated constraints during training and to perform the \ufb01nal predictions.\nWe report results in Table 1a). As there is no single established multi-label error measure, and be-\ncause it illustrates the effect of training with different loss function, we report several common mea-\nsures. The results show nicely how the assumptions made during training in\ufb02uence the prediction\ncharacteristics. Qualitatively, MLSP achieves best prediction accuracy in the max loss, P-SSVM is\nbetter if we judge by the sum loss. This exactly re\ufb02ects the loss functions they are trained with. In-\ndependent training achieves very good results with respect to both measures, justifying its common\nuse for multi-label prediction with small label sets and many training examples per label2 Ordinary\nSSVM training does not achieve good max- or sum-loss scores, but it performs well if quality is\nmeasured by the average of the area under the precision-recall curves across labels for each indi-\nvidual test example. This is also plausible, as SSVM training uses a ranking-like loss: all potential\nlabels for each input are enforced to be in the right order (correct labels have higher score than in-\ncorrect ones), but nothing in the objective encourages a cut-off point at 0. As a consequence, too\nfew or too many labels are predicted by Equation (2). In Table 1a) it appears to be too many, visible\nas high recall but low precision. JKSE does not achieve competitive results in max loss, mAUC loss\nor F1-score. Potentially this is because we use it with a linear kernel to stay comparable with the\nother methods, whereas [27] reported good results mainly for nonlinear kernels.\nQualitatively, MLSP and P-SSVM show comparable prediction quality. We take this as an indication\nthat both, training with sum loss and training with max loss, make sense conceptually. However, of\n\n2For \u2206sum this is not surprising: independent training is known to be the optimal setup, if enough data is\navailable [32]. For \u2206sum, the multi-class reformulation would be the optimal setup. The problem in multi-label\nstructured prediction is solely that |Y| is too large, and training data too scarce, to use either of these setups.\n\n7\n\n\fFigure 1: Multi-label structured prediction results. \u2206max/\u2206sum: max/sum loss (lower is better),\nmAUC: mean area under per-sample precision-recall curve, prec/rec/F1: precision, recall, F1-score\n(higher is better). Methods printed in italics are infeasible for general structured output sets.\n\n\u2206max \u2206sum mAUC F1 ( prec / rec )\n0.42 ( 0.40 / 0.46 )\n0.73\nMLSP\nJKSE\n0.23 ( 0.14 / 0.76 )\n1.00\n0.37 ( 0.24 / 0.88 )\n0.88\nSSVM\nP-SSVM 0.75\n0.44 ( 0.48 / 0.41 )\nindep.\n0.73\n0.46 ( 0.61 / 0.38 )\n\n1.59\n1.91\n3.86\n1.11\n1.07\n\n0.82\n0.54\n0.84\n0.83\n0.84\n\n(a) Hierarchical classi\ufb01cation results.\n\nMLSP\nJKSE\nSSVM\nP-SSVM\nindep.\n\n\u2206max \u2206sum\n1.31\n0.66\n7.29\n0.99\n0.93\n3.71\n\nF1 ( prec / rec )\n0.46 ( 0.60 / 0.52 )\n0.09 ( 0.60 / 0.16 )\n0.21 ( 0.79 / 0.33 )\n\ninfeasible\ninfeasible\n\n(b) Object detection results.\n\nthe \ufb01ve methods, only MLSP, JKSE and SSVM generalize to more general structured prediction\nsetting, as they do not require exhaustive enumeration of the label set. Amongst these, MLSP is\npreferable, except if one is only interested in ranking the labels, for which SSVM also works well.\n\n5.2 Object class detection in natural images\nObject detection can be solved as a structured prediction problem where natural images are the inputs\nand coordinate tuples of bounding boxes are the outputs. The label set is of quadratic size in the num-\nber of image pixels and thus cannot be searched exhaustively. However, ef\ufb01cient (loss-augmented)\nargmax-prediction can be performed by branch-and-bound search [33]. Object detection is also in-\nherently a multi-label task, because natural images contain different numbers of objects. We perform\nexperiments on the public UIUC-Cars dataset [34]. Following the experimental setup of [27] we use\nthe multiscale part of the dataset for training and the singlescale part for testing. The additional set\nof pre-cropped car and background images serves as validation set for model selection. We use the\n\nlocalization kernel, k(cid:0)(x, y), (\u00afx, \u00afy)(cid:1) = \u03c6(x|y)t\u03c6(\u00afx|\u00afy) where \u03c6(x|y) is a 1000-dimensional bag of\n\nvisual words representation of the region y within the image x [13]. As misclassi\ufb01cation cost we\nuse \u03bb(Y, y) := 1 for y \u2208 Y , and \u03bb(Y, y) := min\u00afy\u2208Y A(\u00afy, y) otherwise, where A(\u00afy, y) := 0 if\narea(\u00afy\u2229y)\narea(\u00afy\u222ay) \u2265 0.5, and A(\u00afy, y) := 1 otherwise. This is a common measure in object detection, which\nre\ufb02ects the intuition that all objects in an image should be identi\ufb01ed, and that an object\u2019s position is\nacceptable if it overlaps suf\ufb01ciently with at least one ground truth object. P-SSVM and independent\ntraining are not applicable in this setup, so we compare MLSP against JKSE and SSVM. For each\nmethod we train models on the training set and choose the C or \u03bd value that maximizes the F1 score\nover the validation set of precropped object and background images. Prediction is performed using\nbranch and bound optimization with greedy non-maximum suppression [35]. Table 1b) summarizes\nthe results on the test set (we do not report the mAUC measure, as computing this would require\nsumming over the complete output set). One sees that MLSP achieves the best results amongst the\nthree method. SSVM as well as JKSE suffer particularly from low recall, and their predictions also\nhave higher sum loss as well as max loss.\n\n6 Summary and Discussion\nWe have studied multi-label classi\ufb01cation for structured output sets. Existing multi-label techniques\ncannot directly be applied to this task because of the large size of the output set, and our analysis\nshowed that formulating multi-label structured prediction set a set-valued structured support vec-\ntor machine framework also leads to infeasible training problems. Instead, we proposed an new\nmaximum-margin formulation, MLSP, that remains computationally tractable by use of the max\nloss instead of sum loss between sets, and shows several of the advantageous properties known\nfrom other maximum-margin based techniques, in particular a convex training problem and PAC-\nBayesian generalization bounds. Our experiments showed that MLSP has higher prediction accuracy\nthan baseline methods that remain applied in structured output settings. For small label sets, where\nboth concepts are applicable, MLSP performs comparable to the set-valued SSVM formulation.\nBesides these promising initial results, we believe that there are still several aspects of multi-label\nstructured prediction that need to be better understood, in particular the prediction problem at test\ntime. Collecting all elements of positive score is a natural criterion, but it is costly to perform\nexactly if the output set is very large. Therefore, it would be desirable to develop sparsity enforcing\nvariations of Equation (2), for example by adopting ideas from compressed sensing [25].\n\n8\n\n\fReferences\n[1] J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random \ufb01elds: Probabilistic models for\n\nsegmenting and labeling sequence data. In ICML, 2001.\n\n[2] B. Taskar, C. Guestrin, and D. Koller. Max-margin Markov networks. In NIPS, 2003.\n[3] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and\n\ninterdependent output variables. JMLR, 6, 2006.\n\n[4] T. Joachims, T. Finley, and C. N. J. Yu. Cutting-plane training of structural SVMs. Machine Learning,\n\n[5] C. H. Teo, SVN Vishwanathan, A. Smola, and Q. V. Le. Bundle methods for regularized risk minimiza-\n\n[6] G. Tsoumakas and I. Katakis. Multi-label classi\ufb01cation: An overview. International Journal of Data\n\n[7] K. Dembczynski, W. Cheng, and E. H\u00a8ullermeier. Bayes optimal multilabel classi\ufb01cation via probabilistic\n\n77(1), 2009.\n\ntion. JMLR, 11, 2010.\n\nWarehousing and Mining, 3(3), 2007.\n\nclassi\ufb01er chains. In ICML, 2011.\n\n[8] D. McAllester. Generalization bounds and consistency for structured labeling. In G. Bak\u0131r, T. Hofmann,\n\nB. Sch\u00a8olkopf, A.J. Smola, and B. Taskar, editors, Predicting Structured Data. MIT Press, 2007.\n\n[9] D. Nilsson. An ef\ufb01cient algorithm for \ufb01nding the M most probable con\ufb01gurations in probabilistic expert\n\nsystems. Statistics and Computing, 8(2), 1998.\n\n[10] C. Yanover and Y. Weiss. Finding the M most probable con\ufb01gurations using loopy belief propagation. In\n\n[11] M. Fromer and A. Globerson. An LP View of the M-best MAP problem. In NIPS, 2009.\n[12] J. Porway and S.-C. Zhu. C 4: Exploring multiple solutions in graphical models by cluster sampling.\n\n[13] M. B. Blaschko and C. H. Lampert. Learning to localize objects with structured output regression. In\n\n[14] A. Bordes, N. Usunier, and L. Bottou. Sequence labelling SVMs trained in one pass. ECML PKDD, 2008.\n[15] M. R. Boutell, J. Luo, X. Shen, and C.M. Brown. Learning multi-label scene classi\ufb01cation. Pattern\n\nNIPS, 2004.\n\nPAMI, 33(9), 2011.\n\nECCV, 2008.\n\nRecognition, 37(9), 2004.\n\nECML, 1998.\n\nLearning, 39(2\u20133), 2000.\n\nsion. In ACM SIGIR, 2007.\n\nLearning, 76(2):227\u2013242, 2009.\n\nSIGKDD, 2008.\n\n[16] T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In\n\n[17] R. E. Schapire and Y. Singer. Boostexter: A boosting-based system for text categorization. Machine\n\n[18] Y. Yue, T. Finley, F. Radlinski, and T. Joachims. A support vector method for optimizing average preci-\n\n[19] T. G\u00a8artner and S. Vembu. On structured output training: Hard cases and an ef\ufb01cient alternative. Machine\n\n[20] Y. Yue and T. Joachims. Predicting diverse subsets using structural SVMs. In ICML, 2008.\n[21] S. Ji, L. Tang, S. Yu, and J. Ye. Extracting shared subspaces for multi-label classi\ufb01cation.\n\nIn ACM\n\n[22] P. Rai and H. Daum\u00b4e III. Multi-label prediction via sparse in\ufb01nite CCA. In NIPS, 2009.\n[23] B. Hariharan, L. Zelnik-Manor, S. V. N. Vishwanathan, and M. Varma. Large scale max-margin multi-\n\nlabel classi\ufb01cation with priors. In ICML, 2010.\n\n[24] W. Bi and J. Kwok. Multi-label classi\ufb01cation on tree- and DAG-structured hierarchies. In ICML, 2011.\n[25] D. Hsu, S. Kakade, J. Langford, and T. Zhang. Multi-label prediction via compressed sensing. In NIPS,\n\n2009.\n\n2011.\n\n2004.\n\n[26] G. Tsoumakas, I. Katakis, and I. Vlahavas. Effective and ef\ufb01cient multilabel classi\ufb01cation in domains\n\nwith large number of labels. In ECML PKDD, 2008.\n\n[27] C. H. Lampert and M. B. Blaschko. Structured prediction by joint kernel support estimation. Machine\n\n[28] J. Petterson, T. S. Caetano, J. J. McAuley, and J. Yu. Exponential family graph matching and ranking. In\n\nLearning, 77(2\u20133), 2009.\n\nNIPS, 2009.\n\n[29] J. Rousu, C. Saunders, S. Szedmak, and J. Shawe-Taylor. Kernel-based learning of hierarchical multilabel\n\nclassi\ufb01cation models. JMLR, 7, 2006.\n\n[30] A. Binder, K.-R. M\u00a8uller, and M. Kawanabe. On taxonomies for multi-class image categorization. IJCV,\n\n[31] L. Cai and T. Hofmann. Hierarchical document categorization with support vector machines. In ICKM,\n\n[32] K. Dembczynski, W. Cheng, and E. H\u00a8ullermeier. Bayes optimal multilabel classi\ufb01cation via probabilistic\n\nclassi\ufb01er chains. In ICML, 2010.\n\n[33] C. H. Lampert, M. B. Blaschko, and T. Hofmann. Ef\ufb01cient subwindow search: A branch and bound\n\nframework for object localization. PAMI, 31(12), 2009.\n\n[34] S. Agarwal, A. Awan, and D. Roth. Learning to detect objects in images via a sparse, part-based repre-\n\nsentation. PAMI, 26(11), 2004.\n\n[35] C. H. Lampert. An ef\ufb01cient divide-and-conquer cascade for nonlinear object detection. In CVPR, 2010.\n\n9\n\n\f", "award": [], "sourceid": 207, "authors": [{"given_name": "Christoph", "family_name": "Lampert", "institution": null}]}