{"title": "Submodular Multi-Label Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1512, "page_last": 1520, "abstract": "In this paper we present an algorithm to learn a multi-label classifier which attempts at directly optimising the F-score. The key novelty of our formulation is that we explicitly allow for assortative (submodular) pairwise label interactions, i.e., we can leverage the co-ocurrence of pairs of labels in order to improve the quality of prediction. Prediction in this model consists of minimising a particular submodular set function, what can be accomplished exactly and efficiently via graph-cuts. Learning however is substantially more involved and requires the solution of an intractable combinatorial optimisation problem. We present an approximate algorithm for this problem and prove that it is sound in the sense that it never predicts incorrect labels. We also present a nontrivial test of a sufficient condition for our algorithm to have found an optimal solution. We present experiments on benchmark multi-label datasets, which attest the value of our proposed technique. We also make available source code that enables the reproduction of our experiments.", "full_text": "Submodular Multi-Label Learning\n\nJames Petterson\n\nNICTA/ANU\n\nCanberra, Australia\n\nTiberio Caetano\n\nNICTA/ANU\n\nSydney/Canberra, Australia\n\nAbstract\n\nIn this paper we present an algorithm to learn a multi-label classi\ufb01er which\nattempts at directly optimising the F -score. The key novelty of our for-\nmulation is that we explicitly allow for assortative (submodular) pairwise\nlabel interactions, i.e., we can leverage the co-ocurrence of pairs of labels\nin order to improve the quality of prediction. Prediction in this model\nconsists of minimising a particular submodular set function, what can be\naccomplished exactly and e\ufb03ciently via graph-cuts. Learning however is\nsubstantially more involved and requires the solution of an intractable com-\nbinatorial optimisation problem. We present an approximate algorithm for\nthis problem and prove that it is sound in the sense that it never predicts\nincorrect labels. We also present a nontrivial test of a su\ufb03cient condition\nfor our algorithm to have found an optimal solution. We present exper-\niments on benchmark multi-label datasets, which attest the value of the\nproposed technique. We also make available source code that enables the\nreproduction of our experiments.\n\n1\n\nIntroduction\n\nResearch in multi-label classi\ufb01cation has seen a substantial growth in recent years (e.g.,\n[1, 2, 3, 4]). This is due to a number of reasons, including the increase in availability of\nmulti-modal datasets and the emergence of crowdsourcing, which naturally create settings\nwhere multiple interpretations of a given input observation are possible (multiple labels for\na single instance). Also many classical problems are inherently multi-label, such as the\ncategorisation of documents [5], gene function prediction [6] and image tagging [7].\n\nThere are two desirable aspects in a multi-label classi\ufb01cation system. The \ufb01rst is that a\nprediction should ideally be good both in terms of precision and recall: we care not only\nabout predicting as many of the correct labels as possible, but also as few non-correct labels\nas possible. One of the most popular measures for assessing performance is therefore the\nF -score, which is the harmonic mean of precision and recall [8]. The second property we\nwish is that, both during training and also at test time, the algorithm should ideally take\ninto account possible dependencies between the labels. For example, in automatic image\ntagging, if labels ocean and ship have high co-occurrence frequency in the training set, the\nmodel learned should somehow boost the chances of predicting ocean if there is strong visual\nevidence for the label ship [9].\n\nIn this paper we present a method that directly addresses these two aspects. First, we\nexplicitly model the dependencies between pairs of labels, albeit restricting them to be\nsubmodular (in rough terms, we model only the positive pairwise label correlations). This\nenables exact and e\ufb03cient prediction at test time, since \ufb01nding an optimal subset of labels\nreduces to the minimisation of a particular kind of submodular set function which can be\ndone e\ufb03ciently via graph-cuts. Second, our method directly attempts at optimising a convex\nsurrogate of the F -score. This is because we draw on the max-margin structured prediction\nframework from [10], which, as we will see, enables us to optimise a convex upper bound\non the loss induced by the F -score. The critical technical contribution of the paper is a\nconstraint generation algorithm for loss-augmented inference where the scoring of the pair\n(input-output) is a submodular set function and the loss is derived from the F -score. This\n\n1\n\n\fis what enables us to \ufb01t our model into the estimator from [10]. Our constraint generation\nalgorithm is only approximate since the problem is intractable. However we give theoretical\narguments supporting our empirical \ufb01ndings that the algorithm is not only very accurate in\npractice, but in the majority of our real-world experiments it actually produces a solution\nwhich is exactly optimal. We compare the proposed method with other benchmark methods\non publicly available multi-label datasets, and results favour our approach. We also provide\nsource code that enables the reproduction of all the experiments presented in this paper.\nRelated Work. A convex relaxation for F -measure optimisation in the multi-label setting\nwas proposed recently in [11]. This can be seen as a particular case of our method when there\nare no explicit label dependencies. In [12] the authors propose quite general tree and DAG-\nbased dependencies among the labels and adapt decoding algorithms from signal processing\nto the problem of \ufb01nding predictions consistent with the structures learned. In [13] graphical\nmodels are used to impose structure in the label dependencies. Both [12] and [13] are in a\nsense complementary to our method since we do not enforce any particular graph topology\non the labels but instead we limit the nature of the interactions to be submodular. In [14]\nthe authors study the multi-label problem under the assumption that prior knowledge on\nthe density of label correlations is available. They also use a max-margin framework, similar\nin spirit to our formulation. A quite simple and basic strategy for multi-label problems is\nto treat them as multiclass classi\ufb01cation, e\ufb00ectively ignoring the relationships between the\nlabels. One example in this class is the Binary Method [15]. The RAkEL algorithm [16] uses\ninstead an ensemble of classi\ufb01ers, each learned on a random subset of the label set. In [17] the\nauthors propose a Bayesian CCA model and apply it to multi-label problems by enforcing\ngroup sparsity regularisation in order to capture information about label co-occurrences.\n\n2 The Model\n\nLet x \u2208 X be a vector of dimensionality D with the features of an instance (say, an image);\nlet y \u2208 Y be a set of labels for an instance (say, tags for an image), from a \ufb01xed dictionary\nof V possible labels, encoded as y \u2208{ 0, 1}V . For example, y = [1 1 0 0] denotes the \ufb01rst\nand second labels of a set of four. We assume we are given a training set {(xn, yn)}N\nn=1, and\nour task is to estimate a map f : X \u2192 Y that has good agreement with the training set but\nalso generalises well to new data. In this section we de\ufb01ne the class of functions f that we\nwill consider. In the next section we de\ufb01ne the learning algorithm, i.e., a procedure to \ufb01nd\na speci\ufb01c f in the class.\n\n2.1 The Loss Function Derived from the F -Score\n\nOur notion of \u2018agreement\u2019 with the training set is given by a loss function. We focus on\nmaximising the average over all instances of F, a score that considers both precision and\nrecall and can be written in our notation as\n\nF =\n\n1\nN\n\nN\uffffn=1\n\n2 p(yn, \u00afyn)r(yn, \u00afyn)\np(yn, \u00afyn) + r(yn, \u00afyn)\n\n, where p(y, \u00afy) = |y \u2299 \u00afy|\n|\u00afy|\n\nand r(y, \u00afy) = |y \u2299 \u00afy|\n|y|\n\nHere \u00afyn denotes our prediction for input instance n, yn is the corresponding ground-truth,\n\u2299 denotes the element-wise product and |u| denotes the 1-norm of vector u (in our case the\nnumber of 1s since u will always be binary). Since our goal is to maximise the F -score a\nsuitable choice of loss function is \u2206(y, \u00afy) = 1 \u2212 F (y, \u00afy), which is the one we adopt in this\npaper. The loss for a single prediction is therefore\n\n\u2206(y, \u00afy) = 1 \u2212 2 |y \u2299 \u00afy|/(|y| + |\u00afy|)\n\n(1)\n\n2.2 Feature Maps and Parameterisation\n\nWe assume that the prediction for a given input x is a maximiser of a score that encodes\nboth the unary dependency between labels and instances as well as the pairwise dependencies\nbetween labels:\n\n\u00afy \u2208 argmax\ny\u2208Y\n\nyT Ay\n\n2\n\n(2)\n\n\f\uffffx, \u03b81\n\ni\uffff, where x is the input feature vector and \u03b81\n\nwhere A is an upper-triangular matrix scoring the pair (x, y), with diagonal elements Aii =\ni is a parameter vector that de\ufb01nes how\nlabel i weighs each feature of x. The o\ufb00-diagonal elements are Aij = Cij\u03b8ij, where Cij\nis the normalised counts of co-occurrence of labels i and j in the training set, and \u03b82\nij\nthe corresponding scalar parameter which is forced to be non-negative. This will ensure\nthat the o\ufb00-diagonal entries of A are non-negative and therefore that problem 2 consists\nof the maximisation of a supermodular function (or, equivalently, the minimisation of a\nsubmodular function), which can be solved e\ufb03ciently via graph-cuts. We also de\ufb01ne the\nij . . . ]T and \u03b8 = [\u03b81T \u03b82T ]T , as\ncomplete parameter vectors \u03b81 := [. . .\u03b8 1\ni\nwell as the complete feature maps \u03c61(x, y) = vec(x \u2297 y), \u03c62(y) = vec(y \u2297 y) and \u03c6(x, y) =\n[\u03c6T\n1 (x, y) \u03c6T\n2 (y)]T . This way the score in expression 2 can be written as yT Ay = \uffff\u03c6(x, y),\u03b8 \uffff.\nNote that the dimensionality of \u03b82 is the number of non-zero elements of matrix C\u2013in this\n\nT . . . ]T , \u03b82 := [. . .\u03b8 2\n\nsetting that is\uffffV\n\nthreshold.\n\n2\uffff, but it can be reduced by setting to zero elements of C below a speci\ufb01ed\n\n3 Learning Algorithm\nOptimisation Problem. Direct optimisation of the loss de\ufb01ned in equation 1 is a highly\nintractable problem since it is a discrete quantity and our parameter space is continuous.\nHere we will follow the program in [10] and instead construct a convex upper bound on the\nloss function, which can then be attacked using convex optimisation tools. The purpose of\nlearning will be to solve the following convex optimisation problem\n\n[\u03b8\u2217,\u03be \u2217] = argmin\n\n\uffff 1\ns.t. \uffff\u03c6(xn, yn),\u03b8 \uffff \u2212 \uffff\u03c6(xn, y),\u03b8 \uffff \u2265 \u2206(y, yn) \u2212 \u03ben,\n\u03ben \u2265 0,\u2200n, y \uffff= yn.\n\n2 \uffff\u03b8\uffff2\uffff\n\nN\uffffn=1\n\n\u03ben +\n\nN\n\n\u03b8,\u03be\n\n\u03bb\n\n(3a)\n\n(3b)\n\nThis is the margin-rescaling estimator for structured support vector machines [10].\nThe constraints immediately imply that the optimal solution will be such that \u03be\u2217n \u2265\n\u2206(argmaxy \uffff\u03c6(xn, y),\u03b8 \u2217\uffff , yn), and therefore the minimum value of the objective function\nupper bounds the loss, thus motivating the formulation. Since there are exponentially many\nconstraints, we follow [10] in adopting a constraint generation strategy, which starts by\nsolving the problem with no constraints and iteratively adding the most violated constraint\nfor the current solution of the optimisation problem. This is guaranteed to \ufb01nd an \uffff-close\napproximation of the solution of (3) after including only a polynomial (O(\uffff\u22122)) number of\nconstraints [10]. At each iteration we need to maximise the violation margin \u03ben, which from\nthe constraints 3b reduces to\n\ny\u2217n \u2208 argmax\ny\u2208Y\n\n[\u2206(y, yn) + \uffff\u03c6(xn, y),\u03b8 \uffff]\n\n(4)\n\nLearning Algorithm. The learning algorithm is described in Algorithm 1 (requires as\nsubroutine Algorithm 2). Algorithm 1 describes a particular convex solver based on bundle\nmethods (BMRM [18]), which we use here. Other solvers could have been used instead.\nOur contribution lies not here, but in the routine of constraint generation for Algorithm 1,\nwhich is described in Algorithm 2.\n\nBMRM requires the solution of constraint generation and the value of the objective function\nfor the slack corresponding to the constraint generated, as well as its gradient. Soon we will\ndiscuss constraint generation. The other two ingredients we describe here. The slack at the\noptimal solution is\n\nthus the objective function from (3) becomes\n\n\u03be\u2217n =\u2206( y\u2217n, yn) + \uffff\u03c6(xn, y\u2217n),\u03b8 \uffff \u2212 \uffff\u03c6(xn, yn),\u03b8 \uffff\n\n1\n\nN \uffffn\n\nwhose gradient is\n\n\u2206(y\u2217n, yn) + \uffff\u03c6(xn, y\u2217n),\u03b8 \uffff \u2212 \uffff\u03c6(xn, yn),\u03b8 \uffff +\n\n\u03bb\n2 \uffff\u03b8\uffff2 ,\n\n\u03bb\u03b8 \u2212\n\n1\n\nN \uffffn\n\n(\u03c6(xn, yn) \u2212 \u03c6(xn, y\u2217n))\n\n3\n\n(5)\n\n(6)\n\n(7)\n\n\fAlgorithm1BundleMethodforRegu-larisedRiskMinimisation(BMRM)1:Input:trainingset{(xn,yn)}Nn=1,\u03bb,Output:\u03b82:Initializei=1,\u03b81=0,max=\u2212\u221e3:repeat4:forn=1toNdo5:Computey\u2217n(y\u2217nkmaxreturnedbyAl-gorithm2.)6:endfor7:Computegradientgi(equation(7))andobjectiveoi(equation(6))8:\u03b8i+1:=argmin\u03b8\u03bb2\uffff\u03b8\uffff2+max(0,maxj\u2264i\uffffgj,\u03b8\uffff+oj);i\u2190i+19:untilconverged(see[17])10:return\u03b8Algorithm2ConstraintGeneration1:Input:(xn,yn),\u03b8,V,Output:y\u2217nkmax2:k=03:A[k],nij=\uffff\u03b8ij,Cij\uffff(foralli,j:i\uffff=j)4:whilek\u2264Vdo5:diag(A[k],n)=diag(A)\u22122ynk+\uffffyn\uffff26:y\u2217nk=argmaxyyTA[k],ny(graph-cuts)7:if|y\u2217nk|>kthen8:kmax=|y\u2217nk|;k=kmax9:elseif|y\u2217nk|=kthen10:kmax=|y\u2217nk|;k=kmax+111:else12:k=k+113:endif14:endwhile15:returny\u2217nkmaxExpressions(6)and(7)arethenusedinAlgorithm1.ConstraintGeneration.Themostchallengingstepconsistsofsolvingtheconstraintgenerationproblem.Constraintgenerationforagiventraininginstancenconsistsofsolvingthecombinatorialoptimisationprobleminexpression4,which,usingthelossinequation1,aswellasthecorrespondenceyTAy=\uffff\u03c6(x,y),\u03b8\uffff,canbewrittenasy\u2217n\u2208argmaxyyTAn(y)y(8)wherediag(An)=diag(A)\u22122yn|y|+|yn|ando\ufb00diag(An)=o\ufb00diag(A).NotethatthematrixAndependsony.Moreprecisely,asubsetofitsdiagonalelements(thoseAniiforwhichyn(i)=1)dependsonthequantity|y|,i.e.,thenumberofnonzeroelementsiny.Thismakessolvingproblem8aformidabletask.IfAnwereindependentofy,theneq.8couldbesolvedexactlyande\ufb03cientlyviagraph-cuts,justasourpredictionprobleminequation2.Ana\u00a8\u0131vestrategywouldbetoaimforsolvingproblem8Vtimes,oneforeachvalueof|y|,andconstrainingtheoptimisationtoonlyincludeelementsysuchthat|y|is\ufb01xed.Inotherwords,wecanpartitiontheoptimisationproblemintokoptimisationproblemsconditionedonthesetsYk:={y:|y|=k}:maxyyTA(y)y=maxkmaxy\u2208YkyTA[k],ny(9)whereA[k],ndenotestheparticularmatrixAnthatweobtainwhen|y|=k.Howevertheinnermaximizationabove,i.e.,theproblemofmaximisingasupermodularfunction(orminimisingasubmodularfunction)subjecttoacardinalityconstraint,isitselfNP-hard[19].Wethereforedonotfollowthisstrategy,butinsteadseekapolynomial-timealgorithmthatinpracticewillgiveusanoptimalsolutionmostofthetime.Algorithm2describesouralgorithm.Intheworstcaseitcallsgraph-cutsO(V)times,sothetotalcomplexityisO(V4).1ThealgorithmessentiallysearchesforthelargestksuchthatsolvingargmaxyyTA[k],nyreturnsasolutionwithk1s.Wecallthekobtainedkmax,andthecorrespondingsolutiony\u2217nkmax.Observethefactthat,askincreasesduringtheexecutionofthealgorithm,Aniiincreasesforthoseiwhereyn(i)=1.Theincrementobservedwhenkincreasestok\uffffis\uffffk\uffffk:=A[k\uffff],nii\u2212A[k],nii=2k\uffff\u2212k(k\uffff+|yn|)(k+|yn|)(10)whichisalwaysapositivequantity.Althoughthisalgorithmisnotprovablyoptimal,Theo-rem1guaranteesthatitissoundinthesensethatitneverpredictsincorrectlabels.Inthe1Theworst-caseboundofO(V3)forgraph-cutsisverypessimistic;inpracticethealgorithmisextremelye\ufb03cient.4\fnext section we present additional evidence supporting this algorithm, in the form of a test\nthat if positive guarantees the solution obtained is optimal.\nWe call a solution y\uffff a partially optimal solution of argmaxy yT An(y)y if the labels it predicts\nas being present are indeed present in an optimal solution, i.e., if for those i for which\ny\uffff(i) = 1 we also have y\u2217n(i) = 1, for some y\u2217n \u2208 argmaxy yT An(y)y. Equivalently, we can\nwrite y\uffff \u2299 y\u2217n = y\uffff. We have the following result\nTheorem 1 Upon completion of Algorithm 2, y\u2217n\nargmaxy yT An(y)y.\n\nkmax is a partially optimal solution of\n\nThe proof is in the Appendix A. The theorem means that whenever the algorithm predicts\nthe presence of a label, it does so correctly; however there may be labels not predicted which\nare in fact present in the corresponding optimal solution.\n\n4 Certi\ufb01cate of Optimality\n\n(d)\n\n(b)\n\n\uffff\n\nii\n\nkmax\n\n\uffff\n\n\uffff\n\n(c)\n\n\uffff\uffff\n\n+\uffffi\u2208\u03b1\n\uffff\n\n\uffffi,j\u2208\u03b1;i\uffff=j\n\uffff\n\uffff\uffff\n\n(a)\n\n+ \uffffi\u2208\u03b1,j\u2208O\n\uffff\uffff\n\uffff\n\nAs empirically veri\ufb01ed in our experiments in section 5, our constraint generation algorithm\n(Algorithm 2) is indeed quite accurate: most of the time the solution obtained is optimal.\nIn this section we present a test that if positive guarantees that an optimal solution has\nbeen obtained (i.e., a certi\ufb01cate of optimality). This can be used to generate empirical lower\nbounds on the probability that the algorithm returns an optimal solution (we explore this\npossibility in the experimental section).\nWe start by formalising the situation in which the algorithm will fail. Let Z := {i :\ny\u2217n\nkmax(i) = 0}, and PZ be the power set of Z (Z for \u2018zeros\u2019). Let O := {i : y\u2217n\nkmax(i) = 1} (O\nfor \u2018ones\u2019). Then the algorithm will fail if there exists \u03b1 \u2208 PZ such that\n|yn \u2299 y\u2217n\n\uffff\uffff\n\nA[kmax+|\u03b1|],n\n\n+ \uffffkmax+|\u03b1|\n\nkmax|\n\nAn\nij\n\n(11)\n\n\uffff\n\n\uffff\n\n> 0\n\nAij\n\nThe above expression describes the situation in which, starting with y\u2217n\nkmax, if we insert |\u03b1|\n1s in the indices de\ufb01ned by index set \u03b1, we will obtain a new vector y\uffff which is a feasible\nsolution of argmaxy yT An(y)y and yet has strictly larger score than solution y\u2217n\nkmax. This\ncan be understood by looking closely into each of the sums in expression 11. Sums (a)\nand (b) describe the increase in the objective function due to the inclusion of o\ufb00-diagonal\nterms. Both (a) and (b) are non-negative due to the submodularity assumption. Term (c)\nis the sum of the diagonal terms corresponding to the newly introduced 1s of y\uffff. Term (c)\nis negative or zero, since each term in the sum is negative or zero (otherwise y\u2217n\nkmax would\nhave included it). Finally, term (d) is non-negative, being the total increase in the diagonal\nelements of O due to the inclusion of |\u03b1| additional 1s. We can write (c) as\n\u2212 A[kmax],n\n\uffff\n\n\uffffi\u2208\u03b1\nwhere v\u03b1 = min[|yn|\u2212|yn\u2299y\u2217n\nkmax|,|\u03b1|] is an upper bound on the number of indices i \u2208 \u03b1 such\nis the increment in a diagonal element i for which yn(i) = 1\nthat yn(i) = 1, and \uffffkmax+|\u03b1|\narising from increasing the cardinality of the solution from kmax to kmax +|\u03b1|. Incorporating\nbound 13 into equation 12, we get that (c) \u2264 (e) + (g). We can then replace (c) in inequality\n11 by (e) + (g), obtaining\n\n+\uffffi\u2208\u03b1\n\uffff\n\uffff\n\u2212 A[kmax],n\n\nand the last term can be bounded as\n\n=\uffffi\u2208\u03b1\n\uffff\n\n\uffff\uffff\n\uffff\uffff\n\n) \u2264 \uffffkmax+|\u03b1|\n\n\uffffi\u2208\u03b1\n\uffff\n\n(A[kmax+|\u03b1|],n\n\n(A[kmax+|\u03b1|],n\n\nA[kmax+|\u03b1|],n\n\nA[kmax],n\n\n(e)\n\n\uffff\uffff\n\n(c)\n\n\uffff\uffff\n\n(13)\n\nv\u03b1\n\n\uffff\n\n\uffff\n\nii\n\n(12)\n\nkmax\n\n(g)\n\nii\n\n)\n\nii\n\nii\n\n\uffff\n\nii\n\nkmax\n\n(f )\n\nii\n\n\uffffi,j\u2208\u03b1;i\uffff=j\n\uffff\n\nAn\n\nij + \uffffi\u2208\u03b1,j\u2208O\n\uffff\uffff\n\n:=\u03b2A,\u03b1\n\nAij +\uffffi\u2208\u03b1\n\nA[kmax],n\n\nii\n\n\uffff\n\n+ \uffffkmax+|\u03b1|\n\nkmax\n\n\uffff\n\n5\n\nv\u03b1 + \uffffkmax+|\u03b1|\n\nkmax\n\n:=\u03b3\u03b1\n\n\uffff\uffff\n\n> 0\n\n(14)\n\n|yn \u2299 y\u2217n\n\nkmax|\n\n\uffff\n\n\fAlgorithm3Computemax\u03b1\u03b2A,\u03b11:Input:A[kmax],n,y\u2217nkmax,V,2:Output:max3:max=\u2212\u221e4:Z={i:y\u2217nkmax(i)=0}5:O={i:y\u2217nkmax(i)=1}6:fori\u2208Zdo7:O\uffff=O\u222ai8:rmax=maxy:yO\uffff=1yTA[kmax],ny(graph-cuts)9:ifrmax>maxthen10:max=rmax11:endif12:endfor13:max=max\u2212maxyyTA[kmax],ny14:returnmaxTable1:Datasets.#train/#testdenotesthenumberofobservationsusedfortrainingandtestingrespec-tively;VisthenumberoflabelsandDthedimensionalityofthefeatures;Avgistheaveragenumberoflabelsperinstance.datasetyeastenrondomainbiologytext#train15001123#test917579V1453D1031001Avg4.233.37Weknowthat,regardlessofAor\u03b1,\u03b2A,\u03b1\u22640(otherwisey\u2217nkmax/\u2208argmaxyyTA[kmax],ny,since\u03b2A,\u03b1istheincrementintheobjectivefunctionyTA[kmax],nyobtainedbyadding1sintheentriesof\u03b1).Thekeyfactcomingtoouraidisthat\u03b3\u03b1is\u2018small\u2019,andaweakupperboundis2.Thisisbecause\uffffkmax+|\u03b1|kmaxv\u03b1+\uffffkmax+|\u03b1|kmax|yn\u2299y\u2217nkmax|\u2264\uffffkmax+|\u03b1|kmax|yn|\u2264\uffffVkmax|yn|\u2264\uffffV0|yn|==2V|yn|/((V+|yn|)|yn|)\u22642(15)(Notethatif|yn|=0then\u03b3\u03b1=0andouralgorithmwillalwaysreturnanoptimalsolutionsince\u03b2A,\u03b1\u22640).Now,since\u03b2A,\u03b1\u22640foranyAand\u03b1\u2208PZ,itsu\ufb03cesthatwestudythequantitymax\u03b1\u03b2A,\u03b1:ifmax\u03b1\u03b2A,\u03b1<\u22122,then\u03b2A,\u03b1<\u22122forany\u03b1\u2208PZ.Itishoweververyhardtounderstandtheoreticallythebehaviouroftherandomvariablemax\u03b1\u03b2A,\u03b1evenforasimplisticuniformi.i.d.assumptionontheentriesofA.Thisisbecausethedomainof\u03b1,PZ,isitselfarandomquantitythatdependsontheparticularAchosen.Thismakescomputingeventheexpectedvalueofmax\u03b1\u03b2A,\u03b1anintractabletask,letaloneobtainingconcentrationofmeasureresultsthatcouldgiveusupperboundsontheprobabilityofcondition14holdingundertheassumeddistributiononA.However,foragivenAwecanactuallycomputemax\u03b1\u03b2A,\u03b1e\ufb03ciently.ThiscanbedonewithAlgorithm3.Thealgorithme\ufb00ectivelycomputesthegapbetweenthescoresoftheoptimalsolutiony\u2217nkmaxandthehighestscoringsolutionifonesetsto1atleastoneofthezeroentriesiny\u2217nkmax.Itdoessobysolvinggraph-cutsconstrainingthesolutionytoincludethe1spresentiny\u2217nkmaxbutadditionally\ufb01xingoneofthezeroentriesofy\u2217nkmaxto1(lines7-8).Thisisdoneforeverypossiblezeroentryofy\u2217nkmax,andthemaximumscoreisrecorded(lines7-11).Thegapbetweenthisandthescoreoftheoptimalsolutiony\u2217nkmaxisthenreturned(line13).ThiswillinvolveV\u2212kmaxcallstograph-cuts,andthereforethetotalcomputationalcomplexityisO(V4).Oncewecomputemax\u03b1\u03b2A,\u03b1,wesimplytestwethermax\u03b1\u03b2A,\u03b1+\uffff|V|kmax|yn|>0holds(weuse\uffff|V|kmax|yn|ratherthan2asanupperboundfor\u03b3\u03b1because,asseenfrom(15),itisthetightestupperboundwhichstilldoesnotdependon\u03b1andthereforecanbecomputed).Wehavethefollowingtheorem(proveninAppendixA)Theorem2UponcompletionofAlgorithm3,ifmax\u03b1\u03b2A,\u03b1+\uffff|V|kmax|yn|\u22640,theny\u2217nkmaxisanoptimalsolutionofargmaxyyTAn(y)y.5ExperimentalResultsToevaluateourmulti-labellearningmethodweappliedittoreal-worlddatasetsandcom-paredittostate-of-theartmethods.Datasets.Forthesakeofreproducibilitywefocusedinpubliclyavailabledatasets,andtoensurethatthelabeldependencieshaveareasonableimpactintheresultswerestrictedtheexperimentstodatasetswithasu\ufb03cientlylargeaveragenumberoflabelsperinstance.We6\fFigure 1: F -Score results on enron (left) and yeast (right), for di\ufb00erent amounts of unary\nfeatures. The horizontal axis denotes the proportion of the features used in training.\n\nchose therefore two multilabel datasets from mulan:2 yeast and enron. Table 1 describes\nthem in more detail.\n\nExperimental setting. The datasets used have very informative unary features, so to\nbetter visualise the contribution of the label dependencies to the model we trained using\nvarying amounts (1%, 10% and 100%) of the original unary features. We compared our\nproposed method to RML[11] without reversion3, which is essentially our model without\nthe quadratic term, and to other state-of-the-art methods for which source code is publicly\navailable \u2013 BR [15], RAkEL[16] and MLKNN[20].\n\nModel selection. Our model has two parameters: \u03bb, the trade-o\ufb00 between data \ufb01tting\nand good generalisation, and c, a scalar that multiplies C to control the trade-o\ufb00 between\nthe linear and the quadratic terms. For each experiment we selected them with 5-fold cross-\nvalidation on the training data. We also control the sparsity of C by setting Cij to zero\nfor all except the top most frequent pairs \u2013 this way we can reduce the dimensionality of\n\u03b82, avoiding an excessive number of parameters for datasets with large values of V . In our\nexperiments we used 50% of the pairs with yeast and 5% with enron (45 and 68 pairs,\nrespectively). We experimented with other settings, but the results were very similar.\n\nRML\u2019s only parameter, \u03bb, was selected with 5-fold cross-validation. MLKNN\u2019s two param-\neters k (number of neighbors) and s (strength of the uniform prior) were kept \ufb01xed to 10\nand 1.0, respectively, as was done in [20]. RAkEL\u2019s m (number of models) and t (threshold)\nwere set to the library\u2019s default (respectively 2\u2217 N and 0.5), and k (size of the labelset) was\nset to V\n\n2 as suggested by [4]. For BR we kept the library\u2019s defaults.\n\nImplementation. Our implementation is in C++, based on the source code of RML[11],\nwhich uses the Bundle Methods for Risk Minimization (BMRM) of [18]. The max-\ufb02ow\ncomputations needed for graph-cuts are done with the library of [21]. The modi\ufb01cations\nnecessary to enforce positivity in \u03b82 in BMRM are described in Appendix C. Source code is\navailable4 under the Mozilla Public License. Details of training time for our implementation\nare available in Appendix B.\n\nResults: F-Score. In Figure 1 we plot the F -Score for varying-sized subsets of the unary\nfeatures, for both enron (left) and yeast (right). The goal is to assess the bene\ufb01ts of\nexplicitly modelling the pairwise label interactions, particularly when the unary information\nis deteriorated. As can be seen in Figure 1, when all features are available our model behaves\nsimilarly to RML. In this setting the unary features are very informative and the pairwise\ninteractions are not helpful. As we reduce the number of available unary features (from right\nto left in the plots), the importance of the pairwise interactions increase, and our model\ndemonstrates improvement over RML.\n\n2http://mulan.sourceforge.net/datasets.html\n3RML deals mainly with the reverse problem of predicting instances given labels, however it can\n\nbe applied in the forward direction as well as described in [11].\n\n4http://users.cecs.anu.edu.au/\u223cjpetterson/.\n\n7\n\n\fFigure 2: Empirical analysis of Algorithms 2 and 3 during training with the yeast dataset.\nLeft: frequency with which Algorithm 2 is optimal at each iteration (blue) and frequency\nwith which Algorithm 3 reports an optimal solution has been found by Algorithm 2 (green).\nRight: di\ufb00erence, at each iteration, between the objective computed using the results from\nAlgorithm 2 and exhaustive enumeration.\n\nResults: Correctness. To evaluate how well our constraint generation algorithm performs\nin practice we compared its results against those of exhaustive search, which is exact but\nonly feasible for a dataset with a small number of labels, such as yeast. We also assessed\nthe strength of our test proposed in Algorithm 3.\nIn Figure 2-left we plot, for the \ufb01rst\n100 iterations of the learning algorithm, the frequency with which Algorithm 2 returns the\nexact solution (blue line) as well as the frequency with which the test given in Algorithm\n3 guarantees the solution is exact (green line). We can see that overall in more than 50%\nof its executions Algorithm 2 produces an optimal solution. Our test e\ufb00ectively o\ufb00ers a\nlower bound which is as expected is not tight, however it is informative in the sense that\nits variations re\ufb02ect legitimate variations in the real quantity of interest (as can be seen by\nthe obvious correlation between the two curves).\n\nFor the learning algorithm, however, what we are interested in is the objective oi and the\ngradient gi of line 7 of Algorithm 1, and both depend only on the compound result of N\nexecutions of Algorithm 2 at each iteration of the learning algorithm. This is illustrated\nin Figure 2-right, where we plot, for each iteration, the normalised di\ufb00erence between the\nobjective computed with results from Algorithm 2 and the one computed with the results\nof an exact exhaustive search5. We can see that the di\ufb00erence is quite small \u2013 below 4%\nafter the initial iterations.\n\n6 Conclusion\nWe presented a method for learning multi-label classi\ufb01ers which explicitly models label de-\npendencies in a submodular fashion. As an estimator we use structured support vector\nmachines solved with constraint generation. Our key contribution is an algorithm for con-\nstraint generation which is proven to be partially optimal in the sense that all labels it\npredicts are included in some optimal solution. We also describe an e\ufb03ciently computable\ntest that if positive guarantees the solution found is optimal, and can be used to generate\nempirical lower bounds on the probability of \ufb01nding an optimal solution. We present empir-\nical results that corroborate the fact that the algorithm is very accurate, and we illustrate\nthe gains obtained in comparison to other popular algorithms, particularly a previous al-\ngorithm which can be seen as the particular case of ours when there are no explicit label\ninteractions being modelled.\n\nAcknowledgements\nWe thank Choon Hui Teo for his help on making the necessary modi\ufb01cations to BMRM.\nNICTA is funded by the Australian Government as represented by the Department of\nBroadband, Communications and the Digital Economy and the Australian Research Council\nthrough the ICT Centre of Excellence program.\n\n5We repeated this experiment with several sets of parameters with similar results.\n\n8\n\n\fReferences\n\n[1] K. Dembczynski, W. Cheng, and E. H\u00a8ullermeier, \u201cBayes Optimal Multilabel Classi\ufb01-\n\ncation via Probabilistic Classi\ufb01er Chains,\u201d in ICML, 2010.\n\n[2] X. Zhang, T. Graepel, and R. Herbrich, \u201cBayesian Online Learning for Multi-label and\n\nMulti-variate Performance Measures,\u201d in AISTATS, 2010.\n\n[3] P. Rai and H. Daume, \u201cMulti-Label Prediction via Sparse In\ufb01nite CCA,\u201d in NIPS,\n\n2009.\n\n[4] J. Read, B. Pfahringer, G. Holmes, and E. Frank, \u201cClassi\ufb01er chains for multi-label\n\nclassi\ufb01cation.,\u201d in ECML/PKDD, 2009.\n\n[5] J. Rousu, C. Saunders, S. Szedmak, and J. Shawe-Taylor, \u201cKernel-based learning of\nhierarchical multilabel classi\ufb01cation models,\u201d JMLR, vol. 7, pp. 1601\u20131626, December\n2006.\n\n[6] Z. Barutcuoglu, R. E. Schapire, and O. G. Troyanskaya, \u201cHierarchical multi-label pre-\n\ndiction of gene function,\u201d Bioinformatics, vol. 22, pp. 830\u2013836, April 2006.\n\n[7] M. Guillaumin, T. Mensink, J. Verbeek, and C. Schmid, \u201cTagProp: Discriminative\nMetric Learning in Nearest Neighbor Models for Image Auto-Annotation,\u201d in ICCV,\n2009.\n\n[8] M. Jansche, \u201cMaximum expected F-measure training of logistic regression models,\u201d\n\nHLT, 2005.\n\n[9] T. Mensink, J. Verbeek, and G. Csurka, \u201cLearning structured prediction models for\n\ninteractive image labeling,\u201d in CVPR, 2011.\n\n[10] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun, \u201cLarge margin methods for\nstructured and interdependent output variables,\u201d JMLR, vol. 6, pp. 1453\u20131484, 2005.\n\n[11] J. Petterson and T. Caetano, \u201cReverse multi-label learning,\u201d in NIPS, 2010.\n[12] W. Bi and J. Kwok, \u201cMulti-Label Classi\ufb01cation on Tree- and DAG-Structured Hierar-\n\nchies,\u201d in ICML, 2011.\n\n[13] N. Ghamrawi and A. Mccallum, \u201cCollective Multi-Label Classi\ufb01cation,\u201d 2005.\n[14] B. Hariharan, S. V. N. Vishwanathan, and M. Varma, \u201cLarge Scale Max-Margin Multi-\nLabel Classi\ufb01cation with Prior Knowledge about Densely Correlated Labels,\u201d in ICML,\n2010.\n\n[15] G. Tsoumakas, I. Katakis, and I. P. Vlahavas, Mining Multi-label Data. Springer, 2009.\n[16] G. Tsoumakas and I. P. Vlahavas, \u201cRandom k-labelsets: An ensemble method for\n\nmultilabel classi\ufb01cation,\u201d in ECML, 2007.\n\n[17] S. Virtanen, A. Klami, and S. Kaski, \u201cBayesian CCA via Group Sparsity,\u201d in ICML,\n\n2011.\n\n[18] C. H. Teo, S. V. N. Vishwanathan, A. J. Smola, and Q. V. Le, \u201cBundle methods for\n\nregularized risk minimization,\u201d JMLR, vol. 11, pp. 311\u2013365, 2010.\n\n[19] Z. Svitkina and L. Fleischer, \u201cSubmodular approximation: Sampling-based algorithms\n\nand lower bounds,\u201d in FOCS, 2008.\n\n[20] M.-L. Zhang and Z.-H. Zhou, \u201cML-KNN: A lazy learning approach to multi-label learn-\n\ning,\u201d Pattern Recognition, vol. 40, pp. 2038\u20132048, July 2007.\n\n[21] Y. Boykov and V. Kolmogorov, \u201cAn experimental comparison of min-cut/max-\ufb02ow\n\nalgorithms for energy minimization in vision,\u201d IEEE Trans. PAMI, 2004.\n\n9\n\n\f", "award": [], "sourceid": 860, "authors": [{"given_name": "James", "family_name": "Petterson", "institution": null}, {"given_name": "Tib\u00e9rio", "family_name": "Caetano", "institution": null}]}