{"title": "Hierarchical Multitask Structured Output Learning for Large-scale Sequence Segmentation", "book": "Advances in Neural Information Processing Systems", "page_first": 2690, "page_last": 2698, "abstract": "We present a novel regularization-based Multitask Learning (MTL) formulation for Structured Output (SO) prediction for the case of hierarchical task relations. Structured output learning often results in dif\ufb01cult inference problems and requires large amounts of training data to obtain accurate models. We propose to use MTL to exploit information available for related structured output learning tasks by means of hierarchical regularization. Due to the combination of example sets, the cost of training models for structured output prediction can easily become infeasible for real world applications. We thus propose an ef\ufb01cient algorithm based on bundle methods to solve the optimization problems resulting from MTL structured output learning. We demonstrate the performance of our approach on gene \ufb01nding problems from the application domain of computational biology. We show that 1) our proposed solver achieves much faster convergence than previous methods and 2) that the Hierarchical SO-MTL approach clearly outperforms considered non-MTL methods.", "full_text": "Hierarchical Multitask Structured Output Learning\n\nfor Large-Scale Sequence Segmentation\n\nNico G\u00a8ornitz1\n\nTechnical University Berlin,\n\nFranklinstr. 28/29, 10587 Berlin, Germany\n\nNico.Goernitz@tu-berlin.de\n\nGeorg Zeller\n\nEuropean Molecular Biology Laboratory\n\nMeyerhofstr. 1, 69117 Heidelberg, Germany\n\nGeorg.Zeller@gmail.com\n\nS\u00a8oren Sonnenburg2\n\nTomTom\n\nAn den Treptowers 1, 12435 Berlin, Germany\n\nSoeren.Sonnenburg@tomtom.com\n\nChristian Widmer1\n\nFML of the Max Planck Society\n\nSpemannstr. 39, 72070 T\u00a8ubingen, Germany\n\nChristian.Widmer@tue.mpg.de\n\nAndr\u00b4e Kahles\n\nFML of the Max Planck Society\n\nSpemannstr. 39, 72070 T\u00a8ubingen, Germany\n\nAndre.Kahles@tue.mpg.de\n\nGunnar R\u00a8atsch\n\nFML of the Max Planck Society\n\nSpemannstr. 39, 72070 T\u00a8ubingen, Germany\n\nGunnar.Raetsch@tue.mpg.de\n\nAbstract\n\nWe present a novel regularization-based Multitask Learning (MTL) formulation\nfor Structured Output (SO) prediction for the case of hierarchical task relations.\nStructured output prediction often leads to dif\ufb01cult inference problems and hence\nrequires large amounts of training data to obtain accurate models. We propose to\nuse MTL to exploit additional information from related learning tasks by means of\nhierarchical regularization. Training SO models on the combined set of examples\nfrom multiple tasks can easily become infeasible for real world applications. To\nbe able to solve the optimization problems underlying multitask structured out-\nput learning, we propose an ef\ufb01cient algorithm based on bundle-methods. We\ndemonstrate the performance of our approach in applications from the domain of\ncomputational biology addressing the key problem of gene \ufb01nding. We show that\n1) our proposed solver achieves much faster convergence than previous methods\nand 2) that the Hierarchical SO-MTL approach outperforms considered non-MTL\nmethods.\n\nIntroduction\n\n1\nIn Machine Learning, model quality is most often limited by the lack of suf\ufb01cient training data.\nWhen data from different, but related tasks, is available, it is possible to exploit it to boost the per-\nformance of each task by transferring relevant information. Multitask learning (MTL) considers\nthe problem of inferring models for several tasks simultaneously, while imposing regularity criteria\nor shared representations in order to allow learning across tasks. This has been an active research\nfocus and various methods (e.g., [5, 8]) have been explored, providing empirical \ufb01ndings [16] and\ntheoretical foundations [3, 4]. Recently, also the relationships between tasks have been studied (e.g.,\n[1]) assuming a cluster relationship [11] or a hierarchy [6, 23, 13] between tasks. Our proposed\nmethod follows this line of research in that it exploits externally provided hierarchical task rela-\ntions. The generality of regularization-based MTL approaches makes it possible to extend them\nbeyond the simple cases of classi\ufb01cation or regression to Structured Output (SO) learning problems\n\n1These authors contributed equally.\n2This work was done while SS was at Technical University Berlin\n\n1\n\n\f[14, 2, 21, 10]. Here, the output is not in the form of a discrete class label or a real valued number,\nbut a structured entity such as a label sequence, a tree, or a graph. One of the main contributions\nof this paper is to explicitly extend a regularization-based MTL formulation to the SVM-struct for-\nmulation for SO prediction [2, 21]. SO learning methods can be computationally demanding, and\ncombining information from several tasks leads to even larger problems, which renders many inter-\nesting applications infeasible. Hence, our second main contribution is to provide an ef\ufb01cient solver\nfor SO problems which is based on bundle methods [18, 19, 7]. It achieves much faster convergence\nand is therefore an essential tool to cope with the demands of the MTL setting.\nSO learning has been successfully applied in the analysis of images, natural language, and se-\nquences. The latter is of particular interest in computational biology for the analysis of DNA, RNA\nor protein sequences. This \ufb01eld moreover constitutes an excellent application area for MTL [12, 22].\nIn computational biology, one often uses supervised learning methods to model biological processes\nin order to predict their outcomes and ultimately understand them better. Due to the complexity\nof many biological mechanisms, rich computational models have to be developed, which in turn\nrequire a reasonable amount of training data. However, especially in the biomedical domain, ob-\ntaining labeled training examples through experiments can be costly. Thus, combining information\nfrom several related tasks can be a cost-effective approach to best exploit the available label data.\nWhen transferring label information across tasks, it often makes sense to assume hierarchical task\nrelations. In particular, in computational biology, where evolutionary processes often impose a task\nhierarchy [22]. For instance, we might be interested in modeling a common biological mechanism\nin several organisms such that each task corresponds to one organism. In this setting, we expect\nthat the longer the common evolutionary history between two organisms, the more bene\ufb01cial it is\nto share information between the corresponding tasks. In this work, we chose a challenging prob-\nlem from genome biology to demonstrate that our approach is practically feasible in terms of speed\nand accuracy. In ab initio gene \ufb01nding [17], the task is to build an accurate model of a gene and\nsubsequently use it to predict the gene content of newly sequenced genomes or to re\ufb01ne existing\nannotations. Despite many commonalities between sequence features of genes across organisms,\nsequence differences have made it very dif\ufb01cult to build universal gene \ufb01nders that achieve high\naccuracy in cross-organism prediction. This problem is hence ideally suited for the application of\nthe proposed SO-MTL approach.\n\n2 Methods\nRegularization based supervised learning methods, such as the SVM or Logistic Regression play\na central role in many applications.\nIn its most general form, such a method consists of a loss\nfunction L that captures the error with respect to the training data S = {(x1, y1), . . . , (xn, yn)} and\na regularizer R that penalizes model complexity\n\nn(cid:88)\n\ni=1\n\nJ(w) =\n\nL(w, xi, yi) + R(w).\n\nIn the case of Multitask Learning (MTL), one is interested in obtaining several models w1, ..., wT\nbased on T associated sets of examples St = {(x1, y1), . . . , (xnt, ynt)}, t = 1, . . . , T . To couple\nindividual tasks, an additional regularization term RM T L is introduced that penalizes the disagree-\nment between the individual models (e.g., [1, 8]):\n\n(cid:32) nt(cid:88)\n\nT(cid:88)\n\n(cid:33)\n\nJ(w1, ..., wT ) =\n\nL(w, xi, yi) + R(wt)\n\n+ RM T L(w1, ..., wT ).\n\nt=1\n\ni=1\n\nSpecial cases include T = 2 and RM T L(w1, w2) = \u03b3 ||w1 \u2212 w2|| (e.g., [8, 16]), where \u03b3 is a\nhyper-parameter controlling the strength of coupling of the solutions for both tasks. For more than\ntwo tasks, the number of coupling terms and hyper-parameters can rise quadratically leading to a\ndif\ufb01cult model-selection problem.\n\n2.1 Hierarchical Multitask Learning (HMTL)\nWe consider the case where tasks correspond to leaves of a tree and are related by its inner nodes. In\n[22], the case of taxonomically organized two-class classi\ufb01cation tasks was investigated, where each\ntask corresponds to a species (taxon). The idea was to mimic biological evolution that is assumed to\n\n2\n\n\fgenerate more specialized molecular processes with each speciation event from root to leaf. This is\nimplemented by training on examples available for nodes in the current subtree (i.e., the tasks below\nthe current node), while similarity to the parent classi\ufb01er is induced through regularization. Thus,\nfor each node n, one solves the following optimization problem,\n\n(cid:16)\n\n\uf8f1\uf8f2\uf8f3 1\n\n2\n\n(1 \u2212 \u03b3)||w||2 + \u03b3(cid:12)(cid:12)(cid:12)(cid:12)w \u2212 w\u2217\n\np\n\n(cid:12)(cid:12)(cid:12)(cid:12)2(cid:17)\n\n(cid:88)\n\n+ C\n\n(x,y)\u2208S\n\n\uf8fc\uf8fd\uf8fe , (1)\n\n(cid:96) ((cid:104)x, w(cid:105) + b, y)\n\n(w\u2217\n\nn, b\u2217\n\nn) = argmin\n\nw,b\n\nwhere p is the parent node of n (with the special case of w\u2217\np = 0 for the root node), (cid:96) is an appropriate\nloss function (e.g., the hinge-loss). The hyper-parameter \u03b3 \u2208 [0, 1] determines the contribution of\nregularization from the origin vs. the parent node\u2019s parameters (i.e., the strength of coupling between\nthe node and its parent). The above problem can be equivalently rewritten as:\n\n\uf8f1\uf8f2\uf8f3 1\n\n2\n\n||w||2 \u2212 \u03b3(cid:10)w, w\u2217\n\n(cid:11) + C\n\np\n\n(cid:88)\n\n(x,y)\u2208S\n\n(w\u2217\n\nn, b\u2217\n\nn) = argmin\n\nw,b\n\n(cid:96) ((cid:104)x, w(cid:105) + b, y)\n\n(2)\n\n\uf8fc\uf8fd\uf8fe .\n\nFor \u03b3 = 0, the tasks completely decouple and can be learnt independently. The parameters for\nthe root node correspond to the globally best model. We will refer to these two cases as base-line\nmethods for comparisons in the experimental section.\n\n2.2 Structured Output Learning and Extensions for HMTL\nIn contrast to binary classi\ufb01cation, elements from the output space \u03a5 (e.g., sequences, trees, or\ngraphs) of structured output problems have an inherent structure which makes more sophisticated,\nproblem-speci\ufb01c loss functions desirable. The loss between the true label y \u2208 \u03a5 and the predicted\nlabel \u02c6y \u2208 \u03a5 is measured by a loss function \u2206 : \u03a5 \u00d7 \u03a5 \u2192 (cid:60)+. A widely used approach to predict\n\u02c6y \u2208 \u03a5 is the use of a linearly parametrized model given an input vector x \u2208 X and a joint feature\nmap \u03a8 : X \u00d7 \u03a5 \u2192 H that captures the dependencies between input and output (e.g., [21]):\n\n\u02c6yw(x) = argmax\n\n\u00afy\u2208\u03a5\n\n(cid:104)w, \u03a8(x, \u00afy)(cid:105).\n\nn(cid:88)\n\ni=1\n\nThe most common approaches to estimate the model parameters w are based on structured output\nSVMs (e.g., [2, 21]) and conditional random \ufb01elds (e.g., [14]; see also [10]). Here we follow\nthe approach taken in [21, 15], where estimating the parameter vector w amounts to solving the\nfollowing optimization problem\n\n(cid:41)\n(cid:104)w, \u03a8(xi, \u00afy)(cid:105) + \u2206(yi, \u00afy) \u2212 (cid:104)w, \u03a8(xi, yi)(cid:105))\n\n,\n\n(3)\n\nmin\nw\u2208H\n\nR(w) + C\n\n(cid:96)(max\n\u00afy\u2208\u03a5\n\n(cid:40)\n\nwhere R(w) is a regularizer and (cid:96) is a loss function. For (cid:96)(a) = max(0, a) and R(w) = (cid:107) w (cid:107)2\nobtain the structured output support vector machine [21, 2] with margin rescaling and hinge-loss.\nIt turns out that we can combine the structured output formulation with hierarchical multitask learn-\ning in a straight-forward way. We extend the regularizer R(w) in (3) with a \u03b3-parametrized convex\n2(cid:107) w (cid:107)2\ncombination of a multitask regularizer 1\n2 \u2212 \u03b3(cid:104)w, wp(cid:105). Thus we can apply the\nand omitting constant terms, we arrive at Rp,\u03b3(w) = 1\ndescribed hierarchical multitask learning approach and solve for every node the following optimiza-\ntion problem:\n\n2 with the original term. When R(w) = 1\n\n2 ||w \u2212 wp||2\n\n(cid:41)\n(cid:104)w, \u03a8(xi, \u00afy)(cid:105) + \u2206(yi, \u00afy) \u2212 (cid:104)w, \u03a8(xi, yi)(cid:105))\n\n2(cid:107) w (cid:107)2\n\nn(cid:88)\n\n(cid:40)\n\n2 we\n\n(4)\n\n2\n\nmin\nw\u2208H\n\nRp,\u03b3(w) + C\n\n(cid:96)(max\n\u00afy\u2208\u03a5\n\ni=1\n\nA major dif\ufb01culty remains: solving the resulting optimization problems which now can become\nconsiderably larger than for the single-task case.\n\n2.3 A Bundle Method for Ef\ufb01cient Optimization\nA common approach to obtain a solution to (3) is to use so-called cutting-plane or column-generation\nmethods. Here one considers growing subsets of all possible structures and solves restricted opti-\nmization problems. An algorithm implementing a variant of this strategy based on primal optimiza-\ntion is given in the appendix (similar in [21]). Cutting-plane and column generation techniques\n\n3\n\n\foften converge slowly. Moreover, the size of the restricted optimization problems grows steadily\nand solving them becomes more expensive in each iteration. Simple gradient descent or second\norder methods can not be directly applied as alternatives, because (4) is continuous but non-smooth.\nOur approach is instead based on bundle methods for regularized risk minimization as proposed in\n[18, 19] and [7]. In case of SVMs, this further relates to the OCAS method introduced in [9]. In\norder to achieve fast convergence, we use a variant of these methods adapted to structured output\nlearning that is suitable for hierarchical multitask learning.\nWe consider the objective function J(w) = Rp,\u03b3(w) + L(w), where\n\nL(w) := C\n\n{(cid:104)w, \u03a8(xi, \u00afy)(cid:105) + \u2206(yi, \u00afy)} \u2212 (cid:104)w, \u03a8(xi, yi)(cid:105))\n\n(cid:96)(max\n\u00afy\u2208\u03a5\n\nn(cid:88)\n\ni=1\n\nand Rp,\u03b3(w) is as de\ufb01ned in Section 2.2. Direct optimization of J is very expensive as computing L\ninvolves computing the maximum over the output space. Hence, we propose to optimize an estimate\nof the empirical loss \u02c6L (w), which can be computed ef\ufb01ciently. We de\ufb01ne the estimated empirical\nloss \u02c6L (w) as\n\n(cid:19)\n{(cid:104)w, \u03a8(cid:105) + \u2206} \u2212 (cid:104)w, \u03a8(xi, yi)(cid:105)\n\n.\n\n\u02c6L(w) := C\n\n(cid:96)\n\nmax\n\n(\u03a8,\u2206)\u2208\u0393i\n\n(cid:18)\n\nN(cid:88)\n\ni=1\n\nAccordingly, we de\ufb01ne the estimated objective function as \u02c6J(w) = Rp,\u03b3(w) + \u02c6L(w). It is easy to\nverify that J(w) \u2265 \u02c6J(w). \u0393i is a set of pairs (\u03a8(xi, y), \u2206(yi, y)) de\ufb01ned by a suitably chosen,\ngrowing subset of \u03a5, such that \u02c6L(w) \u2192 L(w) (cf. Algorithm 1).\nIn general, bundle methods are extensions of cutting plane methods that use a prox-function to sta-\nbilize the solution of the approximated function. In the framework of regularized risk minimization,\na natural prox-function is given by the regularizer. We apply this approach to the objective \u02c6J(w)\nand solve\n\nmin\n\nw\n\nRp,\u03b3(w) + max\ni\u2208I\n\n{(cid:104)ai, w(cid:105) + bi}\n\n(5)\n\nwhere the set of cutting planes ai, bi lower bound \u02c6L. As proposed in [7, 19], we use a set I\nof limited size. Moreover, we calculate an aggregation cutting plane \u00afa, \u00afb that lower bounds the\nestimated empirical loss \u02c6L. To be able to solve the primal optimization problem in (5) in the dual\nspace as proposed by [7, 19], we adopt an elegant strategy described in [7] to obtain the aggregated\ncutting plane (\u00afa(cid:48), \u00afb(cid:48)) using the dual solution \u03b1 of (5):\n\n\u00afa(cid:48) =\n\n\u03b1jai\n\nand\n\n\u00afb(cid:48) =\n\n\u03b1ibi.\n\n(6)\n\n(cid:88)\n\ni\u2208I\n\n(cid:88)\n\ni\u2208I\n\n(cid:26)\n\n(cid:27)\n\nThe following two formulations reach the same minimum when optimized with respect to w:\n\nmin\nw\u2208H\n\nRp(w) + max\ni\u2208I\n\n(cid:104)ai, w(cid:105) + bi\n\n= min\n\nw\u2208H {Rp(w) + (cid:104)\u00afa(cid:48), w(cid:105) + \u00afb(cid:48)}.\n\nThis new aggregated plane can be used as an additional cutting plane in the next iteration step.\nWe therefore have a monotonically increasing lower bound on the estimated empirical loss and can\nremove previously generated cutting planes without compromising convergence (see [7] for details).\nThe algorithm is able to handle any (non-)smooth convex loss function (cid:96), since only the subgradient\nneeds to be computed. This can be done ef\ufb01ciently for the hinge-loss, squared hinge-loss, Huber-\nloss, and logistic-loss.\nThe resulting optimization algorithm is outlined in Algorithm 1. There are several improvements\npossible: For instance, one can bypass updating the empirical risk estimates in line 6, when\nL(w(k)) \u2212 \u02c6L(w(k)) \u2264 \u0001. Finally, while Algorithm 1 was formulated in primal space, it is easy\nto reformulate in dual variables making it independent of the dimensionality of w \u2208 H.\n\n2.4 Taxonomically Constrained Model Selection\nModel selection for multitask learning is particularly dif\ufb01cult, as it requires hyper-parameter selec-\ntion for several different, but related tasks in a dependent manner. For the described approach, each\n\n4\n\n\f(cid:19)\n\n> (cid:96)\n\nmax\n\n(\u03a8,\u2206)\u2208\u0393i\n\n(cid:19)\n\nthen\n\n(cid:104)w, \u03a8(cid:105) + \u2206\n\nmax\ny\u2208\u03a5\n\n{(cid:104)w, \u03a8(xi, y)(cid:105) + \u2206(yi, y)}\n\nAlgorithm 1 Bundle Methods for Structured Output Algorithm\n1: S \u2265 1: maximal size of the bundle set\n2: \u03b8 > 0: linesearch trade-off (cf. [9] for details)\n3: w(1) = wp\n4: k = 1 and \u00afa = 0, \u00afb = 0, \u0393i = \u2205 \u2200i\n5: repeat\n6:\n7:\n\nfor i = 1, .., n do\n\n(cid:18)\n(cid:18)\ny\u2217 = argmaxy\u2208\u03a5{(cid:104)w(k), \u03a8(xi, y)(cid:105) + \u2206(yi, y)}\nif (cid:96)\n\u0393i = \u0393i \u222a (\u03a8(xi, y\u2217), \u2206(yi, y\u2217))\n(cid:18)\n\nend if\nCompute ak \u2208 \u2202w \u02c6L(w(k))\nCompute bk = \u02c6L(w(k)) \u2212 (cid:104)w(k), ak(cid:105)\nw\u2217 = argmin\nw\u2208H\nUpdate \u00afa, \u00afb according to (6)\n\u03b7\u2217 = argmin\u03b7\u2208(cid:60) \u02c6J(w\u2217 +\u03b7(w\u2217 \u2212 w(k)))\nw(k+1) = (1 \u2212 \u03b8) w\u2217 +\u03b8\u03b7\u2217(w\u2217 \u2212 w(k))\nk = k + 1\n\n14:\n15:\n16:\n17:\n18:\n19: until L(w(k)) \u2212 \u02c6L(w(k)) \u2264 \u0001 and J(w(k)) \u2212 Jk(w(k)) \u2264 \u0001\n\n8:\n\n9:\n10:\n11:\n12:\n\n13:\n\nend for\n\n(cid:26)\n\nRp,\u03b3(w) + max\n\nmax\n\n(k\u2212S)+