{"title": "Online Learning via Global Feedback for Phrase Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 233, "page_last": 240, "abstract": "", "full_text": "Online Learning via Global Feedback\n\nfor Phrase Recognition\n\nXavier Carreras\n\nLlu\u00b4\u0131s M`arquez\n\nTALP Research Center, LSI Department\nTechnical University of Catalonia (UPC)\nCampus Nord UPC, E\u201308034 Barcelona\n{carreras,lluism}@lsi.upc.es\n\nAbstract\n\nThis work presents an architecture based on perceptrons to recognize\nphrase structures, and an online learning algorithm to train the percep-\ntrons together and dependently. The recognition strategy applies learning\nin two layers: a \ufb01ltering layer, which reduces the search space by identi-\nfying plausible phrase candidates, and a ranking layer, which recursively\nbuilds the optimal phrase structure. We provide a recognition-based feed-\nback rule which re\ufb02ects to each local function its committed errors from\na global point of view, and allows to train them together online as percep-\ntrons. Experimentation on a syntactic parsing problem, the recognition\nof clause hierarchies, improves state-of-the-art results and evinces the\nadvantages of our global training method over optimizing each function\nlocally and independently.\n\n1 Introduction\n\nOver the past few years, many machine learning methods have been successfully applied\nto Natural Language tasks in which phrases of some type have to be recognized. Generally,\ngiven an input sentence \u2014as a sequence of words\u2014 the task is to predict a bracketing\nfor the sentence representing a structure of phrases, either sequential or hierarchical. For\ninstance, syntactic analysis of Natural Language provides several problems of this type,\nsuch as partial parsing tasks [1, 2], or even full parsing [3].\n\nThe general approach consists of decomposing the global phrase recognition problem into\na number of local learnable subproblems, and infer the global solution from the outcomes\nof the local subproblems. For chunking problems \u2014in which phrases are sequentially\nstructured\u2014 the approach is typically to perform a tagging. In this case, local subproblems\ninclude learning whether a word opens, closes, or is inside a phrase of some type (noun\nphrase, verb phrase, . . . ), and the inference process consists of sequentially computing\nthe optimal tag sequence which encodes the phrases, by means of dynamic programming\n[1, 4, 5]. When hierarchical structure has to be recognized, additional local decisions are\nrequired to determine the embedding of phrases, resulting in a more complex inference\nprocess which recursively builds the global solution [3, 2, 6, 7].\nIn general, a learning\nsystem for these tasks makes use of several learned functions which interact in some way\nto determine the structure.\n\n\fA usual methodology for solving the local subproblems is to use a discriminative learning\nalgorithm to learn a classi\ufb01er for each local decision [1, 2]. Each individual classi\ufb01er is\ntrained separately from the others, maximizing some local measure such as the accuracy of\nthe local decision. However, when performing the phrase recognition task, the classi\ufb01ers\nare used together and dependently, in the sense that one classi\ufb01er predictions\u2019 may affect\nthe prediction of another. Indeed, the global performance of a system is measured in terms\nof precision and recall of the recognized phrases, which, although related, is not the local\nclassi\ufb01cation accuracy measure for which the local classi\ufb01ers are usually trained.\n\nIn this direction, recent works in the area provide alternative strategies in which the learning\nprocess is driven from the global level. The general idea consists of moving the learning\nstrategy from the binary classi\ufb01cation setting to a general ranking context into which the\nglobal problem can be casted. Crammer and Singer [8] present a label-ranking algorithm,\nin which several perceptrons receive feedback from the ranking they produce over a training\ninstance. Har-Peled et al. [9] study a general learning framework in which the constraints\nbetween a number of linear functions and an output prediction allow to effectively learn a\ndesired label-ranking function. For structured outputs, and motivating this work, Collins\n[10] introduces a variant of the perceptron for tagging tasks, in which the learning feedback\nis globally given from the output of the Viterbi decoding algorithm.\n\nIn this paper we present a global learning strategy for the general task of recognizing\nphrases in a sentence. We adopt the general phrase recognition strategy of our previous\nwork [6]. Given a sentence, learning is \ufb01rst applied at word level to identify phrase can-\ndidates of the solution. Then, learning is applied at a higher-order level in which phrase\ncandidates are scored to discriminate among competing ones. The overall strategy infers\nthe global solution by exploring with learning components a number of plausible solutions.\n\nAs a main contribution, we propose a recognition-based feedback rule which allows to learn\nthe decisions in the system as perceptrons, all in one go. The learning strategy works online\nat sentence level. When visiting a sentence, the perceptrons are \ufb01rst used to recognize the\nset of phrases, and then updated according to the correctness of the global solution. As a re-\nsult, each local function is automatically adapted to the recognition strategy. Furthermore,\nfollowing [11] the \ufb01nal model incorporates voted prediction methods for the perceptrons\nand the use of kernel functions. Experimenting on the Clause Identi\ufb01cation problem [2] we\nshow the effectiveness of our method, evincing the bene\ufb01ts over local learning strategies\nand improving the best results for the particular task.\n\n2 Phrase Recognition\n\n2.1 Formalization\nLet x be a sentence formed by n words xi, with i ranging from 0 to n \u2212 1, belonging\nto the sentence space X . Let K be a prede\ufb01ned set of phrase categories. For instance,\nin syntactic parsing K may include noun phrases, verb phrases, prepositional phrases and\nclauses, among others. A phrase, denoted as (s, e)k, is the sequence of consecutive words\nspanning from word xs to word xe, having s \u2264 e, with category k \u2208 K.\nLet ph1 =(s1, e1)k1 and ph2 =(s2, e2)k2 be two different phrases. We de\ufb01ne that ph1 and\nph2 overlap iff s1 < s2\u2264 e1 < e2 or s2 < s1\u2264 e2 < e1 , and we note it as ph1\u223c ph2. Also,\nwe de\ufb01ne that ph1 is embedded in ph2 iff s2\u2264 s1\u2264 e1\u2264 e2, and we note it as ph1\u227a ph2.\nLet P be the set of all possible phrases, expressed as P = {(s, e)k | 0 \u2264 s\u2264 e, k\u2208K}.\nA solution for a phrase recognition problem is a set y of phrases which is coherent with\nrespect to some constraints. We consider two types of constraints: overlapping and em-\nbedding. For the problem of recognizing sequentially organized phrases, often referred to\nas chunking, phrases are not allowed to overlap or embed. Thus, the solution space can\n\n\fbe formally expressed as Y = {y \u2286 P | \u2200 ph1, ph2 \u2208 y ph16\u223cph2 \u2227 ph16\u227aph2} . More\ngenerally, for the problem of recognizing phrases organized hierarchically, a solution is a\nset of phrases which do not overlap but may be embedded. Formally, the solution space is\nY = {y \u2286 P | \u2200 ph1, ph2\u2208 y ph16\u223cph2} .\nIn order to evaluate a phrase recognition system we use the standard measures for recog-\nnition tasks: precision (p) \u2014the ratio of recognized phrases that are correct\u2014, recall (r)\n\u2014the ratio of correct phrases that are recognized\u2014 and their harmonic mean F1 = 2pr\np+r .\n\n2.2 Recognizing Phrases\n\nThe mechanism to recognize phrases is described here as a function which, given a sentence\nx, identi\ufb01es the set of phrases y of x: R : X \u2192 Y. We assume two components within this\nfunction, both being learning components of the recognizer. First, we assume a function P\nwhich, given a sentence x, identi\ufb01es a set of candidate phrases, not necessarily coherent,\nfor the sentence, P(x) \u2286 P. Second, we assume a score function which, given a phrase,\nproduces a real-valued prediction indicating the plausability of the phrase for the sentence.\n\nThe phrase recognizer is a function which searches a coherent phrase set for a sentence x\naccording to the following optimality criterion:\n\nX\n\n(s,e)k\u2208y\n\nR(x) = arg\n\nmax\n\ny\u2286P(x) | y\u2208Y\n\nscore((s, e)k, x, y)\n\n(1)\n\nThat is, among all the coherent subsets of candidate phrases, the optimal solution is de\ufb01ned\nas the one whose phrases maximize the summation of phrase scores.\nThe function P is only used to reduce the search space of the R function. Note that the\nR function constructs the optimal phrase set by evaluating scores of phrase candidates,\nand, regarding the length of the sentence, there is a quadratic number of possible phrases,\nthat is, the set P. Thus, considering straightforwardly all phrases in P would result in a\nvery expensive exploration. The function P is intended to \ufb01lter out phrase candidates from\nP by applying decisions at word level. A simple setting for this function is a start-end\nclassi\ufb01cation for each phrase type: each word of the sentence is tested as k-start \u2014if it\nis likely to start phrases of type k\u2014 and as k-end \u2014if it is likely to end phrases type k.\nEach k-start word xs with each k-end word xe, having s \u2264 e, forms the phrase candidates\n(s, e)k. Assuming start and end binary classi\ufb01cation functions, hk\nE, for each type\nk \u2208 K, the \ufb01ltering function is expressed as:\nP(x) = { (s, e)k \u2208 P | hk\n\nS and hk\nE(xe) = +1}\n\nS(xs) = +1 \u2227 hk\n\nAlternatives to this setting may be to consider a single pair of start-end classi\ufb01ers, indepen-\ndent of phrase types, or to perform a different tagging for identifying phrases, such as the\nwell-known begin-inside classi\ufb01cation. In general, each classi\ufb01er will be applied to each\nword in the sentence, and deciding the best strategy for identifying phrase candidates will\ndepend on the sparseness of phrases in a sentence, the length of phrases and the number of\ncategories.\n\nOnce the phrase candidates are identi\ufb01ed, the optimal coherent phrase set is selected ac-\ncording to (1). Due to its nature, there is no need to explicitly enumerate each possible\ncoherent phrase set, which would result in an exponential exploration. Instead, by guiding\nthe exploration through the problem constraints and using dynamic programming the op-\ntimal coherent phrase set can be found in polynomial time over the sentence length. For\nchunking problems, the solution can be found in quadratic time by performing a Viterbi-\nstyle exploration from left to right [4]. When embedding of phrases is allowed, a cubic-time\nbottom-up exploration is required [6]. As noted above, in either cases there will be the ad-\nditional cost of applying a quadratic number of decisions for scoring phrases.\n\n\fSummarizing, the phrase recognition system is performed in two layers: the identi\ufb01cation\nlayer, which \ufb01lters out phrase candidates in linear time, and the scoring layer, which selects\nthe optimal phrase chunking in quadratic or cubic time.\n\n3 Additive Online Learning via Recognition Feedback\n\nIn this section we describe an online learning strategy for training the learning components\nof the Phrase Recognizer, namely the start-end classi\ufb01ers in P and the score function. The\nlearning challenge consists in approximating the functions so as to maximize the global F1\nmeasure on the problem, taking into account that the functions interact. In particular, the\nstart-end functions de\ufb01ne the actual input space of the score function.\nEach function is implemented using a linear separator, hw : Rn \u2192 R, operating in a\nfeature space de\ufb01ned by a feature representation function, \u03c6 : X \u2192 Rn, for some instance\nspace X . The function P consists of two classi\ufb01ers per phrase type: the start classi\ufb01er (hk\nS)\nE). Thus, the P function is formed by a prediction vector for each\nand the end classi\ufb01er (hk\nclassi\ufb01er, noted as wk\nE, and a unique shared representation function \u03c6w which maps a\nS\u00b7 \u03c6w(x), and\nword in context into a feature vector. A prediction is computed as hk\nsimilarly for the hk\nE, and the sign is taken as the binary classi\ufb01cation. The score function\ncomputes a real-valued score for a phrase candidate (s, e)k. We implement this function\nwith a prediction vector wk for each type k \u2208 K, and also a shared representation function\n\u03c6p which maps a phrase into a feature vector. The score prediction is then given by the\nexpression: score((s, e)k, x, y) = wk \u00b7 \u03c6p((s, e)k, x, y).\n\nS(x) = wk\n\nS or wk\n\n3.1 The FR-Perceptron Learning Algorithm\n\nWe propose a mistake-driven online learning algorithm for training the parameter vectors\nall together. We give the algorithm the name FR-Perceptron since it is a Perceptron-based\nlearning algorithm that approximates the prediction vectors in P as Filters of words, and the\nscore vectors as Rankers of phrases. The algorithm starts with all vectors initialized to 0,\nand then runs repeatedly in a number of epochs T through all the sentences in the training\nset. Given a sentence, it predicts its optimal phrase solution as speci\ufb01ed in (1) using the\ncurrent vectors. As in the traditional Perceptron algorithm, if the predicted phrase set is\nnot perfect the vectors responsible of the incorrect prediction are updated additively. The\nalgorithm is as follows:\n\n\u2022 Input: {(x1, y1), . . . , (xm, ym)}, xi are sentences, yi are solutions in Y\n\u2022 De\ufb01ne: W = {wk\n\u2022 Initialize: \u2200w \u2208 W w = 0;\n\u2022 for t = 1 . . . T , for i = 1 . . . m :\n\nE, wk|k \u2208 K}.\n\nS, wk\n\n1. \u02c6y = RW (xi)\n2. recognition learning feedback(W, xi, yi, \u02c6y)\n\n\u2022 Output: the vectors in W .\n\nWe now describe the recognition-based learning feedback. By analyzing the dependencies\nbetween each function and a solution, we derive a feedback rule which naturally \ufb01ts the\nphrase recognition setting. Let y\u2217 be the gold set of phrases for a sentence x, and \u02c6y the set\npredicted by the R function. Let goldS(xi, k) and goldE(xi, k) be, respectively, the perfect\nindicator functions for start and end boundaries of phrases of type k. That is, they return 1\nif word xi starts/ends some k-phrase in y\u2217 and -1 otherwise. We differentiate three kinds\nof phrases in order to give feedback to the functions being learned:\n\n\f\u2022 Phrases correctly identi\ufb01ed: \u2200(s, e)k \u2208 y\u2217\u2229 \u02c6y:\n\n\u2013 Do nothing, since they are correct.\n\n\u2022 Missed phrases: \u2200(s, e)k \u2208 y\u2217\\ \u02c6y:\n\n1. Update misclassi\ufb01ed boundary words:\nS = wk\nE = wk\n\nS \u00b7 \u03c6w(xs) \u2264 0) then wk\nE \u00b7 \u03c6w(xe) \u2264 0) then wk\nS \u00b7 \u03c6w(xs) > 0 \u2227 wk\n\n2. Update score function, if applied:\n\n\u2022 Over-predicted phrases: \u2200(s, e)k \u2208 \u02c6y\\y\u2217:\n\nif (wk\nif (wk\n\nif (wk\n\nS + \u03c6w(xs)\nE + \u03c6w(xe)\n\nE \u00b7 \u03c6w(xe) > 0) then wk = wk + \u03c6p((s, e)k, x, y)\n\n1. Update score function: wk = wk \u2212 \u03c6p((s, e)k, x, y)\n2. Update words misclassi\ufb01ed as S or E:\nS = wk\nE = wk\n\nif (goldS(xs, k) = \u22121) then wk\nif (goldE(xe, k) = \u22121) then wk\n\nS \u2212 \u03c6w(xs)\nE \u2212 \u03c6w(xe)\n\nThis feedback models the interaction between the two layers of the recognition process.\nThe start-end layer \ufb01lters out phrase candidates for the scoring layer. Thus, misclassifying\nthe boundary words of a correct phrase blocks the generation of the candidate and pro-\nduces a missed phrase. Therefore, we move the start or end prediction vectors toward the\nmisclassi\ufb01ed boundary words of a missed phrase. When an incorrect phrase is predicted,\nwe move away the prediction vectors from the start or end words, provided that they are\nnot boundary words of a phrase in the gold solution. Note that we deliberately do not care\nabout false positives start or end words which do not \ufb01nally over-produce a phrase.\n\nRegarding the scoring layer, each category prediction vector is moved toward missed\nphrases and moved away from over-predicted phrases.\nIt is important to note that this\nfeedback operates only on the basis of the predicted solution \u02c6y, avoiding to make updates\nfor every prediction the function has made. Thus, the learning strategy is taking advantage\nof the recognition process, and concentrates on (i) assigning high scores for the correct\nphrases and (ii) making the incorrect competing phrases to score lower than the correct\nones. As a consequence, this feedback rule tends to approximate the desired behavior of\nthe global R function, that is, to make the summation of the scores of the correct phrase\nset maximal with respect to other phrase set candidates. This learning strategy is closely\nrelated to other recent works on learning ranking functions [10, 8, 9].\n\nA Note on the Convergence Assuming linear separability for each start, end and score\nfunction, it can be shown that (i) the mistakes of the start-end \ufb01lters are bounded (applying\nNovikoff\u2019s proof); (ii) between two consecutive updates in the start-end layer, there is room\nonly for a \ufb01nite number of updates of the score function; and (iii) once the start-end \ufb01lters\nhave converged, the correct solution is always considered in the score layer as candidate,\nand in this state the overall learning process converges (applying the proof of Collins for a\nperceptron tagger [10]).\n\n4 Experiments on Clause Identi\ufb01cation\n\nClause Identi\ufb01cation is the problem of recognizing the clauses of a sentence. A clause can\nbe roughly de\ufb01ned as a phrase with a subject, possibly implicit, and a predicate. Clauses in\na sentence form a hierarchical structure which constitutes the skeleton of the full syntactic\ntree. In the following example, the clauses are annotated with brackets:\n\n( (When (you don\u2019t have any other option)), it is easy (to \ufb01ght) .)\n\n\fWe followed the setting of the CoNLL-2001 competition 1. The problem consists of rec-\nognizing the set of clauses on the basis of words, part-of-speech tags (PoS), and syntactic\nbase phrases (or chunks). There is only one category of phrases to be considered, namely\nthe clauses. The data consists of a training set (8,936 sentences, 24,841 clauses), a devel-\nopment set (2,012 sentences, 5,418 clauses) and a test set (1,671 sentences, 5,225 clauses).\n\nRepresentation Functions We now describe the representation functions \u03c6w and \u03c6p,\nwhich respectively map a word or a phrase and their local context into a feature vector in\n{0, 1}n. Their design is inspired in our previous work [6]. For the function \u03c6w(xi) we\ncapture the form, PoS and chunk tags of words in a window around xi, that is, words xi+l\nwith l \u2208 [\u2212Lw, +Lw]. Each attribute type, together with each relative position l and each\nreturned value forms a \ufb01nal binary indicator feature (for instance, \u201cthe word at position -2\nis that\u201d is a binary feature). Also, we consider the word decisions of the words to the left\nof xi, that is, binary \ufb02ags indicating whether the [\u2212Lw,\u22121] words in the window are starts\nand/or ends of a phrase. For the function \u03c6p(s, e) we represent the context of the phrase\nby capturing a [\u2212Lp, 0] window of forms, PoS and chunks at the s word, and a separate\n[0, +Lp] window at the e word. Furthermore, we represent the (s, e) phrase by evaluating\na pattern from s to e which captures the relevant elements in the sentence fragment from\nword s to word e 2. We experimentally set both Lw and Lp to 3.\nOn this problem we were interested in comparing the FR-Perceptron algorithm versus other\nalternative learning methods. The system to train was composed by the start and end func-\ntions which identify clause candidates, and a score function for clauses. As alternatives, we\n\ufb01rst considered a batch classi\ufb01cation setting, in which each function is trained separately\nwith binary classi\ufb01cation loss. To do so, we generated three data sets from training exam-\nples, one for each function. For the start-end sets, we considered an example for each word\nin the data. To train the score classi\ufb01er, we generated only the phrase candidates formed\nwith all pairs of correct phrase boundaries. This latter generation greatly reduces the real\ninstance space in which the scoring function operates. The alternative of generating all pos-\nsible phrases as examples would be more realistic, but infeasible for the learning algorithm\nsince it would produce 1,377,843 examples, with a 98.2% of negatives. As a secondary\nintermediate approach, we considered a simple model which learns all the functions online\nvia binary classi\ufb01cation loss. That is, the training sentences are visited online as in the\nFR-Perceptron: \ufb01rst, the start-end functions are applied to each word, and according to\ntheir positive decisions, phrase examples are generated to train the score function. In this\nway, the input of the score function is dynamically adapted to the start-end behavior, but a\nclassi\ufb01cation feedback is given to each function for each decision taken.\n\nThe functions of the system were actually modeled as Voted Perceptrons [11], which com-\npute a prediction as an average of all vectors generated during training. For the batch\nclassi\ufb01cation setting, we modeled the functions as Voted Perceptrons and also as SVMs3.\nIn all cases, a function can be expressed in dual form as a combination of training instances,\nwhich allows the use of kernel functions. We work with polynomial kernels of degree 2. 4\nWe trained the perceptron models for up to 20 epochs via the FR-Perceptron algorithm and\nvia classi\ufb01cation feedback, either online (CO-VP) or batch (CB-VP). We also trained SVM\nclassi\ufb01ers (Cl-SVM), adjusting the soft margin C parameter on the development set.\n\n1Data and details at the CoNLL-2001 website: http://cnts.uia.ac.be/conll2001 .\n2The following elements are considered in a pattern: a) Punctuation marks and coordinate con-\njunctions; b) The word that; c) Relative pronouns; d) Verb phrase chunks; and e) The top clauses\nwithin the s to e fragment, already recognized through the bottom up search (a clause in a pattern\nreduces all the elements within it into an atomic element).\n\n3We used the SVMlight package available at http://svmlight.joachims.org .\n4Initial tests revealed poor performance for the linear case and no improvements for degrees > 2.\n\n\fFigure 1: Performance on the development set with respect to the number of epochs. Top:\nglobal F1 (left) and precision/recall on starts (right). Bottom: given the start-end \ufb01lters,\nupper bound on the global F1 (left) and number of proposed phrase candidates (right).\n\nnumber of epochs\n\nnumber of epochs\n\nFigure 1 (top, left) shows the performance curves in terms of the F1 measure with respect\nto the number of training epochs. Clearly, the FR-Perceptron model exhibits a much better\ncurve than classi\ufb01cation models, being at any epoch more than 2 points higher than the\nonline model, and far from the batch models. To get an idea of how the learning strategy\nbehaves, it is interesting to look at the other plots of Figure 1. The top right plot shows the\nperformance of the start function. The FR-Perceptron model exhibits the desirable \ufb01ltering\nbehavior for this local decision, which consists in maintaining a very high recall (so that no\ncorrect candidates are blocked) while increasing the precision during epochs. In contrast,\nthe CO-VP model concentrates mainly on the precision. The same behavior is observed for\nthe other classi\ufb01cation models, and also for the end local decision. The start-end behavior\nis also shown from a global point of view at the bottom plots. The left plot shows the\nmaximum achievable global F1, assuming a perfect scorer, given the phrases proposed by\nthe start-end functions. Additionally, the right plot depicts the \ufb01ltering capabilities in terms\nof the number of phrase candidates produced, out of a total number of 300,511 possible\nphrases. The FR-Perceptron behavior in the \ufb01ltering layer is clear: while it maintains a\nhigh recall on identifying correct phrases (above 95%), it substantially reduces the number\nof phrase candidates to explore in the scoring layer, and thus, it progressively simpli\ufb01es the\ninput to the score function. Far from this behavior, the classi\ufb01cation-based models are not\nsensitive to the global performance in the \ufb01ltering layer and, although they aggressively\nreduce the search space, provide only a moderate upper bound on the global F1.\nTable 4 shows the performance of each model, together with the results of our previous\nsystem [6], which held the best results on the problem. There, the same decisions were\nlearned by AdaBoost classi\ufb01ers working in a richer feature space. Also, the score function\nwas a robust combination of several classi\ufb01ers. These were trained taking into account the\nerrors of the start-end classi\ufb01ers, which required a tuning procedure to select the amount\nof introduced errors. Our new approach is much simpler to learn, since the interaction\nbetween functions is naturally ruled by the recognition feedback. Looking at results, we\nsubstantially improve the global F1.\n\n767880828486889005101520global F MeasureFR-PerceptronCO-VPCB-VPSVM5055606570758085909510005101520Precision/Recall on Start Wordsprecision FR-Perceptronrecall FR-Perceptronprecision CO-VPrecall CO-VP8889909192939495969705101520global F upper boundFR-PerceptronCO-VPCB-VPSVM5000100001500020000250003000035000400004500005101520P - number of phrase candidatesFR-PerceptronCO-VPCB-VPSVM\fCB-VP\nSVM\nCO-VP\nFR-Perceptron\nAdaBoost [6]\n\nT\n8\n-\n19\n20\n\u2013\n\nprec.\n83.84\n84.31\n91.06\n90.56\n92.53\n\ndevelopment\n\nrecall F\u03b2=1\n82.16\n80.55\n83.57\n82.83\n85.52\n80.62\n85.73\n88.08\n87.22\n82.48\n\nprec.\n82.22\n83.19\n89.25\n88.17\n90.18\n\ntest\nrecall F\u03b2=1\n80.10\n78.09\n81.57\n80.00\n83.03\n77.62\n82.10\n85.03\n83.71\n78.11\n\nTable 1: Results of Clause Identi\ufb01cation on the CoNLL-2001 development and test sets.\nThe T column shows the optimal number of epochs on the development set.\n\n5 Conclusion\n\nWe have presented a global learning strategy for the general problem of recognizing struc-\ntures of phrases, in which, typically, several different learning functions interact to explore\nand recognize the structure. The effectiveness of our method has been empirically proved\nin the problem of clause identi\ufb01cation, where we have shown that a considerable improve-\nment can be obtained by exploiting high-order global dependencies in learning, in contrast\nto concentrating only on the local subproblems. These results suggest to scale up global\nlearning strategies to more complex problems found in the natural language area (such as\nfull parsing or machine translation), or other structured domains.\n\nAcknowledgements\n\nResearch partially funded by the European Commission (Meaning, IST-2001-34460) and\nthe Spanish Research Department (Hermes, TIC2000-0335-C03-02; Petra, TIC2000-1735-\nC02-02). Xavier Carreras is supported by a grant from the Catalan Research Department.\n\nReferences\n[1] E. F. Tjong Kim Sang and S. Buchholz. Introduction to the CoNLL-2000 Shared Task: Chunk-\n\ning. In Proc. of CoNLL-2000 and LLL-2000, 2000.\n\n[2] Erik F. Tjong Kim Sang and Herv\u00b4e D\u00b4ejean.\n\nIntroduction to the CoNLL-2001 Shared Task:\n\nClause Identi\ufb01cation. In Proc. of CoNLL-2001, 2001.\n\n[3] A. Ratnaparkhi. Learning to Parse Natural Language with Maximum-Entropy Models. Machine\n\nLearning, 34(1):151\u2013175, 1999.\n\n[4] V. Punyakanok and D. Roth. The Use of Classi\ufb01ers in Sequential Inference. In Advances in\n\nNeural Information Processing Systems 13 (NIPS\u201900), 2001.\n\n[5] T. Kudo and Y. Matsumoto. Chunking with Support Vector Machines . In Proc. of 2nd Con-\nference of the North American Chapter of the Association for Computational Linguistics, 2001.\n[6] X. Carreras, L. M`arquez, V. Punyakanok, and D. Roth. Learning and Inference for Clause\n\nIdenti\ufb01cation. In Proceedings of the 14th ECML, Helsinki, Finland, 2002.\n\n[7] T. Kudo and Y. Matsumoto.\nProc. of CoNLL-2002, 2002.\n\nJapanese Dependency Analyisis using Cascaded Chunking . In\n\n[8] K. Crammer and Y. Singer. A Family of Additive Online Algorithms for Category Ranking.\n\nJournal of Machine Learning Research, 3:1025\u20131058, 2003.\n\n[9] S. Har-Peled, D. Roth, and D. Zimak. Constraint Classi\ufb01cation for Multiclass Classi\ufb01cation\n\nand Ranking. In Advances in Neural Information Processing Systems 15 (NIPS\u201902), 2003.\n\n[10] M. Collins. Discriminative Training Methods for Hidden Markov Models: Theory and Experi-\n\nments with Perceptron Algorithms. In Proceedings of the EMNLP\u201902, 2002.\n\n[11] Y. Freund and R. E. Schapire. Large Margin Classi\ufb01cation Using the Perceptron Algorithm.\n\nMachine Learning, 37(3):277\u2013296, 1999.\n\n\f", "award": [], "sourceid": 2528, "authors": [{"given_name": "Xavier", "family_name": "Carreras", "institution": null}, {"given_name": "Llu\u00eds", "family_name": "M\u00e0rquez", "institution": null}]}