{"title": "The Use of Classifiers in Sequential Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 995, "page_last": 1001, "abstract": null, "full_text": "NIPS '00 \n\nThe Use of Classifiers in Sequential Inference \n\nVasin Punyakanok \n\nDan Roth \n\nDepartment of Computer Science \n\nUniversity of Illinois at Urbana-Champaign \n\nUrbana, IL 61801 \n\npunyakan@cs.uiuc.edu \n\ndanr@cs.uiuc. edu \n\nAbstract \n\nWe study the problem of combining the outcomes of several different \nclassifiers in a way that provides a coherent inference that satisfies some \nconstraints. In particular, we develop two general approaches for an im(cid:173)\nportant subproblem - identifying phrase structure. The first is a Marko(cid:173)\nvian approach that extends standard HMMs to allow the use of a rich ob(cid:173)\nservation structure and of general classifiers to model state-observation \ndependencies. The second is an extension of constraint satisfaction for(cid:173)\nmalisms. We develop efficient combination algorithms under both mod(cid:173)\nels and study them experimentally in the context of shallow parsing. \n\n1 \n\nIntroduction \n\nIn many situations it is necessary to make decisions that depend on the outcomes of several \ndifferent classifiers in a way that provides a coherent inference that satisfies some con(cid:173)\nstraints - the sequential nature of the data or other domain specific constraints. Consider, \nfor example, the problem of chunking natural language sentences where the goal is to iden(cid:173)\ntify several kinds of phrases (e.g. noun phrases, verb phrases) in sentences. A task of this \nsort involves multiple predictions that interact in some way. For example, one way to ad(cid:173)\ndress the problem is to utilize two classifiers for each phrase type, one of which recognizes \nthe beginning of the phrase, and the other its end. Clearly, there are constraints over the \npredictions; for instance, phrases cannot overlap and there are probabilistic constraints over \nthe order of phrases and their lengths. The above mentioned problem is an instance of a \ngeneral class of problems - identifying the phrase structure in sequential data. This paper \ndevelops two general approaches for this class of problems by utilizing general classifiers \nand performing inferences with their outcomes. Our formalisms directly applies to natural \nlanguage problems such as shallow parsing [7, 23, 5, 3, 21], computational biology prob(cid:173)\nlems such as identifying splice sites [8,4, 15], and problems in information extraction [9]. \nOur first approach is within a Markovian framework. In this case, classifiers are functions \nof the observation sequence and their outcomes represent states; we study two Markov \nmodels that are used as inference procedures and differ in the type of classifiers and the \ndetails of the probabilistic modeling. The critical shortcoming of this framework is that \nit attempts to maximize the likelihood of the state sequence - not the true performance \nmeasure of interest but only a derivative of it. The second approach extends a constraint \nsatisfaction formalism to deal with variables that are associated with costs and shows how \nto use this to model the classifier combination problem. In this approach general con(cid:173)\nstraints can be incorporated flexibly and algorithms can be developed that closely address \n\n\fthe true global optimization criterion of interest. For both approaches we develop efficient \ncombination algorithms that use general classifiers to yield the inference. \nThe approaches are studied experimentally in the context of shallow parsing - the task \nof identifying syntactic sequences in sentences [14, 1, 11] - which has been found use(cid:173)\nful in many large-scale language processing applications including information extraction \nand text summarization [12, 2]. Working within a concrete task allows us to compare the \napproaches experimentally for phrase types such as base Noun Phrases (NPs) and Subject(cid:173)\nVerb phrases (SVs) that differ significantly in their statistical properties, including length \nand internal dependencies. Thus, the robustness of the approaches to deviations from their \nassumptions can be evaluated. \nOur two main methods, projection-based Markov Models (PMM) and constraint satisfac(cid:173)\ntion with classifiers (CSCL) are shown to perform very well on the task of predicting NP \nand SV phrases, with CSCL at least as good as any other method tried on these tasks. CSCL \nperforms better than PMM on both tasks, more significantly so on the harder, SV, task. We \nattribute it to CSCL's ability to cope better with the length of the phrase and the long term \ndependencies. Our experiments make use of the SNoW classifier [6, 24] and we provide a \nway to combine its scores in a probabilistic framework; we also exhibit the improvements \nof the standard hidden Markov model (HMM) when allowing states to depend on a richer \nstructure of the observation via the use of classifiers. \n\n2 \n\nIdentifying Phrase Structure \n\nThe inference problem considered can be formalized as that of identifying the phrase struc(cid:173)\nture of an input string. Given an input string 0 =< 01,02, .. . On >, a phrase is a substring \nof consecutive input symbols Oi, 0i+l, ... OJ. Some external mechanism is assumed to con(cid:173)\nsistently (or stochastically) annotate substrings as phrases l . Our goal is to come up with a \nmechanism that, given an input string, identifies the phrases in this string. \nThe identification mechanism works by using classifiers that attempt to recognize in the \ninput string local signals which are indicative to the existence of a phrase. We assume that \nthe outcome of the classifier at input symbol 0 can be represented as a function of the local \ncontext of 0 in the input string, perhaps with the aid of some external information inferred \nfrom it2 . Classifiers can indicate that an input symbol 0 is inside or outside a phrase (10 \nmodeling) or they can indicate that an input symbol 0 opens or closes a phrase (the OC \nmodeling) or some combination of the two. Our work here focuses on OC modeling which \nhas been shown to be more robust than the 10, especially with fairly long phrases [21]. In \nany case, the classifiers' outcomes can be combined to determine the phrases in the input \nstring. This process, however, needs to satisfy some constraints for the resulting set of \nphrases to be legitimate. Several types of constraints, such as length, order and others can \nbe formalized and incorporated into the approaches studied here. \nThe goal is thus two fold: to learn classifiers that recognize the local signals and to com(cid:173)\nbine them in a way that respects the constraints. We call the inference algorithm that \ncombines the classifiers and outputs a coherent phrase structure a combinator. The per(cid:173)\nformance of this process is measured by how accurately it retrieves the phrase structure of \nthe input string. This is quantified in terms of recall - the percentage of phrases that are \ncorrectly identified - and precision - the percentage of identified phrases that are indeed \ncorrect phrases. \n\n1 We assume here a single type of phrase, and thus each input symbol is either in a phrase or \n\noutside it. All the methods can be extended to deal with several kinds of phrases in a string. \n\n2Jn the case of natural language processing, if the DiS are words in a sentence, additional informa(cid:173)\n\ntion might include morphological information, part of speech tags, semantic class information from \nWordNet, etc. This information can be assumed to be encoded into the observed sequence. \n\n\f3 Markov Modeling \nHMM is a probabilistic finite state automaton that models the probabilistic generation of \nsequential processes. It consists of a finite set S of states, a set 0 of observations, an \ninitial state distribution Pl(s), a state-transition distribution P(sls') (s, s' E S) and an \nobservation distribution P(ols) (0 EO, s E S). A sequence of observations is generated by \nfirst picking an initial state according to PI (s); this state produces an observation according \nto P(ols) and transits to a new state according to P(sls'). This state produces the next \nobservation, and the process goes on until it reaches a designated final state [22]. \nIn a supervised learning task, an observation sequence 0 =< 01,02,' .. On > is supervised \nby a corresponding state sequence S =< Sl, S2,'\" sn >. This allows one to estimate the \nHMM parameters and then, given a new observation sequence, to identify the most likely \ncorresponding state sequence. The supervision can also be supplied (see Sec. 2) using local \nsignals from which the state sequence can be recovered. Constraints can be incorporated \ninto the HMM by constraining the state transition probability distribution P(sls') . For \nexample, set P( sis') = 0 for all s, s' such that the transition from s' to s is not allowed. \n3.1 A Hidden Markov Model Combinator \nTo recover the most likely state sequence in HMM, we wish to estimate all the required \nprobability distributions. As in Sec. 2 we assume to have local signals that indicate the \nstate. That is, we are given classifiers with states as their outcomes. Formally, we assume \nthat Pt(slo) is given where t is the time step in the sequence. In order to use this information \nin the HMM framework, we compute Pt(ols) = Pt(slo)Pt(o)jPt(s). That is, instead of \nobserving the conditional probability Pt(ols) directly from training data, we compute it \nfrom the classifiers' output. Notice that in HMM, the assumption is that the probability \ndistributions are stationary. We can assume that for Pt(slo) which we obtain from the \nclassifier but need not assume it for the other distributions, Pt(o) and Pt(s). Pt(s) can \nbe calculated by Pt(s) = L:s/Es P(sIS')Pt-l(S') where Pl(s) and P(sls') are the two \nrequired distributions for the HMM. We still need Pt (0) which is harder to approximate but, \nfor each t, can be treated as a constant 'fit because the goal is to find the most likely sequence \nof states for the given observations, which are the same for all compared sequences. \nWith this scheme, we can still combine the classifiers' predictions by finding the most \nlikely sequence for an observation sequence using dynamic programming. To do so, we \nincorporate the classifiers' opinions in its recursive step by computing P(Otls) as above: \n\n8t(s) = max8t_l(s')P(sls')P(otls) = max8t- l (s')P(sls')P(slot)'fIt/Pt(s). \n\ns/ES \n\ns/ES \n\nThis is derived using the HMM assumptions but utilizes the classifier outputs P(slo), al(cid:173)\nlowing us to extend the notion of an observation. In Sec. 6 we estimate P(slo) based on a \nwhole observation sequence rather than 0t to significantly improve the performance. \n\n3.2 A Projection based Markov Model Combinator \nIn HMMs, observations are allowed to depend only on the current state and long term \ndependencies are not modeled. Equivalently, the constraints structure is restricted by hav(cid:173)\ning a stationary probability distribution of a state given the previous one. We attempt to \nrelax this by allowing the distribution of a state to depend, in addition to the previous \nstate, on the observation. Formally, we now make the following independence assumption: \nP(StISt-l,St-2, ... ,Sl,Ot,Ot-l, ... ,od = P(stlst-l,ot). Thus, given an observation \nsequence 0 we can find the most likely state sequence S given 0 by maximizing \n\nn \n\nt=2 \n\nn \n\nt=2 \n\nHence, this model generalizes the standard HMM by combining the state-transition proba(cid:173)\nbility and the observation probability into one function. The most likely state sequence can \n\n\fstill be recovered using the dynamic programming (Viterbi) algorithm if we modify the re(cid:173)\ncursive step: 8t ( s) = maxs'ES 8t - l (s')P( sis', Ot). In this model, the classifiers' decisions \nare incorporated in the terms P(sls', 0) and Pl(slo) . To learn these classifiers we follow \nthe projection approach [26] and separate P( sis', 0) to many functions Ps' (slo) according \nto the previous states s'. Hence as many as IS I classifiers, projected on the previous states, \nare separately trained. (Therefore the name \"Projection based Markov model (PMM)\".) \nSince these are simpler classifiers we hope that the performance will improve. As before, \nthe question of what constitutes an observation is an issue. Sec. 6 exhibits the contribution \nof estimating Ps' (s 10) using a wider window in the observation sequence. \n\n3.3 Related Work \nSeveral attempts to combine classifiers, mostly neural networks, into HMMs have been \nmade in speech recognition works in the last decade [20]. A recent work [19] is similar to \nour PMM but is using maximum entropy classifiers. In both cases, the attempt to combine \nclassifiers with Markov models is motivated by an attempt to improve the existing Markov \nmodels; the belief is that this would yield better generalization than the pure observation \nprobability estimation from the training data. Our motivation is different. The starting \npoint is the existence of general classifiers that provide some local information on the input \nsequence along with constraints on their outcomes; our goal is to use the classifiers to infer \nthe phrase structure of the sequence in a way that satisfies the constraints. Using Markov \nmodels is only one possibility and, as mentioned earlier, not one the optimizes the real \nperformance measure of interest. Technically, another novelty worth mentioning is that we \nuse a wider range of observations instead of a single observation to predict a state. This \ncertainly violates the assumption underlying HMMs but improves the performance. \n4 Constraints Satisfaction with Classifiers \nThis section describes a different model that is based on an extension of the Boolean con(cid:173)\nstraint satisfaction (CSP) formalism [17] to handle variables that are the outcome of classi(cid:173)\nfiers. As before, we assume an observed string 0 =< 01,02, . .. On > and local classifiers \nthat, without loss of generality, take two distinct values, one indicating openning a phrase \nand a second indicating closing it (OC modeling). The classifiers provide their output in \nterms of the probability P(o) and P(c), given the observation. \nWe extend the CSP formalism to deal with probabilistic variables (or, more generally, vari(cid:173)\nables with cost) as follows. Let V be the set of Boolean variables associated with the \nproblem, IVI = n. The constraints are encoded as clauses and, as in standard CSP model(cid:173)\ning the Boolean CSP becomes a CNF (conjunctive normal form) formula f. Our problem, \n: V -+ {O, I} that satisfies f but rather \nhowever, is not simply to find an assignment T \nthe following optimization problem. We associate a cost function c : V -+ R with each \nvariable, and try to find a solution T of f of minimum cost, C(T) = E~=l T(Vi)C(Vi). \nOne efficient way to use this general scheme is by encoding phrases as variables. Let E be \nthe set of all possible phrases. Then, all the non-overlapping constraints can be encoded in: \nI\\e; overlaps ej (-,ei V -,ej). This yields a quadratic number of variables, and the constraints \nare binary, encoding the restriction that phrases do not overlap. A satisfying assignment \nfor the resulting 2-CNF formula can therefore be computed in polynomial time, but the \ncorresponding optimization problem is still NP-hard [13]. For the specific case of phrase \nstructure, however, we can find the optimal solution in linear time. The solution to the \noptimization problem corresponds to a shortest path in a directed acyclic graph constructed \non the observations symbols, with legitimate phrases (the variables of the CSP) as its edges \nand their cost as the edges' weights. The construction of the graph takes quadratic time and \ncorresponds to constructing the 2-CNF formula above. It is not hard to see (details omitted) \nthat each path in this graph corresponds to a satisfying assignment and the shortest path \ncorresponds to the optimal solution. The time complexity of this algorithm is linear in the \nsize of the graph. The main difficulty here is to determine the cost C as a function of the \n\n\fconfidence given by the classifiers. Our experiments revealed, though, that the algorithm \nis robust to reasonable modifications in the cost function. A natural cost function is to \nuse the classifiers probabilities P(o) and P(c) and define, for a phrase e = (0, c), c(e) = \n1 - P(o)P(c). The interpretation is that the error in selecting e is the error in selecting \neither 0 or c, and allowing those to overlap3. The constant in 1 - P(o)P(c) biases the \nminimization to prefers selecting a few phrases, so instead we minimize -P(o)P(c). \n\n5 Shallow Parsing \n\nWe use shallow parsing tasks in order to evaluate our approaches. Shallow parsing involves \nthe identification of phrases or of words that participate in a syntactic relationship. The \nobservation that shallow syntactic information can be extracted using local information -\nby examining the pattern itself, its nearby context and the local part-of-speech information \n- has motivated the use of learning methods to recognize these patterns [7, 23, 3, 5]. In \nthis work we study the identification of two types of phrases, base Noun Phrases (NP) \nand Subject Verb (SV) patterns. We chose these since they differ significantly in their \nstructural and statistical properties and this allows us to study the robustness of our methods \nto several assumptions. As in previous work on this problem, this evaluation is concerned \nwith identifying one layer NP and SV phrases, with no embedded phrases. We use the \nOC modeling and learn two classifiers; one predicting whether there should be an open in \nlocation t or not, and the other whether there should be a close in location t or not. For \ntechnical reasons the cases -'0 and -,c are separated according to whether we are inside \nor outside a phrase. Consequently, each classifier may output three possible outcomes 0, \nnOi, nOo (open, not open inside, not open outside) and C, nCi, nCo, resp. The state(cid:173)\ntransition diagram in figure 1 captures the order constraints. Our modeling of the problem \nis a modification of our earlier work on this topic that has been found to be quite successful \ncompared to other learning methods attempted on this problem [21] . \n\nFigure 1: State-transition diagram for the phrase recognition problem. \n\n5.1 Classification \n\nThe classifier we use to learn the states as a function of the observation is SNoW [24, 6], a \nmulti-class classifier that is specifically tailored for large scale learning tasks. The SNoW \nlearning architecture learns a sparse network of linear functions, in which the targets (states, \nin this case) are represented as linear functions over a common features space. SNoW \nhas already been used successfully for a variety of tasks in natural language and visual \nprocessing [10, 25]. Typically, SNoW is used as a classifier, and predicts using a winner(cid:173)\ntake-all mechanism over the activation value of the target classes. The activation value is \ncomputed using a sigmoid function over the linear SUll. In the current study we normalize \nthe activation levels of all targets to sum to 1 and output the outcomes for all targets (states). \nWe verified experimentally on the training data that the output for each state is indeed a \ndistribution function and can be used in further processing as P(slo) (details omitted). \n\n3It is also possible to account for the classifiers' suggestions inside each phrase; details omitted. \n\n\f6 Experiments \nWe experimented both with NPs and SVs and we show results for two different represen(cid:173)\ntations of the observations (that is, different feature sets for the classifiers) - part of speech \n(PaS) information only and pas with additional lexical information (words). The result \nof interest is F{3 = (f32 + 1) . Recall\u00b7 Precision/ (f32 . Precision + Recall) (here f3 = 1). The \ndata sets used are the standard data sets for this problem [23, 3, 21] taken from the Wall \nStreet Journal corpus in the Penn Treebank [18]. For NP, the training and test corpus was \nprepared from sections 15 to 18 and section 20, respectively; the SV phrase corpus was \n\nprepared from sections 1 to 9 for training and section \u00b0 for testing. \n\nFor each model we study three different classifiers. The simple classifier corresponds to the \nstandard HMM in which P(ols) is estimated directly from the data. When the observations \nare in terms of lexical items, the data is too sparse to yield robust estimates and these \nentries were left empty. The NB (naive Bayes) and SNoW classifiers use the same feature \nset, conjunctions of size 3 of pas tags (PaS and words, resp.) in a window of size 6. \n\nTable 1: Results (F{3=l) of different methods on NP and SV recognition \n\nMethod \n\nNP \n\nSV \n\nModel Classifier \n\nPOS tags only \n\nPOS tags+words \n\nPOS tags only \n\nPOS tags+words \n\nSNoW \n\nHMM NB \n\nSimple \nSNoW \n\nPMM NB \n\nSimple \nSNoW \n\nCSCL NB \n\nSimple \n\n90.64 \n90.50 \n87.83 \n90.61 \n90.22 \n61.44 \n90.87 \n90.49 \n54.42 \n\n92.89 \n92.26 \n\n92.98 \n91.98 \n\n92.88 \n91.95 \n\n64.15 \n75.40 \n64.85 \n74.98 \n74.80 \n40.18 \n85.36 \n80.63 \n59.27 \n\n77.54 \n78.43 \n\n86.07 \n84.80 \n\n90.09 \n88.28 \n\nThe first important observation is that the SV identification task is significantly more dif(cid:173)\nficult than that the NP task. This is consistent across all models and feature sets. When \ncomparing between different models and feature sets, it is clear that the simple HMM for(cid:173)\nmalism is not competitive with the other two models. What is interesting here is the very \nsignificant sensitivity to the feature base of the classifiers used, despite the violation of the \nprobabilistic assumptions. For the easier NP task, the HMM model is competitive with the \nothers when the classifiers used are NB or SNoW. In particular, the fact that the signifi(cid:173)\ncant improvements both probabilistic methods achieve when their input is given by SNoW \nconfirms the claim that the output of SNoW can be used reliably as a probabilistic classifier. \nPMM and CSCL perform very well on predicting NP and SV phrases with CSCL at least \nas good as any other methods tried on these tasks. Both for NPs and SVs, CSCL performs \nbetter than the others, more significantly on the harder, SV, task. We attribute it to CSCL's \nability to cope better with the length of the phrase and the long term dependencies. \n7 Conclusion \nWe have addressed the problem of combining the outcomes of several different classifiers \nin a way that provides a coherent inference that satisfies some constraints. This can be \nviewed as a concrete instantiation of the Learning to Reason framework [16]. The focus \nhere is on an important subproblem, the identification of phrase structure. We presented \ntwo approachs: a probabilistic framework that extends HMMs in two ways and an approach \nthat is based on an extension of the CSP formalism. In both cases we developed efficient \ncombination algorithms and studied them empirically. It seems that the CSP formalisms \ncan support the desired performance measure as well as complex constraints and depen(cid:173)\ndencies more flexibly than the Markovian approach. This is supported by the experimental \nresults that show that CSCL yields better results, in particular, for the more complex case of \n\n\fSV phrases. As a side effect, this work exhibits the use of general classifiers within a prob(cid:173)\nabilistic framework. Future work includes extensions to deal with more general constraints \nby exploiting more general probabilistic structures and generalizing the CSP approach. \n\nAcknowledgments \nThis research is supported by NSF grants IIS-9801638 and IIS-9984168. \nReferences \n\n[1] S. P. Abney. Parsing by chunks. In S. P. A. R. C. Berwick and C. Tenny, editors, Principle-based \n\nparsing: Computation and Psycho linguistics, IJages 257-278. Kluwer, Dordrecht, 1991. \n\n[2] D. Appelt, J. Hobbs, J. Bear, D. Israel , and Nt Tyson. FASTUS: A finite-state processor for \n\ninformation extraction from real-world text. In Proc. of IJCAl, 1993. \n\n[3] S. Argamon, 1. Dagan, and Y. Krymolowski. A memory-based approach to learning shallow \n\nnatural language patterns. Journal of Experimental and Theoretical Artificial Intelligence, spe(cid:173)\ncial issue on memory-based learning, 10:1- 22, 1999. \n\n[4] C. Burge and S. Karlin. Finding the genes in genomic DNA. Current Opinion in Structural \n\nBiology, 8:346- 354, 1998. \n\n[5] C. Cardie and D. Pierce. Error-driven pruning of treebanks grammars for base noun phrase \n\nidentification. In Proceedings of ACL-98, pages 218- 224, 1998. \n\n[6] A. Carlson, C. Cumby, J. Rosen, and D. Roth. The SNoW learning architecture. Technical \n\nReport UillCDCS-R-99-2101, UillC Computer Science Department, May 1999. \n\n[7] K. W. Church. A stochastic parts program and noun phrase parser for unrestricted text. In Proc. \n\nof ACL Conference on Applied Natural Language Processing, 1988. \n\n[8] 1: W. Fickett. The gene identification problem: An overview for developers. Computers and \n\nChemistry, 20:103- 118,1996. \n\n[9] D. Freitag and A. McCallum. Information extraction using HMMs and shrinkage. In Papers \n\nfrom the AAAJ-99 Workshop on Machine Learning for Information Extraction, 31- 36, 1999. \n\n[10] A. R. Golding and D. Roth. A Winnow based approach to context-sensitive spelling correction. \n\nMachine Learning, 34(1-3):107-130, 1999. \n\n[11] G. Greffenstette. Evaluation techniques for automatic semantic extraction: comparing semantic \nand window based approaches. In ACL'93 workshop on the Acquisition of Lexical Knowledge \nfrom Text, 1993. \n\n[12] R. Grishman. The NYU system for MUC-6 or where's syntax? \n\nIn B. Sundheim, editor, \nProceedings of the Sixth Message Understanding Conference. Morgan Kaufmann Publishers, \n1995. \n\n[13] D. Gusfield and L. Pitt. A bounded approximation for the minimum cost 2-SAT problems. \n\nAlgorithmica, 8:103-117, 1992. \n\n[14] Z. S. Harris. Co-occurrence and transformation in linguistic structure. Language, 33(3):283-\n\n340,1957. \n\n[15] D. Haussler. Computational genefinding. Trends in Biochemical Sciences, Supplementary \n\nGuide to Bioinformatics, pages 12- 15, 1998. \n\n[16] R. Khardon and D. Roth. Learning to reason. J. ACM, 44(5):697- 725, Sept. 1997. \n[17] A. Mackworth. Constraint Satisfaction. In S. C. Shapiro, editor, Encyclopedia of Artificial \n\nIntelligence, pages 285- 293, 1992. Volume 1, second edition. \n\n[18] M. P. Marcus, B. Santorini, and M. Marcinkiewicz. Building a large annotated corpus of En-\n\nglish: The Penn Treebank. Computational Linguistics, 19(2):313- 330, June 1993. \n\n[19] A. McCallum, D. Freitag, and F. Pereira. Maximum entropy Markov models for information \n\nextraction and segmentation. In proceedings of ICML-2000, 2000. to appear. \n\n[20] N. Morgan and H. Bourlard. Continuous speech recognition. IEEE Signal Processing Maga-\n\nzine, 12(3):24-42, 1995. \n\n[21] M. Munoz, V. Punyakanok, D. Roth, and D. Zimak. A learning approach to shallow parsing. In \n\nEMNLP-VLC'99, 1999. \n\n[22] L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recog-\n\nnition. Proceedings of the IEEE, 77(2):257- 285, 1989. \n\n[23] L. A. Ramshaw and M. P. Marcus. Text chunking using transformation-based learning. In \n\nProceedings of the Third Annual Workshop on Very Large Corpora, 1995. \n\n[24] D. Roth. Learning to resolve natural language ambiguities: A unified approach. In Proceedings \n\nof the National Conference on Artificial Intelligence, pages 806- 813, 1998. \n\n[25] D. Roth, M.-H. Yang, and N. Ahuja. Learning to recognize objects. In CVPR'OO, The IEEE \n\nConference on Computer Vision and Pattern Recognition, pages 724--731, 2000. \n\n[26] L. G. Valiant. Projection learning. In Proceedings of the Conference on Computational Learn(cid:173)\n\ning Theory, pages 287- 293, 1998. \n\n\f", "award": [], "sourceid": 1817, "authors": [{"given_name": "Vasin", "family_name": "Punyakanok", "institution": null}, {"given_name": "Dan", "family_name": "Roth", "institution": null}]}