{"title": "Conditional Random Fields with High-Order Features for Sequence Labeling", "book": "Advances in Neural Information Processing Systems", "page_first": 2196, "page_last": 2204, "abstract": "Dependencies among neighbouring labels in a sequence is an important source of information for sequence labeling problems. However, only dependencies between adjacent labels are commonly exploited in practice because of the high computational complexity of typical inference algorithms when longer distance dependencies are taken into account. In this paper, we show that it is possible to design efficient inference algorithms for a conditional random field using features that depend on long consecutive label sequences (high-order features), as long as the number of distinct label sequences in the features used is small. This leads to efficient learning algorithms for these conditional random fields. We show experimentally that exploiting dependencies using high-order features can lead to substantial performance improvements for some problems and discuss conditions under which high-order features can be effective.", "full_text": "Conditional Random Fields with High-Order\n\nFeatures for Sequence Labeling\n\nNan Ye\n\nWee Sun Lee\n\nDepartment of Computer Science\nNational University of Singapore\n\n{yenan,leews}@comp.nus.edu.sg\n\nHai Leong Chieu\n\nDSO National Laboratories\nchaileon@dso.org.sg\n\nDan Wu\n\nSingapore MIT Alliance\n\nNational University of Singapore\n\ndwu@nus.edu.sg\n\nAbstract\n\nDependencies among neighbouring labels in a sequence is an important source\nof information for sequence labeling problems. However, only dependencies be-\ntween adjacent labels are commonly exploited in practice because of the high\ncomputational complexity of typical inference algorithms when longer distance\ndependencies are taken into account. In this paper, we show that it is possible to\ndesign ef\ufb01cient inference algorithms for a conditional random \ufb01eld using features\nthat depend on long consecutive label sequences (high-order features), as long as\nthe number of distinct label sequences used in the features is small. This leads\nto ef\ufb01cient learning algorithms for these conditional random \ufb01elds. We show ex-\nperimentally that exploiting dependencies using high-order features can lead to\nsubstantial performance improvements for some problems and discuss conditions\nunder which high-order features can be effective.\n\n1 Introduction\n\nIn a sequence labeling problem, we are given an input sequence x and need to label each component\nof x with its class to produce a label sequence y. Examples of sequence labeling problems include\nlabeling words in sentences with its type in named-entity recognition problems [16], handwriting\nrecognition problems [15], and deciding whether each DNA base in a DNA sequence is part of a\ngene in gene prediction problems [2].\nConditional random \ufb01elds (CRF) [8] has been successfully applied in many sequence labeling prob-\nlems. Its chief advantage lies in the fact that it models the conditional distribution P (y|x) rather\nthan the joint distribution P (y, x). In addition, it can effectively encode arbitrary dependencies of y\non x as the learning cost mainly depends on the parts of y involved in the dependencies. However,\nthe use of high-order features, where a feature of order k is a feature that encodes the dependency\nbetween x and (k + 1) consecutive elements in y, can potentially lead to an exponential blowup in\nthe computational complexity of inference. Hence, dependencies are usually assumed to exist only\nbetween adjacent components of y, giving rise to linear-chain CRFs which limits the order of the\nfeatures to one.\nIn this paper, we show that it is possible to learn and predict CRFs with high-order features ef\ufb01-\nciently under the following pattern sparsity assumption (which is often observed in real problems):\nthe number of observed label sequences of length, say k, that the features depend on, is much smaller\nthan nk, where n is the number of possible labels. We give an algorithm for computing the marginals\nand the CRF log likelihood gradient that runs in time polynomial in the number and length of the\nlabel sequences that the features depend on. The gradient can be used with quasi-newton methods\nto ef\ufb01ciently solve the convex log likelihood optimization problem [14]. We also provide an ef\ufb01-\ncient decoding algorithm for \ufb01nding the most probable label sequence in the presence of long label\nsequence features. This can be used with cutting plane methods to train max-margin solutions for\nsequence labeling problems in polynomial time [18].\n\n\fWe show experimentally that using high-order features can improve performance in sequence la-\nbeling problems. We show that in handwriting recognition, using even simple high-order indicator\nfeatures improves performance over using linear-chain CRFs, and impressive performance improve-\nment is observed when the maximum order of the indicator features is increased. We also use a\nsynthetic data set to discuss the conditions under which higher order features can be helpful. We\nfurther show that higher order label features can sometimes be more stable under change of data\ndistribution using a named entity data set.\n\n2 Related Work\n\nConditional random \ufb01elds [8] are discriminately trained, undirected Markov models, which has\nbeen shown to perform well in various sequence labeling problems. Although a CRF can be used\nto capture arbitrary dependencies among components of x and y, in practice, this \ufb02exibility of the\nCRF is not fully exploited as inference in Markov models is NP-hard in general (see e.g. [1]), and\ncan only be performed ef\ufb01ciently for special cases such as linear chains. As such, most applications\ninvolving CRFs are limited to some tractable Markov models. This observation also applies to other\nstructured prediction methods such as structured support vector machines [15, 18].\nA commonly used inference algorithm for CRF is the clique tree algorithm [5]. De\ufb01ning a feature\ndepending on k (not necessarily consecutive) labels will require forming a clique of size k, resulting\nin a clique-tree with tree-width greater or equal to k. Inference on such a clique tree will be exponen-\ntial in k. For sequence models, a feature of order k can be incorporated into a k-order Markov chain,\nbut the complexity of inference is again exponential in k. Under the pattern sparsity assumption, our\nalgorithm achieves ef\ufb01ciency by maintaining only information related to a few occurred patterns,\nwhile previous algorithms maintain information about all (exponentially many) possible patterns.\nIn the special case of a semi-Markov random \ufb01elds, where high-order features depend on segments\nof identical labels, the complexity of inference is linear in the maximum length of the segments\n[13]. The semi-Markov assumption can be seen as de\ufb01ning a sparse feature representation: though\nthe number of length k label patterns is exponential in k, the semi-Markov assumption effectively\nallows only n2 of them (n is the cardinality of the label set), as the features are de\ufb01ned on a sequence\nof identical labels that can only depend on the label of the preceding segment. Compared to this\napproach, our algorithm has the advantage of being able to ef\ufb01ciently handle high-order features\nhaving arbitrary label patterns.\nLong distance dependencies can also be captured using hierarchical models such as Hierarchical\nHidden Markov Model (HHMM) [4] or Probabilistic Context Free Grammar (PCFG) [6]. The time\ncomplexity of inference in an HHMM is O(min{nl3, n2l}) [4, 10], where n is the number of states\nand l is the length of the sequence. Discriminative versions such as hierarchical CRF has also\nbeen studied [17]. Inference in PCFG and its discriminative version can also be ef\ufb01ciently done\nin O(ml3) where m is the number of productions in the grammar [6]. These methods are able to\ncapture dependencies of arbitrary lengths, unlike k-order Markov chains. However, to do ef\ufb01cient\nlearning with these methods, the hierarchical structure of the examples need to be provided. For\nexample, if we use PCFG to do named entity recognition, we need to provide the parse trees for\nef\ufb01cient learning; providing the named entity labels for each word is not suf\ufb01cient. Hence, a training\nset that has not been labeled with hierarchical labels will need to be relabeled before it can be trained\nef\ufb01ciently. Alternatively, methods that employ hidden variables can be used (e.g. to infer the hidden\nparse tree) but the optimization problem is no longer convex and local optima can sometimes be a\nproblem. Using high-order features captures less expressive form of dependencies than these models\nbut allows ef\ufb01cient learning without relabeling the training set with hierarchical labels.\nSimilar work on using higher order features for CRFs was independently done in [11]. Their work\napply to a larger class of CRFs, including those requiring exponential time for inference, and they\ndid not identify subclasses for which inference is guaranteed to be ef\ufb01cient.\n\n3 CRF with High-order Features\n\nThroughout the remainder of this paper, x, y, z (with or without decorations) respectively denote\nan observation sequence of length T , a label sequence of length T , and an arbitrary label sequence.\nThe function | \u00b7 | denotes the length of any sequence. The set of labels is Y = {1, . . . , n}.\nIf\n\n\fz = (y1, . . . , yt), then zi:j denotes (yi, . . . , yj). When j < i, zi:j is the empty sequence (denoted\nby \u0001). Let the features being considered be f1, . . . , fm. Each feature fi is associated with a label\nsequence zi, called fi\u2019s label pattern, and fi has the form\n\n(cid:26)gi(x, t),\n\n0,\n\nfi(x, y, t) =\n\nif yt\u2212|zi|+1:t = zi\notherwise.\n\nWe call fi a feature of order |zi|\u22121. Consider, for example, the problem of named entity recognition.\nThe observations x = (x1, . . . , xT ) may be a word sequence; gi(x, t) may be an indicator function\nfor whether xt is capitalized or may output a precomputed term weight if xt matches a particular\nword; and zi may be a sequence of two labels, such as (person, organization) for the named entity\nrecognition task, giving a feature of order one.\nA CRF de\ufb01nes conditional probability distributions P (y|x) = Zx(y)/Zx, where Zx(y) =\ny Zx(y). The normalization factor Zx is called the\nx:P red(x) f(x) to denote the summation\n\nexp((cid:80)m\npartition function. In this paper, we will use the notation(cid:80)\n\n(cid:80)T\nt=|zi| \u03bbifi(x, y, t)), and Zx =(cid:80)\n\ni=1\n\nof f(x) over all elements of x satisfying the predicate P red(x).\n\n3.1\n\nInference for High-order CRF\n\nIn this section, we describe the algorithms for computing the partition function, the marginals and the\nmost likely label sequence for high-order CRFs. We give rough polynomial time complexity bounds\nto give an idea of the effectiveness of the algorithms. These bounds are pessimistic compared to\npractical performance of the algorithms. It can also be veri\ufb01ed that the algorithms for linear chain\nCRF [8] are special cases of our algorithms when only zero-th and \ufb01rst order features are considered.\nWe show a work example illustrating the computations in the supplementary material.\n\n3.1.1 Basic Notations\n\nAs in the case of hidden Markov models (HMM) [12], our algorithm uses a forward and backward\npass. First, we describe the equivalent of states used in the forward and backward computation. We\nshall work with three sets: the pattern set Z, the forward-state set P and the backward-state set\nS. The pattern set, Z, is the set of distinct label patterns used in the m features. For notational\nsimplicity, assume Z = {z1, . . . , zM}. The forward-state set, P = {p1, . . . p|P|}, consists of\ndistinct elements in Y \u222a {zj\n1:k}0\u2264k\u2264|zj|\u22121,1\u2264j\u2264M ; that is, P consists of all labels and all proper\npre\ufb01xes (including \u0001) of label patterns, with duplicates removed. Similarly, S = {s1, . . . s|S|}\nconsists of the labels and proper suf\ufb01xes: distinct elements in Y \u222a {zj\nThe transitions between states are based on the pre\ufb01x and suf\ufb01x relationships de\ufb01ned below. Let\nz1 \u2264p z2 denote that z1 is a pre\ufb01x of z2 and let z1 \u2264s z2 denote that z1 is a suf\ufb01x of z2. We de\ufb01ne\nthe longest pre\ufb01x and suf\ufb01x relations with respect to the sets P and S as follows\nz1 \u2264pS z2\nz1 \u2264sP z2\nFinally, the subsequence relationship de\ufb01ned below are used when combining forward and backward\nvariables to compute marginals. Let z \u2286 z(cid:48) denote that z is a subsequence of z(cid:48), z \u2282 z(cid:48) denote that\n2:|z(cid:48)|\u22121. The addition of subscript j in \u2286j and \u2282j indicates that the condition\nz is a subsequence of z(cid:48)\nz \u2264s z(cid:48)\nWe shall give rough time bounds in terms of m (the total number of features), n (the number of\nlabels), T (the length of the sequence), M (the number of distinct label patterns in Z), and the\nmaximum order K = max{|z1| \u2212 1, . . . ,|zM| \u2212 1}.\n\n(z1 \u2208 S) and (z1 \u2264p z2) and (\u2200z \u2208 S, z \u2264p z2 \u21d2 z \u2264p z1)\n(z1 \u2208 P) and (z1 \u2264s z2) and (\u2200z \u2208 P, z \u2264s z2 \u21d2 z \u2264s z1).\n\n1:j is satis\ufb01ed as well (that is, z ends at position j in z(cid:48)).\n\n1:k}1\u2264k\u2264|zj|,1\u2264j\u2264M .\n\nif and only if\nif and only if\n\n3.1.2 The Forward and Backward Variables\n(cid:80)|z|\nWe now de\ufb01ne forward vector \u03b1x and backward vector \u03b2x. Suppose z \u2264p y, then de\ufb01ne y\u2019s pre\ufb01x\nt=|zi| \u03bbifi(x, y, t)). Similarly, if z \u2264s y, then de\ufb01ne y\u2019s suf\ufb01x score\nscore Z p\n\nx(z) = exp((cid:80)m\n\ni=1\n\n\fx(z) = exp((cid:80)m\n\nZ s\n\ni=1\n\n(cid:80)T\n\nx(z) and Z s\n\nx(z) only depend on z. Let\n\nt=T\u2212|z|+|zi| \u03bbifi(x, y, t)). Z p\n\n\u03b1x(t, pi) = (cid:88)\n(cid:88)\n\nz:|z|=t,pi\u2264sP z\n\n\u03b2x(t, si) =\n\nx(z)\nZ p\n\nx(z).\nZ s\n\nz:|z|=T +1\u2212t,si\u2264pS z\n\n(cid:88)\n\n(cid:88)\n\n|P|(cid:88)\n\nThe variable \u03b1x(t, pi) computes for x1:t the sum of the scores of all its label sequences z having\npi as the longest suf\ufb01x. Similarly, the variable \u03b2x(t, si) computes for xt:T the sum of scores of all\nits label sequence z having si as the longest pre\ufb01x. Each vector \u03b1x(t,\u00b7) is of dimension |P|, while\n\u03b2x(t,\u00b7) has dimension |S|. We shall compute the \u03b1x and \u03b2x vectors with dynamic programming.\ni:zi\u2264sp \u03bbigi(x, t)). For y with p \u2264s y1:t, this function counts the contribu-\nLet \u03a8p\ntion towards Zx(y) by all features fi with their label patterns ending at position t and being suf\ufb01xes\nof p. Let piy be the concatenation of pi with a label y. The following proposition is immediate.\n\nx(t, p) = exp((cid:80)\n\nProposition 1\n\n(a) For any z, there is a unique pi such that pi \u2264sP z.\n(b) For any z, y, if pi \u2264sP z and pk \u2264sP piy, then pk \u2264sP zy and Z p\n\nx(zy) = \u03a8p\n\nx(t, piy)Z p\n\nx(z).\n\nProposition 1(a) means that we can induce partitions of label sequences using the forward states.\nand Proposition 1(b) shows how to make well-de\ufb01ned transition from one forward state at a time\nslice to another forward state at the next time slice. By de\ufb01nition, \u03b1x(0, \u0001) = 1, and \u03b1x(0, pi) = 0\nfor all pi (cid:54)= \u0001. Using Proposition 1(b), the recurrence for \u03b1x is\n\n\u03b1x(t, pk) =\n\nx(t, piy)\u03b1x(t \u2212 1, pi), for 1 \u2264 t \u2264 T.\n\u03a8p\n\nSimilarly, for the backward vectors \u03b2x, let \u03a8s\nde\ufb01nition, \u03b2x(T + 1, \u0001) = 1, and \u03b2x(T + 1, si) = 0 for all si (cid:54)= \u0001. The recurrence for \u03b2x is\n\ni:zi\u2264ps \u03bbigi(x, t + |zi| \u2212 1)). By\n\n(pi,y):pk\u2264sP piy\n\nx(t, s) = exp((cid:80)\n\n\u03b2x(t, sk) =\n\n\u03a8s\n\nx(t, ysi)\u03b2x(t + 1, si),\n\nfor 1 \u2264 t \u2264 T.\n\n(si,y):sk\u2264pS ysi\n\nOnce \u03b1x or \u03b2x is computed, then using Proposition 1(a), Zx can be easily obtained:\n\n|S|(cid:88)\n\nZx =\n\n\u03b1x(T, pi) =\n\n\u03b2x(1, si).\n\ni=1\n\ni=1\n\nTime Complexity: We assume that each evaluation of the function gi(\u00b7,\u00b7) can be performed in unit\nx that are used can hence be computed in O(mn|P|T ) (thus\ntime for all i. All relevant values of \u03a8p\nO(mnM KT )) time. In practice, this is pessimistic, and the computation can be done more quickly.\nFor all following analyses, we assume that \u03a8p\nx has already been computed and stored in an array.\nNow all values of \u03b1x can be computed in \u0398(n|P|T ), thus O(nM KT ) time. Similar bounds for \u03a8s\nx\nand \u03b2x hold.\n\n3.1.3 Computing the Most Likely Label Sequence\nAs in the case of HMM [12], Viterbi decoding (calculating the most likely label sequence) is ob-\ntained by replacing the sum operator in the forward backward algorithm with the max operator.\nFormally, let \u03b4x(t, pi) = maxz:|z|=t,pi\u2264sP z Z p\nfor all pi (cid:54)= \u0001, and using Proposition 1, we have\nx(t, piy)\u03b4x(t \u2212 1, pi), for 1 \u2264 t \u2264 T.\n\u03a8p\n\nx(z). By de\ufb01nition, \u03b4x(0, \u0001) = 1, and \u03b4x(0, pi) = 0\n\n\u03b4x(t, pk) =\n\nmax\n\n(pi,y):pk\u2264sP piy\n\nWe use \u03a6x(t, pk) to record the pair (pi, y) chosen to obtain \u03b4x(t, pk),\n\n\u03a6x(t, pk) = arg max(pi,y):pk\u2264sP piy\u03a8p\n\nx(t, piy)\u03b4x(t \u2212 1, pi).\n\n\fLet p\u2217\nin p\u2217\n\nT = arg maxpi \u03b4x(T, pi), then the most likely path y\u2217 = (y\u2217\nT , and the full sequence can be traced backwards using \u03a6x(\u00b7,\u00b7) as follows\n\n1, . . . , y\u2217\n\nT ) has y\u2217\n\nT as the last label\n\n(p\u2217\n\nt ) = \u03a6x(t + 1, p\u2217\nt , y\u2217\nTime Complexity: Either \u03a8p\nx or \u03a8s\n\u0398(n min{|P|,|S|}T ) time.\n\nt+1), for 1 \u2264 t < T.\n\nx can be used for decoding; hence decoding can be done in\n\n3.1.4 Computing the Marginals\nWe need to compute marginals of label sequences and single variables, that is, compute P (yt\u2212|z|:t =\nz|x) for z \u2208 Z \u222a Y. Unlike in the traditional HMM, additional care need to be taken regarding\nfeatures having label patterns that are super or sub sequences of z. We de\ufb01ne\n\u03bbigi(x, t \u2212 |z| + j)).\n\nWx(t, z) = exp( (cid:88)\n\n(i,j):zi\u2282j z\n\nThis function computes the sum of all features that may activate strictly within z.\nIf z1:|z|\u22121 \u2264s pi and z2:|z| \u2264p sj, de\ufb01ne [pi, z, sj] as the sequence pi\n\n1:|pi|\u2212(|z|\u22121)zsj|z|\u22121:|sj|, and\n\nOx(t, pi, sj, z) = exp(\n\n\u03bbkgk(x, t \u2212 |pi| + k(cid:48) \u2212 1)).\n\n(k,k(cid:48)):z\u2286zk,zk\u2286k(cid:48) [pi,z,sj ]\n\n(cid:88)\n\nOx(t, pi, sj, z) counts the contribution of features with their label patterns properly containing z but\nwithin [pi, z, sj].\nProposition 2 Let z \u2208 Z \u222a Y. For any y with yt\u2212|z|+1:t = z, there exists unique pi, sj such\nthat z1:|z|\u22121 \u2264s pi, z2:|z| \u2264p sj, pi \u2264sP y1:t\u22121, and sj \u2264pS yt\u2212|z|+2:T . In addition, Zx(y) =\nWx(t,z) Z p\n\nx(T + 1 \u2212 (t \u2212 |z| + 2), yt\u2212|z|+2:T )Ox(t, pi, sj, z).\n\nx(t \u2212 1, y1:t\u22121)Z s\n\n1\n\nMultiplying by Ox counts features that are not counted in Z p\nfeatures that are double-counted. By Proposition 2, we have\n\nxZ s\n\nx while division by Wx removes\n\n(cid:80)\n(i,j):z1:|z|\u22121\u2264spi,z2:|z|\u2264psj \u03b1x(t \u2212 1, pi)\u03b2x(t \u2212 |z| + 2, sj)Ox(t, pi, sj, z)\n\n.\n\nZxWx(t, z)\n\nP (yt\u2212|z|+1:t = z|x) =\n\nTime Complexity: Both Wx(t, z) and Ox(t, pi, sj, z) can be computed in O(|pi||sj|) =\nO(K 2) time (with some precomputation). Thus a very pessimistic time bound for computing\nP (yt\u2212|z|+1:t = z|x) is O(K 2|P||S|) = O(M 2K 4).\n\nreg\n\ni=1\n\nlog-likelihood LT = log \u03a0(x,y)\u2208T P (y|x) \u2212(cid:80)m\n\n3.2 Training\nGiven a training set T , the model parameters \u03bbi\u2019s can be chosen by maximizing the regularized\n, where \u03c3reg is a parameter that controls\nthe degree of regularization. Note that LT is a concave function of \u03bb1, . . . , \u03bbm, and its maximum is\nachieved when\n\n\u03bb2\ni\n2\u03c32\n\n= \u02dcE(fi) \u2212 E(fi) \u2212 \u03bbk\n\u03c32\n\nwhere \u02dcE(fi) =(cid:80)\ndata, and E(fi) = (cid:80)\n\n(x,y)\u2208T (cid:80)|x|\n(x,y)\u2208T (cid:80)|y(cid:48)|=|x| P (y(cid:48)|x)(cid:80)|x|\n\nt=|zi| fi(x, y, t) is the empirical sum of the feature fi in the observed\nt=|zi| fi(x, y(cid:48), t) is the expected sum of fi.\nGiven the gradient and value of LT , we use the L-BFGS optimization method [14] for maximiz-\ning the regularized log-likelihood.\nThe function LT can now be computed because we have shown how to compute Zx, and computing\nthe value of Zx(y) is straightforward, for all (x, y) \u2208 T . For the gradient, computing \u02dcE(fi) is\n\n\u2202LT\n\u2202\u03bbi\n\n= 0\n\nreg\n\n\fstraightforward, and E(fi) can be computed using marginals computed in previous section:\n\nE(fi) = (cid:88)\n\n|x|(cid:88)\n\nP (y(cid:48)\n\nt\u2212|zi|+1:t = zi|x)gi(x, t).\n\n(x,y)\u2208T\n\nt=|zi|\n\njust consider the time needed to compute the gradient. Let X =(cid:80)\n\nTime Complexity: Computing the gradient is clearly more time-consuming than LT , thus we shall\n(x,y)\u2208T |x|. We need to compute\nat most M X marginals, thus total time needed to compute all the marginals has O(M 3K 4X) as\nan upper bound. Given the marginals, we can compute the gradient in O(mX) time. If the total\nnumber of gradient computations needed in maximization is I, then the total running time in training\nis bounded by O((M 3K 4 + m)XI) (very pessimistic).\n\n4 Experiments\n\nThe practical feasibility of making use of high-order features based on our algorithm lies in the\nobservation that the pattern sparsity assumption often holds. Our algorithm can be applied to take\nthose high-order features into consideration; high-order features now form a component that one can\nplay with in feature engineering.\nNow, the question is whether high-order features are practically signi\ufb01cant. We \ufb01rst use a synthetic\ndata set to explore conditions under which high-order features can be expected to help. We then use\na handwritten character recognition problem to demonstrate that even incorporating simple high-\norder features can lead to impressive performance improvement on a naturally occurring dataset.\nFinally, we use a named entity data set to show that for some data sets, higher order label features\nmay be more robust to changes in data distributions than observation features.\n\n4.1 Synthetic Data Generated Using k-Order Markov Model\n\nWe randomly generate an order k Markov model with n states s1, . . . , sn as follows. To increase\npattern sparsity, we allow at most r randomly chosen possible next state given the previous k states.\nThis limits the number of possible label sequences in each length k + 1 segment from nk+1 to\nnkr. The conditional probabilities of these r next states is generated by randomly selecting a vector\nfrom uniform distribution over [0, 1]r and normalizing them. Each state si generates an observation\n(a1, . . . , am) such that aj follows a Gaussian distribution with mean \u00b5ij and standard deviation\n\u03c3. Each \u00b5i,j is independently randomly generated from the uniform distribution over [0, 1]. In the\nexperiments, we use values of n = 5, r = 2 and m = 3.\nThe standard deviation, \u03c3, has an important role in determining the characteristics of the data gener-\nated by this Markov model. If \u03c3 is very small as compared to most \u00b5ij\u2019s, then using the observations\nalone as features is likely to be good enough to obtain a good classi\ufb01er of the states; the label cor-\nrelations becomes less important for classi\ufb01cation. However, if \u03c3 is large, then it is dif\ufb01cult to\ndistinguish the states based on the observations alone and the label correlations, particularly those\ncaptured by higher order features are likely to be helpful. In short, the standard deviation, \u03c3, is used\nto to control how much information the observations reveal about the states.\nWe use the current, previous and next observations, rather than just the current observation as fea-\ntures, exploiting the conditional probability modeling strength of CRFs. For higher order features,\nwe simply use all indicator features that appeared in the training data up to a maximum order. We\nconsidered the case k = 2 and k = 3, and varied \u03c3 and the maximum order. The training set and\ntest set each contains 500 sequences of length 20; each sequence was initialized with a random se-\nquence of length k and generated using the randomly generated order k Markov model. Training\nwas done by maximizing the regularized log likelihood with regularization parameter \u03c3reg = 1 in all\nexperiments in this paper. The experimental results are shown in Figure 1.\nFigure 1 shows that the high-order indicator features are useful in this case. In particular, we can\nsee that it is bene\ufb01cial to increase the order of the high-order features when the underlying model\nhas longer distance correlations. As expected, increasing the order of the features beyond the order\nof the underlying model is not helpful. The results also suggests that in general, if the observations\nare closely coupled with the states (in the sense that different states correspond to very different\nobservations), then feature engineering on the observations is generally enough to perform well, and\n\n\fy\nc\na\nr\nu\nc\nc\nA\n\ny\nc\na\nr\nu\nc\nc\nA\n\n98\n\n96\n\n94\n\n92\n\n90\n\n88\n\n86\n\n84\n\n82\n\n88\n\n86\n\n84\n\n82\n\n80\n\n78\n\n76\n\n74\n\nGenerated by 2nd-Order Markov Model\n\nGenerated by 3rd-Order Markov Model\n\nSigma = 0.01\nSigma = 0.05\nSigma = 0.10\n\ny\nc\na\nr\nu\nc\nc\nA\n\n95\n\n93\n\n91\n\n89\n\n87\n\n85\n\n83\n\n81\n\n79\n\nSigma = 0.01\nSigma = 0.05\nSigma = 0.10\n\n1\n\n2\n\n3\n\n4\n\n1\n\n2\n\n3\n\n4\n\nMaximum Order of Features\n\nMaximum Order of Features\n\nFigure 1: Accuracy as a function of maximum order on the synthetic data set.\n\nHandwritten Character Recognition\n\nRuntimes for Character Recognition Training\n\n)\ns\n(\n \ne\nm\ni\nT\n\n90\n\n80\n\n70\n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\n0\n\nPer Iteration Time (Left Axis)\n\nTotal Time (Right Axis)\n\n2\n\n3\n\n4\n\n5\n\nMaximum Order of Features\n\n3500\n\n3000\n\n2500\n\n2000\n\n1500\n\n1000\n\n500\n\n0\n\n)\ns\n(\n \ne\nm\ni\nT\n\n1\n\n2\n4\nMaximum Order of Features\n\n3\n\n5\n\nFigure 2: Accuracy (left) and running time (right) as a function of maximum order for the handwrit-\ning recognition data set.\n\nit is less important to use high-order features to capture label correlations. On the other hand, when\nsuch coupling is not clear, it becomes important to capture the label correlations, and high-order\nfeatures can be useful.\n\n4.2 Handwriting Recognition\n\nWe used the handwriting recognition data set from [15], consisting of around 6100 handwritten\nwords with an average length of around 8 characters. The data was originally collected by Kassel\n[7] from around 150 human subjects. The words were segmented into characters, and each character\nwas converted into an image of 16 by 8 binary pixels. In this labeling problem, each xi is the image\nof a character, and each yi is a lower-case letter. The experimental setup is the same as that used in\n[15]: the data set was divided into 10 folds with each fold having approximately 600 training and\n5500 test examples and the zero-th order features for a character are the pixel values.\nFor higher order features, we again used all indicator features that appeared in the training data up\nto a maximum order. The average accuracy over the 10 folds are shown in Figure 2, where strong\nimprovements are observed as the maximum order increases. Figure 2 also shows the total training\ntime and the running time per iteration of the L-BFGS algorithm (which requires computation of the\ngradient and value of the function at each iteration). The running time appears to grow no more than\nlinearly with the maximum order of the features for this data set.\n\n4.3 Named Entity Recognition with Distribution Change\n\nThe Named Entity Recognition (NER) problem asks for identi\ufb01cation of named entities from texts.\nWith carefully engineered observation features, there does not appear to be very much to be gained\nfrom using higher order features. However, in some situations, the training data does not come from\nthe same distribution as the test data. In such cases, we hypothesize that higher order label features\nmay be more stable than observation features and can sometimes offer performance gain.\nIn our experiment, we used the Automatic Content Extraction (ACE) data [9], which is labeled with\nseven classes: Organization, Geo-political, Location, Facility, Vehicle, and Weapon. The ACE data\n\n\fNamed Entity Recognition (Domain Adaptation)\n\ncomes from several genres and we use the following in our experiment: Broadcast conversation\n(BC), Newswire (NW), Weblog (WL) and Usenet (UN).\nWe use all pairs of genres as training and test\ndata. Scoring was done with the F1 score [16].\nThe features used are previous word, next word,\ncurrent word, case patterns for these words, and\nall indicator label features of order up to k. The\nresults for the case k = 1 and k = 2 are shown\nin Figure 3. Introducing second order indicator\nfeatures shows improvement in 10 out of the 12\ncombinations and degrades performance in two\nof the combinations. However, the overall effect\nis small, with an average improvement of 0.62 in\nF1 score.\n\nAverage Improvement = 0.62\n\nTraining Domain : Test Domain\n\nLinear Chain\nSecond Order\n\nn w:wl\n\nn w:un\n\n70\n\n65\n\n60\n\n55\n\nn w:bc\n\n40\n\n35\n\n30\n\n25\n\nun:wl\n\nun:bc\n\nun:n w\n\nwl:bc\n\nwl:n w\n\nbc:un\n\nbc:wl\n\nwl:un\n\ne\nr\no\nc\nS\n1\nF\n\n \n\n50\n\n45\n\nbc:n w\n\n4.4 Discussion\n\nFigure 3: Named entity recognition results.\n\nIn our experiments, we used indicator features of all label patterns that appear in the training data.\nFor real applications, if the pattern sparsity assumption is not satis\ufb01ed, but certain patterns do not\nappear frequently enough and are not really important, then it is useful to see how we can select a\nsubset of features with few distinct label patterns automatically. One possible approach would be to\nuse boosting type methods [3] to sequentially select useful features.\nAn alternate approach to feature selection is to use all possible features and maximize the margin\nof the solution instead. Generalization error bounds [15] show that it is possible to obtain good\ngeneralization with a relatively small training set size despite having a very large number of features\nif the margin is large. This indicates that feature selection may not be critical in some cases. Theo-\nretically, it is also interesting to note that minimizing the regularized training cost when all possible\nhigh-order features of arbitrary length are used is computationally tractable. This is because the\nrepresenter theorem [19] tells us that the optimum solution for minimizing quadratically regularized\ncost functions lies on the span of the training examples. Hence, even when we are learning with\narbitrary sets of high-order features, we only need to use the features that appear in the training set\nto obtain the optimal solution. Given a training set of N sequences of length l, only O(l2N) long\nlabel sequences of all orders are observed. Using cutting plane techniques [18] the computational\ncomplexity of optimization is polynomial in inverse accuracy parameter, the training set size and\nmaximum length of the sequences.\nIt should also be possible to use kernels within the approach here. On the handwritten character\nproblem, [15] reports substantial improvement in performance with the use of kernels. Use of ker-\nnels together with high-order features may lead to further improvements. However, we note that the\nadvantage of the higher order features may become less substantial as the observations become more\npowerful in distinguishing the classes. Whether the use of higher order features together with ker-\nnels brings substantial improvement in performance is likely to be problem dependent. Similarly,\nobservation features that are more distribution invariant such as comprehensive name lists can be\nused for the NER task we experimented with and may reduce the improvements offered by higher\norder features.\n\n5 Conclusion\n\nThe pattern sparsity assumption often holds in real applications, and we give ef\ufb01cient inference al-\ngorithms for CRF with high-order features when the pattern sparsity assumption is satis\ufb01ed. This\nallows high-order features to be explored in feature engineering for real applications. We studied the\nconditions that are favourable for using high-order features using a synthetic data set, and demon-\nstrated that using simple high-order features can lead to performance improvement on a handwriting\nrecognition problem and a named entity recognition problem.\n\nAcknowledgements\nThis work is supported by DSO grant R-252-000-390-592 and AcRF grant R-252-000-327-112.\n\n\fReferences\n[1] B. A. Cipra, \u201cThe Ising model is NP-complete,\u201d SIAM News, vol. 33, no. 6, 2000.\n[2] A. Culotta, D. Kulp, and A. McCallum, \u201cGene prediction with conditional random \ufb01elds,\u201d\n\nUniversity of Massachusetts, Amherst, Tech. Rep. UM-CS-2005-028, 2005.\n\n[3] T. G. Dietterich, A. Ashenfelter, and Y. Bulatov, \u201cTraining conditional random \ufb01elds via gra-\ndient tree boosting,\u201d in Proceedings of the Twenty-First International Conference on Machine\nLearning, 2004.\n\n[4] S. Fine, Y. Singer, and N. Tishby, \u201cThe hierarchical hidden markov model: Analysis and ap-\n\nplications,\u201d Machine Learning, vol. 32, no. 1, pp. 41\u201362, 1998.\n\n[5] C. Huang and A. Darwiche, \u201cInference in belief networks: A procedural guide,\u201d International\n\nJournal of Approximate Reasoning, vol. 15, no. 3, pp. 225\u2013263, 1996.\n\n[6] F. Jelinek, J. D. Lafferty, and R. L. Mercer, \u201cBasic methods of probabilistic context free gram-\nmars,\u201d in Speech Recognition and Understanding. Recent Advances, Trends, and Applications.\nSpringer Verlag, 1992.\n\n[7] R. H. Kassel, \u201cA comparison of approaches to on-line handwritten character recognition,\u201d\n\nPh.D. dissertation, Massachusetts Institute of Technology, Cambridge, MA, USA, 1995.\n\n[8] J. Lafferty, A. McCallum, and F. Pereira, \u201cConditional random \ufb01elds: Probabilistic models\nfor segmenting and labeling sequence data,\u201d in Proceedings of the Eighteenth International\nConference on Machine Learning, 2001, pp. 282\u2013289.\n\n[9] Linguistic Data Consortium, \u201cACE (Automatic Content Extraction) English Annotation\n\nGuidelines for Entities,\u201d 2005.\n\n[10] K. P. Murphy and M. A. Paskin, \u201cLinear-time inference in hierarchical HMMs,\u201d in Advances\n\nin Neural Information Processing Systems 14, vol. 14, 2002.\n\n[11] X. Qian, X. Jiang, Q. Zhang, X. Huang, and L. Wu, \u201cSparse higher order conditional random\n\n\ufb01elds for improved sequence labeling,\u201d in ICML, 2009, p. 107.\n\n[12] L. R. Rabiner, A tutorial on hidden Markov models and selected applications in speech recog-\n\nnition. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1990.\n\n[13] S. Sarawagi and W. W. Cohen, \u201cSemi-Markov conditional random \ufb01elds for information ex-\ntraction,\u201d in Advances in Neural Information Processing Systems 17. Cambridge, MA: MIT\nPress, 2005, pp. 1185\u20131192.\n\n[14] F. Sha and F. Pereira, \u201cShallow parsing with conditional random \ufb01elds,\u201d in Proceedings of the\n\nTwentieth International Conference on Machine Learning, 2003, pp. 282\u2013289.\n\n[15] B. Taskar, C. Guestrin, and D. Koller, \u201cMax-margin Markov networks,\u201d in Advances in Neural\n\nInformation Processing Systems 16. Cambridge, MA: MIT Press, 2004.\n\n[16] E. Tjong and F. D. Meulder, \u201cIntroduction to the CoNLL-2003 shared task: Language-\nindependent named entity recognition,\u201d in Proceedings of Conference on Computational Nat-\nural Language Learning, 2003.\n\n[17] T. T. Tran, D. Phung, H. Bui, and S. Venkatesh, \u201cHierarchical semi-Markov conditional random\n\ufb01elds for recursive sequential data,\u201d in NIPS\u201908: Advances in Neural Information Processing\nSystems 20. Cambridge, MA: MIT Press, 2008, pp. 1657\u20131664.\n\n[18] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun, \u201cSupport vector machine learning for\ninterdependent and structured output spaces,\u201d in Proceedings of the Twenty-First international\nconference on Machine learning, 2004, pp. 104\u2013112.\n\n[19] G. Wahba, Spline models for observational data, ser. CBMS-NSF Regional Conference Series\nin Applied Mathematics. Philadelphia, PA: Society for Industrial and Applied Mathematics\n(SIAM), 1990, vol. 59.\n\n\f", "award": [], "sourceid": 300, "authors": [{"given_name": "Nan", "family_name": "Ye", "institution": null}, {"given_name": "Wee", "family_name": "Lee", "institution": null}, {"given_name": "Hai", "family_name": "Chieu", "institution": null}, {"given_name": "Dan", "family_name": "Wu", "institution": null}]}