{"title": "Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 1808, "page_last": 1816, "abstract": "Conditional Markov Chains (also known as Linear-Chain Conditional Random Fields in the literature) are a versatile class of discriminative models for the distribution of a sequence of hidden states conditional on a sequence of observable variables. Large-sample properties of Conditional Markov Chains have been first studied by Sinn and Poupart [1]. The paper extends this work in two directions: first, mixing properties of models with unbounded feature functions are being established; second, necessary conditions for model identifiability and the uniqueness of maximum likelihood estimates are being given.", "full_text": "Mixing Properties of Conditional Markov Chains\n\nwith Unbounded Feature Functions\n\nMathieu Sinn\n\nIBM Research - Ireland\nMulhuddart, Dublin 15\n\nmathsinn@ie.ibm.com\n\nBei Chen\n\nMcMaster University\n\nHamilton, Ontario, Canada\n\nbei.chen@math.mcmaster.ca\n\nAbstract\n\nConditional Markov Chains (also known as Linear-Chain Conditional Random\nFields in the literature) are a versatile class of discriminative models for the dis-\ntribution of a sequence of hidden states conditional on a sequence of observable\nvariables. Large-sample properties of Conditional Markov Chains have been \ufb01rst\nstudied in [1]. The paper extends this work in two directions: \ufb01rst, mixing prop-\nerties of models with unbounded feature functions are being established; second,\nnecessary conditions for model identi\ufb01ability and the uniqueness of maximum\nlikelihood estimates are being given.\n\n1\n\nIntroduction\n\nConditional Random Fields (CRF) are a widely popular class of discriminative models for the dis-\ntribution of a set of hidden states conditional on a set of observable variables. The fundamental\nassumption is that the hidden states, conditional on the observations, form a Markov random \ufb01eld\n[2,3]. Of special importance, particularly for the modeling of sequential data, is the case where the\nunderlying undirected graphical model forms a simple linear chain. In the literature, this subclass\nof models is often referred to as Linear-Chain Conditional Random Fields. This paper adopts the\nterminology of [4] and refers to such models as Conditional Markov Chains (CMC).\nLarge-sample properties of CRFs and CMCs have been \ufb01rst studied in [1] and [5]. [1] de\ufb01nes CMCs\nof in\ufb01nite length and studies ergodic properties of the joint sequences of observations and hidden\nstates. The analysis relies on fundamental results from the theory of weak ergodicity [6]. The\nexposition is restricted to CMCs with bounded feature functions which precludes the application,\ne.g., to models with linear features and Gaussian observations.\n[5] considers weak consistency\nand central limit theorems for models with a more general structure. Ergodicity and mixing of the\nmodels is assumed, but no explicit conditions on the feature functions or on the distribution of the\nobservations are given. An analysis of model identi\ufb01ability in the case of \ufb01nite sequences can be\nfound in [7].\nThe present paper studies mixing properties of Conditional Markov Chains with unbounded feature\nfunctions. The results are fundamental for analyzing the consistency of Maximum Likelihood es-\ntimates and establishing Central Limit Theorems (which are very useful for constructing statistical\nhypothesis tests, e.g., for model misspeci\ufb01ciations and the sign\ufb01cance of features). The paper is or-\nganized as follows: Sec. 2 reviews the de\ufb01nition of in\ufb01nite CMCs and some of their basic properties.\nIn Sec. 3 the ergodicity results from [1] are extended to models with unbounded feature functions.\nSec. 4 establishes various mixing properties. A key result is that, in order to allow for unbounded\nfeature functions, the observations need to follow a distribution such that Hoeffding-type concentra-\ntion inequalities can be established. Furthermore, the mixing rates depend on the tail behaviour of\nthe distribution. In Sec. 5 the mixture properties are used to analyze model identi\ufb01ability and con-\nsistency of the Maximum Likelihood estimates. Sec. 6 concludes with an outlook on open problems\nfor future research.\n\n\f2 Conditional Markov Chains\n\nPreliminaries. We use N, Z and R to denote the sets of natural numbers, integers and real numbers,\nrespectively. Let X be a metric space with the Borel sigma-\ufb01eld A, and Y be a \ufb01nite set. Further-\nmore, consider a probability space (\u2126,F, P) and let X = (Xt)t\u2208Z, Y = (Yt)t\u2208Z be sequences of\nmeasurable mappings from \u2126 into X and Y, respectively. Here,\n\n\u2022 X is an in\ufb01nite sequence of observations ranging in the domain X ,\n\u2022 Y is an aligned sequence of hidden states taking values in the \ufb01nite set Y.\n\nFor now, the distribution of X is arbitrary. Next we de\ufb01ne Conditional Markov Chains, which\nparameterize the conditional distribution of Y given X.\n\nDe\ufb01nition. Consider a vector f of real-valued functions f : X \u00d7 Y \u00d7 Y \u2192 R, called the feature\nfunctions. Throughout this paper, we assume that the following condition is satis\ufb01ed:\n|f (x, i, j)| < \u221e for all x \u2208 X and i, j \u2208 Y.\n\n(A1) All feature functions are \ufb01nite:\n\nAssociated with the feature functions is a vector \u03bb of real-valued model-weights. The key in the\nde\ufb01nition of Conditional Markov Chains is the matrix M (x) with the (i, j)-th component\n\nm(x, i, j) = exp(\u03bbT f (x, i, j)).\n\nIn terms of statistical physics, m(x, i, j) measures the potential of the transition between the hidden\nstates i and j from time t\u22121 to t, given the observation x at time t. Next, for a sequence x = (xt)t\u2208Z\nin X and time points s, t \u2208 Z with s \u2264 t, introduce the vectors\n\n\u03b1t\n\u03b2t\n\ns(x) = M (xt)T . . . M (xs)T (1, 1, . . . , 1)T ,\ns(x) = M (xs+1) . . . M (xt) (1, 1, . . . , 1)T ,\ns(x, j) to denote the ith respectively jth components. Intuitively, \u03b1t\n\ns(x, i) and \u03b2t\n\nand write \u03b1t\ns(x, i)\nmeasures the potential of the hidden state i at time t given the observations xs, . . . , xt and assuming\nthat at time s \u2212 1 all hidden states have potential equal to 1. Similarly, \u03b2t\ns(x, j) is the potential of\nj at time s assuming equal potential of all hidden states at time t. Now let t \u2208 Z and k \u2208 N, and\nde\ufb01ne the distribution of the labels Yt, . . . , Yt+k conditional on X,\n\nk(cid:89)\n\nP(Yt = yt, . . . , Yt+k = yt+k | X)\n\n:=\n\nm(Xt+i, yt+i\u22121, yt+i)\n\ni=1\n\n\u00d7 lim\nn\u2192\u221e\n\nt\u2212n(X, yt) \u03b2t+k+n\n\u03b1t\n\nt+k\n\n(X, yt+k)\n\n\u03b1t\n\nt\u2212n(X)T \u03b2t+k+n\n\nt\n\n(X)\n\n.\n\nNote that, under assumption (A1), the limit on the right hand side is well-de\ufb01ned (see Theorem 2 in\n[1]). Furthermore, the family of all marginal distributions obtained this way satis\ufb01es the consistency\nconditions of Kolmogorov\u2019s Extension Theorem. Hence we obtain a unique distribution for Y\nconditional on X parameterized by the feature functions f and the model weights \u03bb. Intuitively, the\ndistribution is obtained by conditioning the marginal distributions of Y on the \ufb01nite observational\ncontext (Xt\u2212n, . . . , Xt+k+n), and then letting the size of the context going to in\ufb01nity.\n\nBasic properties. We introduce the following notation: For any matrix P = (pij) with strictly\npositive entries let \u03c6(P ) denote the mixing coef\ufb01cient\n\n\u03c6(P ) = min\ni,j,k,l\n\npikpjl\npjkpil\n\n.\n\nNote that 0 \u2264 \u03c6(P ) \u2264 1. This coef\ufb01cient will play a key role in the analysis of mixing properties.\nThe following proposition summarizes fundamental properties of the distribution of Y conditional\non X, which directly follow from the above de\ufb01nition (also see Corollary 1 in [1]).\n\n\fProposition 1. Suppose that condition (A1) holds true. Then Y conditional on X forms a time-\ninhomogeneous Markov chain. Moreover, if X is strictly stationary, then the joint distribution\nof the aligned sequences (X, Y ) is strictly stationary. The conditional transition probabilities\nPt(x, i, j) := P(Yt = j | Yt\u22121 = i, X = x) of Y given X = x have the following form:\n\nPt(x, i, j) = m(xt, i, j) lim\nn\u2192\u221e\n\n\u03b2n\nt (x, j)\n\u03b2n\nt\u22121(x, i)\n\n.\n\nIn particular, a lower bound for Pt(x, i, j) is given by\n\nPt(x, i, j) \u2265\n\nm(xt, i, j) (mink\u2208Y m(xt+1, i, k))\n\n|Y| (maxk\u2208Y m(xt, j, k)) (maxk,l\u2208Y m(xt+1, k, l))\n\n,\n\nand the matrix of transition probabilities P t(x), with the (i, j)-th component given by Pt(x, i, j),\nsatis\ufb01es \u03c6(P t(x)) = \u03c6(M (xt)).\n\n3 Ergodicity\n\nIn this section we establish conditions under which the aligned sequences (X, Y ) are jointly er-\ngodic. Let us \ufb01rst recall the de\ufb01nition of ergodicity of X (see [8]): By X we denote the space of\nsequences x = (xt)t\u2208Z in X , and by A the corresponding product \u03c3-\ufb01eld. Consider the probability\nmeasure PX on (X ,A) de\ufb01ned by PX (A) := P(X \u2208 A) for A \u2208 A. Finally, let \u03c4 denote the\noperator on X which shifts sequences one position to the left: \u03c4 x = (xt+1)t\u2208Z. Then ergodicity of\nX is formally de\ufb01ned as follows:\n\n(A2) X is ergodic, that is, PX (A) = PX (\u03c4\u22121A) for every A \u2208 A, and PX (A) \u2208 {0, 1} for\n\nevery set A \u2208 A satisfying A = \u03c4\u22121A.\n\nAs a particular consequence of the invariance PX (A) = PX (\u03c4\u22121A), we obtain that X is strictly\nstationary. Now we are able to formulate the key result of this section, which will be of central\nimportance in the later analysis. For simplicity, we state it for functions depending on the values\nof X and Y only at time t. The generalization of the statement is straight-forward. In our later\nanalysis, we will use the theorem to show that the time average of feature functions f (Xt, Yt\u22121, Yt)\nconverges to the expected value E[f (Xt, Yt\u22121, Yt)].\nTheorem 1. Suppose that conditions (A1) and (A2) hold, and g : X \u00d7 Y \u2192 R is a function which\nsatis\ufb01es E[|g(Xt, Yt)|] < \u221e. Then\n\ng(Xt, Yt) = E[g(Xt, Yt)]\n\nP-almost surely.\n\nn(cid:88)\n\nt=1\n\nlim\nn\u2192\u221e\n\n1\nn\n\nProof. Consider the sequence Z = (Zt)t\u2208N given by Zt := (\u03c4 t\u22121X, Yt), where we write \u03c4 t\u22121 to\ndenote the (t \u2212 1)th iterate of \u03c4. Note that Zt represents the hidden state at time t together with the\nentire aligned sequence of observations \u03c4 t\u22121X. In the literature, such models are known as Markov\nsequences in random environments (see [9]). The key step in the proof is to show that Z is ergodic.\nThen, for any function h : X \u00d7 Y \u2192 R with E[|h(Zt)|] < \u221e, the time average 1\nt=1 h(Zt)\nconverges to the expected value E[h(Zt)] P-almost surely. Applying this result to the composition\nof the function g and the projection of (\u03c4 t\u22121X, Yt) onto (Xt, Yt) completes the proof. The details\nof the proof that Z is ergodic can be found in an extended version of this paper, which is included\nin the supplementary material.\n\n(cid:80)n\n\nn\n\n4 Mixing properties\n\nIn this section we are going to study mixing properties of the aligned sequences (X, Y ). To establish\nthe results, we will assume that the distribution of the observations X satis\ufb01es conditions under\nwhich certain concentration inequalities hold true:\n\n(A3) Let A \u2282 A be a measurable set, with p := P(Xt \u2208 A) and Sn(x) := 1\n\nfor x \u2208 X . Then there exists a constant \u03b3 such that, for all n \u2208 N and \u0001 > 0,\n\nn\n\nP(|Sn(X) \u2212 p| \u2265 \u0001) \u2264 exp(\u2212\u03b3 \u00012n).\n\n(cid:80)n\nt=1 1(xt \u2208 A)\n\n\fIf X is a sequence of independent random variables, then (A3) follows by Hoeffding\u2019s inequality. In\nthe dependent case, concentration inequalities of this type can be obtained by imposing Martingale\nor mixing conditions on X (see [12,13]). Furthermore, we will make the following assumption,\nwhich relates the feature functions to the tail behaviour of the distribution of X:\n\n(A4) Let h : [0,\u221e) \u2192 [0,\u221e) be a differentiable decreasing function with h(z) = O(z\u2212(1+\u03ba))\n\nfor some \u03ba > 0. Furthermore, let\n\nF (x)\n\n:=\n\n|\u03bbT f (x, j, k)|\n\n(cid:88)\n\nj,k\u2208Y\n\nfor x \u2208 X . Then E[h(F (Xt))\u22121] and E[h(cid:48)(F (Xt))\u22121] both exist and are \ufb01nite.\n\nThe following theorem establishes conditions under which the expected conditional covariances of\nsquare-integrable functions are summable. The result is obtained by studying ergodic properties of\nthe transition probability matrices.\nTheorem 2. Suppose that conditions (A1) - (A3) hold true, and g : X \u00d7 Y \u2192 R is a function with\n\ufb01nite second moment, E[|g(Xt, Yt)|2] < \u221e. Let \u03b3t,k(X) = Cov(g(Xt, Yt), g(Xt+k, Yt+k)| X)\ndenote the covariance of g(Xt, Yt) and g(Xt+k, Yt+k) conditional on X. Then, for every t \u2208 Z:\n\nn(cid:88)\n\nk=1\n\nlim\nn\u2192\u221e\n\nE[|\u03b3t,k(X)|] < \u221e.\n\nProof. Without loss of generality we may assume that g can be written as g(x, y) = g(x)1(y = i).\nHence, using H\u00a8older\u2019s inequality, we obtain\n\nE[|\u03b3t,k(X)|] \u2264 E[|g(Xt)|] E[|g(Xt+k)|] E[|Cov(1(Yt = i), 1(Yt+k = i)| X)|].\n\nAccording to the assumptions, we have E[|g(Xt)|] = E[|g(Xt+k)|] < \u221e, so we only need to bound\nthe expectation of the conditional covariance. Note that\nCov(1(Yt = i), 1(Yt+k = i)| X) = P(Yt = i, Yt+k = i| X) \u2212 P(Yt = i| X) P(Yt+k = i| X).\nRecall the de\ufb01nition of \u03c6(P ) before Corollary 1. Using probabilistic arguments, it is not dif\ufb01cult to\nshow that the ratio of P(Yt = i, Yt+k = i| X) to P(Yt = i| X) P(Yt+k = i| X) is greater than or\nequal to \u03c6(P t+1(X) . . . P t+k(X)), where P t+1(X), . . . , P t+k(X) denote the transition matrices\nintroduced in Proposition 1. Hence,\n|Cov(1(Yt = i), 1(Yt+k = i)| X)| \u2264 P(Yt = i, Yt+k = i| X)[1 \u2212 \u03c6(P t+1(X) . . . P t+k(X))].\nNow, using results from the theory of weak ergodicity (see Chapter 3 in [6]), we can establish\n\n1 \u2212(cid:112)\u03c6(P t+j(x))\n\u2264 k(cid:89)\n1 +(cid:112)\u03c6(P t+j(x))\nProposition 1, we obtain \u03c6(P t+1(x) . . . P t+k(x)) \u2265 1\u22124(cid:81)k\nfor all x \u2208 X . Using Bernoulli\u2019s inequality and the fact \u03c6(P t+j(x)) = M (xt+j) established in\nj=1[1\u2212\u03c6(M (xt+j))]. Consequently,\nk(cid:89)\n[1 \u2212 \u03c6(M (Xt+j))].\n\n1 \u2212(cid:112)\u03c6(P t+1(x) . . . P t+k(x))\n1 +(cid:112)\u03c6(P t+1(x) . . . P t+k(x))\n\n|Cov(1(Yt = i), 1(Yt+k = i)| X)| \u2264 4\n\nj=1\n\nWith the notation introduced in assumption (A3), let \u03b4 > 0 and A \u2282 X with p > 0 be such that\nx \u2208 A implies \u03c6(M (x)) \u2265 \u03b4. Furthermore, let \u0001 be a constant with 0 < \u0001 < p. In order to\nbound |Cov(1(Yt = i), 1(Yt+k = i)| X)| for a given k \u2208 N, we distinguish two different cases: If\n|Sk(X) \u2212 p| < \u0001, then we obtain\n\nj=1\n\n(cid:0)1 \u2212 \u03c6(M (Xt+j))(cid:1) \u2264 4 (1 \u2212 \u03b4)k(p\u2212\u0001).\n\nk(cid:89)\n\nj=1\n\n4\n\nIf |Sk(X) \u2212 p| \u2265 \u0001, then we use the trivial upper bound 1. According to assumption (A3), the\nprobability of the latter event is bounded by an exponential, and hence\n\nE[|Cov(1(Yt = i), 1(Yt+k = i)| X)|] \u2264 4 (1 \u2212 \u03b4)k(p\u2212\u0001) + exp(\u2212\u03b3 \u00012k).\n\nObviously, the sum of all these expectations is \ufb01nite, which completes the proof.\n\n\fThe next theorem bounds the difference between the distribution of Y conditional on X and \ufb01nite\napproximations of it. Introduce the following notation: For t, k \u2265 0 with t + k \u2264 n let\n\nP(0:n)(Yt = yt, . . . , Yt+k = yt+k | X = x)\n\nk(cid:89)\n\ni=1\n\n:=\n\nm(xt+i, yt+i\u22121, yt+i) lim\nn\u2192\u221e\n\n\u03b1t\n0(x, yt) \u03b2n\n\nt+k(x, yt+k)\n\n\u03b1t\n\n0(x)T \u03b2n\n\nt (x)\n\n.\n\nAccordingly, write E(0:n) for expectations taken with respect to P(0:n). As emphasized by the su-\nperscrips, these quantities can be regarded as marginal distributions of Y conditional on the ob-\nservations at times t = 0, 1, . . . , n. To simplify notation, the following theorem is stated for\n1-dimensional conditional marginal distributions, however, the extension to the general case is\nstraight-forward.\nTheorem 3. Suppose that conditions (A1) - (A4) hold true. Then the limit\nP(0:n)(Yt = i| X)\n\nP(0:\u221e)(Yt = i| X)\n\n:= lim\nn\u2192\u221e\n\nis well-de\ufb01ned P-almost surely. Moreover, there exists a measurable function C(x) of x \u2208 X with\n\ufb01nite expectation, E[|C(X)|] < \u221e, and a function h(z) satisfying the conditions in (A4) , such that\n\n(cid:12)(cid:12)P(0:\u221e)(Yt = i| X) \u2212 P(0:n)(Yt = i| X)(cid:12)(cid:12) \u2264 C(\u03c4 tX) h(n \u2212 t).\n\nProof. De\ufb01ne Gn(x) := M (xt+1) . . . M (xn) and write gn(x, i, j) for the (i, j)-th component\nof Gn(x). Note that \u03b2n\nt (x) = Gn(x)(1, 1, . . . , 1)T . According to Lemma 3.4 in [6], there exist\nnumbers rij(x) such that\n\nlim\nn\u2192\u221e\n\n= rij(x)\n\ngn(x, i, k)\ngn(xj, k)\nt (x, i) to \u03b2n\nt (x, i)\nt (x)\n0(x)/\u03b1t\n\n=\n\nfor all k \u2208 Y. Consequently, the ratio of \u03b2n\n0(x, i) \u03b2n\n\u03b1t\n0(x)T \u03b2n\n\u03b1t\n\nlim\nn\u2192\u221e\n\nt (x, j) converges to rij(x), and hence\n\n1\n\nqi(x)T ri(x)\n\nwhere we use the notation qi(x) = \u03b1t\n0(x, i) and ri(x) denotes the vector with the kth\ncomponent rki(x). This proves the \ufb01rst part of the theorem. In order to prove the second part, note\nthat |x \u2212 y| \u2264 |x\u22121 \u2212 y\u22121| for any x, y \u2208 (0, 1], and hence\n\n(cid:12)(cid:12)P(0:\u221e)(Yt = i| X) \u2212 P(0:n)(Yt = i| X)(cid:12)(cid:12) \u2264 (cid:12)(cid:12)(cid:12)qi(X)T ri(X) \u2212 \u03b1t\n(cid:18) gn(x, k, l)\n\n(cid:18) gn(x, k, l)\n\nTo bound the latter expression, introduce the vectors rn\n\ni (x) with the kth components\n\ni (x) and rn\n\n0(X)T \u03b2n\n0(X, i) \u03b2n\n\u03b1t\n\nt (X)\nt (X, i)\n\n(cid:12)(cid:12)(cid:12).\n\n(cid:19)\n\n(cid:19)\n\nrn\nki(x) = min\nl\u2208Y\n\nIt is easy to see that qi(x)T rn\n\nHence,\n\nqi(x)T rn\n\n(cid:12)(cid:12)(cid:12)qi(X)T ri(X) \u2212 \u03b1t\n\nand\n\nrn\nki(x) = max\nl\u2208Y\n\ngn(x, i, l)\ni (x) \u2264 qi(x)T ri(x) \u2264 qi(x)T rn\n0(x)T \u03b2n\ni (x) \u2264 \u03b1t\n0(x, i) \u03b2n\n\u03b1t\n\nt (x)\nt (x, i)\n\n\u2264 qi(x)T rn\n\ni (x), and\n\ni (x).\n\ngn(x, i, l)\n\n.\n\n(cid:12)(cid:12)(cid:12) \u2264 (cid:12)(cid:12)qi(X)T (rn\n\ni (X))(cid:12)(cid:12).\n\ni (X) \u2212 rn\n\n0(X)T \u03b2n\n0(X, i) \u03b2n\n\u03b1t\n\nt (X)\nt (X, i)\n\nDue to space limitations, we only give a sketch of the proof how the latter quantity can be bounded.\nFor details, see the extended version of this paper in the supplementary material. The \ufb01rst step is\nki(X)| \u2264\nto show the existence of a function C1(x) with E[|C1(X)|] < \u221e such that |rn\nC1(\u03c4 tX) (1 \u2212 \u03b6)n\u2212t for some \u03b6 > 0. With the function F (x) introduced in assumption (A4), we\nde\ufb01ne C2(x) := exp(F (x)) for x \u2208 X and arrive at\n\n(cid:12)(cid:12)P(0:\u221e)(Yt = i| X) \u2212 P(0:n)(Yt = i| X)(cid:12)(cid:12) \u2264 |Y|2 C1(\u03c4 tX) C2(Xt) (1 \u2212 \u03b6)n\u2212t.\n\nki(X) \u2212 rn\n\n\fThe next step is to construct a function C3(x) satisfying the following two conditions: (i) If\nC2(x)(1 \u2212 \u03b6)k \u2265 1, then C3(x)h(k) \u2265 1.\n(ii) If C2(x)(1 \u2212 \u03b6)k < 1, then C3(x)h(k) \u2265\nC2(x) (1 \u2212 \u03b6)k. Since the difference between two probabilities cannot exceed 1, we obtain\n\n(cid:12)(cid:12)P(0:\u221e)(Yt = i| X) \u2212 P(0:n)(Yt = i| X)(cid:12)(cid:12) \u2264 |Y|2 C1(\u03c4 tX) C3(Xt) h(n \u2212 t).\n\nThe last step is to show that E[|C3(Xt)|] < \u221e.\n\nThe following result will play a key role in the later analysis of empirical likelihood functions.\nTheorem 4. Suppose that conditions (A1) - (A4) hold, and the function g : X \u00d7 Y \u2192 R satis\ufb01es\nE[|g(Xt, Yt)|] < \u221e. Then\n\nE(0:n)[g(Xt, Yt)| X] = E[g(Xt, Yt)]\n\nP-almost surely.\n\nn(cid:88)\n\nt=1\n\nlim\nn\u2192\u221e\n\n1\nn\n\nProof. Without loss of generality we may assume that g can be written as g(x, y) = g(x)1(y = i).\nUsing the result from Theorem 3, we obtain\n\nE(0:\u221e)[g(Xt, Yt)| X]\n\n|g(Xt)||C(\u03c4 tX)| h(n \u2212 t),\n\n(cid:12)(cid:12)(cid:12) n(cid:88)\n\nt=1\n\nE(0:n)[g(Xt, Yt)| X] \u2212 n(cid:88)\n\nt=1\n\n(cid:12)(cid:12)(cid:12) \u2264 n(cid:88)\n\nt=1\n\nwhere h(z) is a function satisfying the conditions in assumption (A4). See the extended version of\nthis paper in the supplementary material for more details. Using the facts that X is ergodic and the\nexpectations of |g(Xt)| and |C(\u03c4 tX)| are \ufb01nite, we obtain\n\n(cid:12)(cid:12)(cid:12) n(cid:88)\nE(0:n)[g(Xt, Yt)| X] \u2212 n(cid:88)\n(cid:12)(cid:12)(cid:12) n(cid:88)\nE(0:\u221e)[g(Xt, Yt)| X] \u2212 n(cid:88)\n\nt=1\n\nt=1\n\n1\nn\n\nt=1\n\nt=1\n\nlim\nn\u2192\u221e\n\n1\nn\n\nlim\nn\u2192\u221e\n\nE(0:\u221e)[g(Xt, Yt)| X]\n\nE[g(Xt, Yt)| X]\n\n(cid:12)(cid:12)(cid:12) = 0.\n(cid:12)(cid:12)(cid:12) = 0.\n\nBy similar arguments to the proof of the \ufb01rst part of Theorem 3 one can show that the difference\n|E(0:\u221e)[g(Xt, Yt)| X] \u2212 E[g(Xt, Yt)| X]| tends to 0 as t \u2192 \u221e. Thus,\n\nNow, noting that E[g(Xt, Yt)| X] = E[g(X0, Y0)| \u03c4 tX] for every t, the theorem follows by the\nergodicity of X.\n\n5 Maximum Likelihood learning and model identi\ufb01ability\n\nIn this section we apply the previous results to analyze asymptotic properties of the empirical\nlikelihood function. The setting is the following: Suppose that we observe \ufb01nite subsequences\nX n = (X0, . . . , Xn) and Y n = (Y0, . . . , Yn) of X and Y , where the distribution of Y condi-\ntional on X follows a conditional Markov chain with \ufb01xed feature functions f and unknown model\nweights \u03bb\u2217. We assume that \u03bb\u2217 lies in some parameter space \u0398, the structure of which will be-\ncome important later. To emphasize the role of the model weights, we will use subscripts, e.g.,\nP\u03bb and E\u03bb, to denote the conditional distributions. Our goal is to identify the unknown model\nweights from the \ufb01nite samples, X n and Y n. In order to do so, introduce the shorthand notation\nt=1 f (xt, yt\u22121, yt) for xn = (x0, . . . , xn) and yn = (y0, . . . , yn). Consider the\n\nf (xn, yn) = (cid:80)n\n\nnormalized conditional likelihood,\n\n(cid:16)\n\n(cid:88)\n\nexp(cid:0)\u03bbT f (X n, yn)(cid:1)(cid:17)\n\n.\n\nyn\u2208Y n+1\n\nLn(\u03bb) =\n\n1\nn\n\n\u03bbT f (X n, Y n) \u2212 log\n\nNote that, in the context of \ufb01nite Conditional Markov Chains, Ln(\u03bb) is the exact likelihood of Y n\nconditional on X n. The Maximum Likelihood estimate of \u03bb\u2217 is given by\n\n\u02c6\u03bbn\n\n:= arg max\n\n\u03bb\u2208\u0398\n\nLn(\u03bb).\n\nIf Ln(\u03bb) is strictly concave, then the arg max is unique and can be found using gradient-based\nsearch (see [14]). It is easy to see that Ln(\u03bb) is strictly concave if and only if |Y| > 1, and there\nexists a sequence yn such that at least one component of f (X n, yn) is non-zero. In the following,\nwe study strong consistency of the Maximum Likelihood estimates, a property which is of central\nimportance in large sample theory (see [15]). As we will see, a key problem is the identi\ufb01ability and\nuniqueness of the model weights.\n\n\f5.1 Asymptotic properties of the likelihood function\n\nIn addition to the conditions (A1)-(A4) stated earlier, we will make the following assumptions:\n\n(A5) The feature functions have \ufb01nite second moments: E\u03bb\u2217 [|f (Xt, Yt\u22121, Yt)|2] < \u221e.\n(A6) The parameter space \u0398 is compact.\n\nThe next theorem establishes asymptotic properties of the likelihood function Ln(\u03bb).\nTheorem 5. Suppose that conditions (A1)-(A6) are satis\ufb01ed. Then the following holds true:\n\n(i) There exists a function L(\u03bb) such that limn\u2192\u221e Ln(\u03bb) = L(\u03bb) P\u03bb\u2217-almost surely for\nevery \u03bb \u2208 \u0398. Moreover, the convergence of Ln(\u03bb) to L(\u03bb) is uniform on \u0398, that is,\nlimn\u2192\u221e sup\u03bb\u2208\u0398 |Ln(\u03bb) \u2212 L(\u03bb)| = 0 P\u03bb\u2217-almost surely.\n\n(ii) The gradients satisfy limn\u2192\u221e \u2207Ln(\u03bb) = E\u03bb\u2217 [f (Xt, Yt\u22121, Yt)] \u2212 E\u03bb[f (Xt, Yt\u22121, Yt)]\n\nP\u03bb\u2217-almost surely for every \u03bb \u2208 \u0398.\n\n(iii) If the limit of the Hessian \u22072Ln(\u03bb) is \ufb01nite and non-singular, then the function L(\u03bb) is\nstrictly concave on \u0398. As a consequence, the Maximum Likelihood estimates are strongly\nconsistent:\n\nlim\nn\u2192\u221e\n\n\u02c6\u03bbn = \u03bb\u2217\n\nP\u03bb\u2217-almost surely.\n\nProof. The statements are obtained analogously to Lemma 4-6 and Theorem 4 in [1], using the\nauxiliary results for Conditional Markov Chains with unbounded feature functions established in\nTheorem 1, Theorem 2, and Theorem 4.\n\n\u03bbT f (X n, Y n) = (cid:80)n\n\nNext, we study the asymptotic behaviour of the Hessian \u22072Ln(\u03bb). In order to compute the dervia-\ntives, introduce the vectors \u03bb1, . . . , \u03bbn with \u03bbt = \u03bb for t = 1, . . . , n. This allows us to write\nt f (Xt, Yt\u22121, Yt). Now, regard the argument \u03bb of the likelihood func-\n\ntion as a stacked vector (\u03bb1, . . . , \u03bbn). Same as in [1], this gives us the expressions\n\nt=1 \u03bbT\n\n(cid:2)f (Xt, Yt\u22121, Yt), f (Xt+k, Yt+k\u22121, Yt+k)T | X(cid:3)\n\n\u22022\n\n\u2202\u03bbt\u2202\u03bbT\n\nt+k\n\nLn(\u03bb) =\n\n1\nn\n\nCov(0:n)\n\n\u03bb\n\nwhere Cov(0:n)\nUsing these expressions, the Hessian of Ln(\u03bb) can be written as\n\nis the covariance with respect to the measure P(0:n)\n\n\u03bb\n\n\u03bb\n\nintroduced before Theorem 3.\n\n\u22072Ln(\u03bb) = \u2212(cid:16) n(cid:88)\n\nn\u22121(cid:88)\n\nn\u2212k(cid:88)\n\n(cid:17)\n\n\u22022\n\n\u2202\u03bbt\u2202\u03bbT\nt\n\nLn(\u03bb) + 2\n\nt=1\n\n\u22022\n\nLn(\u03bb)\n\n.\n\nk=1\n\nt=1\n\n\u2202\u03bbt\u2202\u03bbT\n\nt+k\n\nThe following theorem establishes an expression for the limit of \u22072Ln(\u03bb).\nexpression given in Lemma 7 of [1], which is incorrect.\nTheorem 6. Suppose that conditions (A1) - (A5) hold. Then\n\nIt differs from the\n\nn\u2192\u221e\u22072Ln(\u03bb) = \u2212(cid:16)\n\nlim\n\n(cid:17)\n\n\u221e(cid:88)\n\nk=1\n\n\u03b3\u03bb(0) + 2\n\n\u03b3\u03bb(k)\n\nP\u03bb\u2217-almost surely\n\nwhere \u03b3\u03bb(k) = E[Cov\u03bb(f (Xt, Yt\u22121, Yt), f (Xt+k, Yt+k\u22121, Yt+k)| X)] is the expectation of the\nconditional covariance (with respect to P\u03bb) between f (Xt, Yt\u22121, Yt) and f (Xt+k, Yt+k\u22121, Yt+k)\ngiven X. In particular, the limit of \u22072Ln(\u03bb) is \ufb01nite.\n\nProof. The key step is to show the existence of a positive measurable function U\u03bb(k, x) such that\n\nn\u22121(cid:88)\n\nn\u2212k(cid:88)\n\nk=1\n\nt=1\n\n(cid:12)(cid:12)(cid:12)\n\nlim\nn\u2192\u221e\n\n(cid:12)(cid:12)(cid:12) \u2264 lim\n\nn\u2192\u221e\n\nn\u22121(cid:88)\n\nk=1\n\n\u22022\n\n\u2202\u03bbt\u2202\u03bbT\n\nt+k\n\nLn(\u03bb)\n\nE[U\u03bb(k, X)]\n\n\fwith the limit on the right hand side being \ufb01nite. Then the rest of the proof is straight-forward:\nTheorem 4 shows that, for \ufb01xed k \u2265 0,\n\nn\u2212k(cid:88)\n\nt=1\n\nlim\nn\u2192\u221e\n\n\u22022\n\n\u2202\u03bbt\u2202\u03bbT\n\nLn(\u03bb) = \u03b3\u03bb(k)\n\nP\u03bb\u2217-almost surely.\n\nHence, in order to establish the theorem, it suf\ufb01ces to show that\n\n(cid:12)(cid:12)(cid:12) \u2264 \u0001\nfor all \u0001 > 0. Now let \u0001 > 0 be \ufb01xed. According to Theorem 2 we have(cid:80)\u221e\n\nn\u22121(cid:88)\n\nt+k\n\n(cid:12)(cid:12)(cid:12)\u03b3\u03bb(k) \u2212 n\u2212k(cid:88)\n\nLn(\u03bb)\n\n\u2202\u03bbt\u2202\u03bbT\n\nt+k\n\nlim\nn\u2192\u221e\n\n\u22022\n\nt=1\n\nwe can \ufb01nd a \ufb01nite N such that\n\nOn the other hand, Theorem 4 shows that\n\nk=1 |\u03b3\u03bb(k)| < \u221e. Hence\n\nk=1\n\nn\u22121(cid:88)\nN\u22121(cid:88)\n\nk=N\n\nk=1\n\nlim\nn\u2192\u221e\n\nlim\nn\u2192\u221e\n\nn\u22121(cid:88)\n\nk=N\n\n|\u03b3\u03bb(k)| + lim\nn\u2192\u221e\n\n(cid:12)(cid:12)(cid:12)\u03b3\u03bb(k) \u2212 n\u2212k(cid:88)\n\nt=1\n\n\u22022\n\n\u2202\u03bbt\u2202\u03bbT\n\nt+k\n\nE[U\u03bb(k, X)] \u2264 \u0001.\n\n(cid:12)(cid:12)(cid:12) = 0.\n\nLn(\u03bb)\n\nFor details on how to construct U\u03bb(k, x), see the extended version of this paper.\n\n5.2 Model identi\ufb01ability\n\nLet us conclude the analysis by investigating conditions under which the limit of the Hessian\n\u22072Ln(\u03bb) is non-singular. Note that \u22072Ln(\u03bb) is negative de\ufb01nite for every n, so also the limit\nis negative de\ufb01nite, but not necessarily strictly negative de\ufb01nite. Using the result in Theorem 6, we\ncan establish the following statement:\nCorollary 1. Suppose that assumptions (A1)-(A5) hold true. Then the following conditions are\nnecessary for the limit of \u22072Ln(\u03bb) to be non-singular:\n\n(i) For each feature function f (x, i, j), there exists a set A \u2282 X with P(Xt \u2208 A) > 0 such\n\nthat, for every x \u2208 A, we can \ufb01nd i, j, k, l \u2208 Y with f (x, i, j) (cid:54)= f (x, k, l).\n\n(ii) There does not exist a non-zero vector \u03bb and a subset A \u2282 X with P(Xt \u2208 A) = 1 such\n\nthat \u03bbT f (x, i, j) is constant for all x \u2208 X and i, j \u2208 Y.\n\nCondition (i) essentially says: features f (x, i, j) must not be constant in i and j. Condition (ii)\nsays that features must not be expressible as linear combinations of each other. Clearly, features\nviolating condition (i) can be assigned arbitrary model weights without any effect on the conditional\ndistributions. If condition (ii) is violated, then there are in\ufb01nitely many ways for parameterizing the\nsame model. In practice, some authors have found positive effects of including redundant features\n(see, e.g., [16]), however, usually in the context of a learning objective with an additional penalizer.\n\n6 Conclusions\n\nWe have established ergodicity and various mixing properties of Conditional Markov Chains with\nunbounded feature functions. The main insight is that similar results to the setting with bounded\nfeature functions can be obtained, however, under additional assumptions on the distribution of the\nobservations. In particular, the proof of Theorem 2 has shown that the sequence of observations\nneeds to satisfy conditions under which Hoeffding-type concentration inequalities can be obtained.\nThe proof of Theorem 3 has reveiled an interesting interplay between mixing rates, feature func-\ntions, and the tail behaviour of the distribution of observations. By applying the mixing proper-\nties to the empirical likelihood functions we have obtained necessary conditions for the Maximum\nLikelihood estimates to be strongly consistent. We see a couple of interesting problems for future\nresearch: establishing Central Limit Theorems for Conditional Markov Chains; deriving bounds for\nthe asymptotic variance of Maximum Likelihood estimates; constructing tests for the signi\ufb01cance\nof features; generalizing the estimation theory to an in\ufb01nite number of features; \ufb01nally, \ufb01nding\nsuf\ufb01cient conditions for the model identi\ufb01ability.\n\n\fReferences\n\n[1] Sinn, M. & Poupart, P. (2011) Asymptotic theory for linear-chain conditional random \ufb01elds. In Proc. of the\n14th International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS).\n[2] Lafferty, J., McCallum, A. & Pereira, F. (2001) Conditional random \ufb01elds: Probabilistic models for seg-\nmenting and labeling sequence data. In Proc. of the 18th IEEE International Conference on Machine Learning\n(ICML).\n[3] Sutton, C. & McCallum, A. (2006) An introduction to conditional random \ufb01elds for relational learning. In:\nGetoor, L. & Taskar, B. (editors), Introduction to Statistical Relational Learning. Cambridge, MA: MIT Press.\n[4] Hofmann, T., Sch\u00a8olkopf, B. & Smola, A.J. (2008) Kernel methods in machine learning. The Annals of\nStatstics, Vol. 36, No. 3, 1171-1220.\n[5] Xiang, R. & Neville, J. (2011) Relational learning with one network: an asymptotic analysis. In Proc. of\nthe 14th International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS).\n[6] Seneta, E. (2006) Non-Negative Matrices and Markov Chains. Revised Edition. New York, NY: Springer.\n[7] Wainwright, M.J. & Jordan, M.I. (2008) Graphical models: exponential families, and variational inference.\nFoundations and Trends R(cid:13) in Machine Learning, Vol. 1, Nos. 1-2, 1-305.\n[8] Cornfeld, I.P., Fomin, S.V. & Sinai, Y.G. (1982) Ergodic Theory. Berlin, Germany: Springer.\n[9] Orey, S. (1991) Markov chains with stochastically stationary transition probabilities. The Annals of Proba-\nbility, Vol. 19, No. 3, 907-928.\n[10] Hern\u00b4andez-Lerma, O. & Lasserre, J.B. (2003) Markov Chains and Invariant Probabilities. Basel, Switzer-\nland: Birkh\u00a8auser.\n[11] Foguel, S.R. (1969) The Ergodic Theory of Markov Processes. Princeton, NJ: Van Nostrand.\n[12] Samson, P.-M. (2000) Concentration of measure inequalities for Markov chains and \u03a6-mixing processes.\nThe Annals of Probability, Vol. 28, No. 1, 416-461.\n[13] Kontorovich, L. & Ramanan, K. (2008) Concentration inequalities for dependent random variables via the\nmartingale method. The Annals of Probability, Vol. 36, No. 6, 2126-2158.\n[14] Sha, F. & Pereira, F. (2003) Shallow parsing with conditional random \ufb01elds. In Proc. of the Human Lan-\nguage Technology Conference of the North American Chapter of the Association for Computational Linguistics\n(HLT-NAACL).\n[15] Lehmann, E.L. (1999) Elements of Large-Sample Theory. New York, NY: Springer.\n[16] Hoefel, G. & Elkan, C. (2008) Learning a two-stage SVM/CRF sequence classi\ufb01er. In Proc. of the 17th\nACM International Conference on Information and Knowledge Management (CIKM).\n\n\f", "award": [], "sourceid": 892, "authors": [{"given_name": "Mathieu", "family_name": "Sinn", "institution": null}, {"given_name": "Bei", "family_name": "Chen", "institution": null}]}