{"title": "Probabilistic Deterministic Infinite Automata", "book": "Advances in Neural Information Processing Systems", "page_first": 1930, "page_last": 1938, "abstract": "We propose a novel Bayesian nonparametric approach to learning with probabilistic deterministic finite automata (PDFA). We define and develop and sampler for a PDFA with an infinite number of states which we call the probabilistic deterministic infinite automata (PDIA). Posterior predictive inference in this model, given a finite training sequence, can be interpreted as averaging over multiple PDFAs of varying structure, where each PDFA is biased towards having few states. We suggest that our method for averaging over PDFAs is a novel approach to predictive distribution smoothing. We test PDIA inference both on PDFA structure learning and on both natural language and DNA data prediction tasks. The results suggest that the PDIA presents an attractive compromise between the computational cost of hidden Markov models and the storage requirements of hierarchically smoothed Markov models.", "full_text": "Probabilistic Deterministic In\ufb01nite Automata\n\nDavid Pfau\n\n{pfau@neurotheory,{bartlett,fwood}@stat}.columbia.edu\n\nColumbia University, New York, NY 10027, USA\n\nNicholas Bartlett\n\nFrank Wood\n\nAbstract\n\nWe propose a novel Bayesian nonparametric approach to learning with probabilis-\ntic deterministic \ufb01nite automata (PDFA). We de\ufb01ne and develop a sampler for a\nPDFA with an in\ufb01nite number of states which we call the probabilistic determin-\nistic in\ufb01nite automata (PDIA). Posterior predictive inference in this model, given\na \ufb01nite training sequence, can be interpreted as averaging over multiple PDFAs of\nvarying structure, where each PDFA is biased towards having few states. We sug-\ngest that our method for averaging over PDFAs is a novel approach to predictive\ndistribution smoothing. We test PDIA inference both on PDFA structure learning\nand on both natural language and DNA data prediction tasks. The results suggest\nthat the PDIA presents an attractive compromise between the computational cost\nof hidden Markov models and the storage requirements of hierarchically smoothed\nMarkov models.\n\n1\n\nIntroduction\n\nThe focus of this paper is a novel Bayesian framework for learning with probabilistic deterministic\n\ufb01nite automata (PDFA) [9]. A PDFA is a generative model for sequential data (PDFAs are reviewed\nin Section 2).\nIntuitively a PDFA is similar to a hidden Markov model (HMM) [10] in that it\nconsists of a set of states, each of which when visited emits a symbol according to an emission\nprobability distribution. It differs from an HMM in how state-to-state transitions occur; transitions\nare deterministic in a PDFA and nondeterministic in an HMM.\nIn our framework for learning with PDFAs we specify a prior over the parameters of a single large\nPDFA that encourages state reuse. The inductive bias introduced by the PDFA prior provides a soft\nconstraint on the number of states used to generate the data. We take the limit as the number of states\nbecomes in\ufb01nite, yielding a model we call the probabilistic deterministic in\ufb01nite automata (PDIA).\nGiven a \ufb01nite training sequence, the PDIA posterior distribution is an in\ufb01nite mixture of PDFAs.\nSamples from this distribution form a \ufb01nite sample approximation to this in\ufb01nite mixture, and can\nbe drawn via Markov chain Monte Carlo (MCMC) [6]. Using such a mixture we can average over\nour uncertainty about the model parameters (including state cardinality) in a Bayesian way during\nprediction and other inference tasks. We \ufb01nd that averaging over a \ufb01nite number of PDFAs trained\non naturalistic data leads to better predictive performance than using a single \u201cbest\u201d PDFA.\nWe chose to investigate learning with PDFAs because they are intermediate in expressive power be-\ntween HMMs and \ufb01nite-order Markov models, and thus strike a good balance between generaliza-\ntion performance and computational ef\ufb01ciency. A single PDFA is known to have relatively limited\nexpressivity. We argue that a \ufb01nite mixture of PDFAs has greater expressivity than that of a single\nPDFA but is not as expressive as a probabilistic nondeterministic \ufb01nite automata (PNFA)1. A PDIA\nis clearly highly expressive; an in\ufb01nite mixture over the same is even more so. Even though ours is\na Bayesian approach to PDIA learning, in practice we only ever deal with a \ufb01nite approximation to\nthe full posterior and thus limit our discussion to \ufb01nite mixtures of PDFAs.\n\n1PNFAs with no \ufb01nal probability are equivalent to hidden Markov models [3]\n\n1\n\n\fWhile model expressivity is a concern, computational considerations often dominate model choice.\nWe show that prediction in a trained mixture of PDFAs can have lower asymptotic cost than forward\nprediction in the PNFA/HMM class of models. We also present evidence that averaging over PDFAs\ngives predictive performance superior to HMMs trained with standard methods on naturalistic data.\nWe \ufb01nd that PDIA predictive performance is competitive with that of \ufb01xed-order, smoothed Markov\nmodels with the same number of states. While sequence learning approaches such as the HMM\nand smoothed Markov models are well known and now highly optimized, our PDIA approach to\nlearning is novel and is amenable to future improvement.\nSection 2 reviews PDFAs, Section 3 introduces Bayesian PDFA inference, Section 4 presents ex-\nperimental results on DNA and natural language, and Section 5 discusses related work on PDFA\ninduction and the theoretical expressive power of mixtures of PDFAs. In Section 6 we discuss ways\nin which PDIA predictive performance might be improved in future research.\n\n2 Probabilistic Deterministic Finite Automata\n\nA PDFA is formally de\ufb01ned as a 5-tuple M = (Q, \u03a3, \u03b4, \u03c0, q0), where Q is a \ufb01nite set of states, \u03a3 is a\n\ufb01nite alphabet of observable symbols, \u03b4 : Q\u00d7\u03a3 \u2192 Q is the transition function from a state/symbol\npair to the next state, \u03c0 : Q\u00d7 \u03a3 \u2192 [0, 1] is the probability of the next symbol given a state and q0 is\nthe initial state.2 Throughout this paper we will use i to index elements of Q, j to index elements of\n\u03a3, and t to index elements of an observed string. For example, \u03b4ij is shorthand for \u03b4(qi, \u03c3j), where\nqi \u2208 Q and \u03c3j \u2208 \u03a3.\nGiven a state qi, the probability that the next symbol takes the value \u03c3j is given by \u03c0(qi, \u03c3j). We\nuse the shorthand \u03c0qi for the state-speci\ufb01c discrete distribution over symbols for state qi. We can\nalso write \u03c3|qi \u223c \u03c0qi where \u03c3 is a random variable that takes values in \u03a3. Given a state qi and a\nsymbol \u03c3j, however, the next state qi(cid:48) is deterministic: qi(cid:48) = \u03b4(qi, \u03c3j). Generating from a PDFA\ninvolves \ufb01rst generating a symbol stochastically given the state the process is in: xt|\u03bet \u223c \u03c0\u03bet where\n\u03bet \u2208 Q is the state at time t. Next, given \u03bet and xt transitioning deterministically to the next state:\n\u03bet+1 = \u03b4(\u03bet, xt). This is the reason for the confusing \u201cprobabilistic deterministic\u201d name for these\nmodels. Turning this around, given data, q0, and \u03b4, there is no uncertainty about the path through\nthe states. This is a primary source of computational savings relative to HMMs.\nPDFAs are more general than nth-order Markov models (i.e. m-gram models, m = n + 1), but less\nexpressive than hidden Markov models (HMMs)[3]. For the case of nth-order Markov models, we\ncan construct a PDFA with one state per suf\ufb01x x1x2 . . . xn. Given a state and a symbol xn+1, the\nunique next state is the one corresponding to the suf\ufb01x x2 . . . xn+1. Thus nth-order Markov models\nare a subclass of PDFAs with O(|\u03a3|n) states. For an HMM, given data and an initial distribution\nover states, there is a posterior probability for every path through the state space. PDFAs are those\nHMMs for which, given a unique start state, the posterior probability over paths is degenerate at a\nsingle path. As we explain in Section 5, mixtures of PDFAs are strictly more expressive than single\nPDFAs, but still less expressive than PNFAs.\n\n3 Bayesian PDFA Inference\n\nWe start our description of Bayesian PDFA inference by de\ufb01ning a prior distribution over the pa-\nrameters of a \ufb01nite PDFA. We then show how to analytically marginalize nuisance parameters out\nof the model and derive a Metropolis-Hastings sampler for posterior inference using the resulting\ncollapsed representation. We discuss the limit of our model as the number of states in the PDFA goes\nto in\ufb01nity. We call this limit the probabilistic deterministic in\ufb01nite automaton (PDIA). We develop\na PDIA sampler that carries over from the \ufb01nite case in a natural way.\n\n3.1 A PDFA Prior\n\nWe assume that the set of states Q, set of symbols \u03a3, and initial state q0 of a PDFA are known but\nthat the transition and emission functions are unknown. The PDFA prior then consists of a prior\nover both the transition function \u03b4 and the emission probability function \u03c0. In the \ufb01nite case \u03b4 and\n\n2In general q0 may be replaced by a distribution over initial states.\n\n2\n\n\f\u03c0 are representable as \ufb01nite matrices, with one column per element of \u03a3 and one row per element\nof Q. For each column j (j co-indexes columns and set elements) of the transition matrix \u03b4, our\nprior stipulates that the elements of that column are i.i.d. draws from a discrete distribution \u03c6j over\nQ, that is, \u03b4ij \u223c [\u03c61, . . . , \u03c6|\u03a3|], 0 \u2264 i \u2264 |Q| \u2212 1. The \u03c6j represent transition tendencies given\na symbol, if the ith element of \u03c6j is large then state qi is likely to be transitioned to anytime the\nlast symbol was \u03c3j. The \u03c6j\u2019s are themselves given a shared Dirichlet prior with parameters \u03b1\u00b5,\nwhere \u03b1 is a concentration and \u00b5 is a template transition probability vector. If the ith element of \u00b5\nis large then the ith state is likely to be transitioned to regardless of the emitted symbol. We place\na uniform Dirichlet prior on \u00b5 itself, with \u03b3 total mass and average over \u00b5 during inference. This\nhierarchical Dirichlet construction encourages both general and context speci\ufb01c state reuse. We also\nplace a uniform Dirichlet prior over the per-state emission probabilities \u03c0qi with \u03b2 total mass which\nsmooths emission distribution estimates. Formally:\n\n\u00b5|\u03b3,|Q| \u223c Dir (\u03b3/|Q|, . . . , \u03b3/|Q|)\n\u03c6j|\u03b1, \u00b5 \u223c Dir(\u03b1\u00b5)\n\u03c0qi|\u03b2,|\u03a3| \u223c Dir(\u03b2/|\u03a3|, . . . , \u03b2/|\u03a3|)\n\n\u03b4ij \u223c \u03c6j\n\n(1)\n(2)\n\nwhere 0 \u2264 i \u2264 |Q| \u2212 1 and 1 \u2264 j \u2264 |\u03a3|. Given a sample from this model we can run the PDFA\nto generate a sequence of T symbols. Using \u03bet to denote the state of the PDFA at position t in the\nsequence:\n\n\u03be0 = q0,\n\nx0 \u223c \u03c0q0,\n\n\u03bet = \u03b4(\u03bet\u22121, xt\u22121),\n\nxt \u223c \u03c0\u03bet\n\nWe choose this particular inductive bias, with transitions tied together within a column of \u03b4, because\nwe wanted the most recent symbol emission to be informative about what the next state is. If we\ninstead had a single Dirichlet prior over all elements of \u03b4, transitions to a few states would be highly\nlikely no matter the context and those states would dominate the behavior of the automata. If we\ntied together rows of \u03b4 instead of columns, being in a particular state would tell us more about the\nsequence of states we came from than the symbols that got us there.\nNote that this prior stipulates a fully connected PDFA in which all states may transition to all others\nand all symbols may be emitted from each state. This is slightly different that the canonical \ufb01nite\nstate machine literature where sparse connectivity is usually the norm.\n\n3.2 PDFA Inference\n\nGiven observational data, we are interested in learning a posterior distribution over PDFAs. We do\nthis by GIbbs sampling the transition matrix \u03b4 with \u03c0 and \u03c6j integrated out. To start inference we\nneed the likelihood function for a \ufb01xed PDFA; it is given by\n\np(x0:T|\u03c0, \u03b4) = \u03c0(\u03be0, x0)\n\n\u03c0(\u03bet, xt).\n\nT(cid:89)\n\nt=1\n\nmatrix \u03b4 as cij = (cid:80)T\n\nRemember that \u03bet|\u03bet\u22121, xt\u22121 is deterministic given the transition function \u03b4. We can marginalize \u03c0\nout of this expression and express the likelihood of the data in a form that depends only on the counts\nof symbols emitted from each state. De\ufb01ne the count matrix c for the sequence x0:T and transition\nt=0 Iij(\u03bet, xt), where Iij(\u03bet, xt) is an indicator function for the automaton\nbeing in state qi when it generates xt, i.e. \u03bet = qi and xt = \u03c3j. This matrix c = [cij] gives the\nnumber of times each symbol is emitted from each state. Due to multinomial-Dirichlet conjugacy\nwe can express the probability of a sequence given the transition function \u03b4, the count matrix c and\n\u03b2:\n\np(x0:T|\u03b4, c, \u03b2) =\n\np(x0:T|\u03c0, \u03b4)p(\u03c0|\u03b2)d\u03c0 =\n\n(3)\n\n(cid:90)\n\n|Q|\u22121(cid:89)\n\ni=0\n\n\u0393(\u03b2)\n\u0393( \u03b2|\u03a3|)|\u03a3|\n\n(cid:81)|\u03a3|\n\u0393(\u03b2 +(cid:80)|\u03a3|\n\nj=1 \u0393( \u03b2|\u03a3| + cij)\nj=1 cij)\n\nIf the transition matrix \u03b4 is observed we have a closed-form expression for its likelihood given \u00b5\nwith all \u03c6j\u2019s marginalized out. Let vij be the number of times state qi is transitioned to given that\n\u03c3j was the last symbol emitted, i.e. vij is the number of times \u03b4i(cid:48)j = qi for all states i(cid:48) in the column\n\n3\n\n\fj. The marginal likelihood of \u03b4 in terms of \u00b5 is then:\n\np(\u03b4|\u00b5, \u03b1) =\n\np(\u03b4|\u03c6)p(\u03c6|\u00b5, \u03b1)d\u03c6 =\n\n(cid:90)\n\n|\u03a3|(cid:89)\n\nj=1\n\n(cid:81)|Q|\u22121\n\n\u0393(\u03b1)\n\ni=0 \u0393(\u03b1\u00b5i)\n\n(cid:81)|Q|\u22121\n\ni=0 \u0393(\u03b1\u00b5i + vij)\n\n\u0393(\u03b1 + |Q|)\n\n(4)\n\nWe perform posterior inference in the \ufb01nite model by sampling elements of \u03b4 and the vector \u00b5. One\ncan sample \u03b4ij given the rest of the matrix \u03b4\u2212ij using\n\np(\u03b4ij|\u03b4\u2212ij, x0:T , \u00b5, \u03b1) \u221d p(x0:T|\u03b4ij, \u03b4\u2212ij)p(\u03b4ij|\u03b4\u2212ij, \u00b5, \u03b1)\n\n(5)\n\nBoth terms on the right hand side of this equation have closed-form expressions, the \ufb01rst given in\n(3). The second can be found from (4) and is\n\nP (\u03b4ij = qi(cid:48)|\u03b4\u2212ij, \u03b1, \u00b5) = \u03b1\u00b5i(cid:48) + vi(cid:48)j\n\u03b1 + |Q| \u2212 1\n\n(6)\nwhere vi(cid:48)j is the number of elements in column j equal to qi(cid:48) excluding \u03b4ij. As |Q| is \ufb01nite,\nwe compute (5) for all values of \u03b4ij and normalize to produce the required conditional probability\ndistribution.\nNote that in (3), the count matrix c may be profoundly impacted by changing even a single element\nof \u03b4. The values in c depend on the speci\ufb01c sequence of states the automata used to generate x.\nChanging the value of a single element of \u03b4 affects the state trajectory the PDFA must follow to\ngenerate x0:T . Among other things this means that some elements of c that were nonzero may\nbecome zero, and vice versa.\nWe can reduce the computational cost of inference by deleting transitions \u03b4ij for which the corre-\nsponding counts cij become 0. In practical sampler implementations this means that one need not\neven represent transitions corresponding to zero counts. The likelihood of the data (3) does not de-\npend on the value of \u03b4ij if symbol \u03c3j is never emitted while the machine is in state qi. In this case\nsampling from (5) is the same as sampling without conditioning on the data at all. Thus, if while\nsampling we change some transition that renders cij = 0 for some values for each of i and j, we can\ndelete \u03b4ij until another transition is changed such that cij becomes nonzero again, when we sample\n\u03b4ij anew. Under the marginal joint distribution of a column of \u03b4 the row entries in that column are\nexchangeable, and so deleting an entry of \u03b4 has the same effect as marginalizing it out. When all\n\u03b4ij for some state qi are marginalized out, we can say the state itself is marginalized out. When\nwe delete an element from a column of \u03b4, we replace the |Q| \u2212 1 in the denominator of (6) with\nI(vij (cid:54)= 0), the number of entries in the jth column of \u03b4 that are not marginalized\nD+\ni=0\nout yielding\n\nj =(cid:80)|Q|\u22121\n\nP (\u03b4ij = qi(cid:48)|\u03b4\u2212ij, \u03b1, \u00b5) = \u03b1\u00b5i(cid:48) + vi(cid:48)j\n\u03b1 + D+\nj\n\n.\n\n(7)\n\nIf when sampling \u03b4ij it is assigned it a state qi(cid:48) such that some ci(cid:48)j(cid:48) which was zero is now nonzero,\nwe simply reinstantiate \u03b4i(cid:48)j(cid:48) by drawing from (7) and update D+\nj(cid:48). When sampling a single \u03b4ij\nthere can be many such transitions as the path through the machine dictated by x0:T may use many\ntransitions in \u03b4 that were deleted. In this case we update incrementally, increasing D+\nj and vij as we\ngo.\nWhile it is possible to construct a Gibbs sampler using (5) in this collapsed representation, such a\nsampler requires a Monte Carlo integration over a potentially large subset of the marginalized-out\ntransitions in \u03b4, which may be costly. A simpler strategy is to pretend that all entries of \u03b4 exist but\nare sampled in a \u201cjust-in-time\u201d manner. This gives rise to a Metropolis Hastings (MH) sampler for \u03b4\nwhere the proposed value for \u03b4ij is either one of the instantiated states or any one of the equivalent\nmarginalized out states. Any time any marginalized out element of \u03b4 is required we can pretend as\nif we had just sampled its value, and we know that because its value had no effect on the likelihood\nof the data, we know that it would have been sampled directly from (7). It is in this sense that all\nmarginalized out states are equivalent \u2013 we known nothing more about their connectivity structure\nthan that given by the prior in (7).\nFor the MH sampler, denote the set of non-marginalized out \u03b4 entries \u03b4+ = {\u03b4ij : cij > 0}. We\npropose a new value qi\u2217 for one \u03b4ij \u2208 \u03b4+ according to (7). The conditional posterior probability\n\n4\n\n\fPDIA PDIA-MAP HMM-EM bigram trigram 4-gram 5-gram 6-gram SSM\n4.78\n19,358\n3.56\n\n9.71\n28\n3.77\n\n4.80\n5,592\n3.73\n341\n\n4.69\n10,838\n3.72\n1,365\n\n6.45\n382\n3.75\n21\n\n5.13\n2,023\n3.74\n85\n\nAIW 5.13\n365.6\nDNA 3.72\n64.7\n\n5.46\n379\n3.72\n54\n\n7.89\n52\n3.76\n19\n\n5\n\n314,166\n\nTable 1: PDIA inference performance relative to HMM and \ufb01xed order Markov models. Top rows:\nperplexity. Bottom rows: number of states in each model. For the PDIA this is an average number.\n\n(cid:32)\n\n\u03b1(\u03b4ij = qi\u2217|\u03b4ij = qi(cid:48)) = min\n\nof this proposal is proportional to p(x0:T|\u03b4ij = qi\u2217 , \u03b4+\u2212ij)P (\u03b4ij = qi\u2217|\u03b4+\u2212ij). The Hastings cor-\nrection exactly cancels out the proposal probability in the accept/reject ratio leaving an MH accept\nprobability for the \u03b4ij being set to qi\u2217 given that its previous value was qi(cid:48) of\np(x0:T|\u03b4ij = qi\u2217 , \u03b4+\u2212ij)\np(x0:T|\u03b4ij = qi(cid:48), \u03b4+\u2212ij)\n\n(8)\nWhether qi\u2217 is marginalized out or not, evaluating p(x0:T|\u03b4ij = qi\u2217 , \u03b4+\u2212ij) may require reinstantiat-\ning marginalized out elements of \u03b4. As before, these values are sampled from (7) on a just-in-time\nschedule. If the new value is accepted, all \u03b4ij \u2208 \u03b4+ for which cij = 0 are removed, and then move\nto the next transition in \u03b4 to sample.\nIn the \ufb01nite case, one can sample \u00b5 by Metropolis-Hastings or use a MAP estimate as in [7]. Hy-\nperparameters \u03b1, \u03b2 and \u03b3 can be sampled via Metropolis-Hastings updates. In our experiments we\nuse Gamma(1,1) hyperpriors.\n\n(cid:33)\n\n1,\n\n.\n\n3.3 The Probabilistic Deterministic In\ufb01nite Automaton\n\nWe would like to avoid placing a strict upper bound on the number of states so that model complexity\ncan grow with the amount of training data. To see how to do this, consider what happens when\n|Q| \u2192 \u221e. In this case, the right hand side of equations (1) and (2) must be replaced by in\ufb01nite\ndimensional alternatives\n\n\u00b5 \u223c PY(\u03b3, d0, H)\n\u03c6j \u223c PY(\u03b1, d, \u00b5)\n\u03b4ij \u223c \u03c6j\n\nwhere PY stands for Pitman Yor process and H in our case is a geometric distribution over the\nintegers with parameter \u03bb. The resulting hierarchical model becomes the hierarchical Pitman-Yor\nprocess (HPYP) over a discrete alphabet [14]. The discount parameters d0 and d are particular to the\nin\ufb01nite case, and when both are zero the HPYP becomes the well known hierarchical Dirichlet pro-\ncess (HDP), which is the in\ufb01nite dimensional limit of (1) and (2) [15]. Given a \ufb01nite amount of data,\nthere can only be nonzero counts for a \ufb01nite number of state/symbol pairs, so our marginalization\nprocedure from the \ufb01nite case will yield a \u03b4 with at most T elements. Denote these non-marginalized\nout entries by \u03b4+. We can sample the elements of \u03b4+ as before using (8) provided that we can pro-\npose from the HPYP. In many HPYP sampler representations this is easy to do. We use the Chinese\nrestaurant franchise representation [15] in which the posterior predictive distribution of \u03b4ij given\n\u03b4+\u2212ij can be expressed with \u03c6j and \u00b5 integrated out as\nP (\u03b4ij = qi(cid:48)|\u03b4+\u2212ij, \u03b1, \u03b3) = E\n\n\u03b3 + w\u00b7\ni \u03bai are stochastic bookkeeping counts\nrequired by the Chinese Restaurant franchise sampler. These counts must themselves be sampled\n[15]. The discount hyperparameters can also be sampled by Metropolis-Hastings.\n\nwhere wi(cid:48), ki(cid:48)j, \u03bai(cid:48), w\u00b7 =(cid:80)\n\ni kij, and \u03ba\u00b7 =(cid:80)\n\n(cid:34)\ni wi, k\u00b7j =(cid:80)\n\nvi(cid:48)j \u2212 ki(cid:48)jd\n\u03b1 + D+\nj\n\n(cid:18) wi(cid:48) \u2212 \u03bai(cid:48)d0\n\n+ \u03b1 + k\u00b7jd\n\u03b1 + D+\nj\n\n+ \u03b3 + \u03ba\u00b7d0\n\u03b3 + w\u00b7\n\n(cid:19)(cid:35)\n\nH(qi(cid:48))\n\n(9)\n\n4 Experiments and Results\n\nTo test our PDIA inference approach we evaluated it on discrete natural sequence prediction and\ncompared its performance to HMMs and smoothed n-gram models. We trained the models on two\n\n5\n\n\fFigure 1: Subsampled PDIA sampler trace for Alice in Wonderland. The top trace is the joint log\nlikelihood of the model and training data, the bottom trace is the number of states.\n\ndatasets: a character sequence from Alice in Wonderland [2] and a short sequence of mouse DNA.\nThe Alice in Wonderland (AIW) dataset was preprocessed to remove all characters but letters and\nspaces, shift all letters from upper to lower case, and split along sentence dividers to yield a 27-\ncharacter alphabet (a-z and space). We trained on 100 random sentences (9,986 characters) and\ntested on 50 random sentences (3,891 characters). The mouse DNA dataset consisted of a fragment\nof chromosome 2 with 194,173 base pairs, which we treated as a single unbroken string. We used\nthe \ufb01rst 150,000 base pairs for training and the rest for testing. For AIW, the state of the PDIA\nmodel was always set to q0 at the start of each sentence. For DNA, the state of the PDIA model at\nthe start of the test data was set to the last state of the model after accepting the training data. We\nplaced Gamma(1,1) priors over \u03b1, \u03b2 and \u03b3, set \u03bb = .001, and used uniform priors for d0 and d.\nWe evaluated the performance of the learned models by calculating the average per character pre-\ndictive perplexity of the test data. For training data x1:T and test data y1:T (cid:48) this is given by\n2\u2212 1\nT (cid:48) log2 P (y1:T (cid:48)|x1:T ). It is a measure of the average uncertainty the model has about what character\ncomes next given the sequence up to that point, and is at most |\u03a3|. We evaluated the probability of\nthe test data incrementally, integrating the test data into the model in the standard Bayesian way.\nTest perplexity results are shown in Table 1 on the \ufb01rst line of each subtable. Each sample passed\nthrough every instantiated transition. Every \ufb01fth sample for AIW and every tenth sample for DNA\nafter burn-in was used for prediction. For AIW, we ran 15,000 burn-in samples and used 3,500\nsamples for predictive inference. Subsampled sampler diagnostic plots are shown in Figure 1 that\ndemonstrate the convergence properties of our sampler. When modeling the DNA dataset we burn-in\nfor 1,000 samples and use 900 samples for inference. For the smoothed n-gram models, we report\nthousand-sample average perplexity results for hierarchical Pitman-Yor process (HPYP) [14] models\nof varying Markov order (1 through 5 notated as bigram through 6-gram) after burning each model\nin for one hundred samples. We also show the performance of the single particle incremental variant\nof the sequence memoizer (SM) [5], the SM being the limit of an n-gram model as n \u2192 \u221e. We also\nshow results for a hidden Markov model (HMM) [8] trained using expectation-maximization (EM).\nWe determined the best number of hidden states by cross-validation on the test data (a procedure\nused here to produce optimistic HMM performance for comparison purposes only).\nThe performance of the PDIA exceeds that of the HMM and is approximately equal to that of\na smoothed 4-gram model, though it does not outperform very deep, smoothed Markov models.\nThis is in contrast to [16], which found that PDFAs trained on natural language data were able to\npredict as well as unsmoothed trigrams, but were signi\ufb01cantly worse than smoothed trigrams, even\nwhen averaging over multiple learned PDFAs. As can be seen in the second line of each subtable\nin Table 1, the MAP number of states learned by the PDIA is signi\ufb01cantly lower than that of the\nn-gram model with equal predictive performance.\nUnlike the HMM, the computational complexity of PDFA prediction does not depend on the number\nof states in the model because only a single path through the states is followed. This means that the\nasymptotic cost of prediction for the PDIA is O(LT (cid:48)), where L is the number of posterior samples\nand T (cid:48) is the length of the test sequence. For any single HMM it is O(KT (cid:48)), where K is the number\nof states in the HMM. This is because all possible paths must be followed to achieve the given HMM\npredictive performance (although a subset of possible paths could be followed if doing approximate\n\n6\n\n0.511.522.53x 104\u22121.98\u22121.96\u22121.94\u22121.92\u22121.9\u22121.88\u22121.86x 104IterationsLog Likelihood0.511.522.53x 104325350375400425450States\fFigure 2: Two PNFAs outside the class of PDFAs.\n(a) can be represented by a mixture of two\nPDFAs, one following the right branch from state 0, the other following the left branch. (b), in\ncontrast, cannot be represented by any \ufb01nite mixture of PDFAs.\n\ninference). In PDIA inference we too can choose the number of samples used for prediction, but\nhere even a single sample has empirical prediction performance superior to averaging over all paths\nin an HMM. The computational complexity of smoothing n-gram inference is equivalent to PDIA\ninference, however, the storage cost for the large n-gram models is signi\ufb01cantly higher than that of\nthe estimated PDIA for the same predictive performance.\n\n5 Theory and Related Work\n\nThe PDIA posterior distribution takes the form of an in\ufb01nite mixture of PDFAs. In practice, we\nrun a sampler for some number of iterations and approximate the posterior with a \ufb01nite mixture\nof PDFAs. For this reason, we now consider the expressive power of \ufb01nite mixtures of PDFAs.\nWe show that they are strictly more expressive than PDFAs, but strictly less expressive than hidden\nMarkov models. Probabilistic non-deterministic \ufb01nite automata (PNFA) are a strictly larger model\nclass than PDFAs. For example, the PNFA in 2(a) cannot be expressed as a PDFA [3]. However,\nit can be expressed as a mixture of two PDFAs, one with Q = {q0, q1, q3} and the other with\nQ = {q0, q2, q3}. Thus mixtures of PDFAs are a strictly larger model class than PDFAs. In general,\nany PNFA where the nondeterministic transitions can only be visited once can be expressed as a\nmixture of PDFAs. However, if we replace transitions to q3 with transitions to q0, as in 2(b), there\nis no longer any equivalent \ufb01nite mixture of PDFAs, since the nondeterministic branch from q0 can\nbe visited an arbitrary number of times.\nPrevious work on PDFA induction has focused on accurately discovering model structure when the\ntrue generative mechanism is a PDFA. State merging algorithms do this by starting with the trivial\nPDFA that only accepts the training data and merging states that pass a similarity test [1, 17], and\nhave been proven to identify the correct model in the limit of in\ufb01nite data. State splitting algorithms\nstart at the opposite extreme, with the trivial single-state PDFA, and split states that pass a difference\ntest [12, 13]. These algorithms return only a deterministic estimate, while ours naturally expresses\nuncertainty about the learned model.\nTo test if we can learn the generative mechanism given our inductive bias, we trained the PDIA on\ndata from three synthetic grammars: the even process [13], the Reber grammar [11] and the Feldman\ngrammar [4], which have up to 7 states and 7 symbols in the alphabet. In each case the mean number\nof states discovered by the model approached the correct number as more data was used in training.\nResults are presented in Figure 3. Furthermore, the predictive performance of the PDIA was nearly\nequivalent to the actual data generating mechanism.\n\n6 Discussion\n\nOur Bayesian approach to PDIA inference can be interpreted as a stochastic search procedure for\nPDFA structure learning where the number of states is unknown. In Section 5 we presented evidence\nthat PDFA samples from our PDIA inference algorithm have the same characteristics as the true\ngenerative process. This in and of itself may be of interest to the PDFA induction community.\n\n7\n\n0123A/0.5A/0.5A/0.8A/0.6B/0.4B/0.2(a)(b)012A/0.5A/0.5A/0.8A/0.6B/0.4B/0.2\f(a) Even\n\n(b) Reber\n\n(c) Feldman\n\n(d) Posterior marginal PDIA state cardinality distribution\n\nFigure 3: Three synthetic PDFAs: (a) even process [13], (b) Reber grammar [11], (c) Feldman\ngrammar [4]. (d) posterior mean and standard deviation of number of states discovered during PDIA\ninference for varying amounts of data generated by each of the synthetic PDFAs. PDIA inference\ndiscovers PDFAs with the correct number of states\n\nWe ourselves are more interested in establishing new ways to produce smoothed predictive con-\nditional distributions. Inference in the PDIA presents a completely new approach to smoothing,\nsmoothing by averaging over PDFA model structure rather than hierarchically smoothing related\nemission distribution estimates. Our PDIA approach gives us an attractive ability to trade-off be-\ntween model simplicity in terms of number of states, computational complexity in terms of asymp-\ntotic cost of prediction, and predictive perplexity. While our PDIA approach may not yet outperform\nthe best smoothing Markov model approaches in terms of predictive perplexity alone, it does out-\nperform them in terms of model complexity required to achieve the same predictive perplexity, and\noutperforms HMMs in terms of asymptotic time complexity of prediction. This suggests that a\nfuture combination of smoothing over model structure and smoothing over emission distributions\ncould produce excellent results. PDIA inference gives researchers another tool to choose from when\nbuilding models. If very fast prediction is desirable and the predictive perplexity difference between\nthe PDIA and, for instance, the most competitive n-gram is insigni\ufb01cant from an application per-\nspective, then doing \ufb01nite sample inference in the PDIA offers a signi\ufb01cant computational advantage\nin terms of memory.\nWe indeed believe the most promising approach to improving PDIA predictive performance is to\nconstruct a smoothing hierarchy over the state speci\ufb01c emission distributions, as is done in the\nsmoothing n-gram models. For an n-gram, where every state corresponds to a suf\ufb01x of the sequence,\nthe predictive distributions for a suf\ufb01x is smoothed by the predictive distribution for a shorter suf\ufb01x,\nfor which there are more observations. This makes it possible to increase the size of the model indef-\ninitely without generalization performance suffering [18]. In the PDIA, by contrast, the predictive\nprobabilities for states are not tied together. Since states of the PDIA are not uniquely identi\ufb01ed\nby suf\ufb01xes, it is no longer clear what the natural smoothing hierarchy is. It is somewhat surprising\nthat PDIA learning works nearly as well as n-gram modeling even without a smoothing hierarchy\nfor its emission distributions. Imposing a hierarchical smoothing of the PDIA emission distributions\nremains an open problem.\n\n8\n\n01A/0.5B/0.5B/1.001B/1.023456T/0.5P/0.5S/0.6X/0.4V/0.3T/0.7V/0.5X/0.5P/0.5S/0.5E/1.0to 0from 60123456B/0.8125B/0.8125B/0.75A/0.1875A/0.5625A/0.5625B/0.0625A/0.1875B/0.4375A/0.25A/0.9375B/0.4375A/0.75B/0.2510^110^210^310^410^510^612345678ObservationsStates Even ProcessFeldman GrammarReber Grammar\fReferences\n[1] R. Carrasco and J. Oncina. Learning stochastic regular grammars by means of a state merging method.\n\nGrammatical Inference and Applications, pages 139\u2013152, 1994.\nin Wonderland.\n\nAlice\u2019s Adventures\n\n[2] L. Carroll.\n\nhttp://www.gutenberg.org/etext/11.\n\nMacmillan,\n\n1865.\n\nURL\n\n[3] P. Dupont, F. Denis, and Y. Esposito. Links between probabilistic automata and hidden Markov models:\nprobability distributions, learning models and induction algorithms. Pattern recognition, 38(9):1349\u2013\n1371, 2005.\n\n[4] J. Feldman and J.F. Hanna. The structure of responses to a sequence of binary events. Journal of Mathe-\n\nmatical Psychology, 3(2):371\u2013387, 1966.\n\n[5] J. Gasthaus, F. Wood, and Y. W. Teh. Lossless compression based on the Sequence Memoizer. In Data\n\nCompression Conference 2010, pages 337\u2013345, 2010.\n\n[6] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian data analysis. Chapman & Hall, New\n\nYork, 1995.\n\n[7] D. J. C. MacKay and L.C. Bauman Peto. A hierarchical Dirichlet language model. Natural language\n\nengineering, 1(2):289\u2013307, 1995.\n\n[8] K. Murphy.\n\nHidden Markov model\n\n(HMM)\n\ntoolbox\n\nfor Matlab,\n\n2005.\n\nURL\n\nhttp://www.cs.ubc.ca/ murphyk/Software/HMM/hmm.html.\n\n[9] M.O. Rabin. Probabilistic automata. Information and control, 6(3):230\u2013245, 1963.\n[10] L. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Pro-\n\nceedings of the IEEE, 77:257\u2013286, 1989.\n\n[11] A.S. Reber. Implicit learning of arti\ufb01cial grammars. Journal of verbal learning and verbal behavior, 6\n\n(6):855\u2013863, 1967.\n\n[12] D. Ron, Y. Singer, and N. Tishby. The power of amnesia: Learning probabilistic automata with variable\n\nmemory length. Machine learning, 25(2):117\u2013149, 1996.\n\n[13] C.R. Shalizi and K.L. Shalizi. Blind construction of optimal nonlinear recursive predictors for discrete\nsequences. In Proceedings of the 20th conference on Uncertainty in Arti\ufb01cial Intelligence, pages 504\u2013511.\nUAI Press, 2004.\n\n[14] Y. W. Teh. A hierarchical Bayesian language model based on Pitman-Yor processes. In Proceedings of\n\nthe Association for Computational Linguistics, pages 985\u2013992, 2006.\n\n[15] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Journal of the\n\nAmerican Statistical Association, 101(476):1566\u20131581, 2006.\n\n[16] F. Thollard. Improving probabilistic grammatical inference core algorithms with post-processing tech-\n\nniques. In Eighteenth International Conference on Machine Learning, pages 561\u2013568, 2001.\n\n[17] F. Thollard, P. Dupont, and C. del la Higuera. Probabilistic DFA inference using Kullback-Leibler diver-\ngence and minimality. In Seventeenth International Conference on Machine Learning, pages 975\u2013982.\nCiteseer, 2000.\n\n[18] F. Wood, C. Archambeau, J. Gasthaus, L. James, and Y. W. Teh. A stochastic memoizer for sequence data.\nIn Proceedings of the 26th International Conference on Machine Learning, pages 1129\u20131136, Montreal,\nCanada, 2009.\n\n9\n\n\f", "award": [], "sourceid": 1179, "authors": [{"given_name": "David", "family_name": "Pfau", "institution": "Columbia University"}, {"given_name": "Nicholas", "family_name": "Bartlett", "institution": "Columbia University"}, {"given_name": "Frank", "family_name": "Wood", "institution": "Columbia University"}]}