{"title": "The Power of Amnesia", "book": "Advances in Neural Information Processing Systems", "page_first": 176, "page_last": 183, "abstract": null, "full_text": "The Power of Amnesia \n\nDana Ron Yoram Singer Naftali Tishby \n\nInstitute of Computer Science and \n\nCenter for Neural Computation \n\nHebrew University, Jerusalem 91904, Israel \n\nAbstract \n\nWe propose a learning algorithm for a variable memory length \nMarkov process. Human communication, whether given as text, \nhandwriting, or speech, has multi characteristic time scales. On \nshort scales it is characterized mostly by the dynamics that gen(cid:173)\nerate the process, whereas on large scales, more syntactic and se(cid:173)\nmantic information is carried. For that reason the conventionally \nused fixed memory Markov models cannot capture effectively the \ncomplexity of such structures. On the other hand using long mem(cid:173)\nory models uniformly is not practical even for as short memory as \nfour. The algorithm we propose is based on minimizing the sta(cid:173)\ntistical prediction error by extending the memory, or state length, \nadaptively, until the total prediction error is sufficiently small. We \ndemonstrate the algorithm by learning the structure of natural En(cid:173)\nglish text and applying the learned model to the correction of cor(cid:173)\nrupted text . Using less than 3000 states the model's performance \nis far superior to that of fixed memory models with similar num(cid:173)\nber of states. We also show how the algorithm can be applied to \nintergenic E. coli DNA base prediction with results comparable to \nHMM based methods. \n\n1 \n\nIntroduction \n\nMethods for automatically acquiring the structure of the human language are at(cid:173)\ntracting increasing attention . One of the main difficulties in modeling the natural \nlanguage is its multiple temporal scales. As has been known for many years the \nlanguage is far more complex than any finite memory Markov source. Yet Markov \n\n176 \n\n\fThe Power of Amnesia \n\n177 \n\nmodels are powerful tools that capture the short scale statistical behavior of lan(cid:173)\nguage, whereas long memory models are generally impossible to estimate. The \nobvious desired solution is a Markov sOUrce with a 'deep' memory just where it is \nreally needed. Variable memory length Markov models have been in use for language \nmodeling in speech recognition for some time [3, 4], yet no systematic derivation, \nnor rigorous analysis of such learning mechanism has been proposed. \n\nMarkov models are a natural candidate for language modeling and temporal pattern \nrecognition, mostly due to their mathematical simplicity. It is nevertheless obvious \nthat finite memory Markov models can not in any way capture the recursive nature \nof the language, nor can they be trained effectively with long enough memory. The \nnotion of a variable length memory seems to appear naturally also in the context of \nuniversal coding [6]. This information theoretic notion is now known to be closely \nrelated to efficient modeling [7]. The natural measure that appears in information \ntheory is the description length, as measured by the statistical predictability via \nthe Kullback- Liebler (KL) divergence. \n\nThe algorithm we propose here is based on optimizing the statistical prediction \nof a Markov model , measured by the instantaneous KL divergence of the following \nsymbols, or by the current statistical surprise of the model. The memory is extended \nprecisely when such a surprise is significant, until the overall statistical prediction \nof the stochastic model is sufficiently good. We apply this algorithm successfully for \nstatistical language modeling. Here we demonstrate its ability for spelling correction \nof corrupted English text . We also show how the algorithm can be applied to \nintergenic E. coli DNA base prediction with results comparable to HMM based \nmethods. \n\n2 Prediction Suffix Trees and Finite State Automata \n\nDefinitions and Notations \n\nLet ~ be a finite alphabet. Denote by ~* the set of all strings over~ . A string \ns, over ~* of length n, is denoted by s = Sl S2 ... Sn. We denote by e the empty \nstring. The length of a string s is denoted by lsi and the size of an alphabet ~ is \ndenoted by I~I. Let, Prefix(s) = SlS2 .. . Sn-1, denote the longest prefix ofa string \ns, and let Prefix*(s) denote the set of all prefixes of s, including the empty string. \nSimilarly, 5uffix(s) = S2S3 . . . Sn and 5uffix*(s) is the set of all suffixes of s. A \nset of strings is called a prefix free set if, V Sl, S2 E 5: {Sl} nPrefix*(s2) = 0. We \ncall a probability measure P, over the strings in ~* proper if P( e) = 1, and for every \nstring s, I:aEl:P(sa) = P(s). Hence, for every prefix free set 5, I:sEsP(s):S 1, \nand specifically for every integer n 2: 0, I:sEl:n P(s) = 1. \n\nPrediction Suffix Trees \n\nA prediction suffix tree T over ~, is a tree of degree I~I. The edges of the tree \nare labeled by symbols from ~, such that from every internal node there is at most \none outgoing edge labeled by each symbol. The nodes of the tree are labeled by \npairs (s, / s) where s is the string associated with the walk starting from that node \nand ending in the root of the tree, and /s : ~ --t [0,1] is the output probability \nfunction related with s satisfying I:aEE /s(O\") = 1. A prediction suffix tree induces \n\n\f178 \n\nRon, Singer, and Tishby \n\nprobabilities on arbitrary long strings in the following manner. The probability that \nT generates a string w = W1W2 .. 'Wn in ~n, denoted by PT(W), is IIi=1/s.-1(Wi), \nwhere SO = e, and for 1 :S i :S n - 1, sj is the string labeling the deepest node \nreached by taking the walk corresponding to W1 ... Wi starting at the root of T. By \ndefinition, a prediction suffix tree induces a proper measure over ~*, and hence for \nevery prefix free set of strings {wl, ... , wm }, L~l PT(Wi ) :S 1, and specifically for \nn 2: 1, then L3 En PT(S) = 1. An example of a prediction suffix tree is depicted \nin Fig. 1 on the left, where the nodes of the tree are labeled by the corresponding \nsuffix they present. \n\n1'0=0.6 \n1'1=0.4 \n\n0.4 \n\n~,---(~~, \n\n... \n\n... \n\n.... \u00b7\u00b7\u00b7\u00b7 ...... ~~.:.6 \n\n.... \n\nFigure 1: Right: A prediction suffix tree over ~ = {a, I}. The strings written in \nthe nodes are the suffixes the nodes present. For each node there is a probability \nvector over the next possible symbols. For example, the probability of observing a \n'1' after observing the string '010' is 0.3. Left: The equivalent probabilistic finite \nautomaton. Bold edges denote transitions with the symbol '1' and dashed edges \ndenote transitions with '0'. The states of the automaton are the leaves of the tree \nexcept for the leaf denoted by the string 1, which was replaced by the prefixes of \nthe strings 010 and 110: 01 and 11. \n\nFinite State Automata and Markov Processes \nA Probabilistic Finite Automaton (PFA) A is a 5-tuple (Q, 1:, T, I, 7r), where Q is \n: Q x ~ -;. Q is the transition \na finite set of n states, 1: is an alphabet of size k, T \njunction, I : Q x ~ -;. [0, 1 J is the output probability junction, and 7r : Q -;. [0, 1 J \nis the probability distribution over the starting states. The functions I and 7r \nmust satisfy the following requirements: for every q E Q, LUEE I(q, 0') = 1, and \nLqEQ 7r( q) = 1. The probability that A generates a string s = S1 S2 ... Sn E 1:n is \nPA(S) = LqOEQ 7r(qO) TI7=l l(qi-1, Si), where qi+l = T(qi, Si). \nWe are interested in learning a sub-class of finite state machines which have the \nfollowing property. Each state in a machine M belonging to this sub-class is labeled \nby a string of length at most L over ~, for some L 2: O. The set of strings labeling \nthe states is suffix free. We require that for every two states ql ,q2 E Q and for every \nsymbol 0' E ~, if T(q1, 0') = q2 and ql is labeled by a string s1, then q2 is labeled \n\n\fThe Power of Amnesia \n\n179 \n\nby a string s2 which is a suffix of s1 . a. Since the set of strings labeling the states \nis suffix free, if there exists a string having this property then it is unique. Thus, \nin order that r be well defined on a given set of string S, not only must the set be \nsuffix free, but it must also have the property, that for every string s in the set and \nevery symbol a, there exists a string which is a suffix of sa. For our convenience, \nfrom this point on, if q is a state in Q then q will also denote the string labeling \nthat state. \nA special case of these automata is the case in which Q includes all 2L strings of \nlength L. These automata are known as Markov processes of order L. We are \ninterested in learning automata for which the number of states, n, is actually much \nsmaller than 2\u00a3, which means that few states have \"long memory\" and most states \nhave a short one. We refer to these automata as Markov processes with bounded \nmemory L. In the case of Markov processes of order L, the \"identity\" of the states \n(i.e. the strings labeling the states) is known and learning such a process reduces to \napproximating the output probability function. When learning Markov processes \nwith bounded memory, the task of a learning algorithm is much more involved since \nit must reveal the identity of the states as well. \n\nIt can be shown that under a slightly more complicated definition of prediction \nsuffix trees, and assuming that the initial distribution on the states is the stationary \ndistribution, these two models are equivalent up to a grow up in size which is at \nmost linear in L. The proof of this equi valence is beyond the scope of this paper, yet \nthe transformation from a prediction suffix tree to a finite state automaton is rather \nsimple. Roughly speaking, in order to implement a prediction suffix tree by a finite \nstate automaton we define the leaves of the tree to be the states of the automaton. \nIf the transition function of the automaton, r(-, .), can not be well defined on this \nset of strings, we might need to slightly expand the tree and use the leaves of the \nexpanded tree. The output probability function of the automaton, ,(-, .), is defined \nbased on the prediction values of the leaves of the tree. i.e., for every state (leaf) \n\ns, and every symbol a, ,( s, a) = ,s (a). The outgoing edges from the states are \n\ndefined as follows: r(q1, a) = q2 where q2 E Suffix*(q 1a). An example of a finite \nstate automaton which corresponds to the prediction tree depicted in Fig. 1 on the \nleft, is depicted on the right part of the figure. \n\n3 Learning Prediction Suffix Trees \n\nGiven a sample consisting of one sequence of length I or m sequences of lengths \n11 ,/2 , ... ,1m we would like to find a prediction suffix tree that will have the same \nstatistical properties of the sample and thus can be used to predict the next outcome \nfor sequences generated by the same source. At each stage we can transform the \ntree into a Markov process with bounded memory. Hence, if the sequence was \ncreated by a Markov process, the algorithm will find the structure and estimate \nthe probabilities of the process. The key idea is to iteratively build a prediction \ntree whose probability measure equals the empirical probability measure calculated \nfrom the sample. \n\nWe start with a tree consisting of a single node (labeled by the empty string e) and \nadd nodes which we have reason to believe should be in the tree. A node as, must be \nadded to the tree if it statistically differs from its parent node s. A natural measure \n\n\f180 \n\nRon, Singer, and Tishby \n\nto check the statistical difference is the relative entropy (also known as the Kullback(cid:173)\nLiebler (KL) divergence) [5], between the conditional probabilities PCI s) and \nPCIO\"s). Let X be an observation space and Pl , P2 be probability measures over X \nthen the KL divergence between Pl and P'2 is, DKL(Pl IIP2) = 2:XEx Pl(X) log ;~~:~. \nNote that this distance is not symmetric and Pl should be absolutely continuous \nwith respect to P2 . In our problem, the KL divergence measures how much addi(cid:173)\ntional information is gained by using the suffix crs for prediction instead of predicting \nusing the shorter suffix s. There are cases where the statistical difference is large \nyet the probability of observing the suffix crs itself is so small that we can neglect \nthose cases. Hence we weigh the the statistical error by the prior probability of \nobserving crs. The statistical error measure in our case is, \n\nE1'1'(o\"s, s) \n\n) ~ P( 'I \n\nP(O\"s) DKL (P(-IO\"s)IIPCls)) \nP( \n~ P( \nL....-a'E~ \n\nP(a'las) \n0\" O\"S og P(a'ls) \nog P(a'ls)P(as) \n\nP(a3a') \n\nL....-a'EE \n\n) I \n\n') 1 \n\nO\"SO\" \n\nO\"s \n\nTherefore, a node crs is added to the tree if the statistical difference (defined by \nE1'1'( crs, s)) between the node and its parrent s is larger than a predetermined \naccuracy c The tree is grown level by level, adding a son of a given leaf in the \ntree whenever the statistical surprise is large. The problem is that the requirement \nthat a node statistically differs from it's parent node is a necessary condition for \nbelonging to the tree, but is not sufficient. The leaves of a prediction suffix tree must \ndiffer from their parents (or they are redundant) but internal nodes might not have \nthis property. Therefore, we must continue testing further potential descendants \nof the leaves in the tree up to depth L. In order to avoid exponential grow in the \nnumber of strings tested, we do not test strings which belong to branches which are \nreached with small probability. The set of strings, tested at each step, is denoted \nby 5, and can be viewed as a kind of potential 'frontier' of the growing tree T. At \neach stage or when the construction is completed we can produce the equivalent \nMarkov process with bounded memory. The learning algorithm of the prediction \nsuffix tree is depicted in Fig. 2. The algorithm gets two parameters: an accuracy \nparameter t and the maximal order of the process (which is also the maximal depth \nof the tree) L. \n\nThe true source probabilities are not known, hence they should be estimated from \nthe empirical counts of their appearances in the observation sequences. Denote by \n#s the number of time the string s appeared in the observation sequences and by \n#crls the number of time the symbol cr appeared after the string s. Then, usmg \nLaplace's rule of succession, the empirical estimation of the probabilities is, \n\nP(s) ~ P(s) = 2: \n\n~ \n\n-\n\n#s + 1 \n\n#' I~I \n\nS + \n\n3'EEIsi \n\nP(crls) ~ P(O\"ls) = 2: \n\n-\n\n~ \n\n#crls + 1 \n\nI I \na'E~ 0\" S + ~ \n\n# 'I \n\n4 A Toy Learning Example \n\nThe algorithm was applied to a 1000 symbols long sequence produced by the au(cid:173)\ntomaton depicted top left in Fig. 3. The alphabet was binary. Bold lines in the \nfigure represent transition with the symbol '0' and dashed lines represent the sym(cid:173)\nbol '1'. The prediction suffix tree is plotted at each stage of the algorithm. At the \n\n\fThe Power of Amnesia \n\n181 \n\n\u2022 Initialize the tree T and the candidate strings S: \n\nT consists of a single root node , and S -\n\n\u2022 While S =I 0, do the following: \n\n{O\" I 0\" E ~ /\\ p( 0\") 2: t} . \n\n1. Pick any s E S and remove it from S. \n2. If Err{s, Suffix(s)) 2: E then add to T the node corresponding to s \nand all the nodes on the path from the deepest node in T (the deepest \nancestor of s) until s. \n\n3. If lsi < L then for every 0\" E ~ if P(O\"s) 2: E add O\"S to S. \n\nFigure 2: The algorithm for learning a prediction suffix tree. \n\nend of the run the correponding automat.on is plotted as well (bottom right.). Note \nthat the original automaton and the learned automaton are the same except for \nsmall diffrences in the transition probabilities. \n\no. \no. \n\n0.7 \n\n0.3 \n\n0.32. \n0.68 \n\n0.14 \n0.86 \n\nFigure 3: The original automaton (top left), the instantaneous automata built along \nthe run of the algorithm (left to right and top to bottom), and the final automaton \n(bottom left). \n\n0.69 \n\n0.31 \n\n5 Applications \n\nWe applied the algorithm to the Bible with L = 30 and E = 0.001 which resulted in \nan automaton having less than 3000 states. The alphabet was the english letters and \nthe blank character. The final automaton constitutes of states that are of length \n2, like r qu' and r xe', and on the other hand 8 and 9 symbols long states, like \nr shall be' and r there was'. This indicates that the algorithm really captmes \n\n\f182 \n\nRon, Singer, and Tishby \n\nthe notion of variable context length prediction which resulted in a compact yet \naccurate model. Building a full Markov model in this case is impossible since \nit requires II:IL = 27 9 states. Here we demonstrate our algorithm for cleaning \ncorrupted text. A test text (which was taken out of the training sequence) was \nmodified in two different ways. First by a stationary noise that altered each letter \nwith probability 0.2, and then the text was further modified by changing each \nblank to a random letter. The most probable state sequence was found via dynamic \nprogramming. The 'cleaned' observation sequence is the most probable outcome \ngiven the knowledge of the error rate. An example of such decoding for these two \ntypes of noise is shown in Fig. 4. We also applied the algorithm to intergenic \n\nOriginal Text: \nand god called the dry land earth and the gathering together of the waters called \nhe seas and god saw that it was good and god said let the earth bring forth grass \nthe herb yielding seed and the fruit tree yielding fruit after his kind \nNoisy text (1): \nand god cavsed the drxjland earth ibd shg gathervng together oj the waters dlled \nre seas aed god saw thctpit was good ann god said let tae earth bring forth gjasb \ntse hemb yielpinl peed and thesfruit tree sielxing fzuitnafter his kind \nDecoded text (1): \nand god caused the dry land earth and she gathering together of the waters called \nhe sees and god saw that it was good and god said let the earth bring forth grass \nthe memb yielding peed and the fruit tree fielding fruit after his kind \nNoisy text (2): \nandhgodpcilledjthesdryjlandbeasthcandmthelgatceringhlogetherjfytrezaatersoczlled \nxherseasaknddgodbsawwthathitqwasoqoohanwzgodcsaidhletdtheuejrthriringmforth \nbgrasstthexherbyieidingzseedmazdctcybfruitttreeayieidinglfruztbafberihiskind \nDecoded text (2): \nand god called the dry land earth and the gathering together of the altars called he \nseasaked god saw that it was took and god said let the earthriring forth grass the \nherb yielding seed and thy fruit treescielding fruit after his kind \n\nFigure 4: Cleaning corrupted text using a Markov process with bounded memory. \n\nregions of E. coli DNA, with L = 20 and f. = 0.0001. The alphabet is: A. C. T. G. \nThe result of the algorithm is an automaton having 80 states. The names of the \nstates of the final automaton are depicted in Fig. 5. The performance of the model \ncan be compared to other models, such as the HMM based model [8], by calculating \nthe normalized log-likelihood (NLL) over unseen data. The NLL is an empirical \nmeasure of the the entropy of the source as induced by the model. The NLL of \nbounded memory Markov model is about the same as the one obtained by the \nHMM .based model. Yet, the Markov model does not contain length distribution \nof the intergenic segments hence the overall perform ace of the HMM based model \nis slightly better. On the other hand, the HMM based model is more complicated \nand requires manual tuning of its architecture. \n\n\fThe Power of Amnesia \n\n183 \n\nACT G AA AC AT CA CC CT CG TA TC TT TG GA GC GT GG AAC AAT AAG \nACA ATT CAA CAC CAT CAG CCA CCT CCG CTA CTC CTT CGA CGC CGT TAT \nTAG TCA TCT TTA TTG TGC GAA GAC GAT GAG GCA GTA GTC GTT GTG \nGGA GGC GGT AACT CAGC CCAG CCTG CTCA TCAG TCTC TTAA TTGC \nTTGG TGCC GACC GATA GAGC GGAC GGCA GGCG GGTA GGTT GGTG \nCAGCC TTGCA GGCGC GGTTA \n\nFigure 5: The states that constitute the automaton for predicting the next base of \nintergenic regions in E. coli DNA. \n\n6 Conclusions and Future Research \n\nIn this paper we present a new efficient algorithm for estimating the structure and \nthe transition probabilities of a Markov processes with bounded yet variable mem(cid:173)\nory. The algorithm when applied to natural language modeling result in a compact \nand accurate model which captures the short term correlations. The theoretical \nproperties of the algorithm will be described elsewhere. In fact, we can prove that \na slightly different algorithm constructs a bounded memory markov process, which \nwith arbitrary high probability, induces distributions (over I:n for n > 0) which \nare very close to those induced by the 'true' Markovian source, in the sense of the \nKL divergence. This algorithm uses a polynomial size sample and runs in poly(cid:173)\nnomial time in the relevent parameters of the problem. We are also investigating \nhierarchical models based on these automata which are able to capture multi-scale \ncorrelations, thus can be used to model more of the large scale structure of the \nnatural language. \n\nAcknowledgment \n\nWe would like to thank Lee Giles for providing us with the software for plotting finite state \nmachines, and Anders Krogh and David Haussler for letting us use their E. coli DN A data \nand for many helpful discussions. Y.S. would like to thank the Clore foundation for its \nsupport. \n\nReferences \n[1] J.G Kemeny and J.L. Snell, Finite Markov Chains, Springer-Verlag 1982. \n[2] Y. Freund, M. Kearns, D. Ron, R. Rubinfeld, R.E. Schapire, and L. Sellie, \nEfficient Learning of Typical Finite Automata from Random Walks, STOC-93 . \n\n[3] F. Jelinek, Self-Organized Language Modeling for Speech Recognition, 1985. \n[4] A. N adas, Estimation of Probabilities in the Language Model of the IBM Speech \n\nRecognition System, IEEE Trans. on ASSP Vol. 32 No.4, pp. 859-861, 1984. \n\n[5] S. Kullback, Information Theory and Statistics, New York: Wiley, 1959. \n[6] J. Rissanen and G. G. Langdon, Universal modeling and coding, IEEE Trans . \n\non Info. Theory, IT-27 (3), pp. 12-23, 1981. \n\n[7] J. Rissanen, Stochastic complexity and modeling, The Ann. of Stat., 14(3),1986. \n[8] A. Krogh, S.1. Mian, and D. Haussler, A Hidden Markov Model that finds genes \n\nin E. coli DNA, UCSC Tech. Rep. UCSC-CRL-93-16. \n\n\f", "award": [], "sourceid": 723, "authors": [{"given_name": "Dana", "family_name": "Ron", "institution": null}, {"given_name": "Yoram", "family_name": "Singer", "institution": null}, {"given_name": "Naftali", "family_name": "Tishby", "institution": null}]}