{"title": "Building Predictive Models from Fractal Representations of Symbolic Sequences", "book": "Advances in Neural Information Processing Systems", "page_first": 645, "page_last": 651, "abstract": null, "full_text": "Building Predictive Models from Fractal \nRepresentations of Symbolic Sequences \n\nPeter Tioo Georg Dorffner \n\nAustrian Research Institute for Artificial Intelligence \n\nSchottengasse 3, A-101O Vienna, Austria \n\n{petert,georg}@ai.univie.ac.at \n\nAbstract \n\nWe propose a novel approach for building finite memory predictive mod(cid:173)\nels similar in spirit to variable memory length Markov models (VLMMs). \nThe models are constructed by first transforming the n-block structure of \nthe training sequence into a spatial structure of points in a unit hypercube, \nsuch that the longer is the common suffix shared by any two n-blocks, \nthe closer lie their point representations. Such a transformation embodies \na Markov assumption - n-blocks with long common suffixes are likely \nto produce similar continuations. Finding a set of prediction contexts is \nformulated as a resource allocation problem solved by vector quantizing \nthe spatial n-block representation. We compare our model with both the \nclassical and variable memory length Markov models on three data sets \nwith different memory and stochastic components. Our models have a \nsuperior performance, yet, their construction is fully automatic, which is \nshown to be problematic in the case of VLMMs. \n\n1 \n\nIntroduction \n\nStatistical modeling of complex sequences is a prominent theme in machine learning due \nto its wide variety of applications (see e.g. [5)). Classical Markov models (MMs) of finite \norder are simple, yet widely used models for sequences generated by stationary sources. \nHowever, MMs can become hard to estimate due to the familiar explosive increase in the \nnumber of free parameters when increasing the model order. Consequently, only low or(cid:173)\nder MMs can be considered in practical applications. Some time ago, Ron, Singer and \nTishby [4] introduced at this conference a Markovian model that could (at least partially) \novercome the curse of dimensionality in classical MMs. The basic idea behind their model \nwas simple: instead of fixed-order MMs consider variable memory length Markov models \n(VLMMs) with a \"deep\" memory just where it is really needed (see also e.g. [5][7]). \n\nThe size of VLMMs is usually controlled by one or two construction parameters. Unfor(cid:173)\ntunately, constructing a series of increasingly complex VLMMs (for example to enter a \nmodel selection phase on a validation set) by varying the construction parameters can be \n\n\f646 \n\nP Tino and G. DorjJner \n\na troublesome task [1). Construction often does not work \"smoothly\" with varying the \nparameters. There are large intervals of parameter values yielding unchanged VLMMs in(cid:173)\nterleaved with tiny parameter regions corresponding to a large spectrum of VLMM sizes. \nIn such cases it is difficult to fully automize the VLMM construction. \n\nTo overcome this drawback, we suggest an alternative predictive model similar in spirit \nto VLMMs. Searching for the relevant prediction contexts is reformulated as a resource \nallocation problem in Euclidean space solved by vector quantization. A potentially pro(cid:173)\nhibitively large set of alilength-L blocks is assigned to a much smaller set of prediction \ncontexts on a suffix basis. To that end, we first transform the set of L-blocks appearing in \nthe training sequence into a set of points in Euclidean space, such that points corresponding \nto blocks sharing a long common suffix are mapped close to each other. Vector quantiza(cid:173)\ntion on such a set partitions the set of L-blocks into several classes dominated by common \nsuffixes. Quantization centers play the role of predictive contexts. A great advantage of our \nmodel is that vector quantization can be performed on a completely self-organized basis. \n\nWe compare our model with both classical MMs and VLMMs on three data sets repre(cid:173)\nsenting a wide range of grammatical and statistical structure. First, we train the models on \nthe Feigenbaum binary sequence with a very strict topological and metric organization of \nallowed subsequences. Highly specialized, deep prediction contexts are needed to model \nthis sequence. Classical Markov models cannot succeed and the full power of admitting a \nlimited number of variable length contexts can be exploited. The second data set consists of \nquantized daily volatility changes of the Dow Jones Industrial Average (DnA). Predictive \nmodels are used to predict the direction of volatility move for the next day. Financial time \nseries are known to be highly stochastic with a relatively shallow memory structure. In this \ncase, it is difficult to beat low-order classical MMs. One can perform better than MMs only \nby developing a few deeper specialized contexts, but that, on the other hand, can lead to \noverfitting. Finally, we test our model on the experiments of Ron, Singer and Tishby with \nlanguage data from the Bible [5]. They trained classical MMs and a VLMM on the books \nof the Bible except for the book of Genesis. Then the models were evaluated on the bases \nof negative log-likelihood on an unseen text from Genesis. We compare likelihood results \nof our model with those of MMs and VLMMs. \n\n2 Predictive models \n\nWe consider sequences S = 8182 .. . over a finite alphabet A = {I, 2, ... , A} generated by \nstationary sources. The set of all sequences over A with exactly n symbols is denoted by \nAn . \nAn information source over A = {I, 2, ... , A} is defined by a family of consistent prob-\nability measures Pn on An, n = 0,1,2, ... , :LIEA Pn+1 (ws) = Pn(w), for all wEAn \n(AO = {A} and Po(A) = 1, A denotes the empty string). \nIn applications it is useful to consider probability functions Pn that are easy to handle. \nThis can be achieved, for example, by assuming a finite source memory of length at most \nL, and formulating the conditional measures P(slw) = PL+1(WS)/PL(w), WEAL, \nusing a function c : AL ~ C, from L-blocks over A to a (presumably small) finite set C of \nprediction contexts: \n\nP(slw) = P(sjc(w)). \n\n(1) \n\nIn Markov models (MMs) of order n :s; L, for all L-blocks w E A L, c( w) is the length-n \n\n\fPredictive Models from Fractal Representations of Sequences \n\n647 \n\nsuffix ofw, i.e. c(uv) = v, v E An, U E A L - n . \n\nIn variable memory length Markov models (VLMMs), the suffices c( w) of L-blocks w E \nAL can have different lengths, depending on the particular L-block w. For strategies of \nselecting and representing the prediction contexts through prediction suffix trees and/or \nprobabilistic suffix automata see, for example, [4](5]. VLMM construction is controlled by \none, or several parameters regulating selection of candidate contexts and growing/pruning \ndecisions. \nPrediction context function c : AL -+ C in Markov models of order n ~ L, can be \ninterpreted as a natural homomorphism c : AL -+ AL 1\u00a3 corresponding to the equivalence \nrelation E ~ AL X AL on L-blocks over A: two L-blocks u, v are in the same class, i.e. \n( U, v) E E, if they share the same suffix of length n. The factor set ALI \u00a3 = C = An \nconsists of all n-blocks over A. Classical MMs define the equivalence E on the suffix \nbases, but regardless of the suffix structure present in the training data. Our idea is to keep \nthe Markov-motivated suffix strategy for constructing E, but at the same time take into an \naccount the data suffix structure. \nVector quantization on a set of B points in a Euclidean space positions N < < B codebook \nvectors (CV s), each CV representing a subset of points that are closer to it than to any other \nCV, so that the overall error of substituting CVs for points they represent is minimal. In \nother words, CVs tend to represent points lying close to each other (in a Euclidean metric). \nIn order to use vector quantization for determining relevant predictive contexts we need to \ndo two things: \n\n1. Define a suitable metric in the sequence space that would correspond to Markov \n\nassumptions: \n\n(a) two sequences are \"close\" if they share a common suffix \n(b) the longer is the common suffix the closer are the sequences \n\n2. Define a uniformly continuous map from the sequence metric space to the Eu(cid:173)\n\nclidean space, i.e. sequences that are close in the sequence space (i.e. share a long \ncommon suffix) are mapped close to each other in the Euclidean space. \n\nIn [6] we rigorously study a class of such spatial representations of symbolic structures. \nSpecifically, a family of distances between two L-blocks U = UIU2 ... UL-IUL and v = \nVI V2\u00b7\u00b7 . V L-l V L over A = {I, 2, ... , A}, expressed as \n\ndk(u, v) = L kL - i+1c5 (Ui, Vi), k $ 2' \n\nL \n\n1 \n\n(2) \n\ni=l \n\nwith c5(i,j) = 1 if i = j, and c5(i,j) = \u00b0 otherwise, correspond to Markov assumption. \n\nThe parameter k influences the rate of \"forgetting the past\". We construct a map from \nthe sequence metric space to the Euclidean space as follows: Associate with each symbol \ni E A a map \n\n(3) \noperating on a unit D-dimensional hypercube [0, l]D . Dimension of the hypercube should \nbe large enough so that each symbol i is associated with a unique vertex, i.e. D = flog2 A 1 \nand tj #- tj whenever i #- j. The map u : AL -+ [0, l]D, from L-blocks VIV2 ... VL over A \nto the unit hypercube, \n\nU(VI V2 ... VL) = VdVL-l( .,.(V2(VI(X*))) ... )) = (VL 0 VL-l 0 \n\n. . . 0 V2 0 vt}(x*), \n\n(4) \n\n\f648 \n\nP. Tina and G. DarjJner \n\n{~}D is the center of the hypercube, is \"unifonnly continuous\". Indeed, \nwhere x\u00b7 \nwhenever two sequences u, v share a common suffix of length Q, the Euclidean distance \nbetween their point representations O'(u) and O'(v) is less than V2kQ. Strictly speaking, \nfor a mathematically correct treatment of unifonn continuity, we would need to consider \ninfinite sequences. Finite blocks of symbols would then correspond to cylinder sets (see \n[6]). For sake of simplicity we only deal with finite sequences. \n\nAs with classical Markov models, we define the prediction context function c : A L -t C \nvia an equivalence \u00a3 on L-blocks over A: two L-blocks u, v are in the same class if their \nimages under the map 0' are represented by the same codebook vector. In this case, the set \nof prediction contexts C can be identified with the set of codebook vectors {b I , b2 , ... , b N }, \nhi E ~D, i = 1,2, ... , N. We refer to predictive models with such a context function as \nprediction/ractal machines (PFMs). The prediction probabilities (1) are determined by \n\nP(slbd = L: \n\naEA \n\nZ, a \n\nN(i, s) \n\nN(' )' sEA, \n\n(5) \n\nwhere N(i , a) is the number of (L+l)-blocks ua, a E AL, a E A, in the training sequence, \nsuch that the point 0'( u) is allocated to the codebook vector bi . \n\n3 Experiments \n\nIn all experiments we constructed PFMs using a contraction coefficient k = ~ (see eq. (3\u00bb \nand K-means as a vector quantization tool. \nThe first data set is the Feigenbaum sequence over the binary alphabet A = {1,2}. This \nsequence is well-studied in symbolic dynamics and has a number of interesting proper(cid:173)\nties. First, the topological structure of the sequence can only be described using a context \nsensitive tool - a restricted indexed context-free grammar. Second, for each block length \nn = 1, 2, .. . , the distribution of n-blocks is either unifonn, or has just two probability lev(cid:173)\nels. Third, the n-block distributions are organized in a self-similar fashion (see [2]). The \nsequence can be specified by the subsequence composition rule \n\n' \n\n(6) \n\nWe chose to work with the Feigenbaum sequence, because increasingly accurate model(cid:173)\ning of the sequence with finite memory models requires a selective mechanism for deep \nprediction contexts. \n\nWe created a large portion of the Feigenbaum sequence and trained a series of classical \nMMs, variable memory length MMs (VLMMs), and prediction fractal machines (PFMs) \non the first 260,000 symbols. The following 200,000 symbols fonned a test set. Maximum \nmemory length L for VLMMs and PFMs was set to 30. \n\nAs mentioned in the introduction, constructing a series of increasingly complex VLMMs . \nby varying the construction parameters appeared to be a troublesome task. We spent a fair \namount of time finding \"critical\" parameter values at which the model size changed. In \ncontrast, a fully automatic construction of PFMs involved sliding a window of length L = \n30 through the training set; for each window position, mapping the L-block u appearing \nin the window to the point 0'( u) (eq. (4\u00bb, vector-quantizing the resulting set of points (up \nto 30 codebook vectors). After the quantization step we computed predictive probabilities \naccording to eq. (5). \n\n\fPredictive Models from Fractal Representations of Sequences \n\n649 \n\nTable I: Normalized negative log-likelihoods (NNL) on the Feigenbaum test set. \n\nmodel \nPFM \n\nVLMM \n\nMM \n\n# contexts NNL \n0.6666 \n0.3333 \n0.1666 \n0.0833 \n0.6666 \n0.3333 \n0.1666 \n0.0833 \n2,4,8,16,32 0.6666 \n\n2-4 \n5-7 \n8-22 \n23-\n2-4 \n5 \n11 \n23 \n\ncaptured block distribution \n\n1-3 \n1-6 \n1-12 \n1-24 \n1-3 \n1-6 \n1-12 \n1-24 \n1-3 \n\nNegative log-likelihoods per symbol (the base oflogarithm is always taken to be the number \nof symbols in the alphabet) of the test set computed using the fitted models exhibited a step(cid:173)\nlike increasing tendency shown in Table 1. We also investigated the ability of the models \nto reproduce the n-block distribution found in the training and test sets. This was done by \nletting the models generate sequences of length equal to the length of the training sequence \nand for each block length n = 1,2, ... , 30, computing the L1 distance between the n-block \ndistribution of the training and model-generated sequences. The n-block distributions on \nthe test and training sets were virtually the same for n = 1,2, ... 30. In Table I we show \nblock lengths for which the L1 distance does not exceed a small threshold~ . We set \n~ = 0.005, since in this experiment, either the L1 distance was less 0.005, or exceeded \n0.005 by a large amount. \n\nAn explanation of the step-like behavior in the log-likelihood and n-block modeling be(cid:173)\nhavior of VLMMs and PFMs is out of the scope of this paper. We briefly mention, how(cid:173)\never, that by combining the knowledge about the topological and metric structur~s of the \nFeigenbaum sequence (e.g. [2]) with a careful analysis of the models, one can show why \nand when an inclusion of a prediction context leads to an abrupt improvement in the mod(cid:173)\neling performance. In fact, we can show that VLMMs and PFMs constitute increasingly \nbetter approximations to the infinite self-similar Feigenbaum machine known in symbolic \ndynamics [2]. \n\nThe classical MM totally fails in this experiment, since the context length 5 is far too \nsmall to enable the MM to mimic the complicated subsequence structure in the Feigenbaum \nsequence. PFMs and VLMMs quickly learn to explore a limited number of deep prediction \ncontexts and perform comparatively well. \nIn the second experiment, a time series {xtJ of the daily values ofthe Dow Jones Industrial \nAverage (DJIA) from Feb. 1 1918 until April 1 1997 was transformed into a time series \nof returns rt = log Xt+1 -\nlog Xt, and divided into 12 partially overlapping epochs, each \ncontaining about 2300 values (spanning approximately 9 years). We consider the squared \nreturn r; a volatility estimate for day t. Volatility change forecasts (volatility is going to \nincrease or decrease) based on historical returns can be interpreted as a buying or selling \nsignal for a straddle (see e.g. [3]). If the volatility decreases we go short (straddle is sold), \nif it increases we take a long position (straddle is bought). In this respect, the quality of \na volatility model can be measured by the percentage of correctly predicted directions of \ndaily volatility differences. \n\n\f650 \n\nP Tino and G. DorfJner \n\nTable 2: Prediction perfonnance on the DJIA volatility series. \n\nmodel \nPPM \n\nVLMM \n\nMM \n\n1 \n\n71.08 \n68.67 \n68.56 \n\n7 \n\n70.39 \n68.18 \n69.11 \n\n8 \n\nPercent correct on test set \n2 \n5 \n\n3 \n\n4 \n\n6 \n\n69.70 70.05 \n68.79 \n69.25 \n68.28 \n69.78 \n\n72.12 \n72.46 \n68.29 \n69.41 \n69.50 73.13 \n\nPPM \nVLMM \n\nMM \n\n74.01 \n71.77 73.84 \n69.83 \n67.00 67.96 \n74.16 71.96 69.95 \n\n9 \n\n10 \n\n73.84 \n70.76 \n69.16 \n\n11 \n\n71,77 \n69.80 \n71.74 \n\n12 \n\n74.19 \n70.25 \n71.07 \n\nThe series {r~+1 - r~} of differences between the successive squared returns is transfonned \n\ninto a sequence {Dt} over 4 symbols by quantizing the series {r~+1 - rn as follows: \n\nD -\nt -\n\n{\n\nI (extreme down), \n2 (nonnal down), \n3 (nonnal up), \n4 (extreme up), \n\nif rr+1 - rr < 01 < 0 \nif 01 ~ r;+1 - r~ < a \nif a ~ rt~1 -\nr t < O2 \n2 \nif 02 ~ rt+1 - r;, \n\n2 \n\n(7) \n\nwhere the parameters 01 and ()2 correspond to Q percent and (100 - Q) percent sample \nquantiles, respectively. So, the upper (lower) Q% of all daily volatility increases (de(cid:173)\ncreases) in the sample are considered extremal, and the lower (upper) (50 - Q)% of daily \nvolatility increases (decreases) are viewed as nonnal. \n\nEach epoch is partitioned into training, validation and test parts containing 110, 600 and \n600 symbols, respectively. Maximum memory length L for VLMMs and PFMs was set to \n10 (two weeks). We trained classical MMs, VLMMs and PFMs with various numbers of \nprediction contexts (up to 256) and extremal event quantiles Q E {5, 10, 15, ... , 45}. For \neach model class, the model size and the quantile Q to be used on the test set were'selected \naccording to the validation set perfonnance. Perfonnance of the models was quantified as \nthe percentage of correct guesses of the volatility change direction for the next day. If the \nnext symbol is 1 or 2 (3 or 4) and the sum of conditional next symbol probabilities for 1 \nand 2 (3 and 4) given by a model is greater than 0.5, the model guess is considered correct. \nResults are shown in Table 2. Paired t-test reveals that PFMs significantly (p < 0.005) \noutperfonn both VLMMs and classical MMs. \n\nOf course, fixed-order MMs are just special cases of VLMMs, so theoretically, VLMMs \ncannot perfonn worse than MMs. We present separate results for MMs and VLMMs to \nillustrate practical problems in fitting VLMMs. Besides familiar problems with setting the \nconstruction parameter values, one-parameter-schemes (like that presented in [4] and used \nhere) operate only on small subsets of potential VLMMs. On data sets with a rather shallow \nmemory structure, this can have a negative effect. \n\nThe third experiment extends the work of Ron, Singer and Tishby [5]. They tested classical \nMMs and VLMMs on the Bible. The alphabet is English letters and the blank character (27 \nsymbols). The training set consisted of the Bible except for the book of Genesis. The test \nset was a portion of 236 characters from the book of Genesis. They set the maximal mem(cid:173)\nory depth to L = 30 and constructed a VLMM with about 3000 contexts. Summarizing the \nresults in [5], classical MMs of order 0, 1, 2 and 3 achieved negative log-likelihoods per \n\n\fPredictive Models from Fractal Representations of Sequences \n\n651 \n\ncharacter (NNL) of 0.853, 0.681, 0.560 and 0.555, respectively. The authors point out a \nhuge difference between the number of states in MMs of order 2 and 3: 273 - 272 = 18954. \nVLMM performed much better and achieved an NNL of 0.456. In our experiments, we \nset the maximal memory length to L = 30 (the same maximal memory length was used for \nVLMM construction in [5]). PFMs were constructed by vector quantizing a 5-dimensional \n(alphabet has 27 symbols) spatial representation of 3D-blocks appearing in the training set. \nOn the test set, PFMs with 100, 500, 1O(}(} and 3000 predictive contexts achieved an NNL \nof 0.622, 0.518, 0.510 and 0.435. \n\n4 Conclusion \n\nWe presented a novel approach for building finite memory predictive models similar in \nspirit to variable memory length Markov models (VLMMs). Constructing a series of \nVLMMs is often a troublesome and highly time-consuming task requiring a lot of interac(cid:173)\ntive steps. Our predictive models, prediction fractal machines (PFMs), can be constructed \nin a completely automatic and intuitive way - the number of codebook vectors in the vector \nquantization PFM construction step corresponds to the number of predictive contexts. \n\nWe tested our model on three data sets with different memory and stochastic components. \nVLMMs excel over the classical MMs on the Feigenbaum sequence requiring deep pre(cid:173)\ndiction contexts. On this sequence, PFMs achieved the same performance as their rivals \n- VLMMs. On financial time series, PFMs significantly outperform the purely symbolic \nMarkov models - MMs and VLMMs. On natural language Bible data, our PFM outper(cid:173)\nforms a VLMM of comparable size. \n\nAcknowledgments \n\nThis work was supported by the Austrian Science Fund (FWF) within the research project \n\"Adaptive Information Systems and Modeling in Economics and Management Science\" \n(SFB 010) and the Slovak Academy of Sciences grant SAY 2/6018/99. The Austrian Re(cid:173)\nsearch Institute for Artificial Intelligence is supported by the Austrian Federal Ministry of \nScience and Transport. \n\nReferences \n\n[1] P. BOhlmann. Model selection for variable length Markov chains and tuning the context algorithm. \nAnnals of the Institute of Statistical Mathematics, (in press), 1999. \n\n[2] 1. Freund, W. Ebeling, and K. Rateitschak. Self-similar sequences and uniVersal scaling of \ndynamical entropies. PhysicaL Review E, 54(5), pp. 5561-5566, 1996. \n\n[3] 1. Noh, R.F. Engle, and A. Kane. Forecasting volatility and option prices of the s&p 500 index. \nJournaL of Derivatives, pp. 17-30, 1994. \n\n[4] D. Ron, Y. Singer, and N. Tishby. The power of amnesia. In Advances in Neural Information \nProcessing Systems 6, pp. 176-183. Morgan Kaufmann, 1994. \n\n[5] D. Ron, Y. Singer, and N. Tishby. The power of amnesia. Machine Learning, 25,1996. \n\n[6] P. Tino. Spatial representation of symbolic sequences through iterative function system. IEEE \nTransactions on Systems. Man. and Cybernetics Part A: Systems and Humans, 29(4), pp. 386-392, \n1999. \n\n[7] M.1. Weinberger, 1.1. Rissanen, and M. Feder. A universal finite memory source. IEEE Transac(cid:173)\ntions on Information Theory, 41 (3), pp. 643-652,1995. \n\n\f", "award": [], "sourceid": 1762, "authors": [{"given_name": "Peter", "family_name": "Ti\u00f1o", "institution": null}, {"given_name": "Georg", "family_name": "Dorffner", "institution": null}]}