{"title": "Building Predictive Models from Fractal Representations of Symbolic Sequences", "book": "Advances in Neural Information Processing Systems", "page_first": 645, "page_last": 651, "abstract": null, "full_text": "Building Predictive Models from Fractal \nRepresentations of Symbolic Sequences \n\nPeter Tioo  Georg Dorffner \n\nAustrian Research Institute for Artificial Intelligence \n\nSchottengasse 3, A-101O Vienna, Austria \n\n{petert,georg}@ai.univie.ac.at \n\nAbstract \n\nWe propose a novel approach for building finite memory predictive mod(cid:173)\nels similar in spirit to variable memory length Markov models (VLMMs). \nThe models are constructed by first transforming the n-block structure of \nthe training sequence into a spatial structure of points in a unit hypercube, \nsuch that the  longer is the common suffix shared by any  two n-blocks, \nthe closer lie their point representations. Such a transformation embodies \na Markov assumption - n-blocks with long common suffixes  are likely \nto produce similar continuations.  Finding a set of prediction contexts is \nformulated as a resource allocation problem solved by vector quantizing \nthe spatial n-block representation. We compare our model with both the \nclassical  and variable memory length Markov models on three data sets \nwith different memory  and stochastic components.  Our models have  a \nsuperior performance, yet, their construction is fully automatic, which is \nshown to be problematic in the case of VLMMs. \n\n1 \n\nIntroduction \n\nStatistical modeling of complex sequences is a prominent theme in machine learning due \nto its wide variety of applications (see e.g.  [5)).  Classical Markov models (MMs) of finite \norder are  simple,  yet widely used models for sequences  generated by  stationary sources. \nHowever,  MMs can become hard to estimate due to the familiar explosive increase in the \nnumber of free  parameters  when increasing the model order.  Consequently,  only low or(cid:173)\nder  MMs can  be  considered in  practical  applications.  Some time  ago,  Ron,  Singer and \nTishby [4]  introduced at this conference a Markovian model that could (at least partially) \novercome the curse of dimensionality in classical MMs. The basic idea behind their model \nwas  simple:  instead of fixed-order MMs consider variable memory length Markov models \n(VLMMs) with a \"deep\" memory just where it is really needed (see also e.g.  [5][7]). \n\nThe size of VLMMs is  usually controlled by one or two construction parameters.  Unfor(cid:173)\ntunately,  constructing a  series  of increasingly complex  VLMMs  (for example  to  enter  a \nmodel selection phase on a validation set) by  varying the construction parameters can  be \n\n\f646 \n\nP  Tino  and G.  DorjJner \n\na  troublesome task  [1).  Construction often  does  not  work  \"smoothly\"  with  varying  the \nparameters. There are large intervals of parameter values yielding unchanged VLMMs in(cid:173)\nterleaved  with tiny parameter regions corresponding to a large spectrum of VLMM sizes. \nIn such cases it is difficult to fully automize the VLMM construction. \n\nTo  overcome  this drawback,  we  suggest  an  alternative predictive model  similar in  spirit \nto  VLMMs.  Searching  for  the relevant prediction contexts is reformulated as  a resource \nallocation  problem  in  Euclidean  space  solved by  vector  quantization.  A  potentially pro(cid:173)\nhibitively large set of alilength-L blocks is  assigned to  a much smaller set of prediction \ncontexts on a suffix basis.  To  that end, we first transform the set of L-blocks appearing in \nthe training sequence into a set of points in Euclidean space, such that points corresponding \nto blocks sharing a long common suffix are  mapped close to each other.  Vector quantiza(cid:173)\ntion on such a set partitions the set of L-blocks into several classes dominated by common \nsuffixes.  Quantization centers play the role of predictive contexts.  A great advantage of our \nmodel is that vector quantization can be performed on a completely self-organized basis. \n\nWe  compare  our model  with  both classical  MMs  and VLMMs  on  three data sets  repre(cid:173)\nsenting a wide range of grammatical and statistical structure. First, we train the models on \nthe Feigenbaum binary sequence with a very strict topological and metric organization of \nallowed subsequences.  Highly specialized,  deep prediction contexts are needed to model \nthis sequence.  Classical Markov models cannot succeed and the full power of admitting a \nlimited number of variable length contexts can be exploited. The second data set consists of \nquantized daily volatility changes of the Dow Jones Industrial Average (DnA). Predictive \nmodels are used to predict the direction of volatility move for the next day.  Financial time \nseries are known to be highly stochastic with a relatively shallow memory structure. In this \ncase, it is difficult to beat low-order classical MMs. One can perform better than MMs only \nby  developing a few  deeper specialized contexts,  but that,  on  the other hand,  can lead  to \noverfitting. Finally, we test our model on the experiments of Ron, Singer and Tishby with \nlanguage data from the Bible [5].  They trained classical MMs and a VLMM on the books \nof the Bible except for the book of Genesis.  Then the models were evaluated on the bases \nof negative log-likelihood on an unseen text from Genesis.  We compare likelihood results \nof our model with those of MMs and VLMMs. \n\n2  Predictive models \n\nWe  consider sequences  S  =  8182 .. .  over  a finite alphabet A  =  {I, 2, ... , A} generated by \nstationary sources.  The set of all sequences over A  with exactly  n  symbols is  denoted by \nAn . \nAn  information source over A  = {I, 2, ... , A} is defined  by  a family  of consistent prob-\nability measures  Pn on An, n  = 0,1,2, ... , :LIEA Pn+1 (ws)  =  Pn(w),  for all  wEAn \n(AO  = {A} and Po(A)  = 1, A denotes the empty string). \nIn  applications it is  useful  to  consider probability functions  Pn  that  are  easy  to  handle. \nThis can be achieved,  for example,  by assuming a finite source memory of length at most \nL,  and  formulating  the  conditional measures  P(slw)  = PL+1(WS)/PL(w),  WEAL, \nusing a function c : AL  ~ C,  from L-blocks over A  to a (presumably small) finite set C of \nprediction contexts: \n\nP(slw) = P(sjc(w)). \n\n(1) \n\nIn  Markov models (MMs) of order n  :s;  L, for all  L-blocks w  E  A L, c( w)  is the length-n \n\n\fPredictive Models from Fractal Representations of Sequences \n\n647 \n\nsuffix ofw, i.e.  c(uv)  =  v, v  E An, U E A L - n . \n\nIn  variable memory length Markov models (VLMMs), the suffices c( w)  of L-blocks w  E \nAL  can  have  different lengths,  depending on the particular L-block w.  For strategies of \nselecting  and  representing  the  prediction  contexts  through prediction suffix  trees  and/or \nprobabilistic suffix automata see, for example, [4](5]. VLMM construction is controlled by \none, or several  parameters regulating selection of candidate contexts and growing/pruning \ndecisions. \nPrediction  context  function  c  :  AL  -+  C in  Markov  models  of order  n  ~  L,  can  be \ninterpreted as  a natural homomorphism c : AL -+  AL 1\u00a3  corresponding to the equivalence \nrelation E ~ AL  X  AL on  L-blocks over A: two L-blocks u, v  are  in the same class,  i.e. \n( U,  v)  E  E,  if they  share  the same  suffix  of length n.  The  factor set ALI \u00a3  = C = An \nconsists  of all  n-blocks over A.  Classical  MMs define  the  equivalence  E on  the  suffix \nbases, but regardless of the suffix structure present in the training data.  Our idea is to keep \nthe Markov-motivated suffix strategy for constructing E,  but at the same time take  into an \naccount the data suffix structure. \nVector quantization on a set of B points in a Euclidean space positions N  < < B codebook \nvectors (CV s), each CV representing a subset of points that are closer to it than to any  other \nCV,  so  that  the overall  error of substituting CVs for points they  represent is  minimal.  In \nother words, CVs tend to represent points lying close to each other (in a Euclidean metric). \nIn order to use vector quantization for determining relevant predictive contexts we need to \ndo two things: \n\n1.  Define  a suitable metric in the sequence space that would correspond to Markov \n\nassumptions: \n\n(a)  two sequences are \"close\" if they share a common suffix \n(b)  the longer is the common suffix the closer are the sequences \n\n2.  Define  a  uniformly continuous map  from  the sequence  metric  space  to  the Eu(cid:173)\n\nclidean space, i.e.  sequences that are close in the sequence space (i.e.  share a long \ncommon suffix) are mapped close to each other in the Euclidean space. \n\nIn  [6]  we rigorously study  a class  of such spatial  representations of symbolic structures. \nSpecifically,  a  family  of distances  between  two  L-blocks  U  =  UIU2 ... UL-IUL  and  v  = \nVI V2\u00b7\u00b7 . V L-l V L  over A  =  {I, 2, ... , A}, expressed as \n\ndk(u, v)  = L kL - i+1c5  (Ui, Vi),  k  $  2' \n\nL \n\n1 \n\n(2) \n\ni=l \n\nwith c5(i,j)  =  1 if i  =  j, and c5(i,j)  = \u00b0 otherwise, correspond to Markov  assumption. \n\nThe  parameter  k  influences  the rate  of \"forgetting the past\".  We  construct  a  map  from \nthe sequence metric space to the Euclidean space as  follows:  Associate with each symbol \ni  E A a map \n\n(3) \noperating on a unit D-dimensional hypercube [0, l]D . Dimension of the hypercube should \nbe large enough so that each symbol i is associated with a unique vertex, i.e. D =  flog2 A 1 \nand tj #- tj whenever i  #- j. The map u  : AL -+  [0, l]D, from L-blocks VIV2 ... VL  over A \nto the unit hypercube, \n\nU(VI V2 ... VL)  = VdVL-l( .,.(V2(VI(X*))) ... )) = (VL  0  VL-l 0 \n\n. . .  0  V2  0  vt}(x*), \n\n(4) \n\n\f648 \n\nP.  Tina  and G.  DarjJner \n\n{~}D is  the  center  of the  hypercube,  is  \"unifonnly  continuous\".  Indeed, \nwhere  x\u00b7 \nwhenever two sequences  u, v  share a common suffix of length Q,  the Euclidean  distance \nbetween  their point representations O'(u)  and  O'(v)  is less than V2kQ.  Strictly speaking, \nfor a  mathematically correct treatment of unifonn continuity, we would need  to consider \ninfinite sequences.  Finite blocks of symbols would then  correspond to  cylinder sets  (see \n[6]).  For sake of simplicity we only deal with finite sequences. \n\nAs  with classical Markov models,  we define the prediction context function  c  : A L  -t C \nvia an  equivalence \u00a3 on  L-blocks over A:  two L-blocks u, v are in the same class if their \nimages under the map 0'  are represented by the same codebook vector.  In this case, the set \nof prediction contexts C can be identified with the set of codebook vectors {b I , b2 ,  ... , b N }, \nhi  E  ~D, i  = 1,2, ... , N.  We  refer  to predictive models with such a context function  as \nprediction/ractal machines (PFMs).  The prediction probabilities (1) are determined by \n\nP(slbd = L: \n\naEA \n\nZ,  a \n\nN(i, s) \n\nN('  )'  sEA, \n\n(5) \n\nwhere N(i , a) is the number of (L+l)-blocks ua, a  E AL, a E A, in the training sequence, \nsuch that the point 0'( u)  is allocated to the codebook vector bi . \n\n3  Experiments \n\nIn all experiments we constructed PFMs using a contraction coefficient k =  ~ (see eq. (3\u00bb \nand K-means  as  a vector quantization tool. \nThe  first  data set is  the Feigenbaum sequence over the binary alphabet A  =  {1,2}.  This \nsequence  is  well-studied  in  symbolic dynamics  and  has  a  number  of interesting  proper(cid:173)\nties.  First, the topological structure of the sequence can only be described using a context \nsensitive tool - a restricted indexed context-free grammar.  Second,  for each  block length \nn  =  1, 2, .. . , the distribution of n-blocks is either unifonn, or has just two probability lev(cid:173)\nels.  Third,  the n-block distributions are organized in a self-similar fashion  (see  [2]).  The \nsequence can be specified by the subsequence composition rule \n\n' \n\n(6) \n\nWe  chose to  work with the Feigenbaum sequence,  because  increasingly  accurate  model(cid:173)\ning of the sequence  with finite  memory  models requires  a selective mechanism  for  deep \nprediction contexts. \n\nWe  created  a large  portion of the Feigenbaum sequence  and  trained a series  of classical \nMMs, variable  memory  length MMs  (VLMMs),  and prediction fractal  machines  (PFMs) \non the first 260,000 symbols. The following 200,000 symbols fonned a test set.  Maximum \nmemory length L  for VLMMs and PFMs was set to 30. \n\nAs  mentioned in the  introduction, constructing a series of increasingly complex  VLMMs . \nby  varying the construction parameters appeared to be a troublesome task. We spent a fair \namount  of time  finding  \"critical\"  parameter values  at  which  the model  size changed.  In \ncontrast, a fully automatic construction of PFMs involved sliding a window of length L  = \n30  through the training set;  for each  window position, mapping the  L-block u  appearing \nin the window to the point 0'( u)  (eq.  (4\u00bb, vector-quantizing the resulting set of points (up \nto 30 codebook vectors).  After the quantization step we computed predictive probabilities \naccording to eq.  (5). \n\n\fPredictive Models from Fractal Representations of Sequences \n\n649 \n\nTable  I:  Normalized negative log-likelihoods (NNL) on the Feigenbaum test set. \n\nmodel \nPFM \n\nVLMM \n\nMM \n\n# contexts  NNL \n0.6666 \n0.3333 \n0.1666 \n0.0833 \n0.6666 \n0.3333 \n0.1666 \n0.0833 \n2,4,8,16,32  0.6666 \n\n2-4 \n5-7 \n8-22 \n23-\n2-4 \n5 \n11 \n23 \n\ncaptured block distribution \n\n1-3 \n1-6 \n1-12 \n1-24 \n1-3 \n1-6 \n1-12 \n1-24 \n1-3 \n\nNegative log-likelihoods per symbol (the base oflogarithm is always taken to be the number \nof symbols in the alphabet) of the test set computed using the fitted models exhibited a step(cid:173)\nlike increasing tendency shown in Table 1.  We also investigated the ability of the models \nto reproduce the n-block distribution found in the training and test sets.  This was done by \nletting the models generate sequences of length equal to the length of the training sequence \nand for each block length n  = 1,2, ... , 30, computing the L1  distance between the n-block \ndistribution of the training and model-generated sequences.  The n-block distributions on \nthe test and training sets were virtually the same  for  n  =  1,2, ... 30.  In Table  I  we show \nblock lengths  for  which  the  L1  distance  does  not  exceed  a  small  threshold~ .  We  set \n~ =  0.005, since in  this experiment, either the  L1  distance was less  0.005, or exceeded \n0.005 by a large amount. \n\nAn  explanation of the step-like behavior in  the log-likelihood and  n-block modeling  be(cid:173)\nhavior of VLMMs  and PFMs  is out of the scope of this paper.  We  briefly  mention,  how(cid:173)\never, that by combining the  knowledge about the topological and metric  structur~s of the \nFeigenbaum sequence (e.g.  [2]) with a careful  analysis of the models,  one can  show  why \nand when an  inclusion of a prediction context leads to an  abrupt improvement in the mod(cid:173)\neling performance.  In  fact,  we can  show that VLMMs and PFMs constitute increasingly \nbetter approximations to the infinite self-similar Feigenbaum machine known in symbolic \ndynamics [2]. \n\nThe  classical  MM  totally  fails  in  this  experiment,  since  the  context  length  5  is  far  too \nsmall to enable the MM to mimic the complicated subsequence structure in the Feigenbaum \nsequence.  PFMs and VLMMs quickly learn to explore a limited number of deep prediction \ncontexts and perform comparatively well. \nIn the second experiment, a time series {xtJ of the daily values ofthe Dow Jones Industrial \nAverage  (DJIA) from  Feb.  1 1918  until April  1 1997 was transformed into a time series \nof returns rt  = log Xt+1  -\nlog Xt,  and divided into  12  partially overlapping epochs,  each \ncontaining about 2300 values (spanning approximately 9 years).  We consider the squared \nreturn  r; a volatility estimate for day t.  Volatility change forecasts  (volatility is going to \nincrease or decrease)  based on historical returns can  be interpreted as  a buying or selling \nsignal for a straddle (see e.g.  [3]). If the volatility decreases we go short (straddle is sold), \nif it increases  we  take a long position (straddle is bought).  In this respect,  the quality of \na volatility model can  be measured by the percentage of correctly  predicted directions of \ndaily volatility differences. \n\n\f650 \n\nP  Tino  and G.  DorfJner \n\nTable 2:  Prediction perfonnance on the DJIA volatility series. \n\nmodel \nPPM \n\nVLMM \n\nMM \n\n1 \n\n71.08 \n68.67 \n68.56 \n\n7 \n\n70.39 \n68.18 \n69.11 \n\n8 \n\nPercent correct on test set \n2 \n5 \n\n3 \n\n4 \n\n6 \n\n69.70  70.05 \n68.79 \n69.25 \n68.28 \n69.78 \n\n72.12 \n72.46 \n68.29 \n69.41 \n69.50  73.13 \n\nPPM \nVLMM \n\nMM \n\n74.01 \n71.77  73.84 \n69.83 \n67.00  67.96 \n74.16  71.96  69.95 \n\n9 \n\n10 \n\n73.84 \n70.76 \n69.16 \n\n11 \n\n71,77 \n69.80 \n71.74 \n\n12 \n\n74.19 \n70.25 \n71.07 \n\nThe series {r~+1 - r~} of differences between the successive squared returns is transfonned \n\ninto a sequence {Dt} over 4 symbols by quantizing the series {r~+1 - rn as  follows: \n\nD  -\nt  -\n\n{\n\nI  (extreme down), \n2 (nonnal down), \n3 (nonnal up), \n4 (extreme up), \n\nif rr+1  - rr  < 01 <  0 \nif 01 ~ r;+1  - r~ < a \nif a ~ rt~1 -\nr t  < O2 \n2 \nif 02  ~ rt+1  - r;, \n\n2 \n\n(7) \n\nwhere the parameters 01  and ()2  correspond to  Q percent and  (100 - Q)  percent sample \nquantiles,  respectively.  So,  the  upper  (lower)  Q%  of all  daily  volatility  increases  (de(cid:173)\ncreases) in the sample are considered extremal, and the lower (upper) (50  - Q)% of daily \nvolatility increases (decreases) are viewed as nonnal. \n\nEach epoch  is  partitioned into training,  validation and test parts containing 110, 600 and \n600 symbols, respectively.  Maximum memory length L  for VLMMs and PFMs was set to \n10 (two weeks).  We trained classical MMs, VLMMs and PFMs with various numbers  of \nprediction contexts (up to 256) and extremal event quantiles Q E  {5, 10, 15, ... , 45}.  For \neach model class, the model size and the quantile Q to be used on the test set were'selected \naccording to the validation set perfonnance.  Perfonnance of the models was quantified as \nthe percentage of correct guesses  of the volatility change direction for the next day.  If the \nnext symbol is  1 or 2 (3  or 4)  and the sum of conditional next symbol probabilities for  1 \nand 2 (3 and 4) given by a model is greater than 0.5, the model guess is considered correct. \nResults are  shown  in Table 2.  Paired t-test reveals  that PFMs significantly (p  <  0.005) \noutperfonn both VLMMs and classical MMs. \n\nOf course,  fixed-order MMs are just special cases  of VLMMs,  so theoretically,  VLMMs \ncannot perfonn worse than  MMs.  We  present separate results for  MMs  and VLMMs  to \nillustrate practical problems in fitting VLMMs.  Besides familiar problems with setting the \nconstruction parameter values, one-parameter-schemes (like that presented in [4] and used \nhere) operate only on small subsets of potential VLMMs. On data sets with a rather shallow \nmemory structure, this can have a negative effect. \n\nThe third experiment extends the work of Ron, Singer and Tishby [5].  They tested classical \nMMs and VLMMs on the Bible. The alphabet is English letters and the blank character (27 \nsymbols).  The training set consisted of the Bible except for the book of Genesis.  The test \nset was a portion of 236 characters from the book of Genesis.  They set the maximal mem(cid:173)\nory depth to L = 30 and constructed a VLMM with about 3000 contexts. Summarizing the \nresults  in  [5],  classical  MMs of order 0,  1,  2  and 3  achieved  negative log-likelihoods per \n\n\fPredictive Models from Fractal Representations of Sequences \n\n651 \n\ncharacter  (NNL) of 0.853,  0.681,  0.560 and 0.555, respectively.  The  authors point out a \nhuge difference between the number of states in MMs of order 2 and 3:  273 - 272  = 18954. \nVLMM  performed much better and achieved an NNL  of  0.456.  In our experiments,  we \nset the maximal memory length to L = 30 (the same maximal memory length was used for \nVLMM construction in [5]).  PFMs were constructed by vector quantizing a 5-dimensional \n(alphabet has 27 symbols) spatial representation of 3D-blocks appearing in the training set. \nOn the test set,  PFMs with 100, 500,  1O(}(} and 3000 predictive contexts achieved an  NNL \nof 0.622, 0.518, 0.510 and 0.435. \n\n4  Conclusion \n\nWe  presented  a  novel  approach  for  building finite  memory  predictive  models  similar  in \nspirit  to  variable  memory  length  Markov  models  (VLMMs).  Constructing  a  series  of \nVLMMs is often a troublesome and highly time-consuming task requiring a lot of interac(cid:173)\ntive steps.  Our predictive models,  prediction fractal machines (PFMs), can  be constructed \nin a completely automatic and intuitive way - the number of codebook vectors in the vector \nquantization PFM construction step corresponds to the number of predictive contexts. \n\nWe tested our model on three data sets with different memory and stochastic components. \nVLMMs excel  over the classical  MMs on the Feigenbaum sequence  requiring deep  pre(cid:173)\ndiction contexts.  On this sequence,  PFMs achieved  the same performance as  their rivals \n- VLMMs.  On financial  time series,  PFMs significantly  outperform the purely symbolic \nMarkov models  - MMs  and VLMMs.  On  natural language  Bible data,  our PFM outper(cid:173)\nforms a VLMM of comparable size. \n\nAcknowledgments \n\nThis work was supported by the Austrian Science Fund (FWF) within the research project \n\"Adaptive  Information  Systems  and  Modeling in  Economics  and  Management Science\" \n(SFB  010) and the Slovak Academy  of Sciences grant SAY  2/6018/99.  The Austrian Re(cid:173)\nsearch Institute for Artificial Intelligence is supported by the Austrian Federal Ministry of \nScience and Transport. \n\nReferences \n\n[1] P.  BOhlmann.  Model selection for variable length Markov chains and tuning the context algorithm. \nAnnals of the Institute of Statistical Mathematics, (in press), 1999. \n\n[2]  1.  Freund,  W.  Ebeling,  and  K.  Rateitschak.  Self-similar  sequences and  uniVersal  scaling  of \ndynamical entropies. PhysicaL Review E, 54(5), pp.  5561-5566, 1996. \n\n[3] 1.  Noh,  R.F.  Engle, and A.  Kane.  Forecasting volatility and  option prices of the s&p  500 index. \nJournaL of Derivatives, pp. 17-30, 1994. \n\n[4]  D.  Ron,  Y.  Singer, and  N. Tishby.  The power of amnesia.  In Advances in  Neural Information \nProcessing Systems 6, pp.  176-183. Morgan Kaufmann, 1994. \n\n[5]  D.  Ron, Y.  Singer, and N. Tishby.  The power of amnesia.  Machine Learning, 25,1996. \n\n[6] P. Tino.  Spatial representation of symbolic sequences through iterative function  system.  IEEE \nTransactions on Systems.  Man.  and Cybernetics Part A: Systems and Humans, 29(4), pp.  386-392, \n1999. \n\n[7]  M.1.  Weinberger, 1.1. Rissanen, and M. Feder.  A universal finite  memory source.  IEEE Transac(cid:173)\ntions on Information Theory, 41 (3), pp. 643-652,1995. \n\n\f", "award": [], "sourceid": 1762, "authors": [{"given_name": "Peter", "family_name": "Ti\u00f1o", "institution": null}, {"given_name": "Georg", "family_name": "Dorffner", "institution": null}]}