{"title": "Dynamic-Depth Context Tree Weighting", "book": "Advances in Neural Information Processing Systems", "page_first": 3328, "page_last": 3337, "abstract": "Reinforcement learning (RL) in partially observable settings is challenging because the agent\u2019s observations are not Markov. Recently proposed methods can learn variable-order Markov models of the underlying process but have steep memory requirements and are sensitive to aliasing between observation histories due to sensor noise. This paper proposes dynamic-depth context tree weighting (D2-CTW), a model-learning method that addresses these limitations. D2-CTW dynamically expands a suffix tree while ensuring that the size of the model, but not its depth, remains bounded. We show that D2-CTW approximately matches the performance of state-of-the-art alternatives at stochastic time-series prediction while using at least an order of magnitude less memory. We also apply D2-CTW to model-based RL, showing that, on tasks that require memory of past observations, D2-CTW can learn without prior knowledge of a good state representation, or even the length of history upon which such a representation should depend.", "full_text": "Dynamic-Depth Context Tree Weighting\n\nJo\u00e3o V. Messias\u2217\nMorpheus Labs\n\nOxford, UK\n\nShimon Whiteson\nUniversity of Oxford\n\nOxford, UK\n\njmessias@morpheuslabs.co.uk\n\nshimon.whiteson@cs.ox.ac.uk\n\nAbstract\n\nReinforcement learning (RL) in partially observable settings is challenging be-\ncause the agent\u2019s observations are not Markov. Recently proposed methods can\nlearn variable-order Markov models of the underlying process but have steep\nmemory requirements and are sensitive to aliasing between observation histories\ndue to sensor noise. This paper proposes dynamic-depth context tree weighting\n(D2-CTW), a model-learning method that addresses these limitations. D2-CTW\ndynamically expands a suf\ufb01x tree while ensuring that the size of the model, but\nnot its depth, remains bounded. We show that D2-CTW approximately matches\nthe performance of state-of-the-art alternatives at stochastic time-series prediction\nwhile using at least an order of magnitude less memory. We also apply D2-CTW\nto model-based RL, showing that, on tasks that require memory of past observa-\ntions, D2-CTW can learn without prior knowledge of a good state representation,\nor even the length of history upon which such a representation should depend.\n\nIntroduction\n\n1\nAgents must often act given an incomplete or noisy view of their environments. While decision-\ntheoretic planning and reinforcement learning (RL) methods can discover control policies for agents\nwhose actions can have uncertain outcomes, partial observability greatly increases the problem\ndif\ufb01culty since each observation does not provide suf\ufb01cient information to disambiguate the true\nstate of the environment and accurately gauge the utility of the agent\u2019s available actions. Moreover,\nwhen stochastic models of the system are not available a priori, probabilistic inference over latent\nstate variables is not feasible. In such cases, agents must learn to memorize past observations and\nactions [21, 9], or one must learn history-dependent models of the system [15, 8].\nVariable-order Markov models (VMMs), which have long excelled in stochastic time-series pre-\ndiction and universal coding [23, 14, 2], have recently also found application in RL under partial\nobservability [13, 7, 24, 19]. VMMs build a context-dependent predictive model of future observa-\ntions and/or rewards, where a context is a variable-length subsequence of recent observations. Since\nthe number of possible contexts grows exponentially with both the context length and the number\nof possible observations, VMMs\u2019 memory requirements may grow accordingly. Conversely, the fre-\nquency of each particular context in the data decreases as its length increases, so it may be dif\ufb01cult\nto accurately model long-term dependencies without requiring prohibitive amounts of data.\nExisting VMMs address these problems by allowing models to differentiate between contexts at non-\nconsecutive past timesteps, ignoring intermediate observations [13, 22, 10, 24, 4]. However, they\ntypically assume that either the amount of input data is naturally limited or there is a known bound on\nthe length of the contexts to be considered. In most settings in which an agent interacts continuously\nwith its environment, neither assumption is well justi\ufb01ed. The lack of a de\ufb01ned time limit means\nthe approaches that make the former assumption, e.g., [13, 24], may eventually and indiscriminately\n\n\u2217During the development of this work, the main author was employed by the University of Amsterdam.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fuse all the agent\u2019s physical memory, while those that assume a bound on the context length, e.g.,\n[19], may perform poorly if observations older than this bound are relevant.\nThis paper proposes dynamic-depth context tree weighting (D2-CTW), a VMM designed for general\ncontinual learning tasks. D2-CTW extends context tree weighting (CTW) [23] by allowing it to dy-\nnamically grow a suf\ufb01x tree that discriminates between observations at different depths only insofar\nas that improves its ability to predict future inputs. This allows it to bound the number of contexts\nrepresented in the model, without sacri\ufb01cing the ability to model long-term dependencies.\nOur empirical results show that, when used for general stochastic time-series prediction, D2-CTW\nproduces models that are much more compact than those of CTW while providing better results in\nthe presence of noise. We also apply D2-CTW as part of a model-based RL architecture and show\nthat it outperforms multiple baselines on the problem of RL under partial observability, particularly\nwhen an effective bound on the length of its contexts is not known a priori.\n2 Background\n2.1 Stochastic Time-Series Prediction\nLet an alphabet \u03a3 = {\u03c31, \u03c32, . . . , \u03c3|\u03a3|} be a discrete set of symbols, and let \u03a0(\u03a3) represent the\nspace of probability distributions over \u03a3 (the (|\u03a3|\u2212 1)-simplex). Consider a discrete-time stochastic\nprocess that, at each time t \u2265 0, samples a symbol \u03c3t from a probability distribution pt \u2208 \u03a0(\u03a3). We\nassume that this stochastic process is stationary and ergodic, and that pt is a conditional probability\ndistribution, which for some (unknown) constant integer D with 0 < D \u2264 t has the form:\n\npt(\u03c3) = P (\u03c3t = \u03c3 | \u03c3t\u22121, \u03c3t\u22122, . . . , \u03c3t\u2212D).\n\n(1)\nLet \u03c3t\u2212D:t\u22121 = (\u03c3t\u2212D, \u03c3t\u2212D+1, . . . \u03c3t\u22121, ) be a string of symbols from time t \u2212 D to t \u2212 1. Since\n\u03c3t\u2212D:t\u22121 \u2208 \u03a3D and \u03a3 is \ufb01nite, there is a \ufb01nite number of length-D strings on which the evolution\nof our stochastic process can be conditioned. Thus, the stochastic process can also be represented\nby a time-invariant function F : \u03a3D \u2192 \u03a0(\u03a3) such that pt =: F (\u03c3t\u2212D:t\u22121) at any time t \u2265 D.\nLet s be a string of symbols from alphabet \u03a3 with length |s| and elements [s]i=\u2208{1,...,|s|}. Further-\nmore, a string q with |q| < |s| is said to be a pre\ufb01x of s iff q1:|q| = s1:|q|, and a suf\ufb01x of s iff\nq1:|q| = s|s|\u2212|q|:|s|. We write sq or \u03c3s for the concatenation of strings s and q or of s and symbol\n\u03c3 \u2208 \u03a3. A complete and proper suf\ufb01x set is a set of strings S such that any string not in S has exactly\none suf\ufb01x in S but no string in S has a suf\ufb01x in S.\nAlthough D is an upper bound on the age of the oldest symbol on which the process F depends, at\nany time t it may depend only on some suf\ufb01x of \u03c3t\u2212D:t\u22121 of length less than D. Given the variable-\nlength nature of its conditional arguments, F can be tractably encoded as a D-bounded tree source\n[2] that arranges a complete and proper suf\ufb01x set into a tree-like graphical structure. Each node at\ndepth d \u2264 D corresponds to a length-d string and all internal nodes correspond to suf\ufb01xes of the\nstrings associated with their children; and each leaf encodes a distribution over \u03a3 representing the\nvalue of F for that string.\nGiven a single, uninterrupted sequence of \u03c30:t generated by F , we wish to learn the \u02dcF :\n\u03a3D \u2192 \u03a0(\u03a3) that minimises the average log-loss of the observed data \u03c30:t.\nLetting\nP \u02dcF (\u00b7| \u03c3i\u2212D, . . . , \u03c3i\u22121) := \u02dcF (\u03c3i\u2212D:i\u22121):\nl(\u03c30:t | \u02dcF ) = \u2212 1\nt\n\nlog P \u02dcF (\u03c3i | \u03c3i\u2212D, . . . , \u03c3i\u22121).\n\nt(cid:88)\n\ni=D\n\n(2)\n\n2.2 Context Tree Weighting\n\nThe depth-K context tree on alphabet \u03a3 is a graphical structure obtained by arranging all possible\nstrings in \u03a3K into a full tree. A context tree has a \ufb01xed depth at all leaves and potentially encodes\nall strings in \u03a3K, not just those required by F . More speci\ufb01cally, given a sequence of symbols\n\u03c30:t\u22121, the respective length-K context \u03c3t\u2212K:t\u22121 induces a context path along the context tree\nby following at each level d \u2264 K the edge corresponding to \u03c3t\u2212d. The root of the context tree\nrepresents an empty string \u2205, a suf\ufb01x to all strings. Furthermore, each node keeps track of the input\nsymbols that have immediately followed its respective context. Let sub(\u03c30:t\u22121, s) represent the\n\n2\n\n\fstring obtained by concatenating all symbols \u03c3i in \u03c30:t\u22121 such that its preceding symbols verify\n\u03c3i\u2212k = sk for k = 1, . . . ,|s|. Then, each node s in the context tree maintains its own estimate of\nthe probability of observing the string sub(\u03c30:t\u22121, s).\nContext tree weighting (CTW) [23] learns a mixture of the estimates of P (sub(\u03c30:t\u22121, s)) at all\ncontexts s of length |s| \u2264 K and uses it to estimate the probability of the entire observed sequence.\nLet P s\ne (\u03c30:t\u22121) represent the estimate of P (sub(\u03c30:t\u22121, s)) at the node corresponding to s, and let\nw(\u03c30:t\u22121) be a weighted representation of the same measure, de\ufb01ned recursively as:\nP s\nif |s| < K,\nif |s| = K.\n\ne (\u03c30:t\u22121) + 1\n2 P s\n2\ne (\u03c30:t\u22121)\nP s\n\nP s\nw(\u03c30:t\u22121) :=\n\n(cid:26) 1\n\n(cid:81)\n\n\u03c3\u2208\u03a3 P \u03c3s\n\nw (\u03c30:t\u22121)\n\n(3)\n\nw(\u03c30:t\u22121).\n\nSince sub(\u03c30:t\u22121,\u2205) = \u03c30:t\u22121 by de\ufb01nition of the empty context, P \u2205\nw(\u03c30:t\u22121) is an estimate\nof P (\u03c30:t\u22121). The conditional probability of symbol \u03c3t is approximated as P \u02dcF (\u03c3t|\u03c30:t\u22121) =\nP \u2205\nw(\u03c30:t)/P \u2205\ne (\u03c30:t\u22121) at each context is often computed by keeping |\u03a3| incre-\nThe (unweighted) estimate P s\nmentally updated counters [cs,t]i=1,...,|\u03a3| \u2208 N0, where for each \u03c3i \u2208 \u03a3, [cs,t]i represents the total\nnumber of instances where the substring s\u03c3i can be found within \u03c30:t\u22121. The vector of counters cs,t\ncan be modelled as the output of a Dirichlet-multinomial distribution with concentration parameter\nvector \u03b1 = [\u03b1i]i=1,...,|\u03a3|. An estimate of the probability of observing symbol \u03c3k at time t + 1 can\nthen be taken as follows: if s is on the context path at time t and \u03c3t = \u03c3k is the next observed\nsymbol, then [cs,t+1]k = [cs,t]k + 1, and [cs,t+1]i = [cs,t]i for all i (cid:54)= k. Then:\n\nPDirM (cs,t+1 | \u03b1)\nPDirM (cs,t | \u03b1)\n\n=\n\n,\n\n(4)\n\n[cs,t]k + [\u03b1]k\nc+\ns,t + \u03b1+\n\ne (\u03c30:t) = (cid:81)t\n\nwhere \u03b1+ =(cid:80)|\u03a3|\n\ne (\u03c3k| \u03c30:t) :=\nP s\n\ns,t =(cid:80)|\u03a3|\n\n\u03c4 =0 P s\n\ni=1[\u03b1]i, c+\n\ni=1[cs,t]i, and PDirM is the Dirichlet-multimomial mass function.\ne (\u03c3\u03c4|\u03c30:\u03c4\u22121).\nThe estimate of the probability of the full sequence is then P s\nThis can be updated in constant time as each new symbol is received. The choice of \u03b1 affects the\noverall quality of the estimator. We use the sparse adaptive Dirichlet (SAD) estimator [11], which\nis especially suited to large alphabets.\nIn principle, a depth-K context tree has |\u03a3|K+1 \u2212 1 nodes, each with at most |\u03a3| integer counters.\nIn practice, there may be fewer nodes since one need only to allocate space for contexts found in the\ndata at least once, but their total number may still grow linearly with the length of the input string.\nThus, for problems such as partially observable RL, in which the amount of input data is unbounded,\nor for large |\u03a3| and K, the memory used by CTW can quickly become unreasonable.\nPrevious extensions to CTW and other VMM algorithms have been made that do not explicitly\nbound the depth of the model [6, 22]. However, these still take up memory that is worst-case linear\nin the length of the input sequence. Therefore, they are not applicable to reinforcement learning. To\novercome this problem, most existing approaches arti\ufb01cially limit K to a low value, which limits the\nagent\u2019s ability to address long-term dependencies. To our knowledge, the only existing principled\napproach to reducing the amount of memory required by CTW was proposed in [5], through the use\nof a modi\ufb01ed (Budget) SAD estimator which can be used to limit the branching factor in the context\ntree to B < |\u03a3|. This approach still requires K to be set a priori, and is best-suited to prediction\nproblems with large alphabets but few high frequency symbols (e.g. word prediction), which is not\ngenerally the case in decision-making problems.\n\nexpected cumulative future rewards E{(cid:80)\u221e\n\n2.3 Model-Based RL with VMMs\nIn RL with partial observability, an agent performs at each time t an action at \u2208 A, and receives\nan observation ot \u2208 O and a reward rt \u2208 R with probabilities P (ot|o0:t\u22121, r0:t\u22121, a0:t\u22121) and\nP (rt|o0:t\u22121, r0:t\u22121, a0:t\u22121) respectively. This representation results from marginalising out the la-\ntent state variables and assuming that the agent observes rewards. The agent\u2019s goal is to maximise the\n\u03c4 =t+1 rt + \u03bb\u03c4\u2212tr\u03c4} for some discount factor \u03bb \u2208 [0, 1).\nLetting R = {rt : P (rt|o0:t\u22121, r0:t\u22121, a0:t\u22121) > 0\u2200o0:t\u22121, r0:t\u22121, a0:t\u22121} represent the set of\npossible rewards and zt \u2208 {1, . . . ,|R|} the unique index of rt \u2208 R, then a percept (ot, zt) is\nreceived at each time with probability P (ot, zt|o0:t\u22121, z0:t\u22121, a0:t\u22121). VMMs such as CTW can\n\n3\n\n\fthen learn a model of this process, using the alphabet \u03a3 = O \u00d7 {1, . . . ,|R|}. This predictive model\nmust condition on past actions, but its output should only estimate the probability of the next percept\n(not the next action). This is solved by interleaving actions and percepts in the input context, but only\nupdating its estimators based on the value of the next percept [19]. The resulting action-conditional\nmodel can be used as a simulator by sample-based planning methods such as UCT [12].\n2.4 Utile Suf\ufb01x Memory\n\nUtile suf\ufb01x memory (USM) [13] is an RL algorithm similar to VMMs for stochastic time-series\nprediction. USM learns a suf\ufb01x tree that is conceptually similar to a context tree with the following\ndifferences. First, each node in the suf\ufb01x tree directly maintains an estimate of expected cumulative\nfuture reward for each action. To compute this estimate, USM still predicts (immediate) future\nobservations and rewards at each context, analogously to VMM methods. This prediction is done in\na purely frequentist manner, which often yields inferior prediction performance compared to other\nVMMs, especially given noisy data.\nSecond, USM\u2019s suf\ufb01x tree does not have a \ufb01xed depth; instead, its tree is grown incrementally, by\ntesting potential expansions for statistically signi\ufb01cant differences between their respective predic-\ntions of cumulative future reward. USM maintains a \ufb01xed-depth subtree of fringe nodes below the\nproper leaf nodes of the suf\ufb01x tree. Fringe nodes do not contribute to the model\u2019s output, but they\nalso maintain count vectors. At regular intervals, USM compares the distributions over cumulative\nfuture reward of each fringe node against its leaf ancestor, through a Kolmogorov-Smirnov (K-S)\ntest. If this test succeeds at some threshold con\ufb01dence, then all fringe nodes below that respective\nleaf node become proper nodes, and a new fringe subtree is created below the new leaf nodes.\nUSM\u2019s fringe expansion allows it to use memory ef\ufb01ciently, as only the contextual distinctions that\nare actually signi\ufb01cant for prediction are represented. However, USM is computationally expensive.\nPerforming K-S tests for all nodes in a fringe subtree requires, in the worst-case, time linear in the\namount of (real-valued) data contained at each node, and exponential in the depth of the subtree.\nThis cost can be prohibitive even if the expansion test is only run infrequently. Furthermore, USM\ndoes not explicitly bound its memory use, and simply stopping growth once a memory bound is hit\nwould bias the model towards symbols received early in learning.\n3 Dynamic-Depth Context Tree Weighting\nWe now propose dynamic-depth context tree weighting (D2-CTW). Rather than \ufb01xing the depth a\npriori, like CTW, or using unbounded memory, like USM, D2-CTW learns \u02dcF with dynamic depth,\nsubject to the constraint | \u02dcFt| \u2264 L at any time t, where L is a \ufb01xed memory bound.\n3.1 Dynamic Expansion in CTW\nTo use memory ef\ufb01ciently and avoid requiring a \ufb01xed depth, we could simply replicate USM\u2019s fringe\ne ) instead of distribu-\nexpansion in CTW, by performing K-S tests on distributions over symbols (P s\ntions over expected reward. However, doing so would introduce bias. The weighted estimates\nw(\u03c30:t) for each context s depend on the ratio of the probability of the observed data at s itself,\nP s\nw (\u03c30:t) at s(cid:48) = \u03c3s\u2200\u03c3 \u2208 \u03a3. These estimates\ne (\u03c30:t), and that of the data observed at its children, P s(cid:48)\nP s\n\u03c3\u2208\u03a3 cs(cid:48),t.\nThus, the weighting in (3) assumes that each symbol that was observed to follow the non-leaf con-\ntext s was also observed to follow exactly one of its children s(cid:48). If this was not so and, e.g., s was\ncreated at time 0 but its children only at \u03c4 > 0, then, since P s(cid:48)\nw (\u03c30:t), the weighting\nwould be biased towards the children, which would have been exposed to less data.\nFortunately, an alternative CTW recursion, originally proposed for numerical stability [20], over-\ncomes this issue. In CTW and for a context tree of \ufb01xed depth K, let \u03b2s\nt be the likelihood ratio\nbetween the weighted estimate below s and the local estimate at s itself:\n\ndepend on the number of times each symbol followed a context, implying that cs,t =(cid:80)\n\nw (\u03c3\u03c4 :t) \u2265 P s(cid:48)\n\n(cid:40) (cid:81)\n\n\u03b2s\nt :=\n\nThen, the weighted estimate of the conditional probability of an observed symbol \u03c3t at node s is:\n\n\u03c3\u2208\u03a3 P \u03c3s\n\nw (\u03c30:t)\n\nP s\n\ne (\u03c30:t)\n\n1\n\nif |s| < K,\nif |s| = K.\n\n(5)\n\n1\n\ne (\u03c30:t\u22121)(cid:0)1 + \u03b2s\n\ne (\u03c30:t) (1 + \u03b2s\nt )\nt\u22121\n\n2 P s\n\n(cid:1) =: P s\n\n1\n2 P s\n\ne (\u03c3t|\u03c30:t\u22121)\n\n1 + \u03b2s\nt\n1 + \u03b2s\nt\u22121\n\n.\n\n(6)\n\nw(\u03c3t|\u03c30:t\u22121) :=\nP s\n\nP s\nw(\u03c30:t)\nw(\u03c30:t\u22121)\nP s\n\n=\n\n4\n\n\fFurthermore, \u03b2s\ncontext path at time t (the set of all suf\ufb01xes of \u03c30:t\u22121). Then:\n\nt can be updated for each s as follows. Let Ct represent the set of suf\ufb01xes on the\n\n(cid:40) P s(cid:48)\n\n\u03b2s\nt =\n\nw (\u03c3t|\u03c30:t\u22121)\ne (\u03c3t|\u03c30:t\u22121) \u03b2s\nP s\n\u03b2s\nt = \u03b2s\n\nt\u22121\n\nt\u22121\n\nif s \u2208 Ct,\notherwise,\n\n(7)\n\nt , regardless of |\u03a3|.\n\nwhere s(cid:48) = \u03c3t\u22121\u2212|s|s is the child of s that follows it on the context path. For any context, we set\nw(\u03c3t|\u03c30:t\u22121) using only the nodes on the\n0 = 1. This reformulation allows the computation of P s\n\u03b2s\ncontext path and while storing only a single value in those nodes, \u03b2s\nSince this reformulation depends only on conditional probability estimates, we can perform fringe\nexpansion in CTW and add nodes dynamically without biasing the mixture. Disregard the \ufb01xed\ndepth limit K and consider instead a suf\ufb01x tree where all leaf nodes have a depth greater than the\nfringe depth H > 0. For any leaf node at depth d, its ancestor at depth d \u2212 H is its frontier node.\nThe descendants of any frontier node are fringe nodes. Let ft represent the frontier node on the\ncontext path at time t. At every timestep t, we traverse down the tree by following the context path\nas in CTW. At every node on the context path and above ft, we apply (6) and (7) while treating ft\nas a leaf node. For ft and the fringe nodes on the context path below it, we apply the same updates\nwhile treating fringe nodes normally. Thus, the recursion in (6) does not carry over to fringe nodes,\nbut otherwise all nodes update their values of \u03b2 in the same manner.\nOnce the fringe expansion criterion is met (see Section 3.2), the fringe nodes below ft simply stop\nbeing labeled as such, while the values of \u03b2 for the nodes above ft must be updated to re\ufb02ect the\nchange in the model. Let \u00afP ft\nw (\u03c30:t) represent the weighted (unconditional) output at ft after the\nfringe expansion step. We have therefore \u00afP ft\nt ), but prior to the\nexpansion, P ft\n\ne (\u03c30:t). The net change in the likelihood of \u03c30:t, according to ft, is:\n\ne (\u03c30:t)(1 + \u03b2ft\n\nw (\u03c30:t) = P ft\n\nw (\u03c30:t) := 1\n\n2 P ft\n\n\u03b1ft\n\nexp :=\n\n\u00afP ft\nP ft\n\nw (\u03c30:t)\nw (\u03c30:t)\n\n=\n\n1 + \u03b2ft\nt\n\n2\n\n.\n\n(8)\n\nexp =: \u00afP \u2205\n\nw(\u03c30:t)/P \u2205\n\nThis induces a change in the likelihood of the data according to all of the ancestors of ft. We need\nto determine \u03b1\u2205\nw(\u03c30:t), which quanti\ufb01es the effect of the fringe expansion on\nthe global output of the weighted model.\nProposition 1. Let f be a string corresponding to a frontier node, and let pd be the length-d suf\ufb01x\n\nof f (with p0 = \u2205). Also let \u03c1f :=(cid:81)|f|\u22121\n\nexp := 1+\u03b2f\n\n, and \u03b1f\n\n. Then:\n\nt\n\nd=0\n\npd\n\u03b2\nt\n1+\u03b2\n\npd\nt\n\n\u03b1\u2205\nexp :=\n\n\u00afP \u2205\nw(\u03c30:t)\nP \u2205\nw(\u03c30:t)\n\n2\n\n= 1 + \u03c1f(cid:0)\u03b1f\n\nexp \u2212 1(cid:1) .\n\nt =(cid:81)|s|\u22121\n\nThe proof can be found in the supplementary material of this paper (Appendix A.1). This formu-\nlation is useful since, for any node s in the suf\ufb01x tree with ancestors (p0, p1, . . . , p|s|\u22121) we can\nthat measures the sensitivity of the whole\nassociate a value \u03c1s\nmodel to changes below s, and not necessarily just fringe expansions. Thus, a node with \u03c1s (cid:39) 0\nis a good candidate for pruning (see Section 3.3). Furthermore, this value can be computed while\ntraversing the tree along the context path. Although the computation of \u03c1s for a particular node still\nrequires O(|s|) operations, the values of \u03c1 for all ancestors of s are also computed along the way.\n\np|s|\u22121|\nt\np|s\u22121|\n1+\u03b2\nt\n\npd\n\u03b2\nt\npd\n1+\u03b2\nt\n\np|s|\u22121\nt\n\n= \u03c1\n\nd=0\n\n\u03b2\n\n3.2 Fringe Expansion Criterion\n\nexp provides a statistical measure of the difference between the predictive\nAs a likelihood ratio, \u03b1f\nmodel at each frontier node f and that formed by its fringe children. Analogously, \u03b1\u2205\nexp can be seen\nas the likelihood ratio between two models that differ only on the subtree below f. Therefore, we can\ntest the hypothesis that the subtree below f should be added to the model by checking if \u03b1\u2205\nexp > \u03b3\nw(\u00b7) is unknown, we cannot establish proper con\ufb01dence levels\nfor some \u03b3 > 1. Since the form of P \u2205\nfor \u03b3; however, the following result shows that the value of \u03b3 is not especially important, since if the\nsubtree below f improves the model, this test will eventually be true given enough data.\n\n5\n\n\fTheorem 1. Let S and Sexp be two proper suf\ufb01x sets such that Sexp = (S \\ f )\u222aF where f is suf\ufb01x\nto all f(cid:48) \u2208 F. Furthermore, let M and Mexp be the CTW models using the suf\ufb01x trees induced by S\nand Sexp respectively, and P \u2205\nw(\u03c30:t; M ), P \u2205\nw(\u03c30:t; Mexp) their estimates of the likelihood of \u03c30:t.\nIf there is a T \u2208 N such that, for any \u03c4 > 0:\n\nT +\u03c4(cid:89)\n\nT +\u03c4(cid:89)\n\n(cid:89)\n\ne (\u03c3t|\u03c30:t\u22121; M ) <\nP f\n\nw (\u03c3t|\u03c30:t\u22121; Mexp),\nP \u03c3f\n\nt=\u03c4\n\nt=\u03c4\n\nw(\u03c30:T (cid:48); M ) > \u03b3.\n\n\u03c3\u2208\u03a3\nthen for any \u03b3 \u2208 [1,\u221e), there is T (cid:48) > 0 such that P \u2205\nw(\u03c30:T (cid:48); Mexp)/P \u2205\nThe proof can be found in the supplementary material (Appendix A.2). Using \u03b1\u2205\nexp > \u03b3 as a\nstatistical test instead of K-S tests yields great computational savings, since the procedure described\nin Proposition 1 allows us to determine this test in O(|ft|), typically much lower than the O(|\u03a3|H+1)\ncomplexity of K-S testing all fringe children.\nTheorem 1 also ensures that, if suf\ufb01cient memory is available, D2-CTW will eventually perform as\nwell as CTW with optimal depth bound K = D. This follows from the fact that, for every node s at\ndepth ds \u2264 D in a CTW suf\ufb01x tree, if \u03b2s\nt \u2265 1 for all t > \u03c4, then the D2-CTW suf\ufb01x tree will be at\nleast as deep as ds at context s after some time t(cid:48) \u2265 \u03c4. That is, at some point, the D2-CTW model\nwill contain the \u201cuseful\u201d sub-tree of the optimal-depth context tree.\nCorollary 1. Let l(\u00b7| \u02dcFCT W , D) represent the average log-loss of CTW using \ufb01xed depth K when\nmodeling a D\u2212bounded tree source, and l(\u00b7| \u02dcFD2\u2212CT W , \u03b3, H, L) the same metric when using D2-\nCTW. For any values of \u03b3 > 1 and H > 1, and for suf\ufb01ciently high L > 0, there exists a time T (cid:48) > 0\nsuch that, for any t > T (cid:48), l(\u03c3T (cid:48):t | \u02dcFD2\u2212CT W , \u03b3, H, L) \u2264 l(\u03c3T (cid:48):t | \u02dcFCT W , D).\n\nt ). We can also compute \u03b1\u2205\n\nprune = 1 + \u03c1s(cid:0)\u03b1s\n\n3.3 Ensuring the Memory Bound\nIn order to ensure that the memory bound | \u02dcFt| \u2264 L is respected, we must \ufb01rst consider whether\na potential fringe expansion does not require more memory than is available. Thus, if the subtree\nbelow frontier node f has size Lf , we must test if | \u02dcFt| + Lf \u2264 L. This means that fringe nodes are\nnot taken into account when computing | \u02dcFt|, as they do not contribute to the output of \u02dcFt and are\ntherefore considered as memory overhead, and discarded after training.\nOnce | \u02dcFt| is such that no fringe expansions are possible without violating the memory bound, it may\nstill be possible to improve the model by pruning low-quality subtrees to create enough space for\nmore valuable fringe expansions. Pruning operations also have a quanti\ufb01able effect on the likelihood\nof the observed data according to \u02dcFt. Let P s\nw(\u03c30:t) represent the weighted estimate at internal node\ns after pruning its subtree. Analogously to (8), we can de\ufb01ne \u03b1s\nw(\u03c30:t) =\nprune, the global effect on the likelihood, using the procedure\n2/(1 + \u03b2s\nin Proposition 1. Since \u03b1\u2205\nprune < 1, if a fringe\nexp but requires space Lf such that | \u02dcFt|+Lf > L,\nexpansion at f increases P \u2205\nwe should prune the subtree below s (cid:54)= f that frees Ls space and reduces P \u2205\nprune if 1)\nprune > 1; 2) | \u02dcFt| + Lf \u2212 Ls \u2264 L; and 3) s is not an ancestor of f. The latter condition\nexp \u00d7 \u03b1\u2205\n\u03b1\u2205\nrequires O(|f|\u2212|s|) time to validate, while the former can be done in constant time if \u03c1s is available.\nIn general, some combination of subtrees could be pruned to free enough space for some combina-\ntion of fringe expansions, but determining the best possible combination of operations at each time\nis too computationally expensive. As a tractable approximation, we compare only the best single ex-\npansion and prune at nodes f\u2217 and s\u2217 respectively, quanti\ufb01ed with two heuristics Hf\nexpf\nand Hs\nAs L is decreased, the performance of D2-CTW may naturally degrade. Although Corollary 1 may\nno longer be applicable in that case, a weaker bound on the performance of memory-constrained\nD2-CTW can be obtained as follows, regardless of L: let dt\nmin denote the minimum depth of any\nmin-bounded models [23].\nfrontier node at time t; then the D2-CTW suf\ufb01x tree covers the set of dt\nThe redundancy of D2-CTW, measured as the Kullback-Leibler divergence DKL(F|| \u02dcFt), is then at\nleast as low as the redundancy of a multi-alphabet CTW implementation with K = dt\n\nprune \u2212 1(cid:1), typically with \u03b1s\n\nprunes, such that f\u2217 = arg maxf Hf\n\nexp and s\u2217 = arg minf Hs\n\nw(\u03c30:t) by a factor of \u03b1\u2205\n\nprune := \u2212 log \u03b1\u2205\n\nexp := log \u03b1\u2205\n\nprune.\n\nprune := P s\n\nw(\u03c30:t)/P s\n\nw(\u03c30:t) by \u03b1\u2205\n\nmin [17].\n\n6\n\n\f(a)\n\n(d)\n\n(b)\n\n(e)\n\n(c)\n\n(f)\n\nFigure 1: Calgary Corpus performance with CTW (red) and D2-CTW (blue). For average log-loss,\nlower is better: (a)-(c) using optimal parameters; (d) with a bound on the number of nodes; (e) with\nsize bound and uniform noise; (f) log-loss vs. \u03b3 on \u2018book2\u2019, with 10% noise (over 30 runs).\n3.4 Complete Algorithm and Complexity\n\nmax \u2212 H where dt\n\nThe complete D2-CTW algorithm operates as follows (please refer to Appendix A.3 for the respec-\ntive pseudo-code): a suf\ufb01x tree is \ufb01rst initialized containing only a root node; at every timestep, the\nsuf\ufb01x tree is updated using the observed symbol \u03c3t, and the preceding context (if it exists) from\ntime t\u2212 dt\nmax is the current maximum depth of the tree and H is the fringe depth.\nThis update returns the weighted conditional probability of \u03c3t, and it also keeps track of the best\nknown fringe expansion and pruning operations. Then, a post-processing step expands and possibly\nprunes the tree as necessary, ensuring the memory bound is respected. This step also corrects the\nvalues of \u03b2 for any nodes affected by these topological operations. D2-CTW trains on each new\nsymbol in O(dt\nmax + H. A worst-\nmax + H)|\u03a3|) operations are necessary to sample a symbol from the learned model, also\ncase O((dt\nequivalent to CTW. Post-processing requires O(max{|f\u2217|,|s\u2217|}) time.\n4 Experiments\nWe now present empirical results on byte-prediction tasks and partially-observable RL. Our code and\ninstructions for its use is publicly available at: https://bitbucket.org/jmessias/vmm_py.\n\nmax + H) time, the same as CTW with depth bound K = dt\n\nByte Prediction We compare the performance of D2-CTW against CTW on the 18-\ufb01le variant\nof the Calgary Corpus [3], a benchmark of text and binary data \ufb01les. For each \ufb01le, we ask the\nalgorithms to predict the next byte given the preceding data, such that |\u03a3| = 256 across all \ufb01les.\nWe \ufb01rst compare performance when using (approximately) optimal hyperparameters. For CTW,\nwe performed a grid search taking K \u2208 {1, . . . , 10} for each \ufb01le. For D2-CTW, we investigated\nthe effect of \u03b3 on the prediction log-loss across different \ufb01les, and found no signi\ufb01cant effect of\nthis parameter for suf\ufb01ciently large values (an example is shown in Fig. 1f), in accordance with\nTheorem 1. Consequently, we set \u03b3 = 10 for all our D2-CTW runs. We also set L = \u221e and H = 2.\nThe corpus results, shown in Figs. 1a\u20131c, show that D2-CTW achieves comparable performance to\nCTW: on average D2-CTW\u2019s loss is 2% higher, which is expected since D2-CTW grows dynami-\ncally from a single node, while CTW starts with a fully grown model of optimal height. By contrast,\nD2-CTW uses many fewer nodes than CTW, by at least one order of magnitude (average factor\n\u223c 28). D2-CTW automatically discovers optimal depths that are similar to the optimal values for\nCTW. We then ran a similar test but with a bound on the number of nodes L = 1000. For CTW, we\nenforced this bound by simply stopping the suf\ufb01x tree from growing beyond this point2. The results\n\n2For simplicity, we did not use CTW with Budget SAD as a baseline. Budget SAD could also be used\nto extend D2-CTW, so a fair comparison would necessitate the optional integration of Budget SAD into both\nCTW and D2-CTW. This is an interesting possibility for future work.\n\n7\n\nbibbook1book2geonewsobj1obj2paper1paper2paper3paper4paper5paper6picprogcproglprogptransFile name01234567bits/byteAvg. log-loss (optimal params.)ctwd2-ctwbibbook1book2geonewsobj1obj2paper1paper2paper3paper4paper5paper6picprogcproglprogptransFile name100101102103104105106107Number of nodesNr. nodes (optimal params.)bibbook1book2geonewsobj1obj2paper1paper2paper3paper4paper5paper6picprogcproglprogptransFile name02468101214Model depthModel depth (optimal params.)bibbook1book2geonewsobj1obj2paper1paper2paper3paper4paper5paper6picprogcproglprogptransFile name01234567bits/byteAvg. log-loss (L=1000)bibbook1book2geonewsobj1obj2paper1paper2paper3paper4paper5paper6picprogcproglprogptransFile name01234567bits/byteAvg. Log-loss (L=1000, 5% noise)1591317Likelihoodratiothreshold,\u03b34.054.104.154.204.254.304.35bits/byteAvg.Log-lossvs.\u03b3\fare shown in Fig. 1d. In this case, the log-loss of CTW is on average 11.4% and up to 32.3% higher\nthan that of D2-CTW, showing that D2-CTW makes a signi\ufb01cantly better use of memory.\nFinally, we repeated this test but randomly replaced 5% of symbols with uniform noise. This makes\nthe advantage of D2-CTW is even more evident, with CTW scoring on average 20.0% worse (Fig.\n1e). While the presence of noise still impacts performance, the results show that D2-CTW, unlike\nCTW, is resilient to noise: spurious contexts are not deemed signi\ufb01cant, avoiding memory waste.\n\nModel-Based RL For our empirical study on online partially observable RL tasks, we take as a\nbaseline MC-AIXI, a combination of \ufb01xed-depth CTW modelling with \u03c1UCT planning [19], and\ninvestigate the effect of replacing CTW with D2-CTW and limiting the available memory. We also\ncompare against PPM-C, a frequentist VMM that is competitive with CTW [2]. Our experimental\ndomains are further described in the supplementary material.\nOur \ufb01rst domain is the T-maze [1], in which an agent must remember its initial observation in order\nto act optimally at the end of the maze. We consider a maze of length 4. We set K = 3 for CTW\nand PPM-C, which is the guaranteed minimum depth to produce the optimal policy. For D2-CTW\nwe set \u03b3 = 1, H = 2, and do not enforce a memory bound. As in [19], we use an \u0001-greedy\nexploration strategy. Fig. 2a shows that D2-CTW discovers the length of the T-Maze automatically.\nFurthermore, CTW and PPM-C fail to learn to retain the required observations, as during the initial\nstages of learning the agent may need more than 3 steps to reach the goal (D2-CTW learns a model\nof depth 4).\nOur second scenario is the cheese maze [13], a navigation task with aliased observations. Under\noptimal parameters, D2-CTW and CTW both achieve near-optimal performance for this task. We\ninvestigated the effect of setting a bound on the number of nodes L = 1000, roughly 1/5 of the\namount used by CTW with optimal hyperparameters. In Fig. 2b we show that the quality of D2-\nCTW degrades less than both CTW and PPM-C, still achieving a near optimal policy. As this is\na small-sized problem with D = 2, CTW and PPM-C still produce reasonable results in this case\nalbeit with lower quality models than D2-CTW.\nFinally, we tested a partially observable version of mountain car [16], in which the position of the\ncar is observed but not its velocity. We coarsely discretised the position of the car into 10 states. In\nthis task, we have no strong prior knowledge about the required context length, but found K = 4 to\nbe suf\ufb01cient for optimal PPM-C and CTW performance. For D2-CTW, we used \u03b3 = 10, H = 2. We\nset also L = 1000 for all methods. Fig. 2c shows the markedly superior performance of D2-CTW\nwhen subject to this memory constraint.\n5 Conclusions and Future Work\n\nWe introduced D2-CTW, a variable-order modelling algorithm that extends CTW by using a fringe\nexpansion mechanism that tests contexts for statistical signi\ufb01cance, and by allowing the dynamic\nadaptation of its suf\ufb01x tree while subject to a memory bound. We showed both theoretically and\nempirically that D2-CTW requires little con\ufb01guration across domains and provides better perfor-\nmance with respect to CTW under memory constraints and/or the presence of noise. In future work,\nwe will investigage the use the Budget SAD estimator with a dynamic budget as an alternative mech-\nanism for informed pruning. We also aim to apply a similar approach to context tree switching (CTS)\n[18], an algorithm that is closely related to CTW but enables mixtures in a larger model class.\n\n(a)\n\n(b)\n\n(c)\n\nFigure 2: Performance measured as (running) average rewards in (a) T-maze; (b) cheese maze; (c)\npartially observable mountain car. Results show mean over 10 runs, and shaded \ufb01rst to third quartile.\n\n8\n\n0.20.40.60.81.01.21.41.61.82.0Episodes1e3020406080100RewardAverage reward/episode in T-Maze (over 200 eps., K=3)D2-CTWPPM-CCTWoptimal valuefully-Markov bound1234567Timesteps1e354321012RewardAverage reward/step in Cheese Maze (over 200 steps, L=1000)D2-CTWPPM-CCTWoptimal value100200300400500Episodes1008060402002040RewardAverage reward/episode in P.O.Mountain Car (over 100 eps., L=1000)D2-CTWPPM-CCTW\fAcknowledgments\n\nThis work was supported by the European Commission under the grant agreement FP7-ICT-611153\n(TERESA).\n\nReferences\n[1] B. Bakker. Reinforcement learning with long short-term memory. In Proceedings of the 14th\nInternational Conference on Neural Information Processing Systems, NIPS\u201901, pages 1475\u2013\n1482, Cambridge, MA, USA, 2001. MIT Press.\n\n[2] R. Begleiter, R. El-Yaniv, and G. Yona. On prediction using variable order Markov models.\n\nJournal of Arti\ufb01cial Intelligence Research, 22:385\u2013421, 2004.\n\n[3] T. Bell, I. H. Witten, and J. G. Cleary. Modeling for text compression. ACM Computing\n\nSurveys (CSUR), 21(4):557\u2013591, 1989.\n\n[4] M. Bellemare, J. Veness, and E. Talvitie. Skip context tree switching. In Proceedings of the\n\n31st International Conference on Machine Learning (ICML-14), pages 1458\u20131466, 2014.\n\n[5] M. G. Bellemare. Count-based frequency estimation with bounded memory. In IJCAI, pages\n\n3337\u20133344, 2015.\n\n[6] J. G. Cleary and W. J. Teahan. Unbounded length contexts for PPM. The Computer Journal,\n\n40(2 and 3):67\u201375, 1997.\n\n[7] V. F. Farias, C. C. Moallemi, B. Van Roy, and T. Weissman. Universal reinforcement learning.\n\nIEEE Transactions on Information Theory, 56(5):2441\u20132454, 2010.\n\n[8] W. L. Hamilton, M. M. Fard, and J. Pineau. Modelling sparse dynamical systems with com-\n\npressed predictive state representations. In ICML (1), pages 178\u2013186, 2013.\n\n[9] M. Hausknecht and P. Stone. Deep recurrent Q-learning for partially observable MDPs. In\n\nAAAI Fall Symposium Series, 2015.\n\n[10] M. P. Holmes and C. L. Isbell Jr. Looping suf\ufb01x tree-based inference of partially observable\nhidden state. In Proceedings of the 23rd international conference on Machine learning, pages\n409\u2013416. ACM, 2006.\n\n[11] M. Hutter et al. Sparse adaptive dirichlet-multinomial-like processes. In Conference on Learn-\ning Theory: JMLR Workshop and Conference Proceedings, volume 30. Journal of Machine\nLearning Research, 2013.\n\n[12] L. Kocsis and C. Szepesv\u00e1ri. Bandit based monte-carlo planning. In European conference on\n\nmachine learning, pages 282\u2013293. Springer, 2006.\n\n[13] A. K. McCallum. Reinforcement Learning with Selective Percemption and Hidden State. PhD\n\nthesis, University of Rochester, 1995.\n\n[14] D. Ron, Y. Singer, and N. Tishby. The power of amnesia: Learning probabilistic automata with\n\nvariable memory length. Machine learning, 25(2-3):117\u2013149, 1996.\n\n[15] S. Singh, M. R. James, and M. R. Rudary. Predictive state representations: A new theory\nIn Proceedings of the 20th conference on Uncertainty in\n\nfor modeling dynamical systems.\narti\ufb01cial intelligence, pages 512\u2013519. AUAI Press, 2004.\n\n[16] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction, volume 1. MIT press\n\nCambridge, 1998.\n\n[17] T. J. Tjalkens, Y. M. Shtarkov, and F. M. J. Willems. Context tree weighting: Multi-alphabet\n\nsources. In 14th Symposium on Information Theory in the Benelux, pages 128\u2013135, 1993.\n\n[18] J. Veness, K. S. Ng, M. Hutter, and M. Bowling. Context tree switching. In Data Compression\n\nConference (DCC), 2012, pages 327\u2013336. IEEE, 2012.\n\n9\n\n\f[19] J. Veness, K. S. Ng, M. Hutter, W. Uther, and D. Silver. A monte-carlo AIXI approximation.\n\nJournal of Arti\ufb01cial Intelligence Research, 40(1):95\u2013142, 2011.\n\n[20] P. A. J. Volf. Weighting techniques in data compression: Theory and algorithms. Technische\n\nUniversiteit Eindhoven, 2002.\n\n[21] D. Wierstra, A. Foerster, J. Peters, and J. Schmidhuber. Solving deep memory POMDPs with\nrecurrent policy gradients. In International Conference on Arti\ufb01cial Neural Networks, pages\n697\u2013706. Springer, 2007.\n\n[22] F. M. Willems. The context-tree weighting method: Extensions. IEEE Transactions on Infor-\n\nmation Theory, 44(2):792\u2013798, 1998.\n\n[23] F. M. Willems, Y. M. Shtarkov, and T. J. Tjalkens. The context-tree weighting method: basic\n\nproperties. IEEE Transactions on Information Theory, 41(3):653\u2013664, 1995.\n\n[24] F. Wood, J. Gasthaus, C. Archambeau, L. James, and Y. W. Teh. The sequence memoizer.\n\nCommunications of the ACM, 54(2):91\u201398, 2011.\n\n10\n\n\f", "award": [], "sourceid": 1886, "authors": [{"given_name": "Joao", "family_name": "Messias", "institution": "Morpheus Labs"}, {"given_name": "Shimon", "family_name": "Whiteson", "institution": "Oxford University"}]}