{"title": "Simple Local Models for Complex Dynamical Systems", "book": "Advances in Neural Information Processing Systems", "page_first": 1617, "page_last": 1624, "abstract": "We present a novel mathematical formalism for the idea of a local model,'' a model of a potentially complex dynamical system that makes only certain predictions in only certain situations. As a result of its restricted responsibilities, a local model may be far simpler than a complete model of the system. We then show how one might combine several local models to produce a more detailed model. We demonstrate our ability to learn a collection of local models on a large-scale example and do a preliminary empirical comparison of learning a collection of local models and some other model learning methods.\"", "full_text": "Simple Local Models for Complex Dynamical Systems\n\nErik Talvitie\n\nSatinder Singh\n\nComputer Science and Engineering\n\nComputer Science and Engineering\n\nUniversity of Michigan\n\netalviti@umich.edu\n\nUniversity of Michigan\nbaveja@umich.edu\n\nAbstract\n\nWe present a novel mathematical formalism for the idea of a \u201clocal model\u201d of an\nuncontrolled dynamical system, a model that makes only certain predictions in\nonly certain situations. As a result of its restricted responsibilities, a local model\nmay be far simpler than a complete model of the system. We then show how\none might combine several local models to produce a more detailed model. We\ndemonstrate our ability to learn a collection of local models on a large-scale ex-\nample and do a preliminary empirical comparison of learning a collection of local\nmodels and some other model learning methods.\n\n1 Introduction\n\nBuilding models that make good predictions about the world can be a complicated task. Humans,\nhowever, seem to have the remarkable ability to split this task up into manageable chunks. For\ninstance, the activity in a park may have many complex interacting components (people, dogs, balls,\netc.) and answering questions about their joint state would be impossible. It can be much simpler\nto answer abstract questions like \u201cWhere will the ball bounce?\u201d ignoring most of the detail of what\nelse might happen in the next moment. Some other questions like \u201cWhat will the dog do?\u201d may still\nbe very dif\ufb01cult to answer in general, as dogs are complicated objects and their behavior depends\non many factors. However, in certain situations, it may be relatively easy to make a prediction. If a\nball has just been thrown, one may reasonably predict that the dog will chase it, without too much\nconsideration of other potentially relevant facts. In short, it seems that humans have a lot of simple,\nlocalized pieces of knowledge that allow them to make predictions about particular aspects of the\nworld in restricted situations. They can combine these abstract predictions to form more concrete,\ndetailed predictions. Of course, there has been substantial effort in exploiting locality/independence\nstructure in AI. Much of it is focused on static domains without temporal concerns (e.g. [1]), though\nthese ideas have been applied in dynamical settings as well (e.g. [2, 3]). Our main contribution\nis to provide a novel mathematical formulation of \u201clocal models\u201d of dynamical systems that make\nonly certain predictions in only certain situations. We also show how to combine them into a more\ncomplete model. Finally, we present empirical illustrations of the use of our local models.\n\n1.1 Background\n\nIn this paper we will focus on learning models of uncontrolled discrete dynamical systems (we leave\nconsideration of controlled systems to future work). At each time step i the system emits an obser-\nvation oi from a \ufb01nite set of observations O. We call sequences of observations tests and let T be\nthe set of all possible tests of all lengths. At time step i, the history is simply the sequence o1o2...oi\nof past observations. We use the letter \u03c6 to represent the null history in which no observation has yet\nbeen emitted. A prediction of a test t = oi+1...oi+k given a history h = o1...oi, which we denote\np(t|h), is the conditional probability that the sequence t will occur, given that the sequence h has\nalready occurred: p(t|h) def= Pr(oi+1 = oi+1, ..., oi+k = oi+k|o1 = o1, ..., oi = oi). The set of all\nhistories H is de\ufb01ned: H def= {t \u2208 T : p(t|\u03c6) > 0} \u222a {\u03c6}. We use models to make predictions:\n\n1\n\n\fDe\ufb01nition 1. A complete model can generate predictions p(t|h) for all t \u2208 T and h \u2208 H.\n\nA model that can make every such prediction can make any conditional prediction about the system\n[4]. For instance, one may want to make predictions about whether any one of a set of possible\nfutures will occur (e.g. \u201cWill the man throw a ball any time before he leaves the park?\u201d). We can\nrepresent this type of prediction using a union test (also called a \u201ccollective outcome\u201d by Jaeger [5]).\n\nDe\ufb01nition 2. A union test T \u2286 T is a set of tests such that if t \u2208 T then no pre\ufb01x of t is in T . The\nprediction of a union test is a sum of predictions: p(T |h) def= Pt\u2208T p(t|h).\n\nModels may be provided by an expert, or we can learn them from experience with the system (in the\nform of a data set of observation sequences emitted by the system). The complexity of representing\nand learning a model often depends on the complexity of the system being modeled. The measure\nof complexity that we will adopt is called the linear dimension [6] and is de\ufb01ned as the rank of\nthe \u201csystem dynamics matrix\u201d (the in\ufb01nite matrix of predictions whose ijth entry is p(tj|hi) for all\ntj \u2208 T and hi \u2208 H). It is also closely related to the number of underlying states in a Hidden Markov\nModel. We will not de\ufb01ne it more formally here but note that when we say one system is simpler\nthan another, we mean that it has a smaller linear dimension.\n\nWe will now present the main contributions of our work, starting by precisely de\ufb01ning a local model,\nand then showing how they can be combined to create a more complete model.\n\n2 Local Models\n\nIn contrast to a complete model, a local model has limited prediction responsibilities and hence\nmakes only certain predictions in certain situations.\nDe\ufb01nition 3. Given a set of tests of interest T I and a set of histories of interest HI , a local model\nis any model that generates the predictions of interest: p(t|h) for all t \u2208 T I and h \u2208 HI .\n\nWe will assume, in general, that the tests of interest are union tests. In this paper, we will place a\nconstraint on HI \u2286 H which we will call the \u201csemi-Markov\u201d property, due to its close relationship\nto the concept of the same name in the \u201coptions\u201d literature [7]; this assumption will be relaxed in\nfuture work. In words, we require that, in order to determine if the current history is of interest, we\nneed only look at what has happened since the preceeding history of interest. Put formally,\nDe\ufb01nition 4. A set of histories of interest HI is semi-Markov iff h, h\u2032 \u2208 HI \u222a {\u03c6} and ht \u2208 HI for\nsome t \u2208 T , implies that either h\u2032t \u2208 HI or p(h\u2032t|\u03c6) = 0.\n\nAs a simple example, consider the 1D Ball Bounce system (see\nFigure 1). The agent observes a line of pixels, one of which\n(the location of the \u201cball\u201d) is black; the rest are white. The ball\nmoves along the line, changing direction when it hits the edge.\nEach time step, with probability 0.5, the ball sticks in place, and\nwith probability 0.5 it moves one square in its current direction.\n\nFigure 1: 1D Ball Bounce\n\nOne natural local model would make one-step predictions about only one pixel, p. It has two tests\nof interest: the set of all one-step tests in which the pixel p is black, and the set of all one-step tests\nin which p is white. All histories are of interest. This local model answers the question \u201cWhat is the\nchance the ball will be in pixel p next?\u201d Note that, in order to answer this question, we need only\nobserve the color of the pixels neighboring p. We will refer to this example as Model A.\n\nAnother, even more restricted local model would be one that has the same tests of interest, but\nwhose histories of interest are only those that end with pixel p being black. This local model would\nessentially answer the question \u201cWhen the ball is in pixel p, what is the chance that it will stick?\u201d\nIn order to make this prediction, the local model can ignore all detail; the prediction for the test of\ninterest is always 0.5 at histories of interest. We will refer to this local model as Model B.\n\nIn general, as in the examples above, we expect that many details about the world are irrelevant to\nmaking the predictions of interest and could be ignored in order to simplify the local model. Taking\nan approach similar to that of, e.g., Wolfe & Barto [8], Soni & Singh [9], or Talvitie et al. [10], given\ntests and histories of interest, we will show how to convert a primitive observation sequence into an\n\n2\n\n\fabstract observation sequence that ignores unnecessary detail. A complete model of the abstracted\nsystem can be used as a local model in the original, primitive system. The abstraction proceeds in\ntwo steps (shown in Figure 2). First, we construct an intermediate system which makes predictions\nfor all tests, but only updates at histories of interest. Then we further abstract the system by ignoring\ndetails irrelevant to making predictions for just the tests of interest.\n\n2.1 Abstracting Details for Local Predictions\n\nIncorporating Histories Of Interest: Intuitively, since a local model is never asked to make a\nprediction at a history outside of HI , one way to simplify it is to only update its predictions at\nhistories of interest. Essentially, it \u201cwakes up\u201d whenever a history of interest occurs, sees what\nobservation sequence happened since it was last awake, updates, and then goes dormant until the\nnext history of interest. We call the sequences of observations that happen between histories of\ninterest bridging tests. The set of bridging tests T B is induced by the set of histories of interest.\nDe\ufb01nition 5. A test t \u2208 T is a bridging test iff for all j < |t|, and all h \u2208 HI , ht[1...j] /\u2208 HI (where\nt[1...j] denotes the j-length pre\ufb01x of t) and either \u2203 h \u2208 HI such that ht \u2208 HI or |t| = \u221e.\n\nConceptually, we transform the primitive observa-\ntion sequence into a sequence of abstract observa-\ntions in which each observation corresponds to a\nbridging test. We call such a transformed sequence\nthe Temporally Extended or T E sequence (see Fig-\nure 2). Note that even when the primitive system has\na small number of observations, the T E system can\nhave in\ufb01nitely many, because there can be an in\ufb01n-\nity of bridging tests. However, because it does not\nupdate between histories of interest, a model of T E\nmay be simpler than a model of the original system.\nTo see this, consider again the 1D Ball Bounce of\nsize k. This system has linear dimension O(2k), in-\ntuitively because the ball has 2 possible directions and k possible positions. Recall Model B, that\nonly applies when the ball lands on a particular pixel. The bridging tests, then, are all possible ways\nthe ball could travel to an edge and back. The probability of each bridging test depends only on the\ncurrent direction of the ball. As such, the T E system here has linear dimension 2, regardless of k.\nIt is possible to show formally that the T E system is never more complex than the original system.\n\nFigure 2: Mapping experience in the original\nsystem to experience in the TE system, and\nthen to experience in the abstract system.\n\nProposition 1. If the linear dimension of a dynamical system is n then, given a semi-Markov set of\nhistories of interest HI , the linear dimension of the induced T E system, nT E \u2264 n.\n\nProof. (Sketch) The linear dimension of a system is the rank of the system dynamics matrix (SDM)\ncorresponding to the system [6]. The matrix corresponding to the T E system is the submatrix of the\nSDM of the original system with only columns and rows corresponding to histories and tests that are\nsequences of bridging tests. A submatrix never has greater rank than the matrix that contains it.\n\nWhat good is a model of the TE system? We next show that a model of the TE system can make\npredictions for all tests t \u2208 T in all histories of interest h \u2208 HI . Speci\ufb01cally, we show that the\nprediction for any test in a history of interest can be expressed as a prediction of a union test in\nT E. For the following, note that every history of interest h \u2208 HI can be written as a corresponding\nsequence of bridging tests, which we will call sh. Also, we will use the subscript T E to distinguish\npredictions pT E(t|h) in T E from predictions p(t|h) in the original system.\nProposition 2. For any primitive test t \u2208 T in the original system, there is a union test St in T E\nsuch that p(t|h) = pT E(St|sh) for all h \u2208 HI .\n\nProof. We will present a constructive proof. First suppose t can be written as a sequence of bridging\ntests st. Then trivially St = {st}. If t does not correspond to a sequence of bridging tests, we can\nre-write it as the concatenation of two tests: t = t1t2 such that t1 is the longest pre\ufb01x of t that is\na sequence of bridging tests (which may be null) and t2 /\u2208 T B. Now, p(t|h) = p(t1|h)p(t2|ht1),\nwhere h, ht1 \u2208 HI . We know already that p(t1|h) = pT E(st1|sh). To calculate p(t2|ht1) note that\n\n3\n\n\fdef= {b \u2208 T B : b[1...|t2|] = t2}.\nthere must be a set of bridging tests Bt2 which have t2 as a pre\ufb01x: Bt2\nThe probability of seeing t2 is the probability of seeing any of the bridging tests in Bt2 . Thus,\nat the history of interest ht1, p(t2|ht1) = Pb\u2208Bt2\npT E(b|shst1). So, we let\nSt = {st1 b : b \u2208 Bt2}, which gives us the result.\n\np(b|ht1) = Pb\u2208Bt2\n\nSince tests of interest are union tests, to make the prediction of interest p(T |h) for some T \u2208 T I\nand h \u2208 HI using a model of T E, we have simply p(T |h) = pT E(ST |sh) = Pt\u2208T pT E(St|sh).\nA model of T E is simpler than a complete model of the system because it only makes predictions\nat histories of interest. However, it still makes predictions for all tests. We can further simplify our\nmodeling task by focusing on predicting the tests of interest.\n\nIncorporating Tests of Interest: Recall Model A from our example. Since all histories are of\ninterest, bridging tests are single observations, and T E is exactly equivalent to the original system.\nHowever, note that in order to make the predictions of interest, one must only know whether the ball\nis neighboring or on the pixel. So, we need only distinguish observations in which the ball is nearby,\nand we can group the rest into one abstract observation: \u201cthe ball is far from the pixel.\u201d\n\nIn general we will attempt to abstract away unnecessary details of bridging tests by aliasing bridging\ntests that are equivalent with respect to making the predictions of interest. Speci\ufb01cally, we will\nde\ufb01ne a partition, or a many-to-one mapping, from T E observations (the bridging tests T B) to\nabstract observations A. We will then use a model of the abstract system with A as its observations\n(see Figure 2) as our local model. So, A must have the following properties: (1) we must be able\nto express the tests of interest as a union of sequences of abstract observations in A and (2) an\nabstracted history must contain enough detail to make accurate predictions for the tests of interest.\n\nLet us \ufb01rst consider how to satisfy (1). For ease of exposition, we will discuss a special case. We\nassume that tests of interest are unions of one-step tests (i.e., for any T \u2208 T I , T \u2286 O) and that\nT I partitions O, so every observation is contained within exactly one test of interest. One natural\nexample that satis\ufb01es this assumption is where the local model makes one-step predictions for a\nparticular dimension of a vector-valued observation. There is no fundamental barrier to treating tests\nof interest that are arbitrary union tests, but the development of the general case is more complex.\n\nNote that if a union test T \u2282 O, then the equivalent T E union test, ST , consists of every bridging\ntest that begins with an observation in T . So, if T I partitions O, then S I def={ST : T \u2208 T I } partitions\nthe bridging tests, T B, according to their \ufb01rst observation. As such, if we chose A = S I , or any\nre\ufb01nement thereof, we would satisfy criterion (1). However, S I may not satisfy (2). For instance,\nin our 1D Ball Bounce, in order to make accurate predictions for one pixel it does not suf\ufb01ce to\nobserve that pixel and ignore the rest. We must also distinguish the color of the neighboring pixels.\nThis problem was treated explicitly by Talvitie et al. [10]. They de\ufb01ne an accurate partition:\nDe\ufb01nition 6. An observation abstraction A is accurate with respect to T I iff for any two primitive\nhistories h1 = o1...ok and h2 = o\u20321...o\u2032k such that \u2200i oi and o\u2032i are contained within the same\nabstract observation Oi \u2208 A, we have p(T |h1) = p(T |h2), \u2200T \u2208 T I .\n\nThe system we are abstracting is T E, so the observations are bridging tests. We require an accurate\nre\ufb01nement of S I . Any re\ufb01nement of S I satis\ufb01es criterion (1). Furthermore, an accurate re\ufb01nement\nis one that only aliases two histories if they result in the same predictions for the tests of interest.\nThus, we can use an abstract history to make exactly the same predictions for the tests of interest that\nwe would make if we had access to the primitive history. So, an accurate re\ufb01nement also satis\ufb01es\ncriterion (2). Furthermore, an accurate re\ufb01nement always exists, because the partition that distin-\nguishes every observation is trivially accurate, though in general we expect to be able to abstract\naway some detail. Finally, a model of the abstract system may be far simpler than a model of the\noriginal system or the T E system, and can be no more complex:\n\nProposition 3. If the linear dimension of a dynamical system is n then the linear dimension of any\nlocal model M, nM \u2264 nT E \u2264 n.\n\nProof. (Sketch) The rows and columns of the SDM corresponding to an abstraction of T E are linear\ncombinations of rows and columns of the SDM of T E [10]. So, the rank of the abstract SDM can\nbe no more than the rank of the SDM for T E.\n\n4\n\n\fLearning a local model: We are given tests and histories of interest and an accurate abstraction.\nTo learn a local model, we \ufb01rst translate the primitive trajectories into T E trajectories using the\nhistories of interest, and then translate the T E trajectories into abstract trajectories using the accurate\nabstraction (as in Figure 2). We can then train any model on the abstracted data. In our experiments,\nwe use POMDPs [11], PSRs [4], and low-order Markov models as local model representations.\n\n2.2 Combining Local Models\n\nConsider a collection of local models M. Each local model M \u2208 M has tests of interest T I\nM ,\nhistories of interest HI\nM , and is an exact model of the abstract system induced by a given accurate\nre\ufb01nement, AM . At any history h, the set of models Mh\nM } is available\nto make predictions for their tests of interest. However, we may wish to make predictions that are\nnot speci\ufb01cally of interest to any local model. In that case, we must combine the abstract, coarse\npredictions made by individual models into more \ufb01ne-grained joint predictions. We will make a\nmodeling assumption that allows us to ef\ufb01ciently combine the predictions of local models:\n\ndef= {M \u2208 M : h \u2208 HI\n\nDe\ufb01nition 7. The local models in Mh are mutually conditionally independent, given h iff for any\nsubset {M1, M2, ..., Mk} \u2286 Mh, and any T1 \u2208 T I\nMk , the prediction of\nthe intersection is equal to the product of the predictions: p(\u2229k\n\nM2, ..., Tk \u2208 T I\ni=1Ti|h) = Qk\n\nM1, T2 \u2208 T I\n\ni=1 p(Ti|h).\n\nA domain expert specifying the structure of a collection of local models should strive to satisfy\nthis property as best as possible since, given this assumption, a collection of local models can be\nused to make many more predictions than can be made by each individual model. We can compute\nthe predictions of \ufb01ner-grained tests (intersections of tests of interest) by multiplying predictions\ntogether. We can also compute the predictions of unions of tests of interest using the standard\nformula: Pr(A \u222a B) = Pr(A) + Pr(B) \u2212 Pr(A \u2229 B). At any history h for which Mh 6= \u2205, a\ncollection of local models can be used to make predictions for any union test that can be constructed\nby unioning/intersecting the tests of interest of the models in Mh. This may not include all tests.\nOf course making all predictions may not be practical, or necessary. A collection of local models\ncan selectively focus on making the most important predictions well, ignoring or approximating less\nimportant predictions to save on representational complexity.\n\nOf course, a collection of local models can be a complete model. For instance, note that any\nmodel that can make the predictions p(o|h) for every o \u2208 O and h \u2208 H is a complete model.\nThis is because every prediction can be expressed in terms of one-step predictions: p(o1...ok|h) =\np(o1|h)p(o2|ho1)...p(ok|ho1...ok\u22121). As such, if every one-step test is expressible as an intersection\nof tests of interest of models in Mh at every h, then M is a complete model. That said, for a given\nM, the mutual conditional independence property may or may not hold. If it does not, predictions\nmade using M will be approximate, even if each local model in M makes its predictions of interest\nexactly. It would be useful, in future work, to explore bounds on the error of this approximation.\n\nWhen learning a collection of local models in this paper, we assume that tests and histories of in-\nterest as well as an accurate re\ufb01nement for each model are given. We then train each local model\nindividually on abstract data. This is a fair amount of knowledge to assume as given, though it\nis analogous to providing the structure of a graphical model and learning only the distribution pa-\nrameters, which is common practice. Automatically splitting a system into simple local models is\nan interesting, challenging problem, and ripe ground for future research. We hope that casting the\nstructure learning problem in the light of our framework may illuminate new avenues to progress.\n\n2.3 Relationship to Other Structured Representations\n\nHere we brie\ufb02y discuss a few especially relevant alternative modeling technologies that also aim to\nexploit local and independence structure in dynamical systems.\n\nDBNs: The dynamic Bayes network (DBN) [2] is a representation that exploits conditional indepen-\ndence structure. The main difference between DBNs and our collection of local models is that DBNs\nspecify independence structure over \u201chidden variables\u201d whose values are never observed. Our rep-\nresentation expresses structure entirely in terms of predictions of observations. Thus our structural\nassumptions can be veri\ufb01ed using statistical tests on the data while DBN assumptions cannot be\ndirectly veri\ufb01ed. That said, a DBN does decompose its world state into a set of random variables. It\n\n5\n\n\fTable 1: Local model structure for the arcade game\n\nM: M applies when\n\nHI\nhistory ends with:\n\nBall hitting brick b\n\nM: M makes one-step predic-\n\nT I\ntions for:\nColor of 6\u00d74 pixels within b\n\nBall not hitting brick b\nBall in position p, coming\nfrom direction d\nNo brick in pixel p and no\nball near pixel p\n\nColor of 6\u00d74 pixels within b\nAbsence or presence of ball color\nin 6 \u00d7 6 pixels around p\nColor of pixel p\n\nAM: M additionally distinguishes\nbridging tests by:\n\nType of special bricks hit and type of\nspecial brick most recently hit\nNone\nCon\ufb01guration of bricks adjacent to p\nin last step of bridging test\nNone\n\nstores the conditional probability distribution for each variable, given the values in the previous time\nstep. These distributions are like local models that make one-step predictions about their variable.\nFor each variable, a DBN also speci\ufb01es which other variables can be ignored when predicting its\nnext value. This is essentially our accurate re\ufb01nement, which identi\ufb01es details a local model can\nignore. Histories of interest are related to the concept of context-speci\ufb01c independence [12].\n\nRelational Models: Relational models (e.g. [3]) treat the state of the world as a conjunction of\npredicates. The state evolves using \u201cupdate rules,\u201d consisting of pre-conditions specifying when the\nrule applies and post-conditions (changes to the state). Update rules are essentially local models\nwith pre and post-conditions playing the roles of histories and tests of interest. Relational models\ntypically focus on Markov worlds. We address partial observability by essentially generalizing the\n\u201cupdate rule.\u201d The main strength of relational models is that they include \ufb01rst-order variables in\nupdate rules, allowing for sophisticated parameter tying and generalization. We use parameter tying\nin our experiments, but do not incorporate the formalism of variables into our framework.\n\nOthers: Wolfe and Singh recently introduced the Factored PSR [13] which is essentially a special\ncollection of local models. Also related are maximum entropy models (e.g.\n[14], [15]) which\nrepresent predictions as weighted products of features of the future and the past.\n\n3 Experimental Results\n\nLarge Scale Example:\nIn this section we present preliminary empir-\nical results illustrating the application of collections of local models.\nOur \ufb01rst example is a modi\ufb01ed, uncontrolled version of an arcade game\n(see Figure 3). The observations are 64 \u00d7 42 pixel images. In the im-\nage is a 2 \u00d7 2 pixel ball and a wall of 6 \u00d7 4 pixel bricks. After the ball\nhits a brick, the brick disappears. When the ball hits the bottom wall, it\nbounces at a randomly selected angle. An episode ends when there are\nno more bricks. In our version there are two types of \u201cspecial bricks.\u201d\nAfter the ball hits a dark brick, all bricks require two hits rather than\none to break. After the ball hits a light brick, all bricks require only one hit to break. When they\nare \ufb01rst placed, bricks are regular (medium gray) with probability 0.9 and dark or light each with\nprobability 0.05. This system is stochastic, partially observable (and because of the special bricks,\nnot short-order Markov). It has roughly 1020 observations and even more underlying states.\n\nFigure 3: Arcade game\n\nThe decomposition into local models is speci\ufb01ed in Table 11. Quite naturally, we have local models\nto predict how the bricks (rows 1-2), the ball (row 3), and the background (row 4) will behave. This\nstructure satis\ufb01es the mutual conditional independence property, and since every pixel is predicted\nby some model at every history, we can make fully detailed 64\u00d7 42 pixel one-step predictions.\nMore or less subdivision of models could be applied, the tradeoff being the complexity of individual\nmodels versus the total number of local models. With the structure we have selected there are ap-\nproximately 25,000 local models. Of course, naively training 25,000 models is impractical. We can\nimprove our data ef\ufb01ciency and training time though parameter tying. In this system, the behavior\nof objects does not depend on their position. To take advantage of this, for each type of local model\n\n1Note: there are 30 bricks b, 2,688 pixels p, 2,183 possible positions p for the ball, and 9 possible directions\n\nd the ball could come from, including the case in the \ufb01rst step, where the ball simply appears in a pixel.\n\n6\n\n\f1\n\n0.5\n\no\n\ni\nt\n\n \n\na\nR\nd\no\no\nh\n\ni\nl\n\ne\nk\nL\n\ni\n\n \n.\n\ng\nv\nA\n\n \n\n0\n0\n\nSize 5\n\n \n\nLocal POMDP\nLocal PSR\nDBN\nPOMDP\nPSR\n\n5000\n\n# Training Episodes\n\n10000\n\n1\n\n0.5\n\no\n\ni\nt\n\n \n\na\nR\nd\no\no\nh\n\ni\nl\n\ne\nk\nL\n\ni\n\n \n.\n\ng\nv\nA\n\n \n\n0\n0\n\nSize 20\n\n \n\nLocal POMDP\nLocal PSR\nDBN\nPOMDP\nPSR\n\n5000\n\n# Training Episodes\n\n10000\n\nFigure 5: Left: Results for the 1D Ball Bounce problem. Error bars are omitted to avoid graph\nclutter. Right: DBN structure used. All nodes are binary. The shaded nodes are hidden. Links from\n\u201cVel.\u201d at t \u2212 1 to all nodes at t omitted for simplicity.\n\n(12 in total, since there is a ball model for each of the 9 directions) we combine all translated tra-\njectories associated with various positions and use them to train a single shared model. Each local\nmodel maintains its own state, but the underlying model parameters are shared across all models of\nthe same type, associated with different positions. Note that position does matter in the \ufb01rst time\nstep, since the ball always appears in the same place. As a result, our model makes bad predictions\nabout the \ufb01rst time step. For clarity of presentation, we will ignore the \ufb01rst time-step in our results.\n\nFor the local models themselves, we used lookup table based short-order Markov representations.\nThough the overall system is not short-order Markov, each local model is. Our learned local models\nwere \ufb01rst-order Markov except the one responsible for predicting what will happen to a brick when\nthe ball hits it. This model was second-order Markov. No local model had more than 200 states.\n\no\n\ni\nt\n\n \n\na\nR\nd\no\no\nh\n\ni\nl\n\ne\nk\nL\n\ni\n\n \n.\n\ng\nv\nA\n\n1\n\n0.5\n\n0\n0\n0\n\n100\n\n50\n\nd\ne\np\np\no\nr\nD\n \ns\ne\nd\no\ns\np\nE\n%\n\n \n\ni\n\n50\n50\n\n100\n100\n\n150\n150\n\n# Training Trajectories\n\n \n.\n\ng\nv\nA\n\n0\n250\n250\n\n200\n200\n\nFigure 4: Results for the ar-\ncade game example.\n\nThe learning curve for this collection of local models can be seen in\nFigure 4. In each trial we train the models on various numbers of\nepisodes (ending when there are no more bricks, or after 1000 steps)\nand measure the likelihood w.r.t. 50 test episodes. We report the\naverage over 20 trials. Even with parameter tying, our model can\nassign zero probability to a test sequence, due to data sparsity issues.\nThe solid line shows the likelihood ratio (the log likelihood of the\ntrue system divided by the log likelihood of our model) ignoring\nthe episodes that caused an in\ufb01nite log likelihood. The dashed line\nshows the proportion of episodes we dropped. The likelihood ratio\napproaches 1 while the proportion of \u201cbad\u201d episodes approaches 0,\nimplying that we are learning a good model in about 100 episodes.\n\nLearning Comparisons:\nIn this experiment, we will compare parameter learning results for col-\nlections of local models to a few other methods on a simple example, whose complexity is easily\ncontrolled. Recall the 1D Ball Bounce. We learned a model of the 1D Ball Bounce of size 5 and 20\nusing two collections of local models with no parameter tying (using PSRs and POMDPs as local\nmodels respectively), two \ufb02at models (a PSR and a POMDP), and a DBN 2.\n\nBoth collections of local models have the following structure: for every pixel, there are two types\nof model. One predicts the color of the pixel in the next time step in histories when the ball is not\nin the immediate neighborhood about the pixel. This model ignores all pixels other than the one it\nis predicting. The other model applies when the ball is in the pixel. It jointly predicts the colors of\nthe pixel and its two neighbors. This model distinguishes bridging tests in which the ball went to\nthe left, the right, or stayed on the pixel in the \ufb01rst step. This collection of local models satis\ufb01es the\nmutual conditional independence property and allows prediction of primitive one-step tests.\n\nAs with the arcade game example, in each trial we trained each model on various numbers of\nepisodes (of length 50) and then measured their log likelihood on 1000 test episodes (also of length\n\n2We initialized each local POMDP with 5 states and the \ufb02at POMDP with 10 and 40 states for the differ-\nent problem sizes. For the DBN we used the graphical structure shown in Figure 5(c) and trained using the\nGraphical Models Toolkit [16]. We stopped EM after a maximum of 50 iterations. PSR training also has a free\nparameter (see [17] for details). Via parameter sweep we chose 0.02 for local PSRs and for the \ufb02at PSR 0.175\nand 0.005, respectively for the size 5 and size 20 domains.\n\n7\n\n\f50). We report the likelihood ratio averaged over 20 trials. The results are shown in Figure 5. The\ncollections of local models both perform well, outperforming the \ufb02at models (dashed lines). Both\nof the \ufb02at models\u2019 performance degrades as the size of the world increases from 5 to 20. The collec-\ntions of local models are less affected by problem size. The local PSRs seem to take more data than\nthe local POMDPs to learn a good model, however they ultimately seem to learn a better model.\nThe unexpected result is that DBN training seemed to perform worse than \ufb02at POMDP training. We\nhave no explanation for this effect, other than the fact that different graphical structures could cause\ndifferent local extrema issues for the EM algorithm. Clearly, given these results, a more thorough\nempirical comparison across a wider variety of problems is warranted.\n\nConclusions: We have presented a novel formalization of the idea of a \u201clocal model.\u201d Preliminary\nempirical results show that collections of local models can be learned for large-scale systems and\nthat the data complexity of parameter learning compares favorably to that of other representations.\n\nAcknowledgments\n\nErik Talvitie was supported under the NSF GRFP. Satinder Singh was supported by NSF grant IIS-\n0413004. Any opinions, \ufb01ndings, and conclusions or recommendations expressed in this material\nare those of the authors and do not necessarily re\ufb02ect the views of the NSF.\n\nReferences\n\n[1] Lise Getoor, Nir Friedman, Daphne Koller, and Benjamin Taskar. Learning probabilistic models of rela-\n\ntional structure. Journal of Machine Learning Research, 3:679\u2013707, 2002.\n\n[2] Zoubin Ghahramani and Michael I. Jordan. Factorial hidden Markov models.\n\nIn Advances in Neural\n\nInformation Processing Systems 8 (NIPS), pages 472\u2013478, 1995.\n\n[3] Hanna M. Pasula, Luke S. Zettlemoyer, and Leslie Pack Kaelbling. Learning symbolic models of stochas-\n\ntic domains. Journal of Arti\ufb01cial Intelligence, 29:309\u2013352, 2007.\n\n[4] Michael Littman, Richard Sutton, and Satinder Singh. Predictive representations of state. In Advances in\n\nNeural Information Processing Systems 14 (NIPS), pages 1555\u20131561, 2002.\n\n[5] Herbert Jaeger. Observable operator models for discrete stochastic time series. Neural Computation,\n\n12(6):1371\u20131398, 2000.\n\n[6] Satinder Singh, Michael R. James, and Matthew R. Rudary. Predictive state representations: A new theory\nfor modeling dynamical systems. In Uncertainty in Arti\ufb01cial Intelligence 20 (UAI), pages 512\u2013519, 2004.\n\n[7] Richard Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for\n\ntemporal abstraction in reinforcement learning. Arti\ufb01cial Intelligence, 112:181\u2013211, 1999.\n\n[8] Alicia Peregrin Wolfe and Andrew G. Barto. Decision tree methods for \ufb01nding reusable MDP homomor-\n\nphisms. In National Conference on Arti\ufb01cial Intelligence 21 (AAAI), 2006.\n\n[9] Vishal Soni and Satinder Singh. Abstraction in predictive state representations. In National Conference\n\non Arti\ufb01cial Intelligence 22 (AAAI), 2007.\n\n[10] Erik Talvitie, Britton Wolfe, and Satinder Singh. Building incomplete but accurate models. In Interna-\n\ntional Symposium on Arti\ufb01cial Intelligence and Mathematics (ISAIM), 2008.\n\n[11] George E. Monahan. A survey of partially observable markov decisions processes: Theory, models, and\n\nalgorithms. Management Science, 28(1):1\u201316, 1982.\n\n[12] Craig Boutilier, Nir Friedman, Moises Goldszmidt, and Daphne Koller. Context-speci\ufb01c independence in\n\nbayesian networks. In Uncertainty in Arti\ufb01cial Intelligence 12 (UAI), pages 115\u2013123, 1996.\n\n[13] Britton Wolfe, Michael James, and Satinder Singh. Approximate predictive state representations.\n\nIn\n\nAutonomous Agents and Multiagent Systems 7 (AAMAS), 2008.\n\n[14] Adam Berger, Stephen Della Pietra, and Vincent Della Pietra. A maximum entropy approach to natural\n\nlanguage processing. Computational Linguistics, 22(1):39\u201371, 1996.\n\n[15] David Wingate and Satinder Singh. Exponential family predictive representations of state. In Advances\n\nin Neural Information Processing Systems 20 (NIPS), pages 1617\u20131624, 2007.\n\n[16] Jeff Bilmes. The graphical models toolkit (gmtk), 2007. http://ssli.ee.washington.edu/\n\n\u02dcbilmes/gmtk.\n\n[17] Michael James and Satinder Singh. Learning and discovery of predictive state representations in dynam-\n\nical systems with reset. In International Conference on Machine Learning 21 (ICML), 2004.\n\n8\n\n\f", "award": [], "sourceid": 612, "authors": [{"given_name": "Erik", "family_name": "Talvitie", "institution": null}, {"given_name": "Satinder", "family_name": "Singh", "institution": null}]}