{"title": "Exponential Family Predictive Representations of State", "book": "Advances in Neural Information Processing Systems", "page_first": 1617, "page_last": 1624, "abstract": null, "full_text": "Exponential Family\n\nPredictive Representations of State\n\nComputer Science and Engineering\n\nComputer Science and Engineering\n\nUniversity of Michigan\nbaveja@umich.edu\n\nDavid Wingate\n\nSatinder Singh\n\nUniversity of Michigan\n\nwingated@umich.edu\n\nAbstract\n\nIn order to represent state in controlled, partially observable, stochastic dynamical\nsystems, some sort of suf\ufb01cient statistic for history is necessary. Predictive repre-\nsentations of state (PSRs) capture state as statistics of the future. We introduce a\nnew model of such systems called the \u201cExponential family PSR,\u201d which de\ufb01nes\nas state the time-varying parameters of an exponential family distribution which\nmodels n sequential observations in the future. This choice of state representation\nexplicitly connects PSRs to state-of-the-art probabilistic modeling, which allows\nus to take advantage of current efforts in high-dimensional density estimation, and\nin particular, graphical models and maximum entropy models. We present a pa-\nrameter learning algorithm based on maximum likelihood, and we show how a\nvariety of current approximate inference methods apply. We evaluate the qual-\nity of our model with reinforcement learning by directly evaluating the control\nperformance of the model.\n\n1 Introduction\n\nOne of the basic problems in modeling controlled, partially observable, stochastic dynamical sys-\ntems is representing and tracking state. In a reinforcement learning context, the state of the system\nis important because it can be used to make predictions about the future, or to control the system\noptimally. Often, state is viewed as an unobservable, latent variable, but models with predictive rep-\nresentations of state [4] propose an alternative: PSRs represent state as statistics about the future.\n\nThe original PSR models used the probability of speci\ufb01c, detailed futures called tests as the statistics\nof interest. Recent work has introduced the more general notion of using parameters that model the\ndistribution of length n futures as the statistics of interest [8]. To clarify this, consider an agent\ninteracting with the system. It observes a series of observations o1...ot, which we call a history\nht (where subscripts denote time). Given any history, there is some distribution over the next n\nobservations: p(Ot+1...Ot+n|ht) \u2261 p(F n|ht) (where Ot+i is the random variable representing\nan observation i steps in the future, and F n is a mnemonic for future). We emphasize that this\ndistribution directly models observable quantities in the system.\n\nInstead of capturing state with tests, the more general idea is to capture state by directly modeling\nthe distribution p(F n|ht). Our central assumption is that the parameters describing p(F n|ht) are\nsuf\ufb01cient for history, and therefore constitute state (as the agent interacts with the system, p(F n|ht)\nchanges because ht changes; therefore the parameters and hence state change). As an example of\nthis, the Predictive Linear-Gaussian (PLG) model [8] assumes that p(F n|ht) is jointly Gaussian;\nstate therefore becomes its mean and covariance. Nothing is lost by de\ufb01ning state in terms of ob-\nservable quantities: Rudary et al [8] proved that the PLG is formally equivalent to the latent-variable\napproach in linear dynamical systems. In fact, because the parameters are grounded, statistically\nconsistent parameter estimators are available for PLGs.\n\n1\n\n\fThus, as part of capturing state in a dynamical system in our method, p(F n|ht) must be estimated.\nThis is a density estimation problem.\nIn systems with rich observations (say, camera images),\np(F n|ht) may have high dimensionality. As in all high-dimensional density estimation problems,\nstructure must be exploited. It is therefore natural to connect to the large body of recent research\ndealing with high-dimensional density estimation, and in particular, graphical models.\nIn this paper, we introduce the Exponential Family PSR (EFPSR) which assumes that p(F n|ht) is\na standard exponential family distribution. By selecting the suf\ufb01cient statistics of the distribution\ncarefully, we can impose graphical structure on p(F n|ht), and therefore make explicit connections to\ngraphical models, maximum entropy modeling, and Boltzmann machines. The EFPSR inherits both\nthe advantages and disadvantages of graphical exponential family models: inference and parameter\nlearning in the model is generally hard, but all existing research on exponential family distributions\nis applicable (in particular, work on approximate inference).\nSelecting the form of p(F n|ht) and estimating its parameters to capture state is only half of the prob-\nlem. We must also model the dynamical component, which describes the way that the parameters\nvary over time (that is, how the parameters of p(F n|ht) and p(F n|ht+1) are related). We describe a\nmethod called \u201cextend-and-condition,\u201d which generalizes many state update mechanisms in PSRs.\n\nImportantly, the EFPSR has no hidden variables, but can still capture state, which sets it apart from\nother graphical models of sequential data. It is not directly comparable to latent-variable models\nsuch as HMMs, CRFs [3], or Maximum-entropy Markov Models (MEMMs) [5], for example. In\nparticular, EM-based procedures used in the latent-variable models for parameter learning are un-\nnecessary, and indeed, impossible. This is a consequence of the fact that the model is fully observed:\nall statistics of interest are directly related to observable quantities.\n\nWe refer the reader to [11] for an extended version of this paper.\n\n2 The Exponential Family PSR\n\nWe now present the Exponential Family PSR (EFPSR) model. The next sections discuss the speci\ufb01cs\nof the central parts of the model: the state representation, and how we maintain that state.\n\n2.1 Standard Exponential Family Distributions\n\nWe \ufb01rst discuss exponential family distributions, which we use because of their close connections\nto maximum entropy modeling and graphical models. We refer the reader to Jaynes [2] for detailed\njusti\ufb01cation, but brie\ufb02y, he states that the maximum entropy distribution \u201cagrees with everything\nthat is known, but carefully avoids assuming anything that is not known,\u201d which \u201cis the fundamental\nproperty which justi\ufb01es its use for inference.\u201d The standard exponential family distribution is the\nform of the maximum entropy distribution under certain constraints.\n\nFor a random variable X, a standard exponential family distribution has the form p(X = x; s) =\nexp{sT \u03c6(x) \u2212 Z(s)}, where s is the canonical (or natural) vector of parameters and \u03c6(x) is a vector\nof features of variable x. The vector \u03c6(x) also forms the suf\ufb01cient statistics of the distribution.\nThe term Z(s) is known as the log-partition function, and is a normalizing constant which ensures\n\nthat p(X; s) de\ufb01nes a valid distribution: Z(s) = logR exp{sT \u03c6(x)}dx. By carefully selecting the\n\nfeatures \u03c6(x), graphical structure may be imposed on the distribution.\n\n2.2 State Representation and Dynamics\n\nState. The EFPSR de\ufb01nes state as the parameters of an exponential family distribution modeling\np(F n|ht). To emphasize that these parameters represent state, we will refer to them as st:\n\n(1)\n\np(F n = f n|ht; st) = exp(cid:8)s\u22a4\n\nt \u03c6(f n) \u2212 log Z(st)(cid:9) ,\n\nwith both { \u03c6(f n), st } \u2208 Rl\u00d71. We emphasize that st changes with history, but \u03c6(f n) does not.\nMaintaining State. In addition to selecting the form of p(F n|ht), there is a dynamical component:\ngiven the parameters of p(F n|ht), how can we incorporate a new observation to \ufb01nd the parameters\nof p(F n|ht, ot+1)? Our strategy is to extend and condition, as we now explain.\n\n2\n\n\fExtend. We assume that we have the parameters of p(F n|ht), denoted st. We extend the distribu-\ntion of F n|ht to include Ot+n+1, which forms a new variable F n+1|ht, and we assume it has the\ndistribution p(F n, Ot+n+1|ht) = p(F n+1|ht). This is a temporary distribution with (n + 1)d ran-\ndom variables. In order to add the new variable Ot+n+1, we must add new features which describe\nOt+n+1 and its relationship to F n. We capture this with a new feature vector \u03c6+(f n+1) \u2208 Rk\u00d71,\nand de\ufb01ne the vector s+\nt \u2208 Rk\u00d71 to be the parameters associated with this feature vector. We thus\nhave the following form for the extended distribution:\n\np(F n+1 = f n+1|ht; s+\n\nt \u03c6+(f n+) \u2212 log Z(s+\n\nt ) = exp(cid:8)s+\u22a4\n\nt )(cid:9) .\n\nTo de\ufb01ne the dynamics, we de\ufb01ne a function which maps the current state vector to the parameters\nof the extended distribution. We call this the extension function: s+\nt = extend(st; \u03b8), where \u03b8 is a\nvector of parameters controlling the extension function (and hence, the overall dynamics).\n\nThe extension function helps govern the kinds of dynamics that the model can capture. For example,\nin the PLG family of work, a linear extension allows the model to capture linear dynamics [8], while\na non-linear extension allows the model to capture non-linear dynamics [11].\nCondition. Once we have extended the distribution to model the n + 1\u2019st observation in the future,\nwe then condition on the actual observation ot+1, which results in the parameters of a distribution\nover observations from t + 1 through t + n + 1: st+1 = condition(s+\nt , ot+1), which are precisely\nthe statistics representing p(F n|ht+1), which is our state at time t + 1.\nBy extending and conditioning, we can maintain state for arbitrarily long periods. Furthermore, for\nmany choices of features and extension function, the overall extend-and-condition operation does\nnot involve any inference, mean that tracking state is computationally ef\ufb01cient.\n\nThere is only one restriction on the extension function: we must ensure that after extending and con-\nditioning the distribution, the resulting distribution can be expressed as: p(F n = f n|ht+1; st+1) =\nt+1\u03c6(f n) \u2212 log Z(st+1)}. This looks like exactly like Eq. 1, which is the point: the fea-\nexp{s\u22a4\nture vector \u03c6 did not change between timesteps, which means the form of the distribution does not\nchange. For example, if p(F n|ht) is a Gaussian, then p(F n|ht+1) will also be a Gaussian.\n\n2.3 Representational Capacity\n\nThe EFPSR model is quite general. It has been shown that a number of popular models can be uni\ufb01ed\nunder the umbrella of the general EFPSR: for example, every PSR can be represented as an EFPSR\n(implying that every POMDP, MDP, and k-th order Markov model can also be represented as an\nEFPSR); and every linear dynamical system (Kalman \ufb01lter) and some nonlinear dynamical systems\ncan also be represented by an EFPSR. These different models are obtained with different choices of\nthe features \u03c6 and the extension function, and are possible because many popular distributions (such\nas multinomials and Gaussians) are exponential family distributions [11].\n\n3 The Linear-Linear EFPSR\n\nWe now choose speci\ufb01c features and extension function to generate an example model designed to\nbe analytically tractable. We select a linear extension function, and we carefully choose features\nso that conditioning is always a linear operation. We restrict the model to domains in which the\nobservations are vectors of binary random variables. The result is named the Linear-Linear EFPSR.\nFeatures. Recall that the features \u03c6() and \u03c6+() do not depend on time. This is equivalent to saying\nthat the form of the distribution does not vary over time. If the features impose graphical structure\non the distribution, it is also equivalent to saying that the form of the graph does not change over\ntime. Because of this, we will now discuss how we can use a graph whose form is independent of\ntime to help de\ufb01ne structure on our distributions.\nWe construct the feature vectors \u03c6() and \u03c6+() as follows. Let each Ot \u2208 {0, 1}d; therefore, each\nF n|ht \u2208 {0, 1}nd. Let (F n)i be the i\u2019th random variable in F n|ht. We assume that we have an\nundirected graph G which we will use to create the features in the vector \u03c6(), and that we have\nanother graph G+ which we will use to de\ufb01ne the features in the vector \u03c6+(). De\ufb01ne G = (V, E)\ni), and (i, j) \u2208 E are the\nwhere V = {1, ..., nd} are the nodes in the graph (one for each F n|ht\n\n3\n\n\fG\n\nG+\n\nG\n\ns\ne\nr\nu\nt\na\ne\nf\n \n\nn\no\ni\nt\na\nv\nr\ne\ns\nb\nO\n\nt+1\n\nt+2\n\nt+n\n\nt+1\n\nt+2\n\nt+n\n\nt+n+1\n\nt+1\n\nt+2\n\nt+n\n\nt+n+1\n\nDistribution of next n observations\n\np(F n|ht)\n\nExtended distribution\np(F n, Ot+n+1|ht)\n\nConditioned distribution\n\np(F n|ht, ot+1)\n\nFigure 1: An illustration of extending and conditioning the distribution.\n\nedges. Similarly, we de\ufb01ne G+ = (V +, E+) where V + = {1, ..., (n + 1)d} are the nodes in the\ngraph (one for each (F n+1|ht)i), and (i, j) \u2208 E+ are the edges. Neither graph depends on time.\nTo use the graph to de\ufb01ne our distribution, we will let entries in \u03c6 be conjunctions of atomic obser-\nvation variables (like the standard Ising model): for i \u2208 V , there will be some feature k in the vector\nsuch that \u03c6(ft)k = f i\nt . We also create one feature for each edge: if (i, j) \u2208 E, then there will be\nsome feature k in the vector such that \u03c6(ft)k = f i\nAs discussed previously, neither G nor G+ (equivalently, \u03c6 and \u03c6+) can be arbitrary. We must ensure\nthat after conditioning G+, we recover G. To accomplish this, we ensure that both temporally shifted\ncopies and conditioned versions of each feature exist in the graphs (seen pictorially in Fig. 1).\n\nt . Similarly, we use G+ to de\ufb01ne \u03c6+().\n\nt f j\n\nBecause all features are either atomic variables or conjunctions of variables, conditioning the dis-\ntribution can be done with an operation which is linear in the state (this is true even if the random\nvariables are discrete or real-valued). We therefore de\ufb01ne the linear conditioning operator G(ot+1)\nto be a matrix which transforms s+\nt\nLinear extension. In general, the function extend can take any form. We choose a linear extension:\n\ninto st+1: st+1 = G(ot+1)s+\n\nt . See [11] for details.\n\ns+\nt = Ast + B\n\nwhere A \u2208 Rk\u00d7l and B \u2208 Rk\u00d71 are our model parameters. The combination of a linear extension\nand a linear conditioning operator can be rolled together into a single operation. Without loss of\ngenerality, we can permute the indices in our state vector such that st+1 = G(ot+1) (Ast + B).\nNote that although this is linear in the state, it is nonlinear in the observation.\n\n4 Model Learning\n\nWe have de\ufb01ned our concept of state, as well as our method for tracking that state. We now address\nthe question of learning the model from data. There are two things which can be learned in our\nmodel: the structure of the graph, and the parameters governing the state update. We brie\ufb02y address\neach in the next two subsections. We assume we are given a sequence of T observations, [o1 \u00b7 \u00b7 \u00b7 oT ],\nwhich we stack to create a sequence of samples from the F n|ht\u2019s: ft|ht = [ot+1 \u00b7 \u00b7 \u00b7 ot+n|ht].\n\n4.1 Structure Learning\n\nTo learn the graph structure, we make the approximation of ignoring the dynamical component of\nthe model. That is, we treat each ft as an observation, and try to estimate the density of the re-\nsulting unordered set, ignoring the t subscripts (we appeal to density estimation because many good\nalgorithms have been developed for structure induction). We therefore ignore temporal relationships\nacross samples, but we preserve temporal relationships within samples. For example, if observation\na is always followed by observation b, this fact will be captured within the ft\u2019s.\nThe problem therefore becomes one of inducing graphical structure for a non-sequential data set,\nwhich is a problem that has already received considerable attention. In all of our experiments, we\nused the method of Della Pietra et. al [7]. Their method iteratively evaluates a set of candidate\nfeatures and adds the one with highest expected gain in log-likelihood. To enforce the temporal\n\n4\n\n\finvariance property, whenever we add a feature, we also add all of the temporally shifted copies of\nthat feature, as well as the conditioned versions of that feature.\n\n4.2 Maximum Likelihood Parameter Estimation\n\nWith the structure of the graph in place, we are left to learn the parameters A and B of the state ex-\ntension. It is now useful that our state is de\ufb01ned in terms of observable quantities, for two reasons:\n\ufb01rst, because everything in our model is observed, EM-style procedures for estimating the parame-\nters of our model are not needed, simply because there are no unobserved variables over which to\ntake expectations. Second, when trying to learn a sequence of states (st\u2019s) given a long trajectory\nof futures (ft\u2019s), each ft is a sample of information directly from the distribution we\u2019re trying to\nmodel. Given a parameter estimate, an initial state s0, and a sequence of observations, the sequence\nof st\u2019s is completely determined. This will be a key element to our proposed maximum-likelihood\nlearning algorithm.\nAlthough the sequence of state vectors st are the parameters de\ufb01ning the distributions p(F n|ht),\nthey are not the model parameters \u2013 that is, we cannot freely select them. Instead, the model pa-\nrameters are the parameters \u03b8 which govern the extension function. This is a signi\ufb01cant difference\nfrom standard maximum entropy models, and stems from the fact that our overall problem is that of\nmodeling a dynamical system, rather than just density estimation.\n\nThe likelihood of the training data is p(o1, o2...oT ) = QT\nnient to measure the likelihood of the corresponding ft\u2019s: p(o1, o2...oT ) \u2248 nQT\n\nt=1 p(ot|ht). We will \ufb01nd it more conve-\nt=1 p(ft|ht) (the\nlikelihoods are not the same because the likelihood of the ft\u2019s counts a single observation n times;\nthe approximate equality is because the \ufb01rst n and last n are counted fewer than n times).\nThe expected log-likelihood of the training ft\u2019s under the model de\ufb01ned in Eq. 1 is\n\nLL =\n\n1\n\nT T\nXt=1\n\n\u2212s\u22a4\n\nt \u03c6(ft) \u2212 log Z(st)!\n\n(2)\n\nOur goal is to maximize this quantity. Any optimization method can be used to maximize the log-\nlikelihood. Two popular choices are gradient ascent and quasi-Newton methods, such as (L-)BFGS.\nWe use both, for different problems (as discussed later). However, both methods require the gradient\nof the likelihood with respect to the parameters, which we will now compute.\n\nUsing the chain rule of derivatives, we can compute the derivative with respect to the parameters A:\n\nFirst, we compute the derivative of the log-likelihood with respect to each state:\n\n\u2202LL\n\u2202A\n\n=\n\nT\n\nXt=1\n\n\u2202LL\n\u2202st\n\n\u22a4 \u2202st\n\u2202A\n\n(3)\n\n(4)\n\n\u2202\n\n\u2202LL\n\u2202st\n\n=\n\n\u2202st (cid:2)\u2212s\u22a4\n\nt \u03c6(ft) \u2212 log Z(st)(cid:3) = Est [\u03c6(F n|ht)] \u2212 \u03c6(ft) \u2261 \u03b4t\n\nwhere Est [\u03c6(F n|ht)] \u2208 Rl\u00d71 is the vector of expected suf\ufb01cient statistics at time t. Computing\nthis is a standard inference problem in exponential family models, as discussed in Section 5. This\ngradient tells us that we wish to adjust each state to make the expected features of the next n ob-\nservations closer to the observed features however, we cannot adjust st directly; instead, we must\nadjust it implicitly by adjusting the transition parameters A and B.\nWe now compute the gradients of the state with respect to each parameter:\n\n\u2202st\n\u2202A\n\n=\n\n\u2202\n\u2202A\n\nG(ot+1) (Ast\u22121 + B) = G(ot+1)(cid:18)A\n\n\u2202st\u22121\n\u2202A\n\n+ s\u22a4\n\nt\u22121 \u2297 I(cid:19) .\n\nwhere \u2297 is the Kronecker product, and I is an identity matrix the same size as A. The gradients of\nthe state with respect to B are given by\n\n\u2202st\n\u2202B\n\n=\n\n\u2202\n\u2202B\n\nG(ot+1) (Ast\u22121 + B) = G(ot+1)(cid:18)A\n\n\u2202st\u22121\n\u2202B\n\n+ I(cid:19)\n\nThese gradients are temporally recursive \u2013 they implicitly depend on gradients from all previous\ntimesteps. It might seem prohibitive to compute them: must an algorithm examine all past t1 \u00b7 \u00b7 \u00b7 tt\u22121\ndata points to compute the gradient at time t? Fortunately, the answer is no: the necessary statistics\ncan be computed in a recursive fashion as the algorithm walks through the data.\n\n5\n\n\fd\no\no\nh\n\ni\nl\n\ne\nk\n\ni\nl\n\n\u2212\ng\no\nL\n\n\u22121.4\n\n\u22121.6\n\n\u22121.8\n\n\u22122\n\n0\n\n10\n\n20\n\n \n\nTraining LL\nTesting LL\nTrue LL\nNaive LL\n\n\u22122.07\n\np\n\nA\n\nq\n\nB\n\n1\u2212p\n\n1\u2212q\n\n\u22122.08 \n0\n\n10\n\n20\n\n0\n\n10\n\n20\n\n \n0\n\n20\nIterations of optimization\n\n10\n\n(a)\n\n(b)\n\nFigure 2: Results on two-state POMDPs. The right shows the generic model used. By varying the\ntransition and observation probabilities, three different POMDPs were generated. The left shows\nlearning performance on the three models. Likelihoods for naive predictions are shown as a dotted\nline near the bottom; likelihoods for optimal predictions are shown as a dash-dot line near the top.\n\nProblem\nPaint\nNetwork\nTiger\n\n# of\nstates\n16\n7\n2\n\n# of\nobs.\n2\n2\n2\n\n# of\nactions\n4\n4\n3\n\nNaive\nLL\n6.24\n6.24\n6.24\n\nTrue\nLL\n4.66\n4.49\n5.23\n\nTraining set\nLL\n4.67\n4.50\n5.24\n\n%\n99.7\n99.5\n92.4\n\nTest set\n%\n99.9\n98.0\n86.0\n\nLL\n4.66\n4.52\n5.25\n\nFigure 3: Results on standard POMDPs. See text for explanation.\n\n5 Inference\n\nIn order to compute the gradients needed for model learning, the expected suf\ufb01cient statistics\nE[\u03c6(F n|ht)] at each timestep must be computed (see Eq. 4):\n\nE [\u03c6(F n|ht)] =Z \u03c6(ft)p(F n|ht)dft = \u2207Z(s).\n\nThis quantity, also known as the mean parameters, is of central interest in standard exponential fam-\nilies, and has several interesting properties. For example, each possible set of canonical parameters\ns induces one set of mean parameters; assuming that the features are linearly independent, each set\nof valid mean parameters is uniquely determined by one set of canonical parameters [9].\n\nComputing these marginals is an inference problem. This is repeated T times (the number of sam-\nples) in order to get one gradient, which is then used in an outer optimization loop; because inference\nmust be repeatedly performed in our model, computational ef\ufb01ciency is a more stringent require-\nment than accuracy.\nIn terms of inference, our model inherits all of the properties of graphical\nmodels, for better and for worse. Exact inference in our model is generally intractable, except in\nthe case of fully factorized or tree-structured graphs. However, many approximate algorithms ex-\nist: there are variational methods such as naive mean-\ufb01eld, tree-reweighted belief propagation, and\nlog-determinant relaxations [10]; other methods include Bethe-Kikuchi approximations, expectation\npropagation, (loopy) belief propagation, MCMC methods, and contrastive divergence [1].\n\n6 Experiments and Results\n\nTwo sets of experiments were conducted to evaluate the quality of our model and learning algorithm.\nThe \ufb01rst set tested whether the model could capture exact state, given the correct features and exact\ninference. We evaluated the learned model using exact inference to compute the exact likelihood of\nthe data, and compared to the true likelihood. The second set tested larger models, for which exact\ninference is not possible. For the second set, bounds can be provided for the likelihoods, but may be\nso loose as to be uninformative. How can we assess the quality of the \ufb01nal model? One objective\ngauge is control performance: if the model has a reward signal, reinforcement learning can be used\nto determine an optimal policy. Evaluating the reward achieved becomes an objective measure of\nmodel quality, even though approximate likelihood is the learning signal.\n\n6\n\n\fd\nr\na\nw\ne\nR\n \ne\ng\na\nr\ne\nv\nA\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n0\n\n \n1\n\n \n\n6\n\n \n\n \nEFPSR/VMF\nEFPSR/LBP\nEFPSR/LDR\nPOMDP\nReactive\nRandom\n\nd\nr\na\nw\ne\nR\n \ne\ng\na\nr\ne\nv\nA\n\n0.15\n\n0.1\n\n0.05\n\n0\n\n \n1\n\n \n\n6\n\n3\n\n2\n5\nSteps of optimization\n\n4\n\n3\n\n2\n5\nSteps of optimization\n\n4\n\nFigure 4: Results on Cheesemaze (left) and Maze 4x3 (right) for different inference methods.\n\nFirst set. We tested on three two-state problems, as well as three small, standard POMDPs. For\neach problem, training and test sets were generated (using a uniformly random policy for controlled\nsystems). We used 10,000 samples, set n = 3 and used structure learning as explained in Section 4.1.\nWe used exact inference to compute the E[\u03c6(F n|ht)] term needed for the gradients. We optimized\nthe likelihood using BFGS. For each dataset, we computed the log-likelihood of the data under the\ntrue model, as well as the log-likelihood of a \u201cnaive\u201d model, which assigns uniform probability\nto every possible observation. We then learned the best model possible, and compared the \ufb01nal\nlog-likelihood under the learned and true models.\n\nFigure 2 (a) shows results for three two-state POMDPs with binary observations. The left panel of\nFig. 2 (a) shows results for a two-state MDP. The likelihood of the learned model closely approaches\nthe likelihood of the true model (although it does not quite reach it; this is because the model\nhas trouble modeling deterministic observations, because the weights in the exponential need to be\nin\ufb01nitely large [or small] to generate a probability of one [or zero]). The middle panel shows results\nfor a moderately noisy POMDP; again, the learned model is almost perfect. The third panel shows\nresults for a very noisy POMDP, in which the naive and true LLs are very close; this indicates that\nprediction is dif\ufb01cult, even with a perfect model.\nFigure 3 shows results for three standard POMDPs, named Paint, Network and Tiger1. The ta-\nble conveys similar information to the graphs: naive and true log-likelihoods, as well as the log-\nlikelihood of the learned models (on both training and test sets). To help interpret the results, we\nalso report a percentage (highlighted in bold), which indicates the amount of the likelihood gap (be-\ntween the naive and true models) that was captured by the learned model. Higher is better; again we\nsee that the learned models are quite accurate, and generalize well.\nSecond set. We also tested on a two more complicated POMDPs called Cheesemaze and Maze\n4x31. For both problems, exact inference is intractable, and so we used approximate inference. We\nexperimented with loopy belief propagation (LBP) [12], naive mean \ufb01eld (or variational mean \ufb01eld,\nVMF), and log-determinant relaxations (LDR) [10]. Since the VMF and LDR bounds on the log-\nlikelihood were so loose (and LBP provides no bound), it was impossible to assess our model by an\nappeal to likelihood. Instead, we opted to evaluate the models based on control performance.\n\nWe used the Natural Actor Critic (or NAC) algorithm [6] to test our model (see [11] for further\nexperiments). The NAC algorithm requires two things: a stochastic, parameterized policy which\noperates as a function of state, and the gradients of the log probability of that policy. We used a\nsoftmax function of a linear projection of the state: the probability of taking action ai from state st\n\nt \u03b8i(cid:9) /P|A|\n\nj=1 exp(cid:8)s\u22a4\n\ngiven the policy parameters \u03b8 is: p(ai; st, \u03b8) = exp(cid:8)s\u22a4\n\n\u03b8 are to be determined. For comparison, we also ran the NAC planner with the POMDP belief\nstate: we used the same stochastic policy and the same gradients, but we used the belief state of the\ntrue POMDP in place of the EFPSR\u2019s state (st). We also tested NAC with the \ufb01rst-order Markov\nassumption (or reactive policy) and a totally random policy.\nResults. Figure 4 shows the results for Cheesemaze. The left panel shows the best control perfor-\nmance obtained (average reward per timestep) as a function of steps of optimization. The \u201cPOMDP\u201d\nline shows the best reward obtained using the true belief state as computed under the true model,\nthe \u201cRandom\u201d line shows the reward obtained with a random policy, and the \u201cReactive\u201d line shows\nthe best reward obtained by using the observation as input to the NAC algorithm. The lines \u201cVMF,\u201d\n\u201cLBP,\u201d and \u201cLDR\u201d correspond to the different inference methods.\n\nt \u03b8j(cid:9). The parameters\n\n1From Tony Cassandra\u2019s POMDP repository at http://www.cs.brown.edu/research/ai/pomdp/index.html\n\n7\n\n\fThe EFPSR models all start out with performance equivalent to the random policy (average reward of\n0.01), and quickly hop to of 0.176. This is close to the average reward of using the true POMDP state\nat 0.187. The EFPSR policy closes about 94% of the gap between a random policy and the policy\nobtained with the true model. Surprisingly, only a few iterations of optimization were necessary to\ngenerate a usable state representation. Similar results hold for the Maze 4x3 domain, although the\nimprovement over the \ufb01rst order Markov model is not as strong: the EFPSR closes about 77.8% of\nthe gap between a random policy and the optimal policy. We conclude that the EFPSR has learned\na model which successfully incorporates information from history into the state representation, and\nthat it is this information which the NAC algorithm uses to obtain better-than-reactive performance.\nThis implies that the model and learning algorithm are useful even with approximate inference\nmethods, and even in cases where we cannot compare to the exact likelihood.\n\n7 Conclusions\n\nWe have presented the Exponential Family PSR, a new model of controlled, stochastic dynamical\nsystems which provably uni\ufb01es other models with predictively de\ufb01ned state. We have also discussed\na speci\ufb01c member of the EFPSR family, the Linear-Linear EFPSR, and a maximum likelihood learn-\ning algorithm. We were able to learn almost perfect models of several small POMDP systems, both\nfrom a likelihood perspective and from a control perspective. The biggest drawback is computa-\ntional: the repeated inference calls make the learning process very slow. Improving the learning\nalgorithm is an important direction for future research. While slow, the learning algorithm generates\nmodels which can be accurate in terms of likelihood and useful in terms of control performance.\n\nAcknowledgments\n\nDavid Wingate was supported under a National Science Foundation Graduate Research Fellowship.\nSatinder Singh was supported by NSF grant IIS-0413004. Any opinions, \ufb01ndings, and conclusions\nor recommendations expressed in this material are those of the authors and do not necessarily re\ufb02ect\nthe views of the NSF.\n\nReferences\n[1] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation,\n\n14(8):1771\u20131800, 2002.\n\n[2] E. T. Jaynes. Notes on present status and future prospects. In W. Grandy and L. Schick, editors, Maximum\n\nEntropy and Bayesian Methods, pages 1\u201313, 1991.\n\n[3] J. Lafferty, A. McCallum, and F. Pereira. Conditional random \ufb01elds: Probabilistic models for segmenting\n\nand labeling sequence data. In International Conference on Machine Learning (ICML), 2001.\n\n[4] M. L. Littman, R. S. Sutton, and S. Singh. Predictive representations of state. In Neural Information\n\nProcessing Systems (NIPS), pages 1555\u20131561, 2002.\n\n[5] A. McCallum, D. Freitag, and F. Pereira. Maximum entropy Markov models for information extraction\n\nand segmentation. In International Conference on Machine Learning (ICML), pages 591\u2013598, 2000.\n\n[6] J. Peters, S. Vijayakumar, and S. Schaal. Natural actor-critic.\n\nLearning (ECML), pages 280\u2013291, 2005.\n\nIn European Conference on Machine\n\n[7] S. D. Pietra, V. D. Pietra, and J. Lafferty.\n\nInducing features of random \ufb01elds.\n\nPattern Analysis and Machine Intelligence, 19(4):380\u2013393, 1997.\n\nIEEE Transactions on\n\n[8] M. Rudary, S. Singh, and D. Wingate. Predictive linear-Gaussian models of stochastic dynamical systems.\n\nIn Uncertainty in Arti\ufb01cial Intelligence (UAI), pages 501\u2013508, 2005.\n\n[9] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference.\n\nTechnical Report 649, UC Berkeley, 2003.\n\n[10] M. J. Wainwright and M. I. Jordan. Log-determinant relaxation for approximate inference in discrete\n\nMarkov random \ufb01elds. IEEE Transactions on Signal Processing, 54(6):2099\u20132109, 2006.\n\n[11] D. Wingate. Exponential Family Predictive Representations of State. PhD thesis, University of Michigan,\n\n2008.\n\n[12] J. S. Yedida, W. T. Freeman, and Y. Weiss. Understanding belief propagation and its generalizations.\n\nTechnical Report TR-2001-22, Mitsubishi Electric Research Laboratories, 2001.\n\n8\n\n\f", "award": [], "sourceid": 585, "authors": [{"given_name": "David", "family_name": "Wingate", "institution": null}, {"given_name": "Satinder", "family_name": "Baveja", "institution": null}]}