{"title": "Value-Directed Compression of POMDPs", "book": "Advances in Neural Information Processing Systems", "page_first": 1579, "page_last": 1586, "abstract": "", "full_text": "Value-Directed Compression of POMDPs\n\nPascal Poupart\n\nCraig Boutilier\n\nDepartement of Computer Science\n\nDepartment of Computer Science\n\nUniversity of Toronto\nToronto, ON, M5S 3H5\nppoupart@cs.toronto.edu\n\nUniversity of Toronto\nToronto, ON, M5S 3H5\ncebly@cs.toronto.edu\n\nAbstract\n\nWe examine the problem of generating state-space compressions of POMDPs in a\nway that minimally impacts decision quality. We analyze the impact of compres-\nsions on decision quality, observing that compressions that allow accurate policy\nevaluation (prediction of expected future reward) will not affect decision qual-\nity. We derive a set of suf\ufb01cient conditions that ensure accurate prediction in this\nrespect, illustrate interesting mathematical properties these confer on lossless lin-\near compressions, and use these to derive an iterative procedure for \ufb01nding good\nlinear lossy compressions. We also elaborate on how structured representations\nof a POMDP can be used to \ufb01nd such compressions.\n\n1 Introduction\n\nPartially observable Markov decision processes (POMDPs) provide a rich framework for\nmodeling a wide range of sequential decision problems in the presence of uncertainty.\nUnfortunately, the application of POMDPs to real world problems remains limited due to\nthe intractability of current solution algorithms, in large part because of the exponential\ngrowth of state spaces with the number of relevant variables.\n\nIdeally, we would like to mitigate this source of intractability by compressing the state\nspace as much as possible without compromising decision quality. Our aim in solving\na POMDP is to maximize future reward based on our current beliefs about the world.\nBy compressing its belief state, an agent may lose relevant information, which results in\nsuboptimal policy choice. Thus an important aspect of belief state compression lies in\ndistinguishing relevant information from that which can be safely discarded. A number of\nschemes have been proposed for either directly or indirectly compressing POMDPs. For\nexample, approaches using bounded memory [8, 10] and state aggregation\u2014either dynamic\n[2] or static [5, 9]\u2014can be viewed in this light.\n\nIn this paper, we study the effect of static state-space compression on decision quality. We\n\ufb01rst characterize lossless compressions\u2014those that do not lead to any error in expected\nvalue\u2014by deriving a set of conditions that guarantee decision quality will not be impaired.\nWe also characterize the speci\ufb01c case of linear compressions. This analysis leads to algo-\nrithms that \ufb01nd good compression schemes, including methods that exploit structure in the\nPOMDP dynamics (as exhibited, e.g., in graphical models). We then extend these concepts\nto lossy compressions. We derive a (somewhat loose) upper bound on the loss in decision\nquality when the conditions for lossless compression (of some required dimensionality) are\n\n\fnot met. Finally we propose a simple optimization program to \ufb01nd linear lossy compres-\nsions that minimizes this bound, and describe how structured POMDP models can be used\nto implement this scheme ef\ufb01ciently.\n\n2 Background and Notation\n\n2.1 POMDPs\n\nan observation function\n\nA POMDP is de\ufb01ned by: a set of states\n\u0005 ; a transition function\n\u0006\b\u0007\t\u0001\u000b\n\f\u0003\r\n\u000e\u0001\u0010\u000f\u0012\u0011\n\u001b\b\u0007\u0017\u0001\u000b\n\nservation \u0005\n\nof observations\n;\n\u0001\u000b\n\f\u0003\u0016\u0011\nof making ob-\n; and a reward function\ndenotes the immediate reward\n.1 We assume discrete state, action and observation sets and we focus\n\nassociated with state\non discounted, in\ufb01nite horizon POMDPs with discount factor\n\n; a set\ndenotes the transition probability\n\ndenotes the probability\n\n, where\n, where\n\n\u0013\u0015\u0014\u0016\u0007\u0017\u0001\u0018\u000f\u001a\u0019\n\nof actions\n\n\u001c\u001d\u0007\u0017\u0001\u0010\u0011\n\n, where\n\nin state\n\n; a set\n\n\u0001\u0010\u0011\n\n.\n\nis a distribution over \n\nPolicies and value functions for POMDPs are typically de\ufb01ned over belief space, where\na belief state\ncapturing an agent\u2019s knowledge about the current\ncan be updated in response to a speci\ufb01c action-observation\nstate of the world. Belief state\npair\nis a normalization\n, where, in matrix form, we have\nconstant). We denote the (unnormalized) mapping\ncan be\n.\n\n\u0011-,/.\u001d0213'\u0018\u0007\t\u0001\u0010\u00114\u0006\b\u0007\t\u00015\n\u000e\u0003\r\n\u000e\u0001\n\u00067698\n\n-dimensional row and column vectors. We de\ufb01ne\n\n\u0005+* using Bayes rule:\n()\u0003\n\n,>\u0013\u0015\u0014+\u0007\t\u0001\n\n. Note that a belief state\n\nviewed respectively as\n\nand reward function\n\n\u0011?\u0013\u0015\u0014\u0016\u0007\n\n\u0011\u000e\u001b-\u0007\t\u0001\n\n\u0003\r\n\u000e\u0001\n\n;=<\n698\n\n\u0007\u0017\u0001\n\n(\n\nSolving a POMDP consists of \ufb01nding an optimal policy\nThe value\n\nof a policy\n\nis the expected sum of discounted rewards and is de\ufb01ned as:\n\nmapping belief states to actions.\n\n\u0013\u0015\u0014\u0016\u0007\n\u001e \u001f\"!$#&%\n\n\u001c\u001d\u0007\u0017'@\u0011A,B'DCE\u001c\n\n(1)\n\nG\bH\n\nA number of techniques [11] based on value iteration or policy iteration can be used to\ncompute optimal or approximately optimal policies for POMDPs.\n\n\u0007\t'@\u0011A,2\u001cI\u0007\t'@\u0011KJL!7M\n\n\u0007N\u0006\n\nH+O\u0012P\tQ\u001a8\n\n\u0007\u0017'@\u0011\f\u0011\n\n2.2 Conditional Independence and Additive Separability\n\nWhen our state space is de\ufb01ned by a set of variables, POMDPs can often be represented\nconcisely in a factored way by specifying the transition, observation and reward functions\nusing a dynamic Bayesian network (DBN). Such representations exploit the fact that tran-\nsitions associated with each variable depend only on a small subset of variables. These\nrepresentations can often be exploited to solve POMDPs without state space enumeration\n[2].\n\n, such that\n\n\u0013\u0015\u0014\u0016\u0007\t\u001b \u0019\n\nRTS\b\u0011\n\u001e]\n^%@_\n.ZY\\[\n\nis separable if there exist conditional distributions\n\nRecently, Pfeffer [13] showed that conditional independence combined with some form of\nadditive separability can enable ef\ufb01cient inference in many DBNs. Roughly, a function\ncan be additively separated when it decomposes into a sum of smaller terms. For instance,\n,\nand\n. This ensures\n(instead of their joint distribution) to\nthat one need only know the marginals of\ninfer\n. Pfeffer shows how additive separability in the CPTs of a DBN can be exploited\nto identify families of self-suf\ufb01cient variables. A self-suf\ufb01cient family consists of a set\nof subsets of variables such that the marginals of each subset are suf\ufb01cient to predict the\nmarginals of the same subsets at the next time step. Hence, if we require the probabilities\nof a few variables, and can identify a self-suf\ufb01cient family containing those variables, then\nwe need only compute marginals over this family when monitoring belief state.\n\n\u0013\u0015\u0014\u0018U\u001d\u0007\t\u001b \u0019\nRV\u0011\nRT\u0011cJd\u0007e%gfL.b\u0011e\u0013\u0015\u00149Wh\u0007\t\u001b \u0019\n\nR`S\b\u0011a,&.b\u0013\u0015\u00149U\u001d\u0007\t\u001b \u0019\n\n\u0013\u0015\u0014\u0010WX\u0007\u001a\u001b\b\u0019\n\n\u0013\u0015\u0014+\u0007\u001a\u001b\b\u0019\n\nSI\u0011\n\nS \u0011\n\nand\n\nand\n\n1The ideas presented in this paper generalize to cases when\n\nand\n\nalso depend on actions.\n\n\u0001\n\u0002\n\u0003\n\u0004\n\u0006\n\u001b\n\u0005\n\u0011\n\u0005\n\u0019\n\u0001\n\u001c\n\u0001\n'\n'\n'\n\u000f\n\u000f\n\u000f\n\u000f\n\n\u0005\n\u0011\n.\n:\n\u0006\n:\n<\n\u0019\n;\n\u0005\n\u0019\n\u0001\n<\n\u0011\n'\n\u001c\n\u0019\n\n\u0019\nF\nF\nG\nH\n:\nG\nH\n:\nR\nS\n\u001b\ni\nj\n\fT\n\n~b\n~\nR\n\nr\n\nf\n\nb\n\nR\n\na)\n\ng\n\nf\n\nb\u2019\n\nR\n\ng\n\nT\n\n~b\u2019\n~\nR\n\nr\u2019\n\nT\n\n~b\n~\nR\n\nr\n\nf\n\nb\n\nR\n\nb)\n\n~\nT\n\n~\nT\n\nb\u2019\n\nR\n\nT\n\n~b\u2019\n~\nR\n\nr\u2019\n\nFigure 1: a) Functional \ufb02ow of a POMDP (dotted arrows) and a compressed POMDP (solid\narrows) where the next belief state is accurately predicted. b) Functional \ufb02ow of a POMDP\n(dotted arrows) and a compressed POMDP (solid arrows) where the next compressed belief\nstate is accurately predicted.\n\n2.3 Invariant and Krylov Subspaces\n\nbe a vector subspace. We say\n\nWe brie\ufb02y review several linear algebraic concepts used later (see [15] for more details).\nLet\nif it is closed\nunder multiplication by\nis\nthe smallest subspace\nfor\nand is invariant with respect to\na Krylov subspace can easily be generated by repeatedly multiplying\n\nis invariant with respect to matrix\n\n). A Krylov subspace\n\n\u0001\u0003\u0002\nthat contains\n\n. A basis\n(i.e.,\n\nY\b\n\n\u0006\u0005\u0007\u0002\n\n(i.e.,\n\nby\n\nis\nthe last linearly independent vector in this sequence and that all subsequent vectors are\nlinear combinations of\n\n\u0002c\n\u000e\u0001\u0003\u0002c\n\u000e\u0001\b\u000f\u0010\u0002c\n\u000e\u0001\b\u0011\u0012\u0002\n\n-dimensional, one can show that\n\n\u0012\u0013\u0014\u0013\u0012\u0013\u0016\u0015\n\n). If\n\nis\n\n.\n\nY\u0004\n\t$\u0014\u0016\u0007\n\u0001&\n\u000b\u0002\r\u0011\n\n\t$\u0014\u0016\u0007\n\u0001&\n\u000b\u0002\r\u0011\n\u0001\u0019\u0018\u001b\u001a\u001d\u001c\u001e\u0002\n\nIn a DBN, families of self-suf\ufb01cient variables naturally correspond to invariant subspaces.\nis a linear function that depends only on self-suf\ufb01cient family\nFor instance, suppose\n\r \r\nR!\u0015\u000b\n\nby the transition matrix\nues of \r\nof linear functions de\ufb01ned over the truth values of that family is invariant w.r.t.\n\n\u2014the resulting function will also be de\ufb01ned over the truth val-\n. Hence, when a family of variables is self-suf\ufb01cient, the subspace\n\nthrough the dynamics of the DBN\u2014i.e., if we multiply\n\n\u001b\"\u0015 \u0015\nand \n\n. If we regress\n\n698\n\n.\n\nSD\n\nR!\u0015\n\nSD\n\n\u001b\"\u0015\n\n\u00063698\n\n3 Lossless Compressions\n\nIf a compression of the state space of a POMDP allows us to accurately evaluate all policies,\nwe say the compression is lossless, since we have suf\ufb01cient information to select the opti-\nmal policy. We provide one characterization of lossless compressions. We then specialize\nthis to the linear case, and discuss the use of compact POMDP representations.\n\n(see Figure 1(a)). Here\n\nbe a compression function that maps each belief state\n\nLet\ncompressed belief state\nsense of the information bottleneck [17]) that \ufb01lters the information contained in\nit\u2019s used to estimate future rewards. We desire a compression\nsuch that\nthe smallest statistic suf\ufb01cient for accurately predicting the current reward\nnext belief state\ncompression\n\ninto some lower dimensional\ncan be viewed as a bottleneck (e.g., in the\nbefore\ncorresponds to\nas well as the\n). Such a\n\n(since we can accurately predict all following rewards from\n\nexists if we can also \ufb01nd mappings\n\nsuch that:\n\nand\n\nand\n\n\u001c&%'\u001f\n\n$4698\n%'\u001f\u0003\u00053\u0003\n\n698\n\n,&$\n\n698\n\n(2)\n\nSince we are only interested in predicting future rewards, we don\u2019t really need to accurately\nestimate the next belief state\nrelevant for estimating future rewards. Figure 1(b)\nsince it captures all information in\nillustrates the resulting functional \ufb02ow, where\nrepresents the transition function that\ndirectly maps one compressed belief state to the next compressed belief state. Eq. 2 can\n\n; we could just predict the next compressed belief state\n\n698\n\np\np\np\np\np\np\np\np\n\n\n\u0001\n\u0001\n\n\u0002\n\u0001\n\f\n\u0002\n\u0001\n\f\n,\n\n\u0017\n\f\n\u001f\n\n\u001f\n\u001f\n\u0006\n:\n:\n\u001f\n'\n#\n'\n#\n'\n'\n\u001f\n#\n'\n\u0014\n'\n\u000f\n'\n\u000f\n\u001f\n:\n#\n\u001c\n\u001c\n,\n#\n\u0006\n:\n:\nY\n\u0002\n\n\u0005\nY\n\u0004\n'\n\u000f\n#\n'\n\u000f\n'\n\u000f\n#\n\u0006\n:\n\fOnce\nand Eq. 4 are equivalent:\n\nG-H\n\nTheorem 1 Let\nEq. 4 does.\n\nProof\n\n,\n\nand\n\n\u0002\u0004\u0003\n\n\u0002\u0004\u0003\n\n\u0002\u0004\u0003\n\n\u0006h698\n\nG\bH\nG\bH\n\nsatisfy Eq. 3 and let\n\nG\bHV,\n\n\u0007\n\u001f\n\u0007\n\u001f\n\n\u0007\t'@\u0011A,2\u001c\nG-H\n\u0007\u0017'E\u0011e\u0011\n\u001cI\u0007\n\u001cI\u0007\n\u0007\u0017'E\u0011e\u0011\nG7H\n\u001c\u001d\u0007\n\n'@\u0011D,\n\n\u0007\t'@\u0011KJL!70\n\u0007\u0017'@\u0011\f\u0011\nJL!\u00150\nJL!\u00150\n\u0007\u0017'@\u0011\f\u0011\n'@\u0011cJ\n!\u00150\n\nG-H\nG-H\nG-H\nG\bH\n\n\u0007)\u0006gH+O\u0012P\tQ\u001a8\n\u0007N\u0006gH+O\n\u0006gH+O\u0012P\tQ\u001a8\n\u0006gH\u0016O\u0012P\tQ\u001a8\n:5\u0007\n\n\u0007\u0017'@\u0011\f\u0011\nP\u0017Q48\n:5\u0007\u0017'E\u0011e\u0011e\u0011\n:\u000b\u0007\n\u001f\n\u0007\u0017'E\u0011e\u0011e\u0011\n'@\u0011e\u0011\n\nthen be replaced by the following weaker but still suf\ufb01cient conditions:\n\nand\n\nGiven an\nPOMDP dynamics as follows:\n\nand\n\n,\n\n%'\u001f\nsatisfying Eq. 3, we can evaluate a policy\n\n%'\u001f\n\n%D\u0006\n\n698\n\n698\n\n\u001c&,\n\u0006h698\n\nis found, we can recover the original value function\n\n'@\u0011A,\n\n\u001c \u0007\n\n'@\u0011\n\nJL!7M\n\n(3)\nusing the compressed\n\n(4)\n\n. Indeed, Eq. 1\n\nG-H\n\n. Then Eq. 1 holds iff\n\n\u00053\u0003\n\nH+O\n\nP\tQ\u001a8\n\n\u0011e\u0011\nGIH\nG-H\n\n3.1 Linear compressions\n\nWe say\n\nis a linear compression when\n\n. In this case, the approximate transition and reward functions\n\nlinear (assuming Eq. 3 is satis\ufb01ed). Eq. 3 can be rewritten in matrix notation:\n\nis a linear function, representable by some matrix\nmust also be\n\nand\n\nand\n\n(5)\nIn a linear compression, \u0005\ncan be viewed as effecting a change of basis for the value func-\ntion, with the columns of \u0005\nde\ufb01ning a subspace in which the compressed value function\nlies. Furthermore, the rank of \u0005\nindicates the dimensionality of the compressed state space.\nand that is in-\nWhen Eq. 5 is satis\ufb01ed, the columns of \u0005\nvariant with respect to each\n. Intuitively, Eq. 5 says that a suf\ufb01cient statistic must be\nable to \u201cpredict itself\u201d at the next time step (hence the subspace is invariant), and that it\nmust predict the current reward (hence the subspace contains\n\nspan a subspace that contains\n\n). Formally:\n\n\u00053\u0003\n\n698\n\n698\n\n698\n\n6\u00108\n\nTheorem 2 Let\nand \u0005\ninvariant with respect to each\n\n,\n\nsatisfy Eq. 5. Then the range of \u0005\n.\n\ncontains\n\nand is\n\nProof Eq. 5 ensures\n\nis a linear combination of the columns of \u0005\n\n. It also requires that the columns of each\n\nof \u0005\nthe columns of \u0005\n\n, so \u0005\n\nis invariant with respect to each\n\n, so it lies in the range\nare linear combinations of\n\n698\n\n698\n\n. This is by de\ufb01nition the Krylov subspace\n\nThus, the best linear lossless compression corresponds to the smallest invariant subspace\nthat contains\n.\nUsing this fact we can easily compute the best lossless linear compression by iteratively\nmultiplying\nuntil the Krylov basis is obtained. We then let the Krylov\nbasis form the columns of \u0005\nby solving each part of Eq. 5.\n, and compute\nFinally, we can solve the POMDP in the compressed state space by using\n\n\u0015\u000b\n\u000e\u001c7\u0011\n\nand each\n\nby each\n\n\t$\u0014\u0016\u0007\n\nand\n\n6\u00108\n\n.\n\nNote that this technique can be viewed as a generalization of Givan et al\u2019s MDP model\nminimization technique [3].\nIt is interesting to note that Littman et al. [9] proposed a\nsimilar iterative algorithm to compress POMDPs based on predicting future observations. 2\n\n698\n\n.\n\n\u00067698\n698\n:\u0007\u00065\u0003\n\n\u0006\u0015698\n\n\u0006h698\n\n2Assuming that rewards are functions of the observations.\n\n#\n\u001c\n\u001f\n:\n,\n#\n\u0006\n:\nY\n\u0002\n\n\u0005\nY\n\u0004\n\u001f\n#\n\u001c\n#\n:\nF\n#\nG\nH\n\u0007\n#\n#\n#\n:\n#\nG\nH\n\u0007\n#\n\u0006\n\u0001\n:\n\u0007\n#\n'\n#\n,\n#\n%\n\u001f\n\u001f\n#\n\u001c\n#\n:\n#\n%\n\u001f\nH\n:\n:\n#\n,\n#\n\u001f\n:\n#\n\u0007\n\u001f\n#\n,\n#\n\u001f\n:\n#\n\u0007\n#\n#\n\u0007\n#\n#\n#\n:\n#\n\u0007\n#\n#\n\u001f\n\u001f\n\u0005\n#\n\u0006\n:\n#\n\u001c\n\u001c\n,\n\u0005\n#\n\u001c\n\u0006\n:\n\u0005\n,\n\u0005\n#\n\u0006\n:\n\u0005\n\u001c\n\u0006\n:\n\u001c\n#\n\u0006\n:\n#\n\u001c\n\u001c\n\u0006\n:\n\u001c\n:\n\u0005\n:\n\u001c\n\n\u0006\nY\n\u0002\n\n\u0005\nY\n\u0004\n\u001c\n\u0006\n:\n#\n\u001c\n#\n:\n#\n\u001c\n#\n\u0006\n:\n\f3.2 Structured Linear Compressions\n\nWhen a POMDP is speci\ufb01ed in compactly, say, using a DBN, the size of the state space may\nbe exponentially larger than the speci\ufb01cation. The practical need to avoid state enumeration\nis a key motivation for POMDP compression. However, the complexity of the search for\na good compression must also be independent of the state space size. Unfortunately, the\niterative Krylov algorithm involves repeatedly multiplying explicit transition matrices and\nbasis vectors. We consider several ways in which a compact POMDP speci\ufb01cation can be\nexploited to construct a linear compression without state enumeration.\n\nOne solution lies in exploiting DBN structure and context-speci\ufb01c independence. If tran-\nsition, observation and reward functions are represented using DBNs and structured CPTs\n(e.g., decision trees or algebraic decision diagrams), then the matrix operations required by\nthe Krylov algorithm can be implemented effectively [1, 7]. Although this approach can\noffer substantial savings, the DTs or ADDs that represent the basis vectors of the Krylov\nsubspace may still be much larger than the dimensionality of the compressed state space\nand the original DBN speci\ufb01cations.\n\n698\n\n. The corresponding subspace is invariant and necessarily contains\n\nAlternatively, families of self-suf\ufb01cient variables corresponding to invariant subspaces can\nbe identi\ufb01ed by exploiting additive separability. Starting with the variables upon which\ndepends, we can recursively grow a family of variables until it is self-suf\ufb01cient with respect\n. Assum-\nto each\ning a tractable self-suf\ufb01cient family is found, a compact basis can then be constructed by\nusing all indicator functions for each subset of variables in this family (e.g., if \r\nis one such subset of binary variables, then eight basis vectors will correspond to this set).\nThis approach allows us to quickly identify a good compression by a simple inspection of\nthe additive separability structure of the DBN. The resulting compression is not necessar-\nily optimal; however, it is the best among those corresponding to some such family. It is\nof the compressed POMDP can\nimportant to note that the dynamics\nbe constructed easily (i.e., without state enumeration) from this \u0005\nand the original DBN\nmodel. Pfeffer [13] notes that observations tend to reduce the amount of additive separabil-\nity present in a DBN, thereby increasing the size of self-suf\ufb01cient families. Therefore, we\nshould point out that lossless compressions of POMDPs that exploit self-suf\ufb01ciency and\noffer an acceptable degree of compression may not exist. Hence lossy compressions are\nlikely to be required in many cases.\n\nand reward\n\nR$\n\u000eS\n\n698\n\n\u0006h698\n\nand a reward vector\n\nare chosen uniformly at random. The odds that\n\nFinally, we ask whether the existence of lossless compressions requires some form of struc-\nture in the POMDP. We argue that this is almost always the case. Suppose a transition\nfalls\nmatrix\ninto a proper invariant subspace of\nare essentially zero since there are in\ufb01nitely more\nvectors in the full space than in all the proper invariant subspaces put together. This means\nthat if a POMDP can be compressed, it must almost certainly be because its dynamics ex-\nhibit some structure. We have described how context-speci\ufb01c independence and additive\nseparability can be exploited to identify some linear lossless compressions. However they\ndo not guarantee that the optimal compression will be found, so it remains an open question\nwhether other types of structure could be used in similar ways.\n\n\u0006\u00156\u00108\n\n4 Lossy compressions\n\nSince we cannot generally \ufb01nd effective lossless compressions, we also consider lossy\ncompressions. We propose a simple approach to \ufb01nd linear lossy compressions that \u201calmost\nsatisfy\u201d Eq. 5. Table 1 outlines a simple optimization program to \ufb01nd lossy compressions\nthat minimize a weighted sum of the max-norm residual errors,\n, in Eq. 5. Here\n\u0005 and \u0006 are weights that allow us to vary the degree to which the two components of Eq. 5\n\nand\n\n\u0002\u0001\n\n\u0004\u0003\n\n\u001c\n\u0006\n:\n\u001c\n\n\u001b\n\u0015\n#\n\u0006\n:\n#\n\u001c\n:\n\u001c\n\u001c\n:\n\f\u0002\u0001\u0004\u0003\ns.t.\n\n\u0004\u0003\n\n\u001f\u0006\u0005@\u001c2f\n\u0006h698\n\u0005\f\b\n\n\u001f\n\u0005\n\n\u001c\u0007\u0005\t\b\n\u0006h698\n\n:\u000b\u0005\f\b\n\n\u00053\u0003\n\nTable 1: Optimization program for linear lossy compressions\n\n(6)\n(7)\n\n\u0004\u0001\n\n\u0004\u0003\n\nshould be satis\ufb01ed. The unknowns of the program are all the entries of\nwell as\n\nas\nis necessary to preserve scale, otherwise\n\n. The constraint\n\nand \u0005\n\nand\n\n,\n\ncould be driven down to 0 simply by setting all the entries of \u0005\n\nto 0. Since\n\nmultiply \u0005\n\n\u0004\u0001\n, some constraints are nonlinear. However, it is possible to solve this\nand\noptimization program by solving a series of LPs (linear programs). We alternate solving\nthe LP that adjusts\nwhile keeping \u0005 \ufb01xed, and solving the LP that adjusts \u0005\nwhile keeping\n\ufb01xed. This guarantees that the objective function decreases at\neach iteration and will converge, but not necessarily to a local optimum.\n\n\u0006h6\u00108\n\n\u0006h698\n\nand\n\nand\n\n\u0006h698\n\n698\n\n4.1 Max-norm Error Bound\n\nThe quality of the compression resulting from this program depends on the weights \u0005 and\n\u0006 . Ideally, we would like to set \u0005 and \u0006\nrepresents the loss in\ndecision quality associated with compressing the state space. If we can bound the error\n\u000e\r\nof evaluating any policy using the compressed POMDP, then the difference in expected total\nreturn between the policy that is best w.r.t. the compressed POMDP and the true optimal\n. Theorem 3 gives an upper bound\npolicy is at most\non\n\nas a linear combination of the max-norm residual errors in Eq. 5.\n\nin a way that \u0005\n\nbe \u0007\u0011\u0013\u0012\n\n. Let\n\n\u0004\u0001\n\n\u0010\n\n\u0010\n\n,\n\n.\n\nG7H\n\n\u0007\u0011\u0013\u0012\n\n. Then\n\n\u0006h698\n\n,\u0015\u0005E\u001c\n\n\u0005EG\bH\nG-H\n\nTheorem 3 Let\nand\n\n\u0007\u0011\u0013\u0012\n\f\rL,\nG\u0017\u0016g,\u0019\u0018\n\f&\u0005\n\n%\u0007\u001f\u0014\u0005\f\b\n\r\"!\n $#\n%'\u001f\u0014\u0005\t\b\n\u001a\u001b\u001a\nWe omit the proof due to lack of space. It essentially consists of a sequence of substitutions\nof the type\n. We suspect\n\u001f)\u0005\u0010%\u0017\u0005\nthat the above error bound will grossly overestimate the loss in decision quality, however\nwe intend to use it mostly as a guide for setting \u0005 and \u0006 . Here\nis\ntypically much greater than\nG,\u0016/\u0005\nhas a much higher impact on the loss in decision quality than\nsense because the error\nover time, so we should set \u0006 signi\ufb01cantly higher than \u0005 .\n\n. Intuitively, this makes\nin predicting the next compressed belief state may compound\n\nJ*\u0005\u0010\f&\u0005\nG,\u0016\u000b\u0005\n\u0019+\u0005\n\nbecause of the factor\n\n, which means that\n\n\u001f'\u0005\u0010%\u0017\u0005\n\n\u0007?%\u0015f\"!\n\n\u0005@\u0006h698\n\n\u0005\f%2J\n\n\u0005\u0010\f(\u0005\n\nfV!\n\n\u001e\u001f\u001c\u000e \n\n\f(\u0005\n\n\b.-\n\nand\n\n:\rf\n\n\u001a\u001b\u001a\n\n698\n\n\u001a\u001d\u001c\n\n\u0007e%\n\n\u0005\u0010%\n\n\u001f\u0014\u0005\f\b\n\nG-H'%\n\n,\n\n\u0005EG7Hhf\n%\u0007\u001f\u0014\u0005\f\b\n\u0010\rZ\u001f\n\n4.2 Structured Compressions\n\nAs with lossless compressions, solving the program in Table 1 may be intractable due to\n.3 We\ndescribe several techniques that allow one to exploit problem structure to \ufb01nd an acceptable\nlossy compression without state space enumeration.\n\nunknown entries in matrix \u0005\n\nthe size of \n\nconstraints and\n\n. There are\n\n0\u001d\u0007\f\u0019\n\n\u0019\u0012\u0019\n\nOne approach is related to the basis function model proposed in [4], in which we restrict \u0005\nto be functions over some small set of factors (subsets of state variables.) This ensures that\nthe number of unknown parameters in any column of \u0005\n(which we optimize in Table 1) is\n9 and\n\nare unproblematic.\n\nvariables in each\n\n3Assuming\n\nis small, the\n\nvariables in\n\n576\t8\n\n3\t1\n\n3\t1\n\n\u0005\nJ\n\u0006\n\n\u0001\nf\n\n\u0003\n\u0005\n#\n\u001f\n\n\u0003\nf\n\n\u0001\n:\n\u0005\nf\n\u0005\n#\n\u001f\n\n\u0001\nY\n\u0002\n\n\u0005\nY\n\u0004\n\u0005\n\u0005\n,\n%\n#\n\u001c\n#\n:\n\u0005\n\u0005\n\u0005\n\b\n,\n%\n#\n:\n#\n\u001c\n#\n\u001c\n#\n:\n#\n\u001c\n#\n\u0006\n:\n\n\u0003\nJ\n\u0006\n\u000f\nH\n#\n\n\nH\nf\n#\n\n\u0003\nf\n#\n\u001c\n\n\u0001\n,\n:\n#\n:\n#\n\u0003\n\u0002\nH\n#\n\u001c\n\u001c\n\n\u0003\nJ\n\u0001\n\u001c\n\n\u0001\n\b\n\b\n\b\n\b\n\b\n\b\n!\n\u0019\n\u0004\n#\n\u0011\n%\n-\n\u0011\n\u0005\n#\n\b\n\n\u0001\n\n\u0003\n\n\u0001\n\n\u0019\n\u0011\n\u0019\n\n#\n\n\u0019\n1\n2\n2\n3\n4\n1\n2\n3\n1\nj\n\flinear in the number of instantiations of each factor. By keeping factors small, we main-\ntain a manageable set of unknowns. To deal with the\nconstraints, we can exploit\nthe structure imposed on \u0005\nand the DBN structure to reduce the number of constraints\nto something (in the many cases) polynomial in the number of state variables. This can\nbe achieved using the techniques described in [4, 16] to rewrite an LP with many fewer\nconstraints or to generate small subsets of constraints incrementally. These techniques are\nrather involved, so we refer to the cited papers for details.\n\n0\u001d\u0007\f\u0019\n\nBy searching within a restricted set of structured compressions and by exploiting DBN\nstructure it is possible to ef\ufb01ciently solve the optimization program in Table 1. The question\nof factor selection remains: on what factors should \u0005 be de\ufb01ned? A version of this question\nhas been tackled in [12, 14] in the context of selecting a basis to approximately solve MDPs.\nThe techniques proposed in those papers could be adapted to our optimization program.\n\n(\n\n\u0007\u0003\n\n(corresponding to state\n\n. We restrict each column of \u0005\n\n) be subsets of variables, and\n\n; that is, column\n. Here the\n\nAn alternative method for structuring the computation of \u0005\nLet\ncompressed state space\n\ninvolves additive separability.\nbe a function over\nand the\nto be a separable function of the\nfor some parameters\ncan be viewed as weights indicating the importance of the contribution of\nin the separable function. Given a family of subsets, the parameters over which\neach\nwe optimize to determine \u0005\n.\nWhile nonlinear, the same alternating minimization scheme described earlier can be used\nin turn. Note that the number of vari-\nto optimize these two classes of parameters of \u0005\nables is dependent only on the size of the subsets\n.\nFurthermore, this form of additive separability lends itself to the same compact constraint\ngeneration techniques mentioned above. Finally, the (discrete) search for decent subsets\n\nand the entries of each function\n\nand the compressed state space\n\nare now the\n\nA\u0011\n\n) is\n\n<\u0006\u0005\n\n\u0007\u0003\n\n\u0007\u0003\n\ncan be interleaved with optimization of the compression mapping for \ufb01xed sets\n\n.\n\n5 Preliminary Experiments\n\nWe report on preliminary experiments with the coffee problem described in [2]. Given its\nrelatively small size (32 states, 3 observations and 2 actions), these results should be viewed\nas simply illustrating the feasibility and potential of the algorithms proposed in Secs. 3.1\nand 4.1. Further experiments for the structured versions (Secs. 3.2 and 4.2) are necessary\nto assess the degree of compression achievable with large, realistic problems.\n\nto\n\nand\n\nrespectively. The alternating variable technique was iterated\n\nThe 32-dimensional belief space can be compressed without any loss to a 7-dimensional\nsubspace using the Krylov subspace algorithm described in Section 3.1. For further com-\npression, we applied the optimization program described in Table 1 by setting the weights \u0005\nand \u0006\ntimes,\nwith the best solution chosen from\nrandom restarts (to mitigate the effects of local op-\ntima). Figure 2 shows the loss in expected return (w.r.t. the optimal policy) when policy\nstages. The loss is\ncomputed using varying degrees of compression is executing for\nsampled from 100,000 random initial belief states, averaged over 10 runs. These policies\nmanage to achieve expected returns with less than\nloss. In contrast, the average loss of\na random policy is about\n\n\u000f5\u001e5\u001e\n\n%9\u001e5\u001e\n\n(or\n\n%\b\u0007\n\n%\b\u0007\n\n).\n\n\t\u000b\n\n\u000f\u001b\u0013\f\u0007\n\n\u000f\u000e\r\u000f\n\n6 Concluding Remarks\n\nWe have presented an in-depth theoretical analysis of the impact of static compressions on\ndecision quality. We derived a set of conditions that guarantee compression does not impair\ndecision quality, leading to interesting mathematical properties for linear compressions\nthat allow us to exploit structure in the POMDP dynamics. We also proposed a simple\n\n\n\u0019\n\u0011\n\n<\n\u0001\n\u001f\n\u0018\n\u0002\n<\n<\n\n#\n\n\u0011\n\n<\n#\n\n\u0002\n<\n\u0004\n#\n\u0001\n;\n0\n<\n\u0002\n<\n<\n\n#\n\u0001\n;\n\u0011\n\u0005\n<\n\u0005\n<\n\u0002\n<\n\u0005\n<\n\u0002\n<\n<\n\n#\n\n<\n#\n\n\n<\n\n<\n%\n\u001e\n\fl\n\n)\ne\nt\nu\no\ns\nb\nA\n\n(\n \ns\ns\no\nL\n \ne\ng\na\nr\ne\nv\nA\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n)\ne\nv\ni\nt\na\ne\nR\n\nl\n\n(\n \ns\ns\no\nL\n \ne\ng\na\nr\ne\nv\nA\n\n3%\n\n2%\n\n1%\n\n0%\n\n3\n\n4\n\n5\n\n6\n\nDimensionality of Compressed Space\n\n7\n\nFigure 2: Average loss for various lossy compressions\n\noptimization program to search for good lossy compressions. Preliminary results suggest\nthat signi\ufb01cant compression can be achieved with little impact on decision quality.\n\nThis research can be extended in various directions. It would be interesting to carry out a\nsimilar analysis in terms of information theory (instead of linear algebra) since the problem\nof identifying information in a belief state relevant to predicting future rewards can be mod-\neled naturally using information theoretic concepts [6]. Dynamic compressions could also\nbe analyzed since, as we solve a POMDP, the set of reasonable policies shrinks, allowing\ngreater compression.\n\nReferences\n[1] C. Boutilier, R. Dearden, and M. Goldszmidt. Stochastic dynamic programming with factored\n\nrepresentations. Arti\ufb01cial Intelligence, 121:49\u2013107, 2000.\n\n[2] C. Boutilier and D. Poole. Computing optimal policies for partially observable decision pro-\n\ncesses using compact representations. Proc. AAAI-96, pp.1168\u20131175, Portland, OR, 1996.\n\n[3] R. Givan, T. Dean, and M. Greig. Equivalence notions and model minimization in Markov\n\ndecision processes. Arti\ufb01cial Intelligence, to appear, 2002.\n\n[4] C. Guestrin, D. Koller, and R. Parr. Max-norm projections for factored MDPs. Proc. IJCAI-01,\n\npp.673\u2013680, Seattle, WA, 2001.\n\n[5] C. Guestrin, D. Koller, and R. Parr. Solving factored POMDPs with linear value functions.\n\nIJCAI-01 Worksh. on Planning under Uncertainty and Inc. Info., Seattle, WA, 2001.\n\n[6] C. Guestrin and D. Ormoneit. Information-theoretic features for reinforcement learning. Un-\n\npublished manuscript.\n\n[7] J. Hoey, R. St-Aubin, A. Hu, and C. Boutilier. SPUDD: Stochastic planning using decision\n\ndiagrams. Proc. UAI-99, pp.279\u2013288, Stockholm, 1999.\n\n[8] M. L. Littman. Memoryless policies: theoretical limitations and practical results. In D. Cliff,\nP. Husbands, J. Meyer, S. W. Wilson, eds., Proc. 3rd Intl. Conf. Sim. of Adaptive Behavior,\nCambridge, 1994. MIT Press.\n\n[9] M. L. Littman, R. S. Sutton, and S. Singh. Predictive representations of state. Proc.NIPS-02,\n\nVancouver, 2001.\n\n[10] R. A. McCallum. Hidden state and reinforcement learning with instance-based state identi\ufb01ca-\n\ntion. IEEE Transations on Systems, Man, and Cybernetics, 26(3):464\u2013473, 1996.\n\n[11] K. Murphy. A survey of POMDP solution techniques. Technical Report, U.C. Berkeley, 2000.\n[12] R. Patrascu, P. Poupart, D. Schuurmans, C. Boutilier, C. Guestrin. Greedy linear value-\napproximation for factored Markov decision processes. AAAI-02, pp.285\u2013291, Edmonton,\n2002.\n\n[13] A. Pfeffer. Suf\ufb01ciency, separability and temporal probabilistic models. Proc. UAI-01, pp.421\u2013\n\n428, Seattle, WA, 2001.\n\n[14] P. Poupart, C. Boutilier, R. Patrascu, and D. Schuurmans. Piecewise linear value function\n\napproximation for factored MDPs. AAAI-02, pp.292\u2013299, Edmonton, 2002.\n[15] Y. Saad. Iterative Methods for Sparse Linear Systems. PWS, Boston, 1996.\n[16] D. Schuurmans and R. Patrascu. Direct value-approximation for factored MDPs. Proc. NIPS-\n\n01, Vancouver, 2001.\n\n[17] N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck method. 37th Annual\n\nAllerton Conf. on Comm., Contr. and Computing, pp.368\u2013377, 1999.\n\n\f", "award": [], "sourceid": 2192, "authors": [{"given_name": "Pascal", "family_name": "Poupart", "institution": null}, {"given_name": "Craig", "family_name": "Boutilier", "institution": null}]}