{"title": "A Note on the Representational Incompatibility of Function Approximation and Factored Dynamics", "book": "Advances in Neural Information Processing Systems", "page_first": 447, "page_last": 454, "abstract": "", "full_text": "A Note on the Representational Incompatibility\n\nof Function Approximation and Factored\n\nDynamics\n\nEric Allender\nRutgers University\n\nComputer Science Department\nallender@cs.rutgers.edu\n\nSanjeev Arora\n\nComputer Science Department\narora@cs.princeton.edu\n\nPrinceton University\n\nMichael Kearns\n\nDepartment of Computer and Information Science\n\nUniversity of Pennsylvania\n\nmkearns@cis.upenn.edu\n\nCristopher Moore\n\nDepartment of Computer Science\n\nUniversity of New Mexico\nmoore@santafe.edu\n\nAlexander Russell\n\nDepartment of Computer Science and Engineering\n\nUniversity of Connecticut\nacr@cse.uconn.edu\n\nAbstract\n\nWe establish a new hardness result that shows that the dif\ufb01culty of plan-\nning in factored Markov decision processes is representational rather\nthan just computational. More precisely, we give a \ufb01xed family of fac-\ntored MDPs with linear rewards whose optimal policies and value func-\ntions simply cannot be represented succinctly in any standard parametric\nform. Previous hardness results indicated that computing good policies\nfrom the MDP parameters was dif\ufb01cult, but left open the possibility of\nsuccinct function approximation for any \ufb01xed factored MDP. Our result\napplies even to policies which yield a polynomially poor approximation\nto the optimal value, and highlights interesting connectionswith the com-\nplexity class of Arthur-Merlin games.\n\n1 Introduction\nWhile a number of different representational approaches to large Markov decision pro-\ncesses (MDPs) have been proposed and studied over recent years, relatively little is known\nabout the relationships between them. For example, in function approximation, a para-\nmetric form is proposed for the value functions of policies. Presumably, for any assumed\nparametric form (for instance, linear value functions), rather strong constraints on the un-\nderlying stochastic dynamics and rewards may be required to meet the assumption. How-\never, a precise characterization of such constraints seems elusive.\n\n\fSimilarly, there has been recent interest in making parametric assumptions on the dynam-\nics and rewards directly, as in the recent work on factored MDPs. Here it is known that\nthe problem of computing an optimal policy from the MDP parameters is intractable (see\n[7] and the references therein), but exactly what the representational constraints on such\npolicies are has remained largely unexplored.\nIn this note, we give a new intractability result for planning in factored MDPs that exposes\na noteworthy conceptual point missing from previous hardness results. Prior intractability\nresults for planning in factored MDPs established that the problem of computing optimal\npolicies from MDP parameters is hard, but left open the possibility that for any \ufb01xed fac-\ntored MDP, there might exist a compact, parametric representation of its optimal policy.\nThis would be roughly analogous to standard NP-complete problems such as graph color-\ning \u2014 any 3-colorable graph has a \u201ccompact\u201d description of its 3-coloring, but it is hard to\ncompute it from the graph.\nHere we dismiss even this possibility. Under a standard and widely believed complexity-\ntheoretic assumption (that is even weaker than the assumption that NP does not have poly-\nnomial size Boolean circuits), we prove that a speci\ufb01c family of factored MDPs does not\neven possess \u201csuccinct\u201d policies. By this we mean something extremely general \u2014 namely,\nthat for each MDP in the family, it cannot have an optimal policy representedby an arbitrary\nBoolean circuit whose size is bounded by a polynomial in the size of the MDP description.\nSince such circuits can represent essentially any standard parametric functional form, we\nare showing that there exists no \u201creasonable\u201d representation of good policies in factored\nMDPs, even if we ignore the problem of how to compute them from the MDP descrip-\ntion. This result holds even if we ask only for policies whose expected return approximates\nthe optimal within a polynomial factor. (With a slightly stronger complexity-theoretic as-\nsumption, it follows that obtaining an approximation even within an exponential factor is\nimpossible.)\nThus, while previous results established that there was at least a computational barrier to\ngoing from factored MDP parameters to good policies, here we show that the barrier is\nactually representational, a considerably worse situation. The result highlights the fact that\neven when making strong and reasonable assumptions about one representational aspect of\nMDPs (such as value functions or dynamics), there is no reason in general for this to lead\nto any nontrivial restrictions on the others.\nThe construction in our result is ultimately rather simple, and relies on powerful results\ndeveloped in complexity theory over the last decade.\nIn particular, we exploit striking\nresults on the complexity class associated with computational protocols known as Arthur-\nMerlin games.\nWe note that recent and independent work by Liberatore [5] establishes results similar to\nours. The primary differences between our work and Liberatore\u2019s is that our results prove\nintractability of approximation and rely on different proof techniques.\n\n2 DBN-Markov Decision Processes\nA Markov decision process is a tuple\nactions,\n, and\nthat action\nin state\na sequence of actions\nwhere each\nis called a path. The -discounted return associated with such a path is\n\nis a reward function. We will denote by\n. When started in a state\n\nis a family of probability distributions on\n\nis a random sample from the distribution\n\nwhere\n\nis a set of states,\n\n.\n\nresults in state\n\nthe MDP traverses a sequence of states\n\n, one for each\n\nis a set of\nthe probability\n, and provided with\n,\n. Such a state sequence\n\n\fA policy\ngenerated according to this policy, we denote by\nproduced as above. A policy\n\nis a mapping from states to actions. When the action sequence is\nthe state sequence\n, we have\n\nis optimal if for all policies\n\nand all\n\nhas size\n\n, then\n\nstate variables at time\n\nvariables in the \ufb01rst layer, representing the\n\nWe consider MDPs where the transition law is represented as a dynamic Bayes net, or\nDBN-MDPs. Namely, if the state space\nis represented by a -layer\nBayes net. There are\nstate variables at\nany given time , along with the action chosen at time . There are\nvariables in the second\nlayer, representing the\n. All directed edges in the Bayes net go\nfrom variables in the \ufb01rst layer to variables in the second layer; for our result, it suf\ufb01ces to\nconsider Bayes nets in which the indegree of every second-layer node is bounded by some\nconstant. Each second layer node has a conditional probability table (CPT) describing its\nconditional distribution for every possible setting of its parents in the Bayes net. Thus\nthe stochastic dynamics of the DBN-MDP are entirely described by the Bayes net in the\nstandard way; the next-state distribution for any state is given by simply \ufb01xing the \ufb01rst\nlayer nodes to the settings given by the state. Any given action choice then yields the next-\nstate distribution according to standard Bayes net semantics. We shall assume throughout\nthat the rewards are a linear function of state.\n3 Arthur-Merlin Games\nThe complexity class AM is a probabilistic extension of the familiar class NP, and is typ-\nically described in terms of Arthur\u2013Merlin games (see [2]). An Arthur\u2013Merlin game for a\nlanguage\n(the Veri\ufb01er, often referred to as\nArthur in the literature), who is equippedwith a randomcoin and only modest (polynomial-\ntime bounded) computing power; and\n(the Prover, often referred to as Merlin), who is\ncomputationally unbounded. Both are supplied with the same input\nbits. For\ninstance, might be some standard encoding of an undirected graph\n, and might be\ninterested in proving to\nis\nskeptical but willing to listen. At each step of the conversation,\n\ufb02ips a fair coin, perhaps\nseveral times, and reports the resulting bits to\n; this is interpreted as a \u201cquestion\u201d or \u201cchal-\nlenge\u201d to\n. In the graph coloring example, it might be reasonable to interpret the random\nbits generated by\nbeing to\n, with the challenge to\nidentify the colors of the nodes on each end of this edge (which had better be different, and\nconsistent with any previous responses of\nresponds\nis to be convinced). Thus\nwith some number of bits, and the protocol proceeds to the next round. After poly\nsteps,\nor reject.\n\ndecides, based upon the conversation, whether to accept that\n\nof length\nseeks to prove that\n\nis played by two players (Turing machines),\n\nas identifying a random edge in\n\nthat\n\nis 3-colorable. Thus,\n\nWe say that the language\nrithm such that:\n\nis in the class AM poly if there is a (polynomial-time) algo-\n\nWhen\nrandom challenges that causes\nWhen\nability at least\nthe random challenges.\n\n, there is always a strategy for\nto generate the responses to the\nto accept.\n, regardless of how responds to the random challenges, with prob-\nrejects. Here the probability is taken over\n\n,\n\n,\nof\n, then every way of responding to the random challenge sequence has\n\nIn other words, we ask that there be a polynomial time algorithm such that if\nthere is always some response to the random challenge sequence that will convince\nthis fact; but if\nan overwhelming probability of being \u201ccaught\u201d by\nWhat is the power of the class AM poly ? From the de\ufb01nition, it should be clear that\nevery language in NP has an (easy) AM poly protocol in which\n, the prover, ignores\n\n.\n\n;\n\n, if\n\n\fwith the standard NP witness to\n\nthe random challenges, and simply presents\n(e.g., a speci\ufb01c 3-coloring of the graph\n). More surprisingly, every language in the class\nPSPACE (the class of all languages that can be recognized in deterministic polynomial\nspace, conjectured to be much larger than NP) also has an AM poly protocol, a beautiful\nand important result due to [6, 9]. (For de\ufb01nitions of classes such as P, NP, and PSPACE,\nsee [8, 4].)\nIf a language\nof questions, we say that\nArthur says nothing, and thus clearly NP\nseems to put severe limitations on the power of Arthur-Merlin games. Though AM poly\nPSPACE, it is generally believed that\n\nhas an Arthur-Merlin game where Arthur asks only a constant number\nAM . NP corresponds to Arthur-Merlin games where\nAM . Restricting the number of questions\n\nNP AM\n\nPSPACE\n\n4 DBN-MDPs Requiring Large Policies\nIn this section, we outline our construction proving that factored MDPs may not have any\nsuccinct representation for (even approximately) optimal policies, and conclude this note\nwith a formal statement of the result.\nLet us begin by drawing a high-level analogy with the MDP setting. Let\nbe a language\nin PSPACE, and let\nprotocol for\n.\nSince\n(suf\ufb01cient to\nis simply a Turing machine, it has some internal con\ufb01guration\ncompletely describe the tape contents, read/write head position, abstract computational\nstate, and so on) at any given moment in the protocol with\nis all-\n. Since we assume\npowerful (computationally unbounded), we can assume that\nhas complete knowledge of\nthis internal state\nat all times. The protocol at round can thus be viewed:\nis in\nsome state/con\ufb01guration\nis generated; based\non\nenters its\nand\nnext con\ufb01guration\n\n; a random bit sequence (the challenge)\n; and based on\n\n. From this description, several observations can be made:\n\nbe the Turing machines for the AM\n\ncomputes some response or action\n\nand\n\nand\n\nof\n\n,\n\n,\n\nconstitutes state in the Markoviansense \u2014 combined\n, it entirely determines the next-state distribution. The dynamics\n\n\u2019s internal con\ufb01guration\nwith the action\nare probabilistic due to the in\ufb02uence of the random bit sequence\nWe can thus view\ninternal con\ufb01guration of) \u2014 \u2019s actions, together with the stochastic\ntermine the evolution of the\nreturn to\nThe MDP so de\ufb01ned in this manner is not arbitrarily complex \u2014 in particular, the\ntransition dynamics are de\ufb01ned by the polynomial-time Turing machine\n\nas implementing a policy in the MDP determined by (the\n, de-\n. Informally, we might imagine de\ufb01ning the total\n\nto accept, and 0 if\n\nto be 1 if\n\nrejects.\n\ncauses\n\n.\n\n.\n\nAt a high level, then, if every MDP so de\ufb01ned by a language in AM poly had an \u201cef\ufb01cient\u201d\npolicy , then something remarkable would occur: the arbitrary power allowed to\nin the\nde\ufb01nition of the class would have been unnecessary. We shall see that this would have\nextraordinary and rather implausible complexity-theoretic implications. For the moment,\nlet us simply sketch the re\ufb01nements to this line of thought that will allow us to make the\nconnection to factored MDPs: we will show that the MDPs de\ufb01ned above can actually be\nrepresented by DBN-MDPs with only constant indegree and a linear reward function. As\nsuggested, this will allow us to assert rather strong negative results about even the existence\nof ef\ufb01cient policies, even when we ask for rather weak approximationto the optimal return.\nWe now turn to the problem of planning in a DBN-MDP. Typically, one might like to have\na \u201cgeneral-purpose\u201d planning procedure \u2014 a procedure that takes as input a description\nof a DBN-MDP\n.\n\n, and returns a description of the optimal policy\n\n\f(where\n\nhas states with\n\nThis is what is typically meant by the term planning, and we note that it demands a certain\nkind of uniformity \u2014 asingle planning algorithm that can ef\ufb01ciently compute a succinct\nrepresentation of the optimal policy for any DBN-MDP. Note that the existence of such a\nplanning algorithm would certainly imply that every DBN-MDP has a succinct representa-\ntion of its optimal policy \u2014 but the converse does not hold. It could be that the dif\ufb01culty of\nplanning in DBN-MDPs arises from the demand of uniformity \u2014 that is, that every DBN-\nMDP possesses a succinct optimal policy, but the problem of computing it from the MDP\nparameters is intractable. This would be analogous to problems in NP \u2014 for example, ev-\nery 3-colorable graph obviously has a succinct description of a 3-coloring, but it is dif\ufb01cult\nto compute it from the graph.\nAs mentioned in the introduction, it has been known for some time that planning in this\nuniform sense is computationally intractable. Here we establish the stronger and concep-\ntually important result that it is not the uniformity giving rise to the dif\ufb01culty, but rather\nthat there simply exist DBN-MDPs in which the optimal policy does not possess a succinct\nrepresentation in any natural parameterization. We will present a speci\ufb01c family of DBN-\nMDPs\ncomponents), and show that, under a standard\ncomplexity-theoretic assumption, the corresponding family of optimal policies\ncan-\nnot be represented by arbitrary Boolean circuits of size polynomial in . We note that such\ncircuits constitute a universal representation of ef\ufb01ciently computable functions, and all of\nthe standard parametric forms in wide use in AI and statistics can be computed by such\ncircuits.\nWe now provide the details of the construction. Let\nbe any language in PSPACE, and\nlet\n,\non inputs of length\nbe a polynomial-time Turing machine running in time\nimplementing the algorithm of \u201cArthur\u201d in the AM protocol for\nbe the maximum\n. Let\nnumber of bits needed to write down a complete con\ufb01guration of\nthat may arise during\ncomputation on an input of length\n, since no computation taking\nwill have\ntime can consume more than\ncomponents, each corresponding to one bit of the encoding of a con\ufb01guration. No states\nwill have rewards, except for the accepting states, which have reward\n. (Without loss\nof generality, we may assume that\nnever enters an accepting state other than at time time\n.) Note that we can encode con\ufb01gurations so that there is one bit position (say, the \ufb01rst\nbit of the state vector) that records if the current state of\nis accepting or not. Thus the\nreward function is obviously linear (it is simply\nThere are two actions:\nby one time step. There are three types of steps:\n\n. Each action advances the simulation of the AM game\n\nspace). Each state of our DBN-MDP\n\ntimes the \ufb01rst component).\n\n(so\n\n1. Steps where\n\nis choosing a bit to send to\n\nchoosing to send a \u201c \u201d to\n\n.\n\n2. Steps where\n\nis \ufb02ipping a coin; each action\n\nhaving the coin come up \u201cheads\u201d.\n\n; action\n\ncorresponds to\n\nyields probability\n\nof\n\n3. Steps where\n\nis doing deterministic computation; each action\n\nmoves the\n\ncomputation ahead one step.\n\nIt is straightforward to encode this as a DBN-MDP. Note that each bit of the next move\nrelation of a Turing machine depends on only\nbits of the preceding con\ufb01guration (i.e.,\non the bits encoding the contents of the neighboring cells, the bits encoding the presence\nor absence of the input head in one of those cells, and the bits encoding the \ufb01nite state\ninformation of the Turing machine). Thus the DBN-MDP\ndescribing\non inputs of\nlength\nbits on which it depends.\nhas constant indegree; each bit is connected to the\nNote that a path in this MDP corresponding to an accepting computation of\non an input\nof length\nhas total reward ; a rejecting path has reward . A routine calculation shows\n\n\fthat the expected reward of the optimal policy is equal to the fraction of coin \ufb02ip sequences\nthat cause\n\nto accept when communicating with an optimal\n\n. That is,\n\nProb\n\naccepts\n\nOptimal expected reward\n\nand\n\n, such that for any two polynomials,\nof size\ntimes the optimum.\n\nWith the construction above, we can now describe our result:\nTheorem 1. If PSPACE is not contained in P/POLY, then there is a family of DBN-MDPs\n, there exist in\ufb01nitely many\ncan compute a policy having expected reward greater\n\n,\nsuch that no circuit\nthan\nBefore giving the formal proof, we remark that the assumption that PSPACE is not con-\ntained in P/POLY is standard and widely believed, and informally asserts that not every-\nthing that can be computed in polynomial space can be computed by a non-uniform family\nof small circuits.\nProof. Let\nabove. Suppose, contrary to the statement of Theorem, that for large enough\nindeed a circuit\nfactor of optimal. We now consider the probabilistic circuit\ntakes a string\nis the same as the probability that the prover\ninput\n\nas input, and estimates the expected return of the policy given by\n\nbe any languagein PSPACE that is not in P/POLY, and let\n\nthat operates as follows.\nis able to convince\n\nassociated with\ntimes:\n\ncorresponding to the start state of\n\n(which\nthat\nprotocol on\n\nwhose return is within a\n\ncomputing a policy for\n\nbe as described\nthere is\n\nof size\n\nbuilds the state\n\n). Speci\ufb01cally,\n, and then repeats the following procedure\nGiven state , if\nturn, use\nof the AM protocol.\nOtherwise, if\n\ufb02ip a coin at random and set\nRepeat until an accept or reject state is encountered.\n\nto compute the message sent by\n\nand set\n\nis a state encoding a con\ufb01guration in which it is\n\nto the new state of the AM protocol.\n\n\u2019s turn,\n\nis a state encoding a con\ufb01guration in which it is\n\n\u2019s\nto the new state\n\nIf any of these repetitions result in an accept,\nNote now that if\n\n, then the probability that\n\naccepts; otherwise\n\nrejects.\n\nrejects is no more than\n\n. As\n\n, then\n\nis in P/POLY, a contradiction.\n\n. On the other hand, if\n, since each iteration accepts with probability at most\n\nsince in this case we are guaranteed that each iteration will accept with probability at least\naccepts with probability no more than\nhas polynomial\nsize and a probabilistic circuit can be simulated by a deterministic one of essentially the\nsame size, it follows that\nIt is worth mentioning that, by the worst-case-to-average-case reduction of [1], if PSPACE\nis not in P/POLY then we can select such a language\nso that the circuit will perform\nbadly on a non-negligible fraction of the states\n. That is, not only is it hard to \ufb01nd\nof\nan optimal policy, it will be the case that every policy that can be expressed as a polynomial\nsize circuit will perform very badly on very many inputs.\nFinally, we remark that by coupling the above construction with the approximate lower\nbound protocol of [3], one can prove (under a stronger assumption) that there are no suc-\ncinct policies for the DBN-MDPs\nwhich even approximate the optimum return to\nwithin an exponential factor.\n\n\fthere exist in\ufb01nitely many\n\n, such that for any polynomial\n\ncan compute a policy having expected reward greater than\n\nTheorem 2. If PSPACE is not contained in AM , then there is a family of DBN-MDPs\nsuch that no circuit\ntimes the\n\n,\nof size\noptimum.\nReferences\n[1] L. Babai, L. Fortnow, N. Nisan, and A. Wigderson. BPP has subexponential time\nsimulations unless EXPTIME has publishable proofs. Complexity Theory, 3:307\u2013318,\n1993.\n\n[2] L. Babai and S. Moran. Arthur-merlin games: a randomized proof system, and a\nhierarchyof complexityclasses. Journal of Computer and System Sciences, 36(2):254\u2013\n276, 1988.\n\n[3] S. Goldwasser and M. Sipser. Private coins versus public coins in interactive proof\n\nsystems. Advances in Computing Research, 5:73\u201390, 1989.\n\n[4] D. Johnson. A catalog of complexity classes. In J. van Leeuwen, editor, Handbook of\n\nTheoretical Computer Science, volume A. The MIT Press, 1990.\n\n[5] P. Liberatore. The size of MDP factored policies. In Proceedings of AAAI 2002. AAAI\n\nPress, 2002.\n\n[6] C. Lund, L. Fortnow, H. Karloff, and N. Nisan. Algebraic methods for interactive proof\n\nsystems. Journal of the ACM, 39(4):859\u2013868, 1992.\n\n[7] M. Mundhenk, J. Goldsmith, C. Lusena, and E. Allender. Complexity of \ufb01nite-horizon\n\nMarkov decision process problems. Journal of the ACM, 47(4):681\u2013720, 2000.\n\n[8] C. Papadimitriou. Computational Complexity. Addison-Wesley, 1994.\n[9] A. Shamir. IP = PSPACE. Journal of the ACM, 39(4):869\u2013877, 1992.\n\n\f", "award": [], "sourceid": 2328, "authors": [{"given_name": "Eric", "family_name": "Allender", "institution": null}, {"given_name": "Sanjeev", "family_name": "Arora", "institution": null}, {"given_name": "Michael", "family_name": "Kearns", "institution": null}, {"given_name": "Cristopher", "family_name": "Moore", "institution": null}, {"given_name": "Alexander", "family_name": "Russell", "institution": null}]}