{"title": "Complexity of Decentralized Control: Special Cases", "book": "Advances in Neural Information Processing Systems", "page_first": 19, "page_last": 27, "abstract": "The worst-case complexity of general decentralized POMDPs, which are equivalent to partially observable stochastic games (POSGs) is very high, both for the cooperative and competitive cases. Some reductions in complexity have been achieved by exploiting independence relations in some models. We show that these results are somewhat limited: when these independence assumptions are relaxed in very small ways, complexity returns to that of the general case.", "full_text": "Complexity of Decentralized Control: Special Cases\n\nMartin Allen\n\nmartin.allen@conncoll.edu\n\nDepartment of Computer Science\n\nConnecticut College\n\nNew London, CT 06320\n\nShlomo Zilberstein\n\nDepartment of Computer Science\n\nUniversity of Massachusetts\n\nAmherst, MA 01003\n\nshlomo@cs.umass.edu\n\nAbstract\n\nThe worst-case complexity of general decentralized POMDPs, which are equiv-\nalent to partially observable stochastic games (POSGs) is very high, both for the\ncooperative and competitive cases. Some reductions in complexity have been\nachieved by exploiting independence relations in some models. We show that\nthese results are somewhat limited: when these independence assumptions are\nrelaxed in very small ways, complexity returns to that of the general case.\n\n1\n\nIntroduction\n\nDecentralized and partially observable stochastic decision and planning problems are very common,\ncomprising anything from strategic games of chance to robotic space exploration. In such domains,\nmultiple agents act under uncertainty about both their environment and the plans and actions of\nothers. These problems can be represented as decentralized partially observable Markov decision\nprocesses (Dec-POMDPs), or the equivalent, partially observable stochastic games (POSGs), al-\nlowing for precise formulation of solution concepts and success criteria.\nAlas, such problems are highly complex. As shown by Bernstein et al. [1, 2], the full, cooperative\nproblem\u2014where all players share the same payoff, and strategies can depend upon entire observed\nhistories\u2014is NEXP-complete. More recently, Goldsmith and Mundhenk [3] showed that the com-\npetitive case can be worse: when teamwork is allowed among agents, complexity rises to NEXPNP\n(problems solvable by a NEXP machine employing an NP set as an oracle). Much attention has\nthus been paid to restricted cases, particularly those where some parts of the system dynamics be-\nhave independently. The complexity of \ufb01nite-horizon Dec-POMDPs goes down\u2014from NEXP to\nNP\u2014when agents interact only via a joint reward structure, and are otherwise independent. Un-\nfortunately, our new results show that further reduction, based on other combinations of fully or\npartially independent system dynamics are unlikely, if not impossible.\nWe show that if the situation were reversed, so that rewards alone are independent, the problem re-\nmains NEXP-complete. Further, we consider two other Dec-POMDP sub-classes from the literature:\n(a) domains where local agent sub-problems are independent except for a (relatively small) number\nof event-based interactions, and (b) those where agents only interact in\ufb02uencing the set of currently\navailable actions. As it turns out, both types of problem are NEXP-complete as well\u2014facts previ-\nously unknown. (In the latter case, this is a substantial increase in the known upper bound.) These\nresults provide further impetus to devise new tools for the analysis and classi\ufb01cation of problem\ndif\ufb01culty in decentralized problem solving.\n\n2 Basic de\ufb01nitions\n\nThe cooperative, decentralized partially observable Markov decision process (Dec-POMDP) is a\nhighly general and powerful framework, capable of representing a wide range of real-world problem\n\n1\n\n\fdomains. It extends the basic POMDP to multiple agents, operating in conjunction based on locally\nobserved information about the world, and collecting a single source of reward.\nDe\ufb01nition 1 (Dec-POMDP). A (Dec-POMDP), D, is speci\ufb01ed by a tuple:\n\nM = (cid:104){\u03b1i}, S, {Ai}, P, {\u2126i}, O, R, T(cid:105)\n\n(1)\n\nwith individual components as follows:\n\n\u2022 Each \u03b1i is an agent; S is a \ufb01nite set of world states with a distinguished initial state s0; Ai\nis a \ufb01nite set of actions, ai, available to \u03b1i; \u2126i is a \ufb01nite set of observations, oi, for \u03b1i; and\nT is the (\ufb01nite or in\ufb01nite) time-horizon of the problem.\n\u2022 P is the Markovian state-action transition function. P (s, a1, . . . , an, s(cid:48)) is the probability\nof going from state s to state s(cid:48), given joint action (cid:104)a1, . . . , an(cid:105).\n\u2022 O is the joint observation function for the set of agents, given each state-action transition.\nO(a1, . . . , an, s(cid:48), o1, . . . , on) is the probability of observing (cid:104)o1, . . . , on(cid:105), if joint action\n(cid:104)a1, . . . , an(cid:105) causes a transition to global state s(cid:48).\n\u2022 R is the global reward function. R(s, a1, . . . , an) is the reward obtained for performing\njoint action (cid:104)a1, . . . , an(cid:105) when in global state s.\n\nThe most important sub-instance of the Dec-POMDP model is the decentralized MDP (Dec-MDP),\nwhere the joint observation tells us everything we need to know about the system state.\nDe\ufb01nition 2 (Dec-MDP). A decentralized Markov decision process (Dec-MDP) is a Dec-POMDP\nthat is jointly fully observable. That is, there exists a functional mapping, J : \u21261 \u00d7 \u00b7\u00b7\u00b7 \u00d7 \u2126n \u2192 S,\nsuch that O(a1, . . . , an, s(cid:48), o1, . . . , on) (cid:54)= 0 if and only if J(o1, . . . , on) = s(cid:48).\nIn a Dec-MDP, then, the sum total of the individual agent observations provides a complete pic-\nture of the state of the environment. It is important to note, however, that this does not mean that\nany individual agent actually possesses this information. Dec-MDPs are still fully decentralized in\ngeneral, and individual agents cannot count on access to the global state when choosing actions.\nDe\ufb01nition 3 (Policies). A local policy for an agent \u03b1i is a mapping from sequences of that agent\u2019s\nobservations, oi = (cid:104)o1\ni \u2192 Ai. A joint policy for n agents is a\ncollection of local policies, one per agent, \u03c0 = (cid:104)\u03c01, . . . , \u03c0n(cid:105).\nA solution method for a decentralized problem seeks to \ufb01nd some joint policy that maximizes ex-\npected value given the starting state (or distribution over states) of the problem. For complexity\npurposes, the decision version of the Dec-(PO)MDP problem is to determine whether there exists\nsome joint policy with value greater at least k.\n\ni (cid:105), to its actions, \u03c0i : \u2126(cid:63)\n\ni , . . . , ok\n\n3 Bernstein\u2019s proof of NEXP-completeness\n\nBefore establishing our new claims, we brie\ufb02y review the NEXP-completeness result for \ufb01nite-\nhorizon Dec-MDPs, as given by Bernstein et al. [1, 2]. First, we note that the upper bound, namely\nthat \ufb01nite-horizon Dec-POMDPs are in NEXP, will immediately establish the same upper bound for\nall the problems that we will consider. (While we do not discuss the proof here, full details can be\nfound in the original, or the supplemental materials to this paper, \u00a71.)\nTheorem 1 (Upper Bound). The \ufb01nite-horizon, n-agent decision problem Dec-POMDP \u2208 NEXP.\nMore challenging (and interesting) is establishing lower bounds on these problems, which is per-\nformed via our reduction from the known NEXP-complete TILING problem [4, 5]. A TILING\nproblem instance consists of a board size n, given concisely in log n binary bits, a set of tile-\ntypes L = {t0, . . . , tk}, and a collection of binary and vertical compatibility relations between\ntiles H, V \u2286 L \u00d7 L. A tiling is a mapping of board locations to tile-types, t : {0, . . . , n \u2212 1} \u00d7\n{0, . . . , n \u2212 1} \u2192 L; such a tiling is consistent just in case (i) the origin location of the board\nreceives tile-type 0 (t(0, 0) = tile0); and (ii) all adjoint tile assignments are compatible:\n\n(\u2200x, y) (cid:104)t(x, y), t(x + 1, y)(cid:105) \u2208 H & (cid:104)t(x, y), t(x, y + 1)(cid:105) \u2208 V.\n\nThe TILING problem is thus to decide, for a given instance, whether such a consistent tiling exists.\nFigure 1 shows an example instance and consistent solution.\n\n2\n\n\fFigure 1: An example of the TILING problem, and a consistent solution.\n\nThe reduction transforms a given instance of TILING into a 2-agent Dec-MDP, where each agent is\nqueried about some location in the grid, and must answer with a tile to be placed there. By careful\ndesign of the query and response mechanism, it is ensured that a policy with non-negative value\nexists only if the agents already have a consistent tiling, thus showing the Dec-MDP to be as hard\nas TILING. Together with Theorem 1, and the fact that the \ufb01nite-horizon, 2-agent Dec-MDP is a\nspecial case of the general \ufb01nite-horizon Dec-POMDP, the reduction establishes Bernstein\u2019s main\ncomplexity result (again, details are in the supplemental materials, \u00a71):\nTheorem 2 (NEXP-Completeness). The \ufb01nite-horizon Dec-POMDP problem is NEXP-complete.\n\n4 Factored Dec-POMDPs and independence\n\nIn general, the state transitions, observations, and rewards in a Dec-POMDP can involve probabilis-\ntic dependencies between agents. An obvious restricted subcase is thus one in which these factors\nare somehow independent. Becker et al. [6, 7] have thus studied problems in which the global state-\nspace consists of the product of local states, so that each agent has its own individual state-space. A\nDec-POMDP can then be transition independent, observation independent, or reward independent,\nas each the local effects given by each corresponding function are independent of one another.\nDe\ufb01nition 4 (Factored Dec-POMDP). A factored, n-agent Dec-POMDP is a Dec-POMDP such\nthat the system state can be factored into n + 1 distinct components, so that S = S0 \u00d7 S1 \u00d7\u00b7\u00b7\u00b7\u00d7 Sn,\nand no state-variable appears in any Si, Sj, i (cid:54)= j.\nAs with the local (agent-speci\ufb01c) actions, ai, and observations, oi, in the general Dec-POMDP\nde\ufb01nition, we now refer to the local state, \u02c6s \u2208 Si \u00d7 S0, namely that portion of the overall state-\nspace that is either speci\ufb01c to agent \u03b1i (si \u2208 Si), or shared among all agents (so \u2208 S0). We use the\nnotation s\u2212i for the sequence of all state-components except that for agent \u03b1i:\n\ns\u2212i = (s0, s1, . . . , si\u22121, si+1, . . . , sn)\n\n(and similarly for action- or observation-sequences, a\u2212i and o\u2212i).\nDe\ufb01nition 5 (Transition Independence). A factored, n-agent DEC-POMDP is transition inde-\npendent iff the state-transition function can be separated into n + 1 distinct transition functions\nP0, . . . , Pn, where, for any next state s(cid:48)\n\ni \u2208 Si,\n\nP (s(cid:48)\n\ni | (s0, . . . , sn), (a1, . . . , an), s\u2212i) =\n\n(cid:26)P0(s(cid:48)\n\nPi(s(cid:48)\n\n0 | s0)\ni | \u02c6si, ai, s(cid:48)\n0)\n\nif i = 0;\nelse.\n\nIn other words, the next local state of each agent is independent of the local states of all others, given\nits previous local state and local action, and the external system features (S0).\nDe\ufb01nition 6 (Observation Independence). A factored, n-agent Dec-POMDP is observation inde-\npendent iff the joint observation function can be separated into n separate probability functions\nO1, . . . , On, where, for any local observation oi \u2208 \u2126i,\n0, . . . , s(cid:48)\n\nn), o\u2212i) = Oi(oi | ai, \u02c6s(cid:48)\ni)\n\nO(oi | (a1, . . . , an), (s(cid:48)\n\nIn such cases, the probability of an agent\u2019s individual observations is a function of their own local\nstates and actions alone, independent of the states of others, and of what those others do or observe.\n\n3\n\nn = 5 L = H = V = 0 1 2 0 1 0 2 0 1 1 1 1 2 2 0 0 1 0 2 1 0 1 2 2 0 0 1 0 1 2 0 0 1 0 2 0 1 2 0 2 0 0 2 0 2 1 2 0 1 0 A consistent solution \fDe\ufb01nition 7 (Reward Independence). A factored, n-agent Dec-POMDP is reward independent iff\nthe joint reward function can be represented by local reward functions R1, . . . , Rn, such that:\n\nR((s0, . . . sn), (a0, . . . , an)) = f(R1(\u02c6s1, a1), . . . , Rn(\u02c6sn, an))\n\nand\n\nRi(\u02c6si, ai) \u2265 Ri(\u02c6si, a(cid:48)\n\ni) \u21d4 f(R1, . . . ,Ri(\u02c6si, ai), . . . , Rn) \u2265 f(R1, . . . , Ri(\u02c6si, a(cid:48)\n\ni), . . . , Rn)\n\nThat is, joint reward is a function of local reward, constrained so that we maximize global reward if\nand only if we maximize local rewards. A typical example is the additive sum:\n\nR((s0, . . . sn), (a0, . . . , an)) = R1(\u02c6s1, a1) + \u00b7\u00b7\u00b7 + Rn(\u02c6sn, an).\n\nIt is important to note that each de\ufb01nition applies equally to Dec-MDPs; in such cases, joint full\nobservability of the overall state is often accompanied by full observability at the local level.\nDe\ufb01nition 8 (Local Full Observability). A factored, n-agent Dec-MDP is locally fully observable\niff an agent\u2019s local observation uniquely determines its local state: \u2200oi \u2208 \u2126i, \u2203\u02c6si : P (\u02c6si | oi) = 1.\nLocal full observability is not equivalent to independence of observations. In particular, a problem\nmay be locally fully observable without being observation independent (since agents may simply\nobserve outcomes of non-independent joint actions). On the other hand, it is easy to show that an\nobservation-independent Dec-MDP must be locally fully observable (supplementary, \u00a72).\n\n4.1 Shared rewards alone lead to reduced complexity\n\nIt is easy to see that if a Dec-MDP (or Dec-POMDP) has all three forms of independence given\nby De\ufb01nitions 5\u20137, it can be decomposed into n separate problems, where each agent \u03b1i works\nsolely within the local sub-environment Si \u00d7 S0. Such single-agent problems are known to be P-\ncomplete, and can generally be solved ef\ufb01ciently to high degrees of optimality. More interesting\nresults follow when only some forms of independence hold. In particular, it has been shown that\nDec-MDPs with both transition- and observation-independence, but not reward-independence, are\nNP-complete [8, 7]. (This result is discussed in detail in our supplementary material, \u00a73.)\nTheorem 3. A transition- and observation-independent Dec-MDP with joint reward is NP-complete.\n\n5 Other subclasses of interactions\n\nAs our new results will now show, there is a limit to this sort of complexity reduction: other relatively\nobvious combinations of independence relationships do not bear the same fruit. That is, we show\nthe NP-completeness result to be speci\ufb01c to fully transition- and observation-independent problems.\nWhen these properties are not fully present, worst-case complexity is once again NEXP.\n\n5.1 Reward-independent-only models are NEXP-complete\n\nWe begin with a result that is rather simple, but has not, to the best of our knowledge, been estab-\nlished before. We consider the inverse of the NP-complete problem of Theorem 3: a Dec-MDP with\nreward-independence (Df. 7), but without transition- or observation-independence (Dfs. 5, 6).\nTheorem 4. Factored, reward-independent Dec-MDPs with n agents are NEXP-complete.\n\nProof Sketch. For the upper bound, we simply cite Theorem 1, immediately establishing that such\nproblems are in NEXP. For the lower bound, we simply modify the TILING Dec-MDP from Bern-\nstein\u2019s reduction proof so as to ensure that the reward-function factors appropriately into strictly\nlocal rewards. (Full details are found in [9], and the supplementary materials, \u00a74.1.)\n\nThus we see that in some respects, transition and observation independence are fundamental to\nthe reduction of worst-case complexity from NEXP to NP. When only the rewards depend upon\nthe actions of both agents, the problems become easier; however, when the situation is reversed,\n\n4\n\n\fthe general problem remains NEXP-hard. This is not entirely surprising: much of the complexity\nof planning in decentralized domains stems from the necessity to take account of how one\u2019s action-\noutcomes are affected by the actions of others, and from the complications that ensue when observed\ninformation about the system is tied to those actions as well. The structure of rewards, while ob-\nviously key to the nature of the optimal (or otherwise) solution, is not as vital\u2014even if agents can\nseparate their individual reward-functions, making them entirely independent, other dependencies\ncan still make the problem extremely complex.\nWe therefore turn to two other interesting special-case Dec-MDP frameworks, in which independent\nreward functions are accompanied by restricted degrees of transition- and observation-based interac-\ntion. While some empirical evidence has suggested that these problems may be easier on average to\nsolve, nothing has previously been shown about their worst-case complexity. We \ufb01ll in these gaps,\nshowing that even under such restricted dynamics, the problems remain NEXP-hard.\n\n5.2 Event-driven-interaction models are NEXP-complete\n\nThe \ufb01rst model we consider is one of Becker et al. [10], which generalizes the notion of a fully\ntransition-independent Dec-MDP. In this model, a set of primitive events, consisting of state-action\ntransitions, is de\ufb01ned for each agent. Such events can be thought of as occasions upon which\nthat agent takes the given action to generate the associated state transition. Dependencies are then\nintroduced in the form of relationships between one agent\u2019s possible actions in given states and\nanother agent\u2019s primitive events.\nWhile no precise worst-case complexity results have been previously proven, the authors do point out\nthat the class of problems has an upper-bound deterministic complexity that is exponential in the size\nof the state space, |S|, and doubly exponential in the number of de\ufb01ned interactions. This potentially\nbad news is mitigated by noting that if the number of interactions is small, then reasonably-sized\nproblems can still be solved. Here, we examine this issue in detail, showing that, in fact these\nproblems are NEXP-hard (indeed, NEXP-complete); however, when the number of dependencies is\na log-factor of the size of the problem state-space, worst-case NP-hardness is achieved.\nWe begin with the formal framework of the model. Again, we give all de\ufb01nitions in terms of Dec-\nPOMDPs; they apply immediately to Dec-MDPs in particular.\nDe\ufb01nition 9 (History). A history for an agent \u03b1i in a factored, n-agent Dec-POMDP D is a sequence\nof possible local states and actions, beginning in the agent\u2019s initial state: \u03a6i = [\u02c6s0\ni , . . .].\nWhen a problem has a \ufb01nite time-horizon T , all possible complete histories will be of the form\n\u03a6T\ni = [\u02c6s0\nDe\ufb01nition 10 (Events in a History). A primitive event e = (\u02c6si, ai, \u02c6s(cid:48)\ni) for an agent \u03b1i is a triple\nrepresenting a transition between two local states, given some action ai \u2208 Ai. An event E =\n{e1, e2, . . . , eh} is a set of primitive events. A primitive event e occurs in the history \u03a6i, written\n\u03a6i (cid:15) e, if and only if the triple e is a sub-sequence of the sequence \u03a6i. An event E occurs in the\nhistory \u03a6i, written \u03a6i (cid:15) E, if and only if some component occurs in that history: \u2203e \u2208 E : \u03a6i (cid:15) e.\nEvents can therefore be thought of disjunctively. That is, they specify a set of possible state-action\ntransitions from a Dec-POMDP, local to one of its agents. If the historical sequence of state-action\ntransitions that the agent encounters contains any one of those particular transitions, then the history\nsatis\ufb01es the overall event. Events can thus be used, for example, to represent such things as taking a\nparticular action in any one of a number of states over time, or taking one of several actions at some\nparticular state. For technical reasons, namely the use of a specialized solution algorithm, these\nevents are usually restricted in structure, as follows.\nDe\ufb01nition 11 (Proper Events). A primitive event e is proper if it occurs at most once in any given\n(cid:15)\nhistory. That is, for any history \u03a6i if \u03a6i = \u03a61\ne) \u2227 \u00ac(\u03a62\n(cid:15) e). An event E is proper if it consists of proper primitive events that are mutually\nexclusive, in that no two of them both occur in any history:\n\ni then neither sub-history contains e: \u00ac(\u03a61\n\ni , . . . , \u02c6sT\u22121\n\ni , \u02c6sT\ni ].\n\ni , a0\n\ni , \u02c6s1\n\ni , a1\n\ni e \u03a62\n\ni , a0\n\ni , \u02c6s1\n\ni , a1\n\n, aT\n\ni\n\ni\n\n\u2200\u03a6i \u00ac\u2203x, y : (x (cid:54)= y) \u2227 (ex \u2208 E) \u2227 (ey \u2208 E) \u2227 (\u03a6i (cid:15) ex) \u2227 (\u03a6i (cid:15) ey).\n\ni\n\nProper primitive events can be used, for instance, to represent actions that take place at particular\ntimes (building the time into the local state \u02c6si \u2208 e). Since any given point in time can only occur\nonce in any history, the events involving such time-steps will be proper by default. A proper event\n\n5\n\n\fE can then be formed by collecting all the primitive events involving some single time-step, or by\ntaking all possible primitive events involving an unrepeatable action.\nOur new model is then a Dec-MDP with:\n\n1. Two (2) agents.1\n2. A factored state-space: S = S0 \u00d7 S1 \u00d7 Sn.\n3. Local full observability: each agent \u03b1i can determine its own portion of the state-space,\n4. Independent (additive) rewards: R((cid:104)s0, s1, s2(cid:105), a1, a2) = R1(\u02c6s1, a1) + R2(\u02c6s2, a2).\n\n\u02c6si \u2208 S0 \u00d7 Si, exactly.\n\nInteractions between agents are given in terms of a set of dependencies between certain state-action\ntransitions for one agent, and events featuring transitions involving the other agent. Thus, if a history\ncontains one of the primitive events from the latter set, this can have some direct effect upon the\ntransition-model for the \ufb01rst agent, introducing probabilistic transition-dependencies.\nDe\ufb01nition 12 (Dependency). A dependency is a pair dk\nde\ufb01ned over primitive events for agent \u03b1i, and Dk\nsuch that each pair occurs in at most one dependency:\n\u00ac(\u2203 k, k(cid:48), sj, aj) (k (cid:54)= k(cid:48)) & (cid:104)sj, aj(cid:105) \u2208 Dk\n\ni is a proper event\nj is a set of state-action pairs (cid:104)\u02c6sj, aj(cid:105) for agent \u03b1j,\n\nij & (cid:104)sj, aj(cid:105) \u2208 Dk(cid:48)\n\nj (cid:105), where Ek\n\nij = (cid:104)Ek\n\nj \u2208 dk(cid:48)\nij .\n\nj \u2208 dk\n\ni , Dk\n\nSuch a dependency is thus a collection of possible actions that agent \u03b1j can take in one of its local\nstate, each of which depends upon whether the other agent \u03b1i has made one of the state-transitions\nin its own set of primitive events. Such structures can be used to model, for instance, cases where\none agent cannot successfully complete some task until the other agent has completed an enabling\nsub-task, or where the precise outcome depends upon the groundwork laid by the other agent.\nij = (cid:104)Ek\nj (cid:105) is satis\ufb01ed when the\nDe\ufb01nition 13 (Satisfying Dependencies). A dependency dk\ncurrent history for enabling agent \u03b1i contains the relevant event: \u03a6i (cid:15) Ek\ni . For any state-action pair\n(cid:104)\u02c6sj, aj(cid:105), we de\ufb01ne a Boolean indicator variable b\u02c6sj aj , which is true if and only if some dependency\nthat contains the pair is satis\ufb01ed:\n\ni , Dk\n\n(cid:26)1\n\n0\n\nb\u02c6sj aj =\n\nij = (cid:104)Ek\n\nif (\u2203 dk\notherwise.\n\nj (cid:105)) (cid:104)\u02c6sj, aj(cid:105) \u2208 Dk\n\nj & \u03a6i (cid:15) Ek\ni ,\n\ni , Dk\n\ni | \u02c6si, ai, b\u02c6siai). We can thus write Pi(\u02c6si, ai, b\u02c6siai, \u02c6s(cid:48)\n\nThe existence of dependencies allows us to factor the overall state-transition function into two parts,\neach of which depends only on an agent\u2019s local state, action, and relevant indicator variable.\nDe\ufb01nition 14 (Local Transition Function). The transition function for our Dec-MDP is factored\ninto two functions, P1 and P2, each de\ufb01ning the distribution over next possible local states:\nPi(\u02c6s(cid:48)\nWhen agents take some action in a state for which dependencies exist, they observe whether or not\nthe related events have occurred; that is, after taking any action aj in state sj, they can observe the\nstate of indicator variable b\u02c6sj aj .\nWith these de\ufb01nitions in place, we can now show that the worst-case complexity of the event-based\nproblems is the same as the general Dec-POMDP class.\nTheorem 5. Factored, \ufb01nite-horizon, n-agent Dec-MDPs with local full observability, independent\nrewards, and event-driven interactions are NEXP-complete.\n\ni) for this transition probability.\n\nProof Sketch. Again, the upper bound is immediate from Theorem 1, since the event-based structure\nis just a speci\ufb01c case of general reward-dependence, and such models can always be converted into\nDec-MDPs without any events. For the lower bound, we again provide a reduction from TILING,\nconstrained to our special case. Local reward independence, which was not present in the original\nproblem, is ensured by using event dependencies to affect future rewards of the other agent. Thus,\nlocal immediate rewards remain dependent only upon the actions of the individual agent, but the\nstate in which that agent \ufb01nds itself (and so the options available to its reward function) can depend\nupon events involving the other agent. (See [9] and supplemental materials, \u00a74.2.)\n\n1The model can be extended to n agents with little real dif\ufb01culty. Since we will show that the 2-agent case\n\nis NEXP-hard, however, this will suf\ufb01ce for the general claim.\n\n6\n\n\f5.2.1 A special, NP-hard case\n\nThe prior result requires allowing the number of dependencies in the problem to grow as a factor of\nlog n, for a TILING grid of size (n\u00d7n). Since the size of the state-space S in the reduced Dec-MDP\nis also O(log n), the number of dependencies is O(|S|). Thus, the NEXP-completeness result holds\nfor any event-based Dec-MDP where the number of dependencies is linear in the state-space. When\nwe are able to restrict the number of dependencies further, however, we can do better.\nTheorem 6. A factored, \ufb01nite-horizon, n-agent Dec-MDP with local full observability, independent\nrewards, and event-driven interactions are solvable in nondeterministic polynomial time (NP) if the\nnumber of dependencies is O(log |S|), where S is the state-set of the problem.\n\nProof Sketch. As shown by Becker [10], we can use the Coverage Set algorithm to generate an\noptimal policy for a problem of this type, in time that is exponential in the number of dependencies.\nClearly, if this number is logarithmic in the size of the state-set, then solution time is polynomial in\nthe problem size. (See [9] and supplemental materials, \u00a74.2.1.)\n\n5.2.2 Discussion of the results\n\nThese results are interesting for two reasons. First, NEXP-completeness of the event-based case,\neven with independent rewards and local full observability (Theorem 5), means that many interest-\ning problems are potentially intractable. Becker et al. [10] show how to use event-dependencies\nto represent common structures in the TAEMS task modeling language, used in many real-world\ndomains [11, 12, 13]; our complexity analysis thus extends to such practical problems. Second,\nisolating where complexity is lower can help determine what task structures and agent interrelation-\nships lead to intractability. In domains where the dependency structure can be kept relatively simple,\nit may be possible to derive optimal solutions feasibly. Both subjects are worth further study.\n\n5.3 State-dependent-action models are NEXP-complete\n\nGuo and Lesser [14, 15, 16] consider another specialized Dec-MDP subclass, with apparently even\nmore restricted types of interaction. Agent state-spaces are again separate, and all action-transitions\nand rewards are independent. Such problems are not wholly decoupled, however, as the actions\navailable to each agent at any point depend upon the global system state. Thus, agents interact by\nmaking choices that restrict or broaden the range of actions available to others.\nDe\ufb01nition 15 (Dec-MDP with State-Dependent Actions). An n-agent Dec-MDP with state-\ndependent actions is a tuple D = (cid:104)S0,{Si},{Ai},{Bi},{Pi},{Ri}, T(cid:105), where:\n\n\u2022 S0 is a set of shared states, and Si is the state-space of agent si, with global state space\nS = S0 \u00d7 S1 \u00d7 \u00b7\u00b7\u00b7 \u00d7 Sn, and initial state s0 \u2208 S; each Ai is the action-set for \u03b1i; T \u2208 N\nis the \ufb01nite time-horizon of the problem.\n\u2022 Each Bi : S \u2192 2Ai is a mapping from global states of the system to some set of available\nactions for each agent \u03b1i. For all s \u2208 S, Bi(s) (cid:54)= \u2205.\n\u2022 Pi : (S0 \u00d7 Si) \u00d7 Ai(S0 \u00d7 Si) is the state-transition function over local states for \u03b1i. The\n\u2022 Ri : (S0 \u00d7 Si) \u2192 (cid:60) is a local reward function for agent \u03b1i. We let the global reward\n\nglobal transition function is simply the product of individual Pi.\n\nfunction be the sum of local rewards.\n\nNote that there need be no observations in such a problem; given local full observability, each agent\nobserves only its local states. Furthermore, it is presumed that each agent can observe its own\navailable actions in any state; a local policy is thus a mapping from local states to available actions.\nFor such cases, Guo presents a planning algorithm based on heuristic action-set pruning, along\nwith a learning algorithm. While empirical results show that these methods are capable of solving\npotentially large instances, we again know very little about the analytical worst-case dif\ufb01culty of\nproblems with state-dependent actions. An NP-hardness lower bound is given [14] for the overall\nclass, by reducing a normal-form game to the state-dependent model, but this is potentially quite\nweak, since no upper bound has been established, and even the operative algorithmic complexity\nof the given solution method is not well understood. We address this situation, showing that the\nproblem is also just as hard as the general case.\n\n7\n\n\fTheorem 7. Factored, \ufb01nite-horizon, n-agent Dec-MDPs with local full observability, independent\nrewards, and state-dependent action-sets are NEXP-complete.\n\nProof Sketch. Once more, we rely upon the general upper bound on the complexity of Dec-POMDPs\n(Theorem 1). The lower bound is by another TILING reduction. Again, we \u201crecord\u201d actions of each\nagent in the state-space of the other, ensuring purely local rewards and local full observability. This\ntime, however, we use the fact that action-sets depend upon the global state (rather than events) to\nenforce the desired dynamics. That is, we add special state-dependent actions that, based on their\navailability (or lack thereof), affect each agent\u2019s local reward. (See [9], and supplemental \u00a74.3.)\n\n5.3.1 Discussion of the result\n\nGuo and Lesser [16, 14] were able to show that deciding whether a decentralized problem with\nstate-based actions had an equilibrium solution with value greater than k was NP-hard. It was not\nascertained whether or not this lower bound was tight, however; this remained a signi\ufb01cant open\nquestion. Our results show that this bound was indeed too low. Since an optimal joint policy will be\nan equilibrium for the special case of additive rewards, the general problem can be no easier.\nThis is interesting, for reasons beyond the formal. Such decentralized problems indeed appear to be\nquite simple in structure, requiring wholly independent rewards and action-transitions, so that agents\ncan only interact with one another via choices that affect which actions are available. (A typical\nexample involves two persons acting completely regardless of one another, except for the existence\nof a single rowboat, used for crossing a stream; if either agent uses the rowboat to get to the other\nside, then that action is no longer available to the other.) Such problems are intuitive, and common,\nand not all of them are hard to solve, obviously. At the same time, however, our results show that\nthe same structures can be intractable in the worst case, establishing that even seemingly simple\ninteractions between agents can lead to prohibitively high complexity in decentralized problems.\n\n6 Conclusions\n\nThis work addresses a number of existing models for decentralized problem-solving. In each case,\nthe models restrict agent interaction in some way, in order to produce a special sub-case of the\ngeneral Dec-POMDP problem. It has been known for some time that systems where agents act\nentirely independently, but share rewards, have reduced worst-case complexity. We have shown that\nthis does not apply to other variants, where we relax the independence requirements even only a\nlittle. In all of the cases addressed, the new problem variants are as hard as the general case. This\nfact, combined with results showing many other decentralized problem models to be equivalent to\nthe general Dec-POMDP model, or strictly harder [17], reveals the essential dif\ufb01culty of optimal\nplanning in decentralized settings. Together, these results begin to suggest that optimal solutions to\nmany common multiagent problems must remain out of reach; in turn, this indicates that we must\nlook to approximate or heuristic methods, since such problems are so prevalent in practice.\nAt the same time, it must be stressed that the NEXP-complexity demonstrated here is a worst-case\nmeasure. Not all decentralized domains are going to be intractable, and indeed the event-based\nand action-set models have been shown to yield to specialized solution methods in many cases,\nenabling us to solve interesting instances in reasonable amounts of time. When the number of action-\ndependencies is small, or there are few ways that agents can affect available action-sets, it may well\nbe possible to provide optimal solutions effectively. That is, the high worst-case complexity is no\nguarantee that average-case dif\ufb01culty is likewise high. This remains a vital open problem in the \ufb01eld.\nWhile establishing the average case is often dif\ufb01cult, if not impossible\u2014given that the notion of an\n\u201caverage\u201d planning or decision problem is often ill-de\ufb01ned\u2014it is still worth serious consideration.\n\nAcknowledgments\n\nThis material is based upon work supported by the the Air Force Of\ufb01ce of Scienti\ufb01c Research\nunder Award No. FA9550-05-1-0254. Any opinions, \ufb01ndings, and conclusions or recommendations\nexpressed in this publication are those of the authors and do not necessarily re\ufb02ect the views of\nAFOSR. The \ufb01rst author also acknowledges the support of the Andrew W. Mellon Foundation CTW\nComputer Science Consortium Fellowship.\n\n8\n\n\fReferences\n[1] Daniel S. Bernstein, Shlomo Zilberstein, and Neil Immerman. The complexity of decentral-\nIn Proceedings of the Sixteenth Conference on\n\nized control of Markov decision processes.\nUncertainty in Arti\ufb01cial Intelligence, pages 32\u201337, Stanford, California, 2000.\n\n[2] Daniel S. Bernstein, Robert Givan, Neil Immerman, and Shlomo Zilberstein. The complexity\nof decentralized control of Markov decision processes. Mathematics of Operations Research,\n27(4):819\u2013840, 2002.\nJudy Goldsmith and Martin Mundhenk. Competition adds complexity. In J.C. Platt, D. Koller,\nY. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20,\npages 561\u2013568. MIT Press, Cambridge, MA, 2008.\n\n[3]\n\n[4] Harry R. Lewis. Complexity of solvable cases of the decision problem for predicate calculus.\nIn Proceedings of the Nineteenth Symposium on the Foundations of Computer Science, pages\n35\u201347, Ann Arbor, Michigan, 1978.\n\n[5] Christos H. Papadimitriou. Computational Complexity. Addison-Wesley, Reading, Mas-\n\nsachusetts, 1994.\n\n[6] Raphen Becker, Shlomo Zilberstein, Victor Lesser, and Claudia V. Goldman. Transition-\nindependent decentralized Markov decision processes. In Proceedings of the Second Interna-\ntional Joint Conference on Autonomous Agents and Multi-Agent Systems, pages 41\u201348, Mel-\nbourne, Australia, 2003.\n\n[7] Raphen Becker, Shlomo Zilberstein, Victor Lesser, and Claudia V. Goldman. Solving transition\nindependent decentralized MDPs. Journal of Arti\ufb01cial Intelligence Research, 22:423\u2013455,\nNovember 2004.\n\n[8] Claudia V. Goldman and Shlomo Zilberstein. Decentralized control of cooperative systems:\nCategorization and complexity analysis. Journal of Arti\ufb01cial Intelligence Research, 22:143\u2013\n174, 2004.\n\n[9] Martin Allen. Agent Interactions in Decentralized Environments. PhD thesis, University of\nMassachusetts, Amherst, Massachusetts, 2009. Available at http://scholarworks.\numass.edu/open_access_dissertations/1/.\n\n[10] Raphen Becker, Victor Lesser, and Shlomo Zilberstein. Decentralized Markov decision pro-\ncesses with event-driven interactions. In Proceedings of the Third International Joint Confer-\nence on Autonomous Agents and Multi-Agent Systems, pages 302\u2013309, New York, New York,\n2004.\n\n[11] Keith S. Decker and Victor R. Lesser. Quantitative modeling of complex environments. Inter-\nnational Journal of Intelligent Systems in Accounting, Finance and Management, 2:215\u2013234,\n1993.\n\n[12] V. Lesser, K. Decker, T.Wagner, N. Carver, A. Garvey, B. Horling, D. Neiman, R. Podor-\nozhny, M. Nagendra Prasad, A. Raja, R. Vincent, P. Xuan, and X.Q Zhang. Evolution of the\nGPGP/TAEMS domain-independent coordination framework. Autonomous Agents and Multi-\nAgent Systems, 9(1):87\u2013143, 2004.\n\n[13] Tom Wagner, Valerie Guralnik, and John Phelps. TAEMS agents: Enabling dynamic dis-\ntributed supply chain management. Journal of Electronic Commerce Research and Applica-\ntions, 2:114\u2013132, 2003.\n\n[14] AnYuan Guo. Planning and Learning for Weakly-Coupled Distributed Agents. PhD thesis,\n\nUniversity of Massachusetts, Amherst, 2006.\n\n[15] AnYuan Guo and Victor Lesser. Planning for weakly-coupled partially observable stochastic\ngames. In Proceedings of the 19th International Joint Conference on Arti\ufb01cial Intelligence,\npages 1715\u20131716, Edinburgh, Scotland, 2005.\n\n[16] AnYuan Guo and Lesser Victor. Stochastic planning for weakly-coupled distributed agents.\nIn Proceedings of the Fifth Joint Conference on Autonomous Agents and Multiagent Systems,\npages 326\u2013328, Hakodate, Japan, 2006.\n\n[17] Sven Seuken and Shlomo Zilberstein. Formal models and algorithms for decentralized deci-\nsion making under uncertainty. Autonomous Agents and Multi-Agent Systems, 17(2):190\u2013250,\n2008.\n\n9\n\n\f", "award": [], "sourceid": 982, "authors": [{"given_name": "Martin", "family_name": "Allen", "institution": null}, {"given_name": "Shlomo", "family_name": "Zilberstein", "institution": null}]}