{"title": "Individual Planning in Infinite-Horizon Multiagent Settings: Inference, Structure and Scalability", "book": "Advances in Neural Information Processing Systems", "page_first": 478, "page_last": 486, "abstract": "This paper provides the first formalization of self-interested planning in multiagent settings using expectation-maximization (EM). Our formalization in the context of infinite-horizon and finitely-nested interactive POMDPs (I-POMDP) is distinct from EM formulations for POMDPs and cooperative multiagent planning frameworks. We exploit the graphical model structure specific to I-POMDPs, and present a new approach based on block-coordinate descent for further speed up. Forward filtering-backward sampling -- a combination of exact filtering with sampling -- is explored to exploit problem structure.", "full_text": "Individual Planning in In\ufb01nite-Horizon Multiagent\n\nSettings: Inference, Structure and Scalability\n\nXia Qu\n\nEpic Systems\n\nVerona, WI 53593\n\nPrashant Doshi\n\nTHINC Lab, Dept. of Computer Science\nUniversity of Georgia, Athens, GA 30622\n\nquxiapisces@gmail.com\n\npdoshi@cs.uga.edu\n\nAbstract\n\nThis paper provides the \ufb01rst formalization of self-interested planning in multia-\ngent settings using expectation-maximization (EM). Our formalization in the con-\ntext of in\ufb01nite-horizon and \ufb01nitely-nested interactive POMDPs (I-POMDP) is\ndistinct from EM formulations for POMDPs and cooperative multiagent planning\nframeworks. We exploit the graphical model structure speci\ufb01c to I-POMDPs, and\npresent a new approach based on block-coordinate descent for further speed up.\nForward \ufb01ltering-backward sampling \u2013 a combination of exact \ufb01ltering with sam-\npling \u2013 is explored to exploit problem structure.\n\nIntroduction\n\n1\nGeneralization of bounded policy iteration (BPI) to \ufb01nitely-nested interactive partially observable\nMarkov decision processes (I-POMDP) [1] is currently the leading method for in\ufb01nite-horizon self-\ninterested multiagent planning and obtaining \ufb01nite-state controllers as solutions. However, interac-\ntive BPI is acutely prone to converge to local optima, which severely limits the quality of its solutions\ndespite the limited ability to escape from these local optima.\nAttias [2] posed planning using MDP as a likelihood maximization problem where the \u201cdata\u201d is\nthe initial state and the \ufb01nal goal state or the maximum total reward. Toussaint et al. [3] extended\nthis to infer \ufb01nite-state automata for in\ufb01nite-horizon POMDPs. Experiments reveal good quality\ncontrollers of small sizes although run time is a concern. Given BPI\u2019s limitations and the compelling\npotential of this approach in bringing advances in inferencing to bear on planning, we generalize it\nto in\ufb01nite-horizon and \ufb01nitely-nested I-POMDPs. Our generalization allows its use toward planning\nfor an individual agent in noncooperation where we may not assume common knowledge of initial\nbeliefs or common rewards, due to which others\u2019 beliefs, capabilities and preferences are modeled.\nAnalogously to POMDPs, we formulate a mixture of \ufb01nite-horizon DBNs. However, the DBNs\ndiffer by including models of other agents in a special model node. Our approach, labeled as I-EM,\nimproves on the straightforward extension of Toussaint et al.\u2019s EM to I-POMDPs by utilizing various\ntypes of structure. Instead of ascribing as many level 0 \ufb01nite-state controllers as candidate models\nand improving each using its own EM, we use the underlying graphical structure of the model node\nand its update to formulate a single EM that directly provides the marginal of others\u2019 actions across\nall models. This rests on a new insight, which considerably simpli\ufb01es and speeds EM at level 1.\nWe present a general approach based on block-coordinate descent [4, 5] for speeding up the non-\nasymptotic rate of convergence of the iterative EM. The problem is decomposed into optimization\nsubproblems in which the objective function is optimized with respect to a small subset (block) of\nvariables, while holding other variables \ufb01xed. We discuss the unique challenges and present the \ufb01rst\neffective application of this iterative scheme to multiagent planning.\nFinally, sampling offers a way to exploit the embedded problem structure such as information in dis-\ntributions. The exact forward-backward E-step is replaced with forward \ufb01ltering-backward sampling\n\n1\n\n\f(FFBS) that generates trajectories weighted with rewards, which are used to update the parameters of\nthe controller. While sampling has been integrated in EM previously [6], FFBS speci\ufb01cally mitigates\nerror accumulation over long horizons due to the exact forward step.\n\n2 Overview of Interactive POMDPs\nA \ufb01nitely-nested I-POMDP [7] for an agent i with strategy level, l, interacting with agent j is:\nI-POMDPi,l = \ufffdISi,l, A, Ti, \u03a9i, Oi, Ri, OCi\ufffd\n\u2022 ISi,l denotes the set of interactive states de\ufb01ned as, ISi,l = S \u00d7 Mj,l\u22121, where Mj,l\u22121 =\n{\u0398j,l\u22121 \u222a SMj}, for l \u2265 1, and ISi,0 = S, where S is the set of physical states. \u0398j,l\u22121 is the\nset of computable, intentional models ascribed to agent j: \u03b8j,l\u22121 = \ufffdbj,l\u22121, \u02c6\u03b8j\ufffd. Here bj,l\u22121 is\nagent j\u2019s level l\u2212 1 belief, bj,l\u22121 \u2208 \ufffd(ISj,l\u22121) where \u0394(\u00b7) is the space of distributions, and \u02c6\u03b8j =\n\ufffdA, Tj, \u03a9j, Oj, Rj, OCj\ufffd, is j\u2019s frame. At level l=0, bj,0 \u2208 \ufffd(S) and a intentional model reduces\nto a POMDP. SMj is the set of subintentional models of j, an example is a \ufb01nite state automaton.\n\n\u2022 A = Ai \u00d7 Aj is the set of joint actions of all agents.\n\u2022 Other parameters \u2013 transition function, Ti, observations, \u03a9i, observation function, Oi, and prefer-\nence function, Ri \u2013 have their usual semantics analogously to POMDPs but involve joint actions.\n\u2022 Optimality criterion, OCi, here is the discounted in\ufb01nite horizon sum.\nAn agent\u2019s belief over its interactive states is a suf\ufb01cient statistic fully summarizing the agent\u2019s\nobservation history. Given the associated belief update, solution to an I-POMDP is a policy. Using\nthe Bellman equation, each belief state in an I-POMDP has a value which is the maximum payoff\nthe agent can expect starting from that belief and over the future.\n\n3 Planning in I-POMDP as Inference\nWe may represent the policy of agent i for the in\ufb01nite horizon case as a stochastic \ufb01nite state\ncontroller (FSC), de\ufb01ned as: \u03c0i = \ufffdNi,Ti,Li,Vi\ufffd where Ni is the set of nodes in the controller.\nTi : Ni \u00d7 Ai \u00d7 \u03a9i \u00d7 Ni \u2192 [0, 1] represents the node transition function; Li : Ni \u00d7 Ai \u2192 [0, 1] de-\nnotes agent i\u2019s action distribution at each node; and an initial distribution over the nodes is denoted\nby, Vi : Ni \u2192 [0, 1]. For convenience, we group Vi, Ti and Li in \u02c6fi. De\ufb01ne a controller at level l for\nagent i as, \u03c0i,l = \ufffd Ni,l, \u02c6fi,l \ufffd, where Ni,l is the set of nodes in the controller and \u02c6fi,l groups remain-\ning parameters of the controller as mentioned before. Analogously to POMDPs [3], we formulate\nplanning in multiagent settings formalized by I-POMDPs as a likelihood maximization problem:\n\n\u03c0\u2217i,l = arg max\n\u03a0i,l\n\n\u03b3T P r(rT\n\ni = 1|T ; \u03c0i,l)\n\n(1)\n\n(1 \u2212 \u03b3)\ufffd\u221e\n\nT =0\n\nwhere \u03a0i,l are all level-l FSCs of agent i, rT\ni\nemitted after T time steps with probability proportional to the reward, Ri(s, ai, aj).\n\nis a binary random variable whose value is 0 or 1\n\nn0\ni,l\n\ns0\n\na0\ni\n\na0\nj\n\nM 0\nj,0\n\na0\nk\n\nM 0\nk,0\n\nn0\ni,l\n\nn1\ni,l\n\na0\ni\n\no1\ni\n\na1\ni\n\nr0\ni\n\ns0\n\ns1\n\na0\nj\n\na0\nk\n\nM 0\nj,0\n\nM 0\nk,0\n\nM 1\nj,0\n\nM 1\nk,0\n\na1\nj\n\na1\nk\n\nn2\ni,l\n\no2\ni\n\ns2\n\nM 2\nj,0\n\nM 2\nk,0\n\na2\ni\n\na2\nj\n\na2\nk\n\nnT\ni,l\n\noT\ni\n\nsT\n\nM T\nj,0\n\nM T\nk,0\n\naT\ni\n\naT\nj\n\na0\nk\n\ns0\n\nrT\ni\n\ns1\n\nsT\n\na0\nj\n\no1\nj\n\na1\nj\n\noT\nj\n\naT\nj\n\nrT\nj\n\nm0\nj,0\n\nm1\nj,0\n\nmT\nj,0\n\nFigure 1: (a) Mixture of DBNs with 1 to T time slices for I-POMDPi,1 with i\u2019s level-1 policy represented as\na standard FSC whose \u201cnode state\u201d is denoted by ni,l. The DBNs differ from those for POMDPs by containing\nspecial model nodes (hexagons) whose values are candidate models of other agents. (b) Hexagonal model nodes\nand edges in bold for one other agent j in (a) decompose into this level-0 DBN. Values of the node mt\nj,0 are the\ncandidate models. CPT of chance node at\nj) is inferred using likelihood maximization.\n\nj denoted by \u03c6j,0(mt\n\nj,0, at\n\n2\n\n\fi ,aT\n\ni , aT\n\nRi(sT ,aT\n\ni |aT\n\nj )\u2212Rmin\n\nRmax\u2212Rmin\n\nThe planning problem is modeled as a mixture of DBNs of increasing time from T =0 onwards\n(Fig. 1). The transition and observation functions of I-POMDPi,l parameterize the chance nodes s\nand oi, respectively, along with P r(rT\n. Here, Rmax and Rmin\nj , sT ) \u221d\nare the maximum and minimum reward values in Ri.\nThe networks include nodes, ni,l, of agent i\u2019s level-l FSC. Therefore, functions in \u02c6fi,l parameterize\nthe network as well, which are to be inferred. Additionally, the network includes the hexagonal\nmodel nodes \u2013 one for each other agent \u2013 that contain the candidate level 0 models of the agent.\nEach model node provides the expected distribution over another agent\u2019s actions. Without loss of\ngenerality, no edges exist between model nodes in the same time step. Correlations between agents\ncould be included as state variables in the models.\nAgent j\u2019s model nodes and the edges (in bold) between them, and between the model and chance\nj,0, are\naction nodes represent a DBN of length T as shown in Fig. 1(b). Values of the chance node, m0\nthe candidate models of agent j. Agent i\u2019s initial belief over the state and models of j becomes the\nparameters of s0 and m0\nj,0. The likelihood maximization at level 0 seeks to obtain the distribution,\nP r(aj|m0\nProposition 1 (Correctness). The likelihood maximization problem as de\ufb01ned in Eq. 1 with the\nmixture models as given in Fig. 1 is equivalent to the problem of solving the original I-POMDPi,l\nwith discounted in\ufb01nite horizon whose solution assumes the form of a \ufb01nite state controller.\n\nj,0), for each candidate model in node, m0\n\nj,0, using EM on the DBN.\n\nAll proofs are given in the supplement. Given the unique mixture models above, the challenge is to\ngeneralize the EM-based iterative maximization for POMDPs to the framework of I-POMDPs.\n\n3.1 Single EM for Level 0 Models\n\nThe straightforward approach is to infer a likely FSC for each level 0 model. However, this approach\ndoes not scale to many models. Proposition 2 below shows that the dynamic P r(at\nj|st) is suf\ufb01cient\npredictive information about other agent from its candidate models at time t, to obtain the most\nlikely policy of agent i. This is markedly different from using behavioral equivalence [8] that clusters\nmodels with identical solutions. The latter continues to require the full solution of each model.\nProposition 2 (Suf\ufb01ciency). Distributions P r(at\nsuf\ufb01cient predictive information about other agent j to obtain the most likely policy of i.\n\nj \u2208 Aj for each state st is\n\nj|st) across actions at\n\nIn the context of Proposition 2, we seek to infer P r(at\nall time steps, which is denoted as \u03c6j,0. Other terms in the computation of P r(at\nparameters of the level 0 DBN. The likelihood maximization for the level 0 DBN is:\n\nj,0) for each (updated) model of j at\nj|st) are known\n\nj|mt\n\n\u03c6\u2217j,0 = arg max\n\u03c6j,0\n\n(1 \u2212 \u03b3)\ufffd\u221e\n\nT =0\ufffdmj,0\u2208M T\n\nj,0\n\n\u03b3T P r(rT\n\nj = 1|T, mj,0; \u03c6j,0)\n\nj\n\nj,0, at\n\nj, ot\n\nAs the trajectory consisting of states, models, actions and observations of the other agent is hidden\nat planning time, we may solve the above likelihood maximization using EM.\nE-step Let z0:T\nThe log likelihood is obtained as an expectation of these hidden trajectories:\n\n0 where the observation at t = 0 is null, be the hidden trajectory.\nj}T\nT =0\ufffdz0:T\n\n= {st, mt\nQ(\u03c6\ufffdj,0|\u03c6j,0) =\ufffd\u221e\n\nThe \u201cdata\u201d in the level 0 DBN consists of the initial belief over the state and models, b0\ni,1, and the\nobserved reward at T . Analogously to EM for POMDPs, this motivates forward \ufb01ltering-backward\nsmoothing on a network with joint state (st,mt\nj,0) for computing the log likelihood. The transition\nfunction for the forward and backward steps is:\n\u03c6j,0(mt\u22121\n\n, T ; \u03c6j,0) log P r(rT\n\nj = 1, z0:T\n\nj = 1, z0:T\n\n, st) P r(mt\n\nP r(st, mt\n\n, T ; \u03c6\ufffdj,0)\n\n) Tmj (st\u22121, at\u22121\n\nj,0 , at\u22121\n\nj,0 , at\u22121\n\nP r(rT\n\n, ot\nj)\n\n(2)\n\nj\n\nj\n\nj\n\nj\n\nj\n\nj,0|mt\u22121\n\nj\n\nj,0|st\u22121, mt\u22121\n\nj,0 ) =\ufffdat\u22121\n\nj\n\n,ot\nj\n\n\u00d7 Omj (st, at\u22121\n\nj\n\n, ot\nj)\n\nwhere mj in the subscripts is j\u2019s model at t \u2212 1. Here, P r(mt\ndelta function that is 1 when j\u2019s belief in mt\u22121\notherwise 0.\n\nj,0 updated using at\u22121\n\nj\n\nj,0|at\u22121\n\n, ot\nj\nand ot\n\n(3)\nj,0 ) is the Kronecker-\nj,0;\n\nj, mt\u22121\nj equals the belief in mt\n\n3\n\n\fForward \ufb01ltering gives the probability of the next state as follows:\nj,0|st\u22121, mt\u22121\n\nP r(st, mt\n\n\u03b1t(st, mt\n\nj,0) =\ufffdst\u22121,mt\u22121\n\nj,0\n\nj,0 ) \u03b1t\u22121(st\u22121, mt\u22121\nj,0 )\n\nwhere \u03b10(s0, m0\nprobability of the state and model at t \u2212 1 from the distribution at t is:\n\nj,0) is the initial belief of agent i. The smoothing by which we obtain the joint\n\n\u03b2h(st\u22121, mt\u22121\n\nj,0 ) =\ufffdst,mt\n\nj,0\n\nP r(st, mt\n\nj,0|st\u22121, mt\u22121\n\nj,0 ) \u03b2h\u22121(st, mt\n\nj,0)\n\nwhere h denotes the horizon to T and \u03b20(sT , mT\nj,0)]. Messages\n\u03b1t and \u03b2h give the probability of a state at some time slice in the DBN. As we consider a mixture of\nBNs, we seek probabilities for all states in the mixture model. Subsequently, we may compute the\nforward and backward messages at all states for the entire mixture model in one sweep.\n\nj = 1|sT , mT\n\nj,0) = EaT\n\n[P r(rT\n\nj |mT\n\nj,0\n\nP r(T = t) \u03b1t(s, mj,0)\n\nP r(T = h) \u03b2h(s, mj,0)\n\n(4)\n\n\ufffd\u03b1(s, mj,0) =\ufffd\u221e\n\nt=0\n\n\ufffd\u03b2(s, mj,0) =\ufffd\u221e\n\nh=0\n\nModel growth\nAs the other agent performs its actions and makes observations, the space\nof j\u2019s models grows exponentially: starting from a \ufb01nite set of |M 0\nj,0| models, we obtain\nj,0|(|Aj||\u03a9j|)t) models at time t. This greatly increases the number of trajectories in Z 0:T\n.\nO(|M 0\nWe limit the growth in the model space by sampling models at the next time step from the distribu-\ntion, \u03b1t(st, mt\nj,0), as we perform each step of forward \ufb01ltering. It limits the growth by exploiting\nthe structure present in \u03c6j,0 and Oj, which guide how the models grow.\nM-step We obtain the updated \u03c6\ufffdj,0 from the full log likelihood in Eq. 2 by separating the terms:\n\nj\n\nQ(\u03c6\ufffdj,0|\u03c6j,0) = \ufffdterms independent of \u03c6\ufffdj,0\ufffd +\ufffd\u221e\n\nP r(rT\n\ni = 1, z0:T\n\nj\n\n, T ; \u03c6\ufffdj,0) \ufffdT\n\nt=0\n\n\u03c6\ufffdj,0(at\n\nj|mt\n\nj,0)\n\nj\n\nT =0\ufffdz0:T\nj)\ufffd\u03b1(st, mt\n\nj)\ufffdst\n\nRmj (st, at\n\nj,0) Tmj (st, at\n\nj, st+1) P r(mt+1\n\nj,0) +\ufffdst,st+1,mt+1\nj,0 |mt\n\nj, ot+1\n\nj,0, at\n\nj\n\n\u03b3\n\nj,0 ,ot+1\n) Omj (st+1, at\n\n1 \u2212 \u03b3\ufffd\u03b2(st+1, mt+1\n\nj, ot+1\n\n)\n\nj,0 )\n\nj\n\nj\n\nand maximizing it w.r.t. \u03c6\ufffdj,0:\n\u03c6\ufffdj,0(at\n\nj,0) \u221d \u03c6j,0(at\n\nj, mt\n\nj, mt\n\n\u00d7 \ufffd\u03b1(st, mt\n\ni, at\n\ni, nt\n\ni,l, at\n\n3.2 Improved EM for Level l I-POMDP\nAt strategy levels l \u2265 1, Eq. 1 de\ufb01nes the likelihood maximization problem, which is iteratively\nsolved using EM. We show the E- and M-steps next beginning with l = 1.\nE-step\nIn a multiagent setting, the hidden variables additionally include what the other agent\nmay observe and how it acts over time. However, a key insight is that Prop. 2 allows us to limit\nattention to the marginal distribution over other agents\u2019 actions given the state. Thus, let z0:T\ni =\n0 , where the observation at t = 0 is null, and other agents are labeled j\n{st, ot\nto k; this group is denoted \u2212i. The full log likelihood involves an expectation over hidden variables:\n(5)\n\n, T ; \u03c0i,l) log P r(rT\n\ni = 1, z0:T\n\ni = 1, z0:T\n\nj, . . . , at\n\nk}T\nT =0\ufffdz0:T\nQ(\u03c0\ufffdi,l|\u03c0i,l) =\ufffd\u221e\n\nDue to the subjective perspective in I-POMDPs, Q computes the likelihood of agent i\u2019s FSC only\n(and not of joint FSCs as in team planning [9]).\nIn the T -step DBN of Fig. 1, observed evidence includes the reward, rT\ni , at the end and the initial\nbelief. We seek the likely distributions, Vi, Ti, and Li, across time slices. We may again realize the\nfull joint in the expectation using a forward-backward algorithm on a hidden Markov model whose\nstate is (st, nt\n\ni,l). The transition function of this model is,\n\n, T ; \u03c0\ufffdi,l)\n\nP r(rT\n\ni\n\ni\n\ni\n\nP r(st, nt\n\ni,l|st\u22121, nt\u22121\n\ni,l ) =\ufffdat\u22121\n\ni\n\n,at\u22121\n\u2212i ,ot\n\u00d7 Ti(st\u22121, at\u22121\n\ni\n\ni,l\n\ni Li(nt\u22121\n, at\u22121\n\n, at\u22121\n\ni\n\n)\ufffd\u2212i\n\nP r(at\u22121\n\n\u2212i |st\u22121) Ti(nt\u22121\n\u2212i , ot\ni)\n\ni,l\n\n(6)\nIn addition to parameters of I-POMDPi,l, which are given, parameters of agent i\u2019s controller and\nthose relating to other agents\u2019 predicted actions, \u03c6\u2212i,0, are present in Eq. 6. Notice that in conse-\nquence of Proposition 2, Eq. 6 precludes j\u2019s observation and node transition functions.\n\n\u2212i , st) Oi(st, at\u22121\n\n, at\u22121\n\ni\n\n, at\u22121\n\ni\n\n, ot\n\ni, nt\n\ni,l)\n\n4\n\n\fThe forward message, \u03b1t = P r(st, nt\nDBN at time t:\n\n\u03b1t(st, nt\n\ni,l) =\ufffdst\u22121,nt\u22121\n\ni,l\n\ni,l), represents the probability of being at some state of the\n\nP r(st, nt\n\ni,l|st\u22121, nt\u22121\n\ni,l ) \u03b1t\u22121(st\u22121, nt\u22121\ni,l )\n\n(7)\n\ni,l) =\ufffdst+1,nt+1\nP r(rT\ni = 1|sT , aT\n\ni ,aT\n\u2212i\n\ni,l\n\n|st, nt\ni , aT\n\u2212i) Li(nT\n\u2212i) \u221d Ri(sT , aT\n\ni,l) = Vi(n0\n\nwhere, \u03b10(s0, n0\nreward in the \ufb01nal T th time step given a state of the Markov model, \u03b2t(st, nt\n\ni,l(s0). The backward message gives the probability of observing the\ni,l):\n(8)\n\ni = 1|st, nt\n\ni,l) \u03b2h\u22121(st+1, nt+1\ni,l )\n\nP r(st+1, nt+1\ni,l\n\ni,l) = P r(rT\n\n\u03b2h(st, nt\n\ni,l)b0\n\ni = 1|sT , aT\ni , aT\n\ni,l) =\ufffdaT\n\u2212i|st) being dependent on t is that we can no longer conveniently de\ufb01ne\ufffd\u03b1 and\n\ni )\ufffd\u2212i P r(aT\nwhere, \u03b20(sT , nT\nh \u2264 T is the horizon. Here, P r(rT\nA side effect of P r(at\n\ufffd\u03b2 for use in M-step at level 1. Instead, the computations are folded in the M-step.\nM-step We update the parameters, Li, Ti and Vi, of \u03c0i,l to obtain \u03c0\ufffdi,l based on the expectation\n, T ; \u03c0i,l) with \u03c0i,l substituted\nin the E-step. Speci\ufb01cally, take log of the likelihood P r(rT = 1, z0:T\nwith \u03c0\ufffdi,l and focus on terms involving the parameters of \u03c0\ufffdi,l:\n\n\u2212i|sT ), and 1 \u2264\n\ni,l, aT\ni , aT\n\n\u2212i).\n\ni\n\nlog P r(rT = 1, z0:T\n\ni\n\n, T ; \u03c0\ufffdi,l) =\ufffdterms independent of \u03c0\ufffdi,l\ufffd +\ufffdT\n\ni,l, at\n\ni, ot+1\n\ni\n\n, nt+1\n\nt=0\n\nlog L\ufffdi(nt\n\ni,l, at\ni,l ) + log V\ufffdi(ni,l)\n\ni)+\n\nT =0\n\nIn order to update, Li, we partially differentiate the Q-function of Eq. 5 with respect to L\ufffdi. To\nfacilitate differentiation, we focus on the terms involving Li, as shown below.\nQ(\u03c0\ufffdi,l|\u03c0i,l) = \ufffdterms indep. of L\ufffdi\ufffd +\ufffd\u221e\ni = 1, z0:t\nL\ufffdi on maximizing the above equation is:\nL\ufffdi(nt\nNode transition probabilities Ti and node distribution Vi for \u03c0\ufffdi,l, is updated analogously to Li.\nBecause a FSC is inferred at level 1, at strategy levels l = 2 and greater, lower-level candidate\nmodels are FSCs. EM at these higher levels proceeds by replacing the state of the DBN, (st, nt\ni,l)\nwith (st, nt\n\nT =0\ufffd\u2212i\ufffdsT ,aT\n\nt=0\ufffdz0:T\n\n|T ; \u03c0i,l) log L\ufffdi(nt\n\ni)\ufffd\u221e\n\ni = 1|sT , aT\n\ni) \u221d Li(nt\n\n\u2212i) P r(aT\n\n\u03b3T\n1 \u2212 \u03b3\n\nP r(rT\n\nPr(rT\n\n\u2212i|sT ) \u03b1T (sT , nT\ni,l)\n\ni,l, at\ni)\n\ni,l, at\n\ni,l, at\n\ni , aT\n\n\u2212i\n\ni\n\ni\n\ni,l, nt\n\nj,l\u22121, . . . , nt\n\nk,l\u22121).\n\nlog T \ufffdi (nt\n\nt=0\n\n\ufffdT\u22121\nPr(T ) \ufffdT\n\n3.3 Block-Coordinate Descent for Non-Asymptotic Speed Up\n\nBlock-coordinate descent (BCD) [4, 5, 10] is an iterative scheme to gain faster non-asymptotic rate\nof convergence in the context of large-scale N-dimensional optimization problems. In this scheme,\nwithin each iteration, a set of variables referred to as coordinates are chosen and the objective func-\ntion, Q, is optimized with respect to one of the coordinate blocks while the other coordinates are\nheld \ufb01xed. BCD may speed up the non-asymptotic rate of convergence of EM for both I-POMDPs\nand POMDPs. The speci\ufb01c challenge here is to determine which of the many variables should be\ngrouped into blocks and how.\nWe empirically show in Section 5 that grouping the number of time slices, t, and horizon, h, in\nEqs. 7 and 8, respectively, at each level, into coordinate blocks of equal size is bene\ufb01cial. In other\nwords, we decompose the mixture model into blocks containing equal numbers of BNs. Alternately,\ngrouping controller nodes is ineffective because distribution Vi cannot be optimized for subsets of\nnodes. Formally, let \u03a8t\n1 be a subset of {T = 1, T = 2, . . . , T = Tmax}. Then, the set of blocks is,\n3, . . .}. In practice, because both t and h are \ufb01nite (say, Tmax), the cardinality of\nBt = {\u03a8t\nBt is bounded by some C \u2265 1. Analogously, we de\ufb01ne the set of blocks of h, denoted by Bh.\nIn the M-step now, we compute \u03b1t for the time steps in a single coordinate block \u03a8t\nc only, while\nusing the values of \u03b1t from the previous iteration for the complementary coordinate blocks, \u02dc\u03a8t\nc.\nAnalogously, we compute \u03b2h for the horizons in \u03a8h\nc only, while using \u03b2 values from the previous\niteration for the remaining horizons. We cyclically choose a block, \u03a8t\nc, at iterations c + qC where\nq \u2208 {0, 1, 2, . . .}.\n\n2, \u03a8t\n\n1, \u03a8t\n\n5\n\n\f3.4 Forward Filtering - Backward Sampling\n\ni\n\ni,l\n\n, oT\n\ni , nT\n\n, aT\u22121\n\ni,l\ufffd, and so on from T to 0. Here, P r(rT\n\nAn approach for exploiting embedded structure in the transition and observation functions is to\nreplace the exact forward-backward message computations with exact forward \ufb01ltering and back-\nward sampling (FFBS) [11] to obtain a sampled reverse trajectory consisting of \ufffdsT , nT\ni \ufffd,\ni,l, aT\n\ufffdnT\u22121\n\u2212i) is the likelihood\nweight of this trajectory sample. Parameters of the updated FSC, \u03c0\ufffdi,l, are obtained by summing and\nnormalizing the weights.\nEach trajectory is obtained by \ufb01rst sampling \u02c6T \u223c P r(T ), which becomes the length of i\u2019s DBN for\ni,l), t = 0 . . . \u02c6T is computed exactly (Eq. 7) followed by the\nthis sample. Forward message, \u03b1t(st, nt\ni,l), h = 0 . . . \u02c6T and t = \u02c6T \u2212 h. Computing \u03b2h differs from Eq. 8 by\nbackward message, \u03b2h(st, nt\nutilizing the forward message:\n\ni = 1|sT , aT\n\ni , aT\n\n\u03b2h(st, nt\n\ni,l|st+1, nt+1\n\ni\n\ni ,at\ni,l, at\n\n\u2212i,ot+1\ni, ot+1\n\ni\n\ni,l, at\n\n\u03b1t(st, nt\n\ni,l) Li(nt\ni,l ) Oi(st+1, at\n\n, nt+1\n\ni, at\n\ni,l ) =\ufffdat\nTi(nt\ni ) =\ufffdat\n\nP r(at\n\ni) \ufffd\u2212i\n\u2212i, ot+1\n\u2212i|sT ) L(nT\n\n)\n\ni\n\n\u2212i|st) Ti(st, at\n\ni, at\n\n\u2212i, st+1)\n(9)\n\u2212i).\ni , aT\ni |sT , aT\nfrom Eq. 9.\n\ni,l)\ufffd\u2212i P r(aT\ni,l, aT\ni \ufffd followed by sampling sT\u22121\n\ni,l, rT\n\ni\n\ni ) P r(rT\n, nT\u22121\n\ni,l\n\ni,l, rT\n\nwhere \u03b20(sT , nT\n\u03b1T (sT , nT\nSubsequently, we may easily sample \ufffdsT , nT\nWe sample aT\u22121\n\n, oT\n\ni,at\n\ni, ot+1\n\n\u2212i\n\ni\n\ni\n\nP r(at\n\ni, ot+1\n\ni\n\n|st, nt\n\ni \u223c P r(at\ni,l, st+1, nt+1\n\ni,l ) \u221d\ufffd\u2212i\n\n|st, nt\nP r(at\n\ni,l, st+1, nt+1\n\u2212i|st) Li(nt\ni, at\n\u2212j, ot+1\n)\n\ni\n\ni,l ), where:\ni) Ti(nt\n\ni,l, at\n\nOi(st+1, at\n\ni,l, at\n\ni, ot+1\n\ni\n\n, nt+1\n\ni,l ) Ti(st, at\n\ni, at\n\n\u2212i, st+1)\n\n4 Computational Complexity\n\nOur EM at level 1 is signi\ufb01cantly quicker compared to ascribing FSCs to other agents. In the latter,\nnodes of others\u2019 controllers must be included alongside s and ni,l.\nProposition 3 (E-step speed up). Each E-step at level 1 using the forward-backward pass as shown\npreviously results in a net speed up of O((|M||N\u2212i,0|)2K|\u03a9\u2212i|K) over the formulation that ascribes\n|M| FSCs each to K other agents with each having |N\u2212i,0| nodes.\nAnalogously, updating the parameters Li and Ti\nin the M-step exhibits a speedup of\nO((|M||N\u2212i,0|)2K|\u03a9\u2212i|K), while Vi leads to O((|M||N\u2212i,0|)K). This improvement is exponential\nin the number of other agents. On the other hand, the E-step at level 0 exhibits complexity that is\ntypically greater compared to the total complexity of the E-steps for |M| FSCs.\nProposition 4 (E-step ratio at level 0). E-steps when |M| FSCs are inferred for K agents exhibit a\nratio of complexity, O(|N\u2212i,0|2\n|M|\nThe ratio in Prop. 4 is < 1 when smaller-sized controllers are sought and there are several models.\n\n), compared to the E-step for obtaining \u03c6\u2212i,0.\n\n5 Experiments\n\nFive variants of EM are evaluated as appropriate: the exact EM inference-based planning (labeled as\nI-EM); replacing the exact M-step with its greedy variant analogously to the greedy maximization in\nEM for POMDPs [12] (I-EM-Greedy); iterating EM based on coordinate blocks (I-EM-BCD) and\ncoupled with a greedy M-step (I-EM-BCD-Greedy); and lastly, using forward \ufb01ltering-backward\nsampling (I-EM-FFBS).\nWe use 4 problem domains: the noncooperative multiagent tiger problem [13] (|S|= 2, |Ai|= |Aj|=\n3, |Oi|= |Oj|= 6 for level l \u2265 1, |Oj|= 3 at level 0, and \u03b3 = 0.9) with a total of 5 agents and 50\nmodels for each other agent. A larger noncooperative 2-agent money laundering (ML) problem [14]\nforms the second domain. It exhibits 99 physical states for the subject agent (blue team), 9 actions\nfor blue and 4 for the red team, 11 observations for subject and 4 for the other, with about 100 models\n\n6\n\n\fl\n\ne\nu\na\nV\n1\n\n \n\n \nl\n\ne\nv\ne\nL\n\nl\n\ne\nu\na\nV\n1\n\n \n\n \nl\n\ne\nv\ne\nL\n\n5-agent Tiger\n\nI-EM\nI-EM-Greedy\nI-EM-BCD\nI-EM-FFBS\n\n 0\n\n-50\n\n-100\n\n-150\n\n-200\n\n-250\n\n-300\n\n 10\n\n 100\n\n 1000\n\ntime(s) in log scale\n\n(I-a) EM methods\n\n 0\n\n-50\n\n-100\n\n-150\n\n-200\n\n-250\n\n-300\n\nI-EM-BCD\nI-BPI\n\n 10\n\n 100\n\n 1000\n\n 10000\n\ntime(s) in log scale\n\n(II-a) I-EM-BCD, I-BPI\n\n-90\n\n-100\n\n-110\n\n-120\n\n-130\n\n-140\n\n-90\n\n-100\n\n-110\n\n-120\n\n-130\n\n-140\n\n2-agent ML\n\n3-agent UAV\n\nI-EM\nI-EM-Greedy\nI-EM-BCD-Greedy\nI-EM-FFBS\n\n 100\n\n 1000\n\n 10000\n\ntime(s) in log scale\n\n(I-b) EM methods\n\nI-EM-BCD-Greedy\nI-BPI\n\n 400\n\n 350\n\n 300\n\n 250\n\n 200\n\n 150\n\n 100\n\n 50\n\n 0\n\n 400\n\n 350\n\n 300\n\n 250\n\n 200\n\n 150\n\n 100\n\n 50\n\n 0\n\nI-EM-Greedy\nI-EM-BCD-Greedy\nI-EM-FFBS\n\n 0\n\n 10000 20000 30000 40000 50000 60000 70000\n\ntime(s)\n\n(I-c) EM methods\n\nI-EM-BCD-Greedy\nI-BPI\n\n 100\n\n 1000\n\ntime(s) in log scale\n\n 0\n\n 10000\n\n 20000\n\ntime(s)\n\n 30000\n\n 40000\n\n(II-b) I-EM-BCD-Greedy, I-BPI\n\n(II-c) I-EM-BCD-Greedy, I-BPI\n\n 1100\n\n 1000\n\n 900\n\n 800\n\n 700\n\n 600\n\n 500\n\nI-EM\nI-EM-Greedy\nI-EM-BCD\nI-EM-BCD-Greedy\n\n5-agent policing\n\n 1200\n\n 1100\n\n 1000\n\n 900\n\n 800\n\n 700\n\n 600\n\n 0\n\n 5000\n\n 10000\n\ntime(s)\n\n 15000\n\n 20000\n\n 0\n\n 5000\n\nI-EM-BCD\nI-BPI\n\n 10000\n\ntime(s)\n\n 15000\n\n 20000\n\n(I-d) EM methods\n\n(II-d) I-EM-BCD, I-BPI\n\nFigure 2: FSCs improve with time for I-POMDPi,1 in the (I-a) 5-agent tiger, (I-b) 2-agent money laundering,\n(I-c) 3-agent UAV, and (I-d) 5-agent policing contexts. Observe that BCD causes substantially larger improve-\nments in the initial iterations until we are close to convergence. I-EM-BCD or its greedy variant converges\nsigni\ufb01cantly quicker than I-BPI to similar-valued FSCs for all four problem domains as shown in (II-a, b, c and\nd), respectively. All experiments were run on Linux with Intel Xeon 2.6GHz CPUs and 32GB RAM.\n\nfor red team. We also evaluate a 3-agent UAV reconnaissance problem involving a UAV tasked with\nintercepting two fugitives in a 3x3 grid before they both reach the safe house [8]. It has 162 states for\nthe UAV, 5 actions, 4 observations for each agent, and 200,400 models for the two fugitives. Finally,\nthe recent policing protest problem is used in which police must maintain order in 3 designated\nprotest sites populated by 4 groups of protesters who may be peaceful or disruptive [15]. It exhibits\n27 states, 9 policing and 4 protesting actions, 8 observations, and 600 models per protesting group.\nThe latter two domains are historically the largest test problems for self-interested planning.\nComparative performance of all methods\nIn Fig. 2-I(a-d), we compare the variants on all prob-\nlems. Each method starts with a random seed, and the converged value is signi\ufb01cantly better than\na random FSC for all methods and problems. Increasing the sizes of FSCs gives better values in\ngeneral but also increases time; using FSCs of sizes 5, 3, 9 and 5, for the 4 domains respectively\ndemonstrated a good balance. We explored various coordinate block con\ufb01gurations eventually set-\ntling on 3 equal-sized blocks for both the tiger and ML, 5 blocks for UAV and 2 for policing protest.\nI-EM and the Greedy and BCD variants clearly exhibit an anytime property on the tiger, UAV and\npolicing problems. The noncooperative ML shows delayed increases because we show the value of\nagent i\u2019s controller and initial improvements in the other agent\u2019s controller may maintain or decrease\nthe value of i\u2019s controller. This is not surprising due to competition in the problem. Nevertheless,\nafter a small delay the values improve steadily which is desirable.\nI-EM-BCD consistently improves on I-EM and is often the fastest: the corresponding value improves\nby large steps initially (fast non-asymptotic rate of convergence). In the context of ML and UAV,\nI-EM-BCD-Greedy shows substantive improvements leading to controllers with much improved\nvalues compared to other approaches. Despite a low sample size of about 1,000 for the problems,\nI-EM-FFBS obtains FSCs whose values improve in general for tiger and ML, though slowly and\nnot always to the level of others. This is because the EM gets caught in a worse local optima due\n\n7\n\n\fto sampling approximation \u2013 this strongly impacts the UAV problem; more samples did not escape\nthese optima. However, forward \ufb01ltering only (as used in Wu et al. [6]) required a much larger\nsample size to reach these levels. FFBS did not improve the controller in the fourth domain.\nCharacterization of local optima While an exact solution for the smaller tiger problem with 5\nagents (or the larger problems) could not be obtained for comparison, I-EM climbs to the optimal\nvalue of 8.51 for the downscaled 2-agent version (not shown in Fig. 2). In comparison, BPI does\nnot get past the local optima of -10 using an identical-sized controller \u2013 corresponding controller\npredominantly contains listening actions \u2013 relying on adding nodes to eventually reach optimum.\nWhile we are unaware of any general technique to escape local convergence in EM, I-EM can reach\nthe global optimum given an appropriate seed. This may not be a coincidence: the I-POMDP value\nfunction space exhibits a single \ufb01xed point \u2013 the global optimum \u2013 which in the context of Propo-\nsition 1 makes the likelihood function, Q(\u03c0\ufffdi,l|\u03c0i,l), unimodal (if \u03c0i,l is appropriately sized as we\ndo not have a principled way of adding nodes). If Q(\u03c0\ufffdi,l|\u03c0i,l) is continuously differentiable for the\ndomain on hand, Corollary 1 in Wu [16] indicates that \u03c0i,l will converge to the unique maximizer.\nImprovement on I-BPI We compare the quickest of the I-EM variants with previous best algo-\nrithm, I-BPI (Figs. 2-II(a-d)), allowing the latter to escape local optima as well by adding nodes.\nObserve that FSCs improved using I-EM-BCD converge to values similar to those of I-BPI almost\ntwo orders of magnitude faster. Beginning with 5 nodes, I-BPI adds 4 more nodes to obtain the same\nlevel of value as EM for the tiger problem. For money laundering, I-EM-BCD-Greedy converges to\ncontrollers whose value is at least 1.5 times better than I-BPI\u2019s given the same amount of allocated\ntime and less nodes. I-BPI failed to improve the seed controller and could not escape for the UAV\nand policing protest problems. To summarize, this makes I-EM variants with emphasis on BCD the\nfastest iterative approaches for in\ufb01nite-horizon I-POMDPs currently.\n\n6 Concluding Remarks\nThe EM formulation of Section 3 builds on the EM for POMDP and differs drastically from the E-\nand M-steps for the cooperative DEC-POMDP [9]. The differences re\ufb02ect how I-POMDPs build on\nPOMDPs and differ from DEC-POMDPs. These begin with the structure of the DBNs where the\nDBN for I-POMDPi,1 in Fig. 1 adds to the DBN for POMDP hexagonal model nodes that contain\ncandidate models; chance nodes for action; and model update edges for each other agent at each\ntime step. This differs from the DBN for DEC-POMDP, which adds controller nodes for all agents\nand a joint observation chance node. The I-POMDP DBN contains controller nodes for the subject\nagent only, and each model node collapses into an ef\ufb01cient distribution on running EM at level 0.\nIn domains where the joint reward function may be decomposed into factors encompassing subsets\nof agents, ND-POMDPs allow the value function to be factorized as well. Kumar et al. [17] exploit\nthis structure by simply decomposing the whole DBN mixture into a mixture for each factor and it-\nerating over the factors. Interestingly, the M-step may be performed individually for each agent and\nthis approach scales beyond two agents. We exploit both graphical and problem structures to speed\nup and scale in a way that is contextual to I-POMDPs. BCD also decomposes the DBN mixture\ninto equal blocks of horizons. While it has been applied in other areas [18, 19], these applications\ndo not transfer to planning. Additionally, problem structure is considered by using FFBS that ex-\nploits information in the transition and observation distributions of the subject agent. FFBS could be\nviewed as a tenuous example of Monte Carlo EM, which is a broad category and also includes the\nforward sampling utilized by Wu et al. [6] for DEC-POMDPs. However, fundamental differences\nexist between the two: forward sampling may be run in simulation and does not require the transition\nand observation functions. Indeed, Wu et al. utilize it in a model free setting. FFBS is model based\nutilizing exact forward messages in the backward sampling phase. This reduces the accumulation of\nsampling errors over many time steps in extended DBNs, which otherwise af\ufb02icts forward sampling.\nThe advance in this paper for self-interested multiagent planning has wider relevance to areas such\nas game play and ad hoc teams where agents model other agents. Developments in online EM for\nhidden Markov models [20] provide an interesting avenue to utilize inference for online planning.\n\nAcknowledgments\n\nThis research is supported in part by a NSF CAREER grant, IIS-0845036, and a grant from ONR,\nN000141310870. We thank Akshat Kumar for feedback that led to improvements in the paper.\n\n8\n\n\fReferences\n[1] Ekhlas Sonu and Prashant Doshi. Scalable solutions of interactive POMDPs using generalized\nand bounded policy iteration. Journal of Autonomous Agents and Multi-Agent Systems, pages\nDOI: 10.1007/s10458\u2013014\u20139261\u20135, in press, 2014.\n\n[2] Hagai Attias. Planning by probabilistic inference. In Ninth International Workshop on AI and\n\nStatistics (AISTATS), 2003.\n\n[3] Marc Toussaint and Amos J. Storkey. Probabilistic inference for solving discrete and con-\ntinuous state markov decision processes. In International Conference on Machine Learning\n(ICML), pages 945\u2013952, 2006.\n\n[4] Jeffrey A. Fessler and Alfred O. Hero.\n\nSpace-alternating generalized expectation-\n\nmaximization algorithm. IEEE Transactions on Signal Processing, 42:2664\u20132677, 1994.\n\n[5] P. Tseng. Convergence of block coordinate descent method for nondifferentiable minimization.\n\nJournal of Optimization Theory and Applications, 109:475\u2013494, 2001.\n\n[6] Feng Wu, Shlomo Zilberstein, and Nicholas R. Jennings. Monte-carlo expectation maximiza-\ntion for decentralized POMDPs. In Twenty-Third International Joint Conference on Arti\ufb01cial\nIntelligence (IJCAI), pages 397\u2013403, 2013.\n\n[7] Piotr J. Gmytrasiewicz and Prashant Doshi. A framework for sequential planning in multiagent\n\nsettings. Journal of Arti\ufb01cial Intelligence Research, 24:49\u201379, 2005.\n\n[8] Yifeng Zeng and Prashant Doshi. Exploiting model equivalences for solving interactive dy-\n\nnamic in\ufb02uence diagrams. Journal of Arti\ufb01cial Intelligence Research, 43:211\u2013255, 2012.\n\n[9] Akshat Kumar and Shlomo Zilberstein. Anytime planning for decentralized pomdps using\n\nexpectation maximization. In Conference on Uncertainty in AI (UAI), pages 294\u2013301, 2010.\n\n[10] Ankan Saha and Ambuj Tewari. On the nonasymptotic convergence of cyclic coordinate de-\n\nscent methods. SIAM Journal on Optimization, 23(1):576\u2013601, 2013.\n\n[11] C. K. Carter and R. Kohn. Markov chainmonte carlo in conditionally gaussian state space\n\nmodels. Biometrika, 83:589\u2013601, 1996.\n\n[12] Marc Toussaint, Laurent Charlin, and Pascal Poupart. Hierarchical POMDP controller opti-\nmization by likelihood maximization. In Twenty-Fourth Conference on Uncertainty in Arti\ufb01-\ncial Intelligence (UAI), pages 562\u2013570, 2008.\n\n[13] Prashant Doshi and Piotr J. Gmytrasiewicz. Monte Carlo sampling methods for approximating\n\ninteractive POMDPs. Journal of Arti\ufb01cial Intelligence Research, 34:297\u2013337, 2009.\n\n[14] Brenda Ng, Carol Meyers, Ko\ufb01 Boakye, and John Nitao. Towards applying interactive\nPOMDPs to real-world adversary modeling. In Innovative Applications in Arti\ufb01cial Intelli-\ngence (IAAI), pages 1814\u20131820, 2010.\n\n[15] Ekhlas Sonu, Yingke Chen, and Prashant Doshi.\n\nIndividual planning in agent populations:\nAnonymity and frame-action hypergraphs. In International Conference on Automated Plan-\nning and Scheduling (ICAPS), pages 202\u2013211, 2015.\n\n[16] C. F. Jeff Wu. On the convergence properties of the em algorithm. Annals of Statistics,\n\n11(1):95\u2013103, 1983.\n\n[17] Akshat Kumar, Shlomo Zilberstein, and Marc Toussaint. Scalable multiagent planning using\nprobabilistic inference. In International Joint Conference on Arti\ufb01cial Intelligence (IJCAI),\npages 2140\u20132146, 2011.\n\n[18] S. Arimoto. An algorithm for computing the capacity of arbitrary discrete memoryless chan-\n\nnels. IEEE Transactions on Information Theory, 18(1):14\u201320, 1972.\n\n[19] Jeffrey A. Fessler and Donghwan Kim. Axial block coordinate descent (ABCD) algorithm for\nX-ray CT image reconstruction. In International Meeting on Fully Three-dimensional Image\nReconstruction in Radiology and Nuclear Medicine, volume 11, pages 262\u2013265, 2011.\n\n[20] Olivier Cappe and Eric Moulines. Online expectation-maximization algorithm for latent\ndata models. Journal of the Royal Statistical Society: Series B (Statistical Methodology),\n71(3):593\u2013613, 2009.\n\n9\n\n\f", "award": [], "sourceid": 350, "authors": [{"given_name": "Xia", "family_name": "Qu", "institution": "Epic Systems"}, {"given_name": "Prashant", "family_name": "Doshi", "institution": "University of Georgia"}]}