{"title": "Periodic Finite State Controllers for Efficient POMDP and DEC-POMDP Planning", "book": "Advances in Neural Information Processing Systems", "page_first": 2636, "page_last": 2644, "abstract": "Applications such as robot control and wireless communication require planning under uncertainty. Partially observable Markov decision processes (POMDPs) plan policies for single agents under uncertainty and their decentralized versions (DEC-POMDPs) find a policy for multiple agents. The policy in infinite-horizon POMDP and DEC-POMDP problems has been represented as finite state controllers (FSCs). We introduce a novel class of periodic FSCs, composed of layers connected only to the previous and next layer. Our periodic FSC method finds a deterministic finite-horizon policy and converts it to an initial periodic infinite-horizon policy. This policy is optimized by a new infinite-horizon algorithm to yield deterministic periodic policies, and by a new expectation maximization algorithm to yield stochastic periodic policies. Our method yields better results than earlier planning methods and can compute larger solutions than with regular FSCs.", "full_text": "Periodic Finite State Controllers for Ef\ufb01cient\n\nPOMDP and DEC-POMDP Planning\n\nJoni Pajarinen\n\nJaakko Peltonen\n\nAalto University, Department of\n\nInformation and Computer Science,\n\nAalto University, Department of Information\nand Computer Science, Helsinki Institute for\n\nP.O. Box 15400, FI-00076 Aalto, Finland\n\nInformation Technology HIIT,\n\nJoni.Pajarinen@aalto.fi\n\nP.O. Box 15400, FI-00076 Aalto, Finland\n\nJaakko.Peltonen@aalto.fi\n\nAbstract\n\nApplications such as robot control and wireless communication require planning\nunder uncertainty. Partially observable Markov decision processes (POMDPs)\nplan policies for single agents under uncertainty and their decentralized versions\n(DEC-POMDPs) \ufb01nd a policy for multiple agents. The policy in in\ufb01nite-horizon\nPOMDP and DEC-POMDP problems has been represented as \ufb01nite state con-\ntrollers (FSCs). We introduce a novel class of periodic FSCs, composed of layers\nconnected only to the previous and next layer. Our periodic FSC method \ufb01nds\na deterministic \ufb01nite-horizon policy and converts it to an initial periodic in\ufb01nite-\nhorizon policy. This policy is optimized by a new in\ufb01nite-horizon algorithm to\nyield deterministic periodic policies, and by a new expectation maximization al-\ngorithm to yield stochastic periodic policies. Our method yields better results than\nearlier planning methods and can compute larger solutions than with regular FSCs.\n\n1 Introduction\n\nMany machine learning applications involve planning under uncertainty. Such planning is neces-\nsary in medical diagnosis, control of robots and other agents, and in dynamic spectrum access for\nwireless communication systems. The planning task can often be represented as a reinforcement\nlearning problem, where an action policy controls the behavior of an agent, and the quality of the\npolicy is optimized to maximize a reward function. Single agent policies can be optimized with\npartially observable Markov decision processes (POMDPs) [1], when the world state is uncertain.\nDecentralized POMDPs (DEC-POMDPs) [2] optimize policies for multiple agents that act without\ndirect communication, with separate observations and beliefs of the world state, to maximize a joint\nreward function. POMDP and DEC-POMDP methods use various representations for the policies,\nsuch as value functions [3], graphs [4, 5], or \ufb01nite state controllers (FSCs) [6, 7, 8, 9, 10].\n\nWe present a novel ef\ufb01cient method for POMDP and DEC-POMDP planning. We focus on in\ufb01nite-\nhorizon problems, where policies must operate forever. We introduce a new policy representation:\nperiodic \ufb01nite state controllers, which can be seen as an intelligent restriction which speeds up\noptimization and can yield better solutions. A periodic FSC is composed of several layers (subsets\nof states), and transitions are only allowed to states in the next layer, and from the \ufb01nal layer to the\n\ufb01rst. Policies proceed through layers in a periodic fashion, and policy optimization determines the\nprobabilities of state transitions and action choices to maximize reward. Our work has four main\ncontributions. Firstly, we introduce an improved optimization method for standard \ufb01nite-horizon\nproblems with FSC policies by compression. Secondly, we give a method to transform the \ufb01nite-\nhorizon FSC into an initial in\ufb01nite-horizon periodic FSC. Thirdly, we introduce compression to\nthe periodic FSC. Fourthly, we introduce an expectation-maximization (EM) training algorithm for\nplanning with periodic FSCs. We show that the resulting method performs better than earlier DEC-\n\n1\n\n\fPOMDP methods and POMDP methods with a restricted-size policy and that use of the periodic\nFSCs enables computing larger solutions than with regular FSCs. Online execution has complexity\nO(const) for deterministic FSCs and O(log(F SC layer width)) for stochastic FSCs.\n\nWe discuss existing POMDP and DEC-POMDP solution methods in Section 2 and formally de\ufb01ne\nthe in\ufb01nite-horizon (DEC-)POMDP. In Section 3 we introduce the novel concept of periodic FSCs.\nWe then describe the stages of our method: improving \ufb01nite-horizon solutions, transforming them\nto periodic in\ufb01nite-horizon solutions, and improving the periodic solutions by a novel EM algorithm\nfor (DEC-)POMDPs (Section 3.2). In Section 4 we show the improved performance of the new\nmethod on several planning problems, and we conclude in Section 5.\n\n2 Background\n\nPartially observable Markov decision processes (POMDPs) and decentralized POMDPs (DEC-\nPOMDPs) are model families for decision making under uncertainty. POMDPs optimize policies for\na single agent with uncertainty of the environment state while DEC-POMDPs optimize policies for\nseveral agents with uncertainty of the environment state and each other\u2019s states. Given the actions of\nthe agents the environment evolves according to a Markov model. The agents\u2019 policies are optimized\nto maximize the expected reward earned for actions into the future. In in\ufb01nite-horizon planning the\nexpected reward is typically discounted to emphasize current and near-future actions. Computation-\nally POMDPs and DEC-POMDPs are complex: even for \ufb01nite-horizon problems, \ufb01nding solutions\nis in the worst case PSPACE-complete for POMDPs and NEXP-complete for DEC-POMDPs [11].\n\nFor in\ufb01nite-horizon DEC-POMDP problems, state of the art methods [8, 12] store the policy as a\nstochastic \ufb01nite state controller (FSC) for each agent which keeps the policy size bounded. The\nFSC parameters can be optimized by expectation maximization (EM) [12]. An advantage of EM is\nthat it can be adapted to for example continuous probability distributions [7] or to take advantage of\nfactored problems [10]. Alternatives to EM include formulating FSC optimization as a non-linear\nconstraint satisfaction (NLP) problem solvable by an NLP solver [8], or iteratively improving each\nFSC by linear programming with other FSCs \ufb01xed [13]. Deterministic FSCs with a \ufb01xed size could\nalso be found by a best-\ufb01rst search [14]. If a DEC-POMDP problem has a speci\ufb01c goal state, then\na goal-directed [15] approach can achieve good results. The NLP and EM methods have yielded\nthe best results for the in\ufb01nite-horizon DEC-POMDP problems. In a recent variant called mealy\nNLP [16], the NLP based approach to DEC-POMDPs is adapted to FSC policies represented by\nMealy machines instead of traditional Moore machine representations. In POMDPs, Mealy machine\nbased controllers can achieve equal or better solutions than Moore controllers of the same size.\n\nThis paper recognizes the need to improve general POMDP and DEC-POMDP solutions. We intro-\nduce an approach where FSCs have a periodic layer structure, which turns out to yield good results.\n\n2.1 In\ufb01nite-horizon DEC-POMDP: de\ufb01nition\n\nThe tuple h{\u03b1i}, S, {Ai}, P, {\u2126i}, O, R, b0, \u03b3i de\ufb01nes an in\ufb01nite-horizon DEC-POMDP problem\nfor N agents \u03b1i, where S is the set of environment states and Ai and \u2126i are the sets of possible\nactions and observations for agent \u03b1i. A POMDP is the special case when there is only one agent.\nP (s\u2032|s, ~a) is the probability to move from state s to s\u2032, given the actions of all agents (jointly denoted\n~a = ha1, . . . , aN i). The observation function O(~o|s\u2032, ~a) is the probability that the agents observe\n~o = ho1, . . . , oN i, where oi is the observation of agent i, when actions ~a were taken and the environ-\nment transitioned to state s\u2032. The initial state distribution is b0(s). R(s, ~a) is the real-valued reward\nfor executing actions ~a in state s. For brevity, we denote transition probabilities given the actions\nby Ps\u2032s~a, observation probabilities by P~os\u2032~a, reward functions by Rs~a, and the set of all agents other\nthan i by \u00afi. At each time step, agents perform actions, the environment state changes, and agents\nreceive observations. The goal is to \ufb01nd a joint policy \u03c0 for the agents that maximizes expected\n\ndiscounted in\ufb01nite-horizon reward E(cid:2)P\u221e\n\nt=0 \u03b3tRs(t)~a(t)|\u03c0(cid:3), where \u03b3 is the discount factor, and s(t)\n\nand ~a(t) are the state and action at time t, and E[\u00b7|\u03c0] denotes expected value under policy \u03c0. Here,\nthe policy is stored as a set of stochastic \ufb01nite state controllers (FSCs), one for each agent. The\nFSC of agent i is de\ufb01ned by the tuple hQi, \u03bdqi , \u03c0aiqi, \u03bbq\u2032\niqioii, where Qi is the set of FSC nodes qi,\n\u03bdqi is the initial distribution P (qi) over nodes, \u03c0aiqi is the probability P (ai|qi) to perform action\nai in node qi, and \u03bbq\u2032\ni when\n\ni|qi, oi) to transition from node qi to node q\u2032\n\niqioi is the probability P (q\u2032\n\n2\n\n\ftime t\n\ntime t +1\n\nr\n\ns\n\no\n\na\n\nq\n\nr\n\ns\n\no\n\na\n\nq\n\nt=0,3,6,...\n\nt=1,4,7,...\n\nt=2,5,8,...\n\nFigure 1: Left: in\ufb02uence diagram for a DEC-POMDP with \ufb01nite state controllers ~q, states s, joint\nobservations ~o, joint actions ~a and reward r (given by a reward function R(s, ~a)). A dotted line sep-\narates two time steps. Right: an example of the new periodic \ufb01nite state controller, with three layers\nand three nodes in each layer, and possible transitions shown as arrows. The controller controls one\nof the agents. Which layer is active depends only on the current time; which node is active, and\nwhich action is chosen, depend on transition probabilities and action probabilities of the controller.\n\nobserving oi. The current FSC nodes of all agents are denoted ~q = hq1, . . . , qN i. The policies are\noptimized by optimizing the parameters \u03bdqi , \u03c0aiqi , and \u03bbq\u2032\ni qioi . Figure 1 (left) illustrates the setup.\n\n3 Periodic \ufb01nite state controllers\n\nState-of-the-art algorithms [6, 13, 8, 12, 16] for optimizing POMDP/DEC-POMDP policies with\nrestricted-size FSCs \ufb01nd a local optimum. A well-chosen FSC initialization could yield better solu-\ntions, but initializing (compact) FSCs is not straightforward: one reason is that dynamic program-\nming is dif\ufb01cult to apply on generic FSCs.\nIn [17] FSCs for POMDPs are built using dynamic\nprogramming to add new nodes, but this yields large FSCs and cannot be applied on DEC-POMDPs\nas it needs a piecewise linear convex value function. Also, general FSCs are irreducible, so a prob-\nability distribution over FSC nodes is not sparse over time even if a FSC starts from a single node.\nThis makes computations with large FSCs dif\ufb01cult and FSC based methods are limited by FSC size.\nWe introduce periodic FSCs, which allow the use of much larger controllers with a small complexity\nincrease, ef\ufb01cient FSC initialization, and new dynamic programming algorithms for FSCs.\n\nA periodic FSC is composed of M layers of controller nodes. Nodes in each layer are connected\nonly to nodes in the next layer: the \ufb01rst layer is connected to the second, the second layer to the third\nand so on, and the last layer is connected to the \ufb01rst. The width of a periodic FSC is the number of\ncontroller nodes in a layer. Without loss of generality we assume all layers have the same number of\nnodes. A single-layer periodic FSC equals an ordinary FSC. A periodic FSC has different action and\ntransition probabilities for each layer. \u03c0(m)\naiqi is the layer m probability to take action ai when in node\nqi, and \u03bb(m)\ni when observing oi. Each layer\nconnects only to the next one, so the policy cycles periodically through each layer: for t \u2265 M we\nand \u03bb(t)\nhave \u03c0(t)\nwhere \u2018mod\u2019 denotes remainder. Figure 1 (right)\nq\u2032\niqioi\nshows an example periodic FSC.\n\nis the layer m probability to move from node qi to q\u2032\n\n= \u03bb(t mod M)\n\nq\u2032\niqioi\n\nq\u2032\niqioi\n\naiqi = \u03c0(t mod M)\n\naiqi\n\nWe now introduce our method for solving (DEC-)POMDPs with periodic FSC policies. We show\nthat the periodic FSC structure allows ef\ufb01cient computation of deterministic controllers, show how\nto optimize periodic stochastic FSCs, and show how a periodic deterministic controller can be used\nas initialization to a stochastic controller. The algorithms are discussed in the context of DEC-\nPOMDPs, but can be directly applied to POMDPs.\n\n3.1 Deterministic periodic \ufb01nite state controllers\n\nIn a deterministic FSC, actions and node transitions are deterministic functions of the current node\nand observation. To optimize deterministic periodic FSCs we \ufb01rst compute a non-periodic \ufb01nite-\nhorizon policy. The \ufb01nite-horizon policy is transformed into a periodic in\ufb01nite-horizon policy by\nconnecting the last layer to the \ufb01rst layer and the resulting deterministic policy can then be im-\n\n3\n\n\fproved with a new algorithm (see Section 3.1.2). A periodic deterministic policy can also be used as\ninitialization for a stochastic FSC optimizer based on expectation maximization (see Section 3.2).\n\n3.1.1 Deterministic \ufb01nite-horizon controllers\n\nWe brie\ufb02y discuss existing methods for deterministic \ufb01nite-horizon controllers and introduce an\nimproved \ufb01nite-horizon method, which we use as the initial solution for in\ufb01nite-horizon controllers.\n\nState-of-the-art point based \ufb01nite-horizon DEC-POMDP methods [4, 5] optimize a policy graph,\nwith restricted width, for each agent. They compute a policy for a single belief, instead of all possible\nbeliefs. Beliefs over world states are sampled centrally using various action heuristics. Policy graphs\nare built by dynamic programming from horizon T to the \ufb01rst time step. At each time step a policy\nis computed for each policy graph node, by assuming that the nodes all agents are in are associated\nwith the same belief. In a POMDP, computing the deterministic policy for a policy graph node means\n\ufb01nding the best action, and the best connection (best next node) for each observation; this can be\ndone with a direct search. In a DEC-POMDP this approach would go through all combinations of\nactions, observations and next nodes of all agents: the number of combinations grows exponentially\nwith the number of agents, so direct search works only for simple problems. A more ef\ufb01cient way\nis to go through all action combinations, for each action combination sample random policies for all\nagents, and then improve the policy of each agent in turn while holding the other agents\u2019 policies\n\ufb01xed. This is not guaranteed to \ufb01nd the best policy for a belief, but has yielded good results in the\nPoint-Based Policy Generation (PBPG) algorithm [5].\n\nWe introduce a new algorithm which improves on [5]. PBPG used linear programming to \ufb01nd\npolicies for each agent and action-combination, but with a \ufb01xed joint action and \ufb01xed policies of\nother agents we can use fast and simple direct search as follows. Initialize the value function V (s, ~q)\nto zero. Construct an initial policy graph for each agent, starting from horizon t = T : (1) Project\nthe initial belief along a random trajectory to horizon t to yield a sampled belief b(s) over world\nstates. (2) Add, to the graph of each agent, a node to layer t. Find the best connections to the next\nlayer as follows. Sample random connections for each agent, then for each agent in turn optimize its\nconnection with connections of other agents \ufb01xed: for each action-combination ~a and observation\nconnect to the next-layer node that maximizes value, computed using b(s) and the next layer value\nfunction; repeat this until convergence, using random restarts to escape local minima. The best\nconnections and action combination ~a become the policy for the current policy graph node. (3) Run\n(1)-(2) until the graph layer has enough nodes. (4) Decrease t and run (1)-(3), until t = 0.\n\nWe use the above-described algorithm for initialization, and then use a new policy improvement\napproach shown in Algorithm 1 that improves the policy value monotonically: (1) Here we do not\nuse a random trajectory for belief projection, instead we project the belief bt(s, ~q) over world states\ns and controller nodes ~q (agents are initially assumed to start from the \ufb01rst controller node) from\ntime step t = 0 to horizon T , through the current policy graph; this yields distributions for the FSC\nnodes that match the current policy. (2) We start from the last layer and proceed towards the \ufb01rst.\nAt each layer, we optimize each agent separately: for each graph node qi of agent i, for each action\nai of the agent, and for each observation oi we optimize the (deterministic) connection to the next\nlayer. (3) If the optimized policy at the node (action and connections) is identical to policy \u03c0 of\nanother node in the layer, we sample a new belief over world states, and re-optimize the node for\nthe new belief; if no new policy is found even after trying several sampled beliefs, we try several\nuniformly random beliefs for \ufb01nding policies. We also redirect any connections from the previous\npolicy graph layer to the current node to go instead to the node having policy \u03c0; this \u201ccompresses\u201d\nthe policy graph without changing its value (in POMDPs the redirection step is not necessary, it\nwill happen naturally when the previous layer is reoptimized). The computational complexity of\nAlgorithm 1 is O(2M |Q|2N |A|N |O|N |S|2 + M N |Q|N |O|N |A|N |S|2 + CN |Q|2|A|N |O||S|).\n\nOur \ufb01nite-horizon method gets rid of the simplifying assumption that all FSCs are in the same node,\nfor a certain belief, made in [4, 5]. We only assume that for initialization steps, but not in actual\noptimization. Our optimization monotonically improves the value of a \ufb01xed size policy graph and\nconverges to a local optimum. Here we applied the procedure to \ufb01nite-horizon DEC-POMDPs; it\nis adapted for improving deterministic in\ufb01nite-horizon FSCs in Section 3.1.2. We also have two\nsimple improvements: (1) a speedup: [5] used linear programming to \ufb01nd policies for each agent\nand action-combination in turn, but simple direct search is faster, and we use that; (2) improved\nduplicate handling: [5] tried sampled beliefs to avoid duplicate nodes, we also try uniformly random\n\n4\n\n\f1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9\n\n10\n\n11\n\n12\n\n13\n\n14\n\nInitialize VT +1(s, ~q) = 0\nUsing current policy project bt(s, ~q) for 1 \u2264 t \u2264 T\nfor Time step t = T to 0 do\n\nforeach Agent i do\n\nforeach Node q of agent i do\n\nforeach ai do\n\nt (q\u2032\n\nhai\n\u2200oi P ai\n\ni|qi = q, oi) = argmaxP\n\n~o,~q\u2032 =Ps,s\u2032,~q,~a P (~o, s\u2032|s, ~a)bt(s, ~q)Qj6=i Pt(aj|qj)Pt(q\u2032\ni|qi=q,oi)P~q\u2032,{oj}j6=i\ni|qi, oi) = P a\u2217\n\n[bt(s, ~q)R(s, ~a)Yj6=i\n\nai Xs,s\u2032,~q,~a,~o,~q\u2032\n\nPt(aj |qj) + \u03b3P ai\n\ni |qi) = 1, Pt(q\u2032\nPt(ai 6= a\u2217\nif Any node p already has same policy as q then\n\ni |qi) = 0, Pt(ai = a\u2217\n\na\u2217\ni = argmax\n\nt (q\u2032\n\nai\nt (q\u2032\n\ni\n\ni|qi, oi)\n\nj|qj, oj)Vt+1(s\u2032, ~q\u2032)\n\nP ai\n\nt (q\u2032\n\ni|qi = q, oi)hai\n~o,~q\u2032\n\nt (q\u2032\n\ni|qi = q, oi)hai\n\n~o,~q\u2032 ]\n\nFor each qi for which Pt\u22121(q\u2032\nSample belief b(s, qj = q\u2200j) and use it to compute new policy by steps 7-13\n\ni = q|qi, oj) = 1 redirect link to q\u2032\n\ni = p\n\nVt(s, ~q) = R(s, ~a)Qi Pt(ai|qi) + \u03b3Qi Pt(q\u2032\n\ni|qi, oi)P (s\u2032, ~o|s, ~a)Vt+1(s\u2032, ~q\u2032)\nAlgorithm 1: Monotonic policy graph improvement algorithm\n\nbeliefs, and for DEC-POMDPs we redirect previous-layer connections to duplicate nodes. Unlike\nthe recursion idea in [4] our projection approach is guaranteed to improve value at each graph node\nand \ufb01nd a local optimum.\n\n3.1.2 Deterministic in\ufb01nite-horizon controllers\n\nTo initialize an in\ufb01nite-horizon problem, we transform a deterministic \ufb01nite-horizon policy graph\n(computed as in Section 3.1.1) into an in\ufb01nite-horizon periodic controller by connecting the last\nlayer to the \ufb01rst. Assuming controllers start from policy graph node 1, we compute policies for the\nother nodes in the \ufb01rst layer with beliefs sampled for time step M + 1, where M is the length of the\ncontroller period. It remains to compute the (deterministic) connections from the last layer to the\n\ufb01rst: approximately optimal connections are found using the beliefs at the last layer and the value\nfunction projected from the last layer through the graph to the \ufb01rst layer. This approach can yield\nef\ufb01cient controllers on its own, but may not be suitable for problems with a long effective horizon.\n\nTo optimize controllers further, we give two changes to Algorithm 1 that enable optimization of\nin\ufb01nite-horizon policies: (1) To compute beliefs \u02c6bu(s, ~q) over time steps u by projecting the initial\nbelief, \ufb01rst determine an effective projection horizon Tproj. Compute a QM DP policy [18] (an upper\nbound to the optimal DEC-POMDP policy) by dynamic programming. As the projection horizon,\nuse the number of dynamic programming steps needed to gather enough value in the corresponding\nMDP. Compute the belief bt(s, ~q) for each FSC layer t (needed on line 2 of Algorithm 1) as a\ndiscounted sum of projected beliefs: bt(s, ~q) = 1\n(2)\nCompute value function Vt(s, ~q) for a policy graph layer by backing up (using line 14 of Algorithm\n1) M \u2212 1 steps from the previous periodic FSC layer to current FSC layer, one layer at a time.\nThe complexity of one iteration of the in\ufb01nite-horizon approach is O(2M |Q|2N |A|N |O|N |S|2 +\nM (M \u2212 1)N |Q|N |O|N |A|N |S|2 + M CN |Q|2|A|N |O||S|). There is no convergence guarantee\ndue to the approximations, but approximation error decreases exponentially with the period M .\n\nC Pu\u2208{t,t+M,t+2M,...;u\u2264Tproj } \u03b3u\u02c6bu(s, ~q).\n\n3.2 Expectation maximization for stochastic in\ufb01nite-horizon controllers\n\nA stochastic FSC provides a solution of equal or larger value [6] compared to a deterministic FSC\nwith the same number of controller nodes. Many algorithms that optimize stochastic FSCs could be\nadapted to use periodic FSCs; in this paper we adapt the expectation-maximization (EM) approach\n[7, 12] to periodic FSCs. The adapted version retains the theoretical properties of regular EM, such\nas monotonic convergence to a local optimum.\n\nIn the EM approach [7, 12] the optimization of policies is written as an inference problem: rewards\nare scaled into probabilities and the policy, represented as a stochastic FSC, is optimized by EM\n\n5\n\n\fiteration to maximize the probability of getting rewards. We now introduce an EM algorithm for\n(DEC-)POMDPs with periodic stochastic FSCs. We build on the EM method for DEC-POMDPs\nwith standard (non-periodic) FSCs by Kumar and Zilberstein [12]; see [7, 12] for more details of\nnon-periodic EM. First, the reward function is scaled into a probability \u02c6R(r = 1|s, ~a) = (R(s, ~a) \u2212\nRmin)/(Rmax \u2212 Rmin), where Rmin and Rmax are the minimum and maximum rewards possible\nand \u02c6R(r = 1|s, ~a) is the conditional probability for the binary reward r to be 1. The FSC parameters\nT =0 P (T )P (r = 1|T, \u03b8) with respect to \u03b8,\nwhere the horizon is in\ufb01nite and P (T ) = (1 \u2212 \u03b3)\u03b3T . This is equivalent to maximizing expected\ndiscounted reward in the DEC-POMDP. The EM approach improves the policy, i.e. the stochastic\nperiodic \ufb01nite state controllers, in each iteration. We next describe the E-step and M-step formulas.\n\n\u03b8 are optimized by maximizing the reward likelihoodP\u221e\n\nIn the E-step, alpha messages \u02c6\u03b1(m)(~q, s) and beta messages \u02c6\u03b2(m)(~q, s) are computed for each layer\nof the periodic FSC. Intuitively, \u02c6\u03b1(~q, s) corresponds to the discount weighted average probability\nthat the world is in state s and FSCs are in nodes ~q, when following the policy de\ufb01ned by the\ncurrent FSCs, and \u02c6\u03b2(~q, s) is intuitively the expected discounted total scaled reward, when starting\nfrom state s and FSC nodes ~q. The alpha messages are computed by projecting an initial nodes-\nand-state distribution forward, while beta messages are computed by projecting reward probabilities\nbackward. We compute separate \u02c6\u03b1(m)(~q, s) and \u02c6\u03b2(m)(~q, s) for each layer m. We use a projection\nhorizon T = M TM \u2212 1, where M TM is divisible by the number of layers M . This means that when\nwe have accumulated enough probability mass in the E-step we still project a few steps in order to\nreach a valid T . For a periodic FSC the forward projection of the joint distribution over world and\n].\nEach \u02c6\u03b1(m)(~q, s) can be computed by projecting a single trajectory forward starting from the ini-\ntial belief and then adding only messages belonging to layer m to each \u02c6\u03b1(m)(~q, s). In contrast,\neach \u02c6\u03b2(m)(~q, s) has to be projected separately backward, because we don\u2019t have a \u201cstarting point\u201d\nsimilar to the alpha messages. Denoting such projections by \u03b2(m)\naiqi and\n\u03b2(m)\n\nFSC states from time step t to time step t + 1 is Pt(~q\u2032, s\u2032|~q, s) =P~o,~a Ps\u2032s~aP~os\u2032~aQi[\u03c0(t)\n\nt\u22121 (~q\u2032, s\u2032)Pt(~q\u2032, s\u2032|~q, s) the equations for the messages become\n\naiqi \u03bb(t)\nq\u2032\n\u02dci\n\n\u02c6Rs~aQi \u03c0(m)\n\n(~q, s) = P~a\n\nq\u02dcio\u02dci\n\n0\n\nt\n\n(~q, s) =Ps\u2032,~q\u2032 \u03b2(m)\n\nTM \u22121\n\nXtm=0\n\nT\n\nXt=0\n\n\u02c6\u03b1(m)(~q, s) =\n\n\u03b3(m+tmM)(1\u2212\u03b3)\u03b1(m+tmM)(~q, s) and \u02c6\u03b2(m)(~q, s) =\n\n\u03b3t(1\u2212\u03b3)\u03b2(m)\n\nt\n\n(~q, s) .\n\n(1)\nThis means that the complexity of the E-step for periodic FSCs is M times the complexity of the\nE-step for usual FSCs with a total number of nodes equal to the width of the periodic FSC. The\ncomplexity increases linearly with the number of layers.\n\nIn the M-step we can update the parameters of each layer separately using the alpha and beta mes-\nsages for that layer, as follows. EM maximizes the expected complete log-likelihood Q(\u03b8, \u03b8\u2217) =\n\nPT PL P (r = 1, L, T |\u03b8) log P (r = 1, L, T |\u03b8\u2217) , where L denotes all latent variables: actions,\n\nobservations, world states, and FSC states, \u03b8 denotes previous parameters, and \u03b8\u2217 denotes new pa-\nrameters. For periodic FSCs P (r = 1, L, T |\u03b8) is\n\nP (r = 1, L, T |\u03b8) = P (T )[ \u02c6Rs~a]t=T \" T\nYt=1\n\nwhere we denoted \u03c4 (t)\n\naiqi for t = 1, . . . , T , \u03c4 (0)\n\n~a~q =Qi \u03c0(t)\n\nq\u2032\niqioi\nThe log in the expected complete log-likelihood Q(\u03b8, \u03b8\u2217) transforms the product of probabilities into\na sum; we can divide the sums into smaller sums, where each sum contains only parameters from\nthe same periodic FSC layer. Denoting fs\u2032s ~q\u2032~o~am = Ps\u2032s~aP~os\u2032~a \u02c6\u03b2(m+1)(~q\u2032, s\u2032), the M-step periodic\nFSC parameter update rules can then be written as:\n\naiqi \u03bdqi , and \u039b~q\u2032~q~ot =Qi \u03bb(t\u22121)\n\n.\n\n\u03c4 (t)\n\n~a~q Ps\u2032s~aP~os\u2032~a\u039b~q\u2032~q~ot#h\u03c4 (0)\n~a~q =Qi \u03c0(0)\n\n~a~q b0(s)it=0\n\n(2)\n\n\u03bd\u2217\nqi =\n\n\u03bdqi\n\nCi Xs,q\u02dci\n\n\u02c6\u03b2(0)(~q, s)\u03bdq\u02dci b0(s)\n\n\u03c0\u2217(m)\n\naiqi =\n\n\u03c0(m)\naiqi\n\nCqi Xs,s\u2032,q\u02dci, ~q\u2032,~o,a\u02dci\n\n{ \u02c6\u03b1(m)(~q, s)\u03c0(m)\na\u02dciq\u02dci\n\n\u00b7(cid:2) \u02c6Rs~a +\n\n\u03b3\n\n1 \u2212 \u03b3\n\n\u03bb(m)\nq\u2032\n\u02dci\n\nq\u02dcio\u02dci\n\n\u03bb(m)\nqi\n\n\u2032qioi\n\nfs\u2032s ~q\u2032~o~am(cid:3)}\n\n6\n\n(3)\n\n(4)\n\n\f\u03bb\u2217(m)\n\nq\u2032\niqioi\n\n=\n\n\u03bb(m)\nq\u2032\niqioi\n\nCqioi Xs,s\u2032,q\u02dci,q\u2032\n\n\u02dci\n\n\u02c6\u03b1(m)(~q, s)\u03c0(m)\na\u02dciq\u02dci\n\naiqi\u03bb(m)\n\u03c0(m)\nq\u2032\nq\u02dcio\u02dci\n\u02dci\n\nfs\u2032s ~q\u2032~o~am .\n\n(5)\n\n,o\u02dci,~a\n\nNote about initialization. Our initialization procedure (Sections 3.1.1 and 3.1.2) yields determin-\nistic periodic controllers as initializations; a deterministic \ufb01nite state controller is a stable point of\nthe EM algorithm, since for such a controller the M-step of the EM approach does not change the\nprobabilities. To allow EM to escape the stable point and \ufb01nd even better optima, we add noise to\nthe controllers in order to produce stochastic controllers that can be improved by EM.\n\n4 Experiments\n\nExperiments were run for standard POMDP and DEC-POMDP benchmark problems [8, 15, 16, 10]\nwith a time limit of two hours. For both types of benchmarks we ran the proposed in\ufb01nite-horizon\nmethod for deterministic controllers (denoted \u201cPeri\u201d) with nine improvement rounds as described\nin Section 3.1.2. For DEC-POMDP benchmarks we also ran the proposed periodic expectation\nmaximization approach in Section 3.2 (denoted \u201cPeriEM\u201d), initialized by the \ufb01nite-horizon approach\nin Section 3.1.1 with nine improvement rounds and the in\ufb01nite-horizon transformation in Section\n3.1.2, paragraph 1. For \u201cPeriEM\u201d a period of 10 was used. For \u201cPeri\u201d a period of 30 was used for\nproblems with discount factor 0.9, 60 for discount factor 0.95, and 100 for larger discount factors.\nThe main comparison methods EM [12] and Mealy NLP [16] (with removal of dominated actions\nand unreachable state-observation pairs) were implemented using Matlab and the NEOS server was\nutilized for solving the Mealy NLP non-linear programs. We used the best of parallel experiment\nruns to choose the number of FSC nodes. EM was run for all problems and Mealy NLP for the\nHallway2, decentralized tiger, recycling robots, and wireless network problems. SARSOP [3] was\nrun for all POMDP problems and we also report results from literature [8, 15, 16].\n\nTable 1 shows DEC-POMDP results for the decentralized tiger, recycling robots, meeting in a grid,\nwireless network [10], co-operative box pushing, and stochastic mars rover problems. A discount\nfactor of 0.99 was used in the wireless network problem and 0.9 in the other DEC-POMDP bench-\nmarks. Table 2 shows POMDP results for the benchmark problems Hallway2, Tag-avoid, Tag-avoid\nrepeat, and Aloha. A discount factor of 0.999 was used in the Aloha problem and 0.95 in the other\nPOMDP benchmarks. Methods whose 95% con\ufb01dence intervals overlap with that of the best method\nare shown in bold. The proposed method \u201cPeri\u201d performed best in the DEC-POMDP problems and\nbetter than other restricted policy size methods in the POMDP problems. \u201cPeriEM\u201d also performed\nwell, outperforming EM.\n\n5 Conclusions and discussion\n\nWe introduced a new class of \ufb01nite state controllers, periodic \ufb01nite state controllers (periodic FSCs),\nand presented methods for initialization and policy improvement. In comparisons the resulting meth-\nods outperformed state-of-the-art DEC-POMDP and state-of-the-art restricted size POMDP meth-\nods and worked very well on POMDPs in general.\n\nIn our method the period length was based simply on the discount factor, which already performed\nvery well; even better results could be achieved, for example, by running solutions of different\nperiods in parallel. In addition to the expectation-maximization presented here, other optimization\nalgorithms for in\ufb01nite-horizon problems could also be adapted to periodic FSCs: for example, the\nnon-linear programming approach [8] could be adapted to periodic FSCs. In brief, a separate value\nfunction and separate FSC parameters would be used for each time slice in the periodic FSCs, and\nthe number of constraints would grow linearly with the number of time slices.\n\nAcknowledgments\n\nWe thank Ari Hottinen for discussions on decision making in wireless networks. The authors belong\nto the Adaptive Informatics Research Centre (CoE of the Academy of Finland). The work was\nsupported by Nokia, TEKES, Academy of Finland decision number 252845, and in part by the\nPASCAL2 EU NoE, ICT 216886. This publication re\ufb02ects the authors\u2019 views only.\n\n7\n\n\fTable 1: DEC-POMDP benchmarks. Most\ncomparison results are from [8, 15, 16]; we ran\nEM and Mealy NLP on many of the tests (see\nSection 4). Note that \u201cGoal-directed\u201d is a spe-\ncial method that can only be applied to prob-\nlems with goals.\n\nTable 2: POMDP benchmarks. Most compari-\nson method results are from [16]; we ran EM,\nSARSOP, and Mealy NLP on one test (see Sec-\ntion 4).\n\nAlgorithm (Size,Time): Value\n\nAlgorithm (Size, Time): Value\n\nDecTiger\n\n(|S| = 2, |Ai| = 3, |Oi| = 2)\n\nPeri (10 \u00d7 30, 202s):\nPeriEM (7 \u00d7 10, 6540s):\nGoal-directed (11, 75s):\nNLP (19, 6173s):\nMealy NLP (4, 29s):\nEM (6, 142s):\n\n13.45\n9.42\n5.041\n\u22121.088\n\u22121.49\n\u221216.30\n\nRecycling robots\n\n(|S| = 4, |Ai| = 3, |Oi| = 2)\n\nMealy NLP (1, 0s):\nPeri (6 \u00d7 30, 77s):\nPeriEM (6 \u00d7 10, 272s):\nEM (2, 13s):\n\n31.93\n31.84\n31.80\n31.50\n\nMeeting in a 2x2 grid\n\n(|S| = 16, |Ai| = 5, |Oi| = 2)\n\nPeri (5 \u00d7 30, 58s):\nPeriEM (5 \u00d7 10, 6019s):\nEM (8, 5086s):\nMealy NLP (5, 116s):\nHPI+NLP (7, 16763s):\nNLP (5, 117s):\nGoal-directed (4, 4s):\n\nWireless network\n\n6.89\n6.82\n6.80\n6.13\n6.04\n5.66\n5.64\n\n(|S| = 64, |Ai| = 2, |Oi| = 6)\n\n\u2212175.40\nEM (3, 6886s):\nPeri (15 \u00d7 100, 6492s):\u2212181.24\nPeriEM (2 \u00d7 10, 3557s):\u2212218.90\n\u2212296.50\nMealy NLP (1, 9s):\n\nBox pushing\n\n(|S| = 100, |Ai| = 4, |Oi| = 5)\nGoal-directed (5, 199s): 149.85\n148.65\nPeri (15 \u00d7 30, 5675s):\n143.14\nMealy NLP (4, 774s):\nPeriEM (4 \u00d7 10, 7164s): 106.68\n95.63\nHPI+NLP (10, 6545s):\n43.33\nEM (6, 7201s):\n\nMars rovers\n\n(|S| = 256, |Ai| = 6, |Oi| = 8)\nPeri (10 \u00d7 30, 6088s):\n24.13\n21.48\nGoal-directed (6, 956s):\n19.67\nMealy NLP (3, 396s):\n18.13\nPeriEM (3 \u00d7 10, 7132s):\n17.75\nEM (3, 5096s):\n9.29\nHPI+NLP (4, 111s):\n\nHallway2\n\n(|S| = 93, |A| = 5, |O| = 17)\n\nPerseus (56, 10s):\nHSVI2 (114, 1.5s):\nPBPI (320, 3.1s):\nSARSOP (776, 7211s):\nHSVI (1571, 10010s):\nPBVI (95, 360s):\nPeri (160 \u00d7 60, 5252s):\nbiased BPI (60, 790s):\nNLP \ufb01xed (18, 240s):\nNLP (13, 420s):\nEM (30, 7129s):\nMealy NLP (1, 2s):\n\n0.35\n0.35\n0.35\n0.35\n0.35\n0.34\n0.34\n0.32\n0.29\n0.28\n0.28\n0.028\n\nTag-avoid\n\n(|S| = 870, |A| = 5, |O| = 30)\n\n\u22125.87\nPBPI (818, 1133s):\nSARSOP (13588, 7394s): \u22126.04\nPeri (160 \u00d7 60, 6394s):\n\u22126.15\nRTDP-BEL (2.5m, 493s): \u22126.16\n\u22126.17\nPerseus (280, 1670s):\n\u22126.36\nHSVI2 (415, 24s):\n\u22126.65\nMealy NLP (2, 323s):\n\u22126.65\nbiased BPI (17, 250s):\n\u22129.18\nBPI (940, 59772s):\n\u221213.94\nNLP (2, 5596s):\n\u221220.00\nEM (2, 30s):\n\nTag-avoid repeat\n\n(|S| = 870, |A| = 5, |O| = 30)\n\nSARSOP (15202, 7203s):\u221210.71\nPeri (160 \u00d7 60, 6316s): \u221211.02\n\u221211.44\nMealy NLP (2, 319s):\n\u221212.35\nPerseus (163, 5656s):\n\u221214.33\nHSVI2 (8433, 5413s):\n\u221220.00\nNLP (1, 37s):\n\u221220.00\nEM (2, 72s):\n\nAloha\n\n(|S| = 90, |A| = 29, |O| = 3)\n\nSARSOP (82, 7201s): 1237.01\nPeri (160 \u00d7 100, 6793s): 1236.70\n1221.72\nMealy NLP (7, 312s):\n1217.95\nHSVI2 (5434, 5430s):\n1211.67\nNLP (6, 1134s):\n1120.05\nEM (3, 7200s):\n853.42\nPerseus (68, 5401s):\n\n8\n\n\fReferences\n\n[1] R. D. Smallwood and E. J. Sondik. The optimal control of partially observable Markov pro-\n\ncesses over a \ufb01nite horizon. Operations Research, pages 1071\u20131088, 1973.\n\n[2] S. Seuken and S. Zilberstein. Formal models and algorithms for decentralized decision making\n\nunder uncertainty. Autonomous Agents and Multi-Agent Systems, 17(2):190\u2013250, 2008.\n\n[3] H. Kurniawati, D. Hsu, and W.S. Lee. Sarsop: Ef\ufb01cient point-based pomdp planning by ap-\nproximating optimally reachable belief spaces. In Proc. Robotics: Science and Systems, 2008.\n\n[4] S. Seuken and S. Zilberstein. Memory-bounded dynamic programming for DEC-POMDPs. In\n\nProc. of 20th IJCAI, pages 2009\u20132016. Morgan Kaufmann, 2007.\n\n[5] F. Wu, S. Zilberstein, and X. Chen. Point-based policy generation for decentralized POMDPs.\n\nIn Proc. of 9th AAMAS, pages 1307\u20131314. IFAAMAS, 2010.\n\n[6] P. Poupart and C. Boutilier. Bounded \ufb01nite state controllers. Advances in neural information\n\nprocessing systems, 16:823\u2013830, 2003.\n\n[7] M. Toussaint, S. Harmeling, and A. Storkey. Probabilistic inference for solving (PO)MDPs.\n\nTechnical report, University of Edinburgh, 2006.\n\n[8] C. Amato, D. Bernstein, and S. Zilberstein. Optimizing Memory-Bounded Controllers for\n\nDecentralized POMDPs. In Proc. of 23rd UAI, pages 1\u20138. AUAI Press, 2007.\n\n[9] A. Kumar and S. Zilberstein. Point-Based Backup for Decentralized POMDPs: Complexity\n\nand New Algorithms. In Proc. of 9th AAMAS, pages 1315\u20131322. IFAAMAS, 2010.\n\n[10] Joni Pajarinen and Jaakko Peltonen. Ef\ufb01cient Planning for Factored In\ufb01nite-Horizon DEC-\n\nPOMDPs. In Proc. of 22nd IJCAI, pages 325\u2013331. AAAI Press, July 2011.\n\n[11] D. S. Bernstein, R. Givan, N. Immerman, and S. Zilberstein. The Complexity of Decentralized\nControl of Markov Decision Processes. Mathematics of Operations Research, 27(4):819\u2013840,\n2002.\n\n[12] A. Kumar and S. Zilberstein. Anytime Planning for Decentralized POMDPs using Expectation\n\nMaximization. In Proc. of 26th UAI, 2010.\n\n[13] D.S. Bernstein, E.A. Hansen, and S. Zilberstein. Bounded policy iteration for decentralized\n\nPOMDPs. In Proc. of 19th IJCAI, pages 1287\u20131292. Morgan Kaufmann, 2005.\n\n[14] D. Szer and F. Charpillet. An optimal best-\ufb01rst search algorithm for solving in\ufb01nite horizon\n\nDEC-POMDPs. Proc. of 16th ECML, pages 389\u2013399, 2005.\n\n[15] C. Amato and S. Zilberstein. Achieving goals in decentralized POMDPs.\n\nIn Proc. of 8th\n\nAAMAS, volume 1, pages 593\u2013600. IFAAMAS, 2009.\n\n[16] C. Amato, B. Bonet, and S. Zilberstein. Finite-State Controllers Based on Mealy Machines for\n\nCentralized and Decentralized POMDPs. In Proc. of 24th AAAI, 2010.\n\n[17] S. Ji, R. Parr, H. Li, X. Liao, and L. Carin. Point-based policy iteration. In Proc. of 22nd AAAI,\n\nvolume 22, page 1243, 2007.\n\n[18] F.A. Oliehoek, M.T.J. Spaan, and N. Vlassis. Optimal and approximate q-value functions for\n\ndecentralized pomdps. Journal of Arti\ufb01cial Intelligence Research, 32(1):289\u2013353, 2008.\n\n9\n\n\f", "award": [], "sourceid": 1435, "authors": [{"given_name": "Joni", "family_name": "Pajarinen", "institution": null}, {"given_name": "Jaakko", "family_name": "Peltonen", "institution": null}]}