{"title": "Variational Planning for Graph-based MDPs", "book": "Advances in Neural Information Processing Systems", "page_first": 2976, "page_last": 2984, "abstract": "Markov Decision Processes (MDPs) are extremely useful for modeling and solving sequential decision making problems. Graph-based MDPs provide a compact representation for MDPs with large numbers of random variables. However, the complexity of exactly solving a graph-based MDP usually grows exponentially in the number of variables, which limits their application. We present a new variational framework to describe and solve the planning problem of MDPs, and derive both exact and approximate planning algorithms. In particular, by exploiting the graph structure of graph-based MDPs, we propose a factored variational value iteration algorithm in which the value function is first approximated by the multiplication of local-scope value functions, then solved by minimizing a Kullback-Leibler (KL) divergence. The KL divergence is optimized using the belief propagation algorithm, with complexity exponential in only the cluster size of the graph. Experimental comparison on different models shows that our algorithm outperforms existing approximation algorithms at finding good policies.", "full_text": "Variational Planning for Graph-based MDPs\n\nQiang Cheng\u2020\n\nQiang Liu\u2021\n\nFeng Chen\u2020\n\nAlexander Ihler\u2021\n\n\u2020Department of Automation, Tsinghua University\n\n\u2021Department of Computer Science, University of California, Irvine\n\n\u2020{cheng-q09@mails., chenfeng@mail.}tsinghua.edu.cn\n\n\u2021{qliu1@,ihler@ics.}uci.edu\n\nAbstract\n\nMarkov Decision Processes (MDPs) are extremely useful for modeling and solv-\ning sequential decision making problems. Graph-based MDPs provide a compact\nrepresentation for MDPs with large numbers of random variables. However, the\ncomplexity of exactly solving a graph-based MDP usually grows exponentially in\nthe number of variables, which limits their application. We present a new varia-\ntional framework to describe and solve the planning problem of MDPs, and derive\nboth exact and approximate planning algorithms. In particular, by exploiting the\ngraph structure of graph-based MDPs, we propose a factored variational value iter-\nation algorithm in which the value function is \ufb01rst approximated by the multiplica-\ntion of local-scope value functions, then solved by minimizing a Kullback-Leibler\n(KL) divergence. The KL divergence is optimized using the belief propagation\nalgorithm, with complexity exponential in only the cluster size of the graph. Ex-\nperimental comparison on different models shows that our algorithm outperforms\nexisting approximation algorithms at \ufb01nding good policies.\n\nIntroduction\n\n1\nMarkov Decision Processes (MDPs) have been widely used to model and solve sequential decision\nmaking problems under uncertainty, in \ufb01elds including arti\ufb01cial intelligence, control, \ufb01nance and\nmanagement (Puterman, 2009, Barber, 2011). However, standard MDPs are described by explicitly\nenumerating all possible states of variables, and are thus not well suited to solve large problems.\nGraph-based MDPs (Guestrin et al., 2003, Forsell and Sabbadin, 2006) provide a compact repre-\nsentation for large and structured MDPs, where the transition model is explicitly represented by a\ndynamic Bayesian network. In graph-based MDPs, the state is described by a collection of random\nvariables, and the transition and reward functions are represented by a set of smaller (local-scope)\nfunctions. This is particularly useful for spatial systems or networks with many \u201clocal\u201d decisions,\neach affecting small sub-systems that are coupled together and interdependent (Nath and Domingos,\n2010, Sabbadin et al., 2012).\nThe graph-based MDP representation gives a compact way to describe a structured MDP, but the\ncomplexity of exactly solving such MDPs typically still grows exponentially in the number of state\nvariables. Consequently, graph-based MDPs are often approximately solved by enforcing context-\nspeci\ufb01c independence or function-speci\ufb01c independence constraints (Sigaud et al., 2010). To take\nadvantage of context-speci\ufb01c independence, a graph-based MDP can be represented using decision\ntrees or algebraic decision diagrams (Bahar et al., 1993), and then solved by applying structured\nvalue iteration (Hoey et al., 1999) or structured policy iteration (Boutilier et al., 2000). However,\nin the worst case, the size of the diagram still increases exponentially with the number of variables.\nAlternatively, methods based on function-speci\ufb01c independence approximate the value function by\na linear combination of basis functions (Koller and Parr, 2000, Guestrin et al., 2003). Exploit-\ning function-speci\ufb01c independence, a graph-based MDP can be solved using approximate linear\nprogramming (Guestrin et al., 2003, 2001, Forsell and Sabbadin, 2006), approximate policy itera-\n\n1\n\n\ftion (Sabbadin et al., 2012, Peyrard and Sabbadin, 2006) and approximate value iteration (Guestrin\net al., 2003). Among these, the approximate linear programming algorithm in Guestrin et al. (2003,\n2001) has an exponential number of constraints (in the treewidth), and thus cannot be applied to\ngeneral MDPs with many variables. The approximate policy iteration algorithm in Sabbadin et al.\n(2012), Peyrard and Sabbadin (2006) exploits a mean \ufb01eld approximation to compute and update\nthe local policies; unfortunately this can give loose approximations.\nIn this paper, we propose a variational framework for the MDP planning problem. This framework\nprovides a new perspective to describe and solve graph-based MDPs where both the state and deci-\nsion spaces are structured. We \ufb01rst derive a variational value iteration algorithm as an exact planning\nalgorithm, which is equivalent to the classical value iteration algorithm. We then design an approx-\nimate version of this algorithm by taking advantage of the factored representation of the reward and\ntransition functions, and propose a factored variational value iteration algorithm. This algorithm\ntreats the value function as a unnormalized distribution and approximates it using a product of local-\nscope value functions. At each step, this algorithm computes the value function by minimizing a\nKullback-Leibler divergence, which can be done using a belief propagation algorithm for in\ufb02uence\ndiagram problems (Liu and Ihler, 2012) . In comparison with the approximate linear programming\nalgorithm (Guestrin et al., 2003) and the approximate policy iteration algorithm (Sabbadin et al.,\n2012) on various graph-based MDPs, we show that our factored variational value iteration algo-\nrithm generates better policies.\nThe remainder of this paper is organized as follows. The background and some notation for graph-\nbased MDPs are introduced in Section 2. Section 3 describes a variational view of planning for \ufb01nite\nhorizon MDPs, followed by a framework for in\ufb01nite MDPs in Section 4. In Section 5, we derive\nan approximate algorithm for solving in\ufb01nite MDPs based on the variational perspective. We show\nexperiments to demonstrate the effectiveness of our algorithm in Section 6.\n2 Markov Decision Processes and Graph-based MDPs\n2.1 Markov Decision Processes\nA Markov Decision Process (MDP) is a discrete time stochastic control process, where the system\nchooses the decisions at each step to maximize the overall reward. An MDP can be characterized\nby a four tuple (X ,D, R, T ), where X represents the set of all possible states; D is the set of all\npossible decisions; R : X \u00d7 D \u2192 R is the reward function of the system, and R (x, d) is the\nreward of the system after choosing decision d in state x; T : X \u00d7 D \u00d7 X \u2192 [0, 1] is the transition\nfunction, and T (y|x, d) is the probability that the system arrives at state y, given that it starts from\nx upon executing decision d. A policy of the system is a mapping from the states to the decisions\n\u03c0 (x) : X \u2192 D so that \u03c0 (x) tells the decision chosen by the system in state x. The graphical\nrepresentation of an MDP is shown in Figure 1(a).\nWe consider the case of an MDP with in\ufb01nite horizon, in which the future rewards are discounted\nexponentially with a discount factor \u03b3 \u2208 [0, 1]. The task of the MDP is to choose the best stationary\npolicy \u03c0\u2217 (x) that maximizes the expected discounted reward on the in\ufb01nite horizon. The value\nfunction v\u2217 (x) of the best policy \u03c0\u2217 (x) then satis\ufb01es the following Bellman equation:\n\nv\u2217 (x) = max\n\n(1)\nwhere v\u2217 (x) = v\u2217 (y) ,\u2200x = y. The Bellman equation can be solved using stochastic dynamic\nprogramming algorithms such as value iteration and policy iteration, or linear programming algo-\nrithms (Puterman, 2009).\n\nT (y|x, \u03c0 (x)) (R (x, \u03c0 (x)) + \u03b3v\u2217 (y)),\n\n(cid:88)\n\n\u03c0(x)\n\ny\u2208X\n\n2.2 Graph-based MDPs\nWe assume that the full state x can be represented as a collection of state variables xi, so that X\nis a Cartesian product of the domains of the xi: X = X1 \u00d7 X2 \u00d7 \u00b7\u00b7\u00b7 \u00d7 XN , and similarly for d:\nD = D1 \u00d7 D2 \u00d7 \u00b7\u00b7\u00b7 \u00d7 DN . We consider the following particular factored form for MDPs: for each\nvariable i, there exist neighborhood sets \u0393i (including i) such that the value of xt+1\ndepends only\non the variable i\u2019s neighborhood, xt [\u0393i], and the ith decision dt\ni. Then, we can write the transition\nfunction in a factored form:\n\ni\n\nT (y|x, d) =\n\nTi (yi|x[\u0393i], di),\n\n(2)\n\nN(cid:89)\n\ni=1\n\n2\n\n\f(a)\n\n(b)\n\nFigure 1: (a) A Markov decision process; (b) A graph-based Markov decision process.\n\nwhere each factor is a local-scope function Ti : X [\u0393i]\u00d7Di \u00d7Xi \u2192 [0, 1] ,\u2200i \u2208 {1, 2, . . . , N} . We\nalso assume that the reward function is the sum of N local-scope rewards:\n\nR (x, d) =\n\nRi (xi, di),\n\n(3)\n\nN(cid:88)\n\ni=1\n\nwith local-scope functions Ri : Xi \u00d7 Di \u2192 R,\u2200i \u2208 {1, 2, . . . , N}.\nTo summarize, a graph-based Markov decision process is characterized by the following parameters:\n({Xi : 1 \u2264 i \u2264 N} ;{Di : 1 \u2264 i \u2264 N} ;{Ri : 1 \u2264 i \u2264 N} ;{\u0393i : 1 \u2264 i \u2264 N} ;{Ti : 1 \u2264 i \u2264 N}) .\nFigure 1(b) gives an example of a graph-based MDP. These assumptions for graph-based MDPs can\nbe easily generalized, for example to include Ti and Ri that depend on arbitrary sets of variables\nand decisions, using some additional notation.\nThe optimal policy \u03c0 (x) cannot be explicitly represented for large graph-based MDPs, since the\nnumber of states grows exponentially with the number of variables. To reduce complexity, we con-\nsider a particular class of local policies: a policy \u03c0 (x) : X \u2192 D is said to be local if decision di is\nmade using only the neighborhood \u0393i, so that \u03c0 (x) = (\u03c01 (x [\u03931]) , \u03c02 (x [\u03932]) , . . . , \u03c0N (x [\u0393N ]))\nwhere \u03c0i (x [\u0393i]) : X [\u0393i] \u2192 Di. The main advantage of local policies is that they can be concisely\nexpressed when the neighborhood sizes |\u0393i| are small.\n3 Variational Planning for Finite Horizon MDPs\nIn this section, we introduce a variational planning viewpoint of \ufb01nite MDPs. A \ufb01nite MDP can\nbe viewed as an in\ufb02uence diagram; we can then directly relate planning to the variational decision-\nmaking framework of Liu and Ihler (2012).\nIn\ufb02uence diagrams (Shachter, 2007) make use of Bayesian networks to represent structured decision\nproblems under uncertainty. The shaded part in Figure 1(a) shows a simple example in\ufb02uence\ndiagram, with random variables {x, y}, decision variable d and reward functions {R (x, d) , v (y)}.\nThe goal is then to choose a policy that maximizes the expected reward.\nThe best policy \u03c0t (x) for a \ufb01nite MDP can be computed using backward induction (Barber, 2011):\n\nT (y|x, \u03c0 (x))(cid:0)R (x, \u03c0 (x)) + \u03b3vt (y)(cid:1),\n\n(4)\n\nvt\u22121 (x) = max\n\n\u03c0 (x)\n\n(cid:88)\n\ny\u2208X\n\nLet pt (x, y, d) = T (y|x, \u03c0 (x)) (R (x, \u03c0 (x)) + \u03b3vt (y)) be an augmented distribution (see, e.g.,\nLiu and Ihler (2012)). Applying a variational framework for in\ufb02uence diagrams (Liu and Ihler,\n2012, Theorem 3.1), the optimal policy can be equivalently solved from the dual form of Eq. (4):\n\n(cid:8)(cid:10)\u03b8\u2206;t, \u03c4(cid:11) + H (x, y, d; \u03c4 ) \u2212 H (d|x; \u03c4 )(cid:9) ,\n\n\u03a6(cid:0)\u03b8t(cid:1) = max\n\n(5)\nwhere \u03b8\u2206;t (x, y, d) = log pt (x, y, d) = log T (y|x, d) + log (R (x, d) + \u03b3vt (y)), and \u03c4 is a\nvector of moments in the marginal polytope M (Wainwright and Jordan, 2008). In a mild abuse\nof notation, we will use \u03c4 to refer both to the vector of moments and to the maximum entropy\n\n\u03c4\u2208M\n\n3\n\n rrrxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxdRdxRRxxd1 5 6 95 62 5 6 7 106 73 6 7 8 117 8 4 7 8 12 5 6 5 6 5 6 7 6 7 6 7 8 7 8 7 8 xyxxddd\uf028\uf029\uf070x\uf028\uf029\uf070x\uf028\uf029\uf070x\uf028\uf029,Rxd\uf028\uf029,Rxd\uf028\uf029,Rxd\uf028\uf029|,Tyxd\uf028\uf029|,Tyxd\uf028\uf029|,Tyxd rrxyxdd\uf028\uf029\uf070x\uf028\uf029\uf070x\uf028\uf029,Rxd\uf028\uf029,Rxd\uf028\uf029|,Tyxd\uf028\uf029|,Tyxd rrrxyxxddd\uf028\uf029\uf070x\uf028\uf029\uf070x\uf028\uf029\uf070x\uf028\uf029,Rxd\uf028\uf029,Rxd\uf028\uf029,Rxd\uf028\uf029|,Tyxd\uf028\uf029|,Tyxd\uf028\uf029|,Tyxd 1x2x3x1d2d3d1r2r3r 1d2d3d1r2r3r1y2y3y1x2x3x\uf028\uf02911,Rxd\uf028\uf02922,Rxd\uf028\uf02933,Rxd\uf028\uf0291121|,,Tyxxd\uf028\uf02921232|,,,Tyxxxd\uf028\uf0293233|,,Tyxxd rrrxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxdRdxRRxxd1 5 6 95 62 5 6 7 106 73 6 7 8 117 8 4 7 8 12 5 6 5 6 5 6 7 6 7 6 7 8 7 8 7 8 xyxxddd\uf028\uf029\uf070x\uf028\uf029\uf070x\uf028\uf029\uf070x\uf028\uf029,Rxd\uf028\uf029,Rxd\uf028\uf029,Rxd\uf028\uf029|,Tyxd\uf028\uf029|,Tyxd\uf028\uf029|,Tyxd rrxyxdd\uf028\uf029\uf070x\uf028\uf029\uf070x\uf028\uf029,Rxd\uf028\uf029,Rxd\uf028\uf029|,Tyxd\uf028\uf029|,Tyxd rrrxyxxddd\uf028\uf029\uf070x\uf028\uf029\uf070x\uf028\uf029\uf070x\uf028\uf029,Rxd\uf028\uf029,Rxd\uf028\uf029,Rxd\uf028\uf029|,Tyxd\uf028\uf029|,Tyxd\uf028\uf029|,Tyxd 1x2x3x1d2d3d1r2r3r 1d2d3d1r2r3r1y2y3y1x2x3x\uf028\uf02911,Rxd\uf028\uf02922,Rxd\uf028\uf02933,Rxd\uf028\uf0291121|,,Tyxxd\uf028\uf02921232|,,,Tyxxxd\uf028\uf0293233|,,Tyxxd\uf028\uf0291x\uf070\uf028\uf0292x\uf070\uf028\uf0293x\uf070\fdistribution \u03c4 (x, y, d) consistent with those moments; H(\u00b7; \u03c4 ) refers to the entropy or conditional\nentropy of this distribution. See also Wainwright and Jordan (2008), Liu and Ihler (2012) for details.\nLet \u03c4 t (x, y, d) be the optimal solution of Eq. (5); then from Liu and Ihler (2012), the optimal policy\n\u03c0t (x) is simply arg maxd \u03c4 t (d|x). Moreover, the optimal value function vt\u22121 (x) can be obtained\nfrom Eq. (5). This result is summarized in the following lemma.\nLemma 1. For \ufb01nite MDPs with non-stationary policy, the best policy \u03c0t (x) and the value function\nvt\u22121 (x) can be obtained by solving Eq. (5). Let \u03c4 t (x, y, d) be the optimal solution of Eq. (5).\n(a) The optimal policy can be obtained from \u03c4 t (x, y, d), as \u03c0t (x) = arg maxd \u03c4 t (d|x).\n(b) The value function w.r.t. \u03c0t (x) can be obtained as vt\u22121 (x) = exp (\u03a6 (\u03b8t)) \u03c4 t (x).\n\nfollows directly from Theorem 3.1 of Liu and Ihler\n\nProof. (a)\n(b) Note that\nT (y|x, \u03c0t (x)) (R (x, \u03c0t (x)) + \u03b3vt (y)) = exp (\u03a6 (\u03b8t)) \u03c4 t (x, y, d). Making use of Eq. (4),\nsumming over y and maximizing over d on exp (\u03a6 (\u03b8t)) \u03c4 t (x, y, d), we obtain vt\u22121 (x) =\nexp (\u03a6 (\u03b8t)) \u03c4 t (x).\n\n(2012).\n\n4 Variational Planning for In\ufb01nite Horizon MDPs\nGiven the variational form of \ufb01nite MDPs, we now construct a variational framework for in\ufb01nite\nMDPs. Compared to the primal form (i.e., Eq. (4)) of \ufb01nite MDPs, the Bellman equation of an\nin\ufb01nite MDP, Eq. (1), has the additional constraint that vt\u22121 (x) = vt (y) when x = y. For an\nin\ufb01nite MDP, we can simply consider a two-stage \ufb01nite MDP with the variational form in Eq. (5),\nbut with this additional constraint. The main result is given by the following theorem.\nTheorem 2. Assume \u03c4 and \u03a6 are the solution of the following optimization problem,\n\nsubject to \u03a6 =(cid:10)\u03b8\u2206, \u03c4(cid:11) + H (x, y, d; \u03c4 ) \u2212 H (d|x; \u03c4 ),\n\nmax\n\n\u03c4\u2208M,\u03a6\u2208R \u03a6,\n\n(6)\n\n(7)\n\n\u03b8\u2206 = log T (y|x, d) + log (R (x, d) + \u03b3 exp (\u03a6) \u03c4x (y)) ,\nwhere \u03c4x denotes the marginal distribution on x. With \u03c4 \u2217 being the optimal solution, we have\n\n(a) The optimal policy of the in\ufb01nite MDP can be decoded as \u03c0\u2217 (x) = arg maxd \u03c4 \u2217 (d|x).\n(b) The value function w.r.t. \u03c0\u2217 (x) is v\u2217 (x) = exp (\u03a6) \u03c4 \u2217 (x).\n\nProof. The Bellman equation is equivalent to the backward induction in Eq. (4), subject to an extra\nconstraint that vt = vt\u22121. The result follows by replacing Eq. (4) with its variational dual (5).\n\nLike the Bellman equation (4), its dual form (6) also has no closed-form solution. Analogously to the\nvalue iteration algorithm for the Bellman equation, Eq. (6) can be solved by alternately \ufb01xing \u03c4x (x),\n\u03a6 in \u03b8\u2206 and solving Eq. (6) with only the \ufb01rst constraint using some convex optimization technique.\nHowever, each step of solving for \u03c4 and \u03a6 is equivalent to one step of value iteration; if \u03c4 (x, y, d) is\nrepresented explicitly, it seems to offer no advantage over simply applying the elimination operators\nas in (4). The usefulness of this form is mainly in opening the door to design new approximations.\n\n5 Approximate Variational Algorithms for Graph-based MDPs\nThe framework in the previous section gives a new perspective on the MDP planning problem, but\ndoes not by itself simplify the problem or provide new solution methods. For graph-based MDPs,\nthe sizes of the full state and decision spaces are exponential in the number of variables. Thus, the\ncomplexity of exact algorithms is exponentially large. In this section, we present an approximate\nalgorithm for solving Eq. (6), by exploiting the factorization structure of the transition function (2),\nthe reward function (3) and the value function v (x).\nStandard variational approximations take advantage of the multiplicative factorization of a dis-\ntribution to de\ufb01ne their approximations. While our (unnormalized) distribution p (x, y, d) =\nexp[\u03b8\u2206(x, y, d)] is structured, some of its important structure comes from additive factors, such\nas the local-scope reward functions Ri (xi, di) in Eq. (3), and the discounted value function \u03b3v (x)\nin Eq. (1). Computing the sum of these additive factors directly would create a large factor over an\nunmanageably large variable domain, and destroy most of the useful structure of p (x, y, d).\n\n4\n\n\fTo avoid this effect, we convert the presence of additive factors into multiplicative factors by aug-\nmenting the model with a latent \u201cselector\u201d variable, which is similar to that used for the \u201ccomplete\nlikelihood\u201d in mixture models (Liu and Ihler, 2012). For example, consider the sum of two factors:\n\u00aff12(a, x1, x2) \u00b7 \u00aff23(a, x2, x3).\n\nf (x) = f12 (x1, x2) + f23 (x2, x3) =\n\n(f12)a \u00b7 (f23)(1\u2212a) =\n\n(cid:88)\n\n(cid:88)\n\na\u2208{0,1}\n\na\u2208{0,1}\n\nIntroducing the auxilliary variable a converts f into a product of factors, where marginalizing over\na yields the original function f.\nUsing this augmenting approach, the additive elements of the graph-based MDP are converted to\nmultiplicative factors, that is Ri (xi, di) \u2192 \u02dcRi (xi, di, a), and \u03b3v (x) \u2192 \u02dcv\u03b3 (x, a). In this way, the\nparameter \u03b8\u2206 of a graph-based MDP can be represented as\n\n\u03b8\u2206 (x, y, d, a) =\n\nlog \u02dcRi (xi, di, a) + log \u02dcv\u03b3 (y, a) .\n\nN(cid:88)\n\ni=1\n\nlog Ti (yi|x[\u0393i], di) +\nN(cid:88)\n\nN(cid:88)\n\ni=1\n\nN(cid:88)\n\ni=1\n\n(cid:88)\n(cid:81)\n(cid:81)\n\nk\u2208C\n\n(cid:81)\n(cid:81)\n\nNow, p (x, y, d, a) = exp[\u03b8\u2206 (x, y, d, a)] has a representation in terms of a product of factors. Let\n\n\u03b8 (x, y, d, a) =\n\nlog Ti (yi|x[\u0393i], di) +\n\nlog \u02dcRi (xi, di, a).\n\ni=1\n\nBefore designing the algorithms, we \ufb01rst construct a cluster graph (G;C;S) for the distribution\nexp[\u03b8 (x, y, d, a)], where C denotes the set of clusters and S is the set of separators. (See Liu and\nIhler (2012, 2011), Wainwright and Jordan (2008) for more details on cluster graphs.) We assign\neach decision node di to one cluster that contains di and its parents pa(i); clusters so assigned are\ncalled decision clusters A, while other clusters are called normal clusters R, so that C = {R,A}.\nUsing the structure of the cluster graph, \u03b8 can be decomposed into\n\n\u03b8ck (xck , yck , dck , a),\n\n(8)\n\n\u03b8 (x, y, d, a) =\n\nand the distribution \u03c4 is approximated as\n\nk\u2208C \u03c4ck (zck )\n\n\u03c4 (x, y, d, a) =\n\n(kl)\u2208S \u03c4skl (zskl )\n\n(9)\nwhere zck = {xck , yck , dck , a}. Therefore, instead of optimizing the full distribution \u03c4 , we can\noptimize the collection of marginal distributions \u03c4 = {\u03c4ck , \u03c4sk}, with far lower computational cost.\nThese marginals should belong to the local consistency polytope L, which enforces that marginals\nare consistent on their overlapping sets of variables (Wainwright and Jordan, 2008).\nWe now construct a reduced cluster graph over x from the full cluster graph, to serve as the approx-\nimating structure of the marginal \u03c4 (x). We assume a factored representation for \u03c4 (x):\n\n,\n\n\u03c4 (x) =\n\nk\u2208C \u03c4ck (xck )\n\n(kl)\u2208S \u03c4skl (xskl )\n\n,\n\n(10)\n\nfactors into v\u03b3 (x) =(cid:81)\n\nwhere the \u03c4ck (xck ) is the marginal distribution of \u03c4ck (zck ) on xck. Note that Eq. (10) also dictates a\nfactored approximation of the value function v (x), because v (x) \u2248 exp (\u03a6) \u03c4 (x). Assume v\u03b3 (x)\nk vck (xck ). Then, the constraint (7) reduces to a set of simpler constraints\n\non the cliques of the cluster graph,\n\n\u03b8\u2206\nck\n\n(xck , yck , dck , a) = \u03b8ck\n\nCorrespondingly, the constraint (6) can be approximated by\nH(cid:48)\n\n(cid:104)\u03b8\u2206\n\n, \u03c4ck(cid:105) +\n\n\u03a6 =\n\nck\n\n(xck , yck , dck , a) + log vck,x (yck , a) , k \u2208 C.\n(cid:88)\n\n(cid:88)\n\nck \u2212 (cid:88)\n\nHck +\n\nHskl ,\n\nk\u2208R\n\nk\u2208D\n\n(kl)\u2208S\n\n(11)\n\n(12)\n\n(cid:88)\n\nk\u2208C\n\nis the entropy of variables in cluster ck, Hck = H (xck , yck , dck , a; \u03c4 ) and H(cid:48)\nck\n\nwhere Hck\n=\nH (xck , yck , dck , a; \u03c4 ) \u2212 H (dck|xck ; \u03c4 ). With these approximations, we solve the optimization in\nTheorem 2 using \u201cmixed\u201d belief propagation (Liu and Ihler, 2012) for \ufb01xed {\u03b8\u2206\n}; we then update\n{\u03b8\u2206\n\n} using the \ufb01xed point condition (11). This gives the double loop algorithm in Algorithm 1.\n\nck\n\nck\n\n5\n\n\fAlgorithm 1 Factored Variational Value Iteration Algorithm\nInput: A graph-based MDP with ({Xi} ;{Di} ;{Ri} ;{\u0393i} ;{Ti}), the cluster graph (G;C;S), and\n\nthe initial(cid:8)\u03c4 t=0\n\nck\n\n(xck ) ,\u2200ck \u2208 C(cid:9).\n\nIterate until convergence (for both the outer loop and the inner loop).\n\nOuter loop: Update \u03b8\u2206;t\n\nck using Eq. (11).\n\n1:\n2:\n\nInner loop: Maximize the right side of Eq. (12) with \ufb01xed \u03b8\u2206;t\nusing the belief propagation algorithm proposed in Liu and Ihler (2012):\n\nck and compute \u03c4 t+1\nck\n\n(xck )\n\nmk\u2192l(zck ) \u221d \u03c8skl\n\n(zskl )\n\n(cid:88)\n\nzck\\skl\n\n\u03c3(cid:2)\u03c8ck (zck )m\u223ck(zck )(cid:3)\n(cid:26)\u03c4ck (zck )\n(cid:88)\n\u03c4ck (zck )\u03c4ck (dck|xck )\n\nml\u2192k(zck )\n\n,\n\n\u03c4ck (zck )\n\n\u03c4ck (xck ) = max\ndck\n\nyck ,a\n\nck \u2208 R\nck \u2208 A,\n\nwhere \u03c8ck (zck ) = exp[\u03b8\u2206\nck\n\n(zck )],\n\nand \u03c3[\u03c4ck (zck )] =\n\nwith\n\n\u03c4ck (zck ) = \u03c8ck (zck )m\u223ck(zck ) and\n\nOutput: The local policies {\u03c4 (di|x (\u0393i))}, and the value function \u02c6v (x) = exp (\u03a6) \u03c4 (x).\n\n(cid:80)\n\n6 Experiments\nWe perform experiments in two domains, disease management in crop \ufb01elds and viral marketing, to\nevaluate the performance of our factored variational value iteration algorithm (FVI). For comparison,\nwe use approximate policy iteration algorithm (API) (Sabbadin et al., 2012), (a mean-\ufb01eld based pol-\nicy iteration approach), and the approximate linear programming algorithm (ALP) (Guestrin et al.,\n2001). To evaluate each algorithm\u2019s performance, we obtain its approximate local policy, then com-\npute the expected value of the policy using either exact evaluation (if feasible) or a sample-based\nestimate (if not). We then compare the expected reward U alg (x) = 1|X|\nx valg (x) of each algo-\nrithm\u2019s policy.\n6.1 Disease Management in Crop Fields\nA graph-based MDP for disease management in crop \ufb01elds was introduced in (Sabbadin et al.,\n2012). Suppose we have a set of crop \ufb01elds in an agricultural area, where each \ufb01eld is susceptible to\ncontamination by a pathogen. When a \ufb01eld is contaminated, it can infect its neighbors and the yield\nwill decrease. However, if a \ufb01eld is left fallow, it has a probability (denoted by q) of recovering from\ninfection. The decisions of each year include two options (Di = {1, 2}) for each \ufb01eld: cultivate\nnormally (di = 1) or leave fallow (di = 2). The problem is then to choose the optimal stationary\npolicy to maximize the expected discounted yield. The topology of the \ufb01elds is represented by an\nundirected graph, where each node represents one crop \ufb01eld. An edge is drawn between two nodes\nif the \ufb01elds share a common border (and can thus pass an infection). Each crop \ufb01eld can be in\none of three states: xi = 1 if it is uninfected and xi = 2 to xi = 3 for increasing degrees of\ninfection. The probability that a \ufb01eld moves from state xi to state xi + 1 with di = 1 is set to be\nP = P (\u03b5, p, ni) = \u03b5 + (1 \u2212 \u03b5) (1 \u2212 (1 \u2212 p)ni), where \u03b5 and p are parameters and ni is the number\nof the neighbors of i that are infected. The transition function is summarized in Table 1. The reward\nfunction depends on each \ufb01eld\u2019s state and local decision. The maximal yield r > 1 is achieved by an\nuninfected, cultivated \ufb01eld; otherwise, the yield decreases linearly with the level of infection, from\nmaximal reward r to minimal reward 1 + r/10. A \ufb01eld left fallow produces reward 1.\n\nTable 1: Local transition probabilities p(cid:0)x(cid:48)\n\n(cid:1), for the disease management problem.\n\ni|xN (i), ai\n\ndi = 1\nxi = 2\n0\n1 \u2212 P\nP\n\nxi = 1\n1 \u2212 P\nP\n0\n\nxi = 3\n\nxi = 1\n\nxi = 3\n\nx(cid:48)\ni = 1\nx(cid:48)\ni = 2\nx(cid:48)\ni = 3\n6.2 Viral Marketing\nViral marketing (Nath and Domingos, 2010, Richardson and Domingos, 2002) uses the natural\npremise that members of a social network in\ufb02uence each other\u2019s purchasing decisions or comments;\nthen, the goal is to select the best set of people to target for marketing such that overall pro\ufb01t is\n\nq/2\nq/2\n1 \u2212 q\n\n0\n0\n1\n\n1\n0\n0\n\ndi = 2\nxi = 2\nq\n1 \u2212 q\n0\n\n6\n\n\fmaximized. Viral marketing has been previously framed as a one-shot in\ufb02uence diagram problem\n(Nath and Domingos, 2010). Here, we frame the viral marketing task as an MDP planning problem,\nwhere we optimize the stationary policy to maximize long-term reward.\nThe topology of the social network is represented by a directed graph, capturing directional social\nin\ufb02uence. We assume there are three states for each person in the social network: xi = 1 if i is\nmaking positive comments, xi = 2 if not commenting, and xi = 3 for negative comments. There is\na binary decision corresponding to each person i: market to this person (di = 1) or not (di = 2). We\nalso de\ufb01ne a local reward function: if a person gives good comments when di = 2, then the reward\nis r; otherwise, the reward is less, decreasing linearly to minimum value 1 + r/10. For marketed\n\nindividuals (di = 1), the reward is 1. The local transition p(cid:0)x(cid:48)\n\n(cid:1) is set as in Table 1.\n\ni|xN (i), di\n\n6.3 Experimental Results\nWe evaluate both problems above on two topologies of model, each of three sizes (6, 10, and 20\nnodes). Our \ufb01rst topology type are random, regular graphs with three neighbors per node. Our\nsecond are \u201cchain-like\u201d graphs, in which we order the nodes, then connect each node at random to\nfour of its six nearest nodes in the ordering. This ensures that the resulting graph has low tree-width\n(\u2264 6), enabling comparison of the ALP algorithm. We set parameters r = 10 and \u03b5 = 0.1, and test\nthe results on different choices of p and q.\nTables 2\u2013 4 show the expected rewards found by each algorithm for several settings. The best\nperformance (highest rewards) are labeled in bold. For models with 6 nodes, we also compute the\nexpected reward under the optimal global policy \u03c0\u2217 (x) for comparison. Note that this value over-\nestimates the best possible local policy {\u03c0\u2217\ni (\u0393i(x))} being sought by the algorithms; the best local\npolicy is usually much more dif\ufb01cult to compute due to imperfect recall. Since the complexity of the\napproximate linear programming (ALP) algorithm is exponential in the treewidth of graph de\ufb01ned\nby the neighborhoods \u0393i, we were unable to compute results for models beyond treewidth 6.\nThe tables show that our factored variational value iteration (FVI) algorithm gives policies with\nhigher expected rewards than those of approximate policy iteration (API) on the majority of models\n(156/196), over all sets of models and different p and q. Compared to approximate linear program-\nming, in addition to being far more scalable, our algorithm performed comparably, giving better\npolicies on just over half of the models (53/96) that ALP could be run on. However, when we\nrestrict to low-treewidth \u201cchain\u201d models, we \ufb01nd that the ALP algorithm appears to perform better\non larger models; it outperforms our FVI algorithm on only 4/32 models of size 6, but this increases\nto 14/32 at size 10, and 25/32 at size 20. It may be that ALP better takes advantage of the structure\nof x on these cases, and more careful choice of the cluster graph could similarly improve FVI.\nThe average results across all settings are shown in Table 5, along with the relative improvements of\nour factored variational value iteration algorithm to approximate policy iteration and approximate\nlinear programming. Table 5 shows that our FVI algorithm, compared to approximate policy iter-\nation, gives the best policies on regular models across sizes, and gives better policies than those of\nthe approximate linear programming on chain-like models with small size (6 and 10 nodes). Al-\nthough on average the approximate linear programming algorithm may provide better policies for\n\u201cchain\u201d models with large size, its exponential number of constraints makes it infeasible for general\nlarge-scale graph-based MDPs.\n7 Conclusions\nIn this paper, we have proposed a variational planning framework for Markov decision processes.\nWe used this framework to develop a factored variational value iteration algorithm that exploits the\nstructure of the graph-based MDP to give ef\ufb01cient and accurate approximations, scales easily to large\nsystems, and produces better policies than existing approaches. Potential future directions include\nstudying methods for the choice of cluster graphs, and improved solutions for the dual approxima-\ntion (12), such as developing single-loop message passing algorithms to directly optimize (12).\nAcknowledgments\nThis work was supported in part by National Science Foundation grants IIS-1065618 and IIS-\n1254071, a Microsoft Research Fellowship, National Natural Science Foundation of China\n(#61071131 and #61271388), Beijing Natural Science Foundation (#4122040), Research Project\nof Tsinghua University (#2012Z01011), and Doctoral Fund of the Ministry of Education of China\n(#20120002110036).\n\n7\n\n\fTable 2: The expected rewards of different algorithms on regular models with 6 nodes.\n\n(p, q)\n\n(0.2, 0.2)\n\n(0.4, 0.2)\n\n(0.6, 0.2)\n\n(0.8, 0.2)\n\n(0.2, 0.4)\n\n(0.4, 0.4)\n\n(0.6, 0.4)\n\n(0.8, 0.4)\n\n(0.2, 0.6)\n\n(0.4, 0.6)\n\n(0.6, 0.6)\n\n(0.8, 0.6)\n\n(0.2, 0.8)\n\n(0.4, 0.8)\n\n(0.6, 0.8)\n\n(0.8, 0.8)\n\nExact\n202.4\n169.2\n158.1\n154.1\n262.5\n220.1\n212.1\n211.7\n349.3\n290.7\n284.7\n284.0\n423.6\n362.2\n351.6\n350.5\n\nDisease Management\n\nFVI\n202.4\n169.2\n155.2\n152.7\n259.2\n219.1\n203.8\n198.2\n349.3\n276.7\n242.7\n236.1\n423.6\n351.0\n304.8\n284.2\n\nAPI\n164.7\n139.0\n157.4\n153.2\n254.7\n177.0\n203.8\n198.2\n333.6\n276.7\n243.7\n236.1\n423.6\n344.3\n302.7\n284.9\n\nALP\n148.3\n123.3\n115.4\n106.0\n236.7\n181.3\n162.7\n136.1\n307.3\n200.0\n212.8\n194.7\n274.7\n264.5\n242.5\n207.9\n\nViral Marketing\nFVI\n258.2\n195.3\n167.8\n152.7\n361.6\n285.8\n244.6\n225.2\n428.1\n351.7\n304.7\n282.9\n469.8\n402.0\n347.8\n320.8\n\nAPI\n250.0\n192.6\n174.0\n172.2\n355.8\n285.1\n249.6\n296.8\n428.1\n303.3\n152.5\n355.0\n469.8\n402.0\n351.8\n398.0\n\nALP\n237.7\n183.4\n156.4\n144.7\n355.0\n267.3\n244.8\n273.5\n427.7\n350.0\n306.5\n271.3\n469.8\n403.7\n336.6\n294.0\n\nExact\n259.3\n212.2\n209.6\n209.5\n361.6\n300.2\n297.3\n297.3\n428.1\n361.8\n355.5\n355.5\n470.0\n411.6\n398.2\n398.0\n\nTable 3: The expected rewards of different algorithms on \u201cchain-like\u201d models with 10 nodes.\n\nViral Marketing\n\n(p, q)\n\n(0.3, 0.3)\n\n(0.5, 0.3)\n\n(0.7, 0.3)\n\n(0.3, 0.5)\n\n(0.5, 0.5)\n\n(0.7, 0.5)\n\n(0.3, 0.7)\n\n(0.5, 0.7)\n\n(0.7, 0.7)\n\nDisease Management\nALP\nFVI\n304.8\n288.9\n292.7\n273.4\n329.6\n262.2\n456.5\n420.2\n358.5\n302.6\n394.3\n343.8\n612.9\n531.2\n538.6\n498.2\n430.0\n427.3\n\nAPI\n258.4\n228.7\n261.6\n395.4\n317.7\n344.9\n613.6\n491.8\n411.8\n\nFVI\n355.5\n308.1\n298.5\n550.1\n453.3\n386.1\n659.9\n542.7\n496.9\n\nAPI\n324.1\n291.5\n298.1\n523.9\n450.9\n418.6\n634.8\n523.9\n495.7\n\nALP\n335.5\n323.8\n269.7\n543.9\n410.0\n436.9\n664.7\n518.2\n451.2\n\nTable 4: The expected rewards (\u00d7102) of different algorithms on models with 20 nodes.\n\n(p, q)\n(0.2, 0.2)\n\n(0.6, 0.2)\n\n(0.4, 0.4)\n\n(0.4, 0.4)\n\n(0.6, 0.6)\n\n(0.6, 0.6)\n\n(0.8, 0.8)\n\n(0.8, 0.8)\n\nDisease Manag. Viral Marketing\nFVI\nAPI\n7.88\n7.17\n5.33\n5.28\n9.10\n11.52\n7.04\n7.65\n12.29\n13.85\n10.02\n8.50\n15.27\n14.53\n10.90\n11.50\n\nFVI\n7.87\n5.99\n11.56\n7.95\n13.85\n10.22\n15.25\n11.82\n\nAPI\n6.33\n4.94\n8.82\n6.17\n12.11\n8.72\n14.57\n10.78\n\n(p, q)\n(0.4, 0.2)\n\n(0.8, 0.2)\n\n(0.4, 0.4)\n\n(0.4, 0.4)\n\n(0.6, 0.6)\n\n(0.6, 0.6)\n\n(0.8, 0.8)\n\n(0.8, 0.8)\n\nDisease Manag. Viral Marketing\nFVI\nAPI\n5.93\n5.65\n5.62\n5.12\n7.70\n8.83\n6.72\n7.14\n9.97\n11.72\n8.01\n8.88\n12.57\n13.22\n9.92\n10.64\n\nFVI\n6.53\n5.76\n9.23\n7.45\n11.74\n9.23\n13.47\n10.77\n\nAPI\n5.19\n5.20\n6.23\n6.72\n10.06\n7.69\n12.43\n9.56\n\nTable 5: Comparison of average expected rewards on regular and \u201cchain-like\u201d models.\n\nType\nRegular\n\nChain\n\nn = 6\n\nn = 10\n\nn = 20\n\nFVI: 275.8\nAPI: 271.4\nFVI: 275.8\nAPI: 271.6\nALP: 244.9\n\nRel. Imprv.\n\n1.6%\n\nRel. Imprv.\n\n1.6%\n12.6%\n\nFVI: 458.7\nAPI: 452.3\nFVI: 415.7\nAPI: 399.4\nALP: 414.7\n\nRel. Imprv.\n\n1.4%\n\nRel. Imprv.\n\n4.1%\n0.7%\n\nFVI: 935.6\nAPI: 905.1\nFVI: 821.9\nAPI: 749.6\nALP: 872.2\n\nRel. Imprv.\n3.37%\nRel. Imprv.\n\n9.7%\u22125.8%\n\n8\n\n\fReferences\nR Iris Bahar, Erica A Frohm, Charles M Gaona, Gary D Hachtel, Enrico Macii, Abelardo Pardo, and Fabio\nSomenzi. Algebraic decision diagrams and their applications. In IEEE/ACM International Conference on\nComputer-Aided Design, pages 188\u2013191, 1993.\n\nDavid Barber. Bayesian reasoning and machine learning. Cambridge University Press, 2011.\nCraig Boutilier, Richard Dearden, and Mois\u00b4es Goldszmidt. Stochastic dynamic programming with factored\n\nrepresentations. Arti\ufb01cial Intelligence, 121(1):49\u2013107, 2000.\n\nNicklas Forsell and R\u00b4egis Sabbadin. Approximate linear-programming algorithms for graph-based markov\n\ndecision processes. Frontiers in Arti\ufb01cial Intelligence and Applications, 141:590, 2006.\n\nCarlos Guestrin, Daphne Koller, and Ronald Parr. Multiagent planning with factored MDPs. Advances in\n\nNeural Information Processing Systems, 14:1523\u20131530, 2001.\n\nCarlos Guestrin, Daphne Koller, Ronald Parr, and Shobha Venkataraman. Ef\ufb01cient solution algorithms for\n\nfactored mdps. Journal of Arti\ufb01cial Intelligence Research, 19:399\u2013468, 2003.\n\nJesse Hoey, Robert St-Aubin, Alan Hu, and Craig Boutilier. SPUDD: Stochastic planning using decision\ndiagrams. In Proceedings of the Fifteenth conference on Uncertainty in Arti\ufb01cial Intelligence, pages 279\u2013\n288, 1999.\n\nDaphne Koller and Ronald Parr. Policy iteration for factored mdps. In Proceedings of the Sixteenth Conference\n\non Uncertainty in Arti\ufb01cial Intelligence, pages 326\u2013334, 2000.\n\nQiang Liu and Alexander Ihler. Variational algorithms for marginal MAP. In Uncertainty in Arti\ufb01cial Intelli-\n\ngence (UAI), 2011.\n\nQiang Liu and Alexander Ihler. Belief propagation for structured decision making. In Uncertainty in Arti\ufb01cial\n\nIntelligence (UAI), pages 523\u2013532, August 2012.\n\nA. Nath and P. Domingos. Ef\ufb01cient belief propagation for utility maximization and repeated inference. In The\n\nProceeding of the Twenty-Fourth AAAI Conference on Arti\ufb01cial Intelligence, 2010.\n\nNathalie Peyrard and R\u00b4egis Sabbadin. Mean \ufb01eld approximation of the policy iteration algorithm for graph-\n\nbased markov decision processes. Frontiers in Arti\ufb01cial Intelligence and Applications, 141:595, 2006.\n\nMartin L Puterman. Markov decision processes: discrete stochastic dynamic programming. Wiley-Interscience,\n\n2009.\n\nMatthew Richardson and Pedro Domingos. Mining knowledge-sharing sites for viral marketing. In Proceedings\nof the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 61\u2013\n70, 2002.\n\nR. Sabbadin, N. Peyrard, and N. Forsell. A framework and a mean-\ufb01eld algorithm for the local control of\n\nspatial processes. International Journal of Approximate Reasoning, 53(1):66 \u2013 86, 2012.\n\nRoss D Shachter. Model building with belief networks and in\ufb02uence diagrams. Advances in decision analysis:\n\nfrom foundations to applications, pages 177\u2013201, 2007.\n\nOlivier Sigaud, Olivier Buffet, et al. Markov decision processes in arti\ufb01cial intelligence. ISTE-Jonh Wiley &\n\nSons, 2010.\n\nM.J. Wainwright and M.I. Jordan. Graphical models, exponential families, and variational inference. Founda-\n\ntions and Trends in Machine Learning, 1(1-2):1\u2013305, 2008.\n\n9\n\n\f", "award": [], "sourceid": 1358, "authors": [{"given_name": "Qiang", "family_name": "Cheng", "institution": "Tsinghua University"}, {"given_name": "Qiang", "family_name": "Liu", "institution": "UC Irvine"}, {"given_name": "Feng", "family_name": "Chen", "institution": "Tsinghua University"}, {"given_name": "Alexander", "family_name": "Ihler", "institution": "UC Irvine"}]}