{"title": "Fairness in Multi-Agent Sequential Decision-Making", "book": "Advances in Neural Information Processing Systems", "page_first": 2636, "page_last": 2644, "abstract": "We define a fairness solution criterion for multi-agent decision-making problems, where agents have local interests. This new criterion aims to maximize the worst performance of agents with consideration on the overall performance. We develop a simple linear programming approach and a more scalable game-theoretic approach for computing an optimal fairness policy. This game-theoretic approach formulates this fairness optimization as a two-player, zero-sum game and employs an iterative algorithm for finding a Nash equilibrium, corresponding to an optimal fairness policy. We scale up this approach by exploiting problem structure and value function approximation. Our experiments on resource allocation problems show that this fairness criterion provides a more favorable solution than the utilitarian criterion, and that our game-theoretic approach is significantly faster than linear programming.", "full_text": "Fairness in Multi-Agent Sequential Decision-Making\n\nComputer Science and Arti\ufb01cial Intelligence Laboratory\n\nMassachusetts Institute of Technology\n\nChongjie Zhang and Julie A. Shah\n\n{chongjie,julie a shah}@csail.mit.edu\n\nCambridge, MA 02139\n\nAbstract\n\nWe de\ufb01ne a fairness solution criterion for multi-agent decision-making\nproblems, where agents have local interests. This new criterion aims\nto maximize the worst performance of agents with a consideration on\nthe overall performance. We develop a simple linear programming ap-\nproach and a more scalable game-theoretic approach for computing an\noptimal fairness policy. This game-theoretic approach formulates this\nfairness optimization as a two-player zero-sum game and employs an\niterative algorithm for \ufb01nding a Nash equilibrium, corresponding to an\noptimal fairness policy. We scale up this approach by exploiting prob-\nlem structure and value function approximation. Our experiments on\nresource allocation problems show that this fairness criterion provides a\nmore favorable solution than the utilitarian criterion, and that our game-\ntheoretic approach is signi\ufb01cantly faster than linear programming.\n\nIntroduction\n\nFactored multi-agent MDPs [4] offer a powerful mathematical framework for studying multi-agent\nsequential decision problems in the presence of uncertainty. Its compact representation allows us to\nmodel large multi-agent planning problems and to develop ef\ufb01cient methods for solving them. Ex-\nisting approaches to solving factored multi-agent MDPs [4] have focused on the utilitarian solution\ncriterion, i.e., maximizing the sum of individual utilities. The computed utilitarian solution is opti-\nmal from the perspective of the system where the performance is additive. However, as the utilitarian\nsolution often discriminates against some agents, it is not desirable for many practical applications\nwhere agents have their own interests and fairness is expected. For example, in manufacturing plants,\nresources need to be fairly and dynamically allocated to work stations on assembly lines in order\nto maximize the throughput; in telecommunication systems, wireless bandwidth needs to be fairly\nallocated to avoid \u201cunhappy\u201d customers; in transportation systems, traf\ufb01c lights are controlled so\nthat traf\ufb01c \ufb02ow is balanced.\nIn this paper, we de\ufb01ne a fairness solution criterion, called regularized maximin fairness, for multi-\nagent MDPs. This criterion aims to maximize the worst performance of agents with a consideration\non the overall performance. We show that its optimal solution is Pareto-ef\ufb01cient. In this paper, we\nwill focus on centralized joint policies, which are sensible for many practical resource allocation\nproblems. We develop a simple linear programming approach and a more scalable game-theoretic\napproach for computing an optimal fairness policy. This game-theoretic approach formulates this\nfairness optimization for factored multi-agent MDPs as a two-player, zero-sum game. Inspired by\ntheoretical results that two-player games tend to have a Nash equilibrium (NE) with a small sup-\nport [7], we develop an iterative algorithm that incrementally solves this game by starting with a\nsmall subgame. This game-theoretic approach can scale up to large problems by relaxing the ter-\nmination condition, exploiting problem structure in factored multi-agent MDPs, and applying value\nfunction approximation. Our experiments on a factory resource allocation problem show that this\n\n1\n\n\ffairness criterion provides a more favorable solution than the utilitarian criterion [4], and our game-\ntheoretic approach is signi\ufb01cantly faster than linear programming.\n\nMulti-agent decision-making model and its fairness solution\n\nWe are interested in multi-agent sequential decision-making problems, where agents have their own\ninterests. We assume that agents are cooperating. Cooperation can be proactive, e.g., sharing re-\nsources with other agents to sustain cooperation that bene\ufb01ts all agents, or passive, where agents\u2019\nactions are controlled by a thirty party, as with centralized resource allocation. We use a factored\nmulti-agent Markov decision processes (MDP) to model multi-agent sequential decision-making\nproblems [4]. A factored multi-agent MDP is de\ufb01ned by a tuple (cid:104)I, X, A, T,{Ri}i\u2208I , b(cid:105), where\nI = {1, . . . , n} is a set of agent indices.\nX is a state space represented by a set of state variables X = {X1, . . . , Xm}. A state is de\ufb01ned\nby a vector x of value assignments to each state variable. We assume the domain of each\nvariable is \ufb01nite.\nA = \u00d7i\u2208I Ai is a \ufb01nite set of joint actions, where Ai is a \ufb01nite set of actions available for agent i.\nThe joint action a = (cid:104)a1, . . . , an(cid:105) is de\ufb01ned by a vector of individual action choices.\nT is the transition model. T (x(cid:48)|x, a) speci\ufb01es the probability of transitioning to the next state x(cid:48)\nafter a joint action a is taken in the current state x. As in [4], we assume that the transition\nmodel can be factored and compactly represented by a dynamic Bayesian network (DBN).\nRi(xi, ai) is a local reward function of agent i, which is de\ufb01ned on a small set of variables xi \u2286 X\nand ai \u2286 A.\n\nb is the initial distribution of states.\n\nThis model allows us to exploit problem structures to represent exponentially-large multi-agent\nMDPs compactly. Unlike factored MDPs de\ufb01ned in [4], which have one single reward function rep-\nresented by a sum of partial reward functions, this multi-agent model has a local reward function for\neach agent. From the multi-agent perspective, existing approaches to factored MDPs [4] essentially\naim to compute a control policy that maximizes the utilitarian criterion (i.e., the sum of individual\nutilities). As the utilitarian criterion often provides a solution that is not fair or satisfactory for some\nagents (e.g., as shown in the experiment section), it may not be desirable for problems where agents\nhave local interests.\nIn contrast to the utilitarian criterion, an egalitarian criterion, called maximin fairness, has been\nstudied in networking [1, 9], where resources are allocated to optimize the worst performance. This\negalitarian criterion exploits the maximin principle in Rawlsian theory of justice [14], maximizing\nthe bene\ufb01ts of the least-advantaged members of society. In the following, we will de\ufb01ne a fairness\nsolution criterion for multi-agent MDPs by adapting and combining the maximin fairness criterion\nand the utilitarian criterion. Under this new criterion, an optimal policy for multi-agent MDPs aims\nto maximize the worst performance of agents with a consideration on the overall performance.\nA joint stochastic policy \u03c0 : X \u00d7 A \u2192 (cid:60) is a function that returns the probability of taking joint\naction a \u2208 A for any given state x \u2208 X. The utility of agent i under a joint policy \u03c0 is de\ufb01ned as its\nin\ufb01nite-horizon, total discounted reward, which is denoted by\n\n\u03c8(i, \u03c0) = E[\n\n\u03bbtRi(xt, at)|\u03c0, b].\n\n(1)\n\n\u221e(cid:88)\n\nt=0\n\nwhere \u03bb is the discount factor, the expectation operator E(\u00b7) averages over stochastic action selection\nand state transition, b is the initial state distribution, and xt and at are the state and the joint action\ntaken at time t, respectively.\nTo achieve both fairness and ef\ufb01ciency, our goal for a given multi-agent MDP is to \ufb01nd a joint control\npolicy \u03c0\u2217, called a regularized maximin fairness policy, that maximizes the following objective\nvalue function\n\n\u03c8(i, \u03c0),\n\n(2)\n\n(cid:88)\n\ni\u2208I\n\n\u0001\nn\n\nV (\u03c0) = min\ni\u2208I\n\n\u03c8(i, \u03c0) +\n\n2\n\n\fwhere n = |I| is the number of agents and \u0001 is a strictly positive real number, chosen to be arbitrary\nsmall. 1 This fairness objective function can be seen as a lexicographic aggregation of the egalitarian\ncriterion (min) and utilitarian criterion (sum of utilities) with priority to egalitarianism. This fairness\ncriterion can be also seen as a particular instance of the weighted Tchebycheff distance with respect\nto a reference point, a classical secularization function used to generate compromise solutions in\nmulti-objective optimization [16]. Note that the optimal policy under the egalitarian (or maximin)\ncriterion alone may not be Pareto ef\ufb01cient, but the optimal policy under this regularized fairness\ncriterion is guaranteed to be Pareto ef\ufb01cient.\nDe\ufb01nition 1. A joint control policy \u03c0 is said to be Pareto ef\ufb01cient if and only if there does not exist\nanother joint policy \u03c0(cid:48) such that the utility is at least as high for all agents and strictly higher for at\nleast one agent, that is, (cid:64)\u03c0(cid:48),\u2200i, \u03c8(i, \u03c0(cid:48)) \u2265 \u03c8(i, \u03c0) \u2227 \u2203i, \u03c8(i, \u03c0(cid:48)) > \u03c8(i, \u03c0).\nProposition 1. A regularized maximin fairness policy \u03c0\u2217 is Pareto ef\ufb01cient.\nProof. We can prove by contradiction. Assume regularized maximin fairness policy \u03c0\u2217 is not Pareto\nef\ufb01cient. Then there must exist a policy \u03c0 such that \u2200i, \u03c8(i, \u03c0) \u2265 \u03c8(i, \u03c0\u2217) \u2227 \u2203i, \u03c8(i, \u03c0) > \u03c8(i, \u03c0\u2217).\nThen V \u03c0 = mini\u2208I \u03c8(i, \u03c0) + \u0001\n, which\ncontradicts the pre-condition that \u03c0\u2217 is a regularized maximin fairness policy.\nn\n\n(cid:80)\ni\u2208I \u03c8(i, \u03c0) > mini\u2208I \u03c8(i, \u03c0\u2217) + \u0001\n\n(cid:80)\ni\u2208I \u03c8(i, \u03c0\u2217) = V \u03c0\u2217\n\nn\n\nIn this paper, we will mainly focus on centralized policies for multi-agent MDPs. This focus is\nsensible because we assume that, although agents have local interests, they are also willing to co-\noperate. Many practical problems modeled by multi-agent MDPs use centralized policies to achieve\nfairness, e.g., network bandwidth allocation by telecommunication companies, traf\ufb01c congestion\ncontrol, public service allocation, and, more generally, fair resource allocation under uncertainty.\nOn the other hand, we can derive decentralized policies for individual agents from a maximin fair-\nness policy \u03c0\u2217 by marginalizing it over the actions of all other agents. If the maximin fairness policy\nis deterministic, then the derived decentralized policy pro\ufb01le is also optimal under the regularized\nmaximin fairness criterion. Although such a guarantee generally does not hold for stochastic poli-\ncies, as indicated by the following proposition, the derived decentralized policy is a bounded solution\nin the space of decentralized policies under the regularized maximin fairness criterion.\nProposition 2. Let \u03c0c\u2217\nbe an optimal decentralized\npolicy pro\ufb01le under the regularized maximin fairness criterion. Let \u03c0dec be an decentralized policy\npro\ufb01le derived from \u03c0c\u2217\nand \u03c0dec provides bounds for\nthe value of \u03c0dec\u2217\n\nbe an optimal centralized policy and \u03c0dec\u2217\nby marginalization. The values of policy \u03c0c\u2217\n\n, that is,\n\nV (\u03c0c\u2217\n\n) \u2265 V (\u03c0dec\u2217\n\n) \u2265 V (\u03c0dec).\n\nThe proof of this proposition is quite straightforward. The \ufb01rst inequality holds because any decen-\ntralized policy pro\ufb01le can be converted to a centralized policy by product, and the second inequality\nholds because \u03c0dec\u2217\nis an optimal decentralized policy pro\ufb01le. When bounds provided by V (\u03c0c\u2217\n)\nand V (\u03c0dec) are close, we can conclude that \u03c0dec is almost an optimal decentralized policy pro\ufb01le\nunder the regularized maximin fairness criterion.\nIn this paper, we are primarily concerned with total discounted rewards for an in\ufb01nite horizon, but\nthe de\ufb01nition, analysis, and computation of regularized maximin fairness can be adapted to a \ufb01nite\nhorizon with an undiscounted sum of rewards. In the next section, we will present approaches to\ncomputing the regularized maximin fairness policy for in\ufb01nite-horizon multi-agent MDPs.\n\nComputing Regularized Maximin Fairness Policies\n\nIn this section, we present two approaches to computing regularized maximin fairness policies for\nmulti-agent MDPs: a simple linear programming approach and a game theoretic approach. The\nformer approach is adapted from the linear programming formulation of single-agent MDPs. The\nlatter approach formulates this fairness problem as a two-player zero-sum game and employs an\niterative search method for \ufb01nding a Nash equilibrium that contains a regularized maximin fairness\npolicy. This iterative algorithm allows us to scale up to large problems by exploiting structures in\nmulti-agent MDPs and value function approximation and employing a relaxed termination condition.\n\n1In some applications, we may choose proper large \u0001 to trade off fairness and the overall performance.\n\n3\n\n\fA linear programming approach\n\nFor a multi-agent MDP, given a joint policy and the initial state distribution, frequencies of visiting\nstate-action pairs are uniquely determined. We use f\u03c0(x, a) to denote the total discounted probabil-\nity, under the policy \u03c0 and initial state distribution b, that the system occupies state x and chooses\naction a. Using this frequency function, we can rewrite the expected total discount rewards as fol-\nlows, using f\u03c0(x, a):\n\n\u03c8(i, \u03c0) =\n\nf\u03c0(x, a)Ri(xi, ai),\n\n(3)\n\nwhere xi \u2286 x and ai \u2286 a.\nSince the dynamics of a multi-agent MDPs is Markovian, as it is for the single-agent MDP, we can\nadapt the linear programming formulation of single-agent MDPs for \ufb01nding an optimal centralized\npolicy for multi-agent MDPs under the regularized maximin fairness criterion as follows:\n\n(cid:88)\n\n(cid:88)\n\nx\n\na\n\n(cid:88)\n\n(cid:88)\n\nx\n\na\n\nmax\n\nf\n\nmin\ni\u2208I\n\ns.t. (cid:88)\n\nf (x(cid:48), a) = b(x(cid:48)) +\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n\u0001\nn\n\u03bbT (x(cid:48)|x, a)f (x, a),\n\ni\u2208I\n\nx\n\na\n\nf (x, a)Ri(xi, ai) +\n\nf (x, a)Ri(xi, ai)\n\n\u2200x(cid:48) \u2208 X\n\n(4)\n\na\n\nf (x, a) \u2265 0,\n\nx\n\na\n\nfor all a \u2208 A and x \u2208 X.\n\nConstraints are included to ensure that f (x, a) is well-de\ufb01ned. The \ufb01rst set of constraints require\nthat the probability of visiting state x(cid:48) is equal to the initial probability of state x(cid:48) plus the sum of\nall probabilities of entering into state s(cid:48). We linearize this program by introducing another variable\nz, which represents the minimum expected total discounted reward among all agents, as follows:\n\nmax\n\nf\n\ns.t.\n\n(cid:88)\n\na\n\n\u0001\nn\n\nz +\n\n(cid:88)\n(cid:88)\nz \u2264(cid:88)\n(cid:88)\n(cid:88)\n\ni\u2208I\n\nx\n\nx\n\na\n\nf (x(cid:48), a) = b(x(cid:48)) +\n\nf (x, a)Ri(xi, ai)\n\nf (x, a)Ri(xi, ai),\n\n(cid:88)\n\n(cid:88)\n\n\u2200i \u2208 I\n\n\u03bbT (x(cid:48)|x, a)f (x, a),\n\n\u2200x(cid:48) \u2208 X\n\na\n\nf (x, a) \u2265 0,\n\nx\n\na\n\nfor all a \u2208 A and x \u2208 X.\n\n(5)\n\nWe can employ existing linear programming solvers (e.g., the simplex method) to compute an opti-\nmal solution f\u2217 for problem (5) and derive a policy \u03c0\u2217 from f\u2217 by normalization:\n\n\u03c0(x, a) =\n\n.\n\n(6)\n\n(cid:80)\n\nf (x, a)\na\u2208A f (x, a)\n\nUsing Theorem 6.9.1 in [13], we can easily show that the derived policy \u03c0\u2217 is optimal under the\nregularized maximin fairness criterion. This linear programming approach is simple, but is not scal-\nable for multi-agent MDPs with large state spaces or large numbers of agents. This is because the\nnumber of constraints of the linear program is |X| + |I|. In the next sections, we present a more\nscalable game-theoretic approach for large multi-agent MDPs.\n\nA game-theoretic approach\n\nSince the fairness objective function in (2) can be turned to a maximin function, inspired by von\nNeumann\u2019s minimax theorem, we can formulate this optimization problem as a two-player zero-\nsum game. Motivated by theoretical results that two-player games tend to have a Nash equilibrium\n(NE) with a small support, we develop an iterative algorithm for solving zero-sum games.\nLet \u03a0S and \u03a0D be the set of stochastic Markovian policies and deterministic Markovian policies,\nrespectively. As shown in [13], every stochastic policy can be represented by a convex combination\nof deterministic policies and every convex combination of deterministic policies corresponds to a\ni pi\u03c0d\ni\nusing some set of {\u03c0d\n\nstochastic policy. Speci\ufb01cally, for any stochastic policy \u03c0s \u2208 \u03a0s, we can represent \u03c0s =(cid:80)\n\nk} \u2282 \u03a0D with probability distribution p.\n\n1 , . . . , \u03c0d\n\n4\n\n\fs \u2282 \u03a0D and \u00afI \u2282 I ;\n\nAlgorithm 1: An iterative approach to computing the regularized maximin fairness policy\n1 Initialize a zero-sum game G( \u00af\u03a0D, \u00afI) with small subsets \u00af\u03a0D\n2 repeat\n3\n4\n5\n6\n7\n8\n9 until game G( \u00af\u03a0D, \u00afI) converges;\n10 return the regularized maximin fairness policy \u03c0s\n\n(p\u2217, q\u2217, V \u2217) \u2190 compute a Nash equilibrium of game G( \u00af\u03a0D, \u00afI) ;\n(\u03c0d, Vp) \u2190 compute the best-response deterministic policy against q\u2217 in G(\u03a0D, I) ;\nif Vp > V \u2217 then \u00af\u03a0D \u2190 \u00af\u03a0D \u222a {\u03c0d} ;\n(i, Vq) \u2190 compute the best response against p\u2217 among all agents I;\nif Vq < V \u2217 then \u00afI \u2190 \u00afI \u222a {i} ;\nif G( \u00af\u03a0D, \u00afI) changes then expand its payoff matrix with U (\u03c0d, i) for new pairs (\u03c0d, i) ;\n\np\u2217 = p\u2217 \u00b7 \u00af\u03a0D ;\n\n(cid:80)\n\nLet U (\u03c0, i) = \u03c8(i, \u03c0) + \u0001\nj\u2208I \u03c8(j, \u03c0). We can construct a two-player zero-sum game G(\u03a0D, I)\nn\nas follows: the maximizing player, who aims to maximize the value of the game, chooses a deter-\nministic policy \u03c0d from \u03a0D; the minimizing player, who aims to minimizing the value of the game,\nchooses an agent indexed by i in multi-agent MDPs from I; and the payoff matrix has an entry\nU (\u03c0d, i) for each pair \u03c0d \u2208 \u03a0D and i \u2208 I. The following proposition shows that we can compute\nthe regularized minimax fairness policy by solving G(\u03a0D, I).\nProposition 3. Let the strategy pro\ufb01le (p\u2217, q\u2217) be a NE of the game G(\u03a0D, I) and the stochastic\npolicy \u03c0s\ni is the ith\ncomponent of p\u2217, i.e., the probability of choosing the deterministic policy \u03c0d\np\u2217 is a\nregularized maximin fairness policy,\nProof. According to von Neumann\u2019s minimax theorem, p\u2217 is also the maximin strategy for the zero-\nsum game G(\u03a0D, I).\n\np\u2217 which is derived from (p\u2217, q\u2217) with \u03c0s\n\np\u2217 (x, a) = (cid:80)\n\ni \u2208 \u03a0D. Then \u03c0s\n\ni (x, a), where p\u2217\n\ni p\u2217\n\ni \u03c0d\n\np\u2217\nj U (\u03c0d\n\nj , i)\n\n(let \u03c0s\n\np\u2217 =\n\np\u2217\nj )\nj \u03c0d\n\nj\n\n(there always exists a pure best response strategy)\n\nmin\n\ni\n\nU (\u03c0s\n\np\u2217 , i) = min\n\ni\n\n= min\n\nq\n\n(cid:88)\n(cid:88)\n\nj\n\nj\n\n(cid:88)\n(cid:88)\n(cid:88)\n\nj\n\ni\n\nj\n\np\u2217\nj qiU (\u03c0d\n\nj , i)\n\n(cid:88)\n\n= max\n\np\n\n\u2265 max\n\np\n\nmin\n\nq\n\nmin\n\ni\n\ni\npjU (\u03c0d\n\nj , i)\n\n(cid:88)\n\n(cid:88)\n\nj\n\npjqiU (\u03c0d\n\nj , i)\n\n(p\u2217 is the maximin strategy)\n\n(consider i as a pure strategy)\n\n= max\n\n\u03c0p\n\nmin\n\ni\n\nU (\u03c0p, i)\n\n(let \u03c0p =\n\nj )\npj\u03c0d\n\nBy de\ufb01nition, \u03c0s\n\np\u2217 is a regularized maximin fairness policy.\n\nAs the game G(\u03a0D, I) is usually extremely large and computing the payoff matrix of the game\nG(\u03a0D, I) is also non-trivial, it is impossible to directly use linear programming to solve this game.\nOn the other hand, existing work, such as [7] that analyzes the theoretical properties of the NE of\ngames drawn from a particular distribution, shows that support sizes of Nash equilibria tend to be\nbalanced and small, especially for n = 2. Prior work [11] demonstrated that it is bene\ufb01cial to exploit\nthese results in \ufb01nding a NE, especially in 2-player games. Inspired by these results, we develop an\niterative method to compute a fairness policy, as shown in Algorithm 1.\nIntuitively, Algorithm 1 works as follows. It starts by computing a NE for a small subgame (Line 3)\nand then checks whether this NE is also a NE of the whole game (Line 4-7); if not, it expands the\nsubgame and repeats this process until a NE is found for the whole game.\nLine 1initializes a small sub game of the original game, which can be arbitrary. In our experiments, it\nis initialized with a random agent and a policy maximizing this agent\u2019s utility. Line 3 solves the two-\nplayer zero-sum game using linear programming or any other suitable technique. V \u2217 is the maximin\n\n5\n\n\f(cid:88)\n\ni\u2208 \u00afI\n\n(cid:88)\n\ni\u2208 \u00afI\n\n(cid:88)\n\nj\u2208I\n\n(cid:88)\n\nvalue of this subgame. The best response problem in Line 4 is to \ufb01nd a deterministic policy \u03c0 that\nmaximizes the following payoff:\n\nU (\u03c0, q\u2217) =\n\nq\u2217\ni U (\u03c0, i) =\n\nq\u2217\ni [\u03c8(i, \u03c0) +\n\n\u0001\nn\n\n\u03c8(j, \u03c0)] =\n\n(q\u2217\ni +\n\n\u0001\nn\n\n)\u03c8(i, \u03c0)\n\ni\u2208I\n\ni + \u0001\n\ni\u2208I (q\u2217\n\na reward function R(x, a) = (cid:80)\nEquation 3. Vp =(cid:80)\n\nSolving this optimization problem is equivalent to \ufb01nding the optimal policy of a regular MDP with\nn )Ri(xi, ai). We can use the dual linear programming\napproach [13] for this MDP, which outputs the visitation frequency function f\u03c0d (x, a) represent-\ning the optimal policy. This representation facilitates the computation of the payoff U (\u03c0d\ni , i) using\ni U (\u03c0d, i) is the maximizing player\u2019s utility of its best response against q\u2217.\nLine 5 checks if the best response \u03c0d is strictly better than p\u2217. If this is true, we can infer that p\u2217 is\nnot the best response against q\u2217 in the whole game and \u03c0d must not be in \u00af\u03a0D, which is then added\nto \u00af\u03a0D to expand the subgame.\nLine 6 \ufb01nds the minimizing player\u2019s best response against p\u2217, which minimizes the payoff of the\nmaximizing player. Note that there always exists a pure best response strategy. So we formulate this\nbest response problem as follows:\n\ni q\u2217\n\nmin\ni\u2208I\n\nU (\u03c0p\u2217 , q) = min\ni\u2208I\n\np\u2217\nj U (\u03c0d\n\nj , i),\n\n(7)\n\n(cid:88)\n\nj\n\nwhere \u03c0p\u2217 is the stochastic policy corresponding to probability distribution p\u2217. We can solve this\nproblem by directly searching for the agent i that yields the minimum utility with linear time com-\nplexity. Similar to Line 5, Line 7 checks if the minimizing player strictly preferred i to q\u2217 against p\u2217\nand expands the subgame if needed. This algorithm terminates when the subgame does not change.\nProposition 4. Algorithm 1 converges to a regularized maximin fairness policy.\n\nProof. The convergence of this algorithm follows immediately because there exists a \ufb01nite number\nof deterministic Markovian policies and agents for a given multi-agent MDP. The algorithm termi-\nnates if and only if neither of the If conditions of Line 5 and 7 hold. This situation indicates no\nplayer strictly prefers a strategy out of the support of its current strategy, which implies (p\u2217, q\u2217) is\na NE of the whole game G( \u00af\u03a0D, \u00afI). Using Proposition 3, we conclude that Algorithm 1 returns a\nregularized maximin fairness policy.\n\nAlgorithm 1 shares some similarities with the double oracle algorithm proposed in [8] for itera-\ntively solving zero-sum games. The double oracle method is motivated by Benders decomposition\ntechnique, while our iterative algorithm exploits properties of Nash equilibrium, which leads to a\nmore ef\ufb01cient implementation. For example, unlike our algorithm, the double oracle method checks\nif the computed best response MDP policy exists in the current sub-game by comparison, which is\ntime-consuming for MDP policies with a large state space.\n\nScaling the game-theoretic approach\n\nthe optimal policy of a regular MDP with reward function R(x, a) =(cid:80)\n\nBoth linear programming and the game-theoretic approach suffer scalability issues for large prob-\nlems. In multi-agent MDPs, the state space is exponential with the number of state variables and\nthe action space is exponential with the number of agents. This results in an exponential number of\nvariables and constraints in linear program formulation. In this section, we will investigate methods\nto scale up the game-theoretic approach.\nThe major bottleneck of the iterative algorithm is the computation of the best response policy (Line\n4 in Algorithm 1). As discussed in the previous section, this optimization is equivalent to \ufb01nding\nn )Ri(xi, ai). Due\nto the exponential state-action space, exact algorithms (e.g., linear programming) are impractical in\nmost cases. Fortunately, this MDP is essentially a factored MDP [4] with a weighted sum of partial\nreward functions. We can use existing approximate algorithms [4] to solve factored MDPs, which\nexploit both factored structures in the problem and value function approximation. For example, the\napproximate linear programming approach for factored MDPs can provide ef\ufb01cient policies with up\nto an exponential reduction in computation time.\n\ni + \u0001\n\ni(q\u2217\n\n6\n\n\f#C #R\n12\n4\n20\n4\n10\n5\n5\n20\n18\n6\n\n#N\n7E4\n3E5\n4E5\n6E6\n5E7\n\nTime-LP\n68.22s\n22.39m\n89.77m\n\n-\n-\n\nTime-GT\n11.43s\n35.27s\n48.56s\n4.98m\n43.36m\n\nSol-LP\n157.67\n250.59\n104.33\n\n-\n-\n\nSol-GT\n154.24\n239.87\n97.48\n189.62\n153.63\n\nC\n1\n2\n3\n4\nMin\n\nMPE\n180.41\n198.45\n216.49\n234.53\n108.22\n\nUtilitarian\n\n117.44\n184.20\n290.69\n444.08\n68.32\n\nFairness\n250.59\n250.59\n250.59\n250.59\n157.67\n\nTable 1: Performance in sample problems\nwith different cell sizes and total resoureces\n\nTable 2: A comparison of three criteria\nin a 4-agent 20-resource problem\n\nA few subtleties are worth noting when approximate linear programming is employed. First, the best\nresponse\u2019s utility Vp should be computed by evaluating the computed approximate policy against q\u2217,\ninstead of directly using the value from the approximate value function. Otherwise, the convergence\nof Algorithm 1 will not be guaranteed. Similarly, the payoff U (\u03c0d, i) should be calculated through\npolicy evaluation. Second, existing approximate algorithms for factored MDPs usually output a\ndeterministic policy \u03c0d(x) that is not represented by the visitation frequency function f\u03c0(x, a). In\norder to facilitate the policy evaluation, we may convert a policy \u03c0d(x) to a frequency function\nf\u03c0d (x, a). Note that f\u03c0d (x, a) = 0 for all a (cid:54)= \u03c0d(x). For other state-action pairs, we can compute\ntheir visitation frequencies by solving the following equation:\n\n(cid:88)\n\nf\u03c0d (x(cid:48), \u03c0d(x(cid:48))) = b(x(cid:48)) +\n\nT (x(cid:48)|x, a)f\u03c0d (x, \u03c0d(x)).\n\n(8)\n\nx\n\nThis equation can be approximately but more ef\ufb01ciently solved using an iterative method, similar\nto the MDP value iteration. Finally, Algorithm 1 is still guaranteed to converge, but may return a\nsub-optimal solution.\nWe can also speed up Algorithm 1 by relaxing its termination condition, which essentially reduces\nthe number of iterations. We can use the termination condition Vp\u2212 Vq < \u0001, which turns the iterative\napproach into an approximation algorithm.\nProposition 5. The iterative approach using the termination condition Vp \u2212 Vq < \u0001 has bounded\nerror \u0001.\nProof. Let V opt be the value of the regularized maximin fairness policy and V (\u03c0\u2217) be the value of\nthe computed policy \u03c0\u2217. By de\ufb01nition, V opt \u2265 V (\u03c0\u2217). Following von Neumann\u2019s minimax theorem,\nwe have Vp \u2265 V opt \u2265 Vq. Since Vq is the value of the minimizing player\u2019s best response against \u03c0\u2217,\nV opt \u2265 V (\u03c0\u2217) \u2265 Vq \u2265 Vp + \u0001 \u2265 V opt + \u0001.\n\nExperiments\n\nOne motivated domain for our work is resource allocation in a pulse-line manufacturing plant. In a\npulse-line factory, the manufacturing process of complex products is divided into several stages, each\nof which contains a set of tasks to be done in a corresponding work cell. The overall performance\nof a pulse line is mainly determined by the worse performance of work cells. Considering dynamics\nand uncertainty of the manufacturing environment, we need to dynamically allocate resources to\nbalance the progress of work cells in order to optimize the throughput of the pulse line.\nWe evaluate our fairness solution criterion and its computation approaches, linear programming (LP)\nand the game-theoretic (GT) approach with approximation, on this resource allocation problem. For\nsimplicity, we focus on managing one type of resource. We view each work cell in a pulse line as an\nagent. Each agent\u2019s state is represented by two variables: task level (i.e., high or low) and the number\nof local resources. An agent\u2019s next task level is affected by the current task levels of itself and the\nprevious agent. An action is de\ufb01ned on a directed link between two agents, representing the transfer\nof one-unit resource from one agent to another. There is another action for all agents: \u201cno change\u201d.\nWe assume only neighboring agents can transfer resources. An agent\u2019s reward is measured by the\nnumber of partially-\ufb01nished products that will be processed during two decision points, given its\ncurrent task level and resources. We use a discount factor \u03bb = 0.95. We use the approximate linear\nprogramming technique presented in [4] for solving factored MDPs generated in the GT approach.\nWe used Java for our implementation and Gurobi 2.6 [5] for solving linear programming and ran\nexperiments on a 2.4GHz Intel Core i5 with 8Gb RAM.\n\n7\n\n\fTable 1 shows the performance of linear programming and the game-theoretic approach in different\nproblems by varying the number of work cells #C and total resources #R. The third column #N\n= |X||A| is the state-action space size. We can observe that the game-theoretic approach is signif-\nicantly faster than linear programming. This speed improvement is largely due to the integration of\napproximate linear programming, which exploits the problem structure and value function approx-\nimation. In addition, the game-theoretic approach is scalable well to large problems. With 6 cells\nand 18 resources, the size of the state-action space is around 5 \u00b7 107. The last two columns show the\nminimum expected reward among agents, which determines the performance of the pulse line. The\ngame-theoretic approach only has a less than 8% loss over the optimal solution computed by LP.\nWe also compare the regularized maximin fairness criterion against the utilitarian criterion (i.e.,\nmaximizing the sum of individual utility) and Markov perfect equilibrium (MPE). MPE is an exten-\nsion of Nash equilibrium to stochastic games. One obvious MPE in our resource allocation problem\nis that no agent transfers its resources to other agents. We evaluated them in different problems,\nbut the results are qualitatively similar. Table 2 shows the performance of all work cells under the\noptimal policy of different criteria in a problem with 4 agents and 20 resources. The fairness pol-\nicy balanced the performance of all agents and provided a better solution (i.e., a greater minimum\nutility) than other criteria. The perfection of the balance is due to the stochasticity of the computed\npolicy. Even in terms of the sum of utilities, the fairness policy has only a less than 4% loss over\nthe optimal policy under the utilitarian criterion. The utilitarian criterion generated a highly skewed\nsolution with the lowest minimum utility among the three criteria. In addition, we can observe that,\nunder the fairness criterion, all agents performed better than those under MPE, which suggests that\ncooperation is bene\ufb01cial for all of them in this problem.\n\nRelated Work\n\nWhen using centralized policies, our multi-agent MDPs can be also viewed as multi-objective\nMDPs [15]. Recently, Ogryczak et al. [10] de\ufb01ned a compromise solution for multi-objective MDPs\nusing the Tchebycheff scalarization function. They developed a linear programming approach for\n\ufb01nding such compromise solutions; however, this is computationally impractical for most real-world\nproblems. In contrast, we develop a more scalable game-theoretic approach for \ufb01nding fairness so-\nlutions by exploiting structure in multi-agent factored MDPs and value function approximation.\nThe notion of maximin fairness is also widely used in the \ufb01eld of networking, such as bandwidth\nsharing, congestion control, routing, load-balancing and network design [1, 9]. In contrast to our\nwork, maximin fairness in networking is de\ufb01ned without regularization, only addresses one-shot\nresource allocation, and does not consider the dynamics and uncertainty of the environment.\nFair division is an active research area in economics, especially social choice theory. It is concerned\nwith the division of a set of goods among several people, such that each person receives his or\nher due share. In the last few years, fair division has attracted the attention of AI researchers [2,\n12], who envision the application of fair division in multi-agent systems, especially for multi-agent\nresource allocation [3, 6]. Fair division theory focuses on proportional fairness and envy-freeness.\nMost existing work in fair division involves a static setting, where all relevant information is known\nupfront and is \ufb01xed. Only a few approaches deal with dynamics of agent arrival and departures [6,\n17]. In contrast to our model and approach, these dynamic approaches to fair division do not address\nuncertainty, or other dynamics such as changes of resource availability and users\u2019 resource demands.\n\nConclusion\n\nIn this paper, we de\ufb01ned a fairness solution criterion, called regularized maximin fairness, for multi-\nagent decision-making under uncertainty. This solution criterion aims to maximize the worse per-\nformance among agents while considering the overall performance of the system. It is \ufb01nding appli-\ncations in various domains, including resource sharing, public service allocation, load balance, and\ncongestion control. We also developed a simple linear programming approach and a more scalable\ngame-theoretic approach for computing the optimal policy under this new criterion. This game-\ntheoretic approach can scale up to large problems by exploiting the problem structure and value\nfunction approximation.\n\n8\n\n\fReferences\n[1] Thomas Bonald and Laurent Massouli\u00b4e. Impact of fairness on internet performance. In Proceedings of the\n2001 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems,\npages 82\u201391, 2001.\n\n[2] Yiling Chen, John Lai, David C. Parkes, and Ariel D. Procaccia. Truth, justice, and cake cutting.\n\nProceedings of the Twenty-Fourth AAAI Conference on Arti\ufb01cial Intelligence, 2010.\n\nIn\n\n[3] Yann Chevaleyre, Paul E. Dunne, Ulle Endriss, Jrme Lang, Michel Lematre, Nicolas Maudet, Julian A.\nPadget, Steve Phelps, Juan A. Rodrguez-Aguilar, and Paulo Sousa. Issues in multiagent resource alloca-\ntion. Informatica (Slovenia), 30(1):3\u201331, 2006.\n\n[4] C. Guestrin, D. Koller, R. Parr, and S. Venkataraman. Ef\ufb01cient solution algorithms for factored mdps.\n\nJournal of Arti\ufb01cial Intelligence Research, 19:399\u2013468, 2003.\n\n[5] Inc. Gurobi Optimization. Gurobi optimizer reference manual, 2014.\n[6] Ian A. Kash, Ariel D. Procaccia, and Nisarg Shah. No agent left behind: dynamic fair division of multiple\nresources. In International conference on Autonomous Agents and Multi-Agent Systems, pages 351\u2013358,\n2013.\n\n[7] Andrew McLennan and Johannes Berg. Asymptotic expected number of nash equilibria of two-player\n\nnormal form games. Games and Economic Behavior, 51(2):264\u2013295, 2005.\n\n[8] H Brendan McMahan, Geoffrey J Gordon, and Avrim Blum. Planning in the presence of cost functions\ncontrolled by an adversary. In Proceedings of the Twentieth International Conference on Machine Learn-\ning, pages 536\u2013543, 2003.\n\n[9] Dritan Nace and Michal Pi\u00b4oro. Max-min fairness and its applications to routing and load-balancing in\ncommunication networks: A tutorial. IEEE Communications Surveys and Tutorials, 10(1-4):5\u201317, 2008.\n[10] Wlodzimierz Ogryczak, Patrice Perny, and Paul Weng. A compromise programming approach to mul-\nInternational Journal of Information Technology and Decision\n\ntiobjective markov decision processes.\nMaking, 12(5):1021\u20131054, 2013.\n\n[11] Ryan Porter, Eugene Nudelman, and Yoav Shoham. Simple search methods for \ufb01nding a nash equilibrium.\n\nIn Proceedings of the 19th National Conference on Arti\ufb01cal Intelligence, pages 664\u2013669, 2004.\n\n[12] Ariel D. Procaccia. Thou shalt covet thy neighbor s cake. In Proceedings of the 21st International Joint\n\nConference on Arti\ufb01cial Intelligence, 2009, pages 239\u2013244, 2009.\n\n[13] M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Willey Inter-\n\nscience, 2005.\n\n[14] John Rawls. The theory of justice. Harvard University Press, Cambridge, MA, 1971.\n[15] Diederik M. Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley. A survey of multi-objective\n\nsequential decision-making. Journal Arti\ufb01cial Intelligence Research, 48(1):67\u2013113, October 2013.\n\n[16] Ralph E. Steuer. Multiple Criteria Optimization: Theory, Computation, and Application. John Wiley,\n\n1986.\n\n[17] Toby Walsh. Online cake cutting. In Algorithmic Decision Theory - Second International Conference,\n\nvolume 6992 of Lecture Notes in Computer Science, pages 292\u2013305, 2011.\n\n9\n\n\f", "award": [], "sourceid": 1374, "authors": [{"given_name": "Chongjie", "family_name": "Zhang", "institution": "Mass. Institute of Technoly"}, {"given_name": "Julie", "family_name": "Shah", "institution": "Massachusetts Institute of Technology"}]}