{"title": "An MDP-Based Approach to Online Mechanism Design", "book": "Advances in Neural Information Processing Systems", "page_first": 791, "page_last": 798, "abstract": "", "full_text": "An MDP-Based Approach to Online\n\nMechanism Design\n\nDavid C. Parkes\n\nDivision of Engineering and Applied Sciences\n\nHarvard University\n\nparkes@eecs.harvard.edu\n\nSatinder Singh\n\nComputer Science and Engineering\n\nUniversity of Michigan\n\nbaveja@umich.edu\n\nAbstract\n\nOnline mechanism design (MD) considers the problem of provid-\ning incentives to implement desired system-wide outcomes in sys-\ntems with self-interested agents that arrive and depart dynami-\ncally. Agents can choose to misrepresent their arrival and depar-\nture times, in addition to information about their value for di(cid:11)erent\noutcomes. We consider the problem of maximizing the total long-\nterm value of the system despite the self-interest of agents. The\nonline MD problem induces a Markov Decision Process (MDP),\nwhich when solved can be used to implement optimal policies in a\ntruth-revealing Bayesian-Nash equilibrium.\n\n1\n\nIntroduction\n\nMechanism design (MD) is a sub(cid:12)eld of economics that seeks to implement par-\nticular outcomes in systems of rational agents [1]. Classically, MD considers static\nworlds in which a one-time decision is made and all agents are assumed to be pa-\ntient enough to wait for the decision. By contrast, we consider dynamic worlds in\nwhich agents may arrive and depart over time and in which a sequence of decisions\nmust be made without the bene(cid:12)t of hindsight about the values of agents yet to\narrive. The MD problem for dynamic systems is termed online mechanism design\n[2]. Online MD supposes the existence of a center, that can receive messages from\nagents and enforce a particular outcome and collect payments.\n\nSequential decision tasks introduce new subtleties into the MD problem. First,\ndecisions now have expected value instead of certain value because of uncertainty\nabout the future. Second, new temporal strategies are available to an agent, such\nas waiting to report its presence to try to improve its utility within the mechanism.\nOnline mechanisms must bring truthful and immediate revelation of an agent\u2019s value\nfor sequences of decisions into equilibrium.\n\n\fWithout the problem of private information and incentives, the sequential decision\nproblem in online MD could be formulated and solved as a Markov Decision Process\n(MDP). In fact, we show that an optimal policy and MDP-value function can be\nused to de(cid:12)ne an online mechanism in which truthful and immediate revelation of\nan agent\u2019s valuation for di(cid:11)erent sequences of decisions is a Bayes-Nash equilibrium.\n\nOur approach is very general, applying to any MDP in which the goal is to maximize\nthe total expected sequential value across all agents. To illustrate the (cid:13)exibility of\nthis model, we can consider the following illustrative applications:\n\nreusable goods. A renewable resource is available in each time period. Agents\narrive and submit a bid for a particular quantity of resource for each of a\ncontiguous sequence of periods, and before some deadline.\n\nmulti-unit auction. A (cid:12)nite number of identical goods are for sale. Agents submit\nbids for a quantity of goods with a deadline, by which time a winner-\ndetermination decision must be made for that agent.\n\nmultiagent coordination. A central controller determines and enforces the ac-\ntions that will be performed by a dynamically changing team of agents.\nAgents are only able to perform actions while present in the system.\n\nOur main contribution is to identify this connection between online MD and MDPs,\nand to de(cid:12)ne a new family of dynamic mechanisms, that we term the online VCG\nmechanism. We also clearly identify the role of the ability to stall a decision, as it\nrelates to the value of an agent, in providing for Bayes-Nash truthful mechanisms.\n\n1.1 Related Work\n\nThe problem of online MD is due to Friedman and Parkes [2], who focused on\nstrategyproof online mechanisms in which immediate and truthful revelation of an\nagent\u2019s valuation function is a dominant strategy equilibrium. The authors de(cid:12)ne\nthe mechanism that we term the delayed VCG mechanism, identify problems for\nwhich the mechanism is strategyproof, and provide the seeds of our work in Bayes-\nNash truthful mechanisms. Work on online auctions [3] is also related, in that\nit considers a system with dynamic agent arrivals and departures. However, the\nonline auction work considers a much simpler setting (see also [4]), for instance the\nallocation of a (cid:12)xed number of identical goods, and places less emphasis on temporal\nstrategies or allocative e(cid:14)ciency. Awerbuch et al. [5], provide a general method to\nconstruct online auctions from online optimization algorithms. In contrast to our\nmethods, their methods consider the special case of single-minded bidders with a\nvalue vi for a particular set of resources ri, and are only temporally strategyproof\nin the special-case of online algorithms with a non-decreasing acceptance threshold.\n\n2 Preliminaries\n\nIn this section, we introduce a general discrete-time and (cid:12)nite-action formulation\nfor a multiagent sequential decision problem. Putting incentives to one side for\nnow, we also de(cid:12)ne and solve an MDP formalization of the problem. We consider\na (cid:12)nite-horizon problem1 with a set T of discrete time points and a sequence of\ndecisions k = fk1; : : : ; kT g, where kt 2 Kt and Kt is the set of feasible decisions\nin period t. Agent i 2 I arrives at time ai 2 T , departs at time di 2 T , and has\nvalue vi(k) (cid:21) 0 for the sequence of decisions k. By assumption, an agent has no\n\n1The model can be trivially extended to consider in(cid:12)nite horizons if all agents share\n\nthe same discount factor, but will require some care for more general settings.\n\n\fvalue for decisions outside of interval [ai; di]. Agents also face payments, which we\nallow in general to be collected after an agents departure. Collectively, information\n(cid:18)i = (ai; di; vi) de(cid:12)nes the type of agent i with (cid:18)i 2 (cid:2). Agent types are sampled\ni.i.d.\nfrom a probability distribution f ((cid:18)), assumed known to the agents and to\nthe central mechanism. We allow multiple agents to arrive and depart at the same\ntime. Agent i, with type (cid:18)i, receives utility ui(k; p; (cid:18)i) = vi(k; (cid:18)i) (cid:0) p, for decisions\nk and payment p. Agents are modeled as expected-utility maximizers. We adopt as\nour goal that of maximizing the expected total sequential value across all agents.\n\nIf we were to simply ignore incentive issues, the expected-value maximizing decision\nproblem induces an MDP. The state2 of the MDP at time t is the history-vector\nht = ((cid:18)1; : : : ; (cid:18)t; k1; : : : ; kt(cid:0)1), and includes the reported types up to and including\nperiod t and the decisions made up to and including period t (cid:0) 1. The set of all\npossible states at time t is denoted Ht. The set of all possible states across all time is\nH = ST +1\nt=1 Ht. The set of decisions available in state ht is Kt(ht). Given a decision\nkt 2 Kt(ht) in state ht, there is some probability distribution Prob(ht+1jht; kt)\nover possible next states ht+1 determined by the random new agent arrivals, agent\ndepartures, and the impact of decision kt. This makes explicit the dynamics that\nwere left implicit in type distribution (cid:18)i 2 f ((cid:18)i), and includes additional information\nabout the domain.\n\nThe objective is to make decisions to maximize the expected total value across all\nagents. We de(cid:12)ne a payo(cid:11) function for the induced MDP as follows: there is a\npayo(cid:11) Ri(ht; kt) = vi(k(cid:20)t; (cid:18)i) (cid:0) vi(k(cid:20)t(cid:0)1; (cid:18)i), that becomes available from agent i\nupon taking action kt in state ht. With this, we have P(cid:28)\nt=1 Ri(ht; kt) = vi(k(cid:20)(cid:28) ; (cid:18)i),\nfor all periods (cid:28) . The summed value, Pi Ri(ht; kt), is the payo(cid:11) obtained from all\nagents at time t, and is denoted R(ht; kt). By assumption, the reward to an agent\nin this basic online MD problem depends only on decisions, and not on state. The\ntransition probabilities and the reward function de(cid:12)ned above, together with the\nfeasible decision space, constitute the induced MDP Mf .\n\nGiven a policy (cid:25) = f(cid:25)1; (cid:25)2; : : : ; (cid:25)T g where (cid:25)t : Ht ! Kt, an MDP de(cid:12)nes an MDP-\nvalue function V (cid:25) as follows: V (cid:25)(ht) is the expected value of the summed payo(cid:11)\nobtained from state ht onwards under policy (cid:25), i.e., V (cid:25)(ht) = E(cid:25)fR(ht; (cid:25)(ht)) +\nR(ht+1; (cid:25)(ht+1)) + (cid:1) (cid:1) (cid:1) + R(hT ; (cid:25)(hT ))g. An optimal policy (cid:25)(cid:3) is one that maximizes\nthe MDP-value of every state3 in H. The optimal MDP-value function V (cid:3) can be\ncomputed via the following value iteration algorithm: for t = T (cid:0) 1; T (cid:0) 2; : : : ; 1\n\n8h 2 Ht V (cid:3)(h) = max\nk2Kt(h)\n\n[R(h; k) + X\n\nP rob(h0jh; k)V (cid:3)(h0)]\n\nh02Ht+1\n\nwhere V (cid:3)(h 2 HT ) = maxk2KT (h) R(h; k). This algorithm works backwards in time\nfrom the horizon and has time complexity polynomial in the size of the MDP and\nthe time horizon T .\n\nGiven the optimal MDP-value function, the optimal policy is derived as follows: for\nt < T\n\n(cid:25)(cid:3)(h 2 Ht) = arg max\nk2Kt(h)\n\n[R(h; k) + X\n\nP rob(h0jh; k)V (cid:3)(h0)]\n\nh02Ht+1\n\nand (cid:25)(cid:3)(h 2 HT ) = arg maxk2KT (h) R(h; k). Note that we have chosen not to\nsubscript the optimal policy and MDP-value by time because it is implicit in the\nlength of the state.\n\n2Using histories as state in the induced MDP will make the state space very large.\nOften, there will be some function g for which g(h) is a su(cid:14)cient statistic for all possible\nstates h. We ignore this possibility here.\n\n3It is known that a deterministic optimal policy always exists in MDPs[6].\n\n\fLet Ri denote a random variable (distributed according to f ((cid:18))) for\nthe agents that arrive after agent i.\n\nDe(cid:12)nition 4 (Bayesian-Nash Incentive-Compatible) Mechanism MDvcg\nis\nBayesian-Nash incentive-compatible if and only if the policy (cid:25) and payments satisfy:\n\nE(cid:18)>i fvi((cid:25)((cid:18)*i); (cid:18)i) (cid:0) pDvcg\n\ni\n\n((cid:18)**i; (cid:25))g\n\n(BNIC)\n\n(cid:21)E(cid:18)>ifvi((cid:25)((cid:18)**i); (cid:18)i) (cid:0) pDvcg\n\ni\n\n((cid:18)**i; (cid:25))g\n\nfor all types (cid:18)**ifvi((cid:25)(cid:3)((cid:18)**i); (cid:18)i) +\ntype ^(cid:18)i, substituting for the payment term pDvcg\n(cid:20)T ((cid:18)**i; (cid:25)(cid:3)) (cid:0) R(cid:20)T ((cid:18)**i; (cid:25)(cid:3))g. We can ignore the (cid:12)nal term be-\nPj6=i Rj\ncause it does not depend on the choice of ^(cid:18)i at all. Let (cid:28) denote the arrival period\nai of agent i, with state h(cid:28) including agent types (cid:18)** ai.\n\nCorollary 2 A delayed VCG mechanism cannot be Bayes-Nash incentive-\ncompatible if agents have any patience and the expected value of its policy can be\nimproved by stalling a decision.\n\nIf the policy can be improved through stalling, then an agent can improve its ex-\npected utility by delaying its reported arrival to correct for this, and make the\npolicy stall. This delayed VCG mechanism is ex ante e(cid:14)cient, because it im-\nplements the policy that maximizes the expected total sequential value across\nall agents. Second, it is interim individual-rational as long as the MDP sat-\nis(cid:12)es the value-monotonicity property. The expected utility to agent i in equi-\nlibrium is E(cid:18)>ifR(cid:20)T ((cid:18)**i; (cid:25)(cid:3)) (cid:0) R(cid:20)T ((cid:18)**i; (cid:25)(cid:3))g, which is equivalent to\nvalue-monotonicity. Third, the mechanism is ex ante budget-balanced as long\nas the MDP satis(cid:12)es the no-positive-externalities property. The expected pay-\nment by agent i, with type (cid:18)i, to the mechanism is E(cid:18)>ifR(cid:20)T ((cid:18)**i; (cid:25)(cid:3)) (cid:0)\n(R(cid:20)T ((cid:18)**i; (cid:25)(cid:3)) (cid:0) Ri\n(cid:20)T ((cid:18)**i; (cid:25)(cid:3)))g, which is non-negative exactly when\nthe no-positive-externalities condition holds.\n\n4 The Online VCG Mechanism\n\nWe now introduce the online VCG mechanism, in which payments are determined as\nsoon as all decisions are made that a(cid:11)ect an agent\u2019s value. Not only is this a better\n(cid:12)t with the practical needs of online mechanisms, but the online VCG mechanism\nalso enables better computational properties than the delayed mechanism.\nLet V (cid:25)(ht(^(cid:18)(cid:0)i; (cid:25))) denote the MDP-value of policy (cid:25) in the system without agent\ni, given reports (cid:18)(cid:0)i from other agents, and evaluated in some period t.\n\n\fDe(cid:12)nition 5 (online VCG mechanism) Given history h 2 H, mechanism\nMvcg = ((cid:2); (cid:25); pvcg) implements decisions kt = (cid:25)(ht), and computes payment\n\npvcg\ni\n\n(^(cid:18); (cid:25)) = Ri\n\n(cid:20)mi(^(cid:18); (cid:25)) (cid:0) hV (cid:25)(h^ai (^(cid:18); (cid:25))) (cid:0) V (cid:25)(h^ai (^(cid:18)(cid:0)i; (cid:25)))i\n\n(2)\n\nto agent i in its commitment period mi, with zero payments in all other periods.\n\nNote the payment is computed in the commitment period for an agent, which is\nsome period before an agent\u2019s departure at which its value is fully determined. In\nWiFi at Starbucks, this can be the period in which the mechanism commits to a\nparticular allocation for an agent.\n\nAgent i\u2019s payment in the online VCG mechanism is equal to its reported value from\nthe sequence of decisions made by the policy, discounted by the expected marginal\nvalue that agent i will contribute to the system (as determined by the MDP-value\nfunction for the policy in its arrival period). The discount is de(cid:12)ned as the expected\nforward looking e(cid:11)ect the agent will have on the value of the system. Establishing\nincentive-compatibility requires some care because the payment now depends on\nthe stated arrival time of an agent. We must show that there is no systematic\ndependence that an agent can use to its advantage.\n\nTheorem 3 An online VCG mechanism, ((cid:2); (cid:25)(cid:3); pvcg), based on an optimal policy\n(cid:25)(cid:3) for a correct MDP model de(cid:12)ned for a decision space that includes stalling is\nBayes-Nash incentive compatible.\n\nProof. We establish this result by demonstrating that the expected value of the\npayment by agent i in the online VCG mechanism is the same as in the delayed VCG\nmechanism, when other agents report their true types and for any reported type of\nagent i. This proves incentive-compatibility, because the policy in this online VCG\nmechanism is exactly that in the delayed VCG mechanism (and so an agent\u2019s value\nfrom decisions is the same), and with identical expected payments the equilibrium\nfollows from the truthful equilibrium of the delayed mechanism. The (cid:12)rst term in\n(cid:20)mi(^(cid:18)i; (cid:18)(cid:0)i; (cid:25)(cid:3)) and has the same value as the\nthe payment (see Equation 2) is Ri\n(cid:20)T (^(cid:18)i; (cid:18)(cid:0)i; (cid:25)(cid:3)), in the payment in the delayed mechanism (see Equation\n(cid:12)rst term, Ri\n1). Now, consider the discount term in Equation 2, and rewrite this as:\n\nV (cid:3)(h^ai (^(cid:18)i; (cid:18)(cid:0)i; (cid:25)(cid:3))) + R^ai ((cid:18)(cid:0)i; (cid:25)(cid:3)) (cid:0) V (cid:3)(h^ai ((cid:18)(cid:0)i; (cid:25)(cid:3))) (cid:0) R^ai((cid:18)(cid:0)i; (cid:25)(cid:3))\n\n(3)\n\nThe expected value of the left-hand pair of terms in Equation 3 is equal to\nV (cid:3)(h^ai (^(cid:18)i; (cid:18)(cid:0)i; (cid:25)(cid:3))) + R^ai(^(cid:18)i; (cid:18)(cid:0)i; (cid:25)(cid:3)) because agent i\u2019s announced type has no ef-\nfect on the reward before its arrival. Applying Lemma 1, the expected value of\nthese terms is constant and equal to the expected value of V (cid:3)(ht0 (^(cid:18)i; (cid:18)(cid:0)i; (cid:25)(cid:3))) +\nRt0 (^(cid:18)i; (cid:18)(cid:0)i; (cid:25)(cid:3)) for all t0 (cid:21) ai (with the expectation taken wrt history hai available\nto agent i in its true arrival period.) Moreover, taking t0 to be the (cid:12)nal period, T ,\nthis is also equal to the expected value of R(cid:20)T (^(cid:18)i; (cid:18)(cid:0)i; (cid:25)(cid:3)), which is the expected\nvalue of the (cid:12)rst term of the discount in the payment in the delayed VCG mech-\nanism. Similarly, the (negated) expected value of the right-hand pair of terms in\nEquation 3 is constant, and equals V (cid:3)(ht0 ((cid:18)(cid:0)i; (cid:25)(cid:3))) + Rt0((cid:18)(cid:0)i; (cid:25)(cid:3)) for all t0 (cid:21) ai.\nAgain, taking t0 to be the (cid:12)nal period T this is also equal to the expected value of\nR(cid:20)T ((cid:18)(cid:0)i; (cid:25)(cid:3)), which is the expected value of the second term of the discount in the\npayment in the delayed VCG mechanism.\nut\n\nWe have demonstrated that although an agent can systematically reduce the ex-\npected value of each of the (cid:12)rst and second terms in the discount in its payment\n(Equation 2) by delaying its arrival, these e(cid:11)ects exactly cancel each other out.\n\n\fNote that it also remains important for incentive-compatibility on the online VCG\nmechanism that the policy allows stalling.\n\nThe online VCG mechanism shares the properties of allocative e(cid:14)ciency and\nbudget-balance with the delayed VCG mechanism (under the same conditions).\nThe online VCG mechanism is ex post individual-rational so that an agent\u2019s\nexpected utility is always non-negative, a slightly stronger condition that for the\ndelayed VCG mechanism. The expected utility to agent i is V (cid:3)(hai ) (cid:0) V (cid:3)(hai n i)\nand non-negative because of the value-monotonicity property of MDPs.\n\nThe online VCG mechanism also suggests the possibility of new computational\nspeed-ups. The payment to an agent only requires computing the optimal-MDP\nvalue without the agent in the state in which it arrives, while the delayed VCG\npayment requires computing the sequence of decisions that the optimal policy would\nhave made in the counterfactual world without the presence of each agent.\n\n5 Discussion\n\nWe described a direct-revelation mechanism for a general sequential decision mak-\ning setting with uncertainty. In the Bayes-Nash equilibrium each agent truthfully\nreveals its private type information, and immediately upon arrival. The mecha-\nnism induces an MDP, and implements the sequence of decisions that maximize\nthe expected total value across all agents. There are two important directions in\nwhich to take this preliminary work. First, we must deal with the fact that for\nmost real applications the MDP that will need to be solved to compute the decision\nand payment policies will be too big to be solved exactly. We will explore meth-\nods for solving large-scale MDPs approximately, and consider the consequences for\nincentive-compatibility. Second, we must deal with the fact that the mechanism\nwill often have at best an incomplete and inaccurate knowledge of the distributions\non agent-types. We will explore the interaction between models of learning and\nincentives, and consider the problem of adaptive online mechanisms.\n\nAcknowledgments\n\nThis work is supported in part by NSF grant IIS-0238147.\n\nReferences\n\n[1] Matthew O. Jackson. Mechanism theory. In The Encyclopedia of Life Support\n\nSystems. EOLSS Publishers, 2000.\n\n[2] Eric Friedman and David C. Parkes. Pricing WiFi at Starbucks{ Issues in online\nmechanism design. Short paper, In Fourth ACM Conf. on Electronic Commerce\n(EC\u201903), 240{241, 2003.\n\n[3] Ron Lavi and Noam Nisan. Competitive analysis of incentive compatible on-line\n\nauctions. In Proc. 2nd ACM Conf. on Electronic Commerce (EC-00), 2000.\n\n[4] Avrim Blum, Vijar Kumar, Atri Rudra, and Felix Wu. Online learning in online\nauctions. In Proceedings of the 14th Annual ACM-SIAM symposium on Discrete\nalgorithms, 2003.\n\n[5] Baruch Awerbuch, Yossi Azar, and Adam Meyerson. Reducing truth-telling\nonline mechanisms to online optimization. In Proc. ACM Symposium on Theory\nof Computing (STOC\u201903), 2003.\n\n[6] M. L. Puterman. Markov decision processes : discrete stochastic dynamic pro-\n\ngramming. John Wiley & Sons, New York, 1994.\n\n\f", "award": [], "sourceid": 2432, "authors": [{"given_name": "David", "family_name": "Parkes", "institution": null}, {"given_name": "Satinder", "family_name": "Singh", "institution": null}]}*