{"title": "Distributed Optimization in Adaptive Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 887, "page_last": 894, "abstract": "", "full_text": "Distributed Optimization in Adaptive Networks\n\nCiamac C. Moallemi\nElectrical Engineering\nStanford University\nStanford, CA 94305\n\nciamac@stanford.edu\n\nManagement Science and Engineering\n\nand Electrical Engineering\n\nBenjamin Van Roy\n\nStanford University\nStanford, CA 94305\n\nbvr@stanford.edu\n\nAbstract\n\nWe develop a protocol for optimizing dynamic behavior of a network\nof simple electronic components, such as a sensor network, an ad hoc\nnetwork of mobile devices, or a network of communication switches.\nThis protocol requires only local communication and simple computa-\ntions which are distributed among devices. The protocol is scalable to\nlarge networks. As a motivating example, we discuss a problem involv-\ning optimization of power consumption, delay, and buffer over\ufb02ow in a\nsensor network.\nOur approach builds on policy gradient methods for optimization of\nMarkov decision processes. The protocol can be viewed as an extension\nof policy gradient methods to a context involving a team of agents op-\ntimizing aggregate performance through asynchronous distributed com-\nmunication and computation. We establish that the dynamics of the pro-\ntocol approximate the solution to an ordinary differential equation that\nfollows the gradient of the performance objective.\n\n1\n\nIntroduction\n\nThis paper is motivated by the potential of policy gradient methods as a general approach\nto designing simple scalable distributed optimization protocols for networks of electronic\ndevices. We offer a general framework for such protocols that builds on ideas from the pol-\nicy gradient literature. We also explore a speci\ufb01c example involving a network of sensors\nthat aggregates data. In this context, we propose a distributed optimization protocol that\nminimizes power consumption, delay, and buffer over\ufb02ow.\n\nThe proposed approach for designing protocols based on policy gradient methods com-\nprises one contribution of this paper. In addition, this paper offers fundamental contribu-\ntions to the policy gradient literature. In particular, the kind of protocol we propose can be\nviewed as extending policy gradient methods to a context involving a team of agents opti-\nmizing system behavior through asynchronous distributed computation and parsimonious\nlocal communication. Our main theoretical contribution is to show that the dynamics of\nour protocol approximate the solution to an ordinary differential equation that follows the\ngradient of the performance objective.\n\n\f2 A General Formulation\nConsider a network consisting of a set of components V = {1, . . . , n}. Associated with\nthis network is a discrete-time dynamical system with a \ufb01nite state space W. Denote the\nstate of the system at time k by w(k), for k = 0, 1, 2, . . .. There are n subsets W1, . . . , Wn\nof W, each consisting of states associated with events at component i. Note that these\nsubsets need not be mutually exclusive or totally exhaustive. At the kth epoch, there are\nn control actions a1(k) \u2208 A1, . . . , an(k) \u2208 An, where each Ai is a \ufb01nite set of possible\nactions that can be taken by component i. We sometimes write these control actions in\na vector form a(k) \u2208 A = A1 \u00d7 \u00b7\u00b7\u00b7 An. The actions are governed by a set of policies\n, parameterized by vectors \u03b81 \u2208 RN1, . . . , \u03b8n \u2208 RNn. Each ith action pro-\n\u03c01\n\u03b81, . . . , \u03c0n\n\u03b8n\ncess only transitions when the state w(k) transitions to an element of Wi. At the time of\ntransition, the probability that ai(k) becomes any ai \u2208 Ai is given by \u03c0i\nlet\nThe state transitions depend on the prior state and action vector.\nP (w0, a0, w) be a transition kernel de\ufb01ning the probability of state w given prior state w0\nand action a0. Letting \u03b8 = (\u03b81, . . . , \u03b8n), we have\n\n(ai|w(k)).\nIn particular,\n\n\u03b8i\n\nPr{w(k) = w, a(k) = a|w(k \u2212 1) = w0, a(k \u2212 1) = a0, \u03b8}\n\u03c0i\n\u03b8i\n\n= P (w0, a0, w) Y\n\n(ai|w) Y\n\ni:w\u2208Wi\n\ni:w /\u2208Wi\n\n1{a0\n\ni=ai}.\n\nK E[PK\u22121\n\nDe\ufb01ne Fk to be the \u03c3-algebra generated by {(w(\u2018), a(\u2018))|\u2018 = 1, . . . , k}.\nWhile the system is in state w \u2208 W and action a \u2208 A is applied, each component i\nreceives a reward ri(w, a). The average reward received by the network is r(w, a) =\n1\nn\nAssumption 1. For every \u03b8, the Markov chain w(k) is ergodic (aperiodic, irreducible).\n\ni=1 ri(w, a).\n\nPn\n\nk=0 r(w(k), a(k))].\n\nGiven Assumption 1, for each \ufb01xed \u03b8, there is a well-de\ufb01ned long-term average reward\n\u03bb(\u03b8) = limK\u2192\u221e 1\nWe will consider a stochastic approximation iteration\n(1)\nHere, \u0001 > 0 is a constant step size and \u03c7i(k) is a noisy estimate of the gradient \u2207\u03b8i\u03bb(\u03b8(k))\ncomputed at component i based on the component\u2019s historically observed states, actions,\nand rewards, in addition to communication with other components. Our goal is to develop\nan estimator \u03c7i(k) that can be used in an adaptive, asynchronous, and decentralized con-\ntext, and to establish the convergence of the resulting stochastic approximation scheme.\n\n\u03b8i(k + 1) = \u03b8i(k) + \u0001\u03c7i(k).\n\nOur approach builds on policy gradient algorithms that have been proposed in recent years\n([5, 7, 8, 3, 4, 2]). As a starting point, consider a gradient estimation method that is a\ndecentralized variation of the OLPOMDP algorithm of [3, 4, 1]. In this algorithm, each\ncomponent i maintains and updates an eligibility vector z\u03b2\n\u03b8i(\u2018)(ai(\u2018)|w(\u2018))\n\u2207\u03b8i\u03c0i\n\u03b8i(\u2018)(ai(\u2018)|w(\u2018))\n\u03c0i\n\ni (t) \u2208 RNi, de\ufb01ned by\n\n1{w(\u2018)\u2208Wi},\n\nkX\n\nz\u03b2\ni (k) =\n\n\u03b2k\u2212\u2018\n\n(2)\n\n\u2018=0\n\nfor some \u03b2 \u2208 (0, 1). The algorithm generates an estimate \u00af\u03c7i(k) = r(w(t), a(t))z\u03b2\ni (k) to\nthe local gradient \u2207\u03b8i\u03bb(\u03b8(k)). Note that while the credit vector z\u03b2\ni (t) can be computed us-\ning only local information, the gradient estimate \u00af\u03c7i(t) cannot be computed without knowl-\nedge of the global reward r(x(t), a(t)) at each time. In a fully decentralized environment,\nwhere components only have knowledge of their local rewards, this algorithm cannot be\nused.\n\n\fIn this paper, we present a simple scalable distributed protocol through which rewards\noccurring locally at each node are communicated over time across the network and gradient\nestimates are generated at each node based on local information. A fundamental issue\nthis raises is that rewards may incur large delays before being communicated across the\nnetwork. Moreover, these delays may be random and may correlated with the underlying\nevents that occur in operation of the network. We address this issue and establish conditions\nfor convergence. Another feature of the protocol is that it is completely decentralized\n\u2013 there is no central processor that aggregates and disseminates rewards. As such, the\nprotocol is robust to isolated changes or failures in the network. In addition to design of the\nprotocol, a signi\ufb01cant contribution is in the protocol\u2019s analysis, which we believe requires\nnew ideas beyond what has been employed in the prior policy gradient literature.\n\n3 A General Framework for Protocols\n\nWe will make the following assumption regarding the policies, which is common in the\npolicy gradient literature ([7, 8, 3, 4, 2]).\nAssumption 2. For all i and every w \u2208 Wi, ai \u2208 Ai, \u03c0i\n(ai|w) is a continuously differen-\ntiable function of \u03b8i. Further, for every i, there exists a bounded function Li(w, ai, \u03b8) such\nthat for all w \u2208 Wi, ai \u2208 Ai, \u2207\u03b8i\u03c0i\nThe latter part of the assumption is satis\ufb01ed, for example, if there exists a constant \u0001 > 0\nsuch that for each i,w \u2208 Wi,ai \u2208 Ai, either \u03c0i\n(ai|w) \u2265 \u0001,\nfor all \u03b8i.\nConsider the following gradient estimator:\n\n(ai|w) = 0 for every \u03b8i or \u03c0i\n\n(ai|w)Li(w, ai, \u03b8).\n\n(ai|w) = \u03c0i\n\n\u03b8i\n\n\u03b8i\n\n\u03b8i\n\n\u03b8i\n\n\u03b8i\n\n(3)\n\n\u03c7i(k) = z\u03b2\n\ni (k)\n\n1\nn\n\nij(\u2018, k)rj(\u2018),\nd\u03b1\n\nnX\n\nkX\n\nj=1\n\n\u2018=0\n\nwhere we use the shorthand rj(\u2018) = rj(w(\u2018), a(\u2018)). Here,\nthe random variables\n{d\u03b1\nij(\u2018, k)}, with parameter \u03b1 \u2208 (0, 1), represent an arrival process describing the com-\nij(\u2018, k) is the fraction of the reward\nmunication of rewards across the network. Indeed, d\u03b1\nrj(\u2018) at component j that is learned by component i at time k \u2265 \u2018. We will assume the\narrival process satis\ufb01es the following conditions.\nAssumption 3. For each i, j, \u2018, and \u03b1 \u2208 (0, 1), the process {d\u03b1\nji(\u2018, k)|k = \u2018, \u2018 + 1, \u2018 +\n2, . . .} satis\ufb01es:\n\nji(\u2018, k) is Fk-measurable.\n\n1. d\u03b1\n2. There exists a scalar \u03b3 \u2208 (0, 1) and a random variable c\u2018 such that for all k \u2265 \u2018,\n\n(cid:12)(cid:12)(cid:12)(cid:12) d\u03b1\n\nji(\u2018, k)\n\n(1 \u2212 \u03b1)\u03b1k\u2212\u2018 \u2212 1\n\n(cid:12)(cid:12)(cid:12)(cid:12) < c\u2018\u03b3k\u2212\u2018,\n\nwith probability 1. Further, we require that the distribution of c\u2018 given F\u2018 depend\nonly on (w(\u2018), a(\u2018)), and that there exist a constant \u00afc such that E[c\u2018|w(\u2018) =\nw, a(\u2018) = a] < \u00afc < \u221e, with probability 1 for all initial conditions w \u2208 W\nand a \u2208 A.\n\n3. The distribution of {d\u03b1\n\nji(\u2018, k)|k = \u2018, \u2018 + 1, . . .} given F\u2018 depends only on w(\u2018)\n\nand a(\u2018).\n\nThe following result, proved in our appendix [9], establishes the convergence of the long-\nterm sample averages of \u03c7i(t) of the form (3) to an estimate of the gradient. This type of\nconvergence is central to the convergence of the stochastic approximation iteration (1).\n\n\f\u03c7i(k)\n\n#\n(cid:13)(cid:13)(cid:13) = 0.\n\n\u03bb(\u03b8) \u2212 \u2207\u03b8i\u03bb(\u03b8)\n\nTheorem 1. Holding \u03b8 \ufb01xed, the limit\n\n\u2207\u03b1\u03b2\n\n\u03b8i\n\n\u03bb(\u03b8) = lim\nK\u2192\u221e\n\n1\nK\n\nE\n\n\"K\u22121X\n\nk=0\n\n(cid:13)(cid:13)(cid:13)\u2207\u03b1\u03b2\n\n\u03b8i\n\nlim\n\u03b1\u21911\n\nlim sup\n\n\u03b2\u21911\n\nexists. Further,\n\n.\n\n4 Example: A Sensor Network\n\nIn this section, we present a model of a wireless network of sensors that gathers and com-\nmunicates data to a central base station. Our example is motivated by issues arising in the\ndevelopment of sensor network technology being carried out by commercial producers of\nelectronic devices. However, we will not take into account the many complexities asso-\nciated with real sensor networks. Rather, our objective is to pose a simpli\ufb01ed model that\nmotivates and provides a context for discussion of our distributed optimization protocol.\n\n4.1 System Description\n\nConsider a network of n sensors and a central base station. Each sensor gathers packets of\ndata through observation of its environment, and these packets of data are relayed through\nthe network to the base station via multi-hop wireless communication. Each sensor retains\na queue of packets, each obtained either through sensing or via transmission from another\nsensor. Packets in a queue are indistinguishable \u2013 each is of equal size and must be trans-\nferred to the central base station. We take the state of a sensor to be the number of packets\nin the queue and denote the state of the ith sensor at time k by xi(k). The number of\npackets in a queue cannot exceed a \ufb01nite buffer size, which we denote by x.\n\nA number of triggering events occur at any given device. These include (1) packetizing of\nan observation (2) reception of a packet from another sensor, (3) transmission of a packet\nto another sensor, (4) awakening from a period of sleep, (5) termination of a period of\nattempted reception, (6) termination of a period of attempted transmission. At the time of\na triggering event, the sensor must decide on its next action. Possible actions include (1)\nsleep, (2) attempt transmission, (3) attempt reception. When the buffer is full, options are\nlimited to (1) and (2). When the buffer is empty, options are limited (1) and (3). The action\ntaken by the ith sensor at time k is denoted by ai(k).\nThe base station will be thought of as a sensor that has an in\ufb01nite buffer and perpetually\nattempts reception. For each i, there is a set N(i) of entities with which the ith sensor can\ndirectly communicate. If the ith sensor is attempting transmission of a packet and there\nis at least one element of N(i) that is simultaneously attempting reception and is closer to\nthe base station than component i, the packet is transferred to the queue of that element. If\nthere are multiple such elements, one of them is chosen randomly. Note that if among the\nelements of N(i) that are attempting reception, all are further away from the base station\nthan component i, no packet is transmitted.\n\nObservations are made and packetized by each sensor at random times. If a sensor\u2019s buffer\nis not full when an observation is packetized, an element is added to the queue. Otherwise,\nthe packet is dropped from the system.\n\n\f4.2 Control Policies and Objective\n\nEvery sensor employs a control policy that selects an action based on its queue length\neach time a triggering event occurs. The action is maintained until occurrence of the next\ntriggering event. Each ith sensor\u2019s control policy is parameterized by a vector \u03b8i \u2208 R2.\nGiven \u03b8i, at an event time, if the ith sensor has a non-empty queue, it chooses to transmit\nwith probability \u03b8i1. If the ith sensor does not transmit and its queue is not full, it chooses\nto receive with probability \u03b8i2. If the sensor does not transmit or receive, then it sleeps. In\norder to satisfy Assumption 2, we constrain \u03b8i1 and \u03b8i2 to lie in an interval [\u03b8\u2018, \u03b8h], where\n0 < \u03b8\u2018 < \u03b8h < 1.\nAssume that each sensor has a \ufb01nite power supply. In order to guarantee a minimum life-\nspan for the network, we will require that each sensor sleeps at least a fraction fs of the\ntime. This is enforced by considering a time window of length Ts. If, at any given time, a\nsensor has not slept for a total fraction of a least fs of the preceding time Ts, it is forced to\nsleep and hence not allowed to transmit or receive.\n\nThe objective is to minimize a weighted sum of the average delay and average number of\ndropped packets per unit of time. Delay can be thought of as the amount of time a packet\nspends in the network before arriving at the base station. Hence, the objective is:\n\nmax\n\u03b81,...,\u03b8n\n\nlim sup\nK\u2192\u221e\n\n\u2212 1\nK\n\n1\nn\n\n(xi(k) + \u03beDi(k)) ,\n\nK\u22121X\n\nnX\n\nk=0\n\ni=1\n\nwhere Di(k) is the number of packets dropped by sensor i at time k, and \u03be is a weight\nre\ufb02ecting the relative importance of delay and dropped packets.\n\n5 Distributed Optimization Protocol\n\nWe now describe a simple protocol by which components a the network can communicate\nrewards, in a fashion that satis\ufb01es the requirements of Theorem 1 and hence will produce\ngood gradient estimates. This protocol communicates the rewards across the network over\ntime using a distributed averaging procedure.\n\nprotocol through which each node will obtain the average R = Pn\n\nIn order to motivate our protocol, consider a different problem. Imagine each component i\nin the network is given a real value Ri. Our goal is to design an asynchronous distributed\ni=1 Ri/n. To do this,\nde\ufb01ne the vector Y (0) \u2208 Rn by Yi(0) = Ri for all i. For each edge (i, j), de\ufb01ne a function\nQ(i,j) : Rn 7\u2192 Rn by\n\n( Yi+Yj\n\n2\n\nY\u2018\n\nif \u2018 \u2208 {i, j},\notherwise.\n\nQ(i,j)\n\n\u2018\n\n(Y ) =\n\nhencePn\n\nAt each time t, choose an edge (i, j), and set Y (k + 1) = Q(i,j)(Y (k)). If the graph\nis connected and every edge is sampled in\ufb01nitely often, then limk\u2192\u221e Y (t) = Y , where\nY i = R. To see this, note that the operators Q(i,j) preserve the average value of the vector,\ni=1 Yi(k)/n = R. Further, for any k, either Y (k+1) = Y (k) or kY (k+1)\u2212Y k <\nkY (k) \u2212 Y k. Further, Y is the unique vector with average value R that is a \ufb01xed point for\nall operators Q(i,j). Hence, as long as the graph is connected and each edge is sampled\nin\ufb01nitely often, Yi(k)\u2192R as k\u2192\u221e and the components agree to the common average R.\nIn the context of our distributed optimization protocol, we will assume that each component\ni maintains a scalar value Yi(k) at time k representing an estimate of the global reward.\nWe will de\ufb01ne a structure by which components communicate. De\ufb01ne E to be the set\nof edges along which communication can occur. For an ordered set of distinct edges S =\n\n\fQS(k+1) = X\n\nS\u2208\u03c3(E)\n\n1{w(k+1)\u2208WS}QS.\n\n(cid:0)(ii, j1), . . . , (i|S|, j|S|)(cid:1), de\ufb01ne a set WS \u2282 W. Let \u03c3(E) be the set of all possible ordered\n\nsets of disjoint edges S, including the empty set. We will assume that the sets {WS|S \u2208\n\u03c3(E)} are disjoint and together form a partition of W.\nIf w(k) \u2208 WS, for some set S, we will assume that the components along the edges\nin S communicate in the order speci\ufb01ed by S. De\ufb01ne QS = Q(i|S|,j|S|) \u00b7\u00b7\u00b7 Q(i1,j1),\nwhere the terms in the product are taken over the order speci\ufb01ed by S. De\ufb01ne R(k) =\n(r1(k), . . . , rn(k)) is a vector of rewards occurring at time k. The update rule for the\nvector Y (k) is given by Y (k + 1) = R(k + 1) + \u03b1QS(k+1)Y (k), where\n\nLet \u02c6E = {(i, j)|(i, j) \u2208 S, WS 6= \u2205}. We will make the following assumption.\nAssumption 4. The graph (V, \u02c6E) is connected.\nSince the process (w(k), a(k)) is aperiodic and irreducible (Assumption 1), this assumption\nguarantees that every edge on a connected subset of edges is sampled in\ufb01nitely often.\n\nPolicy parameters are updated at each component according to the rule:\n(4)\nIn relation to equations (1) and (3), we have\n\ni (k)(1 \u2212 \u03b1)Yi(k).\nji(\u2018, k) = n(1 \u2212 \u03b1)\u03b1k\u2212\u2018h \u02c6Q(\u2018, k)\ni\n\n\u03b8i(k + 1) = \u03b8i(k) + \u0001z\u03b2\n\nd\u03b1\n\n,\n\nij\n\n(5)\nwhere \u02c6Q(\u2018, k) = QS(k\u22121) \u00b7\u00b7\u00b7 QS(\u2018).\nThe following theorem, which relies on a general stochastic approximation result from [6]\ntogether with custom analysis available in our appendix [9], establishes the convergence of\nthe distributed stochastic iteration method de\ufb01ned by (4).\nTheorem 2. For each \u0001 > 0, de\ufb01ne {\u03b8\u0001(k)|k = 0, 1, . . .} as the result of the stochastic ap-\nproximation iteration (4) with the \ufb01xed value of \u0001. Assume the set {\u03b8\u0001(k)|k, \u0001} is bounded.\nDe\ufb01ne the continuous time interpolation \u00af\u03b8\u0001(t) by setting \u00af\u03b8\u0001(t) = \u03b8\u0001(k) for t \u2208 [k\u0001, k\u0001 + \u0001).\nThen, for any sequence of processes {\u00af\u03b8\u0001(t)|\u0001\u21920} there exists a subsequence that weakly\nconverges to \u00af\u03b8(t) as \u0001\u21920, where \u00af\u03b8(t) is a solution to the ordinary differential equation\n(6)\nFurther, de\ufb01ne L to be the set of limit points of (6), and for a \u03b4 > 0, N\u03b4(L) to be a\nneighborhood of radius \u03b4 about L. The fraction of time that \u00af\u03b8\u0001(t) spends in N\u03b4(L) over\nthe time interval [0, T ] goes to 1 in probability as \u0001\u21920 and T\u2192\u221e.\nNote that since we are using a constant step-size \u0001, this type of weak convergence is the\nstrongest one would expect. The parameters will typically oscillate in the neighborhood of\nan limit point, and only weak convergence to a distribution centered around a limit point\ncan be established. An alternative would be to use a decreasing step size \u0001(k)\u21920 in (4).\nIn such instances, probability 1 convergence to a local optimum can often be established.\nHowever, with decreasing step sizes, the adaptation of parameters becomes very slow as\n\u0001(n) decays. We expect our protocol to be used in an online fashion, where it is ideal to\nbe adaptive to long-term changes in network topology or dynamics of the environment.\nHence, the constant step size case is more appropriate as it provides such adaptivity.\n\n\u02d9\u00af\u03b8(t) = \u2207\u03b1\u03b2\n\n\u03b8 \u03bb(\u00af\u03b8(t)).\n\nAlso, a boundedness requirement on the iterates in Theorem 2 is necessary for the math-\nematical analysis of convergence. In practical numerical implementations, choices of the\npolicy parameters \u03b8i would be constrained to bounded sets of Hi \u2282 RNi. In such an imple-\nmentation, the iteration (4) would be replaced with an iteration projected onto the set Hi.\nThe conclusions of Theorem 2 would continue to hold, but with the ODE (6) replaced with\nan appropriate projected ODE. See [6] for further discussion.\n\n\f5.1 Relation to the Example\n\nIn the example of Section 4, one approach to implementing our distributed optimization\nprotocol involves passing messages associated with the optimization protocol alongside\nnormal network traf\ufb01c, as we will now explain. Each ith sensor should maintain and update\ntwo vectors: a parameter vector \u03b8i(k) \u2208 R2 and an eligibility vector z\u03b2\ni (k). If a triggering\nevent occurs at sensor i at time k, the eligibility vector is updated according to\n\n\u03b8i(k)(ai(k)|w(k))\n\u2207\u03b8i\u03c0i\n\u03b8i(k)(ai(k)|w(k))\n\u03c0i\n\n.\n\ni (k) = \u03b2z\u03b2\n\ni (k) = \u03b2z\u03b2(k \u2212 1) +\nz\u03b2\ni (k \u2212 1). Furthermore, each sensor maintains an estimate Yi(k)\nOtherwise, z\u03b2\nof the global reward. At each time k, each ith sensor observes a reward (negative cost) of\nri(k) = \u2212xi(k) \u2212 \u03beDi(k). If two neighboring sensors are both not asleep at a time k,\nthey communicate their global reward estimates from the previous time. If the ith sensor\nis not involved in a reward communication event at that time, its global reward estimate\nis updated according to Yi(k) = \u03b1Yi(k \u2212 1) + ri(k). On the other hand, at any time\nk that there is a communication event, its global reward estimate is updated according to\nYi(k) = ri(k) + \u03b1(Yi(k) + \u03b1Yj(k))/2, where j is the index of the sensor with which com-\nmunication occurs. If communication occurs with multiple neighbors, the corresponding\nglobal reward estimates are averaged pairwise in an arbitrary order. Clearly this update\nprocess can be modeled in terms of the sets WS introduced in the previous section. In this\ncontext, the graph \u02c6E contains an edge for each pair of neighbors in the sensor network,\nwhere the neighborhood relations are capture by N, as introduced in Section 4. To optimize\nperformance over time, each ith sensor would update its parameter values according to our\nstochastic approximation iteration (4).\n\nTo highlight the simplicity of this protocol, note that each sensor need only maintain and\nupdate a few numerical values. Furthermore, the only communication required by the\noptimization protocol is that an extra scalar numerical value be transmitted and an extra\nscalar numerical value be received during the reception or transmission of any packet.\n\nAs a numerical example, consider the network topology in Figure 1. Here, at every time\nstep, an observation arrives at a sensor with a 0.02 probability, and each sensor maintains\na queue of up to 20 observations. Policy parameters \u03b8i1 and \u03b8i2 for each sensor i are\nconstrained to lie in the interval [0.05, 0.95].\n(Note that for this set of parameters, the\nchance of a buffer over\ufb02ow is very small, and hence did not occur in our simulations.)\nA baseline policy is de\ufb01ned by having leaf nodes transmit with maximum probability, and\ninterior nodes splitting their time roughly evenly between transmission and reception, when\nnot forced to sleep by the power constraint.\n\nApplying our decentralized optimization method to this example, it is clear in Figure 2\nthat the performance of the network is quickly and dramatically improved. Over time,\nthe algorithm converges to the neighborhood of a local optimum as expected. Further,\nthe algorithm achieves qualitatively similar performance to gradient optimization using the\ncentralized OLPOMDP method of [3, 4, 1], hence decentralization comes at no cost.\n\n6 Remarks and Further Issues\n\nWe are encouraged by the simplicity and scalability of the distributed optimization protocol\nwe have presented. We believe that this protocol represents both an interesting direction\nfor practical applications involving networks of electronic devices and a signi\ufb01cant step in\nthe policy gradient literature. However, there is an important outstanding issue that needs\nto be addressed to assess the potential of this approach: whether or not parameters can be\nadapted fast enough for this protocol to be useful in applications. There are two dimensions\n\n\fFigure 1: Example network topology.\n\nFigure 2: Convergence of method.\n\nto this issue: (1) variance of gradient estimates and (2) convergence rate of the underlying\nODE. Both should be explored through experimentation with models that capture practical\ncontexts. Also, there is room for research that explores how variance can be reduced and\nthe convergence rate of the ODE can be accelerated.\n\nAcknowledgements\n\nThe authors thank Abbas El Gamal, Abtin Keshavarzian, Balaji Prabhakar, and Elif Uysal\nfor stimulating conversations on sensor network models and applications. This research\nwas supported by NSF CAREER Grant ECS-9985229 and by the ONR under grant MURI-\nN00014-00-1-0637. The \ufb01rst author was also supported by a Benchmark Stanford Graduate\nFellowship.\n\nReferences\n\n[1] P. L. Bartlett and J. Baxter. Stochastic Optimization of Controlled Markov Decision Processes.\n\nIn IEEE Conference on Decision and Control, pages 124\u2013129, 2000.\n\n[2] P. L. Bartlett and J. Baxter. Estimation and Approximation Bounds for Gradient-Based Rein-\n\nforcement Learning. Journal of Computer and System Sciences, 64:133\u2013150, 2002.\n\n[3] J. Baxter and P. L. Bartlett. In\ufb01nite-Horizon Gradient-Based Policy Search. Journal of Arti\ufb01cial\n\nIntelligence Research, 15:319\u2013350, 2001.\n\n[4] J. Baxter, P. L. Bartlett, and L. Weaver. In\ufb01nite-Horizon Gradient-Based Policy Search: II. Gradi-\nent Ascent Algorithms and Experiments. Journal of Arti\ufb01cial Intelligence Research, 15:351\u2013381,\n2001.\n\n[5] T. Jaakkola, S. P. Singh, and M. I. Jordan. Reinforcement Learning Algorithms for Partially\nObservable Markov Decision Problems. In Advances in Neural Information Processing Systems\n7, pages 345\u2013352, 1995.\n\n[6] H. J. Kushner and G. Yin. Stochastic Approximation Algorithms and Applications. Springer-\n\nVerlag, New York, NY, 1997.\n\n[7] P. Marbach, O. Mihatsch, and J.N. Tsitsiklis. Call Admission Control and Routing in Integrated\n\nService Networks. In IEEE Conference on Decision and Control, 1998.\n\n[8] P. Marbach and J.N. Tsitsiklis. Simulation\u2013Based Optimization of Markov Reward Processes.\n\nIEEE Transactions on Automatic Control, 46(2):191\u2013209, 2001.\n\n[9] C. C. Moallemi and B. Van Roy. Appendix to NIPS Submission. URL: http://www.\n\nmoallemi.com/ciamac/papers/nips-2003-appendix.pdf, 2003.\n\nroot87214359610012345678910x 106\u22120.3\u22120.25\u22120.2\u22120.15\u22120.1\u22120.050IterationLong\u2212Term Average RewardOLPOMDPdecentralizedbaseline\f", "award": [], "sourceid": 2448, "authors": [{"given_name": "Ciamac", "family_name": "Moallemi", "institution": null}, {"given_name": "Benjamin", "family_name": "Roy", "institution": null}]}