{"title": "Using Collective Intelligence to Route Internet Traffic", "book": "Advances in Neural Information Processing Systems", "page_first": 952, "page_last": 960, "abstract": null, "full_text": "USING COLLECTIVE INTELLIGENCE \n\nTO ROUTE INTERNET TRAFFIC \n\nDavid H. Wolpert \n\nNASA Ames Research Center \n\nMoffett Field, CA 94035 \ndhw@ptolemy.arc.nasa.gov \n\nKagan Turner \n\nNASA Ames Research Center \n\nMoffett Field, CA 94035 \n\nkagan@ptolemy.arc.nasa.gov \n\nJeremy Frank \n\nNASA Ames Research Center \n\nMoffett Field, CA 94035 \n\nfrank@ptolemy.arc.nasa.gov \n\nAbstract \n\nA COllective INtelligence (COIN) is a set of interacting reinforce(cid:173)\nment learning (RL) algorithms designed in an automated fashion \nso that their collective behavior optimizes a global utility function. \nWe summarize the theory of COINs, then present experiments us(cid:173)\ning that theory to design COINs to control internet traffic routing. \nThese experiments indicate that COINs outperform all previously \ninvestigated RL-based, shortest path routing algorithms. \n\n1 \n\nINTRODUCTION \n\nCOllective INtelligences (COINs) are large, sparsely connected recurrent neural \nnetworks, whose \"neurons\" are reinforcement learning (RL) algorithms. The dis(cid:173)\ntinguishing feature of COINs is that their dynamics involves no centralized control, \nbut only the collective effects of the individual neurons each modifying their be(cid:173)\nhavior via their individual RL algorithms. This restriction holds even though the \ngoal of the COIN concerns the system's global behavior. One naturally-occurring \nCOIN is a human economy, where the \"neurons\" consist of individual humans try(cid:173)\ning to maximize their reward, and the \"goal\", for example, can be viewed as having \nthe overall system achieve high gross domestic product. This paper presents a \npreliminary investigation of designing and using artificial COINs as controllers of \ndistributed systems. The domain we consider is routing of internet traffic. \n\nThe design of a COIN starts with a global utility function specifying the desired \nglobal behavior. Our task is to initialize and then update the neurons' \"local\" utility \n\n\fUsing Collective Intelligence to Route Internet Traffic \n\n953 \n\nfunctions, without centralized control, so that as the neurons improve their utilities, \nglobal utility also improves. (We may also wish to update the local topology of the \nCOIN.) In particular, we need to ensure that the neurons do not \"frustrate\" each \nother as they attempt to increase their utilities. The RL algorithms at each neuron \nthat aim to optimize that neuron's local utility are microlearners. The learning \nalgorithms that update the neuron's utility functions are macrolearners. \n\nFor robustness and breadth of applicability, we assume essentially no knowledge con(cid:173)\ncerning the dynamics of the full system, i.e., the macrolearning and/ or microlearning \nmust \"learn\" that dynamics, implicitly or otherwise. This rules out any approach \nthat models the full system. It also means that rather than use domain knowledge \nto hand-craft the local utilities as is done in multi-agent systems, in COINs the \nlocal utility functions must be automatically initialized and updated using only the \nprovided global utility and (locally) observed dynamics. \n\nThe problem of designing a COIN has never previously been addressed in full -\nhence the need for the new formalism described below. Nonetheless, this prob(cid:173)\nlem is related to previous work in many fields: distributed artificial intelligence, \nmulti-agent systems, computational ecologies, adaptive control, game theory [6], \ncomputational markets [2], Markov decision theory, and ant-based optimization. \n\nFor the particular problem of routing, examples of relevant work include [4, 5, 8, 9, \n10]. Most of that previous work uses microlearning to set the internal p'arameters \nof routers running conventional shortest path algorithms (SPAs). However the mi(cid:173)\ncrolearning occurs, they do not address the problem of ensuring that the associated \nlocal utilities do not cause the microlearners to work at cross purposes. \n\nThis paper concentrates on COIN-based setting of local utilities rather than \nmacrolearning. We used simulations to compare three algorithms. The first two \nare an SPA and a COIN. Both had \"full knowledge\" (FK) of the true reward(cid:173)\nmaximizing path, with reward being the routing time of the associated router's \npackets for the SPAs, but set by COIN theory for the COINs. The third algorithm \nwas a COIN using a memory-based (MB) microlearner [1] whose knowledge was \nlimited to local observations. \nThe performance of the FK COIN was the theoretical optimum. The performance \nof the FK SPA was 12.5 \u00b1 3 % worse than optimum. Despite limited knowledge, \nthe MB COIN outperformed the FK SPA, achieving performance 36 \u00b1 8 % closer \nto optimum. Note that the performance of the FK SPA is an upper bound on the \nperformance of any RL-based SPA. Accordingly, the performance of the MB COIN \nis at least 36% superior to that of any RL-based SPA. \n\nSection 2 below presents a cursory overview of the mathematics behind COINs. \nSection 3 discusses how the network routing problem is mapped into the COIN \nformalism, and introduces our experiments. Section 4 presents results of those \nexperiments, which establish the power of COINs in the context of routing problems. \nFinally, Section 5 presents conclusions and summarizes future research directions. \n\n2 MATHEMATICS OF COINS \n\nThe mathematical framework for COINs is quite extensive [11, 12]. This paper \nconcentrates on four of the concepts from that framework: subworlds, factored \nsystems, constraint-alignment, and the wonderful-life utility function. \nWe consider the state of the system across a set of discrete, time steps, t E {O, 1, ... }. \nAll characteristics of a neuron at time t -\nincluding its internal parameters at that \n\n\f954 \n\nD. H. Wolpert, K. Turner and J. Frank \n\ntime as well as its externally visible actions -\nare encapsulated in a real-valued \nvector i 17,t' We call this the \"state\" of neuron 1] at time t, and let i be the state \nof all neurons across all time. World utility, G((), is a function of the state of all \nneurons across all time, potentially not expressi@.e as a discounted sum. \n\nA subworld is a set of neurons. All neurons in the same subworld w share the same \nsubworld utility function 9w ((). So when each subworld is a set of neurons that have \nthe most effect on each other, neurons are unlikely to work at cross-purposes -\nall \nneurons that affect each other substantially share the same local utility. \n\nAssociated with subworlds is the concept of a (perfectly) constraint-aligned system. \nIn such systems any change to the neurons in subworld w at time 0 will have no \neffects on the neurons outside of w at times later than O. \nIntuitively, a system \nis constraint-aligned if the neurons in separate subworlds do not affect each other \ndirectly, so that the rationale behind the use of subworlds holds. \n\nA subworld-factored system is one where for each subworld w considered by itself, a \nchange at time 0 to the states of the neurons in that subworld results in an increased \nvalue for 9w(() if and only if it results in an increased value for G((). For a subworld(cid:173)\nfactored system, the side effects on the rest of the system of w's-increasing its own \nutility (which perhaps decrease other subworlds' utilities) do not end up decreasing \nworld utility. For these systems, the separate subworlds successfully pursuing their \nseparate goals do not frustrate each other as far as world utility is concerned. \n\nThe desideratum of subworld-factored is carefully crafted. In particular, it does not \nconcern changes in the value of the utility of subworlds other than the one changing \nits actions. Nor does it concern changes to the states of neurons in more than \none subworld at once. Indeed, consider the following alternative desideratum: any \nchange to the t = 0 state of the entire system that improves all subworld utilities \nsimultaneously also improves world utility. Reasonable as it may appear, one can \nconstruct examples of systems that obey this desideratum and yet quickly evolve \nto a minimum of world utility [12J. \n\nIt can be proven that for a subworld-factored system, when each of the neurons' \nreinforcement learning algorithms are performing as well as they can, given each \nothers' behavior, world utility is at a critical point. Correct global behavior corre(cid:173)\nsponds to learners reaching a (Nash) equilibrium [8, 13J. There can be no tragedy \nof the commons for a subworld-factored system [7, 11, 12J. \nLet CLw (() be defined as the vector ( modified by clamping the states of all neurons \nin subworld w, across all time, to an-arbitrary fixed value, here taken to be O. The \nwonderful life subworld utility (WLU) is: \n\n(1) \n\nWhen the system is constraint-aligned, so that, loosely speaking, subworld w's \"ab(cid:173)\nsence\" would not affect the rest of the system, we can view the WLU as analogous \nto the change in world utility that would have arisen if subworld w \"had never ex(cid:173)\nisted\". (Hence the name of this utility - cf. the Frank Capra movie.) Note however, \nthat CL is a purely mathematical operation. Indeed, no assumption is even being \nmade that CLw (() is consistent with the dynamics of the system. The sequence of \nstates the neurons in w are clamped to in the definition of the WL U need not be \nconsistent with the dynamical laws of the system. \nThis dynamics-independence is a crucial strength of the WLU. It means that to \nevaluate the WLU we do not try to infer how the system would have evolved if all \nneurons in w were set to 0 at time 0 and the system evolved from there. So long as \n\n\fUsing Collective Intelligence to Route Internet Traffic \n\n955 \n\nwe know ( extending over all time, and so long as we know G, we know the value \nof WL U. This is true even if we know nothing of the dynamics of the system. \n\nIn addition to assuring the correct equilibrium behavior, there exist many other \ntheoretical advantages to having a system be subworld-factored. In particular, the \nexperiments in this paper revolve around the following fact: a constraint-aligned \nsystem with wonderful life subworld utilities is subworld-factored. Combining this \nwith our previous result that subworld-factored systems are at Nash equilibrium at \ncritical points of world utility, this result leads us to expect that a constraint-aligned \nsystem using WL utilities in the microlearning will approach near-optimal values \nof the world utility. No such assurances accrue to WL utilities if the system is not \nconstraint-aligned however. Accordingly our experiments constitute an investiga(cid:173)\ntion of how well a particular system performs when WL utilities are used but little \nattention is paid to ensuring that the system is constraint-aligned. \n\n3 COINS FOR NETWORK ROUTING \n\nIn our experiments we concentrated on the two networks in Figure 1, both slightly \nlarger than those in [9]. To facilitate the analysis, traffic originated only at routers \nindicated with white boxes and had only the routers indicated by dark boxes as \nultimate destinations. Note that in both networks there is a bottleneck at router 2. \n\n-(a) Network A \n\n(b ) Network B \n\nFigure 1: Network Architectures. \n\nAs is standard in much of traffic network analysis [3], at any time all traffic at \na router is a real-valued number together with an ultimate destination tag. At \neach timestep, each router sums all traffic received from upstream routers in this \ntimestep, to get a load. The router then decides which downstream router to send \nits load to, and the cycle repeats. \nA running average is kept of the total value of each router's load over a window of the \nprevious L timesteps. This average is run through a load-to-delay function, W(x), \nto get the summed delay accrued at this timestep by all those packets traversing \nthis router at this timestep. Different routers had different W(x), to reflect the fact \nthat real networks have differences in router software and hardware (response time, \nqueue length, processing speed etc). In our experiments W(x) = x 3 for routers 1 \nand 3, and W(x) = log(x + 1) for router 2, for both networks. The global goal is \nto minimize total delay encountered by all traffic. \n\n\f956 \n\nD. H. Wolpert, K. Tumer and J. Frank \n\nIn terms of the COIN formalism, we identified the neurons \"I as individual pairs of \nrouters and ultimate destinations. So ~17,t was the vector of traffic sent along all \nlinks exiting rJ's router, tagged for rJ's ultimate destination, at time t. Each subworld \nconsisted of the set all neurons that shared a particular ultimate destination. \n\n-17,t \n\nIn the SPA each node \"I tries to set ( \nto minimize the sum of the delays to \nbe accrued by that traffic on the way to its ultimate destination. In contrast, in \na COIN \"I tries to set ~17,t to optimize gw for the subworld w containing \"I. For \nboth algorithms, \"full knowledge\" means that at time t all of the routers know the \nwindow-averaged loads for all routers for time t - 1, and assume that those values \nwill be the same at t. For large enough L, this assumption will be arbitrarily good, \nand therefore will allow the routers to make arbitrarily accurate estimates of how \nbest to route their traffic, according to their respective routing criteria. \n\nIn contrast, having limited knowledge, the MB COIN could only predict the WLU \nvalue resulting from each routing decision. More precisely, for each router-ultimate(cid:173)\ndestination pair, the associated microlearner estimates the map from traffic on all \noutgoing links (the inputs) to WLU-based reward (the outputs - see below). This \nwas done with a single-nearest-neighbor algorithm. Next, each router could send \nthe packets along the path that results in outbound traffic with the best (estimated) \nreward. However to be conservative, in these experiments we instead had the router \nrandomly select between that path and the path selected by the FK SPA. \nThe load at router r at time t is determined by (. Accordingly, we can encap(cid:173)\nsulate the load-to-delay functions at the nodes by writing the delay at node r \nat time t as Wr,t(O. \nIn our experiments world utility was the total delay, i.e., \nG(~) = 2:r,t Wr,t(~). So using the WLU, gw(~) = 2:r,t ~w,r,t(~), where ~w,r,t(~) = \n[Wr,t(() - Wr,t(CLw(())]. At each time t, the MB COIN used 2:r ~w,r,t(O as the \n\"WLU--=-based\" reward-signal for trying optimize this full WLU. \n\n-\n\nIn the MB COIN, evaluating this reward in a decentralized fashion was straight(cid:173)\nforward. All packets have a header containing a running sum of the ~'s encountered \nin all the routers it has traversed so far. Each ultimate destination sums all such \nheaders it received and echoes that sum back to all routers that had routed to it. \nIn this way each neuron is apprised of the WLU-based reward of its subworld. \n\n4 EXPERIMENTAL RESULTS \n\nThe networks discussed above were tested under light, medium and heavy traffic \nloads. Table 1 shows the associated destinations (cf. fig. 1). \n\nTable 1: Source- Destination Pairings for the Three Traffic Loads \nNetwork I Source II Dest. (Light) I Dest. (Medium) I Dest. (Heavy) \n\nA \n\nB \n\n4 \n5 \n4 \n5 \n\n6 \n7 \n7,8 \n6,9 \n\n6,7 \n7 \n\n7,8,9 \n6,7,9 \n\n6,7 \n6,7 \n\n6,7,8,9 \n6,7,8,9 \n\nIn our experiments one new packet was fed to each source router at each time step. \nTable 2 reports the average total delay (i.e., average per packet time to traverse the \ntotal network) in each of the traffic regimes, for the shortest path algorithm with \nfull knowledge, the COIN with full knowledge, and the MB COIN. Each table entry \nis based on 50 runs with a window size of 50, and the errors reported are errors \n\n\fUsing Collective Intelligence to Route Internet Traffic \n\n957 \n\nin the meanl . All the entries in Table 2 are statistically different at the .05 level, \nincluding FK SPA vs. MB COIN for Network A under light traffic conditions. \n\nNetwork II Load \nlight \n\nFK SPA \n\nA \n\nB \n\nTable 2: Average Total Delay \n\n0.53 \u00b1 .007 0.45 \u00b1 .001 \nmedium 1.26 \u00b1 .010 1.10 \u00b1 .001 \n1.93 \u00b1 .001 \nheavy \n1.92 \u00b1 .001 \nlight \nmedium 4.37 \u00b1 .014 3.96 \u00b1 .001 \n6.94 \u00b1 .015 6.35 \u00b1 .001 \nheavy \n\nI FK COIN MB COIN \n0.50 \u00b1 .008 \n1.21 \u00b1 .009 \n2.06 \u00b1 .010 \n2.05 \u00b1 .010 \n4.19 \u00b1 .012 \n6.82 \u00b1 .024 \n\n2.17 \u00b1 .012 \n2.13 \u00b1 .012 \n\nTable 2 provides two important observations: First, the WLU-based COIN out(cid:173)\nperformed the SPA when both have full knowledge, thereby demonstrating the \nsuperiority of the new routing strategy. By not having its routers greedily strive \nfor the shortest paths for their packets, the COIN settles into a more desirable \nstate that reduces the average total delay for all packets. Second, even when the \nWLU is estimated through a memory-based learner (using only information avail(cid:173)\nable to the local routers), the performance of the COIN still surpasses that of the \nFK SPA. This result not only establishes the feasibility of COIN-based routers, but \nalso demonstrates that for this task COINs will outperform any algorithm that can \nonly estimate the shortest path, since the performance of the FK SPA is a ceiling \non the performance of any such RL-based SPA. \n\nFigure 2 shows how total delay varies with time for the medium traffic regime \n(each plot is based on 50 runs). The \"ringing\" is an artifact caused by the starting \nconditions and the window size (50). Note that for both networks the FK COIN not \nonly provides the shortest delays, but also settles into that solution very rapidly. \n\ni \n\"\" ~ \nCl. \niii \nQ. \n>-cu \na; \n0 \nCij \n\n~ \n\n1.4 \n\n1.35 \n\n1.3 \n\n1.25 \n\n1.2 \n\n1.15 \n\n1.1 \n\n1.05 \n\n1 \n\n0 \n\nFKSPA 0+(cid:173)\nFKCOIN -+ --(cid:173)\nMBCOIN \" 0 \"\" \n\nFKSPA 0+(cid:173)\nFKCOIN + -_. \nMBCOIN\u00b7 \u00b7 \n\n4.6 \n\n4.5 \n\n4.4 \n\n4.3 \n\n42 \n\n4.1 \n\n4 \n\n3.9 \n\n3.8 \n\ni \n~ \nCl. ... Q) \nQ. \n~ \na; \n0 \n~ t-\n\n100 \n\n200 \n300 \nUnit Time Steps \n\n400 \n\n500 \n\n3. 7 '---\"---'-----'-----'---,--'-\"--.......... -'----'-~ \nISO 200' 250 300 350 400 450 SOO \n\no SO 100 \n\nUnit Time Steps \n\n(a) Network A \n\n(b) Network B \n\nFigure 2: Total Delay. \n\n5 DISCUSSION \n\nMany distributed computational tasks are naturally addressed as recurrent neural \nnetworks ofreinforcement learning algorithms (i.e., COINs) . The difficulty in doing \nso is ensuring that, despite the absence of centralized communication and control, \n\nIThe results are qualitatively identical for window sizes 20 and 100 along with total \n\ntimesteps of 100 and 500. \n\n\f958 \n\nD. H. Wolpert, K. Turner and J. Frank \n\nthe reward functions of the separate neurons work in synchrony to foster good global \nperformance, rather than cause their associated neurons to work at cross-purposes. \n\nThe mathematical framework synopsized in this paper is a theoretical solution to \nthis difficulty. To assess its real-world applicability, we employed it to design a full(cid:173)\nknowledge (FK) COIN as well as a memory-based (RL-based) COIN, for the task \nof packet routing on a network. We compared the performance of those algorithms \nto that of a FK shortest-path algorithm (SPA). Not only did the FK COIN beat the \nFK SPA, but also the memory-based COIN, despite having only limited knowledge, \nbeat the full-knowledge SPA. This latter result is all the more remarkable in that \nthe performance of the FK SPA is an upper bound on the performance of previously \ninvestigated RL-based routing schemes, which use the RL to try to provide accurate \nknowledge to an SPA. \n\nThere are many directions for future work on COINs, even restricting attention \nto domain of packet routing. Within that particular domain, currently we are \nextending our experiments to larger networks, using industrial event-driven network \nsimulators. Concurrently, we are investigating the use of macrolearning for COIN(cid:173)\nbased packet-routing, i.e., the run-time modification of the neurons' utility functions \nto improve the subworld-factoredness of the COIN. \n\nReferences \n\n[1] C. G. Atkenson, A. W. Moore, and S. Schaal. Locally weighted learning. \n\nArtificial Intelligence Review, Submitted, 1996. \n\n[2] E. Baum. Manifesto for an evolutionary economics of intelligence. In C. M. \nBishop, editor, Neural Networks and Machine Learning. Springer-Verlag, 1998. \n\n[3] D. Bertsekas and R. Gallager. Data Networks. Prentice Hall, NJ, 1992. \n[4] J. Boyan and M. Littman. Packet routing in dynamically changing networks: \n\nA reinforcement learning approach. In Advances in Neural Information Pro(cid:173)\ncessing Systems - 6, pages 671-678. Morgan Kaufmann, 1994. \n\n[5] S. P. M. Choi and D. Y. Yeung. Predictive Q-routing: A memory based rein(cid:173)\n\nforcement learning approach to adaptive traffic control. In Advances in Neural \nInformation Processing Systems - 8, pages 945-951. MIT Press, 1996. \n\n[6] D. Fudenberg and J. Tirole. Game Theory. MIT Press, Cambridge, MA, 1991. \n[7] G. Hardin. The tragedy of the commons. Science, 162:1243-1248,1968. \n[8] Y. A. Korilis, A. A. Lazar, and A. Orda. Achieving network optima using \nStackelberg routing strategies. IEEE Tran. on Networking, 5(1):161-173, 1997. \n[9] P. Marbach, O. Mihatsch, M. Schulte, and J. Tsisiklis. Reinforcement learning \nfor call admission control and routing in integrated service networks. In Adv. \nin Neural Info. Proc. Systems - 10, pages 922-928. MIT Press, 1998. \n\n[10] D. Subramanian, P. Druschel, and J. Chen. Ants and reinforcement learning: \nA case study in routing in dynamic networks. In Proceedings of the Fifteenth \nInternational Conference on Artificial Intelligence, pages 832-838, 1997. \n\n[11] D. Wolpert and K. Tumer. Collective Intelligence. In J. M. Bradshaw, editor, \n\nHandbook of Agent technology. AAAI Press/MIT Press, 1999. to appear. \n\n[12] D. Wolpert, K. Wheeler, and K. Tumer. Automated design of multi-agent \nsystems. In Proc. of the 3rd Int. Conf. of Autonomous Agents, 1999. to appear. \n[13] D. Wolpert, K. Wheeler, and K. Tumer. Collective intelligence for distributed \n\ncontrol. 1999. (pre-print). \n\n\fPART IX \n\nCONTROL, NAVIGATION AND PLANNING \n\n\f\f", "award": [], "sourceid": 1591, "authors": [{"given_name": "David", "family_name": "Wolpert", "institution": null}, {"given_name": "Kagan", "family_name": "Tumer", "institution": null}, {"given_name": "Jeremy", "family_name": "Frank", "institution": null}]}