{"title": "Multi-agent Cooperation in Diverse Population Games", "book": "Advances in Neural Information Processing Systems", "page_first": 1521, "page_last": 1528, "abstract": null, "full_text": "Multi-agent Cooperation\n\nin Diverse Population Games\n\nK. Y. Michael Wong, S. W. Lim and Z. Gao\n\nHong Kong University of Science and Technology, Hong Kong, China.\n\nfphkywong, swlim, zhuogaog@ust.hk\n\nAbstract\n\nWe consider multi-agent systems whose agents compete for resources by\nstriving to be in the minority group. The agents adapt to the environment\nby reinforcement learning of the preferences of the policies they hold.\nDiversity of preferences of policies is introduced by adding random bi-\nases to the initial cumulative payoffs of their policies. We explain and\nprovide evidence that agent cooperation becomes increasingly important\nwhen diversity increases. Analyses of these mechanisms yield excellent\nagreement with simulations over nine decades of data.\n\n1\n\nIntroduction\n\nIn the intelligent control of large systems, the multi-agent approach has the advantages of\nparallelism, robustness, scalability, and light communication overhead [1]. Since it involves\nmany interacting adaptive agents, the behavior becomes highly complex. While a standard\nanalytical approach is to study their steady state behavior described by the Nash equilibria\n[2], it is interesting to consider the dynamics of how the steady state is approached. Of\nparticular interest is the case of heterogeneous agents, which have diversi\u00a3ed preferences\nin decision making. In such cases, the cooperation of agents becomes very important.\n\nSpeci\u00a3cally, we consider the dynamics of a version of large population games which mod-\nels the collective behavior of agents simultaneously and adaptively competing for limited\nresources. The game is a variant of the Minority Game, in which the agents strive to\nmake the minority decision, thereby balancing the load distributed between the majority\nand minority choices [3]. Previous work showed that the system behavior depends on the\ninput dimension of the agents\u2019 policies. When the policy dimension is too low, many agents\nshare identical policies, and the system suffers from the maladaptive behavior of the agents,\nmeaning that they prematurely rush to adapt to system changes in bursts [4].\n\nRecently, we have demonstrated that a better system ef\u00a3ciency can be attained by intro-\nducing diversity [5]. This is done by randomly assigning biases to the initial preference of\npolicies of the agents, so that agents sharing common policies may not adopt them at the\nsame time, and maladaptation is reduced. As a result, the population difference between the\nmajority and minority groups decreases. For typical control tasks such as the distribution of\nshared resources, this corresponds to a high system ef\u00a3ciency. In contrast to the maladap-\ntive regime, in which agents blindly respond to environmental signals, agent cooperation\nbecomes increasingly important in the diverse regime. Namely, there are fewer agents ad-\n\n\fjusting their policy perferences at each step of the steady state, but there emerges a more\ncoordinated pattern of policy adjustment among them. Hence, it is interesting to study the\nmechanisms by which they adapt mutually, and their effects on the system ef\u00a3ciency.\n\nIn this paper, we explain the cooperative mechanisms which appear successively when\nthe diversity of the agents\u2019 preference of policies increases, as recently proposed in [6].\nWe will provide experimental evidence of these effects, and sketch their analyses which\nyield excellent agreement with simulations. While we focus on the population dynamics\nof the Minority Game, we expect that the observed cooperative mechanisms are relevant to\nreinforcement learning in multi-agent systems more generally.\n\n2 The Minority Game\n\nThe Minority Game consists of a population of N agents competing sel\u00a3shly to maximize\ntheir individual utility in an environment of limited resources, N being odd [3]. Each agent\nmakes a decision + or \u00a1 at each time step, and the minority group wins. For typical control\ntasks such as the resource allocation, the decisions + and \u00a1 may represent two alternative\nresources, so that less agents utilizing a resource implies more abundance. The decisions\nof each agent are prescribed by policies, which are binary functions mapping the history of\nthe winning bits of the game in the most recent m steps to decisions + or \u00a1. Hence, m is\nthe memory size. Before the game starts, each agent randomly picks s policies out of the\nset of 2D policies with replacement, where D \u00b7 2m is the number of input states.\nThe long-term goal of an agent is to maximize her cumulative payoff, which is the sum\nof the undiscounted payoffs received during the game history. For the decision \u00bbi(t) of\n\nagent i at time t (\u00bbi(t) = \u00a71), the payoff is \u00a1\u00bbi(t)G(A(t)), where A(t) \u00b7 Pi \u00bbi(t)=N,\n\nand G(A) satis\u00a3es the property signG(A) = signA. She tries to achieve her goal by\nchoosing at each step, out of her s policies, the most successful one so far, and outputing\nher decision accordingly. The success of a policy is measured by its cumulative payoff,\nupdated every step irrespective of whether it is adopted or not. This reinforcement learning\nprovides an agent with adaptivity. Though we only consider random policies instead of\norganized ones, we expect that the model is suf\u00a3cient to capture the collective behavior of\nlarge population games. In this paper, we consider a step payoff function, G(A) = signA.\nThe cumulative payoffs then take integer values. Note that an agent gains in payoff when\nshe makes a decision opposite to A(t), and loses otherwise, re\u00a4ecting the winning of the\nminority group.\n\nIt is natural to consider systems with diverse preferences of policies [5]. This means that\nthe initial cumulative payoffs of policies \ufb01 (\ufb01 = 1; : : : ; s \u00a1 1) of agent i with respect to\nher sth policy have random biases !i\ufb01. Diversity is important in reducing the maladaptive\nbehavior of the agents, since otherwise the same policy of all agents accumulates the same\npayoffs, and would be adopted at the same time. In this paper, we consider the case s = 2,\nand the biases are the sums of \u00a71 randomly drawn R times. In particular, when R is not too\nsmall, the bias distribution approaches a Gaussian distribution with mean 0 and variance R.\nThe ratio \u2030 \u00b7 R=N is referred to as the diversity. For odd R, no two policies have the same\ncumulative payoffs throughout the process, and the dynamics is deterministic, resulting in\nhighly precise simulation results useful for re\u00a3ned comparison with theories.\n\nThe population averages of the decisions oscillate around 0 at the steady state. Since a\nlarge difference between the majority and minority populations implies inef\u00a3cient resource\nallocation, the inef\u00a3ciency of the system is often measured by the variance (cid:190) 2=N of the\npopulation making decision +, and is given by\n\nN\n\n(cid:190)2\nN \u00b7\n\n(1)\nit denotes time average at the steady state. Its dependence on the diversity is\n\n4 h[A\u201e\u2044(t)(t) \u00a1 hA\u201e\u2044(t)(t)it]2it;\n\nwhere h\n\n\f104\n\nN\n\n/\n\n2\n\n100\n\nN = 127\nN = 511\nN = 2047\nN = 8191\nN = 32767\n\n10\u22122\n\nN\n\n/\n\n2\n\n10\u22123\n\nN = 127\nN = 511\nN = 2047\n\n10\u22124\n\n10\u22126\n\n10\u22122\n\n102\n\n106\n\n10\u22124\n\n102\n\n104\n\n106\n\nFigure 1: (a) The dependence of the variance of the population making decision + on the\ndiversity at m = 1 and s = 2. Symbols: simulation results averaged over 1,024 samples of\ninitial conditions. Lines: theory. Dashed-dotted line: scaling prediction. (b) Comparison\nbetween simulation results (symbols), theory with kinetic sampling only (dashed lines),\none-wait approximation (dash-dotted lines), and many-wait approximation (lines).\n\nshown in Fig. 1. Several modes of agent cooperation can be identi\u00a3ed, and explained in the\nfollowing sections.\n\n3 Statistical Cooperation\n\nFor each curve with a given N in Fig. 1(a), and besides the \u00a3rst few data points where\n\u2030 \u00bb N \u00a11 and (cid:190)2=N \u00bb N, the behavior of the variance is dominated by the scaling relation\n(cid:190)2=N \u00bb \u2030\u00a11 for \u2030 \u00bb 1. To interpret this result, we describe the macroscopic dynamics of\nthe system by de\u00a3ning the D-dimensional vector A \u201e(t), which is the sum of the decisions\nof all agents responding to history \u201e of their policies, normalized by N. While only one of\nthe D components corresponds to the historical state \u201e\u2044(t) of the system, the augmentation\nto D components is necessary to describe the attractor structure and the transient behavior\nof the system dynamics.\n\nThe key to analysing the system dynamics is the observation that the cumulative payoffs\nof all policies displace by exactly the same amount when the game proceeds. Hence for\na given pair of policies, the pro\u00a3le of the relative cumulative payoff distribution remains\nunchanged, but the peak position shifts with the game dynamics. Let us consider the change\nin A\u201e(t) when \u201e is the historical state \u201e\u2044(t). We let S\ufb01\ufb02(!) be the number of agents\nholding policies \ufb01 and \ufb02 (with \ufb01 < \ufb02), and the bias of \ufb01 with respect to \ufb02 is !. If the\ncumulative payoff of policy \ufb01 at time t is \u203a\ufb01(t), then the agents holding policies \ufb01 and \ufb02\nmake decisions according to policy \ufb01 if ! + \u203a\ufb01(t) \u00a1 \u203a\ufb02(t) > 0, and policy \ufb02 otherwise.\nHence ! + \u203a\ufb01(t) \u00a1 \u203a\ufb02(t) is referred to as the preference of \ufb01 with respect to \ufb02. At time\nt, the cumulative payoff of policy \ufb01 changes from \u203a\ufb01(t) to \u203a\ufb01(t) \u00a1 \u00bb\u201e\n\ufb01signA\u201e(t), where\n\ufb01 is the decision of policy \ufb01 at state \u201e. Only the \u00a3ckle agents, that is, those agents with\n\u00bb\u201e\npreferences on the verge of switching signs, contribute to the change in A\u201e(t), namely,\n! + \u203a\ufb01(t) \u00a1 \u203a\ufb02(t) = \u00a71 and \u00bb\u201e\n\n\ufb02 = \u00a72signA\u201e(t). Hence we have\n\n\ufb01 \u00a1 \u00bb\u201e\n\u00a1signA\u201e(t)\n\ufb01 \u00a1 \u00bb\u201e\n\u00a3\u2013(\u00bb\u201e\n\n2\n\nN X\ufb01<\ufb02 Xr=\u00a71\n\ufb02 \u00a1 2rsignA\u201e(t))\n\nA\u201e(t + 1) \u00a1 A\u201e(t) =\n\nS\ufb01\ufb02(r \u00a1 \u203a\ufb01(t) + \u203a\ufb02(t))\n\n(2)\nwhere \u2013(n) = 1 if n = 0, and 0 otherwise. In the region where D \u00bf ln N, we have\n\nr\ns\nr\ns\n\fA \u2212\n\nProb = \n\n1\n8\n\nProb = \n\n1\n8\n\nQ\n\nProb = \n\n3\n8\n\nP\n\nR\n\nProb = \n\n3\n8\n\n0.6\n\n0.4\n\n0.2\n\ns\nt\n\nn\ne\ng\na\n\n \nf\n\no\n \nr\ne\nb\nm\nu\nn\ne\ng\na\nr\ne\nv\na\n\n \n\nA +\n\n0.0\n\n\u22123001\n\n\u22121\n\npreference\n\n3001\n\nFigure 2: (a) The attractor in the Minority Game with m = 1, following the period-4\nsequence of P-Q-R-Q in the phase space of A+ and A\u00a1. There are 4 approaches to the\nattractor indicated by the arrows, and the respective probabilities are obtained by consider-\ning the detailed dynamics from the different initial positions and states. (b) Experimental\nevidence of the kinetic sampling effect: steady-state preference dependence of the aver-\nage number of agents holding the identity policy and its complement, immediately before\nstate Q enters state R, at \u2030 = N = 1; 023 and averaged over 100,000 samples of initial\nconditions.\n\nS\ufb01\ufb02(!) (cid:192) 1, and Eq. (2) is self-averaging. Following the derivation in [5], we arrive at\n\nA\u201e(t + 1) = A\u201e(t) \u00a1 signA\u201e(t)r 2\n\n\u2026R\n\n\u2013(\u201e \u00a1 \u201e\u2044(t)):\n\n(3)\n\nEquation (3) shows that the dynamics proceeds in the direction which reduces the magni-\n\ncomponent oscillates between positive and negative, as shown in the example of m = 1\nin Fig. 2(a). Due to the maladaptive nature of the dynamics, it never reaches the zero\n\ntude of the population vector, each time by a step of sizep2=\u2026R. At the steady state, each\nvalue. As a result, each state is con\u00a3ned in a D-dimensional hypercube of size p2=\u2026R,\nirrespective of the initial position of the population vector. This con\u00a3nement enables us\nto compute the variance of the decisions, given by (cid:190)2=N = f (\u2030)=2\u2026\u2030, where f (\u2030) is a\nsmooth function of \u2030, which approaches (1 \u00a1 1=4D)=3 for \u2030 (cid:192) 1. The physical picture of\nthis scaling relation comes from the broadening of the preference distribution due to bias\ndiversity. The fraction of \u00a3ckle agents at every time step consists of those who have \u00a71\npreferences, which scales as the height of the bias distribution near its center. Since the\ndistribution is a Gaussian with standard deviation pR, the step sizes scale as 1=pR, and\nvariances (cid:190)2=N as \u2030\u00a11. The scaling relation shows that agent cooperation in this regime is\ndescribed at the level of statistical distributions of policy preferences, since the number of\n\nagents making an adaptive move at each step is suf\u00a3ciently numerous (\u00bb pN).\n4 Kinetic Sampling\n\nAs shown in Fig. 1(a), (cid:190)2=N deviates above the scaling with \u2030\u00a11 when \u2030 \u00bb N. To consider\nthe origin of this deviation, we focus in Fig. 2(b) on how the average number of agents, who\n\ufb02 = \u00a1\u201e, depends on\nhold the identity policy with \u00bb\u201e\nthe preference !+\u203a\ufb01\u00a1\u203a\ufb02, when the system reaches the steady state in games with m = 1.\nSince the preferences are time dependent, we sample their frequencies at a \u00a3xed time, say,\nimmediately before the state changes from Q to R in Fig. 2(a). One would expect that the\nbias distribution is reproduced. However, we \u00a3nd that a peak exists at ! + \u203a \ufb01 \u00a1 \u203a\ufb02 = \u00a11.\n\n\ufb01 = \u201e and its complementary policy \u00bb\u201e\n\n\fThis value of the preference corresponds to that of the attractor step from Q to R when at\nstate \u00a1, decision + loses and decision \u00a1 wins, and ! + \u203a\ufb01 \u00a1 \u203a\ufb02 changes from \u00a11 to +1.\nThe peak at the attractor step shows that its average size is self-organized to be larger than\nthose of the transient steps described by the background distribution.\n\nThis effect that favors the cooperation of larger clusters of agents is referred to as the\nkinetic sampling effect. When \u2030 \u00bb N, A\u201e(t + 1) \u00a1 A\u201e(t) scales as N \u00a11 and is no longer\nself-averaging. Rather, Eq. (2) shows that it is equal to 2=N times the number of \u00a3ckle\nagents at time t, which is Poisson distributed with a mean of N=p2\u2026R = \u00a2=2, where\n\u00a2 \u00b7 Np2=\u2026R is the average step size. However, since the attractor is formed by steps\n\nwhich reverse the sign of A\u201e, the average step size in the attractor is larger than that in\nthe transient state, because a long jump in the vicinity of the attractor is more likely to get\ntrapped.\nTo describe this effect, we consider the probability Patt(\u00a2A) of step sizes \u00a2A in the\nattractor (with \u00a2A\u201e > 0 for all \u201e). Assuming that all states of the phase space are equally\n\nprobability of \u00a3nding the position A with displacement \u00a2A in the attractor. Consider the\nexample of m = 1, where there is only one step along each axis A\u201e. The sign reversal\n\nlikely to be accessed, we have Patt(\u00a2A) =PA Patt(\u00a2A; A), where Patt(\u00a2A; A) is the\ncondition implies that Patt(\u00a2A; A) / PPoi(\u00a2A)Q\u201e \u00a3[\u00a1A\u201e(A\u201e + \u00a2A\u201e)], where \u00a3(x)\nPatt(\u00a2A) / PPoi(\u00a2A)Q\u201e \u00a2A\u201e. We note that the extra factors of \u00a2A\u201e favor larger step\nsizes. Thus, the attractor averages h(\u00a2A\u00a7)2iatt are given by\n\nis the step function of x, and PPoi(\u00a2A) is the Poisson distribution of step sizes, yielding\n\nh(\u00a2A\u00a7)2iatt = h(\u00a2A\u00a7)2\u00a2A+\u00a2A\u00a1iPoi\n\nh\u00a2A+\u00a2A\u00a1iPoi\n\n:\n\n(4)\n\nThere are agents who contribute to both \u00a2A+ and \u00a2A\u00a1, giving rise to their correlations.\nIn Eq. (2), the strategies of the agents contributing to \u00a2A+ and \u00a2A\u00a1 satisfy \u00bb+\n\ufb02 =\n\ufb01 \u00a1 \u00bb\u00a1\n\ufb02 = 2r respectively. Among the agents contributing to \u00a2A+, the extra\n\u00a12r and \u00bb\u00a1\n\ufb01 \u00a1 \u00bb\u00a1\nrequirement of \u00bb\u00a1\n\ufb02 = 2r implies that an average of 1=4 of them also contribute to\n\u00a2A\u00a1. Hence, the number of agents contributing to both steps is a Poisson variable with\nmean \u00a2=8, and those exclusive to the individual steps are Poisson variables with mean\n3\u00a2=8. This yields, for example,\n\n\ufb01 \u00a1 \u00bb+\n\nh\u00a2A+\u00a2A\u00a1iPoi = 4\n\nN 2 Pa0;a+;a\u00a1\n\ne\n\n\u00a1 \u00a2\n8\n\na0! \u00a1 \u00a2\n\n8\u00a2a0 e\n\n\u00a1 3\u00a2\n8\n\na+! \u00a1 3\u00a2\n\n8 \u00a2a+ e\n\n\u00a1 3\u00a2\n8\n\n8 \u00a2a\u00a1\na\u00a1! \u00a1 3\u00a2\n\n(a0 + a+)(a0 + a\u00a1):\n\nTogether with similar expressions of the numerator in Eq. (4), we obtain\n\nh(\u00a2A\u00a7)2iatt =\n\n2\u00a23 + 15\u00a22 + 20\u00a2 + 4\n\nN 2(2\u00a2 + 1)\n\n:\n\n(5)\n\n(6)\n\nThe attractor states are given by A\u201e = m\u201e=N and m\u201e=N \u00a1 \u00a2A\u201e, where m\u201e =\n1; 3; : : : ; N \u00a2A\u201e \u00a1 1. This yields a variance of\n\n(cid:190)2\nN\n\n7h(N \u00a2A+)2iatt + 7h(N \u00a2A\u00a1)2iatt \u00a1 8\n\n=\n\n192N\n\nwhich gives, on combining with Eq. (6),\n\n(cid:190)2\nN\n\n=\n\n14\u00a23 + 105\u00a22 + 132\u00a2 + 24\n\n96N (2\u00a2 + 1)\n\n:\n\n;\n\n(7)\n\n(8)\n\nWhen the diversity is low, \u00a2 (cid:192) 1, and Eq. (8) reduces to (cid:190)2=N = 7=48\u2026\u2030, agreeing with\nthe scaling result of the previous section. When \u2030 \u00bb N, Eq. (8) has excellent agreement\nwith simulation results, which signi\u00a3cantly deviate above the scaling relation.\n\n\f5 Waiting Effect\n\nAs shown in Fig. 1(b), (cid:190)2=N further deviates above the predictions of kinetic sampling\nwhen \u2030 (cid:192) N. To study the origin of this effect, we consider the example of m = 1. As\nshown in Fig. 2(a), the attractor consists of both hops along the A\u00a7 axes. Analysis shows\nthat only those agents holding the identity policy and its complement can complete both\nhops after they have adjusted their preferences to ! + \u203a\ufb01 \u00a1 \u203a\ufb02 = \u00a71. Since there are\nfewer and fewer \u00a3ckle agents in the limit \u2030 (cid:192) N, one would expect that a single agent of\nthis type would dominate the game dynamics, and (cid:190)2=N would approach 0:25=N, as also\npredicted by Eq. (8).\n\nHowever, attractors having 2 \u00a3ckle agents are about 10 times more common in the ex-\ntremely diverse limit. As illustrated in Fig. 3(a) for a typical case, one of the two agents\n\u00a3rst arrives at the status of \u00a71 preference of her policies and stay there waiting. Mean-\nwhile, the preference of the second agent is steadily reduced. Once she has arrived at the\nstatus of \u00a71 preference of her policies, both agents can then cooperate to complete the\ndynamics of the attractor. In this example, both agents do not belong to the correct type\nthat can complete the dynamics alone, but waiting is crucial for them to complete the hops\nin the attractor, even though one would expect that the probability of \u00a3nding more than one\n\u00a3ckle agents at a time step is drastically less than that for one. Thus, the composition of\nthe group of \u00a3ckle agents is self-organized through this waiting effect, and consequently\nthe step sizes and variance increase above those predicted by kinetic sampling.\n\nThe analysis of the waiting effect is lengthy. Here the agents are so diverse that the average\nstep size is approaching 0. At each state in the phase space, the system remains stationary\nfor many time steps, waiting for some agent to reduce the magnitude of her preference until\npolicy switching can take place. For illustration, we sketch the approximation of including\nup to one wait. As shown in Fig. 2(a), the attractor may be approached from the arm (P\nor R) or from the corner (Q). Consider the case of the state approaching from P, waiting\nup to k times at Q to move to R, and ending the transient dynamics thereafter. Then the\ncumulative payoffs of a policy \ufb01 can be written as \u203a\ufb01 + \u00bb+\n\ufb01 at Q\n\ufb01 at R, \u203a\ufb01 \u00a1 k\u00bb\u00a1\nand, in the attractor of period 4, repeating the sequence of \u203a\ufb01 \u00a1 k\u00bb\u00a1\n\ufb01\n\ufb01 at Q. The movement of the cumulative payoffs\nat Q, \u203a\ufb01 \u00a1 k\u00bb\u00a1\n\ufb01, where k\u201e denotes the number\ncan be conveniently represented by writing \u203a\ufb01 =P\u201e k\u201e\u00bb\u201e\nof wins minus losses of decision 1 at state \u201e in the game history. For m = 1, these steps\nare plotted in the space of k+ and k\u00a1 in Fig. 3(b).\nThe size of each step is 2=N times the number of \u00a3ckle agents at that step, which is Poisson\ndistributed with average \u00a2=2. The average numbers of agents appearing simultaneously in\ndifferent steps positioned along the directions k+\u00a7k\u00a1 = constant and k\u00a7 = constant are,\nrespectively, \u00a2=8 and \u00a2=4, and 0 for other directions. Thus, the average number of agents\ncommon in the pairs of steps fPQ, QQ1g, fQQk, QPg, fQP, QRg, fPQ, QPg are \u00a2=8,\n\u00a2=8, \u00a2=8 and \u00a2=4 respectively. The rest of the combinations of steps are uncorrelated.\nThe number of agents involved in the steps are described in Table 1.\n\n\ufb01 at P, \u203a\ufb01; : : : ; \u203a\ufb01 \u00a1 k\u00bb\u00a1\n\n\ufb01 at P, and \u203a\ufb01 \u00a1 k\u00bb\u00a1\n\n\ufb01 \u00a1 \u00bb\u00a1\n\n\ufb01 + \u00bb+\n\nThe variance of the step sizes is given by\n\nh\n\n1\n2\n\n[(\u00a2A+)2 +(\u00a2A\u00a1)2]iatt =Xj\n(9)\nwhere j = arm or corner. The variance of decisions can then be obtained from Eq. (7). For\nillustration, we consider the derivation of the Poisson average h\u00a2A+\u00a2A\u00a1i for one-wait\n\n2 [(\u00a2A+)2 + (\u00a2A\u00a1)2]\u00a2A+\u00a2A\u00a1ii;j\nPi=0;1h\u00a2A+\u00a2A\u00a1ii;j\n\nPj\u02c6Pi=0;1h 1\n\n! ;\n\n\fe\nc\nn\ne\nr\ne\n\nf\n\ne\nr\nP\n\n2000\n\n1000\n\n0\n\n\u22121000\n\n\u22122000\n\n0\n\n2nd agent\n\n1st agent\n\n400\n\n800\n\n1200\n\nTime\n\n\u2212k\n\nQQ\n\n1\n\nPQ\n\nQQ\n\nr\n\nQQ\n\nk\n\nQR\n\nQP\n\n+k\n\nFigure 3: (a) Experimental evidence of the waiting effect: a typical example of the evo-\nlution of the preference of the 2 agents switching policies at the attractor in a game with\nm = 1, N = 127, and R = 224\u00a11. The system converges to the attractor at t = 1; 086. (b)\nThe space of k+ and k\u00a1 describing the movement of the cumulative payoffs in the game\nwith m = 1. Thick arrows: non-vanishing steps. Thin arrows: waiting steps. Thick double\narrows: attractor steps. The dashed lines link those steps that share common agents.\n\narm approach. Noting that the only non-vanishing steps are PQ, QR and QP, we obtain\n\nh\u00a2A+\u00a2A\u00a1i1;arm = 4\n2 ne\u00a1 \u00a2\n\n= 4\nN 2\n\n1\n\u00a1 \u00a2\n\n1\u00a1e\n\nQk\nr=1 \u2013(await;r)\u2013(aturn;2)(a\u00a1 + a0)(acum + aturn;2 + a0)iPoi\n8io :\n\nN 2 P1\nk=1h[1 \u00a1 \u2013(ai)\u2013(aturn;1)\u2013(acum)]\u2013(aturn;1)\n2 h12\u00a1 \u00a2\n8\u00a22\n\n8 h4\u00a1 \u00a2\n8\u00a22\n\n8i \u00a1 e\u00a1 7\u00a2\n\n+ \u00a2\n\n+ \u00a2\n\n(10)\n\nWe note that the number a0 accounts for the agents who contribute to both steps in the\nattractor, and thus can complete the attractor dynamics alone in the extremely diverse limit.\nOn the other hand, the number acum arises from the \u00a3rst step PQ arriving at Q. Once\npresent, it will appear in the attractor step QP, irrespective of the duration k of the wait at\nQ. These acum agents can wait to complete the attractor dynamics together with the a\u00a1\nagents who contribute independently to the step from Q to R, as well as the a0 agents who\ncontribute to both attractor steps. As a result, the average step size increases due to this\nwaiting effect. In the former case, cooperation between individual types of agents becomes\nindispensable in reaching the steady state behavior.\n\nOther Poisson averages in Eq. (9) can be derived similarly. As shown in Fig. 1(b), the wait-\ning effect causes the variance to increase beyond the kinetic sampling prediction, agreeing\nwith the trend of the simulation results. In particular, the variance approaches 0:34=N in\nthe extremely diverse limit, signi\u00a3cantly greater than the limit of 0:25=N in the absence of\nwaiting effects. Further approximation including multiple waiting steps results in the theo-\nretical curves with excellent agreement with the simulation results, as shown in Fig. 1(b).\nIn the extremely diverse limit, the theoretical predictions approach 0:42=N, very close to\nthe simulation result of 0:43=N.\n\n6 Conclusion\n\nWe have studied the dynamical mechanisms of cooperation, which emerges automatically\nin a multi-agent system with adaptive agents competing sel\u00a3shly for \u00a3nite resources. At\nlow diversity, agent cooperation proceeds at the statistical level, resulting in the scaling\nrelation of the variance with diversity. At high diversity, when kinetic sampling becomes\n\n\fLabel\nPQ\n\nQQ1\n\nQQr\n\nQQk\n\nQR\n\nQP\n\nTable 1: The number of \u00a3ckle agents in the steps of one wait.\n\nSteps\n\u203a\ufb01 + \u00bb+\n\n\ufb01\n\n\ufb01 ! \u203a\ufb01\n\u203a\ufb01 ! \u203a\ufb01 \u00a1 \u00bb\u00a1\n\u203a\ufb01 \u00a1 (r \u00a1 1)\u00bb\u00a1\n! \u203a\ufb01 \u00a1 r\u00bb\u00a1\n\u203a\ufb01 \u00a1 (k \u00a1 1)\u00bb\u00a1\n! \u203a\ufb01 \u00a1 k\u00bb\u00a1\n\u203a\ufb01 \u00a1 k\u00bb\u00a1\n\ufb01 !\n\u203a\ufb01 \u00a1 (k + 1)\u00bb\u00a1\n\u203a\ufb01 \u00a1 k\u00bb\u00a1\n\ufb01 !\n\ufb01 + \u00bb+\n\u203a\ufb01 \u00a1 k\u00bb\u00a1\n\ufb01\n\n\ufb01\n\n\ufb01\n\n\ufb01\n\nNo. of agents\nai + aturn;1 + acum\n\nawait;1 + aturn;1\n\nawait;r\n\nawait;k + aturn;2\n\na\u00a1 + a0\n\nacum + aturn;2 + a0\n\nPoisson averages\nhaii = \u00a2=8, haturn;1i = \u00a2=8,\nhacumi = \u00a2=4.\nhawait;1i = 3\u00a2=8.\nhawait;ri = \u00a2=2,\n(2 \u2022 r \u2022 k \u00a1 1).\nhawait;ki = 3\u00a2=8,\nhaturn;2i = \u00a2=8.\nha\u00a1i = 3\u00a2=8,\nha0i = \u00a2=8.\n\n\ufb01\n\n\ufb01\n\nsigni\u00a3cant, we \u00a3nd that the attractor dynamics favors the cooperation of larger clusters\nof agents. In extremely diverse systems, we further discover a waiting mechanism, when\nagents who are unable to complete the attractor dynamics alone wait for other agents to\ncollaborate with them. When waiting is present, cooperation between individual types\nof agents becomes indispensable in reaching the steady state behavior. Together, these\nmechanisms yield theoretical predictions of the population variance in excellent agreement\nwith simulations over nine decades of data.\n\nWe expect that the observed mechanisms of agent cooperation can be found in reinforce-\nment learning of multi-agent systems in general, due to their generic nature. The mecha-\nnisms of statistical cooperation, kinetic sampling and waiting illustrate the importance of\ndynamical considerations in describing the system behavior, and the capability of multi-\nagent systems to self-organize in their collective dynamics. In particular, it is interesting to\nnote that given enough waiting time, agents with limited abilities can cooperate to achieve\ndynamics unachievable by individuals. This is relevant to evolutionary approaches to multi-\nagent control, since it allows limited changes to accumulate into bigger improvements.\n\nAcknowledgments\n\nWe thank C. H. Yeung, Y. S. Ting and B. H. Wang for fruitful discussions. This\nwork is supported by the Research Grant Council of Hong Kong (HKUST6153/01P,\nHKUST6062/02P) and DAG04/05.SC25.\n\nReferences\n[1] G. Wei\u00df and S. Sen, Adaption and Learning in Multi-agent Systems, Lecture Notes in Computer\n\nScience 1042 (Springer, Berlin, 1995).\n\n[2] E. Rasmusen, Games and Information (Basil Blackwell, Oxford, 2001).\n[3] D. Challet and Y. C. Zhang, Emergence of Cooperation and Organization in an Evolutionary\n\nGame, Physica A 246, pp. 407-418 (1997).\n\n[4] R. Savit, R. Manuca, and R. Riolo, Adaptive Competition, Market Ef\u00a3ciency, and Phase Tran-\n\nsitions, Phys. Rev. Lett. 82, pp. 2203-2206 (1999).\n\n[5] K. Y. M. Wong, S. W. Lim, and P. Luo, Diversity and Adaptation in Large Population Games,\n\nInt. J. Mod. Phys. B 18, 2422-2431 (2004).\n\n[6] K. Y. M. Wong, S. W. Lim, and Z. Gao, Dynamical Mehanisms of Adaptation in Multi-agent\n\nSystems, Phys. Rev. E 70, 025103(R) (2004).\n\n\f", "award": [], "sourceid": 2687, "authors": [{"given_name": "K.", "family_name": "Wong", "institution": null}, {"given_name": "S.", "family_name": "Lim", "institution": null}, {"given_name": "Z.", "family_name": "Gao", "institution": null}]}