{"title": "Beating a Defender in Robotic Soccer: Memory-Based Learning of a Continuous Function", "book": "Advances in Neural Information Processing Systems", "page_first": 896, "page_last": 902, "abstract": null, "full_text": "Beating a Defender in Robotic Soccer: \n\nMemory-Based Learning of a Continuous \n\nFUnction \n\nPeter Stone \n\nDepartment of Computer Science \n\nCarnegie Mellon University \n\nPittsburgh, PA 15213 \n\nManuela Veloso \n\nDepartment of Computer Science \n\nCarnegie Mellon University \n\nPittsburgh, PA 15213 \n\nAbstract \n\nLearning how to adjust to an opponent's position is critical to \nthe success of having intelligent agents collaborating towards the \nachievement of specific tasks in unfriendly environments. This pa(cid:173)\nper describes our work on a Memory-based technique for to choose \nan action based on a continuous-valued state attribute indicating \nthe position of an opponent. We investigate the question of how an \nagent performs in nondeterministic variations of the training situ(cid:173)\nations. Our experiments indicate that when the random variations \nfall within some bound of the initial training, the agent performs \nbetter with some initial training rather than from a tabula-rasa. \n\n1 \n\nIntroduction \n\nOne of the ultimate goals subjacent to the development of intelligent agents is to \nhave multiple agents collaborating in the achievement of tasks in the presence of \nhostile opponents. Our research works towards this broad goal from a Machine \nLearning perspective. We are particularly interested in investigating how an intel(cid:173)\nligent agent can choose an action in an adversarial environment. We assume that \nthe agent has a specific goal to achieve. We conduct this investigation in a frame(cid:173)\nwork where teams of agents compete in a game of robotic soccer. The real system \nof model cars remotely controlled from off-board computers is under development . \nOur research is currently conducted in a simulator of the physical system. \n\nBoth the simulator and the real-world system are based closely on systems de(cid:173)\nsigned by the Laboratory for ComputationalIntelligence at the University of British \nColumbia [Sahota et a/., 1995, Sahota, 1993]. The simulator facilitates the control \nof any number of cars and a ball within a designated playing area. Care has been \ntaken to ensure that the simulator models real-world responses (friction, conserva-\n\n\fMemory-based Learning of a Continuous Function \n\n897 \n\ntion of momentum, etc.) as closely as possible. Figure l(a) shows the simulator \ngraphics. \n\n-\n\nIJ \n\nI \n\n- <0 \n\n0 \n\n~ \n\nIJ \n\n(j -\n\n~ \n\n<:P \n\n(a) \n\n(b) \n\nFigure 1: (a) the graphic view of our simulator. (b) The initial position for all \nof the experiments in this paper. The teammate (black) remains stationary, the \ndefender (white) moves in a small circle at different speeds, and the ball can move \neither directly towards the goal or towards the teammate. The position of the ball \nrepresents the position of the learning agent. \n\nWe focus on the question of learning to choose among actions in the presence of \nan adversary. This paper describes our work on applying memory-based supervised \nlearning to acquire strategy knowledge that enables an agent to decide how to \nachieve a goal. For other work in the same domain, please see [Stone and Veloso , \n1995b]. For an extended discussion of other work on incremental and memory(cid:173)\nbased learning [Aha and Salzberg, 1994, Kanazawa, 1994, Kuh et al., 1991, Moore, \n1991, Salganicoff, 1993, Schlimmer and Granger , 1986, Sutton and Whitehead, 1993, \nWettschereck and Dietterich, 1994, Winstead and Christiansen, 1994], particularly \nas it relates to this paper, please see [Stone and Veloso, 1995a]. \n\nThe input to our learning task includes a continuous-valued range of the position \nof the adversary. This raises the question of how to discretize the space of values \ninto a set of learned features . Due to the cost of learning and reusing a large set of \nspecialized instances, we notice a clear advantage to having an appropriate degree \nof generalization . For more details please see [Stone and Veloso, 1995a]. \n\nHere , we address the issue of the effect of differences between past episodes and the \ncurrent situation. We performed extensive experiments, training the system under \nparticular conditions and then testing it (with learning continuing incrementally) in \nnondeterministic variations of the training situation. Our results show that when \nthe random variations fall within some bound of the initial training , the agent \nperforms better with some initial training rather than from a tabula-rasa. This \nintuitive fact is interestingly well- supported by our empirical results. \n\n2 Learning Method \n\nThe learning method we develop here applies to an agent trying to learn a function \nwith a continuous domain. We situate the method in the game of robotic soccer. \n\nWe begin each trial by placing a ball and a stationary car acting as the \"teammate\" \nin specific places on the field. Then we place another car, the \"defender,\" in front of \nthe goal. The defender moves in a small circle in front of the goal at some speed and \nbegins at some random point along this circle. The learning agent must take one \nof two possible actions: shoot straight towards the goal, or pass to the teammate so \n\n\f898 \n\nP. STONE, M. VELOSO \n\nthat the ball will rebound towards the goal. A snapshot of the experimental setup \nis shown graphically in Figure 1 (b). \n\nThe task is essentially to learn two functions, each with one continuous input vari(cid:173)\nable, namely the defender's position. Based on this position, which can be repre(cid:173)\nsented unambiguously as the angle at which the defender is facing, \u00a2, the agent tries \nto learn the probability of scoring when shooting, Ps* (\u00a2), and the probability of scor(cid:173)\ning when passing , P; (\u00a2 ).1 If these functions were learned completely, which would \nonly be possible if the defender's motion were deterministic, then both functions \nwould be binary partitions: Ps*, P; : [0.0,360.0) f--.+ {-1 , I}. 2 That is, the agent \nwould know without doubt for any given \u00a2 whether a shot, a pass, both, or neither \nwould achieve its goal. However, since the agent cannot have had experience for \nevery possible \u00a2, and since the defender may not move at the same speed each time, \nthe learned functions must be approximations: Ps,Pp : [0.0,360.0) f--.+ [-1.0, 1.0] . \nIn order to enable the agent to learn approximations to the functions Ps* and P*, \nwe gave it a memory in which it could store its experiences and from which it coufd \nretrieve its current approximations Ps (\u00a2) and Pp( \u00a2). We explored and developed \nappropriate methods of storing to and retrieving from memory and an algorithm \nfor deciding what action to take based on the retrieved values. \n\n2.1 Memory Model \n\nStoring every individual experience in memory would be inefficient both in terms \nof amount of memory required and in terms of generalization time. Therefore, we \nstore Ps and Pp only at discrete, evenly-spaced values of \u00a2. That is, for a memory \nof size M (with M dividing evenly into 360 for simplicity), we keep values of Pp(O) \nand Ps(O) for 0 E {360n/M I 0 ~ n < M}. We store memory as an array \"Mem\" \nof size M such that Mem[n] has values for both Pp(360n/M) and Ps(360n/M) . \nUsing a fixed memory size precludes using memory-based techniques such as K(cid:173)\nNearest-Neighbors (kNN) and kernel regression which require that every experience \nbe stored, choosing the most relevant only at decision time. Most of our experiments \nwere conducted with memories of size 360 (low generalization) or of size 18 (high \ngeneralization), i.e. M = 18 or M = 360. The memory size had a large effect on \nthe rate of learning [Stone and Veloso, 1995a]. \n\n2.1.1 Storing to Memory \n\nWith M discrete memory storage slots, the problem then arises as to how a specific \ntraining example should be generalized. Training examples are represented here as \nE.p,a,r, consisting of an angle \u00a2, an action a, and a result r where \u00a2 is the initial \nposition of the defender, a is \"s\" or \"p\" for \"shoot\" or \"pass,\" and r is \"I\" or \n\"-I\" for \"goal\" or \" miss\" respectively. For instance, E 72 .345 ,p ,1 represents a pass \nresulting in a goal for which the defender started at position 72.345 0 on its circle. \nEach experience with 0 - 360/2M :::; \u00a2 < 0 + 360/2M affects Mem[O] in propor(cid:173)\ntion to the distance 10 - \u00a2I. \nIn particular, Mem[O] keeps running sums of the \nmagnitudes of scaled results, Mem[O]. total-a-results, and of scaled positive results, \nMem[O].positive-a-results, affecting Pa(O), where \"a\" stands for \"s\" or \"p\" as be(cid:173)\nfore. Then at any given time P (0) = -1 + 2 * positive-a-results The \"-I\" is for \n\n, \n\na \n\ntotal-a-results \n\n. \n\n1 As per convention, P * represents the target (optimal) function. \n2 Although we think of P; and P; as functions from angles to probabilities, we will use \n-1 rather than 0 as the lower bound of the range. This representation simplifies many of \nour illustrative calculations. \n\n\fMemory-based Learning of a Continuous Function \n\n899 \n\nthe lower bound of our probability range, and the \"2*\" is to scale the result to this \nrange. Call this our adaptive memory storage technique: \n\nAdaptive Memory Storage of E4>,a,r in Mem 0 \n\nI _ \n-\n\nr * \n\n(1 _ 14>-01) \n360/M . \n\n\u2022 r \n\u2022 Mem[O].total-a-results += r'o \n\u2022 If r' > 0 Then Mem[O].positive-a-results += r'o \n\u2022 P (0) = -1 + 2 * posittve-a-results. \n\ntotal-a-resuLts \n\na \n\nFor example, EllO,p,l wOilld set both total-p-results and positive-p-results for \nMem[120] (and Mem[100]) to 0.5 and consequently Pp(120) (and Pp(100)) to 1.0. \nBut then E l25 ,p,-1 would increment total-p-resultsfor Mem[120] by .75, while leav-\ning positive-p-results unchanged. Thus Pp(120) becomes -1 + 2 * 1:~5 = -.2. \nThis method of storing to memory is effective both for time-varying concepts and \nfor concepts involving random noise. It is able to deal with conflicting examples \nwithin the range of the same memory slot. \n\nNotice that each example influences 2 different memory locations. This memory \nstorage technique is similar to the kNN and kernel regression function approximation \ntechniques which estimate f( \u00a2) based on f( 0) possibly scaled by the distance from \no to \u00a2 for the k nearest values of O. In our linear continuum of defender position, \nour memory generalizes training examples to the 2 nearest memory locations.3 \n\n2.1.2 Retrieving from Memory \n\nSince individual training examples affect multiple memory locations, we use a simple \ntechnique for retrieving Pa (\u00a2) from memory when deciding whether to shoot or to \npass. We round \u00a2 to the nearest 0 for which Mem[O] is defined, and then take Pa (0) \nas the value of Pa(\u00a2). Thus, each Mem[O] represents Pa(\u00a2) for 0 - 360/2M ~ \u00a2 < \no + 360 /2M. Notice that retrieval is much simpler when using this technique than \nwhen using kNN or kernel regression: we look directly to the closest fixed memory \nposition, thus eliminating the indexing and weighting problems involved in finding \nthe k closest training examples and (possibly) scaling their results. \n\n2.2 Choosing an Action \n\nThe action selection method is designed to make use of memory to select the action \nmost probable to succeed, and to fill memory when no useful memories are available. \nFor example, when the defender is at position \u00a2, the agent begins by retrieving Pp (\u00a2) \nand Ps( \u00a2) as described above. Then, it acts according to the following function: \n\nIf Pp_( 0 and Pp( Ps( 0 and P.( Pp(