{"title": "Reinforcement Learning for Call Admission Control and Routing in Integrated Service Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 922, "page_last": 928, "abstract": null, "full_text": "Hippocampal Model of Rat Spatial Abilities \n\nUsing Temporal Difference Learning \n\nDavid J Foster* \n\nCentre for Neuroscience \nEdinburgh University \n\nRichard GM Morris \n\nCentre for Neuroscience \nEdinburgh University \n\nPeter Dayan \nE25-210, MIT \n\nCambridge, MA 02139 \n\nAbstract \n\nWe provide a model of the standard watermaze task, and of a more \nchallenging task involving novel platform locations, in which rats \nexhibit one-trial learning after a few days of training. The model \nuses hippocampal place cells to support reinforcement learning, \nand also, in an integrated manner, to build and use allocentric \ncoordinates. \n\n1 \n\nINTRODUCTION \n\nWhilst it has long been known both that the hippocampus of the rat is needed for \nnormal performance on spatial tasks l3 , 11 and that certain cells in the hippocampus \nexhibit place-related firing,12 it has not been clear how place cells are actually used \nfor navigation. One of the principal conceptual problems has been understanding \nhow the hippocampus could specify or learn paths to goals when spatially tuned \ncells in the hippocampus respond only on the basis of the rat's current location. \nThis work uses recent ideas from reinforcement learning to solve this problem in \nthe context of two rodent spatial learning results. \nReference memory in the watermazell (RMW) has been a key task demonstrating \nthe importance of the hippocampus for spatial learning. On each trial, the rat is \nplaced in a circular pool of cloudy water, the only escape from which is a platform \nwhich is hidden (below the water surface) but which remains in a constant position. \nA random choice of starting pOSition is used for each trial. Rats take asymptotically \nshort paths after approximately 10 trials (see figure 1 a). Delayed match-to-place \n(DMP) learning is a refined version in which the platform'S location is changed on \neach day. Figure 1 b shows escape latencies for rats given four trials per day for nine \ndays, with the platform in a novel position on each day. On early days, acquisition \n\n\u00b7Crichton Street, Edinburgh EH8 9LE, United Kingdom. Funded by Edin. Univ. \nHoldsworth Scholarship, the McDonnell-Pew foundation and NSF grant IBN-9634339. \nEmail: djf@cfn.ed.ac.uk \n\n\f146 \n\n100 \n\n90 \n\na \n\nD. J Foster; R. G. M. Mo\"is and P. Dayan \n\n100 \n\n90 \n\n80 \n\nb \n\n_ 70 \n\n'\" i oo \n\n'\" 50 \n-' \n\n~40 \n~30 \n\n13 \n\n17 \n\n21 \n\n2S \n\n20 \n\n10 \n\nFigure 1: a) Latencies for rats on the reference memory in the watermaze (RMW) \ntask (N=8). b) Latencies for rats on the Delayed Match-to-Place (DMP) task (N=62). \n\nis gradual but on later days, rats show one-trial learning, that is, near asymptotic \nperformance on the second trial to a novel platform position. \nThe RMW task has been extensively modelled. 6,4,5,20 By contrast, the DMP task \nis new and computationally more challenging. It is solved here by integrating a \nstandard actor-critic reinforcement learning system2,7 which guarantees that the \nrat will be competent to perform well in arbitrary mazes, with a system that learns \nspatial coordinates in the maze. Temporal difference learning 1 7 (TO) is used for actor, \ncritic and coordinate learning. TO learning is attractive because of its generality for \narbitrary Markov decision problems and the fact that reward systems in vertebrates \nappear to instantiate it. 14 \n\n2 THEMODEL \n\nThe model comprises two distinct networks (figure 2): the actor-critic network and \na coordinate learning network. The contribution of the hippocampus, for both \nnetworks, is to provide a state-space representation in the form of place cell basis \nfunctions. Note that only the activities of place cells are required, by contrast with \ndecoding schemes which require detailed information about each place cell.4 \n\nACTOR-CRITIC \n\nSYSTEM \n\nCOORDINATE SYSTEM \n\nr-------------\n\nRemembered \nGoal coordinates \n\n1 \n\n1 \n1 \n\n1 \nVECTOR COMPUTA nONI \n\n~ \n\nCoordinate \nRepresentation \n\n1 ______ -------1 \n\n1 \n\nFigure 2: Model diagram showing the interaction between actor-critic and coordi(cid:173)\nnate system components. \n\n\fHippocampal Model of Rat Spatial Abilities Using TD Learning \n\n147 \n\n2.1 Actor-Critic Learning \n\nPlace cells are modelled as being tuned to location. At position p, place cell \ni has an output given by h(p) = exp{ -lip - sdI2/2(12}, where Si is the place \nfield centre, and (1 = 0.1 for all place fields. The critic learns a value function \nV(p) = L:i wih(p) which comes to represent the distance of p from the goal, using \nthe TO rule 6.w~ ex: 8t h(pt), where \n\n(1) \n\nis the TD error, pt is position at time t, and the reward r(pt, pt+I) is 1 for any \nmove onto the platform, and 0 otherwise. In a slight alteration of the original rule, \nthe value V (p) is set to zero when p is at the goal, thus ensuring that the total \nfuture rewards for moving onto the goal will be exactly 1. Such a modification \nimproves stability in the case of TD learning with overlapping basis functions. \nThe discount factor, I' was set to 0.99. Simultaneously the rat refines a policy, \nwhich is represented by eight action cells. Each action cell (aj in figure 2) receives \na parameterised input at any position p: aj (p) = L:i qjdi (p). An action is chosen \nstochastically with probabilities given by P(aj) = exp{2aj}/ L:k exp{2ak}. Action \nweights are reinforced according to:2 \n\nwhere 9j((Jt) is a gaussian function of the difference between the head direction \n(Jt at time t and the preferred direction of the jth action cell. Figure 3 shows the \ndevelopment of a policy over a few trials. \n\n(2) \n\nV(p)l \n\nTriall \n\nV(p) 1 \n\nTrialS \n\nV(P)l \n\nTriall3 \n\n0.5 \n\n0. \n0.5 \n\n0.5 \nI \n01 \n0.5 \n\n0.5 \n\n0: \n0.5 \n\n0.5 \n\n0 \n\n0.5 \n\n.---'--- -\n\n0.5 \n\n-0.5 -0.5 \n\n-------\n\n0 \n\nFigure 3: The RMW task: the value function gradually disseminates information \nabout reward proximity to all regions of the environment. Policies and paths are \nalso shown. \n\nThere is no analytical guarantee for the convergence of TD learning with policy \nadaptation. However our simulations show that the algorithm always converges \nfor the RMW task. In a simulated arena of diameter 1m and with swimming speeds \nof 20cm/s, the simulation matched the performance of the real rats very closely (see \nfigure S). This demonstrates that TD-based reinforcement learning is adequately \nfast to account for the learning performance of real animals. \n\n\f148 \n\n2.2 Coordinate Learning \n\nD. 1. Foster, R. G. M Morris and P. Dayan \n\nAlthough the learning of a value function and policy is appropriate for finding \na fixed platform, the actor-critic model does not allow the transfer of knowledge \nfrom the task defined by one goal position to that defined by any other; thus it \ncould not generate the sort of one-trial learning that is shown by rats on the DMP \ntask (see figure 1 b). This requires acquisition of some goal-independent know ledge \nabout s~ace. A natural mechanism for this is the path integration or self-motion \nsystem. 0,10 However, path integration presents two problems. First, since the rat \nis put into the maze in a different position for each trial, how can it learn consistent \ncoordinates across the whole maze? Second, how can a general, powerful, but slow, \nbehavioral learning mechanism such as TO be integrated with a specific, limited, \nbut fast learning mechanism involving spatial coordinates? \nSince TO critic learning is based on enforcing consistency in estimates of future \nreward, we can also use it to learn spatially consistent coordinates on the basis \nof samples of self-motion. It is assumed that the rat has an allocentric frame of \nreference.1s The model learns parameterised estimates of the x and y coordinates \nof all positions p: x(p) = Li w[ fi(P) and y(p) = Li wY h(p), Importantly, while \nplace cells were again critical in supporting spatial representation, they do not embody \na map of space. The coordinate functions, like the value function previously, have to \nbe learned. \nAs the simulated rat moves around, the coordinate weights {w[} are adjusted \naccording to: \n\nt \n\nLlwi ()( (Llxt + X (pt+l ) - X(pt)) L At - k h(pk) \n\n(3) \n\nk=1 \n\nwhere Llxt is the self-motion estimate in the x direction. A similar update is applied \nto {wn. In this case, the full TO(A) algorithm was used (with A = 0.9); however \nTD(O) could also have been used, taking slightly longer. Figure 4a shows the x and \ny coordinates at early and late phases of learning. It is apparent that they rapidly \nbecome quite accurate - this is an extremely easy task in an open field maze. \nAn important issue in the learning of coordinates is drift, since the coordinate \nsystem receives no direct information about the location of the origin. It turns out \nthat the three controlling factors over the implicit origin are: the boundary of the \narena, the prior setting of the coordinate weights (in this case all were zero) and \nthe position and prior value of any absorbing area (in this case the platform). If the \ncoordinate system as a whole were to drift once coordinates have been established, \nthis would invalidate coordinates that have been remembered by the rat over long \nperiods. However, since the expected value of the prediction error at time steps \nshould be zero for any self-consistent coordinate mapping, such a mapping should \nremain stable. This is demonstrated for a single run: figure 4b shows the mean \nvalue of coordinates x evolving over trials, with little drift after the first few trials. \nWe modeled the coordinate system as influencing the choice of swimming direction \nin the manner of an abstract action. I5 The (internally specified) coordinates of the \nmost recent goal position are stored in short term memory and used, along with the \ncurrent coordinates, to calculate a vector heading. This vector heading is thrown \ninto the stochastic competition with the other possible actions, governed by a \nsingle weight which changes in a similar manner to the other action weights (as in \nequation 2, see also fig 4d), depending on the TO error, and on the angular proximity \nof the current head direction to the coordinate direction. Thus, whether the the \ncoordinate-based direction is likely to be used depends upon its past performance. \nOne simplification in the model is the treatment of extinction. In the DMP task, \n\n\fHippocampal Model of Rat Spatial Abilities Using 1D Learning \n\n149 \n\n,. \n\nIII \n\n.1 \n\nTRIAL \n\n26 \n\n16 \n\n~Ol \n\n\" TJUAL \nd i: \n!\" ~o \n\n~o \n\nd\u00b7 \n\nr \n\nr. \n\nFigure 4: The evolution of the coordinate system for a typical simulation run: a.) \ncoordinate outputs at early and late phases of learning, b.) the extent of drift in the \ncoordinates, as shown by the mean coordinate value for a single run, c.) a measure \n\nA2 \n\nf \n' where k \no coor mate error for the same run (7E = \nindexes measurement points (max Np ) and r indexes runs (max Nr), Xr(Pk) is the \nmodel estimate of X at position Pk, X(Pk) is the ideal estimate for a coordinate \nsystem centred on zero, and Xr is the mean value over all the model coordinates, \nd.) the increase during training of the probability of choosing the abstract action. \nThis demonstrates the integration of the coordinates into the control system. \n\n~ ~ {X r (Pr.)-X r -X(pr.)}2 \n\n(Np-l)Nr \n\nreal rats extinguish to a platform that has moved fairly quickly whereas the actor(cid:173)\ncritic model extinguishes far more slowly. To get around this, when a simulated \nrat reaches a goal that has just been moved, the value and action weights are \nreinitialised, but the coordinate weights wf and wf, and the weights for the abstract \naction, are not. \n\n3 RESULTS \n\nThe main results of this paper are the replication by simulation of rat performance \non the RMW and DMP tasks. Figures la and b show the course of learning for \nthe rats; figures Sa and b for the model. For the DMP task, one-shot acquisition is \napparent by the end of training. \n\n4 DISCUSSION \n\nWe have built a model for one-trial spatial learning in the watermaze which uses \na single TD learning algorithm in two separate systems. One system is based on a \nreinforcement learning that can solve general Markovian decision problems, and \nthe other is based on coordinate learning and is specialised for an open-field water \nmaze. Place cells in the hippocampus offer an excellent substrate for learning the \nactor, the critic and the coordinates. \nThe model is explicit about the relationship between the general and specific learn(cid:173)\ning systems, and the learning behavior shows that they integrate seamlessly. As \ncurrently constituted, the coordinate system would fail if there were a barrier in \nthe maze. We plan to extend the model to allow the coordinate system to specify \nabstract targets other than the most recent platform position - this could allow \nit fast navigation around a larger class of environments. It is also important to \nimprove the model of learning 'set' behavior - the information about the nature of \n\n\f150 \n\na \n\n12 \n\nj :10 \\ \n\n~ . ~ \n~ . \n\nD. 1. Foster; R. G. M. Mo\"is and P. Dayan \n\nb \n\n14 \n\n12 \n\n10 \n\u00a7. \nz \n~ S> \n.. \n\n'\"~ ............................................ \n\n0~D.~yl~~y~2~D~.y~3~D~.y~47~~yS~~~.~~~y~7~~~y~.7D.~y9~ \n\nFigure 5: a.) Performance of the actor-critic model on the RMW task, and b.) \nperformance of the full model on the DMP task. The data for comparison is shown \nin figures la and b. \n\nthe DMP task that the rats acquire over the course of the first few days of training. \nInterestingly, learning set is incomplete - on the first trial of each day, the rats \nstill aim for the platform position on the previous day, even though this is never \ncorrect.16 The significant differences in the path lengths on the first trial of each \nday (evidence in figure Ib and figure 5b) come from the relative placements of the \nplatforms. However, the model did not use the same positions as the empirical \ndata, and, in any case, the model of exploration behavior is rather simplistic. \nThe model demonstrates that reinforcement learning methods are perfectly fast \nenough to match empirical learning curves. This is fortunate, since, unlike most \nmodels specifically designed for open-field navigation,6,4,5,2o RL methods can \nprovably cope with substantially more complicated tasks with arbitrary barriers, \netc, since they solve the temporal credit assignment problem in its full generality. \nThe model also addresses the problem that coordinates in different parts of the \nsame environment need to be mutually consistent, even if the animal only expe(cid:173)\nriences some parts on separate trials. An important property of the model is that \nthere is no requirement for the animal to have any explicit knowledge of the rela(cid:173)\ntionship between different place cells or place field position, size or shape. Such a \nrequirement is imposed in various models. 9,4,6,2o \n\nExperiments that are suggested by this model (as well as by certain others) con(cid:173)\ncern the relationship between hippocampally dependent and independent spatial \nlearning. First, once the coordinate system has been acquired, we predict that \nmerely placing the rat at a new location would be enough to let it find the platform \nin one shot, though it might be necessary to reinforce the placement e.g. by first \nplacing the rat in a bucket of cold water. Second, we know that the establishment \nof place fields in an environment happens substantiallr faster than establishment \nof one-shot or even ordinary learning to a platform.2 We predict that blocking \nplasticity in the hippocampus following the establishment of place cells (possibly \nachieved without a platform) would not block learning of a platform. In fact, new \nexperiments show that after extensive pre-training, rats can perform one-trial learn(cid:173)\ning in the same environment to new platform positions on the DMP task without \nhippocampal synaptic plasticity. 16 This is in contrast to the effects of hippocampal \nlesion, which completely disrupts performance. According to the model, coor(cid:173)\ndinates will have been learned during pre-training. The full prediction remains \nuntested: that once place fields have been established, coordinates could be learned \nin the absence of hippocampal synaptic plasticity. A third prediction follows from \nevidence that rats with restricted hippocampal lesions can learn the fixed platform \n\n\fHippocampal Model of Rat Spatial Abilities Using TD Learning \n\n151 \n\ntask, but much more slowly, based on a gradual \"shaping\" procedure.22 In our \nmodel, they may also be able to learn coordinates. However, a lengthy training \nprocedure could be required, and testing might be complicated if expressing the \nknowledge required the use of hippocampus dependent short-term memory for \nthe last platform location. I6 \nOne way of expressing the contribution of the hippocampus in the model is to say \nthat its function is to provide a behavioural state space for the solution of complex \ntasks. Hence the contribution of the hippocampus to navigation is to provide \nplace cells whose firing properties remain consistent in a given environment. It \nfollows that in different behavioural situations, hippocampal cells should provide \na representation based on something other than locations -\nand, indeed, there \nis evidence for this. 8 With regard to the role of the hippocampus in spatial tasks, \nthe model demonstrates that the hippocampus may be fundamentally necessary \nwithout embodying a map. \n\nReferences \n\n[1] Barto, AG & Sutton, RS (1981) BioI. Cyber., 43:1-8. \n[2] Barto, AG, Sutton, RS & Anderson, CW (1983) IEEE Trans. on Systems, Man \n\nand Cybernetics 13:834-846. \n\n[3] Barto, AG, Sutton, RS & Watkins, CJCH (1989) Tech Report 89-95, CAIS, Univ. \n\nMass., Amherst, MA. \n\n[4] Blum, KI & Abbott, LF (1996) Neural Computation, 8:85-93. \n[5] Brown, MA & Sharp, PE (1995) Hippocampus 5:171-188. \n[6] Burgess, N, Reece, M & O'Keefe, J (1994) Neural Networks, 7:1065-1081. \n[7] Dayan, P (1991) NIPS 3, RP Lippmann et aI, eds., 464-470. \n[8] Eichenbaum, HB (1996) Curro Opin. Neurobiol., 6:187-195. \n[9] Gerstner, W & Abbott, LF (1996) J. Computational Neurosci. 4:79-94. \n[10] McNaughton, BL et a1 (1996) J. Exp. BioI., 199:173-185. \n[11] Morris, RGM et al (1982) Nature, 297:681-683. \n[12] O'Keefe, J & Dostrovsky, J (1971) Brain Res., 34(171). \n[13] Olton, OS & Samuelson, RJ (1976) J. Exp. Psych: A.B.P., 2:97-116. \n\nRudy, JW & Sutherland, RW (1995) Hippocampus, 5:375-389. \n\n[14] SchUltz, W, Dayan, P & Montague, PR (1997) Science, 275, 1593-1599. \n[15] Singh, SP Reinforcement learning with a hierarchy of abstract models. \n[16] Steele, RJ & Morris, RGM in preparation. \n[17] Sutton, RS (1988) Machine Learning, 3:9-44. \n[18] Taube, JS (1995) J. Neurosci. 15(1):70-86. \n[19] Tsitsiklis, IN & Van Roy, B (1996) Tech Report LIDS-P-2322, M.LT. \n[20] Wan, HS, Touretzky, OS & Redish, AD (1993) Proc. 1993 Connectionist Models \n\nSummer School, Lawrence Erlbaum, 11-19. \n\n[21] Watkins, CJCH (1989) PhD Thesis, Cambridge. \n[22] Whishaw, IQ & Jarrard, LF (1996) Hippocampus \n[23] Wilson, MA & McNaughton, BL (1993) Science 261:1055-1058. \n\n\fReinforcement Learning for Call Admission \nControl and Routing in Integrated Service \n\nNetworks \n\nPeter Marbach\" \n\nLIDS \nMIT \n\nCambridge, MA, 02139 \n\nemail: marbach@mi t . edu \n\nOliver Mihatsch \n\nSiemens AG \n\nCorporate Technology, ZT IK 4 \n\n0-81730 Munich, Germany \n\nemail:oliver.mihatsch@ \nmchp.siemens.de \n\nMiriam Schulte \n\nZentrum Mathematik \n\nTechnische UniversWit Miinchen \n\nD-80290 Munich \n\nGermany \n\nJohn N. Tsitsiklis \n\nLIDS \nMIT \n\nCambridge, MA, 02139 \nemail: jnt@mit. edu \n\nAbstract \n\nIn integrated service communication networks, an important problem is \nto exercise call admission control and routing so as to optimally use the \nnetwork resources. This problem is naturally formulated as a dynamic \nprogramming problem, which, however, is too complex to be solved ex(cid:173)\nactly. We use methods of reinforcement learning (RL), together with a \ndecomposition approach, to find call admission control and routing poli(cid:173)\ncies. The performance of our policy for a network with approximately \n1045 different feature configurations is compared with a commonly used \nheuristic policy. \n\n1 \n\nIntroduction \n\nThe call admission control and routing problem arises in the context where a telecommu(cid:173)\nnication provider wants to sell its network resources to customers in order to maximize \nlong term revenue. Customers are divided into different classes, called service types. Each \nservice type is characterized by its bandwidth demand, its average call holding time and \nthe immediate reward the network provider obtains, whenever a call of that service type is \n\n\u2022 Author to whom correspondence should be addressed. \n\n\fReinforcement Learning for Call Admission Control and Routing \n\n923 \n\naccepted. The control actions for maximizing the long term revenue are to accept or reject \nnew calls (Call Admission Control) and, if a call is accepted, to route the call appropri(cid:173)\nately through the network (Routing). The problem is naturally formulated as a dynamic \nprogramming problem, which, however, is too complex to be solved exactly. We use the \nmethodology of reinforcement learning (RL) to approximate the value function of dynamic \nprogramming. Furthermore, we pursue a decomposition approach, where the network is \nviewed as consisting of link processes, each having its own value function. This has the \nadvantage, that it allows a decentralized implementation of the training methods of RL \nand a decentralized implementation of the call admission control and routing policies. Our \nmethod learns call admission control and routing policies which outperform the commonly \nused heuristic \"Open-Shortest-Path-First\" (OSPF) policy. \n\nIn some earlier related work, we applied RL to the call admission problem for a single \ncommunication link in an integrated service environment. We found that in this case, RL \nmethods performed as well, but no better than, well-designed heuristics. Compared with \nthe single link problem, the addition of routing decisions makes the network problem more \ncomplex and good heuristics are not easy to derive. \n\n2 Call Admission Control and Routing \n\nWe are given a telecommunication network consisting of a set of nodes N = {I, ... , N} and \na set of Iinks .c = {I, ... , L}, where link I has a a total capacity of B(l) units of bandwidth. \nWe support a set M = {I, \"', M} of different service types, where a service type m is \ncharacterized by its bandwidth demand b(m), its average call holding time I/v(m) (here \nwe assume that the call holding times are exponentially distributed) and the immediate \nreward c( m) we obtain, whenever we accept a call of that service type. A link can carry \nsimultaneously any combination of calls, as long as the bandwidth used by these calls does \nnot exceed the total bandwidth of the link (Capacity Constraint). When a new call of \nservice type m requests a connection between a node i and a node j, we can either reject \nor accept that request (Call Admission Control). If we accept the call, we choose a route \nout of a list of predefined routes (Routing). The call then uses b(m) units of bandwidth \non each link along that route for the duration of the call. We can, therefore, only choose \na route, which does not violate the capacity constraints of its links, if the call is accepted. \nFurthermore, if we accept the call, we obtain an immediate reward c( m). The objective \nis to exercise call admission control and routing in such a way that the long term revenue \nobtained by accepting calls is maximized. \n\nWe can formulate the call admission control and routing problem using dynamic program(cid:173)\nming (e. g. Bertsekas, 1995). Events w which incur state transitions, are arrivals of new \ncalls and call terminations. The state Xt at time t consists of a list for each route, indicating \nhow many calls of each service type are currently using that route. The decision/control Ut \napplied at the time t of an arrival of a new call is to decide, whether to reject or accept the \ncall, and, if the call is accepted, how to route it through the network. The objective is to \nlearn a policy that assigns decisions to each state so as to \n\nwhere E{\u00b7} is the expectation operator, tk is the time when the kth event happens, \ng( Xtk' Wk, Ut,,) is the immediate reward associated with the kth event, and f3 is a discount \nfactor that makes immediate rewards more valuable than future ones. \n\n\f924 \n\nP Marbach, O. Mihatsch, M. Schulte and 1. N. Tsitsiklis \n\n3 Reinforcement Learning Solution \n\nRL methods solve optimal control (or dynamic programming) problems by learning good \napproximations to the optimal value function r, given by the solution to the Bellman op(cid:173)\ntimality equation which takes the following form for the caB admission control and routing \nproblem \n\nJ*(x) = Er {e- th } Ew { max [g(x,w, u) + J*(X I ) ] } \n\nueU(x) \n\nwhere U ( x) is the set of control actions available in the current state x, T is the time when \nthe first event w occurs and x' is the successor state. Note that x' is a deterministic function \nof the current state x, the control u and the event w. \nRL uses a compact representation j (', 0) to learn and store an estimate of J\" (.). On each \nevent, i(., 0) is both used to make decisions and to update the parameter vector e. In the \ncaB admission control and routing problem, one has only to choose a control action when \na new call requests a connection. In such a case, J (,,0) is used to choose a control action \naccording to the formula \n\nu=arg max [g(x,w,u) + J(X', e)] \n\nueU(x) \n\n(1) \n\nThis can be expressed in words as follows. \nDecision Making: When a new call requests a connection, use J (', e) to evaluate, for each \npermissible route, the successor state x' we transit to, when we choose that route, and pick \na route which maximizes that value. If the sum of the immediate reward and the value \nassociated with this route is higher than the value of the current state, route the call over \nthat route; otherwise reject the call. \nUsually, RL uses a global feature extractor f(x) to form an approximate compact rep(cid:173)\nresentation of the state of the system, which forms the input to a function approximator \ni(., e). Sutton's temporal difference (TO()'\u00bb algorithms (Sutton, 1988) can then be used \nto train i(., 0) to learn an estimate of J*. Using ID(O), the update at the kth event takes \nthe following form \n\nwhere \n\ndk \n\ne-/J(t/c-t/c-d (g(Xt/c, Wk, Ut/c) + J(!(Xt/c), ek-I)) \n\n-J(I(Xt/C_l)' Ok-I) \n\nand where 'Yk is a small step size parameter and Utk is the control action chosen according \nto the decision making rule described above. \n\nHere we pursue an approach where we view the network as being composed of link pro(cid:173)\ncesses. Furthermore, we decompose immediate rewards g( Xtk' Wk, Ut/c) associated with the \nkth event, into link rewards g(l) (Xt/c, Wk, Ut/c) such that \n\ng(Xtk' Wk. Ut/c) = L gil) (Xtl:' Wk, UtI:) \n\nL \n\n1=1 \n\nWe then define, for each link I, a value function J(I) (I(l) (x), e( I\u00bb), which is interpreted as \nan estimate of the discounted long term revenue associated with that link. Here, f(l) defines \na local feature, which forms the input to the value function associated with link I. To obtain \n\n\fReinforcement Learning for Call Admission Control and Routing \n\n925 \n\nan approximation of J* (x), the functions ](1) (J(l) (x), 0(1)) are combined as follows \n\nL L ](1) (J(I) (x), (J(l)). \n\n1=1 \n\nAt each event, we update the parameter vector (J(l) of link 1, only if the event is associated \nwith the link. Events associated with a link 1 are arrivals of new calls which are potentially \nrouted over link 1 and termination of calls which were routed over the link I. The update \nrule of the parameter vector 0(1) is very similar to the TD(O) algorithm described above \n\n(J(l) -\n-\n\nk \n\nwhere \n\n(J(l) + \"V(I)d(I)V ()](l) (/(1) (x \nk-l \n\n9 I \n\nIk \n\nk \n\n) (J(l) \n\n) \ntk_1 , k-l \n\ne - i3(t~') -t~/~ I) (g(l) (x t(l), Wkl ) , Ut(l)) + ](l) (J(l) (xt(l\u00bb), (Jkl~ 1)) \n_](1) (J(l) (x t (/) ), (Jkl~l) \n\nk \n\nk \n\nk \n\nk-I \n\n(2) \n\n(3) \n\nand where Ill) is a small step size parameter and tr) is the time when the kth event will \nassociated with link 1 occurs. Whenever a new call of a service of type m is routed over \na route r which contains the link i, the immediate reward g(l) associated with the link i is \nequal to c( m) / #r, where #r is the number of links along the route r. For all other events, \nthe immediate reward associated with link 1 is equal to O. \n\nThe advantage of this decomposition approach is that it allows decentralized training and \ndecentralized decision making. Furthermore, we observed that this decomposition ap(cid:173)\nproach leads to much shorter training times for obtaining an approximation for J* than the \napproach without decomposition. All these features become very important if one consid(cid:173)\ners applying methods of RL to large integrated service networks supporting a fair number \nof different service types. \n\nWe use exploration to obtain the states at which we update the parameter vector O. At each \nstate, with probability p == 0.5, we apply a random action, instead of the action recom(cid:173)\nmended by the current value function, to generate the next state in our training trajectory. \nHowever, the action Ut(I), that is used in the update rule (3), is still the one chosen ac-\ncording to the rule given in (1). Exploration during the training significantly improved the \nperformance of the policy. \n\nk \n\nTable I: Service Types. \n\nSERVICE TYPE m \n\n1 \n\n2 \n\n3 \n\nBANDWIDTH DEMAND b( m) \nAVERAGE HOLDING TIME l/v(m) \nIMMEDIATE REWARD c( m) \n\n1 \n10 \n1 \n\n3 \n10 \n2 \n\n5 \n2 \n50 \n\n4 Experimental Results \n\nIn this section, we present experimental results obtained for the case of an integrated service \nnetwork consisting of 4 nodes and 12 unidirectional links. There are two different classes \nof links with a total capacity of 60 and 120 units of bandwidth, respectively (indicated by \nthick and thin arrows in Figure 1). We assume a set M == {I, 2, 3} of three different ser(cid:173)\nvice types. The corresponding bandwidth demands, average holding times and immediate \n\n\f926 \n\nP. Marbach, O. Mihatsch, M Schulte and 1. N. Tsitsiklis \n\nFigure 1: Telecommunication Network Consisting of 4 Nodes and 12 Unidirectional Links. \n\nPERFORMANCE DURING LEAANING \n\nFigure 2: Average Reward per TIme Unit During the Whole Training Phase of 107 Steps \n(Solid) and During Shorter Time Windows of 105 Steps (Dashed). \n\nrewards are given in Table 1. Call arrivals are modeled as independent Poisson processes, \nwith a separate mean for each pair of source and destination nodes and each service type. \nFurthermore, for each source and destination node pair, the list of possible routes consists \nof three entries: the direct path and the two alternative 2-hop-routes. \n\nWe compare the policy obtained through RL with the commonly used heuristic OSPF \n(Open Shortest Path First). For every pair of source and destination nodes, OSPF orders \nthe list of predefined routes. When a new call arrives, it is routed along the first route in the \ncorresponding list, that does not violate the capacity constraint; if no such a route exists, \nthe call is rejected. We use the average reward per unit time as performance measure to \ncompare the two policies. \n\nFor the RL approach, we use a quadratic approximator, which is linear with respect to the \nparameters ()(I), as a compact representation of ](1). Other approximation architectures \nwere tried, but we found that the quadratic gave the best results with respect to both the \nspeed of convergence and the final performance. As inputs to the compact representation \n\n\fReinforcement Learning for Call Admission Control and Routing \n\n927 \n\nAVERAGE REWARD \n\npotential reward \nreward obtained by RL \nreward obtained by OSPF \n\no \n\n50 \n\n100 \n\n150 \n\nreward per time un~ \n\n200 \n\n250 \n\nCOMPARISON OF REJECTION RATES \n\no \n\n5 \n\n10 \n\n15 \n\n20 \npercentage of calls rejected \n\n30 \n\n25 \n\n35 \n\n40 \n\n45 \n\n50 \n\nFigure 3: Comparison of the Average Rewards and Rejection Rates of the RL and OSPF \nPolicies. \n\nROUTING (OSPF) \n\ndirect link \nalternative route no. 1 \nalternative route no. 2 \n\n----~~-\n-----.0:-= \n\n.. \n\n~ \n\n90 \n\n100 \n\no \n\n10 \n\n20 \n\n30 \n\n~ \n\n50 \n\n~ \n\npercentage of calls routed on direct and alternative paths \n\nro \n\nROUTING (RL) \n\ndirect link \nalternative route no. 1 \nalternative route no. 2 \n\no \n\n10 \n\n20 \n\n30 \n\n40 \n\n50 \n\n60 \n\n70 \n\n80 \n\npercentage of calls routed on direct and alternative paths \n\n90 \n\n100 \n\nFigure 4: Comparison of the Routing Behaviour of the RL and OSPF Policies. \n\n](/), we use a set of local features, which we chose to be the number of ongoing calls \nof each service type on link l. For the 4-node network, there are approximately 1.6. 1045 \ndifferent feature configurations. Note that the total number of possible states is even higher. \n\nThe results of the case studies are given in in Figure 2 (Training Phase), Figure 3 (Perfor(cid:173)\nmance) and Figure 4 (Routing Behaviour). We give here a summary of the results. \nTraining Phase: Figure 2 shows the average reward of the RL policy as a function of \nthe training steps. Although the average reward increases during the training, it does not \nexceed 141, the average reward of the heuristic OSPF. This is due to the high amount of \nexploration in the training phase. \nPerformance Comparison: The policy obtained through RL gives an average reward of \n212, which as about 50% higher than the one of 141 achieved by OSPF. Furthermore, the \nRL policy reduces the number of rejected calls for all service types. The most significant \nreduction is achieved for calls of service type 3, the service type, which has the highest \n\n\f928 \n\nP. Marbach, O. Mihatsch, M Schulte and I. N. Tsitsiklis \n\nimmediate reward. Figure 3 also shows that the average reward of the RL policy is close \nto the potential average reward of 242, which is the average reward we would obtain if \nall calls were accepted. This leaves us to believe that the RL policy is close to optimal. \nFigure 4 compares the routing behaviour of the RL control policy and OSPF. While OSPF \nroutes about 15% - 20% of all calls along one of the alternative 2-hop-routes, the RL \npolicy almost uses alternative routes for calls of type 3 (about 25%) and routes calls of the \nother two service types almost exclusively over the direct route. This indicates, that the RL \npolicy uses a routing scheme, which avoids 2-hop-routes for calls of service type 1 and 2, \nand which allows us to use network resources more efficiently. \n\n5 Conclusion \n\nThe call admission control and routing problem for integrated service networks is natu(cid:173)\nrally formulated as a dynamic programming problem, albeit one with a very large state \nspace. Traditional dynamic programming methods are computationally infeasible for such \nlarge scale problems. We use reinforcement learning, based on Sutton's (1988) T D(O), \ncombined with a decomposition approach, which views the network as consisting of link \nprocesses. This decomposition has the advantage that it allows decentralized decision mak(cid:173)\ning and decentralized training, which reduces significantly the time of the training phase. \nWe presented a solution for an example network with about 1045 different feature config(cid:173)\nurations. Our RL policy clearly outperforms the commonly used heuristic OSPF. Besides \nthe game of backgammon (Tesauro, 1992), the elevator scheduling (Crites & Barto, 1996), \nthe jop-shop scheduling (Zhang & Dietterich, 1996) and the dynamic channel allocation \n(Singh & Bertsekas, 1997), this is another successful application of RL to a large-scale \ndynamic programming problem for which a good heuristic is hard to find. \n\nReferences \n\nBertsekas, D. P; (1995) Dynamic Programming and Optimal Control. Athena Scientific, \nBelmont, MA. \n\nCrites, R. H., Barto, A. G. (1996) Improving elevator performance using reinforcement \nlearning. In D. S. Touretzky, M. C. Mozer and M. E. Hasselmo (eds.), Advances in Neural \nInformation Processing Systems 8, pp. 1017-1023. Cambridge, MA: MIT Press. \n\nSingh, S., Bertsekas, D. P. (1997) Reinforcement learning for dynamic channel allocation \nin cellular telephone systems. To appear in Advances in Neural Information Processing \nSystems 9, Cambridge, MA: MIT Press. \n\nSutton, R. S. (1988) Learning to predict by the method of temporal differences. Machine \nLearning, 3:9-44. \n\nTesauro, G. J. (1992) Practical issues in temporal difference learning. Machine Learning, \n8(3/4):257-277. \n\nZhang, W., Dietterich, T. G. (1996) High performance job-shop scheduling with a time(cid:173)\ndelay TD(>.) network. In D. S. Touretzky, M. C. Mozer and M. E. Hasselmo (eds.), Ad(cid:173)\nvances in Neural Information Processing Systems 8, pp. 1024-1030. Cambridge. MA: MIT \nPress. \n\n\f", "award": [], "sourceid": 1382, "authors": [{"given_name": "Peter", "family_name": "Marbach", "institution": null}, {"given_name": "Oliver", "family_name": "Mihatsch", "institution": null}, {"given_name": "Miriam", "family_name": "Schulte", "institution": null}, {"given_name": "John", "family_name": "Tsitsiklis", "institution": null}]}