{"title": "Optimizing Admission Control while Ensuring Quality of Service in Multimedia Networks via Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 982, "page_last": 988, "abstract": null, "full_text": "Optimizing admission control while ensuring \nquality of service in multimedia networks via \n\nreinforcement learning* \n\nTimothy X Brown t , Hui Tong t , Satinder Singh+ \n\nt Electrical and Computer Engineering \n\n+ Computer Science \nUniversity of Colorado \nBoulder, CO 80309-0425 \n\n{timxb, tongh, baveja}@colorado.edu \n\nAbstract \n\nThis paper examines the application of reinforcement learning to a \ntelecommunications networking problem . The problem requires that rev(cid:173)\nenue be maximized while simultaneously meeting a quality of service \nconstraint that forbids entry into certain states. We present a general \nsolution to this multi-criteria problem that is able to earn significantly \nhigher revenues than alternatives. \n\n1 Introduction \nA number of researchers have recently explored the application of reinforcement learning \n(RL) to resource allocation and admission control problems in telecommunications. e.g., \nchannel allocation in wireless systems, network routing, and admission control in telecom(cid:173)\nmunication networks [1, 6, 7, 8]. Telecom problems are attractive applications for RL \nresearch because good, simple to implement, simulation models exist for them in the en(cid:173)\ngineering literature that are both widely used and results on which are trusted, because \nthere are existing solutions to compare with, because small improvements over existing \nmethods can lead to significant savings in the long run, because they have discrete states, \nand because there are many potential commercial applications. However, existing RL ap(cid:173)\nplications have ignored an issue of great practical importance to telecom engineers, that \nof ensuring quality of service (QoS) while simultaneously optimizing whatever resource \nallocation performance criterion is of interest. \n\nThis paper will focus on admission control for broadband multimedia communication net(cid:173)\nworks. These networks are unlike the current internet in that voice, video, and data calls \narrive and depart over time and, in exchange for giving QoS guarantees to customers, the \nnetwork collects revenue for calls that it accepts into the network. In this environment, ad(cid:173)\nmission control decides what calls to accept into the network so as to maximize the earned \nrevenue while meeting the QoS guarantees of all carried customers. \n\n'Timothy Brown and Hui Tong were funded by NSF CAREER Award NCR-9624791. Satinder \n\nSingh was funded by NSF grant IIS-97 1 1753. \n\n\fOptimizing Admission Control via RL \n\n983 \n\nMeeting QoS requires a decision function that decides when adding a new call will violate \nQoS guarantees. Given the diverse nature of voice, video, and data traffic, and their often \ncomplex underlying statistics, finding good QoS decision functions has been the subject \nof intense research [2, 5]. Recent results have emphasized that robust and efficient QoS \ndecision functions require on-line adaptive methods [3]. \n\nGiven we have a QoS decision function, deciding which of the heterogeneous arriving calls \nto accept and which to reject in order to maximize revenue can be framed as a dynamic \nprogram problem . The rapid growth in the number of states with problem complexity has \nled to reinforcement learning approaches to the problem [6]. \n\nIn this paper we consider the problem of finding a control policy that simultaneously meets \nQoS guarantees and maximizes the network's earned revenue. We show that the straightfor(cid:173)\nward approach of mixing positive rewards for revenue with negative rewards for violating \nQoS leads to sub-optimal policies. Ideally we would like to find the optimal policy from \nthe subset of policies that never violate the QoS constraint. But there is no a priori useful \nway to characterize the space of policies that don ' t violate the QoS constraint. We present \na general approach to meeting such multicriteria that solves this problem and potentially \nmany other applications. Experiments show that incorporating QoS and RL yield signifi(cid:173)\ncant gains over some alternative heuristics. \n\n2 Problem Description \nThis section describes the admission control problem model that will be used. To em(cid:173)\nphasize the main features of the problem, networking issues such as queueing that are not \nessential have been simplified or eliminated. It should be emphasized that these aspects \ncan readily be incorporated back into the problem. \n\nWe focus on a single network link. Users attempt to access the link over time and the \nnetwork immediately chooses to accept or reject the call. If accepted, the call generates \ntraffic in terms of bandwidth as a function of time. At a later time, the call terminates and \ndeparts from the network. For each call accepted, the network receives revenue at a fixed \nrate over the duration of the call. The network measures QoS metrics such as transmission \ndelays or packet loss rates and compares them against the guarantees given to the calls. \nThus, the problem is described by the call arrival, traffic, and departure processes; the \nrevenue rates; QoS metrics; QoS constraints; and link model. The choices used in this \npaper are given in the next paragraph. \n\nCalls are divided into discrete classes indexed by i. The calls are generated via a Poisson \narrival process (arrival rate Ai) and exponential holding times (mean holding time 1/ f.Li) . \nWithin a call the bandwidth is an ON/OFF process where the traffic is either ON at rate Ti or \nOFF at rate zero with mean holding times V?N, and V?FF . The effective immediate revenue \nare Ct. The link has a fixed bandwidth B. The total bandwidth used by accepted calls varies \nover time. The QoS metric is the fraction of time that the total bandwidth exceeds the link \nbandwidth (i.e. the overload probability, p). The QoS guarantee is an upper limit, p*. \n\nIn previous work each call had a constant bandwidth over time so that the effect on QoS \nwas predictable. Variable rate traffic is safely approximated by assuming that it always \ntransmits at its maximum or peak rate. Such so-called peak rate allocation under-utilizes \nthe network; in some cases by orders of magnitude less than what is possible. Stochastic \ntraffic rates in real traffic, the desire for high network utilization/revenue, and the resulting \npotential for QoS violations distinguish the problem in this paper. \n\n3 Semi-Markov Decision Processes \nAt any given point of time, the system is in a particular configuration, x, defined by the \nnumber of each type of ongoing calls. At random times a call arrival or a call termination \n\n\f984 \n\nT. X Brown, H. Tong and S. Singh \n\nevent, e, can occur. The configuration and event together determine the state of the sys(cid:173)\ntem, s = (x, e). When an event occurs, the learner has to choose an action feasible for \nthat event. The choice of action, the event, and the configuration deterministically define \nthe next configuration and the payoff received by the learner. Then after an interval the \nnext event occurs, and this cycle repeats. The task of the learner is to determine a pol(cid:173)\nicy that maximizes the discounted sum of payoffs over an infinite horizon. Such a system \nconstitutes a finite state, finite action, semi-Markov decision process (SMDP). \n\n3.1 Multi-criteria Objective \nThe admission control objective is to learn a policy that assigns an accept or reject decision \nto each possible state of the system so as to maximize \n\nJ = E {foOO ,,/C(t)dt} , \n\nwhere E{\u00b7} is the expectation operator, c(t) is the total revenue rate of ongoing calls at \ntime t, and, E (0,1) is a discount factor that makes immediate profit more valuable than \nfuture profit. 1 \n\nIn this paper we restrict the maximization to policies that never enter states that violate QoS \nguarantees. In general SMDP, due to stochastic state transitions, meeting such constraints \nmay not be possible (e.g. from any state no matter what actions are taken there is a pos(cid:173)\nsibility of entering restricted states). In this problem service quality decreases with more \ncalls in the system and adding calls is strictly controlIed by the admission controller so that \nmeeting this QoS constraint is possible. \n\n3.2 Q-Iearning \nRL methods solve SMDP problems by learning good approximations to the optimal value \nfunction , J*, given by the solution to the Bellman optimality equation which takes the \nfollowing form for the dynamic call admission problem: \n\nJ*(s) \n\nmax [E.6.t , s,{c(s,a,flt)+,(L~t)J*(s')}l \naEA(s) \n\n(I) \n\nwhere A(s) is the set of actions available in the current state s, flt is the random time \nuntil the next event, c(s, a, flt) is the effective immediate payoff with the discounting, and \n,(flt) is the effective discount for the next state s' . \n\nWe learn an approximation to J* using Watkin's Q-learning algorithm. To focus on the \ndynamics of this paper's problem and not on the confounding dynamics of function ap(cid:173)\nproximation, the problem state space is kept small enough so that table lookup can be used. \nBellman's equation can be rewritten in Q-values as \n\nJ*(s) \n\nmax Q*(s,a) \naEA (s) \n\n(2) \n\nCall Arrival: When a call arrives. the Q-value of accepting the call and the Q-value of \nrejecting the call is determined. If rejection has the higher value, we drop the call. Else, if \nacceptance has the higher value, we accept the call. \n\nCall Termination: No action needs to be taken. \n\nWhatever our decision, we update our value function as follows: on a transition from state \ns to s' on action a in time flt, \n\nQ(s, a) \n\n(1 - 0:)Q(8, a) + 0: (c(s, a, flt) + ,(flt) max Q(8', b)) \n\nbEArs') \n\n(3) \n\n1 Since we will compare policies based on total reward rather than discounted sum of reward, we \n\ncan use the Tauberian approximation [4), i.e., r is chosen to be sufficiently close to I. \n\n\fOptimizing Admission Control via RL \n\n985 \n\nwhere ex E [0, 1] is the learning rate. \nIn order for Q-Iearning to perform well, all potentially important state-action pairs (s, a) \nmust be explored. At each state, with probability E we apply an action that will lead to a \nless visited configuration, instead of the action recommended by the Q-value. However, to \nupdate Q-values we still use the action b recommended by the Q-Iearning. \n\n4 Combining Revenue and Quality of Service \nThe primary question addressed in this paper is how to combine the QoS constraint \nwith the objective of maximizing revenue within this constraint. Let p(s, a, ~t) and \nq(s, a, ~t) be the revenue and measured QoS components of the reward, c(s, a, ~t). Ide(cid:173)\nally c(s, a, ~t) = p(s, a, ~t) when the QoS constraint is met and c(s, a, ~t) = -Large \n(where -Large is any large negative value) when QoS is not met. If the QoS parameters \ncould be accurately measured between each state transition then this approach would be a \nvalid solution to the problem. In network systems, the QoS metrics contain a high-degree \nof variability. For example, overload probabilities can be much smaller than 10- 3 while \nthe interarrival periods can be only a few ON/OFF cycles so that except for states with the \nmost egregious QoS violations, most interarrival periods will have no overloads. \n\nIf the reward is a general function of revenue and QoS: \n\nc(s, a, ~t) = f(p(s, a, ~t), q(s, a, ~t)), \n\n(4) \n\nsufficient and necessary condition for inducing optimal policy with the QoS constraint is \ngiven by: \n\nE{J(p(s,a,~t),q(s,a,~t))} = { ~t~~~a,~t)} ifE{q(s,a,~t)}
O. \nIn this example, the revenues are random and possibly negative (e.g. if they are net after \ncost of billing and transport). The call should be accepted if E {p} > 0 and E {q} < p*. \nTherefore the correct reward function has the property: \n\nE{J(p,q)} > 0 \n\nif E{p} > 0 and E{q} < p* \n\n(8) \n\nThe point of the example is that an f(\u00b7) satisfying (8) requires prior knowledge about the \ndistributions of the revenue and the QoS as a function of the state. Even if it were possible \n\n\f988 \n\nT. X Brown, H. Tong and S. Singh \n\n1 3 ,-~r-------,-----r-----r---r--r---.----, \n\nCOII'!\"lnson .1_'\" policies. Pareto OIroFF \n\n1-Greedy peak .... 2-G ... dy'