{"title": "Decomposition of Reinforcement Learning for Admission Control of Self-Similar Call Arrival Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 1033, "page_last": 1039, "abstract": null, "full_text": "Decomposition of Reinforcement Learning \n\nfor Admission Control of Self-Similar \n\nCall Arrival Processes \n\nJakob Carlstrom \n\nDepartment of Electrical Engineering, Technion, Haifa 32000, Israel \n\njakob@ee . technion . ac . il \n\nAbstract \n\nThis paper presents predictive gain scheduling, a technique for simplify(cid:173)\ning reinforcement learning problems by decomposition. Link admission \ncontrol of self-similar call traffic is used to demonstrate the technique. \nThe control problem is decomposed into on-line prediction of near-fu(cid:173)\nture call arrival rates, and precomputation of policies for Poisson call ar(cid:173)\nrival processes. At decision time, the predictions are used to select \namong the policies. Simulations show that this technique results in sig(cid:173)\nnificantly faster learning without any performance loss, compared to a \nreinforcement learning controller that does not decompose the problem. \n\nIntroduction \n\n1 \nIn multi-service communications networks, such as Asynchronous Transfer Mode (ATM) \nnetworks, resource control is of crucial importance for the network operator as well as for \nthe users. The objective is to maintain the service quality while maximizing the operator's \nrevenue. At the call level , service quality (Grade of Service) is measured in terms of call \nblocking probabilities, and the key resource to be controlled is bandwidth. Network routing \nand call admission control (CAC) are two such resource control problems. \nMarkov decision processes offer a framework for optimal CAC and routing [1]. By model(cid:173)\nling the dynamics of the network with traffic and computing control policies using dynamic \nprogramming [2], resource control is optimized. A standard assumption in such models is \nthat calls arrive according to Poisson processes. This makes the models of the dynamics \nrelatively simple. Although the Poisson assumption is valid for most user-initiated requests \nin communications networks, a number of studies [3, 4, 5] indicate that many types of arriv(cid:173)\nal processes in wide-area networks as well as in local area networks are statistically self(cid:173)\nsimilar. This makes it difficult to find models of the dynamics, and the models become large \nand complex. If the number of system states is large, straightforward application of dynam(cid:173)\nic programming is unfeasible. Nevertheless, the \"fractal\" burst structure of self-similar \ntraffic should be possible to exploit in the design of efficient resource control methods. \nWe have previously presented a method based on temporal-difference (TD) learning for \nCAC of self-similar call traffic, which yields higher revenue than a TD-based controller \nassuming Poisson call arrival processes [7]. However, a drawback of this method is the slow \nconvergence of the control policy. This paper presents an alternative solution to the above \n\n\fproblem, called predictive gain scheduling. It decomposes the control problem into two \nparts: time-series prediction of near-future call arrival rates and precomputation of a set of \ncontrol policies for Poisson call arrival processes. At decision time, a policy is selected \nbased on these predictions. Thus, the self-similar arrival process is approximated by a qua(cid:173)\nsi-stationary Poisson process. The rate predictions are made by (artificial) neural networks \n(NNs), trained on-line. The policies can be computed using dynamic programming or other \nreinforcement learning techniques [6]. \nThis paper concentrates on the link admission control problem. However, the controllers \nwe describe can be used as building block in optimal routing, as shown in [8] and [9]. Other \nrecent work on reinforcement learning for CAC and routing includes [10], where Marbach \net al. show how to extend the use of TD learning to network routing, and [11] where Tong \net al. apply reinforcement learning to routing subject to Quality of Service constraints. \n\n2 Self-Similar Call Arrival Processes \nThe limitations of the traditional Poisson model for network arrival processes have been \ndemonstrated in a number of studies, e.g. [3, 4, 5], which indicate the existence of heavy(cid:173)\ntailed inter-arrival time distributions and long-term correlations in the arrival processes. \nSelf-similar (fractal-like) models have been shown to correspond better with this traffic. \nA self-similar arrival process has no \"natural\" burst length. On the contrary, its arrival in(cid:173)\ntensity varies considerably over many time scales. This makes the variance of its sample \nmean decay slowly with the sample size, and its auto-correlation function decay slowly \nwith time, compared to Poisson traffic [4]. \nThe complexity of control and prediction of Poisson traffic is reduced by the memory-less \nproperty of the Poisson process: its expected future depends on the arrival intensity, but not \non the process history. On the other hand, the long-range dependence of self-similar traffic \nmakes it possible to improve predictions of the process future by observing the history. \nA compact statistical measure of the degree of self-similarity of a stochastic process is the \nHurst parameter [4]. For self-similar traffic this parameter takes values in the interval \n(0.5, 1], whereas Poisson processes have a Hurst parameter of 0.5. \n\n3 The Link Admission Control Problem \nIn the link admission control (LAC) problem, a link with capacity C [units/s] is offered calls \nfrom K different service classes. Calls belonging to such a class j E J = {I, ... , K} have \nthe same bandwidth requirements hj [units/s]. The per-class call holding times are assumed \nto be exponentially distributed with mean 1/ftj [s]. \nAccess to the link is controlled by a policy:rc that maps states x E X to actions a EA,:rc: \nX -+ A. The set X contains all feasible link states, and the action set is \n\nA = ((ai, ... ,aK ) \n\n: aj E {O, Il,j E J), \n\nwhere aj is \u00b0 for rejecting a presumptive class-j call and 1 for accepting it. The set of link \nstates is given by X = N x H, where N is the set of feasible call number tuples, and His \nthe Cartesian product of some representations, '1, of the history of the per-class call arrival \nprocesses (needed because of the memory of self-similar arrival processes). N is given by \n\nN = {n : nj ;:: 0, j E J; Injhj ::; C}' \n\njEJ \n\nwhere nj is the number of type-j calls accepted on the link. \n\n\fWe assume uniform call charging, which means that the reward rate p(t) at time t is equal \nto the carried bandwidth: \n\npet) = p(x(t\u00bb = I n/t)bj \n\n(1) \n\njEl \n\nTime evolves continuously, with discrete call arrival and departure events, enumerated by \nk = 0,1,2, ... Denote by rk+l the immediate reward obtained from entering a state Xk at \ntime tk until entering the next state Xk+l at time tk+ 1\u2022 The expectation of this reward is \n\nE,,{rk+l} = E,,{P(Xk)[tk+ 1 -\n\nt,)} = P(Xk)1:(X\",:rr(Xk\u00bb \n\n(2) \n\nwhere t'(xk,:rr) is the expected sojourn time in state Xk under policy:rr. \nBy taking optimal actions, the policy controls the probabilities of state transitions so as to \nincrease the probability of reaching states that yield high long-term rewards. The objective \nof link: admission control is to find a policy :rr that maximizes the average reward per stage: \n\nR(,,) ~ )~\"! E.{~ ~ 'He I X, ~ x}. x E X \n\n(3) \n\nNote that the average reward does not depend on the initial state x, as the contribution from \nthis state to the average reward tends to zero as N -+ 00 (assuming, for example, that the \nprobability of reaching any other state y E X from every state x E X is positive). \nCertain states are of special interest for the optimal policy. These are the states that are can(cid:173)\ndidates for intelligent blocking. The set of such states X ib C X is given by X ib = Nib X H, \nwhere Nib is the set of call number tuples for which the available bandwidth is a multiple \nof the bandwidth of a wideband call. In the states of X ib , the long-term reward may be in(cid:173)\ncreased by rejecting narrowband calls to reserve bandwidth for future, expected wideband \ncalls. \n\n4 Solution by Predictive Gain Scheduling \nGain scheduling is a control theory technique, where the parameters of a controller are \nchanged as a function of operating conditions [12]. The approach taken here is to look up \npolicies in a table from predictions of the near-future per-class call arrival rates. \nFor Poisson call arrival processes, the optimal policy for the link: admission control prob(cid:173)\nlem does not depend on the history, H, of the arrival processes. Due to the memory-less \nproperty, only the (constant) per-class arrival rates Aj , j E J, matter. In our gain scheduled \ncontrol of self-similar call arrival processes, near-future Aj are predicted from hj- The self(cid:173)\nsimilar call arrival processes are approximated by quasi-stationary Poisson processes, by \nselecting precomputed polices (for Poisson arrival processes) based on predicted A/s. One \nradial-basis function (REF) NN per class is trained to predict its near-future arrival rate. \n\n4.1 Solving the Link Admission Control problem for Poisson Traffic \n\nFor Poisson call arrival processes, dynamic programming offers well-established tech(cid:173)\nniques for solving the LAC problem [1]. In this paper, policy iteration is used. It involves \ntwo steps: value determination and policy improvement. \nThe value determination step makes use of the objective function (3), and the concept of \nrelative values [1]. The difference v(x,:rr) - v(y,:rr) between two relative values under a \npolicy :rr is the expected difference in accumulated reward over an infinite time interval, \nstarting in state X instead of state y. In this paper, the relative values are computed by solving \na system of linear equations, a method chosen for its fast convergence. The dynamics of \n\n\fthe system are characterized by state transition probabilities, given by the policy, the per(cid:173)\nclass call arrival intensities, (,q, and mean holding times, (1/,ll J \nThe policy improvement step consists of finding the action that maximizes the relative val(cid:173)\nue at each state. After improving the policy, the value determination and policy improve(cid:173)\nment steps are iterated until the policy does not change [9]. \n\n4.2 Determining The Prediction Horizon \n\nOver what future time horizon should we predict the rates used to select policies? In this \nwork, the prediction horizon is set to an average of estimated mean first passage times from \nstates back to themselves, in the following referred to as the mean return time. The arrival \nprocess is approximated by a quasi-stationary Poisson process within this time interval. \nThe motivation for this choice of prediction horizon is that the effects of a decision (action) \nin a state Xd influence the future probabilities of reaching other states and receiving the as(cid:173)\nsociated rewards, until the state Xd is reached the next time. When this happens, a new deci(cid:173)\nsion can be made, where the previous decision does no longer influence the future expected \nreward. In accordance with the assumption of quasi-stationarity, the mean return time can \nbe estimated for call tuples n instead of the full state descriptor, x. \nIn case of Poisson call arrival processes, the mean first passage times E,.{ Tin} from other \nstates to a state n are the unique solution of the linear system of equations \n\nE,,{TmJ = T(m, a) + I E,, {Tln }, m E N\\{n}, a = n(m) \n\n(4) \n\nIE N\\!n} \n\nThe limiting probability qn of occupying state n is determined for all states that are candi(cid:173)\ndates for intelligent blocking, by solving a linear system of equations qB = 0. B is a matrix \ncontaining the state transition intensities, given by (Aj} and (1/,llj}. \nThe mean return time for the link, TI, is defmed as the average of the individual mean return \ntimes of the states of Nib, weighted by their limiting probabilities and normalized: \n\n(5) \n\nFor ease of implementation, this time window is expressed as a number of call arrivals. The \nwindow length Lj for class j is computed by multiplying the mean return time by the arrival \nrate, Lj = Aj T[, and rounding off to an integer. Although the window size varies with Aj, \nthis variation is partly compensated by T[ decreasing with increasing Aj \u2022 \n\n4.3 Prediction of Future Call Arrival Rates \n\nThe prediction of future arrival call rates is naturally based on measures of recent arrival \nrates. In this work, the following representation of the history of the arrival process is used: \nfor all classes j E J, exponentially weighted running averages hj = (hj), ... , hjM) of the in(cid:173)\nter-arrival times are computed on different time scales. These history vectors are computed \nusing forgetting factors {a), ... ,aM } taking values in the interval (0, 1): \n\nhik) = a i[t/k) -\n\nt/k - 1) 1 + (1 - a;)hik - 1) , \nwhere fj(k) is the arrival time of the k-th call from class j. \nIn studies of time-series prediction, non-linear feed-forward NN s outperform linear predic(cid:173)\ntors on time series with long memory [13]. We employ RBF NNs with symmetric Gaussian \nbasis functions. The activations of the RBF units are normalized by division by the sum of \nactivations, to produce a smooth output function. The locations and widths of the RBF units \ncan be determined by inspection of the data sets, to cover the region of history vectors. \n\n(6) \n\n\fThe NN is trained with the average inter-arrival time as target. After every new call arrival, \nthe prediction error \u20acj(k) \n\nis computed: \n\nLj \n\nElk) = L I [ t(k + i) -\n\nJ i~ ' \n\nt(k + i-I)] - y/k). \n\n(7) \n\nLearning is performed on-line using the least mean squares rule, which means that the up(cid:173)\nd)lting must be delayed by Lj call arrivals. The predicted per-class arrival rates \nA/k) = y(k)-' are used to select a control policy on the arrival of a call request. \nGiven the prediction horizon and the arrival rate predictor, ai' ... ,aM can be tuned by linear \nsearch to minimize the prediction error on sample traffic traces. \n\n5 Numerical study \nThe performance of the gain scheduled admission controller was evaluated on a simulated \nlink with capacity C = 24 [units/s], that was offered calls from self-similar call arrival pro(cid:173)\ncesses. For comparison, the simulations were repeated with three other link admission con(cid:173)\ntrollers: two TD-based controllers, one table-based and one NN based, and a controller us(cid:173)\ning complete sharing, i.e. to accept a call if the free capacity on the link is sufficient. \nThe NN based TD controller [7] uses RBF NNs (one per n EN), receiving (h\" h2) as input. \nEach NN has 65 hidden units, factorized to 8 units per call class, plus a default activation \nunit. Its weights were initialized to favor acceptance of all feasible calls in all states. \nThe table-based TD controller assumes Poisson call arrival processes. From this, it follows \nthat the call number tuples n E N constitute Markovian states. Consequently, the value \nfunction table stores only one value per n. This controller was used for evaluation of the \nperformance loss from incorrectly modelling self-similar call traffic by Poisson traffic. \n\n5.1 Synthesis of Call Traffic \n\nSynthetic traffic traces were generated from a Gaussian fractional auto-regressive inte(cid:173)\ngrated moving average model, FARIMA (0, d, 0). This results in a statistically self-similar \narrival process, where the Hurst parameter is easily tuned [7]. \nWe generated traces containing arrival/departure pairs from two call classes, characterized \nby bandwidth requirements bi = 1 (narrow-band) and ~ = 6 (wide-band) [units/s] and call \nholding times with mean 1/,u1 = 1/,u2= 1 [s]. A Hurst parameter of 0.85 was used, and the \ncall arrival rates were scaled to make the expected long-term arrival rates A, and A2 for the \ntwo classes fulfill b,A,/,u, + b).2/,u2 = 1.25 C. The ratio A,/A2 was varied from 0.4 to \n2.0. \n\n5.2 Gain Scheduling \n\nFor simplicity, a constant prediction horizon was used throughout the simulations. This was \ncomputed according to section 4.2. By averaging the resulting prediction windows for \nA,/A2 = 0.4, 1.0 and 2.0, a window size L, = L2 = 6 was obtained. \nThe table of policies to be used for gain scheduling was computed for predicted A, and A2 \nranging from 0.5 to 15 with step size 0.5; in total 900 policies. The two rate-prediction NNs \nboth had 9 hidden units. The NNs' weights were initialized to O. \n\nA \n\nA \n\n5.3 Numerical results \n\nBoth the TD learning controllers and the gain scheduling controller were allowed to adapt \nto the first 400 000 simulated call arrivals of the traffic traces. The throughput obtained by \nall four methods was measured on the subsequent 400000 call arrivals. \n\n\fo \n\n1000 \n\n2000 \n\ncall arrivals \n\n3000 \n\n4000 \n\n0.5 \n\n1 \n\n1.5 2 \n\nx 105 call arrivals \n\n2.5 3 \n\n3.5 4.0 \n\n(a) Initial weight evolution in neural predictor \n\n(b) Long-term weight evolution in neural predictor \n\n11 \n\n9 \n\nThroughput [units/s] \n17.4 \n17.2 \n17.0 \n16.8 \n16.6 \n16.4 \n16.2 \n16.0 \n15.8 \n\nGSIRBF \n\nTDIRBF \nTDITBL \n\nCS \n\n3.5 4.0 \n\n0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 \n\nAdA2 \n\n2 \n\n1.5 \nx 105 call arrivals \n\n2.5 3 \n\n(c) Weight evolution in NN based TD controller \n\n(d) Throughput versus arrival rate ratio \n\nFigure 1: Weight evolution for NN predictor (a, b); NN based TD-controller (c). Performance (d). \n\nFigure 1 (a, b) shows the evolution of the weights of the call arrival rate predictor for class \n2, and figure 1 (c) displays nine weights of the RBF NN corresponding to the call number \ntuple (n!, n2) = (6,2), which is a candidate for intelligent blocking. These weights corre(cid:173)\nspond to eight different class-2 center vectors, plus the default activation. \nThe majority of the weights of the gain scheduling RBF NN seems to converge in a few \nthousand call arrivals, whereas the TD learning controller needs about tOO 000 call arrivals \nto converge. This is not surprising, since the RBF NNs of the TD learning controllers split \nup the set of training data, so that a single NN is updated much less frequently than a rate(cid:173)\npredicting NN in the gain scheduling controller. Secondly, the TD learning NNs are trained \non moving targets, due to the temporal-difference learning rule, stochastic action selection \nand a changing policy. \nA few of the weights of the gain scheduling NN change considerably even after long train(cid:173)\ning. These weights correspond to RBF units that are activated by rare, large inputs. \nFigure t (d) evaluates performance in terms of throughput versus arrival rate ratio. Each \ndata point is the averaged throughput for 10 traffic traces. Gain scheduling (GS/RBF) \nachieves the same throughput as TD learning with RBF NNs (TD/RBF), up to 1.3% \ncompared to tabular TD learning (TDITBL), and up to 5.7% better than complete sharing \n(CS). The difference in throughput between TD learning and complete sharing is greatest \nfor low arrival rate ratios, since the throughput increase by reserving bandwidth for high(cid:173)\nrate wideband calls is considerably higher than the loss of throughput from the blocked low(cid:173)\nrate narrowband traffic. \n\n\f6 Conclusion \nWe have presented predictive gain scheduling, a technique for decomposing reinforcement \nlearning problems. Link admission control, a sub-problem of network routing, was used \nto demonstrate the technique. By predicting near-future call arrival rates from one part of \nthe full state descriptor, precomputed policies for Poisson call arrival processes (computed \nfrom the rest of the state descriptor) were selected. This increased the on-line convergence \nrate approximately 50 times, compared to a TD-based admission controller getting the full \nstate descriptor as input. The decomposition did not result in any performance loss. \nThe computational complexity of the controller using predictive gain scheduling may \nreach a computational bottleneck if the size of the state space is increased: the determina(cid:173)\ntion of optimal policies for Poisson traffic by policy iteration. This can be overcome by state \naggregation [2], or by parametrization the relative value function combined with temporal(cid:173)\ndifference learning [10]. It is also possible to significantly reduce the number of relative \nvalue functions . In [14], we showed that linear interpolation of relative value functions dis(cid:173)\ntributed by an error-driven algorithm enables the use of less than 30 relative value functions \nwithout performance loss. Further, we have successfully employed gain scheduled link ad(cid:173)\nmission control as a building block of network routing [9], where the performance improve(cid:173)\nment compared to conventional methods is larger than for the link admission control prob(cid:173)\nlem. \nThe use of gain scheduling to reduce the complexity of reinforcement learning problems \nis not limited to link admission control. In general, the technique should be applicable to \nproblems where parts of the state descriptor can be used, directly or after preprocessing, \nto select among policies for instances of a simplified version of the original problem. \n\nReferences \n[1] Z. Dziong, ATM Network Resource Management, McGraw-Hill, 1997. \n[2] D.P. Bertsekas, Dynamic Programming and Optimal Control, Athena Scientific, Belmont, Mass., \n1995. \n[3] V. Paxson and S. Floyd, \"Wide-Area Traffic: The Failure of Poisson Modeling\", IEEF/ACM Trans(cid:173)\nactions on Networking, vol. 3, pp. 226-244, 1995. \n[4] W.E. Leland, M.S. Taqqu, W. Willinger and D.V. Wilson, \"On the Self-Similar Nature of Ethemet \nTraffic (Extended Version)\", IEEF/ACM Transactions on Networking, vol. 2, no. 1, pp. 1- 15, Feb. 1994. \n[5] A Feldman, AC. Gilbert, W. Willinger and T.G. Kurtz, \"The Changing Nature of Network Traffic: \nScaling Phenomena\", Computer Communication Review, vol. 28, no. 2, pp. 5- 29, April 1998. \n[6] R.S. Sutton and AG. Barto, Reinforcement Learning: An Introduction, MIT Press, Cambridge, \nMass., 1998. \n[7] J. Carlstrom and E. Nordstrom, \"Reinforcement Learning for Control of Self-Similar Call Traffic \nin Broadband Networks\", Teletraffic Engineering in a Competitive World - Proceedings of The 16th In(cid:173)\nternational Teletraffic Congress (ITC 16), pp. 571- 580, Elsevier Science B.V., 1999. \n[8] Z. Dziong and L. Mason,\"Call Admission Control and Routing in Multi-service Loss Networks\", \nIEEE Transactions on Communications, vol. 42, no. 2. pp. 2011- 2022, Feb. 1994. \n[9] J. Carlstrom and E. Nordstrom, \"Gain Scheduled Routing in Multi-Service Networks\", Technical \nReport 2000-009, Dept. of Information Technology, Uppsala University, Uppsala, Sweden, April 2000. \n[10] P. Marbach, O. Mihatsch and J.N. Tsitsiklis, \"Call Admission Control and Routing in Integrated \nService Networks Using Neuro-Dynarnic Programming\", IEEE J. Sel. Areas ofComm, Feb. 2000. \n[11] H. Tong and T. Brown, \"Adaptive Call Admission Control Under Quality of Service Constraints: \nA Reinforcement Learning Solution\", IEEE Journal on Selected Areas in Communications, Feb. 2000. \n[12] K.J. Astrom and B. Wittenmark, Adaptive Control, 2nd ed., Addison-Wesley, 1995. \n[13] S. Haykin, Neural Networks: A Comprehensive Foundation, 2nd ed., Macmillan College Publish(cid:173)\ning Co., Englewood Cliffs, NJ, 1999. \n[14] J. Carlstrom, \"Efficient Approximation of Values in Gain Scheduled Routing\", Technical Report \n2000-010, Dept. of Information Technology, Uppsala University, Uppsala, Sweden, April 2000. \n\n\f", "award": [], "sourceid": 1915, "authors": [{"given_name": "Jakob", "family_name": "Carlstr\u00f6m", "institution": null}]}