{"title": "Switch Packet Arbitration via Queue-Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1337, "page_last": 1344, "abstract": "", "full_text": "Switch Packet Arbitration via Queue-Learning\n\nTimothy X Brown\n\nElectrical and Computer Engineering\nInterdisciplinary Telecommunications\n\nUniversity of Colorado\nBoulder, CO 80309-0530\ntimxb@colorado.edu\n\nAbstract\n\nIn packet switches, packets queue at switch inputs and contend for out-\nputs. The contention arbitration policy directly affects switch perfor-\nmance. The best policy depends on the current state of the switch and\ncurrent traf\ufb01c patterns. This problem is hard because the state space,\npossible transitions, and set of actions all grow exponentially with the\nsize of the switch. We present a reinforcement learning formulation of\nthe problem that decomposes the value function into many small inde-\npendent value functions and enables an ef\ufb01cient action selection.\n\n1 Introduction\n\nReinforcement learning (RL) has been applied to resource allocation problems in telecom-\nmunications. e.g., channel allocation in wireless systems, network routing, and admis-\nsion control in telecommunication networks [1, 3, 7, 11]. These have demonstrated rein-\nforcement learning can \ufb01nd good policies that signi\ufb01cantly increase the application reward\nwithin the dynamics of the telecommunications problems. However, a key issue is how to\nscale these problems when the state space grows quickly with problem size.\n\nThis paper focuses on packet arbitration for data packet switches. Packet switches are un-\nlike telephone circuit switches in that packet transmissions are uncoordinated and clusters\nof traf\ufb01c can simultaneously contend for switch resources. A packet arbitrator decides the\norder packets are sent through the switch in order to minimize packet queueing delays and\nthe switch resources needed. Switch performance depends on the arbitration policy and the\npattern of traf\ufb01c entering the switch.\n\nA number of packet arbitration strategies have been developed for switches. Many have\n\ufb01xed policies for sending packets that do not depend on the actual patterns of traf\ufb01c in the\nnetwork [10]. Under the worse case traf\ufb01c, these arbitrators can perform quite poorly [8].\nTheoretical work has shown consideration of future packet arrivals can have signi\ufb01cant\nimpact on the switch performance but is computationally intractable (NP-Hard) to use [4].\nAs we will show, a dynamic arbitration policy is dif\ufb01cult since the state space, possible\ntransitions, and set of actions all grow exponentially with the size of the switch.\n\nIn this paper, we consider the problem of \ufb01nding an arbitration policy that dynamically\nand ef\ufb01ciently adapts to traf\ufb01c conditions. We present queue-learning, a formulation that\neffectively decomposes the problem into many small RL sub-problems. The independent\n\n\f\u0002\u0001\n\n0.3\n0.6\n0\n\n0.3\n0\n0.6\n\n0.3\n0.3\n0.3\n\nArrivals\n\nQueues\n132\n1\n1\n2\n\n3 x 3\nSwitch\n\nOut\n1\n2\n3\n\n(a)\n\n(b)\n\n2\n1\n0\n\n1\n0\n1\n\n1\n0\n0\n\n(c)\n\n(a) In each time slot, packet sources generate\n\n packets on average at input \u000b for output \f . (b) Packets arrive at an input-queued\n\nFigure 1: The packet arbitration model.\n\u0004\u0006\u0005\b\u0007\n\t\nswitch and are stored in queues. The number label on each packet indicates to which\noutput the packet is destined. (c) The corresponding queue states, where \r\nindicates\nthe number of packets waiting at input \u000b destined for output \f .\n\n\u0005\b\u0007\u000e\t\n\nRL problems are coupled via an ef\ufb01cient algorithm that trades off actions in the different\nsub-problems. Results show signi\ufb01cant performance improvements.\n\n2 Problem Description\n\nThe problem is comprised of \u000f\ninput is labeled with which of the \u000f\n\ninputs to\na packet data switch as shown in Figure 1. Time is divided into discrete time slots and\nin each time slot each source generates 0 or 1 packets. Each packet that arrives at the\noutputs the packet is headed. In every time slot, the\nswitch takes packets from inputs and delivers them at their intended output. We describe\nthe speci\ufb01c models for each used in this paper and then state the packet arbitration problem.\n\ntraf\ufb01c sources generating traf\ufb01c at each of \u000f\n\n2.1 The Traf\ufb01c Sources\n\n\u0005\u0011\u0010\n\n\u0001\u0013\u0012\n\n\u0005\b\u0007\n\nAt input \u000b , a traf\ufb01c source generates a packet destined for output \f with probability \u0004\n\u0005\b\u0007 at\n\u0001\u0013\u0012\nthe beginning of each time slot. If \u0004\n\u0005\u001a\u0007\nis the load on output \f , then for stability we require \u0004\n!&\"\n\f .\nThe matrix \u0002\u0001('\n\u0004\u0006\u0005\u001a\u0007*) only represents long term average loads between input \u000b and output\n\f . We treat the case where packet arrivals are uncorrelated over time and between sources\nso that in each time slot, a packet arrives at input \u000b with probability \u0004\n\u0005\u0011\u0010\nhave an arrival, it is destined for output \f with probability \u0004\u0015\u0005\u001a\u0007,+*\u0004\narrivals be - .\n\n\u000b and \u0004\u0006\u0014$\u0016\u0019\u0018\n\u0005\u0011\u0010\nand given that we\n. Let the set of packet\n\nis the load on input \u000b and \u0004\u0015\u0014\u0017\u0016\u0019\u0018\n\u001b%\u001d \u001f\n\n\u0005\u0011\u0010\n\u0005\u001c\u001b\u001e\u001d \u001f\n\n!#\"\n\n2.2 The Switch\n\nThe switch alternates between accepting newly arriving packets and sending packets in\nevery time slot. At the start of the time slot the switch sends packets waiting in the input\n\u0001/',0\n\u0005\u001a\u0007\nqueues and delivers them to the correct output where they are sent on. Let .\nrepresent the set of packets sent where 0\nto output\nif a packet is sent from input \u000b\n\f and 0\n! otherwise. The packets it can send are limited by the input and output\nconstraints: the switch can send at most one packet per input and can deliver at most one\npacket to each output. After sending packets, the new arrivals are added at the input and\nthe switch moves to the next time slot. Other switches are possible, but this is the simplest\nand a common architecture in high-speed switches.\n\n\u0005\u001a\u0007\n\n\u0005\b\u0007\n\n\u0003\n\u0001\n\u0003\n\u0005\n\u0007\n\u0004\n\u0007\n\u0005\n\u0004\n\u0007\n\u0005\n\u0005\n)\n\u0001\n\u001d\n\u0001\n\f2.3 The Input Queues\n\nBecause the traf\ufb01c sources are un-coordinated, it is possible for multiple packets to arrive\nin one time slot at different inputs, but destined for the same output. Because of the output\nconstraint, only one such packet may be sent and the others buffered in queues, one queue\nper input. Thus packet queueing is unavoidable and the goal is to limit the delays due to\nqueueing.\n\nThe queues are random access which means packets can be sent in any order from a queue.\nFor the purposes of this paper, all packets waiting at an input and destined for the same\noutput are considered equivalent. Let \u0003\nis the number of\npackets waiting at input \u000b for output \f as shown in Figure 1c.\n\n\u0005\b\u0007*) be a matrix where \n\n\u0005\u001a\u0007\n\n2.4 Packet Arbitration\n\nThe packet arbitration problem is: Given the state of the input queues, \u0003\n, choose a set of\n, so at most one packet is sent from each input and at most one packet is\npackets to send, .\ndelivered to each output. We want a packet arbitration policy that minimizes the expected\npacket wait time.\n\n\u0001\u0003\u0002\u0004\u0001\n\n\u0001\u0006\u0005\u0007\u0001\n\n\u0001\u0006\u0002\b\u0001\n\n\u0001\t\u0005\n\u0001\n\n\u0001 .\n\ninput and \u000f\n\noutputs or not at all. This implies as many as \n\n\u0001 be the total number of packets in all the input queues, let \u0001\n\noutput switch. The input\nis a subset of a permutation matrix (zeros\neverywhere except that every row has at most one one and every column has one one). This\nto choose from. In each time slot at each input,\n\nis sent the remaining packets must wait at least one more time slot before they can\nWhen .\nbe sent. Let \u0001\n\u0001 be the number\nof new arrivals, and let \u0001\n\u0001 be the number of packets sent. Thus, the total wait of all packets\nis increased by the number of packets that remain: \u0001\n\u0001 . By Little\u2019s theorem, the\nexpected wait time is proportional to the expected number of packets waiting in each time\nslot [10]). Thus, we want a policy that minimizes the expected value of \u0001\nThe complexity of this problem is high. Given an \u000f\nand output constraints are met with equality if .\nimplies there are as many as \u000f\f\u000b possible .\n\u001d\u000f\u000e\u0011\u0010\na packet can arrive for one of \u000f\n\u0005\u001a\u0007 ranges from 0 to \u0012 packets, then the number\npossible transitions after each send. If each \r\n\u0010\u0014\u0013 . A minimal representation would only indicate whether each\nof states in the system is \u0012\n\u0010\u0014\u0013 states. Thus, every aspect of the problem grows\nsub-queue is empty or not, resulting in \u0015\nTraditionally switching solves these problems by not considering the possible next arrivals,\nthat considers only\nand using a search algorithm with time-complexity polynomial in \u000f\nthe current state \u0003\n. For instance the problem can be formulated as a so-called matching\nproblem and polynomial algorithms exist that will send the largest . possible [2, 6, 8].\nWhile maximizing the packets sent in every time slot may seem like a solution, the problem\nis more interesting than this. In general, many possible . will maximize the number of\npackets that are sent. Which one can we send now so that we will be in the best possible\nstate for future time slots? Some heuristics can guide this choice, but these are insensitive\nto the traf\ufb01c pattern \n[9]. Further, it can be shown that to minimize the total wait it may be\nnecessary to send less than the maximum number of packets in the current time slot [4]. So,\nwe look to a solution that ef\ufb01ciently \ufb01nds policies that minimize the total wait by adapting\nto the current traf\ufb01c pattern.\n\nexponentially in the size of the switch.\n\nThe problem is especially amenable to RL for two reasons. (1) Packet rates are fast, up to\nmillions of packets per second so that many training examples are available. (2) Occasional\nbad decisions are not catastrophic. They only increase packet delays somewhat, and so it\nis possible to freely learn in an online system. The next section describes our solution.\n\n\u0001\n'\n\n\u0003\n-\n.\n\u0003\n-\n.\n\u0003\n-\n.\n\u000f\n\u0002\n\f3 Queue-Learning Solution\n\nAt any given time slot,  , the system is in a particular state, \u0003\n\u0018 . The cost,\u0001\nand the packet arbitrator can choose to send any valid .\n\u0001\u0006\u0005\n\n\u0018 , arrive\n\u000e packets that remain. The task of the learner is to determine a packet arbitration\npolicy that minimizes the total average cost. We use the Tauberian approximation, that is,\nwe assume the discount factor is close enough to 1 so that the discounted reward policy is\nequivalent to the average reward policy [5]. Since minimizing the expected value of this\ncost is equivalent to minimizing the expected wait time, this formulation provides an exact\nmatch between RL and the problem task.\n\n\u0018 . New packets, -\nis the \n\n\u0003\u0003\u0002\n\nAs shown already every aspect of this problem scales badly. The solution to this problem is\nthree fold. First we use online learning and afterstates [12] to eliminate the need to average\n\u0010 possible next states. Second, we show how the value function can yield\na set of inputs into a polynomial algorithm for choosing actions. Third, we decompose the\n\u0013 . We describe\n\nover the \r\nvalue function so the effective number of states is much smaller than \u0012\n\neach in turn.\n\n\u001d\u000f\u000e\n\n3.1 Afterstates\n\nRL methods solve MDP problems by learning good approximations to the optimal value\n\nfunction, \u0004\u0006\u0005 . A single time slot consists of two stages: new arrivals are added to the\n\nqueues and then packets are sent (see Figure 2). The value function could be computed\nafter either of these stages. We compute it after packets are sent since we can use the notion\nof afterstates to choose the action. Since the packet sending process is deterministic, we\nknow the state following the send action. In this case, the Bellman equation is:\n\n\t\u001e\u001d\u001c\u001f\n\n\u0003\u0003\u0002\n\nis the discount factor, and\n\n\" .\n\n\u000e%$'&\n\u0003#\"\nis the set of actions available in the current state \u0003\n\n\u0002! \n\u0001\b\u0007\n\t\f\u000b\u000e\r\u0010\u000f\u0012\u0011\n\u0013\u0015\u0014\u0017\u0016\u0019\u0018\u0012\u001a\u001c\u001b\n\u0001 is the effective immediate cost, \n\u0018 after event -\n\n\u0003\u0003\u0002\nis the expectation over possible events and the resulting next state is \u0003\nto \u0003\n\u0018 , we update an estimate to\u0004 via\n\u000e%$\n\nwhere (\nafter arrival event - ,\n\u0003\u0003\u0002\n\u0007\n\t\n'*)\nWe learn an approximation to\u0004\u0019\u0005 using TD(0) learning. At time-step on a transition from\nstate \u0003\nwhere.\nWith afterstates, the action (which set of packets to send) depends on both the current state\nand the event. The best action is the one that results in the lowest value function in the\nnext state (which is known deterministically given \u0003\neliminates the need to average over a large number of non-zero transitions to \ufb01nd the best\naction.\n\n\u0018 ). In this way, afterstated\n\n\u0018,+\u0006- on action .\n\n\u0018,+\u0006-\n\u0018 , and .\n\n\u0002! \n\n\u0018 , -\n\nis the learning step size.\n\n\u0002/.\n\n3.2 Choosing the Action\n\nWe compare every action with the action of not sending any packets. The best action, is the\nset of packets meeting the input and output constraints that will reduce the value function\nthe most compared to not sending any packets.\n\n\u0005\u001a\u0007 contend\nEach input-output pair \r\nwith other packets at input \u000b and other packets destined for output \f . If we send a packet\n\u0005\b\u0007 , then no packet at the same input or output will be sent. In other words, packets at\nfrom \n\n\u000e has an associated queue at the input, \n\n\u0005\b\u0007 . Packets in \n\n-\n\u0002\n.\n\u000e\n\u0001\n\u0003\n\u0018\n\u0001\n\u0002\n\u0001\n-\n\u0018\n\u0001\n.\n\u0018\n\u0001\n\u000f\n\u0002\n\u0010\n\u0004\n\u0005\n\n\u0003\n\u000e\n\u0001\n\n-\n\u0002\n.\n\u000e\n\u0004\n\u0005\n\n-\n\u000e\n\u0001\n\n-\n\u0002\n.\n\u000e\n\u0001\n\u0001\n\u0003\n\u0001\n\u0002\n\u0001\n-\n\u0001\n\u0005\n\u0001\n.\n)\n\u0018\n\u0004\n\n\u0003\n\u0018\n\u000e\n\u0001\n\u0004\n\n\u0003\n\u0018\n\u000e\n\u0018\n\u001f\n\u0001\n\n\u0003\n\u0018\n\u0002\n.\n\u0018\n\u0002\n-\n\u0018\n\u000e\n\u0004\n\n\u0003\n\u000e\n\u0005\n\u0004\n\n\u0003\n\u0018\n\u0018\n\u000b\n\u0002\n\f\n\fQueue\nState\n\n2 1 1\n1 0 0\n0 1 0\n\n\u0001\u0001\n\nPacket\nArrivals\n\n0 1 0\n0 0 0\n0 1 0\n\n\u0001\u0001\n\nQueue After\n\nArrivals\n\n2 2 1\n1 0 0\n0 2 0\n\nDecision\n\n\u0001\u0001\n\nPackets\n\nSent\n\n0 0 1\n1 0 0\n0 1 0\n\n\u0001\u0002\n\nStochastic Step\n\nDeterministic Step\n\nNext\nState\n\n2 2 0\n0 0 0\n0 1 0\n\n\u0018,+\u0006-\n\u0018,+\u0019-\n\nFigure 2: Timing of packet arrivals and sends relative to decisions and the value function.\n\ninteract primarily with packets in the same row and column. Packets in other rows and\n\n\u0005\u001a\u0007\ncolumns only have an indirect effect on the value of sending a packet from \r\nThis suggests the following approximation. Let \u0003\nin state \u0003 and before the decision. Let\nsubqueue after arrivals -\nin the value function if one packet is sent from subqueue \n\nsubqueue is empty). We can reformulate the best action as:\n\n\u000e be the number of packets in every\n\u000e be the reduction\nif the\n\n\u0005\b\u0007 .\n\n\u0005\u001a\u0007\n\n\u0005\u001a\u0007\n\n(\n\nsubject to the constraints:\n\n\u0005\u001a\u0007\n\n\u0005\u001a\u0007\n\nargmax\u0013\u0005\u0004\u0007\u0006\t\b\u000b\n\r\f\u000f\u000e\r\u0010\u0002\u0011\n\u0005\u001a\u0007\n\u0005\u001a\u0007\u0013\u0012\n\u0005\u001a\u0007\n\nThis problem can be solved as a linear program and is also known as the weighted matching\nor the assignment problem which has a polynomial time solution [13]. In this way, we\nreduce the search over the\n\n\u000e possible actions to a polynomial time solution.\n\n3.3 Decomposing the Value Function\n\nThe interaction between queues in the same row or the same column is captured primarily\nby the input and output constraints. This suggests a further simplifying approximation with\nthe following decomposition.\n\n\u0005\b\u0007\n\n\u0005\b\u0007\n\n\u0005\u001a\u0007\n\n\u0005\b\u0007 . Every \n\nWe compute a separate value function for each \r\ndepend on the entire state \u0003\nrelevant to \r\nat input \u000b and packets destined for output \f .\nMany forms of\u0004\nbe the total number of packets waiting at input \u000b . Let \r\nwaiting for output \f .\n\u0005\u0011\u0010\n\u0005\u001a\u0007\n\n\u0005\b\u0007 , denoted\u0004\n\u000e . In principle, this can\n, but can be reduced to consider only elements of the state\n\u0005\u001a\u0007 estimates its associated value function\u0004\n\u000e based on the packets\n\u0005\u0011\u0010\n\u000e could be considered, but we consider a linear approximation. Let \r\nbe the total number of packets\n\u000e :\n(1)\n\n\u0016\u0015\u0018\u0017\n\u0014\u0017\u0016\u0019\u0018\nIt follows the value of sending a packet (compared to not sending a packet) from \n\nWith these variables we de\ufb01ne a linear approximation with parameters\n\n\u0014$\u0016\u0019\u0018\n\n\u0015\u001a\u0019\n\n\u0015\u001a\u001f\n\n\u0015\u001a\u0019\n\n\u0015\u001a\u001d\n\n\u0015\u001b\u0017\n\n\u0014$\u0016\n\n\u0005\u0011\u0010\n\n\u0005\u001a\u0007\n\n\u0015\u001b\u001e\n\n\u0015\u001b\u001e\n\n\u0005\u0011\u0010\n\n\u0015*\n\n\u001d\u0006\u000e\n\n\u0005\u001a\u0007\n\nis\n\n\u001d\u000f\u000e\n\n\u001d\u0006\u000e\n\n\u0014$\u0016\n\n\u0015*\n\n\u0015\u001a\u0019\n\n\u0005\u001a\u0007\n\n\u0005\u001a\u0007\n\n\u0015\u001a\u001c\u0019\r\n\u0015*\n\n\u0015\u001a\u001d\n\n\u0015\u001a\u001f\n\n\u0015\u001a\u001c\n\n\u0005\u001a\u0007\n\n\u0003\n\u0018\n-\n\u0018\n\u0003\n\u0018\n\n-\n\u0018\n\u000e\n.\n\u0018\n\u0003\n\u0004\n\n\u0003\n\u0018\n\u000e\n\u0004\n\n\u0003\n\u000e\n\n-\n\u0003\n\n\u0003\n\n-\n\u000e\n\u000b\n\u0002\n\f\n\u000e\n\u0003\n\n\u0003\n\n-\n\u000e\n\u000e\n\u0001\n!\n.\n\u0001\n\u0005\n\u0011\n\u0007\n\u0003\n\n\u0003\n\n-\n\u000e\n\u000e\n0\n0\n\t\n'\n!\n\u0002\n\u001d\n)\n\"\n\u000b\n\u0002\n\f\n\u0011\n\u0005\n0\n\u001d\n\"\n\f\n\u0011\n\u0007\n0\n\u0012\n\u001d\n\"\n\u000b\n\u0014\n\n\u000f\n\u000b\n\n\u0003\n\n\u0003\n\n\u0003\n\u0005\n\u0018\n\u0007\n\u0015\n\u0001\n\u0002\n\u001f\n\u001f\n\u001f\n\u0002\n\u0004\n\n\u0003\n\u000e\n\u0001\n\u0002\n\u0015\n-\n\n\u0002\n\u001c\n\u0002\n\n\u0005\n\u0002\n\n\u0005\n\u000e\n\u001c\n\u0002\n\n\u0007\n\u0002\n\n\u0007\n\u000e\n\u001c\n\u0003\n\n\u0003\n\n-\n\u000e\n\u000e\n\u0001\n\u0015\n-\n\u0002\n\u0002\n\u0002\n\n-\n\u000e\n\u0005\n\u0002\n\n\u0005\n\n-\n\u000e\n\u0005\n\u0002\n\n\u0018\n\u0007\n\n-\n\u000e\n\u0005\n\u001f\n\fThis is computed for each \n\n\u000e and used in the weighted matching of Section 3.2 to com-\npute which packets to send. Learning for this problem is standard TD(0) for linear approx-\nimations [12]. The combination of decomposition and linear value function approximation\nreduces the problem to estimating\n\nNo explicit exploration is used since from the perspective of \u0004\n\n\u0005\u001a\u0007 , enough stochasticity al-\nready exists in the packet arrival and send processes. To assist the switch early in the\nlearning, the switch sends the packets from a maximum matching in each time slot (instead\nof the packets selected by queue learning). This initial assist period during the training was\nfound to bring the switch into a good operating regime from which it could learn a better\npolicy.\n\n\u000e parameters.\n\nIn summary, we simplify the exponential computation for this problem by decomposing\n\nthe state into \u000f\n\nnot sending a packet, and a polynomial algorithm computes the action that maximizes the\ntotal value across substates subject to the input and output constraints.\n\n\u001c substates. Each substate computes the value of sending a packet versus\n\n4 Implementation Issues\n\nA typical high speed link rate is at OC-3 rates (155Mbps). In ATM at this rate, the packet\nrate is 366k time slots/s or less than 30 sec for \u001d\u0019!\u0001\ntime slots. For learning, the number of\n\u001c where \u0003\n\ufb02oating point operations per time slot is approximately \u0002\u0004\u0003\nis the number of\n\u0001\u0006\u0005 switch, this\nparameters in the linear approximation. At the above packet rate, for an \u000f\ntranslates into 650 MFLOPS which is within existing highend microprocessor capacity. For\ncomputation of the packets to send, the cost is approximately \u0003\nto compute the weights.\nTo compute the maximum weight matching an\n\nNew optical transport technologies are pushing data rates one and two orders of magnitude\ngreater than OC-3 rates. In this case, if computing is limited then the queue-learning can\nlearn on a subsample of time slots. To compute the packets to send, the decomposition has\na natural parallel implementation that can divide it among processors. Massively parallel\nneural networks can also be used to compute the maximum weighted matching [2, 9].\n\n\u001d\b\u0007\n\t\u0004\u000b\n\n\u000e algorithm exists [13].\n\n5 Simulation Results\nWe applied our procedure to \u0005\r\f\u000e\u0005 switches under different loads. The parameters used in\nthe experiment are shown in Table 1. In each experiment, the queue-learning was trained\nfor an initial period, and then the mean wait time, \u000f\u0011\u0010\u0013\u0012\nis measured over a test period. We\ncompared performance to two alternatives. One alternative sends the largest number of\npackets in every time slot. If multiple sets are equally large it chooses randomly between\n\n\u001c\u001b . The\nthem. We simulate this arbitrator and measure the mean packet wait time, \u000f\u0015\u0014\u0013\u0016\nbest possible switch is a so-called output-queued switch [10]. Such a switch is dif\ufb01cult to\n\u0014\u0017\u0016\u0019\u0018 , via simulation.\nbuild at high-speeds, but we can compute its mean packet wait time, \u000f\n\u0010\u0018\u0017\u001a\u0019\n\u000e .\nThe results are speci\ufb01ed in normalized form as \r\u001d\u000f\u0011\u0014\u0013\u0016\n\u0014$\u0016\n\n\u001c\u001b\n\r\u001e\u000f\u001f\u0014\u0013\u0016\nThus if our queue-learning solution is no better than a max send arbitrator, the gain will be\n0 and if we achieve the performance of the output-queued switch, the gain will be 1.\nis a uniform load of !\n\n\u001c\u001b\n packets per\nWe experimented on \ufb01ve different traf\ufb01c loads.\ninput per time slot with each packet uniformly destined for one of the outputs. Similarly,\n! . The uniform load is a common baseline scenario for evaluating\n\nis a uniform load of !\nswitches.\nand \n0.9 (as in \npermutation matrices and than scaling the entries to yield the desired row and column sums\n\n) but the distribution is not uniform. This is generated by summing \u000f\n\nare random matrices where the sum of loads per row and column are 0.6 and\n\n\u0010\u0018\u0017\u001a\u0019\n\n\u0010\u0018\u0017\u001a\u0019\n\n\u0010\u0013\u0012\n\n- and \n\n\u000b\n\u0002\n\f\n\u0014\n\n\u000f\n\u001c\n\u000f\n\u000f\n\u001c\n\u0014\n\n\u000f\n\u000f\n\u0005\n\u000f\n\u000e\n+\n\u0005\n\u000f\n\u0018\n\n-\n\u001f\n\n\u001c\n\u001f\n\n\u001d\n\u001e\n\u001c\n\fTable 1: RL parameters.\nValue\nParameter\n0.99\n\nDiscount, \nLearn Rate,.\n\nAssist Period\nTrain Period\nTest Period\n\n\u0017\u000b\u0017\n\n\u0017\u000b\u0017\n\n\u0017\u0001\n\u0019 time slots\n time slots\n time slots\n\nTable 2: Simulation Results.\n\nSwitch Loading\n\n- (uniform 0.6 load)\n\n(uniform 0.9 load)\n(random 0.6 load)\n(random 0.9 load)\n(truncated 0.9 load)\n\nNormalized Wait\nReduction (\n)\n\u0015\u0005\u0004\n10%\n50%\n14%\n70%\n84%\n\n(e.g. Figure 1a). The random load is a more realistic in that loads tend to vary among the\ndifferent input/output pairs.\n, except that all \u0004\u0006\u0005\u001a\u0007\n\n\u0015 outputs is set to zero. This simulates the more\n\ntypical case of traf\ufb01c being concentrated on a few outputs.\n\nfor the last \u000f\n\nis \n\nWe emphasize that a different policy is learned for each of these loads. The different loads\nsuggest the kinds of improvements that we might expect if queue-learning is implemented.\nThe results for the \ufb01ve loads are given in Table 2.\n\n6 Conclusion\n\nThis paper showed that queue learning is able to learn a policy that signi\ufb01cantly reduces\nthe wait times of packets in a high-speed switch. It uses a novel decomposition of the\nvalue function combined with ef\ufb01cient computation of the action to overcome the prob-\nlems a traditional RL approach would have with the large number of states, actions, and\ntransitions. This is able to gain 10% to 84% of the possible reductions in wait times. The\nlargest gains are when the network is more heavily loaded and delays are largest. The gains\nare also largest when the switch load is least uniform which is what is most likely to be\nencountered in practice.\n\nTraditional thinking in switching is that input-queued switches are much worse than the op-\ntimal output-queued switches and improving performance would require increasing switch-\ning speeds (the electronic switching is already the slowest part of the otherwise optical net-\nworking), or using information of future arrivals (which may not exists and in any case is\nNP-Hard to use optimally). The queue-learning approach is able to use its estimates of the\nfuture impact of its packet send decisions in a consistent framework that is able to bridge\nthe majority of the gap between current input queueing and optimal output queueing.\n\nAcknowledgment\n\nThis work was supported by CAREER Award: NCR-9624791.\n\n\u0018\n\u0017\n-\n-\n+\n\u0018\n\u0002\n-\n\u0017\n\u001d\n!\n\u001d\n!\n\u001d\n!\n\u0003\n\n\n\u001c\n\n\u001d\n\n\u001e\n\n\u001f\n\n\u001f\n\u001e\n+\n\fReferences\n\n[1] Boyan, J.A., Littman, M.L., \u201cPacket routing in dynamically changing networks: a\nreinforcement learning approach,\u201d in Cowan, J.D., et al., ed. Advances in NIPS 6,\nMorgan Kauffman, SF, 1994. pp. 671\u2013678.\n\n[2] Brown, T.X, Lui, K.H., \u201cNeural Network Design of a Banyan Network Controller,\u201d\n\nIEEE JSAC, v. 8, n. 8, pp. 1428\u20131438, Oct., 1990.\n\n[3] Brown, T.X, Tong, H., Singh, S., \u201cOptimizing admission control while ensuring qual-\nity of service in multimedia networks via reinforcement learning,\u201d in Advances NIPS\n11, ed. M. Kearns et al., MIT Press, 1999.\n\n[4] Brown, T.X, Gabow, H.N., \u201cFuture Information in Input Queueing,\u201d submitted to\n\nComputer Networks, April 2001.\n\n[5] Gabor, Z., Kalmar, Z., Szepesvari, C., \u201cMulti-criteria Reinforcement Learning,\u201d In-\n\nternational Conference on Machine Learning, Madison, WI, July, 1998.\n\n[6] J. Hopcroft and R. Karp, \u201cAn\n\ngraphs\u201d, SIAM J. Computing 2, 4, 1973, pp 225-231.\n\n\u001c algorithm for maximum matchings in bipartite\n\n[7] Marbach, P., Mihatsch, M., Tsitsiklis, J.N., \u201cCall admission control and routing in\nintegrated service networks using neuro-dynamic programming,\u201d IEEE J. Selected\nAreas in Comm., v. 18, n. 2, pp. 197\u2013208, Feb. 2000.\n\n[8] McKeown, N., Anantharam, V., Walrand, J., \u201cAchieving 100% Throughput in an\n\nInput-Queued Switch,\u201d Proc. of IEEE INFOCOM \u201996, San Francisco, March 1996.\n\n[9] Park, Y.-K., Lee, G., \u201cNN Based ATM Scheduling with Queue Length Based Priority\n\nScheme,\u201d IEEE J. Selected Areas in Comm., v. 15, n. 2 pp. 261\u2013270, Feb. 1997.\n\n[10] Pattavina, A., Switching Theory: Architecture and Performance in Broadband ATM\n\nNetworks, John Wiley and Sons, New York, 1998.\n\n[11] Singh, S.P., Bertsekas, D.P., \u201cReinforcement learning for dynamic channel allocation\nin cellular telephone systems,\u201d in Advances in NIPS 9, ed. Mozer, M., et al., MIT\nPress, 1997. pp. 974\u2013980.\n\n[12] Sutton, R.S., Barto, A.G., Reinforcement Learning: an Introduction, MIT Press,\n\n1998.\n\n[13] Tarjan, R.E., Data Structures and Network Algorithms, Soc. for Industrial and Ap-\n\nplied Mathematics, Philidelphia, 1983.\n\n\n\u001f\n\u0002\n\f", "award": [], "sourceid": 1955, "authors": [{"given_name": "Timothy", "family_name": "Brown", "institution": null}]}