{"title": "Low Power Wireless Communication via Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 893, "page_last": 899, "abstract": null, "full_text": "Low Power Wireless Communication via \n\nReinforcement Learning \n\nTimothy X Brown \n\nElectrical and Computer Engineering \n\nUniversity of Colorado \nBoulder, CO 80309-0530 \ntirnxb@colorado.edu \n\nAbstract \n\nThis paper examines the application of reinforcement learning to a wire(cid:173)\nless  communication problem.  The problem requires  that channel  util(cid:173)\nity  be  maximized while simultaneously minimizing battery  usage.  We \npresent a  solution  to  this  multi-criteria problem  that  is  able  to  signifi(cid:173)\ncantly reduce power consumption.  The solution uses a variable discount \nfactor to capture the effects of battery usage. \n\n1  Introduction \n\nReinforcement learning (RL) has been applied to resource allocation problems in telecom(cid:173)\nmunications, e.g., channel allocation in  wireless systems, network routing, and admission \ncontrol in telecommunication networks  [1,2, 8,  10].  These have demonstrated reinforce(cid:173)\nment  learning  can  find  good  policies  that  significantly  increase  the  application  reward \nwithin  the  dynamics  of the  telecommunication problems.  However,  a  key  issue  is  how \nto treat the commonly occurring multiple reward and constraint criteria in a consistent way. \n\nThis paper will focus on power management for wireless packet communication channels. \nThese channels are unlike wireline channels in that channel quality is  poor and varies over \ntime, and often one side of the wireless  link is  a battery operated device such as  a laptop \ncomputer.  In this environment, power management decides  when to  transmit and receive \nso as to simultaneously maximize channel utility and battery life. \n\nA  number of power management strategies have been developed for  different aspects  of \nbattery operated computer systems such  as  the hard disk  and CPU  [4,  5].  Managing the \nchannel is different in that some control actions such as shutting off the wireless transmitter \nmake the state of the channel and the other side of the communication unobservable. \n\nIn this paper,  we consider the problem of finding  a power management policy that simul(cid:173)\ntaneously maximizes the radio communication's earned revenue while minimizing battery \nusage.  The problem is recast as a stochastic shortest path problem which in turn is mapped \nto  a discounted infinite horizon  with  a  variable discount factor.  Results  show  significant \nreductions in power usage. \n\n\f894 \n\nT.  X  Brown \n\nFigure 1:  The five components of the radio communication system. \n\n2  Problem Description \n\nThe problem  is  comprised of five  components as  shown in Figure  1:  mobile application, \nmobile radio,  wireless channel, base station radio.  and  base station  application.  The ap(cid:173)\nplications on each end generate packets that are sent via a radio across the channel to the \nradio and then application on the other side.  The application also  defines the utility of a \ngiven end-to-end performance. The radios implement a simple acknowledgment/retransmit \nprotocol for reliable transmission. The base station is fixed and has a reliable power supply \nand  therefore is  not power constrained.  The mobile power is  limited by  a  battery  and  it \ncan choose to  turn its radio off for periods of time to reduce power usage.  Note that even \nwith the radio off, the mobile system continues to draw power for other uses.  The channel \nadds  errors  to  the packets.  The rate  of errors  depends  on  many  factors  such  as  location \nof mobile and base station,  intervening distance.  and levels of interference.  The problem \nrequires models for each of these components.  To be concrete. the specific models used in \nthis paper are described in the following sections. It should be emphasized that in order to \nfocus on the machine learning issues, simple models have been chosen. More sophisticated \nmodels can readily be included. \n\n2.1  The Channel \n\nThe channel carries fixed-size packets in synchronous time slots.  All  packet rates  are nor(cid:173)\nmalized by  the channel rate  so  that the channel carries  one packet per unit time in  each \ndirection. The forward and reverse channels are orthogonal and do not interfere. \n\nWireless data channels typically have low error rates.  Occasionally. due to  interference or \nsignal fading, the channel introduces many errors. This variation is possible even when the \nmobile and base station are stationary. The channel is modeled by a two state Gilbert-Elliot \nmodel [3].  In this model,  the channel is  in either a \"good\" or a \"bad\" state with  a packet \nerror probabilities Pg  and Pb  where Pg  < Pb\u00b7  The channel is symmetric with the same loss \nrate  in  both  directions.  The channel  stays  in  each  state  with  a geometrically  distributed \nholding time with mean holding times hg  and hb  time slots. \n\n2.2  Mobile and Base Station Application \n\nThe traffic generated by the source is  a bursty ON/OFF model that alternates between gen(cid:173)\nerating no packets and generating packets at rate TON.  The holding times are geometrically \ndistributed with mean holding times  hON  and hOFF.  The traffic  in each direction is  inde(cid:173)\npendent and identically distributed. \n\n2.3  The Radios \n\nThe radios  can transmit data from  the  application and send it on  the channel  and  simul(cid:173)\ntaneously  receive data from  the  other radio  and pass  it  on  to  its  application.  The radios \nimplement a  simple packet  protocol  to  ensure reliability.  Packets from  the  sources  are \nqueued in the radio and sent one by one. Packets consist of a header and data. The header \ncarries acknowledgements (ACK's) with the most recent packet received without error.  The \nheader contains a checksum so that errors in the payload can be detected.  Errored packets \n\n\fLow Power Wireless Communication via Reinforcement Learning \n\n895 \n\nParameter Name \n\nChannel Error Rate, Good \nChannel Error Rate, Bad \n\nChannel Holding Time, Good \nChannel Holding Time, Bad \n\nSource On Rate \n\nSource Holding Time, On \nSource Holding Time, Off \n\nPower, Radio Off \nPower, Radio On \n\nPower, Radio Transmitting \n\nReal Time Max Delay \n\nWeb Browsing Time Scale \n\nSymbol  Value \n0.01 \n0.20 \n100 \n10 \n1.0 \n1 \n10 \n7W \n8.5W \nlOW \n\npg \nPb \nhg \nhb \nTON \nhON \nhOFF \nPOFF \nPON \nPTX \ndmax \n\ndo \n\n3 \n3 \n\nTable 1:  Application parameters. \n\ncause the receiving radio to send a packet with a negative acknowledgment (NACK) to the \nother radio instructing it to retransmit the packet sequence starting from the errored packet. \nThe NACK is sent immediately even if no data is waiting and the radio must send an empty \npacket.  Only unerrored packets  are  sent on to  the application.  The header is  assumed to \nalways be received without errorl. \n\nSince the mobile is constrained by power, the mobile is considered the master and the base \nstation the slave.  The base station is always on and ready to transmit or receive.  The mobile \ncan turn  its  radio off to  conserve power.  Every ON-OFF  and OFF-ON  transition  generates \na packet with  a  message  in  the  header indicating the  change of state  to  the base station. \nThese message packets carry no data.  The mobile expends power at three levels-PoFF, \nPo N ,  and Ptx--corresponding to the radio off, receiver on but no packet transmitted, and \nreceiver on packet transmitted. \n\n2.4  Reward Criteria \n\nReward  is  earned for  packets  passed  in  each  direction.  The amount depends  on the  ap(cid:173)\nplication.  In  this  paper  we  consider three types  of applications, an  e-mail  application,  a \nreal-time  application,  and  a  web  browsing  application.  In  the  e-mail  application,  a  unit \nreward is  given for every packet received by the application. In  the real time application a \nunit reward is given for every packet received by the application with delay less than dmax \u00b7 \nThe reward is  zero otherwise.  In  the web browsing application, time is  important but not \ncritical.  The value of a packet with  delay d  is  (1  - l/do)d,  where do  is  the desired time \nscale of the arrivals. \n\nThe specific parameters used in this experiment are given in Table 1.  These were gathered \nas  typical  values  from  [7,  9].  It should  be  emphasized  that  this  model  is  the  simplest \nmodel that captures the essential characteristics of the  problem.  More realistic channels, \nprotocols, applications, and rewards can readily be incorporated but for this paper are left \nout for clarity. \n\n1 A packet error rate of 20% implies a bit error rate of less than 1 %.  Error correcting codes in the \nheader can easily reduce this error rate to a low value.  The main intent is to simplify the protocol for \nthis paper so that time-outs and other mechanisms do not need to be considered. \n\n\f896 \n\nComponent \n\nChannel \n\nApplication \n\nMobile \nMobile \n\nBase Station \n\nT.  X  Brown \n\nStates \n\n{good,ba~} \n{ON,OFF} \n{ ON,OFF} \n\n{List of waiting and unacknowledged packets and their current delay} \n{List of waiting and unacknowledged packets and their current delay} \n\nTable 2:  Components to System State. \n\n3  Markov Decision Processes \n\nAt any given time slot, t, the system is in a particular configuration, x, defined by the state \nof each of the components in  Table 2.  The system state is  s  =  (x, t)  where we include \nthe time in order to  facilitate accounting for the battery.  The mobile can choose to toggle \nits radio between the ON  and OFF state and rewards are generated by successfully received \npackets.  The task of the learner is to determine a radio ON/OFF policy that maximizes the \ntotal reward for packets received before batteries run out. \n\nThe battery life is  not a fixed  time.  First, it depends on usage.  Second, for  a given drain, \nthe capacity depends on how long the battery was charged, how long it has sat since being \ncharged,  the  age of the battery, etc.  In short, the battery runs out at a random time.  The \nsystem can be modeled as a stochastic shortest path problem whereby there exists a terminal \nstate, So,  that corresponds to the battery empty in which no more reward is possible and the \nsystem remains permanently at no cost. \n\n3.1  Multi-criteria Objective \n\nFormally, the goal is to learn a policy for each possible system state so as to maximize \n\nJ'(8)=E{t.C(t) 8,,,}, \n\nwhere  E{ 'Is, 'Jr}  is  the  expectation  over  possible  trajectories  starting  from  state  s using \npolicy 'Jr,  c(t) is the reward for packets received at time t, and T  is the last time step before \nthe batteries run out. \n\nTypically,  T  is  very  large  and  this  inhibits  fast  learning.  So,  in  order to  promote faster \nlearning we convert this problem to a discounted problem that removes the variance caused \nby the random stopping times. At time t, given action a(t), while in state s(t) the terminal \nstate is reached with probability Ps(t) (a(t)).  Setting the value of the terminal state to 0, we \ncan convert our new criterion to maximize: \n\nr (8)  =  E { t. c(t) g (1  - p>(T)(a(T)))  S,,,}, \n\nwhere the product is  the probability of reaching time t .  In words, future rewards are dis(cid:173)\ncounted  by  1  - Ps (a),  and  the  discounting  is  larger  for  actions  that  drain  the  batteries \nfaster. Thus a more power efficient strategy will have a discount factor closer to one which \ncorrectly extends the effective horizon over which reward is captured. \n\n3.2  Q-Iearning \n\nRL  methods  solve MDP problems by  learning good approximations to  the optimal  value \nfunction,  J*,  given  by  the  solution  to  the  Bellman  optimality  equation  which  takes  the \n\n\fLow Power Wireless Communication via Reinforcement Learning \n\nfollowing form: \n\nJ*(s) \n\nmax  [Esf{c(s,a,s') + (l-ps(a))J*(s')}] \naEA(s) \n\n897 \n\n(1) \n\nwhere A(s)  is  the set of actions available in  the current state s,  c(s, a, s')  is  the effective \nimmediate payoff, and Esf {.} is the expectation over possible next states s'. \nWe learn an appr<;>ximation to J*  using Watkin's Q-learning algorithm.  Bellman's equation \ncan be rewritten in Q-factor as \n\nJ*(s) \n\nmax  Q*(s,a) \naEA(s) \n\n(2) \n\nIn every time step the following decision is  made.  The Q-value of turning on in the next \nstate is  compared to  the Q-value of turning off in  the next state.  If turning on has higher \nvalue the mobile turns on. Else, the mobile turns off. \n\nWhatever our decision, we update our value function as follows:  on a transition from state \ns to s' on action a, \n\nQ(s, a) \n\n(1  - 1')Q(s, a)  + l' (C(S, a, s') + (1- ps(a))  max  Q(s', b)) \n\nbEA(Sf) \n\n(3) \n\nwhere l' is  the learning rate.  In order for Q-Iearning to perform well, all potentially impor(cid:173)\ntant state-action pairs (s, a)  must be explored. At each state, with probability 0.1 we apply \na random action instead of the action recommended by the Q-value. However, we still use \n(3) to update Q-values using the action b recommended by the Q-values. \n\n3.3  Structural Limits to the State Space \n\nFor  theoretical  reasons  it  is  desirable  to  use  a  table  lookup  representation.  In  practice, \nsince the mobile radio decides using information available to it,  this  is  impossible for  the \nfollowing  reasons.  The state of the  channel  is  never  known  directly.  The receiver only \nobserves errored packets. It is possible to infer the state, but, only when packets are actually \nreceived and channel state changes introduce inference errors. \n\nTraditional packet applications rarely communicate state information to the transport layer. \nThis state information could also be inferred. But, given the quickly changing application \ndynamics, the application state is  often ignored.  For the particular parameters in Table  1, \n(i.e.  rON  =  1.0)  the  application is  on  if and  only  if it  generates  a  packet so  its  state  is \ncompletely specified by the packet arrivals and does not need to be inferred. \n\nThe most serious deficiency to a complete state space representation is that when the mobile \nradio turns OFF,  it has no knowledge of state changes in  the base station.  Even when it is \nON,  the protocol does  not have provisions for  transferring directly  the  state  information. \nAgain, this implies that state information must be inferred. \n\nOne approach to these structural limits is to use a POMDP approach [6]  which we leave to \nfuture work.  In this paper, we simply learn deterministic policies on features that estimate \nthe state. \n\n3.4  Simplifying Assumptions \n\nBeyond the structural problems of the previous section we must treat the usual problem that \nthe  state space is  huge.  For instance, assuming even moderate maximum queue sizes and \nmaximum  wait  times  yields  1020  states.  If one considers e-mail  like  applications where \n\n\f898 \n\nTX  Brown \n\nComponent \nMobile Radio \nMobile Radio \nMobile Radio  wait time of first packet waiting at the mobile \nnumber of errors received in last 4 time slots \nnumber of time slots since mobile was last ON \n\nnumber of packets waiting at the mobile \n\nis radio ON or OFF \n\nFeature \n\nChannel \n\nBase Radio \n\nTable 3:  Decision Features Measured by Mobile Radio \n\nwait times of minutes (1000's of time slot wait times) with many packets waiting possible, \nthe state space exceeds 10100  states.  Thus we seek a representation to reduce the size and \ncomplexity of the state space.  This reduction is  taken in  two parts.  The first  is  a  feature \nrepresentation that is possible given the structural limits of the previous section, the second \nis a function approximation based on these feature vectors. \n\nThe feature  vectors are listed  in  Table  3.  These are chosen since they  are measurable at \nthe mobile radio.  For function approximation, we  use state aggregation since it provably \nconverges. \n\n4  Simulation Results \n\nThis section describes simulation-based experiments on the mobile radio control problem. \nFor this  initial  study,  we simplified the problem by setting Pg  = Pb  = 0 (i.e.  no channel \nerrors). \n\nState aggregation was used with 4800 aggregate states.  The battery termination probability, \nps(a) was simply PIlOOO where P is the power appropriate for the state and action chosen \nfrom Table  1.  This was chosen to have an expected battery life much longer than the time \nscale of the traffic and channel processes. \n\nThree policies were learned, one for each application reward criteria. The resulting policies \nare tested by simulating for 106  time slots. \n\nIn each test run,  an upper and  lower bound on the energy usage is  computed.  The upper \nbound is the case of the mobile radio always on2 .  The lower bound is a policy that ignores \nthe reward criteria but still delivers all the packets. In this policy, the radio is off and packets \nare accumulated until the latter portion of the test run when they are sent in one large group. \nPolicies are compared using the normalized power savings. This is a measure of how close \nthe policy is to the lower bound with 0% and 100% being the upper and lower bound. \n\nThe results are given in Table 4.  The table also lists the average reward per packet received \nby the application. For the e-mail application, which has no constraints on the packets, the \naverage reward is identically one. \n\n5  Conclusion \n\nThis paper showed that reinforcement learning was able to learn a policy that significantly \nreduced  the power consumption of a mobile radio  while maintaining a  high  application \nutility.  It used a novel variable discount factor that captured the impact of different actions \non battery life.  This was able to gain 50% to 80% of the possible power savings. \n\n2There exist policies that exceed this power, e.g. if they toggle oNand oFFoften and generate many \n\nnotification packets. But, the always on policy is the baseline that we are trying to improve upon. \n\n\fLow Power Wireless Communication via Reinforcement Learning \n\n899 \n\nApplication \n\nE-mail \n\nReal Time \n\nWeb Browsing \n\nNormalized \n\nPower Savings \n\nAverage \nReward \n\n81% \n49% \n48% \n\n1 \n\n1.00 \n0.46 \n\nTable 4:  Simulation Results. \n\nIn the application the paper used a simple model of the radio, channel, battery, etc.  It also \nused simple state aggregation and ignored the partially observable aspects of the problem. \nFuture work will address more accurate models, function approximation, and POMDP ap(cid:173)\nproaches. \n\nAcknowledgment \n\nThis  work  was  supported  by  CAREER  Award:  NCR-9624791  and  NSF  Grant  NCR-\n9725778. \n\nReferences \n\n[1]  Boyan,  J.A.,  Littman,  M.L.,  \"Packet routing  in  dynamically  changing networks:  a \nreinforcement  learning  approach,\"  in  Cowan,  J.D.,  et aI. ,  ed.  Advances in  NIPS  6, \nMorgan Kauffman, SF,  1994. pp. 671-678. \n\n[2]  Brown, TX, Tong, H., Singh, S., \"Optimizing admission control while ensuring qual(cid:173)\nity  of service  in  multimedia  networks  via  reinforcement  learning,\"  in  Advances  in \nNeural  Information Processing  Systems 12, ed.  M.  Kearns,  et aI.,  MIT Press,  1999, \npp. 982-988. \n\n[3]  Goldsmith,  AJ., Varaiya, P.P.,  \"Capacity,  mutual  information,  and coding for finite \n\nstate Markov channels,\" IEEE T.  on Info.  Thy.,  v.  42, pp.  868-886, May  1996. \n\n[4]  Govil,  K.,  Chan,  E.,  Wasserman,  H.,  \"Comparing  algorithms  for  dynamic  speed(cid:173)\n\nsetting  of a  low-power  cpu,\"  Proceedings  of the  First  ACM Int.  Can!  on  Mobile \nComputing and Networking (MOBICOM),  1995. \n\n[5]  Helmbold, D., Long, D.D.E., Sherrod, B., \"A dynamic disk spin-down technique for \nmobile computing. Proceedings of the Second ACM Int.  Can! on Mobile Computing \nand Networking (MOBICOM),  1996. \n\n[6]  Jaakola, T., Singh, S., Jordan, M.I., \"Reinforcement Learning Algorithm for Partially \n\nObservable Markov Decision Problems,\" in Advances in Neural Information Process(cid:173)\ning Systems  7, ed. G. Tesauro, et aI., MIT Press, 1995, pp. 345-352. \n\n[7]  Kravits, R., Krishnan, P.,  \"Application-Driven Power Management for Mobile Com(cid:173)\n\nmunication,\" Wireless Networks, 1999. \n\n[8]  Marbach, P., Mihatsch, 0., Schulte, M., Tsitsiklis, J.N., \"Reinforcement learning for \ncall  admission control and routing in integrated service networks,\" in Jordan, M., et \naI., ed. Advances in NIPS 10, MIT Press, 1998. \n\n[9]  Rappaport,  T.S.,  Wireless  Communications:  Principles and Practice,  Prentice-Hall \n\nPub., Englewood Cliffs, NJ,  1996. \n\n[10]  Singh, S.P., Bertsekas, D.P., \"Reinforcement learning for dynamic channel allocation \nin  cellular telephone systems,\"  in  Advances in  NIPS  9,  ed.  Mozer,  M.,  et aI.,  MIT \nPress,  1997. pp. 974-980. \n\n\f", "award": [], "sourceid": 1740, "authors": [{"given_name": "Timothy", "family_name": "Brown", "institution": null}]}