{"title": "Exact Solutions to Time-Dependent MDPs", "book": "Advances in Neural Information Processing Systems", "page_first": 1026, "page_last": 1032, "abstract": null, "full_text": "Exact Solutions to Time-Dependent MDPs \n\nJustin A. Boyan\u00b7 \n\nITA Software \nBuilding 400 \n\nOne Kendall Square \nCambridge, MA 02139 \njab@itasoftware.com \n\nMichael L. Littman \nAT&T Labs-Research \nand Duke University \n\n180 Park Ave. Room A275 \n\nFlorham Park, NJ 07932-0971 USA \n\nmlittman@research.att. com \n\nAbstract \n\nWe describe an extension of the Markov decision process model in \nwhich a continuous time dimension is included in the state space. \nThis allows for the representation and exact solution of a wide \nrange of problems in which transitions or rewards vary over time. \nWe examine problems based on route planning with public trans(cid:173)\nportation and telescope observation scheduling. \n\n1 \n\nIntroduction \n\nImagine trying to plan a route from home to work that minimizes expected time. \nOne approach is to use a tool such as \"Mapquest\", which annotates maps with \ninformation about estimated driving time, then applies a standard graph-search \nalgorithm to produce a shortest route. Even if driving times are stochastic, the an(cid:173)\nnotations can be expected times, so this presents no additional challenge. However, \nconsider what happens if we would like to include public transportation in our route \nplanning. Buses, trains, and subways vary in their expected travel time according to \nthe time of day: buses and subways come more frequently during rush hour; trains \nleave on or close to scheduled departure times. In fact, even highway driving times \nvary with time of day, with heavier traffic and longer travel times during rush hour. \n\nTo formalize this problem, we require a model that includes both stochastic actions, \nas in a Markov decision process (MDP), and actions with time-dependent stochastic \ndurations. There are a number of models that include some of these attributes: \n\n\u2022 Directed graphs with shortest path algorithms [2]: State transitions are deter(cid:173)\nministic; action durations are time independent (deterministic or stochastic). \n\u2022 Stochastic Time Dependent Networks (STDNS) [6]: State transitions are deter(cid:173)\n\nministic; action durations are stochastic and can be time dependent. \n\n\u2022 Markov decision processes (MDPS) [5]: State transitions are stochastic; action \n\ndurations are deterministic. \n\n\u2022 Semi-Markov decision processes (SMDPS) [5]: State transitions are stochastic; \n\naction durations are stochastic, but not time dependent. \n-The work reported here was done while Boyan's affiliation was with NASA Ames \n\nResearch Center, Computational Sciences Division. \n\n\fIn this paper, we introduce the Time-Dependent MDP (TMDP) model, which gener(cid:173)\nalizes all these models by including both stochastic state transitions and stochastic, \ntime-dependent action durations. At a high level, a TMDP is a special continuous(cid:173)\nstate MDP [5; 4] consisting of states with both a discrete component and a real-valued \ntime component: (x, t) E X x lR. \nWith absolute time as part of the state space, we can model a rich set of domain ob(cid:173)\njectives including minimizing expected time, maximizing the probability of making \na deadline, or maximizing the dollar reward of a path subject to a time deadline. \nIn fact, using the time dimension to represent other one-dimensional quantities, \nTMDPS support planning with non-linear utilities [3] (e.g., risk-aversion), or with a \ncontinuous resource such as battery life or money. \n\nWe define TMDPs and express their Bellman equations in a functional form that \ngives, at each state x, the one-step lookahead value at (x, t) for all times in parallel \n(Section 2) . We use the term time-value function to denote a mapping from real(cid:173)\nvalued times to real-valued future reward. With appropriate restrictions on the form \nof the stochastic state-time transition function and reward function, we guarantee \nthat the optimal time-value function at each state is a piecewise linear function of \ntime, which can be represented exactly and computed by value iteration (Section 3). \nWe conclude with empirical results on two domains (Section 4). \n\n2 General model \n\nJ.11 Missed the 8am (rdID \n\nJ.12 Caught the 8am tram \n\n7 8 910 12 \n\n2 \n\n01234567 \n\nREL \n\nJ.L3 Highway - rush hour \n\n~l \n\n~I \n\nIQ\n7 8 910 12 \n\nIIII L3 1At-lll l{ \n\n2 \n\n0 1 2 3 4 5 6 7 \n\nREL \n~4 HIghway - off peak \n\n;\\ I~ L4 14\"011111 ~ \n\n7 8 910 12 \n\n2 \n\n0 1 2 3 4 5 6 7 \n\nI I I I I I L2 1 I ft. 1 I I I Pz \n\n7 8 910 12 \n\n2 \n\nABS \n\nJ.i5 Dnve on backroad \n\nI I II I II I Ls I I I 11111 Ps \n\n7 8 9 10 12 \n\n2 \n\n0 1 2 3 4 5 6 7 \n\nREL \n\nlSl~ 11 \n\n78 9 10 122 \n\nVz ~ III \n\n7 8 910 12 \n\n2 \n\nREL \n\nFigure 1: An illustrative route-planning example TMDP. \n\nFigure 1 depicts a small route-planning example that illustrates several distinguish(cid:173)\ning features of the TMDP model. The start state Xl corresponds to being at home. \nFrom here, two actions are available: a1, taking the 8am train (a scheduled action); \nand a2, driving to work via highway then backroads (may be done at any time). \n\nAction a1 has two possible outcomes, represented by III and 1l2\u00b7 Outcome III \n(\"Missed the 8am train\") is active after 7:50am, whereas outcome 112 (\"Caught the \ntrain\") is active until 7:50am; this is governed by the likelihood functions L1 and L2 \nin the model. These outcomes cause deterministic transitions to states Xl and X3, \nrespectively, but take varying amounts of time. Time distributions in a TMDP may \nbe either \"relative\" (REL) or \"absolute\" (ABS). In the case of catching the train \n(1l2), the distribution is absolute: the arrival time (shown in P2) has mean 9:45am \nno matter what time before 7:50am the action was initiated. (Boarding the train \nearlier does not allow us to arrive at our destination earlier!) However, missing the \ntrain and returning to Xl has a relative distribution: it deterministically takes 15 \nminutes from our starting time (distribution P1 ) to return home. \n\n\fThe outcomes for driving (a2) are Jl3 and Jl4. Outcome Jl3 (\"Highway - rush hour\") \nis active with probability 1 during the interval 8am- 9am, and with smaller proba(cid:173)\nbility outside that interval, as shown by L3. Outcome Jl4 (\"Highway - off peak\") \nis complementary. Duration distributions P3 and P4 , both relative to the initiation \ntime, show that driving times during rush hour are on average longer than those off \npeak. State X2 is reached in either case. \nFrom state X2, only one action is available, a3. The corresponding outcome Jl5 \n(\"Drive on backroad\") is insensitive to time of day and results in a deterministic \ntransition to state X3 with duration 1 hour. The reward function for arriving at \nwork is + 1 before 11am and falls linearly to zero between 11am and noon. \nThe solution to a TMDP such as this is a policy mapping state-time pairs (x, t) to \nactions so as to maximize expected future reward. As is standard in MDP methods, \nour approach finds this policy via the value function V*. We represent the value \nfunction of a TMDP as a set of time-value functions, one per state: ~(t) gives the \noptimal expected future reward from state Xi at time t. In our example of Figure 1, \nthe time-value functions for X3 and X2 are shown as Va and V2. Because of the \ndeterministic one-hour delay of Jl5, V2 is identical to V3 shifted back one hour. This \nwholesale shifting of time-value functions is exploited by our solution algorithm. \n\nThe TMDP model also allows a notion of \"dawdling\" in a state. This means the \nTMDP agent can remain in a state for as long as desired at a reward rate of K(x, t) \nper unit time before choosing an action. This makes it possible, for example, for an \nagent to wait at home for rush hour to end before driving to work. \n\nFormally, a TMDP consists of the following components: \n\nX \nA \nM \n\ndiscrete state space \ndiscrete action space \ndiscrete set of outcomes, each of the form Jl = (x~,TIt,PIt): \n\nx~ EX: the resulting state \nTit E {ABS, REL}: specifies the type of the resulting time distribution \n\nPIt(t') (if Tit = ABS): pdf over absolute arrival times of Jl \nPIt (8) (if Tit = REL): pdf over durations of Jl \n\nL(Jllx, t, a) is the likelihood of outcome Jl given state x, time t, action a \n\nL \nR R(Jl, t, 8) is the reward for outcome Jl at time t with duration 8 \nK K(x, t) is the reward rate for \"dawdling\" in state x at time t. \n\nWe can define the optimal value function for a TMDP in terms of these quantities \nwith the following Bellman equations: \n\nV(x, t) \n\nV(x, t) \n\nQ(x, t,a) \n\nt' \n\nsup (r K(x,s)ds+V(x,t')) \nt'?t it \n\n= max Q(x,t,a) \n= L L(Jllx, a, t) . U(Jl, t) \n\naEA \n\nvalue function (allowing dawdling) \n\nvalue function (immediate action) \n\nexpected Q value over outcomes \n\nitEM \n\nU(Jl, t) \n\n= { f~ PIt(t') [R(Jl, t, t' - t) + V(x~, t'))dt' \nf~ PIt(t' - t)[R(Jl, t, t' - t) + V(x~, t'))dt' \n\n(if Tit = ABS) \n(if Tit = REL). \n\nThese equations follow straightforwardly from viewing the TMDP as an undiscounted \ncontinuous-time MDP. Note that the calculations of U(Jl, t) are convolutions of the \nresult-time pdf P with the lookahead value R + V. In the next section, we discuss \na concrete way of representing and manipulating the continuous quantities that \nappear in these equations. \n\n\f3 Model with piecewise linear value functions \n\nIn the general model, the time-value functions for each state can be arbitrarily \ncomplex and therefore impossible to represent exactly. In this section, we show how \nto restrict the model to allow value functions to be manipulated exactly. \n\nFor each state, we represent its time-value function Vi(t) as a piecewise linear func(cid:173)\ntion of time. Vi(t) is thus represented by a data structure consisting of a set of \ndistinct times called breakpoints and, for each pair of consecutive breakpoints, the \nequation of a line defined over the corresponding interval. \n\nWhy are piecewise linear functions an appropriate representation? Linear time(cid:173)\nvalue functions provide an exact representation for minimum-time problems. Piece(cid:173)\nwise time-value functions provide closure under the \"max\" operator. \n\nRewards must be constrained to be piecewise linear functions of start and arrival \ntimes and action durations. We write R(p\" t, 8) = Rs(p\" t) + Ra(P\" t + 8) + Rd(p\" 8) \nwhere Rs, Ra, and Rd are piecewise linear functions of start time, arrival time, \nand duration, respectively. In addition, the dawdling reward K and the outcome \nprobability function L must be piecewise constant. \n\nThe most significant restriction needed for exact computation is that arrival and \nduration pdfs be discrete. This ensures closure under convolutions. In contrast, \nconvolving a piecewise constant pdf (e.g., a uniform distribution) with a piecewise \nlinear time-value function would in general produce a piecewise quadratic time(cid:173)\nvalue function; further convolutions increase the degree with each iteration of value \niteration. In Section 5 below we discuss how to relax this restriction. \n\nGiven the restrictions just mentioned, all the operations used in the Bellman equa(cid:173)\ntions from Section 2- namely, addition, multiplication, integration, supremum, \nmaximization, and convolution- can be implemented exactly. The running time \nof each operation is linear in the representation size of the time-value functions \ninvolved. Seeding the process with an initial piecewise linear time-value function, \nwe can carry out value iteration until convergence. In general, the running time \nfrom one iteration to the next can increase, as the number of linear \"pieces\" being \nmanipulated grows; however, the representations grow only as complex as necessary \nto represent the value function V exactly. \n\n4 Experimental domains \n\nWe present results on two domains: transportation planning and telescope schedul(cid:173)\ning. For comparison, we also implemented the natural alternative to the piecewise(cid:173)\nlinear technique: discretizing the time dimension and solving the problem as a stan(cid:173)\ndard MDP. To apply the MDP method, three additional inputs must be specified: \nan earliest starting time, latest finishing time, and bin width. Since this paper's \nfocus is on exact computations, we chose a discretization level corresponding to the \nresolution necessary for exact solution by the MDP at its grid points. An advantage \nof the MDP is that it is by construction acyclic, so it can be solved by just one sweep \nof standard value iteration, working backwards in time. The TMDP'S advantage is \nthat it directly manipulates entire linear segments of the time-value functions. \n\n4.1 Transportation planning \n\nFigure 2 illustrates an example TMDP for optimizing a commute from San Francisco \nto NASA Ames. The 14 discrete states model both location and observed traffic \n\n\fFigure 2: The San Francisco to Ames commuting example \n\nQ-functions at state 10 (\"US1 01 & 8ayshore / heavy traffic\") \n\naction 0 (\"drive to Ames\") \naction 1 (\"drive to 8ayshore station\") \n\n------------------------..-\n~ \n\nf - - - - - - -,'------; \n\" \n\n1.2 \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.2 \n\n0 \n\n-0.2 \n\n6 \n\no \n\n7 \n\n8 \n\n9 \n\n10 \n\n11 \n\n12 \n\nOptimal policy over time at state 10 \n\naction 0 (\"drive to Ames\") \naction 1 (\"drive to 8ayshore station\") \n\nC? \nx-\nC \n\nQ) \n::J \n(ij \n> \n(ij \nE \nli \n0 \n\na.; \n.0 \nE \n::J c: \nc: \n\no n ro \n\n6 \n\n7 \n\n8 \n\n9 \n\ntime \n\n10 \n\n11 \n\n12 \n\nFigure 3: The optimal Q-value functions and policy at state #10. \n\n\fconditions: shaded and unshaded circles represent heavy and light traffic, respec(cid:173)\ntively. Observed transition times and traffic conditions are stochastic, and depend \non both the time and traffic conditions at the originating location. At states 5, \n6, 11, and 12, the \"catch the train\" action induces an absolute arrival distribution \nreflecting the train schedules. \n\nThe domain objective is to arrive at Ames by 9:00am. We impose a linear penalty \nfor arriving between 9 and noon, and an infinite penalty for arriving after noon. \nThere are also linear penalties on the number of minutes spent driving in light \ntraffic, driving in heavy traffic, and riding on the train; the coefficients of these \npenalties can be adjusted to reflect the commuter's tastes. \n\nFigure 3 presents the optimal time-value functions and policy for state #10, \n\"US101&Bayshore / heavy traffic.\" There are two actions from this state, cor(cid:173)\nresponding to driving directly to Ames and driving to the train station to wait for \nthe next train. Driving to the train station is preferred (has higher Q-value) at \ntimes that are close- but not too close!-\n\nto the departure times of the train. \n\nThe full domain is solved in well under a second by both solvers (see Table 1). The \noptimal time-value functions in the solution comprise a total of 651 linear segments. \n\n4.2 Telescope observation scheduling \n\nNext, we consider the problem of scheduling astronomical targets for a telescope to \nmaximize the scientific return of one night's viewing [1]. We are given N possible \ntargets with associated coordinates, scientific value, and time window of visibility. \nOf course, we can view only one target at a time. We assume that the reward of \nan observation is proportional to the duration of viewing the target. Acquiring a \ntarget requires two steps of stochastic duration: moving the telescope, taking time \nroughly proportional to the distance traveled; and calibrating it on the new target. \n\nPrevious approaches have dealt with this stochasticity heuristically, using a just-in(cid:173)\ncase scheduling approach [1]. Here, we model the stochasticity directly within the \nTMDP framework. The TMDP has N + 1 states (corresponding to the N observations \nand \"off\") and N actions per state (corresponding to what to observe next). The \n\nDomain \n\nSF-Commute \n\nTelescope-IO \n\nTelescope-25 \n\nTelescope-50 \n\nTelescope-100 \n\nSolver \n\nModel \nstates \n14 \npiecewise VI \n5054 \nexact grid VI \n11 \npiecewise VI \n14,311 \nexact grid VI \n26 \npiecewise VI \n33,826 \nexact grid VI \n51 \npiecewise VI \n66,351 \nexact grid VI \n101 \npiecewise VI \nexact grid VI 131,300 \n\nV* \n\npieces \n\nValue \nsweeps \n651 \n13 \n5054 \n1 \n186 \n5 \n14,311 \n1 \n716 \n6 \n33,826 \n1 \n1252 \n6 \n66,351 \n1 \n2711 \n4 \n1 131,300 \n\nRuntime \n\n(secs) \n\n0.2 \n0.1 \n0.1 \n1.3 \n1.8 \n7.4 \n6.3 \n34.5 \n17.9 \n154.1 \n\nTable 1: Summary of results. The three rightmost columns measure solution com(cid:173)\nplexity in terms of the number of sweeps of value iteration before convergence; \nthe number of distinct \"pieces\" or values in the optimal value function V*; and \nthe running time. Running times are the median of five runs on an UltraSparc II \n(296MHz CPU, 256Mb RAM). \n\n\fdawdling reward rate K(x, t) encodes the scientific value of observing x at time t; \nthat value is 0 at times when x is not visible. Relative duration distributions encode \nthe inter-target distances and stochastic calibration times on each transition. \n\nWe generated random target lists of sizes N =10, 25, 50, and 100. Visibility windows \nwere constrained to be within a 13-hour night, specified with O.Ol-hour precision. \nThus, representing the exact solution with a grid required 1301 time bins per state. \nTable 1 shows comparative results of the piecewise-linear and grid-based solvers. \n\n5 Conclusions \n\nIn sum, we have presented a new stochastic model for time-dependent MDPS \n(TMDPS), discussed applications, and shown that dynamic programming with piece(cid:173)\nwise linear time-value functions can produce optimal policies efficiently. In initial \ncomparisons with the alternative method of discretizing the time dimension, the \nTMDP approach was empirically faster, used significantly less memory, and solved \nthe problem exactly over continuous t E lR rather than just at grid points. \nIn our exact computation model, the requirement of discrete duration distributions \nseems particularly restrictive. We are currently investigating a way of using our \nexact algorithm to generate upper and lower bounds on the optimal solution for \nthe case of arbitrary pdfs. This may allow the system to produce an optimal or \nprovably near-optimal policy without having to identify all the twists and turns in \nthe optimal time-value functions. Perhaps the most important advantage of the \npiecewise linear representation will turn out to be its amenability to bounding and \napproximation methods. We hope that such advances will enable the solution of \ncity-sized route planning, more realistic telescope scheduling, and other practical \ntime-dependent stochastic problems. \n\nAcknowledgments \n\nWe thank Leslie Kaelbling, Rich Washington and NSF CAREER grant IRI-9702576. \n\nReferences \n[1] John Bresina, Mark Drummond, and Keith Swanson. Managing action du(cid:173)\nIn Decision- Theoretic Plan(cid:173)\n\nration uncertainty with just-in-case scheduling. \nning: Papers from the 1994 Spring AAAI Symposium, pages 19-26, Stanford, \nCA, 1994. AAAI Press, Menlo Park, California. ISBN 0-929280-70-9. URL \nhttp://ic-www.arc.nasa.gov/ic/projects/xfr/jic/jic.html. \n\n[2] Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest. Introduction \n\nto Algorithms. The MIT Press, Cambridge, MA, 1990. \n\n[3] Sven Koenig and Reid G. Simmons. How to make reactive planners risk(cid:173)\nsensitive. In Proceedings of the 2nd International Conference on Artificial In(cid:173)\ntelligence Planning Systems, pages 293- 304, 1994. \n\n[4] Harold J. Kushner and Paul G. Dupuis. Numerical Methods for Stochastic \n\nControl Problems in Continuous Time. Springer-Verlag, New York, 1992. \n\n[5] Martin L. Puterman. Markov Decision Processes- Discrete Stochastic Dynamic \n\nProgramming. John Wiley & Sons, Inc., New York, NY, 1994. \n\n[6] Michael P. Wellman, Kenneth Larson, Matthew Ford, and Peter R. Wurman. \nPath planning under time-dependent uncertainty. In Proceedings of the 11th \nConference on Uncertainty in Artificial Intelligence, pages 532- 539, 1995. \n\n\f", "award": [], "sourceid": 1811, "authors": [{"given_name": "Justin", "family_name": "Boyan", "institution": null}, {"given_name": "Michael", "family_name": "Littman", "institution": null}]}