{"title": "Approximate Solutions to Optimal Stopping Problems", "book": "Advances in Neural Information Processing Systems", "page_first": 1082, "page_last": 1088, "abstract": null, "full_text": "Approximate Solutions to \nOptimal Stopping Problems \n\nJohn N. Tsitsiklis and Benjamin Van Roy \nLaboratory for Information and Decision Systems \n\nMassachusetts Institute of Technology \n\nCambridge, MA 02139 \n\ne-mail: jnt@mit.edu, bvr@mit.edu \n\nAbstract \n\nWe propose and analyze an algorithm that approximates solutions \nto the problem of optimal stopping in a discounted irreducible ape(cid:173)\nriodic Markov chain. The scheme involves the use of linear com(cid:173)\nbinations of fixed basis functions to approximate a Q-function. \nThe weights of the linear combination are incrementally updated \nthrough an iterative process similar to Q-Iearning, involving sim(cid:173)\nulation of the underlying Markov chain. Due to space limitations, \nwe only provide an overview of a proof of convergence (with prob(cid:173)\nability 1) and bounds on the approximation error. This is the first \ntheoretical result that establishes the soundness of a Q-Iearning(cid:173)\nlike algorithm when combined with arbitrary linear function ap(cid:173)\nproximators to solve a sequential decision problem. Though this \npaper focuses on the case of finite state spaces, the results extend \nnaturally to continuous and unbounded state spaces, which are ad(cid:173)\ndressed in a forthcoming full-length paper. \n\n1 \n\nINTRODUCTION \n\nProblems of sequential decision-making under uncertainty have been studied ex(cid:173)\ntensively using the methodology of dynamic programming [Bertsekas, 1995]. The \nhallmark of dynamic programming is the use of a value junction, which evaluates \nexpected future reward, as a function of the current state. Serving as a tool for \npredicting long-term consequences of available options, the value function can be \nused to generate optimal decisions. \n\nA number of algorithms for computing value functions can be found in the dynamic \nprogramming literature. These methods compute and store one value per state in a \nstate space. Due to the curse of dimensionality, however, states spaces are typically \n\n\fApproximate Solutions to Optimal Stopping Problems \n\n1083 \n\nintractable, and the practical applications of dynamic programming are severely \nlimited. \n\nThe use of function approximators to \"fit\" value functions has been a central theme \nin the field of reinforcement learning. The idea here is to choose a function ap(cid:173)\nproximator that has a tractable number of parameters, and to tune the parameters \nto approximate the value function. The resulting function can then be used to \napproximate optimal decisions. \n\nThere are two preconditions to the development an effective approximation. First, \nwe need to choose a function approximator that provides a \"good fit\" to the value \nfunction for some setting of parameter values. In this respect, the choice requires \npractical experience or theoretical analysis that provides some rough information on \nthe shape of the function to be approximated. Second, we need effective algorithms \nfor tuning the parameters of the function approximator. \n\nWatkins (1989) has proposed the Q-Iearning algorithm as a possibility. The original \nanalyses of Watkins (1989) and Watkins and Dayan (1992), the formal analysis \nof Tsitsiklis (1994), and the related work of Jaakkola, Jordan, and Singh (1994), \nestablish that the algorithm is sound when used in conjunction with exhaustive look(cid:173)\nup table representations (i.e., without function approximation). Jaakkola, Singh, \nand Jordan (1995), Tsitsiklis and Van Roy (1996a), and Gordon (1995), provide a \nfoundation for the use of a rather restrictive class of function approximators with \nvariants of Q-Iearning. Unfortunately, there is no prior theoretical support for the \nuse of Q- Iearning-like algorithms when broader classes of function approximators \nare employed. \n\nIn this paper, we propose a variant of Q-Iearning for approximating solutions to \noptimal stopping problems, and we provide a convergence result that established its \nsoundness. The algorithm approximates a Q-function using a linear combination of \narbitrary fixed basis functions. The weights of these basis functions are iteratively \nupdated during the simulation of a Markov chain. Our result serves as a starting \npoint for the analysis of Q-learning-Iike methods when used in conjunction with \nclasses of function approximators that are more general than piecewise constant. \nIn addition, the algorithm we propose is significant in its own right. Optimal \nstopping problems appear in practical contexts such as financial decision making \nand sequential analysis in statistics. Like other problems of sequential decision \nmaking, optimal stopping problems suffer from the curse of dimensionality, and \nclassical dynamic programming methods are of limited use. The method we propose \npresents a sound approach to addressing such problems. \n\n2 OPTIMAL STOPPING PROBLEMS \n\nWe consider a discrete-time, infinite-horizon, Markov chain with a finite state space \nS = {I, ... , n} and a transition probability matrix P. The Markov chain follows a \ntrajectory Xo, Xl, X2, .\u2022\u2022 where the probability that the next state is y given that the \ncurrent state is X is given by the (x, y)th element of P, and is denoted by Pxy. At \neach time t E {O, 1,2, ... } the trajectory can be stopped with a terminal reward of \nG(Xt). If the trajectory is not stopped, a reward of g(Xt) is obtained. The objective \nis to maximize the expected infinite-horizon discounted reward, given by \n\nE [~\"tg(Xt} +<>'G(X,}] , \n\nwhere Q: E (0, 1) is a discount factor and T is the time at which the process is \nstopped. The variable T is defined by a stopping policy, which is given by a sequence \n\n\f1084 \n\nJ. N. Tsitsiklis and B. Van Roy \n\nof mappings I-'t : st+l I-t {stop, continue}. Each I-'t determines whether or not to \nterminate, based on xo, . .. , Xt. If the decision is to terminate, then T = t. \nWe define the value function to be a mapping from states to the expected discounted \nfuture reward, given that an optimal policy is followed starting at a given state. In \nparticular, the value function J* : S I-t ~ is given by \n\nwhere T is the stopping time given by the policy {I-'d. It is well known that the \nvalue function is the unique solution to Bellman's equation: \n\nJ*(x) = max [G(X)' g(x) + Q L pxyJ*(Y)]. \n\nyES \n\nFurthermore, there is always an optimal policy that is stationary (Le., of the form \n{I-'t = 1-'*, \"It}) and defined by \n\n* () {stop, \n\nI-' x = \n\ncontinue, \n\nif G(x) ;::: V*(x), \notherwise. \n\nFollowing Watkins (1989), we define the Q- function as the function Q* : S I-t ~ \ngiven by \n\nQ*(x) = g(x) + Q LpxyV*(y). \n\nyES \n\nIt is easy to show that the Q-function uniquely satisfies \nQ* (x) = g(x) + Q L Pxy max [G(y), Q* (y)] , \n\nyES \n\nFurthermore, an optimal policy can be defined by \n\n\"Ix E S. \n\n(1) \n\n*() {stop, \n\nI-' x = \n\ncontinue, \n\nif G(x) ;::: Q*(x), \notherwise. \n\n3 APPROXIMATING THE Q-FUNCTION \n\nClassical computational approaches to solving optimal stopping problems involve \ncomputing and storing a value function in a tabular form. The most common way \nfor doing this is through use of an iterative algorithm of the form \n\nJk+l (x) = max [G(X),g(X) + Q LPxyJk(Y)]. \n\nyES \n\nWhen the state space is extremely large, as is the typical case, two difficulties arise. \nThe first is that computing and storing one value per state becomes intractable, \nand the second is that computing the summation on the right hand side becomes \nintractable. We will present an algorithm, motivated by Watkins' Q-Iearning, that \naddresses both these issues, allowingJor approximate solution to optimal stopping \nproblems with large state spaces. \n\n\fApproximate Solutions to Optimal Stopping Problems \n\n1085 \n\n3.1 LINEAR FUNCTION APPROXIMATORS \n\nWe consider approximations of Q* using a function of the form \n\nK \n\nQ(x, r) = L r(k)\u00a2k (x). \n\nk=l \n\nHere, r = (r(I), ... ,r(K)) is a parameter vector and each \u00a2k is a fixed scalar \nfunction defined on the state space S. The functions \u00a2k can be viewed as basis \nfunctions (or as vectors of dimension n), while each r(k) can be viewed as the \nassociated weight. To approximate the Q-function, one usually tries to choose the \nparameter vector r so as to minimize some error metric between the functions Q(., r) \nand Q*(.). \nIt is convenient to define a vector-valued function \u00a2 : S t-+ lRK , by letting \u00a2(x) = \n(\u00a2l(X), ... ,\u00a2K(X)). With this notation, the approximation can also be written in \nthe form Q(x,r) = (4)r)(x) , where 4> is viewed as a lSI x K matrix whose ith row \nis equal to \u00a2(x). \n\n3.2 THE APPROXIMATION ALGORITHM \n\nIn the approximation scheme we propose, the Markov chain underlying the stopping \nproblem is simulated to produce a single endless trajectory {xtlt = 0,1,2, ... }. The \nalgorithm is initialized with a parameter vector ro, and after each time step, the \nparameter vector is updated according to \n\nrt+l = rt + ,t\u00a2(Xt) (g(Xt) + a max [\u00a2'(xt+drt,G(xt+d] - \u00a2'(xdrt) , \n\nwhere It is a scalar stepsize. \n\n3.3 CONVERGENCE THEOREM \n\nBefore stating the convergence theorem, we introduce some notation that will make \nthe exposition more concise. Let 7r(I), ... , 7r(n) denote the steady-state probabilities \nfor the Markov chain. We assume that 7r(x) > 0 for all xES. Let D be an n x n \ndiagonal matrix with diagonal entries 7r(I), . . . , 7r(n). We define a weighted norm \nII\u00b7IID by \n\nIIJIID = L 7r(X)P(x). \n\nxES \n\nWe define a \"projection matrix\" Il that induces a weighted projection onto the \nsubspace X = {4>r IrE lRK} with projection weights equal to the steady-state \nprobablilities. In particular, \n\nIlJ = arg!p.in IIJ - JIID. \n\nJEX \n\nIt is easy to show that Il is given by Il = 4>(4)' D4\u00bb-l4>' D. \nWe define an operator F : lRn t-+ lRn by \n\nF J = 9 + aPmax [4>rt, a] , \nwhere the max denotes a componentwise maximization. \nWe have the following theorem that ensures soundness of the algorithm: \n\n\f1086 \n\nJ N. Tsitsiklis and B. Van Roy \n\nTheorem 1 Let the following conditions hold: \n(a) The Markov chain has a unique invariant distribution 7r that satisfies 7r' P = 7r', \nwith 7r(x) > 0 for all xES. \n(b) The matrix \u00abP has full column rank; that is, the ((basis functions\" {\u00a2k 1 k = \n1, . .. ,K} are linearly independent. \n(c) The step sizes 'Yt are nonnegative, nonincreasing, and predetermined. Further(cid:173)\nmore, they satisfy L~o 'Yt = 00, and L~o 'Yl < 00. \nWe then have: \n(a) The algorithm converges with probability 1. \n(b) The limit of convergence r* is the unique solution of the equation \n\nIlF(<r) - \u00abpr) . \n\n\fApproximate Solutions to Optimal Stopping Problems \n\n1087 \n\nWe focus on analyzing a deterministic algorithm of the form \n\nft+l = ft + 'Yts(ft). \n\nThe convergence of the stochastic algorithm we have proposed can be deduced \nfrom that of this deterministic algorithm through use of a theorem on stochastic \napproximation, contained in (Benveniste, et al., 1990). \n\nNote that the composition IIFO is a contraction with respect to 1/ . liD with con(cid:173)\ntraction coefficient a since projection is nonexpansive and F is a contraction. It \nfollows that IIF (.) has a fixed point of the form 0, and f t converges to r\" . \nWe can further establish the desired error bound: \n\nII