{"title": "Q-learning with Nearest Neighbors", "book": "Advances in Neural Information Processing Systems", "page_first": 3111, "page_last": 3121, "abstract": "We consider model-free reinforcement learning for infinite-horizon discounted Markov Decision Processes (MDPs) with a continuous state space and unknown transition kernel, when only a single sample path under an arbitrary policy of the system is available. We consider the Nearest Neighbor Q-Learning (NNQL) algorithm to learn the optimal Q function using nearest neighbor regression method. As the main contribution, we provide tight finite sample analysis of the convergence rate. In particular, for MDPs with a $d$-dimensional state space and the discounted factor $\\gamma \\in (0,1)$, given an arbitrary sample path with ``covering time'' $L$, we establish that the algorithm is guaranteed to output an $\\varepsilon$-accurate estimate of the optimal Q-function using $\\Ot(L/(\\varepsilon^3(1-\\gamma)^7))$ samples. For instance, for a well-behaved MDP, the covering time of the sample path under the purely random policy scales as $\\Ot(1/\\varepsilon^d),$ so the sample complexity scales as $\\Ot(1/\\varepsilon^{d+3}).$ Indeed, we establish a lower bound that argues that the dependence of $ \\Omegat(1/\\varepsilon^{d+2})$ is necessary.", "full_text": "Q-learning with Nearest Neighbors\n\nMassachusetts Institute of Technology\n\nMassachusetts Institute of Technology\n\nQiaomin Xie \u21e4\n\nqxie@mit.edu\n\nDevavrat Shah \u21e4\n\ndevavrat@mit.edu\n\nAbstract\n\nWe consider model-free reinforcement learning for in\ufb01nite-horizon discounted\nMarkov Decision Processes (MDPs) with a continuous state space and unknown\ntransition kernel, when only a single sample path under an arbitrary policy of\nthe system is available. We consider the Nearest Neighbor Q-Learning (NNQL)\nalgorithm to learn the optimal Q function using nearest neighbor regression method.\nAs the main contribution, we provide tight \ufb01nite sample analysis of the convergence\nrate. In particular, for MDPs with a d-dimensional state space and the discounted\nfactor 2 (0, 1), given an arbitrary sample path with \u201ccovering time\u201d L, we\nestablish that the algorithm is guaranteed to output an \"-accurate estimate of the\n\nbehaved MDP, the covering time of the sample path under the purely random policy\n\noptimal Q-function using eOL/(\"3(1 )7) samples. For instance, for a well-\nscales as eO1/\"d, so the sample complexity scales as eO1/\"d+3. Indeed, we\nestablish a lower bound that argues that the dependence ofe\u23261/\"d+2 is necessary.\n\n1\n\nIntroduction\n\nMarkov Decision Processes (MDPs) are natural models for a wide variety of sequential decision-\nmaking problems. It is well-known that the optimal control problem in MDPs can be solved, in\nprinciple, by standard algorithms such as value and policy iterations. These algorithms, however, are\noften not directly applicable to many practical MDP problems for several reasons. First, they do not\nscale computationally as their complexity grows quickly with the size of the state space and especially\nfor continuous state space. Second, in problems with complicated dynamics, the transition kernel of\nthe underlying MDP is often unknown, or an accurate model thereof is lacking. To circumvent these\ndif\ufb01culties, many model-free Reinforcement Learning (RL) algorithms have been proposed, in which\none estimates the relevant quantities of the MDPs (e.g., the value functions or the optimal policies)\nfrom observed data generated by simulating the MDP.\nA popular model-free Reinforcement Learning (RL) algorithm is the so called Q-learning [47], which\ndirectly learns the optimal action-value function (or Q function) from the observations of the system\ntrajectories. A major advantage of Q-learning is that it can be implemented in an online, incremental\nfashion, in the sense that Q-learning can be run as data is being sequentially collected from the system\noperated/simulated under some policy, and continuously re\ufb01nes its estimates as new observations\nbecome available. The behaviors of standard Q-learning in \ufb01nite state-action problems have by now\nbeen reasonably understood; in particular, both asymptotic and \ufb01nite-sample convergence guarantees\nhave been established [43, 22, 41, 18].\nIn this paper, we consider the general setting with continuous state spaces. For such problems,\nexisting algorithms typically make use of a parametric function approximation method, such as a\nlinear approximation [27], to learn a compact representation of the action-value function. In many of\n\n\u21e4Both authors are af\ufb01liated with Laboratory for Information and Decision Systems (LIDS). DS is with the\n\nDepartment of EECS as well as Statistics and Data Science Center at MIT.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fthe recently popularized applications of Q-learning, much more expressive function approximation\nmethod such as deep neural networks have been utilized. Such approaches have enjoyed recent\nempirical success in game playing and robotics problems [38, 29, 14]. Parametric approaches typically\nrequire careful selection of approximation method and parametrization (e.g., the architecture of neural\nnetworks). Further, rigorous convergence guarantees of Q-learning with deep neural networks are\nrelatively less understood. In comparison, non-parametric approaches are, by design, more \ufb02exible\nand versatile. However, in the context of model-free RL with continuous state spaces, the convergence\nbehaviors and \ufb01nite-sample analysis of non-parametric approaches are less understood.\nSummary of results.\nIn this work, we consider a natural combination of the Q-learning with\nKernel-based nearest neighbor regression for continuous state-space MDP problems, denoted as\nNearest-Neighbor based Q-Learning (NNQL). As the main result, we provide \ufb01nite sample analysis\nof NNQL for a single, arbitrary sequence of data for any in\ufb01nite-horizon discounted-reward MDPs\nwith continuous state space. In particular, we show that the algorithm outputs an \"-accurate (with\nrespect to supremum norm) estimate of the optimal Q-function with high probability using a number\nof observations that depends polynomially on \", the model parameters and the \u201ccover time\u201d of the\nsequence of the data or trajectory of the data utilized. For example, if the data was sampled per a\ncompletely random policy, then our generic bound suggests that the number of samples would scale\n\nbound stating that for any policy to learn optimal Q function within \" approximation, the number of\n\nas eO(1/\"d+3) where d is the dimension of the state space. We establish effectively matching lower\nsamples required must scale ase\u2326(1/\"d+2). In that sense, our policy is nearly optimal.\n\nOur analysis consists of viewing our algorithm as a special case of a general biased stochastic\napproximation procedure, for which we establish non-asymptotic convergence guarantees. Key to our\nanalysis is a careful characterization of the bias effect induced by nearest-neighbor approximation\nof the population Bellman operator, as well as the statistical estimation error due to the variance of\n\ufb01nite, dependent samples. Speci\ufb01cally, the resulting Bellman nearest neighbor operator allows us\nto connect the update rule of NNQL to a class of stochastic approximation algorithms, which have\nbiased noisy updates. Note that traditional results from stochastic approximation rely on unbiased\nupdates and asymptotic analysis [35, 43]. A key step in our analysis involves decomposing the update\ninto two sub-updates, which bears some similarity to the technique used by [22]. Our results make\nimprovement in characterizing the \ufb01nite-sample convergence rates of the two sub-updates.\nIn summary, the salient features of our work are\n\u2022 Unknown system dynamics: We assume that the transition kernel and reward function of the\nMDP is unknown. Consequently, we cannot exactly evaluate the expectation required in standard\ndynamic programming algorithms (e.g., value/policy iteration). Instead, we consider a sample-\nbased approach which learns the optimal value functions/policies by directly observing data\ngenerated by the MDP.\n\n\u2022 Single sample path: We are given a single, sequential samples obtained from the MDP operated\nunder an arbitrary policy. This in particular means that the observations used for learning are\ndependent. Existing work often studies the easier settings where samples can be generated at will;\nthat is, one can sample any number of (independent) transitions from any given state, or reset\nthe system to any initial state. For example, Parallel Sampling in [23]. We do not assume such\ncapabilities, but instead deal with the realistic, challenging setting with a single path.\n\n\u2022 Online computation: We assume that data arrives sequentially rather than all at once. Estimates\nare updated in an online fashion upon observing each new sample. Moreover, as in standard\nQ-learning, our approach does not store old data. In particular, our approach differs from other\nbatch methods, which need to wait for all data to be received before starting computation, and\nrequire multiple passes over the data. Therefore, our approach is space ef\ufb01cient, and hence can\nhandle the data-rich scenario with a large, increasing number of samples.\n\n\u2022 Non-asymptotic, near optimal guarantees: We characterize the \ufb01nite-sample convergence rate\nof our algorithm; that is, how many samples are needed to achieve a given accuracy for estimating\nthe optimal value function. Our analysis is nearly tight in that we establish a lower bound that\nnearly matches our generic upper bound specialized to setting when data is generated per random\npolicy or more generally any policy with random exploration component to it.\n\nWhile there is a large and growing literature on Reinforcement Learning for MDPs, to the best of our\nknowledge, ours is the \ufb01rst result on Q-learning that simultaneously has all of the above four features.\n\n2\n\n\fTable 1: Summary of relevant work. See Appendix A for details.\n\nSpeci\ufb01c work\n\n[10], [36], [37]\n[43], [22], [41]\n[20], [3], [18]\n[23]\n[42],[28]\n[33], [32]\n[19]\n[44]\n[12]\n[8]\n[9]\n[30]\n[1]\nOur work\n\nMethod\n\nContinuous\nstate space\n\nUnknown\n\ntransition Kernel\n\nSingle\n\nsample path\n\nFinite-state approximation\n\nQ-learning\nQ-learning\nQ-learning\nQ-learning\n\nKernel-based approximation\n\nValue/Policy iteration\n\nParameterized TD-learning\nParameterized TD-learning\nParameterized TD-learning\n\nNon-parametric LP\nFitted value iteration\nFitted policy iteration\n\nQ-learning\n\nYes\nNo\nNo\nNo\nYes\nYes\nNo\nNo\nNo\nNo\nNo\nYes\nYes\nYes\n\nNo\nYes\nYes\nYes\nYes\nYes\nYes\nYes\nYes\nYes\nYes\nYes\nYes\nYes\n\nNo\nYes\nYes\nNo\nYes\nNo\nNo\nYes\nNo\nYes\nNo\nNo\nYes\nYes\n\nguarantees\n\nOnline Non-asymptotic\nupdate\nYes\nYes\nYes\nYes\nYes\nNo\nNo\nYes\nYes\nYes\nNo\nNo\nNo\nYes\n\nYes\nNo\nYes\nYes\nNo\nNo\nYes\nNo\nYes\nYes\nYes\nYes\nYes\nYes\n\nWe summarize comparison with relevant prior works in Table 1. Detailed discussion can be found in\nAppendix A.\n2 Setup\nIn this section, we introduce necessary notations, de\ufb01nitions for the framework of Markov Decision\nProcesses that will be used throughout the paper. We also precisely de\ufb01ne the question of interest.\n\nNotation. For a metric space E endowed with metric \u21e2, we denote by C(E) the set of all bounded and\nmeasurable functions on E. For each f 2 C(E), let kfk1 := supx2E|f (x)| be the supremum norm,\nwhich turns C(E) into a Banach space B. Let Lip(E, M ) denote the set of Lipschitz continuous\nfunctions on E with Lipschitz bound M, i.e.,\n\nLip(E, M ) = {f 2 C(E) | |f (x) f (y)|\uf8ff M\u21e2 (x, y), 8x, y 2 E} .\n\nThe indicator function is denoted by 1{\u00b7}. For each integer k 0, let [k] , {1, 2, . . . , k}.\nMarkov Decision Process. We consider a general setting where an agent interacts with a stochastic\nenvironment. This interaction is modeled as a discrete-time discounted Markov decision process\n(MDP). An MDP is described by a \ufb01ve-tuple (X ,A, p, r, ), where X and A are the state space and\naction space, respectively. We shall utilize t 2 N to denote time. Let xt 2X be state at time t. At\ntime t, the action chosen is denoted as at 2A . Then the state evolution is Markovian as per some\ntransition probability kernel with density p (with respect to the Lebesgue measure on X ). That is,\n(1)\n\nPr(xt+1 2 B|xt = x, at = a) =ZB\n\np(y|x, a)(dy)\n\nfor any measurable set B 2X . The one-stage reward earned at time t is a random variable Rt\nwith expectation E[Rt|xt = x, at = a] = r(x, a), where r : X\u21e5A! R is the expected reward\nfunction. Finally, 2 (0, 1) is the discount factor and the overall reward of interest isP1t=0 tRt\nThe goal is to maximize the expected value of this reward. Here we consider a distance function\n\u21e2 : X\u21e5X! R+ so that (X ,\u21e2 ) forms a metric space. For the ease of exposition, we use Z for the\njoint state-action space X\u21e5A .\nWe start with the following standard assumptions on the MDP:\nAssumption 1 (MDP Regularity). We assume that: (A1.) The continuous state space X is a compact\nsubset of Rd; (A2.) A is a \ufb01nite set of cardinality |A|; (A3.) The one-stage reward Rt is non-\nnegative and uniformly bounded by Rmax, i.e., 0 \uf8ff Rt \uf8ff Rmax almost surely. For each a 2A ,\nr(\u00b7, a) 2 Lip(X , Mr) for some Mr > 0. (A4.) The transition probability kernel p satis\ufb01es\nwhere the function Wp(\u00b7) satis\ufb01esRX\n\n|p(y|x, a) p(y|x0, a)|\uf8ff Wp(y)\u21e2 (x, x0) ,\nWp(y)(dy) \uf8ff Mp.\n\n8a 2A ,8x, x0, y 2X ,\n\n3\n\n\fThe \ufb01rst two assumptions state that the state space is compact and the action space is \ufb01nite. The third\nand forth stipulate that the reward and transition kernel are Lipschitz continuous (as a function of the\ncurrent state). Our Lipschitz assumptions are identical to (or less restricted than) those used in the\nwork of [36], [11], and [17]. In general, this type of Lipschitz continuity assumptions are standard in\nthe literature on MDPs with continuous state spaces; see, e.g., the work of [15, 16], and [6].\nA Markov policy \u21e1(\u00b7|x) gives the probability of performing action a 2A given the current state\nx. A deterministic policy assigns each state a unique action. The value function for each state x\nunder policy \u21e1, denoted by V \u21e1(x), is de\ufb01ned as the expected discounted sum of rewards received\nfollowing the policy \u21e1 from initial state x, i.e., V \u21e1(x) = E\u21e1 [P1t=0 tRt|x0 = x]. The action-value\nfunction Q\u21e1 under policy \u21e1 is de\ufb01ned by Q\u21e1(x, a) = r(x, a) + Ry p(y|x, a)V \u21e1(y)(dy). The\n\nnumber Q\u21e1(x, a) is called the Q-value of the pair (x, a), which is the return of initially performing\naction a at state s and then following policy \u21e1. De\ufb01ne\n\n , 1/(1 )\n\nand\n\nVmax , Rmax.\n\nSince all the rewards are bounded by Rmax, it is easy to see that the value function of every policy\nis bounded by Vmax [18, 40]. The goal is to \ufb01nd an optimal policy \u21e1\u21e4 that maximizes the value\nfrom any start state. The optimal value function V \u21e4is de\ufb01ned as V \u21e4(x) = V \u21e1\u21e4(x) = sup\u21e1 V \u21e1(x),\n8x 2X . The optimal action-value function is de\ufb01ned as Q\u21e4(x, a) = Q\u21e1\u21e4(x, a) = sup\u21e1 Q\u21e1(x, a).\nThe Bellman optimality operator F is de\ufb01ned as\n(F Q)(x, a) = r(x, a) + E\uf8ffmax\nIt is well known that F is a contraction with factor on the Banach space C(Z) [7, Chap. 1]. The\noptimal action-value function Q\u21e4 is the unique solution of the Bellman\u2019s equation Q = F Q in\nC(X\u21e5A ). In fact, under our setting, it can be show that Q\u21e4 is bounded and Lipschitz. This is stated\nbelow and established in Appendix B.\nLemma 1. Under Assumption 1, the function Q\u21e4 satis\ufb01es that kQ\u21e4k1 \uf8ff Vmax and that Q\u21e4(\u00b7, a) 2\nLip(X , Mr + VmaxMp) for each a 2A .\n3 Reinforcement Learning Using Nearest Neighbors\n\nQ(x0, b) | x, a = r(x, a) + ZX\n\np(y|x, a) max\nb2A\n\nQ(y, b)(dy).\n\nb2A\n\nIn this section, we present the nearest-neighbor-based reinforcement learning algorithm. The al-\ngorithm is based on constructing a \ufb01nite-state discretization of the original MDP, and combining\nQ-learning with nearest neighbor regression to estimate the Q-values over the discretized state space,\nwhich is then interpolated and extended to the original continuous state space. In what follows, we\nshall \ufb01rst describe several building blocks for the algorithm in Sections 3.1\u20133.4, and then summarize\nthe algorithm in Section 3.5.\n3.1 State Space Discretization\nLet h > 0 be a pre-speci\ufb01ed scalar parameter. Since the state space X is compact, one can \ufb01nd a\n\ufb01nite set Xh , {ci}Nh\n\ni=1 of points in X such that\n\nmin\ni2[Nh]\n\n\u21e2(x, ci) < h, 8x 2X .\n\nThe \ufb01nite grid Xh is called an h-net of X , and its cardinality n \u2318 Nh can be chosen to be the\nh-covering number of the metric space (X ,\u21e2 ). De\ufb01ne Zh = Xh \u21e5A . Throughout this paper, we\ndenote by Bi the ball centered at ci with radius h; that is, Bi , {x 2X : \u21e2 (x, ci) \uf8ff h} .\n3.2 Nearest Neighbor Regression\nSuppose that we are given estimated Q-values for the \ufb01nite subset of states Xh = {ci}n\ni=1, denoted\nby q = {q(ci, a), ci 2X h, a 2A} . For each state-action pair (x, a) 2X\u21e5A , we can predict its\nQ-value via a regression method. We focus on nonparametric regression operators that can be written\nas nearest neighbors averaging in terms of the data q of the form\n\n(2)\n8x 2X , a 2A ,\nwhere K(x, ci) 0 is a weighting kernel function satisfyingPn\ni=1 K(x, ci) = 1,8x 2X . Equa-\ntion (2) de\ufb01nes the so-called Nearest Neighbor (NN) operator NN, which maps the space C(Xh \u21e5A )\n\ni=1K(x, ci)q(ci, a),\n\n(NNq)(x, a) =Pn\n\n4\n\n\finto the set of all bounded function over X\u21e5A . Intuitively, in (2) one assesses the Q-value of (x, a)\nby looking at the training data where the action a has been applied, and by averaging their values. It\ncan be easily checked that the operator NN is non-expansive in the following sense:\n\nkNNq NNq0k1 \uf8ff kq q0k1 ,\n\n8q, q0 2 C(Xh \u21e5A ).\n\nThis property will be crucially used for establishing our results. K is assumed to satisfy\n\n(3)\n\n8x 2X , y 2X h,\n\nK(x, y) = 0 if \u21e2(x, y) h,\n\n(4)\nwhere h is the discretization parameter de\ufb01ned in Section 3.1.2 This means that the values of states\nlocated in the neighborhood of x are more in\ufb02uential in the averaging procedure (2). There are many\npossible choices for K. In Section C we describe three representative choices that correspond to\nk-Nearest Neighbor Regression, Fixed-Radius Near Neighbor Regression and Kernel Regression.\n3.3 A Joint Bellman-NN Operator\nNow, we de\ufb01ne the joint Bellman-NN (Nearest Neighbor) operator. As will become clear subse-\nquently, it is this operator that the algorithm aims to approximate, and hence it plays a crucial role in\nthe subsequent analysis.\nFor a function q : Zh ! R, we denote by \u02dcQ , (NNq) the nearest-neighbor average extension of q\nto Z; that is,\nThe joint Bellman-NN operator G on R|Zh| is de\ufb01ned by composing the original Bellman operator F\nwith the NN operator NN and then restricting to Zh; that is, for each (ci, a) 2Z h,\n(Gq)(ci, a) , (F NNq)(ci, a) = (F \u02dcQ)(ci, a) = r(ci, a) + E\uf8ffmax\n\n(NNq)(x0, b) | ci, a .\n\n\u02dcQ(x, a) = ( NNq)(x, a),\n\n8(x, a) 2Z .\n\nb2A\n\n(5)\n\nIt can be shown that G is a contraction operator with modulus mapping R|Zh| to itself, thus\nadmitting a unique \ufb01xed point, denoted by q\u21e4h; see Appendix E.2.\n3.4 Covering Time of Discretized MDP\nAs detailed in Section 3.5 to follow, our algorithm uses data generated by an abritrary policy \u21e1 for\nthe purpose of learning. The goal of our approach is to estimate the Q-values of every state. For there\nto be any hope to learn something about the value of a given state, this state (or its neighbors) must\nbe visited at least once. Therefore, to study the convergence rate of the algorithm, we need a way to\nquantify how often \u21e1 samples from different regions of the state-action space Z = X\u21e5A .\nFollowing the approach taken by [18] and [3], we introduce the notion of the covering time of MDP\nunder a policy \u21e1. This notion is particularly suitable for our setting as our algorithm is based on\nasynchronous Q-learning (that is, we are given a single, sequential trajectory of the MDP, where at\neach time step one state-action pair is observed and updated), and the policy \u21e1 may be non-stationary.\nIn our continuous state space setting, the covering time is de\ufb01ned with respect to the discretized space\nZh, as follows:\nDe\ufb01nition 1 (Covering time of discretized MDP). For each 1 \uf8ff i \uf8ff n = Nh and a 2A , a ball-\naction pair (Bi, a) is said to be visited at time t if xt 2B i and at = a. The discretized state-action\nspace Zh is covered by the policy \u21e1 if all the ball-action pairs are visited at least once under the\npolicy \u21e1. De\ufb01ne \u2327\u21e1,h(x, t), the covering time of the MDP under the policy \u21e1, as the minimum number\nof steps required to visit all ball-action pairs starting from state x 2X at time-step t 0. Formally,\n\u2327\u21e1,h(x, t) is de\ufb01ned as\nminns 0 : xt = x, 8i\uf8ff Nh, a2A , 9ti,a2 [t, t+s], such that xti,a 2 Bi and ati,a = a, under \u21e1o,\nwith notation that minimum over empty set is 1.\nWe shall assume that there exists a policy \u21e1 with bounded expected cover time, which guarantees\nthat, asymptotically, all the ball-action pairs are visited in\ufb01nitely many times under the policy \u21e1.\n\n2This assumption is not absolutely necessary, but is imposed to simplify subsequent analysis. In general, our\n\nresults hold as long as K(x, y) decays suf\ufb01ciently fast with the distance \u21e2(x, y).\n\n5\n\n\fAssumption 2. There exists an integer Lh < 1 such that E[\u2327\u21e1,h(x, t)] \uf8ff Lh, 8x 2X , t > 0. Here\nthe expectation is de\ufb01ned with respect to randomness introduced by Markov kernel of MDP as well\nas the policy \u21e1.\n\nIn general, the covering time can be large in the worst case. In fact, even with a \ufb01nite state space,\nit is easy to \ufb01nd examples where the covering time is exponential in the number of states for every\npolicy. For instance, consider an MDP with states 1, 2, . . . , N, where at any state i, the chain is reset\nto state 1 with probability 1/2 regardless of the action taken. Then, every policy takes exponential\ntime to reach state N starting from state 1, leading to an exponential covering time.\nTo avoid the such bad cases, some additional assumptions are needed to ensure that the MDP is\nwell-behaved. For such MDPs, there are a variety of polices that have a small covering time. Below\nwe focus on a class of MDPs satisfying a form of the uniform ergodic assumptions, and show that the\nstandard \"-greedy policy (which includes the purely random policy as special case by setting \" = 1)\nhas a small covering time. This is done in the following two Propositions. Proofs can be found in\nAppendix D.\nProposition 1. Suppose that the MDP satis\ufb01es the following: there exists a probability measure \u232b\non X , a number '> 0 and an integer m 1 such that for all x 2X , all t 0 and all policies \u00b5,\n(6)\nLet \u232bmin , mini2[n] \u232b(Bi), where we recall that n \u2318 Nh = |Xh| is the cardinality of the\ndiscretized state space. Then the expected covering time of \"-greedy is upper bounded by\nLh = O\u21e3 m|A|\nProposition 2. Suppose that the MDP satis\ufb01es the following: there exists a probability measure \u232b on\nX , a number '> 0 and an integer m 1 such that for all x 2X , all t 0, there exists a sequence\nof actions \u02c6a(x) = (\u02c6a1, . . . , \u02c6am) 2A m,\n\nPr\u00b5 (xm+t 2\u00b7| xt = x) '\u232b(\u00b7).\n\nlog(n|A|)\u2318.\n\n\"'\u232bmin\n\n\"m+1'\u232bmin\n\nPr (xm+t 2\u00b7| xt = x, at = \u02c6a1, . . . , at+m1 = \u02c6am) '\u232b(\u00b7).\n\n(7)\nLet \u232bmin , mini2[n] \u232b(Bi), where we recall that n \u2318 Nh = |Xh| is the cardinality of the\ndiscretized state space. Then the expected covering time of \"-greedy is upper bounded by\nLh = O\u21e3 m|A|m+1\n\nlog(n|A|)\u2318.\n\n3.5 Q-learning using Nearest Neighbor\nWe describe the nearest-neighbor Q-learning (NNQL) policy. Like Q-learning, it is a model-free\npolicy for solving MDP. Unlike standard Q-learning, it is (relatively) ef\ufb01cient to implement as it\ndoes not require learning the Q function over entire space X\u21e5A . Instead, we utilize the nearest\nneighbor regressed Q function using the learned Q values restricted to Zh. The policy assumes access\nto an existing policy \u21e1 (which is sometimes called the \u201cexploration policy\u201d, and need not have any\noptimality properties) that is used to sample data points for learning.\nThe pseudo-code of NNQL is described in Policy 1. At each time step t, action at is performed\nfrom state Yt as per the given (potentially non-optimal) policy \u21e1, and the next state Yt+1 is generated\naccording to p(\u00b7|Yt, at). Note that the sequence of observed states (Yt) take continuous values in the\nstate space X .\nThe policy runs over iteration with each iteration lasting for a number of time steps. Let k denote\niteration count, Tk denote time when iteration k starts for k 2 N.\nInitially, k = 0, T0 = 0,\nand for t 2 [Tk, Tk+1), the policy is in iteration k. The iteration is updated from k to k + 1\nwhen starting with t = Tk, all ball-action (Bi, a) pairs have been visited at least once. That is,\nTk+1 = Tk + \u2327\u21e1,h(YTk , Tk). In the policy description, the counter Nk(ci, a) records how many times\nthe ball-action pair (Bi, a) has been visited from the beginning of iteration k till the current time t;\nthat is, Nk(ci, a) =Pt\n1{Ys 2B i, as = a}. By de\ufb01nition, the iteration k ends at the \ufb01rst time\nstep for which min(ci,a) Nk(ci, a) > 0.\nDuring each iteration, the policy keeps track of the Q-function over the \ufb01nite set Zh. Speci\ufb01cally, let\nqk denote the approximate Q-values on Zh within iteration k. The policy also maintains Gkqk(ci, at),\nwhich is a biased empirical estimate of the joint Bellman-NN operator G applied to the estimates qk.\n\ns=Tk\n\n6\n\n\fPolicy 1 Nearest-Neighbor Q-learning\nInput: Exploration policy \u21e1, discount factor , number of steps T , bandwidth parameter h, and\ninitial state Y0.\nConstruct discretized state space Xh; initialize t = k = 0,\u21b5 0 = 1, q0 \u2318 0;\nForeach (ci, a) 2Z h, set N0(ci, a) = 0; end\nrepeat\n\nDraw action at \u21e0 \u21e1(\u00b7|Yt) and observe reward Rt; generate the next state Yt+1 \u21e0 p(\u00b7|Yt, at);\nForeach i such that Yt 2B i do\n\n1\n\nNk(ci,at)+1 ;\n\n\u2318N =\nif Nk(ci, at) > 0 then\n\n(Gkqk)(ci, at) = (1 \u2318N )(Gkqk)(ci, at) + \u2318NRt + maxb2A(NNqk)(Yt+1, b);\n\nelse (Gkqk)(ci, at) = Rt + maxb2A(NNqk)(Yt+1, b);\nend\nNk(ci, at) = Nk(ci, at) + 1\nend\nif min(ci,a)2Zh Nk(ci, a) > 0 then\n\nqk+1(ci, a) = (1 \u21b5k)qk(ci, a) + \u21b5k(Gkqk)(ci, a);\n\nForeach (ci, a) 2Z h do\nend\nk = k + 1; \u21b5k = \nForeach (ci, a) 2Z h do Nk(ci, a) = 0; end\n\n+k ;\n\nend\nt = t + 1;\n\nuntil t T ;\nreturn \u02c6q = qk\n\nAt each time step t 2 [Tk, Tk+1) within iteration k, if the current state Yt falls in the ball Bi, then the\ncorresponding value (Gkqk)(ci, at) is updated as\n\n(Gkqk)(ci, at) = (1 \u2318N )(Gkqk)(ci, at) + \u2318N\u21e3Rt + max\n\nb2A\n\n(NNqk)(Yt+1, b)\u2318,\n\n(8)\n\n1\n\nwhere \u2318N =\nNk(ci,at)+1. We notice that the above update rule computes, in an incremental fashion,\nan estimate of the joint Bellman-NN operator G applied to the current qk for each discretized state-\naction pair (ci, a), using observations Yt that fall into the neighborhood Bi of ci. This nearest-neighbor\napproximation causes the estimate to be biased.\nAt the end of iteration k, i.e., at time step t = Tk+1 1, a new qk+1 is generated as follows: for each\n(ci, a) 2Z h,\n\nqk+1(ci, a) = (1 \u21b5k)qk(ci, a) + \u21b5k(Gkqk)(ci, a).\n\n(9)\nAt a high level, this update is similar to standard Q-learning updates \u2014 the Q-values are updated\nby taking a weighted average of qk, the previous estimate, and Gkqk, an one-step application of\nthe Bellman operator estimated using newly observed data. There are two main differences from\nstandard Q-learning: 1) the Q-value of each (ci, a) is estimated using all observations that lie in its\nneighborhood \u2014 a key ingredient of our approach; 2) we wait until all ball-action pairs are visited to\nupdate their Q-values, all at once.\nGiven the output \u02c6q of Policy 1, we obtain an approximate Q-value for each (continuous) state-action\npair (x, a) 2Z via the nearest-neighbor average operation, i.e., QT\nh (x, a) = ( NN \u02c6q) (x, a); here the\nsuperscript T emphasizes that the algorithm is run for T time steps with a sample size of T .\n\n4 Main Results\n\nAs a main result of this paper, we obtain \ufb01nite-sample analysis of NNQL policy. Speci\ufb01cally, we \ufb01nd\nthat the NNQL policy converges to an \"-accurate estimate of the optimal Q\u21e4 with time T that has\npolynomial dependence on the model parameters. The proof can be found in Appendix E.\n\n7\n\n\f\"2\n\nT = C0\n\nLh\u21e4V 3\nmax4\n\"3\n\n\u25c6 log\u2713 Nh\u21e4 |A| V 2\nh\u21e4 Q\u21e41 \uf8ff \".\n\nTheorem 1. Suppose that Assumptions 1 and 2 hold. With notation = 1/(1 ) and C =\nMr + VmaxMp, for a given \" 2 (0, 4Vmax), de\ufb01ne h\u21e4 \u2318 h\u21e4(\") = \"\n4C . Let Nh\u21e4 be the h\u21e4-\ncovering number of the metric space (X ,\u21e2 ). For a universal constant C0 > 0, after at most\n\u25c6\nmax4\n\nlog\u2713 2\nsteps, with probability at least 1 , we haveQT\n\nThe theorem provides suf\ufb01cient conditions for NNQL to achieve \" accuracy (in sup norm) for\nestimating the optimal action-value function Q\u21e4. The conditions involve the bandwidth parameter\nh\u21e4 and the number of time steps T , both of which depend polynomially on the relevant problem\nparameters. Here an important parameter is the covering number Nh\u21e4: it provides a measure of\nthe \u201ccomplexity\u201d of the state space X , replacing the role of the cardinality |X| in the context of\ndiscrete state spaces. For instance, for a unit volume ball in Rd, the corresponding covering number\nNh\u21e4 scales as O(1/h\u21e4)d (cf. Proposition 4.2.12 in [46]). We take note of several remarks on the\n\nimplications of the theorem.\nSample complexity: The number of time steps T , which also equals the number of samples needed,\nscales linearly with the covering time Lh\u21e4 of the underlying policy \u21e1 to sample data for the given\nMDP. Note that Lh\u21e4 depends implicitly on the complexities of the state and action space as measured\nby Nh\u21e4 and |A|. In the best scenario, Lh\u21e4, and hence T as well, is linear in Nh\u21e4 \u21e5|A| (up to\nlogarithmic factors), in which case we achieve (near) optimal linear sample complexity. The sample\ncomplexity T also depends polynomially on the desired accuracy \"1 and the effective horizon\n = 1/(1 ) of the discounted MDP \u2014 optimizing the exponents of the polynomial dependence\nremains interesting future work.\nSpace complexity: The space complexity of NNQL is O(Nh\u21e4 \u21e5|A| ), which is necessary for storing\nthe values of qk. Note that NNQL is a truly online algorithm, as each data point (Yt, at) is accessed\nonly once upon observation and then discarded; no storage of them is needed.\nComputational complexity: In terms of computational complexity, the algorithm needs to compute\nthe NN operator NN and maximization over A in each time step, as well as to update the values of qk\nfor all ci 2X h\u21e4 and a 2A in each iteration. Therefore, the worst-case computational complexity per\ntime step is O(Nh\u21e4 \u21e5|A| ), with an overall complexity of O(T \u21e5 Nh\u21e4 \u21e5|A| ). The computation can\nbe potentially sped up by using more ef\ufb01cient data structures and algorithms for \ufb01nding (approximate)\nnearest neighbors, such as k-d trees [5], random projection trees [13], Locality Sensitive Hashing [21]\nand boundary trees [26].\nChoice of h\u21e4: NNQL requires as input a user-speci\ufb01ed parameter h, which determines the discretiza-\ntion granularity of the state space as well as the bandwidth of the (kernel) nearest neighbor regression.\nTheorem 1 provides a desired value h\u21e4 = \"/4C, where we recall that C is the Lipschitz parameter\nof the optimal action-value function Q\u21e4 (see Lemma 1). Therefore, we need to use a small h\u21e4 if we\ndemand a small error \", or if Q\u21e4 \ufb02uctuates a lot with a large C.\n\n4.1 Special Cases and Lower Bounds\nTheorem 1, combined with Proposition 1, immediately yield the following bound that quantify the\nnumber of samples required to obtain an \"-optimal action-value function with high probability, if the\nsample path is generated per the uniformly random policy. The proof is given in Appendix F.\nCorollary 1. Suppose that Assumptions 1 and 2 hold, with X = [0, 1]d. Assume that the MDP\nsatis\ufb01es the following: there exists a uniform probability measure \u232b over X , a number '> 0 and an\ninteger m 1 such that for all x 2X , all t 0 and all policies \u00b5, Pr\u00b5 (xm+t 2\u00b7| xt = x) '\u232b(\u00b7).\nAfter at most\n\nT = \uf8ff\n\n1\n\n\"d+3 log3\u2713 1\n\"\u25c6\n\nwith probability at least 1 .\n\nsteps, where \uf8ff \u2318 \uf8ff(|A|, d,, m ) is a number independent of \" and , we haveQT\nCorollary 1 states that the sample complexity of NNQL scales as eO 1\n\neffectively necessary by establishing a lower bound on any algorithm under any sampling policy!\nThe proof of Theorem 2 can be found in Appendix G.\n\nh\u21e4 Q\u21e41 \uf8ff \"\n\"d+3. We will show that this is\n\n8\n\n\fTheorem 2. For any reinforcement learning algorithm \u02c6QT and any number 2 (0, 1), there exists\nan MDP problem and some number T > 0 such that\n\nwhere C > 0 is a constant. Consequently, for any reinforcement learning algorithm \u02c6QT and any\nsuf\ufb01ciently small \"> 0, there exists an MDP problem such that in order to achieve\n\n2+d ,\n\nfor all T T,\n\nT \u25c6 1\nPr\uf8ff \u02c6QT Q\u21e41 C\u2713 log T\nPrh \u02c6QT Q\u21e41\nT C0d\u2713 1\n\"\u25c62+d\n\n<\"i 1 ,\nlog\u2713 1\n\"\u25c6 ,\n\none must have\n\nwhere C0 > 0 is a constant.\n\n5 Conclusions\n\nIn this paper, we considered the reinforcement learning problem for in\ufb01nite-horizon discounted\nMDPs with a continuous state space. We focused on a reinforcement learning algorithm NNQL\nthat is based on kernelized nearest neighbor regression. We established nearly tight \ufb01nite-sample\nconvergence guarantees showing that NNQL can accurately estimate optimal Q function using nearly\noptimal number of samples. In particular, our results state that the sample, space and computational\ncomplexities of NNQL scale polynomially (sometimes linearly) with the covering number of the\nstate space, which is continuous and has uncountably in\ufb01nite cardinality.\nIn this work, the sample complexity analysis with respect to the accuracy parameter is nearly optimal.\nBut its dependence on the other problem parameters is not optimized. This will be an important\ndirection for future work. It is also interesting to generalize approach to the setting of MDP beyond\nin\ufb01nite horizon discounted problems, such as \ufb01nite horizon or average-cost problems. Another\npossible direction for future work is to combine NNQL with a smart exploration policy, which may\nfurther improve the performance of NNQL. It would also be of much interest to investigate whether\nour approach, speci\ufb01cally the idea of using nearest neighbor regression, can be extended to handle\nin\ufb01nite or even continuous action spaces.\n\nAcknowledgment\n\nThis work was supported in parts by NSF projects NeTs-1523546, TRIPODS-1740751, and CMMI-\n1462158.\n\nReferences\n[1] A. Antos, C. Szepesv\u00e1ri, and R. Munos. Learning near-optimal policies with Bellman-residual\nminimization based \ufb01tted policy iteration and a single sample path. Machine Learning, 71(1):89\u2013\n129, 2008.\n\n[2] M. G. Azar, R. Munos, M. Ghavamzadeh, and H. J. Kappen. Reinforcement learning with a\n\nnear optimal rate of convergence. Technical Report, 2011.\n\n[3] M. G. Azar, R. Munos, M. Ghavamzadeh, and H. J. Kappen. Speedy Q-learning. In NIPS, 2011.\n\n[4] A. Barreto, D. Precup, and J. Pineau. Practical kernel-based reinforcement learning. The\n\nJournal of Machine Learning Research, 17(1):2372\u20132441, 2016.\n\n[5] J. L. Bentley. Multidimensional binary search trees in database applications. IEEE Transactions\n\non Software Engineering, (4):333\u2013340, 1979.\n\n[6] D. Bertsekas. Convergence of discretization procedures in dynamic programming.\n\nTransactions on Automatic Control, 20(3):415\u2013419, 1975.\n\nIEEE\n\n9\n\n\f[7] D. P. Bertsekas. Dynamic programming and optimal control, volume II. Athena Scienti\ufb01c,\n\nBelmont, MA, 3rd edition, 2007.\n\n[8] Jalaj Bhandari, Daniel Russo, and Raghav Singal. A \ufb01nite time analysis of temporal difference\nlearning with linear function approximation.\nIn S\u00e9bastien Bubeck, Vianney Perchet, and\nPhilippe Rigollet, editors, Proceedings of the 31st Conference On Learning Theory, volume 75\nof Proceedings of Machine Learning Research, pages 1691\u20131692. PMLR, 06\u201309 Jul 2018.\n\n[9] N. Bhat, V. F. Farias, and C. C. Moallemi. Non-parametric approximate dynamic programming\n\nvia the kernel method. In NIPS, 2012.\n\n[10] C.-S. Chow and J. N. Tsitsiklis. The complexity of dynamic programming. Journal of Complex-\n\nity, 5(4):466\u2013488, 1989.\n\n[11] C.-S. Chow and J. N. Tsitsiklis. An optimal one-way multigrid algorithm for discrete-time\n\nstochastic control. IEEE Transactions on Automatic Control, 36(8):898\u2013914, 1991.\n\n[12] Gal Dalal, Bal\u00e1zs Sz\u00f6r\u00e9nyi, Gugan Thoppe, and Shie Mannor. Finite sample analysis for TD(0)\n\nwith linear function approximation. arXiv preprint arXiv:1704.01161, 2017.\n\n[13] S. Dasgupta and Y. Freund. Random projection trees and low dimensional manifolds. In\nProceedings of the Fortieth Annual ACM Symposium on Theory of Computing, pages 537\u2013546.\nACM, 2008.\n\n[14] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. Benchmarking deep reinforcement\nlearning for continuous control. In International Conference on Machine Learning, pages 1329\u2013\n1338, 2016.\n\n[15] F. Dufour and T. Prieto-Rumeau. Approximation of Markov decision processes with general\n\nstate space. Journal of Mathematical Analysis and applications, 388(2):1254\u20131267, 2012.\n\n[16] F. Dufour and T. Prieto-Rumeau. Finite linear programming approximations of constrained\ndiscounted Markov decision processes. SIAM Journal on Control and Optimization, 51(2):1298\u2013\n1324, 2013.\n\n[17] F. Dufour and T. Prieto-Rumeau. Approximation of average cost Markov decision processes\nusing empirical distributions and concentration inequalities. Stochastics: An International\nJournal of Probability and Stochastic Processes, 87(2):273\u2013307, 2015.\n\n[18] E. Even-Dar and Y. Mansour. Learning rates for Q-learning. JMLR, 5, December 2004.\n\n[19] W. B. Haskell, R. Jain, and D. Kalathil. Empirical dynamic programming. Mathematics of\n\nOperations Research, 41(2), 2016.\n\n[20] H. V. Hasselt. Double Q-learning. In NIPS. 2010.\n\n[21] P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of di-\nmensionality. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing,\npages 604\u2013613. ACM, 1998.\n\n[22] T. Jaakkola, M. I. Jordan, and S. P. Singh. On the convergence of stochastic iterative dynamic\n\nprogramming algorithms. Neural Comput., 6(6), 1994.\n\n[23] M. Kearns and S. Singh. Finite-sample convergence rates for Q-learning and indirect algorithms.\n\nIn NIPS, 1999.\n\n[24] S. H. Lim and G. DeJong. Towards \ufb01nite-sample convergence of direct reinforcement learn-\ning. In Proceedings of the 16th European Conference on Machine Learning, pages 230\u2013241.\nSpringer-Verlag, 2005.\n\n[25] Bo Liu, Ji Liu, Mohammad Ghavamzadeh, Sridhar Mahadevan, and Marek Petrik. Finite-sample\nanalysis of proximal gradient TD algorithms. In Proceedings of the Thirty-First Conference on\nUncertainty in Arti\ufb01cial Intelligence, pages 504\u2013513. AUAI Press, 2015.\n\n10\n\n\f[26] C. Mathy, N. Derbinsky, J. Bento, J. Rosenthal, and J. S. Yedidia. The boundary forest algorithm\nfor online supervised and unsupervised learning. In Twenty-Ninth AAAI Conference on Arti\ufb01cial\nIntelligence, pages 2864\u20132870, 2015.\n\n[27] F. S. Melo, S. P. Meyn, and M. I. Ribeiro. An analysis of reinforcement learning with function\napproximation. In Proceedings of the 25th international conference on Machine learning, pages\n664\u2013671. ACM, 2008.\n\n[28] Francisco S Melo and M Isabel Ribeiro. Q-learning with linear function approximation. In\nInternational Conference on Computational Learning Theory, pages 308\u2013322. Springer, 2007.\n[29] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves,\nM. Riedmiller, A. K Fidjeland, and G. Ostrovski. Human-level control through deep reinforce-\nment learning. Nature, 518(7540):529\u2013533, 2015.\n\n[30] R. Munos and C. Szepesv\u00e1ri. Finite-time bounds for \ufb01tted value iteration. Journal of Machine\n\nLearning Research, 9(May):815\u2013857, 2008.\n\n[31] E. A. Nadaraya. On estimating regression. Theory of Probability & Its Applications, 9(1):141\u2013\n\n142, 1964.\n\n[32] D. Ormoneit and P. Glynn. Kernel-based reinforcement learning in average-cost problems.\n\nIEEE Trans. Automatic Control, 47(10), 2002.\n\n[33] D. Ormoneit and \u00b4S. Sen. Kernel-based reinforcement learning. Mach. Learning, 49(2-3), 2002.\n[34] Jason Pazis and Ronald Parr. PAC optimal exploration in continuous space Markov decision\nprocesses. In Proceedings of the Twenty-Seventh AAAI Conference on Arti\ufb01cial Intelligence,\npages 774\u2013781. AAAI Press, 2013.\n\n[35] H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical\n\nStatistics, pages 400\u2013407, 1951.\n\n[36] J. Rust. Using randomization to break the curse of dimensionality. Econometrica, 65(3), 1997.\n[37] N. Saldi, S. Yuksel, and T. Linder. On the asymptotic optimality of \ufb01nite approximations to\n\nmarkov decision processes with borel spaces. Math. of Operations Research, 42(4), 2017.\n\n[38] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser,\nI. Antonoglou, V. Panneershelvam, and M. Lanctot. Mastering the game of go with deep neural\nnetworks and tree search. Nature, 529(7587):484\u2013489, 2016.\n\n[39] Charles J. Stone. Optimal global rates of convergence for nonparametric regression. The Annals\n\nof Statistics, pages 1040\u20131053, 1982.\n\n[40] A. L Strehl, L. Li, E. Wiewiora, J. Langford, and M. L. Littman. PAC model-free reinforcement\n\nlearning. In ICML, 2006.\n\n[41] C. Szepesv\u00e1ri. The asymptotic convergence-rate of Q-learning. In NIPS, 1997.\n[42] C. Szepesv\u00e1ri and W. D. Smart. Interpolation-based Q-learning. In Proceedings of the Twenty-\n\nFirst International Conference on Machine learning, page 100. ACM, 2004.\n\n[43] J. N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Mach. Learning, 16(3),\n\n1994.\n\n[44] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function\n\napproximation. IEEE Trans. Automatic Control, 42(5), 1997.\n\n[45] Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer Series in Statistics.\n\nSpringer, 2009.\n\n[46] Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data\n\nScience. Cambridge University Press, 2017.\n\n[47] C. J. C. H. Watkins and P. Dayan. Q-learning. Mach. learning, 8(3-4), 1992.\n[48] G. S. Watson. Smooth regression analysis. Sankhy\u00afa: The Indian Journal of Statistics, Series A,\n\npages 359\u2013372, 1964.\n\n11\n\n\f", "award": [], "sourceid": 1596, "authors": [{"given_name": "Devavrat", "family_name": "Shah", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Qiaomin", "family_name": "Xie", "institution": "Massachusetts Institute of Technology"}]}