{"title": "Non-parametric Approximate Dynamic Programming via the Kernel Method", "book": "Advances in Neural Information Processing Systems", "page_first": 386, "page_last": 394, "abstract": "This paper presents a novel non-parametric approximate dynamic programming (ADP) algorithm that enjoys graceful, dimension-independent approximation and sample complexity guarantees. In particular, we establish both theoretically and computationally that our proposal can serve as a viable alternative to state-of-the-art parametric ADP algorithms, freeing the designer from carefully specifying an approximation architecture. We accomplish this by developing a kernel-based mathematical program for ADP. Via a computational study on a controlled queueing network, we show that our non-parametric procedure is competitive with parametric ADP approaches.", "full_text": "Non-parametric Approximate Dynamic\nProgramming via the Kernel Method\n\nNikhil Bhat\n\nGraduate School of Business\n\nColumbia University\nNew York, NY 10027\n\nnbhat15@gsb.columbai.edu\n\nVivek F. Farias\n\nSloan School of Management\n\nMassachusetts Institute of Technology\n\nCambridge, MA 02142\nvivekf@mit.edu\n\nCiamac C. Moallemi\n\nGraduate School of Business\n\nColumbia University\nNew York, NY 10027\n\nciamac@gsb.columbai.edu\n\nAbstract\n\nThis paper presents a novel non-parametric approximate dynamic programming\n(ADP) algorithm that enjoys graceful approximation and sample complexity guar-\nantees. In particular, we establish both theoretically and computationally that our\nproposal can serve as a viable alternative to state-of-the-art parametric ADP algo-\nrithms, freeing the designer from carefully specifying an approximation architec-\nture. We accomplish this by developing a kernel-based mathematical program for\nADP. Via a computational study on a controlled queueing network, we show that\nour procedure is competitive with parametric ADP approaches.\n\n1\n\nIntroduction\n\nProblems of dynamic optimization in the face of uncertainty are frequently posed as Markov de-\ncision processes (MDPs). The central computational problem is then reduced to the computation\nof an optimal \u2018cost-to-go\u2019 function that encodes the cost incurred under an optimal policy starting\nfrom any given MDP state. Many MDPs of practical interest suffer from the curse of dimension-\nality, where intractably large state spaces precluding exact computation of the cost-to-go function.\nApproximate dynamic programming (ADP) is an umbrella term for algorithms designed to produce\ngood approximation to this function, yielding a natural \u2018greedy\u2019 control policy.\nADP algorithms are, in large part, parametric in nature; requiring the user to provide an \u2018approxi-\nmation architecture\u2019 (i.e., a set of basis functions). The algorithm then produces an approximation in\nthe span of this basis. The strongest theoretical results available for such algorithms typically share\ntwo features: (1) the quality of the approximation produced is comparable with the best possible\nwithin the basis speci\ufb01ed, and (2) the computational effort required for doing so typically scales as\nthe dimension of the basis speci\ufb01ed.\nThese results highlight the importance of selecting a \u2018good\u2019 approximation architecture, and remain\nsomewhat dissatisfying in that additional sampling or computational effort cannot remedy a bad ap-\nproximation architecture. On the other hand, a non-parametric approach would, in principle, permit\nthe user to select a rich, potentially full-dimensional architecture (e.g., the Haar basis). One would\nthen expect to compute increasingly accurate approximations with increasing computational effort.\nThe present work presents a practical algorithm of this type. Before describing our contributions,\nwe begin with summarizing the existing body of research on non-parametric ADP algorithms.\n\n1\n\n\fThe key computational step in approximate policy iteration methods is approximate policy evalua-\ntion. This step involves solving the projected Bellman equation, a linear stochastic \ufb01xed point equa-\ntion. A numerically stable approach to this is to perform regression with a certain \uffff2-regularization,\nwhere the loss is the \uffff2-norm of the Bellman error. By substituting this step with a suitable non-\nparametric regression procedure, [2, 3, 4] come up with a corresponding non-parametric algorithm.\nUnfortunately schemes such approximate policy iteration have no convergence guarantees in para-\nmetric settings, and these dif\ufb01culties remain in non-parametric variations. Another idea has been to\nuse kernel-based local averaging ideas to approximate the solution of an MDP with that of a simpler\nvariation on a sampled state space [5, 6, 7]. However, convergence rates for local averaging meth-\nods are exponential in the dimension of the problem state space. As in our setting, [8] constructs\nkernel-based cost-to-go function approximations. These are subsequently plugged into various ad\nhoc optimization-based ADP formulations, without theoretical justi\ufb01cation.\nClosely related to our work, [9, 10] consider modifying the approximate linear program with an\n\uffff1 regularization term to encourage sparse approximations in the span of a large, but necessarily\ntractable set of features. Along these lines, [11] discuss a non-parametric method that explicitly\nrestricts the smoothness of the value function. However, sample complexity results for this method\nare not provided and it appears unsuitable for high-dimensional problems (such as, for instance, the\nproblem we consider in our experiments). In contrast to this line of work, our approach will allow\nfor approximations in a potentially in\ufb01nite dimensional approximation architecture with a constraint\non an appropriate \uffff2-norm of the weight vector.\nThe non-parametric ADP algorithm we develop enjoys non-trivial approximation and sample com-\nplexity guarantees. We show that our approach complements state-of-the-art parametric ADP algo-\nrithms by allowing the algorithm designer to compute what is essentially the best possible \u2018simple\u2019\napproximation1 in a full-dimensional approximation architecture as opposed to restricting attention\nto some a-priori \ufb01xed low dimensional architecture. In greater detail, we make the following contri-\nbutions:\nA new mathematical programming formulation. We rigorously develop a kernel-based variation of\nthe \u2018smoothed\u2019 approximate LP (SALP) approach to ADP proposed by [12]. The resulting mathe-\nmatical program, which we dub the regularized smoothed approximate LP (RSALP), is distinct from\nsimply substituting a kernel-based approximation in the SALP formulation. We develop a compan-\nion active set method that is capable of solving this mathematical program rapidly and with limited\nmemory requirements.\nTheoretical guarantees. 2 We establish a graceful approximation guarantee for our algorithm. Our\nalgorithm can be interpreted as solving an approximate linear program in an appropriate Hilbert\nspace. We provide, with high probability, an upper bound on the approximation error of the algo-\nrithm relative to the best possible approximation subject to a regularization constraint. The sam-\npling requirements for our method are, in fact, independent of the dimension of the approximation\narchitecture. Instead, we show that the number of samples grows polynomially as a function of a\nregularization parameter. Hence, the sampling requirements are a function of the complexity of the\napproximation, not of the dimension of the approximating architecture. This result can be seen as\nthe \u2018right\u2019 generalization of the prior parametric approximate LP approaches [13, 14, 12], where, in\ncontrast, sample complexity grows with the dimension of the approximating architecture.\nA computational study. To study the ef\ufb01cacy of RSALP, we consider an MDP arising from a chal-\nlenging queueing network scheduling problem. We demonstrate that our RSALP method yields\nsigni\ufb01cant improvements over known heuristics and standard parametric ADP methods.\nIn what follows, proofs and a detailed discussion of our numerical procedure are deferred to the\nOnline Supplement to this paper.\n\n1In the sense that the \uffff2 norm of the weight vector can grow at most polynomially with a certain measure\n\n2These guarantees come under assumption of being able to sample from a certain idealized distribution.\n\nof computational budget.\n\nThis is a common in the ADP literature.\n\n2\n\n\f2 Formulation\nConsider a discrete time Markov decision process with \ufb01nite state space S and \ufb01nite action space A.\nWe denote by xt and at respectively, the state and action at time t. We assume time-homogeneous\nMarkovian dynamics: conditioned on being at state x and taking action a, the system transitions to\nstate x\uffff with probability p(x, x\uffff, a) independent of the past. A policy is a map \u00b5 : S\u2192A , so that\n\nJ \u00b5(x) \uffff Ex,\u00b5\uffff \u221e\ufffft=0\n\n\u03b1tgxt,at\uffff\n\nrepresents the expected (discounted, in\ufb01nite horizon) cost-to-go under policy \u00b5 starting at state x.\nLetting \u03a0 denote the set of all policies our goal is to \ufb01nd an optimal policy \u00b5\u2217 such that \u00b5\u2217 \u2208\nargmax\u00b5\u2208\u03a0 J \u00b5(x) for all x \u2208S (it is well known that such a policy exists). We denote the optimal\ncost-to-go function by J\u2217 \uffff J \u00b5\u2217. An optimal policy \u00b5\u2217 can be recovered as a \u2018greedy\u2019 policy with\nrespect to J\u2217,\n\n\u00b5\u2217(x) \u2208 argmin\na\u2208A\n\ngx,a + \u03b1Ex,a[J\u2217(X\uffff)],\n\nwhere we de\ufb01ne Ex,a[f (X\uffff)] as\uffffx\uffff\u2208S p(x, x\uffff, a)f (x\uffff), for all f : S\u2192 R.\nSince in practical applications S is often intractably large, exact computation of J\u2217 is untenable.\nADP algorithms are principally tasked with computing approximations to J\u2217 of the form J\u2217(x) \u2248\nz\uffff\u03a6(x) \uffff \u02dcJ(x), where \u03a6: S\u2192 Rm is referred to as an \u2018approximation architecture\u2019 or a basis and\nmust be provided as input to the ADP algorithm. The ADP algorithm computes a \u2018weight\u2019 vector z;\none then employs a policy that is greedy with respect to the corresponding approximation \u02dcJ.\n\n2.1 Primal Formulation\n\nMotivated by the LP for exact dynamic programming, a series of ADP algorithms [15, 13, 12] have\nbeen proposed that compute a weight vector z by solving an appropriate modi\ufb01cation of the exact LP\nfor dynamic programming. In particular, [12] propose solving the following optimization problem\nwhere \u03bd \u2208 RS+ is a strictly positive probability distribution and \u03ba> 0 is a penalty parameter:\n\nmax \uffffx\u2208S\n\ns. t.\n\n\u03c0xsx\n\n\u03bdxz\uffff\u03a6(x) \u2212 \u03ba\uffffx\u2208S\nz\uffff\u03a6(x) \u2264 ga,x + \u03b1Ex,a[z\uffff\u03a6(X\uffff)] + sx, \u2200 x \u2208S , a \u2208A ,\nz \u2208 Rm, s \u2208 RS+.\n\n(1)\n\nIn parsing the above program notice that if one insisted that the slack variables s were precisely 0,\none is left with the ALP proposed by [15]. [13] provided a pioneering analysis that loosely showed\n\n\uffffJ\u2217 \u2212 z\u2217\uffff\u03a6\uffff1,\u03bd \u2264\n\nz \uffffJ\u2217 \u2212 z\uffff\u03a6\uffff\u221e,\ninf\n\n2\n\n1 \u2212 \u03b1\n\nfor an optimal solution z\u2217 to the ALP; [12] showed that these bounds could be improved upon\nsubstantially by \u2018smoothing\u2019 the constraints of the ALP, i.e., permitting positive slacks. In both\ncases, one must solve a \u2018sampled\u2019 version of the above program.\nNow, consider allowing \u03a6 to map from S to a general (potentially in\ufb01nite dimensional) Hilbert space\nH. We use bold letters to denote elements in the Hilbert space H, e.g., the weight vector is denoted\nby z \u2208H . We further suppress the dependence on \u03a6 and denote the elements H corresponding to\ntheir counterparts in S by bold letters. Hence, for example, x \uffff \u03a6(x) and X \uffff \u03a6(X). Further, we\ndenote X \uffff \u03a6(S); X\u2282H . The value function approximation in this case would be given by\n\n(2)\nwhere b is a scalar offset corresponding to a constant basis function. The following generalization\nof (1) \u2014 which we dub the regularized SALP (RSALP) \u2014 then essentially suggests itself:\n\n\u02dcJz,b(x) \uffff \uffffx, z\uffff + b = \uffff\u03a6(x), z\uffff + b,\n\nmax \uffffx\u2208S\n\ns. t.\n\n\u03bdx\uffffx, z\uffff + b \u2212 \u03ba\uffffx\u2208S\n\uffffx, z\uffff + b \u2264 ga,x + \u03b1Ex,a[\uffffX\uffff, z\uffff + b] + sx, \u2200 x \u2208S , a \u2208A ,\nz \u2208H , b \u2208 R, s \u2208 RS+.\n\n\u0393\n2 \uffffz, z\uffff\n\n\u03c0xsx \u2212\n\n(3)\n\n3\n\n\fThe only \u2018new\u2019 ingredient in the program above is the fact that we regularize z using the parameter\n\n\u0393 > 0. Constraining \uffffz\uffffH \uffff\uffff\uffffz, z\uffff to lie within some \uffff2-ball anticipates that we will eventually\n\nresort to sampling in solving this program and we cannot hope for a reasonable number of samples\nto provide a good solution to a problem where z was unconstrained. This regularization, which plays\na crucial role both in theory and practice, is easily missed if one directly \u2018plugs in\u2019 a local averaging\napproximation in place of z\uffff\u03a6(x) as is the case in the earlier work of [5, 6, 7, 8] and others.\nSince the RSALP, i.e., program (3), can be interpreted as a regularized stochastic optimization prob-\nlem, one may hope to solve it via its sample average approximation. To this end, de\ufb01ne the likeli-\nhood ratio wx \uffff \u03bdx/\u03c0x, and let \u02c6S\u2282S be a set of N states sampled independently according to the\ndistribution \u03c0. The sample average approximation of (3) is then\n\nmax\n\ns. t.\n\n1\n\n\u03ba\n\nwx\uffffx, z\uffff + b \u2212\n\nN \uffffx\u2208 \u02c6S\n\uffffx, z\uffff + b \u2264 ga,x + \u03b1Ex,a[\uffffX\uffff, z\uffff + b] + sx, \u2200 x \u2208 \u02c6S, a \u2208A ,\nz \u2208H , b \u2208 R, s \u2208 R \u02c6S+.\n\nN \uffffx\u2208 \u02c6S\n\n\u0393\n2 \uffffz, z\uffff\n\nsx \u2212\n\n(4)\n\nWe call this program the sampled RSALP. Even if | \u02c6S| were small, it is still not clear that this program\ncan be solved effectively. We will, in fact, solve the dual to this problem.\n\n2.2 Dual Formulation\nWe begin by establishing some notation. Let Nx,a \uffff {x}\u222a{ x\uffff \u2208S| p(x, x\uffff, a) > 0}. Now, de\ufb01ne\nthe symmetric positive semi-de\ufb01nite matrix Q \u2208 R( \u02c6S\u00d7A )\u00d7( \u02c6S\u00d7A ) according to\n\nQ(x, a, x\uffff, a\uffff) \uffff \uffffy\u2208Nx,a \uffffy\uffff\u2208Nx\uffff,a\uffff\uffff1{x=y} \u2212 \u03b1p(x, y, a)\uffff\uffff1{x\uffff=y\uffff} \u2212 \u03b1p(x\uffff, y\uffff, a)\uffff\uffffy, y\uffff\uffff, (5)\n\nand the vector R \u2208 R \u02c6S\u00d7A according to\n\nR(x, a) \uffff \u0393gx,a \u2212\n\n1\n\nN \uffffx\uffff\u2208 \u02c6S \uffffy\u2208Nx,a\n\nwx\uffff\uffff1{x=y} \u2212 \u03b1p(x, y, a)\uffff\uffffy, x\uffff\uffff.\n\n(6)\n\nNotice that Q and R depend only on inner products in X (and other, easily computable quantities).\nThe dual to (4) is then given by:\n\nmin\n\n1\n2 \u03bb\uffffQ\u03bb + R\uffff\u03bb\n\n\u03bbx,a \u2264\n\n\u03ba\nN\n\n,\n\ns. t. \uffffa\u2208A\n\uffffx\u2208 \u02c6S\uffffa\u2208A\n\n\u03bbx,a =\n\n1\n\n1 \u2212 \u03b1\n\n\u2200 x \u2208 \u02c6S,\n,\u03bb \u2208 R \u02c6S\u00d7A+\n\n.\n\n(7)\n\nAssuming that Q and R can be easily computed, this \ufb01nite dimensional quadratic program, is\ntractable \u2013 its size is polynomial in the number of sampled states. We may recover a primal so-\nlution (i.e., the weight vector z\u2217) from an optimal dual solution:\nProposition 1. The optimal solution to (7) is attained at some \u03bb\u2217, then optimal solution to (4) is\nattained at some (z\u2217, s\u2217, b\u2217) with\n\nz\u2217 =\n\n1\n\n\u0393\uf8ee\uf8f0 1\nN \uffffx\u2208 \u02c6S\n\nwxx \u2212 \uffffx\u2208 \u02c6S,a\u2208A\n\nHaving solved this program, we may, using Proposition 1, recover our approximate cost-to-go func-\ntion \u02dcJ(x) = \uffffz\u2217, x\uffff + b\u2217 as\n\u0393\uffff 1\nN \uffffy\u2208 \u02c6S\n\nwy\uffffy, x\uffff \u2212 \uffffy\u2208 \u02c6S,a\u2208A\n\n\u02dcJ(x) =\n\n(9)\n\n1\n\n\u03bb\u2217x,a\uffffx \u2212 \u03b1Ex,a[X\uffff]\uffff\uf8f9\uf8fb .\n\u03bb\u2217y,a\uffff\uffffy, x\uffff \u2212 \u03b1Ey,a[\uffffX\uffff, x\uffff]\uffff\uffff + b\u2217.\n\n(8)\n\n4\n\n\fA policy greedy with respect to \u02dcJ is not affected by constant translations, hence in (9), the value of\nb\u2217 can be set to be zero arbitrarily. Again note that given \u03bb\u2217, \u02dcJ only involves the inner products.\nAt this point, we use the \u2018kernel\u2019 trick: instead of explicitly specifying H or the mapping \u03a6, we\nIn particular, given any positive de\ufb01nite kernel\ntake the approach of specifying inner products.\nK : S\u00d7S \u2192 R, it is well known (Mercer\u2019s theorem) that there exists a Hilbert space H and\n\u03a6: S\u2192H such that K(x, y) = \uffff\u03a6(x), \u03a6(y)\uffff. Consequently, given a positive de\ufb01nite kernel,\nwe simply replace every inner product \uffffx, x\uffff\uffff in the de\ufb01ning of the program (7) with the quantity\nK(x, x\uffff) and similarly in the approximation (9). In particular, this is equivalent to using a Hilbert\nspace, H and mapping \u03a6 corresponding to that kernel.\nSolving (7) directly is costly. In particular, it is computationally expensive to pre-compute and store\nthe matrix Q. An alternative to this is to employ the following broad strategy, as recognized by\n[16] and [17] in the context of solving SVM classi\ufb01cation problems, referred to as an active set\nmethod: At every point in time, one attempts to (a) change only a small number of variables while\nnot impacting other variables (b) maintain feasibility. It turns out that this results in a method that\nrequires memory and per-step computation that scales only linearly with the sample size. We defer\nthe details of the procedure as well as the theoretical analysis to the Online Supplement\n\n3 Approximation Guarantees\n\nRecall that we are employing an approximation \u02dcJz,b of the form (2), parameterized by the weight\nvector z and the offset parameter b. Now denoting by C the feasible region of the RSALP projected\nonto the z and b co-ordinates, the best possible approximation one may hope for among those per-\nmitted by the RSALP will have \uffff\u221e-approximation error inf (z,b)\u2208C \uffffJ\u2217 \u2212 \u02dcJz,b\uffff\u221e. Provided the\nGram matrix given by the kernel restricted to S is positive de\ufb01nite, this quantity can be made ar-\nbitrarily small by making \u0393 small. The rate at which this happens would re\ufb02ect the quality of the\nkernel in use. Here we focus on asking the following question: for a \ufb01xed choice of regularization\nparameters (i.e., with C \ufb01xed) what approximation guarantee can be obtained for a solution to the\nRSALP? This section will show that one can achieve a guarantee that is, in essence, within a certain\nconstant multiple of the optimal approximation error using a number of samples that is independent\nof the size of the state space and the dimension of the approximation architecture.\n\n3.1 The Guarantee\nDe\ufb01ne the Bellman operator, T : RS \u2192 RS according to\n\n(T J)(x) \uffff min\na\u2208A\n\ngx,a + \u03b1Ex,a[J(X\uffff)].\n\nLet \u02c6S be a set of N states drawn independently at random from S under the distribution \u03c0 over S.\nGiven the de\ufb01nition of \u02dcJz,b in (2), we consider the following sampled version of RSALP,\n\nmax \u03bd\uffff \u02dcJz,b \u2212\ns. t.\n\n2\n\n1 \u2212 \u03b1\n\n1\n\nN \uffffx\u2208 \u02c6S\n\nsx\n\n\uffffx, z\uffff + b \u2264 ga,x + \u03b1Ex,a[\uffffX\uffff, z\uffff + b] + sx, \u2200 x \u2208 \u02c6S, a \u2208A ,\n\uffffz\uffffH \u2264 C,\n\nz \u2208H , b \u2208 R, s \u2208 R \u02c6S+.\n\n|b|\u2264 B,\n\n(10)\n\nWe will assume that states are sampled according to an idealized distribution. In particular, \u03c0 \uffff\n\u03c0\u00b5\u2217,\u03bd where\n\n\u221e\ufffft=0\n\n5\n\n\u03c0\uffff\u00b5\u2217,\u03bd \uffff (1 \u2212 \u03b1)\n\n\u03b1t\u03bd\uffffP t\n\n\u00b5\u2217.\n\n(11)\n\nHere, P\u00b5\u2217 is the transition matrix under the optimal policy \u00b5\u2217. This idealized assumption is also\ncommon to the work of [14] and [12].\nIn addition, this program is somewhat distinct from the\nprogram presented earlier, (4): (1) As opposed to a \u2018soft\u2019 regularization term in the objective, we\nhave a \u2018hard\u2019 regularization constraint, \uffffz\uffffH \u2264 C. It is easy to see that given a \u0393, we can choose\na radius C(\u0393) that yields an equivalent optimization problem. (2) We bound the magnitude of the\noffset b. This is for theoretical convenience; our sample complexity bound will be parameterized\n\n\fby B. (3) We \ufb01x \u03ba = 2/(1 \u2212 \u03b1). Our analysis reveals this to be the \u2018right\u2019 penalty weight on the\nBellman inequality violations.\nBefore stating our bound we establish a few bits of notation. We let (z\u2217, b\u2217) denote an optimal\nsolution to (10). We let K \uffff maxx\u2208X \uffffx\uffffH, and \ufb01nally, we de\ufb01ne the quantity\n\u039e(C, B, K, \u03b4) \uffff \uffff1 +\uffff 1\nWe have the following theorem:\nTheorem 1. For any \uffff> 0 and \u03b4> 0, let N \u2265 \u039e(C, B, K, \u03b4)2/\uffff2. If (10) is solved by sampling N\nstates from S with distribution \u03c0\u00b5\u2217,\u03bd, then with probability at least 1 \u2212 \u03b4 \u2212 \u03b44,\n4\uffff\n1 \u2212 \u03b1\n\nln(1/\u03b4)\uffff\uffff4CK(1 + \u03b1) + 4B(1 \u2212 \u03b1) + 2\uffffg\uffff\u221e\uffff.\n\n1 \u2212 \u03b1\uffffJ\u2217 \u2212 \u02dcJz,b\uffff\u221e +\n\n\uffffJ\u2217 \u2212 \u02dcJz\u2217,b\u2217\uffff1,\u03bd \u2264\n\n3 + \u03b1\n\ninf\n\n\uffffz\uffffH\u2264C,|b|\u2264B\n\n.\n\n(12)\n\n2\n\nIgnoring the \uffff-dependent error terms, we see that the quality of approximation provided by (z\u2217, b\u2217)\nis essentially within a constant multiple of the optimal (in the sense of \uffff\u221e-error) approximation to\nJ\u2217 possible using a weight vector z and offsets b permitted by the regularization constraints. This\nis a \u2018structural\u2019 error term that will persist even if one were permitted to draw an arbitrarily large\nnumber of samples. It is analogous to the approximation results produced in parametric settings\nwith the important distinction that one allows comparisons to approximations in potentially full-\ndimensional basis sets which might be substantially superior.\nIn addition to the structural error above, one incurs an additional additive \u2018sampling\u2019 error that scales\n\nlike O(N\u22121/2(CK + B)\uffffln 1/\u03b4). This quantity has no explicit dependence on the dimension of\n\nthe approximation architecture. In contrast, comparable sample complexity results (eg. [14, 12])\ntypically scale with the dimension of the approximation architecture. Here, this space may be full\ndimensional, so that such a dependence would yield a vacuous guarantee. The error depends on the\nuser speci\ufb01ed quantities C and B, and K, which is bounded for many kernels. The result allows\nfor arbitrary \u2018simple\u2019 (i.e. with \uffffz\uffffH small) approximations in a rich feature space as opposed to\nrestricting us to some a-priori \ufb01xed, low dimensional feature space. This yields some intuition for\nwhy we expect the approach to perform well even with a relatively general choice of kernel.\nAs C and B grow large, the structural error will decrease to zero provided K restricted to S is\npositive de\ufb01nite. In order to maintain the sampling error constant, one would then need to increase\nN (at a rate that is \u2126((CK + B)2).\nIn summary, increased sampling yields approximations of\nincreasing quality, approaching an exact approximation. If J\u2217 admits a good approximation with\n\uffffz\uffffH small, one can expect a good approximation with a reasonable number of samples.\n3.2 Proof Sketch\n\nA detailed proof of a stronger result is in the Online Supplement. Here, we provide a proof sketch.\nThe \ufb01rst step of the proof involves providing a guarantee for the exact (non-sampled) RSALP with\nhard regularization. Assuming (z\u2217, b\u2217) is the \u2018learned\u2019 parameter pair, we \ufb01rst establish the guaran-\ntee:\n\n\uffffJ\u2217 \u2212 \u02dcJz\u2217,b\u2217\uffff1,\u03bd \u2264\n\n3 + \u03b1\n1 \u2212 \u03b1\n\ninf\n\n\uffffz\uffffH\u2264C,b\u2208R \uffffJ\u2217 \u2212 \u02dcJz,b\uffff\u221e.\n\nGeometrically, the proof works loosely by translating the \u2018best\u2019 approximation given the regulariza-\ntion constraints to one that is guaranteed to yield an approximation error no worse that that produced\nby the RSALP.\nTo establish a guarantee for the sampled RSALP, we \ufb01rst pose the RSALP as a stochastic optimiza-\ntion problem by setting s(z, b) \uffff ( \u02dcJz,b \u2212 T \u02dcJz,b)+. We must ensure that with high probability,\nthe sample averages in the sampled program are close to the exact expectations, uniformly for all\npossible values of (z, b) with high accuracy. In order to establish such a guarantee, we bound the\nRademacher complexity of the class of functions given by\n\n\u00afFS,\u00b5 \uffff\uffffx \uffff\u2192 ( \u02dcJz,b(x) \u2212 T\u00b5 \u02dcJz,b(x))+ : \uffffz\uffffH \u2264 C,|b|\u2264 B\uffff,\n\n6\n\n\fqueue 2\n\n\u00b52 = 0.12\n\nqueue 4\n\n\u03bb4 = 0.08\n\nserver 1\n\n\u00b53 = 0.28\n\n\u00b54 = 0.28\n\nserver 2\n\n\u03bb1 = 0.08\n\n\u00b51 = 0.12\n\nqueue 1\n\nqueue 3\n\nFigure 1: The queueing network example.\n\n(where T\u00b5 is the Bellman operator associated with policy \u00b5), This yields the appropriate uniform\nlarge deviations bound. Using this guarantee we show that the optimal solution to the sampled\nRSALP yields similar approximation guarantees as that with the exact RSALP; this proof is some-\nwhat delicate as it appears dif\ufb01cult to directly show that the optimal solutions themselves are close.\n\n4 Case Study: A Queueing Network\n\nThis section considers the problem of controlling the queuing network illustrated in Figure 1, with\nthe objective of minimizing long run average delay. There are two \u2018\ufb02ows\u2019 in this network: the \ufb01rst\nthrough server 1 followed by server 2 (with buffering at queues 1 and 2, respectively), and the second\nthrough server 2 followed by server 1 (with buffering at queues 4 and 3, respectively). Here, all inter-\narrival and service times are exponential with rate parameters summarized in Figure 1. This speci\ufb01c\nnetwork has been studied [13, 18] and is considered to be a challenging control problem. Our goal\nin this section will be two-fold. First, we will show that the RSALP can surpass the performance\nof both heuristic as well as established ADP-based approaches, when used \u2018out-of-the-box\u2019 with a\ngeneric kernel. Second, we will show that the RSALP can be solved ef\ufb01ciently.\n\n4.1 MDP Formulation\n\nAlthough the control problem at hand is nominally a continuous time problem, it is routinely con-\nverted into a discrete time problem via a standard uniformization device; see [19], for instance, for\nan explicit such example. In the equivalent discrete time problem, at most a single event can occur\nin a given epoch, corresponding either to the arrival of a job at queues 1 or 4, or the arrival of a ser-\nvice token for one of the four queues with probability proportional to the corresponding rates. The\nstate of the system is described by the number of jobs is each of the four queues, so that S \uffff Z4\n+,\nwhereas the action space A consists of four potential actions each corresponding to a matching be-\ntween servers and queues. We take the single period cost as the total number of jobs in the system,\nso that gx,a = \uffffx\uffff1; note that minimizing the average number of jobs in the system is equivalent to\nminimizing average delay by Little\u2019s law. Finally, we take \u03b1 = 0.9 as our discount factor.\n\n4.2 Approaches\n\nRSALP (this paper). We solve (7) using the active set method outlined in the Online Sup-\ntaking as our kernel the standard Gaussian radial basis function kernel K(x, y) \uffff\nplement,\n\n2/h\uffff, with the bandwidth parameter h \uffff 100. (The sensitivity of our results to this\nexp\uffff\u2212\uffffx \u2212 y\uffff2\nbandwidth parameter appears minimal.) Note that this implicitly corresponds to a full-dimensional\nbasis function architecture. Since the idealized sampling distribution, \u03c0\u00b5\u2217,\u03bd is unavailable to us, we\nuse in its place the geometric distribution \u03c0(x) \uffff (1 \u2212 \u03b6)4\u03b6\uffffx\uffff1, with the sampling parameter \u03b6 set\nat 0.9, as in [13]. The regularization parameter \u0393 was chosen via a line-search; we report results\nfor \u0393 \uffff 10\u22128. (Again performance does not appear to be very sensitive to \u0393, so that a crude line-\nsearch appears to suf\ufb01ce.) In accordance with the theory we set the constraint violation parameter\n\u03ba \uffff 2/(1 \u2212 \u03b1), as suggested by the analysis of Section 3.1, as well as by [12],\n\n7\n\n\fpolicy\n\nLongest Queue\nMax-Weight\nsample size\n\nperformance\n\n8.09\n6.55\n\n1000\n\n3000\n\n5000\n\n10000\n\nSALP, cubic basis\n\nRSALP, Gaussian kernel\n\n7.19\n6.72\n\n(1.76)\n(0.39)\n\n7.89\n6.31\n\n(1.76)\n(0.11)\n\n6.94\n6.13\n\n(1.15)\n(0.08)\n\n6.63\n6.04\n\n(0.92)\n(0.05)\n\nTable 1: Performance results in the queueing example. For the SALP and RSALP methods, the\nnumber in the parenthesis gives the standard deviation across sample sets.\n\nSALP [12]. The SALP formulation (1), is, as discussed earlier, the parametric counterpart to the\nRSALP. It may be viewed as a generalization of the ALP approach proposed by [13] and has been\ndemonstrated to provide substantial performance bene\ufb01ts relative to the ALP approach. Our choice\nof parameters for the SALP mirrors those for the RSALP to the extent possible, so as to allow for an\n\u2018apples-to-apples\u2019 comparison. Thus, we solve the sample average approximation of this program\nusing the same geometric sampling distribution and parameter \u03ba. Approximation architectures in\nwhich the basis functions are monomials of the queue lengths appear to be a popular choice for\nqueueing control problems [13]. We use all monomials with degree at most 3, which we will call\nthe cubic basis, as our approximation architectures.\nLongest Queue (generic). This is a simple heuristic approach: at any given time, a server chooses\nto work on the longest queue from among those it can service.\nMax-Weight [20]. Max-Weight is a well known scheduling heuristic for queueing networks. The\npolicy is obtained as the greedy policy with respect to a value function approximation of the form\ni=1 |xi|1+\uffff, given a parameter \uffff> 0. This policy has been extensively studied and\nshown to have a number of good properties, for example, being throughput optimal and offering\ngood performance for critically loaded settings [21]. Via a line-search, we chose to \uffff \uffff 1.5 as the\nexponent for our experiments.\n\n\u02dcJM W (x) \uffff\uffff4\n\n4.3 Results\n\npolicy,\uffffT\n\nPolicies were evaluated using a common set of arrival process sample paths. The performance met-\nric we report for each control policy is the long run average number of jobs in the system under that\nt=1 \uffffxt\uffff1/T , where we set T \uffff 10000. We further average this random quantity over\nan ensemble of 300 sample paths. Further, in order to generate SALP and RSALP policies, state\nsampling is required. To understand the effect of the sample size on the resulting policy perfor-\nmance, the different sample sizes listed in Table 1 were used. Since the policies generated involve\nrandomness to the sampled states, we further average performance over 10 sets of sampled states.\nThe results are reported in Table 1 and have the following salient features:\n\n1. RSALP outperforms established policies: Approaches such as the Max-Weight or \u2018para-\nmetric\u2019 ADP with basis spanning polynomials have been previously shown to work well\nfor the problem of interest. We see that RSALP with 10000 samples achieves performance\nthat is superior to these extant schemes.\n\n2. Sampling improves performance: This is expected from the theory in Section 3. Ideally, as\nthe sample size is increased one should relax the regularization. However, for our experi-\nments we noticed that the performance is quite insensitive to the parameter \u0393. Nonetheless,\nit is clear that larger sample sets yield a signi\ufb01cant performance improvement.\n\n3. RSALP in less sensitive to state sampling: We notice from the standard deviation values in\nTable 1 that our approach gives policies whose performance varies signi\ufb01cantly less across\ndifferent sample sets of the same size.\n\nIn summary we view these results as indicative of the possibility that the RSALP may serve as a\npractical and viable alternative to state-of-the-art parametric ADP techniques.\n\n8\n\n\fReferences\n[1] D. P. Bertsekas. Dynamic Programming and Optimal Control, Vol. II. Athena Scienti\ufb01c, 2007.\n[2] B. Bethke, J. P. How, and A. Ozdaglar. Kernel-based reinforcement learning using Bellman\n\nresidual elimination. MIT Working Paper, 2008.\n\n[3] Y. Engel, S. Mannor, and R. Meir. Bayes meets Bellman: The Gaussian process approach to\ntemporal difference learning. In Proceedings of the 20th International Conference on Machine\nLearning, pages 154\u2013161. AAAI Press, 2003.\n\n[4] X. Xu, D. Hu, and X. Lu. Kernel-based least squares policy iteration for reinforcement learn-\n\ning. IEEE Transactions on Neural Networks, 18(4):973\u2013992, 2007.\n\n[5] D. Ormoneit and S. Sen. Kernel-based reinforcement learning. Machine Learning, 49(2):161\u2013\n\n178, 2002.\n\n[6] D. Ormoneit and P. Glynn. Kernel-based reinforcement learning in average cost poblems.\n\nIEEE Transactions on Automatic Control, 47(10):1624\u20131636, 2002.\n\n[7] A. M. S. Barreto, D. Precup, and J. Pineau. Reinforcement learning using kernel-based stochas-\ntic factorization. In Advances in Neural Information Processing Systems, volume 24, pages\n720\u2013728. MIT Press, 2011.\n\n[8] T. G. Dietterich and X. Wang. Batch value function approximation via support vectors. In\nAdvances in Neural Information Processing Systems, volume 14, pages 1491\u20131498. MIT Press,\n2002.\n\n[9] J. Kolter and A. Ng. Regularization and feature selection in least-squares temporal difference\n\nlearning. ICML \u201909, pages 521\u2013528. ACM, 2009.\n\n[10] M. Petrik, G. Taylor, R. Parr, and S. Zilberstein. Feature selection using regularization in\napproximate linear programs for Markov decision processes. ICML \u201910, pages 871\u2013879, 2010.\n[11] J. Pazis and R. Parr. Non-parametric approximate linear programming for MDPs. AAAI\n\nConference on Arti\ufb01cial Intelligence. AAAI, 2011.\n\n[12] V. V. Desai, V. F. Farias, and C. C. Moallemi. Approximate dynamic programming via a\n\nsmoothed linear program. To appear in Operations Research, 2011.\n\n[13] D. P. de Farias and B. Van Roy. The linear programming approach to approximate dynamic\n\nprogramming. Operations Research, 51(6):850\u2013865, 2003.\n\n[14] D. P. de Farias and B. Van Roy. On constraint sampling in the linear programming approach\nto approximate dynamic programming. Mathematics of Operations Research, 29:3:462\u2013478,\n2004.\n\n[15] P. Schweitzer and A. Seidman. Generalized polynomial approximations in Markovian decision\n\nprocesses. Journal of Mathematical Analysis and Applications, 110:568\u2013582, 1985.\n\n[16] E. Osuna, R. Freund, and F. Girosi. An improved training algorithm for support vector ma-\nchines. In Neural Networks for Signal Processing, Proceedings of the 1997 IEEE Workshop,\npages 276 \u2013285, sep 1997.\n\n[17] T. Joachims. Making large-scale support vector machine learning practical, pages 169\u2013184.\n\nMIT Press, Cambridge, MA, USA, 1999.\n\n[18] R. R. Chen and S. Meyn. Value iteration and optimization of multiclass queueing networks.\nIn Decision and Control, 1998. Proceedings of the 37th IEEE Conference on, volume 1, pages\n50 \u201355 vol.1, 1998.\n\n[19] C. C. Moallemi, S. Kumar, and B. Van Roy. Approximate and data-driven dynamic program-\n\nming for queueing networks. Working Paper, 2008.\n\n[20] L. Tassiulas and A. Ephremides. Stability properties of constrained queueing systems and\nscheduling policies for maximum throughput in multihop radio networks. IEEE Transactions\non Automatic Control, 37(12):1936\u20131948, December 1992.\n\n[21] A. L. Stolyar. Maxweight scheduling in a generalized switch: State space collapse and work-\n\nload minimization in heavy traf\ufb01c. The Annals of Applied Probability, 14:1\u201353, 2004.\n\n9\n\n\f", "award": [], "sourceid": 199, "authors": [{"given_name": "Nikhil", "family_name": "Bhat", "institution": null}, {"given_name": "Vivek", "family_name": "Farias", "institution": null}, {"given_name": "Ciamac", "family_name": "Moallemi", "institution": null}]}