{"title": "Nonparametric Density Estimation for Stochastic Optimization with an Observable State Variable", "book": "Advances in Neural Information Processing Systems", "page_first": 820, "page_last": 828, "abstract": "We study convex stochastic optimization problems where a noisy objective function value is observed after a decision is made. There are many stochastic optimization problems whose behavior depends on an exogenous state variable which affects the shape of the objective function. Currently, there is no general purpose algorithm to solve this class of problems. We use nonparametric density estimation for the joint distribution of state-outcome pairs to create weights for previous observations. The weights effectively group similar states. Those similar to the current state are used to create a convex, deterministic approximation of the objective function. We propose two solution methods that depend on the problem characteristics: function-based and gradient-based optimization. We offer two weighting schemes, kernel based weights and Dirichlet process based weights, for use with the solution methods. The weights and solution methods are tested on a synthetic multi-product newsvendor problem and the hour ahead wind commitment problem. Our results show Dirichlet process weights can offer substantial benefits over kernel based weights and, more generally, that nonparametric estimation methods provide good solutions to otherwise intractable problems.", "full_text": "Nonparametric Density Estimation for Stochastic\nOptimization with an Observable State Variable\n\nLauren A. Hannah\nDuke University\n\nDurham, NC 27701\nlh140@duke.edu\n\nWarren B. Powell\nPrinceton University\nPrinceton, NJ 08544\n\npowell@princeton.edu\n\nAbstract\n\nDavid M. Blei\n\nPrinceton University\nPrinceton, NJ 08544\n\nblei@cs.princeton.edu\n\nIn this paper we study convex stochastic optimization problems where a noisy\nobjective function value is observed after a decision is made. There are many\nstochastic optimization problems whose behavior depends on an exogenous state\nvariable which affects the shape of the objective function. Currently, there is no\ngeneral purpose algorithm to solve this class of problems. We use nonparametric\ndensity estimation to take observations from the joint state-outcome distribution\nand use them to infer the optimal decision for a given query state s. We propose\ntwo solution methods that depend on the problem characteristics: function-based\nand gradient-based optimization. We examine two weighting schemes, kernel\nbased weights and Dirichlet process based weights, for use with the solution meth-\nods. The weights and solution methods are tested on a synthetic multi-product\nnewsvendor problem and the hour ahead wind commitment problem. Our results\nshow that in some cases Dirichlet process weights offer substantial bene\ufb01ts over\nkernel based weights and more generally that nonparametric estimation methods\nprovide good solutions to otherwise intractable problems.\n\n1\n\nIntroduction\n\nIn stochastic optimization, a decision maker makes a decision and faces a random cost based on\nthat decision. The goal is to choose a decision that minimizes the expected cost using information\nfrom previous observations. Stochastic optimization problems with continuous decision spaces have\nmany viable solution methods, including function averaging and stochastic gradient descent [20].\nHowever, in many situations conditions for the previous observations may not be the same as the\ncurrent conditions; the conditions can be viewed as state variables. There are currently no general\npurpose solution methods for stochastic optimization problems with state variables, although they\nwould be useful for \ufb01nance, energy, dynamic pricing, inventory control and reinforcement learning\napplications.\nWe consider the newsvendor problem, a classic inventory management problem, to illustrate existing\nsolution methods for stochastic optimization problems with state variables and their limitations.\nHere, newspapers can be bought in advance for cost c, and up to D of them can be sold for price\np, where D is a random demand; the goal is to determine how many papers should be ordered so\nas to maximize the expected pro\ufb01t. A state variable that contains information about the random\ndemand may also be included. For example, a rainy forecast may correlate to a lower demand while\na sunny forecast may correlate to a higher. A natural solution method would be to partition the\nprevious observations into \u201crainy\u201d and \u201csunny\u201d bins, and then solve the problem for each partition.\nThis essentially models the problem as a single time period Markov Decision Process and solves\nthe problem accordingly [16, 21]. Partitioning methods work when the state space can take a small\nnumber of discrete values.\n\n1\n\n\fTwo problems arise with partitioning methods when the state space becomes larger. First, the num-\nber of states grows exponentially with the dimension of the state space. If there are 10 attributes,\nlike weather, stock prices, days until an election, etc, and each can take 100 values, then there will\nbe 1020 individual states. Second, previous observations are sparse over these states; a vast number\nof observations must be gathered before there are enough to make a reasonable decision for a given\nstate. Rather than partitioning, we propose using observations from \u201csimilar\u201d states to create a de-\nterministic decision-expected cost function, also called an objective function, that is conditioned on\na particular state.\nSimilar methods have been proposed in an approximate dynamic programming setting that use basis\nfunctions, such as linear and polynomial predictors, to construct approximate value functions [22,\n14]. Basis functions, however, are hard to choose manually and automatic selection is an area of\nactive research [12]. Moreover, basis functions do not guarantee that the approximate objective\nfunction is convex in the decision.\nWe propose using nonparametric density estimation for the joint state and outcome distribution to\ngroup observations from \u201csimilar\u201d states with weights. These are then used to construct determin-\nistic, convex approximations of the noisy function given the current observed information. The\nresults are a deterministic, convex math program. These can be ef\ufb01ciently solved by a number of\ncommercial solvers, even with very large decision spaces (10 to 1,000+ variables and constraints).\nWe give two methods to construct an approximate objective function using previous observations.\nThe \ufb01rst is a function-based method. In some cases, entire random objective functions can be viewed\nretrospectively. For example, if the demand is known in the newsvendor problem, then the value of\nall decisions is also known. In these particular cases, the approximate objective function is mod-\neled as a weighted average of the observed functions. The second method is based on stochastic\ngradients. In some cases, it is not possible to observe entire functions or observed functions may\nbe too complex to manipulate. When this happens, we propose constructing a separable, piecewise\nlinear approximate objective function. A piecewise linear, convex function is created in each deci-\nsion dimension by generating a slope function from a weighted, order-restricted regression of the\ngradients, and then integrating that function. The result is an approximate objective function that is\nnot necessarily the same as the original objective function, but one that has the same minima.\nBoth methods depend heavily on weights to capture dependence between the state and the outcome.\nWe propose two weighting schemes: kernels weights and Dirichlet process mixture model weights.\nKernels are simple to implement, but Dirichlet process mixture models have certain appealing prop-\nerties. First, they act as a local bandwidth selector across the state space; second, the weights are\ngenerated by partitions rather than products of uni-dimensional weights, so the results scale better\nto higher-dimensional settings.\nWe contribute novel algorithms for stochastic optimization problems with a state variable that work\nwith large, continuous decision spaces and propose a new use of Dirichlet process mixture models.\nWe give empirical analysis for these methods where we show promising results on test problems.\nThe paper is organized as follows. In Section 2, we review traditional function-based and gradient-\nbased optimization methods and in each case present novel algorithms to accommodate an observ-\nable state variable. We present an empirical analysis of our methods for synthetic newsvendor data\nand the hour ahead wind commitment problem in Section 4 and a discussion in Section 5.\n\n2 Stochastic optimization for problems with an observable state variable\n\nTraditional stochastic optimization problems have the form\nE [F (x, Z)] ,\n\nmin\nx\u2208X\n\n(1)\nwhere x \u2208 Rd is the decision, Z : \u2126 \u2192 \u03a8 is a random outcome, X is a decision set and F (x, Z(\u03c9))\nis a random objective function [20]. In the newsvendor problem, which we will use as a running\nexample, x is the stocking level and Z is the random demand. Given x and Z(\u03c9), F is deterministic.\nWhen a state variable is inlcuded, we \ufb01rst observe a random state S \u2208 S that may in\ufb02uence F and\nthe distribution of Z, then we make a decision x, and \ufb01nally we observe the random variable Z. Eq.\n(1) becomes\n\nE [F (x, s, Z)|S = s] .\n\nmin\nx\u2208X\n\n2\n\n(2)\n\n\fTraditional stochastic optimization techniques require us to sample from the conditional distribution\nof p(Z|S = s), treating each state observation independently [20]. We will use nonparametric\ndensity estimation for the joint distribution of (S, Z) to take into account that similar values of S\naffect Z and F in a similar way. We now describe new methods for function-based and gradient-\nbased optimization for problems with an observable state variable.\n\n2.1 Function-based optimization with an observable state variable\n\nFunction-based optimization is used when a single outcome \u03c9 can tell us the value of all decisions\ngiven that outcome [19]. For example, in the newsvendor problem, if the demand is known then\nthe value of all inventory levels is known. Function-based optimization relies on sampling a set of\nscenarios, \u03c91, . . . , \u03c9n from \u2126, to approximate Eq. (1):\n\nF (x, Z(\u03c9i)).\n\n(3)\n\nn(cid:88)\n\ni=1\n\nmin\nx\u2208X\n\n1\nn\n\nn\u22121(cid:88)\n\n\u00afFn(x|s) =\n\nSince Eq. (3) is deterministic given \u03c91:n, deterministic solution methods can be used. These meth-\nods are well developed and are implemented in a variety of commercial solvers.\nWhen a state variable is introduced, we wish to solve Eq. (2) for a \ufb01xed query state s \u2208 S. However,\nfrom the distribution p(Z|S = s), but rather from the joint distribution\nscenarios are not i.i.d.\n(p(Z, S). Let (Si, Z(\u03c9i+1))n\u22121\ni=0 be a set of n observations. Instead of taking a naive average of the\nobservations as in Eq. (3), we weight the observations based on the distance between the query state\ni=0 wn(s, Si) = 1,\n\ns and each observation Si with weight wn(s, Si). The weights must sum to 1,(cid:80)n\u22121\n\nand the weights may change with the number of observations, n. Set\n\nwn(s, Si)F (x, Si, Z(\u03c9i+1)).\n\n(4)\n\nThe optimization problem becomes\n\ni=0\n\n(5)\nNote that because F (x, Si, Z(\u03c9i+1)) is convex in x for every Si and \u03c9i+1, \u00afFn(x|s) is convex and\nEq. (5) can be solved with a commercial solver. We discuss weight functions in Section 3.\n\nmin\nx\u2208X\n\n\u00afFn(x|s).\n\n2.2 Gradient-based optimization with an observable state variable\n\nIn gradient-based optimization, we no longer observe an entire function F (x, S, Z(\u03c9)), but only a\nderivative taken at x,\n\n\u02c6\u03b2(xi, s, \u03c9i+1) = \u2207xF (xi, s, Z(\u03c9i+1)).\n\n(6)\nStochastic approximation is the most popular way to solve stochastic optimization problems using a\ngradient; it modi\ufb01es gradient search algorithms to account for random gradients [17, 9]. The general\nidea is to optimize x by iterating,\n\nxn+1 = \u0393X (xn \u2212 an\u2207x F (xn, Z(\u03c9n+1)) ,\n\n(7)\nwhere \u0393X is a projection back into the constraint set X , \u2207x F (xn, Z(\u03c9n+1)) is a stochastic gradient\nat xn and an is a stepsize. Other approaches to gradient-based optimization have included construc-\ntion of piecewise linear, convex functions to approximate F (x) in the region where x is near the\noptimal decision, x\u2217 [15].\nIncluding a state variable into gradient-based optimization is less straightforward than it is for\nfunction-based optimization. We run into dif\ufb01culties because we choose xn given Sn. When we\ninclude state Sn, the decision xn is based on the state Sn. But xn\u22121 depends on Sn\u22121, so no itera-\ntive procedure like Eq. (7) can be used. Moreover, constructing the approximate function \u00afFn(x|s)\nis not trivial because the stochastic gradients depend on both xn and Sn.\nTherefore, we propose modeling F (x|s) with a piecewise linear, convex, separable approximation.\nEven if F (x|s) is not itself separable, we aim to approximate it with a simpler (separable) function\nthat has the same minimum for all \ufb01xed s. Approximating the minimum is easier than approximating\n\n3\n\n\fFigure 1: A graphical depiction of gradient-based method in one dimension for a maximization problem.\n(Top left) Observe gradients, state. (Top right) Weight observations based on state. (Bottom left) Fit isotonic\nregression to weighted slopes. (Bottom right) Integrate isotonic regression to form f k\n\nn (xn|Sn).\n\nthe entire convex function [4, 15]. Moreover, convex regression is easier in one dimension than\nmultiple dimensions. We approximate E[F (x, s, Z)] by a series of separable functions,\n\n\u00afFn(x|s) =\n\nn(xk|s),\nf k\n\nn(x|s) for every s \u2208 S.\nwhere xk is the kth component of x. We enforce convexity restrictions on f k\nUnlike the function-based method, the gradient-based method is a fundamentally online algorithm:\nxn is used to choose xn+1. Given Sn, we choose xn as follows,\nn(xk|Sn).\nf k\n\nd(cid:88)\n\nxn = arg min\nx\u2208X\n\nWe then receive \u02c6\u03b2(xn, Sn, \u03c9n+1). The observations (xi, Si, \u02c6\u03b2(xi, Si, \u03c9i+1))n\u22121\ni=0 are used to up-\ndate \u00afFn(x|s) sequentially. Fix k \u2208 {1, . . . , d}; we want to construct a piecewise linear f k\nn(x|s)\nn(x|Sn) based on the stochastic gra-\nby constructing an increasing slope function, vk\ndx f k\ndient observations, \u02c6\u03b21:n. We use weights to group the gradients from states \u201csimilar\u201d to Sn and a\nn(x|Sn). Order the decision observa-\nweighted isotonic (order restricted) regression to construct vk\ntions xk\n\nn(x|Sn) = d\n\n[n\u22121], and then solve to \ufb01nd slopes for the decision-ordered space,\nn(x0:n\u22121|Sn) = arg min\nvk\n\n(cid:0)Sn, S[i]\n\n[i], S[i], \u03c9[i+1]) \u2212 v[i]\n\n(cid:1)(cid:16) \u02c6\u03b2(xk\n\n[0], . . . , xk\n\nn\u22121(cid:88)\n\n,\n\n(8)\n\n(cid:17)2\n\nsubject to : v[i\u22121] \u2264 v[i],\n\nwn\n\nv\n\ni=0\n\ni = 1, . . . , n \u2212 1.\n\nn(x|Sn) is generated by interpolating the point estimates from Eq. (8) across the kth dimen-\nn(x|Sn). The monotonicity\nn(x|Sn). See Figure 1 for an example. The general method\n\nFirst vk\nsion of the decision space, and then f k(x|Sn) is created by integrating vk\nn(x|Sn) ensures the convexity of f k\nof vk\nfor constructing \u00afFn(x|s) is as follows:\n\nd(cid:88)\n\nk=1\n\nk=1\n\n4\n\n00.20.40.60.8100.20.40.60.81\u22122\u22121.5\u22121\u22120.500.511.52State VariableDecision VariableStochastic Gradientdecisionresponse\u22121.5\u22121.0\u22120.50.00.51.01.5l0.00.20.40.60.81.0weight0.30.40.50.60.70.80.9l1.0decisionresponse\u22121.5\u22121.0\u22120.50.00.51.01.5l0.00.20.40.60.81.0weight0.30.40.50.60.70.80.9l1.0decisionvalue0.000.050.100.150.200.250.00.20.40.60.8\f1. Observe Sn and constructing weights ((wn(Sn, Si))n\u22121\ni=0 ,\n2. Use the weights wn(Sn, Si)n\u22121\n3. Reconstruct f k(x|Sn) from the slopes and construct \u00afF (x|Sn) from (f k(x|Sn))d\n4. Choose xn given \u00afF (x|Sn):\n\n1:K(Sn) with Eq. (8),\n\n\u00afFn(x|Sn).\n\ni=0 , previous decisions x0:n\u22121 and gradients to construct\n\nk=1, and\n\nxn = arg min\nx\u2208X\n\nslopes vk\n\nDetails are given in the supplementary material. We now discuss the choice of weight functions.\n\n3 Weight functions\n\nLike the choice of step size in stochastic approximation, the choice of weight functions in Eqs. (4)\nand (8) determines whether and under which conditions function-based and gradient-based opti-\nmization produce acceptable results. Weighting functions rely on density estimation procedures to\napproximate the conditional density f(z|s), where s is the state and z is the response. Conditional\ndensity estimation weights observations from a joint distribution to create a conditional distribution.\nWe use this to obtain weights from two nonparametric density estimators, kernels and Dirichlet\nprocess mixture models.\n\n3.1 Kernel weights\n\nKernel weights rely on kernel functions, K(s), to be evaluated at each observation to approximate\nthe conditional density. A common choice for K with continuous covariates is the Gaussian kernel,\nKh(s) = (2\u03c0h)\u22121/2 exp{\u2212s2/2h}, where the variance h is called the bandwidth. Kernel weights\nhave the advantage of being simple and easy to implement. The simplest and most universally\napplicable weighting scheme is based on the Nadaraya-Watson estimator [10, 23]. If K(s) is the\nkernel and hn is the bandwidth after n observations, de\ufb01ne\n\nn\u22121(cid:88)\n\nn\u22121(cid:88)\n\ni=1\n\nwn(s, Si) = K ((s \u2212 Si)/hn) /\n\nK ((s \u2212 Sj)/hn) .\n\nKernel estimators require a well sampled space, are poor in higher dimensions and highly sensitive\nto bandwidth size [5].\n\nj=0\n\n3.2 Dirichlet process weights\n\nOne of the curses of dimensionality is sparseness of data: as the number of dimensions grows, the\ndistance between observations grows exponentially. In kernel regression, this means that only a\nhandful of observations have weights that are effectively non-zero, producing non-stable estimates.\nInstead, we would like to average responses for \u201csimilar\u201d observations. We propose modeling the\ndistribution of the state variable with a Dirichlet process mixture model, which is then decomposed\ninto weights.\n\nin\ufb01nite sum of simpler distributions, g(s| \u03b8i), parameterized by \u03b8i, g(s) =(cid:80)\u221e\n\nDirichlet process mixture models. A mixture model represents a distribution, g(s), as a weighted\ni=1 pig(s| \u03b8i). Here,\npi is the mixing proportion for component i. We can use a Dirichlet process (DP) with base measure\nG0 and concentration parameter \u03b1 to place a distribution over the joint distribution of (pi, \u03b8i), the\nmixture proportion and location of component i [6, 1]. Assume that data S1, . . . , Sn are iid with a\ndistribution that is modeled by a mixture over distribution G(\u03b8),\n\n(9)\nThe distribution P drawn from a Dirichlet process is an almost surely discrete measure over param-\neters, with the mixture proportion associated with \u03b8 as the atomic weight. The hidden measure P in\nEq. (9) can be integrated out to obtain a conditional distribution of \u03b8n|\u03b81:n\u22121 [3]\n\nP \u223c DP (\u03b1, G0),\n\nSi|\u03b8i \u223c G(\u03b8i).\n\n\u03b8i|P \u223c P,\n\n\u03b8n | \u03b81, . . . , \u03b8n\u22121 \u223c\n\n1\n\n\u03b1 + n \u2212 1\n\n\u03b4\u03b8i +\n\n\u03b1\n\n\u03b1 + n \u2212 1\n\nG0.\n\n(10)\n\nHere, \u03b4\u03b8 is the Dirac measure with mass at \u03b8. Eq. (10) is known as a Polya urn posterior; the variable\n\u03b8n has positive probability of assuming the value of one of the previously observed \u03b8i, but it also\ncan take a new value drawn from G0 with positive probability. The parameter \u03b1 controls how likely\n\u03b8n is to take a new value. We now discuss how weights can be constructed from Eq. (9).\n\n5\n\n\fDirichlet process mixture model weights. A Dirichlet process mixture model can be used to\nmodel an unknown density, but it can simultaneously be used to produce a distribution of the parti-\ntion structure of observed data [13, 8]. This is shown in the Polya urn posterior of Eq. (10); each\nhidden parameter has positive probability of taking the same value as another parameter. If two\nhidden parameters have the same value, they are in the same partition/cluster. The partition structure\ninduces weights on the observations, proportional to 1 if they are in the same cluster, 0 if not.\ni }\nLet p = {C1, . . . , Cn(p)} be the partition of the observations {1, . . . , n}. Here Ci = {j : \u03b8j = \u03b8\u2217\nis the partition set generated by n(p) unique parameter values, denoted \u03b8\u2217\nn(p). Now suppose\nthat we know the partition p. Given p, we include the query state s into cluster Ci with probability\n\n1, . . . , \u03b8\u2217\n\nps(Ci|p) = P(s \u2208 Ci | p, S1:n) \u221d |Ci|\n\ng(s| \u03b8\u2217)dHCi(\u03b8\u2217),\n\nwhere |Ci| is the number of elements in Ci, and HCi(\u03b8\u2217) is the posterior distribution of \u03b8\u2217 condi-\ntioned on G0 and the set of observations {Sj : Sj \u2208 Ci}. Given p, the weighting function is the\nprobability that the hidden parameter for s would be \u03b8i, the hidden parameter for Si,\n\n(cid:90)\n\nwn(s, Si)| p =\n\nps(Cj | p)\n\n|Cj|\n\n1{Si\u2208Cj}.\n\n(11)\n\nn(p)(cid:88)\n\nj=1\n\nEq. (11) is conditioned on a partition structure, but the Dirichlet process produces a distribution over\npartition structures. Let \u03c0(p) be the prior distribution for partitions p and \u03c0(p|S0:n\u22121) the posterior.\nIntegrating of the partition posterior, we obtain unconditional weights,\n\nX\n\nn(p)X\n\nwn(s, Si) =\n\n\u03c0(p|S1:n)\n\np\n\nj=1\n\nps(Cj | p)\n\n|Cj|\n\n1{Si\u2208Cj} \u2248 1\nM\n\nps(Cj | p(m))\n\n|Cj|\n\n1{Si\u2208Cj}.\n\n(12)\n\nMX\n\nn(p(m))X\n\nm=1\n\nj=1\n\nIt is infeasible to integrate over all of the partitions; therefore, we approximate Eq. (12) by per-\nforming a Monte Carlo integration with M posterior partition samples, (p(m))M\nm=1. We obtain\n(p(m))M\nm=1 by generating M iid samples of the hidden parameters, \u03b80:n\u22121, from the posterior of Eq.\n(9) with Gibbs sampling [11].\n\n4 Empirical analysis\n\n4.1 Multi-product constrained newsvendor problem\n\nA multi-product newsvendor problem is a classic operations research inventory management prob-\nlem. In the two product problem, a newsvendor is selling products A and B. She must decide how\nmuch of each product to stock in the face of random demand, DA and DB. A and B can be be bought\nfor (cA, cB) and sold for (pA, pB), respectively. Any inventory not sold is lost. Let (xA, xB) be the\nstocking decisions for A and B respectively; it is subject to a budget constraint, bA xA + bB xB \u2264 b,\nand a storage constraint, rA xA+rB xB \u2264 r. An observable state S = (S1, S2) contains information\nabout DA and DB. The problem is,\n\n\u2212 cA xA \u2212 cB xB + E [pA min (xA, DA) + pB min (xB, DB) | S = s]\n\n(13)\n\nmax\nxA, xB\n\nsubject to : bA xA + bB xB \u2264 b,\n\nrA xA + rB xB \u2264 r.\n\nWe generated data for Problem (13) in the following way. Demand and two state variables were\ngenerated in a jointly trimodal Gaussian mixture.The following methods were compared.\nFunction-based with kernel and Gradient-based with kernel. Bandwidth is selected according to\nthe \u201crule of thumb\u201d method of the np package for R, hj = 1.06\u03c3jn\u22121/(4+d), where \u03c3j is de\ufb01ned as\nmin(sd, interquartile range/1.349) [7].\nFunction-based with DP and Gradient-based with DP. We used the following hierarchical model,\n\nP \u223c DP (\u03b1, G0),\n\n\u03b8i = (\u00b5i,s, \u03c32\n\ni,s)|P \u223c P,\n\nSi,j|\u03b8i \u223c N(\u00b5i,s,j, \u03c32\n\ni,s,j), j = 1, 2.\n\n6\n\n\fFigure 2: Gradient-based and function-based methods as a function of number of data points sampled. Results\nare averaged over 100 test problems with observed demand.\n\nPosterior samples were drawn using Gibbs sampling with a fully collapsed sampler run for 500\niterations with a 200 iteration burn-in with samples taken every 5 iterations.\nOptimal. These are the optimal decisions with known mixing parameters and unknown components.\nResults. Decisions were made under each regime over eight sample paths; 100 test state/demand\npairs were \ufb01xed and decisions were made for these problems given the observed states/decisions\nin the sample path for each method. Results are given in Figure 2. The kernel and Dirichlet pro-\ncess weights performed approximately equally for each method, but the function-based methods\nconverged more quickly than the gradient-based methods.\n\n4.2 Hour ahead wind commitment\n\nIn the hour ahead wind commitment problem, a wind farm manager must decide how much energy\nto promise a utility an hour in advance, incorporating knowledge about the current state of the world.\nThe decision is the amount of wind energy pledged, a scalar variable. If more energy is pledged than\nis generated, the difference must be bought on the spot market, which is expensive with a price that is\nunknown when the decision is made; otherwise, the excess is lost. The goal is to maximize expected\nrevenue. The observable state variable is the time of day, time of year, wind history from the past\ntwo hours, contract price and current spot price,\n\n= time of day,\n= current spot price,\n\nT D\ni\nP S\ni\nWi\u22121 = wind speed an hour ago,\n= observable state variable\n= amount of energy pledged,\n\nSi\nxi\n\nT Y\ni\nP C\ni\nWi\n\n= time of year,\n= contract price,\n= current wind speed,\n= (T D\ni , P S\ni+1 max (x \u2212 Wi+1, 0).\ni\nYi+1(x) = P C\n\ni , P C\ni x \u2212 P S\n\n, T Y\n\ni , Wi, Wi\u22121),\n\nThe revenue that the wind farm receives, Yi+1(x), depends on the variables P S\ni+1 and Wi+1, which\nare not known until the next hour. We used wind speed data from the North American Land Data As-\nsimilation System with hourly observations from 2002\u20132005 in the following locations: Amarillo,\nTX. Latitude: 35.125 N, Longitude: 101.50 W. The data have strong daily and seasonal patterns.\nThe mean wind level is 186.29 (m/s)3 with standard deviation 244.86. Tehachapi, CA. Latitude:\n35.125 N, Longitude: 118.25 W. The data have strong seasonal patterns. The mean wind level is\n89.45 (m/s)3 with standard deviation 123.47.\nClean spot and contract price data for the time period were unavailable, so contract prices were\ngenerated by Gaussian random variables with a mean of 1 and variance of 0.10. Spot prices were\ngenerated by a mean-reverting (Ornstein-Uhlenbeck) process with a mean function that varies by\ntime of day and time of year [18]. The data were analyzed separately for each location; they were\ndivided by year, with one year used for training and the other three used for testing. The following\nmethods were compared on this dataset:\nKnown wind. The wind is known, allowing maximum possible commitment, xi = Wi+1(\u03c9i+1). It\nserves as an upper bound for all of the methods.\n\n7\n\nTwo Product NewsvendorNumber of ObservationsValue681012141620406080100AlgorithmKernelGradient\u2212BasedDPGradient\u2212BasedKernelFunction\u2212BasedDPFunction\u2212BasedOptimal\fMETHOD/LOCATION\nTEHACHAPI, CA\nKNOWN WIND\nFUNCTION WITH KERNEL\nFUNCTION WITH DP\nIGNORE STATE\nAMARILLO, TX\nKNOWN WIND\nFUNCTION WITH KERNEL\nFUNCTION WITH DP\nIGNORE STATE\n\n2002\n\n2003\n\n2004\n\n2005\n\n97.5\n78.8 (80.8%)\n85.1 (87.3%)\n30.4 (31.1%)\n\n94.5\n77.3 (81.8%)\n82.6 (87.4%)\n31.1 (32.9%)\n\n73.7\n58.9 (79.9%)\n63.9 (86.7%)\n22.8 (30.9%)\n\n91.8\n72.1 (78.5%)\n79.6 (86.7%)\n29.3 (31.9%)\n\n186.0\n155.1 (83.4%)\n168.2 (90.4%)\n70.3 (37.8%)\n\n175.2\n149.6 (85.4%)\n160.6 (91.7%)\n68.7 (39.2%)\n\n184.9\n154.7 (83.7%)\n167.1 (90.4%)\n69.6 (37.6%)\n\n175.2\n146.2 (83.5%)\n159.4 (91.0%)\n66.1 (37.7%)\n\nTable 1: Mean values of decisions by method, year and data set. Percentages of the upper bound, Known Wind,\nare given for the other methods.\nFunction-based with kernel. Function-based optimization where the weights are generated by a\nGaussian kernel. Bandwidth is selected according to the \u201crule of thumb\u201d method of the np package\nfor R, hj = 1.06\u03c3jn\u22121/(4+d), where \u03c3j is de\ufb01ned as min(sd, interquartile range/1.349) [7].\nFunction-based with DP. Function-based optimization with Dirichlet process based weights. We\nmodel the state distribution with the following hierarchical model,\n\u03b8i|P \u223c P,\ni |\u03b8i \u223c von Mises(\u00b5i,Y , \u03c6Y ),\nT Y\ni |\u03b8i \u223c N(\u00b5i,S, \u03c32\ni,S),\nP S\nWi\u22121|\u03b8i \u223c N(\u00b5i,W 2, \u03c32\ni,W 1, \u00b5i,W 2, \u03c32\n\nP \u223c DP (\u03b1, G0),\n|\u03b8i \u223c von Mises(\u00b5i,D, \u03c6D),\nT D\ni\ni |\u03b8i \u223c N(\u00b5i,C, \u03c32\ni,C),\nP C\nWi|\u03b8i \u223c N(\u00b5i,W 1, \u03c32\ni,W 1),\n\n\u03b8i = (\u00b5i,D, \u00b5i,Y , \u00b5i,C, \u03c32\n\ni,W 2),\ni,W 2).\n\ni,S, \u00b5i,W 1, \u03c32\n\ni,C, \u00b5i,S, \u03c32\n\ni , P S\n\n, and year, T Y\n\nWe modeled the time of day, T D\ni , with a von Mises distribution, an exponential family\ni\ndistribution over the unit sphere; the dispersion parameters, \u03c6D and \u03c6Y , are hyperparameters. The\nbase measure was Normal-Inverse Gamma for P C\ni , Wi and Wi\u22121 and uniform for the means of\ni . 100 posterior samples were drawn using Gibbs sampling with a collapsed sampler for\ni and T Y\nT D\nall conjugate dimensions after a 1,000 iteration burn-in and 10 iteration pulse between samples.\nIgnore state. Sample average approximation is used, \u00afFn(x|s) = 1\nResults. Results are presented in Table 1. We display the value of each algorithm, along with\npercentages of Known Wind for the other three methods. Both forms of function-based optimization\noutperformed the algorithm in which the state variable was ignored by a large margin (\u226545% of the\nbest possible value). Dirichlet process weights outperformed kernel weights by a smaller but still\nsigni\ufb01cant margin (5.6\u20138.2% of best possible value).\n\n(cid:80)n\u22121\n\ni=0 Yi+1(x).\n\nn\n\n5 Discussion\n\nWe presented two new methods to solve stochastic optimization problems with an observable state\nvariable, including state variables that are too large for partitioning. Our methods make minimal as-\nsumptions. They are promising additions to areas that rely on observational data to make decisions\nunder changing conditions (energy, \ufb01nance, dynamic pricing, inventory management), and some\ncommunities that make sequential decisions under uncertainty (reinforcement learning, stochastic\nprogramming, simulation optimization). Our methods can accommodate much larger state and de-\ncision spaces than MDPs and other table lookup methods, particularly when combined with Dirichlet\nprocess mixture model weights. Unlike existing objective function approximation methods, such as\nbasis functions, our methods provide convex objective function approximations that can be used\nwith a variety of commercial solvers.\n\nAcknowledgments\n\nThe research was funded in part by the Air Force Of\ufb01ce of Scienti\ufb01c Research under AFOSR con-\ntract FA9550-08-1-0195, and the NSF under grant CMMI-0856153. David M. Blei is supported by\nONR 175-6343, NSF CAREER 0745520, AFOSR-09NL202 and the Alfred P. Sloan foundation.\n\n8\n\n\fReferences\n\n[1] Antoniak, C. E. [1974], \u2018Mixtures of Dirichlet processes with applications to Bayesian non-\n\nparametric problems\u2019, The Annals of Statistics 2(6), 1152\u20131174.\n\n[2] Bennett, K. P. and Parrado-Hern\u00b4andez, E. [2006], \u2018The interplay of optimization and machine\n\nlearning research\u2019, The Journal of Machine Learning Research 7, 1265\u20131281.\n\n[3] Blackwell, D. and MacQueen, J. B. [1973], \u2018Ferguson distributions via Polya urn schemes\u2019,\n\nThe Annals Statistics 1(2), 353\u2013355.\n\n[4] Cheung, R. K. and Powell, W. B. [2000], \u2018SHAPE-A stochastic hybrid approximation proce-\n\ndure for two-stage stochastic programs\u2019, Operations Research 48(1), 73\u201379.\n\n[5] Fan, J. and Gijbels, I. [1996], Local Polynomial Modelling and Its Applications, Chapman &\n\nHall/CRC.\n\n[6] Ferguson, T. S. [1973], \u2018A Bayesian analysis of some nonparametric problems\u2019, The Annals of\n\nStatistics 1(2), 209\u2013230.\n\n[7] Hay\ufb01eld, T. and Racine, J. S. [2008], \u2018Nonparametric econometrics: The np package\u2019, Journal\n\nof Statistical Software 27(5), 1\u201332.\n\n[8] Ishwaran, H. and James, L. F. [2003], \u2018Generalized weighted Chinese restaurant processes for\n\nspecies sampling mixture models\u2019, Statistica Sinica 13(4), 1211\u20131236.\n\n[9] Kiefer, J. and Wolfowitz, J. [1952], \u2018Stochastic estimation of the maximum of a regression\n\nfunction\u2019, The Annals of Mathematical Statistics 23(3), 462\u2013466.\n\n[10] Nadaraya, E. A. [1964], \u2018On estimating regression\u2019, Theory of Probability and its Applications\n\n9(1), 141\u2013142.\n\n[11] Neal, R. M. [2000], \u2018Markov chain sampling methods for Dirichlet process mixture models\u2019,\n\nJournal of Computational and Graphical Statistics 9(2), 249\u2013265.\n\n[12] Parr, R., Painter-Wake\ufb01eld, C., Li, L. and Littman, M. [2007], Analyzing feature generation for\nvalue-function approximation, in \u2018Proceedings of the 24th international conference on Machine\nlearning\u2019, ACM, p. 744.\n\n[13] Pitman, J. [1996], \u2018Some developments of the Blackwell-MacQueen urn scheme\u2019, Lecture\n\nNotes-Monograph Series 30, 245\u2013267.\n\n[14] Powell, W. B. [2007], Approximate Dynamic Programming: Solving the curses of dimension-\n\nality, Wiley-Blackwell.\n\n[15] Powell, W. B., Ruszczy\u00b4nski, A. and Topaloglu, H. [2004], \u2018Learning algorithms for separa-\nble approximations of discrete stochastic optimization problems\u2019, Mathematics of Operations\nResearch 29(4), 814\u2013836.\n\n[16] Puterman, M. L. [1994], Markov decision processes: Discrete stochastic dynamic program-\n\nming, John Wiley & Sons, Inc. New York, NY, USA.\n\n[17] Robbins, H. and Monro, S. [1951], \u2018A stochastic approximation method\u2019, The Annals of Math-\n\nematical Statistics 22(3), 400\u2013407.\n\n[18] Schwartz, E. S. [1997], \u2018The stochastic behavior of commodity prices: Implications for valua-\n\ntion and hedging\u2019, The Journal of Finance 52(3), 923\u2013973.\n\n[19] Shapiro, A., Homem-de Mello, T. and Kim, J. [2002], \u2018Conditioning of convex piecewise linear\n\nstochastic programs\u2019, Mathematical Programming 94(1), 1\u201319.\n\n[20] Spall, J. C. [2003], Introduction to stochastic search and optimization: estimation, simulation,\n\nand control, John Wiley and Sons.\n\n[21] Sutton, R. S. and Barto, A. G. [1998], Introduction to reinforcement learning, MIT Press Cam-\n\nbridge, MA, USA.\n\n[22] Tsitsiklis, J. N. and Van Roy, B. [2001], \u2018Regression methods for pricing complex American-\n\nstyle options\u2019, IEEE Transactions on Neural Networks 12(4), 694\u2013703.\n\n[23] Watson, G. S. [1964], \u2018Smooth regression analysis\u2019, Sankhy\u00afa: The Indian Journal of Statistics,\n\nSeries A 26(4), 359\u2013372.\n\n9\n\n\f", "award": [], "sourceid": 483, "authors": [{"given_name": "Lauren", "family_name": "Hannah", "institution": null}, {"given_name": "Warren", "family_name": "Powell", "institution": null}, {"given_name": "David", "family_name": "Blei", "institution": null}]}