{"title": "Task-based End-to-end Model Learning in Stochastic Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 5484, "page_last": 5494, "abstract": "With the increasing popularity of machine learning techniques, it has become common to see prediction algorithms operating within some larger process. However, the criteria by which we train these algorithms often differ from the ultimate criteria on which we evaluate them. This paper proposes an end-to-end approach for learning probabilistic machine learning models in a manner that directly captures the ultimate task-based objective for which they will be used, within the context of stochastic programming. We present three experimental evaluations of the proposed approach: a classical inventory stock problem, a real-world electrical grid scheduling task, and a real-world energy storage arbitrage task. We show that the proposed approach can outperform both traditional modeling and purely black-box policy optimization approaches in these applications.", "full_text": "Task-based End-to-end Model Learning\n\nin Stochastic Optimization\n\nPriya L. Donti\n\nDept. of Computer Science\n\nDept. of Engr. & Public Policy\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\npdonti@cs.cmu.edu\n\nBrandon Amos\n\nDept. of Computer Science\nCarnegie Mellon University\n\nPittsburgh, PA 15213\nbamos@cs.cmu.edu\n\nJ. Zico Kolter\n\nDept. of Computer Science\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nzkolter@cs.cmu.edu\n\nAbstract\n\nWith the increasing popularity of machine learning techniques, it has become com-\nmon to see prediction algorithms operating within some larger process. However,\nthe criteria by which we train these algorithms often differ from the ultimate crite-\nria on which we evaluate them. This paper proposes an end-to-end approach for\nlearning probabilistic machine learning models in a manner that directly captures\nthe ultimate task-based objective for which they will be used, within the context\nof stochastic programming. We present three experimental evaluations of the pro-\nposed approach: a classical inventory stock problem, a real-world electrical grid\nscheduling task, and a real-world energy storage arbitrage task. We show that the\nproposed approach can outperform both traditional modeling and purely black-box\npolicy optimization approaches in these applications.\n\n1\n\nIntroduction\n\nWhile prediction algorithms commonly operate within some larger process, the criteria by which\nwe train these algorithms often differ from the ultimate criteria on which we evaluate them: the\nperformance of the full \u201cclosed-loop\u201d system on the ultimate task at hand. For instance, instead of\nmerely classifying images in a standalone setting, one may want to use these classi\ufb01cations within\nplanning and control tasks such as autonomous driving. While a typical image classi\ufb01cation algorithm\nmight optimize accuracy or log likelihood, in a driving task we may ultimately care more about the\ndifference between classifying a pedestrian as a tree vs. classifying a garbage can as a tree. Similarly,\nwhen we use a probabilistic prediction algorithm to generate forecasts of upcoming electricity demand,\nwe then want to use these forecasts to minimize the costs of a scheduling procedure that allocates\ngeneration for a power grid. As these examples suggest, instead of using a \u201cgeneric loss,\u201d we instead\nmay want to learn a model that approximates the ultimate task-based \u201ctrue loss.\u201d\nThis paper considers an end-to-end approach for learning probabilistic machine learning models\nthat directly capture the objective of their ultimate task. Formally, we consider probabilistic models\nin the context of stochastic programming, where the goal is to minimize some expected cost over\nthe models\u2019 probabilistic predictions, subject to some (potentially also probabilistic) constraints.\nAs mentioned above, it is common to approach these problems in a two-step fashion: \ufb01rst to \ufb01t a\npredictive model to observed data by minimizing some criterion such as negative log-likelihood,\nand then to use this model to compute or approximate the necessary expected costs in the stochastic\nprogramming setting. While this procedure can work well in many instances, it ignores the fact\nthat the true cost of the system (the optimization objective evaluated on actual instantiations in the\nreal world) may bene\ufb01t from a model that actually attains worse overall likelihood, but makes more\naccurate predictions over certain manifolds of the underlying space.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fWe propose to train a probabilistic model not (solely) for predictive accuracy, but so that\u2013when it is\nlater used within the loop of a stochastic programming procedure\u2013it produces solutions that minimize\nthe ultimate task-based loss. This formulation may seem somewhat counterintuitive, given that a\n\u201cperfect\u201d predictive model would of course also be the optimal model to use within a stochastic\nprogramming framework. However, the reality that all models do make errors illustrates that we\nshould indeed look to a \ufb01nal task-based objective to determine the proper error tradeoffs within a\nmachine learning setting. This paper proposes one way to evaluate task-based tradeoffs in a fully\nautomated fashion, by computing derivatives through the solution to the stochastic programming\nproblem in a manner that can improve the underlying model.\nWe begin by presenting background material and related work in areas spanning stochastic program-\nming, end-to-end training, and optimizing alternative loss functions. We then describe our approach\nwithin the formal context of stochastic programming, and give a generic method for propagating task\nloss through these problems in a manner that can update the models. We report on three experimental\nevaluations of the proposed approach: a classical inventory stock problem, a real-world electrical grid\nscheduling task, and a real-world energy storage arbitrage task. We show that the proposed approach\noutperforms traditional modeling and purely black-box policy optimization approaches.\n\n2 Background and related work\n\nStochastic programming Stochastic programming is a method for making decisions under un-\ncertainty by modeling or optimizing objectives governed by a random process. It has applications\nin many domains such as energy [1], \ufb01nance [2], and manufacturing [3], where the underlying\nprobability distributions are either known or can be estimated. Common considerations include how\nto best model or approximate the underlying random variable, how to solve the resulting optimization\nproblem, and how to then assess the quality of the resulting (approximate) solution [4].\nIn cases where the underlying probability distribution is known but the objective cannot be solved\nanalytically, it is common to use Monte Carlo sample average approximation methods, which\ndraw multiple iid samples from the underlying probability distribution and then use deterministic\noptimization methods to solve the resultant problems [5]. In cases where the underlying distribution\nis not known, it is common to learn or estimate some model from observed samples [6].\n\nEnd-to-end training Recent years have seen a dramatic increase in the number of systems building\non so-called \u201cend-to-end\u201d learning. Generally speaking, this term refers to systems where the end\ngoal of the machine learning process is directly predicted from raw inputs [e.g. 7, 8]. In the context\nof deep learning systems, the term now traditionally refers to architectures where, for example, there\nis no explicit encoding of hand-tuned features on the data, but the system directly predicts what the\nimage, text, etc. is from the raw inputs [9, 10, 11, 12, 13]. The context in which we use the term\nend-to-end is similar, but slightly more in line with its older usage: instead of (just) attempting to learn\nan output (with known and typically straightforward loss functions), we are speci\ufb01cally attempting to\nlearn a model based upon an end-to-end task that the user is ultimately trying to accomplish. We feel\nthat this concept\u2013of describing the entire closed-loop performance of the system as evaluated on the\nreal task at hand\u2013is bene\ufb01cial to add to the notion of end-to-end learning.\nAlso highly related to our work are recent efforts in end-to-end policy learning [14], using value\niteration effectively as an optimization procedure in similar networks [15], and multi-objective\noptimization [16, 17, 18, 19]. These lines of work \ufb01t more with the \u201cpure\u201d end-to-end approach\nwe discuss later on (where models are eschewed for pure function approximation methods), but\nconceptually the approaches have similar motivations in modifying typically-optimized policies to\naddress some task(s) directly. Of course, the actual methodological approaches are quite different,\ngiven our speci\ufb01c focus on stochastic programming as the black box of interest in our setting.\n\nOptimizing alternative loss functions There has been a great deal of work in recent years on\nusing machine learning procedures to optimize different loss criteria than those \u201cnaturally\u201d optimized\nby the algorithm. For example, Stoyanov et al. [20] and Hazan et al. [21] propose methods for\noptimizing loss criteria in structured prediction that are different from the inference procedure of\nthe prediction algorithm; this work has also recently been extended to deep networks [22]. Recent\nwork has also explored using auxiliary prediction losses to satisfy multiple objectives [23], learning\n\n2\n\n\fdynamics models that maximize control performance in Bayesian optimization [24], and learning\nadaptive predictive models via differentiation through a meta-learning optimization objective [25].\nThe work we have found in the literature that most closely resembles our approach is the work of\nBengio [26], which uses a neural network model for predicting \ufb01nancial prices, and then optimizes the\nmodel based on returns obtained via a hedging strategy that employs it. We view this approach\u2013of both\nusing a model and then tuning that model to adapt to a (differentiable) procedure\u2013as a philosophical\npredecessor to our own work.\nIn concurrent work, Elmachtoub and Grigas [27] also propose\nan approach for tuning model parameters given optimization results, but in the context of linear\nprogramming and outside the context of deep networks. Whereas Bengio [26] and Elmachtoub and\nGrigas [27] use hand-crafted (but differentiable) algorithms to approximately attain some objective\ngiven a predictive model, our approach is tightly coupled to stochastic programming, where the\nexplicit objective is to attempt to optimize the desired task cost via an exact optimization routine, but\ngiven underlying randomness. The notions of stochasticity are thus naturally quite different in our\nwork, but we do hope that our work can bring back the original idea of task-based model learning.\n(Despite Bengio [26]\u2019s original paper being nearly 20 years old, virtually all follow-on work has\nfocused on the \ufb01nancial application, and not on what we feel is the core idea of using a surrogate\nmodel within a task-driven optimization procedure.)\n\n3 End-to-end model learning in stochastic programming\n\nWe \ufb01rst formally de\ufb01ne the stochastic modeling and optimization problems with which we are\nconcerned. Let (x 2X , y 2Y ) \u21e0D denote standard input-output pairs drawn from some\n(real, unknown) distribution D. We also consider actions z 2Z that incur some expected loss\nLD(z) = Ex,y\u21e0D[f (x, y, z)]. For instance, a power systems operator may try to allocate power\ngenerators z given past electricity demand x and future electricity demand y; this allocation\u2019s loss\ncorresponds to the over- or under-generation penalties incurred given future demand instantiations.\nIf we knew D, then we could select optimal actions z?\nD = argminz LD(z). However, in practice,\nthe true distribution D is unknown. In this paper, we are interested in modeling the conditional\ndistribution y|x using some parameterized model p(y|x; \u2713) in order to minimize the real-world cost of\nthe policy implied by this parameterization. Speci\ufb01cally, we \ufb01nd some parameters \u2713 to parameterize\np(y|x; \u2713) (as in the standard statistical setting) and then determine optimal actions z?(x; \u2713) (via\nstochastic optimization) that correspond to our observed input x and the speci\ufb01c choice of parameters\n\u2713 in our probabilistic model. Upon observing the costs of these actions z?(x; \u2713) relative to true\ninstantiations of x and y, we update our parameterized model p(y|x; \u2713) accordingly, calculate the\nresultant new z?(x; \u2713), and repeat. The goal is to \ufb01nd parameters \u2713 such that the corresponding policy\nz?(x; \u2713) optimizes the loss under the true joint distribution of x and y.\nExplicitly, we wish to choose \u2713 to minimize the task loss L(\u2713) in the context of x, y \u21e0D , i.e.\n\nminimize\n\n\u2713\n\nL(\u2713) = Ex,y\u21e0D[f (x, y, z?(x; \u2713))].\n\n(1)\n\nSince in reality we do not know the distribution D, we obtain z?(x; \u2713) via a proxy stochastic\noptimization problem for a \ufb01xed instantiation of parameters \u2713, i.e.\n\nz?(x; \u2713) = argmin\n\nz\n\nEy\u21e0p(y|x;\u2713)[f (x, y, z)].\n\n(2)\n\nThe above setting speci\ufb01es z?(x; \u2713) using a simple (unconstrained) stochastic program, but in reality\nour decision may be subject to both probabilistic and deterministic constraints. We therefore consider\nmore general decisions produced through a generic stochastic programming problem1\n\nz?(x; \u2713) = argmin\n\nz\n\nEy\u21e0p(y|x;\u2713)[f (x, y, z)]\n\nsubject to Ey\u21e0p(y|x;\u2713)[gi(x, y, z)] \uf8ff 0,\ni = 1, . . . , neq.\n\nhi(z) = 0,\n\ni = 1, . . . , nineq\n\n(3)\n\n1It is standard to presume in stochastic programming that equality constraints depend only on decision\nvariables (not random variables), as non-trivial random equality constraints are typically not possible to satisfy.\n\n3\n\n\fIn this setting, the full task loss is more complex, since it captures both the expected cost and any\ndeviations from the constraints. We can write this, for instance, as\n\nneqXi=1\n\nnineqXi=1\n\nL(\u2713) = Ex,y\u21e0D[f (x, y, z?(x; \u2713))]+\n\nI{Ex,y\u21e0D[gi(x, y, z?(x; \u2713))] \uf8ff 0}+\n\nEx[I{hi(z?(x; \u2713)) = 0}]\n(4)\n(where I(\u00b7) is the indicator function that is zero when its constraints are satis\ufb01ed and in\ufb01nite other-\nwise). However, the basic intuition behind our approach remains the same for both the constrained\nand unconstrained cases: in both settings, we attempt to learn parameters of a probabilistic model not\nto produce strictly \u201caccurate\u201d predictions, but such that when we use the resultant model within a\nstochastic programming setting, the resulting decisions perform well under the true distribution.\nActually solving this problem requires that we differentiate through the \u201cargmin\u201d operator z?(x; \u2713)\nof the stochastic programming problem. This differentiation is not possible for all classes of opti-\nmization problems (the argmin operator may be discontinuous), but as we will show shortly, in many\npractical cases\u2013including cases where the function and constraints are strongly convex\u2013we can indeed\nef\ufb01ciently compute these gradients even in the context of constrained optimization.\n\nminimize\n\n\u2713\n\nEx,y\u21e0D [ log p(y|x; \u2713)] .\n\n3.1 Discussion and alternative approaches\nWe highlight our approach in contrast to two alternative existing methods: traditional model learning\nand model-free black-box policy optimization. In traditional machine learning approaches, it is\ncommon to use \u2713 to minimize the (conditional) log-likelihood of observed data under the model\np(y|x; \u2713). This method corresponds to approximately solving the optimization problem\n\n(5)\nIf we then need to use the conditional distribution y|x to determine actions z within some later\noptimization setting, we commonly use the predictive model obtained from (5) directly. This\napproach has obvious advantages, in that the model-learning phase is well-justi\ufb01ed independent of\nany future use in a task. However, it is also prone to poor performance in the common setting where\nthe true distribution y|x cannot be represented within the class of distributions parameterized by \u2713, i.e.\nwhere the procedure suffers from model bias. Conceptually, the log-likelihood objective implicitly\ntrades off between model error in different regions of the input/output space, but does so in a manner\nlargely opaque to the modeler, and may ultimately not employ the correct tradeoffs for a given task.\nIn contrast, there is an alternative approach to solving (1) that we describe as the model-free\n\u201cblack-box\u201d policy optimization approach. Here, we forgo learning any model at all of the ran-\ndom variable y. Instead, we attempt to learn a policy mapping directly from inputs x to actions\nz?(x; \u00af\u2713) that minimize the loss L(\u00af\u2713) presented in (4) (where here \u00af\u2713 de\ufb01nes the form of the pol-\nicy itself, not a predictive model). While such model-free methods can perform well in many\nsettings, they are often very data-inef\ufb01cient, as the policy class must have enough representa-\ntional power to describe suf\ufb01ciently complex policies without recourse to any underlying model.2\nAlgorithm 1 Task Loss Optimization\nOur approach offers an intermediate setting,\nwhere we do still use a surrogate model to deter-\n1: input: D // samples from true distribution\nmine an optimal decision z?(x; \u2713), yet we adapt\n2: initialize \u2713 // some initial parameterization\nthis model based on the task loss instead of any\n3: for t = 1, . . . , T do\nmodel prediction accuracy. In practice, we typi-\nsample (x, y) \u21e0D\n4:\ncally want to minimize some weighted combina-\ncompute z?(x; \u2713) via Equation (3)\n5:\ntion of log-likelihood and task loss, which can\nbe easily accomplished given our approach.\n// step in violated constraint or objective\n6:\nif 9i s.t. gi(x, y, z?(x; \u2713)) > 0 then\n7:\nupdate \u2713 with r\u2713gi(x, y, z?(x; \u2713))\n8:\nelse\n9:\nupdate \u2713 with r\u2713f (x, y, z?(x; \u2713))\n10:\nend if\n11:\n12: end for\n\n3.2 Optimizing task loss\nTo solve the generic optimization problem (4),\nwe can in principle adopt a straightforward (con-\nstrained) stochastic gradient approach, as de-\ntailed in Algorithm 1. At each iteration, we\n2This distinction is roughly analogous to the policy search vs. model-based settings in reinforcement learning.\nHowever, for the purposes of this paper, we consider much simpler stochastic programs without the multiple\nrounds that occur in RL, and the extension of these techniques to a full RL setting remains as future work.\n\n4\n\n\f(\n\nFeatures \n(randomly \ngenerated)\n\n\u2261&\u2208\u211d,\n\nNewspaper\nstocking \ndecision\n\n\u2261-\u2208\u211d\n\n1 2\n5 10 20\nPred. demand\n(uncertain; discrete)\n\n\u2261\"($|&;()\n\n\u2261-\n\nGeneration \nschedule (e.g.)\n\nPast demand,\npast temperature, \ntemporal features\n\n\u2261&\n\nt (\n\nn\ne\ns\ne\nr\nP\n\n\u2261\"($|&;()\n\nPred. demand \n(w/ uncertainty)\n\nPred. prices \n(w/ uncertainty)\n\n\u2261\"($|&;()\n(\n\nt\nn\ne\ns\ne\nr\nP\n\nPast prices,\npast temperature,\ntemporal features,\nload forecasts\n\nBattery \n\nschedule (e.g.)\n\n\u2261&\n\u2261-\n\n(a) Inventory stock problem\n\n(b) Load forecasting problem\n\n(c) Price forecasting problem\n\nFigure 1: Features x, model predictions y, and policy z for the three experiments.\n\nsolve the proxy stochastic programming problem (3) to obtain z?(x, \u2713), using the distribution de\ufb01ned\nby our current values of \u2713. Then, we compute the true loss L(\u2713) using the observed value of y.\nIf any of the inequality constraints gi in L(\u2713) are violated, we take a gradient step in the violated\nconstraint; otherwise, we take a gradient step in the optimization objective f. We note that if any\ninequality constraints are probabilistic, Algorithm 1 must be adapted to employ mini-batches in order\nto determine whether these probabilistic constraints are satis\ufb01ed. Alternatively, because even the gi\nconstraints are probabilistic, it is common in practice to simply move a weighted version of these\nconstraints to the objective, i.e., we modify the objective by adding some appropriate penalty times\nthe positive part of the function, gi(x, y, z)+, for some > 0. In practice, this has the effect of\ntaking gradient steps jointly in all the violated constraints and the objective in the case that one or\nmore inequality constraints are violated, often resulting in faster convergence. Note that we need\nonly move stochastic constraints into the objective; deterministic constraints on the policy itself will\nalways be satis\ufb01ed by the optimizer, as they are independent of the model.\n\n3.3 Differentiating the optimization solution to a stochastic programming problem\nWhile the above presentation highlights the simplicity of the proposed approach, it avoids the issue\nof chief technical challenge to this approach, which is computing the gradient of an objective that\ndepends upon the argmin operation z?(x; \u2713). Speci\ufb01cally, we need to compute the term\n\n@L\n@\u2713\n\n=\n\n@L\n@z ?\n\n@z ?\n@\u2713\n\n(6)\n\nwhich involves the Jacobian @z ?\n@\u2713 . This is the Jacobian of the optimal solution with respect to the\ndistribution parameters \u2713. Recent approaches have looked into similar argmin differentiations [28, 29],\nthough the methodology we present here is more general and handles the stochasticity of the objective.\nAt a high level, we begin by writing the KKT optimality conditions of the general stochastic\nprogramming problem (3). Differentiating these equations and applying the implicit function theorem\ngives a set of linear equations that we can solve to obtain the necessary Jacobians (with expectations\nover the distribution y \u21e0 p(y|x; \u2713) denoted Ey\u2713, and where g is the vector of inequality constraints)\n\nr2\nzEy\u2713 f (z) +\n\nir2\n\nzEy\u2713 gi(z)\n\ndiag() (rzEy\u2713 g(z))\n\nnineqXi=1\n\nA\n\n(rzEy\u2713 g(z))T AT\n0\ndiag(Ey\u2713 g(z))\n0\n\n0\n\n264\n\n375\n\n264\n\n@z\n@\u2713\n@\n@\u2713\n@\u232b\n@\u2713\n\n375 =264\n\n@rzEy\u2713 f (z)\n\n@\u2713\n\ni=1\n\n+ @Pnineq\nirzEy\u2713 gi(z)\n@\u2713\ndiag() @Ey\u2713 g(z)\n\n@\u2713\n\n0\n\n375 .\n\n(7)\nThe terms in these equations look somewhat complex, but fundamentally, the left side gives the\noptimality conditions of the convex problem, and the right side gives the derivatives of the relevant\nfunctions at the achieved solution with respect to the governing parameter \u2713. In practice, we calculate\nthe right-hand terms by employing sequential quadratic programming [30] to \ufb01nd the optimal policy\nz?(x; \u2713) for the given parameters \u2713, using a recently-proposed approach for fast solution of the argmin\ndifferentiation for QPs [31] to solve the necessary linear equations; we then take the derivatives at the\noptimum produced by this strategy. Details of this approach are described in the appendix.\n\n4 Experiments\n\nWe consider three applications of our task-based method: a synthetic inventory stock problem, a\nreal-world energy scheduling task, and a real-world battery arbitrage task. We demonstrate that the\ntask-based end-to-end approach can substantially improve upon other alternatives. Source code for\nall experiments is available at https://github.com/locuslab/e2e-model-learning.\n\n5\n\n\fInventory stock problem\n\n4.1\nProblem de\ufb01nition To highlight the performance of the algorithm in a setting where the true\nunderlying model is known to us, we consider a \u201cconditional\u201d variation of the classical inventory\nstock problem [4]. In this problem, a company must order some quantity z of a product to minimize\ncosts over some stochastic demand y, whose distribution in turn is affected by some observed features\nx (Figure 1a). There are linear and quadratic costs on the amount of product ordered, plus different\nlinear/quadratic costs on over-orders [z y]+ and under-orders [y z]+. The objective is given by\nqh([z y]+)2, (8)\nfstock(y, z) = c0z +\nwhere [v]+ \u2318 max{v, 0}. For a speci\ufb01c choice of probability model p(y|x; \u2713), our proxy stochastic\nprogramming problem can then be written as\n(9)\n\nqb([y z]+)2 + ch[z y]+ +\n\nq0z2 + cb[y z]+ +\n\nminimize\n\n1\n2\n\n1\n2\n\n1\n2\n\nz\n\nEy\u21e0p(y|x;\u2713)[fstock(y, z)].\n\nTo simplify the setting, we further assume that the demands are discrete, taking on values d1, . . . , dk\nwith probabilities (conditional on x) (p\u2713)i \u2318 p(y = di|x; \u2713). Thus our stochastic programming\nproblem (9) can be written succinctly as a joint quadratic program3\n\nq0z2 +\n\nc0z +\n\nminimize\nz2R,zb,zh2Rk\nsubject to d z1 \uf8ff zb, z1 d \uf8ff zh, z, zh, zb 0.\n\nqb(zb)2\n\n(p\u2713)i\u2713cb(zb)i +\n\n1\n2\n\nkXi=1\n\n1\n2\n\nFurther details of this approach are given in the appendix.\n\ni + ch(zh)i +\n\n1\n2\n\nqh(zh)2\n\ni\u25c6\n\n(10)\n\nExperimental setup We examine our algorithm under two main conditions: where the true model\nis linear, and where it is nonlinear. In all cases, we generate problem instances by randomly sampling\nsome x 2 Rn and then generating p(y|x; \u2713) according to either p(y|x; \u2713) / exp(\u21e5T x) (linear true\nmodel) or p(y|x; \u2713) / exp((\u21e5T x)2) (nonlinear true model) for some \u21e5 2 Rn\u21e5k. We compare the\nfollowing approaches on these tasks: 1) the QP allocation based upon the true model (which performs\noptimally); 2) MLE approaches (with linear or nonlinear probability models) that \ufb01t a model to\nthe data, and then compute the allocation by solving the QP; 3) pure end-to-end policy-optimizing\nmodels (using linear or nonlinear hypotheses for the policy); and 4) our task-based learning models\n(with linear or nonlinear probability models). In all cases, we evaluate test performance by running\non 1000 random examples, and evaluate performance over 10 folds of different true \u2713? parameters.\nFigures 2(a) and (b) show the performance of these methods given a linear true model, with linear\nand nonlinear model hypotheses, respectively. As expected, the linear MLE approach performs best,\nas the true underlying model is in the class of distributions that it can represent and thus solving the\nstochastic programming problem is a very strong proxy for solving the true optimization problem\nunder the real distribution. While the true model is also contained within the nonlinear MLE\u2019s generic\nnonlinear distribution class, we see that this method requires more data to converge, and when given\nless data makes error tradeoffs that are ultimately not the correct tradeoffs for the task at hand; our\ntask-based approach thus outperforms this approach. The task-based approach also substantially\noutperforms the policy-optimizing neural network, highlighting the fact that it is more data-ef\ufb01cient\nto run the learning process \u201cthrough\u201d a reasonable model. Note that here it does not make a difference\nwhether we use the linear or nonlinear model in the task-based approach.\nFigures 2(c) and (d) show performance in the case of a nonlinear true model, with linear and\nnonlinear model hypotheses, respectively. Case (c) represents the \u201cnon-realizable\u201d case, where the\ntrue underlying distribution cannot be represented by the model hypothesis class. Here, the linear\nMLE, as expected, performs very poorly: it cannot capture the true underlying distribution, and thus\nthe resultant stochastic programming solution would not be expected to perform well. The linear\npolicy model similarly performs poorly. Importantly, the task-based approach with the linear model\nperforms much better here: despite the fact that it still has a misspeci\ufb01ed model, the task-based\nnature of the learning process lets us learn a different linear model than the MLE version, which is\n3This is referred to as a two-stage stochastic programming problem (though a very trivial example of one),\nwhere \ufb01rst stage variables consist of the amount of product to buy before observing demand, and second-stage\nvariables consist of how much to sell back or additionally purchase once the true demand has been revealed.\n\n6\n\n\fFigure 2: Inventory problem results for 10 runs over a representative instantiation of true parameters\n(c0 = 10, q0 = 2, cb = 30, qb = 14, ch = 10, qh = 2). Cost is evaluated over 1000 testing samples\n(lower is better). The linear MLE performs best for a true linear model. In all other cases, the\ntask-based models outperform their MLE and policy counterparts.\n\nparticularly tuned to the distribution and loss of the task. Finally, also as to be expected, the non-linear\nmodels perform better than the linear models in this scenario, but again with the task-based non-linear\nmodel outperforming the nonlinear MLE and end-to-end policy approaches.\n\n4.2 Load forecasting and generator scheduling\n\nWe next consider a more realistic grid-scheduling task, based upon over 8 years of real electrical\ngrid data. In this setting, a power system operator must decide how much electricity generation\nz 2 R24 to schedule for each hour in the next 24 hours based on some (unknown) distribution over\nelectricity demand (Figure 1b). Given a particular realization y of demand, we impose penalties for\nboth generation excess (e) and generation shortage (s), with s e. We also add a quadratic\nregularization term, indicating a preference for generation schedules that closely match demand\nrealizations. Finally, we impose a ramping constraint cr restricting the change in generation between\nconsecutive timepoints, re\ufb02ecting physical limitations associated with quick changes in electricity\noutput levels. These are reasonable proxies for the actual economic costs incurred by electrical grid\noperators when scheduling generation, and can be written as the stochastic programming problem\n\nminimize\n\nz2R24\n\n24Xi=1\n\nEy\u21e0p(y|x;\u2713)\uf8ffs[yi zi]+ + e[zi yi]+ +\n\n1\n2\n\n(zi yi)2\n\n(11)\n\nsubject to |zi zi1|\uf8ff cr 8i,\n\nwhere [v]+ \u2318 max{v, 0}. Assuming (as we will in our model), that yi is a Gaussian random\nvariable with mean \u00b5i and variance 2\ni , then this expectation has a closed form that can be computed\nvia analytically integrating the Gaussian PDF.4 We then use sequential quadratic programming\n(SQP) to iteratively approximate the resultant convex objective as a quadratic objective, iterate until\nconvergence, and then compute the necessary Jacobians using the quadratic approximation at the\nsolution, which gives the correct Hessian and gradient terms. Details are given in the appendix.\nTo develop a predictive model, we make use of a highly-tuned load forecasting methodology. Speci\ufb01-\ncally, we input the past day\u2019s electrical load and temperature, the next day\u2019s temperature forecast,\nand additional features such as non-linear functions of the temperatures, binary indicators of week-\nends or holidays, and yearly sinusoidal features. We then predict the electrical load over all 24\n\n4 Part of the philosophy behind applying this approach here is that we know the Gaussian assumption\nis incorrect: the true underlying load is neither Gaussian distributed nor homoskedastic. However, these\nassumptions are exceedingly common in practice, as they enable easy model learning and exact analytical\nsolutions. Thus, training the (still Gaussian) system with a task-based loss retains computational tractability\nwhile still allowing us to modify the distribution\u2019s parameters to improve actual performance on the task at hand.\n\n7\n\n\fFigure 4: Results for 10 runs of the generation-scheduling problem for representative decision\nparameters e = 0.5, s = 50, and cr = 0.4. (Lower loss is better.) As expected, the RMSE net\nachieves the lowest RMSE for its predictions. However, the task net outperforms the RMSE net on\ntask loss by 38.6%, and the cost-weighted RMSE on task loss by 8.6%.\n\n!\u2208\u211d$\n\n200\n\n200\n\nFuture Load\n\nPast Load\nPast Temp\n(Past Temp)2\nFuture Temp\n(Future Temp)2\n(Future Temp)3\n\n((Weekday)\n((Holiday)\n((DST)\nsin(2-.\u00d7 DOY)\ncos(2-\u00d7 DOY)\n\n%\u2208\u211d&'\n\nhours of the next day. We employ a 2-hidden-layer neural network for this purpose, with an addi-\ntional residual connection from the inputs to the outputs initialized to the linear regression solution.\nAn illustration of the architecture is shown in Fig-\nure 3. We train the model to minimize the mean\nsquared error between its predictions and the actual\nload (giving the mean prediction \u00b5i), and compute\ni as the (constant) empirical variance between the\n2\npredicted and actual values. In all cases we use 7\nyears of data to train the model, and 1.75 subsequent\nyears for testing.\nUsing the (mean and variance) predictions of this\nbase model, we obtain z?(x; \u2713) by solving the gen-\nerator scheduling problem (11) and then adjusting\nnetwork parameters to minimize the resultant task\nloss. We compare against a traditional stochastic\nprogramming model that minimizes just the RMSE,\nas well as a cost-weighted RMSE that periodically\nreweights training samples given their task loss.5 (A\npure policy-optimizing network is not shown, as it could not suf\ufb01ciently learn the ramp constraints.\nWe could not obtain good performance for the policy optimizer even ignoring this infeasibility.)\nFigure 4 shows the performance of the three models. As expected, the RMSE model performs\nbest with respect to the RMSE of its predictions (its objective). However, the task-based model\nsubstantially outperforms the RMSE model when evaluated on task loss, the actual objective that\nthe system operator cares about: speci\ufb01cally, we improve upon the performance of the traditional\nstochastic programming method by 38.6%. The cost-weighted RMSE\u2019s performance is extremely\nvariable, and overall, the task net improves upon this method by 8.6%.\n\nFigure 3: 2-hidden-layer neural network to\npredict hourly electric load for the next day.\n\n4.3 Price forecasting and battery storage\n\nFinally, we consider a battery arbitrage task, based upon 6 years of real electrical grid data. Here, a\ngrid-scale battery must operate over a 24 hour period based on some (unknown) distribution over\nfuture electricity prices (Figure 1c). For each hour, the operator must decide how much to charge\n(zin 2 R24) or discharge (zout 2 R24) the battery, thus inducing a particular state of charge in the\nbattery (zstate 2 R24). Given a particular realization y of prices, the operator optimizes over: 1)\npro\ufb01ts, 2) \ufb02exibility to participate in other markets, by keeping the battery near half its capacity B\n(with weight ), and 3) battery health, by discouraging rapid charging/discharging (with weight \u270f,\n\n5It is worth noting that a cost-weighted RMSE approach is only possible when direct costs can be assigned\nindependently to each decision point, i.e. when costs do not depend on multiple decision points (as in this\nexperiment). Our task-based method, however, accommodates the (typical) more general setting.\n\n8\n\n\fminimize\n\nzin,zout,zstate2R24\n\nEy\u21e0p(y|x;\u2713)\" 24Xi=1\n\nyi(zin zout)i + zstate \n\nB\n\n2\n\nsubject to zstate,i+1 = zstate,i zout,i + effzin,i 8i, zstate,1 = B/2,\n\n0 \uf8ff zin \uf8ff cin, 0 \uf8ff zout \uf8ff cout, 0 \uf8ff zstate \uf8ff B.\n\n2\n\n+ \u270fkzink2 + \u270fkzoutk2#\n\n(12)\n\nHyperparameters\n\n0.1\n1\n10\n35\n\n\u270f\n0.05\n0.5\n5\n15\n\nRMSE net\n\nTask-based net (our method) % Improvement\n\n1.45 \u00b1 4.67\n4.96 \u00b1 4.85\n131 \u00b1 145\n173 \u00b1 7.38\n\n2.92 \u00b1 0.30\n2.28 \u00b1 2.99\n95.9 \u00b1 29.8\n170 \u00b1 2.16\n\n1.02\n0.54\n0.27\n0.02\n\nTable 1: Task loss results for 10 runs each of the battery storage problem, given a lithium-ion battery\nwith attributes B = 1, eff = 0.9, cin = 0.5, and cout = 0.2. (Lower loss is better.) Our task-based net\non average somewhat improves upon the RMSE net, and demonstrates more reliable performance.\n\n\u270f< ). The battery also has a charging ef\ufb01ciency (eff), limits on speed of charge (cin) and discharge\n(cout), and begins at half charge. This can be written as the stochastic programming problem\n\nAssuming (as we will in our model) that yi is a random variable with mean \u00b5i, then this expectation\nhas a closed form that depends only on the mean. Further details are given in the appendix.\nTo develop a predictive model for the mean, we use an architecture similar to that described in\nSection 4.2. In this case, we input the past day\u2019s prices and temperature, the next day\u2019s load forecasts\nand temperature forecasts, and additional features such as non-linear functions of the temperatures\nand temporal features similar to those in Section 4.2. We again train the model to minimize the\nmean squared error between the model\u2019s predictions and the actual prices (giving the mean prediction\n\u00b5i), using about 5 years of data to train the model and 1 subsequent year for testing. Using the\nmean predictions of this base model, we then solve the storage scheduling problem by solving the\noptimization problem (12), again learning network parameters by minimizing the task loss. We\ncompare against a traditional stochastic programming model that minimizes just the RMSE.\nTable 1 shows the performance of the two models. As energy prices are dif\ufb01cult to predict due\nto numerous outliers and price spikes, the models in this case are not as well-tuned as in our load\nforecasting experiment; thus, their performance is relatively variable. Even then, in all cases, our\ntask-based model demonstrates better average performance than the RMSE model when evaluated\non task loss, the objective most important to the battery operator (although the improvements are\nnot statistically signi\ufb01cant). More interestingly, our task-based method shows less (and in some\ncases, far less) variability in performance than the RMSE-minimizing method. Qualitatively, our\ntask-based method hedges against perverse events such as price spikes that could substantially affect\nthe performance of a battery charging schedule. The task-based method thus yields more reliable\nperformance than a pure RMSE-minimizing method in the case the models are inaccurate due to a\nhigh level of stochasticity in the prediction task.\n\n5 Conclusions and future work\n\nThis paper proposes an end-to-end approach for learning machine learning models that will be used in\nthe loop of a larger process. Speci\ufb01cally, we consider training probabilistic models in the context of\nstochastic programming to directly capture a task-based objective. Preliminary experiments indicate\nthat our task-based learning model substantially outperforms MLE and policy-optimizing approaches\nin all but the (rare) case that the MLE model \u201cperfectly\u201d characterizes the underlying distribution.\nOur method also achieves a 38.6% performance improvement over a highly-optimized real-world\nstochastic programming algorithm for scheduling electricity generation based on predicted load.\nIn the case of energy price prediction, where there is a high degree of inherent stochasticity in\nthe problem, our method demonstrates more reliable task performance than a traditional predictive\nmethod. The task-based approach thus demonstrates promise in optimizing in-the-loop predictions.\nFuture work includes an extension of our approach to stochastic learning models with multiple rounds,\nand further to model predictive control and full reinforcement learning settings.\n\n9\n\n\fAcknowledgments\n\nThis material is based upon work supported by the National Science Foundation Graduate Research\nFellowship Program under Grant No. DGE1252522, and by the Department of Energy Computational\nScience Graduate Fellowship.\n\nReferences\n[1] Stein W Wallace and Stein-Erik Fleten. Stochastic programming models in energy. Handbooks\n\nin operations research and management science, 10:637\u2013677, 2003.\n\n[2] William T Ziemba and Raymond G Vickson. Stochastic optimization models in \ufb01nance,\n\nvolume 1. World Scienti\ufb01c, 2006.\n\n[3] John A Buzacott and J George Shanthikumar. Stochastic models of manufacturing systems,\n\nvolume 4. Prentice Hall Englewood Cliffs, NJ, 1993.\n\n[4] Alexander Shapiro and Andy Philpott. A tutorial on stochastic programming. Manuscript.\nAvailable at www2.isye.gatech.edu/ashapiro/publications.html, 17, 2007.\n\n[5] Jeff Linderoth, Alexander Shapiro, and Stephen Wright. The empirical behavior of sampling\nmethods for stochastic programming. Annals of Operations Research, 142(1):215\u2013241, 2006.\n\n[6] R Tyrrell Rockafellar and Roger J-B Wets. Scenarios and policy aggregation in optimization\n\nunder uncertainty. Mathematics of operations research, 16(1):119\u2013147, 1991.\n\n[7] Yann LeCun, Urs Muller, Jan Ben, Eric Cosatto, and Beat Flepp. Off-road obstacle avoidance\n\nthrough end-to-end learning. In NIPS, pages 739\u2013746, 2005.\n\n[8] Ryan W Thomas, Daniel H Friend, Luiz A Dasilva, and Allen B Mackenzie. Cognitive networks:\nadaptation and learning to achieve end-to-end performance objectives. IEEE Communications\nMagazine, 44(12):51\u201357, 2006.\n\n[9] Kai Wang, Boris Babenko, and Serge Belongie. End-to-end scene text recognition. In Computer\n\nVision (ICCV), 2011 IEEE International Conference on, pages 1457\u20131464. IEEE, 2011.\n\n[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for im-\nage recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 770\u2013778, 2016.\n\n[11] Tao Wang, David J Wu, Adam Coates, and Andrew Y Ng. End-to-end text recognition\nwith convolutional neural networks. In Pattern Recognition (ICPR), 2012 21st International\nConference on, pages 3304\u20133308. IEEE, 2012.\n\n[12] Alex Graves and Navdeep Jaitly. Towards end-to-end speech recognition with recurrent neural\n\nnetworks. In ICML, volume 14, pages 1764\u20131772, 2014.\n\n[13] Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro,\nJingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, et al. Deep speech 2: End-to-\nend speech recognition in english and mandarin. arXiv preprint arXiv:1512.02595, 2015.\n\n[14] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep\n\nvisuomotor policies. Journal of Machine Learning Research, 17(39):1\u201340, 2016.\n\n[15] Aviv Tamar, Sergey Levine, Pieter Abbeel, YI WU, and Garrett Thomas. Value iteration\n\nnetworks. In Advances in Neural Information Processing Systems, pages 2146\u20132154, 2016.\n\n[16] Ken Harada, Jun Sakuma, and Shigenobu Kobayashi. Local search for multiobjective function\noptimization: pareto descent method. In Proceedings of the 8th annual conference on Genetic\nand evolutionary computation, pages 659\u2013666. ACM, 2006.\n\n[17] Kristof Van Moffaert and Ann Now\u00e9. Multi-objective reinforcement learning using sets of\npareto dominating policies. Journal of Machine Learning Research, 15(1):3483\u20133512, 2014.\n\n10\n\n\f[18] Hossam Mossalam, Yannis M Assael, Diederik M Roijers, and Shimon Whiteson. Multi-\n\nobjective deep reinforcement learning. arXiv preprint arXiv:1610.02707, 2016.\n\n[19] Marco A Wiering, Maikel Withagen, and M\u02d8ad\u02d8alina M Drugan. Model-based multi-objective\nIn Adaptive Dynamic Programming and Reinforcement Learning\n\nreinforcement learning.\n(ADPRL), 2014 IEEE Symposium on, pages 1\u20136. IEEE, 2014.\n\n[20] Veselin Stoyanov, Alexander Ropson, and Jason Eisner. Empirical risk minimization of graphical\nmodel parameters given approximate inference, decoding, and model structure. International\nConference on Arti\ufb01cial Intelligence and Statistics, 15:725\u2013733, 2011. ISSN 15324435.\n\n[21] Tamir Hazan, Joseph Keshet, and David A McAllester. Direct loss minimization for structured\nprediction. In Advances in Neural Information Processing Systems, pages 1594\u20131602, 2010.\n[22] Yang Song, Alexander G Schwing, Richard S Zemel, and Raquel Urtasun. Training deep neural\nnetworks via direct loss minimization. In Proceedings of The 33rd International Conference on\nMachine Learning, pages 2169\u20132177, 2016.\n\n[23] Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo,\nDavid Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary\ntasks. arXiv preprint arXiv:1611.05397, 2016.\n\n[24] Somil Bansal, Roberto Calandra, Ted Xiao, Sergey Levine, and Claire J Tomlin. Goal-driven\n\ndynamics learning via bayesian optimization. arXiv preprint arXiv:1703.09260, 2017.\n\n[25] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adapta-\n\ntion of deep networks. arXiv preprint arXiv:1703.03400, 2017.\n\n[26] Yoshua Bengio. Using a \ufb01nancial training criterion rather than a prediction criterion. Interna-\n\ntional Journal of Neural Systems, 8(04):433\u2013443, 1997.\n\n[27] Adam N Elmachtoub and Paul Grigas. Smart \"predict, then optimize\". arXiv preprint\n\narXiv:1710.08005, 2017.\n\n[28] Stephen Gould, Basura Fernando, Anoop Cherian, Peter Anderson, Rodrigo Santa Cruz, and\nEdison Guo. On differentiating parameterized argmin and argmax problems with application to\nbi-level optimization. arXiv preprint arXiv:1607.05447, 2016.\n\n[29] Brandon Amos, Lei Xu, and J Zico Kolter. Input convex neural networks. arXiv preprint\n\narXiv:1609.07152, 2016.\n\n[30] Paul T Boggs and Jon W Tolle. Sequential quadratic programming. Acta numerica, 4:1\u201351,\n\n1995.\n\n[31] Brandon Amos and J Zico Kolter. Optnet: Differentiable optimization as a layer in neural\n\nnetworks. arXiV preprint arXiv:1703.00443, 2017.\n\n[32] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\n\nby reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.\n\n[33] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n11\n\n\f", "award": [], "sourceid": 2825, "authors": [{"given_name": "Priya", "family_name": "Donti", "institution": "Carnegie Mellon University"}, {"given_name": "Brandon", "family_name": "Amos", "institution": "Carnegie Mellon University"}, {"given_name": "J. Zico", "family_name": "Kolter", "institution": "Carnegie Mellon University"}]}