{"title": "Differentiable Convex Optimization Layers", "book": "Advances in Neural Information Processing Systems", "page_first": 9562, "page_last": 9574, "abstract": "Recent work has shown how to embed differentiable optimization problems (that is, problems whose solutions can be backpropagated through) as layers within deep learning architectures. This method provides a useful inductive bias for certain problems, but existing software for differentiable optimization layers is rigid and difficult to apply to new settings. In this paper, we propose an approach to differentiating through disciplined convex programs, a subclass of convex optimization problems used by domain-specific languages (DSLs) for convex optimization. We introduce disciplined parametrized programming, a subset of disciplined convex programming, and we show that every disciplined parametrized program can be represented as the composition of an affine map from parameters to problem data, a solver, and an affine map from the solver\u2019s solution to a solution of the original problem (a new form we refer to as affine-solver-affine form). We then demonstrate how to efficiently differentiate through each of these components, allowing for end-to-end analytical differentiation through the entire convex program. We implement our methodology in version 1.1 of CVXPY, a popular Python-embedded DSL for convex optimization, and additionally implement differentiable layers for disciplined convex programs in PyTorch and TensorFlow 2.0. Our implementation significantly lowers the barrier to using convex optimization problems in differentiable programs. We present applications in linear machine learning models and in stochastic control, and we show that our layer is competitive (in execution time) compared to specialized differentiable solvers from past work.", "full_text": "Differentiable Convex Optimization Layers\n\nAkshay Agrawal\nStanford University\n\nakshayka@cs.stanford.edu\n\nBrandon Amos\nFacebook AI\nbda@fb.com\n\nStephen Boyd\n\nStanford University\nboyd@stanford.edu\n\nSteven Diamond\nStanford University\n\ndiamond@cs.stanford.edu\n\nShane Barratt\n\nStanford University\n\nsbarratt@stanford.edu\n\nJ. Zico Kolter\u21e4\n\nCarnegie Mellon University\n\nBosch Center for AI\n\nzkolter@cs.cmu.edu\n\nAbstract\n\nRecent work has shown how to embed differentiable optimization problems (that is,\nproblems whose solutions can be backpropagated through) as layers within deep\nlearning architectures. This method provides a useful inductive bias for certain\nproblems, but existing software for differentiable optimization layers is rigid and\ndif\ufb01cult to apply to new settings. In this paper, we propose an approach to differ-\nentiating through disciplined convex programs, a subclass of convex optimization\nproblems used by domain-speci\ufb01c languages (DSLs) for convex optimization. We\nintroduce disciplined parametrized programming, a subset of disciplined convex\nprogramming, and we show that every disciplined parametrized program can be\nrepresented as the composition of an af\ufb01ne map from parameters to problem data,\na solver, and an af\ufb01ne map from the solver\u2019s solution to a solution of the original\nproblem (a new form we refer to as af\ufb01ne-solver-af\ufb01ne form). We then demonstrate\nhow to ef\ufb01ciently differentiate through each of these components, allowing for\nend-to-end analytical differentiation through the entire convex program. We im-\nplement our methodology in version 1.1 of CVXPY, a popular Python-embedded\nDSL for convex optimization, and additionally implement differentiable layers for\ndisciplined convex programs in PyTorch and TensorFlow 2.0. Our implementation\nsigni\ufb01cantly lowers the barrier to using convex optimization problems in differen-\ntiable programs. We present applications in linear machine learning models and in\nstochastic control, and we show that our layer is competitive (in execution time)\ncompared to specialized differentiable solvers from past work.\n\n1\n\nIntroduction\n\nRecent work has shown how to differentiate through speci\ufb01c subclasses of convex optimization\nproblems, which can be viewed as functions mapping problem data to solutions [6, 31, 10, 1,\n4]. These layers have found several applications [40, 6, 35, 27, 5, 53, 75, 52, 12, 11], but many\napplications remain relatively unexplored (see, e.g., [4, \u00a78]).\nWhile convex optimization layers can provide useful inductive bias in end-to-end models, their\nadoption has been slowed by how dif\ufb01cult they are to use. Existing layers (e.g., [6, 1]) require users\nto transform their problems into rigid canonical forms by hand. This process is tedious, error-prone,\nand time-consuming, and often requires familiarity with convex analysis. Domain-speci\ufb01c languages\n(DSLs) for convex optimization abstract away the process of converting problems to canonical forms,\nletting users specify problems in a natural syntax; programs are then lowered to canonical forms and\n\n\u21e4Authors listed in alphabetical order.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fsupplied to numerical solvers behind-the-scenes [3]. DSLs enable rapid prototyping and make convex\noptimization accessible to scientists and engineers who are not necessarily experts in optimization.\nThe point of this paper is to do what DSLs have done for convex optimization, but for differentiable\nconvex optimization layers. In this work, we show how to ef\ufb01ciently differentiate through disciplined\nconvex programs [45]. This is a large class of convex optimization problems that can be parsed and\nsolved by most DSLs for convex optimization, including CVX [44], CVXPY [29, 3], Convex.jl [72],\nand CVXR [39]. Concretely, we introduce disciplined parametrized programming (DPP), a grammar\nfor producing parametrized disciplined convex programs. Given a program produced by DPP, we\nshow how to obtain an af\ufb01ne map from parameters to problem data, and an af\ufb01ne map from a solution\nof the canonicalized problem to a solution of the original problem. We refer to this representation of\na problem \u2014 i.e., the composition of an af\ufb01ne map from parameters to problem data, a solver, and an\naf\ufb01ne map to retrieve a solution \u2014 as af\ufb01ne-solver-af\ufb01ne (ASA) form.\nOur contributions are three-fold:\n1. We introduce DPP, a new grammar for parametrized convex optimization problems, and ASA\nform, which ensures that the mapping from problem parameters to problem data is af\ufb01ne. DPP and\nASA-form make it possible to differentiate through DSLs for convex optimization, without explicitly\nbackpropagating through the operations of the canonicalizer. We present DPP and ASA form in \u00a74.\n2. We implement the DPP grammar and a reduction from parametrized programs to ASA form in\nCVXPY 1.1. We also implement differentiable convex optimization layers in PyTorch [66] and\nTensorFlow 2.0 [2]. Our software substantially lowers the barrier to using convex optimization layers\nin differentiable programs and neural networks (\u00a75).\n3. We present applications to sensitivity analysis for linear machine learning models, and to learning\ncontrol-Lyapunov policies for stochastic control (\u00a76). We also show that for quadratic programs\n(QPs), our layer\u2019s runtime is competitive with OptNet\u2019s specialized solver qpth [6] (\u00a77).\n\n2 Related work\n\nDSLs for convex optimization. DSLs for convex optimization allow users to specify convex\noptimization problems in a natural way that follows the math. At the foundation of these languages is\na ruleset from convex analysis known as disciplined convex programming (DCP) [45]. A mathematical\nprogram written using DCP is called a disciplined convex program, and all such programs are convex.\nDisciplined convex programs can be canonicalized to cone programs by expanding each nonlinear\nfunction into its graph implementation [43]. DPP can be seen as a subset of DCP that mildly restricts\nthe way parameters (symbolic constants) can be used; a similar grammar is described in [26]. The\ntechniques used in this paper to canonicalize parametrized programs are similar to the methods\nused by code generators for optimization problems, such as CVXGEN [60], which targets QPs, and\nQCML, which targets second-order cone programs (SOCPs) [26, 25].\n\nDifferentiation of optimization problems. Convex optimization problems do not in general admit\nclosed-form solutions. It is nonetheless possible to differentiate through convex optimization problems\nby implicitly differentiating their optimality conditions (when certain regularity conditions are\nsatis\ufb01ed) [36, 68, 6]. Recently, methods were developed to differentiate through convex cone\nprograms in [24, 1] and [4, \u00a77.3]. Because every convex program can be cast as a cone program, these\nmethods are general. The software released alongside [1], however, requires users to express their\nproblems in conic form. Expressing a convex optimization problem in conic form requires a working\nknowledge of convex analysis. Our work abstracts away conic form, letting the user differentiate\nthrough high-level descriptions of convex optimization problems; we canonicalize these descriptions\nto cone programs on the user\u2019s behalf. This makes it possible to rapidly experiment with new families\nof differentiable programs, induced by different kinds of convex optimization problems.\nBecause we differentiate through a cone program by implicitly differentiating its solution map, our\nmethod can be paired with any algorithm for solving convex cone programs. In contrast, methods that\ndifferentiate through every step of an optimization procedure must be customized for each algorithm\n(e.g., [33, 30, 56]). Moreover, such methods only approximate the derivative, whereas we compute it\nanalytically (when it exists).\n\n2\n\n\f3 Background\n\nConvex optimization problems. A parametrized convex optimization problem can be represented\nas\n\nf0(x; \u2713)\n\nminimize\nsubject to fi(x; \u2713) \uf8ff 0,\ngi(x; \u2713) = 0,\n\ni = 1, . . . , m1,\ni = 1, . . . , m2,\n\n(1)\n\nwhere x 2 Rn is the optimization variable and \u2713 2 Rp is the parameter vector [22, \u00a74.2]. The\nfunctions fi : Rn ! R are convex, and the functions gi : Rn ! R are af\ufb01ne. A solution to (1) is any\nvector x? 2 Rn that minimizes the objective function, among all choices that satisfy the constraints.\nThe problem (1) can be viewed as a (possibly multi-valued) function that maps a parameter to\nsolutions. In this paper, we consider the case when this solution map is single-valued, and we denote\nit by S : Rp ! Rn. The function S maps a parameter \u2713 to a solution x?. From the perspective of\nend-to-end learning, \u2713 (or parameters it depends on) is learned in order to minimize some scalar\nfunction of x?. In this paper, we show how to obtain the derivative of S with respect to \u2713, when (1) is\na DPP-compliant program (and when the derivative exists).\nWe focus on convex optimization because it is a powerful modeling tool, with applications in control\n[20, 16, 71], \ufb01nance [57, 19], energy management [63], supply chain [17, 15], physics [51, 8],\ncomputational geometry [73], aeronautics [48], and circuit design [47, 21], among other \ufb01elds.\n\nDisciplined convex programming. DCP is a grammar for constructing convex optimization prob-\nlems [45, 43]. It consists of functions, or atoms, and a single rule for composing them. An atom is a\nfunction with known curvature (af\ufb01ne, convex, or concave) and per-argument monotonicities. The\ncomposition rule is based on the following theorem from convex analysis. Suppose h : Rk ! R\nis convex, nondecreasing in arguments indexed by a set I1 \u2713{ 1, 2, . . . , k}, and nonincreasing in\narguments indexed by I2. Suppose also that gi : Rn ! R are convex for i 2 I1, concave for i 2 I2,\nand af\ufb01ne for i 2 (I1 \\ I2)c. Then the composition f (x) = h(g1(x), g2(x), . . . , gk(x)) is convex.\nDCP allows atoms to be composed so long as the composition satis\ufb01es this composition theorem.\nEvery disciplined convex program is a convex optimization problem, but the converse is not true.\nThis is not a limitation in practice, because atom libraries are extensible (i.e., the class corresponding\nto DCP is parametrized by which atoms are implemented). In this paper, we consider problems of the\nform (1) in which the functions fi and gi are constructed using DPP, a version of DCP that performs\nparameter-dependent curvature analysis (see \u00a74.1).\n\nCone programs. A (convex) cone program is an optimization problem of the form\n\ncT x\n\nminimize\nsubject to b Ax 2K ,\n\n(2)\n\nwhere x 2 Rn is the variable (there are several other equivalent forms for cone programs). The set\nK\u2713 Rm is a nonempty, closed, convex cone, and the problem data are A 2 Rm\u21e5n, b 2 Rm, and\nc 2 Rn. In this paper we assume that (2) has a unique solution.\nOur method for differentiating through disciplined convex programs requires calling a solver (an\nalgorithm for solving an optimization problem) in the forward pass. We focus on the special case\nin which the solver is a conic solver. A conic solver targets convex cone programs, implementing a\nfunction s : Rm\u21e5n \u21e5 Rm \u21e5 Rn ! Rn mapping the problem data (A, b, c) to a solution x?.\nDCP-based DSLs for convex optimization can canonicalize disciplined convex programs to equivalent\ncone programs, producing the problem data A, b, c, and K [3]; (A, b, c) depend on the parameter \u2713\nand the canonicalization procedure. These data are supplied to a conic solver to obtain a solution;\nthere are many high-quality implementations of conic solvers (e.g., [64, 9, 32]).\n\n4 Differentiating through disciplined convex programs\nWe consider a disciplined convex program with variable x 2 Rn, parametrized by \u2713 2 Rp; its solution\nmap can be viewed as a function S : Rp ! Rn that maps parameters to the solution (see \u00a73). In this\nsection we describe the form of S and how to evaluate DTS, allowing us to backpropagate through\nparametrized disciplined convex programs. (We use the notation Df (x) to denote the derivative of\n\n3\n\n\fa function f evaluated at x, and DT f (x) to denote the adjoint of the derivative at x.) We consider\nthe special case of canonicalizing a disciplined convex program to a cone program. With little extra\neffort, our method can be extended to other targets.\nWe express S as the composition R s C; the canonicalizer C maps parameters to cone problem\ndata (A, b, c), the cone solver s solves the cone problem, furnishing a solution \u02dcx?, and the retriever R\nmaps \u02dcx? to a solution x? of the original problem. A problem is in ASA form if C and R are af\ufb01ne.\nBy the chain rule, the adjoint of the derivative of a disciplined convex program is\n\nDTS(\u2713) = DT C(\u2713)DT s(A, b, c)DT R(\u02dcx?).\n\nThe remainder of this section proceeds as follows. In \u00a74.1, we present DPP, a ruleset for constructing\ndisciplined convex programs reducible to ASA form. In \u00a74.2, we describe the canonicalization\nprocedure and show how to represent C as a sparse matrix. In \u00a74.3, we review how to differentiate\nthrough cone programs, and in \u00a74.4, we describe the form of R.\n\n4.1 Disciplined parametrized programming\nDPP is a grammar for producing parametrized disciplined convex programs from a set of functions, or\natoms, with known curvature (constant, af\ufb01ne, convex, or concave) and per-argument monotonicities.\nA program produced using DPP is called a disciplined parametrized program. Like DCP, DPP is\nbased on the well-known composition theorem for convex functions, and it guarantees that every\nfunction appearing in a disciplined parametrized program is af\ufb01ne, convex, or concave. Unlike DCP,\nDPP also guarantees that the produced program can be reduced to ASA form.\nA disciplined parametrized program is an optimization problem of the form\n\nf0(x, \u2713)\n\nminimize\nsubject to fi(x, \u2713) \uf8ff \u02dcfi(x, \u2713),\ngi(x, \u2713) = \u02dcgi(x, \u2713),\n\ni = 1, . . . , m1,\ni = 1, . . . , m2,\n\n(3)\n\nwhere x 2 Rn is a variable, \u2713 2 Rp is a parameter, the fi are convex, \u02dcfi are concave, gi and \u02dcgi are\naf\ufb01ne, and the expressions are constructed using DPP. An expression can be thought of as a tree,\nwhere the nodes are atoms and the leaves are variables, constants, or parameters. A parameter is a\nsymbolic constant with known properties such as sign but unknown numeric value. An expression is\nsaid to be parameter-af\ufb01ne if it does not have variables among its leaves and is af\ufb01ne in its parameters;\nan expression is parameter-free if it is not parametrized, and variable-free if it does not have variables.\nEvery DPP program is also DCP, but the converse is not true. DPP generates programs reducible to\nASA form by introducing two restrictions on expressions involving parameters:\n\n1. In DCP, we classify the curvature of each subexpression appearing in the problem description\nas convex, concave, af\ufb01ne, or constant. All parameters are classi\ufb01ed as constant. In DPP,\nparameters are classi\ufb01ed as af\ufb01ne, just like variables.\n\n2. In DCP, the product atom prod(x, y) = xy is af\ufb01ne if x or y is a constant (i.e., variable-free).\n\nUnder DPP, the product is af\ufb01ne when at least one of the following is true:\n\n\u2022 x or y is constant (i.e., both parameter-free and variable-free);\n\u2022 one of the expressions is parameter-af\ufb01ne and the other is parameter-free.\n\nThe DPP speci\ufb01cation can (and may in the future) be extended to handle several other combinations\nof expressions and parameters.\n\nExample. Consider the program\n\nminimize\nsubject to x 0,\n\nkF x gk2 + kxk2\n\n(4)\n\nwith variable x 2 Rn and parameters F 2 Rm\u21e5n, g 2 Rm, and > 0. If k\u00b7k2, the product, negation,\nand the sum are atoms, then this problem is DPP-compliant:\n\n\u2022 prod(F, x) = F x is af\ufb01ne because the atom is af\ufb01ne (F is parameter-af\ufb01ne and x is\n\nparameter-free) and F and x are af\ufb01ne;\n\n4\n\n\f\u2022 F x g is af\ufb01ne because F x and g are af\ufb01ne and the sum of af\ufb01ne expressions is af\ufb01ne;\n\u2022 kF x gk2 is convex because k\u00b7k2 is convex and convex composed with af\ufb01ne is convex;\n\u2022 prod(,kxk2) is convex because the product is af\ufb01ne ( is parameter-af\ufb01ne, kxk2 is\nparameter-free), it is increasing in kxk2 (because is nonnegative), and kxk2 is convex;\n\u2022 the objective is convex because the sum of convex expressions is convex.\n\nNon-DPP transformations of parameters.\nin DPP-compliant ways. Consider the following examples, in which the pi are parameters:\n\nIt is often possible to re-express non-DPP expressions\n\n\u2022 The expression prod(p1, p2) is not DPP because both of its arguments are parametrized. It\ncan be rewritten in a DPP-compliant way by introducing a variable s, replacing p1p2 with\nthe expression p1s, and adding the constraint s = p2.\n\n\u2022 Let e be an expression. The quotient e/p1 is not DPP, but it can be rewritten as ep2, where\n\np2 is a new parameter representing 1/p1.\n\u2022 The expression log |p1| is not DPP because log is concave and increasing but | \u00b7 | is convex.\nIt can be rewritten as log p2 where p2 is a new parameter representing |p1|.\n\u2022 If P1 2 Rn\u21e5n is a parameter representing a (symmetric) positive semide\ufb01nite matrix and\nx 2 Rn is a variable, the expression quadform(x, P1) = xT P1x is not DPP. It can be\nrewritten as kP2xk2\n\n2, where P2 is a new parameter representing P 1/2\n\n.\n\n1\n\n4.2 Canonicalization\n\nThe canonicalization of a disciplined parametrized program to ASA form is similar to the canoni-\ncalization of a disciplined convex program to a cone program. All nonlinear atoms are expanded\ninto their graph implementations [43], generating af\ufb01ne expressions of variables. The resulting\nexpressions are also af\ufb01ne in the problem parameters due to the DPP rules. Because these expressions\nrepresent the problem data for the cone program, the function C from parameters to problem data is\naf\ufb01ne.\nAs an example, the DPP program (4) can be canonicalized to the cone program\n\nminimize\nsubject to (t1, F x g) 2Q m+1,\n\nt1 + t2\n\n(t2, x) 2Q n+1,\nx 2 Rn\n+,\n\n(5)\n\nwhere (t1, t2, x) is the variable, Qn is the n-dimensional second-order cone, and Rn\ntive orthant. When rewritten in the standard form (2), this problem has data\n\n+ is the nonnega-\n\nc =\"1\n\n0# , K = Qm+1 \u21e5Q n+1 \u21e5 Rn\n\n\n\n+,\n\n,\n\n1\n\n1\n\nA =266664\n\n,\n\n377775\n\nF\nI\nI\n\nb =26664\n\n0\ng\n0\n0\n0\n\n37775\n\nwith blank spaces representing zeros and the horizontal line denoting the cone boundary. In this case,\nthe parameters F , g and are just negated and copied into the problem data.\n\nThe canonicalization map. The full canonicalization procedure (which includes expanding graph\nimplementations) only runs the \ufb01rst time the problem is canonicalized. When the same problem\nis canonicalized in the future (e.g., with new parameter values), the problem data (A, b, c) can be\nobtained by multiplying a sparse matrix representing C by the parameter vector (and reshaping);\nthe adjoint of the derivative can be computed by just transposing the matrix. The na\u00efve alternative\n\u2014 expanding graph implementations and extracting new problem data every time parameters are\nupdated (and differentiating through this algorithm in the backward pass) \u2014 is much slower (see \u00a77).\nThe following lemma tells us that C can be represented as a sparse matrix.\n\n5\n\n\fLemma 1. The canonicalizer map C for a disciplined parametrized program can be represented with\na sparse matrix Q 2 Rn\u21e5p+1 and sparse tensor R 2 Rm\u21e5n+1\u21e5p+1, where m is the dimension of the\nconstraints. Letting \u02dc\u2713 2 Rp+1 denote the concatenation of \u2713 and the scalar offset 1, the problem data\ncan be obtained as c = Q\u02dc\u2713 and [A b] =Pp+1\n\nThe proof is given in Appendix A.\n\ni=1 R[:,:,i] \u02dc\u2713i.\n\n4.3 Derivative of a conic solver\n\nBy applying the implicit function theorem [36, 34] to the optimality conditions of a cone program, it\nis possible to compute its derivative Ds(A, b, c). To compute DT s(A, b, c), we follow the methods\npresented in [1] and [4, \u00a77.3]. Our calculations are given in Appendix B.\nIf the cone program is not differentiable at a solution, we compute a heuristic quantity, as is common\npractice in automatic differentiation [46, \u00a714]. In particular, at non-differentiable points, a linear\nsystem that arises in the computation of the derivative might fail to be invertible. When this happens,\nwe compute a least-squares solution to the system instead. See Appendix B for details.\n\n4.4 Solution retrieval\n\nThe cone program obtained by canonicalizing a DPP-compliant problem uses the variable \u02dcx =\n(x, s) 2 Rn \u21e5 Rk, where s 2 Rk is a slack variable. If \u02dcx? = (x?, s?) is optimal for the cone program,\nthen x? is optimal for the original problem (up to reshaping and scaling by a constant). As such, a\nsolution to the original problem can be obtained by slicing, i.e., R(\u02dcx?) = x?. This map is evidently\nlinear.\n\n5\n\nImplementation\n\nWe have implemented DPP and the reduction to ASA form in version 1.1 of CVXPY, a Python-\nembedded DSL for convex optimization [29, 3]; our implementation extends CVXCanon, an open-\nsource library that reduces af\ufb01ne expression trees to matrices [62]. We have also implemented\ndifferentiable convex optimization layers in PyTorch and TensorFlow 2.0. These layers implement\nthe forward and backward maps described in \u00a74; they also ef\ufb01ciently support batched inputs (see \u00a77).\nWe use the the diffcp package [1] to obtain derivatives of cone programs. We modi\ufb01ed this package\nfor performance: we ported much of it from Python to C++, added an option to compute the derivative\nusing a dense direct solve, and made the forward and backward passes amenable to parallelization.\nOur implementation of DPP and ASA form, coupled with our PyTorch and TensorFlow layers, makes\nour software the \ufb01rst DSL for differentiable convex optimization layers. Our software is open-source.\nCVXPY and our layers are available at\n\nhttps://www.cvxpy.org,\n\nhttps://www.github.com/cvxgrp/cvxpylayers.\n\nExample. Below is an example of how to specify the problem (4) using CVXPY 1.1.\n1 import cvxpy as cp\n2\n3 m, n = 20, 10\n4 x = cp. Variable ((n, 1))\n5 F = cp. Parameter ((m, n))\n6 g = cp. Parameter ((m, 1))\n7 lambd = cp. Parameter ((1 , 1) , nonneg = True )\n8 objective_fn = cp. norm (F @ x - g) + lambd * cp. norm (x)\n9 constraints = [x >= 0]\n10 problem = cp. Problem (cp. Minimize ( objective_fn ), constraints )\n11 assert problem . is_dpp ()\n\nThe below code shows how to use our PyTorch layer to solve and backpropagate through problem\n(the code for our TensorFlow layer is almost identical; see Appendix D).\n\n6\n\n\fFigure 1: Gradients (black lines) of the logistic\ntest loss with respect to the training data.\n\nFigure 2: Per-iteration cost while learning an ADP\npolicy for stochastic control.\n\n1 import torch\n2 from cvxpylayers . torch import CvxpyLayer\n3\n4 F_t = torch . randn (m, n, requires_grad = True )\n5 g_t = torch . randn (m, 1, requires_grad = True )\n6 lambd_t = torch . rand (1, 1, requires_grad = True )\n7 layer = CvxpyLayer (\n8\n9 x_star , = layer (F_t , g_t , lambd_t )\n10 x_star . sum (). backward ()\n\nproblem , parameters =[F, g, lambd ], variables =[x])\n\nConstructing layer in line 7-8 canonicalizes problem to extract C and R, as described in \u00a74.2.\nCalling layer in line 9 applies the map R s C from \u00a74, returning a solution to the problem. Line\n10 computes the gradients of summing x_star, with respect to F_t, g_t, and lambd_t.\n\n6 Examples\n\nIn this section, we present two applications of differentiable convex optimization, meant to be\nsuggestive of possible use cases for our layer. We give more examples in Appendix E.\n\nminimize\n\nMPM\n\n1\n\nNPN\n\ni=1 `(\u2713; \u02dcxi, \u02dcyi) is small, where (\u02dcxi, \u02dcyi)M\n\ni=1 is our test set.\n\ni=1 `(\u2713; xi, yi) + r(\u2713),\n\n6.1 Data poisoning attack\nWe are given training data (xi, yi)N\nlabels. Suppose we \ufb01t a model for this classi\ufb01cation problem by solving\n\ni=1, where xi 2 Rn are feature vectors and yi 2{ 0, 1} are the\n(6)\nwhere the loss function `(\u2713; xi, yi) is convex in \u2713 2 Rn and r(\u2713) is a convex regularizer. We hope\nthat the test loss Ltest(\u2713) = 1\nAssume that our training data is subject to a data poisoning attack [18, 49], before it is supplied to us.\nThe adversary has full knowledge of our modeling choice, meaning that they know the form of (6),\nand seeks to perturb the data to maximally increase our loss on the test set, to which they also have\naccess. The adversary is permitted to apply an additive perturbation i 2 Rn to each of the training\npoints xi, with the perturbations satisfying kik1 \uf8ff 0.01.\nLet \u2713? be optimal for (6). The gradient of the test loss with respect to a training data point,\nrxiLtest(\u2713?)).gives the direction in which the point should be moved to achieve the greatest increase\nin test loss. Hence, one reasonable adversarial policy is to set xi := xi + (.01)sign(rxiLtest(\u2713?)).\nThe quantity (0.01)PN\ni=1 krxiLtest(\u2713?)k1 is the predicted increase in our test loss due to the\npoisoning.\n\nNumerical example. We consider 30 training points and 30 test points in R2, and we \ufb01t a logistic\nmodel with elastic-net regularization. This problem can be written using DPP, with xi as parameters\n\n7\n\n\fTable 1: Time (ms) to canonicalize examples, across 10 runs.\n\nCVXPY 1.0.23\nCVXPY 1.1\n\nLogistic regression\n18.9 \u00b1 1.75\n1.49 \u00b1 0.02\n\nStochastic control\n12.5 \u00b1 0.72\n1.39 \u00b1 0.02\n\n(see Appendix C for the code). We used our convex optimization layer to \ufb01t this model and obtain\nthe gradient of the test loss with respect to the training data. Figure 1 visualizes the results. The\norange (?) and blue (+) points are training data, belonging to different classes. The red line (dashed)\nis the hyperplane learned by \ufb01tting the the model, while the blue line (solid) is the hyperplane that\nminimizes the test loss. The gradients are visualized as black lines, attached to the data points.\nMoving the points in the gradient directions torques the learned hyperplane away from the optimal\nhyperplane for the test set.\n\n6.2 Convex approximate dynamic programming\nWe consider a stochastic control problem of the form\n\nminimize\nsubject to xt+1 = Axt + B(xt) + !t,\n\nt=0 kxtk2\n\n2 + k(xt)k2\n\nlim\nT!1\n\nEh 1\nT PT1\n\n2i\n\nt = 0, 1, . . . ,\n\nwhere xt 2 Rn is the state, : Rn !U\u2713 Rm is the policy, U is a convex set representing the\nallowed set of controls, and !t 2 \u2326 is a (random, i.i.d.) disturbance. Here the variable is the policy ,\nand the expectation is taken over disturbances and the initial state x0. If U is not an af\ufb01ne set, then\nthis problem is in general very dif\ufb01cult to solve [50, 13].\n\nADP policy. A common heuristic for solving (7) is approximate dynamic programming (ADP),\nwhich parametrizes and replaces the minimization over functions with a minimization over\nparameters.\nIn this example, we take U to be the unit ball and we represent as a quadratic\ncontrol-Lyapunov policy [74]. Evaluating corresponds to solving the SOCP\n\n(7)\n\n(8)\n\nminimize uT P u + xT\nsubject to kuk2 \uf8ff 1,\n\nt Qu + qT u\n\nwith variable u and parameters P , Q, q, and xt. We can run stochastic gradient descent (SGD) on P ,\nQ, and q to approximately solve (7), which requires differentiating through (8). Note that if u were\nunconstrained, (7) could be solved exactly, via linear quadratic regulator (LQR) theory [50]. The\npolicy (8) can be written using DPP (see Appendix C for the code).\n\nNumerical example. Figure 2 plots the estimated average cost for each iteration of gradient descent\nfor a numerical example, with x 2 R2 and u 2 R3, a time horizon of T = 25, and a batch size of\n8. We initialize our policy\u2019s parameters with the LQR solution, ignoring the constraint on u. This\nmethod decreased the average cost by roughly 40%.\n\n7 Evaluation\n\nOur implementation substantially lowers the barrier to using convex optimization layers. Here, we\nshow that our implementation substantially reduces canonicalization time. Additionally, for dense\nproblems, our implementation is competitive (in execution time) with a specialized solver for QPs;\nfor sparse problems, our implementation is much faster.\n\nCanonicalization. Table 1 reports the time it takes to canonicalize the logistic regression and\nstochastic control problems from \u00a76, comparing CVXPY version 1.0.23 with CVXPY 1.1. Each\ncanonicalization was performed on a single core of an unloaded Intel i7-8700K processor. We report\nthe average time and standard deviation across 10 runs, excluding a warm-up run. Our extension\nachieves on average an order-of-magnitude speed-up since computing C via a sparse matrix multiply\nis much more ef\ufb01cient than going through the DSL.\n\n8\n\n\f(a) Dense QP, batch size of 128.\n\n(b) Sparse QP, batch size of 32.\n\nFigure 3: Comparison of our PyTorch CvxpyLayer to qpth, over 10 trials. For cvxpylayers, we\nseparate out the canonicalization and solution retrieval times, to allow for a fair comparison.\n\nComparison to specialized layers. We have implemented a batched solver and backward pass for\nour differentiable CVXPY layer that makes it competitive with the batched QP layer qpth from [6].\nFigure 3 compares the runtimes of our PyTorch CvxpyLayer and qpth on a dense and sparse QP.\nThe sparse problem is too large for qpth to run in GPU mode. The QPs have the form\n\n1\n2 xT Qx + pT x\n\nminimize\nsubject to Ax = b,\nGx \uf8ff h,\n\n(9)\n\nwith variable x 2 Rn, and problem data Q 2 Rn\u21e5n, p 2 Rn, A 2 Rm\u21e5n, b 2 Rm, G 2 Rp\u21e5n, and\nh 2 Rp. The dense QP has n = 128, m = 0, and p = 128. The sparse QP has n = 1024, m = 1024,\nand p = 1024 and Q, A, and G each have 1% nonzeros (See Appendix E for the code). We ran\nthis experiment on a machine with a 6-core Intel i7-8700K CPU, 32 GB of memory, and an Nvidia\nGeForce 1080 TI GPU with 11 GB of memory.\nOur implementation is competitive with qpth for the dense QP, even on the GPU, and roughly 5\ntimes faster for the sparse QP. Our backward pass for the dense QP uses our extension to diffcp; we\nexplicitly materialize the derivatives of the cone projections and use a direct solve. Our backward\npass for the sparse QP uses sparse operations and LSQR [65], signi\ufb01cantly outperforming qpth\n(which cannot exploit sparsity). Our layer runs on the CPU, and implements batching via Python\nmulti-threading, with a parallel for loop over the examples in the batch for both the forward and\nbackward passes. We used 12 threads for our experiments.\n\n8 Discussion\n\nOther solvers. Solvers that are specialized to subclasses of convex programs are often faster than\nmore general conic solvers. For example, one might use OSQP [69] to solve QPs, or gradient-based\nmethods like L-BFGS [54] or SAGA [28] for empirical risk minimization. Because CVXPY lets\ndevelopers add specialized solvers as additional back-ends, our implementation of DPP and ASA\nform can be easily extended to other problem classes. We plan to interface QP solvers in future work.\n\nNonconvex problems.\nIt is possible to differentiate through nonconvex problems, either analyti-\ncally [37, 67, 5] or by unrolling SGD [33, 14, 61, 41, 70, 23, 38], Because convex programs can\ntypically be solved ef\ufb01ciently and to high accuracy, it is preferable to use convex optimization layers\nover nonconvex optimization layers when possible. This is especially true in the setting of low-latency\ninference. The use of differentiable nonconvex programs in end-to-end learning pipelines, discussed\nin [42], is an interesting direction for future research.\n\n9\n\n\fAcknowledgments\n\nWe gratefully acknowledge discussions with Eric Chu, who designed and implemented a code\ngenerator for SOCPs [26, 25], Nicholas Moehle, who designed and implemented a basic version of a\ncode generator for convex optimization in unpublished work, and Brendan O\u2019Donoghue. We also\nwould like to thank the anonymous reviewers, who provided us with useful suggestions that improved\nthe paper. S. Barratt is supported by the National Science Foundation Graduate Research Fellowship\nunder Grant No. DGE-1656518.\n\nReferences\n[1] A. Agrawal, S. Barratt, S. Boyd, E. Busseti, and W. Moursi. Differentiating through a cone\n\nprogram. In: Journal of Applied and Numerical Optimization 1.2 (2019), pp. 107\u2013115.\n\n[2] A. Agrawal, A. N. Modi, A. Passos, A. Lavoie, A. Agarwal, A. Shankar, I. Ganichev, J.\nLevenberg, M. Hong, R. Monga, and S. Cai. TensorFlow Eager: A multi-stage, Python-\nembedded DSL for machine learning. In: Proc. Systems for Machine Learning. 2019.\n\n[3] A. Agrawal, R. Verschueren, S. Diamond, and S. Boyd. A rewriting system for convex\n\noptimization problems. In: Journal of Control and Decision 5.1 (2018), pp. 42\u201360.\n\n[4] B. Amos. Differentiable optimization-based modeling for machine learning. PhD thesis.\n\nCarnegie Mellon University, 2019.\n\n[5] B. Amos, I. Jimenez, J. Sacks, B. Boots, and J. Z. Kolter. Differentiable MPC for end-to-end\nplanning and control. In: Advances in Neural Information Processing Systems. 2018, pp. 8299\u2013\n8310.\n\n[6] B. Amos and J. Z. Kolter. OptNet: Differentiable optimization as a layer in neural networks.\n\nIn: Intl. Conf. Machine Learning. 2017.\n\n[7] B. Amos, V. Koltun, and J. Z. Kolter. The limited multi-label projection layer. 2019. arXiv:\n\n[8] G. Angeris, J. Vu\u02c7ckovi\u00b4c, and S. Boyd. Computational Bounds for Photonic Design. In: ACS\n\nPhotonics 6.5 (2019), pp. 1232\u20131239.\n\n[9] M. ApS. MOSEK optimization suite. http://docs.mosek.com/9.0/intro.pdf. 2019.\n[10] S. Barratt. On the differentiability of the solution to convex optimization problems. 2018. arXiv:\n\n1906.08707.\n\n1804.05098.\n\n[11] S. Barratt and S. Boyd. Fitting a kalman smoother to data. 2019. arXiv: 1910.08615.\n[12] S. Barratt and S. Boyd. Least squares auto-tuning. 2019. arXiv: 1904.05460.\n[13] S. Barratt and S. Boyd. Stochastic control with af\ufb01ne dynamics and extended quadratic costs.\n\n2018. arXiv: 1811.00168.\n\n[14] D. Belanger, B. Yang, and A. McCallum. End-to-end learning for structured prediction energy\n\nnetworks. In: Intl. Conf. Machine Learning. 2017.\n\n[15] A. Ben-Tal, B. Golany, A. Nemirovski, and J.-P. Vial. Retailer-supplier \ufb02exible commit-\nments contracts: A robust optimization approach. In: Manufacturing & Service Operations\nManagement 7.3 (2005), pp. 248\u2013271.\n\n[16] D. P. Bertsekas. Dynamic programming and optimal control. 3rd ed. Vol. 1. Athena scienti\ufb01c\n\nBelmont, 2005.\n\n[17] D. Bertsimas and A. Thiele. A robust optimization approach to supply chain management. In:\nProc. Intl. Conf. on Integer Programming and Combinatorial Optimization. Springer. 2004,\npp. 86\u2013100.\n\n[18] B. Biggio and F. Roli. Wild patterns: Ten years after the rise of adversarial machine learning.\n\nIn: Pattern Recognition 84 (2018), pp. 317\u2013331.\n\n[19] S. Boyd, E. Busseti, S. Diamond, R. Kahn, K. Koh, P. Nystrup, and J. Speth. Multi-period\ntrading via convex optimization. In: Foundations and Trends in Optimization 3.1 (2017),\npp. 1\u201376.\n\n[20] S. Boyd, L. El Ghaoui, E. Feron, and V. Balakrishnan. Linear matrix inequalities in system\n\nand control theory. SIAM, 1994.\n\n[21] S. Boyd, S.-J. Kim, D. Patil, and M. Horowitz. Digital circuit optimization via geometric\n\nprogramming. In: Operations Research 53.6 (2005).\n\n10\n\n\f[22] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[23] P. Brakel, D. Stroobandt, and B. Schrauwen. Training energy-based models for time-series\n\nimputation. In: Journal of Machine Learning Research 14.1 (2013), pp. 2771\u20132797.\n\n[24] E. Busseti, W. Moursi, and S. Boyd. Solution re\ufb01nement at regular points of conic problems.\n\n2018. arXiv: 1811.02157.\n\ncvxgrp/qcml. 2017.\n\n[25] E. Chu and S. Boyd. QCML: Quadratic Cone Modeling Language. https://github.com/\n\n[26] E. Chu, N. Parikh, A. Domahidi, and S. Boyd. Code generation for embedded second-order\ncone programming. In: 2013 European Control Conference (ECC). IEEE. 2013, pp. 1547\u2013\n1552.\n\n[27] F. de Avila Belbute-Peres, K. Smith, K. Allen, J. Tenenbaum, and J. Z. Kolter. End-to-end\ndifferentiable physics for learning and control. In: Advances in Neural Information Processing\nSystems. 2018, pp. 7178\u20137189.\n\n[28] A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with\nsupport for non-strongly convex composite objectives. In: Advances in Neural Information\nProcessing Systems. 2014, pp. 1646\u20131654.\n\n[29] S. Diamond and S. Boyd. CVXPY: A Python-embedded modeling language for convex\n\noptimization. In: Journal of Machine Learning Research 17.1 (2016), pp. 2909\u20132913.\n\n[30] S. Diamond, V. Sitzmann, F. Heide, and G. Wetzstein. Unrolled optimization with deep priors.\n\n2017. arXiv: 1705.08041.\nJ. Djolonga and A. Krause. Differentiable learning of submodular models. In: Advances in\nNeural Information Processing Systems. 2017, pp. 1013\u20131023.\n\n[31]\n\n[32] A. Domahidi, E. Chu, and S. Boyd. ECOS: An SOCP solver for embedded systems. In: Control\n\nConference (ECC), 2013 European. IEEE. 2013, pp. 3071\u20133076.\nJ. Domke. Generic methods for optimization-based modeling. In: AISTATS. Vol. 22. 2012,\npp. 318\u2013326.\n\n[33]\n\n[34] A. L. Dontchev and R. T. Rockafellar. Implicit functions and solution mappings. In: Springer\n\nMonogr. Math. (2009).\n\n[35] P. Donti, B. Amos, and J. Z. Kolter. Task-based end-to-end model learning in stochastic\noptimization. In: Advances in Neural Information Processing Systems. 2017, pp. 5484\u20135494.\n[36] A. Fiacco and G. McCormick. Nonlinear programming: Sequential unconstrained minimiza-\n\ntion techniques. John Wiley and Sons, Inc., New York-London-Sydney, 1968, pp. xiv+210.\n\n[37] A. V. Fiacco. Introduction to sensitivity and stability analysis in nonlinear programming.\nVol. 165. Mathematics in Science and Engineering. Academic Press, Inc., Orlando, FL, 1983,\npp. xii+367.\n\n[38] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep\nnetworks. In: 34th Intl. Conf. Machine Learning-Volume 70. JMLR. org. 2017, pp. 1126\u20131135.\n[39] A. Fu, B. Narasimhan, and S. Boyd. CVXR: An R package for disciplined convex optimization.\n\nIn: arXiv preprint arXiv:1711.07582 (2017).\n\n[40] Z. Geng, D. Johnson, and R. Fedkiw. Coercing machine learning to output physically accurate\n\nresults. 2019. arXiv: 1910.09671 [physics.comp-ph].\nI. Goodfellow, M. Mirza, A. Courville, and Y. Bengio. Multi-prediction deep Boltzmann\nmachines. In: Advances in Neural Information Processing Systems. 2013, pp. 548\u2013556.\n\n[41]\n\n[42] S. Gould, R. Hartley, and D. Campbell. Deep declarative networks: A new hope. 2019. arXiv:\n\n1909.04866.\n\n[43] M. Grant and S. Boyd. Graph implementations for nonsmooth convex programs. In: Recent\nAdvances in Learning and Control. Ed. by V. Blondel, S. Boyd, and H. Kimura. Lecture Notes\nin Control and Information Sciences. Springer, 2008, pp. 95\u2013110.\n\n[44] M. Grant and S. Boyd. CVX: Matlab software for disciplined convex programming, version\n\n[45] M. Grant, S. Boyd, and Y. Ye. Disciplined convex programming. In: Global optimization.\n\n2.1. http://cvxr.com/cvx. 2014.\n\nSpringer, 2006, pp. 155\u2013210.\n\n[46] A. Griewank and A. Walther. Evaluating derivatives: principles and techniques of algorithmic\n\ndifferentiation. SIAM, 2008.\n\n11\n\n\f[47] M. Hershenson, S. Boyd, and T. Lee. Optimal design of a CMOS op-amp via geometric\nprogramming. In: IEEE Transactions on Computer-Aided Design of Integrated Circuits and\nSystems 20.1 (2001), pp. 1\u201321.\n\n[48] W. Hoburg and P. Abbeel. Geometric programming for aircraft design optimization. In: AIAA\n\nJournal 52.11 (2014), pp. 2414\u20132426.\n\n[49] M. Jagielski, A. Oprea, B. Biggio, C. Liu, C. Nita-Rotaru, and B. Li. Manipulating machine\nlearning: Poisoning attacks and countermeasures for regression learning. In: IEEE Symposium\non Security and Privacy. IEEE. 2018, pp. 19\u201335.\n\n[50] R. Kalman. When is a linear control system optimal? In: Journal of Basic Engineering 86.1\n\n(1964), pp. 51\u201360.\n\n[51] Y. Kanno. Nonsmooth Mechanics and Convex Optimization. CRC Press, Boca Raton, FL,\n\n2011.\n\n[52] K. Lee, S. Maji, A. Ravichandran, and S. Soatto. Meta-learning with differentiable convex\n\noptimization. In: arXiv preprint arXiv:1904.03758 (2019).\n\n[53] C. K. Ling, F. Fang, and J. Z. Kolter. What game are we playing? End-to-end learning in\n\nnormal and extensive form games. 2018. arXiv: 1805.02777.\n\n[54] D. C. Liu and J. Nocedal. On the limited memory BFGS method for large scale optimization.\n\nIn: Mathematical programming 45.1-3 (1989), pp. 503\u2013528.\n\n[55] C. Malaviya, P. Ferreira, and A. F. Martins. Sparse and constrained attention for neural\n\nmachine translation. 2018. arXiv: 1805.08241.\n\n[56] M. Mardani, Q. Sun, S. Vasawanala, V. Papyan, H. Monajemi, J. Pauly, and D. Donoho. Neural\n\nproximal gradient descent for compressive imaging. 2018. arXiv: 1806.03963 [cs.CV].\n\n[57] H. Markowitz. Portfolio selection. In: Journal of Finance 7.1 (1952), pp. 77\u201391.\n[58] A. Martins and R. Astudillo. From softmax to sparsemax: A sparse model of attention and\n\nmulti-label classi\ufb01cation. In: Intl. Conf. Machine Learning. 2016, pp. 1614\u20131623.\n\n[59] A. F. Martins and J. Kreutzer. Learning what\u2019s easy: Fully differentiable neural easy-\ufb01rst\ntaggers. In: 2017 Conference on Empirical Methods in Natural Language Processing. 2017,\npp. 349\u2013362.\nJ. Mattingley and S. Boyd. CVXGEN: A code generator for embedded convex optimization.\nIn: Optimization and Engineering 13.1 (2012), pp. 1\u201327.\n\n[60]\n\n[61] L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein. Unrolled generative adversarial networks.\n\n[62]\n\n2016. arXiv: 1611.02163.\nJ. Miller, J. Zhu, and P. Quigley. CVXCanon. https://github.com/cvxgrp/CVXcanon/.\n2015.\n\n[63] N. Moehle, E. Busseti, S. Boyd, and M. Wytock. Dynamic energy management. 2019. arXiv:\n\n1903.06230.\n\n[64] B. O\u2019Donoghue, E. Chu, N. Parikh, and S. Boyd. SCS: Splitting conic solver, version 2.1.0.\n\nhttps://github.com/cvxgrp/scs. 2017.\n\n[65] C. C. Paige and M. A. Saunders. LSQR: An algorithm for sparse linear equations and sparse\nleast squares. In: ACM Transactions on Mathematical Software (TOMS) 8.1 (1982), pp. 43\u201371.\n[66] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison,\nL. Antiga, and A. Lerer. Automatic differentiation in PyTorch. In: NIPS Autodiff Workshop\n(2017).\n\n[67] H. Pirnay, R. L\u00f3pez-Negrete, and L. T. Biegler. Optimal sensitivity based on IPOPT. In:\n\nMathematical Programming Computation 4.4 (2012), pp. 307\u2013331.\n\n[68] S. Robinson. Strongly regular generalized equations. In: Mathematics of Operations Research\n\n5.1 (1980), pp. 43\u201362.\n\n[69] B. Stellato, G. Banjac, P. Goulart, A. Bemporad, and S. Boyd. OSQP: An operator splitting\n\nsolver for quadratic programs. 2017. arXiv: 1711.08013.\n\n[70] V. Stoyanov, A. Ropson, and J. Eisner. Empirical risk minimization of graphical model\nparameters given approximate inference, decoding, and model structure. In: AISTATS. 2011,\npp. 725\u2013733.\n\n[71] E. Todorov, T. Erez, and Y. Tassa. MuJoCo: A physics engine for model-based control. In:\n2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE. 2012,\npp. 5026\u20135033.\n\n12\n\n\f[72] M. Udell, K. Mohan, D. Zeng, J. Hong, S. Diamond, and S. Boyd. Convex optimization in\nJulia. In: SC14 Workshop on High Performance Technical Computing in Dynamic Languages\n(2014). arXiv: 1410.4821 [math.OC].\n\n[73] M. Van Kreveld, O. Schwarzkopf, M. de Berg, and M. Overmars. Computational geometry\n\nalgorithms and applications. Springer, 2000.\n\n[74] Y. Wang and S. Boyd. Fast evaluation of quadratic control-Lyapunov policy. In: IEEE Transac-\n\ntions on Control Systems Technology 19.4 (2010), pp. 939\u2013946.\n\n[75] B. Wilder, B. Dilkina, and M. Tambe. Melding the data-decisions pipeline: Decision-focused\n[76] Y. Ye, M. J. Todd, and S. Mizuno. An O(pnL)-iteration homogeneous and self-dual linear\n\nlearning for combinatorial optimization. 2018. arXiv: 1809.05504.\n\nprogramming algorithm. In: Mathematics of Operations Research 19.1 (1994), pp. 53\u201367.\n\n13\n\n\f", "award": [], "sourceid": 5085, "authors": [{"given_name": "Akshay", "family_name": "Agrawal", "institution": "Stanford University"}, {"given_name": "Brandon", "family_name": "Amos", "institution": "Facebook AI"}, {"given_name": "Shane", "family_name": "Barratt", "institution": "Stanford University"}, {"given_name": "Stephen", "family_name": "Boyd", "institution": "Stanford University"}, {"given_name": "Steven", "family_name": "Diamond", "institution": "Stanford University"}, {"given_name": "J. Zico", "family_name": "Kolter", "institution": "Carnegie Mellon University / Bosch Center for AI"}]}