{"title": "Provably Correct Automatic Sub-Differentiation for Qualified Programs", "book": "Advances in Neural Information Processing Systems", "page_first": 7125, "page_last": 7135, "abstract": "The \\emph{Cheap Gradient Principle}~\\citep{Griewank:2008:EDP:1455489} --- the computational cost of computing a $d$-dimensional vector of  partial derivatives of a scalar function is nearly the same (often within a factor of $5$)  as that of simply computing the scalar function itself --- is of central importance in optimization; it allows us to quickly obtain (high-dimensional) gradients of scalar loss functions which are subsequently used in black box gradient-based optimization procedures. The current state of affairs is markedly different with regards to computing sub-derivatives: widely used ML libraries, including TensorFlow and PyTorch, do \\emph{not} correctly compute (generalized) sub-derivatives even on simple differentiable examples. This work considers the question: is there a \\emph{Cheap Sub-gradient Principle}?  Our main result shows that, under certain restrictions on our library of non-smooth functions (standard in non-linear programming), provably correct generalized sub-derivatives can be computed at a computational cost that is within a (dimension-free) factor of $6$ of the cost of computing the scalar function itself.", "full_text": "Provably Correct Automatic Subdifferentiation\n\nfor Quali\ufb01ed Programs\n\nSham M. Kakade\n\nUniversity of Washington\nsham@cs.washington.edu\n\nJason D. Lee\n\nUniversity of Southern California\njasonlee@marshall.usc.edu\n\nAbstract\n\nThe Cheap Gradient Principle [Griewank and Walther, 2008] \u2014 the computa-\ntional cost of computing the gradient of a scalar-valued function is nearly the\nsame (often within a factor of 5) as that of simply computing the function itself\n\u2014 is of central importance in optimization; it allows us to quickly obtain (high\ndimensional) gradients of scalar loss functions which are subsequently used in\nblack box gradient-based optimization procedures. The current state of affairs is\nmarkedly different with regards to computing subderivatives: widely used ML\nlibraries, including TensorFlow and PyTorch, do not correctly compute (general-\nized) subderivatives even on simple examples. This work considers the question:\nis there a Cheap Subgradient Principle? Our main result shows that, under certain\nrestrictions on our library of nonsmooth functions (standard in nonlinear program-\nming), provably correct generalized subderivatives can be computed at a compu-\ntational cost that is within a (dimension-free) factor of 6 of the cost of computing\nthe scalar function itself.\n\n1\n\nIntroduction\n\nThe widespread implementation of Automatic Differentiation (AD) methods [Baydin et al., 2015]\nhas had a transformative effect on applied machine learning; these methods have eased the dif\ufb01-\nculty for practitioners, across a range of disciplines, to learn sophisticated machine learning models\n(including deep neural architectures and richer inferential models). The paradigm is: one simply\nwrites a program to compute the function of interest, say a scalar (loss) function f pxq : Rd \u00d1 R,\nand then a correctly implemented AD method will return both f pxq and all d of its partial derivatives\nwhen provided with x as an input. These partial derivatives are often used in conjunction with some\n(stochastic) gradient-based optimization approach.\nUnderlying the effectiveness of this general black-box approach is the Cheap Gradient Principle\n[Griewank and Walther, 2008]: the computational cost of computing the vector of partial deriva-\ntives pBf {Bx1, Bf {Bx2, . . . Bf {Bxdq is often nearly the same as that of simply computing the scalar\nfunction f pxq itself. In fact, for all rational functions, the striking Baur-Strassen theorem [Baur and\nStrassen, 1983, Griewank, 1989] shows that this increase in computational complexity is a (dimen-\nsion free) factor of 5.\nIn many settings, our underlying function f pxq is a nonsmooth function, and we resort to subgradient\nmethods. This work considers the question: is there a Cheap Subgradient Principle? Speci\ufb01cally,\ngiven a program that computes a (locally Lipschitz) function f and given a point x, can we automat-\nically compute an element of the (Clarke) subdifferential Bf pxq [Clarke, 1975], and can we do this\nat a cost which is comparable to computing the function f pxq itself? Informally, the set Bf pxq is the\nconvex hull of limits of gradients at nearby differentiable points. It can be thought of as generalizing\nthe gradient (for smooth functions) and the subgradient (for convex functions).\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\fLet us brie\ufb02y consider how current approaches handle nonsmooth functions, which are available\nto the user as functions in some library. Consider the following three equivalent ways to write the\nidentity function, where x P R,\n\nf1pxq \u0010 x,\n\nf2pxq \u0010 ReLU pxq \u0001 ReLU p\u0001xq ,\n\nf3pxq \u0010 10f1pxq \u0001 9f2pxq ,\n\n1p0q \u0010 f 1\n\n2p0q \u0010 f 1\n\nwhere ReLU pxq \u0010 maxtx, 0u, and so f1pxq \u0010 f2pxq \u0010 f3pxq. As these functions are differentiable\n3p0q \u0010 1. However, both TensorFlow [Abadi et al.,\nat 0, the unique derivative is f 1\n2015] and PyTorch [Paszke et al., 2017], claim that f 1\n3p0q \u0010 10. This\nparticular answer is due to using a subgradient of 0 at x \u0010 0. One may ask if a more judicious\nchoice \ufb01xes such issues; unfortunately, it is not dif\ufb01cult to see that no such universal choice exists1.\nThis example should be concerning for a number of reasons. The use of nonsmooth functions in AD\ngo well beyond simple one dimensional nonsmooth functions (such as ReLU p\u0004q or the | \u0004 |); current\nmethods permit utilizing eigenvalues, SVDs, QR decompositions (there are AD procedures on these\nnonsmooth linear algebra functions [Maclaurin et al., 2015, Seeger et al., 2017]).\n\n2p0q \u0010 0, f 1\n\n1p0q \u0010 1, f 1\n\nIs correctness important? One option is to disregard these issues \u2014 which is the current state\nof affairs \u2014 based on the observation that in most cases these issues are unlikely to harm our opti-\nmization method. In numerical linear algebra, one could make the same argument: we never truly\nencounter degenerate linear systems (or degenerate eigenspaces); nonetheless, in retrospect, numer-\nical issues have made evident the importance of carefully addressing these \u201ccorner cases\u201d. The\nsituation may be analogous here: numerical issues in these approaches can easily lead to unstable\noutputs. Note that some numerical instability is certainly to be expected due to nonsmoothness (a\npoint we return to in the Discussion under the notion of mixed stability); yet we would still hope to\nhave nontrivial stability guarantees in our widely used AD libraries, much in the manner we have\nfor our established numerical linear algebra libraries [Trefethen and Bau III, 1997, Demmel, 1997].\nUltimately, the importance of correctness in these methods is a decision that must be made by the\nbroader ML community. Here, it is worthwhile to consider that AD software has a range of ap-\nplications: from physical simulators to health care/social science applications to deployed online\nlearning systems to differentiable programming. For example, when using physical simulators (say\nin robotics or in the sciences), a strong notion of stability may be critical when doing AD through\nnonsmooth system dynamics. In safety-critical settings, we may seek to have deployed online learn-\ning methods which are not susceptible to errors due to misspeci\ufb01ed input-output behavior in our\nprograms. Perhaps the most compelling reason for provably correct software implementations is to\navoid costly failure modes due to the utilization of the methods in novel and unforeseen manners.\n\nRelated Work: These issues are in fact known in the mathematical AD literature (see Griewank\nand Walther [2008, Chapter 14]). Once we include either nonsmooth primitive functions or permit\nbranching in a program, the usual chain rule fails to hold and incorrect input-out behavior is easy to\nobserve. Due to that established calculus properties of nonsmooth functions [Klatte and Kummer,\n2002, Mordukhovich, 2006] do not seem amenable to AD approaches, the current provable methods\ndo not have general purpose, computationally ef\ufb01cient AD methods for subdifferentials.\nOne in\ufb02uential and powerful idea is that of lexicographic differentiation [Nesterov, 2005]; it is a\nproperty of a subclass of nonsmooth functions which allow these function to inherit a generalized\nnotion of a chain rule. This idea has been utilized for obtaining correct generalized derivatives in\nKhan and Barton [2013], Griewank [2013]. The dif\ufb01culty is that lexicographic approach often is\nexpensive in that it involves a dimensional factor in the computational cost increase.\nThe other relatively few works that do focus on automatic generalized differentiation go through\nsome notion of algorithmic linearization [A.Griewank, 1995, Nesterov, 2005, Khan and Barton,\n2013, 2015, Fiege et al., 2017], where often piecewise smooth functions are considered, and the ap-\nproach attempts at correct AD through probing the pieces through some linearization (see Griewank\n[2014] for review). The dif\ufb01culties are due to understanding what information we can extract\nthrough linear \u201cprobes\u201d into the function.\n\n1By de\ufb01ning ReLU1p0q \u0010 1{2, the reader may note we obtain the correct derivative on f2, f3; how-\never, consider f4pxq \u0010 ReLU pReLU pxqq \u0001 ReLU p\u0001xq, which also equals f1pxq. Here, we would need\nReLU1p0q \u0010\n\nto obtain the correct answer.\n\n?5\u00011\n\n2\n\n2\n\n\fAlgorithm 1: Straight Line Program for f pxq\nInput: x \u0010 px1, . . . xdq\n1: for k \u0010 d  1, d  2, . . . T do\n2:\n\nCompute:\n\nwhere parentspkq is the index set of the \u201cparent\u201d variables of k.\n\n3: end for\nReturn: xT .\n\nxk \u0010 gkpxparentspkqq\n\nAlgorithm 2: The Reverse Mode of AD\nInput: variables px1, . . . xT q; a computational graph tchildrenptqutPt1,...T u; the associated\n\nderivatives\n\n1: Initialize: BxT\nBxT\n2: for t \u0010 T, T \u0001 1, . . . 1 do\n3:\n\nCompute:\n\n\u0010 1\n\nBxT\nBxt\n\n\u0010 \u00b8iPchildrenptq\n\nBxT\nBxi\n\nBxi\nBxt\n\n, BxT\nBx2\n\n, . . . BxT\n\nBxd\t.\n\n4: end for\nReturn: BxT\n\nBx \u0010 \u0001 BxT\n\nBx1\n\nOne of the \ufb01rst ideas along this line of thought is due to [A.Griewank, 1995], which shows how to\ncompute directional derivatives of nonsmooth functions through following a \u201cbranch\u201d the program\nwould take on an input (where the branch corresponds to the approach direction in the directional\nderivative). In fact, our work uses this basic idea, as does the \u201cbranch locking\u201d approach in Khan\n[2017], Griewank [2013]. The dif\ufb01culty in these approaches is in \ufb01nding a means to relate this\nlinearization to properties of the (nonsmooth) functions, which will allow the algorithm to succeed;\nnaively, we can tell when a method might have failed though it is dif\ufb01cult to guarantee if it will\nsucceed.\nAs such, the extant body of work does not contain methods which contain only a constant fac-\ntor blow up in the computational cost. Notable differences in this work is that our assumptions\nmake strong connections to nonlinear programming [Abadie, 1967, Peterson, 1973, Gould and Tolle,\n1971], which help in characterizing when the linearization approach is informative, and we provide\na key technical result showing a certain chain rule holds for randomized algorithms. Furthermore,\nour focus is on generalizing the reverse mode for scalar functions (as opposed to focusing on multi-\nvariate functions where there is no known Cheap Gradient Principle).\n\nOur contributions: Our main result provides \u2014 under a natural set of assumptions widely\nused in nonlinear programming \u2014 a provably correct Automatic Subdifferentiation procedure,\nwhich given some x, computes both the functional value f pxq and a d dimensional subdifferential\npu1, . . . udq P Bf pxq, with a computational cost that is a factor of at most 6 times that of computing\nthe scalar function f pxq itself. Our assumption is that our library of functions be implemented in\na manner consistent with the standard constraint quali\ufb01cation assumptions in nonlinear program-\nming [Abadie, 1967]. In short, this work shows that in fact there is a Cheap Subgradient Principle.\n\n2 Preliminaries\n\nAssume f : Rd \u00d1 R is a locally Lipschitz function, and recall, that by Rademacher\u2019s theorem, this\nimplies that f is differentiable almost everywhere. The Clarke subdifferential of f at any point x is\nthe set [Clarke et al., 2008, Theorem 8.1]\n\nBf pxq :\u0010 conv! lim\n\ni\u00d18\n\n\u2207f pxiq : xi\n\n\u2126\n\n\u00dd\u00d1 x) ,\n\n(1)\n\n3\n\n\fwhere \u2126 is any full-measure subset of Rd such that f is differentiable at each of its points. Here, the\nlimit is taken to be the set of all limit points. In classical circumstances, the subdifferential reduces\nto more familiar objects. Namely, when f is C 1-smooth at x, the subdifferential Bf pxq consists only\nof the gradient \u2207f pxq, while for convex functions, it reduces to the subdifferential in the sense of\nconvex analysis.\n\n2.1 AD Review and The Baur-Strassen Theorem\nA straight line program for computing f pxq : Rd \u00d1 R is speci\ufb01ed by a program of the form shown\nin Algorithm 1. Here the functions g1, g2, . . . are assumed to be some function from a library of\nfunctions. In the algebraic circuit complexity model, these functions are either monomials or af\ufb01ne\nfunctions of its inputs.\nMore generally, we will be interested in utilizing a richer class of functions where g P L, a library of\nfunctions, e.g. we may desire functions like the | \u0004 |, ReLU pxq, or ever richer nonsmooth functions\nlike eigenvalues.\nDe\ufb01ne Runtimepf ; xq to be the time it takes to compute f pxq under a given program for f.\nTheorem 2.1. [Baur and Strassen, 1983, Griewank, 1989] Assume all multiplications and additions\nhave unit runtime cost. If we restrict to the algebraic circuit complexity model (where the functions\ngk are either monomials or af\ufb01ne functions), then it is possible to compute both f pxq and all its\npartial derivatives \u2207f pxq in time that is at most 5 \u0006 Runtimepf ; xq.\nAn algorithm achieving this guarantee is to \ufb01rst compute f pxq and then use the reverse mode of AD,\nin Algorithm 2. To see the speci\ufb01c counting argument, see [Morgenstern, 1985]. This theorem is\noften more general: the reverse mode also correctly returns the derivatives even with a richer family\nof smooth functions in our library L, often with a constant factor cost increase as well [Griewank,\n1989]. The reverse mode itself has been rediscovered many times [Griewank, 2012]; the well known\nback-propagation algorithm [Rumelhart et al., 1986] is one example of the reverse mode of AD. The\nreverse mode (and the back-propagation algorithm) is not a direct application of the chain rule; the\ndirect application of the chain rule is referred to as the forward mode of AD (see Griewank and\nWalther [2008]), which is d times more expensive procedure to compute the gradient. The reverse\nmode can be viewed as a form of dynamic programming. To compare the two, in the reverse mode\n, referred to as the adjoints2, while in the forward mode of\nof AD, we compute the derivatives BxT\nBxt\nAD we would compute (d-dimensional) derivatives of the form Bxt\nBx (referred to as dual numbers).\n\n2.2 Nonsmooth functions and our computational model\n\nTo specify how our nonsmooth functions are implemented, we extend the computational model\nto allow for branching, using (a restricted version3 of) the Blum-Shub-Smale model of computa-\ntion [Blum et al., 1988].\nDe\ufb01nition 2.1 (Computation Model). The computational model for computing any gpxq : Rd \u00d1 R\nin our library (d may be different for each function) is speci\ufb01ed by a program of the form shown in\nAlgorithm 3. We assume that the function gk,z is either a monomial or an af\ufb01ne function of its inputs.\nFurthermore, for every g, we assume that there exists a time T , where the program terminates in at\nmost this amount of time.\n\nThroughout, we make the following assumption:\nAssumption 2.1. (Computational Cost) Assume all multiplications and additions have unit runtime\ncost and that an execution of an \u201cIf\u201d statement is also unit cost. For example, the cost of computing\na monomial is the number of multiplications.\n\nThe program implicitly encodes a function that has the following representation:\n\nf pxq \u0010 \u00b8zPt\u00011,1uT\n\nISz pxqpzpxq,\n\n(2)\n\n2For a variable xT \u0010 gpxparentsq, the notation BxT\nBxt\n\nrefers to the derivative with respect to xt, but holding all\n\nparent variables of xt as \ufb01xed. If xt is an input variable, then this is the usual partial derivative.\n\n3We avoid halting concerns by assuming our programs halt in a bounded amount of time. We also explicitly\n\navoid discussing tapes and registers in our computational cost model.\n\n4\n\n\fAlgorithm 3: Program for a Nonsmooth function gpxq\nInput: x \u0010 px1, . . . xdq\n1: Initialize a vector z to be all \u00011\u2019s. z is for notational convenience to keep track of the branch.\n2: for k \u0010 d  1, d  2, . . . T do\n3:\n\nCompute:\n\nxk \u0010 gk,zpxparentspk,zqq\n\n4:\n\nIf the program branches at pk, zq, then\n\n\u2022\n\u2022\n\nIf: xk \u00a5 0, zk \u0010 1.\nElse: zk \u0010 \u00011.\n\nIf the program halts at pk, zq, then terminate the for loop.\n\n5:\n6: end for\nReturn: xk.\n\nAlgorithm 4: ReLU pxq\nInput: x \u0010 x1\n1: Branch:\n\nAlgorithm 5: ReLU pxq\nInput: x \u0010 x1\n1: Branch:\n\n\u2022 If: x1 \u00a5 0, set x2 \u0010 x1.\n\u2022 Else: set x2 \u0010 0.\n\n\u2022 If: x3\n\u2022 Else: set x2 \u0010 0.\n\n1 \u00a5 0, set x2 \u0010 x1.\n\nReturn: x2.\n\nReturn: x2.\n\nFigure 1: Two programs that implement ReLU pxq: Both programs are correct and return the same\nvalue. However, the program on the right violates Assumption 3.1 since the gradient of the constraint\nfunction at x \u0010 0, \u2207px3\n\n1q \u0010 3x2\n\n1 \u0010 0.\n\nwhere each pz is a polynomial; ISz is the indicator function on the set Sz; and Sz consists of all x\nwhere the program executes branch z when given x as input. The set Sz can be explicitly de\ufb01ned as\nfollows: for steps k where the programs branches on z, de\ufb01ne hk,zpxq \u0010 xk; on non-branching k,\nde\ufb01ne hk,zpxq \u0010 \u00011; de\ufb01ne the vector valued function hzpxq \u0010 ph1pzq, . . . hT pxqq;\n\nSz :\u0010 tx| Ipsignphzpxqq \u0010 zqu\n\n(3)\n\nwhere the signp\u0004q is the usual sign function (applied componentwise) taking values in t\u00011, 1u (where\nwe take signp0q \u0010 1). Note that Sz is speci\ufb01ed by a set of polynomial inequalities as de\ufb01ned by the\nfunctions hk,zpxq.\n\n3 Provable Automatic Subdifferentiation\n\nIn the algebraic circuit complexity model, where AD is provably correct, branching is not permitted.\nThe inclusion of branching into our program leads to a number of subtle issues. Branching allows\nus to implement the same nonsmooth function in different manners, which have important conse-\nquences in linearization approaches. Consider two different programs (with the same input-output\nbehavior) for the ReLU pxq function in Figure 1. The left program returns x on the constraint set that\nis encoded as S1 \u0010 tx|x \u00a5 0u, while the right program returns x on the constraint set that is encoded\nas S1 \u0010 tx|x3 \u00a5 0u. In nonlinear programming, the importance of avoiding encoding constraints\nin the latter manner is well-known [Abadie, 1967, Peterson, 1973, Gould and Tolle, 1971].\nThis example motivates our restriction to only consider library functions that are encoded like the\nformer set. We will make the standard constraint quali\ufb01cation assumption4. Roughly speaking, the\nassumption states that \ufb01rst order information characterizes the set of feasible perturbations. We state\nthis assumption in a manner more directly applicable to our setting (see [Abadie, 1967, Peterson,\n1973, Gould and Tolle, 1971]).\n\n4The standard constraint quali\ufb01cation assumption on a constraint set is that the tangent cone of the constraint\n\nset equals the linearized cone (of the functions which de\ufb01ne the constraints).\n\n5\n\n\fAssumption 3.1. (Constraint Quali\ufb01cation on our Library) Assume for all g P L that g is locally\nLipschitz and our program for g (in our computational model) satis\ufb01es the constraint quali\ufb01cation\ncondition on all sets Sz in the following sense: suppose thzu (for binary z) are the corresponding\nconstraint functions in our program. For any x, v (of the same input dimensionality of g), assume\nthat for all z:\n\nlim\n\u03b4\u00d30\n\npsignphzpx  \u03b4vqqq \u0010 lim\n\u03b4\u00d30\n\npsignphzpxq  \u03b4\u2207hzpxq \u0004 vqq .\n\nRoughly, this states that the set approached along the limiting direction x  \u03b4v, when \u03b4 \u00d3 0, can be\ndetermined with \ufb01rst order information.\n\nBefore we state our main theorem, one more de\ufb01nition is in order, due to that Runtimepf ; xq may\nnot be continuous. De\ufb01ne the limiting runtime Runtime\u0006pf ; xq of f at x as the (supremum) runtime\nto compute f pxq, as x is approached from nearby points. Precisely,\n\nRuntime\u0006pf ; xq :\u0010 sup! lim\n\ni\u00d18\n\nRuntimepf ; xiq : xi \u00d1 x) ,\n\n(where the limit is taken to be the set of all limit points).\nTheorem 3.1. (A Cheap Subgradient Principle) Assume that our program for f pxq, in Algorithm 1,\nis allowed to use nonsmooth functions from our library L (in addition to af\ufb01ne functions and mono-\nmials). Suppose assumptions 2.1 and 3.1 hold. There exists a (randomized) algorithm, which upon\ninput x, terminates in time that is at most 6 \u0006 Runtime\u0006pf ; xq, and, almost surely, returns both f pxq\nand an element u P Bf pxq.\n\nThe following example shows one subtle issue with regards to constraint quali\ufb01cation.\nExample 3.1. (Constraint quali\ufb01cation on programs do not compose) Consider the function f pxq \u0010\nReLUx2\b (which is equivalent to the smooth function x2). It is straight forward to see that the in-\nduced program for f pxq \u0010 ReLUx2\b (when we unravel it) does not satisfy the constraint quali\ufb01ca-\n\ntion assumption, even if we do use an implementation of ReLU p\u0004q that does satisfy this assumption.\nRegardless, in Example 3.4, we show that our algorithm does indeed correctly compute the gradient\non this (continuous) function.\n\nBefore we present the construction, we \ufb01rst provide a chain rule for nonsmooth functions.\n\n3.1 A Chain Rule for Nonsmooth Functions\n\nLet Drg; vspxq denote the one-sided (Dini) directional derivative:\n\nDrg; vspxq :\u0010 lim\n\u03b4\u00d30\n\ngpx  \u03b4vq \u0001 gpxq\n\n\u03b4\n\n.\n\n(note that we are not assuming that v is a unit vector). This derivative exists for all piecewise\npolynomials and semialgebraic functions [Coste, 2000, Lemma 6.2].\nAssumption 3.2. (Overloading the library with ASD subroutines) Assume we have a library of (lo-\ncally Lipschitz) functions L computable in our computational model. For any g P L, with the rep-\nresentation gpxq \u0010 \u00b0zPt\u00011,1uT ISz pxqpzpxq, assume we have the following associated automatic\n\nsubdifferentiation subroutine ASDrgs with the following behavior: upon input px; vq, the output\nra, d, us \u0010 ASDrgspx; vq satis\ufb01es\n\na \u0010 gpxq, d \u0010 Drg; vspxq, u \u0010 \u2207pzpxq\n\nwhere z is such that:\n\npISz px  \u03b4vqq \u0010 1 .\n\nlim\n\u03b4\u00d30\n\nRoughly speaking, u is the derivative determined by the set Sz which is approached along the limit-\ning direction x  \u03b4v, when \u03b4 \u00d3 0.\n\nFor any locally Lipschitz function h, de\ufb01ne the limiting total derivate as:\n\nBrh; vspxq :\u0010 lim\n\u03b4\u00d30\n\n\u2207hpx  \u03b4vq\n\nif the limit exists. For almost all v, the limit exists, and Brh; vspxq is a subdifferential of h.\n\n6\n\n\fAlgorithm 6: Automatic Subdifferentiation\nInput: x \u0010 px1, . . . xdq, v P Rd.\nInitialize: Set 9x1 \u0010 v1, 9x2 \u0010 v2, . . . 9xd \u0010 vd.\n1: for k \u0010 d  1, d  2, . . . T do\n2:\n\nCompute ra, d, us \u0010 ASDrgkspxparentspkq; 9xparentspkqq and set:\n\nxk \u0010 a, 9xk \u0010 d,\n\nBxk\n\nBxparentspkq\n\n\u0010 u\n\n3: end for\n4: Compute BxT\nReturn: xT , and BxT\nBx .\n\nBx using the Reverse Mode on these precomputed variables.\n\nTheorem 3.2. (A Chain Rule for Nonsmooth Functions) Assume h : Rm \u00d1 R and g1, . . . gm (where\ngi : Rd \u00d1 R) are locally Lipschitz functions computable in our computational model and that the\nfunction h is overloaded with an ASD subroutine as speci\ufb01ed in Assumption 3.2. De\ufb01ne:\n\nf pxq :\u0010 hpg1pxq, . . . gmpxqq \u0010 hpgpxqq ,\n\nwhere gpxq is the vector valued function pg1pxq, . . . gmpxqqJ. Denote the m\u00021 vector of (one-sided)\ndirectional derivatives as Drg; vspxq. If it exists, let Brg; vspxq denote m\u0002d limiting Jacobian matrix\n(whose rows are given by the vectors Brgi; vspxq\u2019s). Set:\n\nra, d, us \u0010 ASDrhspgpxq; Drg; vspxqq\n\nFor all but a measure 0 set of v, we have that Brf ; vspxq and Brg; vspxq exist and that:\n\nBrf ; vspxq \u0010 Brg; vspxqJu .\n\n(4)\nExample 3.2. Consider the example x \u0010 f2pxq \u0010 ReLU pxq \u0001 ReLU p\u0001xq. We de\ufb01ne hpy1, y2q \u0010\ny1 \u0001y2, g1pxq \u0010 ReLU pxq, and g2pxq \u0010 ReLU p\u0001xq, so that f2 \u0010 hpg1pxq, g2pxqq. By applying the\nASD subroutine to h, starting at x \u0010 0 with v \u0010 1 which leads to running ASDrhspp0, 0q; p1, 0qq \u0010\nra, d, us (where it is straightforward to verify that u \u0010 r1, \u00011sJ), we obtain\n\nBrf2; vsp0q \u0010 Brg; vsp0qT u\n0\u001aJ\u0012 1\n\u00011\u001a\n\n\u0010 \u00121\n\n\u0010 1,\n\nwhich is correct. Furthermore, note a correct answer is obtained for any v \u0018 0.\n\nExample 3.3. We return to f pxq \u0010 ReLUx2\b from Example 3.1. De\ufb01ne hpyq \u0010 ReLU pyq,\n\ngpxq \u0010 x2, and so f pxq \u0010 hpgpxqq. By applying the chain rule lemma at x \u0010 0 with v \u0010 1,\n\nBrf ; vsp0q \u0010 Brg; vsp0qu \u0010 0 \u0004 u \u0010 0\n\nSubtly, note that ra, d, us \u0010 ASDrhsp0; 0q, so we are feeding a degenerate direction d \u0010 0 into our\nsubroutine. Regardless, the chain rule lemma still applies (for any v in this case).\n\n3.2 The algorithm\n\nWe \ufb01rst present the algorithm that utilizes an overloaded library. We then provide a provably correct\nconstruction of this overloaded library. All proofs are provided in the appendix.\n\nSubdifferentiation with the overloaded library\n\nAlgorithm 6 is the Automatic Subdifferentiation procedure. Correctness follows from Lemma 3.1.\nLemma 3.1. Suppose Assumptions 2.1 and 3.2 hold. Upon input of an arbitrary x, and if v is\nsampled uniformly at random from the unit sphere, then, almost surely, Algorithm 6 returns both\nf pxq and an element u P Bf pxq.\n\n7\n\n\fAlgorithm 7: Overloading the function gpxq\nInput: x \u0010 px1, . . . xdq, v P Rd.\nInitialize: Set 9x1 \u0010 v1, 9x2 \u0010 v2, . . . 9xd \u0010 vd.\n1: for k \u0010 d  1, d  2, . . . T do\n2:\n\nCompute xk, its partial derivatives, and the directional derivative:\n\nxk \u0010 gk,zpxparentspk,zqq , \" Bxk\nBxj \u0007\u0007\u0007\u0007\nj P parentspk, zq* ,\n9xk \u0010 \u00b8jPparentspk,zq\nIf the program branches at pk, zq, then:\n\nBxk\nBxj\n\n9xj\n\n3:\n\n\u2022\n\u2022\n\u2022\n\nIf: xk \u00a1 0, then zk \u0010 1.\nElseif: xk \u0010 0 and 9xk \u00a5 0, then zk \u0010 1.\nElse: zk \u0010 \u00011\n\nIf the program halts at pk, zq, then terminate the for loop.\n\n4:\n5: end for\n6: Compute Bxk\nReturn: ra, d, us \u0010 rxk,\n\n9xk, Bxk\n\nBx s.\n\nBx using the Reverse Mode on these pre-computed variables.\n\nProof. Fix k P rd  1, . . . , T s. Every parent variable j P parentpkq can be expressed as xj \u0010 \u02dcgjpxq,\nwhere gj is a piecewise polynomial on the d dimensional input x. Thus\n\nxk \u0010 gkp\u02dcg1pxq, . . . , \u02dcgk\u00011pxqq.\n\nNow the usual chain rule holds for directional derivatives [Shapiro, 1990]. As the forward mode of\nAD implements the usual chain rule of directional derivatives, then we have 9xj \u0010 Dr\u02dcgj; vs.\nBy Assumption 3.2 and Theorem 3.2, ASDrgkspxparentspkq, 9xparentspkqq returns u \u0010\n\u0010\nBrgk; 9xparentspkqs and this limiting total derivate satis\ufb01es the chain rule Brxk; vspxq \u0010 Br\u02dcg; vspxqJu.\nSince the limiting total derivates satis\ufb01es the chain rule and the validity of reverse mode AD algo-\nrithm relies only on the chain rule, Algorithm 6 correctly computes Brf pxq; vs.\nBy Rademacher\u2019s theorem and the de\ufb01nition of Clarke subgradient in Equation (1), Brf pxq; vs P\nBf pxq, for almost all v.\n\nBxparentspkq\n\nBxk\n\nOverloading the Library Functions\n\nThe following lemma shows that we can provide a method to correctly overload the library, which\nwe use in Algorithm 6.\nLemma 3.2.\n(Correct Library Overloading) Assume g satis\ufb01es the constraint quali\ufb01ca-\ntion conditions in Assumption 3.1, Suppose the corresponding representation is gpxq \u0010\n\u00b0zPt\u00011,1uT ISz pxqpzpxq, On an arbitrary input x and v, Algorithm 7 returns gpxq, Drg; vspxq,\nand an element u \u0010 \u2207pzpxq where z is such that: lim\u03b4\u00d30pISz px  \u03b4vqq \u0010 1 .\nExample 3.4. We again return to ReLUx2\b from Example 3.1. Here we examine how h is\nwhich leads to running hp\u00012v2q. However, the gradient is correctly computed, BReLUx2\b \u0010 0, re-\n4 Discussion and Open Questions\n\noverloaded based on the implementation in Algorithm 7. When px, vq \u0010 p0, 1q, we are running\nASDrhsp0; 0q and this may not follow the same branch had we run on the (in\ufb01nitesimal) input x \u0010 \u0001v\n\ngardless of the branch taken.\n\nOverloading the Library Functions: It is not dif\ufb01cult to see that piecewise univariate functions\ncan be implemented in our library.\n\n8\n\n\fAlgorithm 8: \u03c3pxq\nInput: x \u0010 x1\n1: Branch:\n\n\u2022 If: x1 \u00a4 b1, set x2 \u0010 p1pxq.\n\u2022 Elseif: x1 \u00a4 b2, set x2 \u0010 p2pxq.\n\n...\n\n\u2022 Elseif: x1 \u00a4 bk\u00011, set x2 \u0010 pk\u00011pxq.\n\u2022 Else: set x2 \u0010 pkpxq.\n\nReturn: x2.\n\nExample 4.1. Univariate Piecewise Polynomial (Algorithm 8). Let \u03c3 : R \u00d1 R be a uni-\nvariate piecewise polynomial, meaning that the domain R is partitioned into a set of k inter-\nvals p\u00018, b1q, pb1, b2q, . . . , pbk\u00011, 8q. On each interval, the function is equal to a polynomial\np1, . . . , pk.\nAlgorithm 8 provides a constraint quali\ufb01ed program for the function \u03c3p\u0004q, which can be used as a\nlibrary function.\n\nAn important step would be in extending our computational model to allow the incorpora-\ntion of provably correct automatic subdifferentiation libraries for linear algebra libraries. Auto-\nGrad [Maclaurin et al., 2015] does do AD through linear algebra methods though it can not be used\nto obtain correct subdifferentials in programs (at nondifferentiable points); obtaining correct gener-\nalized derivatives may be particularly important in cases where we deal with low rank methods. We\nconjecture our results can be extended, by extending the computational model, to handle these cases\n(there is already much known about the \ufb01rst order structure of these methods [Seeger et al., 2017]);\ntechnically, SVDs are not exactly computable in either the algebraic circuit complexity model or the\nBlum-Shub-Smale model.\nNumerical Analysis: The most important open question is how to obtain numerically stable and\naccurate solutions [Trefethen and Bau III, 1997, Demmel, 1997]. We conjecture the techniques\ndeveloped here will help in characterizing these issues. In particular, the most natural question is\nhow to develop algorithms that satisfy the mixed stability criterion: the algorithm should give \u201cnearly\nthe right answer to nearly the right problem\u201d (as in [Trefethen and Bau III, 1997]). For example, for\nthe absp\u0004q function, it should be acceptable for an AD method to provide a subgradient near to \u00011 for\na small input \u0001 \u00a1 0 due to roundoff error; however, it would undesirable for numerical error to lead\nvastly different gradients than those that arise from any nearby problem. This may be particularly\nimportant when doing AD in physical simulators.\n\nAcknowledgments: We thank Dima Drusvyatskiy for many helpful discussions. Sham Kakade\nacknowledges funding from Washington Research Foundation Fund for Innovation in Data-Intensive\nDiscovery, the NSF through award CCF-1740551, and ONR award N00014-18-1-2247. Jason D.\nLee acknowledges support of the ARO under MURI Award W911NF-11-1-0303. This is part of\nthe collaboration between US DOD, UK MOD and UK Engineering and Physical Research Council\n(EPSRC) under the Multidisciplinary University Research Initiative.\n\nReferences\nM. Abadi, A. Agarwal, P. Barham, E. Brevdo, et al. TensorFlow: Large-scale machine learning on\nheterogeneous systems, 2015. URL https://www.tensorflow.org/. Software available from\ntensor\ufb02ow.org.\n\nJ. Abadie. On the kuhn tucker theorem. Nonlinear Programming, pages 19\u201336, 1967.\nA.Griewank. Automatic directional differentiation of nonsmooth composite functions. In R.Durier,\neditor, Recent developments in Optimization / Seventh French-German Conference on Optimiza-\ntion, Dijon 1994, pages 155\u2013169. Springer Verlag, 1995.\n\nWalter Baur and Volker Strassen. The complexity of partial derivatives. Theoretical Computer\n\nScience, 22:317\u2013330, 1983.\n\n9\n\n\fAtilim Gunes Baydin, Barak A. Pearlmutter, and Alexey Radul. Automatic differentiation in ma-\n\nchine learning: a survey. CoRR, abs/1502.05767, 2015.\n\nLenore Blum, Mike Shub, and Steve Smale. On a theory of computation over the real numbers; NP\ncompleteness, recursive functions and universal machines (extended abstract). In FOCS, pages\n387\u2013397. IEEE Computer Society, 1988.\n\nF.H. Clarke. Generalized gradients and applications. Trans. Amer. Math. Soc., 205:247\u2013262, Apr.\n\n1975.\n\nF.H. Clarke, Y.S. Ledyaev, R.J. Stern, and P.R. Wolenski. Nonsmooth analysis and control theory,\n\nvolume 178. Springer Science & Business Media, 2008.\n\nMichel Coste. An introduction to o-minimal geometry. Istituti editoriali e poligra\ufb01ci internazionali\n\nPisa, 2000.\n\nJ. W. Demmel. Applied numerical linear algebra. Society for Industrial Mathematics, 1997.\nSabrina Fiege, Andrea Walther, Kshitij Kulshreshtha, and Andreas Griewank. Algorithmic differ-\nentiation for piecewise smooth functions: a case study for robust optimization. pages 1\u201316, 06\n2017.\n\nF.J. Gould and J.W. Tolle. A necessary and suf\ufb01cient quali\ufb01cation for constrained optimization. 20,\n\n03 1971.\n\nAndreas Griewank. On automatic differentiation. In Mathematical Programming: Recent Develop-\n\nments and Applications, pages 83\u2013108. Kluwer Academic Publishers, 1989.\n\nAndreas Griewank. Who invented the reverse mode of differentiation? Optimization Stories, Docu-\n\nmenta Matematica, Extra Volume ISMP (2012):389\u2013400, 2012.\n\nAndreas Griewank. On stable picewise linearization and generalized differentiation. Optimization\n\nMethods and Software, 28(6):1139\u20131178, 2013.\n\nAndreas Griewank. On Automatic Differentiation and Algorithmic Linearization. Pesquisa Opera-\n\ncional, 34:621 \u2013 645, 12 2014. ISSN 0101-7438.\n\nAndreas Griewank and Andrea Walther. Evaluating Derivatives: Principles and Techniques of\nAlgorithmic Differentiation. Society for Industrial and Applied Mathematics, Philadelphia, PA,\nUSA, second edition, 2008.\n\nK. A. Khan and P.I. Barton. Evaluating an element of the clarke generalized jacobian of a composite\n\npiecewise differentiable function. ACM Trans. Math. Softw., 39:23:1, 2013.\n\nK. A. Khan and P.I. Barton. A vector forward mode of automatic differentiation for generalized\n\nderivative evaluation. Optim. Method Softw., 30:1185, 2015.\n\nKamil A. Khan. Branch-locking ad techniques for nonsmooth composite functions and nonsmooth\n\nimplicit functions. Optimization Methods and Software, 0(0):1\u201329, 2017.\n\nD. Klatte and B. Kummer. Nonsmooth equations in optimization, volume 60 of Nonconvex Optimiza-\ntion and its Applications. Kluwer Academic Publishers, Dordrecht, 2002. ISBN 1-4020-0550-4.\nRegularity, calculus, methods and applications.\n\nDougal Maclaurin, David Duvenaud, Matthew Johnson, and Ryan P. Adams. Autograd: Reverse-\n\nmode differentiation of native Python, 2015. URL http://github.com/HIPS/autograd.\n\nB. S. Mordukhovich.\n\nVariational Analysis and Generalized Differentiation II: Applications.\nSpringer Berlin Heidelberg, 2006. ISBN 9783540312468. URL https://books.google.com/\nbooks?id=lmdmY75lrokC.\n\nJacques Morgenstern. How to compute fast a function and all its derivatives: a variation on the\n\ntheorem of baur-strassen. In SIGA, 1985.\n\nYurii Nesterov. Lexicographic differentiation of nonsmooth functions. Math. Program., 104(2-3):\n\n669\u2013700, 2005.\n\nA. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga,\n\nand A. Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.\n\nDavid W. Peterson. A review of constraint quali\ufb01cations in \ufb01nite-dimensional spaces. SIAM Review,\n\n15(3):639\u2013654, 1973.\n\n10\n\n\fD. E. Rumelhart, G. E. Hinton, and R. J. Williams. Parallel distributed processing: Explorations\nin the microstructure of cognition, vol. 1. chapter Learning Internal Representations by Error\nPropagation, pages 318\u2013362. MIT Press, Cambridge, MA, USA, 1986.\n\nMatthias W. Seeger, Asmus Hetzel, Zhenwen Dai, and Neil D. Lawrence. Auto-differentiating linear\n\nalgebra. CoRR, abs/1710.08717, 2017. URL http://arxiv.org/abs/1710.08717.\n\nA. Shapiro. On concepts of directional differentiability. Journal of Optimization Theory and Appli-\n\ncations, 66(3):477\u2013487, Sep 1990.\n\nLloyd N Trefethen and David Bau III. Numerical linear algebra, volume 50. SIAM, 1997.\n\n11\n\n\f", "award": [], "sourceid": 3539, "authors": [{"given_name": "Sham", "family_name": "Kakade", "institution": "University of Washington"}, {"given_name": "Jason", "family_name": "Lee", "institution": "University of Southern California"}]}