{"title": "Automatic differentiation in ML: Where we are and where we should be going", "book": "Advances in Neural Information Processing Systems", "page_first": 8757, "page_last": 8767, "abstract": "We review the current state of automatic differentiation (AD) for array programming in machine learning (ML), including the different approaches such as operator overloading (OO) and source transformation (ST) used for AD, graph-based intermediate representations for programs, and source languages. Based on these insights, we introduce a new graph-based intermediate representation (IR) which specifically aims to efficiently support fully-general AD for array programming. Unlike existing dataflow programming representations in ML frameworks, our IR naturally supports function calls, higher-order functions and recursion, making ML models easier to implement. The ability to represent closures allows us to perform AD using ST without a tape, making the resulting derivative (adjoint) program amenable to ahead-of-time optimization using tools from functional language compilers, and enabling higher-order derivatives. Lastly, we introduce a proof of concept compiler toolchain called Myia which uses a subset of Python as a front end.", "full_text": "Automatic differentiation in ML:\n\nWhere we are and where we should be going\n\nBart van Merri\u00ebnboer\n\nMila, Google Brain\nbartvm@google.com\n\nOlivier Breuleux\n\nMila\n\nbreuleuo@iro.umontreal.ca\n\nbergearn@iro.umontreal.ca\n\nArnaud Bergeron\n\nMila\n\nPascal Lamblin\nMila, Google Brain\n\nlamblinp@google.com\n\nAbstract\n\nWe review the current state of automatic differentiation (AD) for array program-\nming in machine learning (ML), including the different approaches such as operator\noverloading (OO) and source transformation (ST) used for AD, graph-based in-\ntermediate representations for programs, and source languages. Based on these\ninsights, we introduce a new graph-based intermediate representation (IR) which\nspeci\ufb01cally aims to ef\ufb01ciently support fully-general AD for array programming.\nUnlike existing data\ufb02ow programming representations in ML frameworks, our IR\nnaturally supports function calls, higher-order functions and recursion, making\nML models easier to implement. The ability to represent closures allows us to\nperform AD using ST without a tape, making the resulting derivative (adjoint) pro-\ngram amenable to ahead-of-time optimization using tools from functional language\ncompilers, and enabling higher-order derivatives. Lastly, we introduce a proof of\nconcept compiler toolchain called Myia which uses a subset of Python as a front\nend.\n\n1\n\nIntroduction\n\nRecent advances in ML, and deep learning in particular, have in part been driven by advances in\nhardware [23, 31]. This increase in computational power has spurred the development of a large\nnumber of software libraries, compute kernels, programming languages, and compiler toolchains in\norder to exploit it. We distinguish some features and objectives that separate these ML frameworks\nfrom traditional array programming frameworks.\nFirstly, many machine learning models use optimization algorithms which require access to derivatives\nof the model. Automatic differentiation [16] comprises a collection of techniques that can be employed\nto calculate the derivatives of a function speci\ufb01ed by a computer program, and is a central feature of\npopular ML frameworks such as TensorFlow [1] and PyTorch [28].\nML frameworks also put heavy emphasis on being able to iterate quickly on new models using\nhigh-level, dynamically typed languages, while maintaining high performance through aggressively\nexploiting resources (e.g., through parallelism, distributed computing, accelerators, static optimiza-\ntion). Moreover, since the derivative code is generated programmatically using AD, frameworks\ncannot always rely on users writing hand-tuned code and must instead provide compiler optimizations.\nDespite the growth in ML frameworks, many have been developed in isolation of the AD community,\nand many of their insights regarding language design and interactions between source transformation\nand compiler optimizations have gone largely ignored. Moreover, although many ML frameworks\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fhave slowly been adopting functional language concepts (such as pure functions, immutable variables,\nlazy evaluation) many of the standard approaches in use by functional language compilers to guarantee\nhigh performance (A-normal form and continuation passing style representations, persistent data\nstructures, heap recycling, etc.) have not been applied.\nIn some cases popular ML frameworks have sacri\ufb01ced \ufb02exibility and generality compared to pop-\nular array programming packages such as NumPy [39] in order to provide AD and achieve high\nperformance. On the one hand, frameworks relying on computation graphs such as TensorFlow\nand Theano [36] do not support higher-order functions or recursion, even though some ML models\n(e.g. [35]) are more naturally expressed using recursion than loops. On the other hand, frameworks\nrelying on operator overloading such as PyTorch and Autograd [26] see performance degradation for\nmodels with scalars or small vectors.1\n\n2 Background and prior work\n\nThe development of ML frameworks has been driven by a wide range of \ufb01elds and perspectives\u2014\nsystems programming, automatic differentiation, programming languages, compiler design, applied\nmachine learning, etc.\u2013which has lead to duplicated research and confused terminology (e.g. de\ufb01ne-\nby-run and operator overloading). To contextualize our proposed framework, the \ufb01rst half of this\npaper consists of a review which aims to synthesise these different perspectives. We will begin with\nexplaining the nature of AD and the various challenges associated with it. Then we will review\nthe different approaches to AD and relevant prior work from different domains, such as graph\nrepresentations from the compiler literature, and language and IR design from functional languages.\nWe will discuss the uses of these approaches in existing frameworks and how they affect performance,\nexpressive power, and usability.\nGiven this insight, our goal is to outline in the subsequent sections a proof of concept of a high-\nperformance ML framework with \ufb01rst-class support for AD, but which has the \ufb02exibility and\nexpressive power of a generic, high-level programming language so that it does not restrict the ability\nof ML researchers to explore novel models and algorithms.\n\n2.1 Automatic differentiation\n\nAutomatic differentiation (AD, also called algorithmic differentiation) relies on the ability to de-\ncompose a program into a series of elementary operations (primitives) for which the derivatives are\nknown and to which the chain rule can be applied. AD allows for the calculation of derivatives of any\norder up to working precision.\nAD has been studied since the 60s and 70s and has been employed in \ufb01elds such as computational\n\ufb02uid dynamics, astronomy, and mathematical \ufb01nance [16]. Both its implementation and its theory\nare still an active area of research (e.g. [34] and [41]). We recommend [16] and [5] for a review of\nAD in general and in the context of machine learning respectively. From an application perspective,\nAD affects and interacts with the entire toolchain, from language design through intermediate\nrepresentations, static analysis, to code generation and program execution.\nThe runtime and memory complexity of AD depends on the order in which the chain rule is evaluated.\nEvaluating the chain rule from right to left (from inputs to outputs) is referred to as forward mode,\nwhereas evaluating it from left to right (from outputs to inputs) is called reverse mode. Forward\nmode has constant memory requirements and its runtime complexity scales with the number of inputs.\nReverse mode\u2019s runtime complexity scales with the number of outputs, and its memory complexity\ngrows with the number of intermediate variables. In principle, forward and reverse mode can be\nmixed, but \ufb01nding the optimal way of doing so is NP-complete [27].\nIn forward mode, the partial derivatives of intermediate variables are calculated in step with the\noriginal program. As such, forward mode is relatively straightforward to implement, e.g. using dual\nnumbers [13]. In reverse mode, the chain rule is evaluated in reverse order of the original program.\nThis is a more complex program transformation: an adjoint program must be constructed whose\ncontrol \ufb02ow is the reverse of the original (or primal) program. First, the primal program is run to\nobtain the output, and then the adjoint program is run to compute the gradient, starting from that\n\n1https://github.com/pytorch/pytorch/issues/2518\n\n2\n\n\foutput and going backwards. In order to do so ef\ufb01ciently, each statement in the adjoint must have\naccess to the intermediate variables of the original program. Hence, the AD transformation must\nguarantee that the intermediate variables are not destroyed or mutated.\nIn ML applications, large matrices of input parameters are typically updated using gradient descent\non a scalar output cost. Since the number of inputs is signi\ufb01cantly larger than the number of outputs,\nreverse mode AD is to be preferred. The term \u2018backpropagation\u2019 is used to refer to the specialized\napplication of reverse mode AD in machine learning.\nTwo implementation methods of AD are generally distinguished: operator overloading (OO) and\nsource transformation (ST, also called source code transformation). Each method has its advantages\nand disadvantages in terms of usability, implementation, and ef\ufb01ciency [9]. We will brie\ufb02y discuss\nthem in the context of reverse mode AD.\n\n2.1.1 Operator overloading\n\nOO relies on a language\u2019s ability to rede\ufb01ne the meaning of functions and operators. All primitives\nare overloaded so that they additionally perform a tracing operation: The primitive is logged onto\na \u2018tape\u2019, along with its inputs to ensure that those intermediate variables are kept alive. At the end\nof the function\u2019s execution, this tape contains a linear trace of all the numerical operations in the\nprogram. Derivatives can be calculated by walking this tape in reverse.\nThe main advantage of OO is that it is straightforward to implement. Because the tracing passes\nthrough function calls and control \ufb02ow, the AD logic is simpli\ufb01ed. A signi\ufb01cant downside is that a\nseparate \u2018derivative interpreter\u2019 is needed for the adjoint program. Having an embedded interpreter\ninside of the host language can complicate debugging and performance analysis. Moreover, since the\nprogram is traced and reversed at runtime, OO incurs overhead on each function call which can be\nparticularly problematic if the primitives are fast to execute relative to the tracing operation. OO also\ndoes not allow for ahead-of-time optimizations on the adjoint program.\nOO is the technique used by PyTorch, Autograd, and Chainer [37]. Non-ML oriented AD frameworks\nusing OO include ADOL-C [17] and CppAD [7].\n\n2.1.2 Source transformation\n\nST explicitly constructs the adjoint program. Unlike OO, ST needs to explicitly construct a program\nwith a reversed control \ufb02ow, which means that it needs transformation rules for function calls and\ncontrol \ufb02ow statements such as loops and conditionals. Whereas OO operates within the language,\nST requires tooling such as parsers, tools to manipulate intermediate representations, and unparsers.\nThe advantage of ST is that the AD transformation is done only once per program and hence doesn\u2019t\nincur overhead at runtime, which makes ST performant for a wider range of workloads. Moreover,\nthe full adjoint program is available during compilation and can therefore be optimized ahead of time.\nAlthough ST does not have to deal with the AD transformation at runtime, it must still ensure that\nintermediate variables from the forward pass are accessible by the adjoint. There are a variety of\napproaches to deal with this.\n\nTape-based Frameworks such as ADIFOR [8] and Tapenade [20] for Fortran and C use a global\nstack also called a \u2018tape\u20192 to ensure that intermediate variables are kept alive. The original (primal)\nfunction is augmented so that it writes intermediate variables to the tape during the forward pass, and\nthe adjoint program will read intermediate variables from the tape during the backward pass. More\nrecently, tape-based ST was implemented for Python in the ML framework Tangent [38].\nA problem of this approach is that the tape is a data structure constructed at runtime, analysis of which\nrequires custom compiler passes [19, 20]. Moreover, adjoint programs have a particular symmetric\nstructure where intermediate variables from the \ufb01rst primal statements are used by the last adjoint\nstatements. This highly non-local structure is unsuitable for traditional compiler optimizations which\nact locally. Ways of addressing this interaction between AD and compiler optimizations is an ongoing\nresearch topic [34, 18]. Finally, reading and writing to the tape need to be made differentiable in\n\n2The tape used in ST stores only the intermediate variables, whereas the tape in OO is a program trace that\n\nstores the executed primitives as well.\n\n3\n\n\forder to compute higher-order derivatives which involve multiple applications of reverse mode. For\nthis reason most tape-based systems do not support reverse-over-reverse.\n\nClosure-based To address some of the shortcomings of the tape-based approach, alternative ap-\nproaches have been proposed which employ closures [29] or delimited continuations [40]. In both\ncases, tools from functional programming are used which can capture the environment of a state-\nment during the forward pass, and execute the corresponding adjoint within that environment. The\nadvantage of this approach is that no AD-speci\ufb01c compiler passes are needed: a functional language\ncompiler will recognize the non-local use of the intermediate variables by the fact that they are free\nvariables in the generated closure or continuation. This avoids the need for custom compiler passes,\nand allows for the application of all the tooling from functional compilers on the generated adjoint\nprogram [32, 33].\n\n2.2 Data\ufb02ow programming\n\nPopular ML frameworks such as Theano, TensorFlow, and MXNet [10] follow the data\ufb02ow program-\nming paradigm [21] and use computation graphs as their intermediate representation. These graph\nrepresentations do not have scoping or recursive function calls, which means that AD is much easier\nto implement with ST. Since the adjoint program is part of the same data\ufb02ow graph, it can access\nthe intermediate variables from the forward pass directly from the global scope, so neither tapes nor\nclosures are required. Additionally, a simple liveness analysis makes it easy to keep intermediate\nvalues from the primal alive only for as long as required by the adjoint computation.\nUsing data\ufb02ow graphs without function calls3 nor scoping4 introduces limitations. Some of these\nlimitations are addressed by the use of metaprogramming, but others affect the end-user (e.g., the\nlack of recursion and higher-order functions reduces the expressiveness of the language) and the\ncompiler pipeline (e.g., loops cannot be represented in a principled way, which complicates their\nimplementation).\nAn advantage of data\ufb02ow programming is that graphs are a natural representation for distributed\ncomputing [2]. This allows different operations to be easily distributed across different hosts, devices,\nand cores.\nGraph-based IRs are generally useful for compilers, since the absence of an explicit ordering can\nsimplify certain optimizations and scheduling. Theano\u2019s graph representation in particular was\nbased on the representations used by computer algebra systems (CAS), enabling aggressive algebraic\nsimpli\ufb01cation and pattern matching. An SSA5-based graph representation [12, 25], sometimes\nreferred to as sea-of-nodes, is used by the HotSpot Java compiler and the V8 TurboFan JavaScript\ncompiler, and a graph representation using continuation-passing style (CPS, an IR commonly used in\nfunctional languages) called Thorin also exists [24].\n\n2.3 Programming languages and compilers\n\nTheano was one of the \ufb01rst software packages to refer to itself as a \u2018linear algebra compiler\u2019.\nSince then, more frameworks started approaching the de\ufb01nition and execution of ML models as a\ncompiler problem. In the case of Theano and TensorFlow, they can be considered compilers of a\ncustom language which must be metaprogrammed using Python as a metalanguage. The data\ufb02ow\ngraph is an intermediate representation which is optimized using a series of compiler passes. The\nresulting program is compiled (e.g., XLA) and/or interpreted (e.g., the TensorFlow/Theano runtimes).\nSimilarly, PyTorch has started optimizing its traced Python programs using just-in-time (JIT) compiler\napproaches.\nMore recently, projects such as DLVM [42] and Swift for TensorFlow6 have attempted to extend\nexisting compiler toolchains such as LLVM and Swift\u2019s intermediate language (SIL) with array\nprogramming and AD in order to create frameworks better suited for ML work\ufb02ow needs.\n\n3TensorFlow and Theano implement a type of subroutine through their Defun and OpFromGraph constructs,\n\nbut these must be explicitly constructed by the user and don\u2019t support recursion.\n\n4TensorFlow has a concept it refers to as \u2018scoping\u2019, but these scopes are not lexical and can be reentered at\n\nany time, so the lifetime of a value is not affected by its scope.\n\n5Static single assignment, which essentially means each variable is assigned to exactly once.\n6https://www.tensorflow.org/community/swift\n\n4\n\n\fViewing ML frameworks as compiler toolchains raises several questions. For example, on what\nintermediate representations is it the easiest to apply AD and aggressive optimizations? IRs with\nclosures as \ufb01rst-class objects will be able to use closure-based approaches to AD, whereas traditional\nSSA-based representations (such as SIL) would need to use a tape-based approach. And which IRs\nare most suitable for the heavy use of parallelism and distributed computing in ML?\nSecondly, what should the source language be? The ML community is highly invested in Python,\nan interpreted, dynamically typed programming language which does not have built-in support for\nmultidimensional arrays. More recently, frameworks have suggested using Swift (DLVM) or Julia\n(JuliaDiff, [30]), languages with static typing and built-in multidimensional arrays respectively. On\nthe other hand, frameworks such as Theano and TensorFlow do not have an exposed source language\nbut can only be metaprogrammed. In the AD community, there has been strong push away from\ntraditional imperative languages such as Fortran and C to purely functional languages, since they\nsimplify the implementation of AD and are easier to optimize. Examples of this are VLAD, a dialect\nof Lisp which is compiled with the Stalin\u2207 compiler [34, 29, 33], DVL7, and DiffSharp [4].\n\n2.3.1 Python\n\nBecause Python plays an important role in the ML community many popular ML frameworks\nare Python-based. However, the language\u2019s characteristics make it dif\ufb01cult to implement a high-\nperformance AD-enabled ML framework in Python directly. The reference implementation of Python,\nCPython, has effectively no support for concurrency, and the interpreter is relatively slow. Moreover,\nits highly dynamic nature makes source transformation dif\ufb01cult (Tangent imposes several restrictions\non the use of Python in order for it to perform ST). Python does not have built-in support for\nmultidimensional arrays, which are only supported through third-party frameworks such as NumPy.\nHow to reconcile users\u2019 desire to work in Python because of its \ufb02exibility with the need for high\nperformance and speed is an open question. ML frameworks have focused on metaprogramming\nand using C extensions, but other approaches are possible. For example, Cython [6] is a superset\nof Python which compiles to Python modules, whereas Numba [22] can compile individual Python\nfunctions using LLVM.\n\n3 Graph-based direct intermediate representation\n\nWe endeavor to combine several of the aforementioned techniques and insights from the compiler\nand AD literature in order to provide a \ufb02exible basis for an ML framework. This requires a well-\ntailored intermediate representation which avoids the pitfalls of previous methods, while keeping\ntheir strengths. Concretely, we propose an IR with the following properties:\n\nGraph based Similar to Theano or TensorFlow, programs are represented as graphs. Graphs have\nthe advantage of being easy to optimize and \ufb02exible about execution order, as operations that do not\ndepend on each other in the graph may be executed in any order, or in parallel. Unlike Theano and\nTensorFlow, however, functions may be called recursively and they are \ufb01rst-class objects. Functions\nmay be passed as parameters to other functions, or returned from a function and then called. A large\nvariety of control \ufb02ow constructs, ranging from simple loops to graph traversals, can be implemented\nusing these capabilities. Other graph frameworks tend to implement only a few of these as specialized\noperators, such as Theano\u2019s scan or TensorFlow\u2019s while, leading to an IR which is both more\ncomplex and less powerful than the general one we are proposing. A general IR does require more\nwork to transform and optimize in a provably correct way in the context of automatic differentiation,\nbut this work only needs to be done once.\n\nPurely functional Mutation and side effects are problematic for reverse mode AD, where the\nbackward pass requires access to the unchanged intermediate variables from the forward pass. They\nalso interact poorly with complex optimizations because of aliasing. Restricting our language to be\npurely functional therefore allows us to implement more robust AD and more advanced optimizations\ncompared to imperative languages.\n\n7https://github.com/axch/dysvunctional-language\n\n5\n\n\fNote that Myia\u2019s intended use case is not the writing of ef\ufb01cient low-level kernels, which often\nrequires \ufb01ne-grained memory control. Similarly to, e.g., TensorFlow, the user can write ef\ufb01cient\nlow-level kernels and their derivatives in a low-level language such as CUDA or XLA, and expose\nthem to Myia as primitives.\n\nClosure representation AD on functional languages involves storing the primal\u2019s intermediate\nresults into closures which are then connected together to form the adjoint. It is therefore important\nto have a natural representation for closures. As in Thorin, we represent a function\u2019s graph\u2019s free\nvariables as direct pointers to nodes that belong to other graphs, thereby creating an implicit nesting\nrelationship between them (a graph Gc is \u201cnested\u201d in Gp if it points to a node in Gp, or to a graph\nnested in Gp, or to a node in a graph nested in Gp). This facilitates joint optimization of a closure\nwith the functions it is nested in. Closures are also a great means for abstraction and a natural way to\nrepresent the methods of objects, so there is a concrete advantage in expressiveness from the user\u2019s\nperspective, which cannot be found in other frameworks.\n\nStrongly typed In its canonical form, every node must be associated with a concrete type. This is\nimportant to maximize performance. This is also important in ML applications, because operations\ntend to be very costly and it is best to catch errors as early as possible. In addition to data types, there\nis also a need to infer other properties such as the dimensions of vectors and matrices so that we can\nguarantee that the inputs of all operations have compatible dimensions prior to executing them. Type\nand shape inference are more complex and powerful on our proposed IR than in data\ufb02ow graphs\nbecause of the need to support recursive calls and higher order functions.\n\n3.1\n\nIR speci\ufb01cation\n\nConcretely, our representation represents a function as a graph object with a list of parameter nodes\nand a single return node (multiple return values are supported through tuples). A node represents a\nfunction application and has an ordered list of incoming edges. The \ufb01rst incoming edge is a pointer\nto the function to apply, and the rest point to the arguments. Constants are represented as nodes with\nno incoming edges and a value \ufb01eld. Links between nodes are bidirectional, so that graphs can be\ntraversed in either direction. Each non-constant node belongs to a single graph. See Figure 1 for a\nvisual representation of the IR.\nCompared to other representations, our representation is more expressive than data\ufb02ow graphs, and\nmore \ufb02exible than SSA or CPS representations which tend to be rigid about execution order. It is\nclosest to A-normal form (ANF, [15]), where every intermediate computation is assigned a unique\nname, but it is graphical rather than syntactic and therefore easier to manipulate algorithmically.\n\n3.2 Source transformation\n\nAD can be implemented for this IR using ST with a closure-based method. We closely follow the\napproach described in [29]. The transformed program constructs a chain of closures during the\nforward computation. These closures contain the adjoint code required to compute the derivatives\nalong with the intermediate variables from the forward pass that are needed.\nThe transformation proceeds as follows: Each function call is transformed to return an additional\nvalue, which is a closure called the \u2018backpropagator\u2019. The backpropagator computes the derivative\nwith respect to the inputs given the derivatives with respect to the outputs. The backpropagators of\nprimitives are known, whereas the backpropagators of user-de\ufb01ned functions can be easily constructed\nby calling the backpropagators of the function calls in the body in reverse order.\nIn order to ensure that our transformation can be applied again on the transformed program (so we\ncan use reverse-over-reverse to compute second-order derivatives), it must be able to handle functions\nwith free variables. To this end, each backpropagator will return the partial derivatives with respect to\nthe inputs of the original function, as well as an ordered set of partial derivatives with respect to the\nfree variables. The backpropagator of the function that built the closure is responsible for unpacking\nthese partial derivatives so that it can add contributions to the free variables that belong to it, this\nunpacking being the adjoint of closure creation. Closures are \ufb01rst class functions: when given as\ninputs of other closures, they are treated like any other input.\n\n6\n\n\fFigure 1: Transform of a simple Python program into Myia\u2019s representation. The program computes\nthe gradient of f with respect to x. (1) identi\ufb01es the part of the graph that implements x ** 3, or\npow(x, 3). After the grad macro is expanded, a new graph, \u00a7f is built. In that graph, a = pow(x,\n3) becomes \u00a7a, \u0111pow = \u00a7pow(\u00a7x, 3) (2). The \u00a7pow operation thus returns two values instead of\none, \u00a7a being equal to the original value a, and \u0111pow being the backpropagator for this operation.\nThat backpropagator is used in (3). It is applied on the gradient wrt the output, \u2207a, and produces\nthe gradient wrt the input \u2207x (it also produces a gradient wrt the constant 3, but that gradient is not\nused). (4) points to all the backpropagators created while executing \u00a7f. These are closures that retain\npointers to all the information necessary to perform reverse mode AD. \u0111f, the backpropagator we\nconstruct for f, is returned by \u00a7f (5), alongside the main result (notice that this mirrors the interface\nin (2)). \u0111f retains pointers to all backpropagators in (4), as indicated by the dashed lines. \u0111f is\nretrieved in (6), and immediately called with the value 1.0 for the parameter \u2207f (7). In this context,\n\u2207f corresponds to Bf{Bf, hence why we give it a value of one. After optimization, all functions\nand backpropagators end up being inlined. All unused computations are cut, and what remains is an\nexpression for Bf{Bx that is essentially identical to what one would have written by hand.\n\n7\n\ndef f(x, y): a = x ** 3 b = y ** 4 c = a * b return c@myiadef main(x, y): dfdx = grad(f, 'x') return dfdx(x, y)ParameterFunction callOutput nodeConstant/function pointerNon-local variable(...)Tuple constructionUse as first argumentUse as function to callGet element #1 of tuple\u25b6a\u25c0a\u2207aValue of aBackpropagator for aGradient wrt aSource codeAfter parsingAfter macro expansionAfter optimization(1)(2)(3)(4)(5)(6)(7)\f4 Myia\n\nMyia is a functioning proof of concept of a toolchain that uses the proposed graph representation8.\nMyia performs type inference given the input types, and applies a series of optimizations such as\ninlining, common expression elimination, constant propagation, closure conversion, and algebraic\nsimpli\ufb01cations. The \ufb01nal code can be executed using an interpreter, and we also implemented a\nprototype which compiles the straight-line parts of the graph using TVM [11].\n\n4.1 Python front end\n\nDue to Python\u2019s popularity in the ML community, we feel it is important to offer a front end in that\nlanguage. Users can write models in a subset of Python 3.6 and have them compiled to our IR. This\nrequirement is ostensibly at odds with our IR being pure and strongly typed, for Python is neither of\nthese things. We solve that apparent contradiction by selecting a pure subset of Python, and running\nan advanced type inference algorithm on the functions the user asks to compile. In that sense, our\napproach is similar to that of Numba and Cython, or the recently introduced @script decorator in\nPyTorch9. Functions that should be compiled with Myia are denoted using the @myia decorator, and\ncan be freely mixed with Python code in the same \ufb01le.\nMost of Python\u2019s features, such as functions, conditionals, and loops, can readily be parsed into our\nfunctional representation. However, Python does include some statements such as index assignment\n(x[i] = v) and augmented assignment statements (x += y) which imply mutability. We currently\nforbid these statements in Myia, although it may be possible to support principled use of them in the\nfuture through techniques like uniqueness typing [3, 14].\nMyia uses Python\u2019s inspect module to parse the function into an abstract syntax tree (AST), and\nconverts that AST into the graph representation we previously described. Source transformation as\ndescribed in Section 3.2 is used to generate the code for derivatives. See Figure 1 for an illustration\nof how a Python function is parsed into the proposed IR, its adjoint program is created using ST, and\n\ufb01nally optimized to produce an ef\ufb01cient derivative function.\n\n4.2 Type inference\n\nPython is a dynamically typed language, but for the sake of optimization and eager error reporting, it\nis important to be able to infer concrete types for all expressions. While it is possible to write optional\ntype annotations in Python 3.6, they are not widely used in practice, and we wish to minimize the\namount of work one has to do in order to port existing code to Myia.\nWhen a Myia function is called, we use the types of the user-provided arguments as a starting point\nfor type inference, which allows us to compile a specialized version of the function for these types.\nNo type annotations are required, even when using higher order functions such as map or grad. Myia\nfunctions can be polymorphic: Myia will specialize each use of a function according to the input type\nsignature for that call site. This means users can write highly dynamic programs just as they are used\nto in Python, and Myia will check them.\nThe inferrer operates on an untyped version of the IR. It can infer types as well as values (constant\npropagation) and shapes. Inference for other properties can easily be added in the future. The inferrer\nis implemented using coroutines: to infer a certain property through a certain primitive, one may\nwrite a coroutine (async def in Python) that asynchronously requests any number of properties\nfrom any number of nodes and combines the results using arbitrary logic.\n\n4.3 Optimization\n\nReverse mode AD in Myia poses a few speci\ufb01c challenges for optimization that we have to tackle.\nAs may be seen in Figure 1, the AD transform produces graphs that are substantially larger than the\noriginal source. These graphs typically contain many computations that are not necessary, such as\ngradients with respect to constants, and a lot of tuple packing and unpacking. These graphs can be\nsimpli\ufb01ed using inlining and local optimizations. Figure 1 demonstrates the resulting simpli\ufb01cation.\n\n8Code available at https://github.com/mila-udem/myia\n9https://pytorch.org/2018/05/02/road-to-1.0.html\n\n8\n\n\f5 Conclusion\n\nIn this work we examined the different approaches and techniques used in developing AD-enabled\nML frameworks, drawing insights from functional languages, graph-based IRs, and AD. To address\nsome of the shortcomings in existing frameworks, we propose a novel graph-based intermediate\nrepresentation and describe a proof of concept toolchain called Myia to show its advantages.\nThe result is a system that can achieve performance similar to compiled frameworks such as Ten-\nsorFlow, while providing the \ufb02exibility of OO frameworks such as PyTorch with e.g. support for\nrecursion and higher-order functions.\nWe believe that as AD frameworks will slowly move towards being full-\ufb02edged languages and\ncompilers, developers will bene\ufb01t from building on many other ideas from these \ufb01elds. For example,\nother techniques from functional languages that could be bene\ufb01cial include the use of monads to\nhandle random number generators, and using higher-order functions for kernel programming (similar\nto Tensor Comprehensions10).\n\nAuthor contributions and acknowledgements\n\nBart van Merri\u00ebnboer worked on the design and implementation of the IR, as well as the design of\nthe AD system. Olivier Breuleux worked on the design of the IR and the type system, and on the\nimplementation of the IR, type inference, the AD system, and optimization and compiler pipeline.\nArnaud Bergeron worked on shape inference, on Myia\u2019s virtual machine, and integrating the GPU\nbackend. Pascal Lamblin worked on the design of the AD system and the organization of the project.\nThe authors would like to thank Maxime Chevalier-Boisvert for her work on an earlier design and\nprototype. Her contributions and insight helped shape the current version of Myia. Early discussions\nand brainstorming with Olexa Bilaniuk also helped determining the scope and direction of the project.\n\nReferences\n[1] Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu\nDevin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. TensorFlow: A system for\nlarge-scale machine learning. In OSDI, volume 16, pages 265\u2013283, 2016.\n\n[2] Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J Fern\u00e1ndez-\nMoctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, et al.\nThe data\ufb02ow model: a practical approach to balancing correctness, latency, and cost in massive-\nscale, unbounded, out-of-order data processing. Proceedings of the VLDB Endowment, 8(12):\n1792\u20131803, 2015.\n\n[3] Erik Barendsen and Sjaak Smetsers. Conventional and uniqueness typing in graph rewrite\nsystems. In Rudrapatna K. Shyamasundar, editor, Foundations of Software Technology and The-\noretical Computer Science, pages 41\u201351, Berlin, Heidelberg, 1993. Springer Berlin Heidelberg.\nISBN 978-3-540-48211-6.\n\n[4] At\u0131l\u0131m G\u00fcne\u00b8s Baydin, Barak A Pearlmutter, and Jeffrey Mark Siskind. DiffSharp: An AD\n\nlibrary for .NET languages. arXiv e-prints, abs/1611.03423, 2016.\n\n[5] At\u0131l\u0131m G\u00fcne\u00b8s Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark\nSiskind. Automatic differentiation in machine learning: a survey. Journal of Machine Learning\nResearch (JMLR), 18(153):1\u201343, 2018.\n\n[6] Stefan Behnel, Robert Bradshaw, Craig Citro, Lisandro Dalcin, Dag Sverre Seljebotn, and Kurt\nSmith. Cython: The best of both worlds. Computing in Science & Engineering, 13(2):31\u201339,\n2011.\n\n[7] Bradley M Bell. CppAD: a package for C++ algorithmic differentiation. Computational\n\nInfrastructure for Operations Research, 57, 2012.\n\n10https://facebookresearch.github.io/TensorComprehensions/\n\n9\n\n\f[8] Christian Bischof, Peyvand Khademi, Andrew Mauer, and Alan Carle. ADIFOR 2.0: Automatic\ndifferentiation of Fortran 77 programs. IEEE Computational Science and Engineering, 3(3):\n18\u201332, 1996.\n\n[9] Christian H Bischof and H Martin B\u00fccker. Computing derivatives of computer programs.\n\nTechnical report, Argonne National Lab., IL (US), 2000.\n\n[10] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu,\nChiyuan Zhang, and Zheng Zhang. MXNet: A \ufb02exible and ef\ufb01cient machine learning library\nfor heterogeneous distributed systems. arXiv e-prints, abs/1512.01274, 2015.\n\n[11] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Haichen Shen, Eddie Yan, Leyuan Wang, Yuwei\nHu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. TVM: End-to-end optimization\nstack for deep learning. arXiv e-prints, abs/1802.04799, 2018.\n\n[12] Cliff Click and Michael Paleczny. A simple graph-based intermediate representation. ACM\n\nSigplan Notices, 30(3):35\u201349, 1995.\n\n[13] William Kingdon Clifford. A preliminary sketch of biquaternions. Proceedings of the London\n\nMathematical Society, s1-4(1):381\u2013395, 1873. doi: 10.1112/plms/s1-4.1.381.\n\n[14] Edsko de Vries, Rinus Plasmeijer, and David M. Abrahamson. Uniqueness typing simpli\ufb01ed.\nIn Olaf Chitil, Zolt\u00e1n Horv\u00e1th, and Vikt\u00f3ria Zs\u00f3k, editors, Implementation and Application of\nFunctional Languages, pages 201\u2013218, Berlin, Heidelberg, 2008. Springer Berlin Heidelberg.\nISBN 978-3-540-85373-2.\n\n[15] Cormac Flanagan, Amr Sabry, Bruce F. Duba, and Matthias Felleisen. The essence of compiling\n\nwith continuations. ACM Sigplan Notices, 28(6):237\u2013247, 1993.\n\n[16] Andreas Griewank and Andrea Walther. Evaluating derivatives: principles and techniques of\n\nalgorithmic differentiation, volume 105. Siam, 2008.\n\n[17] Andreas Griewank, David Juedes, and Jean Utke. Algorithm 755: ADOL-C: a package for the\nautomatic differentiation of algorithms written in C/C++. ACM Transactions on Mathematical\nSoftware (TOMS), 22(2):131\u2013167, 1996.\n\n[18] Laurent Hasco\u00ebt. Some highlights on source-to-source adjoint AD. NIPS Autodiff workshop,\n\n2017.\n\n[19] Laurent Hasco\u00ebt, Uwe Naumann, and Val\u00e9rie Pascual. TBR analysis in reverse-mode automatic\n\ndifferentiation. Technical Report RR-4856, INRIA, 2003.\n\n[20] Laurent Hasco\u00ebt and Val\u00e9rie Pascual. The Tapenade automatic differentiation tool: principles,\nmodel, and speci\ufb01cation. ACM Transactions on Mathematical Software (TOMS), 39(3):20,\n2013.\n\n[21] Wesley M. Johnston, J. R. Paul Hanna, and Richard J. Millar. Advances in data\ufb02ow programming\n\nlanguages. ACM computing surveys (CSUR), 36(1):1\u201334, 2004.\n\n[22] Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. Numba: A LLVM-based Python JIT\ncompiler. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in\nHPC, page 7. ACM, 2015.\n\n[23] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436,\n\n2015.\n\n[24] Roland Lei\u00dfa, Marcel K\u00f6ster, and Sebastian Hack. A graph-based higher-order intermediate\nrepresentation. In Proceedings of the 13th Annual IEEE/ACM International Symposium on\nCode Generation and Optimization, pages 202\u2013212. IEEE Computer Society, 2015.\n\n[25] G\u00f6tz Lindenmaier, Michael Beck, Boris Boesler, and Rubino Gei\u00df. Firm, an intermediate\nlanguage for compiler research. Fakult\u00e4t f\u00fcr Informatik, Universit\u00e4t Karlsruhe, Germany, Tech.\nRep, 8(3):2005, 2005.\n\n10\n\n\f[26] Dougal Maclaurin, David Duvenaud, and Ryan P. Adams. Autograd: Effortless gradients in\n\nNumPy. In ICML 2015 AutoML Workshop, 2015.\n\n[27] Uwe Naumann. Optimal Jacobian accumulation is NP-complete. Mathematical Programming,\n\n112(2):427\u2013441, 2008.\n\n[28] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\nPyTorch. NIPS Autodiff workshop, 2017.\n\n[29] Barak A Pearlmutter and Jeffrey Mark Siskind. Reverse-mode AD in a functional framework:\nLambda the ultimate backpropagator. ACM Transactions on Programming Languages and\nSystems (TOPLAS), 30(2):7, 2008.\n\n[30] Jarrett Revels, Miles Lubin, and Theodore Papamarkou. Forward-mode automatic differentiation\n\nin Julia. arXiv e-prints, abs/1607.07892, 2016.\n\n[31] J\u00fcrgen Schmidhuber. Deep learning in neural networks: An overview. Neural networks, 61:\n\n85\u2013117, 2015.\n\n[32] Olin Shivers. Control-\ufb02ow analysis of higher-order languages. PhD thesis, Carnegie Mellon\n\nUniversity, 1991.\n\n[33] Jeffrey Mark Siskind and Barak A Pearlmutter. Using polyvariant union-free \ufb02ow analysis to\ncompile a higher-order functional-programming language with a \ufb01rst-class derivative operator\nto ef\ufb01cient Fortran-like code. Technical Report TR-ECE-08-01, Purdue University, 2008.\n\n[34] Jeffrey Mark Siskind and Barak A. Pearlmutter. Ef\ufb01cient implementation of a higher-order\n\nlanguage with built-in AD. arXiv e-prints, abs/1611.03416, 2016.\n\n[35] Kai Sheng Tai, Richard Socher, and Christopher D Manning. Improved semantic representations\nfrom tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075, 2015.\n\n[36] Theano Development Team. Theano: A Python framework for fast computation of mathematical\n\nexpressions. arXiv e-prints, abs/1605.02688, 2016.\n\n[37] Seiya Tokui, Kenta Oono, Shohei Hido, and Justin Clayton. Chainer: a next-generation open\nsource framework for deep learning. In Proceedings of workshop on machine learning systems\n(LearningSys) in the twenty-ninth annual conference on neural information processing systems\n(NIPS), volume 5, 2015.\n\n[38] Bart van Merri\u00ebnboer, Alexander B. Wiltschko, and Dan Moldovan. Tangent: Automatic\ndifferentiation using source code transformation in Python. arXiv e-prints, abs/1711.02712,\n2017.\n\n[39] St\u00e9fan van der Walt, S. Chris Colbert, and Gael Varoquaux. The NumPy array: a structure for\n\nef\ufb01cient numerical computation. Computing in Science & Engineering, 13(2):22\u201330, 2011.\n\n[40] Fei Wang and Tiark Rompf. A language and compiler view on differentiable programming.\n\nICLR Workshop track, 2018.\n\n[41] Mu Wang, Assefaw Gebremedhin, and Alex Pothen. Capitalizing on live variables: new\nalgorithms for ef\ufb01cient Hessian computation via automatic differentiation. Mathematical\nProgramming Computation, 8(4):393\u2013433, 2016.\n\n[42] Richard Wei, Lane Schwartz, and Vikram Adve. DLVM: A modern compiler framework for\n\nneural network DSLs. Urbana, 51:61801, 2017.\n\n11\n\n\f", "award": [], "sourceid": 5277, "authors": [{"given_name": "Bart", "family_name": "van Merrienboer", "institution": "MILA, Google"}, {"given_name": "Olivier", "family_name": "Breuleux", "institution": "MILA"}, {"given_name": "Arnaud", "family_name": "Bergeron", "institution": "Universit\u00e9 de Montr\u00e9al (MILA)"}, {"given_name": "Pascal", "family_name": "Lamblin", "institution": "Google"}]}