{"title": "Computing Higher Order Derivatives of Matrix and Tensor Expressions", "book": "Advances in Neural Information Processing Systems", "page_first": 2750, "page_last": 2759, "abstract": "Optimization is an integral part of most machine learning systems and most numerical optimization schemes rely on the computation of derivatives. Therefore, frameworks for computing derivatives are an active area of machine learning research. Surprisingly, as of yet, no existing framework is capable of computing higher order matrix and tensor derivatives directly.  Here, we close this fundamental gap and present an algorithmic framework for computing matrix and tensor derivatives that extends seamlessly to higher order derivatives. The framework can be used for symbolic as well as for forward and reverse mode automatic differentiation. Experiments show a speedup between one and four orders of magnitude over state-of-the-art frameworks when evaluating higher order derivatives.", "full_text": "Computing Higher Order Derivatives of Matrix and\n\nTensor Expressions\n\nFriedrich-Schiller-Universit\u00e4t Jena\n\nS\u00f6ren Laue\n\nGermany\n\nMatthias Mitterreiter\n\nFriedrich-Schiller-Universit\u00e4t Jena\n\nGermany\n\nsoeren.laue@uni-jena.de\n\nmatthias.mitterreiter@uni-jena.de\n\nJoachim Giesen\n\nFriedrich-Schiller-Universit\u00e4t Jena\n\nGermany\n\njoachim.giesen@uni-jena.de\n\nAbstract\n\nOptimization is an integral part of most machine learning systems and most nu-\nmerical optimization schemes rely on the computation of derivatives. Therefore,\nframeworks for computing derivatives are an active area of machine learning re-\nsearch. Surprisingly, as of yet, no existing framework is capable of computing\nhigher order matrix and tensor derivatives directly. Here, we close this fundamental\ngap and present an algorithmic framework for computing matrix and tensor deriva-\ntives that extends seamlessly to higher order derivatives. The framework can be\nused for symbolic as well as for forward and reverse mode automatic differentiation.\nExperiments show a speedup of up to two orders of magnitude over state-of-the-art\nframeworks when evaluating higher order derivatives on CPUs and a speedup of\nabout three orders of magnitude on GPUs.\n\n1\n\nIntroduction\n\nRecently, automatic differentiation has become popular in the machine learning community due to its\ngenericity, \ufb02exibility, and ef\ufb01ciency. Automatic differentiation lies at the core of most deep learning\nframeworks and has made deep learning widely accessible. In principle, automatic differentiation\ncan be used for differentiating any code. In practice, however, it is primarily targeted at scalar\nvalued functions. Current algorithms and implementations do not produce very ef\ufb01cient code when\ncomputing Jacobians or Hessians that, for instance, arise in the context of constrained optimization\nproblems. A simple yet instructive example is the function f (x) = x(cid:62)Ax, where A is a square matrix\nand x is a vector. The second order derivative of this function, i.e., its Hessian, is the matrix A(cid:62) + A.\nFrameworks like TensorFlow [1], Theano [23], PyTorch [16], or HIPS autograd [14] generate code for\nthe second order derivative of f that runs two to three orders of magnitude slower than the evaluation\nof the expression A(cid:62) + A.\nIn machine learning, automatic differentiation is mostly used for numerical optimization. Many\nmachine learning optimization problems are formulated in standard matrix language. This has not\nonly the advantage of a compact problem representation, but linear algebra expressions can also be\nexecuted ef\ufb01ciently on modern CPUs and GPUs by mapping them onto highly tuned BLAS (basic\nlinear algebra subprograms) implementations that make extensive use of vectorization such as the\nSSE and the AVX instruction sets or SIMD architectures. Ideally, these advantages are kept also\nfor the gradients and Hessians of the expressions that encode the optimization problem. This is the\npurpose of matrix calculus that computes, if possible, derivatives of matrix expressions again as\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fmatrix expressions. In matrix calculus, the gradient of f (x) = x(cid:62)Ax is the expression A(cid:62)x + Ax\nand its Hessian is the expression A(cid:62) + A, which can be ef\ufb01ciently evaluated.\nSurprisingly, none of the classic computer algebra systems such as Mathematica, Maple, Sage, or\nSymPy1 supports matrix calculus. These systems calculate derivatives of a given matrix expression\nonly on matrix entry level, i.e., every matrix entry gives rise to a separate symbolic scalar variable.\nEach of these systems is able to compute derivatives of univariate scalar-valued functions. Hence,\nwhile in pure matrix calculus a matrix A \u2208 Rn\u00d7n is represented by just one variable, the classic\ncomputer algebra systems treat this as n2 variables. Thus, the number of input variables increases\ndramatically and the complexity of evaluating derivatives explodes. Also, it is impossible to read off\nthe derivative in matrix notation from such complex expressions. These drawbacks are also present in\nthe classic frameworks for automatic differentiation that mostly compute derivatives only on scalar\nlevel, like ADOL-C [25] or TAPENADE [10]. While the direct integration of matrix and tensor\noperators into automatic differentiation frameworks like TensorFlow is under active development, so\nfar the output functions still have to be scalar. Hence, higher order derivatives or Jacobians cannot be\ncomputed directly.\nContributions. We provide an algorithmic framework for computing higher order derivatives of\nmatrix and tensor expressions ef\ufb01ciently, which fully operates on tensors, i.e., all variables are\nallowed to be tensors of any order, including the output variables. Therefore, for the \ufb01rst time higher\norder derivatives and Jacobians can be computed directly. The derivatives are represented as compact\nmatrix and tensor expressions which can be mapped to ef\ufb01cient BLAS implementations. Other state-\nof-the-art frameworks produce huge expression graphs when deriving higher order derivatives. These\nexpression graphs cannot be mapped to simple BLAS calls but involve many for-loops and complex\nmemory access. Hence, we observe an increase in ef\ufb01ciency of up to two orders of magnitude on\nCPUs. Since GPUs can deal with this even worse we observe here an increase in ef\ufb01ciency of about\nthree orders of magnitude over current state-of-the-art approaches. An important bene\ufb01t of working\non tensors is that it enables symbolic matrix calculus. We show that the predominantly used standard\nmatrix language is not well suited for symbolic matrix calculus, in contrast to a tensor representation.\nThus, to implement matrix calculus, we \ufb01rst translate linear algebra matrix expressions into a tensor\nrepresentation, then compute derivatives in this representation and \ufb01nally translate the result back\ninto the standard matrix language.\nRelated work. A great overview and many details on fundamentals and more advanced topics of\nautomatic differentiation can be found in the book by Griewank and Walther [9]. Baydin et al. [2]\nprovide an excellent survey on the automatic differentiation methods used within machine learning.\nPearlmutter [17] discusses the following approach for computing higher order derivatives, like the\nHessian of a function f : Rn \u2192 R, by automatic differentiation: First, an expression for the gradient\n\u2207f is computed. Then the Hessian is computed column-wise by multiplying the gradient with the\nstandard basis vectors ei, i = 1, . . . , n and differentiating the resulting n scalar functions (\u2207f )(cid:62)ei.\nThe derivative of (\u2207f )(cid:62)ei gives the i-th column of the Hessian of f. Gebremedhin et al. [7] show\nthat a few columns of the Hessian can be computed at once with help of graph coloring algorithms, if\nthe Hessian is sparse and has a special structure. However, if the Hessian has no special structure,\nthen all n columns need to be computed individually. This can be seen, for instance, in TensorFlow\u2019s\nexpression graph for computing the Hessian of x(cid:62)Ax that has more than one million nodes for a still\nreasonably small value of n = 1000.\nAn alternative approach is presented in the book by Magnus and Neudecker [15] that provides an\nextensive number of rules for deriving matrix derivatives. At its core, matrices are turned into vectors\nby the vec function that stacks the columns of a matrix into one long vector. Then the Kronecker\nmatrix product is used to emulate higher order tensors. This approach works well for computing \ufb01rst\norder derivatives. However, it is not practicable for computing higher order derivatives. Also, many\nof the rules rely on the fact that the vec function assumes column-major ordering (Fortran style). If\nhowever, one switches to a programming language that uses row-major ordering (C style), then some\nof the formulas do not hold anymore. For instance, the Python NumPy package that is widely used in\nthe machine learning community follows row-major ordering when used with standard settings. To\nbe independent of such a convention and to be able to compute higher order derivatives, we work\ndirectly with higher order tensors.\n\n1Earlier versions of SymPy contained an algorithm for computing matrix derivatives. However, since it some-\ntimes produced erroneous results it has been removed, see, e.g., https://github.com/sympy/sympy/issues/10509.\n\n2\n\n\fThe work by Giles [8] collects a number of derivatives for matrix operators, i.e., pushforward\nand pullback functions for automatic differentiation. Similarly, Seeger et al. [22] provide methods\nand code for computing derivatives for Cholesky factorization, QR decomposition, and symmetric\neigenvalue decomposition when seen as matrix operators. However, they all require that the output\nfunction is scalar-valued, and hence, cannot be generalized to higher order derivatives.\n\n2 Languages for matrix and tensor expressions\n\nMatrix expressions are typically written in a simple yet very effective language. This language\nfeatures two types of entities, namely objects and operators. The objects of the language are scalars,\nvectors and matrices, and the operators include addition, various forms of multiplication, transposition,\ninverses, determinants, traces, and element-wise functions that can be applied to objects.\nProblems with the standard matrix language. Since matrices can be used to encode different\nentities like linear maps, bilinear maps, and transformations that describe a change of basis, the matrix\nlanguage sometimes enforces an indirect encoding of these entities. For instance, let the square\nmatrix A encode a bilinear map on some vector space, i.e., the entries of A represent the evaluation of\nthe bilinear map on any combination of basis vectors. Assume we want to evaluate the bilinear map\nat the vectors x and y whose entries store the respective coef\ufb01cients with respect to the same basis\nthat is used for specifying A. The evaluation of the bilinear map at x and y is then typically written as\nx(cid:62)Ay, which basically means: apply the linear map that is encoded by A to y and feed the resulting\nvector into the linear form x(cid:62). Note that transposing a vector means mapping it to an element in\nthe dual vector space. The scalar value x(cid:62)Ay is not affected by the change of interpretation; the\nvalue of the bilinear map evaluated at x and y is the same as the value of the evaluation of the linear\nform x(cid:62) at the vector Ay. Hence, the existence of different interpretations is not a problem when\nevaluating matrix expressions. But it becomes a problem once we want to compute derivatives of\nsuch expressions. For instance, if we want to compute the derivative of the expression x(cid:62)Ax with\nrespect to the vector x, then we also need to compute the derivative of the transformation that maps x\ninto the linear form x(cid:62), and it is not obvious how to do this. In fact, matrix notation does not contain\nan expression to represent this derivative.\nRicci calculus. The problem is avoided by turning to a different language for encoding matrix\nexpressions, namely Ricci calculus [20]. Ricci calculus lacks the simplicity of the standard language\nfor matrix expressions, but is more precise and can distinguish between linear maps and bilinear maps\nthrough the use of indices. In Ricci calculus one distinguishes two types of indices, namely upper\n(contravariant) and lower (covariant) indices, that determine the behavior of the encoded objects under\nbasis changes. Scalars have no index, vectors have one upper, and covectors one lower index. Bilinear\nforms on a vector space have two lower indices, bilinear maps on the dual space have two upper\nindices, and linear maps have one upper and one lower index. Hence, the bilinear map A evaluated at\nthe vectors x and y is written in Ricci calculus as xiAijyj, or equivalently Aijxiyj. Ricci calculus\ndoes not include the transposition transformation, but features \u03b4-tensors. The linear identity map\nis encoded in Ricci calculus by the \u03b4-tensor \u03b4i\nj. The \u03b4-tensors \u03b4ij and \u03b4ij have no interpretation as\nlinear maps but serve the purpose of transposition; a vector xi is mapped by \u03b4ij to the covector \u03b4ijxi\nand the covector xi is mapped by \u03b4ij to the vector \u03b4ijxi. In Ricci calculus one distinguishes between\nfree and bound indices. Bound indices appear as lower and also as upper indices in the expression,\nwhile free indices appear either as lower or upper indices. Hence, the expression \u03b4ijxi has one free,\nlower index, namely j, and thus encodes a covector, while \u03b4ijxi has one free, upper index, again j,\nand encodes a vector. The expression Aijxixj has no free indices and thus encodes a scalar.\nLet us come back to the issue of the interpretation of an expression. The different interpretations of the\nexpression x(cid:62)Ay can be distinguished in Ricci calculus; Aijxiyj is a bilinear map evaluated at the\nvectors xi and yj, and xiAi\njyj. The interpretation\nthat the vector x is \ufb01rst mapped to the covector x(cid:62) and then evaluated at the image of the vector y\nafter applying the linear map A is written in Ricci calculus as xi\u03b4ijAj\nHence, our approach for computing derivatives of matrix expressions is to translate them \ufb01rst into\nexpressions in Ricci calculus, see Table 1 for examples, and then to compute the derivatives for these\nexpressions. The resulting derivative is again an expression in Ricci calculus that can be translated\nback into the standard matrix calculus language.\n\njyj is the linear form xi evaluated at the vector Ai\n\nkyk.\n\n3\n\n\fTable 1: Translation of expressions from matrix notation into the Ricci calculus language. Here, (cid:12)\ndenotes the element-wise multiplication and diag(x) the matrix whose diagonal is the vector x.\n\nMatrix notation Ricci calculus Matrix notation Ricci calculus\nc = x(cid:62)y\nx = Ay\nx(cid:62) = y(cid:62)A\nC = A \u00b7 B\n\nA = xy(cid:62)\nz = x (cid:12) y\nB = A diag(x) Bi\nB = diag(x)A Bi\n\nc = xiyi\nxi = Ai\njyj\nxj = yiAi\nj\njBj\nC i\nk = Ai\n\nj = xiyj\nAi\nzi = xiyi\nj = Ai\njxj\nj = xiAi\nj\n\nk\n\nElements of Ricci calculus. Only objects with exactly the same indices can be added or subtracted.\nFor instance, the addition xi + yi is a valid expression in Ricci calculus, but the expression xi + yj is\nnot. Obviously, also xi + xi is not a valid expression, because there is no addition of vectors and\ncovectors. The standard multiplication of objects that do not share an index is the outer product (or\nstandard tensor product). For example, the product xiyj describes a matrix that encodes a linear map.\nA multiplication that contains the same index once as an upper and once as a lower index describes\nan inner product, or contraction. A contraction entails the sum over all entries in the object that are\naddressed by the shared index. For instance, xiyi encodes the evaluation of the linear form yi at the\nvector xi, Aj\ni at the vector xi. Note that, in contrast to\nstandard matrix multiplication, multiplication in Ricci calculus is commutative. This will be very\nhelpful later when we are deriving algorithms for computing derivatives. Applying element-wise\nunary functions like exp to some object keeps the indices, i.e., exp(Ai\n\ni xi encodes the evaluation of the linear map Aj\n\nj.\nj) = exp(A)i\n\ni is denoted as det(Aj\n\nj encodes the linear\nOur version of Ricci calculus also features some special symbols. For instance, 0i\nj are 0. Similarly, 0i encodes the\nmap that maps any vector to 0. Hence, all entries of the matrix 0i\n0-vector. The determinant of a matrix Aj\ni ). The trace of Aj\ni does not need a\nspecial symbol in Ricci calculus, because it can be encoded by the expression Aj\nj.\ni \u03b4i\nExtension to higher order tensors. Since Ricci calculus can be extended easily to include more\ngeneral tensor expressions we can also generalize our approach to computing gradients of more\ngeneral tensor expressions. Higher order tensors just have more (free) indices. Note that for encoding\nhigher order tensors, the use of some form of indices is unavoidable. Hence, we can use the Ricci\ncalculus language directly for specifying such expressions and their derivatives; there is no need for\ntranslating them into another language. However, the translation makes sense for matrix expressions\nsince the succinct, index free, standard language is very popular and heavily used in machine learning.\n\n3 Tensor calculus\n\nTo process mathematical expressions, we usually represent them by a tree, or more generally, by a\ndirected acyclic graph (DAG). The roots of the DAG, referred to as input nodes, have no parents and\nrepresent the variables. The leaves of the DAG, or output nodes, have no children and represent the\nfunctions that the DAG computes. Let the DAG have n input nodes (variables) and m output nodes\n(functions). We label the input nodes x[0], ..., x[n \u2212 1], the output nodes y[0], ..., y[m \u2212 1], and the\ninternal nodes v[0], . . . , v[k \u2212 1]. Every internal and every output node represents either a unary or a\nbinary operator. The arguments of these operators are supplied by their parent nodes. Hence, every\nnode in the DAG can have at most two parents, but multiple children. Every edge that connects a\nparent x with a child f is labeled by the easy to compute derivative of f with respect to x, i.e., the\n\u2202x . For instance, if f = x \u00b7 y (multiplication operator), then the label of the edge (x, f ) is\nderivative \u2202f\ny. In case of f = x + y (addition operator), the label of this edge would be 1, and for f = sin(x)\n(sine operator), the label is cos(x). Figure 1 illustrates the case f = x(cid:62)Ax.\nIn automatic differentiation, one distinguishes between forward and reverse mode. Both modes are\nderived from the chain rule. In forward mode, the derivative of the roots is computed from roots to\nleaves along the edges and in reverse mode they are computed from leaves to roots. The edge labels\nare then multiplied along all paths and the products are summed up and stored in the nodes of the\nDAG.\n\n4\n\n\fForward mode. In forward mode for computing derivatives with respect to the input variable x[j],\neach node v[i] will eventually store the derivative \u02d9v[i] = \u2202v[i]\n\u2202x[j], that is computed from root to leaves:\nAt the root nodes representing the variables x[i], the derivatives \u2202x[i]\n\u2202x[j] are stored. Then the derivatives\nthat are stored at the remaining nodes, here called f, are iteratively computed by summing over all\ntheir incoming edges using the following equation:\n\u2202f\n\u2202x\n\n\u00b7 \u2202x\n\u2202x[j]\n\n(cid:88)\n\n(cid:88)\n\n\u2202f\n\u2202x[j]\n\n\u02d9f =\n\n\u2202f\n\u2202x\n\n\u02d9x,\n\n(1)\n\nx : x is parent of f\n\nx : x is parent of f\n\n=\n\n=\n\nwhere the \u2202f\n\u2202x are the labels of the incoming edges of f and the \u02d9x have been computed before and\nare stored at the parent nodes x of f. This means, the derivative of each function is stored at the\ncorresponding leaves y[i] of the expression DAG. Obviously, the updates can be done simultaneously\nfor one input variable x[j] and all output nodes y[i]. Computing the derivatives with respect to all\ninput variables requires n such rounds.\nReverse mode. Reverse mode automatic differentiation proceeds similarly, but from leaf to roots.\nEach node v[i] will eventually store the derivative \u00afv[i] = \u2202y[j]\n\u2202v[i] , where y[j] is the function to be\ndifferentiated. This is done as follows: First, the derivatives \u2202y[j]\n\u2202y[i] are stored at the leaves of the DAG.\nThen the derivatives that are stored at the remaining nodes, here called x, are iteratively computed by\nsumming over all their outgoing edges using the following equation:\n\n\u2202y[j]\n\u2202f\n\n\u00b7 \u2202f\n\u2202x\n\n=\n\n\u00aff \u00b7 \u2202f\n\u2202x\n\n,\n\n(2)\n\nf : f is child of x\n\nf : f is child of x\n\n(cid:88)\n\n\u00afx =\n\n\u2202y[j]\n\u2202x\n\n=\n\n(cid:88)\n\n\u2202x are the labels of the outgoing edges of x and the \u00aff have been computed before and\nwhere the \u2202f\nare stored at the children f of x. This means, that the derivative of the function y[j] with respect\nto all the variables x[i] is stored at the corresponding roots of the expression DAG. Computing the\nderivatives for all the output functions requires m such rounds.\nTensor calculus. Tensor calculus can be implemented straightforwardly using either the forward or\nreverse mode. The input nodes are now considered tensors (symbols) like xi or Ai\ni. These symbols\nare combined by the inner and the output nodes of the expression DAG into more complex tensor\nexpressions like Aj\ni xi. In contrast to existing approaches, the output nodes can represent non-scalar\ntensor expressions, too. To compute the derivatives, the edges of the expression DAG are labeled\nby the corresponding tensor derivatives \u02d9v[i] in forward mode and \u00afv[i] in reverse mode, respectively.\nDerivatives are then computed at expression level as described above. To the best of our knowledge,\nthis is the \ufb01rst approach for applying automatic differentiation to matrix and tensor expressions that\ntreats forward and reverse mode equally. We show in Section 4 that the symbolic expressions for\nhigher order derivatives, obtained through this approach, can be evaluated very ef\ufb01ciently.\nWhy is standard matrix calculus more complicated? The standard language for matrix expressions\nin linear algebra is, as discussed in Section 2, not precise. The multiplication operator, in particular, is\noverloaded and does not refer to one but to several different multiplications. The following examples\nillustrate how this lack of precision complicates the process of computing derivatives.\nConsider the simple matrix multiplication C = AB as part of an expression DAG. It holds, that\n\u02d9C = \u02d9AB + A \u02d9B for the forward mode and \u00afA = \u00afCB(cid:62) and \u00afB = A(cid:62) \u00afC for the reverse mode, see [8].\nThese equations cannot be instantiations of Equations (1) and (2) for the following reasons: Consider\nthe two edges from the (parent) nodes that represent A and B, respectively, to the node that represents\n\u02d9A\nC = AB. We have \u2202C\n\u2202B from the left. In Equation (1), though, both\nwith \u2202C\nmultiplications are always from the left. Similarly, in reverse mode, we also multiply once from the\nleft and once from the right, while both multiplications are always from the right in Equation (2).\nFurthermore, the multiplication in reverse mode is not with \u2202C\n\u2202B , respectively, but with their\ntransposes. This might seem negligible at \ufb01rst, to be \ufb01xed by slight adjustments to Equations (1)\nand (2). The expression c = det(A) shows that this is not so easy. The DAG for this expression\nhas only two nodes, an input node (parent) that represents the matrix A and its child (output node)\nthat represents c = det(A). In forward mode, conventional approaches yield \u02d9c = c tr(inv(A) \u02d9A),\nyet \u00afA = \u00afcc inv(A)(cid:62) in reverse mode, see again [8]. It is impossible to bring these equations into\n\n\u2202B = A. In forward mode we have to multiply the differential\n\n\u2202A from the right and the differential\n\n\u2202A = B and \u2202C\n\n\u2202A and \u2202C\n\n\u02d9B with \u2202C\n\n5\n\n\fFigure 1: Expression DAG for x(cid:62)Ax used\nfor computing \ufb01rst order derivatives.\n\nFigure 2: Expression DAG for Ax + A(cid:62)x\nused for computing second order derivatives.\n\n= adj(Ai\n\nthe form of Equations (1) and (2). Using Ricci calculus, Equations (1) and (2) hold with the same\nedge label \u2202c\nj) for both, forward and reverse mode, where adj is the adjoint of the matrix\n\u2202Ai\nj\nj. We also want to point out that the equations from [8] that we used here in conjunction with\nAi\nstandard matrix language are only valid for scalar expressions, i.e., scalar output functions. Our\nRicci calculus-based approach accommodates matrix- or even higher order tensor-valued expressions\nand consequently higher order derivatives just as well. Hence, while standard matrix language is\nthe preferred notation in linear algebra, it is not well suited for computing derivatives. Switching\nto Ricci calculus leads to a clean and elegant tensor calculus that is, without exceptions, rooted in\nEquations (1) and (2).\nExample. To illustrate the tensor calculus algorithm we demonstrate it on the rather simple example\nf = (x(cid:62)A)x and compute derivatives of f with respect to x. We provide an outline of the steps\nof the algorithm in computing the \ufb01rst and the second order derivatives through forward mode\nautomatic differentiation. The individual steps of the reverse mode for this example can be found in\nthe supplemental material.\nFirst, the standard matrix language expression (x(cid:62)A)x is translated into its corresponding Ricci\ncalculus expression xiAi\njxj. Figure 1 shows the corresponding expression DAG and Table 2 shows\nthe individual steps for computing the gradient with respect to xk. The derivative can be read off\nfrom the last line of the table. Taking the transpose gives the gradient, i.e., Ai\nk\u03b4kk =\nAk\n\njxj\u03b4ik\u03b4kk + xiAi\nk\u03b4kk\u03b4iixi. Translating this expression back into matrix notation yields Ax + A(cid:62)x.\n\nj xj + Ai\nTable 2: Individual steps of the forward mode automatic differentiation for x(cid:62)Ax with respect to x.\n\nForward trace\n\nx[0] = xj\nx[1] = Ai\nj\nv[0] = x[0]\u03b4i\nj\nv[1] = v[0]\u03b4ii\nv[2] = v[1]x[1]\ny[0] = v[2]x[0]\n\n= xi\n= xi\n= xiAi\nj\njxj\n= xiAi\n\nForward derivative trace\n\u02d9x[0] = \u03b4j\nk\n\u02d9x[1] = 0i\njk\n= \u03b4i\n\u02d9v[0] = \u02d9x[0]\u03b4i\nj\nk\n\u02d9v[1] = \u02d9v[0]\u03b4ii\n= \u03b4ik\n\u02d9v[2] = \u02d9v[1]x[1] + \u02d9x[1]v[1] = Ai\n\u02d9y[0] = \u02d9v[2]x[0] + \u02d9x[0]v[2] = Ai\n\nj\u03b4ik + 0jk\njxj\u03b4ik + xiAi\nk\n\nTo obtain the Hessian we just compute the derivative of Ak\nk\u03b4kk\u03b4iixi with respect to xl.\nFigure 2 shows the corresponding expression DAG and Table 3 contains the individual steps for\nk)(cid:62). Translating back\ncomputing the Hessian that can be read off the last line of the table as Ak\nto matrix notation and taking the transpose yields the desired result A + A(cid:62). Note that we simpli\ufb01ed\nthe expression in Table 3 by writing (Ai\n\nj xj + Ai\n\nk)(cid:62) instead of Ai\n\nl + (Al\n\nk\u03b4kk\u03b4ii.\n\n6\n\nxiv[0](cid:62)v[1]Aijx[1]\u2217v[2]xjx[0]\u2217y[0]\u03b4ij\u03b4iix[1]v[1]x[0]v[2]xjx[0]Aijx[1]+y[0]\u03b4ki\u2217v[4](cid:62)v[2]xiv[0]\u03b4jk\u03b4iiv[0]v[2]\u03b4ijAkjv[1]\u2217v[3]x[0]v[1]11\fTable 3: Individual steps of the forward mode computation of the Hessian of x(cid:62)Ax with respect to x.\n\nForward trace\n\nx[0] = xj\nx[1] = Ai\nj\n= xi\nv[0] = x[0]\u03b4i\nj\n= Ak\nv[1] = x[1]\u03b4k\nj\ni\n= (Ai\nv[2] = x[1]\u03b4jk\u03b4ii\n= Ak\nv[3] = x[0]v[1]\n= (Ai\nv[4] = v[2]v[0]\ny[0] = v[3] + v[4] = Ak\n\nk)(cid:62)\nj xj\nk)(cid:62)xi\nj xj + (Ai\n\nk)(cid:62)xi\n\nForward derivative trace\n\u02d9x[0] = \u03b4j\nl\n\u02d9x[1] = 0i\njl\n= \u03b4i\n\u02d9v[0] = \u02d9x[0]\u03b4i\nj\nl\n= 0k\n\u02d9v[1] = \u02d9x[1]\u03b4k\ni\njl\n\u02d9v[2] = \u02d9x[1]\u03b4jk\u03b4ii\n= 0k\nil\n\u02d9v[3] = \u02d9x[0]v[1] + \u02d9v[1]x[0] = Ak\nl + 0k\nl\nl + (Al\n= 0k\n\u02d9v[4] = \u02d9v[2]v[0] + \u02d9v[0]v[2]\n\u02d9y[0] = 1 \u00b7 \u02d9v[3] + 1 \u00b7 \u02d9v[4]\n= Ak\nl + (Al\n\nk)(cid:62)\nk)(cid:62)\n\n4 Experiments\n\nWe have implemented our algorithms in Python. To evaluate expression graphs we use the NumPy\nand CuPy packages. Our framework can perform forward mode as well as reverse mode automatic\ndifferentiation. Reverse mode is the more involved mode (it has a forward and a backward pass)\nand it is also the one that is commonly used since it allows to compute derivatives with respect to\nmany input variables simultaneously. Hence, in the experiments we only use reverse mode automatic\ndifferentiation. Similarly to all other frameworks, our implementation also performs some expression\nsimpli\ufb01cations, i.e., constant folding, pruning of zero tensors, and removal of multiplication by \u03b4\ntensors when applicable. An interface to our framework for computing vector and matrix derivatives\nis available online at www.MatrixCalculus.org.\nExperimental set up. We compare our framework to the state-of-the-art automatic differentiation\nframeworks TensorFlow 1.10, PyTorch 0.4, Theano 1.0, and HIPS autograd 1.2 used with Python 3.6,\nthat were all linked against Intel MKL. All these frameworks support reverse mode automatic\ndifferentiation for computing \ufb01rst order derivatives. They all compute the Hessian row by row by\niteratively computing products of the Hessian with standard basis vectors. All frameworks provide\ninterfaces for computing Hessians except PyTorch. Here, we follow the instructions of its developers.\nThe experiments were run in a pure CPU setting (Intel Xeon E5-2686, four cores) as well as in a pure\nGPU setting (NVIDIA Tesla V100), except for autograd, that does not provide GPU support.\nHessians are not needed for large-scale problems that typically arise in deep learning. However, in\noptimization problems arising from \u2018classical\u2019 machine learning problems like logistic regression,\nelastic net, or inference in graphical models, optimization algorithms based on Newton steps can\nbe faster than gradient based algorithms when the number of optimization variables is in the few\nthousands. The Newton steps entail solving a system of linear equations. A direct solver for such\nsystems of moderate dimension can be much faster than an iterative solver that computes Hessian-\nvector products. This is particularly true for ill-conditioned problems. Hence, we chose three\nrepresentative \u2018classical\u2019 machine learning problems for our experiments, namely quadratic functions,\nlogistic regression, and matrix factorization. For these problems we ran two sets of experiments. In\nthe \ufb01rst set we measured the running times for evaluating the function value and the gradient together,\nand in a second set we measured the running times for evaluating the Hessian. The evaluation of\nJacobians is implicitly covered in the experiments since all frameworks run the same code both for\nevaluating Jacobians and for evaluating Hessians. Hence, we restrict ourselves to Hessians here.\nThe measurements do not include the time needed for constructing the expression graphs for the\ngradients and the Hessians. TensorFlow and Theano took a few seconds to create the expression graphs\nfor the second order derivatives while our approach only needed roughly 20 milliseconds. PyTorch\nand HIPS autograd create the expression graph for the derivative dynamically while evaluating the\nfunction, hence, its creation time cannot be determined.\nQuadratic function. The probably most simple function that has a non-vanishing Hessian is the\nquadratic function x(cid:62)Ax that we have been using as an example throughout this paper. Several\nclassical examples from machine learning entail optimizing a quadratic function, for instance, the dual\nof a support vector machine [5], least squares regression, LASSO [24], or Gaussian processes [19],\nto name just a few. Of course, the Hessian of a quadratic function can be easily computed by\n\n7\n\n\fFigure 3: Log-log plot of the running times for evaluating the Hessian of the quadratic function (left),\nlogistic regression (middle), matrix factorization (right) on the CPU. See the supplemental material\nfor a table with the running times.\n\ni log(cid:0)exp(cid:0)\u2212y(i) (X (i)w)(cid:1) + 1(cid:1), where\n\nlogistic regression aims at minimizing the loss function(cid:80)\n\nhand. However, here we want to illustrate that even for this simple example, running times can vary\ndramatically.\nLogistic regression. Logistic regression [6] is probably one of the most commonly used methods for\nclassi\ufb01cation. Given a set of m data points X \u2208 Rm\u00d7n along with a set of binary labels y \u2208 {\u00b11}m,\nw \u2208 Rn is the weight vector, X (i) is the i-th data point (i-th row of X), and y(i) the corresponding\ni-th label. The data matrix X can be composed of the true input features, features transformed by\nbasis functions/kernels [4, 21], or by random basis functions [18], or by features that have been\nlearned by a deep net [11]. We set m = 2n in the experiments.\nMatrix factorization. Matrix factorization can be stated as the problem minU,V (cid:107)T \u2212 U V (cid:62)(cid:107)2\n\u2126,\nwhere T \u2208 Rm\u00d7n is some target matrix, U \u2208 Rm\u00d7k and V \u2208 Rn\u00d7k are the low-rank factor\nmatrices, and \u2126 \u2208 {0, 1}m\u00d7n is an indicator matrix that de\ufb01nes which elements of T are known.\nMatrix factorization is mainly used in the context of recommender systems [13] or natural language\nprocessing [3, 12]. For the experiments, we set k = 5 and compute the gradient and Hessian\nwith respect to U. Note that the Hessian is a fourth order tensor. In Ricci calculus it reads as\n2\u03b4n\nResults. The experiments show that basically all frameworks are equally fast when evaluating \ufb01rst\norder derivatives. This is no longer true in the case of second order derivatives like Hessians. Since\nour approach extends naturally to higher order derivatives, it should not surprise that it is faster than\nthe reference frameworks. Indeed, as can be seen in Figure 3, it is up to two orders of magnitude faster\nthan the existing frameworks on the CPU. On the GPU, the speedup is about three orders of magnitude,\nsee the supplemental material, where you also \ufb01nd the remaining results and more details. The reason\nfor this speedup is the fact, that our approach is able to produce compact matrix expressions that\ncan be mapped to ef\ufb01cient BLAS implementations whereas all other approaches produce fairly large\nexpression graphs whose evaluation involves many for-loops and complex memory access. The GPU\ncan deal with this even worse which leads to even larger speed-ups on the GPU.\n\ni \u03b4jj\u03b4ii, or in our slightly abbreviated notation as 2\u03b4n\n\nl V k\n\ni (V j\n\nm\u03b4j\n\nl V k\n\ni V j\n\nm\u03b4j\n\ni )(cid:62).\n\n5 Conclusion\n\nWe have presented the \ufb01rst algorithmic framework for computing matrix and tensor derivatives\nthat naturally extends to higher order derivatives. Experiments show that our approach achieves\nstate-of-the-art performance for the evaluation of \ufb01rst order derivatives. In the case of second order\nderivatives it is up to two orders of magnitude more ef\ufb01cient on CPUs and up to three orders of\nmagnitude more ef\ufb01cient on GPUs.\nOur framework operates directly on tensors using Ricci calculus to specify tensor expressions.\nOperating on tensors also enables classical, symbolic matrix calculus that appears notoriously dif\ufb01cult\nin standard matrix language. We showed that the dif\ufb01culties in computing matrix derivatives can be\nattributed to the predominantly used standard matrix language that, in contrast to Ricci calculus, is\nnot well suited for matrix calculus.\n\n8\n\n\fAcknowledgments\n\nS\u00f6ren Laue has been funded by Deutsche Forschungsgemeinschaft (DFG) under grant LA 2971/1-1.\nThis work has also been supported by the AWS Cloud Credits for Research program and by a gift\nfrom Google.\n\nReferences\n[1] Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu\nDevin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg,\nRajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan,\nPete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensor\ufb02ow: A system for\nIn USENIX Conference on Operating Systems Design and\nlarge-scale machine learning.\nImplementation (OSDI), pages 265\u2013283. USENIX Association, 2016.\n\n[2] Atilim Gunes Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark\nSiskind. Automatic differentiation in machine learning: a survey. Journal of Machine Learning\nResearch, 18(153):1\u201343, 2018.\n\n[3] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. Journal of\n\nMachine Learning Research, 3:993\u20131022, 2003.\n\n[4] D.S. Broomhead and David Lowe. Multivariable functional interpolation and adaptive networks.\n\nComplex Systems, 2:321\u2013355, 1988.\n\n[5] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning, 20(3):273\u2013\n\n297, 1995.\n\n[6] David R. Cox. The regression analysis of binary sequences. Journal of the Royal Statistical\n\nSociety. Series B (Methodological), 20(2):215\u2013242, 1958.\n\n[7] Assefaw Hadish Gebremedhin, Arijit Tarafdar, Alex Pothen, and Andrea Walther. Ef\ufb01cient\ncomputation of sparse hessians using coloring and automatic differentiation. INFORMS Journal\non Computing, 21(2):209\u2013223, 2009.\n\n[8] Mike B. Giles. Collected matrix derivative results for forward and reverse mode algorithmic dif-\nferentiation. In Advances in Automatic Differentiation, pages 35\u201344. Springer Berlin Heidelberg,\n2008.\n\n[9] Andreas Griewank and Andrea Walther. Evaluating derivatives - principles and techniques of\n\nalgorithmic differentiation (2. ed.). SIAM, 2008.\n\n[10] Laurent Hasco\u00ebt and Val\u00e9rie Pascual. The Tapenade automatic differentiation tool: Principles,\nmodel, and speci\ufb01cation. ACM Transactions on Mathematical Software, 39(3):20:1\u201320:43,\n2013.\n\n[11] Geoffrey E. Hinton, Simon Osindero, and Yee Whye Teh. A fast learning algorithm for deep\n\nbelief nets. Neural Computation, 18(7):1527\u20131554, 2006.\n\n[12] Thomas Hofmann. Probabilistic latent semantic analysis. In Conference on Uncertainty in\n\nArti\ufb01cial Intelligence (UAI), pages 289\u2013296, 1999.\n\n[13] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recom-\n\nmender systems. Computer, 42(8):30\u201337, August 2009.\n\n[14] Dougal Maclaurin, David Duvenaud, and Ryan P. Adams. Autograd: Effortless gradients in\n\nnumpy. In ICML AutoML workshop, 2015.\n\n[15] Jan R. Magnus and Heinz Neudecker. Matrix Differential Calculus with Applications in Statistics\n\nand Econometrics. John Wiley and Sons, third edition, 2007.\n\n[16] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\npytorch. In NIPS Autodiff workshop, 2017.\n\n9\n\n\f[17] Barak A. Pearlmutter. Fast exact multiplication by the hessian. Neural Computation, 6(1):147\u2013\n\n160, 1994.\n\n[18] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances\n\nin Neural Information Processing Systems (NIPS), pages 1177\u20131184, 2007.\n\n[19] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine\n\nLearning. The MIT Press, 2005.\n\n[20] Gregorio Ricci and Tullio Levi-Civita. M\u00e9thodes de calcul diff\u00e9rentiel absolu et leurs applica-\n\ntions. Mathematische Annalen, 54(1-2):125\u2013201, 1900.\n\n[21] Bernhard Sch\u00f6lkopf and Alexander Johannes Smola. Learning with Kernels: support vector\nmachines, regularization, optimization, and beyond. Adaptive computation and machine learning\nseries. MIT Press, 2002.\n\n[22] Matthias W. Seeger, Asmus Hetzel, Zhenwen Dai, and Neil D. Lawrence. Auto-differentiating\n\nlinear algebra. In NIPS Autodiff workshop, 2017.\n\n[23] Theano Development Team. Theano: A Python framework for fast computation of mathematical\n\nexpressions. arXiv e-prints, abs/1605.02688, May 2016.\n\n[24] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal\n\nStatistical Society. Series B (Methodological), 58(1):267\u2013288, 1996.\n\n[25] Andrea Walther and Andreas Griewank. Getting started with adol-c. In Combinatorial Scienti\ufb01c\n\nComputing, pages 181\u2013202. Chapman-Hall CRC Computational Science, 2012.\n\n10\n\n\f", "award": [], "sourceid": 1455, "authors": [{"given_name": "Soeren", "family_name": "Laue", "institution": "Universitaet Jena"}, {"given_name": "Matthias", "family_name": "Mitterreiter", "institution": "Friedrich Schiller University Jena"}, {"given_name": "Joachim", "family_name": "Giesen", "institution": "Friedrich-Schiller-Universitat Jena"}]}