{"title": "Differentiating Functions of the Jacobian with Respect to the Weights", "book": "Advances in Neural Information Processing Systems", "page_first": 435, "page_last": 441, "abstract": null, "full_text": "Differentiating Functions of the Jacobian \n\nwith Respect to the Weights \n\nGary William Flake \nNEC Research Institute \n4 Independence Way \nPrinceton, NJ 08540 \n\njiake@research.nj.nec.com \n\nBarak A. Pearlmutter \n\nDept of Computer Science, FEC 313 \n\nUniversity of New Mexico \nAlbuquerque, NM 87131 \n\nbap@cs.unm.edu \n\nAbstract \n\nFor many problems, the correct behavior of a model depends not only on \nits input-output mapping but also on properties of its Jacobian matrix, the \nmatrix of partial derivatives of the model's outputs with respect to its in(cid:173)\nputs. We introduce the J-prop algorithm, an efficient general method for \ncomputing the exact partial derivatives of a variety of simple functions of \nthe Jacobian of a model with respect to its free parameters. The algorithm \napplies to any parametrized feedforward model, including nonlinear re(cid:173)\ngression, multilayer perceptrons, and radial basis function networks. \n\n1 Introduction \n\nLet f (x, w) be an n input, m output, twice differentiable feedforward model parameterized \nby an input vector, x, and a weight vector w. Its Jacobian matrix is defined as \n\nJ= \n\n[ ~ \n\n()xl \n: \naim \naXI \n\n~l a~\" \naim \nax\" \n\n= df(x, w) . \n\ndx \n\nThe algorithm we introduce can be used to optimize functions of the form \n\nor \n\nEv(w) = 211Jv - bll \n\n2 \n\n1 \n\n(1) \n\n(2) \n\nwhere u, v, a, and b are user-defined constants. Our algorithm, which we call J-prop, \ncan be used to calculate the exact value of both a Eu / aw or a Ev / aw in 0 (1) times the \ntime required to calculate the normal gradient. Thus, I-prop is suitable for training models \nto have specific first derivatives, or for implementing several other well-known algorithms \nsuch as Double Backpropagation [1] and Tangent Prop [2]. \n\nClearly, being able to optimize Equations 1 and 2 is useful; however, we suspect that the \nformalism which we use to derive our algorithm is actually more interesting because it \nallows us to modify J-prop to easily be applicable to a wide-variety of model types and \n\n\f436 \n\nG. W. Flake and B. A. Pear/mutter \n\nobjective functions. As such, we spend a fair portion of this paper describing the mathe(cid:173)\nmatical framework from which we later build J-prop. \n\nThis paper is divided into four more sections. Section 2 contains background information \nand motivation for why optimizing the properties of the Jacobian is an important problem. \nSection 3 introduces our formalism and contains the derivation of the J-prop algorithm. \nSection 4 contains a brief numerical example of J-prop. And, finally, Section 5 describes \nfurther work and gives our conclusions. \n\n2 Background and motivation \n\nPrevious work concerning the modeling of an unknown function and its derivatives can be \ndivided into works that are descriptive or prescriptive. Perhaps the best known descriptive \nresult is due to White et al. [3,4], who show that given noise-free data, a multilayer percep(cid:173)\ntron (MLP) can approximate the higher derivatives of an unknown function in the limit as \nthe number of training points goes to infinity. The difficulty with applying this result is the \nstrong requirements on the amount and integrity of the training data; requirements which \nare rarely met in practice. This problem was specifically demonstrated by Principe, Rathie \nand Kuo [5] and Deco and Schiirmann [6], who showed that using noisy training data from \nchaotic systems can lead to models that are accurate in the input-output sense, but inaccu(cid:173)\nrate in their estimates of quantities related to the Jacobian of the unknown system, such as \nthe largest Lyapunov exponent and the correlation dimension. \n\nMLPs are particularly problematic because large weights can lead to saturation at a particu(cid:173)\nlar sigmoidal neuron which, in tum, results in extremely large first derivatives at the neuron \nwhen evaluated near the center of the sigmoid transition. Several methods to combat this \ntype of over-fitting have been proposed. One of the earliest methods, weight decay [7], \nuses a penalty term on the magnitude of the weights. Weight decay is arguably optimal \nfor models in which the output is linear in the weights because minimizing the magnitude \nof the weights is equivalent to minimizing the magnitude of the model's first derivatives. \nHowever, in the nonlinear case, weight decay can have suboptimal performance [1] be(cid:173)\ncause large (or small) weights do not always correspond to having large (or small) first \nderivatives. \n\nThe Double Backpropagation algorithm [1] adds an additional penalty term to the error \nfunction equal to II a E / ax 112. Training on this function results in a form of regularization \nthat is in many ways an elegant combination of weight decay and training with noise: it is \nstrictly analytic (unlike training with noise) but it explicitly penalizes large first derivatives \nofthe model (unlike weight decay). Double Backpropagation can be seen as a special case \nof J-prop, the algorithm derived in this paper. \n\nAs to the general problem of coercing the first derivatives of a model to specific values, \nSimard, et at., [2] introduced the Tangent Prop algorithm, which was used to train MLPs \nfor optical character recognition to be insensitive to small affine transformations in the \ncharacter space. Tangent Prop can also be considered a special case of J-prop. \n\n3 Derivation \n\nWe now define a formalism under which J-prop can be easily derived. The method is \nvery similar to a technique introduced by Pearlmutter [8] for calculating the product of the \nHessian of an MLP and an arbitrary vector. However, where Pearlmutter used differential \noperators applied to a model's weight space, we use differential operators defined with \nrespect to a model's input space. \n\nOur entire derivation is presented in five steps. First, we will define an auxiliary error \n\n\fDifferentiating Functions of the Jacobian \n\n437 \n\nfunction that has a few useful mathematical properties that simplify the derivation. Next, \nwe will define a special differential operator that can be applied to both the auxiliary error \nfunction, and its gradient with respect to the weights. We will then see that the result of \napplying the differential operator to the gradient of the auxiliary error function is equivalent \nto analytically calculating the derivatives required to optimize Equations 1 and 2. We then \nshow an example of the technique applied to an MLP. Finally, in the last step, the complete \nalgorithm is presented. \n\nTo avoid confusion, when referring to generic data-driven models, the model will always be \nexpressed as a vector function y = f (x, w), where x refers to the model input and w refers \nto a vector of all of the tunable parameters of the model. In this way, we can talk about \nmodels while ignoring the mechanics of how the models work internally. Complementary \nto the generic vector notation, the notation for an MLP uses only scalar symbols; however, \nthese symbols must refer to internal variable of the model (e.g., neuron thresholds, net \ninputs, weights, etc.), which can lead to some ambiguity. To be clear, when using vector \nnotation, the input and output of an MLP will always be denoted by x and y, respectively, \nand the collection of all of the weights (including biases) map to the vector w. However, \nwhen using scalar arithmetic, the scalar notation for MLPs will apply. \n\n3.1 Auxiliary error function \n\nOur auxiliary error function, E, is defined as \n\n-\nE(x, w) = u f(x, w). \n\nT \n\n(3) \nNote that we never actually optimize with respect E; we define it only because it has the \nproperty that aE/ax = u T J, which will be useful to the derivation shortly. Note that \na E / ax appears in the Taylor expansion of E about a point in input space: \n\nE(x + Ax, w) = E(x, w) + ~! Ax + 0 (1IAXI12) . \n\n-T \n\n(4) \n\nThus, while holding the weights, w, fixed and letting Ax be a perturbation of the input, x, \nEquation 4 characterizes how small changes in the input of the model change the value of \nthe auxiliary error function. \nBe setting Ax = rv, with v being an arbitrary vector and r being a small value, we can \nrearrange Equation 4 into the form: \n\n~ [E(x+rv,w) -E(x,w)] +O(r) \n\n- ] \n\n1 [-\n\nlim - E(x + rv, w) - E(x, w) \nr~O r \nI \na -\naE(x + rv,w) \nr \n\nr=O \n\n. \n\n(5) \n\nThis final expression will allow us to define the differential operator in the next subsection. \n\n3.2 Differential operator \n\nLet h(x, w) be an arbitrary twice differentiable function. We define the differentiable \noperator \n\n(6) \n\n\f438 \n\nG. W Flake and B. A. Pearlmutter \n\nwhich has the property that Rv{E(x, w)} = u T Jv. Being a differential operator, Rv{-} \nobeys all of the standard rules for differentiation: \n\nRv{c} \nRv{ c\u00b7 h(x, w)} \nRv{h(x,w) + g(x,w)} \nRv{h(x, w) . g(x, w)} \nRv{h(g(x, w), w)} \n\nRv{!h(X,W) } \n\no \n\n= c\u00b7 Rv{h(x, w)} \n\nRv{ h(x, w)} + Rv{g(x, w)} \nRv{h(x,w)}\u00b7 g(x,w) + h(x,w)\u00b7 Rv{g(x,w)} \nh'(g(x,w))\u00b7 Rv{g(x,w)} \nd \ndt Rv{h(x, w)} \n\nThe operator also yields the identity Rv{ x} = v. \n\n3.3 Equivalence \n\nWe will now see that the result of calculating Rv{ a E / aw} can be used to calculate both \naEu/aw and aEv/aw. Note that Equations 3-5 all assume that both u and v are in(cid:173)\ndependent of x and w. To calculate aEu/aw and aEv/aw, we will actually set u or \nv to a value that depends on both x and w; however, the derivation still works because \nour choices are explicitly made in such a way that the chain rule of differentiation is not \nsupposed to be applied to these terms. Hence, the correct analytical solution is obtained \ndespite the dependence. \n\nTo optimize with respect to Equation 1, we use: \n\na 1 II T \naw'2 J u - a \n\n112 \n\n(au T J) T T \n\n{ a E } \n= ~ (J u - a) = Rv aw \n\n' \n\nwith v = (JT U - a). To optimize with respect to Equation 2, we use: \n\n~! IIJv - bll 2 = (Jv _ b)T (aJv) = Rv{ aE} , \naw2 \n\naw \n\naw \n\n(7) \n\n(8) \n\nwith u = (Jv - b). \n\n3.4 Method applied to MLPs \n\nWe are now ready to see how this technique can be applied to a specific type of model. \nConsider an MLP with L + 1 layers of nodes defined by the equations: \n\ny~ = \n\nXl \nt \n\ng(x~) \nL 1-1 \nN/ \n\nYj Wij -\n\nI \n\nri \ni' \n\n(9) \n\n(10) \n\nj \n\nIn these equations, superscripts denote the layer number (starting at 0), subscripts index \nover terms in a particular layer, and NI is the number of input nodes in layer l . Thus, y~ is \nthe output of neuron i at node layer l, and xi is the net input coming into the same neuron. \nMoreover, yf is an output of the entire MLP while y? is an input going into the MLP. \n\nThe feedback equations calculated with respect to E are: \n\naE \nayf \n\n(11) \n\n\fDifferentiating Functions of the Jacobian \n\n8E \n8y~ \n8E \n8x' t \n8E \n8w!j \n8E \n8()! \nJ \n\n8E \n8x l\n\nt \n\n' \n\n439 \n\n(12) \n\n(13) \n\n(14) \n\n(15) \n\nwhere the Ui term is a component in the vector u from Equation 1. Applying the R v {\u00b7} \noperator to the feedforward equations yields: \n\nRv{Y?} \nRv{yD \n\nRv{ x~} \n\n(for 1 > 0) \n\ng'(x~)Rv{ xD \nN/ L Rv{y~-l } W~j' \n\nj \n\n(16) \n\n(17) \n\n(18) \n\nwhere the Vi term is a component in the vector v from Equation 2. As the final step, we \napply the Rv {\u00b7} operator to the feedback equations, which yields: \n\no \n\nRv{:~ } \nRv{ :~} \nRv{ :~} \nRv{ :!,} \nRv{ ::;} \n\n(19) \n\n(20) \n\n(21) \n\n(22) \n\n(23) \n\n3.5 Complete algorithm \n\nImplementing this algorithm is nearly as simple as implementing normal gradient descent. \nFor each type of variable that is used in an MLP (net input, neuron output, weights, thresh(cid:173)\nolds, partial derivatives, etc.), we require that an extra variable be allocated to hold the \nresult of applying the R v {\u00b7} operator to the original variable. With this change in place, the \ncomplete algorithm to compute 8Eu /8w is as follows : \n\n\u2022 Set u and a to the user specified vectors from Equation 1. \n\n\u2022 Set the MLP inputs to the value of x that J is to be evaluated at. \n\n\u2022 Perform a normal feedforward pass using Equations 9 and 10. \n\n\u2022 Set 8E/8yf to Ui. \n\n\f440 \n\nG. W. Flake and B. A. Pearlmutter \n\n(a) \n\n(b) \n\nFigure 1: Learning only the derivative: showing (a) poor approximation of the function \nwith (b) excellent approximation of the derivative. \n\n\u2022 Perform the feedback pass with Equations 11-15. Note that values in the aEjay? \n\nterms are now equal to JT U. \n\n\u2022 Set v to (JT u - a) \n\n\u2022 Perform a Rv{ .} forward pass with Equations 16-18. \n\n\u2022 Set the Rv{ 8Ej8yf} terms to O. \n\n\u2022 Perform a Rv{\u00b7} backward pass with Equations 19-23. \n\nAfter the last step, the values in the Rv{ 8E j 8w!j} and Rv{ 8 E j aeD terms contain the \nrequired result. It is important to note that the time complexity of the \"J\u00b7forward\" and \"J. \nbackward\" calculations are nearly identical to the typical output and gradient evaluations \n(i.e., the \"forward\" and \"backward\" passes) of the models used. \nA similar technique can be used for calculating 8Evj8w. The main difference is that the \nRv{ . } forward pass is performed between the normal forward and backward passes because \nu can only be determined after the Rv{ f (z, w)} has been calculated. \n\n4 Experimental results \n\nTo demonstrate the effectiveness and generality of the J-prop algorithm, we have imple(cid:173)\nmented it on top of an existing neural network library [9] in such a way that the algorithm \ncan be used on a large number of architectures, including MLPs, radial basis function net\u00b7 \nworks, and higher order networks. \n\nWe trained an MLP with ten hidden tanh nodes on 100 points with conjugate gradient. The \ntraining exemplars consisted of inputs in [-1, 1] and a target derivative from 3 cos( 3x) + \n5cos(lOx). Our unknown function (which the MLP never sees data from) is sin(3x) + \nl sin(lOx). The model quickly converges to a solution in approximately 100 iterations. \nFigure 1 shows the performance of the MLP. Having never seen data from the unknown \nfunction, the MLP yields a poor approximation of the function, but a very accurate approx(cid:173)\nimation of the function's derivative. We could have trained on both outputs and derivatives, \nbut our goal was to illustrate that J\u00b7prop can target derivatives alone. \n\n\fDifferentiating Functions of the Jacobian \n\n441 \n\n5 Conclusions \n\nWe have introduced a general method for calculating the weight gradient of functions of \nthe Jacobian matrix of feedforward nonlinear systems. The method can be easily applied to \nmost nonlinear models in common use today. The resulting algorithm, J-prop, can be easily \nmodified to minimize functionals from several application domains [10]. Some possible \nuses include: targeting known first derivatives, implementing Tangent Prop and Double \nBackpropagation, enforcing identical VO sensitivities in auto-encoders, deflating the largest \neigenvalue and minimizing all eigenvalue bounds, optimizing the determinant for blind \nsource separation, and building nonlinear controllers. \n\nWhile some special cases of the J-prop algorithm have already been studied, a great deal \nis unknown about how optimization of the Jacobian changes the overall optimization prob(cid:173)\nlem. Some anecdotal evidence seems to imply that optimization of the Jacobian can lead to \nbetter generalization and faster training. It remains to be seen if J-prop used on a nonlinear \nextension of linear methods will lead to superior solutions. \n\nAcknowledgements \n\nWe thank Frans Coetzee, Yannis Kevrekidis, Joe O'Ruanaidh, Lucas Parra, Scott Rickard, \nJustinian Rosca, and Patrice Simard for helpful discussions. GWF would also like to thank \nEric Baum and the NEC Research Institute for funding the time to write up these results. \n\nReferences \n\n[1] H. Drucker and Y. Le Cun. Improving generalization performance using double back(cid:173)\n\npropagation. IEEE Transactions on Neural Networks, 3(6), November 1992. \n\n[2] P. Simard, B. Victorri, Y. Le Cun, and J. Denker. Tangent prop-A formalism for \nspecifying selected invariances in an adaptive network. In John E. Moody, Steve J. \nHanson, and Richard P. Lippmann, editors, Advances in Neural Information Process(cid:173)\ning Systems, volume 4, pages 895-903. Morgan Kaufmann Publishers, Inc., 1992. \n\n[3] H. White and A. R. Gallant. On learning the derivatives of an unknown mapping \nwith multilayer feedforward networks. In Halbert White, editor, Artificial Neural \nNetworks, chapter 12, pages 206-223. Blackwell, Cambridge, Mass., 1992. \n\n[4] H. White, K. Hornik, and M. Stinchcombe. Universal approximation of an unknown \nmapping and its derivative. In Halbert White, editor, Artificial Neural Networks, \nchapter 6, pages 55-77. Blackwell, Cambridge, Mass., 1992. \n\n[5] J. Principe, A. Rathie, and J. Kuo. Prediction of chaotic time series with neural \n\nnetworks and the issues of dynamic modeling. Bifurcations and Chaos, 2(4), 1992. \n\n[6] G. Deco and B. Schiirmann. Dynamic modeling of chaotic time series. In Russell \n\nGreiner, Thomas Petsche, and Stephen Jose Hanson, editors, Computational Learn(cid:173)\ning Theory and Natural Learning Systems, volume IV of Making Learning Systems \nPractical, chapter 9, pages 137-153. The MIT Press, Cambridge, Mass., 1997. \n\n[7] G. E. Hinton. Learning distributed representations of concepts. In Proc. Eigth Annual \n\nCon! Cognitive Science Society, pages 1-12, Hillsdale, NJ, 1986. Erlbaum. \n\n[8] Barak A. Pearlmutter. Fast exact multiplication by the Hessian. Neural Computation, \n\n6(1):147-160,1994. \n\n[9] G. W. Flake. Industrial strength modeling tools. Submitted to NIPS 99, 1999. \n[10] G. W. Flake and B. A. Pearl mutter. Optimizing properties of the Jacobian of nonlinear \n\nfeedforward systems. In preperation, 1999. \n\n\f", "award": [], "sourceid": 1702, "authors": [{"given_name": "Gary", "family_name": "Flake", "institution": null}, {"given_name": "Barak", "family_name": "Pearlmutter", "institution": null}]}