{"title": "On Blackbox Backpropagation and Jacobian Sensing", "book": "Advances in Neural Information Processing Systems", "page_first": 6521, "page_last": 6529, "abstract": "From a small number of calls to a given \u201cblackbox\" on random input perturbations, we show how to efficiently recover its unknown Jacobian, or estimate the left action of its Jacobian on a given vector. Our methods are based on a novel combination of compressed sensing and graph coloring techniques, and provably exploit structural prior knowledge about the Jacobian such as sparsity and symmetry while being noise robust. We demonstrate efficient backpropagation through noisy blackbox layers in a deep neural net, improved data-efficiency in the task of linearizing the dynamics of a rigid body system, and the generic ability to handle a rich class of input-output dependency structures in Jacobian estimation problems.", "full_text": "On Blackbox Backpropagation and Jacobian Sensing\n\nKrzysztof Choromanski\n\nGoogle Brain\n\nNew York, NY 10011\n\nkchoro@google.com\n\nVikas Sindhwani\n\nGoogle Brain\n\nNew York, NY 10011\n\nsindhwani@google.com\n\nAbstract\n\nFrom a small number of calls to a given \u201cblackbox\" on random input perturbations,\nwe show how to ef\ufb01ciently recover its unknown Jacobian, or estimate the left action\nof its Jacobian on a given vector. Our methods are based on a novel combination of\ncompressed sensing and graph coloring techniques, and provably exploit structural\nprior knowledge about the Jacobian such as sparsity and symmetry while being\nnoise robust. We demonstrate ef\ufb01cient backpropagation through noisy blackbox\nlayers in a deep neural net, improved data-ef\ufb01ciency in the task of linearizing the\ndynamics of a rigid body system, and the generic ability to handle a rich class of\ninput-output dependency structures in Jacobian estimation problems.\n\n1\n\nIntroduction\n\nAutomatic Differentiation (AD) [1, 17] techniques are at the heart of several \u201cend-to-end\" machine\nlearning frameworks such as TensorFlow [5] and Torch [2]. Such frameworks are organized around\na library of primitive operators which are differentiable vector-valued functions of data inputs and\nmodel parameters. A composition of these primitives de\ufb01nes a computation graph - a directed acyclic\ngraph whose nodes are operators and whose edges represent data\ufb02ows, typically culminating in the\nevaluation of a scalar-valued loss function. For reverse mode automatic differentiation (backpropaga-\ntion) to work, each operator needs to be paired with a gradient routine which maps gradients of the\nloss function with respect to the outputs of the operator, to gradients with respect to its inputs. In\nthis paper, we are concerned with extending the automatic differentiation paradigm to computation\ngraphs where some nodes are \"blackboxes\" [12], that is, opaque pieces of code implemented outside\nthe AD framework providing access to an operator only via expensive and potentially noisy function\nevaluation, with no associated gradient routine available. A useful mental model of this setting is\nshown below where f3 is a blackbox.\n\nx0\n\nf1\n\nf2\n\nx1\n\nx2\n\nf3\n\nx3\n\nf4\n\nx4\n\nBlackboxes, of course, are pervasive - as legacy or proprietary codes or executables, numerical\noptimization routines, physics engines (e.g, Bullet [3] and MujoCo [4]), or even wrappers interfacing\nwith a mechanical system as is typically the case in reinforcement learning, robotics and process\ncontrol applications.\n\nThe unknown Jacobian of a blackbox is the central object of study in this paper. Recall that the\nJacobian \u2207f (x0) of a differentiable vector-valued map f : Rn (cid:55)\u2192 Rm at an input x0 \u2208 Rn is the\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fm \u00d7 n matrix of partial derivatives, de\ufb01ned by,\n\n[\u2207f (x0)]ij =\n\n\u2202fi\n\u2202xj\n\n(x0)\n\nThe rows of the Jacobian are gradient vectors of the m component functions f = (f1 . . . fm) and the\ncolumns are indexed by the n-dimensional inputs x = (x1 . . . xn). Through Taylor approximation,\nthe Jacobian characterizes the rate of change in f at a step \u0001 (0 < \u0001 (cid:28) 1) along any direction d \u2208 Rn\nin the neighborhood of x0 as follows,\n\n\u2207f (x0)d \u2248 1\n\u0001\n\n[f (x0 + \u0001d) \u2212 f (x0)] .\n\n(1)\nViewed as a linear operator over perturbation directions d \u2208 Rn, differences of the form\n\u0001 [f (x + \u0001d) \u2212 f (x)] may be interpreted as noisy measurements (\u201csensing\" [10, 11, 13]) of the\n1\nJacobian based on function evaluation. The measurement error grows with the step size \u0001 and the\ndegree of nonlineary in f in the vicinity of x0. Additional measurement noise may well be introduced\nby unknown error-inducing elements inside the blackbox.\nFrom as few perturbations and measurements as possible, we are concerned with approximately\nrecovering either the full Jacobian, or approximating the action of the transpose of the Jacobian on a\ngiven vector in the context of enabling backpropagation through blackbox nodes. To elaborate on\nthe latter setting, let y = f (x) represent forward evaluation of an operator, and let p = \u2202l\n\u2202y be the\ngradient of a loss function l(\u00b7) \ufb02owing in from the \u201ctop\" during the reverse sweep. We are interested\n\u2202x = [\u2207f (x)]T p, i.e. the action of the transpose of the Jacobian on p. Note that\nin approximating \u2202l\ndue to linearity of the derivative, this is the same as estimating the gradient of the scalar-valued\n\u0001 (g(x + \u0001d) \u2212 g(x)), which is\nfunction g(x) = pT f (x) based on scalar measurements of the form 1\na special case of the tools developed in this paper.\nThe more general problem of full Jacobian estimation arises in many derivative-free optimization\nsettings [12, 8]. Problems in optimal control and reinforcement learning [18, 21, 20] are prominent\nexamples, where the dynamics of a nonlinear system (e.g., a robot agent) needs to be linearized along\na trajectory of states and control inputs reducing the problem to a sequence of time-varying Linear\nQuadratic Regulator (LQR) subproblems [21]. The blackbox in this case is either a physics simulator\nor actual hardware. The choice of perturbation directions and the collection of measurements then\nbecomes intimately tied to the agent\u2019s strategy for exploration and experience gathering.\nFinite differencing, where the perturbation directions d are the n standard basis vectors, is a default\napproach for Jacobian estimation. However, it requires n function evaluations which may be\nprohibitively expensive for large n. Another natural approach, when the number of measurements,\nsay k, is smaller than n, is to estimate the Jacobian via linear regression,\n\nk(cid:88)\n\ni=1\n\nargmin\nJ\u2208Rm\u00d7n\n\n(cid:107)Jdi \u2212 1\n\u0001\n\n(cid:2)f (x0 + \u0001di) \u2212 f (x0)(cid:3)(cid:107)2\n\n2 + \u03bb(cid:107)J(cid:107)2\nF ,\n\nwhere an l2 regularizer is added to handle the underdetermined setting and (cid:107) \u00b7 (cid:107)F stands for the\nFrobenius norm. This approach assumes that the error distribution is Gaussian and in its basic\nform, does not exploit additional Jacobian structure, e.g., symmetry and sparsity, to improve data\nef\ufb01ciency. For example, if backpropagation needs to be enabled for a noiseless blackbox with\nidentical input-output dimensions whose unknown Jacobian happens to be symmetric, then just\none function evaluation suf\ufb01ces since \u2207f (x0)T p = \u2207f (x0)p \u2248 1\n\u0001 (f (x0 + \u0001p) \u2212 f (x0)). Figure\n1 shows the histogram of the Jacobian of the dynamics of a Humanoid walker with respect to its\n18-dimensional state variables and 6 dimensional control inputs. It can be seen that the Jacobian\nis well approximated by a sparse matrix. In a complex dynamical system comprising of many\nsubsystems, most state or control variables only have local in\ufb02uence on the instantaneous evolution\nof the overall state. Figure 1 also shows the example of a manipulator; the Jacobian of a 5 planar\nlink system has sparse and symmetric blocks (highlighted by blue and red bounding boxes) as a\nconsequence of the form of the equations of motion of a kinematic tree of rigid bodies. Clearly, one\ncan hope that incorporating this kind of prior knowledge in the Jacobian estimation process will\nimprove data ef\ufb01ciency in \u201cmodel-free\" trajectory optimization applications.\nTechnical Preview, Contributions and Outline: We highlight the following contributions:\n\n2\n\n\fFigure 1: Structured Jacobians in Continuous Control Problems\n\nint\n\nint\n\nint\n\n\u221a\n\n)\n\nint\n\nE(n)\n\n(cid:98)J of the true Jacobian J is such that (cid:107)(cid:98)J \u2212 J(cid:107)F \u2264 E(n), where the measurement error vector\n\n\u2022 In \u00a72: We start by asking how many blackbox calls are required to estimate a sparse Jacobian with\nknown sparsity pattern. We recall results from automatic differentiation [14, 17, 23] literature that\nrelates this problem to graph coloring [19, 26] where the chromatic number of a certain graph that\nencodes input-output dependencies dictates the sample complexity. We believe that this connection\nis not particularly well known in the deep learning community, though coloring approaches only\napply to noiseless structure-aware cases.\n\u2022 In \u00a73: We present a Jacobian recovery algorithm, rainbow, that uses a novel probabilistic\ngraph coloring subroutine to reduce the effective number of variables, leading to a compressed\nconvex optimization problem whose solution yields an approximate Jacobian. The approximation\n\u03b7 \u2208 Rm satis\ufb01es: (cid:107)\u03b7(cid:107)\u221e = o(E(n)). Our algorithm requires only O(min(A, B)) calls to the\nblackbox, where A = dint log2(\n), dint is a measure\nof intrinsic dimensionality of a convex set C (cid:51) J encoding prior knowledge about the Jacobian\n) \u2264 n is a parameter encoding combinatorial properties possibly\n(elaborated below) and \u03c1(J, Gweak\nknown in advance (encoded by the introduced later the so-called weak-intersection graph Gweak\n)\nof the sparsity pattern in the Jacobian (see: \u00a73.4.1 for an explicit de\ufb01nition); we will refer to\n\u03c1(J, Gweak\n\u2022 We demonstrate our tools with the following experiments: (1) Training a convolutional neural\nnetwork in the presence of a blackbox node, (2) Estimating structured Jacobians from few calls\nto a blackbox with different kinds of local and global dependency structures between inputs and\noutputs, and (3) Estimating structured Jacobians of the dynamics of a 50-link manipulator, with\na small number of measurements while exploiting sparsity and partial symmetry via priors in lp\nregression.\n\n) as the chromatic character of J.\n\nm\u03c1(J,Gweak\n\n\u221a\nE(n) ), B = m\u03c1(J, Gweak\n\nmn\n\nint\n\n) log2(\n\nThe convex set C mentioned above can be de\ufb01ned in many different ways depending on prior\nknowledge about the Jacobian (e.g., lower and upper bounds on certain entries, sparsity with unknown\npattern, symmetric block structure, etc).\nAs we show in the experimental section, our approach can be applied also for non-smooth problems\nwhere Jacobian is not well-de\ufb01ned. Note that in this setting one can think about a nonsmooth function\nas a noisy version of its smooth approximation and a Jacobian of a function smoothing (such as\nGaussian smoothing) is a subject of interest.\nNotation: D = [d1 . . . dk] \u2208 Rn\u00d7k will denote the matrix of perturbation directions, with the\ncorresponding measurement matrix R = [r1 . . . rk] \u2208 Rm\u00d7k where ri = 1\n\u0001 [f (x + \u0001di) \u2212 f (x)].\n\n2 The Link between Jacobian Estimation and Graph Coloring\n\nSuppose the Jacobian is known to be a diagonal matrix. Then \ufb01nite differencing where perturbation\ndirections are the n standard basis elements is utterly wasteful; it is easy to see that a single\nperturbation direction d = [1, 1 . . . 1]T suf\ufb01ces in identifying all diagonal elements. The goal of this\nsection is to explain the connection between Jacobian recovery and graph coloring problems that\nsubstantially generalizes this observation.\nFirst we introduce graph theory terminology. The undirected graph is denoted as G(V, E), where V\nand E stand for the sets of vertices and edges respectively. For v, w \u2208 V we say that v is adjacent\nto w if there is an edge between v and w. The degree deg(v) of v \u2208 V is the number of vertices\nadjacent to it. The maximum degree in G(V, E) will be denoted as \u2206(G). A stable set in G is the\n\n3\n\n\fd\n\nf\n\na\n\nb\n\nc\n\ne\n\nh\n\ng\n\nFigure 2: On the left: Sparse Jacobian for a function f (a, b, c, d, e, f, g, h) with n = m = 8, where\nblue entries indicate nonzero values. In the middle: coloring of columns. A \ufb01xed color corresponds\nto a stable set in Gint. On the right: corresponding intersection graph Gint.\n\nsubset S \u2286 V , where no two vertices are adjacent. The chromatic number \u03c7(G) of G is the minimum\nnumber of sets in the partitioning of V into stable sets. Equivalently, it is the smallest number of\ncolors used in a valid vertex-coloring of the graph, where a valid coloring is one in which adjacent\nvertices are assigned different colors.\nDenote by Jx = [J1, ..., Jn] \u2208 Rm\u00d7n a Jacobian matrix evaluated at point x \u2208 Rn, where Ji \u2208 Rm\ndenotes the i-th column. Assume that Jis are not known, but the sparsity structure, i.e. the location\nof zero entries in J is given. Let Ai = {k : J i\nk (cid:54)= 0} \u2286 {0, ..., m \u2212 1} be the indices of the\nnon-zero elements of Ji. The intersection graph, denoted by Gint, is a graph whose vertex set is\nV = {x1 . . . xn} and xi is adjacent to xj if the sets Ai and Aj intersect. In other words, there\nexists an output of the blackbox that depends both on xi and xj (see Figure 2 for an illustration).\nNow suppose k colors are used in a valid coloring of Gint. The key fact that relates the Jacobian\nrecovery problem to graph coloring is the following observation. If one constructs vectors di \u2208 Rn\nj = 1 if xj is colored by the ith color and is 0 otherwise, then\nfor i = 1, ..., k in such a way that di\nk computations of the \ufb01nite difference f (x+\u0001di)\u2212f (x)\nfor 0 < \u0001 (cid:28) 1 and i = 1, ..., k suf\ufb01ce to\naccurately approximate the Jacobian matrix (assuming no blackbox noise). The immediate corollary\nis the following lemma.\n\n\u0001\n\nLemma 2.1 ([14]). The number of calls k to a blackbox vector-valued function f needed to compute\nan approximate Jacobian via \ufb01nite difference technique in the noiseless setting satis\ufb01es k \u2264 \u03c7(Gint),\nwhere Gint is the corresponding intersection graph.\nThus, blackboxes whose unknown Jacobian happens to be associated with intersection graphs of\nlow chromatic number admit accurate Jacobian estimation with few function calls. Rich classes of\ngraphs have low chromatic number. If the maximum degree \u2206(Gint) of Gint is small then \u03c7(Gint) is\nalso small, because of the well known fact that \u03c7(Gint) \u2264 \u2206(Gint) + 1. For instance if every input\nxi in\ufb02uences at most k outputs fj and every output fj depends on at most l variables xi, then one\ncan notice that \u2206(Gint) \u2264 kl and thus \u03c7(Gint) \u2264 kl + 1. When the maximum degree is small, an\nef\ufb01cient coloring can be easily found by the greedy procedure that colors vertices one by one and\nassigns to the newly seen vertex the smallest color that has not been used to color all its already seen\nneighbors ([14]). This procedure cannot be applied if there exist vertices of high degree. That is the\ncase for instance if there exist few global variables in\ufb02uence a large number of outputs fi. In the\nsubsequent sections we will present an algorithm that does not need to rely on the small value of\n\u2206(Gint).\nGraph coloring for Jacobian estimation has two disadvantages even if we assume that good quality\ncoloring of the intersection graph can be found ef\ufb01ciently (optimal graph coloring is in general NP\nhard). It assumes that the sparsity structure of the Jacobian, i.e. the set of entries that are zero is given,\nand that all the measurements are accurate, i.e. there is no noise. We relax these limitations next.\n\n3 Sensing and Recovery of Structured Jacobians\n\nOur algorithm receives as input two potential sources of prior knowledge about the blackbox:\n\u2022 sparsity pattern of the Jacobian in the form of a supergraph of the true intersection graph, which we\ncall the weak intersection graph denoted as Gweak\n. The knowledge of the sparsity pattern may be\nimprecise in the sense that we can overestimate the set of outputs an input can in\ufb02uence. Note that\nany stable set of Gweak\n). A complete\n\nis a stable set in Gint and thus we have: \u03c7(Gint) \u2264 \u03c7(Gweak\n\nint\n\nint\n\nint\n\n4\n\n\fint = Gint re\ufb02ects the setting with exact knowledge.\n\nweak intersection graph corresponds to the setting where no prior knowledge about the sparsity\npattern is available while Gweak\n\u2022 a convex set C encoding additional information about the local and global behavior of the blackbox.\nFor example, if output components fi are Lipschitz continuous with the Lipschitz constant Li: the\nmagnitude of the Jacobian entries can be bounded row-wise with Li, i = 1 . . . m. The Jacobian\nmay additionally have sparse blocks, which may be expressed as a bound on the elementwise l1\nnorm over the entries of the block; it may also have symmetric and/or low-rank blocks [6] (the\nlatter may be expressed as a bound on the nuclear norm of the block). A measure of the effective\ndegrees of freedom due to such constraints directly shows up in our theoretical results on Jacobian\nrecovery (\u00a73.4).\n\nDirect domain knowledge, or a few expensive \ufb01nite-difference calls may be used in the \ufb01rst few\niterations to collect input-independent structural information about the Jacobian, e.g., to observe the\ntypical degree of sparsity, whether a symmetry or sparsity pattern holds across iterations etc.\nOur algorithm, called rainbow, consists of three steps:\n\u2022 Color: Ef\ufb01cient coloring of Gweak\n\nfor reducing the dimensionality of the problem, where each\nvariable in the compressed problem corresponds to a subset of variables in the original problem.\nThis phases explores strictly combinatorial structural properties of J (\u00a73.1).\n\u2022 Optimize: Solving a compressed convex optimization problem to minimize (or \ufb01nd a feasible)\nlp reconstruction. This phase can utilize additional structural knowledge via the convex set C\n((\u00a73.3)) de\ufb01ned earlier.\n\u2022 Reconstruct: Mapping the auxiliary variables from the solution to the above convex problem\n\nint\n\nback to the original variables to reconstruct J.\n\nNext we discuss all these steps.\n\n3.1 Combinatorial Variable Compression via Graph Coloring: GreedyColoring\n\nint\n\nConsider the following coloring algorithm for reducing the effective number of input variables. Order\nrandomly. Initialize the list of stable sets I covering {x1, ..., xn}\nthe vertices x1, ..., xn of Gweak\nas I = \u2205. Process vertices one after another and add a vertex xi to the \ufb01rst set from I that does\nnot contain vertices adjacent to xi. If no such a set exists, add the singleton set {xi} to I. After\nprocessing all the vertices, each stable set from I gets assigned a different color. We denote by\ncolor(i) the color assigned to vertex xi and by l the total number of colors. To boost the probability\nof \ufb01nding a good coloring, one can repeat the procedure above for a few random permutations and\nchoose the one that corresponds to the smallest l.\n\n3.2 Choice of Perturbation Directions\nEach di \u2208 Rn is obtained from the randomly chosen vector di\ncore \u2208 Rl, that we call the core vector.\nEntries of all core vectors are taken independently from the same distribution \u03c6 which is: Gaussian,\nPoissonian or bounded and of nonzero variance (for the sake of readability, technical conditions\nand extensions to this family of distributions is relegated to the Appendix). Directions may even be\nchosen from columns of structured matrices, i.e., Circulant and Toeplitz [7, 24, 22, 16]. Each di is\nde\ufb01ned as follows: di(j) = di\n\ncore(color(j)).\n\n3.3 Recovery via Compressed Convex Optimization\nLinear Programming: Assume that the lp-norm of the noise vector \u03b7 \u2208 Rm is bounded by \u0001 =\nE(n), where E(\u00b7) encodes non-decreasing dependence on n. With the matrix of perturbation vectors\nD \u2208 Rn\u00d7k and a matrix of the corresponding core vectors Dcore \u2208 Rl\u00d7k in hand, we are looking for\nthe solution X \u2208 Rm\u00d7l to the following problem:\n\n(cid:107)(XDcore \u2212 R)i(cid:107)p \u2264 \u0001, i = 1 . . . k\n\n(2)\nwhere subscript i runs over columns, R \u2208 Rm\u00d7k is the measurement matrix for the matrix of\nperturbations D. For p \u2208 {1,\u221e}, this task can be cast as a Linear Programming (LP) problem. Note\nthat the smaller the number of colors, l, the smaller the size of the LP. If C is a polytope, it can be\nincluded as additional linear constraints in the LP. After solving for X, we construct the Jacobian\n\napproximation(cid:98)J as follows:(cid:98)Ju,j = Xu,color(j), where color(j) is de\ufb01ned above.\n\n5\n\n\fWe want to emphasize that a Linear Programming approach is just one instantiation of a more general\nmethod we present here. Below we show another one based on ADMM for structured l2 regression.\nADMM Solvers for multiple structures: When the Jacobian is known to have multiple structures,\ne.g., it is sparse and has symmetric blocks, it is natural to solve structured l2 regression problems of\nthe form,\n\nk(cid:88)\n\ni=1\n\nargmin\nX\u2208Rm\u00d7l\u2208S\n\n(cid:107)(XDcore \u2212 R)i(cid:107)2\n\n2 + \u03bb(cid:107)X(cid:107)1,\n\nwhere the convex constraint set S is the set of all matrices conforming to a symmetry pattern on\nselected square blocks; an example is the Jacobian of the dynamics of a 5-link manipulator as shown\nin Figure 1. A consensus ADMM [9] solver can easily be implemented for such problems involving\nmultiple structural priors and constraints admitting cheap proximal and projection operators. For the\nspeci\ufb01c case of the above problem, it runs the following iterations:\n\ncore + \u03c1I]\u22121(cid:0)DRT + \u03c1(XT \u2212 UT\n1 )(cid:1)\n\n1 = [DcoreDT\n\n\u2022 Solve for X1: XT\n\u2022 X2 = symmetrize[X \u2212 U2,S]\n\u2022 X = soft-threshold[ 1\n\u2022 Ui = Ui + Xi \u2212 X, i = 1, 2\n\n2 (X1 + X2 + U1 + U2), \u03bb\u03c1\u22121]\n\nwhere X1, X2 are primal variables with associated dual variables U1, U2, \u03c1 is the ADMM step size\nparameter, and X is the global consensus variable. The symmetrize(X,S) routine implements\nexact projection onto symmetry constraints - it takes a square block \u02c6X of X speci\ufb01ed by the\nconstraint set S and symmetrizes it simply as 1\n2 [ \u02c6X+ \u02c6XT ] keeping other elements of X intact. The soft-\nthresholding operator is de\ufb01ned by soft-threshold(X, \u03bb) = max(X\u2212\u03bb, 0)\u2212max(\u2212X\u2212\u03bb, 0).\ncore + \u03c1I] can be factorized upfront, even across multiple Jacobian\nNote that for the \ufb01rst step [DcoreDT\nestimation problems since it is input-independent. Also, notice that if the perturbation directions\nare structured, e.g., drawn from a Circulant or Toeplitz matrix, then the cost of this linear solve can\nbe further reduced using specialized solvers [15]. As before, after solving for X, we construct the\n\nJacobian approximation(cid:98)J as follows:(cid:98)Ju,j = Xu,color(j).\n\n3.4 Theoretical Guarantees\n\n3.4.1 Chromatic property of a graph\n\nThe probabilistic graph coloring algorithm GreedyColoring generates a coloring, where the number\nof colors is close to the chromatic property \u039b(Gweak\n(see: proof of Lemma 3.1\nin the Appendix). The chromatic property \u039b(G) of a graph G is de\ufb01ned recursively as follows.\n\u2022 \u039b(G\u2205) = 0, where G\u2205 is an empty graph (V = \u2205),\n\u2022 for G (cid:54)= G\u2205, we have: \u039b(G) = 1 + maxS\u2286V \u039b(G\\S) where max is taken over all subsets\n1+deg(v)(cid:101) and G\\S stands for the graph obtained from G be\n\nsatisfying: |S| = |V | \u2212 (cid:100)(cid:80)\n\n) of the graph Gweak\n\nv\u2208V\n\nint\n\nint\n\n1\n\ndeleting vertices from S.\n\nNote that we are not aware of any closed-form expression for \u039b(G). We observe that there exists a\nsubtle connection between the chromatic property of the graph \u039b(G) and its chromatic number.\nLemma 3.1. The following is true for every graph G: \u03c7(G) \u2264 \u039b(G).\nThe importance of the chromatic property lies in the fact that in practice for many graphs G (especially\nsparse, but not necessarily of small maximum degree \u2206(G)) the chromatic property is close to the\nchromatic number. Thus, in practice, GreedyColoring \ufb01nds a good quality coloring for a large class\nof weak-intersection graphs Gweak\n, ef\ufb01ciently utilizing partial knowledge about the sparsity structure.\nThe chromatic character of the Jacobian is de\ufb01ned as the chromatic property of its weak-intersection\n) and thus does not depend only on the Jacobian J, but also on its \u201csparsity exposition\"\ngraph \u039b(Gweak\nand will be referred to as \u03c1(J, Gweak\ngiven by Gweak\n\n).\n\nint\n\nint\n\nint\n\nint\n\n3.4.2 Accuracy of Jacobian Recovery with rainbow\nWe need the following notion of intrinsic dimensionality in Rm\u00d7n as a metric space equipped with\n(cid:107) \u00b7 (cid:107)F norm.\n\n6\n\n\fDe\ufb01nition 3.2 (intrinsic dimensionality). For any point X \u2208 Rm\u00d7n and any r > 0, let B(X, r) =\n{Y : (cid:107)X \u2212 Y(cid:107)F \u2264 r} denote the closed ball of radius r centered at X. The intrinsic dimensionality\nof S \u2286 Rm\u00d7n is the smallest integer d such that for any ball B(X, r) \u2286 Rm\u00d7n, the set B(X, r) \u2229 S\ncan be covered by 2d balls of radius r\n2 .\nWe are ready to state our main theoretical result.\nTheorem 3.3. Consider the Jacobian matrix J \u2208 Rm\u00d7n. Assume that max|Ji,j| \u2264 C for some\n\ufb01xed C > 0 and J \u2208 C, where C \u2286 Rm\u00d7n is a convex set de\ufb01ning certain structural properties of\nJ (for instance C may be the set of matrices with block sparsity and symmetry patterns). Assume\nthat the measurement error vector \u03b7 \u2208 Rm satis\ufb01es: (cid:107)\u03b7(cid:107)\u221e = o(E(n)) for some function E(n).\n1\u2212 1\nA = dint log2( C\ndimensionality of C and spoly(n) is a superpolynomial function of n.\nThe proof is given in the Appendix. The result above is a characterization of the number of blackbox\ncalls needed to recover the Jacobian, in terms of its intrinsic degrees of freedom, the dependency\nstructure in the inputs and outputs and the noise introduced by higher order nonlinear terms and other\nsources of forward evaluation errors.\n\nThen the approximation(cid:98)J of J satisfying (cid:107)(cid:98)J \u2212 J(cid:107)F \u2264 E(n) can be found with probability p =\n\nm\u03c1(J,Gweak\n\nspoly(n) by applying rainbow algorithm with k = O(min(A, B)) calls to the f function, where\n), dint stands for the intrinsic\n\n\u221a\nE(n) ), B = m\u03c1(J, Gweak\n\n) log2(\n\n\u221a\n\n)\n\nint\n\nE(n)\n\nmn\n\nint\n\nC\n\n4 Experiments\n\n4.1. Sparse Jacobian Recovery: We start with a controlled setting where we consider the vector-\nvalued function, f : Rn \u2192 Rm of the following form:\n\nf (x1, ..., xn) = (\n\nsin(xi), ...,\n\nsin(xi)),\n\n(3)\n\n(cid:88)\n\ni\u2208S1\n\n(cid:88)\n\ni\u2208Sm\n\nwhere sets Si for i = 1, ...., m are chosen according to one of the following models. In the p-model\neach entry i \u2208 {1, ..., n} is added to each Sj independently and with the same probability p. In the\n\u03b1-model entry i is added to each Sj independently at random with probability i\u2212\u03b1. We consider a\nJacobian at point x \u2208 Rn drawn from the standard multivariate Gaussian distribution with entries\ntaken from N (0, 1). Both the models enable us to precisely control the sparsity of the corresponding\nJacobian which has an explicit analytic form. Furthermore, the latter generates Jacobians where the\ndegrees of the corresponding intersection graphs have power-law type distribution with few \u201chubs\"\nvery well connected to other nodes and many nodes of small degree. That corresponds to the setting,\nwhere there exist few global variables that impact many output fis, any many local variables that only\nin\ufb02uence a few outputs. We run the LP variant of rainbow for the above models and summarize\nthe results in the table below.\n\nmodel\np = 0.1\np = 0.1\np = 0.1\np = 0.3\np = 0.3\np = 0.3\n\u03b1 = 0.5\n\u03b1 = 0.5\n\u03b1 = 0.5\n\u03b1 = 0.7\n\u03b1 = 0.7\n\u03b1 = 0.7\n\nm n\n60\n30\n70\n40\n80\n50\n60\n30\n40\n70\n80\n50\n60\n30\n70\n40\n80\n50\n30\n60\n70\n40\n50\n80\n\nsparsity\n0.91277\n0.90142\n0.90425\n0.6866\n0.7096\n0.702\n0.7927\n0.78785\n0.79225\n0.85166\n0.87357\n0.86975\n\n\u03c7/\u2206\n0.33\n0.35\n0.32\n0.6833\n0.6857\n0.8625\n0.3833\n0.4285\n0.475\n0.2777\n0.2537\n0.275\n\n\u03c3\n0.07\n0.07\n0.07\n0.07\n0.07\n0.07\n0.1\n0.1\n0.1\n0.1\n0.1\n0.1\n\nk\n15\n20\n30\n45\n60\n70\n45\n60\n70\n40\n55\n65\n\nrel.error\n0.0632\n0.0802\n0.0751\n0.0993\n0.0589\n0.1287\n0.0351\n0.0491\n0.0443\n0.0393\n0.0398\n0.0326\n\nAbove, we measure recovery error in terms of the relative Frobenius distance between estimated\nJacobian and true Jacobian, rel.error =\n. The standard deviation of each entry of the\nmeasurement noise vector is given by \u03c3. We report in particular the fraction of zero entries in\nJ (sparsity), the ratio of the number of colors found by our GreedyColoring algorithm and the\n\n(cid:107)J(cid:107)F\n\n(cid:107)(cid:98)J\u2212J(cid:107)F\n\n7\n\n\fmaximum degree of the graph ( \u03c7\n\u2206). We see that the coloring algorithm \ufb01nds good quality coloring\neven in the \"power-law\" type setting where maximum degree \u2206(G) is large. The quality of the\ncoloring in turn leads to the reduction in the number of measurement vectors needed (k) to obtain an\naccurate Jacobian approximation (i.e., relative error < 0.1).\n4.2. Training Convolutional Neural Networks with Blackbox Nodes: We introduce a blackbox\nlayer between the convolutional layers and the fully connected layers of a standard MNIST convnet.\nThe blackbox node is a standard ReLU layer that takes as input 32-dimensional vectors, 32\u00d7 32-sized\nweight matrix and a bias vector of length 32, and outputs a 32 dimensional representation. The\nminibatch size is 16. We inject truncated Gaussian noise in the output of the layer and override its\ndefault gradient operator in TensorFlow with our LP-based rainbow procedure. We use Gaussian\nperturbation directions and sample measurements by forward evaluation calls to the TensorFlow\nOp inside our custom blackbox gradient operator. In Fig. 3 we study the evolution of training and\nvalidation error across SGD iterations. We see in Fig. 3 that even though for low noise regime the\nstandard linear regression and \ufb01nite differencing methods work quite well, when noise magnitude\nincreases our blackbox backpropagation procedure rainbow-LP shows superior robustness - retain-\ning a capacity to learn while the other methods degrade in terms of validation error. The rightmost\nsub\ufb01gure reports validation error for our method with different numbers of Jacobian measurements at\na high noise level (in this case, the other methods fail to learn and are not plotted).\n\n(a) Standard deviation: 9e-5\n\n(b) Standard deviation: 0.008\n\n(c) Different numbers of measure-\nment vectors (std : 0.1)\n\nFigure 3: TensorFlow CNN training with a \"blackbox\" layer with rainbow-LP method. On the\nleft: Comparison of rainbow-LP with \ufb01nite differencing and linear regression methods for low\nnoise regime. In the middle: As before, but for more substantial noise magnitude. On the right:\nrainbow-LP for even larger noise magnitude (std : 0.1) and different number of measurement vectors\nused. In that setting other methods did not learn at all.\n\n4.3. Jacobian of manipulator dynamics: We compute the\ntrue Jacobian of a planar rigid-body model with 50 links near\nan equilibrium point using MIT\u2019s Drake planning and control\ntoolbox [25]. The \ufb01rst link is unactuated; the remaining\nare all torque-actuated. The state vector comprises of 50\njoint angles and associated joint velocities, and there are\n49 control inputs to the actuators. The Jacobian has sparse\nand symmetric blocks similar to Figure 1. We compare\nlinear regression with l2 regularization against the rainbow\nADMM solver designed to exploit sparsity and symmetry,\nin the setting where the number of measurements is much\nsmaller than the total number of input variables to the forward\ndynamics function (149). Results are shown in the adjacent\nFigure. The recovery is much more accurate in the presence\nof sparsity and symmetry priors. The results are similar if the\nmatrix of perturbation directions are chosen from a Circulant\nmatrix.\n\n8\n\n\fReferences\n[1] http://www.autodiff.org.\n\n[2] http://torch.ch.\n\n[3] http://www.bulletphysics.org.\n\n[4] http://www.mujoco.org.\n\n[5] M. Abadi et al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software\n\navailable from tensorflow.org.\n\n[6] H. S. Abdel-Khali, P. Hovland, A. Lyons, T. E. Stover, and J. Utke. A low rank approach to automatic\n\ndifferentiation. Advances in Automatic Differentiation, 2008.\n\n[7] W. U. Bajwa, J. D. Haupt, G. M. Raz, S. J. Wright, and R. D. Nowak. Toeplitz-structured compressed\n\nsensing matrices. IEEE/SP Workshop on Statistical Signal Processing, 2007.\n\n[8] A. S. Bandeira, K. Scheinberg, and L. N. Vicente. Computation of sparse low degree interpolating\npolynomials and their application to derivative-free optimization. Mathematical Programming, 134, 2012.\n\n[9] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning\nvia the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3, 2011.\n\n[10] E. Candes and M. B. Wakin. An introduction to compressive sampling. IEEE Signal Processing Magazine,\n\n25, 2008.\n\n[11] E. J. Cand\u00e8s, J. K. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate measure-\n\nments. Communications on Pure and Applied Mathematics, 59, 2006.\n\n[12] A. R. Conn, K. Scheinberg, and L. N. Vicente. Derivative Free Optimization. MOS-SIAM Series on\n\nOptimization, 2009.\n\n[13] D. L. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52, 2006.\n\n[14] A. H. Gebremedhin, F. Manne, and A. Pothen. What color is your jacobian? graph coloring for computing\n\nderivatives. SIAM Review, 47(4):629\u2013705, 2005.\n\n[15] G. Golub and C. F. V. Loan. Matrix Computations. Johns Hopkins University Press (4rth edition), 2012.\n\n[16] R. M. Gray. Toeplitz and circulant matrices: A review. Foundations and Trends in Communications and\n\nInformation Theory, 2(3), 2006.\n\n[17] A. Griewank and A. Walther. Evaluating Derivatives: Principles and Techniques of Algorithmic Differenti-\n\nation. SIAM, 2008.\n\n[18] D. H. Jacobson and D. Q. Mayne. Differential Dynamic Programming. Elsevier, 1970.\n\n[19] T. Jensen and B. Toft. Graph Coloring Problems. Wiley - Interscience, 1995.\n\n[20] S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies. JMLR,\n\n17(39), 2016.\n\n[21] W. Li and E. Todorov. Iterative linear quadratic regulator design for nonlinear biological movement\n\nsystems. International Conference on Informatics in Control, Automation and Robotics, 2004.\n\n[22] W. Lin, S. Morgan, J. Yang, and Y. Zhang. Practical compressive sensing with toeplitz and circulant\n\nmatrices. Proceedings of SPIE, the International Society for Optical Engineering, 2010.\n\n[23] G. N. Newsam and J. D. Ramsdell. Estimation of sparse jacobian matrices. SIAM Journal of Algebraic\n\nDiscrete Methods, 1983.\n\n[24] H. Rauhutk. Circulant and toeplitz matrices in compressed sensing. SPARS\u201909 - Signal Processing with\n\nAdaptive Sparse Structured Representations, 2010.\n\n[25] R. Tedrake and the Drake Development Team. Drake: A planning, control, and analysis toolbox for\n\nnonlinear dynamical systems, 2016.\n\n[26] B. Toft. Coloring, stable sets and perfect graphs. Handbook of Combinatorics, 1996.\n\n9\n\n\f", "award": [], "sourceid": 3273, "authors": [{"given_name": "Krzysztof", "family_name": "Choromanski", "institution": "Google Brain Robotics"}, {"given_name": "Vikas", "family_name": "Sindhwani", "institution": null}]}