{"title": "Surfing: Iterative Optimization Over Incrementally Trained Deep Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 15034, "page_last": 15043, "abstract": "We investigate a sequential optimization procedure to minimize the empirical risk functional $f_{\\hat\\theta}(x) = \\frac{1}{2}\\|G_{\\hat\\theta}(x) - y\\|^2$ for certain families of deep networks $G_{\\theta}(x)$. The approach is to optimize a sequence of objective functions that use network parameters obtained during different stages of the training process. When initialized with random parameters $\\theta_0$, we show that the objective $f_{\\theta_0}(x)$ is ``nice'' and easy to optimize with gradient descent. As learning is carried out, we obtain a sequence of generative networks $x \\mapsto G_{\\theta_t}(x)$ and associated risk functions $f_{\\theta_t}(x)$, where $t$ indicates a stage of stochastic gradient descent during training. Since the parameters of the network do not change by very much in each step, the surface evolves slowly and can be incrementally optimized. The algorithm is formalized and analyzed for a family of expansive networks. We call the procedure {\\it surfing} since it rides along the peak of the evolving (negative) empirical risk function, starting from a smooth surface at the beginning of learning and ending with a wavy nonconvex surface after learning is complete. Experiments show how surfing can be used to find the global optimum and for compressed sensing even when direct gradient descent on the final learned network fails.", "full_text": "Sur\ufb01ng: Iterative Optimization Over\nIncrementally Trained Deep Networks\n\nGanlin Song\n\nZhou Fan\n\nDepartment of Statistics and Data Science\n\nDepartment of Statistics and Data Science\n\nYale University\n\nganlin.song@yale.edu\n\nYale University\n\nzhou.fan@yale.edu\n\nJohn Lafferty\n\nDepartment of Statistics and Data Science\n\nYale University\n\njohn.lafferty@yale.edu\n\nAbstract\n\nWe investigate a sequential optimization procedure to minimize the empirical\nrisk functional fb\u03b8(x) = 1\n2 kGb\u03b8(x) \u2212 yk2 for certain families of deep networks\nG\u03b8(x). The approach is to optimize a sequence of objective functions that use\nnetwork parameters obtained during different stages of the training process. When\ninitialized with random parameters \u03b80, we show that the objective f\u03b80 (x) is \u201cnice\u201d\nand easy to optimize with gradient descent. As learning is carried out, we obtain a\nsequence of generative networks x 7\u2192 G\u03b8t (x) and associated risk functions f\u03b8t (x),\nwhere t indicates a stage of stochastic gradient descent during training. Since the\nparameters of the network do not change by very much in each step, the surface\nevolves slowly and can be incrementally optimized. The algorithm is formalized\nand analyzed for a family of expansive networks. We call the procedure sur\ufb01ng\nsince it rides along the peak of the evolving (negative) empirical risk function,\nstarting from a smooth surface at the beginning of learning and ending with a wavy\nnonconvex surface after learning is complete. Experiments show how sur\ufb01ng can\nbe used to \ufb01nd the global optimum and for compressed sensing even when direct\ngradient descent on the \ufb01nal learned network fails.\n\n1\n\nIntroduction\n\nIntensive recent research has provided insight into the performance and mathematical properties of\ndeep neural networks, improving understanding of their strong empirical performance on different\ntypes of data. Some of this work has investigated gradient descent algorithms that optimize the\nweights of deep networks during learning (Du et al., 2018b,a; Davis et al., 2018; Li and Yuan, 2017;\nLi and Liang, 2018). In this paper we focus on optimization over the inputs to an already trained deep\nnetwork in order to best approximate a target data point. Speci\ufb01cally, we consider the least squares\nobjective function\n\nfb\u03b8(x) =\n\n1\n2\n\nkGb\u03b8(x) \u2212 yk2\n\nwhere G\u03b8(x) denotes a multi-layer feed-forward network and b\u03b8 denotes the parameters of the network\nafter training. The network is considered to be a mapping from a latent input x \u2208 Rk to an output\nG\u03b8(x) \u2208 Rn with k \u226a n. A closely related objective is to minimize f\u03b8,A(x) = 1\n2 kAG\u03b8(x) \u2212 Ayk2\nwhere A is a random matrix.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\finitial network\n\npartially trained network\n\nfully trained network\n\ntarget y\n\n\u22124\n\n0\n\nx1\n\n4\n\n\u22124\n\n4\n\n0\n\nx2\n\n\u22124\n\n0\n\nx1\n\n4\n\n\u22124\n\n4\n\n0\n\nx2\n\n\u22124\n\n0\n\nx1\n\n4\n\n\u22124\n\n4\n\n0\n\nx2\n\n\u22124\n\n0\n\nx1\n\n4\n\n\u22124\n\n4\n\n0\n\nx2\n\n\u22124\n\n0\n\nx1\n\n4\n\n\u22124\n\n4\n\n0\n\nx2\n\n\u22124\n\n0\n\nx1\n\n4\n\n\u22124\n\n4\n\n0\n\nx2\n\nFigure 1: Behavior of the surfaces x 7\u2192 \u2212 1\n2 kG\u03b8t (x) \u2212 yk2 for two targets y shown for three levels\nof training,from random networks (left) to fully trained networks (right) on Fashion MNIST data.\nThe network structure has two fully connected layers and two transposed convolution layers with\nbatch normalization, trained as a VAE.\n\nHand and Voroninski (2019) study the behavior of the function f\u03b80,A in a compressed sensing frame-\nwork where y = G\u03b80 (x0) is generated from a random network with parameters \u03b80 = (W1, . . . , Wd)\ndrawn from Gaussian matrix ensembles; thus, the network is not trained. In this setting, it is shown\nthat the surface is very well behaved. In particular, outside of small neighborhoods around x0 and a\nscalar multiple of \u2212x0, the function f\u03b80,A(x) always has a descent direction.\nWhen the parameters of the network are trained, the landscape of the function fb\u03b8(x) can be compli-\ncated; it will in general be nonconvex with multiple local optima. Figure 1 illustrates the behavior of\nthe surfaces as they evolve from random networks (left) to fully trained networks (right) for 4-layer\nnetworks trained on Fashion MNIST using a variational autoencoder. For each of two target values y,\nthree surfaces x 7\u2192 \u2212 1\n\n2 kG\u03b8t (x) \u2212 yk2 are shown for different levels of training.\n\nThis paper explores the following simple idea. We incrementally optimize a sequence of objective\n\nfunctions f\u03b80 , f\u03b81 , . . . , f\u03b8T where the parameters \u03b80, \u03b81, . . . , \u03b8T = b\u03b8 are obtained using stochastic\ngradient descent in \u03b8 during training. When initialized with random parameters \u03b80, we show that\nthe empirical risk function f\u03b80 (x) = 1\n2 kG\u03b80 (x) \u2212 yk2 is \u201cnice\u201d and easy to optimize with gradient\ndescent. As learning is carried out, we obtain a sequence of generative networks x 7\u2192 G\u03b8t (x) and\nassociated risk functions f\u03b8t (x), where t indicates an intermediate stage of stochastic gradient descent\nduring training. Since the parameters of the network do not change by very much in each step (Du\net al., 2018a,b), the surface evolves slowly. We initialize x for the current network G\u03b8t (x) at the\noptimum x\u2217\nt\u22121 found for the previous network G\u03b8t\u22121 (x) and then carry out gradient descent to obtain\nthe updated point x\u2217\n\nt = argminx f\u03b8t (x).\n\nWe call this process sur\ufb01ng since it rides along the peaks of the evolving (negative) empirical\nrisk function, starting from a smooth surface at the beginning of learning and ending with a wavy\nnonconvex surface after learning is complete. We formalize this algorithm in a manner that makes it\namenable to analysis. First, when \u03b80 is initialized so that the weights are random Gaussian matrices,\nwe prove a theorem showing that the surface has a descent direction at each point outside of a small\nneighborhood. The analysis of Hand and Voroninski (2019) does not directly apply in our case since\nthe target y is an arbitrary test point, and not necessarily generated according to the random network.\nWe then give an analysis that describes how projected gradient descent can be used to proceed from\nthe optimum of one network to the next. Our approach is based on the fact that the ReLU network\nand squared error objective result in a piecewise quadratic surface. Experiments are run to show\nhow sur\ufb01ng can be used to \ufb01nd the global optimum and for compressed sensing even when direct\ngradient descent fails, using several experimental setups with networks trained with both VAE and\nGAN techniques.\n\n2\n\n\f2 Background and Previous Results\n\nIn this work we treat the problem of approximating an observed vector y in terms of the output Gb\u03b8(x)\nof a trained generative model. Traditional generative processes such as graphical models are statistical\nmodels that de\ufb01ne a distribution over a sample space. When deep networks are viewed as generative\nmodels, the distribution is typically singular, being a deterministic mapping of a low-dimensional\nlatent random vector to a high-dimensional output space. Certain forms of \u201creversible deep networks\u201d\nallow for the computation of densities and inversion (Dinh et al., 2017; Kingma and Dhariwal, 2018;\nChen et al., 2018).\n\nThe variational autoencoder (VAE) approach training a generative (decoder) network is to model\nthe conditional probability of x given y as Gaussian with mean \u00b5(y) and covariance \u03a3(y) assuming\nthat a priori x \u223c N (0, Ik) is Gaussian. The mean and covariance are treated as the output of a\nsecondary (encoder) neural network. The two networks are trained by maximizing the evidence\nlower bound (ELBO) with coupled gradient descent algorithms\u2014one for the encoder network, the\nother for the decoder network G\u03b8(x) (Kingma and Welling, 2014). Whether \ufb01tting the networks\nusing a variational or GAN approach (Goodfellow et al., 2014; Arjovsky et al., 2017), the problem of\n\u201cinverting\u201d the network to obtain x\u2217 = argmin f\u03b8(x) is not addressed by the training procedure.\n\nIn the now classical compressed sensing framework (Candes et al., 2006; Donoho et al., 2006), the\nproblem is to reconstruct a sparse signal after observing multiple linear measurements, possibly with\nadded noise. More recent work has begun to investigate generative deep networks as a replacement\nfor sparsity in compressed sensing. Bora et al. (2017) consider identifying y = G(x0) from linear\nmeasurements Ay by optimizing f (x) = 1\n2 kAy \u2212 AG(x)k2. Since this objective is nonconvex,\nit is not guaranteed that gradient descent will converge to the true global minimum. However,\nfor certain classes of ReLU networks it is shown that so long as a point bx is found for which\nf (bx) is suf\ufb01ciently close to zero, then ky \u2212 G(bx)k is also small. For the case where y does\nnot lie in the image of G, an oracle type bound is shown implying that the solution bx satis\ufb01es\nkG(bx) \u2212 yk2 \u2264 C inf x kG(x) \u2212 yk2 + \u03b4 for some small error term \u03b4. The authors observe that in\nexperiments the error seems to converge to zero when bx is computed using simple gradient descent;\n\nbut an analysis of this phenomenon is not provided.\n\nHand and Voroninski (2019) establish the important result that for a d-layer random network and\nrandom measurement matrix A, the least squares objective has favorable geometry, meaning that\noutside two small neighborhoods there are no \ufb01rst order stationary points, neither local minima nor\nsaddle points. We describe their setup and result in some detail, since it provides a springboard for\nthe sur\ufb01ng algorithm.\nLet G : Rk \u2192 Rn be a d-layer fully connected feedforward generative neural network, which has the\nform\n\nG(x) = \u03c3(Wd...\u03c3(W2\u03c3(W1x))...)\n\nwhere \u03c3 is the ReLU activation function. The matrix Wi \u2208 Rni\u00d7ni\u22121 is the set of weights for the ith\nlayer and ni is number of the neurons in this layer with k = n0 < n1 < ... < nd = n. If x0 \u2208 Rk\nis the input then AG(x0) is a set of random linear measurements of the signal y = G(x0). The\nobjective is to minimize fA,\u03b80 (x) = 1\nwhere \u03b80 = (W1, . . . , Wd) is the\nset of weights.\n\n2(cid:13)(cid:13)AG\u03b80 (x) \u2212 AG\u03b80 (x0)(cid:13)(cid:13)2\n\nDue to the fact that the nonlinearities \u03c3 are recti\ufb01ed linear units, G\u03b80 (x) is a piecewise linear function.\nIt is convenient to introduce notation that absorbs the activation \u03c3 into weight matrix Wi, denoting\n\nW+,x = diag(W x > 0)W.\n\nFor a \ufb01xed W , the matrix W+,x zeros out the rows of W that do not have a positive dot product with\nx; thus, \u03c3(W x) = W+,xx. We further de\ufb01ne W1,+,x = diag(W1x > 0) W1 and\n\nWi,+,x = diag(WiWi\u22121,+,x...W1,+,xx > 0) Wi.\n\nWith this notation, we can rewrite the generative network G\u03b80 in what looks like a linear form,\n\nnoting that each matrix Wi,+,x depends on the input x.\n\nG\u03b80 (x) = Wd,+,xWd\u22121,+,x...W1,+,xx,\n\n3\n\n\fIf fA,\u03b80 (x) is differentiable at x, we can write the gradient as\n\n\u2207fA,\u03b80 (x) = (cid:16) 1Y\n\ni=d\n\nWi,+,x(cid:17)T\n\nAT A(cid:16) 1Y\n\ni=d\n\nWi,+,x(cid:17)x \u2212(cid:16) 1Y\n\ni=d\n\nWi,+,x(cid:17)T\n\nAT A(cid:16) 1Y\n\ni=d\n\nWi,+,x0(cid:17)x0.\n\nIn this expression, one can see intuitively that under the assumption that A and Wi are Gaussian\nmatrices, the gradient \u2207f\u03b80 (x) should concentrate around a deterministic vector vx,x0 . Hand and\nVoroninski (2019) establish suf\ufb01cient conditions for concentration of the random matrices around\ndeterministic quantities, so that vx,x0 has norm bounded away from zero if x is suf\ufb01ciently far from\nx0 or a scalar multiple of \u2212x0. Their results show that for random networks having a suf\ufb01ciently\nexpansive number of neurons in each layer, the objective fA,\u03b80 has a landscape favorable to gradient\ndescent.\n\nWe build on these ideas, showing \ufb01rst that optimizing with respect to x for a random network and\narbitrary signal y can be done with gradient descent. This requires modi\ufb01ed proof techniques, since\nit is no longer assumed that y = G\u03b80 (x0). In fact, y can be arbitrary and we wish to approximate\nit as Gb\u03b8(x(y)) for some x(y). Second, after this initial optimization is carried out, we show how\nprojected gradient descent can be used to track the optimum as the network undergoes a series of\nsmall changes. Our results are stated formally in the following section.\n\n3 Theoretical Results\n\nSuppose we have a sequence of networks G0, G1, . . . , GT generated from the training process. For\ninstance, we may take a network with randomly initialized weights as G0, and record the network\nafter each step of gradient descent in training; GT = G is the \ufb01nal trained network.\nFor a given vector y \u2208 Rn, we wish to minimize the\nobjective f (x) = 1\n2 kAG(x) \u2212 Ayk2 with respect\nto x for the \ufb01nal network G, where either A = I \u2208\nRn\u00d7n, or A \u2208 Rm\u00d7n is a measurement matrix with\ni.i.d. N (0, 1/m) entries in a compressed sensing\ncontext. Write\n1\n2\n\nAlgorithm 1 Sur\ufb01ng\nInput: Sequence of networks \u03b80, \u03b81, . . . , \u03b8T\n1: x\u22121 \u2190 0\n2: for t = 0 to T do\n3:\n4:\n5:\n6:\n7:\nOutput: xT\n\nThe idea is that we \ufb01rst minimize f0, which has a\nnicer landscape, to obtain the minimizer x0. We\nthen apply gradient descent on ft for t = 1, 2, ..., T\nsuccessively, starting from the minimizer xt\u22121 for the previous network.\n\nuntil convergence\nxt \u2190 x\n\nx \u2190 x \u2212 \u03b7\u2207f\u03b8t (x)\n\nft(x) =\n\nkAGt(x) \u2212 Ayk2,\n\n\u2200 t \u2208 [T ].\n\n(1)\n\nx \u2190 xt\u22121\nrepeat\n\nWe provide some theoretical analysis in partial support of this algorithmic idea. First, we show that at\nrandom initialization G0, all critical points of f0(x) are localized to a small ball around zero. Second,\nwe show that if G0, . . . , GT are obtained from a discretization of a continuous \ufb02ow, along which the\nglobal minimizer of ft(x) is unique and Lipschitz-continuous, then a projected-gradient version of\nsur\ufb01ng can successively \ufb01nd the minimizers for G1, . . . , GT starting from the minimizer for G0.\nWe consider expansive feedforward neural networks G : Rk \u00d7 \u0398 7\u2192 Rn given by\n\nG(x, \u03b8) = V \u03c3(Wd . . . \u03c3(W2\u03c3(W1x + b1) + b2) . . . + bd).\n\nHere, d is the number of intermediate layers (which we will treat as constant), \u03c3 is the ReLU\nactivation function \u03c3(x) = max(x, 0) applied entrywise, and \u03b8 = (V, W1, ..., Wd, b1, ..., bd) are the\nnetwork parameters. The input dimension is k \u2261 n0, each intermediate layer i \u2208 [d] has weights\nWi \u2208 Rni\u00d7ni\u22121 and biases bi \u2208 Rni , and a linear transform V \u2208 Rn\u00d7nd is applied in the \ufb01nal layer.\nFor our \ufb01rst result, consider \ufb01xed y \u2208 Rn and a random initialization G0(x) \u2261 G(x, \u03b80) where \u03b80\nhas Gaussian entries (independent of y). If the network is suf\ufb01ciently expansive at each intermediate\nlayer, then the following shows that with high probability, all critical points of f0(x) belong to a\nsmall ball around 0. More concretely, the directional derivative D\u2212x/kxkf0(x) satis\ufb01es\n\nD\u2212x/kxkf0(x) \u2261 lim\nt\u21920+\n\nf0(x \u2212 tx/kxk) \u2212 f0(x)\n\nt\n\n< 0.\n\n(2)\n\nThus \u2212x/kxk is a \ufb01rst-order descent direction of the objective f0 at x.\n\n4\n\n\fTheorem 3.1. Fix y \u2208 Rn. Let V have N (0, 1/n) entries, let bi and Wi have N (0, 1/ni) entries for\neach i \u2208 [d], and suppose these are independent. There exist d-dependent constants C, C \u2032, c, \u03b50 > 0\nsuch that for any \u03b5 \u2208 (0, \u03b50), if\n\n1. n \u2265 nd and ni > C(\u03b5\u22122 log \u03b5\u22121)ni\u22121 log ni for all i \u2208 [d], and\n\n2. Either A = I and m = n, or A \u2208 Rm\u00d7n has i.i.d. N (0, 1/m) entries (independent of\n\nV, {bi}, {Wi}) where m \u2265 Ck(\u03b5\u22121 log \u03b5\u22121) log(n1 . . . nd),\n\nthen with probability at least 1 \u2212 C(e\u2212c\u03b5m + nde\u2212c\u03b54nd\u22121 + Pd\u22121\n\noutside the ball kxk \u2264 C \u2032\u03b5(1 + kyk) satis\ufb01es (2).\n\ni=1 nie\u2212c\u03b52ni\u22121 ), every x \u2208 Rk\n\nWe defer the proof to the supplementary material. Note that if instead G0 were correlated with y, say\ny = G0(x\u2217) for some input x\u2217 with kx\u2217k \u224d 1, then x\u2217 would be a global minimizer of f0(x), and\nwe would have kyk \u224d kxdk \u224d . . . \u224d kx1k \u224d kx\u2217k \u224d 1 in the above network where xi \u2208 Rni is the\noutput of the ith layer. The theorem shows that for a random initialization of G0 which is independent\nof y, the minimizer is instead localized to a ball around 0 which is smaller in radius by the factor \u03b5.\n\nFor our second result, consider a network \ufb02ow\n\nGs(x) \u2261 G(x, \u03b8(s))\n\nfor s \u2208 [0, S], where \u03b8(s) = (V (s), W1(s), b1(s), . . . , Wd(s), bd(s)) evolve continuously in a time\nparameter s. As a model for network training, we assume that G0, . . . , GT are obtained by discrete\nsampling from this \ufb02ow via Gt = G\u03b4t, corresponding to s \u2261 \u03b4t for a small time discretization step \u03b4.\n\nWe assume boundedness of the weights and uniqueness and Lipschitz-continuity of the global\nminimizer along this \ufb02ow.\n\nAssumption 3.2. There are constants M, L < \u221e such that\n\n1. For every i \u2208 [d] and s \u2208 [0, S],\n\nkWi(s)k \u2264 M.\n\n2. The global minimizer x\u2217(s) = argminx f (x, \u03b8(s)) is unique and satis\ufb01es\n\nkx\u2217(s) \u2212 x\u2217(s\u2032)k \u2264 L|s \u2212 s\u2032|\n\nwhere f (x, \u03b8(s)) = 1\n\n2 kAG(x, \u03b8(s)) \u2212 Ayk2.\n\nFixing \u03b8, the function G(x, \u03b8) is continuous and piecewise-linear in x. For each x \u2208 Rk, there is at\nleast one linear piece P0 (a polytope in Rk) of this function that contains x. For a slack parameter\n\u03c4 > 0, consider the rows given by\n\nwhere\n\nS(x, \u03b8, \u03c4 ) = {(i, j) : |w\u22a4\n\ni,jxi\u22121 + bi,j| \u2264 \u03c4 },\n\nxi\u22121 = \u03c3(Wi\u22121 . . . \u03c3(W1x + b1) . . . + bi\u22121)\n\nis the output of the (i \u2212 1)th layer for this input x, and v\u22a4\ni,j , and bi,j are respectively the jth\nrow of V , the jth row of Wi and the jth entry of bi in \u03b8. This set S(x, \u03b8, \u03c4 ) represents those neurons\nthat are close to 0 before ReLU thresholding, and hence whose activations may change after a small\nchange of the network input x. De\ufb01ne\n\nj , w\u22a4\n\nP(x, \u03b8, \u03c4 ) = {P0, P1, . . . , PG}\n\nas the set of all linear pieces Pg whose activation patterns differ from P0 only in rows belonging to\nS(x, \u03b8, \u03c4 ). That is, for every x\u2032 \u2208 Pg \u2208 P(x, \u03b8, \u03c4 ) and (i, j) /\u2208 S(x, \u03b8, \u03c4 ), we have\n\nsign(w\u22a4\n\ni,jxi\u22121 + bi,j)\n\ni,jx\u2032\n\ni\u22121 + bi,j) = sign(w\u22a4\n\nwhere x\u2032\n\ni\u22121 is the output of the (i \u2212 1)th layer for input x\u2032.\n\nWith this de\ufb01nition, we consider a stylized projected-gradient sur\ufb01ng procedure in Algorithm 2,\nwhere ProjP is the orthogonal projection onto the polytope P .\n\n5\n\n\fAlgorithm 2 Projected-gradient Sur\ufb01ng\nInput: Network \ufb02ow {G(\u00b7, \u03b8(s)) : s \u2208 [0, S]}, parameters \u03b4, \u03c4, \u03b7 > 0.\n1: Initialize x0 = argminx f (x, \u03b8(0)).\n2: for t = 1, . . . , T do\n3:\n4:\n5:\n6:\n\nfor each linear piece Pg \u2208 P(xt\u22121, \u03b8(\u03b4t), \u03c4 ) do\n\nx \u2190 ProjPg (x \u2212 \u03b7\u2207f (x, \u03b8(\u03b4t)))\n\nx \u2190 xt\u22121\nrepeat\n\n7:\n\n8:\n\nuntil convergence\nx(g)\nt \u2190 x\n\nxt \u2190 x(g)\n\nt\n\n9:\nOutput: xT\n\nfor g \u2208 {0, . . . , G} that achieves the minimum value of f (x(g)\n\n, \u03b8(\u03b4t)).\n\nt\n\nThe complexity of this algorithm depends on the number of pieces G to be optimized over in each\nstep. We expect this to be small in practice when the slack parameter \u03c4 is chosen suf\ufb01ciently small,\nand provide a heuristic argument in the supplement indicating why this may be the case.\n\nThe following shows that for any \u03c4 > 0, there is a suf\ufb01ciently \ufb01ne time discretization \u03b4 depending\non \u03c4, M, L such that Algorithm 2 tracks the global minimizer. In particular, for the \ufb01nal objective\nfT (x) = f (x, \u03b8(\u03b4T )) corresponding to the network GT , the output xT is the global minimizer of\nfT (x). We remark that the time discretization \u03b4 may need to be smaller for deeper networks, as G(x)\ncorresponding to a deeper network may have a larger Lipschitz constant in x. The speci\ufb01c dependence\ni=1 kWik, which is a conservative bound\n\nbelow arises from bounding this Lipschitz constant byQd\n\nalso used and discussed in greater detail in Szegedy et al. (2014); Virmaux and Scaman (2018).\nTheorem 3.3. Suppose Assumption 3.2 holds. For any \u03c4 > 0, if \u03b4 < \u03c4 /(L max(M, 1)d+1) and\nx0 = argminx f (x, \u03b8(0)), then the iterates xt in Algorithm 2 are given by xt = argminx f (x, \u03b8(\u03b4t))\nfor each t = 1, . . . , T .\n\nProof. For any \ufb01xed \u03b8, let x, x\u2032 \u2208 Rk be two inputs to G(x, \u03b8). If xi, x\u2032\ni are the corresponding\noutputs of the ith layer, using the assumption kWik \u2264 M and the fact that the ReLU activation \u03c3 is\n1-Lipschitz, we have\n\nkxi \u2212 x\u2032\n\nik = k\u03c3(Wixi\u22121 + bi) \u2212 \u03c3(Wix\u2032\n\ni\u22121 + bi)k\n\n\u2264 k(Wixi\u22121 + bi) \u2212 (Wix\u2032\n\u2264 M kxi\u22121 \u2212 x\u2032\n\ni\u22121k \u2264 . . . \u2264 M ikx \u2212 x\u2032k.\n\ni\u22121 + bi)k\n\nLet x\u2217(s) = argminx f (x, \u03b8(s)). By assumption, kx\u2217(s \u2212 \u03b4) \u2212 x\u2217(s)k \u2264 L\u03b4. For the network with\nparameter \u03b8(s) at time s, let x\u2217,i(s) and x\u2217,i(s \u2212 \u03b4) be the outputs at the ith layer corresponding to\ninputs x\u2217(s) and x\u2217(s \u2212 \u03b4). Then for any i \u2208 [d] and j \u2208 [ni], the above yields\n\n|(wi,j(s)\u22a4x\u2217,i(s \u2212 \u03b4) + bi,j) \u2212 (wi,j(s)\u22a4x\u2217,i(s) + bi,j)| \u2264 kwi,j(s)kkx\u2217,i(s \u2212 \u03b4) \u2212 x\u2217,i(s)k\n\n\u2264 M \u00b7 M ikx\u2217(s \u2212 \u03b4) \u2212 x\u2217(s)k \u2264 M i+1L\u03b4.\n\nFor \u03b4 < \u03c4 /(L max(M, 1)d+1), this implies that for every (i, j) where |wi,j(s)\u22a4x\u2217,i(s\u2212\u03b4)+bi,j| \u2265 \u03c4 ,\nwe have\n\nsign(wi,j(s)\u22a4x\u2217,i(s \u2212 \u03b4) + bi,j) = sign(wi,j(s)\u22a4x\u2217,i(s) + bi,j).\n\nThat is, x\u2217(s) \u2208 Pg for some Pg \u2208 P(x\u2217(s \u2212 \u03b4), \u03b8(s), \u03c4 ).\nAssuming that xt\u22121 = x\u2217(\u03b4(t \u2212 1)), this implies that the next global minimizer x\u2217(\u03b4t) belongs to\nsome Pg \u2208 P(xt\u22121, \u03b8(\u03b4t), \u03c4 ). Since f (x, \u03b8(\u03b4t)) is quadratic on Pg, projected gradient descent over\nPg in Algorithm 2 converges to x\u2217(\u03b4t), and hence Algorithm 2 yields xt = x\u2217(\u03b4t). The result then\nfollows from induction on t.\n\n4 Experiments\n\nWe present experiments to illustrate the performance of sur\ufb01ng over a sequence of networks during\ntraining compared with gradient descent over the \ufb01nal trained network. We mainly use the Fashion-\n\n6\n\n\fInput dimension\n\n5\n\n% successful\n\n# iterations\n\n% successful\n\n# iterations\n\nModel\nRegular Adam 98.7\nSur\ufb01ng\n100\nRegular Adam 737\nSur\ufb01ng\n775\nModel\nRegular Adam 56.0\nSur\ufb01ng\n81.7\nRegular Adam 464\nSur\ufb01ng\n547\n\n10\nVAE\n100\n100\n1330\n1404\nWGAN\n84.3\n97.3\n1227\n1450\n\n20\n\n5\n\n10\n\n20\n\n100\n100\n8215\n10744\n\n90.3\n99.3\n3702\n4986\n\nDCGAN\n\n68.7\n98.7\n4560\n6514\n\n80.0\n96.3\n18937\n33294\n\nWGAN-GP\n\n64.7\n95.7\n1915\n2394\n\n64.7\n97.3\n15445\n25991\n\n48.3\n78.3\n618\n741\n\n47.0\n83.7\n463\n564\n\nTable 1: Sur\ufb01ng compared against direct gradient descent over the \ufb01nal trained network, for various\ngenerative models with input dimensions k = 5, 10, 20. Shown are percentages of \u201csuccessful\u201d\n\nsolutions bxT satisfying kbxT \u2212 x\u2217k < 0.01, and 75th-percentiles of the total number of gradient\ndescent steps used (across all networks G0, . . . , GT for sur\ufb01ng) until kbxT \u2212 x\u2217k < 0.01 was reached.\n\n1.0\n\n0.8\n\ny\nc\nn\ne\nu\nq\ne\nr\nf\n\n0.6\n\n0.4\n\n0.2\n\n0.0\n\n0.0\n\n0.5\n\nDCGAN, dim 5\n\nRegular Adam\nSurfing\n\n1.0\n\n2.0\ndistance to the truth\n\n1.5\n\n2.5\n\n3.0\n\n1.0\n\n0.8\n\ny\nc\nn\ne\nu\nq\ne\nr\nf\n\n0.6\n\n0.4\n\n0.2\n\n0.0\n\n0\n\nDCGAN, dim 10\n\nRegular Adam\nSurfing\n\n1\ndistance to the truth\n\n2\n\n3\n\n1.0\n\n0.8\n\ny\nc\nn\ne\nu\nq\ne\nr\nf\n\n0.6\n\n0.4\n\n0.2\n\n0.0\n\n0\n\n1\n\nDCGAN, dim 20\n\nRegular Adam\nSurfing\n\n2\n\n3\n\ndistance to the truth\n\n4\n\n5\n\nFigure 2: Distribution of distance between solution bxT and the truth x\u2217 for DCGAN trained models,\n\ncomparing sur\ufb01ng (red) to regular gradient descent (blue) over the \ufb01nal network. Both procedures use\nAdam in their gradient descent computations. The results indicate that direct descent often succeeds,\nbut can also converge to a point that is far from the optimum. By moving along the optimum of the\nevolving surface, sur\ufb01ng is able to move closer to the optimum in these cases.\n\nMNIST dataset to carry out the simulations, which is similar to MNIST in many characteristics, but\nis more dif\ufb01cult to train. We build multiple generative models, trained using VAE (Kingma and\nWelling, 2014), DCGAN (Radford et al., 2015), WGAN (Arjovsky et al., 2017) and WGAN-GP\n(Gulrajani et al., 2017). The structure of the generator/decoder networks that we use are the same\nas those reported by Chen et al. (2016); they include two fully connected layers and two transposed\nconvolution layers with batch normalization after each layer (Ioffe and Szegedy, 2015). We use the\nsimple sur\ufb01ng algorithm in these experiments, rather than the projected-gradient algorithm proposed\nfor theoretical analysis. Note also that the network architectures do not precisely match the expansive\nrelu networks used in our analysis. Instead, we experiment with architectures and training procedures\nthat are meant to better re\ufb02ect the current state of the art.\nWe \ufb01rst consider the problem of minimizing the objective f (x) = 1\n2 kG(x) \u2212 G(x\u2217)k2 and recovering\nthe image generated from a trained network G(x) = G\u03b8T (x) with input x\u2217. We run sur\ufb01ng by taking\na sequence of parameters \u03b80, \u03b81, ..., \u03b8T for T = 100, where \u03b80 are the initial random parameters and\nthe intermediate \u03b8t\u2019s are taken every 40 training steps, and we use Adam (Kingma and Ba, 2014) to\ncarry out gradient descent in x over each network G\u03b8t . We compare this to \u201cregular Adam\u201d, which\nuses Adam to optimize over x in only the \ufb01nal trained network G\u03b8T for T = 100.\n\nTo ensure that the runtime of sur\ufb01ng is comparable to that of a single initialization of regular Adam,\nwe do not run Adam until convergence for each intermediate network in sur\ufb01ng. Instead, we use\na \ufb01xed schedule of iterations for the networks G\u03b80 , . . . , G\u03b8T \u22121 , and run Adam to convergence in\nonly the \ufb01nal network G\u03b8T . The total number of iterations for networks G\u03b80 , . . . , G\u03b8T \u22121 is set as the\n75th-percentile of the iteration count required for convergence of regular Adam. These are split across\nthe networks proportional to a deterministic schedule that allots more steps to the earlier networks\nwhere the landscape of G(x) changes more rapidly, and fewer steps to later networks where this\nlandscape stabilizes.\n\n7\n\n\fe\nc\nn\ne\ng\nr\ne\nv\nn\no\nc\n \nf\no\n \nn\no\ni\nt\nr\no\np\no\nr\np\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0.0\n\n10\n\nDCGAN\n\nRegular Adam\nSurfing\n\n20 30 4050 70 100\n\nnumber of measurements\n\n200\n\ne\nc\nn\ne\ng\nr\ne\nv\nn\no\nc\n \nf\no\n \nn\no\ni\nt\nr\no\np\no\nr\np\n\n1.0\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0.0\n\n400\n\n10\n\nWGAN\n\nRegular Adam\nSurfing\n\n20 30 4050 70 100\n\nnumber of measurements\n\n200\n\n400\n\nFigure 3: Compressed sensing setting for exact recovery. As a function of the number of random\nmeasurements m, the lines show the proportion of times sur\ufb01ng (red) and regular gradient descent\nwith Adam (blue) are able to recover the true signal y = G(x), using DCGAN and WGAN.\n\nr\no\nr\nr\ne\n \nn\no\ni\nt\nc\nu\nr\nt\ns\nn\no\nc\ne\nr\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0.0\n\n10\n\n20\n\n30\n\nDCGAN\n\nRegular Adam\nSurfing\n\n40\n\n50\n\n70\n\n100\n\n200\n\n400\n\nnumber of measurements\n\nr\no\nr\nr\ne\n \nn\no\ni\nt\nc\nu\nr\nt\ns\nn\no\nc\ne\nr\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0.0\n\n10\n\n20\n\n30\n\nWGAN\n\nRegular Adam\nSurfing\n\n40\n\n50\n\n70\n\n100\n\n200\n\n400\n\nnumber of measurements\n\nFigure 4: Compressed sensing setting for approximation, or rate-distortion. As a function of the\nnumber of random measurements m, the box plots summarize the distribution of the per-pixel\nreconstruction errors for DCGAN and WGAN trained models, using sur\ufb01ng (red) and regular\ngradient descent with Adam (blue).\n\nFor each network training condition, we apply sur\ufb01ng and regular Adam for 300 trials, where in\neach trial a randomly generated x\u2217 and initial point xinit are chosen uniformly from the hypercube\n\nk. The table also shows the 75th-percentile for the total number of gradient descent iterations taken\n(across all networks for sur\ufb01ng), verifying that the runtime of sur\ufb01ng was typically 1\u20132x that of\n\n[\u22121, 1]k. Table 1 shows the percentage of trials where the solutions bxT satisfy our criterion for\nsuccessful recovery kbxT \u2212 x\u2217k < 0.01, for different models and over three different input dimensions\nregular Adam. We also provide the distributions of kbxT \u2212 x\u2217k under each setting: Figure 2 shows\nthe results for DCGAN, and results for the other models are collected in the supplementary material.\nWe next consider the compressed sensing problem with objective f (x) = 1\n2 kAG(x) \u2212 AG(x\u2217)k2\nwhere A \u2208 Rm\u00d7n is the Gaussian measurement matrix. We carry out 200 trials for each choice\nof number of measurements m. The parameters \u03b8t for sur\ufb01ng are taken every 100 training steps.\nAs before, we record the proportion of the solutions that are close to the truth x\u2217 according to\n\nkbxT \u2212 x\u2217k < 0.01. Figure 3 shows the results for DCGAN and WGAN trained networks with input\ndimension k = 20.\nLastly, we consider the objective f (x) = 1\n2 kAG(x)\u2212Ayk2, where y is a real image from the hold-out\ntest data. This can be thought of as a rate-distortion setting, where the error varies as a function of\nthe number of measurements used. We carry out the same experiments as before and compute the\n\naverage per-pixel reconstruction error q 1\n\nthe distributions of the reconstruction error as the number of measurements m varies.\n\nn kG(bxT ) \u2212 yk2 as in Bora et al. (2017). Figure 4 shows\n\n5 Discussion\n\nThis paper has explored the idea of incrementally optimizing a sequence of objective risk functions\nobtained for models that are slowly changing during stochastic gradient descent during training.\nWhen initialized with random parameters \u03b80, we have shown that the empirical risk function f\u03b80 (x) =\n\n8\n\n\f1\n2 kG\u03b80 (x) \u2212 yk2 is well behaved and easy to optimize. The sur\ufb01ng algorithm initializes x for the\ncurrent network G\u03b8t (x) at the optimum x\u2217\nt\u22121 found for the previous network G\u03b8t\u22121 (x) and then\ncarries out gradient descent to obtain the updated point x\u2217\nt = argminx f\u03b8t (x). Our experiments show\nthat this scheme has merit, and often signi\ufb01cantly outperforms direct gradient descent on the \ufb01nal\nmodel alone.\n\nOn the theoretical side, our main technical result applies and extends ideas of Hand and Voroninski\n(2019) to show that for random ReLU networks that are suf\ufb01ciently expansive, the surface of f\u03b80 (x)\nis well-behaved for arbitrary target vectors y. This result may be of independent interest, but it\nis essential for the sur\ufb01ng algorithm because initially the model is poor, with high approximation\nerror. The analysis for the incremental scheme uses projected gradient descent, although we \ufb01nd\nthat simple gradient descent works well in practice. The analysis assumes that the argmin over the\nsurface evolves continuously in training. This assumption is necessary\u2014if the global minimum is\ndiscontinuous as a function of t, so that the minimizer \u201cjumps\u201d to a far away point, then the sur\ufb01ng\nprocedure will fail in practice.\n\nIn our experiments, we see that simple sur\ufb01ng can indeed be effective for mapping outputs y to inputs\nx for the trained network, where it often outperforms direct gradient descent for a range of deep\nnetwork architectures and training procedures. However, these simulations also point to the fact that\nin some settings, direct gradient descent itself can be surprisingly effective. A deeper understanding\nof this phenomenon could lead to more advanced sur\ufb01ng algorithms that are able to ride to the \ufb01nal\noptimum even more ef\ufb01ciently and often.\n\nAcknowledgments\n\nResearch supported in part by NSF grants DMS-1513594, CCF-1839308, DMS-1916198, and a\nJ.P. Morgan Faculty Research Award.\n\nReferences\n\nArjovsky, M., Chintala, S., and Bottou, L. (2017). Wasserstein GAN. arXiv:1701.07875.\n\nBora, A., Jalal, A., Price, E., and Dimakis, A. G. (2017). Compressed sensing using generative\nmodels. In Proceedings of the 34th International Conference on Machine Learning-Volume 70,\npages 537\u2013546. JMLR. org.\n\nCandes, E. J., Romberg, J. K., and Tao, T. (2006). Stable signal recovery from incomplete and\ninaccurate measurements. Communications on Pure and Applied Mathematics, 59(8):1207\u20131223.\n\nChen, T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. (2018). Neural ordinary differential\nequations. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett,\nR., editors, Advances in Neural Information Processing Systems 31, pages 6571\u20136583. Curran\nAssociates, Inc.\n\nChen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., and Abbeel, P. (2016). Infogan:\nInterpretable representation learning by information maximizing generative adversarial nets. In\nAdvances in neural information processing systems, pages 2172\u20132180.\n\nDavis, D., Drusvyatskiy, D., Kakade, S., and Lee, J. D. (2018). Stochastic subgradient method\n\nconverges on tame functions. Foundations of Computational Mathematics, pages 1\u201336.\n\nDinh, L., Sohl-Dickstein, J., and Bengio, S. (2017). Density estimation using real NVP.\n\narXiv:1605.08803.\n\nDonoho, D. L. et al. (2006). Compressed sensing.\n\nIEEE Transactions on information theory,\n\n52(4):1289\u20131306.\n\nDu, S. S., Lee, J. D., Li, H., Wang, L., and Zhai, X. (2018a). Gradient descent \ufb01nds global minima of\n\ndeep neural networks. arXiv preprint arXiv:1811.03804.\n\nDu, S. S., Zhai, X., Poczos, B., and Singh, A. (2018b). Gradient descent provably optimizes\n\nover-parameterized neural networks. arXiv preprint arXiv:1810.02054.\n\n9\n\n\fGoodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and\nBengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing\nsystems, pages 2672\u20132680.\n\nGulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. C. (2017). Improved training\nof wasserstein GANs. In Advances in Neural Information Processing Systems, pages 5767\u20135777.\n\nHand, P. and Voroninski, V. (2019). Global guarantees for enforcing deep generative priors by\n\nempirical risk. IEEE Transactions on Information Theory.\n\nIoffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by\n\nreducing internal covariate shift. arXiv:1502.03167.\n\nKingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv:1412.6980.\n\nKingma, D. P. and Dhariwal, P. (2018). Glow: Generative \ufb02ow with invertible 1x1 convolutions. In\nBengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R., editors,\nAdvances in Neural Information Processing Systems 31, pages 10215\u201310224. Curran Associates,\nInc.\n\nKingma, D. P. and Welling, M. (2014). Auto-encoding variational Bayes. In 2nd International\n\nConference on Learning Representations, ICLR 2014.\n\nLi, Y. and Liang, Y. (2018). Learning overparameterized neural networks via stochastic gradient\ndescent on structured data. In Advances in Neural Information Processing Systems, pages 8157\u2013\n8166.\n\nLi, Y. and Yuan, Y. (2017). Convergence analysis of two-layer neural networks with relu activation. In\nGuyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R.,\neditors, Advances in Neural Information Processing Systems 30, pages 597\u2013607. Curran Associates,\nInc.\n\nRadford, A., Metz, L., and Chintala, S. (2015). Unsupervised representation learning with deep\n\nconvolutional generative adversarial networks. arXiv:1511.06434.\n\nSzegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. (2014).\nIntriguing properties of neural networks. In International Conference on Learning Representations.\n\nVirmaux, A. and Scaman, K. (2018). Lipschitz regularity of deep neural networks: Analysis and\n\nef\ufb01cient estimation. In Advances in Neural Information Processing Systems, pages 3835\u20133844.\n\n10\n\n\f", "award": [], "sourceid": 8574, "authors": [{"given_name": "Ganlin", "family_name": "Song", "institution": "Yale University"}, {"given_name": "Zhou", "family_name": "Fan", "institution": "Yale Univ"}, {"given_name": "John", "family_name": "Lafferty", "institution": "Yale University"}]}