{"title": "Inverting Deep Generative models, One layer at a time", "book": "Advances in Neural Information Processing Systems", "page_first": 13910, "page_last": 13919, "abstract": "We study the problem of inverting a deep generative model with ReLU activations. \nInversion corresponds to finding a latent code vector that explains observed measurements as much as possible. \nIn most prior works this is performed by attempting to solve a non-convex optimization problem involving the generator. \nIn this paper we obtain several novel theoretical results for the inversion problem. \n\nWe show that for the realizable case, single layer inversion can be performed exactly in polynomial time, by solving a linear program. Further, we show that for two layers, inversion is NP-hard to recover binary latent code (even for the realizable case) and the pre-image set can be non-convex. \n\nFor generative models of arbitrary depth, we show that exact recovery is possible in polynomial time \nwith high probability, if the layers are expanding and the weights are randomly selected. \nVery recent work analyzed the same problem for gradient descent inversion. Their analysis requires significantly higher expansion (logarithmic in the latent dimension) while our proposed algorithm can provably reconstruct even with constant factor expansion. \nWe also provide provable error bounds for different norms for reconstructing noisy observations. Our empirical validation\ndemonstrates that we obtain better reconstructions when the latent dimension is large.", "full_text": "Inverting Deep Generative models,\n\nOne layer at a time\n\nQi Lei\u2020, Ajil Jalal\u2020,\n\nInderjit S. Dhillon\u2020\u2021, and Alexandros G. Dimakis\u2020\n\n\u2020 UT Austin \u2021 Amazon\n\n{leiqi@oden., ajiljalal@, inderjit@cs.,\n\ndimakis@austin.}utexas.edu\n\nAbstract\n\nWe study the problem of inverting a deep generative model with ReLU activations.\nInversion corresponds to \ufb01nding a latent code vector that explains observed mea-\nsurements as much as possible. In most prior works this is performed by attempting\nto solve a non-convex optimization problem involving the generator. In this paper\nwe obtain several novel theoretical results for the inversion problem.\nWe show that for the realizable case, single layer inversion can be performed\nexactly in polynomial time, by solving a linear program. Further, we show that for\nmultiple layers, inversion is NP-hard and the pre-image set can be non-convex.\nFor generative models of arbitrary depth, we show that exact recovery is possible\nin polynomial time with high probability, if the layers are expanding and the\nweights are randomly selected. Very recent work analyzed the same problem for\ngradient descent inversion. Their analysis requires signi\ufb01cantly higher expansion\n(logarithmic in the latent dimension) while our proposed algorithm can provably\nreconstruct even with constant factor expansion. We also provide provable error\nbounds for different norms for reconstructing noisy observations. Our empirical\nvalidation demonstrates that we obtain better reconstructions when the latent\ndimension is large.\n\nIntroduction\n\n1\nModern deep generative models are demonstrating excellent performance as signal priors, frequently\noutperforming the previous state of the art for various inverse problems including denoising, inpaint-\ning, reconstruction from Gaussian projections and phase retrieval (see e.g. [4, 6, 10, 5, 11, 25] and\nreferences therein). Consequently, there is substantial work on improving compressed sensing with\ngenerative adversarial network (GANs) [9, 17, 13, 18, 20]. Similar ideas have been recently applied\nalso for sparse PCA with a generative prior [2].\nA central problem that appears when trying to solve inverse problems using deep generative models\nis inverting a generator [4, 12, 24]. We are interested in deep generative models, parameterized\nas feed-forward neural networks with ReLU/LeakyReLU activations. For a generator G(z) that\nmaps low-dimensional vectors in Rk to high dimensional vectors (e.g. images) in Rn, we want to\nreconstruct the latent code z\u2217 if we can observe x = G(z\u2217) (realizable case) or a noisy version\nx = G(z\u2217) + e where e denotes some measurement noise. We are therefore interested in the\noptimization problem\n\n(cid:107)x \u2212 G(z)(cid:107)p,\n\n(1)\n\nfor some p norm. With this procedure, we learn a concise image representation of a given image\nx \u2208 Rn as z \u2208 Rk, k (cid:28) n. This applies to image compressions and denoising tasks as studied in\n\narg min\n\nz\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f[14, 13]. Meanwhile, this problem is a starting point for general linear inverse problems:\n\n(cid:107)x \u2212 AG(z)(cid:107)p,\n\narg min\n\nz\n\n(2)\n\nsince several recent works leverage inversion as a key step in solving more general inverse problems,\nsee e.g. [24, 22]. Speci\ufb01cally, Shah et al. [24] provide theoretical guarantees on obtaining the optimal\nsolution for (2) with projected gradient descent, provided one could solve (1) exactly. This work\nprovides a provable algorithm to perform this projection step under some assumptions.\nPrevious work focuses on the (cid:96)2 norm that works slowly with gradient descent [4, 15]. In this work,\nwe focus on direct solvers and error bound analysis for (cid:96)\u221e and (cid:96)1 norm instead.1 Note that this is a\nnon-convex optimization problem even for a single-layer network with ReLU activations. Therefore\ngradient descent may get stuck at local minimima or require a long time to converge. For example,\nfor MNIST, compressing a single image by optimizing (1) takes on average several minutes and may\nneed multiple restarts.\nOur Contributions: For the realizable case we show that for a single layer solving (1) is equivalent\nto solving a linear program. For networks more than one layer, however, we show it is NP-hard to\nsimply determine whether exact recovery exists. For a two-layer network we show that the pre-image\nin the latent space can be a non-convex set.\nFor realizable inputs and arbitrary depth we show that inversion is possible in polynomial time if the\nnetwork layers have suf\ufb01cient expansion and the weights are randomly selected. A similar result\nwas established very recently for gradient descent [15]. We instead propose inversion by layer-wise\nGaussian elimination. Our result holds even if each layer is expanding by a constant factor while [15]\nrequires a logarithmic multiplicative expansion in each layer.\nFor noisy inputs and arbitrary depth we propose two algorithms that rely on iteratively solving linear\nprograms to reconstruct each layer. We establish provable error bounds on the reconstruction error\nwhen the weights are random and have constant expansion. We also show empirically that our\nmethod matches and sometimes outperforms gradient descent for inversion, especially when the\nlatent dimension becomes larger.\n\n2 Setup\nWe consider deep generative models G : Rk \u2192 Rn with the latent dimension k being smaller than\nthe signal dimension n, parameterized by a d-layer feed-forward network of the form\n\nG(z) = \u03c6d(\u03c6d\u22121(\u00b7\u00b7\u00b7 \u03c62(\u03c61(z))\u00b7\u00b7\u00b7 )),\n\n(3)\nwhere each layer \u03c6i(a) is de\ufb01ned as a composition of activations and linear maps: ReLU(Wia + bi).\nWe focus on the ReLU activations ReLU(a) = max{a, 0} applied coordinate-wise, and we will\nalso consider the activation as LeakyReLU(a) = ReLU(a) + cReLU(\u2212a), where the scaling factor\nc \u2208 (0, 1) is typically 0.12. Wi \u2208 Rni\u00d7ni\u22121 are the weights of the network, and bi \u2208 Rni are the\nbias terms. Therefore, n0 = k and nd = n indicate the dimensionality of the input and output of the\ngenerator G. We use zi to denote the output of the i-th layer. Note that one can absorb the bias term\nbi, i = 1, 2,\u00b7\u00b7\u00b7 d into Wi by adding one more dimension with a constant input. Therefore, without\nloss of generality, we sometimes omit bi when writing the equation, unless we explicitly needed it.\nWe use bold lower-case symbols for vectors, e.g. x, and xi for its coordinates. We use upper-case\nsymbols for denote matrices, e.g. W , where wi is its i-th row vector. For a indexed set I, WI,:\nrepresents the submatrix of W consisting of each i-th row of W for any i \u2208 I.\nThe central challenge is to determine the signs for the intermediate variables of the hidden layers. We\nrefer to these sign patterns as \"ReLU con\ufb01gurations\" throughout the paper, indicating which neurons\nare \u2018on\u2019 and which are \u2018off\u2019.\n\nInvertibility for ReLU Realizable Networks\n\n3\nIn this section we study the realizable case, i.e., when we are given an observation vector x for which\nthere exists z\u2217 such that x = G(z\u2217). In particular, we show that the problem is NP-hard for ReLU\n1Notice the relation between (cid:96)p norm guarantees (cid:96)p \u2265 (cid:96)q, 1 \u2264 p \u2264 q \u2264 \u221e. Therefore the studies on (cid:96)1 and\n\n(cid:96)\u221e is enough to bound all intermediate (cid:96)p norms for p \u2208 [1,\u221e).\n\n2The inversion of LeakyReLU networks is much easier than ReLU networks and we therefore only mention\n\nit when needed.\n\n2\n\n\factivations in general, but could be solved in polynomial time with some mild assumptions with high\nprobability. We present our theoretical \ufb01ndings \ufb01rst and all proofs of the paper are presented later in\nthe Appendix.\nInverting a Single Layer. We start with the simplest one-layer case to \ufb01nd if minz (cid:107)x\u2212G(z)(cid:107)p = 0,\nfor any p-norm. Since the problem is non-convex, further assumptions of W are required [15] for\ngradient descent to work. When the problem is realizable, however, to \ufb01nd feasible z such that\nx = \u03c6(z) \u2261 ReLU(W z + b), one could invert the function by solving a linear programming:\n\ni z + bi = xi, \u2200i s.t. xi > 0\nw(cid:62)\ni z + bi \u2264 0, \u2200i s.t. xi = 0\nw(cid:62)\n\n(4)\nIts solution set is convex and forms a polytope, but possibly includes uncountable feasible points.\nTherefore, it becomes unclear how to continue the process of layer-wise inversion unless further\nassumptions are made. To demonstrate the challenges to generalize the result to deeper nets, we\nshow that the solution set becomes non-convex, and to determine whether there exists any solution is\nNP-complete.\nChallenges to Invert a Two or More Layered ReLU Network.\nWe would like to study the complexity of inverting deep ReLU networks in general. We do this by\nconstructing a 4-layer network and prove the following statement:\nTheorem 1 (NP-hardness to Recover ReLU Networks with Real Domain). Given a four-layered\nReLU neural network G(x) : Rk \u2192 R2 where weights are all \ufb01xed, and an observation vector\nx \u2208 R2, the problem to determine whether there exists z \u2208 Rk such that G(z) = x is NP-complete.\n\nThe conclusion holds naturally for generative models with deeper architecture. We defer the proof to\nthe Appendix, which is constructive and shows the 3SAT problem is reducible to the above four-layer\nnetwork recovery problem. Meanwhile, when the ReLU con\ufb01guration for each layer is given, the\nrecovery problem becomes to solve a simple linear system. Therefore the problem lies in NP, and\ntogether we have NP-completeness.\nMeanwhile, although the pre-image for a single layer is a polytope thus convex, it doesn\u2019t continue\nto hold for more than one layers, see Example 1. Fortunately, we present next that some moderate\nconditions guarantee a polynomial time solution with high probability.\nInverting Expansive Random Network in Polynomial Time.\nAssumption 1. For a weight matrix W \u2208 Rn\u00d7k, we assume 1) its entries are sampled i.i.d Gaussian,\nand 2) the weight matrix is tall: n = c0k for some constant c0 \u2265 2.1.\n\nIn the previous section, we indicate that the per layer inversion can be achieved through linear\nprogramming (4). With Assumption 1 we will be able to prove that the solution is unique with high\nprobability, and thus Theorem 2 holds for ReLU networks with arbitrary depth.\nTheorem 2. Let G \u2208 Rk \u2192 Rn be a generative model from a d-layer neural network using ReLU\nactivations. If for each layer, the weight matrix Wi satis\ufb01es Assumption 1, then for any prior z\u2217 \u2208 Rk\nand observation x = G(z\u2217), with probability 1 \u2212 e\u2212\u2126(k), z\u2217 could be achieved from x by solving\nlayer-wise linear equations. Namely, a random, expansive and realizable generative model could be\ninverted in polynomial time with high probability.\nIn our proof, we show that with high probability the observation x \u2208 Rn has at least k non-zero\nentries, which forms k equalities and the coef\ufb01cient matrix is invertible with probability 1. Therefore\n[7] since the recovery simply\nrequires solving d linear equations with dimension ni\u22121, i \u2208 [d].\nInverting LeakyReLU Network: On the other hand, inversion of LeakyReLU layers are sig-\nni\ufb01cantly easier for the realizable case. Unlike ReLU, LeakyReLU is a bijective map, i.e., each\nobservation corresponds to a unique preimage:\nLeakyReLU\u22121(x) =\n\nthe time complexity of exact recovery is no worse than(cid:80)d\u22121\n\n(5)\nTherefore, as long as each Wi \u2208 Rni\u00d7ni\u22121 is of rank ni\u22121, each layer map \u03c6i is also bijective and\ncould be computed by the inverse of LeakyReLU (5) and linear regression.\n\ni=0 n2.376\n\ni\n\n(cid:26) x\n\nif x \u2265 0\n1/cx otherwise.\n\n3\n\n\fInvertibility for Noisy ReLU Networks\n\n4\nBesides the realizable case, the study of noise tolerance is essential for many real applications. In\nthis section, we thus consider the noisy setting with observation x = G(z\u2217) + e, and investigate the\napproximate recovery for z\u2217 by relaxing some equalities in (4). We also analyze the problem with\nboth (cid:96)\u221e and (cid:96)1 error bound, in favor of different types of random noise distribution. In this section,\nall generators are without the bias term.\n4.1\nAgain we start with a single layer, i.e. we observe x = \u03c6(z\u2217) + e = ReLU(W z\u2217) + e. Depending\non the distribution over the measurement noise e, different norm in the objective (cid:107)G(z) \u2212 x(cid:107) should\nbe used, with corresponding error bound analysis. We \ufb01rst look at the case where the entries of e are\nuniformly bounded and the approximation of arg minz (cid:107)\u03c6(z) \u2212 x(cid:107)\u221e.\nNote that for an error (cid:107)e(cid:107)\u221e \u2264 \u0001, the true prior z\u2217 that produces the observation x = \u03c6(z\u2217) + e falls\ninto the following constraints:\n\n(cid:96)\u221e Norm Error Bound\n\n(6)\n\nif xj > \u0001, j \u2208 [n]\nif xj \u2264 \u0001, j \u2208 [n],\n\nj z \u2264 xj + \u0001\nxj \u2212 \u0001 \u2264 w(cid:62)\nj z \u2264 xj + \u0001\nw(cid:62)\n\nwhich is also equivalent to the set {z(cid:12)(cid:12)(cid:107)\u03c6(z) \u2212 x(cid:107)\u221e \u2264 \u0001}. Therefore a natural way to approximate\n\nthe prior is to use linear programming to solve the above constraints.\nIf \u0001 is known, inversion is straightforward from constraints (6). However, suppose we don\u2019t want to\nuse a loose guess, we could start from a small estimation and gradually increase the tolerance until\nfeasibility is achieved. A layer-wise inversion is formally presented in Algorithm 13.\nA key assumption that possibly conveys the error bound from the output to the solution is the\nfollowing assumption:\nAssumption 2 (Submatrix Extends (cid:96)\u221e Norm). For the weight matrix W \u2208 Rn\u00d7k, there exists an\ninteger m > k and a constant c\u221e, such that for any I \u2282 [n] := {1, 2,\u00b7\u00b7\u00b7 n},|I| \u2265 m, WI,: satis\ufb01es\nwith high probability 1 \u2212 exp(\u2212\u2126(k)) for any x, and c\u221e is a constant. Recall that WI,: is the\nsub-rows of W con\ufb01ned to I.\n\n(cid:107)WI,:x(cid:107)\u221e \u2265 c\u221e(cid:107)x(cid:107)\u221e,\n\nWith this assumption, we are able to show the following theorem that bounds the recovery error.\nTheorem 3. Let x = G(z\u2217) + e be a noisy observation produced by the generator G, a d-layer\nReLU network mapping from Rk \u2192 Rn. Let each weight matrix Wi \u2208 Rni\u22121\u00d7ni satis\ufb01es Assumption\n2 with the integer mi > ni\u22121 and constant c\u221e. Let the error e satis\ufb01es (cid:107)e(cid:107)\u221e \u2264 \u0001, and for each\nzi = \u03c6i(\u03c6i\u22121(\u00b7\u00b7\u00b7 \u03c6(z\u2217)\u00b7\u00b7\u00b7 )), at least mi coordinates are larger than 2(2/c\u221e)d\u2212i\u0001. Then by\nrecursively applying Algorithm 1 backwards, it produces a z that satis\ufb01es (cid:107)z \u2212 z\u2217(cid:107)\u221e \u2264 (2/c\u221e)d\u0001\nwith high probability.\n\nWe argue that the assumptions required are satis\ufb01ed by random weight matrices sampled from an\ni.i.d Gaussian distribution, and present the following corollary.\nCorollary 1. Let x = G(z\u2217) + e be a noisy observation produced by the generator G, a d-layer\nReLU network mapping from Rk \u2192 Rn. Let each weight matrix Wi \u2208 Rni\u22121\u00d7ni (ni \u2265 5ni\u22121,\u2200i) be\nsampled from i.i.d Gaussian distribution \u223c N (0, 1), then Wi satis\ufb01es Assumption 2 for a universal\nconstant constant c2 \u2208 (0, 2]. Let the error e satisfy (cid:107)e(cid:107)\u221e = \u0001, where \u0001 < cd\n2d+4(cid:107)z\u2217(cid:107)2\nk. By\nrecursively applying Algorithm 1, it produces z that satis\ufb01es (cid:107)z \u2212 z\u2217(cid:107)\u221e \u2264 2d\u0001\nwith high probability.\nRemark 1. For LeakyReLU, we could do at least as good as ReLU, since we could simply view all\nnegative coordinates as inactive coordinates of ReLU, and each observation will produce a loose\nbound. On the other hand, if there are signi\ufb01cant number of negative entries, we can also change the\n\n\u221a\n\ncd\n2\n\n2\n\n3For practical use, we introduce a factor \u03b1 to gradually increase the error estimation. In our theorem, it\n\nassumed we expicitly set \u0001 to invert the i-th layer as the error estimation (cid:107)e(cid:107)0(1/c2)d\u2212i.\n\n4\n\n\flinear programming constraints of Algorithm 1 as follows:\nj z \u2264 xj + \u03b4\nj z \u2264 xj + \u03b4\n\nxj \u2212 \u03b4 \u2264 w(cid:62)\n1/c(xj \u2212 \u03b4) \u2264 w(cid:62)\nxj \u2212 \u03b4 \u2264 cw(cid:62)\n\u03b4 \u2264 \u0001.\n\narg min\n\n\u03b4, s.t.\n\nz,\u03b4\n\nj z \u2264 xj + \u03b4\n\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3\n\nif xj > \u0001\nif \u2212 \u0001 < xj \u2264 \u0001\nif xj \u2264 \u2212\u0001\n\n(cid:96)1 Norm Error Bound\n\n4.2\nIn this section we develop a generative model inversion framework using the (cid:96)1 norm. We introduce\nAlgorithm 2 that tolerates error in different level for each output coordinate and intends to minimize\nthe (cid:96)1 norm error bound.\n\nAlgorithm 1 Linear programming to invert a sin-\ngle layer with (cid:96)\u221e error bound ((cid:96)\u221e LP)\n\nInput: Observation x \u2208 Rn, weight matrix W =\n[w1|w2|\u00b7\u00b7\u00b7|wn](cid:62), initial error bound guess \u0001 > 0,\nscaling factor \u03b1 > 1.\nrepeat\n\nj z \u2264 xj + \u03b4\n\nif xj > \u0001\nif xj \u2264 \u0001\n\nFind arg minz,\u03b4 \u03b4, s.t.\n\n\uf8f1\uf8f2\uf8f3 xj \u2212 \u03b4 \u2264 w(cid:62)\n\nj z \u2264 xj + \u03b4\nw(cid:62)\n\u03b4 \u2264 \u0001\n\n\u0001 \u2190 \u0001\u03b1\n\nuntil z infeasible\nOutput: z\n\nAlgorithm 2 Linear programming to invert a sin-\ngle layer with (cid:96)1 error bound ((cid:96)1 LP)\n\n(cid:80)\n\nInput: Observation x \u2208 Rn, weight matrix W =\n[w1|w2|\u00b7\u00b7\u00b7|wn](cid:62), initial error bound guess \u0001 > 0,\nscaling factor \u03b1 > 1.\nfor t = 1, 2,\u00b7\u00b7\u00b7 do\n\nz(t), e(t) \u2190 arg minz,e\n\n\uf8f1\uf8f2\uf8f3 xj \u2212 ej \u2264 w(cid:62)\n\nif xj > \u0001\nj z \u2264 xj + ej\nif xj \u2264 \u0001\nw(cid:62)\nej \u2265 0\n\u2200j \u2208 [n]\n\u0001 \u2190 \u0001\u03b1\nif (cid:107)\u03c6(z(t)) \u2212 x(cid:107)1 \u2265 (cid:107)\u03c6(z(t\u22121)) \u2212 x(cid:107)1 then\n\nj z \u2264 xj + ej\n\ni ei, s.t.\n\nreturn z(t\u22121)\n\nend if\nend for\n\n(cid:107)WI,:x(cid:107)1 \u2265 c1(cid:107)x(cid:107)1,\n\nDifferent from Algorithm 1, the deviating error allowed on each observation is no longer uniform and\nthe new algorithm is actually optimizing over the (cid:96)1 error. Similar to the error bound analysis with\n(cid:96)\u221e norm we are able to get a tight approximation guarantee under some mild assumption related to\nRestricted Isometry Property for (cid:96)1 norm:\nAssumption 3 (Submatrix Extends (cid:96)1 Norm). For a weight matrix W \u2208 Rn\u00d7k, there exists an\ninteger m > k and a constant c1, such that for any I \u2282 [n],|I| \u2265 m, WI,: satis\ufb01es\nwith high probability 1 \u2212 exp(\u2212\u2126(k)) for any x.\nThis assumption is a special case of the lower bound of the well-studied Restricted Isometry Property,\nfor (cid:96)1-norm and sparsity k, i.e., (k,\u221e)-RIP-1. Similar to the (cid:96)\u221e analysis, we are able to get recovery\nguarantees for generators with arbitary depth.\nTheorem 4. Let x = G(z\u2217) + e be a noisy observation produced by the generator G, a d-layer\nReLU network mapping from Rk \u2192 Rn. Let each weight matrix Wi \u2208 Rni\u22121\u00d7ni satisfy Assumption\n3 with the integer mi > ni\u22121 and constant c1. Let the error e satisfy (cid:107)e(cid:107)1 \u2264 \u0001, and for each\nzi = \u03c6i(\u03c6i\u22121(\u00b7\u00b7\u00b7 \u03c6(z\u2217)\u00b7\u00b7\u00b7 )), at least mi coordinates are larger than 2d+1\u2212i\u0001\n. Then by recursively\ncd\u2212i\napplying Algorithm 2, it produces a z that satis\ufb01es (cid:107)z \u2212 z\u2217(cid:107)1 \u2264 2d\u0001\n\nwith high probability.\n\n(7)\n\n1\n\ncd\n1\n\nThere is a signi\ufb01cant volume of prior work on the RIP-1 condition. For instance, studies in [3] showed\nthat a (scaled) random sparse binary matrix with m = O(s log(k/s)/\u00012) rows is (s, 1 + \u0001)-RIP-1\nwith high probability. In our case s = k and \u0001 could be arbitrarily large, therefore again we only\nrequire the expansion factor to be constant. Similar results with different weight matrices are also\nshown in [19, 16, 1].\n4.3 Relaxation on the ReLU Con\ufb01guration Estimation\nOur previous methods critically depend on the correct estimation of the ReLU con\ufb01gurations. In\nboth Algorithm 1 and 2, we require the ground truth of all intermediate layer outputs to have many\ncoordinates with large magnitude so that they can be distinguished from noise. An incorrect estimate\n\n5\n\n\fRandom Net\n\nMNIST Net\n\n(a) Uniform Noise\n\n(b) Gaussian Noise\n\n(c) Uniform Noise\n\n(d) Gaussian Noise\n\nFigure 1: Comparison of our proposed methods ((cid:96)\u221e LP and (cid:96)1 LP) versus gradient descent. On the horizontal\naxis we plot the relative noise level while on the vertical axis the relative recovery error. In experiments (a)(b)\nthe network is randomly generated and fully connected, with 20 input neurons, 100 hidden neurons and 500\noutput neurons. This corresponds to an expansion factor of 5. Each dot represents a recovery experiment (we\nhave 200 for each noise level). Each line connects the median of the 200 runs for each noise level. As can be\nseen, our algorithm (Blue and Orange) has very similar performance to gradient descent, except at low noise\nlevels where it is slightly more robust.\nIn experiments (c)(d) the network is generative model for the MNIST dataset. In this case, gradient descent fails\nto \ufb01nd global minimum in almost all the cases.\n\nfrom an \"off\" con\ufb01guration to an \"on\" condition will possibly cause primal infeasibility when solving\nthe LP. Increasing \u0001 ameliorates this problem but also increases the recovery error.\nWith this intuition, a natural workaround is to perform some relaxation to tolerate incorrectly estimated\nsigns of the observations.\n\nmax{0, xi}w(cid:62)\n\ni z, s.t, w(cid:62)\n\ni z \u2264 xi + \u0001.\n\n(8)\n\n(cid:88)\n\ni\n\nmax\n\nz\n\nHere the ReLU con\ufb01guration is no longer explicitly re\ufb02ected in the constraints. Instead, we only\ninclude the upper bound for each inner product w(cid:62)\ni z, which is always valid whether the ReLU is on\ni z \u2265 xi \u2212 \u0001 is now relaxed and hidden in\nor off. The previous requirement for the lower bound w(cid:62)\nthe objective part. When the value of xi is relatively large, the solver will produce a larger value of\nw(cid:62)\ni z to achieve optimality. Since this value is also upper bounded by xi + \u0001, the optimal solution\nwould be approaching to xi if possible. On the other hand, when the value of xi is close to 0, the\nobjective dependence on w(cid:62)\nMeanwhile, in the realizable case when \u2203z\u2217 such that ReLU(W z\u2217) = x, and \u0001 = 0, it is easy to\nshow that the solution set for (8) is exactly the preimage of ReLU(W z). This also trivially holds for\nAlgorithm 1 and 2.\n\ni z is almost negligible.\n\n5 Experiments\nIn this section, we describe our experimental setup and report the performance comparisons of our\nalgorithms with the gradient descent method [15, 12]4. We conduct simulations in various aspects\nwith Gaussian random weights, and a simple GAN architecture with MNIST dataset to show that our\napproach can work in practice for the denoising problem. We refer to our Algorithm 1 as (cid:96)\u221e LP and\nAlgorithm 2 as (cid:96)1 LP. We focus in the main text the experiments with these two proposals and also\ninclude some more empirical \ufb01ndings with the relaxed version described in (8) in the Appendix.\n5.1 Synthetic Data\nWe validate our algorithms on synthetic data at various noise levels and verify Theorem 3 and 4\nnumerically. For our methods, we choose the scaling factor \u03b1 = 1.2. With gradient descent, we use\nlearning rate of 1 and up to 1,000 iterations or until the gradient norm is no more than 10\u22129.\nModel architecture: The architecture we choose in the simulation aligns with our theoretical\n\ufb01ndings. We choose a two layer network with constant expansion factor 5: latent dimension k = 20,\nhidden neurons of size 100 and observation dimension n = 500. The entries in the weight matrix are\nindependently drawn from N (0, 1/ni).\n\n4The code to reproduce our results could be found here: https://github.com/cecilialeiqi/\n\nInvertGAN_LP.\n\n6\n\n\f(a) ReLU\n\n(b) LeakyReLU\n\nFigure 2: Comparison of our method and gradient descent on the empirical success rate of recovery (200 runs\non random networks) versus the number of input neurons k for the noiseless problem. The architecture chosen\nhere is a 2 layer fully connected ReLU network, with 250 hidden nodes, and 600 output neurons. Left \ufb01gure is\nwith ReLU activation and right one is with LeakyReLU. Our algorithms are signi\ufb01cantly outpeforming gradient\ndescent for higher latent dimensions k.\n\nObservation\n\nGround Truth\n\nOurs ((cid:96)\u221e LP)\n\nGradient Descent [15]\n\n0\n\n3\n\n7\n\n8\n\n9\n\n0\n\n3\n\n7\n\n8\n\n9\n\nFigure 3: Recovery comparison using our algorithm (cid:96)\u221e LP versus GD for an MNIST generative model. Notice\nthat (cid:96)\u221e LP produces reconstructions that are clearly closer to the ground truth.\n\nNoise generation: We use two kinds of random distribution to generate the noise, i.e., uniform\ndistribution U (\u2212a, a) and Gaussian random noise N (0, a), in favor of the (cid:96)\u221e and (cid:96)1 error bound\nanalysis respectively. We choose a \u2208 {10\u2212i|i = 1, 2,\u00b7\u00b7\u00b7 6} for both noise types.\nRecovery with Various Observation Noise: In Figure 1(a)(b) we plot the relative recovery error\n(cid:107)z \u2212 z\u2217(cid:107)2/(cid:107)z\u2217(cid:107)2 at different noise levels.\nIt supports our theoretical \ufb01ndings that with other\nparameters \ufb01xed, the recovery error grows almost linearly to the observation noise. Meanwhile, we\nobserve in both cases, our methods perform similarly to gradient descent on average, while gradient\ndescent is less robust and produces more outlier points. As expected, our (cid:96)\u221e LP performs slightly\nbetter than gradient descent when the input error is uniformly bounded; see Figure 1(a). However,\nwith a large variance in the observation error, as seen in Figure 1(b), (cid:96)\u221e LP is not as robust as (cid:96)1 LP\nor gradient descent.\nAdditional experiments can be found in the Appendix including the performance of the LP relaxation\nthat mimics (cid:96)1 LP but is more ef\ufb01cient and robust.\nRecovery with Various Input Neurons: According to the theoretical result, one advantage of our\nproposals is the much smaller expansion requirement than gradient descent [12] (constant vs log k\nfactors). Therefore we conduct the experiments to verify this point. We follow the exact setting as\n[15]; we \ufb01x the hidden layer and output sizes as 250 and 600 and vary the input size k to measure the\nempirical success rate of recovery in\ufb02uenced by the input size.\nIn Figure 2 we report the empirical success rate of recovery for our proposals and gradient descent.\nWith exact setting as in [15], a run is considered successful when (cid:107)z\u2217 \u2212 z(cid:107)2/(cid:107)z\u2217(cid:107)2 \u2264 10\u22123. We\nobserve that when input width k is small, both gradient descent and our methods grant 100% success\nrate. However, as the input neurons grows, gradient descent drops to complete failure when k \u226560,\nwhile our algorithms continue to present 100% success rate until k = 109. The performance of\ngradient descent is slightly worse than reported in [15] since they have conducted 150 number of\nmeasurements for each run while we only considered the measurement matrix as identity matrix.\n5.2 Experiments on Generative Model for MNIST Dataset\nTo verify the practical contribution of our model, we conduct experiments on a real generative network\nwith the MNIST dataset. We set a simple fully-connected architecture with latent dimension k = 20,\n\n7\n\n\fObservation\n\nGround Truth\n\nOurs ((cid:96)\u221e LP)\n\nGradient Descent [15]\n\n7\n\n1\n\n5\n\n6\n\n9\n\n7\n\n1\n\n5\n\n6\n\n9\n\nFigure 4: Recovery comparison with non-identity sensing matrix using our algorithm (cid:96)\u221e LP versus GD, for an\nMNIST generative model. The black region denotes unobserved pixels. Our algorithm always \ufb01nds reasonable\nresults while GD sometimes gets stuck at local minimum (See cases with number 1 and 5).\n\nhidden neurons of size n1 = 60 and output size n = 784. The network has a single channel. We train\nthe network using the original Generative Adversarial Network [8]. We set n1 to be small since the\noutput usually only has around 70 to 100 non-zero pixels.\nSimilar to the simulation part, we compared our methods with gradient descent [12, 15]. Under\nthis setting, we choose the learning rate to be 10\u22123 and number of iterations up to 10,000 (or until\ngradient norm is below 10\u22129).\nWe \ufb01rst randomly select some empirical examples to visually show performance comparison in\nFigure 3. In these examples, observations are perturbed with some Gaussian random noise with\nvariance 0.3 and we use (cid:96)\u221e LP as our algorithm to invert the network. From the \ufb01gures, we see that\nour method can almost perfectly denoise and reconstruct the input image, while gradient descent\nimpairs the completeness of the original images to some extent.\nWe also compare the distribution of relative recovery error with respect to different input noise levels,\nas ploted in Figure 1(c)(d). From the \ufb01gures, we observe that for this real network, our proposals still\nsuccessfully recover the ground truth with good accuracy most of the time, while gradient descent\nusually gets stuck in local minimum. This explains why it produces defective image reconstructions\nas shown in 3.\nFinally, we presented some sensing results when we mask part of the observations using PGD with\nour inverting procedure. As shown in Figure 4, our algorithm always show reliable recovery while\ngradient descent sometimes fails to output reasonable result. More experiments are presented in the\nAppendix.\n\n6 Conclusion and Future Work\nWe introduced a novel algorithm to invert a generative model through linear programming, one layer\nat a time, given (noisy) observations of its output. We prove that for expansive and random Gaussian\nnetworks, we can exactly recover the true latent code in the noiseless setting. For noisy observations\nwe also establish provable performance bounds. Our work is different from the closely related [15]\nsince we require less expansion, we bound for (cid:96)1 and (cid:96)\u221e norm (as opposed to (cid:96)2), and we also\nonly focus on inversion, i.e., without a forward operator. Our method can be used as a projection\nstep to solve general linear inverse problems with projected gradient descent [24]. Empirically we\ndemonstrate good performance, sometimes outperforming gradient descent when the latent vectors\nare high dimensional.\nOne message we want to convey in the paper is that it is always easier to invert to the intermediate\nlayer than directly to the input layer. As an extreme case, we invert one layer at a time, assuming\nthat each inversion is uniquely determined. To the best of our knowledge, all existing theoretical\nguarantees for inversion of deep generative models require expansion at each layer; however, models\nlike DCGAN[21] are expansive at all layers except the output layer. In future work, we will blend\nour algorithms with gradient descent and propose more practical inversion algorithms.\nAcknowledgements. This research has been supported by NSF Grants 1618689, IIS-1546452, CCF-\n1564000, DMS 1723052, CCF 1763702, AF 1901292 and research gifts by Google, Western Digital\nand NVIDIA.\n\n8\n\n\fReferences\n[1] Zeyuan Allen-Zhu, Rati Gelashvili, and Ilya Razenshteyn. Restricted isometry property for\n\ngeneral p-norms. IEEE Transactions on Information Theory, 62(10):5839\u20135854, 2016.\n\n[2] Benjamin Aubin, Bruno Loureiro, Antoine Maillard, Florent Krzakala, and Lenka Zdeborov\u00e1.\n\nThe spiked matrix model with generative priors. arXiv preprint arXiv:1905.12385, 2019.\n\n[3] Radu Berinde, Anna C Gilbert, Piotr Indyk, Howard Karloff, and Martin J Strauss. Combining\ngeometry and combinatorics: A uni\ufb01ed approach to sparse signal recovery. In Communication,\nControl, and Computing, 2008 46th Annual Allerton Conference on, pages 798\u2013805. IEEE,\n2008.\n\n[4] Ashish Bora, Ajil Jalal, Eric Price, and Alexandros G Dimakis. Compressed sensing using\n\ngenerative models. arXiv preprint arXiv:1703.03208, 2017.\n\n[5] Manik Dhar, Aditya Grover, and Stefano Ermon. Modeling sparse deviations for compressed\n\nsensing using generative models. arXiv preprint arXiv:1807.01442, 2018.\n\n[6] Alyson K Fletcher and Sundeep Rangan. Inference in deep networks in high dimensions. arXiv\n\npreprint arXiv:1706.06549, 2017.\n\n[7] Gene H Golub and Charles F Van Loan. Matrix computations, volume 3. JHU Press, 2012.\n\n[8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural\ninformation processing systems, pages 2672\u20132680, 2014.\n\n[9] Aditya Grover and Stefano Ermon. Uncertainty autoencoders: Learning compressed representa-\n\ntions via variational information maximization. arXiv preprint arXiv:1812.10539, 2018.\n\n[10] Sidharth Gupta, Konik Kothari, Maarten V de Hoop, and Ivan Dokmani\u00b4c. Deep mesh projectors\n\nfor inverse problems. arXiv preprint arXiv:1805.11718, 2018.\n\n[11] Paul Hand, Oscar Leong, and Vlad Voroninski. Phase retrieval under a generative prior. In\n\nAdvances in Neural Information Processing Systems, pages 9154\u20139164, 2018.\n\n[12] Paul Hand and Vladislav Voroninski. Global guarantees for enforcing deep generative priors by\n\nempirical risk. arXiv preprint arXiv:1705.07576, 2017.\n\n[13] Reinhard Heckel and Paul Hand. Deep decoder: Concise image representations from untrained\n\nnon-convolutional networks. arXiv preprint arXiv:1810.03982, 2018.\n\n[14] Reinhard Heckel, Wen Huang, Paul Hand, and Vladislav Voroninski. Deep denoising: Rate-\noptimal recovery of structured signals with a deep prior. arXiv preprint arXiv:1805.08855,\n2018.\n\n[15] Wen Huang, Paul Hand, Reinhard Heckel, and Vladislav Voroninski. A provably con-\nvergent scheme for compressive sensing under random generative priors. arXiv preprint\narXiv:1812.04176, 2018.\n\n[16] Piotr Indyk and Ilya Razenshteyn. On model-based rip-1 matrices. In International Colloquium\n\non Automata, Languages, and Programming, pages 564\u2013575. Springer, 2013.\n\n[17] Morteza Mardani, Qingyun Sun, Shreyas Vasawanala, Vardan Papyan, Hatef Monajemi, John\nPauly, and David Donoho. Neural proximal gradient descent for compressive imaging. arXiv\npreprint arXiv:1806.03963, 2018.\n\n[18] Dustin G Mixon and Soledad Villar. Sunlayer: Stable denoising with generative networks.\n\narXiv preprint arXiv:1803.09319, 2018.\n\n[19] Mergen Nachin. Lower bounds on the column sparsity of sparse recovery matrices. UAP: MIT\n\nUndergraduate Thesis, 2010.\n\n[20] Parthe Pandit, Mojtaba Sahraee, Sundeep Rangan, and Alyson K Fletcher. Asymptotics of map\n\ninference in deep networks. arXiv preprint arXiv:1903.01293, 2019.\n\n9\n\n\f[21] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with\ndeep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.\n\n[22] Yaniv Romano, Michael Elad, and Peyman Milanfar. The little engine that could: Regularization\n\nby denoising (red). SIAM Journal on Imaging Sciences, 10(4):1804\u20131844, 2017.\n\n[23] Mark Rudelson and Roman Vershynin. Smallest singular value of a random rectangular matrix.\nCommunications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute\nof Mathematical Sciences 62.12, pages 1707\u20131739, 2009.\n\n[24] Viraj Shah and Chinmay Hegde. Solving linear inverse problems using gan priors: An algorithm\n\nwith provable guarantees. arXiv preprint arXiv:1802.08406, 2018.\n\n[25] Subarna Tripathi, Zachary C Lipton, and Truong Q Nguyen. Correction by projection: Denoising\n\nimages with generative adversarial networks. arXiv preprint arXiv:1803.04477, 2018.\n\n10\n\n\f", "award": [], "sourceid": 7778, "authors": [{"given_name": "Qi", "family_name": "Lei", "institution": "University of Texas at Austin"}, {"given_name": "Ajil", "family_name": "Jalal", "institution": "University of Texas at Austin"}, {"given_name": "Inderjit", "family_name": "Dhillon", "institution": "UT Austin & Amazon"}, {"given_name": "Alexandros", "family_name": "Dimakis", "institution": "University of Texas, Austin"}]}