{"title": "Invertibility of Convolutional Generative Networks from Partial Measurements", "book": "Advances in Neural Information Processing Systems", "page_first": 9628, "page_last": 9637, "abstract": "In this work, we present new theoretical results on convolutional generative neural networks, in particular their invertibility (i.e., the recovery of input latent code given the network output). The study of network inversion problem is motivated by image inpainting and the mode collapse problem in training GAN. Network inversion is highly non-convex, and thus is typically computationally intractable and without optimality guarantees. However, we rigorously prove that, under some mild technical assumptions, the input of a two-layer convolutional generative network can be deduced from the network output efficiently using simple gradient descent. This new theoretical finding implies that the mapping from the low- dimensional latent space to the high-dimensional image space is bijective (i.e., one-to-one). In addition, the same conclusion holds even when the network output is only partially observed (i.e., with missing pixels). Our theorems hold for 2-layer convolutional generative network with ReLU as the activation function, but we demonstrate empirically that the same conclusion extends to multi-layer networks and networks with other activation functions, including the leaky ReLU, sigmoid and tanh.", "full_text": "Invertibility of Convolutional Generative\nNetworks from Partial Measurements\n\nFangchang Ma*\n\nMIT\n\nfcma@mit.edu\n\nUlas Ayaz\u02da\n\nMIT\n\nuayaz@mit.edu\nuayaz@lyft.com\n\nSertac Karaman\n\nMIT\n\nsertac@mit.edu\n\nAbstract\n\nThe problem of inverting generative neural networks (i.e., to recover the input latent\ncode given partial network output), motivated by image inpainting, has recently\nbeen studied by a prior work that focused on fully-connected networks. In this\nwork, we present new theoretical results on convolutional networks, which are more\nwidely used in practice. The network inversion problem is highly non-convex, and\nhence is typically computationally intractable and without optimality guarantees.\nHowever, we rigorously prove that, for a 2-layer convolutional generative network\nwith ReLU and Gaussian-distributed random weights, the input latent code can be\ndeduced from the network output ef\ufb01ciently using simple gradient descent. This\nnew theoretical \ufb01nding implies that the mapping from the low-dimensional latent\nspace to the high-dimensional image space is one-to-one, under our assumptions.\nIn addition, the same conclusion holds even when the network output is only\npartially observed (i.e., with missing pixels). We further demonstrate, empirically,\nthat the same conclusion extends to networks with multiple layers, other activation\nfunctions (leaky ReLU, sigmoid and tanh), and weights trained on real datasets.\n\n1\n\nIntroduction\n\nIn recent years, generative models have made signi\ufb01cant progress in learning representations for\ncomplex and multi-modal data distributions, such as those of natural images [10, 18]. However,\ndespite the empirical success, there has been relatively little theoretical understanding into the\nmapping itself from the input latent space to the high-dimensional space. In this work, we address the\nfollowing question: given a convolutional generative network2, is it possible to \u201cdecode\u201d an output\nimage and recover the corresponding input latent code? In other words, we are interested in the\ninvertibility of convolutional generative models.\nThe impact of the network inversion problem is two-fold. Firstly, the inversion itself can be applied\nin image in-painting [21, 17], image reconstruction from sparse measurements [14, 13], and image\nmanipulation [22] (e.g., vector arithmetic of face images [12]). Secondly, the study of network\ninversion provides insight into the mapping from the low-dimensional latent space to the high-\ndimensional image space (e.g., is the mapping one-to-one or many-to-one?). A deeper understanding\nof the mapping can potentially help solve the well known mode collapse3 problem [20] during the\ntraining in the generative adversarial network (GAN) [7, 16].\n\n\u02daBoth authors contributed equally to this work. Ulas Ayaz is presently af\ufb01liated with Lyft, Inc.\n2Deep generative models typically use transposed convolution (a.k.a. \u201cdeconvolution\u201d). With a slight abuse\n\nof notation we refer to transposed convolutional generative models as convolutional models.\n\n3Mode collapse refers to the problem that the Generator characterizes only a few images to fool the\ndiscriminator in GAN. In other words, multiple latent codes are mapped to the same output in the image space.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Recovery of the input latent code z from under-sampled measurements y \u201c AGpzq where\nA is a sub-sampling matrix and G is an expanding generative neural network. We prove that z can be\nrecovered with guarantees using simple gradient-descent methods under mild technical assumptions.\n\nThe challenge of the inversion of a deep neural network lies in the fact that the inversion problem is\nhighly non-convex, and thus is typically computationally intractable and without optimality guaran-\ntees. However, in this work, we show that network inversion can be solved ef\ufb01ciently and optimally,\ndespite being highly non-convex. Speci\ufb01cally, we prove that with simple \ufb01rst-order algorithms like\nstochastic gradient descent, we can recover the latent code with guarantees. The sample code is\navailable at https://github.com/fangchangma/invert-generative-networks.\n\n1.1 Related Work\n\nThe network inversion problem has attracted some attention very recently. For instance, Bora et al. [2]\nempirically \ufb01nd that minimizing the non-convex Problem (3), which is de\ufb01ned formally in Section 2,\nusing standard gradient-based optimizer yields good reconstruction results from small number of\nGaussian random measurements. They also provide guarantees on the global minimum of a generative\nnetwork with certain structure. However, their work does not analyze how to \ufb01nd the global minimum.\nHand and Voroninski [8] further establish that a fully connected generative network with weights\nfollowing Gaussian distribution can be inverted given only compressive linear observations of its last\nlayer. In particular, they show that under mild technical conditions Problem (3) has a favorable global\ngeometry, in the sense that there are no stationary points outside of neighborhoods around the desired\nsolution and its negative multiple with high probability. However, most practical generative networks\nare deconvolutional rather than fully connected, due to memory and speed constraints. Besides, their\nresults are proved for Gaussian random measurements, which are rarely encountered in practical\napplications. In this work, we build on top of [8] and extend their results to 2-layer deconvolutional\nneural networks, as well as uniform random sub-sampling.\nWe also note the work [6], which studies a 1-layer network with a special activation function\n(Concatenated ReLU, which is essentially linear) and a strong assumption on the latent code (k-\nsparsity). In comparison, our results are much stronger than [6]. Speci\ufb01cally, our results are for\n2-layer networks (with empirical evidences for deeper networks), and they apply to the most common\nReLU activation function. Our result also makes no assumption regarding the sparsity of latent codes.\nAnother line of research, which focuses on gradient-based algorithms, analyzes the behavior of\n(stochastic) gradient descent for Gaussian-distributed input. Soltanolkotabi [19] showed that projected\ngradient descent is able to \ufb01nd the true weight vector for 1-layer, 1-neuron model. More recently,\nDu et al. [5] improved this result for a simple convolutional neural network with two unknown\nlayers. Their assumptions on random input and their problem of weight learning are different than\nthe problem we study in this paper.\nOur problem is also connected to compressive sensing [4, 3] which exploits the sparsity of natural\nsignals to design acquisition schemes where the number of measurements scales linearly with the\nsparsity level. The signal is typically assumed to be sparse in a given dictionary, and the objective\nfunction is convex. In comparison, our work does not assume sparsity, and we provide a direct\nanalysis of gradient descents for the highly non-convex problem.\n\n2\n\n\f1.2 Contribution\n\nThe contribution of this work is three-fold:\n\n\u2022 We prove that a convolutional generative neural network is invertible, with high probability,\nunder the following assumptions: (1) the network consists of two layers of transposed con-\nvolutions followed by ReLU activation functions; (2) the network is (suf\ufb01ciently) expansive;\n(3) the \ufb01lter weights follow a Gaussian distribution. When these conditions are satis\ufb01ed,\nthe input latent code can be recovered from partial output of a generative neural network by\nminimizing a L2 empirical loss function using gradient descent.\nsubset of pixels is observed. This is essentially the image inpainting problem.\n\n\u2022 We prove that the same inversion can be achieved with high probability, even when only a\n\u2022 We validate our theoretical results using both random weights and weights trained on real\ndata. We further demonstrate empirically that that our theoretical results generalize to (1)\nmultiple-layer networks; (2) networks with other nonlinear activation functions, including\nLeaky ReLU, Sigmoid and Tanh.\n\nTwo key ideas of our proof include (a) the concentration bounds of convolutional weight matrices\ncombined with ReLU operation, and (b) the angle distortion between two arbitrary input vectors\nunder the transposed convolution and ReLU. In general, our proof follows a similar basic structure\nto [8], where the authors show the invertibility of fully connected networks with Gaussian weights.\nHowever, in fully connected networks, the weight matrix of each layer is a dense Gaussian matrix. In\ncontrast, in convolutional networks the weight matrices are highly sparse with block structure due to\nstriding \ufb01lters, as in Figure 2(a). Therefore, [8]\u2019s proof does not apply to convolutional networks, and\nthe extension of concentration bounds for our case is not trivial.\nTo address such problem, we propose a new permutation technique which shuf\ufb02es the rows and\ncolumns of weight matrices to obtain a block matrix, as depicted in Figure 2(b). With permutation,\neach block is now a dense Gaussian matrix, where we can apply existing matrix concentration results.\nHowever, the permutation operation is quite arbitrary, depending on the structure of the convolutional\nnetwork. This requires some careful handling, since the second step (b) requires the control of angles.\nIn addition, Hand and Voroninski [8] assume a Gaussian sub-sampling matrix at the output of the\nnetwork, rather than partial sub-sampling (sub-matrix of identity matrix) that we study in this problem.\nWe observe that sub-sampling operation can be swapped with the last ReLU in the network, since\nboth are entrywise operations. We handle the sub-sampling by making the last layer more expansive,\nand prove that it is the same with no downsampling from a theoretical standpoint.\n\n2 Problem Statement\nIn this section, we introduce the notation and de\ufb01ne the network inversion problem. Let z\u02db P Rn0\ndenote the latent code of interest, Gp\u00a8q : Rn0 \u00d1 Rnd pn0 ! ndq be a d-layer generative network\nthat maps from the latent space to the image space. Then the ground truth output image x\u02db P Rnd is\nproduced by\n\n(1)\nIn this paper we consider Gp\u00a8q to be a deep neural network4. In particular we assume Gp\u00a8q to be a\ntwo-layer transposed convolutional network, modeled by\n\nx\u02db \u201c Gpz\u02dbq,\n\nGpzq \u201c \u03c3pW2\u03c3pW1zqq\n\n(2)\nwhere \u03c3pzq \u201c maxpz, 0q denotes the recti\ufb01ed linear unit (ReLU) that applies entrywise. W1 P\nRn1\u02c6n0 and W2 P Rn2\u02c6n1, are the weight matrices of the convolutional neural network in the \ufb01rst\nand second layers, respectively. Note that since Gp\u00a8q is a convolutional network, W1 and W2 are\nhighly sparse with a particular block structure, as illustrated in Figure 2(a).\nLet us make the inversion problem a bit more general by assuming that we only have partial\nobservations of the output image pixels. Speci\ufb01cally, let A P Rm\u02c6n2 be a sub-sampling matrix\n4Note that this network inversion problem happens at the inference stage, and thus is independent of the\n\ntraining process.\n\n3\n\n\f(a subset of the rows of an identity matrix), and then the observed pixels are y\u02db \u201c Ax\u02db P Rm.\nConsequently, the inversion problem given partial measurements can be described as follows:\n\nz\u02db P Rn0, W1 P Rn1\u02c6n0, W2 P Rn2\u02c6n1 , A P Rm\u02c6n2\n\nLet:\nGiven: A, W1, W2 and observations y\u02db \u201c AGpz\u02dbq\nFind: z\u02db and x\u02db \u201c Gpz\u02dbq\n\nSince x\u02db is determined completely by the latent representation z\u02db, we only need to \ufb01nd z\u02db. We\npropose to solve the following optimization problem for an estimate \u02c6z:\n\n\u02c6z \u201c arg min\n\nz\n\nJpzq, where Jpzq \u201c 1\n2\n\n}y\u02db \u00b4 AGpzq}2\n\n(3)\n\nThis minimization problem is highly non-convex because of G. Therefore, in general a gradient\ndescent approach is not guaranteed to \ufb01nd the global minimum z\u02db, where Jpz\u02dbq \u201c 0.\n\n2.1 Notation and Assumptions\n\n(a)\n\n(b)\n\nFigure 2: Illustration of a single transposed convolution operation. fi,j stands for ith \ufb01lter kernel\nfor the jth input channel. z and x denote the input and output signals, respectively.\n(a) The\nstandard transposed convolution represented as linear multiplication. (b) With proper row and column\npermutations, the permuted weight matrix has a repeating block structure.\n\nWe vectorize the input signal to 1D signal. The feature at the ith layer consists of Ci channels, each\nof size Di. Therefore, ni \u201c Ci \u00a8 Di. At any convolutional layer, let fi,j denotes the kernel \ufb01lter\n(each of size (cid:96)) for the ith input channel and the jth output channel. For simplicity, we assume the\nstride to be equal to the kernel size l. All \ufb01lters can be concatenated to form a large block matrix Wi.\nFor instance, an example of such block matrix W1 for the \ufb01rst layer is shown in Figure 2(a). Under\nour assumptions, the input and output sizes at each deconvolution operation can be associated as\nDi`1 \u201c Di(cid:96).\nLet DvJpxq be one-sided directional derivative of the objective function Jp\u00a8q along the direction\nv, i.e., DvJpxq \u201c limt\u00d10` Jpx`tvq\u00b4Jpxq\n. Let Bpx, rq be the Euclidean ball of radius r centered at\nx. We omit some universal constants in the inequalities and use \u00c1\u0001 (if the constant depends on a\nvariable \u0001) instead.\n\nt\n\n3 Main Results\n\nIn this section, we present our main theoretical results regarding the invertibility of a 2-layer convolu-\ntional generative network with ReLUs. Our \ufb01rst main theoretical contribution is as follows: although\nthe problem in (3) is non-convex, under appropriate conditions there is a strict descent direction\neverywhere, except in the neighborhood of z\u02db and that of a negative multiple of z\u02db.\n\n4\n\nf1,1z1z2x1x2=\u00b7f1,1f1,1f1,2f1,2f1,2f2,2f2,2f2,2f2,1f2,1f2,1D1D0lD1z\u2032x\u2032=\u00b7C1D1lf1,1f1,1f1,1f1,2f1,2f1,2f2,1f2,1f2,1f2,2f2,2f2,2C0C1l\fTheorem 1 (Invertibity of convolutional generative networks). Fix \u0001 \u0105 0. Let W1 P RC0D0\u02c6C1D1\nand W2 P RC1D1\u02c6C2D2 be deconvolutional weight matrices with \ufb01lters in R(cid:96) with i.i.d. entries from\nNp0, 1{Ci(cid:96)q for layers i \u201c 1, 2 respectively. Let the sampling matrix A \u201c I be an identity matrix\n(meaning there\u2019s no sub-sampling). If C1(cid:96) \u00c1\u0001 C0 log C0 and C2(cid:96) \u00c1\u0001 C1 log C1 then with probability\nat least 1 \u00b4 \u03bapD1C1 e\u00b4\u03b3C0 ` D2C2 e\u00b4\u03b3C1q we have the following. For all nonzero z and z\u02db, there\nexists vz,z\u02db P Rn0 such that\nDvz,z\u02db Jpzq \u0103 0,\nDzJp0q \u0103 0,\n\n@z R Bpz\u02db, \u0001}z\u02db}2q Y Bp\u00b4\u03c1z\u02db, \u0001}z\u02db}2q Y t0u\n@z \u2030 0,\n\n(4)\n(5)\n\nwhere \u03c1 is a positive constant. Both \u03b3 \u0105 0 and \u03ba \u0105 0 depend only on \u0001.\nTheorem 1 establishes under some conditions that the landscape of the cost function is not adversarial.\nDespite the heavily loaded notation, Theorem 1 simply requires that the weight matrices with Gaussian\n\ufb01lters should be suf\ufb01ciently expansive (i.e., output dimension of each layer should increase by at least\na logarithmic factor). Theorem 1 does not provide information regarding the neighborhood centered\nat \u00b4\u03c1x\u02db, which implies the possible existence of a local minimum or a saddle point. However,\nempirically we did not observe convergence to a point other than the ground truth. In other words,\ngradient descent seems to always \ufb01nd the global minimum, see Figure 4.\nOne assumption we make is the size of stride s being same as the \ufb01lter size (cid:96). Although theoretically\nconvenient, this assumption is not common in the practical choices of transposed convolutional\nnetworks. We believe a further analysis can remove this assumption, which we also leave as a future\nwork. In practice different activation functions other than ReLU can be used as well, such as sigmoid\nfunction, Tanh and Leaky ReLU. It is also an interesting venue of research to see whether a similar\nanalysis can be done with those activations. In particular, for Leaky ReLU we brie\ufb02y explain how\nthe proof would divert from ours in Section Sup.2. We include landscapes of the cost function when\ndifferent activations are used in Figure 4.\nGaussian weight assumption might seem unrealistic at \ufb01rst. However, there is some research [1]\nindicating that weights of some trained networks follow a normal distribution. We also make a similar\nobservation on the networks we trained, see Section 4. We also note that Theorem 1 does not require\nindependence of network weights across layers.\n\nProof Outline: Due to space limitations, the complete proof of Theorem 1 is given in the supple-\nmentary material [15]. Here we give a brief outline of the proof and highlight the main steps. The\ntheorem is proven by showing two main conditions on the weight matrices.\nThe \ufb01rst condition is on the spatial arrangement of the network weights within each layer.\nLemma Sup.2 [15] provides a concentration bound on the distribution of the effective weight matrices\n(after merging the ReLUs into the matrices). It shows that the set of neuron weights within each\nlayer are distributed approximately like Gaussian. A key idea for the proving Lemma Sup.2 is our\nnew permutation technique. Speci\ufb01cally, we rearrange both rows and columns of the sparse weight\nmatrices, as in Figure 2(a), into a block diagonal matrix, as in Figure 2(b). Each block in the permuted\nmatrix is the same Gaussian matrix with independent entries. The permutation into block matrices\nhelps turns each block in Figure 2(b) into a dense Gaussian matrix, and therefore makes it possible to\nutilize existing concentration bounds on Gaussian matrices.\nThe second condition is on the approximate angle contraction property of an effective weight matrix\nWi (after merging the ReLUs into the matrices). Lemma Sup.4 [15] shows that the angle between two\narbitrary input vectors x and y does not vanish under a transposed convolution layer and the ReLU.\nThe permutation poses a signi\ufb01cant challenge on the proof of Lemma Sup.4, since permutation of the\ninput vectors distorts the angles. The dif\ufb01culty is handled carefully in the proof of Lemma Sup.4,\n\u02dd\nwhich deviates from the proof machinery in [8] and hence is a major technical contribution.\nCorollary 2 (One-to-one mapping). Under the assumptions of Theorem 1, the mapping Gp\u00a8q :\nRn0 \u00d1 Rn2 pn0 ! n2q is injective (i.e., one-to-one) with high probability.\nCorollary 2 is a direct implication of Theorem 1. Corollary 2 states that the mapping from the latent\ncode space to the high-dimensional image space is one-to-one with high probability, when the\nassumptions hold. This is interesting from a practical point of view, because mode collapse is a\nwell-known problem in training of GAN [20] and Corollary 2 provides a suf\ufb01cient condition to avoid\nmode collapses. It remains to be further explored how we can make use of this insight in practice.\n\n5\n\n\fConjecture 3. Under the assumptions of Theorem 1, let the network weights follow any zero-mean\n, @t \u0105 0 instead of Gaussian. Then with high\nsubgaussian distribution Pp|x| \u0105 tq \u010f ce\u00b4\u03b3t2\nprobability the same conclusion holds.\n\nA subgaussian distribution (a.k.a. light-tailed distribution) is one whose tail decays at least as fast as a\nGaussian distribution (i.e., exponential decay). This includes, for example, any bounded distribution\nand the exponential distribution. Empirically, we observe that Theorem 1 holds for a number of zero-\nmean subgaussian distributions, including uniform random weights and t`1,\u00b41u binary random\nweights.\nNow let us move on to the case where the subsampling matrix A is not an identity matrix. Instead,\nconsider a \ufb01xed sampling rate r P p0, 1s.\nTheorem 4 (Invertibility under partial measurements). Under the assumptions of Theorem 1, let\nA P Rm\u02c6C2D2 be an arbitrary subsampling matrix with m{pC2D2q \u011b r. Then with high probability\nthe same result as Theorem 1 hold.\n\nNote that the subsampling rate r appears in the dimension of the weight matrix of the second layer.\n\nProof. Since ReLU operation is pointwise, we have the identity\n\ny \u201c AGpzq \u201c A\u03c3pW2\u03c3pW1zqq \u201c \u03c3pAW2\u03c3pW1zqq.\n\nIt suf\ufb01ces to show that Theorem 1 still holds with AW2 as the last weight matrix. Note that AW2\nselects a row subset of the matrix W2 Figure 2(a). Consequently, after proper permutation, AW2 is\nagain a block diagonal matrix with each block being a Gaussian matrix with independent entries.\nOnly this time the blocks are not identical, but instead have different sizes. As a result, Theorem 1\nstill holds for AW2, since the proof of Theorem 1 does not require the identical blocks. However,\nthere are certain dimension constraints, which can be met by expanding the last layer with a factor of\nr, the sampling rate. This modi\ufb01cation is re\ufb02ected in the additional dimension assumption on the\nweight matrix W2.\n\nThe minimal sampling rate r is a constant that depends on both the network architecture (e.g., how\nexpansive the networks are) and the sampling matrix A. We made 2 empirical observations. Firstly,\nspatially disperse sampling patterns (e.g., uniform random samples) require a lower r, whilst more\naggressive sampling patterns (e.g., top half, left half, sampling around image boundaries) demand\nmore measurements for perfect recovery. Secondly, regardless of the sampling patterns A, the\nprobability of perfect recovery exhibits a phase transition phenomenon w.r.t. the sampling rate r.\nThis observation supports Theorem 4 (i.e., network is invertible given suf\ufb01cient measurements). A\nmore rigorous and mathematical characterization of r remains an open question.\n\n4 Experimental Validation\n\nIn this section, we verify the gaussian weight assumption of trained generative networks, our main\nresult Theorem 4 on simulated 2-layer networks, as well as the generalization of Theorem 4 to more\ncomplex multi-layer networks trained on real datasets.\n\n4.1 Gaussian Weight in Trained Networks\n\nWe extract the convolutional \ufb01lter weights, trained on real data to generate images in Figure 5, from\na 4-layer convolutional generative models. The histogram of the weights in each layer is depicted\nin Figure 3. It can be observed that the trained weights highly resembles a zero-mean gaussian\ndistribution. We also discover similar distributions of weights in other trained convolutional networks,\nsuch as ResNet [9]. Arora et al. [1] also report similar results.\n\n4.2 On 2-layer Networks with Random Weights\n\nAs a sanity check on Theorem 4, we construct a generative neural network with 2 transposed\nconvolution layers, each followed by a ReLU. The \ufb01rst layer has 16 channels and the second layer has\n1 single channel. Both layers have a kernel size of 5 and a stride of 3. In order to be able to visualize\n\n6\n\n\fFigure 3: Distribution of the kernel weights from every layer in a trained convolutional generative\nnetwork. The trained weights roughly follow a zero-mean gaussian distribution.\n\nthe cost function landscape, we set the input latent space to be 2-dimensional. The weights of the\ntransposed convolution kernels are drawn i.i.d. from a Gaussian distribution with zero mean and unit\nstandard deviation. Only 50% of the network output is observed. We compute the cost function Jpzq\nfor every input latent code z on a grid centered at the ground truth. The landscape of the cost function\nJpzq is depicted in Figure 4(a). Although Theorem 4 implies a possibility of a stationary point at the\nnegative multiple of the ground truth, experimentally we do not observe convergence to any point\nother than the global minimum.\nDespite the fact that Theorem 1 and Theorem 4 are proved only for the case of 2-layer network\nwith ReLU, the same conclusion empirically extends to networks with more layers and different\nkernel sizes and strides. In addition, the inversion of generative models generalizes to other standard\nactivation functions including Sigmoid, and Tanh. Speci\ufb01cally, Sigmoid and Tanh have quasi-convex\nlandscapes as shown in Figure 4(b) and (c), which are even more favorable than that of ReLU. Leaky\nReLU has the same landscape as a regular ReLU.\n\n(a) ReLU\n\n(b) Sigmoid\n\n(c) Tanh\n\n(d) mode collapse\n\nFigure 4: The landscape of the cost function Jpzq for deconvolutional networks with (a) ReLU, (b)\nSigmoid, and (c) Tanh as activation functions, respectively. There exists a unique global minimum.\nAs a counter example, we draw kernel weights uniformly randomly from r0, 1s (which violates the\nzero-mean Gaussian assumption). Consequently, there is a \ufb02at global minimum in the latent space,\nas shown in Figure 4(d). In this region, any two latent vectors are mapped to the exact same output,\nindicating that mode collapse indeed occurs.\n\n4.3 On Multi-layer Networks Trained with Real Data\n\nIn this section, we demonstrate empirically that our \ufb01nding holds for multi-layer networks trained on\nreal data. The \ufb01rst network is trained with GAN to generate handwritten digits, and the second for\ncelebrity faces. In both experiments, the correct latent codes can be recovered perfectly from partial\n(but suf\ufb01ciently many) observations.\n\nMNIST: For the \ufb01rst network on handwritten digit, we rescale the raw grayscale images from the\nMNIST dataset [11] to size of 32 \u02c6 32. We used the conditional deep convolutional generative adver-\nsarial networks (DCGAN) framework [16, 18] to train both a generative model and a discriminator.\nSpeci\ufb01cally, the generative network has 4 transposed convolutional layers. The \ufb01rst 3 transposed\nconvolutional layers are followed by a batch normalization and a Leaky ReLU. The last layer is\nfollowed by a Tanh. The discriminator has 4 convolutional layers, with the \ufb01rst 3 followed by batch\n\n7\n\n0.20.10.00.1weight value0.02.55.07.510.012.515.017.5Probability densitylayer 1alayer 1blayer 2layer 3layer 4-0.500.5-1.5-11-0.501.54020-400-20-20020-4040-50505-510\fnormalization and Leaky ReLU and the last one followed by a Sigmoid function. We use Adam with\nlearn rate 0.1 to optimize the latent code z . The optimization process usually converges within 500\niterations. The input noise to the generator is set to have a relatively small dimension 10 to ensure a\nsuf\ufb01ciently expanding network.\n\nFigure 5: We demonstrate recovery of latent codes on a generative network trained on the MNIST\ndataset. From top to bottom: ground truth output images; partial measurements with different\nsampling masks; reconstructed image using the recovered latent codes from partial measurements.\nThe recovery of latent codes in these examples is perfect, using simple gradient descent.\n\n5 different sampling matrices are showcased in Figure 5, including observing uniform random\nsamples, as well as the top half, bottom half, left half, and right half of the image space. In all cases,\nthe input latent codes are recovery perfectly. We feed the recovered latent code as input to the network\nto obtain the completed image, shown in the 3rd row.\n\nFigure 6: recovery of latent codes on a generative network trained on the CelebA dataset. From\ntop to bottom: ground truth output images; partial measurements with different sampling masks;\nreconstructed image using the recovered latent codes from partial measurements. The recovery of\nlatent codes in these examples is perfect, using simple gradient descent.\n\nCelebFaces: A similar study is conducted on a generative network trained on the CelebFaces [12]\ndataset. We rescale the raw grayscale images from the MNIST dataset [11] to size of 64 \u02c6 64. A\nsimilar network architecture to previous MNIST experiment is adopted, but both the generative model\nand the discriminator have 4 layers rather than 3. The images are showcased in Figure 6.\nNote that the probability of exact recovery increases with the number of measurements. The minimum\nnumber of measurements required for exact recovery, however, depends on the network architecture,\nthe weights, and the sampling spatial patterns. The mathematical characterization for minimal number\nof measurements remains a challenging open question.\n\n5 Conclusion\n\nIn this work we prove rigorously that a 2-layer ReLU convolutional generative neural network is\ninvertible, even when only partial output is observed. This result provides a suf\ufb01cient condition\nfor the generator network to be one-to-one, which avoids the mode collapse problem in training of\nGAN. We empirically demonstrate that the same conclusion holds even if the generative models\nhave other nonlinear activation functions (LeakyReLU, Sigmoid and Tanh) and multiple layers. The\nsame proof technique can be potentially generalized to multi-layer networks. Some interesting\nfuture research directions include rigorous proofs for leaky ReLUs and other activation functions,\nsubgaussian network weights, as well as inversion under noisy measurements.\n\n8\n\n\fReferences\n[1] Sanjeev Arora, Yingyu Liang, and Tengyu Ma. Why are deep nets reversible: A simple theory,\n\nwith implications for training. ICLR workshop.\n\n[2] Ashish Bora, Ajil Jalal, Eric Price, and Alexandros G Dimakis. Compressed sensing using\n\ngenerative models. arXiv preprint arXiv:1703.03208, 2017.\n\n[3] E. Cand\u00e8s, J. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate\n\nmeasurements. Comm. Pure Appl. Math., 59(8):1207\u20131223, 2006.\n\n[4] David L. Donoho. For most large underdetermined systems of linear equations the minimal\n\nl1-norm solution is also the sparsest solution. Comm. Pure Appl. Math, 59:797\u2013829, 2004.\n\n[5] Simon S. Du, Jason D. Lee, Yuandong Tian, Barnabas Poczos, and Aarti Singh. Gradient\ndescent learns one-hidden-layer cnn: Don\u2019t be afraid of spurious local minima. arXiv preprint\narXiv:1712.00779, 2017.\n\n[6] Anna C Gilbert, Yi Zhang, Kibok Lee, Yuting Zhang, and Honglak Lee. Towards understanding\n\nthe invertibility of convolutional neural networks. arXiv preprint arXiv:1705.08664, 2017.\n\n[7] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural\ninformation processing systems, pages 2672\u20132680, 2014.\n\n[8] Paul Hand and Vladislav Voroninski. Global guarantees for enforcing deep generative priors by\n\nempirical risk. arXiv preprint arXiv:1705.07576, 2017.\n\n[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\n[10] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint\n\narXiv:1312.6114, 2013.\n\n[11] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning\n\napplied to document recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[12] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the\n\nwild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.\n\n[13] Fangchang Ma and Sertac Karaman. Sparse-to-dense: Depth prediction from sparse depth\n\nsamples and a single image. arXiv preprint arXiv:1709.07492, 2017.\n\n[14] Fangchang Ma, Luca Carlone, Ulas Ayaz, and Sertac Karaman. Sparse depth sensing for\n\nresource-constrained robots. arXiv preprint arXiv:1703.01398, 2017.\n\n[15] Fangchang Ma, Ulas Ayaz, and Sertac Karaman. Supplementary materials - invertibility of\n\nconvolutional generative networks from partial measurements. NIPS, 2018.\n\n[16] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint\n\narXiv:1411.1784, 2014.\n\n[17] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context\nencoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition, pages 2536\u20132544, 2016.\n\n[18] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with\ndeep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.\n\n[19] Mahdi Soltanolkotabi. Learning relus via gradient descent.\n\nIn I. Guyon, U. V. Luxburg,\nS. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural\nInformation Processing Systems 30, pages 2004\u20132014. Curran Associates, Inc., 2017.\n\n9\n\n\f[20] Akash Srivastava, Lazar Valkoz, Chris Russell, Michael U Gutmann, and Charles Sutton.\nVeegan: Reducing mode collapse in gans using implicit variational learning. In Advances in\nNeural Information Processing Systems, pages 3310\u20133320, 2017.\n\n[21] Raymond A Yeh, Chen Chen, Teck Yian Lim, Alexander G Schwing, Mark Hasegawa-Johnson,\nand Minh N Do. Semantic image inpainting with deep generative models. In Proceedings of\nthe IEEE Conference on Computer Vision and Pattern Recognition, pages 5485\u20135493, 2017.\n\n[22] Jun-Yan Zhu, Philipp Kr\u00e4henb\u00fchl, Eli Shechtman, and Alexei A. Efros. Generative visual\nIn Proceedings of European Conference on\n\nmanipulation on the natural image manifold.\nComputer Vision (ECCV), 2016.\n\n10\n\n\f", "award": [], "sourceid": 5878, "authors": [{"given_name": "Fangchang", "family_name": "Ma", "institution": "MIT"}, {"given_name": "Ulas", "family_name": "Ayaz", "institution": "Massachusetts Institute of Technology / Lyft"}, {"given_name": "Sertac", "family_name": "Karaman", "institution": "MIT"}]}