{"title": "Size-Noise Tradeoffs in Generative Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 6489, "page_last": 6499, "abstract": "This paper investigates the ability of generative networks to convert their input noise distributions into other distributions. Firstly, we demonstrate a construction that allows ReLU networks to increase the dimensionality of their noise distribution by implementing a ``space-filling'' function based on iterated tent maps. We show this construction is optimal by analyzing the number of affine pieces in functions computed by multivariate ReLU networks. Secondly, we provide efficient ways (using polylog$(1/\\epsilon)$ nodes) for networks to pass between univariate uniform and normal distributions, using a Taylor series approximation and a binary search gadget for computing function inverses. Lastly, we indicate how high dimensional distributions can be efficiently transformed into low dimensional distributions.", "full_text": "Size-Noise Tradeoffs in Generative Networks\n\nBolton Bailey\nMatus Telgarsky\n{boltonb2,mjt}@illinois.edu\n\nUniversity of Illinois, Urbana-Champaign\n\nAbstract\n\nThis paper investigates the ability of generative networks to convert their input\nnoise distributions into other distributions. Firstly, we demonstrate a construction\nthat allows ReLU networks to increase the dimensionality of their noise distribution\nby implementing a \u201cspace-\ufb01lling\u201d function based on iterated tent maps. We show\nthis construction is optimal by analyzing the number of af\ufb01ne pieces in functions\ncomputed by multivariate ReLU networks. Secondly, we provide ef\ufb01cient ways\n(using polylog(1/\u0001) nodes) for networks to pass between univariate uniform and\nnormal distributions, using a Taylor series approximation and a binary search\ngadget for computing function inverses. Lastly, we indicate how high dimensional\ndistributions can be ef\ufb01ciently transformed into low dimensional distributions.\n\n1\n\nIntroduction\n\nThis paper focuses on the representational capabilities of generative networks. A generative network\nmodels a complex target distribution by taking samples from some ef\ufb01ciently-sampleable noise\ndistribution and mapping them to the target distribution using a neural network. What distributions\ncan a generative net approximate, and how well? Larger neural networks or networks with more noise\ngiven as input have greater power to model distributions, but it is unclear how the use of one resource\ncan make up for the lack of the other. We seek to describe the relationship between these resources.\nIn our analysis, we make a few assumptions on the structure of the network and the noise. We focus\non the two most standard choices for noise distributions: The normal distribution, and the uniform\ndistribution on the unit hypercube [Arjovsky et al., 2017]. Henceforth, we will use the term \u201cuniform\ndistribution\u201d to refer to the uniform distribution on unit hypercubes, unless otherwise speci\ufb01ed. We\nlook speci\ufb01cally at the case where the generative network is a fully-connected network with the ReLU\nactivation function (without weight sharing). The notion of approximation we use is the Wasserstein\ndistance, introduced for generative networks in Arjovsky et al. [2017], which is de\ufb01ned as follows:\nDe\ufb01nition 1. For two distributions \u00b5 and \u03bd on Rd, their Wasserstein distance is de\ufb01ned as\n\n(cid:90)\n\nW (\u00b5, \u03bd) := inf\n\n\u03c0\u2208\u03a0(\u00b5,\u03bd)\n\n|x \u2212 y|d\u03c0(x, y),\n\nwhere \u03a0(\u00b5, \u03bd) is the set of joint distributions having \u00b5 and \u03bd as marginals.\n\nOur results fall into three regimes, each covered in its own section:\n\nSection 2: The case where the input dimension is less than the output dimension.\n\nIn this regime, we prove tight upper and lower bounds for the task of approximating higher\ndimensional uniform distributions with lower dimensional distributions in terms of the\naverage width W and depth (number of layers) L of the network. The bounds are tight in\nthe sense that both give an accuracy of the form \u0001 = O(W )\u2212O(L) (keeping input and output\ndimensions \ufb01xed). Thus, this gives a good idea of the asymptotic behavior in this regime:\nError exponentially decays with the number of layers, and polynomially decays with the\nnumber of nodes in the network.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fSection 3: The case where the input and output dimensions are equal.\n\nIn this regime, we give constructions for networks that can translate between single dimen-\nsional uniform and normal distributions. These constructions incur \u0001 error in Wasserstein\ndistance using only polylog(1/\u0001) nodes.\n\nSection 4: The case where the input dimension is greater than the output dimension.\n\nIn this regime, we show that even with trivial networks, increased input dimension can\nsometimes improve accuracy.\n\nIn the course of proving the above results, we show several lemmas of independent interest.\n\nMultivariable af\ufb01ne complexity lemma.\n\nFor a function f : Rn0 \u2192 Rd computed by a neural network with N nodes and L layers\nand ReLU nonlinearities, the domain of f can be partitioned into O\nconvex\n(polyhedral) pieces such that f is af\ufb01ne on each piece. This is extends prior work, which\nconsidered networks with only univariate input [Telgarsky, 2016].\n\nn0L\n\n(cid:16) N\n\n(cid:17)n0L\n\nTaylor series approximation.\n\nUnivariate functions with quickly decaying Taylor series, such as exp, cos, and the CDF\nof the standard normal, can be approximated on domains of length M with networks of\nsize poly(M, ln(1/\u0001)). This idea was been explored before by Yarotsky [2017]; the key\ndifference between this work and the prior is that our results apply directly to arbitrary\ndomains.\n\nFunction inversion through binary search.\n\nThe inverses of increasing functions with large enough slope can be approximated ef\ufb01ciently,\nprovided that the functions themselves can be approximated ef\ufb01ciently. While this technique\ndoes not provide uniform bounds on the error, we show that it provides approximations that\nare good enough for generative networks to have low error.\n\nDetailed proofs of most theorems and lemmas can be found in the appendix.\n\n1.1 Related Work\n\nGenerative networks have become popular in the form of Generative Adversarial Networks (GANs),\nintroduced by Goodfellow et al. [2014]; see for instance [Creswell et al., 2018] for a survey of various\nGAN architectures. GANs are trained using a discriminator network, an auxiliary neural network\nwhich tries to prove the distance from the simulated distribution to the true data distribution is large.\nThe generator is trained by gradient descent to minimize the distance given by the adversary network.\nWasserstein GANs (or WGANs) are GANs which use an approximation of the Wasserstein distance\nas this notion of distance. The concept of Wasserstein distance comes out of the theory of optimal\ntransport, as discussed in Villani [2003], and its use as a performance metric is expounded in Arjovsky\net al. [2017]. WGANs have shown success in various generation tasks [Osokin et al., 2017, Donahue\net al., 2018, Chen and Tong, 2017]. While this paper uses the Wasserstein distance as a performance\nmetric, we are not concerned with the training process, only the representational capabilities of the\nnetworks.\nMany of the results in this paper build out of the results on the representational power of neural nets\nas function approximators. These results \ufb01rst focused upon approximating continuous functions\nwith a single hidden layer [Hornik et al., 1989, Cybenko, 1989], but recently branched out to deeper\nnetworks [Telgarsky, 2016, Eldan and Shamir, 2016, Yarotsky, 2017, Montufar et al., 2014]. A\nconcurrent work in this area is Zhang et al. [2018], which uses tropical geometry to analyze deep\nnetworks. This work produced a result on the number of af\ufb01ne pieces of deep networks [Zhang et al.,\n2018, Theorem 6.3], which matches our bound in Lemma 1. This bound was originally suggested\nin Montufar et al. [2014]. The present work relies upon some of these recent works (e.g., af\ufb01ne\npiece counting bounds, approximation via Taylor series), but develops nontrivial extensions (e.g.,\nmultivariate inputs and outputs with tight dimension dependence, less benign Taylor series).\nThe representational capabilities of generative networks have previously been studied by Lee et al.\n[2017]. That paper provides a result for the representation capabilities of deep neural networks\nin terms of \u201cBarron functions\u201d, \ufb01rst described in Barron [1993], which are functions with certain\nconstraints on their Fourier transform. Lee et al. [2017] showed that compositions of these Barron\n\n2\n\n\ffunctions could be approximated well by deep neural networks. Their main result with respect to the\nrepresentation of distributions was that the result of mapping a noise distribution through a Barron\nfunction composition could be approximated in Wasserstein distance by mapping the same noise\ndistribution through the neural network approximation to the Barron function composition. These\ntechniques do not readily permit the analysis of target distributions which are not images of the input\nspace under these Barron functions.\nThe Box-Muller transform [Box et al., 1958] is a computational method for simulating bivariate\nnormal distributions using uniform distributions on the unit (2-dimensional) square. The method is a\ngeneral algorithm, but it is possible to simulate the transform with specially constructed neural nets,\nto prove theorems similar to those in section 3. In fact this was our original approach; an overview of\nthe Box-Muller implementation can be found in section 3.\n\n1.2 Notation for Neural Networks\n\nWe de\ufb01ne a neural network with L layers and ni nodes in the ith layer as a composition of functions\nof the form\n\nAL \u25e6 \u03c3nL\u22121 \u25e6 AL\u22121 \u25e6 \u03c3nL\u22122 \u25e6 \u00b7\u00b7\u00b7 \u25e6 \u03c3n1 \u25e6 A1.\n\nthe map x (cid:55)\u2192 max{x, 0}. The total number of nodes N in a network is the sum(cid:80)L\n\nHere Ai : Rni\u22121 \u2192 Rni is an af\ufb01ne function. That is, Ai is the sum of a linear function and a\nconstant vector. \u03c3k : Rk \u2192 Rk is the k-component pointwise ReLU function, where the ReLU is\ni=0 ni. We will\nsometimes use n = n0 to refer to the input dimension and d = nL to refer to the output dimension.\nSince a neural network is a composition of piecewise af\ufb01ne functions, it is piecewise af\ufb01ne. The\nnumber of af\ufb01ne pieces of a function f will be denoted NA(f ) or just NA.\nWhen \u00b5 is a distribution, we will adopt the notation of Villani [2003] and use f #\u00b5 to denote the\npushforward of \u00b5 under f, i.e., the distribution obtained by applying f to samples from \u00b5. We will\nuse U (A) to denote the uniform distribution on a set A \u2282 Rn, and m(A) to denote the Lebesgue\nmeasure of that set. We will use N to denote a normal distribution, which will always be centered on\nthe origin and have unit covariance matrix.\n\n2\n\nIncreasing the Dimensionality of Noise\n\nHow easy is it to create a generator network that can output more dimensions of noise than it receives?\nIt is common in practice to use a far greater output dimension. Here, we give both upper and lower\nbounds showing that an increase in dimension can require a large, complicated network.\n\n2.1 Constructions for the Uniform Hypercube\n\nFor this section, we restrict ourselves to the case of input and output distributions which are uniform.\nTo motivate our techniques, we can simplify our problem even further: We could ask how one\nmight approximate a uniform distribution on the unit square using the uniform distribution on the\nunit interval. We see that we are limited by the fact that the range of the generator net is some\none-dimensional curve in R2, and so the distribution that the generator net produces will have to be\nsupported on this curve. We will want each point of the unit square to be close to some point on the\ncurve so that the mass of the square can be transported to the generated distribution. We are therefore\nled to consider some kind of (almost) space \ufb01lling curve. An excellent candidate is the graph of the\niterated tent map, shown in Figure 2.1. This function has been useful in the past [Montufar et al.,\n2014, Telgarsky, 2016] since it is highly nonlinear and it can be shown that neural networks must be\nlarge to approximate it well. We can create a construction for the univariate to multivariate network\nwhich uses tent maps of varying frequencies to \ufb01ll space.\nThe tentmap construction, which appears in [Montufar et al., 2014] and is given in full in the appendix,\nachieves the following error:\nTheorem 1. Let \u00b5 and \u03bd respectively denote uniform distributions on [0, 1] and [0, 1]d. Given any\nnumber of nodes N and number of layers L satisfying N > dL, we can construct a generative\n\n3\n\n\fFigure 1: Examples of paths that come near every point in the unit square and the unit cube.\n\nnetwork f : [0, 1] \u2192 [0, 1]d such that\n\nW (f #\u00b5, \u03bd) \u2264\n\n\u221a\n\nd\n\n(cid:22) N \u2212 dL\n\n(cid:23)\u2212(cid:98) L\n\nd\u22121(cid:99)\n\nL\n\n.\n\n(1)\n\nThus, as the size of the network grows, the base in Equation 1 grows proportionally to the average\nwidth of the network, and the exponent grows proportionally to the depth of the network (while being\ninversely proportional to the number of outputs). The N > dL requirement comes from using some\nnodes to carry values forward \u2014 If we were to allow connections between non-adjacent layers, this\nrequirement would go away and N would replace N \u2212 dL in the theorem statement.\nWe now consider the case where our input noise dimension is larger than 1. In this case, we see\nthat one possible construction involves dividing the output dimensions evenly amongst the input\ndimensions and then placing multiple copies of the above described construction in parallel. This\nproduces the following bound:\nTheorem 2. Let \u00b5 and \u03bd respectively denote uniform distributions on [0, 1]n and [0, 1]d. Given any\nnumber of nodes N and number of layers L satisfying N > dL, we can construct a generative\nnetwork f : [0, 1]n \u2192 [0, 1]d such that\n\nW (f #\u00b5, \u03bd) \u2264\n\n\u221a\n\nd\n\n(cid:22) N \u2212 dL\n\n(cid:36)\n\n(cid:23)\u2212\n\nnL\n\n(cid:37)\n\nL\n\n(cid:100) d\u2212n\n\nn\n\n(cid:101)\n\n= O\n\n(cid:18) N\n\nnL\n\n(cid:19)\u2212O( nL\n\nd )\n\n,\n\nwhere the big-O hides factors of d in the base, and constant factors in the exponent.\n\nNote that this generalizes Theorem 1. The proof can be found in the appendix. This bound is at\nits tightest when d is a multiple of n, in which case d\u2212n\nn is an integer, and the exponent matches\nexactly that in the lower bound determined later. The construction works more smoothly with this\neven divisibility because the output nodes can be evenly split among the inputs, and it is easier to\nparallelize the construction.\n\n2.2 Lower Bounds for the Uniform Box\n\nWe now provide matching lower bounds. For this, it suf\ufb01ces to count the af\ufb01ne pieces. Bounds on the\nnumber of af\ufb01ne pieces have been proved before, but only with univariate input [Telgarsky, 2016];\nhere we allow the network input to be multidimensional.\nLemma 1. Let f : Rn0 \u2192 Rd be a function computed by a neural network with at most N total\nnodes and L layers. Then the domain of f can be divided into NA convex pieces on which the\nfunction is af\ufb01ne, where\n\n(cid:19)n0L\n\nNA \u2264\n\ne\n\n+ e\n\n.\n\n(cid:18)\n\nN\nn0L\n\n4\n\n\fThis lemma has also been proven in concurrent work [Zhang et al., 2018] using the techniques of\ntropical geometry. Our proof works essentially by induction on the number of layers: We look at the\nset of possible activations of the ith layer, we see that it is a union of convex af\ufb01ne sets of dimension\nat most n0. The application of the ReLU maps each of these convex af\ufb01ne sets into O(ni)n0 convex\naf\ufb01ne sets, where ni is the number of nodes in the ith layer.\nThe one-dimensional tent map construction tells us that for a given number of nodes and number\nof layers, we can construct a function with a number of af\ufb01ne pieces bounded by the size of the\nnetwork. When constructing multidimensional input networks with a high number of af\ufb01ne pieces,\nwe can always parallelize several of these tentmaps to get a map with the product of the number of\npieces in the individual networks. What this lemma guarantees is that, up to a constant factor in\nthe number of nodes, this construction is optimal for producing as many af\ufb01ne pieces as possible.\nThis gives us con\ufb01dence that our parallelized tent map construction for low-dimensional uniform to\nhigh-dimensional uniform may be close to optimal.\nTo show that our construction is optimal, we need to show that it approximates the high-dimensional\nuniform distribution about as accurately as any piecewise af\ufb01ne function with the same number of\npieces NA. To do this, we will use the fact that the range of a piecewise af\ufb01ne function is a subset of\nthe union of ranges of its constituent af\ufb01ne functions. We then show that any distribution on a union\nlike this is necessarily distant from the target uniform distribution.\nTheorem 3. Let B be a bounded measurable subset of Rd of radius l, let f : Rn \u2192 Rd be piecewise\naf\ufb01ne with n < d, and let P be any distribution on Rn. The Wasserstein distance between f #P and\nthe uniform distribution UB on B has the following lower bound:\n\nW (UB, f #P ) \u2265 k\n\nwhere k depends on n and d.\n\nl\u2212n m(B)\nNA\n\n(cid:18)\n\n(cid:19) 1\n\nd\u2212n\n\n,\n\nNote that our technique can produce bounds not just for the unit cube on n-dimensions but for any\nuniform distribution on any bounded subset of Rn such as the sphere. When we combine this with\nour analysis of NA in Lemma 1, we get a lower bound result for a given number of nodes and layers.\nTheorem 4. Let \u00b5 and \u03bd respectively denote uniform distributions on [0, 1]n and [0, 1]d. Given any\nnumber of nodes N and number of layers L, for any generative network f : [0, 1]n \u2192 [0, 1]d, we\nhave\n\n(cid:18)\n\n(cid:19)\u2212 nL\n\nd\u2212n\n\n(cid:18) N\n\n(cid:19)\u2212 nL\n\nd\u2212n\n\n= O\n\nnL\n\n,\n\nW (f #\u00b5, \u03bd) \u2265 k\n\ne\n\nN\nnL\n\n+ e\n\nwhere the big-O hides factors of n and d in the base.\n\nProof. This follows from applying Theorem 3 taking f as a neural network with the af\ufb01ne piece\nbound from Lemma 1, and P as \u00b5, the uniform distribution on [0, 1]n.\n\n3 Transporting between Univariate Distributions\n\nThe two most common distributions used in generative networks in practice are the uniform distribu-\ntion, and the normal. How easily can one of these distributions can be used to approximate the other?\nWe will deal with the simplest case where our input and output distributions are one-dimensional. If\nwe can construct a neural net for this case, we can parallelize multiple copies of the net if we want to\nmove between normal and uniform distributions in higher dimensions.\n\n3.1 Approximation of a Uniform Distribution by a Normal\n\nPerhaps the simplest idea for approximating a uniform distribution with a generative network with\nnormal noise is to let the network approximate \u03a6, the cumulative distribution function of the normal.\nTo approximate \u03a6, we will approximate its Maclaurin series:\n\n\u03a6(z) =\n\n1\n2\n\n+\n\n1\u221a\n2\u03c0\n\n(\u22121)nz2n+1\nn!(2n + 1)2n .\n\n\u221e(cid:88)\n\nn=0\n\n5\n\n\fThis series has convergence properties which allow a network based on its truncation to work.\nYarotsky [2017, Proposition 3c] showed that f : (x, y) (cid:55)\u2192 xy over [\u2212M, M ]2 can be ef\ufb01ciently\napproximated by neural networks, in the sense that there is a network with O(ln(1/\u0001) + ln(M ))\nnodes and layers computing a function \u02c6f with |f \u2212 \u02c6f| \u2264 \u0001. Yarotsky [2017] uses this to show that\ncertain functions with small derivatives could be approximated well. We will show a similar result,\nwhich depends on the good behavior of the Taylor series of \u03a6.\nNaturally, if \u02dcf is a neural network approximating f and \u02dcg approximates g, then we can compose these\napproximations to get an approximation of the composition. In particular, if g has a Lipschitz constant,\nthen the composition approximation will depend on this Lipschitz constant and the accuracies of\nthe individual approximations. A \u201ccomposition lemma\u201d to this effect is included in the appendix as\nLemma 8. We will use this idea several times to construct a variety of function approximations.\nWe will now consider the method of approximating functions by approximating their Taylor series\nwith neural networks. To do this, we \ufb01rst demonstrate a network which takes a univariate input x in\n[\u2212M, M ] and returns the multivariate output (x0, x1, x2, . . . , xn).\nTheorem 5. The function f : x (cid:55)\u2192 (x0, . . . , xn) on [\u2212M, M ] can be computed uniformly to within\n\u0001 by a neural network of size poly(n, ln(M ), ln(1/\u0001)).\nThe proof relies on iteratively composing the multiplication function xi = xi\u22121 \u00b7 x using the\n\u201ccomposition lemma\u201d to get each of the xi.\nNow that we know the size required to approximate the powers of x, we may use this to approximate\nthe Maclaurin series of \u03a6.\nTheorem 6. The function \u03a6 can be approximated uniformly to within \u0001 by a network of size\npoly(ln(1/\u0001)).\n\nTo show this we apply Theorem 5 with a suitable choice of M and n and then use the monomial\napproximations to get a Taylor approximation of \u03a6. Knowing that we can approximate \u03a6 well, we\ncan give a precise bound on the Wasserstein distance of this construction.\nTheorem 7. We can construct a generative network with polylog(1/\u0001) nodes and univariate normal\nnoise that can output a distribution with Wasserstein distance \u0001 from uniform.\n\nProof. Using Theorem 6, let \u02dc\u03a6 : R \u2192 [0, 1] approximate \u03a6 uniformly to within \u0001\n2. Consider the\ncoupling \u03c0 between the output of this network and the true uniform distribution which consists of\npairs (\u03a6(z), \u02dc\u03a6(z)), where z is normally distributed:\n\n(cid:16)\n\nU ([0, 1]), \u02dc\u03a6#N(cid:17)\n\nW\n\n(cid:16)\n\n\u03a6#N , \u02dc\u03a6#N(cid:17) \u2264\n\n(cid:90)\n\n= W\n\n|\u03a6(z) \u2212 \u02dc\u03a6(z)| \u00b7\n\ne\u2212z2/2dz.\n\n1\u221a\n2\u03c0\n\nR\n\nBut since |\u03a6 \u2212 \u02dc\u03a6| is less than \u0001 everywhere, this integral is no more than \u0001, so we can indeed create a\ngenerative network of polylog(1/\u0001) nodes for this task.\n\n3.2 Approximation of a Normal Distribution by Uniform\n\nHaving shown that normal distributions can approximate uniform distributions with polylog(1/\u0001)\nnodes, let\u2019s see if the reverse is true. For this we\u2019ll need a few lemmas.\nFor analytic convenience, a few of our intermediate constructions will use networks with both ReLU\nactivations, as well as step functions H(x) = 1[x > 0]. Networks with these two allowed activations\nhave a convenient property which allows them to be used to study vanilla ReLU networks: If there is\na ReLU/Step network approximating a function f uniformly, then f can be uniformly approximated\nby a comparably-sized network on all but an arbitrarily small positive subset of its domain.\nLemma 2. Let \u00b5 be a measure, and A a measurable set with \u00b5(A) < \u221e. Suppose f : Rn \u2192 Rd can\nbe approximated uniformly to within \u0001 on A by a function \u02dcf computed by a ReLU/Step network with\nN nodes. Then for any \u03b6 > 0, there exists a ReLU network with O(N ) nodes which approximates f\nto within 2\u0001 on a set B where A \\ B has measure less than \u03b6.\n\n6\n\n\fProof. Note that while a ReLU neural network cannot implement the step function, it can implement\nthe following approximation:\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f30\n\nx/\u03b4\n1\n\ns\u03b4(x) =\n\nif x \u2264 0,\nif 0 \u2264 x \u2264 \u03b4,\nif x > \u03b4.\n\nIn fact, this approximation to the step function can be implemented with a 4-node ReLU network.\nIf we replace every step function activation node in our architecture with a copy of this four node\nnetwork, we get an architecture of size O(N ). With this architecture, we can compute each of a\nsequence (fn) of functions, where in fn, all step functions from our old network are replaced by\ns1/n. For any x in A, consider the minimum positive input to the step function which occurs in the\ncomputation graph. If \u03b4 = 1/n is less than this minimum, then fn(x) = f (x), so this sequence\nconverges pointwise to \u02dcf. Egorov\u2019s theorem [Kolmogorov and Fomin, 1975, pp. 290, Theorem\n12] now tells us that (fn) converges to \u02dcf uniformly on a set B that satis\ufb01es the \u00b5(A \\ B) < \u03b6\nrequirement. Thus, there is an fn that approximates \u02dcf to within \u0001 uniformly on this B, and fn\ntherefore approximates f uniformly on B to within 2\u0001.\n\nThis lemma has a useful application to generative networks: If we make \u03b6 suf\ufb01ciently small, the mass\nof the noise distribution on A \\ B is arbitrarily small. Therefore, we can make \u03b6 small enough that\nthe impact of the mistake region on the Wasserstein distance is negligible.\nWe now would like to approximate some function that maps the uniform distribution to the normal\nin this powerful format. Complementing the use of the normal CDF \u03a6 in the previous subsection,\nhere we will use its inverse \u03a6\u22121. Since we conveniently have already proved that \u03a6 is ef\ufb01ciently\napproximable, we would like a general lemma that allows us to invert this.\nLemma 3. Let f : [a, b] \u2192 [c, d] be a strictly increasing differentiable function with f(cid:48) greater than\na constant L everywhere, and let f be approximated to within \u0001 by a network of size N. Then (for\nany \u03b6 > 0), f\u22121 can be approximated to within (b \u2212 a)2\u2212t + \u0001L on (all but a measure \u03b6 subset of)\n[c, d] by a network of size O(tN ).\n\nThe proof of this lemma constructs a neural network that executes t iterations of a binary search on\n[a, b], using t copies of the approximation to f to decide which subinterval to narrow in on. Applying\nthis lemma to our approximation theorem for the normal CDF gives us an approximation of the\ninverse of the normal CDF.\nTheorem 8. For any \u03b6 > 0, the function \u03a6\u22121 can be approximated to within \u0001 by a network of size\npolylog(1/\u0001) on [\u03a6(\u2212 ln(1/\u0001)), \u03a6(ln(1/\u0001))] \\ A where A is of measure \u03b6.\n\nProof. By Theorem 6 we can get the normal CDF \u03a6 to within \u0001ln(1/\u0001)+1 with polylog(1/\u0001) nodes.\nUsing Lemma 3 with t = O(ln(1/\u0001)), if we choose a = \u2212 ln(1/\u0001), b = ln(1/\u0001) then the Lipschitz\nconstant of \u03a6\u22121 on this interval is\n\n\u03a6(cid:48)(ln(1/\u0001))\u22121 = O(eln(1/\u0001)2/2) = O(\u0001\u2212 ln(1/\u0001)),\n\nand so Lemma 3 gives a total error on the order of \u0001.\n\nWith this approximation, we can get a generative network approximation. Since the tails of the\nnormal distribution are small, we can ignore them by collapsing the mass of the tails into a bounded\ninterval. Then, by setting \u03b6 suf\ufb01ciently small that the Wasserstein distance contributed by the error\nregion is negligible, our approximation can be shown to be within \u0001 of the normal.\nAs a \ufb01nal lemma, we note the following observation\nProposition 1. For two distributions on R, their Wasserstein distance is equal to the L1 integral of\nthe difference of their CDFs.\n\nFor a proof, see [Villani, 2003, remark 2.19.ii]. For an intuition, note that moving a mass m from a\nto b on a one-dimensional distribution changes the CDF of the distribution on [a, b] by m.\nWith these in place, we use get a bound for the uniform to normal construction.\n\n7\n\n\fTheorem 9. A generative network with polylog(1/\u0001) nodes and univariate uniform noise can output\na distribution with Wasserstein distance \u0001 from a normal distribution.\n\nThe proof is an application of Theorem 8 and Proposition 1.\n\n3.2.1 The Box-Muller Transform\n\nWe\u2019ve established the bound we sought (approximation of a normal distribution via uniform), but in\nthis section we\u2019ll show that a curious classical construction also \ufb01ts the bill, albeit in two dimensions.\nThe Box-Muller transform [Box et al., 1958] comes from the observation that if X1 and X2 are two\nindependent uniform distributions on the unit interval, then if we de\ufb01ne\n\nZ2 :=(cid:112)\u22122 ln(X1) sin(2\u03c0X2),\n\nZ1 :=(cid:112)\u22122 ln(X1) cos(2\u03c0X2)\n\n10.\n\n\u03b6\n\n>\n\n0,\n\nand\n\n(2)\nthen Z1, Z2 are independent and normally distributed. Equation 2 comes from the interpretation of\n\n(cid:112)\u22122 ln(X1) and 2\u03c0X2 as r and \u03b8 in a polar-coordinate representation of the pair of normals. While\n(cid:112)\u22122 ln(X1) cos(2\u03c0X2),(cid:112)\u22122 ln(X1) sin(2\u03c0X2) can be approximated to within \u0001 by a net-\n\nthis method is not as powerful as the CDF approximation method, in that it requires two dimensions\nof uniform noise in order to work, it still suggests an idea for a similar theorem to Theorem 8.\n(cid:55)\u2192\nTheorem\nthe\nwork of size polylog(1/\u0001) on [0, 1]2 \\ A where A is of measure \u03b6.\nWe provide the following proof sketch:\n\nfunction\n\nFor\n\nany\n\nX1, X2\n\nsame reason that \u03a6 can: Their Taylor expansion coef\ufb01cients decay rapidly.\n\n\u2022 The cos and sin functions (and the exp function) can be ef\ufb01ciently computed for much the\n\u2022 The ln function can be approximated in [1/2, 3/2] using the Taylor series for ln(1 + x). For\ninputs outside this interval, we can repeatedly multiply double/halve the input until we reach\n[1/2, 3/2], use the approximation we have, then add in a constant depending on the number\nof times we doubled or halved.\n\u2022 The square root function, and in fact all functions of the form x (cid:55)\u2192 x\u03b1 for \u03b1 > 0, can be\napproximated using the approximations for exp and ln and the identity x\u03b1 = exp(\u03b1 ln(x)).\n\u2022 Putting these together, as well as the approximation for products from Yarotsky [2017], we\n\nget the result.\n\n4 From Many Dimensions to One\n\nThis section will complete the story by seeing what is gained in transporting many dimensions into\none.\nTo begin, let\u2019s \ufb01rst re\ufb02ect on the bounds we have. So far, we have shown upper bounds on neural\nnetwork sizes that are polylogarithmic in 1/\u0001. A careful analysis of the previous subsection shows\nthat the construction uses O(ln5(1/\u0001)) for normal to uniform and O(ln18(1/\u0001)) for the uniform to\nnormal. We would like to know how close to optimal these exponents are. The goal of this subsection\nis to quickly establish that the lower bound for this exponent is at least 1. To do this, we will make\nsome use of the af\ufb01ne piece analysis from Section 2.\nNote that piecewise af\ufb01ne functions acting on the uniform distribution have structure in their CDF,\nsince they are a mixture of distributions induced by each individual af\ufb01ne piece:\nProposition 2. For a piecewise af\ufb01ne function f : [0, 1] \u2192 R with NA pieces, the CDF of the a\ndistribution f #U ([0, 1]) is a piecewise af\ufb01ne function with at most NA + 2 pieces.\n\nSo if we can establish a bound on the accuracy with which a piecewise af\ufb01ne function can approximate\nthe normal CDF, we can use the univariate af\ufb01ne pieces lemma above to lower bound the accuracy of\nany uniform univariate noise approximation of the normal. A helpful bound is given in Safran and\nShamir [2016], from which we get:\nLemma 4. Let f be a univariate piecewise af\ufb01ne function with NA pieces. Then\n\n(cid:90) b\n\na\n\n|\u03a6(x) \u2212 f (x)|dx \u2265 K\nN 4\nA\n\n8\n\n\ffor some constant K.\n\nPutting this together with Proposition 2 and Lemma 1, we see:\nTheorem 11. A generative network taking uniform noise can approximate a normal with Wasserstein\naccuracy exponential in the number of nodes.\n\nOr in other words, approximation to accuracy \u0001 requires at least O(log(1/\u0001)) nodes.\nClearly, if we wish to approximate a low-dimensional uniform distribution with a higher-dimensional\none, all we need to do is ignore some of the inputs and spit the others back out unchewed. The same\ngoes for normal distributions. Is there any bene\ufb01t at all to additional dimensions on input noise when\nthe target distribution is a lower dimension?\nInterestingly, the answer is yes. Considering the case of approximating a univariate normal distribution\nwith a high dimensional distribution, we note that there is the simplistic approach which involves\nsumming the inputs and reasoning that the output is close to a normal distribution by the Berry-Esseen\ntheorem.\nTheorem 12. The distribution given by summing n uniform random variables on [0, 1] and normal-\nizing the result has a Wasserstein distance of O( 1\u221a\n\nn ) from the standard normal distribution.\n\nNote that the above approach does not use any nonlinearity at all. It simply takes advantage of the\nfact that projecting a hypercube onto a line results in an approximately normal distribution. This\ntheorem suggests another way of approaching Theorem 9: Use the results of section 2 to increase a\n1-dimensional uniform distribution to a d-dimensional uniform distribution, then apply Theorem 12\nas the \ufb01nal layer of that construction to get an approximately normal distribution. Unfortunately, this\n\u2248 \u0001, which means the size of\ntechnique does not prove the polylog(1/\u0001) size: it is necessary for 1\u221a\nthe network (indeed, even the size of the \ufb01nal layer of the network) is polynomial in 1/\u0001.\n\nd\n\n5 Conclusions and Future Work\n\nOne might ask with regards to Section 3 if there are more ef\ufb01cient constructions than the ones found\nin this section, since there is a gap between the upper and lower bounds. There are other approaches\nto the uniform to normal transformation, such as the Box-Muller method [Box et al., 1958] we discuss.\nFuture work could modify this or other methods to tighten the bounds found in this section.\nAn interesting open question is whether the results of Section 3 can be applied more generally to\nmultidimensional distributions. Suppose for example that we have a neural network that pushes a uni-\nvariate uniform distribution into a univariate normal distribution. We can take d copies of this network\nin parallel to get a network which takes d-dimensional uniform noise, and outputs d-dimensional\nnormal noise. Is a parallel construction of the form described here the most ef\ufb01cient way to create a\nnetwork that pushes forward a d-dimensional uniform distribution to a d-dimensional normal? For\nthat matter, if f : Rd \u2192 Rd is of the form of a univariate function evaluated componentwise on the\ninput, is the best neural network approximation for f of a given size a parallel construction?\nAnother future direction is: To what extent do training methods for generative networks relate to\nthese results? The results in this paper are only representational; they provide proof of what is\npossible with hand-chosen weights. One could experiment with training methods to see whether\nthey create the \u201cspace-\ufb01lling\u201d property that is necessary for optimal increase of noise dimension.\nAlternatively, one could experiment with real-world datasets to see if changing the noise distributions\nwhile simultaneously growing or shrinking the network leaves the accuracy of the method unchanged.\nWe ran some simple initial experiments measuring how well GANs of different architectures and\nnoise distributions learned MNIST generation, and we found them inconclusive; in particular, we\ncould not be certain if our empirical observations were a consequence purely of representation, or\nsome combination of representation and training.\n\nAcknowledgements\n\nThe authors are grateful for support from the NSF under grant IIS-1750051, and for a GPU grant\nfrom NVIDIA.\n\n9\n\n\fReferences\nMartin Anthony and Peter L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge\nUniversity Press, New York, NY, USA, 1st edition, 2009. ISBN 052111862X, 9780521118620.\n\nMartin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein generative adversarial networks.\nIn Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference\non Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 214\u2013\n223, International Convention Centre, Sydney, Australia, 06\u201311 Aug 2017. PMLR. URL http:\n//proceedings.mlr.press/v70/arjovsky17a.html.\n\nA. R. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE\nTransactions on Information Theory, 39(3):930\u2013945, May 1993. ISSN 0018-9448. doi: 10.1109/\n18.256500.\n\nGeorge EP Box, Mervin E Muller, et al. A note on the generation of random normal deviates. The\n\nannals of mathematical statistics, 29(2):610\u2013611, 1958.\n\nZhimin Chen and Yuguang Tong. Face super-resolution through wasserstein gans. arXiv preprint\n\narXiv:1705.02438, 2017.\n\nAntonia Creswell, Tom White, Vincent Dumoulin, Kai Arulkumaran, Biswa Sengupta, and Anil A\nBharath. Generative adversarial networks: An overview. IEEE Signal Processing Magazine, 35(1):\n53\u201365, 2018.\n\nGeorge Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control,\n\nSignals and Systems, 2(4):303\u2013314, 1989.\n\nChris Donahue, Julian McAuley, and Miller Puckette. Synthesizing audio with generative adversarial\n\nnetworks. arXiv preprint arXiv:1802.04208, 2018.\n\nRonen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. In COLT,\n\n2016.\n\nIan Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,\nAaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural informa-\ntion processing systems, pages 2672\u20132680, 2014.\n\nK. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approxi-\n\nmators. Neural Networks, 2(5):359\u2013366, july 1989.\n\nA.N. Kolmogorov and S.V. Fomin. Introductory Real Analysis. Dover Books on Mathematics. Dover\nPublications, 1975. ISBN 9780486612263. URL https://books.google.com/books?id=\nz8IaHgZ9PwQC.\n\nHolden Lee, Rong Ge, Tengyu Ma, Andrej Risteski, and Sanjeev Arora. On the ability of neural\nnets to express distributions. 65:1271\u20131296, 07\u201310 Jul 2017. URL http://proceedings.mlr.\npress/v65/lee17a.html.\n\nGuido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear\nregions of deep neural networks. In Advances in neural information processing systems, pages\n2924\u20132932, 2014.\n\nAnton Osokin, Anatole Chessel, Rafael E Carazo Salas, and Federico Vaggi. Gans for biological\nimage synthesis. In 2017 IEEE International Conference on Computer Vision (ICCV), pages\n2252\u20132261. IEEE, 2017.\n\nIosif Pinelis. On the nonuniform berry\u2013esseen bound. https://arxiv.org/pdf/1301.2828.pdf,\n\nMay 2013. (Accessed on 05/15/2018).\n\nItay Safran and Ohad Shamir. Depth separation in relu networks for approximating smooth non-linear\n\nfunctions. CoRR, abs/1610.09887, 2016. URL http://arxiv.org/abs/1610.09887.\n\n10\n\n\fMatus Telgarsky. bene\ufb01ts of depth in neural networks. In Vitaly Feldman, Alexander Rakhlin, and\nOhad Shamir, editors, 29th Annual Conference on Learning Theory, volume 49 of Proceedings of\nMachine Learning Research, pages 1517\u20131539, Columbia University, New York, New York, USA,\n23\u201326 Jun 2016. PMLR. URL http://proceedings.mlr.press/v49/telgarsky16.html.\n\nC. Villani. Topics in Optimal Transportation. Graduate studies in mathematics. American Mathe-\nmatical Society, 2003. ISBN 9780821833124. URL https://books.google.com/books?id=\nGqRXYFxe0l0C.\n\nDmitry Yarotsky. Error bounds for approximations with deep relu networks. Neural Networks, 94:\n\n103\u2013114, 2017.\n\nLiwen Zhang, Gregory Naitzat, and Lek-Heng Lim. Tropical geometry of deep neural networks.\n\narXiv preprint arXiv:1805.07091, 2018.\n\n11\n\n\f", "award": [], "sourceid": 3213, "authors": [{"given_name": "Bolton", "family_name": "Bailey", "institution": "University of Illinois Urbana-Champaign"}, {"given_name": "Matus", "family_name": "Telgarsky", "institution": "UIUC"}]}