{"title": "Nonparametric Density Estimation & Convergence Rates for GANs under Besov IPM Losses", "book": "Advances in Neural Information Processing Systems", "page_first": 9089, "page_last": 9100, "abstract": "We study the problem of estimating a nonparametric probability distribution under a family of losses called Besov IPMs. This family is quite large, including, for example, L^p distances, total variation distance, and generalizations of both Wasserstein (earthmover's) and Kolmogorov-Smirnov distances. For a wide variety of settings, we provide both lower and upper bounds, identifying precisely how the choice of loss function and assumptions on the data distribution interact to determine the mini-max optimal convergence rate. We also show that, in many cases, linear distribution estimates, such as the empirical distribution or kernel density estimator, cannot converge at the optimal rate. These bounds generalize, unify, or improve on several recent and classical results. Moreover, IPMs can be used to formalize a statistical model of generative adversarial networks (GANs). Thus, we show how our results imply bounds on the statistical error of a GAN, showing, for example, that, in many cases, GANs can strictly outperform the best linear estimator.", "full_text": "Nonparametric Density Estimation and\n\nConvergence of GANs under Besov IPM Losses\n\nAnanya Uppal\n\nDepartment of Mathematical Sciences\n\nCarnegie Mellon University\nauppal@andrew.cmu.edu\n\nShashank Singh\u2217 Barnab\u00e1s P\u00f3czos\n\nMachine Learning Department\nCarnegie Mellon University\n\n{sss1,bapoczos}@cs.cmu.edu\n\nAbstract\n\nWe study the problem of estimating a nonparametric probability density under a\nlarge family of losses called Besov IPMs, which include, for example, Lp distances,\ntotal variation distance, and generalizations of both Wasserstein and Kolmogorov-\nSmirnov distances. For a wide variety of settings, we provide both lower and upper\nbounds, identifying precisely how the choice of loss function and assumptions\non the data interact to determine the minimax optimal convergence rate. We also\nshow that linear distribution estimates, such as the empirical distribution or kernel\ndensity estimator, often fail to converge at the optimal rate. Our bounds generalize,\nunify, or improve several recent and classical results. Moreover, IPMs can be used\nto formalize a statistical model of generative adversarial networks (GANs). Thus,\nwe show how our results imply bounds on the statistical error of a GAN, showing,\nfor example, that GANs can strictly outperform the best linear estimator.\n\nIntroduction\n\n1\nThis paper studies the problem of estimating a nonparametric probability density, using an integral\nprobability metric as a loss. That is, given a sample space X \u2286 RD, suppose we observe n IID\nIID\u223c p from a probability density p over X that is unknown but assumed to lie in\nsamples X1, ..., Xn\n\na regularity class P. We seek an estimator(cid:98)p : X n \u2192 P of p, with the goal of minimizing a loss\n\ndF (p,(cid:98)p(X1, ..., Xn)) := sup\n\nf\u2208F\n\n(cid:12)(cid:12)(cid:12)(cid:12) E\n\nX\u223cp\n\n(cid:12)(cid:12)(cid:12)(cid:12) ,\n\n[f (X)] \u2212\n\nE\n\nX\u223c(cid:98)p(X1,...,Xn)\n\n[f (X)]\n\n(\u2217)\n\nwhere F, called the discriminator class, is some class of bounded, measurable functions on X .\nMetrics of the form (\u2217) are called integral probability metrics (IPMs), or F-IPMs2, and can capture a\nwide variety of metrics on probability distributions by choosing F appropriately [38]. This paper\nstudies the case where both F and P belong to the family of Besov spaces, a large family of\nnonparametric smoothness spaces that include, as examples, Lp, Lipschitz/H\u00f6lder, and Hilbert-\nSobolev spaces. The resulting IPMs include, as examples, Lp, total variation, Kolmogorov-Smirnov,\nand Wasserstein distances. We have two main motivations for studying this problem:\n\n1. This problem uni\ufb01es nonparametric density estimation with the central problem of empirical\n\nprocess theory, namely bounding quantities of the form dF (P,(cid:98)P ) when (cid:98)P is the empirical distribution\nP and \ufb01xes the estimator (cid:98)P = Pn, focusing on the discriminator class F, nonparametric density\nestimation typically \ufb01xes the loss to be an Lp distance, and seeks a good estimator (cid:98)P for a given\n\ni=1 \u03b4Xi of the data [42]. Whereas empirical process theory typically avoids restricting\n\n(cid:80)n\n\nPn = 1\nn\n\n\u2217Now at Google.\n2While the name IPM seems most widely used [38, 48, 6, 58], many other names have been used for these\n\nquantities, including adversarial loss [46, 12], MMD [16], and F-distance or neural net distance [5].\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fdistribution class P. In contrast, we study how constraints on F and P jointly determine convergence\n\nrates of a number of estimates (cid:98)P of P . In particular, since Besov spaces comprise perhaps the largest\n\ncommonly-studied family of nonparametric function spaces, this perspective allows us to unify,\ngeneralize, and extend several classical and recent results in distribution estimation (see Section 3).\n2. This problem is a theoretical framework for analyzing generative adversarial networks (GANs).\nSpeci\ufb01cally, given a GAN whose discriminator and generator networks encode functions in F and P,\nrespectively, recent work [31, 27, 28, 46] showed that a GAN can be seen as a distribution estimate3\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = argmin\n\nQ\u2208P\n\n(cid:16)\n\nQ,(cid:101)Pn\n\n(cid:17)\n\ndF\n\n,\n\n(1)\n\n(cid:98)P = argmin\n\nQ\u2208P\n\nsup\nf\u2208F\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E\n\nX\u223cQ\n\n[f (X)] \u2212 E\nX\u223c(cid:101)Pn\n\n[f (X)]\n\n1\nn\n\ni.e., an estimate which directly minimizes empirical IPM risk with respect to a (regularized) empirical\n\ndistribution (cid:101)Pn. While, in the original GAN model [20], (cid:101)Pn was the empirical distribution Pn =\n(cid:80)n\ndistribution, performance is improved by replacing Pn with a regularized version (cid:101)Pn, equivalent to\nthat, when (cid:101)Pn is a wavelet-thresholding estimate, a GAN based on suf\ufb01ciently large fully-connected\n\ni=1 \u03b4Xi of the data, Liang [27] showed that, under smoothness assumptions on the population\n\nthe instance noise trick that has become standard in GAN training [47, 34]. We show, in particular,\n\nneural networks with ReLU activations learns Besov probability distributions at the optimal rate.\n\n2 Set up and Notation\n< \u221e, and\nFor non-negative real sequences {an}n\u2208N, {bn}n\u2208N, an (cid:46) bn indicates lim supn\u2192\u221e an\nan (cid:16) bn indicates an (cid:46) bn (cid:46) an. For p \u2208 [1,\u221e], p(cid:48) := p\np\u22121 denotes the H\u00f6lder conjugate of p\n(with 1(cid:48) = \u221e, \u221e(cid:48) = 1). Lp(RD) (resp. lp) denotes the set of functions f (resp. sequences a) with\n\nbn\n\n< \u221e (resp. (cid:107)a(cid:107)lp :=(cid:0)(cid:80)\n\nn\u2208N |an|p(cid:1)1/p\n\n(cid:107)f(cid:107)p :=(cid:0)(cid:82) |f (x)|p dx(cid:1)1/p\n\n< \u221e).\n\nj=\u2212\u221e Vj = L2(RD).\n\n2.1 Multiresolution Approximation and Besov Spaces\nWe now provide some notation that is necessary to de\ufb01ne the family of Besov spaces studied in\nthis paper. Since the statements and formal justi\ufb01cations behind these de\ufb01nitions are a bit complex,\nsome technical details are relegated to the Appendix, and several well-known examples from the\nrich class of resulting spaces are given in Section 3. The diversity of Besov spaces arises from the\nfact that, unlike the H\u00f6lder or Sobolev spaces that they generalize, Besov spaces model functions\nsimultaneously across multiple spatial scales. In particular, they rely on the following notion:\nDe\ufb01nition 1. A multiresolution approximation (MRA) of L2(RD) is an increasing sequence {Vj}j\u2208Z\nof closed linear subspaces of L2(RD) with the following properties:\n\n1. (cid:84)\u221e\nj=\u2212\u221e Vj = {0}, and the closure of(cid:83)\u221e\n\n2. For f \u2208 L2(RD), k \u2208 ZD, j \u2208 Z, f (x) \u2208 V0 \u21d4 f (x \u2212 k) \u2208 V0 & f (x) \u2208 Vj \u21d4 f (2x) \u2208 Vj+1.\n3. For some \u201cfather wavelet\u201d \u03c6 \u2208 V0, {\u03c6(x\u2212k) : k \u2208 ZD} is an orthonormal basis of V0 \u2282 L2(RD).\nFor intuition, consider the best-known MRA of L2(R), namely the Haar wavelet basis. Let \u03c6(x) =\n1{[0,1)} be the Haar father wavelet, let V0 = Span{\u03c6(x \u2212 k) : k \u2208 Z} be the span of translations of \u03c6\nby an integer, and let Vj de\ufb01ned recursively for all j \u2208 Z by Vj = {f (2x) : f (x) \u2208 Vj\u22121} be the set\nof horizontal scalings of functions in Vj\u22121 by 1/2. Then, {Vj}j\u2208Z is an MRA of L2(R).\nThe importance of an MRA is that it generates an orthonormal basis of L2(RD), via the following:\nLemma 2 ([35], Section 3.9). Let {Vj}j\u2208Z be an MRA of L2(RD) with father wavelet \u03c6. Then, for\nE = {0, 1}D \\ (0, . . . , 0), there exist \u201cmother wavelets\u201d {\u03c8\u0001}\u0001\u2208E such that {2Dj/2\u03c8\u0001(2jx \u2212 k) :\n\u0001 \u2208 E, k \u2208 ZD} \u222a {2Dj/2\u03c6(2jx \u2212 k) : k \u2208 ZD} is an orthonormal basis of Vj \u2286 L2(RD).\nLet \u039bj = {2\u2212jk + 2\u2212j\u22121\u0001 : k \u2208 ZD, \u0001 \u2208 E} \u2286 RD. Then k, \u0001 are uniquely determined for any\nj\u2208Z \u039bj, we can let \u03c8\u03bb(x) = 2Dj/2\u03c8\u0001(2jx \u2212 k). Equipped with\nthe orthonormal basis {\u03c8\u03bb : \u03bb \u2208 \u039b} of L2(RD), we are almost ready to de\ufb01ne Besov spaces.\nFor technical reasons (see, e.g., [35, Section 3.9]), we need MRAs of smoother functions than Haar\nwavelets, which are called r-regular. Due to space constraints, r-regularity is de\ufb01ned precisely in\n\n\u03bb \u2208 \u039bj. Thus, for all \u03bb \u2208 \u039b :=(cid:83)\n\n3We assume a good optimization algorithm for computing (1), although this is also an active area of research.\n\n2\n\n\fAppendix A; we note here that standard r-regular MRAs exist, such as the Daubechies wavelet [10].\nWe assume for the rest of the paper that the wavelets de\ufb01ned above are supported on [\u2212A, A].\nDe\ufb01nition 3 (Besov Space). Let 0 \u2264 \u03c3 < r, and let p, q \u2208 [1,\u221e]. Given an r-regular MRA of\nL2(RD) with father and mother wavelets \u03c6, \u03c8 respectively, the Besov space B\u03c3\np,q(RD) is de\ufb01ned as\nthe set of functions f : RD \u2192 R such that, the wavelet coef\ufb01cients\n\n(cid:90)\n\n(cid:90)\n\nRD\n\n\u03b1k :=\n\nsatisfy\n\nf (x)\u03c6(x \u2212 k)dx for k \u2208 ZD\n(cid:107)f(cid:107)B\u03c3\n:= (cid:107){\u03b1k}k\u2208ZD(cid:107)lp +\nThe quantity (cid:107)f(cid:107)B\u03c3\nthe closed Besov ball B\u03c3\n(e.g., for rates of convergence), B\u03c3\n\np,q(L) = {f \u2208 B\u03c3\n\np,q\n\np,q : (cid:107)f(cid:107)B\u03c3\np,q denotes a ball B\u03c3\n\np,q\n\nRD\n\nand\n\n(cid:111)\n\n\u03b2\u03bb :=\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:110)\n2j(\u03c3+D(1/2\u22121/p))(cid:13)(cid:13){\u03b2\u03bb}\u03bb\u2208\u039bj\n\nf (x)\u03c8\u03bb(x)dx for \u03bb \u2208 \u039b,\n< \u221e\n\n(cid:13)(cid:13)lp\n\nj\u2208N\np,q(L) to denote\n\u2264 L}. When the constant L is unimportant\np,q(L) of \ufb01nite but arbitrary radius L.\n\n(cid:13)(cid:13)(cid:13)(cid:13)lq\n\np,q is called the Besov norm of f, and, for any L > 0, we write B\u03c3\n\n\u03c3d\n\n2.2 Formal Problem Statement\nHaving de\ufb01ned Besov spaces, we now formally state the statistical problem we study in this paper.\nIID\u223c p from an unknown probability\nFix an r-regular MRA. We observe n IID samples X1, ..., Xn\ndensity p lying in a Besov ball B\u03c3g\npg,qg (Lg) with \u03c3g < r. We want to estimate p, measuring error with\npd ,qd (Ld). Speci\ufb01cally, for general \u03c3d, \u03c3g, pd, pg, qd, qg, we seek to bound minimax risk\nan IPM dB\n\n(cid:17)\n:= inf(cid:98)p\npg,qg, where the in\ufb01mum is taken over all estimators(cid:98)p(X1, . . . , Xn).\nIn the rest of this paper, we suppress dependence of(cid:98)p(X1, ..., Xn) on X1, ..., Xn, writing simply(cid:98)p.\n\n(p,(cid:98)p(X1, . . . , Xn))\n\nof estimating densities in Fg = B\u03c3g\n\nsup\n\u03c3g\npg ,qg\n\nE\nX1:n\n\n, B\u03c3g\n\n(cid:16)\n\n\u03c3d\npd ,qd\n\nB\u03c3d\n\np\u2208B\n\n(cid:104)\n\n(cid:105)\n\n(2)\n\ndB\n\nM\n\npd,qd\n\npg,qg\n\n3 Related Work\nThe current paper uni\ufb01es, extends, or improves upon a number of recent and classical results in the\nnonparametric density estimation literature. Two areas of prior work are most relevant:\n\nNonparametric estimation over inhomogeneous smoothness spaces First is the classical study\nof estimation over inhomogeneous smoothness spaces under Lp losses. Nemirovski [40] \ufb01rst noticed\nthat, over classes of regression functions with inhomogeneous (i.e., spatially-varying) smoothness,\nmany widely-used regression estimators, called \u201clinear\u201d estimators (de\ufb01ned precisely in Section 4.2),\nare provably unable to converge at the minimax optimal rate, in L2 loss. Donoho et al. [13] identi\ufb01ed\npg,qg on R under Lp(cid:48)\na similar phenomenon for estimating probability densities in a Besov space B\u03c3g\nlosses with p(cid:48)\nd > pg, corresponding to the case \u03c3d = 0, D = 1 in our work. [13] also showed that\nthe wavelet-thresholding estimator we consider in Section 4.1 does converge at the minimax optimal\nrate. We generalize these phenomena to many new loss functions; in many cases, linear estimators\ncontinue to be sub-optimal, whereas the wavelet-thresholding estimator continues to be optimal. We\nalso show that sub-optimality of linear estimators is more pronounced in higher dimensions.\n\nd\n\nDistribution estimation under IPMs The second, more recent body of results [27, 46, 28] con-\ncerns nonparametric distribution estimation under IPM losses. Prior work focused on the case where\nF and P are both Sobolev ellipsoids, corresponding to the case pd = qd = pg = qg = 2 in our work.\nNotably, over these smaller spaces (of homogeneous smoothness), the linear estimators mentioned\nabove are minimax rate-optimal. Perhaps the most important \ufb01nding of these works is that the curse of\ndimensionality pervading classical nonparametric statistics is signi\ufb01cantly diminished under weaker\nloss functions than Lp losses (namely, many IPMs). For example, Singh et al. [46] showed that, when\n\u03c3d > D/2, one can estimate P at the parametric rate n\u22121/2 in the loss dB\n, without any regularity\nassumptions whatsoever on the probability distribution P . We generalize this to other losses dB\n.\n\u03c3d\npd ,qd\nThese papers were motivated in part by a desire to understand theoretical properties of GANs, and,\nin particular, Liang [27] and Singh et al. [46] helped establish (1) as a valid statistical model of\nGANs. In particular, we note that Singh et al. [46] showed that the implicit generative modeling\nproblem (\u201csampling\u201d) in terms of which GANs are usually framed, is equivalent, in terms of minimax\n\n\u03c3d\n2,2\n\n3\n\n\fconvergence rates, to nonparametric density estmation, justifying our focus on the latter problem in\nthis paper. We show, in Section 4.3, that, given a suf\ufb01ciently good optimization algorithm, GANs\nbased on appropriately constructed deep neural networks can learn Besov densities at the minimax\noptimal rate. In this context, our results are among the \ufb01rst to suggest theoretically that GANs can\noutperform classical density estimators (namely, linear estimators mentioned above).\nLiu et al. [31] provided general suf\ufb01cient conditions for weak consistency of GANs in a generalization\nof the model (1). Since many IPMs, such as Wasserstein distances, metrize weak convergence of\nprobability measures under mild additional assumptions Villani [52], this implies consistency under\nthese IPMs. However, Liu et al. [31] did not study rates of convergence.\nWe end this section with a brief survey of known results for estimating distributions under speci\ufb01c\nBesov IPM losses, noting that our results (Equations (3) and (4) below) generalize all these rates:\np(cid:48),p(cid:48), then, for distributions P, Q with densities p, q \u2208 Lp,\n1. Lp Distances: If Fd = Lp(cid:48)\ndFd (P, Q) = (cid:107)p\u2212q(cid:107)Lp. These are the most well-studied losses in nonparametric statistics, especially\nfor p \u2208 {1, 2,\u221e} [41, 53, 51]. [13] studied the minimax rate of convergence of density estimation\nover\nover Besov spaces under Lp losses, obtaining minimax rates n\n\n\u2212 \u03c3g +D(1\u22121/pg\u22121/pd)\n\n2\u03c3g +D(1\u22122/pg)\n\n2\u03c3g +D + n\n\n= B0\n\n\u2212 \u03c3g\n\n\u2212 \u03c3g\u2212D/pg +D/p(cid:48)\n2\u03c3g +D\u22122D/pg +2D/p(cid:48)\n\nd\n\n\u2212 \u03c3g\n\n2\u03c3g +D + n\n\nd when restricted to linear estimators.\n\n(p, q) \u2264 Wp(p, q) \u2264 m\u22121/p(cid:48)\n\ngeneral estimators, and n\n2. Wasserstein Distance: If Fd = C 1(1) (cid:16) B1\u221e,\u221e is the space of 1-Lipschitz functions, then dFd\nis the 1-Wasserstein or Earth mover\u2019s distance (via the Kantorovich dual formulation [23, 52]). A long\nline of work has established convergence rates of the empirical distribution to the true distribution\nin spaces as general as unbounded metric spaces [54, 25, 45]). In the Euclidean setting, this is\nwell understood [14, 2, 18], although, to the best of our knowledge, minimax lower bounds have\nbeen proven only recently [45]; this setting intersects with our work in the case \u03c3d = 1, \u03c3g = 0,\npd = \u221e, matching our minimax rate of n\u22121/D + n\u22121/2. More general p-Wasserstein distances Wp\n(p \u2265 1) cannot be expressed exactly as IPMs, but, our results complement recent results of Weed\nand Berthet [55], who showed that, for densities p and q that are bounded above and below (i.e.,\n0 < m \u2264 p, q \u2264 M < \u221e), the bounds M\u22121/p(cid:48)\ndB1\n(p, q)\np(cid:48),\u221e\n\u2212 1+\u03c3g\n2\u03c3g +D + n\u22121/2) up to polylogarithmic factors.\nhold; for such densities, our rates match theirs (n\nWeed and Berthet [55] showed that, without the lower-boundedness assumption (m > 0), minimax\nrates under Wp are strictly slower (by a polynomial factor in n).\nIn machine learning applications, Arora et al. [5] recently used this rate to argue that, for data from\na continuous distribution, Wasserstein GANs [4] cannot generalize at a rate faster than n\u22121/D (at\nleast without additional regularization, as we use in Theorem 9). A variant in which Fd \u2282 C 1 \u2229 L\u221e\nis both uniformly bounded and 1-Lipschitz gives rise to the Dudley metric [15], which has also\nbeen suggested for use in GANs [1]. Finally, we note that the more general distances induced by\nFd = B\u03c3d\u221e,\u221e have been useful for deriving central limit theorems [7, Section 4.8].\n3. Kolmogorov-Smirnov Distance: If Fd = BV (cid:16) B1\n1,\u00b7 is the set of functions of bounded variation,\nthen, in the 1-dimensional case, dFd is the well-known Kolmogorov-Smirnov metric [9], and so the\nfamous Dvoretzky\u2013Kiefer\u2013Wolfowitz inequality [33] gives a parametric convergence rate of n\u22121/2.\n4. Sobolev Distances: If Fd = W \u03c3d,2 = B\u03c3\n2,2 is a Hilbert-Sobolev space, for \u03c3 \u2208 R, then dFd =\n(cid:107) \u00b7 \u2212 \u00b7 (cid:107)W\u2212\u03c3d,2 is the corresponding negative Sobolev pseudometric [57]. Recent work [27, 46, 28]\n\u2212 \u03c3g +\u03c3d\n2\u03c3g +1 + n\u22121/2 when Fg = W \u03c3g,2 is also a Hilbert-Sobolev space.\nestablished a minimax rate of n\n\ndB1\n\np(cid:48) ,1\n\n4 Main Results\nThe three main technical contributions of this paper are as follows:\n\n1. We prove lower and upper bounds (Theorems 4 and 5, respectively) on minimax convergence\nrates of distribution estimation under IPM losses when the distribution class P = B\u03c3g\npg,qg and the\ndiscriminator class F = B\u03c3d\npd,qd are Besov spaces; these rates match up to polylogarithmic factors in\nthe sample size n. Our upper bounds use the wavelet-thresholding estimator proposed in Donoho et al.\n[13], which we show converges at the optimal rate for a much wider range of losses than previously\nknown. Speci\ufb01cally, if M (F,P) denotes minimax risk (2), we show that for p(cid:48)\nd \u2265 pg, \u03c3g \u2265 D/pg,\n\n4\n\n\f(cid:16)\n\nM\n\nB\u03c3d\n\npd,qd\n\n, B\u03c3g\n\npg,qg\n\n(cid:40)\n\nn\u22121/2, n\n\n\u2212 \u03c3g +\u03c3d\n\n2\u03c3g +D , n\n\n\u2212 \u03c3g +\u03c3d+D(1\u22121/pg\u22121/pd)\n\n2\u03c3g +D(1\u22122/pg)\n\n(cid:41)\n\n.\n\n(3)\n\n(cid:17) (cid:16) max\n(cid:16)\n\nB\u03c3d\n\npd,qd\n\n2. We show (Theorem 7) that, for p(cid:48)\ndistribution estimators, called \u201clinear estimators\u201d, can converge at a rate faster than\n\nd \u2265 pg and \u03c3g \u2265 D/pg, no estimator in a large class of\n\n(cid:17) (cid:38) n\n\n\u2212 \u03c3g +\u03c3d\u2212D/pg +D/p(cid:48)\n2\u03c3g +D(1\u22122/pg)+2D/p(cid:48)\n\nd\n\nd .\n\nMlin\n\n, B\u03c3g\n\npg,qg\n\n\u201cLinear estimators\u201d include the empirical distribution, kernel density estimates with uniform band-\nwidth, and the orthogonal series estimators recently used in Liang [27] and Singh et al. [46]). The\nlower bound (4) implies that, in many settings (discussed in Section 5), linear estimators converge at\nsub-optimal rates. This effect is especially pronounced when the data dimension D is large and the\ndistribution P has relatively sparse support (e.g., if P is supported near a low-dimensional manifold).\n3. We show that the minimax convergence rate can be achieved by a GAN with generator and\ndiscriminator networks of bounded size, after some regularization. As one of the \ufb01rst theoretical\nresults separating performance of GANs from that of classic nonparametric tools such as kernel\nmethods, this may help explain GANs\u2019 successes with high-dimensional data such as images.\n\n4.1 Minimax Rates over Besov Spaces\nWe now present our main lower and upper bounds for estimating densities that live in a Besov space\nunder a Besov IPM loss. Then, we have the following lower bound on the convergence rate:\nTheorem 4. (Lower Bound) Let r > \u03c3g \u2265 D/pg, then,\n\n(cid:16)\n\nM\n\nB\u03c3d\n\npd,qd\n\n, B\u03c3g\n\npg,qg\n\n(cid:17) (cid:38) max\n\n\uf8eb\uf8edn\n\n(cid:18) log n\n\n(cid:19) \u03c3g +\u03c3d+D\u2212D/pg\u2212D/pd\n\n2\u03c3g +D\u22122D/pg\n\n\uf8f6\uf8f8\n\n\u2212 \u03c3g +\u03c3d\n2\u03c3g +D ,\n\nn\n\n(4)\n\n(5)\n\nBefore giving a corresponding upper bound, we describe the estimator on which it depends.\nWavelet-Thresholding: Our upper bound uses the wavelet-thresholding estimator proposed by [13]:\n\n(cid:88)\n\nj0(cid:88)\n\n(cid:88)\n\n(cid:98)pn =\n\n(cid:101)\u03b2\u03bb\u03c8\u03bb.\n(cid:98)\u03b1k\u03c6k +\n(cid:98)pn estimates p via its truncated wavelet expansion, with (cid:98)\u03b1k = 1\n(cid:80)n\ni=1 \u03c8\u03bb(Xi), and (cid:101)\u03b2\u03bb = (cid:98)\u03b2\u03bb1{(cid:98)\u03b2\u03bb>\n\n(cid:98)\u03b2\u03bb\u03c8\u03bb +\n\n\u03bb\u2208\u039bj\n\n\u03bb\u2208\u039bj\n\nk\u2208Z\n\n\u221a\n\nj=j0\n\nj=0\n\nn\n\nj1(cid:88)\n\n(cid:88)\n\nj/n} are empirical estimates of respective coef\ufb01cient of\n1\nn\nthe wavelet expansion of p. As [13] \ufb01rst showed, attaining optimality over Besov spaces requires\ntruncating high-resolution terms (of order j \u2208 [j0, j1]) when their empirical estimates are too small;\nthis \u201cnonlinear\u201d part of the estimator distinguishes it from the \u201clinear\u201d estimators we study in the next\nsection. The hyperparameters j0 and j1 are set to j0 = 1\nTheorem 5. (Upper Bound) Let r > \u03c3g \u2265 D/pg and p(cid:48)\nonly on p(cid:48)\n\n2\u03c3g+D log2 n, j1 =\nd > pg. Then, for a constant C depending\n\n2\u03c3g+D\u22122D/pg\n\nlog2 n.\n\n,\n\n1\n\n(6)\n\n(cid:80)n\ni=1 \u03c6k(Xi), (cid:98)\u03b2\u03bb =\n\n(cid:18)\nd, \u03c3g, pg, qg, D, Lg, Ld and (cid:107)\u03c8\u0001(cid:107)p(cid:48)\n(cid:16)\n\n(cid:18)(cid:112)log n\n\n(cid:17) \u2264 C\n\n, B\u03c3g\n\nB\u03c3d\n\nd\n\npd,qd\n\npg,qg\n\nM\n\n\u2212 \u03c3g +\u03c3d\n\nn\n\n2\u03c3g +D + n\n\n\u2212 \u03c3g +\u03c3d\u2212D/pg +D/p(cid:48)\n2\u03c3g +D\u22122D/pg\n\nd\n\n+ n\u22121/2\n\n(7)\n\n(cid:19)\n\n(cid:19)\n\nWe will comment only brie\ufb02y on Theorems 4 and 5 here, leaving extended discussion for Section 5.\nFirst, note that the lower bound (5) and upper bound (7) are essentially tight; they differ only by a\npolylogarithmic factor in n. Second, both bounds contain two main terms of interest. The simpler\nterm, n\n2\u03c3g +D , matches the rate observed in the Sobolev case by Singh et al. [46]. The other term is\nunique to more general Besov spaces. Depending on the values of D, \u03c3d, \u03c3g, pd, and pg, one of these\ntwo terms dominates, leading to two main regimes of convergence rates, which we call the \u201cSparse\u201d\nregime and the \u201cDense\u201d regime. Section 5 discusses these and other interesting phenomena in detail.\n\n\u2212 \u03c3g +\u03c3d\n\n5\n\n\f4.2 Minimax Rates of Linear Estimators over Besov Spaces\nWe now show that, for many Besov densities and IPM losses, many widely-used nonparametric\ndensity estimators cannot converge at the optimal rate (5). These estimators are as follows:\n\nDe\ufb01nition 6 (Linear Estimator). Let (\u2126,F, P ) be a probability space. An estimate (cid:98)P of P is said to\n\nbe linear if there exist functions Ti(Xi,\u00b7) : F \u2192 R such that for all measurable A \u2208 F,\n\nTi(Xi, A).\n\n(8)\n\nn(cid:88)\n\ni=1\n\n(cid:98)P (A) =\n(cid:82)\n\n(cid:16)\n\n(cid:17)\n\n(cid:104)\n\n(cid:16)\n\n\u00b5p,(cid:98)P\n\n(cid:17)(cid:105) (cid:16) n\u2212 1\n\nClassic examples of linear estimators include the empirical distribution (Ti(Xi, A) = 1\nthe kernel density estimate (Ti(Xi, A) = 1\nn\nkernel K : X \u00d7 X \u2192 R) and the orthogonal series estimate (Ti(Xi, A) = 1\nsome cutoff J and orthonormal basis {gj}\u221e\nTheorem 7 (Minimax rate for Linear Estimators). Suppose r > \u03c3g \u2265 D/pg,\n\nn 1{Xi\u2208A},\nA K(Xi,\u00b7) for some bandwidth h > 0 and smoothing\nA gj for\n\nj=1 (e.g., Fourier, wavelet, or polynomial) of L2(\u2126)).\n\nj=1 gj(Xi)(cid:82)\n(cid:80)J\n\nn\n\nB\u03c3d\n\n, B\u03c3g\n\npd,qd\n\nMlin\nwhere the inf is over all linear estimates of p \u2208 Fg, and \u00b5p is the distribution with density p.\n\nsup\np\u2208Fg\n\n2 + n\n\nd + n\n\npg,qg\n\ndFd\n\nE\nX1:n\n\n:= inf(cid:98)Plin\n\n\u2212 \u03c3g +\u03c3d\u2212D/pg +D/p(cid:48)\n2\u03c3g +D\u22122D/pg +2D/p(cid:48)\n\nd\n\n\u2212 \u03c3g +\u03c3d\n\n2\u03c3g +D\n\n\u2212 \u03c3g +\u03c3d+D\u2212D/pg\u2212D/pd\n\n2\u03c3g +D\u22122D/pg\n\nOne can check that the above error decays no faster than n\n. Comparing with\nthe rate in Theorem 5, this implies that, in certain cases, convergence the rate for linear estimators\nis strictly slower than that for general estimators; i.e., linear estimators fail to achieve the minimax\noptimal rate over certain Besov space. We defer detailed discussion of this phenomenon to Section 5.\n4.3 Upper Bounds on a Generative Adversarial Network\nPioneered by Goodfellow et al. [20] as a mechanism for applying deep neural networks to the problem\nof unsupervised image generation, Generative adversarial networks (GANs) have since been widely\napplied not only to computer vision [59, 24], but also to such diverse problems and data as machine\ntranslation using natural language data [56], discovering drugs [22] and designing materials [44] using\nmolecular structure data, inferring expression levels using gene expression data [11], and sharing\npatient data under privacy constraints using electronic health records [8]. Besides the Jensen-Shannon\ndivergence used by [20], many GAN formulations have been proposed based on minimizing other\nlosses, including the Wasserstein metric [4, 21], total variation distance [30], \u03c72 divergence [32],\nMMD [26], Dudley metric [1], and Sobolev metric [37]. The diversity of data types and losses with\nwhich GANs have been used motivates studying GANs in a very general (nonparametric) setting.\nIn particular, Besov spaces likely comprise the largest widely-studied family of nonparametric\nsmoothness class; indeed, most of the losses listed above are Besov IPMs.\nGANs are typically described as a two-player minimax game between a generator network Ng and a\ndiscriminator network Nd; we denote by Fd the class of functions that can be implemented by Nd\nand by Fg the class of distributions that can be implemented by Ng. A recent line of work has argued\nthat a natural statistical model for a GAN as a distribution estimator is\n\nsup\nf\u2208Fd\n\nE\nX\u223cQ\n\n[f (X)] \u2212 E\nX\u223c(cid:101)Pn\n\n[f (X)] ,\n\nQ\u2208Fg\n\nwhere (cid:101)Pn is an (appropriately regularized) empirical distribution, and that, when Fd and Fg respec-\n\ntively approximate classes F and P well, one can bound the risk, under F-IPM loss, of estimating\ndistributions in P by (9) [31, 27, 46, 28]. We emphasize, that, as Singh et al. [46] showed, the\nminimax risk in this framework is identical to that under the \u201csampling\u201d (or \u201cimplicit generative\nmodeling\u201d [36]) framework in terms of which GANs are usually cast. 4\nIn this section, we show such a result for Besov spaces; namely, we show the existence of a particular\nGAN (speci\ufb01cally, a sequence of GANs, necessarily growing with the sample size n), that estimates\ndistributions in a Besov space at the minimax optimal rate (7) under Besov IPM losses. This\n\n(cid:98)P := argmin\n\n(9)\n\n4As in these previous works, we assume implicitly that the optimum (9) can be computed; this complex\nsaddle-point problem is itself the subject of a related but distinct and highly active area of work [39, 3, 29, 19].\n\n6\n\n\fconstruction uses a standard neural network architecture (a fully-connected neural network with\n\nrecti\ufb01ed linear unit (ReLU) activations), and a simple data regularizer (cid:101)Pn, namely the wavelet-\n\nthresholding estimator described in Section 4.1. Our results extend those of Liang [27] and Singh\net al. [46], for Wasserstein loss over Sobolev spaces, to general Besov IPM losses over Besov spaces.\nWe begin with a formal de\ufb01nition of the network architectures that we consider:\nDe\ufb01nition 8. A fully-connected ReLU network f(A1,...,AH ),(b1,...,bH ) : RW \u2192 R has the form\n\nAH \u03b7 (AH\u22121\u03b7 (\u00b7\u00b7\u00b7 \u03b7(A1x + b1)\u00b7\u00b7\u00b7 ) + bH\u22121) + bH ,\n\nper)parameters: the depth H, the width W , the sparsity S :=(cid:80)\n\nwhere, for each (cid:96) \u2208 [H \u2212 1], A(cid:96) \u2208 RW\u00d7W , and AH \u2208 R1\u00d7W and the ReLU operation \u03b7(x) =\nmax{x, 0} is applied element-wise to vectors in RW .\nThe size of f(A1,...,AH ),(b1,...,bH )(x) can be measured in terms of the following four (hy-\n(cid:96)\u2208[H] (cid:107)A(cid:96)(cid:107)0,0 + (cid:107)b(cid:96)(cid:107)0 (i.e., the total\nnumber of non-zero weights), and the maximum weight B := max{(cid:107)A(cid:96)(cid:107)\u221e,\u221e,(cid:107)b(cid:96)(cid:107)\u221e : (cid:96) \u2208 [H]}.\nFor given size parameters H, W, S, B we write \u03a6(H, W, S, B) to denote the set of functions satisfying\nthe corresponding size constraints.\nOur results rely on a recent construction (Lemma 17 in the Appendix), by [49], of a fully-connected\nReLU network that approximates Besov functions. [49] used this approximation to bound the risk of\na neural network for nonparametric regression over Besov spaces, under Lr loss. Here, we use this\napproximation result Lemma 17 to bound the risk of a GAN for nonparametric distribution estimation\nover Besov spaces, under the much larger class of Besov IPM losses. Our precise result is as follows:\nTheorem 9 (Convergence Rate of a Well-Optimized GAN). Fix a Besov density class B\u03c3g\npg,qg with\n\u03c3g > D/pg and discriminator class B\u03c3d\nwith \u03c3d > D/pd. Then, for any desired approximation\nNd \u2208 \u03a6(Hd, Wd, Sd, Bd) and generator network Ng \u2208 \u03a6(Hg, Wg, Sg, Bg), s.t. for all p \u2208 B\u03c3g\n\nerror \u0001 > 0, one can construct a GAN (cid:98)p of the form (9) (with (cid:101)pn) with discriminator network\n\npd,qd\n\npg,qg\n\nE(cid:104)\n\ndB\n\n\u03c3d\npd,qd\n\n(cid:105) (cid:46) \u0001 + E dB\n\n((cid:98)p, p)\n\n((cid:101)pn, p)\n\n\u03c3d\npd ,qd\n\npd,qd\n\nand B\u03c3g\n\nwhere Hd, Hg grow logarithmically with 1/\u0001, Wd, Sd, Bd, Wg, Sg, Bg grow polynomially with 1/\u0001\nand C > 0 is a constant that depends only on B\u03c3d\n\nThis theorem implies that the rate of convergence of the GAN estimate(cid:98)p of the form 9 is the same as\nthe convergence rate of the estimator(cid:101)pn with which the GAN estimate is generated (Here we assume\nwith \u03c3d > D/pd there exists an appropriately constructed GAN estimate(cid:98)p s.t.\n\nthat all distributions have densities). Therefore, given our upper bound from theorem 5 we have the\nfollowing direct consequence.\nCorollary 10. For a Besov density class B\u03c3g\n\npg,qg with \u03c3g > D/pg and discriminator class B\u03c3d\n\npg,qg.\n\npd,qd\n\ndFd ((cid:98)p, p) \u2264(cid:16)\n(cid:110) 1\n\nn\u2212\u03b7(D,\u03c3d,pd,\u03c3g,pg)(cid:112)log n\n2\u03c3g+D , \u03c3g+\u03c3d+D\u2212D/pg\u2212D/p(cid:48)\n\n2\u03c3g+D(1\u22122/pg)\n\n2 , \u03c3g+\u03c3d\n\nd\n\n(cid:17)\n(cid:111)\n\nwhere \u03b7(D, \u03c3d, pd, \u03c3g, pg) = min\n\nis the exponent from (7).\n\nIn other words there is a GAN estimate that is minimax rate optimal for a smooth class of densities\nover an IPM generated by a smooth class of discriminator functions.\n\n5 Discussion of Results\nIn this section, we discuss some general phenomena that can be gleaned from our technical results.\nFirst, we note that, perhaps surprisingly, qd and qg do not appear in our bounds. Tao [50] suggests\nthat qd and qg may have only logarithmic effects (contrasted with the polynomial effects of \u03c3d, pd,\n\u03c3g, and pg). Thus, a more \ufb01ne-grained analysis to close the polylogarithmic gap between our lower\nand upper bounds for general estimators (Theorems 4 and 5) might require incorporating qd and qg.\nOn the other hand, the parameters \u03c3d, pd, \u03c3g, and pg each play a signi\ufb01cant role in determining\nminimax convergence rates, in both the linear and general cases. We \ufb01rst discuss each of these\nparameters independently, and then discuss some interactions between them.\n\n7\n\n\f(a) General Estimators\n\n(b) Linear Estimators\n\nFigure 1: Minimax convergence rates as functions of discriminator smoothness \u03c3d and distri-\nbution function smoothness \u03c3g, for (a) general and (b) linear estimators, in the case D = 4,\npd = 1.2, pg = 2. Color shows exponent of minimax convergence rate (i.e., \u03b1(\u03c3d, \u03c3g) such\nthat M\n\n(cid:17) (cid:16) n\u2212\u03b1(\u03c3d,\u03c3g)), ignoring polylogarithmic factors.\n\n(RD), B\u03c3g\n\n(RD)\n\n(cid:16)\n\nB\u03c3d\n\n1.2,qd\n\n2,qg\n\nRoles of the smoothness orders \u03c3d and \u03c3g As a visual aid for understanding our results, Figure 1\nshow phase diagrams of minimax convergence rates, as functions of discriminator smoothness \u03c3d and\ndistribution smoothness \u03c3g, in the illustrative case D = 4, pd = 1.2, pg = 2. When 1/pg + 1/pd > 1,\na minimum total smoothness \u03c3d + \u03c3g \u2265 D(1/pd + 1/pg \u2212 1) is needed for consistent estimation to be\npossible \u2013 this fails in the \u201cInfeasible\u201d region of the phase diagrams. Intuitively, this occurs because\nFd is not contained in the topological dual F(cid:48)\ng of Fg. For linear estimators, even greater smoothness\n\u03c3d+\u03c3g \u2265 D(1/pd+1/pg) is needed. At the other extreme, for highly smooth discriminator functions,\n\nboth linear and nonlinear estimators converge at the parametric rate O(cid:0)n\u22121/2(cid:1), corresponding to the\n\n\u201cParametric\u201d region. In between, rates for linear estimators vary smoothly with \u03c3d and \u03c3g, while rates\nfor nonlinear estimators exhibit another phase transition on the line \u03c3g + 3\u03c3d = D; to the left lies the\n\u201cSparse\u201d case, in which estimation error is dominated by a small number of large errors at locations\nwhere the distribution exhibits high local variation; to the right lies the \u201cDense\u201d case, where error is\nrelatively uniform on the sample space.\nThe left boundary \u03c3d = 0 corresponds to the classical results of Donoho et al. [13], who consequently\nidenti\ufb01ed the \u201cInfeasible\u201d, \u201cSparse\u201d, and \u201cDense\u201d phases, but not the \u201cParametric\u201d phase. When\nrestricting to linear estimators, the \u201cInfeasible\u201d region grows and the \u201cParametric\u201d region shrinks.\nRole of the powers pd and pg At one extreme (pd = \u221e) lie L1 or total variation loss (\u03c3d = 0),\nWasserstein loss (\u03c3d = 1), and its higher-order generalizations, for which we showed the rate\n\nM\n\nB\u03c3d\u221e,qd\n\n, B\u03c3g\n\npg,qg\n\n\u2212 \u03c3g +\u03c3d\n2\u03c3g +D + n\u22121/2,\n\n(cid:16)\n\n(cid:17) (cid:16) n\n\ngeneralizing the rate \ufb01rst shown by Singh et al. [46] for Hilbert-Sobolev classes to other distribution\nclasses, such as Fg = BV. Because discriminator functions in this class exhibit homogeneous\nsmoothness, these losses effectively weight the sample space relatively uniformly in importance, the\n\u201cSparse\u201d region in Figure (1a) vanishes, and linear estimators can perform optimally.\nAt the other extreme (pd = 1) lie L\u221e loss (\u03c3d = 0), Kolmogorov-Smirnov loss (\u03c3d = 1), and its\nhigher-order generalizations, for which we have shown that the rate is always\n+ n\u22121/2;\n\n\u2212 \u03c3g +\u03c3d+D(1\u22121/pd\u22121/pg )\n\n(cid:17) (cid:16) n\n\n2\u03c3g +D(1\u22122/pg )\n\n, B\u03c3g\n\n(cid:16)\n\nM\n\nB\u03c3d\n1,qd\n\npg,qg\n\nexcept in the parametric regime (D \u2264 2\u03c3d), this rate differs from that of Singh et al. [46]. Because\ndiscriminator functions can have inhomogeneous smoothness, and hence weight some portions of the\nsample space much more heavily than others, the \u201cDense\u201d region in Figure 1a vanishes, and linear\nestimators are always sub-optimal. We note that Sadhanala et al. [43] recently proposed using these\nhigher-order distances (integer \u03c3d > 1) in a fast two-sample test that generalizes the well-known\nKolmogorov-Smirnov test, improving sensitivity to the tails of distributions; our results may provide\na step towards understanding theoretical properties of this test.\nComparison of linear and general rates Letting \u03c3(cid:48)\nsparse term of the linear minimax rate in the same form as the Dense rate, replacing \u03c3g with \u03c3(cid:48)\ng:\n\ng := \u03c3g \u2212 D(1/pg + 1/pd), one can write the\n\n8\n\n0.00.51.01.52.02.53.03.54.0d0.00.51.01.52.02.53.03.54.0gParametricDenseSparseInfeasible012345678d0.00.51.01.52.02.53.03.54.0ParametricNonparametricInfeasible0.00.10.20.30.40.5\f(cid:16)\n\nB\u03c3d\n\npd,qd\n\n(cid:17) (cid:16) n\n\n\u2212 \u03c3(cid:48)\ng +\u03c3d\n2\u03c3(cid:48)\ng +D .\n\n, B\u03c3g\n\npg,qg\n\n(10)\n\nMlin\n\npg,pg \u2286 B\n\nThis is not a coincidence; Morrey\u2019s inequality [17, Section 5.6.2] in functional analysis tells us that\ng := \u03c3g \u2212 D(1/pg + 1/pd) is largest possible value such that the\nfor general \u03c3g > D(1/pg + 1/pd), \u03c3(cid:48)\n\u03c3(cid:48)\npd,pd holds. In the extreme case pd = \u221e (corresponding to generalizations\nembedding B\u03c3g\nof total variation loss), one can interpret the rate (10) as saying that linear estimators bene\ufb01t only\nfrom homogeneous (e.g., H\u00f6lder) smoothness, and not from weaker inhomogeneous (e.g., Besov)\nsmoothness. For general pd, linear estimator can still bene\ufb01t from inhomogeneous smoothness, but\nto a lesser extent than general minimax optimal estimators.\n\ng\n\nConclusions We have shown, up to log factors, uni\ufb01ed minimax convergence rates for a large\nclass of pairs of Fd-IPM losses and distribution classes Fg. By doing so, we have generalized\nseveral phenomena that had observed in special cases previously. First, under suf\ufb01ciently weak loss\nfunctions, distribution estimation is possible at the parametric rate O(n\u22121/2) even over very large\nnonparametric distribution classes. Second, in many cases, optimal estimation requires estimators\nthat adapt to inhomogeneous smoothness conditions; many commonly used distribution estimators\nfail to do this, and hence converge at sub-optimal rates, or even fail to converge. Finally, GANs with\nsuf\ufb01ciently large fully-connected ReLU neural networks using wavelet-thresholding regularization\nperform statistically minimax rate-optimal distribution estimation over inhomogeneous nonparametric\nsmoothness classes (assuming the GAN optimization problem can be solved accurately). Importantly,\nsince GANs optimize IPM losses much weaker than traditional Lp losses, they may be able to learn\nreasonable approximations of even high-dimensional distributions with tractable sample complexity,\nperhaps explaining why they excel in the case of image data. Thus, our results suggest that the curse\nof dimensionality may be less severe than indicated by classical nonparametric lower bounds.\n\n9\n\n\fReferences\n[1] Ehsan Abbasnejad, Javen Shi, and Anton van den Hengel. Deep Lipschitz networks and Dudley GANs,\n\n2018. URL https://openreview.net/pdf?id=rkw-jlb0W.\n\n[2] Mikl\u00f3s Ajtai, J\u00e1nos Koml\u00f3s, and G\u00e1bor Tusn\u00e1dy. On optimal matchings. Combinatorica, 4(4):259\u2013264,\n\n1984.\n\n[3] Martin Arjovsky and L\u00e9on Bottou. Towards principled methods for training generative adversarial networks.\n\narXiv preprint arXiv:1701.04862, 2017.\n\n[4] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein generative adversarial networks. In\n\nInternational Conference on Machine Learning, pages 214\u2013223, 2017.\n\n[5] Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and equilibrium in\n\ngenerative adversarial nets (GANs). arXiv preprint arXiv:1703.00573, 2017.\n\n[6] Leon Bottou, Martin Arjovsky, David Lopez-Paz, and Maxime Oquab. Geometrical insights for implicit\ngenerative modeling. In Braverman Readings in Machine Learning. Key Ideas from Inception to Current\nState, pages 229\u2013268. Springer, 2018.\n\n[7] Louis HY Chen, Larry Goldstein, and Qi-Man Shao. Normal approximation by Stein\u2019s method. Springer\n\nScience & Business Media, 2010.\n\n[8] Edward Choi, Siddharth Biswal, Bradley Malin, Jon Duke, Walter F Stewart, and Jimeng Sun. Generating\nmulti-label discrete patient records using generative adversarial networks. arXiv preprint arXiv:1703.06490,\n2017.\n\n[9] Wayne W Daniel et al. Applied nonparametric statistics. Houghton Mif\ufb02in, 1978.\n\n[10] Ingrid Daubechies. Ten lectures on wavelets, volume 61. Siam, 1992.\n\n[11] Kamran Ghasedi Dizaji, Xiaoqian Wang, and Heng Huang. Semi-supervised generative adversarial network\nfor gene expression inference. In Proceedings of the 24th ACM SIGKDD International Conference on\nKnowledge Discovery & Data Mining, pages 1435\u20131444. ACM, 2018.\n\n[12] Hao-Wen Dong and Yi-Hsuan Yang. Towards a deeper understanding of adversarial losses. arXiv preprint\n\narXiv:1901.08753, 2019.\n\n[13] David L Donoho, Iain M Johnstone, G\u00e9rard Kerkyacharian, and Dominique Picard. Density estimation by\n\nwavelet thresholding. The Annals of Statistics, pages 508\u2013539, 1996.\n\n[14] RM Dudley. The speed of mean Glivenko-Cantelli convergence. The Annals of Mathematical Statistics, 40\n\n(1):40\u201350, 1969.\n\n[15] RM Dudley. Speeds of metric probability convergence. Zeitschrift f\u00fcr Wahrscheinlichkeitstheorie und\n\nVerwandte Gebiete, 22(4):323\u2013332, 1972.\n\n[16] GK Dziugaite, DM Roy, and Z Ghahramani. Training generative neural networks via maximum mean\ndiscrepancy optimization. In Uncertainty in Arti\ufb01cial Intelligence-Proceedings of the 31st Conference,\nUAI 2015, pages 258\u2013267, 2015.\n\n[17] Lawrence C Evans. Partial differential equations. American Mathematical Society, 2010.\n\n[18] Nicolas Fournier and Arnaud Guillin. On the rate of convergence in Wasserstein distance of the empirical\n\nmeasure. Probability Theory and Related Fields, 162(3-4):707\u2013738, 2015.\n\n[19] Gauthier Gidel, Reyhane Askari Hemmat, Mohammad Pezeshki, Gabriel Huang, Remi Lepriol, Simon\nLacoste-Julien, and Ioannis Mitliagkas. Negative momentum for improved game dynamics. arXiv preprint\narXiv:1807.04740, 2018.\n\n[20] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron\nCourville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing\nsystems, pages 2672\u20132680, 2014.\n\n[21] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved\ntraining of wasserstein gans. In Advances in Neural Information Processing Systems, pages 5767\u20135777,\n2017.\n\n10\n\n\f[22] Artur Kadurin, Sergey Nikolenko, Kuzma Khrabrov, Alex Aliper, and Alex Zhavoronkov. drugan: an\nadvanced generative adversarial autoencoder model for de novo generation of new molecules with desired\nmolecular properties in silico. Molecular pharmaceutics, 14(9):3098\u20133104, 2017.\n\n[23] Leonid Vasilevich Kantorovich and Gennady S Rubinstein. On a space of completely additive functions.\n\nVestnik Leningrad. Univ, 13(7):52\u201359, 1958.\n\n[24] Christian Ledig, Lucas Theis, Ferenc Husz\u00e1r, Jose Caballero, Andrew Cunningham, Alejandro Acosta,\nAndrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super-\nresolution using a generative adversarial network. arXiv preprint, 2017.\n\n[25] Jing Lei. Convergence and concentration of empirical measures under Wasserstein distance in unbounded\n\nfunctional spaces. arXiv preprint arXiv:1804.10556, 2018.\n\n[26] Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnab\u00e1s P\u00f3czos. Mmd gan: Towards\ndeeper understanding of moment matching network. In Advances in Neural Information Processing\nSystems, pages 2203\u20132213, 2017.\n\n[27] Tengyuan Liang. How well can generative adversarial networks (GAN) learn densities: A nonparametric\n\nview. arXiv preprint arXiv:1712.08244, 2017.\n\n[28] Tengyuan Liang. On how well generative adversarial networks learn densities: Nonparametric and\n\nparametric results. arXiv preprint arXiv:1811.03179, 2018.\n\n[29] Tengyuan Liang and James Stokes. Interaction matters: A note on non-asymptotic local convergence of\n\ngenerative adversarial networks. arXiv preprint arXiv:1802.06132, 2018.\n\n[30] Zinan Lin, Ashish Khetan, Giulia Fanti, and Sewoong Oh. Pacgan: The power of two samples in generative\n\nadversarial networks. In Advances in Neural Information Processing Systems, pages 1505\u20131514, 2018.\n\n[31] Shuang Liu, Olivier Bousquet, and Kamalika Chaudhuri. Approximation and convergence properties of\ngenerative adversarial learning. In Advances in Neural Information Processing Systems, pages 5551\u20135559,\n2017.\n\n[32] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. Least\nsquares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer\nVision, pages 2794\u20132802, 2017.\n\n[33] Pascal Massart. The tight constant in the dvoretzky-kiefer-wolfowitz inequality. The annals of Probability,\n\npages 1269\u20131283, 1990.\n\n[34] Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for gans do actually\n\nconverge? arXiv preprint arXiv:1801.04406, 2018.\n\n[35] Yves Meyer. Wavelets and operators, volume 1. Cambridge university press, 1992.\n\n[36] Shakir Mohamed and Balaji Lakshminarayanan. Learning in implicit generative models. arXiv preprint\n\narXiv:1610.03483, 2016.\n\n[37] Youssef Mroueh, Chun-Liang Li, Tom Sercu, Anant Raj, and Yu Cheng. Sobolev gan. arXiv preprint\n\narXiv:1711.04894, 2017.\n\n[38] Alfred M\u00fcller. Integral probability metrics and their generating classes of functions. Advances in Applied\n\nProbability, 29(2):429\u2013443, 1997.\n\n[39] Vaishnavh Nagarajan and J Zico Kolter. Gradient descent gan optimization is locally stable. In Advances\n\nin Neural Information Processing Systems, pages 5585\u20135595, 2017.\n\n[40] Arkadi S Nemirovski. Nonparametric estimation of smooth regression functions. Izv. Akad. Nauk. SSR\n\nTeckhn. Kibernet, 3:50\u201360, 1985.\n\n[41] Arkadi S Nemirovski. Topics in non-parametric. Ecole d\u2019Et\u00e9 de Probabilit\u00e9s de Saint-Flour, 28:85, 2000.\n\n[42] David Pollard. Empirical processes: theory and applications. In NSF-CBMS regional conference series in\n\nprobability and statistics, pages i\u201386. JSTOR, 1990.\n\n[43] Veeranjaneyulu Sadhanala, Aaditya Ramdas, Yu-Xiang Wang, and Ryan Tibshirani. A higher-order\n\nkolmogorov-smirnov test. In International Conference on Arti\ufb01cial Intelligence and Statistics, 2019.\n\n11\n\n\f[44] Benjamin Sanchez-Lengeling, Carlos Outeiral, Gabriel L Guimaraes, and Alan Aspuru-Guzik. Optimizing\ndistributions over molecular space. an objective-reinforced generative adversarial network for inverse-\ndesign chemistry (organic). ChemrXiv Preprint, 2017.\n\n[45] Shashank Singh and Barnab\u00e1s P\u00f3czos. Minimax distribution estimation in Wasserstein distance. arXiv\n\npreprint arXiv:1802.08855, 2018.\n\n[46] Shashank Singh, Ananya Uppal, Boyue Li, Chun-Liang Li, Manzil Zaheer, and Barnabas Poc-\nIn Advances in Neural Informa-\nzos. Nonparametric density estimation under adversarial losses.\ntion Processing Systems 31, pages 10246\u201310257, 2018. URL http://papers.nips.cc/paper/\n8225-nonparametric-density-estimation-under-adversarial-losses.pdf.\n\n[47] Casper Kaae S\u00f8nderby, Jose Caballero, Lucas Theis, Wenzhe Shi, and Ferenc Husz\u00e1r. Amortised map\n\ninference for image super-resolution. arXiv preprint arXiv:1610.04490, 2016.\n\n[48] Bharath K. Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Sch\u00f6lkopf, and Gert RG Lanckriet.\nNon-parametric estimation of integral probability metrics. In Information Theory Proceedings (ISIT), 2010\nIEEE International Symposium on, pages 1428\u20131432. IEEE, 2010.\n\n[49] Taiji Suzuki. Adaptivity of deep relu network for learning in besov and mixed smooth besov spaces:\n\noptimal rate and curse of dimensionality. arXiv preprint arXiv:1810.08033, 2018.\n\n[50] Terence Tao. A type diagram for function spaces.\n\nbesov-spaces/, 2011.\n\nhttps://terrytao.wordpress.com/tag/\n\n[51] Alexandre B Tsybakov. Introduction to nonparametric estimation. Revised and extended from the 2004\nFrench original. Translated by Vladimir Zaiats. Springer Series in Statistics. Springer, New York, 2009.\n\n[52] C\u00e9dric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.\n\n[53] Larry Wassermann. All of nonparametric statistics. New York, 2006.\n\n[54] Jonathan Weed and Francis Bach. Sharp asymptotic and \ufb01nite-sample rates of convergence of empirical\n\nmeasures in Wasserstein distance. arXiv preprint arXiv:1707.00087, 2017.\n\n[55] Jonathan Weed and Quentin Berthet. Estimation of smooth densities in wasserstein distance. arXiv preprint\n\narXiv:1902.01778, 2019.\n\n[56] Zhen Yang, Wei Chen, Feng Wang, and Bo Xu. Improving neural machine translation with conditional\n\nsequence generative adversarial nets. arXiv preprint arXiv:1703.04887, 2017.\n\n[57] Kosaku Yosida. Functional analysis. reprint of the sixth (1980) edition. classics in mathematics. Springer-\n\nVerlag, Berlin, 11:14, 1995.\n\n[58] Werner Zellinger, Bernhard A Moser, Thomas Grubinger, Edwin Lughofer, Thomas Natschl\u00e4ger, and\nSusanne Saminger-Platz. Robust unsupervised domain adaptation for neural networks via moment\nalignment. Information Sciences, 2019.\n\n[59] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N\nMetaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks.\nIn Proceedings of the IEEE International Conference on Computer Vision, pages 5907\u20135915, 2017.\n\n12\n\n\f", "award": [], "sourceid": 4874, "authors": [{"given_name": "Ananya", "family_name": "Uppal", "institution": "Carnegie Mellon University"}, {"given_name": "Shashank", "family_name": "Singh", "institution": "CMU"}, {"given_name": "Barnabas", "family_name": "Poczos", "institution": "Carnegie Mellon University"}]}