{"title": "Minimax Estimation of Neural Net Distance", "book": "Advances in Neural Information Processing Systems", "page_first": 3845, "page_last": 3854, "abstract": "An important class of distance metrics proposed for training generative adversarial networks (GANs) is the integral probability metric (IPM), in which the neural net distance captures the practical GAN training via two neural networks. This paper investigates the minimax estimation problem of the neural net distance based on samples drawn from the distributions. We develop the first known minimax lower bound on the estimation error of the neural net distance, and an upper bound tighter than an existing bound on the estimator error for the empirical neural net distance. Our lower and upper bounds match not only in the order of the sample size but also in terms of the norm of the parameter matrices of neural networks, which justifies the empirical neural net distance as a good approximation of the true neural net distance for training GANs in practice.", "full_text": "Minimax Estimation of Neural Net Distance\n\nKaiyi Ji\n\nDepartment of ECE\n\nThe Ohio State University\n\nColumbus, OH 43210\n\nji.367@osu.edu\n\nYingbin Liang\n\nDepartment of ECE\n\nThe Ohio State University\n\nColumbus, OH 43210\nliang.889@osu.edu\n\nAbstract\n\nAn important class of distance metrics proposed for training generative adversarial\nnetworks (GANs) is the integral probability metric (IPM), in which the neural net\ndistance captures the practical GAN training via two neural networks. This paper\ninvestigates the minimax estimation problem of the neural net distance based on\nsamples drawn from the distributions. We develop the \ufb01rst known minimax lower\nbound on the estimation error of the neural net distance, and an upper bound tighter\nthan an existing bound on the estimator error for the empirical neural net distance.\nOur lower and upper bounds match not only in the order of the sample size but also\nin terms of the norm of the parameter matrices of neural networks, which justi\ufb01es\nthe empirical neural net distance as a good approximation of the true neural net\ndistance for training GANs in practice.\n\n1\n\nIntroduction\n\nGenerative adversarial networks (GANs), \ufb01rst introduced by [9], have become an important technique\nfor learning generative models from complicated real-life data. Training GANs is performed via a min-\nmax optimization with the maximum and minimum respectively taken over a class of discriminators\nand a class of generators, where both discriminator and generators are modeled by neural networks.\nGiven that the discriminator class is suf\ufb01ciently large, [9] interpreted the GAN training as \ufb01nding a\ngenerator such that the generated distribution \u03bd is as close as possible to the target true distribution \u00b5,\nmeasured by the Jensen-Shannon distance dJS(\u00b5, \u03bd), as shown below:\n\nmin\n\u03bd\u2208DG\n\ndJS(\u00b5, \u03bd).\n\n(1)\n\nInspired by such an idea, a large body of GAN models were then proposed based on various distance\nmetrics between a pair of distributions, in order to improve the training stability and performance,\ne.g., [2, 3, 13, 15]. Among them, the integral probability metric (IPM) [19] arises as an important\nclass of distance metrics for training GANs, which takes the following form\n\n|Ex\u223c\u00b5f (x) \u2212 Ex\u223c\u03bdf (x)| .\n\ndF (\u00b5, \u03bd) = sup\nf\u2208F\n\n(2)\nIn particular, different choices of the function class F in (2) result in different distance metrics.\nFor example, if F represents a set of all 1-Lipschitz functions, then dF (\u00b5, \u03bd) corresponds to the\nWasserstein-1 distance, which is used in Wasserstein-GAN (WGAN) [2]. If F represents a unit ball\nin a reproducing kernel Hilbert space (RKHS), then dF (\u00b5, \u03bd) corresponds to the maximum mean\ndiscrepancy (MMD) distance, which is used in MMD-GAN [7, 13].\nPractical GAN training naturally motivates to take F in (2) as a set Fnn of neural networks, which\nresults in the so-called neural net distance dFnn(\u00b5, \u03bd) introduced and studied in [3, 28]. For\ncomputational feasibility, in practice dFnn(\u02c6\u00b5n, \u02c6\u03bdm) is typically adopted as an approximation (i.e.,\nan estimator) of the true neural net distance dFnn (\u00b5, \u03bd) for the practical GAN training, where \u02c6\u00b5n\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fand \u02c6\u03bdm are the empirical distributions corresponding to \u00b5 and \u03bd, respectively, based on n samples\ndrawn from \u00b5 and m samples drawn from \u03bd. Thus, one important question one can ask here is\nhow well dFnn(\u02c6\u00b5n, \u02c6\u03bdm) approximates dFnn (\u00b5, \u03bd). If they are close, then training GANs to small\ndFnn (\u02c6\u00b5n, \u02c6\u03bdm) also implies small dFnn(\u00b5, \u03bd), i.e., the generated distribution \u03bd is guaranteed to be\nclose to the true distribution \u00b5.\nTo answer this question, [3] derived an upper bound on the quantity |dFnn(\u00b5, \u03bd) \u2212 dFnn (\u02c6\u00b5n, \u02c6\u03bdm)|,\nand showed that dFnn(\u02c6\u00b5n, \u02c6\u03bdm) converges to dFnn(\u00b5, \u03bd) at a rate of O(n\u22121/2 + m\u22121/2). However,\nthe following two important questions are still left open: (a) Whether the rate O(n\u22121/2 + m\u22121/2) of\nconvergence is optimal? We certainly want to be assured that the empirical objective dFnn(\u02c6\u00b5n, \u02c6\u03bdm)\nused in practice does not fall short at the \ufb01rst place. (b) The dependence of the upper bound on neural\nnetworks in [3] is characterized by the total number of parameters of neural networks, which can\nbe quite loose by considering recent work, e.g., [20, 27]. Thus, the goal of this paper is to address\nthe above issue (a) by developing a lower bound on the minimax estimation error of dFnn(\u00b5, \u03bd) (see\nSection 2.2 for the precise formulation) and to address issue (b) by developing a tighter upper bound\nthan [3].\nIn fact, the above problem can be viewed as a distance estimation problem, i.e., estimating the neural\nnet distance dFnn(\u00b5, \u03bd) based on samples i.i.d. drawn from \u00b5 and \u03bd, respectively. The empirical\ndistance dFnn (\u02c6\u00b5n, \u02c6\u03bdm) serves as its plug-in estimator (i.e., substituting the true distributions by their\nempirical versions). We are interested in exploring the optimality of the convergence of such a plug-in\nestimator not only in terms of the size of samples but also the parameters of neural networks. We\nfurther note that the neural net distance can be used in a variety of other applications such as the\nsupport measure machine [18] and the anomaly detection [29], and hence the performance guarantee\nwe establish here can be of interest in those domains.\n\n1.1 Our Contribution\n\nIn this paper, we investigate the minimax estimation of the neural net distance dFnn(\u00b5, \u03bd), where the\nmajor challenge in analysis lies in dealing with complicated neural network functions. This paper\nestablishes a tighter upper bound on the convergence rate of the empirical estimator than the existing\none in [3], and develop a lower bound that matches our upper bound not only in the order of the\nsample size but also in terms of the norm of the parameter matrices of neural networks. Our speci\ufb01c\ncontributions are summarized as follows:\n\n\u2022 In Section 3.1, we provide the \ufb01rst known lower bound on the minimax estimation error of\n\nd + h(cid:81)d\n\ndFnn(\u00b5, \u03bd) based on \ufb01nite samples, which takes the form as cl max(cid:0)n\u22121/2, m\u22121/2(cid:1) where\ni=1 M (i) max(cid:0)n\u22121/2, m\u22121/2(cid:1) for ReLU networks, where bl is a constant,\n(cid:81)d\n\nthe constant cl depends only on the parameters of neural networks. Such a lower bound further\nspecializes to bl\nd is the depth of neural networks and M (i) can be either the Frobenius norm or (cid:107) \u00b7 (cid:107)1,\u221e norm\nconstraint of the parameter matrix Wi in layer i. Our proof exploits the Le Cam\u2019s method with\nthe technical development of a lower bound on the difference between two neural networks.\n\u2022 In Section 3.2, we develop an upper bound on the estimation error of dFnn (\u00b5, \u03bd) by\ndFnn(\u02c6\u00b5n, \u02c6\u03bdm), which takes the form as cu(n\u22121/2 + m\u22121/2), where the constant cu depends\nonly on the parameters of neural networks. Such an upper bound further specializes to\ni=1 M (i)(n\u22121/2 + m\u22121/2) for ReLU networks, where bu is a constant, h is the\nbu\ndimension of the support of \u00b5 and \u03bd, and\nd + log h\ndepending on the distribution class and the norm of the weight matrices. Our proof includes\nthe following two major technical developments presented in Section 3.4.\n\u2212 A new concentration inequality: In order to develop an upper bound for the unbounded-\nsupport sub-Gaussican class, standard McDiarmid inequality under bounded difference\ncondition is not applicable. We thus \ufb01rst generalize a McDiarmid inequality [11] for\nunbounded functions of scalar sub-Gaussian variables to that of sub-Gaussian vectors,\nwhich can be of independent interest for other applications. Such a development requires\nsubstantial machineries. We then apply such a concentration inequality to upper-bounding\nthe estimation error of the neural net distance in terms of Rademacher complexity.\n\u2212 Upper bound on Rademacher complexity: Though existing Rademacher complexity bounds\n[8, 22] of neural networks can be used for input data with bounded support, direct applica-\ntions of those bounds to the unbounded sub-Gaussian input data yield order-level loose\n\nd + h can be replaced by\n\nd or\n\n\u221a\n\n\u221a\n\n\u221a\n\n\u221a\n\n2\n\n\fbounds. Thus, we develop a tighter bound on the Rademacher complexity that exploits the\nsub-Gaussianity of the input variables. Such a bound is also tighter than the existing same\ntype by [23]. The details of the comparison are provided after Theorem 7.\n\u2022 In Section 3.3, comparison of the lower and upper bounds indicates that the empirical neural\nnet distance (i.e., the plug-in estimator) achieves the optimal minimax estimation rate in terms\nof n\u22121/2 + m\u22121/2. Furthermore, for ReLU networks, the two bounds also match in terms\ni=1 (cid:107)Wi(cid:107)1,\u221e are key quantities that\n\u221a\ncapture the estimation accuracy. Such a result is consistent with those made in [20] for the\ngeneralization error of training deep neural networks. We note that there is still a gap\nd\nbetween the bounds, which requires future efforts to address.\n\nof(cid:81)d\ni=1 M (i), indicating that both(cid:81)d\n\ni=1 (cid:107)Wi(cid:107)F and(cid:81)d\n\n1.2 Related Work\n\nEstimation of IPMs. [25] studied the empirical estimation of several IPMs including the Wasserstein\ndistance, MMD and Dudley metric, and established the convergence rate for their empirical estimators.\nA recent paper [26] established that the empirical estimator of MMD achieves the minimax optimal\nconvergence rate. [3] introduced the neural net distance that also belongs to the IPM class, and\nestablished the convergence rate of its empirical estimator. This paper establishes a tighter upper\nbound for such a distance metric, as well as a lower bound that matches our upper bound in the order\nof sample sizes and the norm of the parameter matrices.\nGeneralization error of GANs. In this paper, we focus on the minimax estimation error of the neural\nnet distance, and hence the quantity |dFnn(\u00b5, \u03bd) \u2212 dFnn(\u02c6\u00b5n, \u02c6\u03bdm)| is of our interest, on which our\nbound is tighter than the earlier study in [3]. Such a quantity relates but is different from the following\ngeneralization error recently studied in [14, 28] for training GANs. [28] studied the generalization\nerror dF (\u00b5, \u02c6\u03bd\u2217) \u2212 inf \u03bd\u2208DG dF (\u00b5, \u03bd), where \u02c6\u03bd\u2217 was the minimizer of dF (\u02c6\u00b5n, \u03bd) and F was taken as\na class Fnnof neural networks. [14] studied the same type of generalization error but took F as a\nSobolev space, and characterized how the smoothness of Sobolev space helps the GAN training.\nRademacher complexity of neural networks. Part of our analysis of the minimax estimation\nerror of the neural net distance requires to upper-bound the average Rademacher complexity of\nneural networks. Although various bounds on the Rademacher complexity of neural networks, e.g.,\n[1, 4, 8, 21, 22], can be used for distributions with bounded support, direct application of the best\nknown existing bound for sub-Gaussian variables turns out to be order-level looser than the bound\nwe establish here. [6, 23] studied the average Rademacher complexity of one-hidden layer neural\nnetworks over Gaussian variables. Specialization of our bound to the setting of [23] improves its\nbound, and to the setting of [6] equals its bound.\n\n1.3 Notations\n\ni,j W2\nij\n\n(cid:16)(cid:80)h\n\ni=1 w2\ni\n\n(cid:17)1/2\n\ndenotes the (cid:96)2 norm, and (cid:107)w(cid:107)1 =(cid:80)h\n\nWe use the bold-faced small and capital letters to denote vectors and matrices, respectively. Given a\nvector w \u2208 Rh, (cid:107)w(cid:107)2 =\ni=1 |wi| denotes the\n(cid:96)1 norm, where wi denotes the ith coordinate of w. For a matrix W = [Wij], we use (cid:107)W(cid:107)F =\nto denote its Frobenius norm, (cid:107)W(cid:107)1,\u221e to denote the maximal (cid:96)1 norm of the row\nvectors of W, and (cid:107)W(cid:107) to denote its spectral norm. For a real distribution \u00b5, we denote \u02c6\u00b5n as its\nempirical distribution, which takes 1/n probability on each of the n samples i.i.d. drawn from \u00b5.\n\n(cid:16)(cid:80)\n\n(cid:17)1/2\n\n2 Preliminaries and Problem Formulations\n\nIn this section, we \ufb01rst introduce the neural net distance and the speci\ufb01cations of the corresponding\nneural networks. We then introduce the minimax estimation problem that we study in this paper.\n\n2.1 Neural Net Distance\n\nThe neural net distance between two distributions \u00b5 and \u03bd introduced in [3] is de\ufb01ned as\n\ndFnn(\u00b5, \u03bd) = sup\nf\u2208Fnn\n\n|Ex\u223c\u00b5f (x) \u2212 Ex\u223c\u03bdf (x)| ,\n\n(3)\n\n3\n\n\fwhere Fnn is a class of neural networks. In this paper, given the domain X \u2286 Rh, we let Fnn be the\nfollowing set of depth-d neural networks of the form:\n\nf \u2208 Fnn : x \u2208 X (cid:55)\u2212\u2192 wT\n\nd \u03c3d\u22121 (Wd\u22121\u03c3d\u22122 (\u00b7\u00b7\u00b7 \u03c31(W1x))) ,\n\n(4)\nwhere Wi, i = 1, 2, ..., d \u2212 1 are parameter matrices, wd is a parameter vector (so that the output of\nthe neural network is a scalar), and each \u03c3i denotes the entry-wise activation function of layer i for\ni = 1, 2, . . . , d \u2212 1, i.e., for an input z \u2208 Rt, \u03c3i(z) := [\u03c3i(z1), \u03c3i(z2), ..., \u03c3i(zt)]T .\nThroughout this paper, we adopt the following two assumptions on the activation functions in (4).\nAssumption 1. All activation functions \u03c3i(\u00b7) for i = 1, 2, ..., d \u2212 1 satisfy\n\n\u2022 \u03c3i(\u00b7) is continuous and non-decreasing and \u03c3i(0) \u2265 0.\n\u2022 \u03c3i(\u00b7) is Li-Lipschitz, where Li > 0.\n\nAssumption 2. For all activation functions \u03c3i, i = 1, 2, ..., d \u2212 1, there exist positive constants q(i)\nand Q\u03c3(i) such that for any 0 \u2264 x1 \u2264 x2 \u2264 q(i), \u03c3i(x2) \u2212 \u03c3i(x1) \u2265 Q\u03c3(i)(x2 \u2212 x1).\nNote that Assumptions 1 and 2 hold for a variety of commonly used activation functions including\nReLU, sigmoid, softPlus and tanh. In particular, in Assumption 2, the existence of the constants q(i)\nand Q\u03c3(i) are more important than the particular values they take, which affect only the constant\nterms in our bounds presented later. For example, Assumption 2 holds for ReLU for any q(i) \u2264 \u221e\nand Q\u03c3(i) = 1, and holds for sigmoid for any q(i) > 0 and Q\u03c3(i) = 1/(2 + 2eq(i)).\nAs shown in [2], the practical training of GANs is conducted over neural networks with parameters\nlying in a compact space. Thus, we consider the following two compact parameter sets as taken\nin [5, 8, 22, 24],\nW1,\u221e : =\n\n(cid:8)Wi \u2208 Rni\u00d7ni+1 : (cid:107)Wi(cid:107)1,\u221e \u2264 M1,\u221e(i)(cid:9) \u00d7 {wd \u2208 Rnd : (cid:107)wd(cid:107)1 \u2264 M1,\u221e(d)} ,\n(cid:8)Wi \u2208 Rni\u00d7ni+1 : (cid:107)Wi(cid:107)F \u2264 MF (i)(cid:9) \u00d7 {wd \u2208 Rnd : (cid:107)wd(cid:107) \u2264 MF (d)} .\n\nd\u22121(cid:89)\nd\u22121(cid:89)\n\nWF : =\n\n(5)\n\ni=1\n\ni=1\n\n2.2 Minimax Estimation Problem\nIn this paper, we study the minimax estimation problem de\ufb01ned as follows. Supposed P is a subset\nof Borel probability measures of interest. Let \u02c6d(n, m) denote any estimator of the neural net distance\ndFnn (\u00b5, \u03bd) constructed by using the samples {xi}n\nj=1 respectively generated i.i.d. by\n\u00b5, \u03bd \u2208 P. Our goal is to \ufb01rst \ufb01nd a lower bound Cl(P, n, m) on the estimation error such that\n\ni=1 and {yj}m\n\nP(cid:110)|dFnn (\u00b5, \u03bd) \u2212 \u02c6d(n, m)| \u2265 Cl(P, n, m)\n(cid:111)\n\n> 0,\nwhere P is the probability measure with respect to the random samples {xi}n\ni=1 and {yi}m\ni=1. We\nthen focus on the empirical estimator dFnn(\u02c6\u00b5n, \u02c6\u03bdm) and are interested in \ufb01nding an upper bound\nCu(P, n, m) on the estimation error such that for any arbitrarily small \u03b4 > 0,\n\nsup\n\u00b5,\u03bd\u2208P\n\n\u02c6d(n,m)\n\n(6)\n\ninf\n\nP{|dFnn (\u00b5, \u03bd) \u2212 dFnn(\u02c6\u00b5n, \u02c6\u03bdm)| \u2264 Cu(P, n, m)} > 1 \u2212 \u03b4.\n\n(7)\n\nsup\n\u00b5,\u03bd\u2208P\n\nClearly such an upper bound also holds if the left hand side of (7) is de\ufb01ned in the same minimax\nsense as in (6).\nIt can be seen that the minimax estimation problem is de\ufb01ned with respect to the set P of distributions\nthat \u00b5 and \u03bd belong to. In this paper, we consider the set of all sub-Gaussian distributions over Rh.\nWe further divide the set into the two subsets and analyze them separately, for which the technical\ntools are very different. The \ufb01rst set PuB contains all sub-Gaussian distributions with unbounded\nsupport, and bounded mean and variance. Speci\ufb01cally, we assume that there exist \u03c4 > 0 and \u0393uB > 0\nsuch that for any probability measure \u00b5 \u2208 PuB and any vector a \u2208 Rh,\n\nEx\u223c\u00b5 eaT (x\u2212E(x)) \u2264 e(cid:107)a(cid:107)2\u03c4 2/2 with 0 < \u03c4, (cid:107)E(x)(cid:107) \u2264 \u0393uB.\n\n(8)\nThe second class PB of distributions contains all sub-Gaussian distributions with bounded support\nX = {x : (cid:107)x(cid:107) \u2264 \u0393B} \u2282 Rh (note that this set in fact includes all distributions with bounded support).\nThese two mutually exclusive classes cover most probability distributions in practice.\n\n4\n\n\f3 Main Results\n\n3.1 Minimax Lower Bound\nWe \ufb01rst develop the following minimax lower bound for the sub-Gaussian distribution class PuB with\nunbounded support.\nTheorem 1 (unbounded-support sub-Gaussian class PuB). Let Fnn be the set of neural networks\nde\ufb01ned by (4). For the parameter set WF in (5), if\n3q(1)/(2MF (1)\u0393uB), then\n\nm\u22121 + n\u22121 <\n\n\u221a\n\n\u221a\n\nP(cid:110)(cid:12)(cid:12)(cid:12)dFnn(\u00b5, \u03bd) \u2212 \u02c6d(n, m)\n(cid:18)\n\n(cid:12)(cid:12)(cid:12) \u2265 C(PuB) max\n(cid:18)\n\nq(1)\n\nMF (1)MF (d)\u0393uB\n\n1 \u2212 \u03a6\n\nn\u22121/2, m\u22121/2(cid:17)(cid:111) \u2265 1\n(cid:16)\n(cid:19)(cid:19) d\u22121(cid:89)\n\nd\u22121(cid:89)\n\n4\n\n\u2126(i)\n\nQ\u03c3(i),\n\n2MF (1)\u0393uB\n\ni=2\n\ni=1\n\n,\n\n(9)\n\n(10)\n\ninf\n\n\u02c6d(n,m)\n\nwhere\n\nsup\n\n\u00b5,\u03bd\u2208PuB\n\u221a\n\nC(PuB) =\n\n3\n6\n\n(11)\n\nand \u03a6(\u00b7) is the cumulative distribution function (CDF) of the standard Gaussian distribution and the\nconstants \u2126(i), i = 2, 3, ..., d \u2212 1 are given by the following recursion\n\n\u2126(2) = min(cid:8)MF (2), q(2)(cid:14)\u03c31(q(1))(cid:9) ,\n\u2126(i) = min(cid:8)MF (i), q(i)(cid:14)\u03c3i\u22121(\u2126(i \u2212 1)\u00b7\u00b7\u00b7 \u2126(2)\u03c31(q(1)))(cid:9) for i = 3, 4, ..., d \u2212 1.\nTheorem 1 implies that dFnn (\u00b5, \u03bd) cannot be estimated at a rate faster than max(cid:0)n\u22121/2, m\u22121/2(cid:1)\n\nThe same result holds for the parameter set W1,\u221e by replacing MF (i) in (10) with M1,\u221e(i).\n\nby any estimator over the class PuB. The proof of Theorem 1 is based on the Le Cam\u2019s method.\nSuch a technique was also used in [26] to derive the minimax lower bound for estimating MMD.\nHowever, our technical development is quite different from that in [26]. In speci\ufb01c, one major step of\nthe Le Cam\u2019s method is to lower-bound the difference arising due to two hypothesis distributions.\nIn the MMD case in [26], MMD can be expressed in a closed form for the chosen distributions.\nHence, the lower bound in Le Cam\u2019s method can be derived based on such a closed form of MMD.\nAs a comparison, the neural net distance does not have a closed-form expression for the chosen\ndistributions. As a result, our derivation involves lower-bounding the difference of the expectations\nof the neural network function with respect to two corresponding distributions. Such developments\nrequire substantial machineries to deal with the complicated multi-layer structure of neural networks.\nSee Appendix A.1 for more details.\nFor general neural networks, C(PuB) takes a complicated form as in (10). We next specialize to\nReLU networks to illustrate how this constant depends on the neural network parameters.\nCorollary 1. Under the setting of Theorem 1, suppose each activation function is ReLU, i.e., \u03c3i(z) =\nmax{0, z}, i = 1, 2, ..., d \u2212 1. For the parameter set WF and all m, n \u2265 1, we have\n\n(cid:40)(cid:12)(cid:12)(cid:12)dFnn(\u00b5, \u03bd) \u2212 \u02c6d(n, m)\n\n(cid:12)(cid:12)(cid:12) \u2265 0.08 \u0393uB\n\nd(cid:89)\n\ni=1\n\n(cid:16)\n\nn\u22121/2, m\u22121/2(cid:17)(cid:41)\n\n\u2265 1\n4\n\n.\n\nMF (i) max\n\ninf\n\n\u02c6d(n,m)\n\nsup\n\n\u00b5,\u03bd\u2208PuB\n\nP\n\nThe same result holds for the parameter set W1,\u221e by replacing MF (i) with M1,\u221e(i).\nNext, we provide the minimax lower bound for the distribution class PB with bounded support.\nThe proof (see Appendix B) is also based on the Le Cam\u2019s method, but with the construction of\ndistributions having the bounded support sets, which are different from those for Theorem 1.\nTheorem 2 (bounded-support class PB). Let Fnn be the set of neural networks de\ufb01ned by (4). For\nthe parameter set WF , we have\n\nP(cid:110)(cid:12)(cid:12)(cid:12)dFnn (\u00b5, \u03bd) \u2212 \u02c6d(n, m)\n\n(cid:12)(cid:12)(cid:12) \u2265 C(PB) max\n\n(cid:16)\n\nn\u22121/2, m\u22121/2(cid:17)(cid:111) \u2265 1\n\n,\n\n(12)\n\n4\n\ninf\n\n\u02c6d(n,m)\n\nsup\n\u00b5,\u03bd\u2208PB\n\nwhere\n\nC(PB) = 0.17 (MF (d)\u03c3d\u22121(\u00b7\u00b7\u00b7 \u03c31(MF (1)\u0393B)) \u2212 MF (d)\u03c3d\u22121(\u00b7\u00b7\u00b7 \u03c31(\u2212MF (1)\u0393B))) ,\n\n(13)\nwhere all constants MF (i), i = 1, 2, ..., d in the second term of the right side of (13) have negative\nsigns. The same result holds for the parameter set W1,\u221e by replacing MF (i) in (13) with M1,\u221e(i).\n\n5\n\n\fCorollary 2. Under the setting of Theorem 2, suppose that each activation function is ReLU. For the\nparameter set WF , we have,\n\nn\u22121/2, m\u22121/2(cid:17)(cid:41)\n(cid:16)\n\n\u2265 1\n4\n\n.\n\nMF (i) max\n\n(cid:40)(cid:12)(cid:12)(cid:12)dFnn(\u00b5, \u03bd) \u2212 \u02c6d(n, m)\n\n(cid:12)(cid:12)(cid:12) \u2265 0.17 \u0393B\n\nd(cid:89)\n\ni=1\n\ninf\n\n\u02c6d(n,m)\n\nsup\n\u00b5,\u03bd\u2208PB\n\nP\n\nThe same result holds for the parameter set W1,\u221e by replacing MF (i) with M1,\u221e(i).\n\n3.2 Rademacher Complexity-based Upper Bound\nIn this subsection, we provide an upper bound on |dFnn(\u00b5, \u03bd) \u2212 dFnn(\u02c6\u00b5n, \u02c6\u03bdm)|, which serves as an\nupper bound on the minimax estimation error. Our main technical development lies in deriving the\nbound for the unbounded-support sub-Gaussian class PuB, which requires a number of new technical\ndevelopments. We discuss its proof in Section 3.4.\nTheorem 3 (unbounded-support sub-Gaussian class PuB). Let Fnn be the set of neural networks\nde\ufb01ned by (4), and suppose that two distributions \u00b5, \u03bd \u2208 PuB and \u02c6\u00b5n, \u02c6\u03bdm are their empirical measures.\nFor a constant \u03b4 > 0 satisfying\n(I) If the parameter set is WF and each activation function satis\ufb01es \u03c3i(\u03b1x) = \u03b1\u03c3i(x) for all \u03b1 > 0\n(e.g., ReLU or leaky ReLU), then with probability at least 1 \u2212 \u03b4 over the randomness of \u02c6\u00b5n and \u02c6\u03bdm,\n|dFnn(\u00b5, \u03bd) \u2212 dFnn(\u02c6\u00b5n, \u02c6\u03bdm)|\nd\u22121(cid:89)\n\n(cid:17)(cid:16)\nn\u22121/2 + m\u22121/2(cid:17)\n(cid:16)(cid:112)6d log 2 + 5h/4 +(cid:112)2h log(1/\u03b4)\n\nm\u22121 + n\u22121 \u2265 4(cid:112)log(1/\u03b4), we have\n\n6h min{n, m}\u221a\n\nd(cid:89)\n\n\u221a\n\n\u22642\u0393uB\n\nMF (i)\n\nLi\n\n.\n\ni=1\n\ni=1\n\n(II) If the parameter set is W1,\u221e and each activation function satis\ufb01es \u03c3i(0) = 0 (e.g., ReLU, leaky\nReLU or tanh), then with probability at least 1 \u2212 \u03b4 over the randomness of \u02c6\u00b5n and \u02c6\u03bdm,\n|dFnn(\u00b5, \u03bd) \u2212 dFnn (\u02c6\u00b5n, \u02c6\u03bdm)|\nd\u22121(cid:89)\n\n(cid:16)(cid:112)2d log 2 + 2 log h +(cid:112)2h log(1/\u03b4)\n\n(cid:17)(cid:16)\nn\u22121/2 + m\u22121/2(cid:17)\n\nd(cid:89)\n\n\u22642\u0393uB\n\nM1,\u221e(i)\n\nLi\n\n.\n\ni=1\n\ni=1\n\nCorollary 3. Theorem 3 is directly applicable to ReLU networks with Li = 1 for i = 1, . . . , d.\nWe next present an upper bound on |dFnn(\u00b5, \u03bd) \u2212 dFnn(\u02c6\u00b5n, \u02c6\u03bdm)| for the bounded-support class\nPB.\nIn such a case, each data sample xi satis\ufb01es (cid:107)xi(cid:107) \u2264 \u0393B, and hence we apply the stan-\ndard McDiarmid inequality [16] and the Rademacher complexity bounds in [8] to upper-bound\n|dFnn(\u00b5, \u03bd) \u2212 dFnn (\u02c6\u00b5n, \u02c6\u03bdm)|. The detailed proof can be found in Appendix D.\nTheorem 4 (bounded-support class PB). Let Fnn be the set of neural networks de\ufb01ned by (4), and\nsuppose that two distributions \u00b5, \u03bd \u2208 PB. Then, we have\n(I) If the parameter set is WF and each activation function satis\ufb01es \u03c3i(\u03b1x) = \u03b1\u03c3i(x) for all \u03b1 > 0,\nthen with probability at least 1 \u2212 \u03b4 over the randomness of \u02c6\u00b5n and \u02c6\u03bdm,\n|dFnn(\u00b5, \u03bd) \u2212 dFnn (\u02c6\u00b5n, \u02c6\u03bdm)|\n\n(cid:16)\n\n2(cid:112)d log 2 +(cid:112)log(1/\u03b4) +\n\n(cid:17)(cid:16)\nn\u22121/2 + m\u22121/2(cid:17)\n\n.\n\n\u221a\n\n2\n\n2\u0393B\n\nM1,\u221e(i)\n\nLi\n\nd(cid:89)\n\n\u221a\n\n\u2264\n\nd\u22121(cid:89)\n\ni=1\n\ni=1\n\n(II) If the parameter set is W1,\u221e and each activation function satis\ufb01es \u03c3i(0) = 0, then with\nprobability at least 1 \u2212 \u03b4 over the randomness of \u02c6\u00b5n and \u02c6\u03bdm,\n|dFnn(\u00b5, \u03bd) \u2212 dFnn(\u02c6\u00b5n, \u02c6\u03bdm)|\n\n(cid:17)(cid:16)\nn\u22121/2 + m\u22121/2(cid:17)\n4(cid:112)d + 1 + log h +(cid:112)2 log(1/\u03b4)\n\nd\u22121(cid:89)\n\nd(cid:89)\n\n(cid:16)\n\nM1,\u221e(i)\n\n\u2264 \u0393B\n\nLi\n\n,\n\ni=1\n\ni=1\n\nCorollary 4. Theorem 4 is applicable for ReLU networks with Li = 1 for i = 1, . . . , d.\nAs a comparison, the upper bound derived in [3] is linear with the total number of the parameters of\nneural networks, whereas our bound in Theorem 4 scales only with the square root of depth\nd (and\nother terms in Theorem 4 matches the lower bound in Corollary 2), which is much smaller.\n\n\u221a\n\n6\n\n\f3.3 Optimality of Minimax Estimation and Discussions\n\nWe compare the lower and upper bounds and make the following remarks on the optimality of\nminimax estimation of the neural net distance.\n\nupper bounds match further in terms of \u0393uB\n\ni=1 M (i) max(cid:8)n\u22121/2, m\u22121/2(cid:9), where M (i)\n(cid:81)d\ni=1 (cid:107)Wi(cid:107)F and(cid:81)d\ncan be MF (i) or M1,\u221e(i), indicating that both(cid:81)d\n\n\u2022 For the unbounded-support sub-Gaussian class PuB, comparison of Theorems 1 and 3 indicates\nthat the empirical estimator dFnn (\u02c6\u00b5n, \u02c6\u03bdm) achieves the optimal minimax estimation rate\nmax{n\u22121/2, m\u22121/2} as the sample size goes to in\ufb01nity.\n\u2022 Furthermore, for ReLU networks, comparison of Corollaries 1 and 3 implies that the lower and\ni=1 (cid:107)Wi(cid:107)1,\u221e capture\nthe estimation accuracy. Such an observation is consistent with those made in [20] for the\ngeneralization error of training deep neural networks. Moreover, the mean norm (cid:107)E(x)(cid:107) and\nthe variance parameter of the distributions also determine the estimation accuracy due to the\nmatch of the bounds in \u0393uB.\n\u2022 The same observations hold for the bounded-support class PB by comparing Theorems 2 and 4\nWe further note that for ReLU networks, for both the unbounded-support sub-Gaussian class PuB\nand the bounded-support class PB, there is a gap of\nd + log h depending on the\ndistribution class and the norm of the weight matrices). To close the gap, the size-independent bound\non Rademacher complexity in [8] appears appealing. However, such a bound is applicable only to the\nbounded-support class PB, and helps to remove the dependence on\nd but at the cost of sacri\ufb01cing\nthe rate (i.e., from m\u22121/2 + n\u22121/2 to m\u22121/4 + n\u22121/4). Consequently, such an upper bound matches\nthe lower bound in Corollary 2 for ReLU networks over the network parameters, but not in terms of\nthe sample size, and is interesting only in the regime when d (cid:29) max{n, m}. It is thus still an open\nproblem and calling for future efforts to close the gap of\nd for estimating the neural net distance.\n\nas well as comparing Corollaries 2 and 4.\n\nd + h,\n\u221a\n\nd (or\n\n\u221a\n\n\u221a\n\n\u221a\n\n\u221a\n\n3.4 Proof Outline for Theorem 3\n\nIn this subsection, we brie\ufb02y explain the three major steps to prove Theorem 3, because some of\nthese intermediate steps correspond to theorems that can be of independent interest. The detailed\nproof can be found in Appendix C.\nStep 1: A new McDiarmid\u2019s type of inequality. To establish an upper bound on |dFnn(\u00b5, \u03bd) \u2212\ndFnn (\u02c6\u00b5n, \u02c6\u03bdm)|, the standard McDiarmid\u2019s inequality [16] that requires the bounded difference\ncondition is not applicable here, because the input data has unbounded support so that the functions\nin Fnn can be unbounded, e.g., ReLU neural networks. Such a challenge can be addressed by a\ngeneralized McDiarmid\u2019s inequality for scalar sub-Gaussian variables established in [11]. However,\nthe input data are vectors in our setting. Thus, we further generalize the result in [11] and establish the\nfollowing new McDiarmid\u2019s type of concentration inequality for unbounded sub-Gaussian random\nvectors and Lipschitz (possibly unbounded) functions. Such development turns out to be nontrivial,\nwhich requires further machineries and tail bound inequalities (see detailed proof in Appendix C.1).\ni.i.d.\u223c \u03bd be two collections of random variables,\nTheorem 5. Let {xi}n\nwhere \u00b5, \u03bd \u2208 PuB are two unbounded-support sub-Gaussian distributions over Rh. Suppose that\nF : (Rh)n+m (cid:55)\u2212\u2192 R is a function of x1, ..., xn, y1, ..., ym, which satis\ufb01es for any i,\ni(cid:107)/n,\ni(cid:107)/m.\n\n|F (x1, ..., xi, ...., ym) \u2212 F (x1, ..., x(cid:48)\n|F (x1, ..., yi, ...., ym) \u2212 F (x1, ..., y(cid:48)\n\ni, ...., ym)| \u2264 LF(cid:107)xi \u2212 x(cid:48)\ni, ...., ym)| \u2264 LF(cid:107)yi \u2212 y(cid:48)\n\ni.i.d.\u223c \u00b5 and {yi}m\n\n(14)\n\ni=1\n\ni=1\n\nThen, for all 0 \u2264 \u0001 \u2264 \u221a\nP (F (x1, ..., xn, ...., ym) \u2212 E F (x1, ..., xn, ...., ym) \u2265 \u0001) \u2264 exp\n\n3h\u0393uBLF min{m, n}(n\u22121 + m\u22121),\n\n(cid:18)\n\n(cid:19)\n\n\u2212\u00012mn\nuBL2F (m + n)\n\n8h\u03932\n\n.\n\n(15)\n\nStep 2: Upper bound based on Rademacher complexity. By applying Theorem 5, we derive an\nupper bound on |dFnn (\u00b5, \u03bd) \u2212 dFnn(\u02c6\u00b5n, \u02c6\u03bdm)| in terms of the average Rademacher complexity that\nwe de\ufb01ne below.\n\n7\n\n\fn\n\n(cid:12)(cid:12) 1\n\ni=1 \u0001if (xi)(cid:12)(cid:12), where {xi}n\n(cid:80)n\n\nDe\ufb01nition 1. The average Rademacher complexity Rn(Fnn, \u00b5) corresponding to the distribution\n\u00b5 with n samples is de\ufb01ned as Rn(Fnn, \u00b5) = Ex,\u0001 supf\u2208Fnn\ni=1 are\ni=1 are independent random variables chosen from {\u22121, 1} uniformly.\ngenerated i.i.d. by \u00b5 and {\u0001i}n\nThen, we have the following result with the proof provided in Appendix C.2. Recall that Li is the\nLipchitz constant of the activation function \u03c3i(\u00b7).\nm\u22121 + n\u22121 \u2265 4(cid:112)log(1/\u03b4), then with probability\nTheorem 6. Let Fnn be the set of neural networks de\ufb01ned by (4). For the parameter set WF de\ufb01ned\nin (5), suppose that \u00b5, \u03bd \u2208 PuB are two sub-Gaussian distributions satisfying (8) and \u02c6\u00b5n, \u02c6\u03bdm are the\n6h min{n, m}\u221a\nempirical measures of \u00b5, \u03bd. If\nat least 1 \u2212 \u03b4 over the randomness of \u02c6\u00b5n and \u02c6\u03bdm ,\n|dFnn(\u00b5, \u03bd) \u2212 dFnn (\u02c6\u00b5n, \u02c6\u03bdm)|\nd(cid:89)\n(cid:112)2h (n\u22121 + m\u22121) log(1/\u03b4). (16)\n\n\u22642Rn(Fnn, \u00b5) + 2Rm(Fnn, \u03bd) + 2\u0393uB\n\nd\u22121(cid:89)\n\nMF (i)\n\n\u221a\n\nLi\n\ni=1\n\ni=1\n\nThe same result holds for the parameter set W1,\u221e by replacing MF (i) in (16) with M1,\u221e(i).\nStep 3: Average Rademacher complexity bound for unbounded sub-Gaussian variables. We\nderive an upper bound on the Rademacher complexity Rn(Fnn, \u00b5). In particular, as we explain\n(cid:80)n\nnext, our upper bound is tighter than directly applying the existing bounds in [8, 22]. To see this,\ni=1 MF (i)(cid:81)d\u22121\n[8, 22] provided upper bounds on the data-dependent Rademacher complexity of neural networks\nde\ufb01ned by \u02c6Rn(Fnn) = E\u0001 supf\u2208Fnn\ni=1 \u0001if (xi). For the parameter set WF , [22] showed that\ni=1 MF (i)(cid:81)d\u22121\n(cid:18)\nd(cid:89)\n\n(cid:112)(cid:80)n\ni=1 (cid:107)xi(cid:107)2(cid:14)n, and [8] further improved this\n(cid:112)(cid:80)n\ni=1 (cid:107)xi(cid:107)2(cid:14)n. Directly applying this result\nd\u22121(cid:89)\n\n\u02c6Rn(Fnn) was bounded by 2d(cid:81)d\nbound to ((cid:112)2 log(2)d + 1)(cid:81)d\n\nfor unbounded sub-Gaussian inputs {xi}n\nEx \u02c6Rn(Fnn) \u2264 O\n\ni=1 Li\ni=1 Li\ni=1 yields\n\ndh(cid:14)\u221a\n\nMF (i)\n\n(cid:19)\n\n(17)\n\n\u221a\n\nLi\n\n\u0393uB\n\n1\nn\n\nn\n\n.\n\ni=1\n\ni=1\n\nWe next show that by exploiting the sub-Gaussianity of the input data, we provide an improved bound\non the average Rademacher complexity. The detailed proof can be found in Appendix C.3.\nTheorem 7. Let Fnn be the set of neural networks de\ufb01ned by (4), and let x1, ..., xn \u2208 Rh be i.i.d.\nrandom samples generated by an unbounded-supported sub-Gaussian distribution \u00b5 \u2208 PuB. Then,\n(I) If the parameter set is WF and activation functions satisfy \u03c3i(\u03b1x) = \u03b1\u03c3i(x) for all \u03b1 > 0, then\n\nRn(Fnn, \u00b5) \u2264 \u0393uB\n\nMF (i)\n\n(18)\n\nd(cid:89)\n\ni=1\n\nd(cid:89)\n\nd\u22121(cid:89)\n\ni=1\n\nLi\n\n(cid:112)6d log 2 + 5h/4(cid:14)\u221a\nd\u22121(cid:89)\n(cid:112)d log 2 + log h(cid:14)\u221a\n\nn.\n\nLi\n\n(II) If the parameter set is W1,\u221e and each activation function satis\ufb01es \u03c3i(0) = 0, then\n\nRn(Fnn, \u00b5) \u2264\n\n\u221a\n\n2\u0393uB\n\nM1,\u221e(i)\n\ni=1\n\ni=1\n\nn.\n\n(19)\n\n\u221a\n\n\u221a\n\ndh) in (17) to O(\n\nTheorem 7 indicates that for the parameter set WF , our upper bound in (18) replaces the order\ndependence O(\nd + h), and hence our proof has the order-level improvement\nthan directly using the existing bounds. The same observation can be made for the parameter set\nW1,\u221e. Such improvement is because our proof takes advantage of the sub-Gaussianity of the inputs\nwhereas the bounds in [8, 22] must hold for any data input (and hence the worst-case data input).\nWe also note that [23] provided an upper bound on the Rademacher complexity for one-hidden-layer\nneural networks for Gaussian inputs. Casting Lemma 3.2 in [23] to our setting of (18) yields\n\n(20)\n\u221a\nwhere n1 is the number of neurons in the hidden layer. Compared with (20), our bound has an\norder-level O(\nSummary. Therefore, Theorem 3 follows by combining Theorems 6 and 7 and using the fact that\n\n\u0393uBMF (2)MF (1)L1\n\nn1) improvement.\n\n(cid:112)1/n + 1/m <(cid:112)1/n +(cid:112)1/m.\n\nn1h/\n\nn\n\n,\n\nRn(Fnn, \u00b5) \u2264 O(cid:16)\n\n(cid:112)\n\n(cid:17)\n\n\u221a\n\n8\n\n\f4 Conclusion\n\nIn this paper, we developed both the lower and upper bounds for the minimax estimation of the neural\nnet distance based on \ufb01nite samples. Our results established the minimax optimality of the empirical\nestimator in terms of not only the sample size but also the norm of the parameter matrices of neural\nnetworks, which justi\ufb01es its usage for training GANs.\n\nAcknowledgments\n\nThe work was supported in part by U.S. National Science Foundation under the grant CCF-1801855.\n\nReferences\n[1] M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge\n\nUniversity Press, 2009.\n\n[2] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In Proc.\n\nInternational Conference on Machine Learning (ICML), pages 214\u2013223, 2017.\n\n[3] S. Arora, R. Ge, Y. Liang, T. Ma, and Y. Zhang. Generalization and equilibrium in generative\nadversarial nets (GANs). In Proc. International Conference on Machine Learning (ICML),\n2017.\n\n[4] P. L. Bartlett, D. J. Foster, and M. J. Telgarsky. Spectrally-normalized margin bounds for\nneural networks. In Proc. Advances in Neural Information Processing Systems (NIPS), pages\n6241\u20136250, 2017.\n\n[5] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: risk bounds and\n\nstructural results. Journal of Machine Learning Research, 3:463\u2013482, Nov 2002.\n\n[6] S. S. Du and J. D. Lee. On the power of over-parametrization in neural networks with quadratic\n\nactivation. arXiv preprint arXiv:1803.01206, 2018.\n\n[7] G. Dziugaite, D. Roy, and Z. Ghahramani. Training generative neural networks via maximum\nmean discrepancy optimization. In Proc. of the 31st Conference on Uncertainty in Arti\ufb01cial\nIntelligence (UAI), pages 258\u2013267, 2015.\n\n[8] N. Golowich, A. Rakhlin, and O. Shamir. Size-independent sample complexity of neural\n\nnetworks. In Proc. Conference on Learning Theory (COLT), 2018. To appear.\n\n[9] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\nY. Bengio. Generative adversarial nets. In Proc. Advances in Neural Information Processing\nSystems (NIPS), pages 2672\u20132680, 2014.\n\n[10] D. Hsu, S. Kakade, and T. Zhang. A tail inequality for quadratic forms of subGaussian random\n\nvectors. Electronic Communications in Probability, 17, 2012.\n\n[11] A. Kontorovich. Concentration in unbounded metric spaces and algorithmic stability. In Proc.\n\nInternational Conference on Machine Learning (ICML), pages 28\u201336, 2014.\n\n[12] M. Ledoux and M. Talagrand. Probability in Banach Spaces. Springer, 1991.\n\n[13] Y. Li, K. Swersky, and R. Zemel. Generative moment matching networks. In Proc. International\n\nConference on Machine Learning (ICML), pages 1718\u20131727, 2015.\n\n[14] T. Liang. How well can generative adversarial networks (GAN) learn densities: a nonparametric\n\nview. arXiv preprint arXiv:1712.08244, 2017.\n\n[15] S. Liu, O. Bousquet, and K. Chaudhuri. Approximation and convergence properties of generative\nadversarial learning. In Proc. Advances in Neural Information Processing Systems (NIPS),\npages 5551\u20135559, 2017.\n\n[16] C. McDiarmid. On the method of bounded differences. Surveys in Combinatorics, 141(1):148\u2013\n\n188, 1989.\n\n9\n\n\f[17] S. J. Montgomery-Smith. The distribution of Rademacher sums. Proceedings of the American\n\nMathematical Society, 109(2):517\u2013522, 1990.\n\n[18] K. Muandet, K. Fukumizu, F. Dinuzzo, and B. Sch\u00f6lkopf. Learning from distributions via\nIn Proc. Advances in Neural Information Processing Systems\n\nsupport measure machines.\n(NIPS), pages 10\u201318, 2012.\n\n[19] A. M\u00fcller. Integral probability metrics and their generating classes of functions. Advances in\n\nApplied Probability, 29(2):429\u2013443, 1997.\n\n[20] B. Neyshabur, S. Bhojanapalli, D. McAllester, and N. Srebro. Exploring generalization in\ndeep learning. In Proc. Advances in Neural Information Processing Systems (NIPS), pages\n5949\u20135958, 2017.\n\n[21] B. Neyshabur, S. Bhojanapalli, D. McAllester, and N. Srebro. A pac-bayesian approach to\nspectrally-normalized margin bounds for neural networks. In Proc. International Conference\non Learning Representations (ICLR), 2018.\n\n[22] B. Neyshabur, R. Tomioka, and N. Srebro. Norm-based capacity control in neural networks. In\n\nProc. Conference on Learning Theory (COLT), pages 1376\u20131401, 2015.\n\n[23] S. Oymak.\n\nLearning compact neural networks with regularization.\n\narXiv:1802.01223, 2018.\n\narXiv preprint,\n\n[24] T. Salimans and D. P. Kingma. Weight normalization: A simple reparameterization to accelerate\ntraining of deep neural networks. In Proc. Advances in Neural Information Processing Systems\n(NIPS), pages 901\u2013909, 2016.\n\n[25] B. K. Sriperumbudur, K. Fukumizu, A. Gretton, B. Sch\u00f6lkopf, and G. R. Lanckriet. On the\nempirical estimation of integral probability metrics. Electronic Journal of Statistics, 6:1550\u2013\n1599, 2012.\n\n[26] I. O. Tolstikhin, B. K. Sriperumbudur, and B. Sch\u00f6lkopf. Minimax estimation of maximum\nmean discrepancy with radial kernels. In Proc. Advances in Neural Information Processing\nSystems (NIPS), pages 1930\u20131938, 2016.\n\n[27] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires\nrethinking generalization. In Proc. International Conference on Learning Representations\n(ICLR), 2017.\n\n[28] P. Zhang, Q. Liu, D. Zhou, T. Xu, and X. He. On the discrimination-generalization tradeoff in\n\nGANs. In Proc. International Conference on Learning Representations (ICLR), 2018.\n\n[29] S. Zou, Y. Liang, and H. V. Poor. Nonparametric detection of geometric structures over networks.\n\nIEEE Transactions on Signal Processing, 65(19):5034\u20135046, 2015.\n\n10\n\n\f", "award": [], "sourceid": 1906, "authors": [{"given_name": "Kaiyi", "family_name": "Ji", "institution": "The Ohio State University"}, {"given_name": "Yingbin", "family_name": "Liang", "institution": "The Ohio State University"}]}