{"title": "A Solvable High-Dimensional Model of GAN", "book": "Advances in Neural Information Processing Systems", "page_first": 13782, "page_last": 13791, "abstract": "We present a theoretical analysis of the training process for a single-layer GAN fed by high-dimensional input data. The training dynamics of the proposed model at both microscopic and macroscopic scales can be exactly analyzed in the high-dimensional limit. In particular, we prove that the macroscopic quantities measuring the quality of the training process converge to a deterministic process characterized by an ordinary differential equation (ODE), whereas the microscopic states containing all the detailed weights remain stochastic, whose dynamics can be described by a stochastic differential equation (SDE). This analysis provides a new perspective different from recent analyses in the limit of small learning rate, where the microscopic state is always considered deterministic, and the contribution of noise is ignored. From our analysis, we show that the level of the background noise is essential to the convergence of the training process: setting the noise level too strong leads to failure of feature recovery, whereas setting the noise too weak causes oscillation. Although this work focuses on a simple copy model of GAN, we believe the analysis methods and insights developed here would prove useful in the theoretical understanding of other variants of GANs with more advanced training algorithms.", "full_text": "A Solvable High-Dimensional Model of GAN\n\nChuang Wang1,2\n\nwangchuang@ia.ac.cn\n\nHong Hu2\n\nhonghu@g.harvard.edu\n\nYue M. Lu2\n\nyuelu@seas.harvard.edu\n\n1. National Laboratory of Pattern Recognition, Institute of Automation,\n\nChinese Academy of Science, 95 Zhong Guan Cun Dong Lu, Beijing 100190, China\n2. John A. Paulson School of Engineering and Applied Sciences, Harvard University\n\n33 Oxford Street, Cambridge, MA 02138, USA\n\nAbstract\n\nWe present a theoretical analysis of the training process for a single-layer GAN\nfed by high-dimensional input data. The training dynamics of the proposed model\nat both microscopic and macroscopic scales can be exactly analyzed in the high-\ndimensional limit. In particular, we prove that the macroscopic quantities measuring\nthe quality of the training process converge to a deterministic process character-\nized by an ordinary differential equation (ODE), whereas the microscopic states\ncontaining all the detailed weights remain stochastic, whose dynamics can be\ndescribed by a stochastic differential equation (SDE). This analysis provides a new\nperspective different from recent analyses in the limit of small learning rate, where\nthe microscopic state is always considered deterministic, and the contribution of\nnoise is ignored. From our analysis, we show that the level of the background\nnoise is essential to the convergence of the training process: setting the noise level\ntoo strong leads to failure of feature recovery, whereas setting the noise too weak\ncauses oscillation. Although this work focuses on a simple copy model of GAN,\nwe believe the analysis methods and insights developed here would prove useful\nin the theoretical understanding of other variants of GANs with more advanced\ntraining algorithms.\n\n1\n\nIntroduction\n\nA generative adversarial network (GAN) [1] seeks to learn a high-dimensional probability distribution\nfrom samples. While there have been numerous advances on the application front [2\u20136], considerably\nless is known about the underlying theory and conditions that can explain or guarantee the successful\ntrainings of GANs.\nRecently, it has been a very active area of research to study either the equilibrium properties [7\u20139]\nor the training dynamics [10, 11]. Speci\ufb01cally, there is a line of works studying the dynamics of\nthe gradient-based training algorithms e.g., [11\u201316]. The basic idea is the following. The evolution\nof the learnable parameters in the training dynamics can be considered as a discrete-time process.\nWith a proper time scaling, this discrete-time process converges to a deterministic continuous-time\nprocess as the learning rates tend to 0, which is characterized by an ordinary differential equation\n(ODE). By studying local stability of the ODE\u2019s \ufb01xed points, [12] shows that oscillation in the\ntraining algorithm is due to the eigenvalues of the Jacobian of the gradient vector \ufb01eld with zero real\npart and large imaginary part. Due to this fact, various stabilization approaches are proposed, for\nexample adding additional regularizers [13, 14], and using two timescale [15] training. Very recently,\n[16] argues that those stabilization techniques may encourage the algorithms to converge non-Nash\nstationary points. All above works consider a small-learning-rates limit, where the limiting process\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fis always deterministic. The stochasticity and the effect of the noise is essentially ignored, which\nmay not re\ufb02ect practical situations. Thus, a new analysis paradigm to study the dynamics with the\nconsideration of the intrinsic stochasticity is needed.\nIn this paper, we present a high-dimensional and exactly solvable model of GAN. Its dynamics can be\nprecisely characterized at both macroscopic and microscopic scales, where the former is deterministic\nand the latter remains stochastic. Interestingly, our theoretical analysis shows that injecting additional\nnoise can stabilize the training. Speci\ufb01cally, our main technical contributions are twofold:\n\n\u2022 We present an asymptotically exact analysis of the training process of the proposed GAN\nmodel. Our analysis is carried out on both the macroscopic and the microscopic levels. The\nmacroscopic state measures the overall performance of the training process, whereas the\nmicroscopic state contains all the detailed weights information. In the high-dimensional\nlimit (n \u2192 \u221e), we show that the former converges to a deterministic process governed by\nan ordinary differential equation (ODE), whereas the latter stays stochastic described by a\nstochastic differential equation (SDE).\n\n\u2022 We show that depending on the choice of the learning rates and the strength of noise, the\ntraining process can reach either a successful, a failed, an oscillating, or a mode-collapsing\nphase. By studying the stabilities of the \ufb01xed points of the limiting ODEs, we precisely\ncharacterize when each phase takes place. The analysis reveals a condition on the learning\nrates and the noise strength for successful training. We show that the level of the background\nnoise is essential to the convergence of the training process: setting the noise level too strong\n(small signal-to-noise ratio) leads to failure of feature recovery, whereas setting the noise\ntoo weak (large signal-to-noise ratio) causes oscillation.\n\nOur work builds upon a general analysis framework [17] for studying the scaling limits of high-\ndimensional exchangeable stochastic processes with applications to nonlinear regression problems.\nSimilar techniques have also been used in the literature to study Monte Carlo methods [18], online\nperceptron learning [19, 20], online sparse PCA [21], subspace estimation [22], online ICA [23] and\nmore recently, the supervised learning of two-layer neural networks [24], but to our best knowledge,\nthis technique has not yet been used in analyzing GANs.\nThe rest of the paper is organized as follows. We present the proposed GAN model and the associated\ntraining algorithm in Section 2. Our main results are presented in Section 3, where we show that the\nmacroscopic and microscopic dynamics of the training process converge to their respective limiting\nprocesses that are characterized by an ODE and SDE, respectively. In Section 4, we analyze the\nstationary solutions of the limiting ODEs and precisely characterizes the long-term behaviors of the\ntraining process. We conclude in Section 5.\n\n2 Formulations\n\nIn this section, we introduce the proposed GAN model and specify the associated training algorithm.\n\nModel for the real data.\nIn order to establish the theoretical analysis, we \ufb01rst impose a model for\nthe probability distribution from which we draw our real data samples. We assume that the real data\nyk \u2208 Rn, k = 0, 1, . . . are drawn according to the following generative model:\n\ndef\n\n= U ck + \u221a\u03b7Tak,\n\nyk = G(ck,ak; U , \u03b7T)\n\n(1)\nwhere U \u2208 Rn\u00d7d is a deterministic unknown feature matrix with d features; ck \u2208 Rd is a random\nvector drawn from an unknown distribution Pc; ak is an n-dimensional random vector acting as the\nbackground noise; and \u03b7T is a parameter to control the strength of noise. Without loss of generality 1,\nwe assume U(cid:62)U = I d, where I d is the d \u00d7 d identity matrix.\n\nThis generative model, referred to as the spiked covariance model [25] in the literature, is commonly\nused in the theoretical study of principal component analysis (PCA). We note that this model is not\na trivial task for PCA even when d = 1 if the variance of the noise ak is a non-zero constant. As\n1If U is not orthogonal, we can rewrite U c in (1) as (U R)(R\u22121c), where R is a matrix that orthogonalizes\nand normalizes the columns of U. We can then study an equivalent system where the new feature vector is\nR\u22121c.\n\n2\n\n\fproved in [25], the best estimator can not perfectly recover the signal U given an O(n) number of\nsamples yk. Thus, it is of suf\ufb01cient interest to investigate whether a GAN can retrieve informative\nresults for the principal components in the same scaling limit.\n\n(2)\n\nThe GAN model The GAN we are going to analyze is de\ufb01ned as follows. We assume that the\ngenerator G has the same linear structure as the real data model (1) given above:\n\n(cid:101)yk = G((cid:101)ck,(cid:101)ak; V , \u03b7G)\n\nWe de\ufb01ne the discriminator D of our GAN model as\n\nbut the parameters are different. Here,(cid:101)yk denotes a fake sample produced by the generator;(cid:101)ak is an\nn-dimensional random noise vector; the random variable(cid:101)ck is drawn from a \ufb01xed distribution P(cid:101)c;\n\u03b7G is the noise strength; and the matrix V \u2208 Rn\u00d7d represents the parameters of the generator. (In an\nideal case in which the generator learns the underlying true probability distribution perfectly, we have\nV = U.) Throughout the paper, we follow the notational convention that all the symbols that are\ndecorated with a tilde (e.g.,(cid:101)yk,(cid:101)ck,(cid:101)ak) denote quantities associated with the generator.\nHere, y is an input vector, which can be either the real data yk from (1) or the fake one(cid:101)yk from (2);\n(cid:98)D : R (cid:55)\u2192 R can be any function; and the vector w \u2208 Rn represents the parameters associated with\n\nthe discriminator. Later, we will show that the generator can learn multiple features even though\nthe discriminator only has one feature vector w. Discriminators with multiple features can also be\nanalyzed in a similar way, but in this paper we consider the single-feature discriminator for simplicity.\n\n= (cid:98)D(y(cid:62)w).\n\nD(y; w)\n\ndef\n\ndef\n\n(3)\n\n(4)\n\n(cid:62)\n\nmax\n\nw\n\nmin\nV\n\nw)) \u2212 \u03bb\n\nThe training algorithm. The proposed GAN model has two set of parameters V and w to be\nlearned from the data. The training process is formulated as the following MinMax problem\n\nEy\u223cP(y;U )E(cid:101)y\u223c(cid:101)P((cid:101)y,V ) L(y,(cid:101)y; w),\nwhere the two probability distributions P(y; U ) and (cid:101)P((cid:101)y; V ) represent the distributions of the real\ndata y and the fake data(cid:101)y as speci\ufb01ed by (1) and (2) respectively, and\n2 H(w(cid:62)w) + \u03bb\nwith F (\u00b7) and (cid:101)F (\u00b7) being two functions that quantify the performance of the discriminator and\n\u03bb > 0 being a constant. The function H(\u00b7) acts as a regularization term introduced to control the\nmagnitude of the parameters w and V . It can be an arbitrary real-valued function, which is applied\nelement-wisely if the input is a matrix.\nWe consider a standard training algorithm that uses the vanilla stochastic gradient descent/ascent\n(SGDA) to seek a solution of (3). To simplify the theoretical analysis, we consider an online (i.e.,\nstreaming) setting where each data sample yk is used only once. At step k, the model parameters wk\n\n=F ((cid:98)D(y(cid:62)w)) \u2212 (cid:101)F ((cid:98)D((cid:101)y\n\n2 tr(cid:0)H(V (cid:62)V )(cid:1)\n\nL(y,(cid:101)y; w)\n\nwk+1 = wk + \u03c4\nV k+1 = V k \u2212\n\n(cid:101)\u03c4\nn\u2207wkL(yk,(cid:101)y2k; wk)\nn\u2207V kL(cid:0)yk,G((cid:101)c2k+1,(cid:101)a2k+1; V k; \u03b7G); wk(cid:1),\n\nthe generator, respectively. In (5), we only consider a single-step update for wk. This is a special\ncase of Algorithm 1 in [1] with the batch-size m set to 1. We note that the analysis presented in this\npaper can be naturally extended to the mini-batch case where m is a \ufb01nite number.\n\nand V k are updated using a new real sample yk and two fake samples(cid:101)y2k and(cid:101)y2k+1, according to\nwhere(cid:101)c2k+1,(cid:101)a2k+1 are random variables that generates the fake sample(cid:101)y2k+1 according to (2). The\ntwo parameters \u03c4 and(cid:101)\u03c4 in the above expressions control the learning rates of the discriminator and\nExample 1. We de\ufb01ne F ((cid:98)D(x)) = (cid:101)F ((cid:98)D(x)) = x2/2, and the regularizer function H(A) =\nlog cosh(A \u2212 I), where I is the identity matrix with the same dimension of A, and the function\nlog cosh(\u00b7) transforms the input matrix element-wisely. We use this speci\ufb01c regularizer to control the\nmagnitude of the model parameters V and w. In practice, any convex function with its minimum\nreached at zero would be \ufb01ne. Our choice log cosh(A \u2212 I) here is is just a convenient special case\nsince its derivative H(cid:48)(x) = tanh(x) is smooth and bounded. Furthermore, we set the regularization\nparameter \u03bb \u2192 \u221e, the original problem (3) becomes a constrained MinMax problem\n\n(5)\n\nmin\n\ndiag(V (cid:62)V )=I d\n\nmax\n(cid:107)w(cid:107)=1\n\nEy\u223cPE(cid:101)y\u223c(cid:101)P(cid:104)(y(cid:62)w)2 \u2212 ((cid:101)y\n\n(cid:62)\n\nw)2(cid:105) ,\n\n3\n\n\fdef\n\n= X(cid:62)\n\nM k\n\nk X k =\uf8ee\uf8f0\n\nP k qk\nI\nP (cid:62)\nrk\nk Sk\nr(cid:62)\nq(cid:62)\nzk\nk\nk\n\n\uf8f9\uf8fb .\n\n(6)\n\nin which the diagonal operation diag(A) returns a matrix where the diagonal entries are the same\nas A and the off-diagonal entries are all zero. The condition diag(V (cid:62)V ) = I d ensures that each\ncolumn vector of V is normalized.\n\n3 Dynamics of the GAN\n\ndef\n\nDe\ufb01nition 1. Let X k\ntraining process at iteration step k.\n\n= [U , V k, wk] \u2208 Rn\u00d7(2d+1). We call X k the microscopic state of the\nThe microscopic state X k contains all the information about the training process. In fact, the sequence\n{X k}k=0,1,2,... forms a Markov chain on Rn\u00d7(2d+1). This can be easily veri\ufb01ed from the update\nrule of X k as de\ufb01ned in (5), in which the real data yk and fake data(cid:101)yk are drawn according to (1)\nand (2) respectively. The Markov chain is driven by the initial state X 0 and the sequence of random\nvariables {(ck, ak,(cid:101)c2k,(cid:101)a2k,(cid:101)c2k+1,(cid:101)a2k+1)}k=0,1,2,....\n\n= w(cid:62)\nDe\ufb01nition 2. Let P k\nWe call the tuple {P k, qk, rk, Sk, zk} the macroscopic state of the Markov chain X k at step k.\nThose macroscopic quantities measure the cosine similarities among the feature vectors of the true\nmodel U, the generator V k and the discriminator wk. For example, the cosine of the angle between\nthe ith true feature (i.e., the ith column of U) and the jth feature estimated in the generator (i.e., the\n\njth column of V k) is [P k]i,j/(cid:112)[Sk]j,j, where [P k]i,j is the inner product between the two feature\nvectors and(cid:112)[Sk]j,j is the norm of the jth column of V k. (The columns of U are unit vectors and\n\nneed not be normalized here.) For simplicity, we introduce a compact notation for the macroscopic\nstate:\n\n= U(cid:62)V k, qk\n\n= U(cid:62)wk, rk\n\nk V k, and zk\n\nk wk, Sk\n\n= V (cid:62)\n\n= V (cid:62)\n\nk wk.\n\ndef\n\ndef\n\ndef\n\ndef\n\ndef\n\nIn what follows, we investigate the dynamics of the training algorithm (5) at both the macroscopic\nand the microscopic levels. At the macroscopic level, by examining the cosines of the angles, we\nstudy how closely the model parameters V k, wk associated with the generator and discriminator\ncan align with the ground truth feature vectors, i.e., the columns of U. At the microscopic level, we\nstudy how the elements in the matrix V k and the vector wk evolve as a stochastic process. As our\nanalysis will reveal, the mechanisms behind the two levels are different: the macroscopic dynamics is\nasymptotically deterministic whereas the microscopic dynamics stays stochastic even as n \u2192 \u221e.\n3.1 Macroscopic dynamics\n\nWe \ufb01rst study the asymptotic dynamics of the macroscopic state M k. Our theoretical analysis is\ncarried out under the following assumptions.\n\n(A.1) The sequences of ck \u223c Pc and(cid:101)ck \u223c P(cid:101)c for k = 0, 1, . . . are i.i.d. random variables with\nbounded moments of all orders, and {ck} is independent of {(cid:101)ck}.\n(A.2) The sequences {ak} and {(cid:101)ak} for k = 0, 1, . . . are both independent Gaussian vectors with\nzero mean and the covariance matrix I n. Moreover, {ak}, {(cid:101)ak} are independent of {ck}\nand {(cid:101)ck}.\nF ((cid:98)D(\u00b7)) and (cid:101)F ((cid:98)D(\u00b7)) exist and they are also uniformly bounded.\nE [(cid:80)d\n\n(A.3) The \ufb01rst-order derivative of H(\u00b7) and the derivatives up to fourth order of the functions\nFor i = 1, 2, . . . , n, we have\n(A.4) Let [U , V 0, w0] be the initial microscopic state.\ni ]) \u2264 C/n2, where C is a constant not depending on\n0 is a\n\n(A.5) The initial macroscopic state M 0 satis\ufb01es E(cid:107)M 0 \u2212 M\u2217\ndeterministic matrix and C is a constant not depending on n.\nWe provide a few remarks on the above assumptions. In Assumption (A.1), Pc and P(cid:101)c can be\ndifferent. For example, c is Gaussian, and(cid:101)c is uniform on [\u22121, 1]d. The assumption (A.2) can\n\n0(cid:107) \u2264 C/\u221an, where M\u2217\n\ni,(cid:96) + [V 0]4\n\ni,(cid:96) + [w0]4\n\n(cid:96)=1([U ]4\n\nn.\n\n4\n\n\fGaussian assumption here to simplify the proof. The assumption (A.4) requires that the elements\n\nbe relaxed to non-Gaussian cases as long as all moments of ak and(cid:101)ak are bounded, but we use\nin the parameter matrix of real data U and initial microscopic state X 0 are O(1/\u221an) numbers.\nIntuitively, this assumption ensures that U and X 0 are generic matrices with O(1) Frobenius norms\n(i.e., not the matrices that most elements are zeros and only few elements are large numbers). The\nassumption (A.5) ensures that the initial macroscopic states converges to a deterministic value as the\nsystem size n goes to in\ufb01nity. The following theorem proves that if the initial state is convergent, then\nthe whole training process converges to a deterministic process as n \u2192 \u221e, which is characterized by\nan ODE.\nTheorem 1. Fix T > 0. It holds under Assumptions (A.1)\u2013(A.5) that\n\nmax\n0\u2264k\u2264nT\n\nE(cid:13)(cid:13)M k \u2212 M(cid:0) k\n\nn(cid:1)(cid:13)(cid:13) \u2264 C(T )\u221a\n\nn ,\n\n(7)\n\n(cid:48)\n\n(cid:62)\n\nt\n\nt\n\nt\n\nd\n\nd\n\nd\n\nd\n\nd\n\n(8)\n\nI\nP (cid:62)\nq(cid:62)\n\n\uf8f9\uf8fb \u2208\n\nP t qt\nrt\nSt\nr(cid:62)\nzt\n\nwhere C(T ) is a constant that depends on T but not on n, and M (t) = \uf8ee\uf8f0\n\nR(2d+1)\u00d7(2d+1) is a deterministic function. Moreover, M (t) is the unique solution of the following\nODE:\n\nt + P tLt(cid:1)\ndt P t =(cid:101)\u03c4(cid:0)qt(cid:101)g\ndt qt = \u03c4(cid:0)gt \u2212 P t(cid:101)gt + qtht(cid:1)\nt gt \u2212 St(cid:101)gt + rtht(cid:1) +(cid:101)\u03c4(cid:0)zt(cid:101)gt + Ltrt(cid:1)\ndt rt = \u03c4(cid:0)P T\ndt St =(cid:101)\u03c4(cid:0)rt(cid:101)g\nt + StLt + LtSt(cid:1)\n(cid:62)\nt +(cid:101)gtr(cid:62)\nt gt \u2212 r(cid:62)\ndt zt = 2\u03c4 (q(cid:62)\nt (cid:101)gt + ztht) + \u03c4 2bt\nwith the initial condition M (0) = M\u2217\n0, where\nrt + e\u221azt\u03b7G)(cid:11)(cid:101)c,e, Lt = \u2212\u03bbdiag(H\ngt =(cid:10)cf (c(cid:62)qt + e\u221azt\u03b7T)(cid:11)c,e, (cid:101)gt =(cid:10)(cid:101)c(cid:101)f ((cid:101)c\nrt + e\u221azt\u03b7G)(cid:11)(cid:101)c,e \u2212 \u03bbH\n(c(cid:62)qt + e\u221azt\u03b7T)(cid:11)c,e \u2212(cid:10)(cid:101)f\nht =(cid:10)f\n((cid:101)c\nrt + e\u221azt\u03b7G)(cid:11)(cid:101)c,e.\nbt = \u03b7T(cid:10)f 2(c(cid:62)qt + e\u221azt\u03b7T)(cid:11)c,e + \u03b7G(cid:10)(cid:101)f 2((cid:101)c\ndx(cid:101)F ((cid:98)D(x)), and f(cid:48), (cid:101)f(cid:48) and H(cid:48)\nThe two functions f, (cid:101)f stand for f (x) = d\ndx F ((cid:98)D(x)) and (cid:101)f (x) = d\nare derivatives of f, (cid:101)f and H respectively. The two constants \u03b7T and \u03b7G are the strength of the\nnoise in the true data model and the generator, respectively. The brackets (cid:104)\u00b7(cid:105)c,e and (cid:104)\u00b7(cid:105)(cid:101)c,e denote the\naverages over the random variables c \u223c Pc,(cid:101)c \u223c P(cid:101)c, and e \u223c N (0, 1), where Pc and P(cid:101)c are the\ndistributions involved in de\ufb01ning the generative model (1) and the generator (2).\nThis theorem implies that for each k = (cid:98)tn(cid:99) for some t \u2208 [0, T ], the macroscopic state M k converges\nto a deterministic number M (t), and the convergence rate is O(1/\u221an). The limiting ODE (8) for\nthe macroscopic states involves O(d2) variables, where d is the number of internal features often\nassumed to be a \ufb01nite number that is much less than n. This ODE is essentially different from the\nODE derived in the small-learning-rate limit [11\u201316], in which the number of variables is O(n).\nThe complete proof can be found in the Supplementary Materials. We brie\ufb02y sketch the proof here.\nFirst, we note that M k is a discrete-time stochastic process driven by the Markov chain X k. Then,\nwe apply the martingale decomposition for M k and get\n\n(St))\n\n(zt),\n\n(9)\n\n(cid:48)\n\n(cid:62)\n\n(cid:62)\n\n(cid:62)\n\n(cid:48)\n\n(cid:48)\n\nn \u03c6(M k) + (M k+1 \u2212 Ek M k+1) + [Ek M k+1 \u2212 M k \u2212 1\n\nM k+1 \u2212 M k = 1\nwe show the martingale(cid:80)k\n\nwhere the matrix-valued function \u03c6(M ) represents the functions on the right hand sides of the ODE\n(8), and Ek denotes the conditional expectation given the state of the Markov chain X k. Finally,\nk(cid:48)=0(M k(cid:48)+1 \u2212 E k(cid:48)M k(cid:48)) and the higher-order term Ek M k+1 \u2212 M k \u2212\nn \u03c6(M k) have no contribution when n goes to in\ufb01nity.\nDue to the limitation of our current proof, the constant C(T ) in (7) grows exponentially as T\nincreases. This is not a problem for any \ufb01nite T , but may cause some problem to study the long\ntime behavior when T \u2192 \u221e. However, if we impose a suf\ufb01cient large regularizer parameter \u03bb to\nlimit the norms of the microscopic weights V k and wk, then the macroscopic state M k is bounded\n\nn \u03c6(M k)],\n\n1\n\n5\n\n\fFigure 1: Macroscopic dynamics of the GAN with d = 2 features: [P k]i,j is the cosine of the angle\nbetween i\u2019th column vector of the real feature matrix U k and j\u2019th column vector of the generator\u2019s\nweight matrix V k. Similarly, [qk]i is the cosine of angle between i\u2019th column vector of U k and\nthe discriminator\u2019s weight vector wk. Colored dots are results from experiments, and the curves\ntracing these dots are our theoretical prediction by the ODE (8). From the left to right, the variance\nof background noise is \u03b7T = \u03b7G = 2, 1, 4 respectively, and other parameters are the same. The left\n\ufb01gure is an example of successful training, where two features (red and blue dots) are retrieved by the\ngenerator. The center \ufb01gure shows an oscillating training. It happens when noise are weak. The right\n\ufb01gures shows a mode collapsing state, in which only the \ufb01rst feature are estimated by the generator.\n\ni,j \u2264 [M k]i,i[M k]j,j. In our experiments, \u03bb > 1 is suf\ufb01cient. In this case, the constant\nas [M k]2\nC(T ) is bounded not depending on T . In Example 1, when \u03bb \u2192 \u221e, [M k]i,i = 1, and therefore\ni,j \u2264 1 and C(T ) \u2264 (2d + 1)2, where the number of features d is considered a constant not\n[M k]2\ngrowing with n. This justi\ufb01es the \ufb01xed points analysis of the ODE as discussed in Section 4, which\nre\ufb02ects the long-time training behavior. A better proof strategy to get rid of this dependence of T is\nalso possible, e.g., [26].\n\nNumerical veri\ufb01cation. We verify the theoretical prediction given by the ODE (8) via numerical\nsimulations under the settings stated in Example 1. The results are shown in Figure 1. The number\n\nof features is d = 2, and ck and(cid:101)ck are both Gaussian with zero mean and covariance diag([5, 3]).\nThe dimension is n = 5, 000, and the learning rates of the generator and discriminator are(cid:101)\u03c4 = 0.04\n\nand \u03c4 = 0.2 respectively. After testing different noise strength \u03b7T = \u03b7G = 2, 1, 4, we have observed\nat least three nontrivial dynamical patterns: success, oscillating or mode collapsing. In all these\nexperiments, our theoretical predictions match the actual trajectories of the macroscopic states pretty\nwell.\nLet us take a closer look at the successful case as shown in the left \ufb01gure in Figure 1. The dynamics\ncan be split into 4 stages. At the \ufb01rst stage, the discriminator learns the \ufb01rst feature of the true\nmodel. At this state, [qt]1 quickly increases. At the second stage, the generator starts to learn the\n\ufb01rst feature and the discriminator is deceived. At this stage, [P t]2\n1 decreases.\nOnce the discriminator completely forgets the \ufb01rst feature as [qt]1 \u2248 0, the third state begins. The\ndiscriminator starts to learn the second feature as [qt]2\n2 increases. Then, at the last stage, the generator\nlearns the second feature and the discriminator is fooled again. In this region, [P t]2\n2,2 increases\nand [qt]2\n2 decreases down to 0. Eventually, the generators learns both features and the discriminator\nis completely fooled. It ends up at a stationary state that qt = 0 and P t is nearly an identity\nmatrix. Interestingly, this experiment shows that the generator learn features sequentially given a\nsingle-feature discriminator. This may be a reason why in practice, the discriminator\u2019s structure can\nbe much simpler than the generator\u2019s.\n\n1,1 increases and [qt]2\n\n3.2 Microscopic dynamics\n\nIn this section, we study how the elements in X k = [U , V k, wk] evolve during the training process.\nInstead of studying the trajectory of X k, we study the evolution of the empirical measure of the\nmicroscopic states, which is de\ufb01ned as\n\n\u00b5k((cid:98)u,(cid:98)v,(cid:98)w)\n\ndef\n= 1\n\ni=1\u03b4(cid:0)(cid:2)(cid:98)u\nn(cid:80)n\n\nwhere \u03b4(\u00b7) is a Dirac measure on R2d+1 and [U ]i,:, [V k]i,: are ith row of U and V k respectively.\nThe scaling factor \u221an in the Dirac measures is introduced because [U ]i,(cid:96), [V k]i,(cid:96) and [wk,]i are\nO(1/\u221an) quantities.\n\n(cid:62)\n\n(cid:62)\n\n,(cid:98)v\n\n,(cid:98)w(cid:3) \u2212 \u221an(cid:2)[U ]i,:, [V k]i,:, [w]i(cid:3)(cid:1)\n\n6\n\n05010015020000.51010020030040000.5105010015020000.51\fdef\n\nt\n\n(cid:62)\n\n(cid:62)\n\n(cid:62)\n\n= \u00b5k((cid:98)u,(cid:98)v,(cid:98)w) with k = (cid:98)nt(cid:99) .Following the general technical approach presented\n\nWe next embed the discrete-time measure-valued stochastic process \u00b5k into a continuous-time process\nby de\ufb01ning \u00b5(n)\nin [17], we can show that under the same assumptions as Theorem 1, given T > 0, the sequence of\nmeasure-valued process {{\u00b5(n)\nt }t\u2208[0,T ]}n converges weakly to a deterministic process {\u00b5t}t\u2208[0,T ].\nIn addition, \u00b5t is the measure of the solution to the stochastic differential equation\nd(cid:98)ut = 0\nd(cid:98)vt =(cid:101)\u03c4(cid:0)(cid:98)wt(cid:101)gt + Lt(cid:98)vt(cid:1) dt\nd(cid:98)wt = \u03c4(cid:0)(cid:98)u\n(cid:105), St = (cid:104)\u00b5t,(cid:98)v(cid:98)v\n\nwhere ((cid:98)u0,(cid:98)v0,(cid:98)w0) \u223c \u00b50; Bt is the standard Brownian motion. The functions gt,(cid:101)gt, Lt, ht and bt\nP t = (cid:104)\u00b5t,(cid:98)u(cid:98)v\nwhere (cid:104)\u00b5t,\u00b7(cid:105) denotes the expectation with respect to the measure \u00b5t.\nThe SDE (10) shows the intuitive meaning of the functions de\ufb01ned in (9): gt,(cid:101)gt, Lt, ht are drift\ncoef\ufb01cients of the SDE and bt is the diffusion coef\ufb01cient of the SDE. We also note that if one follows\nthe analysis in the small-learning-rate limit [11\u201316], one will get an ODE for the microscopic states.\nCompared to our SDE formula, the diffusion term \u03c4\u221abtdBt is missing in those works, and therefore\nthe effect of the noise can not be analyzed.\nMoreover, the deterministic measure \u00b5t is unique solution of the following PDE (given in its weak\n\nare de\ufb01ned in (9), in which the macroscopic quantities P t, St, qt, zt, rt are computed as follows\n(11)\n\nt (cid:101)gt + (cid:98)wtht(cid:1) dt + \u03c4(cid:112)bt dBt\nzt = (cid:104)\u00b5t,(cid:98)w2(cid:105),\n\nt gt +(cid:98)v\n(cid:105), qt = (cid:104)\u00b5t,(cid:98)u(cid:98)w(cid:105),\n\nrt = (cid:104)\u00b5t,(cid:98)v(cid:98)w(cid:105),\n\n(10)\n\nLt(cid:1)\u2207(cid:98)v\u03d5(cid:11) + \u03c4(cid:10)\u00b5t,(cid:0)(cid:98)u\n\nform): for any bounded smooth test function \u03d5((cid:98)u,(cid:98)v,(cid:98)w),\ndt(cid:10)\u00b5t, \u03d5((cid:98)u,(cid:98)v,(cid:98)w)(cid:11) =\n(cid:101)\u03c4(cid:10)\u00b5t,(cid:0)(cid:98)w(cid:101)g\n(cid:101)gt + ht(cid:98)w(cid:1) \u2202\ngt \u2212(cid:98)v\nt +(cid:98)v\nwhere qt, rt, St, and zt are de\ufb01ned in (11), and the functions gt,(cid:101)gt, bt, ht and Lt are de\ufb01ned in (9).\n, (cid:98)w2, and\nthe weak formulation of the PDE. Let \u03d5 being each element of (cid:98)u(cid:98)v\n\nWe refer readers to [17] for a general framework for rigorously establishing the above scaling limit.\nThe connection between the microscopic and macroscopic dynamics can also be derived from\n\nsubstituting those \u03d5 into the PDE (12), we can derive the ODE (8). In the setting of this paper, the\nmacroscopic dynamics enjoys a closed ODE: We can predict the macroscopic states without solving\nthe PDE nor SDE at microscopic scale. However, in a more general setting, e.g. when we add a\nregularizer other than the L2 type, the ODE itself may not be closed. In that case, one has to solve\nthe PDE directly.\n\n\u2202(cid:98)w2 \u03d5(cid:11)\n2 bt(cid:10)\u00b5t, \u22022\n\n, (cid:98)u(cid:98)w, (cid:98)v(cid:98)w, (cid:98)v(cid:98)v\n\n\u2202(cid:98)w \u03d5(cid:11) + \u03c4 2\n\n(12)\n\n(cid:62)\n\n(cid:62)\n\n(cid:62)\n\n(cid:62)\n\n(cid:62)\n\n(cid:62)\n\n(cid:62)\n\nd\n\nNumerical veri\ufb01cation. We verify the predictions given by the PDE (12) by setting d = 1 using\na special choice of the (n \u00d7 1)-dimensional target feature matrix U whose elements are all 1/\u221an\nwith n = 10, 000. We also set the initial condition \u00b50((cid:98)v,(cid:98)w|(cid:98)u = 1) to be a Gaussian distribution.\n(When d = 1, the macroscopic quantities Pt, qt, rt, St reduce to scalars, so we remove their boldface\nhere.) In this case, the PDE (12) admits a particularly simple analytical solution: at any time t, the\nsolution \u00b5t((cid:98)v,(cid:98)w|(cid:98)u = 1) is a Gaussian distribution whose mean and covariance matrix are given by\nzt(cid:21) . Figure 2 overlays the contours\nE\u00b5t((cid:98)v,(cid:98)w|(cid:98)u=1)(cid:20)(cid:98)v\nof the probability distribution \u00b5t((cid:98)v,(cid:98)w|(cid:98)u = 1) at different times t over the point clouds of the actual\n\nexperiment data (\u221an[wk]i,\u221an[V k]i,1). We can see that the theoretical prediction given by (12) has\nexcellent agreement with simulation results.\n\n(cid:98)w(cid:21)(cid:2)(cid:98)v\nqt(cid:21) , E\u00b5t((cid:98)v,(cid:98)w|(cid:98)u=1)(cid:20)(cid:98)v\n\n(cid:98)w(cid:21) =(cid:20)Pt\n\n(cid:98)w(cid:3) =(cid:20)St\n\nrt\n\nrt\n\n4 Local Stability Analysis of the ODE for the Macroscopic States\n\nIn this section, we study how the parameters, such as the learning rates \u03c4 and(cid:101)\u03c4, noise strength \u03b7G\n\nand \u03b7T affect the training algorithm. We will focus on the concrete model as described in Example 1\nso that we can have analytical solutions.\nIn order to further reduce the degrees of freedom of the ODE (8), we let the regularization parameter\n\u03bb \u2192 \u221e. In this case, the vector wk and all columns vectors of V k are always normalized. Thus\n\n7\n\n\fFigure 2: The evolution of the microscopic states at t = 0, 10, 100, and 150. For each \ufb01xed t, the\ni = 1, 2, . . . , n, where k = (cid:98)nt(cid:99). The blue ellipses illustrate the contours corresponding to one, two,\nand three standard deviations of the 2-D Gaussian distribution predicted by the PDE (12).\n\nred points in the corresponding \ufb01gure represent the values of ((cid:98)v,(cid:98)w) = (\u221an[V k]i,1,\u221an[wk]i) for\n\nd\n\nd\n\nd\n\nd\n\nzk = 1 and [S]i,i = 1. The macroscopic state is then described by P k, qk, rk and off-diagonal terms\nof Sk. Correspondingly, the ODE in Theorem 1 reduces to\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\n2 )r(cid:62)\n\nht = (1 \u2212 \u03c4 \u03b7G\n\nt + StLt + LtSt(cid:1)\n\ndt P t =(cid:101)\u03c4(cid:16)qtr(cid:62)\nt (cid:101)\u039b + P tLt(cid:17)\ndt qt = \u03c4(cid:0)\u039bqt \u2212 P t(cid:101)\u039brt + htqt(cid:1)\ndt rt = \u03c4(cid:0)P T\nt \u039bqt \u2212 St(cid:101)\u039brt + htrt(cid:1) +(cid:101)\u03c4(cid:0)(cid:101)\u039b + Lt(cid:1)rt\ndt St =(cid:101)\u03c4(cid:0)rtr(cid:62)\nt (cid:101)\u039b +(cid:101)\u039brtr(cid:62)\n2 )q(cid:62)\nt \u039bqt \u2212 \u03c4 \u03b72\nt (cid:101)\u039brt \u2212 (1 + \u03c4 \u03b7T\n\nwhere \u039b and (cid:101)\u039b are the covariance matrices of the distributions Pc and P(cid:101)c, respectively; and\nLt = \u2212diag(rtr(cid:62)\nt (cid:101)\u039b),\n\nin which \u03b7T and \u03b7G are the variance of noise in the true data model and generator, respectively. The\nderivation from the ODE (8) to (13) is presented in the Supplementary Materials.\nNext, we discuss under what conditions, the GAN can reach a desirable training state by studying\nlocal stability of a particular type of \ufb01xed points of the ODE (13). The perfect estimation of the\ngenerator corresponds to P t being an identity matrix (up to a permutation of rows and columns). A\ncomplete fail state relates to P = 0. Furthermore, It is easy to verify that if qt = rt = 0, the ODE\n(13) will be stable for any P t = P .\nClaim 1. The macroscopic states P t, q = r = 0 for all valid P t are always the \ufb01xed points of the\nODE (13). Furthermore, a suf\ufb01cient condition that the perfect estimation state P t = I, q = r = 0\nis locally stable and the failed state P t = 0, q = r = 0 is unstable if\n\n(13)\n\n(14)\n\nG+\u03b72\n\nT\n\n,\n\n2\n\n\u039b(cid:96),\n\n(15)\n\nwhere \u03b1 = (cid:101)\u03c4\n\n\u03c4 , \u03b72 = 1\n\n2 (\u03b72\n\nT + \u03b72\n\n1\n2 max\n\n(cid:96) {\u039b(cid:96) \u2212(cid:101)\u039b(cid:96) + \u03b1(cid:101)\u039b(cid:96)} \u2264 \u03c4 \u03b72 < min\nG), and \u039b(cid:96) = [\u039b](cid:96),(cid:96),(cid:101)\u039b(cid:96) = [(cid:101)\u039b](cid:96),(cid:96).\n\n(cid:96)\n\nThe proof can be found in the Supplementary Materials. If the right inequality in (15) is violated,\nany feature (cid:96) with the signal-to-noise ratio [\u039b](cid:96),(cid:96) < \u03c4 \u03b72 is not learned by the generator resulting\nmode collapsing. The right \ufb01gure in Figure 1 demonstrates this situations, where only one of the two\nfeatures is recovered. If the left inequality in (15) is violated, the training processes can be trapped\nin an oscillation phase. This phenomenon is shown in the middle \ufb01gure in Figure 1. This result\nindicates that proper background noise can help to avoid oscillation and stabilize the training process.\nIn fact, the trick of injecting additional noise has been used in practice to train multi-layer GANs\n[27]. To our best knowledge, our paper is the \ufb01rst theoretical study on why noise can have such a\npositive effect via a dynamic perspective.\nIn experiments, the training is not ended at the perfect recovery point due to the presence of the noise\nbut converges at another \ufb01xed point nearby. This is because the perfect state is marginally stable,\nas the Jacobian matrix always has zero eigenvalues. It indicates that there are other locally stable\n\ufb01xed points near P = I. In fact, all points in the hyper-rectangle region satisfying q = r = 0\n(cid:96) . In the matched\n\u03c4 and \u03b72 =\n\n(cid:96)(cid:12)(cid:12) \u2264(cid:12)(cid:12)[P ](cid:96),(cid:96)(cid:12)(cid:12) \u2264 1, \u2200 (cid:96) = 1, 2, . . . , d are locally stable for some critical p\u2217\n(cid:96) =(cid:2)(\u039b(cid:96) \u2212 \u03c4 \u03b72)((cid:101)\u039b(cid:96) + \u03c4 \u03b72 \u2212 \u03b1(cid:101)\u039b(cid:96))/(\u039b(cid:96)(cid:101)\u039b(cid:96))(cid:3)1/2\n\nand(cid:12)(cid:12)p\u2217\ncase when \u039b(cid:96) = (cid:101)\u039b(cid:96), we have p\u2217\n\n, \u03b1 = (cid:101)\u03c4\n\n8\n\n\u2212202bv\u2212202bwt=0\u2212202bv\u2212202bwt=10\u2212202bv\u2212202bwt=100\u2212202bv\u2212202bwt=150\f1\n\nT + \u03b72\n\nG). Starting from a point near the origin, numerical solution of the ODE shows the training\n2 (\u03b72\nprocesses are ended up at the corner of this hyper-rectangle, i.e., P \u2217\n(cid:96) , (cid:96) = 1, 2, . . . , d}).\nIn the small-learning rate limit \u03c4 \u2192 0 and the learning rate ratio \u03b1 \u2192 0, we get the perfect recovery\nP \u2217\n= I. The limit \u03c4 \u2192 0, \u03b1 \u2192 0 was studied in the small-learning-rate analysis with the two-time\nscaling [15], and the result is consistent, but our analysis includes the situations with \ufb01nite \u03c4 and \u03b1.\nIn addition, we provide a phase diagram analysis in a single-feature case d = 1 in the Supplementary\nMaterials. All possible \ufb01xed points in this case are enumerated and their local stability is analyzed.\nThis helps us understand the successful recovery condition (15), which is the intersection of the\ninformative phases that each feature can be recovered individually.\n\n= diag({p\u2217\n\n5 Conclusion\nWe present a simple high-dimensional model for GAN with an exactly analyzable training process.\nUsing the tool of scaling limits of stochastic processes, we show that the macroscopic state associated\nwith the training process converges to a deterministic process characterized as the unique solution of\nan ODE, whereas the microscopic state remains stochastic described by an SDE, whose time-varying\nprobability measure is described by a limiting PDE.\nIndeed, it is a common picture in statistical physics that the macroscopic states of large systems tend\nto converge to deterministic values due to self-averaging. These notions, especially the mean-\ufb01eld\ndynamics, have been applied to analyzing neural networks both in shallow [19, 20] and deep models\n[28]. However, this mean-\ufb01eld regime was not considered in previous analyses of GAN. For example,\na series of recent works e.g., [11\u201316] considers a different scaling regime where the learning rate\ngoes to zero but the system dimension n stays \ufb01xed. In that regime, the microscopic dynamics are\ndeterministic even with the presence of the microscopic noise. In contrast, we study the regime where\nthe learning rate is \ufb01xed but the dimension n \u2192 \u221e. This setting allows us to quantify the effect of\ntraining noise in the learning dynamics.\nIn this paper, we only consider a linear generator with a latent variable(cid:101)c drawn from a \ufb01xed\ndistribution P(cid:101)c, but our analysis can be extended to a more complex non-linear model with a learnable\nlatent-variable distribution. Speci\ufb01cally, in order to compute derivatives w.r.t. P(cid:101)c, the latent variable\n(cid:101)c \u223c P(cid:101)c should be reparameterized by a deterministic function(cid:101)c = f (z; \u03b8), where \u03b8 is a learnable\nGaussian mixture with L equal-probability modes can be parameterized by(cid:101)c =(cid:80)L\n(cid:101)c and \u03b8 keep \ufb01nite when the data dimension n goes to in\ufb01nity. More challenging situations, where\n\nparameter and z is a random variable drawn from a simple and \ufb01xed distribution. For example, a\n(cid:96)=1(\u00b5(cid:96) + \u03a3(cid:96)\u0001l)\u03b2l,\nwhere \u00b5(cid:96) and \u03a3(cid:96) are two learnable parameters representing the mean and covariance of the (cid:96)th\nmode respectively, and \u0001 \u223c N (0, I); \u03b2(cid:96) is a random indicator variable where only one \u03b2(cid:96) for\n(cid:96) = 1, 2, . . . , L is 1 and the others are 0. In practice, f (z; \u03b8) is implemented by a multilayer neural\nnetwork. Our analysis can be naturally extended to analyzing this model as long as the dimensions of\n\nthe dimension of \u03b8 is proportional to n, will be explored in future works.\nAlthough our analysis is carried out in the asymptotic setting, numerical experiments show that\nour theoretical predictions can accurately capture the actual performance of the training algorithm\nat moderate dimensions. Our analysis also reveals several different phases of the training process\nthat highly depend on the choice of the learning rates and noise strength. The analysis reveals a\ncondition on the learning rates and the strength of noise to have successful training. Violating this\ncondition results either oscillation or mode collapsing. Despite its simplicity, the proposed model of\nGAN provides a new perspective and some insights for the study of more realistic models and more\ninvolved training algorithms.\n\nAcknowledgments This work was supported by the US Army Research Of\ufb01ce under contract\nW911NF-16-1-0265 and by the US National Science Foundation under grants CCF-1319140, CCF-\n1718698, and CCF-1910410.\n\n9\n\n\fReferences\n[1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio,\n\u201cGenerative Adversarial Nets,\u201d in Advances in Neural Information Processing System, 2014, pp. 2672\u20132680.\n[2] M. Arjovsky, S. Chintala, and L. Bottou, \u201cWasserstein Generative Adversarial Networks,\u201d Proceedings of\n\nThe 34th International Conference on Machine Learning, pp. 1\u201332, 2017.\n\n[3] M. Lucic, K. Kurach, M. Michalski, S. Gelly, and O. Bousquet, \u201cAre gans created equal? a large-scale\n\nstudy,\u201d in Advances in neural information processing systems, 2018, pp. 698\u2013707.\n\n[4] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz,\nZ. Wang, and W. Shi, \u201cPhoto-Realistic Single Image Super-Resolution Using a Generative Adversarial\nNetwork,\u201d in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp.\n4681\u20134690.\n\n[5] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, \u201cImage-to-Image Translation with Conditional Adversarial\nNetworks,\u201d in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp.\n1125\u20131134.\n\n[6] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, \u201cGenerative Adversarial Text to Image\n\nSynthesis,\u201d 33rd International Conference on Machine Learning, pp. 1060\u20131069, 2016.\n\n[7] S. Arora, R. Ge, Y. Liang, and Y. Zhang, \u201cGeneralization and Equilibrium in Generative Adversarial Nets,\u201d\n\nin International Conference on Machine Learning, 2017, pp. 224\u2013232.\n\n[8] M. Arjovsky and L. Bottou, \u201cTowards Principled Methods for Training Generative Adversarial Networks,\u201d\n\n[9] S. Feizi, C. Suh, F. Xia, and D. Tse, \u201cUnderstanding GANs:\n\nthe LQG Setting,\u201d arXiv preprint\n\narXiv preprint arXiv:1701.04862, 2017.\n\narXiv:1710.10793, 2017.\n\n[10] J. Li, A. Madry, J. Peebles, and L. Schmidt, \u201cTowards Understanding the Dynamics of Generative\n\nAdversarial Networks,\u201d arXiv preprint arXiv:1706.09884, 2017.\n\n[11] L. Mescheder, A. Geiger, and S. Nowozin, \u201cWhich training methods for gans do actually converge?\u201d in\n\nInternational Conference on Machine Learning, 2018, pp. 3478\u20133487.\n\n[12] L. Mescheder, S. Nowozin, and A. Geiger, \u201cThe numerics of GANs,\u201d in Advances in Neural Information\n\nProcessing Systems, 2017, pp. 1823\u20131833.\n\n[13] V. Nagarajan and J. Z. Kolter, \u201cGradient descent GAN optimization is locally stable,\u201d in Advances in\n\nNeural Information and Processing Systems, 2017, pp. 5591\u20135600.\n\n[14] K. Roth, A. Lucchi, S. Nowozin, and T. Hofmann, \u201cStabilizing Training of Generative Adversarial\nNetworks through Regularization,\u201d in Advances in Neural Information Processing Systems, 2017, pp.\n2015\u20132025.\n\n[15] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, \u201cGANs Trained by a Two Time-\nScale Update Rule Converge to a Local Nash Equilibrium,\u201d in Advances in Neural Information Processing\nSystems, 2017, pp. 6629\u20136640.\n\n[16] E. V. Mazumdar, M. I. Jordan, and S. S. Sastry, \u201cOn \ufb01nding local nash equilibria (and only local nash\n\nequilibria) in zero-sum games,\u201d arXiv preprint arXiv:1901.00838, 2019.\n\n[17] C. Wang, J. Mattingly, and Y. M. Lu, \u201cScaling Limit: Exact and Tractable Analysis of Online Learning\nAlgorithms with Applications to Regularized Regression and PCA,\u201d arXiv preprint arXiv:1712.04332,\n2017.\n\n[18] G. O. Roberts, A. Gelman, and W. R. Gilks, \u201cWeak convergence and optimal scaling of random walk\n\nMetropolis algorithms,\u201d Annals of Applied Probability, vol. 7, no. 1, pp. 110\u2013120, 1997.\n\n[19] D. Saad and S. A. Solla, \u201cExact Solution for On-Line Learning in Multilayer Neural Networks,\u201d Phys. Rev.\n\nLett., vol. 74, no. 21, pp. 4337\u20134340, 1995.\n\n[20] M. Biehl and H. Schwarze, \u201cLearning by on-line gradient descent,\u201d Journal of Physics A, vol. 28, no. 3, pp.\n\n643\u2013656, 1995.\n\n[21] C. Wang and Y. M. Lu, \u201cOnline Learning for Sparse PCA in High Dimensions: Exact Dynamics and Phase\n\nTransitions,\u201d in Information Theory Workshop (ITW), 2016 IEEE, 2016, pp. 186\u2013190.\n\n[22] C. Wang, Y. C. Eldar, and Y. M. Lu, \u201cSubspace estimation from incomplete observations: A high-\ndimensional analysis,\u201d IEEE Journal of Selected Topics in Signal Processing, vol. 12, no. 6, pp. 1240\u20131252,\nDec 2018.\n\n[23] C. Wang and Y. M. Lu, \u201cThe Scaling Limit of High-Dimensional Online Independent Component Analysis,\u201d\n\nin Advances in Neural Information Processing Systems, 2017, pp. 6641\u20136650.\n\n[24] S. Mei, A. Montanari, and P.-M. Nguyen, \u201cA Mean Field View of the Landscape of Two-Layers Neural\n\nNetworks,\u201d arXiv preprint, p. arXiv:1804.06561, 2018.\n\n[25] I. Johnstone and A. Lu, \u201cOn consistency and sparsity for principal components analysis in high dimensions,\u201d\n\nJournal of the American Statistical Association, vol. 104, no. 486, pp. 682\u2013693, 2009.\n\n[26] B. Jourdain, T. Leli\u00e8vre, and B. Miasojedow, \u201cOptimal scaling for the transient phase of Metropolis\n\nHastings algorithms: The longtime behavior,\u201d Bernoulli, vol. 20, no. 4, pp. 1930\u20131978, 2014.\n\n[27] C. K. S\u00f8nderby, J. Caballero, L. Theis, W. Shi, and F. Husz\u00e1r, \u201cAmortised map inference for image\n\nsuper-resolution,\u201d International Conference on Learning Representations, 2017.\n\n[28] P.-M. Nguyen, \u201cMean \ufb01eld limit of the learning dynamics of multilayer neural networks,\u201d arXiv preprint\n\narXiv:1902.02880, 2019.\n\n10\n\n\f", "award": [], "sourceid": 7677, "authors": [{"given_name": "Chuang", "family_name": "Wang", "institution": "Harvard University"}, {"given_name": "Hong", "family_name": "Hu", "institution": "Harvard"}, {"given_name": "Yue", "family_name": "Lu", "institution": "Harvard University"}]}