{"title": "PCA of high dimensional random walks with comparison to neural network training", "book": "Advances in Neural Information Processing Systems", "page_first": 10307, "page_last": 10316, "abstract": "One technique to visualize the training of neural networks is to perform PCA on the parameters over the course of training and to project to the subspace spanned by the first few PCA components. In this paper we compare this technique to the PCA of a high dimensional random walk. We compute the eigenvalues and eigenvectors of the covariance of the trajectory and prove that in the long trajectory and high dimensional limit most of the variance is in the first few PCA components, and that the projection of the trajectory onto any subspace spanned by PCA components is a Lissajous curve. We generalize these results to a random walk with momentum and to an Ornstein-Uhlenbeck processes (i.e., a random walk in a quadratic potential) and show that in high dimensions the walk is not mean reverting, but will instead be trapped at a fixed distance from the minimum. We finally analyze PCA projected training trajectories for: a linear model trained on CIFAR-10; a fully connected model trained on MNIST; and ResNet-50-v2 trained on Imagenet. In all cases, both the distribution of PCA eigenvalues and the projected trajectories resemble those of a random walk with drift.", "full_text": "PCA of high dimensional random walks with\n\ncomparison to neural network training\n\nJoseph M. Antognini\u2217\n\nWhisper AI\n\nJascha Sohl-Dickstein\n\nGoogle Brain\n\njoe.antognini@gmail.com\n\njaschasd@google.com\n\nAbstract\n\nOne technique to visualize the training of neural networks is to perform PCA on\nthe parameters over the course of training and to project to the subspace spanned by\nthe \ufb01rst few PCA components. In this paper we compare this technique to the PCA\nof a high dimensional random walk. We compute the eigenvalues and eigenvectors\nof the covariance of the trajectory and prove that in the long trajectory and high\ndimensional limit most of the variance is in the \ufb01rst few PCA components, and that\nthe projection of the trajectory onto any subspace spanned by PCA components is a\nLissajous curve. We generalize these results to a random walk with momentum and\nto an Ornstein-Uhlenbeck processes (i.e., a random walk in a quadratic potential)\nand show that in high dimensions the walk is not mean reverting, but will instead be\ntrapped at a \ufb01xed distance from the minimum. We \ufb01nally analyze PCA projected\ntraining trajectories for: a linear model trained on CIFAR-10; a fully connected\nmodel trained on MNIST; and ResNet-50-v2 trained on Imagenet. In all cases,\nboth the distribution of PCA eigenvalues and the projected trajectories resemble\nthose of a random walk with drift.\n\n1\n\nIntroduction\n\nDeep neural networks (NNs) are extremely high dimensional objects. A popular deep NN for image\nrecognition tasks, ResNet-50 (He et al., 2016), has \u223c25 million parameters for example, and it is\ncommon for language models to have more than one billion parameters (Jozefowicz et al., 2016). This\noverparameterization may be responsible for NN\u2019s impressive generalization performance (Novak\net al., 2018). Simultaneously, the high dimensional nature of NNs makes them dif\ufb01cult to reason\nabout.\n\nOver the decades of NN research, the common lore about the geometry of the loss landscape of NNs\nhas changed dramatically. In the early days of NN research it was believed that NNs were dif\ufb01cult\nto train because they tended to get stuck in suboptimal local minima. Later, Dauphin et al. (2014)\nargued that the true scourge of NN optimization was saddle points, not local minima. Choromanska\net al. (2015) further used a spherical spin-glass model to conjecture that local minima of NNs are\nnot much worse than global minima. Baity-Jesi et al. (2018) showed that in the typical case of an\nover-parameterized NN the dynamics of NN optimization are different from glassy systems, and\nclaimed that the dif\ufb01culties with NN optimization were instead due to vast plateaus where the gradient\nis very small. There has also been active debate as to whether the geometry of the loss landscape\naround minima can inform the NN\u2019s ability to generalize. Hochreiter & Schmidhuber (1997) and\nKeskar et al. (2017) have claimed that NNs that generalize better tend to \ufb01nd \ufb02atter minima, though\nDinh et al. (2017) countered that due to the scale-free nature of NNs, there always exist sharp minima\nthat generalize equally well.\n\n\u2217Work done as a Google AI Resident.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fTo help resolve these questions, we would ideally like to be able to visualize the loss landscapes\nof NNs, but this is a dif\ufb01cult, perhaps even futile, task because it involves embedding an extremely\nhigh dimensional space into very few dimensions \u2014 typically one or two. Goodfellow et al. (2015)\nintroduced a visualization technique in which the loss is plotted along a straight line from the initial\npoint to the \ufb01nal point of training (the \u201croyal road\u201d). The authors found that the loss often decreased\nmonotonically along this path. They further considered the loss in the space from the residuals\nbetween the NN\u2019s trajectory to this royal road (note that while this is a two-dimensional manifold, it\nis not a linear subspace). Lorch (2016) and Lipton (2016) proposed another visualization technique\nin which principal component analysis (PCA) is performed on the NN trajectory and the trajectory is\nprojected into the subspace spanned by the lowest PCA components. Lipton (2016) noted that most\nof the variance was in a small number of PCA components. Li et al. (2018) explored this technique\nin more depth by plotting 2-dimensional cross-sections of the loss landscape spanned by the \ufb01rst two\nPCA components.\n\nIn this paper we consider the theory behind this visualization technique. We show that PCA projections\nof random walks in \ufb02at space qualitatively have many of the same properties as projections of NN\ntraining trajectories. We then generalize these results to a random walk with momentum and a random\nwalk in a quadratic potential, also known as an Ornstein-Uhlenbeck process (Uhlenbeck & Ornstein,\n1930). This process is more similar to NN optimization since it consists of a deterministic component\n(the true gradient) plus a stochastic component. In fact, recent work has suggested that stochastic\ngradient descent (SGD) approximates a random walk in a quadratic potential (Ahn et al., 2012; Mandt\net al., 2016; Smith & Le, 2018). Finally, we perform experiments on linear models and large NNs to\nshow how closely they match this simpli\ufb01ed model.\n\nThe approach we take to study the properties of the PCA of high dimensional random walks in \ufb02at\nspace follows that of Moore et al. (2018), but we correct several errors in their argument, notably in the\nvalues of the matrix ST S and the trace of (ST S)\u22121 in Eq. 10. We also \ufb01ll in some critical omissions,\nparticularly the connection between banded Toeplitz matrices and circulant matrices. We extend their\ncontribution by proving that the trajectories of high dimensional random walk in PCA subspaces\nare Lissajous curves and generalizing to random walks with momentum and Ornstein-Uhlenbeck\nprocesses.\n\n2 PCA of random walks in \ufb02at space\n\n2.1 Preliminaries\n\nLet us consider a random walk in d-dimensional space consisting of n steps where every step is\nequal to the previous step plus a sample from an arbitrary probability distribution, P, with zero mean\nand a \ufb01nite covariance matrix.2 For simplicity we shall assume that the covariance matrix has been\nnormalized so that its trace is 1. This process can be written in the form\n\n(1)\nwhere xt is a d-dimensional vector and x0 = 0. If we collect the xts together in an n \u00d7 d dimensional\ndesign matrix X, we can then write this entire process in matrix form as\n\nxt = xt\u22121 + \u03bet,\n\n\u03bet \u223c P,\n\n(2)\nwhere the matrix S is an n\u00d7 n matrix consisting of 1 along the diagonal and -1 along the subdiagonal,\n\nSX = R,\n\n,\n\n(3)\n\nS \u2261\n\n\uf8eb\n\n\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ed\n\n0\n\n1\n\u22121\n1\n0 \u22121\n...\n. . .\n0\n\u00b7\u00b7\u00b7\n\n0\n\n0\n\n\u00b7\u00b7\u00b7\n. . .\n. . .\n. . .\n\n0\n\n1\n. . .\n0\n0 \u22121 1\n\n0\n...\n\n\uf8f6\n\n\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f8\n\n2The case of a constant non-zero mean corresponds to a random walk with a constant drift term. This is not\nan especially interesting extension from the perspective of PCA because in the limit of a large number of steps\nthe \ufb01rst PCA component will simply pick out the direction of the drift (i.e., the mean), and the remaining PCA\ncomponents will behave as a random walk without a drift term.\n\n2\n\n\fand the matrix R is an n \u00d7 d matrix where every column is a sample from P. Thus X = S\u22121R.\n\nTo perform PCA, we need to compute the eigenvalues and eigenvectors of the covariance matrix\nT \u02c6X, where \u02c6X is the matrix X with the mean of every dimension across all steps subtracted. \u02c6X can\n\u02c6X\nbe found by applying the n \u00d7 n centering matrix, C:\n\n\u02c6X = CX, C \u2261 I \u2212\n\n1\nn\n\n11T .\n\n(4)\n\nWe now note that the analysis is simpli\ufb01ed considerably by instead \ufb01nding the eigenvalues and\nT\nT\neigenvectors of the matrix \u02c6X \u02c6X\n.\nThe eigenvectors are similarly related by vk = XT uk, where vk is a (non-normalized) eigenvector of\nT \u02c6X, and uk is the corresponding eigenvector of \u02c6X \u02c6X\n\u02c6X\n\nT \u02c6X are the same as those of \u02c6X \u02c6X\n\n. The non-zero eigenvalues of \u02c6X\n\nT\n\n.\n\nWe therefore would like to \ufb01nd the eigenvalues and eigenvectors of the matrix\n\nT\n\u02c6X \u02c6X\n\n= CS\u22121RRT S\u2212T C,\n\n(5)\n\nwhere we note that CT = C. Consider the middle term, RRT . In the limit d \u226b n we will have\nRRT \u2192 I because the off diagonal terms will be E[\u03bei]2 = 0, whereas the diagonal terms will be\nE[\u03be2] =Pd\nV[\u03bei] = 1. (Recall that we have assumed that the covariance of the noise distribution\nis normalized; if the covariance is not normalized, this simply introduces an overall scale factor given\nby the trace of the covariance.) We therefore have the simpli\ufb01cation\n\ni=0\n\nT\n\u02c6X \u02c6X\n\n= CS\u22121S\u2212T C.\n\n(6)\n\n2.2 Asymptotic convergence to circulant matrices\n\nLet us consider the new middle term, S\u22121S\u2212T = (ST S)\u22121. The matrix S is a banded Toeplitz matrix.\nGray et al. (2006) has shown that banded Toeplitz matrices asymptotically approach circulant matrices\nas the size of the matrix grows. In particular, Gray et al. (2006) showed that banded Toeplitz matrices\nhave the same inverses, distribution of eigenvalues, and eigenvectors as their corresponding circulant\nmatrices in this asymptotic limit (see especially theorem 4.1 and subsequent material from Gray\net al. 2006). Zhu & Wakin (2017) has furthermore proved a stronger result that under some weak\nconditions all eigenvalues of a banded Toeplitz matrix are equal to all eigenvalues of a corresponding\ncirculant matrix in the limit of large matrices. Thus in our case, if we consider the limit of a large\n\nnumber of steps, S asymptotically approaches a circulant matrixeS that is equal to S in every entry\nexcept the top right, where there appears a \u22121 instead of a 0.3\nWith the limiting circulant behavior of S in mind, the problem simpli\ufb01es considerably. We note that\nC is also a circulant matrix, and furthermore the product of two circulant matrices is circulant, the\ntranspose of a circulant matrix is circulant, and the inverse of a circulant matrix is circulant. Thus\nT\nthe matrix \u02c6X \u02c6X\nis asymptotically circulant as n \u2192 \u221e. Finding the eigenvectors is trivial because\nthe eigenvectors of all circulant matrices are the Fourier modes. To \ufb01nd the eigenvalues we must\nexplicitly consider the values of \u02c6X \u02c6X\n. The matrix ST S consists of a 2 along the diagonal, -1 along\nthe subdiagonal and superdiagonal, and 0 elsewhere, with the exception of the bottom right corner\nwhere there appears a 1 instead of a 2.\n\nT\n\nWhile this matrix is not a banded Toeplitz, it is asymptotically equivalent to a banded Toeplitz matrix\nbecause it differs from a banded Toeplitz matrix by a \ufb01nite amount in a single location (B\u00f6ttcher et al.,\n2003). We now note that multiplication of the centering matrix does not change either the eigenvector\nor the eigenvalues of this matrix since all vectors with zero mean are eigenvectors of the centering\nmatrix with eigenvalue 1, and all Fourier modes but the \ufb01rst have zero mean. Thus the eigenvalues of\nT\n\u02c6X \u02c6X\ncan be determined by the inverse of the non-zero eigenvalues of ST S, which is an asymptotic\ncirculant matrix. The kth eigenvalue of a circulant matrix with entries c0, c1, . . . in the \ufb01rst row is\n\n\u03bbcirc,k = c0 + cn\u22121\u03c9k + cn\u22122\u03c92\n\nk + . . . + c1\u03c9n\u22121\n\nk\n\n,\n\n(7)\n\n3We note in passing that eS is the exact representation of a closed random walk.\n\n3\n\n\fwhere \u03c9k is the kth root of unity. The imaginary parts of the roots of unity cancel out, leaving the kth\neigenvalue of ST S to be\n\nand the kth eigenvalue of \u02c6X \u02c6X\n\nT\n\nto be\n\n\u03bbST S,k\n\n= 2(cid:20)1 \u2212 cos(cid:18) \u03c0k\n\nn (cid:19)(cid:21) ,\n\n\u03bb \u02c6X \u02c6X\n\nT\n\n,k\n\n=\n\n1\n\n2(cid:20)1 \u2212 cos(cid:18) \u03c0k\n\nn (cid:19)(cid:21)\u22121\n\n(8)\n\n(9)\n\n.\n\nThe sum of the eigenvalues is given by the trace of (ST S)\u22121 = S\u22121S\u2212T , and S\u22121 is given by a lower\ntriangular matrix with ones everywhere on and below the diagonal. The trace of (ST S)\u22121 is therefore\ngiven by\n\nand so the explained variance ratio from the kth PCA component, \u03c1k in the limit n \u2192 \u221e is\n\nTr(cid:16)S\u22121S\u2212T(cid:17) =\n\n1\n2\n\nn(n + 1),\n\n\u03c1k \u2261\n\n\u03bbk\n\nTr(cid:16)S\u22121S\u2212T(cid:17) =\n\n1\n\n2(cid:2)1 \u2212 cos(cid:0) \u03c0k\n\n2 n(n + 1)\n\nn (cid:1)(cid:3)\u22121\n\n1\n\n.\n\n(10)\n\n(11)\n\n(12)\n\n(15)\n\n(16)\n\nIf we let n \u2192 \u221e we can consider only the \ufb01rst term in a Taylor expansion of the cosine term.\nRequiring thatP\u221e\n\nk=1 \u03c1k = 1, the explained variance ratio is\n\n\u03c1k =\n\n6\n\u03c02k2 .\n\nWe test Eq. 12 empirically in Fig. 5 in the supplementary material.\n\nWe pause here to marvel that the explained variance ratio of a random walk in the limit of in\ufb01nite\ndimensions is highly skewed towards the \ufb01rst few PCA components. Roughly 60% of the variance\nis explained by the \ufb01rst component, \u223c80% by the \ufb01rst two components, \u223c95% by the \ufb01rst 12\ncomponents, and \u223c99% by the \ufb01rst 66 components.\n2.3 Projection of the trajectory onto PCA components\n\nLet us now turn to the trajectory of the random walk when projected onto the PCA components. The\ntrajectory projected onto the kth PCA component is\n\n(13)\nwhere \u02c6vk is the normalized vk. We ignore the centering operation from here on because it changes\nneither the eigenvectors nor the eigenvalues. From above, we then have\n\nXPCA,k = X\u02c6vk,\n\nXPCA,k =\n\nuk.\n\n(14)\n\nBy the symmetry of the eigenvalue equations XXT u = \u03bbu and XT Xv = \u03bbv, it can be shown that\n\nSince uk is simply the kth Fourier mode, we therefore have\n\n\u03bbk\nkvkk\n\nXvk =\n\nXXT uk =\n\n1\nkvkk\n\n1\nkvkk\nkvkk = kXT ukk = \u221a\u03bb.\nXPCA,k =r 2\u03bbk\nn (cid:19) .\ncos(cid:18) \u03c0kt\n\nn\n\nThis implies that the random walk trajectory projected into the subspace spanned by two PCA\ncomponents will be a Lissajous curve. In Fig. 1 we plot the trajectories of a high dimensional random\nwalk projected to various PCA components and compare to the corresponding Lissajous curves. We\nperform 1000 steps of a random walk in 10,000 dimensions and \ufb01nd an excellent correspondence\nbetween the empirical and analytic trajectories. We additionally show the projection onto the \ufb01rst\nfew PCA components over time in Fig. 6 in the supplementary material.\n\nWhile our experiments thus far have used an isotropic Gaussian distribution for ease of computation,\nwe emphasize that these results are completely general for any probability distribution with zero\nmean and a \ufb01nite covariance matrix with rank much larger than the number of steps. We include\nthe PCA projections and eigenvalue distributions of random walks using non-isotropic multivariate\nGaussian distributions in Figs. 7 and 8 in the supplementary material.\n\n4\n\n\f2\n\nA\nC\nP\n\n3\n\nA\nC\nP\n\n4\n\nA\nC\nP\n\n5\n\nA\nC\nP\n\n8\n6\n4\n2\n0\n\u22122\n\u22124\n\u22126\n\u22128\n\n6\n\n4\n\n2\n\n0\n\n\u22122\n\n\u22124\n\n\u22126\n\n4\n3\n2\n1\n0\n\u22121\n\u22122\n\u22123\n\u22124\n\n4\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\n\u221210\n\n0\n\n10\n\nPCA1\n\n\u221210\n\n0\n\n10\n\nPCA1\n\n\u221210\n\n0\n\n10\n\nPCA1\n\n\u221210\n\n0\n\n10\n\nPCA1\n\n3\n\nA\nC\nP\n\n4\n\nA\nC\nP\n\n5\n\nA\nC\nP\n\n6\n\n4\n\n2\n\n0\n\n\u22122\n\n\u22124\n\n\u22126\n\n4\n3\n2\n1\n0\n\u22121\n\u22122\n\u22123\n\u22124\n\n4\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\nR\n\na\n\nn\n\nd\n\no\n\nm\n\n \n\nw\n\na\n\nlk\n\n\u22124\n\n0\n\n4\n\nPCA3\n\n\u22124\n\n0\n\n4\n\nPCA3\n\n5\n\nA\nC\nP\n\n4\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\n\u22123 \u22121\n\n1\nPCA4\n\n3\n\n\u22126 \u22122\n\n2\nPCA2\n\n6\n\n\u22126 \u22122\n\n2\nPCA2\n\n6\n\n\u22126 \u22122\n\n2\nPCA2\n\n6\n\n4\n\nA\nC\nP\n\n5\n\nA\nC\nP\n\n4\n3\n2\n1\n0\n\u22121\n\u22122\n\u22123\n\u22124\n\n4\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\ny\n\ny\n\ny\n\ny\n\n8\n6\n4\n2\n0\n\u22122\n\u22124\n\u22126\n\u22128\n\n6\n\n4\n\n2\n\n0\n\n\u22122\n\n\u22124\n\n\u22126\n\n4\n3\n2\n1\n0\n\u22121\n\u22122\n\u22123\n\u22124\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\n\u221210\n\n10\n\n0\nx\n\n\u221210\n\n10\n\n0\nx\n\n\u221210\n\n10\n\n0\nx\n\n\u221210\n\n10\n\n0\nx\n\ny\n\ny\n\ny\n\n6\n\n4\n\n2\n\n0\n\n\u22122\n\n\u22124\n\n\u22126\n\n4\n3\n2\n1\n0\n\u22121\n\u22122\n\u22123\n\u22124\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\nLis\n\ns\n\na\nj\no\n\nu\n\ns \nc\n\nu\n\nr\n\nv\n\n\u22126 \u22122\n\n2\n\n6\n\nx\n\ne\n\ns\n\ny\n\ny\n\n4\n3\n2\n1\n0\n\u22121\n\u22122\n\u22123\n\u22124\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\n\u22126 \u22122\n\n2\n\n6\n\nx\n\n\u22126 \u22122\n\n2\n\n6\n\nx\n\n\u22124\n\n0\nx\n\n\u22124\n\n0\nx\n\n4\n\n4\n\ny\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\n\u22123 \u22121\n\n1\n\n3\n\nx\n\nFigure 1: The PCA projections of the trajectories of high dimensional random walks are Lissajous\ncurves. Left tableau: Projections of a 10,000-dimensional random walk onto various PCA components.\nRight tableau: Corresponding Lissajous curves from Eq. 16.\n\n3 Generalizations\n\n3.1 Random walk with momentum\n\nIt is a common practice to train neural networks using stochastic gradient descent with momentum.\nIt is therefore interesting to examine the case of a random walk with momentum. In this case, the\nprocess is governed by the following set of updates:\n\nvt = \u03b3vt\u22121 + \u03bet\nxt = xt\u22121 + vt.\n\nIt can be seen that this modi\ufb01es Eq. 2 to instead read\n\nSX = MR\n\n(17)\n\n(18)\n\n(19)\n\nwhere M is a lower triangular Toeplitz matrix with 1 on the diagonal and \u03b3k on the kth subdiagonal.\nThe analysis from Section 2 is unchanged, except that now instead of considering the matrix S\u22121S\u2212T\nwe have the matrix S\u22121MMT S\u2212T . Although M is not a banded Toeplitz matrix, its terms decay\nexponentially to zero for terms very far from the main diagonal. It is therefore asymptotically\ncirculant as well, and the eigenvectors remain Fourier modes. To \ufb01nd the eigenvalues consider the\n\nproduct (ST M\u2212T M\u22121S)\u22121, noting that M\u22121 is a matrix with 1s along the main diagonal and \u2212\u03b3s\nsubdiagonal. With some tedious calculation it can be seen that the matrix ST M\u2212T M\u22121S is given by\n\n(SM\u22121M\u2212T ST )ij =\uf8f1\uf8f4\uf8f2\n\uf8f4\uf8f3\n\n2 + 2\u03b3 + \u03b3 2,\n\u2212(1 + \u03b3)2,\n\u03b3,\n0,\n\ni = j\ni = j \u00b1 1\ni = j \u00b1 2\notherwise\n\n(20)\n\nwith the exception that Snn = 1, and Sn,n\u22121 = Sn\u22121,n = \u2212(1 + \u03b3). As before, this matrix is\nasymptotically circulant, so the eigenvalues of its inverse are\n\n\u03bbk =\n\n1\n\n2(cid:20)1 + \u03b3 + \u03b3 2 \u2212 (1 + \u03b3)2 cos(cid:18) \u03c0k\n\nn (cid:19) + \u03b3 cos(cid:18) 2\u03c0k\n\nn (cid:19)(cid:21)\u22121\n\n.\n\n(21)\n\nIn the limit of n \u2192 \u221e, the distribution of eigenvalues is identical to that of a random walk in\n\ufb02at space, however for \ufb01nite n, it has the effect of shifting the distribution towards the lower PCA\ncomponents. We empirically test Eq. 21 in Fig. 9 in the supplementary material.\n\n5\n\n\f3.2 Discrete Ornstein-Uhlenbeck processes\n\nA useful generalization of the above analysis of random walks in \ufb02at space is to consider random\nwalks in a quadratic potential, also known as an AR(1) process or a discrete Ornstein-Uhlenbeck\nprocess. For simplicity we will assume that the potential has its minimum at the origin. Now every\nstep consists of a stochastic component and a deterministic component which points toward the origin\nand is proportional in magnitude to the distance from the origin. In this case the update equation can\nbe written\n\n(22)\nwhere \u03b1 measures the strength of the potential. In the limit \u03b1 \u2192 0 the potential disappears and we\nrecover a random walk in \ufb02at space. In the limit \u03b1 \u2192 1 the potential becomes in\ufb01nitely strong and\nwe recover independent samples from a multivariate Gaussian distribution. For 1 < \u03b1 < 2 the steps\nwill oscillate across the origin. For \u03b1 outside [0, 2] the updates diverge exponentially.\n\nxt = (1 \u2212 \u03b1)xt\u22121 + \u03bet,\n\n3.2.1 Analysis of eigenvectors and eigenvalues\n\nThis analysis proceeds similarly to the analysis in Section 2 except that instead of S we now have\nthe matrix SOU which has 1s along the diagonal and \u2212(1 \u2212 \u03b1) along the subdiagonal. SOU remains a\nT\nbanded Toeplitz matrix and so the arguments from Sec. 2 that \u02c6X \u02c6X\nis asymptotically circulant hold\nand its eigenvectors are remain Fourier modes. The eigenvalues will differ, however, because we now\nhave that the components of ST\n\nOUSOU are given by\n\n(cid:16)ST\nOUSOU(cid:17)ij\n\n1 + (1 \u2212 \u03b1)2,\n\u2212(1 \u2212 \u03b1),\n1,\n0,\n\ni < n, i = j\ni = j \u00b1 1\ni = j = n\notherwise.\n\n=\uf8f1\uf8f4\uf8f2\n\uf8f4\uf8f3\n\n(23)\n\n(24)\n\nFrom Eq. 7 we have that the kth eigenvalue of ST\n\nOUSOU is\n\n\u03bbOU,k =(cid:20)1 + (1 \u2212 \u03b1)2 \u2212 2(1 \u2212 \u03b1) cos(cid:18) 2\u03c0k\n\nn (cid:19)(cid:21)\u22121\n\n\u2243(cid:20) 4\u03c02k2(1 \u2212 \u03b1)\n\nn2\n\n+ \u03b12(cid:21)\u22121\n\n.\n\nWe show in Fig. 2 a comparison between the eigenvalue distribution predicted from Eq. 24 and the\nobserved distribution from a 3000 step Ornstein-Uhlenbeck process in 30,000 dimensions for several\nvalues of \u03b1. There is generally a tight correspondence between the two. The exception is in the limit\nof \u03b1 \u2192 1, where there is a catch which we have hitherto neglected. While it is true that the mean\neigenvalue of any eigenvector approaches the same constant, there is nevertheless going to be some\ndistribution of eigenvalues for any \ufb01nite walk. Because PCA sorts the eigenvalues, there will be a\ncharacteristic deviation from a \ufb02at distribution.\n\n3.2.2 Critical distance and mixing time\n\nWhile we might be tempted to take the limit n \u2192 \u221e as we did in the case of a random walk in\n\ufb02at space, doing so would obscure interesting dynamics early in the walk. (A random walk in \ufb02at\nspace is self-similar so we lose no information by taking this limit. This is no longer the case in an\nOrnstein-Uhlenbeck process because the parameter \u03b1 sets a characteristic scale in the system.) In\nfact there will be two distinct phases of a high dimensional Ornstein-Uhlenbeck process initialized\nat the origin. First phase the process will behave as a random walk in \ufb02at space and the distance\nfrom the origin will increase proportionally to \u221an and the variance of the kth PCA component will\nbe proportional to k\u22122. But once the distance from the origin reaches a critical value, the gradient\ntoward the origin will become large enough to balance the tendency of the random walk to drift away\nfrom the origin.4 At this point the trajectory will wander inde\ufb01nitely around a sphere centered at\nthe origin with radius given by this critical distance. Thus, while an Ornstein-Uhlenbeck process\nis mean-reverting in low dimensions, in the limit of in\ufb01nite dimensions the Ornstein-Uhlenbeck\nprocess is no longer mean-reverting \u2014 an in\ufb01nite dimensional Ornstein-Uhlenbeck process will\nnever return to its mean.5 This critical distance can be calculated by noting that each dimension is\n\n4Assuming we start close to the origin.\n\nIf we start suf\ufb01ciently far from the origin the trajectory will\n\nexponentially decay to this critical value.\n\n5Speci\ufb01cally, since the limiting distribution is a d-dimensional Gaussian, the probability that the process will\nreturn to within \u01eb of the origin is P (d/2, \u01eb2/2), where P is the regularized gamma function. For small \u01eb this\ndecays exponentially with d.\n\n6\n\n\fe\nc\nn\na\ni\nr\na\nv\n \nA\nC\nP\n\n103\n102\n101\n100\n10-1\n10-2\n10-3\n10-4\n\n100\n\n\u03b1 = 10\u22124\n\u03b1 = 10\u22123\n\u03b1 = 10\u22122\n\u03b1 = 10\u22121\n\u03b1 = 100\n\n101\n102\nPCA component\n\n103\n\ni\n\nn\ng\ni\nr\no\n \ne\nh\nt\n \n\nm\no\nr\nf\n \ne\nc\nn\na\nt\ns\ni\nD\n\n102\n\n101\n\n100\n\n100\n\n\u03b1 = 10\u22124\n\u03b1 = 10\u22123\n\u03b1 = 10\u22122\n\u03b1 = 10\u22121\n\u03b1 = 100\n\n101\n\n102\n\nStep\n\n103\n\nFigure 2: Left panel: The variance of the PCA components for several choices of \u03b1. The empirical\ndistribution is shown in solid and the predicted distribution with a dotted line. The predicted\ndistribution generally matches the observed distribution closely, but there is a systematic deviation\nfor \u03b1 near 1. This is due to the fact that when the mean distribution is \ufb02at, there will nevertheless be\na distribution around this mean when these eigenvalues are sampled from real data. Because PCA\nsorts these eigenvalues, this will always lead to a deviation from the \ufb02at distribution. Right panel:\nDistance from the origin for discrete Ornstein-Uhlenbeck processes with several choices of \u03b1 (solid\nlines) with the predicted asymptote from Eq. 25 (dotted lines).\n\nindependent of every other and it is well known that the asymptotic distribution of an AR(1) process\n\nwith Gaussian noise is Gaussian with a mean of zero and a standard deviation ofpV /(1 \u2212 (1 \u2212 \u03b1)2),\nwhere V is the variance of the stochastic component of the process. In high dimensions the asymptotic\ndistribution as n \u2192 \u221e is simply a multidimensional isotropic Gaussian. Because we are assuming\nV = 1/d, the overwhelming majority of points sampled from this distribution will be in a narrow\nannulus at a distance\n\nrc =\n\n(25)\n\n1\n\np\u03b1(2 \u2212 \u03b1)\n\nfrom the origin. Since the distance from the origin during the initial random walk phase grows as\n\n\u221an, the process will start to deviate from a random walk after nc \u223c (\u03b1(2 \u2212 \u03b1))\u22121 steps. We show in\n\nthe right panel of Fig. 2 the distance from the origin over time for 3000 steps of Ornstein-Uhlenbeck\nprocesses in 30,000 dimensions with several different choices of \u03b1. We compare to the prediction of\nEq. 25 and \ufb01nd a good match.\n\n3.2.3\n\nIterate averages converge slowly\n\nWe \ufb01nally note that if the location of the minimum is unknown, then iterate (or Polyak) averaging\ncan be used to provide a better estimate. But the number of steps must be much greater than nc\nbefore iterate averaging will improve the estimate. Only then will the location on the sphere be\napproximately orthogonal to its original location on the sphere and the variance on the estimate of the\nminimum will decrease as 1/\u221an. We compute the mean of converged Ornstein-Uhlenbeck processes\nwith various choices of \u03b1 in Fig. 10 in the supplementary material.\n\n3.2.4 Random walks in non-isotropic potential are dominated by low curvature directions\n\nWhile our analysis has been focused on the special case of a quadratic potential with equal curvature\nin all dimensions, a more realistic quadratic potential will have a distribution of curvatures and the\naxes of the potential may not be aligned with the coordinate basis. Fortunately these complications\ndo not change the overall picture much. For a general quadratic potential described by a positive\nsemi-de\ufb01nite matrix A, we can decompose A into its eigenvalues and eigenvectors. We then apply\na coordinate transformation to align the parameter space with the eigenvectors of A. At this point\nwe have a distribution of curvatures, each one given by an eigenvalue of A. However, because we\nare considering the limit of in\ufb01nite dimensions, we can assume that there will be a large number of\ndimensions that fall in any bin [\u03b1i, \u03b1i + d\u03b1]. Each of these bins can be treated as an independent\nhigh-dimensional Ornstein-Uhlenbeck process with curvature \u03b1i. After n steps, PCA will then be\ndominated by dimensions for which \u03b1i is small enough that n \u226a nc,i. Thus, even if relatively few\n\n7\n\n\fo\ni\nt\na\nr\n \ne\nc\nn\na\ni\nr\na\nv\n \nd\ne\nn\na\np\nx\nE\n\nl\n\ni\n\n100\n10-1\n10-2\n10-3\n10-4\n10-5\n10-6\n10-7\n10-8\n\n100\n\nLinear model\n\nStart of training\nMiddle of training\nEnd of training\nRandom walk\n101\n102\nPCA component\n\n103\n\no\ni\nt\na\nr\n \ne\nc\nn\na\ni\nr\na\nv\n \nd\ne\nn\na\np\nx\nE\n\nl\n\ni\n\n10-1010-910-810-710-610-510-410-310-210-1100\n\n10-11\n10-12\n10-13\n10-14\n\n100\n\nResNet-50-v2\n\n101\n\n102\nPCA component\n\n103\n\nFigure 3: Left panel: The distribution of PCA variances at various points in training for a linear\nmodel trained on CIFAR-10. At the beginning of training the model\u2019s trajectory is more directed\nthan a random walk, as exhibited by the steep distribution in the lower PCA components. By the\nmiddle of training this distribution has \ufb02attened (apart from the \ufb01rst PCA component) and more\nclosely resembles that of an Ornstein-Uhlenbeck process. Right panel: The distribution of PCA\nvariances of the parameters of ResNet-50-v2 at various points in training. The distribution of PCA\nvariances generally matches that of a random walk with the exception of the \ufb01rst PCA component,\nwhich dominates the distribution, particularly at the end of training.\n\ndimensions have small curvature they will come to dominate the PCA projected trajectory after\nenough steps.\n\n4 Comparison to linear models and neural networks\n\nWhile random walks and Ornstein-Uhlenbeck processes are analytically tractable, there are several\nimportant differences between these simple processes and optimization of even linear models. In\nparticular, the statistics of the noise will depend on the location in parameter space and so will change\nover the course of training. Furthermore, there may be \ufb01nite data or \ufb01nite trajectory length effects.\n\nTo get a sense for the effect of these differences we now compare the distribution of the variances\nin the PCA components between two models and a random walk. For our \ufb01rst model we train a\nlinear model without biases on CIFAR-10 using a learning rate of 10\u22125 for 10,000 steps. For our\nsecond model we train ResNet-50-v2 on Imagenet without batch normalization for 150,000 steps\nusing SGD with momentum and linear learning rate decay. We collect the value of all parameters at\nevery step for the \ufb01rst 1500 steps, the middle 1500 steps, and the last 1500 steps of training, along\nwith collecting the parameters every 100 steps throughout the entirety of training. Further details of\nboth models and the training procedures can be found in the supplementary material. While PCA is\ntractable on a linear model of CIFAR-10, ResNet-50-v2 has \u223c25 million parameters and performing\nPCA directly on the parameters is infeasible, so we instead perform a random Gaussian projection\ninto a subspace of 30,000 dimensions. We show in Fig. 3 the distribution of the PCA variances at the\nbeginning, middle, and end of training for both models and compare to the distribution of variances\nfrom an in\ufb01nite dimensional random walk. We show tableaux of the PCA projected trajectories from\nthe middle of training for the linear model and ResNet-50-v2 in Fig. 4. Tableaux of the other training\ntrajectories in various PCA subspaces are shown in the supplementary material, along with results\nfrom a small fully connected neural network trained on MNIST.\n\nThe distribution of eigenvalues of the linear model resembles an OU process, whereas the distribution\nof eigenvalues of ResNet-50-v2 resembles a random walk with a large drift term. The trajectories\nappear almost identical to those of random walks shown in Fig. 1, with the exception that there is\nmore variance along the \ufb01rst PCA component than in the random walk case, particularly at the start\nand end points. This manifests itself in a small outward turn of the edges of the parabola in the PCA2\nvs. PCA1 projection. This suggests that ResNet-50-v2 generally moves in a consistent direction over\nrelatively long spans of training, similarly to an Ornstein-Uhlenbeck process initialized beyond rc.\n\n8\n\n\f1e\u22122\n\n2\n\nA\nC\nP\n\n2.0\n1.5\n1.0\n0.5\n0.0\n\u22120.5\n\u22121.0\n\u22121.5\n\u22122.0\n\n\u22120.06\n\n1e\u22122\n\n4\n\n3\n\n2\n\n1\n\n0\n\n3\n\nA\nC\nP\n\n\u22121\n\n\u22122\n\u22120.06\n\n0.00\n\n0.06\n\nPCA1\n\n1e\u22122\n\n4\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n3\n\nA\nC\nP\n\n0.00\n\n0.06\n\n\u22122\n\u22120.020 \u22120.005 0.010\n\nPCA2\n\n1e\u22122\n\nPCA1\n\n1e\u22122\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n4\n\nA\nC\nP\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n4\n\nA\nC\nP\n\n1e3\n\n1.0\n\n2\n\nA\nC\nP\n\n0.5\n\n0.0\n\n\u22120.5\n\n\u22121.0\n\n\u221280000 \u221220000 40000\n\nPCA1\n\n3\n\nA\nC\nP\n\n1e2\n\n8\n6\n4\n2\n0\n\u22122\n\u22124\n\u22126\n\u22128\n\u221280000 \u221220000 40000\n\nPCA1\n\n1e2\n\n6\n\n4\n\n2\n\n0\n\n\u22122\n\n\u22124\n\n4\n\nA\nC\nP\n\n3\n\nA\nC\nP\n\n1e2\n\n8\n6\n4\n2\n0\n\u22122\n\u22124\n\u22126\n\u22128\n\u22121000\n\n1e2\n\n6\n\n4\n\n2\n\n0\n\n\u22122\n\n\u22124\n\n4\n\nA\nC\nP\n\n500\n\nPCA2\n\n1e2\n\n6\n\n4\n\n2\n\n0\n\n\u22122\n\n\u22124\n\n4\n\nA\nC\nP\n\nLi\n\nn\n\ne\n\na\n\nr \n\nm\n\no\n\nd\n\ne\nl\n\n1e\u22122\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n4\n\nA\nC\nP\n\nR\n\ne\n\ns\n\nN\n\ne\n\nt-\n\n5\n\n0\n-\nv\n\n2\n\n\u22123\n\u22120.06\n\n0.00\n\n0.06\n\n\u22123\n\u22120.020 \u22120.005 0.010\n\n\u22123\n\u22120.02\n\n0.01\n\n0.04\n\n\u22126\n\u221280000 \u221220000 40000\n\n\u22126\n\u22121000\n\nPCA2\n\n1e\u22122\n\nPCA1\n\n1e\u22122\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n5\n\nA\nC\nP\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n5\n\nA\nC\nP\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n5\n\nA\nC\nP\n\nPCA3\n\n1e\u22122\n\n1e\u22122\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n5\n\nA\nC\nP\n\n\u22123\n\u22120.06\n\n0.00\n\n0.06\n\n\u22123\n\u22120.020 \u22120.005 0.010\n\n\u22123\n\u22120.02\n\n0.01\n\n0.04\n\n\u22123\n\u22120.03\n\n0.00\n\n0.03\n\nPCA1\n\nPCA2\n\nPCA3\n\nPCA4\n\nPCA1\n\n5\n\nA\nC\nP\n\n1e2\n\n5\n4\n3\n2\n1\n0\n\u22121\n\u22122\n\u22123\n\u22124\n\u221280000 \u221220000 40000\n\nPCA1\n\n5\n\nA\nC\nP\n\n1e2\n\n5\n4\n3\n2\n1\n0\n\u22121\n\u22122\n\u22123\n\u22124\n\u22121000\n\n500\n\nPCA2\n\n\u22126\n\u2212800 \u2212200\n\n400\n\nPCA3\n\n5\n\nA\nC\nP\n\n1e2\n\n5\n4\n3\n2\n1\n0\n\u22121\n\u22122\n\u22123\n\u22124\n\u2212800 \u2212200\n\n400\n\nPCA3\n\n5\n\nA\nC\nP\n\n1e2\n\n5\n4\n3\n2\n1\n0\n\u22121\n\u22122\n\u22123\n\u22124\n\u2212600\n\n0\n\n600\n\nPCA4\n\n500\n\nPCA2\n\nFigure 4: Left tableau: PCA projected trajectories from the middle of training a linear model on\nCIFAR-10. Training has largely converged at this point, producing an approximately Gaussian\ndistribution in the higher PCA components. Right tableau: PCA projected trajectories from the\nmiddle of training ResNet-50-v2 on Imagenet. These trajectories strongly resemble those of a random\nwalk. See Figs. 12 and 13 in the supplementary material for PCA projected trajectories at other\nphases of training.\n\n5 Random walks with decaying step sizes\n\nWe \ufb01nally note that the PCA projected trajectories of the linear model and ResNet-50-v2 over\nthe entire course of training qualitatively resemble those of a high dimensional random walk with\nexponentially decaying step sizes. To show this we train a linear regression model y = Wx, where W\nis a \ufb01xed, unknown vector of dimension 10,000. We sample x from a 10,000 dimensional isotropic\nGaussian and calculate the loss\n\nL =\n\n(y \u2212 y\u2032)2,\n\n(26)\n\n1\n2\n\nwhere y\u2032 is the correct output. We show in Fig. 15 that the step size decays exponentially. We \ufb01t the\ndecay rate to this data and then perform a random walk in 10,000 dimensions but decay the variance\nof the stochastic term \u03bei by this rate. We compare in Fig. 16 of the supplementary material the PCA\nprojected trajectories of the linear model trained on synthetic data to the decayed random walk. We\nnote that these trajectories resemble the PCA trajectories over the entire course of training observed\nin Figs. 12 and 13 for the linear model trained on CIFAR-10 and ResNet-50-v2 trained on Imagenet.\n\n6 Conclusions\n\nWe have derived the distribution of the variances of the PCA components of a random walk both with\nand without momentum in the limit of in\ufb01nite dimensions, and proved that the PCA projections of the\ntrajectory are Lissajous curves. We have argued that the PCA projected trajectory of a random walk in\na general quadratic potential will be dominated by the dimensions with the smallest curvatures where\nthey will appear similar to a random walk in \ufb02at space. Finally, we \ufb01nd that the PCA projections of\nthe training trajectory of a layer in ResNet-50-v2 qualitatively resemble those of a high dimensional\nrandom walk despite the many differences between the optimization of a large NN and a high\ndimensional random walk.\n\nAcknowledgments\n\nThe authors thank Matthew Hoffman, Martin Wattenberg, Jeffrey Pennington, Roy Frostig, and Niru\nMaheswaranathan for helpful discussions and comments on drafts of the manuscript.\n\n9\n\n\fReferences\n\nAhn, S., Korattikara, A., and Welling, M. Bayesian posterior sampling via stochastic gradient \ufb01sher\n\nscoring. In International Conference on Machine Learning, 2012.\n\nBaity-Jesi, M., Sagun, L., Geiger, M., Spigler, S., Arous, G. B., Cammarota, C., LeCun, Y., Wyart,\nM., and Biroli, G. Comparing dynamics: Deep neural networks versus glassy systems. arXiv\npreprint arXiv:1803.06969, 2018.\n\nB\u00f6ttcher, A., Embree, M., and Sokolov, V. The spectra of large toeplitz band matrices with a randomly\n\nperturbed entry. Mathematics of computation, 72(243):1329\u20131348, 2003.\n\nChoromanska, A., Henaff, M., Mathieu, M., Arous, G. B., and LeCun, Y. The loss surfaces of\n\nmultilayer networks. In Arti\ufb01cial Intelligence and Statistics, pp. 192\u2013204, 2015.\n\nDauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and Bengio, Y. Identifying and\nattacking the saddle point problem in high-dimensional non-convex optimization. In Advances in\nneural information processing systems, pp. 2933\u20132941, 2014.\n\nDinh, L., Pascanu, R., Bengio, S., and Bengio, Y. Sharp minima can generalize for deep nets. arXiv\n\npreprint arXiv:1703.04933, 2017.\n\nGoodfellow, I. J., Vinyals, O., and Saxe, A. M. Qualitatively characterizing neural network optimiza-\n\ntion problems. In International Conference on Learning Representations, 2015.\n\nGray, R. M. et al. Toeplitz and circulant matrices: A review. Foundations and Trends R(cid:13) in Communi-\n\ncations and Information Theory, 2(3):155\u2013239, 2006.\n\nHe, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings\n\nof the IEEE conference on computer vision and pattern recognition, pp. 770\u2013778, 2016.\n\nHochreiter, S. and Schmidhuber, J. Flat minima. Neural Computation, 9(1):1\u201342, 1997.\n\nJozefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., and Wu, Y. Exploring the limits of language\n\nmodeling. arXiv preprint arXiv:1602.02410, 2016.\n\nKeskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. On large-batch training\nfor deep learning: Generalization gap and sharp minima. In International Conference on Learning\nRepresentations, 2017.\n\nLi, H., Xu, Z., Taylor, G., and Goldstein, T. Visualizing the loss landscape of neural nets.\n\nIn\n\nInternational Conference on Learning Representations, 2018.\n\nLipton, Z. C. Stuck in a what? adventures in weight space. arXiv preprint arXiv:1602.07320, 2016.\n\nLorch, E. Visualizing deep network training trajectories with pca.\nConference on Machine Learning JMLR volume, volume 48, 2016.\n\nIn The 33rd International\n\nMandt, S., Hoffman, M., and Blei, D. A variational analysis of stochastic gradient algorithms. In\n\nInternational Conference on Machine Learning, pp. 354\u2013363, 2016.\n\nMoore, J., Ahmed, H., and Antia, R. High dimensional random walks can appear low dimensional:\n\nApplication to in\ufb02uenza h3n2 evolution. Journal of theoretical biology, 447:56\u201364, 2018.\n\nNovak, R., Bahri, Y., Abola\ufb01a, D. A., Pennington, J., and Sohl-Dickstein, J. Sensitivity and\ngeneralization in neural networks: an empirical study. In International Conference on Learning\nRepresentations, 2018.\n\nRump, S. M. Eigenvalues, pseudospectrum and structured perturbations. Linear algebra and its\n\napplications, 413(2-3):567\u2013593, 2006.\n\nSmith, S. L. and Le, Q. V. A bayesian perspective on generalization and stochastic gradient descent.\n\nIn International Conference on Learning Representations, 2018.\n\nUhlenbeck, G. E. and Ornstein, L. S. On the theory of the brownian motion. Physical review, 36(5):\n\n823, 1930.\n\nZhu, Z. and Wakin, M. B. On the asymptotic equivalence of circulant and toeplitz matrices. IEEE\n\nTrans. Information Theory, 63(5):2975\u20132992, 2017.\n\n10\n\n\f", "award": [], "sourceid": 6601, "authors": [{"given_name": "Joseph", "family_name": "Antognini", "institution": "Whisper AI"}, {"given_name": "Jascha", "family_name": "Sohl-Dickstein", "institution": "Google Brain"}]}