{"title": "Algorithmic Regularization in Learning Deep Homogeneous Models: Layers are Automatically Balanced", "book": "Advances in Neural Information Processing Systems", "page_first": 384, "page_last": 395, "abstract": "We study the implicit regularization imposed by gradient descent for learning multi-layer homogeneous functions including feed-forward fully connected and convolutional deep neural networks with linear, ReLU or Leaky ReLU activation. We rigorously prove that gradient flow (i.e. gradient descent with infinitesimal step size) effectively enforces the differences between squared norms across different layers to remain invariant without any explicit regularization. This result implies that if the weights are initially small, gradient flow automatically balances the magnitudes of all layers. Using a discretization argument, we analyze gradient descent with positive step size for the non-convex low-rank asymmetric matrix factorization problem without any regularization. Inspired by our findings for gradient flow, we prove that gradient descent with step sizes $\\eta_t=O(t^{\u2212(1/2+\\delta)}) (0<\\delta\\le1/2)$ automatically balances two low-rank factors and converges to a bounded global optimum. Furthermore, for rank-1 asymmetric matrix factorization we give a finer analysis showing gradient descent with constant step size converges to the global minimum at a globally linear rate. We believe that the idea of examining the invariance imposed by first order algorithms in learning homogeneous models could serve as a fundamental building block for studying optimization for learning deep models.", "full_text": "Algorithmic Regularization in Learning Deep\n\nHomogeneous Models: Layers are Automatically\n\nBalanced\u02da\n\nSimon S. Du:\n\nWei Hu;\n\nJason D. Lee\u00a7\n\nAbstract\n\nWe study the implicit regularization imposed by gradient descent for learning\nmulti-layer homogeneous functions including feed-forward fully connected and\nconvolutional deep neural networks with linear, ReLU or Leaky ReLU activation.\nWe rigorously prove that gradient \ufb02ow (i.e. gradient descent with in\ufb01nitesimal\nstep size) effectively enforces the differences between squared norms across dif-\nferent layers to remain invariant without any explicit regularization. This result\nimplies that if the weights are initially small, gradient \ufb02ow automatically balances\nthe magnitudes of all layers. Using a discretization argument, we analyze gradient\ndescent with positive step size for the non-convex low-rank asymmetric matrix\nfactorization problem without any regularization. Inspired by our \ufb01ndings for gra-\n\ndient \ufb02ow, we prove that gradient descent with step sizes \u2318t \u201c O\u00b4t\u00b4p 1\n2) automatically balances two low-rank factors and converges to a\n(0 \u2020 \u00a7 1\nbounded global optimum. Furthermore, for rank-1 asymmetric matrix factoriza-\ntion we give a \ufb01ner analysis showing gradient descent with constant step size\nconverges to the global minimum at a globally linear rate. We believe that the\nidea of examining the invariance imposed by \ufb01rst order algorithms in learning\nhomogeneous models could serve as a fundamental building block for studying\noptimization for learning deep models.\n\n2`q\u00af\n\n1\n\nIntroduction\n\nModern machine learning models often consist of multiple layers. For example, consider a feed-\nforward deep neural network that de\ufb01nes a prediction function\n\nx \ufb01\u00d1 fpx; W p1q, . . . , W pNqq \u201c W pNqpW pN\u00b41q \u00a8\u00a8\u00a8 W p2qpW p1qxq\u00a8\u00a8\u00a8q,\n\nwhere W p1q, . . . , W pNq are weight matrices in N layers, and p\u00a8q is a point-wise homogeneous\nactivation function such as Recti\ufb01ed Linear Unit (ReLU) pxq \u201c maxtx, 0u. A simple observa-\ntion is that this model is homogeneous: if we multiply a layer by a positive scalar c and divide\nanother layer by c, the prediction function remains the same, e.g. fpx; cW p1q, . . . , 1\nc W pNqq \u201c\nfpx; W p1q, . . . , W pNqq.\nA direct consequence of homogeneity is that a solution can produce small function value while be-\ning unbounded, because one can always multiply one layer by a huge number and divide another\n\n\u02daThe full version of this paper is available at https://arxiv.org/abs/1806.00900.\n:Machine Learning Department, School of Computer Science, Carnegie Mellon University. Email:\n\nssdu@cs.cmu.edu\n\n;Computer Science Department, Princeton University. Email: huwei@cs.princeton.edu\n\u00a7Department of Data Sciences and Operations, Marshall School of Business, University of Southern Cali-\n\nfornia. Email: jasonlee@marshall.usc.edu\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\flayer by that number. Theoretically, this possible unbalancedness poses signi\ufb01cant dif\ufb01culty in ana-\nlyzing \ufb01rst order optimization methods like gradient descent/stochastic gradient descent (GD/SGD),\nbecause when parameters are not a priori constrained to a compact set via either coerciveness5 of\nthe loss or an explicit constraint, GD and SGD are not even guaranteed to converge [Lee et al., 2016,\nProposition 4.11]. In the context of deep learning, Shamir [2018] determined that the primary barrier\nto providing algorithmic results is in that the sequence of parameter iterates is possibly unbounded.\nNow we take a closer look at asymmetric matrix factorization, which is a simple two-layer homoge-\nneous model. Consider the following formulation for factorizing a low-rank matrix:\n\nmin\n\nUPRd1\u02c6r,V PRd2\u02c6r\n\nf pU , V q \u201c 1\n\n2\u203a\u203aU V J \u00b4 M\u02da\u203a\u203a2\n\nF ,\n\nwhere M\u02da P Rd1\u02c6d2 is a matrix we want to factorize. We observe that due to the homogeneity of\nf, it is not smooth6 even in the neighborhood of a globally optimum point. To see this, we compute\nthe gradient of f:\n\nBf pU , V q\n\nBU\n\n\u201c`U V J \u00b4 M\u02da\u02d8 V ,\n\nBf pU , V q\n\nBV\n\n\u201c`U V J \u00b4 M\u02da\u02d8J U .\n\nNotice that the gradient of f is not homogeneous anymore. Further, consider a globally optimal\nsolution pU , V q such that }U}F is of order \u270f and }V }F is of order 1{\u270f (\u270f being very small). A\nsmall perturbation on U can lead to dramatic change to the gradient of U. This phenomenon can\nhappen for all homogeneous functions when the layers are unbalanced. The lack of nice geometric\nproperties of homogeneous functions due to unbalancedness makes \ufb01rst-order optimization methods\ndif\ufb01cult to analyze.\nA common theoretical workaround is to arti\ufb01cially modify the natural objective function as in (1) in\norder to prove convergence. In [Tu et al., 2015, Ge et al., 2017a], a regularization term for balancing\nthe two layers is added to (1):\n\n(1)\n\n(2)\n\n(3)\n\nmin\n\nUPRd1\u02c6r,V PRd2\u02c6r\n\n1\n\n2\u203a\u203aU V J \u00b4 M\u203a\u203a2\n8\u203a\u203aUJU \u00b4 V JV\u203a\u203a2\nF ` 1\n\nF .\n\nFor problem (3), the regularizer removes the homogeneity issue and the optimal solution becomes\nunique (up to rotation). Ge et al. [2017a] showed that the modi\ufb01ed objective (3) satis\ufb01es (i) every\nlocal minimum is a global minimum, (ii) all saddle points are strict7, and (iii) the objective is smooth.\nThese imply that (noisy) GD \ufb01nds a global minimum [Ge et al., 2015, Lee et al., 2016, Panageas\nand Piliouras, 2016].\nOn the other hand, empirically, removing the homogeneity is not necessary. We use GD with random\ninitialization to solve the optimization problem (1). Figure 1a shows that even without regularization\nterm like in the modi\ufb01ed objective (3) GD with random initialization converges to a global minimum\nand the convergence rate is also competitive. A more interesting phenomenon is shown in Figure 1b\nin which we track the Frobenius norms of U and V in all iterations. The plot shows that the ratio\nbetween norms remains a constant in all iterations. Thus the unbalancedness does not occur at all!\nIn many practical applications, many models also admit the homogeneous property (like deep neural\nnetworks) and \ufb01rst order methods often converge to a balanced solution. A natural question arises:\n\nWhy does GD balance multiple layers and converge in learning homogeneous functions?\n\nIn this paper, we take an important step towards answering this question. Our key \ufb01nding is that\nthe gradient descent algorithm provides an implicit regularization on the target homogeneous func-\ntion. First, we show that on the gradient \ufb02ow (gradient descent with in\ufb01nitesimal step size) tra-\njectory induced by any differentiable loss function, for a large class of homogeneous models, in-\ncluding fully connected and convolutional neural networks with linear, ReLU and Leaky ReLU\nactivations, the differences between squared norms across layers remain invariant. Thus, as long\nas at the beginning the differences are small, they remain small at all time. Note that small\ndifferences arise in commonly used initialization schemes such as\n1?d Gaussian initialization or\n\n5A function f is coercive if }x} \u00d1 8 implies fpxq \u00d1 8.\n6A function is said to be smooth if its gradient is -Lipschitz continuous for some \ufb01nite \u00b0 0.\n7A saddle point of a function f is strict if the Hessian at that point has a negative eigenvalue.\n\n2\n\n\f2\n\n0\n\n-2\n\n-4\n\n-6\n\n-8\n\n-10\n\n-12\n\n-14\n\n-16\n\n-18\n\nj\n\nb\nO\n\n \nf\n\n \n\no\nm\nh\n\nt\ni\nr\na\ng\no\nL\n\nWithout Regularization\nWith Regularization\n\n0\n\n2000\n\n4000\n\n6000\n\n8000\n\n10000\n\nEpochs\n\n(a) Comparison of convergence rates of GD for\nobjective functions (1) and (3).\n\ns\nr\ne\ny\na\nL\n\n \n\no\nw\nT\n\n \nf\n\no\n\n \ns\nm\nr\no\nN\n-\nF\nn\ne\ne\nw\ne\nb\n\nt\n\n \n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n \n\no\n\ni\nt\n\na\nR\n\n0.4\n\n0\n\nWithout Regularization\nWith Regularization\n\n1000\n\n2000\n\n3000\n\n4000\n\n5000\n\n6000\n\n7000\n\n8000\n\n9000\n\n10000\n\nEpochs\n\n(b) Comparison of quantity }U}2\nrunning GD for objective functions (1) and (3).\n\nF {}V }2\n\nF when\n\nFigure 1: Experiments on the matrix factorization problem with objective functions (1) and (3). Red\nlines correspond to running GD on the objective function (1), and blue lines correspond to running\nGD on the objective function (3).\n\nXavier/Kaiming initialization schemes [Glorot and Bengio, 2010, He et al., 2016]. Our result thus\nexplains why using ReLU activation is a better choice than sigmoid from the optimization point\nview. For linear activation, we prove an even stronger invariance for gradient \ufb02ow: we show\nthat W phqpW phqqJ \u00b4 pW ph`1qqJW ph`1q stays invariant over time, where W phq and W ph`1q\nare weight matrices in consecutive layers with linear activation in between.\nNext, we go beyond gradient \ufb02ow and consider gradient descent with positive step size. We focus on\nthe asymmetric matrix factorization problem (1). Our invariance result for linear activation indicates\nthat UJU \u00b4 V JV stays unchanged for gradient \ufb02ow. For gradient descent, UJU \u00b4 V JV can\nchange over iterations. Nevertheless we show that if the step size decreases like \u2318t \u201c O\u00b4t\u00b4p 1\n2`q\u00af\n2), UJU \u00b4 V JV will remain small in all iterations. In the set where UJU \u00b4 V JV\n(0 \u2020 \u00a7 1\nis small, the loss is coercive, and gradient descent thus ensures that all the iterates are bounded.\nUsing these properties, we then show that gradient descent converges to a globally optimal solution.\nFurthermore, for rank-1 asymmetric matrix factorization, we give a \ufb01ner analysis and show that\nrandomly initialized gradient descent with constant step size converges to the global minimum at a\nglobally linear rate.\n\nRelated work. The homogeneity issue has been previously discussed by Neyshabur et al.\n[2015a,b]. The authors proposed a variant of stochastic gradient descent that regularizes paths in\na neural network, which is related to the max-norm. The algorithm outperforms gradient descent\nand AdaGrad on several classi\ufb01cation tasks.\nA line of research focused on analyzing gradient descent dynamics for (convolutional) neural net-\nworks with one or two unknown layers [Tian, 2017, Brutzkus and Globerson, 2017, Du et al.,\n2017a,b, Zhong et al., 2017, Li and Yuan, 2017, Ma et al., 2017, Brutzkus et al., 2017]. For one un-\nknown layer, there is no homogeneity issue. While for two unknown layers, existing work either re-\nquires learning two layers separately [Zhong et al., 2017, Ge et al., 2017b] or uses re-parametrization\nlike weight normalization to remove the homogeneity issue [Du et al., 2017b]. To our knowledge,\nthere is no rigorous analysis for optimizing multi-layer homogeneous functions.\nFor a general (non-convex) optimization problem, it is known that if the objective function satis\ufb01es\n(i) gradient changes smoothly if the parameters are perturbed, (ii) all saddle points and local maxima\nare strict (i.e., there exists a direction with negative curvature), and (iii) all local minima are global\n(no spurious local minimum), then gradient descent [Lee et al., 2016, Panageas and Piliouras, 2016]\nconverges to a global minimum. There have been many studies on the optimization landscapes\nof neural networks [Kawaguchi, 2016, Choromanska et al., 2015, Du and Lee, 2018, Hardt and\nMa, 2016, Bartlett et al., 2018, Haeffele and Vidal, 2015, Freeman and Bruna, 2016, Vidal et al.,\n2017, Safran and Shamir, 2016, Zhou and Feng, 2017, Nguyen and Hein, 2017a,b, Zhou and Feng,\n2017, Safran and Shamir, 2017], showing that the objective functions have properties (ii) and (iii).\n\n3\n\n\fNevertheless, the objective function is in general not smooth as we discussed before. Our paper\ncomplements these results by showing that the magnitudes of all layers are balanced and in many\ncases, this implies smoothness.\n\nPaper organization. The rest of the paper is organized as follows. In Section 2, we present our\nmain theoretical result on the implicit regularization property of gradient \ufb02ow for optimizing neu-\nral networks. In Section 3, we analyze the dynamics of randomly initialized gradient descent for\nasymmetric matrix factorization problem with unregularized objective function (1). In Section 4,\nwe empirically verify the theoretical result in Section 2. We conclude and list future directions in\nSection 5. Some technical proofs are deferred to the appendix.\nNotation. We use bold-faced letters for vectors and matrices. For a vector x, denote by xris its\ni-th coordinate. For a matrix A, we use Ari, js to denote its pi, jq-th entry, and use Ari, :s and\nAr:, js to denote its i-th row and j-th column, respectively (both as column vectors). We use }\u00a8}2 or\n}\u00a8} to denote the Euclidean norm of a vector, and use }\u00a8}F to denote the Frobenius norm of a matrix.\nWe use x\u00a8,\u00a8y to denote the standard Euclidean inner product between two vectors or two matrices.\nLet rns \u201c t1, 2, . . . , nu.\n2 The Auto-Balancing Properties in Deep Neural Networks\n\nIn this section we study the implicit regularization imposed by gradient descent with in\ufb01nitesimal\nstep size (gradient \ufb02ow) in training deep neural networks. In Section 2.1 we consider fully con-\nnected neural networks, and our main result (Theorem 2.1) shows that gradient \ufb02ow automatically\nbalances the incoming and outgoing weights at every neuron. This directly implies that the weights\nbetween different layers are balanced (Corollary 2.1). For linear activation, we derive a stronger\nauto-balancing property (Theorem 2.2).\nIn Section 2.2 we generalize our result from fully con-\nnected neural networks to convolutional neural networks. In Section 2.3 we present the proof of\nTheorem 2.1. The proofs of other theorems in this section follow similar ideas and are deferred to\nAppendix A.\n\n2.1 Fully Connected Neural Networks\nWe \ufb01rst formally de\ufb01ne a fully connected feed-forward neural network with N (N \u2022 2) layers. Let\nW phq P Rnh\u02c6nh\u00b41 be the weight matrix in the h-th layer, and de\ufb01ne w \u201c pW phqqN\nh\u201c1 as a shorthand\nof the collection of all the weights. Then the function fw : Rd \u00d1 Rp (d \u201c n0, p \u201c nN) computed\nby this network can be de\ufb01ned recursively: fp1qw pxq \u201c W p1qx, fphqw pxq \u201c W phqh\u00b41pfph\u00b41qw\npxqq\n(h \u201c 2, . . . , N), and fwpxq \u201c fpNqw pxq, where each h is an activation function that acts coordinate-\nwise on vectors.8 We assume that each h (h P rN\u00b41s) is homogeneous, namely, hpxq \u201c 1hpxq\u00a8x\nfor all x and all elements of the sub-differential 1hp\u00a8q when h is non-differentiable at x. This\nproperty is satis\ufb01ed by functions like ReLU pxq \u201c maxtx, 0u, Leaky ReLU pxq \u201c maxtx, \u21b5xu\n(0 \u2020 \u21b5 \u2020 1), and linear function pxq \u201c x.\nLet ` : Rp \u02c6 Rp \u00d1 R\u20220 be a differentiable loss function. Given a training dataset tpxi, yiqum\ni\u201c1 \u00c4\nRd \u02c6 Rp, the training loss as a function of the network parameters w is de\ufb01ned as\n\nm\u00ffi\u201c1\nWe consider gradient descent with in\ufb01nitesimal step size (also known as gradient \ufb02ow) applied on\nLpwq, which is captured by the differential inclusion:\n\n`pfwpxiq, yiq .\n\nLpwq \u201c 1\n\n(4)\n\nm\n\ndW phq\n\nP \u00b4BLpwq\n(5)\nBW phq ,\nwhere t is a continuous time index, and BLpwq\nBW phq is the Clarke sub-differential [Clarke et al., 2008]. If\ncurves W phq \u201c W phqptq (h P rNs) evolve with time according to (5) they are said to be a solution\nof the gradient \ufb02ow differential inclusion.\n8We omit the trainable bias weights in the network for simplicity, but our results can be directly generalized\n\nh \u201c 1, . . . , N,\n\ndt\n\nto allow bias weights.\n\n4\n\n\fOur main result in this section is the following invariance imposed by gradient \ufb02ow.\nTheorem 2.1 (Balanced incoming and outgoing weights at every neuron). For any h P rN \u00b4 1s and\ni P rnhs, we have\n\nd\n\ndt\u00b4}W phqri, :s}2 \u00b4 }W ph`1qr:, is}2\u00af \u201c 0.\n\n(6)\n\nNote that W phqri, :s is a vector consisting of network weights coming into the i-th neuron in the h-th\nhidden layer, and W ph`1qr:, is is the vector of weights going out from the same neuron. Therefore,\nTheorem 2.1 shows that gradient \ufb02ow exactly preserves the difference between the squared `2-norms\nof incoming weights and outgoing weights at any neuron.\nTaking sum of (6) over i P rnhs, we obtain the following corollary which says gradient \ufb02ow pre-\nserves the difference between the squares of Frobenius norms of weight matrices.\nCorollary 2.1 (Balanced weights across layers). For any h P rN \u00b4 1s, we have\n\nd\n\ndt\u00b4}W phq}2\n\nF \u00b4 }W ph`1q}2\n\nF\u00af \u201c 0.\nF \u00b4 }W ph`1q}2\n\nif we use a small initialization, }W phq}2\n\nCorollary 2.1 explains why in practice, trained multi-layer models usually have similar magnitudes\non all the layers:\nF is very small at the\nbeginning, and Corollary 2.1 implies this difference remains small at all time. This \ufb01nding also\npartially explains why gradient descent converges. Although the objective function like (4) may not\nbe smooth over the entire parameter space, given that }W phq}2\nF is small for all h, the\nobjective function may have smoothness. Under this condition, standard theory shows that gradient\ndescent converges. We believe this \ufb01nding serves as a key building block for understanding \ufb01rst\norder methods for training deep neural networks.\nFor linear activation, we have the following stronger invariance than Theorem 2.1:\nTheorem 2.2 (Stronger balancedness property for linear activation). If for some h P rN \u00b4 1s we\nhave hpxq \u201c x, then\n\nF \u00b4}W ph`1q}2\n\nd\n\ndt\u00b4W phqpW phqqJ \u00b4 pW ph`1qqJW ph`1q\u00af \u201c 0.\n\nThis result was known for linear networks [Arora et al., 2018], but the proof there relies on the entire\nnetwork being linear while Theorem 2.2 only needs two consecutive layers to have no nonlinear\nactivations in between.\nWhile Theorem 2.1 shows the invariance in a node-wise manner, Theorem 2.2 shows for linear\nactivation, we can derive a layer-wise invariance. Inspired by this strong invariance, in Section 3 we\nprove gradient descent with positive step sizes preserves this invariance approximately for matrix\nfactorization.\n\n2.2 Convolutional Neural Networks\n\nNow we show that the conservation property in Corollary 2.1 can be generalized to convolutional\nneural networks. In fact, we can allow arbitrary sparsity pattern and weight sharing structure within\na layer; convolutional layers are a special case.\n\nNeural networks with sparse connections and shared weights. We use the same notation as in\nSection 2.1, with the difference that some weights in a layer can be missing or shared. Formally, the\nweight matrix W phq P Rnh\u02c6nh\u00b41 in layer h (h P rNs) can be described by a vector vphq P Rdh and\na function gh : rnhs\u02c6rnh\u00b41s \u00d1 rdhsYt0u. Here vphq consists of the actual free parameters in this\nlayer and dh is the number of free parameters (e.g. if there are k convolutional \ufb01lters in layer h each\nwith size r, we have dh \u201c r \u00a8 k). The map gh represents the sparsity and weight sharing pattern:\n\nW phqri, js \u201c\"0,\n\nvphqrks,\n\nghpi, jq \u201c 0,\nghpi, jq \u201c k \u00b0 0.\n\n5\n\n\fThe following theorem generalizes Corollary 2.1 to neural networks with sparse connections and\nshared weights:\nTheorem 2.3. For any h P rN \u00b4 1s, we have\n\nd\n\ndt\u00b4}vphq}2 \u00b4 }vph`1q}2\u00af \u201c 0.\n\nh\u201c1 the collection of all the parameters in this network, and we consider\n\nDenote by v \u201c `vphq\u02d8N\n\ngradient \ufb02ow to learn the parameters:\ndvphq\ndt\n\nP \u00b4BLpvq\nBvphq ,\n\nh \u201c 1, . . . , N.\n\nTherefore, for a neural network with arbitrary sparsity pattern and weight sharing structure, gradient\n\ufb02ow still balances the magnitudes of all layers.\n\n2.3 Proof of Theorem 2.1\nThe proofs of all theorems in this section are similar. They are based on the use of the chain rule\n(i.e. back-propagation) and the property of homogeneous activations. Below we provide the proof\nof Theorem 2.1 and defer the proofs of other theorems to Appendix A.\n\ndt\n\nk\u201c1\n\ndtpW phqri, jsq2 \u201c 2W phqri, js\u00a8 dW phqri,js\nBLkpwq\nBW phqri,js\n\nProof of Theorem 2.1. First we note that we can without loss of generality assume L is the loss\nIn fact, for\nassociated with one data sample px, yq P Rd \u02c6 Rp, i.e., Lpwq \u201c `pfwpxq, yq.\nm\u221em\nk\u201c1 Lkpwq where Lkpwq \u201c `pfwpxkq, ykq, for any single weight W phqri, js in the\nLpwq \u201c 1\nBLpwq\n\u201c \u00b42W phqri, js\u00a8\nnetwork we can compute d\nBW phqri,js \u201c\nm\u221em\n\u00b42W phqri, js \u00a8 1\n, using the sharp chain rule of differential inclusions for tame\nfunctions [Drusvyatskiy et al., 2015, Davis et al., 2018]. Thus, if we can prove the theorem for\nevery individual loss Lk, we can prove the theorem for L by taking average over k P rms.\nTherefore in the rest of proof we assume Lpwq \u201c `pfwpxq, yq. For convenience, we denote xphq \u201c\nfphqw pxq (h P rNs), which is the input to the h-th hidden layer of neurons for h P rN \u00b4 1s and is the\noutput of the network for h \u201c N. We also denote xp0q \u201c x and 0pxq \u201c x (@x).\nNow we prove (6). Since W ph`1qrk, is (k P rnh`1s) can only affect Lpwq through xph`1qrks , we\nhave for k P rnh`1s,\nBxph`1qrks\nBW ph`1qrk, is \u201c BLpwq\nBLpwq\nBW ph`1qrk, is \u201c BLpwq\nBxph`1qrks \u00a8\nBW ph`1qr:, is \u201c hpxphqrisq \u00a8 BLpwq\nBLpwq\nBxph`1q .\n\nBxph`1qrks \u00a8 hpxphqrisq,\n\nwhich can be rewritten as\n\nIt follows that\n\nd\n\ndt}W ph`1qr:, is}2 \u201c 2BW ph`1qr:, is,\n\nd\ndt\n\nW ph`1qr:, isF \u201c \u00b42BW ph`1qr:, is,\n\nBLpwq\n\nBW ph`1qr:, isF\n\n\u201c \u00b42hpxphqrisq \u00a8BW ph`1qr:, is, BLpwq\nBxph`1qF .\nBxphqris \u00a8 h\u00b41pxph\u00b41qq \u201cB BLpwq\n\nOn the other hand, W phqri, :s only affects Lpwq through xphqris. Using the chain rule, we get\nBLpwq\nBW phqri, :s \u201c BLpwq\n\nBxph`1q , W ph`1qr:, isF \u00a8 1hpxphqrisq \u00a8 h\u00b41pxph\u00b41qq,\n\nwhere 1 is interpreted as a set-valued mapping whenever it is applied at a non-differentiable point.9\n9More precisely, the equalities should be an inclusion whenever there is a sub-differential, but as we see in\n\n(7)\n\nthe next display the ambiguity in the choice of sub-differential does not affect later calculations.\n\n6\n\n\fIt follows that10\n\nd\n\nd\ndt\n\ndt}W phqri, :s}2 \u201c 2BW phqri, :s,\n\u201c \u00b4 2B BLpwq\n\u201c \u00b4 2B BLpwq\n\nW phqri, :sF \u201c \u00b42BW phqri, :s,\nBxph`1q , W ph`1qr:, isF \u00a8 1hpxphqrisq \u00a8AW phqri, :s, h\u00b41pxph\u00b41qqE\nBxph`1q , W ph`1qr:, isF \u00a8 1hpxphqrisq \u00a8 xphqris \u201c \u00b42B BLpwq\n\nComparing the above expression to (7), we \ufb01nish the proof.\n\nBLpwq\n\nBW phqri, :sF\n\nBxph`1q , W ph`1qr:, isF \u00a8 hpxphqrisq.\n\n3 Gradient Descent Converges to Global Minimum for Asymmetric Matrix\n\nFactorization\n\nIn this section we constrain ourselves to the asymmetric matrix factorization problem and analyze the\ngradient descent algorithm with random initialization. Our analysis is inspired by the auto-balancing\nproperties presented in Section 2. We extend these properties from gradient \ufb02ow to gradient descent\nwith positive step size.\nFormally, we study the following non-convex optimization problem:\n\nmin\n\nUPRd1\u02c6r,V PRd2\u02c6r\n\nfpU , V q \u201c 1\n\n2\u203a\u203aU V J \u00b4 M\u02da\u203a\u203a2\n\nF ,\n\nwhere M\u02da P Rd1\u02c6d2 has rank r. Note that we do not have any explicit regularization in (8). The\ngradient descent dynamics for (8) have the following form:\n\n(8)\n\n\u203a\u203a \u00afU \u00afV J \u00b4 M\u02da\u203a\u203aF \u00a7 \u270f.\n\nUt`1 \u201c Ut \u00b4 \u2318tpUtV Jt \u00b4 M\u02daqVt,\n\nVt`1 \u201c Vt \u00b4 \u2318tpUtV Jt \u00b4 M\u02daqJUt.\n\n(9)\n\n3.1 The General Rank-r Case\nFirst we consider the general case of r \u2022 1. Our main theorem below says that if we use a random\nsmall initialization pU0, V0q, and set step sizes \u2318t to be appropriately small, then gradient descent\n(9) will converge to a solution close to the global minimum of (8). To our knowledge, this is the \ufb01rst\nresult showing that gradient descent with random initialization directly solves the un-regularized\nasymmetric matrix factorization problem (8).\nTheorem 3.1. Let 0 \u2020 \u270f \u2020} M\u02da}F . Suppose we initialize the entries in U0 and V0 i.i.d. from\n(t \u201c 0, 1, . . .).11\nNp0,\nThen with high probability over the initialization, limt\u00d18pUt, Vtq \u201c p \u00afU , \u00afV q exists and satis\ufb01es\n\npolypdqq (d \u201c maxtd1, d2u), and run (9) with step sizes \u2318t \u201c\n\n100pt`1q}M\u02da}3{2\n\n?\u270f{r\n\n\u270f\n\nF\n\nProof sketch of Theorem 3.1. First let\u2019s imagine that we are using in\ufb01nitesimal step size in GD.\nThen according to Theorem 2.2 (viewing problem (8) as learning a two-layer linear network where\nthe inputs are all the standard unit vectors in Rd2), we know that UJU \u00b4 V JV will stay invariant\nthroughout the algorithm. Hence when U and V are initialized to be small, UJU \u00b4 V JV will stay\nsmall forever. Combined with the fact that the objective fpU , V q is decreasing over time (which\nmeans U V J cannot be too far from M\u02da), we can show that U and V will always stay bounded.\nNow we are using positive step sizes \u2318t, so we no longer have the invariance of UJU \u00b4 V JV .\nNevertheless, by a careful analysis of the updates, we can still prove that UJt Ut \u00b4 V Jt Vt is small,\nthe objective fpUt, Vtq decreases, and Ut and Vt stay bounded. Formally, we have the following\nlemma:\nLemma 3.1. With high probability over the initialization pU0, V0q, for all t we have:\n10This holds for any choice of element of the sub-differential, since 1pxqx \u201c pxq holds at x \u201c 0 for any\n11The dependency of \u2318t on t can be \u2318t \u201c \u21e5\u00b4t\u00b4p1{2`q\u00af for any constant P p0, 1{2s.\n\nchoice of sub-differential.\n\n7\n\n\fF \u00a7 5?r }M\u02da}F .\n\n(i) Balancedness:\u203a\u203aUJt Ut \u00b4 V Jt Vt\u203a\u203aF \u00a7 \u270f;\nF \u00a7 5?r }M\u02da}F ,}Vt}2\n\n(ii) Decreasing objective: fpUt, Vtq\u00a7 fpUt\u00b41, Vt\u00b41q\u00a7\u00a8\u00a8\u00a8\u00a7 fpU0, V0q\u00a7 2}M\u02da}2\nF ;\n(iii) Boundedness: }Ut}2\nNow that we know the GD algorithm automatically constrains pUt, Vtq in a bounded region, we\ncan use the smoothness of f in this region and a standard analysis of GD to show that pUt, Vtq\nconverges to a stationary point p \u00afU , \u00afV q of f (Lemma B.2). Furthermore, using the results of [Lee\net al., 2016, Panageas and Piliouras, 2016] we know that p \u00afU , \u00afV q is almost surely not a strict saddle\npoint. Then the following lemma implies that p \u00afU , \u00afV q has to be close to a global optimum since we\n\nknow\u203a\u203a \u00afUJ \u00afU \u00b4 \u00afV J \u00afV\u203a\u203aF \u00a7 \u270f from Lemma 3.1 (i). This would complete the proof of Theorem 3.1.\nLemma 3.2. Suppose pU , V q is a stationary point of f such that\u203a\u203aUJU \u00b4 V JV\u203a\u203aF \u00a7 \u270f. Then\neither\u203a\u203aU V J \u00b4 M\u02da\u203a\u203aF \u00a7 \u270f, or pU , V q is a strict saddle point of f.\n\nThe full proof of Theorem 3.1 and the proofs of Lemmas 3.1 and 3.2 are given in Appendix B.\n\n3.2 The Rank-1 Case\nWe have shown in Theorem 3.1 that GD with small and diminishing step sizes converges to a global\nminimum for matrix factorization. Empirically, it is observed that a constant step size \u2318t \u201d \u2318 is\nenough for GD to converge quickly to global minimum. Therefore, some natural questions are\nhow to prove convergence of GD with a constant step size, how fast it converges, and how the\ndiscretization affects the invariance we derived in Section 2.\nWhile these questions remain challenging for the general rank-r matrix factorization, we resolve\nthem for the case of r \u201c 1. Our main \ufb01nding is that with constant step size, the norms of two layers\nare always within a constant factor of each other (although we may no longer have the stronger\nbalancedness property as in Lemma 3.1), and we utilize this property to prove the linear convergence\nof GD to a global minimum.\nWhen r \u201c 1, the asymmetric matrix factorization problem and its GD dynamics become\n\nand\n\nmin\n\nuPRd1 ,vPRd2\n\nut`1 \u201c ut \u00b4 \u2318putvJt \u00b4 M\u02daqvt,\n\nF\n\n1\n\n2\u203a\u203auvJ \u00b4 M\u02da\u203a\u203a2\nvt`1 \u201c vt \u00b4 \u2318`vtuJt \u00b4 M\u02daJ\u02d8 ut.\n\nHere we assume M\u02da has rank 1, i.e., it can be factorized as M\u02da \u201c 1u\u02dav\u02daJ where u\u02da and v\u02da are\nunit vectors and 1 \u00b0 0.\nOur main theoretical result is the following.\nTheorem 3.2 (Approximate balancedness and linear convergence of GD for rank-1 matrix factor-\nization). Suppose u0 \u201e Np0, Iq, v0 \u201e Np0, Iq with \u201c cinita 1\nd (d \u201c maxtd1, d2u) for some\nsuf\ufb01ciently small constant cinit \u00b0 0, and \u2318 \u201c cstep\nfor some suf\ufb01ciently small constant cstep \u00b0 0.\nThen with constant probability over the initialization, for all t we have c0 \u00a7 |uJt u\u02da|\n|vJt v\u02da| \u00a7 C0 for some\nuniversal constants c0, C0 \u00b0 0. Furthermore, for any 0 \u2020 \u270f \u2020 1, after t \u201c O`log d\n\u270f\u02d8 iterations, we\nhave\u203a\u203autvJt \u00b4 M\u02da\u203a\u203aF \u00a7 \u270f1.\nTheorem 3.2 shows for ut and vt, their strengths in the signal space,\u02c7\u02c7uJt u\u02da\u02c7\u02c7 and\u02c7\u02c7vJt v\u02da\u02c7\u02c7, are of the\n\nsame order. This approximate balancedness helps us prove the linear convergence of GD. We refer\nreaders to Appendix C for the proof of Theorem 3.2.\n\n1\n\n4 Empirical Veri\ufb01cation\n\nWe perform experiments to verify the auto-balancing properties of gradient descent in neural net-\nworks with ReLU activation. Our results below show that for GD with small step size and small\n\n8\n\n\fi\n\nm\nr\no\nN\n \ns\nu\nn\ne\nb\no\nr\nF\nd\ne\nr\na\nu\nq\nS\ne\nh\n\n \n\n \n\nt\n \nf\n\n \n\no\ne\nc\nn\ne\nr\ne\n\nf\nf\ni\n\nD\n\n0.14\n\n0.12\n\n0.1\n\n0.08\n\n0.06\n\n0.04\n\n0.02\n\n0\n\n0\n\nBetween 1st and 2nd Layer\nBetween 2nd and 3rd Layer\n\n2000\n\n4000\n\n6000\n\n8000\n\n10000\n\nEpochs\n\ns\no\n\ni\nt\n\ni\n\n \n\na\nR\nm\nr\no\nN\n \ns\nu\nn\ne\nb\no\nr\nF\nd\ne\nr\na\nu\nq\nS\n\n \n\n1.006\n\n1.004\n\n1.002\n\n1\n\n0.998\n\n0.996\n\n0.994\n\n0.992\n\ni\n\nm\nr\no\nN\n \ns\nu\nn\ne\nb\no\nr\nF\nd\ne\nr\na\nu\nq\nS\ne\nh\n\n \n\n \n\nBetween 1st and 2nd Layer\nBetween 2nd and 3rd Layer\n\n0\n\n2000\n\n4000\n\n6000\n\n8000\n\n10000\n\nEpochs\n\nt\n \nf\n\no\n\n \n\ne\nc\nn\ne\nr\ne\n\nf\nf\ni\n\nD\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0\n\n0\n\nBetween 1st and 2nd Layer\nBetween 2nd and 3rd Layer\n\n2000\n\n4000\n\n6000\n\n8000\n\n10000\n\nEpochs\n\ns\no\n\ni\nt\n\ni\n\n \n\na\nR\nm\nr\no\nN\n \ns\nu\nn\ne\nb\no\nr\nF\nd\ne\nr\na\nu\nq\nS\n\n \n\n12\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0\n\nBetween 1st and 2nd Layer\nBetween 2nd and 3rd Layer\n\n0\n\n2000\n\n4000\n\n6000\n\n8000\n\n10000\n\nEpochs\n\n(a) Balanced initializa-\ntion, squared norm dif-\nferences.\n\n(b) Balanced initializa-\ntion, squared norm ra-\ntios.\n\n(c) Unbalanced Initial-\nization,\nsquared norm\ndifferences.\n\n(d) Unbalanced initial-\nization,\nsquared norm\nratios.\n\nFigure 2: Balancedness of a 3-layer neural network.\n\ninitialization: (1) the difference between the squared Frobenius norms of any two layers remains\nsmall in all iterations, and (2) the ratio between the squared Frobenius norms of any two layers\nbecomes close to 1. Notice that our theorems in Section 2 hold for gradient \ufb02ow (step size \u00d1 0)\nbut in practice we can only choose a (small) positive step size, so we cannot hope the difference\nbetween the squared Frobenius norms to remain exactly the same but can only hope to observe that\nthe differences remain small.\nWe consider a 3-layer fully connected network of the form fpxq \u201c W3pW2pW1xqq where\nx P R1,000 is the input, W1 P R100\u02c61,000, W2 P R100\u02c6100, W3 P R10\u02c6100, and p\u00a8q is ReLU\nactivation. We use 1,000 data points and the quadratic loss function, and run GD. We \ufb01rst test a bal-\nanced initialization: W1ri, js \u201e Np0, 10\u00b44\n10 q and W3ri, js \u201e Np0, 10\u00b44q,\nwhich ensures }W1}2\nF \u201c 42.90,\nF \u00ab }W3}2\nF \u201c 43.76 and }W3}2\n}W2}2\nF \u00b4 }W2}2\nF \u00b4 }W3}2\nF . Figures 2b shows that the ratios between norms approach 1. We then test an unbalanced\n}Wh}2\ninitialization: W1ri, js \u201e Np0, 10\u00b44q, W2ri, js \u201e Np0, 10\u00b44q and W3ri, js \u201e Np0, 10\u00b44q. Af-\nter 10,000 iterations we have }W1}2\nF \u201c 45.46. Figure 2c\nlittle throughout the process), and Figures 2d shows that the ratios become close to 1 after about\n1,000 iterations.\n\nF\u02c7\u02c7\nF\u02c7\u02c7 are bounded by 0.14 which is much smaller than the magnitude of each\nF\u02c7\u02c7 are bounded by 9 (and indeed change very\n\nand\u02c7\u02c7}W2}2\nshows that\u02c7\u02c7}W1}2\n\n100 q, W2ri, js \u201e Np0, 10\u00b44\n\nF \u00ab }W2}2\n\nF . After 10,000 iterations we have }W1}2\n\nF \u201c 43.68. Figure 2a shows that in all iterations\u02c7\u02c7}W1}2\nF\u02c7\u02c7 and\u02c7\u02c7}W2}2\n\nF \u201c 55.50, }W2}2\nF \u00b4 }W3}2\n\nF \u201c 45.65 and }W3}2\n\nF \u00b4 }W2}2\n\n5 Conclusion and Future Work\n\nIn this paper we take a step towards characterizing the invariance imposed by \ufb01rst order algorithms.\nWe show that gradient \ufb02ow automatically balances the magnitudes of all layers in a deep neural net-\nwork with homogeneous activations. For the concrete model of asymmetric matrix factorization, we\nfurther use the balancedness property to show that gradient descent converges to global minimum.\nWe believe our \ufb01ndings on the invariance in deep models could serve as a fundamental building\nblock for understanding optimization in deep learning. Below we list some future directions.\n\nOther \ufb01rst-order methods.\nIn this paper we focus on the invariance induced by gradient descent.\nIn practice, different acceleration and adaptive methods are also used. A natural future direction is\nhow to characterize the invariance properties of these algorithms.\n\nFrom gradient \ufb02ow to gradient descent: a generic analysis? As discussed in Section 3, while\nstrong invariance properties hold for gradient \ufb02ow, in practice one uses gradient descent with positive\nstep sizes and the invariance may only hold approximately because positive step sizes discretize the\ndynamics. We use specialized techniques for analyzing asymmetric matrix factorization. It would be\nvery interesting to develop a generic approach to analyze the discretization. Recent \ufb01ndings on the\nconnection between optimization and ordinary differential equations [Su et al., 2014, Zhang et al.,\n2018] might be useful for this purpose.\n\n9\n\n\fAcknowledgements\n\nWe thank Phil Long for his helpful comments on an earlier draft of this paper. JDL acknowledges\nsupport from ARO W911NF-11-1-0303.\n\nReferences\nPierre-Antoine Absil, Robert Mahony, and Benjamin Andrews. Convergence of the iterates of de-\nscent methods for analytic cost functions. SIAM Journal on Optimization, 16(2):531\u2013547, 2005.\nSanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of deep networks: Implicit\n\nacceleration by overparameterization. arXiv preprint arXiv:1802.06509, 2018.\n\nPeter L Bartlett, David P Helmbold, and Philip M Long. Gradient descent with identity initialization\nef\ufb01ciently learns positive de\ufb01nite linear transformations by deep residual networks. arXiv preprint\narXiv:1802.06093, 2018.\n\nAlon Brutzkus and Amir Globerson. Globally optimal gradient descent for a convnet with gaussian\n\ninputs. arXiv preprint arXiv:1702.07966, 2017.\n\nAlon Brutzkus, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz.\n\nparameterized networks that provably generalize on linearly separable data.\narXiv:1710.10174, 2017.\n\nSgd learns over-\narXiv preprint\n\nAnna Choromanska, Mikael Henaff, Michael Mathieu, G\u00b4erard Ben Arous, and Yann LeCun. The\nIn Arti\ufb01cial Intelligence and Statistics, pages 192\u2013204,\n\nloss surfaces of multilayer networks.\n2015.\n\nFrancis H Clarke, Yuri S Ledyaev, Ronald J Stern, and Peter R Wolenski. Nonsmooth analysis and\n\ncontrol theory, volume 178. Springer Science & Business Media, 2008.\n\nDamek Davis, Dmitriy Drusvyatskiy, Sham Kakade, and Jason D Lee. Stochastic subgradient\n\nmethod converges on tame functions. arXiv preprint arXiv:1804.07795, 2018.\n\nDmitriy Drusvyatskiy, Alexander D Ioffe, and Adrian S Lewis. Curves of descent. SIAM Journal\n\non Control and Optimization, 53(1):114\u2013138, 2015.\n\nSimon S Du and Jason D Lee. On the power of over-parametrization in neural networks with\n\nquadratic activation. arXiv preprint arXiv:1803.01206, 2018.\n\nSimon S Du, Jason D Lee, and Yuandong Tian. When is a convolutional \ufb01lter easy to learn? arXiv\n\npreprint arXiv:1709.06129, 2017a.\n\nSimon S Du, Jason D Lee, Yuandong Tian, Barnabas Poczos, and Aarti Singh. Gradient de-\nscent learns one-hidden-layer cnn: Don\u2019t be afraid of spurious local minima. arXiv preprint\narXiv:1712.00779, 2017b.\n\nC Daniel Freeman and Joan Bruna. Topology and geometry of half-recti\ufb01ed network optimization.\n\narXiv preprint arXiv:1611.01540, 2016.\n\nRong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points \u00b4 online stochastic\ngradient for tensor decomposition. In Proceedings of The 28th Conference on Learning Theory,\npages 797\u2013842, 2015.\n\nRong Ge, Chi Jin, and Yi Zheng. No spurious local minima in nonconvex low rank problems:\nA uni\ufb01ed geometric analysis. In Proceedings of the 34th International Conference on Machine\nLearning, pages 1233\u20131242, 2017a.\n\nRong Ge, Jason D Lee, and Tengyu Ma. Learning one-hidden-layer neural networks with landscape\n\ndesign. arXiv preprint arXiv:1711.00501, 2017b.\n\nXavier Glorot and Yoshua Bengio. Understanding the dif\ufb01culty of training deep feedforward neural\nnetworks. In Proceedings of the thirteenth international conference on arti\ufb01cial intelligence and\nstatistics, pages 249\u2013256, 2010.\n\n10\n\n\fBenjamin D Haeffele and Ren\u00b4e Vidal. Global optimality in tensor factorization, deep learning, and\n\nbeyond. arXiv preprint arXiv:1506.07540, 2015.\n\nMoritz Hardt and Tengyu Ma. Identity matters in deep learning. arXiv preprint arXiv:1611.04231,\n\n2016.\n\nKaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog-\nnition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages\n770\u2013778, 2016.\n\nKenji Kawaguchi. Deep learning without poor local minima. In Advances In Neural Information\n\nProcessing Systems, pages 586\u2013594, 2016.\n\nJason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descent only\n\nconverges to minimizers. In Conference on Learning Theory, pages 1246\u20131257, 2016.\n\nYuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural networks with ReLU activa-\n\ntion. arXiv preprint arXiv:1705.09886, 2017.\n\nSiyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Understanding the\neffectiveness of sgd in modern over-parametrized learning. arXiv preprint arXiv:1712.06559,\n2017.\n\nBehnam Neyshabur, Ruslan R Salakhutdinov, and Nati Srebro. Path-SGD: Path-normalized opti-\nmization in deep neural networks. In Advances in Neural Information Processing Systems, pages\n2422\u20132430, 2015a.\n\nBehnam Neyshabur, Ryota Tomioka, Ruslan Salakhutdinov, and Nathan Srebro. Data-dependent\n\npath normalization in neural networks. arXiv preprint arXiv:1511.06747, 2015b.\n\nQuynh Nguyen and Matthias Hein. The loss surface of deep and wide neural networks. arXiv\n\npreprint arXiv:1704.08045, 2017a.\n\nQuynh Nguyen and Matthias Hein. The loss surface and expressivity of deep convolutional neural\n\nnetworks. arXiv preprint arXiv:1710.10928, 2017b.\n\nIoannis Panageas and Georgios Piliouras. Gradient descent only converges to minimizers: Non-\n\nisolated critical points and invariant regions. arXiv preprint arXiv:1605.00405, 2016.\n\nItay Safran and Ohad Shamir. On the quality of the initial basin in overspeci\ufb01ed neural networks.\n\nIn International Conference on Machine Learning, pages 774\u2013782, 2016.\n\nItay Safran and Ohad Shamir. Spurious local minima are common in two-layer relu neural networks.\n\narXiv preprint arXiv:1712.08968, 2017.\n\nO. Shamir. Are resnets provably better than linear predictors? arXiv preprint arXiv:1804.06739,\n\n2018.\n\nWeijie Su, Stephen Boyd, and Emmanuel Candes. A differential equation for modeling nesterovs\naccelerated gradient method: Theory and insights. In Advances in Neural Information Processing\nSystems, pages 2510\u20132518, 2014.\n\nYuandong Tian. An analytical formula of population gradient for two-layered ReLU network and its\napplications in convergence and critical point analysis. arXiv preprint arXiv:1703.00560, 2017.\n\nStephen Tu, Ross Boczar, Max Simchowitz, Mahdi Soltanolkotabi, and Benjamin Recht. Low-rank\nsolutions of linear matrix equations via procrustes \ufb02ow. arXiv preprint arXiv:1507.03566, 2015.\n\nRene Vidal, Joan Bruna, Raja Giryes, and Stefano Soatto. Mathematics of deep learning. arXiv\n\npreprint arXiv:1712.04741, 2017.\n\nJingzhao Zhang, Aryan Mokhtari, Suvrit Sra, and Ali Jadbabaie. Direct runge-kutta discretization\n\nachieves acceleration. arXiv preprint arXiv:1805.00521, 2018.\n\n11\n\n\fKai Zhong, Zhao Song, Prateek Jain, Peter L Bartlett, and Inderjit S Dhillon. Recovery guarantees\n\nfor one-hidden-layer neural networks. arXiv preprint arXiv:1706.03175, 2017.\nThe landscape of deep learning algorithms.\n\nPan Zhou and Jiashi Feng.\n\narXiv:1705.07038, 2017.\n\narXiv preprint\n\n12\n\n\f", "award": [], "sourceid": 255, "authors": [{"given_name": "Simon", "family_name": "Du", "institution": "Carnegie Mellon University"}, {"given_name": "Wei", "family_name": "Hu", "institution": "Princeton University"}, {"given_name": "Jason", "family_name": "Lee", "institution": "University of Southern California"}]}