{"title": "Implicit Bias of Gradient Descent on Linear Convolutional Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 9461, "page_last": 9471, "abstract": "We show that gradient descent on full-width linear convolutional networks of depth $L$ converges to a linear predictor related to the $\\ell_{2/L}$ bridge penalty in the frequency domain. This is in contrast to linearly fully connected networks, where gradient descent converges to the hard margin linear SVM solution, regardless of depth.", "full_text": "Implicit Bias of Gradient Descent on Linear\n\nConvolutional Networks\n\nSuriya Gunasekar\nTTI at Chicago, USA\nsuriya@ttic.edu\n\nJason D. Lee\n\nUSC Los Angeles, USA\n\njasonlee@marshall.usc.edu\n\nDaniel Soudry\nTechnion, Israel\n\ndaniel.soudry@gmail.com\n\nNathan Srebro\n\nTTI at Chicago, USA\n\nnati@ttic.edu\n\nAbstract\n\nWe show that gradient descent on full width linear convolutional networks of depth\nL converges to a linear predictor related to the (cid:96)2/L bridge penalty in the frequency\ndomain. This is in contrast to fully connected linear networks, where regardless of\ndepth, gradient descent converges to the (cid:96)2 maximum margin solution.\n\n1\n\nIntroduction\n\nImplicit biases introduced by optimization algorithms play an crucial role in learning deep neural net-\nworks [Neyshabur et al., 2015b,a, Hochreiter and Schmidhuber, 1997, Keskar et al., 2016, Chaudhari\net al., 2016, Dinh et al., 2017, Andrychowicz et al., 2016, Neyshabur et al., 2017, Zhang et al., 2017,\nWilson et al., 2017, Hoffer et al., 2017, Smith, 2018]. Large scale neural networks used in practice\nare highly over-parameterized with far more trainable model parameters compared to the number of\ntraining examples. Consequently, optimization objectives for learning such high capacity models\nhave many global minima that \ufb01t training data perfectly. However, minimizing the training loss\nusing speci\ufb01c optimization algorithms take us to not just any global minima, but some special global\nminima, e.g., global minima minimizing some regularizer R(\u03b2). In over-parameterized models,\nspecially deep neural networks, much, if not most, of the inductive bias of the learned model comes\nfrom this implicit regularization from the optimization algorithm. Understanding the implicit bias,\ne.g., via characterizing R(\u03b2), is thus essential for understanding how and what the model learns.\nFor example, in linear regression we understand how minimizing an under-determined model (with\nmore parameters than samples) using gradient descent yields the minimum (cid:96)2 norm solution, and for\nlinear logistic regression trained on linearly separable data, Soudry et al. [2017] recently showed that\ngradient descent converges in the direction of the hard margin support vector machine solution, even\nthough the norm or margin is not explicitly speci\ufb01ed in the optimization problem. Such minimum\nnorm or maximum margin solutions are of course very special among all solutions or separators\nthat \ufb01t the training data, and in particular can ensure generalization Bartlett and Mendelson [2003],\nKakade et al. [2009].\nChanging the optimization algorithm, even without changing the model, changes this implicit bias,\nand consequently also changes generalization properties of the learned models [Neyshabur et al.,\n2015a, Keskar et al., 2016, Wilson et al., 2017, Gunasekar et al., 2017, 2018]. For example, for linear\nlogistic regression, using coordinate descent instead of gradient descent return a maximum (cid:96)1 margin\nsolution instead of the hard margin support vector solution solution\u2014an entirely different inductive\nbias Telgarsky [2013], Gunasekar et al. [2018].\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fSimilarly, and as we shall see in this paper, changing to a different parameterization of the same model\nclass can also dramatically change the implicit bias Gunasekar et al. [2017]. In particular, we study\nthe implicit bias of optimizing multi-layer fully connected linear networks, and linear convolutional\nnetworks (multiple full width convolutional layers followed by a single fully connected layer) using\ngradient descent. Both of these types of models ultimately implement linear transformations, and\ncan implement any linear transformation. The model class de\ufb01ned by these networks is thus simply\nthe class of all linear predictors, and these models can be seen as mere (over) parameterizations\nof the class of linear predictors. Minimizing the training loss on these models is therefore entirely\nequivalent to minimizing the training loss for linear classi\ufb01cation. Nevertheless, as we shall see,\noptimizing these networks with gradient descent leads to very different solutions.\nIn particular, we show that for fully connected networks with single output, optimizing the exponential\nloss over linearly separable data using gradient loss again converges to the homogeneous hard margin\nsupport vector machine solution. This holds regardless of the depth of the network, and hence, at least\nwith a single output, gradient descent on fully connected networks has the same implicit bias as direct\ngradient descent on the parameters of the linear predictor. In contrast, training a linear convolutional\nnetwork with gradient descent biases us toward linear separators that are sparse in the frequency\ndomain. Furthermore, this bias changes with the depth of the network, and a network of depth L\n\n(with L \u2212 1 convolutional layers), implicitly biases towards minimizing the (cid:107)(cid:98)\u03b2(cid:107)2/L bridge penalty\nwith 2/L \u2264 1 of the Fourier transform(cid:98)\u03b2 of the learned linear predictor \u03b2 subject to margin constraints\n(the gradient descent predictor reaches a stationary point of the (cid:107)(cid:98)\u03b2(cid:107)2/L minimization problem). This\n\nis a sparsity inducing regularizer, which induces sparsity more aggressively as the depth increases.\nFinally, in this paper we focus on characterizing which global minimum does gradient descent on\nover-parameterized linear models converge to, while assuming that for appropriate choice of step sizes\ngradient descent iterates asymptotically minimize the optimization objective. A related challenge in\nneural networks, not addressed in this paper, is an answer to when does gradient descent minimize\nthe non-convex empirical loss objective to reach a global minimum. This problem while hard in\nworst case, has been studied for linear networks. Recent work have concluded that with suf\ufb01cient\nover-parameterization (as is the case with our settings), loss landscape of linear models are well\nbehaved and all local minima are global minima making the problem tractable Burer and Monteiro\n[2003], Journ\u00e9e et al. [2010], Kawaguchi [2016], Nguyen and Hein [2017], Lee et al. [2016].\n\nNotation We typeface vectors with bold characters e.g., w, \u03b2, x. Individual entries of a vector\nz \u2208 RD are indexed using 0 based indexing as z[d] for d = 0, 1, . . . , D \u2212 1. Complex numbers\nare represented in the polar form as z = |z|ei\u03c6z with |z| \u2208 R+ denoting the magnitude of z and\n\u03c6z \u2208 [0, 2\u03c0) denoting the phase. z\u2217 = |z|e\u2212i\u03c6z denotes the complex conjugate of z. The complex\n\u2217. The Dth complex\n\ninner product between z, \u03b2 \u2208 CD is given by (cid:104)z, \u03b2(cid:105) =(cid:80)D\nrepresentation of z in the discrete Fourier basis given by,(cid:98)z[d] = 1\u221a\nD and a, we denote the modulo operator as a mod D = a \u2212 D(cid:4) a\nD . For integers\nnetworks (formally de\ufb01ned in Section 2), we will use w \u2208 W to denote parameters of the model in\ngeneral domain W, and \u03b2w or simply \u03b2 to denote the equivalent linear predictor.\n\nD . For z \u2208 RD we use the notation(cid:98)z \u2208 CD to denote the\n(cid:80)D\u22121\n(cid:5). Finally, for multi-layer linear\n\nroot of 1 is denoted by \u03c9D = e\u2212 2\u03c0i\n\n\u2217\n\n[d] = z(cid:62)\u03b2\n\nd=1 z[d]\u03b2\n\np=0 z[p]\u03c9pd\n\nD\n\nD\n\n2 Multi-layer Linear Networks\nWe consider feed forward linear networks that map input features x \u2208 RD to a single real valued\noutput fw(x) \u2208 R, where w denote the parameters of the network. Such networks can be thought\nof as directed acyclic graphs where each edge is associated with a weight, and the value at each\nnode/unit is the weighted sum of values from the parent nodes. The input features form source nodes\nwith no incoming edges and the output is a sink node with no outgoing edge. Every such network\nrealizes a linear function x \u2192 (cid:104)x, \u03b2w(cid:105), where \u03b2w \u2208 RD denotes the effective linear predictor.\nIn multi-layer networks, the nodes are arranged in layers, so an L\u2013layer network represents a\ncomposition of L linear maps. We use the convention that, the input x \u2208 RD is indexed as the zeroth\nlayer l = 0, while the output forms the \ufb01nal layer with l = L. The outputs of nodes in layer l are\ndenoted by hl \u2208 RDl, where Dl is the number of nodes in layer l. We also use wl to denote the\n\n2\n\n\fparameters of the linear map between hl\u22121 and hl, and w = [wl]L\nall parameters of the linear network.\nLinear fully connected network In a fully connected linear network, the nodes between successive\nlayers l \u2212 1 and l are densely connected with edge weights wl \u2208 RDl\u22121\u00d7Dl, and all the weights are\nRDl\u22121\u00d7Dl and\nindependent parameters. This model class is parameterized by w = [wl]L\nthe computation for intermediate nodes hl and the composite linear map fw(x) is given by,\n\nl=1 to denote the collective set of\n\nl=1 \u2208(cid:81)L\n\nl=1\n\nhl = w(cid:62)\n\nl hl\u22121\n\nand\n\nfw(x) = hL = w(cid:62)\n\nL w(cid:62)\n\nL\u22121 . . . w(cid:62)\n1 x.\n\n(1)\n\nLinear convolutional network We consider one-dimensional convolutional network architectures\nwhere each non-output layer has exactly D units (same as the input dimensionality) and the linear\ntransformations from layer l \u2212 1 to layer l are given by the following circular convolutional operation1\nparameterized by full width \ufb01lters with weights [wl \u2208 RD]L\u22121\n\nl=1 . For l = 1, 2, . . . , L \u2212 1,\n\nwl[k] hl\u22121 [(d + k) mod D] := (hl\u22121 (cid:63) wl) [d].\n\n(2)\n\nD\u22121(cid:88)\n\nk=0\n\nhl[d] =\n\n1\u221a\nD\n\nl=1 \u2208(cid:81)L\n\n(cid:62)\n\nThe output layer is fully connected and parameterized by weights wL \u2208 RD. The parameters of the\nRD,\nmodel class therefor consists of L vectors of size D collectively denoted by w = [wl]L\nand the composite linear map fw(x) is given by:\n\nl=1\n\nwL.\n\nfw(x) = ((((x (cid:63) w1) (cid:63) w2) . . .) (cid:63) wL\u22121)\n\n(3)\n\u221a\nRemark: We use circular convolution with a scaling of 1/\nD to make the analysis cleaner. For\nconvolutions with zero-padding, we expect a similar behavior. Secondly, since our goal here to study\nimplicit bias in suf\ufb01ciently over-parameterized models, we only study full dimensional convolutional\n\ufb01lters. In practice it is common to have \ufb01lters of width K smaller than the number of input features,\nwhich can change the implicit bias.\nThe fully connected and convolutional linear networks described above can both be represented in\nterms of a mapping P : W \u2192 RD that maps the input parameters w \u2208 W to a linear predictor in RD,\nsuch that the output of the network is given by fw(x) = (cid:104)x,P(w)(cid:105). For fully connected networks,\nthe mapping is given by Pf ull(w) = w1w2 . . . wL, and for convolutional networks, Pconv(w) =\n, where w\u2193 denotes the \ufb02ipped vector corresponding to w,\n\n(cid:1) . . . (cid:63) w1\n\n(cid:16)(cid:0)(w\n\n\u2193\nL (cid:63) wL\u22121) (cid:63) wL\u22122\n\ngiven by w\u2193[k] = w[D \u2212 k \u2212 1] for k = 0, 1, . . . , D \u2212 1.\nSeparable linear classi\ufb01cation Consider a binary classi\ufb01cation dataset {(xn, yn)\n: n =\n1, 2, . . . N} with xn \u2208 RD and yn \u2208 {\u22121, 1}. The empirical risk minimization objective for\ntraining a linear network parameterized as P(w) is given as follows,\n(cid:96)((cid:104)xn,P(w)(cid:105), yn),\n\n(4)\nwhere (cid:96) : R \u00d7 {\u22121, 1} \u2192 R+ is some surrogate loss for classi\ufb01cation accuracy, e.g., logistic loss\n\n(cid:96)((cid:98)y, y) = log(1 + exp(\u2212(cid:98)yy)) and exponential loss (cid:96)((cid:98)y, y) = exp(\u2212(cid:98)yy).\n\nw\u2208W LP (w) :=\n\nN(cid:88)\n\nmin\n\n(cid:17)\u2193\n\nn=1\n\nIt is easy to see that both fully connected and convolutional networks of any depth L can realize\nany linear predictor \u03b2 \u2208 RD. The model class expressed by both networks is therefore simply\nthe unconstrained class of linear predictors, and the two architectures are merely different (over)\nparameterizations of this class\n\n{Pf ull(w) : w = [wl \u2208 RDl\u22121\u00d7Dl ]L\n\nl=1} = {Pconv(w) : w = [wl \u2208 RD]L\n\nl=1} = RD.\n\nThus, the empirical risk minimization problem in (4) is equivalent to the following optimization over\nthe linear predictors \u03b2 = P(w):\n\nL(\u03b2) :=\n\nmin\n\u03b2\u2208RD\n\n(cid:96)((cid:104)xn, \u03b2(cid:105), yn).\n\n(5)\n\nN(cid:88)\n\nn=1\n\n1We follow the convention used in neural networks literature that refer to the operation in (2) as convolution,\n\nwhile in the signal processing terminology, (2) is known as the discrete circular cross-correlation operator\n\n3\n\n\fAlthough the optimization problems (4) and (5) are exactly equivalent in terms of the set of global\nminima, in this paper, we show that optimizing (4) with different parameterizations leads to very\ndifferent classi\ufb01ers compared to optimizing (5) directly.\nIn particular, consider problems (4)/(5) on a linearly separable dataset {xn, yn}N\nn=1 and using the\nlogistic loss (the two class version of the cross entropy loss typically used in deep learning). The\nglobal in\ufb01mum of L(\u03b2) is 0, but this is not attainable by any \ufb01nite \u03b2. Instead, the loss can be\nminimized by scaling the norm of any linear predictor that separates the data to in\ufb01nity. Thus, any\nsequence of predictors \u03b2(t) (say, from an optimization algorithm) that asymptotically minimizes the\nloss in eq. (5) necessarily separates the data and diverges in norm, (cid:107)\u03b2(t)(cid:107) \u2192 \u221e. In general there\nare many linear separators that correctly label the training data, each corresponding to a direction in\nwhich we can minimize (5). Which of these separators will we converge to when optimizing (4)/(5)?\n\u03b2(t)\n(cid:107)\u03b2(t)(cid:107) the iterates of our optimization algorithm\nIn other words, what is the direction \u03b2\n\u221e\nwill diverge in? If this limit exist we say that \u03b2(t) converges in direction to the limit direction \u03b2\n.\nSoudry et al. [2017] studied this implicit bias of gradient descent on (5) over the direct parameteriza-\ntion of \u03b2. They showed that for any linearly separable dataset and any initialization, gradient descent\nw.r.t. \u03b2 converges in direction to hard margin support vector machine solution:\n\n= lim\nt\u2192\u221e\n\n\u221e\n\n\u221e\n\n\u03b2\n\n=\n\n\u03b2\n(cid:107)\u03b2\n\n\u2217\n(cid:96)2\n\u2217\n(cid:96)2\n\n(cid:107), where \u03b2\n\n\u2217\n(cid:96)2\n\n(cid:107)\u03b2(cid:107)2\n\n2 s.t. \u2200n, yn(cid:104)\u03b2, xn(cid:105) \u2265 1.\n\n= argmin\n\u03b2\u2208RD\n\n(6)\n\nIn this paper we study the behavior of gradient descent on the problem (4) w.r.t different parameteri-\nzations of the model class of linear predictors. For initialization w(0) and sequence of step sizes {\u03b7t},\ngradient descent updates for (4) are given by,\n\nw(t+1) = w(t) \u2212 \u03b7t\u2207wLP (w(t)) = w(t) \u2212 \u03b7t\u2207wP(w(t))\u2207\u03b2L(P(w(t))),\n\n(7)\nwhere \u2207wP(.) denotes the Jacobian of P : W \u2192 RD with respect to the parameters w, and \u2207\u03b2L(.)\nis the gradient of the loss function in (5).\nFor separable datasets, if w(t) minimizes (4) for linear fully connected or convolutional networks,\nthen we will again have (cid:107)w(t)(cid:107) \u2192 \u221e, and the question we ask is: what is the limit direction\n\u221e\n\u03b2\n\nP(w(t))\n(cid:107)P(w(t))(cid:107) of the predictors P(w(t)) along the optimization path?\n\n= lim\nt\u2192\u221e\n\nThe result in Soudry et al. [2017] holds for any loss function (cid:96)(u, y) that is strictly monotone in uy\nwith speci\ufb01c tail behavior, name the tightly exponential tail, which is satis\ufb01ed by popular classi\ufb01cation\nlosses like logistic and exponential loss. In the rest of the paper, for simplicity we exclusively focus\non the exponential loss function (cid:96)(u, y) = exp(\u2212uy), which has the same tail behavior as that of the\nlogistic loss. Along the lines of Soudry et al. [2017], our results should also extend for any strictly\nmonotonic loss function with a tight exponential tail, including logistic loss.\n\n3 Main Results\n\nOur main results characterize the implicit bias of gradient descent for multi-layer fully connected and\nconvolutional networks with linear activations. For the gradient descent iterates w(t) in eq. (7), we\nhenceforth denote the induced linear predictor as \u03b2(t) = P(w(t)).\nAssumptions. In the following theorems, we characterize the limiting predictor \u03b2\nunder the following assumptions:\n\n= lim\nt\u2192\u221e\n\n\u03b2(t)\n(cid:107)\u03b2(t)(cid:107)\n\n\u221e\n\n1. w(t) minimize the objective, i.e., LP (w(t)) \u2192 0.\n2. w(t), and consequently \u03b2(t) = P(w(t)), converge in direction to yield a separator \u03b2\n\n\u221e\n\n=\n\n\u03b2(t)\n\n(cid:107)\u03b2(t)(cid:107) with positive margin, i.e., minn yn(cid:104)xn, \u03b2\n\n\u221e(cid:105) > 0.\n\nlim\nt\u2192\u221e\n\n3. Gradients with respect to linear predictors \u2207\u03b2L(\u03b2(t)) converge in direction.\n\nThese assumptions allow us to focus on the question of which speci\ufb01c linear predictor do gradient\ndescent iterates converge to by separating it from the related optimization questions of when gradient\ndescent iterates minimize the non-convex objective in eq. (5) and nicely converge in direction.\n\n4\n\n\f(a) Fully connected network of depth L\n\u221e \u221d\n\u03b2\n\n(cid:107)\u03b2(cid:107)2 (independent of L)\n\nargmin\n\n\u2200n, yn(cid:104)xn,\u03b2(cid:105)\u22651\n\n(b) Convolutional network of depth L\n\n\u221e \u221d \ufb01rst order stationary point of\n\nargmin\n\n\u2200n, yn(cid:104)xn,\u03b2(cid:105)\u22651\n\n\u03b2\n\n(cid:107)(cid:98)\u03b2(cid:107)2/L\n\n(c) Diagonal network of depth L\n\n\u221e \u221d \ufb01rst order stationary point of\n\nargmin\n\n\u2200n, yn(cid:104)xn,\u03b2(cid:105)\u22651\n\n\u03b2\n\n(cid:107)\u03b2(cid:107)2/L\n\nFigure 1: Implicit bias of gradient descent for different linear network architectures.\n\nTheorem 1 (Linear fully connected networks). For any depth L, almost all linearly separable\ndatasets {xn, yn}N\nn=1, almost all initializations w(0), and any bounded sequence of step sizes {\u03b7t}t,\nwith exponential loss (cid:96)((cid:98)y, y) = exp(\u2212(cid:98)yy) over L\u2013layer fully connected linear networks.\nconsider the sequence gradient descent iterates w(t) in eq. (7) for minimizing LPf ull (w) in eq. (4)\nIf (a) the iterates w(t) minimize the objective, i.e., LPf ull (w(t)) \u2192 0, (b) w(t), and consequently\n\u03b2(t) = Pf ull(w(t)), converge in direction to yield a separator with positive margin, and (c) gradients\nwith respect to linear predictors \u2207\u03b2L(\u03b2(t)) converge in direction, then the limit direction is given by,\n\n\u221e\n\n\u03b2\n\n= lim\nt\u2192\u221e\n\nPf ull(w(t))\n(cid:107)Pf ull(w(t))(cid:107) =\n\n\u2217\n(cid:96)2\n\u2217\n(cid:96)2\n\n\u03b2\n(cid:107)\u03b2\n\n(cid:107) , where \u03b2\n\n\u2217\n(cid:96)2\n\n:= argmin\n\nw\n\n(cid:107)\u03b2(cid:107)2\n\n2 s.t. \u2200n, yn(cid:104)xn, \u03b2(cid:105) \u2265 1.\n\n(8)\n\nFor fully connected networks with single output, Theorem 1 shows that there is no effect of depth\non the implicit bias of gradient descent. Regardless of the depth of the network, the asymptotic\nclassi\ufb01er is always the hard margin support vector machine classi\ufb01er, which is also the limit direction\nof gradient descent for linear logistic regression with the direct parameterization of \u03b2 = w.\nIn contrast, next we show that for convolutional networks we get very different biases. Let us \ufb01rst look\nat a 2\u2013layer linear convolutional network, i.e., a network with single convolutional layer followed by\na fully connected \ufb01nal layer.\n\n5\n\n\fD\u22121(cid:80)\n\n(cid:16)\u2212 2\u03c0ipd\n\n(cid:17)\n\nRecall that(cid:98)\u03b2 \u2208 CD denote the Fourier coef\ufb01cients of \u03b2, i.e.,(cid:98)\u03b2[d] = 1\u221a\n\n,\nand that any non-zero z \u2208 C is denoted in polar form as z = |z|ei\u03c6z for \u03c6z \u2208 [0, 2\u03c0). Linear\npredictors induced by gradient descent iterates w(t) for convolutional networks are denoted by \u03b2(t) =\n(t) converges in direction to(cid:98)\u03b2\n(cid:98)\u03b2\nPconv(w(t)). It is evident that if \u03b2(t) converges in direction to \u03b2\n, then its Fourier transformation\n. In the following theorems, in addition to the earlier assumptions,\ncoordinate-wise. For coordinates d with(cid:98)\u03b2\nwe further assume a technical condition that the phase of the Fourier coef\ufb01cients ei\u03c6(cid:98)\u03b2(t) converge\n[d] (cid:54)= 0 this follows from convergence in direction of\n\n\u03b2[p] exp\n\np=0\n\n\u221e\n\n\u221e\n\n\u221e\n\nD\n\nD\n\ni\u03c6(cid:98)\u03b2(t)[d] \u2192 e\n\ni\u03c6(cid:98)\u03b2\n\n\u221e\n\n[d]. We assume such a \u03c6(cid:98)\u03b2\n\n\u221e\n\n[d]\n\nw(t), in which case e\nTheorem 2 (Linear convolutional networks of depth two). For almost all linearly separable datasets\n{xn, yn}N\nn=1, almost all initializations w(0), and any sequence of step sizes {\u03b7t}t with \u03b7t smaller\nthan the local Lipschitz at w(t), consider the sequence gradient descent iterates w(t) in eq. (7) for\nminimizing LPconv (w) in eq. (4) with exponential loss over 2\u2013layer linear convolutional networks.\nIf (a) the iterates w(t) minimize the objective, i.e., LPconv (w(t)) \u2192 0, (b) w(t) converge in direction\n(t) of the\nto yield a separator \u03b2\nlinear predictors \u03b2(t) converge coordinate-wise, i.e., \u2200d, e\n[d], and (d) the gradients\n\u2207\u03b2L(\u03b2(t)) converge in direction, then the limit direction \u03b2\n\nwith positive margin, (c) the phase of the Fourier coef\ufb01cients(cid:98)\u03b2\n\ni\u03c6(cid:98)(cid:98)\u03b2\nis given by,\n\n[d] = 0.\n\n\u221e\n\nalso exists when(cid:98)\u03b2\n\n\u221e\n\n\u221e\n\ni\u03c6(cid:98)\u03b2(t)[d] \u2192 e\n\u221e\n(cid:107)(cid:98)\u03b2(cid:107)1 s.t. \u2200n, yn(cid:104)\u03b2, xn(cid:105) \u2265 1.\n\n\u221e\n\n\u03b2\n\n=\n\n\u03b2\n(cid:107)\u03b2\n\n\u2217\nF ,1\nF ,1(cid:107) , where \u03b2\n\u2217\n\n\u2217\nF ,1 := argmin\n\n\u03b2\n\nWe already see how introducing a single convolutional layer changes the implicit bias of gradient\ndescent\u2014even without any explicit regularization, gradient descent on the parameters of convolutional\nnetwork architecture returns solutions that are biased to have sparsity in the frequency domain.\nFurthermore, unlike fully connected networks, for convolutional networks we also see that the implicit\nbias changes with the depth of the network as shown by the following theorem.\nTheorem 2a (Linear Convolutional Networks of any Depth). For any depth L, under the conditions\nPconv(w(t))\nof Theorem 2, the limit direction \u03b2\n(cid:107)Pconv(w(t))(cid:107) is a scaling of a \ufb01rst order stationary\npoint of the following optimization problem,\n\n= lim\nt\u2192\u221e\n\n\u221e\n\n(9)\n\n(10)\n\nmin\n\n\u03b2\n\n(cid:107)(cid:98)\u03b2(cid:107)2/L s.t. \u2200n, yn(cid:104)\u03b2, xn(cid:105) \u2265 1,\n(cid:16)(cid:80)D\ni=1 |z[i]|p(cid:17)1/p\n\nwhere the (cid:96)p penalty given by (cid:107)z(cid:107)p =\nfor p = 1 and a quasi-norm for p < 1.\n\n(also called the bridge penalty) is a norm\n\nWhen L > 2, and thus p = 2/L < 1, problem (10) is non-convex and intractable Ge et al. [2011].\nHence, we cannot expect to ensure convergence to a global minimum. Instead we show convergence\nto a \ufb01rst order stationary point of (10) in the sense of sub-stationary points of Rockafellar [1979] for\noptimization problems with non-smooth and non-convex objectives. These are solutions where the\nlocal directional derivative along the directions in the tangent cone of the constraints are all zero.\nThe \ufb01rst order stationary points, or sub-stationary points, of (10) are the set of feasible predictors \u03b2\nsuch that \u2203{\u03b1n \u2265 0}N\n\nn=1 satisfying the following: \u2200n, yn(cid:104)xn, \u03b2(cid:105) > 1 =\u21d2 \u03b1n = 0, and\n\nn\n\nwhere(cid:98)xn is the Fourier transformation of xn, and \u2202\u25e6 denotes the local sub-differential (or Clarke\u2019s\nFor p = 1 and(cid:98)\u03b2 represented in polar form as(cid:98)\u03b2 = |(cid:98)\u03b2|ei\u03c6(cid:98)\u03b2 \u2208 CD, (cid:107)(cid:98)\u03b2(cid:107)p is convex and the local\n\nsub-differential) operator de\ufb01ned as \u2202\u25e6f (\u03b2) = conv{v : \u2203(zk)k s.t. zk \u2192 \u03b2 and \u2207f (zk) \u2192 v}.\n\nsub-differential is indeed the global sub-differential given by,\n\n\u2202\u25e6(cid:107)(cid:98)\u03b2(cid:107)1 = {(cid:98)z : \u2200d, |(cid:98)z[d]| \u2264 1 and(cid:98)\u03b2[d] (cid:54)= 0 =\u21d2 (cid:98)z[d] = ei\u03c6(cid:98)\u03b2[d]}.\n\n(12)\n\n(11)\n\n(cid:88)\n\n\u03b1nyn(cid:98)xn \u2208 \u2202\u25e6(cid:107)(cid:98)\u03b2(cid:107)p,\n\n6\n\n\fFor p < 1, the local sub-differential of (cid:107)(cid:98)\u03b2(cid:107)p is given by,\n\n\u2202\u25e6(cid:107)(cid:98)\u03b2(cid:107)p = {(cid:98)z :(cid:98)\u03b2[d] (cid:54)= 0 =\u21d2 (cid:98)z[d] = p ei\u03c6(cid:98)\u03b2[d] |(cid:98)\u03b2[d]|p\u22121}.\n\n(13)\nFigures 1a\u20131b summarize the implications of the main results in the paper. The proof of this Theorem,\nexploits the following representation of Pconv(\u03b2) in the Fourier domain.\nLemma 3. For full-dimensional convolutions, \u03b2 = Pconv(w) is equivalent to\n\n\u2200p < 1,\n\nwhere for l = 1, 2, . . . , L, (cid:98)w1 \u2208 CD are the Fourier coef\ufb01cients of the parameters wl \u2208 RD.\n\n(cid:98)\u03b2 = diag((cid:98)w1) . . . diag((cid:98)wL\u22121)(cid:98)wL,\n\nFrom above lemma (proved in Appendix C), we can see a connection of convolutional networks to a\nspecial network where the linear transformation between layers is restricted to diagonal entries (see\ndepiction in Figure 1c), we refer to such networks as linear diagonal network.\nThe proof of Theorem 1 and Theorem 2-2a are provided in Appendix B and C, respectively.\n\n4 Understanding Gradient Descent in the Parameter Space\n\nWe can decompose the characterization of implicit bias of gradient descent on a parameterization\nP(w) into two parts: (a) what is the implicit bias of gradient descent in the space of parameters w?,\nand (b) what does this imply in term of the linear predictor \u03b2 = P(w), i.e., how does the bias in\nparameter space translate to the linear predictor learned from the model class?\nWe look at the \ufb01rst question for a broad class of linear models, where the linear predictor is given by a\nhomogeneous polynomial mapping of the parameters: \u03b2 = P(w), where w \u2208 RP are the parameters\nof the model and P : RP \u2192 RD satis\ufb01es de\ufb01nition below. This class covers the linear convolutional,\nfully connected networks, and diagonal networks discussed in Section 3.\nDe\ufb01nition (Homogeneous Polynomial). A multivariate polynomial function P : RP \u2192 RD is said\nto be homogeneous, if for some \ufb01nite integer \u03bd < \u221e, \u2200\u03b1 \u2208 R, v \u2208 RP , P(\u03b1v) = \u03b1\u03bdP(v).\nTheorem 4 (Homogeneous Polynomial Parameterization). For any homogeneous polynomial map\nP : RP \u2192 RD from parameters w \u2208 RD to linear predictors, almost all datasets {xn, yn}N\nn=1\nseparable by B := {P(w) : w \u2208 RP}, almost all initializations w(0), and any bounded sequence of\nstep sizes {\u03b7t}t, consider the sequence of gradient descent updates w(t) from eq. (7) for minimizing\nthe empirical risk objective LP (w) in (4) with exponential loss (cid:96)(u, y) = exp(\u2212uy).\nIf (a) the iterates w(t) asymptotically minimize the objective, i.e., LP (w(t)) = L(P(w(t))) \u2192 0,\n(b) w(t), and consequently \u03b2(t) = P(w(t)), converge in direction to yield a separator with positive\nmargin, and (c) the gradients w.r.t. to the linear predictors, \u2207\u03b2L(\u03b2(t)) converge in direction, then the\nlimit direction of the parameters w\u221e = lim\nis a positive scaling of a \ufb01rst order stationary\nt\u2192\u221e\npoint of the following optimization problem,\ns.t.\n\nw(t)\n(cid:107)w(t)(cid:107)2\n\u2200n, yn(cid:104)xn,P(w)(cid:105) \u2265 1.\n\n(14)\n\n(cid:107)w(cid:107)2\n\n2\n\nmin\nw\u2208RP\n\n\u221e\n\nTheorem 4 is proved in Appendix A. The proof of Theorem 4 involves showing that the asymptotic\ndirection of gradient descent iterates satis\ufb01es the KKT conditions for \ufb01rst order stationary points of\n(14). This crucially relies on two properties. First, the sequence of gradients \u2207\u03b2L(\u03b2(t)) converge in\n\u03b2(t)\ndirection to a positive span of support vectors of \u03b2\n(cid:107)\u03b2(t)(cid:107) (Lemma 8 in Gunasekar et al.\n= lim\nt\u2192\u221e\n[2018]), and this result relies on the loss function (cid:96) being exponential tailed. Secondly, if P is not\n2 s.t. \u2200n,(cid:104)xn, yn(cid:105) \u2265 \u03b3 for different values\nhomogeneous, then the optimization problems minw(cid:107)w(cid:107)2\nof unnormalized margin \u03b3 are not equivalent and lead to different separators. Thus, for general\nnon-homogeneous P, the unnormalized margin of one does not have a signi\ufb01cance and the necessary\nconditions for the \ufb01rst order stationarity of (14) are not satis\ufb01ed.\nFinally, we also note that in many cases (including linear convolutional networks) the optimization\nproblem (14) is non-convex and intractable (see e.g., Ge et al. [2011]). So we cannot expect w\u221e\nto be always be a global minimizer of eq. (14). We however suspect that it is possible to obtain\na stronger result that w\u221e reaches a higher order stationary point or even a local minimum of the\nexplicitly regularized estimator in eq. (14).\n\n7\n\n\fImplications of the implicit bias in predictor space While eq. (14) characterizes the bias of\ngradient descent in the parameter space, what we really care about is the effective bias introduced in\nthe space of functions learned by the network. In our case, this class of functions is the set of linear\npredictors {\u03b2 \u2208 RD}. The (cid:96)2 norm penalized solution in eq. (14), is equivalently given by,\n(cid:107)w(cid:107)2\n2.\n\nRP (\u03b2) s.t. \u2200n, yn(cid:104)\u03b2, xn(cid:105) \u2265 1, where RP (\u03b2) =\n\n\u2217\nRP = argmin\n\n(15)\n\ninf\n\n\u03b2\n\nw:P(w)=\u03b2\n\n\u03b2\n\n\u2217\n\nThe problems in eq. (14) and eq. (15) have the same global minimizers, i.e., w\u2217 is global minimizer\n= P(w\u2217) minimizes eq. (15). However, such an equivalence does not\nof eq. (14) if and only if \u03b2\nextend to the stationary points of the two problems. Speci\ufb01cally, it is possible that a stationary point\nof eq. (14) is merely a feasible point for eq. (15) with no special signi\ufb01cance. So instead of using\nTheorem 4, for the speci\ufb01c networks in Section 3, we directly show (in Appendix) that gradient\ndescent updates converge in direction to a \ufb01rst order stationary point of the problem in eq. (15).\n\n5 Understanding Gradient Descent in Predictor Space\n\nIn the previous section, we saw that the implicit bias of gradient descent on a parameterization\nP(w) can be described in terms of the optimization problem (14) and the implied penalty function\nRP (\u03b2) = minw:P(w)=\u03b2(cid:107)w(cid:107)2\n2. We now turn to studying this implied penalty RP (\u03b2) and obtaining\nexplicit forms for it, which will reveal the precise form of the implicit bias in terms of the learned\nlinear predictor. The proofs of the lemmas in this section are provided in the Appendix D.\nLemma 5. For fully connected networks of any depth L > 0,\n2 = L(cid:107)\u03b2(cid:107)2/L\n\n2 = monotone((cid:107)\u03b2(cid:107)2).\n\nRPf ull (\u03b2) =\n\n(cid:107)w(cid:107)2\n\nmin\n\nw:Pf ull(w)=\u03b2\n\n\u2217\nRPf ull\n\n= argmin\u03b2 RPf ull (\u03b2) s.t. \u2200n, yn(cid:104)xn, \u03b2(cid:105) \u2265 1 in eq. (15) for fully connected\nWe see that \u03b2\nnetworks is independent of the depth of the network L. In Theorem 1, we indeed show that gradient\ndescent for this class of networks converges in the direction of \u03b2\nNext, we motivate the characterization of RP (\u03b2) for linear convolutional networks by \ufb01rst look-\ning at the special linear diagonal network depicted in Figure 1c. The depth\u2013L diagonal net-\nwork is parameterized by w = [wl \u2208 RD]L\nl=1 and the mapping to a linear predictor is given\nby Pdiag(w) = diag(w1)diag(w2) . . . diag(wL\u22121)wL.\nLemma 6. For a depth\u2013L diagonal network with parameters w = [wl \u2208 RD]L\n\n\u2217\nRPf ull\n\nl\u22121, we have\n\n.\n\nRPdiag (\u03b2) =\n\nmin\n\nw:Pdiag(w)=\u03b2\n\n(cid:107)w(cid:107)2\n\n2 = L(cid:107)\u03b2(cid:107)2/L\n\n2/L = monotone((cid:107)\u03b2(cid:107)2/L).\n\nFinally, for full width linear convolutional networks parameterized by w = [wl \u2208 RD]L\nfollowing representation of \u03b2 = Pconv(w) in Fourier from Lemma 3.\n\nl=1, recall the\n\nwhere(cid:98)\u03b2,(cid:98)wl \u2208 CD are Fourier basis representation of \u03b2, wl \u2208 RD, respectively. Extending the\n\nresult of diagonal networks for the complex vector spaces, we get the following characterization of\nRPconv (\u03b2) for linear convolutional networks.\nLemma 7. For a depth\u2013L convolutional network with parameters w = [wl \u2208 RD]L\n\nl\u22121, we have\n\n(cid:98)\u03b2 = diag((cid:98)w1) . . . diag((cid:98)wL\u22121)(cid:98)wL,\n\nRPconv (\u03b2) =\n\nmin\n\nw:Pconv(w)=\u03b2\n\n6 Discussion\n\n2 = L(cid:107)(cid:98)\u03b2(cid:107)2/L\n\n2/L = monotone((cid:107)(cid:98)\u03b2(cid:107)2/L).\n\n(cid:107)w(cid:107)2\n\nIn this paper, we characterized the implicit bias of gradient descent on linear convolutional networks.\nWe showed that even in the case of linear activations and a full width convolution, wherein the\nconvolutional network de\ufb01nes the exact same model class as fully connected networks, merely\nchanging to a convolutional parameterization introduces radically different, and very interesting, bias\n\n8\n\n\fwhen training with gradient descent. Namely, training a convolutional representation with gradient\ndescent implicitly biases towards sparsity in the frequency domain representation of linear predictor.\nFor convenience and simplicity of presentation, we studied one dimensional circular convolutions.\nOur results can be directly extended to higher dimensional input signals and convolutions, including\nthe two-dimensional convolutions common in image processing and computer vision. We also expect\nsimilar results for convolutions with zero padding instead of circular convolutions, although this\nrequires more care with analysis of the edge effects.\nA more signi\ufb01cant way in which our setup differs from usual convolutional networks is that we use\nfull width convolutions, while in practice it is common to use convolutions with bounded width, much\nsmaller then the input dimensionality. This setting is within the scope of Theorem 4, as the linear\ntransformation is still homogeneous. However, understanding the implied bias in the predictor space,\ni.e. understanding RP (\u03b2) requires additional work. It will be very interesting to see if restricting the\nwidth of the convolutional network gives rise to further interesting behaviors.\nAnother important direction for future study is understanding the implicit bias for networks with\nmultiple outputs. For both fully connected and convolutional networks, we looked at networks with a\nsingle output. With C > 1 outputs, the network implements a linear transformation x (cid:55)\u2192 \u03b2x where\n\u03b2 \u2208 RC\u00d7D is now a matrix. Results for matrix sensing in Gunasekar et al. [2018] imply that for\ntwo layer fully connected networks with multiple outputs, the implicit bias is to a maximum margin\nsolution with respect to the nuclear norm (cid:107)\u03b2(cid:107)(cid:63). This is already different from the implicit bias of a\none-layer \u201cnetwork\u201d (i.e. optimizing \u03b2 directly), which would be in terms of the Frobenius norm\n(cid:107)\u03b2(cid:107)F (from the result of Soudry et al. [2017]). We suspect that with multiple outputs, as more layers\nare added, even fully connected networks exhibit a shrinking sparsity penalty on the singular values\nof the effective linear matrix predictor \u03b2 \u2208 RC\u00d7D. Precisely characterizing these biases requires\nfurther study.\nWhen using convolutions as part of a larger network, with multiple parallel \ufb01lters, max pooling,\nand non-linear activations, the situation is of course more complex, and we do not expect to get the\nexact same bias. However, we do expect the bias to be at the very least related to the sparsity-in-\nfrequency-domain bias that we uncover here, and we hope our work can serve as a basis for further\nsuch study. There are of course many other implicit and explicit sources of inductive bias\u2014here we\nshow that merely parameterizing transformations via convolutions and using gradient descent for\ntraining already induces sparsity in the frequency domain.\nOn a technical level, we provided a generic characterization for the bias of gradient descent on linear\nmodels parameterized as \u03b2 = P(w) for a homogeneous polynomial P. The (cid:96)2 bias (in parameter\nspace) we obtained is not surprising, but also should not be taken for granted \u2013 e.g., the result\ndoes not hold in general for non-homogeneous P, and even with homogeneous polynomials, the\ncharacterization is not as crisp when other loss functions are used, e.g., with a squared loss and matrix\nfactorization (a homogeneous degree two polynomial representation), the implicit bias is much more\nfragile Gunasekar et al. [2017], Li et al. [2017]. Moreover, Theorem 4 only ensures convergence to\n\ufb01rst order stationary point in the parameter space, which is not suf\ufb01cient for convergence to stationary\npoints of the implied bias in the model space (eq. (15)). It is of interest for future work to strengthen\nthis result to show either convergence to higher order stationary points or local minima in parameter\nspace, or to directly show the convergence to stationary points of (15).\nIt would also be of interest to strengthen other technical aspects of our results: extend the results to\nloss functions with tight exponential tails (including logistic loss) and handle all datasets including\nthe set of measure zero degenerate datasets\u2014these should be possible following the techniques of\nSoudry et al. [2017], Telgarsky [2013], Ji and Telgarsky [2018]. We can also calculate exact rates of\nconvergence to the asymptotic separator along the lines of Soudry et al. [2017], Nacson et al. [2018],\nJi and Telgarsky [2018] showing how fast the inductive bias from optimization kicks in and why\nit might be bene\ufb01cial to continue optimizing even after the loss value L(\u03b2(t)) itself is negligible.\nFinally, for logistic regression, Ji and Telgarsky [2018] extend the results of asymptotic convergence\nof gradient descent classi\ufb01er to the cases where the data is not strictly linearly separable. This is an\nimportant relaxation of our assumption on strict linear separability. More generally, for non-separable\ndata, we would like a more \ufb01ne grained analysis connecting the iterates \u03b2(t) along the optimization\n\npath to the estimates along regularization path, (cid:98)\u03b2(c) = argminRP (\u03b2)\u2264c L(\u03b2), where an explicit\n\nregularization is added to the optimization objective.\n\n9\n\n\fReferences\nMarcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, and Nando\nde Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information\nProcessing Systems, 2016.\n\nP. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results.\n\nJournal of Machine Learning Research, 2003.\n\nSamuel Burer and Renato DC Monteiro. A nonlinear programming algorithm for solving semide\ufb01nite programs\n\nvia low-rank factorization. Mathematical Programming, 95(2):329\u2013357, 2003.\n\nPratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer\nChayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys.\narXiv preprint arXiv:1611.01838, 2016.\n\nLaurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets.\n\nIn International Conference on Machine Learning, 2017.\n\nDongdong Ge, Xiaoye Jiang, and Yinyu Ye. A note on the complexity of lp minimization. Mathematical\n\nprogramming, 2011.\n\nSuriya Gunasekar, Blake E Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Implicit\n\nregularization in matrix factorization. In NIPS, 2017.\n\nSuriya Gunasekar, Jason D. Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias in terms of\n\noptimization geometry. arXiv preprint, 2018.\n\nSepp Hochreiter and J\u00fcrgen Schmidhuber. Flat minima. Neural Computation, 1997.\n\nElad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in\n\nlarge batch training of neural networks. In Advances in Neural Information Processing Systems, 2017.\n\nZiwei Ji and Matus Telgarsky. Risk and parameter convergence of logistic regression. arXiv preprint\n\narXiv:1803.07300, 2018.\n\nMichel Journ\u00e9e, Francis Bach, P-A Absil, and Rodolphe Sepulchre. Low-rank optimization on the cone of\n\npositive semide\ufb01nite matrices. SIAM Journal on Optimization, 20(5):2327\u20132351, 2010.\n\nSham M Kakade, Karthik Sridharan, and Ambuj Tewari. On the complexity of linear prediction: Risk bounds,\n\nmargin bounds, and regularization. In Advances in neural information processing systems, 2009.\n\nKenji Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information Processing\n\nSystems, 2016.\n\nNitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On\nlarge-batch training for deep learning: Generalization gap and sharp minima. In International Conference on\nLearning Representations, 2016.\n\nJason D. Lee, Max Simchowitz, Michael I. Jordan, and Benjamin Recht. Gradient descent only converges to\n\nminimizers. In 29th Annual Conference on Learning Theory, 2016.\n\nYuanzhi Li, Tengyu Ma, and Hongyang Zhang. Algorithmic regularization in over-parameterized matrix recovery.\n\narXiv preprint arXiv:1712.09203, 2017.\n\nMarian Muresan. A concrete approach to classical analysis, volume 14. Springer, 2009.\n\nMor Shpigel Nacson, Jason Lee, Suriya Gunasekar, Nathan Srebro, and Daniel Soudry. Convergence of gradient\n\ndescent on separable data. arXiv preprint arXiv:1803.01905, 2018.\n\nBehnam Neyshabur, Ruslan R Salakhutdinov, and Nati Srebro. Path-sgd: Path-normalized optimization in deep\n\nneural networks. In Advances in Neural Information Processing Systems, pages 2422\u20132430, 2015a.\n\nBehnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of\nimplicit regularization in deep learning. In International Conference on Learning Representations, 2015b.\n\nBehnam Neyshabur, Ryota Tomioka, Ruslan Salakhutdinov, and Nathan Srebro. Geometry of optimization and\n\nimplicit regularization in deep learning. arXiv preprint, 2017.\n\n10\n\n\fQuynh Nguyen and Matthias Hein. The loss surface of deep and wide neural networks. arXiv preprint\n\narXiv:1704.08045, 2017.\n\nR Tyrrell Rockafellar. Directionally lipschitzian functions and subdifferential calculus. Proceedings of the\n\nLondon Mathematical Society, 1979.\n\nLe Smith, Kindermans. Don\u2019t Decay the Learning Rate, Increase the Batch Size. In ICLR, 2018.\n\nDaniel Soudry, Elad Hoffer, and Nathan Srebro. The implicit bias of gradient descent on separable data. arXiv\n\npreprint arXiv:1710.10345, 2017.\n\nMatus Telgarsky. Margins, shrinkage and boosting. In Proceedings of the 30th International Conference on\n\nInternational Conference on Machine Learning-Volume 28, pages II\u2013307. JMLR. org, 2013.\n\nAshia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The marginal value of\nadaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems, 2017.\n\nChiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning\n\nrequires rethinking generalization. In International Conference on Learning Representations, 2017.\n\n11\n\n\f", "award": [], "sourceid": 5749, "authors": [{"given_name": "Suriya", "family_name": "Gunasekar", "institution": "TTI Chicago"}, {"given_name": "Jason", "family_name": "Lee", "institution": "University of Southern California"}, {"given_name": "Daniel", "family_name": "Soudry", "institution": "Technion"}, {"given_name": "Nati", "family_name": "Srebro", "institution": "TTI-Chicago"}]}