{"title": "Connecting Optimization and Regularization Paths", "book": "Advances in Neural Information Processing Systems", "page_first": 10608, "page_last": 10619, "abstract": "We study the implicit regularization properties of optimization techniques by explicitly connecting their optimization paths to the regularization paths of ``corresponding'' regularized problems. This surprising connection shows that iterates of optimization techniques such as gradient descent and mirror descent are \\emph{pointwise} close to solutions of appropriately regularized objectives. While such a tight connection between optimization and regularization is of independent intellectual interest, it also has important implications for machine learning: we can port results from regularized estimators to optimization, and vice versa. We investigate one key consequence, that borrows from the well-studied analysis of regularized estimators, to then obtain tight excess risk bounds of the iterates generated by optimization techniques.", "full_text": "Connecting Optimization and Regularization Paths\n\nArun Sai Suggala\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nasuggala@cs.cmu.edu\n\nAdarsh Prasad\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\nadarshp@cs.cmu.edu\n\nPradeep Ravikumar\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\npradeepr@cs.cmu.edu\n\nAbstract\n\nWe study the implicit regularization properties of optimization techniques by\nexplicitly connecting their optimization paths to the regularization paths of \u201ccor-\nresponding\u201d regularized problems. This surprising connection shows that iterates\nof optimization techniques such as gradient descent and mirror descent are point-\nwise close to solutions of appropriately regularized objectives. While such a tight\nconnection between optimization and regularization is of independent intellectual\ninterest, it also has important implications for machine learning: we can port results\nfrom regularized estimators to optimization, and vice versa. We investigate one key\nconsequence, that borrows from the well-studied analysis of regularized estimators,\nto then obtain tight excess risk bounds of the iterates generated by optimization\ntechniques.\n\n1\n\nIntroduction\n\nWith the recent success of optimization techniques in training over-parametrized deep neural networks,\nthere has been a growing interest in understanding the implicit regularization properties of various\noptimization techniques. Consequently, a line of work has focused on characterizing the implicit\nbiases of global optimum reached by various optimization algorithms. For example, Gunasekar\net al. [2017] consider the problem of matrix factorization and show that gradient descent (GD) on\nun-regularized objective converges to the minimum nuclear norm solution. Soudry et al. [2017]\nstudy gradient descent on un-regularized logistic regression and show that when the data is linearly\nseparable, gradient descent converges to a max-margin solution. Gunasekar et al. [2018] generalized\nthe results of Soudry et al. [2017] and study the limit behavior of the iterates of general optimization\ntechniques when the data is linearly separable.\nAnother line of work has focused on studying the implicit regularization properties of early stopping\nvarious optimization algorithms, which is a widely used technique in neural network training. These\nworks show that early stopping the iterative optimization of an empirical problem performs a form of\nimplicit regularization. Yao et al. [2007] focus on non-parametric regression in reproducing kernel\nHilbert spaces and provide theoretical justi\ufb01cation for early stopping. In a similar setting, Raskutti\net al. [2014] show that early stopping gradient descent on least squares objective achieves similar risk\nbounds as the corresponding regularized problem, also called ridge regression. Hardt et al. [2015],\nRosasco and Villa [2015] study the implicit regularization properties of early stopping stochastic\ngradient descent (SGD). All these results show that early stopping achieves similar performance as\noptimizing the corresponding regularized objective.\nFurthermore, several recent works suggest that there could be a much deeper connection between\nthe iterates generated by optimization techniques on un-regularized objectives (optimization path)\nand minimizers of corresponding regularized objectives (regularization path), than the performance\nsimilarity observed in the early stopping literature. Friedman and Popescu [2003] empirically observe\nthat for linear regression, the optimization and regularization paths are very similar to each other.\nRosset et al. [2004a] show that under certain conditions on the problem, the path traced by coordinate\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fdescent or boosting is similar to the regularization path of L1 constrained problem. In a related work\nNeu and Rosasco [2018] consider the problem of linear least squares regression and show that the\niterates produced by GD on least squares objective are related to the solutions of ridge regression.\nSpeci\ufb01cally, for any given regularization parameter of ridge regression, Neu and Rosasco [2018]\nshow that there exists a weighing scheme for GD iterates that is exactly equal to the ridge solution.\nIn this work, we take a step towards understanding the deeper connection between the two paths by\nexplicitly connecting the optimization path to the regularization path of the corresponding regularized\nproblem. Our results explicitly show that the sequence of iterates produced by iterative optimization\ntechniques such as gradient descent, mirror descent on strongly convex functions, lie pointwise close\nto the regularization path of a corresponding regularized objective. This surprising connection allows\nus to transfer insights from regularization to optimization and vice-versa. We expect that our work\nwill lead to a new class of results in both \ufb01elds that explicitly draw upon this connection.\nIn this paper, we focus on a particular consequence of our connection: we derive excess risk bounds\nof the iterates of optimization techniques. There has been a huge line of work in the \ufb01elds of machine\nlearning and statistics on understanding the risk bounds of regularized problems [Negahban et al.,\n2009, Hsu et al., 2012]. We utilize these results to derive excess risk bounds of iterates of optimization\ntechniques.\nRecently, there has been a line of work studying the excess risk of iterates of optimization techniques.\nYao et al. [2007], Raskutti et al. [2014] focus on non-parametric regression in a reproducing kernel\nHilbert space and derive excess risk bounds of iterates of gradient descent. Wei et al. [2017] extend\nthese results to a broad class of loss functions. In the context of \ufb01nite dimensional spaces, Hardt et al.\n[2015], Chen et al. [2018] use the notion of algorithmic stability, which was introduced by Bousquet\nand Elisseeff [2002], to derive bounds on the expected excess risk of iterates of various methods.\nOur technique for deriving excess risk bounds can be viewed as an alternative to stability and has\nthe advantage that we can make use of the existing results on the statistical properties of regularized\nproblems. Moreover, this approach has the potential to obtain much tighter bounds than stability and\nwe stress that any improvement in the analysis of regularized estimation will directly translate to a\ntighter bound for the corresponding optimization problem.\nThe main contributions of the paper are as follows. For strongly convex and smooth objectives, we\nexplicitly connect the optimization path of GD and regularization path of L2\n2 penalized objective. We\nfurther extend these results to Mirror Descent with strongly convex and smooth divergences. We use\nthese connections to derive the excess risk of iterates of GD. For convex objectives, we show that the\nconnection need not hold in general. However, for the problem of classi\ufb01cation with separable data,\nwe show that for losses with exponentially decaying tails, the optimization path of GD is close to the\nregularization path of the corresponding regularized objective.\n\n2 Strongly Convex Loss\n\nIn this section we explicitly connect the optimization path of GD and regularization path of L2\n2\npenalized objective on strongly convex and smooth functions. Let f : Rp ! R be a twice differen-\ntiable function which is strongly convex and smooth with parameters m, M > 0. In this work we\nmainly focus on continuous time GD (that is, GD with in\ufb01nitesimally small step size). We de\ufb01ne the\noptimization path of GD on f (\u2713), started at \u27130, as the trajectory followed by GD iterates, which is\ngiven by the following Ordinary Differential Equation (ODE)\n\n\u02d9\u2713(t) :=\n\nd\ndt\n\n\u2713(t) = rf (\u2713(t)),\u2713\n\n(0) = \u27130.\n\nWe now relate the above optimization path to the regularization path of the corresponding L2\n2\npenalized objective, which is de\ufb01ned as the 1-dimensional path of optimal solutions of the following\nregularized objective, obtained as we vary \u232b between [0,1)\n\n\u2713(\u232b) = argmin\n\nf (\u2713) +\n\n\u2713\n\n1\n2\u232b k\u2713 \u27130k2\n2.\n\n(1)\n\nThe following Lemma bounds the distance between the optimization and regularization paths.\nTheorem 1. Let \u02c6\u2713 be the minimizer of f (\u2713). Let \uf8ff = m/M and let c = 2\uf8ff\nregularization penalty \u232b and time t be related through the relation \u232b(t) = 1\n\n1+\uf8ff. Moreover, let the\n\ncmecM t 1. Suppose\n\n2\n\n\fGD is started at \u27130. Then\n\nk\u2713(t) \u2713(\u232b(t))k2 \uf8ff krf (\u27130)k2\n\nm\n\n\u2713emt +\n\nc\n\n1 c ecM t\u25c6 .\n\nNote that when \uf8ff = 1 the upper bound in the above Theorem is equal to 0, thus showing that both\nthe paths are exactly the same. To get a sense of quality of the bound, we compare it with a simple\ntriangle inequality based bound, where we derive an upper bound for k\u2713(t) \u2713(\u232b(t))k2 by \ufb01rst\nbounding k\u2713(t) \u02c6\u2713k2, k\u2713(\u232b(t)) \u02c6\u2713k2 and then using a triangle inequality.\ncmecM t 1.\nTheorem 2 (Weak Bound). Consider the similar setting as in Theorem 1. Let \u232b(t) = 1\nThen\n\nk\u2713(t) \u2713(\u232b(t))k2 \uf8ff k\u2713(t) \u02c6\u2713k2 + k\u2713(\u232b(t)) \u02c6\u2713k2\n1c+ecM t\u2318 .\n\n\u21e3emt +\n\nc\n\n\uf8ff krf (\u27130)k2\n\nm\n\nThe above Theorem gives us O(emt + eM t) upper bound for the distance k\u2713(t) \u2713(\u232b(t))k2.\nWhereas, Theorem 1 gives us O(emt eM t) upper bound, which is strictly better than Theorem 2.\nMoreover, for small t, the bound in Theorem 2 is much weaker than the bound in Theorem 1. As\nwe show later, the tighter bound in Theorem 1 helps us obtain tight generalization bounds and early\nstopping rules for the iterates of GD.\nBy choosing a different relation \u232b(t), one can obtain a different connection and a different upper\nbound for the distance between optimization and regularization paths. In Appendix A.1 we consider\ndifferent choices for \u232b(t) and obtain different bounds. We believe the bounds in Theorem 1 and\nAppendix A.1 can be further improved by choosing an \u201coptimal\u201d \u232b(t).\n\n2.1 Extension to Mirror Descent\nIn this section we provide an extension of Theorem 1 to Mirror descent. Before we proceed, we\nbrie\ufb02y review mirror descent. For a complete review of properties of mirror descent and Bregman\ndivergences see [Banerjee et al., 2005, Bubeck et al., 2015]. Let be a continuously differentiable\nLegendre function de\ufb01ned on Rp. Moreover let be \u21b5strongly convex w.r.t a reference norm k.k\n\n(\u27132) (\u27131) hr(\u27131),\u2713 2 \u27131i \n\n\u21b5\n2 k\u27132 \u27131k2.\n\nThen the Bregman divergence D induced by is de\ufb01ned as D(\u27132,\u2713 1) = (\u27132) (\u27131) \nhr(\u27131),\u2713 2 \u27131i.\nMirror Descent (MD). Suppose we want to minimize a convex function f (\u2713) over Rp. Then\nMirror Descent with divergence D uses the following update rule to estimate the minimizer\n\n\u2713t+1 = argmin\n\n\u2713\n\nf (\u2713t) +\u2326rf (\u2713t),\u2713 \u2713t\u21b5 +\n\n1\n\u2318t\n\nD(\u2713, \u2713 t).\n\nSolving the above equation gives us the following update rule: r(\u2713t+1) = r(\u2713t) \u2318trf (\u2713t).\nThe continuous time dynamics of MD, started at \u27130, is given by the following ODE\n\n\u02d9\u2713(t) = r2(\u2713(t))1rf (\u2713(t)),\u2713\n\n(0) = \u27130.\n\nD-Regularization. We connect the optimization path of MD to the regularization path of the\nfollowing regularized problem\n\n\u2713(\u232b) = argmin\n\nf (\u2713) +\n\n\u2713\n\n1\n\u232b\n\nD(\u2713, \u27130),\n\nwhere \u27130 is some point in \u21e5. The solution \u2713(\u232b) satis\ufb01es the following continuous time dynamics\n\n\u02d9\u2713(\u232b) :=\n\nd\nd\u232b\n\n\u2713(\u232b) = \u21e5\u232br2f (\u2713(\u232b)) + r2(\u2713(\u232b))\u21e41 rf (\u2713(\u232b)).\n\nWe now show that the optimization path of mirror descent with D divergence, started at \u27130, is\nclosely related to the regularization path of the corresponding D-regularized objective. Our analysis\nis similar to the analysis of GD.\n\n3\n\n\fTheorem 3. Let \u02c6\u2713 be the minimizer of f (\u2713). Suppose f is m strongly convex, M smooth and is \u21b5\nstrongly convex w.r.t euclidean norm. Moreover, suppose is smooth w.r.t euclidean norm, in the\nfollowing balls around \u27130 and \u02c6\u2713\n\nwhere c = 2\u21e31 + \n\n\u21b5\n\ncmecM t/\u21b5 1,\n\nLet the regularization penalty \u232b and time t be related through the relation \u232b(t) = \n\n{\u2713 : D(\u2713, \u27130) \uf8ff D(\u02c6\u2713, \u27130)}[{ \u2713 : D(\u02c6\u2713, \u2713) \uf8ff D(\u02c6\u2713, \u27130)}.\n\uf8ff\u23181\n\nand \uf8ff = m/M. If MD is started at \u27130, then\n\n1\n\nkrf (\u27130)k2\n\nm\n\n\u2713emt/ +\n\nc\n\n1 c ecM t/\u21b5\u25c6 .\n\nk\u2713(t) \u2713(\u232b(t))k2 \uf8ff\n\n\n\u21b5\n\nNote that when = \u21b5, we retrieve the bounds of GD in Theorem 1. An example of a divergence\nwhich satis\ufb01es the assumption of strong convexity and smoothness of over Rd is the Mahalanobis\ndistance which is a divergence induced by the function (x) = xT Ax, for some positive de\ufb01nite\nmatrix A. However, we note that many popular divergences such as KL-divergence do not satisfy\nthe smoothness condition over the entire space Rd. For such divergences, depends on the distance\nbetween the starting point \u27130 and the minimizer \u02c6\u2713 and their location in the space.\n\n3 Consequences for Excess Risk of GD Iterates\n\nIn this section we utilize the connection between optimization and regularization paths derived in\nTheorem 1 to provide excess risk bounds of GD iterates. To this end, we \ufb01rst derive excess risk\nbounds of the solutions of the regularized problem and then combine it with the result from Theorem 1\nto obtain the excess risk bounds of GD iterates.\n\n3.1 General Analysis\nIn this section we provide a general statistical analysis of the solution of the regularized problem\nin Equation (1), for general statistical learning problems. Suppose we are given n i.i.d samples\ni=1, where xi 2X , drawn from a distribution P . Let ` : Rp \u21e5X ! R be a loss function\nDn = {xi}n\nthat assigns a cost `(\u2713, x) for an observation x. De\ufb01ne the risk R(\u2713) as R(\u2713) = EX\u21e0P [`(\u2713, X )] and\nlet \u2713\u21e4 be the minimizer of risk R(\u2713). Given samples Dn, our goal is to use Dn to obtain an estimate\n\u02c6\u2713 that has low excess risk: R(\u02c6\u2713) min\u2713 R(\u2713). Let Rn(\u2713) denote the empirical risk, which is de\ufb01ned\nas Rn(\u2713) = 1\ni=1 `(\u2713, xi). We consider the following regularized problem for estimating a \u2713 with\nlow excess risk\n\nmin\n\n(2)\nThe following theorem bounds the parameter estimation error of the minimizer of the above problem.\nTheorem 4. Suppose the empirical risk Rn(\u2713) is m strongly convex and M smooth. Consider\nthe regularized problem in Equation (2). Suppose the regularization penalty \u232b satis\ufb01es 1/\u232b \n2krRn(\u2713\u21e4)k2\n\n. Then the optimal solution \u2713(\u232b) satis\ufb01es\n\nRn(\u2713) +\n\n\u2713\n\nnPn\n\n1\n2\u232b k\u2713k2\n2.\n\nk\u2713\u21e4k2\n\nk\u2713(\u232b) \u2713\u21e4k2 \uf8ff\n\n3\nm\u232b k\u2713\u21e4k.\n\n1\n\nUsing the above result and the result from Theorem 1 we now bound the parameter error of the\niterates of continuous time GD.\nCorollary 5. Suppose the conditions of Theorem 1 are satis\ufb01ed. Moreover, let t satisfy t \uf8ff\ncM log\u21e31 + cmk\u2713\u21e4k\n\nm+M . Then \u2713(t) satis\ufb01es the following error bound\n1 ecM tk\u2713\u21e4k.\n\n1 c ecM t\u25c6 +\n\n\u2713emt +\n\n2krRn(\u2713\u21e4)k2\u2318, where c = 2m\nk\u2713(t) \u2713\u21e4k2 \uf8ff krRn(\u27130)k2\n\necM t\n\n3\nc\n\nm\n\nc\n\nNote that the above results provide deterministic error bounds for a particular choice of \u232b, t. The\nrandom quantities m, M,krRn(\u2713\u21e4)k2 need to be bounded to instantiate the above result for speci\ufb01c\nlearning problems.\n\n4\n\n\fLet \u00afm, \u00afM be the strong convexity and smoothness parameters of the population risk R(\u2713). Using\nstandard tools from empirical process theory, under certain regularity conditions on the distribution P\nand R(\u2713), one can show that krRn(\u2713\u21e4)k2,krRn(0)k2 scale as O(p p\nn ), O(p p\nn +k\u2713\u21e4k) and m, M\nare close to \u00afm, \u00afM with high probability. Substituting these in Corollary 5 gives us the following\nbound for k\u2713(t) \u2713\u21e4k2 at t = 1\nk\u2713(t) \u2713\u21e4k2 = O\u2713\u21e3e \u00afmt \u00afce \u00afM t\u2318k\u2713\u21e4k +r p\nn\u25c6 ,\n\u00afM log\u21e31 +\n\n\u00afc \u00afM log\u21e31 + \u00afc \u00afmk\u2713\u21e4k2 q n\np\u2318\n\n\u00afm+ \u00afM . When \u00afm = \u00afM, the above bound shows that at t = 1\n\n\u00afMk\u2713\u21e4k2 q n\n\np\u2318, we\n\nwhere \u00afc = 2 \u00afm\n\nachieve the standard parametric error rate Op p\nn.\n\nWe note that the above bound can be improved with a tighter analysis of the regularized problem. In\nthe next section, we consider the problem of linear regression and show how a better understanding of\nthe regularized problem, coupled with our connection, helps us obtain a tighter parameter estimation\nerror bound for the iterates of GD.\n\n3.2 Tighter Analysis for Linear Regression\nRecall that in linear regression we observe paired samples {(x1, y1), . . . (xn, yn)}, where each\n(xi, yi) 2 Rp \u21e5 R. The distribution of y conditioned on the covariates x is speci\ufb01ed by the following\nlinear model: y = hx, \u2713\u21e4i + w, where w is drawn from a zero-mean distribution with bounded\nvariance. In this work we assume that noise has a normal distribution with variance 2; that is,\nw \u21e0N (0, 2). The goal in linear regression is to learn a linear map x ! h\u2713, xi with low risk\nR(\u2713) = E\u21e5(y h\u2713, xi)2\u21e4 . The empirical risk Rn(\u2713) is given by\nwhere X = [x1, x2, . . . xn]T 2 Rn\u21e5p is the matrix of covariates. Let w = y X\u2713\u21e4 be the noise\nvector and b\u2303 be the empirical covariance matrix. The regularized problem (2) corresponding to\nthe least squares risk de\ufb01ned above is called ridge regression problem. Ridge regression has been\nwell studied and analyzed in the literature of machine learning and statistics. We now provide the\nfollowing result from Hsu et al. [2012] which obtains tight upper bounds on the parameter estimation\nerror of the solution of ridge regression.\nTheorem 6. Suppose the covariate vector x has a normal distribution with mean 0 and covariance\nmatrix \u2303. Then there exists constants c1, c2 > 0 that depend on \u2303, such that for n c1p log p, the\nsolution of ridge regression \u2713(\u232b) satis\ufb01es the following error bound with probability at least 1 1/p2\n\n1\n2nky X\u2713k2\n2,\n\n(yi h\u2713, xii)2 =\n\nnXi=1\n\nRn(\u2713) =\n\n1\n2n\n\n(3)\n\nk\u2713(\u232b) \u2713\u21e4k2\n\n2 \uf8ff\n\n1\n\uf8ff\n\n26664\n\n\u2713(\u2303 +\n|\n\n1\n\u232b\n\n2\n\nI)1\u2303 I\u25c6 \u2713\u21e4\n{z\n}\n\nBias\n\n37775\n1 + \u232bi\u25c62\npXi=1\u2713 \u232bi\n}\n{z\n\nVariance\n\n,\n\n+\n\n2\nn\n\n|\n\nwhere i is the ith largest eigenvalue of \u2303, where \uf8ff = p/1.\nWe now use the above bound to obtain error rates for optimization path of GD. For the sake of\nsimplicity and to gain insights into the bound we consider the special case of identity covariance\nmatrix.\nCorollary 7. Suppose the covariate vector x has a normal distribution with mean 0 and identity\ncovariance matrix. Then there exists a constant c1 > 0 such that for n c1p log p, the iterates of\ncontinuous time GD satisfy the following bound with probability at least 1 1/p2\n1+\u232b(t)pk\u2713\u21e4k2 + \u232b(t)22 p\np\u2318, the iterate \u2713(t) satis\ufb01es\n\n10e99t/1001\u2318k\u2713\u21e4k + p p\n99 log\u21e31 + 81\n81 e99t/100 1. Further, at t = 100\n2 \uf8ff (1 + \u270f)\"\nk\u2713\u21e4k2\nk\u2713\u21e4k2 + 2p\n\nn + 1\nn # 2p\n\n9 \u21e3e9t/10 \n\nk\u2713(t) \u2713\u21e4k2 \uf8ff 10\n\nk\u2713(t) \u2713\u21e4k2\n\nwhere \u232b(t) = 100\n\n100 k\u2713\u21e4k2\n\nn ,\n\n2\n\nn\n\n9\n\nn\n\n,\n\n5\n\n\fwhere \u270f is a positive constant less than 0.1.\n\nNote that the above bound provides an early stopping rule for GD on linear regression, and the\nresulting rate, especially in the high SNR regime where is high, can be better than the 2p\nn rate\nobtained by running GD until convergence.\n\nComparison with Stability. Hardt et al. [2015], Chen et al. [2018] used stability as a technique\nto provide expected excess risk bounds of iterates generated by an iterative optimization algorithm.\nWe note that in the setting of strong convexity, existing stability based approaches impose a much\nstronger condition of strong convexity on Rn(\u2713). Speci\ufb01cally, they require the loss function `(\u2713, x) to\nbe strongly convex in \u2713 at each x 2X . For example, this condition never holds for linear regression\nwith dimension p > 1. Under the assumption that the loss function `(\u2713, x) at each x 2X is m\nstrongly convex and M smooth, stability gives us the following expected risk bounds for \u2713(t)\n\nE [R(\u2713(t)) R(\u2713\u21e4)] \uf8ff\n\n1\n\nn1 e2mt + e2mt.\n\nNote that the above bound doesn\u2019t provide an early stopping rule and suggests that one has to run\nthe algorithm until convergence for the best possible rates. Moreover, our approach can obtain high\nprobability statements, whereas the above rates are in expectation.\n\nComparison with VC, Rademacher complexity bounds. Traditional techniques for bounding\nexcess risk of iterates proceed by separately bounding the optimization error k\u2713(t) \u02c6\u2713k, statistical\nerror k\u02c6\u2713 \u2713\u21e4k and then using a simple triangle inequality to bound k\u2713(t) \u2713\u21e4k. Such a technique\ngives us rates of the form O(emt + p p\nn ). These rates suggest that one should always run GD\nuntil the end to obtain best possible rates and can\u2019t predict optimal early stopping rules. Note that the\n2 \uf8ffh||\u2713\u21e4||2\nn , is much better than O2 p\nn\nbound in Corollary 7, k\u2713(t) \u2713\u21e4k2\n|\nobtained using standard VC and Rademacher complexity bounds. This is especially true in the low\nSNR regime, where is large and as a result \u21b5 is low. Moreover, this rate is same as ridge regression\nrate. This shows that GD with early stopping can obtain similar rates as ridge regression; and rates\nwhich are tighter than those obtained via VC bounds and stability based risk bounds.\n\n2 + 2p/n\u2318i\n2 /\u21e3||\u2713\u21e4||2\n}\n{z\n\n2p\n\n\u21b5\n\n4 Convex Loss\n\nHaving studied the setting where f (\u00b7) is strongly convex, we next turn our attention to losses which\nare simply convex. As we show below, in the convex case, the connection between the two paths is\nmore nuanced, and in particular, is problem speci\ufb01c.\nFirstly, we derive a result which characterizes the end-point of the regularization path.\nTheorem 8. Let f : Rp ! R be a convex function. Suppose f has a minimizer. Let \u2713(\u232b) be the\nminimizer of\n\nf (\u2713) +\n\n1\n2\u232b k\u2713 \u27130k2\n2.\n\nThen as \u232b ! 1, \u2713(\u232b) converges to the minimizer which is closest to \u27130.\nWe note that this result can be viewed as the regularization path analog of the result of Gunasekar\net al. [2017], where it was shown that for matrix factorization, the optimization path of GD converges\nto the minimum Frobenius norm solution. Next, we present a simple counterexample which shows\nthat in the convex regime, both regularization and optimization paths need not converge to the same\npoint.\n\n6\n\n\fLemma 1. Consider the following function in 2D\nspace f : R\u21e5(100,1) 7! R, f (x, y) = (x+1)2\ny+100 .\nSuppose the continuous time gradient descent is\ninitialized at \u27130 = (2, 1). Then, we have that\n\n1.05\n\ny\n\n1\n\nlim\n\u232b!1\n\n\u2713(\u232b) 6= lim\nt!1\n\n\u2713(t),\n\n0.95\n\n-2\n\n-1\n\nContours of (x+1)2\ny+100\nGD-Path\nReg-Path\n\n1\n\n2\n\n0\nx\n\nThe above result shows that for general convex losses, both the paths need not lie close to each other,\neven as t, \u232b ! 1.\n4.1 Classi\ufb01cation Loss\n\nIn this section, we focus on classi\ufb01cation losses and show that the optimization path of GD and the\ncorresponding regularization path of L2\n2 penalized risk are close to each other. Commonly used losses\nin classi\ufb01cation such as exponential, logistic loss are not strongly convex and moreover, when the\ndata Dn is separable, the risk doesn\u2019t admit a \ufb01nite minimizer. In such cases, a more careful analysis\nis needed to bound the distance between the optimization and regularization paths.\nRecent works by Ji and Telgarsky [2018], Soudry et al. [2017] study the behavior of gradient descent\non un-regularized logistic regression and show that when the data is separable, GD converges to a max\nmargin solution. In this section we \ufb01rst show that similar properties hold for the regularization path\nof L2\ni=1,\nwhere each (xi, yi) 2 Rp \u21e5 {\u00b11}. Let `(\u2713, (x, y)) = (yxT \u2713) be the loss at (x, y). Consider the\nregularized problem in Equation (2). We \ufb01rst present the following useful result from Rosset et al.\n[2004b] which shows that when the data is linearly separable, as \u232b ! 1, the minimizer \u2713(\u232b) of (2)\nconverges to a max-margin solution.\nLemma 2. Assume the data Dn is linearly separable; that is, 9\u02dc\u2713 such that mini yiDxi, \u02dc\u2713E > 0. Let\n(z) be a monotone non-increasing loss function. If 9T > 0 (possibly T = 1) such that:\n\n2 regularized objectives. Recall that in classi\ufb01cation we observe samples Dn = {(xi, yi)}n\n\n(t (1 \u270f))\n\n(t)\n\nlim\nt!T\n\n= 1,8\u270f> 0,\n\nthen is a margin maximizing loss function in the sense that any convergence point of the normalized\nsolutions\nof the regularized problem (2) as \u232b ! 1 is an L2 margin maximizing separating\nhyper-plane. Consequently, if this margin-maximizing hyper-plane is unique, then the solutions\nconverge to it\n\n\u2713(\u232b)\nk\u2713(\u232b)k2\n\nlim\n\u232b!1\n\n\u2713(\u232b)\nk\u2713(\u232b)k2\n\n= argmax\n\nk\u2713k2=1 hmin\n\ni\n\nyi\u2713T xii .\n\nThe condition on in the above Lemma is satis\ufb01ed by many popular loss functions such as logistic,\nexponential, squared hinge losses. Note that Lemma 2 is asymptotic in nature. Our \ufb01rst contribution\nis to derive a non-asymptotic version of this theorem. We focus on the exponential loss (z) = ez,\nbut our results can be generalized to other losses as well. Perhaps interestingly, our non-asymptotic\nbounds depend on the Lambert W(product-log) function [Corless et al., 1996], which has a long\nhistory of applications to instrument design [Ohayon and Ron, 2013] and statistical physics [Valluri\net al., 2000].\n\nTheorem 9. Assume the data Dn is linearly separable; that is, 9\u02dc\u2713 such that mini yiDxi, \u02dc\u2713E = 1. Let\n`(\u2713, (y, x)) = exp(\u2713T (yx)) and let \u2713(\u232b) be the solution to the regularized problem in Equation (2).\nThen \u2713(\u232b) satis\ufb01es\n\ni) Rn(\u2713(\u232b)) \uf8ff C1 W(\u232b)\nii) ||\u2713(\u232b)||2 = \u21e5(log(\u232b))\niii) mini yixT\n||\u2713(\u232b)||2\n\n\u232b = O\u21e3 log(\u232b)\n 1 log log(\u232b)\nlog(\u232b) log n\nlog \u232b .\n\ni \u2713(\u232b)\n\n\u232b \u2318 , where W(\u00b7) is the Lambert W function.\n\n7\n\n\fAs \u232b ! 1, the above Theorem shows that the \u2713(\u232b) converges to a max-margin solution, thus\nrecovering the asymptotic result of Rosset et al. [2004b]. Moreover, our result shows that the\nminimizer of the regularized problem (2) converges to max-margin solution at a slow rate. In\nparticular, the margin increases as O( 1\n\nlog \u232b ).\n\nComparison with GD on Rn(\u2713). Soudry et al. [2017] analyze gradient descent on exponential\nloss, with separable data and obtained similar bounds for the iterates of GD. Letting \u2713(t) be the iterate\nof GD at time t, they show that Rn(\u2713(t)) goes down as O(1/t), the margin converges as O(1/ log t)\nand k\u2713(t)k2 increases as log t. When combined with our result, this shows that the optimization and\nregularization paths are very close to each other.\nTheorem 10. Assume the data Dn is linearly separable; that is, 9\u02dc\u2713 such that mini yiDxi, \u02dc\u2713E = 1.\nLet `(\u2713, (y, x)) = exp(\u2713T (yx)). Suppose the regularization parameter \u232b and time t are related as\n\u232b(t) = t. Suppose GD is initialized at 0. Then for any t 0, we have\n\nmin\n\ni2[n]\n\n5 Experiments\n\nyi hxi,\u2713 (t)i\nk\u2713(t)k2 min\ni2[n]\n\nyi hxi,\u2713 (\u232b(t))i\nk\u2713(\u232b(t))k2\n\n \uf8ff O\u2713 1\nlog t\u25c6 .\n\nIn this section, we conduct simulations to corroborate our theoretical \ufb01ndings.\n\n5.1 Strongly Convex\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\ni\n\nk\ns\nR\n \ns\ns\ne\nc\nx\nE\n\n0.2\n\n100\n\nGD\n\nGD\n\n150\n\n140\n\n130\n\n120\n\n*\n\nT\n\n110\n\n100\n\n90\n\n80\n\n101\n\n102\n\nIterations\n\n(a) R(\u2713t) R\u21e4 vs t\n\n103\n\n70\n\n5.8\n\n6\n\n6.2\n\n6.4\n\n6.6\n\n6.8\n\n7\n\n7.2\n\n7.4\n\nlog(n)\n\n(b) t\u21e4 vs log(n)\n\n)\n\n*\n\nR\n\n \n-\n \n)\n\n*\n\nT\n\n\u03b8\n(\nR\n(\ng\no\n\nl\n\n0.1\n\n0\n\n-0.1\n\n-0.2\n\n-0.3\n\n-0.4\n\n-0.5\n\n-0.6\n\n-0.7\n\n-0.8\n\nGD\nRidge\n\n4\n\n4.5\n\n5\n\n5.5\n\nlog(n)\n\n6\n\n6.5\n\n7\n\n(c) log(R(\u2713t\u21e4 ) R\u21e4) vs log(n)\n\nFigure 1: Connecting GD and L2\n\n2-Penalization for Linear Regression\n\nWe use linear regression to empirically verify our results on connecting ridge-regression and gradient\ndescent. We also corroborate our \ufb01ndings on excess risk and optimality of early-stopping rule for\ngradient descent.\nSetup. We simulate a linear model by drawing the covariates from an isotropic gaussian X \u21e0\nN (0,Ip\u21e5p) and the response y|x \u21e0N (\u2713\u21e4T x, 2) where \u2713\u21e4 = [1/pp, 1/pp, . . . , 1/pp]T and\n2 = 2. We generate a sequence of iterates by GD with step size 0.01, and a corresponding sequence\nof solutions for the penalized estimation problem. We also study how the optimal iteration number t\u21e4,\nwhich minimizes the excess risk, changes as we increase the number of samples for GD. In this case,\nwe \ufb01x p = 100 and vary the samples n from 100 to 1500. Similarly, we \ufb01nd the optimal penalization\n\u232b\u21e4 for each n. All results are reported after averaging over 50 trials.\n\nResults. We report our results in Figure 1.\n\u2022 As shown by our theory, excess risk bounds for GD on OLS are composed of two terms, one which\nincreases with t and the other which decreases with t. Hence, one expects the excess risk to \ufb01rst\ndecreases, then increase before \ufb01nally settling, which is corroborated by Figure 1(a).\n\nclaims on t\u21e4.\n\n\u2022 Figure 1(b) shows a logarithmic relationship between t\u21e4 and n, thereby verifying our theoretical\n\u2022 Figure 1(c) shows that optimal risk for GD coincides with that of L2\n2-penalized estimation across\n\ndifferent values of n.\n\n8\n\n\f5.2 Classi\ufb01cation\n\nm\nr\no\nN\n2\nL\n\n \n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n0\n\nn=32,p=128\n\ni\n\nn\ng\nr\na\nM\n\n2.2\n\n2\n\n1.8\n\n1.6\n\n1.4\n\n1.2\n\n1\n\n0.8\n\n0.6\n\nGD\nL2 Square Regularization\n\n2\n\n4\n\n6\n\n8\n\n10\n\n12\n\n14\n\n(a) ||\u2713(\u232b)||2 vs log(\u232b)\n\nn=32,p=128\n\ne\nc\nn\ne\nr\ne\n\nf\nf\ni\n\ni\n\n \n\nD\nn\ng\nr\na\nM\n\nn=32,p=128\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\nGD\nL2 Square Regularization\n\n0\n\n500\n\n1000\n\n1500\n\n2000\n\n(b) Margin of \u2713(\u232b) vs \u232b\n\n0\n100\n\n101\n\n102\n\n103\n\n104\n\n105\n\n(c) |Margin(\u2713(\u232b)) Margin(\u2713t)| vs t\n\nFigure 2: Connecting GD and L2\n\n2-Penalization for Logistic Regression\n\nIn this section, we corroborate our results which connect GD and L2\n2-regularization in the context of\nlogistic regression on separable data. In particular, we corroborate our \ufb01ndings on parameter error,\nbehavior of margin and the difference in margin for optimization and regularization paths.\n\nSetup. We construct a classi\ufb01cation dataset by drawing covariates X from isotropic gaussian i.e.\nX \u21e0N (0,Ip). We \ufb01x the true parameter \u2713\u21e4 = [ 1pp , 1pp , . . . , 1pp ]T . We \ufb01x the dimension p = 128\nand the number of samples to n = 32. Note that our choice of p, n ensures that the generated data\nis separable. We run GD with a step size \u2318 = 0.123 and construct corresponding points on the\nregularization path (\u232b(t) = t\n\n2 squared penalized objective.\n\n\u2318 ) of the L2\n\nResults. We report our results in Figure 2.\n\u2022 Figure 2(a) shows the norm of the points on the optimization and regularization paths. As predicted\nby our theory, the norm increases at a logarithmic rate.\n\u2022 Figure 2(b) plots the L2-margin ( mini yi\u2713T (xi)\n) for both the optimization and regularization path.\nThe \ufb01gure con\ufb01rms our results that the margin increases with \u232b.\n\u2022 Although the margins of both the optimization and regularization paths increase, Figure 2(c) depicts\n\nthat after a few initial iterations, the difference in the margin between \u2713t and \u2713(\u232b) decreases.\n\n||\u2713||2\n\n6 Summary and Future Work\n\nIn this work, we studied the connections between the trajectory of the iterates of optimization\ntechniques such as GD, Mirror Descent and regularization path of the corresponding regularized\nobjective. For strongly convex functions our results show that both the paths are point-wise close.\nHowever, for general convex functions, our results show that both the paths need not be close to each\nother. For the popularly studied problem of classi\ufb01cation with separable data, we showed that the\noptimization and regularization paths are close to each other.\nWe believe studying the connection between optimization and regularization paths has several\nadvantages, with the key advantage being that it can be used to study the statistical properties of\niterates generated by optimization techniques. We also believe that our results on strongly convex\nlosses can be further improved to obtain tighter connections and better generalization bounds of the\niterates.\nAn interesting direction for future work would be to see if similar connections hold for non-convex\nproblems and speci\ufb01cally the optimization objectives that arise in deep learning. For convex losses,\nour current work focused on analyzing classi\ufb01cation losses with separable data. It would be interesting\nto study the connection for general convex losses and identify the conditions on the loss function\nunder which both the paths stay close to each other.\nWhile our analysis in this paper focused on GD, it\u2019d be interesting to study if similar connections\nhold for other non-stochastic methods such as steepest descent, accelerated GD, Newton\u2019s method\nand stochastic methods such as SGD.\n\n9\n\n\f7 Acknowledgement\n\nWe acknowledge the support of NSF via IIS-1149803, IIS-1664720, DMS-1264033. The authors are\ngrateful to Suriya Gunasekar and anonymous reviewers for helpful comments on the paper.\n\n10\n\n\fReferences\nArindam Banerjee, Srujana Merugu, Inderjit S Dhillon, and Joydeep Ghosh. Clustering with bregman divergences.\n\nJournal of machine learning research, 6(Oct):1705\u20131749, 2005.\n\nOlivier Bousquet and Andr\u00e9 Elisseeff. Stability and generalization. Journal of machine learning research, 2\n\n(Mar):499\u2013526, 2002.\n\nS\u00e9bastien Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends R in\n\nMachine Learning, 8(3-4):231\u2013357, 2015.\n\nY. Chen, C. Jin, and B. Yu. Stability and Convergence Trade-off of Iterative Optimization Algorithms. ArXiv\n\ne-prints, April 2018.\n\nRobert M Corless, Gaston H Gonnet, David EG Hare, David J Jeffrey, and Donald E Knuth. On the lambertw\n\nfunction. Advances in Computational mathematics, 5(1):329\u2013359, 1996.\n\nJ Friedman and Bogdan E Popescu. Gradient directed regularization for linear regression and classi\ufb01cation.\n\nTechnical report, Citeseer, 2003.\n\nS. Gunasekar, J. Lee, D. Soudry, and N. Srebro. Characterizing Implicit Bias in Terms of Optimization Geometry.\n\nArXiv e-prints, February 2018.\n\nSuriya Gunasekar, Blake E Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Implicit\nregularization in matrix factorization. In Advances in Neural Information Processing Systems, pages 6152\u2013\n6160, 2017.\n\nMoritz Hardt, Benjamin Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient\n\ndescent. arXiv preprint arXiv:1509.01240, 2015.\n\nAbdolhossein Hoorfar and Mehdi Hassani. Inequalities on the lambert w function and hyperpower function. J.\n\nInequal. Pure and Appl. Math, 9(2):5\u20139, 2008.\n\nDaniel Hsu, Sham M Kakade, and Tong Zhang. Random design analysis of ridge regression. In Conference on\n\nLearning Theory, pages 9\u20131, 2012.\n\nZiwei Ji and Matus Telgarsky. Risk and parameter convergence of logistic regression. arXiv preprint\n\narXiv:1803.07300, 2018.\n\nMor Shpigel Nacson, Jason Lee, Suriya Gunasekar, Nathan Srebro, and Daniel Soudry. Convergence of gradient\n\ndescent on separable data. arXiv preprint arXiv:1803.01905, 2018.\n\nSahand Negahban, Bin Yu, Martin J Wainwright, and Pradeep K Ravikumar. A uni\ufb01ed framework for high-\ndimensional analysis of m-estimators with decomposable regularizers. In Advances in Neural Information\nProcessing Systems, pages 1348\u20131356, 2009.\n\nG. Neu and L. Rosasco. Iterate averaging as regularization for stochastic gradient descent. ArXiv e-prints,\n\nFebruary 2018.\n\nBen Ohayon and Guy Ron. New approaches in designing a zeeman slower. Journal of Instrumentation, 8(02):\n\nP02016, 2013.\n\nGarvesh Raskutti, Martin J Wainwright, and Bin Yu. Early stopping and non-parametric regression: an optimal\n\ndata-dependent stopping rule. Journal of Machine Learning Research, 15(1):335\u2013366, 2014.\n\nLorenzo Rosasco and Silvia Villa. Learning with incremental iterative regularization. In Advances in Neural\n\nInformation Processing Systems, pages 1630\u20131638, 2015.\n\nSaharon Rosset, Ji Zhu, and Trevor Hastie. Boosting as a regularized path to a maximum margin classi\ufb01er.\n\nJournal of Machine Learning Research, 5(Aug):941\u2013973, 2004a.\n\nSaharon Rosset, Ji Zhu, and Trevor J Hastie. Margin maximizing loss functions.\n\ninformation processing systems, pages 1237\u20131244, 2004b.\n\nIn Advances in neural\n\nMark Rudelson and Roman Vershynin. Smallest singular value of a random rectangular matrix. Communications\n\non Pure and Applied Mathematics, 62(12):1707\u20131739, 2009.\n\nDaniel Soudry, Elad Hoffer, and Nathan Srebro. The implicit bias of gradient descent on separable data. arXiv\n\npreprint arXiv:1710.10345, 2017.\n\n11\n\n\fSree Ram Valluri, David J Jeffrey, and Robert M Corless. Some applications of the lambert w function to physics.\n\nCanadian Journal of Physics, 78(9):823\u2013831, 2000.\n\nYuting Wei, Fanny Yang, and Martin J Wainwright. Early stopping for kernel boosting algorithms: A general\nanalysis with localized complexities. In Advances in Neural Information Processing Systems, pages 6067\u2013\n6077, 2017.\n\nYuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. On early stopping in gradient descent learning. Construc-\n\ntive Approximation, 26(2):289\u2013315, 2007.\n\n12\n\n\f", "award": [], "sourceid": 6777, "authors": [{"given_name": "Arun", "family_name": "Suggala", "institution": "Carnegie Mellon University"}, {"given_name": "Adarsh", "family_name": "Prasad", "institution": "Carnegie Mellon University"}, {"given_name": "Pradeep", "family_name": "Ravikumar", "institution": "Carnegie Mellon University"}]}