{"title": "Rest-Katyusha: Exploiting the Solution's Structure via Scheduled Restart Schemes", "book": "Advances in Neural Information Processing Systems", "page_first": 429, "page_last": 440, "abstract": "We propose a structure-adaptive variant of the state-of-the-art stochastic variance-reduced gradient algorithm Katyusha for regularized empirical risk minimization. The proposed method is able to exploit the intrinsic low-dimensional structure of the solution, such as sparsity or low rank which is enforced by a non-smooth regularization, to achieve even faster convergence rate. This provable algorithmic improvement is done by restarting the Katyusha algorithm according to restricted strong-convexity constants. We demonstrate the effectiveness of our approach via numerical experiments.", "full_text": "Rest-Katyusha: Exploiting the Solution\u2019s Structure\n\nvia Scheduled Restart Schemes\n\nJunqi Tang\n\nSchool of Engineering\n\nUniversity of Edinburgh, UK\n\nJ.Tang@ed.ac.uk\n\nMohammad Golbabaee\n\nDepartment of Computer Science\n\nUniversity of Bath, UK\n\nM.Golbabaee@bath.ac.uk\n\nFrancis Bach\nINRIA - ENS\n\nPSL Research University, France\n\nFrancis.Bach@inria.fr\n\nMike Davies\n\nSchool of Engineering\n\nUniversity of Edinburgh, UK\nMike.Davies@ed.ac.uk\n\nAbstract\n\nWe propose a structure-adaptive variant of a state-of-the-art stochastic variance-\nreduced gradient algorithm Katyusha for regularized empirical risk minimization.\nThe proposed method is able to exploit the intrinsic low-dimensional structure\nof the solution, such as sparsity or low rank which is enforced by a non-smooth\nregularization, to achieve even faster convergence rate. This provable algorithmic\nimprovement is done by restarting the Katyusha algorithm according to restricted\nstrong-convexity (RSC) constants. We also propose an adaptive-restart variant\nwhich is able to estimate the RSC on the \ufb02y and adjust the restart period automati-\ncally. We demonstrate the effectiveness of our approach via numerical experiments.\n\n1\n\nIntroduction\n\n(cid:80)n\n\nMany applications in supervised machine learning and signal processing share the same goal, which\nis to estimate the minimizer of a population risk via minimizing the empirical risk 1\ni=1 fi(ai, x),\nwhere ai, x \u2208 Rd, each fi is a convex and smooth function [1]. In supervised machine learning, ai is\nn\noften referred to as the training data sample, while in signal/image processing applications it is the\nrepresentation of measurements. In practice the number of data samples or measurements is limited,\nand from them we attempt to infer x\u2020 \u2208 Rd which is the unique minimizer of the population risk:\n(1)\nThe ultimate goal is to get a vector x(cid:63) which is a good approximation of x\u2020 from the empirical risk.\nSince in many interesting applications, the dimension of parameter space d is of the same order or\neven larger than the number of data samples n, minimizing the empirical risk alone will introduce\nover\ufb01tting and hence leads to poor estimation of the true parameter x\u2020 [2, 3]. In general, avoiding\nover\ufb01tting is a key issue in both machine learning and signal processing, and the most common\napproach is to add some regularization while minimizing the empirical risk [4, 5, 6]:\n\nx\u2020 = arg min\n\nx\n\nEa\n\n\u00aff (a, x).\n\nF (x) := f (x) + \u03bbg(x)\n\nfi(x),\n\n(2)\n\nwhere for the sake of compactness of notation, we denote fi(x) := fi(ai, x). Each fi is assumed to\nbe convex and have L-Lipschitz continuous gradient, while the regularization term g(x) is assumed\nto be a simple convex function and possibly non-smooth.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n(cid:26)\n\nx(cid:63) \u2208 arg min\nx\u2208Rd\n\n(cid:27)\n\n, f (x) :=\n\n1\nn\n\nn(cid:88)\n\ni=1\n\n\f1.1 Accelerated Stochastic Variance-Reduced Optimization\n\nTo handle the empirical risk minimization in the \u201cbig data\u201d and \u201cbig dimension\u201d regimes, stochastic\ngradient-based iterative algorithms are most often considered. The most basic one is often referred to\nas stochastic gradient descent (SGD) [7, 8], in every iteration of which only one or a few functions fi\nare randomly selected, and only their gradients are calculated as an estimation of the full gradient.\nHowever, the convergence rate of SGD is sub-linear even when the loss function F is strongly-convex.\nTo further accelerate the stochastic gradient descent algorithm, researchers have recently developed\ntechniques which progressively reduce the variance of stochastic gradient estimators, starting from\nSAG [9, 10], SDCA [11], then SVRG [12, 13] and SAGA [14]. Such methods enjoy a linear conver-\ngence rate when the cost function F is \u00b5-strongly-convex and each fi has L-Lipschitz continuous\ngradients, that is, to achieve an output \u02c6x which satis\ufb01es F (\u02c6x) \u2212 F (x(cid:63)) \u2264 \u03b4, the total number of\n\u03b4 . Nesterov\u2019s acceleration [15, 16, 17]\nhas also been successfully applied to construct variance-reduced methods which have an accelerated\nlinear-convergence rate [18, 19, 20, 21, 22, 23, 24, 25]:\n\nstochastic gradient evaluations needed is O(cid:0)n + L/\u00b5(cid:1) log 1\n\n(cid:18)\n\n(cid:115)\n\n(cid:19)\n\nO\n\nn +\n\nnL\n\u00b5\n\nlog\n\n1\n\u03b4\n\n.\n\n(3)\n\nIt is worth noting that this convergence rate has been shown to be worst-case optimal [21]. However,\nall of these algorithms need explicit knowledge of the strong-convexity parameter \u00b5. Very recently,\n[26] has shown theoretically that it is impossible for an accelerated incremental gradient method to\nachieve this ideal linear rate without the knowledge of \u00b5. Since in general the strong-convexity is\nhard to be estimated accurately, researchers propose adaptive restart schemes [27, 28, 29, 30, 25, 31]\nfor accelerated \ufb01rst-order methods, either by the means of enforcing monotonicity on functional\ndecay, or by estimating the strong-convexity on the \ufb02y.\n\n1.2 Solution\u2019s Structure, Restricted Strong Convexity, and Faster Convergence\n\nIn many interesting large-scale optimization problems in machine learning, the solution x(cid:63) in (2) has\nsome low-dimensional structure such as sparsity [4], group-sparsity [32], low-rank [33] or piece-wise\nsmoothness [6], enforced by the non-smooth regularization. It is intuitive that an optimal algorithm\nfor this type of problem should take into account and exploit such solution\u2019s structure. We believe\nthat, when being utilized properly, this prior information of the solution will facilitate the convergence\nof an iterative algorithm.\nOne important theoretical cornerstone is the modi\ufb01ed restricted strong convexity framework presented\nby Agarwal et al. [34]. In the context of statistical estimation with high-dimensional data where\nthe usual strong-convexity assumption is vacuous, these authors have shown that the proximal\ngradient descent method is able to achieve global linear convergence up to a point x which satis\ufb01es\n(cid:107)x \u2212 x(cid:63)(cid:107)2 = o((cid:107)x(cid:63) \u2212 x\u2020(cid:107)2), the accuracy level of statistical precision. Moreover, the results based\non this restricted strong-convexity framework indicate that the convergence rate of the proximal\ngradient become faster when the model complexity of the solution is lower.\nInspired by Agarwal et al. [34], Qu and Xu [35] extend this framework to analyse some variance-\nreduced stochastic gradient methods such as proximal SVRG [13]. Most recently, based on the same\nframework, researchers proposed a two-stage APCG algorithm [36, 37], an accelerated coordinate\ndescent method able to exploit the solution\u2019s structure for faster convergence. Moreover, in the context\nof constrained optimization, researchers have also proposed ef\ufb01cient sketching-based algorithms\n[38, 39, 40, 41, 42] under a similar notion of conic restricted strong-convexity.\n\n1.3 This Work\n\nIn this paper we extend the theoretical framework for randomized \ufb01rst order methods established\nin [36] to design and analyse a structure-adaptive variant of Katyusha [23]. Our proposed method Rest-\nKatyusha is a restarted version of the original Katyusha method for non-strongly convex functions,\nwhere the restart period is determined by the modi\ufb01ed restricted strong-convexity (RSC). The\nconvergence analysis of Rest-Katyusha algorithm is provided, wherein we prove linear convergence\nup to a statistical accuracy with an accelerated convergence rate characterized by the RSC property.\n\n2\n\n\fLike all other accelerated gradient methods which require the explicit knowledge of strong-convexity\nparameter to achieve accelerated linear convergence, the vanilla Rest-Katyusha method also need\nto explicitly know the RSC parameter. We therefore propose a practical heuristic (adaptive Rest-\nKatyusha) which estimates the RSC parameter on the \ufb02y and adaptively tune the restart period, and\nwe show that this adaptive scheme mimics the convergence behavior of the vanilla Rest-Katyusha.\nFinally we validate the effectiveness of our approach via numerical experiments.\n\n2 Restarted Katyusha Algorithm\n\nThe Katyusha algorithm [23] listed in Algorithm 1 is an accelerated stochastic variance-reduced gra-\ndient method extended from the linear-coupling framework for constructing accelerated methods [43].\nIts main loop (denoted as A in Algorithm 1) at iteration s is described as the following:\n\nFor k = 0, 1, 2, ..., m\n\n\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0 xk+1 = \u03b8zk + 1\n\n2 \u02c6xs + ( 1\n\n\u2192 Linear coupling\n(cid:79)k+1 = (cid:79)f (\u02c6xs) + (cid:79)fi(xk+1) \u2212 (cid:79)fi(\u02c6xs); \u2192 Variance reduced stochastic gradient\n2 + (cid:104)(cid:79)k+1, z(cid:105) + \u03bbg(z); \u2192 Proximal mirror descent\nzk+1 = arg minz\n2 + (cid:104)(cid:79)k+1, y(cid:105) + \u03bbg(y);\u2192 Proximal gradient descent\nyk+1 = arg miny\n\n2 \u2212 \u03b8)yk;\n2 (cid:107)z \u2212 zk(cid:107)2\n2 (cid:107)y \u2212 xk+1(cid:107)2\n\n3\u03b8L\n\n3L\n\nm\n\nThe output sequence of A is de\ufb01ned as \u02c6xs+1 = 1\nj=1 yj, ys+1 = ym, zs+1 = zm. It is one of the\nstate-of-the-art methods for empirical risk minimization and matches the complexity lower-bound\nfor minimizing smooth-convex \ufb01nite-sum functions, proven by Lan and Zhou [21]. Most notably, it\nis a primal method which directly1 accelerates stochastic variance-reduction methods. To achieve\nacceleration in the sense of Nesterov, Katyusha introduces the three-point coupling strategy which\nincludes a combination of Nesterov\u2019s momentum and a stabilizing negative momentum which cancels\nthe effect of noisy updates due to stochastic gradients. However, its accelerated linear convergence is\nonly established when the regularization term g(x) is strongly-convex, and fails to bene\ufb01t from the\nstrong convexity from the data-\ufb01delity term [31], or the intrinsic restricted strong-convexity [36].\n\n(cid:80)m\n\nAlgorithm 2 Rest-Katyusha (x0, \u00b5c, S0, \u03b2, T, L)\n\n(cid:24)\n\n(cid:113)\n\n(cid:25)\n\n;\n\n\u03b2\n\n32 + 24L\nm\u00b5c\n\nInitialize: m = 2n, S =\nFirst stage \u2014- warm start:\nx1 = Katyusha (x0, m, S0, L)\nSecond stage \u2014- exploit the restricted strong-\nconvexity via periodic restart:\nfor t = 1, ..., T do\n\nxt+1 = Katyusha (xt, m, S, L)\n\nend for\n\nAlgorithm 1 Katyusha (x0, m, S, L)\n\nInitialize: y0 = z0 = \u02c6x0;\nfor s = 0, . . . , S \u2212 1 do\n\n\u03b8 \u2190 2\n\ns+4, calculate (cid:79)f (\u02c6xs),\n(\u02c6xs+1, ys+1, zs+1)\n= A(\u02c6xs, ys, zs, \u03b8, (cid:79)f (\u02c6xs), m)\n\nend for\nOutput: \u02c6xS\n\n(cid:108)\n(cid:109)\n4(cid:112)L/\u00b5\nF (xk) \u2212 F (cid:63) \u2264 4L(cid:107)x0 \u2212 x(cid:63)(cid:107)2\n(cid:25)\n(cid:113) L\n\n(cid:24)\n\nk2\n\n2\n\n(cid:108)\n(cid:109)\n4(cid:112)L/\u00b5\n\nRestart to rescue: it is well-known that if the cost function F (x) is \u00b5-strongly convex, one can\nperiodically restart the accelerated full gradient method [16], and improve it from a sublinear\nconvergence rate F (xk) \u2212 F (cid:63) \u2264 4L(cid:107)x0\u2212x(cid:63)(cid:107)2\nto a linearly convergent algorithm. For instance if we\nset k =\n\n4:\n, then one can show that the suboptimality can be reduced by 1\n\nk2\n\n2\n\n\u2264 4L[F (x0) \u2212 F (cid:63)]\n\n\u00b5k2\n\n\u2264 1\n4\n\n[F (x0) \u2212 F (cid:63)].\n\n(4)\n\niteration), and only k \u2265\n\nThen we can recursively apply this statement (algorithmically speaking, we restart the algorithm every\n\u03b4 iterations are needed to make F (xk) \u2212 F (cid:63) \u2264 \u03b4,\nand hence an accelerated linear rate is achieved. The restart scheme has been recently applied to\nimprove the convergence of the accelerated coordinate descent method [45, 30] and accelerated\nvariance-reduced dual-averaging method [25] for strongly-convex functions.\n\nlog4\n\n4\n\n\u00b5\n\n1\n\n1On the other hand, one can indirectly accelerate SVRG/SAGA via Catalyst [44].\n\n3\n\n\fInspired by Nesterov [16] we \ufb01rst propose the Katyusha method with periodic restarts, and meanwhile\ndemonstrate that when the restart period is appropriately chosen, the proposed method is able to\nexploit the restricted strong-convexity property to achieve an accelerated linear convergence, even\nwhen the cost function itself is not strongly-convex. We propose to warm start the algorithm prior to\nthe periodic restart stage, by running the Katyusha algorithm for a number of epochs, which in theory\nshould be proportional to the suboptimality of the starting point x0. We present our Rest-Katyusha\nmethod as Algorithm 2.\n\n3 Convergence Analysis of Rest-Katyusha\n\n3.1 Generic Assumptions\n\nWe start by listing out the assumptions we may engage with in our analysis:\nA. 1. (Decomposable regularizer) [34] Given a orthogonal subspace pair (M,M\u22a5) in Rd, g(.) is\ndecomposable which means:\n\ng(a + b) = g(a) + g(b),\u2200a \u2208 M, b \u2208 M\u22a5.\n\n(5)\n\nIn this paper we focus on cases where the regularizer is decomposable, which includes many popular\nregularization which can enforce low-dimensional structure, such as (cid:96)1 norm, (cid:96)1,2 norm and nuclear\nnorm penalty. The subspace M is named the model subspace, while its orthogonal complement M\u22a5\nis called perturbation subspace. Similar notion of decomposition would extend the scope of this\nwork to more general gauge functions g(.), such as the so-called analysis priors, e.g., total variation\nregularization (for more details see Vaiter et al. [46]).\nA. 2. (Restricted strong convexity) [34] The function f (.) satis\ufb01es restricted strong convexity with\nrespect to g(.) with parameters (\u03b3, \u03c4) if the following inequality holds true:\n\nf (x) \u2212 f (x(cid:63)) \u2212 (cid:104)(cid:79)f (x(cid:63)), x \u2212 x(cid:63)(cid:105) \u2265 \u03b3\n2\n\n(cid:107)x \u2212 x(cid:63)(cid:107)2\n\n2 \u2212 \u03c4 g2(x \u2212 x(cid:63)), \u2200x \u2208 Rd.\n\n(6)\n\nIn [34], \u03b3 is referred as the lower curvature parameter, while \u03c4 is named the tolerance parameter. It is\nclear that if \u03c4 = 0, A.2 reduces to usual strong-convexity assumption. While in the high-dimensional\nsetting, the strong-convexity does not hold, but it has been shown in literature that such milder\nassumption of RSC does hold in many situations. This notion of RSC distinguishes from other forms\nof weak strong-convexity assumption based on the Polyak-Lojasiewicz inequality [47] for the purpose\nof this work, because it encodes the direction-restricting effect of the regularization, and hence has\nbeen shown to have a direct connection with the low-dimensional structure of x(cid:63). Next we de\ufb01ne a\ncrucial property for our structure-driven analysis, which is called the subspace compatibility:\nDe\ufb01nition 3.1. [34] With prede\ufb01ned g(x), we de\ufb01ne the subspace compatibility of a model sub-\nspace M as:\n\nwhen M (cid:54)= {0}.\nThe subspace compatibility \u03a6(M) captures the model complexity of subspace M. For example if\ng(.) = (cid:107).(cid:107)1 and M is a subspace which is on a s-sparse support in Rd, we will have \u03a6(M) =\ns.\nA. 3. Each fi(.) has L-Lipschitz continuous gradient:\n\n\u221a\n\n(cid:107)(cid:79)fi(x) \u2212 (cid:79)fi(x(cid:48))(cid:107)2 \u2264 L(cid:107)x \u2212 x(cid:48)(cid:107)2,\u2200x, x(cid:48) \u2208 Rd.\n\n(8)\n\nThis form of smoothness assumption is classic for variance-reduced stochastic gradient methods.\nA. 4. The regularization parameter \u03bb and x\u2020 satis\ufb01es:\n\nwith constant c \u2265 1.\n\n\u03bb \u2265 (1 +\n\n1\nc\n\n)g\u2217((cid:79)f (x\u2020)),\n\n(9)\n\n4\n\n\u03a6(M) := sup\n\nv\u2208M\\{0}\n\ng(v)\n(cid:107)v(cid:107)2\n\n,\n\n(7)\n\n\fAssumption A.4 with the choice of c = 1 is the fundamental assumption of the analytical framework\ndeveloped by Negahban et al. [3]. We relax the requirement to c \u2265 1 for more general results. It is\nseemly a sophisticated and demanding assumption but indeed is reasonable and suits well the purpose\nof this work, which is to develop fast algorithms to speedily solve structured problems (which is\nalways the result of suf\ufb01cient regularization). Moreover, recall that the goal of \ufb01nding the solution\nx(cid:63) via optimizing the regularized empirical risk is to get a meaningful approximation of the true\nparameter x\u2020 which is the unique minimizer of the population risk. Especially in the high dimensional\nsetting where d > n, the choice of regularization is rather important since there is no good control\nover the statistical error (cid:107)x\u2020 \u2212 x(cid:63)(cid:107) for an arbitrarily chosen \u03bb. Because of this issue, in this work we\nonly focus on the \u201cmeaningful\u201d regularized ERM problems which are able to provide trustworthy\napproximation. Similar to A.4, Negahban et al.[3] has shown that \u03bb \u2265 2g\u2217((cid:79)f (x\u2020)) provides a\nsuf\ufb01cient condition to bound the statistical error (cid:107)x(cid:63) \u2212 x\u2020(cid:107)2\n2:\nProposition 3.2. [3, Theorem 1, informal] Under A.1, A.2, A.4, with c = 1, if furthermore\nthe curvature parameter \u03b3, tolerance parameter \u03c4 and the subspace compatibility \u03a6(M) satisfy\n\u03c4 \u03a62(M) < \u03b3\n\n64 , then for any optima x(cid:63), the following inequality holds:\n\n(cid:18) \u03bb2\n\u03b32 \u03a62(M) +\n\n\u03bb\n\u03b3\n\ng(x\n\n\u2020\nM\u22a5 )\n\n(cid:107)x(cid:63) \u2212 x\u2020(cid:107)2\n\n2 \u2264 O\n\n,\n\n(10)\n\n(cid:19)\n\n(cid:19)\n\nwhere O(.) hides deterministic constants for the simplicity of notation.\n\nSuch a bound reveals desirable properties of the regularized ERM when the range of \u03bb satis\ufb01es\nassumption A.4. For instance, if x\u2020 is the s-sparse ground truth vector of a noisy linear measurement\nsystem y = Ax\u2020 + w, where w denotes the zero-mean sub-Gaussian noise (with variance \u03c32) and\nthe measurement matrix A satis\ufb01es a certain restricted eigenvalue condition [3, 48], and we use a\nLasso estimator x(cid:63) \u2208 arg minx\n2 + \u03bb(cid:107)x(cid:107)1. In such case, let M be a subspace in Rd on\n2n(cid:107)Ax \u2212 y(cid:107)2\ns-sparse support where x\u2020 \u2208 M and hence g(x\n\n\u2020\nM\u22a5) = 0, this proposition implies that:\n\n1\n\n(cid:18) \u03bb2s\n\n(cid:19)\n\n\u03b32\n\n(cid:18) \u03c32\n\n(cid:107)x(cid:63) \u2212 x\u2020(cid:107)2\n\n2 \u2264 O\n\n\u2248 O\n\ns log d\n\n\u03b32\n\nn\n\n,\n\n(11)\n\nwhich implies the optimal convergence of the statistical error in terms of sample size and dimension\nfor M-estimators. The details of this claim are presented in [3, Corollary 2].\n\n3.2 Main Results\n\nBase on the assumption of the restricted strong convexity on f (.) w.r.t g(.), and also with the de\ufb01nition\nof subspace compatibility, one can further derive a more expressive form of RSC, which is named\nEffective RSC [34] which has a directly link to the structure of solution.\nLemma 3.3. (Effective RSC) [36, Lemma 3.3] Under A.1, A.2 , A.4, while x satis\ufb01es F (x)\u2212F (x(cid:63)) \u2264\n\u2020\n\u03b7 for a given value \u03b7 > 0 and any minimizer x(cid:63), with \u03b5 := 2\u03a6(M)(cid:107)x\u2020 \u2212 x(cid:63)(cid:107)2 + 4g(x\nM\u22a5) we have:\n(12)\n\nF (x) \u2212 F (cid:63) \u2265 \u00b5c(cid:107)x \u2212 x(cid:63)(cid:107)2\n\n2 \u2212 2\u03c4 (1 + c)2v2,\n\nwhere \u00b5c = \u03b3\n\n2 \u2212 8\u03c4 (1 + c)2\u03a62(M) and v = \u03b7\n\n\u03bb + \u03b5.\n\n\u03b3\n\nHere we refer \u00b5c as the effective restricted strong convexity parameter, which will provide us a direct\nlink between the convergence speed of an algorithm and the low-dimensional structure of the solution.\nNote that this lemma relaxes the condition on \u03bb in [34, Lemma 11], which is restricted to c = 1. Our\nmain theorem is presented as the following:\nTheorem 3.4. Under A.1 - 4, if further A.2 holds with parameter (\u03b3, \u03c4 ) such that \u03c4 \u03a62(M) <\n(cid:25)\n(cid:17)(cid:113) 6L\u03c4 (1+c)2D(x0,x(cid:63))\n\u2020\nn (cid:107)x0\u2212x(cid:63)(cid:107)2\n16(1+c)2 , denote \u03b5 := 2\u03a6(M)(cid:107)x\u2020\u2212x(cid:63)(cid:107)2+4g(x\nM\u22a5 ), D(x0, x(cid:63)) := 16(F (x0)\u2212F (cid:63))+ 6L\n2,\n\u00b5c = \u03b3\n,\n(cid:27)\n\n(cid:24)\n(cid:113)\n2 \u2212 8\u03c4 (1 + c)2\u03a62(M), if we run Rest-Katyusha with S0 \u2265\n\nwith \u03b2 \u2265 2, then the following inequality holds:\n\n(cid:19)T D(x0, x(cid:63))\n\n(cid:18) 1\n\n32 + 12L\nn\u00b5c\n\n(cid:24)(cid:16)\n\n1 + 2\n\u03c1\u03bb\n\n(cid:26)\n\nS =\n\n\u03b2\n\n8n\u00b5c+3L\n\n(cid:25)\n\nE[F (xT +1) \u2212 F (cid:63)] \u2264 max\n\n\u03b5,\n\n(13)\n\n\u03b22\n\n(S0 + 3)2\n\nwith probability at least 1 \u2212 \u03c1.\n\n5\n\n\fCorollary 3.5. Under the same assumptions, parameter choices and notations as Theorem 3.4, the\ntotal number of stochastic gradient evaluations required by Rest-Katyusha to get an \u03b4 > \u03b5-accuracy\nis:\n\nO\n\nn +\n\nnL\n\u00b5c\n\nlog\n\n1\n\u03b4\n\n+ O(n)S0.\n\n(14)\n\n(cid:18)\n\n(cid:115)\n\n(cid:19)\n\nProof technique. We extend the proof technique of Agarwal et al. [34] to the proximal gradient\ndescent and also Qu and Xu [35] for SVRG which are both based on applying induction statements\nto roll up the residual term of (12) which is the second term at the RHS. The complete proofs of\nTheorem 3.4 and Corollary 3.5 can be found in the supplementary material.\nAccelerated linear convergence. Under the RSC assumption, Theorem 3.4 and Corollary 3.5\ndemonstrate a local accelerated linear convergence rate of Rest-Katyusha up to a statistical accuracy\n\u03b4 > \u03b5. We derive this result based on extending the framework provided by Agarwal et al. [34],\nby which they established fast structure-dependent linear convergence of proximal gradient descent\nmethod up to a statistical accuracy of \u03b4 > \u03b5. To the best of our knowledge, this is the \ufb01rst structure-\nadaptive convergence result for an accelerated incremental gradient algorithm. Note that, this result\ncan be trivially extended to a global accelerated linear convergence result (with S0 = S) with the\nsame setting of Agarwal et al. [34] where a side constraint g(x) \u2264 R for some radii R is added\nto restrict the early iterations with additional re-projections unto this constraint set2. Start from\nthe objective-gap convergence result (13), with some additional algebra one can easily derive the\naccelerated linear convergence on the iterate (optimization variable) using the RSC condition as well.\n2 \u2212 8\u03c4 (1 + c)2\u03a62(M) links the\nStructure-adaptive convergence. The effective RSC \u00b5c = \u03b3\nconvergence speed of Rest-Katyusha with the intrinsic low-dimensional structure of the solution\n2 + \u03bb(cid:107)x(cid:107)1, (cid:107)x(cid:63)(cid:107)0 = s and\nwhich is due to the regularization. For instance, if F (x) := 1\n2 \u2212 32\u03c4 s, meanwhile for a wide class of random design\n(A.4) holds c = 1, then we have \u00b5c = \u03b3\nmatrices we have \u03c4 = O( log d\nn ) and \u03b3 > 0. More speci\ufb01cally, if the rows of the random design matrix\nA are drawn i.i.d. from N (0, \u03a3) with covariance matrix \u03a3 \u2208 Rd\u00d7d which has largest singular value\nrmax(\u03a3) and smallest singular value rmin(\u03a3), then \u03b3 \u2265 rmin(\u03a3)\nn with high\nprobability as shown by Raskutti et al. [48].\nHigh probability statement. Since our proofs utilize the effective RSC which holds in a neighbor-\nhood of x(cid:63) as demonstrated in Lemma 3.3, we need to bound the functional suboptimality F (xt)\u2212F (cid:63)\nin the worst case instead of in expectation. Hence inevitably the Markov inequality has to be applied\nto provide the convergence statement with high probability (details can be found in the main proof).\nOptimizing the choice of \u03b2. Theorem 3.4 shows that the complexity of the main loop of Rest-\nKatyusha is\n\u03b4 , which suggest a trade-off between the choice of \u03b2 and\nthe total computation. With some trival computation one can derive that in theory the best choice of \u03b2\nis exactly the Euler\u2019s number (\u2248 2.7). Numerically, we observe that slightly larger choice of \u03b2 often\nprovides better performance in practice (illustrative examples can be found in supplemental material).\n\n(cid:108)\n(cid:109)\n\u03b2(cid:112)32 + 12L/(n\u00b5c)\n\nand \u03c4 \u2264 rmax(\u03a3) 81 log d\n\n2n(cid:107)Ax \u2212 b(cid:107)2\n\n16\n\nlog\u03b22\n\n1\n\n4 Adaptive Rest-Katyusha\n\nMotivated by the theory above, we further propose our practical adaptive restart heuristic of Rest-\nKatyusha which is able to estimate the effective RSC on the \ufb02y. Based on the convergence theory,\nwe observe that, with the choice of restart period S =\nwith a conservative\nestimate \u00b50 \u2264 \u00b5c, then we are always guaranteed to have:\nE\u03bet\\\u03bet\u22121 [F (xt+1) \u2212 F (cid:63)] \u2264 1\n\n(cid:109)\n\u03b2(cid:112)32 + 12L/(n\u00b50)\n\u03b22 [F (xt) \u2212 F (cid:63)],\n\n(15)\n\n(cid:108)\n\n2In [34], a side constraint is manually added to the regularized ERM problem, hence in their setting, the\neffective restricted strong-convexity is valid globally. They provide global linear convergence result of proximal\ngradient descent (with additional re-projection steps) at a cost of additional side-constraints.\n\n6\n\n\fdue to the fact that an underestimation of the RSC will leads to a longer restart period that we\nactually need3. The intuition behind our adaptive restart heuristic is: if we overestimate \u00b5c, the above\ninequality will be violated. Hence an adaptive estimation of \u00b5c can be achieved via a convergence\nspeed check. However the above inequality cannot be evaluated directly in practice since it is in\nexpectation and demands the knowledge of F (cid:63). In [29, Prop. 4], it has been shown that with the\ncomposite gradient map:\n\nT (x) = arg min\n\nL\n2\n\nq\n\n(cid:107)x \u2212 q(cid:107)2\n\n2 + (cid:104)(cid:79)f (x), q \u2212 x(cid:105) + \u03bbg(q),\n\n(16)\n\nthe value of F (x) \u2212 F (cid:63) can be lower bounded:\n\nand also it can be approximately upper bounded by O((cid:107)T (x) \u2212 x(cid:107)2\nassumed, which reads:\n\nF (x) \u2212 F (cid:63) \u2265 O(cid:107)T (x) \u2212 x(cid:107)2\n2,\n\n(17)\n2) if local quadratic growth is\n\n\u2203\u03b1 > 0, r > 0, F (x) \u2212 F (cid:63) \u2265 \u03b1(cid:107)x \u2212 x(cid:63)(cid:107)2\n\n(18)\nHence in our adaptive restart heuristic we check the convergence speed via evaluating the composite\ngradient map at the snapshot points where full gradients have already been calculated. Because of\nthis, the only main additional computational overhead of this adaptive restart scheme is the proximal\noperation of g(.) at the restart points.\n\n2,\u2200x s.t. (cid:107)x \u2212 x(cid:63)(cid:107)2\n\n2 < r.\n\nAlgorithm 3 Adaptive Rest-Katyusha (x0, \u00b50, S0, \u03b2, T, L)\n\nInitialize: Epoch length m = 2n; Initial restart period S =\n\n(cid:24)\n\n(cid:113)\n\n32 + 12L\nn\u00b50\n\n\u03b2\n\n(cid:25)\n\n;\n\nx1 = Katyusha (x0, m, S0, L)\nCalculate the composite gradient map:\nT (x1) = arg minx\n2 (cid:107)x \u2212 x1(cid:107)2\nfor t = 1, . . . , T do\n\nL\n\n2 + (cid:104)(cid:79)f (x1), x \u2212 x1(cid:105) + \u03bbg(x).\n\nxt+1 = Katyusha (xt, m, S, L)\n\u2014\u2013Track the convergence speed via the composite gradient maps:\n\n\u2014\u2013 Update the estimate of RSC and adaptively tune the restart period\n\nT (xt+1) = arg minx\nif (cid:107)T (xt+1) \u2212 xt+1(cid:107)2\nthen \u00b50 \u2190 2\u00b50, else \u00b50 \u2190 \u00b50/2. S =\n\n2 (cid:107)x \u2212 xt+1(cid:107)2\n2 \u2264 1\n\n2 + (cid:104)(cid:79)f (xt+1), x \u2212 xt+1(cid:105) + \u03bbg(x).\n(cid:24)\n\u03b22(cid:107)T (xt) \u2212 xt(cid:107)2\n\n(cid:113)\n\n(cid:25)\n\n2\n\n\u03b2\n\nL\n\n32 + 12L\nn\u00b50\n\nend for\n\nThe adaptive Rest-Katyusha method is presented in Algorithm 3. We highlight the heuristic estimating\nprocedure for RSC parameter in the orange lines, which is additional to the original Katyusha\nalgorithm. The algorithm start with an initial guess \u00b50 and correspondingly the restart period S,\nmeanwhile we calculate the composite gradient map T (x1) at x1 and record the value of (cid:107)T (x1) \u2212\n2 which we use as the estimation of F (x1) \u2212 F (cid:63) (and so on). Then after S outer-loops, we restart\nx1(cid:107)2\n2 \u2265\nthe algorithm and meanwhile and evaluate again the composite gradient map. If (cid:107)T (x2) \u2212 x2(cid:107)2\n\u03b22(cid:107)T (x1) \u2212 x1(cid:107)2\n(cid:108)\n\u03b2(cid:112)32 + 12L/(n\u00b50))\n2, then we suspect that the RSC parameter has been overestimated, and hence\n1\nwe reduce \u00b50 by a half, otherwise we double the estimation. We also update the restart period by\nwith the modi\ufb01ed \u00b50. The forthcoming iterations follow the same\nS =\nupdating rule as described.\n\n(cid:109)\n\n5 Numerical Experiments\n\nIn this section we describe our numerical experiments on our proposed algorithm Rest-Katyusha\n(Alg.2) and also the adaptive Rest-Katyusha (Alg.3). We focus on the Lasso regression task:\n\nx(cid:63) \u2208 arg min\nx\u2208Rd\n\nF (x) :=\n\n(cid:107)Ax \u2212 b(cid:107)2\n\n2 + \u03bb(cid:107)x(cid:107)1\n\n1\n2n\n\n(19)\n\n(cid:111)\n\n.\n\n(cid:110)\n\n3An inaccurate estimate of the RSC will lead to a compromised convergence rate. Detailed discussion and\n\nanalysis of Rest-Katyusha with a rough RSC estimate can be found in the Appendix.\n\n7\n\n\fTo enforce sparsity on regression parameter we use the (cid:96)1 penalty with various degrees of regulariza-\ntion parameters chosen from the set \u03bb \u2208 {1 \u00d7 10p, 2 \u00d7 10p, 5 \u00d7 10p, p \u2208 Z}. For comparison, the\nperformance of (proximal) SVRG and the original Katyusha method is also shown in the plots. We\nrun all the algorithms with their theoretical step sizes for Madelon and REGED dataset, while for\nRCV1 dataset we adopt minibatch scheme for all the algorithms and grid-search the step sizes which\noptimize these algorithms\u2019 performance.\n\nTable 1: Datasets for the Experiments and Minibatch Size Choice for the Algorithms\n\nDATA SET\n(A) MADELON\n(B) RCV1\n(C) REGED\n\nSIZE (n, d)\n(2000, 500)\n\n(20242, 47236)\n\n(500, 999)\n\nMINIBATCH\n\n1\n80\n1\n\nREF.\n[49]\n[49]\n[50]\n\nFigure 1: Lasso Experiments on (A) Madelon and (B) RCV1\n\n(A) \u03bb = 5 \u00d7 10\u22125, (cid:107)x(cid:63)(cid:107)0 = 42\n\n(A) \u03bb = 2 \u00d7 10\u22125, (cid:107)x(cid:63)(cid:107)0 = 159\n\n(B) \u03bb = 1 \u00d7 10\u22124, (cid:107)x(cid:63)(cid:107)0 = 902\n\n(B) \u03bb = 1 \u00d7 10\u22125, (cid:107)x(cid:63)(cid:107)0 = 6315\n\nIn all our experiments we set \u03b2 = 5 and S0 = S for convenience. We \ufb01rst do a grid-search on the\nestimate of \u00b5c for Rest-Katyusha which provides the best convergence performance, and denote it as\n\u201cRest-Katyusha opt\u201d in the plots. Meanwhile we also run Rest-Katyusha with RSC estimation which\nis 20 times larger or smaller than the optimal one, where we denote as \u201cRest-Katyusha opt*20\u201d and\n\u201cRest-Katyusha opt/20\u201d respectively. At the 5th plot in Figure 1 the curves for Rest-Kat opt, opt*20\nand opt/20 are indistinguishable which shows that in these particular experiments their performance\nare almost identical. For the adaptive Rest-Katyusha we \ufb01x our starting estimate of \u00b5c as 10\u22125\nthroughout all the experiments.\nFrom these experiments we observe that as our theory has predicted, the Rest-Katyusha achieves\naccelerated linear convergence even when there is no explicit strong-convexity in the cost function\n(RCV1 and REGED dataset), and the convergence speed has a direct relationship with the sparsity\nof solution. For the lasso experiments while the solution is sparser, the linear convergence speed of\nRest-Katyusha indeed become faster. Meanwhile when we run Rest-Katyusha with an inaccurate\nRSC estimate, we still observe a compromised linear convergence, as predicted by our theory. In all\nthe experiments, we have observe that the adaptive Rest-Katyusha indeed achieves a good estimation\n\n8\n\n05001000150020002500# Gradient Evaluations / n-14-12-10-8-6-4-20Objective Gap (log)010002000300040005000# Gradient Evaluations / n-14-12-10-8-6-4-20Objective Gap (log)050100150200250300# Gradient Evaluations / n-14-12-10-8-6-4-20Objective Gap (log)020040060080010001200# Gradient Evaluations / n-14-12-10-8-6-4-20Objective Gap (log)\fFigure 2: Lasso Experiments on (C) REGED Dataset\n\n(C) \u03bb = 2 \u00d7 10\u22125, (cid:107)x(cid:63)(cid:107)0 = 80\n\n(C) \u03bb = 1 \u00d7 10\u22125, (cid:107)x(cid:63)(cid:107)0 = 127\n\n(C) \u03bb = 5 \u00d7 10\u22126, (cid:107)x(cid:63)(cid:107)0 = 209\n\n(C) \u03bb = 2 \u00d7 10\u22126, (cid:107)x(cid:63)(cid:107)0 = 343\n\nof the RSC parameter and properly adapts the choice of restart period automatically on the \ufb02y, hence\nits performance is often comparable with the best tuned Rest-Katyusha. As the experimental results\nshown in [34, 35, 36], the linear convergence we observe is towards an arbitrary accuracy instead of\na threshold nearby the solution. This conservative aspect of the theory is inherently due to the artifact\nof the RSC framework [34] and we include the extension for arbitrary accuracy regime as our future\nwork. We also include additional experimental results in the supplemental materials.\n\n6 Conclusion\n\nWe developed a restart variant of the Katyusha algorithm for regularized empirical risk minimization\ntasks, which is provably able to actively exploit the intrinsic low-dimensional structure of the solution\nfor the acceleration of convergence. Based on the convergence result we further constructed an\nadaptive restart heuristic which aimed at estimating the RSC parameter on the \ufb02y and adaptively\ntune the restart period. The ef\ufb01ciency of this approach is validated through numerical experiments.\nIn future work, we aim to develop more re\ufb01ned and provably-good adaptive restart schemes for\nRest-Katyusha algorithm to further exploit the solution\u2019s structure for acceleration.\n\n7 Acknowledgements\n\nJT, FB, MG and MD would like to acknowledge the support from H2020-MSCA-ITN Machine\nSensing Training Network (MacSeNet), project 642685; ERC grant SEQUOIA, project 724063;\nEPSRC Compressed Quantitative MRI grant, number EP/M019802/1; and ERC Advanced grant,\nproject 694888, C-SENSE, respectively. MD is also supported by a Royal Society Wolfson Research\nMerit Award. JT would like to thank Damien Scieur and Vincent Roulet for helpful discussions\nduring his research visit in SIERRA team.\n\n9\n\n050010001500# Gradient Evaluations / n-15-10-505Objective Gap (log)050010001500# Gradient Evaluations / n-15-10-505Objective Gap (log)05001000150020002500# Gradient Evaluations / n-15-10-505Objective Gap (log)010002000300040005000# Gradient Evaluations / n-15-10-505Objective Gap (log)\fReferences\n[1] Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media, 2013.\n\n[2] Martin J Wainwright. Structured regularizers for high-dimensional problems: Statistical and computational\n\nissues. Annual Review of Statistics and Its Application, 1:233\u2013253, 2014.\n\n[3] Sahand N Negahban, Pradeep Ravikumar, Martin J Wainwright, and Bin Yu. A uni\ufb01ed framework for\nhigh-dimensional analysis of m-estimators with decomposable regularizers. Statistical Science, pages\n538\u2013557, 2012.\n\n[4] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society.\n\nSeries B (Methodological), pages 267\u2013288, 1996.\n\n[5] Robert Tibshirani, Martin Wainwright, and Trevor Hastie. Statistical learning with sparsity: the lasso and\n\ngeneralizations. Chapman and Hall/CRC, 2015.\n\n[6] Robert Tibshirani, Michael Saunders, Saharon Rosset, Ji Zhu, and Keith Knight. Sparsity and smoothness\nvia the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(1):91\u2013\n108, 2005.\n\n[7] Tong Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms.\nIn Proceedings of the twenty-\ufb01rst international conference on Machine learning, page 116. ACM, 2004.\n\n[8] L\u00e9on Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMP-\n\nSTAT\u20192010, pages 177\u2013186. Springer, 2010.\n\n[9] Nicolas L. Roux, Mark Schmidt, and Francis R. Bach. A stochastic gradient method with an exponential\nconvergence _rate for \ufb01nite training sets. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger,\neditors, Advances in Neural Information Processing Systems 25, pages 2663\u20132671. Curran Associates,\nInc., 2012.\n\n[10] M. Schmidt, N. Le Roux, and F. Bach. Minimizing \ufb01nite sums with the stochastic average gradient.\n\nMathematical Programming, pages 1\u201330, 2013.\n\n[11] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss\n\nminimization. Journal of Machine Learning Research, 14(Feb):567\u2013599, 2013.\n\n[12] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction.\n\nIn Advances in neural information processing systems, pages 315\u2013323, 2013.\n\n[13] L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM\n\nJournal on Optimization, 24(4):2057\u20132075, 2014.\n\n[14] A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with support for\nnon-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pages\n1646\u20131654, 2014.\n\n[15] Yurii Nesterov. A method of solving a convex programming problem with convergence rate o (1/k2). In\n\nSoviet Mathematics Doklady, volume 27, pages 372\u2013376, 1983.\n\n[16] Y. Nesterov. Gradient methods for minimizing composite objective function. Technical report, UCL, 2007.\n\n[17] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science\n\n& Business Media, 2013.\n\n[18] Qihang Lin, Zhaosong Lu, and Lin Xiao. An accelerated proximal coordinate gradient method. In Advances\n\nin Neural Information Processing Systems, pages 3059\u20133067, 2014.\n\n[19] Zeyuan Allen-Zhu, Zheng Qu, Peter Richt\u00e1rik, and Yang Yuan. Even faster accelerated coordinate descent\nusing non-uniform sampling. In International Conference on Machine Learning, pages 1110\u20131119, 2016.\n\n[20] Shai Shalev-Shwartz and Tong Zhang. Accelerated proximal stochastic dual coordinate ascent for regular-\n\nized loss minimization. In International Conference on Machine Learning, pages 64\u201372, 2014.\n\n[21] Guanghui Lan and Yi Zhou. An optimal randomized incremental gradient method. arXiv preprint\n\narXiv:1507.02000, 2015.\n\n[22] Yuchen Zhang and Xiao Lin. Stochastic primal-dual coordinate method for regularized empirical risk\nminimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15),\npages 353\u2013361, 2015.\n\n10\n\n\f[23] Z. Allen-Zhu. Katyusha: The \ufb01rst direct acceleration of stochastic gradient methods. arXiv preprint\n\narXiv:1603.05953, 2016.\n\n[24] Aaron Defazio. A simple practical accelerated method for \ufb01nite sums. In D. D. Lee, M. Sugiyama, U. V.\nLuxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages\n676\u2013684. Curran Associates, Inc., 2016.\n\n[25] Tomoya Murata and Taiji Suzuki. Doubly accelerated stochastic variance reduced dual averaging method\nfor regularized empirical risk minimization. In Advances in Neural Information Processing Systems, pages\n608\u2013617, 2017.\n\n[26] Yossi Arjevani. Limitations on variance-reduction and acceleration schemes for \ufb01nite sums optimization.\n\nIn Advances in Neural Information Processing Systems, pages 3543\u20133552, 2017.\n\n[27] B. O\u2019Donoghue and E. Candes. Adaptive restart for accelerated gradient schemes. Foundations of\n\ncomputational mathematics, 15(3):715\u2013732, 2015.\n\n[28] Vincent Roulet and Alexandre d\u2019Aspremont. Sharpness, restart and acceleration. In Advances in Neural\n\nInformation Processing Systems, pages 1119\u20131129, 2017.\n\n[29] Olivier Fercoq and Zheng Qu. Adaptive restart of accelerated gradient methods under local quadratic\n\ngrowth condition. arXiv preprint arXiv:1709.02300, 2017.\n\n[30] Olivier Fercoq and Zheng Qu. Restarting the accelerated coordinate descent method with a rough strong\n\nconvexity estimate. arXiv preprint arXiv:1803.05771, 2018.\n\n[31] Jialei Wang and Lin Xiao. Exploiting strong convexity from data with primal-dual \ufb01rst-order algorithms.\nIn Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine\nLearning, volume 70 of Proceedings of Machine Learning Research, pages 3694\u20133702, International\nConvention Centre, Sydney, Australia, 06\u201311 Aug 2017. PMLR.\n\n[32] Francis R Bach. Consistency of the group lasso and multiple kernel learning. Journal of Machine Learning\n\nResearch, 9(Jun):1179\u20131225, 2008.\n\n[33] Jian-Feng Cai, Emmanuel J Cand\u00e8s, and Zuowei Shen. A singular value thresholding algorithm for matrix\n\ncompletion. SIAM Journal on Optimization, 20(4):1956\u20131982, 2010.\n\n[34] A. Agarwal, S. Negahban, and M. J. Wainwright. Fast global convergence rates of gradient methods for\n\nhigh-dimensional statistical recovery. The Annals of Statistics, 40(5):2452\u20132482, 2012.\n\n[35] Chao Qu and Huan Xu.\n\narXiv:1611.01957, 2016.\n\nLinear convergence of svrg in statistical estimation.\n\narXiv preprint\n\n[36] Junqi Tang, Francis Bach, Mohammad Golbabaee, and Mike Davies. Structure-adaptive, variance-reduced,\n\nand accelerated stochastic optimization. arXiv preprint arXiv:1712.03156, 2017.\n\n[37] Junqi Tang, Mohammad Golbabaee, Francis Bach, and Mike Davies. Structure-adaptive accelerated\n\ncoordinate descent. hal.archives hal-01889990, 2018.\n\n[38] M. Pilanci and M. J. Wainwright. Randomized sketches of convex programs with sharp guarantees.\n\nInformation Theory, IEEE Transactions on, 61(9):5096\u20135115, 2015.\n\n[39] M. Pilanci and M. J. Wainwright. Iterative hessian sketch: Fast and accurate solution approximation for\n\nconstrained least-squares. Journal of Machine Learning Research, 17(53):1\u201338, 2016.\n\n[40] Mert Pilanci and Martin J Wainwright. Newton sketch: A near linear-time optimization algorithm with\n\nlinear-quadratic convergence. SIAM Journal on Optimization, 27(1):205\u2013245, 2017.\n\n[41] Junqi Tang, Mohammad Golbabaee, and Mike E. Davies. Gradient projection iterative sketch for large-scale\nconstrained least-squares. In Proceedings of the 34th International Conference on Machine Learning,\nvolume 70 of Proceedings of Machine Learning Research, pages 3377\u20133386. PMLR, 2017.\n\n[42] Junqi Tang, Mohammad Golbabaee, and Mike Davies. Exploiting the structure via sketched gradient\nalgorithms. In Signal and Information Processing (GlobalSIP), 2017 IEEE Global Conference on, pages\n1305\u20131309. IEEE, 2017.\n\n[43] Zeyuan Allen-Zhu and Lorenzo Orecchia. Linear coupling: An ultimate uni\ufb01cation of gradient and mirror\n\ndescent. arXiv preprint arXiv:1407.1537, 2014.\n\n11\n\n\f[44] H. Lin, J. Mairal, and Z. Harchaoui. A universal catalyst for \ufb01rst-order optimization. In Advances in\n\nNeural Information Processing Systems, pages 3384\u20133392, 2015.\n\n[45] Olivier Fercoq and Zheng Qu. Restarting accelerated gradient methods with a rough strong convexity\n\nestimate. arXiv preprint arXiv:1609.07358, 2016.\n\n[46] Samuel Vaiter, Mohammad Golbabaee, Jalal Fadili, and Gabriel Peyr\u00e9. Model selection with low complexity\n\npriors. Information and Inference: A Journal of the IMA, 4(3):230\u2013287, 2015.\n\n[47] Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-gradient\nmethods under the polyak-\u0142ojasiewicz condition. In Joint European Conference on Machine Learning and\nKnowledge Discovery in Databases, pages 795\u2013811. Springer, 2016.\n\n[48] Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Restricted eigenvalue properties for correlated\n\ngaussian designs. Journal of Machine Learning Research, 11(Aug):2241\u20132259, 2010.\n\n[49] M. Lichman. UCI machine learning repository, 2013.\n\n[50] Causality workbench team. A genomics dataset, 09 2008.\n\n12\n\n\f", "award": [], "sourceid": 270, "authors": [{"given_name": "Junqi", "family_name": "Tang", "institution": "University of Edinburgh"}, {"given_name": "Mohammad", "family_name": "Golbabaee", "institution": "University of Bath"}, {"given_name": "Francis", "family_name": "Bach", "institution": "INRIA - Ecole Normale Superieure"}, {"given_name": "Mike", "family_name": "davies", "institution": "University of Edinburgh"}]}