{"title": "Safe Adaptive Importance Sampling", "book": "Advances in Neural Information Processing Systems", "page_first": 4381, "page_last": 4391, "abstract": "Importance sampling has become an indispensable strategy to speed up optimization algorithms for large-scale applications. Improved adaptive variants -- using importance values defined by the complete gradient information which changes during optimization -- enjoy favorable theoretical properties, but are typically computationally infeasible. In this paper we propose an efficient approximation of gradient-based sampling, which is based on safe bounds on the gradient. The proposed sampling distribution is (i) provably the \\emph{best sampling} with respect to the given bounds, (ii) always better than uniform sampling and fixed importance sampling and (iii) can efficiently be computed -- in many applications at negligible extra cost. The proposed sampling scheme is generic and can easily be integrated into existing algorithms. In particular, we show that coordinate-descent (CD) and stochastic gradient descent (SGD) can enjoy significant a speed-up under the novel scheme. The proven efficiency of the proposed sampling is verified by extensive numerical testing.", "full_text": "Safe Adaptive Importance Sampling\n\nSebastian U. Stich\n\nEPFL\n\nAnant Raj\n\nMax Planck Institute for Intelligent Systems\n\nsebastian.stich@epfl.ch\n\nanant.raj@tuebingen.mpg.de\n\nMartin Jaggi\n\nEPFL\n\nmartin.jaggi@epfl.ch\n\nAbstract\n\nImportance sampling has become an indispensable strategy to speed up optimiza-\ntion algorithms for large-scale applications. Improved adaptive variants\u2014using\nimportance values de\ufb01ned by the complete gradient information which changes\nduring optimization\u2014enjoy favorable theoretical properties, but are typically com-\nputationally infeasible. In this paper we propose an ef\ufb01cient approximation of\ngradient-based sampling, which is based on safe bounds on the gradient. The\nproposed sampling distribution is (i) provably the best sampling with respect to\nthe given bounds, (ii) always better than uniform sampling and \ufb01xed importance\nsampling and (iii) can ef\ufb01ciently be computed\u2014in many applications at negligible\nextra cost. The proposed sampling scheme is generic and can easily be integrated\ninto existing algorithms. In particular, we show that coordinate-descent (CD) and\nstochastic gradient descent (SGD) can enjoy signi\ufb01cant a speed-up under the novel\nscheme. The proven ef\ufb01ciency of the proposed sampling is veri\ufb01ed by extensive\nnumerical testing.\n\n1\n\nIntroduction\n\nModern machine learning applications operate on massive datasets. The algorithms that are used\nfor data analysis face the dif\ufb01cult challenge to cope with the enormous amount of data or the vast\ndimensionality of the problems. A simple and well established strategy to reduce the computational\ncosts is to split the data and to operate only on a small part of it, as for instance in coordinate\ndescent (CD) methods and stochastic gradient (SGD) methods. These kind of methods are state of\nthe art for a wide selection of machine learning, deep leaning and signal processing applications [9,\n11, 35, 27]. The application of these schemes is not only motivated by their practical preformance,\nbut also well justi\ufb01ed by theory [18, 19, 2].\nDeterministic strategies are seldom used for the data selection\u2014examples are steepest coordinate\ndescent [4, 34, 20] or screening algorithms [14, 15]. Instead, randomized selection has become\nubiquitous, most prominently uniform sampling [27, 29, 7, 8, 28] but also non-uniform sampling based\non a \ufb01xed distribution, commonly referred to as importance sampling [18, 19, 2, 33, 16, 6, 25, 24].\nWhile these sampling strategies typically depend on the input data, they do not adapt to the information\nof the current parameters during optimization. In contrast, adaptive importance sampling strategies\nconstantly re-evaluate the relative importance of each data point during training and thereby often\nsurpass the performance of static algorithms [22, 5, 26, 10, 21, 23]. Common strategies are gradient-\nbased sampling [22, 36, 37] (mostly for SGD) and duality gap-based sampling for CD [5, 23].\nThe drawbacks of adaptive strategies are twofold: often the provable theoretical guarantees can be\nworse than the complexity estimates for uniform sampling [23, 3] and often it is computationally\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\finadmissible to compute the optimal adaptive sampling distribution. For instance gradient based\nsampling requires the computation of the full gradient in each iteration [22, 36, 37]. Therefore one\nhas to rely on approximations based on upper bounds [36, 37], or stale values [22, 1]. But in general\nthese approximations can again be worse than uniform sampling.\nThis makes it necessary to develop adaptive strategies that can ef\ufb01ciently be computed in every\niteration and that come with theoretical guarantees that show their advantage over \ufb01xed sampling.\n\nOur contributions.\nIn this paper we propose an ef\ufb01cient approximation of the gradient-based\nsampling in the sense that (i) it can ef\ufb01ciently be computed in every iteration, (ii) is provably better\nthan uniform or \ufb01xed importance sampling and (iii) recovers the gradient-based sampling in the full-\ninformation setting. The scheme is completely generic and can easily be added as an improvement to\nboth CD and SGD type methods.\nAs our key contributions, we\n\n(1) show that gradient-based sampling in CD methods is theoretically better than the classical \ufb01xed\n\nsampling, the speed-up can reach a factor of the dimension n (Section 2);\n\n(2) propose a generic and ef\ufb01cient adaptive importance sampling strategy that can be applied in CD\n\nand SGD methods and enjoys favorable properties\u2014such as mentioned above (Section 3);\n\n(3) demonstrate how the novel scheme can ef\ufb01ciently be integrated in CD and SGD on an important\n\nclass of structured optimization problems (Section 4);\n\n(4) supply numerical evidence that the novel sampling performs well on real data (Section 5).\nNotation. For x \u2208 Rn de\ufb01ne [x]i := (cid:104)x, ei(cid:105) with ei the standard unit vectors in Rn. We abbreviate\n\u2207if := [\u2207f ]i. A convex function f : Rn \u2192 R with L-Lipschitz continuous gradient satis\ufb01es\n\n(cid:107)u(cid:107)2\n\nf (x + \u03b7u) \u2264 f (x) + \u03b7 (cid:104)u,\u2207f (x)(cid:105) + \u03b72Lu\n\n(1)\nfor every direction u \u2208 Rn and Lu = L. A function with coordinate-wise Li-Lipschitz continuous\ngradients1 for constants Li > 0, i \u2208 [n] := {1, . . . , n}, satis\ufb01es (1) just along coordinate directions,\ni.e. u = ei, Lei = Li for every i \u2208 [n]. A function is coordinate-wise L-smooth if Li \u2264 L for\ni = 1, . . . , n. For convenience we introduce vector l = (L1, . . . , n)(cid:62) and matrix L = diag(l). A\nprobability vector p \u2208 \u2206n := {x \u2208 Rn\u22650 : (cid:107)x(cid:107)1 = 1} de\ufb01nes a probability distribution P over [n]\nand we denote by i \u223c p a sample drawn from P.\n\n\u2200x \u2208 Rn,\u2200\u03b7 \u2208 R ,\n\n2\n\n2 Adaptive Importance Sampling with Full Information\n\nIn this section we argue that adaptive sampling strategies are theoretically well justi\ufb01ed, as they\ncan lead to signi\ufb01cant improvements over static strategies. In our exhibition we focus \ufb01rst on CD\nmethods, as we also propose a novel stepsize strategy for CD in this contribution. Then we revisit the\nresults regarding stochastic gradient descent (SGD) already present in the literature.\n\n2.1 Coordinate Descent with Adaptive Importance Sampling\nWe address general minimization problems minx f (x). Let the objective f : Rn \u2192 R be convex with\ncoordinate-wise Li-Lipschitz continuous gradients. Coordinate descent methods generate sequences\n{xk}k\u22650 of iterates that satisfy the relation\n\nxk+1 = xk \u2212 \u03b3k\u2207ik f (xk)eik .\n\n(2)\n\nHere, the direction ik is either chosen deterministically (cyclic descent, steepest descent), or randomly\npicked according to a probability vector pk \u2208 \u2206n. In the classical literature, the stepsize is often\nchosen such as to minimize the quadratic upper bound (1), i.e. \u03b3k = L\u22121\nIn this work we\npropose to set \u03b3k = \u03b1k[pk]\u22121\nwhere \u03b1k does not depend on the chosen direction ik. This leads to\n\nik\n\n.\n\nik\n\n1|\u2207if (x + \u03b7ei) \u2212 \u2207if (x)| \u2264 Li |\u03b7| ,\n\n\u2200x \u2208 Rn,\u2200\u03b7 \u2208 R.\n\n2\n\n\f(cid:20)\n\n(cid:21)\n\n(3)\n\ndirectionally-unbiased updates, like it is common among SGD-type methods. It holds\nEik\u223cpk\n\n[f (xk+1) | xk]\n\n(1)\u2264 Eik\u223cpk\n\n(\u2207ik f (xk))2 | xk\n\nf (xk) \u2212 \u03b1k\n[pk]ik\n= f (xk) \u2212 \u03b1k (cid:107)\u2207f (xk)(cid:107)2\n\n2 +\n\n(\u2207ik f (xk))2 +\nn(cid:88)\n\nLi\u03b12\nk\n2[pk]i\n\ni=1\n\nLik \u03b12\nk\n2[pk]2\nik\n\n(\u2207if (xk))2 .\n\nIn adaptive strategies we have the freedom to chose both variables \u03b1k and pk as we like. We therefore\npropose to chose them in such a way that they minimize the upper bound (3) in order to maximize the\nexpected progress. The optimal pk in (3) is independent of \u03b1k, but the optimal \u03b1k depends on pk.\nWe can state the following useful observation.\nLemma 2.1. If \u03b1k = \u03b1k(pk) is the minimizer of (3), then xk+1 := xk\u2212 \u03b1k\n\u2207ik f (xk)eik satis\ufb01es\n[pk]ik\n(cid:107)\u2207f (xk)(cid:107)2\n2 .\n\n[f (xk+1) | xk] \u2264 f (xk) \u2212 \u03b1k(pk)\n\nEik\u223cpk\n\n(4)\n\n2\n\nConsider two examples. In the \ufb01rst one we pick a sub-optimal, but very common [18] distribution:\nExample 2.2 (Li-based sampling). Let pL \u2208 \u2206n de\ufb01ned as [pL]i = Li\nTr[L] for i \u2208 [n], where\nL = diag(L1, . . . , Ln). Then \u03b1k(pL) = 1\n\nTr[L] .\n\nThe distribution pL is often referred to as (\ufb01xed) importance sampling. In the special case when\nLi = L for all i \u2208 [n], this boils down to uniform sampling.\nExample 2.3 (Optimal sampling2). Equation (3) is minimized for probabilities [p(cid:63)\n\nk]i =\n\n\u221a\nLi|\u2207if (xk)|\n(cid:107)\u221a\nL\u2207f (xk)(cid:107)1\n\nand \u03b1k(p(cid:63)\n\nk) =\n\n(cid:107)\u2207f (xk)(cid:107)2\n(cid:107)\u221a\nL\u2207f (xk)(cid:107)2\n\n2\n\n1\n\n. Observe\n\nTr[L] \u2264 \u03b1k(p(cid:63)\n\nk) \u2264 1\n\n1\n\nLmin\n\n, where Lmin := mini\u2208[n] Li.\n\nLemma 2.4. De\ufb01ne V (p, x) :=(cid:80)n\n\nTo prove this result, we rely on the following Lemma\u2014the proof of which, as well as for the claims\nabove, is deferred to Section A.1 of the appendix. Here |\u00b7| is applied entry-wise.\n. Then arg minp\u2208\u2206n V (p, x) =\n\nLi[x]2\ni\n\n.\n\ni=1\n\n[p]i\n\n|\u221a\nLx|\n(cid:107)\u221a\nLx(cid:107)1\n\nThe ideal adaptive algorithm. We propose to chose the stepsize and the sampling distribution for\nCD as in Example 2.3. One iteration of the resulting CD method is illustrated in Algorithm 1. Our\nbounds on the expected one-step progress can be used to derive convergence rates of this algorithm\nwith the standard techniques. This is exempli\ufb01ed in Appendix A.1. In the next Section 3 we develop\na practical variant of the ideal algorithm.\n\nEf\ufb01ciency gain. By comparing the estimates provided in the examples above, we see that the\nexpected progress of the proposed method is always at least as good as for the \ufb01xed sampling. For\ninstance in the special case where L = Li for i \u2208 [n], the Li-based sampling is just uniform sampling\nwith \u03b1k(punif ) = 1\n, which can be n times larger than\n\u03b1k(punif ). The expected one-step progress in this extreme case coincides with the one-step progress\nof steepest coordinate descent [20].\n\nLn. On the other hand \u03b1k(p(cid:63)\n\n(cid:107)\u2207f (xk)(cid:107)2\nL(cid:107)\u2207f (xk)(cid:107)2\n\nk) =\n\n1\n\n2\n\n2.2 SGD with Adaptive Sampling\n\n(cid:80)n\n\nSGD methods are applicable to objective functions which decompose as a sum\n\nf (x) = 1\nn\n\n(5)\nwith each fi : Rd \u2192 R convex. In previous work [22, 36, 37] is has been argued that the following\ngradient-based sampling [ \u02dcp(cid:63)\nis optimal in the sense that it maximizes the\nexpected progress (3). Zhao and Zhang [36] derive complexity estimates for composite functions.\nFor non-composite functions it becomes easier to derive the complexity estimate. For completeness,\nwe add this simpler proof in Appendix A.2.\n\n(cid:80)n\n(cid:107)\u2207fi(xk)(cid:107)2\ni=1(cid:107)\u2207fi(xk)(cid:107)2\n\ni=1 fi(x)\n\nk]i =\n\n2Here \u201coptimal\u201d refers to the fact that p(cid:63)\n\nk is optimal with respect to the given model (1) of the objective\n\nfunction. If the model is not accurate, there might exist a sampling that yields larger expected progress on f.\n\n3\n\n\fAlgorithm 1 Optimal sampling\nCompute \u2207f (xk)\n\n(compute full gradient)\n\n(de\ufb01ne optimal sampling)\nk) as in Example 2.3\n\nk, \u03b1(cid:63)\n\nDe\ufb01ne (p(cid:63)\nik \u223c p(cid:63)\n\nk\n\nxk+1 := xk \u2212 \u03b1(cid:63)\nk\nk]ik\n\n[p(cid:63)\n\n\u2207ik f (xk)\n\nAlgorithm 2 Proposed safe sampling\n(update l.- and u.-bounds)\n\nUpdate (cid:96), u\n\n(compute safe sampling)\n\nDe\ufb01ne ( \u02c6pk, \u02c6\u03b1k) as in (7)\nik \u223c \u02c6pk\nCompute \u2207ik f (xk)\nxk+1 := xk \u2212 \u02c6\u03b1k\n[ \u02c6pk]ik\n\n\u2207ik f (xk)\n\nAlgorithm 3 Fixed sampling\n\n(de\ufb01ne \ufb01xed sampling)\n\nDe\ufb01ne (pL, \u00af\u03b1) as in Example 2.2\nik \u223c pL\nCompute \u2207ik f (xk)\nxk+1 := xk \u2212 \u00af\u03b1\n[pL]ik\n\n\u2207ik f (xk)\n\nFigure 1: CD with different sampling strategies. Whilst Alg. 1 requires to compute the full gradient,\nthe compute operation in Alg. 2 is as cheap as for \ufb01xed importance sampling, Alg. 3. De\ufb01ning the\nsafe sampling \u02c6pk requires O(n log n) time.\n\n3 Safe Adaptive Importance Sampling with Limited Information\n\nIn the previous section we have seen that gradient-based sampling (Example 2.3) can yield a massive\nspeed-up compared to a static sampling distribution (Example 2.2). However, sampling according\nk in CD requires the knowledge of the full gradient \u2207f (xk) in each iteration. And likewise,\nto p(cid:63)\nsampling from \u02dcp(cid:63)\nk in SGD requires the knowledge of the gradient norms of all components\u2014both\nthese operations are in general inadmissible, i.e. the compute cost would void all computational\nbene\ufb01ts of the iterative (stochastic) methods over full gradient methods.\nHowever, it is often possible to ef\ufb01ciently compute approximations of p(cid:63)\nk instead. In contrast\nto previous contributions, we here propose a safe way to compute such approximations. By this we\nmean that our approximate sampling is provably never worse than static sampling, and moreover, we\nshow that our solution is the best possible with respect to the limited information at hand.\n\nk or \u02dcp(cid:63)\n\n(cid:21)\n\n(cid:20)\n\n3.1 An Optimization Formulation for Sampling\nFormally, we assume that we have in each iteration access to two vectors (cid:96)k, uk \u2208 Rn\u22650 that\nprovide safe upper and lower bounds on either the absolute values of the gradient entries ([(cid:96)k]i \u2264\n|\u2207if (xk)| \u2264 [uk]i) for CD, or of the gradient norms in SGD: ([(cid:96)k]i \u2264 (cid:107)\u2207fi(xk)(cid:107)2 \u2264 [uk]i). We\npostpone the discussion of this assumption to Section 4, where we give concrete examples.\nThe minimization of the upper bound (3) amounts to the equivalent problem3\n\n\u2212\u03b1k (cid:107)ck(cid:107)2\n\n\u03b12\nk\n2\n\n\u21d4 min\npk\u2208\u2206n\n\nV (pk, ck)\n(cid:107)ck(cid:107)2\n\n2\n\nmin\n\u03b1k\n\nmin\npk\u2208\u2206n\n\n2 +\n\nV (pk, ck)\n\n(6)\nwhere ck \u2208 Rn represents the unknown true gradient. That is, with respect to the bounds (cid:96)k, uk,\nwe can write ck \u2208 Ck := {x \u2208 Rn : [(cid:96)k]i \u2264 [x]i \u2264 [uk]i, i \u2208 [n]}. In Example 2.3 we derived the\noptimal solution for a \ufb01xed ck \u2208 Ck. However, this is not suf\ufb01cient to \ufb01nd the optimal solution for\nan arbitrary ck \u2208 Ck. Just computing the optimal solution for an arbitrary (but \ufb01xed) ck \u2208 Ck is\nunlikely to yield a good solution. For instance both extreme cases ck = (cid:96)k and ck = uk (the latter\nchoice is quite common, cf. [36, 23]) might be poor. This is demonstrated in the next example.\n\nExample 3.1. Let (cid:96) = (1, 2)(cid:62), u = (2, 3)(cid:62), c = (2, 2)(cid:62) and L1 = L2 = 1. Then V(cid:0) (cid:96)(cid:107)(cid:96)(cid:107)1\n\n, c(cid:1) =\n\n2, whereas for uniform sampling V(cid:0) c(cid:107)c(cid:107)1\n\n2, V(cid:0) u(cid:107)u(cid:107)1\n\n9\n\n12 (cid:107)c(cid:107)2\n\n4 (cid:107)c(cid:107)2\nThe proposed sampling. As a consequence of these observations, we propose to solve the follow-\ning optimization problem to \ufb01nd the best sampling distribution with respect to Ck:\n\n2.\n\n, c(cid:1) = 25\n\n, c(cid:1) = 2(cid:107)c(cid:107)2\n(\u03b1k, pk) :=(cid:0) 1\n\n(cid:1) ,\n\n, \u02c6pk\n\nvk\n\n(7)\n\nvk := min\np\u2208\u2206n\n\nmax\nc\u2208Ck\n\nV (p, c)\n(cid:107)c(cid:107)2\n\n2\n\n,\n\nand to set\n\nwhere \u02c6pk denotes a solution of (7). The resulting algorithm for CD is summarized in Alg. 2.\nIn the remainder of this section we discuss the properties of the solution \u02c6pk (Theorem 3.2) and how\nsuch a solution can be ef\ufb01ciently be computed (Theorem 3.4, Algorithm 4).\n\n3Although only shown here for CD, an equivalent optimization problem arises for SGD methods, cf. [36].\n\n4\n\n\f3.2 Proposed Sampling and its Properties\nTheorem 3.2. Let ( \u02c6p, \u02c6c) \u2208 \u2206n \u00d7 Rn\u22650 denote a solution of (7). Then Lmin \u2264 vk \u2264 Tr [L] and\n\n2\n\n(i) max\nc\u2208Ck\n\n, \u2200p \u2208 \u2206n;\n\nV ( \u02c6p, c)\n(cid:107)c(cid:107)2\n\n\u2264 max\nc\u2208Ck\n(ii) V ( \u02c6p, c) \u2264 Tr [L] \u00b7 (cid:107)c(cid:107)2\n( \u02c6p is always better than Li-based sampling)\nRemark 3.3. In the special case Li = L for all i \u2208 [n], the Li-based sampling boils down to uniform\nsampling (Example 2.2) and \u02c6p is better than uniform sampling: V ( \u02c6p, c) \u2264 Ln(cid:107)c(cid:107)2\n\nV (p, c)\n(cid:107)c(cid:107)2\n2, \u2200c \u2208 Ck.\n\n( \u02c6p has the best worst-case guarantee)\n\n2\n\n2, \u2200c \u2208 Ck.\n\nProof. Property (i) is an immediate consequence of (7). Moreover, observe that the Li-based\nsampling pL is a feasible solution in (7) with value V (pL,c)\n\nLmin \u2264 (cid:107)\u221a\nLc(cid:107)2\n(8)\n(cid:107)c(cid:107)2\nfor all c \u2208 Ck, thus vk \u2208 [Lmin, Tr [L]] and (ii) follows. We prove inequality (\u2217) in the appendix, by\nshowing that min and max can be interchanged in (7).\n\n\u2261 Tr [L] for all c \u2208 Ck. Hence\n\n(cid:107)c(cid:107)2\n(\u2217)\u2264 V ( \u02c6p, \u02c6c)\n(cid:107)\u02c6c(cid:107)2\n\n\u2264 V ( \u02c6p, c)\n(cid:107)c(cid:107)2\n\nV (p, c)\n(cid:107)c(cid:107)2\n\n(7)\u2264 max\nc\u2208Ck\n\n2.4= min\np\u2208\u2206n\n\n= Tr [L] ,\n\nV (pL, c)\n\n(cid:107)c(cid:107)2\n\n2\n\n1\n\n2\n\n2\n\n2\n\n2\n\n2\n\n\u221a\n\nA geometric interpretation. We show in Appendix B that the optimization problem (7) can\n, where [l]i = Li for i \u2208 [n].\nequivalently be written as\nThe maximum is thus attained for vectors c \u2208 Ck that minimize the angle with the vector l.\nTheorem 3.4. Let c \u2208 Ck, p =\n\n(cid:104)\u221a\nl,c(cid:105)\n(cid:107)c(cid:107)2\n2 \u00b7 (cid:107)\u221a\nLc(cid:107)\u22121\n\nand denote m = (cid:107)c(cid:107)2\n\n(cid:107)\u221a\nLc(cid:107)1\n(cid:107)c(cid:107)2\n\nvk = maxc\u2208Ck\n\n= maxc\u2208Ck\n\n1 . If\n\n\u221a\n(cid:107)\u221a\nLc\nLc(cid:107)1\n\n\uf8f1\uf8f2\uf8f3[uk]i\n\n\u221a\n[(cid:96)k]i\n\n[c]i =\n\nif [uk]i \u2264 \u221a\nif [(cid:96)k]i \u2265 \u221a\n\nLim ,\nLim ,\n\nLim otherwise,\n\n\u2200i \u2208 [n] ,\n\n(9)\n\nthen (p, c) is a solution to (7). Moreover, such a solution can be computed in time O(n log n).\n\nProof. This can be proven by examining the optimality conditions of problem (7). This is deferred to\nSection B.1 of the appendix. A procedure that computes such a solution is depicted in Algorithm 4.\nThe algorithm makes extensive use of (9). For simplicity, assume \ufb01rst L = In for now. In each\niteration t , a potential solution vector ct is proposed, and it is veri\ufb01ed whether this vector satis\ufb01es all\noptimality conditions. In Algorithm 4, ct is just implicit, with [ct]i = [c]i for decided indices i \u2208 D\nLm]i for undecided indices i /\u2208 D. After at most n iterations a valid solution is found.\nand [ct]i = [\nL\u22121uk by their magnitude, at most a linear number of\nBy sorting the components of\ninequality checks in (9) have to be performed in total. Hence the running time is dominated by the\nO(n log n) complexity of the sorting algorithm. A formal proof is given in the appendix.\n\nL\u22121(cid:96)k and\n\n\u221a\n\n\u221a\n\n\u221a\n\nL\u22121u), m = max((cid:96)sort)\n\n\u221a\n\n\u221a\n\nif [(cid:96)sort](cid:96) > m then\n\nL\u22121(cid:96)), usort := sort_asc(\n\n\u221a\nSet corresponding [c]index := [\n\u221a\nSet corresponding [c]index := [\n\nAlgorithm 4 Computing the Safe Sampling for Gradient Information (cid:96), u\n1: Input: 0n \u2264 (cid:96) \u2264 u, L, Initialize: c = 0n, u = 1, (cid:96) = n, D = \u2205.\n2: (cid:96)sort := sort_asc(\n3: while u \u2264 (cid:96) do\n4:\n5:\n6:\n7:\n8:\n9:\n2 \u00b7 (cid:107)\u221a\n10:\n11: m := (cid:107)c(cid:107)2\n\u221a\n12: end while\nLim for all i /\u2208 D and Return\n13: Set [c]i :=\n\nelse if [usort]u < m then\n\nLc(cid:107)\u22121\n\nbreak\n\nend if\n\nc, p =\n\n(cid:16)\n\nelse\n\n, v =\n\n1\n\n\u221a\n(cid:107)\u221a\nLc\nLc(cid:107)1\n\n(largest undecided lower bound is violated)\n\nL(cid:96)sort](cid:96); (cid:96) := (cid:96) \u2212 1; D := D \u222a {index}\nLusort]u; u := u + 1; D := D \u222a {index}\n\n(smallest undecided upper bound is violated)\n\n(no constraints are violated)\n\n(update m as in (9))\n\n(cid:17)\n\n(cid:107)\u221a\nLc(cid:107)2\n(cid:107)c(cid:107)2\n\n1\n\n2\n\n5\n\n\fCompetitive Ratio. We now compare the proposed sampling distribution \u02c6pk with the optimal\nsampling solution in hindsight. We know that if the true (gradient) vector \u02dcc \u2208 Ck would be given to\nus, then the corresponding optimal probability distribution would be p(cid:63)(\u02dcc) =\n(Example 2.3).\nThus, for this \u02dcc we can now analyze the ratio V ( \u02c6pk,\u02dcc)\nV (p(cid:63)(\u02dcc),\u02dcc). As we are interested in the worst case ratio\namong all possible candidates \u02dcc \u2208 Ck, we de\ufb01ne\nV ( \u02c6p, c)\n\n\u221a\n(cid:107)\u221a\nL\u02dcc\nL\u02dcc(cid:107)1\n\n(10)\n\n\u03c1k := max\nc\u2208Ck\n(cid:107)\u221a\nLc(cid:107)2\n(cid:107)c(cid:107)2\n\n1\n\n= max\nc\u2208Ck\n\n(cid:107)\u221a\n\nV ( \u02c6p, c)\nLc(cid:107)2\n\n.\n\nV (p(cid:63)(c), c)\n. Then Lmin \u2264 wk \u2264 vk, and \u03c1k \u2264 vk\n\n1\n\n(\u2264 vk\n\n2\n\nLemma 3.5. Let wk := minc\u2208Ck\nLemma 3.6. Let \u03b3 \u2265 1. If [Ck]i \u2229 \u03b3[Ck]i = \u2205 and \u03b3\u22121[Ck]i \u2229 [Ck]i = \u2205 for all i \u2208 [n] (here [Ck]i\ndenotes the projection on the i-th coordinate), then \u03c1k \u2264 \u03b34.\nThese two lemma provide bounds on the competitive ratio. Whilst Lemma 3.6 relies on a relative\naccuracy condition, Lemma 3.5 can always be applied. However, the corresponding minimization\nproblem is non-convex. Note that knowledge of \u03c1k is not needed to run the algorithm.\n\nLmin\n\n).\n\nwk\n\n4 Example Safe Gradient Bounds\n\nIn this section, we argue that for a large class of objective functions of interest in machine learning,\nsuitable safe upper and lower bounds (cid:96), u on the gradient along every coordinate direction can be\nestimated and maintained ef\ufb01ciently during optimization. A similar argument can be given for the\nef\ufb01cient approximation of component wise gradient norms in \ufb01nite sum objective based stochastic\ngradient optimization.\nAs the guiding example, we will here showcase the training of generalized linear models (GLMs) as\ne.g. in regression, classi\ufb01cation and feature selection. These models are formulated in terms of a\ngiven data matrix A \u2208 Rd\u00d7n with columns ai \u2208 Rd for i \u2208 [n].\n\nform f (x) := h(Ax) +(cid:80)n\n\nCoordinate Descent - GLMs with Arbitrary Regularizers. Consider general objectives of the\ni=1 \u03c8i([x]i) with an arbitrary convex separable regularizer term given\nby the \u03c8i : R \u2192 R for i \u2208 [n]. A key example is when h : Rd \u2192 R describes the least-squares\n2 for a b \u2208 Rd. Using that this h is twice differentiable\nregression objective h(Ax) = 1\nwith \u22072h(Ax) = In, it is easy to see that we can track the evolution of all gradient entries, when\nperforming CD steps, as follows:\n\n2 (cid:107)Ax \u2212 b(cid:107)2\n\n\u2200i (cid:54)= ik .\n\n\u2207if (xk+1) \u2212 \u2207if (xk) = \u03b3k(cid:104)ai, aik(cid:105) ,\n\n(11)\nfor ik being the coordinate changed in step k (here we also used the separability of the regularizer).\nTherefore, all gradient changes can be tracked exactly if the inner products of all datapoints are\navailable, or approximately if those inner products can be upper and lower bounded. For computa-\ntional ef\ufb01ciency, we in our experiments simply use Cauchy-Schwarz |(cid:104)ai, aik(cid:105)| \u2264 (cid:107)ai(cid:107)\u00b7(cid:107)aik(cid:107). This\nresults in safe upper and lower bounds [(cid:96)k+1]i \u2264 \u2207if (xk+1) \u2264 [uk+1]i for all inactive coordinates\ni (cid:54)= ik. (For the active coordinate ik itself one observes the true value without uncertainty). These\nbounds can be updated in linear time O(n) in every iteration.\nFor general smooth h (again with arbitrary separable regularizers \u03c8i), (11) can readily be extended to\nhold [32, Lemma 4.1], the inner product change term becoming (cid:104)ai,\u22072f (A \u02dcx)aik(cid:105) instead, when\nassuming h is twice-differentiable. Here \u02dcx will be an element of the line segment [xk, xk+1].\n\n(cid:80)n\nStochastic Gradient Descent - GLMs. We now present a similar result for \ufb01nite sum problems (5)\ni=1 hi(a(cid:62)\nfor the use in SGD based optimization, that is f (x) := 1\nn\nLemma 4.1. Consider f : Rd \u2192 R as above, with twice differentiable hi : R \u2192 R. Let xk, xk+1 \u2208\nRd denote two successive iterates of SGD, i.e. xk+1 := xk \u2212 \u03b7k aik\u2207hik (a(cid:62)\nxk) = xk + \u03b3k aik.\nThen there exists \u02dcx \u2208 Rd on the line segment between xk and xk+1, \u02dcx \u2208 [xk, xk+1] with\n\ni=1 fi(x) = 1\nn\n\n(cid:80)n\n\ni x).\n\nik\n\n\u2207fi(xk+1) \u2212 \u2207fi(xk) = \u03b3k \u22072hi(a(cid:62)\n\ni \u02dcx) (cid:104)ai, aik(cid:105) ai ,\n\n\u2200 i (cid:54)= ik .\n\n(12)\n\n6\n\n\fThis leads to safe upper and lower bounds for the norms of the partial gradient, [(cid:96)k]i \u2264 (cid:107)\u2207fi(xk)(cid:107)2 \u2264\n[uk]i, that can be updated in linear time O(n), analogous to the coordinate case discussed above.4\nWe note that there are many other ways to track safe gradient bounds for relevant machine learn-\ning problems, including possibly more tight ones. We here only illustrate the simplest variants,\nhighlighting the fact that our new sampling procedure works for any safe bounds (cid:96), u.\n\nComputational Complexity.\nIn this section, we have demonstrated how safe upper and lower\nbounds (cid:96), u on the gradient information can be obtained for GLMs, and argued that these bounds can\nbe updated in time O(n) per iteration of CD and SGD. The computation of the proposed sampling\ntakes O(n log n) time (Theorem 3.4). Hence, the introduced overhead in Algorithm 2 compared\nto \ufb01xed sampling (Algorithm 3) is of the order O(n log n) in every iteration. The computation of\none coordinate of the gradient, \u2207ik f (xk), takes time \u0398(d) for general data matrices. Hence, when\nd = \u2126(n), the introduced overhead reduces to O(log n) per iteration.\n\n5 Empirical Evaluation\n\nIn this section we evaluate the empirical performance of our proposed adaptive sampling scheme on\nrelevant machine learning tasks. In particular, we illustrate performance on generalized linear models\nwith L1 and L2 regularization, as of the form (5),\n\nn(cid:88)\n\ni=1\n\nmin\nx\u2208Rd\n\n1\nn\n\nhi(a(cid:62)\n\ni x) + \u03bb \u00b7 r(x)\n\n(13)\n\nWe use square loss, squared hinge loss as well as logistic loss for the data \ufb01tting terms hi, and\n(cid:107)x(cid:107)1 and (cid:107)x(cid:107)2\n2 for the regularizer r(x). The datasets used in the evaluation are rcv1, real-sim and\nnews20.5 The rcv1 dataset consists of 20,242 samples with 47,236 features, real-sim contains 72,309\ndatapoints and 20,958 features and news20 contains 19,996 datapoints and 1,355,191 features. For\nall datasets we set unnormalized features with all the non-zero entries set to 1 (bag-of-words features).\nBy real-sim\u2019 and rcv1\u2019 we denote a subset of the data chosen by randomly selecting 10,000 features\nand 10,000 datapoints. By news20\u2019 we denote a subset of the data chose by randomly selecting\n15% of the features and 15% of the datapoints. A regularization parameter \u03bb = 0.1 is used for all\nexperiments.\nOur results show the evolution of the optimization objective over time or number of epochs (an epoch\ncorresponding to n individual updates). To compute safe lower and upper bounds we use the methods\npresented in Section 4 with no special initialization, i.e. (cid:96)0 = 0n, u0 = \u221en.\nCoordinate Descent.\nLn (denoted\nas \u201csmall\u201d) vs.\nthe time varying optimal stepsize (denoted as \u201cbig\u201d) as discussed in Section 2.\nResults are shown for optimal sampling p(cid:63)\nk), cf. Example 2.3), our\nproposed sampling \u02c6pk (with optimal stepsize \u03b1k( \u02c6pk) = v\u22121\nk , cf. (7)) and uniform sampling (with\noptimal stepsize \u03b1k(pL) = 1\nLn, as here L = LIn, cf. Example 2.2). As the experiment aligns\nwith theory\u2014con\ufb01rming the advantage of the varying \u201cbig\u201d stepsizes\u2014we only show the results for\nAlgorithms 1\u20133 in the remaining plots.\nPerformance for squared hinge loss, as well as logistic regression with L1 and L2 regularization is\npresented in Figure 3 and Figure 4 respectively. In Figures 5 and 6 we report the iteration complexity\nvs. accuracy as well as timing vs. accuracy results on the full dataset for coordinate descent with\nsquare loss and L1 (Lasso) and L2 regularization (Ridge).\n\nIn Figure 2 we compare the effect of the \ufb01xed stepsize \u03b1k = 1\n\nk (with optimal stepsize \u03b1k(p(cid:63)\n\nTheoretical Sampling Quality. As part of the CD performance results in Figures 2\u20136 we include\nan additional evolution plot on the bottom of each \ufb01gure to illustrate the values vk which determine\nthe stepsize (\u02c6\u03b1k = v\u22121\nk ) for the proposed Algorithm 2 (blue) and the optimal stepsizes of Algorithm 1\n(black) which rely on the full gradient information. The plots show the normalized values\nTr[L], i.e.\nvk\nthe relative improvement over Li-based importance sampling. The results show that despite only\nrelying on very loose safe gradient bounds, the proposed adaptive sampling is able to strongly bene\ufb01t\nfrom the additional information.\n\n4Here we use the ef\ufb01cient representation \u2207fi(x) = \u03b8(x) \u00b7 ai for \u03b8(x) \u2208 R.\n5All data are available at www.csie.ntu.edu.tw/\u02dccjlin/libsvmtools/datasets/\n\n7\n\n\f(a) rcv1\u2019, L1 reg.\n\n(b) rcv1\u2019, L2 reg.\n\n(a) rcv1\u2019, L1 reg.\n\n(b) real-sim\u2019, L2 reg.\n\nFigure 2: (CD, square loss) Fixed vs. adaptive\nsampling strategies, and dependence on stepsizes.\nWith \u201cbig\u201d \u03b1k = v\u22121\nTr[L].\n\nand \u201csmall\u201d \u03b1k = 1\n\nk\n\nFigure 3: (CD, squared hinge loss) Function\nvalue vs. number of iterations for optimal step-\nsize \u03b1k = v\u22121\nk .\n\n(a) rcv1\u2019, L1 reg.\n\n(b) rcv1\u2019, L2 reg.\n\n(c) real-sim\u2019, L1 reg.\n\n(d) real-sim\u2019, L2 reg.\n\nFigure 4: (CD, logistic loss) Function value vs. number of iterations for different sampling strategies.\nBottom: Evolution of the value vk which determines the optimal stepsize (\u02c6\u03b1k = v\u22121\nk ). The plots\nTr[L], i.e. the relative improvement over Li-based importance sampling.\nshow the normalized values\nvk\n\n(a) rcv1, L1 reg.\n\n(b) real-sim, L1 reg.\n\n(a) real-sim, L1 reg.\n\n(b) real-sim, L2 reg.\n\nFigure 5: (CD, square loss) Function value vs.\nnumber of iterations on the full datasets.\n\nFigure 6: (CD, square loss) Function value vs.\nclock time on the full datasets. (Data for the\noptimal sampling omitted, as this strategy is not\ncompetitive time-wise.)\n\n(a) rcv1\u2019, L1 reg.\n\n(b) rcv1\u2019, L2 reg.\n\n(c) real-sim\u2019, L1 reg.\n\n(d) real-sim\u2019, L2 reg.\n\nFigure 7: (SGD, square loss) Function value vs. number of iterations.\n\n(a) news20\u2019, L1 reg.\n\n(a) news20\u2019, L1 reg.\n\nFigure 8: (SGD, square loss) Function value vs.\nnumber of iterations.\n\nFigure 9: (SGD square loss) Function value vs.\nclock time.\n\n8\n\nEpochsUniformProposed (big step)Proposed (small step)012561.000.990.980.970.960.950.94f(x )vkk100-1-2-3-4Epochs01256Optimal (big step)Optimal (small step)1.000.950.900.85f(x )vkk100-1-2-3-40.980.960.940.920.900.880.861.00f(x )vkk100-1-2-3-4Epochs01256UniformProposedOptimalUniformProposedOptimalf(x )vkkEpochs00.512.531.000.900.800.70100-1-2-3-40.950.850.750.65f(x )vkk100-1-2-3-4Epochs01256UniformProposedOptimal6.906.856.806.750.1 xEpochs01256UniformProposedOptimal0.690.680.670.66f(x )vkk100-1-2-3-4Epochs00.512.53UniformProposedOptimal0.690.680.670.660.650.64f(x )vkk100-1-2-3-4UniformProposedOptimalEpochs00.512.530.690.680.670.660.650.640.63f(x )vkk100-1-2-3-4UniformProposedOptimalEpochs00.511.533.51.000.950.900.850.80f(x )vkk100-1-2-3-4UniformProposedOptimalEpochs00.5121.000.950.900.850.800.750.70f(x )vkk100-1-2-3-4Time0241461612UniformProposed1.000.950.900.850.800.75f(x )vkk100-1-2-3-4UniformProposed1.000.950.900.850.800.75Time0241461612f(x )vkk100-1-2-3-465605550454035UniformProposedOptimalEpochs00.512.52UniformProposedOptimalEpochs00.512908070605040UniformProposedOptimalEpochs00.512.52140120100806040UniformProposedOptimalEpochs00.512.5210080604020UniformProposedOptimalEpochs0124765432UniformProposed40353025201510Time05102520\fStochastic Gradient Descent. Finally, we also evaluate the performance of our approach when\nused within SGD with L1 and L2 regularization and square loss. In Figures 7\u20138 we report the\niteration complexity vs. accuracy results and in Figure 9 the timing vs. accuracy results. The time\nunits in Figures 6 and 9 are not directly comparable, as the experiments were conducted on different\nmachines.\nWe observe that on all three datasets SGD with the optimal sampling performs only slightly better than\nuniform sampling. This is in contrast with the observations for CD, where the optimal sampling yields\na signi\ufb01cant improvement. Consequently, the effect of the proposed sampling is less pronounced in\nthe three SGD experiments.\n\nSummary. The main \ufb01ndings of our experimental study can be summarized as follows:\n\n\u2022 Adaptive importance sampling signi\ufb01cantly outperforms \ufb01xed importance sampling\nin iterations and time. The results show that (i) convergence in terms of iterations is almost\nas good as for the optimal (but not ef\ufb01ciently computable) gradient-based sampling and\n(ii) the introduced computational overhead is small enough to outperform \ufb01xed importance\nsampling in terms of total computation time.\n\u2022 Adaptive sampling requires adaptive stepsizes. The adaptive stepsize strategies of Algo-\nrithms 1 and 2 allow for much faster convergence than conservative \ufb01xed-stepsize strategies.\nIn the experiments, the measured value vk was always signi\ufb01cantly below the worst case\nestimate, in alignment with the observed convergence.\n\u2022 Very loose safe gradient bounds are suf\ufb01cient. Even the bounds derived from the the very\nna\u00efve gradient information obtained by estimating scalar products resulted in signi\ufb01cantly\nbetter sampling than using no gradient information at all. Further, no initialization of the\ngradient estimates is needed (at the beginning of the optimization process the proposed\nadaptive method performs close to the \ufb01xed sampling but accelerates after just one epoch).\n\n6 Conclusion\n\nIn this paper we propose a safe adaptive importance sampling scheme for CD and SGD algorithms.\nWe argue that optimal gradient-based sampling is theoretically well justi\ufb01ed. To make the computation\nof the adaptive sampling distribution computationally tractable, we rely on safe lower and upper\nbounds on the gradient. However, in contrast to previous approaches, we use these bounds in a novel\nway: in each iteration, we formulate the problem of picking the optimal sampling distribution as a\nconvex optimization problem and present an ef\ufb01cient algorithm to compute the solution. The novel\nsampling provably performs better than any \ufb01xed importance sampling\u2014a guarantee which could\nnot be established for previous samplings that were also derived from safe lower and upper bounds.\nThe computational cost of the proposed scheme is of the order O(n log n) per iteration\u2014this is on\nmany problems comparable with the cost to evaluate a single component (coordinate, sum-structure)\nof the gradient, and the scheme can thus be implemented at no extra computational cost. This is\nveri\ufb01ed by timing experiments on real datasets.\nWe discussed one simple method to track the gradient information in GLMs during optimization.\nHowever, we feel that the machine learning community could pro\ufb01t from further research in that\ndirection, for instance by investigating how such safe bounds can ef\ufb01ciently be maintained on more\ncomplex models. Our approach can immediately be applied when the tracking of the gradient is\ndelegated to other machines in a distributed setting, like for instance in [1].\n\nReferences\n[1] Guillaume Alain, Alex Lamb, Chinnadhurai Sankar, Aaron Courville, and Yoshua Bengio. Variance\n\nReduction in SGD by Distributed Importance Sampling. arXiv.org, February 2015.\n\n[2] Zeyuan Allen-Zhu, Zheng Qu, Peter Richt\u00e1rik, and Yang Yuan. Even Faster Accelerated Coordinate\nDescent Using Non-Uniform Sampling. In ICML 2017 - Proceedings of the 34th International Conference\non Machine Learning, pages 1110\u20131119. June 2016.\n\n[3] Ichiro Takeuchi Atsushi Shibagaki. Stochastic Primal Dual Coordinate Method with Non-Uniform\n\nSampling Based on Optimality Violations. arXiv.org, October 2017.\n\n9\n\n\f[4] Stephen P Boyd and Lieven Vandenberghe. Convex optimization. Cambridge University Press, 2004.\n\n[5] Dominik Csiba, Zheng Qu, and Peter Richt\u00e1rik. Stochastic Dual Coordinate Ascent with Adaptive\nProbabilities. In ICML 2015 - Proceedings of the 32th International Conference on Machine Learning,\nFebruary 2015.\n\n[6] Dominik Csiba and Peter Richt\u00e1rik. Importance Sampling for Minibatches. arXiv.org, February 2016.\n\n[7] Jerome Friedman, Trevor Hastie, Holger H\u00f6\ufb02ing, and Robert Tibshirani. Pathwise coordinate optimization.\n\nThe Annals of Applied Statistics, 1(2):302\u2013332, December 2007.\n\n[8] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Regularization Paths for Generalized Linear\n\nModels via Coordinate Descent. Journal of Statistical Software, 33(1):1\u201322, 2010.\n\n[9] Wenjiang J. Fu. Penalized regressions: The bridge versus the lasso. Journal of Computational and\n\nGraphical Statistics, 7(3):397\u2013416, 1998.\n\n[10] Xi He and Martin Tak\u00e1\u02c7c. Dual Free Adaptive Mini-batch SDCA for Empirical Risk Minimization.\n\narXiv.org, October 2015.\n\n[11] Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin, S Sathiya Keerthi, and S Sundararajan. A Dual Coordinate\nDescent Method for Large-scale Linear SVM. In ICML 2008 - the 25th International Conference on\nMachine Learning, pages 408\u2013415, New York, USA, 2008. ACM Press.\n\n[12] Hidetoshi Komiya. Elementary proof for sion\u2019s minimax theorem. Kodai Math. J., 11(1):5\u20137, 1988.\n\n[13] Simon Lacoste-Julien, Mark Schmidt, and Francis Bach. A simpler approach to obtaining an O(1/t)\n\nconvergence rate for projected stochastic subgradient descent. arXiv.org, December 2012.\n\n[14] Jun Liu, Zheng Zhao, Jie Wang, and Jieping Ye. Safe Screening with Variational Inequalities and Its\nApplication to Lasso. In ICML 2014 - Proceedings of the 31st International Conference on Machine\nLearning, pages 289\u2013297, 2014.\n\n[15] Eugene Ndiaye, Olivier Fercoq, Alexandre Gramfort, and Joseph Salmon. Gap Safe screening rules for\n\nsparsity enforcing penalties. JMLR, 2017.\n\n[16] Deanna Needell, Rachel Ward, and Nathan Srebro. Stochastic Gradient Descent, Weighted Sampling, and\nthe Randomized Kaczmarz algorithm. In NIPS 2014 - Advances in Neural Information Processing Systems\n27, pages 1017\u20131025, 2014.\n\n[17] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to\n\nstochastic programming. SIAM Journal on Optimization, 19(4):1574\u20131609, 2009.\n\n[18] Yurii Nesterov. Ef\ufb01ciency of Coordinate Descent Methods on Huge-Scale Optimization Problems. SIAM\n\nJournal on Optimization, 22(2):341\u2013362, 2012.\n\n[19] Yurii Nesterov and Sebastian U. Stich. Ef\ufb01ciency of the accelerated coordinate descent method on\n\nstructured optimization problems. SIAM Journal on Optimization, 27(1):110\u2013123, 2017.\n\n[20] Julie Nutini, Mark W Schmidt, Issam H Laradji, Michael P Friedlander, and Hoyt A Koepke. Coordinate\nDescent Converges Faster with the Gauss-Southwell Rule Than Random Selection. In ICML, pages\n1632\u20131641, 2015.\n\n[21] Anton Osokin, Jean-Baptiste Alayrac, Isabella Lukasewitz, Puneet K. Dokania, and Simon Lacoste-Julien.\nMinding the gaps for block frank-wolfe optimization of structured svms. In Proceedings of the 33rd\nInternational Conference on International Conference on Machine Learning - Volume 48, ICML\u201916, pages\n593\u2013602. JMLR.org, 2016.\n\n[22] Guillaume Papa, Pascal Bianchi, and St\u00e9phan Cl\u00e9men\u00e7on. Adaptive Sampling for Incremental Optimization\nUsing Stochastic Gradient Descent. ALT 2015 - 26th International Conference on Algorithmic Learning\nTheory, pages 317\u2013331, 2015.\n\n[23] Dmytro Perekrestenko, Volkan Cevher, and Martin Jaggi. Faster Coordinate Descent via Adaptive\nImportance Sampling. In AISTATS 2017 - Proceedings of the 20th International Conference on Arti\ufb01cial\nIntelligence and Statistics, volume 54, pages 869\u2013877. PMLR, 20\u201322 Apr 2017.\n\n[24] Zheng Qu, Peter Richt\u00e1rik, and Tong Zhang. Randomized Dual Coordinate Ascent with Arbitrary Sampling.\n\narXiv.org, November 2014.\n\n10\n\n\f[25] Peter Richt\u00e1rik and Martin Tak\u00e1\u02c7c. On optimal probabilities in stochastic coordinate descent methods.\n\nOptimization Letters, 10(6):1233\u20131243, 2016.\n\n[26] Mark Schmidt, Reza Babanezhad, Mohamed Ahmed, Aaron Defazio, Ann Clifton, and Anoop Sarkar.\nNon-Uniform Stochastic Average Gradient Method for Training Conditional Random Fields. In AISTATS\n2015 - Proceedings of the Eighteenth International Conference on Arti\ufb01cial Intelligence and Statistics,\nvolume 38, pages 819\u2013828. PMLR, 09\u201312 May 2015.\n\n[27] Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: Primal Estimated\n\nSub-Gradient Solver for SVM. Mathematical Programming, 127(1):3\u201330, October 2010.\n\n[28] Shai Shalev-Shwartz and Ambuj Tewari. Stochastic Methods for l1-regularized Loss Minimization. JMLR,\n\n12:1865\u20131892, June 2011.\n\n[29] Shai Shalev-Shwartz and Tong Zhang. Stochastic Dual Coordinate Ascent Methods for Regularized Loss\n\nMinimization. JMLR, 14:567\u2013599, February 2013.\n\n[30] Maurice Sion. On general minimax theorems. Paci\ufb01c Journal of Mathematics, 8(1):171\u2013176, 1958.\n\n[31] S. U. Stich, C. L. M\u00fcller, and B. G\u00e4rtner. Variable metric random pursuit. Mathematical Programming,\n\n156(1):549\u2013579, Mar 2016.\n\n[32] Sebastian U. Stich, Anant Raj, and Martin Jaggi. Approximate steepest coordinate descent. In Doina\nPrecup and Yee Whye Teh, editors, ICML 2017 - Proceedings of the 34th International Conference on\nMachine Learning, volume 70, pages 3251\u20133259. PMLR, 06\u201311 Aug 2017.\n\n[33] Thomas Strohmer and Roman Vershynin. A randomized kaczmarz algorithm with exponential convergence.\n\nJournal of Fourier Analysis and Applications, 15(2):262, 2008.\n\n[34] Paul Tseng and Sangwoon Yun. A coordinate gradient descent method for nonsmooth separable minimiza-\n\ntion. Mathematical Programming, 117(1):387\u2013423, 2009.\n\n[35] Stephen J Wright. Coordinate descent algorithms. Mathematical Programming, 151(1):3\u201334, 2015.\n\n[36] Peilin Zhao and Tong Zhang. Stochastic optimization with importance sampling for regularized loss\nminimization. In ICML 2015 - Proceedings of the 32nd International Conference on Machine Learning,\nvolume 37, pages 1\u20139. PMLR, 07\u201309 Jul 2015.\n\n[37] Rong Zhu. Gradient-based sampling: An adaptive importance sampling for least-squares. In NIPS -\n\nAdvances in Neural Information Processing Systems 29, pages 406\u2013414. 2016.\n\n11\n\n\f", "award": [], "sourceid": 2290, "authors": [{"given_name": "Sebastian", "family_name": "Stich", "institution": "EPFL"}, {"given_name": "Anant", "family_name": "Raj", "institution": "Max Planck Institute for Intelligent Systems"}, {"given_name": "Martin", "family_name": "Jaggi", "institution": "EPFL"}]}