{"title": "Limitations on Variance-Reduction and Acceleration Schemes for Finite Sums Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 3540, "page_last": 3549, "abstract": "We study the conditions under which one is able to efficiently apply variance-reduction and acceleration schemes on finite sums problems. First, we show that perhaps surprisingly, the finite sum structure, by itself, is not sufficient for obtaining a complexity bound of $\\tilde{\\cO}((n+L/\\mu)\\ln(1/\\epsilon))$ for $L$-smooth and $\\mu$-strongly convex finite sums - one must also know exactly which individual function is being referred to by the oracle at each iteration. Next, we show that for a broad class of first-order and coordinate-descent finite sums algorithms (including, e.g., SDCA, SVRG, SAG), it is not possible to get an `accelerated' complexity bound of $\\tilde{\\cO}((n+\\sqrt{n L/\\mu})\\ln(1/\\epsilon))$, unless the strong convexity parameter is given explicitly. Lastly, we show that when this class of algorithms is used for minimizing $L$-smooth and non-strongly convex finite sums, the optimal complexity bound is $\\tilde{\\cO}(n+L/\\epsilon)$, assuming that (on average) the same update rule is used for any iteration, and $\\tilde{\\cO}(n+\\sqrt{nL/\\epsilon})$, otherwise.", "full_text": "Limitations on Variance-Reduction and Acceleration\n\nSchemes for Finite Sum Optimization\n\nDepartment of Computer Science and Applied Mathematics\n\nYossi Arjevani\n\nWeizmann Institute of Science\n\nRehovot 7610001, Israel\n\nyossi.arjevani@weizmann.ac.il\n\nAbstract\n\nWe study the conditions under which one is able to ef\ufb01ciently apply variance-\nreduction and acceleration schemes on \ufb01nite sum optimization problems. First, we\nshow that, perhaps surprisingly, the \ufb01nite sum structure by itself, is not suf\ufb01cient\nfor obtaining a complexity bound of \u02dcO((n + L/\u00b5) ln(1/\u0001)) for L-smooth and\n\u00b5-strongly convex individual functions - one must also know which individual\nfunction is being referred to by the oracle at each iteration. Next, we show that for\na broad class of \ufb01rst-order and coordinate-descent \ufb01nite sum algorithms (including,\ne.g., SDCA, SVRG, SAG), it is not possible to get an \u2018accelerated\u2019 complexity\n\nbound of \u02dcO((n+(cid:112)nL/\u00b5) ln(1/\u0001)), unless the strong convexity parameter is given\niteration, and \u2126(n +(cid:112)nL/\u0001) otherwise.\n\nexplicitly. Lastly, we show that when this class of algorithms is used for minimizing\nL-smooth and convex \ufb01nite sums, the iteration complexity is bounded from below\nby \u2126(n + L/\u0001), assuming that (on average) the same update rule is used in any\n\n1\n\nIntroduction\n\nAn optimization problem principal to machine learning and statistics is that of \ufb01nite sums:\n\nmin\nw\u2208Rd\n\nF (w) :=\n\n1\nn\n\nn(cid:88)\n\ni=1\n\nfi(w),\n\n(1)\n\nwhere the individual functions fi are assumed to possess some favorable analytical properties, such as\nLipschitz-continuity, smoothness or strong convexity (see [16] for details). We measure the iteration\ncomplexity of a given optimization algorithm by determining how many evaluations of individual\nfunctions (via some external oracle procedure, along with their gradient, Hessian, etc.) are needed in\norder to obtain an \u0001-solution, i.e., a point w \u2208 Rd which satis\ufb01es E[F (w) \u2212 minw\u2208Rd F (w)] < \u0001\n(where the expectation is taken w.r.t. the algorithm and the oracle randomness).\nArguably, the simplest way of minimizing \ufb01nite sum problems is by using optimization algorithms\nfor general optimization problems. For concreteness of the following discussion, let us assume for the\nmoment that the individual functions are L-smooth and \u00b5-strongly convex. In this case, by applying\nvanilla Gradient Descent (GD) or Accelerated Gradient Descent (AGD, [16]), one obtains iteration\ncomplexity of\n\n\u02dcO(n\u03ba ln(1/\u0001)) or \u02dcO(cid:0)n\n\n\u03ba ln(1/\u0001)(cid:1),\n\n\u221a\n\n(2)\nrespectively, where \u03ba := L/\u00b5 denotes the condition number of the problem and \u02dcO hides logarithmic\nfactors in the problem parameters. However, whereas such bounds enjoy logarithmic dependence on\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fthe accuracy level, the multiplicative dependence on n renders this approach unsuitable for modern\napplications where n is very large.\nA different approach to tackle a \ufb01nite sum problem is by reformulating it as a stochastic optimization\nproblem, i.e., minw\u2208Rd Ei\u223cU ([n])[fi(w)], and then applying a general stochastic method, such\n\nas SGD, which allows iteration complexity of O(1/\u0001) or O(cid:0)1/\u00012(cid:1) (depending on the problem\n\nparameters). These methods offer rates which do not depend on n, and are therefore attractive for\nsituations where one seeks for a solution of relatively low accuracy. An evident drawback of these\nmethods is their broad applicability for stochastic optimization problems, which may con\ufb02ict with\nthe goal of ef\ufb01ciently exploiting the unique noise structure of \ufb01nite sums (indeed, in the general\nstochastic setting, these rates cannot be improved, e.g., [1, 18]).\nIn recent years, a major breakthrough was made when stochastic methods specialized in \ufb01nite sums\n(\ufb01rst SAG [19] and SDCA [21], and then SAGA [10], SVRG [11], SDCA without duality [20], and\nothers) were shown to obtain iteration complexity of\n\n\u02dcO((n + \u03ba) ln(1/\u0001)).\n\n(3)\nThe ability of these algorithms to enjoy both logarithmic dependence on the accuracy parameter and\nan additive dependence on n is widely attributed to the fact that the noise of \ufb01nite sum problems\ndistributes over a \ufb01nite set of size n. Perhaps surprisingly, in this paper we show that another key\ningredient is crucial, namely, a mean of knowing which individual function is being referred to by the\noracle at each iteration. In particular, this shows that variance-reduction mechanisms (see, e.g., [10,\nSection 3]) cannot be applied without explicitly knowing the \u2018identity\u2019 of the individual functions.\nOn the more practical side, this result shows that when data augmentation (e.g., [14]) is done without\nan explicit enumeration of the added samples, it is impossible to obtain iteration complexity as stated\nin (3, see [7] for relevant upper bounds).\nAlthough variance-reduction mechanisms are essential for obtaining an additive dependence on n (as\nshown in (3)), they do not necessarily yield \u2018accelerated\u2019 rates which depend on the square root of\nthe condition number (as shown in (2) for AGD). Recently, generic acceleration schemes were used\nby [13] and accelerated SDCA [22] to obtain iteration complexity of\n\n(cid:17)\n\n\u02dcO(cid:16)\n\n\u221a\n\n(n +\n\nnk) ln(1/\u0001)\n\n.\n\n(4)\n\nThe question of whether this rate is optimal was answered af\ufb01rmatively by [23, 12, 5, 3]. The \ufb01rst\ncategory of lower bounds exploits the degree of freedom offered by a d- (or an in\ufb01nite-) dimensional\nspace to show that any \ufb01rst-order and a certain class of second-order methods cannot obtain better\nrates than (4) in the regime where the number of iterations is less than O(d/n). The second category\nof lower bounds is based on maintaining the complexity of the functional form of the iterates, thereby\nestablishing bounds for \ufb01rst-order and coordinate-descent algorithms whose step sizes are oblivious\nto the problem parameters (e.g., SAG, SAGA, SVRG, SDCA, SDCA without duality) for any number\nof iterations, regardless of d and n.\nIn this work, we further extend the theory of oblivious \ufb01nite sum algorithms, by showing that if\na \ufb01rst-order and a coordinate-descent oracle are used, then acceleration is not possible without\nan explicit knowledge of the strong convexity parameter. This implies that in cases where only\npoor estimation of the strong convexity is available, faster rates may be obtained through \u2018adaptive\u2019\nalgorithms (see relevant discussions in [19, 4]).\nNext, we show that in the smooth and convex case, oblivious \ufb01nite sum algorithms which, on average,\napply the same update rule at each iteration (e.g., SAG, SDCA, SVRG, SVRG++ [2], and typically,\nother algorithms with a variance-reduction mechanism as described in [10, Section 3]), are bound\nto iteration complexity of \u2126(n + L/\u0001), where L denotes the smoothness parameter (rather than\n\n\u2126(n +(cid:112)nL/\u0001)). To show this, we employ a restarting scheme (see [4]) which explicitly introduces\n\nthe strong convexity parameter into algorithms that are designed for smooth and convex functions.\nFinally, we use this scheme to establish a tight dimension-free lower bound for smooth and convex\n\ufb01nite sums which holds for oblivious algorithms with a \ufb01rst-order and a coordinate-descent oracle.\nTo summarize, our contributions (in order of appearance) are the following:\n\n\u2022 In Section 2, we prove that in the setting of stochastic optimization, having \ufb01nitely supported\nnoise (as in \ufb01nite sum problems) is not suf\ufb01cient for obtaining linear convergence rates with\n\n2\n\n\fa linear dependence on n - one must also know exactly which individual function is being\nreferred to by the oracle at each iteration. Deriving similar results for various settings, we\nshow that SDCA, accelerated SDCA, SAG, SAGA, SVRG, SVRG++ and other \ufb01nite sum\nalgorithms must have a proper enumeration of the individual functions in order to obtain\ntheir stated convergence rate.\n\u2022 In Section 3.1, we lay the foundations of the framework of general CLI algorithms (see\n[3]), which enables us to formally address oblivious algorithms (e.g., when step sizes are\nscheduled regardless of the function at hand). In section 3.2, we improve upon [4], by\nshowing that (in this generalized framework) the optimal iteration complexity of oblivious,\ndeterministic or stochastic, \ufb01nite sum algorithms with both \ufb01rst-order and coordinate-descent\noracles cannot perform better than \u2126(n + \u03ba ln(1/\u0001)), unless the strong convexity parameter\nis provided explicitly. In particular, the richer expressiveness power of this framework\nallows addressing incremental gradient methods, such as Incremental Gradient Descent [6]\nand Incremental Aggregated Gradient [8, IAG].\n\u2022 In Section 3.3, we show that, in the L-smooth and convex case, the optimal complexity\nbound (in terms of the accuracy parameter) of oblivious algorithms whose update rules are\n\n(on average) \ufb01xed for any iteration is \u2126(n + L/\u0001) (rather then \u02dcO(n +(cid:112)nL/\u0001), as obtained,\n\ne.g., by accelerated SDCA). To show this, we \ufb01rst invoke a restarting scheme (used by [4])\nto explicitly introduce strong convexity into algorithms for \ufb01nite sums with smooth and\nconvex individuals, and then apply the result derived in Section 3.2.\n\u2022 In Section 3.4, we use the reduction introduced in Section 3.3, to show that the optimal itera-\ntion complexity of minimizing L-smooth and convex \ufb01nite sums using oblivious algorithms\nequipped with a \ufb01rst-order and a coordinate-descent oracle is \u2126\n\n(cid:17)\nn +(cid:112)nL/\u0001\n\n(cid:16)\n\n.\n\n2 The Importance of Individual Identity\n\nIn the following, we address the stochastic setting of \ufb01nite sum problems (1) where one is equipped\nwith a stochastic oracle which, upon receiving a call, returns some individual function chosen\nuniformly at random and hides its index. We show that not knowing the identity of the function\nreturned by the oracle (as opposed to an incremental oracle which addresses the speci\ufb01c individual\nfunctions chosen by the user), signi\ufb01cantly harms the optimal attainable performance. To this end,\nwe reduce the statistical problem of estimating the bias of a noisy coin into that of optimizing \ufb01nite\nsums. This reduction (presented below) makes an extensive use of elementary de\ufb01nitions and tools\nfrom information theory, all of which can be found in [9].\nFirst, given n \u2208 N, we de\ufb01ne the following \ufb01nite sum problem\nn + \u03c3\n\n(5)\nwhere n is w.l.o.g. assumed to be odd, \u03c3 \u2208 {\u22121, 1} and f +, f\u2212 are some functions (to be de\ufb01ned\nlater). We then de\ufb01ne the following discrepancy measure between F1 and F\u22121 for different values\nof n (see also [1]),\n\n(cid:18) n \u2212 \u03c3\n\nf\u2212(cid:19)\n\nF\u03c3 :=\n\nf + +\n\n1\nn\n\n2\n\n2\n\n,\n\n\u03b4(n) = min\nw\u2208Rd\n\n{F1(w) + F\u22121(w) \u2212 F \u2217\n\n1 \u2212 F \u2217\n\n\u22121},\n\n(6)\n\nwhere F \u2217\n\u03c3 := inf w F\u03c3(w). It is easy to verify that no solution can be \u03b4(n)/4-optimal for both\nF1 and F\u22121, at the same time. Thus, by running a given optimization algorithm long enough to\nobtain \u03b4(n)/4-solution w.h.p., we can deduce the value of \u03c3. Also, note that, one can simplify the\ncomputation of \u03b4(n) by choosing convex f +, f\u2212 such that f +(w) = f\u2212(\u2212w). Indeed, in this case,\nwe have F1(w) = F\u22121(\u2212w) (in particular, F \u2217\n\u22121 is\nconvex, it must attain its minimum at w = 0, which yields\n\u03b4(n) = 2(F1(0) \u2212 F \u2217\n1 ).\n\n(7)\nNext, we let \u03c3 \u2208 {\u22121, 1} be drawn uniformly at random, and then use the given optimization\nalgorithm to estimate the bias of a random variable X which, conditioned on \u03c3, takes +1 w.p.\n1/2 + \u03c3/2n, and \u22121 w.p. 1/2 \u2212 \u03c3/2n. To implement the stochastic oracle described above,\n\n\u22121), and since F1(w) + F\u22121(w) \u2212 F \u2217\n\n1 = F \u2217\n\n1 \u2212 F \u2217\n\n3\n\n\fconditioned on \u03c3, we draw k i.i.d. copies of X, denoted by X1, . . . , Xk, and return f\u2212, if Xi = \u03c3,\nand f +, otherwise. Now, if k is such that\n\nfor both \u03c3 = \u22121 and \u03c3 = 1, then by Markov inequality, we have that\n\nE[F\u03c3(w(k)) \u2212 F \u2217\n\n\u03c3 | \u03c3] \u2264 \u03b4(n)\n40\n\n,\n\nP(cid:16)\n\nF\u03c3(w(k)) \u2212 F \u2217\n\n\u03c3 \u2265 \u03b4(n)/4\n\n(cid:17) \u2264 1/10\n\n(cid:12)(cid:12)(cid:12) \u03c3\n\n(8)\n\n(note that F\u03c3(w(k)) \u2212 F \u2217\n\u03c3 using the following estimator\n\n\u03c3 is a non-negative random variable). We may now try to guess the value of\n\n\u02c6\u03c3(w(k)) = argmin\n\u03c3(cid:48)\u2208{\u22121,1}\n\n{F\u03c3(cid:48)(w(k)) \u2212 F \u2217\n\n\u03c3(cid:48)},\n\nwhose probability of error, as follows by Inequality (8), is\nP (\u02c6\u03c3 (cid:54)= \u03c3) \u2264 1/10.\n\n(9)\n\nLastly, we show that the existence of an estimator for \u03c3 with high probability of success implies\nthat k = \u2126(n2). To this end, note that the corresponding conditional dependence structure of this\nprobabilistic setting can be modeled as follows: \u03c3 \u2192 X1, . . . , Xk \u2192 \u02c6\u03c3. Thus, we have\n\nH(\u03c3 | X1, . . . , Xk)\n\n(a)\u2264 H(\u03c3 | \u02c6\u03c3)\n\n(10)\nwhere H(\u00b7) and Hb(\u00b7) denote the Shannon entropy function and the binary entropy function, respec-\ntively, (a) follows by the data processing inequality (in terms of entropy), (b) follows by Fano\u2019s\ninequality and (c) follows from Equation (9). Applying standard entropy identities, we get\n\n,\n\n(b)\u2264 Hb(P (\u02c6\u03c3 (cid:54)= \u03c3))\n\n(c)\u2264 1\n2\n\nH(\u03c3 | X1, . . . , Xk)\n\n(e)\n\n(d)\n\n= H(X1, . . . , Xk | \u03c3) + H(\u03c3) \u2212 H(X1, . . . , Xk)\n= kH(X1 | \u03c3) + 1 \u2212 H(X1, . . . , Xk)\n(f )\u2265 kH(X1 | \u03c3) + 1 \u2212 kH(X1),\n\n(11)\nwhere (d) follows from Bayes rule, (e) follows by the fact that Xi, conditioned on \u03c3, are i.i.d. and\n(f ) follows from the chain rule and the fact that conditioning reduces entropy. Combining this with\nInequality (10) and rearranging, we have\n\nk \u2265\n\n1\n\n2(H(X1) \u2212 H(X1 | \u03c3))\n\n\u2265\n\n1\n\n2 (1/n)2 =\n\nn2\n2\n\n,\n\nwhere the last inequality follows from the fact that H(X1) = 1 and the following estimation for the\nbinary entropy function: Hb(p) \u2265 1 \u2212 4 (p \u2212 1/2)2 (see Lemma 2, Appendix A). Thus, we arrive at\nthe following statement.\nLemma 1. The minimal number of stochastic oracle calls required to obtain \u03b4(n)/40-optimal\nsolution for problem (5) is \u2265 n2/2.\nInstantiating this schemes for f +, f\u2212 of various analytical properties yields the following.\nTheorem 1. When solving a \ufb01nite sum problem (de\ufb01ned in 1) with a stochastic oracle, one needs at\nleast n2/2 oracle calls in order to obtain an accuracy level of:\n\n1. \u03ba+1\n\n2.\n\nL\n\n40n2 for smooth and strongly convex individuals with condition \u03ba.\n40n2 for L-smooth and convex individuals.\n40\u03bbn2 if M\nconvex individuals.\n\n\u03bbn \u2264 1, and M\n\n20n \u2212 \u03bb\n\n3. M 2\n\n40 , otherwise, for (M + \u03bb)-Lipschitz continuous and \u03bb-strongly\n\n4\n\n\fProof\n\n1. De\ufb01ne,\n\nf\u00b1(w) =\n\n(w \u00b1 q)\n\n(cid:62)\n\nA (w \u00b1 q) ,\n\n1\n2\n\nwhere A is a d \u00d7 d diagonal matrix whose diagonal entries are \u03ba, 1 . . . , 1, and q =\n(1, 1, 0, . . . , 0)(cid:62) is a d-dimensional vector. One can easily verify that f\u00b1 are smooth\nand strongly convex functions with condition number \u03ba, and that\n\n(cid:16)\n\n(cid:17)(cid:62)\n\n(cid:16)\n\n(cid:17)\n\n(cid:18)\n\n(cid:19)\n\nF\u03c3(w) =\n\n1\n2\n\nw \u2212 \u03c3\nn\n\nq\n\nA\n\nw \u2212 \u03c3\nn\n\nq\n\n+\n\n1\n2\n\n1 \u2212 1\nn2\n\nq(cid:62)Aq.\n\nTherefore, the minimizer of F\u03c3 is (\u03c3/n)q, and using Equation (7), we see that \u03b4(n) = \u03ba+1\nn2 .\n\n2. We de\ufb01ne\n\nf\u00b1(w) =\n\nL\n2\n\n(cid:107)w \u00b1 e1(cid:107)2 .\n\nOne can easily verify that f\u00b1 are L-smooth and convex functions, and that the minimizer of\nF\u03c3 is (\u03c3/n)e1. By Equation (7), we get \u03b4(n) = L\nn2 .\n\n3. We de\ufb01ne\n\nf\u00b1(w) = M(cid:107)w \u00b1 e1(cid:107) +\n\n(cid:107)w(cid:107)2 ,\n\n\u03bb\n2\n\nover the unit ball. Clearly, f\u00b1 are (M + \u03bb)-Lipschitz continuous and \u03bb-strongly convex\nfunctions. It can be veri\ufb01ed that the minimizer of F\u03c3 is (\u03c3 min{ M\n\u03bbn , 1})e1. Therefore, by\nEquation (7), we see that in this case we have\n\n(cid:40) M 2\nn \u2212 \u03bb o.w.\n\n\u03bbn2\n2M\n\nM\n\n\u03bbn \u2264 1\n\n.\n\n\u03b4(n) =\n\nA few conclusions can be readily made from Theorem 1. First, if a given optimization algorithm\nobtains an iteration complexity of an order of c(n, \u03ba) ln(1/\u0001), up to logarithmic factors (including\nthe norm of the minimizer which, in our construction, is of an order of 1/n and coupled with the\naccuracy parameter), for solving smooth and strongly convex \ufb01nite sum problems with a stochastic\noracle, then\n\n(cid:18)\n\n(cid:19)\n\n.\n\nc(n, \u03ba) = \u02dc\u2126\n\nn2\n\nln(n2/(\u03ba + 1))\n\nThus, the following holds for optimization of \ufb01nite sums with smooth and strongly convex individuals.\nCorollary 1. In order to obtain linear convergence rate with linear dependence on n, one must know\nthe index of the individual function addressed by the oracle.\n\nThis implies that variance-reduction methods such as, SAG, SAGA, SDCA and SVRG (possibly\ncombining with acceleration schemes), which exhibit linear dependence on n, cannot be applied\nwhen data augmentation is used. In general, this conclusion also holds for cases when one applies\ngeneral \ufb01rst-order optimization algorithms, such as AGD, on \ufb01nite sums, as this typically results in a\nlinear dependence on n. Secondly, if a given optimization algorithm obtains an iteration complexity\nof an order of n + L\u03b2(cid:107)w(0) \u2212 w\u2217(cid:107)2/\u0001\u03b1 for solving smooth and convex \ufb01nite sum problems with\na stochastic oracle, then n + L\u03b2\u2212\u03b1n2(\u03b1\u22121) = \u2126(n2). Therefore, \u03b2 = \u03b1 and \u03b1 \u2265 2, indicating that\nan iteration complexity of an order of n + L(cid:107)w(0) \u2212 w\u2217(cid:107)2/\u0001, as obtained by, e.g., SVRG++, is not\nattainable with a stochastic oracle. Similar reasoning based on the Lipschitz and strongly convex\ncase in Theorem 1 shows that the iteration complexity guaranteed by accelerated SDCA is also not\nattainable in this setting.\n\n5\n\n\f3 Oblivious Optimization Algorithms\n\nIn the previous section, we discussed different situations under which variance-reduction schemes are\nnot applicable. Now, we turn to study under what conditions can one apply acceleration schemes.\nFirst, we de\ufb01ne the framework of oblivious CLI algorithms. Next, we show that, for this family of\nalgorithms, knowing the strong convexity parameter is crucial for obtaining accelerated rates. We then\ndescribe a restarting scheme through which we establish that stationary algorithms (whose update\nrule are, on average, the same for every iteration) for smooth and convex functions are sub-optimal.\nFinally, we use this reduction to derive a tight lower bound for smooth and convex \ufb01nite sums on the\niteration complexity of any oblivious algorithm (not just stationary).\n\n3.1 Framework\n\nIn the sequel, following [3], we present the analytic framework through which we derive iteration\ncomplexity bounds. This, perhaps pedantic, formulation will allows us to study somewhat subtle\ndistinctions between optimization algorithms. First, we give a rigorous de\ufb01nition for a class of\noptimization problems which emphasizes the role of prior knowledge in optimization.\nDe\ufb01nition 1 (Class of Optimization Problems). A class of optimization problems is an ordered triple\n(F,I, Of ), where F is a family of functions de\ufb01ned over some domain designated by dom(F), I is\nthe side-information given prior to the optimization process and Of is a suitable oracle procedure\nwhich upon receiving w \u2208 domF and \u03b8 in some parameter set \u0398, returns Of (w, \u03b8) \u2286 dom(F) for a\ngiven f \u2208 F (we shall omit the subscript in Of when f is clear from the context).\nIn \ufb01nite sum problems, F comprises of functions as de\ufb01ned in (1); the side-information may contain\nthe smoothness parameter L, the strong convexity parameter \u00b5 and the number of individual functions\nn; and the oracle may allow one to query about a speci\ufb01c individual function (as in the case of\nincremental oracle, and as opposed to the stochastic oracle discussed in Section 2). We now turn to\nde\ufb01ne CLI optimization algorithms (see [3] for a more comprehensive discussion).\nDe\ufb01nition 2 (CLI). An optimization algorithm is called a Canonical Linear Iterative (CLI) opti-\nmization algorithm over a class of optimization problems (F,I, Of ), if given an instance f \u2208 F\nand initialization points {w(0)\ni }i\u2208J \u2286 dom(F), where J is some index set, it operates by iteratively\ngenerating points such that for any i \u2208 J ,\nOf\n\n\u2208(cid:88)\n\nk = 0, 1, . . .\n\nw(k+1)\n\n(cid:16)\n\nw(k)\n\nj\n\n; \u03b8(k)\nij\n\n,\n\n(cid:17)\n\n(12)\n\ni\n\nj\u2208J\n\nij \u2208 \u0398 are parameters chosen, stochastically or deterministically, by the algorithm,\nholds, where \u03b8(k)\npossibly based on the side-information. If the parameters do not depend on previously acquired\noracle answers, we say that the given algorithm is oblivious. For notational convenience, we assume\nthat the solution returned by the algorithm is stored in w(k)\n1 .\n\nThroughout the rest of the paper, we shall be interested in oblivious CLI algorithms (for brevity, we\nusually omit the \u2018CLI\u2019 quali\ufb01er) equipped with the following two incremental oracles:\nGeneralized \ufb01rst-order oracle: O(w; A, B, c, i) := A\u2207fi(w) + Bw + c,\nSteepest coordinate-descent oracle: O(w; j, i) := w + t\u2217ej,\n\n(13)\nwhere A, B \u2208 Rd\u00d7d, c \u2208 Rd, i \u2208 [n], j \u2208 [d], ej denotes the j\u2019th d-dimensional unit vector and\nt\u2217 \u2208 argmint\u2208R fj(w1, . . . , wj\u22121, wj + t, wj+1, . . . , wd). We restrict the oracle parameters such\nthat only one individual function is allowed to be accessed at each iteration. We remark that the family\nof oblivious algorithms with a \ufb01rst-order and a coordinate-descent oracle is wide and subsumes\nSAG, SAGA, SDCA, SDCA without duality, SVRG, SVRG++ to name a few. Also, note that\ncoordinate-descent steps w.r.t. partial gradients can be implemented using the generalized \ufb01rst-order\noracle by setting A to be some principal minor of the unit matrix (see, e.g., RDCM in [15]). Further,\nsimilarly to [3], we allow both \ufb01rst-order and coordinate-descent oracles to be used during the same\noptimization process.\n\n3.2 No Strong Convexity Parameter, No Acceleration for Finite Sum Problems\n\nHaving described our analytic approach, we now turn to present some concrete applications. Below,\nwe show that in the absence of a good estimation for the strong convexity parameter, the optimal\n\n6\n\n\fiteration complexity of oblivious algorithms is \u2126(n + k ln(1/\u0001)). Our proof is based on the technique\nused in [3, 4] (see [3, Section 2.3] for a brief introduction of the technique).\nGiven 0 < \u0001 < L, we de\ufb01ne the following set of optimization problems (over Rd with d > 1)\n\n(cid:18) 1\n\nn(cid:88)\n\n2\n\ni=1\n\nL+\u00b5\n\u00b5\u2212L\n\n2\n\n2\n\nF\u00b5(w) :=\n\nQ\u00b5 :=\n\n1\nn\n\n\uf8eb\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ed\n\n(cid:19)\n\n\u00b5\n\nw(cid:62)Q\u00b5w \u2212 q(cid:62)w\n\u00b5\u2212L\n\n2\n\nL+\u00b5\n\n2\n\n\u00b5\n\n...\n\n(14)\n\n, where\n\n\uf8f6\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f8 , q :=\n\n\uf8f6\uf8f7\uf8f7\uf8f7\uf8f7\uf8f8 ,\n\n\uf8eb\uf8ec\uf8ec\uf8ec\uf8ec\uf8ed\n\n1\n1\n0\n\n...\n\n0\n\n\u0001R\u221a\n2\n\n\u221a\n\n2, \u0001R/\u00b5\n\nparametrized by \u00b5 \u2208 (\u0001, L) (note that the individual functions are identical. We elaborate more\n\u221a\non this below). It can be easily veri\ufb01ed that the condition number of F\u00b5, which we denote by\n\u03ba(F\u00b5), is L/\u00b5, and that the corresponding minimizers are w\u2217(\u00b5) = (\u0001R/\u00b5\n2, 0, . . . , 0)(cid:62)\nwith norm \u2264 R.\nIf we are allowed to use different optimization algorithm for different \u00b5 in this setting, then we know\n\nthat the optimal iteration complexity is of an order of (n +(cid:112)n\u03ba(F\u00b5)) ln(1/\u0001). However, if we\n\nallowed to use only one single algorithm, then we show that the optimal iteration complexity is of an\norder of n + \u03ba(F\u00b5) ln(1/\u0001). The proof goes as follows. First, note that in this setting, the oracles\nde\ufb01ned in (13) take the following form,\nGeneralized \ufb01rst-order oracle: O(w; A, B, c, i) = A(Q\u00b5w \u2212 q) + Bw + c,\n(15)\nSteepest coordinate-descent oracle: O(w; j, i) = (I \u2212 (1/(Q\u00b5)jj)ei(Q\u00b5)j,\u2217) w \u2212 qj/(Q\u00b5)jjej.\nNow, since the oracle answers are linear in \u00b5 and the k\u2019th iterate is a k-fold composition of sums of\nthe oracle answers, it follows that w(k)\nforms a d-dimensional vector of univariate polynomials in \u00b5\nof degree \u2264 k with (possibly random) coef\ufb01cients (formally, see Lemma 3, Appendix A). Denoting\nthe polynomial of the \ufb01rst coordinate of Ew(k)\n\n1\n\n1 (\u00b5) by s(\u00b5), we see that for any \u00b5 \u2208 (\u0001, L),\n2s(\u00b5)\u00b5\nR\u0001\n\n2L\n\n2\u00b5\n\n\u2212 1\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\u221a\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ,\n\nE(cid:107)w(k)\n\n1 (\u00b5) \u2212 w\u2217(\u00b5)(cid:107) \u2265 (cid:107)Ew(k)\n\n1 (\u00b5) \u2212 w\u2217(\u00b5)(cid:107) \u2265\n\nwhere the \ufb01rst inequality follows by Jensen inequality and the second inequality by focusing on the\n\ufb01rst coordinate of Ew(k)(\u03b7) and w\u2217(\u03b7). Lastly, since the coef\ufb01cients of s(\u00b5) do not depend on \u00b5,\nwe have by Lemma 4 in Appendix A, that there exists \u03b4 > 0, such that for any \u00b5 \u2208 (L \u2212 \u03b4, L) it\nholds that\n\n(cid:12)(cid:12)(cid:12)(cid:12)s(\u00b5) \u2212 R\u0001\u221a\n(cid:18)\n\n1 \u2212 1\n\n\u03ba(F\u00b5)\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u2265 R\u0001\u221a\n(cid:19)k+1\n\n,\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\u221a\n\nR\u0001\u221a\n2L\n\n2s(\u00b5)\u00b5\nR\u0001\n\n\u2212 1\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2265 R\u0001\u221a\n\n2L\n\nby which we derive the following.\nTheorem 2. The iteration complexity of oblivious \ufb01nite sum optimization algorithms equipped with\na \ufb01rst-order and a coordinate-descent oracle whose side-information does not contain the strong\nconvexity parameter is \u02dc\u2126(n + \u03ba ln(1/\u0001)).\n\nThe n part of the lower bound holds for any type of \ufb01nite sum algorithm and is proved in [3, Theorem\n5]. The lower bound stated in Theorem 2 is tight up to logarithmic factors and is attained by, e.g.,\nSAG [19]. Although relying on a \ufb01nite sum with identical individual functions may seem somewhat\ndisappointing, it suggests that some variance-reduction schemes can only give optimal dependence\nin terms of n, and that obtaining optimal dependence in terms of the condition number need to be\ndone through other (acceleration) mechanisms (e.g., [13]). Lastly, note that, this bound holds for any\nnumber of iterations (regardless of the problem parameters).\n\n3.3 Stationary Algorithms for Smooth and Convex Finite Sums are Sub-optimal\n\nIn the previous section, we showed that not knowing the strong convexity parameter reduces the\noptimal attainable iteration complexity. In this section, we use this result to show that whereas general\n\n7\n\n\f\u02dcO(n +(cid:112)nL/\u0001), the optimal iteration complexity of stationary algorithms (whose expected update\n\noptimization algorithms for smooth and convex \ufb01nite sum problems obtain iteration complexity of\n\nrules are \ufb01xed) is \u2126(n + L/\u0001).\nThe proof (presented below) is based on a general restarting scheme (see Scheme 1) used in [4]. The\nscheme allows one to apply algorithms which are designed for L-smooth and convex problems on\nsmooth and strongly convex \ufb01nite sums by explicitly incorporating the strong convexity parameter.\nThe key feature of this reduction is its ability to \u2018preserve\u2019 the exponent of the iteration complexity\nfrom an order of C(f )(L/\u0001)\u03b1 in the non-strongly convex case to an order of (C(f )\u03ba)\u03b1 ln(1/\u0001) in\nthe strongly convex case, where C(f ) denotes some quantity which may depend on f but not on k,\nand \u03b1 is some positive constant.\n\nSCHEME 1\nGIVEN\n\nITERATE\n\nEND\n\nRESTARTING SCHEME\nAN OPTIMIZATION ALGORITHM A\nFOR SMOOTH CONVEX FUNCTIONS WITH\n\n(cid:13)(cid:13)(cid:13) \u00afw(0)\u2212w\u2217(cid:13)(cid:13)(cid:13)2\n\nf (w(k)) \u2212 f\u2217 \u2264 C(f )\nFOR ANY INITIALIZATION POINT \u00afw0\n\nk\u03b1\n\nFOR t = 1, 2, . . .\n\nRESTART THE STEP SIZE SCHEDULE OF A\nINITIALIZE A AT \u00afw(0)\n\nRUN A FOR \u03b1(cid:112)4C(f )/\u00b5 ITERATIONS\n\nSET \u00afw(0) TO BE THE POINT RETURNED BY A\n\nThe proof goes as follows. Suppose A is a stationary CLI optimization algorithm for L-smooth and\nconvex \ufb01nite sum problems equipped with oracles (13). Also, assume that its convergence rate for\nk \u2265 N, N \u2208 N is of an order of n\u03b3 L\u03b2(cid:107)w(0)\u2212w\u2217(cid:107)2\n, for some \u03b1, \u03b2, \u03b3 > 0. First, observe that in\nthis case we must have \u03b2 = 1. For otherwise, we get f (w(k)) \u2212 f\u2217 = ((\u03bdf )(w(k)) \u2212 (\u03bdf )\u2217)/\u03bd \u2264\nn\u03b3(\u03bdL)\u03b2/\u03bdk\u03b1 = \u03bd\u03b2\u22121n\u03b3L\u03b2/k\u03b1, implying that, simply by scaling f, one can optimize to any level\nof accuracy using at most N iterations, which contradicts [3, Theorem 5]. Now, by [4, Lemma 1],\nScheme 1 produces a new algorithm whose iteration complexity for smooth and strongly convex\n\ufb01nite sums with condition number \u03ba is\n\nk\u03b1\n\nO(N + n\u03b3 (L/\u0001)\u03b1) \u2212\u2192 \u02dcO(n + n\u03b3\u03ba\u03b1 ln(1/\u0001)).\n\n(16)\nFinally, stationary algorithms are invariant under this restarting scheme. Therefore, the new algorithm\ncannot depend on \u00b5. Thus, by Theorem 2, it must hold that that \u03b1 \u2265 1 and that max{N, n\u03b3} = \u2126(n),\nproving the following.\nTheorem 3. If the iteration complexity of a stationary optimization algorithm for smooth and convex\n\ufb01nite sum problems equipped with a \ufb01rst-order and a coordinate-descent oracle is of the form of the\nl.h.s. of (16), then it must be at least \u2126(n + L/\u0001).\n\nWe note that, this lower bound is tight and is attained by, e.g., SDCA.\n\n3.4 A Tight Lower Bound for Smooth and Convex Finite Sums\n\nWe now turn to derive a lower bound for \ufb01nite sum problems with smooth and convex individual\nfunctions using the restarting scheme shown in the previous section. Note that, here we allow\nany oblivious optimization algorithm, not just stationary. The technique shown in Section 3.2 of\nreducing an optimization problem into a polynomial approximation problem was used in [3] to derive\nlower bounds for various settings. The smooth and convex case was proved only for n = 1, and\na generalization for n > 1 seems to reduce to a non-trivial approximation problem. Here, using\nScheme 1, we are able to avoid this dif\ufb01culty by reducing the non-strongly case to the strongly convex\ncase, for which a lower bound for a general n is known.\nThe proof follows the same lines of the proof of Theorem 3. Given an oblivious optimization\nalgorithm for \ufb01nite sums with smooth and convex individuals equipped with oracles (13), we apply\nagain Scheme 1 to get an algorithm for the smooth and strongly convex case, whose iteration\ncomplexity is as in (16). Now, crucially, oblivious algorithm are invariant under Scheme 1 (that\n\n8\n\n\fis, when applied on a given oblivious algorithm, Scheme 1 produces another oblivious algorithm).\nTherefore, using [3, Theorem 2], we obtain the following.\nTheorem 4. If the iteration complexity of an oblivious optimization algorithm for smooth and convex\n\ufb01nite sum problems equipped with a \ufb01rst-order and a coordinate-descent oracle is of the form of the\nl.h.s. of (16), then it must be at least\n\n\u2126\n\nn +\n\nnL\n\u0001\n\n.\n\n(cid:32)\n\n(cid:114)\n\n(cid:33)\n\nThis bound is tight and is obtained by, e.g., accelerated SDCA [22]. Optimality in terms of L and \u0001\ncan be obtained simply by applying Accelerate Gradient Descent [16], or alternatively, by using\nan accelerated version of SVRG as presented in [17]. More generally, one can apply acceleration\nschemes, e.g., [13], to get an optimal dependence on \u0001.\n\nAcknowledgments\n\nWe thank Raanan Tvizer and Maayan Maliach for several helpful and insightful discussions.\n\nReferences\n[1] Alekh Agarwal, Martin J Wainwright, Peter L Bartlett, and Pradeep K Ravikumar. Information-\ntheoretic lower bounds on the oracle complexity of convex optimization. In Advances in Neural\nInformation Processing Systems, pages 1\u20139, 2009.\n\n[2] Zeyuan Allen-Zhu and Yang Yuan. Improved svrg for non-strongly-convex or sum-of-non-\n\nconvex objectives. Technical report, Technical report, arXiv preprint, 2016.\n\n[3] Yossi Arjevani and Ohad Shamir. Dimension-free iteration complexity of \ufb01nite sum optimization\n\nproblems. In Advances in Neural Information Processing Systems, pages 3540\u20133548, 2016.\n\n[4] Yossi Arjevani and Ohad Shamir. On the iteration complexity of oblivious \ufb01rst-order optimiza-\ntion algorithms. In Proceedings of the 33nd International Conference on Machine Learning,\npages 908\u2013916, 2016.\n\n[5] Yossi Arjevani and Ohad Shamir. Oracle complexity of second-order methods for \ufb01nite-sum\n\nproblems. arXiv preprint arXiv:1611.04982, 2016.\n\n[6] Dimitri P Bertsekas. A new class of incremental gradient methods for least squares problems.\n\nSIAM Journal on Optimization, 7(4):913\u2013926, 1997.\n\n[7] Alberto Bietti and Julien Mairal. Stochastic optimization with variance reduction for in\ufb01nite\n\ndatasets with \ufb01nite-sum structure. arXiv preprint arXiv:1610.00970, 2016.\n\n[8] Doron Blatt, Alfred O Hero, and Hillel Gauchman. A convergent incremental gradient method\n\nwith a constant step size. SIAM Journal on Optimization, 18(1):29\u201351, 2007.\n\n[9] Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons,\n\n2012.\n\n[10] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient\nmethod with support for non-strongly convex composite objectives. In Advances in Neural\nInformation Processing Systems, pages 1646\u20131654, 2014.\n\n[11] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. In Advances in Neural Information Processing Systems, pages 315\u2013323, 2013.\n\n[12] Guanghui Lan. An optimal randomized incremental gradient method.\n\narXiv:1507.02000, 2015.\n\narXiv preprint\n\n[13] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for \ufb01rst-order optimiza-\n\ntion. In Advances in Neural Information Processing Systems, pages 3366\u20133374, 2015.\n\n9\n\n\f[14] Ga\u00eblle Loosli, St\u00e9phane Canu, and L\u00e9on Bottou. Training invariant support vector machines\n\nusing selective sampling. Large scale kernel machines, pages 301\u2013320, 2007.\n\n[15] Yu Nesterov. Ef\ufb01ciency of coordinate descent methods on huge-scale optimization problems.\n\nSIAM Journal on Optimization, 22(2):341\u2013362, 2012.\n\n[16] Yurii Nesterov. Introductory lectures on convex optimization, volume 87. Springer Science &\n\nBusiness Media, 2004.\n\n[17] Atsushi Nitanda. Accelerated stochastic gradient descent for minimizing \ufb01nite sums.\n\nArti\ufb01cial Intelligence and Statistics, pages 195\u2013203, 2016.\n\nIn\n\n[18] Maxim Raginsky and Alexander Rakhlin. Information-based complexity, feedback and dynam-\nics in convex programming. Information Theory, IEEE Transactions on, 57(10):7036\u20137056,\n2011.\n\n[19] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing \ufb01nite sums with the stochastic\n\naverage gradient. Mathematical Programming, pages 1\u201330, 2013.\n\n[20] Shai Shalev-Shwartz. Sdca without duality. arXiv preprint arXiv:1502.06177, 2015.\n\n[21] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized\n\nloss. The Journal of Machine Learning Research, 14(1):567\u2013599, 2013.\n\n[22] Shai Shalev-Shwartz and Tong Zhang. Accelerated proximal stochastic dual coordinate ascent\n\nfor regularized loss minimization. Mathematical Programming, 155(1-2):105\u2013145, 2016.\n\n[23] Blake E Woodworth and Nati Srebro. Tight complexity bounds for optimizing composite\n\nobjectives. In Advances in Neural Information Processing Systems, pages 3639\u20133647, 2016.\n\n10\n\n\f", "award": [], "sourceid": 2007, "authors": [{"given_name": "Yossi", "family_name": "Arjevani", "institution": "The Weizmann Institute"}]}