{"title": "Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions", "book": "Advances in Neural Information Processing Systems", "page_first": 1538, "page_last": 1546, "abstract": "We develop and analyze stochastic optimization algorithms for problems in which the expected loss is strongly convex, and the optimum is (approximately) sparse. Previous approaches are able to exploit only one of these two structures, yielding a $\\order(\\pdim/T)$ convergence rate for strongly convex objectives in $\\pdim$ dimensions and $\\order(\\sqrt{\\spindex( \\log\\pdim)/T})$ convergence rate when the optimum is $\\spindex$-sparse. Our algorithm is based on successively solving a series of $\\ell_1$-regularized optimization problems using Nesterov's dual averaging algorithm. We establish that the error of our solution after $T$ iterations is at most $\\order(\\spindex(\\log\\pdim)/T)$, with natural extensions to approximate sparsity. Our results apply to locally Lipschitz losses including the logistic, exponential, hinge and least-squares losses. By recourse to statistical minimax results, we show that our convergence rates are optimal up to constants. The effectiveness of our approach is also confirmed in numerical simulations where we compare to several baselines on a least-squares regression problem.", "full_text": "Stochastic optimization and sparse statistical\n\nrecovery: Optimal algorithms for high dimensions\n\nAlekh Agarwal\n\nMicrosoft Research\n\nNew York NY\n\nSahand N. Negahban\n\nDept. of EECS\n\nMIT\n\nMartin J. Wainwright\n\nDept. of EECS and Statistics\n\nUC Berkeley\n\nalekha@microsoft.com\n\nsahandn@mit.edu\n\nwainwrig@stat.berkeley.edu\n\nAbstract\n\nWe develop and analyze stochastic optimization algorithms for problems in which\nthe expected loss is strongly convex, and the optimum is (approximately) sparse.\nPrevious approaches are able to exploit only one of these two structures, yielding\na O(d/T ) convergence rate for strongly convex objectives in d dimensions and\nO(ps(log d)/T ) convergence rate when the optimum is s-sparse. Our algorithm\nis based on successively solving a series of \u21131-regularized optimization problems\nusing Nesterov\u2019s dual averaging algorithm. We establish that the error of our solu-\ntion after T iterations is at most O(s(log d)/T ), with natural extensions to approx-\nimate sparsity. Our results apply to locally Lipschitz losses including the logistic,\nexponential, hinge and least-squares losses. By recourse to statistical minimax\nresults, we show that our convergence rates are optimal up to constants. The ef-\nfectiveness of our approach is also con\ufb01rmed in numerical simulations where we\ncompare to several baselines on a least-squares regression problem.\n\n1 Introduction\n\nStochastic optimization algorithms have many desirable features for large-scale machine learning,\nand have been studied intensively in the last few years (e.g., [18, 4, 8, 22]). The empirical ef\ufb01ciency\nof these methods is backed with strong theoretical guarantees on their convergence rates, which\ndepend on various structural properties of the objective function. More precisely, for an objective\nfunction that is strongly convex, stochastic gradient descent enjoys a convergence rate ranging from\nO(1/T ), when features vectors are extremely sparse, to O(d/T ), when feature vectors are dense [9,\n14, 10]. This strong convexity condition is satis\ufb01ed for many common machine learning problems,\nincluding boosting, least squares regression, SVMs and generalized linear models among others.\n\nA complementary condition is that of (approximate) sparsity in the optimal solution. Sparse models\nhave proven useful in many applications (see e.g., [6, 5] and references therein), and many statistical\nprocedures seek to exploit such sparsity. It has been shown [15, 19] that when the optimal solution \u03b8\u2217\n\nSrebro et al. [20] exploit the smoothness of common loss functions, and obtain improved rates of\n\nis s-sparse, appropriate versions of the mirror descent algorithm converge at a rate O(sp(log d)/T ).\nthe form O(\u03b7p(s log d)/T ), where \u03b7 is the noise variance. While the \u221alog d scaling makes these\nslow\u2014namely, O(1/\u221aT ) as opposed to O(1/T ) for strongly convex problems.\n\nmethods attractive in high dimensions, their scaling with respect to the iterations T is relatively\n\nMany optimization problems encountered in practice exhibit both features: the objective function is\nstrongly convex, and the optimum is (approximately) sparse. This fact leads to the natural question:\nis it possible to design algorithms for stochastic optimization that enjoy the best features of both\ntypes of structure? More speci\ufb01cally, an algorithm should have a O(1/T ) convergence rate, as well\nas a logarithmic dependence on dimension. The main contribution of this paper is to answer this\nquestion in the af\ufb01rmative, and to analyze a new algorithm that has convergence rate O((s log d)/T )\n\n1\n\n\ffor a strongly convex problem with an s-sparse optimum in d dimensions. This rate is unimprovable\n(up to constants) in our setting, meaning that no algorithm can converge at a substantially faster rate.\nOur analysis also yields optimal rates when the optimum is only approximately sparse.\n\nThe algorithm proposed in this paper builds off recent work on multi-step methods for strongly\nconvex problems [11, 10, 12], but involves some new ingredients so as to obtain optimal rates for\nstatistical problems with sparse optima. In particular, we form a sequence of objective functions\nby decreasing the amount of regularization as the optimization algorithm proceeds which is quite\nnatural from a statistical viewpoint. Each step of our algorithm can be computed ef\ufb01ciently, with a\nclosed form update rule in many common examples. In summary, the outcome of our development\nis an optimal one-pass algorithm for many structured statistical problems in high dimensions, and\nwith computational complexity linear in the sample size. Numerical simulations con\ufb01rm our theo-\nretical predictions regarding the convergence rate of the algorithm, and also establish its superiority\ncompared to regularized dual averaging [22] and stochastic gradient descent algorithms. They also\ncon\ufb01rm that a direct application of the multi-step method of Juditsky and Nesterov [11] is inferior\nto our algorithm, meaning that our gradual decrease of regularization is quite critical. More details\non our results and their proofs can be found in the full-length version of this paper [2].\n\n2 Problem set-up and algorithm description\n\nGiven a subset \u2126 \u2286 Rd and a random variable Z taking values in a space Z, we consider an\noptimization problem of the form\n(1)\nwhere L : \u2126 \u00d7 Z \u2192 R is a given loss function. As is standard in stochastic optimization, we do\nnot have direct access to the expected loss function L(\u03b8) := E[L(\u03b8; Z)], nor to its subgradients.\nRather, for a given query point \u03b8 \u2208 \u2126, we observe a stochastic subgradient, meaning a random\nvector g(\u03b8) \u2208 Rd such that E[g(\u03b8)] \u2208 \u2202L(\u03b8). The goal of this paper is to design algorithms that are\nsuitable for solving the problem (1) when the optimum \u03b8\u2217 is (approximately) sparse.\n\n\u03b8\u2217 \u2208 arg min\n\u03b8\u2208\u2126\n\nE[L(\u03b8; Z)],\n\nAlgorithm description:\nconsider a sequence of regularized problems of the form\n\nIn order to solve a sparse version of the problem (1), our strategy is to\n\n(2)\n\nmin\n\n\u03b8\u2208\u2126\u2032(cid:8)L(\u03b8) + \u03bbk\u03b8k1(cid:9).\n\ni=1, and\n\ni=1, which specify the con-\n\ni=1 and d-dimensional vectors {yi}KT\n\nOur algorithm involves a sequence of KT different epochs, where the regularization parameter \u03bb > 0\nand the constraint set \u2126\u2032 \u2282 \u2126 change from epoch to epoch. The epochs are speci\ufb01ed by:\n\u2022 a sequence of natural numbers {Ti}KT\ni=1, where Ti speci\ufb01es the length of the ith epoch,\n\u2022 a sequence of positive regularization weights {\u03bbi}KT\n\u2022 a sequence of positive radii {Ri}KT\nstraint set, \u2126(Ri) :=(cid:8)\u03b8 \u2208 \u2126 | k\u03b8 \u2212 yikp \u2264 Ri(cid:9), that is used throughout the ith epoch.\nWe initialize the algorithm in the \ufb01rst epoch with y1 = 0, and with any radius R1 that is an up-\nper bound on k\u03b8\u2217k1. The norm k \u00b7 kp used in de\ufb01ning the constraint set \u2126(Ri) is speci\ufb01ed by\np = 2 log d/(2 log d \u2212 1), a choice that will be clari\ufb01ed momentarily.\nThe goal of the ith epoch is to update yi\nkyi+1 \u2212 \u03b8\u2217k2\nthat upon termination, kyKT \u2212 \u03b8\u2217k2\nof the stochastic dual averaging algorithm [17] (henceforth DA) on the regularized objective\n\n7\u2192 yi+1, in such a way that we are guaranteed that\ni /2, so\n1/2KT\u22121. In order to update yi 7\u2192 yi+1, we run Ti rounds\n(3)\n\ni+1 for each i = 1, 2, . . .. We choose the radii such that R2\n\n\u03b8\u2208\u2126(Ri)(cid:8)L(\u03b8) + \u03bbik\u03b8k1(cid:9).\nThe DA method generates two sequences of vectors {\u00b5t}Ti\nt=0 initialized as \u00b50 = 0\nand \u03b80 = yi, using a sequence of step sizes {\u03b1t}Ti\nt=0. At iteration t = 0, 1, . . . , Ti, we let gt be a\nstochastic subgradient of L at \u03b8t, and we let \u03bdt be any element of the subdifferential of the \u21131-norm\nk \u00b7 k1 at \u03b8t. The DA update at time t maps (\u00b5t, \u03b8t) 7\u2192 (\u00b5t+1, \u03b8t+1) via the recursions\n(4)\n\n\u00b5t+1 = \u00b5t + gt + \u03bbi\u03bdt, and \u03b8t+1 = arg min\n\nt=0 and {\u03b8t}Ti\n\n1 \u2264 R2\nmin\n\n1 \u2264 R2\n\ni+1 = R2\n\n\u03b8\u2208\u2126(Ri)(cid:8)\u03b1t+1h\u00b5t+1, \u03b8i + \u03c8yi,Ri(\u03b8)(cid:9),\n\n2\n\n\fwhere the prox function \u03c8 is speci\ufb01ed below (5). The pseudocode describing the overall procedure\nis given in Algorithm 1. In the stochastic dual averaging updates (4), we use the prox function\n\n\u03c8yi,Ri(\u03b8) =\n\n1\n\n2R2\n\ni (p \u2212 1) k\u03b8 \u2212 yik2\n\np, where p =\n\n.\n\n(5)\n\n2 log d\n2 log d \u2212 1\n\nThis particular choice of the prox-function and the speci\ufb01c value of p ensure that the function \u03c8\nis strongly convex with respect to the \u21131-norm, and has been previously used for sparse stochastic\noptimization (see e.g. [15, 19, 7]). In most of our examples, \u2126 = Rd and owing to our choice of the\nprox-function and the feasible set in the update (4), we can compute \u03b8t+1 from \u00b5t+1 in closed form.\nSome algebra yields that the update (4) with \u2126 = Rd is equivalent to\n\nR2\n\ni \u03b1t+1\n\n\u03b8t+1 = yi +\n\nk\u00b5t+1k(q\u22122)\n\nq\n\n(p \u2212 1)(1 + \u03be)\n\n\u03b1t+1k\u00b5t+1kqRi\n\n|\u00b5t+1|(q\u22121)sign(\u00b5t+1)\n\n, where \u03be = max(cid:26)0,\n\n\u2212 1(cid:27) .\nHere |\u00b5t+1|(q\u22121) refers to elementwise operations and q = p/(p \u2212 1) is the conjugate exponent to\np. We observe that our update (4) computes a subgradient of the \u21131-norm rather than computing an\nexact prox-mapping as in some previous methods [16, 7, 22]. Computing such a prox-mapping for\nyi 6= 0 requires O(d2) computation, which is why we adopt the update (4) with a complexity O(d).\nAlgorithm 1 Regularization Annealed epoch Dual AveRaging (RADAR)\nRequire: Epoch length schedule {Ti}KT\n\ni=1, initial radius R1, step-size multiplier \u03b1, prox-function\n\np \u2212 1\n\n\u03c8, initial prox-center y1, regularization parameters \u03bbi.\nfor Epoch i = 1, 2, . . . , KT do\n\nInitialize \u00b50 = 0 and \u03b80 = yi.\nfor Iteration t = 0, 1, . . . , Ti \u2212 1 do\nUpdate (\u00b5t, \u03b8t) 7\u2192 (\u00b5t+1, \u03b8t+1) according to rule (4) with step size \u03b1t = \u03b1/\u221at.\nend for\nSet yi+1 =\nUpdate R2\n\nPTi\nt=1 \u03b8t\nTi\n\ni+1 = R2\n\n.\ni /2.\n\nend for\nReturn yKT +1\n\nConditions: Having de\ufb01ned our algorithm, we now discuss the conditions on the objective func-\ntion L(\u03b8) and stochastic gradients that underlie our analysis.\nAssumption 1 (Locally Lipschitz). For each R > 0, there is a constant G = G(R) such that\n\n|L(\u03b8) \u2212 L(\u02dc\u03b8)| \u2264 Gk\u03b8 \u2212 \u02dc\u03b8k1\n\nfor all pairs \u03b8, \u02dc\u03b8 \u2208 \u2126 such that k\u03b8 \u2212 \u03b8\u2217k1 \u2264 R and k\u02dc\u03b8 \u2212 \u03b8\u2217k1 \u2264 R.\nWe note that it suf\ufb01ces to have k\u2207L(\u03b8)k\u221e \u2264 G(R) for the above condition. As mentioned, our\ngoal is to obtain fast rates for objectives satisfying a local strong convexity condition, de\ufb01ned below.\nAssumption 2 (Local strong convexity (LSC)). The function L : \u2126 \u2192 R satis\ufb01es a R-local form\nof strong convexity (LSC) if there is a non-negative constant \u03b3 = \u03b3(R) such that\n\nL(\u02dc\u03b8) \u2265 L(\u03b8) + h\u2207L(\u03b8), \u02dc\u03b8 \u2212 \u03b8i +\n\n\u03b3\n2k\u03b8 \u2212 \u02dc\u03b8k2\n\n2 \u2200\u03b8, \u02dc\u03b8 \u2208 \u2126 with k\u03b8k1 \u2264 R and k\u02dc\u03b8k1 \u2264 R.\n\n(7)\n\nSome of our results regarding stochastic optimization from a \ufb01nite sample will use a weaker form of\nthe assumption, called local RSC, exploited in our recent work on statistics and optimization [1, 13].\nOur \ufb01nal assumption is a tail condition on the error in stochastic gradients: e(\u03b8) := g(\u03b8) \u2212 E[g(\u03b8)].\nAssumption 3 (Sub-Gaussian stochastic gradients). There is a constant \u03c3 = \u03c3(R) such that\n\nE(cid:2) exp(ke(\u03b8)k2\n\n\u221e/\u03c32)(cid:3) \u2264 exp(1) for all \u03b8 such that k\u03b8 \u2212 \u03b8\u2217k1 \u2264 R.\n\nClearly, this condition holds whenever the error vector e(\u03b8) has bounded components. More gener-\nally, the bound (8) holds whenever each component of the error vector has sub-Gaussian tails.\n\n(6)\n\n(8)\n\n3\n\n\fSome illustrative examples: We now describe some examples that satisfy the above conditions to\nillustrate how the various parameters of interest might be obtained in different scenarios.\nExample 1 (Classi\ufb01cation under Lipschitz losses). In binary classi\ufb01cation, the samples consist of\npairs z = (x, y) \u2208 Rd \u00d7 {\u22121, 1}. Common choices for the loss function L(\u03b8; z) are the hinge loss\nmax(0, 1\u2212 yh\u03b8, xi) or the logistic loss log(1 + exp(\u2212yh\u03b8, xi). Given a distribution P over Z (either\nthe population or the empirical distribution), a common strategy is to draw (xt, yt) \u223c P at iteration\nt and use gt = \u2207L(\u03b8; (xt, yt)). We now illustrate how our conditions are satis\ufb01ed in this setting.\n\u2022 Locally Lipschitz: Both the above examples actually satisfy a stronger global Lipschitz condition\nsince we have the bound G \u2264 k\u2207L(\u03b8)k\u221e \u2264 Ekxk\u221e. Often, the data satis\ufb01es the normalization\nkxk\u221e \u2264 B, in which case we get G \u2264 B. More generally, tail conditions on the marginal\ndistribution of each coordinate of x ensure G = O(\u221alog d)) is valid with high probability.\n\u2022 LSC: When the expectation in the objective (1) is under the population distribution, the above\nexamples satisfy LSC. Here we focus on the example of the logistic loss, where we de\ufb01ne the link\nfunction \u03c8(\u03b1) = exp(\u03b1)/(1 + exp(\u03b1))2. We also de\ufb01ne \u03a3 = E[xxT ] to be the covariance matrix\nand let \u03c3min(\u03a3) denote its minimum singular value. Then a second-order Taylor expansion yields\n\nL(\u02dc\u03b8) \u2212 L(\u03b8) \u2212 h\u2207L(\u03b8), \u02dc\u03b8 \u2212 \u03b8i =\nk\u03b8 \u2212 \u02dc\u03b8k2\n2,\nwheree\u03b8 = a\u03b8 + (1 \u2212 a)\u02dc\u03b8 for some a \u2208 (0, 1). Hence \u03b3 \u2265 \u03c8(BR)\u03c3min(\u03a3) in this example.\n\u2022 Sub-Gaussian gradients: Assuming the bound Ekxk\u221e \u2264 B, this condition is easily veri\ufb01ed. A\nsimple calculation yields \u03c3 = 2B, since\n\nk\u03a31/2(\u03b8 \u2212 \u02dc\u03b8)k2\n\n2 \u2265\n\n2\n\n\u03c8(BR)\u03c3min(\u03a3)\n\n\u03c8(he\u03b8, xi)\n\n2\n\nke(\u03b8)k\u221e = k\u2207L(\u03b8; (x, y)) \u2212 \u2207L(\u03b8)k\u221e \u2264 k\u2207L(\u03b8; (x, y))k\u221e + k\u2207L(\u03b8)k\u221e \u2264 2B.\n\nExample 2 (Least-squares regression). In the regression setup, we are given samples of the form\nz = (x, y) \u2208 Rd \u00d7 R. The loss function of interest is L(\u03b8; (x, y)) = (y \u2212 h\u03b8, xi)2/2. To illustrate\nthe conditions more clearly, we assume that our samples are generated as y = hx, \u03b8\u2217i + w, where\nw \u223c N (0, \u03b72) and ExxT = \u03a3 so that EL(\u03b8; (x, y)) = k\u03a31/2(\u03b8 \u2212 \u03b8\u2217)k2\n\u2022 Locally Lipschitz: For this example, the Lipschitz parameter G(R) depends on the bound R.\nIf we de\ufb01ne \u03c1(\u03a3) = maxi \u03a3ii to be the largest variance of a coordinate of x, then a direct\ncalculation yields the bound G(R) \u2264 \u03c1(\u03a3)R.\n\u2022 LSC: Again we focus on the case where the expectation is taken under the population distribu-\ntion, where we have \u03b3 = \u03c3min(\u03a3).\n\u2022 Sub-Gaussian gradients: Once again we assume that kxk\u221e \u2264 B. It can be shown with some\nwork that Assumption 3 is satis\ufb01ed with \u03c32(R) = 8\u03c1(\u03a3)2R2 + 4B4R2 + 10B2\u03b72.\n\n2/2.\n\n3 Main results and their consequences\n\nIn this section we state our main results, regarding the convergence of Algorithm 1. We focus on\nthe cases where Assumptions 1 and 3 hold over the entire set \u2126, and RSC holds uniformly for\nall k\u03b8k1 \u2264 R1; key examples being the hinge and logistic losses from Example 1. Extensions to\nexamples such as least-squares loss, which are not Lipschitz on all of \u2126 require a more delicate\ntreatment and these results as well the proofs of our results can be found in the long version [2].\nFormally, we assume that G(R) \u2261 G and \u03c3(R) \u2261 \u03c3 in Assumptions 1 and 3. We also use \u03b3 to\ndenote \u03b3(R1) in Assumption 2. For a constant \u03c9 > 0 governing the error probability in our results,\nwe also de\ufb01ne \u03c92\n\ni = \u03c92 + 24 log i at epoch i. Our results assume that we run Algorithm 1 with\n\nTi \u2265 c1 (cid:20) s2\n\n\u03b32R2\n\ni (cid:0)(G2 + \u03c32) log d + \u03c92\n\ni \u03c32(cid:1) + log d(cid:21) ,\n\n(9)\n\nwhere c1 is a universal constant. For a total of T iterations in Algorithm 1, we state our results for\n\nthe parameterb\u03b8T = y(KT +1) where KT is the last epoch completed in T iterations.\n\n3.1 Main theorem and some remarks\n\nWe start with our main result which shows an overall convergence rate of O(1/T ) after T iterations.\nThis O(1/T ) convergence is analogous to earlier work on multi-step methods for strongly convex\n\n4\n\n\fobjectives [11, 12, 10]. For each subset S \u2286 {1, 2, . . . , d} of cardinality s, we de\ufb01ne\n\n(10)\nThis quantity captures the degree of sparsity in the optimum \u03b8\u2217; for instance, \u03b52(\u03b8\u2217; S) = 0 if and\nonly if \u03b8\u2217 is supported on S. Given the probability parameter \u03c9 > 0, we also de\ufb01ne the shorthand\n\n\u03b52(\u03b8\u2217; S) := k\u03b8\u2217Sck2\n\n1/s.\n\n\u03baT = log2(cid:20)\n\ns2((G2 + \u03c32) log d + \u03c92\u03c32)(cid:21) log d.\n\n\u03b32R2\n\n1T\n\n(11)\n\n(14)\n\nTheorem 1. Suppose the expected loss L satis\ufb01es Assumptions 1\u2014 3 with parameters G(R) \u2261 G,\n\u03b3 and \u03c3(R) \u2261 \u03c3, and we perform updates (4) with epoch lengths (9) and parameters\nlog d\n\n\u03bb2\ni =\n\nRi\u03b3\n\ns\u221aTiq(G2 + \u03c32) log d + \u03c92\n\ni \u03c32\n\nand \u03b1(t) = 5Ris\n\n(G2 + \u03bb2\n\ni + \u03c32)t\n\n.\n\n(12)\n\nThen for any subset S \u2286 {1, . . . , d} of cardinality s and any T \u2265 2\u03baT , there is a universal constant\nc0 such that with probability at least 1 \u2212 6 exp(\u2212\u03c92/12) we have\n((G2 + \u03c32) log d + \u03c32(\u03c92 + log\n\n(13)\n\n\u03baT\nlog d\n\n)) + \u03b52(\u03b8\u2217; S)(cid:21) .\n\n2 \u2264 c3 (cid:20) s\n\n\u03b32T\n\nkb\u03b8T \u2212 \u03b8\u2217k2\n\nConsequently, the theorem predicts a convergence rate of O(1/\u03b32T ) which is the best possible under\nour assumptions. Under the setup of Example 1, the error bound of Theorem 1 further simpli\ufb01es to\n\n2 = O(cid:18) sB2\n\n\u03b32T\n\n(log d + \u03c92) + \u03b52(\u03b8\u2217; S)(cid:19) .\n\nkb\u03b8T \u2212 \u03b8\u2217k2\n\nWe note that for an approximately sparse \u03b8\u2217, Theorem 1 guarantees convergence only to a toler-\nance \u03b52(\u03b8\u2217; S) due to the error terms arising out of the approximate sparsity. Overall, the theorem\nprovides a family of upper bounds, one for each choice of S. The best bound can be obtained by\noptimizing this choice, trading off the competing contributions of s and k\u03b8\u2217Sck1.\nAt this point, we can compare the result of Theorem 1 to some of the previous work. One approach\nto minimize the objective (1) is to perform stochastic gradient descent on the objective, which has a\n\nscaling in the dimension d. An alternative is to perform mirror descent [15, 19] or regularized dual\naveraging [22] using the same prox-function as Algorithm 1 but without breaking it up into epochs.\nAs mentioned in the introduction, this single-step method fails to exploit the strong convexity of our\n\nconvergence rate of O((eG2 +e\u03c32)/(\u03b32T )) [10, 14], where k\u2207L(\u03b8)k2 \u2264 eG and E exp(cid:16) ke(\u03b8)k2\ne\u03c32 (cid:17) \u2264\nexp(1). In the setup of Example 1, eG2 = Bd and similarly fore\u03c3; giving an exponentially worse\nproblem and obtains inferior convergence rates of O(splog d/T ) [19, 22, 7].\ndeveloped in this paper, it can be shown that setting \u03bb = \u03c3plog d/T leads to an overall convergence\n\u03b3 2T (log d + \u03c92)(cid:17), which exhibits the same scaling as Theorem 1. However, with this\nrate of eO(cid:16) sB2\n\nA proposal closer to our approach is to minimize the regularized objective (3), but with a \ufb01xed\nvalue of \u03bb instead of the decreasing schedule of \u03bbi used in Theorem 1. This amounts to using the\nmethod of Juditsky and Nesterov [11] on the regularized problem, and by using the proof techniques\n\n\ufb01xed setting of \u03bb, the initial epochs tend to be much longer than needed for halving the error. Indeed,\nour setting of \u03bbi is based on minimizing the upper bound at each epoch, and leads to an improved\nperformance in our numerical simulations. The bene\ufb01ts of slowly decreasing the regularization in\nthe context of deterministic optimization were also noted in the recent work of Xiao and Zhang [23].\n\n2\n\n3.2 Some illustrative corollaries\n\nWe now present some consequences of Theorem 1 by making speci\ufb01c assumptions regarding the\nsparsity of \u03b8\u2217. The simplest situation is when \u03b8\u2217 is supported on some subset S of size s. More\ngenerally, Theorem 1 also applies to the case when the optimum \u03b8\u2217 is only approximately sparse.\nOne natural form of approximate sparsity is to assume that \u03b8\u2217 \u2208 Bq(Rq) for 0 < q \u2264 1, where\n\nBq(Rq) :=(\u03b8 \u2208 Rd |\n\n|\u03b8i|q \u2264 Rq) .\n\ndXi=1\n\n5\n\n\fFor 0 < q \u2264 1, membership in the set Bq(Rq) enforces a decay rate on the components of the\nvector \u03b8. We now present a corollary of Theorem 1 under such an approximate sparsity condition.\nTo facilitate comparison with minimax lower bounds, we set \u03c92 = \u03b4 log d in the corollaries.\nCorollary 1. Under the conditions of Theorem 1, for all T > 2\u03baT with probability at least 1 \u2212\n6 exp(\u2212\u03b4 log d/12), there is a universal constant c0 such that\n\nc0h G2+\u03c32(1+\u03b4)\nc0Rq(cid:20)n (G2+\u03c32(1+\u03b4)) log d\n\n\u03b3 2T\n\n\u03b3 2\n\nlog di\n\ns log d\n\nT + s\u03c32\n\n\u03b3 2T log \u03baT\n\n2\n\no 2\u2212q\n\n2\n\n\u03b3 2T(cid:17) 2\u2212q\n+(cid:16) \u03c32\n\nlog \u03baT\n\nlog d\n\n((1+\u03b4) log d)\n\n\u03b8\u2217 is s-sparse,\n\n\u03b8\u2217 \u2208 Bq(Rq).\n\n2(cid:21)\n\nq\n\n2 \u2264\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3\nkb\u03b8T \u2212 \u03b8\u2217k2\n\nThe \ufb01rst part of the corollary follows directly from Theorem 1 by noting that \u03b52(\u03b8\u2217; S) = 0 under\nour assumptions. Note that as q ranges over the interval [0, 1], re\ufb02ecting the degree of sparsity, the\n\nconvergence rate ranges from from eO(1/T ) (for q = 0 corresponding to exact sparsity) to eO(1/\u221aT )\n\n(for q = 1). This is a rather interesting trade-off, showing in a precise sense how convergence rates\nvary quantitatively as a function of the underlying sparsity.\n\nIt is useful to note that the results on recovery for generalized linear models presented here exactly\nmatch those that have been developed in the statistics literature [13, 21], which are optimal under our\n\nassumptions on the design vectors. Concretely, ignoring factors of O(log T ), we get a parameterb\u03b8T\nhaving error at most O(s log d/(\u03b32T ) with an error probability decyaing to zero with d. Moreover,\nin doing so our algorithm only goes over at most T data samples, as each stochastic gradient can be\nevaluated with one fresh data sample drawn from the underlying distribution. Since the statistical\nminimax lower bounds [13, 21] demonstrate that this is the smallest possible error that any method\ncan attain from T samples, our method is statistically optimal in the scaling of the estimation error\nwith the number of samples. We also observe that it is easy to instead set the error probability to\n\u03b4 = \u03c92 log T , if an error probability decaying with T is desired, incurring at most additional log T\nfactors in the error bound. Finally, we also remark that our techniques extend to handle examples\nsuch as the least-squares loss that are not uniformly Lipschitz. The details of this extension are\ndeferred to the long version of this paper [2].\n\nStochastic optimization over \ufb01nite pools: A common setting for the application of stochastic op-\ntimization methods in machine learning is when one has a \ufb01nite pool of examples, say {Z1, . . . , Zn},\nand the objective (1) takes the form\n\n\u03b8\u2217 = arg min\n\u03b8\u2208\u2126\n\n1\nn\n\nnXi=1\n\nL(\u03b8; Zi)\n\n(15)\n\nIn this setting, a stochastic gradient g(\u03b8) can be obtained by drawing a sample Zj at random with\nreplacement from the pool {Z1, . . . , Zn}, and returning the gradient \u2207L(\u03b8; Zj).\nIn high-dimensional problems where d \u226b n, the sample loss is not strongly convex. However, it has\nbeen shown by many researchers [3, 13, 1] that under suitable conditions, this objective does satisfy\nrestricted forms of the LSC assumption, allowing us to appeal to a generalized form of Theorem 1.\nWe will present this corollary only for settings where \u03b8\u2217 is exactly sparse and also specialize to the\nlogistic loss, L(\u03b8; (x, y)) = log(1 + exp(\u2212yh\u03b8, xi)) to illustrate the key aspects of the result. We\nrecall the de\ufb01nition of the link function \u03c8(\u03b1) = exp(\u03b1)/(1 + exp(\u03b1))2. We will state the result\nfor sub-Gaussian data design with parameters (\u03a3, \u03b72\ni ] = \u03a3 and hu, xii is\n\u03b7x-sub-Gaussian for any unit norm vector u \u2208 Rd.\nCorollary 2. Consider the \ufb01nite-pool loss (15), based on n i.i.d. samples from a sub-Gaussian\ndesign with parameters (\u03a3, \u03b72\nx). Suppose that Assumptions 1-3 are satis\ufb01ed and the optimum \u03b8\u2217\nof (15) is s-sparse. Then there are universal constants (c0, c1, c2, c3) such that for all T \u2265 2\u03baT and\nn \u2265 c3\n\nx), meaning that the E[xixT\n\nmin(\u03a3) max(\u03c32\n\nx), we have\n\nmin(\u03a3), \u03b74\n\nlog d\n\n\u03c32\n\n2 \u2264\n\nkb\u03b8T \u2212 \u03b8\u2217k2\n\nc0\n\n\u03c32\nmin(\u03a3)\n\ns log d\n\nT n\n\n1\n\n\u03c82(2BR1)(cid:8)B2(1 + \u03b4)(cid:9)o + c0\n\nwith probability at least 1 \u2212 2 exp(\u2212c1n min(\u03c32\n\nmin(\u03a3)/\u03b74\n\ns\u03c32\n\nlog\n\n\u03baT\nlog d\n\n.\n\n\u03c32\nmin(\u03a3)\u03c82(2BR1)T\nx, 1)) \u2212 6 exp(\u2212\u03b4 log d/12).\n\n6\n\n\fWe observe that the bound only holds when the number of samples n in the objective (15) is large\nenough, which is necessary for the restricted form of the LSC condition to hold with non-trivial\nparameters in the \ufb01nite sample setting.\n\nA modi\ufb01ed method with constant epoch lengths: Algorithm 1 as described is ef\ufb01cient and sim-\nple to implement. However, the convergence results critically rely on the epoch length Ti to be\nset appropriately in a doubling manner. This could be problematic in practice, where it might be\ntricky to know when an epoch should be terminated. Following Juditsky and Nesterov [11], we\nnext demonstrate how a variant of our algorithm with constant epoch lengths enjoys similar rates of\nconvergence. The key challenge here is that unlike the previous set-up [11], our objective function\nchanges at each epoch which leads to signi\ufb01cant technical dif\ufb01culties. At a very coarse level, if\nwe have a total budget of T iterations, then this version of our algorithm allows us to set the epoch\nlengths to O(log T ), and guarantees convergence rates that are O((log T )/T ).\nTheorem 2. Suppose the expected loss satis\ufb01es Assumptions 1- 3 with parameters G, \u03b3, and \u03c3\nresp. Let S be any subset of {1, . . . , d} of cardinality s. Suppose we run Algorithm 1 for a total of\nT iterations with epoch length Ti \u2261 T log d/\u03baT and with parameters as in Equation 12. Assuming\nthat this setting ensures Ti = O(log d), for any set S, with probability at least 1 \u2212 3 exp(\u03c92/12)\n\n2 = O(cid:18)s\n\nkb\u03b8T \u2212 \u03b8\u2217k2\n\n(G2 + \u03c32) log d + (\u03c92 + log(\u03ba/ log d))\u03c32\n\nT\n\nlog d\n\n\u03ba (cid:19) .\n\nThe theorem shows that up to logarithmic factors in T , setting the epoch lengths optimally is not\ncritical. A similar result can also be proved for the case of least-squares regression.\n\n4 Simulations\n\nIn this section we will present numerical simulations that back our theoretical convergence results.\nWe focus on least-squares regression, discussed in Example 2. Speci\ufb01cally, we generate samples\n(xt, yt) with each coordinate of xt distributed as Unif[\u2212B, B] and yt = h\u03b8\u2217, xti + wt. We pick \u03b8\u2217\nto be s-sparse vector with s = \u2308log d\u2309, and wt \u223c N (0, \u03b72) with \u03b72 = 0.5. Given an iterate \u03b8t, we\ngenerate a stochastic gradient of the expected loss (1) at (xt, yt). For the \u21131-norm, we pick the sign\nvector of \u03b8t, with 0 for any component that is zero, a member of the \u21131-sub-differential.\nOur \ufb01rst set of results evaluate Algorithm 1 against other stochastic optimization baselines assuming\na complete knowledge of problem parameters. Speci\ufb01cally, we epoch i is terminated once\np/2. This ensures that \u03b8\u2217 remains feasible throughout, and tests the per-\nkyi+1 \u2212 \u03b8\u2217k2\nformance of Algorithm 1 in the most favorable scenario. We compare the algorithm against two\nbaselines. The \ufb01rst baseline is the regularized dual averaging (RDA) algorithm [22], applied to the\n\np \u2264 kyi \u2212 \u03b8\u2217k2\n\nrameter with T samples. We use the same prox-function \u03c8(\u03b8) = k\u03b8k2\n\nregularized objective (3) with \u03bb = 4\u03b7plog d/T , which is the statistically optimal regularization pa-\n2(p\u22121) , so that the theory for RDA\npredicts a convergence rate of O(splog d/T ) [22]. Our second baseline is the stochastic gradient\n(SGD) algorithm which exploits the strong convexity but not the sparsity of the problem (1). Since\nthe squared loss is not uniformly Lipschitz, we impose an additional constraint k\u03b8k1 \u2264 R1, without\nwhich the algorithm does not converge. The results of this comparison are shown in Figure 1(a),\nwhere we present the error k\u03b8t \u2212 \u03b8\u2217k2\n2 averaged over 5 random trials. We observe that RADAR\ncomprehensively outperforms both the baselines, con\ufb01rming the predictions of our theory.\n\np\n\nThe second set of results focuses on evaluating algorithms better tailored for our assumptions. Our\n\ufb01rst baseline here is the approach that we described in our remarks following Theorem 1. In this\napproach we use the same multi-step strategy as Algorithm 1 but keep \u03bb \ufb01xed. We refer to this as\n\nEpoch Dual Averaging (henceforth EDA), and again employ \u03bb = 4\u03b7plog d/T with this strategy.\n\nOur epochs are again determined by halving of the squared \u2113p-error measured relative to \u03b8\u2217.\nFinally, we also evaluate the version of our algorithm with constant epoch lengths that we analyzed in\nTheorem 2 (henceforth RADAR-CONST), using epochs of length log(T ). As shown in Figure 1(b),\nthe RADAR-CONST has relatively large error during the initial epochs, before converging quite\n\n7\n\n\frapidly, a phenomenon consistent with our theory.1 Even though the RADAR-CONST method does\nnot use the knowledge of \u03b8\u2217 to set epochs, all three methods exhibit the same eventual convergence\nrates, with RADAR (set with optimal epoch lengths) performing the best, as expected. Although\nRADAR-CONST is very slow in initial iterations, its convergence rate remains competitive with\nEDA (even though EDA does exploit knowledge of \u03b8\u2217), but is worse than RADAR as expected.\nOverall, our experiments demonstrate that RADAR and RADAR-CONST have practical perfor-\nmance consistent with our theoretical predictions. Although optimal epoch length setting is not\ntoo critical for our approach, better data-dependent empirical rules for determining epoch lengths\nremains an interesting question for future research. The relatively poorer performance of EDA\ndemonstrates the importance of our decreasing regularization schedule.\n\n2 2\nk\n\u2217\n\u03b8\n\u2212\n\nt\n\u03b8\nk\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n \n\n0\n0\n\nError vs. iterations\n\n \n\nRADAR\nSGD\nRDA\n\n0.5\n\n1\n\nIterations\n\n1.5\n\n2\nx 104\n\n(a)\n\n2 2\nk\n\u2217\n\u03b8\n\u2212\n\nt\n\u03b8\nk\n\n5\n\n4\n\n3\n\n2\n\n1\n\n \n\n0\n0\n\nError vs. iterations\n\n \n\nRADAR\nEDA\nRADAR-CONST\n\n0.5\n\n1\n\nIterations\n\n1.5\n\n2\nx 104\n\n(b)\n\nFigure 1. A comparison of RADAR with other stochastic optimization algorithms for d = 40000 and\ns = \u2308log d\u2309. The left plot compares RADAR with the RDA and SGD algorithms, neither of which\nexploits both the sparsity and the strong convexity structures simultaneously. The right one compares\nRADAR with the EDA and RADAR-CONST algorithms, all of which exploit the problem structure\nbut with varying degrees of effectiveness. We plot k\u03b8t \u2212 \u03b8\u2217k2\n2 averaged over 5 random trials versus the\nnumber of iterations.\n\n5 Discussion\n\nIn this paper we present an algorithm that is able to take advantage of the strong convexity and spar-\nsity conditions that are satsi\ufb01ed by many common problems in machine learning. Our algorithm\nis simple and ef\ufb01cient to implement, and for a d-dimensional objective with an s-sparse optima, it\nachieves the minimax-optimal convergence rate O(s log d/T ). We also demonstrate optimal con-\nvergence rates for problems that have weakly sparse optima, with implications for problems such as\nsparse linear regression and sparse logistic regression. While we focus our attention exclusively on\nsparse vector recovery due to space constraints, the ideas naturally extend to other structures such as\ngroup sparse vectors and low-rank matrices. It would be interesting to study similar developments\nfor other algorithms such as mirror descent or Nesterov\u2019s accelerated gradient methods, leading to\nmulti-step variants of those methods with optimal convergence rates in our setting.\n\nAcknowledgements\n\nThe work of all three authors was partially supported by ONR MURI grant N00014-11-1-0688 to\nMJW. In addition, AA was partially supported by a Google Fellowship, and SNN was partially\nsupported by the Yahoo KSC award.\n\n1 To clarify, the epoch lengths in RADAR-CONST are set large enough to guarantee that we can attain an overall error bound of O(1/T ),\nmeaning that the initial epochs for RADAR-CONST are much longer than for RADAR. Thus, after roughly 500 iterations, RADAR-CONST\nhas done only 2 epochs and operates with a crude constraint set \u2126(R1/4). During epoch i, the step size scales proportionally to Ri/\u221at,\nwhere t is the iteration number within the epoch; hence the relatively large initial steps in an epoch can take us to a bad solution even when\nwe start with a good solution yi when Ri is large. As Ri decreases further with more epochs, this effect is mitigated and the error of\nRADAR-CONST does rapidly decrease like our theory predicts.\n\n8\n\n\fReferences\n\n[1] A. Agarwal, S. N. Negahban, and M. J. Wainwright. Fast global convergence rates of gradient methods\nfor high-dimensional statistical recovery. To appear in The Annals of Statistics, 2012. Full-length version\nhttp://arxiv.org/pdf/1104.4824v2.\n\n[2] A. Agarwal, S. N. Negahban, and M. J. Wainwright. Stochastic optimization and sparse statistical recov-\n\nery: An optimal algorithm for high dimensions. 2012. URL http://arxiv.org/abs/1207.4421.\n\n[3] P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. Ann.\n\nStat., 37(4):1705\u20131732, 2009.\n\n[4] L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In NIPS, 2007.\n[5] P. B\u00a8uhlmann and S. Van De Geer. Statistics for High-Dimensional Data: Methods, Theory and Applica-\n\ntions. Springer Series in Statistics. Springer, 2011.\n\n[6] D. L. Donoho. High-dimensional data analysis: The curses and blessings of dimensionality, 2000.\n[7] J. Duchi, S. Shalev-Shwartz, Y. Singer, and A. Tewari. Composite objective mirror descent. In Proceed-\n\nings of the 23rd Annual Conference on Learning Theory, pages 14\u201326. Omnipress, 2010.\n\n[8] J. Duchi and Y. Singer. Ef\ufb01cient online and batch learning using forward-backward splitting. Journal of\n\nMachine Learning Research, 10:2873\u20132898, 2009.\n\n[9] E. Hazan, A. Kalai, S. Kale, and A. Agarwal. Logarithmic regret algorithms for online convex optimiza-\n\ntion. In Proceedings of the Nineteenth Annual Conference on Computational Learning Theory, 2006.\n\n[10] E. Hazan and S. Kale. Beyond the regret minimization barrier: an optimal algorithm for stochastic\nstrongly-convex optimization. Journal of Machine Learning Research - Proceedings Track, 19:421\u2013436,\n2011.\n\n[11] A. Juditsky and Y. Nesterov. Primal-dual subgradient methods for minimizing uniformly convex func-\n\ntions. Available online http://hal.archives-ouvertes.fr/docs/00/50/89/33/PDF/Strong-hal.pdf, 2010.\n\n[12] G. Lan and S. Ghadimi. Optimal stochastic approximation algorithms for strongly convex stochastic\n\ncomposite optimization, part ii: shrinking procedures and optimal algorithms. 2010.\n\n[13] S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A uni\ufb01ed framework for high-dimensional\nanalysis of M-estimators with decomposable regularizers. In NIPS Conference, Vancouver, Canada, De-\ncember 2009. Full length version arxiv:1010.2731v1.\n\n[14] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to\n\nstochastic programming. SIAM Journal on Optimization, 19(4):1574\u20131609, 2009.\n\n[15] A. Nemirovski and D. Yudin. Problem Complexity and Method Ef\ufb01ciency in Optimization. Wiley, New\n\nYork, 1983.\n\n[16] Y. Nesterov. Gradient methods for minimizing composite objective function. Technical Report 76, Center\n\nfor Operations Research and Econometrics (CORE), Catholic University of Louvain (UCL), 2007.\n\n[17] Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical Programming A,\n\n120(1):261\u2013283, 2009.\n\n[18] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver for SVM. In\n\nProceedings of the 24th International Conference on Machine Learning, 2007.\n\n[19] S. Shalev-Shwartz and A. Tewari. Stochastic methods for l1 regularized loss minimization. Journal of\n\nMachine Learning Research, 12:1865\u20131892, June 2011.\n\n[20] N. Srebro, K. Sridharan, and A. Tewari. Smoothness, low noise, and fast rates. In Advances in Neural\n\nInformation Processing Systems 23, pages 2199\u20132207, 2010.\n\n[21] S. A. van de Geer. High-dimensional generalized linear models and the lasso. The Annals of Statistics,\n\n36:614\u2013645, 2008.\n\n[22] L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of\n\nMachine Learning Research, 11:2543\u20132596, 2010.\n\n[23] L. Xiao and T. Zhang. A proximal-gradient homotopy method for the sparse least-squares problem.\n\nICML, 2012. URL http://arxiv.org/abs/1203.3002.\n\n9\n\n\f", "award": [], "sourceid": 729, "authors": [{"given_name": "Alekh", "family_name": "Agarwal", "institution": null}, {"given_name": "Sahand", "family_name": "Negahban", "institution": null}, {"given_name": "Martin", "family_name": "Wainwright", "institution": null}]}