{"title": "Estimation, Optimization, and Parallelism when Data is Sparse", "book": "Advances in Neural Information Processing Systems", "page_first": 2832, "page_last": 2840, "abstract": "We study stochastic optimization problems when the \\emph{data} is sparse, which is in a sense dual to the current understanding of high-dimensional statistical learning and optimization. We highlight both the difficulties---in terms of increased sample complexity that sparse data necessitates---and the potential benefits, in terms of allowing parallelism and asynchrony in the design of algorithms. Concretely, we derive matching upper and lower bounds on the minimax rate for optimization and learning with sparse data, and we exhibit algorithms achieving these rates. Our algorithms are adaptive: they achieve the best possible rate for the data observed. We also show how leveraging sparsity leads to (still minimax optimal) parallel and asynchronous algorithms, providing experimental evidence complementing our theoretical results on medium to large-scale learning tasks.", "full_text": "Estimation, Optimization, and Parallelism when\n\nData is Sparse\n\nJohn C. Duchi1,2 Michael I. Jordan1\n\nUniversity of California, Berkeley1\n\nBerkeley, CA 94720\n\n{jduchi,jordan}@eecs.berkeley.edu\n\nH. Brendan McMahan2\n\nGoogle, Inc.2\n\nSeattle, WA 98103\n\nmcmahan@google.com\n\nAbstract\n\nWe study stochastic optimization problems when the data is sparse, which is in\na sense dual to current perspectives on high-dimensional statistical learning and\noptimization. We highlight both the dif\ufb01culties\u2014in terms of increased sample\ncomplexity that sparse data necessitates\u2014and the potential bene\ufb01ts, in terms of\nallowing parallelism and asynchrony in the design of algorithms. Concretely, we\nderive matching upper and lower bounds on the minimax rate for optimization\nand learning with sparse data, and we exhibit algorithms achieving these rates.\nWe also show how leveraging sparsity leads to (still minimax optimal) parallel\nand asynchronous algorithms, providing experimental evidence complementing\nour theoretical results on several medium to large-scale learning tasks.\n\n1\n\nIntroduction and problem setting\n\nIn this paper, we investigate stochastic optimization problems in which the data is sparse. Formally,\nlet {F (\u00b7; \u03be),\u03be \u2208 \u039e} be a collection of real-valued convex functions, each of whose domains con-\ntains the convex set X\u2282 Rd. For a probability distribution P on \u039e, we consider the following\noptimization problem:\n\nminimize\n\nx\u2208X\n\nf (x) := E[F (x; \u03be)] =!\u039e\n\nF (x; \u03be)dP (\u03be).\n\n(1)\n\nBy data sparsity, we mean the samples \u03be are sparse: assuming that samples \u03be lie in Rd, and de\ufb01ning\nthe support supp(x) of a vector x to the set of indices of its non-zero components, we assume\n\nsupp\u2207F (x; \u03be) \u2282 supp \u03be.\n\n(2)\n\nThe sparsity condition (2) means that F (x; \u03be) does not \u201cdepend\u201d on the values of xj for indices j\nsuch that \u03bej = 0.1 This type of data sparsity is prevalent in statistical optimization problems and\nmachine learning applications; in spite of its prevalence, study of such problems has been limited.\nAs a motivating example, consider a text classi\ufb01cation problem: data \u03be \u2208 Rd represents words\nappearing in a document, and we wish to minimize a logistic loss F (x; \u03be) = log(1 + exp($\u03be, x%))\non the data (we encode the label implicitly with the sign of \u03be). Such generalized linear models satisfy\nthe sparsity condition (2), and while instances are of very high dimension, in any given instance, very\nfew entries of \u03be are non-zero [8]. From a modelling perspective, it thus makes sense to allow a dense\npredictor x: any non-zero entry of \u03be is potentially relevant and important. In a sense, this is dual to\nthe standard approaches to high-dimensional problems; one usually assumes that the data \u03be may be\ndense, but there are only a few relevant features, and thus a parsimonious model x is desirous [2]. So\n\n1Formally, if \u03c0\u03be denotes the coordinate projection zeroing all indices j of its argument where \u03bej = 0, then\n\nF (\u03c0\u03be(x); \u03be) = F (x; \u03be) for all x, \u03be. This follows from the \ufb01rst-order conditions for convexity [6].\n\n1\n\n\fwhile such sparse data problems are prevalent\u2014natural language processing, information retrieval,\nand other large data settings all have signi\ufb01cant data sparsity\u2014they do not appear to have attracted\nas much study as their high-dimensional \u201cduals\u201d of dense data and sparse predictors.\n\nIn this paper, we investigate algorithms and their inherent limitations for solving problem (1) under\nnatural conditions on the data generating distribution. Recent work in the optimization and machine\nlearning communities has shown that data sparsity can be leveraged to develop parallel (and even\nasynchronous [12]) optimization algorithms [13, 14], but this work does not consider the statistical\neffects of data sparsity. In another line of research, Duchi et al. [4] and McMahan and Streeter [9]\ndevelop \u201cadaptive\u201d stochastic gradient algorithms to address problems in sparse data regimes (2).\nThese algorithms exhibit excellent practical performance and have theoretical guarantees on their\nconvergence, but it is not clear if they are optimal\u2014in that no algorithm can attain better statistical\nperformance\u2014or whether they can leverage parallel computing as in the papers [12, 14].\n\nIn this paper, we take a two-pronged approach. First, we investigate the fundamental limits of\noptimization and learning algorithms in sparse data regimes. In doing so, we derive lower bounds\non the optimization error of any algorithm for problems of the form (1) with sparsity condition (2).\nThese results have two main implications. They show that in some scenarios, learning with sparse\ndata is quite dif\ufb01cult, as essentially each coordinate j \u2208 [d] can be relevant and must be optimized\nfor. In spite of this seemingly negative result, we are also able to show that the ADAGRAD algorithms\nof [4, 9] are optimal, and we show examples in which their dependence on the dimension d can be\nmade exponentially better than standard gradient methods.\n\nAs the second facet of our two-pronged approach, we study how sparsity may be leveraged in parallel\ncomputing frameworks to give substantially faster algorithms that still achieve optimal sample com-\nplexity in terms of the number of samples \u03be used. We develop two new algorithms, asynchronous\ndual averaging (ASYNCDA) and asynchronous ADAGRAD (ASYNCADAGRAD), which allow asyn-\nchronous parallel solution of the problem (1) for general convex f and X . Combining insights of\nNiu et al.\u2019s HOGWILD! [12] with a new analysis, we prove our algorithms achieve linear speedup in\nthe number of processors while maintaining optimal statistical guarantees. We also give experiments\non text-classi\ufb01cation and web-advertising tasks to illustrate the bene\ufb01ts of the new algorithms.\n\n2 Minimax rates for sparse optimization\n\nWe begin our study of sparse optimization problems by establishing their fundamental statistical and\noptimization-theoretic properties. To do this, we derive bounds on the minimax convergence rate\n\nfrom the distribution P as\n\nof any algorithm for such problems. Formally, let\"x denote any estimator for a minimizer of the\nobjective (1). We de\ufb01ne the optimality gap \u0001N for the estimator\"x based on N samples \u03be1, . . . ,\u03be N\nThis quantity is a random variable, since\"x is a random variable (it is a function of \u03be1, . . . ,\u03be N ).\n\nTo de\ufb01ne the minimax error, we thus take expectations of the quantity \u0001N , though we require a bit\nmore than simply E[\u0001N ]. We let P denote a collection of probability distributions, and we consider\na collection of loss functions F speci\ufb01ed by a collection F of convex losses F : X\u00d7 \u03be \u2192 R. We\ncan then de\ufb01ne the minimax error for the family of losses F and distributions P as\n\n\u0001N (\"x, F,X , P ) := f (\"x) \u2212 inf\n\nf (x) = EP [F (\"x; \u03be)] \u2212 inf\n\nx\u2208X\n\nEP [F (x; \u03be)] .\n\nx\u2208X\n\n(3)\n\n\u0001\u2217\nN (X ,P,F) := inf\n\n!x\n\nsup\nP \u2208P\n\nsup\nF \u2208F\n\nEP [\u0001N (\"x(\u03be1:N ), F,X , P )],\n\nwhere the in\ufb01mum is taken over all possible estimators\"x (an estimator is an optimization scheme,\nor a measurable mapping\"x :\u039e N \u2192X ) .\n\n2.1 Minimax lower bounds\n\nLet us now give a more precise characterization of the (natural) set of sparse optimization problems\nwe consider to provide the lower bound. For the next proposition, we let P consist of distributions\nsupported on \u039e= {\u22121, 0, 1}d, and we let pj := P (\u03bej )= 0) be the marginal probability of ap-\npearance of feature j \u2208{ 1, . . . , d}. For our class of functions, we set F to consist of functions F\nsatisfying the sparsity condition (2) and with the additional constraint that for g \u2208 \u2202xF (x; \u03be), we\nhave that the jth coordinate |gj|\u2264 Mj for a constant Mj < \u221e. We obtain\n\n2\n\n\fProposition 1. Let the conditions of the preceding paragraph hold. Let R be a constant such that\nX\u2283 [\u2212R, R]d. Then\n\n\u0001\u2217\nN (X ,P,F) \u2265\n\n1\n8\n\nR\n\nMj min$pj,\n\n\u221apj\u221aN log 3% .\n\nd#j=1\n\nWe provide the proof of Proposition 1 in the supplement A.1 in the full version of the paper, provid-\ning a few remarks here. We begin by giving a corollary to Proposition 1 that follows when the data\n\u03be obeys a type of power law: let p0 \u2208 [0, 1], and assume that P (\u03bej )= 0) = p0j\u2212\u03b1. We have\nCorollary 2. Let \u03b1 \u2265 0. Let the conditions of Proposition 1 hold with Mj \u2261 M for all j, and\nassume the power law condition P (\u03bej )= 0) = p0j\u2212\u03b1 on coordinate appearance probabilities. Then\n(1) If d > (p0N )1/\u03b1,\n\n\u0001\u2217\nN (X ,P,F) \u2265\n\n(2) If d \u2264 (p0N )1/\u03b1,\n\nM R\n\n8 & 2\n\n2 \u2212 \u03b1\u2019 p0\n\nN ((p0N )\n8 \u2019 p0\nN +\n\n2\u2212\u03b1\n\n2\u03b1 \u2212 1) +\n\np0\n\n1 \u2212 \u03b1(d1\u2212\u03b1 \u2212 (p0N )\n\n1\n\n1 \u2212 \u03b1/2\n\nd1\u2212 \u03b1\n\n2 \u2212\n\n1\n\n1 \u2212 \u03b1/2, .\n\nM R\n\n1\u2212\u03b1\n\n\u03b1 )* .\n\n\u0001\u2217\nN (X ,P,F) \u2265\n\nExpanding Corollary 2 slightly, for simplicity assume the number of samples is large enough that\nd \u2264 (p0N )1/\u03b1. Then we \ufb01nd that the lower bound on optimization error is of order\nlog d when \u03b1 \u2192 2, and M R\u2019 p0\nM R\u2019 p0\n\n2 when \u03b1< 2, M R\u2019 p0\n\nwhen \u03b1> 2. (4)\n\nThese results beg the question of tightness: are they improvable? As we see presently, they are not.\n\nd1\u2212 \u03b1\n\nN\n\nN\n\nN\n\n2.2 Algorithms for attaining the minimax rate\n\nTo show that the lower bounds of Proposition 1 and its subsequent specializations are sharp, we\nreview a few stochastic gradient algorithms. We begin with stochastic gradient descent (SGD): SGD\nrepeatedly samples \u03be \u223c P , computes g \u2208 \u2202xF (x; \u03be), then performs the update x \u2190 \u03a0X (x \u2212 \u03b7g),\nwhere \u03b7 is a stepsize parameter and \u03a0X denotes Euclidean projection onto X . Standard analyses of\n\nstochastic gradient descent [10] show that after N samples \u03bei, the SGD estimator\"x(N ) satis\ufb01es\n\nj=1 pj)\n\n(5)\n\n1\n2\n\n,\n\nf (x) \u2264O (1)\n\nR2M (-d\n\u221aN\n\nwhere R2 denotes the &2-radius of X . Dual averaging, due to Nesterov [11] (sometimes called\n\u201cfollow the regularized leader\u201d [5]) is a more recent algorithm. In dual averaging, one again samples\ng \u2208 \u2202xF (x; \u03be), but instead of updating the parameter vector x one updates a dual vector z by\nz \u2190 z + g, then computes\n\nE[f (\"x(N ))] \u2212 inf\n\nx\u2208X\n\nx \u2190 argmin\n\nx\u2208X $$z, x% +\n\n1\n\u03b7\n\n\u03c8(x)% ,\n\nwhere \u03c8(x) is a strongly convex function de\ufb01ned over X (often one takes \u03c8(x) = 1\n2). As\nwe discuss presently, the dual averaging algorithm is somewhat more natural in asynchronous and\nparallel computing environments, and it enjoys the same type of convergence guarantees (5) as SGD.\n\n2 2x22\n\nThe ADAGRAD algorithm [4, 9] is an extension of the preceding stochastic gradient methods. It\nmaintains a diagonal matrix S, where upon receiving a new sample \u03be, ADAGRAD performs the\nfollowing: it computes g \u2208 \u2202xF (x; \u03be), then updates\nSj \u2190 Sj + g2\nx\"\u2208X $$z, x$% +\n\nThe dual averaging variant of ADAGRAD updates the usual dual vector z \u2190 z + g; the update to x\nis based on S and a stepsize \u03b7 and computes\n\nj for j \u2208 [d].\n\nx \u2190 argmin\n\n2\u03b7.x$, S\n\n2 x$/% .\n\n1\n\n1\n\n3\n\n\fAfter N samples \u03be, the averaged parameter\"x(N ) returned by ADAGRAD satis\ufb01es\n\nf (x) \u2264O (1)\n\nR\u221eM\n\u221aN\n\nE[f (\"x(N ))] \u2212 inf\n\nx\u2208X\n\n\u221apj,\n\nd#j=1\n\n(6)\n\nwhere R\u221e denotes the &\u221e-radius of X (cf. [4, Section 1.3 and Theorem 5]). By inspection, the\nADAGRAD rate (6) matches the lower bound in Proposition 1 and is thus optimal. It is interesting\nto note, though, that in the power law setting of Corollary 2 (recall the error order (4)), a calculation\nshows that the multiplier for the SGD guarantee (5) becomes R\u221e\u221ad max{d(1\u2212\u03b1)/2, 1}, while ADA-\nGRAD attains rate at worst R\u221e max{d1\u2212\u03b1/2, log d}. For \u03b1> 1, the ADAGRAD rate is no worse,\nand for \u03b1 \u2265 2, is more than \u221ad/ log d better\u2014an exponential improvement in the dimension.\n\n3 Parallel and asynchronous optimization with sparsity\n\nAs we note in the introduction, recent works [12, 14] have suggested that sparsity can yield bene\ufb01ts\nin our ability to parallelize stochastic gradient-type algorithms. Given the optimality of ADAGRAD-\ntype algorithms, it is natural to focus on their parallelization in the hope that we can leverage their\nability to \u201cadapt\u201d to sparsity in the data. To provide the setting for our further algorithms, we\n\ufb01rst revisit Niu et al.\u2019s HOGWILD! [12]. HOGWILD! is an asynchronous (parallelized) stochastic\ngradient algorithm for optimization over product-space domains, meaning that X in problem (1)\ndecomposes as X = X1 \u00d7\u00b7\u00b7\u00b7\u00d7X d, where Xj \u2282 R. Fix a stepsize \u03b7> 0. A pool of independently\nrunning processors then performs the following updates asynchronously to a centralized vector x:\n\n1. Sample \u03be \u223c P\n2. Read x and compute g \u2208 \u2202xF (x; \u03be)\n3. For each j s.t. gj )= 0, update xj \u2190 \u03a0Xj (xj \u2212 \u03b7gj).\nHere \u03a0Xj denotes projection onto the jth coordinate of the domain X . The key of HOGWILD! is that\nin step 2, the parameter x is allowed to be inconsistent\u2014it may have received partial gradient updates\nfrom many processors\u2014and for appropriate problems, this inconsistency is negligible. Indeed, Niu\net al. [12] show linear speedup in optimization time as the number of processors grow; they show\nthis empirically in many scenarios, providing a proof under the somewhat restrictive assumptions\nthat there is at most one non-zero entry in any gradient g and that f has Lipschitz gradients.\n\n3.1 Asynchronous dual averaging\n\nA weakness of HOGWILD! is that it appears only applicable to problems for which the domain X\nis a product space, and its analysis assumes 2g20 = 1 for all gradients g. In effort to alleviate these\ndif\ufb01culties, we now develop and present our asynchronous dual averaging algorithm, ASYNCDA.\nASYNCDA maintains and upates a centralized dual vector z instead of a parameter x, and a pool of\nprocessors perform asynchronous updates to z, where each processor independently iterates:\n\n1. Read z and compute x := argminx\u2208X0$z, x% + 1\n\ncounter t and let x(t) = x\n\n\u03b7 \u03c8(x)1\n\n// Implicitly increment \u201ctime\u201d\n\n2. Sample \u03be \u223c P and let g \u2208 \u2202xF (x; \u03be) // Let g(t) = g.\n3. For j \u2208 [d] such that gj )= 0, update zj \u2190 zj + gj.\nBecause the actual computation of the vector x in ASYNCDA is performed locally on each processor\nin step 1 of the algorithm, the algorithm can be executed with any proximal function \u03c8 and domain\nX . The only communication point between any of the processors is the addition operation in step 3.\nSince addition is commutative and associative, forcing all asynchrony to this point of the algorithm\nis a natural strategy for avoiding synchronization problems.\n\nIn our analysis of ASYNCDA, and in our subsequent analysis of the adaptive methods, we require a\nmeasurement of time elapsed. With that in mind, we let t denote a time index that exists (roughly)\nbehind-the-scenes. We let x(t) denote the vector x \u2208X computed in the tth step 1 of the ASYNCDA\n\n4\n\n\falgorithm, that is, whichever is the tth x actually computed by any of the processors. This quantity\n\u03c4 =1 x(\u03c4 ).\n\nexists and is recoverable from the algorithm, and it is possible to track the running sum-t\n\nAdditionally, we state two assumptions encapsulating the conditions underlying our analysis.\nAssumption A. There is an upper bound m on the delay of any processor. In addition, for each\nj \u2208 [d] there is a constant pj \u2208 [0, 1] such that P (\u03bej )= 0) \u2264 pj.\nWe also require certain continuity (Lipschitzian) properties of the loss functions; these amount to a\nsecond moment constraint on the instantaneous \u2202F and a rough measure of gradient sparsity.\nAssumption B. There exist constants M and (Mj)d\nx \u2208X : E[2\u2202xF (x; \u03be)22\n\n2] \u2264 M2 and for each j \u2208 [d] we have E[|\u2202xj F (x; \u03be)|] \u2264 pjMj.\n\nj=1 such that the following bounds hold for all\n\nWith these de\ufb01nitions, we have the following theorem, which captures the convergence behavior of\nASYNCDA under the assumption that X is a Cartesian product, meaning that X = X1 \u00d7\u00b7\u00b7\u00b7\u00d7X d,\nwhere Xj \u2282 R, and that \u03c8(x) = 1\n2. Note the algorithm itself can still be ef\ufb01ciently parallelized\nfor more general convex X , even if the theorem does not apply.\nTheorem 3. Let Assumptions A and B and the conditions in the preceding paragraph hold. Then\n\n2 2x22\n\nE& T#t=1\n\nF (x(t); \u03bet) \u2212 F (x\u2217; \u03bet)* \u2264\n\n1\n2\u03b7 2x\u221722\n\n2 +\n\n\u03b7\n2\n\nT M2 + \u03b7T m\n\nj M 2\np2\nj .\n\nd#j=1\n\nWe now provide a few remarks to explain and simplify the result. Under the more stringent condition\nj . Thus, for the\nj , which upper bounds the Lipschitz continuity\n\nthat |\u2202xj F (x; \u03be)|\u2264 Mj, Assumption A implies E[2\u2202xF (x; \u03be)22\nremainder of this section we take M2 =-d\nconstant of the objective function f . We then obtain the following corollary.\nT -T\nCorollary 4. De\ufb01ne\"x(T ) = 1\n\n2] \u2264 -d\nt=1 x(t), and set \u03b7 = 2x\u221722 /M\u221aT . Then\nj M 2\np2\nj .\n\nj=1 pjM 2\n\nj=1 pjM 2\n\nM2x\u221722\u221aT\n\n+ m 2x\u221722\n2M\u221aT\n\nd#j=1\n\nE[f (\"x(T )) \u2212 f (x\u2217)] \u2264\n\nCorollary 4 is nearly immediate: since \u03bet is independent of x(t), we have E[F (x(t); \u03bet) | x(t)] =\nf (x(t)); applying Jensen\u2019s inequality to f (\"x(T )) and performing an algebraic manipulation give the\nresult. If the data is suitably sparse, meaning that pj \u2264 1/m, the bound in Corollary 4 simpli\ufb01es to\n\n3\n2\n\nM2x\u221722\u221aT\n\n=\n\n3\n\n22-d\n\nj=1 pjM 2\n\nj 2x\u221722\n\n\u221aT\n\n,\n\n(7)\n\nwhich is the convergence rate of stochastic gradient descent even in centralized settings (5). The\nconvergence guarantee (7) shows that after T timesteps, the error scales as 1/\u221aT ; however, if we\nhave k processors, updates occur roughly k times as quickly, as they are asynchronous, and in time\nscaling as N/k, we can evaluate N gradient samples: a linear speedup.\n\nE[f (\"x(T )) \u2212 f (x\u2217)] \u2264\n\n3.2 Asynchronous AdaGrad\n\nWe now turn to extending ADAGRAD to asynchronous settings, developing ASYNCADAGRAD\n(asynchronous ADAGRAD). As in the ASYNCDA algorithm, ASYNCADAGRAD maintains a shared\ndual vector z (the sum of gradients) and the shared matrix S, which is the diagonal sum of squares\nof gradient entries (recall Section 2.2). The matrix S is initialized as diag(\u03b42), where \u03b4j \u2265 0 is an\ninitial value. Each processor asynchronously performs the following iterations:\n\n1. Read S and z and set G = S\n\n1\n\n2 . Compute x := argminx\u2208X{$z, x% + 1\n\n2\u03b7 $x, Gx%}\n\n// Implicitly\n\nincrement \u201ctime\u201d counter t and let x(t) = x, S(t) = S\n\n2. Sample \u03be \u223c P and let g \u2208 \u2202F (x; \u03be)\n3. For j \u2208 [d] such that gj )= 0, update Sj \u2190 Sj + g2\n\nj and zj \u2190 zj + gj.\n\n5\n\n\fAs in the description of ASYNCDA, we note that x(t) is the vector x \u2208X computed in the tth \u201cstep\u201d\nof the algorithm (step 1), and similarly associate \u03bet with x(t).\n\nTo analyze ASYNCADAGRAD, we make a somewhat stronger assumption on the sparsity properties\nof the losses F than Assumption B.\nAssumption C. There exist constants (Mj)d\nx \u2208X .\n\nj=1 such that E[(\u2202xj F (x; \u03be))2 | \u03bej )= 0] \u2264 M 2\n\nj m and X\u2282 [\u2212R\u221e, R\u221e]d. Then\n\nj shows that Assumption C implies Assumption B with speci\ufb01c\n\nIndeed, taking M2 = -j pjM 2\nconstants. We then have the following convergence result.\nTheorem 5. In addition to the conditions of Theorem 3, let Assumption C hold. Assume that for all\nj we have \u03b42 \u2265 M 2\nE3F (x(t); \u03bet) \u2212 F (x\u2217; \u03bet)4\n\u221eE5+\u03b42 +\nd#j=1\n\n26 (1 + pjm), MjR\u221epjT%.\n\n26 + \u03b7E5+ T#t=1\n\ngj(t)2, 1\n\ngj(t)2, 1\n\nmin$ 1\n\nT#t=1\n\nT#t=1\n\nj for all\n\nR2\n\n\u2264\n\n\u03b7\n\nIt is possible to relax the condition on the initial constant diagonal term; we defer this to the full\nversion of the paper.\n\nIt is natural to ask in which situations the bound provided by Theorem 5 is optimal. We note that, as\n\nT#t=1\n\n26 \u2264+\u03b42 +\n\nt=1 x(t). By Jensen\u2019s inequality, we have for any \u03b4 that\n\nin the case with Theorem 3, we may obtain a convergence rate for f (\"x(T ))\u2212 f (x\u2217) using convexity,\nT -T\nwhere\"x(T ) = 1\nE5+\u03b42 +\ngj(t)2, 1\nE[f (\"x(T )) \u2212 f (x\u2217)] \u2264O (1)& 1\n\nE[gj(t)2], 1\n\u22642\u03b42 + T pjM 2\nMj min78log(T )/T + pj\n\nFor interpretation, let us now make a few assumptions on the probabilities pj. If we assume that\npj \u2264 c/m for a universal (numerical) constant c, then Theorem 5 guarantees that\n\nwhich is the convergence rate of ADAGRAD except for a small factor of min{\u221alog T /T, pj} in\naddition to the usual8pj/T rate. In particular, optimizing by choosing \u03b7 = R\u221e, and assuming\npj ! 1\n\nT#t=1\n\u221e + \u03b7* d#j=1\n\nT log T , we have convergence guarantee\n\n, pj9 ,\n\n\u221aT\n\nR2\n\n(8)\n\nj .\n\n\u03b7\n\n2\n\nwhich is minimax optimal by Proposition 1.\n\nE[f (\"x(T )) \u2212 f (x\u2217)] \u2264O (1)R\u221e\n\nMj min$\u221apj\u221aT\n\n, pj% ,\n\nd#j=1\n\nIn fact, however, the bounds of Theorem 5 are somewhat stronger: they provide bounds using the\nexpectation of the squared gradients gj(t) rather than the maximal value Mj, though the bounds are\nperhaps clearer in the form (8). We note also that our analysis applies to more adversarial settings\nthan stochastic optimization (e.g., to online convex optimization [5]). Speci\ufb01cally, an adversary may\nchoose an arbitrary sequence of functions subject to the random data sparsity constraint (2), and our\nresults provide an expected regret bound, which is strictly stronger than the stochastic convergence\nguarantees provided (and guarantees high-probability convergence in stochastic settings [3]). More-\nover, our comments in Section 2 about the relative optimality of ADAGRAD versus standard gradient\nmethods apply. When the data is sparse, we indeed should use asynchronous algorithms, but using\nadaptive methods yields even more improvement than simple gradient-based methods.\n\n4 Experiments\n\nIn this section, we give experimental validation of our theoretical results on ASYNCADAGRAD and\nASYNCDA, giving results on two datasets selected for their high-dimensional sparsity.2\n\n2In our experiments, ASYNCDA and HOGWILD! had effectively identical performance.\n\n6\n\n\f8\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\np\nu\nd\ne\ne\np\nS\n\n1\n\n \n\n2\n\n4\n\n \n\ns\ns\no\nl\n\ng\nn\ni\nn\ni\na\nr\nT\n\n0.07\n\n0.065\n\n0.06\n\n0.055\n\n0.05\n\n0.045\n\n0.04\n\n0.035\n\n0.03\n\n0.025\n\n0.02\n\n2\n\n4\n\nr\no\nr\nr\ne\n\nt\ns\ne\nT\n\n0.025\n\n0.024\n\n0.023\n\n0.022\n\n0.021\n\n0.02\n\n0.019\n\n0.018\n\n0.017\n\n16\n\n2\n\n4\n\n8\n\n6\n\nNumber of workers\n\n10\n\n12\n\n14\n\n16\n\n6\n\n8\n\nNumber of workers\n\n10\n\n12\n\n14\n\nA-ADAGRAD\nASYNCDA\n\n6\n\n8\n\nNumber of workers\n\n10\n\n12\n\n14\n\n16\n\nFigure 1. Experiments with URL data. Left: speedup relative to one processor. Middle: training\ndataset loss versus number of processors. Right: test set error rate versus number of processors. A-\nADAGRAD abbreviates ASYNCADAGRAD.\n\nFixed stepsizes, training data, L2=0\n\nFixed stepsizes, test data, L2=0\n\nImpact of L2 regularizaton on test error\n\nA-AdaGrad \u03b7 = 0.002\nA-AdaGrad \u03b7 = 0.002\nA-AdaGrad \u03b7 = 0.004\nA-AdaGrad \u03b7 = 0.004\nA-AdaGrad \u03b7 = 0.008\nA-AdaGrad \u03b7 = 0.008\nA-AdaGrad \u03b7 = 0.016\nA-AdaGrad \u03b7 = 0.016\nA-DA \u03b7 = 0.800\nA-DA \u03b7 = 1.600\nA-DA \u03b7 = 3.200\n\ns\ns\no\nl\n-\ng\no\nl\n \ne\nv\ni\nt\na\ne\nr\n\nl\n\n8\n\n.\n\n1\n\n6\n1\n\n.\n\n4\n\n.\n\n1\n\n2\n\n.\n\n1\n\n0\n\n.\n\n1\n\n4\n0\n\n.\n\n1\n\n3\n0\n1\n\n.\n\n2\n0\n\n.\n\n1\n\n1\n0\n\n.\n\n1\n\n0\n0\n\n.\n\n1\n\nA-AdaGrad, \u03b7 = 0.008 L2 = 0\nA-AdaGrad, \u03b7 = 0.008 L2 = 80\nA-DA, \u03b7 = 0.8 L2 = 0\nA-DA, \u03b7 = 0.8 L2 = 80\n\n4\n0\n\n.\n\n1\n\n3\n0\n1\n\n.\n\n2\n0\n\n.\n\n1\n\n1\n0\n\n.\n\n1\n\n0\n0\n\n.\n\n1\n\n1\n\n2\n\n4\n\n8 16\n\n64\n\n256\n\n1\n\n2\n\n4\n\n8\n\n16\n\n32\n\n64\n\n128\n\n256\n\n1\n\n2\n\n4\n\n8\n\n16\n\n32\n\n64\n\n128\n\n256\n\nnumber of passes\n\nnumber of passes\n\nnumber of passes\n\nFigure 2: Relative accuracy for various stepsize choices on an click-through rate prediction dataset.\n\n4.1 Malicious URL detection\n\nFor our \ufb01rst set of experiments, we consider the speedup attainable by applying ASYNCADAGRAD\nand ASYNCDA, investigating the performance of each algorithm on a malicious URL prediction\ntask [7]. The dataset in this case consists of an anonymized collection of URLs labeled as malicious\n(e.g., spam, phishing, etc.) or benign over a span of 120 days. The data in this case consists of\n2.4\u00b7 106 examples with dimension d = 3.2\u00b7 106 (sparse) features. We perform several experiments,\nrandomly dividing the dataset into 1.2 \u00b7 106 training and test samples for each experiment.\nIn Figure 1 we compare the performance of ASYNCADAGRAD and ASYNCDA after doing after\nsingle pass through the training dataset. (For each algorithm, we choose the stepsize \u03b7 for optimal\ntraining set performance.) We perform the experiments on a single machine running Ubuntu Linux\nwith six cores (with two-way hyperthreading) and 32Gb of RAM. From the left-most plot in Fig. 1,\nwe see that up to six processors, both ASYNCDA and ASYNCADAGRAD enjoy the expected linear\nspeedup, and from 6 to 12, they continue to enjoy a speedup that is linear in the number of processors\nthough at a lesser slope (this is the effect of hyperthreading). For more than 12 processors, there is\nno further bene\ufb01t to parallelism on this machine.\n\nThe two right plots in Figure 1 plot performance of the different methods (with standard errors)\nversus the number of worker threads used. Both are essentially \ufb02at; increasing the amount of par-\nallelism does nothing to the average training loss or the test error rate for either method. It is clear,\nhowever, that for this dataset, the adaptive ASYNCADAGRAD algorithm provides substantial per-\nformance bene\ufb01ts over ASYNCDA.\n\n4.2 Click-through-rate prediction experiments\n\nWe also experiment on a proprietary datasets consisting of search ad impressions. Each example\ncorresponds to showing a search-engine user a particular text ad in response to a query string. From\nthis, we construct a very sparse feature vector based on the text of the ad displayed and the query\nstring (no user-speci\ufb01c data is used). The target label is 1 if the user clicked the ad and -1 otherwise.\n\n7\n\n\f(A) Optimized stepsize for each number of passes\n\n(B) A-AdaGrad speedup\n\n(D) Impact of training data ordering\n\n5\n1\n0\n\n.\n\n1\n\n0\n1\n0\n\n.\n\n1\n\n5\n0\n0\n\n.\n\n1\n\n0\n0\n0\n\n.\n\n1\n\nA-DA, L2=80\nA-AdaGrad, L2=80\n\np\nu\nd\ne\ne\np\ns\n\ni\n\ne\nz\ns\np\ne\n\nt\ns\n \n\ne\nv\ni\nt\n\nl\n\na\ne\nr\n\n2\n1\n\n8\n\n4\n\n0\n\n2\n\n1\n\n0\n\n1.001\n\n1.002\n\n1.003\n\n1.004\n\n1.005\n\n1.006\n\n1.007\n\n1.008\n\ntarget relative log-loss\n\n(C) Optimal stepsize scaling\n\nA-DA base \u03b7 = 1.600\nA-AdaGrad base \u03b7 = 0.023\n\ns\ns\no\nl\n-\ng\no\n\nl\n \n\ne\nv\ni\nt\n\nl\n\na\ne\nr\n\nA-DA \u03b7 = 1.600\nA-AdaGrad \u03b7 = 0.016\n\n5\n1\n0\n\n.\n\n1\n\n0\n1\n0\n\n.\n\n1\n\n5\n0\n0\n\n.\n\n1\n\n0\n0\n0\n\n.\n\n1\n\ns\ns\no\nl\n-\ng\no\n\nl\n \n\ne\nv\ni\nt\n\nl\n\na\ne\nr\n\n1\n\n2\n\n4\n\n8\n\n16\n\n32\n\n64\n\n128\n\n256\n\n1\n\n2\n\n4\n\n8\n\n16\n\n32\n\n64\n\n128\n\n256\n\n1\n\n2\n\n4\n\n8\n\n16\n\n32\n\n64\n\n128\n\n256\n\nnumber of passes\n\nnumber of passes\n\nnumber of passes\n\nFigure 3.\n(A) Relative test-set log-loss for ASYNCDA and ASYNCADAGRAD, choosing the best\nstepsize (within a factor of about 1.4\u00d7) individually for each number of passes. (B) Effective speedup\nfor ASYNCADAGRAD. (C) The best stepsize \u03b7, expressed as a scaling factor on the stepsize used for\none pass. (D) Five runs with different random seeds for each algorithm (with $2 penalty 80).\n\nWe \ufb01t logistic regression models using both ASYNCDA and ASYNCADAGRAD. We run extensive\nexperiments on a moderate-sized dataset (about 107 examples, split between training and testing),\nwhich allows thorough investigation of the impact of the stepsize \u03b7, the number of training passes,3\nand &2-regularization on accuracy. For these experiments we used 32 threads on 16 core machines\nfor each run, as ASYNCADAGRAD and ASYNCDA achieve similar speedups from parallelization.\n\nOn this dataset, ASYNCADAGRAD typically achieves an effective additional speedup over\nASYNCDA of 4\u00d7 or more. That is, to reach a given level of accuracy, ASYNCDA generally needs\nfour times as many effective passes over the dataset. We measure accuracy with log-loss (the lo-\ngistic loss) averaged over \ufb01ve runs using different random seeds (which control the order in which\nthe algorithms sample examples during training). We report relative values in Figures 2 and 3, that\nis, the ratio of the mean loss for the given datapoint to the lowest (best) mean loss obtained. Our\nresults are not particularly sensitive to the choice of relative log-loss as the metric of interest; we\nalso considered AUC (the area under the ROC curve) and observed similar results.\n\nFigure 2 shows relative log-loss as a function of the number of training passes for various stepsizes.\nWithout regularization, ASYNCADAGRAD is prone to over\ufb01tting: it achieves signi\ufb01cantly higher\naccuracy on the training data (Fig. 2 (left)), but unless the stepsize is tuned carefully to the number\nof passes, it will over\ufb01t (Fig. 2 (middle)). Fortunately, the addition of &2 regularization largely solves\nthis problem. Indeed, Figure 2 (right) shows that while adding an &2 penalty of 80 has very little\nimpact on ASYNCDA, it effectively prevents the over\ufb01tting of ASYNCADAGRAD.4\n\nFixing &2 regularization multiplier to 80, we varied the stepsize \u03b7 over a multiplicative grid with res-\nolution \u221a2 for each number of passes and for each algorithm. Figure 3 reports the results obtained by\nselecting the best stepsize in terms of test set log-loss for each number of passes. Figure 3(A) shows\nrelative log-loss of the best stepsize for each algorithm; 3(B) shows the relative time ASYNCDA\nrequires with respect to ASYNCADAGRAD to achieve a given loss. Speci\ufb01cally, Fig. 3(B) shows the\nratio of the number of passes the algorithms require to achieve a \ufb01xed loss, which gives a broader\nestimate of the speedup obtained by using ASYNCADAGRAD; speedups range from 3.6\u00d7 to 12\u00d7.\nFigure 3(C) shows the optimal stepsizes as a function of the best setting for one pass. The optimal\nstepsize decreases moderately for ASYNCADAGRAD, but are somewhat noisy for ASYNCDA.\n\nIt is interesting to note that ASYNCADAGRAD\u2019s accuracy is largely independent of the ordering of\nthe training data, while ASYNCDA shows signi\ufb01cant variability. This can be seen both in the error\nbars on Figure 3(A), and explicitly in Figure 3(D), where we plot one line for each of the \ufb01ve random\nseeds used. Thus, while on the one hand ASYNCDA requires somewhat less tuning of the stepsize\nand &2 parameter, tuning ASYNCADAGRAD is much easier because of its predictable response.\n\n3Here \u201cnumber of passes\u201d more precisely means the expected number of times each example in the dataset\nis trained on. That is, each worker thread randomly selects a training example from the dataset for each update,\nand we continued making updates until (dataset size) \u00d7 (number of passes) updates have been processed.\n\n4For both algorithms, this is accomplished by adding the term \u03b780 \"x\"2\n\n2 to the \u03c8 function. We can achieve\n\nslightly better results for ASYNCADAGRAD by varying the $2 penalty with the number of passes.\n\n8\n\n\fReferences\n\n[1] P. Auer and C. Gentile. Adaptive and self-con\ufb01dent online learning algorithms. In Proceedings\n\nof the Thirteenth Annual Conference on Computational Learning Theory, 2000.\n\n[2] P. B\u00a8uhlmann and S. van de Geer. Statistics for High-Dimensional Data: Methods, Theory and\n\nApplications. Springer, 2011.\n\n[3] N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning\n\nalgorithms. IEEE Transactions on Information Theory, 50(9):2050\u20132057, September 2004.\n\n[4] J. C. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and\n\nstochastic optimization. Journal of Machine Learning Research, 12:2121\u20132159, 2011.\n\n[5] E. Hazan. The convex optimization approach to regret minimization.\n\nIn Optimization for\n\nMachine Learning, chapter 10. MIT Press, 2012.\n\n[6] J. Hiriart-Urruty and C. Lemar\u00b4echal. Convex Analysis and Minimization Algorithms I & II.\n\nSpringer, New York, 1996.\n\n[7] J. Ma, L. K. Saul, S. Savage, and G. M. Voelker. Identifying malicious urls: An application of\nlarge-scale online learning. In Proceedings of the 26th International Conference on Machine\nLearning, 2009.\n\n[8] C. Manning and H. Sch\u00a8utze. Foundations of Statistical Natural Language Processing. MIT\n\nPress, 1999.\n\n[9] B. McMahan and M. Streeter. Adaptive bound optimization for online convex optimization.\nIn Proceedings of the Twenty Third Annual Conference on Computational Learning Theory,\n2010.\n\n[10] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach\n\nto stochastic programming. SIAM Journal on Optimization, 19(4):1574\u20131609, 2009.\n\n[11] Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical Program-\n\nming, 120(1):261\u2013283, 2009.\n\n[12] F. Niu, B. Recht, C. R\u00b4e, and S. Wright. Hogwild: a lock-free approach to parallelizing stochas-\n\ntic gradient descent. In Advances in Neural Information Processing Systems 24, 2011.\n\n[13] P. Richt\u00b4arik and M. Tak\u00b4a\u02c7c. Parallel coordinate descent methods for big data optimization.\n\narXiv:1212.0873 [math.OC], 2012. URL http://arxiv.org/abs/1212.0873.\n\n[14] M. Tak\u00b4a\u02c7c, A. Bijral, P. Richt\u00b4arik, and N. Srebro. Mini-batch primal and dual methods for\n\nSVMs. In Proceedings of the 30th International Conference on Machine Learning, 2013.\n\n9\n\n\f", "award": [], "sourceid": 1296, "authors": [{"given_name": "John", "family_name": "Duchi", "institution": "UC Berkeley"}, {"given_name": "Michael", "family_name": "Jordan", "institution": "UC Berkeley"}, {"given_name": "Brendan", "family_name": "McMahan", "institution": "Google Research"}]}