{"title": "Accelerated Mini-Batch Stochastic Dual Coordinate Ascent", "book": "Advances in Neural Information Processing Systems", "page_first": 378, "page_last": 385, "abstract": "Stochastic dual coordinate ascent (SDCA) is an effective technique for solving regularized loss minimization problems in machine learning. This paper considers an extension of SDCA under the mini-batch setting that is often used in practice. Our main contribution is to introduce an accelerated mini-batch version of SDCA and prove a fast convergence rate for this method. We discuss an implementation of our method over a parallel computing system, and compare the results to both the vanilla stochastic dual coordinate ascent and to the accelerated deterministic gradient descent method of Nesterov [2007].", "full_text": "Accelerated Mini-Batch Stochastic Dual Coordinate\n\nAscent\n\nShai Shalev-Shwartz\n\nSchool of Computer Science and Engineering\n\nHebrew University, Jerusalem, Israel\n\nTong Zhang\n\nDepartment of Statistics\n\nRutgers University, NJ, USA\n\nAbstract\n\nStochastic dual coordinate ascent (SDCA) is an effective technique for solving\nregularized loss minimization problems in machine learning. This paper considers\nan extension of SDCA under the mini-batch setting that is often used in practice.\nOur main contribution is to introduce an accelerated mini-batch version of SDCA\nand prove a fast convergence rate for this method. We discuss an implementation\nof our method over a parallel computing system, and compare the results to both\nthe vanilla stochastic dual coordinate ascent and to the accelerated deterministic\ngradient descent method of Nesterov [2007].\n\n1\n\nIntroduction\n\nWe consider the following generic optimization problem. Let \u03c61, . . . , \u03c6n be a sequence of vector\nconvex functions from Rd to R, and let g : Rd \u2192 R be a strongly convex regularization function.\nOur goal is to solve minx\u2208Rd P (x) where\n\nP (x) =\n\n\u03c6i(x) + g(x)\n\n.\n\n(1)\n\n(cid:34)\n\nn(cid:88)\n\ni=1\n\n1\nn\n\n(cid:35)\n\n(cid:34)\n\n1\nn\n\nn(cid:88)\n\n(cid:32)\n\nn(cid:88)\n\n(cid:33)(cid:35)\n\nFor example, given a sequence of n training examples (v1, y1), . . . , (vn, yn), where vi \u2208 Rd and\n2(cid:107)x(cid:107)2 and \u03c6i(x) = (x(cid:62)vi \u2212 yi)2. Regular-\nyi \u2208 R, ridge regression is obtained by setting g(x) = \u03bb\nized logistic regression is obtained by setting \u03c6i(x) = log(1 + exp(\u2212yix(cid:62)vi)).\ni : Rd \u2192 R be the convex conjugate\nThe dual problem of (1) is de\ufb01ned as follows: For each i, let \u03c6\u2217\ni (u) = maxz\u2208Rd (z(cid:62)u\u2212 \u03c6i(z)). Similarly, let g\u2217 be the convex conjugate of g. The\nof \u03c6i, namely, \u03c6\u2217\ndual problem is:\n\nmax\n\u03b1\u2208Rd\u00d7n\n\nD(\u03b1) where D(\u03b1) =\n\n\u2212\u03c6\u2217\n\ni (\u2212\u03b1i) \u2212 g\u2217\n\n1\nn\n\ni=1\n\ni=1\n\n\u03b1i\n\n,\n\n(2)\n\nwhere for each i, \u03b1i is the i\u2019th column of the matrix \u03b1.\nThe dual objective has a different dual vector associated with each primal function. Dual Coordinate\nAscent (DCA) methods solve the dual problem iteratively, where at each iteration of DCA, the dual\nobjective is optimized with respect to a single dual vector, while the rest of the dual vectors are\nkept in tact. Recently, Shalev-Shwartz and Zhang [2013a] analyzed a stochastic version of dual\ncoordinate ascent, abbreviated by SDCA, in which at each round we choose which dual vector to\noptimize uniformly at random (see also Richt\u00b4arik and Tak\u00b4a\u02c7c [2012a]). In particular, let x\u2217 be the\noptimum of (1). We say that a solution x is \u0001-accurate if P (x) \u2212 P (x\u2217) \u2264 \u0001. Shalev-Shwartz and\n2 and\nZhang [2013a] have derived the following convergence guarantee for SDCA: If g(x) = \u03bb\neach \u03c6i is (1/\u03b3)-smooth, then for every \u0001 > 0, if we run SDCA for at least\n\n2(cid:107)x(cid:107)2\n\n(cid:16)\n\nn + 1\n\u03bb\u03b3\n\n(cid:17)\n\nlog((n + 1\n\n\u03bb\u03b3 ) \u00b7 1\n\u0001 )\n\n1\n\n\fiterations, then the solution of the SDCA algorithm will be \u0001-accurate (in expectation). This conver-\ngence rate is signi\ufb01cantly better than the more commonly studied stochastic gradient descent (SGD)\nmethods that are related to SDCA1.\nAnother approach to solving (1) is deterministic gradient descent methods. In particular, Nesterov\n[2007] proposed an accelerated gradient descent (AGD) method for solving (1). Under the same\nconditions mentioned above, AGD \ufb01nds an \u0001-accurate solution after performing\n\n(cid:18) 1\u221a\n\nO\n\nlog( 1\n\u0001 )\n\n\u03bb\u03b3\n\n(cid:19)\n\n\u03bb\u03b3 while the iteration bound of SDCA scales with 1/(\u03bb\u03b3).\n\niterations.\nThe advantage of SDCA over AGD is that each iteration involves only a single dual vector and\nusually costs O(d). In contrast, each iteration of AGD requires \u2126(nd) operations. On the other\n\u221a\nhand, AGD has a better dependence on the condition number of the problem \u2014 the iteration bound\nof AGD scales with 1/\nIn this paper we describe and analyze a new algorithm that interpolates between SDCA and AGD.\nAt each iteration of the algorithm, we randomly pick a subset of m indices from {1, . . . , n} and\nupdate the dual vectors corresponding to this subset. This subset is often called a mini-batch. The\nuse of mini-batches is common with SGD optimization, and it is bene\ufb01cial when the processing time\nof a mini-batch of size m is much smaller than m times the processing time of one example (mini-\nbatch of size 1). For example, in the practical training of neural networks with SGD, one is always\nadvised to use mini-batches because it is more ef\ufb01cient to perform matrix-matrix multiplications\nover a mini-batch than an equivalent amount of matrix-vector multiplication operations (each over\na single training example). This is especially noticeable when GPU is used:\nin some cases the\nprocessing time of a mini-batch of size 100 may be the same as that of a mini-batch of size 10.\nAnother typical use of mini-batch is for parallel computing, which was studied by various authors\nfor stochastic gradient descent (e.g., Dekel et al. [2012]). This is also the application scenario we\nhave in mind, and will be discussed in greater details in Section 3.\nRecently, Tak\u00b4ac et al. [2013] studied mini-batch variants of SDCA in the context of the Support\nVector Machine (SVM) problem. They have shown that the naive mini-batching method, in which\nm dual variables are optimized in parallel, might actually increase the number of iterations required.\nThey then describe several \u201csafe\u201d mini-batching schemes, and based on the analysis of Shalev-\nShwartz and Zhang [2013a], have shown several speed-up results. However, their results are for the\nnon-smooth case and hence they do not obtain linear convergence rate. In addition, the speed-up\nthey obtain requires some spectral properties of the training examples. We take a different approach\nand employ Nesterov\u2019s acceleration method, which has previously been applied to mini-batch SGD\noptimization. This paper shows how to achieve acceleration for SDCA in the mini-batch setting. The\npseudo code of our Accelerated Mini-Batch SDCA, abbreviated by ASDCA, is presented below.\n\nProcedure Accelerated Mini-Batch SDCA\n\n1 = \u00b7\u00b7\u00b7 = \u03b1(0)\n\nParameters scalars \u03bb, \u03b3 and \u03b8 \u2208 [0, 1] ; mini-batch size m\nInitialize \u03b1(0)\nIterate: for t = 1, 2, . . .\nu(t\u22121) = (1 \u2212 \u03b8)x(t\u22121) + \u03b8\u2207g\u2217(\u00af\u03b1(t\u22121))\nRandomly pick subset I \u2282 {1, . . . , n} of size m and update the dual variables in I\n\nn = \u00af\u03b1(t) = 0, x(0) = 0\n\n\u2212 \u03b8\u2207\u03c6i(u(t\u22121)) for i \u2208 I\n\ni = (1 \u2212 \u03b8)\u03b1(t\u22121)\n\u03b1(t)\nj = \u03b1(t\u22121)\nfor j /\u2208 I\n\u03b1(t)\n\n\u00af\u03b1(t) = \u00af\u03b1(t\u22121) + n\u22121(cid:80)\n\nj\n\ni\n\nx(t) = (1 \u2212 \u03b8)x(t\u22121) + \u03b8\u2207g\u2217(\u00af\u03b1(t))\nend\n\ni\u2208I (\u03b1(t)\n\ni \u2212 \u03b1(t\u22121)\n\ni\n\n)\n\nIn the next section we present our main result \u2014 an analysis of the number of iterations required\n2(cid:107)x(cid:107)2. Analyzing\nby ASDCA. We focus on the case of Euclidean regularization, namely, g(x) = \u03bb\nmore general strongly convex regularization functions is left for future work. In Section 3 we discuss\n\n1An exception is the recent analysis given in Le Roux et al. [2012] for a variant of SGD.\n\n2\n\n\fparallel implementations of ASDCA and compare it to parallel implementations of AGD and SDCA.\nIn particular, we explain in which regimes ASDCA can be better than both AGD and SDCA. In\nSection 4 we present some experimental results, demonstrating how ASDCA interpolates between\nAGD and SDCA. The proof of our main theorem is differed to a long version of this paper (Shalev-\nShwartz and Zhang [2013b]). We conclude with a discussion of our work in light of related works\nin Section 5.\n\n2 Main Results\n\nOur main result is a bound on the number of iterations required by ASDCA to \ufb01nd an \u0001-accurate\nsolution. In our analysis, we only consider the squared Euclidean norm regularization,\n\nwhere (cid:107) \u00b7 (cid:107) is the Euclidean norm and \u03bb > 0 is a regularization parameter. The analysis for general\n\u03bb-strongly convex regularizers is left for future work. For the squared Euclidean norm we have\n\ng(x) =\n\n(cid:107)x(cid:107)2,\n\n\u03bb\n2\n\n\u03b1 .\nWe further assume that each \u03c6i is 1/\u03b3-smooth with respect to (cid:107) \u00b7 (cid:107), namely,\n\nand\n\ng\u2217(\u03b1) =\n\n(cid:107)\u03b1(cid:107)2\n\n1\n2\u03bb\n\n\u2207g\u2217(\u03b1) =\n\n1\n\u03bb\n\n\u2200x, z, \u03c6i(x) \u2264 \u03c6i(z) + \u2207\u03c6i(z)(cid:62)(x \u2212 z) +\nFor example, if \u03c6i(x) = (x(cid:62)vi \u2212 yi)2, then it is (cid:107)vi(cid:107)2-smooth.\nThe smoothness of \u03c6i also implies that \u03c6\u2217\n\ni (\u03b1) is \u03b3-strongly convex:\n\n(cid:107)x \u2212 z(cid:107)2.\n\n1\n2\u03b3\n\n\u2200\u03b8 \u2208 [0, 1], \u03c6\u2217\n\ni ((1 \u2212 \u03b8)\u03b1 + \u03b8\u03b2) \u2264 (1 \u2212 \u03b8)\u03c6\u2217\n\ni (\u03b1) + \u03b8\u03c6\u2217\n\ni (\u03b2) \u2212 \u03b8(1 \u2212 \u03b8)\u03b3\n\n2\n\n(cid:107)\u03b1 \u2212 \u03b2(cid:107)2.\n\nWe have the following result for our method.\n2\u03bb(cid:107)x(cid:107)2\nTheorem 1. Assume that g(x) = 1\n(cid:40)\nnorm. Suppose that the ASDCA algorithm is run with parameters \u03bb, \u03b3, m, \u03b8, where\n\n(cid:41)\n\n(cid:114)\n\n2 and for each i, \u03c6i is (1/\u03b3)-smooth w.r.t. the Euclidean\n\n\u03b8 \u2264 1\n4\n\nmin\n\n1 ,\n\n, \u03b3\u03bbn ,\n\n.\n\n(3)\n\n\u03b3\u03bbn\nm\n\n(\u03b3\u03bbn)2/3\n\nm1/3\n\nDe\ufb01ne the dual sub-optimality by \u2206D(\u03b1) = D(\u03b1\u2217) \u2212 D(\u03b1), where \u03b1\u2217 is the optimal dual solution,\nand the primal sub-optimality by \u2206P (x) = P (x) \u2212 D(\u03b1\u2217). Then,\n\nm E \u2206P (x(t)) + n E \u2206D(\u03b1(t)) \u2264 (1 \u2212 \u03b8m/n)t[m\u2206P (x(0)) + n\u2206D(\u03b1(0))].\n\nIt follows that after performing\n\nt \u2265 n/m\n\u03b8\n\nlog\n\n(cid:18) m\u2206P (x(0)) + n\u2206D(\u03b1(0))\n\n(cid:19)\n\nm\u0001\n\niterations, we have that E[P (x(t)) \u2212 D(\u03b1(t))] \u2264 \u0001.\n\nLet us now discuss the bound, assuming \u03b8 is taken to be the right-hand side of (3). The dominating\nfactor of the bound on t becomes\nn\nm\n\n(cid:114) m\n\n\u00b7 max\n\n(\u03b3\u03bbn)2/3\n\nn\nm\u03b8\n\nm1/3\n\n\u03b3\u03bbn\n\n\u03b3\u03bbn\n\n(4)\n\n(cid:26)\n\n=\n\n1\n\n,\n\n,\n\n(cid:40)\n\n1 ,\n\n(cid:115)\n\n(cid:27)\n(cid:41)\n\n= max\n\nn\nm\n\n,\n\nn/m\n\u03b3\u03bb\n\n,\n\n1/m\n\u03b3\u03bb\n\n,\n\nn1/3\n\n(\u03b3\u03bbm)2/3\n\n.\n\n(5)\n\nTable 1 summarizes several interesting cases, and compares the iteration bound of ASDCA to the\niteration bound of the vanilla SDCA algorithm (as analyzed in Shalev-Shwartz and Zhang [2013a])\n\n3\n\n\fAlgorithm \u03b3\u03bbn = \u0398(1)\nSDCA\nASDCA\nAGD\n\n\u221a\nn\n\u221a\nn/\n\nm\nn\n\n\u03b3\u03bbn = \u0398(1/m)\n\nnm\n\u221a\nn\nnm\n\n\u03b3\u03bbn = \u0398(m)\n\nn\n\n(cid:112)n/m\n\nn/m\n\nTable 1: Comparison of Iteration Complexity\n\nAlgorithm \u03b3\u03bbn = \u0398(1)\nSDCA\nASDCA\nAGD\n\n\u221a\nn\n\u221a\nm\nn\nn\nn\n\n\u03b3\u03bbn = \u0398(1/m)\n\nnm\n\u221a\nnm\n\nnm\n\nn\n\n\u03b3\u03bbn = \u0398(m)\n\nn(cid:112)n/m\n\nn\nn\n\nTable 2: Comparison of Number of Examples Processed\n\nand the Accelerated Gradient Descent (AGD) algorithm of Nesterov [2007]. In the table, we ignore\nconstants and logarithmic factors.\nAs can be seen in the table, the ASDCA algorithm interpolates between SDCA and AGD. In par-\nticular, ASDCA has the same bound as SDCA when m = 1 and the same bound as AGD when\nm = n. Recall that the cost of each iteration of AGD scales with n while the cost of each iteration\nof SDCA does not scale with n. The cost of each iteration of ASDCA scales with m. To compensate\nfor the difference cost per iteration for different algorithms, we may also compare the complexity\nin terms of the number of examples processed (see Table 2). This is also what we will study in\nour empirical experiments. It should be mentioned that this comparison is meaningful in a single\nprocessor environment, but not in a parallel computing environment when multiple examples can be\nprocessed simultaneously in a minibatch. In the next section we discuss under what conditions the\noverall runtime of ASDCA is better than both AGD and SDCA.\n\n3 Parallel Implementation\n\nIn recent years, there has been a lot of interest in implementing optimization algorithms using a\nparallel computing architecture (see Section 5). We now discuss how to implement AGD, SDCA,\nand ASDCA when having a computing machine with s parallel computing nodes.\nIn the calculations below, we use the following facts:\n\n\u2022 If each node holds a d-dimensional vector, we can compute the sum of these vectors in time\nO(d log(s)) by applying a \u201ctree-structure\u201d summation (see for example the All-Reduce\narchitecture in Agarwal et al. [2011]).\n\n\u2022 A node can broadcast a message with c bits to all other nodes in time O(c log2(s)). To\nsee this, order nodes on the corners of the log2(s)-dimensional hypercube. Then, at each\niteration, each node sends the message to its log(s) neighbors (namely, the nodes whose\ncode word is at a hamming distance of 1 from the node). The message between the furthest\naway nodes will pass after log(s) iterations. Overall, we perform log(s) iterations and each\niteration requires transmitting c log(s) bits.\n\n\u2022 All nodes can broadcast a message with c bits to all other nodes in time O(cs log2(s)). To\nsee this, simply apply the broadcasting of the different nodes mentioned above in parallel.\nThe number of iterations will still be the same, but now, at each iteration, each node should\ntransmit cs bits to its log(s) neighbors. Therefore, it takes O(cs log2(s)) time.\n\nFor concreteness of the discussion, we consider problems in which \u03c6i(x) takes the form of\n(cid:96)(x(cid:62)vi, yi), where yi is a scalar and vi \u2208 Rd. This is the case in supervised learning of linear\npredictors (e.g. logistic regression or ridge regression). We further assume that the average number\nof non-zero elements of vi is \u00afd. In very large-scale problems, a single machine cannot hold all of\nthe data in its memory. However, we assume that a single node can hold a fraction of 1/s of the data\nin its memory.\n\n4\n\n\fLet us now discuss parallel implementations of the different algorithms starting with deterministic\ngradient algorithms (such as AGD). The bottleneck operation of deterministic gradient algorithms is\nthe calculation of the gradient. In the notation mentioned above, this amounts to performing order\nof n \u00afd operations. If the data is distributed over s computing nodes, where each node holds n/s\nexamples, we can calculate the gradient in time O(n \u00afd/s + d log(s)) as follows. First, each node\ncalculates the gradient over its own n/s examples (which takes time O(n \u00afd/s)). Then, the s resulting\nvectors in Rd are summed up in time O(d log(s)).\nNext, let us consider the SDCA algorithm. On a single computing node, it was observed that SDCA\nis much more ef\ufb01cient than deterministic gradient descent methods, since each iteration of SDCA\ncosts only \u0398( \u00afd) while each iteration of AGD costs \u0398(n \u00afd). When we have s nodes, for the SDCA\nalgorithm, dividing the examples into s computing nodes does not yield any speed-up. However,\nwe can divide the features into the s nodes (that is, each node will hold d/s of the features for all\nof the examples). This enables the computation of x(cid:62)vi in (expected) time of O( \u00afd/s + s log2(s)).\nxjvi,j, where Jt is the set of features stored in node t (namely,\n|Jt| = d/s). Then, each node broadcasts the resulting scalar to all the other nodes. Note that we\nwill obtain a speed-up over the naive implementation only if s log2(s) (cid:28) \u00afd.\nFor the ASDCA algorithm, each iteration involves the computation of the gradient over m examples.\nWe can choose to implement it by dividing the examples to the s nodes (as we did for AGD) or by\ndividing the features into the s nodes (as we did for SDCA). In the \ufb01rst case, the cost of each iteration\nis O(m \u00afd/s + d log(s)) while in the latter case, the cost of each iteration is O(m \u00afd/s + ms log2(s)).\nWe will choose between these two implementations based on the relation between d, m, and s.\nThe runtime and communication time of each iteration is summarized in the table below.\n\nIndeed, node t will calculate(cid:80)\n\nj\u2208Jt\n\nAlgorithm partition type\n\nruntime\n\ncommunication time\n\nSDCA\nASDCA\nASDCA\nAGD\n\nfeatures\nfeatures\nexamples\nexamples\n\n\u00afd/s\n\u00afdm/s\n\u00afdm/s\n\u00afdn/s\n\ns log2(s)\nms log2(s)\n\nd log(s)\n\nd log(s)\n\nWe again see that ASDCA nicely interpolates between SDCA and AGD. In practice, it is usually\nthe case that there is a non-negligible cost of opening communication channels between nodes. In\nthat case, it will be better to apply the ASDCA with a value of m that re\ufb02ects an adequate tradeoff\nbetween the runtime of each node and the communication time. With the appropriate value of m\n(which depends on constants like the cost of opening communication channels and sending packets\nof bits between nodes), ASDCA may outperform both SDCA and AGD.\n\n4 Experimental Results\n\nIn this section we demonstrate how ASDCA interpolates between SDCA and AGD. All of our\nexperiments are performed for the task of binary classi\ufb01cation with a smooth variant of the hinge-\nloss (see Shalev-Shwartz and Zhang [2013a]). Speci\ufb01cally, let (v1, y1), . . . , (vm, ym) be a set of\nlabeled examples, where for every i, vi \u2208 Rd and yi \u2208 {\u00b11}. De\ufb01ne \u03c6i(x) to be\n\n\u03c6i(x) =\n\n1/2 \u2212 yix(cid:62)vi\n2 (1 \u2212 yix(cid:62)vi)2\nWe also set the regularization function to be g(x) = \u03bb\nvalue for the regularization parameter taken in several optimization packages.\nFollowing Shalev-Shwartz and Zhang [2013a], the experiments were performed on three large\ndatasets with very different feature counts and sparsity. The astro-ph dataset classi\ufb01es abstracts\nof papers from the physics ArXiv according to whether they belong in the astro-physics section;\n\n2 where \u03bb = 1/n. This is the default\n\nyix(cid:62)vi > 1\nyix(cid:62)vi < 0\no.w.\n2(cid:107)x(cid:107)2\n\n\uf8f1\uf8f2\uf8f30\n\n1\n\n5\n\n\fastro-ph\n\nCCAT\n\ncov1\n\nFigure 1: The \ufb01gures presents the performance of AGD, SDCA, and ASDCA with different values\nof mini-batch size, m. In all \ufb01gures, the x axis is the number of processed examples. The three\ncolumns are for the different datasets. Top: primal sub-optimality. Middle: average value of the\nsmoothed hinge loss function over a test set. Bottom: average value of the 0-1 loss over a test set.\n\nCCAT is a classi\ufb01cation task taken from the Reuters RCV1 collection; and cov1 is class 1 of the\ncovertype dataset of Blackard, Jock & Dean. The following table provides details of the dataset\ncharacteristics.\n\nDataset\nastro-ph\nCCAT\ncov1\n\nTraining Size\n\nTesting Size\n\n29882\n781265\n522911\n\n32487\n23149\n58101\n\nFeatures\n99757\n47236\n\n54\n\nSparsity\n0.08%\n0.16%\n22.22%\n\nWe ran ASDCA with values of m from the set {10\u22124n, 10\u22123n, 10\u22122n}. We also ran the SDCA\nalgorithm and the AGD algorithm. In Figure 1 we depict the primal sub-optimality of the different\nalgorithms as a function of the number of examples processed. Note that each iteration of SDCA\nprocesses a single example, each iteration of ASDCA processes m examples, and each iteration of\nAGD processes n examples. As can be seen from the graphs, ASDCA indeed interpolates between\nSDCA and AGD. It is clear from the graphs that SDCA is much better than AGD when we have a\nsingle computing node. ASDCA performance is quite similar to SDCA when m is not very large.\nAs discussed in Section 3, when we have parallel computing nodes and there is a non-negligible cost\nof opening communication channels between nodes, running ASDCA with an appropriate value of\nm (which depends on constants like the cost of opening communication channels) may yield the\nbest performance.\n\n6\n\n10510610710\u2212310\u2212210\u22121#processed examplesPrimal suboptimality m=3m=30m=299AGDSDCA10610710810910\u2212310\u2212210\u22121#processed examplesPrimal suboptimality m=78m=781m=7813AGDSDCA10610710810\u2212310\u2212210\u22121#processed examplesPrimal suboptimality m=52m=523m=5229AGDSDCA1051061070.050.060.070.080.090.10.110.120.130.140.15#processed examplesTest Loss m=3m=30m=299AGDSDCA1061071081090.070.080.090.10.110.120.130.140.150.160.17#processed examplesTest Loss m=78m=781m=7813AGDSDCA1061071080.290.30.310.320.330.340.350.360.370.380.39#processed examplesTest Loss m=52m=523m=5229AGDSDCA1051061070.030.040.050.060.070.080.090.10.110.120.13#processed examplesTest Error m=3m=30m=299AGDSDCA1061071081090.050.060.070.080.090.10.110.120.130.140.15#processed examplesTest Error m=78m=781m=7813AGDSDCA1061071080.220.230.240.250.260.270.280.290.30.310.32#processed examplesTest Error m=52m=523m=5229AGDSDCA\f5 Discussion and Related Work\n\nWe have introduced an accelerated version of stochastic dual coordinate ascent with mini-batches.\nWe have shown, both theoretically and empirically, that the resulting algorithm interpolates between\nthe vanilla stochastic coordinate descent algorithm and the accelerated gradient descent algorithm.\nUsing mini-batches in stochastic learning has received a lot of attention in recent years. E.g. Shalev-\nShwartz et al. [2007] reported experiments showing that applying small mini-batches in Stochastic\nGradient Descent (SGD) decreases the required number of iterations. Dekel et al. [2012] and Agar-\nwal and Duchi [2012] gave an analysis of SGD with mini-batches for smooth loss functions. Cotter\net al. [2011] studied SGD and accelerated versions of SGD with mini-batches and Tak\u00b4ac et al. [2013]\nstudied SDCA with mini-batches for SVMs. Duchi et al. [2010] studied dual averaging in distributed\nnetworks as a function of spectral properties of the underlying graph. However, all of these methods\nhave a polynomial dependence on 1/\u0001, while we consider the strongly convex and smooth case in\nwhich a log(1/\u0001) rate is achievable.2 Parallel coordinate descent has also been recently studied in\nFercoq and Richt\u00b4arik [2013], Richt\u00b4arik and Tak\u00b4a\u02c7c [2013].\nIt is interesting to note that most3 of these papers focus on mini-batches as the method of choice for\ndistributing SGD or SDCA, while ignoring the option to divide the data by features instead of by\nexamples. A possible reason is the cost of opening communication sockets as discussed in Section 3.\nThere are various practical considerations that one should take into account when designing a prac-\ntical system for distributed optimization. We refer the reader, for example, to Dekel [2010], Low\net al. [2010, 2012], Agarwal et al. [2011], Niu et al. [2011].\nThe more general problem of distributed PAC learning has been studied recently in Daume III et al.\n[2012], Balcan et al. [2012]. See also Long and Servedio [2011].\nIn particular, they obtain al-\ngorithms with O(log(1/\u0001)) communication complexity. However, these works consider ef\ufb01cient\nalgorithms only in the realizable case.\n\nAcknowledgements: Shai Shalev-Shwartz is supported by the Intel Collaborative Research Insti-\ntute for Computational Intelligence (ICRI-CI). Tong Zhang is supported by the following grants:\nNSF IIS-1016061, NSF DMS-1007527, and NSF IIS-1250985.\n\nReferences\nAlekh Agarwal and John C Duchi. Distributed delayed stochastic optimization. In Decision and\n\nControl (CDC), 2012 IEEE 51st Annual Conference on, pages 5451\u20135452. IEEE, 2012.\n\nAlekh Agarwal, Olivier Chapelle, Miroslav Dud\u00b4\u0131k, and John Langford. A reliable effective terascale\n\nlinear learning system. arXiv preprint arXiv:1110.4198, 2011.\n\nMaria-Florina Balcan, Avrim Blum, Shai Fine, and Yishay Mansour. Distributed learning, commu-\n\nnication complexity and privacy. arXiv preprint arXiv:1204.3514, 2012.\n\nJoseph K Bradley, Aapo Kyrola, Danny Bickson, and Carlos Guestrin. Parallel coordinate descent\n\nfor l1-regularized loss minimization. In ICML, 2011.\n\nAndrew Cotter, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Better mini-batch algorithms\n\nvia accelerated gradient methods. arXiv preprint arXiv:1106.4574, 2011.\n\nHal Daume III, Jeff M Phillips, Avishek Saha, and Suresh Venkatasubramanian. Protocols for learn-\n\ning classi\ufb01ers on distributed data. arXiv preprint arXiv:1202.6078, 2012.\n\nOfer Dekel. Distribution-calibrated hierarchical classi\ufb01cation. In NIPS, 2010.\nOfer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal distributed online prediction\n\nusing mini-batches. The Journal of Machine Learning Research, 13:165\u2013202, 2012.\n\n2It should be noted that one can use our results for Lipschitz functions as well by smoothing the loss function\n(see Nesterov [2005]). By doing so, we can interpolate between the 1/\u00012 rate of non-accelerated method and\nthe 1/\u0001 rate of accelerated gradient.\n\n3There are few exceptions in the context of stochastic coordinate descent in the primal. See for example\n\nBradley et al. [2011], Richt\u00b4arik and Tak\u00b4a\u02c7c [2012b]\n\n7\n\n\fJohn Duchi, Alekh Agarwal, and Martin J Wainwright. Distributed dual averaging in networks.\n\nAdvances in Neural Information Processing Systems, 23, 2010.\n\nOlivier Fercoq and Peter Richt\u00b4arik. Smooth minimization of nonsmooth functions with parallel\n\ncoordinate descent methods. arXiv preprint arXiv:1309.5885, 2013.\n\nNicolas Le Roux, Mark Schmidt, and Francis Bach. A Stochastic Gradient Method with an Ex-\nponential Convergence Rate for Strongly-Convex Optimization with Finite Training Sets. arXiv\npreprint arXiv:1202.6258, 2012.\n\nPhil Long and Rocco Servedio. Algorithms and hardness results for parallel large margin learning.\n\nIn NIPS, 2011.\n\nYucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph M\narXiv preprint\n\nHellerstein. Graphlab: A new framework for parallel machine learning.\narXiv:1006.4990, 2010.\n\nYucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M\nHellerstein. Distributed graphlab: A framework for machine learning and data mining in the\ncloud. Proceedings of the VLDB Endowment, 5(8):716\u2013727, 2012.\n\nYurii Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 103\n\n(1):127\u2013152, 2005.\n\nYurii Nesterov. Gradient methods for minimizing composite objective function, 2007.\nFeng Niu, Benjamin Recht, Christopher R\u00b4e, and Stephen J Wright. Hogwild!: A lock-free approach\n\nto parallelizing stochastic gradient descent. arXiv preprint arXiv:1106.5730, 2011.\n\nPeter Richt\u00b4arik and Martin Tak\u00b4a\u02c7c.\n\nIteration complexity of randomized block-coordinate descent\nmethods for minimizing a composite function. Mathematical Programming, pages 1\u201338, 2012a.\nPeter Richt\u00b4arik and Martin Tak\u00b4a\u02c7c. Parallel coordinate descent methods for big data optimization.\n\narXiv preprint arXiv:1212.0873, 2012b.\n\nPeter Richt\u00b4arik and Martin Tak\u00b4a\u02c7c. Distributed coordinate descent method for learning with big data.\n\narXiv preprint arXiv:1310.2059, 2013.\n\nShai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized\n\nloss minimization. Journal of Machine Learning Research, 14:567\u2013599, Feb 2013a.\n\nShai Shalev-Shwartz and Tong Zhang. Accelerated mini-batch stochastic dual coordinate ascent.\n\narxiv, 2013b.\n\nShai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal Estimated sub-GrAdient\n\nSOlver for SVM. In ICML, pages 807\u2013814, 2007.\n\nMartin Tak\u00b4ac, Avleen Bijral, Peter Richt\u00b4arik, and Nathan Srebro. Mini-batch primal and dual meth-\n\nods for svms. arxiv, 2013.\n\n8\n\n\f", "award": [], "sourceid": 249, "authors": [{"given_name": "Shai", "family_name": "Shalev-Shwartz", "institution": "The Hebrew University"}, {"given_name": "Tong", "family_name": "Zhang", "institution": "Baidu & Rutgers"}]}