{"title": "Quartz: Randomized Dual Coordinate Ascent with Arbitrary Sampling", "book": "Advances in Neural Information Processing Systems", "page_first": 865, "page_last": 873, "abstract": "We study the problem of minimizing the average of a large number of smooth convex functions penalized with a strongly convex regularizer. We propose and analyze a novel primal-dual method (Quartz) which at every iteration samples and updates a random subset of the dual variables, chosen according to an arbitrary distribution. In contrast to typical analysis, we directly bound the decrease of the primal-dual error (in expectation), without the need to first analyze the dual error. Depending on the choice of the sampling, we obtain efficient serial and mini-batch variants of the method. In the serial case, our bounds match the best known bounds for SDCA (both with uniform and importance sampling). With standard mini-batching, our bounds predict initial data-independent speedup as well as additional data-driven speedup which depends on spectral and sparsity properties of the data.", "full_text": "Quartz: Randomized Dual Coordinate Ascent\n\nwith Arbitrary Sampling\n\nZheng Qu\n\nDepartment of Mathematics\nThe University of Hong Kong\n\nHong Kong\n\nPeter Richt\u00b4arik\n\nSchool of Mathematics\n\nThe University of Edinburgh\nEH9 3FD, United Kingdom\n\nzhengqu@maths.hku.hk\n\npeter.richtarik@ed.ac.uk\n\nTong Zhang\n\nDepartment of Statistics\n\nRutgers University\n\nPiscataway, NJ, 08854\n\ntzhang@stat.rutgers.edu\n\nAbstract\n\nWe study the problem of minimizing the average of a large number of smooth\nconvex functions penalized with a strongly convex regularizer. We propose and\nanalyze a novel primal-dual method (Quartz) which at every iteration samples and\nupdates a random subset of the dual variables, chosen according to an arbitrary\ndistribution.\nIn contrast to typical analysis, we directly bound the decrease of\nthe primal-dual error (in expectation), without the need to \ufb01rst analyze the dual\nerror. Depending on the choice of the sampling, we obtain ef\ufb01cient serial and\nmini-batch variants of the method. In the serial case, our bounds match the best\nknown bounds for SDCA (both with uniform and importance sampling). With\nstandard mini-batching, our bounds predict initial data-independent speedup as\nwell as additional data-driven speedup which depends on spectral and sparsity\nproperties of the data.\n\nKeywords:\nspeedup.\n\nempirical risk minimization, dual coordinate ascent, arbitrary sampling, data-driven\n\n1\n\nIntroduction\n\nIn this paper we consider a primal-dual pair of structured convex optimization problems which has\nin several variants of varying degrees of generality attracted a lot of attention in the past few years\nin the machine learning and optimization communities [4, 22, 20, 23, 21, 27].\nLet A1, . . . , An be a collection of d-by-m real matrices and \u03c61, . . . , \u03c6n be 1/\u03b3-smooth convex\nfunctions from Rm to R, where \u03b3 > 0. Further, let g : Rd \u2192 R \u222a {+\u221e} be a 1-strongly convex\nfunction and \u03bb > 0 a regularization parameter. We are interested in solving the following primal\nproblem:\n\nminw=(w1,...,wd)\u2208Rd\n\n(1)\nIn the machine learning context, matrices {Ai} are interpreted as examples/samples, w is a (linear)\npredictor, function \u03c6i is the loss incurred by the predictor on example Ai, g is a regularizer, \u03bb\nis a regularization parameter and (1) is the regularized empirical risk minimization problem.\nIn\n\nP (w)\n\ni w) + \u03bbg(w)\n\n.\n\n(cid:80)n\ni=1 \u03c6i(A(cid:62)\n\ndef\n= 1\nn\n\n(cid:105)\n\n(cid:104)\n\n1\n\n\fthis paper we are especially interested in problems where n is very big (millions, billions), and\nmuch larger than d. This is often the case in big data applications. Stochastic Gradient Descent\n(SGD) [18, 11, 25] was designed for solving this type of large-scale optimization problems. In each\niteration SGD computes the gradient of one single randomly chosen function \u03c6i and approximates\nthe gradient using this unbiased but noisy estimation. Because of the variance of the stochastic\nestimation, SGD has slow convergence rate O(1/\u0001). Recently, many methods achieving fast (linear)\nconvergence rate O(log(1/\u0001)) have been proposed, including SAG [19], SVRG [6], S2GD [8],\nSAGA [1], mS2GD [7] and MISO [10], all using different techniques to reduce the variance.\nAnother approach, such as Stochastic Dual Coordinate Ascent (SDCA) [22], solves (1) by consid-\n: Rm \u2192 R be the convex\nering its dual problem that is de\ufb01ned as follows. For each i, let \u03c6\u2217\ni (u) = maxs\u2208Rm s(cid:62)u \u2212 \u03c6i(s) and similarly let g\u2217 : Rd \u2192 R be the\nconjugate of \u03c6i, namely, \u03c6\u2217\nconvex conjugate of g. The dual problem of (1) is de\ufb01ned as:\n\ni\n\n(2)\nwhere \u03b1 = (\u03b11, . . . , \u03b1n) \u2208 RN = Rnm is obtained by stacking dual variables (blocks) \u03b1i \u2208 Rm,\ni = 1, . . . , n, on top of each other and functions f and \u03c8 are de\ufb01ned by\n\n\u03b1=(\u03b11,...,\u03b1n)\u2208RN =Rnm\n\nD(\u03b1)\n\nmax\n\n,\n\ndef\n\n= \u2212f (\u03b1) \u2212 \u03c8(\u03b1)\n\n(cid:105)\n\ni=1 Ai\u03b1i\n\n\u03c8(\u03b1)\n\ndef\n= 1\nn\n\n(3)\n\n(cid:80)n\ni=1 \u03c6\u2217\n\ni (\u2212\u03b1i).\n\n= \u03bbg\u2217(cid:0) 1\n\ndef\n\n\u03bbn\n\n(cid:80)n\n\nf (\u03b1)\n\n(cid:104)\n\n(cid:1) ;\n\n(cid:113) maxi Li\n\nSDCA [22] and its proximal extension Prox-SDCA [20] \ufb01rst solve the dual problem (2) by updating\nuniformly at random one dual variable at each round and then recover the primal solution by setting\nw = \u2207g\u2217(\u03b1). Let Li = \u03bbmax(A(cid:62)\n\ni Ai). It is known that if we run SDCA for at least\n\n(cid:16)(cid:16)\n\n(cid:17)\n\n(cid:16)(cid:16)\n\nO\n\nn + maxi Li\n\n\u03bb\u03b3\n\nlog\n\nn + maxi Li\n\n\u03bb\u03b3\n\n(cid:17)(cid:17)\n\n(cid:17) 1\n\n\u0001\n\n\u03bb\u03b3\n\niterations, then SDCA \ufb01nds a pair (w, \u03b1) such that E[P (w) \u2212 D(\u03b1)] \u2264 \u0001. By applying accelerated\nrandomized coordinate descent on the dual problem, APCG [9] needs at most \u02dcO(n +\n)\nnumber of iterations to get \u0001-accuracy. ASDCA [21] and SPDC [26] are also accelerated and ran-\ndomized primal-dual methods. Moreover, they can update a mini-batch of dual variables in each\nround.\nWe propose a new algorithm (Algorithm 1), which we call Quartz, for simultaneously solving the\nprimal (1) and dual (2) problems. On the dual side, at each iteration our method selects and updates\na random subset (sampling) \u02c6S \u2286 {1, . . . , n} of the dual variables/blocks. We assume that these sets\nare i.i.d. throughout the iterations. However, we do not impose any additional assumptions on the\ndistribution of \u02c6S apart from the necessary requirement that each block i needs to be chosen with\n= P(i \u2208 \u02c6S) > 0. Quartz is the \ufb01rst SDCA-like method analyzed for\na positive probability: pi\nan arbitrary sampling. The dual updates are then used to perform an update to the primal variable\nw and the process is repeated. Our primal updates are different (less aggressive) from those used\nin SDCA [22] and Prox-SDCA [20], thanks to which the decrease in the primal-dual error can be\nbounded directly without \ufb01rst establishing the dual convergence as in [20], [23] and [9]. Our analysis\nis novel and directly primal-dual in nature. As a result, our proof is more direct, and the logarithmic\nterm in our bound has a simpler form.\n\ndef\n\nMain result. We prove that starting from an initial pair (w0, \u03b10), Quartz \ufb01nds a pair (w, \u03b1) for\nwhich P (w) \u2212 D(\u03b1) \u2264 \u0001 (in expectation) in at most\n\npi\n\n(cid:16) 1\n(cid:104)(cid:13)(cid:13)(cid:80)\n\n(cid:16) P (w0)\u2212D(\u03b10)\n\n(cid:17)\n\n\u0001\n\nlog\n\n(cid:17)\n(cid:13)(cid:13)2(cid:105) \u2264(cid:80)n\n\nmax\n\ni\n\n+ vi\n\npi\u03bb\u03b3n\n\n(4)\n\niterations. The parameters v1, . . . , vn are assumed to satisfy the following ESO (expected separable\noverapproximation) inequality:\nE \u02c6S\n\n(5)\nwhere (cid:107)\u00b7(cid:107) denotes the standard Euclidean norm. Moreover, the parameters v1, . . . , vn are needed to\nrun the method (they determine stepsizes), and hence it is critical that they can be cheaply computed\n\ni=1 pivi(cid:107)hi(cid:107)2,\n\ni\u2208 \u02c6S Aihi\n\n2\n\n\fbefore the method starts. We wish to point out that (5) always holds for some parameters {vi}.\nIndeed, the left hand side is a quadratic function of h and hence the inequality holds for large-\nenough vi. Having said that, the size of these parameters directly in\ufb02uences the complexity, and\nhence one would want to obtain as tight bounds as possible. As we will show, for many samplings\nof interest small enough parameter v can be obtained in time required to read the data {Ai}. In\nparticular, if the data matrix A = (A1, . . . , An) is suf\ufb01ciently sparse, our iteration complexity\nresult (4) specialized to the case of standard mini-batching can be better than that of accelerated\nmethods such as ASDCA [21] and SPDC [26] even when the condition number maxi Li/\u03bb\u03b3 is\nlarger than n, see Proposition 4 and Figure 2.\nAs described above, Quartz uses an arbitrary sampling for picking the dual variables to be updated\nin each iteration. To the best of our knowledge, only two papers exist in the literature where a\nstochastic method using an arbitrary sampling was analyzed: NSync [16] for unconstrained mini-\nmization of a strongly convex function and ALPHA [15] for composite minimization of non-strongly\nconvex function. Assumption (5) was for the \ufb01rst time introduced in [16]. However, NSync is not\na primal-dual method. Besides NSync, the closest works to ours in terms of the generality of the\nsampling are PCDM [17], SPCDM [3] and APPROX [2]. All these are randomized coordinate\ndescent methods, and all were analyzed for arbitrary uniform samplings (i.e., samplings satisfying\nP(i \u2208 \u02c6S) = P(i(cid:48) \u2208 \u02c6S) for all i, i(cid:48) \u2208 {1, . . . , n}). Again, none of these methods were analyzed in a\nprimal-dual framework.\nIn Section 2 we describe the algorithm, show that it admits a natural interpretation in terms of\nFenchel duality and discuss the \ufb02exibility of Quartz. We then proceed to Section 3 where we state the\nmain result, specialize it to the samplings discussed in Section 2, and give detailed comparison of our\nresults with existing results for related primal-dual stochastic methods in the literature. In Section 4\nwe demonstrate how Quartz compares to other related methods through numerical experiments.\n\n2 The Quartz Algorithm\n\nThroughout the paper we consider the standard Euclidean norm, denoted by (cid:107) \u00b7 (cid:107). A function \u03c6 :\nRm \u2192 R is (1/\u03b3)-smooth if it is differentiable and has Lipschitz continuous gradient with Lispchitz\nconstant 1/\u03b3: (cid:107)\u2207\u03c6(x) \u2212 \u2207\u03c6(y)(cid:107) \u2264 1\n\u03b3(cid:107)x \u2212 y(cid:107), for all x, y \u2208 Rm. A function g : Rd \u2192 R \u222a {+\u221e}\nis 1-strongly convex if g(w) \u2265 g(w(cid:48)) + (cid:104)\u2207g(w(cid:48)), w \u2212 w(cid:48)(cid:105) + 1\n2(cid:107)w \u2212 w(cid:48)(cid:107)2 for all w, w(cid:48) \u2208 dom(g),\nwhere dom(g) denotes the domain of g and \u2207g(w(cid:48)) is a subgradient of g at w(cid:48).\nThe most important parameter of Quartz is a random sampling \u02c6S, which is a random subset of\n[n] = {1, 2, . . . , n}. The only assumption we make on the sampling \u02c6S in this paper is the following:\n\nAssumption 1 (Proper sampling) \u02c6S is a proper sampling; that is,\n\ndef\n\n= P(i \u2208 \u02c6S) > 0,\n\npi\n\ni \u2208 [n].\n\n(6)\n\nThis assumption guarantees that each block (dual variable) has a chance to get updated by the\nmethod. Prior to running the algorithm, we compute positive constants v1, . . . , vn satisfying (5)\nto de\ufb01ne the stepsize parameter \u03b8 used throughout in the algorithm:\n\n\u03b8 = min\n\ni\n\npi\u03bb\u03b3n\nvi+\u03bb\u03b3n .\n\n(7)\n\nNote from (5) that \u03b8 depends on both the data matrix A and the sampling \u02c6S. We shall show how to\ncompute in less than two passes over the data the parameter v satisfying (5) for some examples of\nsampling in Section 2.2.\n\n2.1\n\nInterpretation of Quartz through Fenchel duality\n\n3\n\n\f(cid:80)n\n\ni=1 Ai\u03b10\ni\n\nvi+\u03bb\u03b3n; \u00af\u03b10 = 1\n\n\u03bbn\n\nAlgorithm 1 Quartz\n\nParameters: proper random sampling \u02c6S and a positive vector v \u2208 Rn\nInitialization: \u03b10 \u2208 RN ; w0 \u2208 Rd; pi = P(i \u2208 \u02c6S); \u03b8 = min\nfor t \u2265 1 do\nwt = (1 \u2212 \u03b8)wt\u22121 + \u03b8\u2207g\u2217(\u00af\u03b1t\u22121)\n\u03b1t = \u03b1t\u22121\nGenerate a random set St \u2286 [n], following the distribution of \u02c6S\nfor i \u2208 St do\nend for\n\ni = (1 \u2212 \u03b8p\u22121\n\u03b1t\n\n\u00af\u03b1t = \u00af\u03b1t\u22121 + (\u03bbn)\u22121(cid:80)\n\ni \u2207\u03c6i(A(cid:62)\ni \u2212 \u03b8p\u22121\ni wt)\ni \u2212 \u03b1t\u22121\nAi(\u03b1t\n\n)\u03b1t\u22121\n\npi\u03bb\u03b3n\n\n)\n\ni\n\ni\n\ni\n\ni\u2208St\n\nend for\nOutput: wt, \u03b1t\n\n(cid:80)n\n\n=\n\n(1)+(2)\n\n\u03bb (g(w) + g\u2217 (\u00af\u03b1)) + 1\n\u03bb(g(w) + g\u2217 (\u00af\u03b1) \u2212 (cid:104)w, \u00af\u03b1(cid:105)\n\nQuartz (Algorithm 1) has a natural interpretation in terms of Fenchel duality. Let (w, \u03b1) \u2208 Rd\u00d7RN\nand de\ufb01ne \u00af\u03b1 = 1\n\u03bbn\nP (w) \u2212 D(\u03b1)\n\ni=1 Ai\u03b1i. The duality gap for the pair (w, \u03b1) can be decomposed as:\n\n(cid:125)\ni w, \u03b1i(cid:105)\nBy Fenchel-Young inequality, GAPg(w, \u03b1) \u2265 0 and GAP\u03c6i(w, \u03b1i) \u2265 0 for all i, which proves\nweak duality for the problems (1) and (2), i.e., P (w) \u2265 D(\u03b1). The pair (w, \u03b1) is optimal when both\nGAPg and GAP\u03c6i for all i are zero. It is known that this happens precisely when the following\noptimality conditions hold:\n\n(cid:123)(cid:122)\ni (\u2212\u03b1i) + (cid:104)A(cid:62)\n\n(cid:80)n\n(cid:80)n\ni w) + \u03c6\u2217\ni=1 \u03c6i(A(cid:62)\n(cid:125)\n(cid:124)\ni=1 \u03c6i(A(cid:62)\n) + 1\nn\n\ni (\u2212\u03b1i)\ni w) + \u03c6\u2217\n\nGAP\u03c6i (w,\u03b1i)\n\n(cid:123)(cid:122)\n\nGAPg(w,\u03b1)\n\n(cid:124)\n\n=\n\nn\n\n(8)\n(9)\nWe will now interpret the primal and dual steps of Quartz in terms of the above discussion. It is easy\nto see that Algorithm 1 updates the primal and dual variables as follows:\n\ni \u2208 [n].\n\ni w),\n\nw = \u2207g\u2217(\u00af\u03b1)\n\u03b1i = \u2212\u2207\u03c6i(A(cid:62)\n\n.\n\n(cid:26) (cid:0)1 \u2212 \u03b8p\u22121\n\nwt = (1 \u2212 \u03b8)wt\u22121 + \u03b8\u2207g\u2217(\u00af\u03b1t\u22121)\ni + \u03b8p\u22121\n\n(cid:1) \u03b1t\u22121\n\ni\n\n(cid:0)\u2212\u2207\u03c6i(A(cid:62)\ni wt)(cid:1)\n\ni\n\ni\n\n\u03b1t\n\ni =\n\n(cid:80)n\n\u03b1t\u22121\ni=1 Ai\u03b1t\u22121\n\n(11)\n, \u03b8 is a constant de\ufb01ned in (7) and St \u223c \u02c6S is a random subset of\nwhere \u00af\u03b1t\u22121 = 1\n[n]. In other words, at iteration t we \ufb01rst set the primal variable wt to be a convex combination of\nits current value wt\u22121 and a value reducing GAPg to zero: see (10). This is followed by adjusting a\nsubset of dual variables corresponding to a randomly chosen set of examples St such that for each\nexample i \u2208 St, the i-th dual variable \u03b1t\ni is set to be a convex combination of its current value \u03b1t\u22121\nand a value reducing GAP\u03c6i to zero, see (11).\n\n\u03bbn\n\ni\n\ni\n\n, i \u2208 St\n, i /\u2208 St\n\n(10)\n\n2.2 Flexibility of Quartz\n\nClearly, there are many ways in which the distribution of \u02c6S can be chosen, leading to numerous\nvariants of Quartz. The convex combination constant \u03b8 used throughout the algorithm should be\ntuned according to (7) where v1, . . . , vn are constants satisfying (5). Note that the best possible v\nis obtained by computing the maximal eigenvalue of the matrix (A(cid:62)A) \u25e6 P where \u25e6 denotes the\nHadamard (component-wise) product of matrices and P \u2208 RN\u00d7N is an n-by-n block matrix with\nall elements in block (i, j) equal to P(i \u2208 \u02c6S, j \u2208 \u02c6S), see [14]. However, the worst-case complexity\nof computing directly the maximal eigenvalue of (A(cid:62)A) \u25e6 P amounts to O(N 2), which requires\nunreasonable preprocessing time in the context of machine learning where N is assumed to be very\nlarge. We now describe some examples of sampling \u02c6S and show how to compute in less than two\npasses over the data the corresponding constants v1, . . . , vn. More examples including distributed\nsampling are presented in the supplementary material.\n\n4\n\n\fSerial sampling. The most studied sampling in the literature on stochastic optimization is the\nserial sampling, which corresponds to the selection of a single block i \u2208 [n]. That is, | \u02c6S| = 1 with\nprobability 1. The name \u201cserial\u201d is pointing to the fact that a method using such a sampling will\ntypically be a serial (as opposed to being parallel) method; updating a single block (dual variable) at\na time. A serial sampling is uniquely characterized by the vector of probabilities p = (p1, . . . , pn),\nwhere pi is de\ufb01ned by (6). For serial sampling \u02c6S, it is easy to see that (5) is satis\ufb01ed for\n\nwhere \u03bbmax(\u00b7) denotes the maximal eigenvalue.\n\nvi = Li\n\ndef\n\n= \u03bbmax(A(cid:62)\n\ni Ai), i \u2208 [n],\n\n(12)\n\nStandard mini-batching. We now consider \u02c6S which selects subsets of [n] of cardinality \u03c4, uni-\nformly at random. In the terminology established in [17], such \u02c6S is called \u03c4-nice. This sampling\nsatis\ufb01es pi = pj for all i, j \u2208 [n]; and hence it is uniform. This sampling is well suited for parallel\ncomputing. Indeed, Quartz could be implemented as follows. If we have \u03c4 processors available, then\nat the beginning of iteration t we can assign each block (dual variable) in St to a dedicated processor.\nThe processor assigned to i would then compute \u2206\u03b1t\ni and apply the update. If all processors have\nfast access to the memory where all the data is stored, as is the case in a shared-memory multicore\nworkstation, then this way of assigning workload to the individual processors does not cause any\nmajor problems. For \u03c4-nice sampling, (5) is satis\ufb01ed for\n\nvi = \u03bbmax(Mi),\n\n1 + (\u03c9j\u22121)(\u03c4\u22121)\n\nn\u22121\n\nA(cid:62)\njiAji,\n\ni \u2208 [n],\n\nj=1\n\nwhere for each j \u2208 [d], \u03c9j is the number of nonzero blocks in the j-th row of matrix A, i.e.,\n\nNote that (13) follows from an extension of a formula given in [2] from m = 1 to m \u2265 1.\n\ndef\n\n= |{i \u2208 [n] : Aji (cid:54)= 0}|,\n\n\u03c9j\n\nj \u2208 [d].\n\n3 Main Result\n\n(13)\n\n(14)\n\nMi =(cid:80)d\n\n(cid:16)\n\n(cid:17)\n\nThe complexity of our method is given by the following theorem. The proof can be found in the\nsupplementary material.\nTheorem 2 (Main Result) Assume that g is 1-strongly convex and that for each i \u2208 [n], \u03c6i is convex\nand (1/\u03b3)-smooth. Let \u02c6S be a proper sampling (Assumption 1) and v1, . . . , vn be positive scalars\nsatisfying (5). Then the sequence of primal and dual variables {wt, \u03b1t}t\u22650 of Quartz (Algorithm 1)\nsatis\ufb01es:\n\nE[P (wt) \u2212 D(\u03b1t)] \u2264 (1 \u2212 \u03b8)t(P (w0) \u2212 D(\u03b10)),\n\nwhere \u03b8 is de\ufb01ned in (7). In particular, if we \ufb01x \u0001 \u2264 P (w0) \u2212 D(\u03b10), then for\n\n(cid:17)\n\n(cid:16) P (w0)\u2212D(\u03b10)\n\n(cid:17) \u21d2 E[P (wT ) \u2212 D(\u03b1T )] \u2264 \u0001.\n\n(15)\n\n(16)\n\n(cid:16) 1\n\npi\n\nT \u2265 max\n\ni\n\n+ vi\n\npi\u03bb\u03b3n\n\nlog\n\n\u0001\n\nIn order to put the above result into context, in the rest of this section we will specialize the above\nresult to two special samplings: a serial sampling, and the \u03c4-nice sampling.\n\nWhen \u02c6S is a serial sampling, we just need to plug (12) into (16) and derive the bound\n\nT \u2265 max\n\n+ Li\n\n(17)\nIf in addition, \u02c6S is uniform, then pi = 1/n for all i \u2208 [n] and we refer to this special case of Quartz\nas Quartz-U. By replacing pi = 1/n in (17) we obtain directly the complexity of Quartz-U:\n\npi\u03bb\u03b3n\n\nlog\n\n=\u21d2 E[P (wT ) \u2212 D(\u03b1T )] \u2264 \u0001.\n\npi\n\ni\n\n(cid:17)\n\n(cid:17)\n\n(cid:17)\n\n(cid:16) P (w0)\u2212D(\u03b10)\n(cid:17)\n(cid:16) P (w0)\u2212D(\u03b10)\n\n\u0001\n\nT \u2265(cid:16)\n\nn + maxi Li\n\n\u03bb\u03b3\n\nlog\n\n=\u21d2 E[P (wT ) \u2212 D(\u03b1T )] \u2264 \u0001.\n\n(18)\n\n3.1 Quartz with serial sampling\n\n(cid:16) 1\n\n\u0001\n\n5\n\n\fOtherwise, we can seek to maximize the right-hand side of the inequality in (17) with respect to\nthe sampling probability p to obtain the best bound. A simple calculation reveals that the optimal\nprobability is given by:\n\n(19)\nWe shall call Quartz-IP the algorithm obtained by using the above serial sampling probability. The\nfollowing complexity result of Quartz-IP can be derived easily by plugging (19) into (17):\n=\u21d2 E[P (wT ) \u2212 D(\u03b1T )] \u2264 \u0001.\n\nP( \u02c6S = {i}) = p\u2217\n(cid:80)n\n\nT \u2265(cid:16)\n\ni=1 (Li + \u03bb\u03b3n) .\n\n(cid:17)\n\nn +\n\nlog\n\n(20)\n\ni=1 Li\nn\u03bb\u03b3\n\n\u0001\n\ndef\n\n= (Li + \u03bb\u03b3n)/(cid:80)n\n(cid:16) P (w0)\u2212D(\u03b10)\n\n(cid:17)\n\ni\n\nNote that in contrast with the complexity result of Quartz-U (18), we now have dependence on the\naverage of the eigenvalues Li.\n\nT \u2265(cid:16)\n\nQuartz-U vs Prox-SDCA. Quartz-U should be compared to Proximal Stochastic Dual Coordinate\nAscent (Prox-SDCA) [22, 20]. Indeed, the dual update of Prox-SDCA takes exactly the same form\nof Quartz-U1, see (11). The main difference is how the primal variable wt is updated: while Quartz\nperforms the update (10), Prox-SDCA (see also [24, 5]) performs the more aggressive update wt =\n\u2207g\u2217(\u00af\u03b1t\u22121) and the complexity result of Prox-SDCA is as follows:\n\n(cid:17)(cid:16) D(\u03b1\u2217)\u2212D(\u03b10)\n\n(cid:17)(cid:17) \u21d2 E[P (wT ) \u2212 D(\u03b1T )] \u2264 \u0001,\n\n\u03bb\u03b3\n\nlog\n\nn + maxi Li\n\nn + maxi Li\n\n(21)\nwhere \u03b1\u2217 is the dual optimal solution. Notice that the dominant terms in (18) and (21) exactly match,\nalthough our logarithmic term is better and simpler. This is due to a direct bound on the decrease\nof the primal-dual error of Quartz, without the need to \ufb01rst analyze the dual error, in contrast to the\ntypical approach for most of the dual coordinate ascent methods [22, 23, 20, 21, 9].\n\n\u03bb\u03b3\n\n\u0001\n\n(cid:16)(cid:16)\n\n(cid:17)\n\nQuartz-IP vs Iprox-SDCA. The importance sampling (19) was previously used in the algorithm\nIprox-SDCA [27], which extends Prox-SDCA to non-uniform serial sampling. The complexity of\nQuartz-IP (20) should then be compared with the following complexity result of Iprox-SDCA [27]:\n\n(cid:17)(cid:16) D(\u03b1\u2217)\u2212D(\u03b10)\n\n(cid:17)(cid:17) \u21d2 E[P (wT ) \u2212 D(\u03b1T )] \u2264 \u0001.\n\nT \u2265(cid:16)\n\nn +\n\ni=1 Li\nn\u03bb\u03b3\n\nlog\n\nn +\n\ni=1 Li\nn\u03bb\u03b3\n\n(cid:16)(cid:16)\n\n(cid:80)n\n\n(cid:80)n\n\n(cid:17)\n\n(22)\n\n\u0001\n\nAgain, the dominant terms in (20) and (22) exactly match but our logarithmic term is smaller.\n\n3.2 Quartz with \u03c4-nice Sampling (standard mini-batching)\n\n(cid:17)\n\n(cid:17)\n\n(cid:16)\n\n(cid:17)\n\nWe now specialize Theorem 2 to the case of the \u03c4-nice sampling. We de\ufb01ne \u02dcw such that:\n\nmaxi \u03bbmax\n\n1 + (\u03c9j\u22121)(\u03c4\u22121)\n\nn\u22121\n\nj=1\n\nA(cid:62)\njiAji\n\n=\n\n1 + (\u02dc\u03c9\u22121)(\u03c4\u22121)\n\nn\u22121\n\nmaxi Li\n\nIt is clear that 1 \u2264 \u02dcw \u2264 maxj wj \u2264 n and can be considered as a measure of the density of the data.\nBy plugging (13) into (16) we obtain directly the following corollary.\nCorollary 3 Assume \u02c6S is the \u03c4-nice sampling and v is chosen as in (13). If we let \u0001 \u2264 P (w0) \u2212\nD(\u03b10) and\n\n\uf8eb\uf8ed n\n\n(cid:18)\n\nT \u2265\n\n\u03c4 +\n\n(\u02dc\u03c9\u22121)(\u03c4\u22121)\n\nn\u22121\n\n1+\n\n\u03bb\u03b3\u03c4\n\nmaxi Li\n\n\uf8f6\uf8f8 log\n\n(cid:16) P (w0)\u2212D(\u03b10)\n\n(cid:17) \u21d2 E[P (wT ) \u2212 D(\u03b1T )] \u2264 \u0001.\n\n\u0001\n\n(23)\n\n(cid:16)(cid:80)d\n\n(cid:16)\n\n(cid:19)\n\nLet us now have a detailed look at the above result, especially in terms of how it compares with the\nserial uniform case (18). For fully sparse data, we get perfect linear speedup: the bound in (23) is a\ndef\n1/\u03c4 fraction of the bound in (18). For fully dense data, the condition number (\u03ba\n= maxi Li/(\u03bb\u03b3))\nis unaffected by mini-batching. For general data, the behaviour of Quartz with \u03c4-nice sampling\ninterpolates these two extreme cases. It is important to note that regardless of the condition number\n\u03ba, as long as \u03c4 \u2264 1 + (n \u2212 1)/( \u02dcw \u2212 1) the bound in (23) is at most a 2/\u03c4 fraction of the bound\nin (18). Hence, for sparser problems, Quartz can achieve linear speedup for larger mini-batch sizes.\n\n1In [20] the authors proposed \ufb01ve options of dual updating rule. Our dual updating formula (11) should be\ncompared with option V in Prox-SDCA. For the same reason as given in the beginning of [20, Appendix A.],\nQuartz implemented with the same other four options achieves the same complexity result as Theorem 2.\n\n6\n\n\f3.3 Quartz vs existing primal-dual mini-batch methods\n\nWe now compare the above result with existing mini-batch stochastic dual coordinate ascent meth-\nods. The mini-batch variants of SDCA, to which Quartz with \u03c4-nice sampling can be naturally\ncompared, have been proposed and analyzed previously in [23], [21] and [26]. In [23], the authors\nproposed to use a so-called safe mini-batching, which is precisely equivalent to \ufb01nding the stepsize\nparameter v satisfying (5) (in the special case of \u03c4-nice sampling). However, they only analyzed\nthe case where the functions {\u03c6i}i are non-smooth. In [21], the authors studied accelerated mini-\nbatch SDCA (ASDCA), specialized to the case when the regularizer g is the squared L2 norm.\nThey showed that the complexity of ASDCA interpolates between that of SDCA and accelerated\ngradient descent (AGD) [13] through varying the mini-batch size \u03c4. In [26], the authors proposed\na mini-batch extension of their stochastic primal-dual coordinate algorithm (SPDC). Both ASDCA\nand SPDC reach the same complexity as AGD when the mini-batch size equals to n, thus should be\nconsidered as accelerated algorithms 2. The complexity bounds for all these algorithms are summa-\nrized in Table 1. In Table 2 we compare the complexities of SDCA, ASDCA, SPDC and Quartz in\nseveral regimes.\n\nAlgorithm\nSDCA [22]\n\nASDCA [21]\n\nSPDC [26]\n\nQuartz with \u03c4-nice\n\nsampling\n\n(cid:26)\n(cid:16)\n\n4 \u00d7 max\n\nn\n\u03c4 +\n\nIteration complexity\n\nn + 1\n\u03bb\u03b3\n\n(cid:113) n\n(cid:113) n\n\n\u03bb\u03b3\u03c4 ,\n\nn\n\u03c4 +\n\n\u03bb\u03b3\u03c4\n\nn\n\u03c4 ,\n\n1\n\u03bb\u03b3\u03c4 ,\n\n1 + ( \u02dc\u03c9\u22121)(\u03c4\u22121)\n\nn\u22121\n\n(cid:27)\n\n2\n3\n\ng\n\n1\n\n1\n\n2(cid:107) \u00b7 (cid:107)2\n2(cid:107) \u00b7 (cid:107)2\ngeneral\n\ngeneral\n\n1\n3\n\nn\n\n(\u03bb\u03b3\u03c4 )\n\n(cid:17) 1\n\n\u03bb\u03b3\u03c4\n\nTable 1: Comparison of the iteration complexity of several primal-dual algorithms performing stochastic\ncoordinate ascent steps in the dual using a mini-batch of examples of size \u03c4 (with the exception of SDCA,\nwhich is a serial method using \u03c4 = 1.\n\n\u03b3\u03bbn = \u0398(1)\n\n\u03b3\u03bbn = \u0398(\u03c4 )\n\n\u03b3\u03bbn = \u0398(\n\n\u221a\nn)\n\nn\n\nn/\u03c4\n\nn/\u03c4\n\nn/\u03c4\n\nn\n\n\u221a\nn/\u03c4 + n3/4/\n\u221a\nn/\u03c4 + n3/4/\n\u221a\nn/\u03c4 + \u02dc\u03c9/\n\nn\n\n\u03c4\n\n\u03c4\n\nAlgorithm\nSDCA [22]\nASDCA\n\n[21]\n\nSPDC [26]\n\nQuartz\n(\u03c4-nice)\n\n\u03b3\u03bbn = \u0398( 1\u221a\nn )\n\u221a\n\u03c4 +\n\nn3/2\n\nn3/2/\u03c4 + n5/4/\nn4/3/\u03c4 2/3\n\u221a\nn5/4/\n\u03c4\n\u221a\n\nn3/2/\u03c4 + \u02dc\u03c9\n\nn\n\u221a\nn/\n\u03c4\n\u221a\n\u03c4\nn/\n\nn\n\nn/\u03c4 + \u02dc\u03c9\n\nTable 2: Comparison of leading factors in the complexity bounds of several methods in 5 regimes.\n\n\u221a\n\n\u221a\n\nLooking at Table 2, we see that in the \u03b3\u03bbn = \u0398(\u03c4 ) regime (i.e., if the condition number is \u03ba =\n\u0398(n/\u03c4 )), Quartz matches the linear speedup (when compared to SDCA) of ASDCA and SPDC.\nWhen the condition number is roughly equal to the sample size (\u03ba = \u0398(n)), then Quartz does better\nthan both ASDCA and SPDC as long as n/\u03c4 + \u02dc\u03c9 \u2264 n/\n\u03c4. In particular, this is the case when\nthe data is sparse: \u02dc\u03c9 \u2264 n/\n\u03c4. If the data is even more sparse (and in many big data applications\none has \u02dc\u03c9 = O(1)) and we have \u02dc\u03c9 \u2264 n/\u03c4, then Quartz signi\ufb01cantly outperforms both ASDCA\nand SPDC. Note that Quartz can be better than both ASDCA and SPDC even in the domain of\naccelerated methods, that is, when the condition number is larger than the number of examples:\n\u03ba = 1/(\u03b3\u03bb) \u2265 n. Indeed, we have the following result:\nProposition 4 Assume that n\u03bb\u03b3 \u2264 1 and that maxi Li = 1. If the data is suf\ufb01ciently sparse so that\n(24)\n\n\u03bb\u03b3\u03c4 n \u2265(cid:16)\n\n1 + n\u03bb\u03b3 + (\u02dc\u03c9\u22121)(\u03c4\u22121)\n\n(cid:17)2\n\n,\n\nn\u22121\n\nthen the iteration complexity (in \u02dcO order) of Quartz is better than that of ASDCA and SPDC.\n\nif n \u2264 \u03ba \u2264 \u03c4 n/(1 + n/\u03ba)2 (that is, \u03c4 \u2265 \u03bb\u03b3\u03c4 n \u2265\nThe result can be interpreted as follows:\n(1 + n\u03bb\u03b3)2), then there are sparse-enough problems for which Quartz is better than both ASDCA\nand SPDC.\n\n2APCG [9] also reaches accelerated convergence rate but was not proposed in the mini-batch setting.\n\n7\n\n\f4 Experimental Results\n\nIn this section we demonstrate how Quartz specialized to different samplings compares with other\nmethods. All of our experiments are performed with m = 1, for smoothed hinge-loss functions {\u03c6i}\nwith \u03b3 = 1 and squared L2-regularizer g, see [20]. The experiments were performed on the three\ndatasets reported in Table 3, and three randomly generated large dataset [12] with n = 100, 000\nexamples, d = 100, 000 features with different sparsity. In Figure 1 we compare Quartz specialized\nto serial sampling and for both uniform and optimal sampling with Prox-SDCA and Iprox-SDCA,\npreviously discussed in Section 3.1, on three datasets. Due to the conservative primal date in Quartz,\nQuartz-U appears to be slower than Prox-SDCA in practice. Nevertheless, in all the experiments,\nQuartz-IP shows almost identical convergence behaviour to that of Iprox-SDCA. In Figure 2 we\ncompare Quartz specialized to \u03c4-nice sampling with mini-batch SPDC for different values of \u03c4, in\nthe domain of accelerated methods (\u03ba = 10n). The datasets are randomly generated following [13,\n\u221a\nSection 6]. When \u03c4 = 1, it is clear that SPDC outperforms Quartz as the condition number is\nlarger than n. However, as \u03c4 increases, the number of data processed by SPDC is increased by\n\u03c4\nas predicted by its theory but the number of data processed by Quartz remains almost the same by\ntaking advantage of the large sparsity of the data. Hence, Quartz is much better in the large \u03c4 regime.\n\nDataset\ncov1\nw8a\nijcnn1\n\n# Training size n\n\n522,911\n49,749\n49,990\n\n# features d\n\n54\n300\n22\n\nSparsity (# nnz/(nd))\n\n22.22%\n3.91%\n59.09%\n\nTable 3: Datasets used in our experiments.\n\n(a) cov1; n = 522911; \u03bb = 1e-06 (b) w8a; n = 49749; \u03bb = 1e-05 (c) ijcnn1; n = 49990; \u03bb = 1e-05\n\nFigure 1: Comparison of Quartz-U (uniform sampling), Quartz-IP (optimal importance sampling), Prox-\nSDCA (uniform sampling) and Iprox-SDCA (optimal importance sampling).\n\n(a) Rand1; n = 105; \u03bb = 1e-06\n\n(b) Rand2; n = 105; \u03bb = 1e-06\n\n(c) Rand3; n = 105; \u03bb = 1e-06\n\nFigure 2: Comparison of Quartz with SPDC for different mini-batch size \u03c4 in the regime \u03ba = 10n. The three\nrandom datasets Random1, Random2 and Random2 have respective sparsity 0.01%, 0.1% and 1%.\n\n8\n\n05010015010\u22121510\u22121010\u22125100nb of epochsPrimal dual gap Prox-SDCAQuartz-UIprox-SDCAQuartz-IP010020030010\u22121510\u22121010\u22125100nb of epochsPrimal dual gap Prox-SDCAQuartz-UIprox-SDCAQuartz-IP05010010\u22121510\u22121010\u22125100nb of epochsPrimal dual gap Prox-SDCAQuartz-UIprox-SDCAQuartz-IP010020030010\u22121010\u2212810\u2212610\u2212410\u22122100nb of epochsPrimal dual gap Quartz-\u03c4=1SPDC-\u03c4=1Quartz-\u03c4=100SPDC-\u03c4=100Quartz-\u03c4=1000SPDC-\u03c4=1000010020030010\u22121010\u2212810\u2212610\u2212410\u22122100nb of epochsPrimal dual gap Quartz-\u03c4=1SPDC-\u03c4=1Quartz-\u03c4=10SPDC-\u03c4=10Quartz-\u03c4=100SPDC-\u03c4=100010020030010\u22121010\u2212810\u2212610\u2212410\u22122100nb of epochsPrimal dual gap Quartz-\u03c4=1SPDC-\u03c4=1Quartz-\u03c4=10SPDC-\u03c4=10Quartz-\u03c4=40SPDC-\u03c4=40\fReferences\n[1] A. Defazio, F. Bach, and S. Lacoste-julien. SAGA: A fast incremental gradient method with support for\nnon-strongly convex composite objectives. In Advances in Neural Information Processing Systems 27,\npages 1646\u20131654. 2014.\n\n[2] O. Fercoq and P. Richt\u00b4arik. Accelerated, parallel and proximal coordinate descent. SIAM Journal on\n\nOptimization (after minor revision), arXiv:1312.5799, 2013.\n\n[3] O. Fercoq and P. Richt\u00b4arik. Smooth minimization of nonsmooth functions by parallel coordinate descent.\n\narXiv:1309.5885, 2013.\n\n[4] C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S.S. Keerthi, and S. Sundararajan. A dual coordinate descent method\nfor large-scale linear SVM. In Proc. of the 25th International Conference on Machine Learning, ICML\n\u201908, pages 408\u2013415, 2008.\n\n[5] M. Jaggi, V. Smith, M. Takac, J. Terhorst, S. Krishnan, T. Hofmann, and M.I. Jordan. Communication-\nef\ufb01cient distributed dual coordinate ascent. In Advances in Neural Information Processing Systems 27,\npages 3068\u20133076. Curran Associates, Inc., 2014.\n\n[6] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In\nC.j.c. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.q. Weinberger, editors, Advances in Neural\nInformation Processing Systems 26, pages 315\u2013323. 2013.\n\n[7] J. Kone\u02c7cn\u00b4y, J. Lu, P. Richt\u00b4arik, and M. Tak\u00b4a\u02c7c. mS2GD: Mini-batch semi-stochastic gradient descent in\n\nthe proximal setting. arXiv:1410.4744, 2014.\n\n[8] J. Kone\u02c7cn\u00b4y and P. Richt\u00b4arik. S2GD: Semi-stochastic gradient descent methods. arXiv:1312.1666, 2013.\n[9] Q. Lin, Z. Lu, and L. Xiao. An accelerated proximal coordinate gradient method and its application to\n\nregularized empirical risk minimization. Technical Report MSR-TR-2014-94, July 2014.\n\n[10] J. Mairal. Incremental majorization-minimization optimization with application to large-scale machine\n\nlearning. SIAM J. Optim., 25(2):829\u2013855, 2015.\n\n[11] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to\n\nstochastic programming. SIAM J. Optim., 19(4):1574\u20131609, 2008.\n\n[12] Y. Nesterov. Ef\ufb01ciency of coordinate descent methods on huge-scale optimization problems. SIAM J.\n\nOptim., 22(2):341\u2013362, 2012.\n\n[13] Y. Nesterov. Gradient methods for minimizing composite functions. Math. Program., 140(1, Ser. B):125\u2013\n\n161, 2013.\n\n[14] Z. Qu and P. Richt\u00b4arik. Coordinate descent methods with arbitrary sampling II: Expected separable\n\noverapproximation. arXiv:1412.8063, 2014.\n\n[15] Z. Qu and P. Richt\u00b4arik. Coordinate descent methods with arbitrary sampling I: Algorithms and complexity.\n\narXiv:1412.8060, 2014.\n\n[16] P. Richt\u00b4arik and M. Tak\u00b4a\u02c7c. On optimal probabilities in stochastic coordinate descent methods. Optimiza-\n\ntion Letters, published online 2015.\n\n[17] P. Richt\u00b4arik and M. Tak\u00b4a\u02c7c. Parallel coordinate descent methods for big data optimization. Math. Program.,\n\npublished online 2015.\n\n[18] H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Statistics, 22:400\u2013407, 1951.\n[19] M. Schmidt, N. Le Roux, and F. Bach. Minimizing \ufb01nite sums with the stochastic average gradient.\n\narXiv:1309.2388, 2013.\n\n[20] S. Shalev-Shwartz and T. Zhang. Proximal stochastic dual coordinate ascent. arXiv:1211.2717, 2012.\n[21] S. Shalev-shwartz and T. Zhang. Accelerated mini-batch stochastic dual coordinate ascent. In Advances\n\nin Neural Information Processing Systems 26, pages 378\u2013385. 2013.\n\n[22] S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss. J. Mach.\n\nLearn. Res., 14(1):567\u2013599, February 2013.\n\n[23] M. Tak\u00b4a\u02c7c, A.S. Bijral, P. Richt\u00b4arik, and N. Srebro. Mini-batch primal and dual methods for svms. In\n\nProc. of the 30th International Conference on Machine Learning (ICML-13), pages 1022\u20131030, 2013.\n[24] T. Yang. Trading computation for communication: Distributed stochastic dual coordinate ascent.\n\nIn\n\nAdvances in Neural Information Processing Systems 26, pages 629\u2013637. 2013.\n\n[25] T. Zhang. Solving large scale l.ear prediction problems using stochastic gradient descent algorithms. In\n\nProc. of the 21st International Conference on Machine Learning (ICML-04), pages 919\u2013926, 2004.\n\n[26] Y. Zhang and L. Xiao. Stochastic primal-dual coordinate method for regularized empirical risk minimiza-\ntion. In Proc. of the 32nd International Conference on Machine Learning (ICML-15), pages 353\u2013361,\n2015.\n\n[27] P. Zhao and T. Zhang. Stochastic optimization with importance sampling. ICML, 2015.\n\n9\n\n\f", "award": [], "sourceid": 550, "authors": [{"given_name": "Zheng", "family_name": "Qu", "institution": "University of Hong Kong"}, {"given_name": "Peter", "family_name": "Richtarik", "institution": "University of Edinburgh"}, {"given_name": "Tong", "family_name": "Zhang", "institution": "Rutgers"}]}