{"title": "Bregman Divergence for Stochastic Variance Reduction: Saddle-Point and Adversarial Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 6031, "page_last": 6041, "abstract": "Adversarial machines, where a learner competes against an adversary, have regained much recent interest in machine learning. They are naturally in the form of saddle-point optimization, often with separable structure but sometimes also with unmanageably large dimension. In this work we show that adversarial prediction under multivariate losses can be solved much faster than they used to be. We first reduce the problem size exponentially by using appropriate sufficient statistics, and then we adapt the new stochastic variance-reduced algorithm of Balamurugan & Bach (2016) to allow any Bregman divergence. We prove that the same linear rate of convergence is retained and we show that for adversarial prediction using KL-divergence we can further achieve a speedup of #example times compared with the Euclidean alternative. We verify the theoretical findings through extensive experiments on two example applications: adversarial prediction and LPboosting.", "full_text": "Bregman Divergence for Stochastic Variance\n\nReduction: Saddle-Point and Adversarial Prediction\n\nZhan Shi\n\nXinhua Zhang\n\nUniversity of Illinois at Chicago\n{zshi22,zhangx}@uic.edu\n\nChicago, Illinois 60661\n\nYaoliang Yu\n\nUniversity of Waterloo\nWaterloo, ON, N2L3G1\n\nyaoliang.yu@uwaterloo.ca\n\nAbstract\n\nAdversarial machines, where a learner competes against an adversary, have re-\ngained much recent interest in machine learning. They are naturally in the form of\nsaddle-point optimization, often with separable structure but sometimes also with\nunmanageably large dimension. In this work we show that adversarial prediction\nunder multivariate losses can be solved much faster than they used to be. We \ufb01rst\nreduce the problem size exponentially by using appropriate suf\ufb01cient statistics,\nand then we adapt the new stochastic variance-reduced algorithm of Balamurugan\n& Bach (2016) to allow any Bregman divergence. We prove that the same linear\nrate of convergence is retained and we show that for adversarial prediction using\nKL-divergence we can further achieve a speedup of #example times compared\nwith the Euclidean alternative. We verify the theoretical \ufb01ndings through extensive\nexperiments on two example applications: adversarial prediction and LPboosting.\n\nIntroduction\n\n1\nMany algorithmic advances have been achieved in machine learning by \ufb01nely leveraging the separa-\nbility in the model. For example, stochastic gradient descent (SGD) algorithms typically exploit the\nfact that the objective is an expectation of a random function, with each component corresponding\nto a training example. A \u201cdual\u201d approach partitions the problem into blocks of coordinates and\nprocesses them in a stochastic fashion [1]. Recently, by exploiting the \ufb01nite-sum structure of the\nmodel, variance-reduction based stochastic methods have surpassed the well-known sublinear lower\nbound of SGD. Examples include SVRG [2], SAGA [3], SAG [4], Finito [5], MISO [6], and SDCA\n[7, 8], just to name a few. Specialized algorithms have also been proposed for accommodating\nproximal terms [9], and for further acceleration through the condition number [10\u201313].\nHowever, not all empirical risks are separable in its plain form, and in many cases dualization is\nnecessary for achieving separability. This leads to a composite saddle-point problem with convex-\nconcave (saddle) functions K and M:\n\n(cid:80)n\n\n(x\u2217, y\u2217) = arg minx maxy K(x, y) + M (x, y), where K(x, y) = 1\n\nn\n\nk=1 \u03c8k(x, y).\n\n(1)\n\nMost commonly used supervised losses for linear models can be written as g(cid:63)(Xw), where g(cid:63) is\nthe Fenchel dual of a convex function g, X is the design matrix, and w is the model vector. So\nthe regularized risk minimization can be naturally written as minw max\u03b1 \u03b1(cid:48)Xw + \u2126(w) \u2212 g(\u03b1),\nwhere \u2126 is a regularizer. This \ufb01ts into our framework (1) with a bilinear function K and a decoupled\nfunction M. Optimization for this speci\ufb01c form of saddle-point problems has been extensively\nstudied. For example, [14] and [15] performed batch updates on w and stochastic updates on \u03b1,\nwhile [16] and [17] performed doubly stochastic updates on both w and \u03b1, achieving O( 1\n\u0001 ) and\n\u0001 ) rates respectively. The latter two also studied the more general form (1). Our interest in this\nO(log 1\npaper is double stochasticity, aiming to maximally harness the power of separability and stochasticity.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fAdversarial machines, where the learner competes against an adversary, have re-gained much recent\ninterest in machine learning [18\u201320]. On one hand they \ufb01t naturally into the saddle-point optimization\nframework (1) but on the other hand they are known to be notoriously challenging to solve. The central\nmessage of this work is that certain adversarial machines can be solved signi\ufb01cantly faster than they\nused to be. Key to our development is a new extension of the stochastic variance-reduced algorithm\nin [17] such that it is compatible with any Bregman divergence, hence opening the possibility to\nlargely reduce the quadratic condition number in [17] by better adapting to the underlying geometry\nusing non-Euclidean norms and Bregman divergences.\nImproving condition numbers by Bregman divergence has long been studied in (stochastic, proximal)\ngradient descent [21, 22]. The best known algorithm is arguably stochastic mirror descent [23], which\nwas extended to saddle-points by [16] and to ADMM by [24]. However, they can only achieve the\nsublinear rate O(1/\u0001) (for an \u0001-accurate solution). On the other hand, many recent stochastic variance-\nreduced methods [2\u20136, 9, 17] that achieve the much faster linear rate O(log 1/\u0001) rely inherently on\nthe Euclidean structure, and their extension to Bregman divergence, although conceptually clear,\nremains challenging in terms of the analysis. For example, the analysis of [17] relied on the resolvent\nof monotone operators [25] and is hence restricted to the Euclidean norm. In \u00a72 we extend the notion\nof Bregman divergence to saddle functions and we prove a new Pythagorean theorem that may be of\nindependent interest for analyzing \ufb01rst order algorithms. In \u00a74 we introduce a fundamentally different\nproof technique (details relegated to Appendix C) that overcomes several challenges arising from a\ngeneral Bregman divergence (e.g. asymmetry and unbounded gradient on bounded domain), and we\nrecover similar quantitative linear rate of convergence as [17] but with the \ufb02exibility of using suitable\nBregman divergences to reduce the condition number.\nThe new stochastic variance-reduced algorithm Breg-SVRG is then applied to the adversarial pre-\ndiction framework (with multivariate losses such as F-score) [19, 20]. Here we make three novel\ncontributions: (a) We provide a signi\ufb01cant reformulation of the adversarial prediction problem that\nreduces the dimension of the optimization variable from 2n to n2 (where n is the number of samples),\nhence making it amenable to stochastic variance-reduced optimization (\u00a73). (b) We develop a new\nef\ufb01cient algorithm for computing the proximal update with a separable saddle KL-divergence (\u00a75).\n(c) We verify that Breg-SVRG accelerates its Euclidean alternative by a factor of n in both theory\nand practice (\u00a76), hence con\ufb01rming again the uttermost importance of adapting to the underlying\nproblem geometry. To our best knowledge, this is the \ufb01rst time stochastic variance-reduced methods\nhave been shown with great promise in optimizing adversarial machines.\nFinally, we mention that we expect our algorithm Breg-SVRG to be useful for solving many other\nsaddle-point problems, and we provide a second example (LPboosting) in experiments (\u00a76).\n\n2 Bregman Divergence and Saddle Functions\nIn this section we set up some notations, recall some background materials, and extend Bregman\ndivergences to saddle functions, a key notion in our later analysis.\nBregman divergence. For any convex and differentiable function \u03c8 over some closed convex set\nC \u2286 Rd, its induced Bregman divergence is de\ufb01ned as:\n\n\u2200x \u2208 int(C), x(cid:48) \u2208 C, \u2206\u03c8(x(cid:48), x) := \u03c8(x(cid:48)) \u2212 \u03c8(x) \u2212 (cid:104)\u2207\u03c8(x), x(cid:48) \u2212 x(cid:105) ,\n\n(2)\nwhere \u2207\u03c8 is the gradient and (cid:104)\u00b7,\u00b7(cid:105) is the standard inner product in Rd. Clearly, \u2206\u03c8(x(cid:48), x) \u2265 0 since\n\u03c8 is convex. We mention two familiar examples of Bregman divergence.\n\u2022 Squared Euclidean distance: \u2206\u03c8(x(cid:48), x) = 1\n2 (cid:107)x(cid:107)2\n\n\u2022 (Unnormalized) KL-divergence: \u2206\u03c8(x(cid:48), x) =(cid:80)\n\ni + xi, \u03c8(x) =(cid:80)\n\n2 (cid:107)x(cid:48) \u2212 x(cid:107)2\ni x(cid:48)\n\n2 , \u03c8(x) = 1\ni log x(cid:48)\n\u2212 x(cid:48)\n\ni\nxi\n\n2, where (cid:107) \u00b7 (cid:107)2 is (cid:96)2 norm.\n\ni xi log xi.\n\nf (x(cid:48)) \u2265 f (x) + (cid:104)\u2202f (x), x(cid:48) \u2212 x(cid:105) + \u2206\u03c8(x(cid:48), x).\n\nStrong convexity. Following [26] we call a function f \u03c8-convex if f\u2212\u03c8 is convex, i.e. for all x, x(cid:48)\n(3)\nSmoothness. A function f is L-smooth wrt a norm (cid:107)\u00b7(cid:107) if its gradient \u2207f is L-Lipschitz continuous,\ni.e., for all x and x(cid:48), (cid:107)\u2207f (x(cid:48)) \u2212 \u2207f (x)(cid:107)\u2217 \u2264 L(cid:107)x(cid:48) \u2212 x(cid:107) , where (cid:107) \u00b7 (cid:107)\u2217 is the dual norm of (cid:107) \u00b7 (cid:107). The\nchange of a smooth function, in terms of its induced Bregman divergence, can be upper bounded by\nthe change of its input and lower bounded by the change of its slope, cf. Lemma 2 in Appendix A.\n\n2\n\n\fSaddle functions. Recall that a function \u03c6(x, y) over Cz = Cx \u00d7 Cy is called a saddle function if it\nis convex in x for any y \u2208 Cy, and concave in y for any x \u2208 Cx. Given a saddle function \u03c6, we call\n(x\u2217, y\u2217) its saddle point if\n\n\u2200x \u2208 Cx, \u2200y \u2208 Cy, \u03c6(x\u2217, y) \u2264 \u03c6(x\u2217, y\u2217) \u2264 \u03c6(x, y\u2217),\n\n(4)\nor equivalently (x\u2217, y\u2217) \u2208 arg minx\u2208Cx maxy\u2208Cy \u03c6(x, y). Assuming \u03c6 is differentiable, we denote\n(5)\nNote the negation sign due to the concavity in y. We can quantify the notion of \u201csaddle\u201d: A function\nf (x, y) is called \u03c6-saddle iff f \u2212 \u03c6 is a saddle function, or equivalently, \u2206f (z(cid:48), z) \u2265 \u2206\u03c6(z(cid:48), z) (see\nbelow). Note that any saddle function \u03c6 is 0-saddle and \u03c6-saddle.\nBregman divergence for saddle functions. We now de\ufb01ne the Bregman divergence induced by a\nsaddle function \u03c6: for z = (x, y) and z(cid:48) = (x(cid:48), y(cid:48)) in Cz,\n\nG\u03c6(x, y) := [\u2202x\u03c6(x, y);\u2212\u2202y\u03c6(x, y)].\n\n\u2206\u03c6(z(cid:48), z) := \u2206\u03c6y (x(cid:48), x) + \u2206\u2212\u03c6x (y(cid:48), y) = \u03c6(x(cid:48), y) \u2212 \u03c6(x, y(cid:48)) \u2212 (cid:104)G\u03c6(z), z(cid:48) \u2212 z(cid:105) ,\n\n(6)\n\nwhere \u03c6y(x) = \u03c6(x, y) is a convex function of x for any \ufb01xed y, and similarly \u03c6x(y) = \u03c6(x, y) is a\nconcave (hence the negation) function of y for any \ufb01xed x. The similarity between (6) and the usual\nBregman divergence \u2206\u03c8 in (2) is apparent. However, \u03c6 is never evaluated at z(cid:48) but z (for G) and the\ncross pairs (x(cid:48), y) and (x, y(cid:48)). Key to our subsequent analysis is the following lemma that extends a\nresult of [27] to saddle functions (proof in Appendix A).\nLemma 1. Let f and g be \u03c6-saddle and \u03d5-saddle respectively, with one of them being dif-\nferentiable. Then, for any z = (x, y) and any saddle point (if exists) z\u2217 := (x\u2217, y\u2217) \u2208\narg minx maxy {f (z) + g(z)} , we have f (x, y\u2217)+g(x, y\u2217) \u2265 f (x\u2217, y)+g(x\u2217, y)+\u2206\u03c6+\u03d5(z, z\u2217).\nGeometry of norms. In the sequel, we will design two convex functions \u03c8x(x) and \u03c8y(y) such that\ntheir induced Bregman divergences are \u201cdistance enforcing\u201d (a.k.a. 1-strongly convex), that is, w.r.t.\ntwo norms (cid:107)\u00b7(cid:107)x and (cid:107)\u00b7(cid:107)y that we also design, the following inequality holds:\n\n\u2206x(x, x(cid:48)) := \u2206\u03c8x(x, x(cid:48)) \u2265 1\n\n2 (cid:107)x \u2212 x(cid:48)(cid:107)2\n\nx , \u2206y(y, y(cid:48)) := \u2206\u03c8y (y, y(cid:48)) \u2265 1\n\n2 (cid:107)y \u2212 y(cid:48)(cid:107)2\ny .\n\nFurther, for z = (x, y), we de\ufb01ne\n\n\u2206z(z, z(cid:48)) := \u2206\u03c8x\u2212\u03c8y (z, z(cid:48)) \u2265 1\n\n2 (cid:107)z \u2212 z(cid:48)(cid:107)2\n\nz , where\n\n(cid:107)z(cid:107)2\n\nz := (cid:107)x(cid:107)2\n\nx + (cid:107)y(cid:107)2\n\ny\n\nWhen it is clear from the context, we simply omit the subscripts and write \u2206, (cid:107)\u00b7(cid:107), and (cid:107)\u00b7(cid:107)\u2217.\n3 Adversarial Prediction under Multivariate Loss\nA number of saddle-point based machine learning problems have been listed in [17]. Here we\ngive another example (adversarial prediction under multivariate loss) that is naturally formulated\nas a saddle-point problem but also requires a careful adaptation to the underlying geometry\u2014a\nchallenge that was not addressed in [17] since their algorithm inherently relies on the Euclidean\nnorm. We remark that adaptation to the underlying geometry has been studied in the (stochastic)\nmirror descent framework [23], with signi\ufb01cant improvements on condition numbers or gradient\nnorm bounds. Surprisingly, no analogous efforts have been attempted in the stochastic variance\nreduction framework\u2014a gap we intend to \ufb01ll in this work.\nThe adversarial prediction framework [19, 20, 28], arising naturally as a saddle-point problem, is a\nconvex alternative to the generative adversarial net [18]. Given a training sample X = [x1, . . . , xn]\nand \u02dcy = [\u02dcy1, . . . , \u02dcyn] \u2208 {0, 1}n, adversarial prediction optimizes the following saddle function that\nis an expectation of some multivariate loss (cid:96)(y, z) (e.g. F-score) over the labels y, z \u2208 {0, 1}n of all\ndata points:\n\n(7)\n\n(8)\n\n(cid:104)\n\n(cid:105)\n\nE\n\nmax\nq\u2208\u22062n\n\nmin\np\u2208\u22062n\n\n(9)\nHere the proponent tries to \ufb01nd a distribution p(\u00b7) over the labeling on the entire training set in\norder to minimize the loss (\u22062n is the 2n dimensional probability simplex). An opponent in contrast\ntries to maximize the expected loss by \ufb01nding another distribution q(\u00b7), but his strategy is subject to\nthe constraint that the feature expectation matches that of the empirical distribution. Introducing a\n\n(cid:96)(y, z), s.t. E\nz\u223cq\n\n( 1\nn Xz) = 1\n\ny\u223cp,z\u223cq\n\nn X \u02dcy\n\n3\n\n\f\u03b8\n\nE\n\n(cid:105)\n\nmax\n\n\u2212 \u03bb\n\n2 + 1\n\ny\u223cp,z\u223cq\n\nmax\nq\u2208\u22062n\n\nn \u03b8(cid:48)Xy\n\nn \u03b8(cid:48)X \u02dcy + min\np\u2208\u22062n\n\n(cid:104) 2y(cid:48)z\n1(cid:48)y+1(cid:48)z \u2212 1\n\n1(cid:48)y+1(cid:48)z and (cid:96)(0, 0) := 1, the partial dual problem can be written as\n\nLagrangian variable \u03b8 to remove the feature expectation constraint and specializing the problem to\nF-score where (cid:96)(y, z) = 2y(cid:48)z\n2 (cid:107)\u03b8(cid:107)2\n\n,\nwhere we use y(cid:48)z to denote the standard inner product and we followed [19] to add an (cid:96)2\n2 regularizer\non \u03b8 penalizing the dual variables on the constraints over the training data. It appears that solving\n(10) can be quite challenging, because the variables p and q in the inner minimax problem have 2n\nentries! A constraint sampling algorithm was adopted in [19] to address this challenge, although\nno formal guarantee was established. Note that we can maximize the outer unconstrained variable\n\u03b8 (with dimension the same as the number of features) relatively easily using for instance gradient\nascent, provided that we can solve the inner minimax problem quickly\u2014a signi\ufb01cant challenge to\nwhich we turn our attention below.\nSurprisingly, we show here that the inner minimax problem in (10) can be signi\ufb01cantly simpli\ufb01ed.\nThe key observation is that the expectation in the objective depends only on a few suf\ufb01cient statistics\nof p and q. Indeed, by interpreting p and q as probability distributions over {0, 1}n we have:\n\n(10)\n\n2y(cid:48)z\n\n1(cid:48)y + 1(cid:48)z\n\nE\n\n= p({0})q({0}) +\n\n= p({0})q({0}) +\n\nn(cid:88)\nn(cid:88)\n\ni=1\n\nn(cid:88)\nn(cid:88)\n\nj=1\n\ni=1\n\nj=1\n\n(cid:17)\n\nE\n\n(cid:16) 2y(cid:48)z\n1(cid:48)y+1(cid:48)z [[1(cid:48)y = i]][[1(cid:48)z = j]]\n(cid:124)\n(cid:124)\n\nE (y[[1(cid:48)y = i]])\n\n(cid:123)(cid:122)\n\n\u00b7 1\ni\n\n\u00b7 1\nj\n\n(cid:125)\n\n(cid:48)\n\n\u03b1i\n\n2ij\ni + j\n\nE (z[[1(cid:48)z = j]])\n\n,\n\n(cid:123)(cid:122)\n\n\u03b2j\n\n(cid:125)\n\n(11)\n\n(12)\n\n(13)\n\n(14)\n\n(cid:88)\nn(cid:88)\nn(cid:88)\n\ni\n\n1(cid:48)\u03b1i =\n\ni\u03b1i =\n\n1\ni\n\n(cid:88)\n(cid:2) 2ijn2\n(cid:124)\ni+j \u03b1(cid:48)\n\ni\n\nwhere [[\u00b7]] = 1 if \u00b7 is true, and 0 otherwise. Crucially, the variables \u03b1i and \u03b2j are suf\ufb01cient for\nre-expressing (10), since\n\nE (1(cid:48)y[[1(cid:48)y = i]]) = E[[1(cid:48)y = i]] = p({1(cid:48)y = i}),\n\nE (y[[1(cid:48)y = i]]) = Ey,\n\n(cid:3)+\u2126(\u03b1)\u2212\u2126(\u03b2), (15)\n\nand similar equalities also hold for \u03b2j. In details, the inner minimax problem of (10) simpli\ufb01es to:\n\n(cid:123)(cid:122)\ni\u03b2j +n2\u03b1(cid:48)\n\n(cid:125)\ni11(cid:48)\u03b2j\n\n1\nn2\n\n\u2212n1(cid:48)\u03b1i\u2212n1(cid:48)\u03b2j\u2212\u03b8(cid:48)Xi\u03b1i\n\nmin\n\u03b1\u2208S\n\nmax\n\u03b2\u2208S\n\ni=1\n\nj=1\n\nfij (\u03b1i,\u03b2j )\n\nwhere S = {\u03b1 \u2265 0 : 1(cid:48)\u03b1 \u2264 1,\u2200i, (cid:107)i\u03b1i(cid:107)\u221e \u2264 (cid:107)\u03b1i(cid:107)1}, \u2126(\u03b1) = \u00b5(cid:80)\n\ni,j \u03b1ij log(\u03b1ij). (16)\nImportantly, \u03b1 = [\u03b11; . . . , \u03b1n] (resp. \u03b2) has n2 entries, which is signi\ufb01cantly smaller than the 2n\nentries of p (resp. q) in (10). For later purpose we have also incorporated an entropy regularizer for\n\u03b1 and \u03b2 respectively in (15).\nTo justify the constraint set S, note from (12) and (13) that for any distribution p of y:\n\nsince \u03b1 \u2265 0 and y \u2208 {0, 1}n, (cid:107)i\u03b1i(cid:107)\u221e \u2264 E(cid:107)y[[1(cid:48)y = i]](cid:107)\u221e \u2264 E[[1(cid:48)y = i]] = (cid:107)\u03b1i(cid:107)1.\n\n(17)\nConversely, for any \u03b1 \u2208 S, we can construct a distribution p such that i\u03b1ij = E (yj[[1(cid:48)y = i]]) =\np({1(cid:48)y = i, yj = 1}) in the following algorithmic way: Fix i and for each j de\ufb01ne Yj = {y \u2208\n{0, 1}n : 1(cid:48)y = i, yj = 1}. Let U = {1, . . . , n}. Find an index j in U that minimizes \u03b1ij and set\np({y}) = i\u03b1ij/|Yj| for each y \u2208 Yj. Perform the following updates:\n\nU \u2190 U \\ {j}, \u2200k (cid:54)= j, Yk \u2190 Yk \\ Yj, \u03b1ik \u2190 \u03b1ik \u2212 \u03b1ij|Yk \u2229 Yj|/|Yj|\n\n(18)\nContinue this procedure until U is empty. Due to the way we choose j, \u03b1 remains nonnegative and\nby construction \u03b1ij = p({1(cid:48)y = i, yj = 1}) once we remove j from U.\nThe objective function in (15) \ufb01ts naturally into the framework of (1), with \u2126(\u03b1) \u2212 \u2126(\u03b2) and\nconstraints corresponding to M, and the rest terms to K. The entropy function \u2126 is convex wrt the\nKL-divergence, which is in turn distance enforcing wrt the (cid:96)1 norm over the probability simplex [23].\nIn the next section we propose the SVRG algorithm with Bregman divergence (Breg-SVRG) that (a)\nprovably optimizes strongly convex saddle function with a linear convergence rate, and (b) adapts\nto the underlying geometry by choosing an appropriate Bregman divergence. Then, in \u00a75 we apply\nBreg-SVRG to (15) and achieve a factor of n speedup over a straightforward instantiation of [17].\n\n4\n\n\f(cid:46) epoch index\n\n4 Breg-SVRG for Saddle-Point\n\nAlgorithm 1: Breg-SVRG for Saddle-Point\n1 Initialize z0 randomly. Set \u02dcz = z0.\n2 for s = 1, 2, . . . do\n\u02dc\u00b5 \u2190 \u02dc\u00b5s := \u2207K(\u02dcz), z0 \u2190 zs\n3\nfor t = 1, . . . , m do\n4\n5\n6\n7\n\nIn this section we propose an ef\ufb01cient algorithm\nfor solving the general saddle-point problem in (1)\nand prove its linear rate of convergence. Our main\nassumption is:\nAssumption 1. There exist two norms (cid:107)\u00b7(cid:107)x and\n(cid:107)\u00b7(cid:107)y such that each \u03c8k is a saddle function and\nL-smooth; M is (\u03c8x \u2212 \u03c8y)-saddle; and \u03c8x and \u03c8y\nare distance enforcing (cf. (7)).\nNote that w.l.o.g. we have scaled the norms so that\nthe usual strong convexity parameter of M is 1.\nRecall we de\ufb01ned (cid:107)z(cid:107)z and \u2206z in (8). For saddle-point optimization, it is common to de\ufb01ne a signed\ngradient G(z) := [\u2202xK(z);\u2212\u2202yK(z)] (since K is concave in y). Recall J = K + M, and (x\u2217, y\u2217)\nis a saddle-point of J. Using Assumption 1, we measure the gap of an iterate zt = (xt, yt) as follows:\n\n0 := zm\nRandomly pick \u03be \u2208 {1, . . . , n}.\nCompute vt using (20).\nUpdate zt using (21).\n(1 + \u03b7)tzt\n\n(cid:46) m(cid:80)\n\n\u02dcz \u2190 \u02dczs :=\n\n(cid:46) iter index\n\n(1 + \u03b7)t.\n\nt=1\n\nm(cid:80)\n\nt=1\n\n8\n\n\u0001t = \u0001(zt) = J(xt, y\u2217) \u2212 J(x\u2217, yt) \u2265 \u2206(zt, z\u2217) \u2265 1\n\n(19)\nInspired by [2, 9, 17], we propose in Algorithm 1 a new stochastic variance-reduced algorithm for\nsolving the saddle-point problem (1) using Bregman divergences. The algorithm proceeds in epochs.\nIn each epoch, we \ufb01rst compute the following stochastic estimate of the signed gradient G(zt) by\ndrawing a random component from K:\n\n2 (cid:107)zt \u2212 z\u2217(cid:107)2 \u2265 0.\n\n(cid:19)\n\n(cid:18) vx(zt)\n\n\u2212vy(zt)\n\nvt =\n\nwhere\n\n(cid:26)vx(zt) := \u2202x\u03c8\u03be(zt) \u2212 \u2202x\u03c8\u03be(\u02dcz) + \u2202xK(\u02dcz)\n\nvy(zt) := \u2202y\u03c8\u03be(zt) \u2212 \u2202y\u03c8\u03be(\u02dcz) + \u2202yK(\u02dcz)\n\n.\n\n(20)\n\nHere \u02dcz is the pivot chosen after completing the previous epoch. We make two important observations:\n(1) By construction the stochastic gradient vt is unbiased: E\u03be[vt] = G(zt); (2) The expensive gradient\nevaluation \u2202K(\u02dcz) need only be computed once in each epoch since \u02dcz is held unchanged. If \u02dcz \u2192 z\u2217,\nthen the variance of vt would be largely reduced hence faster convergence may be possible.\nNext, Algorithm 1 performs the following joint proximal update:\n\n(xt+1, yt+1) = arg min\n\nx\n\nmax\n\ny\n\n\u03b7 (cid:104)vx(zt), x(cid:105) + \u03b7 (cid:104)vy(zt), y(cid:105) + \u03b7M (x, y) + \u2206(x, xt) \u2212 \u2206(y, yt), (21)\n\n/log(1 + \u03b7)\n\n2(cid:107)x \u2212 xt(cid:107)2\n\nwhere we have the \ufb02exibility in choosing a suitable Bregman divergence to better adapt to the\nunderlying geometry. When \u2206(x, xt) = 1\n2, we recover the special case in [17]. However,\nto handle the asymmetry in a general Bregman divergence (which does not appear for the Euclidean\ndistance), we have to choose the pivot \u02dcz in a signi\ufb01cantly different way than [2, 9, 17].\nWe are now ready to present our main convergence guarantee for Breg-SVRG in Algorithm 1.\nTheorem 1. Let Assumption 1 hold, and choose a suf\ufb01ciently small \u03b7 > 0 such that m :=\nlog\n\n(cid:17)\n(cid:108)\n(cid:16) 1\u2212\u03b7L\n18\u03b7L2 \u2212\u03b7\u22121\nE\u0001(\u02dczs) \u2264 (1 + \u03b7)\u2212ms[\u2206(z\u2217, z0) + c(Z + 1)\u0001(z0)], where Z =(cid:80)m\u22121\n\n(cid:109)\u2265 1. Then Breg-SVRG enjoys linear convergence in expectation:\n45L2 , which leads to c = O(1/L2), m = \u0398(cid:0)L2(cid:1), (1 + \u03b7)m \u2265 64\n\n45,\nFor example, we may set \u03b7 = 1\nand Z = O(L2). Therefore, between epochs, the gap \u0001(\u02dczs) decays (in expectation) by a factor of 45\n64,\nand each epoch needs to conduct the proximal update (21) for m = \u0398(L2) number of times. (We\nremind that w.l.o.g. we have scaled the norms so that the usual strong convexity parameter is 1.) In\ntotal, to reduce the gap below some threshold \u0001, Breg-SVRG needs to call the proximal update (21)\nO(L2 log 1\nDiscussions. As mentioned, Algorithm 1 and Theorem 1 extend those in [17] which in turn extend\n[2, 9] to saddle-point problems. However, [2, 9, 17] all heavily exploit the Euclidean structure (in\nparticular symmetry) hence their proofs cannot be applied to an asymmetric Bregman divergence.\nOur innovations here include: (a) A new Pythagorean theorem for the newly introduced saddle\nBregman divergence (Lemma 1). (b) A moderate extension of the variance reduction lemma in [9] to\naccommodate any norm (Appendix B). (c) A different pivot \u02dcz is adopted in each epoch to handle\n\n\u0001 ) number of times, plus a similar number of component gradient evaluations.\n\nt=0 (1+\u03b7)t, c = 18\u03b72L2\n\n1\u2212\u03b7L . (22)\n\n5\n\n\fasymmetry. (d) A new analysis technique through introducing a crucial auxiliary variable that enables\nus to bound the function gap directly. See our proof in Appendix C for more details. Compared with\nclassical mirror descent algorithms [16, 23] that can also solve saddle-point problems with Bregman\ndivergences, our analysis is fundamentally different and we achieve the signi\ufb01cantly stronger rate\nO(log(1/\u0001) than the sublinear O(1/\u0001) rate of [16], at the expense of a squared instead of linear\ndependence on L. Similar tradeoff also appeared in [17]. We will return to this issue in Section 5.\nVariants and acceleration. Our analysis also supports to use different \u03be in vx and vy. The standard\nacceleration methods such as universal catalyst [10] and non-uniform sampling can be applied directly\n(see Appendix E where L, the largest smoothness constant over all pieces, is replaced by their mean).\n5 Application of Breg-SVRG to Adversarial Prediction\nThe quadratic dependence on L, the smoothness parameter, in Theorem 1 reinforces the need to\nchoose suitable Bregman divergences. In this section we illustrate how this can be achieved for the\nadversarial prediction problem in Section 3. As pointed out in [17], the factorization of K is important,\nand we consider three schemes: (a) \u03c8k = fij; (b) \u03c8k = 1\ni=1 fi,k.\nn\nW.l.o.g. let us \ufb01x the \u00b5 in (16) to 1.\nComparison of smoothness constant. Both \u03b1 and \u03b2 are n2-dimensional, and the bilinear function\nfij can be written as \u03b1(cid:48)Aij\u03b2, where Aij \u2208 Rn2\u00d7n2 is an n-by-n block matrix, with the (i, j)-th\ni+j I + 11(cid:48)) and all other blocks being 0. The linear terms in (15) can be absorbed\nblock being n2( 2ij\ninto the regularizer \u2126 without affecting the smoothness parameter.\nFor scheme (a), the smoothness constant L2 under (cid:96)2 norm depends on the spectral norm of Aij:\nL2 = maxi,j n2(n + 2ij\ni+j )) = \u0398(n3). In contrast the smoothness constant L1 under (cid:96)1 norm depends\non the absolute value of the entries in Aij: L1 = maxi,j n2(1 + 2ij\ni+j ) = \u0398(n3); no saving is achieved.\nj=1 Akj\u03b2. Then L1 = O(n2) while\n\nFor scheme (b), the bilinear function \u03c8k corresponds to 1\n\nj=1 fk,j; and (c) \u03c8k = 1\n\nn \u03b1(cid:48)(cid:80)n\n\n(cid:80)n\n\n(cid:80)n\n\nn\n\nL2\n\n2 =\n\n1\nn2 max\n\n2 \u2265 n2 max\n(cid:107)v(cid:107)2=1\n1 saves a factor of n compared with L2\n2.\n\nv:(cid:107)v(cid:107)2=1\n\nmax\n\nj=1\n\nk\n\nTherefore, L2\nComparison of smoothness constant for the overall problem. By strong duality, we may push the\nmaximization over \u03b8 to the innermost level of (10), arriving at an overall problem in \u03b1 and \u03b2 only:\n\n(cid:107)11(cid:48)v(cid:107)2 = n5.\n\n(23)\n\n(cid:88)n\n\n(cid:13)(cid:13)Akjv(cid:13)(cid:13)2\n\n(cid:88)n\n\nj=1\n\n(cid:21)\n\n.\n\nn(cid:88)\n\nn(cid:88)\n\n(cid:20)\n\ni=1\n\nj=1\n\nmin\n{\u03b1i}\u2208S\n\nmax\n{\u03b2j}\u2208S\n\n1\nn2\n\nfij(\u03b1i, \u03b2j) \u2212 i\n\u03bbn\n\nc(cid:48)X\u03b1i +\n\n\u03b1(cid:48)\niX(cid:48)X\u03b1j +\n\nij\n2\u03bb\n\n1\n\n2\u03bbn2 (cid:107)c(cid:107)2\n\n2\n\n(24)\n\n2 under (cid:96)2 norm is upper bounded by the sum of\n= \u2126(n6), i.e. L2 = \u0398(n3). In contrast\n1 under (cid:96)1 norm is at most the sum of square of maximum absolute\n= \u0398(n6),\n\nwhere c = X \u02dcy. The quadratic term w.r.t. \u03b1 can be written as \u03b1(cid:48)Bij\u03b1, where Bij \u2208 Rn2\u00d7n2 is an\n2\u03bb X(cid:48)X and all other blocks being 0. And we\nn-by-n block matrix, with its (i, j)-th block being ij\nassume each (cid:107)xi(cid:107)2 \u2264 1. The smoothness constant can be bounded separately from Aij and Bij; see\n(128) in Appendix F.\n(cid:0) ij\n2\u03bb n(cid:1)2\nFor scheme (a), the smoothness constant square L2\n2 \u2265 maxi,j\n(cid:16)\nspectral norm square of Aij and Bij. So L2\nthe smoothness constant square L2\n1 \u2264 maxi,j\nvalue of the entries in Aij and Bij. Hence L2\nj=1 Akj\u03b2 + \u03b1(cid:48)(cid:80)n\nn (\u03b1(cid:48)(cid:80)n\ni.e. L1 = \u0398(n3). So no saving is achieved here.\n(cid:13)(cid:13)(cid:88)n\nAkjv(cid:13)(cid:13)2\nFor scheme (b), \u03c8k corresponds to 1\n(cid:17)2\n(cid:16) kj\n+\n2 \u2265 n5 similar to (23). Therefore, L2\n\n1 saves a factor of n\n2. Similar results apply to scheme (c) too. We also tried non-uniform sampling, but\n\nand by setting \u03b2 to 0 in (126), we get L2\ncompared with L2\n\n1 \u2264 1\nL2\n\u2264 1\n\nBkjv(cid:13)(cid:13)2\n\n(cid:13)(cid:13)(cid:88)n\n\nj=1 Bkj\u03b1). Then\n\n(cid:17)2(cid:21)\n\n\u221e + max\n\nn2 max\n\nn2 max\n\nn2(1+ 2kj\n\nn2(1+ 2ij\n\nmax\n\nv:(cid:107)v(cid:107)1=1\n\n(by (128))\n\n(cid:0) ij\n\nv:(cid:107)v(cid:107)1=1\n\n(cid:20)(cid:16)\n\n+ maxi,j\n\n(cid:17)2\n\n(cid:1)2\n\nmax\n\nj\n\n= n4,\n\n(25)\n\n(26)\n\nj=1\n\nk+j )\n\n(cid:21)\n\n\u221e\n\ni+j )\n\n(cid:20)\n\nk\n\nk\n\nj=1\n\n2\u03bb\n\n2\n\n6\n\n\fit does not change the order in n. It can also be shown that if our scheme randomly samples n entries\nfrom {Aij, Bij}, the above L1 and L2 cannot be improved by further engineering the factorization.\nComputational complexity. We \ufb01nally seek ef\ufb01cient algorithms for the proximal update (21) used\nby Breg-SVRG. When M (\u03b1, \u03b2) = \u2126(\u03b1) \u2212 \u2126(\u03b2) as in (16), we can solve \u03b1 and \u03b2 separately as:\n\n\u03b1ik log(\u03b1ik/bik) \u2212 cik,\n\ns.t. 1(cid:48)\u03b1 \u2264 1, \u2200i \u2200k, 0 \u2264 i\u03b1ik \u2264 1(cid:48)\u03b1i.\n\n(27)\n\n(cid:88)\n\nik\n\nmin\n\n\u03b1\n\nwhere bik and cik are constants. In Appendix D we designe an ef\ufb01cient \u201cclosed form\u201d algorithm\nwhich \ufb01nds an \u0001 accurate solution in O(n2 log2 1\n\u0001 ) time, which is also on par with that for computing\nthe stochastic gradient in schemes (b) and (c). Although scheme (a) reduces the cost of gradient\ncomputation to O(n), its corresponding smoothness parameter L2\n1 is increased by n2 times, hence\nnot worthwhile. We did manage to design an \u02dcO(n) algorithm for the proximal update in scheme (a),\nbut empirically the overall convergence is rather slow.\nIf we use the Euclidean squared distance as the Bregman divergence, then a term (cid:107)\u03b1 \u2212 \u03b1t(cid:107)2\n2 needs to\nbe added to the objective (27). No ef\ufb01cient \u201cclosed form\u201d solution is available, and so in experiments\nwe simply absorbed M into K, and then the proximal update becomes the Euclidean projection onto\nS, which does admit a competitive O(n2 log2(1/\u0001)) time solution.\n\n6 Experimental Results\n\nOur major goal here is to show that empirically Entropy-SVRG (Breg-SVRG with KL divergence) is\nsigni\ufb01cantly more ef\ufb01cient than Euclidean-SVRG (Breg-SVRG with squared Euclidean distance) on\nsome learning problems, especially those with an entropic regularizer and a simplex constraint.\n\n6.1 Entropy regularized LPBoost\n\nWe applied Breg-SVRG to an extension of LP Boosting using entropy regularization [29]. In a binary\nclassi\ufb01cation setting, the base hypotheses over the training set can be compactly represented as\nU = (y1x1, . . . , ynxn)(cid:48). Then the model considers a minimax game between a distribution d \u2208 \u2206n\nover training examples and a distribution w \u2208 \u2206m over the hypotheses:\nd(cid:48)U w + \u03bb\u2126(d) \u2212 \u03b3\u2126(w).\n\nmin\n\n(28)\n\nd\u2208\u2206n,di\u2264\u03bd\n\nmax\nw\u2208\u2206m\n\nHere w tries to combine the hypotheses to maximize the edge (prediction con\ufb01dence) yix(cid:48)\niw, while\nthe adversary d tries to place more weights (bounded by \u03bd) on \u201chard\u201d examples to reduce the edge.\nSettings. We experimented on the adult dataset from the UCI repository, which we partitioned\ninto n = 32, 561 training examples and 16,281 test examples, with m = 123 features. We set\n\u03bb = \u03b3 = 0.01 and \u03bd = 0.1 due to its best prediction accuracy. We tried a range of values of the step\nsize \u03b7, and the best we found was 10\u22123 for Entropy-SVRG and 10\u22126 for Euclidean-SVRG (larger\nstep size for Euclidean-SVRG \ufb02uctuated even worse). For both methods, m = 32561/50 gave good\nresults.\nThe stochastic gradient in d was computed by U:jwj, where U:j is the j-th column and j is randomly\ni:. We tried with Uijwj and Uijdi (scheme (a) in \u00a75),\nsampled. The stochastic gradient in w is diU(cid:48)\nbut they performed worse. We also tried with the universal catalyst in the same form as [17], which\ncan be directly extended to Entropy-SVRG. Similarly we used the non-uniform sampling based on\nthe (cid:96)2 norm of the rows and columns of U. It turned out that the Euclidean-SVRG can bene\ufb01t slightly\nfrom it, while Entropy-SVRG does not. So we only show the \u201caccelerated\u201d results for the former.\nTo make the computational cost comparable across machines, we introduced a counter called effective\nnumber of passes: #pass. Assume the proximal operator has been called #po number of times, then\n(29)\nWe also compared with a \u201cconvex\u201d approach. Given d, the optimal w in (28) obviously admits a\nclosed-form solution. General saddle-point problems certainly do not enjoy such a convenience.\nHowever, we hope to take advantage of this opportunity to study the following question: suppose we\nsolve (28) as a convex optimization in d and the stochastic gradient were computed from the optimal\n\n#pass := number of epochs so far + n+m\n\nnm \u00b7 #po.\n\n7\n\n\f(a) Primal gap v.s. #pass\n\n(a) Primal gap v.s. #pass\n\n(b) Primal gap v.s. CPU time\n\n(b) Test accuracy v.s. #pass\n\n(c) Test F-score v.s. #pass\n\n(d) Test F-score v.s. CPU time\n\nFigure 1: Entropy Regularized\nLPBoost on adult\n\nFigure 2: Adversarial Prediction on the synthetic dataset.\n\nw, would it be faster than the saddle SVRG? Since solving w requires visiting the entire U, strictly\nnm \u00b7#po in the de\ufb01nition of #pass in (29) should be replaced by #po. However,\nspeaking the term n+m\nwe stuck with (29) because our interest is whether a more accurate stochastic gradient in d (based\non the optimal w) can outperform doubly stochastic (saddle) optimization. We emphasize that this\ncomparison is only for conceptual understanding, because generally optimizing the inner variable\nrequires costly iterative methods.\nResults. Figure 1(a) demonstrated how fast the primal gap (with w optimized out for each d) is\nreduced as a function of the number of effective passes. Methods based on entropic prox are clearly\nmuch more ef\ufb01cient than Euclidean prox. This corroborates our theory that for problems like (28),\nEntropy-SVRG is more suitable for the underlying geometry (entropic regularizer with simplex\nconstraints).\nWe also observed that using entropic prox, our doubly stochastic method is as ef\ufb01cient as the \u201cconvex\u201d\nmethod, meaning that although at each iteration the w in saddle SVRG is not the optimal for the\ncurrent d, it still allows the overall algorithm to perform as fast as if it were. This suggests that for\ngeneral saddle-point problems where no closed-form inner solution is available, our method will still\nbe ef\ufb01cient and competitive. Note this \u201cconvex\u201d method is similar to the optimizer used by [29].\nFinally, we investigated the increase of test accuracy as more passes over the data are performed.\nFigure 1(b) shows, once more, that the entropic prox does allow the accuracy to be improved much\nfaster than Euclidean prox. Again, the convex and saddle methods perform similarly.\nAs a \ufb01nal note, the Euclidean/entropic proximal operator for both d and w can be solved in either\nclosed form, or by a 1-D line search based on partial Lagrangian. So their computational cost differ\nin the same order of magnitude as multiplication v.s. exponentiation, which is much smaller than the\ndifference of #pass shown in Figure 1.\n\n6.2 Adversarial prediction with F-score\nDatasets. Here we considered two datasets. The \ufb01rst is a synthetic dataset where the positive\nexamples are drawn from a 200 dimensional normal distribution with mean 0.1 \u00b7 1 and covariance\n0.5 \u00b7 I, and negative examples are drawn from N (\u22120.1 \u00b7 1, 0.5 \u00b7 I). The training set has n = 100\nsamples, half are positive and half are negative. The test set has 200 samples with the same class\nratio. Notice that n = 100 means we are optimizing over two 100-by-100 matrices constrained to a\nchallenging set S. So the optimization problem is indeed not trivial.\n\n8\n\n0200400600800Number of effective passes10-510-410-310-210-1100Primal gapEntropy, SaddleEntropy, ConvexEuclidean, SaddleEuclidean, Convex0200400600800Number of effective passes757779818385Test accuracy (%)Entropy, SaddleEntropy, ConvexEuclidean, SaddleEuclidean, Convex0100200300400Number of effective passes10-410-2100Primal gapEuclidean, ConvexEuclidean, SaddleEuclidean, Saddle, CatalystEntropy, ConvexEntropy, SaddleEntropy, Saddle, Catalyst05101520CPU time(mins)10-410-2100Primal gapEuclidean, ConvexEuclidean, SaddleEuclidean, Saddle, CatalystEntropy, ConvexEntropy, SaddleEntropy, Saddle, Catalyst0100200300Number of effective passes0.90.920.940.96Test F-scoreEuclidean, ConvexEuclidean, SaddleEuclidean, Saddle, CatalystEntropy, ConvexEntropy, SaddleEntropy, Saddle, Catalyst051015CPU time(mins)0.90.920.940.96Test F-scoreEuclidean, ConvexEuclidean, SaddleEuclidean, Saddle, CatalystEntropy, ConvexEntropy, SaddleEntropy, Saddle, Catalyst\fThe second dataset, ionosphere,\nhas 211 training examples (122\npos and 89 neg). 89 examples\nwere used for testing (52 pos\nand 37 neg). Each example has\n34 features.\nMethods.\nTo apply saddle\nSVRG, we used strong duality\nto push the optimization over \u03b8\nto the inner-most level of (10),\nand then eliminated \u03b8 because\nit is a simple quadratic. So\nwe ended up with the convex-\nconcave optimization as shown\nin (24), where the K part of (15)\nis augmented with a quadratic\nterm in \u03b1. The formulae for\ncomputing the stochastic gra-\ndient using scheme (b) are de-\ntailed in Appendix G. We \ufb01xed\n\u00b5 = 1, \u03bb = 0.01 for the iono-\nsphere dataset, and \u00b5 = 1, \u03bb =\n0.1 for the synthetic dataset.\nWe also tried the universal cat-\nalyst along with non-uniform\nsampling where each i was sam-\n\n(a) Primal gap v.s. #pass\n\n(b) Primal gap v.s. CPU time\n\n(c) Test F-score v.s. #pass\n\n(d) Test F-score v.s. CPU time\n\nFigure 3: Adversarial Prediction on the ionosphere dataset.\n\npled with a probability proportional to(cid:80)n\n\nk=1 (cid:107)Aik(cid:107)2\n\nF , and similarly for j. Here (cid:107)\u00b7(cid:107)F is the Frobe-\nnious norm.\nParameter Tuning. Since each entry in the n \u00d7 n matrix \u03b1 is relatively small when n is large, we\nneeded a relatively small step size. When n = 100, we used 10\u22122 for Entropy-SVRG and 10\u22126 for\nEuclidean-SVRG (a larger step size makes it over-\ufb02uctuate). When applying catalyst, the catalyst\nregularizor can suppress the noise from larger step size. After a careful trade off between catalyst\nregularizor parameter and larger step size, we managed to achieve faster convergence empirically.\nResults. The results on the two datasets are shown in Figures 2 and 3 respectively. We truncated\nthe #pass and CPU time in subplots (c) and (d) because the F-score has stabilized and we would\nrather zoom in to see the initial growing phase. In terms of primal gap versus #pass (subplot a), the\nentropy based method is signi\ufb01cantly more effective than Euclidean methods on both datasets (Figure\n2(a) and 3(a)). Even with catalyst, Euclidean-Saddle is still much slower than the entropy based\nmethods on the synthetic dataset in Figure 2(a). The CPU time comparisons (subplot b) follow the\nsimilar trend, except that the \u201cconvex methods\u201d should be ignored because they are introduced only\nto compare #pass.\nThe F-score is noisy because, as is well known, it is not monotonic with the primal gap and glitches\ncan appear. In subplots 2(d) and 3(d), the entropy based methods achieve higher F-score signi\ufb01cantly\nfaster than the plain Euclidean based methods on both datasets. In terms of passes (subplots 2(c) and\n3(c)), Euclidean-Saddle and Entropy-Saddle achieved a similar F-score at \ufb01rst because their primal\ngaps are comparable at the beginning. After 20 passes, the F-score of Euclidean-Saddle is overtaken\nby Entropy-Saddle as the primal gap of Entropy-Saddle become much smaller than Euclidean-Saddle.\n\n7 Conclusions and Future Work\nWe have proposed Breg-SVRG to solve saddle-point optimization and proved its linear rate of\nconvergence. Application to adversarial prediction con\ufb01rmed its effectiveness. For future work, we\nare interested in relaxing the (potentially hard) proximal update in (21). We will also derive similar\nreformulations for DCG and precision@k, with a quadratic number of variables and with a \ufb01nite sum\nstructure that is again amenable to Breg-SVRG, leading to a similar reduction of the condition number\ncompared to Euclidean-SVRG. These reformulations, however, come with different constraint sets,\nand new proximal algorithms with similar complexity as for the F-score can be developed.\n\n9\n\n0200400600Number of effective passes10-410-2100102Primal gapEuclidean, ConvexEuclidean, SaddleEuclidean, Saddle, CatalystEntropy, ConvexEntropy, SaddleEntropy, Saddle, Catalyst020406080CPU time(mins)10-410-2100102Primal gapEuclidean, ConvexEuclidean, SaddleEuclidean, Saddle, CatalystEntropy, ConvexEntropy, SaddleEntropy, Saddle, Catalyst050100150200Number of effective passes0.750.80.850.9Test F-scoreEuclidean, ConvexEuclidean, SaddleEuclidean, Saddle, CatalystEntropy, ConvexEntropy, SaddleEntropy, Saddle, Catalyst0102030CPU time(mins)0.750.80.850.9Test F-scoreEuclidean, ConvexEuclidean, SaddleEuclidean, Saddle, CatalystEntropy, ConvexEntropy, SaddleEntropy, Saddle, Catalyst\fReferences\n[1] S. Lacoste-Julien, M. Jaggi, M. Schmidt, and P. Pletscher. Block-coordinate frank-wolfe\n\noptimization for structural SVMs. In ICML. 2013.\n\n[2] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. In NIPS. 2013.\n\n[3] A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with\n\nsupport for non-strongly convex composite objectives. In NIPS. 2014.\n\n[4] M. Schmidt, N. L. Roux, and F. Bach. Minimizing \ufb01nite sums with the stochastic average\n\ngradient. Mathematical Programming, 2016.\n\n[5] A. J. Defazio, T. S. Caetano, and J. Domke. Finito: A faster, permutable incremental gradient\n\nmethod for big data problems. In ICML. 2014.\n\n[6] J. Mairal. Incremental majorization-minimization optimization with application to large-scale\n\nmachine learning. SIAM Journal on Optimization, 25(2):829\u2013855, 2015.\n\n[7] S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss\n\nminimization. J. Mach. Learn. Res., 14:567\u2013599, 2013.\n\n[8] S. Shalev-Shwartz. SDCA without duality, regularization, and individual convexity. In ICML.\n\n2016.\n\n[9] L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance\n\nreduction. SIAM Journal on Optimization, 24(4):2057\u20132075, 2014.\n\n[10] H. Lin, J. Mairal, and Z. Harchaoui. A universal catalyst for \ufb01rst-order optimization. In NIPS.\n\n2015.\n\n[11] A. Nitanda. Stochastic proximal gradient descent with acceleration techniques. In NIPS. 2014.\n[12] S. Shalev-Shwartz and T. Zhang. Accelerated proximal stochastic dual coordinate ascent for\n\nregularized loss minimization. In ICML. 2014.\n\n[13] R. Babanezhad, M. O. Ahmed, A. Virani, M. Schmidt, J. Kone\u02c7cn\u00b4y, and S. Sallinen. Stop\n\nwasting my gradients: Practical svrg. In NIPS. 2015.\n\n[14] Y. Zhang and L. Xiao. Stochastic primal-dual coordinate method for regularized empirical risk\n\nminimization. In ICML. 2015.\n\n[15] Z. Zhu and A. J. Storkey. Adaptive stochastic primal-dual coordinate descent for separable\nsaddle point problems. In Machine Learning and Knowledge Discovery in Databases, pp.\n645\u2013658. 2015.\n\n[16] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach\n\nto stochastic programming. SIAM Journal on Optimization, 19(4):1574\u20131609, 2009.\n\n[17] P. Balamurugan and F. Bach. Stochastic variance reduction methods for saddle-point problems.\n\nIn NIPS. 2016.\n\n[18] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\n\nY. Bengio. Generative adversarial nets. In NIPS. 2014.\n\n[19] H. Wang, W. Xing, K. Asif, and B. D. Ziebart. Adversarial prediction games for multivariate\n\nlosses. In NIPS. 2015.\n\n[20] F. Farnia and D. Tse. A minimax approach to supervised learning. In NIPS. 2016.\n[21] Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming,\n\n103(1):127\u2013152, 2005.\n\n[22] Y. Nesterov. Excessive gap technique in nonsmooth convex minimization. SIAM J. on Opti-\n\nmization, 16(1):235\u2013249, 2005. ISSN 1052-6234.\n\n[23] A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for\n\nconvex optimization. Operations Research Letters, 31(3):167\u2013175, 2003.\n\n[24] H. Wang and A. Banerjee. Bregman alternating direction method of multipliers. In NIPS. 2014.\n[25] R. T. Rockafellar. Monotone operators associated with saddle functions and minimax problems.\n\nNonlinear Functional Analysis, 18(part 1):397\u2013407, 1970.\n\n10\n\n\f[26] J. C. Duchi, S. Shalev-Shwartz, Y. Singer, and A. Tewari. Composite objective mirror descent.\n\nIn Proc. Annual Conf. Computational Learning Theory. 2010.\n\n[27] P. Tseng. On accelerated proximal gradient methods for convex-concave optimization. submitted\n\nto SIAM Journal on Optimization, 2009.\n\n[28] K. Asif, W. Xing, S. Behpour, and B. D. Ziebart. Adversarial cost-sensitive classi\ufb01cation. In\n\nUAI. 2015.\n\n[29] M. K. Warmuth, K. A. Glocer, and S. V. N. Vishwanathan. Entropy regularized LPBoost. In\nY. Freund, Y. L. Gy\u00a8or\ufb01, and G. Tur`an, eds., Proc. Intl. Conf. Algorithmic Learning Theory,\nno. 5254 in Lecture Notes in Arti\ufb01cial Intelligence, pp. 256 \u2013 271. Springer-Verlag, Budapest,\nOctober 2008.\n\n[30] P. L. Combettes and J.-C. Pesquet. Proximal splitting methods in signal processing. Fixed-Point\n\nAlgorithms for Inverse Problems in Science and Engineering, 49:185\u2013212, 2011.\n\n11\n\n\f", "award": [], "sourceid": 3076, "authors": [{"given_name": "Zhan", "family_name": "Shi", "institution": "University of Illinois at Chicago"}, {"given_name": "Xinhua", "family_name": "Zhang", "institution": "University of Illinois at Chicago (UIC)"}, {"given_name": "Yaoliang", "family_name": "Yu", "institution": "University of Waterloo"}]}