{"title": "An adaptive Mirror-Prox method for variational inequalities with singular operators", "book": "Advances in Neural Information Processing Systems", "page_first": 8455, "page_last": 8465, "abstract": "Lipschitz continuity is a central requirement for achieving the optimal O(1/T) rate of convergence in monotone, deterministic variational inequalities (a setting that includes convex minimization, convex-concave optimization, nonatomic games, and many other problems). However, in many cases of practical interest, the operator defining the variational inequality may become singular at the boundary of the feasible region, precluding in this way the use of fast gradient methods that attain this rate (such as Nemirovski's mirror-prox algorithm and its variants). To address this issue, we propose a novel smoothness condition which we call Bregman smoothness, and which relates the variation of the operator to that of a suitably chosen Bregman function. Leveraging this condition, we derive an adaptive mirror prox algorithm which attains an O(1/T) rate of convergence in problems with possibly singular operators, without any prior knowledge of the problem's Bregman constant (the Bregman analogue of the Lipschitz constant). We also present an extension of our algorithm to stochastic variational inequalities where the algorithm achieves a $O(1/\\sqrt{T})$ convergence rate.", "full_text": "An Adaptive Mirror-Prox Algorithm for\n\nVariational Inequalities with Singular Operators\n\nKimon Antonakopoulos\n\nUniv. Grenoble Alpes, CNRS, Inria, Grenoble INP\n\nLIG 38000 Grenoble, France.\n\nE. Veronica Belmega\n\nETIS/ENSEA\n\nUniv. de Cergy-Pontoise-CNRS, France\n\nkimon.antonakopoulos@inria.fr\n\nbelmega@ensea.fr\n\nPanayotis Mertikopoulos\n\nUniv. Grenoble Alpes, CNRS, Inria, Grenoble INP,\n\nLIG 38000 Grenoble, France.\n\npanayotis.mertikopoulos@imag.fr\n\nAbstract\n\nLipschitz continuity is a central requirement for achieving the optimal O(1/T ) rate\nof convergence in monotone, deterministic variational inequalities (a setting that\nincludes convex minimization, convex-concave optimization, nonatomic games,\nand many other problems). However, in many cases of practical interest, the\noperator de\ufb01ning the variational inequality may exhibit singularities at the boundary\nof the feasible region, precluding in this way the use of fast gradient methods\nthat attain this optimal rate (such as Nemirovski\u2019s mirror-prox algorithm and its\nvariants). To address this issue, we consider a regularity condition which relates the\nvariation of the operator to that of a suitably chosen Bregman function. Leveraging\nthis Bregman continuity condition, we derive an adaptive mirror-prox algorithm\nwhich attains the optimal O(1/T ) rate of convergence in problems with possibly\nsingular operators, without any prior knowledge of the degree of smoothness (the\n\u221a\nBregman analogue of the Lipschitz constant). We also show that, under Bregman\ncontinuity, the mirror-prox algorithm achieves a O(1/\nT ) convergence rate in\nstochastic variational inequalities.\n\n1\n\nIntroduction\n\nThe seminal introduction of generative adversarial networks (GANs) [18] has ushered in a new\noptimization paradigm in deep learning: instead of focusing on the minimization of an empirical loss\nfunction, GAN training hinges on a zero-sum game between a generator and a discriminator. In fact,\nin many cases GAN training goes even beyond the min-max setting, either because there are more\nthan two networks involved, or because the objectives of the generator and the discriminator are not\nentirely opposed \u2013 e.g., as in the widely used ACGAN framework of Odena et al. [43]. In these cases,\nthe most compact way of representing the problem\u2019s training landscape is by means of a variational\ninequality (VI).\nTracing their origins to the work of Stampacchia [49] on the Signorini problem, variational inequalities\nhave since found a broad range of applications in physics, engineering, economics \u2013 and, more\nrecently, machine learning. One of the main reasons for their extensive applicability is that they\ncomprise a \ufb02exible optimization framework which can simultaneously account for loss function\nminimization, saddle-point, game-theoretic, and \ufb01xed point problems. As a result, there has been\nconsiderable interest in the literature to develop optimal algorithms for solving VI problems; for an\nappetizing introduction, see [16] and references therein.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fOne of the most widely studied methods for this purpose is ordinary gradient descent \u2013 also known as\nthe forward-backward (FB) algorithm in operator theory [6].1 In monotone, deterministic variational\ninequalities, the convergence of the method is guaranteed under a condition known as cocoercivity.\nBy the Baillon\u2013Haddad theorem, if the operator de\ufb01ning the variational inequality is a gradient\n\ufb01eld (i.e., in loss minimization problems), this condition is equivalent to Lipschitz smoothness of\nthe associated loss function [4, 6]. However, cocoercivity may fail to hold even in simple, bilinear\nmin-max problems, in which case gradient descent provably fails to converge \u2013 see e.g., [17, 35, 36]\nfor a precise statement.\nThe \ufb01rst algorithm achieving convergence in (pseudo-)monotone variational inequalities without\ncocoercivity is the extra-gradient (EG) algorithm of Korpelevich [24], which only requires Lipschitz\ncontinuity of the underlying operator.2 The asymptotic convergence result of Korpelevich [24]\nwas subsequently extended by Nemirovski [38] who introduced the mirror-prox (MP) algorithm, a\nBregman variant of the EG algorithm with ergodic averaging. As was shown in [38], the mirror-prox\nalgorithm attains a O(1/T ) ergodic convergence rate in monotone variational inequalities with\nLipschitz continuous operators, and this rate cannot be improved without further assumptions.\nHowever, in many applications and problems of practical interest, Lipschitz continuity may also fail\nto hold, either because the loss pro\ufb01le of the problem grows too rapidly (e.g., as in support vector\nmachines or GAN models with Kullback-Leibler losses), or because the problem exhibits singularities\nnear the boundary of the feasible region (e.g., as in resource allocation and inverse problems). In\nthese cases, one would still want to apply a fast method like mirror-prox, but the lack of smoothness\nmeans that there are no convergence guarantees \u2013 asymptotically, ergodically, or otherwise.\n\nOur contributions. Our starting point is the observation that this failure stems from the fact that\nLipschitz continuity of the operator is de\ufb01ned relative to a global norm. Because of this, the standard\nLipschitz framework is not well-suited to problems with singularities or rapid growth: a global norm\nis oblivious to the geometry of the feasible region (and, in particular, its boundary), so it cannot\ncapture the \ufb01ner features of the problem\u2019s loss landscape.\nTo overcome this limitation, we introduce a novel regularity condition, which we call Bregman conti-\nnuity, and which is made-to-order for the singularity landscape of the problem at hand. Speci\ufb01cally,\ninstead of de\ufb01ning Lipschitz continuity relative to a global norm, we de\ufb01ne it in terms of a family of\nlocal norms and a suitably chosen Bregman function. This leads to an intricate interplay between\ndifferent geometric notions of distance (the Bregman divergence and the local norm), but it also\nintroduces the \ufb02exibility required to tackle variational inequalities with singular operators.\nUnder this assumption, we show that the mirror-prox algorithm attains the optimal O(1/T ) conver-\ngence rate in variational inequalities with (possibly) singular operators, provided that the method is\nrun with the same Bregman function that is used to de\ufb01ne Bregman continuity. As in the standard Lip-\nschitz framework, the method\u2019s convergence requires a step-size of the form \u03b3 < 1/\u03b2, where \u03b2 is the\nBregman constant of the operator (i.e., the Bregman analogue of the Lipschitz constant). Estimating\nthis constant can be fairly challenging in practice (if not downright impossible), so we also introduce\nan adaptive mirror-prox (AMP) method which attains the same O(1/T ) rate without requiring any a\npriori estimation of \u03b2 \u2013 essentially, the Bregman constant is learned at the same time as the problem\u2019s\n\u221a\nlandscape. Finally, we provide a variant of the method for stochastic variational inequalities, and we\nestablish a O(1/\nT ) convergence rate in this setting. To the best of our knowledge, these are the\n\ufb01rst results of this kind in the literature.\n\nRelated work. Owing to their optimal rate guarantees, the extra-gradient and mirror-prox algo-\nrithms have been at the forefront of an extensive literature which is impossible to adequately review\nhere. As a purely indicative list of contributions in the Lipschitz continuous setting (and with no\nillusion of being comprehensive), we refer the reader to Juditsky et al. [21], Chambolle and Pock [13],\nMalitsky [28], Iusem et al. [20] and Mokhtari et al. [37] for some recent developments. Especially in\n\n1When used to \ufb01nd a zero of a composite operator, the FB algorithm is known as a \u201csplitting\u201d method; see\ne.g., Bruck Jr. [12], Passty [44], [14], and references therein.\n2An operator A(x) is cocoercive if (cid:104)A(x(cid:48)) \u2212 A(x), x(cid:48) \u2212 x(cid:105) \u2265 (1/\u03b2)(cid:107)A(x(cid:48)) \u2212 A(x)(cid:107)2 for some \u03b2 > 0 and\nall x, x(cid:48). Note that Lipschitz continuity is strictly weaker than cocoercivity: the operator A(x1, x2) = (\u2212x2, x1)\nis Lipschitz continuous over R2, but it is not cocoercive; see Section 2 for a detailed discussion.\n\n2\n\n\flearning theory, there has been a surge of interest motivated by the application of EG/MP methods to\nGAN training, see e.g., [15, 17, 36, 51] and references therein.3\nGoing beyond the Lipschitz regime, Bauschke et al. [7] recently introduced a \u201cLipschitz-like\u201d smooth-\nness condition for convex minimization problems and used it to establish a O(1/T ) value convergence\nrate for mirror descent methods (as opposed to mirror-prox). Always in the context of loss min-\nimization problems, Bolte et al. [9] subsequently extended the results of Bauschke et al. [7] to\nunconstrained non-convex problems that satisfy the Kurdyka\u2013\u0141ojasiewicz (KL) inequality, while Lu\net al. [27] considered functions that are also relatively strongly convex and showed that mirror descent\nachieves a geometric convergence rate in this context. Finally, in a very recent preprint, Hanzely et al.\n[19] examined the rate of convergence of an accelerated variant of mirror descent under the same\nLipschitz-like smoothness assumption.\nThe condition of Bauschke et al. [7] is remarkably simple as it only posits that the problem\u2019s\nloss function f is such that \u03b2h \u2212 f is convex for some reference Bregman function h and some\n\u03b2 > 0. A straightforward extension of this condition to an operator setting would be to require\nthe monotonicity of \u03b2\u2207h \u2212 A, where A is the operator de\ufb01ning the variational inequality under\nstudy. However, the cornerstone of this \u201cLipschitz-like\u201d condition is a descent lemma which does not\ncarry over to variational inequalities, so it does not seem possible to extend the analysis of Bauschke\net al. [7] to an operator setting. Lu [26] also considered a \u201crelative continuity\u201d condition for loss\n\nminimization problems positing that (cid:107)\u2207f (x)(cid:107) \u2264 M inf x(cid:48)(cid:112)2D(x(cid:48), x)/(cid:107)x(cid:48) \u2212 x(cid:107) (where f is the\n\nproblem\u2019s objective and D is the Bregman divergence of h). Written this way, the condition of Lu\n[26] can also be extended to an operator setting, but this would provide a surrogate for operator\nboundedness, not Lipschitz continuity (since A = \u2207f in minimization problems). Since the optimal\nO(1/T ) convergence rate of the mirror-prox algorithm is tied to the regularity of A \u2013 as opposed\nto its boundedness \u2013 the condition of Lu [26] does not seem applicable to the setting under study.\nAccordingly, there is no overlap in results or methodology with this particular strand of the literature.\nFinally, in a very recent paper, Bach and Levy [3] introduced a universal variant of the mirror-prox\nalgorithm which is model-agnostic and achieves an optimal convergence rate in stochastic and/or\nsmooth settings. Achieving optimal rates in the setting of Bach and Levy [3] relies crucially on the\noperator being Lipschitz continuous (albeit with a possibly unknown constant) and the feasible region\nhaving a \ufb01nite Bregman diameter. The algorithm we propose in this work is not universal but it is\nadaptive, and it does not require either Lipschitz continuity or a \ufb01nite Bregman diameter. In this\nmanner, our work also provides an important \ufb01rst step towards extending the universal analysis of\nBach and Levy [3] to VI problems with singularities.\n\n2 Preliminaries\nLet X be a convex \u2013 but not necessarily closed or compact \u2013 subset of a d-dimensional normed space\nV, and let V\u2217 denote the dual space of V. The variational inequality (VI) problem associated to a\ncontinuous operator A : X \u2192 V\u2217 consists of \ufb01nding x\u2217 \u2208 X such that\n\n(cid:104)A(x\u2217), x \u2212 x\u2217(cid:105) \u2265 0 for all x \u2208 X .\n\n(VI)\nFollowing [16], we will refer to this problem as VI(X , A) and we will write X \u2217 \u2261 Sol(X , A) for\nits set of solutions. Note also that, if X is not closed, A may exhibit a singularity at a residual point\nx \u2208 bd(X ) \\ X in the sense that A does not admit a continuous extension to x.\nIn the literature, this formulation of the problem is often referred to as a Stampacchia variational\ninequality (SVI) [16] or a \u201cstrong\u201d variational inequality [21, 40]. For illustration purposes, we\npresent some archetypal examples of such problems below:\nExample 2.1 (Loss minimization). If A = \u2207f for some convex loss function f on X = Rd, solutions\nof (VI) coincide with the global minimizers of f.\nExample 2.2 (Min-max optimization). Suppose that A = (\u2207x1f,\u2212\u2207x2f ) for some real-valued\nfunction f (x1, x2) with x1 \u2208 X1, x2 \u2208 X2. If f is convex-concave (i.e., convex in x1 and concave in\n3We note here that the method is sometimes referred to as \u201coptimistic mirror descent\u201d. This terminology is\ndue to Rakhlin and Sridharan [45, 46] and may refer either to the mirror-prox method itself, or to a variant with\n\u201cgradient extrapolation from the past\u201d, as in [17].\n\n3\n\n\fand\n\nf (x\u2217\n\n1, x\u2217\n\n1, x\u2217\n1, x\u2217\n\n2) of (VI) is a global saddle-point of f, i.e.,\n2) \u2265 f (x\u2217\n2) \u2264 f (x1, x\u2217\n2)\n\nx2), any solution x\u2217 = (x\u2217\nf (x\u2217\n(2.1)\nfor all x1 \u2208 X1, x2 \u2208 X2. Problems of this type have attracted considerable interest in the \ufb01elds of\nmachine learning and arti\ufb01cial intelligence because they constitute the basic optimization framework\nfor GANs [18]. For a series of recent papers focusing on the interplay between GAN and saddle-point\nproblems / variational inequalities, see [15, 17, 25, 36, 51] and references therein.\nExample 2.3 (Resource sharing problems). Consider a set of resources r \u2208 R = {1, . . . , R} serving\na stream of demands that arrive at a rate of \u03c1 per unit of time (for instance, a GPU cluster or a\ncomputing grid processing a stream of jobs). If the load on the r-th resource is xr, the expected\nservice time in the standard Kleinrock model [23] is given by the M/M/1 loss function\n\n1, x2)\n\n(cid:96)r(xr) =\n\n1\n\ncr \u2212 xr\n\n,\n\n(2.2)\n\nwhere cr denotes the capacity of the resource. In this setting, the set of feasible resource allocations\nis X \u2261 {(x1, . . . , xR) : 0 \u2264 xr < cr, x1 + \u00b7\u00b7\u00b7 + xR = \u03c1},4 and we say that a resource allocation\npro\ufb01le x\u2217 \u2208 X \u2217 is at Nash/Wardrop equilibrium [42, 50] if\n\n(2.3)\ni.e., when no job would be better served by transferring it to a different priority queue. In this\ncase, if we let A(x) = ((cid:96)1(x1), . . . , (cid:96)R(xR)), a standard calculation shows that x\u2217 is an equilibrium\nallocation if and only if it solves the associated variational inequality problem for A.\n\nfor all x \u2208 X and all r \u2208 R such that x\u2217\n\nr) \u2264 (cid:96)r(xr)\n\n(cid:96)r(x\u2217\n\nr > 0\n\nfor all x, x(cid:48) \u2208 X .\n\n(cid:104)A(x(cid:48)) \u2212 A(x), x(cid:48) \u2212 x(cid:105) \u2265 0\n\nThe most widely used assumption in the literature for solving VI problems is monotonicity, i.e.,\n(2.4)\nWhen A = \u2207f, this condition is equivalent to f being convex; likewise, when A = (\u2207x1 f,\u2212\u2207x2f )\nas in Example 2.2, monotonicity is equivalent to f being convex-concave [6]; \ufb01nally, by direct\ncalculation, it is straightforward to see that the operator de\ufb01ned in Example 2.3 is monotone. For an\nintroduction to the theory of monotone operators, we refer the reader to Facchinei and Pang [16] and\nBauschke and Combettes [6].\nNow, drawing on Nesterov [40, 41] and Juditsky et al. [21], if A is monotone, the quality of a\ncandidate solution \u02c6x \u2208 X can be assessed via the restricted gap (or merit) function\n\nGapC(\u02c6x) = sup\nx\u2208C\n\n(2.5a)\nwhere C is a nonempty convex subset of X . The rationale behind this de\ufb01nition is that, if x\u2217 solves\n(VI), monotonicity gives (cid:104)A(x), x\u2217 \u2212 x(cid:105) \u2264 (cid:104)A(x\u2217), x\u2217 \u2212 x(cid:105) \u2264 0, so the quantity being maximized\nin (2.5) is small if \u02c6x is an approximate solution of (VI). Formally, we have:\nLemma 1. Suppose that A is monotone. If x\u2217 solves (VI), we have GapC(x\u2217) = 0 whenever x\u2217 \u2208 C.\nConversely, if GapC(\u02c6x) = 0 and C contains a neighborhood of \u02c6x in X , \u02c6x is a solution of (VI).\nThis lemma extends a similar result by Nesterov [40], so we defer its proof to the paper\u2019s supplement.\nIn view of all this, we will employ the gap function GapC(\u02c6x) as our main \ufb01gure of merit and we will\nuse it to state our convergence rate guarantees in the sequel.\n\n(cid:104)A(x), \u02c6x \u2212 x(cid:105),\n\n3 Bregman continuity\n\n(cid:107)A(x(cid:48)) \u2212 A(x)(cid:107)\u2217 \u2264 \u03b2(cid:107)x(cid:48) \u2212 x(cid:107)\n\nIn addition to monotonicity, a standard assumption for solving variational inequalities is that of\nLipschitz continuity, i.e.,\n(Lip)\nfor some \u03b2 > 0 and for all x, x(cid:48) \u2208 X . This de\ufb01nition involves two distinct (but related) measures of\ndistance: (i) the primal norm on V which measures distances between the primal points x, x(cid:48) \u2208 X ; and\n(ii) the dual norm on V\u2217 which measures the distance between the dual vectors A(x), A(x(cid:48)) \u2208 V\u2217.5\nImportantly, both of these notions are global, i.e., they do not depend on the point in space at which\nthey are calculated; as such, Lipschitz continuity is oblivious to the geometry of X (and, in particular,\nits boundary). In the sequel, we describe a way to overcome this limitation by introducing two distinct\nnotions of distance that are tailored to the geometry of X and the singularity landscape of A.\n\n4For posterity, note here that X is convex but it is not necessarily closed.\n5Recall here that the dual norm of v \u2208 V\u2217 is de\ufb01ned as (cid:107)v(cid:107)\u2217 = maxz\u2208V{|(cid:104)v, z(cid:105)| : (cid:107)z(cid:107) \u2264 1}.\n\n4\n\n\fLocal norms. The \ufb01rst measure of distance that we de\ufb01ne is that of local norm on X :\nDe\ufb01nition 1. Let Z = span(X \u2212 X ) denote the tangent hull of X , i.e., the subspace of V spanned\nby all possible displacement vectors of the form z = x(cid:48) \u2212 x, x, x(cid:48) \u2208 X . A local norm on X is a\ncontinuous assignment of a norm (cid:107)\u00b7(cid:107)x on Z at each x \u2208 X .6 The induced dual local norm is then\nde\ufb01ned as\n\n(cid:107)v(cid:107)x,\u2217 = maxz\u2208Z{|(cid:104)v, z(cid:105)| : (cid:107)z(cid:107)x \u2264 1}\n\n(3.1)\nFor ease of presentation, we tacitly assume in what follows that (cid:107)z(cid:107)x \u2265 \u00b5(cid:107)z(cid:107) for some \u00b5 > 0 and\nall x \u2208 X , z \u2208 Z. This can always be achieved by taking (cid:107)\u00b7(cid:107)x \u2190 (cid:107)\u00b7(cid:107)x + \u00b5(cid:107)\u00b7(cid:107) so there is no loss of\ngenerality. Note in particular that this implies that (cid:107)v(cid:107)x,\u2217 \u2264 (1/\u00b5)(cid:107)v(cid:107) for all x \u2208 X and all v \u2208 Z\u2217.\nFor intuition, we present some key examples below:\nExample 3.1 (Euclidean geometry). Let X = Rd so Z = Rd. The Euclidean norm on X is given by\nthe standard expression (cid:107)z(cid:107)2\nExample 3.2 (Shahshahani p-norm). Let X = Rd\non X is de\ufb01ned for all p > 1 as\n\n++ so, again, Z = Rd. The Shahshahani p-norm\n(cid:1)1/p\n\n(cid:107)z(cid:107)x =(cid:0)|z1|p/x1 + \u00b7\u00b7\u00b7 + |zd|p/xd\n\nj , and the associated dual norm is the same.\n\nfor all x \u2208 X , z \u2208 Z.\n\n2 =(cid:80)d\n\nfor all v \u2208 V\u2217.\n\nj=1 z2\n\n(3.2)\n\n(cid:107)v(cid:107)x,\u2217 =(cid:0)xq\u22121\n\nBy a straightforward application of H\u00f6lder\u2019s inequality, the corresponding dual norm is given by\n\n|v1|q + \u00b7\u00b7\u00b7 + xq\u22121\n\n(3.3)\nwith the usual convention p\u22121 + q\u22121 = 1. In particular, for p \u2192 1+, we get the limiting expression\n(3.4)\nThis metric plays a major role in, among others, game theory, optimal transport, machine learning,\ninformation theory, and many other \ufb01elds \u2013 see e.g., [1, 2, 22, 31, 34, 47, 48] and references therein.\n\n(cid:107)v(cid:107)x,\u2217 = max{x1|v1|, . . . , xd|vd|}.\n\nd\n\n1\n\n|vd|q(cid:1)1/q\n\nLocal Bregman functions and the associated divergence. The notion of a dual local norm pre-\nsented above will be our principal measure of distance in V\u2217. To proceed, we will also need to adapt\nthe notion of a Bregman (or distance-generating) function on X :\nDe\ufb01nition 2. Let (cid:107)\u00b7(cid:107)x be a local norm on X . We say that h : V \u2192 R is a Bregman function on X if:\n\n1. h is proper, l.s.c., convex, and dom h = X .\n2. The subdifferential of h admits a continuous selection, i.e., a continuous function \u2207h such\n\nthat \u2207h(x) \u2208 \u2202h(x) for all x \u2208 X \u25e6 \u2261 dom \u2202h.\n\n3. h is strongly convex relative to the underlying local norm, i.e.,\n\nh(p) \u2265 h(x) + (cid:104)\u2207h(x), p \u2212 x(cid:105) + 1\n\n2 K(cid:107)p \u2212 x(cid:107)2\n\nx\n\nfor some K > 0 and all p \u2208 X , x \u2208 X \u25e6.\n\nThe Bregman divergence induced by h is then de\ufb01ned for all p \u2208 X , x \u2208 X \u25e6, as\n\nD(p, x) = h(p) \u2212 h(x) \u2212 (cid:104)\u2207h(x), p \u2212 x(cid:105).\n\nAs an immediate consequence of the above, we have:\nLemma 2. A Bregman function h is K-strongly convex relative to (cid:107)\u00b7(cid:107)x if and only if\n\nD(p, x) \u2265 1\n\n2 K(cid:107)p \u2212 x(cid:107)2\n\nx\n\nfor all p \u2208 X and all x \u2208 X \u25e6.\n\n(3.5)\n\n(3.6)\n\n(3.7)\n\nThe main difference between De\ufb01nition 2 and the standard assumptions in the literature [7, 10, 11,\n21, 30, 32, 33, 39\u201341] is the strong convexity requirement relative to the local norm (cid:107)\u00b7(cid:107)x (whose\nchoice, in turn, is aimed to capture the singularity landscape of the operator). We illustrate this with\ntwo examples below:\n\n6By that, we have in mind the de\ufb01nition of an absolutely homogeneous Finsler metric [5]. Speci\ufb01cally, a\nlocal norm is viewed here as continuous nonnegative function F : X \u00d7 V \u2192 R+ with the following propoerties:\nfor all x \u2208 X and all z1, z2 \u2208 V, we have (i) F (x, z1 + z2) \u2264 F (x, z1) + F (x, z2); (ii) F (x, \u03bbz) = |\u03bb|z; and\n(iii) F (x, z) > 0 for all z \u2208 V \\ {0}. The local norm of z at x is then de\ufb01ned as (cid:107)z(cid:107)x = F (x, z).\n\n5\n\n\fExample 3.3. Suppose that X = Rd is endowed with the Euclidean norm as in Example 3.1.\nThen, setting h(x) = (1/2)(cid:107)x(cid:107)2\n2 for the\nassociated Bregman divergence. Obviously, h is 1-strongly convex relative to (cid:107)\u00b7(cid:107)2.\nExample 3.4. Let X = [0, 1)d (so X is neither open nor closed), and consider the local norm\n(cid:107)z(cid:107)2\n\n2, we get the standard expression D(p, x) = (1/2)(cid:107)p \u2212 x(cid:107)2\n\ni /(1 \u2212 xi)2 for x \u2208 X , z \u2208 Rd (cf. Example 3.2 above). If we set\n\nx =(cid:80)d\n\ni=1|z|2\n\na straightforward calculation gives\n\nD(p, x) =\n\ni=1 1/(1 \u2212 xi)\n\nh(x) =(cid:80)d\n(1 \u2212 pi)(1 \u2212 xi)2 \u2265 d(cid:88)\n\n(pi \u2212 xi)2\n\ni=1\n\nd(cid:88)\n\ni=1\n\n(pi \u2212 xi)2\n(1 \u2212 xi)2 = (cid:107)p \u2212 x(cid:107)2\n\nx,\n\n(3.8)\n\n(3.9)\n\ni.e., h is strongly convex relative to (cid:107)\u00b7(cid:107)x. Importantly, since (cid:107)\u00b7(cid:107)x \u2265 (cid:107)\u00b7(cid:107)2, this Bregman function is\nalso strongly convex relative to the standard Euclidean norm. However, even though the Euclidean\nregularizer of Example 3.3 is strongly convex relative to any global norm on X , it cannot be strongly\nconvex relative to the local norm (cid:107)\u00b7(cid:107)x because of the singularity of the latter when xi \u2192 1\u2212.\n\n(cid:107)A(x(cid:48)) \u2212 A(x)(cid:107)x,\u2217 \u2264 \u03b2(cid:112)2D(x, x(cid:48))\n\nBregman continuity. We are now in a position to introduce the notion of Bregman continuity:\nDe\ufb01nition 3. Let h be a local Bregman function relative to some local norm (cid:107)\u00b7(cid:107)x on X . We say that\nthe operator A : X \u2192 V\u2217 is \u03b2-Bregman continuous if\n\nfor all x, x(cid:48) \u2208 X .\n\nwe recover the standard Lipschitz continuity condition: (cid:107)A(x(cid:48)) \u2212 A(x)(cid:107)\u2217 \u2264 \u03b2(cid:112)2D(x(cid:48), x) =\n\n(BC)\nOf course, in the case of a global norm with Bregman function h(x) = (1/2)(cid:107)x(cid:107)2 (cf. Example 3.3),\n\u03b2(cid:107)x(cid:48) \u2212 x(cid:107). On the other hand, the example below shows that an operator can be Bregman continuous\nwithout being Lipschitz continuous relative to any global norm:\nExample 3.5. Consider the operator A(x) = (cr/(1 \u2212 xr/cr))r\u2208R de\ufb01ned in Example 2.3. Renor-\nmalizing cr to 1 for clarity and using the Bregman data of Examples 3.2 and 3.4, we get:\n\nd(cid:88)\n\ni=1\n\ni)2 \u2264 d(cid:88)\n\ni=1\n\ni \u2212 xi)2\n(x(cid:48)\n(1 \u2212 x(cid:48)\n\n(cid:107)A(x(cid:48)) \u2212 A(x)(cid:107)2\n\nx,\u2217 =\n\n\u221a\n\ni \u2212 xi)2\n(x(cid:48)\n(1 \u2212 xi)(1 \u2212 x(cid:48)\n\ni)2 = D(x, x(cid:48))\n\n(3.10)\n\n2)-Bregman continuous relative to h. However, given the singularity of A(x) as\n\ni.e., A is (1/\nxi \u2192 1\u2212, we see that A cannot be Lipschitz continuous relative to any global norm on X .\nImportantly, this example suggests the following rule of thumb: if the Jacobian of A exhibits a\nsingularity of the form O(\u03c6(x)) near the residual set cl(X ) \\ X of X , taking (cid:107)\u00b7(cid:107)x = \u0398(\u03c6(x)) and\nh(x) = \u0398(\u03c6(x)) allows A to be Bregman continuous, despite this singularity. This heuristic provides\na principled choice of Bregman data under which A satis\ufb01es (BC).\n\n4 The mirror-prox algorithm\n\nIn this section, we present the main algorithmic method that we will use to solve (VI) under Bregman\ncontinuity. Our core assumptions in that regard will be:\nAssumption 1. The solution set X \u2217 \u2261 Sol(X , A) of (VI) is nonempty.\nAssumption 2. A is monotone and \u03b2-Bregman continuous.\n\nIn addition to the above, we assume that the optimizer gains access to A via an oracle which, when\ncalled at the t-th stage of a sequence Xt \u2208 X , returns (possibly imperfect) feedback of the form\n\n(4.1)\nwhere Ut \u2208 V\u2217 is an additive noise variable. The two cases of interest that we consider here are\n(i) when Ut = 0 for all t; and (ii) when Ut satis\ufb01es the statistical hypotheses:\n\nVt = A(Xt) + Ut,\n\na) Zero-mean:\n\nE[Ut | Ft] = 0.\n\n(4.2a)\n\n6\n\n\fb) Finite variance: E[(cid:107)Ut(cid:107)2\u2217 | Ft] \u2264 \u03c32.\n\n(4.2b)\nwith Ft denoting the history (natural \ufb01ltration) of Xt. For obvious reasons, we will refer to the \ufb01rst\ncase (Ut = 0) as a perfect oracle, and to the second one as a stochastic oracle.\nFollowing Nemirovski [38] and Juditsky et al. [21], the mirror-prox (MP) algorithm can be stated in\nrecursive form as follows:\n\nXt+1/2 = PXt(\u2212\u03b3tVt)\nXt+1 = PXt(\u2212\u03b3tVt+1/2)\n\n(MP)\n\nwhere \u03b3t > 0 is a variable step-size sequence (discussed in detail below), and the so-called \u201cprox-\nmapping\u201d P : X \u25e6 \u00d7 V\u2217 \u2192 X is de\ufb01ned as\n\n(4.3)\nwith D(\u00b7,\u00b7) denoting the divergence of an underlying Bregman function h : X \u2192 R. For concreteness,\nwe also assume in what follows that (MP) is initialized at the so-called \u201cprox-center\u201d of X , i.e.,\n\nPx(y) = arg min\n\nx(cid:48)\u2208X\n\n{(cid:104)y, x \u2212 x(cid:48)(cid:105) + D(x(cid:48), x)}\n\n(4.4)\nRemark 1. In general, calculating mirror steps can be computationally expensive \u2013 just like Euclidean\nprojections in several cases. In what follows, we tacitly assume that our setting is \u201cprox-friendly\u201d\n[21, 38, 40] in the sense that the update (4.3) can be computed ef\ufb01ciently (e.g., as in Example 3.4).\n\nX1 = xc \u2261 arg minx\u2208X h(X ).\n\nHeuristically, the main idea behind (MP) is that, at each t = 1, 2, . . . , the oracle is called at the\nalgorithm\u2019s base state Xt to generate an intermediate, leading state Xt+1/2; subsequently, the base\nstate is updated with oracle information from the leading state Xt+1/2 and the process repeats. In\nthis way, (MP) essentially tries to \u201canticipate\u201d the change of A along a prox-step, and to exploit\nthis \u201cforward\u201d information in order to achieve a faster convergence rate than ordinary forward-\nbackward/gradient descent schemes. For this anticipatory scheme to work, the variation of the\noperator A must be suf\ufb01ciently gradual, hence the need for Lipschitz continuity in the classical\nanalysis of the algorithm [21, 38, 40]. If this variation is unbounded (e.g., if A exhibits singularities),\nthis look-ahead mechanism could break down completely and the algorithm might fail to converge\naltogether. Our \ufb01rst result below is that, despite such singularities, Bregman continuity allows us to\nrecover the optimal convergence rate of (MP):\nTheorem 1. Assume that A satis\ufb01es Assumptions 1 and 2, and let GapH denote the restricted gap\nfunction for the Bregman zone CH = {x \u2208 X : D(x, xc) \u2264 H}. Suppose further that (MP) is run\nwith a K-strongly convex Bregman function and oracle feedback of the form (4.1). Then, for all\nt=1 \u03b3t enjoys the following gap bounds:\n\nH > 0, the averaged sequence \u00afXT =(cid:80)T\n\n(cid:14)(cid:80)T\n\nt=1 \u03b3tXt+1/2\na) If \u03c32 = 0 and the algorithm\u2019s step-size satis\ufb01es\n\n0 < \u03b3min \u2261 inf t \u03b3t \u2264 supt \u03b3t \u2261 \u03b3max \u2264 \u221a\n\nK/\u03b2,\n\nwe have\n\nb) Otherwise, if \u03c32 > 0 and \u03b3t \u2264(cid:112)K/2/\u03b2, we have\n\nGapH ( \u00afXT ) \u2264 H\n\u03b3min\n\n1\nT\n\nE[GapH ( \u00afXT )] = O(cid:16) H+\u03c32(cid:80)T\n(cid:80)T\nt=1 \u03b32\nt=1 \u03b3t\n\u221a\nT , we get E[GapH ( \u00afXt)] = O(1/\n\nt\n\nT ).\n\n(cid:17)\n\nIn particular, if \u03b3t \u221d 1/\n\n\u221a\n\n(4.5)\n\n(4.6)\n\n(4.7)\n\nAs we show in the supplement, the key step in the proof of the deterministic part of Theorem 1 is the\nfollowing energy inequality for an arbitrary target point p \u2208 CH:\n\nD(p, Xt+1) \u2264 D(p, Xt) \u2212 \u03b3t(cid:104)A(Xt+1/2), Xt+1/2 \u2212 p(cid:105) \u2212(cid:16)\n\nD(Xt+1/2, Xt)\n\n(4.8)\n\n(cid:17)\n\n1 \u2212 \u03b22\u03b32\nK\n\nt\n\nThere are two points where the Bregman structure of the algorithm can be seen in (4.8): in the energy\niterates D(p, Xt), but also in the comparison of the algorithm\u2019s base and leading state in the term\nD(Xt+1/2, Xt). In the \u201cvanilla\u201d setting, Lipschitz continuity is used to obtain a comparison of these\n\n7\n\n\fAlgorithm 1: adaptive mirror-prox (AMP)\nRequire: local norm (cid:107)\u00b7(cid:107)x, K-strongly convex Bregman function h, shrink ratio \u03b8 \u2208 (0, 1)\n1: take X1 = arg min h, \u03b31 > 0\n2: for t = 1, 2, . . . do\n3:\n4:\n5:\n6:\n\n# base state query\n# leading state update\n# leading state query\n# base state update\n\n# initialization\n\nget oracle feedback Vt at Xt\nset Xt+1/2 = PXt (\u2212\u03b3tVt)\nget oracle feedback Vt+1/2 at Xt+1/2\nset Xt+1 = PXt (\u2212\u03b3tVt+1/2)\nset \u03b2t =\nset \u03b3t+1 = min{\u03b3t, \u03b8\n\n(cid:112)2D(Xt+1/2, Xt)\n\n(cid:107)Vt+1/2 \u2212 Vt(cid:107)Xt+1/2,\u2217\n\n\u221a\nK/\u03b2t}\n\n7:\n\n8:\n9: end for\n\n# estimate Bregman constant\n\n# update step-size\n\nsuccessive states in terms of a global norm difference of the form (cid:107)Xt+1/2 \u2212 Xt(cid:107)2. However, this\nstep also requires A to vary gradually relative to (cid:107)\u00b7(cid:107), which is of course impossible if A exhibits\nsingularities. The key novelty in our setting is the use of the Bregman divergence as a comparator\nfor the algorithm\u2019s successive states: it is at this point that the triple interplay between the operator,\nthe local norm and the chosen Bregman function is made manifest, and it is what makes Bregman\ncontinuity particularly well-suited for tackling singular problems of this kind. This requires a careful\ntreatment of the various Bregman differences involved, so we defer the details to the supplement.\n\n5 The adaptive mirror-prox algorithm\n\nA crucial assumption underlying the analysis of the previous section is that the optimizer must know\nin advance \u2013 or be otherwise able to estimate \u2013 the Bregman constant \u03b2. In practice, this can be\ndif\ufb01cult to achieve, so it is important to be able to run (MP) with an adaptive step-size policy. Our\nstarting point is the observation that, with perfect oracle feedback, one can estimate \u03b2 by setting\n\n(cid:107)A(Xt+1/2) \u2212 A(Xt)(cid:107)Xt+1/2,\u2217\n\n(cid:112)2D(Xt+1/2, Xt)\n\n\u03b2t =\n\n(5.1)\nwhenever Xt+1/2 (cid:54)= Xt; obviously, if A is \u03b2-Bregman continuous, we have \u03b2t \u2264 \u03b2.7 However, the\n\u03b3t \u221d \u221a\nfact that the Bregman constant is being under-estimated means that a step-size policy of the form\nK/\u03b2t would over-estimate the inverse Bregman constant 1/\u03b2, so the resulting step-size\n\npolicy would have no reason to satisfy (4.5).\nTo overcome this obstacle, we introduce the following comparison mechanism: \ufb01rst, at each t =\n1, 2, . . . , we use the estimation (5.1) to test the step-size \u00af\u03b3t =\nK/\u03b2t. Then, to avoid the growth\nphenomenon outlined above, we shrink \u00af\u03b3t by a constant factor of \u03b8 and, to avoid running into\nvanishing step-size issues, we take the previous step-size employed if the shrunk one would be\nsmaller. Formally, we consider the adaptive step-size policy:\n\n\u221a\n\n(cid:26)min{\u03b3t, \u03b8\n\n\u221a\n\nK/\u03b2t}\n\n\u03b3t+1 =\n\n\u03b3t\n\nif Xt (cid:54)= Xt+1/2,\notherwise,\n\n(5.2)\n\nergodic average \u00afXT =(cid:80)T\n\nwith \u03b2t de\ufb01ned as in (5.1) and \u03b8 \u2208 (0, 1) chosen arbitrarily.\nFor concreteness, we call the resulting algorithm adaptive mirror-prox (AMP) and we provide a\npseudocode implementation in Algorithm 1 above. In terms of performance, we have:\n(cid:14)(cid:80)T\nTheorem 2. Assume that A satis\ufb01es Assumptions 1 and 2, and (MP) is run with perfect oracle\nfeedback and the adaptive step-size policy (5.2). Then, with notation as in Theorem 1, the algorithm\u2019s\nt=1 \u03b3t enjoys the gap bound GapH ( \u00afXT ) = O(1/T ).\nWe \ufb01nd this result particularly appealing because it yields the optimal O(1/T ) convergence rate of\nthe mirror-prox algorithm, even for possibly singular operators, and even if the operator\u2019s Bregman\nconstant is unknown. Its proof relies on using the speci\ufb01c form of the step-size policy (5.2) to control\nthe second term in the energy inequality (4.8); we provide the detailed arguments in the supplement.\n\nt=1 \u03b3tXt+1/2\n\n7In a Euclidean setting, similar ideas can be found in, e.g., [8, 29]. We ignore the origins of this technique.\n\n8\n\n\fFigure 1: Different variants of the mirror-prox algorithm in the resource sharing problem of Example 2.3. The\nalgorithm labeled \u201cextra-gradient\u201d refers to Euclidean regularization and a constant step size as indicated in the\nlegend; \u201cmirror-prox\u201d was run with the Bregman function of Example 3.4 and step-sizes as in the legend; \ufb01nally,\n\u201cadaptive mirror-prox\u201d corresponds to Algorithm 1, i.e., mirror-prox with the adaptive step-size policy (5.2).\n6 Numerical experiments\nWe performed a series of numerical experiments on the resource sharing problem described in\nExample 2.3 with a set of R = 1000 servers being shared by N = 100 commodities, each with a\ndemand drawn uniformly at random from [0, 1]; the capacity cr of each server r = 1, . . . , R was\nalso drawn randomly from [0, 100]. Subsequently, we ran two variants of the mirror-prox method:\n(MP) with Euclidean regularization, and (MP) with the Bregman function de\ufb01ned in Example 3.4.\nFor all methods, we ran a range of different constant step-sizes (we present the most representative\nvalues, namely \u03b3 = 0.001, \u03b3 = 0.005, and \u03b3 = 0.010). Subsequently, we also ran Algorithm 1 and\nwe plotted the distance from the solution to the induced variational inequality problem as a function\nof the number of iterations. The main conclusions that can be drawn are as follows:\n1. The Euclidean version of the mirror-prox algorithm (i.e., the extra-gradient algorithm) is unstable\nand does not converge; this is due to the fact that the gradients received are very large (recall\nthat the problem is not Lipschitz continuous), so the algorithm does not exhibit descent or\nconvergence.\n\n2. The MP variant with the non-Euclidean regularizer of Example 3.4 is convergent (since the\nVI problem under study is Bregman continuous relative to this Bregman function). However,\ndepending on the method\u2019s step-size, the convergence is relatively slow, and there is no easy way\nto estimate the problem\u2019s Bregman constant in order to choose a \u201cgood\u201d step-size.\n\n3. By contrast, the AMP algorithm converges signi\ufb01cantly faster than variants with a constant step-\nsize. This is due to the fact that, initially, a greedier step-size is able to take larger steps towards\nthe problem\u2019s solution, so initializing Algorithm 1 with a large step-size helps signi\ufb01cantly.\n\n7 Concluding remarks\nIn this work, we introduced a novel regularity condition to account for variational inequalities (both\ndeterministic and stochastic) with possible singularities. This condition, which we call Bregman\ncontinuity, is tailored to the operator\u2019s singularity landscape and, as such, provides the necessary\nbedrock to achieve optimal convergence rates via a properly chosen version of the mirror-prox\nalgorithm (with or without knowledge of the problem\u2019s Bregman constant). This opens up several\ninteresting research directions: First, an appealing extension would be to develop a \u201cmodel-agnostic\u201d\nversion of the method (which would concurrently provide optimal rates in stochastic and deterministic\nsettings) or to combine it with backtracking / linesearch to accelerate convergence. Finally, it would\nalso be interesting to examine the method\u2019s local convergence properties in non-monotone problems\n(deterministic or stochastic). We relegate these questions to future work.\n\nAcknowledgments\n\nThe authors gratefully acknowledge \ufb01nancial support from the French National Research Agency\n(ANR) under grants ORACLESS (ANR\u201316\u2013CE33\u20130004\u201301) and ELIOT (ANR-18-CE40-0030).\n\n9\n\n\u25cb\u25cb\u25cb\u25cb\u25cb\u25cb\u25cb\u25cb\u25cb\u25cb\u25cb\u25cb\u25cb\u25cb\u25cb\u25cb\u25cb\u25cb\u25cb\u25cb\u25cb\u25cb\u25cb\u25cb\u25cb\u25cb\u25cb\u25cb\u25cb\u25cb\u25cb\u25cb\u25cb\u25cb\u25cb\u25cb\u25cb\u25cb\u25cb\u25cb\u25cb\u25cb\u25cb\u25cb\u25cb\u25cb\u25cb\u25cb\u25cb\u25cb\u25cb\u25a1\u25a1\u25a1\u25a1\u25a1\u25a1\u25a1\u25a1\u25a1\u25a1\u25a1\u25a1\u25a1\u25a1\u25a1\u25a1\u25a1\u25a1\u25a1\u25a1\u25a1\u25a1\u25a1\u25a1\u25a1\u25a1\u25a1\u25a1\u25a1\u25a1\u25a1\u25a1\u25a1\u25a1\u25a1\u25a1\u25a1\u25a1\u25a1\u25a1\u25a1\u25a1\u25a1\u25a1\u25a1\u25a1\u25a1\u25a1\u25a1\u25a1\u25a1\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25cb\ufffd\ufffd\ufffd\ufffd\ufffd-\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd(\u03b3=\ufffd\ufffd\ufffd\ufffd)\u25a1\ufffd\ufffd\ufffd\ufffd\ufffd-\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd(\u03b3=\ufffd\ufffd\ufffd\ufffd)\u25c7\ufffd\ufffd\ufffd\ufffd\ufffd-\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd(\u03b3=\ufffd\ufffd\ufffd\ufffd)\u25cf\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd-\ufffd\ufffd\ufffd\ufffd(\u03b3=\ufffd\ufffd\ufffd\ufffd)\u25a0\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd-\ufffd\ufffd\ufffd\ufffd(\u03b3=\ufffd\ufffd\ufffd\ufffd)\u25c6\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd-\ufffd\ufffd\ufffd\ufffd(\u03b3=\ufffd\ufffd\ufffd\ufffd)\u25b3\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd-\ufffd\ufffd\ufffd\ufffd-\ufffd\ufffd\ufffd\ufffd-\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\fReferences\n[1] P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization Algorithms on Matrix Manifolds. Princeton\n\nUniversity Press, 2008.\n\n[2] F. Alvarez, J. Bolte, and O. Brahic. Hessian Riemannian gradient \ufb02ows in convex programming. SIAM\n\nJournal on Control and Optimization, 43(2):477\u2013501, 2004.\n\n[3] F. Bach and K. Y. Levy. A universal algorithm for variational inequalities adaptive to smoothness and\n\nnoise. In COLT \u201919: Proceedings of the 32nd Annual Conference on Learning Theory, 2019.\n\n[4] J.-B. Baillon and G. Haddad. Quelques propri\u00e9t\u00e9s des op\u00e9rateurs angle-born\u00e9s et n-cycliquement mono-\n\ntones. Israel Journal of Mathematics, 26:137\u2013150, 1977.\n\n[5] D. D.-W. Bao, S.-S. Chern, and Z. Shen. An Introduction to Riemann-Finsler Geometry. Number 200 in\n\nGraduate Texts in Mathematics. Springer-Verlag, New York, NY, 2000.\n\n[6] H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces.\n\nSpringer, New York, NY, USA, 2 edition, 2017.\n\n[7] H. H. Bauschke, J. Bolte, and M. Teboulle. A descent lemma beyond Lipschitz gradient continuity:\nFirst-order methods revisited and applications. Mathematics of Operations Research, 42(2):330\u2013348, May\n2017.\n\n[8] R. I. Bo\u00b8t, E. R. Csetnek, and P. T. Vuong. The forward-backward-forward method from continuous and\ndiscrete perspective for pseudo-monotone variational inequalities in Hilbert spaces. https://arxiv.org/\nabs/1808.08084, 2018.\n\n[9] J. Bolte, S. Sabach, M. Teboulle, and Y. Vaisbourd. First order methods beyond convexity and Lipschitz\ngradient continuity with applications to quadratic inverse problems. SIAM Journal on Optimization, 28(3):\n2131\u20132151, 2018.\n\n[10] M. Bravo and P. Mertikopoulos. On the robustness of learning in games with stochastically perturbed\npayoff observations. Games and Economic Behavior, 103, John Nash Memorial issue:41\u201366, May 2017.\n[11] M. Bravo, D. S. Leslie, and P. Mertikopoulos. Bandit learning in concave N-person games. In NIPS \u201918:\n\nProceedings of the 32nd International Conference on Neural Information Processing Systems, 2018.\n\n[12] R. E. Bruck Jr. On the weak convergence of an ergodic iteration for the solution of variational inequalities\nfor monotone operators in Hilbert space. Journal of Mathematical Analysis and Applications, 61(1):\n159\u2013164, November 1977.\n\n[13] A. Chambolle and T. Pock. A \ufb01rst-order primal-dual algorithm for convex problems with applications to\n\nimaging. Journal of Mathematical Imaging and Vision, 40(1):120\u2013145, May 2011.\n\n[14] P. L. Combettes and V. R. Wajs. Signal recovery by proximal forward-backward splitting. Multiscale\n\nModeling and Simulation, 4(4):1168\u20131200, 2005.\n\n[15] C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng. Training GANs with optimism.\n\nProceedings of the 2018 International Conference on Learning Representations, 2018.\n\nIn ICLR \u201918:\n\n[16] F. Facchinei and J.-S. Pang. Finite-Dimensional Variational Inequalities and Complementarity Problems.\n\nSpringer Series in Operations Research. Springer, 2003.\n\n[17] G. Gidel, H. Berard, G. Vignoud, P. Vincent, and S. Lacoste-Julien. A variational inequality perspective\non generative adversarial networks. In ICLR \u201919: Proceedings of the 2019 International Conference on\nLearning Representations, 2019.\n\n[18] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\nY. Bengio. Generative adversarial nets. In NIPS \u201914: Proceedings of the 27th International Conference on\nNeural Information Processing Systems, 2014.\n\n[19] F. Hanzely, P. Richtarik, and L. Xiao. Accelerated Bregman proximal gradient methods for relatively\n\nsmooth convex optimization. https://arxiv.org/abs/1808.03045, 2018.\n\n[20] A. N. Iusem, A. Jofr\u00e9, R. I. Oliveira, and P. Thompson. Extragradient method with variance reduction for\n\nstochastic variational inequalities. SIAM Journal on Optimization, 27(2):686\u2013724, 2017.\n\n[21] A. Juditsky, A. S. Nemirovski, and C. Tauvel. Solving variational inequalities with stochastic mirror-prox\n\nalgorithm. Stochastic Systems, 1(1):17\u201358, 2011.\n\n[22] S. M. Kakade, S. Shalev-Shwartz, and A. Tewari. Regularization techniques for learning with matrices.\n\nThe Journal of Machine Learning Research, 13:1865\u20131890, 2012.\n\n[23] L. Kleinrock. Queueing Systems, volume 1: Theory. John Wiley & Sons, New York, NY, 1975.\n[24] G. M. Korpelevich. The extragradient method for \ufb01nding saddle points and other problems. \u00c8konom. i\n\nMat. Metody, 12:747\u2013756, 1976.\n\n[25] T. Liang and J. Stokes. Interaction matters: A note on non-asymptotic local convergence of generative\nadversarial networks. In AISTATS \u201919: Proceedings of the 22nd International Conference on Arti\ufb01cial\nIntelligence and Statistics, 2019.\n\n10\n\n\f[26] H. Lu. \"Relative-continuity\" for non-Lipschitz non-smooth convex optimization using stochastic (or\n\ndeterministic) mirror descent. https://arxiv.org/abs/1710.04718, 2017.\n\n[27] H. Lu, R. M. Freund, and Y. Nesterov. Relatively-smooth convex optimization by \ufb01rst-order methods and\n\napplications. SIAM Journal on Optimization, 28(1):333\u2013354, 2018.\n\n[28] Y. Malitsky. Projected re\ufb02ected gradient methods for monotone variational inequalities. SIAM Journal on\n\nOptimization, 25(1):502\u2013520, 2015.\n\n[29] Y. Malitsky. Golden ratio algorithms for variational inequalities. https://arxiv.org/abs/1803.08832,\n\n2018.\n\n[30] P. Mertikopoulos and W. H. Sandholm. Learning in games via reinforcement and regularization. Mathe-\n\nmatics of Operations Research, 41(4):1297\u20131324, November 2016.\n\n[31] P. Mertikopoulos and W. H. Sandholm. Riemannian game dynamics. Journal of Economic Theory, 177:\n\n315\u2013364, September 2018.\n\n[32] P. Mertikopoulos and M. Staudigl. On the convergence of gradient-like \ufb02ows with noisy gradient input.\n\nSIAM Journal on Optimization, 28(1):163\u2013197, January 2018.\n\n[33] P. Mertikopoulos and Z. Zhou. Learning in games with continuous action sets and unknown payoff\n\nfunctions. Mathematical Programming, 173(1-2):465\u2013507, January 2019.\n\n[34] P. Mertikopoulos, E. V. Belmega, R. Negrel, and L. Sanguinetti. Distributed stochastic optimization via\n\nmatrix exponential learning. IEEE Trans. Signal Process., 65(9):2277\u20132290, May 2017.\n\n[35] P. Mertikopoulos, C. H. Papadimitriou, and G. Piliouras. Cycles in adversarial regularized learning. In\n\nSODA \u201918: Proceedings of the 29th annual ACM-SIAM Symposium on Discrete Algorithms, 2018.\n\n[36] P. Mertikopoulos, B. Lecouat, H. Zenati, C.-S. Foo, V. Chandrasekhar, and G. Piliouras. Optimistic mirror\ndescent in saddle-point problems: Going the extra (gradient) mile. In ICLR \u201919: Proceedings of the 2019\nInternational Conference on Learning Representations, 2019.\n\n[37] A. Mokhtari, A. Ozdaglar, and S. Pattathil. A uni\ufb01ed analysis of extra-gradient and optimistic gradient\nmethods for saddle point problems: proximal point approach. https://arxiv.org/abs/1901.08511v2,\n2019.\n\n[38] A. S. Nemirovski. Prox-method with rate of convergence O(1/t) for variational inequalities with Lipschitz\ncontinuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on\nOptimization, 15(1):229\u2013251, 2004.\n\n[39] A. S. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to\n\nstochastic programming. SIAM Journal on Optimization, 19(4):1574\u20131609, 2009.\n\n[40] Y. Nesterov. Dual extrapolation and its applications to solving variational inequalities and related problems.\n\nMathematical Programming, 109(2):319\u2013344, 2007.\n\n[41] Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical Programming, 120(1):\n\n221\u2013259, 2009.\n\n[42] N. Nisan, T. Roughgarden, \u00c9. Tardos, and V. V. Vazirani, editors. Algorithmic Game Theory. Cambridge\n\nUniversity Press, 2007.\n\n[43] A. Odena, C. Olah, and J. Shlens. Conditional image synthesis with auxiliary classi\ufb01er GANs. https:\n\n//arxiv.org/abs/1610.09585, October 2016.\n\n[44] G. B. Passty. Ergodic convergence to a zero of the sum of monotone operators in Hilbert space. Journal of\n\nMathematical Analysis and Applications, 72(2):383\u2013390, December 1979.\n\n[45] A. Rakhlin and K. Sridharan. Online learning with predictable sequences. In COLT \u201913: Proceedings of\n\nthe 26th Annual Conference on Learning Theory, 2013.\n\n[46] A. Rakhlin and K. Sridharan. Optimization, learning, and games with predictable sequences. In NIPS \u201913:\n\nProceedings of the 26th International Conference on Neural Information Processing Systems, 2013.\n\n[47] S. M. Shahshahani. A New Mathematical Framework for the Study of Linkage and Selection. Number 211\nin Memoirs of the American Mathematical Society. American Mathematical Society, Providence, RI, 1979.\n[48] S. Sra, S. Nowozin, and S. J. Wright. Optimization for Machine Learning. MIT Press, Cambridge, MA,\n\nUSA, 2012.\n\n[49] G. Stampacchia. Formes bilineaires coercitives sur les ensembles convexes. Comptes Rendus Hebdo-\n\nmadaires des S\u00e9ances de l\u2019Acad\u00e9mie des Sciences, 1964.\n\n[50] J. G. Wardrop. Some theoretical aspects of road traf\ufb01c research. In Proceedings of the Institute of Civil\n\nEngineers, Part II, volume 1, pages 325\u2013378, 1952.\n\n[51] A. Yadav, S. Shah, Z. Xu, D. Jacobs, and T. Goldstein. Stabilizing adversarial nets with prediction methods.\n\nIn ICLR \u201918: Proceedings of the 2018 International Conference on Learning Representations, 2018.\n\n11\n\n\f", "award": [], "sourceid": 4578, "authors": [{"given_name": "Kimon", "family_name": "Antonakopoulos", "institution": "Inria"}, {"given_name": "Veronica", "family_name": "Belmega", "institution": "ENSEA"}, {"given_name": "Panayotis", "family_name": "Mertikopoulos", "institution": "CNRS (French National Center for Scientific Research)"}]}