{"title": "Stochastic Mirror Descent in Variationally Coherent Optimization Problems", "book": "Advances in Neural Information Processing Systems", "page_first": 7040, "page_last": 7049, "abstract": "In this paper, we examine a class of non-convex stochastic optimization problems which we call variationally coherent, and which properly includes pseudo-/quasiconvex and star-convex optimization problems. To solve such problems, we focus on the widely used stochastic mirror descent (SMD) family of algorithms (which contains stochastic gradient descent as a special case), and we show that the last iterate of SMD converges to the problem\u2019s solution set with probability 1. This result contributes to the landscape of non-convex stochastic optimization by clarifying that neither pseudo-/quasi-convexity nor star-convexity is essential for (almost sure) global convergence; rather, variational coherence, a much weaker requirement, suffices. Characterization of convergence rates for the subclass of strongly variationally coherent optimization problems as well as simulation results are also presented.", "full_text": "Stochastic Mirror Descent in\n\nVariationally Coherent Optimization Problems\n\nZhengyuan Zhou\nStanford University\n\nzyzhou@stanford.edu\n\nPanayotis Mertikopoulos\n\nUniv. Grenoble Alpes, CNRS, Inria, LIG\npanayotis.mertikopoulos@imag.fr\n\nNicholas Bambos\nStanford University\n\nbambos@stanford.edu\n\nStephen Boyd\n\nStanford University\nboyd@stanford.edu\n\nPeter Glynn\n\nStanford University\n\nglynn@stanford.edu\n\nAbstract\n\nIn this paper, we examine a class of non-convex stochastic optimization problems\nwhich we call variationally coherent, and which properly includes pseudo-/quasi-\nconvex and star-convex optimization problems. To solve such problems, we focus\non the widely used stochastic mirror descent (SMD) family of algorithms (which\ncontains stochastic gradient descent as a special case), and we show that the\nlast iterate of SMD converges to the problem\u2019s solution set with probability 1.\nThis result contributes to the landscape of non-convex stochastic optimization by\nclarifying that neither pseudo-/quasi-convexity nor star-convexity is essential for\n(almost sure) global convergence; rather, variational coherence, a much weaker\nrequirement, suf\ufb01ces. Characterization of convergence rates for the subclass of\nstrongly variationally coherent optimization problems as well as simulation results\nare also presented.\n\n1\n\nIntroduction\n\nThe stochastic mirror descent (SMD) method and its variants[1, 7, 8] is arguably one of the most\nwidely used family of algorithms in stochastic optimization \u2013 convex and non-convex alike. Starting\nwith the orginal work of [16], the convergence of SMD has been studied extensively in the context of\nconvex programming (both stochastic and deterministic), saddle-point problems, and monotone varia-\ntional inequalities. Some of the most important contributions in this domain are due to Nemirovski et\nal. [15], Nesterov [18] and Xiao [23], who provided tight convergence bounds for the ergodic average\nof SMD in stochastic/online convex programs, variational inequalities, and saddle-point problems.\nThese results were further boosted by recent work on extra-gradient variants of the algorithm [11, 17],\nand the ergodic relaxation of [8] where the independence assumption on the gradient samples is\nrelaxed and is replaced by a mixing distribution that converges in probability to a well-de\ufb01ned limit.\nHowever, all these works focus exclusively on the algorithm\u2019s ergodic average (also known as time-\naverage), a mode of convergence which is strictly weaker than the convergence of the algorithm\u2019s last\niterate. In addition, most of the analysis focuses on establishing convergence \"in expectation\" and\nthen leveraging sophisticated martingale concentration inequalities to derive \"large deviations\" results\nthat hold true with high probability. Last (but certainly not least), the convexity of the objective plays\na crucial role: thanks to the monotonicity of the gradient, it is possible to exploit regret-like bounds\nand transform them to explicit convergence rates.1\n\n1For the role of variational monotonicity in the context of convex programming, see also [22].\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fBy contrast, the gradient operator of the non-convex programs studied in this paper does not satisfy\nany reasonable monotonicity property (such as quasi-/pseudo-monotonicity, monotonicity-plus, or\nany of the standard variants encountered in the theory of variational inequalities [9]. Furthermore,\ngiven that there is no inherent averaging in the algorithm\u2019s last iterate, it is not possible to employ a\nregret-based analysis such as the one yielding convergence in convex programs. Instead, to establish\nconvergence, we use the stochastic approximation method of Bena\u00efm and Hirsch [2, 3] to compare the\nevolution of the SMD iterates to the \ufb02ow of a mean, underlying dynamical system.2 By a judicious\napplication of martingale limit theory, we then exploit variational coherence to show that the last\niterate of SMD converges with probability 1, recovering in the process a large part of the convergence\nanalysis of the works mentioned above.\n\nOur Contributions. We consider a class of non-convex optimization problems, which we call\nvariationally coherent and which strictly includes convex, pseudo/quasi-convex and star-convex\noptimization problems. For this class of optimization problems, we show that the last iterate of\nSMD with probability 1 to a global minimum under i.i.d. gradient samples. To the best of our\nknowledge, this strong convergence guarantee (almost sure of the last iterate of SMD) is not known\neven for stochastic convex problems. As such, this results contributes to the landscape of non-convex\nstochastic optimization by making clear that neither pseudo-/quasi-convexity nor star-convexity is\nessential for global convergence; rather, variational coherence, a much weaker requirement, suf\ufb01ces.\nOur analysis leverages the Lyapunov properties of the Fenchel coupling [14], a primal-dual divergence\nmeasure that quanti\ufb01es the distance between primal (decision) variables and dual (gradient) variables,\nand which serves as an energy function to establish recurrence of SMD (Theorem 3.4). Building\non this recurrence, we consider an ordinary differential equation (ODE) approximation of the SMD\nscheme and, drawing on various results from the theory of stochastic approximation and variational\nanalysis, we connect the solution of this ODE to the last iterate of SMD. In so doing, we establish the\nalgorithm\u2019s convergence with probability 1 from any initial condition (Thereom 4.4) and, to complete\nthe circle, we also provide a convergence rate estimate for the subclass of strongly variationally\ncoherent optimization problems.\nImportantly, although the ODE approximation of discrete-time Robbins\u2013Monro algorithms has been\nwidely studied in control and stochastic optimization [10, 13], converting the convergence guarantees\nof the ODE solution back to the discrete-time process is a fairly subtle affair that must be done on\nan case-by-case basis. Further, even if such conversion goes through, the results typically have the\nnature of convergence-in-distribution: almost sure convergence is much harder to obtain [5].\n\n2 Setup and Preliminaries\nLet X be a convex compact subset of a d-dimensional real space V with norm (cid:107)\u00b7(cid:107). Throughout this\npaper, we focus on the stochastic optimization problem\n\nwhere the objective function g : X \u2192 R is of the form\n\nminimize\ng(x),\nsubject to x \u2208 X ,\n\n(Opt)\n\ng(x) = E[G(x; \u03be)]\n\n(2.1)\nfor some random function G : X \u00d7 \u039e \u2192 R de\ufb01ned on an underlying (complete) probability space\n(\u039e,F, P). We make the following assumptions regarding (Opt):\nAssumption 1. G(x, \u03be) is continuously differentiable in x for almost all \u03be \u2208 \u039e.\nAssumption 2. \u2207G(x; \u03be) has bounded second moments and is Lipschitz continuous in the mean:\nE[(cid:107)\u2207G(x; \u03be)(cid:107)2\u2217] < \u221e for all x \u2208 X and E[\u2207G(x; \u03be)] is Lipschitz on X .3\nAssumption 1 is a token regularity assumption which can be relaxed to account for nonsmooth\nobjectives by using subgradient devices (as opposed to gradients). However, this would make\n\n2For related approaches based on the theory of dynamical systems, see [21] and [12].\n3In the above, gradients are treated as elements of the dual space V\u2217 of V and (cid:107)v(cid:107)\u2217 = sup{(cid:104)v, x(cid:105) : (cid:107)x(cid:107) \u2264 1}\ndenotes the dual norm of v \u2208 V\u2217. We also note that \u2207G(x; \u03be) refers to the gradient of G(x; \u03be) with respect to\nx; since \u039e need not have a differential structure, there is no danger of confusion.\n\n2\n\n\fthe presentation signi\ufb01cantly more cumbersome, so we stick with smooth objectives throughout.\nAssumption 2 is also standard in the stochastic optimization literature: it holds trivially if \u2207G is\nuniformly Lipschitz (another commonly used condition) and, by the dominated convergence theorem,\nit further implies that g is smooth and \u2207g(x) = \u2207 E[G(x; \u03be)] = E[\u2207G(x; \u03be)] is Lipschitz continuous.\nAs a result, the solution set\nof (Opt) is closed and nonempty (by the compactness of X and the continuity of g).\nRemark 2.1. An important special case of (Opt) is when G(x; \u03be) = g(x) +(cid:104)\u03b6, x(cid:105) for some V\u2217-valued\nrandom vector \u03b6 such that E[\u03b6] = 0 and E[(cid:107)\u03b6(cid:107)2\u2217] < \u221e. This gives \u2207G(x; \u03be) = \u2207g(x) + \u03b6, so (Opt)\ncan also be seen as a model for deterministic optimization problems with noisy gradient observations.\n\nX \u2217 = arg min g\n\n(2.2)\n\n2.1 Variational Coherence\n\nWith all this at hand, we now de\ufb01ne the class of variationally coherent optimization problems:\nDe\ufb01nition 2.1. We say that (Opt) is variationally coherent if\n\n(cid:104)\u2207g(x), x \u2212 x\u2217(cid:105) \u2265 0\n\nfor all x \u2208 X , x\u2217 \u2208 X \u2217,\n\n(VC)\n\nwith equality if and only if x \u2208 X \u2217.\nRemark 2.2. (VC) can be interpreted in two ways. First, as stated, it is a non-random condition for\ng, so it applies equally well to deterministic optimization problems (with or without noisy gradient\nobservations). Alternatively, by the dominated convergence theorem, (VC) can be written as:\n\nE[(cid:104)\u2207G(x; \u03be), x \u2212 x\u2217(cid:105)] \u2265 0.\n\n(2.3)\nIn this form, it can be interpreted as saying that G is variationally coherent \u201con average\u201d, without any\nindividual realization thereof satisfying (VC).\nRemark 2.3. Importantly, (VC) does not have to be stated in terms of the solution set of (Opt). Indeed,\nassume that C is a nonempty subset of X such that\n\n(cid:104)\u2207g(x), x \u2212 p(cid:105) \u2265 0\n\nfor all x \u2208 X , p \u2208 C,\n\n(2.4)\nwith equality if and only if x \u2208 C. Then, as the next lemma (see appendix) indicates, C = arg min g:\nLemma 2.2. Suppose that (2.4) holds for some nonempty subset C of X . Then C is closed, convex,\nand it consists precisely of the global minimizers of g.\nCorollary 2.3. If (Opt) is variationally coherent, arg min g is convex and compact.\n\nRemark 2.4. All the results given in this paper also carry through for \u03bb-variationally coherent\noptimization problems, a further generalization of variational coherence. More precisely, we say that\n(Opt) is \u03bb-variationally coherent if there exists a (component-wise) positive vector \u03bb \u2208 Rd such that\n\nd(cid:88)\n\ni=1\n\n\u03bbi\n\n\u2202g\n\u2202xi\n\n(xi \u2212 x\u2217\n\ni ) \u2265 0\n\nfor all x \u2208 X , x\u2217 \u2208 X \u2217,\n\n(2.5)\n\nwith equality if and only if x \u2208 X \u2217. For simplicity, our analysis will be carried out in the \u201cvanilla\"\nvariational coherence framework, but one should keep in mind that the results to following also hold\nfor \u03bb-coherent problems.\n\n2.2 Examples of Variational Coherence\nExample 2.1 (Convex programs). If g is convex, \u2207g is a monotone operator [19], i.e.\n\n(cid:104)\u2207g(x) \u2212 \u2207g(x(cid:48)), x \u2212 x(cid:48)(cid:105) \u2265 0\n\n(2.6)\nBy the \ufb01rst-order optimality conditions for g, we have (cid:104)g(x\u2217), x \u2212 x\u2217(cid:105) \u2265 0 for all x \u2208 X . Hence, by\nmonotonicity, we get\n\nfor all x, x(cid:48) \u2208 X .\n\n(cid:104)\u2207g(x), x \u2212 x\u2217(cid:105) \u2265 (cid:104)\u2207g(x\u2217), x \u2212 x\u2217(cid:105) \u2265 0\n\n(2.7)\nBy convexity, it follows that (cid:104)\u2207g(x), x \u2212 x\u2217(cid:105) < 0 whenever x\u2217 \u2208 X \u2217 and x \u2208 X \\ X \u2217, so equality\nholds in (2.7) if and only if x \u2208 X \u2217.\n\nfor all x \u2208 X , x\u2217 \u2208 X \u2217.\n\n3\n\n\f(cid:104)\u2207g(x), x(cid:48) \u2212 x(cid:105) \u2265 0 =\u21d2 g(x(cid:48)) \u2265 g(x),\n\nExample 2.2 (Pseudo/Quasi-convex programs). The previous example shows that variational coher-\nence is a weaker and more general notion than convexity and/or operator monotonicity. In fact, as we\nshow below, the class of variationally coherent problems also contains all pseudo-convex programs,\ni.e. when\nfor all x, x(cid:48) \u2208 X . In this case, we have:\nProposition 2.4. If g is pseudo-convex, (Opt) is variationally coherent.\nProof. Take x\u2217 \u2208 X \u2217 and x \u2208 X \\ X \u2217, and assume ad absurdum that (cid:104)\u2207g(x), x \u2212 x\u2217(cid:105) \u2264 0. By\n(PC), this implies that g(x\u2217) \u2265 g(x), contradicting the choice of x and x\u2217. We thus conclude that\n(cid:104)\u2207g(x), x\u2212 x\u2217(cid:105) > 0 for all x\u2217 \u2208 X \u2217, x \u2208 X \\X \u2217; since (cid:104)\u2207g(x), x\u2212 x\u2217(cid:105) \u2264 0 if x \u2208 X \u2217, our claim\n(cid:4)\nfollows by continuity.\n\n(PC)\n\nWe recall that every convex function is pseudo-convex, and every pseudo-convex function is quasi-\nconvex (i.e. its sublevel sets are convex). Both inclusions are proper, but the latter is fairly thin:\nProposition 2.5. Suppose that g is quasi-convex and non-degenerate, i.e.\n\n(2.8)\nwhere TC(x) is the tangent cone vertexed at x. Then, g is pseudo-convex (and variationally coherent).\n\n(cid:104)g(x), z(cid:105) (cid:54)= 0 for all nonzero z \u2208 TC(x), x \u2208 X \\ X \u2217,\n\nProof. This follows from the following characterization of quasi-convex functions [6]: g is quasi-\nconvex if and only if g(x(cid:48)) \u2264 g(x) implies that (cid:104)\u2207g(x), x(cid:48) \u2212 x(cid:105) \u2264 0. By contraposition, this yields\nthe strict part of (PC), i.e. g(x(cid:48)) > g(x) whenever (cid:104)\u2207g(x), x(cid:48) \u2212 x(cid:105) > 0. To complete the proof, if\n(cid:104)\u2207g(x), x(cid:48) \u2212 x(cid:105) = 0 and x \u2208 X \u2217, (PC) is satis\ufb01ed trivially; otherwise, if (cid:104)\u2207g(x), x(cid:48) \u2212 x(cid:105) = 0 but\nx \u2208 X \\ X \u2217, (2.8) implies that x(cid:48) \u2212 x = 0, so g(x(cid:48)) = g(x) and (PC) is satis\ufb01ed as an equality. (cid:4)\n\nThe non-degeneracy condition (2.8) is satis\ufb01ed by every quasi-convex function after an arbitrarily\nsmall perturbation leaving its minimum set unchanged. By this token, Propositions 2.4 and 2.5 imply\nthat essentially all quasi-convex programs are also variationally coherent.\nExample 2.3 (Star-convex programs). If g is star-convex, then (cid:104)\u2207g(x), x \u2212 x\u2217(cid:105) \u2265 g(x) \u2212 g(x\u2217)\nfor all x \u2208 X , x\u2217 \u2208 X \u2217. This is easily seen to be a special case of variational coherence because\n(cid:104)\u2207g(x), x \u2212 x\u2217(cid:105) \u2265 g(x) \u2212 g(x\u2217) \u2265 0, with the last inequality strict unless x \u2208 X \u2217. Note that\nstar-convex functions contain convex functions as a subclass (but not necessarily pseudo/quasi-convex\nfunctions).\nExample 2.4 (Beyond quasi-/star-convexity). A simple example of a function that is variationally\ncoherent without being quasi-convex or star-convex is given by:\n\ng(x) = 2\n\n\u221a\n\n1 + xi,\n\nx \u2208 [0, 1]d.\n\n(2.9)\n\nd(cid:88)\n\ni=1\n\nWhen d \u2265 2, it is easy to see g is not quasi-convex: for instance, taking d = 2, x = (0, 1)\n\u221a\n2 = max{g(x), g(x(cid:48))}, so g is not quasi-\nand x(cid:48) = (1, 0) yields g(x/2 + x(cid:48)/2) = 2\nconvex. It is also instantly clear this function is not star-convex even when d = 1 (in which case\nit is a concave function). On the other hand, to estabilish (VC), simply note that X \u2217 = {0} and\n1 + xi > 0 for all x \u2208 [0, 1]d\\{0}. For a more elaborate example of\n\n(cid:104)\u2207g(x), x \u2212 0(cid:105) =(cid:80)d\n\n\u221a\n6 > 2\n\n\u221a\n\ni=1 xi/\n\na variationally coherent problem that is not quasi-convex, see Figure 2.\n\n2.3 Stochastic Mirror Descent\n\nTo solve (Opt), we focus on the widely used family of algorithms known as stochastic mirror descent\n(SMD), formally given in Algorithm 1.4 Heuristically, the main idea of the method is as follows: At\neach iteration, the algorithm takes as input an independent and identically distributed (i.i.d.) sample\n\n4Mirror descent dates back to the original work of Nemirovski and Yudin [16]. More recent treatments\ninclude [1, 8, 15, 18, 20] and many others; the speci\ufb01c variant of SMD that we are considering here is most\nclosely related to Nesterov\u2019s \u201cdual averaging\u201d scheme [18].\n\n4\n\n\fY0\nY = V\u2217\n\nQ\n\nX \u2286 V\n\n\u2212\u03b11\u2207G(X0; \u03be1)\n\n\u2212\u03b12\u2207G(X1; \u03be2)\n\nY1\n\nQ\n\nX0\n\nY2\n\nQ\n\nX2\n\nQ\n\nX1\n\nFigure 1: Schematic representation of stochastic mirror descent (Algorithm 1).\n\nof the gradient of G at the algorithm\u2019s current state. Subsequently, the method takes a step along\nthis stochastic gradient in the dual space Y \u2261 V\u2217 of V (where gradients live), the result is \u201cmirrored\u201d\nback to the problem\u2019s feasible region X to obtain a new solution candidate, and the process repeats.\nIn pseudocode form, we have:\n\nAlgorithm 1 Stochastic mirror descent (SMD)\nRequire: Initial score variable Y0\n1: n \u2190 0\n2: repeat\n3: Xn = Q(Yn)\n4:\n5:\n6: until end\n7: return solution candidate Xn\n\nYn+1 = Yn \u2212 \u03b1n+1\u2207G(Xn, \u03ben+1)\nn \u2190 n + 1\n\nIn the above representation, the key elements of SMD (see also Fig. 1) are:\n\n1. The \u201cmirror map\u201d Q : Y \u2192 X that outputs a solution candidate Xn \u2208 X as a function of the\n\nauxiliary score variable Yn \u2208 Y. In more detail, the algorithm\u2019s mirror map Q is de\ufb01ned as\n\nQ(y) = arg max\n\nx\u2208X\n\n{(cid:104)y, x(cid:105) \u2212 h(x)},\n\n(2.10)\n\nwhere h(x) is a strongly convex function that plays the role of a regularizer. Different choices\nof the regularizer h yields different speci\ufb01c algorithm. Due to space limitation, we mention in\npassing two well-known examples: When h(x) = 1\n2 (i.e. Euclidean regularizer), mirror\ni=1 xi log xi (i.e. entropic regularizer),\n\ndescent becomes gradient descent. When h(x) =(cid:80)d\n\u221e(cid:88)\n\n2. The step-size sequence \u03b1n > 0, chosen to satisfy the \u201c(cid:96)2 \u2212 (cid:96)1\u201d summability condition:\n\nmirror descent becomes exponential gradient (aka exponential weights).\n\n\u221e(cid:88)\n\n2(cid:107)x(cid:107)2\n\nn < \u221e,\n\u03b12\n\n\u03b1n = \u221e.\n\n(2.11)\n\n3. A sequence of i.i.d. gradient samples \u2207G(x; \u03ben+1).5\n\nn=1\n\nn=1\n\n3 Recurrence of SMD\n\nIn this section, we characterize an interesting recurrence phenomenon that will be useful later for\nestablishing global convergence. Intuitively speaking, for a variationally coherent program of the\n\n5The speci\ufb01c indexing convention for \u03ben has been chosen so that Yn and Xn are both adapted to the natural\n\n\ufb01ltration Fn of \u03ben.\n\n5\n\n\fgeneral form(Opt), any neighborhood of X \u2217 will almost surely be visited by iterates Xn in\ufb01nitely\noften. Note that this already implies that at least a subsequence of iterates converges to global\nminima almost surely. To that end, we \ufb01rst de\ufb01ne an important divergence measure between a primal\nvariable x and a dual variable y, called Fenchel coupling, that plays an indispensable role of an energy\nfunction.\nDe\ufb01nition 3.1. Let h : X \u2192 R be a regularizer with respect to (cid:107) \u00b7 (cid:107) that is K-strongly convex.\n\n1. The convex conjugate function h\u2217 : Rn \u2192 R of h is de\ufb01ned as:\n\nh\u2217(y) = max\n\nx\u2208X {(cid:104)x, y(cid:105) \u2212 h(x)}.\n\n2. The mirror map Q : Rn \u2192 X associated with the regularizer h is de\ufb01ned as:\n\nQ(y) = arg max\n\nx\u2208X {(cid:104)x, y(cid:105) \u2212 h(x)}.\n\n3. The Fenchel coupling F : X \u00d7 Rn \u2192 R is de\ufb01ned as:\n\nF (x, y) = h(x) \u2212 (cid:104)x, y(cid:105) + h\u2217(y).\n\nNote that the naming of Fenchel coupling is natural as it consists of all the terms in the well-known\nFenchel\u2019s inequality: h(x) + h\u2217(y) \u2265 (cid:104)x, y(cid:105). The Fenchel\u2019s inequality says that Fenchel coupling is\nalways non-negative. As indicated by part 1 of the following lemma, a stronger result can be obtained.\nWe state the two key properties Fenchel coupling next.\nLemma 3.2. Let h : X \u2192 R be a K-strongly convex regularizer on X . Then:\n\n1. F (x, y) \u2265 1\n2. F (x, \u02dcy) \u2264 F (x, y) + (cid:104)\u02dcy \u2212 y, Q(y) \u2212 x(cid:105) + 1\n\n2 K(cid:107)Q(y) \u2212 x(cid:107)2,\u2200x \u2208 X ,\u2200y \u2208 Rn.\n\n2K(cid:107)\u02dcy \u2212 y(cid:107)2\u2217,\u2200x \u2208 X ,\u2200\u02dcy, y \u2208 Rn.\n\nWe assume that we are working with mirror maps that are regular in the following weak sense:6\nAssumption 3. The mirror map Q is regular: if Q(yn) \u2192 x, then F (x, yn) \u2192 0.\nDe\ufb01nition 3.3. Given a point x \u2208 X , a set S \u2282 X and a norm (cid:107) \u00b7 (cid:107).\n\n1. De\ufb01ne the point-to-set normed distance and Fenchel coupling distance respectively as:\n\ndist(x,S) (cid:44) inf s\u2208S (cid:107)x \u2212 s(cid:107) and F (S, y) = inf s\u2208S F (s, y).\n2. Given \u03b5 > 0, de\ufb01ne B(S, \u03b5) (cid:44) {x \u2208 X | dist(x,S) < \u03b5}.\n3. Given \u03b4 > 0, de\ufb01ne \u02dcB(S, \u03b4) (cid:44) {Q(y) | F (S, y) < \u03b4}.\n\nWe then have the following recurrence result for a variationally coherent optimization problem Opt.\nTheorem 3.4. Under Assumptions 1\u20133, for any \u03b5 > 0, \u03b4 > 0 and any Xn, the (random) iterates Xn\ngenerated in Algorithm 1 enter both B(X \u2217, \u03b5) and \u02dcB(X \u2217, \u03b4) in\ufb01nitely often almost surely.\n\n4 Global Convergence Results\n\n4.1 Deterministic Convergence\nWhen a perfect gradient \u2207g(x) is available (in Line 4 of Algorithm 1), SMD recovers its deterministic\ncounterpart: mirror descent (Algorithm 2). We \ufb01rst characterize global convergence in this case.\n\n6Mirror maps induced by many common regularizers are regular, including the Euclidean regularizer and the\n\nentropic regularizer.\n\n6\n\n\fFigure 2: Convergence of stochastic mirror descent for the mean objective g(r, \u03b8) = (2 + cos \u03b8/2 +\ncos(4\u03b8))r2(5/3 \u2212 r) expressed in polar coordinates over the unit ball (r \u2264 1). In the left sub\ufb01gure, we\nhave plotted the graph of g; the plot to the right superimposes a typical SMD trajectory over the contours of g.\n\nAlgorithm 2 Mirror descent (MD)\nRequire: Initial score variable y0\n1: n \u2190 0\n2: repeat\n3:\n4:\n5:\n6: until end\n7: return solution candidate xn\n\nxn = Q(yn)\nxn+1 = xn \u2212 \u03b1n+1\u2207g(xn)\nn \u2190 n + 1\n\nTheorem 4.1. Consider an optimization problem Opt that is variationally coherent. Let xn be the\niterates generated by MD. Under Assumption 3, limt\u2192\u221e dist(xn,X \u2217) = 0, for any y0.\nRemark 4.1. Here we do not require \u2207 g(x) to be Lipschitz continuous. If \u2207 g(x) is indeed (locally)\nLipschitz continuous, then Theorem 4.1 follows directly from Theorem 4.4. Otherwise, Theorem 4.1\nrequires a different argument, brie\ufb02y outlined as follows. Theorem 3.4 implies that (in the special\ncase of perfect gradient), iterates xn generated from MD enter B(X \u2217, \u03b5) in\ufb01nitely often. Now, by\nexploiting the properties of Fenchel coupling on a \ufb01ner-grained level (compared to only using it to\nestablish recurrence), we can establish that for any \u03b5-neighborhood B(X \u2217, \u03b5), after a certain number\nof iterations, once the iterate xn enters B(X \u2217, \u03b5), it will never exit. Convergence therefore follows.\n\n4.2 Stochastic Almost Sure Convergence\n\nWe begin with minimal mathematical preliminaries [4] needed that will be needed.\nDe\ufb01nition 4.2. A semi\ufb02ow \u03a6 on a metric space (M, d) is a continuous map \u03a6 : R+ \u00d7 M \u2192 M:\n\n(t, x) \u2192 \u03a6t(x),\n\nsuch that the semi-group properties hold: \u03a60 = identity, \u03a6t+s = \u03a6t \u25e6 \u03a6s for all (t, s) \u2208 R+ \u00d7 R+.\nDe\ufb01nition 4.3. Let \u03a6 be a semi\ufb02ow on the metric space (M, d). A continuous function s : R+ \u2192 M\nis an asymptotic pseudotrajectory (APT) for \u03a6 if for every T > 0, the following holds:\n\nt\u2192\u221e sup\nlim\n0\u2264h\u2264T\n\nd(s(t + h), \u03a6h(s(t))) = 0.\n\n(4.1)\n\nWe are now ready to state the convergence result. See Figure 2 for a simulation example.\nTheorem 4.4. Consider an optimization problem Opt that is variationally coherent. Let Xn be the\niterates generated by SMD (Algorithm 1). Under Assumptions 1\u20133, if \u2207 g(x) is locally Lipschitz\ncontinuous on X , then dist(xn,X \u2217) \u2192 0 almost surely as t \u2192 \u221e, irrespective of Y0.\n\n7\n\n-\ufffd\ufffd\ufffd-\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd-\ufffd\ufffd\ufffd-\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\fRemark 4.2. The proof is rather involved and contains several ideas. To enhance the intuition and\nunderstanding, we outline the main steps here, each of which will be proved in detail in the appendix.\nTo simplify the notation, we assume there is a unique optimal (i.e. X \u2217 is a singleton set). The proof is\nidentical in the multiple minima case, provide we replace x\u2217 by X \u2217 and use the point-to-set distance.\n\n1. We consider the following ODE approximation of SMD:\n\n\u02d9y = v(x),\nx = Q(y),\n\nwhere v(x) = \u2212\u2207 g(x). We verify that the ODE admits a unique solution for y(t) for any\ninitial condition. Consequently, this solution induces a semi\ufb02ow7, which we denote \u03a6t(y):\nit is the state at time t given it starts at y initially. Note that we have used y as the initial\npoint (as opposed to y0) to indicate that the semi\ufb02ow representing the solution trajectory\nshould be viewed as a function of the initial point y.\n\nthe SMD iterates Y1, Y2, . . . , Yk, . . . at times 0, \u03b11, \u03b11 + \u03b12, . . . ,(cid:80)k\u22121\n\n2. We now relate the iterates generated by SMD to the above ODE\u2019s solution. Connect linearly\ni=0 \u03b1i, . . . respectively\nto form a continuous, piecewise af\ufb01ne (random) curve Y (t). We then show that Y (t) is\nalmost surely an asymptotic pseudotrajectory of the semi-\ufb02ow \u03a6 induced by the above ODE.\n\n3. Having characterized the relation between the SMD trajectory (af\ufb01ne interpolation of the\ndiscrete SMD iterates) and the ODE trajectory (the semi-\ufb02ow), we now turn to studying\nthe latter (the semi\ufb02ow given by the ODE trajectory). A desirable property of \u03a6t(y) is that\nthe distance F (x\u2217, \u03a6t(y)) between the optimal solution x\u2217 and the dual variable \u03a6t(y) (as\nmeasured by Fenchel coupling) can never increase as a function of t. We refer to this as the\nmonotonicity property of Fenchel coupling under the ODE trajectory, to be contrasted to the\ndiscrete-time dynamics, where such monotonicity is absent (even when perfect information\non the gradient is available). More formally, we show that \u2200y,\u22000 \u2264 s \u2264 t,\n\nF (x\u2217, \u03a6s(y)) \u2265 F (x\u2217, \u03a6t(y)).\n\n(4.2)\n4. Continuing on the previous point, not only the distance F (x\u2217, \u03a6t(y)) can never increase as\nt increases, but also, provided that \u03a6t(y) is not too close to x\u2217, F (x\u2217, \u03a6t(y)) will decrease\nno slower than linearly. This suggests that either \u03a6t(y) is already close to x\u2217 (and hence\nx(t) = Q(\u03a6t(y)) is close to x\u2217), or their distance will be decreased by a meaningful amount\nin (at least) the ensuing short time-frame. We formalize this discussion as follows:\n\n\u2200\u03b5 > 0,\u2200y,\u2203s > 0, F (x\u2217, \u03a6s(y)) \u2264 max{ \u03b5\n2\n\n, F (x\u2217, y) \u2212 \u03b5\n2\n\n}.\n\n(4.3)\n\n5. Now consider an arbitrary \ufb01xed horizon T . If at time t, F (x\u2217, \u03a60(Y (t))) is small, then\nby the monotonicity property in Claim 3, F (x\u2217, \u03a6h(Y (t))) will remain small on the entire\ninterval h \u2208 [0, T ]. Since Y (t) is an asymptotic pseudotrajectory of \u03a6 (Claim 2), Y (t + h)\nand \u03a6h(Y (t)) should be very close for h \u2208 [0, T ], at least for t large enough. This means\nthat F (x\u2217, Y (t + h)) should also be small on the entire interval h \u2208 [0, T ]. This can be\nmade precise as follows: \u2200\u03b5, T > 0,\u2203\u03c4 (\u03b5, T ) > 0 such that \u2200t \u2265 \u03c4,\u2200h \u2208 [0, T ]:\n\nF (x\u2217, Y (t + h)) < F (x\u2217, \u03a6h(Y (t))) +\n\n\u03b5\n2\n\n, a.s..\n\n(4.4)\n\n6. Finally, we are ready to put the above pieces together. Claim 5 gives us a way to control\nthe amount by which the two Fenchel coupling functions differ on the interval [0, T ].\nClaim 3 and Claim 4 together allow us to extend such control over successive intervals\n[T, 2T ), [2T, 3T ), . . . , thereby establishing that, at least for t large enough, if F (x\u2217, Y (t))\nis small, then F (x\u2217, Y (t + h)) will remains small \u2200h > 0. As it turns out, this means that\nafter long enough time, if xn ever visits \u02dcB(x\u2217, \u03b5), it will (almost surely) be forever trapped\ninside the neighborhood twice that size (i.e. \u02dcB(x\u2217, 2\u03b5)). Since Theorem 3.4 ensures that\nxn visits \u02dcB(x\u2217, \u03b5) in\ufb01nitively often (almost surely), the hypothesis is guaranteed to be true.\nConsequently, this leads to the following claim: \u2200\u03b5 > 0,\u2203\u03c40 (a positive integer), such that:\n(4.5)\n\nF (x\u2217, Y (\u03c40 + h)) < \u03b5,\u2200h \u2208 [0,\u221e), a.s..\n\n7A crucial point to note is that since C may not be invertible, there may not exist a unique solution for x(t).\n\n8\n\n\fFigure 3: SMD run on the objective function of Fig. 2 with \u03b3n \u221d n\u22121/2 and Gaussian random noise with\nstandard deviation about 150% the mean value of the gradient. Due to the lack of convexity, the algorithm\u2019s last\niterate converges much faster than its ergodic average.\nTo conclude, Equation (4.5) implies that F (x\u2217, Yn) \u2192 0, a.s. as t \u2192 \u221e, where the SMD iterates Yn\nare values at integer time points of the af\ufb01ne trajectory Y (\u03c4 ). Per Statement 1 in Lemma 3.2, this\ngives (cid:107)Q(Yn) \u2212 x\u2217(cid:107) \u2192 0, a.s. as t \u2192 \u221e, thereby establishing that Xn = Q(Yn) \u2192 x\u2217, a.s..\n\n4.3 Convergence Rate Analysis\n\nAt the level of generality at which (VC) has been stated, it is unlikely that any convergence rate can\nbe obtained, because unlike in the convex case, one has no handle on measuring the progress of\nmirror descent updates (recall that in (VC), only non-negativity is guaranteed for the inner product).\n\u221a\nConsequently, we focus here on the class of strongly coherent problems (a generalization of strongly\nconvex problems) and derive a O(1/\nT ) convergence rate in terms of the squared distance to a\nsolution of (Opt).\nDe\ufb01nition 4.5. We say that g is c-strongly variationally coherent (or c-strongly coherent for short)\nif, for some x\u2217 \u2208 X , we have:\n\n(cid:104)\u2207g(x), x \u2212 x\u2217(cid:105) \u2265 c\n2\n\n(cid:107)x \u2212 x\u2217(cid:107)2\n\nfor all x \u2208 X .\n\n(cid:80)T\n\n(cid:80)T\n\n(4.6)\n\n, where\n\nn\n\nTheorem 4.6. If (Opt) is c-strongly coherent, then (cid:107)\u00afxT \u2212 x\u2217(cid:107)2 \u2264 2\n\u00afxT =\n\nn=0 \u03b32\n, K is the strong convexity coef\ufb01cient of h and B = maxx\u2208X (cid:107)\u2207 g(x)(cid:107)2\u2217.\n\n(cid:80)T\n(cid:80)T\n\nn=0 \u03b3nxn\n\nF (x\u2217,y0)+ B\n\n2K\n\nc\n\nn=0 \u03b3n\n\nn=0 \u03b3n\n\nT\n\nn, then (cid:107)\u00afxT \u2212 x\u2217(cid:107)2 = O( log T\u221a\n\nThe proof of Theorem 4.6 is given in the supplement. We mention a few implications of Theorem 4.6.\nFirst, in a strongly coherent optimization problem, if \u03b3n = 1\u221a\n) (note\nthat here (cid:96)2 \u2212 (cid:96)1 summability is not required for global convergence). By appropriately choosing\n\u221a\nthe step-size sequence, one can further shave off the log T term above and obtain an O(1/\nT )\nconvergence rate. This rate matches existing rates when applying gradient descent to strongly convex\nfunctions, although strongly variational coherence is a strict superset of strong convexity. Finally,\n\u221a\nnote that even though we have characterized the rates in the mirror descent (i.e. perfect gradient case),\none can easily obtain a mean O(1/\nT ) rate in the stochastic case by using a similar argument. This\ndiscussion is omitted due to space limitation.\nWe end the section (and the paper) with an interesting observation from the simulation shown in\nFigure 3. The rate characterized in Theorem 4.6 is with respect to the ergodic average of the mirror\ndescent iterates, while global convergence results established in Theorem 4.1 and Theorem 4.4 are\nboth last iterate convergence. Figure 3 then provides a convergence speed comparison on the function\ngiven in Figure 2. It is apparent that the last iterate of SMD (more speci\ufb01cally, stochastic gradient\ndescent) converges much faster than the ergodic average in this non-convex objective.\n\n9\n\n\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25b3\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25c7\u25b3\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\u25c7\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd-\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\f5 Acknowledgments\n\nZhengyuan Zhou is supported by Stanford Graduate Fellowship and would like to thank Yinyu Ye\nand Jose Blanchet for constructive discussions and feedback. Panayotis Mertikopoulos gratefully\nacknowledges \ufb01nancial support from the Huawei Innovation Research Program ULTRON and the\nANR JCJC project ORACLESS (grant no. ANR\u201316\u2013CE33\u20130004\u201301).\n\nReferences\n[1] A. BECK AND M. TEBOULLE, Mirror descent and nonlinear projected subgradient methods for convex\n\noptimization, Operations Research Letters, 31 (2003), pp. 167\u2013175.\n\n[2] M. BENA\u00cfM, Dynamics of stochastic approximation algorithms, in S\u00e9minaire de Probabilit\u00e9s XXXIII,\nJ. Az\u00e9ma, M. \u00c9mery, M. Ledoux, and M. Yor, eds., vol. 1709 of Lecture Notes in Mathematics, Springer\nBerlin Heidelberg, 1999, pp. 1\u201368.\n\n[3] M. BENA\u00cfM AND M. W. HIRSCH, Asymptotic pseudotrajectories and chain recurrent \ufb02ows, with applica-\n\ntions, Journal of Dynamics and Differential Equations, 8 (1996), pp. 141\u2013176.\n\n[4] M. BENA\u00cfM AND M. W. HIRSCH, Asymptotic pseudotrajectories and chain recurrent \ufb02ows, with applica-\n\ntions, Journal of Dynamics and Differential Equations, 8 (1996), pp. 141\u2013176.\n\n[5] V. S. BORKAR, Stochastic Approximation: A Dynamical Systems Viewpoint, Cambridge University Press\n\nand Hindustan Book Agency, 2008.\n\n[6] S. BOYD AND L. VANDENBERGHE, Convex Optimization, Berichte \u00fcber verteilte messysteme, Cambridge\n\nUniversity Press, 2004.\n\n[7] N. CESA-BIANCHI, P. GAILLARD, G. LUGOSI, AND G. STOLTZ, Mirror descent meets \ufb01xed share (and\n\nfeels no regret), in Advances in Neural Information Processing Systems, 989-997, ed., vol. 25, 2012.\n\n[8] J. C. DUCHI, A. AGARWAL, M. JOHANSSON, AND M. I. JORDAN, Ergodic mirror descent, SIAM\n\nJournal on Optimization, 22 (2012), pp. 1549\u20131578.\n\n[9] F. FACCHINEI AND J.-S. PANG, Finite-Dimensional Variational Inequalities and Complementarity\n\nProblems, Springer Series in Operations Research, Springer, 2003.\n\n[10] S. GHADIMI AND G. LAN, Stochastic \ufb01rst-and zeroth-order methods for nonconvex stochastic program-\n\nming, SIAM Journal on Optimization, 23 (2013), pp. 2341\u20132368.\n\n[11] A. JUDITSKY, A. S. NEMIROVSKI, AND C. TAUVEL, Solving variational inequalities with stochastic\n\nmirror-prox algorithm, Stochastic Systems, 1 (2011), pp. 17\u201358.\n\n[12] W. KRICHENE, A. BAYEN, AND P. BARTLETT, Accelerated mirror descent in continuous and discrete\ntime, in NIPS \u201915: Proceedings of the 29th International Conference on Neural Information Processing\nSystems, 2015.\n\n[13] H. KUSHNER AND G. YIN, Stochastic Approximation and Recursive Algorithms and Applications,\n\nStochastic Modelling and Applied Probability, Springer New York, 2013.\n\n[14] P. MERTIKOPOULOS, Learning in games with continuous action sets and unknown payoff functions.\n\nhttps://arxiv.org/abs/1608.07310, 2016.\n\n[15] A. S. NEMIROVSKI, A. JUDITSKY, G. G. LAN, AND A. SHAPIRO, Robust stochastic approximation\n\napproach to stochastic programming, SIAM Journal on Optimization, 19 (2009), pp. 1574\u20131609.\n\n[16] A. S. NEMIROVSKI AND D. B. YUDIN, Problem Complexity and Method Ef\ufb01ciency in Optimization,\n\nWiley, New York, NY, 1983.\n\n[17] Y. NESTEROV, Dual extrapolation and its applications to solving variational inequalities and related\n\nproblems, Mathematical Programming, 109 (2007), pp. 319\u2013344.\n\n[18]\n\n, Primal-dual subgradient methods for convex problems, Mathematical Programming, 120 (2009),\n\npp. 221\u2013259.\n\n[19] R. T. ROCKAFELLAR AND R. J. B. WETS, Variational Analysis, vol. 317 of A Series of Comprehensive\n\nStudies in Mathematics, Springer-Verlag, Berlin, 1998.\n\n[20] S. SHALEV-SHWARTZ, Online learning and online convex optimization, Foundations and Trends in\n\nMachine Learning, 4 (2011), pp. 107\u2013194.\n\n[21] W. SU, S. BOYD, AND E. J. CAND\u00c8S, A differential equation for modeling Nesterov\u2019s accelerated gradient\nmethod: Theory and insights, in NIPS \u201914: Proceedings of the 27th International Conference on Neural\nInformation Processing Systems, 2014, pp. 2510\u20132518.\n\n[22] A. WIBISONO, A. C. WILSON, AND M. I. JORDAN, A variational perspective on accelerated methods in\noptimization, Proceedings of the National Academy of Sciences of the USA, 113 (2016), pp. E7351\u2013E7358.\n[23] L. XIAO, Dual averaging methods for regularized stochastic learning and online optimization, Journal of\n\nMachine Learning Research, 11 (2010), pp. 2543\u20132596.\n\n10\n\n\f", "award": [], "sourceid": 3547, "authors": [{"given_name": "Zhengyuan", "family_name": "Zhou", "institution": "Stanford University"}, {"given_name": "Panayotis", "family_name": "Mertikopoulos", "institution": "CNRS (French National Center for Scientific Research)"}, {"given_name": "Nicholas", "family_name": "Bambos", "institution": null}, {"given_name": "Stephen", "family_name": "Boyd", "institution": "Stanford University"}, {"given_name": "Peter", "family_name": "Glynn", "institution": "Stanford University"}]}