{"title": "Variance Reduction for Matrix Games", "book": "Advances in Neural Information Processing Systems", "page_first": 11381, "page_last": 11392, "abstract": "We present a randomized primal-dual algorithm that solves the problem min_x max_y y^T A x to additive error epsilon in time nnz(A) + sqrt{nnz(A) n} / epsilon, for matrix A with larger dimension n and nnz(A) nonzero entries. This improves the best known exact gradient methods by a factor of sqrt{nnz(A) / n} and is faster than fully stochastic gradient methods in the accurate and/or sparse regime epsilon < sqrt{n / nnz(A)$. Our results hold for x,y in the simplex (matrix games, linear programming) and for x in an \\ell_2 ball and y in the simplex (perceptron / SVM, minimum enclosing ball). Our algorithm combines the Nemirovski's \"conceptual prox-method\" and a novel reduced-variance gradient estimator based on \"sampling from the difference\" between the current iterate and a reference point.", "full_text": "Variance Reduction for Matrix Games\n\nYair Carmon, Yujia Jin, Aaron Sidford and Kevin Tian\n\nStanford University\n\n{yairc,yujiajin,sidford,kjtian}@stanford.edu\n\nAbstract\n\nWe present a randomized primal-dual algorithm that solves the problem\n\ntrix A with larger dimension n and nnz(A) nonzero entries. This improves the best\n\nminx maxy y>Ax to additive error \u270f in time nnz(A) +pnnz(A)n/\u270f, for ma-\nknown exact gradient methods by a factor ofpnnz(A)/n and is faster than fully\nstochastic gradient methods in the accurate and/or sparse regime \u270f \uf8ffpn/nnz(A).\n\nOur results hold for x, y in the simplex (matrix games, linear programming) and\nfor x in an `2 ball and y in the simplex (perceptron / SVM, minimum enclosing\nball). Our algorithm combines the Nemirovski\u2019s \u201cconceptual prox-method\u201d and a\nnovel reduced-variance gradient estimator based on \u201csampling from the difference\u201d\nbetween the current iterate and a reference point.\n\n1\n\nIntroduction\n\nMinimax problems\u2014or games\u2014of the form minx maxy f (x, y) are ubiquitous in economics, statis-\ntics, optimization and machine learning. In recent years, minimax formulations for neural network\ntraining rose to prominence [15, 23], leading to intense interest in algorithms for solving large scale\nminimax games [10, 14, 20, 9, 18, 24]. However, the algorithmic toolbox for minimax optimization\nis not as complete as the one for minimization. Variance reduction, a technique for improving\nstochastic gradient estimators by introducing control variates, stands as a case in point. A multitude\nof variance reduction schemes exist for \ufb01nite-sum minimization [cf. 19, 34, 1, 4, 12], and their impact\non complexity is well-understood [43]. In contrast, only a few works apply variance reduction to\n\ufb01nite-sum minimax problems [3, 39, 5, 26], and the potential gains from variance reduction are not\nwell-understood.\nWe take a step towards closing this gap by designing variance-reduced minimax game solvers that\noffer strict runtime improvements over non-stochastic gradient methods, similar to that of optimal\nvariance reduction methods for \ufb01nite-sum minimization. To achieve this, we focus on the fundamental\nclass of bilinear minimax games,\n\nmin\nx2X\n\nmax\ny2Y\n\ny>Ax, where A 2 Rm\u21e5n.\n\nIn particular, we study the complexity of \ufb01nding an \u270f-approximate saddle point (Nash equilibrium),\nnamely x, y with\n\nmax\ny02Y\n\n(y0)>Ax min\nx02X\n\ny>Ax0 \uf8ff \u270f.\n\nIn the setting where X and Y are both probability simplices, the problem corresponds to \ufb01nding an\napproximate (mixed) equilbrium in a matrix game, a central object in game theory and economics.\nMatrix games are also fundamental to algorithm design due in part to their equivalence to linear\nprogramming [8]. Alternatively, when X is an `2 ball and Y is a simplex, solving the corresponding\nproblem \ufb01nds a maximum-margin linear classi\ufb01er (hard-margin SVM), a fundamental task in machine\nlearning and statistics [25]. We refer to the former as an `1-`1 game and the latter as an `2-`1 game;\nour primary focus is to give improved algorithms for these domains.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f1.1 Our Approach\nOur\npoint\nminx2X maxy2Y f (x, y), where f : X\u21e5Y !\nmethod solves a sequence of subproblems parameterized by \u21b5> 0, each of the form\n\nsolving\nR is convex in x and concave in y. The\n\nis Nemirovski\u2019s\n\nprox-method\u201d\n\n\u201cconceptual\n\nstarting\n\n[28]\n\nfor\n\n\ufb01nd x, y s.t. 8x0, y0 hrxf (x, y), x x0i hryf (x, y), y y0i \uf8ff \u21b5Vx0(x0) + \u21b5Vy0(y0)\n(1)\nfor some (x0, y0) 2X\u21e5Y , where Va(b) is a norm-suitable Bregman divergence from a to b: squared\nEuclidean distance for `2 and KL divergence for `1. Combining each subproblem solution with an\nextragradient step, the prox-method solves the original problem to \u270f accuracy by solving eO(\u21b5/\u270f)\n\nsubproblems.1 (Solving (1) with \u21b5 = 0 is equivalent to to solving minx2X maxy2Y f (x, y).)\nOur \ufb01rst contribution is showing that if a stochastic unbiased gradient estimator \u02dcg satis\ufb01es the\n\u201cvariance\u201d bound\n\nEk\u02dcg(x, y) rf (x0, y0)k2\n\n\u21e4 \uf8ff L2 kx x0k2 + L2 ky y0k2\n\n(2)\nfor some L > 0, then O(L2/\u21b52) regularized stochastic mirror descent steps using \u02dcg solve (1) in a\nsuitable probabilistic sense. We call unbiased gradient estimators that satisfy (2) \u201ccentered.\u201d\nOur second contribution is the construction of \u201ccentered\u201d gradient estimators for `1-`1 and `2-`1\nbilinear games, where f (x, y) = y>Ax. Our `1 estimator has the following form. Suppose we wish\nto estimate gx = A>y (the gradient of f w.r.t. x), and we already have gx\n0 = A>y0. Let p 2 m be\nsome distribution over {1, . . . , m}, draw i \u21e0 p and set\n\n\u02dcgx = gx\n\n0 + Ai:\n\n[y]i [y0]i\n\n,\n\npi\n\nwhere Ai: is the ith column of A>. This form is familiar from variance reduction techniques [19,\n44, 1], that typically use a \ufb01xed distribution p. In our setting, however, a \ufb01xed p will not produce\nsuf\ufb01ciently low variance. Departing from prior variance-reduction work and building on [16, 6], we\nchoose p based on y according to\n\npi(y) = [y]i [y0]i\n\nky y0k1\n\n,\n\nyielding exactly the variance bound we require. We call this technique \u201csampling from the difference.\u201d\nFor our `2 gradient estimator, we sample from the squared difference, drawing X -block coordinate\nj \u21e0 q, where qj(x) = ([x]j [x0]j)2/kx x0k2\n2. To strengthen our results for `2-`1 games, we\nconsider a re\ufb01ned version of the \u201ccentered\u201d criterion (2) which allows regret analysis using local\nnorms [37, 6]. To further facilitate this analysis we follow [6] and introduce gradient clipping.\nWe extend our proofs to show that stochastic regularized mirror descent can solve (1) despite the\n(distance-bounded) bias caused by gradient clipping.\nOur gradient estimators attain the bound (2) with L equal to the Lipschitz constant of rf. Speci\ufb01cally,\n(3)\n\nin the `1-`1 setup\nin the `2-`1 setup.\n\nL =\u21e2maxij |Aij|\n\nmaxi kAi:k2\n\n1.2 Method complexity compared with prior art\n\nAs per the discussion above, to achieve accuracy \u270f our algorithm solves eO(\u21b5/\u270f) subproblems. Each\n\nsubproblem takes O(nnz(A)) time for computing two exact gradients (one for variance reduction and\none for an extragradient step), plus an additional (m + n)L2/\u21b52 time for the inner mirror descent\niterations, with L as in (3). The total runtime is therefore\n\neO\u2713\u2713nnz(A) +\n\n(m + n)L2\n\n\u21b52\n\n\u25c6 \u21b5\n\u270f\u25c6 .\n\n1 More precisely, the required number of subproblem solutions is at most \u21e5 \u00b7 \u21b5\n\u270f , where \u21e5 is a \u201cdomain\nsize\u201d parameter that depends on X , Y, and the Bregman divergence V (see Section 2). In the `1 and `2 settings\nconsidered in this paper, we have the bound \u21e5 \uf8ff log(nm) and we use the eO notation to suppress terms\nlogarithmic in n and m. However, in other settings\u2014e.g., `1-`1 games [cf. 38, 40]\u2014making the parameter \u21e5\nscale logarithmically with the problem dimension is far more dif\ufb01cult.\n\n2\n\n\f(4)\n\nBy setting \u21b5 optimally to be max{\u270f, Lp(m + n)/nnz(A)}, we obtain the runtime\n\neO(nnz(A) +pnnz(A) \u00b7 (m + n) \u00b7 L \u00b7 \u270f1).\n\nComparison with mirror-prox and dual extrapolation. Nemirovski [28] instantiates his concep-\ntual prox-method by solving the relaxed proximal problem (1) with \u21b5 = L in time O(nnz(A)), where\nL is the Lipschitz constant of rf, as given in (3). The total complexity of the resulting method is\ntherefore\n(5)\nThe closely related dual extrapolation method of Nesterov [31] attains the same rate of convergence.\nWe refer to the running time (5) as linear since it scales linearly with the problem description size\nnnz(A). Our running time guarantee (4) is never worse than (5) by more than a constant factor, and\nimproves on (5) when nnz(A) = !(n + m), i.e. whenever A is not extremely sparse. In that regime,\nour method uses \u21b5 \u2327 L, hence solving a harder version of (1) than possible for mirror-prox.\nComparison with sublinear-time methods Using a randomized algorithm, Grigoriadis and\nKhachiyan [16] solve `1-`1 bilinear games in time\n\neO(nnz(A) \u00b7 L \u00b7 \u270f1).\n\neO((m + n) \u00b7 L2 \u00b7 \u270f2),\n\n(6)\nand Clarkson et al. [6] extend this result to `2-`1 bilinear games, with the values of L as in (3). Since\nthese runtimes scale with n + m \uf8ff nnz(A), we refer to them as sublinear. Our guarantee improves\non the guarantee (6) when (m + n) \u00b7 L2 \u00b7 \u270f2 nnz(A), i.e. whenever (6) is not truly sublinear.\nOur method carefully balances linear-time extragradient steps with cheap sublinear-time stochastic\ngradient steps. Consequently, our runtime guarantee (4) inherits strengths from both the linear and\nsublinear runtimes. First, our runtime scales linearly with L/\u270f rather than quadratically, as does the\nlinear runtime (5). Second, while our runtime is not strictly sublinear, its component proportional to\n\nOverall, our method offers the best runtime guarantee in the literature in the regime\n\nL/\u270f ispnnz(A)(n + m), which is sublinear in nnz(A).\npnnz(A)(n + m)\nmin{n, m}! \u2327\n\n\u270f\n\nL \u2327r n + m\n\nnnz(A)\n\n,\n\nwhere the lower bound on \u270f is due to the best known theoretical runtimes of interior point methods:\n\neO(max{n, m}! log(L/\u270f)) [7] and eO(nnz(A) + min{n, m}2)pmin{n, m} log(L/\u270f)) [21], where\n\n! is the (current) matrix multiplication exponent.\nIn the square dense case (i.e. nnz(A) \u21e1 n2 = m2), we improve on the accelerated runtime (5) by a\nfactor of pn, the same improvement that optimal variance-reduced \ufb01nite-sum minimization methods\nachieve over the fast gradient method [44, 1].\n\n1.3 Related work\nMatrix games, the canonical form of discrete zero-sum games, have long been studied in economics\n[32]. The classical mirror descent (i.e. no-regret) method yields an algorithm with running time\n\nOur work builds on the extragradient scheme of Nemirovski [28] as well as the gradient estimation\nand clipping technique of Clarkson et al. [6].\nBalamurugan and Bach [3] apply standard variance reduction [19] to bilinear `2-`2 games by sampling\nelements proportional to squared matrix entries. Using proximal-point acceleration they obtain a\n\u270f ), a rate we recover using our algorithm\n\neO(nnz(A)L2\u270f2) [30]. Subsequent work [16, 28, 31, 6] improve this runtime as described above.\nruntime of eO(nnz(A)+kAkFpnnz(A) max{m, n}\u270f1 log 1\n(Appendix E). However, in this setting the mirror-prox method has runtime eO(kAkop nnz(A)\u270f1),\nwhich may be better than the result of [3] by a factor ofpmn/nnz(A) due to the discrepancy in\n\nthe norm of A. Naive application of [3] to `1 domains results in even greater potential losses. Shi\net al. [39] extend the method of [3] to smooth functions using general Bregman divergences, but their\nextension is unaccelerated and appears limited to a \u270f2 rate.\nChavdarova et al. [5] propose a variance-reduced extragradient method with applications to generative\nadversarial training. In contrast to our algorithm, which performs extragadient steps in the outer loop,\n\n3\n\n\fthe method of [5] performs stochastic extragradient steps in the inner loop, using \ufb01nite-sum variance\nreduction as in [19]. Chavdarova et al. [5] analyze their method in the convex-concave setting,\nshowing improved stability over direct application of the extragradient method to noisy gradients.\nHowever, their complexity guarantees are worse than those of linear-time methods. Following up\non [5], Mishchenko et al. [26] propose to reduce the variance of the stochastic extragradient method\nby using the same stochastic sample for both the gradient and extragradient steps. In the Euclidean\nstrongly convex case, they show a convergence guarantee with a relaxed variance assumption, and in\nthe noiseless full-rank bilinear case they recover the guarantees of [27]. In the general convex case,\nhowever, they only show an \u270f2 rate of convergence.\n\n1.4 Paper outline and additional contributions\n\nWe de\ufb01ne our notation in Section 2. In Section 3.1, we review Nemirovski\u2019s conceptual prox-method\nand introduce the notion of a relaxed proximal oracle; we implement such oracle using variance-\nreduced gradient estimators in Section 3.2. In Section 4, we construct these gradient estimators for\nthe `1-`1 and `2-`1 domain settings, and complete the analyses of the corresponding algorithms; in\nAppendix E we provide analogous treatment for the `2-`2 setting, recovering the results of [3].\nIn Appendix F we provide three additional contributions: variance-reduction-based computation of\nproximal points for arbitrary convex-concave functions (Appendix F.1); extension of our results to\n\u201ccomposite\u201d saddle point problems of the form minx2X maxy2Y {f (x, y) + (x) (y)}, where f\nadmits a centered gradient estimator and , are \u201csimple\u201d convex functions (Appendix F.2); and a\nnumber of alternative centered gradient estimators for the `2-`1 and `2-`2 settings (Appendix F.3).\n\n2 Notation\n\nProblem setup. A setup is the triplet (Z,k\u00b7k , r) where: (i) Z is a compact and convex subset of\nRn \u21e5 Rm, (ii) k\u00b7k is a norm on Z and (iii) r is 1-strongly-convex w.r.t. Z and k\u00b7k, i.e. such that\n2 kz0 zk2 for all z, z0 2Z .2 We call r the distance generating\nr(z0) r(z) + hrr(z), z z0i + 1\nfunction and denote the Bregman divergence associated with it by\n\nVz(z0) := r(z0) r(z) hrr(z), z0 zi \n\n1\n2 kz0 zk2 .\n\nWe also denote \u21e5 := maxz0 r(z0) minz r(z) and assume it is \ufb01nite.\nNorms and dual norms. We write S\u21e4 for the set of linear functions on S. For \u21e3 2Z \u21e4 we de\ufb01ne the\n:= maxkzk\uf8ff1 h\u21e3, zi. For p 1 we write the `p norm kzkp = (Pi zp\ndual norm of k\u00b7k as k\u21e3k\u21e4\ni )1/p\nwith kzk1 = maxi |zi|. The dual norm of `p is `q with q1 = 1 p1.\nDomain components. We assume Z is of the form X\u21e5Y for convex and compact sets X\u21e2 Rn\nand Y\u21e2 Rm. Particular sets of interest are the simplex d = {v 2 Rd |k vk1 = 1, v 0} and the\nEuclidean ball Bd = {v 2 Rd |k vk2 \uf8ff 1}. For any vector in z 2 Rn \u21e5 Rm,\n\nwe write zx and zy for the \ufb01rst n and last m coordinates of z, respectively.\n\nWhen totally clear from context, we sometimes refer to the X and Y components of z directly as x\nand y. We write the ith coordinate of vector v as [v]i.\n\nMatrices. We consider a matrix A 2 Rm\u21e5n and write nnz(A) for the number of its nonzero\nentries. For i 2 [n] and j 2 [m] we write Ai:, A:j and Aij for the corresponding row, column\nand entry, respectively.3 We consider the matrix norms kAkmax := maxij |Aij|, kAkp!q :=\nmaxkxkp\uf8ff1 kAxkq and kAkF := (Pi,j A2\n\n2 For non-differentiable r, let hrr(z), wi := sup2@r(z) h, wi, where @r(z) is the subdifferential of r at z.\n3 For k 2 N, we let [k] := {1, . . . , k}.\n\nij)1/2.\n\n4\n\n\f3 Primal-dual variance reduction framework\n\nIn this section, we establish a framework for solving the saddle point problem\n\nwhere f is convex in x and concave y, and admits a (variance-reduced) stochastic estimator for the\ncontinuous and monotone4 gradient mapping\n\nmin\nx2X\n\nmax\ny2Y\n\nf (x, y),\n\ng(z) = g(x, y) := (rxf (x, y),ryf (x, y)) .\n\nOur goal is to \ufb01nd an \u270f-approximate saddle point (Nash equilibrium), i.e. z 2Z := X\u21e5Y such that\n(7)\n\nGap(z) := max\ny02Y\n\nf (zx, y0) min\nx02X\n\nf (x0, zy) \uf8ff \u270f.\nKPK\n\nWe achieve this by generating a sequence z1, z2, . . . , zk such that 1\nevery u 2Z and using the fact that\nGap 1\nKXk=1\n\nzk! \uf8ff max\n\nKXk=1\n\n1\nK\n\nu2Z\n\nK\n\nhg(zk), zk ui\n\nk=1 hg(zk), zk ui \uf8ff \u270f for\n\n(8)\n\ndue to convexity-concavity of f (see proof in Appendix A.1).\nIn Section 3.1 we de\ufb01ne the notion of a (randomized) relaxed proximal oracle, and describe how\nNemirovski\u2019s mirror-prox method leverages it to solve the problem (3). In Section 3.2 we de\ufb01ne a\nclass of centered gradient estimators, whose variance is proportional to the squared distance from a\nreference point. Given such a centered gradient estimator, we show that a regularized stochastic mirror\ndescent scheme constitutes a relaxed proximal oracle. For a technical reason, we limit our oracle\nguarantee in Section 3.2 to the bilinear case f (x, y) = y>Ax, which suf\ufb01ces for the applications in\nSection 4. We lift this limitation in Appendix F.1, where we show a different oracle implementation\nthat is valid for general convex-concave f, with only a logarithmic increase in complexity.\n\n3.1 The mirror-prox method with a randomized oracle\nRecall that we assume the space Z = X\u21e5Y is equipped with a norm k\u00b7k and distance generating\nfunction r : Z! R that is 1-strongly-convex w.r.t. k\u00b7k and has range \u21e5. We write the induced\nBregman divergence as Vz(z0) = r(z0)r(z)hrr(z), z0 zi. We use the following fact throughout\nthe paper: by de\ufb01nition, the Bregman divergence satis\ufb01es, for any z, z0, u 2Z ,\n(9)\nFor any \u21b5> 0 we de\ufb01ne the \u21b5-proximal mapping Prox\u21b5\nz (g) to be the solution of the variational\ninequality corresponding to the strongly monotone operator g + \u21b5rVz, i.e. the unique z\u21b5 2Z such\nthat hg(z\u21b5) + \u21b5rVz(z\u21b5), z\u21b5 ui \uf8ff 0 for all u 2Z [cf. 11]. Equivalently (by (9)),\nProx\u21b5\nz (g) := the unique z\u21b5 2Z s.t. hg(z\u21b5), z\u21b5 ui \uf8ff \u21b5Vz(u) \u21b5Vz\u21b5(u) \u21b5Vz(z\u21b5) 8u 2Z .\n(10)\nz (g) is also the unique solution of the saddle point problem\n\n hrVz(z0), z0 ui = Vz(u) Vz0(u) Vz(z0).\n\nWhen Vz(z0) = V x\n\ny (y0), Prox\u21b5\nx (x0) + V y\nmin\nx02X\n\nmax\n\ny02Yf (x0, y0) + \u21b5V x\n\nx (x0) \u21b5V y\n\ny (y0) .\n\nConsider iterations of the form zk = Prox\u21b5\ntion (10) over k, using the bound (8) and the nonnegativity of Bregman divergences gives\n\nzk1(g), with z0 = arg minz r(z). Averaging the de\ufb01ni-\n\nGap 1\n\nK\n\nKXk=1\n\nzk! \uf8ff max\n\nu2Z\n\n1\nK\n\nKXk=1\n\nhg(zk), zk ui \uf8ff max\nu2Z\n\n\u21b5 (Vz0(u) VzK (u))\n\nK\n\n\u21b5\u21e5\nK\n\n.\n\n\uf8ff\n\nThus, we can \ufb01nd an \u270f-suboptimal point in K = \u21b5\u21e5/\u270f exact proximal steps. However, computing\nProx\u21b5\nz (g) exactly may be as dif\ufb01cult as solving the original problem. Nemirovski [28] proposes\na relaxation of the exact proximal mapping, which we slightly extend to include the possibility of\nrandomization, and formalize in the following.\n\n4 A mapping q : Z!Z \u21e4 is monotone if and only if hq(z0) q(z), z0 zi 0 for all z, z0 2Z ; g is\n\nmonotone due to convexity-concavity of f.\n\n5\n\n\fDe\ufb01nition 1 ((\u21b5, \")-relaxed proximal oracle). Let g be a monotone operator and \u21b5, \" > 0. An\n(\u21b5, \")-relaxed proximal oracle for g is a (possibly randomized) mapping O : Z!Z such that\nz0 = O(z) satis\ufb01es\n\nu2Zhg(z0), z0 ui \u21b5Vz(u) \uf8ff \".\nE\uf8ffmax\n\nNote that O(z) = Prox\u21b5\nz (g) is an (\u21b5, 0)-relaxed proximal oracle. Algorithm 1 describes the\n\u201cconceptual prox-method\u201d of Nemirovski [28], which recovers the error guarantee of exact proximal\niterations. The kth iteration consists of (i) a relaxed proximal oracle call producing zk1/2 =\nO(zk1), and (ii) a linearized proximal (mirror) step where we replace z 7! g(z) with the constant\nfunction z 7! g(zk1/2), producing zk = Prox\u21b5\nzk1(g(zk1/2)). We now state the convergence\nguarantee for the mirror-prox method, \ufb01rst shown in [28] (see Appendix B.1 for a simple proof).\n\nAlgorithm 1: OuterLoop(O) (Nemirovski [28])\nInput: (\u21b5, \")-relaxed proximal oracle O(z) for gradient mapping g, distance-generating r\nParameters :Number of iterations K\nOutput: Point \u00afzK with E Gap(\u00afz) \uf8ff \u21b5\u21e5\n1 z0 arg minz2Z r(z)\n2 for k = 1, . . . , K do\nzk1/2 O (zk1)\n3\nzk1(g(zk1/2)) = arg minz2Z\u2326gzk1/2 , z\u21b5 + \u21b5Vzk1(z) \nzk Prox\u21b5\n4\nKPK\n5 return \u00afzK = 1\n\n. We implement O(zk1) by calling InnerLoop(zk1, \u02dcgzk1,\u21b5 )\n\nk=1 zk1/2\n\nK + \"\n\nProposition 1 (Mirror prox convergence via oracles). Let O be an (\u21b5,\")-relaxed proximal oracle\nwith respect to gradient mapping g and distance-generating function r with range at most \u21e5. Let\nz1/2, z3/2, . . . , zK1/2 be the iterates of Algorithm 1 and let \u00afzK be its output. Then\n\nE Gap(\u00afzK) \uf8ff E max\nu2Z\n\n1\nK\n\nKXk=1\u2326g(zk1/2), zk1/2 u\u21b5 \uf8ff\n\n\u21b5\u21e5\nK\n\n+ \".\n\nImplementation of an (\u21b5, 0)-relaxed proximal oracle\n\n3.2\nWe now explain how to use stochastic variance-reduced gradient estimators to design an ef\ufb01cient\n(\u21b5, 0)-relaxed proximal oracle. We begin by introducing the bias and variance properties of the\nestimators we require.\nDe\ufb01nition 2. Let z0 2Z and L > 0. A stochastic gradient estimator \u02dcgz0 : Z!Z \u21e4 is called\n(z0, L)-centered for g if for all z 2Z\n\n1. E [\u02dcgz0(z)] = g(z),\n2. Ek\u02dcgz0(z) g(z0)k2\n\n\u21e4 \uf8ff L2 kz z0k2.\n\nLemma 1. A (z0, L)-centered estimator for g satis\ufb01es Ek\u02dcgz0(z) g(z)k2\nProof. Writing \u02dc = \u02dcgz0(z) g(z0), we have E\u02dc = g(z) g(z0) by the \ufb01rst centered estimator\nproperty. Therefore,\n\n\u21e4 \uf8ff (2L)2 kz z0k2.\n\nEk\u02dcgz0(z) g(z)k2\n\n\u21e4 = Ek\u02dc E\u02dck2\n\u21e4\n\n\uf8ff 2Ek\u02dck2\n\n\u21e4 + 2kE\u02dck2\n\u21e4\n\n(i)\n\n(ii)\n\n\uf8ff 4Ek\u02dck2\n\u21e4\n\n(iii)\n\n\uf8ff (2L)2 kz z0k2 ,\n\nwhere the bounds follow from (i) the triangle inequality, (ii) Jensen\u2019s inequality and (iii) the second\ncentered estimator property.\nRemark 1. A gradient mapping that admits a (z, L)-centered gradient estimator for every z 2Z is\n2L-Lipschitz, since by Jensen\u2019s inequality and Lemma 1 we have for all w 2Z\n\nkg(w) g(z)k\u21e4 = kE\u02dcgz(w) g(z)k\u21e4 \uf8ff (Ek\u02dcgz(w) g(z)k2\n\n\u21e4)1/2 \uf8ff 2Lkw zk .\n\n6\n\n\fRemark 2. De\ufb01nition 2 bounds the gradient variance using the distance to the reference point. Similar\nbounds are used in variance reduction for bilinear saddle-point problems with Euclidean norm [3],\nas well as for \ufb01nding stationary points in smooth nonconvex \ufb01nite-sum problems [2, 33, 12, 45].\nHowever, known variance reduction methods for smooth convex \ufb01nite-sum minimization require\nstronger bounds [cf. 1, Section 2.1].\n\nWith the variance bounds de\ufb01ned, we describe Algorithm 2 which (for the bilinear case) implements a\nrelaxed proximal oracle. The algorithm is stochastic mirror descent with an additional regularization\nterm around the initial point w0. Note that we do not perform extragradient steps in this stochastic\nmethod. When combined with a centered gradient estimator, the iterates of Algorithm 2 provide the\nfollowing guarantee, which is one of our key technical contributions.\n\nAlgorithm 2: InnerLoop(w0, \u02dcgw0,\u21b5 )\nInput: Initial w0 2Z , gradient estimator \u02dcgw0, oracle quality \u21b5> 0\nParameters :Step size \u2318, number of iterations T\nOutput: Point \u00afwT satisfying De\ufb01nition 1 (for appropriate \u02dcgw0, \u2318, T )\n\u2318 Vwt1(w)o\nwt arg minw2Znh\u02dcgw0(wt1), wi + \u21b5\n\n1 for t = 1, . . . , T do\n2\n\n2 Vw0(w) + 1\n\n3 return \u00afwT = 1\n\nt=1 wt\n\nT PT\n10L2 and T 4\n\nProposition 2. Let \u21b5, L > 0, let w0 2Z and let \u02dcgw0 be (w0, L)-centered for monotone g. Then, for\n\u2318 = \u21b5\n\n\u21b52 , the iterates of Algorithm 2 satisfy\n\n\u2318\u21b5 = 40L2\n\nE max\n\nu2Z24 1\nT Xt2[T ]\n\nhg(wt), wt ui \u21b5Vw0(u)35 \uf8ff 0.\n\n(11)\n\nBefore discussing the proof of Proposition 2, we state how it implies the relaxed proximal oracle\nproperty for the bilinear case.\nCorollary 1. Let A 2 Rm\u21e5n and let g(z) = (A>zy,Azx). Then, in the setting of Proposition 2,\nO(w0) = InnerLoop(w0, \u02dcgw0,\u21b5 ) is an (\u21b5, 0)-relaxed proximal oracle.\nProof. Note that hg(z), wi = hg(w), zi for any z, w 2Z and consequently hg(z), zi = 0.\nTherefore, the iterates w1, . . . , wT of Algorithm 2 and its output \u00afwT = 1\nt=1 wt satisfy for every\nu 2Z ,\n\nT PT\n\nhg(wt), wt ui =\n\nhg(u), wti = hg(u), \u00afwTi = hg( \u00afwT ), \u00afwT ui .\n\n1\n\nT Xt2[T ]\n\n1\n\nT Xt2[T ]\n\nSubstituting into the bound (11) yields the (\u21b5, 0)-relaxed proximal oracle property in De\ufb01nition 1.\n\nMore generally, the proof of Corollary 1 shows that Algorithm 2 implements a relaxed proximal oracle\nwhenever z 7! hg(z), z ui is convex for every u. In Appendix F.1 we implement an (\u21b5, \")-relaxed\nproximal oracle without such an assumption.\nThe proof of Proposition 2 is a somewhat lengthy application of existing techniques for stochastic\nmirror descent analysis in conjunction with De\ufb01nition 2. We give it in full in Appendix B.2 and sketch\nit brie\ufb02y here. We view Algorithm 2 as mirror descent with stochastic gradients \u02dct = \u02dcgw0(wt)g(w0)\n2 Vw0(z). For any u 2Z , the standard mirror descent analysis\nand composite term hg(w0), zi + \u21b5\n(see Lemma 4 in Appendix A.2) bounds the regretPt2[T ]\u2326\u02dcgw0(wt) + \u21b5\n2 rVw0(wt), wt u\u21b5 in\nterms of the distance to initialization Vw0(u) and the stochastic gradient norms k\u02dctk2\nfor t 2 [T ].\n\u21e4\nBounding these norms via De\ufb01nition 2 and rearranging the hrVw0(wt), wt ui terms, we show that\nEh 1\nT Pt2[T ] hg(wt), wt ui \u21b5Vw0(u)i \uf8ff 0 for all u 2Z . To reach our desired result we must\n\nswap the order of the expectation and \u201cfor all.\u201d We do so using the \u201cghost iterate\u201d technique due\nto Nemirovski et al. [29].\n\n7\n\n\f4 Application to bilinear saddle point problems\n\nWe now construct centered gradient estimators (as per De\ufb01nition 2) for the linear gradient mapping\n\ng(z) = (A>zy,Azx) corresponding to the bilinear saddle point problem min\nx2X\n\nmax\ny2Y\n\ny>Ax.\n\nSections 4.1 and 4.2 consider the `1-`1 and `2-`1 settings, respectively; in Appendix E we show\nhow our approach naturally extends to the `2-`2 setting as well. Throughout, we let w0 denote the\n\u201ccenter\u201d (i.e. reference point) of our stochastic gradient estimator and consider a general query point\nw 2Z = X\u21e5Y . We also recall the notation [v]i for the ith entry of vector v.\n4.1\nSetup. Denoting the d-dimensional simplex by d, we let X = n, Y = m and Z = X\u21e5Y .\nWe take k\u00b7k to be the `1 norm with conjugate norm k\u00b7k\u21e4 = k\u00b7k1\n. We take the distance generating\nfunction r to be the negative entropy, i.e. r(z) =Pi[z]i log[z]i. We note that both k\u00b7k1 and r are\nseparable and in particular separate over the X and Y blocks of Z. Finally we set\n\n`1-`1 games\n\nkAkmax := max\n\ni,j\n\n|Aij|\n\nand note that this is the Lipschitz constant of the gradient mapping g under the chosen norm.\n\nGradient estimator. Given w0 = (wx\n0), we describe the\nreduced-variance gradient estimator \u02dcgw0(w). First, we de\ufb01ne the probabilities p(w) 2 m and\nq(w) 2 n according to,\n\n0) and g(w0) = (A>wy\n\n0,Awx\n\n0, wy\n\nand qj(w) := |[wx]j [wx\n0]j|\nkwx wx\n0k1\nTo compute \u02dcgw0 we sample i \u21e0 p(w) and j \u21e0 q(w) independently, and set\n\npi(w) := |[wy]i [wy\n0]i|\nkwy wy\n0k1\n\n.\n\n(12)\n\n(13)\n\n\u25c6 ,\n\n\u02dcgw0(w) :=\u2713A>wy\n\n0 + Ai:\n\n[wy]i [wy\n0]i\npi(w)\n\n,Awx\n\n0 A:j\n\n[wx]j [wx\n0]j\nqj(w)\n\nwhere Ai: and A:j are the ith row and jth column of A, respectively. Since the sampling distributions\np(w), q(w) are proportional to the absolute value of the difference between blocks of w and w0, we\ncall strategy (12) \u201csampling from the difference.\u201d Substituting (12) into (13) gives the explicit form\n\u02dcgw0(w) = g(w0) + (Ai:kwy wy\n0]j)) . (14)\nA straightforward calculation shows that this construction satis\ufb01es De\ufb01nition 2.\nLemma 2. In the `1-`1 setup, the estimator (14) is (w0, L)-centered with L = kAkmax.\nProof. The \ufb01rst property (E\u02dcgw0(w) = g(w)) follows immediately by inspection of (13). The second\nproperty follows from (14) by noting that\n\n0]i),A:jkwx wx\n\n0k1sign([wy wy\n\n0k1sign([wx wx\n\nk\u02dcgw0(w) g(w0)k1 = maxkAi:k1 kwy wy\nfor all i, j, and therefore Ek\u02dcgw0(w) g(w0)k2\nThe proof of Lemma 2 reveals that the proposed estimator satis\ufb01es a stronger version of De\ufb01nition 2:\nthe last property and also Lemma 1 hold with probability 1 rather than in expectation.\n\n0k1 , kA:jk1 kwx wx\n1 \uf8ff kAk2\n1.\nmax kw w0k2\n\n0k1 \uf8ff kAkmax kw w0k1\n\nRuntime bound. Combining the centered gradient estimator (13), the relaxed oracle implemen-\ntation (Algorithm 2) and the extragradient outer loop (Algorithm 1), we obtain our main result for\n`1-`1 games: an accelerated stochastic variance reduction algorithm. We write the resulting complete\nmethod explicitly as Algorithm 3 in Appendix C.1. The algorithm enjoys the following runtime\nguarantee (see proof in Appendix C.2).\n\n8\n\n\fTheorem 1. Let A 2 Rm\u21e5n, \u270f> 0, and \u21b5 \u270f/ log(nm). Algorithm 3 outputs a point z = (zx, zy)\nand runs in time\n\nsuch that E\u21e5maxy2m y>Azx minx2n(zy)>Ax\u21e4 = Ehmaxi [Azx]i minj [A>zy]ji \uf8ff \u270f,\n\nSetting \u21b5 optimally, the running time is\n\nmax\n\n(m + n)kAk2\n\n! .\n! \u21b5 log(mn)\nO nnz(A) +\nO nnz(A) +pnnz(A)(m + n)kAkmax log(mn)\n! .\n\n\u21b52\n\n\u270f\n\n\u270f\n\n(15)\n\n(16)\n\ndenote\n\n2 + kzyk2\n\n\u21e4 = kgxk2\n\n`2-`1 games\n\n4.2\nSetup. We set X = Bn to be the n-dimensional Euclidean ball of radius 1, while Y = m remains\nthe simplex. For z = (zx, zy) 2Z = X\u21e5Y we de\ufb01ne a norm by\n\nkzk2 = kzxk2\n\nkAk2!1 = max\n\n2 and ry(y) =\n2 + log m \uf8ff log(2m). Finally, we\n\n2 + kgyk2\n1 .\n2 kxk2\nFor distance generating function we take r(z) = rx(zx) + ry(zy) with rx(x) = 1\n\n1 with dual norm kgk2\nPi yi log yi; r is 1-strongly convex w.r.t. to k\u00b7k and has range 1\ni2[m]kAi:k2 ,\nand note that this is the Lipschitz constant of g under k\u00b7k.\nGradient estimator. To account for the fact that X is now the `2 unit ball, we modify the sampling\ndistribution q in (12) to qj(w) = ([wx]j[wx\n0]j )2\n, and keep p the same. As we explain in detail\n0k2\nkwxwx\nin Appendix D.1.1, substituting these probabilities into the expression (13) yields a centered gradient\nestimator with a constant (Pj2[n] kA:jk2\nby a factor of up to pn.\n)1/2 that is larger than kAk2!1\nUsing local norms analysis allows us to tighten these bounds whenever the stochastic steps have\nbounded in\ufb01nity norm. Following Clarkson et al. [6], we enforce such bound on the step norms via\ngradient clipping. The \ufb01nal gradient estimator is\nkwy wy\n0k1\nsign([wy wy\n\n0 T\u2327 A:j kwx wx\n0]j!! ,\n0k2\n[wx]j [wx\n\n,Awx\n\n1\n\n2\n\n2\n\n0 + Ai:\n\n\u02dcgw0(w) := A>wy\nwhere [T\u2327 (v)]i =8<:\n\n0]i)\n[v]i < \u2327\n[v]i >\u2327,\n\n\u2327\n[v]i \u2327 \uf8ff [v]i \uf8ff \u2327\n\u2327\n\nThe clipping operation T\u2327 introduces bias to the gradient estimator, which we account for by carefully\nchoosing a value of \u2327 for which the bias is on the same order as the variance, and yet the resulting\nsteps are appropriately bounded; see Appendix D.1.2. In Appendix F.3.1 we describe an alternative\ngradient estimator for which the distribution q does not depend on the current iterate w.\n\nRuntime bound. Algorithm 4 in Appendix D.5 combines our clipped gradient estimator with our\ngeneral variance reduction framework. The analysis in Appendix D gives the following guarantee.\nTheorem 2. Let A 2 Rm\u21e5n, \u270f> 0, and any \u21b5 \u270f/ log(2m). Algorithm 4 outputs a point z =\n(zx, zy) such that E\u21e5maxy2m y>Azx minx2Bn(zy)>Ax\u21e4 = E\u21e5maxi [Azx]i + kA>zyk2\u21e4 \uf8ff \u270f,\nand runs in time\n\nSetting \u21b5 optimally, the running time is\n\n2!1\n\n(m + n)kAk2\n\n! .\n! \u21b5 log(2m)\nO nnz(A) +\n! .\nO nnz(A) +pnnz(A)(m + n)kAk2!1 log(2m)\n\n\u21b52\n\n\u270f\n\n\u270f\n\n(17)\n\n(18)\n\n9\n\n\fAcknowledgments\nYC and YJ were supported by Stanford Graduate Fellowships. AS was supported by the NSF\nCAREER Award CCF-1844855. KT was supported by the NSF Graduate Fellowship DGE1656518.\n\nReferences\n\n[1] Z. Allen-Zhu. Katyusha:\n\nIn\nProceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages\n1200\u20131205, 2017.\n\nthe \ufb01rst direct acceleration of stochastic gradient methods.\n\n[2] Z. Allen-Zhu and E. Hazan. Variance reduction for faster non-convex optimization. In Proceed-\n\nings of the 33rd International Conference on Machine Learning, pages 699\u2013707, 2016.\n\n[3] P. Balamurugan and F. R. Bach. Stochastic variance reduction methods for saddle-point\n\nproblems. In Advances in Neural Information Processing Systems, 2016.\n\n[4] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning.\n\nSIAM Review, 60(2):223\u2013311, 2018.\n\n[5] T. Chavdarova, G. Gidel, F. Fleuret, and S. Lacoste-Julien. Reducing noise in GAN training\nwith variance reduced extragradient. In Advances in Neural Information Processing Systems,\n2019.\n\n[6] K. L. Clarkson, E. Hazan, and D. P. Woodruff. Sublinear optimization for machine learning. In\n\n51th Annual IEEE Symposium on Foundations of Computer Science, pages 449\u2013457, 2010.\n\n[7] M. B. Cohen, Y. T. Lee, and Z. Song. Solving linear programs in the current matrix multiplication\n\ntime. arXiv preprint arXiv:1810.07896, 2018.\n\n[8] G. B. Dantzig. Linear Programming and Extensions. Princeton University Press, Princeton, NJ,\n\n1953.\n\n[9] C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng. Training GANs with optimism.\n\nInternational Conference on Learning Representations, 2019.\n\nIn\n\n[10] Y. Drori, S. Sabach, and M. Teboulle. A simple algorithm for a class of nonsmooth convex-\n\nconcave saddle-point problems. Operations Research Letters, 43(2):209\u2013214, 2015.\n\n[11] J. Eckstein. Nonlinear proximal point algorithms using Bregman functions, with applications to\n\nconvex programming. Mathematics of Operations Research, 18(1):202\u2013226, 1993.\n\n[12] C. Fang, C. J. Li, Z. Lin, and T. Zhang. SPIDER: near-optimal non-convex optimization via\nstochastic path-integrated differential estimator. In Advances in Neural Information Processing\nSystems, 2018.\n\n[13] R. Frostig, R. Ge, S. Kakade, and A. Sidford. Un-regularizing: approximate proximal point\nand faster stochastic algorithms for empirical risk minimization. In Proceedings of the 32nd\nInternational Conference on Machine Learning, pages 2540\u20132548, 2015.\n\n[14] G. Gidel, T. Jebara, and S. Lacoste-Julien. Frank-Wolfe algorithms for saddle point problems.\nIn Proceedings of the 20th International Conference on Arti\ufb01cial Intelligence and Statistics,\n2017.\n\n[15] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville,\nand Y. Bengio. Generative adversarial nets. In Advances in Neural Information Processing\nSystems, 2014.\n\n[16] M. D. Grigoriadis and L. G. Khachiyan. A sublinear-time randomized approximation algorithm\n\nfor matrix games. Operation Research Letters, 18(2):53\u201358, 1995.\n\n[17] J. Hiriart-Urruty and C. Lemar\u00e9chal. Convex Analysis and Minimization Algorithms I. Springer,\n\nNew York, 1993.\n\n10\n\n\f[18] C. Jin, P. Netrapalli, and M. I. Jordan. Minmax optimization: Stable limit points of gradient\n\ndescent ascent are locally optimal. arXiv preprint arXiv:1902.00618, 2019.\n\n[19] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. In Advances in Neural Information Processing Systems, 2013.\n\n[20] O. Kolossoski and R. D. Monteiro. An accelerated non-Euclidean hybrid proximal extragradient-\ntype algorithm for convex-concave saddle-point problems. Optimization Methods and Software,\n32(6):1244\u20131272, 2017.\n\n[21] Y. T. Lee and A. Sidford. Ef\ufb01cient inverse maintenance and faster algorithms for linear\nprogramming. In IEEE 56th Annual Symposium on Foundations of Computer Science, pages\n230\u2013249, 2015.\n\n[22] H. Lin, J. Mairal, and Z. Harchaoui. A universal catalyst for \ufb01rst-order optimization.\n\nAdvances in Neural Information Processing Systems, 2015.\n\nIn\n\n[23] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models\nresistant to adversarial attacks. In International Conference on Learning Representations, 2018.\n[24] P. Mertikopoulos, H. Zenati, B. Lecouat, C.-S. Foo, V. Chandrasekhar, and G. Piliouras. Mirror\ndescent in saddle-point problems: Going the extra (gradient) mile. In International Conference\non Learning Representations, 2019.\n\n[25] M. Minsky and S. Papert. Perceptrons\u2014an introduction to computational geometry. MIT Press,\n\n1987.\n\n[26] K. Mishchenko, D. Kovalev, E. Shulgin, P. Richt\u00e1rik, and Y. Malitsky. Revisiting stochastic\n\nextragradient. arXiv preprint arXiv:1905.11373, 2019.\n\n[27] A. Mokhtari, A. Ozdaglar, and S. Pattathil. A uni\ufb01ed analysis of extra-gradient and opti-\nmistic gradient methods for saddle point problems: Proximal point approach. arXiv preprint\narXiv:1901.08511, 2019.\n\n[28] A. Nemirovski. Prox-method with rate of convergence O(1/t) for variational inequalities with\nlipschitz continuous monotone operators and smooth convex-concave saddle point problems.\nSIAM Journal on Optimization, 15(1):229\u2013251, 2004.\n\n[29] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach\n\nto stochastic programming. SIAM Journal on optimization, 19(4):1574\u20131609, 2009.\n\n[30] A. Nemirovsky and D. Yudin. Problem Complexity and Method Ef\ufb01ciency in Optimization. J.\n\nWiley & Sons, New York, NY, 1983.\n\n[31] Y. Nesterov. Dual extrapolation and its applications to solving variational inequalities and\n\nrelated problems. Mathematical Programing, 109(2-3):319\u2013344, 2007.\n\n[32] J. V. Neumann. Zur theorie der gesellschaftsspiele. Mathematische Annalen, 100:295\u2013320,\n\n1928.\n\n[33] S. J. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola. Stochastic variance reduction for\nnonconvex optimization. In Proceedings of the 33rd International Conference on Machine\nLearning, pages 314\u2013323, 2016.\n\n[34] M. W. Schmidt, N. L. Roux, and F. R. Bach. Minimizing \ufb01nite sums with the stochastic average\n\ngradient. Mathematical Programing, 162(1-2):83\u2013112, 2017.\n\n[35] S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized\n\nloss. Journal of Machine Learning Research, 14(1):567\u2013599, 2013.\n\n[36] S. Shalev-Shwartz and T. Zhang. Accelerated proximal stochastic dual coordinate ascent for\n\nregularized loss minimization. Mathematical Programing, 155(1-2):105\u2013145, 2016.\n\n[37] S. Shalev-Shwartz et al. Online learning and online convex optimization. Foundations and\n\nTrends in Machine Learning, 4(2):107\u2013194, 2012.\n\n11\n\n\f[38] J. Sherman. Area-convexity, `1 regularization, and undirected multicommodity \ufb02ow.\n\nIn\nProceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages\n452\u2013460. ACM, 2017.\n\n[39] Z. Shi, X. Zhang, and Y. Yu. Bregman divergence for stochastic variance reduction: Saddle-point\n\nand adversarial prediction. In Advances in Neural Information Processing Systems, 2017.\n\n[40] A. Sidford and K. Tian. Coordinate methods for accelerating `1 regression and faster approx-\nimate maximum \ufb02ow. In 2018 IEEE 59th Annual Symposium on Foundations of Computer\nScience, pages 922\u2013933. IEEE, 2018.\n\n[41] T. Strohmer and R. Vershynin. A randomized Kaczmarz algorithm with exponential convergence.\n\nJournal of Fourier Analysis and Applications, 15(2):262, 2009.\n\n[42] M. D. Vose. A linear algorithm for generating random numbers with a given distribution. IEEE\n\nTransactions on software engineering, 17(9):972\u2013975, 1991.\n\n[43] B. E. Woodworth and N. Srebro. Tight complexity bounds for optimizing composite objectives.\n\nIn Advances in Neural Information Processing Systems, 2016.\n\n[44] L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance\n\nreduction. SIAM Journal on Optimization, 24(4):2057\u20132075, 2014.\n\n[45] D. Zhou, P. Xu, and Q. Gu. Stochastic nested variance reduced gradient descent for nonconvex\n\noptimization. In Advances in Neural Information Processing Systems, 2018.\n\n12\n\n\f", "award": [], "sourceid": 6073, "authors": [{"given_name": "Yair", "family_name": "Carmon", "institution": "Stanford University"}, {"given_name": "Yujia", "family_name": "Jin", "institution": "Stanford University"}, {"given_name": "Aaron", "family_name": "Sidford", "institution": "Stanford"}, {"given_name": "Kevin", "family_name": "Tian", "institution": "Stanford University"}]}