{"title": "Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 451, "page_last": 459, "abstract": "We consider the minimization of a convex objective function defined on a Hilbert space, which is only available through unbiased estimates of  its gradients.  This problem includes  standard machine learning algorithms such as kernel logistic regression and least-squares regression, and is commonly referred to as a stochastic approximation problem in the operations research community. We provide a non-asymptotic analysis of the  convergence of two well-known algorithms, stochastic gradient descent (a.k.a.~Robbins-Monro algorithm) as well as a simple modification where iterates are averaged (a.k.a.~Polyak-Ruppert averaging). Our analysis suggests that a learning rate proportional to the inverse of the number of iterations, while leading to the optimal convergence rate in the strongly convex case, is not robust to the lack of strong convexity or the setting of the proportionality constant. This situation is remedied when using slower decays together with averaging, robustly leading to the optimal rate of convergence. We illustrate our theoretical results with simulations on synthetic and standard datasets.", "full_text": "Non-Asymptotic Analysis of Stochastic\n\nApproximation Algorithms for Machine Learning\n\nFrancis Bach\n\nINRIA - Sierra Project-team\n\nEcole Normale Sup\u00b4erieure, Paris, France\n\nfrancis.bach@ens.fr\n\nAbstract\n\nEric Moulines\n\nLTCI\n\nTelecom ParisTech, Paris, France\neric.moulines@enst.fr\n\nWe consider the minimization of a convex objective function de\ufb01ned on a Hilbert space,\nwhich is only available through unbiased estimates of its gradients. This problem in-\ncludes standard machine learning algorithms such as kernel logistic regression and\nleast-squares regression, and is commonly referred to as a stochastic approximation\nproblem in the operations research community. We provide a non-asymptotic anal-\nysis of the convergence of two well-known algorithms, stochastic gradient descent\n(a.k.a. Robbins-Monro algorithm) as well as a simple modi\ufb01cation where iterates are\naveraged (a.k.a. Polyak-Ruppert averaging). Our analysis suggests that a learning rate\nproportional to the inverse of the number of iterations, while leading to the optimal con-\nvergence rate in the strongly convex case, is not robust to the lack of strong convexity or\nthe setting of the proportionality constant. This situation is remedied when using slower\ndecays together with averaging, robustly leading to the optimal rate of convergence. We\nillustrate our theoretical results with simulations on synthetic and standard datasets.\n\n1 Introduction\nThe minimization of an objective function which is only available through unbiased estimates of\nthe function values or its gradients is a key methodological problem in many disciplines. Its anal-\nysis has been attacked mainly in three communities: stochastic approximation [1, 2, 3, 4, 5, 6],\noptimization [7, 8], and machine learning [9, 10, 11, 12, 13, 14, 15]. The main algorithms which\nhave emerged are stochastic gradient descent (a.k.a. Robbins-Monro algorithm), as well as a simple\nmodi\ufb01cation where iterates are averaged (a.k.a. Polyak-Ruppert averaging).\nTraditional results from stochastic approximation rely on strong convexity and asymptotic analysis,\nbut have made clear that a learning rate proportional to the inverse of the number of iterations, while\nleading to the optimal convergence rate in the strongly convex case, is not robust to the wrong setting\nof the proportionality constant. On the other hand, using slower decays together with averaging\nrobustly leads to optimal convergence behavior (both in terms of rates and constants) [4, 5].\nThe analysis from the convex optimization and machine learning literatures however has focused\non differences between strongly convex and non-strongly convex objectives, with learning rates and\nroles of averaging being different in these two cases [11, 12, 13, 14, 15].\nA key desirable behavior of an optimization method is to be adaptive to the hardness of the problem,\nand thus one would like a single algorithm to work in all situations, favorable ones such as strongly\nconvex functions and unfavorable ones such as non-strongly convex functions. In this paper, we\nunify the two types of analysis and show that (1) a learning rate proportional to the inverse of the\nnumber of iterations is not suitable because it is not robust to the setting of the proportionality\nconstant and the lack of strong convexity, (2) the use of averaging with slower decays allows (close\nto) optimal rates in all situations.\nMore precisely, we make the following contributions:\n\n\u2212 We provide a direct non-asymptotic analysis of stochastic gradient descent in a machine learn-\ning context (observations of real random functions de\ufb01ned on a Hilbert space) that includes\n\n1\n\n\fkernel least-squares regression and logistic regression (see Section 2), with strong convexity\nassumptions (Section 3) and without (Section 4).\n\n\u2212 We provide a non-asymptotic analysis of Polyak-Ruppert averaging [4, 5], with and without\nIn particular, we show that slower decays of the\n\nstrong convexity (Sections 3.3 and 4.2).\nlearning rate, together with averaging, are crucial to robustly obtain fast convergence rates.\n\n\u2212 We illustrate our theoretical results through experiments on synthetic and non-synthetic exam-\n\nples in Section 5.\n\nNotation. We consider a Hilbert space H with a scalar product h\u00b7,\u00b7i. We denote by k \u00b7 k the\nassociated norm and use the same notation for the operator norm on bounded linear operators from\nH to H, de\ufb01ned as kAk = supkxk61 kAxk (if H is a Euclidean space, then kAk is the largest\nsingular value of A). We also use the notation \u201cw.p.1\u201d to mean \u201cwith probability one\u201d. We denote\nby E the expectation or conditional expectation with respect to the underlying probability space.\n\n\u2200n > 1, \u03b8n = \u03b8n\u22121 \u2212 \u03b3nf \u2032\n\n2 Problem set-up\nWe consider a sequence of convex differentiable random functions (fn)n>1 from H to R. We con-\nsider the following recursion, starting from \u03b80 \u2208 H:\n(1)\nwhere (\u03b3n)n>1 is a deterministic sequence of positive scalars, which we refer to as the learning\nrate sequence. The function fn is assumed to be differentiable (see, e.g., [16] for de\ufb01nitions and\nproperties of differentiability for functions de\ufb01ned on Hilbert spaces), and its gradient is an unbiased\nestimate of the gradient of a certain function f we wish to minimize:\n(H1) Let (Fn)n>0 be an increasing family of \u03c3-\ufb01elds. \u03b80 is F0-measurable, and for each \u03b8 \u2208 H,\n(2)\n\nn(\u03b8) is square-integrable, Fn-measurable and\n\u2200\u03b8 \u2208 H, \u2200n > 1, E(f \u2032\n\nn(\u03b8)|Fn\u22121) = f \u2032(\u03b8), w.p.1.\n\nthe random variable f \u2032\n\nn(\u03b8n\u22121),\n\nFor an introduction to martingales, \u03c3-\ufb01elds, and conditional expectations, see, e.g., [17]. Note that\ndepending whether F0 is a trivial \u03c3-\ufb01eld or not, \u03b80 may be random or not. Moreover, we could\nrestrict Eq. (2) to be satis\ufb01ed only for \u03b8n\u22121 and \u03b8\u2217 (which is a global minimizer of f ).\nn(\u03b8n\u22121), the goal of stochastic approximation is to minimize the\nGiven only the noisy gradients f \u2032\nfunction f with respect to \u03b8. Our assumptions include two usual situations, but also include many\nothers (e.g., potentially, active learning):\n\n\u2212 Stochastic approximation: in the so-called Robbins-Monro setting, for all \u03b8 \u2208 H and n > 1,\nfn(\u03b8) may be expressed as fn(\u03b8) = f (\u03b8) +h\u03b5n, \u03b8i, where (\u03b5n)n>1 is a square-integrable mar-\ntingale difference (i.e., such that E(\u03b5n|Fn\u22121) = 0), which corresponds to a noisy observation\nf \u2032(\u03b8n\u22121) + \u03b5n of the gradient f \u2032(\u03b8n\u22121).\n\u2212 Learning from i.i.d. observations: for all \u03b8 \u2208 H and n > 1, fn(\u03b8) = \u2113(\u03b8, zn) where zn is an\ni.i.d. sequence of observations in a measurable space Z and \u2113 : H\u00d7Z is a loss function. Then\nf (\u03b8) is the generalization error of the predictor de\ufb01ned by \u03b8. Classical examples are least-\nsquares or logistic regression (linear or non-linear through kernel methods [18, 19]), where\nfn(\u03b8) = 1\n2 (hxn, \u03b8i \u2212 yn)2, or fn(\u03b8) = log[1 + exp(\u2212yn hxn, \u03b8i)], for xn \u2208 H, and yn \u2208 R\n(or {\u22121, 1} for logistic regression).\n\nThroughout this paper, unless otherwise stated, we assume that each function fn is convex and\nsmooth, following the traditional de\ufb01nition of smoothness from the optimization literature, i.e.,\nLipschitz-continuity of the gradients (see, e.g., [20]). However, we make two slightly different\nassumptions: (H2) where the function \u03b8 7\u2192 E(f \u2032\nn(\u03b8)|Fn\u22121) is Lipschitz-continuous in quadratic\nmean and a strengthening of this assumption, (H2\u2019) in which \u03b8 7\u2192 f \u2032\nn(\u03b8) is almost surely Lipschitz-\ncontinuous.\n(H2) For each n > 1, the function fn is almost surely convex, differentiable, and:\n\n(3)\n(H2\u2019) For each n > 1, the function fn is almost surely convex, differentiable with Lipschitz-\n\nn(\u03b82)k2|Fn\u22121) 6 L2k\u03b81 \u2212 \u03b82k2 , w.p.1.\n\n\u2200n > 1, \u2200\u03b81, \u03b82 \u2208 H, E(kf \u2032\n\nn(\u03b81) \u2212 f \u2032\n\ncontinuous gradient f \u2032\n\nn, with constant L, that is:\nn(\u03b81) \u2212 f \u2032\n\n\u2200n > 1, \u2200\u03b81, \u03b82 \u2208 H, kf \u2032\n\nn(\u03b82)k 6 Lk\u03b81 \u2212 \u03b82k , w.p.1.\n\n(4)\n\n2\n\n\fIf fn is twice differentiable, this corresponds to having the operator norm of the Hessian operator\nof fn bounded by L. For least-squares or logistic regression, if we assume that (Ekxnk4)1/4 6\nR for all n \u2208 N, then we may take L = R2 (or even L = R2/4 for logistic regression) for\nassumption (H2), while for assumption (H2\u2019), we need to have an almost sure bound kxnk 6 R.\n3 Strongly convex objectives\n\nIn this section, following [21], we make the additional assumption of strong convexity of f , but not\nof all functions fn (see [20] for de\ufb01nitions and properties of such functions):\n(H3) The function f is strongly convex with respect to the norm k\u00b7k, with convexity constant \u00b5 > 0.\n\nThat is, for all \u03b81, \u03b82 \u2208 H, f (\u03b81) > f (\u03b82) + hf \u2032(\u03b82), \u03b81 \u2212 \u03b82i + \u00b5\n\n2k\u03b81 \u2212 \u03b82k2.\n\nNote that (H3) simply needs to be satis\ufb01ed for \u03b82 = \u03b8\u2217 being the unique global minimizer of f\n(such that f \u2032(\u03b8\u2217) = 0). In the context of machine learning (least-squares or logistic regression),\nassumption (H3) is satis\ufb01ed as soon as \u00b5\n2k\u03b8k2 is used as an additional regularizer. For all strongly\nconvex losses (e.g., least-squares), it is also satis\ufb01ed as soon as the expectation E(xn \u2297 xn) is\ninvertible. Note that this implies that the problem is \ufb01nite-dimensional, otherwise, the expectation\nis a compact covariance operator, and hence non-invertible (see, e.g., [22] for an introduction to\ncovariance operators). For non-strongly convex losses such as the logistic loss, f can never be\nstrongly convex unless we restrict the domain of \u03b8 (which we do in Section 3.2). Alternatively to\nrestricting the domain, replacing the logistic loss u 7\u2192 log(1 + e\u2212u) by u 7\u2192 log(1 + e\u2212u) + \u03b5u2/2,\nfor some small \u03b5 > 0, makes it strongly convex in low-dimensional settings.\nBy strong convexity of f , if we assume (H3), then f attains its global minimum at a unique vector\n\u03b8\u2217 \u2208 H such that f \u2032(\u03b8\u2217) = 0. Moreover, we make the following assumption (in the context of\nstochastic approximation, it corresponds to E(k\u03b5nk2|Fn\u22121) 6 \u03c32):\n(H4) There exists \u03c32 \u2208 R+ such that for all n > 1, E(kf \u2032\n3.1 Stochastic gradient descent\nBefore stating our \ufb01rst theorem (see proof in [23]), we introduce the following family of functions\n\u03d5\u03b2 : R+ \\ {0} \u2192 R given by:\n\nn(\u03b8\u2217)k2|Fn\u22121) 6 \u03c32, w.p.1.\n\n\u03d5\u03b2(t) = (cid:26) t\u03b2 \u22121\n\nlog t\n\n\u03b2\n\nif \u03b2 6= 0,\nif \u03b2 = 0.\n\nThe function \u03b2 7\u2192 \u03d5\u03b2(t) is continuous for all t > 0. Moreover, for \u03b2 > 0, \u03d5\u03b2(t) < t\u03b2\n\u03b2 < 0, we have \u03d5\u03b2(t) < 1\n\n\u2212\u03b2 (both with asymptotic equality when t is large).\n\n\u03b2 , while for\n\nTheorem 1 (Stochastic gradient descent, strong convexity) Assume (H1,H2,H3,H4). Denote\n\u03b4n = Ek\u03b8n \u2212 \u03b8\u2217k2, where \u03b8n \u2208 H is the n-th iterate of the recursion in Eq. (1), with \u03b3n = Cn\u2212\u03b1.\nWe have, for \u03b1 \u2208 [0, 1]:\n\n2 exp(cid:0)4L2C2\u03d51\u22122\u03b1(n)(cid:1) exp(cid:18)\u2212\n\nexp(2L2C2)\n\n\u00b5C\n4\n\nn1\u2212\u03b1(cid:19)(cid:18)\u03b40 +\nL2(cid:19) + 2\u03c32C2 \u03d5\u00b5C/2\u22121(n)\n\nn\u00b5C/2\n\n\u03c32\n\n,\n\n(cid:18)\u03b40 +\n\nn\u00b5C\n\n\u03c32\n\nL2(cid:19) +\n\n4C\u03c32\n\u00b5n\u03b1 ,\n\nif 0 6 \u03b1 < 1,\n\n(5)\n\nif \u03b1 = 1.\n\n\u03b4n 6\n\n\uf8f1\uf8f4\uf8f4\uf8f2\n\uf8f4\uf8f4\uf8f3\n\nSketch of proof. Under our assumptions, it can be shown that (\u03b4n) satis\ufb01es the following recursion:\n\n\u03b4n 6 (1 \u2212 2\u00b5\u03b3n + 2L2\u03b32\n\nn)\u03b4n\u22121 + 2\u03c32\u03b32\nn.\n\n(6)\n\nNote that it also appears in [3, Eq. (2)] under different assumptions. Using this deterministic recur-\nsion, we then derive bounds using classical techniques from stochastic approximation [2], but in a\nnon-asymptotic way, by deriving explicit upper-bounds.\nRelated work. To the best of our knowledge, this non-asymptotic bound, which depends explicitly\nupon the parameters of the problem, is novel (see [1, Theorem 1, Electronic companion paper] for a\nsimpler bound with no such explicit dependence). It shows in particular that there is convergence in\nquadratic mean for any \u03b1 \u2208 (0, 1]. Previous results from the stochastic approximation literature have\nfocused mainly on almost sure convergence of the sequence of iterates. Almost-sure convergence\nrequires that \u03b1 > 1/2, with counter-examples for \u03b1 < 1/2 (see, e.g., [2] and references therein).\n\n3\n\n\fBound on function values. The bounds above imply a corresponding a bound on the functions\nvalues. Indeed, under assumption (H2), it may be shown that E[f (\u03b8n) \u2212 f (\u03b8\u2217)] 6 L\n2 \u03b4n (see proof\nin [23]).\nTightness for quadratic functions. Since the deterministic recursion in Eq. (6) is an equality for\nquadratic functions fn, the result in Eq. (5) is optimal (up to constants). Moreover, our results are\nconsistent with the asymptotic results from [6].\n\nn\u00b5C/2\n\n6\n\n1\n\n1\n\n1\n\n1\n\nForgetting initial conditions. Bounds depend on the initial condition \u03b40 = E(cid:2)k\u03b80 \u2212 \u03b8\u2217k2(cid:3) and the\nvariance \u03c32 of the noise term. The initial condition is forgotten sub-exponentially fast for \u03b1 \u2208 (0, 1),\nbut not for \u03b1 = 1. For \u03b1 < 1, the asymptotic term in the bound is 4C\u03c32\n\u00b5n\u03b1 .\nBehavior for \u03b1 = 1. For \u03b1 = 1, we have \u03d5\u00b5C/2\u22121(n)\nn if C\u00b5 > 2, \u03d5\u00b5C/2\u22121(n)\nn\u00b5C/2 = log n\nif C\u00b5 = 2 and \u03d5\u00b5C/2\u22121(n)\nn\u00b5C/2 if C\u00b5 > 2. Therefore, for \u03b1 = 1, the choice of C is\ncritical, as already noticed by [8]: too small C leads to convergence at arbitrarily small rate of the\nform n\u2212\u00b5C/2, while too large C leads to explosion due to the initial condition. This behavior is\ncon\ufb01rmed in simulations in Section 5.\nSetting C too large. There is a potentially catastrophic term when C is chosen too large, i.e.,\n\nn\u00b5C/2 6\n\n\u00b5C/2\u22121\n\n1\u2212\u00b5C/2\n\nexp(cid:0)4L2C2\u03d51\u22122\u03b1(n)(cid:1), which leads to an increasing bound when n is small. Note that for \u03b1 < 1,\n\nthis catastrophic term is in front of a sub-exponentially decaying factor, so its effect is mitigated\nonce the term in n1\u2212\u03b1 takes over \u03d51\u22122\u03b1(n), and the transient term stops increasing. Moreover, the\nasymptotic term is not involved in it (which is also observed in simulations in Section 5).\nMinimax rate. Note \ufb01nally, that the asymptotic convergence rate in O(n\u22121) matches optimal\nasymptotic minimax rate for stochastic approximation [24, 25]. Note that there is no explicit depen-\ndence on dimension; this dependence is implicit in the de\ufb01nition of the constants \u00b5 and L.\n\nn\n\n3.2 Bounded gradients\n\nIn some cases such as logistic regression, we also have a uniform upper-bound on the gradients, i.e.,\nwe assume (note that in Theorem 2, this assumption replaces both (H2) and (H4)).\n\n(H5) For each n > 1, almost surely, the function fn if convex, differentiable and has gradients\nuniformly bounded by B on the ball of center 0 and radius D, i.e., for all \u03b8 \u2208 H and all n > 0,\nk\u03b8k 6 D \u21d2 kf \u2032\n\nn(\u03b8)k 6 B.\n\nNote that no function may be strongly convex and Lipschitz-continuous (i.e., with uniformly\nbounded gradients) over the entire Hilbert space H. Moreover, if (H2\u2019) is satis\ufb01ed, then we may take\nD = k\u03b8\u2217k and B = LD. The next theorem shows that with a slight modi\ufb01cation of the recursion\nin Eq. (1), we get simpler bounds than the ones obtained in Theorem 1, obtaining a result which\nalready appeared in a simpli\ufb01ed form [8] (see proof in [23]):\n\nTheorem 2 (Stochastic gradient descent, strong convexity, bounded gradients) Assume\n\n(H1,H3,H5). Denote \u03b4n = E(cid:2)k\u03b8n \u2212 \u03b8\u2217k2(cid:3), where \u03b8n \u2208 H is the n-th iterate of the follow-\n\ning recursion:\n\n(7)\nwhere \u03a0D is the orthogonal projection operator on the ball {\u03b8 : k\u03b8k 6 D}. Assume k\u03b8\u2217k 6 D. If\n\u03b3n = Cn\u2212\u03b1, we have, for \u03b1 \u2208 [0, 1]:\n\n\u2200n > 1, \u03b8n = \u03a0D[\u03b8n\u22121 \u2212 \u03b3nf \u2032\n\nn(\u03b8n\u22121)],\n\n\u03b4n 6 \uf8f1\uf8f2\n(cid:0)\u03b40 + B2C2\u03d51\u22122\u03b1(n)(cid:1) exp(cid:18)\u2212\n\u00b5C\n2\n\u03b40n\u2212\u00b5C + 2B2C2n\u2212\u00b5C \u03d5\u00b5C\u22121(n),\n\uf8f3\n\nn1\u2212\u03b1(cid:19) +\n\n2B2C2\n\u00b5n\u03b1 ,\n\nif \u03b1 \u2208 [0, 1) ;\nif \u03b1 = 1 .\n\n(8)\n\nThe proof follows the same lines than for Theorem 1, but with the deterministic recursion \u03b4n 6\nn. Note that we obtain the same asymptotic terms than for Theorem 1 (but B\n(1\u2212 2\u00b5\u03b3n)\u03b4n\u22121 + B2\u03b32\nreplaces \u03c3). Moreover, the bound is simpler (no explosive multiplicative factors), but it requires to\nknow D in advance, while Theorem 1 does not. Note that because we have only assumed Lipschitz-\ncontinuity, we obtain a bound on function values of order O(n\u2212\u03b1/2), which is sub-optimal. For\nbounds directly on function values, see [26].\n\n4\n\n\fn Pn\u22121\n\nHessian operator f \u2032\u2032\nkf \u2032\u2032\n\nn (\u03b81) \u2212 f \u2032\u2032\n\nk=0 \u03b8k and, following [4, 5], we make extra assumptions regarding the\n\n3.3 Polyak-Ruppert averaging\nWe now consider \u00af\u03b8n = 1\nsmoothness of each fn and the fourth-order moment of the driving noise:\n(H6) For each n > 1, the function fn is almost surely twice differentiable with Lipschitz-continuous\nn , with Lipschitz constant M . That is, for all \u03b81, \u03b82 \u2208 H and for all n > 1,\nNote that (H6) needs only to be satis\ufb01ed for \u03b82 = \u03b8\u2217. For least-square regression, we have M = 0,\nwhile for logistic regression, we have M = R3/4.\n(H7) There exists \u03c4 \u2208 R+, such that for each n > 1, E(kf \u2032\n\nn (\u03b82)k 6 Mk\u03b81 \u2212 \u03b82k, where k \u00b7 k is the operator norm.\n\nMoreover, there exists a nonnegative self-adjoint operator \u03a3 such that for all n, E(f \u2032\nn(\u03b8\u2217)|Fn\u22121) 4 \u03a3 almost-surely.\nf \u2032\n\nn(\u03b8\u2217)k4|Fn\u22121) 6 \u03c4 4 almost surely.\nn(\u03b8\u2217) \u2297\nThe operator \u03a3 (which always exists as soon as \u03c4 is \ufb01nite) is here to characterize precisely the\nvariance term, which will be independent of the learning rate sequence (\u03b3n), as we now show:\nTheorem 3 (Averaging, strong convexity) Assume (H1, H2\u2019, H3, H4, H6, H7). Then, for \u00af\u03b8n =\nn Pn\u22121\n1\n(cid:0)Ek\u00af\u03b8n \u2212 \u03b8\u2217k2(cid:1)1/2\n\n6(cid:2) tr f \u2032\u2032(\u03b8\u2217)\u22121\u03a3f \u2032\u2032(\u03b8\u2217)\u22121(cid:3)1/2\n\nM C\u03c4 2\n2\u00b53/2 (1+(\u00b5C)1/2)\n\n6\u03c3\n\n+\n\nn\n\n1\n\n\u03d51\u2212\u03b1(n)\n\nk=0 \u03b8k and \u03b1 \u2208 (0, 1), we have:\n\u221an\nn\u00b51/2(cid:16) 1\n8A\nA exp(cid:0)24L4C4(cid:1)(cid:16)\u03b40 +\n\n\u03d51\u2212\u03b1(n)1/2\n\n5M C1/2\u03c4\n\n4LC1/2\n\n2n\u00b5\n\n+\n\n+\n\n+\n\n\u00b5\n\nn\n\n\u00b5C1/2\n\u03c32\n\nn1\u2212\u03b1/2 +\nL2(cid:17)1/2\n\nC\n\n+ L(cid:17)(cid:16)\u03b40 +\n\u00b5E(cid:2)k\u03b80 \u2212 \u03b8\u2217k4(cid:3)\n\n20C\u03c4 2\n\n+ 2\u03c4 2C3\u00b5 + 8\u03c4 2C2(cid:17)1/2\n\n,\n\n(9)\n\nk=1\n\n1\n\u03b3k\n\nn(\u03b8n\u22121) = 1\n\u03b3n\n\nk=1 f \u2032\u2032(\u03b8\u2217)\u22121f \u2032\n\nk(\u03b8\u2217). Note that we obtain a bound on the root mean square error.\n\nwhere A is a constant that depends only on \u00b5, C, L and \u03b1.\nSketch of proof. Following [4], we start from Eq. (1), write it as f \u2032\n(\u03b8n\u22121 \u2212 \u03b8n), and\nn(\u03b8\u2217) + f \u2032\u2032(\u03b8\u2217)(\u03b8n\u22121 \u2212 \u03b8\u2217), (b) f \u2032\nnotice that (a) f \u2032\nn(\u03b8\u2217) has zero mean and behaves\nn(\u03b8n\u22121) \u2248 f \u2032\nn Pn\nlike an i.i.d. sequence, and (c) 1\n(\u03b8k\u22121 \u2212 \u03b8k) turns out to be negligible owing to a sum-\nmation by parts and to the bound obtained in Theorem 1. This implies that \u00af\u03b8n \u2212 \u03b8\u2217 behaves like\nn Pn\n\u2212 1\nForgetting initial conditions. There is no sub-exponential forgetting of initial conditions, but\nrather a decay at rate O(n\u22122) (last two lines in Eq. (9)). This is a known problem which may\nslow down the convergence, a common practice being to start averaging after a certain number of\niterations [2]. Moreover, the constant A may be large when LC is large, thus the catastrophic terms\nare more problematic than for stochastic gradient descent, because they do not appear in front of\nsub-exponentially decaying terms (see [23]). This suggests to take CL small.\nAsymptotically leading term. When M > 0 and \u03b1 > 1/2, the asymptotic term for \u03b4n is independent\nof (\u03b3n) and of order O(n\u22121). Thus, averaging allows to get from the slow rate O(n\u2212\u03b1) to the opti-\nmal rate O(n\u22121). The next two leading terms (in the \ufb01rst line) have order O(n\u03b1\u22122) and O(n\u22122\u03b1),\nsuggesting the setting \u03b1 = 2/3 to make them equal. When M = 0 (quadratic functions), the leading\nterm has rate O(n\u22121) for all \u03b1\u2208 (0, 1) (with then a contribution of the \ufb01rst term in the second line).\nCase \u03b1 = 1. We get a simpler bound by directly averaging the bound in Theorem 1, which leads\nto an unchanged rate of n\u22121, i.e., averaging is not key for \u03b1 = 1, and does not solve the robustness\nproblem related to the choice of C or the lack of strong convexity.\nLeading term independent of (\u03b3n). The term in O(n\u22121) does not depend on \u03b3n. Moreover, as no-\nticed in the stochastic approximation literature [4], in the context of learning from i.i.d. observations,\nthis is exactly the Cramer-Rao bound (see, e.g., [27]), and thus the leading term is asymptotically\noptimal. Note that no explicit Hessian inversion has been performed to achieve this bound.\nRelationship with prior work on online learning. There is no clear way of adding a bounded\ngradient assumption in the general case \u03b1 \u2208 (0, 1), because the proof relies on the recursion without\nprojections, but for \u03b1 = 1, the rate of O(n\u22121) (up to a logarithmic term) can be achieved in the\nmore general framework of online learning, where averaging is key to deriving bounds for stochastic\napproximation from regret bounds. Moreover, bounds are obtained in high probability rather than\nsimply in quadratic mean (see, e.g., [11, 12, 13, 14, 15]).\n\n5\n\n\f4 Non-strongly convex objectives\nIn this section, we do not assume that the function f is strongly convex, but we replace (H3) by:\n(H8) The function f attains its global minimum at a certain \u03b8\u2217 \u2208 H (which may not be unique).\nIn the machine learning scenario, this essentially implies that the best predictor is in the function\nclass we consider.1 In the following theorem, since \u03b8\u2217 is not unique, we only derive a bound on\nfunction values. Not assuming strong convexity is essential in practice to make sure that algorithms\nare robust and adaptive to the hardness of the learning or optimization problem (much like gradient\ndescent is).\n\n4.1 Stochastic gradient descent\nThe following theorem is shown in a similar way to Theorem 1; we \ufb01rst derive a deterministic recur-\nsion, which we analyze with novel tools compared to the non-stochastic case (see details in [23]),\nobtaining new convergence rates for non-averaged stochastic gradient descent :\nTheorem 4 (Stochastic gradient descent, no strong convexity) Assume (H1,H2\u2019,H4,H8). Then,\nif \u03b3n = Cn\u2212\u03b1, for \u03b1 \u2208 [1/2, 1], we have:\n\nE [f (\u03b8n) \u2212 f (\u03b8\u2217)] 6\n\n\u03c32\n\n1\n\nC(cid:16)\u03b40 +\n\nL2(cid:17) exp(cid:0)4L2C2\u03d51\u22122\u03b1(n)(cid:1)\n\n1 + 4L3/2C3/2\n\nmin{\u03d51\u2212\u03b1(n), \u03d5\u03b1/2(n)}\n\n.\n\n(10)\n\nWhen \u03b1 = 1/2, the bound goes to zero only when LC < 1/4, at rates which can be arbitrarily\nslow. For \u03b1 \u2208 (1/2, 2/3), we get convergence at rate O(n\u2212\u03b1/2), while for \u03b1 \u2208 (2/3, 1), we get a\nconvergence rate of O(n\u03b1\u22121). For \u03b1 = 1, the upper bound is of order O((log n)\u22121), which may be\nvery slow (but still convergent). The rate of convergence changes at \u03b1 = 2/3, where we get our best\nrate O(n\u22121/3), which does not match the minimax rate of O(n\u22121/2) for stochastic approximation\nin the non-strongly convex case [25]. These rates for stochastic gradient descent without strong\nconvexity assumptions are new and we conjecture that they are asymptotically minimax optimal (for\nstochastic gradient descent, not for stochastic approximation). Nevertheless, the proof of this result\nfalls out of the scope of this paper.\nIf we further assume that we have all gradients bounded by B (that is, we assume D = \u221e in (H5)),\nthen, we have the following theorem, which allows \u03b1 \u2208 (1/3, 1/2) with rate O(n\u22123\u03b1/2+1/2):\nTheorem 5 (Stochastic gradient descent, no strong convexity, bounded gradients) Assume\n(H1, H2\u2019, H5, H8). Then, if \u03b3n = Cn\u2212\u03b1, for \u03b1 \u2208 [1/3, 1], we have:\nE [f (\u03b8n) \u2212 f (\u03b8\u2217)] 6 \uf8f1\uf8f2\n(cid:0)\u03b40 + B2C2\u03d51\u22122\u03b1(n)(cid:1)\nC (\u03b40 + B2C2)1/2\n\uf8f3\n\nif \u03b1 \u2208 [1/2, 1],\nif \u03b1 \u2208 [1/3, 1/2].\n\nC min{\u03d51\u2212\u03b1(n),\u03d5\u03b1/2(n)} ,\n\n(1\u22122\u03b1)1/2\u03d53\u03b1/2\u22121/2(n) ,\n\n(1+4L1/2BC 3/2)\n\n1+4L1/2C 1/2\n\n4.2 Polyak-Ruppert averaging\nAveraging in the context of non-strongly convex functions has been studied before, in particular in\nthe optimization and machine learning literature, and the following theorems are similar in spirit to\nearlier work [7, 8, 13, 14, 15]:\nTheorem 6 (averaging, no strong convexity) Assume (H1,H2\u2019,H4,H8). Then, if \u03b3n = Cn\u2212\u03b1, for\n\u03b1 \u2208 [1/2, 1], we have\n1\nE(cid:2)f (\u00af\u03b8n) \u2212 f (\u03b8\u2217)(cid:3) 6\n\nL2(cid:17) exp(cid:0)2L2C2\u03d51\u22122\u03b1(n)(cid:1)\n\nh1+(2LC)1+ 1\n\nC(cid:16)\u03b40 +\n\n\u03b1i+\n\n\u03c32C\n2n\n\n\u03d51\u2212\u03b1(n).\n\nn1\u2212\u03b1\n\n(12)\n\n(11)\n\n\u03c32\n\n2\n\nIf \u03b1 = 1/2, then we only have convergence under LC < 1/4 (as in Theorem 4), with potentially\nslow rate, while for \u03b1 > 1/2, we have a rate of O(n\u2212\u03b1), with otherwise similar behavior than for\nthe strongly convex case with no bounded gradients. Here, averaging has allowed the rate to go from\nO(max{n\u03b1\u22121, n\u2212\u03b1/2}) to O(n\u2212\u03b1).\n\n1For least-squares regression with kernels, where fn(\u03b8) = 1\n\n2 (yn \u2212 h\u03b8, \u03a6(xn)i)2, with \u03a6(xn) being the\nfeature map associated with a reproducing kernel Hilbert space H with universal kernel [28], then we need that\nx 7\u2192 E(Y |X = x) is a function within the RKHS. Taking care of situations where this is not true is clearly of\nimportance but out of the scope of this paper.\n\n6\n\n\f]\n\n\u2217\n\nf\n\n\u2212\n)\n\nn\n\n\u03b8\n(\nf\n[\n\ng\no\n\nl\n\n]\n\n\u2217\n\nf\n\n\u2212\n)\n\n\u03b8\n(\nf\n[\n\ng\no\n\nl\n\nn\n\npower 2\n\n \n\nsgd \u2212 1/3\nave \u2212 1/3\nsgd \u2212 1/2\nave \u2212 1/2\nsgd \u2212 2/3\nave \u2212 2/3\nsgd \u2212 1\nave \u2212 1\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\n\u22124\n\n \n0\n\n1\n\n2\n\n3\n\n4\n\n5\n\nlog(n)\n\npower 4\n\n \n\n2\n\n0\n\n\u22122\n\n]\n\n\u2217\n\nf\n\n\u2212\n)\n\n\u03b8\n(\nf\n[\n\nn\n\ng\no\n\nl\n\n\u22124\n\n\u22126\n\n \n0\n\n1\n\n2\n\nlog(n)\n\n3\n\n4\n\n5\n\nsgd \u2212 1/3\nave \u2212 1/3\nsgd \u2212 1/2\nave \u2212 1/2\nsgd \u2212 2/3\nave \u2212 2/3\nsgd \u2212 1\nave \u2212 1\n\nFigure 1: Robustness to lack of strong convexity for different learning rates and stochastic gradient\n(sgd) and Polyak-Ruppert averaging (ave). From left to right: f (\u03b8) = |\u03b8|2 and f (\u03b8) = |\u03b8|4, (between\n\u22121 and 1, af\ufb01ne outside of [\u22121, 1], continuously differentiable). See text for details.\n\n\u03b1 = 1/2\n\n\u03b1 = 1\n\n5\n\n0\n\n\u22125\n \n0\n\n \n\nsgd \u2212 C=1/5\nave \u2212 C=1/5\nsgd \u2212 C=1\nave \u2212 C=1\nsgd \u2212 C=5\nave \u2212 C=5\n\n]\n\n\u2217\n\nf\n\n\u2212\n)\n\n\u03b8\n(\nf\n[\n\nn\n\ng\no\n\nl\n\n2\n\nlog(n)\n\n4\n\n \n\nsgd \u2212 C=1/5\nave \u2212 C=1/5\nsgd \u2212 C=1\nave \u2212 C=1\nsgd \u2212 C=5\nave \u2212 C=5\n\n5\n\n0\n\n\u22125\n \n0\n\n2\n\nlog(n)\n\n4\n\nFigure 2: Robustness to wrong constants for \u03b3n = Cn\u2212\u03b1. Left: \u03b1 = 1/2, right: \u03b1 = 1. See text for\ndetails. Best seen in color.\n\nTheorem 7 (averaging, no strong convexity, bounded gradients) Assume (H1,H5,H8). If \u03b3n =\nCn\u2212\u03b1, for \u03b1 \u2208 [0, 1], we have\n\n(\u03b40 + C2B2\u03d51\u22122\u03b1(n)) +\n\n\u03d51\u2212\u03b1(n).\n\n(13)\n\nE(cid:2)f (\u00af\u03b8n) \u2212 f (\u03b8\u2217)(cid:3) 6\n\nn\u03b1\u22121\n2C\n\nB2\n2n\n\nWith the bounded gradient assumption (and in fact without smoothness), we obtain the minimax\nasymptotic rate O(n\u22121/2) up to logarithmic terms [25] for \u03b1 = 1/2, and, for \u03b1 < 1/2, the rate\nO(n\u2212\u03b1) while for \u03b1 > 1/2, we get O(n\u03b1\u22121). Here, averaging has also allowed to increase the\nrange of \u03b1 which ensures convergence, to \u03b1 \u2208 (0, 1).\n5 Experiments\nRobustness to lack of strong convexity. De\ufb01ne f : R \u2192 R as |\u03b8|q for |\u03b8| 6 1 and extended into\na continuously differentiable function, af\ufb01ne outside of [\u22121, 1]. For all q > 1, we have a convex\nfunction with Lipschitz-continuous gradient with constant L = q(q\u22121). It is strongly convex around\nthe origin for q \u2208 (1, 2], but its second derivative vanishes for q > 2. In Figure 1, we plot in log-log\nscale the average of f (\u03b8n) \u2212 f (\u03b8\u2217) over 100 replications of the stochastic approximation problem\n(with i.i.d. Gaussian noise of standard deviation 4 added to the gradient). For q = 2 (left plot),\nwhere we locally have a strongly convex case, all learning rates lead to good estimation with decay\nproportional to \u03b1 (as shown in Theorem 1), while for the averaging case, all reach the exact same\nconvergence rate (as shown in Theorem 3). However, for q = 4 where strong convexity does not\nhold (right plot), without averaging, \u03b1 = 1 is still fastest but becomes the slowest after averaging;\non the contrary, illustrating Section 4, slower decays (such as \u03b1 = 1/2) leads to faster convergence\nwhen averaging is used. Note also the reduction in variability for the averaged iterations.\nRobustness to wrong constants. We consider the function on the real line f , de\ufb01ned as f (\u03b8) =\n1\n2|\u03b8|2 and consider standard i.i.d. Gaussian noise on the gradients. In Figure 2, we plot the average\nperformance over 100 replications, for various values of C and \u03b1. Note that for \u03b1 = 1/2 (left plot),\nthe 3 curves for stochastic gradient descent end up being aligned and equally spaced, corroborating\na rate proportional to C (see Theorem 1). Moreover, when averaging for \u03b1 = 1/2, the error ends up\n\n7\n\n\fSelecting rate after n/10 iterations\n\n \n\n0\n\n\u22120.5\n\n]\n\n\u2217\n\nf\n\n\u2212\n)\n\nn\n\n\u03b8\n(\nf\n[\n\n1/3 \u2212 sgd\n1/3 \u2212 ave\n1/2 \u2212 sgd\n1/2 \u2212 ave\n2/3 \u2212 sgd\n2/3 \u2212 ave\n1 \u2212 sgd\n1 \u2212 ave\n\n1/3 \u2212 sgd\n1/3 \u2212 ave\n1/2 \u2212 sgd\n1/2 \u2212 ave\n2/3 \u2212 sgd\n2/3 \u2212 ave\n1 \u2212 sgd\n1 \u2212 ave\n\ng\no\n\nl\n\n\u22121\n\n\u22121.5\n \n0\n\n1\n\nSelecting rate after n/10 iterations\n\n \n\nn\n\n]\n\n\u2217\n\nf\n\n\u2212\n)\n\n\u03b8\n(\nf\n[\n\ng\no\n\nl\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n\u22122.5\n\n \n0\n\n1\n\n2\n\n3\n\n4\n\n5\n\nlog(n)\n\n2\nlog(n)\n\n3\n\n4\n\nFigure 3: Comparison on non strongly convex logistic regression problems. Left: synthetic example,\nright: \u201calpha\u201d dataset. See text for details. Best seen in color.\n\nbeing independent of C and \u03b1 (see Theorem 3). Finally, when C is too large, there is an explosion\n(up to 105), hinting at the potential instability of having C too large. For \u03b1 = 1 (right plot), if C is\ntoo small, convergence is very slow (and not at the rate n\u22121), as already observed (see, e.g., [8, 6]).\nMedium-scale experiments with linear logistic regression. We consider two situations where\nH = Rp:\n(a) the \u201calpha\u201d dataset from the Pascal large scale learning challenge (http://\nlargescale.ml.tu-berlin.de/), for which p = 500 and n = 50000, and (b) a synthetic ex-\nample where p = 100, n = 100000; we generate the input data i.i.d. from a multivariate Gaussian\ndistribution with mean zero and a covariance matrix sampled from a Wishart distribution with p\ndegrees of freedom (thus with potentially bad condition number), and the output is obtained through\na classi\ufb01cation by a random hyperplane. For different values of \u03b1, we choose C in an adaptive\nway where we consider the lowest test error after n/10 iterations, and report results in Figure 3. In\nexperiments reported in [23], we also consider C equal to 1/L suggested by our analysis to avoid\nlarge constants, for which the convergence speed is very slow, suggesting that our global bounds in-\nvolving the Lipschitz constants may be locally far too pessimistic and that designing a truly adaptive\nsequence (\u03b3n) instead of a \ufb01xed one is a fruitful avenue for future research.\n\n6 Conclusion\n\nIn this paper, we have provided a non-asymptotic analysis of stochastic gradient, as well as its\naveraged version, for various learning rate sequences of the form \u03b3n = Cn\u2212\u03b1 (see summary of\nresults in Table 1). Following earlier work from the optimization, machine learning and stochastic\napproximation literatures, our analysis highlights that \u03b1 = 1 is not robust to the choice of C and to\nthe actual dif\ufb01culty of the problem (strongly convex or not). However, when using averaging with\n\u03b1 \u2208 (1/2, 1), we get, both in strongly convex and non-strongly convex situation, close to optimal\nrates of convergence. Moreover, we highlight the fact that problems with bounded gradients have\nbetter behaviors, i.e., logistic regression is easier to optimize than least-squares regression.\nOur work can be extended in several ways: \ufb01rst, we have focused on results in quadratic mean\nand we expect that some of our results can be extended to results in high probability (in the line\nof [13, 3]). Second, we have focused on differentiable objectives, but the extension to objective\nfunctions with a differentiable stochastic part and a non-differentiable deterministic (in the line\nof [14]) would allow an extension to sparse methods.\nAcknowledgements.\n(SIERRA Project). We thank Mark Schmidt and Nicolas Le Roux for helpful discussions.\n\nFrancis Bach was partially supported by the European Research Council\n\nSGD SGD SGD\n\u00b5, L\n\n\u00b5, B\n\n\u03b1\n, 1/3)\n(0\n(1/3 , 1/2)\n(1/2 , 2/3)\n(2/3 ,\n1)\n\n\u03b1\n\u03b1\n\u03b1\n\u03b1\n\n\u03b1\n\u03b1\n\u03b1\n\u03b1\n\nL\n\u00d7\n\u00d7\n\u03b1/2\n1 \u2212 \u03b1\n\nSGD\nL, B\n\u00d7\n\u03b1/2\n1 \u2212 \u03b1\n\n(3\u03b1 \u2212 1)/2\n\nAver.\n\nAver. Aver.\n\u00b5, L\n2\u03b1\n2\u03b1\n1\n1\n\nL\n\u00d7\n\u00d7\n1 \u2212 \u03b1 1 \u2212 \u03b1\n1 \u2212 \u03b1 1 \u2212 \u03b1\n\nB\n\u03b1\n\u03b1\n\nTable 1: Summary of results: For stochastic gradient descent (SGD) or Polyak-Ruppert averaging\n(Aver.), we provide their rates of convergence of the form n\u2212\u03b2 corresponding to learning rate se-\nquences \u03b3n = Cn\u2212\u03b1, where \u03b2 is shown as a function of \u03b1. For each method, we list the main\nassumptions (\u00b5: strong convexity, L: bounded Hessian, B: bounded gradients).\n\n8\n\n\fReferences\n[1] M. N. Broadie, D. M. Cicek, and A. Zeevi. General bounds and \ufb01nite-time improvement for\n\nstochastic approximation algorithms. Technical report, Columbia University, 2009.\n\n[2] H. J. Kushner and G. G. Yin. Stochastic approximation and recursive algorithms and applica-\n\ntions. Springer-Verlag, second edition, 2003.\n\n[3] O. Yu. Kul\u2032chitski\u02d8\u0131 and A. `E. Mozgovo\u02d8\u0131. An estimate for the rate of convergence of recurrent\n\nrobust identi\ufb01cation algorithms. Kibernet. i Vychisl. Tekhn., 89:36\u201339, 1991.\n\n[4] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM\n\nJournal on Control and Optimization, 30(4):838\u2013855, 1992.\n\n[5] D. Ruppert. Ef\ufb01cient estimations from a slowly convergent Robbins-Monro process. Technical\n\nReport 781, Cornell University Operations Research and Industrial Engineering, 1988.\n\n[6] V. Fabian. On asymptotic normality in stochastic approximation. The Annals of Mathematical\n\nStatistics, 39(4):1327\u20131332, 1968.\n\n[7] Y. Nesterov and J. P. Vial. Con\ufb01dence level solutions for stochastic programming. Automatica,\n\n44(6):1559\u20131568, 2008.\n\n[8] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach\n\nto stochastic programming. SIAM Journal on Optimization, 19(4):1574\u20131609, 2009.\n\n[9] L. Bottou and Y. Le Cun. On-line learning for very large data sets. Applied Stochastic Models\n\nin Business and Industry, 21(2):137\u2013151, 2005.\n\n[10] L. Bottou and O. Bousquet. The tradeoffs of large scale learning.\n\nInformation Processing Systems (NIPS), 20, 2008.\n\nIn Advances in Neural\n\n[11] S. Shalev-Shwartz and N. Srebro. SVM optimization: inverse dependence on training set size.\n\nIn Proc. ICML, 2008.\n\n[12] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver\n\nfor svm. In Proc. ICML, 2007.\n\n[13] S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. Stochastic convex optimization.\n\nIn Conference on Learning Theory (COLT), 2009.\n\n[14] L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization.\n\nJournal of Machine Learning Research, 9:2543\u20132596, 2010.\n\n[15] J. Duchi and Y. Singer. Ef\ufb01cient online and batch learning using forward backward splitting.\n\nJournal of Machine Learning Research, 10:2899\u20132934, 2009.\n\n[16] J. M. Borwein and A. S. Lewis. Convex Analysis and Nonlinear Optimization: Theory and\n\nExamples. Springer, 2006.\n\n[17] R. Durrett. Probability: theory and examples. Duxbury Press, third edition, 2004.\n[18] B. Sch\u00a8olkopf and A. J. Smola. Learning with Kernels. MIT Press, 2001.\n[19] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge Univer-\n\nsity Press, 2004.\n\n[20] Y. Nesterov. Introductory lectures on convex optimization: a basic course. Kluwer Academic\n\nPublishers, 2004.\n\n[21] K. Sridharan, N. Srebro, and S. Shalev-Shwartz. Fast rates for regularized objectives. Advances\n\nin Neural Information Processing Systems, 22, 2008.\n\n[22] N. N. Vakhania, V. I. Tarieladze, and S. A. Chobanyan. Probability distributions on Banach\n\nspaces. Reidel, 1987.\n\n[23] F. Bach and E. Moulines. Non-asymptotic analysis of stochastic approximation algorithms for\n\nmachine learning. Technical Report 00608041, HAL, 2011.\n\n[24] A.S. Nemirovsky and D.B. Yudin. Problem complexity and method ef\ufb01ciency in optimization.\n\nWiley & Sons, 1983.\n\n[25] A. Agarwal, P. L. Bartlett, P. Ravikumar, and M. J. Wainwright. Information-theoretic lower\nbounds on the oracle complexity of convex optimization, 2010. Tech. report, Arxiv 1009.0571.\n[26] E. Hazan and S. Kale. Beyond the regret minimization barrier: an optimal algorithm for\n\nstochastic strongly-convex optimization. In Proc. COLT, 2001.\n\n[27] G. Casella and R. L. Berger. Statistical Inference. Duxbury Press, 2001.\n[28] I. Steinwart. On the in\ufb02uence of the kernel on the consistency of support vector machines.\n\nJournal of Machine Learning Research, 2:67\u201393, 2002.\n\n9\n\n\f", "award": [], "sourceid": 340, "authors": [{"given_name": "Eric", "family_name": "Moulines", "institution": null}, {"given_name": "Francis", "family_name": "Bach", "institution": null}]}