{"title": "Byzantine Stochastic Gradient Descent", "book": "Advances in Neural Information Processing Systems", "page_first": 4613, "page_last": 4623, "abstract": "This paper studies the problem of distributed stochastic optimization in an adversarial setting where, out of $m$ machines which allegedly compute stochastic gradients every iteration, an $\\alpha$-fraction are Byzantine, and may behave adversarially. Our main result is a variant of stochastic gradient descent (SGD) which finds $\\varepsilon$-approximate minimizers of convex functions in $T = \\tilde{O}\\big( \\frac{1}{\\varepsilon^2 m} + \\frac{\\alpha^2}{\\varepsilon^2} \\big)$ iterations. In contrast, traditional mini-batch SGD needs $T = O\\big( \\frac{1}{\\varepsilon^2 m} \\big)$ iterations, but cannot tolerate Byzantine failures.\nFurther, we provide a lower bound showing that, up to logarithmic factors, our algorithm is information-theoretically optimal both in terms of sample complexity and time complexity.", "full_text": "Byzantine Stochastic Gradient Descent\n\nDan Alistarh\u21e4\nIST Austria\n\nZeyuan Allen-Zhu\u21e4\nMicrosoft Research AI\n\nJerry Li\u21e4\n\nSimons Institute\n\ndan.alistarh@ist.ac.at\n\nzeyuan@csail.mit.edu\n\njerryzli@berkeley.edu\n\nAbstract\n\nThis paper studies the problem of distributed stochastic optimization in an ad-\nversarial setting where, out of m machines which allegedly compute stochastic\ngradients every iteration, an \u21b5-fraction are Byzantine, and may behave adversari-\nally. Our main result is a variant of stochastic gradient descent (SGD) which \ufb01nds\n\n\"-approximate minimizers of convex functions in T = eO 1\nIn contrast, traditional mini-batch SGD needs T = O 1\n\ntolerate Byzantine failures. Further, we provide a lower bound showing that, up\nto logarithmic factors, our algorithm is information-theoretically optimal both in\nterms of sample complexity and time complexity.\n\n\"2m + \u21b52\n\n\"2 iterations.\n\"2m iterations, but cannot\n\n1\n\nIntroduction\n\nMachine learning applications are becoming increasingly decentralized, either because data is nat-\nurally distributed\u2014in applications such as federated learning [17]\u2014or because data is partitioned\nacross machines to parallelize computation, e.g. [2]. Fault-tolerance is a critical concern in such\ndistributed settings. Machines in a data center may crash, or fail in unpredictable ways; even worse,\nin some settings one must be able to tolerate a fraction of adversarial/faulty workers, sending cor-\nrupted or even malicious data. This Byzantine failure model\u2014where a small fraction of bad workers\nare allowed to behave arbitrarily\u2014has a rich history in distributed computing [19]. By contrast, the\ndesign of machine learning algorithms which are robust to such Byzantine failures is a relatively re-\ncent topic, but is rapidly becoming a major research direction at the intersection of machine learning,\ndistributed computing, and security.\nWe measure algorithms in this setting against two fundamental criteria: sample complexity, which\nrequires high accuracy from few data samples, and computational complexity, i.e. preserving the\nruntime speedups achieved by distributing computation. These criteria should hold even under ad-\nversarial conditions. Another important consideration in the design of these algorithms is that they\nshould remain useful in high dimensions.\nSystem Model. We study stochastic optimization in the Byzantine setting. We assume an unknown\ndistribution D over functions Rd ! R, and wish to minimize f (x) := Es\u21e0D[fs(x)].\nWe consider a standard setting with m workers and a master (coordinator), where an \u21b5-fraction of\nthe workers may be Byzantine, with \u21b5< 1/2. Each worker has access to T sample functions from\nthe distribution D. We proceed in iterations, structured as follows: workers \ufb01rst perform some local\ncomputation, then synchronously send information to the master, which compiles the information\nand sends new information to the workers. At the end, the master should output an approximate\nminimizer of the function f.\nWhile our negative results will apply for this general setting, our algorithms will be expressed in\nthe standard framework of distributed stochastic gradient methods: in each iteration k, the master\nbroadcasts the current iterate xk 2 Rd to worker machines, and each worker is supposed to compute\na stochastic gradient at xk and return it to the master. A good worker returns returns rfs(xk) for a\n\n\u21e4Authors in alphabetical order. Full version can be found on https://arxiv.org/abs/1803.08917.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\frandom sample s \u21e0D , but a Byzantine worker machine may adversarially return any vector. This\nstochastic optimization framework is general and very well studied, and captures many important\nproblems such as regression, learning SVMs, logistic regression, and training deep neural networks.\nTraditional methods such as mini-batch stochastic gradient descent (SGD) are vulnerable to even\na single Byzantine failure. Our results are presented in the master-worker distribution model, but\ncan be generalized to a coordinator-free distributed setting using standard techniques [12], assuming\nauthenticated point-to-point channels.\nIn this setting, sample complexity is measured as the number of functions fs(\u00b7) we accessed. Since\nevery machine gets one sample per iteration, minimizing sample complexity is equivalent to mini-\nmizing the number of iterations. Time complexity is determined by the number of iterations.\nOur Results.\nIn this work, we study the convex formulation of this Byzantine stochastic optimiza-\ntion problem: we assume f (x) is convex, although each of the functions fs(x) may not necessarily\nbe convex. We provide the \ufb01rst algorithms that, in the presence of Byzantine machines, guarantee\nthe following, up to logarithmic and lower-order terms:\n\n(1) achieve optimal sample complexity,\n\n(2) achieve optimal number of stochastic gradient computations,\n(3) match the sample and time complexities of traditional SGD as \u21b5 ! 0, and\n(4) achieve (1)-(3) even as the dimension grows, without losing additional dimension factors.\n\nIn addition, our algorithms are optimally-robust, supporting a fraction of \u21b5< 1/2 Byzantine work-\ners. Despite signi\ufb01cant recent interest, e.g. [6, 8, 13, 26, 27, 30, 31], to the best of our knowl-\nedge, prior to our work there were no algorithms for stochastic optimization in high dimensions that\nachieved any of the four objectives highlighted above. Previous algorithms either provided weak\nrobustness guarantees, or had sample or time complexities which degrade polynomially with the\ndimension d or with the error \".\nTechnical Contribution. A direct way to deal with Byzantine workers is to perform a robust\naggregation step to compute gradients, such as median of means: for each (good) worker machine\ni 2 [m], whenever a query point xk is provided by the master, the worker takes n stochastic gradient\nsamples and computes their average, which we call vi. If n = e\u21e5(\"2), one can show that for each\ngood machine i, it holds that kvi rf (xk)k \uf8ff \" with high probability. Therefore, in each iteration\nk, we can determine a vector vmed 2{ v1, . . . , vm} satisfying kvmed rf (xk)k \uf8ff 2\", and move in\nthe negative direction of vmed.\nHowever, the above idea requires too many computations of stochastic gradients. In the non-strongly\nconvex setting, each worker machine needs to compute \"2 stochastic gradients per iteration, and\nthe overall number of iterations will be at least \"1. This is because, even when fs(x) = f (x) and\n\u21b5 = 0, gradient descent converges in \"1 iterations. This amounts to a sample complexity of linear\ndependency in \"3.\nWe take a different approach. We run the algorithm for T iterations, where each machine i 2 [m]\nonly computes one stochastic gradient per iteration. Let v(k)\nbe the stochastic gradient allegedly\ncomputed by machine i 2 [m] at iteration k 2 [T ]. By martingale concentration, Bi := (v1\ni + \u00b7\u00b7\u00b7 +\nv(T )\n)/T should concentrate around B? := (rf (x1) + \u00b7\u00b7\u00b7 + rf (xT ))/T for each good machine i,\nup to an additive error 1pT . Hence, if kBi B?k > 1/pT for machine i, we can safely declare that\ni\n\ni is Byzantine.\nTwo non-trivial technical obstacles remain. First, we cannot restart the algorithm every time we\ndiscover a new Byzantine machine, since that would ruin its time complexity. Second, Byzantine\nmachines may successfully \u201cdisguise\u201d themselves by not violating the above criterion.\nTo address the \ufb01rst issue, we keep track of the quantity\n\ni\n\ni + \u00b7\u00b7\u00b7 + v(k)\nv1\nat each step k; if a machine strays away too much from B(k)\n? , it is labeled as Byzantine, and removed\nfrom future consideration. We prove that restarts are not necessary. For the second problem, we\n\nB(k)\n\nk\n\n:=\n\ni\n\ni\n\n2\n\n\falgorithm\n\nSGD (\u21b5 = 0 only)\n\ntotal work\n\nper-iteration\n\nper-machine work\n\n# sampled functions\n\nper machine\n\" + 1\n\" + 1\n\n\"2m + \u21b52\n\n\"2m + \u21b52\n\n1\n1\n\n1\n1\n\nSGD (\u21b5 = 0 only)\n\nfolklore\nthis paper\n\n\"2\n\"2\n\n\"2m + \u21b52\n\n\"2 + \u21b52m\n\n\"2 + \u21b52m\n\n\"3 + \u21b52m\n\n\"2 \n\"3 \n\n\"2m + \u21b52\n\n\"2\n\n\" + 1\n\" + 1\n\" + 1\n\" + d\n\nByzantineSGD\nGD (\u21b5 = 0 only)\nMedian-GD\n\n1 + eO 1\n\"2m\n1 + eO d\n\n(folklore) O 1\n\"2m\n(Theorem 3.2) eO 1\n(folklore) eO1 + 1\n\"2m\n(Yin et al. [31]) eO1 + d\n(c.f. [29, Theorem 11]) \u2326 1\n\"2m\n\"2\n(Theorem 4.3) \u2326 1\n(folklore) O 1\n\"m\n(Theorem 3.4) eO 1\n(folklore) O1 + 1\n\"m\n(Yin et al. [31]) eO1 + d\n(c.f. [29, Appendix C.5]) \u2326 1\n\"m\n\"\n(Theorem 4.4) \u2326 1\nRemark 1. In this table, we have hidden parameters L (smoothness), V (variance), and D (diameter).\nThe goal is to achieve f (x) f (x\u21e4) \uf8ff \", and is the strong convexity parameter of f (x).\nRemark 2. \u201c# sampled functions\u201d is the number of fs(\u00b7) to sample for each machine.\nRemark 3. \u201ctotal/per-iteration work\u201d is in terms of the # of stochastic gradient computations rfs(\u00b7).\n\nfolklore\nthis paper\nTable 1: Comparison of Byzantine optimization for smooth convex minimization f (x) = Es\u21e0D[fs(x)].\n\nO m\neO m\neO m\neO m\n\u2326 1\n\"2\n\u2326 1\nO m\neO m\nO m\neO m\n\u2326 1\n\"\n\u2326 1\n\n\"2\n\"3\n\"2 \n\"\n2\"\n\" \n\n1 + O 1\n\"m\n1 + eO d\n\nByzantineSGD\nGD (\u21b5 = 0 only)\nMedian-GD\n\n + 1\n + 1\n + 1\n + d\n\n + 1\n + 1\n\n\"m + \u21b52\n\n\" \n2\"\n\n\"m + \u21b52\n\n\"\n\n -strongly convex \n\n\" + \u21b52m\n\n2\" + \u21b52m\n\n\"m + \u21b52\n\n\" + \u21b52m\n\n\"\n\"\n\n\uf8ff convex \uf8ff\n\n\"m + \u21b52\n\nconstruct a similar \u201csafety\u201d criterion, in terms of the sequence\n, x1 x0i + \u00b7\u00b7\u00b7 + hv(k)\n\n:= hv(1)\n\nA(k)\n\ni\n\ni\n\ni\n\n, xk x0i\n\n.\n\nk\n\nWe prove that good machines will satisfy both criteria; more importantly, any Byzantine machine\nwhich satis\ufb01es both of them must have negligible negative in\ufb02uence in the algorithm\u2019s convergence.\nRelated Work. The closest work to ours is the concurrent and independent work of Yin et al. [31].\nThey consider a similar Byzantine model, but for gradient descent (GD). In their algorithm, each of\nthe m machines receives n samples of functions upfront. In an iteration k, machine i allegedly com-\nputes n stochastic gradients at point xk and averages them (the n stochastic gradients are taken with\nrespect to the n sampled functions stored on machine i). Then, their proposed algorithm aggregates\nall the average vectors from the m machines, and performs a coordinate-wise median operation to\ndetermine the descent direction. In contrast, our algorithm is a Byzantine variant of SGD: a total of\nT m functions are sampled and a total of T m stochastic gradient computations are performed. To be\nrobust against Byzantine machines, they average stochastic gradients within a single iteration and\ncompare them across machines. In contrast, we average stochastic gradients (and other quantities)\nacross iterations.\nFurther, in terms of sample complexity (i.e., the number of functions fs(\u00b7) to be sampled), their\nalgorithm\u2019s complexity is higher by a linear factor in the dimension d (see Table 1). This is in large\npart due to their coordinate-wise median operation. In high dimensions, this leads to sub-optimal\nstatistical rates. In terms of total computational complexity, each iteration of Yin et al. [31] requires\na full pass over the (sampled) dataset. In contrast, an entire run of ByzantineSGD requires only one\npass. Finally, their algorithm works under a weaker set of assumptions than ours. They assumed that\nthe stochastic error in gradients (namely, rfs(x) rf (x)) has bounded variance and skewness;\nin contrast, we only assume that rfs(x) rf (x) is bounded with probability 1. Our stronger\nassumption (which is standard) turns out to simplify our algorithm and analysis. We leave it as\nfuture work to extend ByzantineSGD to bounded skewness.\nYin et al. [31] also provided a lower bound in terms of sampling complexity \u2014 the number of\nfunctions fs(\u00b7) needed to be sampled in the presence of Byzantine machines. When translated to\n\n3\n\n\four language, the result is essentially the same as the strongly convex part of Theorem 4.4. The\nresults in this paper are the \ufb01rst to cover the case of non-strongly convex functions.\nByzantine Stochastic Optimization. There has been a lot of recent work on Byzantine stochastic\noptimization, and in particular, SGD [6, 8, 13, 26, 27, 30]. One of the \ufb01rst references to consider\nthis setting is Feng et al. [13], which investigated distributed PCA and regression in the Byzantine\ndistributed model. Their general framework has each machine running a robust learning algorithm\nlocally, and aggregating results via a robust estimator. However, the algorithm requires careful\nparametrization of the sample size at each machine to obtain good error bounds, which renders it\nsuboptimal with respect to sample complexity. Our work introduces new techniques which address\nboth these limitations. Su and Vaidya [26, 27] consider a similar setting: in Su and Vaidya [26]\nthey focus on the single-dimensional (d = 1) case, whereas Su and Vaidya [27] considers the multi-\ndimensional setting, but only consider a restricted family of consensus-based algorithms.\nBlanchard et al. [6] propose a general Byzantine-resilient gradient aggregation rule called Krum\nfor selecting a valid gradient update. This rule has local complexity O(m2(d + log m)), which\nmakes it relatively expensive to compute when the d and/or m are large. Moreover, in each iteration\nthe algorithm chooses a gradient corresponding to a constant number of correct workers, so the\nscheme does not achieve speedup with respect to the number of distributed workers, which negates\nan important bene\ufb01t of distributed training. Xie et al. [30] consider gradient aggregation rules in a\ngeneralized Byzantine setting where a subset of the messages sent between servers can be corrupted.\n\nThe complexity of their selection rule can be as low as eO(dm), but their approach is far from sample-\n\noptimal. Chen et al. [8] leverage the geometric median of means idea in a novel way, which allows\nit to be signi\ufb01cantly more sample-ef\ufb01cient, and applicable for a wider range of parameters. At the\nsame time, their technique only applies in the strongly convex setting, and is suboptimal in terms of\nconvergence rate by a factor of p\u21b5m.\nAdversarial Noise. Optimization and learning in the presence of adversarial noise is a well-studied\nproblem [4, 5, 15, 20, 22, 28]. Recently, ef\ufb01cient algorithms for high dimensional optimization\nwhich are tolerant to a small fraction of adversarial corruptions have been developed [1, 7, 11, 16,\n24], building on new algorithms for high dimensional robust statistics [5, 7, 9, 18]. This setting is\ndifferent from ours. For instance, in their setting, there are statistical barriers so that no algorithm\ncan achieve an optimization error below some \ufb01xed threshold, no matter how many samples are\ntaken. In contrast, in the current Byzantine setting, the adversarial corruptions can only occur in a\nfraction of the machines (as opposed to each machine having some adversarial corruptions). For this\nreason, our results do not extend to their scenario.\n\n2 Preliminaries\nThroughout this paper, we denote by k\u00b7k the Euclidean norm and [n] := {1, 2, . . . , n}. We reit-\nerate some de\ufb01nitions regarding strong convexity, smoothness, and Lipschitz continuity (for other\nequivalent de\ufb01nitions, see Nesterov [21]).\nDe\ufb01nition 2.1. For a differentiable function f : Rd ! R,\n\n\u2022 f is -strongly convex if 8x, y 2 Rd, it satis\ufb01es f (y) f (x)+hrf (x), yxi+ \n2kxyk2.\n\u2022 f is L-Lipschitz smooth (or L-smooth for short) if 8x, y 2 Rd, krf (x) rf (y)k \uf8ff\n\u2022 f is G-Lipschitz continuous if 8x 2 Rd, krf (x)k \uf8ff G.\n\nLkx yk.\n\nByzantine Convex Stochastic Optimization. We let m be number of worker machines and assume\n\n2. We denote by good \u2713 [m] the set of\nat most an \u21b5 fraction of them are Byzantine for \u21b5 2\u21e50, 1\ngood (i.e. non-Byzantine) machines. Obviously, the algorithm does not know good.\nWe let D be a distribution over (not necessarily convex) functions fs : Rd ! R. Our goal is to\napproximately minimize the following objective:\n(2.1)\n\nmin\n\nwhere we assume f is convex. In each iteration k = 1, 2, . . . , T , the algorithm is allowed to specify\na point xk and query m machines. Each machine i 2 [m] gives back a vector rk,i 2 Rd satisfying\n\nx2Rdf (x) := Es\u21e0D[fs(x)] ,\n\n4\n\n\fAssumption 2.2. For each iteration k 2 [T ] and for every i 2 good, we have rk,i = rfs(xk) for\na random sample s \u21e0D , and krk,i rf (xk)k \uf8ffV .\nRemark 2.3. For each k 2 [T ] and i 62 good, the vector rk,i can be adversarially chosen and may\ndepend on {rk0,i}k0\uf8ffk,i2[m]. In particular, the Byzantine machines can even collude in an iteration.\nThe next fact is completely classical (for projected mirror descent).\nFact 2.4. If xk+1 = arg miny : kyx1k\uf8ffD{ 1\n\n2ky xkk2 + \u2318h\u21e0, y xki}, then 8u : ku x1k \uf8ff D:\n\n+ kxk uk2\n\nh\u21e0, xk ui \uf8ff h\u21e0, xk xk+1i kxk xk+1k2\n3 Description and Analysis of ByzantineSGD\nWithout loss of generality, in this section we assume that we are given a starting point x1 2 Rd and\nwant to solve the following more general problem:2\n(3.1)\n\n kxk+1 uk2\n\nmin\n\n2\u2318\n\n2\u2318\n\n2\u2318\n\n.\n\nkxx1k\uf8ffDf (x) := Es\u21e0D[fs(x)] .\n\n1\n\nWe denote by x\u21e4 an arbitrary minimizer to Problem (3.1).\nOur algorithm ByzantineSGD is formally stated in Algorithm 1.\nIn each iteration at point\nxk, ByzantineSGD tries to identify a set goodk of \u201ccandidate good\u201d machines, and then per-\nform stochastic gradient update only with respect to goodk \u2713 [m], by using direction \u21e0k :=\nmPi2goodk rk,i.\nThe way goodk is maintained is by constructing two \u201cestimation sequences\u201d. Namely, for each ma-\nchine i 2 [m], we maintain a real value Ai =Pk\nt=1 rt,i.\nThen, we denote by Amed the median of {A1, . . . , Am} and Bmed some \u201cvector median\u201d of\n{B1, . . . , Bm}. We also de\ufb01ne rmed to be some \u201cvector median\u201d of {rk,1, . . . ,rk,m}. For in-\nstance for {rk,1, . . . ,rk,m}, our vector median is de\ufb01ned as follows. We select rmed to be any\nrk,i as long as{j 2 [m] : krk,j rk,ik \uf8ff 2V} > m/2. Such an index i 2 [m] can be ef\ufb01ciently\ncomputed because our later lemmas shall ensure that at least (1 \u21b5)m indices in [m] are valid\nchoices for i. Therefore, one can for instance guess a random index i and verify whether it is valid.\nIn expectation at most 2 guesses are needed, so \ufb01nding these quantities can be done in linear time.\nStarting from good0 = [m], we de\ufb01ne goodk to be all the machines i from goodk1 whose Ai is\nTA-close to Amed, Bi is TB-close to Bmed, and rk,i is 4V-close to rmed. We will prove that if the\nthresholds TA and TB are chosen appropriately, then goodk always contains all machines in good.\nBounding the Error. As we shall see, the \u201cerror\u201d incurred by ByzantineSGD contains two parts:\n\nt=1hrt,i, xt x1i and a vector Bi =Pk\n\nError1 is due to the bias created by the stochastic gradient (of good machines) and the adversarial\nnoise (of Byzantine machines); while Error2 is the variance of using \u21e0k to approximate rf (xk).\nAs we shall see, Error2 is almost always \u201cwell bounded.\u201d However, the adversarial noise incurred\nin Error1 can sometimes destroy the convergence of SGD. We therefore use {Ai}i and {Bi}i to\nperform a reasonable estimation of Error1, and remove the bad machines if they misbehave. Note\nthat even at the end of the algorithm, goodT may still contain some Byzantine machines; however,\ntheir adversarial noise must be negligible and shall not impact the performance of the algorithm.\nWe have the following argument to establish bounds on the two error terms:\nLemma 3.1. With probability 1 , we simultaneously have\n\nError1 \uf8ff 4DVpT mC + 16\u21b5mDVpT C and Error2 \uf8ff 32\u21b52V 2 +\n\n2This is so because even in unconstrained setting, classical SGD requires knowing an upper bound D to\n\nkx1 x\u21e4k in order to choose the learning rate. We can thus add the constraint to the objective.\n\n4V 2C\nm\n\n.\n\n5\n\nand\n\nError1 := Xk2[T ] Xi2goodk\nT Xk2[T ]\n\nError2 :=\n\n1\n\nhrk,i rf (xk), xk x\u21e4i\n\n1\n\nm Xi2goodkrk,i rf (xk)\n\n2\n\n.\n\n\fthresholds TA, TB > 0;\n\nAlgorithm 1 ByzantineSGD(\u2318, x1, D, T, TA, TB)\nInput: learning rate \u2318> 0, starting point x1 2 Rd, diameter D > 0, number of iterations T ,\n\u21e7 theory suggests TA = 4DVpT log(16mT /) and TB = 4VpT log(16mT /)\n\u21e7 where is con\ufb01dence parameter\n\n\u21e7 we have E[rk,i] = rf (xk) if i 2 good\n\nend for\nAmed := median{A1, . . . , Am}\n\n1: good1 [m];\n2: for k 1 to T do\nfor i 1 to m do\n3:\nreceive rk,i 2 Rd from machine i 2 [m];\n4:\nt=1hrt,i, xt x1i and Bi Pk\nAi Pk\n5:\n6:\n7:\nBmed Bi where i 2 [m] is any machine s.t.{j 2 [m] : kBj Bik \uf8ff TB} > m/2.\n8:\nrmed rk,i where i 2 [m] is any machine s.t.{j 2 [m] : krk,j rk,ik \uf8ff 2V} > m/2\ngoodk i 2 goodk1 : |AiAmed|\uf8ff TA^kBiBmedk \uf8ff TB^krk,irmedk \uf8ff 4V ;\nxk+1 = arg miny : kyx1k\uf8ffDn 1\n\n\u21e7 all machines i 2 good will be valid choice, see Claim A.3b\n\u21e7 all machines i 2 good will be valid choice due to Assumption 2.2\n\u21e7 with high probability goodk \u25c6 good\n\nmPi2goodk rk,i, y xk\u21b5o;\n\n2ky xkk2 + \u2318\u2326 1\n\n11:\n12: end for\n\nt=1 rt,i;\n\n10:\n\n9:\n\nThe proof of this lemma will be in two parts: \ufb01rst, we de\ufb01ne a set of determinstic conditions, and\nshow that these conditions hold with high probability. Then, we will demonstrate that assuming\nthese concentration results hold, the error will be bounded. The details of the proof are deferred to\nthe full version of this paper.\nWith this crucial lemma, we can now prove some rates for our algorithm.\nSmooth functions. We \ufb01rst consider the setting where our objective is smooth, and prove:\nTheorem 3.2. Suppose in Problem (3.1) our f (x) is L-smooth and Assumption 2.2 holds. Suppose\n2L and TA = 4DVpT C and TB = 4VpT C. Then, with probability at least 1 , letting\n\u2318 \uf8ff 1\nC := log(16mT /) and x := x2+\u00b7\u00b7\u00b7+xT +1\n\n, we have\n\nf (x) f (x\u21e4) \uf8ff\n\nD2\n\u2318T\n\n+\n\nIf \u2318 is chosen optimally, then\n\nT\n\n8DVpT mC + 32\u21b5mDVpT C\n\nT m\n\nf (x) f (x\u21e4) \uf8ff O\u21e3 LD2\n\nT\n\nDVpC\npT m\n\n+\n\n+\n\n+ 32\u21b52V 2\u2318 .\n\nm\n\n+ \u2318 \u00b7\u21e3 8V 2C\n\u21b5DVpC\n\u2318 .\npT\n\nWe remark that\n\nfor SGD on smooth objectives, and should exist even if \u21b5 = 0 (so we have no Byzantine\nmachines).\n\nT is the classical error rate for gradient descent on smooth objec-\n\u2022 The \ufb01rst term O LD2\ntives [21] and should exist even if V = 0 (so every rk,i exactly equals rf (xk)) and\n\u21b5 = 0.\nT + DVpT m together match the classical mini-batch error rate\n\u2022 The \ufb01rst two terms eO LD2\n\u2022 The third term eO \u21b5DVpT is optimal in our Byzantine setting due to Theorem 4.3.\nT Xk2[T ]\n\n2\u2318kxk xk+1k2\u2318\n\nProof of Theorem 3.2. Applying Fact 2.4 for k = 1, 2, . . . , T with u = x\u21e4, we have\n\nh\u21e0k, xk x\u21e4i \uf8ff\n\nD2\n2\u2318T\n\n+\n\n1\n\n1\n\n1\n\nT Xk2[T ]\u21e3h\u21e0k, xk xk+1i \nT Xk2[T ]\u21e3D 1\n\nm Xi2goodk\n\n1\n\n=\n\nD2\n2\u2318T\n\n+\n\nrk,i, xk xk+1E \n\n1\n\n2\u2318kxk xk+1k2\u2318\n\n(3.2)\n\n6\n\n\f(3.3)\n\nWe notice that the left hand side of (3.2)\n\n=\n\n\u00a8\n\n\n\n1\n\n1\n\nhrk, xk x\u21e4i +\n\nXk2[T ]\nh\u21e0k, xk x\u21e4i\nm Xk2[T ] Xi2goodk\nm Xk2[T ] Xi2goodkf (xk) f (x\u21e4) +\nm Xk2[T ] Xi2goodkf (xk+1) f (x\u21e4) hrk, xk+1 xki \n\nm Xk2[T ] Xi2goodk\n\nError1\n\nm\n\n1\n\n1\n\n\u2260\n\n\n\n2 kxk xk+1k2 +\nAbove, inequality \u00a8 uses the convexity of f (\u00b7) and the de\ufb01nition of Error1, and inequality \u2260 uses\n2 kxk xk+1k2.\nthe smoothness of f (\u00b7) which implies f (xk+1) \uf8ff f (xk) +hrf (xk), xk+1 xki + L\nPutting (3.4) back to (3.2), we have\n\nError1\n\n(3.4)\n\nm\n\nL\n\nhrk,i rk, xk x\u21e4i\n\n1\n\nT m Xk2[T ] Xi2goodkf (xk+1) f (x\u21e4)\n\n\u00a8\n\n\u2318\n\n1\n\n1\n\n+\n\n+\n\n\uf8ff\n\nError1\nT m\n\nError1\nT m\n\nD2\n2\u2318T \nD2\n2\u2318T \n\nT Xk2[T ]\u21e3D 1\nT Xk2[T ]\n\nm Xi2goodkrk,i rk, xk xk+1E 1\nm Xi2goodkrk,i rk\n\nD2\n2\u2318T \n\nError1\nT m\n\n\uf8ff\n4\u2318 , and Young\u2019s inequality which says ha, bi \nAbove, inequality \u00a8 uses the fact that 1\n1\n2kbk2 \uf8ff 1\nFinally, we conclude the proof by plugging Lemma 3.1 and the following convexity inequality into\n(3.5):\n1\n\n2\u2318 L\n\n+ \u2318Error2 .\n\n2 1\n\n2kak2.\n\nL\n\n2kxk xk+1k2\u2318\n\n2\u2318 \n\n(3.5)\n\n=\n\n2\n\nT m Xk2[T ] Xi2goodkf (xk+1) f (x\u21e4) =\n\n1\n\nT Xk2[T ]\nT Xk2[T ]\n\n1\n\n|goodk|\n\nm f (xk) f (x\u21e4)\n2f (xk) f (x\u21e4) \n\n1\n\n\n\n1\n\n2f (x) f (x\u21e4) .\n\n\u21e4\nNonsmooth Functions. We also derive a similarly tight result when the objective is not assumed\nto be smooth. The proof is similar to the previous one and we defer it to the supplementary material.\nTheorem 3.3. Suppose in Problem (3.1) our f (x) is differentiable, G-Lipschitz continuous and\nAssumption 2.2 holds. Suppose \u2318> 0 and TA = 4DVpT C and TB = 4VpT C. Then, with\nprobability at least 1 , letting C := log(16mT /) and x := x1+\u00b7\u00b7\u00b7+xT\n+ 32\u21b52V 2\u2318 .\nf (x) f (x\u21e4) \uf8ff\nT\nIf \u2318 is chosen optimally, then\n\n8DVpT mC + 32\u21b5mDVpT C\n\n, we have\n\nD2\n\u2318T\n\n2\u2318G2\n\nT m\n\n+\n\n+\n\nT\n\nf (x) f (x\u21e4) \uf8ff O\u21e3 GD\npT\n\nDVpC\npT m\n\n+\n\n+\n\nm\n\n+ \u2318 \u00b7\u21e3 8V 2C\n\u21b5DVpC\n\u2318 .\npT\n\nWe remark that, as for Theorem 3.2, the \ufb01rst two terms are asymptotically tight for SGD in this\nsetting, and the last term is necessary in our Byzantine setting, as we show in Theorem 4.3.\nStrongly convex functions. We now consider the problem3\n\nmin\n\nx2Rdf (x) := Es\u21e0D[fs(x)] where f (x) is -strongly convex.\n\n3To present the simplest result, we have assumed that Problem (3.6) is unconstrained. One can also impose\n\n(3.6)\n\nan addition constraint kx x0k \uf8ff D but we refrain from doing so.\n\n7\n\n\fIn this setting, we can obtain similarly optimal rates to those we obtained before, by reducing the\nproblem to repeatedly solving non-strongly convex ones, as in Hazan and Kale [14]. When the\nfunction is additionally smooth, we obtain:\nTheorem 3.4. Suppose in Problem (3.6) our f (x) is L-smooth and Assumption 2.2 holds. Given\nx0 2 Rd with guarantee kx0 x\u21e4k \uf8ff D, one can repeatedly apply ByzantineSGD to \ufb01nd a point\nx satisfying with probability at least 1 0, f (x) f (x\u21e4) \uf8ff \" and kx x\u21e4k2 \uf8ff 2\"/ in\n\nT = eO\u21e3 L\n\n\n\n+ V 2\nm\"\n\n+\n\n\u21b52V 2\n\n\" \u2318\n\niterations, where the eO notation hides logarithmic factors in D, m, L,V, 1,\" 1, 1.\nWhen the function is non-smooth, we instead obtain:\nTheorem 3.5. Suppose in Problem (3.6) our f (x) is differentiable, G-Lipschitz continuous and\nAssumption 2.2 holds. Given x0 2 Rd with guarantee kx0 x\u21e4k \uf8ff D, one can repeatedly apply\nByzantineSGD to \ufb01nd a point x satisfying with probability at least 1 0, f (x) f (x\u21e4) \uf8ff \" and\nkx x\u21e4k2 \uf8ff 2\"/ in\n\nT = eO\u21e3 G2\n\n\"\n\n+ V 2\nm\"\n\n+\n\n\u21b52V 2\n\"\n\n+ 1\u2318\n\niterations, where the eO notation hides logarithmic factors in D, m, L,V, 1,\" 1, 1.\n\nWe defer the proofs to the supplementary material, but we remark that again in all of these equations,\nour rates have three terms. Just as in the rates for non-strongly convex functions, the \ufb01rst two terms\nare necessary even when there are no Byzantine workers, and the last term matches the lower bound\nwe give in Theorem 4.4 for Byzantine optimization.\n\n4 Lower Bounds for Byzantine Stochastic Optimization\n\nIn this section, we prove that the convergence rates we obtain in Section 3 are optimal up to log\nfactors, even in d = 1 dimension. Recall a random vector X 2 Rd is subgaussian with variance\nproxy V 2 if uT X is a univariate subgaussian random variable with variance proxy V 2 for all unit\nvectors u 2 Rd. We require the following de\ufb01nition:\nDe\ufb01nition 4.1 (Stochastic estimator). Given X\u2713 Rd and f : X! R, we say a random function\nfs (with s drawn from some distribution D) is a stochastic estimator for f if E[fs(x)] = f (x) for\nall x 2X . Furthermore, we say fs is subgaussian with variance proxy V 2 if rfs(x) rf (x) is a\nsubgaussian random variable with variance proxy V 2/d for all x 2X .\nNote that the normalization factor of 1/d in this de\ufb01nition ensures that E\u21e5krfs(x) rf (x)k2\u21e4 \uf8ff\nO(V 2), which matches the normalization used in this paper and throughout the literature. However,\nin our lower bound constructions it turns out that it suf\ufb01ces to take d = 1.\nWe prove our lower bounds only against subgaussian stochastic estimators. This is different from\nour Assumption 2.2 used in the upper-bound theorems, where we assumed krfs(x)rf (x)k \uf8ffV\nis uniformly bounded for all x in the domain.\nRemark 4.2. Such difference is negligible, because by concentration, if fs is a sample from a sub-\ngaussian stochastic estimator with variance proxy V 2, then krfs(x)rf (x)k \uf8ff OVplog(mT )\n\nwith overwhelming probability. As a result, this impacts our lower bounds only by a log(mT ) factor.\nFor simplicity of exposition, we only state our theorems in subgaussian stochastic estimators.\nOur result for non-strongly convex stochastic optimization is the following:\nTheorem 4.3. For any D,V,\" > 0 and \u21b5 2 (0, 0.1), there exists a linear function f : [D, D] !\nR (of Lipscthiz continuity G = \"/D) with a subgaussian stochastic estimator with variance proxy\nV 2 so that, given m machines, of which \u21b5m are Byzantine, and T samples from the stochastic\nestimator per machine, no algorithm can output x so that f (x) f (x\u21e4) <\" with probability 2/3\nunless T =\u2326 \u21e3 D2V 2\n\n\u2318 , where x\u21e4 = arg minx2[D,D] f (x).\n\nObserve that up to log factors, this matches the upper bound in Theorem 3.3 exactly, demonstrating\nthat both are exactly tight. We get a similarly tight result for the strongly convex case:\n\n\"2m + \u21b52V 2D2\n\n\"2\n\n8\n\n\fTheorem 4.4. For any V, > 0 and \u21b5 2 (0, 0.1), there exists a -strongly convex quadratic\nfunction f : R ! R with a subgaussian stochastic estimator of variance proxy V 2 so that, given m\nmachines, of which \u21b5m are Byzantine, and T samples from the stochastic estimator per machine, no\n2b\"2\u2318,\nm2b\"2 + \u21b52V 2\nimplies kx x\u21e4k \uf8ff b\" by the strong convexity of f, Theorem 4.4\n\nalgorithm can output x so that |xx\u21e4|