{"title": "Machine Learning with Adversaries: Byzantine Tolerant Gradient Descent", "book": "Advances in Neural Information Processing Systems", "page_first": 119, "page_last": 129, "abstract": "We study the resilience to Byzantine failures of distributed implementations of Stochastic Gradient Descent (SGD). So far, distributed machine learning frameworks have largely ignored the possibility of failures, especially arbitrary (i.e., Byzantine) ones. Causes of failures include software bugs, network asynchrony, biases in local datasets, as well as attackers trying to compromise the entire system. Assuming a set of $n$ workers, up to $f$ being Byzantine, we ask how resilient can SGD be, without limiting the dimension, nor the size of the parameter space. We first show that no gradient aggregation rule based on a linear combination of the vectors proposed by the workers (i.e, current approaches) tolerates a single Byzantine failure. We then formulate a resilience property of the aggregation rule capturing the basic requirements to guarantee convergence despite $f$ Byzantine workers. We propose \\emph{Krum}, an aggregation rule that satisfies our resilience property, which we argue is the first provably Byzantine-resilient algorithm for distributed SGD. We also report on experimental evaluations of Krum.", "full_text": "Machine Learning with Adversaries:\nByzantine Tolerant Gradient Descent\n\nPeva Blanchard\nEPFL, Switzerland\n\npeva.blanchard@epfl.ch\n\nRachid Guerraoui\nEPFL, Switzerland\n\nrachid.guerraoui@epfl.ch\n\nEl Mahdi El Mhamdi\u2217\nEPFL, Switzerland\n\nelmahdi.elmhamdi@epfl.ch\n\nJulien Stainer\n\nEPFL, Switzerland\n\njulien.stainer@epfl.ch\n\nAbstract\n\nWe study the resilience to Byzantine failures of distributed implementations of\nStochastic Gradient Descent (SGD). So far, distributed machine learning frame-\nworks have largely ignored the possibility of failures, especially arbitrary (i.e.,\nByzantine) ones. Causes of failures include software bugs, network asynchrony,\nbiases in local datasets, as well as attackers trying to compromise the entire system.\nAssuming a set of n workers, up to f being Byzantine, we ask how resilient can\nSGD be, without limiting the dimension, nor the size of the parameter space. We\n\ufb01rst show that no gradient aggregation rule based on a linear combination of the vec-\ntors proposed by the workers (i.e, current approaches) tolerates a single Byzantine\nfailure. We then formulate a resilience property of the aggregation rule capturing\nthe basic requirements to guarantee convergence despite f Byzantine workers. We\npropose Krum, an aggregation rule that satis\ufb01es our resilience property, which we\nargue is the \ufb01rst provably Byzantine-resilient algorithm for distributed SGD. We\nalso report on experimental evaluations of Krum.\n\n1\n\nIntroduction\n\nThe increasing amount of data available [6], together with the growing complexity of machine\nlearning models [27], has led to learning schemes that require a lot of computational resources. As a\nconsequence, most industry-grade machine-learning implementations are now distributed [1]. For\nexample, as of 2012, Google reportedly used 16.000 processors to train an image classi\ufb01er [22]. More\nrecently, attention has been given to federated learning and federated optimization settings [15, 16, 23]\nwith a focus on communication ef\ufb01ciency. However, distributing a computation over several machines\n(worker processes) induces a higher risk of failures. These include crashes and computation errors,\nstalled processes, biases in the way the data samples are distributed among the processes, but also, in\nthe worst case, attackers trying to compromise the entire system. The most robust system is one that\ntolerates Byzantine failures [17], i.e., completely arbitrary behaviors of some of the processes.\nA classical approach to mask failures in distributed systems is to use a state machine replication\nprotocol [26], which requires however state transitions to be applied by all worker processes. In the\ncase of distributed machine learning, this constraint can be translated in two ways: either (a) the\nprocesses agree on a sample of data based on which they update their local parameter vectors, or (b)\nthey agree on how the parameter vector should be updated. In case (a), the sample of data has to be\ntransmitted to each process, which then has to perform a heavyweight computation to update its local\n\n\u2217contact author\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fparameter vector. This entails communication and computational costs that defeat the entire purpose\nof distributing the work. In case (b), the processes have no way to check if the chosen update for the\nparameter vector has indeed been computed correctly on real data: a Byzantine process could have\nproposed the update and may easily prevent the convergence of the learning algorithm. Neither of\nthese solutions is satisfactory in a realistic distributed machine learning setting.\nIn fact, most learning algorithms today rely on a core component, namely stochastic gradient descent\n(SGD) [4, 13], whether for training neural networks [13], regression [34], matrix factorization [12]\nor support vector machines [34]. In all those cases, a cost function \u2013 depending on the parameter\nvector \u2013 is minimized based on stochastic estimates of its gradient. Distributed implementations of\nSGD [33] typically take the following form: a single parameter server is in charge of updating the\nparameter vector, while worker processes perform the actual update estimation, based on the share of\ndata they have access to. More speci\ufb01cally, the parameter server executes learning rounds, during\neach of which, the parameter vector is broadcast to the workers. In turn, each worker computes an\nestimate of the update to apply (an estimate of the gradient), and the parameter server aggregates\ntheir results to \ufb01nally update the parameter vector. Today, this aggregation is typically implemented\nthrough averaging [25], or variants of it [33, 18, 31].\nThis paper addresses the fundamental question of how a distributed SGD can be devised to tolerate f\nByzantine processes among the n workers.\n\nContributions. We \ufb01rst show in this paper that no linear combination (current approaches) of the\nupdates proposed by the workers can tolerate a single Byzantine worker. Basically, a single Byzantine\nworker can force the parameter server to choose any arbitrary vector, even one that is too large in\namplitude or too far in direction from the other vectors. Clearly, the Byzantine worker can prevent\nany classic averaging-based approach to converge. Choosing the appropriate aggregation of the\nvectors proposed by the workers turns out to be challenging. A non-linear, squared-distance-based\naggregation rule, that selects, among the proposed vectors, the vector \u201cclosest to the barycenter\u201d (for\nexample by taking the vector that minimizes the sum of the squared distances to every other vector),\nmight look appealing. Yet, such a squared-distance-based aggregation rule tolerates only a single\nByzantine worker. Two Byzantine workers can collude, one helping the other to be selected, by\nmoving the barycenter of all the vectors farther from the \u201ccorrect area\u201d.\nWe formulate a Byzantine resilience property capturing suf\ufb01cient conditions for the parameter server\u2019s\naggregation rule to tolerate f Byzantine workers. Essentially, to guarantee that the cost will decrease\ndespite Byzantine workers, we require the vector output chosen by the parameter server (a) to point,\non average, to the same direction as the gradient and (b) to have statistical moments (up to the fourth\nmoment) bounded above by a homogeneous polynomial in the moments of a correct estimator of\nthe gradient. One way to ensure such a resilience property is to consider a majority-based approach,\nlooking at every subset of n \u2212 f vectors, and considering the subset with the smallest diameter.\nWhile this approach is more robust to Byzantine workers that propose vectors far from the correct\narea, its exponential computational cost is prohibitive. Interestingly, combining the intuitions of the\nmajority-based and squared-distance 2-based methods, we can choose the vector that is somehow\nthe closest to its n \u2212 f neighbors. Namely, the one that minimizes the sum of squared distances\nto its n \u2212 f closest vectors. This is the main idea behind our aggregation rule, we call Krum3.\nAssuming 2f + 2 < n, we show that Krum satis\ufb01es the resilience property aforementioned and the\ncorresponding machine learning scheme converges. An important advantage of Krum is its (local)\ntime complexity (O(n2 \u00b7 d)), linear in the dimension of the gradient, where d is the dimension of the\nparameter vector. (In modern machine learning, the dimension d of the parameter vector may take\nvalues in the hundreds of billions [30].) For simplicity of presentation, the version of Krum we \ufb01rst\nconsider selects only one vector. We also discuss other variants.\nWe evaluate Krum experimentally, and compare it to classical averaging. We con\ufb01rm the very fact\nthat averaging does not stand Byzantine attacks, while Krum does. In particular, we report on attacks\nby omniscient adversaries \u2013 aware of a good estimate of the gradient \u2013 that send the opposite vector\nmultiplied by a large factor, as well as attacks by adversaries that send random vectors drawn from a\nGaussian distribution (the larger the variance of the distribution, the stronger the attack). We also\n\n2In all this paper, distances are computed with the Euclidean norm.\n3Krum, in Greek \u039a\u03c1\u03bf\u03cd\u0003\u03bf\u03c2, was a Bulgarian Khan of the end of the eighth century, who undertook offensive\n\nattacks against the Byzantine empire. Bulgaria doubled in size during his reign.\n\n2\n\n\fevaluate the extent to which Krum might slow down learning (compared to averaging) when there\nare no Byzantine failures. Interestingly, as we show experimentally, this slow down occurs only\nwhen the mini-batch size is close to 1. In fact, the slowdown can be drastically reduced by choosing\na reasonable mini-batch size. We also evaluate Multi-Krum, a variant of Krum, which, intuitively,\ninterpolates between Krum and averaging, thereby allowing to mix the resilience properties of Krum\nwith the convergence speed of averaging. Multi-Krum outperforms other aggregation rules like the\nmedoid, inspired by the geometric median.\n\nPaper Organization. Section 2 recalls the classical model of distributed SGD. Section 3 proves that\nlinear combinations (solutions used today) are not resilient even to a single Byzantine worker, then\nintroduces our new concept of (\u03b1, f )-Byzantine resilience. Section 4 introduces our Krum function,\ncomputes its computational cost and proves its (\u03b1, f )-Byzantine resilience. Section 5 analyzes the\nconvergence of a distributed SGD using Krum. Section 6 presents our experimental evaluation of\nKrum. We discuss related work and open problems in Section 7. Due to space limitations, some\nproofs and complementary experimental results are given as supplementary material.\n\n2 Model\n\np = G(xt, \u03bet\n\nWe consider the general distributed system model of [1], consisting of a parameter server4, and\nn workers, f of them possibly Byzantine (behaving arbitrarily). Computation is divided into (in-\n\ufb01nitely many) synchronous rounds. During round t, the parameter server broadcasts its parameter\nvector xt \u2208 Rd to all the workers. Each correct worker p computes an estimate V t\np) of\nthe gradient \u2207Q(xt) of the cost function Q, where \u03bet\np is a random variable representing, e.g.,\nthe sample (or a mini-batch of samples) drawn from the dataset. A Byzantine worker b pro-\nposes a vector V t\nb which can deviate arbitrarily from the vector it is supposed to send if it was\ncorrect, i.e., according to the algorithm assigned to it by the system developer (see Figure 1).\nSince the communication is synchronous, if the\nparameter server does not receive a vector value\nb from a given Byzantine worker b, then the\nV t\nparameter server acts as if it had received the\ndefault value V t\nThe parameter server computes a vector\nn) by applying a deterministic\nF (V t\nfunction F (aggregation rule) to the vectors re-\nceived. We refer to F as the aggregation rule of\nthe parameter server. The parameter server up-\ndates the parameter vector using the following\nSGD equation\n\nFigure 1: The gradient estimates computed by cor-\nrect workers (black dashed arrows) are distributed\naround the actual gradient (solid arrow) of the cost\nfunction (thin black curve). A Byzantine worker\ncan propose an arbitrary vector (red dotted arrow).\n\nb = 0 instead.\n\n1 , . . . , V t\n\nxt+1 = xt \u2212 \u03b3t \u00b7 F (V t\n\n1 , . . . , V t\n\nn).\n\nThe correct (non-Byzantine) workers are assumed to compute unbiased estimates of the gradient\n\u2207Q(xt). More precisely, in every round t, the vectors V t\ni \u2019s proposed by the correct workers are\ni ) = \u2207Q(xt).\nindependent identically distributed random vectors, V t\nThis can be achieved by ensuring that each sample of data used for computing the gradient is drawn\nuniformly and independently, as classically assumed in the literature of machine learning [3].\nThe Byzantine workers have full knowledge of the system, including the aggregation rule F as well\nas the vectors proposed by the workers. They can furthermore collaborate with each other [21].\n\ni \u223c G(xt, \u03bet\n\ni ) with E\u03bet\n\nG(xt, \u03bet\n\ni\n\n3 Byzantine Resilience\n\nIn most SGD-based learning algorithms used today [4, 13, 12], the aggregation rule consists in\ncomputing the average 5 of the input vectors. Lemma 1 below states that no linear combination of the\nvectors can tolerate a single Byzantine worker. In particular, averaging is not Byzantine resilient.\n\n4The parameter server is assumed to be reliable. Classical techniques of state-machine replication can be\n\nused to ensure this.\n\n5Or a closely related rule.\n\n3\n\n\fLemma 1. Consider an aggregation rule Flin of the form Flin(V1, . . . , Vn) =(cid:80)n\n\ni=1 \u03bbi \u00b7 Vi, where\nthe \u03bbi\u2019s are non-zero scalars. Let U be any vector in Rd. A single Byzantine worker can make F\nalways select U. In particular, a single Byzantine worker can prevent convergence.\n\n\u00b7 U \u2212(cid:80)n\u22121\n\ni=1\n\nProof. Immediate: if the Byzantine worker proposes Vn = 1\n\u03bbn\n\n\u03bbi\n\u03bbn\n\nVi, then F = U.6\n\nIn the following, we de\ufb01ne basic requirements on an appropriate Byzantine-resilient aggregation rule.\nIntuitively, the aggregation rule should output a vector F that is not too far from the \u201creal\u201d gradient g,\nmore precisely, the vector that points to the steepest direction of the cost function being optimized.\nThis is expressed as a lower bound (condition (i)) on the scalar product of the (expected) vector F\nand g. Figure 2 illustrates the situation geometrically. If EF belongs to the ball centered at g with\nradius r, then the scalar product is bounded below by a term involving sin \u03b1 = r/(cid:107)g(cid:107).\nCondition (ii) is more technical, and states that the moments of F should be controlled by the\nmoments of the (correct) gradient estimator G. The bounds on the moments of G are classically used\nto control the effects of the discrete nature of the SGD dynamics [3]. Condition (ii) allows to transfer\nthis control to the aggregation rule.\nDe\ufb01nition 1 ((\u03b1, f )-Byzantine Resilience). Let 0 \u2264 \u03b1 < \u03c0/2 be any angular value, and any\ninteger 0 \u2264 f \u2264 n. Let V1, . . . , Vn be any independent identically distributed random vectors in Rd,\nVi \u223c G, with EG = g. Let B1, . . . , Bf be any random vectors in Rd, possibly dependent on the Vi\u2019s.\naggregation rule F is said to be (\u03b1, f )-Byzantine resilient if, for any 1 \u2264 j1 < \u00b7\u00b7\u00b7 < jf \u2264 n, vector\n\nF = F (V1, . . . , B1(cid:124)(cid:123)(cid:122)(cid:125)\n\nj1\n\n, . . . , Bf(cid:124)(cid:123)(cid:122)(cid:125)\n\njf\n\n, . . . , Vn)\n\nsatis\ufb01es (i) (cid:104)EF, g(cid:105) \u2265 (1 \u2212 sin \u03b1) \u00b7 (cid:107)g(cid:107)2 > 0 and (ii) for r = 2, 3, 4, E(cid:107)F(cid:107)r is bounded above by a\nlinear combination of terms E(cid:107)G(cid:107)r1 . . . E(cid:107)G(cid:107)rn\u22121 with r1 + \u00b7\u00b7\u00b7 + rn\u22121 = r.\n\n4 The Krum Function\n\n(cid:80)n\n\nWe now introduce Krum, our aggregation\nrule, which, we show, satis\ufb01es the (\u03b1, f )-\nByzantine resilience condition. The barycen-\ntric aggregation rule Fbary = 1\ni=1 Vi can\nn\nbe de\ufb01ned as the vector in Rd that mini-\nmizes the sum of squared distances 7 to the\ni=1 (cid:107)Fbary \u2212 Vi(cid:107)2. Lemma 1, however,\nstates that this approach does not tolerate even a\nsingle Byzantine failure. One could try to select\nthe vector U among the Vi\u2019s which minimizes\ni (cid:107)U \u2212 Vi(cid:107)2, i.e., which is \u201cclosest\n\nVi\u2019s(cid:80)n\nthe sum(cid:80)\n\ng\n\n\u03b1\n\nr\n\nIf (cid:107)EF \u2212 g(cid:107) \u2264 r then (cid:104)EF, g(cid:105) is\nFigure 2:\nbounded below by (1 \u2212 sin \u03b1)(cid:107)g(cid:107)2 where sin \u03b1 =\nr/(cid:107)g(cid:107).\n\nto all vectors\u201d. However, because such a sum involves all the vectors, even those which are very far,\nthis approach does not tolerate Byzantine workers: by proposing large enough vectors, a Byzantine\nworker can force the total barycenter to get closer to the vector proposed by another Byzantine worker.\nOur approach to circumvent this issue is to preclude the vectors that are too far away. More precisely,\nwe de\ufb01ne our Krum aggregation rule KR(V1, . . . , Vn) as follows. For any i (cid:54)= j, we denote by i \u2192 j\nthe fact that Vj belongs to the n \u2212 f \u2212 2 closest vectors to Vi. Then, we de\ufb01ne for each worker i, the\ni\u2192j (cid:107)Vi \u2212 Vj(cid:107)2 where the sum runs over the n \u2212 f \u2212 2 closest vectors to Vi. Finally,\nKR(V1, . . . , Vn) = Vi\u2217 where i\u2217 refers to the worker minimizing the score, s(i\u2217) \u2264 s(i) for all i.8\nLemma 2. The expected time complexity of the Krum Function KR(V1, . . . , Vn), where V1, . . . , Vn\nare d-dimensional vectors, is O(n2 \u00b7 d)\n\nscore s(i) =(cid:80)\n\n6Note that the parameter server could cancel the effects of the Byzantine behavior by setting, for example,\n\n\u03bbn to 0. This however requires means to detect which worker is Byzantine.\n\n7Removing the square of the distances leads to the geometric median, we discuss this when optimizing Krum.\n8If two or more workers have the minimal score, we choose the one with the smallest identi\ufb01er.\n\n4\n\n\fProof. For each Vi, the parameter server computes the n squared distances (cid:107)Vi \u2212 Vj(cid:107)2 (time O(n\u00b7d)).\nThen the parameter server selects the \ufb01rst n \u2212 f \u2212 1 of these distances (expected time O(n) with\nQuickselect) and sums their values (time O(n \u00b7 d)). Thus, computing the score of all the Vi\u2019s takes\nO(n2 \u00b7 d). An additional term O(n) is required to \ufb01nd the minimum score, but is negligible relatively\nto O(n2 \u00b7 d).\n\nProposition 1 below states that, if 2f + 2 < n and the gradient estimator is accurate enough, (its\nstandard deviation is relatively small compared to the norm of the gradient), then the Krum function\nis (\u03b1, f )-Byzantine-resilient, where angle \u03b1 depends on the ratio of the deviation over the gradient.\nProposition 1. Let V1, . . . , Vn be any independent and identically distributed random d-dimensional\nvectors s.t Vi \u223c G, with EG = g and E(cid:107)G \u2212 g(cid:107)2 = d\u03c32. Let B1, . . . , Bf be any f random vectors,\npossibly dependent on the Vi\u2019s. If 2f + 2 < n and \u03b7(n, f )\n\nd \u00b7 \u03c3 < (cid:107)g(cid:107), where\n\n\u221a\n\n(cid:19)\n\n(cid:26) O(n)\n\n\u221a\nO(\n\nn)\n\n=\n\nif f = O(n)\nif f = O(1)\n\n,\n\n(cid:115)\n\n(cid:18)\n\n\u03b7(n, f ) =\ndef\n\n2\n\nn \u2212 f +\n\nf \u00b7 (n \u2212 f \u2212 2) + f 2 \u00b7 (n \u2212 f \u2212 1)\n\nn \u2212 2f \u2212 2\n\nthen the Krum function KR is (\u03b1, f )-Byzantine resilient where 0 \u2264 \u03b1 < \u03c0/2 is de\ufb01ned by\n\n\u03b7(n, f ) \u00b7 \u221a\n(cid:107)g(cid:107)\nThe condition on the norm of the gradient, \u03b7(n, f ) \u00b7 \u221a\n\nsin \u03b1 =\n\nd \u00b7 \u03c3\n\n.\n\nd \u00b7 \u03c3 < (cid:107)g(cid:107), can be satis\ufb01ed, to a certain\nextent, by having the (correct) workers compute their gradient estimates on mini-batches [3]. Indeed,\naveraging the gradient estimates over a mini-batch divides the deviation \u03c3 by the squared root of the\nsize of the mini-batch. For the sake of concision, we only give here the sketch of the proof. (We give\nthe detailed proof in the supplementary material.)\n\nProof. (Sketch) Without loss of generality, we assume that the Byzantine vectors B1, . . . , Bf occupy\nthe last f positions in the list of arguments of KR, i.e., KR = KR(V1, . . . , Vn\u2212f , B1, . . . , Bf ).\nLet i\u2217 be the index of the vector chosen by the Krum function. We focus on the condition (i) of\n(\u03b1, f )-Byzantine resilience (De\ufb01nition 1).\nConsider \ufb01rst the case where Vi\u2217 = Vi \u2208 {V1, . . . , Vn\u2212f} is a vector proposed by a correct process.\nThe \ufb01rst step is to compare the vector Vi with the average of the correct vectors Vj such that i \u2192 j.\nLet \u03b4c(i) be the number of such Vj\u2019s.\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)Vi \u2212 1\n\nE\n\n(cid:88)\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\n(cid:88)\n\n\u03b4c(i)\n\ni\u2192 correct j\n\nVj\n\n\u2264 1\n\u03b4c(i)\n\ni\u2192 correct j\n\nE(cid:107)Vi \u2212 Vj(cid:107)2 \u2264 2d\u03c32.\n\n(1)\n\nThe last inequality holds because the right-hand side of the \ufb01rst inequality involves only vectors\nproposed by correct processes, which are mutually independent and follow the distribution of G.\nNow, consider the case where Vi\u2217 = Bk \u2208 {B1, . . . , Bf} is a vector proposed by a Byzantine\nprocess. The fact that k minimizes the score implies that for all indices i of vectors proposed by\ncorrect processes(cid:88)\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)Bk \u2212 1\n\n(cid:107)Bk \u2212 Bl(cid:107)2 \u2264 (cid:88)\n\nThen, for all indices i of vectors proposed by correct processes\n\n(cid:88)\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\n(cid:107)Bk \u2212 Vj(cid:107)2 +\n\n(cid:107)Vi \u2212 Vj(cid:107)2 +\n\n(cid:107)Vi \u2212 Vj(cid:107)2 +\n\n(cid:107)Vi \u2212 Bl(cid:107)2 .\n\n(cid:107)Vi \u2212 Bl(cid:107)2\n\n.\n\n\u2264 1\n\u03b4c(k)\n\n\u03b4c(k)\n\ni\u2192 byz l\n\n\u03b4c(k)\n\nk\u2192 correct j\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\nk\u2192 correct j\n\ni\u2192 correct j\n\ni\u2192 correct j\n\nk\u2192 byz l\n\ni\u2192 byz l\n\nVj\n\n1\n\n(cid:124)\n\n(cid:123)(cid:122)\n\nD2(i)\n\n(cid:125)\n\nThe term D2(i) is the only term involving vectors proposed by Byzantine processes. However, the\ncorrect process i has n \u2212 f \u2212 2 neighbors and f + 1 non-neighbors. Therefore, there exists a correct\n\n5\n\n\fprocess \u03b6(i) which is farther from i than every neighbor j of i (including the Byzantine neighbors).\nIn particular, for all l such that i \u2192 l, (cid:107)Vi \u2212 Bl(cid:107)2 \u2264 (cid:107)Vi \u2212 V\u03b6(i)(cid:107)2. Thus\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)Bk \u2212 1\n\n(cid:88)\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\n\u03b4c(k)\n\nk\u2192 correct j\n\nVj\n\n\u2264 1\n\u03b4c(k)\n\ni\u2192 correct j\n\n(cid:88)\n\n(cid:107)Vi \u2212 Vj(cid:107)2 +\n\nn \u2212 f \u2212 2 \u2212 \u03b4c(i)\n\n\u03b4c(k)\n\u221a\n\n(cid:13)(cid:13)Vi \u2212 V\u03b6(i)\n\n(cid:13)(cid:13)2\n\n.\n\n(2)\nCombining equations 1, 2, and a union bound yields (cid:107)EKR \u2212 g(cid:107)2 \u2264 \u03b7\nd(cid:107)g(cid:107), which, in turn, implies\n(cid:104)EKR, g(cid:105) \u2265 (1\u2212 sin \u03b1)(cid:107)g(cid:107)2. Condition (ii) is proven by bounding the moments of KR with moments\nof the vectors proposed by the correct processes only, using the same technique as above. The full\nproof is provided in the supplementary material.\n\n5 Convergence Analysis\n\nIn this section, we analyze the convergence of the SGD using our Krum function de\ufb01ned in Section 4.\nThe SGD equation is expressed as follows\nwhere at least n \u2212 f vectors among the V t\na correct index i, V t\ndeviation \u03c3(x) by\n\ni \u2019s are correct, while the other ones may be Byzantine. For\ni ) where G is the gradient estimator. We de\ufb01ne the local standard\nd \u00b7 \u03c32(x) = E(cid:107)G(x, \u03be) \u2212 \u2207Q(x)(cid:107)2 .\n\nxt+1 = xt \u2212 \u03b3t \u00b7 KR(V t\n\ni = G(xt, \u03bet\n\n1 , . . . , V t\nn)\n\nThe following proposition considers an (a priori) non-convex cost function. In the context of non-\nconvex optimization, even in the centralized case, it is generally hopeless to aim at proving that the\nparameter vector xt tends to a local minimum. Many criteria may be used instead. We follow [3],\nand we prove that the parameter vector xt almost surely reaches a \u201c\ufb02at\u201d region (where the norm of\nthe gradient is small), in a sense explained below.\nProposition 2. We assume that (i) the cost function Q is three times differentiable with continuous\nt <\n\u221e; (iii) the gradient estimator satis\ufb01es EG(x, \u03be) = \u2207Q(x) and \u2200r \u2208 {2, . . . , 4}, E(cid:107)G(x, \u03be)(cid:107)r \u2264\nAr + Br(cid:107)x(cid:107)r for some constants Ar, Br; (iv) there exists a constant 0 \u2264 \u03b1 < \u03c0/2 such that for all x\n\nderivatives, and is non-negative, Q(x) \u2265 0; (ii) the learning rates satisfy(cid:80)\n\nt \u03b3t = \u221e and(cid:80)\n\nt \u03b32\n\n\u221a\n\n\u03b7(n, f ) \u00b7\n\nd \u00b7 \u03c3(x) \u2264 (cid:107)\u2207Q(x)(cid:107) \u00b7 sin \u03b1;\n\n(v) \ufb01nally, beyond a certain horizon, (cid:107)x(cid:107)2 \u2265 D, there exist \u0001 > 0 and 0 \u2264 \u03b2 < \u03c0/2 \u2212 \u03b1 such that\n(cid:107)\u2207Q(x)(cid:107) \u2265 \u0001 > 0 and (cid:104)x,\u2207Q(x)(cid:105)\n(cid:107)x(cid:107)\u00b7(cid:107)\u2207Q(x)(cid:107) \u2265 cos \u03b2. Then the sequence of gradients \u2207Q(xt) converges\nalmost surely to zero.\n\n\u03b7\n\n\u03b1\n\nxt\n\nd\u03c3\n\n\u221a\n\n\u03b2 \u2207Q(xt)\n\nConditions (i) to (iv) are the same conditions as\nin the non-convex convergence analysis in [3].\nCondition (v) is a slightly stronger condition\nthan the corresponding one in [3], and states\nthat, beyond a certain horizon, the cost function\nQ is \u201cconvex enough\u201d, in the sense that the di-\nrection of the gradient is suf\ufb01ciently close to the\ndirection of the parameter vector x. Condition\n(iv), however, states that the gradient estimator\nused by the correct workers has to be accurate\nenough, i.e., the local standard deviation should\nbe small relatively to the norm of the gradient.\nOf course, the norm of the gradient tends to zero near, e.g., extremal and saddle points. Actually, the\nd \u00b7 \u03c3/(cid:107)\u2207Q(cid:107) controls the maximum angle between the gradient \u2207Q and the vector\nd \u00b7 \u03c3, the Byzantine workers\nmay take advantage of the noise (measured by \u03c3) in the gradient estimator G to bias the choice of\nthe parameter server. Therefore, Proposition 2 is to be interpreted as follows: in the presence of\nByzantine workers, the parameter vector xt almost surely reaches a basin around points where the\nd \u00b7 \u03c3), i.e., points where the cost landscape is \u201calmost \ufb02at\u201d.\nNote that the convergence analysis is based only on the fact that function KR is (\u03b1, f )-Byzantine\nresilient. The complete proof of Proposition 2 is deferred to the supplementary material.\n\nratio \u03b7(n, f ) \u00b7 \u221a\nchosen by the Krum function. In the regions where (cid:107)\u2207Q(cid:107) < \u03b7(n, f ) \u00b7 \u221a\n\nFigure 3: Condition on the angles between xt,\n\u2207Q(xt) and EKRt, in the region (cid:107)xt(cid:107)2 > D.\n\ngradient is small ((cid:107)\u2207Q(cid:107) \u2264 \u03b7(n, f ) \u00b7 \u221a\n\n6\n\n\fFigure 4: Cross-validation error evolution with rounds, respectively in the absence and in the\npresence of 33% Byzantine workers. The mini-batch size is 3. With 0% Gaussian Byzantine workers,\naveraging converges faster than Krum. With 33% Gaussian Byzantine workers, averaging does not\nconverge, whereas Krum behaves as if there were 0% Byzantine workers.\n\n6 Experimental Evaluation\n\nWe report here on the evaluation of the convergence and resilience properties of Krum, as well as an\noptimized variant of it. (We also discuss other variants of Krum in the supplementary material.)\n\n(Resilience to Byzantine processes). We consider the task of spam \ufb01ltering (dataset spam-\nbase [19]). The learning model is a multi-layer perceptron (MLP) with two hidden layers. There are\nn = 20 worker processes. Byzantine processes propose vectors drawn from a Gaussian distribution\nwith mean zero, and isotropic covariance matrix with standard deviation 200. We refer to this behavior\nas Gaussian Byzantine. Each (correct) worker estimates the gradient on a mini-batch of size 3. We\nmeasure the error using cross-validation. Figure 4 shows how the error (y-axis) evolves with the\nnumber of rounds (x-axis).\nIn the \ufb01rst plot (left), there are no Byzantine workers. Unsurprisingly, averaging converges faster\nthan Krum. In the second plot (right), 33% of the workers are Gaussian Byzantine. In this case,\naveraging does not converge at all, whereas Krum behaves as if there were no Byzantine workers.\nThis experiment con\ufb01rms that averaging does not tolerate (the rather mild) Gaussian Byzantine\nbehavior, whereas Krum does.\n\n(The Cost of Resilience). As seen above, Krum slows down learning when there are no Byzantine\nworkers. The following experiment shows that this overhead can be signi\ufb01cantly reduced by slightly\nincreasing the mini-batch size. To highlight the effect of the presence of Byzantine workers, the\nByzantine behavior has been set as follows: each Byzantine worker computes an estimate of the\ngradient over the whole dataset (yielding a very accurate estimate of the gradient), and proposes the\nopposite vector, scaled to a large length. We refer to this behavior as omniscient.\nFigure 5 displays how the error value at the 500-th round (y-axis) evolves when the mini-batch size\nvaries (x-axis). In this experiment, we consider the tasks of spam \ufb01ltering (dataset spambase) and\nimage classi\ufb01cation (dataset MNIST). The MLP model is used in both cases. Each curve is obtained\nwith either 0 or 45% of omniscient Byzantine workers.\nIn all cases, averaging still does not tolerate Byzantine workers, but yields the lowest error when\nthere are no Byzantine workers. However, once the size of the mini-batch reaches the value 20, Krum\nwith 45% omniscient Byzantine workers is as accurate as averaging with 0% Byzantine workers. We\nobserve a similar pattern for a ConvNet as provided in the supplementary material.\n\n(Multi-Krum). For the sake of presentation simplicity, we considered a version of Krum which\nselects only one vector among the vector proposed by the workers. We also propose a variant of\nKrum, we call Multi-Krum. Multi-Krum computes, for each vector proposed, the score as in the\nKrum function. Then, Multi-Krum selects the m \u2208 {1, . . . , n} vectors V \u2217\nm which score the\ni . Note that, the cases m = 1 and m = n correspond to\nbest, and outputs their average 1\nm\nKrum and averaging respectively.\nFigure 6 shows how the error (y-axis) evolves with the number of rounds (x-axis). In the \ufb01gure, we\nconsider the task of spam \ufb01ltering (dataset spambase), and the MLP model (the same comparison\n\n1 , . . . , V \u2217\n\n(cid:80)\n\ni V \u2217\n\n7\n\n 0 0.2 0.4 0.6 0.8 1 0 100 200 300 400 500errorround0% byzantineaveragekrum 0 0.2 0.4 0.6 0.8 1 0 100 200 300 400 500errorround33% byzantineaveragekrum\fFigure 5: Cross-validation error at round 500 when increasing the mini-batch size. The two \ufb01gures\non the rights are zoomed versions of the two on the left). With a reasonably large mini-batch size\n(arround 10 for MNIST and 30 for Spambase), Krum with 45% omniscient Byzantine workers is as\naccurate as averaging with 0% Byzantine workers.\n\nFigure 6: Cross-validation error evolution with rounds. The mini-batch size is 3. Multi-Krum with\n33% Gaussian Byzantine workers converges as fast as averaging with 0% Byzantine workers.\n\nis done for the task of image classi\ufb01cation with a ConvNet and is provided in the supplementary\nmaterial). The Multi-Krum parameter m is set to m = n \u2212 f. Figure 6 shows that Multi-Krum with\n33% Byzantine workers is as ef\ufb01cient as averaging with 0% Byzantine workers.\nFrom the practitionner\u2019s perspective, the parameter m may be used to set a speci\ufb01c trade-off between\nconvergence speed and resilience to Byzantine workers.\n\n7 Concluding Remarks\n\n(The Distributed Computing Perspective). Although seemingly related, results in d-dimensional\napproximate agreement [24, 14] cannot be applied to our Byzantine-resilient machine context for\nthe following reasons: (a) [24, 14] assume that the set of vectors that can be proposed to an instance\nof the agreement is bounded so that at least f + 1 correct workers propose the same vector, which\nwould require a lot of redundant work in our setting; and more importantly, (b) [24] requires a local\ncomputation by each worker that is in O(nd). While this cost seems reasonable for small dimensions,\nsuch as, e.g., mobile robots meeting in a 2D or 3D space, it becomes a real issue in the context\nof machine learning, where d may be as high as 160 billion [30] (making d a crucial parameter\nwhen considering complexities, either for local computations, or for communication rounds). The\nexpected time complexity of the Krum function is O(n2 \u00b7 d). A closer approach to ours has been\nrecently proposed in [28, 29]. In [28], the study only deals with parameter vectors of dimension\none, which is too restrictive for today\u2019s multi-dimensional machine learning. In [29], the authors\ntackle a multi-dimensional situation, using an iterated approximate Byzantine agreement that reaches\nconsensus asymptotically. This is however only achieved on a \ufb01nite set of possible environmental\nstates and cannot be used in the continuous context of stochastic gradient descent.\n\n(The Statistics and Machine Learning View). Our work looks at the resilience of the aggregation\nrule using ideas that are close to those of [11], and somehow classical in theoretical statistics on\nthe robustness of the geometric median and the notion of breakdown [7]. The closest concept to a\nbreakdown in our work is the maximum fraction of Byzantine workers that can be tolerated, i.e. n\u22122\n2n ,\nwhich reaches the optimal theoretical value (1/2) asymptotically on n. It is known that the geometric\n\n8\n\n 0 0.2 0.4 0.6 0.8 1 40 80 120error at round 500batch sizespambaseaverage (0% byz)krum (0% byz)average (45% byz)krum (45% byz) 0 0.2 0.4 0.6 0.8 1 40 80 120 160error at round 500batch sizemnist 0 0.1 0.2 0.3 0.4 0.5 10 20 30 40error at round 500batch sizespambaseaverage (0% byz)krum (0% byz)krum (45% byz) 0 0.1 0.2 0.3 0.4 0.5 10 20 30 40error at round 500batch sizemnist 0 0.2 0.4 0.6 0.8 1 0 40 80 120 160 200 240 280 320 360 400 440 480errorroundmulti-krumaverage (0% byz)krum (33% byz)multi-krum (33% byz)\fmedian does achieve the optimal breakdown. However, no closed form nor an exact algorithm to\ncompute the geometric median is known (only approximations are available [5] and their Byzantine\nresilience is an open problem.). An easily computable variant of the median is the Medoid, which is\nthe proposed vector minimizing the sum of distances to all other proposed vectors. The Medoid can\nbe computed with a similar algorithm to Krum. We show however in the supplementary material that\nthe implementation of the Medoid is outperformed by multi-Krum.\n\n(Robustness Within the Model).\nIt is important to keep in mind that this work deals with robustness\nfrom a coarse-grained perspective: the unit of failure is a worker, receiving its copy of the model and\nestimating gradients, based on either local data or delegated data from a server. The nature of the\nmodel itself is not important, the distributed system can be training models spanning a large range\nfrom simple regression to deep neural networks. As long as this training is using gradient-based\nlearning, our algorithm to aggregate gradients, Krum, provably ensures convergence when a simple\nmajority of nodes are not compromised by an attacker.\nA natural question to consider is the \ufb01ne-grained view: is the model itself robust to internal per-\nturbations? In the case of neural networks, this question can somehow be tied to neuroscience\nconsiderations: could some neurons and/or synapses misbehave individually without harming the\nglobal outcome? We formulated this question in another work and proved a tight upper bound on the\nresulting global error when a set of nodes is removed or is misbehaving [8]. One of the many practical\nconsequences [9] of such \ufb01ne-grained view is the understanding of memory cost reduction trade-offs\nin deep learning. Such memory cost reduction can be viewed as the introduction of precision errors\nat the level of each neuron and/or synapse [8].\nOther approaches to robustness within the model tackled adversarial situations in machine learning\nwith a focus on adversarial examples (during inference) [10, 32, 11] instead of adversarial gradients\n(during training) as we did for Krum. Robustness to adversarial input can be viewed through the\n\ufb01ne-grained lens we introduced in [8], for instance, one can see perturbations of pixels in the\ninputs as perturbations of neurons in layer zero.\nIt is important to note the orthogonality and\ncomplementarity between the \ufb01ne-grained (model/input units) and the coarse-grained (gradient\naggregation) approaches. Being robust, as a model, either to adversarial examples or to internal\nperturbations, does not necessarily imply robustness to adversarial gradients during training. Similarly,\nbeing distributively trained with a robust aggregation scheme such as Krum does not necessarily\nimply robustness to internal errors of the model or adversarial input perturbations that would occur\nlater during inference. For instance, the theory we develop in the present work is agnostic to the\nmodel being trained or the technology of the hardware hosting it, as long as there are gradients to be\naggregated.\n\nAcknowledgment. The authors would like to thank Georgios Damaskinos and Rhicheek Patra from\nthe Distributed Computing group at EPFL for kindly providing their distributed machine learning\nframework, on top of which we could test our algorithm, Krum, and its variants described in this\nwork. Further implementation details and additional experiments will be posted in the lab\u2019s Github\nrepository [20]. The authors would also like to thank Saad Benjelloun, L\u00ea Nguyen Hoang and\nS\u00e9bastien Rouault for fruitful comments. This work has been supported in part by the European\nERC (Grant 339539 - AOC) and by the Swiss National Science Foundation (Grant 200021_ 169588\nTARBDA). A preliminary version of this work appeared as a brief announcement during the 36st\nACM Symposium on Principles of Distributed Computing [2].\n\n9\n\n\fReferences\n[1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving,\nM. Isard, et al. Tensor\ufb02ow: A system for large-scale machine learning. In Proceedings of the\n12th USENIX Symposium on Operating Systems Design and Implementation (OSDI). Savannah,\nGeorgia, USA, 2016.\n\n[2] P. Blanchard, E. M. El Mhamdi, R. Guerraoui, and J. Stainer. Brief announcement: Byzantine-\ntolerant machine learning. In Proceedings of the ACM Symposium on Principles of Distributed\nComputing, PODC \u201917, pages 455\u2013457, New York, NY, USA, 2017. ACM.\n\n[3] L. Bottou. Online learning and stochastic approximations. Online learning in neural networks,\n\n17(9):142, 1998.\n\n[4] L. Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of\n\nCOMPSTAT\u20192010, pages 177\u2013186. Springer, 2010.\n\n[5] M. B. Cohen, Y. T. Lee, G. Miller, J. Pachocki, and A. Sidford. Geometric median in nearly linear\ntime. In Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing,\npages 9\u201321. ACM, 2016.\n\n[6] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang,\nQ. V. Le, et al. Large scale distributed deep networks. In Advances in neural information\nprocessing systems, pages 1223\u20131231, 2012.\n\n[7] D. L. Donoho and P. J. Huber. The notion of breakdown point. A festschrift for Erich L.\n\nLehmann, 157184, 1983.\n\n[8] E. M. El Mhamdi and R. Guerraoui. When neurons fail. In 2017 IEEE International Parallel\n\nand Distributed Processing Symposium (IPDPS), pages 1028\u20131037, May 2017.\n\n[9] E. M. El Mhamdi, R. Guerraoui, and S. Rouault. On the robustness of a neural network. In\n2017 IEEE 36th Symposium on Reliable Distributed Systems (SRDS), pages 84\u201393, Sept 2017.\n\n[10] A. Fawzi, S.-M. Moosavi-Dezfooli, and P. Frossard. Robustness of classi\ufb01ers: from adversarial\nto random noise. In Advances in Neural Information Processing Systems, pages 1624\u20131632,\n2016.\n\n[11] J. Feng, H. Xu, and S. Mannor. Outlier robust online learning. arXiv preprint arXiv:1701.00251,\n\n2017.\n\n[12] R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis. Large-scale matrix factorization with\ndistributed stochastic gradient descent. In Proceedings of the 17th ACM SIGKDD international\nconference on Knowledge discovery and data mining, pages 69\u201377. ACM, 2011.\n\n[13] S. S. Haykin. Neural networks and learning machines, volume 3. Pearson Upper Saddle River,\n\nNJ, USA:, 2009.\n\n[14] M. Herlihy, S. Rajsbaum, M. Raynal, and J. Stainer. Computing in the presence of concurrent\nsolo executions. In Latin American Symposium on Theoretical Informatics, pages 214\u2013225.\nSpringer, 2014.\n\n[15] J. Kone\u02c7cn`y, B. McMahan, and D. Ramage. Federated optimization: Distributed optimization\n\nbeyond the datacenter. arXiv preprint arXiv:1511.03575, 2015.\n\n[16] J. Kone\u02c7cn`y, H. B. McMahan, F. X. Yu, P. Richt\u00e1rik, A. T. Suresh, and D. Bacon. Federated\nlearning: Strategies for improving communication ef\ufb01ciency. arXiv preprint arXiv:1610.05492,\n2016.\n\n[17] L. Lamport, R. Shostak, and M. Pease. The byzantine generals problem. ACM Transactions on\n\nProgramming Languages and Systems (TOPLAS), 4(3):382\u2013401, 1982.\n\n[18] X. Lian, Y. Huang, Y. Li, and J. Liu. Asynchronous parallel stochastic gradient for nonconvex\noptimization. In Advances in Neural Information Processing Systems, pages 2737\u20132745, 2015.\n\n10\n\n\f[19] M. Lichman. UCI machine learning repository, 2013.\n\n[20] LPD-EPFL. The implementation is part of a larger distributed framework to run sgd in a reliable\ndistributed fashion and will be released in the github repository of the distributed computing\ngroup at ep\ufb02, https://github.com/lpd-ep\ufb02.\n\n[21] N. A. Lynch. Distributed algorithms. Morgan Kaufmann, 1996.\n\n[22] J. Markoff. How many computers to identify a cat? 16,000. New York Times, pages 06\u201325,\n\n2012.\n\n[23] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas. Communication-ef\ufb01cient\nlearning of deep networks from decentralized data. In Arti\ufb01cial Intelligence and Statistics,\npages 1273\u20131282, 2017.\n\n[24] H. Mendes and M. Herlihy. Multidimensional approximate agreement in byzantine asyn-\nchronous systems. In Proceedings of the forty-\ufb01fth annual ACM symposium on Theory of\ncomputing, pages 391\u2013400. ACM, 2013.\n\n[25] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM\n\nJournal on Control and Optimization, 30(4):838\u2013855, 1992.\n\n[26] F. B. Schneider. Implementing fault-tolerant services using the state machine approach: A\n\ntutorial. ACM Computing Surveys (CSUR), 22(4):299\u2013319, 1990.\n\n[27] R. K. Srivastava, K. Greff, and J. Schmidhuber. Training very deep networks. In Advances in\n\nneural information processing systems, pages 2377\u20132385, 2015.\n\n[28] L. Su and N. H. Vaidya. Fault-tolerant multi-agent optimization: optimal iterative distributed\nalgorithms. In Proceedings of the 2016 ACM Symposium on Principles of Distributed Computing,\npages 425\u2013434. ACM, 2016.\n\n[29] L. Su and N. H. Vaidya. Non-bayesian learning in the presence of byzantine agents.\n\nInternational Symposium on Distributed Computing, pages 414\u2013427. Springer, 2016.\n\nIn\n\n[30] A. Trask, D. Gilmore, and M. Russell. Modeling order in neural word embeddings at scale. In\n\nICML, pages 2266\u20132275, 2015.\n\n[31] J. Tsitsiklis, D. Bertsekas, and M. Athans. Distributed asynchronous deterministic and stochastic\ngradient optimization algorithms. IEEE transactions on automatic control, 31(9):803\u2013812,\n1986.\n\n[32] B. Wang, J. Gao, and Y. Qi. A theoretical framework for robustness of (deep) classi\ufb01ers under\n\nadversarial noise. arXiv preprint arXiv:1612.00334, 2016.\n\n[33] S. Zhang, A. E. Choromanska, and Y. LeCun. Deep learning with elastic averaging sgd. In\n\nAdvances in Neural Information Processing Systems, pages 685\u2013693, 2015.\n\n[34] T. Zhang. Solving large scale linear prediction problems using stochastic gradient descent\nalgorithms. In Proceedings of the twenty-\ufb01rst international conference on Machine learning,\npage 116. ACM, 2004.\n\n11\n\n\f", "award": [], "sourceid": 94, "authors": [{"given_name": "Peva", "family_name": "Blanchard", "institution": null}, {"given_name": "El Mahdi", "family_name": "El Mhamdi", "institution": "EPFL"}, {"given_name": "Rachid", "family_name": "Guerraoui", "institution": null}, {"given_name": "Julien", "family_name": "Stainer", "institution": null}]}