{"title": "A Little Is Enough: Circumventing Defenses For Distributed Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 8635, "page_last": 8645, "abstract": "Distributed learning is central for large-scale training of deep-learning models. However, it is exposed to a security threat in which Byzantine participants can interrupt or control the learning process. Previous attack models assume that the rogue participants (a) are omniscient (know the data of all other participants), and (b) introduce large changes to the parameters. \nAccordingly, most defense mechanisms make a similar assumption and attempt to use statistically robust methods to identify and discard values whose reported gradients are far from the population mean. We observe that if the empirical variance between the gradients of workers is high enough, an attacker could take advantage of this and launch a non-omniscient attack that operates within the population variance. We show that the variance is indeed high enough even for simple datasets such as MNIST, allowing an attack that is not only undetected by existing defenses, but also uses their power against them, causing those defense mechanisms to consistently select the byzantine workers while discarding legitimate ones. We demonstrate our attack method works not only for preventing convergence but also for repurposing of the model behavior (``backdooring''). We show that less than 25\\% of colluding workers are sufficient to degrade the accuracy of models trained on MNIST, CIFAR10 and CIFAR100 by 50\\%, as well as to introduce backdoors without hurting the accuracy for MNIST and CIFAR10 datasets, but with a degradation for CIFAR100.", "full_text": "A Little Is Enough:\n\nCircumventing Defenses For Distributed Learning\n\nMoran Baruch 1\n\nGilad Baruch 1\n\nmoran.baruch@biu.ac.il\n\ngilad.baruch@biu.ac.il\n\nYoav Goldberg 1 2\nyogo@cs.biu.ac.il\n\n1 Dept. of Computer Science, Bar Ilan University, Israel\n\n2 The Allen Institute for Arti\ufb01cial Intelligence\n\nAbstract\n\nDistributed learning is central for large-scale training of deep-learning models.\nHowever, it is exposed to a security threat in which Byzantine participants can\ninterrupt or control the learning process. Previous attack models assume that the\nrogue participants (a) are omniscient (know the data of all other participants),\nand (b) introduce large changes to the parameters. Accordingly, most defense\nmechanisms make a similar assumption and attempt to use statistically robust\nmethods to identify and discard values whose reported gradients are far from the\npopulation mean. We observe that if the empirical variance between the gradients\nof workers is high enough, an attacker could take advantage of this and launch\na non-omniscient attack that operates within the population variance. We show\nthat the variance is indeed high enough even for simple datasets such as MNIST,\nallowing an attack that is not only undetected by existing defenses, but also uses\ntheir power against them, causing those defense mechanisms to consistently select\nthe byzantine workers while discarding legitimate ones. We demonstrate our attack\nmethod works not only for preventing convergence but also for repurposing of the\nmodel behavior (\u201cbackdooring\u201d). We show that less than 25% of colluding workers\nare suf\ufb01cient to degrade the accuracy of models trained on MNIST, CIFAR10 and\nCIFAR100 by 50%, as well as to introduce backdoors without hurting the accuracy\nfor MNIST and CIFAR10 datasets, but with a degradation for CIFAR100.\n\n1\n\nIntroduction\n\nDistributed Learning has become a wide-spread framework for large-scale model training [1, 3, 10,\n17, 18, 23, 31], in which a server is leveraging the compute power of many devices by aggregating\nlocal models trained on each of the devices.\n\nA popular class of distributed learning algorithms is Synchronous Stochastic Gradient Descent\n(sync-SGD), using a single server (called Parameter Server - PS) and n workers, also called nodes\n[17, 18]. In each round, each worker trains a local model on his or her device with a different chunk\nof the dataset, and shares the \ufb01nal gradients with the PS. The PS then aggregates the gradients of\nthe different workers, and starts another round by sharing with the workers the resulting combined\nparameters to start another round. The structure of the network (number of layers, types, sizes etc.) is\nagreed between all workers beforehand.\n\nWhile effective in a sterile environment, a major risk emerge with regards to the correctness of the\nlearned model upon facing even a single Byzantine worker [5]. Such participants are not rigorously\nfollowing the protocol either innocently, for example due to faulty communication, numerical error\nor crashed devices, or adversarially, in which the Byzantine output is well crafted to maximize its\neffect on the network.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fWe consider malicious Byzantine workers, where an attacker controls either the devices themselves, or\neven only the communication between the participants and the PS, for example by Man-In-The-Middle\n(MITM) attack. Both attacks and defenses have been explored in the literature [5, 11, 24, 28, 29].\nIn the very heart of distributed learning lies the assumption that the parameters of the trained network\nacross the workers are independent and identically distributed (i.i.d.) [5, 9, 29]. This assumption\nallows the averaging of different models to yield a good estimator for the desired parameters, and is\nalso the basis for most defense mechanisms, which try to recover the original mean after clearing away\nthe byzantine values. Existing defenses claim to be resilient even when the attacker is omniscient\n[5, 11, 28], and can observe the data of all the workers. Likewise, most existing defenses for\ndistributed learning [5, 11, 28, 29] work under the assumption that changes which are upper-bounded\nby an order of the variance of the correct workers cannot satisfy a malicious objective. This last\nassumption is advocated by the fact that SGD better converges with a little random noise [21, 25, 13].\nGiven that last assumption, those defenses use statistics-based methods to clear away the large\nchanges and prevent attacks.\nWe show that this assumption is incorrect: the experimental variance between the different workers is\nhigh enough so that by carefully crafting byzantine values that are as far as possible from the correct\nones, yet within the bounds of existing defenses, we are capable of defeating all state-of-the-art\ndefenses and interfering with or gaining control over the training process. Moreover, while most\n(but not all) previous attacks focused on preventing the convergence of the training process, we\ndemonstrate a wider range of attacks and support also introducing backdoors to the resulting model,\nwhich are samples that will produce the attacker\u2019s desired output, regardless of their true label. Lastly,\nby exploiting the i.i.d assumption we introduce a non-omniscient attack in which the attacker only\nhas access to the data of the corrupted workers.1\n\nContributions We present a new approach for attacking distributed learning with this properties:\n1. We provide a perturbation range in which the attacker can change the parameters without\n\nbeing detected even in i.i.d. settings.\n\n2. Changes within this range are suf\ufb01cient for both interfering with the learning process and\n\nfor backdooring the system.\n\n3. We propose the \ufb01rst non-trivial non-omniscient attack in i.i.d settings applicable for dis-\n\ntributed learning, making the attack stronger and more practical.\n\n4. The same con\ufb01guration of the attack overcome all state-of-the-art statistics-based defenses.\n\n2 Background\n\nDistributed training is using the Synchronous SGD protocol, presented in Algorithm 1.\n\nAlgorithm 1: Synchronous SGD\n1 P 1 \u2190 Randomly initiate the parameters in the server.\n2 for round t \u2208 [T ] do\n3\n4\n5\n6\n\nThe server sends P t to all n workers.\nfor each worker i \u2208 [n] do\n\nto the server.2\n: i \u2208 [n]})\n7\n8 return P t that maximized accuracy on the test set.\n\nP t+1 \u2190 AggregationRule({pt+1\n\nSet P t as initial parameters and train locally using own data chunk.\nReturn \ufb01nal parameters pt+1\n\ni\n\ni\n\n1An exception to this line of defense is DRACO [7]. DRACO is taking a different approach on defending\nagainst byzantine workers, and uses coding based methods to insure the removal of byzantine values, including\nour attack. A longer discussion about this method and its real world applicability can be found in Section 2.2.\n2In the original protocol, as in our experiments, the gradients are the ones to be published and aggregated.\n\n2\n\n\fThe attacker interferes the process at the time that maximizes its effect, that is between lines 5 and\n6. During this time, the attacker can use the corrupted workers\u2019 parameters expressed in pt+1\n, and\nreplace them with whatever values it desires to send to the server. Attack methods differ in the way in\nwhich they set the parameter values, and defense methods attempt to identify corrupted parameters\nand discard them.\nAlgorithm 1 aggregates the workers values using averaging (AggregationRule() in line 7). Some\ndefense methods change this aggregation rule, as explained below.\n\ni\n\nNotation. All existing defenses are working on each round separately, so for the sake of readability\nwe will discard the notation of the round (t).\nFor the rest of the paper we will use the following notations:\nn is the total number of workers, m is the number of malicious workers, and d is the number of\ndimensions (parameters) of the model, pi is the vector of parameters trained by worker i, and (pi)j is\nits jth dimension, and P is {pi : i \u2208 [n]}.\n\n2.1 Malicious Objectives\n\nConvergence Prevention is the attack which most of the existing literature for distributed learning\nwith byzantine workers focuses on [5, 11, 28]. In this case, the attacker interferes with the process\nwith the mere desire of obstructing the server from reaching good accuracy. In this type of attack the\nserver is aware of the attack and, in a real world scenario, is likely to take actions to mitigate it, for\nexample by actively blocking subsets of the workers and observing the effect on the training process.\nBackdooring [4, 8, 19], is an attack in which the attacker manipulates the model at training time\nso that it will produce the attacker-chosen target at inference time. The backdoor can be either a\nsingle sample, e.g. falsely classifying a speci\ufb01c person as another, or it can be a class of samples, e.g.\nsetting a speci\ufb01c pattern of pixels in an image to cause it to be classi\ufb01ed maliciously. Bagdasaryan et\nal. [2] demonstrated a backdooring attack on federated learning by making the attacker optimize for\na model with the backdoor while adding a term to the loss that keeps the new parameters close to the\noriginal ones. Their attack has the bene\ufb01ts of requiring only a few corrupted workers, as well as being\nnon-omniscient. However, it does not work for distributed training: in federated learning each worker\nis using its own private data, coming from a different distribution, negating the i.i.d assumption\n[20, 14] and making the attack easier as it drops the ground under the fundamental assumption of all\nexisting defenses for distributed learning. In [12], Fung et al. proposed a defense against backdoors\nin federated learning, but like the attack above it heavily relies on the non-i.i.d property of the data,\nwhich does not hold for distributed training.\nA few defenses aimed at detecting backdoors were proposed [26, 22, 6, 27], but those defenses\nassume a single-server training in which the backdoor is injected in the training set for which the\nserver has access to, so that by clustering or other techniques the backdoor samples can be found and\nremoved from the training set. In contrast, in our settings, the server has no control over the samples\nwhich the workers adversely decide to train with, rendering those defenses inoperable. Finally, [24]\ndemonstrate a method for circumventing backdooring attacks on distributed training. As discussed\nbelow, the method is a variant of the Trimmed Mean defense, which we successfully evade.\n\n2.2 Coding-Based Defenses For Byzantine Workers\n\nWhile all other defenses take a statistics-based approach, the authors of DRACO [7] suggest a\ncoding-based defense. It is achieved by having the PS sending each chunk of data to multiple workers,\nand using majority to \ufb01nd the correct evaluation of each chunk. DRACO defense could be robust\neven against small changes, but it should be noted that it defends against a very limited number of\nByzantine workers, not O(n) such as Krum, MeanMed or Bulyan, to be detailed below. In real life,\nan attacker controlling a network component (e.g. a router or a switch) near the PS will be able to\nperform a Man-In-The-Middle attack while controlling more than a handful of nodes. In DRACO\u2019s\nsettings, in case that the defender wishes to protect against up to m = 10 corrupted workers, each\nchunk needs to be calculated r = 2m + 1 = 21 times.\nIgnoring the run-time concerns, it seems that only methods such as DRACO which force the output\nto be identical to results with no attack can insure resilience to attacks that require minor changes.\n\n3\n\n\f2.3 Statistics-Based Defenses For Byzantine Workers\n\nThe state-of-the-art defense for distributed learning is Bulyan. Bulyan utilizes a combination of two\nearlier methods - Krum and Trimmed Mean, to be explained \ufb01rst.\nTrimmed Mean. This family of defenses, called Mean-Around-Median [28] or Trimmed Mean [29],\nchange the aggregation rule of Algo 1 to a trimmed average, handling each dimension separately:\n\nT rimmedM ean(P) =\n\n(pi)j : j \u2208 [d]\n\n(cid:110)\n\n(cid:80)\n\nvj = 1|Uj|\n\ni\u2208Uj\n\n(cid:111)\n\nThree variants exist, differing in the de\ufb01nition of Uj.\n\n1. Uj is the indices of top-(n \u2212 m) values in {(p1)j, ..., (pn)j} nearest to the median \u00b5j [28].\n2. Same as the \ufb01rst variant only taking top-(n \u2212 2m) values [11].\n3. Uj is the indices of elements in the same vector {(p1)j, ..., (pn)j} where the largest and\n\nsmallest m elements are removed, regardless of their distance from the median [29].\n\nA defense method of [24] clusters each parameter into two clusters using 1-dimensional k-means,\nand if the distance between the clusters\u2019 centers exceeds a threshold, the values compounding the\nsmaller cluster are discarded. This can be seen as a variant of the Trimmed Mean defense, because\nonly the values of the larger cluster which must include the median will be averaged while the rest of\nthe values will be discarded.\nAll variants are designed to defend against up to (cid:100) n\n2(cid:101) \u2212 1 corrupted workers, as this defenses depend\non the assumption that the median is taken from the range of benign values.\nThe circumvention analysis and experiments are similar for all variants upon facing our attack, so we\nwill consider only the second variant which is used in Bulyan below.\nKrum. Suggested by Blanchard et al [5], Krum strives to \ufb01nd a single honest participant which\nis probably a good choice for the next round, discarding the data from the rest of the workers.\nThe chosen worker is the one with parameters which are closest to another n \u2212 m \u2212 2 workers,\nmathematically expressed by:\n\n(cid:16)\npi | argmini\u2208[n]\n\ni\u2192j (cid:107)pi \u2212 pj(cid:107)2(cid:17)\n(cid:80)\n\nKrum(P) =\n\nWhere i \u2192 j is the n \u2212 m \u2212 2 nearest neighbors to pi in P , measured by Euclidean Distance.\nLike TrimmedMean, Krum is designed to defend against up to (cid:100) n\n2(cid:101) \u2212 1 corrupted workers (m). The\nintuition behind this method is that in normal distribution, the vector with average parameters in\neach dimension will be the closest to all the parameter vectors drawn from the same distribution. By\nconsidering only the distance to the closest n \u2212 m \u2212 2 workers, sets of parameters which will differ\nsigni\ufb01cantly from the average vector are outliers and will be ignored. The malicious parameters,\nassumed to be far from the original parameters, will suffer from the high distance to at least one\nnon-corrupted worker, which is expected to prevent it from being selected.\nWhile Krum was proven to converge, in [11] the authors showed that convergence alone should not be\nthe target, because the parameters may converge to an ineffectual model. In addition, as already noted\nin [11], due to the high dimensionality of the parameters, a malicious attacker can notably introduce a\nlarge change to a single parameter without a considerable impact on the Lp norm (Euclidean distance),\nmaking the model ineffective.\nFurthermore, The output of Krum\u2019s process is only one chosen worker, and all of its parameters are\nbeing used while the other workers are discarded. It is assumed that there exists such a worker for\nwhich all of the parameters are close to the desired mean in each dimension. In practice however,\nwhere the parameters are in very high dimensional space, even the best worker will have at least a\nfew parameters which will reside far from the mean. To exploit this shortcoming, one can generate a\nset of parameters which will differ from the mean of each parameter by only a small amount. Those\nsmall changes will decrease the Euclidean Distance calculated by Krum, hence causing the malicious\nset to be selected.\nBulyan. El Mhamdi et al. [11], who suggested the above-mentioned attack on Krum, proposed\na new defense that successfully oppose such an attack. They present a \u201cmeta\u201d-aggregation rule,\nwhere another aggregation rule A is used as part of it. In the \ufb01rst part, Bulyan is using A iteratively\n\n4\n\n\fto create a SelectionSet of probably benign candidates, and then aggregates this set by the second\nvariant of TrimmedMean. Bulyan combines methods working with Lp norm that proved to converge,\nwith the advantages of methods working on each dimension separately, such as TrimmedMean,\novercoming Krum\u2019s disadvantage described above because TrimmedMean will not let the single\nmalicious dimension slip.\nAlgorithm 2 describes the defense. It should be noted that on line 5, n \u2212 4m values are being\naveraged, which is n(cid:48) \u2212 2m for n(cid:48) = |SelectionSet| = n \u2212 2m.\n\nAlgorithm 2: Bulyan Algorithm\nInput: A,P, n, m\n1 SelectionSet \u2190 \u2205\n2 while |SelectionSet| < n \u2212 2m do\np \u2190 A(P \\ SelectionSet)\n3\nSelectionSet \u2190 SelectionSet \u222a {p}\n4\n5 return T rimmedM ean(2)(SelectionSet)\n\nUnlike previous methods, Bulyan is designed to defend against only up to n\u22123\ncorrupted workers.\nSuch number m insures that the input for each run of A will have more than 2m workers as required,\nand there is also a majority of non-corrupted workers in the input to T rimmedM ean. We follow the\nauthors of Bulyan and use A=Krum in the rest of the paper including the experiments.\nNo Defense. In the experiments section we will use the name No Defense for the basic method of\naveraging the parameters from all the workers, due to the lack of outliers rejection mechanism.\n\n4\n\n3 Our Attack\n\nFigure 1: Illustration of the supporting and opposing correct workers.\n\nAs mentioned above, the research in the \ufb01eld of distributed learning assumes that the different\nparameters of all of the workers are i.i.d. and therefore expressed by normal distribution. We follow\nthis assumption, hence in the rest of the paper the \u201cunits\u201d of a change applied by the malicious\nopponent for attacking distributed learning models are standard deviations (\u03c3).\nIntuitively, the values of the correct workers are spread symmetrically around the mean, and at the\nextremes there are workers that push strongly to the direction we want (\u201csupporters\u201d), while a similar\nnumber of workers push to the opposite direction (\u201copposers\u201d). By crafting values between the mean\nand the supporters, our byzantine values will be closer to the mean than the opposers are. This makes\nall existing defenses remove the non-byzantine opposer workers, while choosing workers (either\nbyzantine or not) in the desired direction, shifting the mean (See Figure 1). The variance of the\nworkers around the mean is large enough for this effect to be meaningful.\n\n3.1 Exploited assumption\n\nThe attacks described in [5, 29, 11] share a similar assumption on the behaviour of the Byzantine\nworkers - the attacker will choose parameters that are far away from the real mean of the parameters,\nfor example by choosing parameters that are in the opposite direction of the gradient, in order to skew\nthe real mean, and prevent the model convergence.\nTherefore, the suggested defenses strive to prove that the selected set of parameters will lie within\na ball centered at the real mean and with a radius which is a function of the correct workers. They\n\n5\n\n\fassume that if the selected parameters are close to a correct one, or there exist a correct worker which\nis even farther away, the model will produce accurate results.\nIn Krum [5] the authors acknowledge the fact that small changes will not be detected within a\nradius de\ufb01ned by (\u03b1, f ), where the angle \u03b1 depends on the ratio of the deviation over the gradients\nselected by their method to the real gradient, as a function of the proportion of corrupted workers\nf. They show that for any worker (potentially byzantine) w selected by their method, there exists\nat least one non-corrupted worker v for which (cid:107)pw \u2212 (cid:126)\u00b5(cid:107) \u2264 (cid:107)pv \u2212 (cid:126)\u00b5(cid:107) where (cid:126)\u00b5 is the mean of all\npi for i \u2208 BenignW orkers. The assumption exploited by our attack is that if such worker v exist,\nchoosing pw which is closer to the ideal target (cid:126)\u00b5 will be at least as good as using parameters from\na benign worker which is expected to produce good results. In practice, the Krum defense will be\nforced to either choose one of our byzantine workers, or a supporter.\nIn TrimmedMean [29] the authors bounded the error rate which their method guarantees as a function\nof the variance and skewness of the different workers. The larger the variance and skewness, the\nhigher the error rate the attacker can achieve. They show an example: for any parameter vector w and\ndata point (x, y) \u2208 Rd \u00d7 R, the variance V ar(\u2207f (w; x, y)) \u2264 V 2. For convex model with quadratic\n\u221a\nd) where d is the dimension of\nloss, if the diameter of the parameter space is constant, V = O(\n+ 1\u221a\n(cid:96) ) where m is the ratio of\n(cid:96)n\n\nthe parameters. In such case the error rate is bounded by (cid:101)O( m\u221a\n\ncorrupted machines out of n total, and l is the number of data points on each machine.\nAs our experiments show (in Section 4.1), the high variance even on simple deep learning tasks,\nalong with the high skewness induced by putting all the Byzantine workers on the same side in our\nattack below, is enough for the attacker to hide Byzantine workers within the variance of the correct\nworkers, and still gain control over the training process.\nWe would like to induce directed small changes to many parameters, instead of large changes to a few.\nIn the next section we will further explain how to choose such pmal for which normal distribution\nproperties guarantee the existence of s non-corrupted workers (which we called \u201dsupporters\u201d) for\nwhich (cid:107)pmal \u2212 (cid:126)\u00b5(cid:107) \u2264 (cid:107)ps \u2212 (cid:126)\u00b5(cid:107). An illustration of our attack method is depicted in Figure 1.\nIn addition, the aforementioned defenses claim to protect against an omniscient attacker, i.e. who\nknows the data of all of the workers. We show that due to the normal distribution of the data, in case\nthe attacker controls a representative portion of the workers, it is suf\ufb01cient to have only the corrupted\nworkers\u2019 data in order to estimate the distribution\u2019s mean and standard deviation, and manipulate the\nresults accordingly. The variance of the estimation of the mean \u00afX is inversely proportional to the\nsize of the set: \u00afX \u223c N (\u00b5, \u03c32\nn ). This observation enables our attack to work also for non-omniscient\nattacker, by estimating the properties of the entire population through the corrupted participants alone.\n\n+ 1\n\n(cid:96)\n\n3.2 Perturbation Range\n\nAs mentioned before, the defense mechanisms are designed to discard values that are too far away\nfrom the mean. We thus seek a range in which we can deviate from the mean without being detected.\nSince normal distribution is symmetric, the same value zmax will set the lower and upper bounds for\nthe applicable changes around the mean.\nWhat is the maximal change that can be applied by an attacker without being detected? In order\nto change the value produced by the aggregation rule, the attacker should control a majority of\nthe workers. While existing defenses prevented it by claiming to support only up to (cid:100) n\n2(cid:101) \u2212 1\ncorrupted workers, the attacker can yet attain a majority by \ufb01nding the minimal number s of non-\ncorrupted workers that are required as \u201csupporters\u201d. The attacker will then use properties of the\nnormal distribution, speci\ufb01cally the Cumulative Standard Normal Function \u03c6(z), and look for the\nmaximal value zmax such that s non-corrupted workers will reside farther away from the mean, hence\npreferring the selection of the byzantine parameters instead of the more distant correct mean. The\nexact steps for \ufb01nding zmax are shown in Algorithm 3, lines 1-4. By setting all corrupted workers\nto values in the range (\u00b5 \u2212 zmax\u03c3, \u00b5 + zmax\u03c3), the defenses will not be able to differentiate the\ncorrupted workers from the benign.\nThe probability for a worker to become a supporter can be considered as a Binomial with p = 1\u2212\u03c6(z)\nfor an attack factor z. Thus, the probability for \ufb01nding s supporters is 1 \u2212 I(\u03c6(z); n \u2212 m \u2212 s + 1, s)\non each dimension independently where I(z; a, b) is the Regularized Incomplete Beta Function.\n\n6\n\n\f(cid:17)\n\n(cid:16)\n2 + 1(cid:99) \u2212 m\n\nAlgorithm 3: Preventing Convergence Attack\nInput: {pi : i \u2208 CorruptedW orkers}, n, m\n1 s \u2190 (cid:98) n\n2 zmax \u2190 maxz\n3 for j \u2208 [d] do\ncalculate mean (\u00b5j) and std (\u03c3j)\n4\n(pmal)j \u2190 \u00b5j \u2212 zmax \u00b7 \u03c3j\n5\n6 for i \u2208 CorruptedWorkers do\npi \u2190 pmal\n7\n\n\u03c6(z) < n\u2212m\u2212s\nn\u2212m\n\nhaving initial parameters {\u00b5j : j \u2208 [d]} .\n\nAlgorithm 4: Backdoor Attack\nInput: {pi : i \u2208 CorruptedW orkers}, n, m\n1 Calculate zmax, \u00b5 and \u03c3 as in Algo 3, lines 1-4.\n2 Train the model parameters V with the backdoor,\n3 for j \u2208 [d] do\n4\n5 for i \u2208 CorruptedWorkers do\n6\n\n(pmal)j \u2190 clip(vj, \u00b5j \u2212zmax\npi \u2190 pmal\n\n, \u00b5j +zmax\n\nj\n\n)\n\nj\n\nTrimmedMean is circumvented because by making the \u201dsupporters\u201d prefer the malicious parameters,\nthe attacker controls the median, and the \ufb01nal parameters after averaging the nearby parameters will\nbe close to that. Krum indeed achieve its target of selecting a set of parameters for which there exist\na benign worker which lies farther away, but with this parameters the selected parameters lie very\nclose to the boundaries of this guarantee.\nSince Bulyan is a combination of Krum and TrimmedMean, and since our attack circumvents both, it\nis reasonable to expect that it will circumvent Bulyan as well. Nevertheless, Bulyan claim to defend\nagainst only up to 25% of corrupted workers, and not 50% like Krum and TrimmedMean. At \ufb01rst\nglance it seems that the zmax derived for m = 25% might not be suf\ufb01cient, but it should be noted\nthat the perturbation range calculated above is the possible input to TrimmedMean, for which m can\nreach up to 50% of the workers in the SelectionSet being aggregated in the second phase of Bulyan.\nIndeed, our approach is effective also against Bulyan.\n\nThe fact that the same set of parameters was used against all defenses is a strong advantage for this\nmethod: the attack will go unnoticed no matter which of the aforementioned statistics-based defenses\nthe server decides to choose, again rendering our attack more practical.\n\n3.3 Preventing Convergence\n\nWith the objective of forestalling convergence, the attacker will use the maximal value z that will\ncircumvent the defense. The attack \ufb02ow is detailed in Algorithm 3.\nExample: If the number of malicious workers is 24 out of a total of 50 workers, the attacker needs 2\n\u201csupporters\u201d ((cid:98) 50\n50\u221224 = 0.923,\nand by looking at the z-table for the maximal z for which \u03c6(z) = 0.923 we get zmax = 1.43. Finally,\nthe attacker will set the value of all the malicious workers to v = \u00b5j \u2212 1.43 \u00b7 \u03c3j for each parameter j\nindependently. Having enough workers with value lower than v, will set v as the median.\n\n2 + 1(cid:99) \u2212 24 = 2) in order to have a majority and set the median. 50\u221224\u22122\n\n3.4 Backdooring Attack\n\nIn section 3.2, we found a range in which the attacker can perturb the parameters without being\ndetected, and in order to obstruct the convergence, the attacker maximized the change inside this\nrange. For backdooring attack on the other hand, the attacker seeks the set of parameters within\nthis range which will produce the desired label for the backdoor, while minimizing the impact on\nthe functionality for benign inputs. To accomplish that, similar to [2], the attacker optimizes for\nthe model with the backdoor while minimizing the distance from the original parameters. This is\nachieved by adding a term to the loss function weighted by a parameter \u03b1 as: \u03b1(cid:96)backdoor + (1\u2212 \u03b1)(cid:96)\u2206\nwhere (cid:96)backdoor introduce the backdoor, and (cid:96)\u2206 is the MSE Loss between the new parameters and\nthe original ones, keeping them close.\n\nFor \u03b1 too large, the parameters will signi\ufb01cantly differ from the original parameters, thus being\ndiscarded by the defense mechanisms. While for \u03b1 too low the backdoor will not be introduced to the\nmodel. Hence, the attacker should use the minimal \u03b1 which successfully introduce the backdoor in\nthe model. This attack is detailed in Algorithm 4.\n\n7\n\n\f4 Experiments and Results\n\nWe provide two kinds of experiments: (1) empirically validating the claim regarding the variance\nbetween correct workers, and (2) validating the applicability of the methods by attacking real world\nnetworks. More experiments with different settings can be found in the supplementary material.\nWe experiment with attacking the following models, with and without the presence of the defenses.\nFollowing the experiments described in the state-of-the-art defenses [28, 29, 11], we consider simple\narchitectures on the \ufb01rst two datasets: MNIST [16] and CIFAR10 [15]. To strengthen our claims,\nwe also experimented on the modern WideResNet architecture [30] on CIFAR100. The models\narchitectures and hyper-parameters, can be found in the supplementary materials. The models were\ntrained with n = 51 workers, out of which m = 12 \u2248 24% were corrupted and non-omniscient.\n\n4.1 Variance Between Correct Workers\n\nIn this experiment, we want to quantify the extent of the success rate of the proposed attack. We do\nso by considering gradient sign-\ufb02ipping: In each round, the attacker checks the fraction of parameters\nfor which the gradients can change direction (\ufb02ip sign) with the given attack. This is applicable for a\ndimension j if the size of the mean gradient (|\u00b5j|) is smaller than the change that the attacker can\nintroduce: z\u03c3j, for the z described above and \u03c3j the standard deviation of the gradients across the\ndifferent workers.\nThe results in Figure 2 demonstrates the relation between the different variances and the size of\nthe gradients, and their effect on the model\u2019s accuracy. The graphs show the evaluations on the 3\nexperimented tasks with z \u2208 { 1\n2 , 1, 2}. It is clear that many gradients can \ufb02ip sign even with a change\nof only 1\n2 \u03c3. These results negate the assumption of most defenses (most explicitly by Krum) that the\nstandard deviation is smaller than the gradients.\n\nFigure 2: Parameters per iteration for which |\u00b5j| < z\u03c3j, for z \u2208 { 1\n\n2 , 1, 2} in the \ufb01rst 50 iterations.\n\n4.2 Attack Objectives\n\nHaving established that the variance between correct workers is indeed high, we now demonstrate\nvarious kinds of attacks.\n\nConvergence Prevention.\nIn section 3.2 we showed that when the number of workers is 50, zmax\ncan be set to 1.43. In the following experiments, we tried to change the parameters by up to 1.5\u03c3. In\nthe supplementary materials we discuss about the needed z for impacting the network convergence.\nWe applied our attack against all defenses, and examined their resilience on all models. Results can be\nfound in Table 1. The reason for the low accuracy on CIFAR10 with no attack is the simple model we\nused as a reproduction of the defenses literature, and is consistent with the results described in those\nworks. The Krum defense performed worst, since our malicious set of parameters was selected even\nwith only 24% of corrupted workers. Bulyan was affected more than TrimmedMean, because even\nthough the malicious proportion was 24%, it can reach up to 48% of the SelectionSet, which is the\nproportion used by TrimmedMean in the second stage of Bulyan. TrimmedMean performed better\nthan the previous two, because the malicious parameters were diluted by the averaging with many\nparameters coming from non corrupted workers.\nIronically but expected, the best defense strategy against this attack was the vanilla rule of averaging\nwithout outliers rejection. This is because the 1.5\u03c3 were averaged across all n workers, 76% of which\nare not corrupted, so the overall shift in each iteration was 1.5 \u2217 0.24 = 0.36\u03c3, which only have a\n\n8\n\n\fTable 1: Convergence Prevention Results. Maximal accuracy for n = 51, m = 24%, z = 1.5.\n\nModel MNIST CIFAR10 CIFAR100\n\nDefense\nNo Attack\nNo Defense\nTrimmed Mean\nKrum\nBulyan\n\n96.1\n91.1\n88.3\n78.5\n87.9\n\n59.6\n42.2\n32.7\n20.3\n28.1\n\n73.0\n63.1\n34.4\n16.9\n19.8\n\nminor impact on the accuracy. It is clear however that the server cannot choose this aggregation rule\nbecause of the serious vulnerabilities it provokes. In case that circumventing No Defense is desired,\nthe attacker can compose a hybrid attack, in which one worker is dedicated to attack No Defense with\nattacks detailed in earlier papers [5, 28], and the rest will be used for the attack proposed here.\n\nBackdooring. As a result of the attacker\u2019s desire not to interrupt the convergence for benign inputs,\nlow \u03b1 and z (both 0.2) were chosen. After each round the attacker trained the network with the\nbackdoor for 5 rounds. We used MSE for (cid:96)\u2206 and set (cid:96)backdoor to cross entropy like the one used for\nthe original classi\ufb01cation. On each round 1000 images were randomly sampled by the attacker, and\ntheir upper-left 5x5 pixels were set to the maximal intensity. All those samples were trained with\ntarget = 0. Testing was performed on a different subset.\n\nTable 2 lists the results. MNIST perfectly learned the backdoor pattern with a minimal impact of 1%\non benign inputs. For CIFAR10 the accuracy is worse than MNIST, with a degradation of 7 to 15% ,\nbut the accuracy drop for benign inputs is still reasonable and probably unsuspicious for an innocent\nserver training for a new task without knowing the expected accuracy. Lastly, CIFAR100 was even\nless robust to the minimal changes for the backdoor, causing higher degradation to the accuracy.\nIt is interesting to see that No Defense was virtually resilient to this attack, with relatively minimal\ndegradation on benign inputs and almost without mis-classifying samples with the backdoor pattern.\nHowever, on a different experiment on MNIST with higher z and \u03b1 (1 and 0.5 respectively), the\nopposite occur, where No Defense reached 95.6% for benign inputs and 100% on the backdoor, while\nother defenses did not perform as well on the benign inputs. Another option for circumventing No\nDefense is dedicating one corrupted worker for the case that No Defense is being used by the server,\nand use the rest of the corrupted workers for the defense-evading attack.\n\nTable 2: Backdoor pattern results. The maximal accuracy with backdoor pattern attack. n = 51,\nm = 24%, z = \u03b1 = 0.2. Results with no attack are also presented for comparison.\n\nMNIST\n\nCIFAR10\n\nCIFAR100\n\nBenign Backdoor Benign Backdoor Benign Backdoor\n96.1\n96.0\n95.3\n95.2\n95.3\n\n73.0\n69.6\n69.7\n52.8\n54.9\n\n-\n7.3\n80.7\n95.1\n84.3\n\n-\n36.9\n100.\n100.\n99.9\n\n59.6\n59.1\n55.6\n52.5\n51.9\n\n-\n0.0\n100.\n99.8\n92.9\n\nNo Attack\nNo Defense\nTrimmed Mean\nKrum\nBulyan\n\n5 Conclusions\n\nWe present a new attack paradigm, in which by applying limited changes to the reported parameters,\na malicious opponent may interfere with or backdoor the process of Distributed Learning. Unlike\nprevious attacks of byzantine workers, the attacker does not need to know the exact data of the\nnon-corrupted workers (being non-omniscient), and it works even on i.i.d. settings, where the data is\nknown to come from a speci\ufb01c distribution. The attack evades all state-of-the-art defenses based on\nrobust aggregation, suggesting to use other approaches such as DRACO, albeit its run-time overhead.\n\n9\n\n\f6 Acknowledgements\n\nThis research was supported by the BIU Center for Research in Applied Cryptography and Cyber\nSecurity in conjunction with the Israel National Cyber Bureau in the Prime Minister\u2019s Of\ufb01ce.\n\nReferences\n[1] A. Agarwal, M. J. Wainwright, and J. C. Duchi. Distributed dual averaging in networks. In\n\nAdvances in Neural Information Processing Systems (NIPS), pages 550\u2013558, 2010.\n\n[2] E. Bagdasaryan, A. Veit, Y. Hua, D. Estrin, and V. Shmatikov. How to backdoor federated\n\nlearning. arXiv preprint arXiv:1807.00459, 2018.\n\n[3] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, and J. M. Siskind. Automatic differentiation in\n\nmachine learning: a survey. Journal of Machine Learning Research, 18(153):1\u2013153, 2017.\n\n[4] B. Biggio, B. Nelson, and P. Laskov. Poisoning attacks against support vector machines. In\nProceedings of the 29th International Coference on Machine Learning (ICML), pages 1467\u2013\n1474. Omnipress, 2012.\n\n[5] P. Blanchard, R. Guerraoui, J. Stainer, et al. Machine learning with adversaries: Byzantine\ntolerant gradient descent. In Advances in Neural Information Processing Systems 27 (NIPS),\n2017.\n\n[6] B. Chen, W. Carvalho, N. Baracaldo, H. Ludwig, B. Edwards, T. Lee, I. Molloy, and B. Srivas-\ntava. Detecting backdoor attacks on deep neural networks by activation clustering. In Advances\nin Neural Information Processing Systems (NIPS), 2018.\n\n[7] L. Chen, H. Wang, Z. Charles, and D. Papailiopoulos. Draco: Byzantine-resilient distributed\n\ntraining via redundant gradients. arXiv preprint arXiv:1803.09877, 2018.\n\n[8] X. Chen, C. Liu, B. Li, K. Lu, and D. Song. Targeted backdoor attacks on deep learning systems\n\nusing data poisoning. arXiv preprint, 2017.\n\n[9] Y. Chen, L. Su, and J. Xu. Distributed statistical machine learning in adversarial settings: Byzan-\ntine gradient descent. Proceedings of the ACM on Measurement and Analysis of Computing\nSystems, 1(2):44, 2017.\n\n[10] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang,\nQ. V. Le, et al. Large scale distributed deep networks. In Advances in neural information\nprocessing systems (NIPS), pages 1223\u20131231, 2012.\n\n[11] E. M. El Mhamdi, R. Guerraoui, and S. Rouault. The hidden vulnerability of distributed\nIn Proceedings of the 35th International Conference on Machine\n\nlearning in Byzantium.\nLearning (ICML), pages 3521\u20133530, 2018.\n\n[12] C. Fung, C. J. Yoon, and I. Beschastnikh. Mitigating sybils in federated learning poisoning.\n\narXiv preprint arXiv:1808.04866, 2018.\n\n[13] R. D. Kleinberg, Y. Li, and Y. Yuan. An alternative view: When does sgd escape local minima?\n\nIn the International Coference on Machine Learning (ICML), 2018.\n\n[14] J. Kone\u02c7cn`y, H. B. McMahan, F. X. Yu, P. Richt\u00b4arik, A. T. Suresh, and D. Bacon. Federated\nlearning: Strategies for improving communication ef\ufb01ciency. arXiv preprint arXiv:1610.05492,\n2016.\n\n[15] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical\n\nreport, Citeseer, 2009.\n\n[16] Y. LeCun. The MNIST database of handwritten digits. http://yann. lecun. com/exdb/mnist/,\n\n1998.\n\n10\n\n\f[17] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J.\nShekita, and B.-Y. Su. Scaling distributed machine learning with the parameter server. In OSDI,\nvolume 14, pages 583\u2013598, 2014.\n\n[18] M. Li, D. G. Andersen, A. J. Smola, and K. Yu. Communication ef\ufb01cient distributed machine\nlearning with the parameter server. In Advances in Neural Information Processing Systems\n(NIPS), pages 19\u201327, 2014.\n\n[19] Y. Liu, S. Ma, Y. Aafer, W.-C. Lee, J. Zhai, W. Wang, and X. Zhang. Trojaning attack on neural\nnetworks. In 25nd Annual Network and Distributed System Security Symposium, NDSS, 2018.\n\n[20] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, et al. Communication-ef\ufb01cient learning\n\nof deep networks from decentralized data. arXiv preprint arXiv:1602.05629, 2016.\n\n[21] A. Neelakantan, L. Vilnis, Q. V. Le, I. Sutskever, L. Kaiser, K. Kurach, and J. Martens. Adding\ngradient noise improves learning for very deep networks. International Conference on Learning\nRepresentations Workshop (ICLR Workshop), 2016.\n\n[22] M. Qiao and G. Valiant. Learning discrete distributions from untrusted batches. arXiv preprint\n\narXiv:1711.08113, 2017.\n\n[23] B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic\ngradient descent. In Advances in neural information processing systems (NIPS), pages 693\u2013701,\n2011.\n\n[24] S. Shen, S. Tople, and P. Saxena. Auror: defending against poisoning attacks in collaborative\ndeep learning systems. In Proceedings of the 32nd Annual Conference on Computer Security\nApplications, pages 508\u2013519. ACM, 2016.\n\n[25] N. Shirish Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. Tang. On large-batch\ntraining for deep learning: Generalization gap and sharp minima. International Conference on\nLearning Representations (ICLR) Workshop, 2017.\n\n[26] J. Steinhardt, P. W. W. Koh, and P. S. Liang. Certi\ufb01ed defenses for data poisoning attacks. In\n\nAdvances in Neural Information Processing Systems 30 (NIPS), 2017.\n\n[27] B. Tran, J. Li, and A. Madry. Spectral signatures in backdoor attacks. In Advances in Neural\n\nInformation Processing Systems 31 (NIPS), 2018.\n\n[28] C. Xie, O. Koyejo, and I. Gupta. Generalized Byzantine-tolerant SGD. arXiv preprint\n\narXiv:1802.10116, 2018.\n\n[29] D. Yin, Y. Chen, K. Ramchandran, and P. Bartlett. Byzantine-robust distributed learning:\nTowards optimal statistical rates. In Proceedings of the International Conference on Machine\nLearning (ICML), 2018.\n\n[30] S. Zagoruyko and N. Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146,\n\n2016.\n\n[31] H. Zhang, Z. Zheng, S. Xu, W. Dai, Q. Ho, X. Liang, Z. Hu, J. Wei, P. Xie, and E. P. Xing.\nPoseidon: An ef\ufb01cient communication architecture for distributed deep learning on GPU\nclusters. arXiv preprint, 2017.\n\n11\n\n\f", "award": [], "sourceid": 4657, "authors": [{"given_name": "Gilad", "family_name": "Baruch", "institution": "Apple"}, {"given_name": "Moran", "family_name": "Baruch", "institution": "Bar Ilan University"}, {"given_name": "Yoav", "family_name": "Goldberg", "institution": "Bar-Ilan University"}]}