{"title": "cpSGD: Communication-efficient and differentially-private distributed SGD", "book": "Advances in Neural Information Processing Systems", "page_first": 7564, "page_last": 7575, "abstract": "Distributed stochastic gradient descent is an important subroutine in distributed learning. A setting of particular interest is when the clients are mobile devices, where two important concerns are communication efficiency and the privacy of the clients. Several recent works have focused on reducing the communication cost or introducing privacy guarantees, but none of the proposed communication efficient methods are known to be privacy preserving and none of the known privacy mechanisms are known to be communication efficient. To this end, we study algorithms that achieve both communication efficiency and differential privacy. For $d$ variables and $n \\approx d$ clients, the proposed method uses $\\cO(\\log \\log(nd))$ bits of communication per client per coordinate and ensures constant privacy.\n\nWe also improve previous analysis of the \\emph{Binomial mechanism} showing that it achieves nearly the same utility as the Gaussian mechanism, while requiring fewer representation bits, which can be of independent interest.", "full_text": "cpSGD: Communication-ef\ufb01cient and\ndifferentially-private distributed SGD\n\nNaman Agarwal\nGoogle Brain\n\nPrinceton, NJ 08540\n\nnamanagarwal@google.com\n\nAnanda Theertha Suresh\n\nGoogle Research\nNew York, NY\n\ntheertha@google.com\n\nFelix Yu\n\nGoogle Research\nNew York, NY\n\nfelixyu@google.com\n\nSanjiv Kumar\nGoogle Research\nNew York, NY\n\nsanjivk@google.com\n\nH. Brendan McMahan\n\nGoogle Research\n\nSeattle, WA\n\nmcmahan@google.com\n\nAbstract\n\nDistributed stochastic gradient descent is an important subroutine in distributed\nlearning. A setting of particular interest is when the clients are mobile devices,\nwhere two important concerns are communication ef\ufb01ciency and the privacy of the\nclients. Several recent works have focused on reducing the communication cost or\nintroducing privacy guarantees, but none of the proposed communication ef\ufb01cient\nmethods are known to be privacy preserving and none of the known privacy\nmechanisms are known to be communication ef\ufb01cient. To this end, we study\nalgorithms that achieve both communication ef\ufb01ciency and differential privacy. For\nd variables and n \u21e1 d clients, the proposed method uses O(log log(nd)) bits of\ncommunication per client per coordinate and ensures constant privacy.\nWe also improve previous analysis of the Binomial mechanism showing that it\nachieves nearly the same utility as the Gaussian mechanism, while requiring fewer\nrepresentation bits, which can be of independent interest.\n\n1\n\nIntroduction\n\n1.1 Background\n\nDistributed stochastic gradient descent (SGD) is a basic building block of modern machine learn-\ning [25, 11, 9, 28, 1, 27, 5]. In the typical scenario of synchronous distributed learning, in every\nround, each client obtains a copy of a global model which it updates based on its local data. The\nupdates (usually in the form of gradients) are sent to a parameter server, where they are averaged\nand used to update the global model. Alternatively, without a central server, each client maintains\na global model and either broadcasts the gradient to all or a subset of other clients, and updates its\nmodel with the aggregated gradient. In our paper we speci\ufb01cally consider the centralized setting, for\nthe decentralized case the authors are referred to [36] and references therein.\nOften, the communication cost of sending the gradient becomes the bottleneck [30, 23, 22]. To address\nthis issue, several recent works have focused on reducing the communication cost of distributed\nlearning algorithms via gradient quantization and sparsi\ufb01cation [32, 17, 33, 20, 21, 4, 34]. These\nalgorithms have been shown to improve communication cost and hence communication time in\ndistributed learning. This is especially effective in the federated learning setting where clients are\nmobile devices with expensive up-link communication cost [26, 20].\n\n32nd Conference on Neural Information Processing Systems (NIPS 2018), Montr\u00e9al, Canada.\n\n\fWhile communication is a key concern in client based distributed machine learning, an equally\nimportant consideration is that of protecting the privacy of participating clients and their sensitive\ninformation. Providing rigorous privacy guarantees for machine learning applications has been\nan area of active recent interest [6, 35, 31]. Differentially private gradient descent algorithms in\nparticular were studied in the work of [2]. A direct application of these mechanisms in distributed\nsettings leads to algorithms with high communication costs. The key focus of our paper is to analyze\nmechanisms that achieve rigorous privacy guarantees as well as have communication ef\ufb01ciency.\n\n1.2 Communication ef\ufb01ciency\nWe \ufb01rst describe synchronous distributed SGD formally. Let F (w) : Rd ! R be of the form\ni=1 fi(w), where each fi resides at the ith client. For example, w\u2019s are weights of a\nF (w) = 1\nneural network and fi(w) is the loss of the network on data located on client i. Let w0 be the initial\nvalue. At round t, the server transmits wt to all the clients and asks a random set of n (batch size /\ni(wt). Let S be the subset of clients. The\nlot size) clients to transmit their local gradient estimates gt\nserver updates as follows\n\nM \u00b7PM\n\ngt(wt) =\n\ngt\ni(wt),\n\nwt+1 , wt  g t(wt)\n\n1\n\nnXi2S\n\nfor some suitable choice of . Other optimization algorithms such as momentum, Adagrad, or Adam\ncan also be used instead of the SGD step above.\nNaively for the above protocol, each of the n clients needs to transmit d reals, typically using O(d \u00b7\nlog 1/\u2318) bits1. This communication cost can be prohibitive, e.g., for a medium size PennTreeBank\nlanguage model [39], the number of parameters d > 10 million and hence total cost is \u21e0 38MB\n(assuming 32 bit \ufb02oat), which is too large to be sent from a mobile phone to the server at every round.\nMotivated by the need for communication ef\ufb01cient protocols, various quantization algorithms have\nbeen proposed to reduce the communication cost [33, 20, 21, 38, 37, 34, 5]. In these protocols, the\nclients quantize the gradient by a function q and send an ef\ufb01cient representation of q(gt\ni(wt)) instead\nof its actual local gradient gt\n\ni(wt). The server computes the gradient as\n\n\u02dcgt(wt) =\n\nq(gt\n\ni(wt)),\n\n1\n\nnXi2S\n\nand updates wt as before. Speci\ufb01cally, [33] proposes a quantization algorithm which reduces\nthe requirement of full (or \ufb02oating point) arithmetic precision to a bit or few bits per value on\naverage. There are many subsequent works e.g., see [21] and in particular [5] showed that stochastic\nquantization and Elias coding [15] can be used to obtain communication-optimal SGD for convex\nfunctions. If the expected communication cost at every round t is bounded by c, then the total\ncommunication cost of the modi\ufb01ed gradient descent is at most\n\nT \u00b7 c.\n\n(1)\nAll the previous papers relate the error in gradient compression to SGD convergence. We \ufb01rst state\none such result for completeness for non-convex functions and prove it in Appendix A. Similar (and\nstronger) results can be obtained for (strongly) convex functions using results in [16] and [29].\nCorollary 1 ([16]). Let F be L-smooth and 8x krF (x)k2 \uf8ff D. Let w0 satisfy F (w0)  F (w\u21e4) \uf8ff\nDF . Let q be a quantization scheme, and  , minnL1,p2DF (pLT )1o , then after T rounds\n\nEt\u21e0(Unif[T ])[krF (wt)k2\n\n2] \uf8ff\n\nT\n\n2DF L\n\n+\n\n2p2pLDF\n\npT\n\n+ DB,\n\nwhere\n\n2 = max\n1\uf8fft\uf8ffT\n\n2E[kgt(wt)  rF (wt)k2\n\n2] + 2 max\n1\uf8fft\uf8ffT\n\nEq[kgt(wt)  \u02dcgt(wt)k2\n2],\n\n(2)\n\nand B = max1\uf8fft\uf8ffT kEq[gt(wt)  \u02dcgt(wt)]k. The expectation in the above equations is over the\nrandomness in gradients and quantization.\n\n1\u2318 is the per-coordinate quantization accuracy. To represent a d dimensional vector X to an constant accuracy\n\nin Euclidean distance, each coordinate is usually quantized to an accuracy of \u2318 = 1/pd.\n\n2\n\n\fThe above result relates the convergence of distributed SGD for non-convex functions to the worst-\ncase mean square error (MSE) and bias in gradient mean estimates in Equation (2). Thus smaller the\nmean square error in gradient estimation, better convergence. Hence, we focus on the problem of\ndistributed mean estimation (DME), where the goal is to estimate the mean of a set of vectors.\n\ni before transmission. Aggregation of gradients\n\n1.3 Differential privacy\nWhile the above schemes reduce the communication cost, it is unclear what (if any) privacy guarantees\nthey offer. We study privacy from the lens of differential privacy (DP). The notion of differential\nprivacy [13] provides a strong notion of individual privacy while permitting useful data analysis in\nmachine learning tasks. We refer the reader to [14] for a survey. Informally, for the output to be\ndifferentially private, the estimated model should be indistinguishable whether a particular client\u2019s\ndata was taken into consideration or not. We de\ufb01ne this formally in Section 2.\nIn the context of client based distributed learning, we are interested in the privacy of the gradients\naggregated from clients; differential privacy for the average gradients implies privacy for the resulting\nmodel since DP is preserved by post-processing. The standard approach is to let the server add the\nnoise to the averaged gradients (e.g., see [14, 2] and references within). However, the above only\nworks under a restrictive assumption that the clients can trust the server. Our goal is to also minimize\nthe need for clients to trust the central aggregator, and hence we propose the following model:\nClients add their share of the noise to their gradients gt\nat the server results in an estimate with noise equal to the sum of the noise added at each client.\nThis approach improves over server-controlled noise addition in several scenarios:\nClients do not trust the server: Even in the scenario when the server is not trustworthy, the above\nscheme can be implemented via cryptographically secure aggregation schemes [7], which ensures that\nthe only information about the individual users the server learns is what can be inferred from the sum.\nHence, differential privacy of the aggregate now ensures that the parameter server does not learn any\nindividual user information. This will encourage clients to participate in the protocol even if they do\nnot fully trust the server. We note that while secure aggregation schemes add to the communication\ncost (e.g., [7] adds log2(k \u00b7 n) for k levels of quantization), our proposed communication bene\ufb01ts\nstill hold. For example, if n = 1024, a 4-bit quantization protocol would reduce communication cost\nby 67% compared to the 32 bit representation.\nServer is negligent, but not malicious: the server may \"forget\" to add noise, but is not malicious\nand not interested in learning characteristics of individual users. However, if the server releases the\nlearned model to public, it needs to be differentially-private.\nA natural way to extend the results of [14, 2] is to let individual users add Gaussian noise to their\ngradients before transmission. Since the sum of Gaussians is Gaussian itself, differential privacy\nresults follow. However, the transmitted values now are real numbers and the bene\ufb01ts of gradient\ncompression are lost. Further, secure aggregation protocols [7] require discrete inputs. To resolve\nthese issues, we propose that the clients add noise drawn from an appropriately parameterized\nBinomial distribution. We refer to this as the Binomial mechanism. Since Binomial random variables\nare discrete, they can be transmitted ef\ufb01ciently. Furthermore, the choice of the Binomial is convenient\nin the distributed setting because sum of Binomials is also binomially distributed i.e., if\nthen Z1 + Z2 \u21e0 Bin(N1 + N2, p).\n\nZ1 \u21e0 Bin(N1, p), Z2 \u21e0 Bin(N2, p)\n\nHence the total noise post aggregation can be analyzed easily, which is convenient for the distributed\nsetting2. Binomial mechanism can be of independent interest in other applications with discrete\noutput as well. Furthermore, unlike Gaussian it avoids \ufb02oating point representation issues.\n\n1.4 Summary of our results\nBinomial mechanism: We \ufb01rst study Binomial mechanism as a generic mechanism to release\ndiscrete valued data. Previous analysis of the Binomial mechanism (where you add noise Bin(N, p))\nwas due to [12], who analyzed the 1-dimensional case for p = 1/2 and showed that to achieve (\", )\ndifferential privacy, N needs to be  64 log(2/)/\"2. We improve the analysis in the following ways:\n2Another choice is the Poisson distribution. Different from Poisson, the Binomial distribution has bounded\n\nsupport and has an easily analyzable communication complexity which is always bounded.\n\n3\n\n\f\u2022 d-dimensions. We extend the analysis of 1-dimensional Binomial mechanism to d dimensions.\nUnlike the Gaussian distribution, Binomial is not rotation invariant making the analysis more\ninvolved. The key fact utilized in this analysis is that Binomial distribution is locally rotation-\ninvariant around the mean.\n\n\u2022 Improvement. We improve the previous result and show that N  8 log(2/)/\"2 suf\ufb01ces for\nsmall \", implying that the Binomial and Gaussian mechanism perform identically as \" ! 0. We\nnote that while this is a constant improvement , it is crucial in making differential privacy practical.\n\nDifferentially-private distributed mean estimation (DME): A direct application of Gaussian mech-\nanism requires n \u00b7 d reals and hence n \u00b7 d \u00b7 log(nd) bits of communication. This can be prohibitive\nin practice. We \ufb01rst propose a direct application of quantization [33] and Binomial mechanism and\ncharacterize its privacy/error guarantees along with its communication costs. We further show that\ncoupling the scheme with random rotation can signi\ufb01cantly improve communication further. In\nparticular, for \" = O(1), we provide an algorithm achieving the same privacy and error tradeoff as\nthat of the Gaussian mechanism with communication\n\n\uf8ff n \u00b7 d \u00b7\u2713log2\u27131 +\n\nd\n\nn\u25c6 + O\u2713log log\u2713 nd\n\n \u25c6\u25c6\u25c6 bits,\n\nper round of distributed SGD. Hence when d \u21e1 n, the number of bits is n \u00b7 d \u00b7 log(log(nd)/).\nThe rest of the paper is organized as follows. In Section 2, we review the notion of differential privacy\nand state our results for the Binomial mechanism. Motivated by the fact that the convergence of SGD\ncan be reduced to the error in gradient estimate computation per-round, we formally describe the\nproblem of DME in Section 3 and state our results in Section 4.\nIn Section 4.2, we provide and analyze the implementation of the binomial mechanism in conjunction\nwith quantization in the context of DME. The main idea is for each client to add noise drawn from\nan appropriately parameterized Binomial distribution to each quantized value before sending to the\nserver. The server further subtracts the bias introduced by the noise to achieve an unbiased mean\nestimator. We further show in Section 4.3 that the rotation procedure proposed in [33] which reduces\nthe MSE is helpful in reducing the additional error due to differential privacy.\n\n2 Differential privacy\n\n2.1 Notation\nWe start by de\ufb01ning the notion of differential privacy. Formally, given a set of data sets D provided\nwith a notion of neighboring data sets ND \u21e2D\u21e5D and a query function f : D!X , a mechanism\nM : X!O to release the answer of the query, is de\ufb01ned to be (\", ) differentially private if for any\nmeasurable subset S \u2713O and two neighboring data sets (D1, D2) 2N D,\n(3)\nUnless otherwise stated, for the rest of the paper, we will assume the output spaces X ,O\u2713 Rd. We\nconsider the mean square error as a metric to measure the error of the mechanism M. Formally,\n\nPr (M(f (D1)) 2 S) \uf8ff e\" Pr (M(f (D2)) 2 S) + .\n\nE(M) , max\nD2D\n\nE[kM(f (D))  f (D)k2\n2].\n\nA key quantity in characterizing differential privacy for many mechanisms is the sensitivity of a query\nf : D! Rd in a given norm `q. Formally this is de\ufb01ned as\n\nq ,\n\nmax\n\n(D1,D2)2NDkf (D1)  f (D2)kq.\n\n(4)\n\nThe canonical mechanism to achieve (\", ) differential privacy is the Gaussian mechanism M\ng [14]:\ng (f (D)) , f (D) + Z, where Z \u21e0N (0, 2Id). We now state the well-known privacy guarantee\nM\nof the Gaussian mechanism.\nLemma 1 ( [14]). For any , `2 sensitivity bound 2, and  such that   2p2 log 1.25/, M\n p2 log 1.25/, ) differentially private 3 and the error is bounded by d \u00b7 2.\n\n3All logs are to base e unless otherwise stated.\n\nis ( 2\n\ng\n\n4\n\n\f2.2 Binomial Mechanism\nWe now de\ufb01ne the Binomial mechanism for the case when the output space X of the query f is Zd.\nThe Binomial mechanism is parameterized by three quantities N, p, s where N 2 N, p 2 (0, 1), and\nquantization scale s = 1/j for some j 2 N and is given by\n\nMN,p,s\n\n(f (D)) , f (D) + (Z  N p) \u00b7 s,\n\n(5)\nwhere for each coordinate i, Zi \u21e0 Bin(N, p) and independent. One dimensional binomial mechanism\nwas introduced by [12] for the case when p = 1/2. We analyze the mechanism for the general\nd-dimensional case and for any p. This analysis is involved as the Binomial mechanism is not rotation\ninvariant. By carefully exploiting the local rotation invariant structure near the mean, we show that:\nTheorem 1. For any , parameters N, p, s and sensitivity bounds 1, 2, 1 such that\n\nb\n\nthe Binomial mechanism is (\", ) differentially private for\n\nN p(1  p)  max (23 log(10d/), 21/s) ,\n\n\" =\n\n2q2 log 1.25\nspN p(1  p)\n\n\n\n2cpqlog 10\n\n + 1bp\nsN p(1  p)(1  /10)\n\n+\n\n1dp log 1.25\n\n+\n\n + 1dp log 20d\nsN p(1  p)\n\n log 10\n\n\n,\n\n(6)\n\nwhere bp, cp, and dp are de\ufb01ned in (16), (11), and (15) respectively, and for p = 1/2, bp = 1/3,\ncp = 5/2, and dp = 2/3. The error of the mechanism is d \u00b7 s2 \u00b7 N p(1  p).\nThe proof is given in Appendix B. We make some remarks regarding the design and the guarantee\nfor the Binomial Mechanism. Note that the privacy guarantee for the Binomial mechanism depends\non all three sensitivity parameters 2, 1, 1 as opposed to the Gaussian mechanism which only\ndepends on 2. The 1 and 1 terms can be seen as the added complexity due to discretization.\nSecondly setting s = 1 (i.e. providing no scale to the noise) in the expression (6), it can be readily\nseen that the terms involving 1 and 2 scale differently with respect to the variance of the noise.\nThis motivates the use of the accompanying quantization scale s in the mechanism. Indeed it is\npossible that the resolution of the integer that is provided by the Binomial noise could potentially\nbe too large for the problem leading to worse guarantees. In this setting, the quantization parameter\ns helps normalize the noise correctly. Further, it can be seen as long as the variance of the random\nvariable s \u00b7 Z is \ufb01xed, increasing N p(1  p) and decreasing s makes the Binomial mechanism closer\nto the Gaussian mechanism. Formally, if we let  = spN p(1  p) and s \uf8ff /(cpd), then using\nthe Cauchy-Schwartz inequality, the \" guarantee (6) can be rewritten as\nThe variance of the Binomial distribution is N p(1  p) and the leading term in \" matches exactly the\n\" term in Gaussian mechanism. Furthermore, if s is o(1/pd), then this mechanism approaches the\nGaussian mechanism. This result agrees with the Berry-Esseen type Central limit theorems for the\nconvergence of one dimensional Binomial distribution to the Gaussian distribution. In Figure 1, we\nplot the error vs \" for Gaussian and Binomial mechanism. Observe that as scale is reduced, error vs\nprivacy trade-off for Binomial mechanism approaches that of Gaussian mechanism.\nFinally note that, while p = 1/2 will in general be the optimal choice as it maximizes the variance\nfor a \ufb01xed communication budget, there might be corner cases wherein the required variance is so\nsmall that it cannot be achieved by an integer choice of N and p = 1/2. Our results working with\ngeneral p also cover these corner cases.\n\n\" = ( 2/)p2 log 1.25/ (1 + O (1/c)) .\n\n3 Distributed mean estimation (DME)\n\nWe have related the SGD convergence rate to the MSE in approximating the gradient at each step in\nCorollary 1. Eq. (1) relates the communication cost of SGD to the communication cost of estimating\ngradient means. Advanced composition theorem (Thm. 3.5 [19]) or moments accounting [2] can\nbe used to relate the privacy guarantee of SGD to that of gradient mean estimate at each instance\nt. We also note that in SGD, we often sample the clients, standard privacy ampli\ufb01cation results via\nsampling [2], can be used to get tighter bounds in this case.\n\n5\n\n\fFigure 1: Comparison of error\nvs privacy for Gaussian and Bi-\nnomial mechanism at different\nscales\n\n(a) \" = 4.0\n\n(b) \" = 2.0\n\nFigure 2: cpSGD with rotation on the in\ufb01nite MNIST dataset.\nk is the number of quantization levels, and m is the parameter\nof the binomial noise (p = 0.5, s = 1). The baseline is\nwithout quantization and differential privacy.  = 109.\n\nTherefore, akin to [33], in the rest of the paper we just focus on the MSE and privacy guarantees of\nDME. The results for synchronous distributed GD follow from Corollary 1 (convergence), advanced\ncomposition theorem (privacy), and Eq. (1) (communication).\nFormally, the problem of DME is de\ufb01ned as given n vectors X , {X1 . . . Xn} where Xi 2 Rd is on\nclient i, we wish to compute the mean \u00afX = 1\ni=1 Xi at a central server. For gradient descent at\neach round t, Xi is set to gt\ni. DME is a fundamental building block for many distributed learning\nalgorithms including distributed PCA/clustering [24].\nWhile analyzing private DME we assume that each vector Xi has bounded `2 norm, i.e. kXik \uf8ff D.\nThe reason to make such an assumption is to be able to de\ufb01ne and analyze the privacy guarantees\nand is often enforced in practice by employing gradient clipping at each client. We note that this\nassumption appears in previous works on gradient descent and differentially private gradient descent\n(e.g. [2]). Since our results also hold for all gradients without any statistical assumptions, we get\ndesired convergence results and privacy results for SGD.\n\nnPn\n\n3.1 Communication protocol\nOur proposed communication algorithms are simultaneous and independent, i.e., the clients inde-\npendently send data to the server at the same time. We allow the use of both private and public\nrandomness. Private randomness refers to random values generated by each client separately, and\npublic randomness refers to a sequence of random values that are shared among all parties4.\nGiven n vectors X , {X1 . . . Xn} where Xi 2 Rd resides on a client i. In any independent\ncommunication protocol, each client transmits a function of Xi (say q(Xi)), and a central server\nestimates the mean by some function of q(X1), q(X2), . . . , q(Xn). Let \u21e1 be any such protocol and\nlet Ci(\u21e1, Xi) be the expected number of bits transmitted by the i-th client during protocol \u21e1, where\nthroughout the paper, expectation is over the randomness in protocol \u21e1.\nLet Ci(\u21e1, Xi) be the number of bits transmitted by client i. The total number of bits transmitted\n1 ) def= Pn\ni=1 Ci(\u21e1, Xi). Let the estimated mean be \u02c6\u00afX.\nby all clients with the protocol \u21e1 is C(\u21e1, X n\n1 ) = Ehk \u02c6\u00afX  \u00afXk2\n2i . We note that bounds on\nFor a protocol \u21e1, the MSE of the estimate is E(\u21e1, X n\nE((\u21e1, X n\nvia Corollary 1.\n\n1 ), translates to bounds on gradients estimates in Eq. (2) and result in convergence guarantees\n\n3.2 Differential privacy\nTo state the privacy results for DME, we de\ufb01ne the notion of data sets and neighbors as follows. A\ndataset is a collection of vectors X = {X1, . . . Xn}. The notion of neighboring data sets typically\ncorresponds to those differing only on the information of one user, i.e. X, X\u2326i are neighbors if they\ndiffer in one vector. Note that this notion of neighbors for DME in the context of distributed gradient\n\n4Public randomness can be emulated by the server communicating a random seed\n\n6\n\n\fdescent translates to two data sets F = f1, f2, . . . fn and F 0 = f01, f02, . . . f0n being neighbors if they\ndiffer in one function fi and corresponds to guaranteeing privacy for individual client\u2019s data. The\nbound kXik2 \uf8ff D translates to assuming kgt\n4 Results for distributed mean estimation (DME)\n\nik \uf8ff D, ensured via gradient clipping.\n\nIn this section we describe our algorithms, the associated MSE, and the privacy guarantees in the\ncontext of DME. First, we \ufb01rst establish a baseline by stating the results for implementing the\nGaussian mechanism by adding Gaussian noise on each client vector.\n\n4.1 Gaussian protocol\nIn the Gaussian mechanism, each client sends vector Yi = Xi + Zi, where Zis are i.i.d distributed as\ni=1 Yi. We refer to this protocol as \u21e1g.\ni=1 Zi/n is distributed as N (0, 2Id/n) the above mechanism is equivalent to applying the\nGaussian mechanism on the output with variance 2/n. Since changing any of the Xi\u2019s changes the\nnorm of \u00afX by at most 2D/n, the following theorem follows directly from Lemma 1.\nTheorem 2. Under the Gaussian mechanism, the mean estimate is unbiased and communication cost\n\nN (0, 2Id). The server estimates the mean by \u02c6\u00afX = 1/n \u00b7Pn\nSincePn\nis n \u00b7 d reals. Moreover, for any  and   2Dpn \u00b7p2 log 1.25/, it is (\", ) differentially private for\n\n\" =\n\n2D\n\npnr2 log\n\n1.25\n\n\n\nand E(\u21e1g, X) =\n\nd2\nn\n\n,\n\nWe remark that real numbers can potentially be quantized to O(log dn/\") bits with insigni\ufb01cant\neffect to privacy5. However this is asymptotic and can be prohibitive in practice [20], where we\nhave a small \ufb01xed communication budget and d is of the order of millions. A natural way to reduce\ncommunication cost is via quantization, where each client quantizes Yis before transmitting. However\nhow privacy guarantees degrade as the quantization of the Gaussian mechanism is hard to analyze\nparticularly under aggregation. Instead we propose to use the Binomial mechanism which we describe\nnext.\n\n4.2 Stochastic k-level quantization + Binomial mechanism\nWe now de\ufb01ne the mechanism \u21e1sk(Bin(m, p)) based on k-bit stochastic quantization \u21e1sk proposed\nin [33] composed with the Binomial mechanism. It will be parameterized by 3 quantities k, m, p.\nFirst, the server sends X max to all the clients, with the hope that for all i, j, X max \uf8ff Xi(j) \uf8ff\nX max. The clients then clip each coordinate of their vectors to the range [X max, X max]. For every\ninteger r in the range [0, k), let B(r)represent a bin (one for each r), i.e.\n\nThe algorithm quantizes each coordinate into one of the bins stochastically and adds scaled Binomial\nnoise. Formally client i computes the following quantities for every j\n\nB(r) def= X max +\n\n2rX max\nk  1\n\n,\n\n(7)\n\nUi(j) =(B(r + 1) w.p. Xi(j)B(r)\n\notherwise.\n\nB(r+1)B(r)\n\nB(r)\n\nYi(j) = Ui(j) +\n\n2X max\n\nk  1 \u00b7 Ti(j).\n\n(8)\n\nwhere r is such that Xi(j) 2 [B(r), B(r + 1)] and Ti(j) \u21e0 Bin(m, p). The client sends Yi to the\nserver. The server now estimates \u00afX by\n\nnXi=1\u2713Yi \nIf 8j, Xi(j) 2 [X max, X max], then EhYi  2X maxmp\n\nunbiased estimate of the mean.\n\n\u02c6\u00afX\u21e1sk(Bin(m,p)) =\n\nk1\n\n1\nn\n\n2X maxmp\n\nk  1 \u25c6 .\ni = Xi, and \u02c6\u00afX\u21e1sk(Bin(m,p)) will be an\n\n(9)\n\n5Follows by observing that quantizing all values to 1/poly(n, d, 1/\", log 1/) accuracy ensures minimum\n\nloss in privacy. In practice this is often implemented using 32 bits of quantization via \ufb02oat representation.\n\n7\n\n\fBefore stating the formal guarantees we will require the de\ufb01nitions of the following quantities\nrepresenting the sensitivity of the quantization protocol in the appropriate norm.\n\n1(X max, D)\n\ndef\n= k + 1\n\n1(X max, D)\n\n2(X max, D)\n\ndef\n=\n\ndef\n=\n\nX max\n\npdD(k  1)\n\n+s 2pdD log(2/)(k  1)\nX max +vuut1 +s 2pdD log(2/)(k  1)\n\nD(k  1)\n\nX max\n\nX max\n\n4\n3\n\n+\n\nlog\n\n2\n\n\n.\n\n(10)\n\nFor brevity of notation we have suppressed the parameters k,  from the LHS. With no prior informa-\ntion on X max, the natural choice is to set X max = D. With this value of X max we characterize the\nMSE, sensitivity, and communication complexity of \u21e1sk(Bin(m, p)) below leveraging Theorem 1.\nTheorem 3. If X max = D, then the mean estimate is unbiased and\n\nFurthermore if\n\ndD2\n\nE (\u21e1sk(Bin(m, p)), X n) \uf8ff\n\nn(k  1)2 +\n\n(k  1)2\nmnp(1  p)  max (23 log(10d/), 21(D, X max)) ,\n\n4mp(1  p)D2\n\n,\n\nd\nn \u00b7\n\nthen for any , \u02c6\u00afX\u21e1sk(Bin(m,p)) is (\", 2) differentially private where \" (as given by Theorem 1) is\n\n\" =\n\n2q2 log 1.25\npmnp(1  p)\n\n\n\n2cpqlog 10\n\n + 1bp\nmnp(1  p)(1  /10)\n\n+\n\n1dp log 1.25\n\n+\n\n + 1dp log 20d\nmnp(1  p)\n\n log 10\n\n\n,\n\nwith sensitivity parameters {1(X max, D), 2(X max, D), 1(X max, D)} as de\ufb01ned in (10).\nFurthermore,\n\nC(\u21e1sk(Bin(m, p)), X n) = n \u00b7 (d log2(k + m) + \u02dcO(1)).6\n\nWe provide the proof in Appendix D. The \ufb01rst term in the expression for \" in the above theorem\nrecovers the same guarantee as that of the Gaussian mechanism (Theorem 2). Further, it can be seen\nthat the trailing terms are negligible when k >> pd. Formally this leads to the following corollary\nsummarizing the communication cost for \" \uf8ff 1 for achieving the same guarantee as the Gaussian\nmechanism.\nCorollary 2. There exists an implementation of \u21e1sk(Bin(m, p)), which achieves the same privacy\nand error as the full precision Gaussian mechanism with a total communication complexity of\n\nn \u00b7 d \u00b7\u2713log2\u2713pd +\n\nd\n\nn\"2\u25c6 + O\u2713log log\u2713 nd\n\n\"\u25c6\u25c6\u25c6 bits.\n\nThe communication cost of the above algorithm is \u2326(log d) bits per coordinate per client, which can\nbe prohibitive. In the next section we show that these bounds can be further improved via rotation.\n\n4.3 Error reduction via randomized rotation\nAs seen in Corollary 2, for \u21e1sk(Bin(m, p)) to have error and privacy same as that of the Gaussian\nmechanism, the best bound on the communication cost guaranteed is \u2326(log(d)) bits per coordinate\nirrespective of how large n is. The proof reveals that this is due to the error being proportional to\nO(d(X max)2/n). Therefore MSE reduces when X max is small, e.g., when Xi is uniform on the unit\n\nsphere, X max is O\u21e3p(log d)/d\u2318 (whp) [10]. [33] showed that the same effect can be observed by\n\nrandomly rotating the vectors before quantization. Here we show that random rotation reduces the\nleading term in the error as well as improves the privacy guarantee.\nUsing public randomness, all clients and the central server generate a random orthogonal matrix\nR 2 Rd\u21e5d according to some known distribution. Given a protocol \u21e1 for DME which takes inputs\n\n6 \u02dcO is used to denote poly-logarithmic factors.\n\n8\n\n\fX1 . . . Xn, we de\ufb01ne Rot(\u21e1, R) as the protocol where each client i \ufb01rst computes, X0i = RXi, and\nruns the protocol on X01, X02, . . . X0n. The server then obtains the mean estimate \u02c6\u00afX0 in the rotated\nspace using the protocol \u21e1 and then multiplies by R1 to obtain the coordinates in the original basis,\ni.e., \u02c6\u00afX = R1 \u02c6\u00afX0.\nDue to the fact that d can be huge in practice, we need orthogonal matrices that permit fast matrix-\nvector products. Naive matrices that support fast multiplication such as block-diagonal matrices\noften result in high values of kX0ik2\n. Similar to [33], we propose to use a special type of orthogonal\n1\nmatrix R = 1pd\nHA, where A is a random diagonal matrix with i.i.d. Rademacher entries (\u00b11 with\nprobability 0.5) and H is a Walsh-Hadamard matrix [18]. Applying both rotation and its inverse\ntakes O(d log d) time and O(1) space (with an in-place algorithm).\nThe next theorem provides the MSE and privacy guarantees for Rot(\u21e1sk(Bin(m, p)), HA).\nTheorem 4 (Appendix E). For any , let X max = 2Dq log(2nd/)\n\n, then\n\nd\n\nE(Rot(\u21e1sk(Bin(m, p))), HA) \uf8ff\n\n2 log 2nd\n\n\n\u00b7 D2\nn(k  1)2 +\n\n8 log 2nd\n\n\n\u00b7 mp(1  p)D2\nn(k  1)2\n\n+ 4D22.\n\nThe bias of mean estimate is bounded by \uf8ff 2D. Furthermore if\n\nmnp(1  p)  max (23 log(10d/), 21(D, X max)) ,\n\nthen \u02c6\u00afX(Rot(\u21e1sk(Bin(m, p)))) is (\", 3) differentially private where \" (as given by Theorem 1) is\n\n\" =\n\n2q2 log 1.25\npmnp(1  p)\n\n\n\n2cpqlog 10\n\n + 1bp\nmnp(1  p)(1  /10)\n\n+\n\n1dp log 1.25\n\n+\n\n + 1dp log 20d\nmnp(1  p)\n\n log 10\n\n\n,\n\nwith sensitivity parameters {1(X max, D), 2(X max, D), 1(X max, D)} (Eq. (10)). Further-\nmore,\n\nC(Rot(\u21e1sk(Bin(m, p))), X n) = n \u00b7 (d log2(k + m) + \u02dcO(1)).\n\nThe following corollary now bounds the communication cost for Rot(\u21e1sk(Bin(m, p)), HA) when\n\" \uf8ff 1 akin to Corollary 2.\nCorollary 3. There exists an implementation of Rot(\u21e1sk(Bin(m, p)), HA), that achieves the same\nerror and privacy of the full precision Gaussian mechanism with a total communication complexity:\n\nn \u00b7 d\u2713log2\u27131 +\n\nd\n\nn\"2\u25c6 + O\u2713log log\n\ndn\n\n\"\u25c6\u25c6 bits.\n\nNote that k is no longer required to be set\nthen\nRot(\u21e1sk(Bin(m, p)), HA) has the same privacy and utilities as the Gaussian mechanism, but with\njust O(nd log log(nd/\")) communication cost.\n5 Discussion\n\nto \u2326(pd) and hence if d = o(n\"2),\n\nWe trained a three-layer model (60 hidden nodes each with ReLU activation) on the in\ufb01nite MNIST\ndataset [8] with 25M data points and 25M clients. At each step 10,000 clients send their data to the\nserver. This setting is close to real-world settings of federated learning where there are hundreds\nof millions of users. The results are in Figure 2. Note that the models achieve different levels of\naccuracy depending on communication cost and privacy parameter \". We note that we trained the\nmodel with exactly one epoch, so each sample was used at most once in training. In this setting, the\nper batch \" and the overall \" are the same.\nThere are several interesting future directions. On the theoretical side, it is not clear if our analysis\nof Binomial mechanism is tight. Furthermore, it is interesting to have better privacy accounting for\nBinomial mechanism via a moments accountant. On the practical side, we plan to explore the effects\nof neural network topology, over-parametrization, and optimization algorithms on the accuracy of the\nprivately learned models.\n\n9\n\n\fReferences\n[1] Mart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,\nGreg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensor\ufb02ow: Large-scale\nmachine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467,\n2016.\n\n[2] Mart\u00edn Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar,\nand Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC\nConference on Computer and Communications Security, pages 308\u2013318. ACM, 2016.\n\n[3] Nir Ailon and Bernard Chazelle. Approximate nearest neighbors and the fast Johnson-\n\nLindenstrauss transform. In STOC, 2006.\n\n[4] Dan Alistarh, Demjan Grubic, Jerry Liu, Ryota Tomioka, and Milan Vojnovic. Communication-\n\nef\ufb01cient stochastic gradient descent, with applications to neural networks. 2017.\n\n[5] Dan Alistarh, Jerry Li, Ryota Tomioka, and Milan Vojnovic. QSGD: Randomized quantization\n\nfor communication-optimal stochastic gradient descent. arXiv:1610.02132, 2016.\n\n[6] Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization:\nEf\ufb01cient algorithms and tight error bounds. In Foundations of Computer Science (FOCS), 2014\nIEEE 55th Annual Symposium on, pages 464\u2013473. IEEE, 2014.\n\n[7] Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H Brendan McMahan,\nSarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. Practical secure aggregation for\nprivacy-preserving machine learning. pages 1175\u20131191, 2017.\n\n[8] Leon Bottou. The in\ufb01nite mnist dataset.\n[9] Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, and Ng Andrew. Deep\nlearning with cots hpc systems. In International Conference on Machine Learning, pages\n1337\u20131345, 2013.\n\n[10] Sanjoy Dasgupta and Anupam Gupta. An elementary proof of a theorem of johnson and\n\nlindenstrauss. Random Structures & Algorithms, 22(1):60\u201365, 2003.\n\n[11] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew\nSenior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In\nAdvances in neural information processing systems, pages 1223\u20131231, 2012.\n\n[12] Cynthia Dwork, Krishnaram Kenthapadi, Frank McSherry, Ilya Mironov, and Moni Naor. Our\ndata, ourselves: Privacy via distributed noise generation. In Eurocrypt, volume 4004, pages\n486\u2013503. Springer, 2006.\n\n[13] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to\n\nsensitivity in private data analysis. In TCC, volume 3876, pages 265\u2013284. Springer, 2006.\n\n[14] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Found.\n\nTrends Theor. Comput. Sci., 9(3&#8211;4):211\u2013407, August 2014.\n\n[15] Peter Elias. Universal codeword sets and representations of the integers. IEEE transactions on\n\ninformation theory, 21(2):194\u2013203, 1975.\n\n[16] Saeed Ghadimi and Guanghui Lan. Stochastic \ufb01rst-and zeroth-order methods for nonconvex\n\nstochastic programming. SIAM Journal on Optimization, 23(4):2341\u20132368, 2013.\n\n[17] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning\nwith limited numerical precision. In Proceedings of the 32nd International Conference on\nMachine Learning (ICML-15), pages 1737\u20131746, 2015.\n\n[18] Kathy J Horadam. Hadamard matrices and their applications. Princeton university press, 2012.\n[19] Peter Kairouz, Sewoong Oh, and Pramod Viswanath. The composition theorem for differential\n\nprivacy. IEEE Transactions on Information Theory, 63(6):4037\u20134049, 2017.\n\n10\n\n\f[20] Jakub Kone\u02c7cn`y, H Brendan McMahan, Felix X Yu, Peter Richt\u00e1rik, Ananda Theertha Suresh,\nand Dave Bacon. Federated learning: Strategies for improving communication ef\ufb01ciency. arXiv\npreprint arXiv:1610.05492, 2016.\n\n[21] Jakub Kone\u02c7cn`y and Peter Richt\u00e1rik. Randomized distributed mean estimation: Accuracy vs\n\ncommunication. arXiv preprint arXiv:1611.07555, 2016.\n\n[22] Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski,\nJames Long, Eugene J Shekita, and Bor-Yiing Su. Scaling distributed machine learning with\nthe parameter server. In OSDI, volume 1, page 3, 2014.\n\n[23] Mu Li, David G Andersen, Alexander J Smola, and Kai Yu. Communication ef\ufb01cient distributed\nmachine learning with the parameter server. In Advances in Neural Information Processing\nSystems, pages 19\u201327, 2014.\n\n[24] Stuart Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory,\n\n28(2):129\u2013137, 1982.\n\n[25] Ryan McDonald, Keith Hall, and Gideon Mann. Distributed training strategies for the structured\n\nperceptron. In HLT, 2010.\n\n[26] H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas.\nCommunication-ef\ufb01cient learning of deep networks from decentralized data. In Proceedings of\nthe 20th International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2016.\n\n[27] H. Brendan McMahan, Eider Moore, Daniel Ramage, and Blaise Aguera y Arcas. Federated\n\nlearning of deep networks using model averaging. arXiv:1602.05629, 2016.\n\n[28] Daniel Povey, Xiaohui Zhang, and Sanjeev Khudanpur. Parallel training of deep neural networks\n\nwith natural gradient and parameter averaging. arXiv preprint, 2014.\n\n[29] Alexander Rakhlin, Ohad Shamir, Karthik Sridharan, et al. Making gradient descent optimal\n\nfor strongly convex stochastic optimization. In ICML. Citeseer, 2012.\n\n[30] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free\nIn Advances in neural information\n\napproach to parallelizing stochastic gradient descent.\nprocessing systems, pages 693\u2013701, 2011.\n\n[31] Anand D Sarwate and Kamalika Chaudhuri. Signal processing and machine learning with\ndifferential privacy: Algorithms and challenges for continuous data. IEEE signal processing\nmagazine, 30(5):86\u201394, 2013.\n\n[32] Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent\nand its application to data-parallel distributed training of speech dnns. In Fifteenth Annual\nConference of the International Speech Communication Association, 2014.\n\n[33] Ananda Theertha Suresh, X Yu Felix, Sanjiv Kumar, and H Brendan McMahan. Distributed\nmean estimation with limited communication. In International Conference on Machine Learn-\ning, pages 3329\u20133337, 2017.\n\n[34] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Tern-\ngrad: Ternary gradients to reduce communication in distributed deep learning. arXiv preprint\narXiv:1705.07878, 2017.\n\n[35] Xi Wu, Fengan Li, Arun Kumar, Kamalika Chaudhuri, Somesh Jha, and Jeffrey Naughton. Bolt-\non differential privacy for scalable stochastic gradient descent-based analytics. In Proceedings\nof the 2017 ACM International Conference on Management of Data, pages 1307\u20131322. ACM,\n2017.\n\n[36] Kun Yuan, Qing Ling, and Wotao Yin. On the convergence of decentralized gradient descent.\n\nSIAM Journal on Optimization, 26(3):1835\u20131854, 2016.\n\n[37] Huizi Mao Yu Wang Bill Dally Yujun Lin, Song Han. Deep gradient compression: Reducing\nthe communication bandwidth for distributed training. International Conference on Learning\nRepresentations, 2018.\n\n11\n\n\f[38] Takuya Akiba Yusuke Tsuzuku, Hiroto Imachi. Variance-based gradient compression for\n\nef\ufb01cient distributed deep learning, 2018.\n\n[39] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization.\n\narXiv preprint arXiv:1409.2329, 2014.\n\n12\n\n\f", "award": [], "sourceid": 3751, "authors": [{"given_name": "Naman", "family_name": "Agarwal", "institution": "princeton"}, {"given_name": "Ananda Theertha", "family_name": "Suresh", "institution": "Google"}, {"given_name": "Felix Xinnan", "family_name": "Yu", "institution": "Google Research"}, {"given_name": "Sanjiv", "family_name": "Kumar", "institution": "Google Research"}, {"given_name": "Brendan", "family_name": "McMahan", "institution": "Google"}]}