{"title": "Locally Private Learning without Interaction Requires Separation", "book": "Advances in Neural Information Processing Systems", "page_first": 15001, "page_last": 15012, "abstract": "We consider learning under the constraint of local differential privacy (LDP). For many learning problems known efficient algorithms in this model require many rounds of communication between the server and the clients holding the data points. Yet multi-round protocols are prohibitively slow in practice due to network latency and, as a result, currently deployed large-scale systems are limited to a single round. Despite significant research interest, very little is known about which learning problems can be solved by such non-interactive systems. The only lower bound we are aware of is for PAC learning an artificial class of functions with respect to a uniform distribution (Kasiviswanathan et al., 2008).\n\nWe show that the margin complexity of a class of Boolean functions is a lower bound on the complexity of any non-interactive LDP algorithm for distribution-independent PAC learning of the class. In particular, the classes of linear separators and decision lists require exponential number of samples to learn non-interactively even though they can be learned in polynomial time by an interactive LDP algorithm. This gives the first example of a natural problem that is significantly harder to solve without interaction and also resolves an open problem of Kasiviswanathan et al.~(2008). We complement this lower bound with a new efficient learning algorithm whose complexity is polynomial in the margin complexity of the class. Our algorithm is non-interactive on labeled samples but still needs interactive access to unlabeled samples. All of our results also apply to the statistical query model and any model in which the number of bits communicated about each data point is constrained.", "full_text": "Locally Private Learning without Interaction\n\nRequires Separation\n\nAmit Daniely\n\nHebrew University and Google Research\n\nVitaly Feldman\u2217\nGoogle Research\n\nAbstract\n\nWe consider learning under the constraint of local differential privacy (LDP). For\nmany learning problems known ef\ufb01cient algorithms in this model require many\nrounds of communication between the server and the clients holding the data points.\nYet multi-round protocols are prohibitively slow in practice due to network latency\nand, as a result, currently deployed large-scale systems are limited to a single round.\nDespite signi\ufb01cant research interest, very little is known about which learning\nproblems can be solved by such non-interactive systems. The only lower bound we\nare aware of is for PAC learning an arti\ufb01cial class of functions with respect to a\nuniform distribution [39].\nWe show that the margin complexity of a class of Boolean functions is a lower\nbound on the complexity of any non-interactive LDP algorithm for distribution-\nindependent PAC learning of the class. In particular, the classes of linear separators\nand decision lists require exponential number of samples to learn non-interactively\neven though they can be learned in polynomial time by an interactive LDP algo-\nrithm. This gives the \ufb01rst example of a natural problem that is signi\ufb01cantly harder\nto solve without interaction and also resolves an open problem of Kasiviswanathan\net al. [39]. We complement this lower bound with a new ef\ufb01cient learning algo-\nrithm whose complexity is polynomial in the margin complexity of the class. Our\nalgorithm is non-interactive on labeled samples but still needs interactive access\nto unlabeled samples. All of our results also apply to the statistical query model\nand any model in which the number of bits communicated about each data point is\nconstrained.\n\n1 Overview\nWe consider learning in distributed systems where each client i (or user) holds a data point zi \u2208 Z\ndrawn i.i.d. from some unknown distribution P and the goal of the server is to solve some statistical\nlearning problem using the data stored at the clients. In addition, the communication from the client\nto the server is constrained. The primary model we consider is that of local differential privacy (LDP)\n[39]. In this model each user i applies a differentially-private algorithm to their point zi and then\nsends the result to the server. The speci\ufb01c algorithm applied by each user is determined by the server.\nIn the general version of the model the server can determine which algorithm the user should apply\non the basis of all the previous communications the server has received. In practice, however waiting\nfor the client\u2019s response often takes a relatively large amount of time. Therefore in such systems it is\nnecessary to limit the number of rounds of interaction. That is, the queries of the server need to be\nsplit into a small number of batches such that the LDP algorithms used in each batch depend only on\nresponses to queries in previous batches (a query speci\ufb01es the algorithm to apply). Indeed, currently\n\u2217Part of this work was done while the author was visiting the Simons Institute for the Theory of Computing.\n1Extended abstract. Full version appears as [16].\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fdeployed systems that use local differential privacy use very few rounds (usually just one) [28, 3, 19].\nSee Section 2 for a formal de\ufb01nition of the model.\nIn this paper we will focus on the standard PAC learning of a class of Boolean functions C over some\ndomain X. In this setting the input distribution P is over labeled examples (x, y) \u2208 X \u00d7 {\u22121, 1}\nwhere x is drawn from some distribution D and y = f (x) for some unknown f \u2208 C (referred to as\nthe target function). The goal of the learning algorithm is to output a function h such that the error\nPrx\u223cD[f (x) (cid:54)= h(x)] is small. In the distribution-independent setting D is not known to the learning\nalgorithm while in the distribution-speci\ufb01c setting the learning algorithm only needs to succeed for\nsome speci\ufb01c D.\nFor many of the important classes of functions all known LDP learning algorithms require many\nrounds of interaction. Yet there are no results that rule out solving these problems without interaction.\nThis problem was \ufb01rst addressed by Kasiviswanathan et al. [39] who demonstrated existence of\nan arti\ufb01cial class of Boolean functions C over {0, 1}d with the following property. C can be PAC\nlearned ef\ufb01ciently relative to the uniform distribution over {0, 1}d by an interactive LDP protocol\nbut requires 2\u2126(d) samples to learn by any non-interactive learning algorithm. The class C is highly\nunnatural. It splits the domain into two parts. Target function learned on the \ufb01rst half gives the key\nto the learning problem on the second half of the domain. That problem is exponentially hard to\nsolve without the key. This approach does not extend to distribution-independent learning setting\n(intuitively, the learning algorithm will not be able to obtain the key if the distribution does not place\nany probability on the \ufb01rst half of the domain).\nDeriving a technique that applies to distribution independent learning is posed as a natural open\nproblem in this area [39]. Even beyond PAC learning, there are no examples of natural problems that\nprovably require exponentially more samples to solve non-interactively.\n\n1.1 Our results\n\nWe give a new technique for proving lower bounds on the power of non-interactive LDP algorithms\nfor distribution-independent PAC learning. Our technique is based on a connection between the\npower of interaction and margin complexity of Boolean function classes that we establish. The\nmargin complexity of a class of Boolean functions C, denoted by MC(C), is the inverse of the\nlargest margin of separation achievable by an embedding of X in Rd that makes the positive and\nnegative examples of each function in C linearly separable (see De\ufb01nition 2.5). It is a well-studied\nmeasure of complexity of classes of functions and corresponding sign matrices in learning theory and\ncommunication complexity (e.g. [44, 2, 13, 33, 12, 48, 42, 38]).\nWe prove that only classes that have polynomially small margin complexity can be ef\ufb01ciently PAC\nlearned by a non-interactive LDP algorithm. Our lower bound implies that two natural and well-\nstudied classes of functions: linear separators and decision lists require an exponential number of\nsamples to learn non-interactively. Importantly, it is known that these classes can be learned ef\ufb01ciently\nby interactive LDP algorithms (this follows from the results for the statistical query model that we\ndiscuss later). Thus our result gives an exponential separation between the power of interactive and\nnon-interactive protocols. To the best of our knowledge this is the only known such separation for a\nnatural statistical problem (see Section 1.2 for a more detailed comparison with related notions of\nnon-interactive algorithms).\nOur result follows from a stronger lower bound that also holds against algorithms for which only the\nqueries that depend on the label of the point are non-interactive (also referred to as non-adaptive in\nrelated contexts). We will refer to such algorithms as label-non-adaptive LDP algorithms. Formally,\nour lower bounds for such algorithms is as follows. We say that a class of Boolean ({\u22121, 1}-valued)\nfunctions C is closed under negation if for every f \u2208 C, \u2212f \u2208 C.\nTheorem 1.1. Let C be a class of Boolean functions closed under negation. Assume that there\nexists a label-non-adaptive \u0001-LDP algorithm A that, with success probability at least 2/3, PAC\nlearns C distribution-independently with error less than 1/2 using at most n examples. Then\nn = \u2126(MC(C)2/3/e\u0001).\n\nOur second contribution is an algorithm for learning large-margin linear separators that matches (up\nto polynomial factors) our lower bound.\n\n2\n\n\fTheorem 1.2. Let C be an arbitrary class of Boolean functions over X. For any \u03b1, \u0001 > 0 and n =\npoly (MC(C)/(\u03b1\u0001)) there is a label-non-adaptive \u0001-LDP algorithm that PAC learns C distribution-\nindependently with accuracy 1 \u2212 \u03b1 using at most n examples.\nLearning of large-margin classi\ufb01ers is a classical learning problem and various algorithms for the\nproblem are widely used in practice. Our learning algorithm is computationally ef\ufb01cient as long as\nan embedding of C into a d = poly (MC(C) log |X|)-dimensional space can be computed ef\ufb01ciently\n(such an embedding is known to exists by the Johnson-Lindenstrauss random projection argument\n[4]). Together these results show an equivalence (up to polynomials) between margin complexity and\nPAC learning with this limited form of interaction in the LDP model.\nAnother implication of Theorem 1.2 is that if the distribution over X is \ufb01xed (and known to the\nlearning algorithm) then the learning algorithm becomes non-interactive.\nCorollary 1.3. Let C be a class of Boolean functions over X and D be an arbitrary distribution\nover X. For any \u03b1, \u0001 > 0 and n = poly (MC(C)/(\u03b1\u0001)) there is a non-interactive \u0001-LDP algorithm\nthat PAC learns C relative to D with accuracy 1 \u2212 \u03b1 using at most n examples.\n\nTechniques: Following the approach of Kasiviswanathan et al. [39], we use the characterization\nof LDP protocols using the statistical query (SQ) model of Kearns [40]. In this model an algorithm\nhas access to a statistical query oracle for P in place of i.i.d. samples from P . The most commonly\nstudied SQ oracle give an estimate of the mean of any bounded function with \ufb01xed tolerance.\nDe\ufb01nition 1.4. Let P be a distribution over a domain Z and \u03c4 > 0. A statistical query oracle\nSTATP (\u03c4 ) is an oracle that given as input any function \u03c6 : Z \u2192 [\u22121, 1], returns some value v such\nthat |v \u2212 Ez\u223cP [\u03c6(z)]| \u2264 \u03c4.\nTolerance \u03c4 of statistical queries roughly corresponds to the number of random samples in the\ntraditional setting. Non-adaptive (or non-interactive) SQ algorithms are de\ufb01ned analogously to LDP\nprotocols. The reductions between learning in the SQ model and learning in the LDP model given by\nKasiviswanathan et al. [39] preserve the number of rounds of interaction of a learning algorithm.\nThe key technical tool we apply to prove our lower bound is a result of Feldman [30] relating margin\ncomplexity and a certain notion of complexity for statistical queries. The result shows that the\nexistence of a (possibly randomized) algorithm that outputs a set T of m functions such that for\nevery f \u2208 C and distribution D, with signi\ufb01cant probability one of the functions in T is at least\n1/m-correlated with f relative to D implies that MC(C) = O(m3/2)) (the sharpest bound was\nproved in [38]). We then show that such a set of functions can be easily extracted from the queries of\nany label-non-adaptive SQ algorithm for learning C.\nOur label-non-adaptive LDP learning algorithm for large-margin halfspaces relies on a new for-\nmulation of halfspace learning as a stochastic convex optimization problem. The crucial property\nof this program is that (approximately) computing sub-gradients can be done by using a \ufb01xed set\nof non-adaptive queries (that measure the correlation of each of the attributes with the label) and\n(adaptive) but label-independent queries. We can then use an arbitrary gradient-descent-based LDP\nalgorithm for stochastic convex optimization. Such algorithms were \ufb01rst described by Duchi et al.\n[21]. For simplicity, we appeal to the fact that such algorithms can also be implemented in the\nstatistical query model [32].\n\nCorollaries: The class of decision lists (see [47, 41] for a de\ufb01nition) and the class of linear\nseparators (or halfspaces) over {0, 1}d are known to have exponentially large margin complexity\n[34, 15, 48] (and are also negation closed). In contrast, these classes are known to be learnable\nef\ufb01ciently by SQ algorithms [40, 24] and thus also by LDP algorithms. Formally, we obtain the\nfollowing lower bounds:\nCorollary 1.5. Any label-non-adaptive \u0001-LPD algorithm that PAC learns the class of linear separa-\ntors over {0, 1}d with error less than 1/2 and success probability at least 3/4 must use n = 2\u2126(d)/e\u0001\ni.i.d. examples. For learning the class of decision lists under the same conditions the algorithm must\nuse n = 2\u2126(d1/3)/e\u0001 i.i.d. examples.\n\nOur use of the statistical query model to prove the results implies that we can derive the analogues\nof our results in other models that have connections to the SQ model. One of such models is the\ndistributed model in which only a small number of bits is communicated from each client. Namely,\n\n3\n\n\feach client applies a function with range {0, 1}k to their input and sends the result to the server (for\nsome k (cid:28) log |Z|). As in the case of LDP, the speci\ufb01c function used is chosen by the server. One\nmotivation for this model is collection of data from remote sensors where the cost of communication\nis highly asymmetric. In the context of learning this model was introduced by Ben-David and\nDichterman [11] and generalized by Steinhardt et al. [50]. Identical and closely related models\nare often studied in the context of distributed statistical estimation with communication constraints\n(e.g. [43, 45, 46, 57, 51, 53, 1]). As in the setting of LDP, the number of rounds of interaction\nthat the server uses to solve a learning problem in this model is a critical resource. Using the\nequivalence between this model and SQ learning that preserves the number of rounds of interaction\nwe immediately obtain analogous results for this model. We are not aware of any prior results on the\npower of interaction in the context of this model. See Section 5 for additional details.\n\n1.2 Related work\n\nSmith et al. [49] address the question of the power of non-interactive LDP algorithms in the closely\nrelated setting of stochastic convex optimization. They derive new non-interactive LDP algorithms\nfor the problem albeit requiring an exponential in the dimension number of queries. They also give an\nexponential lower bound for non-interactive algorithms that are further restricted to obtain only local\ninformation about the optimized function. Subsequently, upper and lower bounds on the number of\nqueries to the gradient/second-order oracles for algorithms with few rounds of interaction have been\nstudied by several groups [23, 56, 7, 18]. In the context of discrete optimization from queries for the\nvalue of the optimized function the round complexity has been recently investigated in [8, 6, 9]. To\nthe best of our knowledge, the techniques used in these works are unrelated to ours. Also in all these\nworks the lower bounds rely heavily on the fact that the oracle provides only local (in the geometric\nsense) information about the optimized function. In contrast, statistical queries allow getting global\ninformation about the optimized function.\nA number of lower bounds on the sample complexity of LDP algorithms demonstrate that LDP is less\nef\ufb01cient than the central model of differential privacy (e.g. [22, 20]). The number of data samples\nnecessary to answer statistical queries chosen adaptively has recently been studied in a line of work\non adaptive data analysis [27, 35, 10, 52]. Our work provably demonstrates that the use of such\nadaptive queries is important for solving basic learning problems.\n\nSubsequent work: Acharya et al. [1] implicitly give a separation between interactive and non-\ninteractive protocols for the problem of identity testing for a discrete distribution over k elements,\nalbeit a relatively weak one (O(k) vs \u2126(k3/2) samples). The work of Joseph et al. [36, 37] explores a\ndifferent aspect of interactivity in LDP. Speci\ufb01cally, they distinguish between two types of interactive\nprotocols: fully-interactive and sequentially-interactive ones. Fully-interactive protocols place no\nrestrictions on interaction whereas sequentially-interactive ones only allows asking one query per user.\nThey give a separation showing that sequentially-interactive protocols may require exponentially\nmore samples than fully interactive ones. This separation is orthogonal to ours since our lower bounds\nare against completely non-interactive protocols and we separate them from sequentially-interactive\nprotocols.\n\n2 Preliminaries\nFor integer n \u2265 1 let [n]\n\n= {1, . . . , n}.\n.\n\nLocal differential privacy:\nIn the local differential privacy (LDP) model [55, 29, 39] it is assumed\nthat each data sample obtained by the server is randomized in a differentially private way. This is\nmodeled by assuming that the server running the learning algorithm accesses the dataset via an oracle\nde\ufb01ned below.\nDe\ufb01nition 2.1 ([39]). An \u0001-local randomizer R : Z \u2192 W is a randomized algorithm that satis\ufb01es\n\u2200z1, z2 \u2208 Z and w \u2208 W , Pr[R(z1) = w] \u2264 e\u0001 Pr[R(z2) = w]. For a dataset S \u2208 Z n, an LRS\noracle takes as an input an index i and a local randomizer R and outputs a random value w obtained\nby applying R(zi). An algorithm is (compositionally) \u0001-LDP if it accesses S only via the LRS oracle\nwith the following restriction: for all i \u2208 [n], if LRS(i, R1), . . . , LRS(i, Rk) are the algorithm\u2019s\n\ninvocations of LRS on index i where each Rj is an \u0001j-randomizer then(cid:80)\n\nj\u2208[k] \u0001j \u2264 \u0001.\n\n4\n\n\fan execution of a single \u0001-randomizer with(cid:80)\n\nFor a non-interactive LDP algorithm one can assume without loss of generality that each sample\nis queried only once since the application of k \ufb01xed local randomizers can be equivalently seen as\nj\u2208[k] \u0001j \u2264 \u0001. Further, in this de\ufb01nition the privacy\nparameter is de\ufb01ned as the composition of the privacy parameters of all the randomizers. A more\ngeneral (and less strict) way to de\ufb01ne the privacy parameter of an LDP protocol is as the differential\nprivacy of the entire transcript of the protocol (see [36] for a more detailed discussion). This\ndistinction does not affect our results since in our lower and upper bounds each sample is only queried\nonce. For such protocols these two ways to measure privacy coincide. The local model of privacy\ncan be contrasted with the standard, or central, model of differential privacy where the entire dataset\nis held by the learning algorithm whose output needs to satisfy differential privacy [25]. This is a\nstronger model and an \u0001-LPD algorithm also satis\ufb01es \u0001-differential privacy.\n\nEquivalence to statistical queries: The statistical query model of Kearns [40] is de\ufb01ned by having\naccess to STATP (\u03c4 ) oracle, where P is the unknown data distribution. To solve a learning problem\nin this model an algorithm needs to succeed for any valid (that is satisfying the guarantees on the\ntolerance) oracle\u2019s responses. In other words, the guarantees of the algorithm should hold in the worst\ncase over the responses of the oracle. A randomized learning algorithm needs to succeed for any SQ\noracle whose responses may depend on the all queries asked so far but not on the internal randomness\nof the learning algorithm.\nA special case of statistical queries are counting or linear queries in which the distribution P is\nuniform over the elements of a given database S \u2208 Z n. In other words the goal is to estimate the\nempirical mean of \u03c6 on the given set of data points. This setting is studied extensively in the literature\non differential privacy (see [26] for an overview) and our discussion applies to this setting as well.\nFor an algorithm in LDP and SQ models we say that the algorithm is non-interactive (or non-adaptive)\nif all its queries are determined before observing any of the oracle\u2019s responses. Similarly, we say\nthat the algorithm is label-non-adaptive if all the queries that depend on oracle\u2019s response are\nlabel-independent (the query function depends only on the point).\nKasiviswanathan et al. [39] show that one can simulate STATP (\u03c4 ) oracle with success probability\n1\u2212\u03b4 by an \u0001-LDP algorithm using LRS oracle for S containing n = O(log(1/\u03b4)/(\u0001\u03c4 )2) i.i.d. samples\nfrom P . This has the following implication for simulating SQ algorithms.\nTheorem 2.2 ([39]). Let ASQ be an algorithm that makes at most t queries to STATP (\u03c4 ). Then\nfor every \u0001 > 0 and \u03b4 > 0 there is an \u0001-local algorithm A that uses LRS oracle for S containing\nn \u2265 n0 = O(t log(t/\u03b4)/(\u0001\u03c4 )2) i.i.d. samples from P and produces the same output as ASQ (for\nsome valid answers of STATP (\u03c4 )) with probability at least 1 \u2212 \u03b4. Further, if ASQ is non-interactive\nthen A is non-interactive.\n\nKasiviswanathan et al. [39] also prove a converse of this theorem.\nTheorem 2.3 ([39]). Let A be an \u0001-LPD algorithm that makes at most t queries to LRS for S drawn\ni.i.d. from P n. Then for every \u03b4 > 0 there is an SQ algorithm ASQ that in expectation makes O(t\u00b7 e\u0001)\nqueries to STATP (\u03c4 ) for \u03c4 = \u0398(\u03b4/(e2\u0001t)) and produces the same output as A with probability at\nleast 1 \u2212 \u03b4. Further, if A is non-interactive then ASQ is non-interactive.\n\nPAC learning and margin complexity: Our results are for the standard PAC model of learning\n[54].\nDe\ufb01nition 2.4. Let X be a domain and C be a class of Boolean functions over X. An algorithm\nA is said to PAC learn C with error \u03b1 if for every distribution D over X and f \u2208 C, given access\n(via oracle or samples) to the input distribution over examples (x, f (x)) for x \u223c D, the algorithm\noutputs a function h such that PrD[f (x) (cid:54)= h(x)] \u2264 \u03b1 with probability at least 2/3.\nWe say that the learning algorithm is ef\ufb01cient if its running time is polynomial in log |X|, log |C| and\n1/\u0001.\nFor dimension d, we denote by Bd(1) the unit ball in (cid:96)2 norm in Rd.\nDe\ufb01nition 2.5. Let X be a domain and C be a class of Boolean functions over X. The margin\ncomplexity of C, denoted MC(C), is the minimal number M \u2265 0 such that for some d, there is an\nembedding \u03a8 : X \u2192 Bd(1) for which the following holds: for every f \u2208 C there is w \u2208 Bd(1) such\n\n5\n\n\fthat\n\n{f (x) \u00b7 (cid:104)w, \u03a8(x)(cid:105)} \u2265 1\nM\n\n.\n\nmin\nx\u2208X\n\nAs pointed out in [30], margin complexity2 is equivalent (up to a polynomial) to the existence of\na (possibly randomized) algorithm that outputs a small set of functions such that with signi\ufb01cant\nprobability one of those functions is correlated with the target function. The upper bound in [30]\nwas sharpened by Kallweit and Simon [38] although they proved it only for deterministic algorithms\n(which corresponds to a single \ufb01xed set of functions and is referred to as the CSQ dimension). It is\nhowever easy to see that their sharper bound extends to randomized algorithms with an appropriate\nadjustment of the bound and we give the resulting statement below:\nLemma 2.6 ([30, 38]). Let X be a domain and C be a class of Boolean functions over X. Assume\nthat there exists a (possibly randomized) algorithm A that generates a set of functions h1, . . . , hm\nsatisfying: for every f \u2208 C and distribution D over X with probability at least \u03b2 > 0 (over the\nrandomness of A) there exists i \u2208 [m] such that | Ex\u223cD[f (x)hi(x)]| \u2265 1/m. Then\n\nMC(C) \u2264 2\n\u03b2\n\nm3/2.\n\nThe conditions in Lemma 2.6 are also known to be necessary for low margin complexity.\nLemma 2.7 ([30, 38]). Let X be a domain, C be a class of Boolean functions over X and d =\nMC(C). Then for m = O(ln(|C||X|)d2), there exists a set of functions h1, . . . , hm satisfying: for\nevery f \u2208 C and distribution D over X there exists i \u2208 [m] such that | Ex\u223cD[f (x)hi(x)]| \u2265 1/m.\n\n3 Lower bounds for label-non-adaptive algorithms\n\nWe prove the SQ version of our lower bound. Theorem 1.1 then follows immediately by applying the\nsimulation result from Theorem 2.3.\nTheorem 3.1. Let C be a class of Boolean functions closed under negation. Assume that for some m\nthere exists a label-non-adaptive possibly randomized SQ algorithm A that, with success probability\nat least 2/3, PAC learns C distribution-independently with error less than 1/2 using at most m\nqueries to STAT(1/m). Then MC(C) \u2264 6m3/2.\n\nProof. We \ufb01rst recall a simple observation from [14] that allows to decompose each statistical query\ninto a correlational and label-independent parts. Namely, for a function \u03c6 : X \u00d7 {\u22121, 1} \u2192 [\u22121, 1],\n\n\u03c6(x, y) =\n\n1 \u2212 y\n2\n\n\u03c6(x,\u22121) +\n\n1 + y\n\n2\n\n\u03c6(x, 1) =\n\n\u03c6(x,\u22121) + \u03c6(x, 1)\n\n2\n\n+ y \u00b7 \u03c6(x, 1) \u2212 \u03c6(x,\u22121)\n\n2\n\n.\n\nFor a query \u03c6, we will use h and g to denote the parts of the decomposition \u03c6(x, y) = g(x) + yh(x):\n\nand\n\nh(x)\n\n.\n=\n\ng(x)\n\n.\n=\n\n\u03c6(x, 1) \u2212 \u03c6(x,\u22121)\n\n2\n\n\u03c6(x, 1) + \u03c6(x,\u22121)\n\n2\n\n.\n\nFor every input distribution D and target functions f, we de\ufb01ne the following SQ oracle. Given a\nquery \u03c6, if |ED[f (x)h(x)]| \u2265 1/m then the oracle provides the exact expectation ED[\u03c6(x, f (x)] as\nthe response. Otherwise, it answers with ED[g(x)]. Note that, by the properties of the decomposition,\nthis is a valid implementation of the SQ oracle.\nLet A(r) denote A with its random bits set to r, where r is drawn from some distribution R. Let\nm(cid:48) : X \u00d7 {\u22121, 1} \u2192 [\u22121, 1] be the statistical queries asked by A(r) that depend on the\n1, . . . , \u03c6r\n\u03c6r\nlabel (where m(cid:48) \u2264 m). Note that, by the de\ufb01nition of a label-non-adaptive SQ algorithm, all these\n2The results there are stated in terms of another notion that is closely related to margin complexity. Namely,\nthe smallest dimension d for which for which there exists a mapping of X to {0, 1}d such that every f \u2208 C\nbecomes expressible as a majority function over some subset T \u2286 [d] of variables. See the discussion in Sec. 6\nof [38].\n\n6\n\n\fqueries are \ufb01xed in advance and do not depend on the oracle\u2019s answers. Let gr\ndecomposition of these queries into correlational and label-independent parts. Let hr\nhypothesis output by A(r) when used with the SQ oracle de\ufb01ned above.\nWe claim that if A achieves error < 1/2 with probability at least 2/3, then for every f \u2208 C and\ndistribution D, with probability at least 1/3, there exists i \u2208 [m(cid:48)] such that | ED[f (x)hr\ni (x)]| \u2265 1/m\n(satisfying the conditions of Lemma 2.6 with \u03b2 = 1/3). To see this, assume for the sake of\ncontradiction that for some distribution D and function f \u2208 C,\n[r \u2208 T (f, D)] > 2/3,\n\ni denote the\nf,D denote the\n\ni and hr\n\nPr\nr\u223cR\n\nwhere T (f, D) is the set of all random strings r such that for all i \u2208 [m(cid:48)], | ED[f (x)hr\ni (x)]| < 1/m.\nLet S(f, D) denote the set of random strings r for which A succeeds (with the given SQ oracle), that\nis PrD[f (x) (cid:54)= hr\nBy our assumption, Prr\u223cR[r \u2208 S(f, D)] \u2265 2/3 and therefore\n\nf,D(x)] < 1/2.\n\n[r \u2208 T (f, D) \u2229 S(f, D)] > 1/3.\n\nPr\nr\u223cR\n\n(1)\nNow, observe that T (\u2212f, D) = T (f, D) and, in particular, the answers of our SQ oracle to A(r)\u2019s\nqueries are identical for f and \u2212f whenever r \u2208 T (f, D). Further, if PrD[f (x) (cid:54)= hr\nf,D(x)] < 1/2\nthen PrD[\u2212f (x) (cid:54)= hr\nf,D(x)] > 1/2. This means that for every r \u2208 T (f, D) \u2229 S(f, D), A(r) fails\nfor the target function is \u2212f and the distribution D (by de\ufb01nition, \u2212f \u2208 C). By eq. (1) we obtain\nthat A fails with probability > 1/3 for \u2212f and D. This contradicts our assumption and therefore we\nobtain that\n\nPr\nr\u223cR\n\nBy Lemma 2.6, we obtain the claim.\n\n3.1 Applications\n\n[r (cid:54)\u2208 T (f, D)] \u2265 1/3.\n\nWe will now spell out several easy corollaries of our lower bound, simulation results and existing SQ\nalgorithms. Together they imply the claimed separations for halfspaces and decision lists. We start\nwith the class of halfspaces over {0, 1}d which we denote by CHS. The lower bound on the margin\ncomplexity of halfspaces is implied by a celebrated work of Goldmann et al. [34] on the complexity\nof linear threshold circuits (the connection of this result to margin complexity is due to Sherstov\n[48]):\nTheorem 3.2 ([34, 48]). MC(CHS) = 2\u2126(d).\nWe denote the class of decision lists over {0, 1}d by CDL (see [41] for a standard de\ufb01nition). A lower\nbound on the margin complexity decision lists was derived by Buhrman et al. [15] in the context of\ncommunication complexity.\nTheorem 3.3 ([15]). MC(CDL) = 2\u2126(d1/3).\n\nCombining these results with Theorem 1.1 we obtain the lower bound on complexity of LDP\nalgorithms for learning linear classi\ufb01ers and decision lists given in Corollary 1.5.\nLearnability of decision list using statistical queries is a classical result of Kearns [40]. Applying the\nsimulation in Theorem 2.2 we obtain polynomial time learnability of this class by (interactive) LDP\nalgorithms.\nTheorem 3.4 ([40]). For every \u0001, \u03b1 > 0, there exists an \u0001-LDP learning algorithm that PAC learns\nCDL with error \u03b1 using poly(d/(\u0001\u03b1)) i.i.d. examples (with one query per example).\n\nIn the case of halfspaces, Dunagan and Vempala [24] give the \ufb01rst ef\ufb01cient algorithm for PAC\nlearning halfspaces (their description is not in the SQ model but it is known that their algorithm can\nbe easily converted to the SQ model [5]). Applying Theorem 2.2 we obtain learnability of this class\nby (interactive) LDP algorithms.\nTheorem 3.5 ([24, 5]). For every \u0001, \u03b1 > 0, there exists an \u0001-LDP learning algorithm that PAC learns\nCHS with error \u03b1 using poly(d/(\u0001\u03b1)) i.i.d. examples (with one query per example).\n\n7\n\n\f4 Label-non-adaptive learning algorithm for halfspaces\n\nOur algorithm for learning large-margin halfspaces relies on the formulation of the problem of\nlearning a halfspace as the following convex optimization problem. Proofs of results in this section\ncan be found in the full version [16].\nLemma 4.1. Let P be a distribution on Bd(1) \u00d7 {\u22121, 1}. Suppose that there is a vector w\u2217 \u2208 Bd(1)\nsuch that Pr(x,(cid:96))\u223cP [(cid:104)w\u2217, (cid:96)x(cid:105) \u2265 \u03b3] = 1. Let (e1, . . . , ed) denote the standard basis of Rd and let w\nbe a unit vector such that for \u03b1, \u03b2 \u2208 (0, 1)\n\n|(cid:104)w + \u03b3ei, x(cid:105)| \u2212 (cid:104)w + \u03b3ei, (cid:96)x(cid:105) +\n\n|(cid:104)w \u2212 \u03b3ei, x(cid:105)| \u2212 (cid:104)w \u2212 \u03b3ei, (cid:96)x(cid:105)\n\n(cid:35)\n\n\u2264 \u03b1\u03b2.\n\n(2)\n\n(cid:34) d(cid:88)\n\ni=1\n\nF (w)\n\n.\n= E\n\n(x,(cid:96))\u223cP\n\nThen, F (w\u2217) = 0 and\n\nIn particular, if \u03b2 < 2\u03b32\u221a\nd\n\nd(cid:88)\n(cid:21)\n\ni=1\n\n(cid:20)\n\nPr\nP\n\n(cid:104)w, (cid:96)x(cid:105) \u2265 \u2212 \u03b2\n2\n\n\u03b32\u221a\nd\nthen PrP [(cid:104)w, (cid:96)x(cid:105) > 0] \u2265 1 \u2212 \u03b1.\n\n+\n\n\u2265 1 \u2212 \u03b1.\n\nWe now describe how to solve the convex optimization problem given in Lemma 4.1. Both the\nrunning time and accuracy of queries of our solution depend on the ambient dimension d. This\ndimension is not necessarily upper-bounded by a polynomial in MC(C). However, the well-known\nrandom projection argument shows that the dimension can be reduced to O(log(1/\u03b4)/MC(C)2) at\nthe expense of small multiplicative decrease in the margin and probability of at most \u03b4 of failure (for\nevery individual point over the randomness of the random projection) [4, 12]. This fact together with\nMarkov\u2019s inequality implies the following standard lemma:\nLemma 4.2. Let d be an arbitrary dimension. For every \u03b4 and \u03b3, there exists a distribution \u03a8 over\nmappings \u03c8 : Bd(1) \u2192 Bd(cid:48)\n(1), where d(cid:48) = O(log(1/\u03b4)/\u03b32) such that: For every distribution D and\nfunction f over Bd(1), if there exists w \u2208 Bd(1) such that Prx\u223cD[f (x) \u00b7 (cid:104)w, x(cid:105) \u2265 \u03b3] = 1 then\n\n(cid:104)\u2203w(cid:48), Pr\n\nx\u223cD\n\n(cid:104)\n\nPr\n\u03c8\u223c\u03a8\n\nf (x) \u00b7 (cid:104)w(cid:48), \u03c8(x)(cid:105) \u2265 \u03b3\n2\n\n(cid:105) \u2265 1 \u2212 \u03b4\n\n(cid:105) \u2265 1 \u2212 \u03b4.\n\nLemma 4.2 ensures that at most a tiny fraction \u03b4 of the points (according to D) does not satisfy the\nmargin condition. This is not an issue as we will be implementing our algorithm in the SQ model\nthat, by de\ufb01nition allows any of the answers to its queries to be imprecise. The lemma also allows for\ntiny probability that the mapping will fail altogether (making the overall algorithm randomized).\nTherefore the only ingredient missing for establishing Theorem 1.2 is a label-non-adaptive SQ\nalgorithm for solving the convex optimization algorithm in dimension d using a polynomial in d, 1/\u03b3\nand 1/\u03b1 number of queries and (the inverse of) tolerance:\nLemma 4.3. Let P be a distribution on Bd(1) \u00d7 {\u22121, 1}. Suppose that there is a vector w\u2217 \u2208 Bd(1)\nsuch that Pr(x,(cid:96))\u223cP [(cid:104)w\u2217, (cid:96)x(cid:105) \u2265 \u03b3] = 1. There is a label-non-adaptive SQ algorithm that for every\n\u03b1 \u2208 (0, 1) uses O(d4/(\u03b34\u03b12)) queries to STATP (\u2126(\u03b34\u03b12/d3)), and \ufb01nds a vector w such that\nPrP [(cid:104)w, (cid:96)x(cid:105) > 0] \u2265 1 \u2212 \u03b1.\n\n5\n\nImplications for distributed learning with communication constraints\n\nIn this section we brie\ufb02y de\ufb01ne the model of bounded communication per sample, state the known\nequivalence results to the SQ model and spell out the immediate corollary of our lower bound.\nIn the bounded communication model [11, 50] it is assumed that the total number of bits learned by\nthe server about each data sample is bounded by (cid:96) for some (cid:96) (cid:28) log |Z|. As in the case of LDP this\nis modeled by using an appropriate oracle for accessing the dataset.\nDe\ufb01nition 5.1. We say that an algorithm R : Z \u2192 {0, 1}(cid:96) extracts (cid:96) bits. For a dataset\nS \u2208 Z n, an COMMS oracle takes as an input an index i and an algorithm R and outputs a\nrandom value w obtained by applying R(zi). An algorithm is (cid:96)-bit communication bounded if\nit accesses S only via the COMMS oracle with the following restriction: for all i \u2208 [n], if\nCOMMS(i, R1), . . . , COMMS(i, Rk) are the algorithm\u2019s invocations of COMMS on index i where\n\neach Rj extracts (cid:96)j bits then(cid:80)\n\nj\u2208[k] (cid:96)j \u2264 (cid:96).\n\n8\n\n\fWe use (non-)adaptive in the same sense as we do for LDP.\nAs \ufb01rst observed by Ben-David and Dichterman [11], it is easy to simulate a single query to STATP (\u03c4 )\nby extracting a single bit from each of the O(1/\u03c4 2) samples. This gives the following simulation.\nTheorem 5.2 ([11]). Let ASQ be an algorithm that makes at most t queries to STATP (\u03c4 ). Then\nfor every \u03b4 > 0 there is an \u0001-local algorithm A that uses COMMS oracle for S containing n \u2265\nn0 = O(t log(t/\u03b4)/\u03c4 2) i.i.d. samples from P and produces the same output as ASQ (for some valid\nanswers of STATP (\u03c4 )) with probability at least 1 \u2212 \u03b4. Further, if ASQ is non-interactive then A is\nnon-interactive.\n\n(cid:0)\u03b4/(2(cid:96)+1n)(cid:1) and produces the same output as A with probability at\n\nThe converse of this theorem for the simpler COMM oracle that accesses each sample once was\ngiven in [11, 31]. For the stronger oracle in De\ufb01nition 5.1, the converse was given by Steinhardt et al.\n[50].\nTheorem 5.3 ([50]). Let A be an (cid:96)-bit communication bounded algorithm that makes queries to\nCOMMS for S drawn i.i.d. from P n. Then for every \u03b4 > 0, there is an SQ algorithm ASQ that\nmakes 2n(cid:96) queries to STATP\nleast 1 \u2212 \u03b4. Further, if A is non-interactive then ASQ is non-interactive.\nNote that in this simulation we do not need to assume a separate bound on the number of queries\nsince at most (cid:96)n queries can be asked.\nA direct corollary of Theorems 3.1 and 5.3 and is the following lower bound:\nCorollary 5.4. Let C be a class of Boolean functions closed under negation. Any label-non-\nadaptive (cid:96)-communication bounded algorithm that PAC learns C with error less than 1/2 and\nsuccess probability at least 3/4 using queries to COMMS for S drawn i.i.d. from P n must have\nn = MC(C)2/3/2(cid:96).\n\nOur other results can be extended analogously.\n\n6 Discussion\n\nOur work shows that polynomial margin complexity is a necessary and suf\ufb01cient condition for\nef\ufb01cient distribution-independent PAC learning of a class of binary classi\ufb01ers by a label-non-adaptive\nSQ/LDP/limited-communication algorithm. A natural open problem that is left open is whether there\nexists an ef\ufb01cient and fully non-interactive algorithm for any class of polynomial margin complexity.\nWe conjecture that the answer is \u201cno\" and in this case the question is how to characterize the problems\nthat are learnable by non-interactive algorithms. See [17] for a more detailed discussion of this open\nproblem.\nA signi\ufb01cant limitation of our result is that it does not rule out even a 2-round algorithm for learning\nhalfspaces (or decision lists). This is, again, in contrast to the fact that learning algorithms for\nthese classes require at least d rounds of interaction. We believe that extending our lower bounds\nto multiple-round algorithms and quantifying the tradeoff between the number of rounds and the\ncomplexity of learning is an important direction for future work.\n\nReferences\n[1] Jayadev Acharya, Cl\u00e9ment L Canonne, and Himanshu Tyagi. Inference under information constraints i:\n\nLower bounds from chi-square contraction. arXiv preprint arXiv:1812.11476, 2018.\n\n[2] M. A. Aizerman, E. A. Braverman, and L. Rozonoer. Theoretical foundations of the potential function\nmethod in pattern recognition learning. In Automation and Remote Control,, number 25 in Automation and\nRemote Control\u201e pages 821\u2013837, 1964.\n\n[3] Apple\u2019s Differential Privacy Team. Learning with privacy at scale. Apple Machine Learning Journal, 1(9),\n\nDecember 2017.\n\n[4] R. Arriaga and S. Vempala. An algorithmic theory of learning: Robust concepts and random projection. In\nProceedings of the 40th Annual Symposium on Foundations of Computer Science (FOCS), pages 616\u2013623,\n1999.\n\n9\n\n\f[5] Maria-Florina Balcan and Vitaly Feldman. Statistical active learning algorithms for noise tolerance and\n\ndifferential privacy. Algorithmica, 72(1):282\u2013315, 2015.\n\n[6] Eric Balkanski and Yaron Singer. The adaptive complexity of maximizing a submodular function. In\n\nSTOC, pages 1138\u20131151, 2018.\n\n[7] Eric Balkanski and Yaron Singer. Parallelization does not accelerate convex optimization: Adaptivity lower\nbounds for non-smooth convex minimization. CoRR, abs/1808.03880, 2018. URL http://arxiv.org/\nabs/1808.03880.\n\n[8] Eric Balkanski, Aviad Rubinstein, and Yaron Singer. The limitations of optimization from samples. In\n\nSTOC, 2017.\n\n[9] Eric Balkanski, Aviad Rubinstein, and Yaron Singer. An exponential speedup in parallel running time for\n\nsubmodular maximization without loss in approximation. In SODA, pages 283\u2013302, 2019.\n\n[10] Raef Bassily, Kobbi Nissim, Adam D. Smith, Thomas Steinke, Uri Stemmer, and Jonathan Ullman.\n\nAlgorithmic stability for adaptive data analysis. In STOC, pages 1046\u20131059, 2016.\n\n[11] Shai Ben-David and Eli Dichterman. Learning with restricted focus of attention. J. Comput. Syst. Sci., 56\n\n(3):277\u2013298, 1998.\n\n[12] Shai Ben-David, Nadav Eiron, and Hans Ulrich Simon. Limitations of learning via embeddings in euclidean\nhalf spaces. Journal of Machine Learning Research, 3:441\u2013461, 2002. URL http://www.jmlr.org/\npapers/v3/bendavid02a.html.\n\n[13] Bernhard E. Boser, Isabelle Guyon, and Vladimir Vapnik. A training algorithm for optimal margin\n\nclassi\ufb01ers. In COLT, pages 144\u2013152. ACM, 1992.\n\n[14] N. Bshouty and V. Feldman. On using extended statistical queries to avoid membership queries. Journal of\n\nMachine Learning Research, 2:359\u2013395, 2002.\n\n[15] H. Buhrman, N. Vereshchagin, and R. de Wolf. On computation and communication with small bias. In\n\nIEEE Conference on Computational Complexity, pages 24\u201332, 2007.\n\n[16] Amit Daniely and Vitaly Feldman. Locally private learning without interaction requires separation. arXiv\n\npreprint arXiv:1809.09165, 2018.\n\n[17] Amit Daniely and Vitaly Feldman. Open problem: Is margin suf\ufb01cient for non-interactive private\ndistributed learning? In COLT, pages 3180\u20133184, 2019. URL http://proceedings.mlr.press/\nv99/daniely19a.html.\n\n[18] Jelena Diakonikolas and Crist\u00f3bal Guzm\u00e1n. Lower bounds for parallel and randomized convex optimization.\n\nCoRR, abs/1811.01903, 2018. URL http://arxiv.org/abs/1811.01903.\n\n[19] Bolin Ding, Janardhan Kulkarni, and Sergey Yekhanin. Collecting telemetry data privately. In 31st\n\nConference on Neural Information Processing Systems (NIPS), pages 3574\u20133583, 2017.\n\n[20] John Duchi and Ryan Rogers. Lower bounds for locally private estimation via communication complexity.\n\narXiv preprint arXiv:1902.00582, 2019.\n\n[21] John C. Duchi, Michael I. Jordan, and Martin J. Wainwright. Local privacy and statistical minimax rates.\n\nIn FOCS, pages 429\u2013438, 2013.\n\n[22] John C. Duchi, Martin J. Wainwright, and Michael I. Jordan. Local privacy and minimax bounds: Sharp\n\nrates for probability estimation. In NIPS, pages 1529\u20131537, 2013.\n\n[23] John C. Duchi, Feng Ruan, and Chulhee Yun. Minimax bounds on stochastic batched convex optimization.\n\nIn COLT, pages 3065\u20133162, 2018. URL http://proceedings.mlr.press/v75/duchi18a.html.\n\n[24] J. Dunagan and S. Vempala. A simple polynomial-time rescaling algorithm for solving linear programs. In\n\nSTOC, pages 315\u2013320, 2004.\n\n[25] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis.\n\nIn TCC, pages 265\u2013284, 2006.\n\n[26] Cynthia Dwork and Aaron Roth. The Algorithmic Foundations of Differential Privacy, volume 9. 2014.\n\nURL http://dx.doi.org/10.1561/0400000042.\n\n10\n\n\f[27] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth. Preser-\nving statistical validity in adaptive data analysis. CoRR, abs/1411.2664, 2014. Extended abstract in STOC\n2015.\n\n[28] \u00dalfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. RAPPOR: randomized aggregatable privacy-\npreserving ordinal response. In ACM SIGSAC Conference on Computer and Communications Security,\npages 1054\u20131067, 2014.\n\n[29] Alexandre V. Ev\ufb01mievski, Johannes Gehrke, and Ramakrishnan Srikant. Limiting privacy breaches in\n\nprivacy preserving data mining. In PODS, pages 211\u2013222, 2003.\n\n[30] V. Feldman. Evolvability from learning algorithms. In STOC, pages 619\u2013628, 2008.\n\n[31] Vitaly Feldman, Elena Grigorescu, Lev Reyzin, Santosh Vempala, and Ying Xiao. Statistical algorithms\nand a lower bound for detecting planted cliques. arXiv, CoRR, abs/1201.1214, 2012. Extended abstract in\nSTOC 2013.\n\n[32] Vitaly Feldman, Cristobal Guzman, and Santosh Vempala. Statistical query algorithms for mean vector\nestimation and stochastic convex optimization. CoRR, abs/1512.09170, 2015. URL http://arxiv.org/\nabs/1512.09170. Extended abstract in SODA 2017.\n\n[33] J\u00fcrgen Forster, Niels Schmitt, and Hans Ulrich Simon. Estimating the optimal margins of embeddings in\n\neuclidean half spaces. In Proceedings of COLT 2001 and EuroCOLT 2001, pages 402\u2013415, 2001.\n\n[34] M. Goldmann, J. H\u00e5stad, and A. Razborov. Majority gates vs. general weighted threshold gates. Computa-\n\ntional Complexity, 2:277\u2013300, 1992.\n\n[35] M. Hardt and J. Ullman. Preventing false discovery in interactive data analysis is hard. In FOCS, pages\n\n454\u2013463, 2014.\n\n[36] Matthew Joseph, Jieming Mao, Seth Neel, and Aaron Roth. The role of interactivity in local differential\n\nprivacy. CoRR, abs/1904.03564, 2019. URL http://arxiv.org/abs/1904.03564.\n\n[37] Matthew Joseph, Jieming Mao, and Aaron Roth. Exponential separations in local differential privacy.\n\n2019.\n\n[38] M. Kallweit and H. Simon. A close look to margin complexity and related parameters. In COLT, pages\n\n437\u2013456, 2011.\n\n[39] Shiva Prasad Kasiviswanathan, Homin K. Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith.\n\nWhat can we learn privately? SIAM J. Comput., 40(3):793\u2013826, June 2011.\n\n[40] M. Kearns. Ef\ufb01cient noise-tolerant learning from statistical queries. Journal of the ACM, 45(6):983\u20131006,\n\n1998.\n\n[41] M. Kearns and U. Vazirani. An introduction to computational learning theory. MIT Press, Cambridge,\n\nMA, 1994.\n\n[42] Nati Linial and Adi Shraibman. Learning complexity vs communication complexity. Comb. Probab.\n\nComput., 18(1-2):227\u2013245, March 2009. ISSN 0963-5483.\n\n[43] Zhi-Quan Luo. Universal decentralized estimation in a bandwidth constrained sensor network. IEEE\n\nTransactions on information theory, 51(6):2210\u20132219, 2005.\n\n[44] A. Novikoff. On convergence proofs on perceptrons. In Proceedings of the Symposium on Mathematical\n\nTheory of Automata, volume XII, pages 615\u2013622, 1962.\n\n[45] Ram Rajagopal, Martin J Wainwright, and Pravin Varaiya. Universal quantile estimation with feedback in\nthe communication-constrained setting. In Information Theory, 2006 IEEE International Symposium on,\npages 836\u2013840. IEEE, 2006.\n\n[46] Alejandro Ribeiro and Georgios B Giannakis. Bandwidth-constrained distributed estimation for wireless\nsensor networks-part i: Gaussian case. IEEE transactions on signal processing, 54(3):1131\u20131143, 2006.\n\n[47] R. Rivest. Learning decision lists. Machine Learning, 2(3):229\u2013246, 1987.\n\n[48] Alexander A. Sherstov. Halfspace matrices. Computational Complexity, 17(2):149\u2013178, 2008.\n\n[49] Adam D. Smith, Abhradeep Thakurta, and Jalaj Upadhyay. Is interaction necessary for distributed private\n\nlearning? In 2017 IEEE Symposium on Security and Privacy, SP 2017, pages 58\u201377, 2017.\n\n11\n\n\f[50] J. Steinhardt, G. Valiant, and S. Wager. Memory, communication, and statistical queries. In COLT, pages\n\n1490\u20131516, 2016.\n\n[51] Jacob Steinhardt and John C. Duchi. Minimax rates for memory-bounded sparse linear regression. In COLT,\npages 1564\u20131587, 2015. URL http://jmlr.org/proceedings/papers/v40/Steinhardt15.html.\n\n[52] Thomas Steinke and Jonathan Ullman. Interactive \ufb01ngerprinting codes and the hardness of preventing\nfalse discovery. In COLT, pages 1588\u20131628, 2015. URL http://jmlr.org/proceedings/papers/\nv40/Steinke15.html.\n\n[53] Ananda Theertha Suresh, Felix X Yu, H Brendan McMahan, and Sanjiv Kumar. Distributed mean\n\nestimation with limited communication. arXiv preprint arXiv:1611.00429, 2016.\n\n[54] L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134\u20131142, 1984.\n\n[55] Stanley L. Warner. Randomized response: A survey technique for eliminating evasive answer bias. J. of\n\nthe American Statistical Association, 60(309):63\u201369, 1965.\n\n[56] Blake E. Woodworth,\n\nJialei Wang, Adam D. Smith, Brendan McMahan,\n\nbro.\ntion.\n8069-graph-oracle-models-lower-bounds-and-gaps-for-parallel-stochastic-optimization.\n\nGraph oracle models,\nIn NeurIPS, pages 8505\u20138515,\n\nlower bounds,\n\nand Nati Sre-\nstochastic optimiza-\nURL http://papers.nips.cc/paper/\n\nfor parallel\n\nand gaps\n\n2018.\n\n[57] Yuchen Zhang, John C. Duchi, Michael I. Jordan, and Martin J. Wainwright. Information-theoretic lower\nbounds for distributed statistical estimation with communication constraints. In NIPS, pages 2328\u20132336,\n2013.\n\n12\n\n\f", "award": [], "sourceid": 8557, "authors": [{"given_name": "Amit", "family_name": "Daniely", "institution": "Hebrew University and Google Research"}, {"given_name": "Vitaly", "family_name": "Feldman", "institution": "Google Brain"}]}