{"title": "Communication-Efficient Distributed Learning of Discrete Distributions", "book": "Advances in Neural Information Processing Systems", "page_first": 6391, "page_last": 6401, "abstract": "We initiate a systematic investigation of distribution learning (density estimation) when the data is distributed across multiple servers. The servers must communicate with a referee and the goal is to estimate the underlying distribution with as few bits of communication as possible. We focus on non-parametric density estimation of discrete distributions with respect to the l1 and l2 norms. We provide the first non-trivial upper and lower bounds on the communication complexity of this basic estimation task in various settings of interest. Specifically, our results include the following: 1. When the unknown discrete distribution is unstructured and each server has only one sample, we show that any blackboard protocol (i.e., any protocol in which servers interact arbitrarily using public messages) that learns the distribution must essentially communicate the entire sample. 2. For the case of structured distributions, such as k-histograms and monotone distributions, we design distributed learning algorithms that achieve significantly better communication guarantees than the naive ones, and obtain tight upper and lower bounds in several regimes. Our distributed learning algorithms run in near-linear time and are robust to model misspecification. Our results provide insights on the interplay between structure and communication efficiency for a range of fundamental distribution estimation tasks.", "full_text": "Communication-Ef\ufb01cient Distributed Learning\n\nof Discrete Probability Distributions\n\nIlias Diakonikolas\n\nCS, USC\n\ndiakonik@usc.edu\n\nAbhiram Natarajan\n\nCS, Purdue\n\nnataraj2@purdue.edu\n\nElena Grigorescu\n\nCS, Purdue\n\nelena-g@purdue.edu\n\nJerry Li\n\nEECS & CSAIL, MIT\njerryzli@mit.edu\n\nKrzysztof Onak\nIBM Research, NY\nkonak@us.ibm.com\n\nLudwig Schmidt\n\nEECS & CSAIL, MIT\nludwigs@mit.edu\n\nAbstract\n\nWe initiate a systematic investigation of distribution learning (density estimation)\nwhen the data is distributed across multiple servers. The servers must communicate\nwith a referee and the goal is to estimate the underlying distribution with as few\nbits of communication as possible. We focus on non-parametric density estimation\nof discrete distributions with respect to the (cid:96)1 and (cid:96)2 norms. We provide the \ufb01rst\nnon-trivial upper and lower bounds on the communication complexity of this basic\nestimation task in various settings of interest. Speci\ufb01cally, our results include the\nfollowing:\n\n1. When the unknown discrete distribution is unstructured and each server has\nonly one sample, we show that any blackboard protocol (i.e., any protocol\nin which servers interact arbitrarily using public messages) that learns the\ndistribution must essentially communicate the entire sample.\n\n2. For the case of structured distributions, such as k-histograms and monotone\ndistributions, we design distributed learning algorithms that achieve signif-\nicantly better communication guarantees than the naive ones, and obtain\ntight upper and lower bounds in several regimes. Our distributed learning\nalgorithms run in near-linear time and are robust to model misspeci\ufb01cation.\nOur results provide insights on the interplay between structure and communication\nef\ufb01ciency for a range of fundamental distribution estimation tasks.\n\n1\n\nIntroduction\n\n1.1 Background and Motivation\n\nWe study the problem of distribution learning (or density estimation) in a distributed model, where\nthe data comes from an unknown distribution and is partitioned across multiple servers. The main\ngoal of this work is to explore the inherent tradeoff between sample size and communication for non-\nparametric density estimation of discrete distributions. We seek answers to the following questions:\nWhat is the minimum amount of communication required to learn the underlying distribution of the\ndata? Is there a communication-ef\ufb01cient learning algorithm that runs in polynomial time? We obtain\nthe \ufb01rst non-trivial algorithms and lower bounds for distributed density estimation. Before we state\nour results, we provide the relevant background.\n\nDensity Estimation. Distribution learning or density estimation is the following prototypical\ninference task: Given samples drawn from an unknown target distribution that belongs to (or is\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fwell-approximated by) a given family of distributions P, the goal is to approximately estimate (learn)\nthe target distribution. Estimating a distribution from samples is a fundamental unsupervised learning\nproblem that has been studied in statistics since the late nineteenth century [36]. The classical statistics\nliterature focuses primarily on the sample complexity of distribution learning, i.e., on the information-\ntheoretic aspects of the problem. More recently, there has been a large body of work in computer\nscience on this topic with an explicit focus on computational ef\ufb01ciency [12, 11, 7, 8, 1, 13, 2]. We\nemphasize that the aforementioned literature studies density estimation in the centralized setting,\nwhere all the data samples are available on a single machine.\n\nDistributed Computation.\nIn recent years, we have seen an explosion in the amount of data that\nhas been generated and collected across various scienti\ufb01c and technological domains [10]. Due to the\nsize and heterogeneity of modern datasets, there is a real need for the design of ef\ufb01cient algorithms\nthat succeed in the distributed model, when the data is partitioned across multiple servers. A major\nbottleneck in distributed computation is the communication cost between individual machines. In\npractice, communication may be limited by bandwidth constraints and power consumption, leading\nto either slow or expensive systems (see, e.g., [23] for a survey). Hence, the general problem of\ndesigning communication-ef\ufb01cient distributed protocols is of fundamental importance in this setting.\nIn recent years, a number of statistical estimation problems have been algorithmically studied in the\ndistributed setting [3, 16, 15, 40, 21, 30, 24, 33, 5, 29]. To the best of our knowledge, the problem of\nnonparametric density estimation has not been previously studied in this context.\n\nThis Work: Distributed Density Estimation. We initiate a systematic investigation of density\nestimation in the distributed model. We believe that this is a fundamental problem that merits\ninvestigation in its own right. Also, the problem of distributed density estimation arises in various real-\ndata applications when it is required to reconstruct the data distribution from scattered measurements.\nExamples include sensor networks and P2P systems (see, e.g., [35, 32, 27, 41, 37] and references\ntherein).\nWe explore the tradeoff between communication and statistical ef\ufb01ciency for a number of funda-\nmental nonparametric density estimation problems. Speci\ufb01cally, we insist that our algorithms are\nsample-ef\ufb01cient and our goal is to design distributed protocols using a minimum amount of com-\nmunication. As our main contribution, we provide the \ufb01rst non-trivial upper and lower bounds on\nthe communication complexity of density estimation for a range of natural distribution families\nthat have been extensively studied in the centralized regime. The main conceptual message of our\n\ufb01ndings is the following: When the underlying discrete distribution is unstructured, no non-trivial\ncommunication protocol is possible. In sharp contrast, for various families of structured distributions,\nthere are non-trivial algorithms whose communication complexity signi\ufb01cantly improves over naive\nprotocols. It should be noted that all our algorithms are in addition computationally ef\ufb01cient.\n\nCommunication Model for Density Estimation. We now informally describe the communication\nmodel used in this paper. We refer to the preliminaries in Section 2 for formal de\ufb01nitions.\nThe model is parameterized by the number of samples per server (player), which we denote by\ns. There are a speci\ufb01c number of servers, each holding s independent samples from an unknown\ndistribution P . We call these servers sample-holding players. Additionally, there is a server that\nholds no samples from P . We call this server a referee or fusion center. In communication protocols\nconsidered in this work, servers exchange messages, and at the end of the protocol, the referee outputs\n\nan accurate hypothesis distribution (cid:98)P . More precisely, we want the the hypothesis (cid:98)P to satisfy\nd((cid:98)P , P ) \u2264 \u0001 with high probability (over the samples and internal randomness), where the metric d is\n\neither the (cid:96)1-norm (statistical distance) or the (cid:96)2-norm.\nWe study two variants of this model. In the simultaneous communication model, each sample-holding\nplayer sends a message (of one or more bits) to the referee once, based only on the samples she holds\nand public randomness. In the blackboard model, the sample-holding players\u2019 messages are public,\nand the communication protocol does not restrict the number of times a player may speak. The goal\nis to minimize the amount of communication between the players and the referee, while transmitting\nenough information about the samples so that the underlying distribution P can be approximately\nrecovered from the transcript of the communication.\n\n2\n\n\fTable 1: Communication complexity bounds for density estimation of unstructured distributions (for\nsuccess probability 9/10)\n\nCCs,1/10(ADE(Dn, 1, \u03b5, \u03b1))\n\nCC\u2192\n\ns,1/10(ADE(Dn, 1, \u03b5, \u03b1))\n\n\u2126( n\n\u03b52 log n)\n\u2126(n log 1\n\u03b5 )\n\u2126(n log 1\n\u03b5 )\n\nO( n\n\n\u03b52 log n)\nO( n\n\n\u03b52 )\nO(n log 1\n\u03b5 )\n\nRegime\n\ns = 1\n\ns = \u0398(n)\ns = \u0398( n\n\u03b52 )\n\n1.2 Our Contributions\n\nIn this section, we provide informal statements of our main results. For the formal statements of\nall our results the reader is referred to the full version of the paper. We will require the following\nnotation.\u2018 We use n to denote an upper bound on the domain size of our distributions and \u03b1 to denote\nthe total sample size. Without loss of generality, we will assume that the domain of the distributions\nis the set [n] := {1, 2, . . . , n}. The (cid:96)1 (resp. (cid:96)2) distance between two discrete distributions is the\n(cid:96)1 (resp. (cid:96)2) norm of the difference between their probability vectors. We note that the sample\nsizes in this section correspond to high constant probability of success. This can be boosted to high\nprobability by standard techniques.\nWe start by pointing out the baseline result that we compare against. The naive protocol to perform\ndistribution density estimation is the following: all the servers (players) communicate their entire\nsample to the referee, who applies a centralized estimator to output an accurate hypothesis. The\ncommunication complexity of this approach is \u0398(\u03b1 log n) bits. The obvious question is whether\nthere exists a protocol with signi\ufb01cantly smaller communication complexity.\n\nUnstructured Discrete Distributions. Our starting point is the basic setting in which the under-\nlying distribution over n elements is potentially arbitrary and each server (player) holds exactly\none sample from an unknown distribution over a domain of size n. (This basic setting is motivated\nby practical applications, e.g., aggregation of cell-phone data, etc.) In the centralized setting, it\nis a folklore fact (see, e.g., [19]) that \u0398(n/\u03b52) samples are necessary and suf\ufb01cient to learn an\nunstructured distribution supported on n elements within (cid:96)1-error \u03b5. This fact in turn implies that the\nnaive distributed protocol uses O( n\n\u03b52 log n) bits. We show that this protocol is best possible, up to\nconstant factors:\nTheorem 1. Suppose \u0398(n/\u03b52) samples from an unknown distribution P over [n] are distributed\nsuch that each player has exactly one sample. Then learning P within (cid:96)1-distance \u03b5 requires\n\u2126((n/\u03b52) log n) bits of communication in the blackboard model.\n\nWe remark that a blackboard model captures a very general interaction between sample-holding\nplayers and the referee. The players are allowed to send messages in arbitrary order and share partial\ninformation about their samples from [n], perhaps using much fewer than log n bits. For instance, if\none of the players has revealed her sample, other players may just notify everyone that they hold the\nsame (or a correlated) sample, using O(1) extra bits. Thus, our lower bound excludes the possibility\nof non-trivial protocols that do better than essentially having each machine transmit its entire sample.\nThis statement might seem intuitively obvious, but its proof is not straightforward.\nBy a standard packing argument, we also show a communication lower bound of \u2126(n log 1\n\u03b5 ) for all\nprotocols that estimate an unstructured discrete distribution over [n] in (cid:96)1-distance. In the regime\nwhere there are \u0398(n/\u03b52) samples per machine, we show that there is a simple estimator that achieves\nthis lower bound. (See Table 1 for instantiations of the theorems, and Section 2 for the formal\nde\ufb01nitions.)\n\nIn contrast to the unstructured case, we design non-trivial\n\nStructured Discrete Distributions.\nprotocols that signi\ufb01cantly improve upon the naive protocols in several regimes of interest.\nOur main algorithmic results are the \ufb01rst communication-ef\ufb01cient algorithms for robust learning\nof histogram distributions. A k-histogram distribution over [n] is a probability distribution that is\npiecewise constant over some set of k intervals over [n]. Histograms have been extensively studied in\nstatistics and computer science. In the database community, histograms constitute the most common\n\n3\n\n\ftool for the succinct approximation of data [9, 38, 25, 26, 1]. In statistics, many methods have been\nproposed to estimate histogram distributions in a variety of settings [22, 34, 17, 31].\nThe algorithmic dif\ufb01culty in learning histograms lies in the fact that the location and \u201csize\u201d of\nthese intervals is a priori unknown. In the centralized setting, sample and computationally ef\ufb01cient\nalgorithms for learning histograms have been recently obtained [7, 8, 2]. Our distributed learning\nalgorithm for the (cid:96)1-metric builds on the recent centralized algorithm of [2]. In particular, we have\nthe following:\nTheorem 2. For the problem of learning k-histograms with (cid:96)1 error \u03b5, the following hold:\n\n1. In the regime of one sample per player, there exists a protocol that uses O( k\n\nbits of communication. Furthermore, any successful protocol must use \u2126(k log n\nbits of communication.\n\n\u03b5 log n+ k\nk + k\n\n\u03b53 log k\n\u03b5 )\n\u03b52 log k)\n\n2. In the regime of \u0398( k\n\nbits of communication. Furthermore, any protocol must use \u2126(k log n\ncommunication.\n\n\u03b52 ) samples per player, there exists a successful protocol with O( k\nk + k log 1\n\n\u03b5 log n)\n\u03b5 ) bits of\n\nWe now turn our attention to learning under the (cid:96)2-metric. Previous centralized algorithms for this\nproblem [1] work in a \u201cbottom-up\u201d fashion. Unfortunately, this approach does not seem amenable\nto distributed computation for the following reason: it seems impossible to keep track of a large\nnumber of intervals with limited communication. Instead, we devise a new \u201ctop-down\u201d algorithm that\nstarts with a small number of large intervals and iteratively splits them based on the incurred (cid:96)2-error.\nA careful application of this idea in conjunction with some tools from the streaming literature\u2014\n2 error using few\nspeci\ufb01cally, an application of the Johnson-Lindenstrauss tranform to estimate the (cid:96)2\nbits of communication\u2014yields the following result:\nTheorem 3. For the problem of learning k-histograms with (cid:96)2 error \u03b5, the following hold:\n\n1. In the regime of s = \u02dcO(k log n) samples per player, there exists a protocol that uses\n\u03b52 log n) bits of communication. Furthermore, any successful protocol must use\n\nO( 1\n\u2126(k log n\n\nk + 1\n\n\u03b5 log \u03b5k) bits of communication.\n\n2. In the regime of s = \u03c9(k log n) samples per player, there exists a protocol with \u02dcO( k\nk + 1\n\nbits of communication. Furthermore, any successful protocol must use \u2126(k log n\nbits.\n\ns\u03b52 log n)\n\u03b5 log \u03b5k)\n\nWe remark that the above algorithms are robust to model misspeci\ufb01cation, i.e., they provide near-\noptimal error guarantees even if the input distribution is only close a histogram. As an immediate\ncorollary, we also obtain communication ef\ufb01cient learners for all families of structured discrete\ndistributions that can be well-approximated by histograms. Speci\ufb01cally, by using the structural\napproximation results of [6, 7, 20], we obtain sample-optimal distributed estimators for various\nwell-studied classes of structured densities including monotone, unimodal, log-concave, monotone\nhazard rate (MHR) distributions, and others. The interested reader is referred to the aforementioned\nworks.\nFor speci\ufb01c families of structured distributions, we may be able to do better by exploiting additional\nstructure. An example of interest is the family of monotone distributions. By a result of Birge [4]\n(see also [14] for an adaptation to the discrete case), every monotone distribution over [n] is \u03b5-close\nin (cid:96)1-distance to a k-histogram distribution, for k = O(\u03b5\u22121 log n). Hence, an application of the\nabove theorem yields a distributed estimation algorithm for monotone distributions. The main insight\nhere is that each monotone distribution is well-approximated by an oblivious histogram, i.e., one\nwhose intervals are the same for each monotone distribution. This allows us to essentially reduce the\nlearning problem to that of learning a discrete distribution over the corresponding domain size. A\nreduction in the opposite direction yields the matching lower bound. Please refer to the full version\nfor more details.\n\n1.3 Comparison to Related Work\n\nRecent works [40, 21, 24, 5] study the communication cost of mean estimation problems of structured,\nparametrized distributions. These works develop powerful information theoretic tools to obtain lower\n\n4\n\n\fbounds for parameter estimation problems. In what follows, we brie\ufb02y comment why we need to\ndevelop new techniques by pointing out fundamental differences between the two problems.\nFirst, our most general results on distributed density estimation do not assume any structure on the\ndistribution (and thus, our learning algorithms are agnostic). This is in contrast to the problems\nconsidered before, where the concept classes are restricted (Gaussians, linear separators) and enjoy a\nlot of structure, which is often leveraged during the design of estimators.\nSecondly, while we also consider more structured distributions (monotone, k-histograms), the\ntechniques developed in the study of distributed parameter estimation do not apply to our problems.\nSpeci\ufb01cally, those results reduce to the problem of learning a high-dimensional vector (say, where\neach coordinate parametrizes a spherical Gaussian distribution), where the value at each coordinate\nis independent of the others. The results in distributed parameter estimation crucially use the\ncoordinate independence feature. The lower bounds essentially state that the communication cost of a\nd-dimensional parameter vector with independent components grows proportionally to the dimension\nd, and hence one needs to estimate each coordinate separately.\n\n2 Preliminaries\nNotation. For any positive integer n, we write [n] to denote {1, . . . , n}, the set of integers between\n1 and n. We think of a probability distribution P on [n] as a vector of probabilities (p1, . . . , pn) that\nsum up to 1. We write X \u223c P to denote that a random variable X is drawn from P . Sometimes\nwe use the notation P (i) to denote P[X = i], where X \u223c P . We consider three families of discrete\ndistributions:\n\n\u2022 Dn: the family of unstructured discrete distributions on [n],\n\u2022 Hn,k: the family of k-histogram distributions on [n],\n\u2022 Mn: the family of monotone distributions on [n].\n\nWe use (cid:96)p metrics on spaces of probability distributions. For two distributions P and P (cid:48) on [n], their\n(cid:96)p-distance, where p \u2208 [1,\u221e), is de\ufb01ned as\n\n(cid:32) n(cid:88)\n\ni=1\n\n(cid:33)1/p\n\n(cid:107)P \u2212 P (cid:48)(cid:107)p :=\n\n|P (i) \u2212 P (cid:48)(i)|p\n\n.\n\nIn this work we focus on the cases of p = 1 and p = 2, in which (cid:107)P \u2212 P (cid:48)(cid:107)1 =(cid:80)n\nand (cid:107)P \u2212 P (cid:48)(cid:107)2 =(cid:112)(cid:80)n\n\ni=1(P (i) \u2212 P (cid:48)(i))2.\nFor a given distribution Q \u2208 Dn and family P \u2286 Dn of distributions, we denote the (cid:96)p-distance of Q\nto P as distp(Q,P) := inf P\u2208P (cid:107)Q \u2212 P(cid:107)p.\nPackings and the Packing Number. Let (X,(cid:107) \u00b7 (cid:107)p) be a normed space, E \u2282 X, and r > 0 be\na radius. E(cid:48) = {e1, . . . , en} \u2282 E is an (r, p)-packing of E if mini(cid:54)=j (cid:107)ei \u2212 ej(cid:107)p > r. The (r, p)-\npacking number N pack\n(E, p) :=\nsup{|E(cid:48)| | E(cid:48) \u2282 E is an (r, p)-packing of E}.\n\n(E, p) is the cardinality of the largest (r, p)-packing of E, i.e., N pack\n\ni=1 |P (i) \u2212 P (cid:48)(i)|\n\nr\n\nr\n\nDensity Estimation. We now formally introduce density estimation problems considered in this\npaper. First, for a given n \u2208 Z+, let P \u2286 Dn be a family of distributions on [n], \u03b5 \u2208 [0,\u221e), and\np \u2208 [1,\u221e). The goal of the density estimation problem DE(P, p, \u03b5) is to output, for any unknown\ndistribution P \u2208 P, a distribution Q \u2208 Dn such that (cid:107)P \u2212 Q(cid:107)p \u2264 \u03b5. Note that in this problem, we\nare guaranteed that the unknown distribution belongs to P.\nNow we de\ufb01ne a version of the problem that allows inputs from outside of the class of interest. For a\ngiven n \u2208 Z+, let P \u2286 P be a family of distributions on [n]. Also let \u03b5 \u2208 [0,\u221e), p \u2208 [1,\u221e), and\n\u03b1 \u2208 [1,\u221e). The goal of the agnostic density estimation problem ADE(P, p, \u03b5, \u03b1) is to output, for\nany unknown distribution P \u2208 Dn, a distribution Q \u2208 Dn such that (cid:107)P \u2212 Q(cid:107)p \u2264 \u03b1\u00b7 distp(P,P) + \u03b5,\nwith high probability. The reason for this version of the problem is that in practice one often has to\ndeal with noisy or non-ideal data. Hence if the unknown distribution is close to belonging to a class\nP, we wish to output a near distribution as well.\n\n5\n\n\fEstimators and Sample Complexity. For any distribution estimation problem A involving an un-\nknown distribution P \u2014such as DE(P, p, \u03b5) and ADE(P, p, \u03b5, \u03b1) de\ufb01ned above\u2014we now introduce\nthe notion of an estimator. For any m \u2208 N, an estimator \u03b8 : [n]m \u00d7 {0, 1}\u221e \u2192 Dn is a function that\ntakes a sequence (cid:126)X = (X1, . . . Xm) of m independent samples from P and sequence R of uniformly\nsay that the estimator solves A with probability 1 \u2212 \u03b4 if for any unknown distribution P allowed\n\nand independently distributed random bits, and outputs a hypothesis distribution (cid:98)P := \u03b8((cid:126)X, R). We\nby the formulation of problem A, the probability that (cid:98)P is a correct solution to A is at least 1 \u2212 \u03b4.\nFor instance, if A is the ADE(P, p, \u03b5, \u03b1) problem, the hypothesis distribution (cid:98)P produced by the\n\nestimator should satisfy the following inequality for any distribution P \u2208 Dn:\n\nP(cid:104)(cid:107)(cid:98)P \u2212 P(cid:107)p \u2264 \u03b1 \u00b7 distp(P,P) + \u03b5\n\n(cid:105) \u2265 1 \u2212 \u03b4.\n\nThe sample complexity of A with error \u03b4, which we denote SC\u03b4(A), is the minimum number of\nsamples m, for which there exists an estimator \u03b8 : [n]m \u00d7 {0, 1}\u221e \u2192 Dn that solves A with\nprobability 1 \u2212 \u03b4.\nAs a simple application of this notation, note that SC\u03b4(DE(P, p, \u03b5)) \u2264 SC\u03b4(ADE(P, p, \u03b5, \u03b1)) for\nany \u03b1 \u2208 [1,\u221e). This follows from the fact that in DE(P, p, \u03b5), one has to solve exactly the same\nproblem but only for a subset of input distributions in ADE(P, p, \u03b5, \u03b1). Since the input P for\nDE(P, p, \u03b5)) comes from P, we have distp(P,P) = 0.\n\nCommunication Complexity of Density Estimation.\nIn all of our communication models, when\na player wants to send a message, the set of possible messages is pre\ufb01x-free, i.e., after \ufb01xing both the\nrandomness and the set of previous messages known to the player, there are no two possible messages\nsuch that one is a proper pre\ufb01x of the other. Furthermore, for a protocol \u03a0 in any of them, we\nwrite CostP (\u03a0) to denote the (worst-case) communication cost of \u03a0 on P de\ufb01ned as the maximum\nlength of messages that can be generated in the protocol if the unknown distribution belongs to P.\nSimilarly, we write CostP (\u03a0) to denote the expected communication cost of \u03a0 on P de\ufb01ned as\nthe maximum expected total length of messages exchanged, where the maximum is taken over all\nunknown distributions in P and the expectation is taken over all assignments of samples to machines\nand settings of public randomness. The following inequality always holds: CostP (\u03a0) \u2264 CostP (\u03a0).\nSimultaneous communication. In the simultaneous communication model, each sample-holding\nplayer sends a message to the referee once, based only on the samples she holds and public\nrandomness.\nFor a density estimation problem A, let P be the family of possible unknown distributions P .\nWe write CC\u2192\ns,\u03b4(A) to denote (s, \u03b4)-simultaneous communication complexity of A de\ufb01ned\nas the minimum CostP (\u03a0) over all simultaneous communication protocols \u03a0 that solve A\nwith probability at least 1 \u2212 \u03b4 for any P \u2208 P with s samples per sample-holding player and\nan arbitrary number of sample-holding players.\n\nBlackboard communication. In this model, each message sent by each player is visible to all\nplayers. The next player speaking is uniquely determined by the previously exchanged\nmessages and public randomness. We use this model to prove lower bounds. Any lower\nbound in this model applies to the previous communication models. More speci\ufb01cally, we\nshow lower bounds for the average communication complexity, which we de\ufb01ne next.\nFor a density estimation problem A, let P be the family of possible unknown distributions\nP . We write CCs,\u03b4(A) to denote (s, \u03b4)-average communication complexity of A de\ufb01ned\nas the in\ufb01mum CostP (\u03a0) over all blackboard protocols \u03a0 that solve A with probability\nat least 1 \u2212 \u03b4 for any P \u2208 P with s samples per sample-holding player and an arbitrary\nnumber of sample-holding players.\n\nThe communication complexity notions that we just introduced remain in the following relationship.\nClaim 1. For any density estimation problem A,\n\nCCs,\u03b4(A) \u2264 CC\u2192\n\ns,\u03b4(A).\n\nThe claim follows from the fact that simultaneous communication is a speci\ufb01c case of blackboard\ncommunication. Additionally, expected communication cost lower bounds worst-case communication\n\n6\n\n\fcost. All lower bounds that we prove are on the average communication complexity in blackboard\ncommunication.\n\nA Trivial Upper Bound. There is always a trivial protocol that leverages the sample complexity\nof the density estimation. Since SC\u03b4(A) samples are enough to solve the problem, it suf\ufb01ces that\nsample-holding players communicate this number of samples to the referee. Since each sample can\nbe communicated with at most (cid:100)log n(cid:101) bits, we obtain the following upper bound on the simultaneous\ncommunication complexity.\nClaim 2. For any density estimation problem A and any s \u2265 1,\ns,\u03b4(A) \u2264 SC\u03b4(A) \u00b7 (cid:100)log n(cid:101).\n\nCC\u2192\n\nIn this paper, we investigate whether there exist protocols that signi\ufb01cantly improve on this direct\nupper bound.\n\nRandomness. All our protocols are deterministic (more precisely, depend only the randomness\ncoming from samples provided by the samples from the hidden distribution). On the other hand our\nlower bounds apply to all protocols, also those using an arbitrary amount of public randomness (i.e.,\npre-shared randomness).\n\n3 Our Techniques\n\nIn this section, we provide a high-level description of the main ideas in our upper and lower bounds.\nWe defer the details of upper and lower bounds for monotone distributions to the full version of the\npaper.\n\n3.1 Overview of Algorithmic Ideas\n\nWe start by describing the main ideas in our distributed learning algorithms.\n\nRobustly Learning Histograms in (cid:96)1-Distance. We will require the following de\ufb01nition:\nDe\ufb01nition 1. (Distribution \ufb02attening) Let P be a distribution over [n] and let I = {Ii}(cid:96)\npartition of [n] into disjoint intervals. We denote by \u00afPI the distribution over [n], where\n\ni=1 be a\n\n(cid:80)\n\n\u00afPI(i) =\n\nP (k)\n\nk\u2208Ij\n|Ij|\n\n,\n\n\u2200j \u2208 [(cid:96)], i \u2208 Ij .\n\nThis means that \u00afPI is obtained by spreading the total mass of an interval uniformly in the interval.\nOur upper bounds in this setting crucially depend on the following norm from Vapnik-Chervonenkis\n(VC) theory [39], known as the Ak norm (see, e.g., [18]).\nDe\ufb01nition 2 (Ak norm). For any function f : [n] \u2192 R, we de\ufb01ne the Ak norm of f as\n\nwhere for any set S \u2286 [n], we let f (S) = (cid:80)\n\n(cid:107)f(cid:107)Ak = sup\n\nI1,...,Ik\n\ni=1\n\n|f (Ii)| ,\n\ni\u2208S f (i) and the supremum is taken over disjoint\n\nintervals.\nIn other words, the Ak norm of f is the maximum norm of any \ufb02attening of f into k interval pieces.\nOur distributed algorithms crucially rely on the following building blocks:\n\nTheorem 4 ([2]). Let P : [n] \u2192 R be a distribution, and let (cid:98)P : [n] \u2192 R be a distribution such\nthat (cid:107)P \u2212 (cid:98)P(cid:107)Ak \u2264 \u03b5. There is an ef\ufb01cient algorithm LEARNHIST((cid:98)P , k, \u03b5) that given (cid:98)P , outputs a\nk-histogram h such that (cid:107)P \u2212 h(cid:107)1 \u2264 3OPTk + O(\u03b5), where OPTk = minh\u2208Hn,k (cid:107)P \u2212 h(cid:107)1.\nThis theorem says that if we know a proxy to P that is close in Ak-norm to P , then this gives us\nenough information to construct the best k-histogram \ufb01t to P . Moreover, this is the only information\nwe need to reconstruct a good k-histogram \ufb01t to P . The following well-known version of the VC-\ninequality states that the empirical distribution after O(k/\u03b52) samples is close to the true distribution\nin Ak-norm:\n\nk(cid:88)\n\n7\n\n\f\u03b52\n\nTheorem 5 (VC inequality, e.g., [18]). Fix \u03b5, \u03b4 > 0. Let P : [n] \u2192 R be a distribution, and let Q be\n) samples from P . Then with probability at least 1 \u2212 \u03b4,\nthe empirical distribution after O( k+log 1/\u03b4\nwe have that (cid:107)P \u2212 Q(cid:107)Ak \u2264 \u03b5.\nconstruct some distribution (cid:98)P such that the empirical distribution Q is close to (cid:98)P in Ak-norm. After\nThese two theorems together imply (via the triangle inequality) that in order to learn P , it suf\ufb01ces to\nwe construct this (cid:98)P , we can run LEARNHIST at a centralized server, and simply output the resulting\nof constructing such a (cid:98)P .\neither |I| = 1 and Q(I) \u2265 \u2126(\u03b5/k), or we have Q(I) \u2264 O(\u03b5/k). We then show that if we let (cid:98)P be\nthe \ufb02attening of Q over this partition, then (cid:98)P is \u03b5-close to P in Ak-norm. To \ufb01nd this partition, we\n\nWe achieve this as follows. First, we learn a partition I of [n] such that on each interval I \u2208 I,\n\nhypothesis distribution. Thus, the crux of our distributed algorithm is a communication-ef\ufb01cient way\n\nrepeatedly perform binary search over the the domain to \ufb01nd intervals of maximal length, starting\nat some \ufb01xed left endpoint (cid:96), such that the mass of Q over that interval is at most O(\u03b5/k). We\nshow that the intervals in I can be found iteratively, using O(m log ms log n) bits of communication\neach, and that there are at most O(k/\u03b5) intervals in I. This in turn implies a total upper bound of\n\u02dcO(mk log n/\u03b5) bits of communication, as claimed.\nWe also show a black-box reduction for robustly learning k-histograms. It improves on the communi-\ncation cost when the domain size is very large. Speci\ufb01cally, we show:\nLemma 1. Fix n \u2208 N, and \u03b5, \u03b4 > 0. Suppose for all 1 \u2264 n(cid:48) \u2264 n, there is a robust learning\nalgorithm for Hn(cid:48),k with s samples per server and m servers, using B(k, n,(cid:48) m, s, \u03b5) bits of commu-\nnication, where ms \u2265 \u2126((k + log 1/\u03b4)/\u03b52). Then there is an algorithm which solves Hn,k using\nO(B(k, O(k/\u03b5), s, \u03b5) + k\n\n\u03b5 log n) bits of communication.\n\nIn other words, by increasing the communication by an additive factor of k\ndomain size n with O(k/\u03b5). This is crucial for getting tighter bounds in certain regimes.\n\n\u03b5 log n, we can replace the\n\n2 \u2264 \u03b5.\n\nLearning Histograms in (cid:96)2-Distance. We now describe our algorithm for learning k-histograms\nin (cid:96)2. We \ufb01rst require the following folklore statistical bound:\nLemma 2 (see e.g. [1]). Fix \u03b5, \u03b4 > 0 and a distribution P : [n] \u2192 R. Let Q be the empirical\ndistribution with O(log(1/\u03b4)/\u03b5) i.i.d. samples from P . Then with probability 1 \u2212 \u03b4, we have\n(cid:107)P \u2212 Q(cid:107)2\nThis lemma states that it suf\ufb01ces to approximate the empirical distribution Q in (cid:96)2 norm. We now\ndescribe how to do so.\nOur \ufb01rst key primitive is that using the celebrated Johnson-Lindenstrauss lemma [28], it is possible\nto get an accurate estimate of (cid:107)x(cid:107)2\ncommunicates at most logarithmically many bits, regardless of the dimension of x. Moreover, we\ncan do this for poly(n) many different x\u2019s, even without shared randomness, by communicating\nonly O(log n log log n) bits once at the beginning of the algorithm and constantly many bits per call\nafterwards. In particular, we use this to approximate\n\n2 when server i has access to xi and x =(cid:80) xi, where each server\n\nfor all intervals I \u2286 [n].\nPerhaps surprisingly, we are now able to give an algorithm that outputs the best O(k log n)-histogram\napproximation to Q in (cid:96)2, which only accesses the distribution via the eI. Moreover, we show that\nthis algorithm needs to query only O(k log n) such eI. Since each query to eI can be done with\nlogarithmically many bits per server, this yields the claimed communication bound of \u02dcO(mk log n).\nRoughly speaking, our algorithm proceeds as follows. At each step, it maintains a partition of [n].\nInitially, this is the trivial partition containing just one element: [n]. Then in every iteration it \ufb01nds\nthe 2k intervals in its current partition with largest eI, and splits them in half (or splits them all in\nhalf if there are less than 2k intervals). It then repeats this process for log n iterations, and returns the\n\ufb02attening over the set of intervals returned. By being somewhat careful with how we track error, we\nare able to show that this in fact only ever requires O(k log n) queries to eI. While this algorithm is\nquite simple, proving correctness requires some work and we defer it to the full version.\n\n(cid:88)\n\ni\u2208I\n\neI =\n\n(Q(i) \u2212 Q(I))2 ,\n\n8\n\n\f3.2 Proof Ideas for the Lower Bounds\n\nWe now give an overview of proofs of our lower bounds.\n\nInteractive Learning of Unstructured Distributions. We start with the most sophisticated of our\nlower bounds: a lower bound for unstructured distributions with one sample per player and arbitrary\ncommunication in the blackboard model. We show that \u2126((n/\u03b52) log n) bits of communication are\nneeded. Thid is optimal and implies that in this case, there is no non-trivial protocol that saves more\nthan a constant factor over the trivial one (in which O(n/\u03b52) samples are fully transmitted). In order\nto prove the lower bound, we apply the information complexity toolkit. Our lower bound holds for a\nfamily of nearly uniform distributions on [n], in which each pair of consecutive elements, (2i\u2212 1, 2i),\nhave slightly perturbed probabilities. In the uniform distribution each element has probability 1/n.\nHere for each pair of elements 2i \u2212 1 and 2i, we set the probabilities to be 1\nn (1 + 100\u03b4i\u03b5) and\nn (1 \u2212 100\u03b4i\u03b5), where each \u03b4i is independently selected from the uniform distribution on {\u22121, 1}.\n1\nEach such pair can be interpreted as a single slightly biased coin. We show that the output of any good\nlearning protocol can be used to learn the bias \u03b4i of most of the pairs. This implies that messages\nexchanged in any protocol that is likely to learn the distribution have to reveal most of the biases with\nhigh constant probability.\nIntuitively, the goal in our analysis is to show that if a player sends much fewer than log n bits overall,\nthis is unlikely to provide much information about that player\u2019s sample and help much with predicting\n\u03b4i\u2019s. This is done by bounding the mutual information between the transcript and the \u03b4i\u2019s. It should\nbe noted that our lower bound holds in the interactive setting. That is, players are unlikely to gain\nmuch by adaptively selecting when to continue providing more information about their samples. The\ndetails of the proof are deferred to the full version.\n\nPacking Lower Bounds. Some of our lower bounds are obtained via the construction of a suitable\npacking set. We use the well-known result that the logarithm of the size of the packing set is a lower\nbound on the communication complexity. This follows from using the well-known reduction from\nestimation to testing, in conjunction with Fano\u2019s inequality.\n\n4 Conclusion and Open Problems\n\nThis work provides the \ufb01rst rigorous study of the communication complexity of nonparametric\ndistribution estimation. We have obtained both negative results (tight lower bounds in certain regimes)\nand the \ufb01rst non-trivial upper bounds for a range of structured distributions.\nA number of interesting directions remain. We outline a few of them here:\n\n1. The positive results of this work focused on discrete univariate structured distributions (e.g.,\nhistograms and monotone distributions). For what other families of structured distributions\ncan one obtain communication-ef\ufb01cient algorithms? Studying multivariate structured\ndistributions in this setting is an interesting direction for future work.\n\n2. The results of this paper do not immediately extend to the continuous setting. Can we obtain\n\npositive results for structured continuous distributions?\n\n3. It would be interesting to study related inference tasks in the distributed setting, including\n\nhypothesis testing and distribution property estimation.\n\nAcknowledgments\n\nThe authors would like to thank the reviewers for their insightful and constructive comments. ID\nwas supported by NSF Award CCF-1652862 (CAREER) and a Sloan Research Fellowship. EG was\nsupported by NSF Award CCF-1649515. JL was supported by NSF CAREER Award CCF-1453261,\nCCF-1565235, a Google Faculty Research Award, and an NSF Graduate Research Fellowship. AN\nwas supported in part by a grant from the Purdue Research Foundation and NSF Awards CCF-1618981\nand CCF-1649515. LS was funded by a Google PhD Fellowship.\n\n9\n\n\fReferences\n[1] J. Acharya, I. Diakonikolas, C. Hegde, J. Li, and L. Schmidt. Fast and near-optimal algorithms for\napproximating distributions by histograms. In Proceedings of the 34th ACM Symposium on Principles of\nDatabase Systems (PODS), pages 249\u2013263. ACM, 2015.\n\n[2] J. Acharya, I. Diakonikolas, J. Li, and L. Schmidt. Sample-optimal density estimation in nearly-linear\ntime. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages\n1278\u20131289. SIAM, 2017.\n\n[3] M. F. Balcan, A. Blum, S. Fine, and Y. Mansour. Distributed learning, communication complexity and\n\nprivacy. In Conference on Learning Theory, pages 26\u20131, 2012.\n\n[4] L. Birg\u00e9. Estimating a density under order restrictions: Nonasymptotic minimax risk. The Annals of\n\nStatistics, pages 995\u20131012, 1987.\n\n[5] M. Braverman, A. Garg, T. Ma, H. L. Nguyen, and D. P. Woodruff. Communication lower bounds for\nstatistical estimation problems via a distributed data processing inequality. In Proceedings of the 48th\nAnnual ACM Symposium on Theory of Computing, STOC 2016, pages 1011\u20131020, 2016.\n\n[6] S. Chan, I. Diakonikolas, R. Servedio, and X. Sun. Learning mixtures of structured distributions over\n\ndiscrete domains. In SODA, pages 1380\u20131394, 2013.\n\n[7] S. Chan, I. Diakonikolas, R. Servedio, and X. Sun. Ef\ufb01cient density estimation via piecewise polynomial\n\napproximation. In STOC, pages 604\u2013613, 2014.\n\n[8] S. Chan, I. Diakonikolas, R. Servedio, and X. Sun. Near-optimal density estimation in near-linear time\n\nusing variable-width histograms. In NIPS, pages 1844\u20131852, 2014.\n\n[9] S. Chaudhuri, R. Motwani, and V. R. Narasayya. Random sampling for histogram construction: How much\n\nis enough? In SIGMOD Conference, pages 436\u2013447, 1998.\n\n[10] N. R. Council. Frontiers in Massive Data Analysis. The National Academies Press, Washington, DC, 2013.\n\n[11] C. Daskalakis, I. Diakonikolas, R. ODonnell, R. Servedio, and L. Y. Tan. Learning sums of independent\ninteger random variables. In Foundations of Computer Science (FOCS), 2013 IEEE 54th Annual Symposium\non, pages 217\u2013226. IEEE, 2013.\n\n[12] C. Daskalakis, I. Diakonikolas, and R. Servedio. Learning k-modal distributions via testing. In SODA,\n\npages 1371\u20131385, 2012.\n\n[13] C. Daskalakis, I. Diakonikolas, and R. Servedio. Learning poisson binomial distributions. Algorithmica,\n\n72(1):316\u2013357, 2015.\n\n[14] C. Daskalakis, I. Diakonikolas, R. A. Servedio, G. Valiant, and P. Valiant. Testing k-modal distributions:\nOptimal algorithms via reductions. In Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium\non Discrete Algorithms, SODA 2013, pages 1833\u20131852, 2013.\n\n[15] H. Daum\u00e9 III, J. Phillips, A. Saha, and S. Venkatasubramanian. Ef\ufb01cient protocols for distributed\n\nclassi\ufb01cation and optimization. In Algorithmic Learning Theory, pages 154\u2013168. Springer, 2012.\n\n[16] H. Daum\u00e9 III, J. Phillips, A. Saha, and S. Venkatasubramanian. Protocols for learning classi\ufb01ers on\n\ndistributed data. In Arti\ufb01cial Intelligence and Statistics, pages 282\u2013290, 2012.\n\n[17] L. Devroye and G. Lugosi. Bin width selection in multivariate histograms by the combinatorial method.\n\nTest, 13(1):129\u2013145, 2004.\n\n[18] L. Devroye and G. Lugosi. Combinatorial methods in density estimation. Springer Science & Business\n\nMedia, 2012.\n\n[19] I. Diakonikolas. Learning structured distributions. In P. B\u00fchlmann, P. Drineas, M. Kane, and M. van\nDer Laan, editors, Handbook of Big Data, Chapman & Hall/CRC Handbooks of Modern Statistical\nMethods, chapter 15, pages 267\u2013284. Taylor & Francis, 2016.\n\n[20] I. Diakonikolas, D. M. Kane, and A. Stewart. Ef\ufb01cient robust proper learning of log-concave distributions.\n\nCoRR, abs/1606.03077, 2016.\n\n[21] J. C. Duchi, M. I. Jordan, M. J. Wainwright, and Y. Zhang. Optimality guarantees for distributed statistical\n\nestimation. ArXiv e-prints, 2014.\n\n10\n\n\f[22] D. Freedman and P. Diaconis. On the histogram as a density estimator:l2 theory. Zeitschrift f\u00fcr Wahrschein-\n\nlichkeitstheorie und Verwandte Gebiete, 57(4):453\u2013476, 1981.\n\n[23] S. H. Fuller and L. I. Millett. The Future of Computing Performance: Game Over Or Next Level? National\n\nAcademy Press, Washington, DC, 2011.\n\n[24] A. Garg, T. Ma, and H. Nguyen. On communication cost of distributed statistical estimation and dimen-\n\nsionality. In Advances in Neural Information Processing Systems (NIPS), pages 2726\u20132734, 2014.\n\n[25] A. C. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan, and M. Strauss. Fast, small-space algorithms\n\nfor approximate histogram maintenance. In STOC, pages 389\u2013398, 2002.\n\n[26] S. Guha, N. Koudas, and K. Shim. Approximation and streaming algorithms for histogram construction\n\nproblems. ACM Trans. Database Syst., 31(1):396\u2013438, 2006.\n\n[27] Y. Hu, H. Chen, J. g. Lou, and J. Li. Distributed density estimation using non-parametric statistics. In 27th\n\nInternational Conference on Distributed Computing Systems (ICDCS \u201907), pages 28\u201328, 2007.\n\n[28] W. B. Johnson and J. Lindenstrauss. Extensions of lipschitz mappings into a hilbert space. Contemporary\n\nmathematics, 26(189-206):1\u20131, 1984.\n\n[29] M. I. Jordan, J. D. Lee, and Y. Yang. Communication-ef\ufb01cient distributed statistical learning. CoRR,\n\nabs/1605.07689, 2016.\n\n[30] R. Kannan, S. Vempala, and D. Woodruff. Principal component analysis and higher correlations for\n\ndistributed data. In Conference on Learning Theory, pages 1040\u20131057, 2014.\n\n[31] J. Klemela. Multivariate histograms with data-dependent partitions. Statistica Sinica, 19(1):159\u2013176,\n\n2009.\n\n[32] W. Kowalczyk and N. A. Vlassis. Newscast EM. In Advances in Neural Information Processing Systems\n\n17 (NIPS 2004), pages 713\u2013720, 2004.\n\n[33] Y. Liang, M. F. Balcan, V. Kanchanapally, and D. Woodruff. Improved distributed principal component\n\nanalysis. In Advances in Neural Information Processing Systems (NIPS), pages 3113\u20133121, 2014.\n\n[34] G. Lugosi and A. Nobel. Consistency of data-driven histogram methods for density estimation and\n\nclassi\ufb01cation. Ann. Statist., 24(2):687\u2013706, 04 1996.\n\n[35] R. D. Nowak. Distributed EM algorithms for density estimation in sensor networks. In 2003 IEEE\nInternational Conference on Acoustics, Speech, and Signal Processing, ICASSP \u201903, Hong Kong, April\n6-10, 2003, pages 836\u2013839, 2003.\n\n[36] K. Pearson. Contributions to the mathematical theory of evolution. ii. skew variation in homogeneous\n\nmaterial. Philosophical Trans. of the Royal Society of London, 186:343\u2013414, 1895.\n\n[37] V. Slavov and P. R. Rao. A gossip-based approach for internet-scale cardinality estimation of xpath queries\n\nover distributed semistructured data. VLDB J., 23(1):51\u201376, 2014.\n\n[38] N. Thaper, S. Guha, P. Indyk, and N. Koudas. Dynamic multidimensional histograms. In SIGMOD\n\nConference, pages 428\u2013439, 2002.\n\n[39] V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events to their\n\nprobabilities. Theory Probab. Appl., 16:264\u2013280, 1971.\n\n[40] Y. Zhang, J. Duchi, M. Jordan, and M. J. Wainwright. Information-theoretic lower bounds for distributed\nstatistical estimation with communication constraints. In Advances in Neural Information Processing\nSystems (NIPS), pages 2328\u20132336, 2013.\n\n[41] M. Zhou, H. T. Shen, X. Zhou, W. Qian, and A. Zhou. Effective data density estimation in ring-based P2P\nnetworks. In IEEE 28th International Conference on Data Engineering (ICDE 2012), pages 594\u2013605,\n2012.\n\n11\n\n\f", "award": [], "sourceid": 3202, "authors": [{"given_name": "Ilias", "family_name": "Diakonikolas", "institution": "USC"}, {"given_name": "Elena", "family_name": "Grigorescu", "institution": "Purdue University"}, {"given_name": "Jerry", "family_name": "Li", "institution": "MIT"}, {"given_name": "Abhiram", "family_name": "Natarajan", "institution": "Purdue University"}, {"given_name": "Krzysztof", "family_name": "Onak", "institution": "IBM T.J. Watson Research Center"}, {"given_name": "Ludwig", "family_name": "Schmidt", "institution": "MIT"}]}