{"title": "Phase Transitions in the Pooled Data Problem", "book": "Advances in Neural Information Processing Systems", "page_first": 377, "page_last": 385, "abstract": "In this paper, we study the {\\em pooled data} problem of identifying the labels associated with a large collection of items, based on a sequence of pooled tests revealing the counts of each label within the pool. In the noiseless setting, we identify an exact asymptotic threshold on the required number of tests with optimal decoding, and prove a {\\em phase transition} between complete success and complete failure. In addition, we present a novel {\\em noisy} variation of the problem, and provide an information-theoretic framework for characterizing the required number of tests for general random noise models. Our results reveal that noise can make the problem considerably more difficult, with strict increases in the scaling laws even at low noise levels. Finally, we demonstrate similar behavior in an {\\em approximate recovery} setting, where a given number of errors is allowed in the decoded labels.", "full_text": "Phase Transitions in the Pooled Data Problem\n\nJonathan Scarlett and Volkan Cevher\n\nLaboratory for Information and Inference Systems (LIONS)\n\n\u00c9cole Polytechnique F\u00e9d\u00e9rale de Lausanne (EPFL)\n\n{jonathan.scarlett,volkan.cevher}@ep\ufb02.ch\n\nAbstract\n\nIn this paper, we study the pooled data problem of identifying the labels associ-\nated with a large collection of items, based on a sequence of pooled tests revealing\nthe counts of each label within the pool. In the noiseless setting, we identify an\nexact asymptotic threshold on the required number of tests with optimal decod-\ning, and prove a phase transition between complete success and complete failure.\nIn addition, we present a novel noisy variation of the problem, and provide an\ninformation-theoretic framework for characterizing the required number of tests\nfor general random noise models. Our results reveal that noise can make the prob-\nlem considerably more dif\ufb01cult, with strict increases in the scaling laws even at\nlow noise levels. Finally, we demonstrate similar behavior in an approximate re-\ncovery setting, where a given number of errors is allowed in the decoded labels.\n\nIntroduction\n\n1\nConsider the following setting: There exists a large population of items, each of which has an\nassociated label. The labels are initially unknown, and are to be estimated based on pooled tests.\nEach pool consists of some subset of the population, and the test outcome reveals the total number\nof items corresponding to each label that are present in the pool (but not the individual labels). This\nproblem, which we refer to as the pooled data problem, was recently introduced in [1,2], and further\nstudied in [3, 4]. It is of interest in applications such as medical testing, genetics, and learning with\nprivacy constraints, and has connections to the group testing problem [5] and its linear variants [6,7].\nThe best known bounds on the required number of tests under optimal decoding were given in [3];\nhowever, the upper and lower bounds therein do not match, and can exhibit a large gap. In this\npaper, we completely close these gaps by providing a new lower bound that exactly matches the\nupper bound of [3]. These results collectively reveal a phase transition between success and failure,\nwith the probability of error vanishing when the number of tests exceeds a given threshold, but\ntending to one below that threshold. In addition, we explore the novel aspect of random noise in the\nmeasurements, and show that this can signi\ufb01cantly increase the required number of tests. Before\nsummarizing these contributions in more detail, we formally introduce the problem.\n1.1 Problem setup\nWe consider a large population of items [p] = {1, . . . , p}, each of which has an associated label\nin [d] = {1, . . . , d}. We let \u21e1 = (\u21e11, . . . ,\u21e1 d) denote a vector containing the proportions of items\nhaving each label, and we assume that the vector of labels itself, = (1, . . . , p), is uniformly\ndistributed over the sequences consistent with these proportions:\n\n \u21e0 Uniform(B(\u21e1)),\n\n(1)\n\nwhere B(\u21e1) is the set of length-p sequences whose empirical distribution is \u21e1.\nThe goal is to recover based on a sequence of pooled tests. The i-th test is represented by a\n(possibly random) vector X (i) 2{ 0, 1}p, whose j-th entry X (i)\nindicates whether the j-th item is\n\nj\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fSuf\ufb01cient for Pe ! 0 [3]\np\nlog p \u00b7\nf (r)\n\nr2{1,...,d1}\n\nmax\n\nNecessary for Pe 6! 1 [3]\n\np\nlog p \u00b7\n\n1\n2\n\nf (1)\n\n(this paper)\n\nNecessary for Pe 6! 1\np\nlog p \u00b7\n\nr2{1,...,d1}\n\nmax\n\nf (r)\n\nTable 1: Necessary and suf\ufb01cient conditions on the number of tests n in the noiseless setting. The\nfunction f (r) is de\ufb01ned in (5). Asymptotic multiplicative 1 + o(1) terms are omitted.\n\nNoiseless testing\n\n\u21e5\u21e3 p\nlog p\u2318\n\nNoisy testing\n(SNR = p\u21e5(1))\n\n\u2326\u21e3 p\nlog p\u2318\n\nNoisy testing\n\n(SNR = (log p)\u21e5(1))\n\n\u2326\u21e3\n\np\n\nlog log p\u2318\n\nNoisy testing\n(SNR = \u21e5(1))\n\n\u2326p log p\n\nTable 2: Necessary and suf\ufb01cient conditions on the number of tests n in the noisy setting. SNR\ndenotes the signal-to-noise ratio, and the noise model is given in Section 2.2.\n\nincluded in the i-th test. We de\ufb01ne a measurement matrix X 2{ 0, 1}n\u21e5p whose i-th row is given\nby X (i) for i = 1, . . . , n, where n denotes the total number of tests. We focus on the non-adaptive\ntesting scenario, where the entire matrix X must be speci\ufb01ed prior to performing any tests.\nIn the noiseless setting, the i-th test outcome is a vector Y (i) = (Y (i)\n\nd ), with t-th entry\n\n1 , . . . , Y (i)\n\nY (i)\nt = Nt(, X (i)),\n\n(2)\nwhere for t = 1, . . . , d, we let Nt(, X ) = Pj2[p]\n1{j = t \\ Xj = 1} denote the number of\nitems with label t that are included in the test described by X 2{ 0, 1}p. More generally, in the\npossible presence of noise, the i-th observation is randomly generated according to\n(3)\nfor some conditional probability mass function PY |N1,...,Nd (or density function in the case of con-\ntinuous observations). We assume that the observations Y (i) (i = 1, . . . , n) are conditionally inde-\npendent given X, but otherwise make no assumptions on PY |N1,...,Nd. Clearly, the noiseless model\n(2) falls under this more general setup.\nSimilarly to X, we let Y denote an n\u21e5d matrix of observations, with the i-th row being Y (i). Given\nX and Y, a decoder outputs an estimate \u02c6 of , and the error probability is given by\n\nY (i) | X (i), \u21e0 PY |N1(,X (i))...Nd(,X (i))\n\nPe = P[ \u02c6 6= ],\n\n(4)\nwhere the probability is with respect to , X, and Y. We seek to \ufb01nd conditions on the number of\ntests n under which Pe attains a certain target value in the limit as p ! 1, and our main results\nprovide necessary conditions (i.e., lower bounds on n) for this to occur. As in [3], we focus on the\ncase that d and \u21e1 are \ufb01xed and do not depend on p.1\n1.2 Contributions and comparisons to existing bounds\nOur focus in this paper is on information-theoretic bounds on the required number of tests that hold\nregardless of practical considerations such as computation and storage. Among the existing works in\nthe literature, the one most relevant to this paper is [3], whose bounds strictly improve on the initial\nbounds in [1]. The same authors also proved a phase transition for a practical algorithm based on\napproximate message passing [4], but the required number of tests is in fact signi\ufb01cantly larger than\nthe information-theoretic threshold (speci\ufb01cally, linear in p instead of sub-linear).\nTable 1 gives a summary of the bounds from [3] and our contributions in the noiseless setting. To\nde\ufb01ne the function f (r) therein, we introduce the additional notation that for r = {1, . . . , d 1},\n\u21e1(r) = (\u21e1(r)\nr ) is a vector whose \ufb01rst entry sums the largest d r + 1 entries of \u21e1, and\nwhose remaining entries coincide with the remaining r 1 entries of \u21e1. We have\n\n1 , . . . ,\u21e1 (r)\n\nmeaning that the entries in Table 1 corresponding to the results of [3] are given as follows:\n\nf (r) =\n\nmax\n\nr2{1,...,d1}\n\n2(H(\u21e1) H(\u21e1(r)))\n\nd r\n\n,\n\n(5)\n\n1More precisely, \u21e1 should be rounded to the nearest empirical distribution (e.g., in `1-norm) for sequences\n\n 2 [d]p of length p; we leave such rounding implicit throughout the paper.\n\n2\n\n\f1.4\n1.2\n1\n0.8\n0.6\n0.4\n0.2\n0\n\n)\nr\n(\nf\n\nRandom :\nUniform :\nHighly non-uniform :\n\n1\n\n2\n\n3\n\n4\n\nFigure 1: The function f (r) in (5), for several choices of \u21e1, with d = 10. The random \u21e1 are\ndrawn uniformly on the probability simplex, and the highly non-uniform choice of \u21e1 is given by \u21e1 =\n(0.49, 0.49, 0.0025, . . . , 0.0025). When the maximum is achieved at r = 1, the bounds of [3] coincide up\nto a factor of two, whereas if the maximum is achieved for r > 1 then the gap is larger.\n\n6\n\n7\n\n8\n\n9\n\n5\nr\n\n\u2022 (Achievability) When the entries of X are i.i.d. on Bernoulli(q) for some q 2 (0, 1) (not\n\ndepending on p), there exists a decoder such that Pe ! 0 as p ! 1 with\n\u25c6(1 + \u2318)\n\u2022 (Converse) In order to achieve Pe 6! 1 as p ! 1, it is necessary that\n\nlog p\u2713 max\n\n2(H(\u21e1) H(\u21e1(r)))\n\nfor arbitrarily small \u2318> 0.\n\nr2{1,...,d1}\n\nd r\n\nn \uf8ff\n\np\n\nfor arbitrarily small \u2318> 0.\n\nn \n\np\n\nlog p\u2713 H(\u21e1)\n\nd 1\u25c6(1 \u2318)\n\n(6)\n\n(7)\n\nUnfortunately, these bounds do not coincide. If the maximum in (6) is achieved by r = 1 (which\noccurs, for example, when \u21e1 is uniform [3]), then the gap only amounts to a factor of two. However,\nas we show in Figure 1, if we compute the bounds for some \u201crandom\u201d choices of \u21e1 then the gap is\ntypically larger (i.e., r = 1 does not achieve the maximum), and we can construct choices where the\ngap is signi\ufb01cantly larger. Closing these gaps was posed as a key open problem in [3].\nWe can now summarize our contributions as follows:\n\n1. We give a lower bound that exactly matches (6), thus completely closing the above-mentioned\ngaps in the existing bounds and solving the open problem raised in [3]. More speci\ufb01cally,\nwe show that Pe ! 1 whenever n \uf8ff p\nsome \u2318> 0, thus identifying an exact phase transition \u2013 a threshold above which the error\nprobability vanishes, but below which the error probability tends to one.\n\nlog p maxr2{1,...,d1}\n\n(1 \u2318) for\n\n2(H(\u21e1)H(\u21e1(r)))\n\ndr\n\n2. We develop a framework for understanding variations of the problem consisting of random\nnoise, and give an example of a noise model where the scaling laws are strictly higher com-\npared to the noiseless case. A summary is given in Table 2; the case SNR = (log p)\u21e5(1) reveals\na strict increase in the scaling laws even when the signal-to-noise ratio grows unbounded, and\nthe case SNR = \u21e5(1) reveals that the required number of tests increases from sub-linear to\nsuper-linear in the dimension when the signal-to-noise ratio is constant.\n\n3. In the supplementary material, we discuss how our lower bounds extend readily to the approx-\nimate recovery criterion, where we only require to be identi\ufb01ed up to a certain Hamming\ndistance. However, for clarity, we focus on exact recovery throughout the paper.\n\nIn a recent independent work [8], an adversarial noise setting was introduced. This turns out to\nbe fundamentally different to our noisy setting. In particular, the results of [8] state that exact re-\ncovery is impossible, and even with approximate recovery, a huge number of tests (i.e., higher than\n\npolynomial) is needed unless = Oq1/2+o(1)\n\ntion error measured by the Hamming distance, and is maximum adversarial noise amplitude. Of\ncourse, both random and adversarial noise are of signi\ufb01cant interest, depending on the application.\n\n, where qmax is the maximum allowed reconstruc-\n\nmax\n\n3\n\n\fNotation. For a positive integer d, we write [d] = {1, . . . , d}. We use standard information-theoretic\nnotations for the (conditional) entropy and mutual information, e.g., H(X), H(Y |X), I(X; Y |Z)\n[9]. All logarithms have base e, and accordingly, all of the preceding information measures are in\nunits of nats. The Gaussian distribution with mean \u00b5 and variance 2 is denoted by N(\u00b5, 2). We\nuse the standard asymptotic notations O(\u00b7), o(\u00b7), \u2326(\u00b7), !(\u00b7) and \u21e5(\u00b7).\n2 Main results\n\nIn this section, we present our main results for the noiseless and noisy settings. The proofs are given\nin Section 3, as well as the supplementary material.\n2.1 Phase transition in the noiseless setting\nThe following theorem proves that the upper bound given in (6) is tight. Recall that for r =\n{1, . . . , d 1}, \u21e1(r) = (\u21e1(r)\nr ) is a vector whose \ufb01rst entry sums the largest d r + 1\nentries of \u21e1, and whose remaining entries coincide with the remaining r 1 entries of \u21e1.\nTheorem 1. (Noiseless setting) Consider the pooled data problem described in Section 1.1 with a\ngiven number of labels d and label proportion vector \u21e1 (not depending on the dimension p). For any\ndecoder, in order to achieve Pe 6! 1 as p ! 1, it is necessary that\n2(H(\u21e1) H(\u21e1(r)))\n\n1 , . . . ,\u21e1 (r)\n\n(8)\n\np\n\nlog p\u2713 max\n\nr2{1,...,d1}\n\n\u25c6(1 \u2318)\n\nd r\n\nn \nfor arbitrarily small \u2318> 0.\n\nlog p maxr2{1,...,d1}\n\nCombined with (6), this result reveals an exact phase transition on the required number of measure-\nments: Denoting n\u21e4 = p\nn n\u21e4(1 + \u2318), tends to one for n \uf8ff n\u21e4(1 \u2318), regardless of how small \u2318 is chosen to be.\nRemark 1. Our model assumes that is uniformly distributed over the sequences with empirical\ndistribution \u21e1, whereas [3] assumes that is i.i.d. on \u21e1. However, Theorem 1 readily extends to\nthe latter setting: Under the i.i.d. model, once we condition on a given empirical distribution, the\nconditional distribution of is uniform. As a result, the converse bound for the i.i.d. model follows\ndirectly from Theorem 1 by basic concentration and the continuity of the entropy function.\n\n, the error probability vanishes for\n\n2(H(\u21e1)H(\u21e1r))\n\ndr\n\nInformation-theoretic framework for the noisy setting\n\n2.2\nWe now turn to general noise models of the form (3), and provide necessary conditions for the noisy\npooled data problem in terms of the mutual information. General characterizations of this form were\nprovided previously for group testing [10, 11] and other sparse recovery problems [12, 13].\nOur general result is stated in terms of a maximization over a vector parameter ` = (`1, . . . ,` d) with\n`t 2{ 0, . . . ,\u21e1 tp} for all t. We will see in the proof that `t represents the number of items of type t\nthat are unknown to the decoder after p\u21e1t `t are revealed by a genie. We de\ufb01ne the following:\n\n\u2022 Given ` and , we let S` be a random set of indices in [p] such that for each t 2 [d], the set\ncontains `t indices corresponding to entries where equals t. Speci\ufb01cally, we de\ufb01ne S` to be\nuniformly distributed over all such sets. Moreover, we de\ufb01ne Sc\n\n` = [p] \\ S`.\n\n\u2022 Given the above de\ufb01nitions, we de\ufb01ne\nSc\n\n`\n\n=\u21e2j\n\n?\n\nj 2 Sc\notherwise,\n\n`\n\n(9)\n\nwhere ? can be thought of as representing an unknown value. Hence, knowing Sc\nknowing the labels of all items in the set Sc\n`.\n\n` amounts to\n\n\u2022 We de\ufb01ne |B`(\u21e1)| to be the number of sequences 2 [d]p that coincide with a given Sc\n\n` on\nthe entries not equaling ?, while also having empirical distribution \u21e1 overall. This number\ndoes not depend on the speci\ufb01c choice of Sc\n`. As an example, when `t = p\u21e1t for all t, we have\nS` = [p], Sc\n\n= (?, . . . , ?), and |B`(\u21e1)| = |B(\u21e1)|, de\ufb01ned following (1)\n\u2022 We let k`k0 denote the number of values in (`1, . . . ,` d) that are positive.\n\n`\n\n4\n\n\fWith these de\ufb01nitions, we have the following result for general random noise models.\nTheorem 2. (Noisy setting) Consider the pooled data problem described in Section 1.1 under a\ngeneral observation model of the form (3), with a given number of labels d and label proportion\nvector \u21e1. For any decoder, in order to achieve Pe \uf8ff for a given 2 (0, 1), it is necessary that\n\n` : k`k02 log |B`(\u21e1)|(1 ) log 2\n\ni=1 I(; Y (i)|Sc\n\n, X (i))\n\nn max\n\n1\n\n`\n\nnPn\n\n.\n\n(10)\n\n,\n\n(11)\n\nI(X0,`; Y |X1,`)\n\nIn order to obtain more explicit bounds on n from (10), one needs to characterize the mutual in-\nformation terms, ideally forming an upper bound that does not depend on the distribution of the\nmeasurement matrix X. We do this for some speci\ufb01c models below; however, in general it can be a\ndif\ufb01cult task. The following corollary reveals that if the entries of X are i.i.d. on Bernoulli(q) for\nsome q 2 (0, 1) (as was assumed in [3]), then we can simplify the bound.\nCorollary 1. (Noisy setting with Bernoulli testing) Suppose that the entries of X are i.i.d. on\nBernoulli(q) for some q 2 (0, 1). Under the setup of Theorem 2, it is necessary that\n` : k`k02 log |B`(\u21e1)|(1 ) log 2\n\nn max\n\nwhere (X0,`, X1,`, Y ) are distributed as follows: (i) X0,` (respectively, X1,`) is a concatenation\nof the vectors X0,`(1), . . . , X0,`(d) (respectively, X1,`(1), . . . , X1,`(d)), the t-th of which contains\n`t (respectively, \u21e1tp `t) entries independently drawn from Bernoulli(q); (ii) Letting each Nt\n(t = 1, . . . , d) be the total number of ones in X0,`(t) and X1,`(t) combined, the random variable Y\nis drawn from PY |N1,...,Nd according to (3).\nAs well as being simpler to evaluate, this corollary may be of interest in scenarios where one does\nnot have complete freedom in designing X, and one instead insists on using Bernoulli testing. For\ninstance, one may not know how to optimize X, and accordingly resort to generating it at random.\nExample 1: Application to the noiseless setting. In the supplementary material, we show that in\nthe noiseless setting, Theorem 2 recovers a weakened version of Theorem 1 with 1 \u2318 replaced by\n1 o(1) in (8). Hence, while Theorem 2 does not establish a phase transition, it does recover\nthe exact threshold on the number of measurements required to obtain Pe ! 0.\nAn overview of the proof of this claim is as follows. We restrict the maximum in (10) to choices\nof ` where each `t equals either its minimum value 0 or its maximum value p\u21e1t. Since we\nare in the noiseless setting, each mutual information term reduces to the conditional entropy of\nY (i) = (Y (i)\nis\ndeterministic (i.e., it has zero entropy), whereas for the values of t such that `t = p\u21e1t, the value Y (i)\n\n` and X (i). For the values of t such that `t = 0, the value Y (i)\n\nfollows a hypergeometric distribution, whose entropy behaves as 1\nIn the case that X is i.i.d. on Bernoulli(q), we can use Corollary 1 to obtain the following necessary\ncondition for Pe \uf8ff as as p ! 1, proved in the supplementary material:\n\nd ) given Sc\n\n1 , . . . , Y (i)\n\nt\n\nt\n\np\n\nlog(pq(1 q))\u2713 max\n\nr2{1,...,d1}\n\nn \n\n2(H(\u21e1) H(\u21e1r))\n\nd r\n\np. Hence, while q = \u21e5(1) recovers the\n\nfor any q = q(p) such that both q and 1 q behave as ! 1\n\nthreshold in (8), the required number of tests strictly increases when q = o(1), albeit with a mild\nlogarithmic dependence.\nExample 2: Group testing. To highlight the versatility of Theorem 2 and Corollary 1, we show\nthat the latter recovers the lower bounds given in the group testing framework of [11].\nSet d = 2, and let label 1 represent \u201cdefective\u201d items, and label 2 represent \u201cnon-defective\u201d items.\nLet PY |N1N2 be of the form PY |N1 with Y 2{ 0, 1}, meaning the observations are binary and\ndepend only on the number of defective items in the test. For brevity, let k = p\u21e11 denote the total\nnumber of defective items, so that p\u21e12 = p k is the number of non-defective items.\nLetting `2 = p k in (11), and letting `1 remain arbitrary, we obtain the necessary condition\n\n2 log p(1 + o(1)).\n\u25c6(1 o(1))\n\n(12)\n\n`12{1,...,k} logpk+`1\n\nn max\n\n`1\n\n(1 ) log 2\n\nI(X0,`1; Y |X1,`1)\n5\n\n,\n\n(13)\n\n\fwhere X0,`1 is a shorthand for X0,` with ` = (`1, p k), and similarly for X1,`1. This matches\nthe lower bound given in [11] for Bernoulli testing with general noise models, for which several\ncorollaries for speci\ufb01c models were also given.\nExample 3: Gaussian noise. To give a concrete example of a noisy setting, consider the case that\nwe observe the values in (2), but with each such value corrupted by independent Gaussian noise:\n\nY (i)\nt = Nt(, X (i)) + Z(i)\n\n(14)\nwhere Z(i)\nt \u21e0 N(0, p 2) for some 2 > 0. Note that given X (i), the values Nt themselves have\nvariance at most proportional to p (e.g., see Appendix C), so 2 = \u21e5(1) can be thought of as the\nconstant signal-to-noise ratio (SNR) regime.\nIn the supplementary material, we prove the following bounds for this model:\n\u2022 By letting each `t in (10) equal its minimum or maximum value analogously to the noiseless case\n\n,\n\nt\n\nn \u2713 max\n\nG\u2713[d] : |G|2\n\npGH(\u21e1G)\n1\n\n2 log1 + \u21e1t\n\u21e1tPt02G \u21e1t0\n\nabove, we obtain the following necessary condition for Pe \uf8ff as p ! 1:\n42 )\u25c6(1 o(1)),\nPt2G\nwhere pG :=Pt2G \u21e1tp, and \u21e1G has entries\nfor t 2 G. Hence, we have the following:\n\u2013 In the case that 2 = pc for some c 2 (0, 1), each summand in the denominator simpli\ufb01es\nto c\n2 log p(1 + o(1)), and we deduce that compared to the noiseless case (cf., (8)), the\nto c\n2 log log p(1 + o(1)), and we deduce that compared to the noiseless case, the asymptotic\n\n\u2013 In the case that 2 = (log p)c for some c > 0, each summand in the denominator simpli\ufb01es\n\nasymptotic number of tests increases by at least a constant factor of 1\nc .\n\nnumber of tests increases by at least a factor of\nin the scaling laws despite the fact that the SNR grows unbounded.\n\nc log log p. Hence, we observe a strict increase\n\n\u2013 While (15) also provides an \u2326(p) lower bound for the case 2 = \u21e5(1), we can in fact do\n\n(15)\n\nlog p\n\nbetter via a different choice of ` (see below).\n\n\u2022 By letting `1 = p\u21e11, `2 = 1, and `t = 0 for t = 3, . . . , d, we obtain the necessary condition\n(16)\nfor Pe \uf8ff as p ! 1. Hence, if 2 = \u21e5(1), we require n =\u2326( p log p); this is super-linear in\nthe dimension, in contrast with the sub-linear \u21e5 p\nlog p behavior observed in the noiseless case.\nNote that this choice of ` essentially captures the dif\ufb01culty in identifying a single item, namely,\nthe one corresponding to `2 = 1.\n\nn 4p2 log p(1 o(1))\n\nThese \ufb01ndings are summarized in Table 2; see also the supplementary material for extensions to the\napproximate recovery setting.\nRemark 2. While it may seem unusual to add continuous noise to discrete observations, this still\ncaptures the essence of the noisy pooled data problem, and simpli\ufb01es the evaluation of the mutual\ninformation terms in (10). Moreover, this converse bound immediately implies the same bound for\nthe discrete model in which the noise consists of adding a Gaussian term, rounding, and clipping to\n{0, . . . , p}, since the decoder could always choose to perform these operations as pre-processing.\n3 Proofs\n\nHere we provide the proof of Theorem 1, along with an overview of the proof of Theorem 2. The\nremaining proofs are given in the supplementary material.\n\n3.1 Proof of Theorem 1\nStep 1: Counting typical outcomes. We claim that it suf\ufb01ces to consider the case that X is deter-\nministic and \u02c6 is a deterministic function of Y; to see this, we note that when either of these are\nrandom we have Pe = EX, \u02c6[P[error]], and the average is lower bounded by the minimum.\nThe following lemma, proved in the supplementary material, shows that for any X (i), each entry of\n\nthe corresponding outcome Y (i) lies in an interval of length Opp log p with high probability.\n\n6\n\n\fLemma 1. For any deterministic test vector X 2{ 0, 1}p, and for uniformly distributed on B(\u21e1),\nwe have for each t 2 [d] that\n\nPhNt(, X ) E[Nt(, X )] >pp log pi \uf8ff\n\n2\np2 .\n\nBy Lemma 1 and the union bound, we have with probability at least 1 2nd\n\nthatNt(, X (i)) \nE[Nt(, X (i))] \uf8ff pp log p for all i 2 [n] and t 2 [d]. Letting this event be denoted by A, we have\n\nPe P[A] P[A\\ no error] 1 \n\n2nd\np2 P[A\\ no error].\n\n(18)\n\np2\n\nNext, letting Y() 2 [p]n\u21e5d denote Y explicitly as a function of and similarly for \u02c6(Y) 2 [d]p,\nand letting YA denote the set of matrices Y under which the event A occurs, we have\n\n(17)\n\nP[A\\ no error] =\n\n1\n\n|B(\u21e1)| Xb2B(\u21e1)\n\uf8ff |YA|\n|B(\u21e1)|\n\n,\n\n1{Y(b) 2Y A \\ \u02c6(Y(b)) = b}\n\n(19)\n\n(20)\n\nwhere (20) follows since each each Y 2Y A can only be counted once in the summation of (19),\ndue to the condition \u02c6(Y(b)) = b.\nStep 2: Bounding the set cardinalities. By a standard combinatorial argument (e.g., [14, Ch. 2])\nand the fact that \u21e1 is \ufb01xed as p ! 1, we have\n\n(21)\nTo bound |YA|, \ufb01rst note that the entries of each Y (i) 2 [p]d sum to a deterministic value, namely, the\nnumber of ones in X (i). Hence, each Y 2Y A is uniquely described by a sub-matrix of Y 2 [p]n\u21e5d\nof size n \u21e5 (d 1). Moreover, since YA only includes matrices under which A occurs, each value\nin this sub-matrix only takes one of at most 2pp log p + 1 values. As a result, we have\n\n|B(\u21e1)| = ep(H(\u21e1)+o(1)).\n\nand combining (18)\u2013(22) gives\n\n|YA|\uf8ff 2pp log p + 1n(d1),\nPe 2pp log p + 1n(d1)\n2nd\np2 .\nSince d is constant, it immediately follows that Pe ! 1 whenever n \uf8ff\nfor some \u2318> 0. Applying log(2pp log p + 1) = 1\nnecessary condition for Pe 6! 1:\n\nep(H(\u21e1)+o(1))\n\n\n\npH(\u21e1)\n\n(d1) log(2pp log p+1) (1 \u2318)\n2 log p(1 + o(1)), we obtain the following\n\n(1 \u2318).\n\n(24)\n\nn \n\n2pH(\u21e1)\n\n(d 1) log p\n\nThis yields the term in (8) corresponding to r = 1.\nStep 3: Genie argument. Let G be a subset of [d] of cardinality at least two, and de\ufb01ne Gc = [d]\\G.\nMoreover, de\ufb01ne Gc to be a length-p vector with\n\n(22)\n\n(23)\n\n(25)\n\n(Gc)j =\u21e2j j 2 Gc\n\nj 2 G,\n\n?\n\nwhere the symbol ? can be thought of as representing an unknown value. We consider a modi\ufb01ed\nsetting in which a genie reveals Gc to the decoder, i.e., the decoder knows the labels of all items\nfor which the label lies in Gc, and is only left to estimate those in G. This additional knowledge\ncan only make the pooled data problem easier, and hence, any lower bound in this modi\ufb01ed setting\nremains valid in the original setting.\nIn the genie-aided setting, instead of receiving the full observation vector Y (i) = (Y (i)\nit is equivalent to only be given {Y (i)\n\n1 , . . . , Y (i)\nd ),\n: j 2 G}, since the values in Gc are uniquely determined\n\nj\n\n7\n\n\ffrom Gc and X (i). This means that the genie-aided setting can be cast in the original setting with\n\nmodi\ufb01ed parameters: (i) p is replaced by pG = Pt2G \u21e1tp, the number of items with unknown\nlabels; (ii) d is replaced by |G|, the number of distinct remaining labels; (iii) \u21e1 is replaced by \u21e1G,\n(t 2 G).\nde\ufb01ned to be a |G|-dimensional probability vector with entries equaling\nDue to this equivalence, the condition (24) yields the necessary condition n 2pGH(\u21e1G)\n(|G|1) log p (1 \u2318),\nand maximizing over all G with |G| 2 gives\n\n\u21e1tPt02G \u21e1t0\n\n(26)\n\nn \n\nmax\n\nG\u2713[d] : |G|2\n\n2pGH(\u21e1G)\n\n(|G| 1) log p1 \u2318.\n\nStep 4: Simpli\ufb01cation. De\ufb01ne r = d| G| + 1. We restrict the maximum in (26) to sets G indexing\nthe highest |G| = d r + 1 values of \u21e1, and consider the following process for sampling from \u21e1:\n\nd )}n\n\n1 , . . . , N (i)\n\nBy Shannon\u2019s property of entropy for sequentially-generated random variables [15, p. 10], we \ufb01nd\n\na label (i.e., the labels have conditional probability proportional to the top |G| entries of \u21e1);\n\n\u2022 Draw a sample v from \u21e1(r) (de\ufb01ned above Theorem 1);\n\u2022 If v corresponds to the \ufb01rst entry of \u21e1(r), then draw a random sample from \u21e1G and output it as\n\u2022 Otherwise, if v corresponds to one of the other entries of \u21e1(r), then output v as a label.\nthat H(\u21e1) = H(\u21e1(r)) +Pt2G \u21e1tH(\u21e1G). Moreover, since pG = p\u00b7Pt2G \u21e1j, this can be written\nas pGH(\u21e1G) = pH(\u21e1) H(\u21e1(r)). Substituting into (26), noting that |G| 1 = d r by the\nde\ufb01nition of r, and maximizing over r = 1, . . . , d 1, we obtain the desired result (8).\n3.2 Overview of proof of Theorem 2\nWe can interpret the pooled data problem as a communication problem in which a \u201cmessage\u201d \nis sent over a \u201cchannel\u201d PY |N1,...,Nd via \u201ccodewords\u201d of the form {(N (i)\ni=1 that are\nconstructed by summing various columns of X. As a result, it is natural to use Fano\u2019s inequality [9,\nCh. 7] to lower bound the error probability in terms of information content (entropy) of and the\namount of information that Y reveals about (mutual information).\nHowever, a naive application of Fano\u2019s inequality only recovers the bound in (10) with ` = p\u21e1.\nTo handle the other possible choices of `, we again consider a genie-aided setting in which, for\neach t 2 [d], the decoder is informed of p\u21e1t `t of the items whose label equals t. Hence, it only\nremains to identify the remaining `t items of each type. This genie argument is a generalization of\nthat used in the proof of Theorem 1, in which each `t was either equal to its minimum value zero\nor its maximum value p\u21e1t. In Example 3 of Section 2, we saw that this generalization can lead to a\nstrictly better lower bound in certain noisy scenarios.\nThe complete proof of Theorem 2 is given in the supplementary material.\n4 Conclusion\nWe have provided novel information-theoretic lower bounds for the pooled data problem. In the\nnoiseless setting, we provided a matching lower bound to the upper bound of [3], establishing an\nexact threshold indicating a phase transition between success and failure. In the noisy setting, we\nprovided a characterization of general noise models in terms of the mutual information. In the special\ncase of Gaussian noise, we proved an inherent added dif\ufb01culty compared to the noiseless setting,\nwith strict increases in the scaling laws even when the signal-to-noise ratio grows unbounded.\nAn interesting direction for future research is to provide upper bounds for the noisy setting, poten-\ntially establishing the tightness of Theorem 2 for general noise models. This appears to be challeng-\ning using existing techniques; for instance, the pooled data problem bears similarity to group testing\nwith linear sparsity, whereas existing mutual information based upper bounds for group testing are\nlimited to the sub-linear regime [10, 11, 16]. In particular, the proofs of such bounds are based on\nconcentration inequalities which, when applied to the linear regime, lead to additional requirements\non the number of tests that prevent tight performance characterizations.\nAcknowledgment: This work was supported in part by the European Commission under Grant\nERC Future Proof, SNF Sinergia project CRSII2-147633, SNF 200021-146750, and EPFL Fellows\nHorizon2020 grant 665667.\n\n8\n\n\fReferences\n[1] I.-H. Wang, S. L. Huang, K. Y. Lee, and K. C. Chen, \u201cData extraction via histogram and\narithmetic mean queries: Fundamental limits and algorithms,\u201d in IEEE Int. Symp. Inf. Theory,\nJuly 2016, pp. 1386\u20131390.\n\n[2] I.-H. Wang, S. L. Huang, and K. Y. Lee, \u201cExtracting sparse data via histogram queries,\u201d in\n\nAllerton Conf. Comm., Control, and Comp., 2016.\n\n[3] A. E. Alaoui, A. Ramdas, F. Krzakala, L. Zdeborova, and M. I. Jordan, \u201cDecoding from pooled\n\ndata: Sharp information-theoretic bounds,\u201d 2016, http://arxiv.org/abs/1611.09981.\n\n[4] \u2014\u2014, \u201cDecoding from pooled data:\n\nhttp://arxiv.org/abs/1702.02279.\n\nPhase transitions of message passing,\u201d 2017,\n\n[5] D.-Z. Du and F. K. Hwang, Combinatorial group testing and its applications, ser. Series on\n\nApplied Mathematics. World Scienti\ufb01c, 1993.\n\n[6] A. Seb\u02ddo, \u201cOn two random search problems,\u201d J. Stat. Plan. Inf., vol. 11, no. 1, pp. 23\u201331, 1985.\n[7] M. Malyutov and H. Sadaka, \u201cMaximization of ESI. Jaynes principle for testing signi\ufb01cant\n\ninputs of linear model,\u201d Rand. Opt. Stoch. Eq., vol. 6, no. 4, pp. 339\u2013358, 1998.\n\n[8] W.-N. Chen and I.-H. Wang, \u201cPartial data extraction via noisy histogram queries: Information\n\ntheoretic bounds,\u201d in IEEE Int. Symp. Inf. Theory (ISIT), 2017.\n\n[9] T. M. Cover and J. A. Thomas, Elements of Information Theory.\n\n2006.\n\nJohn Wiley & Sons, Inc.,\n\n[10] M. Malyutov, \u201cThe separating property of random matrices,\u201d Math. Notes Acad. Sci. USSR,\n\nvol. 23, no. 1, pp. 84\u201391, 1978.\n\n[11] G. Atia and V. Saligrama, \u201cBoolean compressed sensing and noisy group testing,\u201d IEEE Trans.\n\nInf. Theory, vol. 58, no. 3, pp. 1880\u20131901, March 2012.\n\n[12] C. Aksoylar, G. K. Atia, and V. Saligrama, \u201cSparse signal processing with linear and nonlinear\nobservations: A uni\ufb01ed Shannon-theoretic approach,\u201d IEEE Trans. Inf. Theory, vol. 63, no. 2,\npp. 749\u2013776, Feb. 2017.\n\n[13] J. Scarlett and V. Cevher, \u201cLimits on support recovery with probabilistic models: An\ninformation-theoretic framework,\u201d IEEE Trans. Inf. Theory, vol. 63, no. 1, pp. 593\u2013620, 2017.\n[14] I. Csisz\u00e1r and J. K\u00f6rner, Information Theory: Coding Theorems for Discrete Memoryless Sys-\n\ntems, 2nd ed. Cambridge University Press, 2011.\n\n[15] C. E. Shannon, \u201cA mathematical theory of communication,\u201d Bell Syst. Tech. Journal, vol. 27,\n\npp. 379\u2013423, July and Oct. 1948.\n\n[16] J. Scarlett and V. Cevher, \u201cPhase transitions in group testing,\u201d in Proc. ACM-SIAM Symp. Disc.\n\nAlg. (SODA), 2016.\n\n[17] W. Hoeffding, \u201cProbability inequalities for sums of bounded random variables,\u201d J. Amer. Stat.\n\nAssoc., vol. 58, no. 301, pp. 13\u201330, 1963.\n\n[18] J. Massey, \u201cOn the entropy of integer-valued random variables,\u201d in Int. Workshop on Inf. The-\n\nory, 1988.\n\n[19] G. Reeves and M. Gastpar, \u201cThe sampling rate-distortion tradeoff for sparsity pattern recovery\n\nin compressed sensing,\u201d IEEE Trans. Inf. Theory, vol. 58, no. 5, pp. 3065\u20133092, May 2012.\n\n[20] \u2014\u2014, \u201cApproximate sparsity pattern recovery: Information-theoretic lower bounds,\u201d IEEE\n\nTrans. Inf. Theory, vol. 59, no. 6, pp. 3451\u20133465, June 2013.\n\n[21] J. Scarlett and V. Cevher, \u201cHow little does non-exact recovery help in group tesitng?\u201d in IEEE\n\nInt. Conf. Acoust. Sp. Sig. Proc. (ICASSP), New Orleans, 2017.\n\n[22] \u2014\u2014, \u201cOn the dif\ufb01culty of selecting Ising models with approximate recovery,\u201d IEEE Trans.\n\nSig. Inf. Proc. over Networks, vol. 2, no. 4, pp. 625\u2013638, 2016.\n\n[23] J. C. Duchi and M. J. Wainwright, \u201cDistance-based and continuum Fano inequalities with\n\napplications to statistical estimation,\u201d 2013, http://arxiv.org/abs/1311.2669.\n\n9\n\n\f", "award": [], "sourceid": 292, "authors": [{"given_name": "Jonathan", "family_name": "Scarlett", "institution": "EPFL"}, {"given_name": "Volkan", "family_name": "Cevher", "institution": "EPFL"}]}