{"title": "Optimal Testing for Properties of Distributions", "book": "Advances in Neural Information Processing Systems", "page_first": 3591, "page_last": 3599, "abstract": "Given samples from an unknown  distribution, p, is it possible to distinguish whether p belongs to some class of distributions C versus p being far from every distribution in C? This fundamental question has receivedtremendous attention in Statistics, albeit focusing onasymptotic analysis, as well as in Computer Science, wherethe emphasis has been on small sample size and computationalcomplexity. Nevertheless, even for basic classes ofdistributions such as monotone, log-concave, unimodal, and monotone hazard rate, the optimal sample complexity is unknown.We provide a general approach via which we obtain sample-optimal and computationally efficient testers for all these distribution families. At the core of our approach is an algorithm which solves the following problem:Given samplesfrom an unknown distribution p, and a known distribution q, are p and q close in Chi^2-distance, or far in total variation distance?The optimality of all testers is established by providing matching lower bounds. Finally, a necessary building block for our tester and important byproduct of our work are the first known computationally efficient proper learners for discretelog-concave and monotone hazard rate distributions. We exhibit the efficacy of our testers via experimental analysis.", "full_text": "Optimal Testing for Properties of Distributions\n\nJayadev Acharya, Constantinos Daskalakis, Gautam Kamath\n\nEECS, MIT\n\n{jayadev, costis, g}@mit.edu\n\nAbstract\n\nGiven samples from an unknown discrete distribution p, is it possible to distin-\nguish whether p belongs to some class of distributions C versus p being far from\nevery distribution in C? This fundamental question has received tremendous at-\ntention in statistics, focusing primarily on asymptotic analysis, as well as in in-\nformation theory and theoretical computer science, where the emphasis has been\non small sample size and computational complexity. Nevertheless, even for ba-\nsic properties of discrete distributions such as monotonicity, independence, log-\nconcavity, unimodality, and monotone-hazard rate, the optimal sample complexity\nis unknown.\nWe provide a general approach via which we obtain sample-optimal and compu-\ntationally ef\ufb01cient testers for all these distribution families. At the core of our\napproach is an algorithm which solves the following problem: Given samples\nfrom an unknown distribution p, and a known distribution q, are p and q close in\n2-distance, or far in total variation distance?\nThe optimality of our testers is established by providing matching lower bounds,\nup to constant factors. Finally, a necessary building block for our testers and\nan important byproduct of our work are the \ufb01rst known computationally ef\ufb01cient\nproper learners for discrete log-concave, monotone hazard rate distributions.\n\n1\n\nIntroduction\n\nThe quintessential scienti\ufb01c question is whether an unknown object has some property, i.e. whether a\nmodel from a speci\ufb01c class \ufb01ts the object\u2019s observed behavior. If the unknown object is a probability\ndistribution, p, to which we have sample access, we are typically asked to distinguish whether p\nbelongs to some class C or whether it is suf\ufb01ciently far from it.\nThis question has received tremendous attention in the \ufb01eld of statistics (see, e.g., [1, 2]), where test\nstatistics for important properties such as the ones we consider here have been proposed. Neverthe-\nless, the emphasis has been on asymptotic analysis, characterizing the rates of convergence of test\nstatistics under null hypotheses, as the number of samples tends to in\ufb01nity. In contrast, we wish to\nstudy the following problem in the small sample regime:\n\n\u21e7(C,\" ): Given a family of distributions C, some \"> 0, and sample access to an unknown\ndistribution p over a discrete support, how many samples are required to distinguish between\np 2C versus dTV(p,C) >\" ?1\nThe problem has been studied intensely in the literature on property testing and sublinear algorithms\n[3, 4, 5], where the emphasis has been on characterizing the optimal tradeoff between p\u2019s support\nsize and the accuracy \" in the number of samples. Several results have been obtained, roughly\n1We want success probability at least 2/3, which can be boosted to 1  by repeating the test O(log(1/))\n\ntimes and taking the majority.\n\n1\n\n\fclustering into three groups, where (i) C is the class of monotone distributions over [n], or more\ngenerally a poset [6, 7]; (ii) C is the class of independent, or k-wise independent distributions over a\nhypergrid [8, 9]; and (iii) C contains a single-distribution q, and the problem becomes that of testing\nwhether p equals q or is far from it [8, 10, 11, 13].\nWith respect to (iii), [13] exactly characterizes the number of samples required to test identity to each\ndistribution q, providing a single tester matching this bound simultaneously for all q. Nevertheless,\nthis tester and its precursors are not applicable to the composite identity testing problem that we\nconsider.\nIf our class C were \ufb01nite, we could test against each element in the class, albeit this\nwould not necessarily be sample optimal. If our class C were a continuum, we would need tolerant\nidentity testers, which tend to be more expensive in terms of sample complexity [12], and result\nin substantially suboptimal testers for the classes we consider. Or we could use approaches related\nto generalized likelihood ratio test, but their behavior is not well-understood in our regime, and\noptimizing likelihood over our classes becomes computationally intense.\n\nOur Contributions We obtain sample-optimal and computationally ef\ufb01cient testers for \u21e7(C,\" )\nfor the most fundamental shape restrictions to a distribution. Our contributions are the following:\n\n1. For a known distribution q over [n], and sample access to p, we show that distinguishing the\ncases: (a) whether the 2-distance between p and q is at most \"2/2, versus (b) the `1 distance\nbetween p and q is at least 2\", requires \u21e5(pn/\"2) samples. As a corollary, we obtain an alternate\nargument that shows that identity testing requires \u21e5(pn/\"2) samples (previously shown in [13]).\n\nn of monotone distributions over [n]d we require an optimal \u21e5nd/2/\"2\n2. For the class C = Md\nnumber of samples, where prior work requires \u2326pn log n/\"6 samples for d = 1 and\n\u02dc\u2326nd1/2poly (1/\") for d > 1 [6, 7]. Our results improve the exponent of n with respect\n\nto d, shave all logarithmic factors in n, and improve the exponent of \" by at least a factor of 2.\n(a) A useful building block and interesting byproduct of our analysis is extending Birg\u00b4e\u2019s obliv-\nious decomposition for single-dimensional monotone distributions [14] to monotone distri-\nbutions in d  1, and to the stronger notion of 2-distance. See Section C.1.\n[n]d in 2-distance. See Lemma 3 for the precise statement.\n\n(b) Moreover, we show that O(logd n) samples suf\ufb01ce to learn a monotone distribution over\n\n3. For the class C =\u21e7 d of product distributions over [n1] \u21e5\u00b7\u00b7\u00b7\u21e5 [nd], our algorithm requires\nO(Q` n`)1/2 +P` n` /\"2 samples. We note that a product distribution is one where all\nmarginals are independent, so this is equivalent to testing if a collection of random variables are\nall independent. In the case where n`\u2019s are large, then the \ufb01rst term dominates, and the sample\ncomplexity is O((Q` n`)1/2 /\"2). In particular, when d is a constant and all n`\u2019s are equal to n,\nwe achieve the optimal sample complexity of \u21e5(nd/2/\"2). To the best of our knowledge, this\nis the \ufb01rst result for d  3, and when d = 2, this improves the previously known complexity\n\"6 polylog(n/\") [8, 15], signi\ufb01cantly improving the dependence on \" and shaving all\nfrom O n\nlogarithmic factors.\n4. For the classes C = LCDn, C = MHRn and C = Un of log-concave, monotone-hazard-rate\nand unimodal distributions over [n], we require an optimal \u21e5pn/\"2 number of samples. Our\ntesters for LCDn and C = MHRn are to our knowledge the \ufb01rst for these classes for the\nlow sample regime we are studying\u2014see [16] and its references for statistics literature on the\nasymptotic regime. Our tester for Un improves the dependence of the sample complexity on \" by\nat least a factor of 2 in the exponent, and shaves all logarithmic factors in n, compared to testers\nbased on testing monotonicity.\n(a) A useful building block and important byproduct of our analysis are the \ufb01rst computation-\nally ef\ufb01cient algorithms for properly learning log-concave and monotone-hazard-rate dis-\ntributions, to within \" in total variation distance, from poly(1/\") samples, independent of\nthe domain size n. See Corollaries 4 and 6. Again, these are the \ufb01rst computationally ef-\n\ufb01cient algorithms to our knowledge in the low sample regime. [17] provide algorithms for\ndensity estimation, which are non-proper, i.e. will approximate an unknown distribution\nfrom these classes with a distribution that does not belong to these classes. On the other\nhand, the statistics literature focuses on maximum-likelihood estimation in the asymptotic\nregime\u2014see e.g. [18] and its references.\n\n2\n\n\f5. For all the above classes we obtain matching lower bounds, showing that the sample complexity\nof our testers is optimal with respect to n, \" and when applicable d. See Section 8. Our lower\nbounds are based on extending Paninski\u2019s lower bound for testing uniformity [10].\n\nOur Techniques At the heart of our tester lies a novel use of the 2 statistic. Naturally, the 2\nand its related `2 statistic have been used in several of the afore-cited results. We propose a new\nuse of the 2 statistic enabling our optimal sample complexity. The essence of our approach is to\n\ufb01rst draw a small number of samples (independent of n for log-concave and monotone-hazard-rate\ndistributions and only logarithmic in n for monotone and unimodal distributions) to approximate the\nunknown distribution p in 2 distance. If p 2C , our learner is required to output a distribution q\nthat is O(\")-close to C in total variation and O(\"2)-close to p in 2 distance. Then some analysis\nreduces our testing problem to distinguishing the following cases:\n\n\u2022 p and q are O(\"2)-close in 2 distance; this case corresponds to p 2C .\n\u2022 p and q are \u2326(\")-far in total variation distance; this case corresponds to dTV(p,C) >\" .\n\nWe draw a comparison with robust identity testing, in which one must distinguish whether p and q are\nc1\"-close or c2\"-far in total variation distance, for constants c2 > c1 > 0. In [12], Valiant and Valiant\nshow that \u2326(n/ log n) samples are required for this problem \u2013 a nearly-linear sample complexity,\nwhich may be prohibitively large in many settings. In comparison, the problem we study tests for\n2 closeness rather than total variation closeness: a relaxation of the previous problem. However,\nour tester demonstrates that this relaxation allows us to achieve a substantially sublinear complexity\nof O(pn/\"2). On the other hand, this relaxation is still tight enough to be useful, demonstrated by\nour application in obtaining sample-optimal testers.\nWe note that while the 2 statistic for testing hypothesis is prevalent in statistics providing opti-\nmal error exponents in the large-sample regime, to the best of our knowledge, in the small-sample\nregime, modi\ufb01ed-versions of the 2 statistic have only been recently used for closeness-testing\nin [19, 20, 21] and for testing uniformity of monotone distributions in [22].\nIn particular, [19]\ndesign an unbiased statistic for estimating the 2 distance between two unknown distributions.\n\nOrganization In Section 4, we show that a version of the 2 statistic, appropriately excluding\ncertain elements of the support, is suf\ufb01ciently well-concentrated to distinguish between the above\ncases. Moreover, the sample complexity of our algorithm is optimal for most classes. Our base tester\nis combined with the afore-mentioned extension of Birg\u00b4e\u2019s decomposition theorem to test monotone\ndistributions in Section 5 (see Theorem 2 and Corollary 1), and is also used to test independence of\ndistributions in Section 6 (see Theorem 3).\nIn Section 7, we give our results on testing unimodal, log-concave and monotone hazard rate dis-\ntributions. Naturally, there are several bells and whistles that we need to add to the above skeleton\nto accommodate all classes of distributions that we are considering. In Remark 1 we mention the\nadditional modi\ufb01cations for these classes.\n\nRelated Work. For the problems that we study in this paper, we have provided the related works\nin the previous section along with our contributions. We cannot do justice to the role of shape re-\nstrictions of probability distributions in probabilistic modeling and testing. It suf\ufb01ces to say that\nthe classes of distributions that we study are fundamental, motivating extensive literature on their\nlearning and testing [23]. In the recent times, there has been work on shape restricted statistics, pio-\nneered by Jon Wellner, and others. [24, 25] study estimation of monotone and k-monotone densities,\nand [26, 27] study estimation of log-concave distributions. Due to the sheer volume of literature in\nstatistics in this \ufb01eld, we will restrict ourselves to those already referenced.\nAs we have mentioned, statistics has focused on the asymptotic regime as the number of samples\ntends to in\ufb01nity. Instead we are considering the low sample regime and are more stringent about the\nbehavior of our testers, requiring 2-sided guarantees. We want to accept if the unknown distribution\nis in our class of interest, and also reject if it is far from the class. For this problem, as discussed\nabove, there are few results when C is a whole class of distributions. Closer related to our paper is\nthe line of papers [6, 7, 28] for monotonicity testing, albeit these papers have sub-optimal sample\ncomplexity as discussed above. Testing independence of random variables has a long history in\nstatisics [29, 30]. The theoretical computer science community has also considered the problem of\n\n3\n\n\ftesting independence of two random variables [8, 15]. While our results sharpen the case where the\nvariables are over domains of equal size, they demonstrate an interesting asymmetric upper bound\nwhen this is not the case. More recently, Acharya and Daskalakis provide optimal testers for the\nfamily of Poisson Binomial Distributions [31].\nFinally, contemporaneous work of Canonne et al [32] provides a generic algorithm and lower bounds\nfor the single-dimensional families of distributions considered here. We note that their algorithm\nhas a sample complexity which is suboptimal in both n and \", while our algorithms are optimal.\nTheir algorithm also extends to mixtures of these classes, though some of these extensions are not\ncomputationally ef\ufb01cient. They also provide a framework for proving lower bounds, giving the\noptimal bounds for many classes when \" is suf\ufb01ciently large with respect to 1/n. In comparison, we\nprovide these lower bounds unconditionally by modifying Paninski\u2019s construction [10] to suit the\nclasses we consider.\n\n2 Preliminaries\n\nWe use the following probability distances in our paper.\nThe total variation distance between distributions p and q is dTV(p, q) def= supA |p(A)  q(A)| =\n2kp qk1. The 2-distance between p and q over [n] is de\ufb01ned as 2(p, q) def= Pi2[n]\n. The\n1\nKolmogorov distance between two probability measures p and q over an ordered set (e.g., R) with\ncumulative density functions Fp and Fq is dK(p, q) def= supx2R |Fp(x)  Fq(x)|.\nOur paper is primarily concerned with testing against classes of distributions, de\ufb01ned formally as:\nDe\ufb01nition 1. Given \" 2 (0, 1] and sample access to a distribution p, an algorithm is said to test a\nclass C if it has the following guarantees:\n\n(piqi)2\n\nqi\n\n\u2022 If p 2C , the algorithm outputs ACCEPT with probability at least 2/3;\n\u2022 If dTV(p,C)  \", the algorithm outputs REJECT with probability at least 2/3.\n\nWe note the following useful relationships between these distances [33]:\nProposition 1. dK(p, q)2 \uf8ff dTV(p, q)2 \uf8ff 1\nDe\ufb01nition 2. An \u2318-effective support of a distribution p is any set S such that p(S)  1  \u2318.\nThe \ufb02attening of a function f over a subset S is the function \u00aff such that \u00affi = p(S)/|S|.\nDe\ufb01nition 3. Let p be a distribution, and support I1, . . . is a partition of the domain. The \ufb02attening\nof p with respect to I1, . . . is the distribution \u00afp which is the \ufb02attening of p over the intervals I1, . . ..\n\n4 2(p, q).\n\nPoisson Sampling Throughout this paper, we use the standard Poissonization approach. Instead\nof drawing exactly m samples from a distribution p, we \ufb01rst draw m0 \u21e0 Poisson(m), and then\ndraw m0 samples from p. As a result, the number of times different elements in the support of p\noccur in the sample become independent, giving much simpler analyses. In particular, the number\nof times we will observe domain element i will be distributed as Poisson(mpi), independently for\neach i. Since Poisson(m) is tightly concentrated around m, this additional \ufb02exibility comes only at\na sub-constant cost in the sample complexity with an inversely exponential in m, additive increase\nin the error probability.\n\n3 The Testing Algorithm \u2013 An Overview\nOur algorithm for testing a class C can be decomposed into three steps.\nNear-proper learning in 2-distance. Our \ufb01rst step requires learning with very speci\ufb01c guarantees.\nGiven sample access to p 2C , we wish to output q such that (i) q is close to C in total variation\ndistance, and (ii) p and q are O(\"2)-close in 2-distance on an \"effective support2 of p. When\n2We also require the algorithm to output a description of an effective support for which this property holds.\n\nThis requirement can be slightly relaxed, as we show in our results for testing unimodality.\n\n4\n\n\fp is not in C, we do not guarantee anything about q. From an information theoretic standpoint,\nthis problem is harder than learning the distribution in total variation, since 2-distance is more\nrestrictive than total variation distance. Nonetheless, for the structured classes we consider, we are\nable to learn in 2 by modifying the approaches to learn in total variation.\nComputation of distance to class. The next step is to see if the hypothesis q is close to the class C\nor not. Since we have an explicit description of q, this step requires no further samples from p, i.e.\nit is purely computational. If we \ufb01nd that q is far from the class C, then it must be that p 62 C, as\notherwise the guarantees from the previous step would imply that q is close to C. Thus, if it is not,\nwe can terminate the algorithm at this point.\n2-testing. At this point, the previous two steps guarantee that our distribution q is such that:\n- If p 2C , then p and q are close in 2 distance on a (known) effective support of p;\n- If dTV(p,C)  \", then p and q are far in total variation distance.\nWe can distinguish between these two cases using O(pn/\"2) samples with a simple statistical 2-\ntest, that we describe in Section 4.\n\nUsing the above three-step approach, our tester, as described in the next section, can directly test\nmonotonicity, log-concavity, and monotone hazard rate. With an extra trick, using Kolmogorov\u2019s\nmax inequality, it can also test unimodality.\n\n4 A Robust 2-`1 Identity Test\n\nOur main result in this section is Theorem 1.\nTheorem 1. Given \" 2 (0, 1], a class of probability distributions C, sample access to a distribution\np, and an explicit description of a distribution q, both over [n] with the following properties:\n\nProperty 1. dTV(q,C) \uf8ff \"\n2.\nProperty 2. If p 2C , then 2(p, q) \uf8ff \"2\n500.\n\nThen there exists an algorithm such that: If p 2C , it outputs ACCEPT with probability at least 2/3;\nIf dTV(p,C)  \", it outputs REJECT with probability at least 2/3. The time and sample complexity\nof this algorithm are Opn/\"2.\n\nProof. Algorithm 1 describes a 2 testing procedure that gives the guarantee of the theorem.\n\nthe number of occurrences of the ith domain element.\n\nAlgorithm 1 Chi-squared testing algorithm\n1: Input: \"; an explicit distribution q; (Poisson) m samples from a distribution p, where Ni denotes\n2: A { i : qi  \"2/50n}\n3: Z Pi2A\n4: if Z \uf8ff m\"2/10 return close\n5: else return far\n\n(Nimqi)2Ni\n\nmqi\n\nqi\n\np2\ni\nq2\ni\n\n+ 4m \u00b7\n\n(pi  qi)2\n\n= m \u00b7 2(pA, qA), Var [Z] =Xi2A\uf8ff2\n\nIn Section A we compute the mean and variance of the statistic Z (de\ufb01ned in Algorithm 1) as:\npi \u00b7 (pi  qi)2\n\n (1)\nE [Z] = m \u00b7Xi2A\nwhere by pA and qA we denote respectively the vectors p and q restricted to the coordinates in\nA, and we slightly abuse notation when we write 2(pA, qA), as these do not then correspond to\nprobability distributions.\nLemma 1 demonstrates the separation in the means of the statistic Z in the two cases of interest,\ni.e., p 2C versus dTV(p,C)  \", and Lemma 2 shows the separation in the variances in the two\ncases. These two results are proved in Section B.\n\nq2\ni\n\n5\n\n\fLemma 1. If p 2C , then E [Z] \uf8ff m\"2/500. If dTV(p,C)  \", then E [Z]  m\"2/5.\nLemma 2. Let m  20000pn/\"2. If p 2C then Var [Z] \uf8ff 1\nVar [Z] \uf8ff 1\nAssuming Lemmas 1 and 2, Theorem 1 is now a simple application of Chebyshev\u2019s inequality.\n\n500000 m2\"4. If dTV(p,C)  \", then\n\n100 E[Z]2.\n\nWhen p 2C , we have that E [Z] +p3 Var [Z] \uf8ff\u21e31/500 +p3/500000\u2318 m\"2 \uf8ff m\"2/200. Thus,\n\nChebyshev\u2019s inequality gives\n\nPr\u21e5Z  m\"2/10\u21e4 \uf8ff Pr\u21e5Z  m\"2/200\u21e4 \uf8ff PrhZ  E [Z]  p3 Var [Z]1/2i \uf8ff 1/3.\nWhen dTV(p,C)  \", E [Z] p3 Var [Z] \u21e31 p3/100\u2318 E[Z]  3m\"2/20. Therefore,\nPr\u21e5Z \uf8ff m\"2/10\u21e4 \uf8ff Pr\u21e5Z \uf8ff 3m\"2/20\u21e4 \uf8ff PrhZ  E [Z] \uf8ff p3 Var [Z]1/2i \uf8ff 1/3.\n\nThis proves the correctness of Algorithm 1. For the running time, we divide the summation in Z\ninto the elements for which Ni > 0 and Ni = 0. When Ni = 0, the contribution of the term to\nthe summation is mqi, and we can sum them up by subtracting the total probability of all elements\nappearing at least once from 1.\nRemark 1. To apply Theorem 1, we need to learn distribution in C and \ufb01nd a q that is O(\"2)-close\nin 2-distance to p. For the class of monotone distributions, we are able to ef\ufb01ciently obtain such\na q, which immediately implies sample-optimal learning algorithms for this class. However, for\nsome classes, we may not be able to learn a q with such strong guarantees, and we must consider\nmodi\ufb01cations to our base testing algorithm.\nFor example, for log-concave and monotone hazard rate distributions, we can obtain a distribution\nq and a set S with the following guarantees:\n- If p 2C , then 2(pS, qS) \uf8ff O(\"2) and p(S)  1  O(\");\n- If dTV(p,C)  \", then dTV(p, q)  \"/2. In this scenario, the tester will simply pretend that the\nsupport of p and q is S, ignoring any samples and support elements in [n] \\ S. Analysis of this\ntester is extremely similar to Theorem 1. In particular, we can still show that the statistic Z will be\nseparated in the two cases. When p 2C , excluding [n] \\ S will only reduce Z. On the other hand,\nwhen dTV(p,C)  \", since p(S)  1 O(\"), p and q must still be far on the remaining support, and\nwe can show that Z is still suf\ufb01ciently large. Therefore, a small modi\ufb01cation allows us to handle\nthis case with the same sample complexity of O(pn/\"2).\nFor unimodal distributions, we are even unable to identify a large enough subset of the support\nwhere the 2 approximation is guaranteed to be tight. But we can show that there exists a light\nenough piece of the support (in terms of probability mass under p) that we can exclude to make the\n2 approximation tight. Given that we only use Chebyshev\u2019s inequality to prove the concentration\nof the test statistic, it would seem that our lack of knowledge of the piece to exclude would involve a\nunion bound and a corresponding increase in the required number of samples. We avoid this through\na careful application of Kolmogorov\u2019s max inequality in our setting. See Theorem 7 of Section 7.\n\n5 Testing Monotonicity\n\nAs the \ufb01rst application of our testing framework, we will demonstrate how to test for monotonicity.\nLet d  1, and i = (i1, . . . , id), j = (j1, . . . , jd) 2 [n]d. We say i < j if il  jl for l = 1, . . . , d. A\ndistribution p over [n]d is monotone (decreasing) if for all i < j, pi \uf8ff pj\nWe follow the steps in the overview. The learning result we show is as follows (proved in Section C).\n\n3.\n\n3This de\ufb01nition describes monotone non-increasing distributions. By symmetry, identical results hold for\n\nmonotone non-decreasing distributions.\n\n6\n\n\fLemma 3. Let d  1. There is an algorithm that takes m = O((d log(n)/\"2)d/\"2) samples from a\ndistribution p over [n]d, and outputs a distribution q such that if p is monotone, then with probability\nat least 5/6, 2(p, q) \uf8ff \"2\n500. Furthermore, the distance of q to monotone distributions can be\ncomputed in time poly(m).\n\n\u00b7\n\n1\n\n\"2!\n\nO nd/2\n\nThis accomplishes the \ufb01rst two steps in the overview. In particular, if the distance of q from mono-\ntone distributions is more than \"/2, we declare that p is not monotone. Therefore, Property 1 in\nTheorem 1 is satis\ufb01ed, and the lemma states that Property 2 holds with probability at least 5/6. We\nthen proceed to the 2  `1 test. At this point, we have precisely the guarantees needed to apply\nTheorem 1 over [n]d, directly implying our main result of this section:\nTheorem 2. For any d  1, there exists an algorithm for testing monotonicity over [n]d with sample\ncomplexity\n\n\"2 +\u2713 d log n\n\"2 \u25c6d\nand time complexity Ond/2/\"2 + poly(log n, 1/\")d.\nIn particular, this implies the following optimal algorithms for monotonicity testing for all d  1:\nCorollary 1. Fix any d  1, and suppose \"> pd log n/n1/4. Then there exists an algorithm for\ntesting monotonicity over [n]d with sample complexity Ond/2/\"2.\n\nWe note that the class of monotone distributions is the simplest of the classes we consider. We\nnow consider testing for log-concavity, monotone hazard rate, and unimodality, all of which are\nmuch more challenging to test. In particular, these classes require a more sophisticated structural\nunderstanding, more complex proper 2-learning algorithms, and non-trivial modi\ufb01cations to our\n2-tester. We have already given some details on the required adaptations to the tester in Remark 1.\nOur algorithms for learning these classes use convex programming. One of the main challenges is\nto enforce log-concavity of the PDF when learning LCDn (respectively, of the CDF when learning\nMHRn), while simultaneously enforcing closeness in total variation distance. This involves a\ncareful choice of our variables, and we exploit structural properties of the classes to ensure the\nsoundness of particular Taylor approximations. We encourage the reader to refer to the proofs of\nTheorems 7, 8, and 9 for more details.\n\n6 Testing Independence of Random Variables\n\nLet X def= [n1] \u21e5 . . . \u21e5 [nd], and let \u21e7d be the class of all product distributions over X . Similar to\nlearning monotone distributions in 2 distance we prove the following result in Section E.\n`=1 n`)/\"2\u2318 samples from a distribution p and\nLemma 4. There is an algorithm that takes O\u21e3(Pd\n\noutputs a q 2 \u21e7d such that if p 2 \u21e7d, then with probability at least 5/6, 2(p, q) \uf8ff O(\"2).\nThe distribution q always satis\ufb01es Property 1 since it is in \u21e7d, and by this lemma, with probability\nat least 5/6 satis\ufb01es Property 2 in Theorem 1. Therefore, we obtain the following result.\nTheorem 3. For any d  1, there exists an algorithm for testing independence of random variables\n\nover [n1] \u21e5 . . . [nd] with sample and time complexity O\u21e3\u21e3(Qd\n\n`=1 n`)1/2 +Pd\n\nWhen d = 2 and n1 = n2 = n this improves the result of [8] for testing independence of two\nrandom variables.\nCorollary 2. Testing if two distributions over [n] are independent has sample complexity \u21e5(n/\"2).\n\n`=1 n`\u2318 /\"2\u2318 .\n\n7 Testing Unimodality, Log-Concavity and Monotone Hazard Rate\n\nUnimodal distributions over [n] (denoted by Un) are all distributions p for which there exists an i\u21e4\nsuch that pi is non-decreasing for i \uf8ff i\u21e4 and non-increasing for i  i\u21e4. Log-concave distributions\nover [n] (denoted by LCDn), is the sub-class of unimodal distributions for which pi1pi+1 \uf8ff p2\ni .\n\n7\n\n\fMonotone hazard rate (MHR) distributions over [n] (denoted by MHRn), are distributions p with\nCDF F for which i < j implies\n\n.\n\nfi\n\n1Fi \uf8ff fj\n1Fj\n\ntonicity to give an algorithm with Opn log n/\"2 samples. However, this is unsatisfactory, since\n\nThe following theorem bounds the complexity of testing these classes (for moderate \").\nTheorem 4. Suppose \"> n 1/5. For each of the classes, unimodal, log-concave, and MHR, there\nexists an algorithm for testing the class over [n] with sample complexity O(pn/\"2).\nThis result is a corollary of the speci\ufb01c results for each class, which is proved in the appendix.\nIn particular, a more complete statement for unimodality, log-concavity and monotone-hazard rate,\nwith precise dependence on both n and \" is given in Theorems 7, 8 and 9 respectively. We mention\nsome key points about each class, and refer the reader to the respective appendix for further details.\nTesting Unimodality Using a union bound argument, one can use the results on testing mono-\nour lower bound (and as we will demonstrate, the true complexity of this problem) is pn/\"2. We\novercome the logarithmic barrier introduced by the union bound, by employing a non-oblivious\ndecomposition of the domain, and using Kolmogorov\u2019s max-inequality.\nTesting Log-Concavity The key step is to design an algorithm to learn a log-concave distribution\nin 2 distance. We formulate the problem as a linear program in the logarithms of the distribution\nand show that using O(1/\"5) samples, it is possible to output a log-concave distribution that has a\n2 distance at most O(\"2) from the underlying log-concave distribution.\nTesting Monotone Hazard Rate For learning MHR distributions in 2 distance, we formulate a\nlinear program in the logarithms of the CDF and show that using O(log(n/\")/\"5) samples, it is\npossible to output a MHR distribution that has a 2 distance at most O(\"2) from the underlying\nMHR distribution.\n\n8 Lower Bounds\n\nWe now prove sharp lower bounds for the classes of distributions we consider. We show that the\nexample studied by Paninski [10] to prove lower bounds on testing uniformity can be used to prove\nlower bounds for the classes we consider. They consider a class Q consisting of 2n/2 distributions\nde\ufb01ned as follows. Without loss of generality assume that n is even. For each of the 2n/2 vectors\nz0z1 . . . zn/21 2 {1, 1}n/2, de\ufb01ne a distribution q 2Q over [n] as follows.\n\nqi =( (1+z`c\")\n\n(1z`c\")\n\nn\n\nn\n\nfor i = 2` + 1\nfor i = 2`.\n\n(2)\n\nEach distribution in Q has a total variation distance c\"/2 from Un, the uniform distribution over\n[n]. By choosing c to be an appropriate constant, Paninski [10] showed that a distribution picked\nuniformly at random from Q cannot be distinguished from Un with fewer than pn/\"2 samples with\nprobability at least 2/3.\nSuppose C is a class of distributions such that (i) The uniform distribution Un is in C, (ii) For\nappropriately chosen c, dTV(C,Q)  \", then testing C is not easier than distinguishing Un from Q.\nInvoking [10] immediately implies that testing the class C requires \u2326(pn/\"2) samples.\nThe lower bounds for all the one dimensional distributions will follow directly from this construc-\ntion, and for testing monotonicity in higher dimensions, we extend this construction to d  1,\nappropriately. These arguments are proved in Section H, leading to the following lower bounds for\ntesting these classes:\nTheorem 5.\n\u2022 For any d  1, any algorithm for testing monotonicity over [n]d requires \u2326(nd/2/\"2) samples.\n\u2022 For d  1, testing independence over [n1]\u21e5\u00b7\u00b7\u00b7\u21e5 [nd] requires \u2326(n1n2 . . . nd)1/2/\"2 samples.\n\u2022 Testing unimodality, log-concavity, or monotone hazard rate over [n] needs \u2326(pn/\"2) samples.\n\n8\n\n\fReferences\n[1] R. A. Fisher, Statistical Methods for Research Workers. Edinburgh: Oliver and Boyd, 1925.\n[2] E. Lehmann and J. Romano, Testing statistical hypotheses. Springer Science & Business Media, 2006.\n[3] E. Fischer, \u201cThe art of uninformed decisions: A primer to property testing,\u201d Science, 2001.\n[4] R. Rubinfeld, \u201cSublinear-time algorithms,\u201d in International Congress of Mathematicians, 2006.\n[5] C. L. Canonne, \u201cA survey on distribution testing: your data is big, but is it blue,\u201d ECCC, 2015.\n[6] T. Batu, R. Kumar, and R. Rubinfeld, \u201cSublinear algorithms for testing monotone and unimodal distribu-\n\ntions,\u201d in Proceedings of STOC, 2004.\n\n[7] A. Bhattacharyya, E. Fischer, R. Rubinfeld, and P. Valiant, \u201cTesting monotonicity of distributions over\n\ngeneral partial orders,\u201d in ICS, 2011, pp. 239\u2013252.\n\n[8] T. Batu, E. Fischer, L. Fortnow, R. Kumar, R. Rubinfeld, and P. White, \u201cTesting random variables for\n\nindependence and identity,\u201d in Proceedings of FOCS, 2001.\n\n[9] N. Alon, A. Andoni, T. Kaufman, K. Matulef, R. Rubinfeld, and N. Xie, \u201cTesting k-wise and almost\n\nk-wise independence,\u201d in Proceedings of STOC, 2007.\n\n[10] L. Paninski, \u201cA coincidence-based test for uniformity given very sparsely sampled discrete data.\u201d IEEE\n\nTransactions on Information Theory, vol. 54, no. 10, 2008.\n\n[11] D. Huang and S. Meyn, \u201cGeneralized error exponents for small sample universal hypothesis testing,\u201d\n\nIEEE Transactions on Information Theory, vol. 59, no. 12, pp. 8157\u20138181, Dec 2013.\n\n[12] G. Valiant and P. Valiant, \u201cEstimating the unseen: An n/ log n-sample estimator for entropy and support\n\nsize, shown optimal via new CLTs,\u201d in Proceedings of STOC, 2011.\n\n[13] \u2014\u2014, \u201cAn automatic inequality prover and instance optimal identity testing,\u201d in FOCS, 2014.\n[14] L. Birg\u00b4e, \u201cEstimating a density under order restrictions: Nonasymptotic minimax risk,\u201d The Annals of\n\nStatistics, vol. 15, no. 3, pp. 995\u20131012, September 1987.\n\n[15] R. Levi, D. Ron, and R. Rubinfeld, \u201cTesting properties of collections of distributions,\u201d Theory of Com-\n\n[16] P. Hall and I. Van Keilegom, \u201cTesting for monotone increasing hazard rate,\u201d Annals of Statistics, pp.\n\nputing, vol. 9, no. 8, pp. 295\u2013347, 2013.\n\n1109\u20131137, 2005.\n\n[17] S. O. Chan, I. Diakonikolas, R. A. Servedio, and X. Sun, \u201cLearning mixtures of structured distributions\n\nover discrete domains,\u201d in Proceedings of SODA, 2013.\n\n[18] M. Cule and R. Samworth, \u201cTheoretical properties of the log-concave maximum likelihood estimator of\n\na multidimensional density,\u201d Electronic Journal of Statistics, vol. 4, pp. 254\u2013270, 2010.\n\n[19] J. Acharya, H. Das, A. Jafarpour, A. Orlitsky, S. Pan, and A. T. Suresh, \u201cCompetitive classi\ufb01cation and\n\n[20] S. Chan, I. Diakonikolas, G. Valiant, and P. Valiant, \u201cOptimal algorithms for testing closeness of discrete\n\ncloseness testing,\u201d in COLT, 2012, pp. 22.1\u201322.18.\n\ndistributions,\u201d in SODA, 2014, pp. 1193\u20131203.\n\n[21] B. B. Bhattacharya and G. Valiant, \u201cTesting closeness with unequal sized samples,\u201d in NIPS, 2015.\n[22] J. Acharya, A. Jafarpour, A. Orlitsky, and A. Theertha Suresh, \u201cA competitive test for uniformity of\n\nmonotone distributions,\u201d in Proceedings of AISTATS, 2013, pp. 57\u201365.\n\n[23] R. E. Barlow, D. J. Bartholomew, J. M. Bremner, and H. D. Brunk, Statistical Inference under Order\n\n[24] H. K. Jankowski and J. A. Wellner, \u201cEstimation of a discrete monotone density,\u201d Electronic Journal of\n\nRestrictions. New York: Wiley, 1972.\n\nStatistics, vol. 3, pp. 1567\u20131605, 2009.\n\n[25] F. Balabdaoui and J. A. Wellner, \u201cEstimation of a k-monotone density: characterizations, consistency and\n\nminimax lower bounds,\u201d Statistica Neerlandica, vol. 64, no. 1, pp. 45\u201370, 2010.\n\n[26] F. Balabdaoui, H. Jankowski, and K. Ru\ufb01bach, \u201cMaximum likelihood estimation and con\ufb01dence bands\n\nfor a discrete log-concave distribution,\u201d 2011. [Online]. Available: http://arxiv.org/abs/1107.3904v1\n\n[27] A. Saumard and J. A. Wellner, \u201cLog-concavity and strong log-concavity: a review,\u201d Statistics Surveys,\n\nvol. 8, pp. 45\u2013114, 2014.\n\n[28] M. Adamaszek, A. Czumaj, and C. Sohler, \u201cTesting monotone continuous distributions on high-\n\ndimensional real cubes,\u201d in SODA, 2010, pp. 56\u201365.\n\n[29] J. N. Rao and A. J. Scott, \u201cThe analysis of categorical data from complex sample surveys,\u201d Journal of the\n\nAmerican Statistical Association, vol. 76, no. 374, pp. 221\u2013230, 1981.\n[30] A. Agresti and M. Kateri, Categorical data analysis. Springer, 2011.\n[31] J. Acharya and C. Daskalakis, \u201cTesting Poisson Binomial Distributions,\u201d in SODA, 2015, pp. 1829\u20131840.\n[32] C. Canonne, I. Diakonikolas, T. Gouleakis, and R. Rubinfeld, \u201cTesting shape restrictions of discrete\n\n[33] A. L. Gibbs and F. E. Su, \u201cOn choosing and bounding probability metrics,\u201d International Statistical\n\ndistributions,\u201d in STACS, 2016.\n\nReview, vol. 70, no. 3, pp. 419\u2013435, dec 2002.\n\n[34] J. Acharya, A. Jafarpour, A. Orlitsky, and A. T. Suresh, \u201cEf\ufb01cient compression of monotone and m-modal\n\n[35] S. Kamath, A. Orlitsky, D. Pichapati, and A. T. Suresh, \u201cOn learning distributions from their samples,\u201d in\n\ndistributions,\u201d in ISIT, 2014.\n\nCOLT, 2015.\n\n[36] P. Massart, \u201cThe tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality,\u201d The Annals of Probability,\n\nvol. 18, no. 3, pp. 1269\u20131283, 07 1990.\n\n9\n\n\f", "award": [], "sourceid": 1985, "authors": [{"given_name": "Jayadev", "family_name": "Acharya", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Constantinos", "family_name": "Daskalakis", "institution": "MIT"}, {"given_name": "Gautam", "family_name": "Kamath", "institution": "MIT"}]}