{"title": "Nearly tight sample complexity bounds for learning mixtures of Gaussians via sample compression schemes", "book": "Advances in Neural Information Processing Systems", "page_first": 3412, "page_last": 3421, "abstract": "We prove that \u03f4(k d^2 / \u03b5^2) samples are necessary and sufficient for learning a mixture of k Gaussians in R^d, up to error \u03b5 in total variation distance. This improves both the known upper bounds and lower bounds for this problem. For mixtures of axis-aligned Gaussians, we show that O(k d / \u03b5^2) samples suffice, matching a known lower bound.\n\nThe upper bound is based on a novel technique for distribution learning based on a notion of sample compression. Any class of distributions that allows such a sample compression scheme can also be learned with few samples. Moreover, if a class of distributions has such a compression scheme, then so do the classes of products and mixtures of those distributions. The core of our main result is showing that the class of Gaussians in R^d has an efficient sample compression.", "full_text": "Nearly tight sample complexity bounds\n\nfor learning mixtures of Gaussians\nvia sample compression schemes\u2217\n\nHassan Ashtiani\n\nDepartment of Computing and Software\n\nMcMaster University, and\nVector Institute, ON, Canada\nzokaeiam@mcmaster.ca\nNicholas J. A. Harvey\n\nDepartment of Computer Science\nUniversity of British Columbia\n\nVancouver, BC, Canada\nnickhar@cs.ubc.ca\nAbbas Mehrabian\n\nSchool of Computer Science\n\nMcGill University\n\nMontr\u00e9al, QC, Canada\n\nabbasmehrabian@gmail.com\n\nShai Ben-David\n\nSchool of Computer Science\n\nUniversity of Waterloo,\nWaterloo, ON, Canada\nshai@uwaterloo.ca\nChristopher Liaw\n\nDepartment of Computer Science\nUniversity of British Columbia\n\nVancouver, BC, Canada\ncvliaw@cs.ubc.ca\n\nYaniv Plan\n\nDepartment of Mathematics\n\nUniversity of British Columbia\n\nVancouver, BC, Canada\nyaniv@math.ubc.ca\n\nAbstract\n\nWe prove that (cid:101)\u0398(kd2/\u03b52) samples are necessary and suf\ufb01cient for learning a\nof axis-aligned Gaussians, we show that (cid:101)O(kd/\u03b52) samples suf\ufb01ce, matching a\n\nmixture of k Gaussians in Rd, up to error \u03b5 in total variation distance. This improves\nboth the known upper bounds and lower bounds for this problem. For mixtures\n\nknown lower bound.\nThe upper bound is based on a novel technique for distribution learning based on a\nnotion of sample compression. Any class of distributions that allows such a sample\ncompression scheme can also be learned with few samples. Moreover, if a class of\ndistributions has such a compression scheme, then so do the classes of products\nand mixtures of those distributions. The core of our main result is showing that the\nclass of Gaussians in Rd has a small-sized sample compression.\n\n1\n\nIntroduction\n\nEstimating distributions from observed data is a fundamental task in statistics that has been studied\nfor over a century. This task frequently arises in applied machine learning and it is common to\nassume that the distribution can be modeled using a mixture of Gaussians. Popular software packages\nhave implemented heuristics, such as the EM algorithm, for learning a mixture of Gaussians. The\ntheoretical machine learning community also has a rich literature on distribution learning; the recent\nsurvey [9] considers learning structured distributions, and the survey [13] focuses on mixtures of\nGaussians.\nThis paper develops a general technique for distribution learning, then employs this technique in the\nimportant setting of mixtures of Gaussians. The theoretical model we adopt is density estimation:\n\n\u2217For the full version of this paper see [2].\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fgiven i.i.d. samples from an unknown target distribution, \ufb01nd a distribution that is close to the target\ndistribution in total variation (TV) distance. Our focus is on sample complexity bounds: using as\nfew samples as possible to obtain a good estimate of the target distribution. For background on this\nmodel see, e.g., [7, Chapter 5] and [9].\nOur new technique for proving upper bounds on the sample complexity involves a form of sample\ncompression. If it is possible to \u201cencode\u201d members of a class of distributions using a carefully chosen\nsubset of the samples, then this yields an upper bound on the sample complexity of distribution\nlearning for that class. In particular, by constructing compression schemes for mixtures of axis-\naligned Gaussians and general Gaussians, we obtain new upper bounds on the sample complexity of\nlearning with respect to these classes, which we prove to be optimal up to logarithmic factors.\n\n1.1 Main results\n\nIn this section, all learning results refer to the problem of producing a distribution within total\nvariation distance \u03b5 from the target distribution. Our \ufb01rst main result is an upper bound for learning\nmixtures of multivariate Gaussians. This bound is tight up to logarithmic factors.\n\nTheorem 1.1 The class of k-mixtures of d-dimensional Gaussians can be learned using (cid:101)O(kd2/\u03b52)\nWe emphasize that the (cid:101)O(\u00b7) notation hides a factor polylog(kd/\u03b5), but has no dependence whatsoever\nthe sample complexity of this problem were (cid:101)O(kd2/\u03b54), due to [3], and O(k4d4/\u03b52), based on a\n\non the condition number or scaling of the distribution. Previously, the best known upper bounds on\n\nsamples.\n\nVC-dimension bound that we discuss below. For the case of a single Gaussian (i.e., k = 1), a sample\ncomplexity bound of O(d2/\u03b52) is well known, again using a VC-dimension bound discussed below.\nOur second main result is a minimax lower bound matching Theorem 1.1 up to logarithmic factors.\n\nTheorem 1.2 Any method for learning the class of k-mixtures of d-dimensional Gaussians has\n\nsample complexity \u2126(kd2/\u03b52 log3(1/\u03b5)) =(cid:101)\u2126(kd2/\u03b52).\nHere and below (cid:101)\u2126 (and (cid:101)O) allow for poly-logarithmic factors. Previously, the best known lower\nbound on the sample complexity was(cid:101)\u2126(kd/\u03b52) [20]. Even for a single Gaussian (i.e., k = 1), an\n(cid:101)\u2126(d2/\u03b52) lower bound was not known prior to this work.\n\nOur third main result is an upper bound for learning mixtures of axis-aligned Gaussians, i.e.,\nGaussians with diagonal covariance matrices. This bound is tight up to logarithmic factors.\n\nTheorem 1.3 The class of k-mixtures of axis-aligned d-dimensional Gaussians can be learned using\n\n(cid:101)O(kd/\u03b52) samples.\nA matching lower bound of(cid:101)\u2126(kd/\u03b52) was proved in [20]. Previously, the best known upper bounds\nwere (cid:101)O(kd/\u03b54), due to [3], and O((k4d2 + k3d3)/\u03b52), based on a VC-dimension bound that we\n\ndiscuss below.\n\nComputational ef\ufb01ciency. Although our approach for proving sample complexity upper bounds\nis algorithmic, our focus is not on computational ef\ufb01ciency. The resulting algorithms have nearly\noptimal sample complexities, but their running times are exponential in the dimension d and the\nnumber of mixture components k. More precisely, the running time is 2kd2 polylog(d,k,1/\u03b5) for mixtures\nof general Gaussians, and 2kd polylog(d,k,1/\u03b5) for mixtures of axis-aligned Gaussians. The existence\nof a polynomial time algorithm for density estimation is unknown even for the class of mixtures of\naxis-aligned Gaussians, see [10, Question 1.1].\nEven for the case of a single Gaussian, the published proofs of the O(d2/\u03b52) bound (of which we are\naware) are not algorithmically ef\ufb01cient. Using ideas from our proof of Theorem 1.1, in the full version\nwe show that an algorithmically ef\ufb01cient proof for single Gaussians can be obtained by computing\nthe empirical mean and a careful modi\ufb01cation of the sample covariance matrix of O(d2/\u03b52) samples.\n\n2\n\n\f1.2 Related work\n\nDistribution learning is a vast topic and many approaches have been considered in the literature; here\nwe only review approaches that are most relevant to our problem.\nFor parametric families of distributions, a common approach is to use the samples to estimate\nthe parameters of the distribution, possibly in a maximum likelihood sense, or possibly aiming to\napproximate the true parameters. For the speci\ufb01c case of mixtures of Gaussians, there is a substantial\ntheoretical literature on algorithms that approximate the mixing weights, means and covariances;\nsee [13] for a recent survey of this literature. The strictness of this objective cuts both ways. On the\none hand, a successful learner uncovers substantial structure of the target distribution. On the other\nhand, this objective is clearly impossible when the means and covariances are extremely close. Thus,\nalgorithms for parameter estimation of mixtures necessarily require some \u201cseparability\u201d assumptions\non the target parameters.\nDensity estimation has a long history in the statistics literature, where the focus is on the sample\ncomplexity question; see [6, 7, 19] for general background. It was \ufb01rst studied in the computational\nlearning theory community under the name PAC learning of distributions by [14], whose focus is on\nthe computational complexity of the learning problem.\nFor density estimation there are various possible measures of distance between distributions, the most\npopular ones being the TV distance and the Kullback-Leibler (KL) divergence. Here we focus on the\nTV distance since it has several appealing properties, such as being a metric and having a natural\nprobabilistic interpretation. In contrast, KL divergence is not even symmetric and can be unbounded\neven for intuitively close distributions. For a detailed discussion on why TV is a natural choice, see\n[7, Chapter 5].\nA popular method for distribution learning in practice is kernel density estimation (see, e.g., [7,\nChapter 9]). The few rigorously proven sample complexity bounds for this method require either\nsmoothness assumptions (e.g., [7, Theorem 9.5]) or boundedness assumptions (e.g., [12, Theo-\nrem 2.2]) on the class of densities. The class of Gaussians is not universally Lipschitz or universally\nbounded, so those results do not apply to the problems we consider.\nAnother elementary method for density estimation is using histogram estimators (see, e.g., [7,\nSection 10.3]). Straightforward calculations show that histogram estimators for mixtures of Gaussians\nresult in a sample complexity that is exponential in the dimension. The same is true for estimators\nbased on piecewise polynomials.\nThe minimum distance estimate [7, Section 6.8] is another approach for deriving sample complexity\nupper bounds for distribution learning. This approach is based on uniform convergence theory.\nIn particular, an upper bound for any class of distributions can be achieved by bounding the VC-\ndimension of an associated set system, called the Yatracos class (see [7, page 58] for the de\ufb01nition).\nFor example, [11] used this approach to bound the sample complexity of learning high-dimensional\nlog-concave distributions. For the class of single Gaussians in d dimensions, this approach leads to\nthe optimal sample complexity upper bound of O(d2/\u03b52). However, for mixtures of Gaussians and\naxis-aligned Gaussians in Rd, the best known VC-dimension bounds [1, Theorem 8.14] result in\nloose upper bounds of O(k4d4/\u03b52) and O((k4d2 + k3d3)/\u03b52), respectively.\nAnother approach is to \ufb01rst approximate the mixture class using a more manageable class such\nas piecewise polynomials, and then study the associated Yatracos class, see, e.g., [5]. However,\npiecewise polynomials do a poor job in approximating d-dimensional Gaussians, resulting in an\nexponential dependence on d.\nFor density estimation of mixtures of Gaussians, the current best sample complexity upper bounds\n\n(in terms of k and d) are (cid:101)O(kd2/\u03b54) for general Gaussians and (cid:101)O(kd/\u03b54) for axis-aligned Gaussians,\nboth due to [3]. For the general Gaussian case, their method takes an i.i.d. sample of size (cid:101)O(kd2/\u03b52)\nand partitions this sample in every possible way into k subsets. Based on those partitions, k(cid:101)O(kd2/\u03b52)\n\n\u201ccandidate distributions\u201d are generated. The problem is then reduced to learning with respect to\nthis \ufb01nite class of candidates. Their sample complexity has a suboptimal factor of 1/\u03b54, of which\n1/\u03b52 arises in their approach for choosing the best candidate, and another factor 1/\u03b52 is due to the\nexponent in the number of candidates.\n\n3\n\n\fOur approach via compression schemes also ultimately reduces the problem to learning with respect\nto \ufb01nite classes. However, our compression technique leads to a more re\ufb01ned bound. In the case\nof mixtures of Gaussians, one factor of 1/\u03b52 is again incurred due to learning with respect to \ufb01nite\nclasses. The key is that the number of compressed samples has no additional factor of 1/\u03b52, so the\n\noverall sample complexity bound has only an (cid:101)O(1/\u03b52) dependence on \u03b5.\n(cid:101)\u2126(kd/\u03b52) for learning mixtures of spherical Gaussians (and hence for general Gaussians as well).\n\nAs for lower bounds on the sample complexity, much fewer results are known for learning mixtures\nof Gaussians. The only lower bound of which we are aware is due to [20], which shows a bound of\n\nThis bound is tight for the axis-aligned case, as we show in Theorem 1.3, but loose in the general\ncase, as we show in Theorem 1.2.\n\n1.2.1 Comparison to parameter estimation\n\nIn this section we observe that neither our upper bound (Theorem 1.1) nor our lower bound (Theo-\nrem 1.2) can directly follow from bounds on parameter estimation for Gaussian distributions. Recall\nthat our sample complexity upper bound in Theorem 1.1 has no dependence on the condition number\nof the distribution. We now show that, if a learning algorithm with entrywise approximation guarantee\nis used to learn the distribution in KL divergence or TV distance, then the approximation parameter\nmust depend on the condition number. Let \u03ba(\u03a3) be the condition number of the covariance matrix\n\u03a3, i.e., the ratio of the maximum and minimum eigenvalues; refer to Section 2 for other relevant\nde\ufb01nitions.\n\nProposition 1.4 Set \u03b5 =\nentrywise approximations:\n\n2\n\n\u03ba(\u03a3)+1 . There exist two covariance matrices \u03a3 and \u02c6\u03a3 that are good\n\nand\n\n\u02c6\u03a3i,j \u2208 [1, 1 + 2\u03b5] \u00b7 \u03a3i,j\n\n\u2200i, j,\n\n(cid:16)N (0, \u03a3),N (0, \u02c6\u03a3)\n(cid:17)\n\n= 1.\n\n= \u221e\n\nand\n\nTV\n\nbut the corresponding distributions are as far as they can get, i.e.,\n\n|\u03a3i,j \u2212 \u02c6\u03a3i,j| \u2264 \u03b5\n\n(cid:16)N (0, \u03a3) (cid:107) N (0, \u02c6\u03a3)\n\n(cid:17)\n\nKL\n\nThus, given a black-box algorithm that provides an entrywise \u03b5-approximation to the true covariance\nmatrix \u03a3, if \u03b5 \u2265 2\n\u03ba(\u03a3)+1, it might output \u02c6\u03a3, which does not approximate \u03a3 in KL divergence or total\nvariation distance.\nOne might imagine that lower bounds on the sample complexity of parameter estimation readily\nimply lower bounds on distribution learning. The following proposition shows this is not the case.\nProposition 1.5 For any \u03b5 \u2208 (0, 1/2] there exist two covariance matrices \u03a3 and \u02c6\u03a3 such that\nTV\n\n(cid:17) \u2264 \u03b5, but there exist i, j such that, for any c \u2265 1, \u02c6\u03a3i,j (cid:54)\u2208 [1/c, c] \u00b7 \u03a3i,j.\n\n(cid:16)N (0, \u03a3),N (0, \u02c6\u03a3)\n\n1.3 Our techniques\n\nWe introduce a method for learning distributions via a novel form of compression. Given a class of\ndistributions, suppose there is a method for \u201ccompressing\u201d information about the true distribution\nusing a mix of samples from that distribution and some additional bits. Further, suppose there exists\na \ufb01xed (deterministic) decoder for the class, such that given the samples and additional bits, it\napproximately recovers the original distribution. In this case, if the size of the compressed set and the\nnumber of bits is guaranteed to be small, we show that the sample complexity of learning that class is\nsmall as well.\nMore precisely, we say a class of distributions admits (\u03c4, t, m) compression if there exists a decoder\nfunction such that upon generating m i.i.d. samples from any distribution in the class, we are\nguaranteed, with reasonable probability, to have a subset of size at most \u03c4 of that sample, and a\nsequence of at most t bits, on which the decoder outputs an approximation to the original distribution.\nNote that \u03c4, t, and m can be functions of \u03b5, the accuracy parameter.\nWe prove that compression implies learning. In particular, if a class admits (\u03c4, t, m) compression,\n\nthen the sample complexity of learning with respect to this class is bounded by (cid:101)O(m + (\u03c4 + t)/\u03b52)\n\n(Theorem 3.5).\n\n4\n\n\fAn attractive property of compression is that it enjoys two closure properties. Speci\ufb01cally, if a base\nclass admits compression, then the class of mixtures of that base class, as well as the class of products\nof the base class, are compressible (Lemmas 3.6 and 3.7).\nConsequently, it suf\ufb01ces to provide a compression scheme for the class of single Gaussian distributions\nin order to obtain a compression scheme for the class of mixtures of Gaussians (and therefore, to\nbe able to bound their sample complexity). We prove that the class of d-dimensional Gaussian\n\ndistributions admits ((cid:101)O(d),(cid:101)O(d2),(cid:101)O(d)) compression (Lemma 4.1). The high level idea is that\nby generating (cid:101)O(d) samples from a Gaussian, one can get a rough sketch of the geometry of the\n\nGaussian. In particular, the points drawn from a Gaussian concentrate around an ellipsoid centered\nat the mean and whose principal axes are the eigenvectors of the covariance matrix. Using ideas\nfrom convex geometry and random matrix theory, we show one can in fact encode the center of the\nellipsoid and the principal axes using a linear combination of these samples. Then we discretize the\ncoef\ufb01cients and obtain an approximate encoding.\n\nThe above results together imply tight (up to logarithmic factors) upper bounds of (cid:101)O(kd2/\u03b52) for\nmixtures of k Gaussians, and (cid:101)O(kd/\u03b52) for mixtures of k axis-aligned Gaussians over Rd. The\n\ncompression framework we introduce is quite \ufb02exible, and can be used to prove sample complexity\nupper bounds for other distribution classes as well. This is left for future work.\nIn this paper we assume the target belongs to some known class of distributions (this is called the\nrealizable setting in the learning theory literature). In the full version of this paper [2] we relax this\nrequirement and give similar sample complexity bounds for the setting where the target is close (in\nTV distance) to some distribution in the class (known as agnostic learning).\n\nLower bound. For proving our lower bound for mixtures of Gaussians, we \ufb01rst prove a lower\n\nbound of (cid:101)\u2126(d2/\u03b52) for learning a single Gaussian. Although the approach is quite intuitive, the\n\ndetails are intricate and much care is required to make a formal proof. The main step is to construct a\nlarge family (of size 2\u2126(d2)) of covariance matrices such that the associated Gaussian distributions\nare well-separated in terms of their TV distance while simultaneously ensuring that their relative KL\ndivergences are small. Once this is established, we can then apply a generalized version of Fano\u2019s\ninequality to complete the proof.\nTo construct this family of covariance matrices, we sample 2\u2126(d2) matrices from the following\nprobabilistic process: start with an identity covariance matrix; then choose a random subspace of\ndimension d/9 and slightly increase the eigenvalues corresponding to this eigenspace. It is easy to\nbound the KL divergences between the constructed Gaussians. To lower bound the total variation,\nwe show that for every pair of these distributions, there is some subspace for which a vector drawn\nfrom one Gaussian will have slightly larger projection than a vector drawn from the other Gaussian.\nQuantifying this gap will then give us the desired lower bound on the total variation distance.\n\nPaper outline. We set up our formal framework and notation in Section 2. In Section 3, we de\ufb01ne\ncompression schemes for distributions, prove their closure properties, and show their connection with\ndensity estimation. Theorem 1.1 and Theorem 1.3 are proved in Section 4. The proof of Theorem 1.2\nas well as all other omitted proofs can be found in the full version [2].\n\n2 Preliminaries\n\nA distribution learning method or density estimation method is an algorithm that takes as input a\nsequence of i.i.d. samples generated from a distribution g, and outputs (a description of) a distribution\n\u02c6g as an estimation for g. We work with continuous distributions in this paper, and so we identify\na probability distribution by its probability density function. Let f1 and f2 be two probability\ndistributions de\ufb01ned over Rd and let L be the set of Lebesgue measurable subsets of Rd. Their total\nvariation (TV) distance is de\ufb01ned by\n\nTV (f1, f2) := sup\nB\u2208L\n\n(cid:107)f1 \u2212 f2(cid:107)1 ,\n\n1\n2\n\n(cid:90)\n\n(cid:0)f1(x) \u2212 f2(x)(cid:1)dx =\n\nB\n\n5\n\n\fwhere (cid:107)f(cid:107)1 :=(cid:82)\n\nand f2 is de\ufb01ned by\n\nRd |f (x)|dx is the L1 norm of f. The Kullback-Leibler (KL) divergence between f1\n\n(cid:90)\n\nKL (f1 (cid:107) f2) :=\n\nf1(x) log\n\nRd\n\nf1(x)\nf2(x)\n\ndx.\n\nIn the following de\ufb01nitions, F is a class of probability distributions, and g is a distribution (not\nnecessarily in F).\nDe\ufb01nition 2.1 (\u03b5-approximation) A distribution \u02c6g is an \u03b5-approximation for g if (cid:107)\u02c6g \u2212 g(cid:107)1 \u2264 \u03b5.\nDe\ufb01nition 2.2 (PAC-learning distributions) A distribution learning method is called a (realizable)\nPAC-learner for F with sample complexity mF (\u03b5, \u03b4) if, for all distributions g \u2208 F and all \u03b5, \u03b4 \u2208\n(0, 1), given \u03b5, \u03b4, and an i.i.d. sample of size mF (\u03b5, \u03b4) from g, with probability at least 1 \u2212 \u03b4 (over\nthe samples) the method outputs an \u03b5-approximation of g.\n\nLet \u2206n := { (w1, . . . , wn) : wi \u2265 0,(cid:80) wi = 1 } denote the n-dimensional simplex.\ni=1wifi : (w1, . . . , wk) \u2208 \u2206k, f1, . . . , fk \u2208 F(cid:111)\n(cid:110)(cid:80)k\n\nDe\ufb01nition 2.3 (k-mix(F)) Let F be a class of probability distributions. Then the class of k-mixtures\nof F, written k-mix(F), is de\ufb01ned as\n\nk-mix(F) :=\n\n.\n\nLet d denote the dimension. A Gaussian distribution with mean \u00b5 \u2208 Rd and covariance matrix\n\u03a3 \u2208 Rd\u00d7d is denoted by N (\u00b5, \u03a3). If \u03a3 is a diagonal matrix, then N (\u00b5, \u03a3) is called an axis-aligned\nGaussian. For a distribution g, we write X \u223c g to mean X is a random variable with distribution g,\nand we write S \u223c gm to mean that S is an i.i.d. sample of size m generated from g.\nWe will use (cid:107)v(cid:107) or (cid:107)v(cid:107)2 to denote the Euclidean norm of a vector v, (cid:107)A(cid:107) or (cid:107)A(cid:107)2 to denote the\nFor x \u2208 R, we will write (x)+ := max{0, x}. All logarithms are in the natural base.\n\noperator norm of a matrix A, and (cid:107)A(cid:107)F :=(cid:112)Tr(ATA) to denote the Frobenius norm of a matrix A.\n\n(cid:83)\u221e\nn=0 Z n \u00d7(cid:83)\u221e\n\n3 Compression schemes and their connection with learning\nLet F be a class of distributions over a domain Z.\nDe\ufb01nition 3.1 (distribution decoder) A distribution decoder for F is a deterministic function J :\nn=0{0, 1}n \u2192 F, which takes a \ufb01nite sequence of elements of Z and a \ufb01nite sequence\nof bits, and outputs a member of F.\nDe\ufb01nition 3.2 (distribution compression schemes) Let \u03c4, t, m : (0, 1) \u2192 Z\u22650 be functions. We\nsay F admits (\u03c4, t, m) compression if there exists a decoder J for F such that for any distribution\ng \u2208 F, the following holds:\nFor any \u03b5 \u2208 (0, 1), if a sample S is drawn from gm(\u03b5), then with probability at least 2/3, there\nexists a sequence L of at most \u03c4 (\u03b5) elements of S, and a sequence B of at most t(\u03b5) bits, such that\n(cid:107)J (L, B) \u2212 g(cid:107)1 \u2264 \u03b5.\nNote that S and L are sequences rather than sets; in particular, they can contain repetitions. Also note\nthat in this de\ufb01nition, m(\u03b5) is a lower bound on the number of samples needed, whereas \u03c4 (\u03b5), t(\u03b5)\nare upper bounds on the size of compression and the number of bits.\nEssentially, the de\ufb01nition asserts that with reasonable probability, there is a (short) sequence consisting\nof elements S and some (small number of) additional bits, from which g can be approximately\nreconstructed. We say that the distribution g is \u201cencoded\u201d with L and B, and in general we would\nlike to have a compression scheme of a small size.\n\nRemark 3.3 In the de\ufb01nition above we required the probability of existence of L and B to be at\nleast 2/3, but one can boost this probability to 1 \u2212 \u03b4 by generating a sample of size m(\u03b5) log(1/\u03b4).\n\nNext we show that if a class of distributions can be compressed, then it can be learned; thus we build\nthe connection between compression and learning. We will need the following useful result about\n\n6\n\n\fPAC-learning of \ufb01nite classes of distributions, which immediately follows from [7, Theorem 6.3] and\na standard Chernoff bound. It states that a \ufb01nite class of size M can be learned using O(log(M/\u03b4)/\u03b52)\nsamples. Denote by [M ] the set {1, 2, ..., M}. Throughout the paper, a/bc always means a/(bc).\nTheorem 3.4 ([7]) There exists a deterministic algorithm that, given candidate distributions\nf1, . . . , fM , a parameter \u03b5 > 0, and log(3M 2/\u03b4)/2\u03b52 i.i.d. samples from an unknown distribu-\ntion g, outputs an index j \u2208 [M ] such that\n\n(cid:107)fj \u2212 g(cid:107)1 \u2264 3 min\ni\u2208[M ]\n\n(cid:107)fi \u2212 g(cid:107)1 + 4\u03b5,\n\nwith probability at least 1 \u2212 \u03b4/3.\n\nThe proof of the following theorem appears in the full version [2].\nTheorem 3.5 (compressibility implies learnability) Suppose F admits (\u03c4, t, m) compression. Let\n\u03c4(cid:48)(\u03b5) := \u03c4 (\u03b5/6) + t(\u03b5/6). Then F can be learned using\n\n(cid:19)\n\nsamples.\n\n(cid:18)\n\n(cid:16) \u03b5\n\n(cid:17)\n\n6\n\nO\n\nm\n\nlog\n\n(cid:16) 1\n\n(cid:17)\n\n\u03b4\n\n\u03c4(cid:48)(\u03b5) log(m( \u03b5\n\n6 ) log(1/\u03b4)) + log(1/\u03b4)\n\n\u03b52\n\n+\n\n(cid:19)\n\n(cid:18)\n\n= (cid:101)O\n\n(cid:16) \u03b5\n\n(cid:17)\n\n6\n\nm\n\n\u03c4(cid:48)(\u03b5)\n\u03b52\n\n+\n\n(cid:110)(cid:81)d\ni=1 pi : p1, . . . , pd \u2208 F(cid:111)\n\nWe next prove two closure properties of compression schemes. First, Lemma 3.6 below states that if\na class F of distributions can be compressed, then the class of distributions that are formed by taking\nproducts of members of F can also be compressed. If p1, . . . , pd are distributions over domains\ni=1 Zi. For a class F of\n\ni=1 pi denotes the standard product distribution over(cid:81)d\n\nZ1, . . . , Zd, then(cid:81)d\n\ndistributions, de\ufb01ne F d :=\nLemma 3.6 (compressing product distributions) If F admits (\u03c4 (\u03b5), t(\u03b5), m(\u03b5)) compression,\nthen F d admits (d\u03c4 (\u03b5/d), dt(\u03b5/d), m(\u03b5/d) log(3d)) compression.\nOur next lemma states that if a class F of distributions can be compressed, then the class of\ndistributions that are formed by taking mixtures of members of F can also be compressed.\nLemma 3.7 (compressing mixtures) If F admits (\u03c4 (\u03b5), t(\u03b5), m(\u03b5)) compression, then k-mix(F)\nadmits (k\u03c4 (\u03b5/3), kt(\u03b5/3) + k log2(4k/\u03b5)), 48m(\u03b5/3)k log(6k)/\u03b5) compression.\n\n.\n\n4 Upper bound: learning mixtures of Gaussians by compression schemes\n\nIn this section we prove an upper bound of (cid:101)O(kd2/\u03b52) for the sample complexity of learning mixtures\nof k Gaussians in d dimensions, and an upper bound of (cid:101)O(kd/\u03b52) for the sample complexity of\n(cid:0)O(d log(2d)), O(d2 log(2d) log(d/\u03b5)), O(d log(2d))(cid:1) compression scheme.\n\nlearning mixtures of k axis-aligned Gaussians. The heart of the proof is to show that Gaussians have\ncompression schemes in any dimension.\n\nthe class of d-dimensional Gaussians admits an\n\nLemma 4.1 For any positive integer d,\n\nRemark 4.2 In the special case d = 1, there also exists a (2, 0, O(1/\u03b5)) (i.e. constant size) com-\npression scheme: if we draw C/\u03b5 samples from N (\u00b5, \u03c32), for a suf\ufb01ciently large constant C, with\nprobability at least 2/3 there exist two points in the sample such that one of them is within distance\n\u03c3\u03b5/2 of \u00b5\u2212 \u03c3 and the other one is within distance \u03c3\u03b5/2 of \u00b5 + \u03c3. Given these two points, the decoder\ncan estimate \u00b5 and \u03c3 up to additive precision \u03b5\u03c3/2, which results in an \u03b5-approximation of N (\u00b5, \u03c32)\nin total variation distance. Remarkably, this compression scheme has constant size, as the value of\n\u03c4 + t is independent of \u03b5 (unlike Lemma 4.1). This scheme can be used instead of Lemma 4.1 in the\nproof of Theorem 1.3, although it would not improve the sample complexity bound asymptotically.\n\nProof of Theorem 1.1. Combining Lemma 4.1 and Lemma 3.7 implies that the class of k-mixtures\nof d-dimensional Gaussians admits an\n\n(cid:0) O(kd log(2d)), O(kd2 log(2d) log(d/\u03b5) + k log(k/\u03b5)), O(dk log k log(2d)/\u03b5)(cid:1)\n\n7\n\n\fcompression scheme. Applying Theorem 3.5 with m(\u03b5) = (cid:101)O(dk/\u03b5) and \u03c4(cid:48)(\u03b5) = (cid:101)O(d2k) shows that\nthe sample complexity of learning this class is (cid:101)O(kd2/\u03b52). This proves Theorem 1.1.\n\n(cid:4)\nLet G denote the class of 1-dimensional Gaussian distribu-\nProof of Theorem 1.3.\ntions. By Lemma 4.1, G admits an (O(1), O(log(1/\u03b5)), O(1)) compression scheme. Com-\nbining Lemma 3.6 and Lemma 3.7 gives the class k-mix(Gd) admits (O(kd), O(kd log(d/\u03b5) +\nk log(k/\u03b5)), O(k log(k) log(3d)/\u03b5)) compression. Applying Theorem 3.5 implies that the class of\n(cid:4)\n\nk-mixtures of axis-aligned Gaussians in Rd can be learned using (cid:101)O(kd/\u03b52) samples.\n\n4.1 Proof of Lemma 4.1\nLet N (\u00b5, \u03a3) denote the target distribution, which we are to encode.\nRemark 4.3 The case of rank-de\ufb01cient \u03a3 can easily be reduced to the case of full-rank \u03a3. If the\nrank of \u03a3 is r < d, then any X \u223c N (\u00b5, \u03a3) lies in some af\ufb01ne subspace S of dimension r. With high\nprobability, the \ufb01rst d samples from N (\u00b5, \u03a3) uniquely identify S. We encode S using these samples,\nand for the rest of the process we work in this af\ufb01ne subspace. Hence, we may assume \u03a3 has full rank\nd.\n\n16, Corollary 4.1]. Let Sd\u22121 :=(cid:8) y \u2208 Rd : (cid:107)y(cid:107) = 1(cid:9) and Bd\n\n2 :=(cid:8) y \u2208 Rd : (cid:107)y(cid:107) \u2264 1(cid:9). We use\n\nTo prove Lemma 4.1, we will need the following result from the random matrix theory literature [cf.\n\n20 Bd\n\n2 to denote the set of d-dimensional vectors with Euclidean norm at most 1/20. The\n\nthe notation 1\nconvex hull of a set T is denoted by conv(T ).\nLemma 4.4 Let q1, . . . , qm be i.i.d. samples from N (0, Id), and let T := { \u00b1qi\nThen for a large enough constant C > 0, if m \u2265 Cd(1 + log d) then\n\n: (cid:107)qi(cid:107) \u2264 4\n\n\u221a\n\nd }.\n\n(cid:20) 1\n\n20\n\n(cid:21)\n2 (cid:54)\u2286 conv(T )\nBd\n\nPr\n\n\u2264 1/6.\n\ni=1 vivT\n\nLemma 4.5 Let C > 0 be a suf\ufb01ciently large constant. Given m = 2Cd(1 + log d) samples S from\n\nNote that the lemma can be improved to require only m \u2265 Cd samples [see 16, Corollary 4.1], but\nthis would not improve our \ufb01nal bound.\nThe remainder of the proof amounts to showing that with only a small number of additional bits, we\ncan approximate the mean and each eigenvector of the covariance matrix as a linear combination of a\nsubset of the drawn samples.\n\ni /(cid:107)vi(cid:107). Note\nthat both \u03a3 and \u03a8 are positive de\ufb01nite, and that \u03a3 = \u03a82. Moreover, it is easy to see that \u03a3\u22121 =\n\ni=1 vivT\n\ni=1 vivT\n\ni /(cid:107)vi(cid:107)4 and \u03a8\u22121 =(cid:80)d\n\ni , where the vi vectors are orthogonal. Let \u03a8 :=(cid:80)d\n\nSuppose \u03a3 =(cid:80)d\n(cid:80)d\nN (\u00b5, \u03a3), with probability at least 2/3, one can encode vectors(cid:98)v1, . . . ,(cid:98)vd,(cid:98)\u00b5 \u2208 Rd satisfying\nand (cid:107)\u03a8\u22121((cid:98)\u00b5 \u2212 \u00b5)(cid:107) \u2264 \u03b5/2, using O(d2 log(2d) log(d/\u03b5)) bits and the points in S.\nLemma 4.6 Suppose \u03a3 = \u03a82 =(cid:80)\nthat (cid:107)\u03a8\u22121((cid:98)\u00b5 \u2212 \u00b5)(cid:107) \u2264 \u03b6, and that (cid:107)\u03a8\u22121((cid:98)vj \u2212 vj)(cid:107) \u2264 \u03c1 \u2264 1 holds for all j \u2208 [d]. Then,\n(cid:88)\n\n(cid:107)\u03a8\u22121((cid:98)vj \u2212 vj)(cid:107) \u2264 \u03b5/24d2\n\nLemma 4.1 now follows immediately from the following lemma\n\ni /(cid:107)vi(cid:107)3.\n\ni\u2208[d] vivT\n\ni , where the vi are orthogonal and \u03a3 is full rank, and\n\n\u2200j \u2208 [d],\n\ni=1 vivT\n\n\uf8f6\uf8f8\uf8f6\uf8f8 \u2264 (cid:112)\n\n\uf8eb\uf8edN\n\n\uf8eb\uf8ed\u00b5,\n\n\uf8f6\uf8f8 ,N\n\n\uf8eb\uf8ed(cid:98)\u00b5,\n\n9d3\u03c12 + \u03b6 2/2.\n\nTV\n\n(cid:88)\n\ni\u2208[d]\n\n(cid:98)vi(cid:98)vT\n\ni\n\nvivT\ni\n\ni\u2208[d]\n\n5 Discussion\n\nA central open problem in distribution learning and density estimation is characterizing the sample\ncomplexity of learning a distribution class. An insight from supervised learning theory is that\n\n8\n\n\fthe sample complexity of learning a class (of concepts, functions, or distributions) is typically\nproportional to (some notion of) intrinsic dimension of the class divided by \u03b52, where \u03b5 is the error\ntolerance. For the case of agnostic binary classi\ufb01cation, the intrinsic dimension is captured by the\nVC-dimension of the concept class (see [21, 4]). For the case of distribution learning with respect\nto \u2018natural\u2019 parametric classes, we expect this dimension to be equal to the number of parameters.\nThis is indeed true for the class of Gaussians (which have d2 parameters) and axis-aligned Gaussians\n(which have d parameters), and we showed in this paper that it holds for their mixtures as well (which\nhave kd2 and kd parameters, respectively).\nIn binary classi\ufb01cation, the combinatorial notion of Littlestone-Warmuth compression has been\nshown to be suf\ufb01cient [15] and necessary [18] for learning. In this work, we showed that the new\nbut related notion of distribution compression is suf\ufb01cient for distribution learning. Whether the\nexistence of compression schemes is necessary for learning an arbitrary class of distributions remains\nan intriguing open problem.\nIt is worth mentioning that while it may \ufb01rst seem that the VC-dimension of the Yatracos set\nassociated with a class of distributions can characterize its sample complexity, it is not hard to come\nup with examples where this VC-dimension is in\ufb01nite while the class can be learned with \ufb01nite\nsamples. Covering numbers do not characterize the sample complexity either: for instance the class\nof Gaussians does not have a \ufb01nite covering number in the TV metric, nevertheless it is learnable\nwith \ufb01nitely many samples.\nA concept related to compression is that of core-sets. In a sense, core-sets can be viewed as a special\ncase of compression, where the decoder is required to be the empirical error minimizer. See [17] for\nusing core-sets in maximum likelihood estimation.\n\nAcknowledgments\n\nWe thank Yaoliang Yu for pointing out a mistake in an earlier version of this paper, and Luc Devroye\nfor fruitful discussions. Abbas Mehrabian was supported by a CRM-ISM postdoctoral fellowship and\nan IVADO-Apog\u00e9e-CFREF postdoctoral fellowship. Nicholas Harvey was supported by an NSERC\nDiscovery Grant. Christopher Liaw was supported by an NSERC graduate award. Yaniv Plan was\nsupported by NSERC grant 22R23068.\n\nAddendum\n\nThe lower bound of Theorem 1.2 was recently improved in a subsequent work [8] from\n\u2126(kd2/\u03b52 log3(1/\u03b5)) to \u2126(kd2/\u03b52 log(1/\u03b5)) using a different construction.\n\nReferences\n[1] Martin Anthony and Peter Bartlett. Neural network learning: theoretical foundations. Cam-\n\nbridge University Press, 1999.\n\n[2] Hassan Ashtiani, Shai Ben-David, Nicholas J. A. Harvey, Christopher Liaw, Abbas Mehrabian,\nand Yaniv Plan. Near-optimal sample complexity bounds for robust learning of Gaussians\nmixtures via compression schemes. arXiv preprint. URL https://arxiv.org/abs/1710.\n05209.\n\n[3] Hassan Ashtiani, Shai Ben-David, and Abbas Mehrabian. Sample-ef\ufb01cient learning of mixtures.\nIn Proceedings of the Thirty-Second AAAI Conference on Arti\ufb01cial Intelligence, AAAI\u201918,\npages 2679\u20132686. AAAI Publications, 2018. URL https://arxiv.org/abs/1706.01596.\n[4] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K. Warmuth. Learnability\nand the Vapnik-Chervonenkis dimension. J. ACM, 36(4):929\u2013965, October 1989. ISSN 0004-\n5411. doi: 10.1145/76359.76371. URL http://doi.acm.org/10.1145/76359.76371.\n\n[5] Siu-On Chan, Ilias Diakonikolas, Rocco A. Servedio, and Xiaorui Sun. Ef\ufb01cient density\nestimation via piecewise polynomial approximation. In Proceedings of the Forty-sixth Annual\nACM Symposium on Theory of Computing, STOC \u201914, pages 604\u2013613, New York, NY, USA,\n2014. ACM. ISBN 978-1-4503-2710-7. doi: 10.1145/2591796.2591848.\n\n9\n\n\f[6] Luc Devroye. A course in density estimation, volume 14 of Progress in Probability and Statistics.\n\nBirkh\u00e4user Boston, Inc., Boston, MA, 1987. ISBN 0-8176-3365-0.\n\n[7] Luc Devroye and G\u00e1bor Lugosi. Combinatorial methods in density estimation. Springer\nSeries in Statistics. Springer-Verlag, New York, 2001. ISBN 0-387-95117-2. doi: 10.1007/\n978-1-4613-0125-7.\n\n[8] Luc Devroye, Abbas Mehrabian, and Tommy Reddad. The minimax learning rate of normal\n\nand Ising undirected graphical models. arXiv preprint arXiv:1806.06887, 2018.\n\n[9] Ilias Diakonikolas. Learning Structured Distributions. In Peter B\u00fchlmann, Petros Drineas,\nMichael Kane, and Mark van der Laan, editors, Handbook of Big Data, chapter 15, pages 267\u2013\n283. Chapman and Hall/CRC, 2016. URL http://www.crcnetbase.com/doi/pdfplus/\n10.1201/b19567-21.\n\n[10] Ilias Diakonikolas, Daniel M. Kane, and Alistair Stewart. Statistical query lower bounds for\nrobust estimation of high-dimensional Gaussians and Gaussian mixtures. In 2017 IEEE 58th\nAnnual Symposium on Foundations of Computer Science (FOCS), pages 73\u201384, Oct 2017. doi:\n10.1109/FOCS.2017.16. Available on arXiv:1611.03473 [cs.LG].\n\n[11] Ilias Diakonikolas, Daniel M. Kane, and Alistair Stewart. Learning multivariate log-concave\ndistributions. In Proceedings of Machine Learning Research, volume 65 of COLT\u201917, pages\n1\u2014-17, 2017. ISBN 3-540-35294-5, 978-3-540-35294-5. URL http://proceedings.mlr.\npress/v65/diakonikolas17a/diakonikolas17a.pdf.\n\n[12] Ildar Ibragimov. Estimation of analytic functions.\n\nIn State of the art in probability and\nstatistics (Leiden, 1999), volume 36 of IMS Lecture Notes Monogr. Ser., pages 359\u2013383.\nInst. Math. Statist., Beachwood, OH, 2001. doi: 10.1214/lnms/1215090078. URL https:\n//doi.org/10.1214/lnms/1215090078.\n\n[13] Adam Kalai, Ankur Moitra, and Gregory Valiant. Disentangling Gaussians. Communications\n\nof the ACM, 55(2), 2012.\n\n[14] Michael Kearns, Yishay Mansour, Dana Ron, Ronitt Rubinfeld, Robert E. Schapire, and\nIn Proceedings of the Twenty-\nLinda Sellie. On the learnability of discrete distributions.\nsixth Annual ACM Symposium on Theory of Computing, STOC \u201994, pages 273\u2013282, New\nYork, NY, USA, 1994. ACM. ISBN 0-89791-663-8. doi: 10.1145/195058.195155. URL\nhttp://doi.acm.org/10.1145/195058.195155.\n\n[15] Nick Littlestone and Manfred Warmuth. Relating data compression and learnability. Technical\n\nreport, Technical report, University of California, Santa Cruz, 1986.\n\n[16] Alexander E. Litvak, Alain Pajor, Mark Rudelson, and Nicole Tomczak-Jaegermann. Smallest\nsingular value of random matrices and geometry of random polytopes. Adv. Math., 195(2):\n491\u2013523, 2005. ISSN 0001-8708. URL https://doi.org/10.1016/j.aim.2004.08.004.\n[17] Mario Lucic, Matthew Faulkner, Andreas Krause, and Dan Feldman. Training Gaussian mixture\nmodels at scale via coresets. Journal of Machine Learning Research, 18(160):1\u201325, 2018. URL\nhttp://jmlr.org/papers/v18/15-506.html.\n\n[18] Shay Moran and Amir Yehudayoff. Sample compression schemes for VC classes. Journal of\n\nthe ACM (JACM), 63(3):21, 2016.\n\n[19] Bernard W. Silverman. Density estimation for statistics and data analysis. Monographs on\n\nStatistics and Applied Probability. Chapman & Hall, London, 1986. ISBN 0-412-24620-1.\n\n[20] Ananda Theertha Suresh, Alon Orlitsky,\n\nNear-optimal-sample\n\nand Ashkan Jafar-\npour.\nIn\nZ. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger,\neditors, Advances\npages 1395\u2013\n1403. Curran Associates,\nURL http://papers.nips.cc/paper/\n5251-near-optimal-sample-estimators-for-spherical-gaussian-mixtures.\npdf.\n\nInformation Processing Systems 27,\n\nJayadev Acharya,\n\nestimators\n\nfor\n\nspherical gaussian mixtures.\n\nin Neural\n\nInc.,\n\n2014.\n\n[21] Vladimir N. Vapnik and Alexey Ya. Chervonenkis. On the uniform convergence of relative\nfrequencies of events to their probabilities. Theory of Probability & Its Applications, 16(2):\n264\u2013280, 1971. doi: 10.1137/1116025.\n\n10\n\n\f", "award": [], "sourceid": 1737, "authors": [{"given_name": "Hassan", "family_name": "Ashtiani", "institution": "McMaster University"}, {"given_name": "Shai", "family_name": "Ben-David", "institution": "University of Waterloo"}, {"given_name": "Nicholas", "family_name": "Harvey", "institution": "University of British Columbia"}, {"given_name": "Christopher", "family_name": "Liaw", "institution": "University of British Columbia"}, {"given_name": "Abbas", "family_name": "Mehrabian", "institution": "Mcgill University"}, {"given_name": "Yaniv", "family_name": "Plan", "institution": "University of British Columbia"}]}