{"title": "The Sample Complexity of Semi-Supervised Learning with Nonparametric Mixture Models", "book": "Advances in Neural Information Processing Systems", "page_first": 9321, "page_last": 9332, "abstract": "We study the sample complexity of semi-supervised learning (SSL) and introduce new assumptions based on the mismatch between a mixture model learned from unlabeled data and the true mixture model induced by the (unknown) class conditional distributions. Under these assumptions, we establish an $\\Omega(K\\log K)$ labeled sample complexity bound without imposing parametric assumptions, where $K$ is the number of classes. Our results suggest that even in nonparametric settings it is possible to learn a near-optimal classifier using only a few labeled samples. Unlike previous theoretical work which focuses on binary classification, we consider general multiclass classification ($K>2$), which requires solving a difficult permutation learning problem. This permutation defines a classifier whose classification error is controlled by the Wasserstein distance between mixing measures, and we provide finite-sample results characterizing the behaviour of the excess risk of this classifier. Finally, we describe three algorithms for computing these estimators based on a connection to bipartite graph matching, and perform experiments to illustrate the superiority of the MLE over the majority vote estimator.", "full_text": "The Sample Complexity of Semi-Supervised Learning\n\nwith Nonparametric Mixture Models\n\nChen Dan1, Liu Leqi1, Bryon Aragam1, Pradeep Ravikumar1, Eric P. Xing1,2\n\n1Carnegie Mellon University\n\n2Petuum Inc.\n\n{cdan,leqil,naragam,pradeepr,epxing}@cs.cmu.edu\n\nAbstract\n\nWe study the sample complexity of semi-supervised learning (SSL) and introduce\nnew assumptions based on the mismatch between a mixture model learned from\nunlabeled data and the true mixture model induced by the (unknown) class condi-\ntional distributions. Under these assumptions, we establish an \u2126(K log K) labeled\nsample complexity bound without imposing parametric assumptions, where K is\nthe number of classes. Our results suggest that even in nonparametric settings it is\npossible to learn a near-optimal classi\ufb01er using only a few labeled samples. Un-\nlike previous theoretical work which focuses on binary classi\ufb01cation, we consider\ngeneral multiclass classi\ufb01cation (K > 2), which requires solving a dif\ufb01cult permu-\ntation learning problem. This permutation de\ufb01nes a classi\ufb01er whose classi\ufb01cation\nerror is controlled by the Wasserstein distance between mixing measures, and we\nprovide \ufb01nite-sample results characterizing the behaviour of the excess risk of this\nclassi\ufb01er. Finally, we describe three algorithms for computing these estimators\nbased on a connection to bipartite graph matching, and perform experiments to\nillustrate the superiority of the MLE over the majority vote estimator.\n\n1\n\nIntroduction\n\nWith the rapid growth of modern datasets and increasingly passive collection of data, labeled data\nis becoming more and more expensive to obtain while unlabeled data remains cheap and plentiful\nin many applications. Leveraging unlabeled data to improve the predictions of a machine learning\nsystem is the problem of semi-supervised learning (SSL), which has been the source of many\nempirical successes [1\u20133] and theoretical inquiries [4\u201316]. Commonly studied assumptions include\nidenti\ufb01ability of the class conditional distributions [5, 6], the cluster assumption [10, 11] and the\nmanifold assumption [9, 12, 13, 15]. In this work, we propose a new type of assumption that loosely\ncombines ideas from both the identi\ufb01ability and cluster assumption perspectives. Importantly, we\nconsider the general multiclass (K > 2) scenario, which introduces signi\ufb01cant complications. In\nthis setting, we study the sample complexity and rates of convergence for SSL and propose simple\nalgorithms to implement the proposed estimators.\nThe basic question behind SSL is to connect the marginal distribution over the unlabeled data\nP(X) to the regression function P(Y | X). We consider multiclass classi\ufb01cation, so that Y \u2208 Y =\n{\u03b11, . . . , \u03b1K} for some K \u2265 2. In order to motivate our perspective, let F \u2217 denote the marginal\ndensity of the unlabeled samples and suppose that F \u2217 can be written as a mixture model\n\nK(cid:88)\n\nF \u2217(x) =\n\n\u03bbbfb(x).\n\n(1)\n\nAssuming the unlabeled data can be used to learn the mixture model (1), the question becomes when\nis this mixture model useful for predicting Y ? Figure 1 illustrates an idealized example.\n\nb=1\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f(a) Decision boundaries learned\nfrom the unlabeled data.\n\n(b) Learned decision boundaries\ncorrespond exactly to true bound-\naries.\n\n(c) Learned decision boundaries\nare only approximately correct.\n\nFigure 1: Illustration of the main idea for K = 4. The decision boundaries learned from the unlabeled\ndata (cf. (1)) are depicted by the dashed black lines and the true decision boundaries are depicted by\nthe solid red lines. (a) The unlabeled data is used to learn some approximate decision boundaries\nvia the mixture model \u039b. Even with these decision boundaries, it is not known which class each\nregion corresponds to. The labeled data is used to learn this assignment. (b) Previous work assumes\nthat the true and learned decision boundaries are the same. (c) In the current work, we assume that\nthe true decision boundaries are unknown, but that it is be possible to learn a mixture model that\napproximates the true boundaries using unlabeled data.\n\n(cid:80)\nk \u03bb\u2217\n\nClearly, we can always write F \u2217(x) = (cid:80)K\n\nk=1 \u03bb\u2217\n\nkf\u2217\n\nk (x), where f\u2217\n\nk , nor do we assume that \u03bbb corresponds to some \u03bb\u2217\n\nk is the density of the kth class\nconditional P(X | Y = \u03b1k) and \u03bb\u2217\nk = P(Y = \u03b1k). This will be called the true mixture model in the\nsequel. In this work, we do not assume that (1) is the true mixture model, i.e. we do not assume that\nfb corresponds to one of the f\u2217\nk. In other words,\nwe allow the true mixture model to be nonidenti\ufb01able and consider the case where some misspeci\ufb01ed\nmixture model is learned from the unlabeled data. We assume that the number of mixture components\nK is the same as the number of classes, which is always known, although extending our analysis to\nover\ufb01tted mixtures is straightforward.\nIn an early series of papers, Castelli and Cover [5, 6] considered this question under the following\nassumptions: (a) For each b there is some k such that fb = f\u2217\nk, (b) F \u2217 is known, and (c)\nK = 2. Thus, they assumed that the true components and weights were known but it was unknown\nwhich class each mixture component represents. In Figure 1, this corresponds to the case (b) where\nthe true and learned decision boundaries are identical. Given labeled data, the special case K = 2\nreduces to a simple hypothesis testing problem which can be tackled using the Neyman-Pearson\nlemma. In this paper, we are interested in settings where each of these three assumptions fail:\n\nk and \u03bbb = \u03bb\u2217\n\nkf\u2217\n\n(a) What if the class conditionals f\u2217\n\nk are unknown? Although we can always write F \u2217(x) =\nk (x), it is generally not the case that this mixture model is learnable from unlabeled\ndata alone. In practice, what is learned will be different from this ideal case, but the hope is\nthat it will still be useful. In this case, the argument in Castelli and Cover [5] breaks down.\nMotivated by recent work on nonparametric mixture models [17], we study the general case\nwhere the true mixture model is not known or even learnable from unlabeled data.\n\n(b) What if F \u2217 is unknown? In a follow-up paper, Castelli and Cover [6] studied the case where\n2} are\nF \u2217 is unknown by assuming that K = 2 and the class conditional densities {f\u2217\nknown up to a permutation. In this setting, the unlabeled data is used to ascertain the relative\nmixing proportions, but estimation error in the densities is not considered. We are interested\nin the general case in which a \ufb01nite amount of unlabeled data is used to estimate both the\nmixture weights and densities.\n\n1 , f\u2217\n\n(c) What if K > 2? If K > 2, once again the argument in Castelli and Cover [5] no longer\napplies, and we are faced with a challenging permutation learning problem. Permutation\nlearning problems have gained notoriety recently owing to their applicability to a wide\nvariety of problems, including statistical matching and seriation [18\u201320], graphical models\n[21, 22], and regression [23, 24], so these results may be of independent interest.\n\nWith these goals in mind, we study the MLE and majority voting (MV) rules for learning the unknown\nclass assignment introduced in the next section. Our assumptions for MV are closely related to recent\nwork based on the so-called cluster assumption [4, 10, 11, 25]; see Section 4.2 for more details.\n\n2\n\nD2(\u039b)=?D1(\u039b)=?D4(\u039b)=?D3(\u039b)=?D2(\u039b)D1(\u039b)D4(\u039b)D3(\u039b)D2(\u039b\u2217)D1(\u039b\u2217)D3(\u039b\u2217)D4(\u039b\u2217)D2(\u039b)D1(\u039b)D4(\u039b)D3(\u039b)D2(\u039b\u2217)D1(\u039b\u2217)D3(\u039b\u2217)D4(\u039b\u2217)\fContributions A key aspect of our analysis is to establish conditions that connect the mixture\nmodel (1) to the true mixture model. Under these conditions we prove nonasymptotic rates of\nconvergence for learning the class assignment (Figure 1a) from labeled data when K > 2, establish\nan \u2126(K log K) sample complexity for learning this assignment, and prove that the resulting classi\ufb01er\nconverges to the Bayes classi\ufb01er. We then propose simple algorithms based on a connection to\nbipartite graph matching, and illustrate their performance on real and simulated data.\n\n2 SSL as permutation learning\n\nIn this section, we formalize the ideas from the introduction using the language of mixing measures.\nWe adopt this language for several reasons: 1) It makes it easy to refer to the parameters in the\nmixture model (1) by wrapping everything into a single, coherent statistical parameter \u039b, 2) We can\ntalk about convergence of these parameters via the Wasserstein metric, and 3) It simpli\ufb01es discussions\nof identi\ufb01ability in mixture models. Before going into technical details, we summarize the main idea\nas follows (see also Figure 1):\n\n1. Use the unlabeled data to learn a K-component mixture model that approximates F \u2217, which\n\nis represented by the mixing measure \u039b de\ufb01ned below;\nregions Db(\u039b) de\ufb01ned by \u039b;\n\n2. Use the labeled data to determine the correct assignment \u03c0 of classes \u03b1k to the decision\n3. Based on the pair (\u039b, \u03c0), de\ufb01ne a classi\ufb01er g\u039b,\u03c0 : X \u2192 Y by (3) below.\n\nMixing measures and mixture models For concreteness, we will work on X = Rd, however, our\nresults generalize naturally to any space X with a dominating measure and well-de\ufb01ned density\nfunctions. Let P = {f \u2208 L1(Rd) :\nf dx = 1} be the set of probability density functions on Rd,\nand MK(P) denote the space of probability measures over P with precisely K atoms. An element\n\u039b \u2208 MK(P) is called a (\ufb01nite) mixing measure, and can be thought of as a convenient mathematical\ndevice for encoding the weights {\u03bbk} and the densities {fk} into a single statistical parameter. By\nintegrating against this measure, we obtain a new probability density which is denoted by\n\n\u015f\n\nDecision regions, assignments, and classi\ufb01ers Any mixing measure \u039b de\ufb01nes K decision regions\ngiven by Db = Db(\u039b) := {x \u2208 X : \u03bbbfb(x) > \u03bbjfj(x)\u2200j (cid:54)= b} (Figure 1). This allows us to assign\nan index from 1, . . . , K to any x \u2208 X , and hence de\ufb01nes a function \u02c7g\u039b : X \u2192 [K] := {1, . . . , K}.\nThis function is not a genuine classi\ufb01er, however, since its output is an uninformative index b \u2208 [K]\nas opposed to a proper class label \u03b1k \u2208 Y. The key point is that even if we know \u039b, we still must\nidentify each label \u03b1k with a decision region Db(\u039b), i.e. we must learn a permutation \u03c0 : Y \u2192 [K].\nWith some abuse of notation, we will sometimes write \u03c0(k) instead of \u03c0(\u03b1k) for any permutation \u03c0.\nTogether, any pair (\u039b, \u03c0) de\ufb01nes a classi\ufb01er g\u039b,\u03c0 : X \u2192 Y by\n\ng\u039b,\u03c0(x) = \u03c0(\u02c7g\u039b(x)) =\n\n\u03c0\u22121(b)1(x \u2208 Db(\u039b)).\n\n(3)\n\nb=1\n\nThis mixing measure perspective helps to clarify the role of the unknown permutation in supervised\nlearning: The unlabeled data is enough to learn \u039b (and hence the decision regions Db(\u039b)), however,\nlabeled data are necessary to learn an assignment \u03c0 between classes and decision regions. This\nformulates SSL as a coupled mixture modeling and permutation learning problem: Given unlabeled\n\nand labeled data, learn a pair ((cid:98)\u039b,(cid:98)\u03c0) which yields a classi\ufb01er(cid:98)g = g(cid:98)\u039b,(cid:98)\u03c0.\n\n3\n\nm(\u039b) :=\n\n\u03bbbfb(x),\n\n(2)\n\n\u015f\n(cid:40)(cid:88)\n\nwhere fb is a particular enumeration of the densities in the support of \u039b and \u03bbb is the probability\nof the bth density. Thus, (1) can be written as F \u2217 = m(\u039b). By metrizing P via the total variation\n|f \u2212 g| dx, the distance between two \ufb01nite K-mixtures can be computed\ndistance dTV(f, g) = 1\n2\nvia the Wasserstein metric [26]:\nW1(\u039b, \u039b(cid:48)) = inf\n\nj) : 0 \u2264 \u03c3ij \u2264 1,\n\n\u03c3ijdTV(fi, f(cid:48)\n\n\u03c3ij = \u03bb(cid:48)\nj,\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n\u03c3ij = 1,\n\n\u03c3ij = \u03bbi\n\n.\n\n(cid:41)\n\ni,j\n\ni,j\n\ni\n\nj\n\nK(cid:88)\n\nb=1\n\nK(cid:88)\n\n\fk to the density f\u2217\n\nBayes classi\ufb01ers and the true permutation The true permutation \u03c0\u2217 : Y \u2192 [K] is de\ufb01ned to be\nthe permutation that assigns each class \u03b1k to the correct decision region D\u2217\nb = Db(\u039b\u2217) (Figure 1).\nAs usual, the target classi\ufb01er is the Bayes classi\ufb01er, which can also be written in the form (3):\nLet \u039b\u2217 denote the true mixing measure that assigns probability \u03bb\u2217\nk and note that\nF \u2217 = m(\u039b\u2217), which is the true mixture model de\ufb01ned previously. Then it is easy to check that\ng\u039b\u2217,\u03c0\u2217 is the Bayes classi\ufb01er.\nIdenti\ufb01ability Although the true mixing measure \u039b\u2217 may not be identi\ufb01able from F \u2217, some other\nmixture model may be. In other words, although it may not be possible to learn \u039b\u2217 from unlabeled\ndata, it may be possible to learn some other mixing measure \u039b (cid:54)= \u039b\u2217 such that m(\u039b) = F \u2217 =\nm(\u039b\u2217) (Figure 1c). This essentially amounts to a violation of the cluster assumption: High-density\nclusters are identi\ufb01able, but in practice the true class labels may not respect the cluster boundaries.\nAssumptions that guarantee a mixture model are identi\ufb01able are well-studied [27\u201329], including both\nparametric [30] and nonparametric [17, 31, 32] assumptions. In particular, Aragam et al. [17] have\nproved general conditions under which mixture models with arbitrary, overlapping nonparametric\ncomponents are identi\ufb01able and estimable, including extreme cases where each component fk has\nthe same mean. Since this problem is well-studied, we focus hereafter on the problem of learning the\npermutation \u03c0\u2217. Thus, in the sequel we will assume that we are given an arbitrary mixing measure \u039b\nwhich will be used to estimate \u03c0\u2217. We do not assume that \u039b = \u039b\u2217 or even that these mixing measures\nare close: The idea is to elicit conditions on \u039b that ensure consistent estimation of \u03c0\u2217. This makes\nour analysis applicable to a wide variety of methods, including heuristic approaches for learning \u039b\nfrom the unlabeled data, which are common in the literature on nonparametric mixtures.\n\n3 Two estimators\nAssume we are given a mixing measure \u039b along with the labeled samples (X (i), Y (i)) \u2208 X \u00d7 Y.\nTwo natural estimators of \u03c0\u2217 are the MLE and majority vote. Although both estimators depend on \u039b,\nthis dependence will be suppressed for brevity.\n\nMaximum likelihood De\ufb01ne (cid:96)(\u03c0; \u039b, X, Y ) := log \u03bb\u03c0(Y )f\u03c0(Y )(X). We will work with the fol-\nlowing misspeci\ufb01ed MLE (i.e. \u039b (cid:54)= \u039b\u2217):\n\n(cid:98)\u03c0MLE \u2208 arg max\n\n\u03c0\n\n(cid:96)n(\u03c0; \u039b),\n\n(cid:96)n(\u03c0; \u039b) :=\n\n1\nn\n\nn(cid:88)\n\ni=1\n\n(cid:96)(\u03c0; \u039b, X (i), Y (i)).\n\n(4)\n\nWhen \u039b = \u039b\u2217, this is the correctly speci\ufb01ed MLE of the unknown permutation \u03c0\u2217, however, the\nde\ufb01nition above allows for the general misspeci\ufb01ed case \u039b (cid:54)= \u039b\u2217.\n\nMajority vote The majority vote estimator (MV) is given by a simple majority vote over each\nMV :\n\ndecision region. Formally, we de\ufb01ne a permutation(cid:98)\u03c0MV as follows: The inverse assignment(cid:98)\u03c0\u22121\n[K] \u2192 Y is de\ufb01ned by(cid:98)\u03c0\u22121\nunde\ufb01ned. Note that when K = 2, the MV classi\ufb01er de\ufb01ned by (3) with \u03c0 =(cid:98)\u03c0MV is essentially the\n\nIf there is no majority class in a given decision region, we consider this a failure of MV and treat it as\n\n1(Y (i) = \u03b1, X (i) \u2208 Db(\u039b)).\n\nsame as the three-step procedure described in Rigollet [10], which focuses on bounding the excess\nrisk under the cluster assumption. In contrast, we are interested in the consistency of the unknown\npermutation \u03c0\u2217 when K > 2, which is a more dif\ufb01cult problem.\n\nn(cid:88)\n\ni=1\n\nMV(b) = arg max\n\n\u03b1\u2208Y\n\n(5)\n\n4 Statistical results\n\nOur main results establish rates of convergence for both the MLE and MV introduced in the previous\nsection. We will use the notation E\u2217h(X, Y ) to denote the expectation with respect to the true\ndistribution (X, Y ) \u223c P(X, Y ). Without loss of generality, we assume that \u03c0\u2217(\u03b1k) = k and\nfb = f\u2217\nnotation in the sequel.\n\nb + hb for some hb. Then(cid:98)\u03c0 = \u03c0\u2217 if and only if(cid:98)\u03c0(\u03b1k) = k, which helps to simplify the\n\n4\n\n\f4.1 Maximum likelihood\nGiven \u039b, the notation E\u2217(cid:96)(\u03c0; \u039b, X, Y ) = E\u2217 log \u03bb\u03c0(Y )f\u03c0(Y )(X) denotes the expectation of the\nmisspeci\ufb01ed log-likelihood with respect to the true distribution. De\ufb01ne the \u201cgap\u201d\n\n\u2206MLE(\u039b) := E\u2217(cid:96)(\u03c0\u2217; \u039b, X, Y ) \u2212 max\n\u03c0(cid:54)=\u03c0\u2217\n\n(6)\nFor any function a : R \u2192 R, de\ufb01ne the usual Fenchel-Legendre dual a\u2217(t) = sups\u2208R(st \u2212 a(s)).\nLet Ub = log \u03bbbfb(X) and \u03b2b(s) = log E\u2217 exp(sUb). Finally, let nk := |{i : Y (i) = \u03b1k}| denote\nTheorem 4.1. Let(cid:98)\u03c0MLE be the MLE de\ufb01ned in (4). If \u2206MLE := \u2206MLE(\u039b) > 0 then\nthe number of labeled samples with the kth label.\n(cid:17)\n\nE\u2217(cid:96)(\u03c0; \u039b, X, Y ).\n\nP((cid:98)\u03c0MLE = \u03c0\u2217) \u2265 1 \u2212 2K 2 exp\n\n(cid:16) \u2212 inf\n\n\u03b2\u2217\nb (\u2206MLE/3)\n\nnk \u00b7 inf\n\n.\n\nk\n\nb\n\nThe condition \u2206MLE(\u039b) > 0 is important to ensure that \u03c0\u2217 is learnable from \u039b, and the size of\n\u2206MLE(\u039b) quanti\ufb01es \u201chow easy\u201d it is to learn \u03c0\u2217 is given \u039b. A bigger gap implies an easier problem.\nThus, it is of interest to understand this quantity better. The following proposition shows that when\n\u039b = \u039b\u2217, this gap is always nonnegative:\nProposition 4.2. For any permutation \u03c0 and any \u039b,\n\nE\u2217(cid:96)(\u03c0; \u039b, X, Y ) \u2264 E\u2217(cid:96)(\u03c0\u2217; \u039b\u2217, X, Y )\n\nand hence \u2206MLE(\u039b\u2217) \u2265 0.\nIn general, assuming \u2206MLE(\u039b) > 0 is a weak assumption, but bounds on \u2206MLE(\u039b) are dif\ufb01cult to\nobtain without making additional assumptions on the densities fk and f\u2217\nk . A brief discussion of this\ncan be found in Appendix B; we leave it to future work to study this quantity more carefully.\n\n(cid:80)n\n4.2 Majority vote\nFor any \u039b, de\ufb01ne mb := |i : X (i) \u2208 Db(\u039b)| and \u03c7bj(\u039b) := 1\ni=1 1(Y (i) = j, X (i) \u2208 Db(\u039b)),\nwhere 1(\u00b7) is the indicator function. Similar to the MLE, our results for MV depend crucially on a\n(cid:111)\n\u201cgap\u201d quantity, given by\n\n(cid:110)E\u2217\u03c7bb(\u039b) \u2212 max\n\nE\u2217\u03c7bj(\u039b)\n\n(7)\n\nmb\n\n.\n\n\u2206MV(\u039b) := inf\nb\n\nj(cid:54)=b\n\nThis quantity essentially measures how much more likely it is to sample the bth label in the bth\ndecision region than any other label, averaged over the entire region. Thus, conditions on \u2206MV(\u039b)\nare closely related to the well-known cluster assumption [4, 10, 11, 25].\n\nTheorem 4.3. Let(cid:98)\u03c0MV be the MV de\ufb01ned in (5). If \u2206MV := \u2206MV(\u039b) > 0 then\n(cid:17)\n\n(cid:16)\u22122\u22062\n\nP((cid:98)\u03c0MV = \u03c0\u2217) \u2265 1 \u2212 2K 2 exp\n\nMV minb mb\n\n.\n\n9\n\nAs with the MLE, the gap \u2206MV(\u039b) is an important quantity. Fortunately, when \u039b = \u039b\u2217 it is always\npositive:\nProposition 4.4. For each b = 1, . . . , K,\n\nE\u2217\u03c7bb(\u039b\u2217) > max\nj(cid:54)=b\n\nE\u2217\u03c7bj(\u039b\u2217)\n\nand hence \u2206MV(\u039b\u2217) > 0.\nWhen \u039b (cid:54)= \u039b\u2217, \u2206MV(\u039b) has the following interpretation: \u2206MV(\u039b) measures how well the decision\nregions de\ufb01ned by \u039b match up with the decision regions de\ufb01ned by \u039b\u2217. When \u039b de\ufb01nes decision\nregions that assign high probability to one class, \u2206MV(\u039b) will be large. If \u039b de\ufb01nes decision regions\nwhere multiple classes have approximately the same probability, however, then it is possible that\n\u2206MV(\u039b) will be small. In this case, our experiments in Section 6 indicate that the MLE performs\nmuch better by managing overlapping decision regions more gracefully.\n\n5\n\n\f4.3 Sample complexity\n\nTheorems 4.1 and 4.3 imply upper bounds on the minimum number of samples required to learn the\npermutation \u03c0\u2217: For any \u03b4 \u2208 (0, 1), as long as\n\n(MLE)\n\n(MV)\n\ninf\nb\n\nlog 2K2\n\u03b4\nb (\u2206MLE/3)\n\ninf\nk\n\nnk := n0 \u2265\ninf b \u03b2\u2217\nmb := m0 \u2265 9 log 2K2\n\n\u03b4\n\n2\u22062\n\nMV\n\n(8)\n\n(9)\n\nwe recover \u03c0\u2217 with probability at least 1\u2212\u03b4. Surprisingly, as stated these lower bounds are dimension-\nfree, however, in practice the gaps \u2206MLE and \u2206MV may be dimension-dependent.\nTo derive the sample complexity in terms of the total number of labeled samples n, it suf\ufb01ces to\ndetermine the minimum number of samples per class given n draws from a multinomial random\nvariable. For the general case with unequal probabilities, Lemma D.2 provides a precise answer. For\nsimplicity here, we summarize the special case where each class (resp. decision region) is equally\nprobable for the MLE (resp. MV).\nCorollary 4.5 (Sample complexity of MLE). Suppose that \u03bb\u2217\n4\n\nk = 1/K for each k, \u2206MLE > 0, and\n\n(cid:104)\n\n(cid:105)\n\nn \u2265 K log(K/\u03b4)\n\n1 +\n\ninf b \u03b2\u2217\n\nb (\u2206MLE/3)\n\n.\n\nThen P((cid:98)\u03c0MLE = \u03c0\u2217) \u2265 1 \u2212 \u03b4.\n\nThen P((cid:98)\u03c0MV = \u03c0\u2217) \u2265 1 \u2212 \u03b4.\n\nCorollary 4.6 (Sample complexity of MV). Suppose that P(X \u2208 Db(\u039b)) = 1/K for each k,\n\u2206MV > 0, and\n\n(cid:104)\n\n(cid:105)\n\n.\n\nn \u2265 K log(K/\u03b4)\n\n1 +\n\n18\n\u22062\n\nMV\n\nCoupon collector\u2019s problem and SSL To better understand these bounds, consider arguably\nsimplest possible case: Suppose that each density f\u2217\nk = 1/K, and that\nwe know \u039b\u2217. Under these very strong assumptions, an alternative way to learn \u03c0\u2217 is to simply\nsample from P(X) until we have visited each decision region D\u2217\nk at least once. This is the classical\ncoupon collector\u2019s problem (CCP), which is known to require \u0398(K log K) samples [33, 34]. Thus,\nunder these assumptions the expected number of samples required to learn \u03c0\u2217 is \u0398(K log K). By\ncomparison, our results indicate that even if the f\u2217\nk have overlapping supports and we do not know\n\u039b\u2217, as long as \u2206MLE = \u2126(1) (resp. \u2206MV = \u2126(1)) then \u2126(K log K) samples suf\ufb01ce to learn \u03c0\u2217. In\nother words, SSL is approximately as dif\ufb01cult as CCP in very general settings.\n\nk has disjoint support, \u03bb\u2217\n\n4.4 Classi\ufb01cation error\nSo far our results have focused on the probability of recovery of the unknown permutation \u03c0\u2217. We\ncan further bound the classi\ufb01cation error of the classi\ufb01er (3) in terms of the Wasserstein distance\nW1(\u039b, \u039b\u2217) between \u039b and \u039b\u2217 as follows:\nTheorem 4.7 (Classi\ufb01cation error). Let g\u2217 = g\u039b\u2217,\u03c0\u2217 denote the Bayes classi\ufb01er. If \u03c0\u2217(\u03b1b) =\narg mini dTV(fi, f\u2217\n\nb ) then there is a constant C > 0 depending on K and \u039b\u2217 such that\n|\u03bb\u03c0\u2217(\u03b1b) \u2212 \u03bb\u2217\nb|.\n\nP(g\u039b,\u03c0\u2217 (X) (cid:54)= Y ) \u2264 P(g\u2217(X) (cid:54)= Y ) + C \u00b7 W1(\u039b, \u039b\u2217) +\n\n(cid:88)\n\nThis theorem allows for the possibility that the mixture model \u039b learned from the unlabeled data is\nnot the same as \u039b\u2217 (e.g. the true mixing measure corresponding to the true class conditionals). It is\nthus necessary to assume that the mismatch between \u039b and \u039b\u2217 is not so bad that the closest density\nfi to f\u2217\n\nThe interpretation of this theorem is as follows: Given \u039b, we learn a permutation(cid:98)\u03c0 =(cid:98)\u03c0n(\u039b) from n\nlabeled samples, e.g. using either the MLE (4) or MV (5). Together, the pair (\u039b,(cid:98)\u03c0) de\ufb01nes a classi\ufb01er\n\nb is something other than f\u03c0\u2217(\u03b1b).\n\nb\n\n6\n\n\fterms of the Bayes error. Since(cid:98)\u03c0 = \u03c0\u2217 with high probability, Theorem 4.7 implies that\ng\u039b,(cid:98)\u03c0 via (3). We are interested in bounding the probability of misclassi\ufb01cation P(g\u039b,(cid:98)\u03c0(X) (cid:54)= Y ) in\n|\u03bb\u03c0\u2217(\u03b1b) \u2212 \u03bb\u2217\nb|.\n\nP(g\u039b,(cid:98)\u03c0(X) (cid:54)= Y ) \u2264 P(g\u2217(X) (cid:54)= Y ) + C \u00b7 W1(\u039b, \u039b\u2217) +\n\n(cid:88)\n\nb\n\nrisk is zero. This is clearly a very strong conclusion.\nIdenti\ufb01ability and misspeci\ufb01cation As discussed in Section 2, although \u039b\u2217 will in general be\nnonidenti\ufb01able, the unlabeled data may identify some other mixing measure \u039b (see e.g. [17]).\n\nIn this case, there is an irreducible error quanti\ufb01ed by the Wasserstein distance W1(\u039b, \u039b\u2217). In fact, if\nW1(\u039b, \u039b\u2217) = 0, then Theorem 4.7 implies that P(g\u039b,(cid:98)\u03c0(X) (cid:54)= Y ) \u2264 P(g\u2217(X) (cid:54)= Y ), i.e. the excess\nSuppose that(cid:98)\u039bm is a mixing measure estimated from m unlabeled samples and that W1((cid:98)\u039bm, \u039b) \u2192 0.\nCorollary 4.8. Suppose W1((cid:98)\u039bm, \u039b) = O(rm) for some rm \u2192 0 where m is the number of unlabeled\n\nThe question then is how much the misspeci\ufb01ed \u039b helps in classi\ufb01cation.\n\nsamples and \u03c0\u2217(\u03b1b) = arg mini dTV(fi, f\u2217\n\nb ). Then if(cid:98)\u03c0 = \u03c0\u2217,\n\nP(g(cid:98)\u039bm,(cid:98)\u03c0(X) (cid:54)= Y ) \u2264 P(g\u2217(X) (cid:54)= Y ) + C \u00b7 rm + C \u00b7 W1(\u039b, \u039b\u2217).\n\nIn particular, if W1(\u039b, \u039b\u2217) = 0, then\n\nP(g(cid:98)\u039bm,(cid:98)\u03c0(X) (cid:54)= Y ) \u2212 P(g\u2217(X) (cid:54)= Y ) = O(rm).\n\nit is assumed that we know (1) perfectly. This amounts to taking(cid:98)\u039bm = \u039b in the previous results, or\n\nClairvoyant SSL Previous work [5, 6, 11] has studied the so-called clairvoyant SSL case in which\nequivalently m = \u221e. Under this assumption, we have perfect knowledge of the decision regions and\nonly need to learn the label permutation \u03c0\u2217. Then Corollary 4.8 implies that with high probability,\nwe can learn a Bayes classi\ufb01er for the problem using \ufb01nitely many labeled samples.\n\nConvergence rates The convergence rate rm used here is essentially the rate of convergence in\nestimating an identi\ufb01able mixture model, which is well-studied for parametric mixture models [35\u2013\n37]. In particular, for so-called strongly identi\ufb01able parametric mixture models, the minimax rate of\nconvergence attains the optimal root-m rate rm = m\u22121/2 [35].1 Asymptotic consistency theorems\nfor nonparametric mixtures can be found in Aragam et al. [17].\n\nComparison to supervised learning (SL). Previous work [11] has compared the sample complex-\nity of SSL to SL under a cluster-type assumption. While a precise characterization of these trade-offs\nis not the main focus of this paper, we note in passing here the following: If the minimax risk of\nSL for a particular problem is larger than W1(\u039b, \u039b\u2217), then Theorem 4.7 implies that SSL provably\noutperforms SL on \ufb01nite samples.\n\n5 Algorithms\n\nOne of the signi\ufb01cant appeals of MV (5) is its simplicity. It is conceptually easy to understand and\ntrivial to implement. The MLE (4), on the other hand, is more subtle and dif\ufb01cult to compute in\npractice. In this section, we discuss two algorithms for computing the MLE: 1) An exact algorithm\nbased on \ufb01nding the maximum weight perfect matching in a bipartite graph by the Hungarian\nalgorithm [39], and 2) Greedy optimization.\nDe\ufb01ne Ck = {i : Y (i) = \u03b1k}. Consider the weighted complete bipartite graph G = (VK,K, w) with\nedge weights\n\nSince a permutation \u03c0 de\ufb01nes a perfect matching on G, the log-likelihood can be rewritten as\n\nw(k, k(cid:48)) =\n\n(cid:88)\nlog(cid:0)\u03bbk(cid:48)fk(cid:48)(X (i))(cid:1),\n(cid:88)\nlog(cid:0)\u03bb\u03c0(\u03b1k)f\u03c0(\u03b1k)(X (i))(cid:1) =\n\ni\u2208Ck\n\nK(cid:88)\n\n\u2200k, k(cid:48) \u2208 [K]\n\nK(cid:88)\n\nk=1\n\nw(k, \u03c0(\u03b1k)),\n\n(cid:96)n(\u03c0; \u039b) =\n\nk=1\n\ni\u2208Ck\n\n1This paper corrects an earlier result due to Chen [38] that claimed an m\u22121/4 minimax rate.\n\n7\n\n\fthe right side of which is the total weight of the matching \u03c0. Hence, the maximizer(cid:98)\u03c0MLE can be\n\nfound by \ufb01nding a perfect matching for this graph that has maximum weight. This can be done in\nO(K 3) using the well-known Hungarian algorithm [39].\nWe can also approximately solve the matching problem by a greedy method: Assign the kth class to\n\n(cid:98)\u03c0G(\u03b1k) = arg max\n\nk(cid:48)\u2208[K]\n\nw(k, k(cid:48)) = arg max\nk(cid:48)\u2208[K]\n\nlog(cid:0)\u03bbk(cid:48)fk(cid:48)(X (i))(cid:1),\n\n(cid:88)\n\ni\u2208Ck\n\nplement and can be viewed as a \u201csoft interpolation\u201d of(cid:98)\u03c0MLE and(cid:98)\u03c0MV as follows: If we de\ufb01ne\nwMV(k, k(cid:48)) = (cid:80)\n\nThis greedy heuristic isn\u2019t guaranteed to achieve optimal matching, however, it is simple to im-\n1(X (i) \u2208 Dk(cid:48)(\u039b)), we can see that a training example (X (i), Y (i) = \u03b1k)\ncontributes 1 to wMV(k, k(cid:48)) if k(cid:48) = arg maxj \u03bbjfj(X (i)), and contributes 0 to wMV(k, k(cid:48)) other-\nwise. By comparison, for the greedy heuristic, a training example (X (i), Y (i) = \u03b1k) contributes\nlog(\u03bbk(cid:48)fk(cid:48)(X (i))) to w(k, k(cid:48)). Therefore, the greedy estimator can be seen as a \u201csoft\u201d version of MV\nthat also greedily optimizes the MLE objective.\n\ni\u2208Ck\n\n6 Experiments\n\nIn order to evaluate the relative performance of the proposed estimators in practice, we implemented\neach of the three methods described in Section 5 on simulated and real data. These experiments\nalso illustrate the gap \u2206MLE(\u039b) (resp. \u2206MV(\u039b)) that appears in Theorem 4.1 (resp. Theorem 4.3):\nIn many of the examples, although the learned mixture is badly misspeci\ufb01ed and the true class\nconditionals overlap signi\ufb01cantly, it is still possible to recover \u03c0\u2217 with fewer than 100 labeled samples\n(sometimes signi\ufb01cantly fewer).\nOur experiments consider three settings: (i) Parametric mixtures of Gaussians, (ii) A nonparametric\nmixture model, and (iii) Real data from MNIST. In each experiment, a random true mixture model\n\u039b\u2217 was generated from one of these settings, and then N = 99 labeled samples were drawn from\nthis mixture model. We generated \u039b\u2217 under different separation conditions, from well-separated to\noverlapping. Then, \u039b was generated in two ways: (a) \u039b = \u039b\u2217, corresponding to a setting where the\ntrue decision boundaries are known, and (b) \u039b (cid:54)= \u039b\u2217 by perturbing the components and weights of\n\u039b\u2217 by a parameter \u03b7 > 0 (see Appendix A for details). \u039b was then used to estimate \u03c0\u2217 using each of\nthe three algorithms described in the previous section for the \ufb01rst n = 3, 6, 9, . . . , 99 labeled samples.\n\nThis procedure was repeated T = 50 times (holding \u039b\u2217 and \u039b \ufb01xed) in order to estimate P((cid:98)\u03c0 = \u03c0\u2217).\n\nFull details of the experiments can be found in Appendix A.\nMixture of Gaussians A random Gaussian mixture model with K \u2208 {2, 4, 9, 16} and dimension\nd = 2. Since the lower bounds (8) and (9) are dimension-free, we tested examples with d = 10 as\nwell with similar results.\nNonparametric mixture model A nonparametric mixture model with K = 4. Each f\u2217\nto be a random Gaussian mixture. Thus, the overall density is a \u201cmixture of Gaussian mixtures\u201d.\n\nk was chosen\n\nMNIST To approximate real data, we used training data from the MNIST dataset to build K = 10\nclass conditionals f\u2217\nk from real data using kernel density estimates. For labeled data, we sampled\nfrom the test data. To simulate the case \u039b (cid:54)= \u039b\u2217, we contaminated the training labels by randomly\nswitching 10% of the labels.\nThe results are shown in Figure 2. As expected, the MLE performs by far the best, obtaining near\nperfect recovery of \u03c0\u2217 with fewer than n = 20 labeled samples on synthetic data, and fewer than\nn = 40 on MNIST. Unsurprisingly, the most dif\ufb01cult case was K = 16, in which only the MLE was\nable recover the true permutation > 50% of the time. By increasing n, the MLE is eventually able to\nlearn this most dif\ufb01cult case, in accordance with our theory. Furthermore, the MLE is much more\nrobust to misspeci\ufb01cation \u039b (cid:54)= \u039b\u2217 and component overlap compared to the others. This highlights\nthe advantage of leveraging density information in the MLE, which is ignored by the MV estimator).\nThese results also illustrate how the gaps \u2206MLE and \u2206MV affect learning \u03c0\u2217: Even when W1(\u039b, \u039b\u2217)\nis large, the labeled sample complexity is relatively small (fewer than n = 100 in general). For\nexample, see Fig 4 in Appendix A for an illustration of the dif\ufb01culty of the case K = 16. Furthermore,\nin Appendix A, we also compare the classi\ufb01cation accuracy of the resulting SSL classi\ufb01ers with a\n\n8\n\n\f(a) Mixture of Gaussians\n\n(b) Mixture of Gaussian mixtures\n\n(c) MNIST\n\nFigure 2: Performance of MLE (Hungarian - Green; Greedy - Blue) and MV (Red). Solid line\nand dashed line correspond to the performance when \u039b\u2217 = \u039b and \u039b\u2217 (cid:54)= \u039b, respectively. Columns\ncorrespond to the number of classes K; rows correspond to decreasing separation; e.g. the bottom\nrows in each \ufb01gure are the least separated.\n\nstandard supervised baseline (LeNet) on the MNIST dataset. When sample size is small, it is clear\nthat our proposed estimators are more accurate. In accordance with our theory, the accuracy of the\nSSL classi\ufb01ers plateaus around 96% due to misspeci\ufb01cation of \u039b\u2217, as measured by W1(\u039b, \u039b\u2217).\n\n7 Discussion\n\nUsing nonparametric mixture models as a foundation, we analyzed the labeled sample complexity\nof semi-supervised learning. Our results allow for arbitrary, possibly heuristic estimators of a\nmixing measure \u039b that is used to approximate the unlabeled data distribution F \u2217. This mixing\nmeasure de\ufb01nes decision boundaries that can be used to de\ufb01ne a semi-supervised classi\ufb01er whose\nclassi\ufb01cation accuracy is controlled by the Wasserstein distance between \u039b and \u039b\u2217, the true mixing\nmeasure corresponding to the class conditional distributions. This draws an explicit connection\nbetween the quality of what is learned from the unlabeled data (i.e. \u039b) and the quality of the resulting\nclassi\ufb01er. Our experiments convey two main takeaway messages: 1) It pays off to use density\ninformation as with the MLE, and 2) When the mixture model learned from the unlabeled data is\na poor approximation of the true mixing measure, or the true class conditionals have substantial\noverlap, the MLE can still learn a reasonable semi-supervised classi\ufb01er.\nThis work poses many interesting questions for future work, including instantiating our results for\npractical methods for learning \u039b and quantifying the dependence of \u2206MLE and \u2206MV on K and\nd. Furthermore, it would be interesting to provide a rigorous comparison of \u2206MLE and \u2206MV in\nspeci\ufb01c settings in order to better understand the trade-off between the MLE and MV estimators.\nFinally, exploring additional connections with existing assumptions such as the cluster and manifold\nassumptions is an interesting problem.\n\n9\n\n0918273645546372819099n0.00.10.20.30.40.50.60.70.80.91.0P(=*)K=20918273645546372819099n0.00.10.20.30.40.50.60.70.80.91.0P(=*)K=40918273645546372819099n0.00.10.20.30.40.50.60.70.80.91.0P(=*)K=90918273645546372819099n0.00.10.20.30.40.50.60.70.80.91.0P(=*)K=160918273645546372819099n0.00.10.20.30.40.50.60.70.80.91.0P(=*)0918273645546372819099n0.00.10.20.30.40.50.60.70.80.91.0P(=*)0918273645546372819099n0.00.10.20.30.40.50.60.70.80.91.0P(=*)0918273645546372819099n0.00.10.20.30.40.50.60.70.80.91.0P(=*)0918273645546372819099n0.00.10.20.30.40.50.60.70.80.91.0P(=*)0918273645546372819099n0.00.10.20.30.40.50.60.70.80.91.0P(=*)0918273645546372819099n0.00.10.20.30.40.50.60.70.80.91.0P(=*)0918273645546372819099n0.00.10.20.30.40.50.60.70.80.91.0P(=*) number of labeled samples0918273645546372819099n0.00.10.20.30.40.50.60.70.80.91.0P(=*)K=40918273645546372819099n0.00.10.20.30.40.50.60.70.80.91.0P(=*)number of labeled samples0918273645546372819099number of labeled samples0.00.10.20.30.40.50.60.70.80.91.0P(=*)K=10\fAcknowledgments\n\nP.R. acknowledges the support of NSF via IIS-1149803, IIS-1664720, DMS-1264033, and ONR via\nN000141812861. E.X. acknowledges the support of NIH R01GM114311, P30DA035778.\n\nReferences\n[1] Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. In\nProceedings of the eleventh annual conference on Computational learning theory, pages 92\u2013100.\nACM, 1998.\n\n[2] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-\nsupervised learning with deep generative models. In Advances in Neural Information Processing\nSystems, pages 3581\u20133589, 2014.\n\n[3] Zihang Dai, Zhilin Yang, Fan Yang, William W Cohen, and Ruslan R Salakhutdinov. Good\nsemi-supervised learning that requires a bad gan. In Advances in Neural Information Processing\nSystems, pages 6513\u20136523, 2017.\n\n[4] Martin Azizyan, Aarti Singh, Larry Wasserman, et al. Density-sensitive semisupervised infer-\n\nence. The Annals of Statistics, 41(2):751\u2013771, 2013.\n\n[5] Vittorio Castelli and Thomas M Cover. On the exponential value of labeled samples. Pattern\n\nRecognition Letters, 16(1):105\u2013111, 1995.\n\n[6] Vittorio Castelli and Thomas M Cover. The relative value of labeled and unlabeled samples\nin pattern recognition with an unknown mixing parameter. IEEE Transactions on information\ntheory, 42(6):2102\u20132117, 1996.\n\n[7] Fabio G Cozman, Ira Cohen, and Marcelo C Cirelo. Semi-supervised learning of mixture\nmodels. In Proceedings of the 20th International Conference on Machine Learning (ICML-03),\npages 99\u2013106, 2003.\n\n[8] Matti K\u00e4\u00e4ri\u00e4inen. Generalization error bounds using unlabeled data. In International Conference\n\non Computational Learning Theory, pages 127\u2013142. Springer, 2005.\n\n[9] Partha Niyogi. Manifold regularization and semi-supervised learning: Some theoretical analyses.\n\nThe Journal of Machine Learning Research, 14(1):1229\u20131250, 2013.\n\n[10] Philippe Rigollet. Generalization error bounds in semi-supervised classi\ufb01cation under the\n\ncluster assumption. Journal of Machine Learning Research, 8(Jul):1369\u20131392, 2007.\n\n[11] Aarti Singh, Robert Nowak, and Xiaojin Zhu. Unlabeled data: Now it helps, now it doesn\u2019t. In\n\nAdvances in neural information processing systems, pages 1513\u20131520, 2009.\n\n[12] Larry Wasserman and John D Lafferty. Statistical analysis of semi-supervised regression. In\n\nAdvances in Neural Information Processing Systems, pages 801\u2013808, 2008.\n\n[13] Xiaojin Zhu, Zoubin Ghahramani, and John D Lafferty. Semi-supervised learning using gaussian\n\ufb01elds and harmonic functions. In Proceedings of the 20th International conference on Machine\nlearning (ICML-03), pages 912\u2013919, 2003.\n\n[14] Shai Ben-David, Tyler Lu, and D\u00e1vid P\u00e1l. Does unlabeled data provably help? worst-case\nanalysis of the sample complexity of semi-supervised learning. In COLT, pages 33\u201344, 2008.\n\n[15] Amir Globerson, Roi Livni, and Shai Shalev-Shwartz. Effective semisupervised learning on\n\nmanifolds. In Conference on Learning Theory, pages 978\u20131003, 2017.\n\n[16] Malte Darnst\u00e4dt, Hans Ulrich Simon, and Bal\u00e1zs Sz\u00f6r\u00e9nyi. Unlabeled data does provably help.\n\n2013.\n\n[17] Bryon Aragam, Chen Dan, Pradeep Ravikumar, and Eric Xing. Identi\ufb01ability of nonparametric\n\nmixture models and bayes optimal clustering. arXiv preprint, arXiv:1802.04397, 2018.\n\n10\n\n\f[18] Olivier Collier and Arnak S Dalalyan. Minimax rates in permutation estimation for feature\n\nmatching. The Journal of Machine Learning Research, 17(1):162\u2013192, 2016.\n\n[19] Fajwel Fogel, Rodolphe Jenatton, Francis Bach, and Alexandre d\u2019Aspremont. Convex relax-\nations for permutation problems. In Advances in Neural Information Processing Systems, pages\n1016\u20131024, 2013.\n\n[20] Cong Han Lim and Stephen Wright. Beyond the birkhoff polytope: Convex relaxations for\nvector permutation problems. In Advances in Neural Information Processing Systems, pages\n2168\u20132176, 2014.\n\n[21] Sara van de Geer and Peter B\u00fchlmann. (cid:96)0-penalized maximum likelihood for sparse directed\n\nacyclic graphs. Annals of Statistics, 41(2):536\u2013567, 2013.\n\n[22] Bryon Aragam, Arash A. Amini, and Qing Zhou. Learning directed acyclic graphs with\n\npenalized neighbourhood regression. arXiv:1511.08963, 2016.\n\n[23] Ashwin Pananjady, Martin J Wainwright, and Thomas A Courtade. Linear regression with an\nunknown permutation: Statistical and computational limits. In Communication, Control, and\nComputing (Allerton), 2016 54th Annual Allerton Conference on, pages 417\u2013424. IEEE, 2016.\n\n[24] Nicolas Flammarion, Cheng Mao, and Philippe Rigollet. Optimal rates of statistical seriation.\n\narXiv preprint arXiv:1607.02435, 2016.\n\n[25] Matthias Seeger. Learning with labeled and unlabeled data. Technical report, 2000.\n\n[26] XuanLong Nguyen. Convergence of latent mixing measures in \ufb01nite and in\ufb01nite mixture models.\n\nThe Annals of Statistics, 41(1):370\u2013400, 2013.\n\n[27] Henry Teicher. Identi\ufb01ability of mixtures. The annals of Mathematical statistics, 32(1):244\u2013248,\n\n1961.\n\n[28] Henry Teicher. Identi\ufb01ability of \ufb01nite mixtures. The annals of Mathematical statistics, pages\n\n1265\u20131269, 1963.\n\n[29] Sidney J Yakowitz and John D Spragins. On the identi\ufb01ability of \ufb01nite mixtures. The Annals of\n\nMathematical Statistics, pages 209\u2013214, 1968.\n\n[30] O Barndorff-Nielsen. Identi\ufb01ability of mixtures of exponential families. Journal of Mathemati-\n\ncal Analysis and Applications, 12(1):115\u2013121, 1965.\n\n[31] Henry Teicher. Identi\ufb01ability of mixtures of product measures. The Annals of Mathematical\n\nStatistics, 38(4):1300\u20131302, 1967.\n\n[32] Peter Hall and Xiao-Hua Zhou. Nonparametric estimation of component distributions in a\n\nmultivariate mixture. Annals of Statistics, pages 201\u2013224, 2003.\n\n[33] Donald J Newman. The double dixie cup problem. The American Mathematical Monthly, 67\n\n(1):58\u201361, 1960.\n\n[34] Philippe Flajolet, Daniele Gardy, and Lo\u00ffs Thimonier. Birthday paradox, coupon collectors,\ncaching algorithms and self-organizing search. Discrete Applied Mathematics, 39(3):207\u2013229,\n1992.\n\n[35] Philippe Heinrich and Jonas Kahn. Minimax rates for \ufb01nite mixture estimation. arXiv preprint\n\narXiv:1504.03506, 2015.\n\n[36] Nhat Ho and XuanLong Nguyen. On strong identi\ufb01ability and convergence rates of parameter\n\nestimation in \ufb01nite mixtures. Electronic Journal of Statistics, 10(1):271\u2013307, 2016.\n\n[37] Nhat Ho and XuanLong Nguyen. Singularity structures and impacts on parameter estimation in\n\n\ufb01nite mixtures of distributions. arXiv preprint arXiv:1609.02655, 2016.\n\n[38] Jiahua Chen. Optimal rate of convergence for \ufb01nite mixture models. Annals of Statistics, pages\n\n221\u2013233, 1995.\n\n11\n\n\f[39] Harold W Kuhn. The hungarian method for the assignment problem. Naval Research Logistics\n\n(NRL), 2(1-2):83\u201397, 1955.\n\n[40] Luc Devroye, L\u00e1szl\u00f3 Gy\u00f6r\ufb01, and G\u00e1bor Lugosi. A probabilistic theory of pattern recognition,\n\nvolume 31. Springer Science & Business Media, 2013.\n\n12\n\n\f", "award": [], "sourceid": 5689, "authors": [{"given_name": "Chen", "family_name": "Dan", "institution": "Carnegie Mellon University"}, {"given_name": "Liu", "family_name": "Leqi", "institution": "Carnegie Mellon University"}, {"given_name": "Bryon", "family_name": "Aragam", "institution": "Carnegie Mellon University"}, {"given_name": "Pradeep", "family_name": "Ravikumar", "institution": "Carnegie Mellon University"}, {"given_name": "Eric", "family_name": "Xing", "institution": "Petuum Inc. / Carnegie Mellon University"}]}