{"title": "Estimating Mixture Models via Mixtures of Polynomials", "book": "Advances in Neural Information Processing Systems", "page_first": 487, "page_last": 495, "abstract": "Mixture modeling is a general technique for making any simple model more expressive through weighted combination. This generality and simplicity in part explains the success of the Expectation Maximization (EM) algorithm, in which updates are easy to derive for a wide class of mixture models. However, the likelihood of a mixture model is non-convex, so EM has no known global convergence guarantees. Recently, method of moments approaches offer global guarantees for some mixture models, but they do not extend easily to the range of mixture models that exist. In this work, we present Polymom, an unifying framework based on method of moments in which estimation procedures are easily derivable, just as in EM. Polymom is applicable when the moments of a single mixture component are polynomials of the parameters. Our key observation is that the moments of the mixture model are a mixture of these polynomials, which allows us to cast estimation as a Generalized Moment Problem. We solve its relaxations using semidefinite optimization, and then extract parameters using ideas from computer algebra. This framework allows us to draw insights and apply tools from convex optimization, computer algebra and the theory of moments to study problems in statistical estimation. Simulations show good empirical performance on several models.", "full_text": "Estimating Mixture Models via Mixtures of\n\nPolynomials\n\nSida I. Wang Arun Tejasvi Chaganty\n\nPercy Liang\n\nComputer Science Department, Stanford University, Stanford, CA, 94305\n\n{sidaw,chaganty,pliang}@cs.stanford.edu\n\nAbstract\n\nMixture modeling is a general technique for making any simple model more ex-\npressive through weighted combination. This generality and simplicity in part\nexplains the success of the Expectation Maximization (EM) algorithm, in which\nupdates are easy to derive for a wide class of mixture models. However, the likeli-\nhood of a mixture model is non-convex, so EM has no known global convergence\nguarantees. Recently, method of moments approaches offer global guarantees for\nsome mixture models, but they do not extend easily to the range of mixture mod-\nels that exist. In this work, we present Polymom, an unifying framework based on\nmethod of moments in which estimation procedures are easily derivable, just as\nin EM. Polymom is applicable when the moments of a single mixture component\nare polynomials of the parameters. Our key observation is that the moments of\nthe mixture model are a mixture of these polynomials, which allows us to cast\nestimation as a Generalized Moment Problem. We solve its relaxations using\nsemide\ufb01nite optimization, and then extract parameters using ideas from computer\nalgebra. This framework allows us to draw insights and apply tools from convex\noptimization, computer algebra and the theory of moments to study problems in\nstatistical estimation. Simulations show good empirical performance on several\nmodels.\n\n1\n\nIntroduction\n\nMixture models play a central role in machine learning and statistics, with diverse applications\nincluding bioinformatics, speech, natural language, and computer vision. The idea of mixture\nmodeling is to explain data through a weighted combination of simple parametrized distributions\n[1, 2]. In practice, maximum likelihood estimation via Expectation Maximization (EM) has been\nthe workhorse for these models, as the parameter updates are often easily derivable. However, EM\nis well-known to suffer from local optima. The method of moments, dating back to Pearson [3] in\n1894, is enjoying a recent revival [4, 5, 6, 7, 8, 9, 10, 11, 12, 13] due to its strong global theoreti-\ncal guarantees. However, current methods depend strongly on the speci\ufb01c distributions and are not\neasily extensible to new ones.\nIn this paper, we present a method of moments approach, which we call Polymom, for estimating\na wider class of mixture models in which the moment equations are polynomial equations (Section\n2). Solving general polynomial equations is NP-hard, but our key insight is that for mixture models,\nthe moments equations are mixtures of polynomials equations and we can hope to solve them if the\nmoment equations for each mixture component are simple polynomials equations that we can solve.\nPolymom proceeds as follows: First, we recover mixtures of monomials of the parameters from the\ndata moments by solving an instance of the Generalized Moment Problem (GMP) [14, 15] (Section\n3). We show that for many mixture models, the GMP can be solved with basic linear algebra and\nin the general case, can be approximated by an SDP in which the moment equations are linear\nconstraints. Second, we extend multiplication matrix ideas from the computer algebra literature [16,\n\n1\n\n\fk=1\n\ndata point (RD)\nlatent mixture component ([K])\nparameters of component k (RP )\nmixing proportion of p(z = k)\nall model parameters\n\nmixture model\nxt\nzt\n\u2713k\n\u21e1k\n[\u2713k]K\nmoments of data\nn(x)\nfn(\u2713)\nmoments of parameters\nLy\ny\u21b5\n\u00b5\ny\nMr(y) moment matrix of degree r\n\nthe Riesz linear functional\ny\u21b5 = Ly(\u2713\u21b5), \u21b5th moment\nprobability measure for y\n(y\u21b5)\u21b5 the moment sequence\n\nobservation function\nobservation function\n\nsizes\nD\nK\nP\nT\nN\n[N ]\nr\ns(r)\npolynomials\nR[\u2713]\nN\n\u21b5, , \n\u2713\u21b5\nan\u21b5\n\ndata dimensions\nmixture components\nparameters of mixture components\ndata points\nconstraints\n{1, . . . , N}\ndegree of the moment matrix\nsize of the degree r moment matrix\n\npolynomial ring in variables \u2713\nset of non-negative integers\nvector of exponents (in NP or ND)\n\nmonomialQP\n\ncoef\ufb01cient of \u2713\u21b5 in fn(\u2713)\n\np=1 \u2713\u21b5p\n\np\n\nTable 1: Notation: We use lowercase letters (e.g., d) for indexing, and the corresponding uppercase\nletter to denote the upper limit (e.g., D, in \u201csizes\u201d). We use lowercase letters (e.g., \u2713k,p) for scalars,\nlowercase bold letters (e.g., \u2713) for vectors, and bold capital letters (e.g., M) for matrices.\n\n1. Write down a mixture model\n\n4. Recover parameter moments (y)\n\n5. Solve for parameters\n\nz\n\nx\n\nz \u21e0 Multinomial(\u21e11,\u21e1 2)\n\nx | z \u21e0N (\u21e0z, z) 2 R\n\u2713k = (\u21e0k, k) 2 R2\n\n2. Derive single mixture\nmoment equations\n\n3. Add data\n\n2664\n\nMr(y) =\n\n1\n\n\u21e0\n\u21e02\n2\n\nminimize\n\ny\ns.t.\n\n(x)\nx\nx2\nx3\n...\n\nE! f (\u2713)\n\n\u21e0\n\u21e02 + 2\n\u21e03 + 3\u21e02\n...\n\nx1 \u21e0 p(x; \u2713\u21e4)\n\n...\n\nxT \u21e0 p(x; \u2713\u21e4)\n\nuser speci\ufb01ed framework speci\ufb01ed\n\n3775\n\n\u21e02\ny2,0\ny3,0\ny4,0\ny2,1\n\n2\ny0,1\ny1,1\ny2,1\ny0,2\n\n\u21e0\n1\ny1,0\ny0,0\ny2,0\ny1,0\ny3,0\ny2,0\ny1,1\ny0,1\ntr(Mr(y))\nMr(y) \u232b 0, y0,0 = 1\ny1,0 = 1\nt xt\ny2,0 + y0,1 = 1\nt x2\nt\ny3,0 + 3y1,1 = 1\nt x3\nt\n. . .\n\nTPT\n\nTPT\nTPT\n\nMr(y) = VPV>\n# sim. diag.\n\nP = diag([\u21e11,\u21e1 2])\n\nV =\n\n1\n\n\u21e0\n2\n\u21e02\n\u21e02\n4\n\u21e022\n\n2666666664\n\nv(\u27131)\n\nv(\u27132)\n\n1\n2\n3\n4\n6\n9\n12\n\n1\n2\n5\n4\n10\n25\n20\n\n3777777775\n\nFigure 1: An overview of applying the Polymom framework.\n\n17, 18, 19] to extract the parameters by solving a certain generalized eigenvalue problem (Section\n4).\nPolymom improves on previous method of moments approaches in both generality and \ufb02exibility.\nFirst, while tensor factorization has been the main driver for many of the method of moments ap-\nproaches for many types of mixture models, [6, 20, 9, 8, 21, 12], each model required speci\ufb01c adap-\ntations which are non-trivial even for experts. In contrast, Polymom provides a uni\ufb01ed principle for\ntackling new models that is as turnkey as computing gradients or EM updates. To use Polymom\n(Figure 1), one only needs to provide a list of observation functions (n) and derive their expected\nvalues expressed symbolically as polynomials in the parameters of the speci\ufb01ed model (fn). Poly-\nmom then estimates expectations of n and outputs parameter estimates of the speci\ufb01ed model.\nSince Polymom works in an optimization framework, we can easily incorporate constraints such\nas non-negativity and parameter tying which is dif\ufb01cult to do in the tensor factorization paradigm.\nIn simulations, we compared Polymom with EM and tensor factorization and found that Polymom\nperforms similarly or better (Section 5). This paper assumes identi\ufb01ability and in\ufb01nite data. With\nthe exception of a few speci\ufb01c models in Section 5, we defer issues of general identi\ufb01ability and\nsample complexity to future work.\n\n2\n\n\f2 Problem formulation\n\n2.1 The method of moments estimator\nIn a mixture model, each data point x 2 RD is associated with a latent component z 2 [K]:\n\nz \u21e0 Multinomial(\u21e1), x | z \u21e0 p(x; \u2713\u21e4z),\n\n(1)\nwhere \u21e1 = (\u21e11, . . . ,\u21e1 K) are the mixing coef\ufb01cients, \u2713\u21e4k 2 RP are the true model parameters for\nthe kth mixture component, and x 2 RD is the random variable representing data. We restrict our\nattention to mixtures where each component distribution comes from the same parameterized family.\nFor example, for a mixture of Gaussians, \u2713\u21e4k = (\u21e0\u21e4k 2 RD, \u2303\u21e4k 2 RD\u21e5D) consists of the mean and\ncovariance of component k.\nWe de\ufb01ne N observation functions n : RD ! R for n 2 [N ] and de\ufb01ne fn(\u2713) to be the expectation\nof n over a single component with parameters \u2713, which we assume is a simple polynomial:\n\nfn(\u2713) := Ex\u21e0p(x;\u2713)[n(x)] =X\u21b5\n\nan\u21b5\u2713\u21b5,\n\n(2)\n\np=1 \u2713\u21b5p\n\nwhere \u2713\u21b5 = QP\npressed as a mixture of polynomials of the true parameters E[n(x)] =PK\nPK\n\np . The expectation of each observation function E[n(x)] can then be ex-\nk=1 \u21e1kE[n(x)|z = k] =\n\nThe method of moments for mixture models seeks parameters [\u2713k]K\nditions\n\nk=1 that satisfy the moment con-\n\nk=1 \u21e1kfn(\u2713\u21e4k).\n\nE[n(x)] =\n\n\u21e1kfn(\u2713k).\n\n(3)\n\nKXk=1\nT PT\n\nk + 2\n\nk=1 \u21e1k(\u21e02\n\n(3), E[x2] = PK\n\nt=1 n(xt) p! E[n(x)]. The goal of this work\nwhere E[n(x)] can be estimated from the data: 1\nis to \ufb01nd parameters satisfying moment conditions that can be written in the mixture of polynomial\nform (3). We assume that the N observations functions 1, . . . , N uniquely identify the model\nparameters (up to permutation of the components).\nExample 2.1 (1-dimensional Gaussian mixture). Consider a K-mixture of 1D Gaussians with pa-\nrameters \u2713k = [\u21e0k, 2\nk] corresponding to the mean and variance, respectively, of the k-th component\n(Figure 1: steps 1 and 2). We choose the observation functions, (x) = [x1, . . . , x6], which have\ncorresponding moment polynomials, f (\u2713) = [\u21e0, \u21e0 2 + 2,\u21e0 3 + 3\u21e02, . . . ]. For example, instantiating\nk). Given (x) and f (\u2713\u21e4), and data, the Polymom framework can\nrecover the parameters. Note that the 6 moments we use have been shown by [3] to be suf\ufb01cient for\na mixture of two Gaussians.\nExample 2.2 (Mixture of linear regressions). Consider a mixture of linear regressions [22, 9],\nwhere each data point x = [x, y] is drawn from component k by sampling x from an unknown\nk). The parameters\ndistribution independent of k and setting y = wkx + \u270f, where \u270f \u21e0N (0, 2\nk) are the slope and noise variance for each component k. Let us take our observation\n\u2713k = (wk, 2\nfunctions to be (x) = [x, xy, xy2, x2, . . . , x3y2], for which the moment polynomials are f (\u2713) =\n[E[x], E[x2]w, E[x3]w2 + E[x]2, E[x2], . . .].\nIn Example 2.1, the coef\ufb01cients an\u21b5 in the polynomial fn(\u2713) are just constants determined by inte-\ngration. For the conditional model in Example 2.2, the coef\ufb01cients depends on the data. However,\nwe cannot handle arbitrary data dependence, see Section D for suf\ufb01cient conditions and counterex-\namples.\n\n2.2 Solving the moment conditions\nOur goal is to recover model parameters \u2713\u21e41, . . . , \u2713\u21e4K 2 RP for each of the K components of the\nmixture model that generated the data as well as their respective mixing proportions \u21e11, . . . ,\u21e1 K 2\nR. To start, let\u2019s ignore sampling noise and identi\ufb01ability issues and suppose that we are given exact\nmoment conditions as de\ufb01ned in (3). Each condition fn 2 R[\u2713] is a polynomial of the parameters\n\u2713, for n = 1, . . . , N.\n\n3\n\n\fEquation 3 is a polynomial system of N equations in the K + K \u21e5 P variables [\u21e11, . . . ,\u21e1 K] and\n[\u27131, . . . , \u2713K] 2 RP\u21e5K. It is natural to ask if standard polynomial solving methods can solve (3) in\nthe case where each fn(\u2713) is simple. Unfortunately, the complexity of general polynomial equation\nsolving is lower bounded by the number of solutions, and each of the K! permutations of the mixture\ncomponents corresponds to a distinct solution of (3) under this polynomial system representation.\nWhile several methods can take advantage of symmetries in polynomial systems [23, 24], they still\ncannot be adapted to tractably solve (3) to the best of our knowledge.\nThe key idea of Polymom is to exploit the mixture representation of the moment equations (3).\nSpeci\ufb01cally, let \u00b5\u21e4 be a particular \u201cmixture\u201d over the component parameters \u2713\u21e41, . . . , \u2713\u21e4k (i.e. \u00b5\u21e4 is a\nprobability measure). Then we can express the moment conditions (3) in terms of \u00b5\u21e4:\n\nE[n(x)] =Z fn(\u2713) \u00b5\u21e4(d\u2713), where \u00b5\u21e4(\u2713) =\n\nKXk=1\n\n\u21e1k(\u2713 \u2713\u21e4k).\n\n(4)\n\nAs a result, solving the original moment conditions (3) is equivalent to solving the following fea-\nsibility problem over \u00b5, but where we deliberately \u201cforget\u201d the permutation of the components by\nusing \u00b5 to represent the problem:\n\n\ufb01nd \u00b5 2M +(RP ), the set of probability measures over RP\ns.t.\n\nR fn(\u2713) \u00b5(d\u2713) = E[n(x)], n = 1, . . . , N\n\n\u00b5 is K-atomic (i.e. sum of K deltas).\n\n(5)\n\nIf the true model parameters [\u2713\u21e4k]K\n\ntion, then the measure \u00b5\u21e4(\u2713) =PK\n\nPolymom solves Problem 5 in two steps:\n\nk=1 can be identi\ufb01ed by the N observed moments up to permuta-\nk=1 \u21e1k(\u2713 \u2713\u21e4k) solving Problem 5 is also unique.\n\n1. Moment completion (Section 3): We show that Problem 5 over the measure \u00b5 can be\nrelaxed to an SDP over a certain (parameter) moment matrix Mr(y) whose optimal solution\nk=1 \u21e1kvr(\u2713\u21e4k)vr(\u2713\u21e4k)>, where vr(\u2713\u21e4k) is the vector of all monomials of\n\nis Mr(y\u21e4) = PK\n\ndegree at most r.\n\n2. Solution extraction (Section 4): We then take Mr(y) and construct a series of generalized\n\neigendecomposition problems, whose eigenvalues yield [\u2713\u21e4k]K\n\nk=1.\n\nRemark. From this point on, distributions and moments refer to \u00b5\u21e4 which is over parameters, not\nover the data. All the structure about the data is captured in the moment conditions (3).\n\n3 Moment completion\n\n1 \u27132\n\n1\u00b5(d\u2713) and y2,1 :=R \u27132\n\nThe \ufb01rst step is to reformulate Problem 5 as an instance of the Generalized Moment Problem (GMP)\nintroduced by [15]. A reference on the GMP, algorithms for solving GMPs, and its various exten-\nsions is [14]. We start by observing that Problem 5 really only depends on the integrals of monomials\nunder the measure \u00b5: for example, if fn(\u2713) = 2\u27133\n1\u27132, then we only need to know the integrals\nover the constituent monomials (y3,0 :=R \u27133\n1\u27132\u00b5(d\u2713)) in order to evaluate\nthe integral over fn. This suggests that we can optimize over the (parameter) moment sequence\ny = (y\u21b5)\u21b52NP , rather than the measure \u00b5 itself. We say that the moment sequence y has a repre-\nsenting measure \u00b5 if y\u21b5 =R \u2713\u21b5 \u00b5(d\u2713) for all \u21b5, but we do not assume that such a \u00b5 exists. The\nRiesz linear functional Ly : R[\u2713] ! R is de\ufb01ned to be the linear map such that Ly(\u2713\u21b5) := y\u21b5 and\nLy(1) = 1. For example, Ly(2\u27133\n1\u27132 + 3) = 2y3,0 y2,1 + 3. If y has a representing measure\n\u00b5, then Ly simply maps polynomials f to integrals of f against \u00b5.\nThe key idea of the GMP approach is to convexify the problem by treating y as free variables\nand then introduce constraints to guarantee that y has a representing measure. First, let vr(\u2713) :=\n[\u2713\u21b5 : |\u21b5|\uf8ff r] 2 R[\u2713]s(r) be the vector of all s(r) monomials of degree no greater than r. Then,\nde\ufb01ne the truncated moment matrix as Mr(y) := Ly(vr(\u2713)vr(\u2713)T), where the linear functional\nLy is applied elementwise (see Example 3.1 below). If y has a representing measure \u00b5, then Mr(y)\nis simply a (positive) integral over rank 1 matrices vr(\u2713)vr(\u2713)T with respect to \u00b5, so necessarily\n\n1 \u27132\n\n4\n\n\fMr(y) \u232b 0 holds. Furthermore, by Theorem 1 [25], for y to have a K-atomic representing measure,\nit is suf\ufb01cient that rank(Mr(y)) = rank(Mr1(y)) = K. So Problem 5 is equivalent to\n\n(or equivalently, \ufb01nd M(y))\n\n\ufb01nd y 2 RN\ns.t. P\u21b5 an\u21b5y\u21b5 = E[n(x)], n = 1, . . . , N\n\nMr(y) \u232b 0, y0 = 1\nrank(Mr(y)) = K and rank(Mr1(y)) = K.\n\n(6)\n\n(7)\n\nUnfortunately, the rank constraints in Problem 6 are not tractable. We use the following relaxation\nto obtain our \ufb01nal (convex) optimization problem\ntr(CMr(y))\n\nminimize\n\ny\ns.t.\n\nP\u21b5 an\u21b5y\u21b5 = E[n(x)], n = 1, . . . , N\n\nMr(y) \u232b 0, y0 = 1\n\nwhere C 0 is a chosen scaling matrix. A common choice is C = Is(r) corresponding to min-\nimizing the nuclear norm of the moment matrix, the usual convex relaxation for rank. Section A\ndiscusses some other choices of C.\nExample 3.1 (moment matrix for a 1-dimensional Gaussian mixture). Recall that the parameters\n\u2713 = [\u21e0, 2] are the mean and variance of a one dimensional Gaussian. Let us choose the monomials\nv2(\u2713) = [1,\u21e0,\u21e0 2, 2]. Step 4 for Figure 1 shows the moment matrix when using r = 2. Each row\nand column of the moment matrix is labeled with a monomial and entry (i, j) is subscripted by the\nproduct of the monomials in row i and column j. For 2(x) := x2, we have f2(\u2713) = \u21e02 + c, which\nleads to the linear constraint y2,0 + y0,1 E[x2] = 0. For 3(x) = x3, f3(\u2713) = \u21e03 + 3\u21e0c, leading\nto the constraint y3,0 + 3y1,1 E[x3] = 0.\nRelated work. Readers familiar with the sum of squares and polynomial optimization litera-\nture [26, 27, 28, 29] will note that Problem 7 is similar to the SDP relaxation of a polynomial\noptimization problem. However, in typical polynomial optimization, we are only interested in so-\nlutions \u2713\u21e4 that actually satisfy the given constraints, whereas here we are interested in K solutions\n[\u2713\u21e4k]K\nk=1, whose mixture satis\ufb01es constraints corresponding to the moment conditions (3). Within\nmachine learning, generalized PCA has been formulated as a moment problem [30] and the Hankel\nmatrix (basically the moment matrix) has been used to learn weighted automata [13]. While similar\ntools are used, the conceptual approach and the problems considered are different. For example,\nthe moment matrix of this paper consists of unknown moments of the model parameters, whereas\nexisiting works considered moments of the data that are always directly observable.\n\nConstraints. Constraints such as non-negativity (for parameters which represent probabilities or\nvariances) and parameter tying [31] are quite common in graphical models and are not easily ad-\ndressed with existing method of moments approaches. The GMP framework allows us to incorporate\nsome constraints using localizing matrices [32]. Thus, we can handle constraints during the estima-\ntion procedure rather than projecting back onto the constraint set as a post-processing step. This is\nnecessary for models that only become identi\ufb01able by the observed moments after constraints are\ntaken into account. We describe this method and its learning implications in Section C.1.\n\nGuarantees and statistical ef\ufb01ciency.\nIn some circumstances, e.g. in three-view mixture models\nor the mixture of linear regressions, the constraints fully determine the moment matrix \u2013 we consider\nthese cases in Section 5 and Appendix B. While there are no general guarantee on Problem 7, the\n\ufb02at extension theorem tells us when the moment matrix corresponds to a unique solution (more\ndiscussions in Appendix A):\nTheorem 1 (Flat extension theorem [25]). Let y be the solution to Problem 7 for a particular r. If\nMr(y) \u232b 0 and rank(Mr1(y)) = rank(Mr(y)) then y is the optimal solution to Problem 6 for\nK = rank(Mr(y)) and there exists a unique K-atomic supporting measure \u00b5 of Mr(y).\n\nRecovering Mr(y) is linearly dependent on small perturbations of the input [33], suggesting that\nthe method has polynomial sample complexity for most models where the moments concentrate at\na polynomially rate. Finally, in Appendix C, we discuss a few other important considerations like\nnoise robustness, making Problem 7 more statistical ef\ufb01cient, along with some technical results on\nthe moment completion problem and some open problems.\n\n5\n\n\f4 Solution extraction\n\nHaving completed the (parameter) moment matrix Mr(y) (Section 3), we now turn to the problem\nof extracting the model parameters [\u2713\u21e4k]K\nk=1. The solution extraction method we present is based on\nideas from solving multivariate polynomial systems where the solutions are eigenvalues of certain\nmultiplication matrices [16, 17, 34, 35].1 The main advantage of the solution extraction view is\nthat higher-order moments and structure in parameters are handled in the framework without model-\nspeci\ufb01c effort.\n\nRecall that the true moment matrix is Mr(y\u21e4) = PK\nk=1 \u21e1kv(\u2713\u21e4k)v(\u2713\u21e4k)T, where v(\u2713) :=\n[\u2713\u21b51, . . . , \u2713\u21b5s(r)] 2 R[\u2713]s(r) contains all the monomials up to degree r. We use \u2713 = [\u27131, . . . ,\u2713 P ]\nfor variables and [\u2713\u21e4k]K\nk=1 for the true solutions to these variables (note the boldface). For example,\n\u2713\u21e4k,p := (\u2713\u21e4k)p denotes the pth value of the kth component, which corresponds to a solution for the\nvariable \u2713p. Typically, s(r) K, P and the elements of v(\u2713) are arranged in a degree ordering so\nthat ||\u21b5i||1 \uf8ff|| \u21b5j||1 for i \uf8ff j. We can also write Mr(y\u21e4) as Mr(y\u21e4) = VPV>, where the canon-\nical basis V := [v(\u2713\u21e41), . . . , v(\u2713\u21e4K)] 2 Rs(r)\u21e5K and P := diag(\u21e11, . . . ,\u21e1 K). At the high level, we\nwant to factorize Mr(y\u21e4) to get V, however we cannot simply eigen-decompose Mr(y\u21e4) since V is\nnot orthogonal. To overcome this challenge, we will exploit the internal structure of V to construct\nseveral other matrices that share the same factors and perform simultaneous diagonalization.\nSpeci\ufb01cally, let V[1; . . . ; K] 2 RK\u21e5K be a sub-matrix of V with only the rows corresponding to\nmonomials with exponents 1, . . . , K 2 NP . Typically, 1, . . . , K are just the \ufb01rst K monomials\nin v. Now consider the exponent p 2 NP which is 1 in position p and 0 elsewhere, corresponding\nto the monomial \u2713p = \u2713p. The key property of the canonical basis is that multiplying each column\nk by a monomial \u2713\u21e4k,p just performs a \u201cshift\u201d to another set of rows:\nV[1; . . . ; K] Dp = V\u21e51 + p; . . . ; K + p\u21e4, where Dp := diag(\u2713\u21e41,p, . . . ,\u2713 \u21e4K,p).\nNote that Dp contains the pth parameter for all K mixture components.\nExample 4.1 (Shifting the canonical basis). Let \u2713 = [\u27131,\u2713 2] and the true solutions be \u2713\u21e41 = [2, 3]\nand \u2713\u21e42 = [2, 5]. To extract the solution for \u27131 (which are (\u2713\u21e41,1,\u2713 \u21e42,1)), let 1 = (1, 0), 2 = (1, 1),\nand 1 = (1, 0).\n\n(8)\n\nV =\n\n1\n\u27131\n\u27132\n\u27132\n1\n\u27131\u27132\n\u27132\n2\n\u27132\n1\u27132\n\n26666664\n\nv(\u27131)\n\nv(\u27132)\n\n1\n2\n3\n4\n6\n9\n12\n\n1\n2\n5\n4\n10\n25\n20\n\n37777775\n\n\u27131\n\u27131\u27132\n\n|\n\nv2\n\n2 2\n\n\uf8ff v1\n6 10 \n}\n{z\n\nV[1;2]\n\n=\n\n0\n\n\uf8ff2\n0 2\n}\n{z\n|\n\ndiag(\u27131,1,\u27132,1)\n\n\u27132\n1\n\u27132\n1\u27132\n\n|\n\n4\n12\n\n\uf8ff v1\n{z\n\nv2\n4\n\n20 \n}\n\nV[1+1;2+1]\n\n(9)\n\nWhile the above reveals the structure of V, we don\u2019t know V. However, we recover its column space\nU 2 Rs(r)\u21e5K from the moment matrix Mr(y\u21e4), for example with an SVD. Thus, we can relate\nU and V by a linear transformation: V = UQ, where Q 2 RK\u21e5K is some unknown invertible\nmatrix.\nEquation 8 can now be rewritten as:\n\nU[1; . . . ; K]Q Dp = U\u21e51 + p; . . . ; K + p\u21e4Q,\n\n(10)\nwhich is a generalized eigenvalue problem where Dp are the eigenvalues and Q are the eigenvectors.\nCrucially, the eigenvalues, Dp = diag(\u2713\u21e41,p, . . . ,\u2713 \u21e4K,p) give us solutions to our parameters. Note\nthat for any choice of 1, . . . , K and p 2 [P ], we have generalized eigenvalue problems that\nshare eigenvectors Q, though their eigenvectors Dp may differ. Corresponding eigenvalues (and\nhence solutions) can be obtained by solving a simultaneous generalized eigenvalue problem, e.g., by\nusing random projections like Algorithm B of [4] or more robust [37] simutaneous diagonalization\nalgorithms [38, 39, 40].\n\np = 1, . . . , P,\n\n1 [36] is a short overview and [35] is a comprehensive treatment including numerical issues.\n\n6\n\n\fTable 2: Applications of the Polymom framework. See Appendix B.2 for more details.\n\nMixture of linear regressions\n\nModel\nx = [x, ] is observed where x 2 RD is drawn\nfrom an unspeci\ufb01ed distribution and\n \u21e0N (w \u00b7 x, 2I), and is known. The\nparameters are \u2713\u21e4k = (wk) 2 RD.\n\nObservation functions\n\u21b5,b(x) = x\u21b5b for 0 \uf8ff| \u21b5|\uf8ff 3, b 2 [2].\nMoment polynomials\nf\u21b5,1(\u2713) =PP\nf\u21b5,2(\u2713) = E[x\u21b5]2 +PP\np,q=1 E[x\u21b5xpxq]wpwq,\nwhere the p 2 NP is 1 in position p and 0 else-\nwhere.\n\np=1 E[x\u21b5+p ]wp\n\nMixture of Gaussians\n\nModel\nx 2 RD is observed where x is drawn from a\nGaussian with diagonal covariance:\nx \u21e0N (\u21e0, diag(c)). The parameters are\n\u2713\u21e4k = (\u21e0k, ck) 2 RD+D.\n\nObservation functions\n\u21b5(x) = x\u21b5 for 0 \uf8ff| \u21b5|\uf8ff 4.\nMoment polynomials\nf\u21b5(\u2713) =QD\nd=1 h\u21b5d (\u21e0d, cd). 2\n\nMultiview mixtures\n\nModel\nWith 3 views, x = [x(1), x(2), x(3)] is observed\nwhere x(1), x(2), x(3) 2 RD and x(`) is drawn\nfrom an unspeci\ufb01ed distribution with mean \u21e0(`)\nfor ` 2 [3]. The parameters are\n\u2713\u21e4k = (\u21e0(1)\n\nk , \u21e0(2)\n\nk , \u21e0(3)\n\nk ) 2 RD+D+D.\n\nObservation functions\nijk(x) = x(1)\nj x(3)\nk where 1 \uf8ff i, j, k \uf8ff D.\nMoment polynomials\nj \u21e0(3)\nfijk(\u2713) = \u21e0(1)\nk .\n\ni x(2)\n\ni \u21e0(2)\n\nWe describe one approach to solve (10), which is similar to Algorithm B of [4]. The idea is to take P\nrandom weighted combinations of the equations (10) and solve the resulting (generalized) eigende-\ncomposition problems. Let R 2 RP\u21e5P be a random matrix whose entries are drawn from N (0, 1).\n\np=1 Rq,pU\u21e51 + p; . . . ; K + p\u21e4\u2318 Q =\n\nThen for each q = 1, . . . Q, solve U[1; . . . ; K]1\u21e3PP\nQDq. The resulting eigenvalues can be collected in \u21e4 2 RP\u21e5K, where \u21e4q,k = Dq,k,k. Note that\nby de\ufb01nition \u21e4q,k = PP\np=1 Rq,p\u2713\u21e4k,p, so we can simply invert to obtain [\u2713\u21e41, . . . , \u2713\u21e4K] = R1\u21e4.\nAlthough this simple approach does not have great numerical properties, these eigenvalue problems\nare solvable if the eigenvalues [q,1, . . . , q,K] are distinct for all q, which happens with probability\n1 as long as the parameters \u2713\u21e4k are different from each other.\nIn Appendix B.1, we show how a prior tensor decomposition algorithm from [4] can be seen as\nsolving Equation 10 for a particular instantiation of 1, . . . K.\n5 Applications\nLet us now look at some applications of Polymom. Table 2 presents several models with corre-\nsponding observation functions and moment polynomials. It is fairly straightforward to write down\nobservation functions for a given model. The moment polynomials can then be derived by comput-\ning expectations under the model\u2013 this step can be compared to deriving gradients for EM.\nWe implemented Polymom for several mixture models in Python (code: https://github.\ncom/sidaw/polymom). We used CVXOPT to handle the SDP and the random projections algo-\nrithm from to extract solutions. In Table 3, we show the relative error maxk ||\u2713k \u2713\u21e4k||2/||\u2713\u21e4k||2\naveraged over 10 random models of each class.\nIn the rest of this section, we will discuss guarantees on parameter recovery for each of these models.\n\n2 h\u21b5(\u21e0, c) =Pb\u21b5/2c\n\ni=0 a\u21b5,\u21b52i\u21e0\u21b52ici and a\u21b5,i be the absolute value of the coef\ufb01cient of the degree i term\nof the \u21b5th (univariate) Hermite polynomial. For example, the \ufb01rst few are h1(\u21e0, c) = \u21e0, h2(\u21e0, c) = \u21e02 + c,\nh3(\u21e0, c) = \u21e03 + 3\u21e0c, h4(\u21e0, c) = \u21e04 + 6\u21e02c + 3c2.\n\n7\n\n\fEM TF\n\nT = 103\n\nMethd.\n\nGaussians K, D\nspherical\ndiagonal\nconstrained\nOthers\n3-view\nlin. reg.\n\n2, 2\n2, 2\n2, 2\nK, D\n3, 3\n2, 2\n\n0.37\n0.44\n0.49\n\n0.38\n-\n\nPoly\n\n0.58\n0.48\n0.38\n\n0.57\n3.51\n\nEM TF\n\nT = 104\n\n0.24\n0.48\n0.47\n\n0.31\n-\n\n0.73\n4.03\n2.56\n\nT = 105\n\n0.33\n-\n\nPoly\n\n0.29\n0.40\n0.30\n\n0.26\n2.60\n\nEM TF\n\nT = 105\n\n0.19\n0.38\n0.34\n\n0.36\n-\n\n0.36\n2.46\n3.02\n\nT = 106\n\n0.16\n-\n\nPoly\n\n0.14\n0.35\n0.29\n\n0.12\n2.52\n\n2.05\n2.15\n7.52\n\nT = 104\n\n0.51\n-\n\nTable 3: T is the number of samples, and the error metric is de\ufb01ned above. Methods: EM: sklearn\ninitialized with k-means using 5 random restarts; TF: tensor power method implemented in Python;\nPoly: Polymom by solving Problem 7. Models: for mixture of Gaussians, we have \u21e1 2||\u00b51 \n\u00b52||2. spherical and diagonal describes the type of covariance matrix. The mean parameters of\nconstrained Gaussians satis\ufb01es \u00b51 + \u00b52 = 1. The best result is bolded. TF only handles spherical\nvariance, but it was of interest to see what TF does if the data is drawn from mixture of Gaussians\nwith diagonal covariance, these results are in strikeout.\n\nMixture of Linear Regressions. We can guarantee that Polymom can recover parameters for this\nmodel when K \uf8ff D by showing that Problem 6 can be solved exactly: observe that while no entry\nof the moment matrix M3(y) is directly observed, each observation gives us a linear constraint on\nthe entries of the moment matrix and when K \uf8ff D, there are enough equations that this system\nadmits an unique solution for y.\nChaganty et al. [9] were also able to recover parameters for this model under the same conditions\n(K \uf8ff D) by solving a series of low-rank tensor recovery problems, which ultimately requires the\ncomputation of the same moments described above. In contrast, the Polymom framework makes the\ndependence on moments upfront and takes care of the heavy-lifting in a problem-agnostic manner.\nLastly, the model can be extended to handle per component noise by including as a parameter, an\nextension that is not possible using the method in [9].\n\nMultiview Mixtures. We can guarantee parameter recovery when K \uf8ff D by proving that Prob-\nlem 7 can be solved exactly (see Section B.2).\n\nMixture of Gaussians.\nIn this case however, the moment conditions are non-trivial and we cannot\nguarantee recovery of the true parameters. However, Polymom is guaranteed to recover a mixture of\nGaussians that match the moments. We can also apply constraints to the model: consider the case\nof 2d mixture where the mean parameters for all components lies on a parabola \u21e01 \u21e02\n2 = 0. In this\ncase, we just need to add constraints to Problem 7: y(1,0)+ y(0,2)+ = 0 for all 2 N2 up to\ndegree ||\uf8ff 2r 2. By incorporating these constraints at estimation time, we can possibly identify\nthe model parameters with less moments. See Section C for more details.\n6 Conclusion\nWe presented an unifying framework for learning many types of mixture models via the method\nof moments. For example, for the mixture of Gaussians, we can apply the same algorithm to both\nmixtures in 1D needing higher-order moments [3, 11] and mixtures in high dimensions where lower-\norder moments suf\ufb01ce [6]. The Generalized Moment Problem [15, 14] and its semide\ufb01nite relax-\nation hierarchies is what gives us the generality, although we rely heavily on the ability of nuclear\nnorm minimization to recover the underlying rank. As a result, while we always obtain parame-\nters satisfying the moment conditions, there are no formal guarantees on consistent estimation. The\nsecond main tool is solution extraction, which characterizes a more general structure of mixture\nmodels compared the tensor structure observed by [6, 4]. This view draws connections to the liter-\nature on solving polynomial systems, where many techniques might be useful [35, 18, 19]. Finally,\nthrough the connections we\u2019ve drawn, it is our hope that Polymom can make the method of mo-\nments as turnkey as EM on more latent-variable models, as well as improve the statistical ef\ufb01ciency\nof method of moments procedures.\n\nAcknowledgments. This work was supported by a Microsoft Faculty Research Fellowship to the\nthird author and a NSERC PGS-D fellowship for the \ufb01rst author.\n\n8\n\n\fReferences\n[1] D. M. Titterington, A. F. Smith, and U. E. Makov. Statistical analysis of \ufb01nite mixture distributions,\n\nvolume 7. Wiley New York, 1985.\n\n[2] G. McLachlan and D. Peel. Finite mixture models. John Wiley & Sons, 2004.\n[3] K. Pearson. Contributions to the mathematical theory of evolution. Philosophical Transactions of the\n\nRoyal Society of London. A, 185:71\u2013110, 1894.\n\n[4] A. Anandkumar, D. Hsu, and S. M. Kakade. A method of moments for mixture models and hidden\n\nMarkov models. In Conference on Learning Theory (COLT), 2012.\n\n[5] A. Anandkumar, D. P. Foster, D. Hsu, S. M. Kakade, and Y. Liu. Two SVDs suf\ufb01ce: Spectral decomposi-\ntions for probabilistic topic modeling and latent Dirichlet allocation. In Advances in Neural Information\nProcessing Systems (NIPS), 2012.\n\n[6] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky. Tensor decompositions for learning\n\nlatent variable models. arXiv, 2013.\n\n[7] D. Hsu, S. M. Kakade, and P. Liang. Identi\ufb01ability and unmixing of latent parse trees. In Advances in\n\nNeural Information Processing Systems (NIPS), 2012.\n\n[8] D. Hsu and S. M. Kakade. Learning mixtures of spherical Gaussians: Moment methods and spectral\n\ndecompositions. In Innovations in Theoretical Computer Science (ITCS), 2013.\n\n[9] A. Chaganty and P. Liang. Spectral experts for estimating mixtures of linear regressions. In International\n\nConference on Machine Learning (ICML), 2013.\n\n[10] A. T. Kalai, A. Moitra, and G. Valiant. Ef\ufb01ciently learning mixtures of two Gaussians. In Symposium on\n\nTheory of Computing (STOC), pages 553\u2013562, 2010.\n\n[11] M. Hardt and E. Price.\narXiv:1404.4997, 2014.\n\nSharp bounds for learning a mixture of two Gaussians.\n\narXiv preprint\n\n[12] R. Ge, Q. Huang, and S. M. Kakade. Learning mixtures of Gaussians in high dimensions. arXiv preprint\n\narXiv:1503.00424, 2015.\n\n[13] B. Balle, X. Carreras, F. M. Luque, and A. Quattoni. Spectral learning of weighted automata - A forward-\n\nbackward perspective. Machine Learning, 96(1):33\u201363, 2014.\n\n[14] J. B. Lasserre. Moments, Positive Polynomials and Their Applications. Imperial College Press, 2011.\n[15] J. B. Lasserre. A semide\ufb01nite programming approach to the generalized problem of moments. Mathe-\n\nmatical Programming, 112(1):65\u201392, 2008.\n\n[16] H. J. Stetter. Multivariate polynomial equations as matrix eigenproblems. WSSIA, 2:355\u2013371, 1993.\n[17] H. M. M\u00a8oller and H. J. Stetter. Multivariate polynomial equations with multiple zeros solved by matrix\n\neigenproblems. Numerische Mathematik, 70(3):311\u2013329, 1995.\n\n[18] B. Sturmfels. Solving systems of polynomial equations. American Mathematical Society, 2002.\n[19] D. Henrion and J. Lasserre. Detecting global optimality and extracting solutions in GloptiPoly. In Positive\n\npolynomials in control, pages 293\u2013310, 2005.\n\n[20] A. Anandkumar, R. Ge, D. Hsu, and S. Kakade. A tensor spectral approach to learning mixed membership\n\ncommunity models. In Conference on Learning Theory (COLT), pages 867\u2013881, 2013.\n\n[21] A. Anandkumar, R. Ge, and M. Janzamin. Provable learning of overcomplete latent variable models:\n\nSemi-supervised and unsupervised settings. arXiv preprint arXiv:1408.0553, 2014.\n\n[22] K. Viele and B. Tong. Modeling with mixtures of linear regressions. Statistics and Computing, 12(4):315\u2013\n\n330, 2002.\n\n[23] B. Sturmfels. Algorithms in invariant theory. Springer Science & Business Media, 2008.\n[24] R. M. Corless, K. Gatermann, and I. S. Kotsireas. Using symmetries in the eigenvalue method for poly-\n\nnomial systems. Journal of Symbolic Computation, 44(11):1536\u20131550, 2009.\n\n[25] R. E. Curto and L. A. Fialkow. Solution of the truncated complex moment problem for \ufb02at data, volume\n\n568. American Mathematical Society, 1996 1996.\n\n[26] J. B. Lasserre. Global optimization with polynomials and the problem of moments. SIAM Journal on\n\nOptimization, 11(3):796\u2013817, 2001.\n\n[27] M. Laurent. Sums of squares, moment matrices and optimization over polynomials. In Emerging appli-\n\ncations of algebraic geometry, pages 157\u2013270, 2009.\n\n[28] P. A. Parrilo and B. Sturmfels. Minimizing polynomial functions. Algorithmic and quantitative real\nalgebraic geometry, DIMACS Series in Discrete Mathematics and Theoretical Computer Science, 60:83\u2013\n99, 2003.\n\n[29] P. A. Parrilo. Semide\ufb01nite programming relaxations for semialgebraic problems. Mathematical program-\n\nming, 96(2):293\u2013320, 2003.\n\n[30] N. Ozay, M. Sznaier, C. M. Lagoa, and O. I. Camps. GPCA with denoising: A moments-based convex\n\napproach. In Computer Vision and Pattern Recognition (CVPR), pages 3209\u20133216, 2010.\n\n9\n\n\f", "award": [], "sourceid": 354, "authors": [{"given_name": "Sida", "family_name": "Wang", "institution": "Stanford University"}, {"given_name": "Arun Tejasvi", "family_name": "Chaganty", "institution": "Stanford"}, {"given_name": "Percy", "family_name": "Liang", "institution": "Stanford University"}]}