{"title": "Critical Lines in Symmetry of Mixture Models and its Application to Component Splitting", "book": "Advances in Neural Information Processing Systems", "page_first": 889, "page_last": 896, "abstract": null, "full_text": "Critical Lines in Symmetry of Mixture Models\n\nand its Application to Component Splitting\n\nKenji Fukumizu\n\nInstitute of Statistical\n\nMathematics\n\nTokyo 106-8569 Japan\nfukumizu@ism.ac.jp\n\nShotaro Akaho\n\nAIST\n\nTsukuba 305-8568 Japan\n\ns.akaho@aist.go.jp\n\nShun-ichi Amari\n\nRIKEN\n\nWako 351-0198 Japan\n\namari@brain.riken.go.jp\n\nAbstract\n\nWe show the existence of critical points as lines for the likelihood func-\ntion of mixture-type models. They are given by embedding of a critical\npoint for models with less components. A suf\ufb01cient condition that the\ncritical line gives local maxima or saddle points is also derived. Based\non this fact, a component-split method is proposed for a mixture of Gaus-\nsian components, and its effectiveness is veri\ufb01ed through experiments.\n\n1\n\nIntroduction\n\nThe likelihood function of a mixture model often has a complex shape so that calculation\nof an estimator can be dif\ufb01cult, whether the maximum likelihood or Bayesian approach\nis used. In the maximum likelihood estimation, convergence of the EM algorithm to the\nglobal maximum is not guaranteed, while it is a standard method. Investigation of the like-\nlihood function for mixture models is important to develop effective methods for learning.\n\nThis paper discusses the critical points of the likelihood function for mixture-type models\nby analyzing their hierarchical symmetric structure. As generalization of [1], we show that,\ngiven a critical point of the likelihood for the model with (H (cid:0) 1) components, duplication\nof any of the components gives critical points as lines for the model with H components.\nWe call them critical lines of mixture models. We derive also a suf\ufb01cient condition that\nthe critical lines give maxima or saddle points of the larger model, and show that given a\nmaximum of the likelihood for a mixture of Gaussian components, an appropriate split of\nany component always gives an ascending direction of the likelihood. Based on this theory,\nwe propose a stable method of splitting a component, which works effectively with the\nEM optimization for avoiding the dependency on the initial condition and improving the\noptimization. The usefulness of the algorithm is veri\ufb01ed through experiments.\n\n2 Hierarchical Symmetry and Critical Lines of Mixture Models\n\n2.1 Symmetry of Mixture models\n\nSuppose fH (x j (cid:18)(H)) is a mixture model with H components, de\ufb01ned by\n\nfH (x j (cid:18)(H)) =PH\n\nj=1cj p(x j (cid:12)j);\n\ncj = (cid:11)j=((cid:11)1 + (cid:1) (cid:1) (cid:1) + (cid:11)H );\n\n(1)\n\n\fwhere p(x j (cid:12)) is a probability density function with a parameter (cid:12). We write, for simplicity,\n(cid:11)(H) = ((cid:11)1; : : : ; (cid:11)H ), (cid:12)(H) = ((cid:12)1; : : : ; (cid:12)H ), and (cid:18)(H) = ((cid:11)(H); (cid:12)(H)).\nThe key of our discussion is the following two symmetric properties, which are satis\ufb01ed by\nmixture models;\n\n(S-1) fH (x j (cid:11)(H); (cid:12)(H(cid:0)2); (cid:12)H(cid:0)1; (cid:12)H(cid:0)1) = fH(cid:0)1(x j (cid:11)(H(cid:0)2); (cid:11)H(cid:0)1 + (cid:11)H ; (cid:12)(H(cid:0)1)):\n(S-2) There exists a function A((cid:11)) such that for j = H (cid:0) 1 and H,\n\n@fH\n@(cid:12)j\n\n(x j (cid:11)(H); (cid:12)(H(cid:0)2); (cid:12)H(cid:0)1; (cid:12)H(cid:0)1) =\n\n(cid:11)j\n\nA((cid:11))\n\n@fH(cid:0)1\n@(cid:12)H(cid:0)1\n\n(x j (cid:11)(H(cid:0)2); (cid:11)H(cid:0)1 + (cid:11)H ; (cid:12)(H(cid:0)1)):\n\nIn mixture models, the function A((cid:11)) is simply given by A((cid:11)) = (cid:11)1 + (cid:1) (cid:1) (cid:1) + (cid:11)H.\nHereafter, we discuss in general a model with the assumptions (S-1) and (S-2). The results\nin Sections 2.1 and 2.2 depend only on these assumptions 1. While in mixture models\nsimilar conditions are satis\ufb01ed with any choices of two components, we describe only the\ncase of H (cid:0) 1 and H just for simplicity. We write (cid:2)H for the space of the parameter (cid:18)(H).\nAnother example which satis\ufb01es (S-1) and (S-2) is Latent Dirichlet Allocation (LDA,\n[2]), which models data of a group structure (e.g. document as a set of words). For\nx = (x1; : : : ; xM ), LDA with H components is de\ufb01ned by\n\nfH (x j (cid:18)(H)) =Z(cid:1)H(cid:0)1\n\nDH (u\n\n(H)j(cid:11)(H))QM\n\n(cid:23)=1(cid:16)PH\n\nj=1ujp(x(cid:23)j(cid:12)j)(cid:17)du\n\n(H);\n\n(2)\n\nwhere DH (u (H)j(cid:11)(H)) =\nis the Dirichlet distribution over the (H (cid:0)\n1)-dimensional simplex (cid:1)H(cid:0)1. It is easy to see (S-1) and (S-2) hold for LDA by using\nLemma 6 in Appendix. LDA includes mixture models eq.(1) as the special case of M = 1.\n\nj=1 u(cid:11)j (cid:0)1\n\nj\n\n(cid:0)(Pj (cid:11)j )\nQj (cid:0)((cid:11)j )QH\n\nIt is straightforward from (S-1) that, given a parameter (cid:18)(H(cid:0)1) = ((cid:13)(H(cid:0)1); (cid:17)(H(cid:0)1)) of the\nmodel with (H (cid:0) 1) components and a scalar (cid:21), the parameter (cid:18)(cid:21) 2 (cid:2)H de\ufb01ned by\n(1 (cid:20) j (cid:20) H (cid:0) 2)\n\n(cid:11)j = (cid:13)j;\n\n(cid:12)j = (cid:17)j\n(cid:12)H(cid:0)1 = (cid:12)H = (cid:17)H(cid:0)1\n\n(3)\n\n(cid:11)H(cid:0)1 = (cid:21)(cid:13)H(cid:0)1; (cid:11)H = (1 (cid:0) (cid:21))(cid:13)H(cid:0)1;\n\ngives the same function as fH(cid:0)1(x j (cid:18)(H(cid:0)1)). In mixture models/LDA, this corresponds to\nduplication of the (H (cid:0) 1)-th component with partitioning the mixing/Dirichlet parameter\nin the ratio (cid:21) : (1 (cid:0) (cid:21)). Since (cid:21) is arbitrary, a point in the smaller model is embedded into\nthe larger model as a line in the parameter space (cid:2)H. This implies that the parameter to\nrealize fH(cid:0)1(x j (cid:18)(H(cid:0)1)) lacks identi\ufb01ability in (cid:2)H. Such singular structure of a model\ncauses various interesting phenomena in estimation, learning, and generalization ([3]).\n\n2.2 Critical Lines \u2013 Embedding of a Critical Point\n\nGiven a sample fX (1); : : : ; X (N )g, we de\ufb01ne an objective function for learning by\n\nLH ((cid:18)(H)) =PN\n\nn=1(cid:10)n(fH (X (n) j (cid:18)(H)));\n\n(4)\nwhere (cid:10)n(f ) are differentiable functions, which may depend on n. The objective of learn-\ning is to maximize LH. If (cid:10)n(f ) = log f for all n, maximization of LH ((cid:18)(H)) is equal to\nthe maximum likelihood estimation.\nSuppose (cid:18)(H(cid:0)1)\nthat is, @LH(cid:0)1\n\n1 ; : : : ; (cid:13)(cid:3)\n) = 0. Embedding of this point into (cid:2)H gives a critical line;\n\nH(cid:0)1) is a critical point of LH(cid:0)1((cid:18)(H(cid:0)1)),\n\n= ((cid:13)(cid:3)\n@(cid:18)(H(cid:0)1) ((cid:18)(H(cid:0)1)\n\nH(cid:0)1; (cid:17)(cid:3)\n\n1; : : : ; (cid:17)(cid:3)\n\n(cid:3)\n\n(cid:3)\n\n1The results do not require that p(x j (cid:12)) is a density function. Thus, they can be easily extended\n\nto function \ufb01tting in regression, which gives the results on multilayer neural networks in [1].\n\n\fTheorem 1 (Critical Line). Suppose that a model satis\ufb01es (S-1) and (S-2). Let (cid:18)(H(cid:0)1)\nbe a critical point of LH(cid:0)1 with (cid:13)(cid:3)\nH(cid:0)1 6= 0, and (cid:18)(cid:21) be a parameter given by eq.(3) for\n(cid:18)(H(cid:0)1)\n(cid:3)\n\n. Then, (cid:18)(cid:21) is a critical point of LH ((cid:18)(H)) for all (cid:21).\n\n(cid:3)\n\nt = (cid:11)H(cid:0)1 (cid:0) (cid:11)H ;\n\ns = (cid:11)H(cid:0)1 + (cid:11)H ;\n\nProof. Although this is essentially the same as Theorem 1 in [1], the following proof gives\nbetter intuition. Let (s; t; (cid:16); (cid:24)) be reparametrization of ((cid:11)H(cid:0)1; (cid:11)H ; (cid:12)H(cid:0)1; (cid:12)H ), de\ufb01ned by\n(5)\nThis is a one-to-one correspondence, if (cid:11)H(cid:0)1 + (cid:11)H 6= 0. Note that (cid:24) = 0 is equiv-\nalent to the condition (cid:12)H(cid:0)1 = (cid:12)H. Let ! = ((cid:11)(H(cid:0)2); s; t; (cid:12)(H(cid:0)2); (cid:16); (cid:24)) be the\nnew coordinate, \u2018H (!) be the objective function eq.(4) under this parametrization, and\n!(cid:21) be the parameter corresponding to (cid:18)(cid:21). Since we have, by de\ufb01nition, \u2018H (!) =\nLH ((cid:11)(H(cid:0)2); s+t\n\n(cid:12)H(cid:0)1 = (cid:16) + (cid:11)H (cid:24); (cid:12)H = (cid:16) (cid:0) (cid:11)H(cid:0)1(cid:24):\n\n2 (cid:24)), the condition (S-1) means\n\n2 ; s(cid:0)t\n\n2 ; (cid:12)(H(cid:0)2); (cid:16) + s(cid:0)t\n\n2 (cid:24); (cid:16) (cid:0) s+t\n\n\u2018H ((cid:11)(H(cid:0)2); s; t; (cid:12)(H(cid:0)2); (cid:16); 0) = LH(cid:0)1((cid:11)(H(cid:0)2); s; (cid:12)(H(cid:0)2); (cid:16)):\n\n(6)\nThen, it is clear that the \ufb01rst derivatives of \u2018H at !(cid:21) with respect to (cid:11)(H(cid:0)2); s; (cid:12)(H(cid:0)2),\nand (cid:16) are equal to those of LH(cid:0)1((cid:18)(H(cid:0)1)) at (cid:18)(H(cid:0)1)\n, and they are zero. The derivative\n@\u2018H (!(cid:21))=@t vanishes from eq.(6), and @\u2018H (!(cid:21))=@(cid:24) = 0 from following Lemma 2.\nLemma 2. Let H be a hyperplane given by f! j (cid:24) = 0g. Then, for all !o 2 H, we have\n\n(cid:3)\n\n@fH\n@(cid:24) (x j !o) = 0:\n\n(7)\n\nProof. Straightforward from the assumption (S-2) and @\n\n@(cid:24) = (cid:11)H\n\n@\n\n@(cid:12)H(cid:0)1\n\n(cid:0) (cid:11)H(cid:0)1\n\n@\n\n@(cid:12)H\n\n.\n\nGiven that a maximum of LH is larger than that of LH(cid:0)1, Theorem 1 implies that the\nfunction LH always has critical points which are not global maximum. Those points lie on\nlines in the parameter space. Further embedding of the critical lines into larger models gives\nhigh-dimensional critical planes in the parameter space. This property is very general, and\nin LDA and mixture models we do not need any assumptions on p(x j (cid:12)). In these models,\nby the permutation symmetry of components, there are many choices for embedding, which\ninduces many critical lines and planes for LH.\n\n2.3 Embedding of a Maximum Point in LDA and Mixture Models\n\nThe next question is whether or not the critical lines from a maximum of LH(cid:0)1 gives\nmaxima of LH. The answer requires information on the second derivatives, and depends\non models. We show a general result on LDA, and that on mixture models as its corollary.\nTheorem 3. Suppose that the model is LDA de\ufb01ned by eq.(2). Let (cid:18)(H(cid:0)1)\nbe an isolated\nmaximum point of LH(cid:0)1, and (cid:18)(cid:21) be its embedding given by eq.(3). De\ufb01ne a symmetric\nmatrix R of the size dim(cid:12) by\n\n(cid:3)\n\nn=1(cid:10)0\n\nn(fH(cid:0)1(X (n) j (cid:18)(H(cid:0)1)\n\n(cid:3)\n\n(cid:22)=1I (n)\n\n(cid:22)\n\n@2p(X (n)\n\nH(cid:0)1)\n\n(cid:22) j (cid:17)(cid:3)\n@(cid:12)@(cid:12)\n\n))nPM\n(cid:22)=1PM\n\n(cid:28) =1\n(cid:28) 6=(cid:22)\n\n+\n\nj=1 (cid:13)(cid:3)\n\n1PH(cid:0)1\n\nj +1PM\n\nJ (n)\n(cid:22);(cid:28)\n\n@p(X (n)\n(cid:22) j (cid:17)(cid:3)\n@(cid:12)\n\nH(cid:0)1)\n\n@p(X (n)\n@(cid:12)\n\n(cid:28)\n\nj (cid:17)(cid:3)\n\nH(cid:0)1)\n\no;\n\nwhere (cid:10)0(f ) denotes the derivative of (cid:10)(f ) w.r.t. f, and\n\nR =PN\n\nI (n)\n\n(cid:22) =Z(cid:1)H(cid:0)2\n(cid:22);(cid:28) =Z(cid:1)H(cid:0)2\n\nJ (n)\n\nDH(cid:0)1(u j (cid:13)(cid:3)\n\n1 ; : : : ; (cid:13)(cid:3)\n\nH(cid:0)2; (cid:13)(cid:3)\n\nj=1 ujp(X (n)\n\n(cid:23)\n\n(H(cid:0)1);\n\nH(cid:0)1 + 1)Y(cid:23)6=(cid:22)(cid:0)PH(cid:0)1\nH(cid:0)1 + 2) Y(cid:23)6=(cid:22);(cid:28)(cid:0)PH(cid:0)1\n\nj (cid:12)j)(cid:1)du\nj (cid:12)j)(cid:1)du\n\nDH(cid:0)1(u j (cid:13)(cid:3)\n\n1 ; : : : ; (cid:13)(cid:3)\n\nH(cid:0)2; (cid:13)(cid:3)\n\nj=1 ujp(X (n)\n\n(cid:23)\n\n(H(cid:0)1):\n\n\fThen, we have\n(i) If R is negative de\ufb01nite, the parameter (cid:18)(cid:21) is a maximum of LH for all (cid:21) 2 (0; 1).\n(ii) If R has a positive eigenvalue, the parameter (cid:18)(cid:21) is a saddle point for all (cid:21) 2 (0; 1).\n\n(cid:3)\n\n.\n\nRemark: The conditions on R depend only on the parameter (cid:18)(H(cid:0)1)\nProof. We use the parametrization ! de\ufb01ned by eq.(5). For each t, let Ht be a hyperplane\nwith t \ufb01xed, and ~LH;t be the function LH restricted on Ht. The hyperplane Ht is a slice\ntransversal to the critical line, along which LH has the same value. Therefore, if the Hessian\nmatrix of ~LH;t on Ht is negative de\ufb01nite at the intersection !(cid:21) ((cid:21) = (t + 1)=2), the point\nis a maximum of LH, and if the Hessian has a positive eigenvalue, !(cid:21) is a saddle point.\nSince in ! coordinate we have ~LH;t((cid:11)(H(cid:0)1); s; (cid:12)(H(cid:0)1); (cid:16); 0) = LH(cid:0)1((cid:11)(H(cid:0)1); s;\n(cid:12)(H(cid:0)1); (cid:16)), the Hessian of ~LH;t at !(cid:21) is given by\n\nHess ~LH;t(!(cid:21)) =  HessLH(cid:0)1((cid:18)(H(cid:0)1)\n\nO\n\n(cid:3)\n\n)\n\nO\n\n@2 ~LH;t(!(cid:21))\n\n@(cid:24)@(cid:24)\n\n! :\n\n(8)\n\nThe off-diagonal blocks are zero, because we have @2 ~LH;t(!(cid:21))\nLemma 2. By assumption, HessLH(cid:0)1((cid:18)(H(cid:0)1)\nincluding @fH (X (n); (cid:18)(cid:21))=@(cid:24) vanish from Lemma 2, it is easy to obtain @2 ~LH;t(!(cid:21))\n(cid:21)(1 (cid:0) (cid:21))((cid:13)(cid:3)\n\n= 0 for !a 6= (cid:24) from\n) is negative de\ufb01nite. Noting that the terms\n=\n\nj ) (cid:2) R by using Lemma 6 and the de\ufb01nition of (cid:24).\n\nj=1 (cid:13)(cid:3)\n\n@(cid:24)@!a\n\n@(cid:24)@(cid:24)\n\n(cid:3)\n\nBy setting M = 1 in LDA model, we have the suf\ufb01cient conditions for mixture models.\nCorollary 4. For a mixture model, the same assertions as Theorem 3 hold for\n\nn=1(cid:10)0\n\nn(fH(cid:0)1(X (n) j (cid:18)(H(cid:0)1)\n\n(cid:3)\n\n))\n\n@2p(X (n) j (cid:17)(cid:3)\n\nH(cid:0)1)\n\n@(cid:12)@(cid:12)\n\n:\n\n(9)\n\nH(cid:0)1)3=(PH(cid:0)1\n\n~R =PN\n\nProof. For M = 1, J (n)\n\n(cid:22);(cid:28) = 0 and I (n) = (cid:13)(cid:3)\n\nj=1 (cid:13)(cid:3)\n\nj . The assertion is obvious.\n\nH(cid:0)1=PH(cid:0)1\n\n2.4 Critical Lines in Various Models\n\nWe further investigate the critical lines for speci\ufb01c models. Hereafter, we consider the\nmaximum likelihood estimation, setting (cid:10)n(f ) = log f for all n.\n\nGaussian Mixture, Mixture of Factor Analyzers, and Mixture of PCA\nAssume that each component is the D-dimensional Gaussian density with mean (cid:22) and\nvariance-covariance matrix V as parameters, which is denoted by (cid:30)(x ; (cid:22); V ). The matrix\n\n~R in eq.(9) has a form ~R = (cid:0) S2 S3\n\n3 S4(cid:1), where S2, S3, and S4 correspond to the second\n\nderivatives with respect to ((cid:22); (cid:22)), ((cid:22); V ), and (V; V ), respectively. It is well known that the\nsecond derivative @2(cid:30)=@(cid:22)@(cid:22) of a Gaussian density is equal to the \ufb01rst derivative @(cid:30)=@V .\nThen, S2 is equal to zero by the condition of a critical point.\nIf the data is randomly\ngenerated, S3 and S4 are of full rank almost surely. This type of matrix necessarily has\na positive eigenvalue. It is not dif\ufb01cult to extend this discussion to models with scalar or\ndiagonal variance-covariance matrices as variable parameters.\n\nST\n\nSimilar arguments hold for mixture of factor analyzers (MFA, [4]) and mixture of prob-\nabilistic PCA (MPCA, [5]).\nIn factor analyzers or probabilistic PCA, the variance-\ncovariance matrix is restricted to the form\n\nV = F F T + S;\n\n\fwhere F is a factor loading of rank k and S is a diagonal or scalar matrix. Because the\n\ufb01rst derivative of (cid:30)(x ; (cid:22); F F T + S) with respect to F is @(cid:30)(x;(cid:22);F F T +S)\nF , the block in\n~R corresponding to the second derivatives on (cid:22) is not of full rank. In a similar manner to\nGaussian mixtures, ~R has a positive eigenvalue. In summary, we have the following\nTheorem 5. Suppose that a model is Gaussian mixture, MFA, or MPCA. If ~R is of full\nrank, every point (cid:18)(cid:21) on the critical line is a saddle point of LH.\n\n@V\n\nThis theorem means that if we have the maximum likelihood estimator for H (cid:0) 1 com-\nponents, we can \ufb01nd an ascending direction of likelihood by splitting a component and\nmodifying their means and variance-covariance matrices in the direction of the positive\neigenvector. This leads a component splitting method, which will be shown in Section 3.1.\n\nLatent Dirichlet Allocation\nWe consider LDA with multinomial components. Using the D-dimensional random vector\nx = (xa) 2 f(1; 0; : : : ; 0)T ; : : : ; (0; : : : ; 0; 1)T g, which indicates a chosen element, the\nmultinomial distribution over D elements is expressed as an exponential family by\n\np(x j (cid:12)) =QD\n\na=1(pa)xa = expfPD(cid:0)1\n\na=1 (cid:12)axa (cid:0) log(1 +PD(cid:0)1\n\na=1 e(cid:12)a\n\nwhere pa is the expectation of xa, and (cid:12) 2 RD(cid:0)1 is a natural parameter given by (cid:12) a =\nlog(pa=pD). It is easy to obtain\n\n)g;\n\nn=1(cid:10)0(fH(cid:0)1(X (n) j (cid:18)(H(cid:0)1)\n\n(cid:3)\n\nR =PN\n\n))PM\n(cid:22)=1P(cid:28) 6=(cid:22)J (n)\n\n(cid:22) (cid:0) p(cid:3)\n\n(cid:2) ( ~X (n)\n\n(cid:22);(cid:28) p(X (n)\n(cid:22) j (cid:13)(cid:3)\n(H(cid:0)1))( ~X (n)\n\nH(cid:0)1)\n(10)\n\nH(cid:0)1)p(X (n)\n\n(cid:28)\n\nj (cid:13)(cid:3)\n\n(cid:28) (cid:0) p(cid:3)\n\n(H(cid:0)1))T ;\n\nwhere ~X (n)\nexpectation parameter for (H (cid:0) 1)-th component of (cid:18)(H(cid:0)1)\n\nis the truncated (D (cid:0) 1)-dimensional vector, and p(cid:3)\n\n.\n\n(cid:23)\n\n(cid:3)\n\n(H(cid:0)1) 2 (0; 1)D(cid:0)1 is the\n\nIn general, J (n)\n\n(cid:22);(cid:28) are intractable in large problems. We explain a simple case of H = 2\n\n(cid:22);(cid:28) = 1 and\n\n(cid:22)=1( ~X (n)\n\n(cid:22);(cid:28) =1( ~X (n)\n\nn=1(cid:8)PM\n\n(cid:22) (cid:0)bp)( ~X (n)\n\nlikelihood estimator for the one multinomial model. In this case, we have J (n)\n\nFirst, suppose we have a data set with X (n)\n(cid:23) = e(cid:23) for all n and 1 (cid:20) (cid:23) (cid:20) D = M, where\nej is the D-dimensional vector with the j-th component 1 and others zero. Then, we have\n\nand M = D. Let bp be the frequency vector of the D elements, which is the maximum\n(cid:22) (cid:0)bp)T(cid:9):\n(cid:28) (cid:0)bp)T (cid:0)PM\nR =PN\nbp = (1=D; : : : ; 1=D) andPD\n(cid:22) (cid:0)bp) = 0, which means R < 0. The critical line\n(cid:23) = ej. While we have againbp = (1=D; : : : ; 1=D),\nthe matrix R isPD\nj=1(N=D) (cid:2) D(D (cid:0) 1)(ej (cid:0)bp)(ej (cid:0)bp)T > 0. Thus, all the points\n\non the critical lines are saddle points. These examples explain two extreme cases; in the\nformer we have no advantage in using two components because all the data X (n) are the\nsame, while in the latter the multiple components \ufb01ts better to the variety of X (n).\n\ngives maxima for LDA with H = 2. Next, suppose the data consists of D groups, and every\ndata in the j-th group is given by X (n)\n\n(cid:22) (cid:0)bp)( ~X (n)\n\n(cid:22)=1( ~X (n)\n\n3 Component Splitting Method in Mixture of Gaussian Components\n\n3.1 EM with Component Splitting\n\nIt is well known that the EM algorithm suffers from strong dependency on initialization. In\naddition, because the likelihood of a mixture of Gaussian components is not upper bounded\n\n\fAlgorithm 1 : EM with component splitting for Gaussian mixture\n\n1. Initialization: calculate the sample mean (cid:22)1 and variance-covariance matrix V1.\n2. H := 1.\n3. For all 1 (cid:20) h (cid:20) H, diagonalize Vh(cid:3) as Vh(cid:3) = Uh(cid:3)hU T\n\nh , and calculate ~Rh\n\naccording to eq.(12) in Appendix.\n\n4. For 1 (cid:20) h (cid:20) H, calculate the eigenvector (rh; Wh) of ~Rh corresponding to the\n\nlargest eigenvalue.\n\n5. For 1 (cid:20) h (cid:20) H, optimize (cid:12) by line search to maximize the likelihood for\n\nch = 1\n2 ch(cid:3);\ncH+1 = 1\n\n(cid:22)h = (cid:22)h(cid:3) (cid:0) (cid:12)rh;\n\n2 ch(cid:3); (cid:22)H+1 = (cid:22)h(cid:3) + (cid:12)rh;\n\nVh = Uhe(cid:0)(cid:12)Wh(cid:3)he(cid:0)(cid:12)Wh U T\nh ;\nVH+1 = Uhe(cid:12)Wh (cid:3)he(cid:12)WhU T\nh :\n\n(11)\n\nLet (cid:12)o\n\nh be the optimizer and Lh be the likelihood.\n\n6. For hy := arg maxh Lh, split hy-th component according to eq.(11) with (cid:12) o\n7. Optimize the parameter (cid:18)(H+1) using EM algorithm. Let (cid:18)(H+1)\n8. If H + 1 = MAX H, then END. Otherwise, H := H + 1 and go to 3.\n\nhy.\nbe the result.\n\n(cid:3)\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n-1\n\n6\n\n3\n\n4\n\n3\n\n2\n\n1\n\n0\n\n6\n\n3\n\n0\n\n-3\n\n-6\n\n0\n\n-3\n\n-6\n\n(a) Data\n\n3\n\n0\n\n-3\n\n4\n\n3\n\n2\n\n1\n\n0\n\n3\n\n0\n\n-3\n\n3\n\n3\n\n0\n\n-3\n\n(b) Success\n\n0\n\n-3\n\n(c) Failure\n\nFigure 1: Spiral data. In (b) and (c), the lines represent the factor loading vectors Fh and\n(cid:0)Fh at the mean values, and the radius of a sphere is the scalar part of the variance.\n\nfor small variances, we should use an optimization technique to give an appropriate maxi-\nmum. Sequential split of components can give a solution to these problems. From Theorem\n5, a stable and effective way of splitting a Gaussian component is derived to increase the\nlikelihood. We propose EM with component splitting, which adds a component one by one\nafter maximizing the likelihood at each size. Ueda et al ([6]) proposes Split and Merge EM,\nin which the components repeat split and merge in a triplet, keeping the total number \ufb01xed.\nWhile their method works well, it requires a large number of trials of EM for candidate\ntriplets, and the splitting method is heuristic. Our splitting method is well based on theory,\nand EM with splitting gives a series of estimators for all model sizes in a single run.\n\nAlgorithm 1 is the procedure of learning. We show only the case of mixture of Gaussian.\nThe exact algorithm for the mixture of PCA/FA will be shown in a forthcoming paper.\nIt is noteworthy that in splitting a component, not only the means but also the variance-\ncovariance matrices must be modi\ufb01ed. The simple additive rule Vnew = Vold + (cid:1)V tends\nto fail, because it may make the matrix non-positive de\ufb01nite. To solve this problem, we\nuse Lie algebra expression to add a vector of ascending direction. Let V = U (cid:3)U T be\nthe diagonalization of V , and consider V (W ) = U eW (cid:3)eW U T for a symmetric matrix\n\n\fW . This gives a local coordinate of the positive de\ufb01nite matrices around V = V (0).\nModi\ufb01cation of V through W gives a stable way of updating variance-covariance matrices.\n\n3.2 Experimental results\n\nWe show through experiments how the proposed EM with component splitting effectively\nmaximizing the likelihood. In the \ufb01rst experiment, the mixture of PCA with 8 components\nof rank 1 is employed to \ufb01t the synthesized 150 data generated along a piecewise linear\nspiral (Fig.1). Table 1-(a) shows the results over 30 trials with different random numbers.\nWe use the on-line EM algorithm ([7]), presenting data one-by-one in a random order. The\nEM with random initialization reaches the best state (Fig.1-(b)) only 6 times, while EM\nwith component splitting achieves it 26 times. Fig.1-(c) shows an example of failure.\n\nThe next experiment is an image compression prob-\nlem, in which the image \u201dLenna\u201d of 160(cid:2)160 pixels\n(Fig.2) is used. The image is partitioned into 20(cid:2)20\nblocks of 8(cid:2)8 pixels, which are regarded as 400 data\nin R64. We use the mixture of PCA with 10 com-\nponents of rank 4, and obtain a compressed image by\n^X = Fh(F T\nh X, where X is a 64 dimensional\nblock and h indicates the component of the shortest\nEuclidean distance kX (cid:0) (cid:22)hk. Table 1-(b) shows the\nj=1 kXj (cid:0) ^Xjk2, which\nshows the quality of the compression. In both experi-\nments, we can see the better optimization performance\nof the proposed algorithm.\n\nresidual square error (RSE),P400\n\nh Fh)(cid:0)1F T\n\n(a) Likelihood for spiral data (30 runs)\n\nEM\n\nEMCS\n\nBest\nWorst\nAv.\n\n-534.9 (6 times)\n\n-648.1\n-583.9\n\n-534.9 (26 times)\n\n-587.9\n-541.3\n\n20\n\n40\n\n60\n\n80\n\n100\n\n120\n\n140\n\n160\n\n20\n\n40\n\n60\n\n80\n\n100\n\n120\n\n140\n\n160\n\nFigure 2: \u201dLenna\u201d.\n\n(b) RSE for \u201dLenna\u201d (10 runs)\n\n(cid:2)104\nBest\nWorst\nAv.\n\nEM EMCS\n5.38\n5.94\n6.12\n6.40\n6.15\n5.78\n\nTable 1: Experimental results. EM is the conventional EM with random initialization, and\nEMCS is the proposed EM with component splitting.\n\n4 Discussions\n\nIn EM with component splitting, we obtain the estimators up to the speci\ufb01ed number of\ncomponents. We need a model selection technique to choose the best one, which is another\nimportant problem. We do not discuss it in this paper, because our method can be combined\nwith many techniques, which select a model after obtaining the estimators. However, we\nshould note that some famous methods such as AIC and MDL, which are based on statisti-\ncal asymptotic theory, cannot be applied to mixture models because of the unidenti\ufb01ability\nof the parameter. Further studies are necessary on model selection for mixture models.\n\nAlthough the computation to calculate the matrix R is not cheap in a mixture of Gaussian\ncomponents, the full variance-covariance matrices are not always necessary in practical\nproblems.\nIt can save the computation drastically. Also, some methods to reduce the\ncomputational cost should be more investigated.\n\nIn selecting a component to split, we try line search for all the components and choose the\none giving the largest likelihood. While this works well in our experiments, the proposed\nmethod of component splitting can be combined with other criterions to select a component.\n\n\fOne of them is to select the component giving the largest eigenvalue of ~Rh. In Gaussian\nmixture models, this is very natural; the block of the second derivatives w.r.t. V in ~R is\nequal to the weighted fourth cummulant, and a component with a large cummulant should\nbe split. However, in mixture of FA and PCA, this does not necessarily work well, because\nthe decomposition V = F F T + S does not give a natural parametrization. Although we\nhave discussed only local properties, a method incorporating global information might be\nmore preferable. These are left as a future work.\n\nAppendix\n\nLemma 6. Suppose \u2019H (u (H); (cid:12)(H))\n\nDe\ufb01ne\n\u2019(u (H); (cid:12)(H))DH (u (H) j (cid:11)(H))du (H). Then, IH also satis-\n\nthe assumption (S-1).\n\nsatis\ufb01es\n\nIH ((cid:11)(H); (cid:12)(H)) = R(cid:1)H(cid:0)1\n\n\ufb01es (S-1);\n\nIH ((cid:11)(H); (cid:12)(H(cid:0)2); (cid:12)H(cid:0)1; (cid:12)H(cid:0)1) = IH(cid:0)1((cid:11)(H(cid:0)2); (cid:11)H(cid:0)1 + (cid:11)H ; (cid:12)(H(cid:0)1)):\n\nProof. Direct calculation.\n\nMatrix ~Rh for Gaussian mixture\n\nb +ubuT\n\nWe omit the index h for simplicity, and use Einstein\u2019s convention. Let U = (u1; : : : ; uD)\nand (cid:3) = diag((cid:21)1; : : : ; (cid:21)D). For V (W ) = U eW (cid:3)eW U T , we have @V (O)=@Wab =\na ), where (cid:14)ab is Kronecker\u2019s delta. Let T (3) and T (4) be the\n((cid:21)a +(1(cid:0)(cid:14)ab)(cid:21)b)(uauT\nweighted third and fourth sample moments, respectively, with weights\n.\nf (H(cid:0)1)(x(n);(cid:18)(H(cid:0)1)\n)\n(4) = V apV bqV crV dsT (4)\n~T(3) and ~T(4) are de\ufb01ned by ~T abc\npqrs,\nrespectively, where V ap is the (ap)-component of V (cid:0)1. Direct calculation leads that the\n\n(3) = V apV bqV crT (3)\n\nmatrix ~R =(cid:16) O B\n\nBT C(cid:17), where the decomposition corresponds to (cid:12) = ((cid:22); W ), is given by\n\npqr and ~T abcd\n\n(cid:30)(x(n);(cid:22)(cid:3);V(cid:3))\n\n(cid:3)\n\nB(cid:22)a;Wbc = ((cid:21)b + (1 (cid:0) (cid:14)bc)(cid:21)c)uT\nb\nCWabWcd = ((cid:21)aubuT\n\n~T (cid:1)(cid:1)a\n(3)uc\na + (1 (cid:0) (cid:14)ab)(cid:21)buauT\n\nIn the above equation, ~T (cid:1)(cid:1)a\n\nb )pq((cid:21)cuduT\n\nc + (1 (cid:0) (cid:14)cd)(cid:21)ducuT\n\nd )rs\n\n(12)\n\n(4) (cid:0) (V pqV rs + V prV qs + V psV qr)(cid:9):\n\n(3) is the D (cid:2) D matrix with \ufb01xed a for ~T bca\n(3) .\n\n(cid:2)(cid:8) ~T pqrs\n\nReferences\n[1] K. Fukumizu and S. Amari. Local minima and plateaus in hierarchical structures of multilayer\n\nperceptrons. Neural Networks, 13(3):317\u2013327, 2000.\n\n[2] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Advances in Neural Infor-\n\nmation Processing Systems, 14, 2002. MIT Press.\n\n[3] S. Amari, H. Park, and T. Ozeki. Geometrical singularities in the neuromanifold of multilayer\n\nperceptrons. Advances in Neural Information Processing Systems, 14, 2002. MIT Press.\n\n[4] Z. Ghahramani and G. Hinton. The EM algorithm for mixtures of factor analyzers. Technical\n\nReport CRG-TR-96-1, University of Toronto, Department of Computer Science, 1997.\n\n[5] M. Tipping and C. Bishop. Mixtures of probabilistic principal component analysers. Neural\n\nComputation, 11:443\u2013482, 1999.\n\n[6] N. Ueda, R. Nakano, Z. Ghahramani, and G. Hinton. SMEM algorithm for mixture models.\n\nNeural Computation, 12(9):2109\u20132128, 2000.\n\n[7] M. Sato and S. Ishii. On-line EM algorithm for the normalized Gaussian network. Neural\n\nComputation, 12(2):2209\u20132225, 2000.\n\n\f", "award": [], "sourceid": 2258, "authors": [{"given_name": "Kenji", "family_name": "Fukumizu", "institution": null}, {"given_name": "Shotaro", "family_name": "Akaho", "institution": null}, {"given_name": "Shun-ichi", "family_name": "Amari", "institution": null}]}