{"title": "Unsupervised Classification with Non-Gaussian Mixture Models Using ICA", "book": "Advances in Neural Information Processing Systems", "page_first": 508, "page_last": 514, "abstract": null, "full_text": "Unsupervised Classification with \n\nNon-Gaussian Mixture Models using ICA \n\nTe-Won Lee, Michael S. Lewicki and Terrence Sejnowski \n\nHoward Hughes Medical Institute \n\nComputational Neurobiology Laboratory \n\nThe Salk Institute \n\n10010 N. Torrey Pines Road \n\nLa Jolla, California 92037, USA \n\n{tewon,lewicki,terry}Osalk.edu \n\nAbstract \n\nWe present an unsupervised classification algorithm based on an \nICA mixture model. The ICA mixture model assumes that the \nobserved data can be categorized into several mutually exclusive \ndata classes in which the components in each class are generated \nby a linear mixture of independent sources. The algorithm finds \nthe independent sources, the mixing matrix for each class and also \ncomputes the class membership probability for each data point. \nThis approach extends the Gaussian mixture model so that the \nclasses can have non-Gaussian structure. We demonstrate that \nthis method can learn efficient codes to represent images of natural \nscenes and text. The learned classes of basis functions yield a better \napproximation of the underlying distributions of the data, and thus \ncan provide greater coding efficiency. We believe that this method \nis well suited to modeling structure in high-dimensional data and \nhas many potential applications. \n\n1 \n\nIntrod uction \n\nRecently, Blind Source Separation (BSS) by Independent Component Analysis \n(ICA) has shown promise in signal processing applications including speech en(cid:173)\nhancement systems, telecommunications and medical signal processing. ICA is a \ntechnique for finding a linear non-orthogonal coordinate system in multivariate data. \nThe directions of the axes of this coordinate system are determined by the data's \nsecond- and higher-order statistics. The goal of the ICA is to linearly transform the \ndata such that the transformed variables are as statistically independent from each \n\n\fUnsupervised Classification with Non-Gaussian Mixture Models Using ICA \n\n509 \n\nother as possible (Bell and Sejnowski, 1995; Cardoso and Laheld, 1996; Lee et al., \n1999a). ICA generalizes the technique of Principal Component Analysis (PCA) \nand, like PCA, has proven a useful tool for finding structure in data. \n\nOne limitation of ICA is the assumption that the sources are independent. Here, \nwe present an approach for relaxing this assumption using mixture models. In a \nmixture model (Duda and Hart, 1973), the observed data can be categorized into \nseveral mutually exclusive classes. When the class variables are modeled as mul(cid:173)\ntivariate Gaussian densities, it is called a Gaussian mixture model. We generalize \nthe Gaussian mixture model by modeling each class with independent variables \n(ICA mixture model). This allows modeling of classes with non-Gaussian (e.g., \nplatykurtic or leptokurtic) structure. An algorithm for learning the parameters is \nderived using the expectation maximization (EM) algorithm. In Lee et al. (1999c), \nwe demonstrated that this approach showed improved performance in data clas(cid:173)\nsification problems. Here, we apply the algorithm to learning efficient codes for \nrepresenting different types of images. \n\n2 The ICA Mixture Model \n\nWe assume that the data were generated by a mixture density (Duda and Hart, \n1973): \n\nK \n\np(xI8) = LP(xICk,(h)p(Ck), \n\n(1) \n\nk=l \n\nwhere 8 = ((}l,'\" \n,(}K) are the unknown parameters for each p(xICk, (}k), called \nthe component densities. We further assume that the number of classes, K, and \nthe a priori probability, p(Ck ), for each class are known. In the case of a Gaussian \nmixture model, p(XICk , (}k) ex N(f-Lk, Ek)' Here we assume that the form of the \ncomponent densities is non-Gaussian and the data within each class are described \nby an ICA model. \n\nXk = AkSk + bk, \n\n(2) \nwhere Ak is a N x M scalar matrix (called the basis or mixing matrix) and b k \nis the bias vector for class k. The vector Sk is called the source vector (these \nare also the coefficients for each basis vector). It is assumed that the individual \nsources Si within each class are mutually independent across a data ensemble. For \nsimplicity, we consider the case where Ak is full rank, i.e. the number of sources \n(M) is equal to the number of mixtures (N). Figure 1 shows a simple example of \na dataset that can be described by ICA mixture model. Each class was generated \nfrom eq.2 using a different A and b. Class (0) was generated by two uniform \ndistributed sources, whereas class (+) was generated by two Laplacian distributed \nsources (P(s) ex exp( -lsl)). The task is to model the unlabeled data points and \nto determine the parameters for each class, A k , bk and the probability of each \nclass p( Ck lx, (}l:K) for each data point. A learning algorithm can be derived by an \nexpectation maximization approach (Ghahramani, 1994) and implemented in the \nfollowing steps: \n\n\u2022 Compute the log-likelihood of the data for each class: \n\n10gp(xICk,(}k) = logp(sk) -log(det IAkl), \n\nwhere (}k = {Ak, bk,Sd\u00b7 \n\n\u2022 Compute the probability for each class given the data vector x \n\n(C I \np \n\n() \n\nk x, 1 . K \n. \n\np(XI(}k' Ck)p(Ck) \n\n) -\n- ==--'---:---:--,:--:-=-:-:-----:--::-:-:-\nLkP(xl(}k,Ck)P(Ck) \n\n(3) \n\n(4) \n\n\f510 \n\nT.-w. Lee, M. S. Lewicki and T. 1. Sejnowski \n\n+ \n\n+ \n\n+ \n+ \n\n+ \n\n+ \n\n+ \n\n10 \n\n'\" )( \n\n+ \n\n... \n\n++ \n\n+ \n\n+ \n\n+ \n\n+ \n\n-5 \n\n+ + \n\n-10 \n\n-5 \n\n0 \n\nXl \n\n10 \n\nFigure 1: A simple example for classifying an ICA mixture model. There are \ntwo classes (+) and (0); each class was generated by two independent variables, \ntwo bias terms and two basis vectors. Class (0) was generated by two uniform \ndistributed sources as indicated next to the data class. Class (+) was generated by \ntwo Laplacian distributed sources with a sharp peak at the bias and heavy tails. \nThe inset graphs show the distributions of the source variables, Si ,k, for each basis \nvector. \n\n\u2022 Adapt the basis functions A and the bias terms b for each class. The basis \n\nfunctions are adapted using gradient ascent \n\n8 \n\nex: 8Ak 10gp(xIBI:K) \n\np(Cklx, B1:K ) 8Ak 10gp(xICk, Ok). \n\n8 \n\n(5) \n\nNote that this simply weights any standard ICA algorithm gradient by \np(Cklx,OI:K). The gradient can also be summed over multiple data points. \nThe bias term is updated according to \n\nb \n\nk-\n-\n\nLt Xtp( Ck IXt, BI:K ) \nLtp(Cklxt,OI:K) , \n\n(6) \n\nwhere t is the data index (t = 1, ... , T) . \n\nThe three steps in the learning algorithm perform gradient ascent on the total \nlikelihood of the data in eq .1. \n\nThe extended infomax ICA learning rule is able to blindly separate mixed sources \nwith sub- and super-Gaussian distributions. This is achieved by using a simple \ntype of learning rule first derived by Girolami (1998). The learning rule in Lee \net al. (1999b) uses the stability analysis of Cardoso and Laheld (1996) to switch \nbetween sub- and super-Gaussian regimes. The learning rule expressed in terms of \nW = A-I, called the filter matrix is: \n\nAWex: [1 - K tanh(u)uT - uuT ] W , \n\n(7) \n\n\fUnsupervised Classification with Non-Gaussian Mixture Models Using lCA \n\n511 \n\nwhere ki are elements of the N-dimensional diagonal matrix K and u = Wx. The \nunmixed sources u are the source estimate s (Bell and Sejnowski, 1995). The ki's \nare (Lee et al., 1999b) \n\nki = sign (E[sech2ui]E[u~] - E[Ui tanh UiJ) . \n\n(8) \nThe source distribution is super-Gaussian when k i = 1 and sub-Gaussian when k i = \n-1. For the log-likelihood estimation in eq.3 the term log p{ s) can be approximated \nas follows \n\nlogp(s) ex- 2:logcoshsn -\n\nS2 \n; \n\nsuper-Gaussian \n\nn \n\nlogp(s) ex+ 2: log cosh Sn - ; \n\nS2 \n\nn \n\nsub-Gaussian \n\n(9) \n\nSuper-Gaussian densities, are approximated by a density model with heavier tail \nthan the Gaussian density; Sub-Gaussian densities are approximated by a bimodal \ndensity (Girolami, 1998). Although the source density approximation is crude it \nhas been demonstrated that they are sufficient for standard leA problems (Lee \net al., 1999b). When learning sparse representations only, a Laplacian prior (p(s) ex \nexp{ -lsi\u00bb can be used for the weight update which simplifies the infomax learning \nrule to \n\n~ W \n\nlogp(s) \n\nex \n\nex \n\n[I - sign(u)uT ] W, \n- 2: ISnl \n\nLaplacian prior \n\n(10) \n\n3 Learning efficient codes for images \n\nn \n\nRecently, several approaches have been proposed to learn image codes that utilize a \nset of linear basis functions. Olshausen and Field (1996) used a sparseness criterion \nand found codes that were similar to localized and oriented receptive fields. Similar \nresults were presented by Bell and Sejnowski (1997) using the infomax algorithm \nand by Lewicki and Olshausen (1998) using a Bayesian approach. By applying the \nleA mixture model we present results which show a higher degree of flexibility in \nencoding the images. We used images of natural scenes obtained from Olshausen \nand Field (1996) and text images of scanned newspaper articles. The training \nset consisted of 12 by 12 pixel patches selected randomly from both image types. \nFigure 2 illustrates examples of those image patches. Two complete basis vectors \nAl and A2 were randomly initialized. Then, for each gradient in eq.5 a stepsize \nwas computed as a function of the amplitude of the basis vectors and the number \nof iterations. The algorithm converged after 100,000 iterations and learned two \nclasses of basis functions as shown in figure 3. Figure 3 (top) shows basis functions \ncorresponding to natural images. The basis functions show Gabor1-like structure \nas previously reported in (Olshausen and Field, 1996; Bell and Sejnowski, 1997; \nLewicki and Olshausen, 1998). However, figure 3 (bottom) shows basis functions \ncorresponding to text images. These basis functions resemble bars with different \nlengths and widths that capture the high-frequency structure present in the text \nimages. \n\n3.1 Comparing coding efficiency \n\nWe have compared the coding efficiency between the leA mixture model and similar \nmodels using Shannon's theorem to obtain a lower bound on the number of bits \n\n1 Gaussian modulated siusoidal \n\n\f512 \n\nT-W Lee, M S. Lewicki and T. J. Sejnowski \n\nam M ~ W3 tiifi ~'1 Z a .!IE R - m I!!B!!tlr;a \n. . lIPS ~j.l 111 ~ au t:k.1 __ Ui . . :1111 ~ lUi OJ BII KG \n\nFigure 2: Example of natural scene and text image. The 12 by 12 pixel image \npatches were randomly sampled from the images and used as inputs to the ICA \nmixture model. \n\nrequired to encode the pattern. \n\n#bits ~ -log2 P(xIA) - Nlog2 (O\"x), \n\n(11) \nwhere N is the dimensionality of the input pattern x and o\"x is the coding precision \n(standard deviation of the noise introduced by errors in encoding). Table 1 compares \nthe coding efficiency of five different methods. It shows the number of bits required \nto encode three different test data sets (5000 image patches from natural scenes, \n5000 image patches from text images and 5000 image patches from both image \ntypes) using five different encoding methods (ICA mixture model, nature trained \nICA, text trained ICA, nature and text trained ICA, and PCA trained on all three \ntest sets). It is clear that ICA basis functions trained on natural scene images \nexhibit the best encoding when only natural scenes are presented (column: nature). \nThe same applies to text images (column: text). Note that text training yields a \nreasonable basis for both data sets but nature training gives a good basis only for \nnature. The ICA mixture model shows the same encoding power for the individual \ntest data sets, and it gives the best encoding when both image types are present. \nIn this case, the encoding difference between the ICA mixture model and PCA is \nSignificant (more than 20%). ICA mixtures yielded a small improvement over ICA \ntrained on both image types. We expect the size of the improvement to be greater \nin situations where there are greater differences among the classes. An advantage \nof the mixture model is that each image patch is automatically classified .. \n\n4 Discussion \n\nThe new algorithm for unsupervised classification presented here is based on a \nmaximum likelihood mixture model using ICA to model the structure of the classes. \nWe have demonstrated here that the algorithm can learn efficient codes to represent \ndifferent image types such as natural scenes and text images. In this case, the \nlearned classes of basis functions show a 20% improvement over PCA encoding. \nICA mixture model should show better image compression rates than traditional \ncompression algorithm such as JPEG. \nThe ICA mixture model is a nonlinear model in which each class is modeled as a \nlinear process and the choice of class is modeled using probabilities. This model \n\n\fUnsupervised Classification with Non-Gaussian Mixture Models Using ICA \n\n513 \n\nFigure 3: (Left) Basis function class corresponding to natural images. (Right) Basis \nfunction class corresponding to text images. \n\nTable 1: Comparing coding efficiency \n\nTest data \n\nTraining set and model Nature Text Nature and Text \n\nICA mixtures \nNature trained ICA \nText trained ICA \nNature and text trained ICA \npeA \n\n4.72 \n4.72 \n5.00 \n4.83 \n6.22 \n\n5.20 \n9.57 \n5.19 \n5.29 \n5.97 \n\n4.96 \n7.15 \n5.10 \n5.07 \n6.09 \n\nCodmg efficIency (bIts per pIxel) of five methods IS compared for three test sets. \nCoding precision was set to 7 bits (Nature: U x = 0.016 and Text: U x = 0.029). \n\ncan therefore be seen as a nonlinear ICA model. Furthermore, it is one way of \nrelaxing the independence assumption over the whole data set. The ICA mixture \nmodel is a conditional independence model, i.e., the independence assumption holds \nwithin only each class and there may be dependencies among classes. A different \nview of the ICA mixture model is to think of the classes of being an overcomplete \nrepresentation. Compared to the approach of Lewicki and Sejnowski (1998), the \nmain difference is that the basis functions learned here are mutually exclusive, i.e. \neach class uses its own set of basis functions. \n\nThis method is similar to other approaches including the mixture density networks \nby Bishop (1994) in which a neural network was used to find arbitrary density \nfunctions. This algorithm reduces to the Gaussian mixture model when the source \npriors are Gaussian. Purely Gaussian structure, however, is rare in real data sets. \nHere we have used priors of the form of super-Gaussian and sub-Gaussian densities. \nBut these could be extended as proposed by Attias (1999). The proposed model was \nused for learning a complete set of basis functions without additive noise. However, \nthe method can be extended to take into account additive Gaussian noise and an \novercomplete set of basis vectors (Lewicki and Sejnowski, 1998). \n\nIn (Lee et al., 1999c), we have performed several experiments on benchmark data \nsets for classification problems. The results were comparable or improved over those \nobtained by AutoClass (Stutz and Cheeseman, 1994) which uses a Gaussian mixture \n\n\f514 \n\nT.-w. Lee, M. S. Lewicki and T. J. Sejnowski \n\nmodel. Furthermore, we showed that the algorithm can be applied to blind source \nseparation in nonstationary environments. The method can switch automatically \nbetween learned mixing matrices in different environments (Lee et al., 1999c). This \nmay prove to be useful in the automatic detection of sleep stages by observing EEG \nsignals. The method can identify these stages due to the changing source priors and \ntheir mixing. \n\nPotential applications of the proposed method include the problem of noise removal \nand the problem of filling in missing pixels. We believe that this method provides \ngreater flexibility in modeling structure in high-dimensional data and has many \npotential applications. \n\nReferences \n\nAttias, H. (1999) . Blind separation of noisy mixtures: An EM algorithm for inde(cid:173)\n\npendent factor analysis. Neural Computation, in press. \n\nBell, A. J. and Sejnowski, T . J. (1995). An Information-Maximization Approach to \n\nBlind Separation and Blind Deconvolution. Neural Computation, 7:1129-1159. \n\nBell, A. J. and Sejnowski, T. J. (1997). The 'independent components' of natural \n\nscenes are edge filters. Vision Research, 37(23):3327-3338. \n\nBishop, C. (1994). Mixture density networks. Technical Report, NCRG/4288. \nCardoso, J.-F. and Laheld, B. (1996) . Equivariant adaptive source separation. IEEE \n\nTrans. on S.P., 45(2):434- 444. \n\nDuda, R. and Hart, P. (1973). Pattern classification and scene analysis. Wiley, \n\nNew York. \n\nGhahramani, Z. (1994). Solving inverse problems using an em approach to density \nestimation. Proceedings of the 1993 Connectionist Models Summer School, pages \n316--323. \n\nGirolami, M. (1998). An alternative perspective on adaptive independent compo(cid:173)\n\nnent analysis algorithms. Neural Computation, 10(8):2103-2114. \n\nLee, T .-W., Girolami, M., Bell, A. J., and Sejnowski, T. J. (1999a). A unifying \n\nframework for independent component analysis. International Journal on Math(cid:173)\nematical and Computer Models, in press. \n\nLee, T.-W., Girolami, M., and Sejnowski, T. J. (1999b). Independent component \n\nanalysis using an extended infomax algorithm for mixed sub-gaussian and super(cid:173)\ngaussian sources. Neural Computation, 11(2):409-433. \n\nLee, T.-W., Lewicki, M. S., and Sejnowski, T. J. (1999c). ICA mixture models \nfor unsupervised classification and automatic context switching. In International \nWorkshop on ICA, Aussois, in press. \n\nLewicki, M. and Olshausen, B. (1998). Inferring sparse, overcomplete 'image codes \n\nusing an efficient coding framework. In Advances in Neural Information Process(cid:173)\ning Systems 10, pages 556-562. \n\nLewicki, M. and Sejnowski, T. J. (1998). Learning nonlinear overcomplete represen(cid:173)\n\nations for efficient coding. In Advances'in Neural Information Processing Systems \n10, pages 815-821. \n\nOlshausen, B. and Field, D. (1996). Emergence of simple-cell receptive field prop(cid:173)\n\nerties by learning a sparse code for natural images. Nature, 381:607-609. \n\nStutz, J. and Cheeseman, P. (1994). Autoclass - a Bayesian approach to classifica(cid:173)\ntion. Maximum Entropy and Bayesian Methods, Kluwer Academic Publishers. \n\n\f", "award": [], "sourceid": 1592, "authors": [{"given_name": "Te-Won", "family_name": "Lee", "institution": null}, {"given_name": "Michael", "family_name": "Lewicki", "institution": null}, {"given_name": "Terrence", "family_name": "Sejnowski", "institution": null}]}