{"title": "Gaussianization", "book": "Advances in Neural Information Processing Systems", "page_first": 423, "page_last": 429, "abstract": null, "full_text": "Gaussianization \n\nScott Shaobing Chen \n\nRenaissance Technologies \nEast Setauket, NY 11733 \n\nschen@rentec.com \n\nRamesh A. Gopinath \n\nIBM TJ. Watson Research Center \n\nYorktown Heights, NY 10598 \n\nrameshg@us.ibm.com \n\nAbstract \n\nHigh dimensional data modeling is difficult mainly because the so-called \n\"curse of dimensionality\". We propose a technique called \"Gaussianiza(cid:173)\ntion\" for high dimensional density estimation, which alleviates the curse \nof dimensionality by exploiting the independence structures in the data. \nGaussianization is motivated from recent developments in the statistics \nliterature: projection pursuit, independent component analysis and Gaus(cid:173)\nsian mixture models with semi-tied covariances. We propose an iter(cid:173)\native Gaussianization procedure which converges weakly: at each it(cid:173)\neration, the data is first transformed to the least dependent coordinates \nand then each coordinate is marginally Gaussianized by univariate tech(cid:173)\nniques. Gaussianization offers density estimation sharper than traditional \nkernel methods and radial basis function methods. Gaussianization can \nbe viewed as efficient solution of nonlinear independent component anal(cid:173)\nysis and high dimensional projection pursuit. \n\n1 Introduction \n\nDensity Estimation is a fundamental problem in statistics. In the statistics literature, the \nunivariate problem is well-understood and well-studied. Techniques such as (variable) ker(cid:173)\nnel methods, radial basis function methods, Gaussian mixture models, etc, can be applied \nsuccessfully to obtain univariate density estimates. However, the high dimensional problem \nis very challenging, mainly due to the problem of the so-called \"curse of dimensionality\". \nIn high dimensional space, data samples are often sparsely distributed: it requires very \nlarge neighborhoods to achieve sufficient counts, or the number of samples has to grows \nexponentially according to the dimensions in order to achieve sufficient coverage of the \nsampling space. As a result, direct extension of univariate techniques can be highly biased, \nbecause they are neighborhood-based. \n\nIn this paper, we attempt to overcome the curse of dimensionality by exploiting indepen(cid:173)\ndence structures in the data. We advocate the notion that \n\nIndependence lifts the curse of dimensionality! \n\nIndeed, if the dimensions are independent, then there is no curse of dimensionality since \nthe high dimensional problem can be reduced to univariate problems along each dimension. \n\nFor natural data sets which do not have independent dimensions, we would like to construct \ntransforms such that after the transformation, the dimensions become independent. We pro(cid:173)\npose a technique called \"Gaussianization\" which finds and exploits independence structures \n\n\fin the data for high dimensional density estimation. For a random variable X EnD, we \ndefine its Gaussianization transform to be an invertible and differential transform T(X) \nsuch that the transformed variable T(X) follows the standard Gaussian distribution: \n\nT(X) '\" N(O, I) \n\nIt is clear that density estimates can be derived from Gaussianization transforms. We pro(cid:173)\npose an iterative procedure which converges weakly in probability: at each iteration, the \ndata is first transformed to the least dependent coordinates and then each coordinate is \nmarginally Gaussianized by univariate techniques which are based on univariate density \nestimation. At each iteration, the coordinates become less dependent in terms of the mu(cid:173)\ntual information, and the transformed data samples become more Gaussian in terms of the \nKullback-Leibler divergence. In fact, at each iteration, as far as the data is linearly trans(cid:173)\nformed to less dependent coordinates, the convergence result still holds. Our convergence \nproof of Gaussianization is highly related to Huber's convergence proof of projection pur(cid:173)\nsuit [4]. \n\nAlgorithmically, each Gaussianization iteration amounts to performing the linear indepen(cid:173)\ndent component analysis. Since the assumption of linear independent component analysis \nmay not be valid, the resulting linear transform does not necessarily make the coordinate \nindependent, however, it does make the coordinates as independent as possible. Therefore \nthe engine of our algorithm is the linear independent component analysis. We propose \nan efficient EM algorithm which jointly estimates the linear transform and the marginal \nunivariate Gaussianization transform at each iteration. Our parametrization is identical to \nthe independence factor analysis proposed by Attias (1999) [1]. However, we apply the \nvariational method in the M-step, as in the semi-tied covariance algorithm proposed for \nGaussian mixture models by Gales (1999) [3]. \n\n2 Existence of Gaussianization \n\nWe first show the existence of Gaussianization transforms. Denote \u00a2(.) the probability \ndensity function of the standard normal N(O, I); denote \u00a2(', /-\u00a3,};.) the probability den(cid:173)\nsity function of N(/-\u00a3,};.) ; denote <1>(-) the cumulative distribution function (CDF) of the \nstandard normal. \n\n2.1 Univariate Gaussianization \n\nUnivariate Gaussianization exists uniquely and can be derived from univariate density es(cid:173)\ntimation. Let X E n 1 be the univariate variable. We assume that the density function of \nX is strictly positive and differentiable. Let F(\u00b7) be the cumulative distribution function of \nX. Then T(\u00b7) is a Gaussianization transform if and only if it satisfies the following partial \ndifferential equation: \n\np(x) = \u00a2(T(x))1 ax I\u00b7 \n\naT \n\n\u00b1*-l(F(X)) '\" N(O, 1) \n\nIt can be easily verified that the above partial differential equation has only two solutions: \n(1) \nIn practice, the CDF F( \u00b7) is not available; it has to be estimated from the training data. \nWe choose to approximate it by Gaussian mixture models: p(x) = 2:[=1 7ri\u00a2(X, /-\u00a3i, O'l); \nequivalently we assume the CDF F(x) = 2:[=1 7ri(X~:,) where the parameters \n{7ri' /-\u00a3i, O'd can be estimated via maximum likelihood using the standard EM algorithm. \nTherefore we can parameterize the Gaussianization transform as \n\nT(X) = <1>-1(2: 7ri ** ( ~)) \n\nI \n\nX-. \n\ni=l \n\nO'i \n\n(2) \n\n\fIn practice there is an issue of model selection: we suggest to use model selection tech(cid:173)\nniques such as the Bayesian information criterion [6] to determine the number of Gaussians \nI. Throughout the paper, we shall assume that univariate density estimation and univariate \nGaussianization can be solved by univariate Gaussian mixture models. \n\n2.2 High Dimensional Gaussianization \n\nHowever, the existence of high dimensional Gaussianization is non-trivial. We present \nhere a theoretical construction. For simplicity, we consider the two dimensional case. Let \nX = (X I, X 2) T be the random variable. Gaussianization can be achieved in two steps. \nWe first marginally Gaussianize the first coordinate X I and fix the second coordinate X 2 \nunchanged; the transformed variable will have the following density \nP(XI,X2) =P(XI)P(X2Ixt) = \u00a2(xt)p(x2Ixt) . \n\nWe then marginally Gaussian each conditional density p(\u00b7IXI) for each Xl. Notice that the \nmarginal Gaussianization is different for different Xl : \n\nT X1 (X2 ) = CP-I(F.l x l(X2 )), \n\nOnce all the conditional densities are marginally Gaussianized, we achieve joint Gaussian(cid:173)\nization \n\np(XI,X2) = p(XI)p(X2IxI) = \u00a2(Xt}\u00a2(X2) . \n\nThe existence of high dimensional Gaussianization can be proved by similar construction. \n\nThe above construction, however, is not practical since the marginal Gaussianization of the \nconditional densities P(X2 = x21XI = xt) requires estimation of the conditional densities \ngiven all Xl , which is impossible with finite samples. In the following sections, we shall \ndevelop an iterative Gaussianization algorithm that is practical and also can be proved to \nconverge weakly. \n\nHigh-dimensional Gaussianization is unique up to any invertible transforms which preserve \nthe measure on RP induced by the standard Gaussian distribution. Examples of such \ntransforms are orthogonal linear transforms and certain nontrivial Nonlinear transforms. \n\n3 Gaussianization with Linear leA Assumption \n\nLet (Xl,' . . , XN) be the i.i.d. samples from the random variable X E RP. We as(cid:173)\nsume that there exist a linear transform A DxD such that the transformed variable Y = \n(YI, ... , YD)T = AX has independent components: p(YI, ... , YD) = p(yt} ... p(YD)' In \nthis case, Gaussianization is reduced to linear ICA: we can first find the linear transfor(cid:173)\nmation A by linear independent component analysis, and then Gaussianize each individual \ndimension of Y by univariate Gaussianization. \n\nWe parametrize the marginal Gaussianization by univariate Gaussian mixtures (2). This \namounts to model the coordinates of the transformed variable by univariate Gaussian mix-\ntures: p(Yd) = l:[~l 7r d,i\u00a2(Yd, f.l,d,i, a~,i)' We would like to jointly optimize both the \nlinear transform A and the marginal Gaussianization parameters (7r, f.l\" a) via maximum \nlikelihood. In fact, this is the same parametrization as in Attias (1999) [1] . We point out \nthat modeling the coordinates after the linear transform as non-Gaussian distributions, for \nwhich we assume univariate Gaussian mixtures are adequate, leads to ICA while as mod(cid:173)\neling them as single Gaussians leads to PCA. \n\nThe joint estimation of the parameters can be computed via the EM algorithm. The auxil(cid:173)\niary function which has to be maximized in the M-step has the following form: \n( \n\n)2 \nQ(A,7r,f.l\"a) = Nlogldet(A)I+ L LLWn,d,dlog7rd,i-210g27ra~,i- Yn,d2~td,i 1 \n\n1 \n\nd,t \n\nN D \n\nId \n\nn=l d=l i=l \n\n\fwhere (Wn,d,i) are the posterior counts computed at the E-step. It can be easily shown that \nthe priors (7r d,i can be easily updated and the means (f..Ld,i can be entirely determined by \nthe linear transform A. However, updating the linear transform A and the variances (lTd,i) \ndoes not have closed form solution and has to be solved iteratively by numerical methods. \nAttias (1999) [1] proposed to optimize Q via gradient descent: at each iteration, one fixes \nthe linear transform and compute the Gaussian mixture parameters, then fixes the Gaussian \nntixture parameters and update the linear transform via gradient descent using the so-called \nnatural gradient. \n\nWe propose an iterative algorithm as in Gales (1999) [3] for the M-step which does not \ninvolve gradient descent and the nuisance and instability caused by of the step size param(cid:173)\neter. At each iteration, we fix the linear transform A and update the variances (lTd,i); we \nthen fix (lTd,i) and update each row of A with all the other rows of A fixed: updating each \nrow amounts to solving a system of linear equations. Our iterative scheme guarantees that \nthe auxiliary function Q to be increased at every iteration. Notice that each iteration in our \nM-step updates the rows of the linear matrix A by solving D linear equations. Although \nour iterative scheme may be slightly more expensive per iteration than standard numerical \noptintization techniques such as Attias' algorithm, in practice it converges after very few \niterations, as observed in Gales (1999) [3]. In contrast the numerical optintization scheme \nmay take an order of magnitude more iterations. In fact, in our experiments, our algorithm \nconverges much faster than Attias's algorithm. Furthermore, our algorithm is stable since \neach iteration is guaranteed to increase the likelihood. \n\nThe M-step in both Attias' algorithm and our algorithm can be implemented efficiently \nby storing and accessing the sufficient statistics. Typically in our M-steps, most of the \nimprovement on the likelihood comes in the first few iterations. Therefore we can stop \neach M-step after, say one iteration of updating the parameters; even though the auxiliary \nfunction is not optimized, but it is guaranteed to improve. Therefore we obtained the so(cid:173)\ncalled generalized EM algorithm. Attias (1999) [1] reported faster convergence of the \ngeneralized EM algorithm than the standard EM algorithm. \n\n4 Iterative Gaussianization \n\nIn this section we develop an iterative algorithm which Gaussianizes arbitrary random vari(cid:173)\nables. At each iteration, the data is first transformed to the least dependent coordinates and \nthen each coordinate is marginally Gaussianized by univariate techniques which are based \non univariate density estimation. We shall show that transfornting the data into the least \ndependent coordinates can be achieved by linear independent component analysis. We also \nprove the weak convergence result. \n\nWe define the negentropy 1 of a random variable X = (Xl,\"', X D) T as the Kullback(cid:173)\nLeibler divergence between X and the standard Gaussian distribution. We define the \nmarginal negentropy to be JM(X) = Ef=l J(Xd). One can show that the negentropy \ncan be decomposed as the sum of the marginal negentropy and the mutual information: \nJ(X) = JM(X) + I(X). Gaussianization is equivalent to finding an invertible transform \nTO such that the negentropy ofthe transformed variable vanishes: J(T(X)) = O. \nFor arbitrary random variable X E R D, we propose the following iterative Gaussianization \nalgorithm. Let X(O) = X. At each iteration, \n\n(A) Linearly transform the data: y(k) = Ax(k). \n1 We are abusing the terminology slightly: normally the negentropy of a random variable is defined \nto be the Kullback-Leibler distance between itself and the Gaussian variable with the same mean and \ncovariance. \n\n\f(B) Nonlinearly transform the data by marginal Gaussianization: \n\nX(k+1) = w \n\n'Tr,J.1.,u \n\n(y(k)) \n\nwhere the marginal Gaussianization w 1T, It, 0.(-), which approximates the ideal \nmarginal Gaussianization w (.), can be derived from univariate Gaussian mixtures \n(2): \n\nThe parameters are chosen by minimizing the negentropy of the transformed variable \nX(k+1): \n\n(3) \n\n(..4, 7T, it, 0-) = min J(w 1T,It,.,.(AX)). \n\nA,1r,f-t,u \n\nThus, after each iteration, the transformed variable becomes as close as possible to the \nstandard Gaussian in the Kullback-Leibler distance. \n\nFirst, the problem of minizing the negentropy (3) is equivalent to the maximum likelihood \nproblem for Gaussianization with linear ICA assumption in section 3, and thus can be \nsolved by the same efficient EM algorithm. \n\nSecond, since the data X(k) might not satisfy the linear ICA assumption, the optimal lin(cid:173)\near transform might not transform X(k) into independent coordinates. However, it does \ntransform X (k) into the least dependent coordinates, since \n\nJ(X(k+1)) = JM(w(AX(k))) + I(w(AX(k))) = I(AX(k)) . \n\nFurther more, if the linear transform A is constrained to be orthogonal, then finding the \nleast dependent coordinates is equivalent to finding the marginally most non-Gaussian co(cid:173)\nordinates, since \n\nJ(X(k)) = J(AX(k)) = JM(AX(k)) + I(AX(k)) \n(notice that the negentropy is invariant under orthogonal transforms). \n\nTherefore our iterative algorithm can be viewed as follows. At each iteration, the data is lin(cid:173)\nearly transformed to the least dependent coordinates and then each coordinate is marginally \nGaussianized. In practice, after the first iteration, the algorithm finds linear transforms \nwhich are almost orthogonal. Therefore one can also view practically that at each iteration, \nthe data is linearly transformed to the most marginally non-Gaussian coordinates and then \neach coordinate is marginally Gaussianized. \n\nFor the sake of simplicity, we assume that we can achieve perfect marginal Gaussianization \nw(\u00b7) by w1T,It,.,.O), which is derived from univariate Gaussian mixtures. In fact, when the \nnumber of Gaussians goes to infinity and the number of samples goes to infinity, one can \nshow that \n\nThus it suffices to analyze the ideal iterative Gaussianization \n\nlim W 1T ,It,.,. = W. \n\nX(k) = W(AX(k)) \n\nwhere \n\nA = argmin J(W(AX(k))) = argmin I(AX(k)) . \n\nFollowing Huber's argument [4], we can show that \n\nX(k) -t N(O, I) \n\nin the sense of weak convergence, i.e. the density function of X(k) converges pointwise to \nthe density function of standard normal. \n\n\fOriginal Data \n\nIteratIon 1 \n\nIteration 2 \n\n:~ \n-2~ \n\n-4 \n-4 -2024 \n\n-~4L-_-::-2 ----:-0 -----:-2----'4 \n\n-1 \n\n-~2L---:-O -----' \n\niteratIon 3 \n\nIteratIon 4 \n\nIteration 5 \n\niteratIon 6 \n\n_4L - - - - - - - ' \n-4-2 0 24 \n\n' \u2022. \n\n\u2022. ~:. :!'::-. \n0\".' .' \n\n-2 \n\n:~:). -.: ... \n\n.. \n\nf 0 \n0\" \n\no \n-2 : : :-:... ~ .. : \n\n:ltJ\u00b7\u00b7:,.\u00a3.~\u00b7 \n; \u2022 .;..=: \n\n!to.: . \n\n'-; . \n\n\" \u2022\u2022 \n\n- 2 :\u00b7 \u2022 \n\n0 \n\n: \" \n\nIteratIon 7 \n\n-4 \n-4-2 0 24 \n\n-4 \n-4-2 0 24 \n\n:_: ~.:-\n\n4~ 4~ \n\n... : 0.0 \n\u2022\u2022 \u2022 \" . \n\n........ l: \n..... \n.r, ... \n\nIteration 8 \n\n2 \n\n2 \n\n-4 \n-4-2 0 24 \n\n-4 \n-4-2 0 24 \n\nGausslanlzatlon DensIty EstImatIon \n\nGaussIan MIxture DensIty EstImatIon \n\n0.2 \n\n0. 4 \n\n0.6 \n\nO.S \n\n0. 2 \n\n0. 4 \n\n0.6 \n\nO.S \n\nFigure 1: Iterative Gaussianization on a synthetic circular data set \n\nWe point out that out iterative algorithm can be relaxed as follows. At each iteration, the \ndata can linearly transformed into coordinates which are less dependent, instead of into \ncoordinates which are the least dependent: \n\nI(X(k) - I(AkX(k)) ~ E[I(X(k) - i~t I(AX(k))] \n\nwhere the constant E > O. We can show that this relaxed algorithm still converges weakly. \n\n5 Examples \n\nWe demonstrate the process of our iterative Gaussianization algorithm through a very dif(cid:173)\nficult two dimensional synthetic data set. The true underlying variable is circularly dis(cid:173)\ntributed: in the polar coordinate system, the angle is uniformly distributed; the radius fol(cid:173)\nlows a mixture of four non-overlapping Gaussians. We drew 1000 i.i.d. samples from this \ndistribution. We ran 8 iterations to Gaussianize the data set. Figure 4 displays the trans(cid:173)\nformed data set at each iteration. Clearly we see the transformed data gradually becomes \nstandard Gaussian. \nLet X(O) = X ; assume that the iterative Gaussianization procedure converges after K \niterations, i.e. X(K) '\" N( O, I). Since the transforms at each iteration are invertible, \nwe can then compute Jacobian and obtain density estimation for X. The Jacobian can be \ncomputed rapidly due to the chain rule. Figure 4 compares the Gaussianization density \n\n\festimate (8 iterations) and Gaussian mixture density estimate (40 Gaussians). Clearly we \nsee that the Gaussianization density estimate recovers the four circular structure; however, \nthe Gaussian mixture estimate lacks resolution. \n\n6 Discussion \n\nGaussianization is closely connected with the exploratory projection pursuit algorithm pro(cid:173)\nposed by Friedman (1987) [2]. In fact we argue that our iterative Gaussianization procedure \ncan easily constrained as an efficient parametric solution of high dimensional projection \npursuit. Assume that we are interested in [-dimensional projections where 1 ~ [ ~ D. If \nwe constrain that at each iteration the linear transform has to be orthogonal, and only the \nfirst [ coordinates of the transformed variable are marginally Gaussianized, then the itera(cid:173)\ntive Gaussianization algorithm achieves [ dimensional projection pursuit. The bottleneck \nof Friedman's high dimensional projection pursuit is to find the jointly most non-Gaussian \nprojection and to jointly Gaussianize that projection. In contrast, our algorithm finds the \nmost marginally non-Gaussian projection and marginally Gaussianize that projection; it \ncan be computed by an efficient EM algorithm. \n\nWe argue that Gaussianization density estimation indeed alleviates the problem of the curse \nof dimensionality. At each iteration, the effect of the curse of dimensionality is solely on \nfinding a linear transform such that the transformed coordinates are less dependent, which \nis a relatively much easier problem than the original problem of high dimensional density \nestimation itself; after the linear transform, the marginal Gaussianization can be derived \nfrom univariate density estimation, which has nothing to do with the curse of dimension(cid:173)\nality. Hwang 1994 [5] performed extensive comparative study among the following three \npopular density estimates: one dim projection pursuit density estimates (a special case of \nour iterative Gaussianization algorithm), adaptive kernel density estimates and radial basis \nfunction density estimates; he concluded that projection pursuit density estimates outper(cid:173)\nform in most data set. \n\nWe are currently experimenting with application of Gaussianization density estimation in \nautomatic speech and speaker recognition. \n\nReferences \n\n[1] H. Attias, \"Independent factor analysis\", Neural Computation, vol. 11, pp. 803-851, \n\n1999. \n\n[2] J.H. Friedman, \"Exploratory projection pursuit\", J. American Statistical Association, \n\nvol. 82, pp. 249-266, 1987. \n\n[3] MJ.F. Gales, \"Semi-tied covariance matrices for hidden Markov Models\", IEEE \n\nTransaction Speech and Audio Processing, vol. 7, pp. 272-281, 1999. \n\n[4] PJ. Huber, \"Projection pursuit\", Annals of Statistics, vol. 13, pp 435-525, 1985. \n[5] J. Hwang, S. Lay and A. Lippman, \"Nonparametric multivariate density estimation: \na comparative study\", IEEE Transaction Signal Processing, vol. 42, pp 2795-2810, \n1994. \n\n[6] G. Schwarz, \"Estimating the dimension of a model\", Annals of Statistics, vol. 6, pp \n\n461-464,1978. \n\n\f", "award": [], "sourceid": 1856, "authors": [{"given_name": "Scott", "family_name": "Chen", "institution": null}, {"given_name": "Ramesh", "family_name": "Gopinath", "institution": null}]}*