{"title": "Higher-Order Statistical Properties Arising from the Non-Stationarity of Natural Signals", "book": "Advances in Neural Information Processing Systems", "page_first": 786, "page_last": 792, "abstract": null, "full_text": "Higher-order Statistical Properties \nArising from the Non-stationarity of \n\nNatural Signals \n\nAdaptive Signal and Image Processing, Sarnoff Corporation \n\nLucas Parra, Clay Spence \n\n{lparra, cspence} @sarnofJ. com \n\nDepartment of Biomedical Engineering, Columbia University \n\nPaul Sajda \n\nps629@columbia. edu \n\nAbstract \n\nWe present evidence that several higher-order statistical proper(cid:173)\nties of natural images and signals can be explained by a stochastic \nmodel which simply varies scale of an otherwise stationary Gaus(cid:173)\nsian process. We discuss two interesting consequences. The first \nis that a variety of natural signals can be related through a com(cid:173)\nmon model of spherically invariant random processes, which have \nthe attractive property that the joint densities can be constructed \nfrom the one dimensional marginal. The second is that in some cas(cid:173)\nes the non-stationarity assumption and only second order methods \ncan be explicitly exploited to find a linear basis that is equivalent \nto independent components obtained with higher-order methods. \nThis is demonstrated on spectro-temporal components of speech. \n\n1 \n\nIntroduction \n\nRecently, considerable attention has been paid to understanding and modeling the \nnon-Gaussian or \"higher-order\" properties of natural signals, particularly images. \nSeveral non-Gaussian properties have been identified and studied. For example, \nmarginal densities of features have been shown to have high kurtosis or \"heavy \ntails\", indicating a non-Gaussian, sparse representation. Another example is the \n\"bow-tie\" shape of conditional distributions of neighboring features, indicating de(cid:173)\npendence of variances [11]. These non-Gaussian properties have motivated a number \nof image and signal processing algorithms that attempt to exploit higher-order s(cid:173)\ntatistics of the signals, e.g., for blind source separation. In this paper we show \nthat these previously observed higher-order phenomena are ubiquitous and can be \naccounted for by a model which simply varies the scale of an otherwise station(cid:173)\nary Gaussian process. This enables us to relate a variety of natural signals to one \nanother and to spherically invariant random processes, which are well-known in \nthe signal processing literature [6, 3]. We present analyses of several kinds of data \n\n\ffrom this perspective, including images, speech, magneto encephalography (MEG) \nactivity, and socio-economic data (e.g., stock market data). Finally we present the \nresults of experiments with algorithms for finding a linear basis equivalent to inde(cid:173)\npendent components that exploit non-stationarity so as to require only 2nd-order \nstatistics. This simplification is possible whenever linearity and non-stationarity of \nindependent sources is guaranteed such as for the powers of acoustic signals. \n\n2 Scale non-stationarity and high kurtosis \n\nNatural signals can be non-stationary in various ways, e.g. varying powers, changing \ncorrelation of neighboring samples, or even non-stationary higher moments. We will \nconcentrate on the simplest possible variation and show in the following sections \nhow it can give rise to many higher-order properties observed in natural signals. \nWe assume that at any given instance a signal is specified by a probability density \nfunction with zero mean and unknown scale or power. The signal is assumed non(cid:173)\nstationary in the sense that its power varies from one time instance to the next. 1 \nWe can think of this as a stochastic process with samples z(t) drawn from a zero \nmean distribution Pz(z) with samples possibly correlated in time. We observe a \n\nscaled version of this process with time varying scales s(t) > \u00b0 sampled from Ps(s), \n\nx(t) = s(t)z(t) , \n\n(1) \n\n(2) \n\n(3) \n\nThe observable process x(t) is distributed according to \n\nPx(x) = (OOdsPs(s)Px(xls) = rXJ dsps(s) S-l Pz(~). \n\ns \n\n10 \n\n10 \n\nWe refer to px(x) as the long-term distribution and pz(z) as the instantaneous \ndistribution. In essence Px (x) is a mixture distribution with infinitely many kernels \nS-lpz(~). We would like to relate the sparseness of Pz(z), as measured by the \nkurtosis, to the sparseness of the observable distribution Px(x). \nKurtosis is defined as the ratio between the fourth and second cumulant of a distri(cid:173)\nbution [7]. As such it measures the length of the distribution's tails, or the sharpness \nof its mode. For a zero mean random variable x this reduces up to a constant to \n\nK[x] = ~::;! ,with (f(x)x = f dxf(x)px(x). \n\nIn this case we find that the kurtosis of the long-term distribution is always larger \nthan the kurtosis of the instantaneous distribution unless the scale is stationary ([9] \nand [1] for symmetric pz(z)), \n\nK[x] ~K[z]. \n\n(4) \nTo see this note that the independence of sand z implies, (xn)x = (sn)s (zn)z, and \ntherefore, K[x] = K[z] (S4)s / (S2)~. From the inequality, ((S2 - C2)2)s ~ 0, which \nhold for any arbitrary constant c> 0, it is easy to show that (S4) s ~ (S2)~, where \nthe equality holds for Ps(s) = 8(s - c). Together this leads to inequality (4), which \nstates that for a fixed scale s(t), i.e. the magnitude of the signal is stationary, the \nkurtosis will be minimal. Conversely, non-stationary signals, defined as a variable \nscaling of an otherwise stationary process, will have increased kurtosis. \n\nIThroughout this paper we will refer to signals that are sampled in time. Note that \nall the arguments apply equally well to a spatial rather than temporal sampling, that is, \nimages rather than time series. \n\n\f-1 \n\n- 2 \n\n- 3 \n\n-4 \n\n- 5 \n\n-1 \n\n- 2 \n\n-3 \n\n- 2 \n\n- 5 \n\n-2 \n\n-2 \n\n-2 \n\nFigure 1: Marginal distributions within 3 standard deviations are shown on a log(cid:173)\narithmic scale; left to right: natural image features, speech sound intensities, stock \nmarket variation, MEG alpha activity. The measured kurtosis is 4.5, 16.0, 12.9, and \n5.3 respectively. On top the empirical histograms are presented and on bottom the \nmodel distributions. The speech data has been fit with a Meijer-G function G5g [3]. \nFor the MEG activity, the stock market data and the image features a mixture of \nzero mean Gaussians was used. \n\nFigure 1 shows empirical plots of the marginal distributions for four natural signals; \nimage, speech, stock market, and MEG data. As image feature we used a wavelet \ncomponent for a 162x162 natural texture image of sand (presented in [4]). Self(cid:173)\ninverting wavelets with a down-sampling factor of three where used. The speech \nsignal is a 2.3 s recording of a female speaker sampled at 8 kHz with a noise level \nless than -25 dB. The signal has been band limited between 300 Hz and 3.4 kHz cor(cid:173)\nresponding to telephone speech. The market data are the daily closing values of the \nNY Stock exchange composite index from 02/01/1990 to 04/28/2000. We analyzed \nthe variation from the one day linear prediction value to remove the upwards trend \nof the last decade. The MEG data is band-passed (10-12 Hz) alpha activity of a in(cid:173)\ndependent component of 122 MEG signals. This independendt component exhibits \nalpha de-synchronization for a visio-motor integration task [10]. One can see that \nin all four cases the kurtosis is high relative to a Gaussian (K = 3). Our claim is \nthat for natural signals, high kurtosis is a natural result of the scale non-stationarity \nof the signal. Additional evidence comes from the behavior seen in the conditional \nhistograms of the joint distributions, presented in the next section. \n\n3 Higher-order properties of joint densities \n\nIt has been observed in images that the conditional histograms of joint densities \nfrom neighboring features (neighboring in scale, space, and/or orientation) exhibit \nvariance dependencies that cannot be accounted for by simple second-order model(cid:173)\ns [11]. Figure 2 shows empirical conditional histograms for the four types of natural \nsignals we considered earlier. One can see that speech and stock-market data exhibit \nthe same variance dependency or \"bow-tie\" shape exhibited by images. \n\n\f-2 \n\n-2 \n\n-2 \n\n- 2 \n\n-2 \n\n-2 \n\n-2 \n\n-2 \n\nFigure 2: (Top) Empirical conditional histograms and (bottom) model conditional \ndensity derived from the one dimensional marginals presented in the previous figure \nassuming the data is sampled form a SIRP. Good correspondence validates the SIRP \nassumption which is equivalent to our non-stationary scale model for slow varying \nscales. \n\nThe model of Equation 1 can easily account for this observation if we assume slowly \nchanging scales s(t). A possible explanation is that neighboring samples or features \nexhibit a common scale. If two zero mean stochastic variables are scaled both with \nthe same factors their magnitude and variance will increase together. That is, as \nthe magnitudes of one variable increase so will the magnitude and the variance of \nthe other variable. This results in a broadening of the histogram of one variable \nas one increases the value of the conditioning variable -\nresulting in a \"bow-tie\" \nshaped conditional density. \n\n4 Relationship to spherical invariant random process \n\nA closely related class of signals to those in Equation 1 is the so-called Spherical \nInvariant Random Process (SIRP). If the signals are short time Gaussian and the \npowers vary slowly the class of signals described are approximately SIRPs. Despite \nthe restriction to Gaussian distributed z SIRPs have been shown to be a good \nmodel for a range of stochastic processes with very different higher-order properties, \ndepending on the scale distributions Ps (s). They have been used in a variety of signal \nprocessing applications [6]. Band-limited speech, in particular, has been shown to \nbe well described by SIRPs [3]. If z is multidimensional, such as a window of samples \nin a time series or a multi-dimensional feature vector, one talks about Spherically \nInvariant Random Vectors SIRVs. Natural images have been modeled by what in \nessence is closely related to SIRV s -\na infinite mixture of zero mean Gaussian \nfeatures [11]. Similar models have also been used for financial time series [2]. \n\nThe fundamental property of SIRPs is that the joint distribution of a SIRP is \nentirely defined by a univariate characteristic function Cx(u) and the covariance ~ \nof neighboring samples [6]. They are directly related to our scale-non-stationarity \nmodel through a theorem by Kingman and Yao which states that any SIRP is \n\n\fequivalent to a zero mean Gaussian process z(t) with an independent stochastic \nscale s. Furthermore the univariate characteristic function Cx(u) specifies Ps(s) \nand the 1D marginal Px(x) and visa versa [6]. From the characteristic function \nCx(u) and the covariance 1; one can also construct all higher dimensional joint \ndensities. This leads to the following relation between the marginal densities of \nvarious orders [3], \n\nPn(x) = 7r-n/2 fn(xT1;-lx), with x E IRn , and 1; = (xxT), \n2) \n\n-1/2 fOO \n\n) \n\n) \nfn+2(S = - dsfn(s , \n\nd \n\nhm(s) = 7r \n\n_oohm+1(s + y dy \n\n(5) \n\n(6) \n\nIn particular these relations allow us to compute the joint density P2(X(t), x(t + 1)) \nfrom an empirically estimated marginal density Pi (x(t)) and the covariance of x(t) \nand x(t+ 1). Comparing the resulting 2D joint density to the observed joint density \nallows to us verify the assumption that the data is sampled from a SIRP. In so doing \nwe can more firmly assert that the observed two dimensional joint histograms can \nin fact be explained as a Gaussian process with a non-stationary scale. \n\nIf we use zero mean Gaussian mixtures, p1(X) = L~lmiexp(-x2/uT), as the \n1D model distribution the resulting 2D joint distribution is simply Pn(x) = \nL~l mi exp( -xT1;-lx / uT). If the model density is given by a Meijer-G func(cid:173)\ntion, as suggested in [3] with P1(X) = ro1A)G5g(A2X 2IA - 0.5,A - 0.5), then the 2D \njoint is p2(X) = ~:(A) G~g(A2xT1;-lxl - 0.5; 0, A, A). In both cases it is assumed \nthat the data is normalized to unit variance. \n\nBrehm has used this approach to demonstrate that band-limited speech is well de(cid:173)\nscribed by a SIRP [3] . In addition, we show here that the same is true for the image \nfeatures and stock market data presented above. The model conditional densities \nshown in Figure 2 correspond well with the empirical conditional histograms. In \nparticular they exhibit the characteristic bow-tie structure. We emphasize that \nthese model 2D joint densities have been obtained only from the 1D marginal of \nFigure 1 and the covariance of neighboring samples. \n\nThe deviations of the observed and model 2D joint distributions are likely due to \nvariable covariance itself, that is, not only does the overall scale or power vary \nwith time, but the components of the covariance matrix vary independently of each \nother. For example in speech the covariance of neighboring samples is well known to \nchange considerably over time. Nevertheless, the surprising result is that a simple \nscale non-stationarity model can reproduce the higher-order statistical properties \nin a variety of natural signals. \n\n5 Spectro-temporallinear basis for speech \n\nAs an example of the utility of this non-stationarity assumption, we analyze the \nstatistical properties of the powers of a single source, in particular for speech signals. \nMotivated by the auditory spectro-temporal receptive field reported in [5] and work \non receptive fields and independent components we are interested to find a linear \nbasis of independent components in a spectro-temporal window of speech signals. \nIn [9, 8] we show that one can use second order statistic to uniquely recover sources \nfrom a mixture provided that the mix is linear and the sources are non-stationary. \nOne can do so by finding a basis that guarantees uncorrelated signals at multiple \ntime intervals (multiple decorrelation algorithm (MDA)). Our present model argues \nthat features of natural signals such as the powers in different frequency bands can \nbe assumed non-stationary, while powers of independent signals are known to add \n\n\f\"We had a barbecue over the weekend at my house.\" \n\nPCA \n\nMDA \n\nICA-JADE \n\nFigure 3: Spectro-temporal representation of speech. One pixel in the horizontal \ndirection corresponds to 16 ms. In the vertical direction 21 Bark scale power bands \nare displayed. The upper diagram shows the log-powers for a 2.5 s segment of the \n200 s recording used to compute the different linear bases. The three lower diagrams \nshow three sets of 15 linear basis components for 2lx8 spectra-temporal segments of \nthe speech powers. The sets correspond to PCA, MDA, and ICA respectively. Note \nthat these are not log-powers, hence the smaller contribution of the high frequencies \nas compared to the log-power plot on top. \n\nlinearly. We should be able therefore to identify with second order methods the \nsame linear components as with independent component algorithms where high(cid:173)\norder statistical assumptions are invoked. \n\nWe compute the powers in 21 frequency bands on a Bark scale for short consecutive \ntime intervals. We choose to find a basis for a segment of 21 bands and 8 neighbor(cid:173)\ning time slices corresponding to 128 ms of signal between 0 and 4 kHz. We used half \noverlapping windows of 256 samples such that for a 8 kHz signal neighboring time \nslices are 16 ms apart. A set of 7808 such spectro-temporal segments were sampled \nfrom 200 s of the same speech data presented previously. Figure 3 shows the results \nobtained for a subspace of 15 components. One can see that the components ob(cid:173)\ntained with MDA are quite similar to the result of rcA and differ considerably from \nthe principal components. From this we conclude that speech powers can in fact \nbe thought of as a linear combination of non-stationary independent components. \nIn general, the point we wish to make is to demonstrate the strength of second(cid:173)\norder methods when the assumptions of non-stationarity, independence, and linear \nsuperposition are met. \n\n\f6 Conclusion \n\nWe have presented evidence that several high-order statistical properties of natural \nsignals can be explained by a simple scale non-stationary model. For four types of \nnatural signals, we have shown that a scale non-stationary model will reproduce the \nhigh-kurtosis behavior of the marginal densities. Furthermore, for the case of scale \nnon-stationary with Gaussian density (SIRP), we have shown that we can reproduce \nthe variance dependency seen in conditional histograms of the joint density directly \nfrom the empirical marginal densities. This leads to the conclusion that a scale non(cid:173)\nstationary model (e.g. SIRP) is a good model for these natural signals. We have \nshown that one can exploit the assumptions of this model to compute a linear basis \nfor natural signals without having to invoke higher order statistically techniques. \nThough we do not claim that all higher-order properties or all natural signals can \nbe explained by a scale non-stationary model, it is remarkable that such a simple \nmodel can account for a variety of the higher-order phenomena and for a variety of \nsignal types. \n\nReferences \n\n[1] E.M.L. Beale and C.L. Mallows. Scale mixing of symmetric distributions with \n\nzero means. Annals of Mathematical Statitics, 30:1145-1151, 1959. \n\n[2] T. P. Bollerslev, R. F. Engle, and D. B. Nelson. Arch models. In R. F. Engle \n\nand D. L. McFadden, editors, Handbook of Econometrics, volume IV. North(cid:173)\nHolland, 1994. \n\n[3] Helmut Brehm and Walter Stammler. Description and generation of spherically \n\ninvariant speech-model signals. Signal Processing, 12:119-141, 1987. \n\n[4] Phil Brodatz. Textures: A Photographic Album for Artists and Designers. \n\nDover, 1999. \n\n[5] R. deCharms, Christopher and M. Merzenich, Miachael. Characteristic neuros \nin the primary auditory cortex of the awake primate using reverse correlation. \nIn M. Jordan, M. Kearns, and S. Solla, editors, Advances in Neural Information \nProcessing Systems 10, pages 124-130, 1998. \n\n[6] Joel Goldman. Detection in the presence of spherically symmetric random vec(cid:173)\ntors. IEEE Transactions on Information Theory, 22(1):52- 59, January 1976. \n[7] M.G. Kendal and A. Stuart. The Advanced Theory of Statistics. Charles Griffin \n\n& Company Limited, London, 1969. \n\n[8] L. Parra and C. Spence. Convolutive blind source separation of non-stationary \nsources. IEEE Trans. on Speech and Audio Processing, pages 320- 327, May \n2000. \n\n[9] Lucas Parra and Clay Spence. Separation of non-stationary sources. \n\nIn \n\nStephen Roberts and Richard Everson, editors, Independent Components Anal(cid:173)\nysis: Principles and Practice. Cambridge University Press, 200l. \n\n[10] Akaysha Tang, Barak Pearlmutter, Dan Phung, and Scott Carter. Independent \n\ncomponents of magnetoencephalography. Neural Computation, submitted. \n\n[11] Martin J. Wainwright and Eero P. Simoncelli. Scale mixtures of Gaussians \nand the statistics of natural images. In S. A. Solla, T.K. Leen, and K.-R. \nMiiller, editors, Advances in Neural Information Processing Systems 12, pages \n855-861, Cambridge, MA, 2000. MIT Press. \n\n\f", "award": [], "sourceid": 1926, "authors": [{"given_name": "Lucas", "family_name": "Parra", "institution": null}, {"given_name": "Clay", "family_name": "Spence", "institution": null}, {"given_name": "Paul", "family_name": "Sajda", "institution": null}]}