{"title": "Modeling Natural Sounds with Modulation Cascade Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 1545, "page_last": 1552, "abstract": "Natural sounds are structured on many time-scales. A typical segment of speech, for example, contains features that span four orders of magnitude: Sentences (~1s); phonemes (~0.1s); glottal pulses (~0.01s); and formants (<0.001s). The auditory system uses information from each of these time-scales to solve complicated tasks such as auditory scene analysis. One route toward understanding how auditory processing accomplishes this analysis is to build neuroscience-inspired algorithms which solve similar tasks and to compare the properties of these algorithms with properties of auditory processing. There is however a discord: Current machine-audition algorithms largely concentrate on the shorter time-scale structures in sounds, and the longer structures are ignored. The reason for this is two-fold. Firstly, it is a difficult technical problem to construct an algorithm that utilises both sorts of information. Secondly, it is computationally demanding to simultaneously process data both at high resolution (to extract short temporal information) and for long duration (to extract long temporal information). The contribution of this work is to develop a new statistical model for natural sounds that captures structure across a wide range of time-scales, and to provide efficient learning and inference algorithms. We demonstrate the success of this approach on a missing data task.", "full_text": "Modeling Natural Sounds\n\nwith Modulation Cascade Processes\n\nRichard E. Turner and Maneesh Sahani\nGatsby Computational Neuroscience Unit\n\n17 Alexandra House, Queen Square, London, WC1N 3AR, London\n\nAbstract\n\nNatural sounds are structured on many time-scales. A typical segment of speech,\nfor example, contains features that span four orders of magnitude: Sentences\n(\u223c 1 s); phonemes (\u223c 10\u22121 s); glottal pulses (\u223c 10\u22122 s); and formants (. 10\u22123 s).\nThe auditory system uses information from each of these time-scales to solve com-\nplicated tasks such as auditory scene analysis [1]. One route toward understand-\ning how auditory processing accomplishes this analysis is to build neuroscience-\ninspired algorithms which solve similar tasks and to compare the properties of\nthese algorithms with properties of auditory processing. There is however a dis-\ncord: Current machine-audition algorithms largely concentrate on the shorter\ntime-scale structures in sounds, and the longer structures are ignored. The rea-\nson for this is two-fold. Firstly, it is a dif\ufb01cult technical problem to construct\nan algorithm that utilises both sorts of information. Secondly, it is computation-\nally demanding to simultaneously process data both at high resolution (to extract\nshort temporal information) and for long duration (to extract long temporal infor-\nmation). The contribution of this work is to develop a new statistical model for\nnatural sounds that captures structure across a wide range of time-scales, and to\nprovide ef\ufb01cient learning and inference algorithms. We demonstrate the success\nof this approach on a missing data task.\n\n1 Introduction\n\nComputational models for sensory processing are still in their infancy, but one promising approach\nhas been to compare aspects of sensory processing with aspects of machine-learning algorithms\ncrafted to solve the same putative task. A particularly fruitful approach in this vein uses the genera-\ntive modeling framework to derive these learning algorithms. For example, Independent Component\nAnalysis (ICA) and Sparse Coding (SC), Slow Feature Analysis (SFA), and Gaussian Scale Mix-\nture Models (GSMMs) are examples of algorithms corresponding to generative models that show\nsimilarities with visual processing [3]. In contrast, there has been much less success in the auditory\ndomain, and this is due in part to the paucity of \ufb02exible models with an explicit temporal dimension\n(although see [2]). The purpose of this paper is to address this imbalance.\n\nThis paper has three parts. In the \ufb01rst we review models for the short-time structure of sound and\nargue that a probabilistic time-frequency model has several distinct bene\ufb01ts over traditional time-\nfrequency representations for auditory modeling. In the second we review a model for the long-time\nstructure in sounds, called probabilistic amplitude demodulation.\nIn the third section these two\nmodels are combined with the notion of auditory features to produce a full generative model for\nsounds called the Modulation Cascade Process (MCP). We then show how to carry out learning and\ninference in such a complex hierarchical model, and provide results on speech for complete and\nmissing data tasks.\n\n1\n\n\f2 Probabilistic Time-Frequency Representations\nMost representations of sound focus on the short temporal structures. Short segments (<10\u22121 s) are\nfrequently periodic and can often be ef\ufb01ciently represented in a Fourier basis as the weighted sum of\na few sinusoids. Of course, the spectral content of natural sounds changes slowly over time. This is\nhandled by time-frequency representations, such as the Short-Time Fourier Transform (STFT) and\nspectrogram, which indicate the spectral content of a local, windowed section of the sound. More\nspeci\ufb01cally, the STFT (xd,t) and spectrogram (sd,t) of a discretised sound (yt0) are given by,\n\nxd,t =\n\nrt\u2212t0 yt0 exp (\u2212i\u03c9dt0) ,\n\nsd,t = log |xd,t|.\n\n(1)\n\nT 0X\n\nt0=1\n\nThe (possibly frequency dependent) duration of the window (rt\u2212t0) must be chosen carefully, as it\ncontrols whether features are represented in the spectrum or in the time-variation of the spectra. For\nexample, the window for speech is typically chosen to last for several pitch periods, so that both\npitch and formant information is represented spectrally.\n\nThe \ufb01rst stage of the auditory pathway derives a time-frequency-like representation mechanically at\nthe basilar membrane. Subsequent stages extract progressively more complex auditory features, with\nstructure extending over more time. Thus, computational models of auditory processing often begin\nwith a time-frequency (or auditory-\ufb01lter bank) decomposition, deriving new representations from\nthe time-frequency coef\ufb01cients [4]. Machine-learning algorithms also typically operate on the time-\nfrequency coef\ufb01cients, and not directly on the waveform. The potential advantage lies in the ease\nwith which auditory features may be extracted from the STFT representation. There are, however,\nassociated problems. For example, time-frequency representations tend to be over-complete (e.g.\nthe number of STFT coef\ufb01cients tends to be larger than the number of samples of the original sound\nT \u00d7 D > T 0). This means that realisable sounds live on a manifold in the time-frequency space (for\nthe STFT this manifold is a hyper-plane). Algorithms that solve tasks like \ufb01lling-in missing data\nor denoising must ensure that the new coef\ufb01cients lie on the manifold. Typically this is achieved in\nan ad hoc manner by projecting time-frequency coef\ufb01cients back onto the manifold according to an\narbitrary metric [5]. For generative models of time-frequency coef\ufb01cients, it is dif\ufb01cult to force the\nmodel to generate only on the realisable manifold. An alternative is to base a probabilistic model of\nthe waveform on the same heuristics that led to the original time-frequency representation. Not only\ndoes this side-step the generation problem, but it also allows parameters of the representation, like\nthe \u201cwindow\u201d, to be chosen automatically.\n\nThe heuristic behind the STFT \u2013 that sound comprises sinusoids in slowly-varying linear superpo-\nsition \u2013 led Qi et al [6] to propose a probabilistic algorithm called Bayesian Spectrum Estimation\n(BSE), in which the sinusoid coef\ufb01cients (xd,t) are latent variables. The forward model is,\n\np(xd,t|xd,t\u22121) = Norm(cid:0)\u03bbdxd,t\u22121, \u03c32\n\n(cid:1) ,\n\nd\n\n!\n\n(2)\n\n(3)\n\np(yt|xt) = Norm\n\nxd,t sin (\u03c9dt + \u03c6d) , \u03c32\ny\n\n.\n\nd\n\nd. Thus, as \u03bbd \u2192 1 and \u03c32\n\nThe prior distribution over the coef\ufb01cients is Gaussian and auto-regressive, evolving at a rate con-\nd \u2192 0 the processes become\ntrolled by the dynamical parameters \u03bbd and \u03c32\nvery slow, and as \u03bbd \u2192 0 and \u03c32\nd \u2192 \u221e they become very fast. More precisely, the length-scale of\nthe coef\ufb01cients is given by \u03bbd = \u2212 log(\u03bbd). The observations are generated by a weighted sum of\nsinusoids, plus Gaussian noise. This model is essentially a Linear Gaussian State Space System with\ntime varying weights de\ufb01ned by the sinusoids. Thus, inference is simple, proceeding via the Kalman\nSmoother recursions with time-varying weights. In effect, these recursions dynamically adjust the\nwindow used to derive the coef\ufb01cients, based on the past history of the stimulus. BSE is a model for\nthe short-time structure of sounds and it will essentially form the bottom level of the MCP. In the\nnext section we turn our attention to a model of the longer-time structure.\n\n3 Probabilistic Demodulation Cascade\n\nA salient property of the long-time statistics of sounds is the persistence of strong amplitude mod-\nulation [7]. Speech, for example, contains power in isolated regions corresponding to phonemes.\n\n2\n\n X\n\n\fp\n\nz(m)\n0\n\n= Norm (0, 1) ,\n\nz(m)\nt\n\n|z(m)\nt\u22121\n\n= Norm\n\n\u03bbmz(m)\n\nt\u22121, \u03c32\nm\n\nx(m)\nt = fa(m)\n\nz(m)\nt\n\nx(1)\nt = Norm (0, 1) ,\n\nyt =\n\n(cid:17) \u2200t > 0,\nMY\n\nx(m)\nt\n\n.\n\nm=1\n\n(4)\n\n(5)\n\n(6)\n\nThe phonemes themselves are localised into words, and then into sentences. Motivated by these\nobservations, Anonymous Authors [8] have proposed a model for the long-time structures in sounds\nusing a demodulation cascade. The basic idea of the demodulation cascade is to represent a sound as\na product of processes drawn from a hierarchy, or cascade, of progressively longer time-scale mod-\nulators. For speech this might involve three processes: representing sentences on top, phonemes in\nthe middle, and pitch and formants at the bottom (e.g. \ufb01g. 1A and B). To construct such a repre-\nsentation, one might start with a traditional amplitude demodulation algorithm, which decomposes\na signal into a quickly-varying carrier and more slowly-varying envelope. The cascade could then\nbe built by applying the same algorithm to the (possibly transformed) envelope, and then to the en-\nvelope that results from this, and so on. This procedure is only stable, however, if both the carrier\nand the envelope found by the demodulation algorithm are well-behaved. Unfortunately, traditional\nmethods (like the Hilbert Transform, or low-pass \ufb01ltering a non-linear transformation of the stimu-\nlus) return a suitable carrier or envelope, but not both. A new approach to amplitude demodulation\nis thus called for.\n\nIn a nutshell, the new approach is to view amplitude demodulation as a task of probabilistic in-\nference. This is natural, as demodulation is fundamentally ill-posed \u2014 there are in\ufb01nitely many\ndecompositions of a signal into a positive envelope and real valued carrier \u2014 and so prior infor-\nmation must always be leveraged to realise such a decomposition. The generative model approach\nmakes this information explicit. Furthermore, it not necessary to use the recursive procedure (just\ndescribed) to derive a modulation cascade: the whole hierarchy can be estimated at once using a\nsingle generative model. The generative model for Probabilistic Amplitude Demodulation (PAD) is\n\n(cid:17)\n\n(cid:16)\n\n(cid:16)\n\n(cid:17)\n\n(cid:16)\n(cid:17) \u2200m > 1,\n\np\n\n(cid:16)\n\nA set of modulators (X2:M ) are drawn in a two stage process: First a set of slowly varying processes\n(Z2:M ) are drawn from a one-step linear Gaussian prior (identical to Eq. 2). The effective length-\nscales of these processes, inherited by the modulators, are ordered such that \u03bbm > \u03bbm\u22121. Second\nthe modulators are formed by passing these variables through a point-wise non-linearity to enforce\npositivity. A typical choice might be\n\n(cid:0)z(m)\n\nt\n\n(cid:1) = log(cid:0) exp(cid:0)z(m)\n\nt + a(m)(cid:1) + 1(cid:1),\n\nfa(m)\n\nt\n\nwhich is logarithmic for large negative values of z(m)\n, and linear for large positive values. This\ntransforms the Gaussian distribution over z(m)\ninto a sparse, non-negative, distribution, which is a\ngood match to the marginal distributions of natural envelopes. The parameter a(m) controls exactly\nwhere the transition from log to linear occurs, and consequently alters the degree of sparsity. These\npositive signals modulate a Gaussian white-noise carrier, to yield observations y1:T by a simple\npoint-wise product. A typical draw from this generative model can be seen in Fig. 1C. This model\nis a fairly crude one for natural sounds. For example, as described in the previous section, we\nexpect that the carrier process will be structured and yet it is modelled as Gaussian white noise. The\nsurprising observation is that this very simple model is excellent at demodulation.\n\nt\n\nInference in this model typically proceeds by a zero-temperature EM-like procedure. Firstly the\ncarrier (x(1)\n) is integrated out and then the modulators are found by maximum a posteriori (MAP).\nSlower, more Bayesian algorithms that integrate over the modulators using MCMC indicate that this\napproximation is not too severe, and the results are compelling.\n\nt\n\n4 Modulation Cascade Processes\n\nWe have reviewed two contrasting models: The \ufb01rst captures the local harmonic structure of sounds,\nbut has no long-time structure; The second captures long-time amplitude modulations, but models\nthe short-time structure as white noise. The goal of this section is to synthesise both to form a new\nmodel. We are guided by the observation that the auditory system might implement a similar syn-\nthesis. In the well-known psychophysical phenomenon of comodulation masking release (see [9] for\na review), a tone masked by noise with a bandwidth greater than an auditory \ufb01lter becomes audible\n\n3\n\n\fFigure 1: An example of a modulation-cascade representation of speech (A and B) and typical sam-\nples from generative models used to derive that representation (C). A) The spoken-speech waveform\n(black) is represented as the product of a carrier (blue), a phoneme modulator (red) and a sentence\nmodulator (magenta). B) A close up of the \ufb01rst sentence (2 s) additionally showing the derived enve-\nlope (x(2)\n) superposed onto the speech (red, bottom panel). C) The generative model (M = 3)\nwith a carrier (blue), a phoneme modulator (red) and a sentence modulator (magenta).\n\nt x(3)\n\nt\n\nif the noise masker is amplitude modulated. This suggests that long-time envelope information is\nprocessed and analysed across (short-time) frequency channels in the auditory system.\n\nA simple way to combine the two models would be to express each \ufb01lter coef\ufb01cient of the time-\nfrequency model as a product of processes (e.g. xd,t = x(1)\nd,t ). However, power across even\nwidely seperated channels of natural sounds can be strongly correlated [7]. Furthermore, comodu-\nlation masking release suggests that amplitude-modulation is processed across frequency channels\nand not independently in each channel. Presumably this re\ufb02ects the collective modulation of wide-\nband (or harmonic) sounds, with features that span many frequencies. Thus, a synthesis of BSE and\nPAD should incorporate the notion of auditory features.\n\nd,t x(2)\n\nThe forward model. The Modulation Cascade Process (MCP) is given by\n\n(cid:16)\n\n(cid:16)\n(cid:16)\n(cid:17)\n\n(cid:17)\n(cid:17)\nkm,t|z(m)\nz(m)\n= Norm(cid:0)\u00b5yt, \u03c32\nz(m)\nkm,0\n\nkm,t\u22121, \u03b8\n\ny\n\n= Norm (0, 1) ,\n\np\n\np\n\nyt|x(m)\n\nt\n\n, \u03b8\n\n(cid:16)\n\np\n\n= Norm\n\n\u03bb(m)z(m)\n\nkm,t\u22121, \u03c32\n\nx(m)\nkm,t = f(z(m)\n\n(cid:1) , \u00b5yt = X\n\n(m)\n\nm = 1 : 3, t > 0,\nkm,t, a(m)) m = 1 : 3, t \u2265 0,\ngd,k1,k2x(1)\n\nk2,tx(3)\n\nk1,tx(2)\n\nt\n\nsin (\u03c9dt + \u03c6d) .\n\n(7)\n\n(8)\n\n(9)\n\n(cid:17)\n\nd,k1,k2\n\nOnce again, latent variables are arranged in a hierarchy according to their time-scales (which de-\npend on m). At the top of the hierarchy is a long-time process which models slow structures, like\nthe sentences of speech. The next level models more quickly varying structure (like phonemes).\nFinally, the bottom level of the hierarchy captures short-time variability (intra-phoneme variability\nfor instance). Unlike in PAD, the middle and lower levels now contain multiple process. So, for\nexample if K1 = 4 and K2 = 2, there would be four quickly varying modulators in the lower level,\ntwo modulators in the middle level, and one slowly varying modulator at the top (see \ufb01g. 2A).\n\nindividual spectral features (given byP\n\nThe idea is that the modulators in the \ufb01rst level independently control the presence or absence of\nd gd,k1,k2 sin (\u03c9dt + \u03c6d)). For example, in speech a typical\nphoneme might be periodic, but this periodicity might change systematically as the speaker alters\ntheir pitch. This change in pitch might be modeled using two spectral features: one for the start of the\nphoneme and one for the end, with a region of coactivation in the middle. Indeed it is because speech\n\n4\n\nx(1)1:Tx(2)1:Tx(3)1:T246y1:Ttime /s0.511.5time /sA.B.C. \fFigure 2: A. Schematic representation of the MCP forward model in the simple case when K1 = 4,\nK2 = 2 and D = 6. The hierarchy of latent variables moves from the slowest modulator at the\ntop (magenta) to the fastest (blue) with an intermediate modulator between (red). The outer-product\nof the modulators multiplies the generative weights (black and white, only 4 of the 8 shown). In\nturn, these modulate sinusoids (top right) which are summed to produce the observations (bottom\nright). B. A draw from the forward model using parameters learned from a spoken-sentence (see\nthe results section for more details of the model). The grey bars on the top four panels indicate the\nregion depicted in the bottom four panels.\n\nand other natural sounds are not precisely stationary even over short time-scales that we require the\nlowest layer of the hierarchy. The role of the modulators in the second level is to simultaneously\nturn on groups of similar features. For example, one modulator might control the presence of all the\nharmonic features and the other the broad-band features. Finally the top level modulator gates all\nthe auditory features at once. Fig. 2B shows a draw from the forward model for a more complicated\nexample. Promisingly the samples share many features of natural sounds.\n\nRelationship to other models. This model has an interesting relationship to previous statistical\nmodels and in particular to the GSMMs. It is well known that when ICA is applied to data from\nnatural scenes the inferred \ufb01lter coef\ufb01cients tend not to be independent (see [3, 10]), with coef\ufb01cients\ncorresponding to similar \ufb01lters sharing power. GSMMs model dependencies using a hierarchical\nframework, in which the distribution over the coef\ufb01cients depends on a set of latent variables that\nintroduce correlations between their powers. The MCP is similar, in that the higher level latent\nvariables alter the power of similar auditory features. Indeed, we suggest that the correlations in the\npower of ICA coef\ufb01cients are a sign that AM is prevalent in natural scenes. The MCP can be seen\nas a generalisation of the GSMMs to include time-varying latent variables, a deeper hierarchy and a\nprobabilistic time-frequency representation.\n\nInference and learning algorithms. Any type of learning in the MCP is computationally demand-\ning. Motivated by speed, and encouraged by the results from PAD, the aim will therefore be to \ufb01nd\na joint MAP estimate of the latent variables and the weights, that is\nlog p(X, Y, G|\u03b8).\n\nX, G = arg max\n\n(10)\n\nX,G\n\n5\n\nx(3)tx(2)tx(1)t00.511.52ytx(3)tx(2)tx(1)t0.840.860.880.90.920.94yttime /s\u00b7\u00b7x(3)tx(2)tx(1)t\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7+++\u00b7=ytA.B.\fNote that we have introduced a prior over the generative tensor. This prevents an undesirable feature\nof combined MAP and ML inference in such models: namely that the weights grow without bound,\nenabling the modal values of latent variables to shrink towards zero, increasing their density under\nthe prior. The resulting cost function is,\n\nlog p(X, Y, G|\u03b8) =\n\nlog p(yt|x(1)\n\n, x(2)\n\nt ) +\n\nlog p(z(m)\n\nkm,t|z(m)\n\nkm,t\u22121)\n\nTX\n\nt=1\n\n+\n\nTX\n\nt=0\n\nlog\n\nt\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) dz(m)\n\nkm,t\ndx(m)\nkm,t\n\n(cid:18) TX\n3X\nX\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + log p(z(m)\n(cid:19)\n+ X\n\nkm,0)\n\nm=1\n\nt=1\n\nkm\n\nk1,k2,d\n\nlog p(gd,k1,k2)\n\n(11)\n\nWe would like to optimise this objective-function with respect to the latent variables (x(m)\nkm,t) and the\ngenerative tensor (gd,k1,k2). There are, however, two main obstacles. The \ufb01rst is that there are a\nlarge number of latent variables to estimate (T \u00d7 (K1 + K2)), making inference slow. The second\nis that the generative tensor contains a large number of elements D \u00d7 K1 \u00d7 K2, making learning\nslow too. The solution is to \ufb01nd a good initialisation procedure, and then to \ufb01ne-tune using a slow\nEM-like algorithm that iterates between updating the latents and the weights. First we outline the\ninitialisation procedure.\n\nThe key to learning complicated hierarchical models is to initialise well, and so the procedure devel-\noped for the MCP will be explained in some detail. The main idea is to learn the model one layer at\na time. This is achieved by clamping the upper layers of the hierarchy that are not being learned to\nunity. In the \ufb01rst stage of the initialisation, for example, the top and middle levels of the hierarchy\nare clamped and the mean of the emission distribution becomes\n\n\u03b3d,k1x(1)\n\nk1,t sin (\u03c9dt + \u03c6d) ,\n\n(12)\n\n\u00b5yt = X\n\nd,k1\n\nwhere \u03b3d,k1 =P\n\nk2 gd,k1,k2. Learning and inference then proceed by gradient based optimisation of\nthe cost-function (log p(X, Y, G|\u03b8)) with respect to the un-clamped latents (x(1)\nk1,t) and the contracted\ngenerative weights (\u03b3d,k1). This is much faster than the full optimisation as there are both fewer\nlatents and fewer parameters to estimate. When this process is complete, the second layer of latent\nvariables is un-clamped, and learning of these variables commences. This requires the full generative\ntensor, which must be initialised from the contracted generative weights learned at the previous\nstage. One choice is to set the individual weights to their averages gd,k1,k2 = 1\n\u03b3d,k1 and this\nK2\nworks well, but empirically slows learning. An alternative is to use small chunks of sounds to\nlearn the lower level weights. These chunks are chosen to be relatively stationary segments that\nhave a time-scale similar to the second-level modulators. This allows us to make the simplifying\nassumption that just one second-level modulator was active during the chunk. The generative tensor\ncan be therefore be initialised using gd,k1,k2 = \u03b3d,k1\u03b4k2,j. Typically this method causes the second\nstage of learning to converge faster, and to a similar solution.\n\nIn contrast to the initialisation, the \ufb01ne tuning algorithm is simple. In the E-Step the latent variables\nare updated simultaneously using gradient based optimisation of Eq. 11. In the M-Step, the gen-\nerative tensor is updated using co-ordinate ascent. That is to say that we sequentially update each\ngk1,k2 using gradient based optimisation of Eq. 11 and iterate over k1 and k2. In principle, joint\noptimisation of the generative tensor and latent variables is possible, but the memory requirements\nare prohibitive. This is also why co-ordinate ascent is used to learn the generative tensor (rather than\nusing the usual linear regression solution which involves a prohibitive matrix inverse).\n\n5 Results\n\nThe MCP was trained on a spoken sentence, lasting 2s and sampled at 8000Hz, using the algorithm\noutlined in the previous section. The time-scales of the modulators were chosen to be {20 ms,\n200 ms, 2 s}. The time-frequency representation had D/2 = 100 sines and D/2 = 100 cosines\nspaced logarithmically from 100 \u2212 4000Hz. The model was given K1 = 18 latent variables in the\n\ufb01rst level of the hierarchy, and K2 = 6 in the second. Learning took 16hrs to run on a 2.2 GHz\nOpteron with 2Gb of memory.\n\n6\n\n\fThe learned spectral features (pg2\n\nFigure 3: Application of the MCP to speech. Left panels: The inferred latent variable hierarchy.\nAt top is the sentence modulator (magenta). Next are the phoneme modulators, followed by the\nintra-phoneme modulators. These are coloured according to which of the phoneme modulators they\ninteract most strongly with. The speech waveform is shown in the bottom panel. Right panels:\ncos) coloured according to phoneme modulator. For ex-\nample, the top panel show the spectra from gk1=1:18,k2=1. Spectra corresponding to one phoneme\nmodulator look similar and offer the features only differ in their phase.\n\nsin + g2\n\n[t]\n\nThe results can be seen in Fig. 3. The MCP recovers a sentence modulator, phoneme modulators,\nand intra-phoneme modulators. Typically a pair of features are used to model a phoneme, and often\nthey have similar spectra as expected. The spectra fall into distinct classes: those which are har-\nmonic (modelling voiced features) and those which are broad-band (modelling unvoiced features).\nOne way of assessing which features of speech the model captures is to sample from the forward\nmodel using the learned parameters. This can be seen in Fig. 2B. The conclusion is that the model\nis capturing structure across a wide range of time-scales: formants and pitch structure, phoneme\nstructure, and sentence structure.\n\nThere are, however, two noticeable differences between the real and generated data. The \ufb01rst is that\nthe generated data contain fewer transients and noise segments than natural speech, and more vowel-\nlike components. The reason for this is that at the sampling rates used, many of the noisy segments\nare indistinguishable from white-noise and are explained using observation noise. These problems\nare alleviated by moving to higher sampling rates, but the algorithm is then markedly slower. The\nsecond difference concerns the inferred and generated latent variables in that the former are much\nsparser than the latter. The reason is that learned generative tensor contains many gk1,k2 which are\nnearly zero. In generation, this means that signi\ufb01cant contributions to the output are only made when\nparticular pairs of phoneme and intra-phoneme modulators are active. So although many modulators\nare active at one time, only one or two make sizeable contributions. Conversely, in inference, we\ncan only get information about the value of a modulator when it is part of a contributing pair. If this\nis not the case, the inference goes to the maximum of the prior which is zero. In effect there are\nlarge error-bars on the non-contributing components\u2019 estimates.\n\nFinally, to indicate the improvement of the MCP over PAD and BSE, we compare the algorithms\nabilities to \ufb01ll in missing sections of a spoken sentence. The average root-mean-squared (RMS)\nerror per sample is used as a metric to compare the algorithms. In order to use the MCP to \ufb01ll in the\nmissing data, it is \ufb01rst necessary to learn a set of auditory features. The MCP was therefore trained\non a different spoken sentence from the same speaker, before inference was carried out on the test\ndata. To make the comparison fair, BSE is given an identical set of sinusoidal basis functions as\nMCP, and the associated smoothness priors were learned on the same training data.\n\nTypical results can be seen in \ufb01g. 4. On average the RMS errors for MCP, BSE and PAD were:\n{0.10, 0.30, 0.41}. As PAD models the carrier as white noise it predicts zeros in the missing regions\n\n7\n\nx(3)tx(2)t,k2x(1)t,k100.511.5time /sy01000200030004000frequency /HZ\fFigure 4: A selection of typical missing data results for three phonemes (columns). The top row\nshows the original speech segement with the missing regions shown in red. The middle row shows\nthe predictions made by the MCP and the bottom row those made by BSE.\n\nand therefore it merely serves as a baseline in these experiments. Both MCP and BSE smoothly\ninterpolate their latent variables over the missing region. However, whereas BSE smoothly inter-\npolates each sinusoidal component independently, MCP interpolates the set of learned auditory fea-\ntures in a complex manner determined by the interaction of the modulators. It is for this reason that\nit improves over BSE by such a large margin.\n\n6 Conclusion\n\nWe have introduced a neuroscience-inspired generative model for natural sounds that is capable of\ncapturing structure spanning a wide range of temporal scales. The model is a marriage between a\nprobabilistic time-frequency representation (that captures the short-time structure) and a probabilis-\ntic demodulation cascade (that captures the long-time structure). When the model is trained on a\nspoken sentence, the \ufb01rst level of the hierarchy learns auditory features (weighted sets of sinusoids)\nthat capture structures like different voiced sections of speech. The upper levels comprise a tem-\nporally ordered set of modulators are used to represent sentence structure, phoneme structure and\nintra-phoneme variability. The superiority of the new model over its parents was demonstrated in\na missing data experiment where it out-performed the Bayesian time-frequency analysis by a large\nmargin.\n\nReferences\n\n[1] Bregman, A.S. (1990) Auditory Scene Analysis. MIT Press.\n[2] Smith E. & Lewicki, M.S. (2006) Ef\ufb01cient Auditory Coding. Nature 439 (7079).\n[3] Simoncelli, E.P. (2003) Vision and the statistics of the visual environment. Curr Opin Neurobi 13(2):144-9.\n[4] Patterson, R.D. (2000) Auditory images: How complex sounds are represented in the auditory system. J\n\nAcoust Soc Japan (E) 21(4):183-190.\n\n[5] Grif\ufb01n, D. & Lim J. (1984) Signal estimation from modi\ufb01ed short-time Fourier transform. IEEE Trans. on\n\nASSP 32(2):236-243.\n\n[6] Qi, Y., Minka, T. & Picard, R.W. (2002) Bayesian Spectrum Estimation of Unevenly Sampled Nonstation-\n\nary Data. MIT Media Lab Technical Report Vismod-TR-556.\n\n[7] Attias, H. & Schreiner, C.E. (1997) Low-Order Temporal Statistics of Natural Sounds. Adv in Neural Info\n\nProcessing Sys 9. MIT Press.\n\n[8] Anonymous Authors (2007) Probabilistic Amplitude Demodulation. ICA 2007 Conference Proceedings.\n\nSpringer, in press.\n\n[9] Moore, B.C.J. (2003) An Introduction to the Psychology of Hearing. Academic Press.\n[10] Karklin, Y. & Lewicki, M.S. (2005) A hierarchical Bayesian model for learning nonlinear statistical reg-\n\nularities in nonstationary natural signals. Neural Comput 17(2):397-423.\n\n8\n\noriginalMCP0.080.10.120.140.160.18time /sBSE0.080.10.120.140.16time /s0.150.20.250.3time /s\f", "award": [], "sourceid": 681, "authors": [{"given_name": "Richard", "family_name": "Turner", "institution": null}, {"given_name": "Maneesh", "family_name": "Sahani", "institution": null}]}