{"title": "Hierarchical spike coding of sound", "book": "Advances in Neural Information Processing Systems", "page_first": 3032, "page_last": 3040, "abstract": "We develop a probabilistic generative model for representing acoustic event structure at multiple scales via a two-stage hierarchy. The first stage consists of a spiking representation which encodes a sound with a sparse set of kernels at different frequencies positioned precisely in time. The coarse time and frequency statistical structure of the first-stage spikes is encoded by a second stage spiking representation, while fine-scale statistical regularities are encoded by recurrent interactions within the first-stage. When fitted to speech data, the model encodes acoustic features such as harmonic stacks, sweeps, and frequency modulations, that can be composed to represent complex acoustic events. The model is also able to synthesize sounds from the higher-level representation and provides significant improvement over wavelet thresholding techniques on a denoising task.", "full_text": "To appear in: Neural Information Processing Systems (NIPS),\n\nLake Tahoe, Nevada. December 3-6, 2012.\n\nHierarchical spike coding of sound\n\nYan Karklin\u2217\n\nChaitanya Ekanadham\u2217\n\nHoward Hughes Medical Institute,\n\nCourant Institute of Mathematical Sciences\n\nCenter for Neural Science\n\nNew York University\n\nyan.karklin@nyu.edu\n\nNew York University\n\nchaitu@math.nyu.edu\n\nEero P. Simoncelli\n\nHoward Hughes Medical Institute, Center for Neural Science,\n\nand Courant Institute of Mathematical Sciences\n\nNew York University\n\neero.simoncelli@nyu.edu\n\nAbstract\n\nNatural sounds exhibit complex statistical regularities at multiple scales. Acous-\ntic events underlying speech, for example, are characterized by precise temporal\nand frequency relationships, but they can also vary substantially according to the\npitch, duration, and other high-level properties of speech production. Learning\nthis structure from data while capturing the inherent variability is an important\n\ufb01rst step in building auditory processing systems, as well as understanding the\nmechanisms of auditory perception. Here we develop Hierarchical Spike Coding,\na two-layer probabilistic generative model for complex acoustic structure. The\n\ufb01rst layer consists of a sparse spiking representation that encodes the sound us-\ning kernels positioned precisely in time and frequency. Patterns in the positions\nof \ufb01rst layer spikes are learned from the data: on a coarse scale, statistical reg-\nularities are encoded by a second-layer spiking representation, while \ufb01ne-scale\nstructure is captured by recurrent interactions within the \ufb01rst layer. When \ufb01t to\nspeech data, the second layer acoustic features include harmonic stacks, sweeps,\nfrequency modulations, and precise temporal onsets, which can be composed to\nrepresent complex acoustic events. Unlike spectrogram-based methods, the model\ngives a probability distribution over sound pressure waveforms. This allows us to\nuse the second-layer representation to synthesize sounds directly, and to perform\nmodel-based denoising, on which we demonstrate a signi\ufb01cant improvement over\nstandard methods.\n\n1\n\nIntroduction\n\nNatural sounds, such as speech and animal vocalizations, consist of complex acoustic events oc-\ncurring at multiple scales. Precise timing and frequency relationships among these events convey\nimportant information about the sound, while intrinsic variability confounds simple approaches to\nsound processing and understanding. Speech, for example, can be described as a sequence of words,\nwhich are composed of precisely interrelated phones, but each utterance may have its own prosody,\nwith variable duration, loudness, and/or pitch. An auditory representation that captures the corre-\nsponding structure while remaining invariant to this variability would provide a useful \ufb01rst step for\nmany applications in auditory processing.\n\n\u2217Contributed equally\n\n1\n\n\fMany recent efforts to learn auditory representations in an unsupervised setting have focused on\nsparse decompositions chosen to capture structure inherent in sound ensembles. The dictionaries\ncan be chosen by hand [1, 2] or learned from data. For example, Klein et al [3] adapted a set of\ntime-frequency kernels to represent spectrograms of speech signals and showed that the resulting\nkernels were localized and bore resemblance to auditory receptive \ufb01elds. Lee et al [4] trained a\ntwo-layer deep belief network on spectrogram patches and used it for several auditory classi\ufb01cation\ntasks.\n\nThese approaches have several limitations. First, they operate on spectrograms (rather than the origi-\nnal sound waveforms), which impose limitations on both time and frequency resolution. In addition,\nmost models built on spectrograms rely on block-based partitioning of time, and thus are susceptible\nto artifacts \u2013 precisely-timed acoustic events can appear across multiple blocks and events can ap-\npear at different temporal offsets relative to the block, making their identi\ufb01cation and representation\ndif\ufb01cult [5]. The features learned by these models are tied to speci\ufb01c frequencies, and must be repli-\ncated at different frequency offsets to accommodate pitch shifts that occur in natural sounds. Finally,\nthe linear generative models underlying most methods are unsuitable for constructing hierarchical\nmodels, since the composition of multiple linear stages is again linear.\n\nTo address these limitations, we propose a two-layer hierarchical model that encodes complex acous-\ntic events using a representation that is shiftable in both time and frequency. The \ufb01rst layer is a\n\u201cspikegram\u201d representation of the sound pressure waveform, as developed in [6, 5]. The prior prob-\nabilities for coef\ufb01cients in the \ufb01rst layer are modulated by the output of the second layer, combined\nwith a recurrent component that operates within the \ufb01rst layer. When trained on speech, the kernels\nlearned at the second layer encode complex acoustic events which, when positioned at speci\ufb01c times\nand frequencies, compactly represent the \ufb01rst-layer spikegram, which is itself a compact description\nof the sound pressure waveform. Despite its very sparse activation, the second-layer representation\nretains much of the acoustic information: sounds sampled according to the generative model approx-\nimate well the original sound. Finally, we demonstrate that the model performs well on a denoising\ntask, particularly when the noise is structured, suggesting that the higher-order representation pro-\nvides a useful statistical description of speech.\n\n2 Hierarchical spike coding\n\nIn the \u201cspikegram\u201d representation [5], a sound is encoded using a linear combination of sparse,\ntime-shifted kernels \u03c6f (t):\n\nxt =X\u03c4,f\n\nS\u03c4,f \u03c6f (t \u2212 \u03c4 ) + \u01ebt\n\n(1)\n\nwhere \u01ebt denotes Gaussian white noise and the coef\ufb01cients S\u03c4,f are mostly zero. As in [5], the \u03c6f (t)\nare gammatone functions with varying center frequencies, indexed by f. In order to encode the sig-\nnal, a sparse set of \u201cspikes\u201d (i.e., nonzero coef\ufb01cients at speci\ufb01c times and frequencies) is estimated\nusing an approximate inference method, such as matching pursuit [7]. The resulting spikegram,\nshown in Fig. 1b, offers an ef\ufb01cient representation of sounds [8] that avoids the blocking artifacts\nand time-frequency trade-offs associated with more traditional spectrogram representations.\n\nWe aim to model the statistical regularities present in the spikegram representations. Spikegrams ex-\nhibit clear statistical structure, both at coarse (Fig. 1b,c) and at \ufb01ne temporal scales (Fig. 1e,f). Spikes\nplaced at precise locations in time and frequency reveal acoustic features, harmonic structures, as\nwell as slow modulations in the sound envelope. The coarse scale non-stationarity is likely caused\nby higher-order acoustic events, such as phoneme utterances that span a much larger time-frequency\nrange than the individual gammatone kernels. On the other hand, the \ufb01ne-scale correlations are\ndue to some combination of the correlations inherent in the gammatone \ufb01lterbank and the precise\ntemporal structure present in speech.\n\nWe introduce the hierarchical spike coding (HSC) model, illustrated in Fig. 2, to capture the struc-\nture in the spikegrams (S(1)) on both coarse and \ufb01ne scales. We add a second layer of unobserved\nspikes (S(2)), assumed to arise from a Poisson process with constant rate \u03bb. These spikes are con-\nvolved with a set of time-frequency \u201crate kernels\u201d (K r) to yield the logarithm of the \ufb01ring rate of\nthe \ufb01rst-layer spikes on a coarse scale. On a \ufb01ne scale, the logarithm of the \ufb01ring rate of \ufb01rst-\nlayer spikes is modulated using recurrent interactions, by convolving the local spike history with\n\n2\n\n\fa\n\nspeech waveform\n\n0\n\nd\n\n3\n\n0.81\n\n0.82\n\n0.83\n\n0.84\n\ntime (sec)\n\nb\n\n)\nz\nH\n\n(\n \nq\ne\nr\nf\n \nr\ne\n\nt\n\nn\ne\nc\n\ne\n\n)\nz\nH\n\n(\n \nq\ne\nr\nf\n \nr\ne\n\nt\n\nn\ne\nc\n\n104\n\n103\n\n102\n104\n\n0\n\n103\n\n102\n\n0.81\n\nspikegram representation\n\n1\n\n0\n\nc\n\n)\nz\nH\ng\no\n\nl\n \n\n\u2206\n(\n\n\u22121\n\u22120.5\n\n3\n\nf\n\ntime/freq cross\u2212correlation\n\n0\n\n(\u2206 sec)\n\n0.5\n\n0.82\n\n0.83\n\n0.84\n\ntime (sec)\n\n0\n\n0.01\n\ninter spike interval (sec)\n\n0.02\n\nFigure 1: Coarse (top row) and \ufb01ne (bottom row) scale structure in spikegram encodings of speech.\na. The sound pressure waveform of a spoken sentence and b. the corresponding spikegram. Each\nspike (dot) has an associated time (abscissa) and center frequency (ordinate) as well as an amplitude\n(dot size). c. Cross-correlation function for a spikegram ensemble reveals correlations across large\ntime/frequency scales. d. Magni\ufb01cation of a portion of (a), with two gammatone kernels (red and\nblue), corresponding to the red and blue spikes in (e). e. Magni\ufb01cation of corresponding portion of\n(b) , revealing that spike timing exhibits strong regularities at a \ufb01ne scale. f. Histograms of inter-\nspike-intervals for two frequency channels corresponding to the colored spikes in (e) reveal strong\ntemporal dependencies.\n\na set of \u201ccoupling kernels\u201d (K c). The amplitudes of the \ufb01rst-layer spikes are also speci\ufb01ed hi-\nerarchically: the logarithm of the amplitudes is assumed to be normally distributed, with a mean\nspeci\ufb01ed by the convolution of second-layer spikes with \u201camplitude kernels\u201d, (K a not shown) with-\nout any recurrent contribution, and the variance \ufb01xed at \u03c32. The model parameters are denoted by\n\n\u0398 = \u00b3K r, K a, K c,~br,~ba\u00b4 where ~br,~ba are the bias vectors corresponding to the log-rate and log-\n\namplitude of the \ufb01rst-layer coef\ufb01cients, respectively. The model speci\ufb01es a conditional probability\ndensity over \ufb01rst-layer coef\ufb01cients,\n\nP (S(1)\n\nt,f |S(2); \u0398) = (1 \u2212 p) \u03b4(S(1)\n\nt,f ) + pN \u00b3log S(1)\n\nwhere p = \u2206t\u2206f eRt,f\n\nand\n\nRt,f = br\n\nAt,f = ba\n\nf + (K c \u2217 1\nf +Xi h(K a\n\nS(1))t,f +Xi h(K r\ni \u2217 S(2)\n\ni\n\n)t,fi\n\nfor S(1)\n\nt,f \u2265 0, \u2200t, f\n\nt,f ; At,f , \u03c32\u00b4\nN \u00a1x; \u00b5, \u03c32\u00a2 =\ni \u2217 S(2)\n\ni\n\n(x\u2212\u00b5)2\n\n2\u03c32\n\ne\u2212\n\u221a2\u03c0\u03c32\n)t,fi\n\n(2)\n\n(3)\n\n(4)\n\n(5)\n\nIn Eq. (2), \u03b4(.) is the Dirac delta function. In Eq. (3), \u2206t and \u2206f are the time and frequency bin\nsizes. In Eqs. (4-5), \u2217 denotes convolution and 1x is 1 if x 6= 0, and 0 otherwise.\n3 Learning\n\nThe joint log-probability of the \ufb01rst and second layer can be expressed as a function of the model\nparameters \u0398 and the (unobserved) second-layer spikes S(2):\n\nL(\u0398, S(2)) = log P (S(1), S(2); \u0398, \u03bb) = log P (S(1)|S(2); \u0398) + log P (S(2); \u03bb)\neRt,f \u2206t\u2206f\n\n1\n\n= X(t,f )\u2208S(1)\u00b5Rt,f \u2212\n2\u03c32 \u00b3log S(1)\n\u2212 log (\u03bb\u2206t\u2206f )kS(2)k0 + const\n\nt,f \u2212 At,f\u00b42\u00b6 \u2212Xt,f\n\n(6)\n\n(7)\n\n3\n\n\fFigure 2: Illustration of the hierarchical spike coding model. Second-layer spikes S(2) associ-\nated with 3 features (indicated by color) are sampled in time and frequency according to a Poisson\nprocess, with exponentially-distributed amplitudes (indicated by dot size). These are convolved\nwith corresponding rate kernels K r (outlined in colored rectangles), summed together, and passed\nthrough an exponential nonlinearity to drive the instantaneous rate of the \ufb01rst-layer spikes on a\ncoarse scale. The \ufb01rst-layer spike rate is also modulated on a \ufb01ne scale by a recurrent component\nthat convolves previous spikes with coupling kernels K c. At a given time step (vertical line), spikes\nS(1) are generated according to a Poisson process whose rate depends on the top-down and the\nrecurrent terms.\n\nwhere the equality in Eq. (7) holds in the limit \u2206t\u2206f \u2192 0. Maximizing the data likelihood re-\nquires integrating L over all possible second-layer representations S(2), which is computationally\nintractable. Instead, we choose to approximate the optimal \u0398 by maximizing L jointly over \u0398 and\nS(2). If S(2) is known, then the model falls within the well-known class of generalized linear models\n(GLMs) [9], and Eq. (6) is convex in \u0398. Conversely, if \u0398 is known then Eq. (6) is convex in S(2)\nexcept for the L0 penalty term corresponding to the prior on S(2). Motivated by these facts, we\nadopt a coordinate-descent approach by alternating between the following steps:\n\nS(2) \u2190 arg max\n\nS(2) L(\u0398, S(2))\n\u0398 \u2190 \u0398 + \u03b7\u2207\u0398L(\u0398, S(2))\n\n(9)\nwhere \u03b7 is a \ufb01xed learning rate. Section 4 describes a method for approximate inference of the\nsecond-layer spikes (solving Eq. (8)). The gradients used in Eq. (9) are straightforward to compute\nand are given by\n\n(8)\n\n(10)\n\n(11)\n\n(12)\n\n(13)\n\n\u2202L\n\u2202br\nf\n\u2202L\n\u2202ba\nf\n\u2202L\n\u2202K r\n\n\u03c4,\u03b6,i\n\n\u2202L\n\u2202K c\n\n\u03c4,f,f \u2032\n\n1\n\n=\n\n= (# 1\u2032 spikes in channel f ) \u2212Xt\n\u03c32 Xt \u00b3log S(1)\n= X(t,f )\u2208S(1)\n= Xt\u2208S(1)\n\nt,f \u2212 At,f\u00b4\n(t \u2212 \u03c4, f \u2212 \u03b6) \u2212 Xt,f\n\nt\u2212\u03c4,f \u2032 \u2212 Xt\n\neRt,f 1\n\nS(2)\n\ni\n\nS(1)\n\nt\u2212\u03c4,f \u2032\n\n1\n\nS(1)\n\nf\n\n4\n\neRt,f \u2206t\u2206f\n\neRt,f S(2)\n\nt\u2212\u03c4,f \u2212\u03b6,i\u2206t\u2206f\n\n\u2206t\u2206f\n\n\f3.84\n\n)\ns\ne\nv\na\nt\nc\no\n(\n \n\nq\ne\nr\nf\n\n0\n\n0\n\ntime (sec)\n\n0.4\n\ncenter freq = 111Hz\n\n1.34\n\n)\ns\ne\nv\na\nt\nc\no\n(\n \n\nq\ne\nr\nf\n\n0\n\n\u22121.34\n\n0\n\ntime (sec)\n\n0.02\n\ncenter freq = 246Hz\n\ncenter freq = 546Hz\n\ncenter freq = 1214Hz\n\n1.5\n\n0\n\n\u22121.5\n\nFigure 3: Example model kernels learned on the TIMIT data set. Top: rate kernels (colormaps\nindividually rescaled). Bottom: Four representative coupling kernels (scaling indicated by colorbar).\n\n4\n\nInference\n\nInference of the second-layer spikes S(2) (Eq. (8)) involves maximizing the trade-off between the\nGLM likelihood term, which we denote by \u02dcL(\u0398, S(2)) and the last term which penalizes the number\nof spikes (kS(2)k0). Solving Eq. (8) exactly is NP-hard. We adopt a variant of the well-known\nmatching pursuit algorithm [7] to approximate the solution. First, S(2) is initialized to ~0. Then the\nfollowing two steps are repeated:\n\n1. Select the coef\ufb01cient that maximizes a second-order Taylor approximation of \u02dcL(\u0398,\u00b7) about\n\nthe current solution S(2):\n\n(\u03c4 \u2217, \u03b6 \u2217, i\u2217) = arg max\n\n\u03c4,\u03b6,i \u2212\u00c3 \u2202 \u02dcL\n\n\u2202S(2)\n\n\u03c4,\u03b6,i!2\n\n/\n\n\u22022 \u02dcL\n\u2202S(2)\n\n\u03c4,\u03b6,i\n\n2\n\n(14)\n\n2. Perform a line search to determine the step size for this coef\ufb01cient that maximizes \u02dcL(\u0398,\u00b7).\nIf the maximal improvement does not outweigh the cost \u2212 log(\u03bb\u2206t\u2206f ) of adding a spike,\nterminate. Otherwise update S(2) using this step and repeat Step 1.\n\n5 Results\n\nModel parameters learned from speech\n\nWe applied the model to the TIMIT speech corpus [10]. First, we obtained spikegrams by encoding\nsounds to 20dB precision using a set of 200 gammatone \ufb01lters with center frequencies spaced evenly\non a logarithmic scale (see [5] for details). For each audio sample, this gave us a spikegram with\n\ufb01ne time and frequency resolution (6.25\u00d710\u22125s and 3.8\u00d710\u22122 octaves, respectively). We trained\na model with 20 rate and 20 amplitude kernels, with frequency resolution equivalent to that of the\nspikegram and time resolution of 20ms. These kernels extended over 400ms\u00d73.8 octaves (spanning\n20 time and 100 frequency bins). Coupling kernels were de\ufb01ned independently for each frequency\nchannel; they extended over 20ms and 2.7 octaves around the channel center frequency with the\nsame time/frequency resolution as the spikegram. All parameters were initialized randomly, and\nwere learned according to Eq. (8-9).\n\nFig. 3 displays the learned rate kernels (top) and coupling kernels (bottom). Among the patterns\nlearned by the rate kernels are harmonic stacks of different durations and pitch shifts (e.g., kernels\n4, 9, 11, 18), ramps in frequency (kernels 1, 7, 15, 16), sharp temporal onsets and offsets (kernels\n\n5\n\n\fS(2)\n\naa + r\n\nS(2)\n\nao + l\n\n+\n\n+\n\n+\n\n\u2248\n\n+\n\n+\n\n\u2248\n\n+\n\n+\n\n+\n\n+\n\n\u2248\n\n+\n\n+\n\n\u2248\n\n+\n\n+\n\n+\n\n\u2248\n\n+\n\n+\n\n\u2248\n\n+\n\n+\n\n+\n\n+\n\n\u2248\n\n+\n\n+\n\n\u2248\n\n5\n\nq\ne\nr\nf\n\n0\n\n0\n\ntime\n\n0.4\n\nFigure 4: Model representation of phone pairs aa+r (left) and ao+l (right), as uttered by four speak-\ners (rows: two male, two female). Each row shows inferred second-layer spikes, the rate kernels\nmost correlated with the utterance of each phone pair, shifted to their corresponding spikes\u2019 fre-\nquencies (colored on left), and the encoded log \ufb01ring rate centered on the phone pair utterance.\n\n7, 13, 19), and acoustic features localized in time and frequency (kernels 5, 10, 12, 20) (example\nsounds synthesized by turning on single features are available in supplementary materials). The\ncorresponding amplitude kernels (not shown) contain patterns highly correlated with the rate kernels,\nsuggesting a strong dependence in the spikegram between spike rate and magnitude. For most\nfrequency channels, the coupling kernels are strongly negative at times immediately following the\nspike and at adjacent frequencies, representing \u201crefractory periods\u201d observed in the spikegrams.\nPositive peaks in the coupling kernels encode precise alignment of spikes across time and frequency.\nSecond-layer representation\n\nThe learned kernels combine in various ways to represent complex acoustic events. For example,\nFig. 4 illustrates how features can combine to represent two different phone pairs. Vowel phones\nare approximated by a harmonic stack (outlined in yellow) together with a ramp in frequency (out-\nlined in orange and dark blue). Because the rate kernels add to specify the logarithm of the \ufb01ring\nrate, their superposition results in a multiplicative modulation of the intensities at each level of the\nharmonic stack. In addition, the \u2018r\u2018 consonant in the \ufb01rst example is characterized by a high concen-\ntration of energy at the high frequencies and is largely accounted for by the kernel outlined in red.\nThe \u2018l\u2018 consonant following \u2018ao\u2018 contains a frequency modulation captured by the v-shaped feature\n(outlined in cyan).\n\nTranslating the kernels in log-frequency allows the same set of fundamental features to participate\nin a range of acoustic events: the same vocalizations at different pitch are often represented by the\nsame set of features. In Fig. 4, the same set of kernels is used in a similar con\ufb01guration across dif-\nferent speakers and genders. It should be noted that the second-layer representation does not discard\nprecise time and frequency information (this information is carried in the times and frequencies of\nthe second-layer spikes). However, the identities of the features that are active remain invariant to\npitch and frequency modulations.\nSynthesis\n\nOne can further understand the acoustic information that is captured by second-layer spikes by\nsampling a spikegram according to the generative model. We took the second-layer encoding of a\nsingle sentence from the TIMIT speech corpus [10] (Fig. 5 middle) and sampled two spikegrams:\none with only the hierarchical component (left), and one that included both hierarchical and coupling\ncomponents (right). At a coarse scale the two samples closely resemble the spikegram of the original\nsound. However, at the \ufb01ne time scale, only the spikegram sampled with coupling contains the\nregularities observed in speech data (Fig. 5 bottom row). Sounds were also generated from these\nspikegram samples by superimposing gammatone kernels as in [5]. Despite the fact that the second-\n\n6\n\n\fSecond layer (176 spikes)\n\n4\n10\n\n3\n10\n\n \n\n)\nz\nH\ng\no\nl\n(\n \n\nq\ne\nr\nf\n\n2\n10\n\n0\n\n3\n\nHierarchical (2741 spikes)\n\nData (2544 spikes)\n\nCoupling + Hierarchical (2358 spikes)\n\n3\n\n0.9\n\n \n\n)\nz\nH\ng\no\nl\n(\n \n\nq\ne\nr\nf\n\n \n\n)\nz\nH\ng\no\nl\n(\n \nq\ne\nr\nf\n\n4\n10\n\n3\n10\n\n2\n10\n\n0\n\n4\n10\n\n3\n10\n\n2\n10\n\n0.8\n\ntime\n\nFigure 5: Synthesis from inferred second-layer spikes. Middle bottom: spikegram representation\nof the sentence in Fig. 1; Middle top: Inferred second-layer representation; Left: \ufb01rst-layer spikes\ngenerated using only the hierarchical model component; Right: \ufb01rst-layer spikes generated using\nhierarchical and coupling kernels. Synthesized waveforms are included in the supplementary mate-\nrials.\n\nwhite noise\n\nnoise level Wiener wav thr\n\n-10dB\n-5dB\n0dB\n5dB\n10dB\n\n-7.00\n0.00\n5.49\n7.84\n10.31\n\n2.41\n4.93\n7.94\n11.15\n14.64\n\nMP\n2.26\n4.79\n7.71\n11.01\n14.49\n\nHSC\n2.50\n5.01\n7.99\n11.33\n14.83\n\n-10dB\n-5dB\n0dB\n5dB\n10dB\n\nsparse temporally modulated noise\nWiener\nHSC\n-4.37\n-8.68\n-0.38\n-3.09\n3.30\n1.90\n7.40\n6.37\n11.88\n9.68\n\nwav thr\n-8.73\n-3.63\n1.23\n6.06\n11.28\n\nMP\n-5.12\n-0.96\n2.97\n7.11\n11.58\n\nTable 1: Denoising accuracy (dB SNR) for speech corrupted with white noise (left) or with sparse,\ntemporally modulated noise (right).\n\nlayer representation contains over 15 times fewer spikes as the \ufb01rst-layer spikegrams, the synthesized\nsounds are intelligible and the addition of the coupling \ufb01lters provides a noticeable improvement\n(audio examples in supplementary materials).\nDenoising\n\nAlthough the model parameters have been adapted to the data ensemble, obtaining an estimate of the\nlikelihood of the data ensemble under the model is dif\ufb01cult, as it requires integrating over unobserved\nvariables (S(2)). Instead, we can use performance on unsupervised signal processing tasks, such\nas denoising, to validate the model and compare it to other methods that explicitly or implicitly\nrepresent data density. In the noiseless case, a spikegram is obtained by running matching pursuit\nuntil the decrease in the residual falls below a threshold; in the presence of noise, this encoding\nprocess can be formulated as a denoising operation, terminated when the improvement in the log-\nlikelihood (variance of the residual divided by the variance of the noise) is less than the cost of\nadding a spike (the negative log-probability of spiking). We incorporate the HSC model directly\ninto this denoising algorithm by replacing the \ufb01xed probability of spiking at the \ufb01rst layer with the\n\n7\n\n\frate speci\ufb01ed by the second layer. Since neither the \ufb01rst- nor second-layer spike code for the noisy\nsignal is known, we \ufb01rst infer the \ufb01rst and then the second layer using MAP estimation, and then\nrecompute the \ufb01rst layer given both the data and second layer. The denoised waveform is obtained\nby reconstructing from the resulting \ufb01rst-layer spikes.\n\nTo the extent that the parameters learned by HSC re\ufb02ect statistical properties of the signal, incorpo-\nrating the more sophisticated spikegram prior into a denoising algorithm should allow us to better\ndistinguish signal from noise. We tested this by denoising speech waveforms (held out during model\ntraining) that have been corrupted by additive white Gaussian noise. We compared the model\u2019s per-\nformance to that of the matching pursuit encoding (sparse signal representation without a hierarchi-\ncal model), as well as to two standard denoising methods, Wiener \ufb01ltering and wavelet-threshold\ndenoising (implemented with MATLAB\u2019s wden function, using symlets, SURE estimator for soft\nthreshold selection; other parameters optimized for performance on the training data set) [11].\n\nHSC-based denoising is able to outperform standard methods, as well as matching pursuit denoising\n(Table 1 left). Although the performance gains are modest, the fact that the HSC model, which is not\noptimized for the task or trained on noisy data, can match the performance of adaptive algorithms\nlike wavelet \ufb01ltering denoising suggests that it has learned a representation that successfully exploits\nthe statistical regularities present in the data.\n\nTo test more rigorously the bene\ufb01t of a structured prior, we evaluated denoising performance on\nsignals corrupted with non-stationary noise whose power is correlated over time. This is a more\nchallenging task, but it is also more relevant to real-world applications, where sources of noise are\noften non-stationary. Algorithms that incorporate speci\ufb01c (but often incorrect) noise models (e.g.,\nWiener \ufb01ltering) tend to perform poorly in this setting. We generated sparse temporally modulated\nnoise by scaling white Gaussian noise with a temporally smooth envelope (given as a convolution of\na Gaussian function with st. dev. of 0.02s with a Poisson process with rate 16s\u22121). All methods fare\nworse on this task. Again, the hierarchical model outperforms other methods (Table 1 right), but\nhere the improvement in performance is larger, especially at high noise levels where the model prior\nplays a greater role. The reconstruction SNR does not fully convey the manner in which different\nalgorithms handle noise: perceptually, we \ufb01nd that the sounds denoised by the hierarchical model\nsound more similar to the original (audio examples in supplementary materials).\n\n6 Discussion\n\nWe developed a hierarchical spike code model that captures complex structure in sounds. Our\nwork builds on the spikegram representation of [5], thus avoiding the limitations arising from\nspectrogram-based methods, and makes a number of novel contributions. Unlike previous work\n[3, 4], the learned kernels are shiftable in both time and log-frequency, which enables the model to\nlearn time- and frequency-relative patterns and use a small number of kernels ef\ufb01ciently to represent\na wide variety of sound features. In addition, the model describes acoustic structure on multiple\nscales (via a hierarchical component and a recurrent component), which capture fundamentally dif-\nferent kinds of statistical regularities.\n\nTechnical contributions of ths work include methods for learning and performing approximate in-\nference in a generalized linear model in which some of the inputs are unobserved and sparse (in\nthis case the second-layer spikes). The computational framework developed here is general, and\nmay have other applications in modeling sparse data with partially observed variables. Because the\nmodel is nonlinear, multi-layer cascades could lead to substantially more powerful models.\n\nApplying the model to complex natural sounds (speech), we demonstrated that it can learn non-\ntrivial features, and we have shown how these features can be composed to form basic acoustic\nunits. We also showed a simple application to denoising, demonstrating improved performance to\nwavelet thresholding. The framework provides a general methodology for learning higher-order\nfeatures of sounds, and we expect that it will prove useful in representing other structured sounds\nsuch as music, animal vocalizations, or ambient natural sounds.\n\n6.1 Acknowledgments\n\nWe thank Richard Turner and Josh McDermott for helpful discussions.\n\n8\n\n\fReferences\n[1] C. Fevotte, B. Torresani, L. Daudet, and S. Godsill, \u201cSparse linear regression with structured priors and\napplication to denoising of musical audio,\u201d Audio, Speech, and Language Processing, IEEE Transactions\non, vol. 16, pp. 174 \u2013185, jan. 2008.\n\n[2] M. Plumbley, T. Blumensath, L. Daudet, R. Gribonval, and M. Davies, \u201cSparse representations in audio\nand music: From coding to source separation,\u201d Proceedings of the IEEE, vol. 98, pp. 995 \u20131005, june\n2010.\n\n[3] D. J. Klein, P. K\u00a8onig, and K. P. K\u00a8ording, \u201cSparse spectrotemporal coding of sounds,\u201d EURASIP J. Appl.\n\nSignal Process., vol. 2003, pp. 659\u2013667, Jan. 2003.\n\n[4] H. Lee, Y. Largman, P. Pham, and A. Y. Ng, \u201cUnsupervised feature learning for audio classi\ufb01cation using\nconvolutional deep belief networks,\u201d in Advances in Neural Information Processing Systems, pp. 1096\u2013\n1104, The MIT Press, 2009.\n\n[5] E. Smith and M. S. Lewicki, \u201cEf\ufb01cient coding of time-relative structure using spikes,\u201d Neural Computa-\n\ntion, vol. 17, no. 1, pp. 19\u201345, 2005.\n\n[6] M. Lewicki and T. Sejnowski, \u201cCoding time-varying signals using sparse, shift-invariant representations,\u201d\n\nin Advances in Neural Information Processing Systems, pp. 730\u2013736, The MIT Press, 1999.\n\n[7] S. Mallat and Z. Zhang, \u201cMatching pursuits with time-frequency dictionaries,\u201d IEEE Trans Sig Proc,\n\nvol. 41, pp. 3397\u20133415, December 1993.\n\n[8] E. Smith and M. S. Lewicki, \u201cEf\ufb01cient auditory coding,\u201d Nature, vol. 439, no. 7079, 2006.\n[9] P. McCullagh and J. A. Nelder, Generalized linear models (Second edition). London: Chapman & Hall,\n\n1989.\n\n[10] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren, \u201cDarpa timit\n\nacoustic phonetic continuous speech corpus cdrom,\u201d 1993.\n\n[11] S. Mallat, A Wavelet Tour of Signal Processing, Third Edition: The Sparse Way. Academic Press, 3rd ed.,\n\n2008.\n\n9\n\n\f", "award": [], "sourceid": 1393, "authors": [{"given_name": "Yan", "family_name": "Karklin", "institution": null}, {"given_name": "Chaitanya", "family_name": "Ekanadham", "institution": null}, {"given_name": "Eero", "family_name": "Simoncelli", "institution": null}]}