{"title": "Learning Efficient Auditory Codes Using Spikes Predicts Cochlear Filters", "book": "Advances in Neural Information Processing Systems", "page_first": 1289, "page_last": 1296, "abstract": null, "full_text": " Learning efficient auditory codes using spikes\n predicts cochlear filters\n\n\n\n Evan Smith1 Michael S. Lewicki2\n evan@cnbc.cmu.edu lewicki@cnbc.cmu.edu\n\n\n Departments of Psychology1 & Computer Science2\n Center for the Neural Basis of Cognition\n Carnegie Mellon University\n\n\n\n Abstract\n\n The representation of acoustic signals at the cochlear nerve must serve a\n wide range of auditory tasks that require exquisite sensitivity in both time\n and frequency. Lewicki (2002) demonstrated that many of the filtering\n properties of the cochlea could be explained in terms of efficient coding\n of natural sounds. This model, however, did not account for properties\n such as phase-locking or how sound could be encoded in terms of action\n potentials. Here, we extend this theoretical approach with algorithm for\n learning efficient auditory codes using a spiking population code. Here,\n we propose an algorithm for learning efficient auditory codes using a\n theoretical model for coding sound in terms of spikes. In this model,\n each spike encodes the precise time position and magnitude of a local-\n ized, time varying kernel function. By adapting the kernel functions to\n the statistics natural sounds, we show that, compared to conventional\n signal representations, the spike code achieves far greater coding effi-\n ciency. Furthermore, the inferred kernels show both striking similarities\n to measured cochlear filters and a similar bandwidth versus frequency\n dependence.\n\n\n\n1 Introduction\n\nBiological auditory systems perform tasks that require exceptional sensitivity to both spec-\ntral and temporal acoustic structure. This precision is all the more remarkable considering\nthese computations begin with an auditory code that consists of action potentials whose du-\nration is in milliseconds and whose firing in response to hair cell motion is probabilistic. In\ncomputational audition, representing the acoustic signal is the first step in any algorithm,\nand there are numerous approaches to this problem which differ in both their computa-\ntional complexity and in what aspects of signal structure are extracted. The auditory nerve\nrepresentation subserves a wide variety of different auditory tasks and is presumably well-\nadapted for these purposes. Here, we investigate the theoretical question of what computa-\ntional principles might underlie cochlear processing and the representation of the auditory\nnerve.\n\nFor sensory representations, a theoretical principle that has attracted considerable inter-\nest is efficient coding. This posits that (assuming low noise) one goal of sensory coding\n\n\f\nis to represent signals in the natural sensory environment efficiently, i.e. with minimal\nredundancy [13]. Recently, it was shown that efficient coding of natural sounds could ex-\nplain auditory nerve filtering properties and their organization as a population [4] and also\naccount for some non-linear properties of auditory nerve responses [5]. Although those\nresults provided an explanation for auditory nerve encoding of spectral information, they\nfail to explain the encoding of temporal information. Here, we extend the standard efficient\ncoding model, which has an implicit stationarity assumption, to form efficient representa-\ntions of non-stationary and time-relative acoustic structures.\n\n\n2 An abstract model for auditory coding\n\nIn standard models of efficient coding, sensory signals are represented by vectors of fixed\nlength, and the representation is a linear transformation of the input pattern. A simple\nmethod to encode temporal signals is to divide the signal into discrete blocks; however, this\napproach has several drawbacks. First, the underlying acoustic structures have no relation\nto the block boundaries, so elemental acoustic features may be split across blocks. Second,\nthis representation implicitly assumes that the signal structures are stationary, and provides\nno way to represent time-relative structures such as transient sounds. Finally, this approach\nhas limited plausibility as a model of cochlear encoding. To address all of these problems,\nwe use a theoretical model in which sounds are represented as spikes [6, 7]. In this model,\nthe signal, x(t), is encoded with a set of kernel functions, 1 . . . M , that can be positioned\narbitrarily and independently in time. The mathematical form of the representation with\nadditive noise is\n M nm\n\n x(t) = sm ) + (t),\n i m(t - m\n i (1)\n m=1 i=1\n\nwhere m and sm are the temporal position and coefficient of the ith instance of kernel \n i i m,\nrespectively. The notation nm indicates the number of instances of m, which need not be\nthe same across kernels. In addition, the kernels are not restricted in form or length.\n\nThe key theoretical abstraction of the model is that the signal is decomposed in terms of\ndiscrete acoustic events, each of which has a precise amplitude and temporal position. We\ninterpret the analog amplitude values as representing a local population of auditory nerve\nspikes. Thus, this theory posits that the purpose of the (binary) spikes at the auditory nerve\nis to encode as accurately as possible the temporal position and amplitude of the acoustic\nevents defined by m(t). The main questions we address are 1) encoding, i.e. what are the\noptimal values of m and sm and 2) learning, i.e. what are the optimal kernel functions\n i i\nm(t).\n\n2.1 Encoding\n\nFinding the optimal representation of arbitrary signals in terms of spikes is a hard problem,\nand currently there are no known biologically plausible algorithms that solve this problem\nwell [7]. There are reasons to believe that this problem can be solved (approximately) with\nbiological mechanisms, but for our purposes here, we compute the values of m and sm\n i i\nfor a given signal we using the matching pursuit algorithm [8]. It iteratively approximates\nthe input signal with successive orthogonal projections onto a basis. The signal can be\ndecomposed into\n x(t) =< x(t)m > m + Rx(t), (2)\n\nwhere < x(t)m > is the inner product between the signal and the kernel and is equiva-\nlent to sm in equation 1. The final term in equation 2, R\n i x(t), is the residual signal after\napproximating x(t) in the direction of m. The projection with the largest magnitude inner\nproduct will minimize the power of Rx(t), thereby capturing the most structure possible\nwith a single kernel.\n\n\f\n 5000\n 5000\n\n 2000\n 2000\n\n 1000\n 1000\n KKernel CF (Hz) 500\n 500\n 200\n 200\n 100\n 10000 55 10\n 10 15\n 15 20\n 20 25 ms\n Input\n Input\n\n\n\n Reconstruction\n\n\n Residual\n\n\nFigure 1: A brief segment of the word canteen (input) is represented as a spike code (top).\nA reconstruction of the speech based only on the few spikes shown (ovals in spike code) is\nvery accurate with relatively little residual error (reconstruction and residual). The colored\narrows and matching curves illustrate the correspondence between a few of the ovals and\nthe underlying acoustic structure represented by the kernel functions.\n\n\n\n\nEquation 2 can be rewritten more generally as\n\n\n\n Rn(t) =< Rn(t) (t),\n x x m > m + Rn+1\n x (3)\n\n\n\nwith R0 (t) = x(t)\n x at the start of the algorithm. On each iteration, the current residual is\nprojected onto the basis. The projection with the largest inner product is subtracted out, and\nits coefficient and time are recorded. This projection and subtraction leaves < Rn(t)\n x m >\nm orthogonal to the residual signal, Rn+1(t)\n x and to all previous and future projections [8].\nAs a result, matching pursuit codes are composed of mutually orthogonal signal structures.\nFor the results reported here, the encoding was halted when sm fell below a preset threshold\n i\n(the spiking threshold).\n\nFigure 1 illustrates the spike code model and its efficiency in representing speech. The spo-\nken word \"canteen\" was encoded as a set of spikes using a fixed set of kernel functions. The\nkernels can have arbitrary shape and for illustration we have chosen gammatones (mathe-\nmatical approximations of cochlear filters) as the kernel functions. A brief segment from\ninput signal (1, Input) consists of three glottal pulses in the /a/ vowel. The resulting spike\ncode is show above it. Each oval represents the temporal position and center frequency of\nan underlying kernel function, with oval size and gray value indicating kernel amplitude.\nFor four spikes, colored arrows and curves indicate the relationship between the ovals and\nthe acoustics events they represent. As evidenced from the figure, the very small set of spike\nevents is sufficient to produce a very accurate reconstruction of the sound (reconstruction\nand residual).\n\n\f\n2.2 Learning\n\nWe adapt the method used in [9] to train our kernel function. Equation 1 can be rewritten\nin probabilistic form as\n\n p(x|) = p(x|, ^\n s)p(^\n s)ds, (4)\n\nwhere ^\n s, an approximation of the posterior maximum, comes from the set of coefficient\ngenerated by matching pursuit. We assume the noise in the likelihood, p(x|, ^\n s), is\nGaussian and the prior, p(s), is sparse. The basis is updated by taking the gradient of\nthe log probability,\n\n \n log(p(x|)) = log(p(x|, s)) + log(p(s)) (5)\n m m\n M n\n 1 m\n = [x - ^\n sm m(t - m)]2 (6)\n 2 i i\n m m=1 i=1\n 1\n = [x - ^\n x] ^\n sm\n i (7)\n i\n\nAs noted by Olshausen (2002), equation 7 indicates that the kernels are updated in Hebbian\nfashion, simply as a product of activity and residual [9] (i.e., the unit shifts its preferred\nstimuli in the direction of the stimuli that just made it spike minus those elements already\nencoded by other units). But in the case of the spike code, rather than updating for every\ntime-point, we need only update at times when the kernel spiked.\n\nAs noted earlier, the model can use kernels of any form or length. This capability also\nextends to the learning algorithm such that it can learn functions of differing temporal ex-\ntents, growing or shrinking them as needed. Low frequency functions and others requiring\nlonger temporal extent can be grown from shorter initial seeds, while brief functions can be\ntrimmed to speed processing and minimize the effects of over-fitting. Periodically during\ntraining, a simple heuristic is used to trim or extend the kernels, m. The functions are\ninitially zero-padded. If learning causes the power of the padding to surpass a threshold,\nthe padding is extended. If the power of the padding plus an adjacent segment falls below\nthe threshold, the padding is trimmed from the end. Following the gradient step and length\nadjustment, the kernels are again normalized and the next training signal is encoded.\n\n\n3 Adapting kernels to natural sounds\n\nThe spike coding algorithm was used to learn kernel functions for two different classes\nof sounds: human speech and music. For speech, the algorithm trained on a subset the\nTIMIT Speech Corpus. Each training sample consisted of a single speaker saying a single\nsentence. The signals were bandpass filtered to remove DC components of the signal and\nto prevent aliasing from affecting learning. The signals were all normalized to a maximum\namplitude of 1.\n\nEach of the 30 kernel functions were initialized to random Gaussian vectors of 100 sam-\nples in duration. The threshold below which spikes (values of sm) were ignored during\nthe encoding stage was set at 0.1, which allowed for an initial encoding of 12dB signal-\nto-noise ratio (SNR). As indicated by equation 7, the gradient depends on the residual. If\nthe residual drops near zero or is predominately noise then learning is impeded. By slowly\nincreasing the spiking threshold as the average residual drops, we retain some signal struc-\nture in the residual for further training. At the same time, the power distribution of natural\nsounds means that high frequency signal components might fall entirely below threshold,\npreventing their being learned. One possible solution that was not implemented here is\nusing separate thresholds for each kernel.\n\n\f\nFigure 2: When adapted to speech, kernel functions become asymmetric sinusoids (smooth\ncurves in red, zero padding has been removed for plotting), with sharp attacks and gradual\ndecays. They also adapt in temporal extent, with longer and shorter functions emerging\nfrom the same initial length. These learned kernels are strikingly similar to revcor functions\nobtained from cat auditory nerve fibers (noisy curves in blue). The revcor functions were\nnormalized and aligned in phase with the learned kernels but are otherwise unaltered (no\nsmoothing or fitting).\n\n\n\nFigure 2 shows the kernel functions trained on speech (red curves). All are temporally\nlocalized, bandpass filters. They are similar in form to previous results but with several\nnotable differences. Most notably, the learned kernel functions are temporally asymmetric,\nwith sharp attack and gradual decay which matches physiological filtering properties of\nthe auditory nerves. Each kernel function in figure 2 is overlayed on a so-called reverse-\ncorrelation (revcor) function which is an estimate of the physiological impulse response\nfunction for an individual auditory nerve fiber [10]. The revcor functions have been nor-\nmalized, and the most closely matching in terms of center frequency and envelop were\nphase aligned with learned kernels by hand. No additional fitting was done, yet there is\na striking similarity between the inferred kernels functions and physiologically estimated\nreverse-correlation functions. For 25 out of 30 kernel functions, we found a close match\nto the physiological revcor functions (correlation > 0.8). Of the remaining filters, all\npossessed the same basic asymmetric filter structure show in figure 2 and showed a more\nmodest match to the data (correlation > 0.5).\n\nIn the standard efficient coding model, the signal and the basis functions are all the same\nlength. In order for the basis to span the signal space in the time domain and still be tem-\nporally localized, some of the learned functions are essentially replications of one another.\nIn the spike coding model, this redundancy does not occur because coding is time-relative.\nKernel functions can be placed arbitrarily in time such that one kernel function can code for\nsimilar acoustic events at different points in the signal. So, temporally extended functions\ncan be learned without causing an explosion in the number of high-frequency functions\n\n\f\n 5\n Speech Prediction\n Auditory Nerve Filters\n\n\n\n 2\n\n\n\n\n 1\n\n\n\n\n 0.5\n Bandwidth (kHz)\n\n\n\n\n 0.2\n\n\n\n\n 0.1\n 0.1 0.2 0.5 1 2 5\n Center Frequency (kHz)\n\nFigure 3: The center frequency vs. bandwidth distribution of learned kernel functions (red\nsquares) plotted against physiological data (blue pluses).\n\n\nneeded to span the signal space. Because cochlear coding also shares this quality, it might\nalso allow more precise predictions about the population characteristics of cochlear filters.\n\nIndividually, the learned kernel functions closely match the linear component of cochlear\nfilters. We can also compare the learned kernels against physiological data in terms of\npopulation distributions. In frequency space, our learned population follows the approxi-\nmately logarithmic distribution found in the cochlea, a more natural distribution of filters\ncompared to previous findings, where the need to tile high-frequency space biased the dis-\ntribution [4]. Figure 3 presents a log-log scatter-plot of the center frequency of each kernel\nversus its bandwidth (red squares). Plotted on the same axis are two sets of empirical data.\nOne set (blue pluses) comes from a large corpus of reverse-correlation functions derived\nfrom physiological recordings of auditory nerve fibers [10]. Both the slope and distribu-\ntion of the learned kernel functions match those of the empirical data. The distribution of\nlearned kernels even appears to follow shifts in the slope of the empirical data at the high\nand low frequencies.\n\n\n4 Coding Efficiency\n\nWe can quantify the coding efficiency of the learned kernel functions in bits so as to ob-\njectively evaluate the model and compare it quantitatively to other signal representations.\nRate-fidelity provides a useful objective measure for comparison. Here we use a method\ndeveloped in [7] which we now briefly describe. Computing the rate-fidelity curves begins\nwith associated pairs of coefficients and time values, {sm, m}, which are initially stored\n i i\nas double precision variables. Storing the original time values referenced to the start of\nthe signal is costly because their range can be arbitrarily large and the distribution of time\npoints is essentially uniform. Storing only the time since the last spike, m, greatly re-\n i\nstricts the range and produces a variable that approximately follows a gamma distribution.\n\nRate-fidelity curves are generated by varying the precision of the code, {sm, m}, and\n i i\ncomputing the resulting fidelity through reconstruction. A uniform quantizer is used to\n\n\f\nvary the precision of the code between 1 and 16 bits. At all levels of precision, the bin\nwidths for quantization are selected so that equal numbers of values fall in each bin. All\nsm or m that fall within a bin are recoded to have the same value. We use the mean of\n i i\nthe non-quantized values that fell within the bin. sm and m are quantized independently.\n i i\n\nTreating the quantized values as samples from a random variable, we estimate a code's\nentropy (bits/coefficient) from histograms of the values. Rate is then the product of the\nestimated entropy of the quantized variables and the number of coefficients per second for\na given signal. At each level of precision the signal is reconstructed based on the quantized\nvalues, and an SNR for the code is computed. This process was repeated across a set of\nsignals and the results were averaged to produce rate-fidelity curves.\n\nCoding efficiency can be measured in nearly identical fashion for other signal represen-\ntations. For comparison we generate rate-fidelity curves for Fourier and wavelet repre-\nsentations as well as for a spike code using either learned kernel functions or gammatone\nfunctions. Fourier coefficients were obtained for each signal via Fast Fourier Transform.\nThe real and imaginary parts were quantized independently, and the rate was based on\nthe estimated entropy of the quantized coefficients. Reconstruction was simply the in-\nverse Fourier transform of the quantized coefficients. Similarly, coding efficiency using\nDaubechies wavelets was estimated using Matlab's discrete wavelet transform and inverse\nwavelet transform functions. Curves for the gammatone spike code were generated as de-\nscribed above.\n\nFigure 4 shows the rate-fidelity curves calculated for speech from the TIMIT speech corpus\n[11]. At low bit rates (below 40 Kbps), both of the spike codes produce more efficient\nrepresentations of speech than the other traditional representations. For example, between\n10 and 20 Kbps the fidelity of the spike representation of speech using learned kernels is\napproximately twice that of either Fourier or wavelets. The learned kernels are also sightly\nbut significantly more efficient than spike codes using gammatones, particularly in the case\nof music. The kernel functions trained on music are more extended in time and appear\nbetter able to describe harmonic structure than the gammatones. As the number of spikes\nincreases the spike codes become less efficient, with the curve for learned kernels dropping\nmore rapidly than for gammatones. Encoding sounds to very high precision requires setting\nthe spike threshold well below the threshold used in training. It may be that the learned\nkernel functions are not well adapted to the statistics of very low amplitude sounds. At\nhigher bit rates (above 60 Kbps) the Fourier and wavelet representations produce much\nhigher rate-fidelity curves than either spike codes.\n\n\n5 Conclusion\n\nWe have presented a theoretical model of auditory coding in which temporal kernels are\nthe elemental features of natural sounds. The essential property of these features is that\nthey can describe acoustic structure at arbitrary time points, and can thus represent non-\nstationary, transient sounds in a compact and shift-invariant manner. We have shown that\nby using this time-relative spike coding model and adapting the kernel shapes to efficiently\ncode natural sounds, it is possible to account for both the detailed filter shapes of audi-\ntory nerve fibers and their distribution as a population. Moreover, we have demonstrated\nquantitatively that, at a broad range of low to medium bit rates, this type of code is sub-\nstantially more efficient than conventional signal representations such as Fourier or wavelet\ntransforms.\n\n\nReferences\n\n [1] H. B. Barlow. Possible principles underlying the transformation of sensory messages.\n In W. A. Rosenbluth, editor, Sensory Communication, pages 217234. MIT Press,\n\n\f\n 40\n\n\n 35\n\n\n 30\n\n\n 25\n\n\n 20\n\n SNR (dB) 15\n\n 10 Spike Code: adapted\n Spike Code: gammatone\n 5 Block Code: wavelet\n Block Code: Fourier\n\n 0\n 0 10 20 30 40 50 60 70 80 90\n\n Rate (Kbps)\n\nFigure 4: Rate-Fidelity curves speech were made for spike coding using both learned ker-\nnels (red) and gammatones (light blue) as well as using discrete Daubechies wavelet trans-\nform (black) and Fourier transform (dark blue).\n\n\n Cambridge, 1961.\n\n [2] J. J. Atick. Could information-theory provide an ecological theory of sensory process-\n ing. Network, 3(2):213251, 1992.\n\n [3] E. Simoncelli and B. Olshausen. Natural image statistics and neural representation.\n Annual Review of Neuroscience, 24:11931216, 2001.\n\n [4] M. S. Lewicki. Efficient coding of natural sounds. Nature Neuroscience, 5(4):356\n 363, 2002.\n\n [5] O. Schwartz and E. P. Simoncelli. Natural signal statistics and sensory gain control.\n Nature Neuroscience, 4:819825, 2001.\n\n [6] M. S. Lewicki. Efficient coding of time-varying patterns using a spiking population\n code. In R. P. N. Rao, B. A. Olshausen, and M. S. Lewicki, editors, Probabilistic\n Models of the Brain: Perception and Neural Function, pages 241255. MIT Press,\n Cambridge, MA, 2002.\n\n [7] E. C. Smith and M. S. Lewicki. Efficient coding of time-relative structure using\n spikes. Neural Computation, 2004.\n\n [8] S. G. Mallat and Z. Zhang. Matching pursuits with time-frequency dictionaries. IEEE\n Transactions on Signal Processing, 41(12):33973415, 1993.\n\n [9] B. A. Olshausen. Sparse codes and spikes. In R. P. N. Rao, B. A. Olshausen, and M. S.\n Lewicki, editors, Probabilistic Models of the Brain: Perception and Neural Function,\n pages 257272. MIT Press, Cambridge, MA, 2002.\n\n[10] L. H. Carney, M. J. McDuffy, and I. Shekhter. Frequency glides in the impulse\n responses of auditory-nerve fibers. Journal of the Acoustical Society of America,\n 105:23842391, 1999.\n\n[11] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, N. L. Dahlgren,\n and V. Zue. Timit acoustic-phonetic continuous speech corpus, 1990.\n\n\f\n", "award": [], "sourceid": 2620, "authors": [{"given_name": "Evan", "family_name": "Smith", "institution": null}, {"given_name": "Michael", "family_name": "Lewicki", "institution": null}]}