{"title": "Bangs, Clicks, Snaps, Thuds and Whacks: An Architecture for Acoustic Transient Processing", "book": "Advances in Neural Information Processing Systems", "page_first": 734, "page_last": 740, "abstract": null, "full_text": "Bangs. Clicks, Snaps, Thuds and Whacks: \nan Architecture for Acoustic Transient \n\nProcessing \n\nFernando J. Pineda(l) \nfernando. pineda@jhuapl.edu \n\nGert Cauwenberghs(2) \ngert@jhunix.hcf.jhu.edu \n\nR. Timothy Edwards(2) \ntim@bach.ece.jhu.edu \n\n(iThe Applied Physics Laboratory \nThe Johns Hopkins University \nLaurel, Maryland 20723-6099 \n\n(2Dept. of Electrical and Computer Engineering \nThe Johns Hopkins University \n34th and Charles Streets \nBaltimore Maryland 21218 \n\nABSTRACT \n\nWe propose a neuromorphic architecture for real-time processing of \nacoustic transients in analog VLSI. We show how judicious normalization \nof a time-frequency signal allows an elegant and robust implementation \nof a correlation algorithm. The algorithm uses binary multiplexing instead \nof analog-analog multiplication. This removes the need for analog \nstorage and analog-multiplication. Simulations show that the resulting \nalgorithm has the same out-of-sample classification performance (-93% \ncorrect) as a baseline template-matching algorithm. \n\n1 INTRODUCTION \n\nWe report progress towards our long-term goal of developing low-cost, low-power, low(cid:173)\ncomplexity analog-VLSI processors for real-time applications. We propose a neuromorphic \narchitecture for acoustic processing in analog VLSI. The characteristics of the architecture \nare explored by using simulations and real-world acoustic transients. We use acoustic \ntransients in our experiments because information in the form of acoustic transients \npervades the natural world. Insects, birds, and mammals (especially marine mammals) \nall employ acoustic signals with rich transient structure. Human speech, is largely composed \nof transients and speech recognizers based on transients can perform as well as recognizers \nbased on phonemes (Morgan, Bourlard,Greenberg, Hermansky, and Wu, 1995). Machines \nalso generate transients as they change state and as they wear down. Transients can be \nused to diagnose wear and abnormal conditions in machines. \n\n\fArchitecture for Acoustic Transient Processing \n\n735 \n\nIn this paper, we consider how algorithmic choices that do not influence classification \nperformance, make an initially difficult-to-implement algorithm, practical to implement. \nIn particular, we present a practical architecture for performing real-time recognition of \nacoustic transients via a correlation-based algorithm. Correlation in analog VLSI poses \ntwo fundamental implementation challenges. First, there is the problem of template storage, \nsecond, there is the problem of accurate analog multiplication. Both problems can be \nsolved by building sufficiently complex circuits. This solution is generally unsatisfactory \nbecause the resulting processors must have less area and consume less power than their \ndigital counterparts in order to be competitive. Another solution to the storage problem is \nto employ novel floating gate devices. At present such devices can store analog values \nfor years without significant degradation. Moreover, this approach can result in very \ncompact, yet computationally complex devices. On the other hand, programming floating \ngate devices is not so straight-forward. It is relatively slow, it requires high voltage and it \ndegrades the floating gate each time it is reprogrammed. Our \"solution\" is to side-step the \nproblem completely and to develop an algorithmic solution that requires neither analog \nstorage nor analog multiplication. Such an approach is attractive because it is both \nbiologically plaUSible and electronically efficient. We demonstrate that a high level of \nclassification performance on a real-world data set is achievable with no measurable loss \nof performance, compared to a baseline correlation algorithm. \n\nThe acoustic transients used in our experiments were collected by K. Ryals and O. \nSteigerwald and are described in (Pineda, Ryals, Steigerwald and Furth, 1995). These \ntransients consist of isolated Bangs, Claps, Clicks, Cracks, Oinks, Pings, Pops, Slaps, \nSmacks, Snaps, Thuds and Whacks that were recorded on OAT tape in an office environment. \nThe ambient noise level was uncontrolled, but typical of a single-occupant office. \nApproximately 221 transients comprising 10 classes were collected. Most of the energy \nin one of our typical transients is dissipated in the first 10 ms. The remaining energy is \ndissipated over the course of approximately 100 ms. The transients had durations of \napproximately 20-100 ms. There was considerable in-class and extra-class variability in \nduration. The duration of a transient was determined automatically by a segmentation \nalgorithm described below. The segmentation algorithm was also used to align the templates \nin the correlation calculations. \n\n2 THE BASELINE ALGORITHM \n\nThe baseline classification algorithm and its performance is described in Pineda, et al. \n(1995). Here we summarize only its most salient features. Like many biologically motivated \nacoustic processing algorithms, the preprocessing steps include time-frequency analysis, \nrectification, smoothing and compression via a nonlinearity (e.g. Yang, Wang and Shamma, \n1992). Classification is performed by correlation against a template that represents a \nparticular class. In addition , there is a \"training\" step which is required to create the \ntemplates. This step is described in the \"correlation\" section below. We turn now to a \nmore detailed description of each processing step. \n\nA. Time-frequency Analysis: Time-frequency analysis for the baseline algorithm and the \nsimulations performed in this work, was performed by an ultra-low power (5.5 mW) \nanalog VLSI filter bank intended to mimic the processing performed by the mammalian \ncochlea (Furth, Kumar, Andreou and Goldstein, 1994). This real-time device creates a \ntime-frequency representation that would ordinarily require hours of computation on a \n\n\f736 \n\nF J. Pineda. G. Cauwenberghs and R. T. Edwards \n\nhigh-speed workstation. More complete descriptions can be found in the references. The \ntime-frequency representation produced by the filter bank is qualitatively similar to that \nproduced by a wavelet transformation. The center frequencies and Q-factors of each \nchannel are uniformly spaced in log space. The low frequency channel is tuned to a \ncenter frequency of 100 Hz and Q-factor of 1.0, while the high frequency channel is \ntuned to a center frequency of 6000 Hz and Q-factor 3.5. There are 31 output channels. \nThe 31-channel cochlear output was digitized and stored on disk at a raw rate of 256K \nsamples per second. This raw rate was distributed over 32 channels, at rates appropriate \nfor each channel (six rates were used, 1 kHz for the lowest frequency channels up to 32 \nkHz for the highest-frequency channels and the unfiltered channel). \n\nB. Segmentation: Both the template calculation and the classification algorithm rely on \nhaving a reliable segmenter. In our experiments, the transients are isolated and the noise \nlevel is low, therefore a simple segmenter is all that is needed. Figure 2. shows a \nsegmenter that we implemented in software and which consists of a three layer neural \nnetwork. \n\nnoisy segmentation bit \n\nclean segmentation bit \n\nFigure 2: Schematic diagram showing the segmenter network \n\nThe input layer receives mean subtracted and rectified signals from the cochlear filters. \nThe first layer simply thresholds these signals. The second layer consists of a single unit \nthat accumulates and rethresholds the thresholded signals. The second layer outputs a \nnoisy segmentation signal that is nonzero if two or more channels in the input layer \nexceed the input threshold. Finally, the output neuron cleans up the segmentation signal \nby low-pass filtering it with a time-scale of 10 ms (to fill in drop outs) and by low-pass \nfiltering it with a time-scale of 1 ms (to catch the onset of a transient). The outputs of the \ntwo low-pass filters are OR'ed by the output neuron to produce a clean segmentation bit. \n\nThe four adjustable thresholds in the network were determined empirically so as to \nmaximize the number of true transients that were properly segmented while minimizing \nthe number of transients that were missed or cut in half. \n\nC. Smoothing & Normalization: The raw output of the filter bank is rectified and smoothed \nwith a single pole filter and subsequently normalized. Smoothing was done with a the \n\n\fArchitecture for Acoustic Transient Processing \n\n737 \n\nsame time-scale (l-ms) in all frequency channels. Let X(t) be the instantaneous vector \nof rectified and smoothed channel data, then the instantaneous output of the normalizer is \n\nX(t) = \n\n~(t) II. Where \n\n()+ X(t) \n\n() is a positive constant whose purpose is to prevent the \n\nnormalization stage from amplifying noise in the absence of a transient signal. With this \nnormalization we have IIX(t)lt z 0 if IIX(t)lll \u00ab(), and IIX(t)lll z 1 if IIX(t)lll \u00bb (). Thus \n() effectively determines a soft input threshold that transients must exceed if they are to \nbe normalized and passed on to higher level processing. \n\nA sequence of normalized vectors over a time-window of length T is used as the feature \nvector for the correlation and classification stages of the algorithm. Figure 3. shows four \nnormalized feature vectors from one class of transients (concatenated together) . \n... \"\"--'\"'~~ \n\n......I\"\"'~~~ \n\n.A.~~~ \n\n.-. \n\n~ \n\n~ -...... \n~ \n~ \n-\n\n~-~:..... \n.... \n\n1'0. \n\n...I\\. \n.,..... \nJ,.;~ \n\n..-.. \n\n.~ \n\n~I'-.. \n\n)., \n~ \n\n...... \n\n~ \n\n....... \n-. \n\n~ \n\n~ ~ '-----....., \n\n~ \n\n~ ........................ , \no \n\nI \n100 \n\nI \n50 \n\nI \n200 \n\nI \n150 \nTime (ms) \n\nI \n250 \n\nI \n300 \n\nFigure 3.: Normalized representation of the first 4 exemplars from one class of transients. \n\nD. Correlation: The feature-vectors are correlated in the time-frequency domain against \na set of K time-frequency templates. The k - th feature-vector-template is precalculated \nby averaging over a corpus of vectors from the k - th class. Thus, if Ck represents \nthe k - th transient class, and if ( ) k represents an average over the elements in a class, \ne.g. (X(t\u00bb)k = E{X(t)IX(t)E Ck}. Then the template is of the form bk(t) = (X(t\u00bb)k \u00b7 The \ninstantaneous output of the correlation stage is a K -dimensional vector c(t)whose \nk -th component is ck(t) = LX(t)\u00b7 bk(t). The time-frequency window over which the \n\nt \n\nA \n\ncorrelations are performed is of length T and is advanced by one time-step between \ncorrelation calculations. \n\nt'=t-T \n\nE. Classification The classification stage is a simple winner-take-all algorithm that assigns \na class to the feature vector by picking the component of ck(t) that has the largest value \nat the appropriate time, i.e. class = argmax{ck(tvalid)}' \n\nk \n\n\f738 \n\nF. 1. Pineda, G. Cauwenberghs and R. T. Edwards \n\nThe segmenter is used to determine the time tva1idwhen the output of the winner-take-all \nis to be used for classification. This corresponds to properly aligning the feature vector \nand the template. Leave-one-out cross-validation was used to estimate the out-of-sample \nclassification performance of all the algorithms described in this paper. The rate of \ncorrect classification for the baseline algorithm was 92.8%. Out of a total of 221 events \nthat were detected and segmented, 16 were misclassified. \n\n3 A CORRELATION ALGORITHM FOR ANALOG VLSI \n\nWe now address the question of how to perform classification without performing analog(cid:173)\nanalog multiplication and without having to store analog templates. To provide a better \nunderstanding of the algorithm, we present it as a set of incremental modifications to the \nbaseline algorithm. This will serve to make clear the role played by each modification. \n\nExamination of the normalized representation in figure 3 suggests that the information \ncontent of anyone time-frequency bin cannot be very high. Accordingly, we seek a \nhighly compressed representation that is both easy to form and with which it is easy to \ncompute. As a preliminary step to forming this compressed representation, consider \ncorrelating the time-derivative of the feature vector with the time-derivative of the template, \n\nckU)= I,t~(t).bk(t) where bk(t) = (X(t)}k' \n\nThis modification has no effect on the out-of-sample performance of the winner-take-all \nclassification algorithm. The above representation, by itself, has very few implementation \nadvantages. It can, in principal, mitigate the effect of any systematic offsets that might \nemerge from the normalization circuit. Unfortunately, the price for this small advantage \nwould be a very complex multiplier. This is evident since the time-derivative of a positive \nquantity can have either sign, both the feature vector and the template are now bipolar. \nAccordingly the correlation hardware would now require 4-quadrant analog-analog \nmultipliers. Moreover the storage circuits must handle bipolar quantities as well. \n\nThe next step in forming a compressed representation is to replace the time-differentiated \ntemplate with just a sign that indicates whether the template value in a particular channel \n\nis increasing or decreasing with time. This template is b' k (t) = Sign( (XU)} k J. We denote \n\nthis template as the [-1,+ 1]-representation template. The resulting classification algorithm \nyields exactly the same out-of-sample performance as the baseline algorithm. The 4-quadrant \nanalog-analog multiply of the differentiated representation is reduced to a \"4-quadrant \nanalog-binary\" multiply. The storage requirements are reduced to a single bit per time(cid:173)\nfrequency bin. To simplify the hardware yet further, we exploit the fact that the time \nderivative of a random unit vector net) (with respect to the I-norm) satisfies \n\nE{ ~Sign\u00ab(Uv))iv} = 2E{ ~e\u00ab(uv))iv} \nb' I k (t) = e( (X(t)} k)' we expect \n\nwhere e is a step function. Accordingly, if we use a template whose elements are in \n[0,1] instead of [-1, + 1], i.e. \nE{ ~ b' vXv } = 2E{ b' I v Xv} = IlxlI\" provided the feature vector X(t) is drawn from the \n\n\fArchitecture/or Acoustic Transient Processing \n\n739 \n\nsame class as is used to calculate the template. Furthermore, if the feature vector and the \ntemplate are statistically independent, then we expect that either representation will produce \na zero correlation, E{ ~ h' j(v } = E{ h\" v Xv} = 0 . In practice, we find that the difference \nin correlation values between using the [0,1] and the [-1,+1] representations is simply a \nscale factor (approximately equal to 2 to several digits of precision). This holds even \nwhen the feature vectors and the templates do not correspond to the same class. Thus the \ndifference between the two representations is quantitatively minor and qualitatively \nnonexistent, as evidenced by our classification experiments, which show that the out-of(cid:173)\nsample performance of the [0,1] representation is identical to that of the [-1,+1] \nrepresentation. Furthermore, changing to the [0,1] representation has no impact on the \nstorage requirements since both representations require the storage of single bit per time(cid:173)\nfrequency bin. On the other hand, consider that by using the [0,1] representation we now \nhave a \"2-quadrant analog-binary\" multiply instead of a \"4-quadrant analog-binary\" \nmultiply. Finally, we observe that differentiation and correlation are commuting operations, \nthus rather than differentiating X(t) before correlation, we can differentiate after the \ncorrelation without changing the result. This reduces the complexity of the correlation \noperation still further, since the fact that both X(t) and h\" k (t) are positive means that \nwe need only implement a correlator with I-quadrant analog-binary multiplies. \n\nA \n\nThe result of the above evolution is a correlation algorithm that empirically performs as \nwell as a baseline correlation algorithm, but only requires binary-multiplexing to perform \nthe correlation. We find that with only 16 frequency channels and 64 time bins (1024-\nbits/templates) , we are able to achieve the desired level of performance. We have undertaken \nthe design and fabrication of a prototype chip. This chip has been fabricated and we will \nreport on it's performance in the near future. Figure 4 illustrates the key architectural \nfeatures of the correlator/memory implementation. The rectified and \n\n1-norm \n\ncorrelator/memory \n\ninput \n\nFigure 4: Schematic architecture of the k-th correlator-memory. \n\nsmoothed frequency-analyzed signals are input from the left as currents. The currents are \nnormalized before being fed into the correlator. A binary time-frequency template is \nstored as a bit pattern in the correlator/memory. A single bit is stored at each time and \nfrequency bin. If this bit is set, current is mirrored from the horizontal (frequency) lines \nonto vertical (aggregation) lines. Current from the aggregation lines is integrated and \nshifted in a bucket-brigade analog shift register. The last two stages of the shift register \nare differenced to estimate a time-derivative. \n\n4 DISCUSSION AND CONCLUSIONS \n\nThe correlation algorithm described in the previous section is related to the zero-crossing \n\n\f740 \n\nF. J Pineda, G. Cauwenberghs and R. T. Edwards \n\nrepresentation analyzed by Yang, Wang. and Shamma (1992). This is because bit flips \nin the templates correspond to the zero crossings of the expected time-derivative of the \nnormalized \"energy-envelope.\" Note that we do not encode the incoming acoustic signal \nwith a zero-crossing representation. Interestingly enough, if both the analog signal and \nthe template are reduced to a binary representation, then the classification performance \ndrops dramatically. It appears that maintaining some analog information in the processing \npath is significant. \n\nThe frequency-domain normalization approach presented above throws away absolute \nintensity information. Thus, low intensity resonances that remain excited after the initial \nburst of acoustic energy are as important in the feature vector as the initial burst of \nenergy. These resonances can contain significant information about the nature of the \ntransient but would have less weight in an algorithm with a different normalization \nscheme. Another consequence of the normalization is that even a transient whose spectrum \nis highly concentrated in just a few frequency channels will spread its information over \nthe entire spectrum through the normalization denominator. The use of a normalized \nrepresentation thus distributes the correlation calculation over very many frequency \nchannels and serves to mitigate the effect of device mismatch. \n\nWe consider the proposed correlator/memory as a potential component in more sophisticated \nacoustic processing systems. For example, the continuously generated output of the \ncorrelators , c(t), is itself a feature vector that could be used in more sophisticated \nsegmentation and/or classification algorithms such as the time-delayed neural network \napproach ofUnnikrishnan, Hopfield and Tank (1991). \n\nThe work reported in this report was supported by a Whiting School of Engineering! Applied \nPhysics Laboratory Collaborative Grant. Preliminary work was supported by an APL \nInternal Research & Development Budget. \n\nREFERENCES \n\nFurth, P.M. and Kumar, N.G., Andreou, A.G. and Goldstein, M.H. , \"Experiments with \nthe Hopkins Electronic EAR\", 14th Speech Research Symposium, Baltimore, MD \npp.183-189, (1994). \n\nPineda, F.J., Ryals, K, Steigerwald, D. and Furth, P., (1995). \"Acoustic Transient \nProcessing using the Hopkins Electronic Ear\", World Conference on Neural Networks \n1995, Washington DC. \n\nYang, X., Wang K and Shamma, S.A. (1992). \"Auditory Representations of Acoustic \nSignals\", IEEE Trans. on Information Processing,.3.8., pp. 824-839. \n\nMorgan, N. , Bourlard, H., Greenberg, S., Hermansky, H. and Wu, S. L., (1996). \n\"Stochastic Perceptual Models of Speech\", IEEE Proc. IntI. Conference on Acoustics, \nSpeech and Signal Processing, Detroit, MI, pp. 397-400. \n\nUnnikrishnan, KP., Hopfield J.J., and Tank, D.W. (1991). \"Connected-Digit Speaker(cid:173)\nDependent Speech Recognition Using a Neural Network with Time-Delayed \nConnections\", IEEE Transactions on Signal Processing, 3.2, pp. 698-713 \n\n\f", "award": [], "sourceid": 1192, "authors": [{"given_name": "Fernando", "family_name": "Pineda", "institution": null}, {"given_name": "Gert", "family_name": "Cauwenberghs", "institution": null}, {"given_name": "R.", "family_name": "Edwards", "institution": null}]}