{"title": "Silicon Models for Auditory Scene Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 699, "page_last": 705, "abstract": null, "full_text": "Silicon Models \n\nfor \n\nA uditory  Scene  Analysis \n\nJohn Lazzaro and John Wawrzynek \n\nCS  Division \nUC Berkeley \n\nBerkeley,  CA  94720-1776 \n\nlazzaroOcs.berkeley.edu.  johnvOcs.berkeley.edu \n\nAbstract \n\nWe  are  developing  special-purpose,  low-power  analog-to-digital \nconverters  for  speech  and  music  applications,  that  feature  analog \ncircuit  models  of biological  audition  to  process  the  audio  signal \nbefore  conversion.  This  paper describes  our most recent  converter \ndesign,  and a working system that uses several copies ofthe chip to \ncompute  multiple representations  of sound  from  an  analog input. \nThis  multi-representation  system  demonstrates  the  plausibility of \ninexpensively implementing an auditory scene  analysis approach to \nsound  processing. \n\n1.  INTRODUCTION \n\nThe  visual system  computes multiple representations  of the  retinal  image, such  as \nmotion,  orientation,  and  stereopsis,  as  an  early  step  in  scene  analysis.  Likewise, \nthe  auditory brainstem computes secondary  representations  of sound,  emphasizing \nproperties  such  as  binaural  disparity,  periodicity,  and  temporal  onsets.  Recent \nresearch  in  auditory  scene  analysis  involves  using  computational models  of these \nauditory brainstem representations  in engineering  applications. \n\nComputation  is  a  major  limitation  in  auditory  scene  analysis  research:  the  com(cid:173)\nplete auditory processing system described in  (Brown and Cooke,  1994) operates at \napproximately 4000 times real  time, running under  UNIX on a  Sun  SPARCstation \n1.  Standard  approaches  to  hardware  acceleration  for  signal  processing  algorithms \ncould be used to ease this computational burden in a research environment; a variety \nof parallel, fixed-point  hardware products  would  work  well  on  these  algorithms. \n\n\f700 \n\nJ.  LAZZARO, J.  WAWRZYNEK \n\nHowever,  hardware  solutions  appropriate  for  a  research  environment  may  not  be \nwell suited for  accelerating  algorithms in cost-sensitive,  battery-operated consumer \nproducts.  Possible  product  applications  of  auditory  algorithms  include  robust \npitch-tracking systems for  musical  instrument  applications,  and  small-vocabulary, \nspeaker-independent  wordspotting systems for  control applications. \n\nIn  these  applications,  the  input  takes  an  analog  form:  a  voltage  signal  from  a \nmicrophone or  a  guitar  pickup.  Low-power  analog circuits  that  compute  auditory \nrepresentations have been implemented and characterized by several research groups \n- these  working  research  prototypes  include  several  generation  of cochlear  models \n(Lyon  and  Mead,  1988),  periodicity  models,  and  binaural models.  These  circuits \ncould be used  to compute auditory representations directly on the analog signal, in \nreal-time, using  these  low-power,  area-efficient  analog circuits. \n\nUsing analog computation successfully  in a system presents  many practical difficul(cid:173)\nties;  the density  and power  advantages of the analog approach are often lost in the \nprocess of system integration.  One successful  Ie architecture that uses  analog com(cid:173)\nputation in a system is the special-purpose analog to digital converter, that includes \nanalog,  non-linear  pre-processing  before  or  during  data conversion.  For  example, \nconverters  that  include  logarithmic  waveform  compression  before  digitization  are \ncommercially viable components. \n\nUsing  this  component  type  as  a  model,  we  have  been  developing  special-purpose, \nlow-power analog-to-digital converters for speech  and audio applications; this paper \ndescribes  our most recent  converter design,  and a  working system that uses  several \ncopies of the chip  to compute multiple representations  of sound. \n\n2.  CONVERTER DESIGN \n\nFigure 1 shows  an architectural block diagram of our current converter design.  The \n35,000 transistor  chip  was  fabricated  in the 2pm, n-well  process  of Orbit Semicon(cid:173)\nductor, broke red through MOSIS; the circuit is fully functional.  Below is a summary \nof the general architectural features ofthis chip;  unless otherwise referenced,  circuit \ndetails  are similar to the converter  design  described  in  (Lazzaro  et  al.,  1994). \n\n\u2022  An  analog  audio  signal  serves  as  input  to  the  chip;  dynamic  range  is  40dB  to \n60dB  (l-lOmV to  IV  peak,  dependent  on  measurement criteria). \n\n\u2022  This signal is  processed  by  analog circuits  that model cochlear  processing  (Lyon \nand  Mead,  1988)  and  sensory  transduction;  the  audio  signal  is  transformed  into \n119 wavelet-filtered,  half-wave rectified,  non-linearly compressed  audio signals.  The \ncycle-by-cycle  waveform of each  signal is  preserved;  no temporal smoothing is  per(cid:173)\nformed. \n\n\u2022  Two  additional  analog  processing  blocks  follow  this  initial  cochlear  processing, \na  temporal  autocorrelation  processor  and  a  temporal  adaptation  processor.  Each \nblock  transforms  the  input  array  into  a  new  representation  of equal  size;  alterna(cid:173)\ntively,  the block can  be programmed to pass its input vector  to its output without \nalteration. \n\n\u2022  The output circuits of the final  processing block are  pulse generators,  which code \nthe  signal  as  a  pattern  of fixed-width,  fixed-height  spikes.  All  the  information in \nthe representation  is  contained in the onset  times of the  pulses. \n\n\fSilicon Models  for  Auditory  Scene Analysis \n\n701 \n\n\u2022  The activity on  this array is sent  off-chip  via an  asynchronous  parallel bus.  The \nconverter  chip  acts  as  a  sender  on  the  bus;  a  digital  host  processor  is  the  receiver. \nThe converter initiates a transaction on the bus to communicate the onset of a pulse \nin  the  array;  the  data value  on  the  bus  is  a  number  indicating  which  unit  in  the \narray  pulsed.  The time of transaction initiation carries essential information.  This \ncoding method is  also  known  as  the  address-event  representation. \n\n\u2022  Many converters  can be used  in the same system, sharing the same asynchronous \noutput  bus  (Lazzaro  and  Wawrzynek,  1995) .  No  extra components  are  needed  to \nimplement bus sharing; the converter bus design includes extra signals and logic that \nimplements multi-chip bus  arbitration.  This feature  is  a  major difference  between \nthis design  and  (Lazzaro  et  at.,  1994). \n\n\u2022  The converter  includes  a  digitally-controllable parameter storage and  generation \nsystem; 25  tunable parameters control the behavior of the analog processing blocks. \nProgrammability supports the creation of multi-converter systems that use  a single \nchip  design:  each  chip  receives  the  same analog signal,  but  processes  the  signal  in \ndifferent  ways,  as  determined  by the  parameter values for  each chip. \n\n\u2022  Non-volatile analog storage elements are used to store the parameters; parameters \nare changeable via Fowler-Nordhiem tunneling, using a 5V control input bus.  Many \nconverters can share the same control bus.  Parameter values can be sensed  by acti(cid:173)\nvating a control mode, which sends  parameter information on  the converter output \nbus.  Apart  from  two  high-voltage  power  supply  pins,  and  a  trimming input  pin \nfor  tunneling  pulse  width,  all control  voltages  used  in  this converter  are generated \non-chip. \n\n21V-----. \n15V \nTrim \n~  DO \n~  Dl \n-;  D2 \nD3 \n~  D4 \n'0  D5 \n~  D6 \n\u00a7  CS \nU  WR \n\nVDD \nGND \nVDD \nGND \nVDD \nGND \n\nAudio  In ... -------1 \n\nR \nA \nDO \nDl \nD2 \nD3 \nD4 \nD5 \nD6 \n\nrJl \n\nb() \n\nAR \nRR  ] \nAL \nen \nRL \nAM  ~ \nRM  \"i: \nDL  ~ \nDR  til \n~ \nKO \nKM  ~ \n\nFigure 1.  Block diagram of the converter chip.  Most of the 40  pins of the chip  are \ndedicated to the data output and control input buses,  and to the control signals for \ncoordinating bus sharing in  multi-converter systems. \n\n\f702 \n\nJ.  LAZZAR9. J.  WAWRZYNEK \n\n3.  SYSTEM DESIGN \n\nFigure 2 shows a  block diagram of a  system that uses  three copies  of the converter \nchip  to compute multiple representations  of sound;  the  system  acts  as  a  real-time \naudio input device  to  a  Sun  workstation.  An  analog audio input connects  to each \nconverter;  this input can be from a pre-amplified microphone, for spontaneous input, \nor from  the analog audio signal of the workstation, for  controlled experiments. \n\nThe  asynchronous  output  buses  from  the  three  chips  are  connected  together,  to \nproduce  a  single output address  space for  the system; no  external components  are \nneeded  for  output  bus  sharing  and  arbitration.  The  onset  time  of a  transaction \ncarries  essential  information on  this bus;  additional logic on this  board  adds  a  16-\nbit timestamp to each  bus transaction, coding the onset  time with 20ps resolution. \nThe control input buses for  the three  chips  are  also connected  together  to produce \na  single  input  address  space,  using  external  logic for  address  decoding.  We  use  a \ncommercial interface  board to link the workstation  with  these  system  buses. \n\n4.  SYSTEM PERFORMANCE \n\nWe  designed  a  software environment,  Aer,  to support  real-time,  low-latency data \nvisualization  of the  multi-converter  system.  Using  Aer,  we  can  easily  experiment \nwith  different  converter  tunings.  Figure  3  shows  a  screen  from  Aer, showing  data \nfrom  the  three  converters  as  a  function  of time;  the  input sound  for  this  screen  is \na  short  800  Hz  tone  burst,  followed  by  a  sinusoid  sweep  from  300  Hz  to  3  Khz. \nThe  top  (\"Spectral  Shape\")  and  bottom  (\"Onset\")  representations  are  raw  data \nfrom  converters  1 and 3,  as  marked on Figure 2,  tuned for  different  responses.  The \noutput channel number is  plotted vertically;  each  dot represents  a  pulse. \n\nThe  top  representation  codes  for  periodicity-based  spectral  shape;  for  this  repre(cid:173)\nsentation,  the  temporal  autocorrelation  block  (see  Figure  1)  is  activated,  and  the \ntemporal  adaptation  block  is  inactivated.  Spectral  frequency  is  mapped logarith(cid:173)\nmically  on  the  vertical  dimension,  from  300  Hz  to  4  Khz;  the  activity  in  each \nchannel is the periodic waveform present  at that frequency.  The difference  between \na  periodicity-based  spectral  method  and  a  resonant  spectral  method can  be  seen \nin  the  response  to the  800  Hz  sinusoid onset:  the  periodicity representation  shows \nactivity only around the  800  Hz  channels,  whereas  a spectral representation  would \nshow  broadband transient activity at tone onset. \n\nMulti-Converter System \n\nIn~~----. .  ----~--------~ \n\nO~ \n\nBus \n\nOut \n\nSound Input \n\nFigure 2.  Block diagram of the multi-converter system. \n\naa \naaa \naa aaaaaaaaaD  aaa \n\n:: ::::::::ga ::0 \naa aaaaaaaaaD  aao \n\naa a \n\naa \n\n\fSilicon Models  for Auditory  Scene Analysis \n\n703 \n\n4 Khz \n\n(log) \n\n300 Hz \n\nOms \n\n(linear) \n\n12.5 ms \n\n4  Khz \n\n(log) \n\n300 Hz \n\n. 1 .... \n\n. ~;.;\\: \n: ;;~~if~ \n:r;,;~ .. , .\u2022.. :, \n1  gj~;':~_ \nI \u00ab>'~': \n:  \u00b7;:.:mT.:\u00b7~~ \ni ttJrC(cid:173)\n; f;:.~ \n\n.. ~.~~~~ \n\n. . \n\n\u00b7 ~ \u2022\u2022 \"' 1,~  _ \n\nSpectral \nShape \n\nSummary \n\nAuto \n\nCorr. \n\nOnset \n\nFigure  3.  Data from  the  multi-converter  system,  in  response  to  a  800-Hz  pure \ntone,  followed  by  a sinusoidal sweep  from  300Hz  to 3Khz. \n\n200 ms \n\n\f704 \n\nJ.  LAZZARO, J. WAWRZYNEK \n\n. \": ', \"' , \n\n-:  ~ : . .... \n\n\", r: \n\nD  \u2022 \n\n\u2022  De \n\n300 \n\nOms \n\n... ... o o \no ..., \n:l < \n\u00bb \n~ e e \n\n:l \nrJ) \n\n12. \n\n4  Khz \n\n..., \nII> \ntil \nC o \n\nFigure  4.  Data from  the  multi-converter system,  in  response  to  the  word  \"five\" \nfollowed  by  the word  \"nine\". \n\n100 ms \n\n\fSilicon Models  for Auditory  Scene Analysis \n\n705 \n\nThe  bottom representation  codes  for  temporal onsets;  for  this  representation,  the \ntemporal  adaptation  block  is  activated,  and  the  temporal  autocorrelation  block  is \ninactivated.  The spectral filtering  of the  representation  reflects  the  silicon  cochlea \ntuning:  a  low-pass  response  with  a  sharp  cutoff and  a  small  resonant  peak  at the \nbest  frequency  of the filter.  The black,  wideband  lines  at  the  start  of the  800  Hz \ntone  and the sinusoid sweep  illustrate the temporal adaptation. \n\nThe  middle  (\"Summary  Auto  Corr.\")  representation  is  a  summary  autocorrelo(cid:173)\ngram, useful  for  pitch  processing  and voiced/unvoiced  decisions  in  speech  recogni(cid:173)\ntion.  This representation is not raw data from a converter; software post-processing \nis  performed on  the converter output to produce the final  result.  The frequency  re(cid:173)\nsponse of converter 2 is set as in the bottom representation; the temporal adaptation \nresponse,  however,  is  set  to a  100 millisecond time constant.  The converter  output \npulse rates are set  so that the cycle-by-cycle  waveform information for  each output \nchannel is  preserved  in the output. \nTo complete the representation,  a set of running autocorrelation functions x(t)X(t-T) \nis  computed  for  T  = k 105tLs,  k  = 1 ... 120,  for  each  of the  119  output  channels. \nThese  autocorrelation  functions  are  summed  over  all  output  channels  to  produce \nthe final representation;  T  is  plotted as a linear function of time on the vertical axis. \nThe correlation multiplication can be efficiently implemented by integer subtraction \nand  comparison  of pulse  timestamps;  the  summation over  channels  is  simply  the \nmerging of lists  of bus  transactions.  The  middle  representation  in  Figure  3  shows \nthe  qualitative characteristics  of the summary autocorrelogram:  a  repetitive  band \nstructure  in  response  to  periodic sounds. \n\nFigure  'f  shows  the  output  response  of the  multi-converter  system  in  response  to \ntelephone-bandwidth-limited  speech;  the  phonetic  boundaries  of the  two  words, \n\"five\"  and  \"nine\",  are  marked by  arrows.  The vowel  formant information is shown \nmost clearly by the strong peaks in the spectral shape representation;  the wideband \ninformation in the \"f\"  offive is easily seen in the onset representation.  The summary \nautocorrelation  representation shows  a clear texture  break between  vowels  and the \nvoiced  \"n\"  and  \"v\"  sounds. \n\nAcknowledgements \n\nThanks  to  Richard  Lyon  and  Peter  Cariani for  summary autocorrelogram discus(cid:173)\nsions.  Funded  by  the Office  of Naval Research  (URI-N00014-92-J-1672). \n\nReferences \n\nBrown,  G.J.  and Cooke,  M.  (1994).  Computational auditory scene  analysis.  Com(cid:173)\nputer Speech  and Language,  8:4,  pp.  297-336. \n\nLazzaro,  J.  P.  and  Wawrzynek,  J.  (1995).  A  multi-sender asynchronous  extension \nto  the  address-event  protocol.  In  Dally,  W.  J.,  Poulton,  J.  W.,  Ishii,  A.  T.  (eds), \n16th  Conference  on  Advanced Research  in  VLSI,  pp.  158-169. \n\nLazzaro,  J.  P.,  Wawrzynek,  J.,  and  Kramer,  A  (1994).  Systems  technologies  for \nsilicon  auditory models.  IEEE Micro,  14:3.  7-15. \nLyon,  R.  F.,  and  Mead,  C.  (1988).  An  analog  electronic  cochlea.  IEEE  Trans. \nAcoust.,  Speech,  Signal  Processing vol.  36,  pp.  1119-1134. \n\n\f", "award": [], "sourceid": 1123, "authors": [{"given_name": "John", "family_name": "Lazzaro", "institution": null}, {"given_name": "John", "family_name": "Wawrzynek", "institution": null}]}