{"title": "A Neuromorphic Monaural Sound Localizer", "book": "Advances in Neural Information Processing Systems", "page_first": 692, "page_last": 698, "abstract": null, "full_text": "A  N euromorphic Monaural  Sound \n\nLocalizer \n\nJohn G. Harris, Chiang-Jung Pu, and Jose C.  Principe \n\nDepartment of Electrical & Computer Engineering \n\nUniversity of Florida \nGainesville, FL 32611 \n\nAbstract \n\nWe  describe the first  single microphone sound localization system \nand its inspiration from theories of human monaural sound localiza(cid:173)\ntion.  Reflections and diffractions caused by the external ear (pinna) \nallow  humans to estimate sound source elevations  using only one \near.  Our single microphone localization model relies on a specially \nshaped reflecting structure that serves the role  of the pinna.  Spe(cid:173)\ncially designed analog VLSI circuitry uses  echo-time processing to \nlocalize the sound.  A CMOS integrated circuit has been designed, \nfabricated,  and successfully demonstrated on actual sounds. \n\n1 \n\nIntroduction \n\nThe principal cues for  human sound localization arise from time and intensity dif(cid:173)\nferences between the signals received at the two ears.  For low-frequency components \nof sounds (below 1500Hz for  humans),  the phase-derived interaural time difference \n(lTD)  can be  used  to localize the sound source.  For these frequencies,  the sound \nwavelength is at least several times larger than the head and the amount of shadow(cid:173)\ning (which depends on the wavelength of the sound compared with the dimensions of \nthe head) is negligible.  lTD localization is a well-studied system in biology (see e.g., \n[5])  and has even been mapped to neuromorphic analog VLSI circuits with limited \nsuccess on actual sound signals  [6]  [2].  Above  3000Hz,  interaural phase differences \nbecome ambiguous by multiples of 3600  and are no longer viable localization cues. \nFor these  high frequencies,  the wavelength of the  sound  is  small  enough  that the \nsound amplitude is attenuated by the head.  The intensity difference of the log mag(cid:173)\nnitudes at the ears provides a  unique inter aural intensity difference  (lID)  that can \nbe used to localize. \nMany  studies  have  shown  that  when  one  ear  is  completely  blocked,  humans  can \nstill  localize  sounds in  space,  albeit  at a  worse  resolution  in the horizontal  direc-\n\n\fA Neuromorphic Monaural Sound Localizer \n\n693 \n\nSound Signal \n\nNeuromorphic \nMicrophone \n\n---- --, \n\nModel : \n\nDetecting \n\nOnset \n\n(a) \n\nReflector \n\nGenerating \n\nComputing \n\nDelay \n\nPulse  I \n\ni \n\nAdaptive \n\nThreshold \n\nSource \n\nSl \n\nC \n\n. ; d1 l \n.  \"\" -,~:I \n\n~\"  ~ ' \nr2\nk \n\n: \n\nf \n\n, \n' \n\nMic. \n\n.. 2 \n\n.. 1 \n\nd \n\nS2 \n\nReflector  (b ) \n\nFigure  1:  (a)  Proposed  localization  model  is  inspired from  the  biological  model  (b) \nSpecial  reflection  surface to serve  the  role  of the pinna \n\ntion.  Monaural localization  requires  that information  is  somehow extracted from \nthe direction-dependent effects of the reflections and diffractions of sound off of the \nexternal ear (pinna), head, shoulder, and torso.  The S<rcalled  \"Head Related Trans(cid:173)\nfer  Function\"  (HRTF)  is the effective direction-dependent transfer function that is \napplied  to the  incoming  sound to  produce  the  sound  in  the middle  ear.  Section \n2  of this  paper  introduces  our  monaural  sound localization  model  and  Section  3 \ndiscusses the simulation and measurement results. \n\n2  Monaural Sound Localization Model \n\nBatteau [1]  was one of the first to emphasize that the external ear, specifically the \npinna,  could  be a  source  of spatial cues  that account for  vertical  localization.  He \nconcluded that the physical structure of the external ear introduced two Significant \nechoes in addition to the original sound.  One echo varies with the azimuthal position \nof the sound source, having a  latency in the 0 to 80t'S range, while the other varies \nwith  elevation  in  the  lOOt'S  to 300t'S  range.  The  output  y(t)  at the  inner  ear  is \nrelated to the original sound source x(t)  as \n\n(1) \nwhere Ta , Tv  refer to azimuth and elevation echoes  respectively;  at  and a2  are two \nreflection constants.  Other researchers subsequently verified these results  [11]  [4]. \n\ny(t) = x(t) + atx(t - Ta) + a2x(t - Tv) \n\nOur  localizer  system  (shown  in  Figure  l(a))  is  composed  of  a  special  reflection \nsurface that encodes the sound source's direction,  a  silicon cochlea that functions \nas a  band-pass filter  bank, onset detecting circuitry that detects and amplifies  the \nenergy change at each frequency tap, pulse generating circuitry that transfers analog \nsound signals into pulse signals based on adaptively thresholding the onset signal, \nand  delay  time  computation  circuitry  that  computes  the  echo's  time  delay  then \ndecodes the sound source's direction. \nSince our recorded signal is composed of a direct sound and an echo, the sound is a \nsimplified version of actual HRTF recordings that are composed of the direct sound \n\n\f694 \n\nYIn \n\nJ.  G. Harris, c.-J.  Pu and J.  C.  Principe \n\nFigure 2:  (a)  Sound signal's onset is detected by taking the difference 01 two low-pass \nfilters  with  different time constants.  (b)  Pulse generating  circuit. \n\nand  its  reflections  from  the  external  ear,  head,  shoulder,  and  torso.  To  achieve \nlocalization in a  ID plane, we may use any shape of reflection surface as long as the \nreflection  echo caused by the surface  provides  a  one-to-one mapping  between the \necho's delay time and the source's direction.  Thus, we  propose two flat surfaces to \ncompose the reflection structure in our proposed model depicted in Figure l(b).  A \nmicrophone is placed at distances a1  and lI2 from two flat surfaces  (81  and 8 2 ), dis \nthe distance between the microphone and the sound source moving line (the dotted \nline  in  Figure  l(b).  As  shown  in  Figure  l(b),  a  sound  source  is  at  L~ position. \nIf the source is  far  enough from  the reflection surface,  the  ray diagram is  valid  to \nanalyze the sound's behavior.  We skip the complete derivation but the echo's delay \ntime can be expressed as \n\nc \n\n(2) \nwhere  d1  is  the  length  of the  direct  path,  r1  + r2  is  reflected  path  length,  and c \nis  the speed  of sound.  The path distance  are easily  solved  in terms  of the source \ndirection and the geometry of the setup  (see  [9]  for  complete details). \nThe  echo's  delay  time  T  decreases  as  the  source  position  ~ moves  from  0  to 90 \ndegrees.  A similar analysis can be made if the source moves in the opposite direction, \nand the reflection  is  caused by the other reflection surface 82 \u2022  Since the reflection \npath is longer for reflection surface 8 2  than for reflection surface 8 1 ,  the echo's delay \ntime can  be segmented into two ranges.  Therefore,  the echo's delay time encodes \nthe source's directions in a  one-to-one mapping relation. \nIn the setup, an Earthworks M30 microphone and Labl amplifier were used to record \nand amplify the sound signals [3].  For this preliminary study of monaurallocaliza(cid:173)\ntion,  we  have chosen to localize simple impulse sounds generated through speakers \nIn the  future,  more \nand  therefore  can  drop  the  silicon  cochlea from  our  model. \ncomplicated signals, such as speech, will require a  silicon cochlea implementation. \n\nInspired by ideas from  visual processing, onset detection is  used 'to segment sounds \n[10].  The  detection  of an onset  is  produced  by  first  taking  the  difference  of two \nfirst-order,  low-pass filters  given by  [10] \n\nOCt, k, r) = lot Iz(t - x, k)s(x)dx -lot Iz(t - x, k/r)s(x)dx \n\n(3) \n\nwhere  r>l,  k  is  a  time  constant,  sex)  is  the  input  sound  signal,  and  /z(x, k)  = \nkexp(-kx). \nA hardware implementation of the above equation is  depicted  in  Figure 2a.  In our \nmodel,  sound  signals  from  the  special  reflection  surface  microphone  are  fed  into \ntwo  low-pass  filters  which  have  different  time  constants  determined  by  two  bias \n\n\fA Neuromorphic Monaural Sound Localizer \n\n695 \n\nFigure 3:  Adaptive  threshold circuit used to remove  unwanted reflections. \n\nA(t- 't)  A(t-2 't) \n\nA(t-m't) \n\nVref2 \n\nA(t) \n\nt \nDl \n\nt \n02 \n\n! , , \n+ \n03 \n\nOm \n\nFigure 4:  Neural signal processing  model \n\nvoltages V Onb1  and V onb2 .  The bias voltage V onbS  determines the amplification of the \ndifference.  The output of the onset detecting circuit is  Vonouc.  The onset detection \ncircuit determines significant increases in the signal energy and therefore segments \nsound  events.  By  computing  the  delay  time  between  two  sound  events  (direct \nsound and its echo caused by the reflection surface), the system is  able to decode \nthe  source's  direction.  Each sound  event  is  then  transformed  into  a  fixed-width \npulse so that the delay time can be computed with binary autocorrelators. \n\nThe fixed-width pulse generating circuit is depicted in Figure 2b.  The pulse generat(cid:173)\ning circuit includes a self-resetting neuron circuit [8] that controls the pulse duration \nbased on the bias  voltage Vneubs'  As  discussed above, an appropriate threshold is \nrequired to discriminate sound events from noise.  One input of the pulse generating \ncircuit is  the output of the onset detecting signal,  Vonouc.  vthreBh  is  set properly in \nthe  pulse  generating circuit  in order to generate a  fixed  width  pulse  when  Vonouc \nexceeds  vthreBh.  Unfortunately  the system  may  be  confused  by  unwanted sound \nevents due  to extraneous reflections  from  the desks  and  walls.  However,  since  we \nknow the expect  range  of echo  delays,  we  can inhibit  many  of the  environmental \nechoes that fall  outside this range using an adaptive threshold circuit. \nIn  order to cancel  unwanted signals,  we  need  to design  an  inhibition  mechanism \nwhich suppresses signals arriving to our system outside of the expected time range. \nThis inhibition is implemented in Figure 3.  As the pulse generating circuit detects \nthe first sound event  (which is the direct sound signal), the threshold becomes high \nin a certain period of time to suppress the detection of the unwanted reflections (not \nfrom our reflection surfaces).  The input of the adaptive threshold circuit is  Vneuouc \nwhich  is  the  output  of the  pulse  generating circuit.  The output  of the threshold \ncircuit is  vthreBh  which is the input of the pulse generating circuit.  When the pulse \ngenerating  circuit  detects  a  sound  event,  Vneuouc  becomes  high,  which  increases \nvthreBh from  Vre / 2  to V re/ 1  as shown in Figure 3.  The higher  vthreBh  suppresses the \ndetection.  The suppression  time is  determined  by  the  other self-resetting  neuron \ncircuit. \n\n\f696 \n\n1.  G. Harris. c.-1.  Pu and 1.  C.  Principe \n\n2 . G5':-\n\n2  ,,=-\n\n2 . 551 :-\n\n2 . 't 51 ;-\n\n241:-\n\n2 . )51 ~ \n\n2  It;-,  , \n\nI \n\nI \n\ni \n\nI \n\nI \n\n, \n\n, \n\n5  \u2022  ~ \nqs.? \n... ~. \n\no. \n\n:.:: t-~.---~--:-~ It ----:------. --4 ~,=-=-=-~ \n,,:,:: l . ..  II \n\nI!  .  ..\u2022 . \u2022  1 \n. ~~~~T~~L~~~~~~~~~~r.7~~ \n\n\"~~--~~~~~~~--~ \n\nTH~[  ( LBO \n\n2?  i511\" \n\nI \n\n, \n\n, \n\nj \n\n, \n\n, \n\n,  I i ,  , -; \n\n, \ni \n~\"r \nt \n~ \n\n.,. \nI \nI \n\n::  'i'f 'i5'U N2  no \n~~ \nj  rJ! \n\nFigure 5:  (a)  The  input sound signal:  impulse signal recorded in typical office  envi(cid:173)\nronment  (b)  HSPICE simulation  of the  output of the  detecting  onset circuit  (label \n61),  the  output of the pulse generating circuit {label  12),  and the  adaptive  threshold \ncircuit response  (label  11) \n\nThe nervous  system  likely  uses  a  running autocorrelation analysis to measure the \ntime delay between signals.  The basic neural connections are shown in Figure 4 [7]. \nA(t)  is  the input  neuron,  A(t - r),  A(t - 2r), ... A(t - mr)  is  a  delay  chain.  The \noriginal signal and the delayed signal are multiplied when A(t) and A(t - kr) feed \nCk.  Assuming the state of neuron A is  NA(t).  H each synaptic delay  in the chain \nis  r, the chain gives  us  NA(t)  under  various delays.  Ck  fires  simultaneously when \nboth A(t) and A(t - kr)  fire.  Neuron Ck  connects neuron Dk.  Excitation is  built \nup at Dk  by the charge and discharge of Ck'  The excitation at Dk is therefore \n\n(4) \n\nViewing the arrangement of Figure 4 as a  neuron autocorrelator, the time-varying \nexcitation  at Db D2, .. Dk  provides  a  spatial  representation  of the autocorrelation \nfunction.  The localization resolution of this system depends on the delay  time r, \nand the  number  of the  correlators.  As  r  decreases,  the  localization  resolution  is \nimproved provided there are enough correlators.  In this paper, 30 unit delay taps, \nand 10 correlators have been implemented on chip.  The outputs of the 10 correlators \ndisplay the time difference between two sound events.  The delay time decodes the \nsource's  direction.  Therefore,  the  10  correlators  provide  a  unit  encoding  of the \nsource location in the  ID plane. \n\n3  Simulation and Measurement Results \n\nThe complete system has been successfully simulated in HSPICE using database we \nhave recorded.  Figure 5(a) shows the input sound signal which is an impulse signal \nrecording in our lab  (a typical student office environment).  Figure 5(b)  shows the \noutput of the onset detector  (labeled 61), the pulse generating output (labeled 12), \nand the adaptive threshold (labeled 11).  When the onset output exceeds the thresh(cid:173)\nold,  the output of the  pulse  generating circuit  becomes  high.  Simultaneously, the \nhigh value of the generated pulse turns on the adaptive threshold circuit to increase \nthe threshold voltage.  The adaptive threshold voltage suppresses the unwanted re-\n\n\fA Neuromorphic Monaural Sound Localizer \n\n697 \n\nReflection Surface \n\n, \" \n\ndl \n\n\u2022 a1 \n\nM30 \n\nSpeaker \n\nd2 \n~ - - -- > \n\nSpeaker \nmoving \ndirection \n\nLabl \nAmp. \n\nLocalizer \n\nChip \n\nLED 1 \n\no  LED 2 \n\nLED 3 \n\nLED 4 \n\nFigure 6:  Block  diagram  of the  test setup \n\nflection  which  can  be  seen  right  after  the direct  signal  (we  believe  the  unwanted \nreflection is caused by the table).  Further simulation results are discussed in  [9]. \nThe single microphone sound localizer circuit has been fabricated through the MO(cid:173)\nSIS  2J.'m  N-well  CMOS  process.  Impulse  signals  are  played through speakers  to \ntest  the fabricated  localizer chip.  Figure  6  depicts  the  block  diagram  of the test \nsetup.  The M30 microphone picks up the direct impulse signal and echoes from the \nreflection surface.  Since the reflection surface in our test is just a single flat surface, \nlocalization is  only tested in  one-half of the  ID  plane.  The composite signals  are \nfed  into the input  of the  sound  localizer  after  amplification.  Our sound  localizer \nchip receives the composite signal, computes the echo time delay, and sends out the \nlocalization result to a  display  circuit.  The display circuit is composed of 4 LEDs \nwith  each LED  representing a  specific  sound source location.  The sound localizer \nsends the computational result to turn on a  specific  LED  signifying the echo time \ndelay.  In the test, the M30 microphone and the reflection surface are placed at fixed \nlocations.  The speaker is  moved along the dotted line shown in Figure 6.  The M30 \nmicrophone is d1  (33cm) from the reflection surface and al  (24cm) from the speaker \nmoving line.  The speaker's location is  defined as ch  as depicted in Figure 6. \nFigure  7(a)  shows  the theoretical  echo's  delay  at  various  speaker  locations.  Fig(cid:173)\nure 7(b)  is the measurement of the setup depicted in Figure 6.  The y-axis indicates \nLED  1 through LED  4.  The x-axis represents the distance  between the speaker's \nlocation  (ch  in Figure 6).  The solid  horizontal  line  in  Figure  7(b)  represents  the \ntheoretical results for which LED should respond for each displacement.  The results \nshow that localization is accurate within each region with possibilities of two LEDs \nresponding in the overlap regions. \n\n4  Conclusion \n\nWe  have developed the first  monaural sound localization system.  This system pro(cid:173)\nvides a  real-time model for human sound localization and has potential use in such \napplications as  low-cost teleconferencing.  More  work  is  needed to further  develop \nthe system.  We  need to characterize the accuracy of our system and to test  more \ninteresting sound signals, such as speech.  Our flat  reflection surface is straightfor(cid:173)\nward and simple, but it lacks sufficient flexibility to encode the source's direction in \nmore than a I-D plane.  We plan to replace the flat surfaces with a more complicated \nsurface to provide more reflections to encode a  richer set of source directions. \n\n\f698 \n\nJ  G. Harris, c.-J Pu and J  C.  Principe \n\n1 \n0 \n\n10 \n\n20 \n\n4 \n\n00 \n\n30 \n\n40 \n\n50 \n\n80 \n\n70 \n\n80 \n\nsound so ..... d i_ f rom_  (em) \n\nlocalizer clip ............... \n\n9  9 \n\n9 \n\n9 \n\n9 \n\nEJ  0 \n\nC,l  9  9 \n\n9 \n\n99  9  e  00 \n\nfil3 \n....0 \nc: \n\n0 r \n\n.... \n\n1 \n\n0 \n0 \n\n10 \n\n20 \n\n30 \n\n40 \n\n50 \n\nsound so ..... di_lrom ___ (em) \n\ne  e \n\ne \n\n0 \n\n80 \n\n70 \n\n80 \n\nFigure 7:  Sound localizer  chip test result \n\nAcknowledgments \n\nThis work was supported by an ONR contract #NOOOI4-94-1-0858 and an NSF CA(cid:173)\nREER award  #MIP-9502307.  We  gratefully acknowledge  MOSIS  chip  fabrication \nand Earthworks Inc.  for  loaning the M30 microphone and amplifier. \n\nReferences \n[1]  D.  W.  Batteau.  The role  of the  pinna in  human localization.  Proc.  R.  Soc. \n\nLondon,  Ser.  B,  168:158-180,1967. \n\n[2]  Neal A.  Bhadkamkar.  Binaural source localizer chip using subthreshold analog \n\ncmos.  In Proceeding  of JCNN,  pages 1866-1870, 1994. \n\n[3]  Earthworks, Inc., P.O.  Box 517,  Wilton, NH  03086.  M90  Microphone. \n[4]  y. Hiranaka and H.  Yamasaki.  Envelop  representations of pinna impulse  re(cid:173)\n\nsponses relating to three-dimensional localization of sound sources.  J.  Acoust. \nSoc.  Am., 73:29,  1983. \n\n[5]  E. Knudsen, G.  Blasdel, and M.  Konishi.  Mechanisms of sound localization in \n\nthe barn owl  (tyto alba).  J.  Compo  Physiol,  133:13-21, 1979. \n\n[6]  J. Lazzaro and C.  A.  Mead.  A silicon model  of auditory localization.  Neural \n\nComputation,  1:47-57, 1989. \n\n[7]  J.C.  Licklider.  A  duplex theory  of pitch  perception.  Experientia,  7:128-133, \n\n1951. \n\n[8]  C.  Mead.  Analog  VLSJ and Neural  Systems.  Addison-Wesley,  1989. \n[9]  Chiang-Jung  Pu.  A  neuromorphic  microphone  for  sound  localization.  PhD \n\nthesis, University of Florida, Gainesville, FL, May 1998. \n\n[10]  L.S.  Smith.  Sound  segmentation  using  onsets  and offsets.  J.  of New  Music \n\nResearch,  23,  1994. \n\n[11]  A.J.  Watkins.  Psychoacoustical aspects of synthesized  vertical locale cues.  J. \n\nAcoust.  Soc.  Am., 63:1152-1165, 1978. \n\n\f", "award": [], "sourceid": 1489, "authors": [{"given_name": "John", "family_name": "Harris", "institution": null}, {"given_name": "Chiang-Jung", "family_name": "Pu", "institution": null}, {"given_name": "Jos\u00e9", "family_name": "Pr\u00edncipe", "institution": null}]}