{"title": "A Model of Auditory Streaming", "book": "Advances in Neural Information Processing Systems", "page_first": 52, "page_last": 58, "abstract": null, "full_text": "A MODEL OF AUDITORY STREAMING \n\nSusan L. McCabe & Michael J. Denham \n\nNeurodynamics Research Group \n\nSchool of Computing \nUniversity of Plymouth \nPlymouth PL4 8AA, u.K. \n\nABSTRACT \n\nAn essential feature of intelligent sensory processing is the ability to \nfocus on the part of the signal of interest against a background of \ndistracting signals, and to be able to direct this focus at will. In this \npaper the problem of auditory scene segmentation is considered and a \nmodel of the early stages of the process is proposed. The behaviour of \nthe model is shown to be in agreement with a number of well known \npsychophysical results. The principal contribution of this model lies in \ndemonstrating how streaming might result from interactions between \nthe tonotopic patterns of activity of input signals and traces of previous \nactivity which feedback and influence the way in which subsequent \nsignals are processed. \n\n1 \n\nINTRODUCTION \n\nThe appropriate segmentation and grouping of incoming sensory signals is important in \nenabling an organism to interact effectively with its environment (Llinas, 1991). The \nformation of associations between signals, which are considered to arise from the same \nexternal source, allows the organism to recognise significant patterns and relationships \nwithin the signals from each source without being confused by accidental coincidences \nbetween unrelated signals (Bregman, 1990). The intrinsically temporal nature of sound \nmeans that in addition to being able to focus on the signal of interest, perhaps of equal \nsignificance, is the ability to predict how that signal is expected to progress; such \nexpectations can then be used to facilitate further processing of the signal. It is important \nto remember that perception is a creative act (Luria, 1980). The organism creates its \ninterpretation of the world in response to the current stimuli, within the context of its \ncurrent state of alertness, attention, and previous experience. The creative aspects of \nperception are exemplified in the auditory system where peripheral processing \ndecomposes acoustic stimuli. Since the frequency spectra of complex sounds generally \n\n\fA Model of Auditory Streaming \n\n53 \n\noverlap, this poses a complicated problem for the auditory system : which parts of the \nsignal belong together, and which of the subgroups should be associated with each other \nfrom one moment to the next, given the extra complication of possible discontinuities \nand occlusion of sound signals? The process of streaming effectively acts to to associate \nthose sounds emitted from the same source and may be seen as an accomplishment, \nrather than the breakdown of some integration mechanism (Bregman, 1990). \n\nThe cognitive model of streaming, proposed by (Bregman, 1990), is based primarily on \nGestalt principles such as common fate, proximity, similarity and good continuation. \nStreaming is seen as a mUltistage process, in which an initial, preattentive process \npartitions the sensory input, causing successive sounds to be associated depending on the \nrelationship between pitch proximity and presentation rate. Further refinement of these \nsound streams is thought to involve the use of attention and memory in the processing of \nsingle streams over longer time spans. \n\nRecently a number of computational models which implement these concepts of \nstreaming have been developed. A model of streaming in which pitch trajectories are \nused as the basis of sequential grouping is proposed by (Cooke, 1992). In related work, \n(Brown, 1992) uses data-driven grouping schema to form complex sound groups from \nfrequency components with common periodicity and simultaneous onset. Sequential \nassociations are then developed on the basis of pitch trajectory. An alternative approach \nsuggests that the coherence of activity within networks of coupled oscillators, may be \ninterpreted to indicate both simultaneous and sequential groupings (Wang, 1995), \n(Brown, 1995), and can, therefore, also model the streaming of complex stimuli. Sounds \nbelonging to the same stream, are distinguished by synchronous activity and the \nrelationship between frequency proximity and stream formation is modelled by the \ndegree of coupling between oscillators. \n\nA model, which adheres closely to auditory physiology, has been proposed by (Beauvois, \n1991). Processing is restricted to two frequency channels and the streaming of pure \ntones. The model uses competitive interactions between frequency channels and leaky \nintegrator model neurons in order to replicate a number of aspects of human \npsychophysical behaviour. The model, described here, used Beauvois' work as a starting \npoint, but has been extended to include multichannel processing of complex signals. It \ncan account for the relationship streaming and frequency difference and time interval \n(Beauvois, 1991), the temporal development and variability of streaming perceptions \n(Anstis, 1985), the influence of background organisation on foreground perceptions \n(Bregman, 1975), as well as a number of other behavioural results which have been \nomitted due to space limitations. \n\n2 THEMODEL \n\nWe assume the existence of tonotopic maps, in which frequency is represented as a \ndistributed pattern of activity across the map. Interactions between the excitatory \ntonotopic patterns of activity reflecting stimulus input, and the inhibitory tonotopic \nmasking patterns, resulting from previous activity, form the basis of the model. In order \nto simulate behavioural experiments, the relationship between characteristic frequency \nand position across the arrays is determined by equal spacing within the ERB scale \n(Glasberg, 1990). The pattern of activation across the tonotopic axis is represented in \nterms of a Gaussian function with a time course which reflects the onset-type activity \nfound frequently within the auditory system. \n\n\f54 \n\ns. L. MCCABE, M. J. DENHAM \n\nInput signals therefore take the form : \n\ni(x,t) = CI (t - t Onset)e-c2(t-tOrtut )e 2~2lfc(x}-r.)2 \n\n[1] \n\nwhere i(x.t) is the probability of input activity at position x, time t. C} and C; are \nconstants, tan.m is the starting time of the signal, fc (x) is the characteristic frequency at \nposition x,/. is the stimulus frequency, and a determines the spread of the activation. \n\nIn models where competitive interactions within a single network are used to model the \nstreaming process, such as (Beauvois, 1991), it is difficult to see how the organisation of \nbackground sounds can be used to improve foreground perceptions (Bregman, 1975) \nsince the strengthening of one stream generally serves to weaken others. To overcome \nthis problem, the model of preattentive streaming proposed here, consists of two \ninteracting networks, the foreground and background networks, F and B; illustrated in \nfigure 1. The output from F indicates the activity, if any, in the foreground, or attended \nstream, and the output from B reflects any other activity. The interaction between the \ntwo eventually ensures that those signals appearing in the output from F, i.e. in the \nforeground stream, do not appear in the output from B, the background; and vice versa. \nIn the model, strengthening of the organisation of the background sounds, results in the \n'sharpening' of the foreground stream due to the enhanced inhibition produced by a more \ncoherent background. \n\nrre \n\nrnR \n\nFigure 1 : Connectivity of the Streaming Networks. \n\nNeurons within each array do not interact with each other but simply perform a \nsummation of their input activity. A simplified neuron model with low-pass filtering of \nthe inputs, and output representing the probability offiring, is used: \n\np(x, t) = cr[~ Vj(x, t)], where cr(y) = I+~_Y \n\nJ \n\nThe inputs to the foreground net are : \n\nVI (x,t) = (1- ::)VI (x,t-dt) + VI .~(i(x,t\u00bb.dt \n\nV2(X,t) = (1- ::)V2(X, t- dt) + V2\u00a5mFi(x,t- dt\u00bb .dt \n\nV3(X, t) = (1- ~)v3(x,t-dt) + V3 .~(mB(x,t- dt\u00bb.dt \n\n[2] \n\n[3] \n\n[4] \n\n[5] \n\nwhere x is the position across the array, time t, sampling rate dt. \"tj are time constants \nwhich determine the rate of decay of activity, V; are weights on each of the inputs, and \nt/J(y) is a function used to simulate the stochastic properties of nerve firing which returns \na value of J or 0 with probability y. \n\n\fA Model of Auditory Streaming \n\n55 \n\nThe output activity pattern in the foreground net and its 'inverse', mF(x,f) and mFi(x,t), \nare found by : \n\nmF(x, t) = cr[v\\ (x, t) -l'\\(V2(X, t), n) -l'\\(V3(X, t), n)] \n\nmFi(x, t) = max { [~ ~ mF(xi' t - dt)] - mF(x, t- dt), O} \n\nN \n\ni=\\ \n\n[6] \n\n[7] \n\nwhere 17(v(x,f),n) is the mean of the activity within neighbourhood n of position x at time \nt and N is the number of frequency channels. Background inputs are similarly calculated. \n\nTo summarise, the current activity in response to the acoustic stimulus forms an \nexcitatory input to both the foreground and background streaming arrays, F and B. In \naddition, F receives inhibitory inputs reflecting the current background activity, and the \ninverse of the current foreground activity. The interplay between the excitatory and \ninhibitory activities causes the model to gradually focus the foreground stream and \nexclude extraneous stimuli. Since the patterns of inhibitory input reflect the distributed \npatterns of activity in the input, the relationship between frequency difference and \nstreaming, results simply from the graded inhibition produced by these patterns. The \nrelationship between tone presentation rate and streaming is determined by the time \nconstants in the model which can be tuned to alter the rate of decay of activity. \n\nTo enable comparisons with psychophysical results, we view the judgement of coherence \nor streaming made by the model as the difference between the strength of the foreground \nresponse to one set of tones compared to the other. The strength of the response to a \ngiven frequency, Resp(f,t), is a weighted sum of the activity within a window centred on \nthe frequency : \n\n_k... \nRespif, t) = ~ mF(x(j) + i, t) * e 2(12 \n\nW \n\ni=-W \n\n[8] \n\nwhere W determines the size of the window centred on position, x(/), the position in the \nmap corresponding to frequency f, and a determines the spread of the weighting \nfunction about position x(/). \nThe degree of coherence between two tones, say hand h' is assumed to depend on the \ndifference in strength of foreground response to the two : \n\nC hif I: t) = 1 _/ Resp(fj .t)--Resp{j2,t) / \nResp(fj . t}+Resp(/2,t) \n\n\\,j 2, \n\no \n\n[9] \n\nwhere Coh(f;,h,t) ranges between 0, when Resp(f;,t) or Resp(h,t) vanishes and the \ndifference between the responses is a maximum, indicating maximum streaming, and 1, \nwhen the responses are equal and maximally coherent. Values between these limits are \ninterpreted as the degree of coherence, analogous to the probability of human subjects \nmaking ajudgement of coherence (Anstis, 1985), (Beauvois, 1991). \n\n3 RESULTS \n\nExperiments exploring the effect of frequency interval and tone presentation rate and \nstreaming are described in (Beauvois, 1991). Subjects were required to listen to an \nalternating sequence of tones, ABABAB ... for 15 seconds, and then to judge whether at \nthe end of the sequence they perceived an oscillating, trill-like, temporally coherent \nsequence, or two separate streams, one of interrupted high tones, the other of interrupted \n\n\f56 \n\ns. L. MCCABE, M. J. DENHAM \n\nlow tones. Their results showed clearly an increasing tendency towards stream \nsegmentation both with increasing frequency difference between A and B, and \nincreasing tone presentation rate, results the model manages substantially to reproduce; \nas may be seen in figure 2. \nI eo \n\n100r---~c_--~--~--------_, 100r---~~--------~--~----' \n\n4.76 tones/sec \n\n60 \n\n~ 60 \n11 ~ \n~ \nE20 \n\n0 \n1000 \n\n100 \n\n80 \n\n~ \n~ ! 60 \n'li \u00ab> \u00b7 ~ E 20 \n\n1100 \n\n1200 \n\n1300 \n\n1<400 \n\n1500 \n\n5.88 tones/sec \n\n0 \n1000 \n\n1100 \n\n1200 \n\n1300 \n\n1\u00abXl \n\n1500 \n\nOL---~----~----~--~--~ \n1500 \n1000 \n\n1300 \n\n1200 \n\n1100 \n\n1\u00ab>0 \n\n100~~~----------~--------' \n\n7.69 tones/sec \n\n80 \n\n20 \n\noL---~----~----~--~----~ \n1000 \n1 500 \n\n1200 \n\n1100 \n\n1300 \n\n1\u00ab>0 \n\n100 \n\nl! \n\n\u00b7 80 \n~ S 60 \n.co \u00b7 ~ eo \n\n1J \n\n~ 20 \no \n1000 \n\n11.11 tones/sec \n\n100r---~----------~--~----' \n\n20 tones/sec \n\n20 \n\n1100 \n\nOL---~----~----~--~--~ \n1500 \n1000 \nFigure 2 : Mean Psychophysical '0' and Model ,*, Responses to the Stimulus ABAB ... \n(A=lOOO Hz, B as indicated along X axis (Hz), tone presentation rates, as shown.) \n\n1200 \n\n1200 \n\n1.ao \n\n1100 \n\n1300 \n\n1300 \n\n1500 \n\n1\u00abXl \n\nIn investigating the temporal development of stream segmentation, (Anstis, 1985) used a \nsimilar stimulus to the experiment described above, but in this case subjects were \nrequired to indicate continuously whether they were perceiving a coherent or streaming \nsignal. As can be seen in figure 3, the model clearly reproduces the principal features \nfound in their experiments, i.e. the probability of hearing a single, fused, stream declines \nduring each run, the more rapid the tone presentation rate, the quicker stream \nsegmentation occurs, and the judgements made were quite variable during each run. \n\nIn an experiment to investigate whether the organisation of the background sounds \naffects the foreground, subjects were required to judge whether tone A was higher or \nlower than B (Bregman, 1975). This judgement was easy when the two tones were \npresented in isolation, but performance degraded significantly when the distractor tones, \nX, were included. However, when a series of 'captor' tones, C, with frequency close to X \nwere added, the judgement became easier, and the degree of improvement was inversely \nrelated to the difference in frequency between X and C. In the experiment, subjects \nreceived an initial priming AB stimulus, followed by a set of 9 tones : CCCXABXCC. \nThe frequency of the captor tones, was manipulated to investigate how the proximity of \n'captor' to 'distractor' tones affected the required AB order judgement. \n\n\fA Model of Auditory Streaming \n\n57 \n\nFigure 3 : The Probability of Perceptual Coherence as a Function of Time in Response \n\nto Two Alternating Tones. Symbols: '.' 2 tones/s, '0' 4 tones/s, '+' 8 tones/so \n\nIn order to model this experiment and the effect of priming, an 'attentive' input, focussed \non the region of the map corresponding to the A and B tones, was included. We assume, \nas argued by Bregman, that subjects' performance in this task is related to the degree to \nwhich they are able to stream the AB pair separately. His D parameter is a measure of \nthe degree to which ABIBA can be discriminated. The model's performance is then \ngiven by the strength of the foreground response to the AB pair as compared to the \ndistractor tones, and Coh([A B],X) is used to measure this difference. The model exhibits \na similar sensitivity to the distractor/captor frequency difference to that of human \nsubjects, and it appears that the formation of a coherent background stream allows the \nmodel to distinguish the foreground group more clearly. \nA) \n\nB)' ~------~------~----~ \n\n2S00 \n\nN I!OO \nE \n\nt.4eM toherente XAB.X. \n\n09 \n\n08 \n\n0) \n\n06 \n\n05 \n\n0 4 \n\nG\u00b7\u00b7\u00b7\u00b7\u00b7 .. \u00b7\u00b7c\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7(\u00b7\u00b7\u00b7\u00b7 .. \u2022\u00b7 .... \u00b7 .... _ .......... \u00b7\u00b7\u00b7 co .... \u00b7 .. \u00b7\u00b7, \n\n_____ . . _____ e -- --\n03 \" .. \u00b7 .. \u00b7\u00b7 .... &egm ..... Op....\"...,. \n\n02 \n\n0 1 \n\no \n\nTIME \n\n500 \n\nCap'Of hoquoncy jHz) \n\n' 000 \n\n'500 \n\nFigure 4 : A) Experiment to Demonstrate the Formation ofMuItiple Streams, \n\n(Bregman, 1975). B) Model Response; '\u00b7'Mean Degree of Doherence to XABX, '0', \n\nBregman's D Parameter, '+' Model's Judgement of Coherence. \n\n4 DISCUSSION \n\nThe model of streaming which we have presented here is essentially a very simple one, \nwhich can, nevertheless, successfully replicate a wide range of psychophysical \nexperiments. Embodied in the model is the idea that the characteristics of the incoming \nsensory signals result in activity which modifies the way in which subsequent incoming \n\n\f58 \n\ns. L. MCCABE, M. J. DENHAM \n\nsignals are processed. The inhibitory feedback signals effectively comprise expectations \nagainst which later signals are processed. Processing in much of the auditory system \nseems to be restricted to processing within frequency 'channels'. In this model, it is \nshown how local interactions, restricted almost entirely to within-channel activity, can \nform a global computation of stream formation. It is not known where streaming occurs \nin the auditory system, but feedback projections both within and between nuclei are \nextensive, perhaps allowing an iterative refinement of streams. Longer range projections, \noriginating from attentive processes or memory, may modify local interactions to \nfacilitate the extraction of recognised or interesting sounds. \n\nThe relationship between streaming and frequency interval, could be modelled by \nsystematically graded inhibitory weights between frequency channels. However, in the \nmodel this relationship arises directly from the distributed incoming activity patterns, \nwhich seems a more robust and plausible solution, particularly if one takes the need to \ncope with developmental changes into account. Although to simplify the simulations \nperipheral auditory processing was not included in the model, the activity patterns \nassumed as input can be produced by the competitive processing of the output from a \ncochlear model. \n\nAn important aspect of intelligent sensory processing is the ability to focus on signals of \ninterest against a background of distracting signals, thereby enabling the perception of \nsignificant temporal patterns. Artificial sensory systems, with similar capabilities, could \nact as robust pre-processors for other systems, such as speech recognisers, fault detection \nsystems, or any other application which required the dynamic extraction and temporal \nlinking of subsets of the overall signal. \n\nValues Used For Model Parameters \na=.005, c)=75, c2=100, V=[lOO 5 5 5 5], T=[.05 .6 .6 .6 .6], n=2, N=lOO \n\nReferences \nAnstis, S., Saida, S., J. (1985) Exptl Psych, 11(3), pp257-271 \nBeauvois, M.W., Meddis, R (1991) J. Exptl Psych, 43A(3), pp517-541 \nBregman, AS., Rudnicky, AI. (1975) J. ExptJ Psych, 1(3), pp263-267 \n\nBregman, A.S. (1990) 'Auditory scene analysis', MIT Press \n\nBrown, GJ. (1992) University of Sheffield Research Reports, CS-92-22 \n\nBrown, GJ., Cooke, M. (1995) submitted to IJCAI workshop on Computational \nAuditory Scene Analysis \n\nCooke, M.P. (1992) Computer Speech and Language 6, pp 153-173 \n\nGlasberg, B.R., Moore, B.C.J. (1990) Hearing Research, 47, pp103-138 \nL1inas, RR, Pare, D. (1991) Neuroscience, 44(3), pp521-535 \nLuria, A (1980) 'Higher cortical functions in man', NY:Basic \n\nvan Noorden, L.P.AS. (1975) doctoral dissertation, published by Institute for Perception \nResearch, PO Box 513, Eindhoven, NL \n\nWang, D.L. (1995) in 'Handbook of brain theory and neural networks', MIT Press \n\n\fPART II \n\nNEUROSCIENCE \n\n\f\f", "award": [], "sourceid": 1026, "authors": [{"given_name": "Susan", "family_name": "McCabe", "institution": null}, {"given_name": "Michael", "family_name": "Denham", "institution": null}]}