{"title": "Discriminative Binaural Sound Localization", "book": "Advances in Neural Information Processing Systems", "page_first": 1253, "page_last": 1260, "abstract": null, "full_text": "Discriminative Binaural Sound Localization\n\nEhud Ben-Reuven and Yoram Singer\nSchool of Computer Science & Engineering\n\nThe Hebrew University, Jerusalem 91904, Israel\n\nudi@benreuven.com, singer@cs.huji.ac.il\n\nAbstract\n\nTime difference of arrival (TDOA) is commonly used to estimate the az-\nimuth of a source in a microphone array. The most common methods\nto estimate TDOA are based on \ufb01nding extrema in generalized cross-\ncorrelation waveforms. In this paper we apply microphone array tech-\nniques to a manikin head. By considering the entire cross-correlation\nwaveform we achieve azimuth prediction accuracy that exceeds extrema\nlocating methods. We do so by quantizing the azimuthal angle and\ntreating the prediction problem as a multiclass categorization task. We\ndemonstrate the merits of our approach by evaluating the various ap-\nproaches on Sony\u2019s AIBO robot.\n\n1 Introduction\nIn this paper we describe and evaluate several algorithms to perform sound localization in\na commercial entertainment robot. The physical system being investigated is composed of\na manikin head equipped with a two microphones and placed on a manikin body. This type\nof systems is commonly used to model sound localization in biological systems and the\nalgorithms used to analyze the signal are usually inspired from neurology. In the case of an\nentertainment robot there is no need to be limited to a neurologically inspired model and\nwe will use combination of techniques that are commonly used in microphone arrays and\nstatistical learning. The focus of the work is the task of localizing an unknown stationary\nsource (compact in location and broad in spectrum). The goal is to \ufb01nd the azimuth angle\nof the source relative to the head.\n\nA common paradigm to approximately \ufb01nd the location of a sound source employs a mi-\ncrophone array and estimates time differences of arrival (TDOA) between microphones\nin the array (see for instance [1]). In a dual-microphone array it is usually assumed that\nthe difference in the two channels is limited to a small time delay (or linear phase in fre-\nquency domain) and therefore the cross-correlation is peaked at the the time corresponding\nto the delay. Thus, methods that search for extrema in cross-correlation waveforms are\ncommonly used [2]. The time delay approach is based on the assumption that the sound\nwaves propagate along a single path from the source to the microphone and that the mi-\ncrophone response of the two channels for the given source location is approximately the\nsame. In order for this to hold, the microphones should be identical, co-aligned, and, near\neach other relative to the source. In addition there should not be any obstructions between\nor near the microphones. The time delay assumption fails in the case of a manikin head: the\nmicrophone are antipodal and in addition the manikin head and body affect the response\nin a complex way. In our system the distance to the supporting \ufb02oor was also signi\ufb01cant.\nOur approach for overcoming these dif\ufb01culties is composed of two stages. First, we per-\nform signal processing based on the generalized cross correlation transform called Phase\n\n\fTransform (PHAT) also called Cross Power Spectrum Phase (CPSP). This signal process-\ning removes to a large extent variations due the sound source. Then, rather than proceeding\nwith peak-\ufb01nding we employ discriminative learning methods by casting the azimuth es-\ntimation as a multiclass prediction problem. The results achieved by combining the two\nstages gave improved results in our experimental setup.\n\nThis paper is organized as follows. In Sec. 2 we describe how the signal received in the\ntwo microphones was processed to generate accurate features. In Sec. 3 we outline the\nsupervised learning algorithm we used. We then discuss in Sec. 4 approaches to com-\nbined predictions from multiple segments. We describe experimental results in Sec. 5 and\nconclude with a brief discussion in Sec. 6.\n\n2 Signal Processing\n\nThroughout the paper we denote signals in the time domain by lower case letters and in the\nfrequency domain by upper case letters. We denote the convolution operator between two\n\n. The unknown source signal is denoted by\n\nsignals by and the correlation operator by\n\n. The source signal passes through different physical setup and is\nand thus its spectrum is\n. We\nreceived at the right and left microphones. We denote the received signals by\nmodel the different physical media, the signal passes through, as two linear systems whose\nfrequency response is denoted by\n. In addition the signals are contaminated with\nnoise that may account for non-linear effects such as room reverberations (see for instance\n[3] for more detailed noise models). Thus, the received signals can be written in the time\nand frequency domain as,\n\nand\n\nand\n\n\u0002\u0005\u0004\n\n\u0002\u0007\u0006\n\n\u0004\n\t\n\n\u0002\u0012\u000b\u0014\u0013\n\u0002\u0016\u000b\u000e\u0013\n\n\u0003\f\u000b\u000e\r\n\u0003\u0015\u000b\u0014\n\n\u0004\n\t\u0010\u000f\u0011\u0004\n\t\u0010\u000f\n\n(1)\n(2)\nSince the source signal is typically non-stationary we break each training and test sig-\nnal into segments and perform the processing described in the sequel based on short-time\nthe\nFourier transform. Let\nnumber of samples in a single segment. Each is multiplied by a Hanning window and\npadded with zeros to smooth the end-of-segment effects and increase the resolution of the\nthe left and right\nshort-time Fourier transform (see for instance [8]). Denote by\nsignal-segments after the above processing. Based on the properties of the Fourier trans-\nform, the local cross-correlation between the two signals can be computed ef\ufb01ciently by\nand the\nthe inverse Fourier transform, denoted\ncomplex conjugate of the spectrum of\n\nbe the number of segments a signal is divided into and\n\n, of the product of the spectrum of\n\n\u0006\u0018\u0017\n\nand\n\n\u001b\u0011&\n\n\u001b%$\n\n\u001d \u001f#\"\n\n(3)\nHad the difference between the two signals been a mere time delay due to the different\nlocation of the microphones, the cross correlation would have obtained its maximal value\nat a point which corresponds to the time-lag between the received signals. However, since\nthe source signal passes through different physical media the short-time cross-correlation\ndoes not necessarily obtain a large value at the time-lag index. It is therefore common (see\nfor instance [1]) to multiply the spectrum of the cross-correlation by a weighting function\nin order to compensate for the differences in the frequency responses obtained at the two\n, the gen-\nmicrophones. Denoting the spectral shaping function for the\n. For\n\u001b)\u0001*\u0002\n\u201cplain\u201d cross-correlation,\n. In our tests\nwe found that a globally-equalized cross-correlation gives better results. The transform is\nobtained by setting,\nis the average over all measurements and both\nchannels of\n\neralization cross-correlation from Eq. (3) is, !\n\n. Finally, for PHAT the weight for the spectral point\n\nat each (discrete) frequency\n\nth segment by\n\nis equal to\n\n\u001c+\u001d\n\u001f\n\nwhere\n\nis,\n\n\u001b ,.-0/\n1325476\n\n,.-0/\n\n\u001c\u001e\u001d \u001f\n,\n\u0001\u0014\u0002\n\n,:-0/\n\n486\n\n,.-0/\n\n,:-?=\n\u000b*1\n\nIn order to classify new test data we simply compute the likelihood of the observed mea-\nsurement under each distribution and choose the class attaining the maximal likelihood\n(ML) score with respect to the distribution de\ufb01ned by the histogram,\n\n!CB\u0017\f\n\n\t#\"\n\n25\u0019\n\n(7)\n\nMulticlass Fisher discriminant: Generalising the Fisher discriminant for binary classi-\n\ufb01cation problems to multiclass settings, each class is modelled as a multivariate normal\ndistribution. To do so we divide the training set into subsets where the\nth subset corre-\nsponds to measurements from azimuth in\n\n. The density function of the\n\nth class is\n\n\t\u0017D+E\bFHGID:J\n\n\u0005\u000b\n\u0014%&\f\nJ\u0012NPO\n\u000b\u00101\n\n9+M\n\nis the transpose of \f\nwhere \f\nthe normal distribution, and\nare set to be the maximum likelihood estimates,\n\n\u0015\u0016\n\n,\n\ndenotes the mean of\nthe covariance matrix. Each mean and covariance matrix\n\nis its dimensionality,\n\n\u0015:;\n\n\u001bYX\n\n\u001b\u00065\n%WV\n\n\u001d \u001f\n\n\tRQ\n\n/\rS\n\n\t\u001dQ\n\n/UT\n\n\u001b65\n\n/\rS\u0012,\n\nNew test waveforms were then classi\ufb01ed using the ML formula, Eq. 7.\n\n\n\n\t\n\n\t\n\n\n/\n\u0016\n\u0005\n\t\n,\n\t\n\t\n1\n/\n'\n\u0005\n\u001b\n\u001b\n1\n\n\u001b\n\u001b\n\u0005\n\u001b\n\n\u001b\n\u001b\n\u001f\n\u001e\n1\n\n\t\n\"\n\"\n\u001b\n\u0005\n\u001b\n\t\n\b\n,\n!\n9\n\"\n/\n%\n\n\n\f\n\u001b\n\u001c\n\n!\n\u0016\n\n,\n!\n/\n\u0016\n/\n\u0007\n\u0007\n\n\n\u0007\n1\n;\n\b\n\u0016\n\u0016\n\t\n;\n\u0004\n=\n9\n@\nA\n,\n\f\n9\n\"\n/\n\t\n\b\n,\n\u001c\n\n!\n\u0016\n\n,\n!\n/\n\u0016\n/\n9\n\"\n/\n%\n\u0019\n%\n\u001b\n\u001b\n@\n\u001b\n%\n@\nA\n,\n\f\n9\n\"\n/\n\u0017\n\"\n\"\nA\n,\n\f\n9\n\"\n/\n\t\n1\nK\n,\n/\n\u001b\n9\nL\n%\n\t\n1\n\u0015\n,\n\f\n%\nL\n%\n,\n\f\n%\n\nS\n'\n\t\nQ\n%\nL\n%\n@\nQ\n%\n\t\n1\n\u0019\n7\n8\n*\n%\n\f\n@\nL\n%\n\t\n1\n\u0019\n%\n\t\n1\nV\n7\n8\n*\n%\n,\n\f\n\u001b\n\t\n@\nQ\n%\n\f\n\u001b\n\t\n@\nQ\n%\n/\n\u0017\n\fThe advantage of Fisher linear discriminant is that it is simple and easy to implement.\nHowever, it degenerates if the training data is non-stationary, as often is the case in sound\nlocalization problems due to effects such as moving objects. We therefore also designed,\nimplemented and tested a second discriminative methods based on the Perceptron.\nOnline Learning using Multiclass Perceptron with Kernels: Despite, or because of,\nits age the Perceptron algorithm [9] is a simple and effective algorithm for classi\ufb01cation.\nWe chose the Perceptron algorithm for its simplicity, adaptability, and ease in incorporating\nMercer kernels described below. The Perceptron algorithm is a conservative online algo-\nrithm: it receives an instance, outputs a prediction for the instance, and only in case it made\na prediction mistake the Perceptron update its classi\ufb01cation rule which is a hyperplane.\nSince our setting requires building a multiclass rule, we use the version described in [6]\nwhich generalises the Perceptron to multiclass settings. We \ufb01rst describe the general form\nof the algorithm and then discuss the modi\ufb01cations we performed in order to adapt it to the\nsound localization problem.\n\nTo extend the Perceptron algorithm to multiclass problem we maintain\nhyperplanes (one\nper class) denoted\n. The algorithm works in an online fashion working on one\nexample at a time. On the\nth round, the algorithm gets a new instance \f\nand set the\n\u00178\u00177\u0017\npredicted class to be the index of the hyperplane attaining the largest inner-product with\nthe input instance,\nIf the algorithm made a prediction error,\nthat is\nIn [6] a family of possible update\nschemes was given.\nIn this work we have used the so called uniform update which is\nvery simple to implement and also attained very good results. The uniform update moves\nand all the\nthe hyperplane corresponding to the correct label\n. Formally, let\nhyperplanes whose inner-products were larger than\n\n, it updates the set of hyperplanes.\n\nin the direction of \f\n\nGID:J\n\nD+E\bF\n\nWe update the hyperplanes as follows,\n\naway from \f\n\n\u000f\u0003\n\n7\b8\n7\b8\n\n(8)\n\nthen we keep\n\nintact. This update of the hyperplanes is performed\nand if\nonly on rounds on which there was a prediction error. Furthermore, on such rounds only\na subset of the vectors is updated and thus the algorithm is called ultraconservative. The\nmulticlass Perceptron algorithm is guaranteed to converge to a perfect classi\ufb01cation rule\nif the data can be classi\ufb01ed perfectly by an unknown set of hyperplanes. When the data\ncannot be classi\ufb01ed perfectly then an alternative competitive analysis can be applied.\n\n\u001b\u0007\u0006C\u001e\n\nThe problem with above algorithm is that it allows only linear classi\ufb01cation rules. How-\never, linear classi\ufb01ers may not suf\ufb01ce to obtain in many applications, including the sound\nlocalization application. We therefore incorporate kernels into the multiclass Perceptron.\nis the instance space (for\nA kernel is an inner-product operator\ninstance, PHAT waveforms). An explicit way to describe\n\nis via a mapping\n\nsuch that\n\nfrom \b\n\nto an inner-products space \u0014\n\n. Common kernels\nare RBF kernels and polynomial kernels which take the form\n.\nAny learning algorithm that is based on inner-products with a weighted sum of vectors can\nbe converted to a kernel-based version by explicitly keeping the weighted combination of\nvectors. In the case of the multiclass Perceptron we replace the update from Eq. 8 with a\n\u201ckernelized\u201d version,\n\n'\u0016\u0015\n\n'\u0016\u0015\n\n,\u0018\u0017\n\nB\t\b\u000b\n\f\b\u000e\r\u0010\u000f where \b\n\nB\u0012\b\u0013\n\n/\u001a\u0019\n\n(9)\n\nSince we cannot compute\nassociated with each\ninstance, the inner-product of a vector\n\nexplicitly we instead perform bookkeeping of the weights\nand compute a inner-products using the kernel functions. For\n\nwith a new instance \f\n\nis\n\n\u0013\u001d\u001b\n\n\t\u00183\n\n\u0013\u001c\u001b\n\n.\n\n\u0013\u001c\u001b\n\n\t\n\n\u001f\n\n\n\u000f\n'\n\u001b\n@\n\u001b\n\u001b\n\t\n%\n\n%\n\u0001\n\f\n\u001b\n\u0017\n@\n\u001b\n\u001b\n\u0002\n\t\n\u001b\n\u001b\n\n\u001b\n\n\u0001\n\f\n\u001b\n\u001b\n\u0016\n\u001b\n\t\n\u001e\n\"\n9\n\"\n\u0002\n\t\n\u001b\n\u001b\nX\n\n%\n\u0001\n\f\n\u001b\n7\n8\n\u0001\n\f\n\u001b\n \n\u0017\n\n%\n\t\n\n%\n\u000b\n\u0004\n\f\n\u001b\n\"\n\t\n\u001b\n\u001b\n\t\n\u001f\n=\n\u0005\n8\n=\n\f\n\u001b\n\"\n\u001c\n\u0016\n\u001b\n\n\"\n\u0002\n\u001c\n\u0016\n\u001b\n\u001b\n \n\n%\n\t\n\t\n\u0011\n\u0014\n\b\n\b\n\t\n,\n'\n\n/\n\t\n\u0011\n,\n'\n/\n\u0001\n\u0011\n,\n/\n\t\n,\n\f\n\n\f\n\u0015\n/\n\t\n\u000b\n\f\n\u0001\n\f\n\u0015\n\n%\n\t\n\n%\n\u000b\n\u0004\n\u0011\n,\n\f\n\u001b\n/\n\"\n\t\n\u001b\n\u001b\n\t\n\u001f\n=\n\u0005\n8\n=\n\u0011\n,\n\f\n\u001b\n/\n\"\n\u001c\n\u0016\n\u001b\n\u0017\n\u0011\n,\n\f\n\u001b\n/\n\u0011\n,\n\f\n\u001b\n/\n\n\t\n3\n\u0013\n\u0011\n,\n\f\n\u0013\n/\n\u0015\n\n\u0001\n\f\n\u0015\n\t\n3\n\u0013\n\u0011\n,\n\f\n\u0013\n/\n\u0001\n\u0011\n,\n\f\n\u0015\n/\n\u0013\n\t\n,\n\f\n\u0013\n\n\f\n\u0015\n/\n\fAlgorithm\nPHAT + Poly Kernels, D=5\nPHAT + Fisher\nPHAT + Peak-\ufb01nding\nEqualized CrossCor + Peak-\ufb01nding\n\nErr\n\n\b\u0005\t\u000b\n\n\u0003\u0005\u0004\u0007\u0006\n\u000f\u0010\t\u000b\n\n\u0003\u0005\u0004\u0007\u0006\n\u000f\u0005\t\u0014\n\n\u0012\u000e\u0012\u0013\u0006\n\u0016\u0002\u0017\f\u0006\u0019\u0018\u001a\t\u0014\n\n\u000e\t\n\u0002\f\u0006\n\r\u0010\t\n\u0002\f\u0006\n\r\u000e\t\n\u0002\u0015\u0006\n\u0002\u0015\u0006\u0019\u0018\u001a\t\n\n\u0002\u0001\n\u0003\u0011\n\n\u0003\u0011\n\n\u0004\u0002\n\u0002\f\u0006\n\r\u0010\u0016\u0007\u0006\n\u0003\u0005\u0016\u0007\u0006\n\n\u0002\f\u0006\n\u0002\f\u0006\n\u0002\f\u0006\n\u0002\f\u0006\n\n\u000e\u0003\n\r\u000e\u0003\n\nTable 1: Summary of results of sound localization methods for a single segment.\n\nIn our experiments we found that polynomial kernel of degree\nThe results are summarised in Table 1. We defer the discussion of the results to Sec. 5.\n\nyielded the best results.\n\n4 Multi-segment Classi\ufb01cation\nThe accuracy of a single segment classi\ufb01er is too low to make our approach practical. How-\never, if the source of sound does not move for a period of time, we can accumulate evidence\nfrom multiple segments in order to increase the accuracy. Due to the lack of space we only\noutline the multi-segment classi\ufb01cation procedure for the Fisher discriminant and compare\nit to smoothing and averaging techniques used in the signal processing community.\n\n!\u0010$\n\n%&\f\n\n,\n\n\u00177\u00178\u0017\n\n\u001b0/\n\nwhere\n\n\u001f\u001f\u001e \u001e \u001e\n\u001f\u001f\u001e \u001e \u001e\n\u001b,+.-\n\u001e\u00151\n,:-0/\n\n\u0005\u0012\n\u0014%&\f\n\ndirection, A\n\nIn multi-segment classi\ufb01cation we are given\nwaveforms for which we assume that\n. Each\nthe source angle did not change in this period, i.e.,\n. We then con-\nsmall window was processed independently to give a feature vector \f\nverted the waveform feature vector into a probability estimate for each discrete angle\nusing the Fisher discriminant. We next assumed that the proba-\nbility estimates for consecutive windows are independent. This is of course a false as-\nsumption. However, we found that methods which compensate for the dependencies did\nnot yield substantial improvements. The probability density function of the entire win-\nis\ndow is therefore\nWe compared the Maximum Likelihood decision un-\nder the independence assumption with the following commonly used signal processing\ntechnique. We averaged the power spectrum and cross power spectrum of the different\nwindows and only then we proceeded to compute the generalized cross correlation wave-\nis the average over the measurements\nThe averaged weight function for the\nin the same window,\nPHAT waveform is now\nWhen using averaged power\n,:-9@?\n\t\u0002\u0001\u0016\b\u0006\u0003\n\u00177\u00178\u0017\n\n0<\u0002\u00074\u0007\u0013\n\u0001\u0016\b\u0006\u0003\n\n1\u001a\u0001\n\n1\u0007\b\u0004\u0003\n\n\u0003\n\n\u0003\n\n\u000f\n\u0003\n\n\u0003\n\u0004\n\u001b\n\u001c\n\u0005\n6\n\u001b\n\t\n\u0005\n\u001b\n\u001d\n-\n\t\n1\n\n\u001c\n6\n\u001b\n,\n\f\n6\n\u001b\n9\n/\n@\nA\n,\n\f\n!\n\u001b\n9\n\u0005\n/\n!\n6\n*\n\u001f\n@\nA\n,\n\f\n6\n\u001b\n9\n\u0005\n\n/\n\n@\n\u0005\n\u001b\n@\n\u0005\n\u001b\n\t\n\u0017\n@\nA\n,\n\f\n!\n\u001b\n9\n/\n\u0017\n\u001b\n\t\n\u001c\n\"\n(\n\u0003\n\u0004\n\u001b\n$\n\u0003\n\u0006\n&\n\n+\n\u001e\n\u0001\n \n+\n\u001b\n \n\t\n\u001f\n!\n3\n!\n6\n*\n\u001f\n1\n6\n\u001b\n\u0017\n(\n\t\n\u0003\n\u0004\n\u001b\n$\n\u0003\n\u0006\n\u001b\n/\n9\n\u0017\n(\n\u001b\n\t\n+\n-\n\u0003\n\u0004\n\u001b\n$\n\u0003\n\u0004\n\u001b\n/\n+\n-\n\u0003\n\u0006\n\u001b\n$\n\u0003\n\u0006\n\u001b\n/\n-\n.\n\t\n\"\n\b\n\u0017\n\u0003\n\u0005\n\t\n\u000b\n\u000b\n\t\n\fAlgorithm\nMax. Likl. PHAT + Fisher\nSCOT + Fisher\nSmoothed PHAT + Fisher\nSmoothed PHAT + Peak-\ufb01nding\nSCOT + Peak-\ufb01nding\n\nErr\n\n\u0003\u0015\u0006\n\u0012\f\u0006\n\b\u0015\u0006\n\u0004\u0015\u0006\n\u0004\u0007\u0006\n\n\u0004\u0010\t\n\u000f\u0010\t\u0014\n\n\u0012\u0007\t\n\b\u0005\t\n\u000f\u0005\t\n\n\u0002\u0015\u0006\n\u0002\f\u0006\n\u0002\u0015\u0006\n\u0002\u0015\u0006\n\u0002\u0015\u0006\n\n\u0012\u0005\t\n\u0012\u0005\t\n\u0004\u000e\t\n\u0004\u0010\t\n\u0004\u000e\t\n\n\u0002\u0001\n\u000f\u0004\u0003\u0011\n\n\u0003\u0011\n\n\u0003\u0011\n\n\u0002\f\u0006\n\u0002\f\u0006\n\u0018\u0010\u0006\n\u0018\u0010\u0006\n\u0018\u0010\u0006\n\n\u0002\f\u0006\u0019\u0018\n\u0002\f\u0006\n\u0002\f\u0006\n\u0003\u0004\u0003\n\u0002\f\u0006\n\u0002\f\u0006\n\u0003\u0004\u0003\n\nTable 2: Summary of results of sound localization methods for multiple segments.\n\nsamples where taken (\n\b\u00149\n\nbins.\n\n\b\u0016\b\n\nsegments of data were collected. Each segment is\n\nFurther technical details can be obtained from http://udi.benreuven.com. (MATLAB is a\ntrademark of Mathworks, Inc. and AIBO is a trademark of Sony and its af\ufb01liates.) For each\nhead direction\nlong. The\nsegments were collected with a partial overlap of\n. For each direction, the measure-\nments were divided into equal amounts of train and test measurements. The total number of\nsegments per class,\n\b\u0016\b\nsegments for training and the same amount for evaluation. An FFT of size\nwas used to\ngenerate un-normalized cross-correlations, equalized cross-correlations, and PHAT wave-\nforms. From the transformed waveforms\nin Eq. 6). Extrema\nlocations in histograms were found using\n\n. Therefore, altogether there were\n\n1)0\u0015;\f\u000284\n\n1\u0004\b\f;\n\n, is\n\n\u0002\u00074\n\n\b\u0016\b\n\n\u0015\u0017\b\n\n.\n\nWe used two evaluation measures for comparing the different algorithms. The \ufb01rst, denoted\n\n!3!\n\n\u001b+*\n\n\u001b\u0015*\n\n!3! , is the empirical classi\ufb01cation error that counts the number of times the predicted\n\n(discretized) angle was different than the true angle, that is,\n.\n, is the average absolute difference between\nThe second evaluation measure, denoted\nthe predicted angle and the true angle,\nIt should be kept\nin mind that the test data was obtained from the same direction set as the training\nis an appropriate evaluation measure of the errors in our experi-\ndata. Therefore,\nmental setting. However, alternative evaluation methods should be devised for general\nrecordings when the test signal is not con\ufb01ned to a \ufb01nite set of possible directions.\nThe accuracy results with respect to both mea-\nsures on the test data for the various representa-\ntions and algorithms are summarized in Table 1.\nIt is clear from the results that traditional meth-\nods which search for extrema in the waveforms\nare inferior to the discriminative methods. As a\nby-product we con\ufb01rmed that equalized cross-\ncorrelations is inferior to PHAT modelling for\nhigh SNR with strong reverberations, similar re-\nsults were reported in [11]. The two discrimi-\nnative methods achieve about the same results.\nUsing the Perceptron algorithm with degree\nachieves the best results but the difference between the Perceptron and the multiclass Fisher\ndiscriminant is not statistically signi\ufb01cant. It is worth noting again that we also tested lin-\near regression algorithms. Their performance turns to be inferior to the discriminative\nmulticlass approaches. A possible explanation is that the multiclass methods employ mul-\ntiple hyperplanes and project each class onto a different hyperplane while linear regression\nmethods seek a single hyperplane onto which example are projected.\n\nFigure 2: Acquisition system overview.\n\nAlthough Fisher\u2019s discriminant and the Perceptron algorithm exhibit practically the same\nperformance, they have different merits. While Fisher\u2019s discriminant is very simple\nto implement and is space ef\ufb01cient the Perceptron is capable to adapt quickly and\nachieves high accuracy even with small amounts of training data.\nIn Fig 3 we com-\npare the error rates of Fisher\u2019s discriminant and the Perceptron on subsets of the train-\ning data. The Perceptron clearly outperforms Fisher\u2019s discriminant when the num-\nber of training examples is less than \u0001\nexamples are pro-\n\nbut once about\n\n\b\u0016\b\u0016\b\n\n\u001b\u0016\b\u0016\b\u0016\b\n\n\u0012\n\u0003\n\n\u0003\n\n\u0003\n\n\u0003\n\u0003\n\n\u0003\n\n\u0003\n\n\u0012\n?\n\b\n\u0013\n\u0013\n\u0019\n%\n\u0019\n\t\n\u0019\n%\n\n\t\n\t\n\u0001\n\u0019\n\n\b\n\u001b\n1\n\u0015\n1\n1\n\n\t\n\u001b\n9\n\t\n?\n1\n+\n+\n\t\n\u001f\n(\n3\n(\n\u001f\n\n\n\u0005\n\u001b\n\u0002\n\t\n@\n\u0005\n\u001b\n\u0007\n\u0007\n\u0002\n\u001f\n\u0002\n\u001f\n\t\n\u001f\n(\n3\n(\n\u001f\n9\n@\n\u0005\n\u001b\n\t\n\u0005\n\u001b\n9\n\u0002\n\u001f\n\u001b\n\f48\n\n47\n\n46\n\nPerceptron\nFisher\n\n\u0015\f\u001b\u00150\u0015;\f\u0002\u00074\u0007\u0013\n\nvided the two algorithms are indistinguishable. This suggests that online algorithms\nmay be more suitable when the sound source is stationary only for short periods.\nLast we compared multi-segment results. Multi-\nsegment classi\ufb01cation was performed by taking\nconsecutive measurements over a win-\nduring which the source loca-\ndow of\ntion remained \ufb01x.\nIn Table 2 we report clas-\nsi\ufb01cation results for the various multi-segment\ntechniques.\n(Since the Perceptron algorithm\nused a very large number of kernels we did not\nimplement a multi-segment classi\ufb01cation using\nthe Perceptron. We are currently conducting re-\nsearch on space-ef\ufb01cient kernel-based methods\nfor multi-segment classi\ufb01cation.) Here again,\nthe best performing method is Fisher\u2019s discrim-\ninant that combines the scores directly without\naveraging and smoothing leads the pack. The re-\nsulting prediction accuracy of Fisher\u2019s discrimi-\nnant is good enough to make the solution prac-\ntical so long as the sound source is \ufb01xed and the recording conditions do not change.\n\nFigure 3: Error rates of Fisher\u2019s dis-\ncriminant and the Perceptron for vari-\nous training sizes.\n\n3000\n4000\nNumber of Examples\n\nt\n\ne\na\nR\n\n \nr\no\nr\nr\n\nE\n\n41\n\n40\n\n43\n\n42\n\n45\n\n44\n\n5000\n\n6000\n\n39\n1000\n\n2000\n\n6 Discussion\nWe have demonstrated that by using discriminative methods highly accurate sound local-\nization is achievable on a small commercial robot equipped with a binaural hearing that\nare placed inside a manikin head. We have con\ufb01rmed that PHAT is superior to plain cross-\ncorrelation. For classi\ufb01cation using multiple segments classifying the entire PHAT wave-\nform gave better results than various techniques that smooth the power spectrum over the\nsegments. Our current research is focused on ef\ufb01cient discriminative methods for sound\nlocalization in changing environments.\n\nReferences\n[1] C. H. Knapp and G. C. Carter. The generalized correlation method for estimation of time delay.\n\nIEEE Transactions on ASSP, 24(4):320-327,1976.\n\n[2] M. Omologo and P. Svaizer. Acoustic event localization using a crosspowerspectrum phase based\n\ntechnique. Proceedings of ICASSP1994, Adelaide, Australia, 1994.\n\n[3] T. Gustafsson and B.D. Rao. Source Localization in Reverberant Environments: Statistical Anal-\n\nysis. Submitted to IEEE Trans. on Speech and Audio Processing, 2000.\n\n[4] N. Strobel and R. Rabenstein. Classi\ufb01cation of Time Delay Estimates for Robust Speaker Local-\n\n[5] J. Benesty Adaptive eigenvalue decomposition algorithm for passive acoustic source localization\n\nization ICASSP, Phoenix, USA, March 1999.\n\nJ. Acoust. Soc. Am. 107 (1), January 2000\n\n[6] K. Crammer and Y. Singer. Ultraconservative online algorithms for multiclass problems. In Proc.\n\nof the 14th Annual Conf. on Computational Learning Theory, 2001.\n\n[7] R. O. Duda, P. E. Hart. Pattern Classi\ufb01cation. Wiley, 1973.\n[8] B. Porat. A course in Digital Signal Processing. Wiley, 1997.\n[9] F. Rosenblatt. The Perceptron: A probabilistic model for information storage and organization\n\nin the brain. Psychological Review, 65:386\u2013407, 1958.\n\n[10] B. Widrow and M. E. Hoff. Adaptive switching circuits. 1960 IRE WESCON Convention\n\nRecord, pages 96\u2013104, 1960.\n\n[11] P. Aarabi, A. Mahdavi. The Relation Between Speech Segment Selectivity and Time-Delay\nEstimation Accuracy. In Proc. of IEEE Conf. on Acoustics Speech and Signal Processing, 2002.\n\n\u001c\n\t\n?\n1\n\f", "award": [], "sourceid": 2151, "authors": [{"given_name": "Ehud", "family_name": "Ben-reuven", "institution": null}, {"given_name": "Yoram", "family_name": "Singer", "institution": null}]}