{"title": "Periodic Component Analysis: An Eigenvalue Method for Representing Periodic Structure in Speech", "book": "Advances in Neural Information Processing Systems", "page_first": 807, "page_last": 813, "abstract": null, "full_text": "Periodic Component Analysis: \n\nAn Eigenvalue Method for Representing \n\nPeriodic Structure in Speech \n\nLawrence K. Saul and Jont B. Allen \n\n{lsaul,jba}@research.att.com \n\nAT&T Labs, 180 Park Ave, Florham Park, NJ 07932 \n\nAbstract \n\nAn eigenvalue method is developed for analyzing periodic structure in \nspeech. Signals are analyzed by a matrix diagonalization reminiscent of \nmethods for principal component analysis (PCA) and independent com(cid:173)\nponent analysis (ICA). Our method-called periodic component analysis \n(1l\"CA)-uses constructive interference to enhance periodic components \nof the frequency spectrum and destructive interference to cancel noise. \nThe front end emulates important aspects of auditory processing, such as \ncochlear filtering, nonlinear compression, and insensitivity to phase, with \nthe aim of approaching the robustness of human listeners. The method \navoids the inefficiencies of autocorrelation at the pitch period: it does not \nrequire long delay lines, and it correlates signals at a clock rate on the \norder of the actual pitch, as opposed to the original sampling rate. We \nderive its cost function and present some experimental results. \n\n1 Introduction \n\nPeriodic structure in the time waveform conveys important cues for recognizing and under(cid:173)\nstanding speech[I]. At the end of an English sentence, for example, rising versus falling \npitch indicates the asking of a question; in tonal languages, such as Chinese, it carries lin(cid:173)\nguistic information. In fact, early in the speech chain-prior to the recognition of words or \nthe assignment of meaning-the auditory system divides the frequency spectrum into peri(cid:173)\nodic and non-periodic components. This division is geared to the recognition of phonetic \nfeatures[2]. Thus, a voiced fricative might be identified by the presence of periodicity in \nthe lower part of the spectrum, but not the upper part. In complicated auditory scenes, peri(cid:173)\nodic components of the spectrum are further segregated by their fundamental frequency [3 ]. \nThis enables listeners to separate simultaneous speakers and explains the relative ease of \nseparating male versus female speakers, as opposed to two recordings of the same voice[4]. \n\nThe pitch and voicing of speech signals have been extensively studied[5]. The simplest \nmethod to analyze periodicity is to compute the autocorrelation function on sliding win(cid:173)\ndows of the speech waveform. The peaks in the autocorrelation function provide estimates \nof the pitch and the degree of voicing. In clean wideband speech, the pitch of a speaker \ncan be tracked by combining a peak-picking procedure on the autocorrelation function \nwith some form of smoothing[6], such as dynamic programming. This method, however, \n\n\fdoes not approach the robustness of human listeners in noise, and at best, it provides an \nextremely gross picture of the periodic structure in speech. It cannot serve as a basis \nfor attacking harder problems in computational auditory scene analysis, such as speaker \nseparation[7], which require decomposing the frequency spectrum into its periodic and \nnon-periodic components. \n\nThe correlogram is a more powerful method for analyzing periodic structure in speech. It \nlooks for periodicity in narrow frequency bands. Slaney and Lyon[8] proposed a percep(cid:173)\ntual pitch detector that autocorrelates multichannel output from a model of the auditory \nperiphery. The auditory model includes a cochlear filterbank and periodicity-enhancing \nnonlinearities. The information in the correlogram is summed over channels to produce an \nestimate of the pitch. This method has two compelling features: (i) by measuring autocor(cid:173)\nrelation, it produces pitch estimates that are insensitive to phase changes across channels; \n(ii) by working in narrow frequency bands, it produces estimates that are robust to noise. \nThis method, however, also has its drawbacks. Computing multiple autocorrelation func(cid:173)\ntions is expensive. To avoid aliasing in upper frequency bands, signals must be correlated \nat clock rates much higher than the actual pitch. From a theoretical point of view, it is \nunsatisfying that the combination of information across channels is not derived from some \nprinciple of optimality. Finally, in the absence of conclusive evidence for long delay lines \n(~1O ms) in the peripheral auditory system, it seems worthwhile-for both scientists and \nengineers-to study ways of detecting periodicity that do not depend on autocorrelation. \n\nIn this paper, we develop an eigenvalue method for analyzing periodic structure in speech. \nOur method emulates important aspects of auditory processing but avoids the inefficiencies \nof autocorrelation at the pitch period. At the same time, it is highly robust to narrowband \nnoise and insensitive to phase changes across channels. Note that while certain aspects of \nthe method are biologically inspired, its details are not intended to be biologically realistic. \n\n2 Method \n\nWe develop the method in four stages. These stages are designed to convey the main tech(cid:173)\nnical ideas of the paper: (i) an eigenvalue method for combining and enhancing weakly \nperiodic signals; (ii) the use of Hilbert transforms to compensate for phase changes across \nchannels; (iii) the measurement of periodicity by efficient sinusoidal fits; and (iv) the hier(cid:173)\narchical analysis of information across different frequency bands. \n\n2.1 Cross-correlation of critical bands \n\nConsider the multichannel output of a cochlear filterbank. If the input to this filterbank \nconsists of noisy voiced speech, the output will consist of weakly periodic signals from \ndifferent critical bands. Can we combine these signals to enhance the periodic signature \nof the speaker's pitch? We begin by studying a mathematical idealization of the problem. \nGiven n real-valued signals, {xi(t)}i=l' what linear combination s(t) = Li WiXi(t) max(cid:173)\nimizes the periodic structure at some fundamental frequency fa, or equivalently, at some \npitch period T = 1/ fa? Ideally, the linear combination should use constructive interfer(cid:173)\nence to enhance periodic components of the spectrum and destructi ve interference to cancel \nnoise. We measure the periodicity of the combined signal by the cost function: \n\n( \nc w, T \n\n-\n\n) _ Lt Is(t + T) - s(tW \n\nLt Is(t)12 \n\nwith s(t) = L WiXi(t). \n\n(1) \n\nHere, for simplicity, we have assumed that the signals are discretely sampled and that the \nperiod T is an integer multiple of the sampling interval. The cost function c (w , T) measures \nthe normalized prediction error, with the period T acting as a prediction lag. Expanding the \n\n\fright hand side in terms of the weights Wi gives: \n\ne(W,7) = L \n\nLij Wi wj Aij(7) \n' \n\nij WiwjBij \n\n(2) \n\nwhere the matrix elements Aij (7) are determined by the cross-correlations, \nAij (7) = L [Xi(t)Xj(t) + Xi(t + 7)Xj(t + 7) - Xi(t)Xj (t + 7) - Xi (t + 7)Xj (t)] , \n\nt \n\nand the matrix elements Bij are the equal-time cross-correlations, Bij = Lt Xi (t)Xj(t). \nNote that the denominator and numerator of eq. (2) are both quadratic forms in the \nweights Wi. By the Rayleigh-Ritz theorem of linear algebra, the weights Wi minimizing \neq. (2) are given by the eigenvector of the matrix B-1 A( 7) with the smallest eigenvalue. \nFor fixed 7, this solution corresponds to the global minimum of the cost function e( w, 7). \nThus, matrix diagonalization (or simply computing the bottom eigenvector, which is often \ncheaper) provides a definitive answer to the above problem. \n\nThe matrix diagonalization which optimizes eq. (2) is reminiscent of methods for principal \ncomponent analysis (PCA) and independent component analysis (IcA)[9]. Our method(cid:173)\nwhich by analogy we call periodic component analysis (1I'cA)-uses an eigenvalue princi(cid:173)\nple to combine periodicity cues from different parts of the frequency spectrum. \n\n2.2 \n\nInsensitivity to phase \n\nThe eigenvalue method in the previous section has one obvious shortcoming: it cannot \ncompensate for phase changes across channels. In particular, the real-valued linear combi(cid:173)\nnation 8(t) = Li WiX;(t) cannot align the peaks of signals that are (say) 11'/2 radians out \nof phase, even though such an alignment-prior to combining the signals-would signifi(cid:173)\ncantly reduce the normalized prediction error in eq. (1). \n\nA simple extension of the method overcomes this shortcoming. Given real-valued sig(cid:173)\nnals, {x;(t)} , we consider the analytic signals, {x;(t)}, whose imaginary components are \ncomputed by Hilbert transforms[lO]. The Fourier series of these signals are related by: \n\nX;(t) = L D:k COS(Wkt + \u00a2k) \n\n\u00a2:::::> \n\nx;(t) = L D:k e;(Wkt+\u00a2k). \n\n(3) \n\nk \n\nk \n\nWe now reconsider the problem of the previous section, looking for the linear combination \nof analytic signals, 8(t) = L; w;x;(t), that minimizes the cost function in eq. (1). In this \nsetting, moreover, we allow the weights W; to be complex so that they can compensate for \nphase changes across channels. Eq. (2) generalizes in a straightforward way to: \n\ne(w ,7)= L \n\nL;j wi wj A;j(7) \n' \n\n;j w; wjB;j \n\n* \n\n(4) \n\nwhere A ( 7) and B are Hermitian matrices with matrix elements \nA;j(7) = L [x;(t)Xj(t) + x;(t + 7)Xj(t + 7) - x;(t)Xj(t + 7) - x;(t + 7)Xj(t)] \nand B;j = Lt x;(t)Xj(t). Again, the optimal weights W; are given by the eigenvector \ncorresponding to the smallest eigenvalue of the matrix B- 1 A ( 7). (Note that all the eigen(cid:173)\nvalues of this matrix are real because the matrix is Hermitian.) \n\nt \n\nOur analysis so far suggests a simple-minded approach to investigating periodic structure \nin speech. In particular, consider the following algorithm for pitch tracking. The first \nstep of the algorithm is to pass speech through a cochlear filterbank and compute analytic \n\n\fsignals, Xi (t), via Hilbert transforms. The next step is to diagonalize the matrices B- 1 A( T) \non sliding windows of Xi(t) over a range of pitch periods, T E [Tmin, Tmaxl. The final step \nis to estimate the pitch periods by the values of T that minimize the cost function, eq. (1), \nfor each sliding window. One might expect such an algorithm to be relatively robust to \nnoise (because it can zero the weights of corrupted channels), as well as insensitive to \nphase changes across channels (because it can absorb them with complex weights). \n\nDespite these attractive features, the above algorithm has serious deficiencies. Its worst \nshortcoming is the amount of computation needed to estimate the pitch period, T. Note that \nthe analysis step requires computing n 2 cross-correlation functions, Lt xi(t)x j (t+T), and \ndiagonalizing the n x n matrix, B- 1 A(T). This step is unwieldy for three reasons: (i) the \nburden of recomputing cross-correlations for different values of T, (ii) the high sampling \nrates required to avoid aliasing in upper frequency bands, and (iii) the poor scaling with the \nnumber of channels, n. We address these concerns in the following sections. \n\n2.3 Extracting the fundamental \n\nFurther signal processing is required to create multichannel output whose periodic struc(cid:173)\nture can be analyzed more efficiently. Our front end, shown in Fig. 1, is designed to an(cid:173)\nalyze voiced speech with fundamental frequencies in the range fa E [fmin, fmax] , where \nfmax < 2fmin. The one-octave restriction on fa can be lifted by considering parallel, over(cid:173)\nlapping implementations of our front end for different frequency octaves. \n\nThe stages in our front end are inspired by important aspects of auditory processing[lO]. \nCochlear filtering is modeled by a Bark scale filterbank with contiguous passbands. Next, \nwe compute narrowband envelopes by passing the outputs of these filters through two non(cid:173)\nlinearities: half-wave rectification and cube-root compression. These operations are com(cid:173)\nmonly used to model the compressive unidirectional response of inner hair cells to move(cid:173)\nment along the basilar membrane. Evidence for comparison of envelopes in the peripheral \nauditory system comes from experiments on comodulation masking release[ll]. Thus, the \nnext stage of our front end creates a multichannel array of signals by pairwise multiply(cid:173)\ning envelopes from nearby parts of the frequency spectrum. Allowed pairs consist of any \ntwo envelopes, including an envelope with itself, that might in principle contain energy \nat two consecutive harmonics of the fundamental. Multiplying these harmonics-just like \nmultiplying two sine waves-produces intermodulation distortion with energy at the sum \nand difference frequencies. The energy at the difference frequency creates a signature of \n\"residue\" pitch at fa. The energy at the sum frequency is removed by bandpass filtering to \nfrequencies [fmin'!max] and aggressively downsampling to a sampling rate fs = 4fmin. \nFinally, we use Hilbert transforms to compute the analytic signal in each channel, which \nwe call Xi(t). \nIn sum, the stages of the front end create an array of bandlimited analytic signals, Xi (t), \nthat-while derived from different parts of the frequency spectrum-have energy concen(cid:173)\ntrated at the fundamental frequency, fa. Note that the bandlimiting of these channels to \nfrequencies [fmin, fmax] where fmax <2fmin removes the possibility that a channel con(cid:173)\ntains periodic energy at any harmonic other than the fundamental. In voiced .speech, this \nhas the effect that periodic channels contain noisy sine waves with frequency fa. \n\nspeech \n\nwaveform \n\ncochlear \nfilterbank \n\nhalf-wave Q \nx8. \nrectification; \ncube-root \ncompression \n\n/'0, \n\npairwise \n\nmultiplication \n\nbandlimiting; \n\ndownsampling =: \n\ncompute \nanalytic \nsignals \n\n~----~ ~----~ \n\nX \n\nFigure 1: Signal processing in the front end. \n\n\fHow can we combine these \"baseband\" signals to enhance the periodic signature of a \nspeaker's pitch? The nature of these signals leads to an important simplification of the \nproblem. As opposed to measuring the autocorrelation at lag T, as in eq. (1), here we can \nmeasure the periodicity of the combined signal by a simple sinusoidalfit. Let ~ = 27r fo/ f. \ndenote the phase accumulated per sample by a sine wave with frequency fo at sampling \nrate f., and let S (t) = I:i Wi Xi (t) denote the combined signal. We measure the periodic(cid:173)\nity of the combined signal by \n\n(\nC w,u -\n\nA) _ I:t Is(t + 1) - s(t)ei~ 12 _ I:ij wiWjAij(~) \n, \n\n(5) \n\nI:t Is(t)12 \n\n-\n\nI:ij Wi WjBij \n\nwhere the matrix B is again formed by computing equal-time cross-correlations, and the \nmatrix A(~) has elements \n\nAij(~) = L [x;(t)Xj(t)+X;(t+l)Xj(t+l)-e-i~x;(t)Xj(t+l)-ei~x;(t+l)xj(t)] . \n\nt \n\nFor fixed ~, the optimal weights Wi are given by the eigenvector corresponding to the \nsmallest eigenvalue of the matrix B- 1 A( ~). \n\nNote that optimizing the cost function in eq. (5) over the phase, ~, is equivalent to opti(cid:173)\nmizing over the fundamental frequency, fo, or the pitch period, T. The structure of this \ncost function makes it much easier to optimize than the earlier measure of periodicity in \neq. (1). For instance, the matrix elements Aij(~) depend only on the equal-time and one(cid:173)\nsample-lagged cross-correlations, which do not need to be recomputed for different values \nof ~. Also, the channels Xi(t) appearing in this cost function are sampled at a clock rate \non the order of fo, as opposed to the original sampling rate of the speech. Thus, the few \ncross-correlations that are required can be computed with many fewer operations. These \nproperties lead to a more efficient algorithm than the one in the previous section. The im(cid:173)\nproved algorithm, working with baseband signals, estimates the pitch by optimizing eq. (5) \nover w and ~ for sliding windows of Xi (t). One problem still remains, however-the need \nto invert and diagonalize large numbers of n x n matrices, where the number of channels, n, \nmay be prohibitively large. This final obstacle is removed in the next section. \n\n2.4 Hierarchical analysis \n\nWe have developed a fast recursive algorithm to locate a good approximation to the min(cid:173)\nimum of eq. (5). The recursive algorithm works by constructing and diagonalizing 2 x 2 \nmatrices, as opposed to the n x n matrices required for an exact solution. Our approximate \nalgorithm also provides a hierarchical analysis of the frequency spectrum that is interesting \nin its own right. A sketch of the algorithm is given below. \n\nThe base step of the recursion estimates a value ~i for each individual channel by mini(cid:173)\nmizing the error of a sinusoidal fit: \n\n(6) \n\nThe minimum of the right hand side can be computed by setting its derivative to zero and \nsolving a quadratic equation in the variable ei~ \u2022. If this minimum does not correspond to \na legitimate value of fo E [fmin, fmax], the ith channel is discarded from future analysis, \neffectively setting its weight Wi to zero. Otherwise, the algorithm passes three arguments \nto a higher level of the recursion: the values of ~i and Ci (~i)' and the channel Xi (t) itself. \nThe recursive step of the algorithm takes as input two auditory \"substreams\", Sl(t) \nand su(t), derived from \"lower\" and \"upper\" parts of the frequency spectrum, and re(cid:173)\nturns as output a single combined stream, s(t) = WISI(t) + wusu(t). In the first step \n\n\fFigure 2: Measures of pitch (fo) and periodicity (e l ) in nested regions of the frequency \nspectrum. The nodes in this tree describe periodic structure in the vowel luI from 400-\n1080 Hz. The nodes in the first (bottom) layer describe periodicity cues in individual \nchannels; the nodes in the kth layer measure cues integrated across 2k - l channels. \n\nof the recursion, the substreams correspond to individual channels Xi (t), while in the kth \nstep, they correspond to weighted combinations of 2k - l channels. Associated with the \nsubstreams are phases, ~I and ~t\" corresponding to estimates of fo from different parts \nof the frequency spectrum. The combined stream is formed by optimizing eq.(5) over the \ntwo-component weight vector, W = [WI , w u ]. Note that the eigenvalue problem in this case \ninvolves only a 2 x 2 matrix, as opposed to an n x n matrix. The value of ~ determines the \nperiod of the combined stream; in practice, we optimize it over the interval defined by ~I \nand ~u. Conveniently, this interval tends to shrink at each level of the recursion. \n\nThe algorithm works in a bottom-up fashion. Channels are combined pairwise to form \nstreams, which are in turn combined pairwise to form new streams. Each stream has a \npitch period and a measure of periodicity computed by optimizing eq. (5). We order the \nchannels so that streams are derived from contiguous (or nearly contiguous) parts of the fre(cid:173)\nquency spectrum. Fig. 2 shows partial output of this recursive procedure for a windowed \nsegment of the vowel luI. Note how as one ascends the tree, the combined streams have \ngreater periodicity and less variance in their pitch estimates. This shows explicitly how \nthe algorithm integrates information across narrow frequency bands of speech. The recur(cid:173)\nsive output also suggests a useful representation for studying problems, such as speaker \nseparation, that depend on grouping different parts of the spectrum by their estimates of fo. \n\n3 Experiments \n\nWe investigated the performance of our algorithm in simple experiments on synthesized \nvowels. Fig. 3 shows results from experiments on the vowel luI. The pitch contours in these \nplots were computed by the recursive algorithm in the previous section, with f min = 80 Hz, \nfmax = 140 Hz, and 60 ms windows shifted in 10 ms intervals. The solid curves show \nthe estimated pitch contour for the clean wideband waveform, sampled at 8 kHz. The \nleft panel shows results for filtered versions of the vowel, bandlimited to four different \nfrequency octaves. These plots show that the algorithm can extract the pitch from different \nparts of the frequency spectrum. The right panel shows the estimated pitch contours for the \nvowel in 0 dB white noise and four types of -20 dB bandlimited noise. The signal-to-noise \nratios were computed from the ratio of (wideband) speech energy to noise energy. The \nwhite noise at 0 dB presents the most difficulty; by contrast, the bandlimited noise leads \nto relatively few failures, even at -20 dB. Overall, the algorithm is quite robust to noise \nand filtering. (Note that the particular frequency octaves used in these experiments had no \nspecial relation to the filters in our front end.) The pitch contours could be further improved \nby some form of smoothing, but this was not done for the plots shown. \n\n\f130 l--~--~-r=======il \n\n1 30 L-~--r=========il \n\nbandhmlted speech \n\nnoisy speech \n\nclean \no dB, white noise \n\n-20 dB, 0250 - 0500 Hz \n-20 dB, 0500 - 1000 Hz \n-20 dB, 1000 - 2000 Hz \n-20 dB, 2000 - 4000 Hz \n\n125 \n\n120 \n\nwide band \n\n0250 - 0500 Hz \n0500 - 1000 Hz \n1000 - 2000 Hz \n2000 - 4000 Hz \n\n125 \n\n120 \n\n90 \n\n~-----=-0'-::-\n\n.2---=-0.':-4 ---=-0.':-6 ---::'0.-=-B -----: \n\n90 \n\nL----::'o .-=-2 ---::'o.~4 ---::'0.~6 --~0.B~-----: \n\ntime (sec) \n\ntime (sec) \n\nFigure 3: Tracking the pitch of the vowel lui in corrupted speech. \n\n4 Discussion \n\nMany aspects of this work need refinement. Perhaps the most important is the initial filter(cid:173)\ning into narrow frequency bands. While narrow filters have the ability to resolve individual \nharmonics, overly narrow filters-which reduce all speech input to sine waves~o not ad(cid:173)\nequately differentiate periodic versus noisy excitation. We hope to replace the Bark scale \nfilterbank in Fig. 1 by one that optimizes this tradeoff. We also want to incorporate adapta(cid:173)\ntion and gain control into the front end, so as to improve the performance in non stationary \nlistening conditions. Finally, beyond the problem of pitch tracking, we intend to develop \nthe hierarchical representation shown in Fig. 2 for harder problems in phoneme recognition \nand speaker separation[7]. These harder problems seem to require a method, like ours, that \ndecomposes the frequency spectrum into its periodic and non-periodic components. \n\nReferences \n\n[1] Stevens, K. N. 1999. Acoustic Phonetics. MIT Press: Cambridge, MA. \n[2] Miller, G. A. and Nicely, P. E. 1955. An analysis of perceptual confusions among some English \n\nconsonants. Journal of the Acoustical Society of America 27, 338- 352. \n\n[3] Bregman, A. S. 1994. Auditory Scene Analysis: the Perceptual Organization of Sound. MIT \n\nPress: Cambridge, MA. \n\n[4] Brokx, J. P. L. and Noteboom, S. G. 1982. Intonation and the perceptual separation of simulta(cid:173)\n\nneous voices. J. Phonetics 10, 23- 26. \n\n[5] Hess, W. 1983. Pitch Determination of Speech Signals: Algorithms and Devices. Springer(cid:173)\n\nVerlag. \n\n[6] Talkin, D. 1995. A Robust Algorithm for Pitch Tracking (RAPT). In Kleijn, W. B. and Paliwal, \n\nK. K. (Eds.), Speech Coding and Synthesis , 497- 518. Elsevier Science. \n\n[7] Roweis, S. 2000. One microphone source separation. In Tresp, v., Dietterich, T., and Leen, T. \n(Eds.), Advances in Neural Information Processing Systems 13. MIT Press: Cambridge, MA. \n\n[8] Slaney, M. and Lyon, R. F. 1990. A perceptual pitch detector. In Proc. ICASSP-90, 1, 357- 360. \n[9] Molgedey, L. and Schuster, H. G. 1994. Separation of a mixture of independent signals using \n\ntime delayed correlations. Phys. Rev. Lett. 72(23), 3634-3637. \n\n[10] Hartmann, W. A. 1997. Signals, Sound, and Sensation. Springer-Verlag. \n[11] Hall, J. w., Haggard, M. P., and Fernandes, M. A. 1984. Detection in noise by spectro-temporal \n\npattern analysis. J. Acoust. Soc. Am. 76,50- 56. \n\n\f", "award": [], "sourceid": 1939, "authors": [{"given_name": "Lawrence", "family_name": "Saul", "institution": null}, {"given_name": "Jont", "family_name": "Allen", "institution": null}]}