{"title": "An Alternative to Low-level-Sychrony-Based Methods for Speech Detection", "book": "Advances in Neural Information Processing Systems", "page_first": 2029, "page_last": 2037, "abstract": "Determining whether someone is talking has applications in many areas such as speech recognition, speaker diarization, social robotics, facial expression recognition, and human computer interaction.  One popular approach to this problem is audio-visual synchrony detection. A candidate speaker is deemed to be talking if the visual signal around that speaker correlates with the auditory signal.  Here we show that with the proper visual features (in this case movements of various facial muscle groups), a very accurate detector of speech can be created that does not use the audio signal at all.  Further we show that this person independent visual-only detector can be used to train very accurate audio-based person dependent voice models.  The voice model has the advantage of being able to identify when a particular person is speaking even when they are not visible to the camera (e.g. in the case of a mobile robot).  Moreover, we show that a simple sensory fusion scheme between the auditory and visual models improves performance on the task of talking detection.  The work here provides dramatic evidence about the efficacy of two very different approaches to multimodal speech detection on a challenging database.", "full_text": "An Alternative to Low-Level-Synchrony-Based\n\nMethods for Speech Detection\n\nPaul Ruvolo\n\nUniversity of California, San Diego\n\nMachine Perception Laboratory\nAtkinson Hall (CALIT2), 6100\n\n9500 Gilman Dr., Mail Code 0440\n\nLa Jolla, CA 92093-0440\n\npaul@mplab.ucsd.edu\n\nJavier R. Movellan\n\nUniversity of California, San Diego\n\nMachine Perception Laboratory\nAtkinson Hall (CALIT2), 6100\n\n9500 Gilman Dr., Mail Code 0440\n\nLa Jolla, CA 92093-0440\n\nmovellan@mplab.ucsd.edu\n\nAbstract\n\nDetermining whether someone is talking has applications in many areas such as\nspeech recognition, speaker diarization, social robotics, facial expression recog-\nnition, and human computer interaction. One popular approach to this problem\nis audio-visual synchrony detection [10, 21, 12]. A candidate speaker is deemed\nto be talking if the visual signal around that speaker correlates with the auditory\nsignal. Here we show that with the proper visual features (in this case movements\nof various facial muscle groups), a very accurate detector of speech can be cre-\nated that does not use the audio signal at all. Further we show that this person\nindependent visual-only detector can be used to train very accurate audio-based\nperson dependent voice models. The voice model has the advantage of being able\nto identify when a particular person is speaking even when they are not visible to\nthe camera (e.g. in the case of a mobile robot). Moreover, we show that a simple\nsensory fusion scheme between the auditory and visual models improves perfor-\nmance on the task of talking detection. The work here provides dramatic evidence\nabout the ef\ufb01cacy of two very different approaches to multimodal speech detection\non a challenging database.\n\n1\n\nIntroduction\n\nIn recent years interest has been building [10, 21, 16, 8, 12] in the problem of detecting locations\nin the visual \ufb01eld that are responsible for auditory signals. A specialization of this problem is de-\ntermining whether a person in the visual \ufb01eld is currently taking. Applications of this technology\nare wide ranging: from speech recognition in noisy environments, to speaker diarization, to expres-\nsion recognition systems that may bene\ufb01t from knowledge of whether or not the person is talking to\ninterpret the observed expressions.\nPast approaches to the problem of speaker detection have focused on exploiting audio-visual syn-\nchrony as a measure of how likely a person in the visual \ufb01eld is to have generated the current audio\nsignal [10, 21, 16, 8, 12]. One bene\ufb01t of these approaches is their general purpose nature, i.e., they\nare not limited to detecting human speech [12]. Another bene\ufb01t is that they require very little pro-\ncessing of the visual signal (some of them operating on raw pixel values [10]). However, as we\nshow in this document, when visual features tailored to the analysis of facial expressions are used\nit is possible to develop a very robust speech detector that is based only on the visual signal that far\noutperforms the past approaches.\nGiven the strong performance for the visual speech detector we incorporate auditory information\nusing the paradigm of transductive learning. Speci\ufb01cally we use the visual-only detector\u2019s output as\n\n1\n\n\fan uncertain labeling of when a given person is speaking and then use this labeling along with a set\nof acoustic measurements to create a voice model of how that person sounds when he/she speaks.\nWe show that the error rate of the visual-only speech detector can be more than halved by combining\nit with the auditory voice models developed via transductive learning.\nAnother view of our proposed approach is that it is also based on synchrony detection, however,\nat a much higher level and much longer time scale than previous approaches. More concretely our\napproach moves from the level of synchrony between pixel \ufb02uctuations and sound energy to the\nlevel of the visual markers of talking and auditory markers of a particular person\u2019s voice. As we\nwill show later, a bene\ufb01t of this approach is that the auditory model that is optimized to predict\nthe talking/not-talking visual signal for a particular candidate speaker also works quite well without\nusing any visual input. This is an important property since the visual input is often periodically\nabsent or degraded in real world applications (e.g. when a mobile robot moves to a part of the\nroom where it can no longer see everyone in the room, or when a subject\u2019s mouth is occluded). The\nresults presented here challenge the orthodoxy of the use of low-level synchrony related measures\nthat dominates research in this area.\n\n2 Methods\n\nIn this section we review a popular approach to speech detection that uses Canonical Correlation\nAnalysis (CCA). Next we present our method for visual-only speaker detection using facial expres-\nsion dynamics. Finally, we show how to incorporate auditory information using our visual-only\nmodel as a training signal.\n\n2.1 Speech Detection by Low-level Synchrony\n\nHershey et. al. [10] pioneered the use of audio-visual synchrony for speech detection. Slaney et.\nal. [21] presented a thorough evaluation of methods for detecting audio-visual synchrony. Slaney\net. al. were chie\ufb02y interested in designing a system to automatically synchronize audio and video,\nhowever, their results inspired others to use similar approaches for detecting regions in the visual\n\ufb01eld responsible for auditory events [12]. The general idea is that if measurements in two different\nsensory modalities are correlated then they are likely to be generated by a single underlying common\ncause. For example, if mouth pixels of a potential speaker are highly predictable based on sound\nenergy then it is likely that there is a common cause underlying both sensory measurements (i.e. that\nthe candidate speaker is currently talking).\nA popular apprach to detect correlations between two different signals is Canonical Correlation\nAnalysis. Let A1, . . . , AN and V1, . . . , VN be sequences of audio and visual features respectively\nwith each Ai \u2208 Rv and Vi \u2208 Ru. We collectively refer to the audio and visual features with the\nvariables A \u2208 Rv\u00d7N and V \u2208 Ru\u00d7N . The goal of CCA is to \ufb01nd weight vectors wA \u2208 Rv and\nwV \u2208 Ru such that the projection of each sequence of sensory measurements onto these weight\nvectors is maximally correlated. The objective can be stated as follows:\n\n(wA, wV ) =\n\nargmax\n\n||wA||2\u22641,||wV ||2\u22641\n\n\u03c1(A(cid:62)wA, V (cid:62)wv)\n\n(1)\n\nWhere \u03c1 is the Pearson correlation coef\ufb01cient. Equation 1 reduces to a generalized Eigenvalue\nproblem (see [9] for more details).\nOur model of speaker detection based on CCA involves computing canonical vectors wA and wV\nthat solve Equation 1 and then computing time-windowed estimates of the correlation of the auditory\nand visual features projected on these vectors at each point in time. The \ufb01nal judgment as to whether\nor not a candidate face is speaking is determined by thresholding the windowed correlation value.\n\n2.2 Visual Detector of Speech\n\nThe Facial Action Coding System (FACS) is an anatomically inspired, comprehensive and versatile\nmethod to describe human facial expressions [7]. FACS encodes the observed expressions as com-\nbinations of Action Unit (AUs). Roughly speaking AUs describe changes in the appearance of the\nface that are due to the effect of individual muscle movements.\n\n2\n\n\fFigure 1: The Computer Expression Recognition Toolbox was used to automatically extract 84\nfeatures describing the observed facial expressions. These features were used for training a speech\ndetector.\n\nIn recent years signi\ufb01cant progress has been made in the full automation of FACS. The Computer\nExpression Recognition Toolbox (CERT, shown in Figure 1) [2] is a state of the art system for\nautomatic FACS coding from video.\nThe output of the CERT system provides a versatile and effective set of features for vision-based\nautomatic analysis of facial behavior. Among other things it has been successfully used to recognize\ndriver fatigue [22], discriminate genuine from faked pain [13] , and estimate how dif\ufb01cult a student\n\ufb01nds a video lecture [24, 23].\nIn this paper we used 84 outputs of the CERT system ranging from the locations of key feature points\non the face to movements of individual facial muscle groups (Action Units) to detectors that specify\nhigh-level emotional categories (such as distress). Figure 2 shows an example of the dynamics of\nCERT outputs during periods of talking and non-talking. There appears to be a periodicity to the\nmodulations in the chin raise Action Unit (AU 17) during the speech period. In order to capture this\ntype of temporal \ufb02uctuation we processed the raw CERT outputs with a bank of temporal Gabor\n\ufb01lters. Figure 3 shows a subset of the \ufb01lters we used. The Figure shows the real and imaginary parts\nof the \ufb01lter output over a range of bandwidth and fundamental frequency values. In this work we use\na total of 25 temporal Gabors. Speci\ufb01cally we use all combinations of half-magnitude bandwidths\nof 3.4, 6.8, 10.2, 13.6, and 17 Hz peak frequency values of 1, 2, 3, 4, and 5 Hz.\nThe outputs of these \ufb01lters were used as input to a ridge logistic regression classifer [5]. Logistic\nregression is a ubiquitous tool for machine learning and has performed quite well over a range of\ntasks [11]. Popular approaches like Support Vector Machines, and Boosting, can be seen as special\ncases of logistic regression. One advantage of logistic regression is that it provides estimates of the\nposterior probability of the category of interest, given the input. In our case, the probability that a\nsequence of observed images corresponds to a person talking.\n\n2.3 Voice Model\n\nThe visual speech detector described above was then used to automatically label audio-visual speech\nsignals. These labels where then used to train person-speci\ufb01c voice models. This paradigm for\ncombining weakly labeled data and supervised learning is known as transductive learning in the\nmachine learning community. It is possible to cast the bootstrapping of the voice model very sim-\nilarly to the more conventional Canonical Correlation method discussed in Section 2.1. Although\nit is known [20] that non-linear models provide superior performance to linear models for auditory\nspeaker identi\ufb01cation, consider the case where we seek to learn a linear model over auditory features\nto determine a model of a particular speaker\u2019s voice. If we assume that we are given a \ufb01xed linear\n\n3\n\n\fFigure 2: An example of the shift in action unit output when talking begins. The Figure shows a bar\ngraph where the height of each black line corresponds to the value of Action Unit 17 for a particular\nframe. Qualitatively there is a periodicity in CERT\u2019s Action Unit 17 (Chin Raise) output during the\ntalking period.\n\nFigure 3: A selection of the temporal Gabor \ufb01lter bank used to express the modulation of the CERT\noutputs. Shown are both the real and imaginary Gabor components over a range of bandwidths and\npeak frequencies.\n\n4\n\nTimeAU 17 (Chin Raise)TalkingNot TalkingNot Talking!101!101!101!101!101!101!101!101!101Time (seconds)Filter Amplitude050100150200!0.2!0.15!0.1!0.0500.050.10.150.2  ViewerConfederateRealImaginaryA Selection of Temporal Gabors\fmodel, wV , that predicts when a subject is talking based on visual features we can reformulate the\nCCA-based approach to learning an auditory model as a simple linear regression problem:\n\nwA = argmax\n||wA||2\u22641\n\n\u03c1(A(cid:62)wA, V (cid:62)wv)\n\n(cid:18)\n\n= arg min\nwa\n\n(cid:107)A(cid:62)wa + b \u2212 V (cid:62)wv(cid:107)2\n\nmin\n\nb\n\n(cid:19)\n\n(2)\n\n(3)\n\nWhere b is a bias term. While this view is useful for seeing the commonalities between our approach\nand the classical synchrony approaches it is important to note that our approach does not have the\nrestriction of requiring the use of linear models of either the auditory or visual talking detectors. In\nthis section we show how we can \ufb01t a non-linear voice model that is very popular for the task of\nspeaker detection using the visual detector output as a training signal.\n\n2.3.1 Auditory Features\n\nWe use the popular Mel-Frequency Cesptral Coef\ufb01cients (MFCCs) [3] as the auditory descriptors\nto model the voice of a candidate speaker. MFCCs have been applied to a wide range of audio\ncategory recognition problems such as genre identi\ufb01cation and speaker identi\ufb01cation [19], and can\nbe seen as capturing the timbral information of sound. See [14] for a more thorough discussion of\nthe MFCC feature. In other work various statistics of the MFCC features have also been shown to\nbe informative (e.g. \ufb01rst or second temporal derivatives). In this work we only use the raw MFCC\noutputs leaving a systematic exploration of the acoustic feature space as future work.\n\n2.3.2 Learning and Classi\ufb01cation\n\nGiven a temporal segmentation of when each of a set of candidate speakers is speaking we de\ufb01ne the\nset of MFCC features generated by speaker i as FAi where each column of FAij denotes the MFCC\nfeatures of speaker i at the jth time point that the speaker is talking. In order to build an auditory\nmodel that can discriminate who is speaking we \ufb01rst model the density of input features pi for the ith\nspeaker based on the training data FAi. In order to determine the probability of a speaker generating\nnew input audio features, TA, we apply Bayes\u2019 rule p(Si = 1|TA) \u221d p(TA|Si = 1)p(Si = 1).\nWhere Si indicates whether or not the ith speaker is currently speaking. The probability distributions\nof the audio features given whether or not a given speaker is talking are modeled using 4-state hidden\nMarkov models with each state having an independent 4 component Gaussian Mixture model. The\ntransition matrix is unconstrained (i.e. any state may transition to any other). The parameters of the\nvoice model were learned using the Expectation Maximization Algorithm [6].\n\n2.3.3 Threshold Selection\n\nThe outputs of the visual detector over time provide an estimate of whether or not a candidate\nspeaker is talking. In this work we convert these outputs into a binary temporal segmentation of\nwhen a candidate speaker was or was not talking. In practice we found that the outputs of the CERT\nsystem had different baselines for each subject, and thus it was necessary to develop a method for\nautomatically \ufb01nding person dependent thresholds of the visual detector output in order to accurately\nsegment the areas of where each speaker was or was not talking. Our threshold selection mechanism\nuses a training portion of audio-visual input as a method of tuning the threshold to each candidate\nspeaker.\nIn order to select an appropriate threshold we trained a number of audio models each trained us-\ning a different threshold for the visual speech detector output. Each of these thresholds induces a\nbinary segmentation which in turn is fed to the voice model learning component described in Sec-\ntion 2.3. Next, we evaluate each voice model on a set of testing samples (e.g.\nthose collected\nafter a suf\ufb01cient amount of time audio-visual input has been collected for a particular candidate\nspeaker). The acoustic model that achieved the highest generalization performance (with respect to\nthe thresholded visual detector\u2019s output on the testing portion) was then selected for fusion with the\nvisual-only model. The reason for this choice is that models trained with less-noisy labels are likely\nto yield better generalization performance and thus the boundary used to create those labels was\n\n5\n\n\fFigure 4: A schematic of our threshold selection system. In the training stage several models are\ntrained with different temporal segmentations over who is speaking. In the testing stage each of\nthese discrete models is evaluated (in the \ufb01gure there are only two but in practice we use more)\nto see how well it generalizes on the testing set (where ground truth is de\ufb01ned based on the visual\ndetector\u2019s thresholded output). Finally, the detector that generalizes the best is fused with the visual\ndetector to give the \ufb01nal output of our system.\n\nmost likely at the boundary between the two classes. See Figure 4 for a graphical depiction of this\napproach. Note that at no point in this approach is it necessary to have ground truth values for when\na particular person was speaking. All assessments of generalization performance are with respect to\nthe outputs of the visual classi\ufb01er and not the true speaking vs. not speaking label.\n\n2.4 Fusion\n\nThere are many approaches [15] to fusing the visual and auditory model outputs to estimate the\nlikelihood that someone is or is not talking. In the current work we employ a very simple fusion\nscheme that likely could be improved upon in the future. In order to compute the fused output we\nsimply add the whitened outputs of the visual and auditory detectors\u2019 outputs.\n\n6\n\nTimeVisual Detector OutputCandidate ThresholdsTraining PortionTimeAuditory Model 1 OutputTesting PortionTimeAuditory Model 2 OutputTesting PortionNot TalkingTalkingNot TalkingTalkingSegmentation Basedon ThresholdedVisual Detector OutputSegmentation Basedon ThresholdedVisual Detector Output10203040506070809024681012Training PortionTimeMFCCsTraining Segmentation 1Training Segmentation 2Audio Model 1Audio Model 2Threshold Selection MechanismTesting PortionFused OutputVisual Detector OutputTesting PortionCandidate ThresholdsFusion ModelTraining PhaseTesting Phase\f2.5 Related Work\n\nMost past approaches for detecting whether someone is talking have either been purely visual [18]\n(i.e. using a classi\ufb01er trained on visual features from a training database) or based on audio-visual\nsynchrony [21, 8, 12].\nThe system most similar to that proposed in this document is due to Noulas and Krose [16]. In their\nwork a switching model is proposed that modi\ufb01es the audio-visual probability emission distributions\nbased on who is likely speaking. Three principal differences with our work are: Noulas and Krose\nuse a synchrony-based method for initializing the learning of both the voice and visual model (in\ncontrast to our system that uses a robust visual detector for initializing), Noulas and Krose use\nstatic visual descriptors (in contrast to our system that uses Gabor energy \ufb01lters which capture facial\nexpression dynamics), and \ufb01nally we provide a method for automatic threshold selection to adjust\nthe initial detector\u2019s output to the characteristics of the current speaker.\n\n3 Results\n\nWe compared the performance of two multi-modal methods for speech detection. The \ufb01rst method\nused low-level audio-visual synchrony detection to estimate the probability of whether or not some-\none is speaking at each point in time (see Section 2.1). The second approach is the approach pro-\nposed in this document: start with a visual-only speech detector, then incorporate acoustic infor-\nmation by training speaker-dependent voice models, and \ufb01nally fuse the audio and visual models\u2019\noutputs.\nThe database we use for training and evaluation is the D006 (aka RUFACS) database [1]. The\nportion of the database we worked with contains 33 interviews (each approximately 3 minutes in\nlength) between college students and an interrogator who is not visible in the video. The database\ncontains a wide-variety of vocal and facial expression behavior as the responses of the interviewees\nare not scripted but rather spontaneous. As a consequence this database provides a much more\nrealistic testbed for speech detection algorithms then the highly scripted databases (e.g. the CUAVE\ndatabase [17]) used to evaluate other approaches. Since we cannot extract visual information of the\nperson behind the camera we de\ufb01ne the task of interest to be a binary classi\ufb01cation of whether or\nnot the person being interviewed is talking at each point in time. It is reasonable to conclude that our\nperformance would only be improved on the task of speaker detection in two speaker environments\nif we could see both speakers\u2019 faces. The generalization to more than two speakers is untested in this\ndocument. We leave the determination of the scalability of our approach to more than two speakers\nas future work.\nIn order to test the effect of the voice model bootstrapping we use the \ufb01rst half of each interview as\na training portion (that is the portion on which the voice model is learned) and the second half as the\ntesting portion. The speci\ufb01c choice of a 50/50 split between training and test is somewhat arbitrary,\nhowever, it is a reasonable compromise between spending too long learning the voice model and not\nhaving suf\ufb01cient audio input to \ufb01t the voice model. It is important to note that no ground truth was\nused from the \ufb01rst 50% of each interview as the labeling was the result of the person independent\nvisual speech detector.\nIn total we have 6 interviews that are suitable for evaluation purposes (i.e. we have audio and video\ninformation and codes as to when the person in front of the camera is talking). However, we have\n27 additional interviews where only video was available. The frames from these videos were used\nto train the visual-only speech detector. For both our method and the synchrony method the audio\nmodality was summarized by the \ufb01rst 13 (0th through 12th) MFCCs.\nTo evaluate the synchrony-based model we perform the following steps. First we apply CCA be-\ntween MFCCs and CERT outputs (plus the temporal derivatives and absolute value of the temporal\nderivatives) over the database of six interviews. Next we look for regions in the interview where the\nprojection of the audio and video onto the vectors found by CCA yield high correlation. To compute\nthis correlation we summarized the correlation at each point in time by computing the correlation\nover a 5 second window centered at that point. This evaluation method is called \u201cWindowed Cor-\nrelation\u201d in the results table for the synchrony detection (see Table 2). We tried several different\nwindow lengths and found that the performance was best with 5 seconds.\n\n7\n\n\fSubject Visual Only Audio Only Fused Visual No Dynamics\n16\n17\n49\n56\n71\n94\nmean\n\n0.9891\n0.9444\n0.9860\n0.9598\n0.9800\n0.9125\n0.9620\n\n0.9796\n0.9560\n0.9858\n0.8924\n0.9321\n0.8924\n0.9397\n\n0.9929\n0.9776\n0.9956\n0.9593\n0.9780\n0.9364\n0.9733\n\n0.7894\n0.8166\n0.8370\n0.8795\n0.9375\n0.7506\n0.8351\n\nTable 1: Results of our bootstrapping model for detecting speech. Each row indicates the perfor-\nmance (as measured by area under the ROC) of the a particular detector on the second half of a video\nof a particular subject.\n\nSubject Windowed Correlation\n16\n17\n49\n56\n71\n94\nmean\n\n.5925\n.7937\n.6067\n.7290\n.8078\n.6327\n.6937\n\nTable 2: The performance of the synchrony detection model. Each row indicates the performance of\nthe a particular detector on the second half of a video of a particular subject.\n\nTable 2 and Table 1 summarize the performance of the synchrony detection approach and our ap-\nproach respectively. Our approach achieves an average area under the ROC of .9733 compared to\n.6937 for the synchrony approach. Moreover, our approach is able to do considerably better using\nonly vision on the area under the ROC metric (.9620), than the synchrony detection approach that\nhas access to both audio and video. The Gabor temporal \ufb01lter bank helped to sign\ufb01cantly improve\nperformace, raising it from .8351 to .962 (see Table 1). It is also encouraging that our method was\nable to learn an accurate audio-only model of the interviewee (average area under the ROC of .9397).\nThis validates that our method is of use in situations where we cannot expect to always have visual\ninput on each of the candidate speakers\u2019 faces.\nOur approach also bene\ufb01tted from fusing the learned audio-based speaker models. This can be seen\nby the fact that 2-AFC error (1 - area under the ROC gives the 2-AFC error) for the fused model\ndecreased by an average (geometric mean over each of the six interviews) of 57% over the vision\nonly model.\n\n4 Discussion and Future Work\n\nWe described a new method for multi-modal detection of when a candidate person is speaking. Our\napproach used the output of a person independent-vision based speech detector to train a person-\ndependent voice model. To this end we described a novel approach for threshold selection for\ntraining the voice model based on the outputs of the visual detector. We showed that our method\ngreatly improved performance with respect to previous approaches to the speech detection problem.\nWe also brie\ufb02y discussed how the work proposed here can be seen in a similar light as the more\nconventional synchrony detection methods of the past. This view combined with the large gain in\nperformance for the method presented here demonstrates that synchrony over long time scales and\nhigh-level features (e.g. talking / not talking) works signi\ufb01cantly better than over short time scales\nand low-level features (e.g. pixel intensities).\nIn the future, we would like to extend our approach to learn fully online by incorporating approx-\nimations to the EM algorithm that are able to run in real-time [4] as well as performing threshold\nselection on the \ufb02y. Another challenge is incorporating con\ufb01dences from the visual detector output\nin the learning of the voice model.\n\n8\n\n\fReferences\n[1] M. S. Bartlett, G. Littlewort, C. Lainscsek, I. Fasel, and J. Movellan. Recognition of facial actions in\n\nspontaneous expressions,. Journal of Multimedia, 2006. 7\n\n[2] M. S. Bartlett, G. C. Littlewort, M. G. Frank, C. Lainscsek, I. R. Fasel, and J. R. Movellan. Automatic\n\nrecognition of facial actions in spontaneous expressions. Journal of Multimedia, 1(6):22, 2006. 3\n\n[3] J. Bridle and M. Brown. An experimental automatic word recognition system. JSRU Report, 1003, 1974.\n\n5\n\n[4] A. Declercq and J. Piater. Online learning of gaussian mixture models-a two-level approach. In Intl. l\n\nConf. Comp. Vis., Imaging and Comp. Graph. Theory and Applications, pages 605\u2013611, 2008. 8\n\n[5] A. DeMaris. A tutorial in logistic regression. Journal of Marriage and the Family, pages 956\u2013968, 1995.\n\n3\n\n[6] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em\n\nalgorithm. Journal of the Royal Statistical Society, 39(Series B):1\u201338, 1977. 5\n\n[7] P. Ekman, W. Friesen, and J. Hager. Facial Action Coding System (FACS): Manual and Investigator\u2019s\n\nGuide. A Human Face, Salt Lake City, UT, 2002. 2\n\n[8] J. Fisher and T. Darrell. Speaker association with signal-level audiovisual fusion. IEEE Transactions on\n\nMultimedia, 6(3):406\u2013413, 2004. 1, 7\n\n[9] D. Hardoon, S. Szedmak, and J. Shawe-Taylor. Canonical correlation analysis: an overview with appli-\n\ncation to learning methods. Neural Computation, 16(12):2639\u20132664, 2004. 2\n\n[10] J. Hershey and J. Movellan. Audio-vision: Using audio-visual synchrony to locate sounds. Advances in\n\nNeural Information Processing Systems, 12:813\u2013819, 2000. 1, 2\n\n[11] D. Hosmer and S. Lemeshow. Applied logistic regression. Wiley-Interscience, 2000. 3\n[12] E. Kidron, Y. Schechner, and M. Elad. Pixels that sound. In IEEE COMPUTER SOCIETY CONFERENCE\n\nON COMPUTER VISION AND PATTERN RECOGNITION, volume 1, page 88. Citeseer, 2005. 1, 2, 7\n\n[13] G. Littlewort, M. Bartlett, and K. Lee. Faces of pain: automated measurement of spontaneousallfacial\nexpressions of genuine and posed pain. In Proceedings of the 9th international conference on Multimodal\ninterfaces, pages 15\u201321. ACM, 2007. 3\n\n[14] B. Logan. Mel frequency cepstral coef\ufb01cients for music modeling. In International Symposium on Music\n\nInformation Retrieval, volume 28, 2000. 5\n\n[15] J. Movellan and P. Mineiro. Robust sensor fusion: Analysis and application to audio visual speech\n\nrecognition. Machine Learning, 32(2):85\u2013100, 1998. 6\n\n[16] A. Noulas and B. Krose. On-line multi-modal speaker diarization. In Proceedings of the 9th international\n\nconference on Multimodal interfaces, pages 350\u2013357. ACM, 2007. 1, 7\n\n[17] E. Patterson, S. Gurbuz, Z. Tufekci, and J. Gowdy. CUAVE: A new audio-visual database for multi-\nmodal human-computer interface research. In IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS\nSPEECH AND SIGNAL PROCESSING, volume 2. Citeseer, 2002. 7\n\n[18] J. Rehg, K. Murphy, and P. Fieguth. Vision-based speaker detection using bayesian networks. In Proceed-\nings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2,\npages 110\u2013116, 1999. 7\n\n[19] D. Reynolds. Experimental evaluation of features for robust speaker identi\ufb01cation. IEEE Transactions on\n\nSpeech and Audio Processing, 2(4):639\u2013643, 1994. 5\n\n[20] D. Reynolds, T. Quatieri, and R. Dunn. Speaker veri\ufb01cation using adapted Gaussian mixture models.\n\nDigital signal processing, 10(1-3):19\u201341, 2000. 3\n\n[21] M. Slaney and M. Covell. Facesync: A linear operator for measuring synchronization of video facial\nimages and audio tracks. Advances in Neural Information Processing Systems, pages 814\u2013820, 2001. 1,\n2, 7\n\n[22] E. Vural, M. Cetin, A. Ercil, G. Littlewort, M. Bartlett, and J. Movellan. Drowsy driver detection through\n\nfacial movement analysis. Lecture Notes in Computer Science, 4796:6\u201318, 2007. 3\n\n[23] J. Whitehill, M. Bartlett, and J. Movellan. Automatic facial expression recognition for intelligent tutoring\n\nsystems. Computer Vision and Pattern Recognition, 2008. 3\n\n[24] J. Whitehill, M. S. Bartlett, and J. R. Movellan. Measuring the dif\ufb01culty of a lecture using automatic\n\nfacial expression recognition. In Intelligent Tutoring Systems, 2008. 3\n\n9\n\n\f", "award": [], "sourceid": 1093, "authors": [{"given_name": "Javier", "family_name": "Movellan", "institution": null}, {"given_name": "Paul", "family_name": "Ruvolo", "institution": null}]}