{"title": "Modeling Conversational Dynamics as a Mixed-Memory Markov Process", "book": "Advances in Neural Information Processing Systems", "page_first": 281, "page_last": 288, "abstract": null, "full_text": "Modeling Conversational Dynamics as a \n\nMixed-Memory Markov Process \n\nTanzeem Choudhury \n\nIntel Research \n\ntanzeem.choudhury@intel.com \n\nSumit Basu \n\nMicrosoft Research \n\nsumitb@microsoft.com \n\nAbstract \n\ninfluences \n\nIn this work, we quantitatively investigate the ways in which a \ngiven person \nthe joint turn-taking behavior in a \nconversation. After collecting an auditory database of social \ninteractions among a group of twenty-three people via wearable \nsensors (66 hours of data each over two weeks), we apply speech \nand conversation detection methods to the auditory streams. These \nmethods automatically locate the conversations, determine their \nparticipants, and mark which participant was speaking when. We \nthen model the joint turn-taking behavior as a Mixed-Memory \nMarkov Model [1] that combines the statistics of the individual \nsubjects' self-transitions and the partners ' cross-transitions. The \nmixture parameters in this model describe how much each person's \nindividual behavior contributes to the joint turn-taking behavior of \nthe pair. By estimating these parameters, we thus estimate how \nmuch influence each participant has in determining the joint turn(cid:173)\nthis measure correlates \ntaking behavior. We \nsignificantly with betweenness centrality [2], an \nindependent \nmeasure of an individual's importance in a social network. This \nresult suggests that our estimate of conversational influence is \npredictive of social influence. \n\nshow how \n\n1 \n\nIntroduction \n\nPeople's relationships are largely determined by their social interactions, and the \nnature of their conversations plays a large part in defining those interactions. There \nis a long history of work in the social sciences aimed at understanding the \ninteractions between individuals and the influences they have on each others' \nbehavior. However, existing studies of social network interactions have either been \nrestricted to online communities, where unambiguous measurements about how \npeople interact can be obtained, or have been forced to rely on questionnaires or \ndiaries to get data on face-to-face interactions. Survey-based methods are error \nprone and impractical to scale up. Studies show that self-reports correspond poorly \nto communication behavior as recorded by independent observers [3]. \n\n\fIn contrast, we have used wearable sensors and recent advances in speech \nprocessing techniques to automatically gather information about conversations: \nwhen they occurred, who was involved, and who was speaking when. Our goal was \nthen to see if we could examine the influence a given speaker had on the turn-taking \nbehavior of her conversational partners. Specifically, we wanted to see if we could \nbetter explain the turn-taking transitions observed in a given conversation between \nsubjects i and} by combining the transitions typical to i and those typical toj. We \ncould then interpret the contribution from i as her influence on the joint turn-taking \nbehavior. \n\nIn this paper, we first describe how we extract speech and conversation information \nfrom the raw sensor data, and how we can use this to estimate the underlying social \nnetwork. We then detail how we use a Mixed-Memory Markov Model to combine \nthe individuals ' statistics. Finally, we show the performance of our method on our \ncollected data and how it correlates well with other metrics of social influence. \n\n2 Sensing and Modeling Face-to-face Communication Networks \nAlthough people heavily rely on email, telephone, and other virtual means of \ncommunication, high complexity information is primarily exchanged through face-to(cid:173)\nface interaction [4]. Prior work on sensing face-to-face networks have been based on \nproximity measures [5],[6], a weak approximation of the actual communication network. \nOur focus is to model the network based on conversations that take place within a \ncommunity. To do this, we need to gather data from real-world interactions. \n\nWe thus used an experiment conducted at MIT [7] in which 23 people agreed to wear the \nsociometer, a wearable data acquisition board [7],[8]. The device stored audio \ninformation from a single microphone at 8 KHz. During the experiment the users wore \nthe device both indoors and outdoors for six hours a day for 11 days. The participants \nwere a mix of students, facuity, and administrative support staff who were distributed \nacross different floors of a laboratory building and across different research groups. \n\n3 Speech and Conversation Detection \n\nGiven the set of auditory streams of each subject, we now have the problem of \ndetecting who is speaking when and to whom they are speaking. We break this \nproblem into two parts: voicing/speech detection and conversation detection. \n\n3.1 Voicing and Speech Detection \n\nTo detect the speech, we use the linked-HMM model for VOlClllg and speech \ndetection presented in [9]. This structure models the speech as two layers (see \nFigure 1); the lower level hidden state represents whether the current frame of audio \nis voiced or unvoiced (i.e., whether the audio in the frame has a harmonic structure, \nas in a vowel), while the second level represents whether we are in a speech or non(cid:173)\nspeech segment. The principle behind the model is that while there are many voiced \nsounds in our environment (car horns, tones, computer sounds, etc.), the dynamics \nof voiced/unvoiced transitions provide a unique signature for human speech; the \nhigher level is able to capture this dynamics since the lower level 's transitions are \ndependent on this variable. \n\n\fspeech layer (S[t) = {O, I}) \n\nvoicing layer (V[t) = {O,l}) \n\nobservation layer (3 features) \n\nFigure 1: Graphical model for the voicing and speech detector. \n\nTo apply this model to data, the 8 kHz audio is split into 256-sample frames (32 \nmilliseconds) with a 128-sample overlap. Three features are then computed: the \nnon-initial maximum of the noisy autocorrelation, the number of autocorrelation \npeaks, and the spectral entropy. The features were modeled as a Gaussian with \ndiagonal covariance. The model was then trained on 8000 frames of fully labeled \ndata. We chose this model because of its robustness to noise and distance from the \nmicrophone : even at 20 feet away more than 90% of voiced frames were detected \nwith negligible false alarms (see [9]). \n\nThe results from this model are the binary sequences v[t} and s[t} signifying \nwhether the frame is voiced and whether it is in a speech segment for all frames of \nthe audio. \n\n3.2 Conversation Detection \n\nOnce the voicing and speech segments are identified, we are sti II left with the \nproblem of determining who was talking with whom and when. To approach this, \nwe use the method of conversation detection described in [10]. The basic idea is \nsimple: since the speech detection method described above is robust to distance, the \nvoicing segments v[t} of all the participants in the conversation will be picked up by \nthe detector in all of the streams (this is referred to as a \"mixed stream\" in [10]). \nWe can then examine the mutual information of the binary voicing estimates \nbetween each person as a matching measure. Since both voicing streams will be \nnearly identical, the mutual information should peak when the two participants are \neither involved in a conversation or are overhearing a conversation from a nearby \ngroup. However, we have the added complication that the streams are only roughly \naligned in time. Thus, we also need to consider a range of time shifts between the \nstreams. We can express the alignment measure a[k] for an offset of k between the \ntwo voicing streams as follows: \n\np(v,[t]=i,v, [t-l]=j) \na[k] = l(vJt], v, [t - k]) = L.\" p(vJt] = i, v, [t - k] = j) log --.:...--'--'-'~----=-=---=---....::...:....-\np(vJt]=i)p(v, [t-k]=j) \n\n\" \n\ni.j \n\nwhere i and j take on values {O, l} for unvoiced and voiced states respectively. \nThe distributions for p(v\\, vJ and its marginals are estimated over a window of one \nminute (T=3750 frames). To see how well this measure performs, we examine an \nexample pair of subjects who had one five-minute conversation over the course of \nhalf an hour. The streams are correctly aligned at k=0, and by examining the value \nof ark} over a large range we can investigate its utility for conversation detection \nand for aligning the auditory streams (see Figure 2). \n\nThe peaks are both strong and unique to the correct alignment (k=0), implying that \nthis is indeed a good measure for detecting conversations and aligning the audio in \nour setup. By choosing the optimal threshold via the ROC curve, we can achieve \n100% detection with no false alarms using time windows T of one minute. \n\n\fFigure 2: Values of ark] over ranges: 1.6 seconds, 2.5 minutes, and 11 minutes. \n\nFor each minute of data in each speaker' s stream, we computed ark] for k ranging \nover +/- 30 seconds with T=3750 for each of the other 22 subjects in the study. \nWhile we can now be confident that this will detect most of the conversations \nbetween the subjects, since the speech segments from all the participants are being \npicked up by all of their microphones (and those of others within earshot), there is \nstill the problem of determining who is speaking when. Fortunately, this is fairly \nstraightforward. Since the microphones for each subject are pre-calibrated to have \napproximately equal energy response, we can classify each voicing segment among \nthe speakers by integrating the audio energy over the segment and choosing the \nargmax over subjects. \nIt is still possible that the resulting subject does not \ncorrespond to the actua l speaker (she could simply be the one nearest to a non(cid:173)\nsubject who is speaking), we determine an overall threshold below which the \nassignment to the speaker is rejected. Both of these methods are further detailed in \n[10]. \n\nFor this work, we rejected all conversations with more than two participants or \nthose that were simply overheard by the subj ects. Finally, we tested the overall \nperformance of our method by comparing with a hand-labeling of conversation \noccurrence and length from four subjects over 2 days (48 hours of data) and found \nan 87% agreement with the hand labeling. Note that the actual performance may \nhave been better than this, as the labelers did miss some conversations. \n\n3.3 The Turn-Taking Signal S; \n\nFinally, given the location of the conversations and who is speaking when, we can \ncreate a new signal for each subject i , S;, defined over five-second blocks, which is \n1 when the subject is holding the turn and 0 otherwise. We define the holder of the \nturn as whoever has produced more speech during the five-second block. Thus, \nwithin a given conversation between subjects i and j , the turn-taking signals are \ncomplements of each other, i.e., Si = -,SJ . \n\nI \n\nI \n\n4 Estimating the Social Network Structure \nOnce we have detected the pairwise conversations we can identify the communication \nthat occurs within the community and map the links between individuals. The link \nstructure is calculated from the total number of conversations each subject has with \nothers: interactions with another person that account for less than 5% of the subject's \ntotal interactions are removed from the graph. To get an intuitive picture of the \ninteraction pattern within the group, we visualize the network diagram by performing \nmulti-dimensional scaling (MDS) on the geodesic distances (number of hops) between \nthe people (Figure 3). The nodes are colored according to the physical closeness of the \nsubjects' office locations. From this we see that people whose offices are in the same \ngeneral space seem to be close in the communication space as well. \n\n\fFigure 3: Estimated network of subjects \n\n5 Modeling the Influence of Turn-taking Behavior in \n\nConversations \n\nWhen we talk to other people we are influenced by their style of interaction. \nSometimes this influence is strong and sometimes insignificant - we are interested \nin finding a way to quantify this effect. We probably all know people who have a \nstrong effect on our natural interaction style when we talk to them, causing us to \nchange our style as a result. For example, consider someone who never seems to \nstop talking once it is her turn. She may end up imposing her style on us, and we \nmay consequently end up not having enough of a chance to talk, whereas in most \nother circumstances we tend to be an active and equal participant. \n\nIn our case, we can model this effect via the signals we have already gathered. Let \nus consider the influence subject} has on subj ect i. We can compute i's average \nself-transition table, peS: I S;_I) , via simple counts over all conversations for subject \ni (excluding those with i). Similarly, we can compute j's average cross-transition \ntable, p(Stk I Sf- I)' over all subjects k (excluding i) with which} had conversations. \nThe question now is, for a given conversation between i and}, how much does} 's \naverage cross-transition help explain peS: I S;_I ' Sf- I) ? \n\nWe can formalize this contribution via the Mixed-Memory Markov Model of Saul \nand Jordan [1]. The basic idea of this model was to approximate a high-dimensional \nconditional probability table of one variable conditioned on many others as a convex \ncombination of the pairwise conditional tables. For a general set of N interacting \nMarkov chains in the form of a Coupled Markov Model [11], we can write this \napproximation as: \n\npeS; I sLI,\u00b7\u00b7\u00b7, St~l) = I a ijP(S; I S f- I) \n\nj \n\nFor our case of a two chain (two person) model the transition probabilities will be \nthe following: \n\n\fpeS: I S,'_, , S,2_,) = a Il P(S,' I S,'_,) + a 12P(S,k I S,2_, ) \n\np(S,2 I S,'_, , S,2_,) = a 2,P(S,k I S,'_,) + a 22P(S,2 I S,~, ) \n\nThis is very similar to the original Mixed-Memory Model, though the transition \ntables are estimated over all other subjects k excluding the partner as described \nabove. Also, since the a ij sum to one over j, in this case a ll = 1- a '2 . We thus have \n\na single parameter, a'2' which describes the contribution of p(Stk I St2_ 1) \nto \nexplaining P(S~ I SLl,St~I)' i.e., the contribution of subject 2's average turn-taking \nbehavior on her interactions with subject 1. \n\n5.1 Learning the influence parameters \n\nTo find the a ij values, we would like to maximize the likelihood of the data. Since \nwe have already estimated the relevant conditional probability tables, we can do this \nvia constrained gradient ascent, where we ensure that a ij>O [12]. Let us first \nexamine how the likelihood function simplifies for the Mixed-Markov model: \n\nConverting this expression to log likelihood and removing terms that are not \nrelevant to maximization over a ij yields: \n\nNow we reparametrize for the normality constraint with fJij = a ij and fJ;N = 1- LfJij , \n\nremove the terms not relevant to chain i, and take the derivatives: \n\na \n\nafJij (.) = ~ LfJ;kP(S; I S,~,)+(I- LfJ;k )P(S; I S,~,) \n\npeS; I S,~,) - pes; I S,~,) \n\nWe can show that the likelihood is convex in the a ij ' so we are guaranteed to \nachieve the global maximum by climbing the gradient. More details of this \nformulation are given in [12],[7]. \n\n5.2 Aggregate Influence over Multiple Conversations \nIn order to evaluate whether this model provides additional benefit over using a \ngiven subject's self-transition statistics alone, we estimated the reduction in KL \ndivergence by using the mixture of interactions vs. using the self-transition model. \nWe found that by using the mixture model we were able to reduce the KL \ndivergence between a subject's average self-transition statistics and the observed \ntransitions by 32% on average. However, in the mixture model we have added extra \ndegrees of freedom, and hence tested whether the better fit was statistically \nsignificant by using the F-test. The resulting p-value was less than 0.01 , implying \nthat the mixture model is a significantly better fit to the data. \n\n\fIn order to find a single influence parameter for each person, we took a subset of 80 \nconversations and aggregated all the pairwise influences each subject had on all her \nIn order to compute this aggregate value, there is an \nconversational partners. \nIf the subject's self-transition \nadditional aspect about a ij we need to consider. \nmatrix and the complement of the partner's cross-transition matrix are very similar, \nthe influence scores are indeterminate, since for a given interaction S; = -,s: : i.e. , \nwe would essentially be trying to find the best way to linearly combine two identical \ntransition matrices. We thus weight the contribution to the aggregate influence \nestimate for each individual Ai by the relevant I-divergence (symmetrized KL \ndivergence) for each conversational partner: \n\nAi = L J(P(S: I-,SL,) II peS: I S:_,))a ki \n\nkEpartners \n\nThe upper panel of Figure 4 shows the aggregated influence values for the subset of \nsubjects contained in the set of eighty conversations analyzed. \n\n6 Link between Conversational Dynamics and Social Role \n\nBetweenness centrality is a measure frequently used in social network analysis to \ncharacterize importance in the social network. For a given person i, it is defined as \nbeing proportional to the number of pairs of people (j,k) for which that person lies \nalong the shortest path in the network between j and k. \nIt is thus used to estimate \nhow much control an individual has over the interaction of others, since it is a count \nof how often she is a \"gateway\" between others. People with high betweenness are \noften perceived as leaders [2]. \n\nWe computed the betweenness centrality for the subjects from the 80 conversations \nusing the network structure we estimated in Section 3. We then discovered an \ninteresting and statistically significant correlation between a person's aggregate \ninfluence score and her betweenness centrality --\nit appears that a person's \ninteraction style is indicative of her role within the community based on the \ncentrality measure. Figure 4 shows the weighted influence values along with the \ncentrality scores. Note that ID 8 (the experiment coordinator) is somewhat of an \noutlier -- a plausible explanation for this can be that during the data collection ID 8 \nwent and talked to many of the subjects, which is not her usual behavior. This \nresulted in her having artificially high centrality (based on link structure) but not \nhigh influence based on her interaction style. \n\nWe computed the statistical correlation between the influence values and the \ncentrality scores, both including and excluding the outlier subject ID 8. The \ncorrelation excluding ID 8 was 0.90 (p-value < 0.0004, rank correlation 0.92) and \nincluding ID 8 it was 0.48 (p-value <0.07, rank correlation 0.65). The two measures, \nnamely influence and centrality, are highly correlated, and this correlation is \nstatistically significant when we exclude ID 8, who was the coordinator of the \nproject and whose centrality is likely to be artificially large. \n\n7 Conclusion \nWe have developed a model for quantitatively representing the influence of a given \nperson j's turn-taking behavior on the joint-turn taking behavior with person i. On \nreal-world data gathered from wearable sensors, we have estimated the relevant \ncomponent statistics about turn taking behavior via robust speech processing \n\n\ftechniques, and have shown how we can use the Mixed-Memory Markov formalism \nto estimate the behavioral influence. Finally, we have shown a strong correlation \nbetween a person's aggregate influence value and her betweenness centrality score. \nThis implies that our estimate of conversational influence may be indicative of \nimportance within the social network. \n\nAggregate Influence Values \n\n0.25 \n\" ~ 0.2 \n> \nl! 0.15 \n~ \u00a3 0 .1 \n\n0.05 \no \n\n10 \n\n11 \n\n12 \n\n13 \n\n14 \n\nBelweenneS5 CenlralHy Scores \n\n~ 0 .2 \n~ \n~0. 15 \ne i \n\n0 .1 \n\no \n\n0.05 \n\nFigure 4: Aggregate influence values and corresponding centrality scores. \n\n8 References \n[1] Saul, L.K. and M. Jordan. \"Mixed Memory Markov Models.\" Machine \n\nLearning, 1999.37: p. 75-85. \n\n[2] Freeman, L.c., \"A Set of Measures of Centrality Based on Betweenness.\" \n\nSociometry, 1977.40: p. 35-41. \n\n[3] Bernard, H.R., et aI., \"The Problem of Informant Accuracy: the Validity of \n\nRetrospective data.\" Annual Review of Anthropology, 1984. 13: p. pp. 495-517. \n\n[4] Allen, T., Architecture and Communication Among Product Development \n\nEngineers. 1997, Sloan School of Management, MIT: Cambridge. p. pp. 1-35. \n\n[5] Want, R., et aI., \"The Active Badge Location System.\" ACM Transactions on \n\nInformation Systems, 1992.10: p. 91-102. \n\n[6] Borovoy, R. , Folk Computing: Designing Technology to Support Face-to-Face \n\nCommunity Building. Doctoral Thesis in Media Arts and Sciences. MIT, 2001. \n\n[7] Choudhury, T. , Sensing and Modeling Human Networks, Doctoral Thesis in \n\nMedia Arts and Sciences. MIT. Cambridge, MA, 2003. \n\n[8] Gerasimov, V., T. Selker, and W. Bender, Sensing and Effecting Environment \n\nwith Extremity Computing Devices. Motorola Offspring, 2002. 1(1). \n\n[9] Basu, S. \"A Two-Layer Model for Voicing and Speech Detection.\" in Int 'l \n\nConference on Acoustics, Speech, and Signal Processing (ICASSP). 2003. \n\n[10]Basu, S., Conversation Scene Analysis. Doctoral Thesis \nEngineering and Computer Science. MIT. Cambridge, MA 2002. \n\nin Electrical \n\n[11]Brand, M., \"Coupled Hidden Markov Models for Modeling Interacting \n\nProcesses.\" MIT Media Lab Vision & Modeling Tech Report, 1996. \n\n[12]Basu, S., T. Choudhury, and B. Clarkson. \"Learning Human Interactions with \nthe Influence Model.\" MIT Media Lab Vision and Modeling Tech Report #539. \nJune, 2001. \n\n\f", "award": [], "sourceid": 2624, "authors": [{"given_name": "Tanzeem", "family_name": "Choudhury", "institution": null}, {"given_name": "Sumit", "family_name": "Basu", "institution": null}]}