{"title": "The Infinite Factorial Hidden Markov Model", "book": "Advances in Neural Information Processing Systems", "page_first": 1697, "page_last": 1704, "abstract": "We introduces a new probability distribution over a potentially infinite number of binary Markov chains which we call the Markov Indian buffet process. This process extends the IBP to allow temporal dependencies in the hidden variables. We use this stochastic process to build a nonparametric extension of the factorial hidden Markov model. After working out an inference scheme which combines slice sampling and dynamic programming we demonstrate how the infinite factorial hidden Markov model can be used for blind source separation.", "full_text": "The In\ufb01nite Factorial Hidden Markov Model\n\nJurgen Van Gael\u2217\n\nDepartment of Engineering\nUniversity of Cambridge, UK\n\njv279@cam.ac.uk\n\nYee Whye Teh\nGatsby Unit\n\nUniversity College London, UK\nywteh@gatsby.ucl.ac.uk\n\nZoubin Ghahramani\n\nDepartment of Engineering\nUniversity of Cambridge, UK\nzoubin@eng.cam.ac.uk\n\nAbstract\n\nWe introduce a new probability distribution over a potentially in\ufb01nite number of\nbinary Markov chains which we call the Markov Indian buffet process. This pro-\ncess extends the IBP to allow temporal dependencies in the hidden variables. We\nuse this stochastic process to build a nonparametric extension of the factorial hid-\nden Markov model. After constructing an inference scheme which combines slice\nsampling and dynamic programming we demonstrate how the in\ufb01nite factorial\nhidden Markov model can be used for blind source separation.\n\n1 Introduction\n\nWhen modeling discrete time series data, the hidden Markov model [1] (HMM) is one of the most\nwidely used and successful tools. The HMM de\ufb01nes a probability distribution over observations\ny1, y2,\u00b7\u00b7\u00b7 yT using the following generative model:\nit assumes there is a hidden Markov chain\ns1, s2,\u00b7\u00b7\u00b7 , sT with st \u2208 {1\u00b7\u00b7\u00b7 K} whose dynamics is governed by a K by K stochastic transition\nmatrix \u03c0. At each timestep t, the Markov chain generates an output yt using some likelihood\nmodel F parametrized by a state dependent parameter \u03b8st. We can write the probability distribution\ninduced by the HMM as follows1\n\nT(cid:89)\n\nT(cid:89)\n\np(y1:T , s1:T ) =\n\np(st|st\u22121)p(yt|st) =\n\n\u03c0st\u22121,stF (yt; \u03b8st).\n\n(1)\n\nt=1\n\nt=1\n\nFigure 1 shows the graphical model for the HMM.\nOne shortcoming of the hidden Markov model is the limited representational power of the latent\nvariables. One way to look at the distribution de\ufb01ned by the HMM is to write down the marginal\ndistribution of yt given the previous latent state st\u22121\n\np(yt|st\u22121) =(cid:88)\n\np(st|st\u22121)p(yt|st) =(cid:88)\n\n\u03c0st\u22121,stF (yt; \u03b8st).\n\n(2)\n\nst\n\nst\n\nEquation (2) illustrates that the observations are generated from a dynamic mixture model. The\nfactorial hidden Markov model (FHMM), developed in [2], addresses the limited representational\npower of the hidden Markov model. The FHMM extends the HMM by representing the hidden state\n\n\u2217http://mlg.eng.cam.ac.uk/jurgen\n1To make the notation more convenient, we assume w.l.o.g. that for all our models, all latent chains start in\n\na dummy state that is in the 0 state. E.g. for the HMM s0 = 0, for the FHMM s(m)\n\n0 = 0 for all m.\n\n1\n\n\fFigure 1: The Hidden Markov Model\n\nFigure 2: The Factorial Hidden Markov Model\n\n(1:m)\nt\n\n(cid:183) yT as follows: M latent chains s(1)(cid:44) s(2)(cid:44)(cid:183)\n\nin a factored form. This way, information from the past is propagated in a distributed manner through\na set of parallel Markov chains. The parallel chains can be viewed as latent features which evolve\nover time according to Markov dynamics. Formally, the FHMM de\ufb01nes a probability distribution\nover observations y1(cid:44) y2(cid:44)(cid:183)\n(cid:44) s(M ) evolve according\nto Markov dynamics and at each timestep t, the Markov chains generate an output yt using some\nlikelihood model F parameterized by a joint state-dependent parameter (cid:31)s\n. The graphical model\nin \ufb01gure 2 shows how the FHMM is a special case of a dynamic Bayesian network. The FHMM has\nbeen successfully applied in vision [3], audio processing [4] and natural language processing [5].\nUnfortunately, the dimensionality M of our factorial representation or equivalently, the number of\nparallel Markov chains, is a new free parameter for the FHMM which we would prefer learning\nfrom data rather than specifying it beforehand.\nRecently, [6] introduced the basic building block for nonparametric Bayesian factor models called\nthe Indian Buffet Process (IBP). The IBP de\ufb01nes a distribution over in\ufb01nite binary matrices Z where\nelement znk denotes whether datapoint n has feature k or not. The IBP can be combined with\ndistributions over real numbers or integers to make the features useful for practical problems.\nIn this work, we derive the basic building block for nonparametric Bayesian factor models for time\nseries which we call the Markov Indian Buffet Process (mIBP). Using this distribution we build a\nnonparametric extension of the FHMM which we call the In\ufb01nite Factorial Hidden Markov Model\n(iFHMM). This construction allows us to learn a factorial representation for time series.\nIn the next section, we develop the novel and generic nonparametric mIBP distribution. Section 3\ndescribes how to use the mIBP do build the iFHMM. Which in turn can be used to perform inde-\npendent component analysis on time series data. Section 4 shows results of our application of the\niFHMM to a blind source separation problem. Finally, we conclude with a discussion in section 5.\n\n2 The Markov Indian Buffet Process\n\nSimilar to the IBP, we de\ufb01ne a distribution over binary matrices to model whether a feature at time\nt is on or off. In this representation rows correspond to timesteps and the columns to features or\nMarkov chains. We want the distribution over matrices to satisfy the following two properties: (1)\nthe potential number of columns (representing latent features) should be able to be arbitrary large;\n(2) the rows (representing timesteps) should evolve according to a Markov process.\nBelow, we will formally derive the mIBP distribution in two steps: \ufb01rst, we describe a distribution\nover binary matrices with a \ufb01nite number of columns. We choose the hyperparameters carefully so\nwe can easily integrate out the parameters of the model. In a second phase, we take the limit as the\nnumber of features goes to in\ufb01nity in a manner analogous to [7]\u2019s derivation of in\ufb01nite mixtures.\n\n2.1 A \ufb01nite model\n\nLet S represent a binary matrix with T rows (datapoints) and M columns (features). stm represents\nthe hidden state at time t for Markov chain m. Each Markov chain evolves according to the transition\nmatrix\n\n(3)\n\n(cid:31) 1 (cid:31) am am\n\n1 (cid:31) bm bm\n\nW (m) =\n\n(cid:30)\n\n(cid:44)\n\n2\n\n(cid:183)\n(cid:183)\n(cid:183)\n\fij = p(st+1,m = j|stm = i). We give the parameters of W (m) distributions am \u223c\nwhere W (m)\nBeta(\u03b1/M, 1) and bm \u223c Beta(\u03b3, \u03b4). Each chain starts with a dummy zero state s0m = 0. The\nhidden state sequence for chain m is generated by sampling T steps from a Markov chain with\ntransition matrix W (m). Summarizing, the generative speci\ufb01cation for this process is\n\n\u2200m \u2208 {1, 2,\u00b7\u00b7\u00b7 , M} : am \u223c Beta\n\n, 1\n\n,\n\nbm \u223c Beta(\u03b3, \u03b4),\nstm \u223c Bernoulli(a1\u2212st\u22121,m\n\nm\n\nbst\u22121,m\nm\n\n).\n\n(4)\n\n(cid:17)\n\n(cid:16) \u03b1\n\nM\n\ns0m = 0 ,\n\nNext, we evaluate the probability of the state matrix S with the transition matrix parameters W (m)\nm be the number of 0 \u2192\nmarginalized out. We introduce the following notation, let c00\n0, 0 \u2192 1, 1 \u2192 0 and 1 \u2192 1 transitions respectively, in binary chain m (including the transition from\nthe dummy state to the \ufb01rst state). We can then write\n\nm , c11\n\nm , c01\n\nm , c10\n\np(S|a, b) =\n\n(1 \u2212 am)c00\n\nm ac01\n\nm (1 \u2212 bm)c10\n\nm bc11\nm .\n\nm\n\nm\n\nM(cid:89)\n\nm=1\n\nWe integrate out a and b with respect to the conjugate priors de\ufb01ned in equation (4) and \ufb01nd\n\np(S|\u03b1, \u03b3, \u03b4) =\n\n\u03b1\n\nM \u0393( \u03b1\n\u0393( \u03b1\n\nM + c01\nM + c00\n\nm )\u0393(c00\nm + c01\n\nm + 1)\u0393(\u03b3 + \u03b4)\u0393(\u03b4 + c10\nm + 1)\u0393(\u03b3)\u0393(\u03b4)\u0393(\u03b3 + \u03b4 + c10\n\nm )\nm )\u0393(\u03b3 + c11\nm ) ,\nm + c11\n\nM(cid:89)\n\nm=1\n\nwhere \u0393(x) is the Gamma function.\n\n2.2 Taking the in\ufb01nite limit\nAnalogous to the IBP, we compute the limit for M \u2192 \u221e of the \ufb01nite model in equation (6). The\nprobability of a single matrix in the limit as M \u2192 \u221e is zero. This is not a problem since we\nare only interested in the probability of a whole class of matrices, namely those matrices that can\nbe transformed into each other through column permutations. In other words, our factorial model is\nexchangeable in the columns as we don\u2019t care about the ordering of the features. Hence, we compute\nthe in\ufb01nite limit for left-ordered form (lof)-equivalence classes [6].\nThe left-ordered form of a binary S matrix can be de\ufb01ned as follows: we interpret one column of\nlength T as encoding a binary number: column m encodes the number 2T\u22121s1m +2T\u22122s2m +\u00b7\u00b7\u00b7+\nsT m. We call the number which a feature encodes the history of the column. Then, we denote with\nMh the number of columns in the matrix S that have the same history. We say a matrix is a lof-\nmatrix if its columns are sorted in decreasing history values. Let S be a lof-matrix, then we denote\nwith [S] the set of all matrices that can be transformed into S using only column permutations; we\ncall [S] the lof-equivalence class. One can check that the number of elements in the lof-equivalence\nclass of S is equal to\n\n. We thus \ufb01nd the probability of the equivalence class of S to be\n\n(5)\n\n(6)\n\n(7)\n\np([S]) = (cid:88)\n(cid:81)2T \u22121\n\nS\u2208[S]\n\n=\n\nM !\n\n(cid:81)2T \u22121\nh=0 Mh!\np(S|\u03b1, \u03b3, \u03b4)\n\nM(cid:89)\n\n\u03b1\n\nM!\n\nm )\nm )\u0393(\u03b3 + c11\nm ) . (8)\nm + c11\nThis form allows us to compute a meaningful limit as M \u2192 \u221e. A writeup on the technical details\nof this computation can be found on the author\u2019s website. The end result has the following form\n\nm + 1)\u0393(\u03b3 + \u03b4)\u0393(\u03b4 + c10\nm + 1)\u0393(\u03b3)\u0393(\u03b4)\u0393(\u03b3 + \u03b4 + c10\n\nM + c01\nM + c00\n\nm )\u0393(c00\nm + c01\n\nM \u0393( \u03b1\n\u0393( \u03b1\n\nh=0 Mh!\n\nm=1\n\n(cid:81)2T \u22121\n\nM\u2192\u221e p([S]) =\nlim\n\n\u03b1M+\nh=0 Mh!\n\nm )\u0393(\u03b3 + c11\nm )\nm ) ,\nm + c11\n(9)\nwhere Ht denotes the t\u2019th Harmonic number and M+ denotes the number of Markov chains that\nswitch on at least once between 0 and T , i.e. M+ is the effective dimension of our model.\n\nm )!\u0393(\u03b3)\u0393(\u03b4)\u0393(\u03b3 + \u03b4 + c10\n\nm !\u0393(\u03b3 + \u03b4)\u0393(\u03b4 + c10\n\nexp{\u2212\u03b1HT}\n\nm=1\n\nm \u2212 1)!c00\n(c01\nm + c01\n(c00\n\nM+(cid:89)\n\n3\n\n\f2.3 Properties of the distribution\n\nFirst of all, it is interesting to note from equation (9) that our model is exchangeable in the columns\nand Markov exchangeable2 in the rows.\nNext, we derive the distribution in equation (9) through a stochastic process that is analogous to\nthe Indian Buffet Process but slightly more complicated for the actors involved. In this stochastic\nprocess, T customers enter an Indian restaurant with an in\ufb01nitely long buffet of dishes organized in\na line. The \ufb01rst customer enters the restaurant and takes a serving from each dish, starting at the left\nof the buffet and stopping after a Poisson(\u03b1) number of dishes as his plate becomes overburdened.\nA waiter stands near the buffet and takes notes as to how many people have eaten which dishes. The\nt\u2019th customer enters the restaurant and starts at the left of the buffet. At dish m, he looks at the\ncustomer in front of him to see whether he has served himself that dish.\n\nm + \u03b4)/(\u03b3 + \u03b4 + c10\n\n\u2022 If so, he asks the waiter how many people have previously served themselves dish m when\nthe person in front of them did (the waiters replies to him the number c11\nm ) and how many\npeople didn\u2019t serve themselves dish m when the person in front of them did (the waiter\nm ). The customer then serves himself dish m with probability\nreplies to him the number c10\n(c11\nm + c11\nm ).\n\u2022 Otherwise, he asks the waiter how many people have previously served themselves dish m\nm ) and\nwhen the person in front of them did not (the waiters replies to him the number c01\nhow many people didn\u2019t serve themselves dish m when the person in front of them did not\nm ). The customer then serves himself dish m\neither (the waiter replies to him the number c00\nwith probability c00\n\nm /(c00\n\nm + c01\nm ).\n\np([S]) =\n\nt=1 M (t)\n1 !\n\n\u03b1M+(cid:81)T\n\nexp{\u2212\u03b1HT} M(cid:89)\n\nThe customer then moves on to the next dish and does exactly the same. After the customer has\npassed all dishes people have previously served themselves from, he tries Poisson(\u03b1/t) new dishes.\nIf we denote with M (t)\nthe number of new dishes tried by the t\u2019th customer, the probability of any\n1\nparticular matrix being produced by this process is\nM \u0393( \u03b1\n\u0393( \u03b1\n\nm )\u0393(\u03b3 + c11\nm )\nm ) .\nm + c11\n(10)\nWe can recover equation (9) by summing over all possible matrices that can be generated using\nthe Markov Indian Buffet process that are in the same lof-equivalence class. It is straightforward\nof these. Multiplying this by equation (10) we recover\nto check that there are exactly\nequation (9). This construction shows that the effective dimension of the model (M+) follows a\nPoisson(\u03b1HT ) distribution.\n\nm + 1)\u0393(\u03b3 + \u03b4)\u0393(\u03b4 + c10\nm + 1)\u0393(\u03b3)\u0393(\u03b4)\u0393(\u03b3 + \u03b4 + c10\n\nM + c01\nM + c00\n\nm )\u0393(c00\nm + c01\n\n(cid:81)T\n(cid:81)2T \u22121\n\n(t)\n1 !\nt=1 M\nh=0 Mh!\n\n\u03b1\n\nm=1\n\n2.4 A stick breaking representation\n\nAlthough the representation above is convenient for theoretical analysis, it is not very practical for\ninference. Interestingly, we can adapt the stick breaking construction for the IBP [8] to the mIBP.\nThis will be very important for the iFHMM as it will allow us to use a combination of slice sampling\nand dynamic programming to do inference.\nThe \ufb01rst step in the stick breaking construction is to \ufb01nd the distribution of a(1) > a(2) > \u00b7\u00b7\u00b7 ,\nthe order statistics of the parameters a. Since the distribution on the variables am in our model are\nidentical to the distribution of the feature parameters in the IBP model, we can use the result in [8]\nthat these variables have the following distribution\n\na(1) \u221d Beta(\u03b1, 1),\n\np(a(m)|a(m\u22121)) = \u03b1a\u2212\u03b1\n\n(m\u22121)a\u03b1\u22121\n\n(m)\n\nI(0 \u2264 a(m) \u2264 a(m\u22121)).\n\n(11)\n(12)\n\nThe variables bm are all independent draws from a Beta(\u03b3, \u03b4) distribution which is independent of\nM. Hence if we denote with b(m) the b variable corresponding to the m\u2019th largest a value (in other\nwords: the b value corresponding to a(m)) then it follows that b(m) \u223c Beta(\u03b3, \u03b4).\n\n2A sequence is Markov exchangeable if its distribution is invariant under permutations of the transitions.\n\n4\n\n\fFigure 3: The In\ufb01nite Factorial Hidden Markov Model\n\n3 The In\ufb01nite Factorial Hidden Markov Model\n\nIn this section, we explain how to use the mIBP as a building block in a full blown probabilistic\nmodel. The mIBP provides us with a matrix S which we interpret as an arbitrarily large set of par-\nallel Markov chains. First we augment our binary representation with a more expressive component\nwhich can describe feature speci\ufb01c properties. We do this by introducing a base distribution H from\nwhich we sample a parameter (cid:31)m (cid:30) H for each Markov chain. This is a rather \ufb02exible setup as\nthe base distribution can introduce a parameter for every chain and every timestep, which we will\nillustrate in section 3.1.\nNow that we have a model with a more expressive latent structure, we want to add a likelihood\nmodel F which describes the distribution over the observations conditional on the latent structure.\nFormally, F (yt(cid:124) (cid:31)(cid:44) st(cid:44)(cid:183) ) describes the probability of generating yt given the model parameters (cid:31)\nand the current latent feature state st(cid:44)(cid:183) . We note that there are two important conditions which\nthe likelihood must satisfy in order for the limit M (cid:29) (cid:28) to be valid: (1) the likelihood must be\ninvariant to permutations of the features, (2) the likelihood cannot depend on (cid:31)m if stm = 0. Figure 3\nshows the graphical model for our construction which we call the In\ufb01nite Factorial Hidden Markov\nModel (iFHMM). In the following section, we describe one particular choice of base distribution\nand likelihood model which performs Independent Component Analysis on time series.\n\n3.1 The Independent Component Analysis iFHMM\n\nIndependent Component Analysis [9] (ICA) means different things to different people. Originally\ninvented as an algorithm to unmix a signal into a set of independent signals, it will be more insightful\nfor our purpose to think of ICA in terms of the probabilistic model which we describe below. As we\nexplain in detail in section 4, we are interested in ICA to solve the blind source separation problem.\nAssume that M signals are represented through the vectors xm; grouping them we can represent\nthe signals using the matrix X = [x1x2 (cid:183)\n(cid:183) xM ]. Next, we linearly combine the signals using a\nmixing matrix W to generate the observed signal Y = XW . Additionally, we will assume IID\nNormal(0(cid:44) (cid:30)2\nY ) noise added: Y = XW + (cid:29).\nA variety of fast algorithms exist which unmix the observations Y and recover the signal X. How-\never, crucial to these algorithms is that the number of signals is known in advance. [10] used the\nIBP to design the In\ufb01nite Independent Component Analysis (iICA) model which learns an appropri-\nate number of signals from exchangeable data. Our ICA iFHMM model extends the iICA for time\nseries.\nThe ICA iFHMM generative model can be described as follows: we sample S (cid:30) mIBP and point-\nwise multiply (denoted by (cid:27)) it with a signal matrix X. Each entry in X is an IID sample from a\nLaplace(0(cid:44) 1) distribution. One could choose many other distributions for X, but since in section 4\nwe will model speech data, which is known to be heavy tailed, the Laplace distribution is a conve-\nnient choice. Speakers will be speaking infrequently so pointwise multiplying a heavy tailed distri-\nbution with a sparse binary matrix achieves our goal of producing a sparse heavy tailed distribution.\nNext, we introduce a mixing matrix W which has a row for each signal in S (cid:27) X and a column\nfor each observed dimension in Y . The entries for W are sampled IID from a Normal(0(cid:44) (cid:30)2\nW )\ndistribution. Finally, we combine the signal and mixing matrices as in the \ufb01nite case to form the\n\n5\n\n(cid:183)\n\fobservation matrix Y : Y = (S (cid:12) X)W + \u0001 where \u0001 is Normal(0, \u03c32\nY ) IID noise for each element.\nIn terms of the general iFHMM model de\ufb01ned in the previous section, the base distribution H is\na joint distribution over columns of X and rows of W . The likelihood F performs the pointwise\nmultiplication, mixes the signals and adds the noise. It can be checked that our likelihood satis\ufb01es\nthe two technical conditions for proper iFHMM likelihoods described in section 3.\n\n3.2\n\nInference\n\nInference for nonparametric models requires special treatment as the potentially unbounded dimen-\nsionality of the model makes it hard to use exact inference schemes. Traditionally, in nonparametric\nfactor models inference is done using Gibbs sampling, sometimes augmented with Metropolis Hast-\nings steps to improve performance. However, it is commonly known that naive Gibbs sampling in\na time series model is notoriously slow due to potentially strong couplings between successive time\nsteps [11]. In the context of the in\ufb01nite hidden Markov model, a solution was recently proposed\nin [12], where a slice sampler adaptively truncates the in\ufb01nite dimensional model after which a dy-\nnamic programming performs exact inference. Since a stick breaking construction for the iFHMM\nis readily available, we can use a very similar approach for the iFHMM. The central idea is the\nfollowing: we introduce an auxiliary slice variable \u00b5 with the following distribution\n\n\u00b5 \u223c Uniform(0, min\n\nm:\u2203t,stm=1\n\nam).\n\n(13)\n\nIt is not essential that we sample from the uniform distribution, in fact for some of our experiments\nwe use the more \ufb02exible Beta distribution. The resulting joint distribution is\n\np(\u00b5, a, b, S) = p(\u00b5|a, S)p(a, b, S).\n\n(14)\n\nIt is clear from the equation above that one recovers the original mIBP distribution when we integrate\nout \u00b5. However, when we condition the joint distribution on \u00b5 we \ufb01nd\n\np(S|Y , \u00b5, a, b) \u221d p(S|Y , a, b)\n\nI(0 \u2264 \u00b5 \u2264 minm:\u2203t,stm=1 am)\n\nminm:\u2203t,stm=1 am\n\n(15)\n\nwhich forces all columns of S for which am < \u00b5 to be in the all zero state. Since there can only be\na \ufb01nite number of am > \u00b5, this effectively implies that we need only resample a \ufb01nite number of\ncolumns of S.\nWe now describe our algorithm in the context of the ICA iFHMM: we start with an initial S matrix\nand sample a, b. Next, conditional on our initial S and the data Y , we sample the ICA parameters\nX and W . We then start an iterative sampling scheme which involves the following steps:\n\n1. We sample the auxiliary slice variable \u00b5. This might involve extending the representation\n\nof S, X and W ,\n\n2. For all the represented features, we sample S, X and W ,\n3. We resample the hyperparameters (\u03c3Y , \u03c3W , \u03b1, \u03b3, \u03b4) of our model,\n4. We compact our representation by removing all unused features.\n\nWe experimented with 3 different algorithms for step 2. The \ufb01rst, a naive Gibbs sampler, did not\nperform well as we expected. The second algorithm, which we used for our experiments, is a blocked\nGibbs sampler which \ufb01xes all but one column of S and runs a forward-\ufb01ltering backward-sampling\nsweep on the remaining column. This allows us to analytically integrate out one column of X in\nthe dynamic program and resample it from the posterior afterwards. W can be sampled exactly\nconditional on X, S and Y . A third algorithm runs dynamic programming on multiple chains at\nonce. We originally designed this algorithm as it has the potential to merge two features in one\nsweep. However, we found that because we cannot integrate out X and W in this setting, the\ninference was not faster than our second algorithm. Note that because the bulck of the computation\nis used for estimating X and W , the dynamic programming based algorithms are effectively as fast\nas the naive Gibbs sampler. A prototype implementation of the iFHMM sampler in Matlab or .NET\ncan be obtained from the \ufb01rst author.\n\n6\n\n\f(a) Ground Truth\n\n(b) ICA iFHMM\n\n(c) iICA\n\n(d) ICA iFHMM\n\n(e) iICA\n\nFigure 4: Blind speech separation experiment; \ufb01gures represent which speaker is speaking at a cer-\ntain point in time: columns are speakers, rows are white if the speaker is talking and black otherwise.\nThe left \ufb01gure is ground truth, the next two \ufb01gures in are for the 10 microphone experiment, the right\ntwo \ufb01gures are for the 3 microphone experiment.\n\nY ) noise with \u03c3Y = 0.3.\n\n4 Experiments\nTo test our model and inference algorithms, we address a blind speech separation task, also known\nas the cocktail party problem. More speci\ufb01cally, we record multiple people who are simultane-\nously speaking, using a set of microphones. Given the mixed speech signals, the goal is to separate\nout the individual speech signals. Key to our presentation is that we want to illustrate that using\nnonparametric methods, we can learn the number of speakers from a small amount of data. Our\n\ufb01rst experiment learns to recover the signals in a setting with more microphones then speakers, our\nsecond experiment uses less microphones then speakers.\nThe experimental setup was the following: we downloaded data from 5 speakers from the Speech\nSeparation Challenge website3. The data for each speaker consists of 4 sentences which we ap-\npended with random pauses in between each sentence. Figure 4(a) illustrates which person is talking\nat what point in time. Next, we arti\ufb01cially mix the data 10 times. Each mixture is a linear combi-\nnation of each of the 5 speakers using Uniform(0, 1) mixing weights. We centered the data to have\nzero mean and unit variance and added IID Normal(0, \u03c32\nIn our \ufb01rst experiment we compared the ICA iFHMM with the iICA model using all 10 microphones.\nWe subsample the data so we learn from 245 datapoints. We initialized the samplers for both models\nwith an initial S matrix with 10 features, 5% random entries on. We use a Gamma(1.0, 4.0) prior on\n\u03b1. In both models, we use a InverseGamma(2.0, 1.0) prior for \u03c3Y and \u03c3W . Finally, for the iFHMM,\nwe chose a Gamma(10.0, 1.0) prior on \u03b3 and a Gamma(1.0, 1.0) prior on \u03b4 to encode our belief that\npeople speak for larger stretches of time, say the time to pronounce a sentence. We ran the samplers\nfor 5000 iterations and then gathered 20 samples every 20 iterations.\nFor both the ICA iFHMM and iICA models, we average the 20 samples and rearrange the features\nto have maximal overlap with the ground truth features. Figure 4(b) shows that the ICA iFHMM\nmodel recognizes that the data was generated from 5 speakers. Visual inspection of the recovered S\nmatrix also shows that the model discovers who is speaking at what time. 4(c) illustrated the results\nof the iICA model on the same data. Although the model discovers some structure in the data, it fails\nto \ufb01nd the right number of speakers (it \ufb01nds 9) and does a poor job in discovering which speaker is\nactive at which time. We computed the average mutual information between the 5 columns of the\ntrue S matrix and the \ufb01rst 5 columns of the recovered S matrices. We \ufb01nd that the iFHMM has an\naverage mutual information of 0.296 compared to 0.068 for the iICA model. The difference between\nthe two models is strictly limited to the difference between using the IBP versus mIBP. We want to\nemphasize that although one could come up with ad-hoc heuristics to smooth the iICA results, the\nICA iFHMM is a principled probabilistic model that does a good job at comparable computational\ncost.\nIn a second experiment, we chose to perform blind speech separation using only the \ufb01rst 3 micro-\nphones. We subsampled a noiseless version of the data to get 489 datapoints. We ran both the ICA\niFHMM and iICA inference algorithms using exactly the same settings as in the previous experi-\n\n3http://www.dcs.shef.ac.uk/ martin/SpeechSeparationChallenge.htm\n\n7\n\n\fment. Figure 4(d) and 4(e) show the average of 20 samples, rearranged to match the ground truth. In\nthis setting both methods fail to identify the number of speakers although the ICA iFHMM clearly\nperforms better. The ICA iFHMM \ufb01nds one too many signal: the spurious signal is very similar\nto the third signal which suggests that the error is a problem of the inference algorithm and not so\nmuch of the model itself. The iICA on the other hand performs poorly: it is very hard to \ufb01nd any\nstructure in the recovered Z matrix. We compared the mutual information as described above and\n\ufb01nd that the iFHMM has a mutual information of 0.091 compared to 0.028 for the iICA model.\n5 Discussion\nThe success of the Hidden Markov Model set off a wealth of extensions to adapt it to particular\nsituations. [2] introduced a factorial hidden Markov model which explicitly models dynamic latent\nfeatures while in [13] a nonparametric version of the the Hidden Markov Model was presented.\nIn this paper we \u201ccomplete the square\u201d by presenting a nonparametric Factorial Hidden Markov\nModel. We introduced a new stochastic process for latent feature representation of time series\ncalled the Markov Indian Buffet Process. We showed how this stochastic process can be used to\nbuild a nonparametric extension of the FHMM which we call the iFHMM. Another issue which\ndeserves further exploration is inference: in [2] it was found that a structured variational method\nprovides a good balance between accuracy and computational effort. An interesting open problem\nis whether we can adapt the structured variational method to the iFHMM. Finally, analogous to the\ntwo-parameter IBP [14] we would like to add one more degree of \ufb02exibility to control the 0 \u2192 1\ntransition probability more \ufb01nely. Although the derivation of the mIBP with this extra parameter is\nstraightforward, we as yet lack a stick breaking construction for this model which is crucial for our\ninference scheme.\nAcknowledgments\n\nWe kindly acknowledge David Knowles for discussing the generalized Amari error and A. Taylan\nCemgil for his suggestions on blind source separation. Jurgen Van Gael is supported by a Microsoft\nResearch PhD scholarship; Zoubin Ghahramani is also in the Machine Learning department, CMU.\nReferences\n[1] L. R. Rabiner, \u201cA tutorial on hidden markov models and selected applications in speech recognition,\u201d\n\nProceedings of the IEEE, vol. 77, pp. 257\u2013286, 1989.\n\n[2] Z. Ghahramani and M. I. Jordan, \u201cFactorial hidden markov models,\u201d Machine Learning, vol. 29, pp. 245\u2013\n\n273, 1997.\n\n[3] P. Wang and Q. Ji, \u201cMulti-view face tracking with factorial and switching hmm,\u201d in Proceedings of the\nSeventh IEEE Workshops on Application of Computer Vision, pp. 401\u2013406, IEEE Computer Society, 2005.\n\n[4] B. Logan and P. Moreno, \u201cFactorial hmms for acoustic modeling,\u201d 1998.\n[5] K. Duh, \u201cJoint labeling of multiple sequences: A factorial hmm approach,\u201d in 43rd Annual Meeting of the\n\nAssociation of Computational Linguistics (ACL) - Student Research Workshop, 2005.\n\n[6] T. L. Grif\ufb01ths and Z. Ghahramani, \u201cIn\ufb01nite latent feature models and the indian buffet process,\u201d Advances\n\nin Neural Information Processing Systems, vol. 18, pp. 475\u2013482, 2006.\n\n[7] R. M. Neal, \u201cBayesian mixture modeling,\u201d Maximum Entropy and Bayesian Methods, 1992.\n[8] Y. W. Teh, D. G\u00a8or\u00a8ur, and Z. Ghahramani, \u201cStick-breaking construction for the indian buffet process,\u201d\n\nProceedings of the International Conference on Arti\ufb01cial Intelligence and Statistics, vol. 11, 2007.\n\n[9] A. Hyvarinen and E. Oja, \u201cIndependent component analysis: Algorithms and applications,\u201d Neural Net-\n\nworks, vol. 13, pp. 411\u201330, 2000.\n\n[10] D. Knowles and Z. Ghahramani, \u201cIn\ufb01nite sparse factor analysis and in\ufb01nite independent components\n\nanalysis,\u201d Lecture Notes in Computer Science, vol. 4666, p. 381, 2007.\n\n[11] S. L. Scott, \u201cBayesian methods for hidden markov models: Recursive computing in the 21st century,\u201d\n\nJournal of the American Statistical Association, vol. 97, pp. 337\u2013351, Mar. 2002.\n\n[12] J. Van Gael, Y. Saatci, Y. W. Teh, and Z. Ghahramani, \u201cBeam sampling for the in\ufb01nite hidden markov\n\nmodel,\u201d in The 25th International Conference on Machine Learning, vol. 25, (Helsinki), 2008.\n\n[13] M. J. Beal, Z. Ghahramani, and C. E. Rasmussen, \u201cThe in\ufb01nite hidden markov model,\u201d Advances in\n\nNeural Information Processing Systems, vol. 14, pp. 577 \u2013 584, 2002.\n\n[14] Z. Ghahramani, T. L. Grif\ufb01ths, and P. Sollich, \u201cBayesian nonparametric latent feature models,\u201d Bayesian\n\nStatistics, vol. 8, 2007.\n\n8\n\n\f", "award": [], "sourceid": 109, "authors": [{"given_name": "Jurgen", "family_name": "Gael", "institution": null}, {"given_name": "Yee", "family_name": "Teh", "institution": null}, {"given_name": "Zoubin", "family_name": "Ghahramani", "institution": null}]}