{"title": "Independent Factor Analysis with Temporally Structured Sources", "book": "Advances in Neural Information Processing Systems", "page_first": 386, "page_last": 392, "abstract": null, "full_text": "Independent Factor Analysis with \n\nTemporally Structured Sources \n\nHagai Attias \n\nhagai@gatsby.ucl.ac.uk \n\nGatsby Unit, University College London \n\n17 Queen Square \n\nLondon WCIN 3AR, U.K. \n\nAbstract \n\nWe present a new technique for time series analysis based on dy(cid:173)\nnamic probabilistic networks. In this approach, the observed data \nare modeled in terms of unobserved, mutually independent factors, \nas in the recently introduced technique of Independent Factor Anal(cid:173)\nysis (IFA). However, unlike in IFA, the factors are not Li.d.; each \nfactor has its own temporal statistical characteristics. We derive a \nfamily of EM algorithms that learn the structure of the underlying \nfactors and their relation to the data. These algorithms perform \nsource separation and noise reduction in an integrated manner, and \ndemonstrate superior performance compared to IFA. \n\n1 \n\nIntroduction \n\nThe technique of independent factor analysis (IFA) introduced in [1] provides a \ntool for modeling L'-dim data in terms of L unobserved factors. These factors \nare mutually independent and combine linearly with added noise to produce the \nobserved data. Mathematically, the model is defined by \n\nYt = HXt + Ut, \n\n(1) \n\nwhere Xt is the vector of factor activities at time t, Yt is the data vector, H is the \nL' x L mixing matrix, and Ut is the noise. \nThe origins of IFA lie in applied statistics on the one hand and in signal processing \non the other hand. Its statistics ancestor is ordinary factor analysis (FA), which as(cid:173)\nsumes Gaussian factors. In contrast, IFA allows each factor to have its own arbitrary \ndistribution, modeled semi-parametrically by a I-dim mixture of Gaussians (MOG). \nThe MOG parameters, as well as the mixing matrix and noise covariance matrix, \nare learned from the observed data by an expectation-maximization (EM) algorithm \nderived in [1]. The signal processing ancestor of IFA is the independent component \nanalysis (ICA) method for blind source separation [2]-[6]. In ICA, the factors are \ntermed sources, and the task of blind source separation is to recover them from the \nobserved data with no knowledge of the mixing process. The sources in ICA have \nnon-Gaussian distributions, but unlike in IFA these distributions are usually fixed \nby prior knowledge or have quite limited adaptability. More significant restrictions \n\n\fDynamic Independent Factor Analysis \n\n387 \n\nare that their number is set to the data dimensionality, i.e. L = L' ('square mix(cid:173)\ning'), the mixing matrix is assumed invertible, and the data are assumed noise-free \n(Ut = 0). In contrast, IFA allows any L, L' (including more sources than sensors, \nL > L'), as well as non-zero noise with unknown covariance. In addition, its use of \nthe flexible MOG model often proves crucial for achieving successful separation [1]. \n\nTherefore, IFA generalizes and unifies FA and ICA. Once the model has been \nlearned, it can be used for classification (fitting an IFA model for each class), com(cid:173)\npleting missing data, and so on. In the context of blind separation, an optimal \nreconstruction of the sources Xt from data is obtained [1] using a MAP estimator. \n\nHowever, IFA and its ancestors suffer from the following shortcoming: They are \noblivious to temporal information since they do not attempt to model the temporal \nstatistics of the data (but see [4] for square, noise-free mixing). In other words, the \nmodel learned would not be affected by permuting the time indices of {yt}. This is \nunfortunate since modeling the data as a time series would facilitate filtering and \nforecasting, as well as more accurate classification. Moreover, for source separation \napplications, learning temporal statistics would provide additional information on \nthe sources, leading to cleaner source reconstructions. \n\nTo see this, one may think of the problem of blind separation of noisy data in terms \nof two components: source separation and noise reduction. A possible approach \nmight be the following two-stage procedure. First, perform noise reduction using, \ne.g., Wiener filtering. Second, perform source separation on the cleaned data us(cid:173)\ning, e.g., an ICA algorithm. Notice that this procedure directly exploits temporal \n(second-order) statistics of the data in the first stage to achieve stronger noise re(cid:173)\nduction. An alternative approach would be to exploit the temporal structure of \nthe data indirectly, by using a temporal source model. In the resulting single-stage \nalgorithm, the opemtions of source sepamtion and noise reduction are coupled. This \nis the approach taken in the present paper. \n\nIn the following, we present a new approach to the independent factor problem \nbased on dynamic probabilistic networks. In order to capture temporal statistical \nproperties of the observed data, we describe each source by a hidden Markov model \n(HMM). The resulting dynamic model describes a multivariate time series in terms \nof several independent sources, each having its own temporal characteristics. Section \n2 presents an EM learning algorithm for the zero-noise case, and section 3 presents \nan algorithm for the case of isotropic noise. The case of non-isotropic noise turns out \nto be computationally intractable; section 4 provides an approximate EM algorithm \nbased on a variational approach. \nNotation: The multivariable Gaussian density is denoted by g(z, E) =1 27rE 1-1/ 2 \nexp( -zT E- l z/2). We work with T-point time blocks denoted Xl:T = {Xt}[=I' The \nith coordinate of Xt is x~. For a function f, (f(Xl:T)) denotes averaging over an \nensemble of Xl:T blocks. \n\n2 Zero Noise \n\nThe MOG source model employed in IFA [1] has the advantages that (i) it is capable \nof approximating arbitrary densities, and (ii) it can be learned efficiently from data \nby EM. The Gaussians correspond to the hidden states of the sources, labeled by \ns. Assume that at time t, source i is in state s~ = s. Its signal x~ is then generated \nby sampling from a Gaussian distribution with mean JL! and variance v!. In order \nto capture temporal statistics of the data, we endow the sources with temporal \nstructure by introducing a transition matrix a!,s between the states. Focusing on \n\n\f388 \n\nH. Attias \n\na time block t = 1, ... , T, the resulting probabilistic model is defined by \n\ni \n\n( \n\ni \n\ni \n\n) \n\nP St = S St-l = S = as's' P So = S = 7rs , \n( i i i ') \np(X~ I S~ = S) = g(x~ - J.L!,v!), \n\nP(Yl:T) =1 detG IT P(Xl:T), \n\n(2) \nwhere P(Xl:T) is the joint density of all sources xL i = 1, ... , L at all time points, and \nthe last equation follows from Xt = GYt with G = H- 1 being the unmixing matrix. \nAs usual in the noise-free scenario (see [2]; section 7 of [1]), we are assuming that \nthe mixing matrix is square and invertible. \nThe graphical model for the observed density P(Yl:T I W) defined by (2) is \nparametrized by W = {Gij , J.L!, v!, 7r!, a!, s}' This model describes each source as \na first-order HMM; it reduces to a time-independent model if a!,s = 7r!. Whereas \ntemporal structure can be described by other means, e.g. a moving-average [4] or \nautoregressive [6] model, the HMM is advantageous since it models high-order tem(cid:173)\nporal statistics and facilitates EM learning. Omitting the derivation, maximization \nwith respect to Gij results in the incremental update rule \n\nbG = \u20acG - \u20acT L (Xt)x[G , \n\n1 T \n\nt=l \n\n(3) \n\nwhere (xn = Es 'Y:(s)(x~ - J.L!)/v!, and the natural gradient [3] was used; \u20ac is an \nappropriately chosen learning rate. For the source parameters we obtain the update \nrules \n\nEt 'Yt(s)x~ \nEt 'Y1(s) \n\n, \n\n_ E t ~t( s' , s) \n\ni \nas' s - ~ i ( ' ) ' \n\nut 'Yt-l S \n\n(4) \n\nwith the initial probabilities updated via 7r! = 'YA(s). We used the standard HMM \nnotation 'Y:(s) = p(s~ = S I xLT)' ~t(s',s) = P(SLI = s',s~ = s I xLT)' These \nposterior densities are computed in the E-step for each source, which is given in \nterms of the data via x~ = E j Gijyl, using the forward-backward procedure [7]. \nThe algorithm (3-4) may be used in several possible generalized EM schemes. An \nefficient one is given by the following two-phase procedure: (i) freeze the source \nparameters and learn the separating matrix G using (3); (ii) freeze G and learn the \nsource parameters using (4), then go back to (i) and repeat. Notice that the rule (3) \nis similar to a natural gradient version of Bell and Sejnowski's leA rule [2]; in fact, \nthe two coincide for time-independent sources where (Xi) = -alogp(xi)/axi. We \nalso recognize (4) as the Baum-Welch method. Hence, in phase (i) our algorithm \nseparates the sources using a generalized leA rule, whereas in phase (ii) it learns \nan HMM for each source. \nRemark. Often one would like to model a given L'-variable time series in terms \nof a smaller number L ~ L' of factors. In the framework of our noise-free model \nYt = HXt, this can be achieved by applying the above algorithm to the L largest \nprincipal components of the data; notice that if the data were indeed generated by L \nfactors, the remaining L' - L principal components would vanish. Equivalently, one \nmay apply the algorithm to the data directly, using a non-square L x L' unmixing \nmatrix G. \nResults. Figure 1 demonstrates the performance of the above method on a 4 x 4 \nmixture of speech signals, which were passed through a non-linear function to mod(cid:173)\nify their distributions. This mixture is inseparable to leA because the source model \nused by the latter does not fit the actual source densities (see discussion in [1]). We \nalso applied our dynamic network to a mixture of speech signals whose distributions \n\n\fDynamic Independent Factor Analysis \n\n389 \n\n0 . 8 \n\n0 . 7 \n\n0 . 8 \n\n):i\"0.5 \n'zs:O.4 \n\n0.3 \n\n0 .2 \n\n0.1 \n\n0 \n-4 \n\n-2 \n\nHMM-ICA \n\nleA \n\n3 \n\n2 \n\n'>:! \n\n0 \n\n-1 \n\n-2 \n\n-3 \n\n3 \n\n-3 \n\n4 \n\n-2 \n\no \nx1 \n\n2 \n\n-2 \n\n0 \nx1 \n\n2 \n\nFigure 1: Left: Two of the four source distributions. Middle: Outputs of the EM algo(cid:173)\nrithm (3-4) are nearly independent. Right: the outputs of leA (2) are correlated. \n\nwere made Gaussian by an appropriate non-linear transformation. Since temporal \ninformation is crucial for separation in this case (see [4],[6]), this mixture is in(cid:173)\nseparable to leA and IFA; however, the algorithm (3-4) accomplished separation \nsuccessfully. \n\n3 \n\nIsotropic Noise \n\nWe now turn to the case of non-zero noise Ut ::j:. O. We assume that the noise is white \nand has a zero-mean Gaussian distribution with covariance matrix A. In general, \nthis case is computationally intractable (see section 4). The reason is that the E(cid:173)\nstep requires computing the posterior distribution P(SO:T, Xl:T I Yl:T) not only over \nthe source states (as in the zero-noise case) but also over the source signals, and \nthis posterior has a quite complicated structure. We now show that if we assume \nisotropic noise, i.e. Aij = )..6ij , as well as square invertible mixing as above, this \nposterior simplifies considerably, making learning and inference tractable. This is \ndone by adapting an idea suggested in [8] to our dynamic probabilistic network. \nWe start by pre-processing the data using a linear transformation that makes their \ncovariance matrix unity, i.e., (YtyT) = I ('sphering'). Here (-) denotes averaging \nover T-point time blocks. From (1) it follows that HSHT = )..'1, where S = (XtxT) \nis the diagonal covariance matrix of the sources, and )..' = 1 -)... This, for a square \ninvertible H, implies that HTH is diagonal. In fact, since the unobserved sources \ncan be determined only to within a scaling factor, we can set the variance of each \nsource to unity and obtain the orthogonality property HTH = )..'1. It can be shown \nthat the source posterior now factorizes into a product over the individual sources, \nP(SO:T, Xl :T I Yl:T) = TIiP(sb:T, XLT I Yl:T), where \n\nt=l \n\nP(Sb:T,xLT I Yl:T) ()( [rrg(X; -T):'aD \u00b7 v;p(s: I SLl)] vbp(sb)\u00b7 \n\n(5) \nThe means and variances at time t in (5), as well as the quantities vL depend on \nboth the data Yt and the states s~; in particular, T); = (~j Hjiyl + )..j1!)/(>..'vs +)..) \nand a-; = )..v!/(>..'vs + )..), using s = s1; the expression for the v; are omitted. The \ntransition probabilities are the same as in (2). Hence, the posterior distribution \n(5) effectively defines a new HMM for each source, with yrdependent emission and \ntransition probabilities. \nTo derive the learning rule for H, we should first compute the conditional mean Xt \nof the source signals at time t given the data. This can be done recursively using \n(5) as in the forward-backward procedure. We then obtain \n\nc= T~YtXr. \n\n1 T \n\nt=l \n\n(6) \n\n\f390 \n\nH. Attias \n\nThis fractional form results from imposing the orthogonality constraint HTH = >..'1 \nusing Lagrange multipliers and can be computed via a diagonalization procedure. \nThe source parameters are computed using a learning rule (omitted) similar to the \nnoise-free rule (4). It is easy to derive a learning rule for the noise level ,\\ as well; in \nfact, the ordinary FA rule would suffice. We point out that, while this algorithm has \nbeen derived for the case L = L', it is perfectly well defined (though sub-optimal: \nsee below) for L :::; L'. \n\n4 Non-Isotropic Noise \n\nThe general case of non-isotropic noise and non-square mixing is computationally \nintractable. This is because the exact E-step requires summing over all possible \nsource configurations (st, ... , SfL) at all times tl, ... , tL = 1, ... , T. The intractability \nproblem stems from the fact that, while the sources are independent, the sources \nconditioned on a data vector Yl:T are correlated, resulting in a large number of \nhidden configurations. This problem does not arise in the noise-free case, and can \nbe avoided in the case of isotropic noise and square mixing using the orthogonality \nproperty; in both cases, the exact posterior over the sources factorizes. \n\nThe EM algorithm derived below is based on a variational approach. This approach \nwas introduced in [9J in the context of sigmoid belief networks, but constitutes a \ngeneral framework for ML learning in intractable probabilistic networks; it was \nused in a HMM context in [IOJ. The idea is to use an approximate but tractable \nposterior to place a lower bound on the likelihood, and optimize the parameters by \nmaximizing this bound. \nA starting point for deriving a bound on the likelihood L is Neal and Hinton's [l1J \nformulation of the EM algorithm: \n\nL = lOgp(Yl:T) ~ L Eq logp(Yt I Xt) + L Eq logp(sb:T' xi:T) - Eq logq, \n\n(7) \n\nT \n\nt=l \n\nL \n\ni=l \n\nwhere Eq denotes averaging with respect to an arbitrary posterior density over the \nhidden variables given the observed data, q = q(SO:T,Xl:T I Yl:T). Exact EM, \nas shown in [11], is obtained by maximizing the bound (7) with respect to both \nthe posterior q (corresponding to the E-step) and the model parameters W (M(cid:173)\nstep). However, the resulting q is the true but intractable posterior. In contrast, in \nvariational EM we choose a q that differs from the true posterior, but facilitates a \ntractable E-step. \n\nE-Step. We use q(sO:T,Xl:T \nparametrized as \n\nI Yl :T) = TIiq(sb:T \n\nI Yl:T)TItq(Xt \n\nI Yl:T), \n\nq(s~ = s I SLI = S',Yl:T) \n\nex: \n\n'\\!,ta!,s, \n\nq(sb = s I Yl :T) ex: ,\\! t7r! , \n\n, \n\nq(Xt IYl :T) = Q(Xt - Pt, ~t) . \n\n(8) \nThus, the variational transition probabilities in (8) are described by multiplying the \noriginal ones a!, s by the parameters '\\~,t' subject to the normalization constraints. \nThe source signals Xt at time t are jointly Gaussian with mean Pt and covariance \n~t. The means, covariances and transition probabilities are all time- and data(cid:173)\ndependent, i.e., Pt = f(Yl:T, t) etc. This parametrization scheme is motivated by \nthe form of the posterior in (5); notice that the quantities 1]:, a-t, v~ t there become \nthe variational parameters pL ~;j,,\\~ t of (8). A related scheme was used in [IOJ in \na different context. Since these parameters will be adapted independently of the \nmodel parameters, the non-isotropic algorithm is expected to give superior results \ncompared to the isotropic one. \n\n, \n\n\fDynamic Independent Factor Analysis \n\n391 \n\nO ~--------~------~ \n\nMixing \n\n-10 \n\n~-20 \n\n.L3Cfl> ___ ~O \n\n-40 \n\n-5~S;----:0:------::5'-------:-::' 0:------:-'\u00b7 \n15 \n\nSNA (dB) \n\n5 \n\n0 \n\n~ -5 \n~ \nL.U -10 \n\n- 15 \n\n-20 \n- 5 \n\nReco nstruc tion \n\n0 \n\n0 \n\n0 \n\n0 \n\n5 \n\nSNR (dB) \n\n10 \n\n15 \n\nFigure 2: Left: quality of the model parameter estimates. Right: quality of the source \nreconstructions. (See text). \n\nOf course, in the true posterior the Xt are correlated, both temporally among them(cid:173)\nselves and with St, and the latter do not factorize. To best approximate it, the \nvariational parameters V = {p~, ~~j , >..! t} are optimized to maximize the bound on \n.c, or equivalently to minimize the KL' distance between q and the true posterior. \nThis requirement leads to the fixed point equations \n(HT A -lH + Bt)-l(HT A -lYt + bt), \n. (pi _ J-Li)2 + ~ii] \n1 [1 \n\n~t = (HT A-1H + Bt)-l , \n\nPt \n\nt \n\n, \n\n(9) \n\n--:- exp - - log V Z _ \nzZ \nt \n\n2 \n\ns \n\nt \n\ns . \n2vZ \ns \n\nwhere Bij = Ls[rl(S)/v!]6ij , b~ = Ls ,l(s)J-L!/v!, and the factors zf ensure nor(cid:173)\nmalization. The HMM quantities ,f(s) are computed by the forward-backward \nprocedure using the variational transition probabilities (8). The variational param(cid:173)\neters are determined by solving eqs. (9) iteratively for each block Yl :T; in practice, \nwe found that less then 20 iterations are usually required for convergence. \nM-Step. The update rules for W are given for the mixing parameters by \n\n1 ~ T \n\nA = T L,)YtYt - YtPt H ), \n\nT T \n\nand for the source parameters by \nLt ,f(s)p~ \nLt ,I(s) , \nLt ~f(s', s) \nLt,Ll(S') , \n\nt \n\nVi = Lt ,f(s)((p~ - J-L~)2 + ~~i) \ns \n\nLt ,f(s) \n\n(10) \n\n(11) \n\nwhere the ~Hs' , s) are computed using the variational transition probabilities (8). \nNotice that the learning rules for the source parameters have the Baum-Welch form, \nin spite of the correlations between the conditioned sources. In our variational \napproach, these correlations are hidden in V, as manifested by the fact that the \nfixed point equations (9) couple the parameters V across time points (since ,:(s) \ndepends on >\"!,t=l:T) and sources. \nSource Reconstruction. From q(Xt I Yl :T) (8), we observe that the MAP source \nestimate is given by Xt = Pt(Yl:T), and depends on both Wand V. \nResults. The above algorithm is demonstrated on a source separation task in Fig(cid:173)\nure 2. We used 6 speech signals, transformed by non-linearities to have arbitrary \none-point densities, and mixed by a random 8 x 6 matrix Ho. Different signal(cid:173)\nto-noise (SNR) levels were used. The error in the estimated H (left, solid line) is \nquantified by the size ofthe non-diagonal elements of (HTH)-l HTHo relative to the \n\n\f392 \n\nH Attias \n\ndiagonal; the results obtained by IFA [1], which does not use temporal information, \nare plotted for reference (dotted line). The mean squared error of the reconstructed \nsources (right, solid line) and the corresponding IFA result (right, dashed line) are \nalso shown. The estimate and reconstruction errors of this algorithm are consis(cid:173)\ntently smaller than those of IFA, reflecting the advantage of exploiting the temporal \nstructure of the data. Additional experiments with different numbers of sources and \nsensors gave similar results. Notice that this algorithm, unlike the previous two, \nallows both L ::; L' and L > L'. We also considered situations where the number of \nsensors was smaller than the number of sources; the separation quality was good, \nalthough, as expected, less so than in the opposite case. \n\n5 Conclusion \n\nAn important issue that has not been addressed here is model selection. When ap(cid:173)\nplying our algorithms to an arbitrary dataset, the number of factors and of HMM \nstates for each factor should be determined. Whereas this could be done, in princi(cid:173)\nple, using cross-validation, the required computational effort would be fairly large. \nHowever, in a recent paper [12] we develop a new framework for Bayesian model \nselection, as well as model averaging, in probabilistic networks. This framework, \ntermed Variational Bayes, proposes an EM-like algorithm which approximates full \nposterior distributions over not only hidden variables but also parameters and model \nstructure, as well as predictive quantities, in an analytical manner. It is currently \nbeing applied to the algorithms presented here with good preliminary results. \nOne field in which our approach may find important applications is speech technol(cid:173)\nogy, where it suggests building more economical signal models based on combining \nindependent low-dimensional HMMs, rather than fitting a single complex HMM. \nIt may also contribute toward improving recognition performance in noisy, multi(cid:173)\nspeaker, reverberant conditions which characterize real-world auditory scenes. \n\nReferences \n[1] Attias, H. (1999). Independent factor analysis. Neur. Camp. 11, 803-85l. \n[2] Bell, A.J. & Sejnowski, T .J. (1995). An information-maximization approach to blind \nseparation and blind deconvolution. Neur. Camp. 7, 1129-1159. \n[3] Amari, S., Cichocki, A. & Yang, H.H. (1996). A new learning algorithm for blind signal \nseparation. Adv. Neur. Info. Pmc. Sys. 8,757-763 (Ed. by Touretzky, D.S. et al). MIT \nPress, Cambridge, MA. \n[4] Pearlmutter, B.A. & Parra, L.C. (1997). Maximum likelihood blind source separation: \nA context-sensitive generalization of ICA. Adv. Neur. Info. Pmc. Sys. 9, 613-619 (Ed. \nby Mozer, M.C. et al). MIT Press, Cambridge, MA. \n[5] Hyviirinen, A. & Oja, E. (1997). A fast fixed-point algorithm for independent compo(cid:173)\nnent analysis. Neur. Camp. 9, 1483-1492. \n[6] Attias, H. & Schreiner, C.E. (1998). Blind source separation and deconvolution: the \ndynamic component analysis algorithm. Neur. Camp. 10, 1373-1424. \n[7] Rabiner, L. & Juang, B.-H. (1993). Fundamentals of Speech Recognition. Prentice Hall, \nEnglewood Cliffs, NJ. \n[8] Lee, D.D. & Sompolinsky, H. (1999) , unpublished; D.D. Lee, personal communication. \n[9] Saul, L.K., Jaakkola, T., and Jordan, M.L (1996). Mean field theory of sigmoid belief \nnetworks. J. Art. Int. Res. 4, 61-76. \n[10] Ghahramani, Z. & Jordan, M.L (1997). Factorial hidden Markov models. Mach. \nLearn. 29, 245-273. \n[11] Neal, R.M. & Hinton, G.E. (1998). A view of the EM algorithm that justifies incre(cid:173)\nmental, sparse, and other variants. Learning in Graphical Models, 355-368 (Ed. by Jordan, \nM.L). Kluwer Academic Press. \n[12] Attias, H. (2000). A variational Bayesian framework for graphical models. Adv. Neur. \nInfo. Pmc. Sys. 12 (Ed. by Leen, T. et al). MIT Press, Cambridge, MA. \n\n\f", "award": [], "sourceid": 1682, "authors": [{"given_name": "Hagai", "family_name": "Attias", "institution": null}]}