{"title": "Scaling Factorial Hidden Markov Models: Stochastic Variational Inference without Messages", "book": "Advances in Neural Information Processing Systems", "page_first": 4044, "page_last": 4052, "abstract": "Factorial Hidden Markov Models (FHMMs) are powerful models for sequential data but they do not scale well with long sequences. We propose a scalable inference and learning algorithm for FHMMs that draws on ideas from the stochastic variational inference, neural network and copula literatures. Unlike existing approaches, the proposed algorithm requires no message passing procedure among latent variables and can be distributed to a network of computers to speed up learning. Our experiments corroborate that the proposed algorithm does not introduce further approximation bias compared to the proven structured mean-field algorithm, and achieves better performance with long sequences and large FHMMs.", "full_text": "Scaling Factorial Hidden Markov Models:\n\nStochastic Variational Inference without Messages\n\nYin Cheng Ng\n\nDept. of Statistical Science\nUniversity College London\n\ny.ng.12@ucl.ac.uk\n\nPawel Chilinski\n\nDept. of Computing Science\nUniversity College London\n\nucabchi@ucl.ac.uk\n\nRicardo Silva\n\nDept. of Statistical Science\nUniversity College London\n\nr.silva@ucl.ac.uk\n\nAbstract\n\nFactorial Hidden Markov Models (FHMMs) are powerful models for sequential\ndata but they do not scale well with long sequences. We propose a scalable infer-\nence and learning algorithm for FHMMs that draws on ideas from the stochastic\nvariational inference, neural network and copula literatures. Unlike existing ap-\nproaches, the proposed algorithm requires no message passing procedure among\nlatent variables and can be distributed to a network of computers to speed up learn-\ning. Our experiments corroborate that the proposed algorithm does not introduce\nfurther approximation bias compared to the proven structured mean-\ufb01eld algorithm,\nand achieves better performance with long sequences and large FHMMs.\n\n1\n\nIntroduction\n\nBreakthroughs in modern technology have allowed more sequential data to be collected in higher res-\nolutions. The resulted sequential data sets are often extremely long and high-dimensional, exhibiting\nrich structures and long-range dependency that can only be captured by \ufb01tting large models to the\nsequences, such as Hidden Markov Models (HMMs) with a large state space. The standard methods\nof learning and performing inference in the HMM class of models are the Expectation-Maximization\n(EM) and the Forward-Backward algorithms. The Forward-Backward and EM algorithms are pro-\nhibitively expensive for long sequences and large models because of their linear and quadratic\ncomputational complexity with respect to sequence length and state space size respectively.\nTo rein in the computational cost of inference in HMMs, several variational inference algorithms that\ntrade-off inference accuracy in exchange for lower computational cost have been proposed in the\nliteratures. Variational inference is a deterministic approximate inference technique that approximates\nposterior distribution p by minimizing the Kullback-Leibler divergence KL(q||p), where q lies in a\nfamily of distributions selected to approximate p as closely as possible while keeping the inference\nalgorithm computationally tractable [24]. Despite its biased approximation of the actual posteriors,\nthe variational inference approach has been proven to work well in practice [21].\nVariational inference has also been successfully scaled to tackle problems with large data sets\nthrough the use of stochastic gradient descent (SGD) algorithms [12]. However, applications of such\ntechniques to models where the data is dependent (i.e., non-i.i.d.) require much care in the choice of\nthe approximating family and parameter update schedules to preserve dependency structure in the\ndata [9]. More recently, developments of stochastic variational inference algorithms to scale models\nfor non-i.i.d. data to large data sets have been increasingly explored [5, 9].\nWe propose a stochastic variational inference approach to approximate the posterior of hidden Markov\nchains in Factorial Hidden Markov Models (FHMM) with independent chains of bivariate Gaussian\ncopulas. Unlike existing variational inference algorithms, the proposed approach eliminates the need\nfor explicit message passing between latent variables and allows computations to be distributed to\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fmultiple computers. To scale the variational distribution to long sequences, we reparameterise the\nbivariate Gaussian copula chain parameters with feed-forward recognition neural networks that are\nshared by copula chain parameters across different time points. The use of recognition networks in\nvariational inference has been well-explored in models in which data is assumed to be i.i.d. [14, 11].\nTo the best of our knowledge, the use of recognition networks to decouple inference in non-factorised\nstochastic process of unbounded length has not been well-explored. In addition, both the FHMM\nparameters and the parameters of the recognition networks are learnt in conjunction by maximising\nthe stochastic lower bound of the log-marginal likelihood, computed based on randomly sampled\nsubchains from the full sequence of interest. The combination of recognition networks and stochastic\noptimisations allow us to scale the Gaussian copula chain variational inference approach to very long\nsequences.\n\n2 Background\n\n2.1 Factorial Hidden Markov Model\n\nt ,\u00b7\u00b7\u00b7 , sM\n\nFactorial Hidden Markov Models (FHMMs) are a class of HMMs consisting of M latent variables\nt ) at each time point, and observations yt where the conditional emission probability\nst = (s1\nof the observations p (yt|st, \u03b7) is parameterised through factorial combinations of st and emission\nparameters \u03b7. Each of the latent variables sm\nt evolves independently in time through discrete-\nvalued Markov chains governed by transition matrix Am [8]. For a sequence of observations\ny = (y1,\u00b7\u00b7\u00b7 , yT ) and corresponding latent variables s = (s1,\u00b7\u00b7\u00b7 , sT ), the joint distribution can be\nwritten as follow\n\nM(cid:89)\n\np(y, s) =\n\nT(cid:89)\n\nM(cid:89)\n\np(sm\n\n1 )p(y1|s1, \u03b7)\n\np(yt|st, \u03b7)\n\np(sm\n\nt |sm\n\nt\u22121, Am)\n\n(1)\n\nm=1\n\nt=2\n\nm=1\n\nDepending on the state of latent variables at a particular time point, different subsets of emission\nparameters \u03b7 can be selected, resulting in a dynamic mixture of distributions for the data. The\nfactorial respresentation of state space reduces the required number of parameters to encode transition\ndynamics compared to regular HMMs with the same number of states. As an example, a state space\nwith 2M states can be encoded by M binary transition matrices with a total of 2M parameters while\na regular HMM requires a transition matrix with 2M \u00d7 (2M \u2212 1) parameters to be estimated.\nIn this paper, we specify a FHMM with D\u2212dimensional Gaussian emission distributions and M\nbinary hidden Markov chains. The emission distributions share a covariance matrix \u03a3 across different\nstates while the mean is parameterised as a linear combination of the latent variables,\n\nt ,\u00b7\u00b7\u00b7 , sM\n\n(2)\nt , 1]T is a M + 1-dimensional binary vector and W \u2208 R(M +1)\u00d7D. The\nwhere \u02c6st = [s1\nFHMM model parameters \u0393 = (\u03a3, W, A1,\u00b7\u00b7\u00b7 , AM ) can be estimated with the EM algorithm. Note\nthat to facilitate optimisations, we reparameterised \u03a3 as LLT where L \u2208 RD\u00d7D is a lower-triangular\nmatrix.\n\n\u00b5t = WT \u02c6st,\n\n2.1.1 Inference in FHMMs\n\nExact inference in FHMM is intractable due to the O(T M KM +1) computational complexity for\nFHMM with M K-state hidden Markov chains [15]. A structured mean-\ufb01eld (SMF) variational\ninference approach proposed in [8] approximates the posterior distribution with M independent\nMarkov chains and reduces the complexity to O(T M K 2) in models with linear-Gaussian emission\ndistributions. While the reduction in complexity is signi\ufb01cant, inference and learning with SMF\nremain insurmountable in the presence of extremely long sequences. In addition, SMF requires the\nstorage of O(2T M K) variational parameters in-memory per training sequence. Such computational\nrequirements remain expensive to satisfy even in the age of cloud computing.\n\n2.2 Gaussian Copulas\n\nGaussian copulas are a family of multivariate cumulative distribution functions (CDFs) that capture\nlinear dependency structure between random variables with potentially different marginal distributions.\nGiven two random variables X1, X2 with their respective marginal CDFs F1, F2, their Gaussian\ncopula joint CDF can be written as\n\n\u03a6\u03c1(\u03c6\n\n\u22121(F1(x1)), \u03c6\n\n\u22121(F2(x2)))\n\n(3)\n\n2\n\n\fwhere \u03c6\u22121 is the quantile function of the standard Gaussian distribution, and \u03a6 is the CDF of the\nstandard bivariate Gaussian distribution with correlation \u03c1. In a bivariate setting, the dependency\nbetween X1 and X2 is captured by \u03c1. The bivariate Gaussian copula can be easily extended to\nmultivariate settings through a correlation matrix. For an in-depth introduction of copulas, please\nrefer to [18, 3].\n\n2.3 Stochastic Variational Inference\n\nVariational inference is a class of deterministic approximate inference algorithms that approximate\nintractable posterior distributions p(s|y) of latent variables s given data y with a tractable family of\nvariational distributions q\u03b2(s) parameterised by variational parameters \u03b2. The variational parameters\nare \ufb01tted to approximate the posterior distributions by maximising the evidence lower bound of\n\nlog-marginal likelihood (ELBO) [24]. By applying the Jensen\u2019s inequality to(cid:82) p(y, s)ds the ELBO\n\ncan be expressed as\n\nELBO = Eq[log p(y, s)] \u2212 Eq[log q(s)].\n\n(4)\nThe ELBO can also be interpreted as the negative KL-divergence KL(q\u03b2(s)||p(s|y)) up to a constant.\nTherefore, variational inference results in variational distribution that is the closest to p within the\napproximating family as measured by KL.\nMaximising ELBO in the presence of large data set is computationally expensive as it requires the\nELBO to be computed over all data points. Stochastic variational inference (SVI) [12] successfully\nscales the inference technique to large data sets using subsampling based stochastic gradient descent\nalgorithms[2].\n\n2.4 Amortised Inference and Recognition Neural Networks\n\nThe many successes of neural networks in tackling certain supervised learning tasks have generated\nmuch research interest in applying neural networks to unsupervised learning and probabilistic\nmodelling problems [20, 7, 19, 14]. A recognition neural network was initially proposed in [11] to\nextract underlying structures of data modelled by a generative neural network. Taking the observed\ndata as input, the feed-forward recognition network learns to predict a vector of unobserved code that\nthe generative neural network initially conjectured to generate the observed data.\nMore recently, a recognition network was applied to variational inference for latent variable models[14,\n7]. Given data, the latent variable model and an assumed family of variational distributions, the\nrecognition network learns to predict optimal variational parameters for the speci\ufb01c data points. As\nthe recognition network parameters are shared by all data points, information learned by the network\non a subset of data points are shared with other data points. This inference process is aptly named\namortised inference. In short, recognition network can simply be thought of as a feed-forward neural\nnetwork that learns to predict optimal variational parameters given the observed data, with ELBO as\nits utility function.\n\n3 The Message Free Stochastic Variational Inference Algorithm\n\nWhile structured mean-\ufb01eld variational inference and its associated EM algorithms are effective tools\nfor inference and learning in FHMMs with short sequences, they become prohibitively expensive as\nthe sequences grow longer. For example, one iteration of SMF forward-backward message passing\nfor FHMM of 5 Markov chains and 106 sequential data points takes hours of computing time on a\nmodern 8-cores workstation, rendering SMF unusable for large scale problems. To scale FHMMs to\nlong sequences, we resort to stochastic variational inference.\nThe proposed variational inference algorithm approximates posterior distributions of the M hidden\nMarkov chains in FHMM with M independent chains of bivariate Gaussian-Bernoulli copulas. The\ncomputational cost of optimising the variational parameters is managed by a subsampling-based\nstochastic gradient ascent algorithm similar to SVI. In addition, parameters of the copula chains\nare reparameterised using feed-forward recognition neural networks to improve ef\ufb01ciency of the\nvariational inference algorithm.\nIn contrast to the EM approach for learning FHMM model parameters, our approach allows for both\nthe model parameters and variational parameters to be learnt in conjunction by maximising the ELBO\n\n3\n\n\fwith a stochastic gradient ascent algorithm. In the following sections, we describe the variational\ndistributions and recognition networks, and derive the stochastic ELBO for SGD.\n\n3.1 Variational Chains of Bivariate Gaussian Copulas\n\nSimilar to the SMF variational inference algorithm proposed in [8], we aim to preserve posterior\ndependency of latent variables within the same hidden Markov chains by introducing chains of\nbivariate Gaussian copulas. The chain of bivariate Gaussian copulas variational distribution can be\nwritten as the product of bivariate Gaussian copulas divided by the marginals of latent variables at the\nintersection of the pairs,\n\nq(sm) =\n\n(5)\n\nt=2 q(sm\n\nt\u22121, sm\nt )\nt=2 q(sm\nt )\n\n(cid:81)T\n(cid:81)T\u22121\nt ) such that(cid:80)\n\nsm\nt\u22121\n\nM(cid:89)\n\nsm\nt+1\n\nq(sm\n\nt\u22121, sm\n\nt\u22121, sm\n\nt ) =(cid:80)\n\nt ) is the joint probability density or mass function of a bivariate Gaussian copula.\n\nwhere q(sm\nThe copula parameterization in Equation (5) offers several advantages. Firstly, the overlapping bivari-\nate copula structure enforces coherence of q(sm\nt+1).\nSecondly, the chain structure of the distribution restricts the growth in the number of variational\nparameters to only two parameters per chain for every increment in the sequence length. Finally,\nthe Gaussian copula allows marginals and dependency structure of the random variables to be mod-\nelled separately [3]. The decoupling of the marginal and correlation parameters thus allows these\nparameters to be estimated by unconstrained optimizations and also lend themselves to be predicted\nseparately using feed-forward recognition neural networks.\nFor the rest of the paper, we assume that the FHMM latent variables are Bernoulli random variables\nwith the following bivariate Gaussian-Bernoulli copula probability mass function (PMF) as their\nvariational PMFs\nt\u22121 = 0, sm\nt\u22121 = 0, sm\n\nt = 0) = 1 \u2212 \u03b8t,m \u2212 qm\nt = 1) = \u03b8t,m + \u03b8t\u22121,m + qm\n\nt = 0) = qm\n00t\nt = 1) = 1 \u2212 \u03b8t\u22121,m \u2212 qm\n\nt\u22121 = 1, sm\nt\u22121 = 1, sm\n\nq(sm\nq(sm\n\nq(sm\nq(sm\n\nq(sm\n\nt , sm\n\n= \u03a6\u03c1t,m (\u03c6\u22121(1 \u2212 \u03b8t\u22121,m), \u03c6\u22121(1 \u2212 \u03b8t,m)) and q(sm\n\nwhere qm\n00t\nBernoulli copula can be easily extended to multinomial random variables.\nAssuming independence between random variables in different hidden chains, the posterior distribu-\ntion of s can be factorised by chains and approximated by\n\n00t\n\n(6)\nt = 1) = \u03b8t,m. The Gaussian-\n\n00t \u2212 1\n\n00t\n\nq(s) =\n\nq(sm)\n\n(7)\n\n3.2 Feed-forward Recognition Neural Networks\n\nm=1\n\nThe number of variational parameters in the chains of bivariate Gaussian copulas scales linearly with\nrespect to the length of the sequence as well as the number of sequences in the data set. While it is\npossible to directly optimise these variational parameters, the approach quickly becomes infeasible\nas the size of data set grows. We propose to circumvent the challenging scalability problem by\nreparameterising the variational parameters with rolling feed-forward recognition neural networks\nthat are shared among variational parameters within the same chain. The marginal variational\nparameters \u03b8t,m and copula correlation variational parameters \u03c1t,m are parameterised with different\nrecognition networks as they are parameters of a different nature.\nGiven observed sequence y = (y1, . . . , yT ), the marginal and correlation recognition networks for\nhidden chain m compute the variational paremeters \u03b8t,m and \u03c1t,m by performing a forward pass on a\nwindow of observed data \u2206yt = (yt\u2212 1\n\u03b8t,m = f m\n\n(8)\nwhere \u2206t + 1 is the user selected size of rolling window, f m\n\u03c1 are the marginal and correlation\nrecognition networks for hidden chain m with parameters \u03c9m = (\u03c9\u03b8,m, \u03c9\u03c1,m). The output layer non-\nlinearities of f m\n\u03c1 are chosen to be the sigmoid and hyperbolic tangent functions respectively\nto match the range of \u03b8t,m and \u03c1t,m.\nThe recognition network hyperparameters, such as the number of hidden units, non-linearities, and\nthe window size \u2206t can be chosen based on computing budget and empirical evidence. In our\n\n\u03c1t,m = f m\n\u03b8 and f m\n\n2 \u2206t, . . . , yt, . . . , yt+ 1\n\n\u03b8 and f m\n\n\u03b8 (\u2206yt)\n\n\u03c1 (\u2206yt)\n\n2 \u2206t)\n\n4\n\n\fexperiments with shorter sequences where ELBO can be computed within a reasonable amount of\ntime, we did not observe a signi\ufb01cant difference in the coverged ELBOs among different choices of\nnon-linearity. However, we observed that the converged ELBO is sensitive to the number of hidden\nunits and the number of hidden units needs to be adapted to the data set and computing budget.\nRecognition networks with larger hidden layers have larger capacity to approximate the posterior\ndistributions as closely as possible but require more computing budget to learn. Similarly, the choice\nof \u2206t determines the amount of information that can be captured by the variational distributions as\nwell as the computing budget required to learn the recognition network parameters. As a rule of\nthumb, we recommend the number of hidden units and \u2206t to be chosen as large as the computing\nbudget allows in long sequences. We emphasize that the range of posterior dependency captured\nby the correlation recognition networks is not limited by \u2206t, as the recognition network parameters\nare shared across time, allowing dependency information to be encoded in the network parameters.\nFor FHMMs with large number of hidden chains, various schemes to share the networks\u2019 hidden\nlayers can be devised to scale the method to FHMMs with a large state space. This presents another\ntrade-off between computational requirements and goodness of posterior approximations.\nIn addition to scalability, the use of recognition networks also allows our approach to perform fast\ninference at run-time, as computing the posterior distributions only require forward passes of the\nrecognition networks with data windows of interest. The computational complexity of the recognition\nnetwork forward pass scales linearly with respect to \u2206t. As with other types of neural networks,\nthe computation is highly data-parallel and can be massively sped up with GPU. In comparison,\ncomputation for a stochastic variational inference algorithm based on a message passing approach\nalso scales linearly with respect to \u2206t but is not data-parallel [5]. Subchains from long sequences,\ntogether with their associated recognition network computations, can also be distributed across a\ncluster of computers to improve learning and inference speed.\nHowever, the use of recognition networks is not without its drawbacks. Compared to message passing\nalgorithms, the recognition networks approach cannot handle missing data gracefully by integrating\nout the relevant random variables. The \ufb01delity of the approximated posterior can also be limited by\nthe capacity of the neural networks and bad local minimas. The posterior distributions of the random\nvariables close to the beginning and the end of the sequence also require special handling, as the\nrolling window cannot be moved any further to the left or right of the sequences. In such scenarios,\nthe posteriors can be computed by adapting the structured mean-\ufb01eld algorithm proposed in [8] to the\nsubchains at the boundaries (see Supplementary Material). The importance of the boundary scenarios\nin learning the FHMM model parameters diminishes as the data sequence becomes longer.\n\n3.3 Learning Recognition Network and FHMM Parameters\n\nGiven sequence y of length T , the M-chain FHMM parameters \u0393 and recognition network parameters\n\u2126 = (\u03c91, . . . , \u03c9M ) need to be adapted to the data by maximising the ELBO as expressed in Equation\n(4) with respect to \u0393 and \u2126. Note that the distribution q(sm) is now parameterised by the recognition\nnetwork parameters \u03c9m. For notational simplicty, we do not explicitly express the parameterisation\nof q(sm) in our notations. Plugging in the FHMM joint distribution in Equation (1) and variational\ndistribution in Equation (7), the FHMM ELBO L(\u0393, \u2126) for the variational chains of bivariate Gaussian\ncopula is approximated as\n\nEquation (9) is only an approximation of the ELBO as the variational distribution of sm\nt close to\nthe beginning and end of y cannot be computed using the recognition networks. Because of the\n1 ) cannot be learned using our approach.\nHowever, they can be approximated by the stationary distribution of the transition matrices as T\nbecome large assuming that the sequence is close to stationary[5]. Comparisons to SMF in our\nexperiment results suggest that the error caused by the approximations is negligible.\nThe log-transition probability expectations and variational entropy in Equation (9) can be easily\ncomputed as they are simply sums over pairs of Bernoulli random variables. The expectations of\n\nm=1 p(sm\n\n5\n\nT\u2212 1\n\nL(\u0393, \u2126) \u2248\n\n2 \u2206t\u22121(cid:88)\n(cid:10) log p(yt|s1\nM(cid:88)\nt\u22121)(cid:11)\n(cid:10) log p(sm\napproximation, the FHMM initial distribution(cid:81)M\n\nt |sm\n\n2 \u2206t+1\n\nm=1\n\nt= 1\n\n+\n\nt , . . . , sM\n\nt )(cid:11)\n+(cid:10) log q(sm\nt )(cid:11)\n\nq\n\nq\n\nq\n\n\u2212(cid:10) log q(sm\n\nt+1)(cid:11)\n\nt , sm\n\nq\n\n(9)\n\n\flog-emission distributions can be ef\ufb01ciently computed for certain distributions, such as multinomial\nand multivariate Gaussian distributions. Detailed derivations of the expectation terms in ELBO can\nbe found in the Supplementary Material.\n\n3.3.1 Stochastic Gradient Descent & Subsampling Scheme\nWe propose to optimise Equation (9) with SGD by computing noisy unbiased gradients of ELBO with\nrespect to \u0393 and \u2126 based on contributions from subchains of length \u2206t + 1 randomly sampled from\ny [2, 12]. Multiple subchains can be sampled in each of the learning iterations to form a mini-batch\nof subchains, reducing variance of the noisy gradients. Noisy gradients with high variance can\ncause the SGD algorithm to converge slowly or diverge [2]. The subchains should also be sampled\nrandomly without replacement until all subchains in y are depleted to speed up convergence. To\nensure unbiasedness of the noisy gradients, the gradients computed in each iteration need to be\nmultiplied by a batch factor\n\nT \u2212 \u2206t\nnminibatch\n\nc =\n\n(10)\n\nwhere nminibatch is the number of subchains in each mini-batch. The scaled noisy gradients can then\nbe used by SGD algorithm of choice to optimise L. In our implementation of the algorithm, gradients\nare computed using the Python automatic differentiation tool [17] and the optimisation is performed\nusing Rmsprop [22].\n\n4 Related Work\n\nCopulas have previously been adapted in variational inference literatures as a tool to model posterior\ndependency in models with i.i.d. data assumption [23, 10]. However, the previously proposed\napproaches cannot be directly applied to HMM class of models without addressing parameter\nestimation issues as the dimensionality of the posterior distributions grow with the length of sequences.\nThe proposed formulation of the variational distribution circumvents the problem by exploiting the\nchain structure of the model, coupling only random variables within the same chain that are adjacent\nin time with a bivariate Gaussian-Bernoulli copula, leading to a coherent chain of bivariate Gaussian\ncopulas as the variational distribution.\nOn the other hand, a stochastic variational inference algorithm that also aims to scale HMM class\nof models to long sequences has previously been proposed in [5]. Our proposed algorithm differs\nfrom the existing approach in that it does not require explicit message passing to perform inference\nand learning. Applying the algorithm proposed in [5] to FHMM requires multiple message passing\niterations to determine the buffer length of each subchain in the mini batch of data, and the procedure\nneeds to be repeated for each FHMM Markov chain. The message passing routines can be expensive\nas the number of Markov chains grows. In contrast, the proposed recognition network approach\neliminates the need for iterative message passing and allows the variational distributions to be learned\ndirectly from the data using gradient descent. The use of recognition networks also allows fast\ninference at run-time with modern parallel computing hardwares.\nThe use of recognition networks as inference devices for graphical models has received much research\ninterest recently because of its scalability and simplicity. Similar to our approach, the algorithms\nproposed in [4, 13] also make use of the recognition networks for inference, but still rely on message\npassing to perform certain computations. In addition, [1] proposed an inference algorithm for state\nspace models using a recognition network. However, the algorithm cannot be applied to models with\nnon-Gaussian posteriors.\nFinally, the proposed algorithm is analogous to composite likelihood algorithms for learning in HMMs\nin that the data dependency is broken up according to subchains to allow tractable computations\n[6]. The EM-composite likelihood algorithm in [6] partitions the likelihood function according to\nsubchains, bounding each subchain separately with a different posterior distribution that uses only\nthe data in that subsequence. Our recognition models generalize that.\n\n5 Experiments\n\nWe evaluate the validity of our algorithm and the scalability claim with experiments using real\nand simulated data. To validate the algorithm, we learn FHMMs on simulated and real data using\n\n6\n\n\fthe proposed algorithm and the existing SMF-EM algorithm. The models learned using the two\napproaches are compared with log-likelihood (LL). In addition, we compare the learned FHMM\nparameters to parameters used to simulate the data. The validation experiments ensure that the\nproposed approach does not introduce further approximation bias compared to SMF.\nTo verify the scalability claim, we compare the LL of FHMMs with different numbers of hidden\nchains learned on simulated sequences of increasing length using the proposed and SMF-based EM\nalgorithms. Two sets of experiments are conducted to showcase scalability with respect to sequence\nlength and the number of hidden Markov chains. To simulate real-world scenarios where computing\nbudget is constrained, both algorithms are given the same \ufb01xed computing budget. The learned\nFHMMs are compared after the computing budget is depleted. Finally, we demonstrate the scalability\nof the proposed algorithm by learning a 10 binary hidden Markov chains FHMM on long time series\nrecorded in a real-world scenario.\n\n5.1 Algorithm Validation\n\nSimulated Data We simulate a 1, 000 timesteps long 2-dimensional sequence from a FHMM with 2\nhidden binary chains and Gaussian emission, and attempt to recover the true model parameters with\nthe proposed approach. The simulation procedure is detailed in the Supplementary Material. The\nproposed algorithm successfully recovers the true model parameters from the simulated data. The\nLL of the learned model also compared favorably to FHMM learned using the SMF-EM algorithm,\nshowing no visible further bias compares to the proven SMF-EM algorithm. The LL of the proposed\nalgorithm and SMF-EM are shown in Table 1. The learned emission parameters, together with the\ntraining data, are visualised in Figure 1.\n\nBach Chorales Data Set [16] Following the experiment in [8], we compare the proposed algorithm\nto SMF-EM based on LL. The training and testing data consist of 30 and 36 sequences from the Bach\nChorales data set respectively. FHMMs with various numbers of binary hidden Markov chains are\nlearned from the training data with both algorithms. The log-likelihoods, tabulated in Table 1, show\nthat the proposed algorithm is competitve with SMF-EM on a real data set in which FHMM is proven\nto be a good model, and show no further bias. Note that the training log-likelihood of the FHMM\nwith 8 chains trained using the proposed algorithm is smaller than the FHMM with 7 chains, showing\nthat the proposed algorithm can be trapped in bad local minima.\n\n5.2 Scalability Veri\ufb01cation\n\nSimulated Data This experiment consists of two parts to verify scalability with respect to sequence\nlength and the state space size. In the \ufb01rst component, we simulate 2-dimensional sequences of\nvarying length from a FHMM with 4-binary chains using an approach similar to the validation\nexperiment. Given \ufb01xed computing budget of 2 hours per sequence on a 24 cores Intel i7 workstation,\nboth SMF-EM and the proposed algorithm attempt to \ufb01t 4-chain FHMMs to the sequences. Two\ntesting sequences of length 50, 000 are also simulated from the same model. In the second component,\nwe keep the sequence length to 15, 000 and attempt to learn FHMMs with various numbers of chains\nwith computing budget of 1, 000s. The computing budget in the second component is scaled according\nto the sequence length. Log-likelihoods are computed with the last available learned parameters after\ncomputing time runs out. The proposed algorithm is competitive with SMF-EM when sequences are\nshorter and state space is smaller, and outperforms SMF-EM in longer sequences and larger state\nspace. The results in Figure 2 and Figure 3 both show the increasing gaps in the log-likelihoods as\nsequence length and state space size increased. The recognition networks in the experiments have 1\nhidden layer with 30 tanh hidden units, and rolling window size of 5. The marginal and correlation\nrecognition networks for latent variables in the same FHMM Markov chain share hidden units to\nreduce memory and computing requirements as the number of Markov chains increases.\n\nHousehold Power Consumption Data Set [16] We demonstrate the applicability of our algorithm\nto long sequences in which learning with SMF-EM using the full data set is simply intractable. The\npower consumption data set consists of a 9-dimensional sequence of 2, 075, 259 time steps. After\ndropping the date/time series and the current intensity series that is highly correlated with the power\nconsumption series, we keep the \ufb01rst 106 data points of the remaining 6 dimensional sequence for\ntraining and set aside the remaining series as test data. A FHMM with 10 hidden Markov chains is\n\n7\n\n\flearned on the training data using the proposed algorithm. In this particular problem, we force all\n20 recognition networks in our algorithm to share a common tanh hidden layer of 200 units. The\nrolling window size is set to 21 and we allow the algorithm to complete 150, 000 SGD iterations with\n10 subchains per iteration before terminating. To compare, we also learned the 10-chain FHMM with\nSMF-EM on the last 5, 000 data points of the training data. The models learned with the proposed\nalgorithm and SMF-EM are compared based on the Mean Squared Error (MSE) of the smoothed test\ndata (i.e., learned emission means weighted by latent variable posterior). As shown in Table 2, the\ntest MSEs of the proposed algorithm are lower than the SMF-EM algorithm in all data dimensions.\nThe result shows that learning with more data is indeed advantageous, and the proposed algorithm\nallows FHMMs to take advantage of the large data set.\n\nFigure 1: Simulated data in\nvalidation experiments with the\nemission parameters from simu-\nlation (red), learned by proposed\nalgorithm (green) and SMF-EM\n(blue). The emission means are\ndepicted as stars and standard de-\nviations as elliptical contour at 1\nstandard deviation.\n\nFigure 2: The red and blue lines\nshow the train (solid) and test\n(dashed) LL results from the pro-\nposed and SMF-EM algorithms\nin the scalability experiments as\nthe sequence length (x-axis) in-\ncreases. Both algorithms are\ngiven 2hr computing budget per\ndata set. SMF-EM failed to com-\nplete a single iteration for length\nof 150, 000.\n\nFigure 3: The red and blue\nlines show the train (solid) and\ntest (dashed) LL results from\nthe proposed and SMF-EM al-\ngorithms in the scalability ex-\nperiments as the number of hid-\nden Markov chain (x-axis) in-\ncreases. Both algorithms are\ngiven 1, 000s computing budget\nper data set.\n\nProposed Algo.\n\nSMF\n\nnchain LLtrain LLtest LLtrain LLtest\n\nSimulated Data\n\n-2.320 -2.332 -2.315 -2.338\n\n2\n\n2\n3\n4\n5\n6\n7\n8\n\nBach Chorales\n\n-7.241 -7.908 -7.172 -7.869\n-6.627 -7.306 -6.754 -7.489\n-6.365 -7.322 -6.409 -7.282\n-6.135 -6.947 -5.989 -7.174\n-5.973 -6.716 -5.852 -7.008\n-5.754 -6.527 -5.771 -6.664\n-5.836 -6.722 -5.675 -6.697\n\nTable 1: LL from the validation experiments. The\nresults demonstrate that the proposed algorithm is\ncompetitive with SMF. Plot of the Bach chorales\nLL is available in the Supplementary Material.\n\n6 Conclusions\n\nDim. M SESM F M SEP roposed\n\n1\n2\n3\n4\n5\n6\n\n0.155\n0.084\n0.079\n0.466\n0.121\n0.202\n\n0.082\n0.055\n0.027\n0.145\n0.062\n0.145\n\nTable 2: Test MSEs of the SMF-EM and the pro-\nposed algorithm for each dimension in the house-\nhold power consumption data set. The results\nshow that the proposed algorithm is able to take\nadvantage of the full data set to learn a better\nmodel because of its scalability. Plots of the \ufb01tted\nand observed data are available in the Supplemen-\ntary Material.\n\nWe propose a novel stochastic variational inference and learning algorithm that does not rely on\nmessage passing to scale FHMM to long sequences and large state space. The proposed algorithm\nachieves competitive results when compared to structured mean-\ufb01eld on short sequences, and out-\nperforms structured mean-\ufb01eld on longer sequences with a \ufb01xed computing budget that resembles a\nreal-world model deployment scenario. The applicability of the algorithm to long sequences where\nthe structured mean-\ufb01eld algorithm is infeasible is also demonstrated. In conclusion, we believe that\nthe proposed scalable algorithm will open up new opportunities to apply FHMMs to long sequential\ndata with rich structures that could not be previously modelled using existing algorithms.\n\n8\n\n\fReferences\n[1] Evan Archer, Il Memming Park, Lars Buesing, John Cunningham, and Liam Paninski. Black box variational\n\ninference for state space models. arXiv preprint arXiv:1511.07367, 2015.\n\n[2] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and\n\nstochastic optimization. The Journal of Machine Learning Research, 12:2121\u20132159, 2011.\n\n[3] Gal Elidan. Copulas in machine learning. In Piotr Jaworski, Fabrizio Durante, and Wolfgang Karl H\u00e4rdle,\neditors, Copulae in Mathematical and Quantitative Finance, Lecture Notes in Statistics, pages 39\u201360.\nSpringer Berlin Heidelberg, 2013.\n\n[4] Kai Fan, Chunyuan Li, and Katherine Heller. A unifying variational inference framework for hierarchical\n\ngraph-coupled HMM with an application to in\ufb02uenza infection. 2016.\n\n[5] Nicholas Foti, Jason Xu, Dillon Laird, and Emily Fox. Stochastic variational inference for hidden markov\n\nmodels. In Advances in Neural Information Processing Systems, pages 3599\u20133607, 2014.\n\n[6] Xin Gao and Peter X-K Song. Composite likelihood EM algorithm with applications to multivariate hidden\n\nmarkov model. Statistica Sinica, pages 165\u2013185, 2011.\n\n[7] Samuel J Gershman and Noah D Goodman. Amortized inference in probabilistic reasoning. In Proceedings\n\nof the 36th Annual Conference of the Cognitive Science Society, 2014.\n\n[8] Zoubin Ghahramani and Michael I Jordan. Factorial hidden markov models. Machine learning, 29(2-\n\n3):245\u2013273, 1997.\n\n[9] Prem K Gopalan and David M Blei. Ef\ufb01cient discovery of overlapping communities in massive networks.\n\nProceedings of the National Academy of Sciences, 110(36):14534\u201314539, 2013.\n\n[10] Shaobo Han, Xuejun Liao, David B Dunson, and Lawrence Carin. Variational gaussian copula inference.\n\narXiv preprint arXiv:1506.05860, 2015.\n\n[11] Geoffrey E Hinton, Peter Dayan, Brendan J Frey, and Radford M Neal. The\" wake-sleep\" algorithm for\n\nunsupervised neural networks. Science, 268(5214):1158\u20131161, 1995.\n\n[12] Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. The\n\nJournal of Machine Learning Research, 14(1):1303\u20131347, 2013.\n\n[13] Matthew J Johnson, David Duvenaud, Alexander B Wiltschko, Sandeep R Datta, and Ryan P Adams.\nComposing graphical models with neural networks for structured representations and fast inference. arXiv\npreprint arXiv:1603.06277, 2016.\n\n[14] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,\n\n2013.\n\n[15] Steffen L Lauritzen and David J Spiegelhalter. Local computations with probabilities on graphical structures\nand their application to expert systems. Journal of the Royal Statistical Society. Series B (Methodological),\npages 157\u2013224, 1988.\n\n[16] M. Lichman. UCI machine learning repository, 2013.\n[17] Dougal Maclaurin, David Duvenaud, Matthew Johnson, and Ryan P. Adams. Autograd: Reverse-mode\n\ndifferentiation of native Python, 2015.\n\n[18] Roger B Nelsen. An introduction to copulas, volume 139. Springer Science & Business Media, 2013.\n[19] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approxi-\n\nmate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.\n\n[20] Andreas Stuhlm\u00fcller, Jacob Taylor, and Noah Goodman. Learning stochastic inverses. In Advances in\n\nneural information processing systems, pages 3048\u20133056, 2013.\n\n[21] Y. W. Teh, D. Newman, and M. Welling. A collapsed variational Bayesian inference algorithm for latent\n\nDirichlet allocation. In Advances in Neural Information Processing Systems, volume 19, 2007.\n\n[22] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of\n\nits recent magnitude. COURSERA: Neural Networks for Machine Learning, 4:2, 2012.\n\n[23] Dustin Tran, David Blei, and Edo M Airoldi. Copula variational inference.\n\nInformation Processing Systems, pages 3550\u20133558, 2015.\n\nIn Advances in Neural\n\n[24] Martin J Wainwright and Michael I Jordan. Graphical models, exponential families, and variational\n\ninference. Foundations and Trends R(cid:13) in Machine Learning, 1(1-2):1\u2013305, 2008.\n\n9\n\n\f", "award": [], "sourceid": 2020, "authors": [{"given_name": "Yin Cheng", "family_name": "Ng", "institution": "University College London"}, {"given_name": "Pawel", "family_name": "Chilinski", "institution": "University College London"}, {"given_name": "Ricardo", "family_name": "Silva", "institution": "University College London"}]}