{"title": "Spectral Learning of Large Structured HMMs for Comparative Epigenomics", "book": "Advances in Neural Information Processing Systems", "page_first": 469, "page_last": 477, "abstract": "We develop a latent variable model and an efficient spectral algorithm motivated by the recent emergence of very large data sets of chromatin marks from multiple human cell types. A natural model for chromatin data in one cell type is a Hidden Markov Model (HMM); we model the relationship between multiple cell types by connecting their hidden states by a fixed tree of known structure. The main challenge with learning parameters of such models is that iterative methods such as EM are very slow, while naive spectral methods result in time and space complexity exponential in the number of cell types. We exploit properties of the tree structure of the hidden states to provide spectral algorithms that are more computationally efficient for current biological datasets. We provide sample complexity bounds for our algorithm and evaluate it experimentally on biological data from nine human cell types. Finally, we show that beyond our specific model, some of our algorithmic ideas can be applied to other graphical models.", "full_text": "Spectral Learning of Large Structured HMMs for\n\nComparative Epigenomics\n\nChicheng Zhang\nUC San Diego\n\nJimin Song\n\nRutgers University\n\nchz038@eng.ucsd.edu\n\nsong@dls.rutgers.edu\n\nKevin C Chen\n\nRutgers University\n\nKamalika Chaudhuri\n\nUC San Diego\n\nkcchen@dls.rutgers.edu\n\nkamalika@eng.ucsd.edu\n\nAbstract\n\nWe develop a latent variable model and an ef\ufb01cient spectral algorithm motivated\nby the recent emergence of very large data sets of chromatin marks from multiple\nhuman cell types. A natural model for chromatin data in one cell type is a Hidden\nMarkov Model (HMM); we model the relationship between multiple cell types by\nconnecting their hidden states by a \ufb01xed tree of known structure.\nThe main challenge with learning parameters of such models is that iterative meth-\nods such as EM are very slow, while naive spectral methods result in time and\nspace complexity exponential in the number of cell types. We exploit properties\nof the tree structure of the hidden states to provide spectral algorithms that are\nmore computationally ef\ufb01cient for current biological datasets. We provide sample\ncomplexity bounds for our algorithm and evaluate it experimentally on biological\ndata from nine human cell types. Finally, we show that beyond our speci\ufb01c model,\nsome of our algorithmic ideas can be applied to other graphical models.\n\n1\n\nIntroduction\n\nIn this paper, we develop a latent variable model and ef\ufb01cient spectral algorithm motivated by the\nrecent emergence of very large data sets of chromatin marks from multiple human cell types [7, 9].\nChromatin marks are chemical modi\ufb01cations on the genome which are important in many basic\nbiological processes. After standard preprocessing steps, the data consists of a binary vector (one\nbit for each chromatin mark) for each position in the genome and for each cell type.\nA natural model for chromatin data in one cell type is a Hidden Markov Model (HMM) [8, 13],\nfor which ef\ufb01cient spectral algorithms are known. On biological data sets, spectral algorithms have\nbeen shown to have several practical advantages over maximum likelihood-based methods, including\nspeed, prediction accuracy and biological interpretability [24]. Here we extend the approach by\nmodeling multiple cell types together. We model the relationships between cell types by connecting\ntheir hidden states by a \ufb01xed tree, the standard model in biology for relationships between cell\ntypes. This comparative approach leverages the information shared between the different data sets\nin a statistically uni\ufb01ed and biologically motivated manner.\nFormally, our model is an HMM where the hidden state zt at time t has a structure represented by\na tree graphical model of known structure. For each tree node u we can associate an individual\nhidden state zu\nt\u22121 for the same tree node u but\nalso on the individual hidden state of its parent node. Additionally, there is an observation variable\nt is independent of other state and observation variables\nt for each node u, and the observation xu\nxu\n\nt that depends not only on the previous hidden state zu\n\n1\n\n\fconditioned on the hidden state variable zu\nt . In the bioinformatics literature, [5] studied this model\nwith the additional constraint that all tree nodes share the same emission parameters. In biological\napplications, the main outputs of interest are the learned observation matrices of the HMM and a\nsegmentation of the genome into regions which can be used for further studies.\nA standard approach to unsupervised learning of HMMs is the Expectation-Maximization (EM) al-\ngorithm. When applied to HMMs with very large state spaces, EM is very slow. A recent line of\nwork on spectral learning [18, 1, 23, 6] has produced much more computationally ef\ufb01cient algo-\nrithms for learning many graphical models under certain mild conditions, including HMMs. How-\never, a naive application of these algorithms to HMMs with large state spaces results in computa-\ntional complexity exponential in the size of the underlying tree.\nHere we exploit properties of the tree structure of the hidden states to provide spectral algorithms\nthat are more computationally ef\ufb01cient for current biological datasets. This is achieved by three\nnovel key ideas. Our \ufb01rst key idea is to show that we can treat each root-to-leaf path in the tree\nseparately and learn its parameters using tensor decomposition methods. This step improves the\nrunning time because our trees typically have very low depth. Our second key idea is a novel tensor\nsymmetrization technique that we call Skeletensor construction where we avoid constructing the full\ntensor over the entire root-to-leaf path. Instead we use carefully designed symmetrization matrices\nto reveal its range in a Skeletensor which has dimension equal to that of a single tree node. The third\nand \ufb01nal key idea is called Product Projections, where we exploit the independence of the emission\nmatrices along the root-to-leaf path conditioned on the hidden states to avoid constructing the full\ntensors and instead construct compressed versions of the tensors of dimension equal to the number\nof hidden states, not the number of observations. Beyond our speci\ufb01c model, we also show that\nProduct Projections can be applied to other graphical models and thus we contribute a general tool\nfor developing ef\ufb01cient spectral algorithms.\nFinally we implement our algorithm and evaluate it on biological data from nine human cell types\n[7]. We compare our results with the results of [5] who used a variational EM approach. We also\ncompare with spectral algorithms for learning HMMs for each cell type individually to assess the\nvalue of the tree model.\n\n1.1 Related Work\n\nThe \ufb01rst ef\ufb01cient spectral algorithm for learning HMM parameters was due to [18]. There has been\nan explosion of follow-up work on spectral algorithms for learning the parameters and structure of\nlatent variable models [23, 6, 4]. [18] gives a spectral algorithm for learning an observable operator\nrepresentation of an HMM under certain rank conditions.\n[23] and [3] extend this algorithm to\nthe case when the transition matrix and the observation matrix respectively are rank-de\ufb01cient. [19]\nextends [18] to Hidden Semi-Markov Models.\n[2] gives a general spectral algorithm for learning parameters of latent variable models that have a\nmulti-view structure \u2013 there is a hidden node and three or more observable nodes that are not con-\nnected to any other nodes and are independent conditioned on the hidden node. Many latent variable\nmodels have this structure, including HMMs, tree graphical models, topic models and mixture mod-\nels. [1] provides a simpler, more robust algorithm that involves decomposing a third order tensor.\n[21, 22, 25] provide algorithms for learning latent trees and of latent junction trees.\nSeveral algorithms have been designed for learning HMM parameters for chromatin modeling, in-\ncluding stochastic variational inference [16] and contrastive learning of two HMMs [26]. However,\nnone of these methods extend directly to modeling multiple chromatin sequences simultaneously.\n\n2 The Model\n\nProbabilistic Model. The natural probabilistic model for a single epigenomic sequence is a hidden\nMarkov model (HMM), where time corresponds to position in the sequence. The observation at\ntime t is the sequence value at position t, and the hidden state at t is the regulatory function in this\nposition.\n\n2\n\n\fFigure 1: Left: A tree T with 3 nodes V = {r, u, v}. Right: A HMM whose hidden state has\nstructure T .\n\nt is a parent of zu\n\nIn comparative epigenomics, the goal is to jointly model epigenomic sequences from multiple\nspecies or cell-types. This is done by an HMM with a tree-structured hidden state [5](THS-HMM),1\nwhere each node in the tree representing the hidden state has a corresponding observation node.\nFormally, we represent the model by a tuple H = (G,O,T ,W); Figure 1 shows a pictorial repre-\nsentation.\nG = (V, E) is a directed tree with known structure whose nodes represent individual cell-types\nor species. The hidden state zt and the observation xt are represented by vectors {zu\nt } and {xu\nt }\nindexed by nodes u \u2208 V . If (v, u) \u2208 E, then v is the parent of u, denoted by \u03c0(u); if v is a parent\nt . In addition, the observations have the following product\nof u, then for all t, zv\nstructure: if u\ufffd \ufffd= u, then conditioned on zu\nt as\nwell as any zu\ufffd\nO is a set of observation matrices Ou = P (xu\ntensors T u = P (zu\n1|z\u03c0(u)\nW u = P (zu\nGiven a tree structure and a number of iid observation sequences corresponding to each node of\nthe tree, our goal is to determine the parameters of the underlying THS-HMM and then use these\nparameters to infer the most likely regulatory function at each position in the sequences.\nBelow we use the notation D to denote the number of nodes in the tree and d to denote its depth.\nFor typical epigenomic datasets, D is small to moderate (5-50) while d is very small (2 or 3) as\nit is dif\ufb01cult to obtain data with large d experimentally. Typically m, the number of possible val-\nues assumed by the hidden state at a single node, is about 6-25, while n, the number of possible\nobservation values assumed by a single node is much larger (e.g. 256 in our dataset).\n\nt ) for each u \u2208 V and T is a set of transition\nt+1 ) for each u \u2208 V . Finally, W is the set of initial distributions where\n\nt , z\u03c0(u)\nt+1|zu\n) for each zu\n1 .\n\nt , the observation xu\n\nt is independent of zu\ufffd\n\nt and xu\ufffd\n\nt |zu\n\nt\ufffd and xu\ufffd\n\nt\ufffd for t \ufffd= t\ufffd.\n\n1\n\nTensors. An order-3 tensor M \u2208 Rn1 \u2297 Rn2 \u2297 Rn3 is a 3-dimensional array with n1n2n3 entries,\nwith its (i1, i2, i3)-th entry denoted as Mi1,i2,i3.\nGiven ni \u00d7 1 vectors vi, i = 1, 2, 3, their tensor product, denoted by v1 \u2297 v2 \u2297 v3 is the n1 \u00d7 n2 \u00d7 n3\ntensor whose (i1, i2, i3)-th entry is (v1)i1 (v2)i2 (v3)i3. A tensor that can be expressed as the tensor\nproduct of a set of vectors is called a rank 1 tensor. A tensor M is symmetric if and only if for any\npermutation \u03c0 : [3] \u2192 [3], Mi1,i2,i3 = M\u03c0(i1),\u03c0(i2),\u03c0(i3).\nLet M \u2208 Rn1 \u2297 Rn2 \u2297 Rn3. If Vi \u2208 Rni\u00d7mi, then M (V1, V2, V3) is a tensor of size m1 \u00d7 m2 \u00d7 m3,\nwhose (i1, i2, i3)-th entry is: M (V1, V2, V3)i1,i2,i3 =\ufffdj1,j2,j3\nMj1,j2,j3 (V1)j1,i1 (V2)j2,i2 (V3)j3,i3.\nSince a matrix is a order-2 tensor, we also use the following shorthand to denote matrix multipli-\ncation. Let M \u2208 Rn1 \u2297 Rn2. If Vi \u2208 Rmi\u00d7ni, then M (V1, V2) is a matrix of size m1 \u00d7 m2,\nwhose (i1, i2)-th entry is: M (V1, V2)i1,i2 =\ufffdj1,j2\nMj1,j2 (V1)j1,i1 (V2)j2,i2. This is equivalent to\nV \ufffd1 M V2.\n\n1In the bioinformatics literature, this model is also known as a tree HMM.\n\n3\n\n\ft \u2297 xu\nt\ufffd \u2297 xu\n\nt\ufffd ], and \u02c6P u,u\nt,t\ufffd\nt\ufffd\ufffd ] and its empirical version \u02c6P u,u,u\n\nt\ufffd at a single node u, we use the notation P u\nt,t\ufffd = E[xu\nt \u2297 xu\n\nMeta-States and Observations, Co-occurrence Matrices and Tensors. Given observations xu\nt\nand xu\nt,t\ufffd to denote their expected co-occurence frequen-\ncies: P u,u\nto denote their corresponding empirical version. The tensor\nP u,u,u\nt,t\ufffd,t\ufffd\ufffd = E[xu\nOccasionally, we will consider the states or observations corresponding to a subset of nodes in G\ncoalesced into a single meta-state or meta-observation. Given a connected subset S \u2286 V of nodes in\nt to denote the meta-state represented\nthe tree G that includes the root, we use the notation zS\nby (zu\nt , u \u2208 S) respectively. We de\ufb01ne the\nt ) \u2208 Rn|S|\u00d7m|S| and the transition matrix for S as\nobservation matrix for S as OS = P (xS\nT S = P (zS\n\nt and xS\nt , u \u2208 S) and the meta-observation represented by (xu\n\nt,t\ufffd,t\ufffd\ufffd are de\ufb01ned similarly.\n\nt |zS\nt ) \u2208 Rm|S|\u00d7m|S|, respectively.\n\nt+1|zS\n\nFor sets of nodes V1 and V2, we use the notation P V1,V2\nfrequencies of the meta-observations xV1\nt\nSimilarly, we can de\ufb01ne the notation P V1,V2,V3\n\nand xV2\nt\ufffd .\n\nt,t\ufffd,t\ufffd\ufffd\n\nand its empirical version \u02c6P V1,V2,V3\n\n.\n\nt,t\ufffd,t\ufffd\ufffd\n\nto denote the expected co-occurrence\nt,t\ufffd\nIts empirical version is denoted by \u02c6P V1,V2\n.\n\nt,t\ufffd\n\nBackground on Spectral Learning for Latent Variable Models. Recent work by [1] has provided\na novel elegant tensor decomposition method for learning latent variable models. Applied to HMMs,\nthe main idea is to decompose a transformed version of the third order co-occurrence tensor of the\n\ufb01rst three observations to recover the parameters; [1] shows that given enough samples and under\nfairly mild conditions on the model, this provides an approximation to the globally optimal solution.\nThe algorithm has three main steps. First, the third order tensor of the co-occurrences is symmetrized\nusing the second order co-occurrence matrices to yield a symmetric tensor; this symmetric tensor\nis then orthogonalized by a whitening transformation. Finally, the resultant symmetric orthogonal\ntensor is decomposed via the tensor power method.\nIn biological applications, instead of multiple independent sequences, we have a single long se-\nquence in the steady state. In this case, following ideas from [23], we use the average over t of the\nthird order co-occurence tensors of three consecutive observations starting at time t. The second\norder co-occurence tensor is also modi\ufb01ed similarly.\n\n3 Algorithm\n\nA naive approach for learning parameters of HMMs with tree-structured hidden states is to directly\napply the spectral method of [1]. Since this method ignores the structure of the hidden state, its\nrunning time is very high, \u03a9(nDmD), even with optimized implementations. This motivates the\ndesign of more computationally ef\ufb01cient approaches.\nA plausible approach is to observe that at t = 1, the observations are generated by a tree graphical\nmodel; thus in principle one could learn the parameters of the underlying tree using existing algo-\nrithms [22, 21, 25]. However, this approach does not directly produce the HMM parameters; it also\ndoes not work for biological sequences because we do not have multiple independent samples at\nt = 1; instead we have a single long sequence at the steady state, and the steady state distribution\nof observations is not generated by a latent tree. Another plausible approach is to use the spectral\njunction tree algorithm of [25]; however, this algorithm does not provide the actual transition and\nobservation matrix parameters which hold important biological information, and instead provides\nan observable operator representation.\nOur main contribution is to show that we can achieve a much better running time by exploiting the\nstructure of the hidden state. Our algorithm is based on three key ideas \u2013 Partitioning, Skeletensor\nConstruction and Product Projections. We explain these ideas next.\n\nPartitioning. Our \ufb01rst observation is that to learn the parameters at a node u, we can focus only on\nthe unique path from the root to u. Thus we partition the learning problem on the tree into separate\nlearning problems on these paths. This maintains correctness as proved in the Appendix.\nThe Partitioning step reduces the computational complexity since we now need to learn an HMM\nwith md states and nd observations, instead of the naive method where we learn an HMM with mD\nstates and nD observations. As d \ufffd D in biological data, this gives us signi\ufb01cant savings.\n\n4\n\n\fConstructing the Skeletensor. A naive way to learn the parameters of the HMM corresponding to\neach root-to-node path is to work directly on the O(nd \u00d7 nd \u00d7 nd) co-occurrence tensor. Instead,\nwe show that for each node u on a root-to-node path, a novel symmetrization method can be used\nto construct a much smaller skeleton tensor T u of size n \u00d7 n \u00d7 n, which nevertheless captures the\neffect of the entire root-to-node path and projects it into the skeleton tensor, thus revealing the range\nof Ou. We call this the skeletensor.\nLet Hu be the path from the root to a node u, and let \u02c6P Hu,u,Hu\nbe the empirical n|Hu| \u00d7 n \u00d7 n|Hu|\ntensor of co-occurrences of the meta-observations Hu, u and Hu at times 1, 2 and 3 respectively.\nBased on the data we construct the following symmetrization matrices:\n\n1,2,3\n\nS1 \u223c \u02c6P u,Hu\n\n2,3\n\n( \u02c6P Hu,Hu\n\n1,3\n\n)\u2020, S3 \u223c \u02c6P u,Hu\n\n2,1\n\n( \u02c6P Hu,Hu\n\n3,1\n\n)\u2020\n\n1,2,3\n\n, xu1\nt\n\n, . . . , xud\u22121\n\nNote that S1 and S3 are n \u00d7 n|Hu| matrices. Symmetrizing \u02c6P Hu,u,Hu\nwith S1 and S3 gives us an\nn\u00d7 n\u00d7 n skeletensor, which can in turn be decomposed to give an estimate of Ou (see Lemma 3 in\nthe Appendix).\nEven though naively constructing the symmetrization matrices and skeletensor takes O(N n2d+1 +\nn3d) time, this procedure improves computational ef\ufb01ciency because tensor construction is a one-\ntime operation, while the power method which takes many iterations is carried out on a much smaller\ntensor.\nProduct Projections. We further reduce the computational complexity by using a novel algo-\nrithmic technique that we call Product Projections. The key observation is as follows. Let\nHu = {u0, u1, . . . , ud\u22121} be any root-to-node path in the tree and consider the HMM that generates\nthe observations (xu0\n) for t = 1, 2, . . .. Even though the individual observations\nt\nxuj\nt , j = 0, 1, . . . , d \u2212 1 are highly dependent, the range of OHu, the emission matrix of the HMM\ndescribing the path Hu, is contained in the product of the ranges of Ouj , where Ouj is the emission\nmatrix at node uj (Lemma 4 in the Appendix). Furthermore, even though the Ouj matrices are dif\ufb01-\ncult to \ufb01nd, their ranges can be determined by computing the SVDs of the observation co-occurrence\nmatrices at uj.\nThus we can implicitly construct and store (an estimate of) the range of OHu. This also gives us\nestimates of the range of \u02c6P Hu,Hu\n, and the range of the\n\ufb01rst and third modes of the tensor \u02c6P Hu,u,Hu\n. Therefore during skeletensor construction we can\navoid explicitly constructing S1, S3 and \u02c6P Hu,u,Hu\n, and instead construct their projections onto their\nranges. This reduces the time complexity of the skeletensor construction step to O(N m2d+1 +\nm3d + dmn2) (recall that the range has dimension m.) While the number of hidden states m could\nbe as high as n, this is a signi\ufb01cant gain in practice, as n \ufffd m in biological datasets (e.g. 256\nobservations vs. 6 hidden states).\nProduct projections are more ef\ufb01cient than random projections [17] on the co-occurrence matrix of\nmeta-observations: the co-occurrence matrices are nd \u00d7 nd matrices, and random projections would\ntake \u03a9(nd) time. Also, product projections differ from the suggestion of [15] since we exploit\nproperties of the model to ef\ufb01ciently \ufb01nd good projections.\nThe Product Projections technique is a general technique with applications beyond our model. Some\nexamples are provided in Appendix C.3.\n\n, the column spaces of \u02c6P u,Hu\n\n2,1\n\nand \u02c6P u,Hu\n\n2,3\n\nt\n\n1,3\n\n1,2,3\n\n1,2,3\n\n3.1 The Full Algorithm\n\nOur \ufb01nal algorithm follows from combining the three key ideas above. Algorithm 1 shows how\nto recover the observation matrices Ou at each node u. Once the Ous are recovered, one can use\nstandard techniques to recover T and W ; details are described in Algorithm 2 in the Appendix.\n\n3.2 Performance Guarantees\n\nWe now provide performance guarantees on our algorithm. Since learning parameters of HMMs\nand many other graphical models is NP-Hard, spectral algorithms make simplifying assumptions on\nthe properties of the model generating the data. Typically these assumptions take the form of some\n\n5\n\n\fwith tree structured hidden state with known tree structure.\n\nAlgorithm 1 Algorithm for Observation Matrix Recovery\n1: Input: N samples of the three consecutive observations (x1, x2, x3)N\n2: for u \u2208 V do\n3:\n4: end for\n5: for u \u2208 V do\n6:\n7:\n\n1,2 to get the \ufb01rst m left singular vectors \u02c6U u.\n\nPerform SVD on \u02c6P u,u\n\nLet Hu denote the set of nodes on the unique path from root r to u. Let \u02c6U Hu = \u2297v\u2208Hu\nConstruct Projected Skeletensor. First, compute symmetrization matrices:\n\n\u02c6U v.\n\ni=1 generated by an HMM\n\n1 = (( \u02c6U u)\ufffd \u02c6P u,Hu\n\u02c6Su\n\n2,3\n\n\u02c6U Hu)(( \u02c6U Hu )\ufffd \u02c6P Hu,Hu\n\n1,3\n\n\u02c6U Hu)\u22121\n\n3 = (( \u02c6U u)\ufffd \u02c6P u,Hu\n\u02c6Su\n\n2,1\n\n\u02c6U Hu)(( \u02c6U Hu )\ufffd \u02c6P Hu,Hu\n\n3,1\n\n\u02c6U Hu)\u22121\n\n8:\n\nCompute symmetrized second and third co-occurrences for u:\n\n\u02c6M u\n\u02c6M u\n\n2 = ( \u02c6P Hu,u\n3 = \u02c6P Hu,u,Hu\n\n1,2,3\n\n1,2\n\n( \u02c6U Hu ( \u02c6Su\n\n1 )\ufffd, \u02c6U u) + \u02c6P Hu,u\n1 )\ufffd, \u02c6U u, \u02c6U Hu( \u02c6Su\n\n( \u02c6U Hu ( \u02c6Su\n3 )\ufffd)\n\n1,2\n\n( \u02c6U Hu ( \u02c6Su\n\n1 )\ufffd, \u02c6U u)\ufffd)/2\n\n9: Orthogonalization and Tensor Decomposition. Orthogonalize \u02c6M u\n\n3 using \u02c6M u\n\n2 and decom-\n\npose to recover (\u02c6\u03b8u\nUndo Projection onto Range. Estimate Ou as: \u02c6Ou = \u02c6U u \u02c6\u0398u, where \u02c6\u0398u = (\u02c6\u03b8u\n\nm) as in [1] (See Algorithm 3 in the Appendix for details).\n\n1 , . . . , \u02c6\u03b8u\n\n1 , . . . , \u02c6\u03b8u\n\nm).\n\n10:\n11: end for\n\nconditions on the rank of certain parameter matrices. We state below the conditions needed for our\nalgorithm to successfully learn parameters of a HMM with tree structured hidden states. Observe\nthat we need two kinds of rank conditions \u2013 node-wise and path-wise \u2013 to ensure that we can recover\nthe full set of parameters on a root-to-node path.\nAssumption 1 (Node-wise Rank Condition). For all u \u2208 V , the matrix Ou has rank m, and the\njoint probability matrix P u,u\nAssumption 2 (Path-wise Rank Condition). For any u \u2208 V , let Hu denote the path from root to u.\nThen, the joint probability matrix P Hu,Hu\n\n2,1 has rank m.\n\nhas rank m|Hu|.\n\n1,2\n\nAssumption 1 is required to ensure that the skeletensor can be decomposed, and that \u02c6U u indeed\ncaptures the range of Ou. Assumption 2 ensures that the symmetrization operation succeeds. This\nkind of assumption is very standard in spectral learning [18, 1].\n[3] has provided a spectral algorithm for learning HMMs involving fourth and higher order moments\nwhen Assumption 1 does not hold. We believe similar approaches will apply to our problem as well,\nand we leave this as an avenue for future work.\nIf Assumptions 1 and 2 hold, we can show that Algorithm 1 is consistent \u2013 provided enough samples\nare available, the model parameters learnt by the algorithms are close to the true model parameters.\nA \ufb01nite sample guarantee is provided in the Appendix.\n\nTheorem 1 (Consistency). Suppose we run Algorithm 1 on the \ufb01rst three observation vectors\n{xi,1, xi,2, xi,3} from N iid sequences generated by an HMM with tree-structured hidden states.\nThen, for all nodes u \u2208 V , the recovered estimates \u02c6Ou satisfy the following property: with high\nprobability over the iid samples, there exists a permutation \u03a0u of the columns of \u02c6Ou such that as\n\ufffdOu \u2212 \u03a0u \u02c6Ou\ufffd \u2264 \u03b5(N ) where \u03b5(N ) \u2192 0 as N \u2192 \u221e.\nObserve that the observation matrices (as well as the transition and initial probabilities) are recovered\nupto permutations of hidden states in a globally consistent manner.\n\n6\n\n\f4 Experiments\n\nData and experimental settings. We ran our algorithm, which we call \u201cSpectacle-Tree\u201d, on a\nchromatin dataset on human chromosome 1 from nine cell types (H1-hESC, GM12878, HepG2,\nHMEC, HSMM, HUVEC, K562, NHEK, NHLF) from the ENCODE project [7]. Following [5], we\nused a biologically motivated tree structure of a star tree with H1-hESC, the embryonic stem cell\ntype, as the root. There are data for eight chromatin marks for each cell type which we preprocessed\ninto binary vectors using a standard Poisson background assumption [11]. The chromosome is\ndivided into 1,246,253 segments of length 200, following [11]. The observed data consists of a\nbinary vector of length eight for each segment, so the number of possible observations is the number\nof all combinations of presence or absence of the chromatin marks (i.e. n = 28 = 256). We set\nthe number of hidden states, which we interpret as chromatin states, to m = 6, similar to the choice\nof ENCODE. Our goals are to discover chromatin states corresponding to biologically important\nfunctional elements such as promoters and enhancers, and to label each chromosome segment with\nthe most probable chromatin state.\nObserve that instead of the \ufb01rst few observations from N iid sequences, we have a single long\nsequence in the steady state per cell type; thus, similar to [23], we calculate the empirical co-\noccurrence matrices and tensors used in the algorithm based on two and three successive obser-\nvations respectively (so, more formally, instead of \u02c6P1,2, we use the average over t of \u02c6Pt,t+1 and\nso on). Additionally, we use a projection procedure similar to [4] for rounding negative entries in\nthe recovered observation matrices. Our experiments reveal that the rank conditions appear to be\nsatis\ufb01ed for our dataset.\n\nRun time and memory usage comparisons. First, we \ufb02attened the HMM with tree-structured\nhidden states into an ordinary HMM with an exponentially larger state space. Our Python imple-\nmentation of the spectral algorithm for HMMs of [18] ran out of memory while performing singular\nvalue decomposition on the co-occurence matrix, even using sparse matrix libraries. This suggests\nthat naive application of spectral HMM is not practical for biological data.\nNext we compared the performance of Spectacle-Tree to a similar model which additionally con-\nstrained all transition and observation parameters to be the same on each branch [5]. That work used\nseveral variational approximations to the EM algorithm and reported that SMF (structured mean\n\ufb01eld) performed the best in their tests. Although we implemented Spectacle-Tree in Matlab and did\nnot optimize it for run-time ef\ufb01ciency, Spectacle-Tree took \u223c2 hr, whereas the SMF algorithm took\n\u223c13 hr for 13 iterations to convergence. This suggests that spectral algorithms may be much faster\nthan variational EM for our model.\n\nBiological interpretation of the observation matrices. Having examined the ef\ufb01ciency of\nSpectacle-Tree, we next studied the accuracy of the learned parameters. We focused on the observa-\ntion matrices which hold most of the interesting biological information. Since the full observation\nmatrix is very large (28 \u00d7 6 where each row is a combination of chromatin marks), Figure 2 shows\nthe 8\u00d7 6 marginal distribution of each chromatin mark conditioned on each hidden state. Spectacle-\nTree identi\ufb01ed most of the major types of functional elements typically discovered from chromatin\ndata: repressive, strong enhancer, weak enhancer, promoter, transcribed region and background state\n(states 1-6, respectively, in Figure 2b). In contrast, the SMF algorithm used three out of the six states\nto model the large background state (i.e. the state with no chromatin marks). It identi\ufb01ed repressive,\ntranscribed and promoter states (states 2, 4, 5, respectively, in Figure 2a) but did not identify any\nenhancer states, which are one of the most interesting classes for further biological studies.\nWe believe these results are due to that fact that the background state in the data set is large: \u223c62%\nof the segments do not have chromatin marks for any cell type. The background state has lower\nbiological interest but is modeled well by the maximum likelihood approach. In contast, biologi-\ncally interesting states such as promoters and enhancers comprise a relatively small fraction of the\ngenome. We cannot simply remove background segments to make the classes balanced because it\nwould change the length distribution of the hidden states. Finally, we observed that our model esti-\nmated signi\ufb01cantly different parameters for each cell type which captures different chromatin states\n(Appendix Figure 3). For example, we found enhancer states with strong H3K27ac in all cell types\nexcept for H1-hESC, where both enhancer states (3 and 6) had low signal for this mark. This mark\nis known to be biologically important in these cells for distinguishing active from poised enhancers\n\n7\n\n\f(a) SMF\n\n(b) Spectacle-Tree\n\nFigure 2: The compressed observation matrices for the GM12878 cell type estimated by the SMF\nand Spectacle-Tree algorithms. The hidden states are on the X axis.\n\n[10]. This suggests that modeling the additional branch-speci\ufb01c parameters can yield interesting\nbiological insights.\n\nComparison of the chromosome segments labels. We computed the most probable state for each\nchromosome segment using a posterior decoding algorithm. We tested the accuracy of the predic-\ntions using an experimentally de\ufb01ned data set and compared it to SMF and the spectral algorithm\nfor HMMs run for individual cell types without the tree (Spectral-HMM). Speci\ufb01cally we assessed\npromoter prediction accuracy (state 5 for SMF and state 4 for Spectacle-Tree in Figure 2) using\nCAGE data from [14] which was available for six of the nine cell types. We used the F1 score\n(harmonic mean of precision and recall) for comparison and found that Spectacle-Tree was much\nmore accurate than SMF for all six cell types (Table 1). This was because the promoter predictions\nof SMF were biased towards the background state so those predictions had slightly higher recall but\nmuch lower speci\ufb01city.\nFinally, we compared our predictions to Spectral-HMM to assess the value of the tree model. H1-\nhESC is the root node so Spectral-HMM and Spectacle-Tree have the same model and obtain the\nsame accuracy (Table 1). Spectacle-Tree predicts promoters more accurately than Spectral-HMM\nfor all other cell types except HepG2. However, HepG2 is the most diverged from the root among\nthe cell types based on the Hamming distance between the chromatin marks. We hypothesize that\nfor HepG2, the tree is not a good model which slightly reduces the prediction accuracy.\n\nCell type\nSMF\nH1-hESC .0273\n.0220\nGM12878\n.0274\n.0275\n.0255\n.0287\n\nHepG2\nHUVEC\n\nK562\nNHEK\n\nSpectral-HMM Spectacle-Tree\n\n.1930\n.1230\n.1022\n.1221\n.0964\n.1528\n\n.1930\n.1703\n.0993\n.1621\n.1966\n.1719\n\nTable 1: F1 score for predicting promoters for six cell types. The highest F1 score for each cell type\nis emphasized in bold. Ground-truth labels for the other 3 cell-types are currently unavailable.\n\nOur experiments show that Spectacle-Tree has improved computational ef\ufb01ciency, biological inter-\npretability and prediction accuracy on an experimentally-de\ufb01ned feature compared to variational\nEM for a similar tree HMM model and a spectral algorithm for single HMMs. A previous study\nshowed improvements for spectral learning of single HMMs over the EM algorithm [24]. Thus our\nalgorithms may be useful to the bioinformatics community in analyzing the large-scale chromatin\ndata sets currently being produced.\n\nAcknowledgements. KC and CZ thank NSF under IIS 1162581 for research support.\n\n8\n\n\fReferences\n[1] Anima Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, and Matus Telgarsky. Tensor\n\ndecompositions for learning latent variable models. CoRR, abs/1210.7559, 2012.\n\n[2] Animashree Anandkumar, Daniel Hsu, and Sham M. Kakade. A method of moments for\n\nmixture models and hidden Markov models. CoRR, abs/1203.0683, 2012.\n\n[3] B. Balle, X. Carreras, F. Luque, and A. Quattoni. Spectral learning of weighted automata - A\n\nforward-backward perspective. Machine Learning, 96(1-2), 2014.\n\n[4] B. Balle, W. L. Hamilton, and J. Pineau. Methods of moments for learning stochastic lan-\n\nguages: Uni\ufb01ed presentation and empirical comparison. In ICML, pages 1386\u20131394, 2014.\n\n[5] Jacob Biesinger, Yuanfeng Wang, and Xiaohui Xie. Discovering and mapping chromatin states\n\nusing a tree hidden Markov model. BMC Bioinformatics, 14(Suppl 5):S4, 2013.\n\n[6] A. Chaganty and P. Liang. Estimating latent-variable graphical models using moments and\n\nlikelihoods. In ICML, 2014.\n\n[7] ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human\n\ngenome. Nature, 489:57\u201374, 2012.\n\n[8] Jason Ernst and Manolis Kellis. Discovery and characterization of chromatin states for sys-\n\ntematic annotation of the human genome. Nature Biotechnology, 28(8):817\u2013825, 2010.\n\n[9] Bernstein et. al. The NIH Roadmap Epigenomics Mapping Consortium. Nature Biotechnology,\n\n28:1045\u20131048, 2010.\n\n[10] Creyghton et. al. Histone H3K27ac separates active from poised enhancers and predicts devel-\n\nopmental state. Proc Natl Acad Sci, 107(50):21931\u201321936, 2010.\n\n[11] Ernst et. al. Mapping and analysis of chromatin state dynamics in nine human cell types.\n\nNature, 473:43\u201349, 2011.\n\n[12] Jun Zhu et al. Characterizing dynamic changes in the human blood transcriptional network.\n\nPLoS Comput Biol, 6:e1000671, 2010.\n\n[13] M. Hoffman et al. Unsupervised pattern discovery in human chromatin structure through ge-\n\nnomic segmentation. Nature Methods, 9(5):473\u2013476, 2012.\n\n[14] S. Djebali et al. Landscape of transcription in human cells. Nature, 2012.\n[15] D. Foster, J. Rodu, and L. Ungar. Spectral dimensionality reduction for HMMs. In CoRR,\n\n2012.\n\n[16] N. Foti, J. Xu, D. Laird, and E. Fox. Stochastic variational inference for hidden markov models.\n\nIn NIPS, 2014.\n\n[17] N. Halko, P. Martinsson, and J. Tropp. Finding structure with randomness: Probabilistic algo-\n\nrithms for constructing approximate matrix decompositions. SIAM Review, 53, 2011.\n\n[18] D. Hsu, S. Kakade, and T. Zhang. A spectral algorithm for learning hidden Markov models.\n\nIn COLT, 2009.\n\n[19] I. Melnyk and A. Banerjee. A spectral algorithm for inference in hidden semi-Markov models.\n\nIn AISTATS, 2015.\n\n[20] E. Mossel and S Roch. Learning non-singular phylogenies and hidden Markov models. Ann.\n\nAppl. Probab., 16(2), 05 2006.\n\n[21] A. Parikh, L. Song, and E. P. Xing. A spectral algorithm for latent tree graphical models. In\n\nICML, pages 1065\u20131072, 2011.\n\n[22] A. P. Parikh, L. Song, M. Ishteva, G. Teodoru, and E. P. Xing. A spectral algorithm for latent\n\njunction trees. In UAI, 2012.\n\n[23] S. Siddiqi, B. Boots, and G. Gordon. Reduced-rank hidden Markov models. In AISTATS, 2010.\n[24] J. Song and K. C. Chen. Spectacle: fast chromatin state annotation using spectral learning.\n\nGenome Biology, 16:33, 2015.\n\n[25] L. Song, M. Ishteva, A. P. Parikh, E. P. Xing, and H. Park. Hierarchical tensor decomposition\n\nof latent tree graphical models. In ICML, 2013.\n\n[26] J. Zou, D. Hsu, D. Parkes, and R. Adams. Contrastive learning using spectral methods. In\n\nNIPS, 2013.\n\n9\n\n\f", "award": [], "sourceid": 345, "authors": [{"given_name": "Chicheng", "family_name": "Zhang", "institution": "UC San Diego"}, {"given_name": "Jimin", "family_name": "Song", "institution": "Rutgers"}, {"given_name": "Kamalika", "family_name": "Chaudhuri", "institution": "UCSD"}, {"given_name": "Kevin", "family_name": "Chen", "institution": "Rutgers"}]}