{"title": "Hidden Markov Decision Trees", "book": "Advances in Neural Information Processing Systems", "page_first": 501, "page_last": 507, "abstract": null, "full_text": "Hidden Markov decision trees \n\nMichael I. Jordan*, Zoubin Ghahramanit , and Lawrence K. Saul* \n\n{jordan.zoubin.lksaul}~psyche.mit.edu \n\n*Center for Biological and Computational Learning \n\nMassachusetts Institute of Technology \n\nCambridge, MA USA 02139 \n\nt Department of Computer Science \n\nUniversity of Toronto \n\nToronto, ON Canada M5S 1A4 \n\nAbstract \n\nWe study a time series model that can be viewed as a decision \ntree with Markov temporal structure. The model is intractable for \nexact calculations, thus we utilize variational approximations. We \nconsider three different distributions for the approximation: one in \nwhich the Markov calculations are performed exactly and the layers \nof the decision tree are decoupled, one in which the decision tree \ncalculations are performed exactly and the time steps of the Markov \nchain are decoupled, and one in which a Viterbi-like assumption is \nmade to pick out a single most likely state sequence. We present \nsimulation results for artificial data and the Bach chorales. \n\n1 \n\nIntroduction \n\nDecision trees are regression or classification models that are based on a nested \ndecomposition of the input space. An input vector x is classified recursively by a \nset of \"decisions\" at the nonterminal nodes of a tree, resulting in the choice of a \nterminal node at which an output y is generated. A statistical approach to decision \ntree modeling was presented by Jordan and Jacobs (1994), where the decisions were \ntreated as hidden multinomial random variables and a likelihood was computed by \nsumming over these hidden variables. This approach, as well as earlier statistical \nanalyses of decision trees, was restricted to independently, identically distributed \ndata. The goal of the current paper is to remove this restriction; we describe a \ngeneralization of the decision tree statistical model which is appropriate for time \nsenes. \n\nThe basic idea is straightforward-we assume that each decision in the decision tree \nis dependent on the decision taken at that node at the previous time step. Thus we \naugment the decision tree model to include Markovian dynamics for the decisions. \n\n\f502 \n\nM. I. Jordan, Z Ghahramani and L. K. Saul \n\nFor simplicity we restrict ourselves to the case in which the decision variable at \na given nonterminal is dependent only on the same decision variable at the same \nnonterminal at the previous time step. It is of interest, however, to consider more \ncomplex models in which inter-nonterminal pathways allow for the possibility of \nvarious kinds of synchronization. \n\nWhy should the decision tree model provide a useful starting point for time series \nmodeling? The key feature of decision trees is the nested decomposition. If we \nview each nonterminal node as a basis function, with support given by the subset \nof possible input vectors x that arrive at the node, then the support of each node \nis the union of the support associated with its children. This is reminiscent of \nwavelets, although without the strict condition of multiplicative scaling. Moreover, \nthe regions associated with the decision tree are polygons, which would seem to \nprovide a useful generalization of wavelet-like decompositions in the case of a high(cid:173)\ndimensional input space. \n\nThe architecture that we describe in the current paper is fully probabilistic. We \nview the decisions in the decision tree as multinomial random variables, and we \nare concerned with calculating the posterior probabilities of the time sequence of \nhidden decisions given a time sequence of input and output vectors. Although \nsuch calculations are tractable for decision trees and for hidden Markov models \nseparately, the calculation is intractable for our model. Thus we must make use \nof approximations. We utilize the partially factorized variational approximations \ndescribed by Saul and Jordan (1996), which allow tractable substructures (e.g., the \ndecision tree and Markov chain substructures) to be handled via exact methods, \nwithin an overall approximation that guarantees a lower bound on the log likelihood. \n\n2 Architectures \n\n2.1 Probabilistic decision trees \n\nThe \"hierarchical mixture of experts\" (HME) model (Jordan & Jacobs, 1994) is a \ndecision tree in which the decisions are modeled probabilistically, as are the outputs. \nThe total probability of an output given an input is the sum over all paths in the \ntree from the input to the output. The HME model is shown in the graphical \nmodel formalism in Figure 2.1. Here a node represents a random variable, and the \nlinks represent probabilistic dependencies. A conditional probability distribution is \nassociated with each node in the graph, where the conditioning variables are the \nnode's parents. \nLet Zl, Z2, and z3 denote the (multinomial) random variables corresponding to \nthe first, second and third levels of the decision tree. l We associate multinomial \nprobabilities P(zl lx,1]l), P(z2Ix,zl,1]2), and P(Z3jx,zl,Z2,1]3) with the decision \nnodes, where 1]1,1]2, and 1]3 are parameters (e.g., Jordan and Jacobs utilized soft(cid:173)\nmax transformations of linear functions of x for these probabilities). The leaf prob(cid:173)\nabilities P(y lx, zl, Z2 , Z3, 0) are arbitrary conditional probability models; e.g., lin(cid:173)\near/Gaussian models for regression problems. \n\nThe key calculation in the fitting of the HME model to data is the calculation of \nthe posterior probabilities of the hidden decisions given the clamped values of x \nand y. This calculation is a recursion extending upward and downward in the tree, \nin which the posterior probability at a given nonterminal is the sum of posterior \nprobabilities associated with its children. The recursion can be viewed as a special \n\nIThroughout the paper we restrict ourselves to three levels for simplicity of presentation. \n\n\fHidden Markov Decision Trees \n\n503 \n\n1t \n\nFigure 1: The hierarchical mixture of \nexperts as a graphical model. The \nE step of the learning algorithm for \nHME's involves calculating the poste(cid:173)\nrior probabilities of the hidden (un(cid:173)\nshaded) variables given the observed \n(shaded) variables. \n\nFigure 2: An HMM as a graphical \nmodel. The transition matrix appears \non the horizontal links and the output \nprobability distribution on the vertical \nlinks. The E step of the learning algo(cid:173)\nrithm for HMM's involves calculating \nthe posterior probabilities of the hid(cid:173)\nden (unshaded) variables given the ob(cid:173)\nserved (shaded) variables. \n\ncase of generic algorithms for calculating posterior probabilities on directed graphs \n(see, e.g., Shachter, 1990). \n\n2.2 Hidden Markov models \n\nIn the graphical model formalism a hidden Markov model (HMM; Rabiner, 1989) is \nrepresented as a chain structure as shown in Figure 2.1. Each state node is a multi(cid:173)\nnomial random variable Zt. The links between the state nodes are parameterized by \nthe transition matrix a(ZtIZt-l), assumed homogeneous in time. The links between \nthe state nodes Zt and output nodes Yt are parameterized by the output probability \ndistribution b(YtIZt), which in the current paper we assume to be Gaussian with \n(tied) covariance matrix ~. \n\nAs in the HME model, the key calculation in the fitting of the HMM to observed \ndata is the calculation of the posterior probabilities of the hidden state nodes given \nthe sequence of output vectors. This calculation-the E step of the Baum-Welch \nalgorithm-is a recursion which proceeds forward or backward in the chain. \n\n2.3 Hidden Markov decision trees \n\nWe now marry the HME and the HMM to produce the hidden Markov decision tree \n(HMDT) shown in Figure 3. This architecture can be viewed in one of two ways: \n(a) as a time sequence of decision trees in which the decisions in a given decision \ntree depend probabilistically on the decisions in the decision tree at the preceding \nmoment in time; (b) as an HMM in which the state variable at each moment in \ntime is factorized (cf. Ghahramani & Jordan, 1996) and the factors are coupled \nvertically to form a decision tree structure. \nLet the state of the Markov process defining the HMDT be given by the values of \nhidden multinomial decisions z~, z\u00a5, and zr, where the superscripts denote the level \nofthe decision tree (the vertical dimension) and the sUbscripts denote the time (the \nhorizontal dimension). Given our assumption that the state transition matrix has \nonly intra-level Markovian dependencies, we obtain the following expression for the \n\n\f504 \n\nM. I. Jordan, Z Ghahramani and L. K. Saul \n\n\u2022 \u2022 \u2022 \n\nFigure 3: The HMDT model is an HME decision tree (in the vertical dimension) \nwith Markov time dependencies (in the horizontal dimension). \n\nHMDT probability model: \nP( {z}, z;, zn, {Yt}l{xt}) = 7r1 (ZUX1)7r2(zilx1' zt)7r3(zflx1 ' zL zi) \n\nT \n\nt=l \n\nT \n\nt=2 \n\nII a1 (z: IXt, zL1)a 2(z; IXt, Z;_1 ' z:)a3(z~lxt , ZL1' z:, z;) II b(Yt IXt, z}, z;, z~) \nSumming this probability over the hidden values zI, z;, and z~ yields the HMDT \nlikelihood. \nThe HMDT is a 2-D lattice with inhomogeneous field terms (the observed data). \nIt is well-known that such lattice structures are intractable for exact probabilistic \ncalculations. Thus, although it is straightforward to write down the EM algorithm \nfor the HMDT and to write recursions for the calculations of posterior probabilities \nin the E step, these calculations are likely to be too time-consuming for practical \nuse (for T time steps, J{ values per node and M levels, the algorithm scales as \nO(J{M+1T)). Thus we turn to methods that allow us to approximate the posterior \nprobabilities of interest . \n\n3 Algori thms \n\n3.1 Partially factorized variational approximations \n\nCompletely factorized approximations to probability distributions on graphs can \noften be obtained variationally as mean field theories in physics (Parisi, 1988). For \nthe HMDT in Figure 3, the completely factorized mean field approximation would \ndelink all of the nodes, replacing the interactions with constant fields acting at each \nof the nodes. This approximation, although useful, neglects to take into account \nthe existence of efficient algorithms for tractable substructures in the graph. \n\nSaul and Jordan (1996) proposed a refined mean field approximation to allow in(cid:173)\nteractions associated with tractable substructures to be taken into account. The \nbasic idea is to associate with the intractable distribution P a simplified distribu(cid:173)\ntion Q that retains certain of the terms in P and neglects others, replacing them \nwith parameters Pi that we will refer to as \"variational parameters.\" Graphically \nthe method can be viewed as deleting arcs from the original graph until a forest \nof tractable substructures is obtained. Arcs that remain in the simplified graph \n\n\fHidden Markov Decision Trees \n\n505 \n\ncorrespond to terms that are retained in Q; arcs that are deleted correspond to \nvariational parameters. \n\nTo obtain the best possible approximation of P we minimize the Kullback-Liebler \ndivergence K L( Q liP) with respect to the parameters J-li. The result is a coupled \nset of equations that are solved iteratively. These equations make reference to the \nvalues of expectations of nodes in the tractable substructures; thus the (efficient) \nalgorithms that provide such expectations are run as subroutines. Based on the pos(cid:173)\nterior expectations computed under Q, the parameters defining P are adjusted. The \nalgorithm as a whole is guaranteed to increase a lower bound on the log likelihood. \n\n3.2 A forest of chains \n\nThe HMDT can be viewed as a coupled set of chains, with couplings induced directly \nvia the decision tree structure and indirectly via the common coupling to the output \nvector. If these couplings are removed in the variational approximation, we obtain \na Q distribution whose graph is a forest of chains. There are several ways to \nparameterize this graph; in the current paper we investigate a parameterization with \ntime-varying transition matrices and time-varying fields. Thus the Q distribution \nis given by \n\nT \n\n1 II-1( 111 )-2( 212 \n)-3( 31 3 ) \n~ at Zt Zt-1 at Zt Zt_l at Zt Zt_1 \nQ t=2 \n\nT \n\nII iji(zDij;(z;)ij~(zn \n\nt=l \n\nwhere a~(z~ IzL1) and ij;(zD are potentials that provide the variational parameter(cid:173)\nization. \n\n3.3 A forest of decision trees \n\nAlternatively we can drop the horizontal couplings in the HMDT and obtain a \nvariational approximation in which the decision tree structure is handled exactly \nand the Markov structure is approximated. The Q distribution in this case is \n\nT \n\nII fi(zDf;(zi IzDf~(zrlzt, zi) \n\nt=l \n\nNote that a decision tree is a fully coupled graphical model; thus we can view the \npartially factorized approximation in this case as a completely factorized mean field \napproximation on \"super-nodes\" whose configurations include all possible configu(cid:173)\nrations of the decision tree. \n\n3.4 A Viterbi-like approximation \n\nIn hidden Markov modeling it is often found that a particular sequence of states \nhas significantly higher probability than any other sequence. \nIn such cases the \nViterbi algorithm, which calculates only the most probable path, provides a useful \ncomputational alternative. \n\nWe can develop a Viterbi-like algorithm by utilizing an approximation Q that assigns \nprobability one to a single path {zi, zF, zn: \n\n{ I \n\n(1) \n\nif z~ = zL \"It, i \n\no otherwise \n\n\f506 \n\nM. I. Jordan, Z Ghahramani and L. K. SauL \n\na) \n\n10r-------------------, \n\nb) \n600r-------------------, \n\n-~ 0 \n\n-5 \n\no \n\n50 \n\n-10~-------------' \n200 \n\n150 \n\n100 \nt \n\n500 \n\n\"'0 \n\n8 400 \n,\u00a3; \n~300 \n\n~200 \n\n100 \n\n2 \n\nJ)) \n\nff//JJjjJ \n10 \n20 \n40 \niterations of EM \n\n30 \n\nFigure 4: a) Artificial time series data. b) Learning curves for the HMDT. \n\nNote that the entropy Q In Q is zero, moreover the evaluation of the energy Q In P \nreduces to substituting z~ for z~ in P. Thus the variational approximation is par(cid:173)\nticularly simple in this case. The resulting algorithm involves a subroutine in which \na standard Viterbi algorithm is run on a single chain, with the other (fixed) chains \nproviding field terms. \n\n4 Results \n\nWe illustrate the HMDT on (1) an artificial time series generated to exhibit spatial \nand temporal structure at multiple scales, and (2) a domain which is likely to exhibit \nsuch structure naturally-the melody lines from J .S. Bach's chorales. \n\nThe artificial data was generated from a three level binary HMDT with no inputs, \nin which the root node determined coarse-scale shifts (\u00b15) in the time series, the \nmiddle node determined medium-scale shifts (\u00b12), and the bottom node determined \nfine-scale shifts (\u00b10.5) (Figure 4a). The temporal scales at these three nodes-as \nmeasured by the rate of convergence (second eigenvalue) of the transition matrices, \nwith 0 (1) signifying immediate (no) convergence-were 0.85,0.5, and 0.3, respec(cid:173)\ntively. \nWe implemented forest-of-chains, forest-of-trees and Viterbi-like approximations. \nThe learning curves for ten runs of the forest-of-chains approximation are shown in \nFigure 4b. Three plateau regions are apparent, corresponding to having extracted \nthe coarse, medium, and fine scale structures of the time series. Five runs captured \nall three spatia-temporal scales at their correct level in the hierarchy; three runs \ncaptured the scales but placed them at incorrect nodes in the decision tree; and two \ncaptured only the coarse-scale structure.2 Similar results were obtained with the \nViterbi-like approximation. We found that the forest-of-trees approximation was \nnot sufficiently accurate for these data. \nThe Bach chorales dataset consists of 30 melody lines with 40 events each. 3 Each \ndiscrete event encoded 6 attributes-start time of the event (st), pitch (pitch), \nduration (dur), key signature (key), time signature (time), and whether the event \nwas under a fermata (ferm). \nThe chorales dataset was modeled with 3-level HMDTs with branching factors (K) \n\n2Note that it is possible to bias the ordering of the time scales by ordering the initial \nrandom values for the nodes of the tree; we did not utilize such a bias in this simulation. \n3This dataset was obtained from the UCI Repository of Machine Learning Datasets. \n\n\fHidden Markov Decision Trees \n\n507 \n\nK \n2 \n3 \n4 \n5 \n6 \n\nst pitch dur key \n84 \n3 \n22 \n93 \n55 \n96 \n57 \n97 \n94 \n70 \n\n6 \n7 \n36 \n41 \n58 \n\n6 \n38 \n48 \n41 \n40 \n\nPercent variance explained \n\nTemporal scale \n\ntime \n95 \n99 \n99 \n99 \n99 \n\nferm \na \n0 \n5 \n61 \n10 \n\nlevel 1 \n1.00 \n1.00 \n1.00 \n1.00 \n1.00 \n\nlevel 2 \n1.00 \n0.96 \n1.00 \n0.95 \n0.93 \n\nlevel 3 \n0.51 \n0.85 \n0.69 \n0.75 \n0.76 \n\nTable 1: Hidden Markov decision tree models of the Bach chorales dataset: mean \npercentage of variance explained for each attribute and mean temporal scales at the \ndifferent nodes. \n\n2, 3, 4, 5, and 6 (3 runs at each size, summarized in Table 1). Thirteen out of 15 \nruns resulted in a coarse-to-fine progression of temporal scales from root to leaves \nof the tree. A typical run at branching factor 4, for example, dedicated the top \nlevel node to modeling the time and key signatures-attributes that are constant \nthroughout any single chorale-the middle node to modeling start times, and the \nbottom node to modeling pitch or duration. \n\n5 Conclusions \n\nViewed in the context of the burgeoning literature on adaptive graphical probabilis(cid:173)\ntic models-which includes HMM's, HME's, CVQ's, IOHMM's (Bengio & Frasconi, \n1995), and factorial HMM's-the HMDT would appear to be a natural next step. \nThe HMDT includes as special cases all of these architectures, moreover it arguably \ncombines their best features: factorized state spaces, conditional densities, represen(cid:173)\ntation at multiple levels ofresolution and recursive estimation algorithms. Our work \non the HMDT is in its early stages, but the earlier literature provides a reasonably \nsecure foundation for its development. \n\nReferences \n\nBengio, Y., & Frasconi, P. (1995) . An input output HMM architecture. \nIn G. \nTesauro, D. S. Touretzky & T. K. Leen, (Eds.), Advances in Neural Information \nProcessing Systems 7, MIT Press, Cambridge MA. \nGhahramani, Z., & Jordan, M. r. (1996). Factorial hidden Markov models. In D. S. \nTouretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in Neural Information \nProcessing Systems 8, MIT Press, Cambridge MA. \nJordan, M. r., & Jacobs, R. A. (1994). Hierarchical mixtures of experts and the \nEM algorithm. Neural Computation, 6, 181-214. \n\nParisi, G. (1988). Statistical Field Theory. Redwood City, CA : Addison-Wesley. \nRabiner, L. (1989). A tutorial on hidden Markov models and selected application s \nin speech recognition. Proceedings of the IEEE, 77, 257-285. \nSaul, L. K., & Jordan, M. I. (1996). Exploiting tractable substructures in intractable \nnetworks. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in \nNeural Information Processing Systems 8, MIT Press, Cambridge MA. \nShachter, R. (1990). An ordered examination of influence diagrams. Networks, 20, \n535-563. \n\n\f", "award": [], "sourceid": 1264, "authors": [{"given_name": "Michael", "family_name": "Jordan", "institution": null}, {"given_name": "Zoubin", "family_name": "Ghahramani", "institution": null}, {"given_name": "Lawrence", "family_name": "Saul", "institution": null}]}