{"title": "An Entropic Estimator for Structure Discovery", "book": "Advances in Neural Information Processing Systems", "page_first": 723, "page_last": 729, "abstract": null, "full_text": "An entropic estimator for structure discovery \n\nMitsubishi Electric Research Laboratories, 201 Broadway, Cambridge MA 02139 \n\nMatthew Brand \n\nbrand@merl.com \n\nAbstract \n\nWe introduce a novel framework for simultaneous structure and parameter learning in \nhidden-variable conditional probability models, based on an en tropic prior and a solution \nfor its maximum a posteriori (MAP) estimator. The MAP estimate minimizes uncertainty \nin all respects: cross-entropy between model and data; entropy of the model ; entropy \nof the data's descriptive statistics. \nIterative estimation extinguishes weakly supported \nparameters, compressing and sparsifying the model. Trimming operators accelerate this \nprocess by removing excess parameters and, unlike most pruning schemes, guarantee \nan increase in posterior probability. Entropic estimation takes a overcomplete random \nmodel and simplifies it, inducing the structure of relations between hidden and observed \nvariables. Applied to hidden Markov models (HMMs), it finds a concise finite-state \nmachine representing the hidden structure of a signal. We entropically model music, \nhandwriting, and video time-series, and show that the resulting models are highly concise, \nstructured, predictive, and interpretable: Surviving states tend to be highly correlated \nwith meaningful partitions of the data, while surviving transitions provide a low-perplexity \nmodel of the signal dynamics. \n\n1 . An entropic prior \nIn entropic estimation we seek to maximize the information content of parameters. For \nconditional probabilities, parameters values near chance add virtually no information \nto the model, and are therefore wasted degrees of freedom. \nIn contrast, parameters \nnear the extrema {O, I} are informative because they impose strong constr\u00b7aints on the \nclass of signals accepted by the model. In Bayesian terms, our prior should assert that \nparameters that do not reduce uncertainty are improbable. We can capture this intuition in \na surprisingly simple form: For a model of N conditional probabilities 9 = {(h , . . . , () N } \nwe write \n\n(1) \n\nwhence we can see that the prior measures a model's freedom from ambiguity (H(9) is an \nentropy measure). Applying Pe (.) to a multinomial yields the posterior \n\np (LlI) P(wI9)Pe (9) \ne U W \n\nP(w) \n\n Pe (OIX). This stands in contrast to most pruning schemes, which typically \ntry to minimize damage to the posterior. Expanding via Bayes rule and taking logarithms \nwe obtain \n\n(7) \n\nwhere hi((}i) is the entropy due to (}i. For small (}i, we can approximate via differentials: \n\n() . {)H(O) \n~ \n\n{)(}i \n\n> \n\n() . {)logP(XIO) \nt \n\n{)(} i \n\n(8) \n\nBy mixing the left- and right-hand sides of equations 7 and 8, we can easily identify \ntrimmable parameters-those that contribute more to the entropy than the log-likelihood. \nE.g., for multinomials we set hi ((}i) = -(}i log (}i against r.h.s. eqn. 8 and simplify to obtain \n\n< exp -\n\n[ {)logP(XIO)] \n\n{)(}i \n\n(9) \n\nParameters can be trimmed at any time during training; at convergence trimming can \nbump the model out of a local probability maximum, allowing further training in a lower(cid:173)\ndimensional and possibly smoother parameter subspace. \n2 Entropic HMM training and trimming \nIn entropic estimation of HMM transition probabilities, we follow the conventional E-step, \ncalculating the probability mass for each transition to be used as evidence w: \n\nIj,i \n\nT-l \n\nL aj(t) Pilj Pi(Xt+1) fh(t + 1) \n\n(10) \n\nwhere PilJ is the current estimate of the transition probability from state j to state i; \nPi(Xt+d is the output probability of observation Xt+1 given state i, and Q, {3 are obtained \nfrom forward-backward analysis and follow the notation of Rabiner [1989]. For the M-\nstep, we calculate new estimates {Pi lj h = 0 by applying the MAP estimator in \u00a7 1.1 to \neach w = {,j ,i k That is, w is a vector of the evidence for each kind of transition out of \na single state; from this evidence the MAP estimator calculates probabilities O. (In Baum(cid:173)\nWelch re-estimation, the maximum-likelihood estimator simply sets Pilj = Ij ,i/ 2:i Ij,d \nIn iterative estimation, e.g., expectation-maximization (EM), the entropic estimator drives \nweakly supported parameters toward zero, skeletonizing the model and concentrating \nevidence on surviving parameters until their estimates converge to near the ML estimate. \nTrimming appears to accelerate this process by allowing slowly dying parameters to \nleapfrog to extinction. It also averts numerical underflow errors. \nFor HMM transition parameters, the trimming criterion of egn. 9 becomes \n\nwhere Ij (t) is the probability of state j at time t. The multinomial output distributions of a \ndiscrete-output HMM can be en tropically re-estimated and trimmed in the same manner. \n\n(11 ) \n\n\f726 \n\nM. Brand \n\nEntropic versus ML HMM models of Bach chorales \n90 \n\n7S.-.-~-.-----,..., \n\ngo \n\nr:\\\\~ \n~~ \n\n.t5.....,5----+.-.,-~ \n\no S \n\n20 \n\n5 \n\n3S \n\nIS \n\n25 \n\n, states at initialization \n\nFigure 1: Left: Sparsification, classification, and prediction superiority of entropically \nestimated HMMs modeling Bach chorales. Lines indicate mean performance over 10 \ntrials; error bars are 2 standard deviations. Right: High-probability states and subgraphs of \ninterest from an entropically estimated 35-state chorale HMM. Tones output by each state \nare listed in order of probability. Extraneous arcs have been removed for clarity. \n\n3 Structure learning experiments \n\nTo explore the practical utility of this framework, we will use entropically estimated HMMs \nas a window into the hidden structure of some human-generated time-series. \nBach Chorales: We obtained a dataset of melodic lines from 100 of I.S. Bach's 371 \nsurviving chorales from the UCI repository [Merz and Murphy, 1998], and transposed all \ninto the key of C. We compared entropically and conventionally estimated HMMs in \nprediction and classification tasks, training both from identical random initial conditions \nand trying a variety of different initial state-counts. We trained with 90 chorales and \ntesting with the remaining 10. \nIn ten trials, all chorales were rotated into the test \nset. Figure 1 illustrates that despite substantial loss of parameters to sparsification, the \nentropically estimated HMMs were, on average, better predictors of notes. (Each test \nsequence was truncated to a random length and the HMMs were used to predict the first \nmissing note.) They also were better at discriminating between test chorales and temporally \nreversed test chorales-challenging because Bach famously employed melodic reversal as a \ncompositional device. With larger models, parameter-trimming became state-trimming: An \naverage of 1.6 states were \"pinched off\" the 35-state models when all incoming transitions \nwere deleted. \nWhile the conventionally estimated HMMs were wholly uninterpretable, in the entropically \nestimated HMMs one can discern several basic musical structures (figure 1, right), \nincluding self-transitioning states that output only tonic (C-E-G) or dominant (G-B-D) \ntriads, lower- or upper-register diatonic tones (C-D-E or F-G-A-B), and mordents (A-nG(cid:173)\nA). We also found chordal state sequences (F-A-C) and states that lead to the tonic (C) via \nthe mediant (E) or the leading tone (B). \nHandwriting: We used 2D Gaussian-output HMMs to analyze handwriting data. Training \ndata, obtained from the UNIPEN web site [Reynolds, 1992], consisted of sequences of \nnormalized pen-position coordinates taken at 5msec intervals from 10 different individuals \nwriting the digits 0-9. The HMMs were estimated from identical data and initial conditions \n(random upper-diagonal transition matrices; random output parameters). The diagrams \nin Figure 2 depict transition graphs of two HMMs modeling the pen-strokes for the digit \n\"5,\" mapped onto the data. Ellipses indicate each state's output probability iso-contours \n(receptive field); X s and arcs indicate state dwell and transition probabilities, respectively, \nby their thicknesses. Entropic estimation induces an interpretable automaton that captures \nessential structure and timing of the pen-strokes. 50 of the 80 original transition parameters \n\n\fAn Entropic Estimator for Structure Discovery \n\n727 \n\n.. \n\n. \n\neonrus.on Matnll WIth 96 0% accuracy \n\n, . \n\nConlUStOn MatrIX WIth 93 0% acct.JIlIcy \n\n. \n\n-, \n\n~~Y\", '. \n:,y \n.s -': \n. \n.,-!.:~,~_~~\" 6 \n\n:,\"'\n\" \n\n/ \n\na. conventional \n\nb. en tropic \n\nc. conventional \n\nd. en tropic \n\nFigure 2: (a & b): State machines of conventionally and entropically estimated hidden \nMarkov models of writing \"S.\" (c & d): Confusion matrices for all digits. \n\nwere trimmed. Estimation without the entropic prior results in a wholly opaque model, in \nwhich none of the original dynamical parameters were trimmed. Model concision leads to \nbetter classification-the confusion matrices show cumulative classificMion error over ten \ntrials with random initializations. Inspection of the parameters for the model in 2b showed \nthat all writers began in states 1 or 2. From there it is possible to follow the state diagram \nto reconstruct the possible sequences of pen-strokes: Some writers start with the cap (state \n1) while others start with the vertical (state 2); all loop through states 3-8 and some return \nto the top (via state 10) to add a horizontal (state 12) or diagonal (state 11) cap. \nOffice activity: Here we demonstrate a model of human activity learned from medium(cid:173)\nto long-term ambient video. By activity, we mean spatio-temporal patterns in the pose, \nposition, and movement of one's body. To make the vision tractable, we consider the \nactivity of a single person in a relatively stable visual environment, namely, an office. \nWe track the gross shape and position of the office occupant by segmenting each image \ninto foreground and background pixels. Foreground pixels are identified with reference \nto an acquired statistical model of the background texture and camera noise. Their \nensemble properties such as motion or color are modeled via adaptive multivariate \nGaussian distributions, re-estimated in each frame. \nA single bivariate Gaussian is \nfitted to the foreground pixels and we record the associated ellipse parameters [mean x , \nmeany, timeanx , timean y, mass, timass, elongation, eccentricity]. Sequences of these \nobservation vectors are used to train and test the HMMs. \nApproximately 30 minutes of data were taken at SHz from an SGI IndyCam. Data \nwas collected automatically and at random over several days by a program that started \nrecording whenever someone entered the room after it had been empty S+ minutes. \nBackgrounds were re-Iearned during absences to accommodate changes in lighting and \nroom configuration. Prior to training, HMM states were initialized to tile the image \nwith their receptive fields, and transition probabilities were initialized to prefer motion \nto adjoining tiles. Three sequences ranging from 1000 to 1900 frames in length were used \nfor entropic training of 12, 16,20, 2S, and 30-state HMMs. \nEntropic training yielded a substantially sparsified model with an easily interpreted state \nmachine (see figure 3). Grouping of states into activities (done only to improve readability) \nwas done by adaptive clustering on a proximity matrix which combined Mahalonobis \ndistance and transition probability between states. The labels are the author's description \nof the set of frames claimed by each state cluster during forward-backward analysis of \ntest data. Figure 4 illustrates this analysis, showing frames from a test sequence to which \nspecific states are strongly tuned. State S (figure 3 right) is particularly interesting-it has a \nvery non-specific receptive field, no self-transition, and an extremely low rate of occupancy. \nInstead of modeling data, it serves to compress the model by summarizing transition \npatterns that are common to several other states. The entropic model has proven to be \nquite superior for segmented new video into activities and detecting anomalous behavior. \n\n\f728 \n\nM Brand \n\nirlitialization - --\n.\" . ... ' .. \n.-. \n: .. ~ \n.\" \n~ ~ \n\n~ . \n\ntinalmochtl \n\n~ \n\nFigure 3: Top: The state machine found by en tropic training (left) is easily labeled and \ninterpreted. The state machine found by conventional training (right) is not, begin fully \nconnected. Bottom: Transition matrices after (1) initialization, (2) entropic training, (3) \nconventional training, and (4 & 5) entropic training from larger initializations. The top row \nindicates initial probabilities of each state; each subsequent row indicates the transition \nprobabilities out of a state. Color key: 0 = 0; \u2022 = 1. The state machines above are \nextracted from 2 & 3. Note that 4 & 5 show the same qualitative structure as 2, but sparser, \nwhile 3 shows no almost no structure at all. \n\nFigure 4: Some sample frames assigned high state-specific probabilities by the model. Note \nthat some states are tuned to velocities, hence the difference between states 6 and 11. \n\n4 Related work \n\nHMMs: The literature of structure-learning in HMMs is based almost entirely on generate(cid:173)\nand-test algorithms. These algorithms work by merging [Stokke and Omohundro, 1994] \nor splitting [Takami and Sagayama, 1991] states, then retraining the model to see if any \nadvantage has been gained. Space constraints force us to summarize a recent literature \nreview: There are now more than 20 variations and improvements on these approaches, plus \nsome heuristic constructive algorithms (e.g., [Wolfertstetter and Ruske, 1995]). Though \nthese efforts use a variety of heuristic techniques and priors (including MDL) to avoid \ndetrimental model changes, much of the computation is squandered and reported run-times \noften range from hours to days. Entropic estimation is exact. monotonic, and orders of \nmagnitude faster-only slightly longer than standard EM parameter estimation. \nMDL: Description length minimization is typically done via gradient ascent or search via \nmodel comparison; few estimators are known. Rissanen [1989] introduced an estimator for \nbinary fractions, from which Vovk [1995] derived an approximate estimator for Bernoulli \n\n\fAn Entropic Estimator for Structure Discovery \n\n729 \n\nmodels over discrete sample spaces. It approximates a special case of our exact estimator, \nwhich handles multinomial models in continuous sample spaces. Our framework provides \na unified Bayesian framework for two issues that are often treated separately in MDL: \nestimating the number of parameters and estimating their values. \nMaxEnt: Our prior has different premises and an effect opposite that of the \"standard\" \nMaxEnt prior e- aD(9i1 9o). Nonetheless, our prior can be derived via MaxEnt reasoning \nfrom the premise that the expectation of the perplexity over all possible models is finite \n[Brand, 1998]. More colloquially, we almost always expect there to be learnable structure. \nExtensions: For simplicity of exposition (and for results that are independent of model \nclass), we have assumed prior independence of the parameters and taken H (8) to be the \ncombined parameter entropies of the model's component distributions. Depending on the \nmodel class, we can also provide variants of eqns. 1-8 for H (8) =conditional entropy or \nH (8) =entropy rate of the model. In Brand [1998] we present entropic MAP estimators \nfor spread and covariance parameters with applications to mixtures-of-Gaussians, radial \nbasis functions, and other popular models. In the same paper we generalize eqns. 1-8 \nwith a temperature term, obtaining a MAP estimator that minimizes the free energy of the \nmodel. This folds deterministic annealing into EM, turning it into a quasi-global optimizer. \nIt also provides a workaround for one known limitation of entropy minimization: It is \ninappropriate for learning from data that is atypical of the source process. \nOpen questions: Our framework is currently agnostic w.r.t. two important questions: Is \nthere an optimal trimming policy? Is there a best entropy measure? Other questions \nnaturally arise: Can we use the entropy to estimate the peakedness of the posterior \ndistribution, and thereby judge the appropriateness of MAP models? Can we also directly \nminimize the entropy of the hidden variables, thereby obtaining discriminant training? \n5 Conclusion \nEntropic estimation is highly efficient hillclimbing procedure for simultaneously estimating \nmodel structure and parameters. It provides a clean Bayesian framework for minimizing all \nentropies associated with modeling, and an E-MAP algorithm that brings the structure of a \nrandomly initialized model into alignment with hidden structures in the data via parameter \nextinction. The applications detailed here are three of many in which entropically estimated \nmodels have consistently outperformed maximum likelihood models in classification and \nprediction tasks. Most notably, it tends to produce interpretable models that shed light on \nthe structure of relations between hidden variables and observed effects. \nReferences \nBrand, M. (1997). Structure discovery in conditional probability models via an entropic prior and \nparameter extinction. NeuraL Computation To appear; accepted 8/98. \nBrand, M. (1998). Pattern discovery via entropy minimization. To appear in Proc .. ArtificiaL \nIntelligence and Statistics #7. \nMerz, C. and Murphy, P. (1998). UCI repository of machine learning databases. \nRabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech \n\nrecognition. Proceedings of the IEEE, 77(2):257-286. \n\nReynolds, D. (1992). Handwritten digit data. UNIPEN web site, hUp:llhwr.nici.kun.nl/unipen/. \n\nDonated by HP Labs, Bristol, England. \n\nRissanen, J. (1989). Stochastic CompLexit)' and StatisticaL Inquiry. World Scientific. \nStolcke, A. and Omohundro, S. (1994). Best-first model merging for hidden Markov model induction. \nTR-94-003, International Computer Science Institute, U.c. Berkeley. \nTakami, 1.-1. and Sagayama, S. (1991). Automatic generation of the hidden Markov model by \nsuccessive state splitting on the contextual domain and the temporal domain. TR SP91-88, IEICE. \nVovk, V. G. (1995). Minimum description length estimators under the optimal coding scheme. In \nVitanyi, P., editor, Proc. ComputationaL Learning Theory / Europe, pages 237-251. Springer-Verlag. \nIn \n\nWolfertstetter, F. and Ruske, G. (1995). Structured Markov models for speech recognition. \nInternationaL Conference on Acoustics. Speech. and SignaL Processing, volume I, pages 544-7. \n\n\f", "award": [], "sourceid": 1522, "authors": [{"given_name": "Matthew", "family_name": "Brand", "institution": null}]}