{"title": "Agnostic Classification of Markovian Sequences", "book": "Advances in Neural Information Processing Systems", "page_first": 465, "page_last": 471, "abstract": "", "full_text": "Agnostic Classification of Markovian \n\nSequences \n\nRan EI-Yaniv \n\nShai Fine \n\nNaftali Tishby* \n\nInstitute of Computer Science and Center for Neural Computation \n\nThe Hebrew University \nJerusalem 91904, Israel \n\nE-Dlail: {ranni,fshai,tishby}Ocs.huji.ac.il \n\nCategory: Algorithms. \n\nAbstract \n\nClassification of finite sequences without explicit knowledge of their \nstatistical nature is a fundamental problem with many important \napplications. We propose a new information theoretic approach \nto this problem which is based on the following ingredients: (i) se(cid:173)\nquences are similar when they are likely to be generated by the same \nsource; (ii) cross entropies can be estimated via \"universal compres(cid:173)\nsion\"; (iii) Markovian sequences can be asymptotically-optimally \nmerged. \nWith these ingredients we design a method for the classification of \ndiscrete sequences whenever they can be compressed. We introduce \nthe method and illustrate its application for hierarchical clustering \nof languages and for estimating similarities of protein sequences. \n\n1 \n\nIntrod uction \n\nWhile the relationship between compression (minimal description) and supervised \nlearning is by now well established, no such connection is generally accepted for \nthe unsupervised case. Unsupervised classification is still largely based on ad-hock \ndistance measures with often no explicit statistical justification. This is particu(cid:173)\nlarly true for unsupervised classification of sequences of discrete symbols which is \nencountered in numerous important applications in machine learning and data min(cid:173)\ning, such as text categorization, biological sequence modeling, and analysis of spike \ntrains. \n\nThe emergence of \"universal\" (Le. asymptotically distribution independent) se-\n\n\u00b7Corresponding author. \n\n\f466 \n\nR. EI-Yaniv, S. FineandN. TlShby \n\nquence compression techniques suggests the existence of \"universal\" classification \nmethods that make minimal assumptions about the statistical nature of the data. \nSuch techniques are potentially more robust and appropriate for real world appli(cid:173)\ncations. \nIn this paper we introduce a specific method that utilizes the connection between \nuniversal compression and unsupervised classification of sequences. Our only un(cid:173)\nderlying assumption is that the sequences can be approximated (in the information \ntheoretic sense) by some finite order Markov sources. There are three ingredients to \nour approach. The first is the assertion that two sequences are statistically similar \nif they are likely to be independently generated by the same source. This likelihood \ncan then be estimated, given a typical sequence of the most likely joint source, using \nany good compression method for the sequence samples. The third ingredient is a \nnovel and simple randomized sequence merging algorithm which provably generates \na typical sequence of the most likely joint source of the sequences, under the above \nMarkovian approximation assumption. \n\nOur similarity measure is also motivated by the known \"two sample problem\" \n[Leh59] of estimating the probability that two given samples are taken from the \nsame distribution. In the LLd. (Bernoulli) case this problem was thoroughly inves(cid:173)\ntigated and the. optimal statistical test is given by the sum of the empirical cross \nentropies between the two samples and their most likely joint -source. We argue \nthat this measure can be extended for arbitrary order Markov sources and use it to \nconstruct and sample the most likely joint source. \n\nThe similarity measure and the statistical merging algorithm can be naturally com(cid:173)\nbined into classification algorithms for sequences. Here we apply the method to \nhierarchical clustering of short text segments in 18 European languages and to eval(cid:173)\nuation of similarities of protein sequences. A complete analysis of the method, with \nfurther applications, will be presented elsewhere [EFT97]. \n\n2 Measuring the statistical similarity of sequences \n\nEstimating the statistical similarity of two individual sequences is traditionally done \nby training a statistical model for each sequence and then measuring the likelihood \nof the other sequence by the model. Training a model entails an assumption about \nthe nature of the noise in the data and this is the rational behind most \"edit \ndistance\" measures, even when the noise model is not explicitly stated. \nEstimating the log-likelihood of a sequence-sample over a discrete alphabet E by \na statistical model can be done through the Cross Entropy or Kullback-Leibler \nDivergence[CT91] between the sample empirical distribution p and model distri(cid:173)\nbution q, defined as: \n\nDKL (pllq) = L P (0-) log P((:\u00bb . \n\n(1) \n\nuEE \n\nq \n\nThe KL-divergence, however, has some serious practical drawbacks. It is non(cid:173)\nsymmetric and unbounded unless the model distribution q is absolutely continuous \nwith respect to p (Le. q = 0 ::::} P = 0). The KL-divergence is therefore highly sensi(cid:173)\ntive to low probability events under q. Using the \"empirical\" (sample) distributions \nfor both p and q can result in very unreliable estimates of the true divergences. Es(cid:173)\nsentially, D K L [Pllq] measures the asymptotic coding inefficiency when coding the \nsample p with an optimal code for the model distribution q. \nThe symmetric divergence, i.e. D (p, q) = DKL [Pllq] + DKL [qllp], suffers from \n\n\fAgnostic Classification of Markovian Sequences \n\n467 \n\nsimilar sensitivity problems and lacks the clear statistical meaning. \n\n2.1 The \"two sample problem\" \n\nDirect Bayesian arguments, or alternately the method of types [CK81], suggest that \nthe probability that there exists one source distribution M for two independently \ndrawn samples, x and y [Leh59], is proportional to \n\n! dJ-L (M) Pr (xIM) . Pr (yIM) = ! dJ-L (M) . 2-(lzIDKdp\",IIM1+lyIDKL[P1l1IM]), \n\n(2) \n\nwhere dJ-L(M) is a prior density of all candidate distributions, pz and Py are the \nempirical (sample) distributions, and Ixl and Iyl are the corresponding sample sizes. \nFor large enough samples this integral is dominated (for any non-vanishing prior) \nby the maximal exponent in the integrand, or by the most likely joint source of x \nand y, M>.., defined as \n\nM>.. = argmin {lxIDKL (PzIIM') + IYIDKL (pyIlM')}. \n\n(3) \nwhere 0 ~ A = Ixl/(lxl + Iyl) ~ 1 is the sample mixture ratio. The convexity of the \nKL-divergence guarantees that this minimum is unique and is given by \n\nM' \n\nM>.. = APz + (1 - A) PY' \n\nthe A - mixture of pz and py. \nThe similarity measure between two samples, d(x, y), naturally follows as the min(cid:173)\nimal value of the above exponent. That is, \n\nDefinition 1 The similarity measure, d(x, y) = V>..(Pz,Py), of two samples x and \ny, with empirical distributions pz and Py respectively, is defined as \n\nd(x, y) = V>..(Pz,Py) = ADKL (PzIIM>..) + (1- A) DKL (pyIIM>..) \n\n(4) \n\nwhere M>.. is the A-mixture of pz and Py. \n\nThe function V>.. (p, q) is an extension of the Jensen-Shannon divergence (see e.g. \n[Lin91]) and satisfies many useful analytic properties, such as symmetry and bound(cid:173)\nedness on both sides by the L1-norm, in addition to its clear statistical meaning. \nSee [Lin91, EFT97] for a more complete discussion of this measure. \n\n2.2 Estimating the V>.. similarity measure \n\nThe key component of our classification method is the estimation of V>.. for individ(cid:173)\nual finite sequences, without an explicit model distribution. \nSince cross entropies, D K L, express code-length differences, they can be estimated \nusing any efficient compression algorithm for the two sequences. The existence \nof \"universal\" compression methods, such as the Lempel-Ziv algorithm (see e.g. \n[CT91]) which are provably asymptotically optimal for any sequence, give us the \nmeans for asymptotically optimal estimation of V>.., provided that we can obtain a \ntypical sequence of the most-likely joint source, M >... \nWe apply an improvement of the method of Ziv and Merhav [ZM93] for the esti(cid:173)\nmation of the two cross-entropies using the Lempel-Ziv algorithm given two sample \nsequences [BE97]. Notice that our estimation of V>.. is as good as the compression \nmethod used, namely, closer to optimal compression yields better estimation of the \nsimilarity measure. \nIt remains to show how a typical sequence of the most-likely joint source can be \ngenerated. \n\n\f468 \n\nR. El-Yaniv, S. Fine and N. Tishby \n\n3 \n\nJoint Sources of Markovian Sequences \n\nIn this section we first explicitly generalize the notion of the joint statistical source to \nfinite order Markov probability measures. We identify the joint source of Markovian \nsequences and show how to construct a typical random sample of this source. \n\nMore precisely, let x and y be two sequences generated by Markov processes with \ndistributions P and Q, respectively. We present a novel algorithm for the merging \nthe two sequences, by generating a typical sequence of an approximation to the \nmost likely joint source of x and y. The algorithm does not require the parameters \nof the true sources P and Q and the computation of the sequence is done directly \nfrom the sequence samples x and y. \nAs before, r; denotes a finite alphabet and P and Q denote two ergodic Markov \nsources over r; of orders Kp and KQ, respectively. By equation 3, the :A-mixture \njoint source M>.. of P and Q is M>.. = argminM' :ADKdPIIM')+(I-:A)DKdQIIM') , \nwhere for sequences DKdPIIM) = limsuPn-too ~ L:zE!:n P(x) log :1:))' The fol(cid:173)\nlowing theorem identifies the joint source of P and Q. \n\nTheorem 1 The unique :A-mixture joint source M>.. of P and Q, of order K = \nmax {K p, K Q}, is given by the following conditional distribution. For each s E \nr;K,aEE, \n\nM>..(als) = :AP(s) + (1 _ :A)Q(s) P(als) + :AP(s) + (1- :A)Q(s) Q(als) . \n\n:AP(s) \n\n(1 - :A)Q(s) \n\nThis distribution can be naturally extended to n sources with priors :At, ... ,:An. \n\n3.1 The \"sequence merging\" algorithm \n\nThe above theorem can be easily translated into an algorithm. Figure 1 describes a \nrandomized algorithm that generates from the given sequences x and y, an asymp(cid:173)\ntotically typical sequence z of the most likely joint source, as defined by Theorem \n1, of P and Q. \n\nInitialization: \n\n\u2022 z [OJ = choose a symbol from x with probability ,x or y with probability 1 - ,x \n\n\u2022 i = 0 \n\nLoop: \nRepeat until the approximation error is lower then a prescribe threshold \n\n\u2022 s\", := max length suffix of z appearing somewhere in x \n\u2022 Sy := max length suffix of z appearing somewhere in y \n\u2022 A(,x S \n\n>.Pr .. (s .. ) \n\n>.Pr .. (s.,)+(l->.) Prll(s\\I) \n\n, \n\n\"\" Sy \n\n} -\n-\n\n\u2022 r = choose x with probability A(,x, s\"\" Sy} or y with probability 1-A(,x, S\"\" S1/} \n\u2022 r (Sr) = randomly choose one of the occurrences of Sr in r \n\u2022 z [i + 1) = the symbol appearing immediately after r (Sr) at r \n\u2022 i=i+1 \n\nEnd Repeat \n\nFigure 1: The most-likely joint source algorithm \n\n\fAgnostic Classification of Markovian Sequences \n\n469 \n\nNotice that the algorithm is completely unparameterized, even the sequence alpha(cid:173)\nbets, which may differ from one sequence to another, are not explicitly needed. The \nalgorithm can be efficiently implemented by pre-preparing suffix trees for the given \nsequences, and the merging algorithm is naturally generalizable to any number of \nsequences. \n\n4 Applications \n\nThere are several possible applications of our sequence merging algorithm and sim(cid:173)\nilarity measure. Here we focus on three possible applications: the source merging \nproblem, estimation of sequence similarity, and bottom-up sequence-classification. \nThese algorithms are different from most existing approaches because they rely \nonly on the sequenced data, similar to universal compression, without explicit mod(cid:173)\neling assumptions. Further details, analysis, and applications of the method will be \npresented elsewhere [EFT97]. \n\n4.1 Merging and synthesis of sequences \n\nAn immediate application of the source merging algorithm is for synthesis of typical \nsequences of the joint source from some given data sequences, without any access \nto an explicit model of the source. \n\nTo illustrate this point consider the sequence in Figure 2. This sequence was ran(cid:173)\ndomly generated, character by character, from two natural excerpts: a 47,655-\ncharacter string from Dickens' Tale of Two Cities, and a 59,097-character string \nfrom Twain's The King and the Pauper. \n\nDo your way to her breast, and sent a treason's sword- and not empty. \n\n\"I am particularly and when the stepped of his ovn commits place. No; yes, \nof course, and he passed behind that by turns ascended upon him, and my bone \nto touch it, less to say: \nIn miness?\" \nThe books third time. There was but pastened her unave misg his ruined head \nthan they had knovn to keep his saw whether think\" The feet our grace he \ncalled offer information? \n\n'Remove thought, everyone! Guards! \n\n[Twickens, 1997] \n\nFigure 2: A typical excerpt of random text generated by the \"joint source\" of \nDickens and Twain. \n\n4.2 Pairwise similarity of proteins \n\nThe joint source algorithm, combined with the new similarity measure, provide \nnatural means for computing the similarity of sequences over any alphabet. In this \nsection we illustrate this applicationl for the important case of protein sequences \n(sequences over the set of the 20 amino-acids). \nFrom a database of all known proteins we selected 6 different families and within \neach family we randomly chose 10 proteins. The families chosen are: Chaperonin, \nMHC1, Cytochrome, Kinase, Globin Alpha and Globin Beta. Our pairwise dis(cid:173)\ntances between all 60 proteins were computed using our agnostic algorithm and are \ndepicted in the 6Ox60 matrix of Figure 3. As can be seen, the algorithm succeeds to \n\nIThe protein results presented here are part of an ongoing work with G. Yona and E. \n\nBen-Sasson. \n\n\f470 \n\nR. El-Yaniv, S. Fine and N. TlShby \n\nidentify the families (the success with the Kinase and Cytochrome families is more \nlimited). \n\nPairwIse Distances of Protein Sequences \n\nchaperonin \n\nMHC I \n\ncytochrome \n\nkinase \n\nglobin a \n\nglobin b \n\nFigure 3: A 60x60 symmetric matrix representing the pairwise distances, as com(cid:173)\nputed by our agnostic algorithm, between 60 proteins, each consecutive 10 belong \nto a different family. Darker gray represent higher similarity. \n\nIn another experiment we considered all the 200 proteins of the Kinase family and \ncomputed the pairwise distances of these proteins using the agnostic algorithm. \nFor comparison we computed the pairwise similarities of these sequences using the \nwidely used Smith-Waterman algorithm (see e.g. [HH92]).2 The resulting agnostic \nsimilarities, computed with no biological information whatsoever, are very similar to \nthe Smith-Waterman similarities. 3 Furthermore, our agnostic measure discovered \nsome biological similarities not detected by the Smith-Waterman method. \n\n4.3 Agnostic classification of languages \n\nThe sample of the joint source of two sequences can be considered as their \"average\" \nor \"centroid\", capturing a mixture of their statistics. Averaging and measuring dis(cid:173)\ntance between objects are sufficient for most standard clustering algorithms such as \nbottom-up greedy clustering, vector quantization (VQ), and clustering by determin(cid:173)\nistic annealing. Thus, our merging method and similarity measure can be directly \napplied for the classification of finite sequences via standard clustering algorithms. \n\nTo illustrate the power of this new sequence clustering method we give the result of a \nrudimentary linguistic experiment using a greedy bottom-up (conglomerative) clus(cid:173)\ntering of short excerpts (1500 characters) from eighteen languages. Specifically, we \ntook sixteen random excerpts from the following Porto-Indo-European languages: \nAfrikaans, Catalan, Danish, Dutch, English, Flemish, French, German, Italian, \nLatin, Norwegian, Polish, Portuguese, Spanish, Swedish and Welsh, together with \n\n2we applied the Smith-Waterman for computing local-alignment costs using the state(cid:173)\n\nof-the-art blosum62 biological cost matrix. \n\n3These results are not given here due to space limitations and will be discussed \n\nelsewhere. \n\n\fAgnostic Classification of Markovian Sequences \n\n471 \n\ntwo artificial languages: Esperanto and Klingon4. \n\nThe resulting hierarchical classification tree is depicted in Figure 4. This entirely \nunsupervised method, when applied to these short random excerpts, clearly agrees \nwith the \"standard\" philologic tree of these languages, both in terms of the grouping \nand the levels of similarity (depth of the split) of the languages (the Polish-Welsh \n\"similarity\" is probably due to the specific transcription used). \n\nFigure 4: Agnostic bottom-up greedy clustering of eighteen languages \n\nAcknowledgments \n\nWe sincerely thank Ran Bachrach and Golan Yona for helpful discussions. We also \nthank Sageev Oore for many useful comments. \n\nReferences \n[BE97] R. Bachrach and R. EI-Yaniv, An Improved Measure of Relative Entropy \n\nBetween Individual Sequences, unpublished manuscript. \n\n[CK81] 1. Csiszar and J . Krorner. Information Theory: Coding Theorems for Dis(cid:173)\n\ncrete Memoryless Systems, Academic Press, New-York 1981. \n\n[CT91] T. M. Cover and J. A. Thomas. Elements of Information Theory, John \n\nWiley & Sons, New-York 1991. \n\n[EFT97] R. EI-Yaniv, S. Fine and N. Tishby. Classifying Markovian Sources, in \n\npreparations, 1997. \n\n[HH92] S. Henikoff and J . G. Henikoff (1992) . Amino acid substitution matrices \n\nfrom protein blocks. Proc. Natl. Acad. Sci. USA 89, 10915-10919. \n\n[Leh59] E. L. Lehmann. Testing Statistical Hypotheses, John Wiley & Sons, New(cid:173)\n\nYork 1959. \n\n[Lin91] J. Lin, 1991. Divergence measures based on the Shannon entropy. IEEE \n\nTransactions on In/ormation Theory, 37(1):145-15l. \n\n[ZM93] J . Ziv and N. Merhav, 1993. A Measure of Relative Entropy Between \nIndividual Sequences with Application to Universal Classification, IEEE \nTransactions on In/ormation Theory, 39(4). \n\n4Klingon is a synthetic language that was invented for the Star-Trek TV series. \n\n\f", "award": [], "sourceid": 1376, "authors": [{"given_name": "Ran", "family_name": "El-Yaniv", "institution": null}, {"given_name": "Shai", "family_name": "Fine", "institution": null}, {"given_name": "Naftali", "family_name": "Tishby", "institution": null}]}