{"title": "A Hidden Markov Model for de Novo Peptide Sequencing", "book": "Advances in Neural Information Processing Systems", "page_first": 457, "page_last": 464, "abstract": null, "full_text": " A Hidden Markov Model for de Novo Peptide\n Sequencing\n\n\n\n Bernd Fischer, Volker Roth, Joachim M. Buhmann\n Institute of Computational Science\n ETH Zurich\n CH-8092 Zurich, Switzerland\n bernd.fischer@inf.ethz.ch\n\n\n Jonas Grossmann, Sacha Baginsky, Franz Roos,\n Wilhelm Gruissem Peter Widmayer\n Institute of Plant Sciences Inst. of Theoretical Computer Science\n ETH Zurich ETH Zurich\n CH-8092 Zurich, Switzerland CH-8092 Zurich, Switzerland\n\n\n\n Abstract\n\n De novo Sequencing of peptides is a challenging task in proteome re-\n search. While there exist reliable DNA-sequencing methods, the high-\n throughput de novo sequencing of proteins by mass spectrometry is still\n an open problem. Current approaches suffer from a lack in precision\n to detect mass peaks in the spectrograms. In this paper we present a\n novel method for de novo peptide sequencing based on a hidden Markov\n model. Experiments effectively demonstrate that this new method signif-\n icantly outperforms standard approaches in matching quality.\n\n\n\n1 Introduction\n\nThe goal of de novo peptide sequencing is to reconstruct an amino acid sequence from a\ngiven mass spectrum. De novo sequencing by means of mass spectrometry is a very chal-\nlenging task, since many practical problems like measurement errors or peak suppression\nhave to be overcome. It is, thus, not surprising that current approaches to reconstruct the\nsequence from mass spectra are usually limited to those species for which genome infor-\nmation is available. This case is a simplified problem of the de novo sequencing problem,\nsince the hypothesis space of possible sequences is restricted to the known ones contained\nin a sequence database.\n\nIn this paper we present a Hidden Markov Model (HMM) for de novo sequencing. The\nmain difference to standard methods which are all based on dynamic programming [2, 1]\nlies in the fully probabilistic model. Our trained HMM defines a generative model for\nmass spectra which, for instance, is used for scoring observed spectra according to their\nlikelihood given a peptide sequence. Besides predicting the most likely sequence, however,\nthe HMM framework is far more general in the sense that it additionally allows us to specify\nthe confidence in the predictions.\n\n\f\n2 Tandem Mass Spectrometry\n\n\nIn a typical sequencing experiment by mass spectrometry a protein is digested with the\nhelp of an enzyme. This digestion reaction breaks the protein into several peptides, each\nof which consists of a short sequence of typically 10 to 20 amino acid residues, with an\nadditional H-atom at the N-terminus and an OH-group at the C-terminus.\n\n\n DSRSCK AYSARDGFSHEK DGDGYASDZROPGFSHEK\n\n\n1. MS\n\n m/z\n AYSARDGFSHEK\n\n\n\n\n AYSAR DGFSHEK\n\n\n A K AY EK HEK AYS\n\n\n\n2. MS\n\n\n m/z\n\n\nFigure 1: In the first mass measurement the parent mass is selected. In the second mea-\nsurement the peptide is dissociated and the mass of the ion fragments is measured.\n\n\nThere are two measurement steps in a tandem mass spectrometer. The first step is responsi-\nble for filtering peptides of a certain total mass (also called the parent mass). The difficulty\nin measuring the parent mass arises from different 12C/13C isotope proportions of the\napproximately 30-80 C-atoms contained in a peptide. Fluctuations of the 13C fraction re-\nsult in a binomial distribution of parent masses in the measurement. Given such an \"ion\ncount distribution\" one can roughly estimate the mono-isotopic parent mass of the peptide,\nwhere the term mono-isotopic here refers to a peptide that contains exclusively 12C atoms.\nIn practice, all isotope configurations of a peptide with parent masses that do not exceed\nthe estimated mono-isotopic mass by more than a predefined offset are separated from the\nother peptides and passed to the second spectrometer.\n\n\n\n\n\nFigure 2: Top: The ideal peaks of a peptide sequence are drawn. Bottom: The spectrum of\nthe corresponding peptide.\n\n\nIn the second mass measurement, a peptide is split into two fragments by means of collision\ninduced dissociation with a noble gas. In almost all cases the peptide is broken between\ntwo amino acids. Thus, an ideal spectrum is composed of the masses of all prefix and suffix\nsequences of the peptide. Deviations from this ideal case are e.g. caused by problems in\ndetermining the exact mono-isotopic mass of the fragments due to isotope shifts. Further\ncomplications are caused by an accidental loss of water (H2O), ammonia (NH3) or other\nmolecules in the collision step. Moreover, the ion counts are not uniformly distributed over\nthe spectrum. And last but not least, the measurements are noisy.\n\n\f\n3 The Hidden Markov Model for de Novo Peptide Sequencing\n\nA peptide can formally be described as a sequence of symbols from a fixed alphabet A of\n20 amino acids. We will denote amino acids with A and the mass of an amino acid\nwith M (). The input data is a spectrum of ion counts over all mass units. The ion count\nfor mass m is denoted by x(m). The spectra are discretized to approximately one Dalton\nmass units and normalized such that the mean ion count per Dalton is constant.\n\nThe mono-isotopic parent mass mp of the peptide P = (1, . . . , n) with i A is\nthe sum of all amino acid masses plus a constant mass for the N- and C-termini. mp =\nconstN + n M (\n i=1 i) + constC . For the sake of simplicity it is assumed that the N- and\nC-termini are not present and thus the parent mass considered in the sequel is\n mp = n M (\n i=1 i) . (1)\nIn the HMM framework a spectrum is regarded as a realization of a random process. The\nphysical process that generates spectra is based on the fact that a peptide is randomly broken\ninto two parts by interaction with a noble gas. Each of these parts is detected in the mass\nspectrometer and increases the ion-count in the corresponding mass interval. Finally, a\nhistogram over many such events is measured. In order to derive a model of the generation\nprocess, we make the simplifying assumptions that (i) breaks occur only at amino acid\nboundaries, and (ii) the probability of observing a break after a certain amino acid depends\nonly on the amino acid itself. These assumptions allow us to model the generative process\nby way of a Markov process on a finite state automation. In such a model, the process of\ngenerating a spectrum for a peptide of known parent mass is formalized as a path through\nthe automaton in 1 Dalton steps until the constraint on the parent mass is satisfied.\n\n\n3.1 Finite State Automaton\n\n\n\n\n 1\n sA 2\n sA 3\n sA 71\n sA\n\n -\n s\n\n\n\n 0\n s 1\n sR 2\n sR 3\n sR 156\n sR\n\n\n +\n s\n\n\n 1\n sV 2\n sV 3\n sV 99\n sV\n\n\n\n\n\nFigure 3: The finite state machine of the Hidden Markov Model. For each amino acid \nthere is a list of M () states.\n\n\nThe finite state automaton (fig. 3) has one initial state s0. For each amino acid A\nthere exists a list of M () states s\n 1 , . . . , s . Together with the end states s\n M () + and s-\nthe complete set of states is\n S = {s0} sj | A, 1 j M() {s+, s } .\n - (2)\nThe bold edges in the graph correspond to state transition probabilities a(s, t) from state s\nto state t. Once the automation is in the first state s\n 1 of a state list of one amino acid , it\nhas to pass through all other states within the specific list. Thus for the next M () steps\nthe list corresponding to amino acid is linearly traversed. If the automaton is in the last\nstate s of a list, it can reach the start states s\n M () 1 of any other amino acid . The random\nvariable for the state sequence is denoted by Y1, . . . , Ym . The transition probabilities are\n p\n\n 1 A,1 i < M() : s = si t = si+1\n a(s, t) = P {Yi+1 = t|Yi = s} = r A, A : s = s t = s (3)\n m() 1\n 0 else.\n\n\f\n The first row (a(s, t) = 1) describes the case where the automaton is in a non-terminating\nstate of a list of amino acid (1 i < M () : s = s\n i ), where the following state is\naccepted with probability 1. The second row, on the contrary, refers to a terminating state\nof a list. In such a case, the starting state of any other amino acid is selected with probability\nr. The probabilities r are the probabilities of occurrence of amino acid .\n\nThe transition probabilities a(s0, t) from the start state s0 are the occurrence probabilities\nof the amino acids.\n r\n a(s A : t = s\n 1\n 0, t) = (4)\n 0 else\n\nFinally one has to ensure that the parent mass constraint is fulfilled. In order to satisfy\nthe constraint we device a time dependent hidden Markov model in which the transition\nprobability changes with a heavy side function at time mp from a(s, t) to a (s, t). The\ndotted arrows in figure 3 show the transition probabilities a (s, t) into the end states s+ and\ns-.\n 1 A : s = s ,t = s\n M () +\n a (s, t) = 1 A,1 i < M() : s = s (5)\n i , t = s-\nIf the automaton is in 0 else\n the last state s of an amino acid state list, it changes to the\n M ()\npositive end state s+ with probability 1 since the parent mass constraint is satisfied. If the\nautomaton is in one of the other states, it changes to the negative end state s- since the\nparent mass constraint is violated. It is important to realize that all amino acid sequences\nthat fulfill the parent mass constraint can be transformed into state sequences that end in\nthe positive state s+ and vice versa.\n\n\n3.2 Emission Probabilities\n\n\n\n 0.1 0.1\n 0.35\n\n\n0.09 0.09\n 0.3\n\n0.08 0.08\n\n\n0.07 0.07\n 0.25\n\n\n\n0.06 0.06\n 0.2\n\n0.05 H O NH 0.05\n 2 3\n\n 0.15\n0.04 0.04\n\n\n0.03 0.03\n CO+NH NH\n H O+ CO 0.1\n 3 2 3 H O NH\n 2 3\n0.02 0.02\n 2H O\n 2 0.05\n0.01 0.01\n\n\n 0 0\n 0\n -50 -40 -30 -20 -10 0 10 -50 -40 -30 -20 -10 0 10\n -50 -40 -30 -20 -10 0 10\n\n\n\n\n (a) (b)\n\n\n\nFigure 4: Mean height of ion counts for different shifts with respect to the ideal prefix\nfragments (a) and suffix fragments (b).\n\n\nAt each state of the finite state automaton an ion count value is emitted. Figure 4 shows\nthe mean ion count for different positions relative to the amino acid bound averaged over\nall amino acids. The histograms are taken over the training examples described in the\nexperimental section. It happens quite frequently that an amino acid looses water (H2O)\nor ammonia (NH3). The ion count patterns for the prefix fragments (fig. 4 a) and the\nsuffix fragments (fig. 4 b) are quite different due to chemical reasons. For instance, carbon\n\n\f\nmonoxide loss in the suffix fragments is an unlikely event. Suffix fragments are more stable\nthan prefix fragments: the central peak at position 0 (amino acid boundary) is three times\nhigher for the suffix fragments than for the prefix fragments. Note that in figure 4 b) we\nused two different scales.\n\n s+\n\n\n m=mp\n\n\n\n m=0\n\n s0\n\nFigure 5: Folding the spectrum in the middle makes the intern mirror symmetry of the\nproblem visible. The Markov chain models a sequence with three amino acids. The filled\ncircles correspond to the amino acid boundaries. Each amino acid bound generates an ion\ncount pattern for the prefix fragment and one for the suffix fragment.\n\n\nBreaking a peptide in the second mass spectrometer produces both a prefix and a suffix\nfragment. To simultaneously process peaks of both types of fragments, we use one forward\nand one backward Markov chain which are independent of each other. Due to the inherent\nmirror symmetry of the problem (fig. 5) it is sufficient to limit the length of both models to\nmp/2. For the recognition process we assume that we simultaneously observe two peaks\nxm,1 = x(m) and xm,2 = x(mp - m) in step m. The joint observation of the prefix and\nthe suffix peaks is an essential modeling step in our method.\n\nThe forward and the backward Markov chains are extended to hidden Markov models to\ndescribe the ion counts in the mass spectra. The emission probabilities depend on the two\nstates of the prefix and suffix sequence, since these states give rise to ion counts in the\nmeasurements. We define\n\n bs,s (xm) = P \n Xm = xm = (x(m), x(mp - m)) | \n Ym = (s, s ) (6)\n\nas the emission probabilities of ion counts.\n\n\nXm are the (coupled) random variables of the ion counts. The hidden variables for the state\nsequence are denoted by \n Ym. This notion of coupled variables \n Xm describes the transition\nfrom two independent Markov chains to one coupled hidden Markov model with a squared\nnumber of states (2-tuple states).\n\nThe joint probability of observable and hidden variables given the parent mass mp is\n\n P {X = x, Y = y | s+, mp} = a (s0, y1) a ym , s+ (7)\n p\n\n mp-1\n 2 \n by ,y (xm)a (ym, ym+1) a ym a y , y\n m m\n m m p -1 p -1\n p -m p -m , ymp -m+1 +1\n 2 2\n m=1 \nThis formula holds for parent masses with an odd Dalton value, an equivalent formula can\nbe derived for the even case. The first term in eq. (7) is the joint probability from s0 to\ny1 in the prefix model and the transition ym to s+ in the suffix model. In each term of\n p\nthe product, two peaks are observed on both sides of the spectrum: one at position m and\nthe other at the mirror position mp - m. The joint probability of emissions is defined by\nby ,y (xm, xm\n m mp -m p -m ). Furthermore, the transition probabilities of the prefix and suffix\nsequences are multiplied which reflects the independence assumption of the Markov model.\n\n\f\nThe two chains are connected by the transition probability a(y(mp-1)/2, y(mp-1)/2+1) of\ntraversing from the last state of the forward Markov chain to the first state of the backward\nchain.\n\n\n3.3 Most Probable Sequence\n\nThe input spectrum usually comes with an estimate of the parent mass with a tolerance of\nabout 1 Dalton. Using a maximum likelihood approach the parent mass estimate is\n\n ^\n mp = argmax P {X = x | s+, mp} = argmax P {X = x, Y = y | s+, mp} . (8)\n m m\n p p y\n\nThe sum over all sequences can be computed efficiently by dynamic programming using\nthe forward algorithm.\n\nOne result of de novo peptide sequencing is the computation of the best sequence generat-\ning a given spectrum. Given the estimated parent mass ^\n mp the maximum posterior estimate\nof the sequence is\n\n y = argmax P {Y = y | X = x, s+, ^\n mp} = argmax P {X = x, Y = y | s+, ^\n mp} .\n y y\n (9)\nThe best sequence can efficiently be found by the Viterbi algorithm. To compute the poste-\nrior probability one has to normalize the joint probability P {X = x, Y = y | s+, ^\n mp} by\nthe evidence P {X = x | s+, ^\n mp} using the forward-backward algorithm.\n\nIn the mass spectra ions with very low mass or almost parent mass are less frequently\nobserved than ions with a medium mass. Therefore it becomes quite difficult to estimate the\nwhole sequence with a high score. It is also possible to give a score for each subsequence\nof the peptide, especially a score for each amino acid. An amino acid is a subsequence\nyp, . . . , yq of the state sequence y1, . . . , ym .\n p\n\n\n P {yp, . . . , yq | s+, x, mp} (10)\n\n P y , x | s\n y 1, . . . , ym +, mp\n p\n = 1 ,...yp-1 yq+1,...,ymp (11)\n P {x | s+, mp}\n\nThis can again be computed by some variation of the forward and backward algorithm.\n\n\n3.4 Simplification of the Model\n\nThe coupled hidden Markov model has 2 3752 = 5 640 625 states that leads to a runtime\nof 20 minutes per peptide which for practical applications is problematic. A significant\nsimplification is achieved by assuming that there are two spectra observed, where the sec-\nond one is the mirror version of the first one. The emission probabilities in this simplified\nmodel only depend on the states of the prefix Markov chain (fig. 6). Thus the emission of\nmirror peaks x(mp -m) is deterministically coupled to the emission of the peak xm. Since\nthis model has only 2 375 states, the computation time reduces to 1-2 seconds per peptide.\n\n\n4 Experiments\n\nIn our experiments a protein probe of plant cell vacuoles (Arabidopsis thaliana) was di-\ngested with trypsin. The mass spectrometer gave an output of 7056 different candi-\ndate spectra. From a database search with SEQUEST [3] and further validation with\nPeptideProphet [4], 522 spectra with a confidence larger than 90% were extracted.\nIt was shown that the PeptideProphet score is a very reliable scoring method for pep-\ntide identification by database search. The database output was used as training data. The\n\n\f\n m=m m=0\n p\n\n\n\n\n\n m=0 m=mp\n s s\n 0 +\n\n\n\nFigure 6: In the simplified model two mirrored spectra are observed. The emission of\nsymbols is coupled with the amino acid bounds of the prefix sequence.\n\n\n\nquality of the HMM inference is measured by the ratio of common amino acid boundaries\nand the number of amino acids in the database sequence. The performance of the HMM\nwas tested by leave-one-out cross validation: in each training step the emission proba-\nbilities and the amino acid occurrence probabilities are re-estimated, with one sequence\nexcluded from the training set. To estimate the emission probabilities, the ion count is dis-\ncretized to a fixed number of bins, in such a way that all bins contain an equal number of\ncounts. The leave-one-out scheme is repeated for different numbers of discretization levels.\n\n\n 1\n\n\n 0.9\n\n\n 0.8\n\n\n 0.7\n\n\n 0.6\n\n\n 0.5\n recall\n 0.4\n\n\n 0.3\n\n\n 0.2\n\n\n 0.1\n\n\n 0\n\n 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30\n number of bins\n\n\n\nFigure 7: Cross validation of recall rates for different number of bins in the discretization\nprocess. Depicted are the lower quartile, the median and the upper quartile.\n\n\nThe resulting performance recall rate are depicted in figure 7. Choosing 5 bins yields the\nhighest recall value.\n\nWe have chosen the prominent de novo sequencing programs LUTEFISK [6] and PEAKS\n[5] as competitors for the simplified HMM. We compared the sequence from the HMM\nwith the highest scoring sequences from the other programs. In figure 8 a) the estimated\nparent masses compared to the database parent mass is drawn. The plot demonstrates that\nall de novo sequencing methods tend to overestimate the parent mass. The best one is the\nHMM with 89.1% correct estimations, whereas only 59.3% of the LUTEFISK estimates\nand 58.1% of the PEAKS estimates are correct. In figure 8 b) boxplots of the recognition\nrate of peak positions is drawn. The three lines in the box correspond to the lower quartile,\nthe median and the upper quartile of the distribution. The median recall of the HMM is\n75.0%, for Lutefisk 53.9% and for Peaks 56.7%. Note that the lower quartile of the HMM\nresults is above 50%, whereas it is below 10% for the other programs.\n\n\n5 Conclusion and Further Work\n\nA novel method for the analysis of mass spectra in de novo peptide sequencing is presented\nin this paper. The proposed hidden Markov model is a fully probabilistic model for the\ngeneration process of mass spectra. The model was tested on mass spectra from vacuola\n\n\f\n 1 1 1\n\n\n0.9 0.9 0.9 1\n\n\n 0.9\n0.8 0.8 0.8\n\n 0.8\n0.7 0.7 0.7\n\n 0.7\n0.6 0.6 0.6\n\n 0.6\n\n0.5 0.5 0.5\n 0.5\n Recall\n0.4 0.4 0.4\n 0.4\n\n\n0.3 0.3 0.3 0.3\n\n\n0.2 0.2 0.2 0.2\n\n\n0.1 0.1 0.1 0.1\n\n\n 0 0 0 0\n <-2 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3\n Lutefisk Peaks HMM Lutefisk Peaks HMM\n\n\n\n\n (a) (b)\n\nFigure 8: a) Histogram on difference of estimated parent mass and database output. b)\nRecall of peak positions.\n\n\n\nproteins. The HMM clearly outperforms its competitors in recognition of the parent mass\nand peak localization. In further work additional model parameters will be introduced to\nrepresent and to detect amino acids with post-translational modifications. Reliable subse-\nquences can further be used for a tagged database search to identify peptides with post-\ntranslational modifications. Our method shows a large potential for high throughput de\nnovo sequencing of proteins which is unmatched by competing techniques.\nAcknowledgment This work has been partially supported by DFG grant # Buh 914/5.\n\nReferences\n\n[1] Sacha Baginsky, Mark Cieliebak, Wilhelm Gruissem, Torsten Kleffmann, Zsuzsanna\n Liptak, Matthias Muller, and Paolo Penna. Audens: A tool for automatic de novo\n peptide sequencing. Technical Report 383, ETH Zurich, Dept. of Computer Science,\n 2002.\n\n[2] Ting Chen, Ming-Yang Kao, Matthew Tepel, John Rush, and George M. Church. A\n dynamic programming approach to de novo peptide sequencing via tandem mass spec-\n trometry. Journal of Computational Biology, 8(3):325337, 2001.\n\n[3] Jimmy K. Eng, Ashley L. McCormack, and John R. Yates. An approach to correlate\n tandem mass spectral data of peptides with amino acid sequences in a protein database.\n American Society for Mass Spectrometry, 5(11):976989, 1994.\n\n[4] Andrew Keller, Alexey I. Nesvizhskii, Eugene Kolker, and Ruedi Aebersold. Empirical\n statistical model to estimate the accuracy of peptide identifications made by MS/MS\n and database search. Analytical Chemistry, 2002.\n\n[5] Bin Ma, Kaizhong Zhang, Christopher Hendrie, Chengzhi Liang, Ming Li, Amanda\n Doherty-Kirby, and Gilles Lajoie. Peaks: Powerful software for peptide de novo se-\n quencing by tandem mass spectrometry. Rapid Communication in Mass Spectrometry,\n 17(20):23372342, 2003.\n\n[6] J. Alex Taylor and Richard S. Johnson. Implementation and uses of automated de novo\n peptide sequencing by tandem mass spectrometry. Analytical Chemistry, 73:2594\n 2604, 2001.\n\n\f\n", "award": [], "sourceid": 2668, "authors": [{"given_name": "Bernd", "family_name": "Fischer", "institution": null}, {"given_name": "Volker", "family_name": "Roth", "institution": null}, {"given_name": "Jonas", "family_name": "Grossmann", "institution": null}, {"given_name": "Sacha", "family_name": "Baginsky", "institution": null}, {"given_name": "Wilhelm", "family_name": "Gruissem", "institution": null}, {"given_name": "Franz", "family_name": "Roos", "institution": null}, {"given_name": "Peter", "family_name": "Widmayer", "institution": null}, {"given_name": "Joachim", "family_name": "Buhmann", "institution": null}]}