{"title": "Hidden Markov Models for Human Genes", "book": "Advances in Neural Information Processing Systems", "page_first": 761, "page_last": 768, "abstract": null, "full_text": "Hidden Markov Models for  Human \n\nGenes \n\nPierre  Baldi * \n\nJet  Propulsion  Laboratory \n\nCalifornia Institute of Technology \n\nPasadena,  CA  91109 \n\nS0ren Brunak \n\nCenter for  Biological Sequence  Analysis \nThe Technical  University of Denmark \n\nDK-2800  Lyngby,  Denmark \n\nYves  Chauvin t \n\nNet-ID,  Inc. \n601  Minnesota \n\nSan  Francisco,  CA 94107 \n\nJacob Engelbrecht \n\nCenter for  Biological Sequence  Analysis \nThe Technical  University of Denmark \n\nDK-2800  Lyngby,  Denmark \n\nAnders Krogh \n\nElectronics Institute \n\nThe Technical University of Denmark \n\nDK-2800  Lyngby,  Denmark \n\n.Abstract \n\nHuman  genes  are  not  continuous  but  rather  consist  of short  cod(cid:173)\ning  regions  (exons)  interspersed  with  highly  variable  non-coding \nregions  (introns).  We apply HMMs  to the problem of modeling ex(cid:173)\nons,  introns and  detecting splice  sites  in  the  human  genome.  Our \nmost  interesting result  so  far  is  the  detection of particular oscilla(cid:173)\ntory patterns, with a minimal period ofroughly 10 nucleotides, that \nseem to be characteristic of exon  regions  and  may have significant \nbiological  implications. \n\n\u2022 and  Division of Biology,  California Institute of Technology. \nt and  Department of Psychology,  Stanford  University. \n\n761 \n\n\f762 \n\nBaldi, Brunak, Chauvin, Engelbrecht, and Krogh \n\nexon \n\nintron \n\nEXON \n\n3' splice site \nacceptor site \n\n5' splice site \ndonor site \n\nCONSENSUS SEQUENCES \n\nI I I I I I I I I I  NC AG  I  G \nCCCCCCCC  T \n\nAC AG  I  GTAGAGT \n\n~------------------~ \n\nFigure  1:  Structure  of eukaryotic  genes  (not  to  scale:  introns  are  typically  much \nlonger than exons). \n\n1 \n\nINTRODUCTION \n\nThe genes of higher organisms are not continuous.  Rather, they consist of relatively \nshort coding regions called exons interspersed with non-coding regions of highly vari(cid:173)\nable length  called introns (Fig.  1).  A complete  gene  may comprise  as  many as  fifty \nexons.  Very  often,  exons encode discrete functional  or structural units of proteins. \nPrior to the translation of genes  into proteins, a  complex set of biochemical mecha(cid:173)\nnisms  is  responsible for  the precise cutting of genes  at the splice junctions,  i.e.  the \nboundaries  between  introns  and  exons,  and  the  subsequent  removal  and  ligation \nwhich  results  in  the  production  of mature  messenger  RNA.  The  translation  ma(cid:173)\nchinery of the cell operates directly onto the mRNA, converting a primary sequence \nof nucleotides  into  the  corresponding  primary sequence  of amino  acids,  according \nto the  rules  of the genetic code.  The  genetic  code  converts  every three contiguous \nnucIeotides,  or  codons,  into one of the  twenty  amino  acids  (or  into a  stop  signal). \nTherefore the splicing process must be exceedingly  precise since  a shift of only one \nbase  pair completely  upsets the codon  reading frame  for  translation.  Many details \nof the splicing process are not known; in particular it is  not clear how acceptor sites \n(i.e.  intron/exon boundaries) and donor sites (i.e.  exon/intron boundaries) are rec(cid:173)\nognized  with extremely high  accuracy.  Both acceptor  and  donor  sites  are signaled \nby  the  existence  of consensus  sequences,  i.e.  short  sequences  of nucleotides  which \nare  highly conserved across genes  and, to some extent, across species.  For  instance, \n\n\fHidden Markov Models for Human Genes \n\n763 \n\nmost  introns start  with  GT  and  terminate with  AG  and  additional  patterns can  be \ndetected  in  the  proximity  of the  splice  sites.  The  main  problem  with  consensus \nsequences, in  addition to their variability, is that by themselves they are insufficient \nfor  reliable splice site detection.  Indeed, whereas exons  are relatively short with  an \naverage length  around  150  nucleotides,  introns are often  much longer,  with several \nthousand of seemingly  random nucleotides.  Therefore  numerous false  positive  con(cid:173)\nsensus signals are bound to occur inside the introns.  The GT  dinucleotide constitutes \nroughly 5%  of the dinucleotides in human DNA, but only a very small percentage of \nthese belongs to the splicing donor  category, in the order of 1.5%.  The dinucleotide \nAG  constitutes  roughly  7.5%  of all  the  dinucleotides  and  only  around  1%  of these \nfunction  as  splicing acceptor sites.  In  addition  to consensus  sequences at the splice \nsites, there seem to exist a number of other weak signals (Senapathy (1989),  Brunak \net  a1.  (1992)) embedded in the  100  intron nucleotides upstream and downstream of \nan  exon.  Partial experimental evidence  seems  also  to suggest  that  the  recognition \nof the acceptor and  donor  boundaries of an exon  may be  a  concerted  process. \n\nIn  connection with  the current exponential growth of available DNA sequences  and \nthe  human  genome  project,  it  has  become  essential  to  be  able  to  algorithmically \ndetect the boundaries  between  exons and introns and to parse entire genes.  Unfor(cid:173)\ntunately, current available methods are far from performing at the level of accuracy \nrequired  for  a  systematic  parsing  of the  entire  human  genome.  Most  likely,  gene \nparsing requires the statistical integration of several weak signals, some of which are \npoorly known,  over length scales  of a few  hundred nucleotides.  Furthermore, initial \nand terminal exons,  lacking one  of the  splice sites, need  to be  treated separately. \n\n2  HMMs  FOR BIOLOGICAL PRIMARY SEQUENCES \n\nThe  parsing  problem has  been  tackled with  classical statistical methods  and  more \nrecently using neural networks  (Lapedes  (1988),  Brunak (1991)),  with  encouraging \nresults.  Conventional neural networks, however,  do not seem ideally suited to han(cid:173)\ndle  the sort of elastic  deformations introduced  by evolutionary tinkering in  genetic \nsequences.  Another trend in  recent years,  has  been  the casting of DNA and protein \nsequences  problems  in  terms of formal  languages  using context free  grammars,  au(cid:173)\ntomata and Hidden  Markov Models (HMMs).  The combination of machine learning \ntechniques  which  can  take advantage of abundant  data together  with  new  flexible \nrepresentations appears particularly promising.  HMMs in particular have been used \nto model protein families  and address a number of task such as multiple alignments, \nclassification  and data base searches (Baldi et  al.  (1993)  and  (1994);  Haussler et  a1. \n(1993);  Krogh et  al.  (1994a); and references therein).  It is  the success obtained with \nthis  method  on  protein sequences  and the ease  with  which  it  can  handle insertions \nand  deletions  that naturally suggests its  application  to the  parsing problem. \n\nIn Krogh et al.  (1994b),  HMMs are applied to the problem of detecting coding/non(cid:173)\ncoding  regions  in  bacterial  DNA  (E.  coli),  which  is  characterized  by  the  absence \nof  true  introns  (like  other  prokaryotes).  Their  approach  leads  to  a  HMM  that \nintegrates both genic  and intergenic regions,  and can  be used  to locate genes fairly \nreliably.  A  similar  approach  for  human  DNA,  that  is  not  based  on  HMMs,  but \nuses  dynamic  programming  and  neural  networks  to  combine  various  gene  finding \ntechniques,  is  described  in  Snyder  and  Stormo  (1993).  In  this  paper  we  take  a \n\n\f764 \n\nBaldi, Brunak, Chauvin, Engelbrecht, and Krogh \n\nMain State Entropy Values \n\n-ci \n\n10  20  30  40  60  60  70  80  90100110120130140160160170 \n\nMain  State  Position \n\n180190200210220230240260260270280290300310320330340360 \n\nMain State Position \n\nFigure 2:  Entropy of emission  distribution of main states. \n\nfirst  step  towards parsing the human genome  with  HMMs  by modeling exons  (and \nflanking  intron regions). \n\nAs  in  the  applications  of HMMs  to  speech  or  protein  modeling,  we  use  left-right \narchitectures  to  model  exon  regions,  intron  regions  or  their  boundaries.  The  ar(cid:173)\nchitectures typically  consist  of a  backbone of main states flanked  by a  sequence of \ndelete states  and  a  sequence of insert states, with  the proper interconnections  (see \nBaldi et  al.  (1994)  and  Krogh  et  al.  (1994)  for  more  details and Fig.  4 below).  The \ndata base used in the experiments to be described  consists of roughly  2,000  human \ninternal exons,  with  the corresponding  adjacent  introns, extracted from  release  78 \nof the GenBank data base.  It is  essential to remark  that, unlike in  the previous ex(cid:173)\nperiments on  protein families,  the exons in the data base  are not directly related by \nevolution.  As  a  result,  insertions  and deletions  in  the model  should  be interpreted \nin  terms  of formal operations  on  the strings  rather than evolutionary events. \n\n3  EXPERIMENTS  AND  RESULTS \n\nA  number of different  HMM  training experiments have been  carried using different \nclasses of sequences including exons only, flanked exons  (with 50  or  100  nucleotides \non  each  side),  introns  only,  flanked  acceptor  and  flanked  donor  sites  (with  100 \nnucleotides on each side) and slightly different architectures and learning algorithms. \nOnly  a  few  relevant examples  will  be given  here. \n\n\fHidden Markov Models for Human Genes \n\n765 \n\nA \n\ng \n:~ \n\n~..J~~ \n\n00 \n\n140 \n\n.20 \n\n.40 \n\n.00 \n\n2.0 \n\n'00 \n\n.20 \n\nIn \u2022\u2022 rt at.ta Po \u2022. tlon \n\nC \n\n100 \n\n.00 \n\n, .0  \n\nG \n\n~ \n\n~ \n\n~ \n;: \n:::t \n:::: \n\n~ \n:; \n\n~ \n;: \n:::t \n\n~ \n\n40 \n\n120 \n\n200 \n\n\u2022 \u2022  0 \n\n320 \n\n'40 \n\nT \n\n... \n\n32. \n\nFigure  3:  Emission  distribution from  main states. \n\nIn  an  early  experiment,  we  trained  a  model  of length  350  using  500  flanked  ex(cid:173)\nons,  with  100  nucleotides  on  each side,  using  gradient  descent on  the negative log(cid:173)\nlikelihood  (Baldi  and  Chauvin  (1994)).  The exons  themselves had variable lengths \nbetween 50  and  300.  The entropy  plot  (Fig.  2),  after  7  gradient  descent  training \ncycles,  reveals  that  the  HMM  has  learned the  acceptor  site  quite  well  but  appears \nto  have  some  difficulties  with  the  donor  site.  One  possible  contributing factor  is \nthe  high  variability  of the  length  of the  training  exons:  the model seems  to  learn \ntwo donor sites, one for  short exons  and one for  the other exons.  The most striking \npattern,  however,  is  the  greater  smoothness  of the entropy in  the  exon  region.  In \nthe  exon  region,  the  entropy  profile  is  weakly  oscillatory,  with  a  period  of about \n20  base  pairs.  Discrimination  and  t-tests  conducted  on  this  model  show  that  it \nis  definitely  capable of discriminating exon  regions,  but  the  confidence  level  is  not \nsufficient  yet to reliably search entire genomes. \n\nA slightly different  model  was  subsequently trained  using again  500  flanked exons, \nwith the length of the exons between 100 and 200  only.  The probability of emitting \neach  one  of the  four  nucleotides,  across  the  main  states  of the  model,  are  plott.ed \nin  Fig.  3,  after  the  sixt.h  gradient  descent  training  cycle.  Again  the  donor  site \nseems  harder  to  learn  than  the  acceptor  site.  Even  more  striking  are  the  clear \n\n\f766 \n\nBaldi, Brunak, Chauvin, Engelbrecht, and Krogh \n\nFigure 4:  The repeated segment of the tied model.  Note that position 15  is  identical \nto position 5. \n\noscillatory  patterns  present  in  the exon  region,  characterized by a  minimal  period \nof 10  nucleotides, with  A and  G in  phase and  C and T  in  anti-phase. \n\nThe fact that the acceptor site is easier to learn could result from the fact that exons \nin  the  training sequences  are  always  flanked  by  exactly  100  nucleotides  upstream. \nTo test  this  hypothesis,  we  trained  a  similar  model  using  the  same sequences  but \nin  reverse  order.  Surprisingly,  the  model  still  learns  the  acceptor  site  (which  is \nnow  downstream from  the  donor  site)  much  better  than  the  donor  site.  The  os(cid:173)\ncillatory  pattern  in  the  reversed  exon  region  is  still  present.  The  oscillations  we \nobserve could also  be an  artifact  of the method:  for  instance,  when  presented with \nrandom  training  sequences,  oscillatory  HMM  solutions  could  appear  naturally  as \nlocal optima of the training procedure.  To test  this hypothesis,  we  trained a  model \nusing  random sequences  of similar  average  composition  as the exons  and found  no \ndistinct oscillatory patterns.  We  also  checked  that our  data base  of exons  does  not \ncorrespond  prevalently to a-helical domains of proteins. \n\nTo further  test our findings,  we  trained  a  tied  exon  model  with  a  hard-wired  peri(cid:173)\nodicity of 10.  The tied  model  consists  of 14  identical segments  of length  10  and  5 \nadditional  positions  in  the beginning and  end  of the  model,  making  a  total length \nof 150.  During training the segments are kept identical by  tying of the parameters, \ni.e.  the  parameters  are constrained to be exactly the same  throughout  learning,  as \nin the weight sharing procedure for  neural networks.  The model was trained on  800 \nexon  sequences  of length  between  100  and  200,  and  it  was  tested on  262  different \nsequences.  The  parameters  of the  repeated  segment,  after  training,  are  shown  in \nFig.  4.  Emission  probabilities  are  represented  by horizontal bars  of corresponding \nproportional  length.  There  is  a  lot  of structure in  this  segment.  The  most  promi(cid:173)\nnent  feature  is  the  regular  expression  [AT][AT]G  at  position  12-14.  (The  regular \nexpression  means  \"anything but  T  followed  by A  or  T  followed  by G\".)  The  same \npattern was often found at positions with very low entropy in the  \"standard models\" \ndescribed  above.  In  order to test the significance,  the tied model was compared to a \nstandard model of the same length.  The average negative log-likelihood (NNL) they \nboth assign  to the exon sequences and to random sequences  of similar composition, \nas  well  as  their  number  of parameters are shown  in  the table below. \n\n\fHidden Markov Models for Human Genes \n\n767 \n\nModel Scores \nStandard model \nwith  random seqs \nStandard model \nwith real seqs \nTied  model \nwith  real seqs \n\nNLL  training  NLL  testing  #  parameters \n\n203.2 \n\n198.8 \n\n198.6 \n\n200.3 \n\n196.4 \n\n195.6 \n\n2550 \n\n2550 \n\n340 \n\nThe tied  model achieves  a  level of performance comparable  to the standard  model \nbut with significantly less free  parameters, and therefore  a period of 10  in  the exons \nseems  to be a strong hypothesis.  Note  that the period of the pattern is  not strictly \n10,  and  we found  almost equally  good models  with  a  built-in  period of 9 or  11. \n\nThe type of left-to-right architecture we have used is not the ideal model of an exon, \nbecause  of the  large  length  variations.  It would  be  desirable  to have a  model  with \na  loop  structure such  that the  segment can  be entered as  many times  as  necessary \nfor  any  given  exon  (see  Krogh  et  al.  (1994b)  for  a  loop  structure  used  for  E.  coll \nDNA).  This  is  one of the future  lines  of research. \n\n4  CONCLUSION \n\nIn  summary,  we  are  applying  HMMs  and  related  methods  to  the  problems  of \nexon/intron  modeling  and  human  genome  parsing.  Our  preliminary  results  show \nthat  acceptor  sites  are  intrinsically easier  to  learn  than  donor  sites  and  that  very \nsimple  HMM  models  alone  are  not sufficient for  reliable  genome  parsing.  Most  im(cid:173)\nportantly, interesting statistical  10  base  oscillatory patterns have  been detected  in \nthe  exon  regions.  If confirmed,  these  patterns could  have significant biological  and \nalgorithmic  implications.  These  patterns  could  be  related  to  the  superimposition \nof several simultaneous codes  (such  as  triplet code and frame  code),  and/or to the \nway  DNA  is  wrapped  around  histone  molecules  (Beckmann  and  Trifonov  (1991)). \nPresently, we are investigating their relationship to reading frame effects by training \nseveral  HMM  models  using a  data base  of exons  with  the same reading frame. \n\nReferences \n\nBeckmann, J.S.  and Trifonov, E.N.  (1991)  Splice  Junctions  Follow  a  205-base  Lad(cid:173)\nder.  PNAS  USA,  88,  2380-2383. \n\nBaldi,  P.,  Chauvin, Y.,  Hunkapiller, T. and  McClure,  M.  A.  (1994)  Hidden  Markov \nModels of Biological Primary Sequence Information.  PNAS USA,  91,  3,  1059-1063. \n\nBaldi,  P., Chauvin, Y., Hunkapiller, T.  and  McClure, M.  A.  (1993)  Hidden  Markov \nModels  in  Molecular  Biology:  New  Algorithms  and  Applications.  Advances  in \nNeural  Information  Processing Systems 5,  Morgan  Kaufmann,  747-754. \n\nBaldi,  P.  and Chauvin, Y.  (1994)  Smooth On-Line Learning Algorithms for  Hidden \nMarkov  Models.  Neural  Computation,  6,  2,  305-316. \n\nBrunak,  S.,  Engelbrecht,  J.  and  Knudsen,  S.  (1991)  Prediction  of Human  mRNA \nDonor  and Acceptor  Sites from  the  DNA  Sequence.  Journal of Molecular  Biology, \n220,49-65. \n\n\f768 \n\nBaldi, Brunak, Chauvin, Engelbrecht, and Krogh \n\nEngelbrecht,  J.,  Knudsen,  S.  and  Brunak  S.,  (1992)  GIC  rich  tract  in  5'  end  of \nhuman  introns, Journal of Molecular  Biology,  221,  108-113. \n\nHaussler,  D.,  Krogh,  A.,  Mian,  I.S.  and  Sjolander,  K.  (1993)  Protein  Modeling \nusing  Hidden  Markov  Models:  Analysis  of Globins,  Proceedings  of the  Hawaii  In(cid:173)\nternational  Conference on  System Sciences,  1, IEEE Computer Society  Press,  Los \nAlamitos,  CA,  792-802. \nKrogh,  A.,  Brown,  M.,  Mian,  I.  S.,  Sjolander,  K.  and  Haussler,  D.  (1994a)  Hid(cid:173)\nden  Markov  Models  in  Computational Biology:  Applications  to  Protein  Modeling. \nJournal of Molecular  Biology,  235,  1501-153l. \nKrogh,  A.,  Mian, I.  S.  and Haussler, D.  (1994b) A Hidden Markov Model that Finds \nGenes in  E. Coli DNA, Technical Report UCSC-CRL-93-33,  University of California \nat San ta Cruz. \n\nLapedes,  A.,  Barnes, C.,  Burks, C.,  Farber, R.  and Sirotkin,  K.  Application of Neu(cid:173)\nral  Networks  and Other Machine Learning Algorithms to DNA  Sequence  Analysis. \nIn  G.I.  Bell  and  T.G.  Marr,  editors.  The  Procceedings  of  the  Interface  Between \nComputation  Science  and  Nucleic  Acid  Sequencing  Workshop.  Proceedings  of the \nSanta  Fe  Institute,  volume  VII,  pages  157-182.  Addison  Wesley,  Redwood  City, \nCA,1988. \n\nSenapathy,  P.,  Shapiro,  M.B.,  and  Harris,  N.1.  (1990)  Splice  Junctions,  Branch \nPoint  Sites,  and  Exons:  Sequence  Statistics,  Identification  and  Applications  to \nGenome  Project.  Patterns in  Nucleic  Acid  Sequences,  Academic  Press,  252-278. \n\nSnyder,  E.E.  and  Stormo,  G.D.  (1993)  Identification  of coding  regions  in  genomic \nDNA  sequences:  an  application  of  dynamic  programming  and  neural  networks. \nNucleic  Acids  Research, 21, 607-613. \n\n\f", "award": [], "sourceid": 761, "authors": [{"given_name": "Pierre", "family_name": "Baldi", "institution": null}, {"given_name": "S\u00f8ren", "family_name": "Brunak", "institution": null}, {"given_name": "Yves", "family_name": "Chauvin", "institution": null}, {"given_name": "Jacob", "family_name": "Engelbrecht", "institution": null}, {"given_name": "Anders", "family_name": "Krogh", "institution": null}]}