{"title": "Discrete profile alignment via constrained information bottleneck", "book": "Advances in Neural Information Processing Systems", "page_first": 1009, "page_last": 1016, "abstract": null, "full_text": "Discrete profile alignment via constrained\n information bottleneck\n\n\n\n Sean O'Rourke Gal Chechik Robin Friedman\n seano@cs.ucsd.edu gal@stanford.edu rcfriedm@ucsd.edu\n\n\n Eleazar Eskin\n eeskin@cs.ucsd.edu\n\n\n\n\n Abstract\n\n Amino acid profiles, which capture position-specific mutation prob-\n abilities, are a richer encoding of biological sequences than the in-\n dividual sequences themselves. However, profile comparisons are\n much more computationally expensive than discrete symbol com-\n parisons, making profiles impractical for many large datasets. Fur-\n thermore, because they are such a rich representation, profiles can\n be difficult to visualize. To overcome these problems, we propose a\n discretization for profiles using an expanded alphabet representing\n not just individual amino acids, but common profiles. By using an\n extension of information bottleneck (IB) incorporating constraints\n and priors on the class distributions, we find an informationally\n optimal alphabet. This discretization yields a concise, informative\n textual representation for profile sequences. Also alignments be-\n tween these sequences, while nearly as accurate as the full profile-\n profile alignments, can be computed almost as quickly as those\n between individual or consensus sequences. A full pairwise align-\n ment of SwissProt would take years using profiles, but less than\n 3 days using a discrete IB encoding, illustrating how discrete en-\n coding can expand the range of sequence problems to which profile\n information can be applied.\n\n\n\n1 Introduction\n\nOne of the most powerful techniques in protein analysis is the comparison of a\ntarget amino acid sequence with phylogenetically related or homologous proteins.\nSuch comparisons give insight into which portions of the protein are important by\nrevealing the parts that were conserved through natural selection. While mutations\nin non-functional regions may be harmless, mutations in functional regions are often\nlethal. For this reason, functional regions of a protein tend to be conserved between\norganisms while non-functional regions diverge.\n\n Department of Computer Science and Engineering, University of California San Diego\n Department of Computer Science, Stanford University\n\n\f\nMany of the state-of-the-art protein analysis techniques incorporate homologous\nsequences by representing a set of homologous sequences as a probabilistic profile,\na sequence of the marginal distributions of amino acids at each position in the\nsequence. For example, Yona et al.[10] uses profiles to align distant homologues from\nthe SCOP database[3]; the resulting alignments are similar to results from structural\nalignments, and tend to reflect both secondary and tertiary protein structure. The\nPHD algorithm[5] uses profiles purely for structure prediction. PSIBLAST[6] uses\nthem to refine database searches.\n\nAlthough profiles provide a lot of information about the sequence, the use of pro-\nfiles comes at a steep price. While extremely efficient string algorithms exist for\naligning protein sequences (Smith-Waterman[8]) and performing database queries\n(BLAST[6]), these algorithms operate on strings and are not immediately applica-\nble to profile alignment or profile database queries. While profile-based methods\ncan be substantially more accurate than sequence-based ones, they can require at\nleast an order of magnitude more computation time, since substitution penalties\nmust be calculated by computing distances between probability distributions. This\nmakes profiles impractical for use with large bioinformatics databases like SwissProt,\nwhich recently passed 150,000 sequences. Another drawback of profile as compared\nto string representations is that it is much more difficult to visually interpret a\nsequence of 20 dimensional vectors than a sequence of letters.\n\nDiscretizing the profiles addresses both of these problems. First, once a profile is rep-\nresented using a discrete alphabet, alignment and database search can be performed\nusing the efficient string algorithms developed for sequences. For example, when\naligning sequences of 1000 elements, runtime decreases from 20 seconds for profiles\nto 2 for discrete sequences. Second, by representing each class as a letter, discretized\nprofiles can be presented in plain text like the original or consensus sequences, while\nconveying more information about the underlying profiles. This makes them more\naccurate than consensus sequences, and more dense than sequence logos (see figure\n1). To make this representation intuitive, we want the discretization not only to\nminimize information loss, but also to reflect biologically meaningful categories by\nforming a superset of the standard 20-character amino acid alphabet. For example,\nwe use \"A\" and \"a\" for strongly- and weakly-conserved Alanine. This formulation\ndemands two types of constraints: similarities of the centroids to predefined values,\nand specific structural similarities between strongly- and weakly-conserved variants.\nWe show below how these constraints can be added to the original IB formalism.\n\nIn this paper, we present a new discrete representation of proteins that takes into\naccount information from homologues. The main idea behind our approach is to\ncompress the space of probabilistic profiles in a data-dependent manner by clustering\nthe actual profiles and representing them by a small alphabet of distributions. Since\nthis discretization removes some of the information carried by the full profiles,\nwe cluster the distribution in a way that is directly targeted at minimizing the\ninformation loss. This is achieved using a variant of Information Bottleneck (IB)[9],\na distributional clustering approach for informationally optimal discretization.\n\nWe apply our algorithm to a subset of MEROPS[4], a database of peptidases or-\nganized structurally by family and clan, and analyze the results in terms of both\ninformation loss and alignment quality. We show that multivariate IB in particular\npreserves much of the information in the original profiles using a small number of\nclasses. Furthermore, optimal alignments for profile sequences encoded with these\nclasses are much closer to the original profile-profile alignments than are alignments\nbetween the seed proteins. IB discretization is therefore an attractive way to gain\nsome of the additional sensitivity of profiles with less computational cost.\n\n\f\n 0.0 0.0 0.0 0.09 0.34 0.23 0.12 0.0 0.0 0.0\n 0.0 0.0 0.0 0.04 0.01 0.01 0.03 0.0 0.0 0.0\n 0.0 0.0 1.0 0.01 0.05 0.14 0.09 0.0 1.0 0.0\n 0.0 0.0 0.0 0.38 0.04 0.00 0.04 0.0 0.0 0.0\n 0.0 0.0 0.0 0.06 0.00 0.08 0.04 0.0 0.0 1.0\n 0.0 0.0 0.0 0.00 0.06 0.01 0.03 1.0 0.0 0.0\n 0.0 0.0 0.0 0.02 0.00 0.04 0.00 0.0 0.0 0.0 N\n 0.0 0.0 0.0 0.00 0.00 0.03 0.00 0.0 0.0 0.0 ND GDF\n 0.0 0.0 0.0 0.04 0.01 0.01 0.00 0.0 0.0 0.0 S EAAS\n S\n V\n PT\n S\n A A\n T\n D\n F F\n D\n N\n G\n S\n L\n K\n D\n Q\n\n E\n\n T\n R\n N\n H\n\n F\n\n Y\n\n\n\n\n\n V\n 0.0 0.0 0.0 0.01 0.01 0.00 0.09 0.0 0.0 0.0 Q\n Y\n\n\n E\n\n\n\n\n\n A\n A A\n A\n A\n A\n\n\n\n\n\n 0.0 0.0 0.0 0.00 0.00 0.03 0.00 0.0 0.0 0.0 (b)\n 0.5 1.0 0.0 0.05 0.05 0.01 0.01 0.0 0.0 0.0\n 0.0 0.0 0.0 0.02 0.00 0.23 0.00 0.0 0.0 0.0 P00790 Seq.: ---EAPT---\n 0.0 0.0 0.0 0.04 0.05 0.00 0.00 0.0 0.0 0.0 Consensus Seq.: NNDEAASGDF\n 0.0 0.0 0.0 0.04 0.01 0.00 0.00 0.0 0.0 0.0\n 0.5 0.0 0.0 0.16 0.10 0.06 0.29 0.0 0.0 0.0 IB Seq.: NNDeaptGDF\n 0.0 0.0 0.0 0.02 0.10 0.05 0.20 0.0 0.0 0.0\n 0.0 0.0 0.0 0.00 0.14 0.03 0.04 0.0 0.0 0.0 (c)\n 0.0 0.0 0.0 0.00 0.00 0.00 0.00 0.0 0.0 0.0\n 0.0 0.0 0.0 0.01 0.00 0.04 0.04 0.0 0.0 0.0\n\n\n (a)\n\nFigure 1: (a) Profile, (b) sequence logo[2], and (c) textual representations for part\nof an alignment of Pepsin A precursor P00790, showing IB's concision compared to\nprofiles and logos, and its precision compared to single sequences.\n\n\n2 Information Bottleneck\n\nInformation Bottleneck [9] is an information theoretic approach for distributional\nclustering. Given a joint distribution p(X, Y ) of two random variables X and Y , the\ngoal is to obtain a compressed representation C of X, while preserving the informa-\ntion about Y . The two goals of compression and information preservation are quan-\ntified by the same measure of mutual information I(X; Y ) = p(x, y) log p(x,y)\n x,y p(x)p(y)\nand the problem is therefore defined as the constrained optimization problem\nminp(c|x):I(C;Y )>K I(C; X) where K is a constraint on the level of information\npreserved about Y , and the problem should also obey the constraints p(y|c) =\n p(y|x)p(x|c) and p(y) = p(y|x)p(x). This constrained optimization can be\n x x\nreformulated using Lagrange multipliers, and turned into a tradeoff optimization\nfunction with Lagrange multiplier :\n\n\n min L def\n = I(C; X) - I(C; Y ) (1)\n p(c|x)\n\n\nAs an unsupervised learning technique, IB aims to characterize the set of solutions\nfor the complete spectrum of constraint values K. This set of solutions is identical to\nthe set of solutions of the tradeoff optimization problem obtained for the spectrum\nof values.\n\nWhen X is discrete, its natural compression is fuzzy clustering. In this case, the\nproblem is not convex and cannot be guaranteed to contain a single global minimum.\nFortunately, its solutions can be characterized analytically by a set of self consistent\nequations. These self consistent equations can then be used in an iterative algorithm\nthat is guaranteed to converge to a local minimum. While the optimal solutions of\nthe IB functional are in general soft clusters, in practice, hard cluster solutions are\nsometimes more easily interpreted. A series of algorithms was developed for hard\nIB, including an algorithm that can be viewed as a one-step look-ahead sequential\nversion of K-Means [7].\n\nTo apply IB to the problem of profiles discretization discussed here, X is a given\nset of probabilistic profiles obtained from a set of aligned sequences and Y is the\nset of 20 amino acids.\n\n\f\n2.1 Constraints on centroids' semantics\n\nThe application studied in this paper differs from standard IB applications in that\nwe are interested in obtaining a representation that is both efficient and biologi-\ncally meaningful. This requires that we add two kinds of constraints on clusters'\ndistributions, discussed below.\n\nFirst, some clusters' meanings are naturally determined by limiting them to corre-\nspond to the common 20-letter alphabet used to describe amino acids. From the\npoint of view of distributions over amino acids, each of these symbols is used today\nas the delta function distribution which is fully concentrated on a single amino acid.\nFor the goal of finding an efficient representation, we require the centroids to be\nclose to these delta distributions. More generally, we require the centroids to be\nclose to some predefined values ^\n ci, thus adding constraints to the IB target function\nof the form DKL[p(y|^\n ci)||p(y|ci)] < Ki for each constrained centroid. While solving\nthe constrained optimization problem is difficult, the corresponding tradeoff opti-\nmization problem can be made very similar to standard IB. With the additional\nconstraints, the IB trade-off optimization problem becomes\n\n\n min L I(C; X) - I(C; Y ) + (ci)DKL[p(y|^\n ci)||p(y|ci)] . (2)\n p(c|x)\n ciC\n\nWe now use the following identity\n\n\n p(x, c)DKL[p(y|x)||p(y|c)]\n x,c\n\n = p(x) p(y|x) log p(y|x) - p(c) log p(y|c) p(y|x)p(x|c)\n\n x y c y x\n\n = -H(Y |X) + H(Y |C) = I(X; Y ) - I(Y ; C)\n\nto rewrite the IB functional of Eq. (1) as\n\n L = I(C; X) + p(x, c)DKL[p(y|x)||p(y|c)] - I(X; Y )\n cC xX\n\nWhen (ci) 1 we can similarly rewrite Eq. (2) as\n\n L = I(C; X) + p(x) p(ci|x)DKL[p(y|x)||p(y|ci)] (3)\n xX ciC\n\n + (ci)DKL[p(y|^\n ci)||p(y|ci)] - I(X; Y )\n ciC\n\n = I(C; X) + p(x ) p(ci|x )DKL[p(y|x )||p(y|ci)] - I(X; Y )\n x X ciC\n\nThe optimization problem therefore becomes equivalent to the original IB problem,\nbut with a modified set of samples x X , containing X plus additional \"pseudo-\ncounts\" or biases. This is similar to the inclusion of priors in Bayesian estimation.\nFormulated this way, the biases can be easily incorporated in standard IB algorithms\nby adding additional pseudo-counts x with prior probability p(x ) = i(c).\n\n2.2 Constraints on relations between centroids\n\nWe want our discretization to capture correlations between strongly- and weakly-\nconserved variants of the same symbol. This can be done with standard IB using\n\n\f\nseparate classes for the alternatives. However, since the distributions of other amino\nacids in these two variants are likely to be related, it is preferable to define a single\nshared prior for both variants, and to learn a model capturing their correlation.\n\nFriedman et al.[1] describe multivariate information bottleneck (mIB), an extension\nof information bottleneck to joint distributions over several correlated input and\ncluster variables. For profile discretization, we define two compression variables\nconnected as in Friedman's \"parallel IB\": an amino acid class C {A, C, . . .} with\nan associated prior, and a strength S {0, 1}. Since this model correlates strong\nand weak variants of each category, it requires fewer priors than simple IB. It also\nhas fewer parameters: a multivariate model with ns strengths and nc classes has as\nmany categories as a univariate one with nc = nsnc classes, but has only ns +nc -2\nfree parameters for each x, instead of nsnc - 1.\n\n\n3 Results\n\nTo test our method, we apply it to data from MEROPS[4]. Proteins within the same\nfamily typically contain high-confidence alignments, those from different families\nin the same clan less so. For each protein, we generate a profile from alignments\nobtained from PSIBLAST with standard parameters, and compute IB classes from\na large subset of these profiles using the priors described below. Finally, we encode\nand align pairs of profiles using the learned classes, comparing the results to those\nobtained both with the full profiles and with just the original sequences.\n\nFor univariate IB, we have used four types of priors reflecting biases on stability,\nphysical properties, and observed substitution frequencies: (1) Strongly conserved\nclasses, in which a single symbol is seen with S% probability. These are the only\npriors used for multivariate IB. (2) Weakly conserved classes, in which a single\nsymbol occurs with W % probability; (S -W )% of the remaining probability mass is\ndistributed among symbols with non-negative log-odds of substitution. (3) Physical\ntrait classes, in which all symbols with the same hydrophobicity, charge, polarity,\nor aromaticity occur uniformly S% of the time. (4) A uniform class, in which all\nsymbols occur with their background probabilities.\n\nThe choice of S and W depends upon both the data and one's prior notions of\n\"strong\" and \"weak\" conservation. Unbiased IB on a large subset of MEROPS\nwith several different numbers of unbiased categories yielded a mean frequency\napproaching 0.7 for the most common symbol in the 20 most sharply-distributed\nclasses (0.59 0.13 for |C| = 52; 0.66 0.12 for |C| = 80; 0.70 0.09 for |C| = 100).\nSimilarly, the next 20 classes have a mean most-likely-symbol frequency around\n0.4. These numbers can be seen as lower bounds on S and W . We therefore chose\nS = 0.8 and W = 0.5, reflecting a bias toward stronger definitions of conservation\nthan those inferred from the data.\n\n\n3.1 Iterative vs. Sequential IB\n\nSlonim[7] compares several IB algorithms, concluding that best hard clustering re-\nsults are obtained with a sequential method (sIB), in which elements are first as-\nsigned to a fixed number of clusters and then individually moved from cluster to\ncluster while calculating a 1-step lookahead score, until the score converges. While\nsIB is more efficient than exhaustive bottom-up clustering, it neglects information\nabout the best potential candidates to be assigned to a cluster, yielding slow con-\nvergence. Furthermore updates are expensive, since each requires recomputing the\nclass centroids. Therefore instead of sIB, we use iterative IB (iIB) with hard cluster-\ning, which only recomputes the centroids after performing all updates. This reduces\n\n\f\n ACDEFGHIKLMNPQRSTVWYACDEFGH LPSLSLSLSLSLS\n A I KLMNPQRSTVY S K SGGGVGWL\n LS L SE V L\n G SGSLKLVLSSLGTS GKLTRGELSVKLEG\n S S L K E I K\n T D L V\n SGR I STTES L NSGLDVGYVTRVG\n VAT V QG A T I AV T P\n LA TQVGATKLK AL V P T I LG V I\n GD LD N V V KT I\n A G I K L I\n K I GE I G A G D DTDA T\n TV GK D A TNF T V P T\n E I A G\n Q DEQV DV TDDNF I Q I A I D A\n I L S D F A\n NPAVRTAFFRA L V D\n F D TER T P\n KK S T I SE\n ATSAAENPK SF YA FDD\n Q T\n L NG A PT P G PD VTK I GAPW I\n T S P I T VS I Q EQ M NE NQ N I N\n Y N Q\n E V Y\n F I F NP V A I E I A RNREEF\n L DA AV T E N I N K E YQF\n R S I S Y\n L Q R F M T P YC Y I\n S\n K G\n P L G A V Q V T R N Q NS\n S SE RVK A K N\n R\n F P N Y LD V K Q D\n V T\n L TG R R T I P\n V Q N P V F A A RY\n TG\n N E\n E\n K A F N R A\n S MQ Q\n Q SRR K N K\n E\n A Q T A Q Q Q\n I AS EC N G I K V F\n K T V L Q PD R QG\n E W\n G I E\n P P\n D P S\n Y\n N S\n T A E Q I Q D P\n N Q T I Q\n E P Q K D P D Q\n N\n N\n P D G\n A AR\n S A G\n N\n G\n D K E Y K AF QQ\n G S I F K S\n F R D E\n E G T R A\n P I E\n I Y\n LAD R RG\n K E R W N\n K\n K T D I\n P E\n Q P\n F E W\n P D D E FV\n G N\n Q L E V I K P\n H G N N N\n T\n N R Y FPW P E\n F V L G Q L\n W\n E\n Q\n S H S V\n T N\n M\n L D D YD M\n A Y\n T L N C\n A K F I Y E\n C V K\n D C M E\n F\n L\n F L\n M Q V D\n C Q F\n S N Y N\n D\n F G E F Q\n R Y W\n F R D\n Q\n D T\n QV I W Y E CQ I I K M Y\n K Y E F\n Q K Y\n K L\n R E\n S F\n H D P R H FP\n G P G N F K\n F Y P K\n R\n F P W\n M\n D Y\n G\n T M H M\n M Y\n Q Y\n L M H Y\n A\n N F\n D R\n V F L Y\n Y N P\n A F\n Q H\n H P Y\n H V\n R\n V P W\n A I N N\n H H E R\n I T P\n R F C\n C Q\n D C\n K Y\n G RY R M N\n R H E K\n K M M M F C W M\n I\n L K\n D I R\n P Y\n M F\n C\n\n M\n Q R N T M\n P H Q\n D\n P M\n C\n P F\n H V R\n C H M R K\n G W R\n R\n\n\n\n\n\n Y\n E V H\n M M F R M H\n M\n S Y H C\n D R Y\n H\n M M P Q H\n H W\n V H H H\n K M M\n H M\n L C\n P E H\n W\n W H R\n M M Y C\n C M\n H H\n M\n H C M F H M M\n M H\n C\n H\n W M M Y\n R M\n M R\n W H M W\n H\n H C H\n ACDEFGHIKLMNPQRSTVWYAC FGIKLNPQSTVYA P SGLSG\n KDEFGL SEGL ELS\n LKDE L S\n SK S G L RG\n LRVSVL L NH S RMKT SKCLKGT\n VS A T LS L K\n I GG KTE S I AS D P SV VVD I WVV\n ST I A D FYD I VGPL\n A DTTVE TL I GAAGTTP Q N DL I T PT\n THN LEAA L\n GSKA VQ TV SF STSAAT Y\n I L GPD L L N S N I D\n P G\n S ADNG TR A I G VNTVTA\n GN I R\n Q F K P D I\n VF I TDDTFE QAA T NKDLDV\n KGPQ V D F AW\n Y S LR N\n YTTS GAK SDV A A F R\n G NE AG E Q E\n A S PQ PY A I TQEVPVML I NSLTR F GQ I N\n A A E Q\n G L\n L T S A NQ E R A\n Q I L F G QYP I PQ\n VR S F I NL N K F I N TK V I D F\n S G\n M N\n A A GAR A V M SY I QV V P\n Y V L T\n L L KQ L Q ER F N ED P YKD\n N E\n N\n TK S Y K S E\n K T\n T T A\n KT E R G I Y\n S E S P\n D\n R\n E E P E P W A\n H N I S V G\n T I R I Q I I Q\n Q A Y K K I Q I E C\n A I K P M H C P N P F\n P\n S A A Q\n V\n F\n R A Q V DYW K D S\n R\n N R G YQ V N\n F V E\n K E DA E P QC QR\n PV Y\n M\n EY A\n R I K G K N N\n E T T H\n K V Q\n Q D S D Q G\n R S L E\n K D S C GR\n V\n E T\n E Y Q N\n P N\n D E T Q\n G M\n G S D R F N P\n K V W L D R N Q\n I T\n N D\n N\n S I I I G E Y R Q M P\n N\n G V N R D\n Q A S E L\n D L\n V G L R R\n Q G I\n N F T R R R\n D M F K V Q\n L R\n F K\n D R Y\n E D F V F\n L T V E\n P L\n C Y\n C F W\n K N E\n D\n R M N N E\n S V N\n W V D K\n Y F A\n D W P H E E\n F K\n F\n N P\n F F Y\n K Y E R\n H M\n K QT E F\n L\n M\n Q\n N H L H H\n W\n M N Q\n R G G\n E\n L K EY P I H M\n M\n D\n F D K P\n A F\n K H G E M\n C K Y\n I H P\n G\n T L T I\n P Y\n D Y\n L P V\n M M I\n C\n Q F\n G\n M\n S\n Q K\n K P\n R C\n R\n M M H\n H\n V M YQ\n Y K\n K H\n Y N M\n Q\n Q V T N\n R P F\n H D P K\n M P T M N FM K\n P\n\n D\n Y T R\n P\n G I E\n M F R C\n P\n P Q\n Q R\n A I H F P\n S\n P\n G FM YM\n C\n T VA M\n D H F\n W\n R C\n A\n\n\n\n G\n R D R\n G\n A A I\n P\n Y H C H M\n L Y Q F\n P H\n N\n R W\n W Y H\n M M\n H H Y M\n Y R\n M W H M\n D E Y Q C\n E M H\n S\n V H C H\n H M W\n F W M\n C R H\n G L M\n W\n\n\n\n\n\n H\n P M\n V\n W H W C Y H\n R H\n M W H W\n H Q\n W\n\n\n\n\n\nFigure 2: Stretched sequence logos for categories found by iIB (top) and sIB (bot-\ntom), ordered by primary symbol and decreasing information.\n\n\n\nthe convergence time from several hours to around ten minutes.\n\nSince Slonim argues that sIB outperforms soft iIB in part because sIB's discrete\nsteps allow it to escape local optima, we expect hard iIB to have similar behavior.\nTo test this, we applied three complete sIB iterations initialized with categories\nfrom multivariate iIB. sIB decreased the loss L by only about 3 percent (from 0.380\nto 0.368), with most of this gain occurring in the first iteration. Also, the resulting\ncategories were mostly conserved up to exchanging labels, suggesting that hard iIB\nfinds categories similar sIB ones (see figure 2).\n\n\n3.2 Information Loss and Alignments\n\nOne measure of the quality of the resulting clusters is the amount of information\nabout Y lost through discretization, I(Y ; X) - I(Y ; C). Figure (3b) shows the ef-\nfect on information loss of varying the prior weight w with three sets of priors: 20\nstrongly conserved symbols and one background; these plus 20 weakly conserved\nsymbols; and these plus 10 categories for physical characteristics. As expected,\nboth decreasing the number of categories and increasing the number or weight of\npriors increases information loss. However, with a fixed number of free categories,\ninformation loss is nearly independent of prior strength, suggesting that our pri-\nors correspond to actual regularities in the data. Finally, note that despite having\nfewer free parameters than the univariate models, mIB's achieves comparable per-\nformance, suggesting that our decomposition into conserved class and degree of\nconservation is reasonable.\n\nSince we are ultimately using these classes in alignments, the true cost of discretiza-\ntion is best measured by the amount of change between profile and IB alignments,\nand the significance of this change. The latter is important because the best path\ncan be very sensitive to small changes in the sequences or scoring matrix; if two rad-\nically different alignments have similar scores, neither is clearly \"correct\". We can\nrepresent an alignment as a pair of index-insertion sequences, one for each profile\nsequence to be aligned (e.g. \"1,2, , ,3,...\" versus \"1, ,2, ,3,...\"). The edit distance\nbetween these sequences for two alignments then measures how much they differ.\nHowever, even when this distance is large, the difference between two alignments\nmay not be significant if both choices' scores are nearly the same. That is, if the\noptimal profile alignment's score is only slightly lower than the optimal IB class\nalignment's score as computed with the original profiles, either might be correct.\n\nFigure 4 shows at left both the edit distance and score change per length between\nprofile alignments and those using IB classes, mIB classes, and the original se-\nquences with the BLOSUM62 scoring matrix. To compare the profile and sequence\nalignments, profiles corresponding to gaps in the original sequences are replaced\n\n\f\n 64\n Profile-profile multivariate\n IB-profile 0.46 21/52 priors\n 2e-5 * L^2 + 0.1 41/52 priors\n 3e-3 * L - 0.1 51/52 priors\n\n\n 16 C)\n (s) Y;I( 0.42\n -\n meTi X)Y;\n 4 I(\n\n 0.38\n\n\n\n 1 400 800 1600\n Length 0.2 0.4 0.6 0.8\n w\n (a) (b)\n\n\nFigure 3: (a) Running times for profile-profile versus IB-profile alignment, showing\nspeedups of 3.5-12.5x for pairwise global alignment. (b)I(Y ; X) - I(Y ; C) as a\nfunction of w for different groups of priors. The information loss for 52 categories\nwithout priors is 0.359, for 10, 0.474.\n\n\n Edit distance Score change\n Same Superfamily\n mIB 0.154 0.182 0.086 0.166\n IB 0.170 0.189 0.107 0.198\n BLOSUM 0.390 0.065\n Same Clan\n mIB 0.124 0.209 0.019 0.029\n IB 0.147 0.232 0.022 0.037\n BLOSUM 0.360 0.062\n\n\n\nFigure 4: Left: alignment differences for IB models and sequence alignment, within\nand between superfamilies. Right: ROC curve for same/different superfamily clas-\nsification by alignment score.\n\n\n\nby gaps, and resulting pairs of aligned gaps in the profile-profile alignment are re-\nmoved. We consider both sequences from the same family and those from other\nfamilies in the same clan, the former being more similar than the latter, and there-\nfore having better alignments. Assuming the profile-profile alignment is closest to\nthe \"true\" alignment, iIB alignment significantly outperforms sequence alignment\nin both cases, with mIB showing a slight additional improvement. At right is the\nROC curve for detecting superfamily relationships between profiles from different\nfamilies based on alignment scores, showing that while IB fares worse than profiles,\nsimple sequences perform essentially at chance.\n\nFinally, figure 3a compares the performance of profile and IB alignment for different\nsequence lengths. To use a profile alphabet for novel alignments, we must map\neach input profile to the closest IB class. To be consistent with Yona[10], we use\nthe Jensen-Shannon (JS) distance with mixing coefficient 0.5 rather than the KL\ndistance optimized in creating the categories. Aligning two sequences of lengths n\nand m requires computing the |C|(n+m) JS-distances between each profile and each\ncategory, a significant improvement over the mn distance computations required for\nprofile-profile alignment when |C| min(m,n) . Our results show that JS distance\n 2\ncomputations dominate running time, since IB alignment time scales linearly with\nthe input size, while profile alignment scales quadratically, yielding an order of\nmagnitude improvement for typical 500- to 1000-base-pair sequences.\n\n\f\n4 Discussion\n\nWe have described a discrete approximation to amino acid profiles, based on min-\nimizing information loss, that allows profile information to be used for alignment\nand search without additional computational cost compared to simple sequence\nalignment. Alignments of sequences encoded with a modest number of classes cor-\nrespond to the original profile alignments significantly better than alignments of the\noriginal sequences. In addition to minimizing information loss, the classes can be\nconstrained to correspond to the standard amino acid representation, yielding an\nintuitive, compact textual form for profile information.\n\nOur model is useful in three ways: (1) it makes it possible to apply existing fast\ndiscrete algorithms to arbitrary continuous sequences; (2) it models rich conditional\ndistribution structures; and (3) its models can incorporate a variety of class con-\nstraints. We can extend our approach in each of these directions. For example,\nadjacent positions are highly correlated: the average entropy of a single profile is\n0.99, versus 1.23 for an adjacent pair. Therefore pairs can be represented more com-\npactly than the cross-product of a single-position alphabet. More generally, we can\nencode arbitrary conserved regions and still treat them symbolically for alignment\nand search. Other extensions include incorporating structural information in the\ninput representation; assigning structural significance to the resulting categories;\nand learning multivariate IB's underlying model's structure.\n\n\nReferences\n\n [1] Nir Friedman, Ori Mosenzon, Noam Slonim, and Naftali Tishby. Multivariate\n information bottleneck. In Uncertainty in Artificial Intelligence: Proceedings\n of the Seventeenth Conference (UAI-2001), pages 152161, San Francisco, CA,\n 2001. Morgan Kaufmann Publishers.\n\n [2] Crooks GE, Hon G, Chandonia JM, and Brenner SE. WebLogo: a sequence\n logo generator. Genome Research, in press, 2004.\n\n [3] A. G. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia. SCOP: a struc-\n tural classification of proteins database for the investigation of sequences and\n structures. J. Mol. Biol., 247:53640, 1995.\n\n [4] N.D. Rawlings, D.P. Tolle, and A.J. Barrett. MEROPS: the peptidase\n database. Nucleic Acids Res, 32 Database issue:D1604, 2004.\n\n [5] B. Rost and C. Sander. Prediction of protein secondary structure at better\n than 70% accuracy. J. Mol. Bio., 232:58499, 1993.\n\n [6] Altschul SF, Gish W, Miller W, Myers EW, and Lipman DJ. Basic local\n alignment search tool. J Mol Biol, 215(3):40310, October 1990.\n\n [7] Noam Slonim. The Information Bottleneck: Theory and Applications. PhD\n thesis, Hebrew University, Jerusalem, Israel, 2002.\n\n [8] T. F. Smith and M. S. Waterman. Identification of common molecular subse-\n quences. Journal of Molecular Biology, 147:195197, 1981.\n\n [9] Naftali Tishby, Fernando C. Pereira, and William Bialek. The information\n bottleneck method. In Proc. of the 37-th Annual Allerton Conference on Com-\n munication, Control and Computing, pages 36877, 1999.\n\n[10] Golan Yona and Michael Levitt. Within the twilight zone: A sensitive profile-\n profile comparison tool based on information theory. Journal of Molecular\n Biology, 315:125775, 2002.\n\n\f\n", "award": [], "sourceid": 2612, "authors": [{"given_name": "Sean", "family_name": "O'rourke", "institution": null}, {"given_name": "Gal", "family_name": "Chechik", "institution": null}, {"given_name": "Robin", "family_name": "Friedman", "institution": null}, {"given_name": "Eleazar", "family_name": "Eskin", "institution": null}]}