{"title": "Identifying Protein-Protein Interaction Sites on a Genome-Wide Scale", "book": "Advances in Neural Information Processing Systems", "page_first": 1465, "page_last": 1472, "abstract": null, "full_text": "Identifying protein-protein interaction sites on a\n                               genome-wide scale\n\n\n\n     Haidong Wang Eran Segal Asa Ben-Hur Daphne Koller Douglas L. Brutlag\n                Computer Science Department, Stanford University, CA 94305\n                       {haidong, koller}@cs.stanford.edu\n         Center for Studies in Physics and Biology, Rockefeller University, NY 10021\n                                 eran@cs.stanford.edu\n            Department of Genome Sciences, University of Washington, WA 98195\n                                 asa@gs.washington.edu\n                 Department of Biochemistry, Stanford University, CA 94305\n                                 brutlag@stanford.edu\n\n\n\n\n                                           Abstract\n\n           Protein interactions typically arise from a physical interaction of one or\n           more small sites on the surface of the two proteins. Identifying these sites\n           is very important for drug and protein design. In this paper, we propose\n           a computational method based on probabilistic relational model that at-\n           tempts to address this task using high-throughput protein interaction data\n           and a set of short sequence motifs. We learn the model using the EM\n           algorithm, with a branch-and-bound algorithm as an approximate infer-\n           ence for the E-step. Our method searches for motifs whose presence in a\n           pair of interacting proteins can explain their observed interaction. It also\n           tries to determine which motif pairs have high affinity, and can therefore\n           lead to an interaction. We show that our method is more accurate than\n           others at predicting new protein-protein interactions. More importantly,\n           by examining solved structures of protein complexes, we find that 2/3 of\n           the predicted active motifs correspond to actual interaction sites.\n\n\n1     Introduction\n\nMany cellular functions are carried out through physical interactions between proteins.\nDiscovering the protein interaction map can therefore help to better understand the work-\nings of the cell. Indeed, there has been much work recently on developing high-throughput\nmethods to produce a more complete map of protein-protein interactions [1, 2, 3].\n      Interactions between two proteins arise from physical interactions between small re-\ngions on the surface of the proteins [4] (see Fig. 2(b)). Finding interaction sites is an\nimportant task, which is of particular relevance to drug design. There is currently no high-\nthroughput experimental method to achieve this goal, so computational methods are re-\nquired. Existing methods either require solving a protein's 3D structure (e.g., [5]), and\ntherefore are computationally very costly and not applicable on a genome-wide scale, or\nuse known interaction sites as training data (e.g., [6]), which are relatively scarce and hence\nhave poor coverage. Other work focuses on refining the highly noisy high-throughput inter-\naction maps [7, 8, 9], or on assessing the confidence levels of the observed interactions [10].\n      In this paper, we propose a computational method for predicting protein interactions\n\n\f\n                                                                                         P1                          P2                          P5\n                                                    d\n                     a          d                                             A                A               A           A                     A\n                                                         P                          a          d                b          d                          d\n                                                              5\n                          P1\n\n\n                                                                              A                     A     A                A               A\n                                                                                   ab               db         ad               dd          bd\n\n\n\n                                                                              B          S B              B          S Bab                 B               S\n                                                                                                               ab                               ab\n                                               b                                   ab               db\n                                                                         c\n      d                                                  b                    T                           T                           T\n           P                                                       P4              12                          15                          25\n                2                    P                                                   O                           O                           O\n                                          3\n      b\n\n                                     (a)                                                                            (b)\n\nFigure 1: (a) Simple illustration of our assumptions for protein-protein interactions. The small\nelements denote motif occurrences on proteins, with red denoting active and gray denoting inactive\nmotifs. (b) A fragment of our probabilistic model, for the proteins P1, P2, P5. We use yellow to\ndenote an assignment of the value true, and black to denote the value false; full circles denote an\nassignment observed in the data, and patterned circles an assignment hypothesized by our algorithm.\nThe dependencies involving inactive motif pairs were removed from the graph because they do not\naffect the rest of the model.\n\nand the sites at which the interactions take place, which uses as input only high-throughput\nprotein-protein interaction data and the protein sequences. In particular, our method as-\nsumes no knowledge of the 3D protein structure, or of the sites at which binding occurs.\n    Our approach is based on the assumption that interaction sites can be described using a\nlimited repertoire of conserved sequence motifs [11]. This is a reasonable assumption since\ninteraction sites are significantly more conserved than the rest of the protein surface [12].\nGiven a protein interaction map, our method tries to explain the observed interactions by\nidentifying a set of sites of motif occurrence on every pair of interacting proteins through\nwhich the interaction is mediated. To understand the intuition behind our approach, con-\nsider the example of Fig. 1(a). Here, the interaction pattern of the protein P1 can best be\nexplained using the motif pair a, b, where a appears in P1 and b in the proteins P2, P3, P4\nbut not in P5. By contrast, the motif pair d, b is not as good an explanation, because d\nalso appears in P5, which has a different interaction pattern. In general, our method aims\nto identify motif pairs that have high affinity, potentially leading to interaction between\nprotein pairs that contain them.\n    However, a sequence motif might be used for a different purpose, and not give rise to an\nactive binding site; it might also be buried inside the protein, and thus be inaccessible for\ninteraction. Thus, the appearance of an appropriate motif does not always imply interaction.\nA key feature of our approach is that we allow each motif occurrence in a protein to be\neither active or inactive. Interactions are then induced only by the interactions of high-\naffinity active motifs in the two proteins. Thus, in our example, the motif d in p2 is inactive,\nand hence does not lead to an interaction between p2 and p4, despite the affinity between\nthe motif pair c, d. We note that Deng et al. [8] proposed a somewhat related method for\ngenome-wide analysis of protein interaction data, based on protein domains. However,\ntheir method is focused on predicting protein-protein interactions and not on revealing the\nsite of interaction, and they do not allow for the possibility that some domains are inactive.\n    Our goal is thus to identify two components: the affinities between pairs of motifs, and\nthe activity of the occurrences of motifs in different proteins. Our algorithm addresses this\nproblem by using the framework of Bayesian networks [13] and probabilistic relational\nmodels [14], which allows us to handle the inherent noise in the protein interaction data\nand the uncertain relationship between interactions and motif pairs. We construct a model\nencoding our assumption that protein interactions are induced by the interactions of active\nmotif pairs. We then use the EM algorithm [15], to fill in the details of the model, learning\nboth the motif affinities and activities from the observed data of protein-protein interactions\nand protein motif occurrences. We address the computational complexity of the E-step in\n\n\f\nthese large, densely connected models by using an approximate inference procedure based\non branch-and-bound.\n     We evaluated our model on protein-protein interactions in yeast and Prosite motifs [11].\nAs a basic performance measure, we evaluated the ability of our method to predict new\nprotein-protein interactions, showing that it achieves better performance than several other\nmodels. In particular, our results validate our assumption that we can explain interactions\nvia the interactions of active sequence motifs. More importantly, we analyze the ability\nof our method to discover the mechanism by which the interaction occurs. Finally, we\nexamined co-crystallized protein pairs where the 3D structure of the interaction is known,\nso that we can determine the sites at which the interaction took place. We show that our\nactive motifs are more likely to participate in interactions.\n\n\n2    The Probabilistic Model\n\nThe basic entities in our probabilistic model are the proteins and the set of sequence motifs\nthat can mediate protein interactions. Our model therefore contains a set of protein entities\nP = {P1, . . . , Pn}, with the motifs that occur in them. Each protein P is associated with\nthe set of motifs that occur in it, denoted by P.M . As we discussed, a key premise of\nour approach is that a specific occurrence of a sequence motif may or may not be active.\nThus, each motif occurrence a  P.M is associated with a binary-value variable P.Aa,\nwhich takes the value true if Aa is active in protein P and false otherwise. We structure the\nprior probability P (P.Aa = true) = min{0.8, 3+0.1|P.M| }, to capture our intuition that\n                                                      |P.M |\nthe number of active motifs in a protein is roughly a constant fraction of the total number\nof motifs in the protein, but that even proteins with few motifs tend to have at least some\nnumber of active motifs.\n     A pair of active motifs in two proteins can potentially bind and induce an interaction\nbetween the corresponding proteins. Thus, in our model, a pair of proteins interact if each\ncontains an active motif, and this pair of motifs bind to each other. The probability with\nwhich two motifs bind to each other is called their affinity. We encode this assumption by\nincluding in our model entities Tij corresponding to a pair of proteins Pi, Pj. For each\npair of motifs a  Pi.M and b  Pj.M , we introduce a variable Tij.Aab, which is a\ndeterministic AND of the activity of these two motifs. Intuitively, this variable represents\nwhether the pair of motifs can potentially interact. The probability with which two active\nmotif occurrences bind is their affinity. We model the binding event between two motif\noccurrences using a variable Tij.Bab, and define: P (Tij.Bab = true | Tij.Aab = true) =\nab and P (Tij.Bab = true | Tij.Aab = false) = 0, where ab is the affinity between motifs\na and b. This model reflects our assumption that two motif occurrences can bind only if\nthey are both active, but their actual binding probability depends on their affinity. Note that\nthis affinity is a feature of the motif pair and does not depend on the proteins in which they\nappear.\n     We must also account for interactions that are not explained by our set of motifs,\nwhether because of false positives in the data, or because of inadequacies of our model\nor of our motif set. Thus, we add a spurious binding variable Tij.S, for cases where an\ninteraction between Pi and Pj exists, but cannot be explained well by our set of active\nmotifs. The probability that a spurious binding occurs is given by P (Tij.S = true) = S.\n     Finally, we observe an interaction between two proteins if and only if some form of\nbinding occurs, whether by a motif pair or a spurious binding. Thus, we define a variable\nTij.O, which represents whether protein i was observed to interact with protein j, to be\na deterministic OR of all the binding variables Tij.S and Tij.Bab. Overall, Tij.O is a\nnoisy-OR [13] of all motif pair variables Tij.Aab.\n     Note that our model accounts for both types of errors in the protein interaction data.\nFalse negatives (missing interactions) in the data are addressed through the fact that the\npresence of an active motif pair only implies that binding takes place with some probability.\nFalse positives (wrong interactions) in the data are addressed through the introduction of\n\n\f\nthe spurious interaction variables.\n     The full model defines a joint probability distribution over the entire set of attributes:\n\n      P (P.A, T.A, T.B, T.S, T.O) =                                         P (P\n                                                        i    aP                    i.Aa)\n                                                                    i .M\n                                           P (T\n                    aP                            ij .Aab | Pi.Aa, Pj .Ab)P (Tij .Bab | Tij .Aab)\n                           i .M,bPj .M\n            ij    P (Tij.S)P (Tij.O | Tij.B, Tij.S)\n\nwhere each of these conditional probability distributions is as specified above. We use\n to denote the entire set of model parameters {a,b}a,b  {S}. An instantiation of our\nprobabilistic model is illustrated in Fig. 1(b).\n\n\n3    Learning the Model\n\nWe now turn to the task of learning the model from the data. In a typical setting, we are\ngiven as input a protein interaction data set, specifying a set of proteins P and a set of\nobserved interacting pairs T.O. We are also given a set of potentially relevant motifs, and\nthe occurrences of these motifs in the different proteins in P. Thus, all the variables except\nfor the O variables are hidden. Our learning task is thus twofold: we need to infer the values\nof the hidden variables, both the activity variables P.A, T.A, and the binding variables\nT.B, T.S; we also need to find a setting of the model parameters , which specify the\nmotif affinities. We use a variant of the EM algorithm [15] to find both an assignment to\nthe parameters , and an assignment to the motif variables P.A, which is a local maximum\nof the likelihood function P (T.O, P.A | ). Note that, to maximize this objective, we\nsearch for a MAP assignment to the motif activity variables, but sum out over the other\nhidden variables. This design decision is reasonable in our setting, where determining motif\nactivities is an important goal; it is a key assumption for our computational procedure.\n     As in most applications of EM, our main difficulty arises in the E-step, where we need\nto compute the distribution over the hidden variables given the settings of the observed\nvariables and the current parameter settings. In our model, any two motif variables (both\nwithin the same protein and across different proteins) are correlated, as there exists a path\nof influence between them in the underlying Bayesian network (see Fig. 1(c)). These cor-\nrelations make the task of computing the posterior distribution over the hidden variables\nintractable, and we must resort to an approximate computation. While we could apply a\ngeneral purpose approximate inference algorithm such as loopy belief propagation [16],\nsuch methods may not converge in densely connected model such as this one, and there\nare few guarantees on the quality of the results even if they do converge. Fortunately, our\nmodel turns out to have additional structure that we can exploit. We now describe an ap-\nproximate inference algorithm that is tailored to our model, and is guaranteed to converge\nto a (strong) local maximum.\n     Our first observation is that the only variables that correlate the different protein pairs\nTij are the motif variables P.A. Given an assignment to these activity variables, the net-\nwork decomposes into a set of independent subnetworks, one for each protein pair. Based\non this observation, we divide our computation of the E-step into two parts. In the first,\nwe find an assignment to the motif variables in each protein, P.A; in the second, we com-\npute the posterior probability over the binding motif pair variables T.B, T.S, given the\nassignment to the motif variables.\n     We begin by describing the second phase. We observe that, as all the motif pair vari-\nables, T.A, are fully determined by the motif variables, the only variables left to reason\nabout are the binding variables T.B and T.S. The variables for any pair Tij are inde-\npendent of the rest of the model given the instantiation to T.A and the interaction evi-\ndence. That fact, combined with the noisy-OR form of the interaction, allows us to com-\npute the posterior probability required in the E-step exactly and efficiently. Specifically,\nthe computation for the variables associated with a particular protein pair Tij is as fol-\nlows, where we omit the common prefix Tij to simplify notation. If Aab = false, then\n\n\f\nP (Bab = true | Aab = false, O, ) = 0. Otherwise, if Aab = true, then\n\n                                                                   P (B\n          P (B                                                             ab | A, )P (O | Bab = true, A, )\n                  ab = true | A, O, )                    =                                                                          .\n                                                                                       P (O | A, )\n\nThe first term in the numerator is simply the motif affinity ab; the second term is 1 if\nO = true and 0 otherwise. The numerator can easily be computed as P (O | A, ) =\n1 - (1 - S)                               (1 - \n                         A                           ab). The computation for P (S) is very similar.\n                              a,b =true\n\n   We now turn to the first phase, of finding a setting to all of the motif variables. Un-\nfortunately, as we discussed, the model is highly interconnected, and a finding an optimal\njoint setting to all of these variables P.A is intractable. We thus approximate finding this\njoint assignment using a method that exploits our specific structure. Our method iterates\nover proteins, finding in each iteration the optimal assignment to the motif variables of each\nprotein given the current assignment to the motif activities in the remaining proteins. The\nprocess repeats, iterating over proteins, until convergence.\n   As we discussed, the likelihood of each assignment to Pi.A can be easily computed\nusing the method described above. However, the computation for each protein is still ex-\nponential in the number of motifs it contains, which can be large (e.g., 15). However, in\nour specific model, we can apply the following branch-and-bound algorithm (similar to an\napproach proposed by Henrion [17] for BN2O networks) to find the globally optimal as-\nsignment to the motif variables of each protein. The idea is that we search over the space\nof possible assignments Pi.A for one that maximizes the objective we wish to maximize.\nWe can show that if making a motif active relative to one assignment does not improve the\nobjective, it will also not improve the objective relative to a large set of other assignments.\n   More precisely, let f (Pi.A) = P (Pi.A, P-i.A|O, ) denote the objective we wish\nto maximize, where P-i.A is the fixed assignment to motif variables in all proteins\nexcept Pi. Let Pi.A-a denote the assignment to all the motif variables in Pi except\nfor Aa. We compute the ratio of f after we switch Pi.Aa from false to true. Let\nha(Pj) =                                (1 - \n                    P                               ab) denote the probability that motif a does not bind with\n                         j .Ab =true\nany active motif in Pj. We can now compute:\n\n                                    f (P                                               g\n                                           i.Aa = true, Pi.A-a)\n          a(Pi.A-a) =                                                          =\n                                    f (Pi.Aa = false, Pi.A-a)                        1 - g\n                                                                   1 - (1 - S)ha(Pj)                                        hb(Pj)\n                                  h                                                             a=b,Pi.Ab=true\n                                        a(Pj )                                                                                           (1)\n                                                                      1 - (1 - S)                                    h\n                                                                                              a=b,P                        b(Pj )\n                   1jn                              1jn                                            i .Ab =true\n                 Tij .O=false                       Tij .O=true\n\nwhere g is the prior probability for a motif in protein Pi to be active.\n   Now, consider a different point in the search, where our current motif activity assign-\nment is Pi.A-a, which has all the active motifs in Pi.A-a and some additional ones. The\nfirst two terms in the product of Eq. (1) are the same for a(Pi.A-a) and a(Pi.A-a).\nFor the final term (the large fraction), one can show using some algebraic manipulation\nthat this term in a(Pi.A-a) is lower than that for a(Pi.A-a). We conclude that\na(Pi.A-a)  a(Pi.A-a), and hence that:\n\n            f (Pi.Aa = true, Pi.A-a)                                         f (P\n                                                                1                  i.Aa = true, Pi.A-a)  1.\n            f (Pi.Aa = false, Pi.A-a)                                        f (Pi.Aa = false, Pi.A- )\n                                                                                                                a\n\nIt follows that, if switching motif a from inactive to active relative to Pi.A decreases f , it\nwill also decrease f if we have some additional active motifs.\n   We can exploit this property in a branch-and-bound algorithm in order to find the glob-\nally optimal assignment Pi.A. Our algorithm keeps a set V of viable candidates for motif\nassignments. For presentation, we encode assignments via the set of active motifs they\ncontain. Initially, V contains only the empty assignment {}. We start out by considering\n\n\f\nmotif assignments with a single active motif. We put such an assignment {a} in V if its\nf -score is higher than f ({}). Now, we consider assignments {a, b} that have two active\nmotifs. We consider {a, b} only if both {a} and {b} are in V . If so, we evaluate its f -score,\nand add it to V if this score is greater than that of {a} and {b}. Otherwise, we throw it\naway. We continue this process for all assignments of size k: For each assignment with\nactive motif set S, we test whether S - {a}  V for all a  S; if we compare f (S) to each\nf (S - {a}), and add it if it dominates all of them. The algorithm terminates when, from\nsome k, no assignment of size k is saved.\n     To understand the intuition behind this pruning procedure, consider a candidate assign-\nment {a, b, c, d}, and assume that {a, b, c}  V , but {b, c, d}  V . In this case, we must\nhave that {b, c}  V , but adding d to that assignment reduces the f -score. In this case, as\nshown by our analysis, adding d to the superset {a, b, c} would also reduce the f -score.\n     This algorithm is still exponential in worst case. However, in our setting, a protein with\nmany motifs has a low prior probability that each of them is active. Hence, adding new\nmotifs is less likely to increase the f -score, and the algorithm tends to terminate quickly.\nAs we show in Section 4, this algorithm significantly reduces the cost of our procedure.\n     Our E-step finds an assignment to P.A which is a strong local optimum of the ob-\njective function max P (P.A | T.O, ): The assignment has higher probability than any\nassignment that changes any of the motif variables for any single protein. For that assign-\nment, our algorithm also computes the distribution over all of the binding variables, as\ndescribed above. Using this completion, we can now easily compute the (expected) suffi-\ncient statistics for the different parameters in the model. As each of these parameters is a\nsimple binomial distribution, the maximum likelihood estimation in the M-step is entirely\nstandard; we omit details.\n\n\n4    Results\n\nWe evaluated our model on reliable S. cerevisiae protein interactions data from MIPS [2]\nand DIP [3] databases. As for non-interaction data, we randomly picked pairs of proteins\nthat have no common function and cellular location. This results in a dataset of 2275\nproteins, 4838 interactions (Tij.O = true), and 9037 non-interactions (Tij.O = false). We\nused sequence motifs from the Prosite database [11] resulting in a dataset of 516 different\nmotifs with an average of 7.1 motif occurrences per protein. If a motif pair doesn't appear\nbetween any pair of interacting proteins, we initialize its affinity to be 0 to maximize the\njoint likelihood. Its affinity will stay at 0 during the EM iterations and thus simplify our\nmodel structure. We set the initial affinity for the remaining 8475 motif pairs to 0.03.\n     We train our model with motifs initialized to be either all active (P.A = true) or all\ninactive (P.A = false). We get similar results with these two different initializations,\nindicating the robustness of our algorithm. Below we only report the results based on all\nmotifs initialized to be active. Our branch-and-bound algorithm is able to significantly\nreduce the number of motif activity assignments that need to be evaluated. For a protein\nwith 15 motifs, the number of assignments evaluated is reduced from 215 = 32768 in\nexhaustive search to 802 using our algorithm. Since majority of the computation is spent\non finding the activity assignments, this resulted in a 40 fold reduction in running time.\n\nPredicting protein-protein interactions. We test our model by evaluating its performance\nin predicting interactions. We test this performance using 5-fold cross validation on the set\nof interacting and non-interacting protein pairs. In each fold, we train a model and predict\nP (Tij.O) = true for pairs Pi, Pj in the held-out interactions.\n     Many motif pairs are over-represented in interacting proteins. We thus compare our\nmethod to a baseline method that ranks pairs of proteins on the basis of the maximum\nenrichment of over-represented motif pairs (see [18] for details). We also compare it to\na model where all motifs are set to be active; this is analogous to the method of Deng\net al. [8]. For completeness, we compare the two variants of the model using data on the\ndomain (Pfam and ProDom [19]) content of the proteins as well as the Prosite motif content.\n\n\f\n                                                                                                                               1\n\n\n\n\ns                                                                                                                             0.8\n     n\n          tio                                                            t\n                 c                                                            c\n                      ra                                                           ra 0.6\n                            te                                                     te\n                                                                                          in\n                                  ll in                                                          to\n                                           f a                                                         d                                                 Association (Sprinzak & Margalit)\n                                                   o                                                                          0.4\n                                                                                                            te\n                                                        n                                                         icd                                    Prosite motif (Allow inactive motif. P.A = {0, 1})\n                                                             rtioo re\n                                                             p                                                           p                               Prosite motif (All motifs active. P.A = 1)\n                                                                                                                              0.2\n                                                                  roP                                                                                    Pfam&ProDom (All domains active. P.A = 1.\n                                                                                                                                                         Deng et al)\n                                                                                                                               0\n\n                                                                                                                                     0            0.2            0.4           0.6            0.8              1\n                                                                                                                                          Proportion of all non-interactions predicted to interact\n\n                                                                                                                                                                        (a)                                         (b)\n\nFigure 2: (a) ROC curve for different methods. The X-axis is the proportion of all non-interacting\nprotein pairs in the training data predicted to interact. Y-axis is the proportion of all interacting protein\npairs in the training data predicted to interact. Points are generated using different cutoff probabilties.\nA larger area under the curve indicates better prediction. Our method (square marker) outperforms\nall other methods. (b) Two protein chains that form a part of the 1ryp complex in PDB, interacting at\nthe site of two short sequence motifs.\n\nThe ROC curves in Fig. 2(a) show that our method outperforms the other methods, and that\nthe additional degree of freedom of allowing motifs to be inactive is essential. These results\nvalidate our modeling assumptions; they also show that our method can be used to suggest\nnew interactions and to assign confidence levels on observed interactions, which is much\nneeded in view of the inaccuracies and large fraction of missing interactions in current\ninteraction databases.\n\nEvaluating predicted active motifs. A key feature of our approach is its ability to detect\npairs of interacting motifs. We evaluate these predictions against the data from Protein Data\nBank (PDB) [20], which contains some solved structures of interacting proteins Fig. 2(b).\nWhile the PDB data is scarce, it provides the ultimate evaluation of our predicted active\nmotifs. We extracted all structures from PDB that have at least two co-crystallized chains,\nand whose chains are nearly identical to yeast proteins. From the residues that are in contact\nbetween two chains (distance < 5 Angstr oms), we infer which protein motifs participate in\ninteractions. Among our training data, 105 proteins have co-crystallized structure in PDB.\nOn these proteins, our data contained a total of 620 motif occurrences, of which 386 are\npredicted to be active. Among those motifs predicted to be active, 257 of them (66.6%)\nare interacting in PDB. Among the 234 motifs predicted to be inactive, only 120 of them\n(51.3%) are interacting. The chi-square p-value is 10-4. On the residue level, our predicted\nactive motifs consist of 3736 amino acids, and 1388 of them (37.2%) are interacting. In\ncomparison, our predicted inactive motifs consist of 3506 amino acids, and only 588 of\nthem (16.0%) are interacting. This significant enrichment provides support for the ability\nof our method to detect motifs that participate in interactions. In fact, the set of interactions\nin PDB is only a subset of the interactions those proteins participate in. Therefore, the\nactual rate of false positive active motifs is likely to be lower than we report here.\n\n\n5                                                                                                                             Discussion and Conclusions\n\nIn this paper, we presented a probabilistic model which explicitly encodes elements in the\nprotein sequence that mediate protein-protein interactions. By using a variant of the EM al-\n\n\f\ngorithm and a branch-and-bound algorithm for the E-step, we make the learning procedure\ntractable. Our result shows that our method successfully uncovers motif activities and bind-\ning affinities, and uses them to predict both protein interactions and specific binding sites.\nThe ability of our model to predict structural elements, without a full structure analysis,\nprovides support for the viability of our approach.\n    Our use of a probabilistic model provides us with a general framework to incorporate\ndifferent types of data into our model, allowing it to be extended in varies ways. First,\nwe can incorporate additional signals for protein interactions, such as gene expression data\n(as in [9]), cellular location, or even annotations from the literature (as in [7]). We can\nalso integrate protein interaction data across multiple species; for example, we might try to\nuse the yeast interaction data to provide more accurate predictions for the protein-protein\ninteractions in fly [10].\n\nReferences\n\n [1] P. Uetz, et al. A comprehensive analysis of protein-protein interactions in saccharomyces cere-\n     visiae. Nature, 403(6770):6237, 2000. 0028-0836 Journal Article.\n [2] H. W. Mewes, et al. Mips: a database for genomes and protein sequences. Nucleic Acids Res,\n     2002.\n [3] I. Xenarios, et al. Dip ; the database of interacting proteins: a research tool for studying cellular\n     networks of protein interactions. Nucleic Acids Research, 30(1):303305, 2002. (c) 2002 Inst.\n     For Sci. Info.\n [4] P. Chakrabarti and J. Janin. Dissecting protein protein recognition sites. PROTEINS: Structure,\n     Function, and Genetics, 47:334343, 2002.\n [5] J. J. Gray, et al. Protein protein docking with simultaneous optimization of rigid-body displace-\n     ment and side-chain conformations. Journal of Molecular Biology, 331:281299, 2003.\n [6] Y. Ofran and B. Rost. Predicted protein-protein interaction sites from local sequence informa-\n     tion. FEBS Lett., 544(1-3):236239, 2003.\n [7] R. Jansen, et al. A bayesian networks approach for predicting protein-protein interactions from\n     genomic data. Science, 302:44953, 2003.\n [8] M. Deng, S. Mehta, F. Sun, and T. Chen. Inferring domain-domain interactions from protein-\n     protein interactions. Genome Res, 12(10):15408, 2002. 22253763 1088-9051 Journal Article.\n [9] E. Segal, H. Wang, and D. Koller. Discovering molecular pathways from protein interaction and\n     gene expression data. Bioinformatics, 19 Suppl 1:I264I272, 2003. 1367-4803 Journal Article.\n[10] L. Giot, et al. A protein interaction map of drosophila melanogaster. Science, 302(5651):1727\n     36, 2003.\n[11] L. Falquet, et al. The PROSITE database, its status in 2002. Nucliec Acids Research, 30:235\n     238, 2002.\n[12] D. R. Caffrey, et al. Are protein protein interfaces more conserved in sequence than the rest of\n     the protein surface? Protein Science, 13:190202, 2003.\n[13] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988.\n[14] D. Koller and A. Pfeffer. Probabilistic frame-based systems. In Proc. AAAI, pages 580587,\n     1998.\n[15] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via\n     the em algorithm. J. Roy. Stat. Soc., B(39):139, 1977.\n[16] J. S. Yedidia, W. T. Freeman, and Y. Weiss. Generalized belief propagation. In NIPS, pages\n     689695, 2000.\n[17] M. Henrion. Search-based methods to bound diagnostic probabilities in very large belief nets.\n     In Uncertainty in Artificial Intelligence, pages 142150, 1991.\n[18] E. Sprinzak and H. Margalit. Correlated sequence-signatures as markers of protein-protein\n     interaction. Journal of Molecular Biology, 311:681692, 2001.\n[19] R. Apweiler, et al. The interpro database, an integrated documentation resource for protein\n     families, domains and functional sites. Nucleic Acids Res, 29(1):3740, 2001. 1362-4962\n     Journal Article.\n[20] H.M. Berman, et al. The protein data bank. Nucleic Acids Research, 28:235242, 2000.\n\n\f\n", "award": [], "sourceid": 2696, "authors": [{"given_name": "Haidong", "family_name": "Wang", "institution": null}, {"given_name": "Eran", "family_name": "Segal", "institution": null}, {"given_name": "Asa", "family_name": "Ben-Hur", "institution": null}, {"given_name": "Daphne", "family_name": "Koller", "institution": null}, {"given_name": "Douglas", "family_name": "Brutlag", "institution": null}]}