{"title": "Neural Network Definitions of Highly Predictable Protein Secondary Structure Classes", "book": "Advances in Neural Information Processing Systems", "page_first": 809, "page_last": 816, "abstract": "", "full_text": "Neural Network Definitions of Highly \n\nPredictable Protein Secondary Structure \n\nClasses \n\nAlan Lapedes \n\nComplex Systems Group (TI3) \n\nLANL, MS B213 Los Alamos N .M. 87545 \n\nand The Santa Fe Institute, Santa Fe, New Mexico \n\nEvan Steeg \n\nDepartment of Computer Science \n\nUniversity of Toronto, Toronto, Canada \n\nRobert Farber \n\nComplex Systems Group (TI3) \n\nLANL, MS B213 Los Alamos N .M. 87545 \n\nAbstract \n\nWe use two co-evolving neural networks to determine new classes \nof protein secondary structure which are significantly more pre(cid:173)\ndictable from local amino sequence than the conventional secondary \nstructure classification. Accurate prediction of the conventional \nsecondary structure classes: alpha helix, beta strand, and coil, from \nprimary sequence has long been an important problem in compu(cid:173)\ntational molecular biology. Neural networks have been a popular \nmethod to attempt to predict these conventional secondary struc(cid:173)\nture classes. Accuracy has been disappointingly low. The algo(cid:173)\nrithm presented here uses neural networks to similtaneously exam(cid:173)\nine both sequence and structure data, and to evolve new classes \nof secondary structure that can be predicted from sequence with \nsignificantly higher accuracy than the conventional classes. These \nnew classes have both similarities to, and differences with the con(cid:173)\nventional alpha helix, beta strand and coil. \n\n809 \n\n\f810 \n\nLapedes, Steeg, and Farber \n\nintroduced \n\nThe conventional classes of protein secondary structure, alpha helix and beta \nsheet, were first \nin 1951 by Linus Pauling and Robert Corey \n[Pauling, 1951] on the basis of molecular modeling. Prediction of secondary \nstructure from the amino acid sequence has long been an important problem \nin computational molecular biology. There have been numerous attempts to \npredict locally defined secondary structure classes using only a local window \nof sequence information. The prediction methodology ranges from a combina(cid:173)\ntion of statistical and rule-based methods \n[Chou, 1978] to neural net methods \n[Qian, 1988], [Maclin, 1992], [Kneller, 1990], [Stolorz, 1992]. Despite a variety of \nintense efforts, the accuracy of prediction of conventional secondary structure is \nstill distressingly low. \nIn this paper we will use neural networks to generalize the notion of protein sec(cid:173)\nondary structure and to find new classes of structure that are significantly more \npredictable. We define protein \"secondary structure\" to be any classification of \nprotein structure that can be defined using only local \"windows\" of structural in(cid:173)\nformation about the protein. Such structural information could be, e.g., the classic \ncI>'lf angles [Schulz, 1979] that describe the relative orientation of peptide units along \nthe protein backbone, or any other representation of local backbone structure. A \nclassification of local structure into \"secondary structure classes\", is defined to be \nthe result of any algorithm that uses a representation of local structure as Input, \nand which produces discrete classification labels as Output. This is a very general \ndefinition of local secondary structure that subsumes all previous definitions. \nWe develop classifications that are more predictable than the standard classifica(cid:173)\ntions [Pauling, 1951] [Kabsch, 1983] which were used in previous machine learning \nprojects, as well as in other analyses of protein shape. We show that these new, \npredictable classes of secondary structure bear some relation to the conventional \ncategory of \"helix\", but also display significant differences. \nWe consider the definition, and prediction from sequence, of just two classes of \nstructure. The extension to multiple classes is not difficult, but will not be made \nexplicit here for reasons of clarity. We won't discuss details concerning construction \nof a representative training set, or details of conventional neural network train(cid:173)\ning algorithms, such as backpropagation. These are well studied subjects that \nare addressed in e.g., \nture prediction. We note in passing that one can employ complicated network \narchitectures containing many output neurons (e.g. three output neurons for pre(cid:173)\ndicting alpha helix, beta chain, random coil), or many hidden units etc. \n[Stolorz, 1992], [Qian, 1988], [Kneller, 1990]). However, explanatory figures pre(cid:173)\nsented in the next section employ only one output unit per net, and no hidden \nunits, for clarity. \n\n[Stolorz, 1992] in the context of protein secondary struc(cid:173)\n\n(c.f. \n\n\fNeural Network Definitions of Highly Predictable Protein Secondary Structure Classes \n\n811 \n\nA widely adopted definition of protein secondary structure classes is due to Kabsch \nand Sander [Kabsch, 1983]. It has become conventional to use the Kabsch and \nSander definition to define, via local structural information, three classes of sec(cid:173)\nondary structure: alpha helix, beta strand, and a default class called random coil. \nThe Kabsch and Sander alpha helix and beta strand classification captures in large \npart the classification first introduced by Pauling and Corey [Pauling, 1951]. Soft(cid:173)\nware implementing the Kabsch and Sander definitions, which take a local window \nof structural information as Input, and produce the Kabsch and Sander secondary \nstructure classification of the window as Output, is widely available. \n\nThe key ideas of this paper are contained in Fig. (1). \n\n[ \n\n_\n\nF = I,-C-o-r:e- Ia-h-' o-n-( (-)-l-P)- O-(-P-)----. \nJ _P_~ __ L_ ' R_~ \n\nLeft Net \n\nMaps AA sequence to \"secondary \n\nstructure\" . \n\nRight Net \n\nMaps <l>.\\f' to \"secondary \n\nstructure\". \n\nIn this figure the Kabsch and Sander rules are represented by a second neural net(cid:173)\nwork. The Kabsch and Sander rules are just an Input/Output mapping (from a \nlocal window of structure to a classification of that structure) and may in princi(cid:173)\nple be replaced with an equivalent neural net representing the same Input/Output \nmapping. We explicitly demonstrated that a simple neural net is capable of repre(cid:173)\nsenting rules of the complexity of the Kabsch and Sander rules by training a network \nto perform the same structure classification as the Kabsch and Sander rules, and \nobtained high accuracy. \nThe representation of the structure data in the right-hand network uses cI>\\i' angles. \nThe right-hand net sees a window of cI>\\i' angles corresponding to the window of \namino acids in the left-hand network. Problems due to the angular periodicity of \nthe cI>\\i' angles (i.e., 360 degrees and 0 degrees are different numbers, but represent \nthe same angle) are eliminated by utilizing both the sin and cos of each angle. \n\n\f812 \n\nLapedes, Steeg, and Farber \n\nThe representation of the amino acids in the left-hand network is the usual unary \nrepresentation employing twenty bits per amino acid. Results quoted in this paper \ndo not use a special twenty-first bit to represent positions in a window extending \npast the ends of a protein. \nNote that the right-hand neural network could implement extremely general def(cid:173)\ninitions of secondary structure by changing the weights. We next show how to \nchange the weights in a fashion so that new classifications of secondary structure \nare derived under the important restriction that they be predictable from amino \nacid sequence. In other words, we require that the synaptic weights be chosen so \nthat the output of the left-hand network and the output of the right-hand network \nagree for each sequence-structure pair that is input to the two networks. \nTo achieve this, both networks are trained simultaneously, starting from random \ninitial weights in each net, under the sole constraint that the outputs of the two \nnetworks agree for each pattern in the training set. The mathematical implemen(cid:173)\ntation of this constraint is described in various versions below. This procedure \nis a general, effective method of evolving predictable secondary structure classifi(cid:173)\ncations of experimental data. The goal of this research is to use two mutually \nself-supervised networks to define new classes of protein secondary structure which \nare more predictable from sequence than the standard classes of alpha helix, beta \nsheet or coil. \n\n3 CONSTRAINING THE TWO NETS TO AGREE \n\nOne way to impose agreement between the outputs of the two networks is to require \nthat they covary when viewed as a stream of real numbers. Note that it is not \nsufficient to merely require that the outputs of the left-hand and right-hand nets \nagree by, e.g., minimizing the following objective function \n\n(1) \n\nP \n\nHere, LeftO(p) and RightO(p) represent the outputs of the left-hand and right(cid:173)\nhand networks, respectively, for the pth pair of input windows: (sequence window \n-left net) and (structure window -right net). It is necessary to avoid the trivial \nminimum of E obtained where the weights and thresholds are set so that each net \npresents a constant Output regardless of the input data. This is easily accomplished \nin Eqn (1) by merely setting all the weights and thresholds to 0.0. \n\nDemanding that the outputs vary, or more explicitly co-vary, is a viable solution \nto avoiding trivial local minima. Therefore, one can maximize the correlation, P, \nbetween the left-hand and right-hand network outputs. The standard correlation \nmeasure between two objects, LeftO(p) and RightO(p) is: \n\np = '2:)LeftO(p) - LeftO)(RightO(p) - RightO) \n\n(3) \n\nP \n\nwhere LeftO denotes the mean of the left net's outputs over the training set, and \nrespectively for the right net. p is zero if there is no variation, and is maximized \n\n\fNeural Network Definitions of Highly Predictable Protein Secondary Structure Classes \n\n813 \n\nif there is simultaneously both individual variation and joint agreement. In our \nsituation it is equally desirable to have the networks maximally anti-correlated \nas it is for them to be correlated. (Whether the networks choose correlation, or \nanti-correlation, is evident from the behavior on the training set). Hence the min(cid:173)\nimization of E = _p2 would ensure that the outputs are maximally correlated \n(or anti-correlated). While this work was in progress we received a preprint by \nSchmidhuber [Schmidhuber, 1992] who essentially implemented Eqn. (1) with an \nadditional variance term (in a totally different context). Our results using this mea(cid:173)\nsure seem quite susceptible to local minima and we prefer alternative measures to \nenforce agreement. \nOne alternative to enforce agreement, since one ultimately measures predic(cid:173)\ntive performance on the basis of the Mathews correlation coefficient (see, e.g., \n[Stolorz, 1992]), is to simultaneously train the two networks to maximize this mea(cid:173)\nsure. The Mathews coefficient, Gi, for the ith state is defined as: \n\nc. _ \n\nI \n\n-\n\nPini - UiOi \n\n[(ni + ui)(ni + Oi)(Pi + Ud(pi + Oi\u00bb)1/2 \n\nwhere Pi is the number of examples where the left-hand net and right-hand net \nboth predict class i, ni is the number of examples where neither net predicts ;, Ui \ncounts the examples where the left net predicts i and the right net does not, and 0i \ncounts the reverse. Minimizing E = -Gi 2 optimizes Gi. \nOther training measures forcing agreement of the left and right networks may be \nused. Particularly suitable for the situation of many outputs (i.e., more than two(cid:173)\nclass discrimination) is \"mutual information\". Use of mutual information in this \ncontext is related to the IMAX algorithm for unsupervised detection of regularities \nacross spatial or temporal data [Becker, 1992]. The mutual information is defined \nas \n\nM = LJ Pi; log ....!2L. \np \" \nPi .P.; \n\n' \" \n. . \nI,J \n\n(4) \n\nwhere Pij is the joint probability of occurrence of the states of the left and right \n[Stolorz, 1992] we showed how Pij may be defined \nnetworks. (In previous work \nin terms of neural networks). Minimizing E = -M maximizes M. While M has \nmany desirable properties as a measure of agreement between two or more variables \n[Stolorz, 1992] [Farber, 1992] [Lapedes, 1989] [Korber, 1993], our preliminary sim(cid:173)\nulations show that maximizing M is often prone to poor local maxima. \nFinally, an alternative to using mutual information for multi-class, as opposed to \ndichotomous classification, is the Pearson correlation coefficient, X 2 \u2022 This is defined \nin terms of Pi; as \n\nOur simulations indicate that X 2 , Gi and p are all less susceptible to local minima \n\n(5) \n\n\f814 \n\nLapedes, Steeg, and Farber \n\nthan M. However, these other objective functions suffer the defect that predictabil(cid:173)\nity is emphasized at the expense of utility. In other words, they can be maximal \nfor the peculiar situation where a structural class is defined that occurs very rarely \nin the data, but when it occurs, it is predicted perfectly by the other network. The \nutility of this classification is therefore degraded by the fact that the predictable \nclass only occurs rarely. Fortunately, this effect did not cause difficulties in the \nsimulations we performed. Our best results to date have been obtained using the \nMathews objective function (see Results). \n\n4 RESULTS \n\nThe database we used consisted of 105 proteins and is identical to that used in \nprevious investigations [Kneller, 1990] [Stolorz, 1992]. The proteins were divided \ninto two groups: a set of 91 \"training\" proteins, and a distinct \"prediction\" set \nof 14 proteins. The resulting database is similar to the database used by Qian & \nSejnowski [Qian, 1988] in their neural network studies of conventional secondary \nstructure prediction. When comparison to predictability of conventional secondary \nstructure classes was needed, we defined the conventional alpha, beta and coil states \nusing the Kabsch and Sander definitions and therefore these states are identical to \nthose used in previous work [Kneller, 1990] [Stolorz, 1992]. A window size of 13 \nresidues resulted in 16028 train set examples and 3005 predict set examples. Effects \nof other windows sizes have not yet been extensively tested. All results, including \nconventional backpropagation training of Kabsch and Sander classifications, as well \nas two-net training of our new secondary structure classifications, did not employ \nan extra symbol denoting positions in a window that extended past the ends of a \nprotein. Use of such a symbol could further increase accuracy. \nWe found that random initial conditions are necessary for the development of in(cid:173)\nteresting new classes. However, random initial conditions also suffer to a certain \nextent from local minima. The mutual information function, in particular, often \ngets trapped quickly in uninteresting local minima when evolved from random initial \nconditions. More success was obtained with the other objective functions discussed \nabove. We have not exhaustively investigated strategies to avoid local minima, \nand usually just chose new initial conditions if an uninteresting local minimum was \nencountered. \n\nResults were best for two class discrimination using the Mathews objective function \nand a layer of five hidden units in each net. If one assigns the name \"Xclass\" to the \nnewly defined structural class, then the Mathews coefficient on the prediction set \nfor the Xclass dichotomy is -0.425. The Mathews coefficient on the train set for the \nXclass dichotomy is -0.508. For comparison, the Mathews coefficient on the same \npredict set data for dichotomization (using standard backpropagation training with \nno hidden units), into the standard secondary structure classes Alpha/NotAlpha, \nBeta/NotBeta, and CoilJNotCoil is 0.33, 0.26, and 0.39, respectively. Adding hid(cid:173)\nden units gives negligible accuracy increase in predicting the conventional classes, \nbut is important for improved prediction of the new classes. The negative sign of \nthe two-net result indicates anti-correlation - a feature allowed by our objective \nfunction. The sign of the correlation is easily assessed on the train set and then can \nbe trivially compensated for in prediction. \n\n\fNeural Network Definitions of Highly Predictable Protein Secondary Structure Classes \n\n815 \n\nA natural question to ask is whether the new classes are simply related to the more \nconventional classes of alpha helix, beta, coil. A simple answer is to compute the \nMathews correlation coefficient of the new secondary structure classes with each \nof the three Kabsch and Sander classes, for those examples in which the sequence \nnetwork agreed with the structure network's classification. The correlation with \nKabsch and Sander's alpha helix is highest: a Mathews coefficient of 0.248 was \nobtained on the train set, while a Mathews coefficient of 0.247 was obtained on \nthe predict set. There is therefore a significant degree of correlation with the con(cid:173)\nventional classification of alpha helix, but significant differences exist as well. The \nnew classes are a mixture of the conventional classes, and are not solely dominated \nby either alpha, beta or coil. Conventional alpha-helices comprise roughly 25% of \nthe data (for both train and predict sets), while the new Xclass comprises 10%. It \nis quite interesting that an evolution of secondary structure classifications starting \nfrom random initial conditions, and hence completely unbiased towards the conven(cid:173)\ntional classifications, results in a classification that has significant relationship to \nconventional helices but is more predictable from amino acid sequence than conven(cid:173)\ntional helices. Graphical analysis (not shown here) of the new Xclass shows that \nthe Xclass that is most closely related to helix typically extends the definition of \nhelix past the standard boundaries of an alpha-helix. \n\n5 CONCLUSIONS \n\nA primary goal of this investigation is to evolve highly predictable secondary struc(cid:173)\nture classes. Ultimately, such classes could be used, e.g., to provide constraints \non tertiary structure calculations. Further work remains to derive even more pre(cid:173)\ndictable classes and to analyze their physical meaning. However, it is now clear \nthat the use of two, co-evolving, adaptive networks defines a novel and useful ma(cid:173)\nchine learning paradigm that allows the evolution of new definitions of secondary \nstructure that are significantly more predictable from primary amino acid sequence \nthan the conventional definitions. \n\nRelated work is that of [Hunter, 1992], [Hunter, 1992], [Zhang, 1992], [Zhang, 1993] \nin which clustering either only in sequence space, or only in structure space, is \nattempted. However, no condition on the compatibility of the clustering is required, \nso new classes of structure are not guaranteed to be predictable from sequence. \n\nFinally, we note that the methods described here might be usefully applied to \nother cognitive/perceptual or engineering tasks in which correlation of two or more \ndifferent representations of the same data is required. In this regard the relation of \nour work to that of independent work of Becker [Becker, 1992], and of Schmidhuber \n[Schmidhuber, 1992], should be noted. \n\nAcknowledgements \n\nWe are grateful for useful discussions with Geoff Hinton, Sue Becker, and Joe Bryn(cid:173)\ngelson. Sue Becker's contribution of software that was used in the early stages of \nthis project is much appreciated. The research of Alan Lapedes and Robert Farber \nwas supported in part by the U.S. Department of Energy. The authors would like \nto acknowledge the hospitality of the Santa Fe Institute, where much of this work \n\n\f816 \n\nLapedes, Steeg, and Farber \n\nwas performed. \n\nReferences \n\n[Becker, 1992] \n\n[Becker, 1992] \n[Chou, 1978] \n[Farber, 1992] \n[Hunter, 1992] \n\n[Hunter, 1992] \n[Kabsch, 1983] \n[Kneller, 1990] \n\n[Korber, 1993] \n\n[Lapedes, 1989] \n\n[Maclin, 1992] \n\n[Schulz, 1979] \n\n[Zhang, 1993] \n\nS. Becker. An Information-theoretic Unsupervised Learning \nAlgorithm for Neural Networks. PhD thesis, University of \nToronto (1992) \nS. Becker, G. Hinton, Nature 355, 161-163 (1992) \nP. Chou, G. Fasman Adv. Enzymol. 47, 45 (1978) \nR. Farber, A. Lapedes J. Mol. Bioi. 226 , 471, (1992) \n1. Hunter, N. Harris, D. States Proceedings of the Ninth Inter(cid:173)\nnational Conference on Machine Learning, San Mateo, Cali(cid:173)\nfornia, Morgan Kaufmann Associates (1992) \nL. Hunter, D. States, IEEE Expert, 7(4) 67-75 (1992) \nW. Kabsch, C. Sander Biopolymers 22, 2577 (1983) \nD. Kneller, F. Cohen, R. Langridge J. Mol. Bioi. 214, 171 \n(1990) \nB. Korber, R. Farber, D. Wolpert, A. Lapedes P.N.A.S. - in \npress (1993) \nA. Lapedes, C.Barnes, C. Burks, R.Farber, K. Sirotkin in \nComputers and DNA editors: G.Bell, T. Marr, (1989) \nR. Maclin, J. W. Shavlik Proceedings of the Tenth National \nConference on Artificial Intelligence, San Jose, California, \nMorgan Kauffman Associates (1992) \nL. Pauling, R. Corey Proc. Nat. Acad. Sci. 37,205 (1951) \nN. Qian, T. Sejnowski J. Mol. Bioi. 202, 865 (1988) \n\n[Pauling, 1951] \n[Qian, 1988] \n[Schmidhuber, 1992] J. Schmidhuber Discovering Predictable Classifications, Tech-\nnical report CU-CS-626-92, Department of Computer Science, \nUniversity of Colorado (1992) \nG. Schulz, R. Schirmer Principles of Protein Structure \nSpringer Verlag, New York, (1979) \nP. Stolorz, A. Lapedes, X. Yuan J. Mol. Bioi. 225, 363 (1992) \nX. Zhang, D. Waltz in Artificial Intelligence and Molecular \nBiology, editor: L. Hunter, AAAI Press (MIT Press) (1992) \nX. Zhang, J. Fetrow, W. Rennie, D. Waltz, G. Berg, in Pro(cid:173)\nceedings: First International Conference on Intelligent Sys-\ntems For Molecular Biology, p. 438, editors: L. Hunter, D. \nSearls, J. Shavlik, AAAI Press, Menlo Park, CA. (1993) \n\n[Stolorz, 1992] \n[Zhang, 1992] \n\n\f", "award": [], "sourceid": 758, "authors": [{"given_name": "Alan", "family_name": "Lapedes", "institution": null}, {"given_name": "Evan", "family_name": "Steeg", "institution": null}, {"given_name": "Robert", "family_name": "Farber", "institution": null}]}