{"title": "Predicting the Geometry of Metal Binding Sites from Protein Sequence", "book": "Advances in Neural Information Processing Systems", "page_first": 465, "page_last": 472, "abstract": "Metal binding is important for the structural and functional characterization of proteins. Previous prediction efforts have only focused on bonding state, i.e. deciding which protein residues act as metal ligands in some binding site. Identifying the geometry of metal-binding sites, i.e. deciding which residues are jointly involved in the coordination of a metal ion is a new prediction problem that has been never attempted before from protein sequence alone. In this paper, we formulate it in the framework of learning with structured outputs. Our solution relies on the fact that, from a graph theoretical perspective, metal binding has the algebraic properties of a matroid, enabling the application of greedy algorithms for learning structured outputs. On a data set of 199 non-redundant metalloproteins, we obtained precision/recall levels of 75\\%/46\\% correct ligand-ion assignments, which improves to 88\\%/88\\% in the setting where the metal binding state is known.", "full_text": "Predicting the Geometry of Metal Binding Sites from\n\nProtein Sequence\n\nPaolo Frasconi\n\nUniversit`a degli Studi di Firenze\n\nVia di S. Marta 3, 50139 Firenze, Italy\n\np-f@dsi.unifi.it\n\nAndrea Passerini\n\nUniversit`a degli Studi di Trento\n\nVia Sommarive, 14, 38100 Povo, Italy\n\npasserini@disi.unitn.it\n\nAbstract\n\nMetal binding is important for the structural and functional characterization of\nproteins. Previous prediction efforts have only focused on bonding state, i.e. de-\nciding which protein residues act as metal ligands in some binding site. Identify-\ning the geometry of metal-binding sites, i.e. deciding which residues are jointly\ninvolved in the coordination of a metal ion is a new prediction problem that has\nbeen never attempted before from protein sequence alone. In this paper, we formu-\nlate it in the framework of learning with structured outputs. Our solution relies on\nthe fact that, from a graph theoretical perspective, metal binding has the algebraic\nproperties of a matroid, enabling the application of greedy algorithms for learning\nstructured outputs. On a data set of 199 non-redundant metalloproteins, we ob-\ntained precision/recall levels of 75%/46% correct ligand-ion assignments, which\nimproves to 88%/88% in the setting where the metal binding state is known.\n\n1 Introduction\nMetal ions play important roles in protein function and structure and metalloproteins are involved\nin a number of diseases for which medicine is still seeking effective treatment, including cancer,\nParkinson, dementia, and AIDS [10]. A metal binding site typically consists of an ion bound to one\nor more protein residues (called ligands). In some cases, the ion is embedded in a prosthetic group\n(e.g. in the case of heme). Among the 20 amino acids, the four most common ligands are cysteine\n(C), histidine (H), aspartic acid (D), and glutamic acid (E). Highly conserved residues are more likely\nto be involved in the coordination of a metal ion, although in the case of cysteines, conservation is\nalso often associated with the presence of a disul\ufb01de bridge (a covalent bond between the sulfur\natoms of two cysteines) [8]. Predicting metal binding from sequence alone can be very useful in\ngenomic annotation for characterizing the function and the structure of non determined proteins,\nbut also during the experimental determination of new metalloproteins. Current high-throughput\nexperimental technologies only annotate whole proteins as metal binding [13], but cannot determine\nthe involved ligands. Most of the research for understanding metal binding has focused on \ufb01nding\nsequence patterns that characterize binding sites [8]. Machine learning techniques have been applied\nonly more recently.\nThe easiest task to formulate in this context is bonding state prediction, which is a binary classi\ufb01ca-\ntion problem: either a residue is involved in the coordination of a metal ion or is free (in the case of\ncysteines, a third class can also be introduced for disul\ufb01de bridges). This prediction task has been\naddressed in a number of recent works in the case of cysteines only [6], in the case of transition\nmetals (for C and H residues) [12] and for in the special but important case of zinc proteins (for\nC,H,D, and E residues) [11, 14]. Hovever, classi\ufb01cation of individual residues does not provide\nsuf\ufb01cient information about a binding site. Many proteins bind to several ions in their holo form\nand a complete characterization requires us to identify the site geometry, i.e. the tuple of residues\ncoordinating each individual ion. This problem has been only studied assuming knowledge of the\nprotein 3D structure (e.g. [5, 1]), limiting its applicability to structurally determined proteins or their\n\n\fclose homologs, but not from sequence alone. Abstracting away the biology, this is a structured\noutput prediction problem where the input consists of a string of protein residues and the output is a\nlabeling of each residue with the corresponding ion identi\ufb01er (speci\ufb01c details are given in the next\nsection).\nThe supervised learning problem with structured outputs has recently received a considerable\namount of attention (see [2] for an overview). The common idea behind most methods consists\nof learning a function F (x, y) on input-output pairs (x, y) and, during prediction, searching the\nargument y that maximises F when paired with the query input x. The main dif\ufb01culty is that the\nsearch space on which y can take values has usually exponential size (in the length of the query).\nDifferent structured output learners deal with this issue by exploiting speci\ufb01c domain properties\nfor the application at hand. Some researchers have proposed probabilistic modeling and ef\ufb01cient\ndynamic programming algorithms (e.g. [16]). Others have proposed large margin approaches com-\nbined with clever algorithmic ideas for reducing the number of constraints (e.g. [15] in the case of\ngraph matching). Another solution is to construct the structured output in a suitable Hilbert space of\nfeatures and seek the corresponding pre-image for obtaining the desired discrete structure [17]. Yet\nanother is to rely on a state-space search procedure and learn from examples good moves leading to\nthe desired goal [4].\nIn this paper we develop a large margin solution that does not require a generative model for produc-\ning outputs. We borrow ideas from [15] and [4] but speci\ufb01cally take advantage of the fact that, from\na graph theoretical perspective, the metal binding problem has the algebraic structure of a matroid,\nenabling the application of greedy algorithms.\n\n2 A formalization of the metal binding sites prediction problem\nA protein sequence s is a string in the alphabet of the 20 amino acids. Since only some of the 20\namino acids that exist in nature can act as ligands, we begin by extracting from s the subsequence\nx obtained by deleting characters corresponding to amino acids that never (or very rarely) act as\nligands. By using T = {C, H, D, E} as the set of candidate ligands, we cover 92% ligands of struc-\nturally known proteins. A large number of interesting cases (74% in transition metals) is covered by\njust considering cysteines and histidines, i.e. T = {C, H}. We also introduce the set I of symbols\nassociated with metal ion identi\ufb01ers. I includes the special nil symbol. The goal is to predict the\ncoordination relation between amino acids in x and metal ions identi\ufb01ers in I. Amino acids that are\nnot metal-bound are linked to nil. Ideally, it would be also interesting to predict the chemical element\nof the bound metal ion. However, previous studies suggest that distinguishing the chemical element\nfrom sequence alone is a dif\ufb01cult task [12]. Hence, ion identi\ufb01ers will have no chemical element at-\ntribute attached. In practice, we \ufb01x a maximum number m of possible ions (m = 4 in the subsequent\nexperiments, covering 93% of structurally known proteins) and let I = {nil , \u03b91, . . . , \u03b9m}.\nThe number of admissible binding geometries for a given protein chain having n candidate ligands\nis the multinomial coef\ufb01cient\nk1!k2!\u00b7\u00b7\u00b7km!(n\u2212k1\u2212\u00b7\u00b7\u00b7\u2212km)! being m the number of ions and ki the\nnumber of ligands for ion \u03b9i. In practice, each ion is coordinated by a variable number of ligands\n(typically ranging from 1 to 4, but occasionally more), and each protein chain binds a variable\nnumber of ions (typically ranging from 1 to 4). The number of candidate ligands n grows linearly\nwith the protein chain. For example, in the case of PDB chain 1H0Hb (see Figure 1), there are\nn = 52 candidate ligands and m = 3 ions coordinated by 4 residues each, yielding a set of 7 \u00b7 1015\nadmissible conformations.\nIt is convenient to formulate the problem in a graph theoretical setting.\nIn this view, the string\nx should be regarded as a set of vertices labeled with the corresponding amino acid in T . The\nsemantic of x will be clear from the context and for simplicity we will avoid additional notation.\nDe\ufb01nition 2.1 (MBG property). Let x and I be two sets of vertices (associated with candidate\nligands and metal ion identi\ufb01ers, respectively). We say that a bipartite edge set y \u2282 x \u00d7 I satis\ufb01es\nthe metal binding geometry (MBG) property if the degree of each vertex in x in the graph (x\u222aI, y)\nis at most 1.\nFor a given x, let Yx denote the set of y that satisfy the MBG property. Let Fx : Yx (cid:55)\u2192 IR+ be a\nfunction that assigns a positive score to each bipartite edge set in Yx. The MBG problem consists of\n\ufb01nding arg maxy\u2208Yx Fx(y).\n\nn!\n\n\fFigure 1: Metal binding structure of PDB entry 1H0Hb. For readability, only a few connections\nfrom free residues to the nil symbol are shown.\n\nNote that the MBG problem is not a matching problem (such as those studied in [15]) since more\nthan one edge can be incident to vertices belonging to I. As discussed above, we are not interested\nin distinguishing metal ions based on the element type. Hence, any two label-isomorphic bipartite\ngraphs (obtained by exchanging two non-nil metal ion vertices) should be regarded as equivalent.\nOutputs y should be therefore regarded as equivalence classes of structures (in the 1H0Hb example\nabove, there are 7 \u00b7 1015/3! equivalence classes, each corresponding to a permutation of \u03b91, \u03b92, \u03b93).\nFor simplicity, we will slightly abuse notation and avoid this distinction in the following.\nWe could also look over the MBG problem by analogy with language parsing using formal gram-\nmars. In this view, the binding geometry consists of a very shallow \u201cparse tree\u201d for string x, as\nexampli\ufb01ed in Figure 1. A dif\ufb01culty that is immediately apparent is that the underlying grammar\nneeds to be context sensitive in order to capture the crossing-dependencies between bound amino\nacids.\nIn real data, when representing metal bonding state in this way, crossing edges are very\ncommon. This view enlightens a dif\ufb01culty that would be encountered by attempting to solve the\nstructured output problem with a generative model as in [16].\n\n3 A greedy algorithm for constructing structured outputs\nThe core idea of the solution used in this paper is to avoid a generative model as a component of\nthe structured output learner and cast the construction of an output structure into a maximum weight\nproblem that can be solved by a greedy algorithm.\nDe\ufb01nition 3.1 (Matroid). A matroid (see e.g. [9]) is an algebraic structure M = (S,Y) where S is\na \ufb01nite set and Y a family of subsets of S such that: i) \u2205 \u2286 Y; ii) all proper subsets of a set y in Y\nare in Y; iii) if y and y(cid:48) are in Y and |y| < |y(cid:48)| then there exists e \u2208 y(cid:48) \\ y such that y \u222a {e} \u2208 Y.\nElements of Y are called independent sets. If y is an independent set, then ext(y) = {e \u2208 S :\ny \u222a {e} \u2208 Y} is called the extension set of y. A maximal (having an empty extension set) inde-\npendent set is called a base. In a weighted matroid, a local weight function v : S (cid:55)\u2192 IR+ assigns\na positive number v(e) to each element e \u2208 S. The weight function allows us to compare two\nstructures in the following sense. A set y = {e1, . . . , en} is lexicographically greater than set y(cid:48)\nif its monotonically decreasing sequence of weights (v(e1), . . . , v(en)) is lexicographically greater\nthan the corresponding sequence for y(cid:48). The following classic result (see e.g. [9]) is the underlying\nsupport for many greedy algorithms:\nTheorem 3.2 (Rado 1957; Edmonds 1971). For any nonnegative weighting over S, a lexicographi-\n\ncally maximum base in Y maximizes the global objective function F (y) =(cid:80)\n\ne\u2208y v(e).\n\nWeighted matroids can be seen as a kind of discrete counterparts of concave functions: thanks to the\nabove theorem, if M is a weighted matroid, then the following greedy algorithm is guaranteed to\n\ufb01nd the optimal structure, i.e. arg maxy\u2208Y F (y):\nGREEDYCONSTRUCT(M, F )\n\ny \u2190 \u2205\nwhile ext(y) (cid:54)= \u2205\n\ndo y \u2190 y \u222a(cid:110)\n\narg maxe\u2208ext(y) F (y \u222a {e})\n\n(cid:111)\n\nreturn y\n\nThis theory shows that if the structured output space being searched satis\ufb01es the property of a ma-\ntroid, learning structured outputs may be cast into the problem of learning the objective function\n\nnil\u2026DCCCCHEHDHHEEDDDCHCCDEDHDDCDEDECDECDCDCCDEEEDCDDCHHE11020304050\u03b91\u03b92\u03b93\fF for the greedy algorithm. When following this strategy, however, we may perceive the additive\nform of F as a strong limitation as it would prescribe to predict v(e) independently for each part\ne \u2208 S, while the whole point of structured output learning is to end-up with a collective decision\nabout which parts should be present in the output structure. But interestingly, the additive form of\nthe objective function as in Theorem 3.2 is not a necessary condition for the greedy optimality of\nmatroids. In facts, Helman et al. [7] show that the classic theory can be generalized to so-called\nconsistent objective functions, i.e. functions that satisfy the following additional constraints:\n\nF (y \u222a {e}) \u2265 F (y \u222a {e(cid:48)}) \u21d2 F (y(cid:48) \u222a {e}) \u2265 F (y(cid:48) \u222a {e(cid:48)})\n\n(1)\n\nfor any y \u2282 y(cid:48) \u2282 S and e, e(cid:48) \u2208 S \\ y(cid:48).\nTheorem 3.3 (Helman et al. 1993). If F is a consistent objective function then, for each matroid on\nS, all greedy bases are optimal.\n\nNote that the suf\ufb01cient condition of Theorem 3.3 is also necessary for a slighly more general class\nof algebraic structures that include matroids, called matroid embeddings [7]. We now show that the\nMBG problem is a suitable candidate for a greedy algorithmic solution.\nTheorem 3.4. If each y \u2208 Yx satis\ufb01es the MBG property, then Mx = (Sx,Yx) is a matroid.\nProof. Suppose y(cid:48) \u2208 Yx and y \u2286 y(cid:48). Removing an edge from y(cid:48) cannot increase the degree of any\nvertex in the bipartite graph so y \u2208 Yx. Also, suppose y \u2208 Yx, y(cid:48) \u2208 Yx, and |y| < |y(cid:48)|. Then there\nmust be at least one vertex t in x having no incident edges in y and such that (\u03b9, t) \u2208 y(cid:48) for some\n\u03b9 \u2208 I. Therefore y \u222a {(\u03b9, t)} also satis\ufb01es the MBG property and belongs to Yx, showing that Mx\nis a matroid.\n\nWe can \ufb01nally formulate the greedy algorithm for constructing the structured output in the MBG\nproblem. Given the input x, we begin by forming the associated MBG matroid Mx and a corre-\nsponding objective function Fx : Yx (cid:55)\u2192 IR+ (in the next section we will show how to learn the\nobjective function from data). The output structure associated with x is then computed as\n\nf(x) = arg max\ny\u2208Yx\n\nFx(y) = GREEDYCONSTRUCT(Mx, Fx).\n\n(2)\n\nThe following result immediately follows from De\ufb01nition 2.1 and Theorem 3.3:\nCorollary 3.5. Let (x, y) be an MBG instance.\nIf Fx is a consistent objective function and\nFx(y(cid:48) \u222a {e}) > Fx(y(cid:48) \u222a {e(cid:48)}) for each y(cid:48) \u2282 y, e \u2208 ext(y(cid:48)) \u2229 y and e(cid:48) \u2208 ext(y(cid:48)) \\ y, then\nGREEDYCONSTRUCT((Sx,Yx), Fx) returns y.\n\n4 Learning the greedy objective function\nA data set for the MBG problem consist of pairs D = {(xi, yi)} where xi is a string in T \u2217 and yi\na bipartite graph. Corollary 3.5 directly suggests the kind of constraints that the objective function\nneeds to satisfy in order to minimize the empirical error of the structured-output problem. For any\ninput string x and (partial) output structure y \u2208 Y, let Fx(y) = wT \u03c6x(y), being w a weight vector\nand \u03c6x(y) a feature vector for (x, y). The corresponding max-margin formulation is\n\nmin\n\n(cid:107)w(cid:107)2\n\n1\n2\n\nsubject to:\n\nwT(cid:16)\nwT(cid:16)\n\n\u03c6xi(y(cid:48) \u222a {e}) \u2212 \u03c6xi(y(cid:48) \u222a {e(cid:48)})\n\u03c6xi(y(cid:48)(cid:48) \u222a {e}) \u2212 \u03c6xi(y(cid:48)(cid:48) \u222a {e(cid:48)})\n\n(cid:17) \u2265 1\n(cid:17) \u2265 1\n\n(3)\n\n(4)\n\n(5)\n\n\u2200i = 1, . . . ,|D|, \u2200y(cid:48) \u2282 yi, \u2200e \u2208 ext(y(cid:48)) \u2229 yi, \u2200e(cid:48) \u2208 ext(y(cid:48)) \\ yi,\n\u2200y(cid:48)(cid:48) : y(cid:48) \u2282 y(cid:48)(cid:48) \u2282 Sx.\n\nIntuitively, the \ufb01rst set of constraints (Eq. 4) ensures that \u201ccorrect\u201d extensions (i.e. edges that actually\nbelong to the target output structure yi) receive a higher weight than \u201cwrong\u201d extensions (i.e. edges\nthat do not belong to the target output structure). The purpose of the second set of constraints (Eq. 5)\nis to force the learned objective function to obey the consistency property of Eq. (1), which in turns\nensures the correctness of the greedy algorithm thanks to Theorem 3.3. As usual, a regularized\n\n\fvariant with soft constraints can be formulated by introducing positive slack variables and adding\ntheir 1-norm times a regularization coef\ufb01cient to Eq. (3). The number of resulting constraints in the\nabove formulation grows exponentially with the number of edges in each example, hence naively\nsolving problem (3\u20135) is practically unfeasible. However, we can seek an approximate solution by\nleveraging the ef\ufb01ciency of the greedy algorithm also during learning. For this purpose, we will use\nan online active learner that samples constraints chosen by the execution of the greedy construction\nalgorithm.\ni \u2286 yi\nFor each epoch, the algorithm maintains the current highest scoring partial correct output y(cid:48)\nfor each example, initialized with the empty MBG structure, where the score is computed by the\ncurrent objective function F . While there are \u201cunprocessed\u201d examples in D, the algorithm picks\na random one and its current best MBG structure y(cid:48). If there are no more correct extensions of\ny(cid:48), then y(cid:48) = yi and the example is removed from D. Otherwise, the algorithm evaluates each\ncorrect extension of y(cid:48), updates the current best MBG structure, and invokes the online learner by\ncalling FORCE-CONSTRAINT, which adds a constraint derived from a random incorrect extension\n(see Eq. 4). It also performs a prede\ufb01ned number L of lookaheads by picking a random superset of\ny(cid:48)(cid:48) which is included in the target yi, evaluating it and updating the best MBG structure if needed,\nand adding a corresponding consistency constraint (see Eq. 5). The epoch terminates when all\nexamples are processed. In practice, we found that a single epoch over the data set is suf\ufb01cient for\nconvergence. Pseudocode for one epoch is given below.\n\nGREEDYEPOCH(D, L)\nfor i \u2190 1, . . . ,|D|\nwhile D (cid:54)= \u2205\n\ni \u2190 \u2205\n\ndo y(cid:48)\ndo pick a random example (xi, yi) \u2208 D\n\ni, y(cid:48)\n\ny(cid:48) \u2190 y(cid:48)\ni \u2190 \u2205\nif ext(y(cid:48)) \u2229 yi = \u2205\nthen D \u2190 D \\ (xi, yi)\nelse for each e \u2208 ext(y(cid:48)) \u2229 yi\n\ndo pick randomly e(cid:48) \u2208 ext(y(cid:48)) \\ yi\ni) < F (y(cid:48) \u222a {e}) then y(cid:48)\n\nif F (y(cid:48)\nFORCE-CONSTRAINT(Fxi(y(cid:48) \u222a {e}) \u2212 Fxi(y(cid:48) \u222a {e(cid:48)}) \u2265 1)\nfor l \u2190 1, . . . , L\ndo randomly choose y(cid:48)(cid:48) : y(cid:48) \u2282 y(cid:48)(cid:48) \u2282 yi \u2227 e, e(cid:48) \u2208 Sx \\ y(cid:48)(cid:48)\n\ni \u2190 y(cid:48) \u222a {e}\n\nFORCE-CONSTRAINT(Fxi(y(cid:48)(cid:48) \u222a {e}) \u2212 Fxi(y(cid:48)(cid:48) \u222a {e(cid:48)}) \u2265 1)\nif F (y(cid:48)\n\ni) < F (y(cid:48)(cid:48) \u222a {e}) then y(cid:48)\n\ni \u2190 y(cid:48)(cid:48) \u222a {e}\n\nThere are several suitable online learners implementing the interface required by the above proce-\ndure. Possible candidates include perceptron-like or ALMA-like update rules like those proposed\nin [4] for structured output learning (in our case the update would depend on the difference between\nfeature vectors of correctly and incorrectly extended structures in the inner loop of GREEDYEPOCH).\nAn alternative online learner is the LaSVM algorithm [3] equipped with obvious modi\ufb01cations for\nhandling constraints between pairs of examples. LaSVM is an SMO-like solver for the dual version\nof problem (3\u20135) that optimizes one or two coordinates at a time, alternating process (on newly\nacquired examples, generated in our case by the FORCE-CONSTRAINT procedure) and reprocess\n(on previously seen support vectors or patterns) steps. The ability to work ef\ufb01ciently in the dual\nis the most appealing feature of LaSVM in the present context and advantageous with respect to\nperceptron-like approaches. Our unsuccessful preliminary experiments with simple feature vectors\ncon\ufb01rmed the necessity of \ufb02exible design choices for developing rich feature spaces. Kernel meth-\nods are clearly more attractive in this case. We will therefore rewrite the objective function F using\na kernel k(z, z(cid:48)) = (cid:104)\u03c6x(y), \u03c6x(cid:48)(y(cid:48))(cid:105) between two structured instances z = (x, y) and z(cid:48) = (x(cid:48), y(cid:48)),\n\nso that Fx(y) = F (z) =(cid:80)\n\nLet \u03c3i(z) denote the set of edges incident on ion \u03b9i \u2208 I \\ nil and n(z) the number of non-nil ion\nidenti\ufb01ers that have at least one incident edge. Below is a top-down de\ufb01nition of the kernel used in\n\ni \u03b1ik(z, zi).\n\n\fkmbs(\u03c3i(z), \u03c3j(z(cid:48)))\n\nn(z)n(z(cid:48))\n2 min{|x|,|x(cid:48)|}\n\n|x| + |x(cid:48)|\n\n|\u03c3i(z)|(cid:88)\n\n(6)\n\n(7)\n\n(8)\n\nthe subsequent experiments.\n\nk(z, z(cid:48)) = kglob(z, z(cid:48))\n\nn(z)(cid:88)\n\nn(z(cid:48))(cid:88)\n\ni=1\n\nj=1\n\nkglob(z, z(cid:48)) = \u03b4(n(z), n(z(cid:48)))\n\nkmbs(\u03c3i(z), \u03c3j(z(cid:48))) = \u03b4(|\u03c3i(z)|,|\u03c3j(z(cid:48))|)\n\nkres(xi((cid:96)), x(cid:48)\n\nj((cid:96)))\n\n(cid:96)=1\n\nwhere \u03b4(a, b) = 1 iff a = b, xi((cid:96)) denotes the (cid:96)-th residue in \u03c3i(z), taken in increasing order\nof sequential position in the protein, and kres(xi((cid:96)), x(cid:48)\nj((cid:96))) is simply the dot product between the\nfeature vectors describing residues xi((cid:96)) and x(cid:48)\nj((cid:96)) (details on these features are given in Section 5).\nkmbs measures the similarity between individual sites (two sites are orthogonal if have a different\nnumber of ligands, a choice that is supported by protein functional considerations). kglob ensures\nthat two structures are orthogonal unless they have the same number of sites and down weights their\nsimilarity when their number of candidate ligands differs.\n5 Experiments\nWe tested the method on a dataset of non-redundant proteins previously used in [12]\nfor metal bonding state prediction (http://www.dsi.unifi.it/\u02dcpasse/datasets/\nmbs06/dataset.tgz). Proteins that do not bind metal ions (used in [12] as negative examples)\nare of no interest in the present case and were removed, resulting in a set of 199 metalloproteins\nbinding transition metals. Following [12], we used T = {C, H} as the set of candidate ligands.\nProtein sequences were enriched with evolutionary information derived from multiple alignments.\nPro\ufb01les were obtained by running one iteration of PSI-BLAST on the non-redundant (nr) NCBI\ndataset, with an e-value cutoff of 0.005. Each candidate ligand xi((cid:96)) was described by a feature\nvector of 221 real numbers. The \ufb01rst 220 attributes consist of multiple alignment pro\ufb01les in the\nwindow of 11 amino acids centered around xi((cid:96)) (the window was formed from the original protein\nsequence, not the substring xi of candidate ligands). The last attribute is the normalized sequence\nseparation between xi((cid:96)) and xi((cid:96) \u2212 1), using the N-terminus of the chain for (cid:96) = 1.\nA modi\ufb01ed version of LaSVM (http://leon.bottou.org/projects/lasvm) was run\nwith constraints produced by the GREEDYEPOCH procedure of Section 4, using a \ufb01xed regulariza-\ntion parameter C = 1, and L \u2208 {0, 5, 10}. All experiments were repeated 30 times, randomly\nsplitting the data into a training and test set in a ratio of 80/20. Two prediction tasks were consid-\nered, from unknown and from known metal bonding state (a similar distinction is also customary\nfor the related task of disul\ufb01de bonds prediction, see e.g. [15]). In the latter case, the input x only\ncontains actual ligands and no nil symbol is needed.\nSeveral measures of performance are reported in Table 1. PE and RE are the precision and recall\nfor the correct assignment between a residue and the metal ion identi\ufb01er (ratio of correctly pre-\ndicted coordinations to the number of predicted/actual coordinations); correct links to the nil ion\n(that would optimistically bias the results) are ignored in these measures. AG is the geometry ac-\ncuracy, i.e.\nthe fraction of chains that are entirely correctly predicted. PS and RS are the metal\nbinding site precision and recall, respectively (ratio of correctly predicted sites to the number of pre-\ndicted/actual sites). Finally, PB and RB are precision and recall for metal bonding state prediction\n(as in binary classi\ufb01cation, being \u201cbonded\u201d the positive class). Table 2 reports the breakdown of\nthese performance measures for proteins binding different numbers of metal ions (for L = 10).\nResults show that enforcing consistency constraints tends to improve recall, especially for the bond-\ning state prediction, i.e. helps the predictor to assign a residue to a metal ion identi\ufb01er rather than to\nnil. However, it only marginally improves precision and recall at the site level. Correct prediction of\nwhole sites is very challenging and correct prediction of whole chains even more dif\ufb01cult (given the\nenormous number of alternatives to be compared). Hence, it is not surprising that some of these per-\nformance indicators are low. By comparison, absolute \ufb01gures are not high even for the much easier\ntask of disul\ufb01de bonds prediction [15]. Correct edge assignment, however, appears satisfactory and\nreasonably good when the bonding state is given. The complete experimental environment can be\nobtained from http://www.disi.unitn.it/\u02dcpasserini/nips08.tgz.\n\n\fTable 1: Experimental results.\n\nL\n0\n5\n10\n\nPE\n75\u00b15\n66\u00b15\n63\u00b15\n\nRE\n46\u00b15\n52\u00b14\n52\u00b15\n\nab-initio\nPS\n18\u00b16\n20\u00b17\n20\u00b17\n\nAG\n12\u00b14\n14\u00b16\n13\u00b16\n\nRS\n14\u00b16\n17\u00b16\n15\u00b16\n\nmetal bonding state given\nPS\n65\u00b16\n66\u00b17\n67\u00b17\n\nRE\n87\u00b12\n87\u00b13\n88\u00b13\n\nAG\n64\u00b16\n65\u00b17\n67\u00b17\n\nPE\n87\u00b12\n87\u00b13\n88\u00b13\n\nL\n0\n5\n10\n\nRB\n51\u00b16\n64\u00b16\n68\u00b15\n\nPB\n81\u00b15\n79\u00b14\n78\u00b14\n\nRS\n65\u00b16\n66\u00b17\n67\u00b17\n\nTable 2: Breakdown by number of sites each chain. BS= (K)nown/(U)nknown bonding state.\n\n# sites = 2 (48 chains)\nRE\nRS\n6\u00b18\n46\u00b18\n21\u00b110\n73\u00b15\n\nPS\n14\u00b112\n21\u00b110\n\n# sites = 4 (8 chains)\nRS\n2\u00b16\n1\u00b12\n\nPS\n3\u00b111\n1\u00b12\n\nRE\n24\u00b120\n37\u00b125\n\nAG\n3\u00b16\n20\u00b111\n\nAG\n0\n0\n\n# sites = 1 (132 chains)\n\nBS\nU\nK\n\nPE\n62\u00b16\n97\u00b12\n\nBS\nPE\nU 65\u00b116\nK 61\u00b112\n\nRE\n57\u00b16\n97\u00b12\n\nPS\n25\u00b19\n92\u00b16\n\nRS\n21\u00b18\n92\u00b16\n\n# sites = 3 (11 chains)\nRE\n33\u00b113\n61\u00b112\n\nPS\n1\u00b15\n8\u00b111\n\nRS\n1\u00b15\n9\u00b113\n\nAG\n19\u00b18\n92\u00b16\n\nAG\n0\n0\n\nPE\n67\u00b19\n73\u00b15\n\nPE\n44\u00b131\n37\u00b125\n\n6 Related works\nAs mentioned in the Introduction, methods for structured outputs usually learn a function F on input-\noutput pairs (x, y) and construct the predicted output as f(x) = arg maxy F (x, y). Our approach\nfollows the same general principle.\nThere is a notable analogy between the constrained optimization problem (3\u20135) and the set of con-\nstraints derived in [15] for the related problem of disul\ufb01de connectivity. As in [15], our method\nis based on a large-margin approach for solving a structured output prediction problem. The un-\nderlying formal problems are however very different and require different algorithmic solutions.\nDisul\ufb01de connectivity is a (perfect) matching problem since each cysteine is bound to exactly one\nother cysteine (assuming known bonding state, yielding a perfect matching) or can be bound to an-\nother cysteine or free (unknown bonding state, yielding a non-perfect matching). The original set of\nconstraints in [15] only focuses on complete structures (non extensible set or bases, in our terminol-\nogy). It also has exponential size but the matching structure of the problem in that case allows the\nauthors to derive a certi\ufb01cate formulation that reduces it to polynomial size. The MBG problem is\nnot a matching problem but has the structure of a matroid and our formulation allows us to control\nthe number of effectively enforced constraints by taking advantage of a greedy algorithm.\nThe idea of an online learning procedure that receives examples generated by an algorithm which\nconstructs the output structure was inspired from the Learning as Search Optimization (LaSO) ap-\nproach [4]. LaSO aims to solve a much broader class of structured output problems where good\noutput structures can be generated by AI-style search algorithms such as beam search or A*. The\ngeneration of a fresh set of siblings in LaSO when the search is stuck with a frontier of wrong can-\ndidates (essentially a backtrack) is costly compared to our greedy selection procedure and (at least\nin principle) unnecessary when working on matroids.\nAnother general way to deal with the exponential growth of the search space is to introduce a gener-\native model so that arg maxy F (x, y) can be computed ef\ufb01ciently, e.g. by developing an appropriate\ndynamic programming algorithm. Stochastic grammars and related conditional models have been\nextensively used for this purpose [2]. These approaches work well if the generative model matches\nor approximates well the domain at hand. Unfortunately, as discussed in Section 2, the speci\ufb01c ap-\nplication problem we study in this paper cannot be even modeled by a context-free grammar. While\nwe do not claim that it is impossible to devise a suitable generative model for this task (and indeed\nthis is an interesting direction of research), we can argue that handling context-sensitiveness is hard.\nIt is of course possible to approximate context sensitive dependencies using a simpli\ufb01ed model. In-\ndeed, an alternative view of the MBG problem is supervised sequence labeling, where the output\nstring consists of symbols in I. A (higher-order) hidden Markov model or chain-structured condi-\ntional random \ufb01eld could be used as the underlying generative model for structured output learning.\n\n\fUnfortunately, these approaches are unlikely to be very accurate since models that are structured as\nlinear chains of dependencies cannot easily capture long-ranged interactions such as those occurring\nin the example. In our preliminary experiments, SVMHMM [16] systematically assigned all bonded\nresidues to the same ion, thus never correctly predicted the geometry except in trivial cases.\n\n7 Conclusions\n\nWe have reported about the \ufb01rst successful solution to the challenging problem of predicting protein\nmetal binding geometry from sequence alone. The result \ufb01lls-in an important gap in structural and\nfunctional bioinformatics. Learning with structured outputs is a fairly dif\ufb01cult task and in spite of\nthe fact that several methodologies have been proposed, no single general approach can effectively\nsolve every possible application problem. The solution proposed in this paper draws on several\nprevious ideas and speci\ufb01cally leverages the existence of a matroid for the metal binding problem.\nOther problems that formally exhibit a greedy structure might bene\ufb01t of similar solutions.\n\nAcknowledgments\n\nWe thank Thomas G\u00a8artner for very fruitful discussions.\n\nReferences\n[1] M. Babor, S. Gerzon, B. Raveh, V. Sobolev, and M. Edelman. Prediction of transition metal-binding sites\n\nfrom apo protein structures. Proteins, 70(1):208\u2013217, 2008.\n\n[2] G. Bakir, T. Hofmann, B. Sch\u00a8olkopf, A. Smola, B. Taskar, and S. Vishwanathan, editors. Predicting\n\nStructured Data. The MIT Press, 2007.\n\n[3] A. Bordes, S. Ertekin, J. Weston, and L. Bottou. Fast kernel classi\ufb01ers with online and active learning.\n\nJournal of Machine Learning Research, 6:1579\u20131619, 2005.\n\n[4] H. Daume III and D. Marcu. Learning as search optimization: Approximate large margin methods for\n\nstructured prediction. In Proc. of the 22nd Int. Conf. on Machine Learning (ICML\u201905), 2005.\n\n[5] J. C. Ebert and R. B. Altman. Robust recognition of zinc binding sites in proteins. Protein Sci, 17(1):54\u2013\n\n65, 2008.\n\n[6] F. Ferr`e and P. Clote. DiANNA 1.1: an extension of the DiANNA web server for ternary cysteine classi-\n\n\ufb01cation. Nucleic Acids Res, 34:W182\u2013W185, 2006.\n\n[7] P. Helman, B. M. E. Moret, and H. D. Shapiro. An exact characterization of greedy structures. SIAM J.\n\nDisc. Math., 6(2):274\u2013283, 1993.\n\n[8] N. Hulo, A. Bairoch, V. Bulliard, L. Cerutti, B. A. Cuche, E. de Castro, C. Lachaize, P. S. Langendijk-\n\nGenevaux, and C. J. A. Sigrist. The 20 years of prosite. Nucleic Acids Res, 36:D245\u20139, 2008.\n\n[9] E. L. Lawler. Combinatorial Optimization: Networks and Matroids. Holt, Rinehart and Winston, 1976.\n[10] A. Messerschmidt, R. Huber, K. Wieghardt, and T. Poulos, editors. Handbook of Metalloproteins. John\n\nWiley & Sons, 2004.\n\n[11] A. Passerini, C. Andreini, S. Menchetti, A. Rosato, and P. Frasconi. Predicting zinc binding at the pro-\n\nteome level. BMC Bioinformatics, 8:39, 2007.\n\n[12] A. Passerini, M. Punta, A. Ceroni, B. Rost, and P. Frasconi.\n\nIdentifying cysteines and histidines in\ntransition-metal-binding sites using support vector machines and neural networks. Proteins, 65(2):305\u2013\n316, 2006.\n\n[13] W. Shi, C. Zhan, A. Ignatov, B. A. Manjasetty, N. Marinkovic, M. Sullivan, R. Huang, and M. R. Chance.\nMetalloproteomics: high-throughput structural and functional annotation of proteins in structural ge-\nnomics. Structure, 13(10):1473\u20131486, 2005.\n\n[14] N. Shu, T. Zhou, and S. Hovmoller. Prediction of zinc-binding sites in proteins from sequence. Bioinfor-\n\nmatics, 24(6):775\u2013782, 2008.\n\n[15] B. Taskar, V. Chatalbashev, D. Koller, and C. Guestrin. Learning structured prediction models: a large\n\nmargin approach. Proc. of the 22nd Int. Conf. on Machine Learning (ICML\u201905), pages 896\u2013903, 2005.\n\n[16] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large Margin Methods for Structured and\n\nInterdependent Output Variables. The Journal of Machine Learning Research, 6:1453\u20131484, 2005.\n\n[17] J. Weston, O. Chapelle, A. Elisseeff, B. Scholkopf, and V. Vapnik. Kernel dependency estimation. Ad-\n\nvances in Neural Information Processing Systems, 15:873\u2013880, 2003.\n\n\f", "award": [], "sourceid": 886, "authors": [{"given_name": "Paolo", "family_name": "Frasconi", "institution": null}, {"given_name": "Andrea", "family_name": "Passerini", "institution": null}]}