{"title": "Protein Interface Prediction using Graph Convolutional Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 6530, "page_last": 6539, "abstract": "We consider the prediction of interfaces between proteins, a challenging problem with important applications in drug discovery and design, and examine the performance of existing and newly proposed spatial graph convolution operators for this task. By performing convolution over a local neighborhood of a node of interest, we are able to stack multiple layers of convolution and learn effective latent representations that integrate information across the graph that represent the three dimensional structure of a protein of interest. An architecture that combines the learned features across pairs of proteins is then used to classify pairs of amino acid residues as part of an interface or not. In our experiments, several graph convolution operators yielded accuracy that is better than the state-of-the-art SVM method in this task.", "full_text": "Protein Interface Prediction using Graph\n\nConvolutional Networks\n\nDepartment of Computer Science\n\nDepartment of Computer Science\n\nJonathon Byrd\u2020\n\nColorado State University\nFort Collins, CO 80525\n\njonbyrd@colostate.edu\n\nAlex Fout\u2020\n\nColorado State University\nFort Collins, CO 80525\nfout@colostate.edu\n\nBasir Shariat\u2020\n\nDepartment of Computer Science\n\nColorado State University\nFort Collins, CO 80525\n\nbasir@cs.colostate.edu\n\nAsa Ben-Hur\n\nDepartment of Computer Science\n\nColorado State University\nFort Collins, CO 80525\nasa@cs.colostate.edu\n\nAbstract\n\nWe consider the prediction of interfaces between proteins, a challenging prob-\nlem with important applications in drug discovery and design, and examine the\nperformance of existing and newly proposed spatial graph convolution operators\nfor this task. By performing convolution over a local neighborhood of a node of\ninterest, we are able to stack multiple layers of convolution and learn effective\nlatent representations that integrate information across the graph that represent the\nthree dimensional structure of a protein of interest. An architecture that combines\nthe learned features across pairs of proteins is then used to classify pairs of amino\nacid residues as part of an interface or not. In our experiments, several graph\nconvolution operators yielded accuracy that is better than the state-of-the-art SVM\nmethod in this task.\n\n1\n\nIntroduction\n\nIn many machine learning tasks we are faced with structured objects that can naturally be modeled as\ngraphs. Examples include the analysis of social networks, molecular structures, knowledge graphs,\nand computer graphics to name a few. The remarkable success of deep neural networks in a wide range\nof challenging machine learning tasks from computer vision [14, 15] and speech recognition [12] to\nmachine translation [24] and computational biology [4], has resulted in a resurgence of interest in\nthis area. This success has also led to the more recent interest in generalizing the standard notion\nof convolution over a regular grid representing a sequence or an image, to convolution over graph\nstructures, making these techniques applicable to the wide range of prediction problems that can be\nmodeled in this way [8].\nIn this work we propose a graph convolution approach that allows us to tackle the challenging\nproblem of predicting protein interfaces. Proteins are chains of amino acid residues that fold into\na three dimensional structure that gives them their biochemical function. Proteins perform their\nfunction through a complex network of interactions with other proteins. The prediction of those\ninteractions, and the interfaces through which they occur, are important and challenging problems\nthat have attracted much attention [10]. This paper focuses on predicting protein interfaces. Despite\n\n\u2020denotes equal contribution\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fthe plethora of available methods for interface prediction, it has been recently noted that \"The \ufb01eld in\nits current state appears to be saturated. This calls for new methodologies or sources of information to\nbe exploited\" [10]. Most machine learning methods for interface prediction use hand-crafted features\nthat come from the domain expert\u2019s insight on quantities that are likely to be useful and use standard\nmachine learning approaches. Commonly used features for this task include surface accessibility,\nsequence conservation, residue properties such as hydrophobicity and charge, and various shape\ndescriptors (see Aumentado et al. [6] for a review of the most commonly used features for this task).\nThe task of object recognition in images has similarities to interface prediction: Images are represented\nas feature values on a 2D grid, whereas the the solved crystal structure of a protein can be thought of\nas a collection of features on an irregular 3D grid corresponding to the coordinates of its atoms. In\nboth cases, we are trying to recognize an object within a larger context. This suggests that approaches\nthat have proven successful in image classi\ufb01cation can be adapted to work for protein structures,\nand has motivated us to explore the generalization of the convolution operator to graph data. In fact,\nseveral techniques from computer vision have found their way into the analysis of protein structures,\nespecially methods for locally describing the shape of an object, and various spectral representations\nof shape (see e.g. [18, 17]).\nIn this work we evaluate multiple existing and proposed graph convolution operators and propose an\narchitecture for the task of predicting interfaces between pairs of proteins using a graph representation\nof the underlying protein structure. Our results demonstrate that this approach provides state-of-the-\nart accuracy, outperforming a recent SVM-based approach [2]. The proposed convolution operators\nare not speci\ufb01c to interface prediction. They are applicable to graphs with arbitrary size and structure,\ndo not require imposing an ordering on the nodes, allow for representing both node and edge features,\nand maintain the original graph structure, allowing multiple convolution operations without the need\nto downsample the graph. Therefore we expect it to be applicable to a variety of other learning\nproblems on graphs.\n\n2 Methods for Graph Convolution\n\nIn this work we consider learning problems over a collection of graphs where prediction occurs at the\nnode level. Nodes and edges have features that are associated with them, and we denote by xi the\nfeature vector associated with node i and Aij the feature vector associated with the edge between\nnodes i and j, where for simplicity we have omitted indexing over graphs.\nWe describe a framework that allows us to learn a representation of a local neighborhood around each\nnode in a graph. In the domains of image, audio, or text data, convolutional networks learn local\nfeatures by assigning an ordering to pixels, amplitudes, or words based on the structure inherent to\nthe domain, and associating a weight vector/matrix with each position within a receptive \ufb01eld. The\nstandard notion of convolution over a sequence (1D convolution) or an image (2D convolution) relies\non having a regular grid with a well-de\ufb01ned neighborhood at each position in the grid, where each\nneighbor has a well-de\ufb01ned relationship to its neighbors, e.g. \"above\", \"below\", \"to the left\", \"to the\nright\" in the case of a 2D grid. On a graph structure there is usually no natural choice for an ordering\nof the neighbors of a node. Our objective is to design convolution operators that can be applied to\ngraphs without a regular structure, and without imposing a particular order on the neighbors of a\ngiven node. To summarize, we would like to learn a mapping at each node in the graph which has\nthe form: zi = \u03c3W (xi,{xn1, . . . , xnk}), where {n1, . . . , nk} are the neighbors of node i that de\ufb01ne\nthe receptive \ufb01eld of the convolution, \u03c3 is a non-linear activation function, and W are its learned\nparameters; the dependence on the neighboring nodes as a set represents our intention to learn a\nfunction that is order-independent. We present the following two realizations of this operator that\nprovides the output of a set of \ufb01lters in a neighborhood of a node of interest that we refer to as the\n\"center node\":\n\n(cid:18)\n\nzi = \u03c3\n\nW Cxi +\n\n1\n|Ni|\n\n(cid:19)\n\n(cid:88)\n\nj\u2208Ni\n\nW Nxj + b\n\n,\n\n(1)\n\nwhere Ni is the set of neighbors of node i, W C is the weight matrix associated with the center node,\nW N is the weight matrix associated with neighboring nodes, and b is a vector of biases, one for each\n\ufb01lter. The dimensionality of the weight matrices is determined by the dimensionality of the inputs\nand the number of \ufb01lters. The computational complexity of this operator on a graph with n nodes, a\n\n2\n\n\fFigure 1: Graph convolution on protein structures. Left: Each residue in a protein is a node in a graph where the\nneighborhood of a node is the set of neighboring nodes in the protein structure; each node has features computed\nfrom its amino acid sequence and structure, and edges have features describing the relative distance and angle\nbetween residues. Right: Schematic description of the convolution operator which has as its receptive \ufb01eld a set\nof neighboring residues, and produces an activation which is associated with the center residue.\n\n(cid:18)\n\n(cid:18)\n\n(cid:88)\n\nj\u2208Ni\n\n(cid:88)\n\nj\u2208Ni\n\n(cid:19)\n\n(cid:19)\n\n(cid:88)\n\nj\u2208Ni\n\n(cid:88)\n\nj\u2208Ni\n\nneighborhood of size k, Fin input features and Fout output features is O(kFinFoutn). Construction of\nthe neighborhood is straightforward using a preprocessing step that takes O(n2 log n).\nIn order to provide for some differentiation between neighbors, we incorporate features on the edges\nbetween each neighbor and the center node as follows:\n\nzi = \u03c3\n\nW Cxi +\n\n1\n|Ni|\n\nW Nxj +\n\n1\n|Ni|\n\nW EAij + b\n\n,\n\n(2)\n\nwhere W E is the weight matrix associated with edge features.\nFor comparison with order-independent methods we propose an order-dependent method, where\norder is determined by distance from the center node. In this method each neighbor has unique weight\nmatrices for nodes and edges:\n\nzi = \u03c3\n\nW Cxi +\n\n1\n|Ni|\n\nW N\n\nj xj +\n\n1\n|Ni|\n\nW E\n\nj Aij + b\n\n.\n\n(3)\n\nj /W E\n\nHere W N\nj are the weight matrices associated with the jth node or the edges connecting to the jth\nnodes, respectively. This operator is inspired by the PATCHY-SAN method of Niepert et al. [16]. It is\nmore \ufb02exible than the order-independent convolutional operators, allowing the learning of distinctions\nbetween neighbors at the cost of signi\ufb01cantly more parameters.\nMultiple layers of these graph convolution operators can be used, and this will have the effect\nof learning features that characterize the graph at increasing levels of abstraction, and will also\nallow information to propagate through the graph, thereby integrating information across regions of\nincreasing size. Furthermore, these operators are rotation-invariant if the features have this property.\nIn convolutional networks, inputs are often downsampled based on the size and stride of the receptive\n\ufb01eld. It is also common to use pooling to further reduce the size of the input. Our graph operators\non the other hand maintain the structure of the graph, which is necessary for the protein interface\nprediction problem, where we classify pairs of nodes from different graphs, rather than entire\ngraphs. Using convolutional architectures that use only convolutional layers without downsampling is\ncommon practice in the area of graph convolutional networks, especially if classi\ufb01cation is performed\nat the node or edge level. This practice has support from the success of networks without pooling\nlayers in the realm of object recognition [23]. The downside of not downsampling is higher memory\nand computational costs.\n\nRelated work. Several authors have recently proposed graph convolutional operators that generalize\nthe notion of convolution over a regular grid. Spectral graph theory forms the basis for several of\n\n3\n\nNodeResidue Conservation / CompositionAccessible Surface AreaResidue DepthProtrusion IndexEdgeDistanceAnglereceptivefieldneighborresidue of interestproteinconvolution\fthese methods [8], in which convolutional \ufb01lters are viewed as linear operators on the eigenvectors\nof the graph Laplacian (or an approximation thereof [13]). Our protein dataset consists of multiple\ngraphs with no natural correspondence to each other, making it dif\ufb01cult to apply methods based\non the graph Laplacian. In what follows we describe several existing spatial graph convolutional\nmethods, remarking on the aspects which resemble or helped inspire our implementation.\nIn their Molecular Fingerprint Networks (MFNs), Duvenaud et al. [9] proposed a spatial graph\nconvolution approach similar to Equation (1), except that they use a single weight matrix for all\nnodes in a receptive \ufb01eld and sum the results, whereas we distinguish between the center node and\nthe neighboring nodes, and we average over neighbors rather than sum over them. Furthermore,\ntheir graphs do not contain edge features, so their convolution operator does not make use of them.\nMFNs were designed to generate a feature representation of an entire molecule. In contrast, our node\nlevel prediction task motivates distinguishing between the center node, whose representation is being\ncomputed, and neighboring nodes, which provide information about the local environment of the\nnode. Averaging is important in our problem to allow for any size of neighborhood.\nSchlichtkrull et al. [19] describe Relational Graph Convolutional Networks (RGCNs), which consider\ngraphs with a large number of binary edge types, where a unique neighborhood is de\ufb01ned by\neach edge type. To reduce the total number of model parameters, they employ basis matrices or\nblock diagonal constraints to introduce shared parameters between the representations of different\nedge/neighborhood types. That aspect of the method is not relevant to our problem, and without it,\nEquation (1) closely resembles their convolution operator.\nSch\u00fctt et al.[21] de\ufb01ne Deep Tensor Neural Networks (DTNNs) for predicting molecular energies.\nThis version of graph convolution uses the node and edge information from neighbors to produce an\nadditive update to the center node:\n\n(cid:19)(cid:21)\n\n(cid:20)\n\n(cid:18)\n\n(cid:88)\n\nj\u2208Ni\n\nzi = xi +\n\n1\n|Ni|\n\n\u03c3\n\nW\n\n(W Nxj + bN) (cid:12) (W EAij + bE)\n\n,\n\n(4)\n\nwhere (cid:12) denotes the elementwise product, W , W N , and W E are weights matrices, and bN and bE\nare bias vectors. Edge information is incorporated similarly to Equation (2), with the difference in\nhow the edge and node signals are combined\u2014their choice being elementwise product rather than\nsum. Another difference is that DTNN convolution forces the output of a layer to have the same\ndimensionality as its input; our approach does not require that, allowing the networks to have varying\nnumbers of \ufb01lters across convolutional layers.\nRather than operate on \ufb01xed neighborhoods, Atwood and Towsley [5] take a different spatial convolu-\ntion approach in their Diffusion-Convolutional Neural Networks (DCNNs), and apply multiple steps\n(or \"hops\") of a diffusion operator that propagates the value of an individual feature across the graph.\nA node after k hops will contain information from all nodes that have walks of length k ending at that\nnode. If X is a data matrix where each row corresponds to a node, and each column to a different\nfeature, then the representation of X after a k hop convolution is:\n\n(5)\nwhere wk is the k-hop vector of weights, and P k is the transition matrix raised to power k. Rather\nthan stack multiple convolution layers, the authors apply the diffusion operator using multiple hop\nnumbers. In our work we use this method with an adjacency matrix whose entries are an exponentially\ndecreasing function of the distance between nodes.\n\nZk = \u03c3(wkP kX),\n\nProteins as graphs.\nIn this work we represent a protein as a graph where each amino acid residue\nis a node whose features represent the properties of the residue; the spatial relationships between\nresidues (distances, angles) are represented as features of the edges that connect them (see Figure 1).\nThe neighborhood of a node used in the convolution operator is the set of k closest residues as\ndetermined by the mean distance between their atoms. Before going into the details of the node and\nedge features we describe the neural network architecture.\n\nPairwise classi\ufb01cation architecture.\nIn the protein interface prediction problem, examples are\ncomposed of pairs of residues, one from a ligand protein and one from a receptor protein, i.e., our\ntask is to classify pairs of nodes from two separate graphs representing those proteins. More formally,\nour data are a set of N labeled pairs {((li, ri), yi)}N\ni=1, where li is a residue (node) in the ligand, ri\n\n4\n\n\fFigure 2: An overview of the pairwise classi\ufb01cation architecture. Each neighborhood of a residue in the two\nproteins is processed using one or more graph convolution layers, with weight sharing between legs of the\nnetwork. The activations generated by the convolutional layers are merged by concatenating them, followed by\none or more regular dense layers.\n\nData Partition Complexes\nTrain\nValidation\nTest\n\n140\n35\n55\n\nPositive examples Negative examples\n128,660 (90.9%)\n1,874,322 (99.8%)\n4,953,446 (99.9%)\n\n12,866 (9.1%)\n3,138 (0.2%)\n4,871 (0.1%)\n\nTable 1: Number of complexes and examples in the Docking Benchmark Dataset. Positive examples are residue\npairs that participate in the interface, negative examples are pairs that do not. For training we downsample the\nnegative examples for an overall ratio of 10:1 of negative to positive examples; in validation and testing all the\nnegative examples are used.\n\nis a residue (node) in the receptor protein, and yi \u2208 {\u22121, 1} is the associated label that indicates if\nthe two residues are interacting or not. The role of ligand/receptor is arbitrary, so we would like to\nlearn a scoring function that is independent of the order in which the two residues are presented to\nthe network. In the context of SVM-based methods this can be addressed using pairwise kernels,\nbuilding the invariance into the representation (see e.g. [2]). To create an order-invariant model in a\nsetting which requires an explicit feature representation. We considered two approaches. One is to\nconstruct explicit features that are order invariant by taking the sum and element-wise products of the\ntwo feature vectors. Note that pairwise kernels implicitly use all products of features, which we avoid\nby taking the element wise product. Another approach is to present each example to the model in\nboth possible orders, (li, ri) and (ri, li), and average the two predictions; the feature representation\nof an example is the concatenation of the features of the two residues [3]. In preliminary experiments\nboth approaches yielded similar results, and our reported results use the latter.\nOur network architecture is composed of two identical \"legs\" which learn feature representations of\nthe ligand and receptor proteins of a complex by applying multiple layers of graph convolution to\neach. The weights between the two legs are shared. We then merge the legs by concatenating residue\nrepresentations together to create the representation of residue pairs. The resulting features are then\npassed through one or more fully-connected layers before classi\ufb01cation (see Figure 2).\n\n3 Experiments\n\nData.\nIn our experiments we used the data from Version 5 of the Docking Benchmark Dataset,\nwhich is the standard benchmark dataset for assessing docking and interface prediction methods [25].\nThese complexes are a carefully selected subset of structures from the Protein Data Bank (PDB). The\nstructures are generated from x-ray crystallography or nuclear magnetic resonance experiments and\ncontain the atomic coordinates of each amino acid residue in the protein. These proteins range in\nlength from 29 to 1979 residues with a median of 203.5. For each complex, DBD includes both bound\nand unbound forms of each protein in the complex. Our features are computed from the unbound\nform since proteins can alter their shape upon binding, and the labels are derived from the structure of\nthe proteins in complex. As in previous work [2], two residues from different proteins are considered\npart of the interface if any non-Hydrogen atom in one is within 6\u00c5 of any non-Hydrogen atom in the\nother when in complex.\n\n5\n\nMerge Fully-ConnectedGraph ConvolutionClassification Ligand Protein GraphReceptor Protein GraphGraph ConvolutionGraph ConvolutionGraph ConvolutionResidue RepresentationResidue Pair RepresentationR1R2R3R1R2R3R1R2R3R1R2R3R1R2R3R1R1R1R2R2R2R3R3R3\fFor our test set we used the 55 complexes that were added since version 4.0 of DBD, and separated\nthe complexes in DBD 4.0 into training and validation sets. In dividing the complexes into training\nand validation we strati\ufb01ed them by dif\ufb01culty and type using the information provided in DBD.\nBecause in any given complex there are vastly more residue pairs that don\u2019t interact than those that\ndo, we downsampled the negative examples in the training set to obtain a 10:1 ratio of negative and\npositive examples. Final models used for testing were trained using the training and validation data,\nwith the 10:1 ratio of positive to negative examples. Dataset sizes are shown in Table 1.\n\nNode and edge features. Each node and edge in the graph representing a protein has features\nassociated with it that are computed from the protein\u2019s sequence and structure. For the node features\nwe used the same features used in earlier work [2], as summarized next. Protein sequence alone\ncan be a good indicator of the propensity of a residue to form an interface, because each amino acid\nexhibits unique electrochemical and geometric properties. Furthermore, the level of conservation\nof a residue in alignments against similar proteins also provides valuable information, since surface\nresidues that participate in an interface tend to be more conserved than surface residues that do not.\nThe identity and conservation of a residue are quanti\ufb01ed by 20 features that capture the relative\nfrequency of each of the 20 amino acids in alignments to similar proteins. Earlier methods used\nthese features by considering a window of size 11 in sequence centered around the residue of interest\nand concatenating their features [2]. Since we are explicitly representing the structure of a protein,\neach node contains only the sequence features of the corresponding residue. In addition to these\nsequence-based features, each node contains several features computed from the structure. These\ninclude a residue\u2019s surface accessibility, a measure of its protrusion, its distance from the surface, and\nthe counts of amino acids within 8\u00c5 in two directions\u2014towards the residue\u2019s side chain, and in the\nopposite direction.\nThe primary edge feature is based on the distance between two residues, calculated as the average\ndistance between their atoms. The feature is a Radial Basis Function (RBF) of this distance with\na standard deviation of 18\u00c5 (chosen on the validation set). To incorporate information regarding\nthe relative orientation of two residues, we calculate the angle between the normal vectors of the\namide plane of each residue. Note that DCNNs use residue distances to inform the diffusion process.\nFor this we used an RBF kernel over the distance, with a standard deviation optimized as part of\nthe model selection procedure. All node and edge features were normalized to be between 0 and 1,\nexcept the residue conservation features, which were standardized.\n\nTraining, validation, and testing. The validation set was used to perform an extensive search\nover the space of possible feature representations and model hyperparameters, to select the edge\ndistance feature RBF kernel standard deviation (2 to 32), negative to positive example ratio (1:1 to\n20:1), number of convolutional layers (1 to 6), number of \ufb01lters (8 to 2000), neighborhood size (2 to\n26), pairwise residue representation (elementwise sum/product vs concatenation), number of dense\nlayers after merging (0 to 4), optimization algorithm (stochastic gradient descent, RMSProp, ADAM,\nMomentum), learning rate (0.01 to 1), dropout probability (0.3 to 0.8), minibatch size (64 or 128\nexamples), and number of epochs (50 to 1000). This search was conducted manually and not all\ncombinations were tested. Automatic model selection as in Bergstra et al.[7] failed to outperform the\nbest manual search results.\nFor testing, all classi\ufb01ers were trained for 80 epochs in minibatches of 128. Weight matrices were\ninitialized as in He et al. [11] and biases initialized to zero. Recti\ufb01ed Linear Units were employed\non all but the classi\ufb01cation layer. During training we performed dropout with probability 0.5 to\nboth dense and convolutional layers (except for DCNN, where performance was better when trained\nwithout dropout). Negative examples were randomly sampled to achieve a 10:1 ratio with positive\nexamples, and the weighted cross entropy loss function was used to account for the class imbalance.\nTraining was performed using stochastic gradient descent with a learning rate of 0.1. Test results were\ncomputed by training the model on the training and validation sets using the model hyperparameters\nthat yielded best validation performance. The convolution neighborhood (i.e. receptive \ufb01eld) is\nde\ufb01ned as a \ufb01xed-size set of residues that are closest in space to a residue of interest, and 21 yielded\nthe best performance in our validation experiments. We implemented our networks in TensorFlow [1]\nv1.0.1 to make use of rapid training on GPUs. Training times vary from roughly 17-102 minutes\ndepending on convolution method and network depth, using a single NVIDIA GTX 980 or GTX\nTITAN X GPU.\n\n6\n\n\fMethod\n\nNo Convolution\nDiffusion (DCNN) (2 hops) [5]\nDiffusion (DCNN) (5 hops) [5])\nSingle Weight Matrix (MFN [9])\nNode Average (Equation 1)\nNode and Edge Average (Equation 2)\nDTNN [21]\nOrder Dependent (Equation 3)\n\n1\n0.812 (0.007)\n0.790 (0.014)\n0.828 (0.018)\n0.865 (0.007)\n0.864 (0.007)\n0.876 (0.005)\n0.867 (0.007)\n0.854 (0.004)\n\nConvolutional Layers\n\n2\n0.810 (0.006)\n\u2013\n\u2013\n0.871 (0.013)\n0.882 (0.007)\n0.898 (0.005)\n0.880 (0.007)\n0.873 (0.005)\n\n3\n0.808 (0.006)\n\u2013\n\u2013\n0.873 (0.017)\n0.891 (0.005)\n0.895 (0.006)\n0.882 (0.008)\n0.891 (0.004)\n\n4\n0.796 (0.006)\n\u2013\n\u2013\n0.869 (0.017)\n0.889 (0.005)\n0.889 (0.007)\n0.873 (0.012)\n0.889 (0.008)\n\nTable 2: Median area under the receiver operating characteristic curve (AUC) across all complexes in the\ntest set for various graph convolutional methods. Results shown are the average and standard deviation over\nten runs with different random seeds. Networks have the following number of \ufb01lters for 1, 2, 3, and 4 layers\nbefore merging, respectively: (256), (256, 512), (256, 256, 512), (256, 256, 512, 512). The exception is the\nDTNN method, which by necessity produces an output which is has the same dimensionality as its input. Unlike\nthe other methods, diffusion convolution performed best with an RBF with a standard deviation of 2\u00c5. After\nmerging, all networks have a dense layer with 512 hidden units followed by a binary classi\ufb01cation layer. Bold\nfaced values indicate best performance for each method.\n\nTo determine the best form of graph convolution for protein interface prediction, we implemented the\nspatial graph convolution operators described in the Related Work section. The MFN method required\nmodi\ufb01cation to work well in our problem, namely averaging over neighbors rather than summing. For\neach graph convolution method, we searched over the hyperparameters listed above using the same\nmanual search method; for the DCNN this also included the number of hops. Diffusion convolution is\na single layer method as presented in the original publication; and indeed, stacking multiple diffusion\nconvolutional layers yielded poor results, so testing was conducted using only one layer for that\nmethod.\nTo demonstrate the effectiveness of graph convolution we examine the effect of incorporating neighbor\ninformation by implementing a method that performs no convolution (referred to as No-Convolution),\nequivalent to Equation (1) with no summation over neighbors. The PAIRpred SVM method [2] was\ntrained by performing \ufb01ve fold cross validation on the training and validation data to select the best\nkernel and soft margin parameters before evaluating on the test set.\n\n3.1 Results\n\nResults comparing the accuracy of the various graph convolution methods are shown in Table 2. Our\n\ufb01rst observation is that the proposed graph convolution methods, with AUCs around 0.89, outperform\nthe No Convolution method, which had an AUC of 0.81, showing that the incorporation of information\nfrom a residue\u2019s neighbors improves the accuracy of interface prediction. This matches the biological\nintuition that the region around a residue should impact its binding af\ufb01nity. We also observe that the\nproposed order-independent methods, with and without edge features (Equations (1) and (2) ) and the\norder-dependent method (Equation (3) performed at a similar level, although the order-independent\nmethods do so with fewer layers and far fewer model parameters than the order-dependent method.\nThese methods exhibit improvement over the state-of-the-art PAIRPred method which yielded an\nAUC of 0.863.\nThe MFN method, which is a simpler version of the order-independent method given in Equation (1)\nperformed slightly worse. This method uses the same weight matrix for the center node and its\nneighbors, and thereby does not differentiate between them. Its lower performance suggests this\nis an important distinction in our problem, where prediction is performed at the node level. This\nconvolution operator was proposed in the context of a classi\ufb01cation problem at the graph level. The\nDTNN approach is only slightly below the top performing methods. We have observed that the other\nconvolutional methods perform better when the number of \ufb01lters is increased gradually in subsequent\nnetwork layers, a feature not afforded by this method.\nAmong the convolutional methods, the diffusion convolution method (DCNN) performed the worst,\nand was similar in performance to the No Convolution method. The other convolution methods\nperformed best when employing multiple convolutional layers, suggesting that the networks are\n\n7\n\n\fFigure 3: PyMOL [20] visualizations of the best performing test complex (PDB ID 3HI6). Upper left:\nLigand (red) and receptor (blue), along with the true interface (yellow). Upper right: Visualization\nof predicted scores, where brighter colors (cyan and orange) represent higher scores. Since scores\nare for pairs of residues, we take the max score over all partners in the partner protein. Bottom row:\nActivations of two \ufb01lters in the second convolutional layer, where brighter colors indicate greater\nactivation and black indicates activation of zero. Lower left: A \ufb01lter which provides high activations\nfor buried residues, a useful screening criterion for interface detection. Lower right: Filter which\ngives high activations for residues near the interface of this complex.\n\nindeed learning a hierarchical representation of the data. However, networks with more than four\nlayers performed worse, which could be attributed to the relatively limited amount of labeled protein\ninterface data. Finally, we note that the extreme class imbalance in the test set produces a very poor\narea under the precision-recall curve, with no method achieving a value above 0.017.\nTo better understand the behavior of the best performing convolutional method we visualize the best\nperforming test complex, PDB ID 3HI6 (see \ufb01gure 3). The \ufb01gure shows that the highest predictions\nare in agreement with the true interface. We also visualize two convolutional \ufb01lters to demonstrate\ntheir ability to learn aspects of the complex that are useful for interface prediction.\n\n4 Conclusions and Future Work\n\nWe have examined the performance of several spatial graph convolutional methods in the problem\nof predicting interfaces between proteins on the basis of their 3D structure. Neighborhood-based\nconvolution methods achieved state-of-the-art performance, outperforming diffusion-based convolu-\ntion and the previous state-of-the-art SVM-based method. Among the neighborhood-based methods,\norder-independent methods performed similarly to an order-dependent method, and we identi\ufb01ed\nelements that are important for the performance of the order-indpendent methods.\nOur experiments did not demonstrate a big difference with the inclusion of edge features. There\nwere very few of those, and unlike the node features, they were static: our networks learned latent\nrepresentations only for the node features. These methods can be extended to learn both node and\nedge representations, and the underlying convolution operator admits a simple deconvolution operator\nwhich lends itself to be used with auto-encoders.\nCNNs typically require large datasets to learn effective representations. This may have limited the\nlevel of accuracy that we could attain using our purely supervised approach and the relatively small\n\n8\n\n\fnumber of labeled training examples. Unsupervised pre-training would allow us to use the entire\nProtein Data Bank which contains close to 130,000 structures (see http://www.rcsb.org/).\nThe features learned by deep convolutional architectures for image classi\ufb01cation have demonstrated a\ngreat degree of usefulness in classi\ufb01cation tasks different than the ones they were originally trained on\n(see e.g. [22]). Similarly, we expect the convolution operators we propose and the resulting features\nto be useful in many other applications, since structure information is useful for predicting a variety\nof properties of proteins, including their function, catalytic and other functional residues, prediction\nof protein-protein interactions, and protein interactions with DNA and RNA.\nIn designing our methodology we considered the question of the appropriate level at which to describe\nprotein structure. In classifying image data, CNNs are usually applied to the raw pixel data [15]. The\nanalogous level of description for protein structure would be the raw 3D atomic coordinates, which\nwe thought would prove too dif\ufb01cult. Using much larger training sets and unsupervised learning can\npotentially allow the network to begin with features that are closer to the raw atomic coordinates and\nlearn a more detailed representation of the geometry of proteins.\nSupplementary Materials\nPython code is available at https://github.com/fouticus/pipgcn, data can be downloaded\nfrom: https://zenodo.org/record/1127774, and the accompanying poster can be found at:\nhttps://zenodo.org/record/1134154.\nAcknowedgements\nThis work was supported by the National Science Foundation under grant no DBI-1564840.\n\nReferences\n[1] Mart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,\nGreg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow,\nAndrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser,\nManjunath Kudlur, Josh Levenberg, Dan Man\u00e9, Rajat Monga, Sherry Moore, Derek Murray,\nChris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul\nTucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Vi\u00e9gas, Oriol Vinyals, Pete Warden,\nMartin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale\nmachine learning on heterogeneous systems, 2015.\n\n[2] Fayyaz ul Amir Afsar Minhas, Brian J. Geiss, and Asa Ben-Hur. PAIRpred: Partner-speci\ufb01c\nprediction of interacting residues from sequence and structure. Proteins: Structure, Function,\nand Bioinformatics, 82(7):1142\u20131155, 2014.\n\n[3] Shandar Ahmad and Kenji Mizuguchi. Partner-aware prediction of interacting residues in\n\nprotein-protein complexes from sequence data. PLoS One, 6(12):e29104, 2011.\n\n[4] Christof Angermueller, Tanel P\u00e4rnamaa, Leopold Parts, and Oliver Stegle. Deep learning for\n\ncomputational biology. Molecular systems biology, 12(7):878, 2016.\n\n[5] James Atwood and Don Towsley. Diffusion-convolutional neural networks. In Advances in\n\nNeural Information Processing Systems, pages 1993\u20132001, 2016.\n\n[6] Tristan T Aumentado-Armstrong, Bogdan Istrate, and Robert a Murgita. Algorithmic approaches\nto protein-protein interaction site prediction. Algorithms for Molecular Biology, 10(1):1\u201321,\n2015.\n\n[7] James S. Bergstra, R\u00e9mi Bardenet, Yoshua Bengio, and Bal\u00e1zs K\u00e9gl. Algorithms for hyper-\nparameter optimization. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q.\nWeinberger, editors, Advances in Neural Information Processing Systems 24, pages 2546\u20132554.\nCurran Associates, Inc., 2011.\n\n[8] Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst.\n\nGeometric deep learning: going beyond Euclidean data. IEEE Sig. Proc. Magazine, 2017.\n\n9\n\n\f[9] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel,\nAl\u00e1n Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning\nmolecular \ufb01ngerprints. In Advances in Neural Information Processing Systems, pages 2224\u2013\n2232, 2015.\n\n[10] R. Esmaielbeiki, K. Krawczyk, B. Knapp, J.-C. Nebel, and C. M. Deane. Progress and challenges\n\nin predicting protein interfaces. Brie\ufb01ngs in Bioinformatics, (January):1\u201315, 2015.\n\n[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into recti\ufb01ers:\nSurpassing human-level performance on imageNet classi\ufb01cation. CoRR, abs/1502.01852, 2015.\n\n[12] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep\nJaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural\nnetworks for acoustic modeling in speech recognition: The shared views of four research groups.\nIEEE Signal Processing Magazine, 29(6):82\u201397, 2012.\n\n[13] Thomas N Kipf and Max Welling. Semi-supervised classi\ufb01cation with graph convolutional\n\nnetworks. In ICLR, 2017.\n\n[14] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in neural information processing systems, pages\n1097\u20131105, 2012.\n\n[15] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436\u2013444,\n\n2015.\n\n[16] Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning convolutional neural\nnetworks for graphs. In Proceedings of the 33rd annual international conference on machine\nlearning. ACM, 2016.\n\n[17] Lee Sael and Daisuke Kihara. Protein surface representation and comparison : New approaches\n\nin structural proteomics. Biological Data Mining, pages 89\u2013109, 2009.\n\n[18] Lee Sael, Bin Li, David La, Yi Fang, Karthik Ramani, Raif Rustamov, and Daisuke Kihara.\nFast protein tertiary structure retrieval based on global surface shape similarity. Proteins,\n72(4):1259\u20131273, 2008.\n\n[19] Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and\nMax Welling. Modeling relational data with graph convolutional networks. arXiv preprint\narXiv:1703.06103, 2017.\n\n[20] Schr\u00f6dinger, LLC. The PyMOL molecular graphics system, version 1.8. November 2015.\n\n[21] Kristof T Sch\u00fctt, Farhad Arbabzadah, Stefan Chmiela, Klaus R M\u00fcller, and Alexandre\nTkatchenko. Quantum-chemical insights from deep tensor neural networks. Nature com-\nmunications, 8:13890, 2017.\n\n[22] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. CNN features\noff-the-shelf: an astounding baseline for recognition. In Proceedings of the IEEE Conference\non Computer Vision and Pattern Recognition Workshops, pages 806\u2013813, 2014.\n\n[23] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving\n\nfor simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.\n\n[24] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural\n\nnetworks. In Advances in Neural Information Processing Systems, pages 3104\u20133112, 2014.\n\n[25] Thom Vreven, Iain H Moal, Anna Vangone, Brian G Pierce, Panagiotis L Kastritis, Mieczyslaw\nTorchala, Raphael Chaleil, Brian Jim\u00e9nez-Garc\u00eda, Paul A Bates, Juan Fernandez-Recio, et al.\nUpdates to the integrated protein\u2013protein interaction benchmarks: docking benchmark version\n5 and af\ufb01nity benchmark version 2. Journal of molecular biology, 427(19):3031\u20133041, 2015.\n\n10\n\n\f", "award": [], "sourceid": 3278, "authors": [{"given_name": "Alex", "family_name": "Fout", "institution": "Colorado State University"}, {"given_name": "Jonathon", "family_name": "Byrd", "institution": "Colorado State University"}, {"given_name": "Basir", "family_name": "Shariat", "institution": "Colorado State University"}, {"given_name": "Asa", "family_name": "Ben-Hur", "institution": "Colorado State University"}]}