{"title": "Neural Edit Operations for Biological Sequences", "book": "Advances in Neural Information Processing Systems", "page_first": 4960, "page_last": 4970, "abstract": "The evolution of biological sequences, such as proteins or DNAs, is driven by the three basic edit operations: substitution, insertion, and deletion. Motivated by the recent progress of neural network models for biological tasks, we implement two neural network architectures that can treat such edit operations. The first proposal is the edit invariant neural networks, based on differentiable Needleman-Wunsch algorithms. The second is the use of deep CNNs with concatenations. Our analysis shows that CNNs can recognize star-free regular expressions, and that deeper CNNs can recognize more complex regular expressions including the insertion/deletion of characters. The experimental results for the protein secondary structure prediction task suggest the importance of insertion/deletion. The test accuracy on the widely-used CB513 dataset is 71.5%, which is 1.2-points better than the current best result on non-ensemble models.", "full_text": "Neural Edit Operations for Biological Sequences\n\nSatoshi Koide\n\nToyota Central R&D Labs.\n\nkoide@mosk.tytlabs.co.jp\n\nKeisuke Kawano\n\nToyota Central R&D Labs.\n\nkawano@mosk.tytlabs.co.jp\n\nTakuro Kutsuna\n\nToyota Central R&D Labs.\n\nkutsuna@mosk.tytlabs.co.jp\n\nAbstract\n\nThe evolution of biological sequences, such as proteins or DNAs, is driven by the\nthree basic edit operations: substitution, insertion, and deletion. Motivated by the\nrecent progress of neural network models for biological tasks, we implement two\nneural network architectures that can treat such edit operations. The \ufb01rst proposal\nis the edit invariant neural networks, based on differentiable Needleman-Wunsch\nalgorithms. The second is the use of deep CNNs with concatenations. Our analysis\nshows that CNNs can recognize regular expressions without Kleene star, and\nthat deeper CNNs can recognize more complex regular expressions including the\ninsertion/deletion of characters. The experimental results for the protein secondary\nstructure prediction task suggest the importance of insertion/deletion. The test\naccuracy on the widely-used CB513 dataset is 71.5%, which is 1.2-points better\nthan the current best result on non-ensemble models.\n\n1\n\nIntroduction\n\nNeural networks are now used in many applications, not limited to classical \ufb01elds such as image\nprocessing, speech recognition, and natural language processing. Bioinformatics is becoming an\nimportant application \ufb01eld of neural networks. These biological applications are often implemented\nas a supervised learning model that takes a biological string (such as DNA or protein) as an input,\nand outputs the corresponding label(s), such as a protein secondary structure [13, 14, 15, 18, 19, 23,\n24, 26], protein contact maps [4, 8], and genome accessibility [12].\nInvariance, which forces a prediction model to satisfy a desirable property for a speci\ufb01c task, is\nimportant in neural networks. For example, CNNs with pooling layers capture the shift invariant\nproperty that is considered to be an important property for image recognition tasks. CNNs were \ufb01rst\nproposed to imitate the organization of the visual cortex [6]. This is often used to explain why CNNs\nwork for visual tasks. Similarly, rotation invariance for image tasks is also studied [25]. Generally, it\nis important to model the proper invariances for a given application domain.\nWhat is, then, the invariance in biological tasks? As is well known in bioinformatics, similar\nsequences tend to exhibit similar functions or structures (i.e., similar labels in terms of machine\nlearning). Here, the similarity is evaluated by sequence alignment, which is closely related to the edit\ndistance. This implies that labels associated with the biological sequences exhibit (weak) invariance\nwith respect to a small number of edit operations, i.e., substitution, insertion, and deletion. This paper\naims to incorporate such invariances, which we call edit invariance, into neural networks.\n\nContribution. We consider two neural network architectures that incorporate the edit operations.\nFirst, we propose the edit invariant neural networks. This is obtained by interpreting the classical\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fNeedleman-Wunsch algorithm [17] as a differentiable neural network. Next, we show that deep CNNs\nwith concatenations can treat regular expressions without Kleene stars, indicating that such CNNs\ncan capture edit operations including insertion/deletion. Our experiments demonstrate the validity of\nour approach. The test accuracy of protein secondary structure prediction on the widely-used CB513\ndataset (e.g., [26]) results in 71.5% accuracy, the state-of-the-art performance compared to those of\nprevious studies on non-ensemble models.\n\n2 Edit Invariant Neural Networks (EINN)\n\nDifferentiable Sequence Alignment.\nIn bioinfor-\nmatics, sequence alignment is key in comparing two bi-\nological strings (e.g., DNA, proteins). The Needleman-\nWunsch (NW) algorithm [17], a fundamental sequence\nalignment algorithm, calculates the similarity score be-\ntween two strings on an alphabet \u03a3. As shown in Fig.\n1, the similarity score is computed via dynamic pro-\ngramming to maximize the total score (illustrated as\nthe double-lined square) by inserting or deleting char-\nacters (depicted as the vertical and horizontal arrows,\nrespectively).\nAlthough the original NW algorithm is a function that\nuses strings as its arguments, we can naturally extend it\nto a differentiable function using embedding. Algorithm 1 shows the proposed dynamic programming\nprocedure to calculate the NW score sNW(x1:m, y1:n; g). Here, the scalar parameter g is the gap cost\nthat represents the cost to insert or delete a character. The differences from the original NW algorithm\nare three-fold.\n1. The input sequences x1:m := [x1,\u00b7\u00b7\u00b7 , xm] and y1:n := [y1,\u00b7\u00b7\u00b7 , yn] are each d-dimensional\n\nFigure 1: The NW alignment. The red line\nis a path that maximizes the score. The\nnumbers in the cells correspond to Fi,j in\nAlgorithm 1.\n\ntime series (i.e., xi and yj are vectors in Rd) of length m and n, respectively.\n\n2. Following the modi\ufb01cation above, the score function is de\ufb01ned as the inner product instead of a\n\ni exp(xi/\u03b3)) is used instead of the hard max function\n\n3. The softmax function max\u03b3(x) = \u03b3 log((cid:80)\n\nprede\ufb01ned lookup table (Line 7).\n\n(Line 10).\n\nThe dynamic programming in Algorithm 1 can be re-\ngarded as a computational graph, allowing us to dif-\nferentiate the NW similarity score sNW(x1:m, y1:n; g)\nwith respect to x1:m, y1:n, and g. In principle, it is\npossible to apply automatic differentiation to obtain\nthe gradient; however, automatic differentiation might\nbe computationally expensive because exponentially\nthere are many backward paths in the computational\ngraph. To avoid this problem, we modify the backward\ncomputation by employing some algebraic substitu-\ntions. Consequently, we can compute the derivatives\nef\ufb01ciently using dynamic programming, as shown in\nAlgorithm 2 and Algorithm 3. See Appendix A in the\nsupplementary material for the derivation of these algo-\nrithms. With the matrix Q computed in Algorithm 2,\nwe can calculate the derivative of the NW score with\nrespect to xi and yj as follows.\n\n5\n6\n7\n8\n9\n10\n\nn(cid:88)\n\n=\n\nQi,j exp(Hi,j/\u03b3) \u00b7 yj,\n\n\u2202sNW\n\u2202sNW\n\u2202xi\n\u2202yj\nwhere Hi,j := Fi\u22121,j\u22121 + xi\u00b7yj \u2212 Fi,j.\n\nj=1\n\nm(cid:88)\n\ni=1\n\n// (m+2)x(n+2) zero matrix\n\nAlgorithm 1: Differentiable Needleman-\nWunsch (forward): sNW(x1:m, y1:n; g)\n1 F \u2190 0;\n2 for i = 0\u00b7\u00b7\u00b7 m do\nFi,0 \u2190 \u2212ig\n4 for j = 1\u00b7\u00b7\u00b7 n do\nF0,j \u2190 \u2212jg;\nfor i = 1\u00b7\u00b7\u00b7 m do\n\n3\n\na \u2190 Fi\u22121,j\u22121 + xi\u00b7yj;\nb \u2190 Fi\u22121,j \u2212 g;\nc \u2190 Fi,j\u22121 \u2212 g;\n\nFi,j \u2190 max\u03b3(cid:0)a, b, c(cid:1)\n\n11 return Fm,n as sNW(x1:m, y1:n; g)\n\n=\n\nQi,j exp(Hi,j/\u03b3) \u00b7 xi,\n\n(1)\n\nThese derivatives are derived similarly to SoftDTW [3], a differentiable distance function correspond-\ning to the dynamic time warping (hence the gapcost is not involved). For the derivative with respect to\n\n2\n\nTCGCATCCA0-2-4-6-8-21-1-3-5-4-120-2-6-301-1-8-5-210-10-7-4-12Fm,nx1:my1:n\f// (m+2) x (n+2) zero matrix\n\nAlgorithm 2: Calculation of Q (backward). We\ndenote \u03d5\u03b3(a, b) := exp((a \u2212 b)/\u03b3).\n1 Q \u2190 0;\n2 for i = 1\u00b7\u00b7\u00b7 m do\nFi,n+1 \u2190 \u221e\n4 Fm+1,n+1 \u2190 Fm,n; Qm+1,n+1 \u2190 1;\n5 for j = n\u00b7\u00b7\u00b7 1 do\nFm+1,j \u2190 \u221e;\nfor i = m\u00b7\u00b7\u00b7 1 do\n\n3\n\na \u2190 \u03d5\u03b3(Fi,j + xi\u00b7yj, Fi+1,j+1);\nb \u2190 \u03d5\u03b3(Fi,j \u2212 g, Fi+1,j);\nc \u2190 \u03d5\u03b3(Fi,j \u2212 g, Fi,j+1);\nQi,j \u2190 aQi+1,j+1 + bQi+1,j + cQi,j+1\n\n6\n7\n8\n9\n10\n11\n\nAlgorithm 3: Calculation of P . We denote\n\u03d5\u03b3(a, b) := exp((a \u2212 b)/\u03b3).\n1 P \u2190 0;\n2 for i = 0\u00b7\u00b7\u00b7 m do\n\n// (m+2) x (n+2) zero matrix\n\n3\n\nPi,0 \u2190 \u2212i\n4 for j = 1\u00b7\u00b7\u00b7 n do\nP0,j \u2190 \u2212j;\nfor i = 1\u00b7\u00b7\u00b7 m do\n\n5\n6\n7\n8\n9\n10\n\na \u2190 \u03d5\u03b3(Fi\u22121,j\u22121 + xi\u00b7yj, Fi,j);\nb \u2190 \u03d5\u03b3(Fi\u22121,j \u2212 g, Fi,j);\nc \u2190 \u03d5\u03b3(Fi,j\u22121 \u2212 g, Fi,j);\nPi,j \u2190\naPi\u22121,j\u22121 + b(Pi\u22121,j \u2212 1) + c(Pi,j\u22121 \u2212 1)\n\n12 return Q\n\n11 return P\n\nthe gapcost g, we can derive it similarly using the matrix P in Algorithm 3: \u2202sNW\n\u2202g = Pm,n. As in the\noriginal NW algorithm, the proposed method can consider insertions/deletions. It is well known that\nthe NW score is closely related to the edit distance. Given sequences x1:m and y1:n, let us consider\na modi\ufb01ed sequence x(cid:48)\n1:(m\u22121) where one feature vector xt is deleted from x1:m. In such a case,\nthe calculated scores sNW(x, y) and sNW(x(cid:48), y) show a similar value. We call this property the edit\ninvariance, which is expected to be important for tasks involving biological sequences.\n\nConvolutional EINN. Here, we extend the traditional CNNs by the NW score introduced above.\nLet us consider an embedded sequence X \u2208 Rd\u00d7L of length L, and a convolutional \ufb01lter w \u2208 Rd\u00d7K\nof kernel size K. Let x \u2208 Rd\u00d7K be a frame of length K at a certain position in the embedded\nsequence X. In CNNs, the similarity is computed by the (Frobenius) inner product, i.e., w\u00b7x. Our\nidea is to replace this inner-product-based similarity with the above proposed sNW(x, w; g). Taking a\nlimit as g \u2192 \u221e (i.e., insertion and deletion are prohibited), sNW is associated with a convolution as\nfollows.\nProposition 1. For any x \u2208 Rd\u00d7K and any w \u2208 Rd\u00d7K, we have w\u00b7x = limg\u2192\u221e sNW(x, w; g).\nProof. If g \u2192 \u221e, we have Fi,j = Fi\u22121,j\u22121 + wi\u00b7 xj in Line 10 of Algorithm 1. This leads to\n\nlimg\u2192\u221e sNW(x, w; g) = FK,K =(cid:80)K\n\ni=1 wi\u00b7xi = w\u00b7x.\n\nTherefore, the replacement of w\u00b7x in CNNs with sNW(x, w; g) can be regarded as a generalization of\nCNNs. As mentioned above, the NW score is related to the edit distance, while the inner-product\nw\u00b7x corresponds to the Hamming distance, a special case of edit distance when insertion/deletion\nare prohibited (i.e., only substitutions are allowed). We also emphasize that this EINN-based\nconvolutional architecture allows for the use of GPUs for batch, \ufb01lter, and CNN-frame dimensions,\nalthough we cannot parallelize the innermost double loop of the dynamic programming.\n\nSummary.\nIn this section, we discussed a differentiable sequence alignment, EINN, to render the\nneural networks edit invariant. Subsequently, we proposed to replace the inner products in CNNs with\nEINNs. The proposed method is a generalization of CNNs, because the NW score sNW converges\nto an inner product as g \u2192 \u221e (Proposition 1). We employed a plain NW alignment in place of the\ninner product; however, there are many other alignment strategies, such as alignments with af\ufb01ne gap\ncosts, Smith-Waterman (SW) alignment [22], and BLAST [1]. It is easy to replace the inner product in\nCNNs with an af\ufb01ne gap cost alignment or the SW alignment because these alignment methods are\ndescribed as computational graphs with basic operations, such as \u2018+\u2019 and \u2018max.\u2019 In contrast, creating\na differentiable BLAST is highly challenging owing to its heuristic operations.\n\n3 Deep CNN as a Regular Expression Recognizer\n\nIn bioinformatics, meaningful string patterns are called motifs, which resemble to regular expressions.\nFor example, the N-glycosylation site motif is represented as N[^P][ST][^P], where N, P, S, and T\n\n3\n\n\fare amino acids, [^P] means any amino acid except for P, and [ST] means an amino acid S or T.\nThis motif represents the following pattern: N, followed by any amino acid but P, followed by S or\nT, followed by anything but P. Another example is the C2H2-type zinc \ufb01nger domain represented\nas C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H, where x(i) and x(j,k) mean any se-\nquence of length i and any sequence of length between j and k, respectively (we inserted \u201c-\u201d for\nreadability).\nWe show the relationship between CNNs and regular expressions. First, we introduce regular\nexpressions without Kleene star, which is a subset of the standard regular expressions.\nDe\ufb01nition 1. The regular expression without Kleene star is a set of strings on an alphabet \u03a3 de\ufb01ned\nrecursively as follows. First, the following are the regular expressions without Kleene star: 1) Empty\nset \u2205; 2) Empty string \u03b5; 3) A single character \u2200a \u2208 \u03a3. Next, let R and S be regular expressions\nwithout Kleene star. Then, the following sets of strings are also regular expressions without Kleene\nstar. 4) Concatenation of strings in R and S, denoted by RS. 5) A union of the sets R and S, denoted\nby R|S (called alternation). Moreover, given a string q, we say q matches R if q is included in R.(cid:3)\nIn short, this is equivalent to the standard regular expressions without the Kleene star, R\u2217, which\naccepts (potentially) in\ufb01nite repeats of strings in R. Following this de\ufb01nition, we can easily con\ufb01rm\nthat the sets of strings represented by the two motifs mentioned above are the regular expressions\nwithout Kleene star. Hereafter, we use the Unix-like notations of regular expressions (see also\nthe regular expression cheetsheet in the supplementary material (Appendix D)). For example, a\nregular expression /a.b/ describes strings such as \u201ca, followed by any character, followed by b.\u201d\nFurhter, /a[bc]a/, means strings such as \u201ca, followed by b or c, followed by a,\u201d and /(abc|ac)/\nimplies \u201cabc or ac.\u201d It is noteworthy that the last regular expression /(abc|ac)/ is equal to /ab?c/,\nwhere \u2018?\u2019 means zero or one occurrence of the preceding token. Because the Kleene star \u201c*\u201d is not\nconsidered, regular expressions such as \u201c/ab*/,\u201d describing \u201ca followed by any number of b,\u201d are\nnot considered.\n\nSimple regular expressions with CNNs. Here, we reveal the relationship between regular expres-\nsions and CNNs. Let us start from a simple example to verify whether a given input string x of length\nL on an alphabet \u03a3 = {a, b, c} matches a regular expression /abc/ for each position. We assume a\none-hot representation for x, where each dimension corresponds to a character in \u03a3.\nWe compose a one-dimensional (1d)-convolutional layer whose \ufb01lter matrix w1 and bias b1 are\ngiven by the one-hot representation w1 = (ea, eb, ec) and b1 = \u22122, respectively, where ea is the\none-hot vector of character \u201ca.\u201d This \ufb01lter matrix, w1, is shown in Fig. 2 (a). Using this \ufb01lter, the\noutput of the layer at position i is 1 if x(i\u22121):(i+1) matches the regular expression /abc/, or smaller\nthan 1 otherwise (see Fig. 2 (b)). Therefore, using ReLU (i.e., relu(w1\u00b7 x(i\u22121):(i+1) \u2212 b1)), we\nobtain 1 for matching and 0 for non-matching. This shows that we can emulate the exact pattern\nmatching using a single 1d-convolutional layer. To simplify the discussion, we denote the convolution\nby a tuple (w1, b1). Similarly, the recognizer for a regular expression /ac/ can be emulated by a\n1d-convolutional layer of kernel size k = 3 consisting of w2 = (ea, ec, 0) and b2 = \u22121 (Fig. 2).\nNext, let us use a regular expression /ab?c/=/(abc|ac)/ as an example, which represents the\npattern \u2018abc\u2019, but accepts the deletion of the middle \u2018b\u2019. This can be recognized by the following\nmulti-layer network. First, we apply the two convolutions above, (w1, b1) and (w2, b2). Then, using\nthe outputs of these two \ufb01lters as an input, the next convolutional layer of kernel size 1 with parameter\nw3 = (eabc + eac) = [1, 1]T and b3 = 0 is applied (see the lowest matrix in Fig. 2 (b)).\n\nRelation between CNNs and regular expressions without Kleene star.\nIn principle, given a\nregular expression without Kleene star R, we can construct a two-layered convolutional network that\naccepts R similarly. Let k be the maximum length of strings in R. It is noteworthy that k is \ufb01nite\nbecause R does not involve the Kleene stars. For the same reason, R is a \ufb01nite set. For each string r\nin R, we construct a convolutional layer with kernel size k accepting r. Subsequently, the outputs of\nthese layers are input to the next convolutional layer of kernel size 1, which realizes the OR operation\nsimilarly to the \ufb01lter (w3, b3) above. This discussion leads to the following general proposition.\nProposition 2 (CNN as a regular expression recognizer). Given a regular expression without Kleene\nstar R, there exists a CNN that can verify whether a given string x matches R for each position of x.\n\nAlthough this proposition demonstrates the potential ability of CNNs, the construction is inef\ufb01cient\nespecially when |R| is large. For example, let us consider a regular expression /ba./, which consists\n\n4\n\n\fFigure 2: 1d-convolutional architecture ac-\ncepting a regular expression /(abc|ac)/.\n(a) Weights of convolutions. (b) Applying\nto an example string acbcbabc. Note that\nblanks mean zero.\n\nFigure 3: 1d-convolutional architecture accepting\na regular expression /a[bc]a[ac]ba./.\n(a)\nWeights of convolutions. (b) Applying to an exam-\nple string acabaabc. The 1\u2019s that do not match w6\nare grayed-out.\n\nof |\u03a3| strings. In this case, the construction above requires |\u03a3| convolutional \ufb01lters that might not be\nacceptable. In fact, it can be represented by only one \ufb01lter, consisting of w4 = (eb, ea, ea + eb + ec)\nand b4 = \u22122 (Fig. 3 (a)). Furthermore, /a[bc]a/ corresponds to w5 = (ea, eb +ec, ea) and b5 = \u22122\n(Fig. 3 (a)). These examples show a possibility that a large regular expression R could be compressed\ninto a small CNN. We extend this discussion in the following.\n\nGoing Deeper for Complex Regular Expressions. According to Proposition 2, shallow yet wide\nneural networks can recognize an arbitrary regular expression without Kleene star R. Here, we\ndiscuss how the depth of the neural network relates to regular expressions. Further, we investigate the\nmeaning of DenseNet [11] -like concatenation of the outputs from various layers.\nBrie\ufb02y, the depth and concatenation are important to obtain the distributed representations of string\npatterns, similar to that in image processing. Combining deep convolutions with concatenation, we\ncan construct a recognizer for highly complicated regular expressions from small building blocks of\nsimple regular expressions.\nWe explain based on an example using Fig. 3. In addition to the atomic regular expressions /a/,\n/b/, and /c/, let us consider two regular expressions, /a[bc]a/, and /ba./, as discussed above.\nFurthermore, we consider a regular expression /a[bc]a[ac]ba./, which is more complicated, yet\nis a combination of the simple regular expressions above. This regular expression can be divided\ninto three parts: 1) /a[bc]a/ 2) /ba./ and 3) /[ac]/. The \ufb01rst two regular expressions can be\nrecognized by w4 and w5 discussed above. To recognize the last one, /[ac]/, we employ the\nconcatenation of two matrices, as shown in Fig. 3 (b). This allows us to combine with the atomic\nregular expressions. Concretely, using the convolutional \ufb01lter w6 shown in Fig. 3 (a) with b = \u22122,\nwe can recognize /a[bc]a[ac]ba./ as shown in the lowest matrix in Fig. 3 (b). Similarly, if the\nshaded cell of w6 in Fig. 3 (a) is set to 1, we can represent another regular expression with a deletion\n(i.e., /a[bc]a[ac]?ba./).\n\nSummary.\nIn this section, we \ufb01rst showed that even shallow CNNs can treat regular expressions\nwithout Kleene star in principle (Proposition 2). The number of \ufb01lters can be, however, much larger\nthan that in EINNs, which explicitly model edit operations. We provided an upper-bound of the\n\ufb01lter size, |R|, although this bound is not very tight. This indicates that shallow CNNs cannot treat\ncomplex regular expressions ef\ufb01ciently. Then, we explained that the depth and concatenation can\nmitigate this issue. Concatenations allow us to reuse and combine simple regular expressions (like\n/a/, /b/, ...). In the next sections, we discuss and investigate what happens if concatenations are not\nused by comparing with the ResNet architecture, which does not involve concatenations. We also\ndiscuss the advantages and disadvantages of EINNs and CNNs.\n\n5\n\n/a//b//c/1111w1w2(a)a b cw31a c11/abc//ac/(abc|ac)x1:8/a//b//c/a c b c b a b c11111111/abc//ac/1(b)1/ /(abc|ac)11relu(w1 x(i-1):(i+1)+b1)(b1=-2)(b2=-1)(b3=0)relu(w2 x(i-1):(i+1)+b2)relu(w3 xi+b3)/a//b//c/(a)w6/ba./x1:8/a//b//c/a c a c b a b c111111111(b)1/ /a[bc]a[ac]ba.1111w5aa[bc]11w41b a .111/a a/[bc]/a//b//c/a[bc]a[ac]ba.11111111111111Concatination/ba.//a a/[bc](b4=-2)(b5=-2)(b6=-2)\f4 Discussion\n\nCNNs and EINNs.\nIn the previous sections, we have shown how to treat insertion and deletion\nof characters in a string, which is expected to be important for biological tasks. The EINNs treat\ninsertions and deletions explicitly, while the deep CNNs with concatenation treat them implicitly.\nHere, we discuss their advantages and disadvantages. Although the EINNs model the well-established\nbiological process directly through the NW algorithm, the computational cost is high due to the\ndynamic programming. Meanwhile, the computational cost of the deep CNNs is signi\ufb01cantly lower.\nHowever, if the target regular expression involves many insertions and deletions, the number of\nconvolutional \ufb01lters required to represent it will increase rapidly. This is because the CNNs may have\nto treat such gapped patterns with separate convolutional \ufb01lters, as shown in Fig. 2. One might wonder\nif we can mitigate this problem with pooling layers; however, we could not obtain improvements in\nthe accuracy in a protein secondary structure prediction problem.\nThe CNN analysis in Section 3 was restricted to one-hot representations (called binarized CNNs\nherein). However, the \ufb01lter weights and inputs of the normal CNNs are real numbers. We believe\nthat this binarized analysis is still meaningful because such binarized CNNs are included in normal\nCNNs, indicating that the normal CNNs can learn more \ufb02exible patterns than the binarized CNNs.\n\nResNet-like architecture. Finally, we discuss ResNet[10]-like architectures (i.e., using additive\nskip connections instead of dense connections). We argue that ResNet-like architectures are dif\ufb01cult\nto interpret. In fact, adding two matrices in the top and middle of Fig. 2 (b) generates matching\nresults for /(a|abc)/ and /(b|ac)/ (here, we ignored the third row of the top matrix). This implies\nthat additive skip connections do not allow us to combine simpler regular expressions freely. In our\nexperiments, ResNet-like architectures do not demonstrate a better performance than DenseNet-like\narchitectures for the protein secondary structure prediction task (see Section 6).\n\n5 Related Work\n\nSequence alignment and dynamic programming. The NW algorithm [17] is a fundamental\nsequence alignment algorithm. It is a global alignment algorithm, which aligns along the entire\nsequence. The Smith-Waterman (SW) algorithm [22], another well-known algorithm, is a local\nalignment algorithm where the subsequences are aligned.\nDynamic programming is used frequently for similarity computation between two sequences. Dy-\nnamic time warping (DTW) is often used for tasks involving time series (e.g., speech recognition\n[21]). Unlike the NW algorithm, DTW does not allow us to insert gaps (i.e., DTW does not consider\nthe gap cost). Cuturi and Blondel [3] proposed a differentiable loss function called Soft-DTW. In\nspeech recognition, connectionist temporal classi\ufb01cation (CTC) is used as a loss function for two\nsequences [9]. The CTC explicitly models gaps differently from the NW algorithm. In bioinformatics,\nSaigo et al. [20] used a local alignment kernel, which is similar to SW alignment, to optimize amino\nacid substitution matrices by gradient descent. The gradient is computed similarly to EINNs, while\nembedding is not used. The difference between these methods and EINNs is that these methods are\nused as loss functions, whereas EINNs are used as similarity functions in convolutions to make neural\nnetworks edit invariant.\n\nNN as a language recognizer Thus far, the relationships between neural networks and the formal\nlanguage theory have been studied in terms of RNNs. Minsky [16] demonstrated that any \ufb01nite state\nmachines can be emulated by a discrete state RNN with McCulloch-Pitts neurons. Forcada and\nCarrasco [5] considered a continuous version of the RNN, called the neural state machine. Gers and\nSchmidhuber [7] showed experimentally that the LSTM can learn context free grammar, including\nregular grammar. Unlike these studies, we focus on CNNs and reveal the relationship to regular\nexpressions without Kleene star (Section 3).\n\nNN for biological sequences. Neural networks are used for several biological predictive tasks, such\nas protein secondary structure (shown below), protein contact maps [4, 8], and genome accessibility\n[12]. We herein focus on the protein secondary structure prediction problem, which is a sequence\nlabeling problem predicting a label for each sequence position. The existing approaches are classi\ufb01ed\ninto three groups: 1) RNN-based models [13, 15, 23], 2) Hybrid of probabilistic models with neural\n\n6\n\n\fnetworks [18, 24, 26], and 3) CNN-based models [2, 14]. Among them, Li and Yu [13] reported\nthe test accuracy of 69.4% for the CB513 dataset, a standard open dataset for this task, based on a\nbidirectional GRU model. Busia and Jaitly [2] reported the highest CB513 accuracy, 70.3% using a\nCNN-based model.1 Based on the discussion in Section 3, we employ a much deeper architecture in\nour experiment. Consequently, we achieved a much higher accuracy, 71.5% (Section 6).\n\n6 Experiments\n\nIn this section, we present the experimental results using a real task for biological sequences. We\nfocus on the protein secondary structure prediction problem, which is widely studied both in the\nmachine learning and bioinformatics communities. Overall, we will demonstrate that for protein\nstructure prediction, it is important to adopt network architectures that consider insertions/deletions,\nas we have discussed previously.\n\nDataset and implementation. We follow the previous studies for the secondary structure predic-\ntion. For the test, we used the widely-used CB513 dataset. For training, we used the \ufb01ltered CB6133\ndataset [13, 26], which has \ufb01ltered out proteins in the original CB6133 dataset having 25% or higher\nsimilarity with some proteins in CB513. Consequently, the \ufb01ltered CB6133 dataset includes 5534\nproteins. We train the models that predict the eight-class secondary structure labels assigned at each\nposition of a given sequence (i.e., a sequence labeling task). The feature vector at each position\ngiven in these datasets is the one-hot representation of amino acid (22-dim), and the position speci\ufb01c\nscoring matrix (PSSM, 22-dim). We employ zero-padding for convolutional operations to keep the\nsequence length constant. For implementations, we used PyTorch version 0.2. Unless otherwise\nnoted, the default settings are used (e.g., weight initialization and hyperparameters for optimizers).\nThe training was conducted on Nvidia Tesla GPUs.\n\nTable 1: Test accuracy (CB513).\n\nAcc. (%)\n42.0\n43.0\n\nResults for simpli\ufb01ed models. First, we investigate the effect of EINNs using simpli\ufb01ed models\nand datasets. Here, two types of models are used: Tiny-CNN and Tiny-EINN. Figure 6 (a) shows\nthe Tiny-CNN while the Tiny-EINN is obtained by replacing the Conv-5 layers with the EINN\nconvolutional layer proposed in Section 2.\nFor training, we used the one-hot vector for input (i.e., PSSM\nis not used here), and 2% of training data (sequences) sam-\npled from the \ufb01ltered CB6133 dataset. We used the Adam\noptimizer with the minibatch size of 128, initial learning rate\nof 0.0002 (reduced by 1/10 at epoch 15), and weight decay\n(10\u22125). We report the CB513 accuracy at epoch 30.\nIn Table 1, Tiny-EINN shows an accuracy that is 1.0-point better than Tiny-CNN. In this experiment,\nwe used the \ufb01xed gapcost g = 2.5. Figure 4 shows how the accuracy changes with respect to g. We\nobserve that, for g > 10, the accuracy is equal to that of the CNN, 42.0% (Proposition 1). Furhter, the\nmaximum accuracy is achieved at g = 2.5, indicating the potential importance of insertion/deletion.\nNext, we investigate what happens when different sizes of training data are used. Figure 5 shows the\ntest accuracy for the CB513 dataset against the gapcost with the 1%, 2%, and 5% datasets. For the\n5% dataset, the performance gain, de\ufb01ned by the difference between EINN (with g at the peak) and\nCNNs (with g \u2192 \u221e), is 0.6%, which is smaller than that of the 2% dataset, i.e., 1.0%. For the 1%\ndataset, the performance gain is 1.4%, larger than that of the 2% dataset. To summarize, we obtained\nlarger gains for smaller dataset. Thus, this result shows that the modeling power of EINNs is better\nthan that of CNNs.\n\nMethod\nTiny-CNN\nTiny-EINN (g = 2.5)\n\nResults for deeper models. Next, we show results for fully-deep models with realistic con\ufb01gura-\ntions, including a model achieving the state-of-the-art CB513 accuracy. Throughout the experiments,\nwe used RMSProp for optimization, with the initial learning rate of 0.00033 and minibatch size of 8.\nThe models are trained for 150 epochs, and the test accuracy at the last epoch is reported. We do not\n\n1This accuracy is based on a single model (i.e., non-ensemble model) prediction result. With model\nensembling, they obtained 71.4%, which is comparable to our result, 71.5% (note that we do not use model\nensembling in our experiment).\n\n7\n\n\fFigure 4: Gapcost g vs accuracy (Tiny-\nEINN). Tiny-EINN is nearly equivalent\nto Tiny-CNN for g > 10.\n\nFigure 5: Effect of data size. The per-\nformance gain of EINNs increases as the\ndata size decreases.\n\nuse weight decay, and the learning rate is reduced by 1/10 at epoch 100. We do not employ other\ntechniques including beam-search-based classi\ufb01cation [2], or model ensembling [2, 13].\nIn addition, we found that data augmentation improves the accuracy. To create a new training data,\nwe replaced the one-hot vector at randomly chosen positions with an amino acid drawn from the\nuniform distribution. In our experiments, we randomly replaced 15% of the residues. We maintained\nthe PSSM dimension. This simple augmentation strategy improved the CB513 accuracy by up to\n0.8-points. In the supplementary material (Appendix C), we investigate the effect of this augmentation.\nTo the best of our knowledge, this technique has not been adopted in previous studies.\nAs a baseline, we stack the ConvBlocks shown in Fig. 6 (b). This is similar to the current state-\nof-the-art model proposed in [2]. Unlike their architecture, we do not employ nonlinearity after\nConv-1 because we found it deteriorates the test performance when stacked deeply. We \ufb01rst apply\ntwo ConvBlocks. Then, at each position, a fully connected layer (of size 455) is applied, followed by\nbatch normalization, dropout (p = 0.2) and ReLU. Finally, another fully-connected layer is applied to\noutput the 8-class scores. To investigate the effect of the EINNs, we replace the convolutions shaded\nin Fig. 6 (b) in the \ufb01rst ConvBlock with EINNs of the same \ufb01lter and kernel sizes.\nThe test accuracies for these models (2-block CNN\u2020 and\n2-block EINN\u2020 in Table 2) indicate that the EINN-based\nmodel is again better, while the degree of improvement\ngets smaller. This can be interpreted as follows. Follow-\ning the analysis in Section 3, the ConvBlock itself can\nrecognize complex string patterns. This could reduce the\nneed for EINNs, although it can potentially recognize\ncomplex patterns alone.\nIt is impossible to replace all of the convolutions in\nthe model with EINNs owing to the following reasons.\nFirst, EINNs consume much more GPU memory than\nCNNs, thereby preventing us from applying them widely.\nSecond, the computation time of the EINNs is slower\nthan that of the CNNs. Although we have implemented\nEINNs using GPUs, as mentioned in Section 2, the com-\nputation speed is more than ten times slower than that\nof CNNs if the kernel size is k = 5. This is because\nthe innermost double loop cannot be parallelized, thus\nresulting in a time complexity of O(k2), while the CNN\ncomputation runs in O(1) time using GPUs. Hence, we\ninvestigate only CNNs in the following.\nIn Section 3, we argued that the 1) depth and 2) concatenation are important to handle edit operations\nwith CNNs ef\ufb01ciently. In the following, we investigate the effect of each factor by\n\nTable 2: Comparison of precisions for\nthe secondary structure prediction on\nCB513 dataset. Note that these results\nare for non-ensemble models. (*: with\nmultitasking / \u2020: with data augment.)\nAcc. (%)\nMethod\nOur 2-block CNN\u2020\n69.7\nOur 2-block EINN\u2020\n69.8\nOur 2-block CNN*\u2020\n69.8\nOur 4-block CNN*\u2020\n70.6\nOur 8-block CNN*\u2020\n71.2\nOur 12-block CNN*\u2020\n71.5\nOur 16-block CNN*\u2020\n71.3\nOur 8-block MCNN*\u2020\n71.3\nOur 12-block MCNN*\u2020\n71.5\nResNet*\u2020 (best result)\n71.0\n66.4\nGSN [26] (2014)\n68.3\nDeepCNF [24] (2016)\nDCRNN [13] (2016)\n69.4\n70.3\nNextCond CNN [2] (2017)\n\n1) increasing the depth while keeping other conditions equivalent, and\n2) using two different network architectures that do not involve concatenation (i.e., ResNet).\n\n8\n\n1235102050100Gapcost42434445Accuracy (%)TestTraining1235102050100Gapcost404142434445Test Accuracy (%)Data size: 1%(gain: 1.4%)Data size: 2%(gain: 1.0%)Data size: 5%(gain: 0.6%)\f(a) Tiny-{CNN/EINN}\n\n(b) ConvBlock (inspired by [2])\n\n(c) Modi\ufb01ed-ConvBlock\n\nFigure 6: Network architectures. Conv-k is the 1d-convolutional layer with kernel size k. The number\nat the top-left indicates the number of \ufb01lters used. By replacing the shaded convolutions with EINN\nof the same kernel size, we obtain their EINN version. Here, c(cid:13) means concatenation along the \ufb01lter\ndimension.\n(a) 32-convs implies a grouped convolution with 32 groups. (b) This ConvBlock is\nstacked deeply. Then, at each position, a fully-connected layer is applied to output the 8-class scores\n(see the text for details.) (c) An alternative ConvBlock. This is to show the robustness against the\nnetwork architecture.\n\nEffect of depth. Next, we investigate the deeper CNN architectures. We begin with the shallow\nstacking of ConvBlocks, and make the stacking deeper (from 2 blocks to 16 blocks). The training\nprocedure is the same as that in the previous experiment, except for the following points. First, we\nemployed the widely-used multitasking technique [13, 19], simultaneously predicting the secondary\nstructure (eight classes) and solvent accessibility (four classes). Second, we trained each model for\n300 epochs, and the learning rate was reduced by 1/10 at 200 epochs.\nAs shown in Table 2, a 71.5% CB513 accuracy is achieved by our 12-block CNN*\u2020, which is much\nhigher than the results of the previous model, shown in the bottom of the table. In particular, it is\nmore accurate than the previous best accuracy of 70.3%, for a single model [2]. Further, deeper\nmodels tend to show higher accuracy, which corresponds to the discussion in Section 3.\nWe can test other techniques such as ensemble models or templates [15] to improve the accuracy and\navoid potential over\ufb01tting. Further, we should evaluate our model using various independent datasets\nand investigate other network architectures; however, we omitted most of them primarily because of\nresource limitations. In the following, we show what happens if different architectures are used.\n\nEffect of network architecture. We investigate how network architecture affects performance by\nreplacing the ConvBlock with the modi\ufb01ed ConvBlock (Fig. 6 (c)), which also involves convolu-\ntional layers with concatenations. Note that the original and modi\ufb01ed blocks have the same output\ndimensions and receptive \ufb01elds. Table 2 (\u201c8-block MCNN\u201d and \u201c12-block MCNN\u201d) shows that this\nmodi\ufb01cation does not change the accuracy. This indicates that there are many different architectures\nthat can achieve the same performance, and there is still room for improvement.\nFinally, we replace the ConvBlocks with the residual blocks [10], which we discussed in Section 4.\nConsequently, the best CB513 accuracy achieved by the ResNet-like models is 71.0% (\u2018ResNet\u2019 in\nTable 2), and is worse than the models in Fig. 6 (b) and (c). For details, see Appendix B.\n\n7 Conclusion\n\nIn this paper, we discussed how to make neural networks edit-invariant, a new important feature\nfor ML tasks for biological sequences. First, we proposed EINNs that consisted of differentiable\nNW algorithm modules. Using EINNs as a generalization of CNNs, we con\ufb01rmed that EINNs\nperformed better than the corresponding CNN for a real biological task. This indicated that handling\ninsertion/deletion was important for biological ML tasks. Next, we discussed that suf\ufb01ciently deep\nCNNs with concatenation could emulate complex regular expressions. This implied that such deep\nCNNs could also treat the insertion/deletion of characters. Finally, for the protein secondary structure\nprediction task on the CB513 test dataset, the accuracy of our deep CNN was better than the current\nbest result among the non-ensemble models.\n\n9\n\nConv-1Conv-5Conv-1Conv-522Secondary Structure (8-class)......Protein sequence (one-hot, 22-dim)(32-convs)22Full-ConnecedFull-Conneced328ReLUSoftmaxReLUReLUConv-3Conv-7Conv-9Conv-9Conv-16496646427CCBN+ReLU+Dropout(0.4)BN+ReLU+Dropout(0.4)Conv-3Conv-7Conv-9Conv-1BN+ReLU+Dropout(0.4)96962796BN+ReLU+Dropout(0.4)BN+ReLU+Dropout(0.4)CCC\fReferences\n[1] S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman. Basic local alignment search tool. Journal of\n\nMolecular Biology, 215:403\u2013410, 1990.\n\n[2] A. Busia and N. Jaitly. Next-Step Conditioned Deep Convolutional Neural Networks Improve Protein\n\nSecondary Structure Prediction. CoRR, abs/1702.03865, 2017.\n\n[3] M. Cuturi and M. Blondel. Soft-DTW: a Differentiable Loss Function for Time-Series. In Proc. ICML\u201917,\n\npages 894\u2013903, 2017.\n\n[4] P. Di lena, K. Nagata, and P. Baldi. Deep architectures for protein contact map prediction. Bioinformatics,\n\n28(19):2449\u20132457, 2012.\n\n[5] M. L. Forcada and R. C. Carrasco. Finite-State Computation in Analog Neural Networks: Steps towards\nBiologically Plausible Models? Emergent neural computational architectures based on neuroscience,\npages 480\u2013493, 2001.\n\n[6] K. Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of pattern\n\nrecognition unaffected by shift in position. Biological Cybernetics, 36(4):193\u2013202, 1980.\n\n[7] F. A. Gers and J. Schmidhuber. LSTM Recurrent Networks Learn Simple Context Free and Context\n\nSensitive Languages. IEEE Transactions on Neural Networks, 12(6):1333\u20131340, 2001.\n\n[8] V. Golkov, M. J. Skwark, A. Golkov, A. Dosovitskiy, T. Brox, J. Meiler, and D. Cremers. Protein contact\nprediction from amino acid co-evolution using convolutional networks for graph-valued images. In Proc.\nNIPS\u201916, pages 4222\u20134230. 2017.\n\n[9] A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber. Connectionist Temporal Classi\ufb01cation : Labelling\nUnsegmented Sequence Data with Recurrent Neural Networks. In Proc. ICML\u201906, pages 369\u2013376, 2006.\n\n[10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proc. CVPR\u201916,\n\npages 770\u2013778, 2016.\n\n[11] G. Huang, Z. Liu, L. v. d. Maaten, and K. Q. Weinberger. Densely Connected Convolutional Networks. In\n\nProc. CVPR\u201917, pages 2261\u20132269, 2017.\n\n[12] D. R. Kelley, J. Snoek, and J. L. Rinn. Basset: learning the regulatory code of the accessible genome with\n\ndeep convolutional neural networks. Genome research, 26(7):990\u20139, Jul. 2016.\n\n[13] Z. Li and Y. Yu. Protein Secondary Structure Prediction Using Cascaded Convolutional and Recurrent\n\nNeural Networks. In Proc. IJCAI\u201916, pages 2560\u20132567, 2016.\n\n[14] Z. Lin, J. Lanchantin, and Y. Qi. MUST-CNN: A Multilayer Shift-and-Stitch Deep Convolutional\n\nArchitecture for Sequence-based Protein Structure Prediction. In Proc. AAAI, 2016.\n\n[15] C. N. Magnan and P. Baldi. SSpro/ACCpro 5: Almost perfect prediction of protein secondary structure\nand relative solvent accessibility using pro\ufb01les, machine learning and structural similarity. Bioinformatics,\n30(18):2592\u20132597, 2014.\n\n[16] M. Minsky. Computation : \ufb01nite and in\ufb01nite machines. Prentice-Hall, 1967.\n\n[17] S. Needleman and C. Wunsch. A general method applicable to the search for similarities in the amino acid\n\nsequence of two proteins. Journal of Molecular Biology, 48(3):443\u2013453, 3 1970.\n\n[18] J. Peng, L. Bo, and J. Xu. Conditional Neural Fields. In Proc. NIPS\u201909, pages 1419\u20131427, 2009.\n\n[19] Y. Qi, M. Oja, J. Weston, and W. S. Noble. A uni\ufb01ed multitask architecture for predicting local protein\n\nproperties. PLoS ONE, 7(3), 2012.\n\n[20] H. Saigo, J.-P. Vert, and T. Akutsu. Optimizing amino acid substitution matrices with a local alignment\n\nkernel. BMC bioinformatics, 7:246, 2006.\n\n[21] H. Sakoe and S. Chiba. Dynamic programming algorithm optimization for spoken word recognition. IEEE\n\nTransactions on Acoustics, Speech, and Signal Processing, 26(1):43\u201349, Feb. 1978.\n\n[22] T. F. Smith and M. S. Waterman. Identi\ufb01cation of common molecular subsequences. Journal of molecular\n\nbiology, 147(1):195\u20137, Mar. 1981.\n\n10\n\n\f[23] O. W. S\u00f8ren Kaae S\u00f8nderby. Protein secondary structure prediction with long short term memory networks.\n\nIn arXiv:1412.7828, 2016.\n\n[24] S. Wang, J. Peng, J. Ma, and J. Xu. Protein Secondary Structure Prediction Using Deep Convolutional\n\nNeural Fields. Scienti\ufb01c reports, 6(18962), 2016.\n\n[25] D. E. Worrall, S. J. Garbin, D. Turmukhambetov, and G. J. Brostow. Harmonic Networks: Deep Translation\n\nand Rotation Equivariance. In Proc. CVPR\u201917, pages 5028\u20135037, 2017.\n\n[26] J. Zhou and O. G. Troyanskaya. Deep supervised and convolutional generative stochastic network for\n\nprotein secondary structure prediction. In Proc. ICML\u201914, pages 745\u2013753, 2014.\n\n11\n\n\f", "award": [], "sourceid": 2394, "authors": [{"given_name": "Satoshi", "family_name": "Koide", "institution": "Toyota Central R&D Labs."}, {"given_name": "Keisuke", "family_name": "Kawano", "institution": "Toyota Central R&D Labs., Inc"}, {"given_name": "Takuro", "family_name": "Kutsuna", "institution": "Toyota Central R&D Labs. Inc."}]}