{"title": "Recurrent Kernel Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 13431, "page_last": 13442, "abstract": "Substring kernels are classical tools for representing biological sequences or text. However, when large amounts of annotated data is available, models that allow end-to-end training such as neural networks are often prefered.  Links between recurrent neural networks (RNNs) and substring kernels have recently been drawn, by formally showing that RNNs with specific activation functions were points in a reproducing kernel Hilbert space (RKHS).  In this paper, we revisit this link by generalizing convolutional kernel networks---originally related to a relaxation of the mismatch kernel---to model gaps in sequences. It results in a new type of recurrent neural network which can be trained end-to-end with backpropagation, or without supervision by using kernel approximation techniques.  We experimentally show that our approach is well suited to biological sequences, where it outperforms existing methods for protein classification tasks.", "full_text": "Recurrent Kernel Networks\n\nDexiong Chen\n\nInria\u2217\n\ndexiong.chen@inria.fr\n\nLaurent Jacob\n\nCNRS\u2020\n\nlaurent.jacob@univ-lyon1.fr\n\nJulien Mairal\n\nInria\u2217\n\njulien.mairal@inria.fr\n\nAbstract\n\nSubstring kernels are classical tools for representing biological sequences or text.\nHowever, when large amounts of annotated data are available, models that allow\nend-to-end training such as neural networks are often preferred. Links between\nrecurrent neural networks (RNNs) and substring kernels have recently been drawn,\nby formally showing that RNNs with speci\ufb01c activation functions were points\nin a reproducing kernel Hilbert space (RKHS). In this paper, we revisit this link\nby generalizing convolutional kernel networks\u2014originally related to a relaxation\nof the mismatch kernel\u2014to model gaps in sequences. It results in a new type of\nrecurrent neural network which can be trained end-to-end with backpropagation, or\nwithout supervision by using kernel approximation techniques. We experimentally\nshow that our approach is well suited to biological sequences, where it outperforms\nexisting methods for protein classi\ufb01cation tasks.\n\n1\n\nIntroduction\n\nLearning from biological sequences is important for a variety of scienti\ufb01c \ufb01elds such as evolution [8]\nor human health [16]. In order to use classical statistical models, a \ufb01rst step is often to map sequences\nto vectors of \ufb01xed size, while retaining relevant features for the considered learning task. For a long\ntime, such features have been extracted from sequence alignment, either against a reference or between\neach others [3]. The resulting features are appropriate for sequences that are similar enough, but they\nbecome ill-de\ufb01ned when sequences are not suited to alignment. This includes important cases such as\nmicrobial genomes, distant species, or human diseases, and calls for alternative representations [7].\nString kernels provide generic representations for biological sequences, most of which do not require\nglobal alignment [34]. In particular, a classical approach maps sequences to a huge-dimensional\nfeature space by enumerating statistics about all occuring subsequences. These subsequences may be\nsimple classical k-mers leading to the spectrum kernel [21], k-mers up to mismatches [22], or gap-\nallowing subsequences [24]. Other approaches involve kernels based on a generative model [17, 35],\nor based on local alignments between sequences [36] inspired by convolution kernels [11, 37].\nThe goal of kernel design is then to encode prior knowledge in the learning process. For instance,\nmodeling gaps in biological sequences is important since it allows taking into account short insertion\nand deletion events, a common source of genetic variation. However, even though kernel methods are\ngood at encoding prior knowledge, they provide \ufb01xed task-independent representations. When large\namounts of data are available, approaches that optimize the data representation for the prediction task\n\n\u2217Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France.\n\u2020Univ. Lyon, Universit\u00e9 Lyon 1, CNRS, Laboratoire de Biom\u00e9trie et Biologie Evolutive UMR 5558, 69000\n\nLyon, France\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fare now often preferred. For instance, convolutional neural networks [19] are commonly used for DNA\nsequence modeling [1, 2, 41], and have been successful for natural language processing [18]. While\nconvolution \ufb01lters learned over images are interpreted as image patches, those learned over sequences\nare viewed as sequence motifs. RNNs such as long short-term memory networks (LSTMs) [14] are\nalso commonly used in both biological [13] and natural language processing contexts [5, 26].\nMotivated by the regularization mechanisms of kernel methods, which are useful when the amount\nof data is small and are yet imperfect in neural networks, hybrid approaches have been developed\nbetween the kernel and neural networks paradigms [6, 27, 40]. Closely related to our work, the\nconvolutional kernel network (CKN) model originally developed for images [25] was successfully\nadapted to biological sequences in [4]. CKNs for sequences consist in a continuous relaxation of the\nmismatch kernel: while the latter represents a sequence by its content in k-mers up to a few discrete\nerrors, the former considers a continuous relaxation, leading to an in\ufb01nite-dimensional sequence\nrepresentation. Finally, a kernel approximation relying on the Nystr\u00f6m method [38] projects the\nmapped sequences to a linear subspace of the RKHS, spanned by a \ufb01nite number of motifs. When\nthese motifs are learned end-to-end with backpropagation, learning with CKNs can also be thought\nof as performing feature selection in the\u2014in\ufb01nite dimensional\u2014RKHS.\nIn this paper, we generalize CKNs for sequences by allowing gaps in motifs, motivated by genomics\napplications. The kernel map retains the convolutional structure of CKNs but the kernel approx-\nimation that we introduce can be computed using a recurrent network, which we call recurrent\nkernel network (RKN). This RNN arises from the dynamic programming structure used to compute\nef\ufb01ciently the substring kernel of [24], a link already exploited by [20] to derive their sequence neural\nnetwork, which was a source of inspiration for our work. Both our kernels rely on a RNN to build\na representation of an input sequence by computing a string kernel between this sequence and a\nset of learnable \ufb01lters. Yet, our model exhibits several differences with [20], who use the regular\nsubstring kernel of [24] and compose this representation with another non-linear map\u2014by applying\nan activation function to the output of the RNN. By contrast, we obtain a different RKHS directly by\nrelaxing the substring kernel to allow for inexact matching at the compared positions, and embed\nthe Nystr\u00f6m approximation within the RNN. The resulting feature space can be interpreted as a\ncontinuous neighborhood around all substrings (with gaps) of the described sequence. Furthermore,\nour RNN provides a \ufb01nite-dimensional approximation of the relaxed kernel, relying on the Nystr\u00f6m\napproximation method [38]. As a consequence, RKNs may be learned in an unsupervised manner (in\nsuch a case, the goal is to approximate the kernel map), and with supervision with backpropagation,\nwhich may be interpreted as performing feature selection in the RKHS.\n\nIn this paper, we make the following contributions:\n\nContributions.\n\u2022 We generalize convolutional kernel networks for sequences [4] to allow gaps, an important option\nfor biological data. As in [4], we observe that the kernel formulation brings practical bene\ufb01ts over\ntraditional CNNs or RNNs [13] when the amount of labeled data is small or moderate.\n\u2022 We provide a kernel point of view on recurrent neural networks with new unsupervised and\nsupervised learning algorithms. The resulting feature map can be interpreted in terms of gappy motifs,\nand end-to-end learning amounts to performing feature selection.\n\u2022 Based on [28], we propose a new way to simulate max pooling in RKHSs, thus solving a classical\ndiscrepancy between theory and practice in the literature of string kernels, where sums are often\nreplaced by a maximum operator that does not ensure positive de\ufb01niteness [36].\n\n2 Background on Kernel Methods and String Kernels\n\nKernel methods consist in mapping data points living in a set X to a possibly in\ufb01nite-dimensional\nHilbert space H, through a mapping function \u03a6 : X \u2192 H, before learning a simple predictive model\nin H [33]. The so-called kernel trick allows to perform learning without explicitly computing this\nmapping, as long as the inner-product K(x, x(cid:48)) = (cid:104)\u03a6(x), \u03a6(x(cid:48))(cid:105)H between two points x, x(cid:48) can\nbe ef\ufb01ciently computed. Whereas kernel methods traditionally lack scalability since they require\ncomputing an n \u00d7 n Gram matrix, where n is the amount of training data, recent approaches based\non approximations have managed to make kernel methods work at large scale in many cases [30, 38].\nFor sequences in X = A\u2217, which is the set of sequences of any possible length over an alphabet A,\nthe mapping \u03a6 often enumerates subsequence content. For instance, the spectrum kernel maps\nsequences to a \ufb01xed-length vector \u03a6(x) = (\u03c6u(x))u\u2208Ak, where Ak is the set of k-mers\u2014length-k\n\n2\n\n\fsequence of characters in A for some k in N, and \u03c6u(x) counts the number of occurrences of u\nin x [21]. The mismatch kernel [22] operates similarly, but \u03c6u(x) counts the occurrences of u up to a\nfew mismatched letters, which is useful when k is large and exact occurrences are rare.\n\n2.1 Substring kernels\n\nAs [20], we consider the substring kernel introduced in [24], which allows to model the presence\nof gaps when trying to match a substring u to a sequence x. Modeling gaps requires introducing\nthe following notation: Ix,k denotes the set of indices of sequence x with k elements (i1, . . . , ik)\nsatisfying 1 \u2264 i1 < \u00b7\u00b7\u00b7 < ik \u2264 |x|, where |x| is the length of x. For an index set i in Ix,k, we may\nnow consider the subsequence xi = (xi1, . . . , xik ) of x indexed by i. Then, the substring kernel\ntakes the same form as the mismatch and spectrum kernels, but \u03c6u(x) counts all\u2014consecutive or\nnot\u2014subsequences of x equal to u, and weights them by the number of gaps. Formally, we consider a\nparameter \u03bb in [0, 1], and \u03c6u(x) =(cid:80)i\u2208Ix,k\n\u03bbgaps(i)\u03b4(u, xi), where \u03b4(u, v) = 1 if and only if u = v,\nand 0 otherwise, and gaps(i) := ik \u2212 i1 \u2212 k + 1 is the number of gaps in the index set i. When \u03bb is\nsmall, gaps are heavily penalized, whereas a value close to 1 gives similar weights to all occurrences.\nUltimately, the resulting kernel between two sequences x and x(cid:48) is\n\nKs(x, x(cid:48)) := (cid:88)i\u2208Ix,k (cid:88)j\u2208Ix(cid:48),k\n\n\u03bbgaps(i)\u03bbgaps(j)\u03b4(cid:0)xi, x(cid:48)j(cid:1) .\n\n(1)\n\nAs we will see in Section 3, our RKN model relies on (1), but unlike [20], we replace the quantity\n\u03b4(xi, x(cid:48)j) that matches exact occurrences by a relaxation, allowing more subtle comparisons. Then,\nwe will show that the model can be interpreted as a gap-allowed extension of CKNs for sequences.\nWe also note that even though Ks seems computationally expensive at \ufb01rst sight, it was shown in [24]\nthat (1) admits a dynamic programming structure leading to ef\ufb01cient computations.\n\n2.2 The Nystr\u00f6m method\n\nWhen computing the Gram matrix is infeasible, it is typical to use kernel approximations [30, 38],\nconsisting in \ufb01nding a q-dimensional mapping \u03c8 : X \u2192 Rq such that the kernel K(x, x(cid:48)) can be\napproximated by a Euclidean inner-product (cid:104)\u03c8(x), \u03c8(x(cid:48))(cid:105)Rq. Then, kernel methods can be simulated\nby a linear model operating on \u03c8(x), which does not raise scalability issues if q is reasonably small.\nAmong kernel approximations, the Nystr\u00f6m method consists in projecting points of the RKHS onto a\nq-dimensional subspace, allowing to represent points into a q-dimensional coordinate system.\nSpeci\ufb01cally, consider a collection of Z = {z1, . . . , zq} points in X and consider the subspace\n\nE = Span(\u03a6(z1), . . . , \u03a6(zq)) and de\ufb01ne \u03c8(x) = K\u2212 1\n\n2\n\nZZ KZ(x),\n\nwhere KZZ is the q\u00d7q Gram matrix of K restricted to the samples z1, . . . , zq and KZ(x) in Rq carries\nthe kernel values K(x, zj), j = 1, . . . , q. This approximation only requires q kernel evaluations and\noften retains good performance for learning. Interestingly as noted in [25], (cid:104)\u03c8(x), \u03c8(x(cid:48))(cid:105)Rq is exactly\nthe inner-product in H between the projections of \u03a6(x) and \u03a6(x(cid:48)) onto E, which remain in H.\nWhen X is a Euclidean space\u2014this can be the case for sequences when using a one-hot encoding\nrepresentation, as discussed later\u2014 a good set of anchor points zj can be obtained by simply clustering\nthe data and choosing the centroids as anchor points [39]. The goal is then to obtain a subspace E\nthat spans data as best as possible. Otherwise, previous works on kernel networks [4, 25] have also\ndeveloped procedures to learn the set of anchor points end-to-end by optimizing over the learning\nobjective. This approach can then be seen as performing feature selection in the RKHS.\n\n3 Recurrent Kernel Networks\n\nWith the previous tools in hand, we now introduce RKNs. We show that it admits variants of CKNs,\nsubstring and local alignment kernels as special cases, and we discuss its relation with RNNs.\n\n3.1 A continuous relaxation of the substring kernel allowing mismatches\nFrom now on, and with an abuse of notation, we represent characters in A as vectors in Rd. For\ninstance, when using one-hot encoding, a DNA sequence x = (x1, . . . , xm) of length m can be seen\n\n3\n\n\fe\u03b1((cid:104)xi,x(cid:48)j(cid:105)\u2212k) =(cid:81)k\n\nas a 4-dimensional sequence where each xj in {0, 1}4 has a unique non-zero entry indicating which\nof {A, C, G, T} is present at the j-th position, and we denote by X the set of such sequences. We\nnow de\ufb01ne the single-layer RKN as a generalized substring kernel (1) in which the indicator function\n\u03b4(xi, x(cid:48)j) is replaced by a kernel for k-mers:\n\nKk(x, x(cid:48)) := (cid:88)i\u2208Ix,k (cid:88)j\u2208Ix(cid:48) ,k\n\n\u03bbx,i\u03bbx,je\u2212 \u03b1\n\n2 (cid:107)xi\u2212x(cid:48)j(cid:107)2\n\n,\n\n(2)\n\nwhere we assume that the vectors representing characters have unit (cid:96)2-norm, such that e\u2212 \u03b1\n\n2 (cid:107)xi\u2212x(cid:48)j(cid:107)2\nt=1 e\u03b1((cid:104)xit ,x(cid:48)jt(cid:105)\u22121) is a dot-product kernel, and \u03bbx,i = \u03bbgaps(i) if we follow (1).\n\n=\n\nFor \u03bb = 0 and using the convention 00 = 1, all the terms in these sums are zero except those\nfor k-mers with no gap, and we recover the kernel of the CKN model of [4] with a convolutional\nstructure\u2014up to the normalization, which is done k-mer-wise in CKN instead of position-wise.\nCompared to (1), the relaxed version (2) accommodates inexact k-mer matching. This is important\nfor protein sequences, where it is common to consider different similarities between amino acids in\nterms of substitution frequency along evolution [12]. This is also re\ufb02ected in the underlying sequence\nrepresentation in the RKHS illustrated in Figure 1: by considering \u03d5(.) the kernel mapping and\nRKHS H such that K(xi, x(cid:48)j) = e\u2212 \u03b1\n\n= (cid:104)\u03d5(xi), \u03d5(x(cid:48)j)(cid:105)H, we have\n\u03bbx,j\u03d5(x(cid:48)j)(cid:43)\n\u03bbx,i\u03d5(xi), (cid:88)j\u2208Ix(cid:48) ,k\nH\nA natural feature map for a sequence x is therefore \u03a6k(x) =(cid:80)i\u2208Ix,k\n\u03bbx,i\u03d5(xi): using the RKN\n2 (cid:107)xi\u2212z(cid:107)2\namounts to representing x by a mixture of continuous neighborhoods \u03d5(xi) : z (cid:55)\u2192 e\u2212 \u03b1\ncentered on all its k-subsequences xi , each weighted by the corresponding \u03bbx,i (e.g., \u03bbx,i = \u03bbgaps(i)).\nAs a particular case, a feature map of CKN [4] is the sum of the kernel mapping of all the k-mers\nwithout gap.\n\nKk(x, x(cid:48)) =(cid:42) (cid:88)i\u2208Ix,k\n\n2 (cid:107)xi\u2212x(cid:48)j(cid:107)2\n\n(3)\n\n.\n\nFigure 1: Representation of sequences in a RKHS based on Kk with k = 4 and \u03bbx,i = \u03bbgaps(i).\n\n3.2 Extension to all k-mers and relation to the local alignment kernel\nDependency in the hyperparameter k can be removed by summing Kk over all possible values:\n\nKsum(x, x(cid:48)) :=\n\nKk(x, x(cid:48)) =\n\nKk(x, x(cid:48)).\n\n\u221e(cid:88)k=1\n\nmax(|x|,|x(cid:48)|)(cid:88)k=1\n\nInterestingly, we note that Ksum admits the local alignment kernel of [36] as a special case. More\nprecisely, local alignments are de\ufb01ned via the tensor product set Ak(x, x(cid:48)) := Ix,k \u00d7 Ix(cid:48),k, which\ncontains all possible alignments of k positions between a pair of sequences (x, x(cid:48)). The local\nalignment score of each such alignment \u03c0 = (i, j) in Ak(x, x(cid:48)) is de\ufb01ned, by [36], as S(x, x(cid:48), \u03c0) :=\n(cid:80)k\nt=1 s(xit, x(cid:48)jt) \u2212(cid:80)k\u22121\nt=1 [g(it+1 \u2212 it \u2212 1) + g(jt+1 \u2212 jt \u2212 1)], where s is a symmetric substitution\n\n4\n\nk-merkernelembeddingone4-merofxi1i2\u03bbi3\u03bbi4xii1i2i3i4\u03bb2\u03d5(xi)one-layerk-subsequencekernelxi1i2\u03bbi3\u03bbikallembeddedk-mers\u03bbgap(i)\u03d5(xi)poolingPi\u03bbgap(i)\u03d5(xi)\ffunction and g is a gap penalty function. The local alignment kernel in [36] can then be expressed in\nterms of the above local alignment scores (Thrm. 1.7 in [36]):\n\nKLA(x, x(cid:48)) =\n\nK k\n\nLA(x, x(cid:48)) :=\n\n\u221e(cid:88)k=1\n\n\u221e(cid:88)k=1 (cid:88)\u03c0\u2208Ak(x,x(cid:48))\n\nexp(\u03b2S(x, x(cid:48), \u03c0)) for some \u03b2 > 0.\n\n(4)\n\nWhen the gap penalty function is linear\u2014that is, g(x) = cx with c > 0, K k\n\nLA(x, x(cid:48)) =\n(cid:80)\u03c0\u2208Ak(x,x(cid:48)) exp(\u03b2S(x, x(cid:48), \u03c0)) = (cid:80)(i,j)\u2208Ak(x,x(cid:48)) e\u2212c\u03b2gaps(i)e\u2212c\u03b2gaps(j)(cid:81)k\n). When\ns(xit, x(cid:48)jt) can be written as an inner-product (cid:104)\u03c8s(xit), \u03c8s(x(cid:48)jt)(cid:105) between normalized vectors, we see\nthat KLA becomes a special case of (2)\u2014up to a constant factor\u2014with \u03bbx,i = e\u2212c\u03b2gaps(i), \u03b1 = \u03b2.\nThis observation sheds new lights on the relation between the substring and local alignment kernels,\nwhich will inspire new algorithms in the sequel. To the best of our knowledge, the link we will\nprovide between RNNs and local alignment kernels is also new.\n\nLA becomes K k\nt=1 e\u03b2s(xit ,x(cid:48)jt\n\n3.3 Nystr\u00f6m approximation and recurrent neural networks\n\nAs in CKNs, we now use the Nystr\u00f6m approximation method as a building block to make the above\nkernels tractable. According to (3), we may \ufb01rst use the Nystr\u00f6m method described in Section 2.2 to\n\ufb01nd an approximate embedding for the quantities \u03d5(xi), where xi is one of the k-mers represented\nas a matrix in Rk\u00d7d. This is achieved by choosing a set Z = {z1, . . . , zq} of anchor points in Rk\u00d7d,\nand by encoding \u03d5(xi) as K\u22121/2\nZZ KZ(xi)\u2014where K is the kernel of H. Such an approximation for\nk-mers yields the q-dimensional embedding for the sequence x:\n\n\u03c8k(x) = (cid:88)i\u2208Ix,k\n\n2\n\nZZ (cid:88)i\u2208Ix,k\n\n\u03bbx,iK\u2212 1\n\nZZ KZ(xi) = K\u2212 1\n\n2\n\n\u03bbx,iKZ(xi).\n\n(5)\n\nThen, an approximate feature map \u03c8sum(x) for the kernel Ksum can be obtained by concatenating the\nembeddings \u03c81(x), . . . , \u03c8k(x) for k large enough.\n\nThe anchor points as motifs. The continuous relaxation of the substring kernel presented in (2)\nallows us to learn anchor points that can be interpreted as sequence motifs, where each position can\nencode a mixture of letters. This can lead to more relevant representations than k-mers for learning on\nbiological sequences. For example, the fact that a DNA sequence is bound by a particular transcription\nfactor can be associated with the presence of a T followed by either a G or an A, followed by another\nT, would require two k-mers but a single motif [4]. Our kernel is able to perform such a comparison.\nEf\ufb01cient computations of Kk and Ksum approximation via RNNs. A naive computation of \u03c8k(x)\nwould require enumerating all substrings present in the sequence, which may be exponentially large\nwhen allowing gaps. For this reason, we use the classical dynamic programming approach of substring\nkernels [20, 24]. Consider then the computation of \u03c8j(x) de\ufb01ned in (5) for j = 1, . . . , k as well as a\nset of anchor points Zk = {z1, . . . , zq} with the zi\u2019s in Rd\u00d7k. We also denote by Zj the set obtained\nwhen keeping only j-th \ufb01rst positions (columns) of the zj\u2019s, leading to Zj = {[z1]1:j, . . . , [zq]1:j},\ni in Rd\nwhich will serve as anchor points for the kernel Kj to compute \u03c8j(x). Finally, we denote by zj\nthe j-th column of zi such that zi = [z1\ni ]. Then, the embeddings \u03c81(x), . . . , \u03c8k(x) can be\ncomputed recursively by using the following theorem:\nTheorem 1. For any j \u2208 {1, . . . , k} and t \u2208 {1, . . . ,|x|},\n\ni , . . . , zk\n\n\u03c8j(x1:t) = K\u2212 1\n\n2\n\nZj Zj(cid:26)cj[t]\n\nhj[t]\n\nif \u03bbx,i = \u03bb|x|\u2212i1\u2212j+1,\nif \u03bbx,i = \u03bbgaps(i),\n\n(6)\n\nwhere cj[t] and hj[t] form a sequence of vectors in Rq indexed by t such that cj[0] = hj[0] = 0, and\nc0[t] is a vector that contains only ones, while the sequence obeys the recursion\n1 \u2264 j \u2264 k,\n1 \u2264 j \u2264 k,\n\n(7)\nwhere (cid:12) is the elementwise multiplication operator and bj[t] is a vector in Rq whose entry i in\n{1, . . . , q} is e\u2212 \u03b1\n\ncj[t] = \u03bbcj[t \u2212 1] + cj\u22121[t \u2212 1] (cid:12) bj[t]\nhj[t] = hj[t \u2212 1] + cj\u22121[t \u2212 1] (cid:12) bj[t]\n\nj(cid:105)\u22121) and xt is the t-th character of x.\n\n= e\u03b1((cid:104)xt,zi\n\n2 (cid:107)xt\u2212zi\n\nj(cid:107)2\n\n5\n\n\fA proof is provided in Appendix A and is based on classical recursions for computing the substring\nkernel, which were interpreted as RNNs by [20]. The main difference in the RNN structure we\nobtain is that their non-linearity is applied over the outcome of the network, leading to a feature map\nformed by composing the feature map of the substring kernel of [24] and another one from a RKHS\nthat contains their non-linearity. By contrast, our non-linearities are built explicitly in the substring\nkernel, by relaxing the indicator function used to compare characters. The resulting feature map is a\ncontinuous neighborhood around all substrings of the described sequence. In addition, the Nystr\u00f6m\nmethod yields an orthogonalization factor K\u22121/2\nZZ to the output KZ(x) of the network to compute our\napproximation, which is perhaps the only non-standard component of our RNN. This factor provides\nan interpretation of \u03c8(x) as a kernel approximation. As discussed next, it makes it possible to learn\nthe anchor points by k-means, see [4], which also makes the initialization of the supervised learning\nprocedure simple without having to deal with the scaling of the initial motifs/\ufb01lters zj.\n\nLearning the anchor points Z. We now turn to the application of RKNs to supervised learning.\nGiven n sequences x1, . . . , xn in X and their associated labels y1, . . . , yn in Y, e.g., Y = {\u22121, 1}\nfor binary classi\ufb01cation or Y = R for regression, our objective is to learn a function in the RKHS H\nof Kk by minimizing\n\nmin\nf\u2208H\n\n1\nn\n\nn(cid:88)i=1\n\nL(f (xi), yi) +\n\n\u00b5\n2(cid:107)f(cid:107)2\nH,\n\nwhere L : R \u00d7 R \u2192 R is a convex loss function that measures the \ufb01tness of a prediction f (xi) to\nthe true label yi and \u00b5 controls the smoothness of the predictive function. After injecting our kernel\napproximation Kk(x, x(cid:48)) (cid:39) (cid:104)\u03c8k(x), \u03c8k(x(cid:48))(cid:105)Rq, the problem becomes\n\u00b5\n2(cid:107)w(cid:107)2.\n\n1\nn\n\n(8)\n\nn(cid:88)i=1\n\nL(cid:0)(cid:104)\u03c8k(xi), w(cid:105), yi(cid:1) +\n\nmin\nw\u2208Rq\n\nFollowing [4, 25], we can learn the anchor points Z without exploiting training labels, by applying\na k-means algorithm to all (or a subset of) the k-mers extracted from the database and using the\nobtained centroids as anchor points. Importantly, once Z has been obtained, the linear function\nparametrized by w is still optimized with respect to the supervised objective (8). This procedure can\nbe thought of as learning a general representation of the sequences disregarding the supervised task,\nwhich can lead to a relevant description while limiting over\ufb01tting.\nAnother strategy consists in optimizing (8) jointly over (Z, w), after observing that \u03c8k(x) =\nK\u22121/2\n\u03bbx,iKZ(xi) is a smooth function of Z. Learning can be achieved by using backprop-\nagation over (Z, w), or by using an alternating minimization strategy between Z and w. It leads to\nan end-to-end scheme where both the representation and the function de\ufb01ned over this representation\nare learned with respect to the supervised objective (8). Backpropagation rules for most operations\nare classical, except for the matrix inverse square root function, which is detailed in Appendix B.\nInitialization is also parameter-free since the unsupervised learning approach may be used for that.\n\nZZ (cid:80)i\u2208Ix,k\n\n3.4 Extensions\n\nMultilayer construction.\nstruct a multilayer model based on kernel compositions similar to [20]. Assume that K(n)\nlayer kernel and \u03a6(n)\n\nIn order to account for long-range dependencies, it is possible to con-\nis the n-th\nits mapping function. The corresponding (n + 1)-th layer kernel is de\ufb01ned as\n\nk\n\nk\n\nK(n+1)\n\nk\n\n(x, x(cid:48)) = (cid:88)i\u2208Ix,k,j\u2208Ix(cid:48) ,k\n\n\u03bb(n+1)\nx,i\n\n\u03bb(n+1)\nx(cid:48),j\n\nk(cid:89)t=1\n\nKn+1(\u03a6(n)\n\nk (x1:it), \u03a6(n)\n\nk (x(cid:48)1:jt)),\n\n(9)\n\nwhere Kn+1 will be de\ufb01ned in the sequel and the choice of weights \u03bb(n)\nx,i slightly differs from the\nx,i = \u03bbgaps(i) only for the last layer N of the kernel, which\nsingle-layer model. We choose indeed \u03bb(N )\ndepends on the number of gaps in the index set i but not on the index positions. Since (9) involves\na kernel Kn+1 operating on the representation of pre\ufb01x sequences \u03a6(n)\nk (x1:t) from layer n, the\nrepresentation makes sense only if \u03a6(n)\nk (x1:t) carries mostly local information close to position t.\n\n6\n\n\fOtherwise, information from the beginning of the sequence would be overrepresented. Ideally, we\nwould like the range-dependency of \u03a6(n)\nk (x1:t) (the size of the window of indices before t that\nin\ufb02uences the representation, akin to receptive \ufb01elds in CNNs) to grow with the number of layers\nin a controllable manner. This can be achieved by choosing \u03bb(n)\nx,i = \u03bb|x|\u2212i1\u2212k+1 for n < N, which\nassigns exponentially more weights to the k-mers close to the end of the sequence.\nFor the \ufb01rst layer, we recover the single-layer network Kk de\ufb01ned in (2) by de\ufb01ning \u03a6(0)\nk (x1:ik ) = xik\nand K1(xik , x(cid:48)jk ) = e\u03b1((cid:104)xik ,x(cid:48)jk(cid:105)\u22121). For n > 1, it remains to de\ufb01ne Kn+1 to be a homogeneous\ndot-product kernel, as used for instance in CKNs [25]:\n\nKn+1(u, u(cid:48)) = (cid:107)u(cid:107)Hn(cid:107)u(cid:107)Hn\u03ban(cid:32)(cid:28) u\n\n(cid:107)u(cid:107)Hn\n\n,\n\nu(cid:48)\n\n(cid:107)u(cid:48)(cid:107)Hn(cid:29)Hn(cid:33) with \u03ban(t) = e\u03b1n(t\u22121).\n\n(10)\n\nk\n\nat each layer, allowing to replace the inner-products (cid:104)\u03a6(n)\n\nNote that the Gaussian kernel K1 used for 1st layer may also be written as (10) since characters are\nnormalized. As for CKNs, the goal of homogenization is to prevent norms to grow/vanish exponen-\ntially fast with n, while dot-product kernels lend themselves well to neural network interpretations.\nAs detailed in Appendix C, extending the Nystr\u00f6m approximation scheme for the multilayer con-\nstruction may be achieved in the same manner as with CKNs\u2014that is, we learn one approximate\nembedding \u03c8(n)\nk (x(cid:48)1:jt)(cid:105) by\ntheir approximations (cid:104)\u03c8(n)\nk (x(cid:48)1:jt)(cid:105), and it is easy to show that the interpretation in terms\nof RNNs is still valid since K(n)\nMax pooling in RKHS. Alignment scores (e.g. Smith-Waterman) in molecular biology rely on\na max operation\u2014over the scores of all possible alignments\u2014to compute similarities between\nsequences. However, using max in a string kernel usually breaks positive de\ufb01niteness, even though it\nseems to perform well in practice. To solve such an issue, sum-exponential is used as a proxy in [32],\nbut it leads to diagonal dominance issue and makes SVM solvers unable to learn. For RKN, the sum\nin (3) can also be replaced by a max\n\nhas the same sum structure as (2).\n\nk (x1:it), \u03c8(n)\n\nk (x1:it), \u03a6(n)\n\nk\n\nk (x, x(cid:48)) =(cid:28) max\nKmax\n\ni\u2208Ix,k\n\n\u03bbx,i\u03c8k(xi), max\nj\u2208Ix(cid:48) ,k\n\n\u03bbx,j\u03c8k(x(cid:48)j)(cid:29) ,\n\n(11)\n\nwhich empirically seems to perform well, but breaks the kernel interpretation, as in [32]. The\ncorresponding recursion amounts to replacing all the sum in (7) by a max.\nAn alternative way to aggregate local features is the generalized max pooling (GMP) introduced in\n[28], which can be adapted to the context of RKHSs. Assuming that before pooling x is embedded\nto a set of N local features (\u03d51, . . . , \u03d5N ) \u2208 HN , GMP builds a representation \u03d5gmp whose inner-\nproduct with all the local features \u03d5i is one: (cid:104)\u03d5i, \u03d5gmp(cid:105)H = 1, for i = 1, . . . , N. \u03d5gmp coincides\nwith the regular max when each \u03d5 is an element of the canonical basis of a \ufb01nite representation\u2014i.e.,\nassuming that at each position, a single feature has value 1 and all others are 0.\nSince GMP is de\ufb01ned by a set of inner-products constraints, it can be applied to our approximate\nkernel embeddings by solving a linear system. This is compatible with CKN but becomes intractable\nfor RKN which pools across |Ix,k| positions. Instead, we heuristically apply GMP over the set\n\u03c8k(x1:t) for all t with \u03bbx,i = \u03bb|x|\u2212i1\u2212k+1, which can be obtained from the RNN described in\nTheorem 1. This amounts to composing GMP with mean poolings obtained over each pre\ufb01x of x.\nWe observe that it performs well in our experiments. More details are provided in Appendix D.\n\n4 Experiments\n\nWe evaluate RKN and compare it to typical string kernels and RNN for protein fold recognition.\nPytorch code is provided with the submission and additional details given in Appendix E.\n\n4.1 Protein fold recognition on SCOP 1.67\n\nSequencing technologies provide access to gene and, indirectly, protein sequences for yet poorly\nstudied species. In order to predict the 3D structure and function from the linear sequence of these\n\n7\n\n\fproteins, it is common to search for evolutionary related ones, a problem known as homology\ndetection. When no evolutionary related protein with known structure is available, a\u2014more dif\ufb01cult\u2014\nalternative is to resort to protein fold recognition. We evaluate our RKN on such a task, where the\nobjective is to predict which proteins share a 3D structure with the query [31].\nHere we consider the Structural Classi\ufb01cation Of Proteins (SCOP) version 1.67 [29]. We follow\nthe preprocessing procedures of [10] and remove the sequences that are more than 95% similar,\nyielding 85 fold recognition tasks. Each positive training set is then extended with Uniref50 to\nmake the dataset more balanced, as proposed in [13]. The resulting dataset can be downloaded\nfrom http://www.bioinf.jku.at/software/LSTM_protein. The number of training samples\nfor each task is typically around 9,000 proteins, whose length varies from tens to thousands of\namino-acids. In all our experiments we use logistic loss. We measure classi\ufb01cation performances\nusing auROC and auROC50 scores (area under the ROC curve and up to 50% false positives).\nFor CKN and RKN, we evaluate both one-hot encoding of amino-acids by 20-dimensional binary\nvectors and an alternative representation relying on the BLOSUM62 substitution matrix [12]. Specif-\nically in the latter case, we represent each amino-acid by the centered and normalized vector of\nits corresponding substitution probabilities with other amino-acids. The local alignment kernel (4),\nwhich we include in our comparison, natively uses BLOSUM62.\n\nHyperparameters. We follow the training procedure of CKN presented in [4]. Speci\ufb01cally, for\neach of the 85 tasks, we hold out one quarter of the training samples as a validation set, use it to\ntune \u03b1, gap penalty \u03bb and the regularization parameter \u00b5 in the prediction layer. These parameters are\nthen \ufb01xed across datasets. RKN training also relies on the alternating strategy used for CKN: we use\nan Adam algorithm to update anchor points, and the L-BFGS algorithm to optimize the prediction\nlayer. We train 100 epochs for each dataset: the initial learning rate for Adam is \ufb01xed to 0.05 and is\nhalved as long as there is no decrease of the validation loss for 5 successive epochs. We \ufb01x k to 10,\nthe number of anchor points q to 128 and use single layer CKN and RKN throughout the experiments.\n\nImplementation details for unsupervised models. The anchor points for CKN and RKN are\nlearned by k-means on 30,000 extracted k-mers from each dataset. The resulting sequence represen-\ntations are standardized by removing mean and dividing by standard deviation and are used within a\nlogistic regression classi\ufb01er. \u03b1 in Gaussian kernel and the parameter \u03bb are chosen based on validation\nloss and are \ufb01xed across the datasets. \u00b5 for regularization is chosen by a 5-fold cross validation on\neach dataset. As before, we \ufb01x k to 10 and the number of anchor points q to 1024. Note that the\nperformance could be improved with larger q as observed in [4], at a higher computational cost.\n\nComparisons and results. The results are shown in Table 1. The blosum62 version of CKN and\nRKN outperform all other methods. Improvement against the mismatch and LA kernels is likely\ncaused by end-to-end trained kernel networks learning a task-speci\ufb01c representation in the form of a\nsparse set of motifs, whereas data-independent kernels lead to learning a dense function over the set\nof descriptors. This difference can have a regularizing effect akin to the (cid:96)1-norm in the parametric\nworld, by reducing the dimension of the learned linear function w while retaining relevant features\nfor the prediction task. GPkernel also learns motifs, but relies on the exact presence of discrete motifs.\nFinally, both LSTM and [20] are based on RNNs but are outperformed by kernel networks. The latter\nwas designed and optimized for NLP tasks and yields a 0.4 auROC50 on this task.\nRKNs outperform CKNs, albeit not by a large margin. Interestingly, as the two kernels only differ\nby their allowing gaps when comparing sequences, this results suggests that this aspect is not the\nmost important for identifying common foldings in a one versus all setting: as the learned function\ndiscriminates on fold from all others, it may rely on coarser features and not exploit more subtle ones\nsuch as gappy motifs. In particular, the advantage of the LA-kernel against its mismatch counterpart\nis more likely caused by other differences than gap modelling, namely using a max rather than a\nmean pooling of k-mer similarities across the sequence, and a general substitution matrix rather than\na Dirac function to quantify mismatches. Consistently, within kernel networks GMP systematically\noutperforms mean pooling, while being slightly behind max pooling.\nAdditional details and results, scatter plots, and pairwise tests between methods to assess the statistical\nsigni\ufb01cance of our conclusions are provided in Appendix E. Note that when k = 14, the auROC and\nauROC50 further increase to 0.877 and 0.636 respectively.\n\n8\n\n\fTable 1: Average auROC and auROC50 for SCOP fold recognition benchmark. LA-kernel uses\nBLOSUM62 to compare amino acids which is a little different from our encoding approach. Details\nabout pairwise statistical tests between methods can be found in Appendix E.\n\nMethod\n\npooling\n\none-hot\n\nGPkernel [10]\nSVM-pairwise [23]\nMismatch [22]\nLA-kernel [32]\nLSTM [13]\nCKN-seq [4]\nCKN-seq [4]\nCKN-seq\nCKN-seq (unsup)[4]\nRKN (\u03bb = 0)\nRKN\nRKN (\u03bb = 0)\nRKN\nRKN (\u03bb = 0)\nRKN\nRKN (unsup)\n\nauROC auROC50\n0.844\n0.724\n0.814\n\n0.514\n0.359\n0.467\n\nBLOSUM62\n\nauROC auROC50\n\n\u2013\n\n\u2013\n\n0.834\n\n0.504\n\n\u2013\n\n\u2013\n\n0.830\n0.827\n0.837\n0.838\n0.804\n0.829\n0.829\n0.840\n0.844\n0.840\n0.848\n0.805\n\nmean\nmax\nGMP\nmean\nmean\nmean\nmax\nmax\nGMP\nGMP\nmean\n\n0.566\n0.536\n0.572\n0.561\n0.493\n0.542\n0.541\n0.575\n0.587\n0.563\n0.570\n0.504\n\n\u2013\n\n0.843\n0.866\n0.856\n0.827\n0.838\n0.840\n0.862\n0.871\n0.855\n0.852\n0.833\n\n\u2013\n\n0.563\n0.621\n0.608\n0.548\n0.563\n0.571\n0.618\n0.629\n0.598\n0.609\n0.570\n\nTable 2: Classi\ufb01cation accuracy for SCOP 2.06. The complete table with error bars can be found in\nAppendix E.\n\nMethod\n\nPSI-BLAST\nDeepSF\nCKN (128 \ufb01lters)\nCKN (512 \ufb01lters)\nRKN (128 \ufb01lters)\nRKN (512 \ufb01lters)\n\n(cid:93)Params Accuracy on SCOP 2.06\ntop 10\n87.34\n94.51\n95.27\n96.36\n95.51\n96.54\n\ntop 5\n86.48\n90.25\n92.17\n94.29\n92.89\n94.95\n\ntop 1\n84.53\n73.00\n76.30\n84.11\n77.82\n85.29\n\n-\n\n920k\n211k\n843k\n211k\n843k\n\nLevel-strati\ufb01ed accuracy (top1/top5/top10)\n\nfamily\n\nsuperfamily\n\nfold\n\n82.20/84.50/85.30\n75.87/91.77/95.14\n83.30/94.22/96.00\n90.24/95.77/97.21\n76.91/93.13/95.70\n84.31/94.80/96.74\n\n86.90/88.40/89.30\n72.23/90.08/94.70\n74.03/91.83/95.34\n82.33/94.20/96.35\n78.56/92.98/95.53\n85.99/95.22/96.60\n\n18.90/35.10/35.10\n51.35/67.57/72.97\n43.78/67.03/77.57\n45.41/69.19/79.73\n60.54/83.78/90.54\n71.35/84.86/89.73\n\n4.2 Protein fold classi\ufb01cation on SCOP 2.06\n\nWe further benchmark RKN in a fold classi\ufb01cation task, following the protocols used in [15].\nSpeci\ufb01cally, the training and validation datasets are composed of 14699 and 2013 sequences from\nSCOP 1.75, belonging to 1195 different folds. The test set consists of 2533 sequences from SCOP\n2.06, after removing the sequences with similarity greater than 40% with SCOP 1.75. The input\nsequence feature is represented by a vector of 45 dimensions, consisting of a 20-dimensional one-hot\nencoding of the sequence, a 20-dimensional position-speci\ufb01c scoring matrix (PSSM) representing\nthe pro\ufb01le of amino acids, a 3-class secondary structure represented by a one-hot vector and a\n2-class solvent accessibility. We further normalize each type of the feature vectors to have unit\n(cid:96)2-norm, which is done for each sequence position. More dataset details can be found in [15]. We\nuse mean pooling for both CKN and RKN models, as it is more stable during training for multi-class\nclassi\ufb01cation. The other hyperparameters are chosen in the same way as previously. More details\nabout hyperparameter search grid can be found in Appendix E.\nThe accuracy results are obtained by averaging 10 different runs and are shown in Table 2, strati\ufb01ed\nby prediction dif\ufb01culty (family/superfamily/fold, more details can be found in [15]). By contrast\nto what we observed on SCOP 1.67, RKN sometimes yields a large improvement on CKN for fold\nclassi\ufb01cation, especially for detecting distant homologies. This suggests that accounting for gaps\ndoes help in some fold prediction tasks, at least in a multi-class context where a single function is\nlearned for each fold.\n\n9\n\n\fAcknowledgments\n\nWe thank the anonymous reviewers for their insightful comments and suggestions. This work has\nbeen supported by the grants from ANR (FAST-BIG project ANR-17-CE23-0011-01), by the ERC\ngrant number 714381 (SOLARIS), and ANR 3IA MIAI@Grenoble Alpes.\n\nReferences\n[1] B. Alipanahi, A. Delong, M. T. Weirauch, and B. J. Frey. Predicting the sequence speci\ufb01cities\nof DNA-and RNA-binding proteins by deep learning. Nature biotechnology, 33(8):831\u2013838,\n2015.\n\n[2] C. Angermueller, T. P\u00e4rnamaa, L. Parts, and O. Stegle. Deep learning for computational biology.\n\nMolecular Systems Biology, 12(7):878, 2016.\n\n[3] A. Auton, L. D. Brooks, R. M. Durbin, E. Garrison, H. M. Kang, J. O. Korbel, J. Marchini,\nS. McCarthy, G. McVean, and G. R. Abecasis. A global reference for human genetic variation.\nNature, 526:68\u201374, 2015.\n\n[4] D. Chen, L. Jacob, and J. Mairal. Biological sequence modeling with convolutional kernel\n\nnetworks. Bioinformatics, 35(18):3294\u20133302, 02 2019.\n\n[5] K. Cho, B. Van Merri\u00ebnboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and\nY. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine\ntranslation. In Conference on Empirical Methods in Natural Language Processing (EMNLP),\n2014.\n\n[6] Y. Cho and L. K. Saul. Kernel methods for deep learning. In Advances in Neural Information\n\nProcessing Systems (NIPS), 2009.\n\n[7] T. C. P.-G. Consortium. Computational pan-genomics: status, promises and challenges. Brie\ufb01ngs\n\nin Bioinformatics, 19(1):118\u2013135, 10 2016.\n\n[8] L. Flagel, Y. Brandvain, and D. R. Schrider. The Unreasonable Effectiveness of Convolutional\nNeural Networks in Population Genetic Inference. Molecular Biology and Evolution, 36(2):220\u2013\n238, 12 2018.\n\n[9] M. B. Giles. Collected matrix derivative results for forward and reverse mode algorithmic\n\ndifferentiation. In Advances in Automatic Differentiation, pages 35\u201344. Springer, 2008.\n\n[10] T. H\u00e5ndstad, A. J. Hestnes, and P. S\u00e6trom. Motif kernel generated by genetic programming\n\nimproves remote homology and fold detection. BMC bioinformatics, 8(1):23, 2007.\n\n[11] D. Haussler. Convolution kernels on discrete structures. Technical report, Department of\n\nComputer Science, University of California, 1999.\n\n[12] S. Henikoff and J. G. Henikoff. Amino acid substitution matrices from protein blocks. Proceed-\n\nings of the National Academy of Sciences, 89(22):10915\u201310919, 1992.\n\n[13] S. Hochreiter, M. Heusel, and K. Obermayer. Fast model-based protein homology detection\n\nwithout alignment. Bioinformatics, 23(14):1728\u20131736, 2007.\n\n[14] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u2013\n\n1780, 1997.\n\n[15] J. Hou, B. Adhikari, and J. Cheng. DeepSF: deep convolutional neural network for mapping\n\nprotein sequences to folds. Bioinformatics, 34(8):1295\u20131303, 12 2017.\n\n[16] E. J. Topol. High-performance medicine: the convergence of human and arti\ufb01cial intelligence.\n\nNature Medicine, 25, 01 2019.\n\n[17] T. S. Jaakkola, M. Diekhans, and D. Haussler. Using the \ufb01sher kernel method to detect remote\nprotein homologies. In Conference on Intelligent Systems for Molecular Biology (ISMB), 1999.\n\n10\n\n\f[18] N. Kalchbrenner, E. Grefenstette, and P. Blunsom. A convolutional neural network for modelling\n\nsentences. In Association for Computational Linguistics (ACL), 2014.\n\n[19] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D.\nJackel. Backpropagation applied to handwritten zip code recognition. Neural computation,\n1(4):541\u2013551, 1989.\n\n[20] T. Lei, W. Jin, R. Barzilay, and T. Jaakkola. Deriving neural architectures from sequence and\n\ngraph kernels. In International Conference on Machine Learning (ICML), 2017.\n\n[21] C. Leslie, E. Eskin, and W. S. Noble. The spectrum kernel: A string kernel for svm protein\n\nclassi\ufb01cation. In Biocomputing, pages 564\u2013575. World Scienti\ufb01c, 2001.\n\n[22] C. S. Leslie, E. Eskin, A. Cohen, J. Weston, and W. S. Noble. Mismatch string kernels for\n\ndiscriminative protein classi\ufb01cation. Bioinformatics, 20(4):467\u2013476, 2004.\n\n[23] L. Liao and W. S. Noble. Combining pairwise sequence similarity and support vector machines\nfor detecting remote protein evolutionary and structural relationships. Journal of computational\nbiology, 10(6):857\u2013868, 2003.\n\n[24] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins. Text classi\ufb01cation\n\nusing string kernels. Journal of Machine Learning Research (JMLR), 2:419\u2013444, 2002.\n\n[25] J. Mairal. End-to-End Kernel Learning with Supervised Convolutional Kernel Networks. In\n\nAdvances in Neural Information Processing Systems (NIPS), 2016.\n\n[26] S. Merity, N. S. Keskar, and R. Socher. Regularizing and optimizing lstm language models. In\n\nInternational Conference on Learning Representations (ICLR), 2018.\n\n[27] A. Morrow, V. Shankar, D. Petersohn, A. Joseph, B. Recht, and N. Yosef. Convolutional kitchen\nsinks for transcription factor binding site prediction. arXiv preprint arXiv:1706.00125, 2017.\n\n[28] N. Murray and F. Perronnin. Generalized max pooling. In Proceedings of the IEEE Conference\n\non Computer Vision and Pattern Recognition (CVPR), 2014.\n\n[29] A. G. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia. Scop: a structural classi\ufb01cation\nof proteins database for the investigation of sequences and structures. Journal of molecular\nbiology, 247(4):536\u2013540, 1995.\n\n[30] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Adv. in Neural\n\nInformation Processing Systems (NIPS), 2008.\n\n[31] H. Rangwala and G. Karypis. Pro\ufb01le-based direct kernels for remote homology detection and\n\nfold recognition. Bioinformatics, 21(23):4239\u20134247, 2005.\n\n[32] H. Saigo, J.-P. Vert, N. Ueda, and T. Akutsu. Protein homology detection using string alignment\n\nkernels. Bioinformatics, 20(11):1682\u20131689, 2004.\n\n[33] B. Sch\u00f6lkopf and A. J. Smola. Learning with kernels: support vector machines, regularization,\n\noptimization, and beyond. MIT press, 2002.\n\n[34] B. Sch\u00f6lkopf, K. Tsuda, and J.-P. Vert. Kernel methods in computational biology. MIT Press,\n\nCambridge, Mass., 2004.\n\n[35] K. Tsuda, T. Kin, and K. Asai. Marginalized kernels for biological sequences. Bioinformatics,\n\n18(suppl_1):S268\u2013S275, 07 2002.\n\n[36] J.-P. Vert, H. Saigo, and T. Akutsu. Convolution and local alignment kernels. Kernel methods in\n\ncomputational biology, pages 131\u2013154, 2004.\n\n[37] C. Watkins. Dynamic alignment kernels. In Advances in Neural Information Processing Systems\n\n(NIPS), 1999.\n\n[38] C. K. Williams and M. Seeger. Using the Nystr\u00f6m method to speed up kernel machines. In\n\nAdvances in Neural Information Processing Systems (NIPS), 2001.\n\n11\n\n\f[39] K. Zhang, I. W. Tsang, and J. T. Kwok. Improved nystr\u00f6m low-rank approximation and error\n\nanalysis. In International Conference on Machine Learning (ICML), 2008.\n\n[40] Y. Zhang, P. Liang, and M. J. Wainwright. Convexi\ufb01ed convolutional neural networks. In\n\nInternational Conference on Machine Learning (ICML), 2017.\n\n[41] J. Zhou and O. Troyanskaya. Predicting effects of noncoding variants with deep learning-based\n\nsequence model. Nature Methods, 12(10):931\u2013934, 2015.\n\n12\n\n\f", "award": [], "sourceid": 7426, "authors": [{"given_name": "Dexiong", "family_name": "Chen", "institution": "Inria"}, {"given_name": "Laurent", "family_name": "Jacob", "institution": "CNRS"}, {"given_name": "Julien", "family_name": "Mairal", "institution": "Inria"}]}