Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The paper is well-written and clear, the different components of the RKN are introduced in an incremental fashion providing motivations for their introduction. The main concern with the paper is a lack of originality and novelty. The RKN are based on the work in , which presents a much more general framework for deriving neural architectures for various types of sequences and graph kernels. This work adapts the neural encoding for sequence kernels by replacing the exact match between subsequences with a gaussian kernel accounting for vectorial representations of symbols (substitution matrices for aminoacids). Note that the original work itself already mentions the possibility to replace exact matches with dot products. The multilayer construction is also taken from . The authors introduce a number of efficiency refinements taken from the kernel literature, such as Nystrom approximation, mostly following another publication (ref ) which addresses the very problem presenting in this work. Overall, I believe the differences with respect to  and  are limited and too application specific to justify a publication in a top machine learning venue. The most recent competitor presented in the experimental evaluation, apart from the work in , is from 2007. Is it possible that  is the only recent approach dealing with fold recognition? AFTER REBUTTAL: The authors did better clarify the novelty of their contribution with respect to the papers they build upon. I still believe the scope is a bit limited, but I am not against the paper being accepted provided the authors clarify these aspects in the revised version.
In the paper titled "Recurrent Kernel Networks", the authors proposed a new method to model the gaps in comparative sequence analysis. The new method is termed Recurrent Kernel Networks, which essentially can embed sequences as points in RKHS, and leverage the training strategy of RNN. Overall, this is an innovative paper. The paper is also well organized. My only question is that in the original CKN paper, CKN was used to train DNA sequences (similar to DeepSEA), why the authors in this paper did not apply RKN to the same task?
The manuscript “Recurrent Kernel Networks” generalizes convolutional kernel networks to model gaps in sequences. It further introduces a new point of view on RNNs and proposes a new way to simulate max pooling in RKHS. In general, the manuscript is well written and introduces a nice addition to the existing literature. It shows an interesting connection between kernel methods and RNNs, which could become quite significant. At some passages it can be seen that the page limit is reducing readability, since the authors often refer to the literature instead of giving the main ideas of those papers (, , ). Including this information would help the readers to get a better picture of what was known before and how novel the contributions are. It is a little bit difficult to evaluate the performance improvement of the new method, since e.g., for one-hot encoding the new methods do not produce a better auROC than a standard approach (GPkernel). Furthermore, in all experiments, only one layer was used for CKNs and RKNs (line 296, page 8), which implies that the data set choice might not have been ideal for showing the performance improvement opportunities on large data sets. Furthermore, the data set is quite old (SCOP 1.75 came out in 2009 and the data is based on SCOP 1.67). Update: The author response showed additional evaluations, on newer/larger data, which substantiate the claim that the newly proposed method can also lead to performance improvements on relevant prediction problems (note: the GPkernel performances are missing in those evaluations and should be included in a final version if possible). For the RKN, the generalized max pooling is implemented by a heuristic, since it would otherwise become intractable (line 261, page 7). It performed worse than the max pooling and the performance was just shown for \lambda = 0. It would be interesting to see the performance also for other \lambdas. Update: Some new evaluations were added in the authors' response and promised for the updated appendix, solidifying the evaluations. Minor points Page 2: -line 84: the standard spectrum kernel measures the counts of the occurrences of the k-mers not just whether a k-mer is present as stated in the manuscript. -line 85 the standard mismatch kernel measures the counts of the occurrences of u up to a few letters and not just presence/absence as stated in the manuscript. Page 3: -lines 96 and 132: the gaps function might be confusing to some readers. It does not show k explicitly, but then uses it to calculate gap costs. One should make the k explicit in the call. It would also help to not introduce two ways to write the gap penalty function (lines 96, 132). -line 101: The authors spend time to introduce string kernels, but they do not put emphasis on introducing the more recent work. It would for example help the reader to see the definition of a CKN. Page 4: -line 133: It is unclear where the 0^0 comes into play and also it would help to elaborate a little bit more on the statement of line 134. Page 6: -line 223: Again, if you remove some parts at the beginning, you could mention some more details here to show how similar the construction is. Page 8: -line 295: It would be good to see stability of the performance for different choices of the parameters that were set (k, q, number of layers).