Computation of Similarity Measures for Sequential Data using Generalized Suffix Trees

Part of Advances in Neural Information Processing Systems 19 (NIPS 2006)

Bibtex Metadata Paper

Authors

Konrad Rieck, Pavel Laskov, Sören Sonnenburg

Abstract

We propose a generic algorithm for computation of similarity measures for se- quential data. The algorithm uses generalized suffix trees for efficient calculation of various kernel, distance and non-metric similarity functions. Its worst-case run-time is linear in the length of sequences and independent of the underlying embedding language, which can cover words, k-grams or all contained subse- quences. Experiments with network intrusion detection, DNA analysis and text processing applications demonstrate the utility of distances and similarity coeffi- cients for sequences as alternatives to classical kernel functions.