Volker Roth, Julian Laub, Klaus-Robert Müller, Joachim Buhmann
Pairwise data in empirical sciences typically violate metricity, ei(cid:173) ther due to noise or due to fallible estimates, and therefore are hard to analyze by conventional machine learning technology. In this paper we therefore study ways to work around this problem. First, we present an alternative embedding to multi-dimensional scaling (MDS) that allows us to apply a variety of classical ma(cid:173) chine learning and signal processing algorithms. The class of pair(cid:173) wise grouping algorithms which share the shift-invariance property is statistically invariant under this embedding procedure, leading to identical assignments of objects to clusters. Based on this new vectorial representation, denoising methods are applied in a sec(cid:173) ond step. Both steps provide a theoretically well controlled setup to translate from pairwise data to the respective denoised met(cid:173) ric representation. We demonstrate the practical usefulness of our theoretical reasoning by discovering structure in protein sequence data bases, visibly improving performance upon existing automatic methods.