{"title": "Measures of distortion for machine learning", "book": "Advances in Neural Information Processing Systems", "page_first": 4886, "page_last": 4895, "abstract": "Given data from a general metric space, one of the standard machine learning pipelines is to first embed the data into a Euclidean space and subsequently apply out of the box machine learning algorithms to analyze the data. The quality of such an embedding is typically described in terms of a distortion measure. In this paper, we show that many of the existing distortion measures behave in an undesired way, when considered from a machine learning point of view. We investigate desirable properties of distortion measures and formally prove that most of the existing measures fail to satisfy these properties. These theoretical findings are supported by simulations, which for example demonstrate that existing distortion measures are not robust to noise or outliers and cannot serve as good indicators for classification accuracy. As an alternative, we suggest a new measure of distortion, called $\\sigma$-distortion. We can show both in theory and in experiments that it satisfies all desirable properties and is a better candidate to evaluate distortion in the context of machine learning.", "full_text": "Measures of distortion for machine learning\n\nLeena Chennuru Vankadara\n\nUniversity of T\u00fcbingen\n\nMax Planck Institute for Intelligent Systems, T\u00fcbingen\n\nleena.chennuru@tuebingen.mpg.de\n\nUlrike von Luxburg\nUniversity of T\u00fcbingen\n\nMax Planck Institute for Intelligent Systems, T\u00fcbingen\n\nluxburg@informatik.uni-tuebingen.de\n\nAbstract\n\nGiven data from a general metric space, one of the standard machine learning\npipelines is to \ufb01rst embed the data into a Euclidean space and subsequently apply\nmachine learning algorithms to analyze the data. The quality of such an embedding\nis typically described in terms of a distortion measure. In this paper, we show that\nmany of the existing distortion measures behave in an undesired way, when consid-\nered from a machine learning point of view. We investigate desirable properties of\ndistortion measures and formally prove that most of the existing measures fail to\nsatisfy these properties. These theoretical \ufb01ndings are supported by simulations,\nwhich for example demonstrate that existing distortion measures are not robust to\nnoise or outliers and cannot serve as good indicators for classi\ufb01cation accuracy. As\nan alternative, we suggest a new measure of distortion, called \u03c3-distortion. We can\nshow both in theory and in experiments that it satis\ufb01es all desirable properties and\nis a better candidate to evaluate distortion in the context of machine learning.\n\n1\n\nIntroduction\n\nGiven data from a general metric space, one of the standard machine learning pipelines is to \ufb01rst\nembed the data into a Euclidean space (for example using an unsupervised algorithm such as Isomap,\nlocally linear embedding, maximum variance unfolding, etc) and subsequently apply out of the box\nmachine learning algorithms to analyze the data. Typically, the quality of such an embedding is\ndescribed in terms of a distortion measure that summarizes how the distances between the embedded\npoints deviate from the original distances. Many distortion measures have been used in the past, the\nmost prominent ones being worst case distortion, lq-distortion (Abraham, Bartal, and Neiman, 2011),\naverage distortion(Abraham, Bartal, and Neiman, 2011), \u0001-distortion (Abraham, Bartal, Kleinberg, et\nal., 2005), k-local distortion (Abraham, Bartal, and Neiman, 2007) and scaling distortion (Abraham,\nBartal, Kleinberg, et al., 2005). Such distortion measures are sometimes evaluated in hindsight\nto evaluate the quality of an embedding, and sometimes used directly as objective functions in\nembedding algorithms, for example the stress functions that are commonly used in different variants\nof Multidimensional scaling (Cox and Cox, 2000). There also exist embedding algorithms with\ncompletely different objectives. For instance, t-SNE (Maaten and Hinton, 2008) employs an objective\nfunction that aims to enhance the cluster structure present in the data. In this paper, however, we\nrestrict our analysis to distortion measures that evaluate the quality of distance preserving embeddings.\nFrom a theoretical computer science point of view, many aspects of distance preservation of emed-\ndings are well understood. For example, Bourgain\u2019s theorem (Bourgain, 1985) and the John-\nson\u2013Lindenstrauss Lemma (Johnson and Lindenstrauss, 1984) state that any \ufb01nite metric space\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fof n points can be embedded into a Euclidean space of dimension O(log n) with worst case distortion\nO(log n). Many related results exist (Gupta, Krauthgamer, and Lee, 2003; Abraham, Bartal, and\nNeiman, 2008; Abraham, Bartal, and Neiman, 2011; Abraham, Bartal, Kleinberg, et al., 2005;\nAbraham, Bartal, and Neiman, 2007; Abraham, Bartal, and Neiman, 2011; Semmes, 1996).\nHowever, from a machine learning point of view, these results are not entirely satisfactory. The typical\ndistortion guarantees from theoretical computer science focus on a \ufb01nite metric space. However, in\nmachine learning, we are ultimately interested in consistency statements: given a sample of n points\nfrom some underlying space, we would like to measure the distortion of an embedding algorithm as\nn \u2192 \u221e. In particular, the dimension of the embedding space should be constant and not grow with\nn, because we want to relate the geometry of the original underlying space to the geometry of the\nembedding space. Hence, many of the guarantees that are nice from a theoretical computer science\npoint of view (for example, because they provide approximation guarantees for NP hard problems)\nmiss the point when applied to machine learning (either in theory or in practice, see below).\nIdeally, in machine learning we would like to use the distortion measure as an indication of the quality\nof an embedding. We would hope that when we compare several embeddings, choosing the one with\nsmaller distortion would lead to better machine learning results (at least in tendency). However, when\nwe empirically investigated the behavior of existing distortion measures, we were surprised to see\nthat they behave quite erratically and often do not serve this purpose at all (see Section 4).\nIn pursuit of a more meaningful measure of distortion in the context of machine learning, we take a\nsystematic approach in this paper. We identify a set of properties that are essential for any distortion\nmeasure. In light of these properties, we propose a new measure of distortion that is designed\ntowards machine learning applications: the \u03c3-distortion. We prove in theory and through simulations\nthat our new measure of distortion satis\ufb01es many of the properties that are important for machine\nlearning, while all the other measures of distortion have serious drawbacks and fail to satisfy all of the\nproperties. These results can be summarized in the following table (where each column corresponds\nto one measure of distortion and each row to one desirable property, see Section 2 for notation and\nde\ufb01nitions):\n\nProperty/Distortion measure\n\nTranslation invariance\n\nMonotonicity\nScale invariance\n\nRobustness to outliers\nRobustness to noise\n\nIncorporation of probability\n\nConstant distortion embeddings\n\n\u03c3(sigma) wc\n\u0013\n\u0013\n\u0013\n\u0017\n\u0017\n\u0017\n\u0017\n\n\u0013\n\u0013\n\u0013\n\u0013\n\u0013\n\u0013\n\u0013\n\navg(lq)\n\n\u0013\n\u0017\n\u0017\n\u0013\n\u0017\n\u0017\n\u0013\n\nnavg\n\u0013\n\u0013\n\u0013\n\u0017\n\u0017\n\u0017\n?\n\nk-local\n\n\u0001(epsilon)\n\n\u0013\n\u0013\n\u0013\n\u0017\n\u0017\n\u0017\n\u0013\n\n\u0013\n\u0013\n\u0013\n\u0013\n\u0013\n\u0017\n\u0013\n\n2 Existing measures of distortion\n\nLet (X, dX ) and (Y, dY ) be arbitrary \ufb01nite metric spaces. Let(cid:0)X\n\n(cid:1) := {{u, v}| u, v \u2208 X, u (cid:54)= v}\n\nand for any n \u2208 IN, let [n] denote the set {1, 2, ..., n}. An embedding of X into Y is an injective\nmapping f : (X, dX ) \u2192 (Y, dY ). Let P be a probability distribution on X, and \u03a0 := P \u00d7 P the\nproduct distribution on X \u00d7 X. Distortion measures aim to quantify the deviation of an embedding\nfrom isometry. Intuitively, the distortion of such an embedding is supposed to measure how far\nthe new distances dY (f (u), f (v)) between the embedded points deviate from the original distances\ndX (u, v). Virtually all the existing distortion measures are summary statistics of the pairwise ratios\n\n2\n\n\u03c1f (u, v) = dY (f (u), f (v))/dX (u, v)\n\nwith u, v \u2208 X. The intention is to capture the property that if the ratios dY (f (u), f (v))/dX (u, v)\nare close to 1 for many pairs of points u, v, then the distortion is small. The most popular measures\n\n2\n\n\fof distortion are the following ones:\n\nWorst case distortion: \u03a6wc(f ) :=\n\n(cid:32)\n\n(cid:33)\n\nmax\n(u,v)\u2208(X\n2 )\n\n\u03c1f (u, v)\n\n(cid:32)\n\nmax\n(u,v)\u2208(X\n2 )\n\n1\n\n\u03c1f (u, v)\n\n(cid:33)\n\n.\n\n\u00b7\n\nAverage case distortion: \u03a6avg(f ) :=\n\n2\n\nn(n \u2212 1)\n\n\u03c1f (u, v).\n\n(cid:88)\n(cid:88)\n\n(u,v)\u2208(X\n2 )\n\nNormalized avg distortion: \u03a6navg(f ) :=\n\nu(cid:54)=v\u2208X\nlq-distortion (with 1 \u2264 q < \u221e): \u03a6lq (f ) := E\u03a0(\u03c1f (u, v)q)\n\u0001-distortion ( \u2200 0 < \u0001 < 1): \u03a6\u0001(f ) :=\n\nmin\n\n1\n\nq .\n\n2\n\nn(n \u2212 1)\n\nS\u2282(X\n\n2 ),|S|\u2265(1\u2212\u0001)(n\n2)\n\nrestriction of f to S.\n\nk-local distortion: \u03a6klocal(f ) :=\n\n(cid:18)\n\nmax\n\nu\u2208X,v\u2208kNN(u),u(cid:54)=v\n\n\u03c1f (u, v)\n\n\u03c1f (u, v)\n\n\u03b1\n\nwith \u03b1 = min\nu(cid:54)=v\u2208X\n\n\u03c1f (u, v).\n\n\u03a6wc(fS), where fS denotes the\n\n(cid:19)\n\n\u00b7\n\n(cid:18)\n\nmax\n\nu\u2208X,v\u2208kNN(u)u(cid:54)=v\n\n(cid:19)\n\n,\n\n1\n\n\u03c1f (u, v)\n\nwhere kNN(u) denotes the set of k nearest neighbours of u.\n\nThe different measures of distortion put their focus on different aspects: the worst case among\nall pairs of points (\u03a6wc), the worst case excluding pathological outliers (\u03a6\u0001), the average case\n(\u03a6avg, \u03a6navg, \u03a6lq) or distortions that are just evaluated between neighboring points (\u03a6klocal).\nFrom a conceptual level, all these measures of distortion make sense, and it is not obvious why one\nshould prefer one over the other. However, when we studied them in practice, we found different\nsources of undesired behavior for many of them. For example, many of them behave in a quite\nunstable or even erratic manner, due to high sensitivity to outliers or because they are not invariant\nwith respect to rescaling. To study these issues more systematically, we will now identify a set of\nproperties that any measure of distortion should satisfy in the context of machine learning applications.\nIn Section 3.2 we then prove which of the existing measures satis\ufb01es which properties and \ufb01nd that\neach of them has particular de\ufb01ciencies. In Section 3.3 we then introduce a new measure of distortion\nthat does not suffer from these issues, and demonstrate its practical behavior in Section 4.\n\n3 Properties of distortion measures\n\nIn this section we identify properties that a distortion measure is expected to satisfy in the context of\nmachine learning. In addition to basic properties such as invariance to rescaling and translation, the\nmost important properties should resonate with an appropriate characterization of the quality of an\nembedding. In the following, let (X, dX ) be an arbitrary metric space, let Y be an arbitrary vector\nspace and let dY be a homogeneous and translation invariant metric on Y (See the supplement for the\nformal de\ufb01nitions). Let f, g : (X, dX ) \u2192 (Y, dY ) be two embeddings and let \u03a6 be any function that\nis supposed to measure the distortion of any injective mapping from X to Y .\n\n3.1 De\ufb01nitions\n\nWe start with a set of basic properties that should be satis\ufb01ed by any function that is supposed to\nprovide a measure of distortion, irrespective of the context in which it is applied.\nScale Invariance is an essential property for a measure of distortion since embeddings that are\nmerely different in units of measurement (say, kilometers vs centimeters) should not be assigned\ndifferent values of distortion. Formally, let f : (X, dX ) \u2192 (Y, dY ) and g : (X, dX ) \u2192 (Y, dY ) be\ntwo injective mappings. A distortion measure \u03a6 is said to be scale invariant if for any \u03b1 \u2208 R,\n\n\u2200u \u2208 X, f (u) = \u03b1g(u) =\u21d2 \u03a6(f ) = \u03a6(g).\n\n(1)\nTranslation Invariance: A measure of distortion should clearly be invariant to translations: Let\nf : (X, dX ) \u2192 (Y, dY ) and g : (X, dX ) \u2192 (Y, dY ) be two injective mappings. A measure of\n\n3\n\n\fdistortion \u03a6 is said to be translation invariant if for any y \u2208 Y ,\n\n\u2200u \u2208 X, f (u) = g(u) + y; =\u21d2 \u03a6(f ) = \u03a6(g).\n\n(2)\nMonotonicity captures the property that if distances are preserved more strictly, then the distortion\nof the corresponding embedding should be smaller. The formal de\ufb01nition is a bit tricky, because\none has to be careful about scaling issues. We take care of it by standardizing the embeddings\nsuch that the average of the \u03c1(u, v) is 1. Let f : (X, dX ) \u2192 (Y, dY ) and g : (X, dX ) \u2192 (Y, dY )\nbe embeddings. De\ufb01ne the scaling constants \u03b1(f ) = (\nu(cid:54)=v\u2208X \u03c1f (u, v) and \u03b1(g) =\n(\n\nu(cid:54)=v\u2208X \u03c1g(u, v). Then a measure of distortion \u03a6 is said to be monotonic if\n\nn(n\u22121) )(cid:80)\n\nn(n\u22121) )(cid:80)\n\n2\n\n2\n\n(cid:18)(cid:16)\u03c1f (u, v)\n\n\u03b1(f )\n\n(cid:17)\n\n(cid:16) \u03c1f (u, v)\n\n\u03b1(f )\n\n\u2264 \u03c1g(u, v)\n\u03b1(g)\n\n\u2264 1\n\nor\n\n\u2200u, v\u2208 X,\n\n(cid:17)(cid:19)\n\n\u2265 \u03c1g(u, v)\n\u03b1(g)\n\n\u2265 1\n\n=\u21d2 \u03a6(f ) \u2265 \u03a6(g).\n\n(cid:26)x\u2217,\n\nx,\n\n(3)\nAfter having introduced the basic properties that need to be satis\ufb01ed such that a function \u03a6 deserves\nthe term \u201cdistortion\u201d, we now turn to some advanced properties that speci\ufb01cally identify the\nnecessary characteristics of distortion measures in the context of machine learning applications.\nRobustness to outliers: Outliers are inherent to data processed by machine learning algorithms, and\nhence a measure of distortion that is too volatile against outliers is not desirable. What we would like\nto achieve is rather that the in\ufb02uence of a single data point or a single distance value on the measure\nof distortion is very small. In the spirit of this interpretation, we create two test cases as necessary\nconditions to deem a measure of distortion robust to outliers.\nOutliers in data: To verify that the effect of a single data point on the measure of distortion is\nsmall, we stipulate that the in\ufb02uence of this point should converge to 0 as the number n of points\ngoes to in\ufb01nity. To formulate this property, we compare an isometric embedding to an embedding\nthat is \"isometric except for one point\". Formally, let I : (X, dX ) \u2192 (X, dX ) be an isometry. Fix\narbitrary x0, x\u2217 \u2208 X and \u03b2 > 0. For any n \u2208 IN, let Xn = {x1, x2, ..., xn} \u2282 X \\ B(x0, \u03b2). Let\nfn : Xn \u222a {x0} \u2192 X such that\n\nfn(x) =\n\nif x = x0.\notherwise.\n\n(4)\n\nn\u2192\u221e \u03a6(fn) (cid:54)= lim\n\nWe say that a measure of distortion \u03a6 is not robust to outliers if lim\nn\u2192\u221e \u03a6(In), where\nIn denotes the restriction of the mapping I to Xn \u222a{x0}. In the formal de\ufb01nition, one needs to make\nsure that the distortions do not grow arbitrarily fast, which can happen either if points in the original\nspace are too close or if points in the image space are too far from each other. The ball of positive\nradius \u03b2 prevents the \ufb01rst case, and the fact that we choose x\u2217 as a \ufb01xed point prevents the second\ncase.\nOutliers in distances: To evaluate whether a measure of distortion is robust to outliers in distances,\nwe consider mappings for which at most a constant (K) number of distances are distorted and compare\nthe resulting distortion measure to the one of an isometry. Formally, let I : (X, dX ) \u2192 (X, dX ) be an\nisometry. Let XD = {x1, x2, ....,} \u2282 X. Let f : XD \u2192 X be an injective mapping such that there\nexists a constant K \u2208 IN for which the set G = {(u, v) \u2208 XD \u00d7 XD : dX (f (u), f (v)) (cid:54)= dX (u, v)}\nsatis\ufb01es |G| \u2264 K. For any n \u2208 IN, let fn and In denote the restriction of the mappings f and I,\nrespectively, to Xn = {x1, x2, ..., xn} \u2282 XD. We say that a measure of distortion \u03a6 is not robust to\noutliers if lim\n\nn\u2192\u221e \u03a6(fn) (cid:54)= lim\n\nn\u2192\u221e \u03a6(In).\n\nIncorporation of the probability distribution: In machine learning, a standard assumption is that\nthe data has been sampled according to some probability distribution from an underlying space. A\nmeasure of distortion should be able to take this probability distribution into account, in the sense that\ndistortions of points in high density regions should be more costly than distortions of points in low\ndensity regions. We formalize this idea by stipulating that given two different embeddings which are\n\"isometric except for one point\", where the two embeddings distort two different points such that the\nratios of distorted distances are the same for both the embeddings, then the embedding that distorts\nthe point that occurs with higher probability needs to have a higher value of distortion.\nLet (X, dX ) be an arbitrary metric space. Let Xn = {x1, x2, ..., xn} be a \ufb01nite subset of X. Let\nP denote a probability distribution on Xn. Fix arbitrary x\u2217, y\u2217 \u2208 Xn such that P (x\u2217) > P (y\u2217).\n\n4\n\n\f(cid:48)\n\n, y\n\n(cid:48) \u2208 X such that \u2200i \u2208 [n], dX (xi, x\nLet x\nf, g : Xn \u2192 X be two embeddings such that:\nif x = x\u2217.\notherwise.\n\n(cid:26)x\n\nf (x) =\n\nx,\n\n(cid:48)\n\n,\n\n,\n\ng(x) =\n\n(cid:48)\n\n) = \u03b1dX (xi, x\u2217) and dX (xi, y\n\n(cid:48)\n\n) = \u03b1dX (xi, y\u2217). Let\n\n(cid:26)y\n\n(cid:48)\n\n,\nx,\n\nif x = y\u2217.\notherwise.\n\nThen a measure of distortion \u03a6 is said to incorporate the probability distribution P if \u03a6(f ) > \u03a6(g).\nRobustness to noise: Noisy observations, just as outliers, are common in machine learning. In\nmachine learning applications we would expect that the measure of distortion is smaller if there is\nless noise on the data. For this property, we do not provide a formal de\ufb01nition. Rather, we conduct\nexperiments to empirically verify whether this is the case in simple settings.\nWe believe that in order to be useful for machine learning, a measure of distortion should satisfy all\nthe basic as well as the advanced properties. We would like to conclude this list with a last property\nthat is perhaps not absolutely crucial, but nice to have: the ability to provide constant-dimensional\nembeddings.\nIn learning theory we often assume that we are given a set of data points that has been sampled\naccording to some underlying probability distribution, and then we are interested in consistency\nstatements: given a sample of n points, we study the behavior of algorithms as the sample size n goes\nto in\ufb01nity. In particular, when we consider embeddings we would hope that as the sample size grows,\nthe geometry of the embedded points \u201cconverges\u201d to something that is close to the \u201ctrue geometry\u201d of\nthe underlying space. In particular, if the underlying space is \u201csimple\u201d, we would like to embed into a\nEuclidean space of constant dimension \u2014 the dimension is not supposed to grow with the sample size\nbecause we would then have to deal with an in\ufb01nite-dimensional space in the limit case, which would\nallow for too complex geometries. For embeddings in a constant-dimensional space, we would then\nlike to bound the distortion, ideally by a quantity that is bounded by a constant that is independent\nof n and just depends on the geometry of the true underlying space. In general, it is impossible to\nguarantee the existence of an embedding into Euclidean space with constant dimension and constant\ndistortion (for all the standard measures of distortion, cf. (Semmes, 1999; Semmes, 1996; Abraham,\nBartal, and Neiman, 2011)). However, guarantees can be given if we make assumptions on the\nunderlying metric space. For example if (X, dX ) is doubling, it is possible to achieve an embedding\ninto constant dimensional Euclidean space with O(1) average distortion (but \u2126(log n) worst case\ndistortion) (Abraham, Bartal, and Neiman, 2011). Hence we stipulate that a measure of distortion\nthat is nice for machine learning should satisfy that if the underlying geometry of the metric space is\n\u201cnice\u201d (according to an appropriate de\ufb01nition), then we can guarantee that there exists an embedding\nin constant dimension with constant distortion.\n\n3.2 Theoretical results: existing distortion measures fail to satisfy all properties\n\nIn the following theorem we investigate which measure of distortion satis\ufb01es which of the properties\nmentioned above. All the proofs can be found in the appendix.\nTheorem 1 (Properties of existing distortion measures). For all choices of the parameters 1 \u2264\nq < \u221e, 0 < \u0001 < 1, 1 \u2264 k \u2264 n, the following statements are true:\n(a) \u03a6wc, \u03a6avg, \u03a6navg, \u03a6lq, \u03a6\u0001 and \u03a6klocal satisfy the property of translation invariance.\n(b) \u03a6wc, \u03a6navg, \u03a6\u0001, \u03a6klocal satisfy the properties of scale invariance and monotonicity. \u03a6avg and\n\n\u03a6lq fail to satisfy these properties.\n\n(c) \u03a6\u0001, \u03a6avg, \u03a6lq satisfy the property of robustness to outliers. The measures \u03a6wc, \u03a6navg, \u03a6klocal\n\nfail to do so.\n\n(d) The distortion measures \u03a6wc, \u03a6avg, \u03a6navg, \u03a6lq, \u03a6\u0001, \u03a6klocal fail to incorporate a probability\n\ndistribution de\ufb01ned on the data space.\n\nAt the current point in time, we do not yet have a formal guarantee towards robustness to noise.\nHowever, in our experiments we show that \u0001\u2212distortion is considerably robust to noise for certain\nvalues of \u0001 and the rest of the distortion measures do not demonstrate robustness to noise.\nRegarding the constant-dimensional embeddings, there exists a large literature. In the case of\naverage distortion, \u0001-distortion, and k-local distortion for \ufb01xed values of k, \u0001, it has been shown\n\n5\n\n\f(cid:102)\u03c1f (u, v) :=\n\n2(cid:80)\nE\u03a0((cid:101)\u03c1f (u, v) \u2212 1)2.\ndistribution of the normalized ratio of distances,(cid:101)\u03c1f (u, v).\n\nThe \u03c3-distortion is then de\ufb01ned as\n\n(n)(n \u2212 1)\u03c1f (u, v)\n2 ) \u03c1f (u, v)\n\n(u,v)\u2208(X\n\n.\n\n(5)\nIf P is a uniform probability distribution over Xn, then \u03c3-distortion measures the variance of the\n\nthat any \ufb01nite subset of a doubling metric space (see the supplement for a formal de\ufb01nition) can\nbe embedded into a constant dimensional Euclidean space with O(1) distortion (Abraham, Bartal,\nand Neiman, 2011; Abraham, Bartal, and Neiman, 2009). Hence these measures of distortion also\nsatisfy the \"nice to have\" property. Such a result for Normalized average distortion doesn\u2019t exist in\nthe literature to the best of our knowledge. Worstcase distortion, however, fails to satisfy this property\n(Semmes, 1999; Semmes, 1996).\n\n3.3 A new measure of distortion that satis\ufb01es all properties\n\nWe have seen that all the existing measures of distortion fail to satisfy at least one of the properties\nthat we identi\ufb01ed above. In the light of these results, we introduce a new measure of distortion, the\n\u03c3\u2212distortion (\u03a6\u03c3). The intuition for our de\ufb01nition is as follows. For a given data set X, consider a\nhistogram of the ratios \u03c1f (u, v). An embedding of high quality should preserve most distances as\nwell as possible, that is we would like to see that most of these ratios are close to 1. We characterize\nthis by measuring the \u201cconcentration\u201d of the distribution of the ratio of distances (\u03c1f (u, v)) in terms\nof the variance. The fact that we consider the variance of this distribution makes our de\ufb01nition robust\nagainst outliers (one of the properties which most of the other distortion measures fail to satisfy).\nBy a rescaling step we will achieve invariance with respect to scale. Furthermore we will see that\nalso all the other properties are satis\ufb01ed by our de\ufb01nition. Let Xn be a \ufb01nite subset of X. Given a\ndistribution P over Xn, let \u03a0 = P \u00d7 P denote the distribution on the product space Xn \u00d7 Xn. For\n\nany embedding f, let(cid:101)\u03c1f (u, v) denote the normalized ratio of distances given by\n\nTheorem 2 (Properties of \u03c3-distortion). The \u03c3- distortion (a) is invariant to scale and translations,\n(b) satis\ufb01es the property of monotonicity, (c) is robust to outliers in data and outliers in distances,\nand (d) incorporates a probability distribution into its evaluation.\n\nIn addition to satisfying all of the aforementioned properties, the proofs of Abraham, Bartal, and\nNeiman, 2011 can be extended to show that one can embed any \ufb01nite subset of a doubling metric\nspace into constant dimensional Euclidean space (or any lp space) with O(1) distortion. So the\n\u03c3-distortion also satis\ufb01es the nice-to-have property regarding constant dimensional embeddings with\nbounded distortion. The formal guarantees are given in the following two theorems:\nTheorem 3 (General metric spaces: embeddable with constant \u03c3-distortion in log n dimen-\nsions). Given any \ufb01nite sample Xn = {x1, x2, ..., xn} from an arbitrary metric space (X, dX ) and\na probability distribution P on Xn, for any 1 \u2264 p < \u221e there exists an embedding f : Xn \u2192 lD\np ,\nwhere D = O(log n) with \u03c3-distortion = O(1).\nTheorem 4 (Doubling metric spaces: embeddable with constant \u03c3-distortion in constant di-\nmensions). Given any \ufb01nite sample Xn = {x1, x2, ..., xn} from a doubling metric space (X, dX )\nand a probability distribution P on Xn, for any 1 \u2264 p < \u221e there exists an embedding f : Xn \u2192 lD\np ,\nwhere D = O(1) with \u03c3-distortion = O(1).\n\n4 Experiments\n\nWe evaluate the behavior of various distortion measures by conducting experiments in two different\nsettings: 1) Dimensionality reduction 2) A pipeline consisting of dimensionality reduction followed\nby classi\ufb01cation. We start with simulated data for which we know all ground truth parameters. In\norder to generate datasets of dimension D, we sample each coordinate independently from a speci\ufb01ed\n1-dimensional distribution. Several different distributions such as Gaussian distribution, Gamma\ndistribution, Beta distribution, Gaussian mixture distribution, Laakso Space (Bartal, Gottlieb, and\nNeiman, 2015) with many different parameter settings have been used to conduct the experiments.\nEmbeddings are then generated by various dimensionality reduction algorithms. In particular, we\n\n6\n\n\fused Isomap (Tenenbaum, De Silva, and Langford, 2000), Maximum Variance Unfolding(Weinberger\nand Saul, 2006), Multidimensional Scaling, PCA (Hotelling, 1933), Probabilistic PCA (Tipping and\nBishop, 1999), and Structure Preserving Embedding (Shaw and Jebara, 2009). All experiments have\nbeen repeated 10 times, the error bars in the plots depict the standard deviations over the different\nrepetitions.\nEmbedding dimension vs distortion: For a dataset sampled from a Euclidean space with \ufb01xed\ndimension, it is natural to expect that any meaningful measure of distortion decreases with increasing\nembedding dimension. Intuitively, higher-dimensional spaces have more degrees of freedom to\nplace points, and theoretical results con\ufb01rm the general tendency (Chan, Gupta, and Talwar, 2010;\nAbraham, Bartal, and Neiman, 2008). In Figure 1, we show that this behavior can also be veri\ufb01ed\nexperimentally for most measures of distortion including \u03c3\u2212distortion, except for average distortion.\nThe failure of average distortion to conform to this trend is clearly because it does not demonstrate\ninvariance to scaling. To clarify, average distortion simply computes the sum of the ratios of distances\n\u03c1f (u, v). An embedding can deviate from isometry by either contracting the distances between pairs\nof points or expanding them relative to the original distances. An immediate consequence of scale\ninvariance of a distortion is that it treats expansions and contractions symmetrically in the following\nsense: if an embedding expands all distances by \u03b1, a scale invariant distortion would assign the\nembedding the same value of distortion as an embedding that contracts them by \u03b1. Average distortion\ndoes not possess this property and places undue emphasis on expansions and underscores contractions.\nThe balance of the weight of contractions and expansions in\ufb02uence the trend followed by average\ndistortion, which is thereby unpredictable.\nDistortion vs original dimension: For datasets generated from Euclidean spaces of increasing\ndimension, it is also natural to expect that for a \ufb01xed embedding dimension, the quality of an\nembedding decreases with increasing original dimension. The intuition here is that the data gets\nmore complex, but the embedding space does not have the capacity to re\ufb02ect this. To our surprise,\nthis behavior cannot be observed for most of the traditional measures of distortion, as can be seen\nfrom Figures 2 and 3 (this observation was one of the starting points of this whole line of research).\nWhen looking closer into the data, we come to the conclusion that the reason for this failure is that\nthese measures (except for average distortion, which shows erratic behaviour due to its dependence\non the scale of the embedding) suffer from outliers, which disproportionately affect the distortion\nmeasures. There are only two measures of distortion that show the desired behavior: \u03c3-distortion\nand \u0001-distortion. We attribute this to the fact that these two measures are robust to outliers (as also\nbeen shown in our theoretical results). We can also see from the error bars in Figures 2, 3 that the\nvariability of the rest of the distortion measures is signi\ufb01cantly larger compared to that of \u0001-distortion\nas well as \u03c3-distortion. Again we attribute this behavior to the brittleness of the other distortions in\nthe presence of outliers.\n\nFigure 1: Embedding dimension vs various measures of distortion. From left to right: \u03a6wc, \u03a6avg,\n\u03a6navg, \u03a6\u0001 for \u0001 = 0.1, \u03a6klocal for k = 5, \u03a6\u03c3. The color of each curve indicates the dimension of\nthe original space, the x-axis the dimension of the embedding space. We can clearly see that for\nall but the average distortion, distortion decreases with embedding dimension. Data was generated\naccording to a standard normal distribution of dimension as indicated by the color, embeddings have\nbeen generated using Isomap. Results for other distributions and algorithms look similar.\n\nEffect of noise: In order to test the effect of noise on various measures of distortion, we generated\nmixture of Gaussian data in R2 similar to that of the previous experiment and added normally\ndistributed noise in R20 of increasing variance to the data to generate different datasets. Embeddings\nwere then performed using various algorithms into R2. In a \ufb01rst evaluation, we investigated whether\nthe distortion increases with increasing noise. In Figure 4 (left) we can see that while \u03c3-distortion\nclearly shows the desired trend, all other measures of distortion fail to show the correct behavior.\n\n7\n\n246810246810wc246810-1-0.500.5avg24681012345678navg2468100.511.5epsilon246810234567klocal24681000.10.20.30.40.5sigma\fFigure 2: Original dimension vs measures of distortion. From left to right: \u03a6wc, \u03a6avg, \u03a6navg, \u03a6\u0001 for\n\u0001 = 0.1, \u03a6klocal for k = 5, \u03a6\u03c3. The x-axis shows the dimension of the original space, the color of\nthe curve corresponds to the dimension of the embedding space. Each curve corresponds to Isomap\nembeddings of data generated according to gamma distribution (a = 1.5, b = 4) from Euclidean space\nof dimensions (10 : 10 : 100). Results for other distributions and algorithms look similar.\n\nFigure 3: Same setting as in Figure 2, but data generated according to beta distribution (a = 0.75, b =\n0.75).\n\nIn a second evaluation, we then performed classi\ufb01cation on the embedded data. The corresponding\nSVM and kNN loss are plotted against the variance of the additive noise. Figure 4 clearly shows\nthat the SVM and kNN classi\ufb01cation loss increase with increasing variance of noise. This reiterates\nthat the quality of the embedding indeed worsens with increasing additive noise. We performed\nthis experiment using different embedding algorithms (Isomap, PCA, MVU) and the plots in all the\nexperiments paint the same picture.\n\nFigure 4: Left: Variance of noise vs distortion measures. The x-axis shows the variance of noise. As\nthe measures of distortion are not all in the same range, we added two y-axes: the left (blue) one\nfor \u03c3-distortion, and the right (red) one for the values of the rest of the distortion measures. Right:\nVariance of noise vs. classi\ufb01cation error. The x-axis shows the variance of noise, the y-axis the\nclassi\ufb01cation error. All embeddings here are created using Isomap. The behavior corresponding to\nthe other embedding algorithms is similar.\n\nDistortion vs classi\ufb01cation accuracy: In this set of experiments, we want to investigate whether a\nmeasure of distortion is a good indicator for classi\ufb01cation accuracy. To this end, we sampled data\nfrom various mixture of Gaussian distributions in R2 with different sets of parameters. Gaussian noise\nin R20 was then added to the data to generate various datasets. The datasets were then embedded\ninto R2 using various embedding algorithms: PCA (Hotelling, 1933), GPLVM (Lawrence, 2004),\nIsomap (Tenenbaum, De Silva, and Langford, 2000), MVU (Weinberger and Saul, 2006), SPE (Shaw\n\n8\n\n20406080100246810wc20406080100-0.8-0.6-0.4-0.200.20.4avg204060801002468navg204060801000.511.52epsilon20406080100234567klocal2040608010000.10.20.30.40.50.6sigma20406080100246810wc20406080100-1-0.500.5avg2040608010012345678navg204060801000.511.5epsilon20406080100234567klocal2040608010000.10.20.30.40.5sigma0.511.522.53Variance of noise0.130.140.150.160.170.180.190.20.210.22Distortions00.511.522.5104sigmaworstcasel1l1nepsilonklocal0.511.522.53Variance of noise0.10.120.140.160.180.20.220.240.260.28Classification lossknnsvm\fand Jebara, 2009). Classi\ufb01cation is performed on the resulting embeddings using kernel SVM (with\nRBF kernel) and kNN classi\ufb01cation algorithms. In Figure 5, we plot the distortions incurred by the\nembeddings against the classi\ufb01cation loss incurred by the classi\ufb01er (where we sorted the outcome of\nall the simulations according to their resulting classi\ufb01cation accuracy). Note that in this experiment,\nwe compare the quality of embeddings across different embedding algorithms. The ideal behavior\nwould be that distortion increases with increasing classi\ufb01cation loss (in such a case, a measure of\ndistortion could be used to select the best embedding, for example). This setting encapsulates the idea\nof using distortion measure as a means of evaluating the quality of an embedding in machine learning\ntasks. The plots clearly show that \u03c3-distortion and \u0001-distortion consistently show the expected\nincreasing trend with the classi\ufb01cation loss, whereas the other measures of distortion fail to do so.\n\nFigure 5: Classi\ufb01cation error vs. distortion, for kNN (left) and SVM (right). x-axis: classi\ufb01cation\nerror and y-axis: distortion. The measures of distortion are not all in the same range, we added two\ny-axis: the left (blue) one for \u03c3-distortion, and the right (red) one for the values of the rest of the\ndistortion measures. Each curve corresponds to a distortion measure as indicated in the legend. The\ndistortions are scaled appropriately for visualization.\n\n5 Discussion\n\nWe investigate the properties of various measures of distortion for machine learning. Both in theory\nand experiments we can demonstrate that many of the existing measures of distortion behave in an\nundesired way: in simulations they show the wrong tradeoff with respect to the dimension of the\noriginal space, and they are not robust to noise or outliers, and cannot serve as a good indicator for\nclassi\ufb01cation accuracy. As an alternative, we de\ufb01ne a new measure of distortion, called \u03c3-distortion.\nIn a nutshell, it measures the variance of the pairwise distortion ratios (rather than a norm of the vector\nof these ratios). We can show in theory and in experiments that it satis\ufb01es all our desirable properties.\nThere is only one existing measure of distortion that comes close to our new \u03c3-distortion, namely\nthe \u0001-distortion. It explicitly excludes an \u0001 fraction of outlier points from the distortion computation.\nFor most properties it behaves nice as well, but it fails to take the probability measure into account.\nThis is important because \u0001-distortion provides no guarantees on \u0001 fraction of the pairwise distances,\nwhich could be critical for a given machine learning task. One of its drawbacks is that it has an\nimportant parameter to tune, the value of \u0001 (fraction of outliers), whereas \u03c3-distortion does not have a\nparameter. Our work clearly shows the need to study measures of distortion from a more systematic\npoint of view, both in theory and practice.\n\nAcknowledgments\n\nThis work has been supported by the Institutional Strategy of the University of T\u00fcbingen (Deutsche\nForschungsgemeinschaft, DFG, ZUK 63) and the International Max Planck Research School for\nIntelligent Systems (IMPRS-IS).\n\n9\n\n0.150.20.250.30.350.40.450.5Classification loss0.180.190.20.210.220.230.240.250.260.27Distortions00.511.522.533.544.5104sigmaworstcaseaveragek-localEpsilon0.150.20.250.30.350.40.450.5Classification loss0.180.190.20.210.220.230.240.250.260.27Distortions00.511.522.533.544.5104sigmaworstcaseaveragek-localEpsilon\fReferences\n\n[1]\n\n[2]\n\n[3]\n\n[4]\n\n[5]\n\nI. Abraham, Y. Bartal, J. Kleinberg, T-H. Chan, O. Neiman, K. Dhamdhere, A. Slivkins, and\nA. Gupta, \u201cMetric embeddings with relaxed guarantees\u201d, IEEE Symposium on Foundations of\nComputer Science, pp. 83\u2013100, 2005.\nI. Abraham, Y. Bartal, and O. Neiman, \u201cAdvances in metric embedding theory\u201d, Advances in\nMathematics, vol. 228, no. 6, pp. 3026\u20133126, 2011.\nI. Abraham, Y. Bartal, and O. Neiman, \u201cEmbedding metric spaces in their intrinsic dimension\u201d,\nACM-SIAM Symposium on Discrete Algorithms, pp. 363\u2013372, 2008.\nI. Abraham, Y. Bartal, and O. Neiman, \u201cLocal embeddings of metric spaces\u201d, ACM Symposium\non Theory of Computing, pp. 631\u2013640, 2007.\nI. Abraham, Y. Bartal, and O. Neiman, \u201cOn low dimensional local embeddings\u201d, ACM-SIAM\nSymposium on Discrete Algorithms, pp. 875\u2013884, 2009.\n\n[6] Y. Bartal, L. Gottlieb, and O. Neiman, \u201cOn the impossibility of dimension reduction for\ndoubling subsets of lp\u201d, SIAM Journal on Discrete Mathematics, vol. 29, no. 3, pp. 1207\u20131222,\n2015.\nJ. Bourgain, \u201cOn Lipschitz embedding of \ufb01nite metric spaces in Hilbert space\u201d, Israel Journal\nof Mathematics, vol. 52, no. 1, pp. 46\u201352, 1985.\n\n[7]\n\n[8] T. Chan, A. Gupta, and K. Talwar, \u201cUltra-low-dimensional embeddings for doubling metrics\u201d,\n\nJournal of the ACM, vol. 57, no. 4 2010.\n\n[9] T Cox and M. Cox, Multidimensional scaling, Chapman and hall/CRC, 2000.\n[10] A. Gupta, R. Krauthgamer, and J. Lee, \u201cBounded geometries, fractals, and low-distortion\n\nembeddings\u201d, Foundations of Computer Science, pp. 534\u2013543, 2003.\n\n[11] H. Hotelling, \u201cAnalysis of a complex of statistical variables into principal components.\u201d,\n\nJournal of Educational Psychology, vol. 24, no. 6, p. 417, 1933.\n\n[12] W. Johnson and J. Lindenstrauss, \u201cExtensions of Lipschitz mappings into a Hilbert space\u201d,\n\nContemporary Mathematics, vol. 26, no. 189-206 1984.\n\n[13] N. Lawrence, \u201cGaussian process latent variable models for visualisation of high dimensional\n\ndata\u201d, Neural Information Processing Systems, pp. 329\u2013336, 2004.\n\n[14] L. Maaten and G. Hinton, \u201cVisualizing data using t-SNE\u201d, Journal of Machine Learning\n\nResearch, vol. 9, no. Nov, pp. 2579\u20132605, 2008.\n\n[15] S. Semmes, \u201cBilipschitz embeddings of metric spaces into Euclidean spaces\u201d, Publicacions\n\nMatematicks, pp. 571\u2013653, 1999.\n\n[16] S. Semmes, \u201cOn the nonexistence of bilipschitz parameterizations and geometric problems\nabout A-in\ufb01nity weights\u201d, Revista Matematica Iberoamericana, vol. 12, no. 2, pp. 337\u2013410,\n1996.\n\n[17] B. Shaw and T. Jebara, \u201cStructure preserving embedding\u201d, International Conference on\n\nMachine Learning, pp. 937\u2013944, 2009.\nJ. Tenenbaum, V. De Silva, and J. Langford, \u201cA global geometric framework for nonlinear\ndimensionality reduction\u201d, Science, vol. 290, no. 5500, pp. 2319\u20132323, 2000.\n\n[18]\n\n[19] M. Tipping and C. Bishop, \u201cProbabilistic principal component analysis\u201d, Journal of the Royal\n\nStatistical Society: Series B, vol. 61, no. 3, pp. 611\u2013622, 1999.\n\n[20] K. Weinberger and L. Saul, \u201cAn introduction to nonlinear dimensionality reduction by maxi-\nmum variance unfolding\u201d, Association for the Advancement of Arti\ufb01cial Intelligence (AAAI),\nvol. 6, pp. 1683\u20131686, 2006.\n\n10\n\n\f", "award": [], "sourceid": 2372, "authors": [{"given_name": "Leena", "family_name": "Chennuru Vankadara", "institution": "Max Planck Institute for Intelligent Systems"}, {"given_name": "Ulrike", "family_name": "von Luxburg", "institution": "University of T\u00fcbingen"}]}