{"title": "Estimation of Intrinsic Dimensionality Using High-Rate Vector Quantization", "book": "Advances in Neural Information Processing Systems", "page_first": 1105, "page_last": 1112, "abstract": null, "full_text": "Estimation of Intrinsic Dimensionality Using High-Rate Vector Quantization\n\nMaxim Raginsky and Svetlana Lazebnik Beckman Institute, University of Illinois 405 N Mathews Ave, Urbana, IL 61801 {maxim,slazebni}@uiuc.edu\n\nAbstract\nWe introduce a technique for dimensionality estimation based on the notion of quantization dimension, which connects the asymptotic optimal quantization error for a probability distribution on a manifold to its intrinsic dimension. The definition of quantization dimension yields a family of estimation algorithms, whose limiting case is equivalent to a recent method based on packing numbers. Using the formalism of high-rate vector quantization, we address issues of statistical consistency and analyze the behavior of our scheme in the presence of noise.\n\n1. Introduction\nThe goal of nonlinear dimensionality reduction (NLDR) [1, 2, 3] is to find low-dimensional manifold descriptions of high-dimensional data. Most NLDR schemes require a good estimate of the intrinsic dimensionality of the data to be available in advance. A number of existing methods for estimating the intrinsic dimension (e.g., [3, 4, 5]) rely on the fact that, for data uniformly distributed on a d-dimensional compact smooth submanifold of IRD , the probability of a small ball of radius around any point on the manifold is (d ). In this paper, we connect this argument with the notion of quantization dimension [6, 7], which relates the intrinsic dimension of a manifold (a topological property) to the asymptotic optimal quantization error for distributions on the manifold (an operational property). Quantization dimension was originally introduced as a theoretical tool for studying \"nonstandard\" signals, such as singular distributions [6] or fractals [7]. However, to the best of our knowledge, it has not been previously used for dimension estimation in manifold learning. The definition of quantization dimension leads to a family of dimensionality estimation algorithms, parametrized by the distortion exponent r [1, ), yielding in the limit of r = a scheme equivalent to Kegl's recent technique based on packing numbers [4]. To date, many theoretical aspects of intrinsic dimensionality estimation remain poorly understood. For instance, while the estimator bias and variance are assessed either heuristically [4] or exactly [5], scant attention is paid to robustness of each particular scheme against noise. Moreover, existing schemes do not fully utilize the potential for statistical consistency afforded by ergodicity of i.i.d. data: they compute the dimensionality estimate from a fixed training sequence (typically, the entire dataset of interest), whereas we show that an independent test sequence is necessary to avoid overfitting. In addition, using the framework of high-rate vector quantization allows us to analyze the performance of our scheme in the presence of noise.\n\n\f\n2. Quantization-based estimation of intrinsic dimension\nLet us begin by introducing the definitions and notation used in the rest of the paper. A D-dimensional k -point vector quantizer [6] is a measurable map Qk : IRD C , where C = {y1 , . . . , yk } IRD is called the codebook and the yi 's are called the codevectors. The number log2 k is called the rate of the quantizer, in bits per vector. The sets Ri = {x IRD : Qk (x) = yi }, 1 i k , are called the quantizer cells (or partition regions). The quantizer performance on a random vector X distributed according to a probability distribution (denoted X ) is measured by the average rth-power distortion r (Qk |) = E X - Qk (X ) r, r [1, ), where is the Euclidean D norm on IR . In the sequel, we will often find it more convenient to work with the quantizer error er (Qk |) = r (Qk |)1/r . Let Qk denote the set of all D-dimensional k -point quantizers. Then the performance achieved by an optimal k -point quantizer on X is r (k |) = inf Qk Qk r (Qk |) or equivalently, e (k |) = r (k |)1/r . r 2.1. Quantization dimension The dimensionality estimation method presented in this paper exploits the connection between the intrinsic dimension d of a smooth compact manifold M IRD (from now on, simply referred to as \"manifold\") and the asymptotic optimal quantization error for a regular probability distribution1 on M . When the quantizer rate is high, the partition cells can be well approximated by D-dimensional balls around the codevectors. Then the regularity of ensures that the probability of such a ball of radius is (d ), and it can be shown [7, 6] that e (k |) = (k -1/d ). This is referred to as the high-rate (or high-resolution) r approximation, and motivates the definition of quantization dimension of order r: dr () = - lim\n\n\nk\n\nlog k . log e (k |) r\n\nThe theory of high-rate quantization confirms that, for a regular supported on the manifold M , dr () exists for all 1 r and equals the intrinsic dimension of M [7, 6]. (The r = limit will be treated in Sec. 2.2.) This definition immediately suggests an empirical procedure for estimating the intrinsic dimension of a manifold from a set of samples. Let X n = (X1 , . . . , Xn ) be n i.i.d. samples from an unknown regular distribution on the manifold. We also fix some r [1, ). Briefly, we select a range k1 k k2 of codebook sizes for which the high-rate approximation holds (see Sec. 3 for implementation details), and design a sequence of quantizers ^ {Qk }k2 k1 that give us good approximations er (k |) to the optimal error e (k |) over the ^ r k= chosen range of k . Then an estimate of the intrinsic dimension is obtained by plotting log k vs. - log er (k |) and measuring the slope of the plot over the chosen range of k (because ^ the high-rate approximation holds, the plot is linear). This method hinges on estimating reliably the optimal errors e (k |). Let us explain how r this can be achieved. The ideal quantizer for each k should minimize the training error 1n 1/ r i er (Qk |train ) = Xi - Qk (Xi ) r , n =1\n1 A probability distribution on IRD is regular of dimension d [6] if it has compact support and if there exist constants c, 0 > 0, such that c-1 d (B (a, )) cd for all a supp() and all (0, 0 ), where B (a, ) is the open ball of radius centered at a. If M IRD is a d-dimensional smooth compact manifold, then any with M = supp() that possesses a smooth, strictly positive density w.r.t. the normalized surface measure on M is regular of dimension d.\n\n\f\nwhere train is the corresponding empirical distribution. However, finding this empirically optimal quantizer is, in general, an intractable problem, so in practice we merely strive to ^ ^ produce a quantizer Qk whose error er (Qk |train ) is a good approximation to the minimal empirical error er (k |train ) = inf Qk Qk er (Qk |train ) (the issue of quantizer design is discussed in Sec. 3). However, while minimizing the training error is necessary for obtaining a statistically consistent approximation to an optimal quantizer for , the training error itself is an optimistically biased estimate of e (k |) [8]: intuitively, this is due to the fact r that an empirically designed quantizer overfits the training set. A less biased estimate is ^ given by the performance of Qk on a test sequence independent from the training set. Let Z m = (Z1 , . . . , Zm ) be m i.i.d. samples from , independent from X n . Provided m is sufficiently large, the law of large numbers guarantees that the empirical average 1/ r 1m i ^ ^ Zi - Qk (Zi ) r er (Qk |test ) = m =1 ^ will be a good estimate of the test error er (Qk |). Using learning-theoretic formalism [8], one can show that the test error of an empirically optimal quantizer is a strongly consistent estimate of e (k |), i.e., it converges almost surely to e (k |) as n . Thus, we r r ^ take er (k |) = er (Qk |test ). In practice, therefore, the proposed scheme is statistically ^ ^ consistent to the extent that Qk is close to the optimum. 2.2. The r = limit and packing numbers If the support of is compact (which is the case with all probability distributions considered in this paper), then the limit e (Qk |) = limr er (Qk |) exists and gives the \"worstcase\" quantization error of X by Qk : e (Qk |) =\nxsupp()\n\nmax\n\nx - Qk (x) .\n\nThe optimum e (k |) = inf Qk Qk e (Qk |) has an interesting interpretation as the smallest covering radius of the most parsimonious covering of supp() by k or fewer balls of equal radii [6]. Let us describe how the r = case is equivalent to dimensionality estimation using packing numbers [4]. The covering number NM () of a manifold M IRD is defined as the size of the smallest covering of M by balls of radius > 0, while the packing number PM () is the cardinality of the maximal set S M with x - y for all distinct x, y S . If d is the dimension of M , then NM () = (-d ) for small enough , M leading to the definition of the capacity dimension: dcap (M ) = - lim0 loglNg () . If this o limit exists, then it equals the intrinsic dimension of M . Alternatively, Kegl [4] suggests using the easily proved inequality NM () PM () NM (/2) to express the capacity PM dimension in terms of packing numbers as dcap (M ) = - lim0 loglog () . Now, a simple geometric argument shows that, for any supported on M , PM (e (k |)) > k [6]. On the other hand, NM (e (k |)) k , which implies that PM (2e (k |)) k . Let {k } be a sequence of positive reals converging to zero, such that k = e (k |). Let k0 be such that log k < 0 for all k k0 . Then it is not hard to show that - log k log PM (k ) log PM (2k ) - <- , log 2k - 1 log e (k |) log k k k0 .\n\nIn other words, there exists a decreasing sequence {k }, such that for sufficiently large values of k (i.e., in the high-rate regime) the ratio - log k / log e (k |) can be approx imated increasingly finely both from below and from above by quantities involving the packing numbers PM (k ) and PM (2k ) and converging to the common value dcap (M ).\n\n\f\n This demonstrates that the r = case of our scheme is numerically equivalent to Kegl's method based on packing numbers. For a finite training set, the r = case requires us to find an empirically optimal k point quantizer w.r.t. the worst-case 2 error -- a task that is much more computationally complex than for the r = 2 case (see Sec. 3 for details). In addition to computational efficiency, other important practical considerations include sensitivity to sampling density and noise. In theory, this worst-case quantizer is completely insensitive to variations in sampling density, since the optimal error e (k |) is the same for all with the same support. However, this advantage is offset in practice by the increased sensitivity of the r = scheme to noise, as explained next. 2.3. Estimation with noisy data Random noise transforms \"clean\" data distributed according to into \"noisy\" data distributed according to some other distribution . This will cause the empirically designed quantizer to be matched to the noisy distribution , whereas our aim is to estimate optimal quantizer performance on the original clean data. To do this, we make use of the rth-order Wasserstein distance [6] between and , defined as r (, ) = 1/ r inf X ,Y (E X - Y r) , r [1, ), where the infimum is taken over all pairs (X, Y ) of jointly distributed random variables with the respective marginals and . It is a natural measure of quantizer mismatch, i.e., the difference in performance that results from using a quantizer matched to on data distributed according to [9]. Let n denote the empirical distribution of n i.i.d. samples of . It is possible to show (details omitted for lack of space) that for an empirically optimal k -point quantizer Q ,r trained on n samk ples of , |er (Q ,r | ) - e (k |)| 2r (n , ) + r (, ). Moreover, n converges to r k in the Wasserstein sense [6]: limn r (n , ) = 0. Thus, provided the training set is sufficiently large, the distortion estimation error is controlled by r (, ). Consider the case of isotropic additive Gaussian noise. Let W be a D-dimensional zeromean Gaussian with covariance matrix K = 2 ID , where ID is the D D identity matrix. The noisy data are described by the random variable X + W = Y , and 1/ r ((r + D)/2) r (, ) 2 , (D/2) where is the gamma function. In particular, 2 (, ) D. The magnitude of the bound, and hence the worst-case sensitivity of the estimation procedure to noise, is controlled by the noise variance, by the extrinsic dimension, and by the distortion exponent. The factor involving the gamma functions grows without bound both as D and as r , which suggests that the susceptibility of our algorithm to noise increases with the extrinsic dimension of the data and with the distortion exponent.\n\n3. Experimental results\nWe have evaluated our quantization-based scheme for two choices of the distortion exponent, r = 2 and r = . For r = 2, we used the k -means algorithm to design the quantizers. For r = , we have implemented a Lloyd-type algorithm, which alternates two steps: (1) the minimum-distortion encoder, where each sample Xi is mapped to its nearest neighbor in the current codebook, and (2) the centroid decoder, where the center of each region is recomputed as the center of the minimum enclosing ball of the samples assigned to that region. It is clear that the decoder step locally minimizes the worst-case error (the largest distance of any sample from the center). Using a simple randomized algorithm, the minimum enclosing ball can be found in O((D + 1)!(D + 1)N ) time, where N is the number of samples in the region [10]. Because of this dependence on D, the running time of the Lloyd algorithm becomes prohibitive in high dimensions, and even for D < 10 it is an\n\n\f\nTraining error 10 8 Training error 6 4 2 0 r=2 r= Lloyd r= greedy Test error 10 8 6 4 2 0\n\nTest error r=2 r= Lloyd r= greedy\n\n500 1000 1500 2000 Codebook size (k)\n\n500 1000 1500 2000 Codebook size (k)\n\nTraining error\n\nTest error\n\nFigure 1: Training and test error vs. codebook size on the swiss roll (Figure 2 (a)). Dashed line:\nr = 2 (k-means), dash-dot: r = (Lloyd-type), solid: r = (greedy).\n11\n40 20 -10 -5 0 5 5 10 -10 0 -5\n\n2.5 2.25 Dim. Estimate 2 1.75 1.5 1.25 1 Training estimate Test estimate 6 7 8 Rate 9 10\n\n10 Rate (log k) 9 8 7\n10\n\n6 5 -2 -1\n\nTraining error Test error Training fit Test fit 0 -log(Error) 1\n\n(a)\n11\n\n(b)\n2.25 2 Rate (log k) 10 9 8 7 6 1 2 Training error Test error 3 4 -log(Error) 5 Dim. Estimate 1.75 1.5 1.25 1 0.75\n\n(c)\n\n0.5 0 -0.5 4 2 0 -2 -4 -4 0 -2 2 4\n\nTraining estimate Test estimate 6 7 8 Rate 9 10\n\n(d)\n\n(e)\n\n(f)\n\nFigure 2: (a) The swiss roll (20,000 samples). (b) Plot of rate vs. negative log of the quantizer\nerror (log-log curves), together with parametric curves fitted using linear least squares (see text). (c) Slope (dimension) estimates: 1.88 (training) and 2.04 (test). (d) Toroidal spiral (20,000 samples). (e) Log-log curves, exhibiting two distinct linear parts. (f) Dimension estimates: 1.04 (training), 2.02 (test) in the low-rate region, 0.79 (training), 1.11 (test) in the high-rate region.\n\nOur first synthetic dataset (Fig. 2 (a)) is the 2D \"swiss roll\" embedded in IR3 [2]. We split the samples into 4 equal parts and use each part in turn for training and the rest for testing. This cross-validation setup produces four sets of error curves, which we average to obtain an improved estimate. We sample quantizer rates in increments of 0.1 bits. The lowest rate is 5 bits, and the highest rate is chosen as log(n/2), where n is the size of the training set. The high-rate approximation suggests the asymptotic form (k -1/d ) for the quantizer error\n\norder of magnitude slower than k -means. Thus, we were compelled to also implement a greedy algorithm reminiscent of Kegl's algorithm for estimating the packing number [4]: supposing that k - 1 codevectors have already been selected, the k th one is chosen to be the sample point with the largest distance from the nearest codevector. Because this is the point that gives the worst-case error for codebook size k - 1, adding it to the codebook lowers the error. We generate several codebooks, initialized with different random samples, and then choose the one with the smallest error. For the experiment shown in Figure 3, the training error curves produced by this greedy algorithm were on average 21% higher than those of the Lloyd algorithm, but the test curves were only 8% higher. In many cases, the two test curves are visually almost coincident (Figure 1). Therefore, in the sequel, we report only the results for the greedy algorithm for the r = case.\n\n\f\nas a function of codebook size k . To validate this approximation, we use linear least squares to fit curves of the form a + b k -1/2 to the r = 2 training and test distortion curves for the the swiss roll. The fitting procedure yields estimates of -0.22 + 29.70k -1/2 and 0.10 + 28.41k -1/2 for the training and test curves, respectively. These estimates fit the observed data well, as shown in Fig. 2(b), a plot of rate vs. the negative logarithm of the training and test error (\"log-log curves\" in the following). Note that the additive constant for the training error is negative, reflecting the fact that the training error of the empirical quantizer is identically zero when n = k (each sample becomes a codevector). On the other hand, the test error has a positive additive constant as a consequence of quantizer suboptimality. Significantly, the fit deteriorates as n/k 1, as the average number of training samples per quantizer cell becomes too small to sustain the exponentially slow decay required for the high-rate approximation. Fig. 2(c) shows the slopes of the training and test log-log curves, obtained by fitting a line to each successive set of 10 points. These slopes are, in effect, rate-dependent dimensionality estimates for the dataset. Note that the training slope is always below the test slope; this is a consequence of the \"optimism\" of the training error and the \"pessimism\" of the test error (as reflected in the additive constants of the parametric fits). The shapes of the two slope curves are typical of many \"well-behaved\" datasets. At low rates, both the training and the test slopes are close to the extrinsic dimension, reflecting the global geometry of the dataset. As rate increases, the local manifold structure is revealed, and the slope yields its intrinsic dimension. However, as n/k 1, the quantizer begins to \"see\" isolated samples instead of the manifold structure. Thus, the training slope begins to fall to zero, and the test slope rises, reflecting the failure of the quantizer to generalize to the test set. For most datasets in our experiments, a good intrinsic dimensionality estimate is given by the first minimum of the test slope where the line-fitting residual is sufficiently low (marked by a diamond in Fig. 2(c)). For completeness, we also report the slope of the training curve at the same rate (note that the training curve may not have local minima because of its tendency to fall as the rate increases). Interestingly, some datasets yield several well-defined dimensionality estimates at different rates. Fig. 2(d) shows a toroidal spiral embedded in IR3 , which at larger scales \"looks\" like a torus, while at smaller scales the 1D curve structure becomes more apparent. Accordingly, the log-log plot of the test error (Fig. 2(e)) has two distinct linear parts, yielding dimension estimates of 2.02 and 1.11, respectively (Fig. 2(f)). Recall from Sec. 2.1 that the high-rate approximation for regular probability distributions is based on the assumption that the intersection of each quantizer cell with the manifold is a d-dimensional neighborhood of that manifold. Because we compute our dimensionality estimate at a rate for which this approximation is valid, we know that the empirically optimal quantizer at this rate partitions the data into clusters that are locally d-dimensional. Thus, our dimensionality estimation procedure is also useful for finding a clustering of the data that respects the intrinsic neighborhood structure of the manifold from which it is sampled. As an expample, for the toroidal spiral of Fig. 2(c), we obtain two distinct dimensionality estimates of 2 and 1 at rates 6.6 and 9.4, respectively (Fig. 2(f)). Accordingly, quantizing the spiral at the lower (resp. higher) rate yields clusters that are locally two-dimensional (resp. one-dimensional). To ascertain the effect of noise and extrinsic dimension on our method, we have embedded the swiss roll in dimensions 4 to 8 by zero-padding the coordinates and applying a random orthogonal matrix, and added isotropic zero-mean Gaussian noise in the high-dimensional space, with = 0.2, 0.4, . . . , 1. First, we have verified that the r = 2 estimator behaves in agreement with the Wasserstein bound from Sec. 2.3. The top part of Fig. 3(a) shows the maximum differences between the noisy and the noiseless test error curves for each combination of D and , and the bottom part shows the corresponding values of the Wasserstein bound D for comparison. For each value of , the test error of the empirically designed quantizer differs from the noiseless case by O( D), while, for a fixed D, the difference\n\n\f\n3 2.5 empirical difference\n3.5\n\nr = 2 training\n\nr = training\n\n3.5 d estimate 7 1 3 2.5 2 8 6 D 5 4 0.4 0.6 0.8\n\n2 1.5 1 0.5 0 3 3 2.5\n\n = 0.8 = 0.6 = 0.4 = 0.2 = 0.0\n\nd estimate\n\n = 1.0\n\n3 2.5 2 8\n\n7\n\n4\n\n5 D\n\n6\n\n7\n\n8\n\n30\n\n0.2\n\n6 5 D 4\n\n30\n\n0.2\n\n0.4\n\n0.6 \n\n0.8\n\n1\n\n = 1.0 = 0.8\n3.5 d estimate\n\nr = 2 test\n\nr = test\n\n3.5 d estimate 7 1 3 2.5 2 8 6 D 5 4 0.4 0.6 0.8\n\n2 bound = 0.6 1.5 1 0.5 0 3 = 0.4 = 0.2 = 0.0\n3 2.5 2 8\n\n7\n\n6 D\n\n4\n\n5 D\n\n6\n\n7\n\n8\n\n5\n\n30\n\n0.2\n\n4\n\n30\n\n0.2\n\n0.4\n\n0.6 \n\n0.8\n\n1\n\n(a) Noise bounds\n\n(b) r = 2\n\n(c) r = \n\nFigure 3: (a) Top: empirica`lybserved differences between noisy and noiseless test curves; bottom: lo theoretically derived bound D . (b) Height plot of dimension estimates for the r = 2 algorithm as a function of D and . Top: training estimates, bottom: test estimates. (c) Dimension estimates for r = . Top: training, bottom: test. Note that the training estimates are consistently lower than the test estimates: the average difference is 0.17 (resp. 0.28) for the r = 2 (resp. r = ) case. of the noisy and noiseless test errors grows as O( ). As predicted by the bound, the additive constant in the parametric form of the test error increases with , resulting in larger slopes of the log-log curve and therefore higher dimension estimates. This is reflected in Figs. 3(b) and (c), which show training and test dimensionality estimates for r = 2 and r = , respectively. The r = estimates are much less stable than those for r = 2 because the r = (worst-case) error is controlled by outliers and often stays constant over a range of rates. The piecewise-constant shape of the test error curves (see Fig. 1) results in log-log plots with unstable slopes. Table 1 shows a comparative evaluation on the MNIST handwritten digits database2 and a face video.3 The MNIST database contains 70,000 images at resolution 28 28 (D = 784), and the face video has 1965 frames at resolution 28 20 (D = 560). For each of the resulting 11 datasets (taking each digit separately), we used half the samples for training and half for testing. The first row of the table shows dimension estimates obtained using a baseline regression method [3]: for each sample point, a local estimate is given by the o first local minimum of the curve d dolg g() , where () is the distance from the point to its l th nearest neighbor, and a global estimate is then obtained by averaging the local estimates. The rest of the table shows the estimates obtained from the training and test curves of the r = 2 quantizer and the (greedy) r = quantizer. Comparative examination of the results shows that the r = estimates tend to be fairly low, which is consistent with the experimental findings of Kegl [4]. By contrast, the r = 2 estimates seem to be most resistant to negative bias. The relatively high values of the dimension estimates reflect the many degrees of freedom found in handwritten digits, including different scale, slant and thickness of the strokes, as well as the presence of topological features (i.e., loops in 2's or extra horizontal bars in 7's). The lowest dimensionality is found for 1's, while the highest is found for 8's, reflecting the relative complexities of different digits. For the face dataset, the different dimensionality estimates range from 4.25 to 8.30. This dataset certainly contains enough degrees of freedom to justify such high estimates, including changes in pose\n2 3\n\nhttp://yann.lecun.com/exdb/mnist/ http://www.cs.toronto.edu/~roweis/data.html, B. Frey and S. Roweis.\n\n\f\nTable 1: Performance on the MNIST dataset and on the Frey faces dataset.\nHandwritten digits (MNIST data set) Faces\n\n# samples Regression r = 2 train r = 2 test r = train r = test\n\n6903 11.14 12.39 15.47 10.33 9.02\n\n7877 7.86 6.51 7.11 8.19 6.61\n\n6990 12.79 16.04 20.89 10.15 13.98\n\n7141 13.39 15.38 19.78 12.63 12.21\n\n6824 11.98 13.22 16.79 9.87 7.26\n\n6313 13.05 14.63 19.80 8.49 10.46\n\n6876 11.19 12.05 16.02 9.85 9.08\n\n7293 10.42 12.32 16.02 8.10 9.92\n\n6825 13.79 19.80 20.07 10.88 14.03\n\n6958 11.26 13.44 17.46 7.40 9.59\n\n1965 5.63 5.70 8.30 4.25 6.39\n\nand facial expression, as well as camera jitter.4 Finally, for both the digits and the faces, significant noise in the dataset additionally inflated the estimates.\n\n4. Discussion\nWe have demonstrated an approach to intrinsic dimensionality estimation based on highrate vector quantization. A crucial distinguishing feature of our method is the use of an independent test sequence to ensure statistical consistency and avoid underestimating the dimension. Many existing methods are well-known to exhibit a negative bias in high dimensions [4, 5]. This can have serious implications in practice, as it may result in lowdimensional representations that lose essential features of the data. Our results raise the possibility that this negative bias may be indicative of overfitting. In the future we plan to integrate our proposed method into a unified package of quantization-based algorithms for estimating the intrinsic dimension of the data, obtaining its dimension-reduced manifold representation, and compressing the low-dimensional data [11]. Acknowledgments Maxim Raginsky was supported by the Beckman Institute Postdoctoral Fellowship. Svetlana Lazebnik was partially supported by the National Science Foundation grants IIS0308087 and IIS-0535152.\n\nReferences\n[1] S.T. Roweis and L.K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290:23232326, December 2000. [2] J.B. Tenenbaum, V. de Silva, and J.C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290:23192323, December 2000. [3] M. Brand. Charting a manifold. In NIPS 15, pages 977984, Cambridge, MA, 2003. MIT Press. [4] B. Kegl. Intrinsic dimension estimation using packing numbers. In NIPS 15, volume 15, Cambridge, MA, 2003. MIT Press. [5] E. Levina and P.J. Bickel. Maximum likelihood estimation of intrinsic dimension. In NIPS 17, Cambridge, MA, 2005. MIT Press. [6] S. Graf and H. Luschgy. Foundations of Quantization for Probability Distributions. SpringerVerlag, Berlin, 2000. [7] P.L. Zador. Asymptotic quantization error of continuous signals and the quantization dimension. IEEE Trans. Inform. Theory, IT-28:139149, March 1982. [8] T. Linder. Learning-theoretic methods in vector quantization. In L. Gyorfi, editor, Principles of Nonparametric Learning. Springer-Verlag, New York, 2001. [9] R.M. Gray and L.D. Davisson. Quantizer mismatch. IEEE Trans. Commun., 23:439443, 1975. [10] E. Welzl. Smallest enclosing disks (balls and ellipsoids). In New Results and New Trends in Computer Science, volume 555 of LNCS, pages 359370. Springer, 1991. [11] M. Raginsky. A complexity-regularized quantization approach to nonlinear dimensionality reduction. Proc. 2005 IEEE Int. Symp. Inform. Theory, pages 352356.\n4 Interestingly, Brand [3] reports an intrinsic dimension estimate of 3 for this data set. However, he used only a 500-frame subsequence and introduced additional mirror symmetry.\n\n\f\n", "award": [], "sourceid": 2945, "authors": [{"given_name": "Maxim", "family_name": "Raginsky", "institution": null}, {"given_name": "Svetlana", "family_name": "Lazebnik", "institution": null}]}