{"title": "On the Sample Complexity of Subspace Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2067, "page_last": 2075, "abstract": "A large number of algorithms in machine learning, from principal component analysis (PCA), and its non-linear (kernel) extensions, to more recent spectral embedding and support estimation methods, rely on estimating a linear subspace from samples. In this paper we introduce a general formulation of this problem and derive novel learning error estimates. Our results rely on natural assumptions on the spectral properties of the covariance operator associated to the data distribution, and hold for a wide class of metrics between subspaces. As special cases, we discuss sharp error estimates for the reconstruction properties of PCA and spectral support estimation. Key to our analysis is an operator theoretic approach that has broad applicability to spectral learning methods.", "full_text": "On the Sample Complexity of Subspace Learning\n\nRobotics Brain and Cognitive Science\n\nMassachussetss Institute of Technology\n\nGuillermo D. Canas\n\nguilledc@mit.edu\n\nAlessandro Rudi\n\nIstituto Italiano di Tecnologia\n\nalessandro.rudi@iit.it\n\nLorenzo Rosasco\n\nUniversita\u2019 degli Studi di Genova, LCSL,\n\nMassachusetts Institute of Technology & Istituto Italiano di Tecnologia\n\nlrosasco@mit.edu\n\nAbstract\n\nA large number of algorithms in machine learning, from principal component\nanalysis (PCA), and its non-linear (kernel) extensions, to more recent spectral\nembedding and support estimation methods, rely on estimating a linear subspace\nfrom samples. In this paper we introduce a general formulation of this problem\nand derive novel learning error estimates. Our results rely on natural assumptions\non the spectral properties of the covariance operator associated to the data distribu-\ntion, and hold for a wide class of metrics between subspaces. As special cases, we\ndiscuss sharp error estimates for the reconstruction properties of PCA and spectral\nsupport estimation. Key to our analysis is an operator theoretic approach that has\nbroad applicability to spectral learning methods.\n\n1\n\nIntroduction\n\nThe subspace learning problem is that of \ufb01nding the smallest linear space supporting data drawn\nfrom an unknown distribution.\nIt is a classical problem in machine learning and statistics, with\nseveral established algorithms addressing it, most notably PCA and kernel PCA [12, 18]. It is also\nat the core of a number of spectral methods for data analysis, including spectral embedding meth-\nods, from classical multidimensional scaling (MDS) [7, 26], to more recent manifold embedding\nmethods [22, 16, 2], and spectral methods for support estimation [9]. Therefore knowledge of the\nspeed of convergence of the subspace learning problem, with respect to the sample size, and the\nalgorithms\u2019 parameters, is of considerable practical importance.\nGiven a measure \u03c1 from which independent samples are drawn, we aim to estimate the smallest\nsubspace S\u03c1 that contains the support of \u03c1. In some cases, the support may lie on, or close to, a\nsubspace of lower dimension than the embedding space, and it may be of interest to learn such a\nsubspace S\u03c1 in order to replace the original samples by their local encoding with respect to S\u03c1.\nWhile traditional methods, such as PCA and MDS, perform such subspace estimation in the data\u2019s\noriginal space, other, more recent manifold learning methods, such as isomap [22], Hessian eigen-\nmaps [10], maximum-variance unfolding [24, 25, 21], locally-linear embedding [16, 17], and Lapla-\ncian eigenmaps [2] (but also kernel PCA [18]), begin by embedding the data in a feature space, in\nwhich subspace estimation is carried out. Indeed, as pointed out in [11, 4, 3], the algorithms in this\nfamily have a common structure. They embed the data in a suitable Hilbert space H, and compute\na linear subspace that best approximates the embedded data. The local coordinates in this subspace\nthen become the new representation space. Similar spectral techniques may also be used to estimate\nthe support of the data itself, as discussed in [9].\n\n1\n\n\fWhile the subspace estimates are derived from the available samples only, or their embedding, the\nlearning problem is concerned with the quality of the computed subspace as an estimate of S\u03c1 (the\ntrue span of the support of \u03c1). In particular, it may be of interest to understand the quality of these\nestimates, as a function of the algorithm\u2019s parameters (typically the dimensionality of the estimated\nsubspace).\nWe begin by de\ufb01ning the subspace learning problem (Sec. 2), in a suf\ufb01ciently general way to encom-\npass a number of well-known problems as special cases (Sec. 4). Our main technical contribution is\na general learning rate for the subspace learning problem, which is then particularized to common\ninstances of this problem (Sec. 3). Our proofs use novel tools from linear operator theory to obtain\nlearning rates for the subspace learning problem which are signi\ufb01cantly sharper than existing ones,\nunder typical assumptions, but also cover a wider range of performance metrics. A full sketch of the\nmain proofs is given in Section 7, including a brief description of some of the novel tools developed.\nWe conclude with experimental evidence, and discussion (Sec. 5 and 6).\n\n2 Problem de\ufb01nition and notation\nGiven a measure \u03c1 with support M in the unit ball of a separable Hilbert space H, we consider in\nthis work the problem of estimating, from n i.i.d. samples Xn = {xi}1\u2264i\u2264n, the smallest linear\nsubspace S\u03c1 := span(M ) that contains M.\nThe quality of an estimate \u02c6S of S\u03c1, for a given metric (or error criterion) d, is characterized in terms\nof probabilistic bounds of the form\n\nd(S\u03c1, \u02c6S) \u2264 \u03b5(\u03b4, n, \u03c1)\n\n0 < \u03b4 \u2264 1.\n\n(1)\n\nP(cid:104)\n\n(cid:105) \u2265 1 \u2212 \u03b4,\n\nfor some function \u03b5 of the problem\u2019s parameters. We derive in the sequel high probability bounds of\nthe above form.\nIn the remainder the metric projection operator onto a subspace S is denoted by PS, where P 2\nS =\nS = PS (every P is idempotent and self-adjoint). We denote by (cid:107) \u00b7 (cid:107)H the norm induced by the\nP \u2217\n\ndot product < \u00b7,\u00b7 >H in H, and by (cid:107)A(cid:107)p := p(cid:112)Tr(|A|p) the p-Schatten, or p-class norm of a linear\n\nbounded operator A [15, p. 84].\n\n2.1 Subspace estimates\n(cid:80)n\nLetting C := Ex\u223c\u03c1x \u2297 x be the (uncentered) covariance operator associated to \u03c1, it is easy to show\ni=1 x \u2297 x, we de\ufb01ne the\nthat S\u03c1 = Ran C. Similarly, given the empirical covariance Cn := 1\nn\nempirical subspace estimate\n\n\u02c6Sn := span(Xn) = Ran Cn\n\nn := Ran C k\n\n(note that the closure is not needed in this case because \u02c6Sn is \ufb01nite-dimensional). We also de\ufb01ne\nthe k-truncated (kernel) PCA subspace estimate \u02c6Sk\nn, where C k\nn is obtained from Cn by\nkeeping only its k top eigenvalues. Note that, since the PCA estimate \u02c6Sk\nn is spanned by the top k\nn \u2286 \u02c6Sk(cid:48)\nn}n\nn for k < k(cid:48), and therefore { \u02c6Sk\neigenvectors of Cn, then clearly \u02c6Sk\nk=1 is a nested family of\nsubspaces (all of which are contained in S\u03c1).\nAs discussed in Section 4.1, since kernel-PCA reduces to regular PCA in a feature space [18] (and\ncan be computed with knowledge of the kernel alone), the following discussion applies equally to\nkernel-PCA estimates, with the understanding that, in that case, S\u03c1 is the span of the support of \u03c1 in\nthe feature space.\n\n2.2 Performance criteria\n\nIn order for a bound of the form of Equation (1) to be meaningful, a choice of performance criteria\nd must be made. We de\ufb01ne the distance\n\n(2)\nbetween subspaces U, V , which is a metric over the space of subspaces contained in S\u03c1, for 0 \u2264\n2 and 1 \u2264 p \u2264 \u221e. Note that d\u03b1,p depends on \u03c1 through C but, in the interest of clarity,\n\u03b1 \u2264 1\n\nd\u03b1,p(U, V ) := (cid:107)(PU \u2212 PV )C \u03b1(cid:107)p\n\n2\n\n\fthis dependence is omitted in the notation. While of interest in its own right, it is also possible\nto express important performance criteria as particular cases of d\u03b1,p. In particular, the so-called\nreconstruction error [13]:\n\ndR(S\u03c1, \u02c6S) := Ex\u223c\u03c1(cid:107)PS\u03c1(x) \u2212 P \u02c6S(x)(cid:107)2H\n\nis dR(S\u03c1,\u00b7) = d1/2,2(S\u03c1,\u00b7)2 .\nNote that dR is a natural criterion because a k-truncated PCA estimate minimizes a suitable error\ndR over all subspaces of dimension k. Clearly, dR(S\u03c1, \u02c6S) vanishes whenever \u02c6S contains S\u03c1 and,\nbecause the family { \u02c6Sk\nn) is non-increasing with k.\nAs shown in [13], a number of unsupervised learning algorithms, including (kernel) PCA, k-means,\nk-\ufb02ats, sparse coding, and non-negative matrix factorization, can be written as a minimization of dR\nover an algorithm-speci\ufb01c class of sets (e.g. over the set of linear subspaces of a \ufb01xed dimension in\nthe case of PCA).\n\nk=1 of PCA estimates is nested, then dR(S\u03c1, \u02c6Sk\n\nn}n\n\n3 Summary of results\n\nOur main technical contribution is a bound of the form of Eq. (1), for the k-truncated PCA estimate\n\u02c6Sk\nn (with the empirical estimate \u02c6Sn := \u02c6Sn\nn being a particular case), whose proof is postponed to\nSec. 7. We begin by bounding the distance d\u03b1,p between S\u03c1 and the k-truncated PCA estimate \u02c6Sk\nn,\ngiven a known covariance C.\nTheorem 3.1. Let {xi}1\u2264i\u2264n be drawn i.i.d. according to a probability measure \u03c1 supported on\nthe unit ball of a separable Hilbert space H, with covariance C. Assuming n > 3, 0 < \u03b4 < 1,\n0 \u2264 \u03b1 \u2264 1\n\n2 , 1 \u2264 p \u2264 \u221e, the following holds for each k \u2208 {1, . . . , n}:\n\nP(cid:104)\n\nn) \u2264 3t\u03b1(cid:13)(cid:13)C \u03b1(C + tI)\u2212\u03b1(cid:13)(cid:13)p\n\nd\u03b1,p(S\u03c1, \u02c6Sk\n\u03b4 }, and \u03c3k is the k-th top eigenvalue of C.\n\n(cid:105) \u2265 1 \u2212 \u03b4\n\n(3)\n\nn log n\n\nwhere t = max{\u03c3k, 9\nWe say that C has eigenvalue decay rate of order r if there are constants q, Q > 0 such that\nqj\u2212r \u2264 \u03c3j \u2264 Qj\u2212r, where \u03c3j are the (decreasingly ordered) eigenvalues of C, and r > 1. From\nEquation (2) it is clear that, in order for the subspace learning problem to be well-de\ufb01ned, it must\nbe (cid:107)C \u03b1(cid:107)p < \u221e, or alternatively: \u03b1p > 1/r. Note that this condition is always met for p = \u221e, and\nalso holds in the reconstruction error case (\u03b1 = 1/2, p = 2), for any decay rate r > 1.\nKnowledge of an eigenvalue decay rate can be incorporated into Theorem 3.1 to obtain explicit\nlearning rates, as follows.\nTheorem 3.2 (Polynomial eigenvalue decay). Let C have eigenvalue decay rate of order r. Under\nthe assumptions of Theorem 3.1, it is, with probability 1 \u2212 \u03b4\nif k < k\u2217\nif k \u2265 k\u2217\n\n(cid:40)\nn) \u2264\n(cid:17)1/r\n, and Q(cid:48) = 3(cid:0)Q1/r\u0393(\u03b1p \u2212 1/r)\u0393(1 + 1/r)/\u0393(1/r)(cid:1)1/p.\n\n(polynomial decay)\n(plateau)\n\n\u2212r\u03b1+ 1\n\u2212r\u03b1+ 1\n\nd\u03b1,p(S\u03c1, \u02c6Sk\n\nwhere it is k\u2217\n\nn =\n\nQ(cid:48)k\nQ(cid:48)k\u2217\n\nn\n\n(cid:16)\n\np\n\np\n\n(4)\n\nqn\n\n9 log(n/\u03b4)\n\nn\n\nn\n\nThe above theorem guarantees a drop in d\u03b1,p with increasing k, at a rate of k\u2212r\u03b1+1/p, up to k = k\u2217\nn,\nafter which the bound remains constant. The estimated plateau threshold k\u2217 is thus the value of\ntruncation past which the upper bound does not improve. Note that, as described in Section 5, this\nperformance drop and plateau behavior is observed in practice.\nThe proofs of Theorems 3.1 and 3.2 rely on recent non-commutative Bernstein-type inequalities on\noperators [5, 23], and a novel analytical decomposition. Note that classical Bernstein inequalities in\nHilbert spaces (e.g. [14]) could also be used instead of [23]. However, while this approach would\nsimplify the analysis, it produces looser bounds, as described in Section 7.\nIf we consider an algorithm that produces, for each set of n samples, an estimate \u02c6Sk\nthen, by plugging the de\ufb01nition of k\u2217\nn.\n\nn with k \u2265 k\u2217\nn\nn into Eq. 4, we obtain an upper bound on d\u03b1,p as a function of\n\n3\n\n\fCorollary 3.3. Let C have eigenvalue decay rate of order r, and Q(cid:48), k\u2217\n\u02c6S\u2217\nn be a truncated subspace estimate \u02c6Sk\n\nn. It is, with probability 1 \u2212 \u03b4,\n\nn be as in Theorem 3.2. Let\n\nd\u03b1,p(S\u03c1, \u02c6S\u2217\n\n(cid:19)\u03b1\u2212 1\n\nn with k \u2265 k\u2217\n\nn) \u2264 Q(cid:48)(cid:18) 9 (log n \u2212 log \u03b4)\n(cid:32)(cid:18) log n \u2212 log \u03b4\nrp(cid:33)\n(cid:19)\u03b1\u2212 1\n\nqn\n\nrp\n\nn\n\n.\n\nd\u03b1,p(S\u03c1, Sn) = O\n\nRemark 3.4. Note that, by setting k = n, the above corollary also provides guarantees on the rate\nof convergence of the empirical estimate Sn = span(Xn) to S\u03c1, of order\n\nn \u2264 n (or equivalently such that\nCorollary 4.1 and remark 3.4 are valid for all n such that k\u2217\nnr\u22121(log n \u2212 log \u03b4) \u2265 q/9). Note that, because \u03c1 is supported on the unit ball, its covariance\nhas eigenvalues no greater than one, and therefore it must be q < 1. It thus suf\ufb01ces to require that\nn > 3 to ensure the condition k\u2217\n\nn \u2264 n to hold.\n\n4 Applications of subspace learning\n\nWe describe next some of the main uses of subspace learning in the literature.\n\n4.1 Kernel PCA and embedding methods\n\nn(cid:88)\n\nOne of the main applications of subspace learning is in reducing the dimensionality of the input.\nIn particular, one may \ufb01nd nested subspaces of dimension 1 \u2264 k \u2264 n that minimize the dis-\ntances from the original to the projected samples. This procedure is known as the Karhunen-Lo`eve,\nPCA, or Hotelling transform [12], and has been generalized to Reproducing-Kernel Hilbert Spaces\n(RKHS) [18].\nIn particular, the above procedure amounts to computing an eigen-decomposition of the empirical\ncovariance (Sec. 2.1):\n\ni=1\n\nCn =\n\nn := Ran C k\n\n\u03c3iui \u2297 ui,\nn = span{ui : 1 \u2264 i \u2264 k}. Note that, in the\nwhere the k-th subspace estimate is \u02c6Sk\ngeneral case of kernel PCA, we assume the samples {xi}1\u2264i\u2264n to be in some RKHS H, which are\nobtained from the observed variables (z1, . . . , zn) \u2208 Z n, for some space Z, through an embedding\n:= \u03c6(zi). Typically, due to the very high dimensionality of H, we may only have indirect\nxi\ninformation about \u03c6 in the form a kernel function K : Z \u00d7 Z \u2192 R: a symmetric, positive de\ufb01nite\nfunction satisfying K(z, w) = (cid:104)\u03c6(z), \u03c6(w)(cid:105)H [20] (for technical reasons, we also assume K to be\ncontinuous). Note that every such K has a unique associated RKHS, and viceversa [20, p. 120\u2013121],\nwhereas, given K, the embedding \u03c6 is only unique up to an inner product-preserving transformation.\nGiven a point z \u2208 Z, we can make use of K to compute the coordinates of the projection of its\nembedding \u03c6(z) onto \u02c6Sk\nIt is easy to see that the k-truncated kernel PCA subspace \u02c6Sk\nerror dR( \u02c6Sn, \u02c6S), among all subspaces \u02c6S of dimension k. Indeed, it is\n\nn \u2286 H by means of a simple k-truncated eigen-decomposition of Kn.\n\nn minimizes the empirical reconstruction\n\ndR( \u02c6Sn, \u02c6S) = Ex\u223c \u02c6\u03c1(cid:107)x \u2212 P \u02c6S(x)(cid:107)2H = Ex\u223c \u02c6\u03c1\n\n(cid:10)I \u2212 P \u02c6S, x \u2297 x(cid:11)\n\nHS\n\n= Ex\u223c \u02c6\u03c1\n\n(cid:10)(I \u2212 P \u02c6S)x, (I \u2212 P \u02c6S)x(cid:11)\n=(cid:10)I \u2212 P \u02c6S, Cn\n\n(cid:11)\n\n,\n\nHS\n\nH\n\n(5)\n\nHS\n\nwhere (cid:104)\u00b7,\u00b7(cid:105)\nis the Hilbert-Schmidt inner product, from which it is easy to see that the k-\ndimensional subspace minimizing Equation 5 (alternatively maximizing < P \u02c6S, Cn >) is spanned\nby the k-top eigenvectors of Cn.\nSince we are interested in the expected dR(S\u03c1, \u02c6Sk\nn) (rather than the empirical dR( \u02c6Sn, \u02c6S)) error of the\nkernel PCA estimate, we may obtain a learning rate for Equation 5 by particularizing Theorem 3.2\n\n4\n\n\fto the reconstruction error, for all k (Theorem 3.2), and for k \u2265 k\u2217 with a suitable choice of k\u2217\n(Corollary 4.1). In particular, recalling that dR(S\u03c1,\u00b7) = d\u03b1,p(S\u03c1,\u00b7)2 with \u03b1 = 1/2 and p = 2,\nand choosing a value of k \u2265 k\u2217\nn that minimizes the bound of Theorem 3.2, we obtain the following\nresult.\n\nCorollary 4.1 (Performance of PCA / Reconstruction error). Let C have eigenvalue decay rate of\norder r, and \u02c6S\u2217\n\nn be as in Corollary 3.3. Then it holds, with probability 1 \u2212 \u03b4,\n\n(cid:32)(cid:18) log n \u2212 log \u03b4\n\n(cid:19)1\u22121/r (cid:33)\n\ndR(S\u03c1, \u02c6S\u2217\n\nn) = O\n\nn\n\nwhere the dependence on \u03b4 is hidden in the Landau symbol.\n\n4.2 Support estimation\n\nThe problem of support estimation consists in recovering the support M of a distribution \u03c1 on\na metric space Z from identical and independent samples Zn = (zi)1\u2264i\u2264n. We brie\ufb02y recall a\nrecently proposed approach to support estimation based on subspace learning [9], and discuss how\nour results specialize to this setting, producing a qualitative improvement to theirs.\nGiven a suitable reproducing kernel K on Z (with associated feature map \u03c6), the support M can\nbe characterized in terms of the subspace S\u03c1 = span \u03c6(M ) \u2286 H [9]. More precisely, letting\ndV (x) = (cid:107)x \u2212 PV x(cid:107)H be the point-subspace distance to a subspace V , it can be shown (see [9])\nthat, if the kernel separates 1 M, then it is\n\nM = {z \u2208 Z | dS\u03c1 (\u03c6(z)) = 0}.\n\nThis suggests an empirical estimate \u02c6M = {z \u2208 Z | d \u02c6S(\u03c6(z)) \u2264 \u03c4} of M, where \u02c6S = span \u03c6(Zn),\nand \u03c4 > 0. With this choice, almost sure convergence limn\u2192\u221e dH (M, \u02c6M ) = 0 in the Hausdorff\ndistance [1] is related to the convergence of \u02c6S to S\u03c1 [9]. More precisely, if the eigenfunctions of the\ncovariance operator C = Ez\u223c\u03c1 [\u03c6(z) \u2297 \u03c6(z)] are uniformly bounded, then it suf\ufb01ces for Hausdorff\nconvergence to bound from above d r\u22121\n2r ,\u221e (where r > 1 is the eigenvalue decay rate of C). The\nfollowing results specializes Corollary 3.3 to this setting.\nCorollary 4.2 (Performance of set learning). If 0 \u2264 \u03b1 \u2264 1\n\n2 , then it holds, with probability 1 \u2212 \u03b4,\n\nd\u03b1,\u221e(S\u03c1, \u02c6S\u2217\n\nn) = O\n\n(cid:18)(cid:18) log n \u2212 log \u03b4\n\n(cid:19)\u03b1(cid:19)\n\nn\n\nwhere the constant in the Landau symbol depends on \u03b4.\n\nFigure 1: The \ufb01gure shows the experimental be-\nhavior of the distance d\u03b1,\u221e( \u02c6Sk, S\u03c1) between the\nempirical and the actual support subspaces, with\nrespect to the regularization parameter. The set-\nting is the one of section 5. Here the actual sub-\nspace is analytically computed, while the empiri-\ncal one is computed on a dataset with n = 1000\nand 32bit \ufb02oating point precision. Note the nu-\nmerical instability as k tends to 1000.\n\n(cid:17)\n\n(cid:16)\n(cid:17)\n\nLetting \u03b1 = r\u22121\nfactors), which is considerably sharper than the bound O\n\n2r above yields a high probability bound of order O\n\u2212 r\u22121\n2(3r\u22121)\nn\n\n(cid:16)\n\n2r\n\n(up to logarithmic\n\nn\u2212 r\u22121\nfound in [8] (Theorem 7).\n\n1A kernel is said to separate M if its associated feature map \u03c6 satis\ufb01es \u03c6\u22121(span \u03c6(M )) = M (e.g. the\n\nAbel kernel is separating).\n\n5\n\n1001011021030.00.20.40.60.81.0k\fNote that these are upper bounds for the best possible choice of k (which minimizes the bound).\nWhile the optima of both bounds vanish with n \u2192 \u221e, their behavior is qualitatively different. In\nparticular, the bound of [8] is U-shaped, and diverges for k = n, while ours is L-shaped (no trade-\noff), and thus also convergent for k = n. Therefore, when compared with [8], our results suggest\nthat no regularization is required from a statistical point of view though, as clari\ufb01ed in the following\nremark, it may be needed for purposes of numerical stability.\nRemark 4.3. While, as proven in Corollary 4.2, regularization is not needed from a statistical\nperspective, it can play a role in ensuring numerical stability in practice. Indeed, in order to \ufb01nd\n\u02c6M, we compute d \u02c6S(\u03c6(z)) with z \u2208 Z. Using the reproducing property of K, it can be shown that,\nwhere (tz)i = K(z, zi), \u02c6Kn is the Gram\nn)\u2020 is the pseudo-inverse\nmatrix ( \u02c6Kn)ij = K(zi, zj), \u02c6K k\nof \u02c6K k\nn. The computation of \u02c6M therefore requires a matrix inversion, which is prone to instability for\nhigh condition numbers. Figure 1 shows the behavior of the error that results from replacing \u02c6S by\nits k-truncated approximation \u02c6Sk. For large values of k, the small eigenvalues of \u02c6S are used in the\ninversion, leading to numerical instability.\n\nfor z \u2208 Z, it is d \u02c6Sk (\u03c6(z)) = K(z, z) \u2212(cid:68)\n\nn is the rank-k approximation of \u02c6Kn, and ( \u02c6K k\n\n(cid:69)\n\ntz, ( \u02c6K k\n\nn)\u2020tz\n\n5 Experiments\n\nFigure 2: The spectrum of the empirical covariance (left), and the expected distance from a random\nsample to the empirical k-truncated kernel-PCA subspace estimate (right), as a function of k (n =\n1000, 1000 trials shown in a boxplot). Our predicted plateau threshold k\u2217\nn (Theorem 3.2) is a good\nestimate of the value k past which the distance stabilizes.\n\nn (the expected distance in H of samples to \u02c6Sk\n\nIn order to validate our analysis empirically, we consider the following experiment. Let \u03c1 be a\nuniform one-dimensional distribution in the unit interval. We embed \u03c1 into a reproducing-kernel\nHilbert space H using the exponential of the (cid:96)1 distance (k(u, v) = exp{\u2212(cid:107)u \u2212 v(cid:107)1}) as kernel.\nGiven n samples drawn from \u03c1, we compute its empirical covariance in H (whose spectrum is\nplotted in Figure 2 (left)), and truncate its eigen-decomposition to obtain a subspace estimate \u02c6Sk\nn, as\ndescribed in Section 2.1.\nFigure 2 (right) is a box plot of reconstruction error dR(S\u03c1, \u02c6Sk\nn) associated with the k-truncated\nkernel-PCA estimate \u02c6Sk\nn), with n = 1000 and varying\nk. While dR is computed analytically in this example, and S\u03c1 is \ufb01xed, the estimate \u02c6Sk\nn is a random\nvariable, and hence the variability in the graph. Notice from the \ufb01gure that, as pointed out in [6] and\ndiscussed in Section 6, the reconstruction error dR(S\u03c1, \u02c6Sk\nn) is always a non-increasing function of k,\nn \u2282 \u02c6Sk(cid:48)\nn for k < k(cid:48) (see Section 2.1). The\ndue to the fact that the kernel-PCA estimates are nested: \u02c6Sk\ngraph is highly concentrated around a curve with a steep intial drop, until reaching some suf\ufb01ciently\nhigh k, past which the reconstruction (pseudo) distance becomes stable, and does not vanish. In our\nexperiments, this behavior is typical for the reconstruction distance and high-dimensional problems.\nDue to the simple form of this example, we are able to compute analytically the spectrum of the\ntrue covariance C. In this case, the eigenvalues of C decay as 2\u03b3/((k\u03c0)2 + \u03b32), with k \u2208 N, and\ntherefore they have a polynomial decay rate r = 2 (see Section 3). Given the known spectrum decay\nrate, we can estimate the plateau threshold k = k\u2217\nn in the bound of Theorem 3.2, which can be seen\n\n6\n\n\fto be a good approximation of the observed start of a plateau in dR(S\u03c1, \u02c6Sk\nn) (Figure 2, right). Notice\nthat our bound for this case (Corollary 4.1) similarly predicts a steep performance drop until the\nthreshold k = k\u2217\n\nn (indicated in the \ufb01gure by the vertical blue line), and a plateau afterwards.\n\n6 Discussion\n\nFigure 3 shows a comparison of our learning rates with existing rates in the literature [6, 19]. The\nn) = O(n\u2212c), as a\nplot shows the polynomial decay rate c of the high probability bound dR(S\u03c1, \u02c6Sk\nfunction of the eigenvalue decay rate r of the covariance C, computed at the best value k\u2217\nn (which\nminimizes the bound).\n\nFigure 3: Known upper bounds for the polynomial decay rate c (for the best choice of k), for\nthe expected distance from a random sample to the empirical k-truncated kernel-PCA estimate,\nas a function of the covariance eigenvalue decay rate (higher is better). Our bound (purple line),\nconsistently outperforms previous ones [19] (black line). The top (dashed) line [6], has signi\ufb01cantly\nstronger assumptions, and is only included for completeness.\n\nr\u2212s+sr for [6] and c = r\u22121\n\nThe learning rate exponent c, under a polynomial eigenvalue decay assumption of the data covari-\nance C, is c = s(r\u22121)\n2r\u22121 for [19], where s is related to the fourth moment. Note\nthat, among the two (purple and black) that operate under the same assumptions, our bound (purple\nline) is the best by a wide margin. The top, best performing, dashed line [6] is obtained for the best\npossible fourth-order moment constraint s = 2r, and is therefore not a fair comparison. However, it\nis worth noting that our bounds perform almost as well as the most restrictive one, even when we do\nnot include any fourth-order moment constraints.\nChoice of truncation parameter k. Since, as pointed out in Section 2.1, the subspace estimates \u02c6Sk\nn\nn \u2286 \u02c6Sk(cid:48)\nare nested for increasing k (i.e. \u02c6Sk\nn), and in particular\nthe reconstruction error dR(S\u03c1, \u02c6Sk\nn), is a non-increasing function of k. As has been previously dis-\ncussed [6], this suggests that there is no tradeoff in the choice of k. Indeed, the fact that the estimates\n\u02c6Sk\nn become increasing close to S\u03c1 as k increases indicates that, when minimizing d\u03b1,p(S\u03c1, \u02c6Sk\nn), the\nbest choice is the highest: k = n.\nInterestingly, however, both in practice (Section 5), and in theory (Section 3), we observe that a typ-\nical behavior for the subspace learning problem in high dimensions (e.g. kernel PCA) is that there is\na certain value of k = k\u2217\nn, past which performance plateaus. For problems such as spectral embed-\nding methods [22, 10, 25], in which a degree of dimensionality reduction is desirable, producing an\nestimate \u02c6Sk\nn where k is close to the plateau threshold may be a natural parameter choice: it leads to\nan estimate of the lowest dimension (k = k\u2217\nn), whose distance to the true S\u03c1 is almost as low as the\nbest-performing one (k = n).\n\nn for k < k(cid:48)), the distance d\u03b1,p(S\u03c1, \u02c6Sk\n\n7\n\n468100.20.40.60.8r\f7 Sketch of the proofs\n\nDue to the novelty of the the techniques employed, and in order to clarify how they may be used\nin other contexts, we provide here a proof of our main theoretical result, Theorem 3.1, with some\ndetails omitted in the interest of conciseness.\nFor each \u03bb > 0, we denote by r\u03bb(x) := 1{x > \u03bb} the step function with a cut-off at \u03bb. Given\nan empirical covariance operator Cn, we will consider the truncated version r\u03bb(Cn) where, in this\nnotation, r\u03bb is applied to the eigenvalues of Cn, that is, r\u03bb(Cn) has the same eigen-structure as Cn,\nbut its eigenvalues that are less or equal to \u03bb are clamped to zero.\nIn order to prove the bound of Equation (3), we begin by proving a more general upper bound of\nn), which is split into a random (A), and a deterministic part (B,C). The bound holds for\nd\u03b1,p(S\u03c1, \u02c6Sk\nall values of a free parameter t > 0, which is then constrained and optimized in order to \ufb01nd the\n(close to) tightest version of the bound.\nLemma 7.1. Let t > 0, 0 \u2264 \u03b1 \u2264 1/2, and \u03bb = \u03c3k(C) be the k-th top eigenvalue of C, it is,\n\u00b7 (cid:107)C \u03b1(C + tI)\u2212\u03b1(cid:107)p\n\n\u00b7 {3/2(\u03bb + t)}\u03b1\n\n2 (Cn + tI)\u2212 1\n\nn) \u2264 (cid:107)(C + tI)\n\nd\u03b1,p(S\u03c1, \u02c6Sk\n\n(6)\n\n1\n\n(cid:125)\n2(cid:107)2\u03b1\u221e\n\n(cid:124)\n\n(cid:123)(cid:122)\n\nB\n\n(cid:125)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\nC\n\n(cid:125)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\nA\n\nNote that the right-hand side of Equation (6) is the product of three terms, the left of which (A)\ninvolves the empirical covariance operator Cn, which is a random variable, and the right two (B, C)\nare entirely deterministic. While the term B has already been reduced to the known quantities t, \u03b1, \u03bb,\nthe remaining terms are bound next. We bound the random term A in the next Lemma, whose proof\nmakes use of recent concentration results [23].\nLemma 7.2 (Term A). Let 0 \u2264 \u03b1 \u2264 1/2, for each 9\n\u03b4 \u2264 t \u2264 (cid:107)C(cid:107)\u221e, with probability 1 \u2212 \u03b4 it\nis\n\nn log n\n\n(2/3)\u03b1 \u2264 (cid:107)(C + tI)\n\n1\n\n2 (Cn + tI)\u2212 1\n\n2(cid:107)2\u03b1\u221e \u2264 2\u03b1\n\nLemma 7.3 (Term C). Let C be a symmetric, bounded, positive semide\ufb01nite linear operator on H.\nIf \u03c3k(C) \u2264 f (k) for k \u2208 N, where f is a decreasing function then, for all t > 0 and \u03b1 \u2265 0, it holds\n(7)\n\n(cid:13)(cid:13)C \u03b1(C + tI)\u2212\u03b1(cid:13)(cid:13)p \u2264 inf\n1 f (x)u\u03b1pdx(cid:1)1/p. Furthermore, if f (k) = gk\u22121/\u03b3, with 0 < \u03b3 < 1\n(cid:13)(cid:13)C \u03b1(C + tI)\u2212\u03b1(cid:13)(cid:13)p \u2264 Qt\u2212\u03b3/p\n\nwhere gu\u03b1 = (cid:0)f (1)u\u03b1p +(cid:82) \u221e\n\nand \u03b1p > \u03b3, then it holds\n\ngu\u03b1t\u2212u\u03b1\n\n0\u2264u\u22641\n\n(8)\n\nwhere Q = (g\u03b3\u0393(\u03b1p \u2212 \u03b3)\u0393(1 + \u03b3)/\u0393(\u03b3))1/p.\n\nThe combination of Lemmas 7.1 and 7.2 leads to the main theorem 3.1, which is a probabilistic\nbound, holding for every k \u2208 {1, . . . , n}, with a deterministic term (cid:107)C \u03b1(C + tI)\u2212\u03b1(cid:107)p that depends\non knowledge of the covariance C. In cases in which some knowledge of the decay rate of C is\navailable, Lemma 7.3 can be applied to obtain Theorem 3.2 and Corollary 3.3. Finally, Corollary 4.1\nis simply a particular case for the reconstruction error dR(S\u03c1,\u00b7) = d\u03b1,p(S\u03c1,\u00b7)2, with \u03b1 = 1/2, p =\n2.\nAs noted in Section 3, looser bounds would be obtained if classical Bernstein inequalities in\nHilbert spaces [14] were used instead. In particular, Lemma 7.2 would result in a range for t of\nqn\u2212r/(r+1) \u2264 t \u2264 (cid:107)C(cid:107)\u221e, implying k\u2217 = O(n1/(r+1)) rather than O(n1/r), and thus Theorem 3.2\nwould become (for k \u2265 k\u2217) d\u03b1,p(S\u03c1, Sk\nn) = O(n\u2212\u03b1r/(r+1)+1/(p(r+1))) (compared with the sharper\nO(n\u2212\u03b1+1/rp) of Theorem 3.2). For instance, for p = 2, \u03b1 = 1/2, and a decay rate r = 2 (as\nin the example of Section 5), it would be: d1/2,2(S\u03c1, Sn) = O(n\u22121/4) using Theorem 3.2, and\nd1/2,2(S\u03c1, Sn) = O(n\u22121/6) using classical Bernstein inequalities.\nAcknowledgments L. R. acknowledges the \ufb01nancial support of the Italian Ministry of Education,\nUniversity and Research FIRB project RBFR12M3AC.\n\n8\n\n\fReferences\n[1] G. Beer. Topologies on Closed and Closed Convex Sets. Springer, 1993.\n[2] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation.\n\nNeural computation, 15(6):1373\u20131396, 2003.\n\n[3] Y. Bengio, O. Delalleau, N.L. Roux, J.F. Paiement, P. Vincent, and M. Ouimet. Learning eigenfunctions\n\nlinks spectral embedding and kernel pca. Neural Computation, 16(10):2197\u20132219, 2004.\n\n[4] Y. Bengio, J.F. Paiement, and al. Out-of-sample extensions for lle, isomap, mds, eigenmaps, and spectral\n\nclustering. Advances in neural information processing systems, 16:177\u2013184, 2004.\n\n[5] S. Bernstein. The Theory of Probabilities. Gastehizdat Publishing House, Moscow, 1946.\n[6] G. Blanchard, O. Bousquet, and L. Zwald. Statistical properties of kernel principal component analysis.\n\nMachine Learning, 66(2):259\u2013294, 2007.\n\n[7] I. Borg and P.J.F. Groenen. Modern multidimensional scaling: Theory and applications. Springer, 2005.\n[8] Ernesto De Vito, Lorenzo Rosasco, and al. Learning sets with separating kernels. arXiv:1204.3573, 2012.\n[9] Ernesto De Vito, Lorenzo Rosasco, and Alessandro Toigo. Spectral regularization for support estimation.\n\nAdvances in Neural Information Processing Systems, NIPS Foundation, pages 1\u20139, 2010.\n\n[10] D.L. Donoho and C. Grimes. Hessian eigenmaps: Locally linear embedding techniques for high-\n\ndimensional data. Proceedings of the National Academy of Sciences, 100(10):5591\u20135596, 2003.\n\n[11] J. Ham, D.D. Lee, S. Mika, and B. Sch\u00a8olkopf. A kernel view of the dimensionality reduction of manifolds.\n\nIn Proceedings of the twenty-\ufb01rst international conference on Machine learning, page 47. ACM, 2004.\n\n[12] I. Jolliffe. Principal component analysis. Wiley Online Library, 2005.\n[13] Andreas Maurer and Massimiliano Pontil. K\u2013dimensional coding schemes in hilbert spaces. IEEE Trans-\n\nactions on Information Theory, 56(11):5839\u20135846, 2010.\n\n[14] Iosif Pinelis. Optimum bounds for the distributions of martingales in banach spaces. The Annals of\n\nProbability, pages 1679\u20131706, 1994.\n\n[15] J.R. Retherford. Hilbert Space: Compact Operators and the Trace Theorem. London Mathematical\n\nSociety Student Texts. Cambridge University Press, 1993.\n\n[16] S.T. Roweis and L.K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science,\n\n290(5500):2323\u20132326, 2000.\n\n[17] L.K. Saul and S.T. Roweis. Think globally, \ufb01t locally: unsupervised learning of low dimensional mani-\n\nfolds. The Journal of Machine Learning Research, 4:119\u2013155, 2003.\n\n[18] B. Sch\u00a8olkopf, A. Smola, and K.R. M\u00a8uller. Kernel principal component analysis. Arti\ufb01cial Neural\n\nNetworks-ICANN\u201997, pages 583\u2013588, 1997.\n\n[19] J. Shawe-Taylor, C. K. Williams, N. Cristianini, and J. Kandola. On the eigenspectrum of the gram matrix\n\nand the generalization error of kernel-pca. Information Theory, IEEE Transactions on, 51(7), 2005.\n\n[20] I. Steinwart and A. Christmann. Support vector machines. Information science and statistics. Springer-\n\nVerlag. New York, 2008.\n\n[21] J. Sun, S. Boyd, L. Xiao, and P. Diaconis. The fastest mixing markov process on a graph and a connection\n\nto a maximum variance unfolding problem. SIAM review, 48(4):681\u2013699, 2006.\n\n[22] J.B. Tenenbaum, V. De Silva, and J.C. Langford. A global geometric framework for nonlinear dimension-\n\nality reduction. Science, 290(5500):2319\u20132323, 2000.\n\n[23] J.A. Tropp. User-friendly tools for random matrices: An introduction. 2012.\n[24] K.Q. Weinberger and L.K. Saul. Unsupervised learning of image manifolds by semide\ufb01nite programming.\n\nIn Computer Vision and Pattern Recognition, 2004. CVPR 2004., volume 2, pages II\u2013988. IEEE, 2004.\n\n[25] K.Q. Weinberger and L.K. Saul. Unsupervised learning of image manifolds by semide\ufb01nite programming.\n\nInternational Journal of Computer Vision, 70(1):77\u201390, 2006.\n\n[26] C.K.I. Williams. On a connection between kernel pca and metric multidimensional scaling. Machine\n\nLearning, 46(1):11\u201319, 2002.\n\n9\n\n\f", "award": [], "sourceid": 1034, "authors": [{"given_name": "Alessandro", "family_name": "Rudi", "institution": "Istituto Italiano di Tecnologia"}, {"given_name": "Guillermo", "family_name": "Canas", "institution": "MIT"}, {"given_name": "Lorenzo", "family_name": "Rosasco", "institution": "MIT"}]}