{"title": "Streaming Kernel PCA with $\\tilde{O}(\\sqrt{n})$ Random Features", "book": "Advances in Neural Information Processing Systems", "page_first": 7311, "page_last": 7321, "abstract": "We study the statistical and computational aspects of kernel principal component analysis using random Fourier features and show that under mild assumptions, $O(\\sqrt{n} \\log n)$ features suffices to achieve $O(1/\\epsilon^2)$ sample complexity. Furthermore, we give a memory efficient streaming algorithm based on classical Oja's algorithm that achieves this rate", "full_text": "Streaming Kernel PCA with \u02dcO(\n\n\u221a\n\nn) Random Features\n\nEnayat Ullah \u2020\nenayat@jhu.edu\n\nPoorya Mianjy \u2020\nmianjy@jhu.edu\n\nTeodor V. Marinov \u2020\ntmarino2@jhu.edu\n\nRaman Arora \u2020\n\narora@cs.jhu.edu\n\nAbstract\n\nWe study the statistical and computational aspects of kernel principal component\n\u221a\nanalysis using random Fourier features and show that under mild assumptions,\nn log (n)) features suf\ufb01ce to achieve O(1/\u00012) sample complexity. Further-\nO(\nmore, we give a memory ef\ufb01cient streaming algorithm based on classical Oja\u2019s\nalgorithm that achieves this rate.\n\nIntroduction\n\n1\nKernel methods represent an important class of machine learning algorithms that simultaneously\nenjoy strong theoretical guarantees as well as empirical performance. However, it is notoriously\nhard to scale them to large datasets due to space and runtime complexity (typically O(n2) and\nO(n3), respectively, for most problems) [Smola and Sch\u00f6lkopf, 1998]. There have been many efforts\nto overcome these computational challenges, including Nystr\u00f6m method [Williams and Seeger,\n2001], incomplete Cholesky factorization [Fine and Scheinberg, 2001], random Fourier features\n(RFF) [Rahimi and Recht, 2007] and randomized sketching [Yang et al., 2015]. In this paper, we\nfocus on random Fourier features due to its broad applicability to a large class of kernel problems.\nIn a seminal paper by Rahimi and Recht [2007], the authors appealed to Bochner\u2019s theorem to argue\nthat any shift-invariant kernel can be approximated as k(x, y) \u2248 (cid:104)z(x), z(y)(cid:105), where the random\nFourier feature mapping z : Rd \u2192 Rm is obtained by sampling from the inverse Fourier transform\nof the kernel function. This allows one to invoke fast linear techniques to solve the linear problem\nin Rm. However, subsequent work analyzing kernel methods based on RFF for learning problems\nsuggests that to achieve the same asymptotic rates (as obtained using the true kernel) on the excess\nrisk, one requires m = \u2126(n) random features [Rahimi and Recht, 2009], which defeats the purpose\nof using random features from a computational perspective and fails to explain its empirical success.\nLast year at NIPS, while Rahimi and Recht won the test-of-time award for their work on RFF [Rahimi\nand Recht, 2007], Rudi and Rosasco [2017] showed for the \ufb01rst time that at least for the kernel ridge\n\u221a\nregression problem, under some mild distributional assumptions and for appropriately chosen regular-\nization parameter, one can achieve minimax optimal statistical rates using only m = O(\nn log (n))\nrandom features. It is then natural to ask if the same holds for other kernel problems.\nIn this paper, we focus on Kernel Principal Component Analysis (KPCA) [Sch\u00f6lkopf et al., 1998],\nwhich is a popular technique for unsupervised nonlinear representation learning. We argue that\nscalability is an even bigger issue in the unsupervised setting since big data is largely unlabeled.\nFurthermore, when extending the results from the supervised learning to unsupervised learning we\nhave to deal with additional challenges stemming from the non-convexity of the KPCA problem.\nWe pose KPCA as a stochastic optimization problem and investigate the tradeoff between statistical\nsamples and random features needed to guarantee \u0001-suboptimality on the population objective (aka a\nsmall generalization error).\nKPCA entails computing the top-k principal components of the data mapped into a Reproducing\nKernel Hilbert Space (RKHS) induced by a positive de\ufb01nite kernel [Aronszajn, 1950]. In Sch\u00f6lkopf\net al. [1998], authors showed that given a sample of n i.i.d. draws from the underlying distribution,\nthe in\ufb01nite dimensional problem (over RKHS) can be reduced to a \ufb01nite dimensional problem (in Rn)\n\n\u2020 Department of Computer Science, Johns Hopkins University, Baltimore, MD 21204\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fAlgorithm\n\nReference\n\nSample complexity\n\nERM\n\nRF-DSG\n\nRF-ERM\n\nRF-Oja\n\nShawe-Taylor et al. [2005]\nBlanchard et al. [2007]\u2020\n\nXie et al. [2015]\n\nLopez-Paz et al. [2014]\n\nCorollary 4.4\u2020\nCorollary 4.4\u2020\n\n\u02dcO(1/\u00012)\n\u02dcO(1/\u0001)\n\u02dcO(1/\u00012)\n\u02dcO(1/\u00012)\n\u02dcO(1/\u00012)\n\u02dcO(1/\u00012)\n\nPer-iteration cost Memory\nO(1/\u00014)\nO(1/\u00012)\nO(k/\u00012)\n\u02dcO(1/\u00014)\n\u02dcO(1/\u00013)\n\u02dcO(k/\u0001)\n\n\u02dcO(k/\u00014)\n\u02dcO(k/\u00012)\n\u02dcO(k/\u00012)\n\u02dcO(k/\u00014)\n\u02dcO(k/\u00013)\n\u02dcO(k/\u0001)\n\nTable 1: Comparing different approaches to KPCA in terms of sample complexity, per-iteration\ncomputational cost and space complexity. \u2020 : Optimistic rates realized under (potentially different)\nhigher-order distributional assumptions (See Corollary 4.3, and Blanchard et al. [2007]).\n\nusing the kernel trick. In particular, the solution entails computing the top-k eigenvectors of the kernel\nmatrix computed on the given sample. Statistical consistency of this approach was established in\nShawe-Taylor et al. [2005] and further improved in Blanchard et al. [2007]. However, computational\naspects of KPCA are less well understood. Note that the eigendecomposition of the kernel matrix\nalone requires O(kn2) computation, which can be prohibitive for large datasets. Several recent works\nhave attempted to accelerate KPCA using random features. In Lopez-Paz et al. [2014], authors show\nthat the kernel matrix computed using random features converges to the true kernel matrix in operator\n\nnorm at a rate of O(n(cid:112)(log n)/m). In Ghashami et al. [2016], authors extended this guarantee to a\n\nstreaming setting using the Frequent Direction algorithm [Liberty, 2013] on random features. In a\nrelated line of work, Xie et al. [2015] propose a stochastic optimization algorithm based on doubly\n\u221a\nstochastic gradients with a 1/n convergence in the sense of angle between subspaces. However, all\nthese results require m = \u02dc\u2126(n) random features to guarantee a O(1/\nMore recently, Sriperumbudur and Sterge [2017] studied statistical consistency of ERM with ran-\ndomized Fourier features. They showed that the top-k eigenspace of the empirical covariance matrix\n\u221a\nin the random feature space converges to that of the population covariance operator in the RKHS\nn) 1. This result\nwhen lifted to the space of square integrable functions, at a rate of O(1/\nsuggests that statistical and computational ef\ufb01ciency cannot be achieved at the same time without\nmaking further assumptions. In this paper, we assume a spectral decay on the distribution of the\ndata in the feature space to show that we can simultaneously guarantee spectral and computational\nef\ufb01ciency for KPCA using random features. Our main contributions are as follows.\n\nn) generalization bound.\n\n\u221a\nm + 1/\n\n1. We study kernel PCA as stochastic optimization problem and show that under mild distribu-\n\u221a\ntional assumptions, for a wide range of kernels, the empirical risk minimizer (ERM) in the\nrandom feature space converges in objective as O(1/\nn log (n)),\nwith overall runtime of O(kn 3\n\nn) whenever m = \u2126(k\n\n2 log (n)).\n\n\u221a\n\n2. We propose a stochastic approximation algorithm based on classical Oja\u2019s updates on\nrandom features which enjoys the same statistical guarantees as the ERM above but with\nbetter runtime and space requirements.\n\n3. We overcome a key challenge associated with kernel PCA using random features which\nis to ensure that the output of the algorithm corresponds to a projection operator in the\n(potentially in\ufb01nite dimensional) RKHS. We establish that the output of the proposed\nalgorithms converges to a projection operator.\n\n4. In order to better understand the computational bene\ufb01ts of using random features, we also\nconsider the KPCA problem in a streaming setting, where at each iteration, the algorithm is\nprovided with a fresh sample drawn i.i.d. from the underlying distribution and is required to\noutput a solution based on the samples observed so far. In such a setting, comparison with\nother algorithmic approaches suggests that Oja\u2019s algorithm on random Fourier features (see\nRF-Oja in Table 1) enjoys the best overall runtime as well as superior space complexity.\n\n5. We contribute novel analytical tools that should be useful broadly when designing algo-\nrithms for kernel methods based on random features. We provide crucial and novel insights\nthat exploit connections between covariance operators in RKHS and the space of square\nintegrable functions with respect to data distribution. This connection allows us to look\n\n1While our paper was under review, Sriperumbudur and Sterge [2017], which initially focused on statistical\nconsistency of kernel PCA with random features, was replaced by Sriperumbudur and Sterge [2018], with a new\ntitle and focus on computational and statistical tradeoffs of KPCA much like our paper.\n\n2\n\n\fat the kernel approximation using random features as an estimation problem in the space\nof square integrable functions, where we appeal to recent results in local Rademacher\ncomplexity [Massart, 2000, Bartlett et al., 2002, Blanchard et al., 2007] to yield faster rates.\n\n6. Finally, we provide empirical results on a real dataset to support our theoretical results.\n\nThe rest of the paper is organized as follows. In Section 2, we give the problem setup. In Section 3,\nwe provide mathematical preliminaries and introduce the key notation. The main algorithm and the\nresults are in Section 4 and the empirical results are discussed in Section 5.\n2 Problem setup\nGiven a random vector x \u2208 Rd with underlying distribution \u03c1, principal component analysis (PCA)\ncan be formulated as the following stochastic optimization problem [Arora et al., 2012, 2013]:\n\nmaximize Ex\u223c\u03c1(cid:104)P, xx(cid:62)(cid:105) s.t. P \u2208 P k\n\n(1)\nwhere P k is the set of d \u00d7 d rank-k orthogonal projection matrices. Essentially, PCA seeks a k-\ndimensional subspace of Rd that captures maximal variation with respect to the underlying distribution.\nIt is well understood that the solution to the problem above is given by the projection matrix\n\ncorresponding to the subspace spanned by the top-k eigenvectors of the covariance matrix E(cid:2)xx(cid:62)(cid:3).\n\n,\n\nIn most real world applications, however, the data does not have a linear structure. In other words, the\nunderlying distribution may not be well-represented by any low-rank subspace of the ambient space.\nIn such settings, the representations learned using PCA may not be very informative. This motivates\nthe need for non-linear dimensionality reduction methods. For example, in kernel PCA [Sch\u00f6lkopf\net al., 1998], a canonical approach for manifold learning, a nonlinear feature map lifts the data\ninto a higher (potentially in\ufb01nite) dimensional Reproducing Kernel Hilbert Space (RKHS), where a\nlow-rank subspace corresponds to a (non-linear) low-dimensional manifold in ambient space. Hence,\nsolving the PCA problem in an RKHS can better capture the complicated nonlinear structure in data.\nFormally, given a kernel function k(\u00b7,\u00b7) : Rd \u00d7 Rd \u2192 R, KPCA can be formulated as the following\nstochastic optimization problem:\n\nmaximize Ex\u223c\u03c1(cid:104)P, k(x,\u00b7) \u2297H k(x,\u00b7)(cid:105) s.t. P \u2208 P k\n\nHS(H)\n\n,\n\n(2)\n\nwith a small excess risk:\n\noperator(cid:98)C := 1\n\nwhere P k\nHS(H) is the set of all orthogonal projection operators onto a k-dimensional subspace of the\nRKHS. The solution to the above problem is given by Pk\nC, the projection operator corresponding to\nthe top-k eigenfunctions of the covariance operator C := Ex\u223c\u03c1[k(x,\u00b7) \u2297H k(x,\u00b7)]. The primary goal\nHS(H)\n\nof any KPCA algorithm is then to guarantee generalization, i.e. providing a solution(cid:98)P \u2208 P k\n\nE((cid:98)P) := Ex\u223c\u03c1(cid:104)Pk\nC, k(x,\u00b7) \u2297H k(x,\u00b7)(cid:105) \u2212 Ex\u223c\u03c1(cid:104)(cid:98)P, k(x,\u00b7) \u2297H k(x,\u00b7)(cid:105).\n(3)\nGiven access to i.i.d. samples {xi}n\ni=1 \u223c \u03c1, one approach to solving Problem (2) is Empirical Risk\n(cid:80)n\nMinimization (ERM), which amounts to \ufb01nding the top-k eigenfunctions of the empirical covariance\ni=1 k(xi,\u00b7) \u2297 k(xi,\u00b7). Using kernel trick, Sch\u00f6lkopf et al. [1998] showed that\nthis problem is equivalent of \ufb01nding the top-k eigenvectors of the kernel matrix associated with\nthe samples. Alternatively, when approximating the kernel map with random features, Problem (2)\nreduces to the PCA problem (given in Equation (1)) in the random feature space. Here, we discuss\ntwo natural approaches to solve this problem. First, the ERM in the random feature space (called\nRF-ERM), which is given by the top-k eigenvectors of the empirical covariance matrix of data in the\nfeature space. Second, the classical Oja\u2019s algorithm (called RF-Oja) [Oja, 1982].\nNote that while the output of ERM is guaranteed to induce a projection operator in the RKHS of\nk(\u00b7,\u00b7), this may not be the case when using RFF (equivalently, when working in the RKHS associated\nwith the approximate kernel map). Therefore, a key technical challenge when designing KPCA\nalgorithm based on RFF is to ensure that the output is close to the set of projection operators in the\n\nn\n\ntrue RKHS induced by k(\u00b7,\u00b7), i.e. d((cid:98)P,P k\n\nHS(H)) is small.\n3 Mathematical Preliminaries and Notation\nIn this section, we review basic concepts we need from functional analysis [Reed and Simon, 1972].\nWe begin with a simple observation that given an underlying distribution on data, and a \ufb01xed kernel\n\n3\n\n\fmap, it induces a distribution on the feature map. We work with this distribution implicitly by\nconsidering measurable Hilbert spaces. We denote matrices and Hilbert-Schmidt operators with\ncapital roman letters D, vectors with lower-case roman letters v, and scalars with lower-case letters a.\nWe denote operators over the space of Hilbert-Schmidt operators with capital Fraktur letters A.\nHilbert space notation and operator norm. Let H and \u02dcH be two separable Hilbert spaces over\n\ufb01elds F and \u02dcF with measures \u00b5 and \u02dc\u00b5, respectively. Let {ei}i\u22651 and {\u02dcei}i\u22651 denote some \ufb01xed\northonormal basis for H and \u02dcH respectively. The inner product between two elements h1, h2 \u2208 H is\ndenoted as (cid:104)h1, h2(cid:105)H, or (cid:104)h1, h2(cid:105)\u00b5. Similarly, we denote the norm of an element h \u2208 H as (cid:107)h(cid:107)H, or\n(cid:107)h(cid:107)\u00b5. For h1, h2 \u2208 H the outer product denoted as h1 \u2297H h2, or h1 \u2297\u00b5 h2, is a linear operator on\nH that maps any h3 \u2208 H to (h1 \u2297H h2)h3 = (cid:104)h2, h3(cid:105)H h1. For a linear operator D : H \u2192 \u02dcH, the\noperator norm of D is de\ufb01ned as (cid:107)D(cid:107)2 := sup{(cid:107)Dh(cid:107) \u02dcH , h \u2208 H,(cid:107)h(cid:107)H \u2264 1}.\nAdjoint, Hilbert-Schmidt, and trace-class operators. The adjoint of a linear operator D : H \u2192\n\u02dcH, is given as the linear operator D\u2217 : \u02dcH \u2192 H such that (cid:104)Dh, \u02dch(cid:105) \u02dcH = (cid:104)h, D\u2217\u02dch(cid:105)H, for all h \u2208 H, \u02dch \u2208\n\u02dcH. A linear operator D : H \u2192 H is self-adjoint if D\u2217 = D. The linear operator D : H \u2192 \u02dcH is\nD : H \u2192 H is a Hilbert-Schmidt operator if(cid:80)\ni\u22651 (cid:107)Dei(cid:107)2H = (cid:80)\ncompact if the image of any bounded set of H is a relatively compact subset of \u02dcH. A linear operator\nHilbert-Schmidt norm of D, denoted as (cid:107)D(cid:107)HS(H) or (cid:107)D(cid:107)HS(\u00b5), is de\ufb01ned as ((cid:80)\ni,j\u22651 (cid:104)Dei, ej(cid:105)2H < \u221e. The\n(cid:10)(DD\u2217)1/2ei, ei\n(cid:11)\nis trace-class if (cid:107)D(cid:107)L1(H) :=(cid:80)\n2 . The\nspace of all Hilbert-Schmidt operators on H is denoted as HS(H). A compact operator D : H \u2192 H\nH < \u221e, where (cid:107)D(cid:107)L1(H) denotes the nuclear\nrespect to measure \u03c1, i.e L2(X , \u03c1) = {f : X \u2192 R,(cid:82)\nspace with the inner product denoted as (cid:104)f, g(cid:105)\u03c1 :=(cid:82)\nnorm of D. For a vector space X , L2(X , \u03c1) denotes the space of square integrable functions with\nX (f (x))2d\u03c1(x) < \u221e}. L2(X , \u03c1) is a Hilbert\nX f (x)g(x)d\u03c1(x), where f, g \u2208 L2(X , \u03c1). The\n\ni\u22651 (cid:107)Dei(cid:107)2H) 1\n\ni\u22651\n\nfor f \u2208 L2(X , \u03c1).\n\n\u03c1\n\nD =(cid:80)k\n\nk(x, y) = (cid:82)\n\nD; given the spectral decomposition D =(cid:80)\u221e\n\nnorm induced on L2(X , \u03c1) is denoted as (cid:107)f(cid:107)\u03c1 := (cid:104)f, f(cid:105)1/2\nProjection operators, spectral decomposition. Given a vector space X , let P kX denote the set of\nrank-k projection operators on X . For a Hilbert-Schmidt operator D over a separable Hilbert space\nH, let \u03bbi(D) denote its ith largest eigenvalue. The projection operator associated with the \ufb01rst k\ni=1 \u00b5i\u03c8i \u2297 \u03c8i, we\neigenfunctions of D is denoted as Pk\ni=1 \u03c8i \u2297 \u03c8i. For a \ufb01nite dimensional vector v, (cid:107)v(cid:107)p denotes the (cid:96)p-norm of v. For\nhave that Pk\noperators D over \ufb01nite dimensional spaces, (cid:107)D(cid:107)2 and (cid:107)D(cid:107)F denote the spectral and Frobenius norm\nof D, respectively. For a metric space (Y, d) and a closed subset S \u2286 Y, we denote the distance from\nq \u2208 Y to S by d(q, S) = mins\u2208S d(q, s). In a Hilbert space, d is the underlying metric induced by\nthe respective norm. [n] denotes the set of natural numbers from 1 to n.\nMercer kernels, and random feature maps. Let X \u2286 Rd be a compact (data) domain and \u03c1\nbe a distribution on X . We are given n independent and identically distributed samples from \u03c1,\n{xi}n\ni=1 \u223c \u03c1n. Let k : X \u00d7 X \u2192 R be a Mercer kernel with the following integral representation,\n\u2126 z(x, \u03c9)z(y, \u03c9)d\u03c0(\u03c9). Here, (\u2126, \u03c0) is the probability space induced by the Mercer\nkernel. Let z\u03c9(\u00b7) := z(\u00b7, \u03c9). We know that z\u03c9(\u00b7) \u2208 L2(X , \u03c1) almost surely with respect to \u03c0. We\ndraw i.i.d. samples, \u03c9i \u223c \u03c0, for i = 1, . . . , m, to approximate the kernel function. Let z(\u00b7) denote\nm (z\u03c91 (x), z\u03c92 (x), . . . , z\u03c9m(x)). Let F \u2286 Rm\nthe random feature map, i.e. z : Rd \u2192 Rm, z(x) = 1\u221a\nbe the linear subspace spanned by the range of z, with the inner product inherited from Rm. The\napproximate kernel map is denoted as km(\u00b7,\u00b7), where km(x, y) = (cid:104)z(x), z(y)(cid:105)F . Let H denote the\nseparable RKHS associated with the kernel function k(\u00b7,\u00b7).\nAssumption 3.1. The kernel function k is a Mercer kernel (see Theorem A.5) and has the following\n\u2126 z(x, \u03c9)z(y, \u03c9)d\u03c0(\u03c9) \u2200x, y \u2208 X where H is a separable RKHS\nof real-valued functions on X with a bounded positive de\ufb01nite kernel k. We also assume that there\nexists \u03c4 > 1 such that |z(x, \u03c9)| \u2264 \u03c4 for all x \u2208 X , \u03c9 \u2208 \u2126.\nNote that z(x,\u00b7) are continuous functions because k(\u00b7,\u00b7) is continuous. Note that when X is separable\nand k(\u00b7,\u00b7) is continuous, H is separable.\nDe\ufb01nition 3.2. C : H \u2192 H is the covariance operator of the random variables k(x,\u00b7) with measure \u03c1,\nX k(x,\u00b7)f (x)d\u03c1(x). C is compact and self-adjoint, which implies C has a spectral\n\nintegral representation, k(x, y) =(cid:82)\n\nde\ufb01ned as Cf :=(cid:82)\n\n4\n\n\fdecomposition C =(cid:80)\u221e\n\ni=1\n\n\u00af\u03bbi\n\n\u00af\u03c6i \u2297H \u00af\u03c6i, where \u00af\u03bbi\u2019s and \u00af\u03c6i\u2019s are the eigenvalues and eigenfunctions\n\ni=1, forms a unitary basis for H.\n\nof C, respectively. The set of eigenfunctions, { \u00af\u03c6i}\u221e\nSince we are approximating the kernel k by sampling m i.i.d. copies of z\u03c9, this implies an approxi-\nmation to the covariance operator C (in the space HS(\u03c1)) by a sample average of the random linear\noperators z\u03c9 \u2297\u03c1 z\u03c9. The tools we use to establish concentration require a suf\ufb01cient spectral decay of\nthe variance of this random operator, which we de\ufb01ne next.\nDe\ufb01nition 3.3. Let C1 denote the random linear operator on L2(X , \u03c1) given by C1 = z\u03c9 \u2297\u03c1 z\u03c9. Let\nC2 = C1 \u2297HS(\u03c1) C1 and de\ufb01ne the covariance operator of C1 to be C(cid:48) = E\u03c0 [C2] \u2212 E\u03c0 [C1] \u2297HS(\u03c1)\nE\u03c0 [C1].\nWe note that C(cid:48) can also be interpreted as the fourth moment of the random variable z\u03c9 in L2(X , \u03c1).\nThe spectrum of C(cid:48) plays a crucial role in our results through the following key-quantity:\n\n\uf8f1\uf8f2\uf8f3 Bkh\n\nm\n\n(cid:115) k\n\n(cid:88)\n\nm\n\nj>h\n\n+\n\n\uf8fc\uf8fd\uf8fe , where Bk :=\n\n(cid:113)E\u03c0\n\n(cid:2)(cid:104)z\u03c9, z\u03c9(cid:105)4\n\n\u03c1\n\n(cid:3)\n\n\u00af\u03bbk \u2212 \u00af\u03bbk+1\n\n\u03bbi(C(cid:48))\n\n\u03ba(Bk, k, m) = inf\nh\u22650\n\n(4)\n\nEssentially, we will see that the constant \u03ba(Bk, k, m) is the dominating factor when bounding the\nexcess risk, and, therefore, will determine the rate of convergence of our algorithms.\nFrom a practical perspective, working in HS(\u03c1) is not computationally feasible. However, our\napproximation to C has a representation in the \ufb01nite dimensional space F, as de\ufb01ned here.\nDe\ufb01nition 3.4. Cm : F \u2192 F is the covariance operator in HS(F), de\ufb01ned as Cm :=\nX (cid:104)z(x), v(cid:105) z(x)d\u03c1(x). Cm is compact\n\nE\u03c1 [z(x) \u2297F z(x)]. Equivalently, for any v \u2208 F, Cmv = (cid:82)\nand self-adjoint which implies that Cm has a spectral decomposition Cm =(cid:80)m\n\ni=1 \u03bbi\u03c6i \u2297F \u03c6i.\n\nAs mentioned at the beginning of the section, our convergence tools work most conveniently when\nwe can incorporate the randomness with respect to \u03c1 in the geometry of the space we study, hence,\nthe need to study L2(X , \u03c1). Since we are essentially dealing with random operators on F,H and\nL2(X , \u03c1), it is most appropriate to also work in the respective spaces of Hilbert-Schmidt operators.\nThus, we introduce the inclusion and approximation operators, which allow us to transition with ease\nbetween the aforementioned spaces.\nDe\ufb01nition 3.5. [Inclusion Operators I and I] The inclusion operator is de\ufb01ned as\n\nAlso, for an operator D \u2208 HS(H) with spectral decomposition D =(cid:80)\u221e\n\nI : H \u2192 L2(X , \u03c1), (If ) = f, where f \u2208 H.\n\n\u00b5i\n\n\u2297\n\nI\u03c8i\n\nI\u03c8i\n\nIn Lemma A.8 and Lemma A.9 in the appendix, we show that the adjoint of the Inclusion operator I\n\nI : HS(H) \u2192 HS(\u03c1), ID :=\n\n(cid:112)(cid:104)C\u03c8i, \u03c8i(cid:105)H\n\n(cid:112)(cid:104)C\u03c8i, \u03c8i(cid:105)H\nis I\u2217 : L2(X , \u03c1) \u2192 H given by (I\u2217g)(\u00b7) =(cid:82) k(x,\u00b7)g(x)d\u03c1(x), and that C = I\u2217I, L = II\u2217.\nFor an operator D \u2208 HS(F) with rank k with spectral decomposition D =(cid:80)\u221e\n\nA : F \u2192 L2(X, \u03c1), (Av)(\u00b7) = (cid:104)z(\u00b7), v(cid:105) , where v \u2208 F.\n\ni=1 \u00b5i\u03c8i \u2297 \u03c8i, let \u03a8\nbe the matrix with eigenvectors \u03c8i as columns and let \u03a6 be the matrix with eigenvectors of Cm as\ncolumns (see De\ufb01nition 3.4). De\ufb01ne\n\nDe\ufb01nition 3.6. [Approximation Operators A and A] The Approximation operator A is de\ufb01ned as\n\n.\n\ni=1 \u00b5i\u03c8i \u2297 \u03c8i,\n\n\u221e(cid:88)\n\ni=1\n\n(cid:107)\u03a8R \u2212 \u03a6(cid:107)2F , \u02dc\u03a8 := \u03a8R\u2217.\nk(cid:88)\n\nA \u02dc\u03c8i\n\u02dc\u03c8i, \u02dc\u03c8i(cid:105)F\n\n\u2297\u03c1\n\n(cid:113)(cid:104)Cm\n\n\u00b5i\n\ni=1\n\n5\n\n(cid:113)(cid:104)Cm\n\nA \u02dc\u03c8i\n\u02dc\u03c8i, \u02dc\u03c8i(cid:105)F\n\n.\n\nR\u2217 = arg min\n\nR(cid:62)R=RR(cid:62)=I\n\nLet \u02dc\u03c8i be the ith column of \u02dc\u03a8, de\ufb01ne\n\nA : HS(F) \u2192 HS(\u03c1), AD :=\n\n\fA\u2217 : L2(X, \u03c1) \u2192 F, (A\u2217f )i =(cid:82)\n\nIn Lemma A.11 and Lemma A.12, we show that the adjoint of the Approximation Operator is\n\nX f (x)z\u03c9i(x)d\u03c1(x), and Cm = A\u2217A, Lm = AA\u2217.\n\nHS(\u03c1)\n\nA\n\nA\n\nHS(F )\n\nDe\ufb01nition 3.7. [Operator L] Let \u02dcP \u2208 HS(F). Let A\u02dcP = (cid:80)k\nthat L1/2pi = \u02dcpi. The operator L : HS(F) \u2192 HS(H) is de\ufb01ned as L(cid:98)P =(cid:80)k\n\nWe note that the de\ufb01nition of the approximation operator A requires knowledge of the covariance\nmatrix Cm to \ufb01nd the optimal rotation matrix R\u2217, but this is solely for the purpose of analysis and is\nnot used in the algorithm in any form.\nThe following de\ufb01nition enables us to bound the excess risk in HS(H) (Section B in the appendix).\ni=1 \u02dcpi \u2297\u03c1 \u02dcpi be \u02dcP lifted to HS(\u03c1).\nConsider the equivalence relation pi \u223c pj if L1/2pi = L1/2pj. Let [pi] be the equivalence class such\ni=1 I\u2217pi \u2297H I\u2217pi.\nHere I\u2217 is the restriction of the operator I\u2217 to the quotient space L2(X , \u03c1)/ \u223c.\nThe quotient space in the de\ufb01nition above is with respect to the kernel of L, i.e., L2(X , \u03c1)/ \u223c\u2261\nL2(X , \u03c1)/ker(L). This quotient is benign since the optimal solution to our optimization problem\nlives in the range of L and intuitively we can disregard any components in the kernel of L.\nFinally, to conclude the section we give a vi-\nsual schematic in Figure 1 to help the reader\nconnect different spaces. To summarize, the\nkey spaces of interest are the data domain X ,\nthe RKHS H of the kernel map k(\u00b7,\u00b7), and the\nfeature space F obtained via random feature\napproximation. The space L2(X , \u03c1) consists of\nfunctions over the data domain X that are square\nintegrable with respect to the data distribution\n\u03c1. The space L2(X , \u03c1) allows us to embed ob-\njects from different spaces into a common space\nso as to compare them. Speci\ufb01cally, we map\nfunctions from H to L2(X , \u03c1) via the inclusion\noperator I, and vectors from F to L2(X , \u03c1) via\nthe approximation operator A. I\u2217 and A\u2217 de-\nnote the adjoints of I and A, respectively. The\nspace of Hilbert-Schmidt operators on H,F and\nL2(X , \u03c1), are denoted by HS(H), HS(F) and\nHS(\u03c1), respectively. Analogous to I and A, I\nmaps operators from HS(H) to HS(\u03c1), and A\nmaps operators from HS(F) to HS(\u03c1), respec-\ntively. Speci\ufb01cally, these are essentially con-\nstructed by mapping eigenvectors of operators\nvia I and A respectively. The above mappings thus allow us to embed operators in the common space,\ni.e., HS(\u03c1) and to bound estimation and approximation errors. However, the problem of Kernel PCA\nis formulated in HS(H) and bounds in HS(\u03c1) are therefore not suf\ufb01cient. To this end, we establish\nan equivalence between kernel PCA in HS(H) and HS(\u03c1). We use the map A and the established\nequivalence to get L, which maps operators from HS(F) to HS(H). We encourage the reader to go\nthrough Sections A and B in the appendix for a gentler and a more rigorous presentation.\n\nFigure 1: Maps between the data domain\n(X ), space of square integrable functions on X\n(L2(X , \u03c1)), the RKHS of kernel k(\u00b7,\u00b7), and RKHS\nof the approximate feature map, as well as maps be-\ntween Hilbert-Schmidt operators on these spaces.\n\nk(x,\u00b7)\n\nL2(X , \u03c1)\n\nA\u2217\n\nF\n\nz\n\nX\n\nI\n\nI\n\nH\n\nI\u2217\n\nL\n\nHS(H)\n\n4 Main Results\nRecall that our primary goal is to study the generalization behaviour of algorithms solving KPCA\nusing random features. Rather than stick to a particular algorithm, we de\ufb01ne a class of algorithms\nthat are suitable to the problem. We characterize this class as follows.\nDe\ufb01nition 4.1 (Ef\ufb01cient Subspace Learner (ESL)). Let A be an algorithm which takes as input\n\nn points from F and outputs a rank-k projection matrix over F. Let(cid:98)PA denote the output of the\nalgorithm A and(cid:98)PA = \u02dc\u03a6 \u02dc\u03a6(cid:62) be an eigendecompostion of(cid:98)PA. Let \u03a6\u22a5\n\nk be an orthogonal matrix\ncorresponding to the orthogonal complement of the top k eigenvectors of Cm. We say that algorithm\nA is an Ef\ufb01cient Subspace Learner if the following holds with probability at least 1 \u2212 \u03b4,\n\n(cid:13)(cid:13)(cid:13)(\u03a6\u22a5\n\nk )(cid:62) \u02dc\u03a6\n\n(cid:13)(cid:13)(cid:13)2\n\nF\n\n\u2264 q\u03c1,\u03c0A (1/\u03b4, log (m) , log (n))\n\nn\n\n,\n\n6\n\n\fAlgorithm 1 KPCA with Random Features (Meta Algorithm)\nInput: Training data X = {xi}n\n\ni=1\n\n1: Obtain Training data X = {xt}n\n2: Sample \u03c9i \u223c \u03c0, i.i.d, i = 1 to m\n3: Z \u2190 RandomFeatures(X,{\u03c9i}m\n\ni=1)\n\ni=1 in a batch or stream\n\nOutput: (cid:98)PA\n4: (cid:98)PA \u2190 A(Z)\n\n//A is an Efficient Subspace Learner, Definition 4.1\n\nwhere q\u03c1,\u03c0A is a function given the triple (A, \u03c1, \u03c0) which has polynomial dependence on 1/\u03b4, log (m)\nand log (n). For notational convenience, we drop superscripts from q\u03c1,\u03c0A and write it as qA henceforth.\nIntuitively, an ESL is an algorithm which returns a projection onto a k-dimensional subspace such\nthat the angle between the subspace and the space spanned by the top k eigenvectors of Cm decays at\na sub-linear rate with the number of samples. Our guarantees are in terms of any algorithm which\nbelongs to this class. Algorithm 1 gives a high-level view of the algorithmic routine. To discuss the\nassociated computational aspects, we instantiate this with two speci\ufb01c algorithms, ERM and Oja\u2019s\nalgorithm, and show how the result looks in terms of their algorithmic parameters. Similar results\ncan be obtained for other ESL algorithms such as (cid:96)2-RMSG [Mianjy and Arora, 2018]. We now give\nthe main theorem of the paper which characterizes the excess risk of an ESL.\n\nTheorem 4.2 (Main Theorem). Let A be an ef\ufb01cient subspace learner. Let(cid:98)PA be the output of A\n(a). E(L(cid:98)PA) \u2264 24\u03ba(Bk, k, m) + log(\u03b4/2)+7Bk\n(cid:17) \u2264(cid:113) qA(2/\u03b4,log(m),log(n))\n\nrun with m random features on n \u2265 2\u03bb2\n2\u22121)\nCm. Then, with probability at least 1 \u2212 \u03b4 it holds that\n+\n\n(cid:113) qA(2/\u03b4,log(m),log(n))\n\npoints, where \u03bbi is the ith eigenvalue of\n\n1qA(2/\u03b4,log(m),log(n))2\n\nL(cid:98)PA,P k\n\n(b). d\n\n(cid:16)\n\n\u03bb2\nk(\n\n\u221a\n\nm\n\nn\n\n.\n\n,\n\nHS(H)\n\nn\n\nA few remarks are in order. First, as we forewarned the reader in Section 3, the error bound is\ndominated by the additive term \u03ba(Bk, k, m). This, in a sense, determines the hardness of the problem.\nAs we will see, under appropriate assumptions on data distribution in the feature space, this term can\n\nbe bounded by something that is in O(1/m). Second, the output of our algorithm, L(cid:98)P, need not be a\nprojection operator in the RKHS. This is precisely why we need to bound the difference between L(cid:98)P\nof Theorem 4.2, it is easy to see that if we project L(cid:98)PA to the set of rank k projection operators in\n\nand the set of all projection operators in HS(H), which we see is of the order O(1/\nn). Third, note\nthat the dependence on the number of random features is at worst poly-logarithmic. From part (b)\nHS(H), we get the same rate of convergence. This is presented as Corollary C.12 in the appendix.\nNext, we characterize \u201ceasy\u201d instances of KPCA problems under which we are guaranteed a fast rate.\nSpeci\ufb01cally, we show that if the decay of the spectrum of the fourth order moment, C(cid:48), of z\u03c9, is\nexponential, then the dominating factor, \u03ba(Bk, k, m) is in O(1/m). Then, optimizing the number of\nrandom features w.r.t. the sample complexity term gives us the following result.\nCorollary 4.3 (Main - Good decay). Along with the assumptions and notation of Theorem 4.2, if the\nspectrum of the operator C(cid:48) has an exponential decay, i.e., \u03bbj(C(cid:48)) = \u03b1j for some \u03b1 < 1, then with\n\u221a\nm = O(\n\nn log (n)) random features, we have\n\n\u221a\n\nc(cid:48)(k + log (\u03b4/2) + 7Bk)\n\n+\n\n\u221a\nwhere c and c(cid:48) are universal constants.\nFinally, we instantiate the above corollary with two algorithms, namely ERM and Oja\u2019s algorithm.\nCorollary 4.4 (ERM and Oja). With the same assumptions and notation as in Corollary 4.3,\n\nqA(2/\u03b4, log (m) , log (n))\n\nn log (n)\n\n+\n\nn\n\nn\n\n,\n\nE(L(cid:98)PA) \u2264 cBk\u221a\n\n(cid:114)\n\ngap2 log(cid:0) \u03b4\n(cid:17)\n(cid:16) \u039b\n, where \u039b =(cid:80)k\n\n(cid:1)2\n\n2m\n\n.\n\ngap2\n\ni=1 \u03bbi.\n\n(a). RF-ERM is an ESL with qERM (1/\u03b4, log (m) , log (n)) = k\u03bb1\u03c4 2\n(b). RF-Oja is an ESL with qoja(1/\u03b4, log (m) , log (n)) = \u02dc\u0398\n\nwhere gap := \u03bbk(Cm) \u2212 \u03bbk+1(Cm).\n\n7\n\n\fError Decomposition: There are two sources of error when solving KPCA using random features\n\u2013 the estimation error (\u0001e) resulting from the fact that we have access to the distribution only through\nan i.i.d. sample, and approximation error (\u0001a) resulting from the approximate feature map. Therefore,\nto get a better handle on the excess error, we decompose it as follows.\n\n, C(cid:105)HS(H) \u2212 (cid:104)L(cid:98)PA, C(cid:105)HS(H)\n(cid:125)\n\n(cid:123)(cid:122)\n\n\u0001e: Estimation Error\n\n.\n\nE(L(cid:98)PA) = (cid:104)Pk\n(cid:124)\n\n(cid:123)(cid:122)\nC, C(cid:105)HS(H) \u2212 (cid:104)LPk\n\n\u0001a: Approximation Error\n\n, C(cid:105)HS(H)\n\n+(cid:104)LPk\n\nCm\n\nCm\n\n(cid:125)\n\n(cid:124)\n\nThe main idea behind controlling the approximation error is to interpret it as the error incurred in\neigenspace estimation in L2(X , \u03c1), and then use local Rademacher complexity to get faster rates. In\nthe context of Kernel PCA, this technique was \ufb01rst used by Blanchard et al. [2007] which allowed\nthem to get sharper O(1/n) excess risk. The estimation error is controlled by the de\ufb01nition of our\nlifting map A together with the convergence rate implicit in the de\ufb01nition of an ESL. Below, we\nguide the reader through the main steps taken to bound each of the error terms.\nBounding the Approximation error: Using simple algebraic manipulations, we can show that the\napproximation error is exactly the error incurred by the ERM in estimating the top k eigenfunctions of\nthe kernel integral operator L using m samples drawn from \u03c0. This problem of eigenspace estimation\nis well studied in the literature and has optimal statistical rates of O (1/\nm) [Zwald and Blanchard,\n2006]. This appears to be a key bottleneck and reinforces the view that the use of random features\n\u221a\ncannot provide computational bene\ufb01ts \u2013 it suggests m = \u2126(n) random features are required to get a\nn) rate. However, these rates are conservative when viewed in the sense of excess risk. This\nO (1/\nhas been extensively studied in empirical process theory and one of the primary techniques to get\nsharper rates is the use of local Rademacher complexity [Bartlett et al., 2002]. The key idea is to\nshow that around the best hypothesis in the class, variance of the empirical process is bounded by a\nconstant times the mean of the difference from the best hypothesis (see Theorem F.3). This technique\nwas used in the context of Kernel PCA by Blanchard et al. [2007] to get fast O(1/m) rates. We now\nstate Lemma 4.5 which bounds the approximation error, the proof of which is deferred to appendix.\nLemma 4.5 (Approximation Error). With probability at least 1 \u2212 \u03b4, we have\n\n\u221a\n\n\u0001a \u2264 24\u03ba(Bk, k, m) +\n\n11\u03c4 2 log (\u03b4) + 7Bk\n\nm\n\n.\n\nCm\n\nCm\n\nBounding the Estimation error: Since the objective with respect to the inner product in HS(\u03c1)\nequals the objective with respect to the inner product in HS(H) (See Lemma B.4), we focus on\nbounding the estimation error in L2(X , \u03c1). Using a Cauchy-Schwartz type of inequality in HS(\u03c1),\nwe see that it is enough to bound the difference (cid:107)APk\nsteps \u2013 bound the error (cid:107)Pk\nA : HS(F) \u2192 HS(\u03c1). We already have a lifting from F to L2(X , \u03c1) in the form of A. The natural\nattempt to lift an operator on F would be by lifting and appropriately rescaling its eigenfunctions.\n\n\u2212 A(cid:98)PA(cid:107)HS(\u03c1). We can do this in two\n\u2212(cid:98)PA(cid:107)F (we already have this from the ESL guarantee) and construct\nSince the eigendecomposition of (cid:98)PA is not unique, we need to choose an appropriate one to be\ndecomposition for which the distance(cid:80)k\n(cid:118)(cid:117)(cid:117)(cid:116) k(cid:88)\n\nlifted. Since the goal of A is to preserve distances between operators, we choose the unique eigen-\n2 is minimized. Notice that the lifting operator\nA depends on the eigendecomposition of Cm, which can not be obtained in practice. This is not a\nproblem, because A is only used for the purposes of showing the main result and is not part of the\nproposed algorithms. We now state Lemma 4.6 which bounds the estimation error.\nLemma 4.6 (Estimation Error). With the same assumptions as Theorem 4.2, the following holds with\nprobability at least 1 \u2212 \u03b4,\n\u221a\n(\n\n(cid:19)2 qA(1/\u03b4, log (m) , log (n))2\n\n(cid:18) 2\u03bbi + 4\u03bb1\n\ni=1 (cid:107)Ui \u2212 \u03c6i(cid:107)2\n\n\u03bb2\n1\n2 \u2212 1)\n\n\u0001e \u2264\n\nn\n\n.\n\ni=1\n\n\u03bb2\ni\n\n5 Experiments\n\nThe goal of this section is to provide empirical evidence supporting our theoretical \ufb01ndings in\nSection 4. As we motivated in Section 2, the success of an algorithm is measured in terms of it\u2019s\ngeneralization ability, i.e. the variance captured by the output of the algorithm on the unseen data2.\n\n2Details on how we evaluate objective for RF-ERM/Oja are deferred to Section E due to space limitations.\n\n8\n\n\fk = 5\n\nk = 10\n\nk = 15\n\nFigure 2: Comparisons of ERM, Nystr\u00f6m, Oja+RFF, and Oja+ERM for KPCA on the MNIST dataset, in terms\nof the objective value as a function of iterations (top) and as a function of CPU runtime (bottom).\n\nWe perform experiments on the MNIST dataset that consists of 70K samples, partitioned into a\ntraining, tuning, and a test set of sizes 20K, 10K, and 40K, respectively. We use a \ufb01xed kernel in all\nour experiments, since we are not concerned about model selection here. In particular, we choose the\n\nRBF kernel k(x, x(cid:48)) = exp(cid:0)\u2212(cid:107)x \u2212 x(cid:48)(cid:107)2/2\u03c32(cid:1) with bandwidth parameter \u03c32 = 50. The bandwidth is\n\nchosen such that ERM converges in objective within observing few thousands training samples. The\nobjective of the ERM3 is used as the baseline. Furthermore, to evaluate the computational speedup\ngained by using random features, we compare against Nystr\u00f6m method [Drineas and Mahoney,\n2005] as a secondary baseline. In particular, upon receiving a new sample, we do a full Nystr\u00f6m\napproximation and ERM on the set of samples observed so far. Finally, empirical risk minimization\n(RF-ERM) and Oja\u2019s algorithm (RF-Oja) are used with random features to verify the theoretical\nresults presented in Corollary 4.4.\nFigure 2 shows the population objective as a function of iteration (top row) as well as the total runtime4\n(bottom row). Each curve represents an average over 100 runs of the corresponding algorithm on\ntraining samples drawn independently and uniformly at random from the whole dataset. Number of\nrandom features and the size of Nystr\u00f6m approximation are set to 750 and 100, respectively. We note:\n\u2022 As predicted by Corollary 4.4, for both RF-ERM and RF-Oja,\nn log (n) \u2248 750 random\n\u2022 The performance of ERM is similar to that of RF-Oja and RF-ERM in terms to overall\nruntime. However, due to larger space complexity of O(n2), ERM becomes infeasible for\nlarge-scale problems; this makes a case for streaming/stochastic approximation algorithms.\n\nfeatures is suf\ufb01cient to achieve the same suboptimality as that of ERM.\n\n\u221a\n\nFinally, we note that the iterates of RF-ERM and RF-Oja reduce the objective as they approach from\nabove to the maximizer of the population objective. Although it might seem counter-intuitive, we\nnote that the output of RF-ERM and RF-Oja are not necessarily projection operators. Hence, they\ncan achieve higher objective than the maximum. However, as guaranteed by Corollary 4.4, the output\nof both algorithms will converge to a projection operator as more training samples are introduced.\n\n3The kernel matrix is computed in an online fashion for computational ef\ufb01ciency\n4Runtime is recorded in a controlled environment; each run executed on identical unloaded compute node.\n\n9\n\n101102103104Iteration00.10.20.30.40.50.6ObjectiveERMNystromRF-OjaRF-ERMtruth101102103104Iteration00.20.40.60.811.2Objective101102103104Iteration00.511.5Objective100101102Time10-1Objective100101102Time10-1100Objective100101102Time10-1100Objective\fAcknowledgements\n\nThis research was supported in part by NSF BIGDATA grant IIS-1546482.\n\nReferences\nZeyuan Allen-Zhu and Yuanzhi Li. First ef\ufb01cient convergence for streaming k-pca: a global, gap-free,\n\nand near-optimal rate. arXiv preprint arXiv:1607.07837, 2016a.\n\nZeyuan Allen-Zhu and Yuanzhi Li. Lazysvd: Even faster svd decomposition yet without agonizing\n\npain. In Advances in Neural Information Processing Systems, pages 974\u2013982, 2016b.\n\nNachman Aronszajn. Theory of reproducing kernels. Transactions of the American mathematical\n\nsociety, 68(3):337\u2013404, 1950.\n\nRaman Arora, Andrew Cotter, Karen Livescu, and Nathan Srebro. Stochastic optimization for PCA\n\nand PLS. In Allerton Conference, pages 861\u2013868. Citeseer, 2012.\n\nRaman Arora, Andy Cotter, and Nati Srebro. Stochastic optimization of PCA with capped MSG. In\n\nAdvances in Neural Information Processing Systems, NIPS, 2013.\n\nPeter L Bartlett, Olivier Bousquet, and Shahar Mendelson. Localized rademacher complexities. In\n\nInternational Conference on Computational Learning Theory, pages 44\u201358. Springer, 2002.\n\nGilles Blanchard, Olivier Bousquet, and Laurent Zwald. Statistical properties of kernel principal\n\ncomponent analysis. Machine Learning, 66(2-3):259\u2013294, 2007.\n\nPetros Drineas and Michael W Mahoney. On the nystr\u00f6m method for approximating a gram matrix\nfor improved kernel-based learning. journal of machine learning research, 6(Dec):2153\u20132175,\n2005.\n\nShai Fine and Katya Scheinberg. Ef\ufb01cient svm training using low-rank kernel representations.\n\nJournal of Machine Learning Research, 2(Dec):243\u2013264, 2001.\n\nRong Ge, Chi Jin, and Yi Zheng. No spurious local minima in nonconvex low rank problems: A\n\nuni\ufb01ed geometric analysis. arXiv preprint arXiv:1704.00708, 2017.\n\nMina Ghashami, Daniel J Perry, and Jeff Phillips. Streaming kernel principal component analysis. In\n\nArti\ufb01cial Intelligence and Statistics, pages 1365\u20131374, 2016.\n\nEdo Liberty. Simple and deterministic matrix sketching. In Proceedings of the 19th ACM SIGKDD\ninternational conference on Knowledge discovery and data mining, pages 581\u2013588. ACM, 2013.\n\nDavid Lopez-Paz, Suvrit Sra, Alex Smola, Zoubin Ghahramani, and Bernhard Sch\u00f6lkopf. Random-\n\nized nonlinear component analysis. arXiv preprint arXiv:1402.0119, 2014.\n\nPascal Massart. Some applications of concentration inequalities to statistics. In Annales-Faculte des\n\nSciences Toulouse Mathematiques, volume 9, pages 245\u2013303. Universit\u00e9 Paul Sabatier, 2000.\n\nPoorya Mianjy and Raman Arora. Stochastic PCA with l 2 and l 1 regularization. In International\n\nConference on Machine Learning, pages 3528\u20133536, 2018.\n\nErkki Oja. Simpli\ufb01ed neuron model as a principal component analyzer. Journal of mathematical\n\nbiology, 15(3):267\u2013273, 1982.\n\nAli Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in\n\nneural information processing systems, pages 1177\u20131184, 2007.\n\nAli Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: Replacing minimization\nwith randomization in learning. In Advances in neural information processing systems, pages\n1313\u20131320, 2009.\n\nM Reed and B Simon. Methods of modern mathematical physics i: Functional analysis (academic,\n\nnew york). Google Scholar, page 151, 1972.\n\n10\n\n\fAlessandro Rudi and Lorenzo Rosasco. Generalization properties of learning with random features.\n\nIn Advances in Neural Information Processing Systems, pages 3218\u20133228, 2017.\n\nWalter Rudin. Fourier analysis on groups. Courier Dover Publications, 2017.\n\nBernhard Sch\u00f6lkopf, Alexander Smola, and Klaus-Robert M\u00fcller. Nonlinear component analysis as a\n\nkernel eigenvalue problem. Neural computation, 10(5):1299\u20131319, 1998.\n\nDino Sejdinovic and Arthur Gretton. What is an rkhs?, 2012.\n\nJohn Shawe-Taylor, Christopher KI Williams, Nello Cristianini, and Jaz Kandola. On the eigen-\nspectrum of the gram matrix and the generalization error of kernel-pca. IEEE Transactions on\nInformation Theory, 51(7):2510\u20132522, 2005.\n\nAlex J Smola and Bernhard Sch\u00f6lkopf. Learning with kernels, volume 4. Citeseer, 1998.\n\nBharath Sriperumbudur and Nicholas Sterge. Statistical consistency of kernel pca with random\n\nfeatures. arXiv preprint arXiv:1706.06296v1, 2017.\n\nBharath Sriperumbudur and Nicholas Sterge. Approximate kernel pca using random features:\n\nComputational vs. statistical trade-off. arXiv preprint arXiv:1706.06296v2, 2018.\n\nJoel A Tropp et al. An introduction to matrix concentration inequalities. Foundations and Trends R(cid:13)\n\nin Machine Learning, 8(1-2):1\u2013230, 2015.\n\nChristopher KI Williams and Matthias Seeger. Using the nystr\u00f6m method to speed up kernel machines.\n\nIn Advances in neural information processing systems, pages 682\u2013688, 2001.\n\nBo Xie, Yingyu Liang, and Le Song. Scale up nonlinear component analysis with doubly stochastic\n\ngradients. CoRR, abs/1504.03655, 2015. URL http://arxiv.org/abs/1504.03655.\n\nYun Yang, Mert Pilanci, and Martin J Wainwright. Randomized sketches for kernels: Fast and\n\noptimal non-parametric regression. arXiv preprint arXiv:1501.06195, 2015.\n\nLaurent Zwald and Gilles Blanchard. On the convergence of eigenspaces in kernel principal compo-\n\nnent analysis. In Advances in neural information processing systems, pages 1649\u20131656, 2006.\n\n11\n\n\f", "award": [], "sourceid": 3643, "authors": [{"given_name": "Enayat", "family_name": "Ullah", "institution": "Johns Hopkins University"}, {"given_name": "Poorya", "family_name": "Mianjy", "institution": "Johns Hopkins University"}, {"given_name": "Teodor Vanislavov", "family_name": "Marinov", "institution": "Johns Hopkins University"}, {"given_name": "Raman", "family_name": "Arora", "institution": "Johns Hopkins University"}]}