{"title": "Scale Up Nonlinear Component Analysis with Doubly Stochastic Gradients", "book": "Advances in Neural Information Processing Systems", "page_first": 2341, "page_last": 2349, "abstract": "Nonlinear component analysis such as kernel Principle Component Analysis (KPCA) and kernel Canonical Correlation Analysis (KCCA) are widely used in machine learning, statistics and data analysis, but they can not scale up to big datasets. Recent attempts have employed random feature approximations to convert the problem to the primal form for linear computational complexity. However, to obtain high quality solutions, the number of random features should be the same order of magnitude as the number of data points, making such approach not directly applicable to the regime with millions of data points.We propose a simple, computationally efficient, and memory friendly algorithm based on the ``doubly stochastic gradients'' to scale up a range of kernel nonlinear component analysis, such as kernel PCA, CCA and SVD. Despite the \\emph{non-convex} nature of these problems, our method enjoys theoretical guarantees that it converges at the rate $\\Otil(1/t)$ to the global optimum, even for the top $k$ eigen subspace. Unlike many alternatives, our algorithm does not require explicit orthogonalization, which is infeasible on big datasets. We demonstrate the effectiveness and scalability of our algorithm on large scale synthetic and real world datasets.", "full_text": "Scale Up Nonlinear Component Analysis with\n\nDoubly Stochastic Gradients\n\nBo Xie1, Yingyu Liang2, Le Song1\n1Georgia Institute of Technology\n\nbo.xie@gatech.edu, lsong@cc.gatech.edu\n\n2Princeton University\n\nyingyul@cs.princeton.edu\n\nAbstract\n\nNonlinear component analysis such as kernel Principle Component Analysis\n(KPCA) and kernel Canonical Correlation Analysis (KCCA) are widely used in\nmachine learning, statistics and data analysis, but they cannot scale up to big\ndatasets. Recent attempts have employed random feature approximations to con-\nvert the problem to the primal form for linear computational complexity. How-\never, to obtain high quality solutions, the number of random features should be\nthe same order of magnitude as the number of data points, making such approach\nnot directly applicable to the regime with millions of data points.\nWe propose a simple, computationally ef\ufb01cient, and memory friendly algorithm\nbased on the \u201cdoubly stochastic gradients\u201d to scale up a range of kernel nonlinear\ncomponent analysis, such as kernel PCA, CCA and SVD. Despite the non-convex\nnature of these problems, our method enjoys theoretical guarantees that it con-\nverges at the rate \u02dcO(1/t) to the global optimum, even for the top k eigen subspace.\nUnlike many alternatives, our algorithm does not require explicit orthogonaliza-\ntion, which is infeasible on big datasets. We demonstrate the effectiveness and\nscalability of our algorithm on large scale synthetic and real world datasets.\n\nIntroduction\n\n1\nScaling up nonlinear component analysis has been challenging due to prohibitive computation and\nmemory requirements. Recently, methods such as Randomized Component Analysis (RCA) [12]\nare able to scale to larger datasets by leveraging random feature approximation. Such methods ap-\nproximate the kernel function by using explicit random feature mappings, then perform subsequent\nsteps in the primal form, resulting in linear computational complexity. Nonetheless, theoretical anal-\nysis [18, 12] shows that in order to get high quality results, the number of random features should\ngrow linearly with the number of data points. Experimentally, one often sees that the statistical\nperformance of the algorithm improves as one increases the number of random features.\nAnother approach to scale up the kernel component analysis is to use stochastic gradient descent and\nonline updates [15, 16]. These stochastic methods have also been extended to the kernel case [9, 5,\n8]. They require much less computation than their batch counterpart, converge in O(1/t) rate, and\nare naturally applicable to streaming data setting. Despite that, they share a severe drawback: all\ndata points used in the updates need to be saved, rendering them impractical for large datasets.\nIn this paper, we propose to use the \u201cdoubly stochastic gradients\u201d for nonlinear component analysis.\nThis technique is a general framework for scaling up kernel methods [6] for convex problems and\nhas been successfully applied to many popular kernel machines such as kernel SVM, kernel ridge\nregressions, and Gaussian process. It uses two types of stochastic approximation simultaneously:\nrandom data points instead of the whole dataset (as in stochastic update rules), and random features\ninstead of the true kernel functions (as in RCA). These two approximations lead to the following\n\n1\n\n\fbene\ufb01ts: (1) Computation ef\ufb01ciency The key computation is the generation of a mini-batch of\nrandom features and the evaluation of them on a mini-batch of data points, which is very ef\ufb01cient.\n(2) Memory ef\ufb01ciency Instead of storing training data points, we just keep a small program for\nregenerating the random features, and sample previously used random features according to pre-\nspeci\ufb01ed random seeds. This leads to huge savings: the memory requirement up to step t is O(t),\nindependent of the dimension of the data. (3) Adaptibility Unlike other approaches that can only\nwork with a \ufb01xed number of random features beforehand, doubly stochastic approach is able to\nincrease the model complexity by using more features when new data points arrive, and thus enjoys\nthe advantage of nonparametric methods.\nAlthough on \ufb01rst look our method appears similar to the approach in [6], the two methods are\nfundamentally different. In [6], they address convex problems, whereas our problem is highly non-\nconvex. The convergence result in [6] crucially relies on the properties of convex functions, which\ndo not translate to our problem. Instead, our analysis centers around the stochastic update of power\niterations, which uses a different set of proof techniques.\nIn this paper, we make the following contributions.\nGeneral framework We show that the general framework of doubly stochastic updates can be\napplied in various kernel component analysis tasks, including KPCA, KSVD, KCCA, etc..\nStrong theoretical guarantee We prove that the \ufb01nite time convergence rate of doubly stochastic\napproach is \u02dcO(1/t). This is a signi\ufb01cant result since 1) the global convergence result is w.r.t. a non-\nconvex problem; 2) the guarantee is for update rules without explicit orthogonalization. Previous\nworks require explicit orthogonalization, which is impractical for kernel methods on large datasets.\nStrong empirical performance Our algorithm can scale to datasets with millions of data points.\nMoreover, the algorithm can often \ufb01nd much better solutions thanks to the ability to use many more\nrandom features. We demonstrate such bene\ufb01ts on both synthetic and real world datasets.\nSince kernel PCA is a typical task, we focus on it in the paper and provide a description of other\ntasks in Section 4.3. Although we only state the guarantee for kernel PCA, the analysis naturally\ncarries over to the other tasks.\n2 Related work\nMany efforts have been devoted to scale up kernel methods. The random feature approach [17, 18]\napproximates the kernel function with explicit random feature mappings and solves the problem\nin primal form, thus circumventing the quadratic computational complexity. It has been applied\nto various kernel methods [11, 6, 12], among which most related to our work is RCA [12]. One\ndrawback of RCA is that their theoretical guarantees are only for kernel matrix approximation: it\ndoes not say anything about how close the solution obtained from RCA is to the true solution. In\ncontrast, we provide a \ufb01nite time convergence rate of how our solution approaches the true solution.\nIn addition, even though a moderate size of random features can work well for tens of thousands of\ndata points, datasets with tens of millions of data points require many more random features. Our\nonline approach allows the number of random features, hence the \ufb02exibility of the function class,\nto grow with the number of data points. This makes our method suitable for data streaming setting,\nwhich is not possible for previous approaches.\nOnline algorithms for PCA have a long history. Oja proposed two stochastic update rules for ap-\nproximating the \ufb01rst eigenvector and provided convergence proof in [15, 16], respectively. These\nrules have been extended to the generalized Hebbian update rules [19, 20, 3] that compute the top\nk eigenvectors (the subspace case). Similar ones have also been derived from the perspective of\noptimization and stochastic gradient descent [20, 2]. They are further generalized to the kernel\ncase [9, 5, 8]. However, online kernel PCA needs to store all the training data, which is impracti-\ncal for large datasets. Our doubly stochastic method avoids this problem by using random features\nand keeping only a small program for regenerating previously used random features according to\npre-speci\ufb01ed seeds. As a result, it can scale up to tens of millions of data points.\nFor \ufb01nite time convergence rate, [3] proved the O(1/t) rate for the top eigenvector in linear PCA\nusing Oja\u2019s rule. For the same task, [21] proposed a noise reduced PCA with linear convergence rate,\nwhere the rate is in terms of epochs, i.e., number of passes over the whole dataset. The noisy power\nmethod presented in [7] provided linear convergence for a subspace, although it only converges\nlinearly to a constant error level. In addition, the updates require explicit orthogonalization, which\n\n2\n\n\f1 = DSGD-KPCA(P(x), k)\n\ni=1)\n\nj=1) \u2208 Rk.\n\nj hihi, for j = 1, . . . , i \u2212 1.\n\nSample \u03c9i \u223c P(\u03c9) with seed i.\nh = h + \u03c6\u03c9i (x)\u03b1i.\n\nSample xi \u223c P(x).\nSample \u03c9i \u223c P(\u03c9) with seed i.\nhi = Evaluate(xi,{\u03b1j}i\u22121\n\u03b1i = \u03b7i\u03c6\u03c9i (xi)hi.\n\u03b1j = \u03b1j \u2212 \u03b7i\u03b1(cid:62)\n\nAlgorithm 2: h = Evaluate(x, {\u03b1i}t\nRequire: P(\u03c9), \u03c6\u03c9(x).\n1: Set h = 0 \u2208 Rk.\n2: for i = 1, . . . , t do\n3:\n4:\n5: end for\n\nAlgorithm 1: {\u03b1i}t\nRequire: P(\u03c9), \u03c6\u03c9(x).\n1: for i = 1, . . . , t do\n2:\n3:\n4:\n5:\n6:\n7: end for\nis impractical for kernel methods. In comparison, our method converges in O(1/t) for a subspace,\nwithout the need for orthogonalization.\n3 Preliminaries\n(cid:80)n\nKernels and Covariance Operators A kernel k(x, y) : X \u00d7 X (cid:55)\u2192 R is a function that is\npositive-de\ufb01nite (PD), i.e., for all n > 1, c1, . . . , cn \u2208 R, and x1, . . . , xn \u2208 X , we have\ni,j=1 cicjk(xi, xj) \u2265 0. A reproducing kernel Hilbert space (RKHS) F on X is a Hilbert\nspace of functions from X to R. F is an RKHS if and only if there exists a k(x, x(cid:48)) such\nthat \u2200x \u2208 X , k(x,\u00b7) \u2208 F, and \u2200f \u2208 F,(cid:104)f (\u00b7), k(x,\u00b7)(cid:105)F = f (x). Given P(x), k(x, x(cid:48)) with\nRKHS F, the covariance operator A : F (cid:55)\u2192 F is a linear self-adjoint operator de\ufb01ned as\nAf (\u00b7) := Ex[f (x) k(x,\u00b7)], \u2200f \u2208 F, and furthermore (cid:104)g, Af(cid:105)F = Ex[f (x) g(x)], \u2200g \u2208 F.\nLet F = (f1(\u00b7), f2(\u00b7), . . . , fk(\u00b7)) be a list of k functions in the RKHS, and we de\ufb01ne matrix-like\nnotation AF (\u00b7) := (Af1(\u00b7), . . . , Afk(\u00b7)), and F (cid:62)AF is a k \u00d7 k matrix, whose (i, j)-th element\nis (cid:104)fi, Afj(cid:105)F . The outer-product of a function v \u2208 F de\ufb01nes a linear operator vv(cid:62) such that\n(vv(cid:62))f (\u00b7) := (cid:104)v, f(cid:105)F v(\u00b7), \u2200f \u2208 F. Let V = (v1(\u00b7), . . . , vk(\u00b7)) be a list of k functions, then the\n(cid:80)k\nweighted sum of a set of linear operators can be denoted using matrix-like notation as V \u03a3kV (cid:62) :=\ni=1 \u03bbiviv(cid:62)\n\ni , where \u03a3k is a diagonal matrix with \u03bbi on the i-th entry of the diagonal.\n\nj=1 \u03b1j\n\ni k(xj,\u00b7), where {\u03b1i}k\n\nvi =(cid:80)n\n(cid:82)\n\nEigenfunctions and Kernel PCA A function v is an eigenfunction of A with the corresponding\neigenvalue \u03bb if Av(\u00b7) = \u03bbv(\u00b7). Given a set of eigenfunctions {vi} and associated eigenvalues {\u03bbi},\nwhere (cid:104)vi, vj(cid:105)F = \u03b4ij, we can write the eigen-decomposion as A = V \u03a3kV (cid:62) + V\u22a5\u03a3\u22a5V (cid:62)\n\u22a5 , where V\nis the list of top k eigenfunctions, \u03a3k is a diagonal matrix with the corresponding eigenvalues, V\u22a5 is\n(cid:80)\nthe list of the rest of the eigenfunctions, and \u03a3\u22a5 is a diagonal matrix with the rest of the eigenvalues.\nKernel PCA aims to identify the the top k subspace V . In the \ufb01nite data case, the empirical co-\ni k(xi,\u00b7) \u2297 k(xi,\u00b7). According to the representer theorem, we have\nvariance operator is A = 1\nn\ni=1 \u2208 Rn are weights for the data points. Using Av(\u00b7) = \u03bbv(\u00b7)\nand the kernel trick, we have K\u03b1i = \u03bbi\u03b1i, where K is the n \u00d7 n Gram matrix.\nRandom feature approximation The random feature approximation for shift-invariant ker-\nnels k(x, y) = k(x \u2212 y), e.g., Gaussian RBF kernel, relies on the identity k(x \u2212 y) =\nRd ei\u03c9(cid:62)(x\u2212y) dP(\u03c9) = E [\u03c6\u03c9(x)\u03c6\u03c9(y)] since the Fourier transform of a PD function is non-\nnegative, thus can be considered as a (scaled) probability measure [17]. We can therefore ap-\nproximate the kernel function as an empirical average of samples from the distribution. In other\nwords, k(x, y) \u2248 1\ni are i.i.d. samples drawn from P(\u03c9). For\nGaussian RBF kernel, k(x \u2212 x(cid:48)) = exp(\u2212(cid:107)x \u2212 x(cid:48)(cid:107)2/2\u03c32), this yields a Gaussian distribution\nP(\u03c9) = exp(\u2212\u03c32(cid:107)\u03c9(cid:107)2/2). See [17] for more details.\n4 Algorithm\nIn this section, we describe an ef\ufb01cient algorithm based on the \u201cdoubly stochastic gradients\u201d to\nscale up kernel PCA. KPCA is essentially an eigenvalue problem in a functional space. Traditional\napproaches convert it to the dual form, leading to another eigenvalue problem whose size equals\nthe number of training points, which is not scalable. Other approaches solve it in the primal form\nwith stochastic functional gradient descent. However, these algorithms need to store all the training\npoints seen so far. They quickly run into memory issues when working with millions of data points.\nWe propose to tackle the problem with \u201cdoubly stochastic gradients\u201d, in which we make two unbi-\nased stochastic approximations. One stochasticity comes from sampling data points as in stochastic\ngradient descent. Another source of stochasticity is from random features to approximate the kernel.\n\n(cid:80)\ni \u03c6\u03c9i(x)\u03c6\u03c9i(y), where {(\u03c9i)}B\n\nB\n\n3\n\n\fOne technical dif\ufb01culty in designing doubly stochastic KPCA is an explicit orthogonalization step\nrequired in the update rules, which ensures the top k eigenfunctions are orthogonal. This is infeasible\nfor kernel methods on a large dataset since it requires solving an increasingly larger KPCA problem\nin every iteration. To solve this problem, we formulate the orthogonality constraints into Lagrange\nmultipliers which leads to an Oja-style update rule. The new update enjoys small per iteration\ncomplexity and converges to the ground-truth subspace.\nWe present the algorithm by \ufb01rst deriving the stochastic functional gradient update without random\nfeature approximations, then introducing the doubly stochastic updates. For simplicity of presenta-\ntion, the following description uses one data point and one random feature at a time, but typically a\nmini-batch of data points and random features are used in each iteration.\n4.1 Stochastic functional gradient update\nKernel PCA can be formulated as the following non-convex optimization problem\n\ntr(cid:0)G(cid:62)AG(cid:1) s.t. G(cid:62)G = I,\n\nmax\n\nwhere G :=(cid:0)g1, . . . , gk(cid:1) and gi is the i-th function.\nwhere gt =(cid:2)g1\n\nt (xt)(cid:3)(cid:62)\n\nt (xt), . . . , gk\n\nG\n\nGradient descent on the Lagrangian leads to Gt+1 = Gt + \u03b7t\napproximation for A: Atf (\u00b7) = f (xt) k(xt,\u00b7), we have AtGt = k(xt,\u00b7)g(cid:62)\n\nt\n\n(cid:0)I \u2212 GtG(cid:62)\n(cid:1) + \u03b7tk(xt,\u00b7)g(cid:62)\n\nt .\n\nGt+1 = Gt\n\n. Therefore, the update rule is\n\n(cid:0)I \u2212 \u03b7tgtg(cid:62)\n\nt\n\n(cid:1) AGt. Using a stochastic\n\nt and G(cid:62)\n\nt AtGt = gtg(cid:62)\nt ,\n\n(1)\n\n(2)\n\n(3)\n\n(cid:62)\n\n,\n\n(cid:16)\n\n(cid:62)(cid:17)\n\nThis rule can also be derived using stochastic gradient and Oja\u2019s rule [15, 16].\n4.2 Doubly stochastic update\nThe update rule (2) must store all the data points it has seen so far, which is impractical for large scale\ndatasets. To address this issue, we use the random feature approximation k(x,\u00b7) \u2248 \u03c6\u03c9i(x)\u03c6\u03c9i(\u00b7).\nDenote Ht the function we get at iteration t, the update rule becomes\n\nt (xt)(cid:3)(cid:62)\n\nwhere ht is the evaluation of Ht at the current data point: ht =(cid:2)h1\n\nI \u2212 \u03b7ththt\n\nHt+1 = Ht\n\n+ \u03b7t\u03c6\u03c9t(xt)\u03c6\u03c9t(\u00b7)ht\n\nt AtGt\n\n. The speci\ufb01c\nupdates in terms of the coef\ufb01cients are summarized in Algorithms 1 and 2. Note that in theory new\nrandom features are drawn in each iteration, but in practice one can revisit these random features.\n4.3 Extensions\nLocating individual eigenfunctions The algorithm only \ufb01nds the eigen subspace, but not necessar-\nily individual eigenfunctions. A modi\ufb01ed version, called Generalized Hebbian Algorithm (GHA)\n\n[19] can be used for this purpose: Gt+1 = Gt + \u03b7tAtGt \u2212 \u03b7tGt UT(cid:2)G(cid:62)\n\n(cid:3), where UT [\u00b7] is an\n\nt (xt), . . . , hk\n\noperator that sets the lower triangular parts to zero.\nLatent variable models and kernel SVD Recently, spectral methods have been proposed to learn\nlatent variable models with provable guarantees [1, 22], in which the key computation is SVD. Our\nalgorithm can be straightforwardly extended to solve kernel SVD, with two simultaneous update\nrules. The algorithm is summarized in Algorithm 3. See the supplementary for derivation details.\nKernel CCA and generalized eigenvalue problem Given two variables X and Y , CCA \ufb01nds two\nprojections such that the correlations between the two projected variables are maximized.\nIt is\nequivalent to a generalized eigenvalue problem, which can also be solved in our framework. We\npresent the updates for coef\ufb01cients in Algorithm 4, and derivation details in the supplementary.\nKernel sliced inverse regression Kernel sliced inverse regression [10] aims to do suf\ufb01cient dimen-\nsion reduction in which the found low dimension representation preserves the statistical correlation\nwith the targets. It also reduces to a generalized eigenvalue problem, and has been shown to \ufb01nd the\nsame subspace as KCCA [10].\n5 Analysis\nIn this section, we provide \ufb01nite time convergence guarantees for our algorithm. As discussed\nin the previous section, explicit orthogonalization is not scalable for the kernel case, therefore we\n\n4\n\n\f1\n\nAlgorithm 3: DSGD-KSVD\nRequire: P(\u03c9), \u03c6\u03c9(x), k.\nOutput: {\u03b1i, \u03b2i}t\n1: for i = 1, . . . , t do\n2:\n3:\n4:\n5:\n6: W = uiv(cid:62)\n7:\n8:\n9:\n10:\n11: end for\n\nSample xi \u223c P(x). Sample yi \u223c P(y).\nSample \u03c9i \u223c P(\u03c9) with seed i.\nui = Evaluate(xi,{\u03b1j}i\u22121\nvi = Evaluate(yi,{\u03b2j}i\u22121\n\nj=1) \u2208 Rk.\nj=1) \u2208 Rk.\n\ni + viu(cid:62)\n\u03b1i = \u03b7i\u03c6\u03c9i (xi)vi.\n\u03b2i = \u03b7i\u03c6\u03c9i (yi)ui.\n\u03b1j = \u03b1j \u2212 \u03b7iW \u03b1j, for j = 1, . . . , i \u2212 1.\n\u03b2j = \u03b2j \u2212 \u03b7iW \u03b2j, for j = 1, . . . , i \u2212 1.\n\ni\n\n1\n\nAlgorithm 4: DSGD-KCCA\nRequire: P(\u03c9), \u03c6\u03c9(x), k.\nOutput: {\u03b1i, \u03b2i}t\n1: for i = 1, . . . , t do\n2:\n3:\n4:\n5:\n6: W = uiv(cid:62)\n7:\n8:\n9: end for\n\nSample xi \u223c P(x). Sample yi \u223c P(y).\nSample \u03c9i \u223c P(\u03c9) with seed i.\nui = Evaluate(xi,{\u03b1j}i\u22121\nvi = Evaluate(yi,{\u03b2j}i\u22121\n\u03b1i = \u03b7i\u03c6\u03c9i (xi) [vi \u2212 W ui].\n\u03b2i = \u03b7i\u03c6\u03c9i (yi) [ui \u2212 W vi].\n\nj=1) \u2208 Rk.\nj=1) \u2208 Rk.\n\ni + viu(cid:62)\n\ni\n\nneed to provide guarantees for the updates without orthogonalization. This challenge is even more\nprominent when using random features, since it introduces additional variance.\nthe top k-dimension subspace. Although the convergence\nFurthermore, our guarantees are w.r.t.\nwithout normalization for a top eigenvector has been established before [15, 16], the subspace case\nis complicated by the fact that there are k angles between k-dimension subspaces, and we need to\nbound the largest angle. To the best of our knowledge, our result is the \ufb01rst \ufb01nite time convergence\nresult for a subspace without explicit orthogonalization.\nNote that even though it appears our algorithm is similar to [6] on the surface, the underlying analysis\nIn [6], the result only applies to convex problems where every local\nis fundamentally different.\noptimum is a global optimum while the problems we consider are highly non-convex. As a result,\nmany techniques that [6] builds upon are not applicable.\nConditions and Assumptions We will focus on the case when a good initialization V0 is given:\n\n0 V0 = I, cos2 \u03b8(V, V0) \u2265 1/2.\nV (cid:62)\n\n(4)\nIn other words, we analyze the later stage of the convergence, which is typical in the literature (e.g.,\n[21]). The early stage can be analyzed using established techniques (e.g., [3]). In practice, one can\nachieve a good initialization by solving a small RCA problem [12] with, e.g.\nthousands, of data\npoints and random features.\nThroughout the paper we suppose |k(x, x(cid:48))| \u2264 \u03ba,|\u03c6\u03c9(x)| \u2264 \u03c6 and regard \u03ba and \u03c6 as constants.\nNote that this is true for all the kernels and corresponding random features considered. We further\nregard the eigengap \u03bbk \u2212 \u03bbk+1 as a constant, which is also true for typical applications and datasets.\n5.1 Update without random features\nOur guarantee is on the cosine of the principal angle between the computed subspace and the ground\ntruth eigen subspace (also called potential function): cos2 \u03b8(V, Gt) = minw\n\n.\n\n(cid:107)V (cid:62)Gtw(cid:107)2\n(cid:107)Gtw(cid:107)2\n\nConsider the two different update rules, one with explicit orthogonalization and another without\n\nFt+1 \u2190 orth(Ft + \u03b7tAtFt)\nGt+1 \u2190 Gt + \u03b7t\n\n(cid:0)I \u2212 GtG(cid:62)\n\n(cid:1) AtGt\n\nt\n\nwhere At is the empirical covariance of a mini-batch. Our \ufb01nal guarantee for Gt is the following.\nTheorem 1 Assume (4) and suppose the mini-batch sizes satisfy that for any 1 \u2264 i \u2264 t, (cid:107)A \u2212 Ai(cid:107) <\n(\u03bbk \u2212 \u03bbk+1)/8. There exist step sizes \u03b7i = O(1/i) such that\n\n1 \u2212 cos2 \u03b8(V, Gt) = O(1/t).\n\nThe convergence rate O(1/t) is in the same order as that of computing only the top eigenvector in\nlinear PCA [3]. The bound requires the mini-batch size is large enough so that the spectral norm of\nA is approximated up to the order of the eigengap. This is because the increase of the potential is in\nthe order of the eigengap. Similar terms appear in the analysis of the noisy power method [7] which,\nhowever, requires orthogonalization and is not suitable for the kernel case. We do not specify the\n\n5\n\n\fmini-batch size, but by assuming suitable data distributions, it is possible to obtain explicit bounds;\nsee for example [23, 4].\n\nt+1 \u2265 c2\nc2\n\nt )) \u2212 O(\u03b72\nt ).\n\nProof sketch We \ufb01rst prove the guarantee for the orthogonalized subspace Ft which is more conve-\nnient to analyze, and then show that the updates for Ft and Gt are \ufb01rst order equivalent so Gt enjoys\nthe same guarantee. To do so, we will require lemma 2 and 3 below\nLemma 2 1 \u2212 cos2 \u03b8(V, Ft) = O(1/t).\nLet c2\n\nt denote cos2 \u03b8(V, Ft), then a key step in proving the lemma is to show the following recurrence\n(5)\n\nt (1 + 2\u03b7t(\u03bbk \u2212 \u03bbk+1 \u2212 2(cid:107)A \u2212 At(cid:107))(1 \u2212 c2\n\nWe will need the mini-batch size large enough so that 2(cid:107)A \u2212 At(cid:107) is smaller than the eigen-gap.\nAnother key element in the proof of the theorem is the \ufb01rst order equivalence of the two update\nrules. To show this, we introduce F (Gt) \u2190 orth(Gt + \u03b7tAtGt) to denote the subspace by applying\nthe update rule of Ft on Gt. We show that the potentials of Gt+1 and F (Gt) are close:\nLemma 3 cos2 \u03b8(V, Gt+1) = cos2 \u03b8(V, F (Gt)) \u00b1 O(\u03b72\nt ).\nThe lemma means that applying the two update rules to the same input will result in two sub-\nspaces with similar potentials. Then by (5), we have 1 \u2212 cos2 \u03b8(V, Gt) = O(1/t) which\nleads to our theorem. The proof of Lemma 3 is based on the observation that cos2 \u03b8(V, X) =\n\u03bbmin(V (cid:62)X(X(cid:62)X)\u22121X(cid:62)V ). Comparing the Taylor expansions w.r.t. \u03b7t for X = Gt+1 and\nX = F (Gt) leads to the lemma.\n5.2 Doubly stochastic update\nThe Ht computed in the doubly stochastic update is no longer in the RKHS so the principal angle is\nnot well de\ufb01ned. Instead, we will compare the evaluation of functions from Ht and the true principal\nsubspace V respectively on a point x. Formally, we show that for any function v \u2208 V with unit norm\n(cid:107)v(cid:107)F = 1, there exists a function h in Ht such that for any x, err := |v(x) \u2212 h(x)|2 is small with\nhigh probability.\nTo do so, we need to introduce a companion update rule: \u02dcGt+1 \u2190 \u02dcGt + \u03b7tk(xt,\u00b7)h(cid:62)\nt \u2212 \u03b7t \u02dcGthth(cid:62)\nresulting in function in the RKHS, but the update makes use of function values from ht \u2208 Ht which\nis outside the RKHS. Let w = \u02dcG(cid:62)v be the coef\ufb01cients of v projected onto \u02dcG, h = Htw, and\nz = \u02dcGtw. Then the error can be decomposed as\n\nt\n\n|v(x) \u2212 h(x)|2 = |v(x) \u2212 z(x) + z(x) \u2212 h(x)|2 \u2264 2|v(x) \u2212 z(x)|2 + 2|z(x) \u2212 h(x)|2\n\nBy de\ufb01nition, (cid:107)v \u2212 z(cid:107)2F = (cid:107)v(cid:107)2F \u2212(cid:107)z(cid:107)2F \u2264 1\u2212cos2 \u03b8(V, \u02dcGt), so the \ufb01rst error term can be bounded\nby the guarantee on \u02dcGt, which can be obtained by similar arguments in Theorem 1. For the second\nterm, note that \u02dcGt is de\ufb01ned in such a way that the difference between z(x) and h(x) is a martingale,\nwhich can be bounded by careful analysis.\nTheorem 4 Assume (4) and suppose the mini-batch sizes satisfy that for any 1 \u2264 i \u2264 t, (cid:107)A \u2212 Ai(cid:107) <\n(\u03bbk \u2212 \u03bbk+1)/8 and are of order \u2126(ln t\n\u03b4 ). There exist step sizes \u03b7i = O(1/i), such that the following\n\u02dcGi) = O(1) for all 1 \u2264 i \u2264 t, then for any x and any\nholds. If \u2126(1) = \u03bbk( \u02dcG(cid:62)\nfunction v in the span of V with unit norm (cid:107)v(cid:107)F = 1, we have that with probability at least 1 \u2212 \u03b4,\n\nthere exists h in the span of Ht satisfying |v(x) \u2212 h(x)|2 = O(cid:0) 1\n\n\u02dcGi) \u2264 \u03bb1( \u02dcG(cid:62)\n\n(cid:1) .\n\ni\n\ni\n\nt ln t\n\n\u03b4\n\nThe point-wise error scales as \u02dcO(1/t) with the step t. Besides the condition that (cid:107)A \u2212 Ai(cid:107) is up\nto the order of the eigengap, we additionally need that the random features approximate the kernel\nfunction up to constant accuracy on all the data points up to time t, which eventually leads to \u2126(ln t\n\u03b4 )\nmini-batch sizes. Finally, we need \u02dcG(cid:62)\n\u02dcGi to be roughly isotropic, i.e., \u02dcGi is roughly orthonormal.\nIntuitively, this should be true for the following reasons: \u02dcG0 is orthonormal; the update for \u02dcGt is\nclose to that for Gt, which in turn is close to Ft that are orthonormal.\n\ni\n\n6\n\n(cid:124)\n(cid:125)\n\u2264 2\u03ba2 (cid:107)v \u2212 z(cid:107)2F\n\n(cid:123)(cid:122)\n\n(I: Lemma 5)\n\n(cid:124)\n\n+ 2|z(x) \u2212 h(x)|2\n\n(II: Lemma 6)\n\n(cid:123)(cid:122)\n\n(cid:125)\n\n.\n\n(6)\n\n\f(a)\n\n(b)\n\n(a)\n\n(b)\n\nFigure 1: (a) Convergence of our algorithm on\nthe synthetic dataset. It is on par with the \u02dcO(1/t)\nrate denoted by the dashed red line. (b) Recovery\nof the top three eigenfunctions. Our algorithm (in\nred) matches the ground-truth (dashed blue).\n\nFigure 2: Visualization of the molecular space\ndataset by the \ufb01rst two principal components.\nBluer dots represent lower PCE values while red-\nder dots are for higher PCE values. (a) Kernel\nPCA; (b) linear PCA. (Best viewed in color)\n\nProof sketch In order to bound term I in (6), we show that\n\nLemma 5 1 \u2212 cos2 \u03b8(V, \u02dcGt) = O(cid:0) 1\n\n(cid:1).\n\nt ln t\n\n\u03b4\n\nThis is proved by following similar arguments to get the recurrence (5), except with an additional\nerror term, which is caused by the fact that the update rule for \u02dcGt+1 is using the evaluation ht(xt)\nrather than \u02dcgt(xt). Bounding this additional term thus relies on bounding the difference between\nht(x) \u2212 \u02dcgt(x), which is also what we need for bounding term II in (6). For this, we show:\nLemma 6 For any x and unit vector w, with probability \u2265 1 \u2212 \u03b4 over (Dt, \u03c9t), |\u02dcgt(x)w \u2212\n\nht(x)w|2 = O(cid:0) 1\nt ln(cid:0) t\n\n\u03b4\n\n(cid:1)(cid:1) .\n\nThe key to prove this lemma is that our construction of \u02dcGt makes sure that the difference between\n\u02dcgt(x)w and ht(x)w consists of their difference in each time step. Furthermore, the difference forms\na martingale and thus can be bounded by Azuma\u2019s inequality. See the supplementary for the details.\n6 Experiments\nSynthetic dataset with analytical solution We \ufb01rst verify the convergence rate of DSGD-KPCA\non a synthetic dataset with analytical solution of eigenfunctions [24]. If the data follow a Gaussian\ndistribution, and we use a Gaussian kernel, then the eigenfunctions are given by the Hermite poly-\nnomials. We generated 1 million such 1-D data points, and ran DSGD-KPCA with a total of 262144\nrandom features. In each iteration, we use a data mini-batch of size 512, and a random feature mini-\nbatch of size 128. After all random features are generated, we revisit and adjust the coef\ufb01cients of\nexisting random features. The kernel bandwidth is set as the true bandwidth. The step size is sched-\nuled as \u03b7t = \u03b80/(1 + \u03b81t), where \u03b80 and \u03b81 are two parameters. We use a small \u03b81 \u2248 0.01 such\nthat in early stages the step size is large enough to arrive at a good initial solution. Figure 1a shows\nthe convergence rate of the proposed algorithm seeking top k = 3 subspace. The potential function\nis the squared sine of the principle angle. We can see the algorithm indeed converges at the rate\nO(1/t). Figure 1b show the recovered top k = 3 eigenfunctions compared with the ground-truth.\nThe solution coincides with one eigenfunction, and deviates only slightly from two others.\nKernel PCA visualization on molecular space dataset MolecularSpace dataset contains 2.3 mil-\nlion molecular motifs [6]. We are interested in visualizing the dataset with KPCA. The data are\nrepresented by sorted Coulomb matrices of size 75 \u00d7 75 [14]. Each molecule also has an attribute\ncalled power conversion ef\ufb01ciency (PCE). We use a Gaussian kernel with bandwidth chosen by\nthe \u201cmedian trick\u201d. We ran kernel PCA with a total of 16384 random features, with a feature\nmini-batch size of 512, and data mini-batch size of 1024. We ran 4000 iterations with step size\n\u03b7t = 1/(1 + 0.001 \u2217 t). Figure 2 presents visualization by projecting the data onto the top two prin-\nciple components. Compared with linear PCA, KPCA shrinks the distances between the clusters\nand brings out the important structures in the dataset. We can also see higher PCE values tend to lie\ntowards the center of the ring structure.\nNonparametric Latent Variable Model We learn latent variable models with DSGD-KSVD using\none million data points [22], achieving higher quality solutions compared with two other approaches.\nThe dataset consists of two latent components, one is a Gaussian distribution and the other a Gamma\ndistribution with shape parameter \u03b1 = 1.2. DSGD-KSVD uses a total of 8192 random features, and\nuses a feature mini-batch of size 256 and a data mini-batch of size 512. We compare with 1) random\n\n7\n\n10510610\u2212410\u2212310\u2212210\u22121100Number of data pointsPotential function ConvergenceO(1/t)\u22122\u22121012\u22120.1\u22120.0500.050.1 estimatedgroundtruth\f(a)\n\n(b)\n\n(c)\n\nFigure 4: Comparison on\nKUKA dataset.\n\nFigure 3: Recovered latent components: (a) DSGD-KSVD, (b) 2048\nrandom features, (c) 2048 Nystrom features.\nFourier features, and 2) random Nystrom features, both of \ufb01xed 2048 functions [12]. Figures 3\nshows the learned conditional distributions for each component. We can see DSGD-KSVD achieves\nalmost perfect recovery, while Fourier and Nystrom random feature methods either confuse high\ndensity areas or incorrectly estimate the spread of conditional distributions.\nKCCA MNIST8M We compare our algorithm on MNIST8M in the KCCA task, which consists of\n8.1 million digits and their transformations. We divide each image into the left and right halves, and\nlearn their correlations. The evaluation criteria is the total correlations on the top k = 50 canonical\ncorrelation directions measured on a separate test set of size 10000. We compare with 1) random\nFourier and 2) random Nystrom features on both total correlation and running time. Our algorithm\nuses a total of 20480 features, with feature mini-batches of size 2048 and data mini-batches of size\n1024, and with 3000 iterations. The kernel bandwidth is set using the \u201cmedian trick\u201d and is the same\nfor all methods. All algorithms are run 5 times, and the mean is reported. The results are presented\nin Table 1. Our algorithm achieves the best test-set correlations in comparable run time with random\nFourier features. This is especially signi\ufb01cant for random Fourier features, since the run time would\nincrease by almost four times if double the number of features were used. In addition, Nystrom\nfeatures generally achieve better results than Fourier features since they are data dependent. We can\nalso see that for large datasets, it is important to use more random features for better performance.\n\nTable 1: KCCA results on MNIST 8M (top 50 largest correlations)\n\n# of feat\n\n256\n512\n1024\n2048\n4096\n\nRandom features\ncorrs.\nminutes\n25.2\n30.7\n35.3\n38.8\n41.5\n\n3.2\n7.0\n13.9\n54.3\n186.7\n\nNystrom features\ncorrs.\nminutes\n30.4\n35.3\n38.0\n41.1\n42.7\n\n3.0\n5.1\n10.1\n27.0\n71.0\n\nDSGD-KCCA (20480)\ncorrs.\n43.5\n\nminutes\n183.2\n\nlinear CCA\n\nminutes\n\n1.1\n\ncorrs.\n27.4\n\nKernel sliced inverse regression on KUKA dataset We evaluate our algorithm under the setting\nof kernel sliced inverse regression [10], a way to perform suf\ufb01cient dimension reduction (SDR)\nfor high dimension regression. After performing SDR, we \ufb01t a linear regression model using the\nprojected input data, and evaluate mean squared error (MSE). The dataset records rhythmic motions\nof a KUKA arm at various speeds, representing realistic settings for robots [13]. We use a variant\nthat contains 2 million data points generated by the SL simulator. The KUKA robot has 7 joints,\nand the high dimension regression problem is to predict the torques from positions, velocities and\naccelerations of the joints. The input has 21 dimensions while the output is 7 dimensions. Since there\nare seven independent joints, we set the reduced dimension to be seven. We randomly select 20% as\ntest set and out of the remaining training set, we randomly choose 5000 as validation set to select step\nsizes. The total number of random features is 10240, with mini-feature batch and mini-data batch\nboth equal to 1024. We run a total of 2000 iterations using step size \u03b7t = 15/(1+0.001\u2217t). Figure 4\nshows the regression errors for different methods. The error decreases with more random features,\nand our algorithm achieves lowest MSE by using 10240 random features. Nystrom features do not\nperform as well in this setting probably because the spectrum decreases slowly (there are seven\nindependent joints) as Nystrom features are known to work well for fast decreasing spectrum.\nAcknowledge\nThe research was supported in part by NSF/NIH BIGDATA 1R01GM108341, ONR N00014-15-\n1-2340, NSF IIS-1218749, NSF CAREER IIS-1350983, NSF CCF-0832797, CCF-1117309, CCF-\n1302518, DMS-1317308, Simons Investigator Award, and Simons Collaboration Grant.\n\n8\n\n\u221210\u2212505101500.050.10.150.20.25 Estimated Component2Estimated Component1True Component2True Component1\u221210\u2212505101500.050.10.150.20.25 Estimated Component2Estimated Component1True Component2True Component1\u221210\u2212505101500.050.10.150.20.25 Estimated Component2Estimated Component1True Component2True Component15121024204840961024000.020.040.060.08Random feature dimensionMean squared error Random Fourier featuresNystrom featuresDSGD\fReferences\n[1] A. Anandkumar, D. P. Foster, D. Hsu, S. M. Kakade, and Y.-K. Liu. Two svds suf\ufb01ce: Spectral decom-\n\npositions for probabilistic topic modeling and latent dirichlet allocation. CoRR, abs/1204.6703, 2012.\n\n[2] R. Arora, A. Cotter, and N. Srebro. Stochastic optimization of pca with capped msg. In Advances in\n\nNeural Information Processing Systems, pages 1815\u20131823, 2013.\n\n[3] A. Balsubramani, S. Dasgupta, and Y. Freund. The fast convergence of incremental pca. In Advances in\n\nNeural Information Processing Systems, pages 3174\u20133182, 2013.\n\n[4] T. T. Cai and H. H. Zhou. Optimal rates of convergence for sparse covariance matrix estimation. The\n\nAnnals of Statistics, 40(5):2389\u20132420, 2012.\n\n[5] T.-J. Chin and D. Suter. Incremental kernel principal component analysis. IEEE Transactions on Image\n\nProcessing, 16(6):1662\u20131674, 2007.\n\n[6] B. Dai, B. Xie, N. He, Y. Liang, A. Raj, M.-F. F. Balcan, and L. Song. Scalable kernel methods via doubly\n\nstochastic gradients. In Advances in Neural Information Processing Systems, pages 3041\u20133049, 2014.\n\n[7] M. Hardt and E. Price. The noisy power method: A meta algorithm with applications. In Advances in\n\nNeural Information Processing Systems, pages 2861\u20132869, 2014.\n\n[8] P. Honeine. Online kernel principal component analysis: A reduced-order model. IEEE Trans. Pattern\n\nAnal. Mach. Intell., 34(9):1814\u20131826, 2012.\n\n[9] K. Kim, M. O. Franz, and B. Sch\u00a8olkopf. Iterative kernel principal component analysis for image modeling.\n\nIEEE Transactions on Pattern Analysis and Machine Intelligence, 27(9):1351\u20131366, 2005.\n\n[10] M. Kim and V. Pavlovic. Covariance operator based dimensionality reduction with extension to semi-\nsupervised settings. In International Conference on Arti\ufb01cial Intelligence and Statistics, pages 280\u2013287,\n2009.\n\n[11] Q. Le, T. Sarlos, and A. J. Smola. Fastfood \u2014 computing hilbert space expansions in loglinear time. In\n\nInternational Conference on Machine Learning, 2013.\n\n[12] D. Lopez-Paz, S. Sra, A. Smola, Z. Ghahramani, and B. Sch\u00a8olkopf. Randomized nonlinear component\n\nanalysis. In International Conference on Machine Learning (ICML), 2014.\n\n[13] F. Meier, P. Hennig, and S. Schaal. Incremental local gaussian regression. In Z. Ghahramani, M. Welling,\nC. Cortes, N. Lawrence, and K. Weinberger, editors, Advances in Neural Information Processing Systems\n27, pages 972\u2013980. Curran Associates, Inc., 2014.\n\n[14] G. Montavon, K. Hansen, S. Fazli, M. Rupp, F. Biegler, A. Ziehe, A. Tkatchenko, A. von Lilienfeld,\nand K.-R. M\u00a8uller. Learning invariant representations of molecules for atomization energy prediction. In\nNeural Information Processing Systems, pages 449\u2013457, 2012.\n\n[15] E. Oja. A simpli\ufb01ed neuron model as a principal component analyzer. J. Math. Biology, 15:267\u2013273,\n\n1982.\n\n[16] E. Oja. Subspace methods of pattern recognition. John Wiley and Sons, New York, 1983.\n[17] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In J. Platt, D. Koller, Y. Singer,\nand S. Roweis, editors, Advances in Neural Information Processing Systems 20. MIT Press, Cambridge,\nMA, 2008.\n\n[18] A. Rahimi and B. Recht. Weighted sums of random kitchen sinks: Replacing minimization with random-\n\nization in learning. In Neural Information Processing Systems, 2009.\n\n[19] T. D. Sanger. Optimal unsupervised learning in a single-layer linear feedforward network. Neural Net-\n\nworks, 2:459\u2013473, 1989.\n\n[20] N. N. Schraudolph, S. G\u00a8unter, and S. V. N. Vishwanathan. Fast iterative kernel PCA. In B. Sch\u00a8olkopf,\nJ. Platt, and T. Hofmann, editors, Advances in Neural Information Processing Systems 19, Cambridge\nMA, June 2007. MIT Press.\n\n[21] O. Shamir. A stochastic pca algorithm with an exponential convergence rate.\n\narXiv:1409.2848, 2014.\n\narXiv preprint\n\n[22] L. Song, A. Anamdakumar, B. Dai, and B. Xie. Nonparametric estimation of multi-view latent variable\n\nmodels. In International Conference on Machine Learning (ICML), 2014.\n\n[23] R. Vershynin. How close is the sample covariance matrix to the actual covariance matrix? Journal of\n\nTheoretical Probability, 25(3):655\u2013686, 2012.\n\n[24] C. K. I. Williams and M. Seeger. The effect of the input density distribution on kernel-based classi\ufb01ers.\nIn P. Langley, editor, Proc. Intl. Conf. Machine Learning, pages 1159\u20131166, San Francisco, California,\n2000. Morgan Kaufmann Publishers.\n\n9\n\n\f", "award": [], "sourceid": 1376, "authors": [{"given_name": "Bo", "family_name": "Xie", "institution": "Georgia Tech"}, {"given_name": "Yingyu", "family_name": "Liang", "institution": "Princeton University"}, {"given_name": "Le", "family_name": "Song", "institution": "Georgia Institute of Technology"}]}