{"title": "Stochastic Optimization of PCA with Capped MSG", "book": "Advances in Neural Information Processing Systems", "page_first": 1815, "page_last": 1823, "abstract": "We study PCA as a stochastic optimization problem and propose a novel stochastic approximation algorithm which we refer to as Matrix Stochastic Gradient'' (MSG), as well as a practical variant, Capped MSG. We study the method both theoretically and empirically. \"", "full_text": "Stochastic Optimization of PCA with Capped MSG\n\nRaman Arora\nTTI-Chicago\n\nChicago, IL USA\n\narora@ttic.edu\n\nAndrew Cotter\nTTI-Chicago\n\nChicago, IL USA\n\ncotter@ttic.edu\n\nNathan Srebro\n\nTechnion, Haifa, Israel\n\nand TTI-Chicago\nnati@ttic.edu\n\nAbstract\n\nWe study PCA as a stochastic optimization problem and propose a novel stochas-\ntic approximation algorithm which we refer to as \u201cMatrix Stochastic Gradient\u201d\n(MSG), as well as a practical variant, Capped MSG. We study the method both\ntheoretically and empirically.\n\nIntroduction\n\n1\nPrincipal Component Analysis (PCA) is a ubiquitous tool used in many data analysis, machine learn-\ning and information retrieval applications. It is used to obtain a lower dimensional representation of\na high dimensional signal that still captures as much of the original signal as possible. Such a low di-\nmensional representation can be useful for reducing storage and computational costs, as complexity\ncontrol in learning systems, or to aid in visualization.\nPCA is typically phrased as a question about a \ufb01xed data set: given n vectors in Rd, what is the\nk-dimensional subspace that captures most of the variance in the data (or equivalently, that is best in\nreconstructing the vectors, minimizing the sum squared distances, or residuals, from the subspace)?\nIt is well known that this subspace is the span of the leading k components of the singular value\ndecomposition of the data matrix (or equivalently of the empirical second moment matrix). Hence,\nthe study of computational approaches for PCA has mostly focused on methods for \ufb01nding the SVD\n(or leading components of the SVD) for a given n\u00d7d matrix (Oja & Karhunen, 1985; Sanger, 1989).\nIn this paper we approach PCA as a stochastic optimization problem, where the goal is to optimize\na \u201cpopulation objective\u201d based on i.i.d. draws from the population. In this setting, we have some\nunknown source (\u201cpopulation\u201d) distribution D over Rd, and the goal is to \ufb01nd the k-dimensional\nsubspace maximizing the (uncentered) variance of D inside the subspace (or equivalently, minimiz-\ning the average squared residual in the population), based on i.i.d. samples from D. The main point\nhere is that the true objective is not how well the subspace captures the sample (i.e. the \u201ctraining\nerror\u201d), but rather how well the subspace captures the underlying source distribution (i.e. the \u201cgen-\neralization error\u201d). Furthermore, we are not concerned with capturing some \u201ctrue\u201d subspace, and so\ndo not, for example, try to minimize the angle to such a subspace, but rather attampt to \ufb01nd a \u201cgood\u201d\nsubspace, i.e. one that is almost as good as the optimal one in terms of reconstruction error.\nOf course, \ufb01nding the subspace that best captures the sample is a very reasonable approach to PCA\non the population. This is essentially an Empirical Risk Minimization (ERM) approach. However,\nwhen comparing it to alternative, perhaps computationally cheaper, approaches, we argue that one\nshould not compare the error on the sample, but rather the population objective. Such a view can jus-\ntify and favor computational approaches that are far from optimal on the sample, but are essentially\nas good as ERM on the population.\nSuch a population-based view of optimization has recently been advocated in machine learning,\nand has been used to argue for crude stochastic approximation approaches (online-type methods)\nover sophisticated deterministic optimization of the empirical (training) objective (i.e. \u201cbatch\u201d meth-\nods)\n(Bottou & Bousquet, 2007; Shalev-Shwartz & Srebro, 2008). A similar argument was also\n\n1\n\n\fmade in the context of stochastic optimization, where Nemirovski et al. (2009) argues for stochastic\napproximation (SA) approaches over ERM. approaches (a.k.a. ERM). Accordingly, SA approaches,\nmostly variants of Stochastic Gradient Descent, are often the methods of choice for many learning\nproblems, especially when very large data sets are available (Shalev-Shwartz et al., 2007; Collins\net al., 2008; Shalev-Shwartz & Tewari, 2009). We take the same view in order to advocate for, study,\nand develop stochastic approximation approaches for PCA.\nIn an empirical study of stochastic approximation methods for PCA, a heuristic \u201cincremental\u201d\nmethod showed very good empirical performance (Arora et al., 2012). However, no theoretical\nguarantees or justi\ufb01cation were given for incremental PCA. In fact, it was shown that for some dis-\ntributions it can converge to a suboptimal solution with high probability (see Section 5.2 for more\nabout this \u201cincremental\u201d algorithm). Also relevant is careful theoretical work on online PCA by\nWarmuth & Kuzmin (2008), in which an online regret guarantee was established. Using an online-\nto-batch conversion, this online algorithm can be converted to a stochastic approximation algorithm\nwith good iteration complexity, however the runtime for each iteration is essentially the same as that\nof ERM (i.e. of PCA on the sample), and thus senseless as a stochastic approximation method (see\nSection 3.3 for more on this algorithm).\nIn this paper we borrow from these two approaches and present a novel algorithm for stochastic\nPCA\u2014the Matrix Stochastic Gradient (MSG) algorithm. MSG enjoys similar iteration complex-\nity to Warmuth\u2019s and Kuzmin\u2019s algorithm, and in fact we present a uni\ufb01ed view of both algo-\nrithms as different instantiations of Mirror Descent for the same convex relaxation of PCA. We\nthen present the capped MSG algorithm, which is a more practical variant of MSG, has very similar\nupdates to those of the \u201cincremental\u201d method, works well in practice, and does not get stuck like\nthe \u201cincremental\u201d method. The Capped MSG algorithm is thus a clean, theoretically well founded\nmethod, with interesting connections to other stochastic/online PCA methods, and excellent practi-\ncal performance\u2014a \u201cbest of both worlds\u201d algorithm.\n2 Problem Setup\nWe consider PCA as the problem of \ufb01nding the maximal (uncentered) variance k-dimensional sub-\nspace with respect to an (unknown) distribution D over x \u2208 Rd. We assume without loss of gener-\nality that the data are scaled in such a way that Ex\u223cD[(cid:107)x(cid:107)2] \u2264 1. For our analysis, we also require\nthat the fourth moment be bounded: Ex\u223cD[(cid:107)x(cid:107)4] \u2264 1. We represent a k-dimensional subspace by\nan orthonormal basis, collected in the columns of a matrix U. With this parametrization, PCA is\nde\ufb01ned as the following stochastic optimization problem:\n\nmaximize : Ex\u223cD[xT U U T x]\nsubject to : U \u2208 Rd\u00d7k, U T U = I.\n\n(2.1)\n\nIn a stochastic optimization setting we do not have direct knowledge of the distribution D, and\ninstead may access it only through i.i.d. samples\u2014these can be thought of as \u201ctraining examples\u201d.\nAs in other studies of stochastic approximation methods, we are less concerned with the number\nof required samples, and instead care mostly about the overall runtime required to obtain an \u0001-\nsuboptimal solution.\nThe standard approach to Problem 2.1 is empirical risk minimization (ERM): given samples {xt}T\nt=1\ndrawn from D, we compute the empirical covariance matrix \u02c6C = 1\nt , and take the\ncolumns of U to be the eigenvectors of \u02c6C corresponding to the top-k eigenvalues. This approach\nrequires O(d2) memory and O(d2) operations just in order to compute the covariance matrix, plus\nsome additional time for the SVD. We are interested in methods with much lower sample time and\nspace complexity, preferably linear rather than quadratic in d.\n3 MSG and MEG\nA natural stochastic approximation (SA) approach to PCA is projected stochastic gradient descent\n(SGD) on Problem 2.1, with respect to U. This leads to the stochastic power method, for which, at\neach iteration, the following update is performed:\nU (t+1) = Porth\n\n(cid:80)T\n\nU (t) + \u03b7xtxT\nt\n\nt=1 xtxT\n\nT\n\n(3.1)\n\n(cid:17)\n\n(cid:16)\n\n2\n\n\fis the gradient of the PCA objective w.r.t. U, \u03b7 is a step size, and Porth (\u00b7) projects its\nHere, xtxT\nt\nargument onto the set of matrices with orthonormal columns. Unfortunately, although SGD is well\nunderstood for convex problems, Problem 2.1 is non-convex. Consequently, obtaining a theoretical\nunderstanding of the stochastic power method, or of how the step size should be set, has proved\nelusive. Under some conditions, convergence to the optimal solution can be ensured, but no rate is\nknown (Oja & Karhunen, 1985; Sanger, 1989; Arora et al., 2012).\nInstead, we consider a re-parameterization of the PCA problem where the objective is convex. In-\nstead of representing a linear subspace in terms of its basis matrix U, we parametrize it using the\ncorresponding projection matrix M = U U T . We can now reformulate the PCA problem as:\n\nmaximize : Ex\u223cD[xT M x]\nsubject to : M \u2208 Rd\u00d7d, \u03bbi (M ) \u2208 {0, 1} , rank M = k\n\nwhere \u03bbi (M ) is the ith eigenvalue of M.\nWe now have a convex (linear, in fact) objective, but the constraints are not convex. This prompts us\nrelax the objective by taking the convex hull of the feasible region:\n\n(3.2)\n\n(3.3)\n\nmaximize : Ex\u223cD[xT M x]\nsubject to : M \u2208 Rd\u00d7d, 0 (cid:22) M (cid:22) I, tr M = k\n\nSince the objective is linear, and the feasible regiuon is the convex hull of that of Problem 3.2,\nan optimal solution is always attained at a \u201cvertex\u201d, i.e. a point on the boundary of the original\nconstraints. The optima of the two objectives are thus the same (strictly speaking\u2014every optimum\nof Problem 3.2 is also an optimum of Problem 3.3), and solving Problem 3.3 is equivalent to solving\nProblem 3.2.\nFurthermore, if a suboptimal solution for Problem 3.3 is not rank-k, i.e.\nis not a feasible point\nof Problem 3.2, we can easily sample from it to obtain a rank-k solution with the same objective\nfunction value (in expectation). This is shown by the following result of Warmuth & Kuzmin (2008):\nLemma 3.1 (Rounding (Warmuth & Kuzmin, 2008)). Any feasible solution of Problem 3.3 can be\nexpressed as a convex combination of at most d feasible solutions of Problem 3.2.\n\nAlgorithm 2 of Warmuth & Kuzmin (2008) shows how to ef\ufb01ciently \ufb01nd such a convex combination.\nSince the objective is linear, treating the coef\ufb01cients of the convex combination as de\ufb01ning a discrete\ndistribution, and sampling according to this distribution, yields a rank-k matrix with the desired\nexpected objective function value.\n\n3.1 Matrix Stochastic Gradient\n\nPerforming SGD on Problem 3.3 (w.r.t. the variable M) yields the following update rule:\n\n(3.4)\nThe projection is now performed onto the (convex) constraints of Problem 3.3. This gives the Matrix\nStochastic Gradient (MSG) algorithm, which, in detail, consists of the following steps:\n\nM (t) + \u03b7xtxT\nt\n\n,\n\nM (t+1) = P(cid:16)\n\n(cid:17)\n\n1. Choose a step-size \u03b7, iteration count T , and starting point M (0).\n2. Iterate the update rule (Equation 3.4) T times, each time using an independent sample\n\nxt \u223c D.\n\nous section.\n\n3. Average the iterates as \u00afM = 1\nT\n4. Sample a rank-k solution \u02dcM from \u00afM using the rounding procedure discussed in the previ-\n\nt=1 M (t).\n\nAnalyzing MSG is straightforward using a standard SGD analysis:\nTheorem 1. After T iterations of MSG (on Problem 3.3), with step size \u03b7 =\nM (0) = 0,\n\n(cid:114)\n\nE[Ex\u223cD[xT \u02dcM x]] \u2265 Ex\u223cD[xT M\u2217x] \u2212 1\n2\n\nk\nT\n\n,\n\nwhere the expectation is w.r.t. the i.i.d. samples x1, . . . , xT \u223c D and the rounding, and M\u2217 is the\noptimum of Problem 3.2.\n\n(cid:113) k\n\nT , and starting at\n\n(cid:80)T\n\n3\n\n\fAlgorithm 1 Matrix stochastic gradient (MSG) update: compute an eigendecomposition of M(cid:48)+\u03b7xxT from a\nrank-m eigendecomposition M(cid:48)= U(cid:48)diag(\u03c3(cid:48))(U(cid:48))T and project the resulting solution onto the constraint set.\nThe computational cost is dominated by the matrix multiplication on lines 4 or 7 costing O(m2d) operations.\n\nmsg-step(cid:0)d, k, m : N, U(cid:48) : Rd\u00d7m, \u03c3(cid:48) : Rm, x : Rd, \u03b7 : R(cid:1)\n\u02c6x \u2190 \u221a\n\n\u03b7(U(cid:48))T x; x\u22a5 \u2190 \u221a\n\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n\nif r > 0\n\n\u03b7x \u2212 U(cid:48) \u02c6x; r \u2190 (cid:107)x\u22a5(cid:107);\nV, \u03c3 \u2190 eig([diag(\u03c3(cid:48)) + \u02c6x\u02c6xT , r\u02c6x; r\u02c6xT , r2]);\nU \u2190 [U(cid:48), x\u22a5/r]V ;\nV, \u03c3 \u2190 eig(diag(\u03c3(cid:48)) + \u02c6x\u02c6xT );\nU \u2190 U(cid:48)V ;\n\nelse\n\n\u03c3 \u2190 distinct eigenvalues in \u03c3; \u03ba \u2190 corresponding multiplicities;\n\u03c3 \u2190 project (d, k, m, \u03c3, \u03ba);\nreturn U, \u03c3;\n\nProof. The SGD analysis of Nemirovski & Yudin (1983) yields that:\n\n(cid:107)M\u2217 \u2212 M (0)(cid:107)2\n\nF\n\n\u221a\n\nk.\n\nEx\u223cD[(cid:107)g(cid:107)2\n\nF ] +\n\nE[xT M\u2217x \u2212 xT \u00afM x] \u2264 \u03b7\n2\n\nF = (cid:107)M\u2217(cid:107)2\n\n2\u03b7T\nwhere g = xxT is the gradient of the PCA objective. Now, Ex\u223cD[(cid:107)g(cid:107)2\n\n(3.5)\nF ] = Ex\u223cD[(cid:107)x(cid:107)4] \u2264 1 and\nF = k. In the last inequality, we used the fact that M\u2217 has k eigenvalues\n\n(cid:13)(cid:13)M\u2217 \u2212 M (0)(cid:13)(cid:13)2\n\nof value 1 each, and hence (cid:107)M\u2217(cid:107)F =\n3.2 Ef\ufb01cient Implementation and Projection\nA na\u00a8\u0131ve implementation of the MSG update requires O(d2) memory and O(d2) operations per iter-\nation. In this section, we show how to perform this update ef\ufb01ciently by maintaining an up-to-date\neigendecomposition of M (t). Pseudo-code for the update may be found in Algorithm 1. Consider\nthe eigendecomposition M (t) = U(cid:48)diag(\u03c3)(U(cid:48))T at the tth iteration, where rank(M (t)) = kt and\nU(cid:48) \u2208 Rd\u00d7kt. Given a new observation xt, the eigendecomposition of M (t) + \u03b7xtxT\nt can be updated\nef\ufb01ciently using a (kt+1)\u00d7(kt+1) SVD (Brand, 2002; Arora et al., 2012) (steps 1-7 of Algorithm 1).\nThis rank-one eigen-update is followed by projection onto the constraints of Problem 3.3, invoked as\nproject in step 8 of Algorithm 1, discussed in the following paragraphs and given as Algorithm 2.\nThe projection procedure is based on the following lemma1. See supplementary material for a proof.\nLemma 3.2. Let M(cid:48) \u2208 Rd\u00d7d be a symmetric matrix, with eigenvalues \u03c3(cid:48)\nd and associated\nd. Its projection M = P (M(cid:48)) onto the feasible region of Problem 3.3 with\neigenvectors v(cid:48)\nrespect to the Frobenius norm, is the unique feasible matrix which has the same eigenvectors as M(cid:48),\nwith the associated eigenvalues \u03c31, . . . , \u03c3d satisfying:\n\n1, . . . , \u03c3(cid:48)\n\n1, . . . , v(cid:48)\n\nwith S \u2208 R being chosen in such a way that(cid:80)d\n\n\u03c3i = max (0, min (1, \u03c3(cid:48)\ni=1 \u03c3i = k.\n\ni + S))\n\nThis result shows that projecting onto the feasible region amounts to \ufb01nding the value of S such that,\nafter shifting the eigenvalues by S and clipping the results to [0, 1], the result is feasible. Importantly,\nthe projection operates only on the eigenvalues. Algorithm 2 contains pseudocode which \ufb01nds S\nfrom a list of eigenvalues. It is optimized to ef\ufb01ciently handle repeated eigenvalues\u2014rather than\nreceiving the eigenvalues in a length-d list, it instead receives a length-n list containing only the\ndistinct eigenvalues, with \u03ba containing the corresponding multiplicities. In Sections 4 and 5, we will\nsee why this is an important optimization. The central idea motivating the algorithm is that, in a\nsorted array of eigenvalues, all elements with indices below some threshold i will be clipped to 0,\nand all of those with indices above another threshold j will be clipped to 1. The pseudocode simply\nsearches over all possible pairs of such thresholds until it \ufb01nds the one that works.\nThe rank-one eigen-update combined with the fast projection step yields an ef\ufb01cient MSG update\nthat requires O(dkt) memory and O(dk2\nt ) operations per iteration (recall that kt is the rank of the\n1Our projection problem onto the capped simplex, even when seen in the vector setting, is substantially\ndifferent from Duchi et al. (2008). We project onto the set {0 \u2264 \u03c3 \u2264 1,(cid:107)\u03c3(cid:107)1 = k} in Problem 3.3 and {0 \u2264\n\u03c3 \u2264 1,(cid:107)\u03c3(cid:107)1 = k,(cid:107)\u03c3(cid:107)0 \u2264 K} in Problem 5.1 whereas Duchi et al. (2008) project onto {0 \u2264 \u03c3,(cid:107)\u03c3(cid:107)1 = k}.\n\n4\n\n\f\u03ba(cid:48) contain the distinct eigenvalues and their multiplicities, respectively, of M(cid:48) (with(cid:80)n\n\nAlgorithm 2 Routine which \ufb01nds the S of Lemma 3.2. It takes as parameters the dimension d, \u201ctarget\u201d sub-\nspace dimension k, and the number of distinct eigenvalues n of the current iterate. The length-n arrays \u03c3(cid:48) and\ni = d). Line 1 sorts\n\u03c3(cid:48) and re-orders \u03ba(cid:48) so as to match this sorting. The loop will be run at most 2n times (once for each possible\nincrement to i or j on lines 12\u201315), so the computational cost is dominated by that of the sort: O(n log n).\n\ni=1 \u03ba(cid:48)\n\nproject (d, k, n : N, \u03c3(cid:48) : Rn, \u03ba(cid:48) : Nn)\n\n\u03c3(cid:48), \u03ba(cid:48) \u2190 sort(\u03c3(cid:48), \u03ba(cid:48));\ni \u2190 1; j \u2190 1; si \u2190 0; sj \u2190 0; ci \u2190 0; cj \u2190 0;\nwhile i \u2264 n\nif (i < j)\nS \u2190 (k \u2212 (sj \u2212 si) \u2212 (d \u2212 cj))/(cj \u2212 ci);\nb \u2190 (\ni + S \u2265 0) and (\u03c3(cid:48)\n(\u03c3(cid:48)\nand ((i \u2264 1) or (\u03c3(cid:48)\nand ((j \u2265 n) or (\u03c3(cid:48)\n\nj\u22121 + S \u2264 1)\ni\u22121 + S \u2264 0))\nj+1 \u2265 1))\n\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n11\n12\n13\n14\n15\n16\n\n);\nreturn S if b;\ni \u2264 1)\nif (j \u2264 n) and (\u03c3(cid:48)\nj; cj \u2190 cj + \u03ba(cid:48)\nsj \u2190 sj + \u03ba(cid:48)\nj\u03c3(cid:48)\ni; ci \u2190 ci + \u03ba(cid:48)\nsi \u2190 si + \u03ba(cid:48)\ni\u03c3(cid:48)\n\nj \u2212 \u03c3(cid:48)\n\nelse\n\nj; j \u2190 j + 1;\ni; i \u2190 i + 1;\n\nreturn error;\n\niterate M (t)). This is a signi\ufb01cant improvement over the O(d2) memory and O(d2) computation\nrequired by a standard implementation of MSG, if the iterates have relatively low rank.\n\nfunction, i.e. \u03a8h(M ) =(cid:80)\n\n3.3 Matrix Exponentiated Gradient\nSince M is constrained by its trace, and not by its Frobenius norm, it is tempting to consider mirror\ndescent (MD) (Beck & Teboulle, 2003) instead of SGD updates for solving Problem 3.3. Recall that\nMirror Descent depends on a choice of \u201cpotential function\u201d \u03a8(\u00b7) which should be chosen according\nto the geometry of the feasible set and the subgradients (Srebro et al., 2011). Using the squared\nFrobenius norm as a potential function, i.e. \u03a8(M ) = (cid:107)M(cid:107)2\nF , yields SGD, i.e. the MSG updates\nEquation 3.4. The trace-norm constraint suggests using the von Neumann entropy as the potential\ni \u03bbi (M ) log \u03bbi (M ). This leads to multiplicative updates, yielding what\nwe refer to as the Matrix Exponentiated Gradient (MEG) algorithm, which is similar to that of (War-\nmuth & Kuzmin, 2008). In fact, Warmuth and Kuzmin\u2019s algorithm exactly corresponds to online\nMirror Descent on Problem 3.3 with this potential function, but takes the optimization variable to\nbe M\u22a5 = I \u2212 M (with the constraints tr M\u22a5 = d \u2212 k and 0 (cid:22) M\u22a5 (cid:22) I). In either case, using the\nentropy potential, despite being well suited for the trace-geometry, does not actually lead to a better\n\ndependence2 on d or k, and a Mirror Descent-based analysis again yields an excess loss of(cid:112)k/T .\n(cid:19)\n\nyields a total runtime of O( \u00afk2dk/\u00012), where \u00afk2 =(cid:80)T\n\nWarmuth and Kuzmin present an \u201coptimistic\u201d analysis, with a dependence on the \u201creconstruction\nerror\u201d L\u2217 = E[xT (I \u2212 M\u2217)x], which yields an excess error of O\n(their logarithmic term can be avoided by a more careful analysis).\n4 MSG runtime and the rank of the iterates\nAs we saw in Sections 3.1 and 3.2, MSG requires O(k/\u00012) iterations to obtain an \u0001-suboptimal\nt d) operations, where kt is the rank of iterate M (t). This\nsolution, and each iteration costs O(k2\nt . Clearly, the runtime for MSG depends\ncritically on the rank of the iterates. If kt is as large as d, then MSG achieves a runtime that is cubic\nin the dimensionality. On the other hand, if the rank of the iterates is O(k), the runtime is linear in\nthe dimensionality. Fortunately, in practice, each kt is typically much lower than d. The reason for\nthis is that the MSG update performs a rank-1 update followed by a projection onto the constraints.\nt will have a larger trace than M (t) (i.e. tr M(cid:48) \u2265 k), the projection, as is\nSince M(cid:48) = M (t) + \u03b7xtxT\n\u221a\ntrM\u2217. Furthermore, the SGD analysis\ndepends on the Frobenius norm of the stochastic gradients, but since all stochastic gradients are rank one, this\nis the same as their spectral norm, which comes up in the entropy-case analysis, and again there is no bene\ufb01t.\n\n2This is because in our case, due to the other constraints, (cid:107)M\u2217(cid:107)F =\n\n(cid:18)(cid:113) L\u2217k log(d/k)\n\n+ k log(d/k)\n\nt=1 k2\n\nT\n\nT\n\n5\n\n\fcovariance matrices \u03a3 = diag(\u03c3/(cid:107)\u03c3(cid:107)), where \u03c3i = \u03c4\u2212i/(cid:80)32\n\nshown by Lemma 3.2, will subtract a quantity S from every eigenvalue of M(cid:48), clipping each to 0 if\nit becomes negative. Therefore, each MSG update will increase the rank of the iterate by at most 1,\nand has the potential to decrease it, perhaps signi\ufb01cantly. It\u2019s very dif\ufb01cult to theoretically quantify\nhow the rank of the iterates will evolve over time, but we have observed empirically that the iterates\ndo tend to have relatively low rank.\nWe explore this issue in greater detail experimentally, on a distribution which we expect to be dif\ufb01-\ncult for MSG. To this end, we generated data from known 32-dimensional distributions with diagonal\nj=1 \u03c4\u2212j, for i = 1, . . . , 32 and for\nsome \u03c4 > 1. Observe that \u03a3(k) has a smoothly-decaying set of eigenvalues and the rate of decay is\ncontrolled by \u03c4. As \u03c4 \u2192 1, the spectrum becomes \ufb02atter resulting in distributions that present chal-\nlenging test cases for MSG. We experimented with \u03c4 = 1.1 and k \u2208 {1, 2, 4}, where k is the desired\nsubspace dimension used by each algorithm. The data is generated by sampling the ith standard unit\nbasis vector ei with probability\n\u03a3ii. We refer to this as the \u201corthogonal distribution\u201d, since it is a\ndiscrete distribution over 32 orthogonal vectors.\nIn Figure 1, we show the results with k = 4. We\ncan see from the left-hand plot that MSG main-\ntains a subspace of dimension around 15. The\nplot on the right shows how the set of nonzero\neigenvalues of the MSG iterates evolves over\ntime, from which we can see that many of the ex-\ntra dimensions are \u201cwasted\u201d on very small eigen-\nvalues, corresponding to directions which leave\nthe state matrix only a handful of iterations after\nthey enter. This suggests that constraining kt can\nlead to signi\ufb01cant speedups and motivates capped\nMSG updates discussed in the next section.\n\nFigure 1: The ranks kt (left) and the eigenvalues\n(right) of the MSG iterates M (t).\n\nIterations\n\nIterations\n\n\u221a\n\nkt\n\nSpectrum\n\n5 Capped MSG\nWhile, as was observed in the previous section, MSG\u2019s iterates will tend to have ranks kt smaller\nthan d, they will nevertheless also be larger than k. For this reason, we recommend imposing a hard\nconstraint K on the rank of the iterates:\n\nmaximize : Ex\u223cD[xT M x]\nsubject to : M \u2208 Rd\u00d7d, 0 (cid:22) M (cid:22) I\ntr M = k, rank M \u2264 K\n\n(5.1)\n\nWe will refer to MSG where the projection is replaced with a projection onto the constraints of\nProblem 5.1 (i.e. where the iterates are SGD iterates on Problem 5.1) as \u201ccapped MSG\u201d. As before,\nas long as K \u2265 k, Problem 5.1 and Problem 3.3 have the same optimum, it is achieved at a rank-k\nmatrix, and the extra rank constraint in Problem 5.1 is inactive at the optimum. However, the rank\nconstraint does affect the iterates, especially since Problem 5.1 is no longer convex. Nonetheless\nif K > k (i.e. the hard rank-constraint K is strictly larger than the target rank k), then we can\neasily check if we are at a global optimum of Problem 5.1, and hence of Problem 3.3: if the capped\nMSG algorithm converges to a solution of rank K, then the upper bound K should be increased.\nConversely, if it has converged to a rank-de\ufb01cient solution, then it must be the global optimum.\nThere is thus an advantage in using K > k, and we recommend setting K = k + 1, as we do in our\nexperiments, and increasing K only if a rank de\ufb01cient solution is not found in a timely manner.\nIf we take K = k, then the only way to satisfy the trace constraint is to have all non-zero eigenvalues\nequal to one, and Problem 5.1 becomes identical to Problem 3.2. The detour through the convex\nobjective of Problem 3.3 allows us to increase the search rank K, allowing for more \ufb02exibility in\nthe iterates, while still forcing each iterate to be low-rank, and each update to therefore be ef\ufb01cient,\nthrough the rank constraint.\n\nImplementing the projection\n\n5.1\nThe only difference between the implementation of MSG and capped MSG is in the projection step.\nSimilar reasoning to that which was used in the proof of Lemma 3.2 shows that if M (t+1) =P (M(cid:48))\n\n6\n\n1011021031040510152025303510110210310400.20.40.60.81\fk = 1\n\nk = 2\n\nk = 4\n\ny\nt\ni\nl\na\nm\n\ni\nt\np\no\nb\nu\nS\n\nIterations\n\nIterations\n\nIterations\n\nFigure 2: Comparison on simulated data for different values of parameter k.\n\nwith M(cid:48) = M (t) + \u03b7xtxT\nt , then M (t) and M(cid:48) are simultaneously diagonalizable, and therefore\nwe can consider only how the projection acts on the eigenvalues. Hence, if we let \u03c3(cid:48) be the vector\nof the eigenvalues of M(cid:48), and suppose that more than K of them are nonzero, then there will be\na a size-K subset of \u03c3(cid:48) such that applying Algorithm 2 to this set gives the projected eigenvalues.\nSince we perform only a rank-1 update at every iteration, we must check at most K possibilities,\nat a total cost of O(K 2 log K) operations, which has no effect on the asymptotic runtime because\nAlgorithm 1 requires O(K 2d) operations.\n\n5.2 Relationship to the incremental PCA method\n\nThe capped MSG algorithm with K = k is similar to the incremental algorithm of Arora et al.\n(2012), which maintains a rank-k approximation of the covariance matrix and updates according to:\n\n(cid:16)\n\n(cid:17)\n\nM (t+1) = Prank-k\n\nM (t) + xtxT\nt\n\n\u221a\n\nwhere the projection is onto the set of rank-k matrices. Unlike MSG, the incremental algorithm does\nnot have a step-size. Updates can be performed ef\ufb01ciently by maintaining an eigendecomposition\nof each iterate, just as was done for MSG (see Section 3.2).\nIn a recent survey of stochastic algorithms for PCA (Arora et al., 2012), the incremental algorithm\nwas found to perform extremely well in practice\u2013it was the best, in fact, among the compared algo-\n\u221a\nrithms. However, there exist cases in which it can get stuck at a suboptimal solution. For example,\nIf the data are drawn from a discrete distribution D which samples [\n3, 0]T with probability 1/3\nand [0,\n2]T with probability 2/3, and one runs the incremental algorithm with k = 1, then it will\nconverge to [1, 0]T with probability 5/9, despite the fact that the maximal eigenvector is [0, 1]T .\nThe reason for this failure is essentially that the orthogonality of the data interacts poorly with the\nlow-rank projection: any update which does not entirely displace the maximal eigenvector in one\niteration will be removed entirely by the projection, causing the algorithm to fail to make progress.\nThe capped MSG algorithm with K > k will not get stuck in such situations, since it will use\nthe additional dimensions to search in the new direction. Only as it becomes more con\ufb01dent in its\ncurrent candidate will the trace of M become increasingly concentrated on the top k directions. To\nillustrate this empirically, we generalized this example by generating data using the 32-dimensional\n\u201corthogonal\u201d distribution described in Section 4. This distribution presents a challenging test-case\nfor MSG, capped MSG and the incremental algorithm. Figure 2 shows plots of individual runs of\nMSG, capped MSG with K = k + 1, the incremental algorithm, and Warmuth and Kuzmin\u2019s al-\ngorithm, all based on the same sequence of samples drawn from the orthogonal distribution. We\ncompare algorithms in terms of the suboptimality on the population objective based on the largest\nk eigenvalues of the state matrix M (t). The plots show the incremental algorithm getting stuck for\nk \u2208 {1, 4}, and the others intermittently plateauing at intermediate solutions before beginning to\nagain converge rapidly towards the optimum. This behavior is to be expected on the capped MSG\nalgorithm, due to the fact that the dimension of the subspace stored at each iterate is constrained.\nHowever, it is somewhat surprising that MSG and Warmuth and Kuzmin\u2019s algorithm behaved simi-\nlarly, and barely faster than capped MSG.\n6 Experiments\nWe also compared the algorithms on the real-world MNIST dataset, which consists of 70, 000 binary\nimages of handwritten digits of size 28\u00d728, resulting in a dimensionality of 784. We pre-normalized\nthe data by mean centering the feature vectors and scaling each feature by the product of its standard\n\n7\n\n10110210310400.20.40.60.8110110210310400.20.40.60.811.21.4  IncrementalWarmuth & KuzminMSGCapped MSG10110210310400.20.40.60.811.21.4\fk = 1\n\nk = 4\n\nk = 8\n\nIterations\n\nIterations\n\nIterations\n\ny\nt\ni\nl\na\nm\n\ni\nt\np\no\nb\nu\nS\n\ny\nt\ni\nl\na\nm\n\ni\nt\np\no\nb\nu\nS\n\niteration count, while the bottom row suboptimality as a function of estimated runtime(cid:80)t\n\nFigure 3: Comparison on the MNIST dataset. The top row of plots shows suboptimality as a function of\n\nEst. runtime\n\nEst. runtime\n\nEst. runtime\n\ns=1(k(cid:48)\n\ns)2.\n\ndeviation and the data dimension, so that each feature vector is zero mean and unit norm in expec-\ntation. In addition to MSG, capped MSG, the incremental algorithm and Warmuth and Kuzmin\u2019s\nalgorithm, we also compare to a Grassmannian SGD algorithm (Balzano et al., 2010). All algo-\n\u221a\nrithms except the incremental algorithm have a step-size parameter. In these experiments, we ran\nt for c \u2208 {2\u221212, 2\u221211, . . . , 25} and picked the\neach algorithm with decreasing step sizes \u03b7t = c/\nbest c, in terms of the average suboptimality over the run, on a validation set. Since we cannot eval-\nuate the true population objective, we estimate it by evaluating on a held-out test set. We use 40%\nof samples in the dataset for training, 20% for validation (tuning step-size), and 40% for testing.\nWe are interested in learning a maximum variance subspace of dimension k \u2208 {1, 4, 8} in a single\n\u201cpass\u201d over the training sample. In order to compare MSG, capped MSG, the incremental algo-\nrithm and Warmuth and Kuzmin\u2019s algorithm in terms of runtime, we calculate the dominant term\ns)2. The results are averaged over 100 random splits into\n\nin the computational complexity:(cid:80)t\n\ntrain-validation-test sets.\nWe can see from Figure 3 that the incremental algorithm makes the most progress per iteration and\nis also the fastest of all algorithms. MSG is comparable to the incremental algorithm in terms of the\nthe progress made per iteration. However, its runtime is slightly worse because it will often keep\na slightly larger representation (of dimension kt). The capped MSG variant (with K = k + 1) is\nsigni\ufb01cantly faster\u2013almost as fast as the incremental algorithm, while, as we saw in the previous\nsection, being less prone to getting stuck. Warmuth and Kuzmin\u2019s algorithm fares well with k = 1,\nbut its performance drops for higher k. Inspection of the underlying data shows that, in the k \u2208\n{4, 8} experiments, it also tends to have a larger kt than MSG in these experiments, and therefore\nhas a higher cost-per-iteration. Grassmannian SGD performs better than Warmuth and Kuzmin\u2019s\nalgorithm, but much worse than MSG and capped MSG.\n\ns=1(k(cid:48)\n\n7 Conclusions\n\nIn this paper, we presented a careful development and analysis of MSG, a stochastic approximation\nalgorithm for PCA, which enjoys good theoretical guarantees and offers a computationally ef\ufb01cient\nvariant, capped MSG. We show that capped MSG is well-motivated theoretically and that it does\nnot get stuck at a suboptimal solution. Capped MSG is also shown to have excellent empirical per-\nformance and it therefore is a much better alternative to the recently proposed incremental PCA\nalgorithm of Arora et al. (2012). Furthermore, we provided a cleaner interpretation of PCA up-\ndates of Warmuth & Kuzmin (2008) in terms of Matrix Exponentiated Gradient (MEG) updates\nand showed that both MSG and MEG can be interpreted as mirror descent algorithms on the same\nrelaxation of the PCA optimization problem but with different distance generating functions.\n\n8\n\n10010110210310410500.511.522.53  IncrementalWarmuth & KuzminMSGCapped MSGGrassmannian10010110210310410501234567810010110210310410502468101210010110210310410510610710800.511.522.5100101102103104105106107108012345678100101102103104105106107108024681012\fReferences\nArora, Raman, Cotter, Andrew, Livescu, Karen, and Srebro, Nathan. Stochastic optimization for\nPCA and PLS. In 50th Annual Allerton Conference on Communication, Control, and Computing,\n2012.\n\nBalzano, Laura, Nowak, Robert, and Recht, Benjamin. Online identi\ufb01cation and tracking of sub-\nspaces from highly incomplete information. In 48th Annual Allerton Conference on Communica-\ntion, Control, and Computing, 2010.\n\nBeck, A. and Teboulle, M. Mirror descent and nonlinear projected subgradient methods for convex\n\noptimization. Operations Research Letters, 31(3):167\u2013175, 2003.\n\nBottou, Leon and Bousquet, Olivier. The tradeoffs of large scale learning. In NIPS\u201907, pp. 161\u2013168,\n\n2007.\n\nBoyd, Stephen and Vandenberghe, Lieven. Convex Optimization. Cambridge University Press, 2004.\nBrand, Matthew. Incremental singular value decomposition of uncertain data with missing values.\n\nIn ECCV, 2002.\n\nCollins, Michael, Globerson, Amir, Koo, Terry, Carreras, Xavier, and Bartlett, Peter L. Exponen-\ntiated gradient algorithms for conditional random \ufb01elds and max-margin markov networks. J.\nMach. Learn. Res., 9:1775\u20131822, June 2008.\n\nDuchi, John, Shalev-Shwartz, Shai, Singer, Yoram, and Chandra, Tushar. Ef\ufb01cient projections onto\nthe l1-ball for learning in high dimensions. In Proceedings of the 25th international conference\non Machine learning, ICML \u201908, pp. 272\u2013279, New York, NY, USA, 2008. ACM.\n\nNemirovski, Arkadi and Yudin, David. Problem complexity and method ef\ufb01ciency in optimization.\n\nJohn Wiley & Sons Ltd, 1983.\n\nNemirovski, Arkadi, Juditsky, Anatoli, Lan, Guanghui, and Shapiro, Alexander. Robust stochastic\napproximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574\u2013\n1609, January 2009.\n\nOja, Erkki and Karhunen, Juha. On stochastic approximation of the eigenvectors and eigenvalues\nof the expectation of a random matrix. Journal of Mathematical Analysis and Applications, 106:\n69\u201384, 1985.\n\nSanger, Terence D. Optimal unsupervised learning in a single-layer linear feedforward neural net-\n\nwork. Neural Networks, 12:459\u2013473, 1989.\n\nShalev-Shwartz, Shai and Srebro, Nathan. SVM optimization: Inverse dependence on training set\n\nsize. In ICML\u201908, pp. 928\u2013935, 2008.\n\nShalev-Shwartz, Shai and Tewari, Ambuj. Stochastic methods for l1 regularized loss minimization.\nIn Proceedings of the 26th Annual International Conference on Machine Learning, ICML\u201909, pp.\n929\u2013936, New York, NY, USA, 2009. ACM.\n\nShalev-Shwartz, Shai, Singer, Yoram, and Srebro, Nathan. Pegasos: Primal Estimated sub-GrAdient\n\nSOlver for SVM. In ICML\u201907, pp. 807\u2013814, 2007.\n\nSrebro, N., Sridharan, K., and Tewari, A. On the universality of online mirror descent. Advances in\n\nNeural Information Processing Systems, 24, 2011.\n\nWarmuth, Manfred K. and Kuzmin, Dima. Randomized online PCA algorithms with regret bounds\nthat are logarithmic in the dimension. Journal of Machine Learning Research (JMLR), 9:2287\u2013\n2320, 2008.\n\n9\n\n\f", "award": [], "sourceid": 920, "authors": [{"given_name": "Raman", "family_name": "Arora", "institution": "University of Washington"}, {"given_name": "Andy", "family_name": "Cotter", "institution": "TTI Chicago"}, {"given_name": "Nati", "family_name": "Srebro", "institution": "TTI Chicago"}]}