{"title": "Coordinate-wise Power Method", "book": "Advances in Neural Information Processing Systems", "page_first": 2064, "page_last": 2072, "abstract": "In this paper, we propose a coordinate-wise version of the power method from an optimization viewpoint. The vanilla power method simultaneously updates all the coordinates of the iterate, which is essential for its convergence analysis. However, different coordinates converge to the optimal value at different speeds. Our proposed algorithm, which we call coordinate-wise power method, is able to select and update the most important k coordinates in O(kn) time at each iteration, where n is the dimension of the matrix and k <= n is the size of the active set. Inspired by the ''greedy'' nature of our method, we further propose a greedy coordinate descent algorithm applied on a non-convex objective function specialized for symmetric matrices. We provide convergence analyses for both methods. Experimental results on both synthetic and real data show that our methods achieve up to 20 times speedup over the basic power method. Meanwhile, due to their coordinate-wise nature, our methods are very suitable for the important case when data cannot fit into memory. Finally, we introduce how the coordinate-wise mechanism could be applied to other iterative methods that are used in machine learning.", "full_text": "Coordinate-wise Power Method\n\nQi Lei 1\n\nKai Zhong 1\n\nInderjit S. Dhillon 1,2\n\n1 Institute for Computational Engineering & Sciences\n\n2 Department of Computer Science\n\n{leiqi, zhongkai}@ices.utexas.edu, inderjit@cs.utexas.edu\n\nUniversity of Texas at Austin\n\nAbstract\n\nIn this paper, we propose a coordinate-wise version of the power method from\nan optimization viewpoint. The vanilla power method simultaneously updates\nall the coordinates of the iterate, which is essential for its convergence analysis.\nHowever, different coordinates converge to the optimal value at different speeds.\nOur proposed algorithm, which we call coordinate-wise power method, is able\nto select and update the most important k coordinates in O(kn) time at each\niteration, where n is the dimension of the matrix and k \uf8ff n is the size of the\nactive set. Inspired by the \u201cgreedy\u201d nature of our method, we further propose a\ngreedy coordinate descent algorithm applied on a non-convex objective function\nspecialized for symmetric matrices. We provide convergence analyses for both\nmethods. Experimental results on both synthetic and real data show that our\nmethods achieve up to 23 times speedup over the basic power method. Meanwhile,\ndue to their coordinate-wise nature, our methods are very suitable for the important\ncase when data cannot \ufb01t into memory. Finally, we introduce how the coordinate-\nwise mechanism could be applied to other iterative methods that are used in machine\nlearning.\n\nIntroduction\n\n1\nComputing the dominant eigenvectors of matrices and graphs is one of the most fundamental tasks\nin various machine learning problems, including low-rank approximation, principal component\nanalysis, spectral clustering, dimensionality reduction and matrix completion. Several algorithms are\nknown for computing the dominant eigenvectors, such as the power method, Lanczos algorithm [14],\nrandomized SVD [2] and multi-scale method [17]. Among them, the power method is the oldest and\nsimplest one, where a matrix A is multiplied by the normalized iterate x(l) at each iteration, namely,\n\nx(l+1) = normalize(Ax(l)).\n\nThe power method is popular in practice due to its simplicity, small memory footprint and robustness,\nand particularly suitable for computing the dominant eigenvector of large sparse matrices [14]. It has\napplied to PageRank [7], sparse PCA [19, 9], private PCA [4] and spectral clustering [18]. However,\nits convergence rate depends on |2|/|1|, the ratio of magnitude of the top two dominant eigenvalues\n[14]. Note that when |2| \u21e1 |1|, the power method converges slowly.\nIn this paper, we propose an improved power method, which we call coordinate-wise power method,\nto accelerate the vanilla power method. Vanilla power method updates all n coordinates of the iterate\nsimultaneously even if some have already converged to the optimal. This motivates us to develop new\nalgorithms where we select and update a set of important coordinates at each iteration. As updating\neach coordinate costs only 1\nn of one power iteration, signi\ufb01cant running time can be saved when n is\nvery large. We raise two questions for designing such an algorithm.\nThe \ufb01rst question: how to select the coordinate? A natural idea is to select the coordinate that will\nchange the most, namely,\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fargmaxi|ci|, where c =\n\nAx\nxT Ax x,\n\n(1)\n\nwhere Ax\nxT Ax is a scaled version of the next iterate given by power method, and we will explain this\nspecial scaling factor in Section 2. Note that ci denotes the i-th element of the vector c. Instead of\nchoosing only one coordinate to update, we can also choose k coordinates with the largest k changes\ni=1. We will justify this selection criterion by connecting our method with greedy coordinate\nin {|ci|}n\ndescent algorithm for minimizing a non-convex function in Section 3. With this selection rule, we\nare able to show that our method has global convergence guarantees and faster convergence rate\ncompared to vanilla power method if k satis\ufb01es certain conditions.\nAnother key question: how to choose these coordinates without too much overhead? How to ef\ufb01ciently\nselect important elements to update is of great interest in the optimization community. For example,\n[1] leveraged nearest neighbor search for greedy coordinate selection, while [11] applied partially\nbiased sampling for stochastic gradient descent. To calculate the changes in Eq (1) we need to know\nall coordinates of the next iterate. This violates our previous intention to calculate a small subset of\nthe new coordinates. We show, by a simple trick, we can use only O(kn) operations to update the\nmost important k coordinates. Experimental results on dense as well as sparse matrices show that our\nmethod is up to 8 times faster than vanilla power method.\nRelation to optimization. Our method reminds us of greedy coordinate descent method. Indeed,\nwe show for symmetric matrices our coordinate-wise power method is similar to greedy coordinate\ndescent for rank-1 matrix approximation, whose variants are widely used in matrix completion [8]\nand non-negative matrix factorization [6]. Based on this interpretation, we further propose a faster\ngreedy coordinate descent method specialized for symmetric matrices. This method achieves up to 23\ntimes speedup over the basic power method and 3 times speedup over the Lanczos method on large\nreal graphs. For this non-convex problem, we also provide convergence guarantees when the initial\niterate lies in the neighborhood of the optimal solution.\nExtensions. With the coordinate-wise nature, our methods are very suitable to deal with the case\nwhen data cannot \ufb01t into memory. We can choose a k such that k rows of A can \ufb01t in memory, and\nthen fully process those k rows of data before loading the RAM (random access memory) with a new\npartition of the matrix. This strategy helps balance the data processing and data loading time. The\nexperimental results show our method is 8 times faster than vanilla power method for this case.\nThe paper is organized as follows. Section 2 introduces coordinate-wise power method for computing\nthe dominant eigenvector. Section 3 interprets our strategy from an optimization perspective and\nproposes a faster algorithm. Section 4 provides theoretical convergence guarantee for both algorithms.\nExperimental results on synthetic or real data are shown in Section 5. Finally Section 6 presents\nthe extensions of our methods: dealing with out-of-core cases and generalizing the coordinate-wise\nmechanism to other iterative methods that are useful for the machine learning community.\n2 Coordinate-wise Power Method\nThe classical power method (PM) iteratively multiplies the iterate x 2 Rn by the matrix A 2 Rn\u21e5n,\nwhich is inef\ufb01cient since some coordinates may converge faster than others. To illustrate this\n\n(a) The percentage of unconverged coor-\ndinates versus the number of operations\n\n(b) Number of updates of each coordinate\n\nFigure 1: Motivation for the Coordinate-wise Power Method.\nFigure 1(a) shows how the percentage\nof unconverged coordinates decreases with the number of operations. The gradual decrease demonstrates the\nunevenness of each coordinate as the iterate converges to the dominant eigenvector. In Figure 1(b), the X-axis\nis the coordinate indices of iterate x sorted by their frequency of updates, which is shown on the Y-axis. The\narea below each curve approximately equals the total number of operations.The given matrix is synthetic with\n|2|/|1| = 0.5, and terminating accuracy \u270f is set to be 1e-5.\n\n2\n\n\fphenomenon, we conduct an experiment with the power method; we set the stopping criterion as\nkx v1k1 < \u270f, where \u270f is the threshold for error, and let vi denote the i-th dominant eigenvector\n(associated with the eigenvalue of the i-th largest magnitude) of A in this paper. During the iterative\nprocess, even if some coordinates meet the stopping criterion, they still have to be updated at every\niteration until uniform convergence. In Figure 1(a), we count the number of unconverged coordinates,\n\nwhich we de\ufb01ne as {i : i 2 [n]|xi v1,i| > \u270f}, and see it gradually decreases with the iterations,\n\nwhich implies that the power method makes a large number of unnecessary updates. In this paper, for\ncomputing the dominant eigenvector, we exhibit a coordinate selection scheme that has the ability to\nselect and update \u201dimportant\u201d coordinates with little overhead. We call our method Coordinate-wise\nPower Method (CPM). As shown in Figure 1(a) and 1(b), by selecting important entries to update,\nthe number of unconverged coordinates drops much faster, leading to an overall fewer \ufb02ops.\nAlgorithm 1 Coordinate-wise Power Method\n1: Input: Symmetric matrix A 2 Rn\u21e5n, number of selected coordinates k, and number of iterations, L.\n2: Initialize x(0) 2 Rn and set z(0) = Ax(0). Set coordinate selecting criterion c(0) = x(0) \n3: for l = 1 to L do\n4:\n\nLet \u2326(l) be a set containing k coordinates of c(l1) with the largest magnitude. Execute the following\nupdates:\n\n(x(0))T z(0) .\n\nz(0)\n\n= 8<:\n\nz(l1)\nj\n\ny(l)\nj\n\nj 2 \u2326(l)\n(x(l1))T z(l1) ,\nx(l1)\nj /2 \u2326(l)\n,\nj\n\u2326(l) x(l1)\nz(l) = z(l1) + A(y(l)\n\u2326(l) )\nz(l) = z(l)/ky(l)k, x(l) = y(l)/ky(l)k\nc(l) = x(l) \n5: Output: Approximate dominant eigenvector x(L)\n\n(x(l1))T z(l1)\n\nz(l)\n\n(2)\n\n(3)\n\nAlgorithm 1 describes our coordinate-wise power method that updates k entries at a time for com-\nputing the dominant eigenvector for a symmetric input matrix, while a generalization to asymmetric\ncases is straightforward. The algorithm starts from an initial vector x(0), and iteratively performs\nupdates xi aT\ni x/xT Ax with i in a selected set of coordinates \u2326 \u2713 [n] de\ufb01ned in step 4, where\nai is the i-th row of A. The set of indices \u2326 is chosen to maximize the difference between the current\ncoordinate value xi and the next coordinate value aT\ni x/xT Ax. z(l) and c(l) are auxiliary vectors.\nMaintaining z(l) \u2318 Ax(l) saves much time, while the magnitude of c represents importance of each\ncoordinate and is used to select \u2326.\nWe use the Rayleigh Quotient xT Ax (x is normalized) for scaling, different from kAxk in the power\nmethod. Our intuition is as follows: on one hand, it is well known that Rayleigh Quotient is the\nbest estimate for eigenvalues. On the other hand, the limit point using xT Ax scaling will satisfy\n\u00afx = A\u00afx/\u00afxT A\u00afx, which allows both negative or positive dominant eigenvectors, while the scaling\nkAxk is always positive, so its limit point only lies in the eigenvectors associated with positive\neigenvalues, which rules out the possibility of converging to the negative dominant eigenvector.\n2.1 Coordinate Selection Strategy\nAn initial understanding for our coordinate selection strategy is that we select coordinates with the\nlargest potential change. With a current iterate x and an arbitrary active set \u2326, let y\u2326 be a potential\nnext iterate with only coordinates in \u2326 updated, namely,\n\nAccording to our algorithm, we select active set \u2326 to maximize the iterate change. Therefore:\n\ni x\n\nxT Ax ,\nxi,\n\ni 2 \u2326\ni /2 \u2326\n\n(y\u2326)i =\u21e2 aT\nI\u21e2[n],|I|=k(\n= kyI xk2) = arg min\n)I\n\n2\n\ndef\n\n= kgk2)\n\n2\n\nAx\n\nxT Ax yI\n\n\u2326 = arg max\n\nI\u21e2[n],|I|=k((x \n\nAx\n\nxT Ax\n\nThis is to say, with our updating rule, our goal of maximizing iteration gap is equivalent to minimizing\nthe difference between the next iterate y(l+1) and Ax(l)/(x(l))T Ax(l), where this difference could\nbe interpreted as noise g(l). A good set \u2326 ensures a suf\ufb01ciently small noise g(l), thus achieving a\n\n3\n\n\fsimilar convergence rate in O(kn) time (analyzed later) as the power method does in O(n2) time.\nMore formal statement for the convergence analysis is given in Section 4.\nAnother reason for this selection rule is that it incurs little overhead. For each iteration, we maintain a\nvector z \u2318 Ax with kn \ufb02ops by the updating rule in Eq.(3). And the overhead consists of calculating\nc and choosing \u2326. Both parts cost O(n) operations. Here \u2326 is chosen by Hoare\u2019s quick selection\nalgorithm [5] to \ufb01nd the kth largest entry in |c|. Thus the overhead is negligible compared with O(kn).\nThus CPM spends as much time on each coordinate as PM does on average, while those updated k\ncoordinates are most important. For sparse matrices, the time complexity is O(n + k\nnnnz(A)) for\neach iteration, where nnz(A) is the number of nonzero elements in matrix A.\nAlthough the above analysis gives us a good intuition on how our method works, it doesn\u2019t directly\nshow that our coordinate selection strategy has any optimal properties. In next section, we give\nanother interpretation of our coordinate-wise power method and establish its connection with the\noptimization problem for low-rank approximation.\n3 Optimization Interpretation\nThe coordinate descent method [12, 6] was popularized due to its simplicity and good performance.\nWith all but one coordinates \ufb01xed, the minimization of the objective function becomes a sequence of\nsubproblems with univariate minimization. When such subproblems are quickly solvable, coordinate\ndescent methods can be ef\ufb01cient. Moreover, in different problem settings, a speci\ufb01c coordinate\nselecting rule in each iteration makes it possible to further improve the algorithm\u2019s ef\ufb01ciency.\nThe power method reminds us of the rank-one matrix factorization\n\nx2Rn,y2Rdf (x, y) = kA xyTk2\nF \nWith alternating minimization, the update for x becomes x Ay\nkyk2 and vice versa for y. Therefore\nfor symmetric matrix, alternating minimization is exactly PM apart from the normalization constant.\nMeanwhile, the above similarity between PM and alternating minimization extends to the similarity\nbetween CPM and greedy coordinate descent. A more detailed interpretation is in Appendix A.5,\nwhere we show the equivalence in the following coordinate selecting rules for Eq.(4): (a) largest coor-\ndinate value change, denoted as |xi|; (b) largest partial gradient (Gauss-Southwell rule), |rif (x)|;\n(c) largest function value decrease, |f (x + xiei) f (x)|. Therefore, the coordinate selection rule\nis more formally testi\ufb01ed in optimization viewpoint.\n3.1 Symmetric Greedy Coordinate Descent (SGCD)\nWe propose an even faster algorithm based on greedy coordinate descent. This method is designed\nfor symmetric matrices and additionally requires to know the sign of the most dominant eigenvalue.\nWe also prove its convergence to the global optimum with a suf\ufb01ciently close initial point.\nA natural alternative objective function speci\ufb01cally for the symmetric case would be\n\narg min\n\n(4)\n\narg min\n\nx2Rn f (x) = kA xxTk2\nF .\n\n(5)\n\nNotice that the stationary points of f (x), which require rf (x) = 4(kxk2x Ax) = 0, are obtained\nat eigenvectors: x\u21e4i = pivi, if the eigenvalue i is positive. The global minimum for Eq. (5) is the\neigenvector corresponding to the largest positive eigenvalue, not the one with the largest magnitude.\nFor most applications like PageRank we know 1 is positive, but if we want to calculate the negative\neigenvalue with the largest magnitude, just optimize on f = kA + xxTk2\nNow we introduce Algorithm 2 that optimizes Eq. (5). With coordinate descent, we update the i-th\ni arg min\u21b5 f (x(l) + (\u21b5 x(l)\ncoordinate by x(l+1)\ni )ei), which requires the partial derivative of\nf (x) in i-th coordinate to be zero, i.e.,\nrif (x) = 4(xikxk2\ni x) = 0.\ni + pxi + q = 0, where p = kxk2 x2\n\n(6)\n(7)\nSimilar to CPM, the most time consuming part comes from maintaining z (\u2318 Ax), as the calculation\nfor selecting the criterion c and the coef\ufb01cient q requires it. Therefore the overall time complexity for\none iteration is the same as CPM.\n\ni aii, and q = aT\n\n() x3\n\nF instead.\n\ni x + aiixi\n\n2 aT\n\n4\n\n\fNotice that c from Eq.(6) is the partial gradient of f, so we are using the Gauss-Southwell rule to\nchoose the active set. And it is actually the only effective and computationally cheap selection rule\namong previously analyzed rules (a), (b) or (c). For calculating the iterate change |xi|, one needs to\nobtain roots for n equations. Likewise, the function decrease |fi| requires even more work.\nRemark: for an unbiased initializer, x(0) should be scaled by a constant \u21b5 such that\n\n\u21b5 = arg min\n\na0\n\nkA (ax(0))(ax(0))TkF =s (x(0))T Ax(0)\n\nkx(0)k4\n\nAlgorithm 2 Symmetric greedy coordinate descent (SGCD)\n1: Input: Symmetric matrix A 2 Rn\u21e5n, number of selected coordinate, k, and number of iterations, L.\n2: Initialize x(0) 2 Rn and set z(0) = Ax(0). Set coordinate selecting criterion c(0) = x(0) z(0)\nkx(0)k2 .\n3: for l = 0 to L 1 do\n4:\n\nLet \u2326(l) be a set containing k coordinates of c(l) with the largest magnitude. Execute the following\nupdates:\n\nx(l+1)\nj\n\n= ( arg min\u21b5 f\u21e3x(l) + (\u21b5 x(l)\n\nj )ej\u2318 ,\n\nz(l+1) = z(l) + A(x(l+1)\n\nif j 2 \u2326(l),\nif j /2 \u2326(l).\n\nc(l+1) = x(l+1) \n\n5: Output: vector x(L)\n\nx(l)\nj ,\n\u2326(l) x(l)\n\u2326(l) )\nz(l+1)\nkx(l+1)k2\n\n4 Convergence Analysis\nIn the previous section, we propose coordinate-wise power method (CPM) and symmetric greedy\ncoordinate descent (SGCD) on a non-convex function for computing the dominant eigenvector.\nHowever, it remains an open problem to prove convergence of coordinate descent methods for general\nnon-convex functions. In this section, we show that both CPM and SGCD converge to the dominant\neigenvector under some assumptions.\n4.1 Convergence of Coordinate-wise Power Method\nConsider a positive semide\ufb01nite matrix A, and let v1 be its leading eigenvector. For any sequence\n(x(0), x(1),\u00b7\u00b7\u00b7 ) generated by Algorithm 1, let \u2713(l) to be the angle between vector x(l) and v1,\nand (l)(k)\ni )2/kc(l)k2 = kg(l)k/kc(l)k. The following lemma illustrates\nconvergence of the tangent of \u2713(l) .\nLemma 4.1. Suppose k is large enough such that\n\n= min|\u2326|=kqPi /2\u2326 (c(l)\n\ndef\n\n(l)(k) <\n\n1 2\n\n(1 + tan \u2713(l))1\n\n.\n\nThen\n\ntan \u2713(l+1) \uf8ff tan \u2713(l)(\n\n2\n1\n\n+\n\n(l)(k))\ncos \u2713(l) < tan \u2713(l)\n\n(8)\n\n(9)\n\n12\n\n21(1+tan \u2713(l)), if x(0) is not orthogonal to v1, then after T = O( 1\n12\n\nWith the aid of Lemma 4.1, we show the following iteration complexity:\nTheorem 4.2. For any sequence (x(0), x(1),\u00b7\u00b7\u00b7 ) generated by Algorithm 1 with k satisfying\nlog( tan \u2713(0)\n(l)(k) <\n))\niterations we have tan \u2713(T ) \uf8ff \".\nThe iteration complexity shown is the same as the power method, but since it requires less operations\n(O(knnz(A)/n) instead of O(nnz(A)) per iteration, we have\nCorollary 4.2.1. If the requirements in Theorem 4.2 apply and additionally k satis\ufb01es:\n\n(10)\nCPM has a better convergence rate than PM in terms of the number of equivalent passes over the\ncoordinates.\n\nk < n log((1 + 2)/(21))/ log(2/1),\n\n\"\n\n5\n\n\fThe RHS of (10) ranges from 0.06n to 0.5n when 2\ngoes from 105 to 1 105. Meanwhile,\n1\nexperiments show that the performance of our algorithms isn\u2019t too sensitive to the choice of k. Figure\n6 in Appendix A.6 illustrates that a suf\ufb01ciently large range of k guarantees good performances. Thus\nwe use a prescribed k = n\n20 throughout our experiments in this paper, which saves the burden of\ntuning parameters and is a theoretically and experimentally favorable choice.\nPart of the proof is inspired by the noisy power method [3] in that we consider the unchanged part\ng as noise. For the sake of a neat proof we require our target matrix to be positive semide\ufb01nite,\nalthough experimentally a generalization to regular matrices is also valid for our algorithm. Details\ncan be found in Appendix A.1 and A.3.\n4.2 Local Convergence for Optimization on kA xxTk2\nAs the objective in Problem (5) is non-convex, it is hard to show global convergence. Clearly, with\nexact coordinate descent, Algorithm 2 will converge to some stationary point. In the following, we\nshow that Algorithm 2 converges to the global minimum with a starting point suf\ufb01ciently close to it.\nTheorem 4.3. (Local Linear Convergence) For any sequence of iterates (x(0), x(1),\u00b7\u00b7\u00b7 ) generated\nby Algorithm 2, assume the starting point x(0) is in a ball centered by p1v1 with radius r =\n), or formally, x(0) 2 Br(p1v1), then (x0, x1,\u00b7\u00b7\u00b7 ) converges to the optima linearly.\nO( 12p1\nSpeci\ufb01cally, when k = 1, then after T = 14122+4 maxi |aii|\nf (x(T )) f\u21e4 \uf8ff \", where f\u21e4 = f (p1v1) is the global minimum of the objective function f, and\n\u00b5 = inf x,y2Br(p1v1) krf (x)rf (y)k1\nWe prove this by showing that the objective (5) is strongly convex and coordinate-wise Lipschitz\ncontinuous in a neighborhood of the optimum. The proof is given in Appendix A.4.\nRemark: For real-life graphs, the diagonal values aii = 0, and the coef\ufb01cient in the iteration\ncomplexity could be simpli\ufb01ed as 14122\n\n, 3(1 2)].\n\niterations, we have\n\n2 [ 3(12)\n\nn\n\nlog f (x(0))f\u21e4\n\n\"\n\nkxyk1\n\nF\n\n\u00b5\n\nwhen k = 1.\n\n\u00b5\n\nn\n\n/\ns\np\no\n\nl\nf\n\n10 7\n\n10 6\n\n10 5\n\n10 4\n\n0\n\nCPM\nSGCD\nPM\nLanczos\nVRPCA\n\nCPM\nSGCD\nPM\nLanczos\nVRPCA\n\n10 2\n\n10 1\n\n10 0\n\n)\nc\ne\ns\n(\n \ne\nm\n\ni\nt\n\n10 2\n\n10 1\n\n10 0\n\n10 -1\n\n)\nc\ne\ns\n(\n \ne\nm\n\ni\nt\n\nCPM\nSGCD\nPM\nLanczos\nVRPCA\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1\n\n0\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1\n\n1000\n\n2000\n\n3000\n\n4000\n\n5000\n\n6000\n\n7000\n\n8000\n\n9000\n\n10000\n\n\u03bb2/\u03bb1\n\n\u03bb2/\u03bb1\n\nn\n\n(a) Convergence \ufb02ops vs 2\n1\n\n(b) Convergence time vs 2\n1\n\n(c) Convergence time vs dimension\n\nFigure 2: Matrix properties affecting performance. Figure 2(a), 2(b) show the performance of \ufb01ve methods\nwith 2\nranging from 0.01 to 0.99 and \ufb01xed matrix size n = 5000. In Figure 2(a) the measurement is FLOPs\n1\nwhile in Figure 2(b) Y-axis is CPU time. Figure 2(c) shows how the convergence time varies with the dimension\nwhen \ufb01xing 2\n= 2/3. In all \ufb01gures Y-axis is in log scale for better observation. Results are averaged over from\n1\n20 runs.\n5 Experiments\nIn this section, we compared our algorithms with PM, Lanczos method [14], and VRPCA [16] on\ndense as well as sparse dataset. All the experiments were executed on Intel(R) Xeon(R) E5430\nmachine with 16G RAM and Linux OS. We implement all the \ufb01ve algorithms in C++ with Eigen\nlibrary.\n5.1 Comparison on Dense and Simulated Dataset\nWe compare PM with our CPM and SGCD methods to show how coordinate-wise mechanism\nimproves the original method. Further, we compared with a state-of-the-art algorithm Lanczos\nmethod. Besides, we also include a recent proposed stochastic SVD algorithm, VRPCA, that enjoys\nexponential convergence rate and shows similar insight in viewing the data in a separable way.\nWith dense and synthetic matrices, we are able to test the condition that our methods are preferable,\nand how the properties of the matrix, like 2/1 or the dimension, affect the performance. For each\nalgorithm, we start from the same random vector, and set stopping condition to be cos \u2713 1 \u270f, \u270f =\n106, where \u2713 is the angle between the current iterate and the dominant eigenvector.\n\n6\n\n\fFirst we compare the performances with number of FLOPs (Floating Point Operations), which could\nbetter illustrate how greediness affects the algorithm\u2019s ef\ufb01ciency. From Figure 2(a) we can see our\nmethod shows much better performance than PM, especially when 2/1 ! 1, where CPM and\nSGCD respectively achieve more than 2 and 3 times faster than PM. Figure 2(b) shows running time\nusing \ufb01ve methods under different eigenvalue ratios 2/1. We can see that only in some extreme\ncases when PM converges in less than 0.1 second, PM is comparable to our methods. In Figure 2(c)\nthe testing factor is the dimension, which shows the performance is independent of the size of n.\nMeanwhile, in most cases, SGCD is better than Lanczos method. And although VRPCA has better\nconvergence rate, it requires at least 10n2 operations for one data pass. Therefore in real applications,\nit is not even comparable to PM.\n\n5.2 Comparison on Sparse and Real Dataset\n\nTable 1: Six datasets and the performance of three methods on them.\n\nDataset\n\ncom-Orkut\n\nsoc-LiveJournal\n\nsoc-Pokec\n\nweb-Stanford\n\nego-Gplus\nego-Twitter\n\nn\n\nnnz/n\nnnz(A)\n76.3\n3.07M 234M\n17.8\n86M\n4.85M\n44M\n1.63M\n27.3\n3.99M 14.1\n282K\n283\n30.5M\n108K\n81.3K\n2.68M\n33\n\n2\n1\n0.71\n0.78\n0.95\n0.95\n0.51\n0.65\n\nTime (sec)\n\nPM CPM SGCD Lanczos VRPCA\n189.7\n109.6\n88.1\n58.5\n596.2\n118\n8.15\n7.55\n5.06\n0.99\n0.31\n0.98\n\n63.6\n25.8\n14.2\n0.69\n1.01\n0.19\n\n31.5\n17.9\n26.5\n1.05\n0.57\n0.15\n\n19.3\n13.7\n5.2\n0.54\n0.61\n0.11\n\nTo test the scalability of our methods, we further test and compare our methods on large and sparse\ndatasets. We use the following real datasets:\n1) com-Orkut: Orkut online social network\n2) soc-LiveJournal: On-line community for maintaining journals, individual and group blogs\n3) soc-Pokec: Pokec, most popular on-line social network in Slovakia\n4) web-Stanford: Pages from Stanford University (stanford.edu) and hyperlinks between them\n5) ego-Gplus (Google+): Social circles from Google+\n6) ego-Twitter: Social circles from Twitter\nThe statistics of the datasets are summarized in Table 1, which includes the essential properties of the\ndatasets that affect the performances and the average CPU time for reaching cos \u2713x,v1 1 106.\nFigure 3 shows tan \u2713x,v1 against the CPU time for the four methods with multiple datasets.\nFrom the statistics in Table 1 we can see that in all the cases, either CPM or SGCD performs the best.\nCPM is roughly 2-8 times faster than PM, while SGCD reaches up to 23 times and 3 times faster\nthan PM and Lanczos method respectively. Our methods show their privilege in the soc-Pokec(3(c))\nand web-Stanford(3(d)), the most ill-conditioned cases (2/1 \u21e1 0.95), achieving 15 or 23 times\nof speedup on PM with SGCD. Meanwhile, when the condition number of the datasets is not too\nsmall (see 3(a),3(b),3(e),3(f)), both CPM and SGCD outperform PM as well as Lanczos method. And\n\n10 1\n\n10 0\n\n1\nv\n,\nx\n\n\u03b8\nn\na\nt\n\n10 -1\n\n10 -2\n\n10 -3\n\n0\n\nCPM\nSGCD\nPM\nLanczos\nVRPCA\n\n20\n\n40\n\n60\n\n80\n\n100\n\n120\n\n140\n\n160\n\ntime (sec)\n\n10 1\n\n10 0\n\n1\nv\n,\nx\n\n\u03b8\nn\na\nt\n\n10 -1\n\n10 -2\n\n10 -3\n\n0\n\nCPM\nSGCD\nPM\nLanczos\nVRPCA\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\ntime (sec)\n\n10 1\n\n10 0\n\n1\nv\n,\nx\n\n\u03b8\nn\na\nt\n\n10 -1\n\n10 -2\n\n10 -3\n\n0\n\nCPM\nSGCD\nPM\nLanczos\nVRPCA\n\n20\n\n40\n\n60\n\n80\n\n100\n\n120\n\n140\n\n160\n\ntime (sec)\n\n(a) Performance on com-Orkut\n\n(b) Performance on LiveJournal\n\n(c) Performance on soc-Pokec\n\n10 1\n\n10 0\n\n1\nv\n,\nx\n\n\u03b8\nn\na\nt\n\n10 -1\n\n10 -2\n\n10 -3\n\n0\n\nCPM\nSGCD\nPM\nLanczos\nVRPCA\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9\n\ntime (sec)\n\n10 1\n\n10 0\n\n1\nv\n,\nx\n\n\u03b8\nn\na\nt\n\n10 -1\n\n10 -2\n\n10 -3\n\n0\n\nCPM\nSGCD\nPM\nLanczos\nVRPCA\n\n0.5\n\n1\n\n1.5\n\n2\n\n2.5\n\n3\n\ntime (sec)\n\n10 1\n\n10 0\n\n1\nv\n,\nx\n\n\u03b8\nn\na\nt\n\n10 -1\n\n10 -2\n\n10 -3\n\nCPM\nSGCD\nPM\nLanczos\nVRPCA\n\n0\n\n0.05\n\n0.1\n\n0.15\n\n0.2\n\n0.25\n\n0.3\n\n0.35\n\n0.4\n\n0.45\n\n0.5\n\ntime (sec)\n\n(d) Performance on web-Stanford\nFigure 3: Time comparison for sparse dataset. X-axis shows the CPU time while Y-axis is log scaled tan \u2713\nbetween x and v1. The empirical performance shows all three methods have linear convergence.\n\n(f) Performance on ego-Twitter\n\n(e) Performance on Google+\n\n7\n\n\f10 2\n\n10 1\n\n10 0\n\n10 -2\n\n0\n\n10 -3\n\n10 -1\n\n1\nv\n,\nx\n\n500\n\n1000\n\n1500\n\n2000\n\n2500\n\n3000\n\n10 -4\n\n10 -5\n\n\u03b8\nn\na\nt\n\n3500\n\n4000\n\n4500\n\ntime (sec)\n\nCPM\nSGCD\nPM\n\nsimilar to the reasoning in the dense case, although VRPCA requires less iterations for convergence,\nthe overall CPU time is much longer than others in practice.\nIn summary of performances on both dense and sparse datasets, SGCD is the fastest among others.\n6 Other Application and Extensions\n6.1 Comparison on Out-of-core Real Dataset\nAn important application for coordinate-wise\npower method is the case when data can not \ufb01t\ninto memory. Existing methods can\u2019t be easily\napplied to out-of-core dataset. Most existing\nmethods don\u2019t indicate how we can update part\nof the coordinates multiple times and fully reuse\npart of the matrix corresponding to those active\ncoordinates. Therefore the data loading and data\nprocessing time are highly unbalanced. A naive\nway of using PM would be repetitively loading\npart of the matrix from the disk and calculating\nthat part of matrix-vector multiplication. But\nfrom Figure 4 we can see reading from the disk\ncosts much more time than the process of com-\nputation, therefore we will waste a lot of time\nif we cannot fully use the data before dumping\nit. For CPM, as we showed in Theorem 4.1 that\nupdating only k coordinates of iterate x may\nstill enhance the target direction, we could do\nmatrix vector multiplication multiple times after one single loading. As with SGCD, optimization on\npart of x for several times will also decrease the function value.\nWe did experiments on the dataset from Twitter [10] using out-of-core version of the three algorithms\nshown in Algorithm 3 in Appendix A.7. The data, which contains 41.7 million user pro\ufb01les and 1.47\nbillion social relations, is originally 25.6 GB and then separated into 5 \ufb01les. In Figure 4, we can see\nthat after data pass, our methods can already reach rather high precision, which compresses hours of\nprocessing time to 8 minutes.\n6.2 Extension to other linear algebraic methods\nWith the interpretation in optimization, we could apply a coordinate-wise mechanism to PM and get\ngood performance. Meanwhile, for some other iterative methods in linear algebra, if the connection to\noptimization is valid, or if the update is separable for each coordinate, the coordinate-wise mechanism\nmay also be applicable, like Jacobi method.\nFor diagonal dominant matrices, Jacobi iteration [15] is a classical method for solving linear system\nAx = b with linear convergence rate. The iteration procedure is:\nInitialize: A ! D + R, where D =Diag(A), and R = A D.\nIterations: x+ D1(b Rx).\n\nFigure 4: A pseudograph for time comparison of\nout-of-core dataset from Twitter. Each \"staircase\" il-\nlustrates the performance of one data pass. The \ufb02at part\nindicates the stage of loading data, while the downward\npart shows the phase of processing data. As we only\nupdated auxiliary vectors instead of the iterate every\ntime we load part of the matrix, we could not test per-\nformances until a whole data pass. Therefore for the\nsake of clear observation, we group together the loading\nphase and the processing phase in each data pass.\n\nThis method is similar to the vanilla power method, which includes a matrix vector multiplication\nRx with an extra translation b and a normalization step D1. Therefore, a potential similar\nrealization of greedy coordinate-wise mechanism is also applicable here. See Appendix A.8 for more\nexperiments and analyses, where we also specify its relation to Gauss-Seidel iteration [15].\n7 Conclusion\nIn summary, we propose a new coordinate-wise power method and greedy coordinate descent\nmethod for computing the most dominant eigenvector of a matrix. This problem is critical to many\napplications in machine learning. Our methods have convergence guarantees and achieve up to 23\ntimes of speedup on both real and synthetic data, as compared to the vanilla power method.\nAcknowledgements\nThis research was supported by NSF grants CCF-1320746, IIS-1546452 and CCF-1564000.\n\n8\n\n\fReferences\n\n[1] Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. Nearest neighbor based greedy\ncoordinate descent. In Advances in Neural Information Processing Systems, pages 2160\u20132168,\n2011.\n\n[2] Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding structure with randomness:\nProbabilistic algorithms for constructing approximate matrix decompositions. SIAM review,\n53(2):217\u2013288, 2011.\n\n[3] Moritz Hardt and Eric Price. The noisy power method: A meta algorithm with applications. In\n\nAdvances in Neural Information Processing Systems, pages 2861\u20132869, 2014.\n\n[4] Moritz Hardt and Aaron Roth. Beyond worst-case analysis in private singular vector computa-\ntion. In Proceedings of the forty-\ufb01fth annual ACM symposium on Theory of computing, pages\n331\u2013340. ACM, 2013.\n\n[5] Charles AR Hoare. Algorithm 65: \ufb01nd. Communications of the ACM, 4(7):321\u2013322, 1961.\n[6] Cho-Jui Hsieh and Inderjit S Dhillon. Fast coordinate descent methods with variable selection\nfor non-negative matrix factorization. In Proceedings of the 17th ACM SIGKDD international\nconference on Knowledge discovery and data mining, pages 1064\u20131072. ACM, 2011.\n\n[7] Ilse Ipsen and Rebecca M Wills. Analysis and computation of google\u2019s pagerank. In 7th IMACS\ninternational symposium on iterative methods in scienti\ufb01c computing, Fields Institute, Toronto,\nCanada, volume 5, 2005.\n\n[8] Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank matrix completion using\nalternating minimization. In Proceedings of the forty-\ufb01fth annual ACM symposium on Theory\nof computing, pages 665\u2013674. ACM, 2013.\n\n[9] Michel Journ\u00e9e, Yurii Nesterov, Peter Richt\u00e1rik, and Rodolphe Sepulchre. Generalized power\nmethod for sparse principal component analysis. The Journal of Machine Learning Research,\n11:517\u2013553, 2010.\n\n[10] Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. What is Twitter, a social\nnetwork or a news media? Proceedings of the 19th international conference on World wide\nweb, pages 591\u2013600, 2010.\n\n[11] Deanna Needell, Rachel Ward, and Nati Srebro. Stochastic gradient descent, weighted sampling,\nIn Advances in Neural Information Processing\n\nand the randomized Kaczmarz algorithm.\nSystems, pages 1017\u20131025, 2014.\n\n[12] Yu Nesterov. Ef\ufb01ciency of coordinate descent methods on huge-scale optimization problems.\n\nSIAM Journal on Optimization, 22(2):341\u2013362, 2012.\n\n[13] Julie Nutini, Mark Schmidt, Issam H Laradji, Michael Friedlander, and Hoyt Koepke. Coordinate\ndescent converges faster with the Gauss-Southwell rule than random selection. In Proceedings\nof the 32nd International Conference on Machine Learning (ICML-15), pages 1632\u20131641, 2015.\n\n[14] Beresford N Parlett. The Symmetric Eigenvalue Problem, volume 20. SIAM, 1998.\n[15] Yousef Saad. Iterative methods for sparse linear systems. SIAM, 2003.\n[16] Ohad Shamir. A stochastic PCA and SVD algorithm with an exponential convergence rate. In\n\nProc. of the 32st Int. Conf. Machine Learning (ICML 2015), pages 144\u2013152, 2015.\n\n[17] Si Si, Donghyuk Shin, Inderjit S Dhillon, and Beresford N Parlett. Multi-scale spectral\ndecomposition of massive graphs. In Advances in Neural Information Processing Systems,\npages 2798\u20132806, 2014.\n\n[18] Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395\u2013416,\n\n2007.\n\n[19] Xiao-Tong Yuan and Tong Zhang. Truncated power method for sparse eigenvalue problems.\n\nThe Journal of Machine Learning Research, 14(1):899\u2013925, 2013.\n\n9\n\n\f", "award": [], "sourceid": 1090, "authors": [{"given_name": "Qi", "family_name": "Lei", "institution": "UT AUSTIN"}, {"given_name": "Kai", "family_name": "Zhong", "institution": "UT AUSTIN"}, {"given_name": "Inderjit", "family_name": "Dhillon", "institution": "University of Texas at Austin"}]}