{"title": "Towards Practical Alternating Least-Squares for CCA", "book": "Advances in Neural Information Processing Systems", "page_first": 14764, "page_last": 14773, "abstract": "Alternating least-squares (ALS) is a simple yet effective solver for canonical correlation analysis (CCA). In terms of ease of use, ALS is arguably practitioners' first choice. Despite recent provably guaranteed variants, the empirical performance often remains unsatisfactory. To promote the practical use of ALS for CCA, we propose truly alternating least-squares. Instead of approximately solving two independent linear systems, in each iteration, it simply solves two coupled linear systems of half the size. It turns out that this coupling procedure is able to bring significant performance improvements in practice. Inspired by accelerated power method, we further propose faster alternating least-squares, where momentum terms are introduced into the update equations. Both algorithms enjoy linear convergence. To make faster ALS even more practical, we put forward adaptive alternating least-squares to avoid tuning the momentum parameter, which is as easy to use as the plain ALS while retaining advantages of the fast version. Experiments on several datasets empirically demonstrate the superiority of the proposed algorithms to recent variants.", "full_text": "Towards Practical Alternating Least-Squares for\n\nCCA\n\nZhiqiang Xu and Ping Li\nCognitive Computing Lab\n\nBaidu Research\n\nNo.10 Xibeiwang East Road, Beijing, 10085, China\n\n10900 NE 8th St, Bellevue, WA 98004, USA\n{xuzhiqiang04,liping11}@baidu.com\n\nAbstract\n\nAlternating least-squares (ALS) is a simple yet effective solver for canonical corre-\nlation analysis (CCA). In terms of ease of use, ALS is arguably practitioners\u2019 \ufb01rst\nchoice. Despite recent provably guaranteed variants, the empirical performance\noften remains unsatisfactory. To promote the practical use of ALS for CCA, we\npropose truly alternating least-squares. Instead of approximately solving two\nindependent linear systems, in each iteration, it simply solves two coupled linear\nsystems of half the size. It turns out that this coupling procedure is able to bring\nsigni\ufb01cant performance improvements in practical setting. Inspired by the acceler-\nated power method, we further propose faster alternating least-squares, where\nmomentum terms are introduced into the update equations. Theoretically, both\nalgorithms enjoy linear convergence rate. To make faster ALS even more practical,\nwe put forward adaptive alternating least-squares to avoid tuning the momentum\nparameter, which is as easy to use as the plain ALS while retaining advantages\nof the fast version. Experiments on several datasets empirically demonstrate the\nsuperiority of the proposed algorithms to several recent variants of CCA solvers.\n\n1\n\nIntroduction\n\nCanonical correlation analysis [11] is a classical statistical tool for \ufb01nding directions of the maximal\ncorrelations between data sources of the same phenomenon, which has found widespread applications\nin high-dimensional data analysis such as regression [12], clustering [5], classi\ufb01cation [13], and word\nembedding [7], to name a few. Let X \u2208 Rdx\u00d7n and Y \u2208 Rdy\u00d7n be the data matrices1 of two views\nwith empirical cross-covariance matrix and two auto-covariance matrices given by\nYY(cid:62) + ryI,\n\nXX(cid:62) + rxI, Cyy =\n\nCxy =\n\nXY(cid:62), Cxx =\n\n1\nn\n\n1\nn\n\n1\nn\n\nrespectively, where rx and ry are positive regularization parameters for avoiding ill-conditioned\nmatrices and I represents the identity matrix of the appropriate size. CCA aims to \ufb01nd projection\nmatrices \u03a6 \u2208 Rdx\u00d7k and \u03a8 \u2208 Rdy\u00d7k such that the cumulative correlation between two views is\nmaximized after the projection of each view [19, 16]:\n\nmax\n\n\u03a6(cid:62)Cxx\u03a6=\u03a8(cid:62)Cyy\u03a8=I\n\ntr(\u03a6(cid:62)Cxy\u03a8).\n\n(1)\n\nIt is well-known that the global optimum to Problem (1), which is also known as the canonical\nsubspaces, can be obtained via a k-truncated singular value decomposition (SVD) on the whitened\n\n1We assume that X and Y are row centered at the origin.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fcross-covariance matrix C = C\n\n\u2212 1\nxx CxyC\n\n2\n\n\u2212 1\nyy , i.e.,\n\n2\n\n\u2212 1\nxx P, C\n\n2\n\n\u2212 1\nyy Q),\n\n2\n\n(U, V) = (C\n\n(2)\nwhere P and Q are the top-k left and right singular subspaces of C. Simply applying the partial\nSVD of C by inverting matrices Cxx and Cyy, however, is computationally prohibitive for high-\ndimensional datasets, as the complexity of matrix inversions can be as high as O(d3), where d =\nmax{dx, dy}, and the data sparsity of X and Y can not be utilized then.\nTo address this computational issue of CCA, there have been a range of relevant algorithms proposed\nrecently in different settings [22, 15, 16, 10, 20, 1, 8, 2, 6, 4]. In this work, we focus on the block\nand off-line setting where k > 1 and the collection of instance pairs, i.e., (X, Y), is ready. In\nterms of ease of use, in this setting, alternating least-squares (ALS) [20, 10] is arguably the \ufb01rst\nchoice from a user\u2019s perspective, by virtue of the simplicity, the fewest parameters, and guaranteed\nconvergence. Nonetheless, as we will see in our experiments, its effectiveness, especially solutions\nof high accuracy, often comes at the cost of slow convergence. Particularly, [20] considered inexact\nalternating least-squares for the vector case k = 1. However, in order for the block case to work, one\nhas to set the block size to 2k rather than k and needs a post-processing step to randomly project the\nresulting solution of 2k-dimensional subspace onto a k-dimensional subspace, as is demonstrated\nin [10]. The update equations of alternating least-squares in both [20] and [10] are derived from the\npower method on an augmented real symmetric matrix, i.e., A =\n. However, the\npower method can only \ufb01nd top eigenspaces corresponding to the largest eigenvalues in magnitude\nrather than the real part. Given the special eigen-structure of A [23, 20, 10], the block size has to be\nat least 2k to recover a top-k canonical subspace (U, V). It is clear that this way that the block CCA\nsolver proceeds not only causes a signi\ufb01cant increase in both time and space, but may also degrade\nthe quality of the \ufb01nal solution due to the random projection.\nThus, the question one would naturally ask is:\n\n(cid:18)\n\n(cid:19)\n\nC(cid:62)\n\nCxy\n\nxy\n\nIs there any variant of ALS that is able to recover (U, V) with block size k?\n\n\u03c3k\n\n\u03c3k+\u03c3k+1\n\nt and \u03a8(cid:63)\n\nt and \u03a8(cid:63)\n\nIn this paper, we offer a simple answer in the af\ufb01rmative. Recall that the power iteration in [20, 10]\nleads to simultaneous approximations to exact iterates \u03a6(cid:63)\nt on two canonical variables and\nends up solving two independent linear systems. What we are going to change here is to do sequential\napproximations to \u03a6(cid:63)\nt with block size k, arriving at an algorithm that approximately solves\ntwo coupled linear systems of half the size per iteration (see Algorithm 1). To stress the difference, the\nproposed algorithm for CCA is called truly alternating least-squares (TALS). It does not only inherit\ntheoretical properties of global convergence and linear complexity from alternating least-squares but\nalso enjoys a speedup roughly by a factor of\n, where \u03c3k represents the k-th largest singular\nvalue of C. Most important to practitioners is that remarkable performance improvements can be\nachieved in practice as will be shown in our experiments, albeit with a slight algorithmic change.\nMoreover, we develop another variant of ALS. Inspired by a recent work on accelerated power\nmethod [21], we try to think about faster alternating least-squares (FALS) for CCA with momentum\nacceleration. The main idea is to add a momentum term to the update equations of the iterates \u03a6t and\n\u03a8t on top of the truly alternating least-squares, which gives rise to Algorithm 2. Compared to other\nfast methods, e.g., shift-and-invert preconditioning based methods [20, 1], especially for the block\ncase, the advantage here is that the fast version takes over the simple structure of the plain one and\nupdating iterates remains in a sequential manner. At least, locally linear convergence can be achieved.\nOn the other hand, the algorithm is no longer almost parameter-free due to the momentum parameter\nwhich needs to be tuned. Although we can leverage this parameter to pursue better performance by\nhand-tuning, it requires multiple runs of the algorithm which computationally may not be affordable\nin practice. To tackle this, we put forward adaptive alternating least-squares (AALS) with automatic\nmomentum tuning during iterations, such that it is as easy to use as the plain version and at the same\ntime expected to retain the advantages of the fast one. Experiments show that the adaptive version\nachieves comparable performance to its predecessor, i.e., the faster alternating least-squares, and\noften outperforms the truly alternating least-squares.\n\n2\n\n\fThe rest of the paper is organized as follows. We discuss recent literature in Section 2 and then\npresent our algorithms with convergence guarantees on truly alternating least-squares in Section 3\nand the fast versions in Section 4. Our experimental studies are reported in Section 5. Finally, the\npaper is concluded by discussions in Section 6.\n\n2 Related Work\n\nThere is a rich literature on CCA. We focus here on the block and off-line algorithms proposed\nrecently. [3] proposed a randomized CCA algorithm for a pair of tall and thin matrices. It \ufb01rst\nperforms a randomized dimensionality reduction on the matrices and then runs an off-the-shelf CCA\nalgorithm for the resulting matrices. However, it seems to have quite a high complexity, and as was\npointed out in [16], it does not work for large dx and dy. To cope with this issue, on top of [3], the\nproblem is cast into solving a sequence of iterative least-squares in [15]. But only sub-optimal results\ncan be achieved this way due to the coarse approximation, which was noted in [10]. [16] proposed an\niterative method with a low per-iteration cost, but there is no guarantee of global convergence and the\nperformance is worse than CCALin, i.e., alternating least-squares proposed in [10]. These algorithms\ndirectly solve Problem (1).\nAlternating least-squares solves Problem (1) indirectly, by targeting an equivalent problem, i.e.,\ngeneralized eigenspace computation, in the following form:\ntr(\u2126(cid:62)A\u2126),\n\nmax\n\n\u2126(cid:62)B\u2126=I\n\nwhere B = diag(Cxx, Cyy). [20] proposed inexact alternating least squares with a sub-linear\nconvergence analysis for the vector case k = 1. The block case was considered with block size\nset to 2k and given a linear convergence analysis in [10]. While both algorithms enjoys global\nconvergence, they have the drawbacks mentioned in Section 1. In this paper, our proposed truly\nalternating least-squares is a natural extension of above two algorithms without the drawbacks.\nMost of the fast CCA algorithms rely on the shift-and-invert preconditioning paradigm that is\noriginally designed for eigenvector computation [9]. [20] extended the paradigm to the CCA setting\nand achieved better performance than alternating least-squares for the vector case. [1] further extended\nto the block setting, using the vector version as a meta algorithm to recursively \ufb01nd top-k canonical\nsubspaces via de\ufb02ation. While both algorithms have theoretically faster convergence, pragmatic\nconcerns arise that the shift-and-invert preconditioning paradigm bears a complicated algorithm\nstructure and is dif\ufb01cult to deploy in practice, especially in the block setting. The deployment is\nbuilt upon a number of tuning parameters including the nontrivial estimation of the spectral gap [20].\nThe de\ufb02ation further complicates the task in the block case. In contrast, the fast CCA algorithm\npresented in this paper follows the momentum acceleration scheme that is also originally designed for\neigenvector computation [21], and outperforms alternating least-squares, particularly for the block\ncase. The underlying algorithm is simple with much fewer parameters. Furthermore, the adaptive\nversion does not even need to tune the momentum parameter, making it more practical.\nWe also note that there are a number of recent CCA algorithms that handle the streaming setting [22,\n8, 2, 6, 4]. It will be interesting to investigate how our algorithms extend to this setting.\n\n3 Truly Alternating Least-Squares (TALS)\n\nIn this section, we detail our proposed truly alternating least-squares (TALS) for CCA, starting from\nthe existing alternating least-squares solvers. Update equations of alternating least-squares in [20]\n\ncan be written as \uf8f1\uf8f2\uf8f3 \u02dc\u03c6t+1 = C\u22121\n\n\u02dc\u03c8t+1 = C\u22121\n\nxx Cxy\u03c8t + \u03bet, \u03c6t+1 =\nyy C(cid:62)\n\nxy\u03c6t + \u03b7t, \u03c8t+1 =\n\n\u02dc\u03c6t+1\n(cid:107) \u02dc\u03c6t+1(cid:107)2\n\u02dc\u03c8t+1\n(cid:107) \u02dc\u03c8t+1(cid:107)2\n\n,\n\n(3)\n\nwhere \u03c6t \u2208 Rdx\u00d71, \u03c8t \u2208 Rdy\u00d71, and \u03bet, \u03b7t are errors incurred in approximating C\u22121\nC\u22121\nyy C(cid:62)\nsystems of equations Cxx\n\nxx Cxy\u03c8t and\nxx Cxy\u03c8t is the exact solution to the linear\n\u02dc\u03c6 = Cxy\u03c8t with unknowns \u02dc\u03c6, or equivalently the following least-squares\n\nxy\u03c6t by least-squares, respectively. For example, C\u22121\n\n3\n\n\fAlgorithm 1 TALS-CCA\n1: Input: T, k, data matrices X, Y.\n2: Output: approximate top-k canonical subspaces (\u03a6T , \u03a8T ).\n3: Initialize \u03a60 = GSCxx(\u03a6init) \u2208 Rdx\u00d7k, \u03a80 = GSCyy (\u03a8init) \u2208 Rdy\u00d7k, where entries of\n4: for t = 1, 2,\u00b7\u00b7\u00b7 , T do\n5:\n\n\u03a6init, \u03a8init are i.i.d. standard normal samples.\n\nApproximately solve least-squares\n\n\u02dc\u03a6t \u2248 arg min\n\u02dc\u03a6\u2208Rdx\u00d7k\nwith initial \u02dc\u03a6(0) = \u03a6t\u22121(\u03a6(cid:62)\n\nlt( \u02dc\u03a6) =\n\n1\n2n\n\n(cid:107)X(cid:62) \u02dc\u03a6 \u2212 Y(cid:62)\u03a8t\u22121(cid:107)2\n\nF +\n\n(cid:107) \u02dc\u03a6(cid:107)2\n\nF\n\nrx\n2\n\nt\u22121Cxx\u03a6t\u22121)\u22121(\u03a6(cid:62)\n\nt\u22121Cxy\u03a8t\u22121).\n\n6: \u03a6t = GSCxx( \u02dc\u03a6t).\n7:\n\nApproximately solve least-squares\n\n\u02dc\u03a8t \u2248 arg min\n\u02dc\u03a8\u2208Rdy\u00d7k\nwith initial \u02dc\u03a8(0) = \u03a8t\u22121(\u03a8(cid:62)\n\nst( \u02dc\u03a8) =\n\n1\n2n\n\n(cid:107)Y(cid:62) \u02dc\u03a8 \u2212 X(cid:62)\u03a6t(cid:107)2\n\nF +\n\n(cid:107) \u02dc\u03a8(cid:107)2\n\nF\n\nry\n2\n\nt\u22121Cyy\u03a8t\u22121)\u22121(\u03a8(cid:62)\n\nt\u22121C(cid:62)\n\nxy\u03a6t).\n\n8: \u03a8t = GSCyy ( \u02dc\u03a8t).\n9: end for\n\nproblem:\n\nmin\n\n\u02dc\u03c6\u2208Rdx\u00d71\n\nlt( \u02dc\u03c6) =\n\n1\n2n\n\n(cid:107)X(cid:62) \u02dc\u03c6 \u2212 Y(cid:62)\u03c8t(cid:107) +\n\n(cid:107) \u02dc\u03c6(cid:107)2\n2.\n\nrx\n2\n\nThe approximation can be done by running a least-squares solver, warm-started by \u03c6t, for only a few\niterations. The block version for k > 1 in [10] needs to take the following form:\n\nxx Cxy\u03a8t + \u03bet, \u03a6t+1 = \u02dc\u03a6t+1( \u02dc\u03a6(cid:62)\nxy\u03a6t + \u03b7t, \u03a8t+1 = \u02dc\u03a8t+1( \u02dc\u03a6(cid:62)\nyy C(cid:62)\n\nt+1Cxx \u02dc\u03a6t+1 + \u02dc\u03a8(cid:62)\nt+1Cxx \u02dc\u03a6t+1 + \u02dc\u03a8(cid:62)\n\nt+1Cyy \u02dc\u03a8t+1)\u2212 1\nt+1Cyy \u02dc\u03a8t+1)\u2212 1\n\n2\n\n2\n\n,\n\n(4)\n\nwhere \u03a6t \u2208 Rdx\u00d72k and \u03a8t \u2208 Rdy\u00d72k, rather than \u03a6t \u2208 Rdx\u00d7k and \u03a8t \u2208 Rdy\u00d7k. It is easy to see\nthat update equations in both (3) and (4) yield two independent linear systems. It turns out that the\nindependence hampers the empirical performance of alternating least-squares for CCA.\nTo overcome the drawbacks especially for the block case, we propose the following truly (and inexact)\nalternating least-squares,\n\n(cid:40) \u02dc\u03a6t+1 = C\u22121\n\n\u02dc\u03a8t+1 = C\u22121\n\n(cid:40) \u02dc\u03a6t+1 = C\u22121\n\n\u02dc\u03a8t+1 = C\u22121\n\nxx Cxy\u03a8t + \u03bet, \u03a6t+1 = \u02dc\u03a6t+1( \u02dc\u03a6(cid:62)\nyy C(cid:62)\n\nxy\u03a6t+1 + \u03b7t+1, \u03a8t+1 = \u02dc\u03a8t+1( \u02dc\u03a8(cid:62)\n\nt+1Cxx \u02dc\u03a6t+1)\u2212 1\n\n2\n\nt+1Cyy \u02dc\u03a8t+1)\u2212 1\n\n2\n\n,\n\n(5)\n\nwhere we now have \u03a6t \u2208 Rdx\u00d7k and \u03a8t \u2208 Rdy\u00d7k. Compared to (3) and (4), two induced linear\nsystems in (5) are coupled together and of half the size in the block setting. Corresponding algorithmic\nsteps are given in Algorithm 1, where subroutine GSH(\u00b7) performs the generalized Gram-Schmidt\northogonalization process with inner product (cid:104),(cid:105)H for a positive de\ufb01nite matrix H. Note that our\ninitials to the least-squares solver are different from those in both [20] and [10].\nRecall that P and Q are the top-k left and right singular subspaces of C with respect to their respective\nEuclidean metrics, corresponding to singular values \u03a3 = diag(\u03c31,\u00b7\u00b7\u00b7 , \u03c3k) in descending order, i.e.,\n\u03c3i \u2265 \u03c3j for 1 \u2264 i < j \u2264 rank(C). Thus, by Equation (2), ground truth U and V are the counterparts\nwith respect to metrics Cxx and Cyy, respectively. Let \u03b8t = max{\u03b8max(\u03a6t, U), \u03b8max(\u03a8t, V)},\nwhere \u03b8max(\u03a6t, U) represents the largest principal angle between subspaces2 \u03a6t and U in underlying\nmetric Cxx, i.e., \u03b8max(\u03a6t, U) = cos\u22121(\u03c3min(U(cid:62)Cxx\u03a6t)). Let nnz(X) represent the number of\nnonzero entries in X and \u03ba(Cxx) the condition number of Cxx. Algorithm 1 then has properties that\nare described by the following theorem.\n\n2For brevity we use \u03a6t to represent the subspace spanned by columns of \u03a6t or one of its bases in the\n\nunderlying metric Cxx without any risk of confusion.\n\n4\n\n\fAlgorithm 2 FALS-CCA\n1: Input: T, k, momentum parameter \u03b2, data matrices X, Y.\n2: Output: approximate top-k canonical subspaces (\u03a6T , \u03a8T ).\n3: Initialize \u03a6\u22121 = 0 \u2208 Rdx\u00d7k, \u03a60 = GSCxx(\u03a6init) \u2208 Rdx\u00d7k, \u03a80 = GSCyy (\u03a8init) \u2208 Rdy\u00d7k,\n4: for t = 1, 2,\u00b7\u00b7\u00b7 , T do\n5:\n\nwhere entries of \u03a6init, \u03a8init are i.i.d. standard normal samples.\n\n(cid:107)X(cid:62)( \u02dc\u03a6 + \u03b2\u03a6t\u22122) \u2212 Y(cid:62)\u03a8t\u22121(cid:107)2\n\nF +\n\n(cid:107) \u02dc\u03a6 + \u03b2\u03a6t\u22122(cid:107)2\n\nF\n\nrx\n2\n\nt\u22121Cxx\u03a6t\u22121)\u22121(\u03a6(cid:62)\n\nt\u22121Cxy\u03a8t\u22121).\n\nApproximately solve least-squares\n\u02dc\u03a6t \u2248 arg min\nlt( \u02dc\u03a6) =\n\u02dc\u03a6\u2208Rdx\u00d7k\nwith initial \u02dc\u03a6(0) = \u03a6t\u22121(\u03a6(cid:62)\n\n1\n2n\n\n6: \u03a6t = GSCxx( \u02dc\u03a6t).\n7:\n\nApproximately solve least-squares\n1\n2n\n\nst( \u02dc\u03a8) =\nwith initial \u02dc\u03a8(0) = \u03a8t\u22121(\u03a8(cid:62)\n\n\u02dc\u03a8t \u2248 arg min\n\u02dc\u03a8\u2208Rdy\u00d7k\n\n(cid:107)Y(cid:62)( \u02dc\u03a8 + \u03b2\u03a8t\u22121) \u2212 X(cid:62)\u03a6t(cid:107)2\n\nF +\n\n(cid:107) \u02dc\u03a8 + \u03b2\u03a8t\u22121(cid:107)2\n\nF\n\nry\n2\n\nt\u22121Cyy\u03a8t\u22121)\u22121(\u03a8(cid:62)\n\nt\u22121C(cid:62)\n\nxy\u03a6t).\n\n8: \u03a8t = GSCyy ( \u02dc\u03a8t).\n9: end for\n\nTheorem 1 Given data matrices X and Y, TALS-CCA computes a dx \u00d7 k matrix \u03a6T and a\ndy \u00d7 k matrix \u03a8T which are estimates of top-k canonical subspaces (U, V) with an error of \u0001,\ni.e., \u03a6(cid:62)\n) iterations. If\nNesterov\u2019s accelerated gradient descent is used as the least-squares solver, the running time is at\nmost\n\nT Cyy\u03a8T = I and tan \u03b8T \u2264 \u0001, in T = O(\n\nT Cxx\u03a6T = \u03a8(cid:62)\n\n\u03c32\nk\nk\u2212\u03c32\n\u03c32\n\n\u0001 cos \u03b80\n\nlog\n\nk+1\n\n1\n\nO(\n\nk\u03c32\nk\nk \u2212 \u03c32\n\u03c32\n\nk+1\n\nnnz(X, Y)\u03ba(X, Y)(log\n\n1\n\ncos \u03b80\n\u03c31\nk \u2212 \u03c32\n\u03c32\n\nk+1\n\nlog\n\n) +\n\n\u03c31\nk \u2212 \u03c32\n(\u03c32\nk+1) cos \u03b80\nk2\u03c32\nk\nk \u2212 \u03c32\n\u03c32\n\nk+1\n\n+\n\nmax{dx, dy} log\n\n1\n\n),\n\n\u0001 cos \u03b80\n\nlog\n\n1\n\u0001\n\nlog\n\nwhere nnz(X, Y) = nnz(X) + nnz(Y) and \u03ba(X, Y) = max{\u03ba(Cxx), \u03ba(Cyy)}.\nNote that random initializations to \u03a60 and \u03a80 result in cos \u03b80 > 0 with high probability, by\nLemma 13 in [10]. Thus, TALS-CCA is globally and linearly convergent. Proofs are provided in the\nsupplementary material. Compared to alternating least-squares, e.g., CCALin in [10], it is roughly\nfaster by a factor of\n, whereas empirical improvements are often more pronounced. Note\nthat it makes a difference especially for the cases of a small singular value gap \u03c3k \u2212 \u03c3k+1.\n\n\u03c3k+\u03c3k+1\n\n\u03c3k\n\n4 Faster Alternating Least-Squares (FALS)\n\nIn this section, we consider the momentum acceleration for CCA, motivated by accelerated power\nmethod for eigenvector computation [21]. To derive update equations for faster alternating least-\nsquares (FALS), we \ufb01rst have CCA cast into the setting of eigenvector computation on real symmetric\nmatrices and then introduce the momentum to speedup.\nRecall that\n\n(cid:19)\n\nCxy\n\nC(cid:62)\n\nxy\n\nand B =\n\n(cid:19)\n\n(cid:18) Cxx\n(cid:18) \u03a6t\n\n1\n2\n\n\u03a8t\n\nCyy\n\n(cid:19)\n\n.\n\n\u2208 R(dx+dy)\u00d72k.\n\nLet\n\n\u02dcWt = B\n\n\u2208 R(dx+dy)\u00d72k\nThe momentum acceleration applied to B\u2212 1\n\n\u02dc\u03a8t\n\n1\n2\n\nand Wt = B\n\n2 AB\u2212 1\n\n2 then yields that\n\n(cid:18)\n(cid:19)\n\nA =\n\n(cid:18) \u02dc\u03a6t\n\n\u02dcWt+1 = B\u2212 1\n\n2 AB\u2212 1\n\n2 Wt \u2212 \u03b2Wt\u22121, Wt+1 = \u02dcWt+1( \u02dcW(cid:62)\n\nt+1\n\n\u02dcWt+1)\u2212 1\n2 ,\n\n(6)\n\n5\n\n\fAlgorithm 3 AALS-CCA\n1: Input: T, k, data matrices X, Y.\n2: Output: approximate top-k canonical subspaces (\u03a6T , \u03a8T ).\n3: Initialize \u03a6\u22121 = 0 \u2208 Rdx\u00d7k, \u03a60 = GSCxx(\u03a6init) \u2208 Rdx\u00d7k, \u03a80 = GSCyy (\u03a8init) \u2208 Rdy\u00d7k,\n4: for t = 1, 2,\u00b7\u00b7\u00b7 , T do\n5:\nmin\n1\u2264i\u2264k\n\nwhere entries of \u03a6init, \u03a8init are i.i.d. standard normal samples.\n\n)2, where \u03a3(t\u22121,1) = (\u03a6(cid:62)\n\nt\u22121Cxx\u03a6t\u22121)\u22121(\u03a6(cid:62)\n\nt\u22121Cxy\u03a8t\u22121).\n\n(\u03a3(t\u22121,1)\n\nii\n\n1\n4\n\n(cid:107)X(cid:62)( \u02dc\u03a6 + \u03b2t,1\u03a6t\u22122) \u2212 Y(cid:62)\u03a8t\u22121(cid:107)2\n\nF +\n\n(cid:107) \u02dc\u03a6 + \u03b2t,1\u03a6t\u22122(cid:107)2\n\nF\n\nrx\n2\n\n6:\n\n9:\n\nSet \u03b2t,1 =\nApproximately solve least-squares\n\u02dc\u03a6t \u2248 arg min\n\u02dc\u03a6\u2208Rdx\u00d7k\nwith initial \u02dc\u03a6(0) = \u03a6t\u22121\u03a3(t\u22121,1).\n\nlt( \u02dc\u03a6) =\n\n1\n2n\n\n7: \u03a6t = GSCxx( \u02dc\u03a6t).\n8:\n\n1\n4\n\n(\u03a3(t\u22121,2)\n\nii\n\nmin\n1\u2264i\u2264k\n\nSet \u03b2t,2 =\nApproximately solve least-squares\n\u02dc\u03a8t \u2248 arg min\n\u02dc\u03a8\u2208Rdy\u00d7k\nwith initial \u02dc\u03a8(0) = \u03a8t\u22121\u03a3(t\u22121,2).\n\nst( \u02dc\u03a8) =\n\n1\n2n\n\n10: \u03a8t = GSCyy ( \u02dc\u03a8t).\n11: end for\n\n)2, where \u03a3(t\u22121,2) = (\u03a8(cid:62)\n\nt\u22121Cyy\u03a8t\u22121)\u22121(\u03a8(cid:62)\n\nt\u22121C(cid:62)\n\nxy\u03a6t).\n\n(cid:107)Y(cid:62)( \u02dc\u03a8 + \u03b2t,2\u03a8t\u22121) \u2212 X(cid:62)\u03a6t(cid:107)2\n\nF +\n\n(cid:107) \u02dc\u03a8 + \u03b2t,2\u03a8t\u22121(cid:107)2\n\nF\n\nry\n2\n\nwhere \u2212\u03b2Wt\u22121 is known as the momentum term and \u03b2 is the momentum parameter. Expanding\nthe above update equation into two inexact update equations in \u03a6t, \u03a8t and having them coupled\ntogether as with TALS, we arrive at our faster (truly and inexact) alternating least-squares as follows:\n\nxx Cxy\u03a8t \u2212 \u03b2\u03a6t\u22121 + \u03bet, \u03a6t+1 = \u02dc\u03a6t+1( \u02dc\u03a6(cid:62)\nyy C(cid:62)\n\nxy\u03a6t+1 \u2212 \u03b2\u03a8t + \u03b7t+1, \u03a8t+1 = \u02dc\u03a8t+1( \u02dc\u03a8(cid:62)\n\nt+1Cxx \u02dc\u03a6t+1)\u2212 1\n\n2\n\nt+1Cyy \u02dc\u03a8t+1)\u2212 1\n\n2\n\n,\n\n(cid:40) \u02dc\u03a6t+1 = C\u22121\n\n\u02dc\u03a8t+1 = C\u22121\n\nwhere \u03a8t \u2208 Rdx\u00d7k and \u03a6t \u2208 Rdy\u00d7k. The algorithmic steps are given in Algorithm 2 which\nkeeps as simple as the plain alternating least-squares. Despite the simplicity, the analysis of faster\nconvergence is very dif\ufb01cult (see our discussions in Section 6). Nonetheless, it is at least locally\nlinearly convergent, as stated in the following theorem.\nTheorem 2 Given data matrices X and Y, FALS-CCA computes a dx \u00d7 k matrix \u03a6T and a\ndy \u00d7 k matrix \u03a8T which are estimates of top-k canonical subspaces (U, V) with an error of \u0001, i.e.,\nT Cxx\u03a6T = \u03a8(cid:62)\n\u03a6(cid:62)\n) iterations if\n\u03b80 \u2264 \u03c0\n4 . If Nesterov\u2019s accelerated gradient descent is used as the least-squares solver, the running\ntime is at most\n\nT Cyy\u03a8T = I and tan \u03b8T \u2264 \u0001, in T = O(\n\nk\u2212c\u03c31\u03b2\n\u03c32\nk+1\u22124c\u03c31\u03b2 log\nk\u2212\u03c32\n\u03c32\n\n\u0001 cos \u03b80\n\n1\n\nk(\u03c32\nk \u2212 \u03c32\n\u03c32\n\nk \u2212 c\u03c31\u03b2)\nk+1 \u2212 4c\u03c31\u03b2\n\nO(\n\nnnz(X, Y)\u03ba(X, Y)(log\n\nlog\n\n1\n\u0001\n\nlog\n\n\u03c31\nk \u2212 \u03c32\n\u03c32\n\nk+1\n\n) +\n\nwhere c > 0 is a constant.\n\n1\n\nlog\ncos \u03b80\nk \u2212 c\u03c31\u03b2)\nk+1 \u2212 4c\u03c31\u03b2\n\n\u03c31\nk \u2212 \u03c32\n(\u03c32\nk+1) cos \u03b80\nmax{dx, dy} log\n\n+\n\n1\n\n\u0001 cos \u03b80\n\n),\n\nk2(\u03c32\nk \u2212 \u03c32\n\u03c32\n\nClearly, the momentum parameter plays a key role for Algorithm 2 to work. It is central for us to\n\ufb01gure out sensible ways to set it in practice. Given the tight analysis (see Theorem 11 in [21]) for the\nexact update (6) in Wt, the optimal value of \u03b2 should be around \u03c32\n. On the other hand, it holds\nfor the optimal solution that\n\nk+1\n4\n\nCxyV = CxxU\u03a3, C(cid:62)\n\nxyU = CyyV\u03a3.\n\n6\n\n\fWe thus can write that\n\nxyU.\nAccordingly, we have the following estimate options of \u03a3 for suf\ufb01ciently large t:\n\n(U(cid:62)CxxU)\u22121U(cid:62)CxyV = \u03a3 = (V(cid:62)CyyV)\u22121V(cid:62)C(cid:62)\n\n\u03a3(t,1) (cid:44) (\u03a6(cid:62)\n\u03a3(t,2) (cid:44) (\u03a8(cid:62)\n\u03a3(t,3) (cid:44) (\u03a6(cid:62)\n\nt Cxx\u03a6t)\u22121(\u03a6(cid:62)\nt Cyy\u03a8t)\u22121(\u03a8(cid:62)\nt Cxx\u03a6t + \u03a8(cid:62)\n\nt Cxy\u03a8t),\nt C(cid:62)\nxy\u03a6t+1),\nt CyyV)\u22121(\u03a6(cid:62)\n\u03a3(t,j)\n\nt Cxy\u03a8t + \u03a8(cid:62)\n\nt C(cid:62)\n\nxy\u03a6t).\n\nBefore iterates \u03a6t and \u03a8t converge, min\nis strictly less than \u03c3k in general. Meanwhile,\n1\u2264i\u2264k\n\u03c3k+1 is bounded above by \u03c3k. Therefore, our \ufb01rst strategy for approximating the optimal momentum\nparameter is to run a small number of iterations of TALS, which can be viewed as a burning process,\nand then set \u03b2j =\n\n)2 for FALS. Denote the resulting algorithm as FALS-T0.\n\n(\u03a3(T0,j)\n\nii\n\n1\n4\n\nmin\n1\u2264i\u2264k\n\nii\n\nAdaptive Alternating Least-Squares (AALS) To further avoid choosing burning parameter T0,\nthe second strategy we propose is to automatically and adaptively adjust momentum parameter \u03b2\nduring iterations, as described in Algorithm 3. Compared to Algorithm 2, there is no additional cost\nin running AALS. It keeps as easy to use as the plain alternating least-squares while retaining the\nadvantages of the fast version.\n\n5 Experiments\n\nIn this section, we examine and compare the empirical behaviors of both existing and our alternating\nleast-squares algorithms. Three real-world datasets are used: Mediamill [18], JW11 [17], and\nMNIST [14]. See Table 1 for the statistics and descriptions. They are commonly used to test CCA\n\nTable 1: Statistics of Datasets\n\nDATA\n\nMemdiamill\nJW11\nMNIST\nYoutube\n\nDescription\nimages and its labels\nacoustic and articulation measurements\nleft and right halves of images\nUCI Youtube audio and vision streams\n\ndx\n100\n273\n392\n64\n\ndy\n120\n112\n392\n1024\n\nn\n\n30000\n30000\n60000\n122041\n\nsolvers [20]. In order to show the inability of the plain alternating least-squares with block size k\nto solve CCA, we adapt alternating least-squares in both [20] and [10] to block size k, denoted as\nALS-k and CCALin-k, respectively. Note that the post-processing step is not needed any more for\nCCALin-k. Two algorithms differ only in the initial to the least-squares solver. The original CCALin\nalgorithm is also included as a baseline. They are compared with our TALS, FALS-T0 (i.e., FALS\nwith burning parameter T0), and AALS. Particularly, T0 \u2208 {4, 6} is used. Regularization parameters\nare \ufb01xed to rx = ry = 0.1. Stochastic variance reduced gradient (SVRG) is the least-squares solver\nwe use for each algorithm. Throughout the experiments the solver runs 2 epochs with each running n\niterations with constant step-sizes \u03b1\u03a6 =\nfor \u03a8t, where xi is\nthe i-th column of X. All the algorithms were implemented in MATLAB, and run on a laptop with 8\nGB memory. Quality measures we use are as follows:\n\u2022 sin2 \u03b8u (cid:44) sin2 \u03b8max(\u03a6t, U), squared sine value of largest principal angle between \u03a6t and U;\n\u2022 sin2 \u03b8v (cid:44) sin2 \u03b8max(\u03a8t, V), squared sine value of largest principal angle between \u03a8t and V,\nwhere ground truth (P, \u03a3, Q) is obtained by MATLAB\u2019s svds function for evaluation purpose.\nSmaller is better for each measure. It is worth noting that the two measures are more indicative of the\nperformance of all the algorithms considered here, compared to the relative objective function error\nmeasure\n\nfor \u03a6t and \u03b1\u03a8 =\n\nmaxi (cid:107)yi(cid:107)2\n\nmaxi (cid:107)xi(cid:107)2\n\n1\n\n1\n\n2\n\n2\n\nf (cid:63) \u2212 f\nf (cid:63)\n\n(cid:44) tr(\u03a3) \u2212 tr(\u03a6(cid:62)\n\nt Cxy\u03a8t)\n\n,\n\ntr(\u03a3)\n\n7\n\n\fFigure 1: Performance of different ALS algorithms for CCA.\n\nbecause they do not directly optimize the objective function of Problem (1), i.e., tr(\u03a6(cid:62)Cxy\u03a8),\nespecially for the CCALin. Convergence results in terms of (f (cid:63) \u2212 f )/f (cid:63) are reported in the\nsupplementary material.\nConvergence curves of all the considered ALS algorithms are plotted in a 4 \u00d7 3 array of \ufb01gures in\nFigure 1 with a column for each dataset. Upper and lower halves of the rows of \ufb01gures correspond\nto sin2 \u03b8u and sin2 \u03b8v, respectively, while upper and lower rows in each half correspond to results\nin running time and passes over data, respectively. Note that the curve patterns in running time and\npasses are not necessarily the same, e.g., for CCALin. From these empirical results, we \ufb01rst observe\nthat both ALS-k and CCALin-k indeed fail to work as the values of both measures always remain\nhigh during iterations across datasets. This is because the target ground-truth of both algorithms\ndoes not cover top-k canonical subspaces. Second, it takes a much longer time for the CCALin than\nour ALS algorithms to \ufb01nd a solution even with low precision. Third, our TALS achieves better\nperformance than the CCALin by a large margin in both measures, demonstrating the advantage of\nthe coupling in ALS for CCA. Last, further speedups over TALS are observed for the fast versions,\nwhich showcases the potential of the momentum acceleration for CCA. Particularly, the adaptive\nversion, i.e., AALS, without the need to tune the momentum parameter and set the burning parameter,\nperforms equally well as FALS-T0, proving its practical value to certain extent.\n\n8\n\n051015time (seconds)10-1610-1210-810-4100sin2uMediamill (k=4)ALS-kCCALin-kCCALinTALSFALS-T0=4FALS-T0=6AALS0204060time (seconds)10-1610-1210-810-4100sin2uJW11 (k=10)0100200300time (seconds)10-1610-1210-810-4100sin2uMNIST (k=10)02004006008001000# Passes10-1610-1210-810-4100sin2uMediamill (k=4)ALS-kCCALin-kCCALinTALSFALS-T0=4FALS-T0=6AALS0200040006000# Passes10-1610-1210-810-4100sin2uJW11 (k=10)0200040006000# Passes10-1610-1210-810-4100sin2uMNIST (k=10)051015time (seconds)10-1610-1210-810-4100sin2vMediamill (k=4)ALS-kCCALin-kCCALinTALSFALS-T0=4FALS-T0=6AALS0204060time (seconds)10-1610-1210-810-4100sin2vJW11 (k=10)0100200300time (seconds)10-1610-1210-810-4100sin2vMNIST (k=10)02004006008001000# Passes10-1610-1210-810-4100sin2vMediamill (k=4)ALS-kCCALin-kCCALinTALSFALS-T0=4FALS-T0=6AALS0200040006000# Passes10-1610-1210-810-4100sin2vJW11 (k=10)0200040006000# Passes10-1610-1210-810-4100sin2vMNIST (k=10)\fAdditional experiments are provided in the supplementary material, aiming to demonstrate: 1) the\nperformance of all the considered algorithms with varying block sizes; 2) the performance of our ALS\nalgorithms especially the fast versions in comparison with the shift-and-invert (SI) preconditioning\nmethod [20] in the vector setting; 3) the performance of the algorithms on more datasets (n = 122041).\nThese experiments indicate that the truly alternating least-squares sometimes can achieve equally\ngood performance compared to its fast versions. In the vector case, the faster alternating least-squares\nmay even work better than the SI method which, though is given the advantage of the knowledge on\nthe spectral gap at k = 1 and other tuning parameters.\n\n6 Discussion\n\nIn this work, we study alternating least-squares as a block CCA solver. Noting the drawbacks of\ncurrent alternating least-squares methods, we propose the truly alternating least-squares which only\nneeds to update equations of half the size due to coupling. Both theory and practice show that the\ncoupling can signi\ufb01cantly improve the performance of alternating least-squares. On top of that, we\nfurther propose faster alternating least-squares with momentum acceleration. To make it practical,\ntwo strategies are put forward to set the momentum parameter. One is to introduce a burning phase to\nset it by running the truly alternating least-squares for a few iterations. The other is to automatically\nadjust the momentum parameter during iterations, making it as easy to use as the plain alternating\nleast-squares without sacri\ufb01cing fast convergence. Experiments show that both strategies work\nwell. Despite the excellent performance of the fast versions, it lacks of a tight convergence analysis\nexplaining the empirical behaviors. This seems quite dif\ufb01cult, given that there has been no such theory\nthus far on the momentum acceleration for the basic eigenvector computation in a corresponding\nsetting. The coupling in our context further complicates the analysis. We leave it to our future work\nwhere other settings, e.g., streaming or robust, may be considered as well.\n\nReferences\n[1] Zeyuan Allen-Zhu and Yuanzhi Li. Doubly accelerated methods for faster CCA and generalized\neigendecomposition. In Proceedings of the 34th International Conference on Machine Learning,\nICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 98\u2013106, 2017.\n\n[2] Raman Arora, Teodor Vanislavov Marinov, Poorya Mianjy, and Nati Srebro. Stochastic ap-\nproximation for canonical correlation analysis. In Advances in Neural Information Processing\nSystems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December\n2017, Long Beach, CA, USA, pages 4778\u20134787, 2017.\n\n[3] Haim Avron, Christos Boutsidis, Sivan Toledo, and Anastasios Zouzias. Ef\ufb01cient dimensionality\n\nreduction for canonical correlation analysis. SIAM J. Scienti\ufb01c Computing, 36(5), 2014.\n\n[4] Kush Bhatia, Aldo Pacchiano, Nicolas Flammarion, Peter L. Bartlett, and Michael I. Jordan.\nGen-oja: Simple & ef\ufb01cient algorithm for streaming generalized eigenvector computation.\nIn Advances in Neural Information Processing Systems 31: Annual Conference on Neural\nInformation Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montr\u00e9al, Canada.,\npages 7016\u20137025, 2018.\n\n[5] Kamalika Chaudhuri, Sham M. Kakade, Karen Livescu, and Karthik Sridharan. Multi-view\nclustering via canonical correlation analysis. In Proceedings of the 26th Annual International\nConference on Machine Learning, ICML 2009, Montreal, Quebec, Canada, June 14-18, 2009,\npage 17, 2009.\n\n[6] Zhehui Chen, Xingguo Li, Lin Yang, Jarvis D. Haupt, and Tuo Zhao. On constrained nonconvex\nstochastic optimization: A case study for generalized eigenvalue decomposition. In The 22nd\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, AISTATS 2019, 16-18 April\n2019, Naha, Okinawa, Japan, pages 916\u2013925, 2019.\n\n[7] Paramveer Dhillon, Dean P Foster, and Lyle H. Ungar. Multi-view learning of word embeddings\nvia cca. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors,\nAdvances in Neural Information Processing Systems 24, pages 199\u2013207. Curran Associates,\nInc., 2011.\n\n[8] Chao Gao, Dan Garber, Nathan Srebro, Jialei Wang, and Weiran Wang. Stochastic canonical\n\ncorrelation analysis. CoRR, abs/1702.06533, 2017.\n\n9\n\n\f[9] Dan Garber, Elad Hazan, Chi Jin, Sham M. Kakade, Cameron Musco, Praneeth Netrapalli,\nand Aaron Sidford. Faster eigenvector computation via shift-and-invert preconditioning. In\nInternational Conference on Machine Learning, pages 2626\u20132634, 2016.\n\n[10] Rong Ge, Chi Jin, Sham M. Kakade, Praneeth Netrapalli, and Aaron Sidford. Ef\ufb01cient algo-\nrithms for large-scale generalized eigenvector computation and canonical correlation analysis.\nIn International Conference on Machine Learning, pages 2741\u20132750, 2016.\n\n[11] Harold Hotelling. Relations between two sets of variates. Biometrika, 28(3/4):321\u2013377, 1936.\n[12] Sham M. Kakade and Dean P. Foster. Multi-view regression via canonical correlation analysis.\nIn Proceedings of the 20th Annual Conference on Learning Theory, COLT\u201907, pages 82\u201396,\nBerlin, Heidelberg, 2007. Springer-Verlag.\n\n[13] Nikos Karampatziakis and Paul Mineiro. Discriminative features via generalized eigenvectors.\nIn Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing,\nChina, 21-26 June 2014, pages 494\u2013502, 2014.\n\n[14] Yann Lecun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied\n\nto document recognition. In Proceedings of the IEEE, pages 2278\u20132324, 1998.\n\n[15] Yichao Lu and Dean P. Foster. large scale canonical correlation analysis with iterative least\nsquares. In Advances in Neural Information Processing Systems 27: Annual Conference on\nNeural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada,\npages 91\u201399, 2014.\n\n[16] Zhuang Ma, Yichao Lu, and Dean P. Foster. Finding linear structure in large datasets with\nscalable canonical correlation analysis. In Proceedings of the 32nd International Conference on\nMachine Learning, ICML 2015, Lille, France, 6-11 July 2015, pages 169\u2013178, 2015.\n\n[17] J R. Westbury. X-ray microbeam speech production database users\u2019 handbook. IEEE Personal\n\nCommunications - IEEE Pers. Commun., 01 1994.\n\n[18] Cees G. M. Snoek, Marcel Worring, Jan C. van Gemert, Jan-Mark Geusebroek, and Arnold\nW. M. Smeulders. The challenge problem for automated detection of 101 semantic concepts in\nmultimedia. In Proceedings of the 14th ACM International Conference on Multimedia, MM\n\u201906, pages 421\u2013430, New York, NY, USA, 2006. ACM.\n\n[19] Hrishikesh Vinod. Canonical ridge and econometrics of joint production. Journal of Economet-\n\nrics, 4:147\u2013166, 02 1976.\n\n[20] Weiran Wang, Jialei Wang, Dan Garber, and Nati Srebro. Ef\ufb01cient globally convergent stochastic\noptimization for canonical correlation analysis. In Advances in Neural Information Processing\nSystems 29: Annual Conference on Neural Information Processing Systems 2016, December\n5-10, 2016, Barcelona, Spain, pages 766\u2013774, 2016.\n\n[21] Peng Xu, Bryan D. He, Christopher De Sa, Ioannis Mitliagkas, and Christopher R\u00e9. Accelerated\nstochastic power iteration. In International Conference on Arti\ufb01cial Intelligence and Statistics,\nAISTATS 2018, 9-11 April 2018, Playa Blanca, Lanzarote, Canary Islands, Spain, pages 58\u201367,\n2018.\n\n[22] Florian Yger, Maxime Berar, Gilles Gasso, and Alain Rakotomamonjy. Adaptive canonical\nIn Proceedings of the 29th International\ncorrelation analysis based on matrix manifolds.\nConference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26 - July 1,\n2012, 2012.\n\n[23] Zhihua Zhang.\n\nThe singular value decomposition, applications and beyond. CoRR,\n\nabs/1510.08532, 2015.\n\n10\n\n\f", "award": [], "sourceid": 8364, "authors": [{"given_name": "Zhiqiang", "family_name": "Xu", "institution": "Baidu Inc."}, {"given_name": "Ping", "family_name": "Li", "institution": "Baidu Research USA"}]}