{"title": "Stochastic Approximation for Canonical Correlation Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 4775, "page_last": 4784, "abstract": "We propose novel first-order stochastic approximation algorithms for canonical correlation analysis (CCA). Algorithms presented are instances of inexact matrix stochastic gradient (MSG) and inexact matrix exponentiated gradient (MEG), and achieve $\\epsilon$-suboptimality in the population objective in $\\operatorname{poly}(\\frac{1}{\\epsilon})$ iterations. We also consider practical variants of the proposed algorithms and compare them with other methods for CCA both theoretically and empirically.", "full_text": "Stochastic Approximation\n\nfor Canonical Correlation Analysis\n\nRaman Arora\n\nDept. of Computer Science\nJohns Hopkins University\n\nBaltimore, MD 21204\narora@cs.jhu.edu\n\nTeodor V. Marinov\n\nDept. of Computer Science\nJohns Hopkins University\n\nBaltimore, MD 21204\ntmarino2@jhu.edu\n\nNathan Srebro\nTTI-Chicago\n\nChicago, Illinois 60637\n\nnati@ttic.edu\n\nPoorya Mianjy\n\nDept. of Computer Science\nJohns Hopkins University\n\nBaltimore, MD 21204\n\nmianjy@jhu.edu\n\nAbstract\n\nWe propose novel \ufb01rst-order stochastic approximation algorithms for canonical\ncorrelation analysis (CCA). Algorithms presented are instances of inexact matrix\nstochastic gradient (MSG) and inexact matrix exponentiated gradient (MEG), and\nachieve \u270f-suboptimality in the population objective in poly( 1\n\u270f ) iterations. We also\nconsider practical variants of the proposed algorithms and compare them with other\nmethods for CCA both theoretically and empirically.\n\n1\n\nIntroduction\n\nCanonical Correlation Analysis (CCA) [11] is a ubiquitous statistical technique for \ufb01nding maximally\ncorrelated linear components of two sets of random variables. CCA can be posed as the following\nstochastic optimization problem: given a pair of random vectors (x, y) 2 Rdx \u21e5 Rdy, with some\n(unknown) joint distribution D, \ufb01nd the k-dimensional subspaces where the projections of x and y\nare maximally correlated, i.e. \ufb01nd matrices \u02dcU 2 Rdx\u21e5k and \u02dcV 2 Rdy\u21e5k that\n\nmaximize Ex,y[x> \u02dcU \u02dcV>y] subject to \u02dcU>Ex[xx>] \u02dcU = Ik, \u02dcV>Ey[yy>] \u02dcV = Ik.\n\n(1)\n\nCCA-based techniques have recently met with success at unsupervised representation learning where\nmultiple \u201cviews\u201d of data are used to learn improved representations for each of the views [3, 5, 13,\n23]. The different views often contain complementary information, and CCA-based \u201cmultiview\u201d\nrepresentation learning methods can take advantage of this information to learn features that are\nuseful for understanding the structure of the data and that are bene\ufb01cial for downstream tasks.\nUnsupervised learning techniques leverage unlabeled data which is often plentiful. Accordingly,\nin this paper, we are interested in \ufb01rst-order stochastic Approximation (SA) algorithms for solving\nProblem (1) that can easily scale to very large datasets. A stochastic approximation algorithm is an\niterative algorithm, where in each iteration a single sample from the population is used to perform an\nupdate, as in stochastic gradient descent (SGD), the classic SA algorithm.\nThere are several computational challenges associated with solving Problem (1). A \ufb01rst challenge\nstems from the fact that Problem (1) is non-convex. Nevertheless, akin to related spectral methods\nsuch as principal component analysis (PCA), the solution to CCA can be given in terms of a\ngeneralized eigenvalue problem. In other words, despite being non-convex, CCA admits a tractable\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\falgorithm. In particular, numerical techniques based on power iteration method and its variants can\nbe applied to these problems to \ufb01nd globally optimal solutions. Much recent work, therefore, has\nfocused on analyzing optimization error for power iteration method for the generalized eigenvalue\nproblem [1, 8, 24]. However, these analyses are on numerical (empirical) optimization error for\n\ufb01nding left and right singular vectors of a \ufb01xed given matrix based on empirical estimates of the\ncovariance matrices, and not on the population \u270fsuboptimality (aka bound in terms of population\nobjective) of Problem (1) which is the focus here.\nThe second challenge, which is our main concern here, presents when designing \ufb01rst order stochastic\napproximation algorithms for CCA. The main dif\ufb01culty here, compared to PCA, and most other\nmachine learning problems, is that the constraints also involve stochastic quantities that depend on the\nunknown distribution D. Put differently, the CCA objective does not decompose over samples. To see\nthis, consider the case for k = 1. The CCA problem then can be posed equivalently as maximizing\n\nthe correlation objective \u21e2(uT x, vT y) = Ex,y\u21e5u>xy>v\u21e4/(pEx [u>xx>u]pEy [v>yy>v]). This\n\nyields an unconstrained optimization problem. However, the objective is no longer an expectation,\nbut is instead a ratio of expectations. If we were to solve the empirical version of this problem,\nit is easy to check that the objective ties all the samples together. This departs signi\ufb01cantly from\ntypical stochastic approximation scenario. Crucially, with a single sample, it is not possible to get an\nunbiased estimate of the gradient of the objective \u21e2(uT x, vT y). Therefore, we consider a \ufb01rst-order\noracle that provides inexact estimates of the gradient with a norm bound on the additive noise, and\nfocus on inexact proximal gradient descent algorithms for CCA.\nFinally, it can be shown that the CCA problem given in Problem (1) is ill-posed if the population\n\nauto-covariance matrices Ex\u21e5xx>\u21e4 or Ey\u21e5yy>\u21e4 are ill-conditioned. This observation follows from\nthe fact that if there exists a direction in the kernel of Ex\u21e5xx>\u21e4 or Ey\u21e5yy>\u21e4 in which x and y\nexhibit non-zero covariance, then the objective of Problem (1) is unbounded. We would like to avoid\nrecovering such directions of spurious correlation and therefore assume that the smallest eigenvalues\nof the auto-covariance matrices and their empirical estimates are bounded below by some positive\nconstant. Formally, we assume that Cx \u232b rxI and Cy \u232b ryI. This is the typical assumption made in\nanalyzing CCA [1, 7, 8].\n\n1.1 Notation\nScalars, vectors and matrices are represented by normal, Roman and capital Roman letters respectively,\ne.g. x, x, and X. Ik denotes identity matrix of size k \u21e5 k, where we drop the subscript whenever the\nsize is clear from the context. The `2-norm of a vector x is denoted by kxk. For any matrix X, spectral\nnorm, nuclear norm, and Frobenius norm are represented by kXk2, kXk\u21e4, and kXkF respectively.\nThe trace of a square matrix X is denoted by Tr (X). Given two matrices X 2 Rk\u21e5d, Y 2 Rk\u21e5d, the\nstandard inner-product between the two is given as hX, Yi = TrX>Y; we use the two notations\ninterchangeably. For symmetric matrices X and Y, we say X \u232b Y if X Y is positive semi-de\ufb01nite\n(PSD). Let x 2 Rdx and y 2 Rdy denote two sets of centered random variables jointly distributed as\nD with corresponding auto-covariance matrices Cx = Ex[xx>], Cy = Ey[yy>], and cross-covariance\nmatrix Cxy = E(x,y)[xy>], and de\ufb01ne d := max{dx, dy}. Finally, X 2 Rdx\u21e5n and Y 2 Rdy\u21e5n\ndenote data matrices with n corresponding samples from view 1 and view 2, respectively.\n\n1.2 Problem Formulation\nGiven paired samples (x1, y1), . . . , (xT , yT ), drawn i.i.d. from D, the goal is to \ufb01nd a maximally\ncorrelated subspace of D, i.e. in terms of the population objective. A simple change of variables in\nProblem (1), with U = C1/2\n\ny \u02dcV, yields the following equivalent problem:\n\nx \u02dcU and V = C1/2\n\nmaximize Tr\u21e3U>C 1\n\ny V\u2318 s.t. U>U = I, V>V = I.\n\n(2)\nTo ensure that Problem 2 is well-posed, we assume that r := min{rx, ry} > 0, where rx = min(Cx)\nand ry = min(Cy) are smallest eigenvalues of the population auto-covariance matrices. Furthermore,\nwe assume that with probability one, for (x, y) \u21e0 D, we have that max{kxk2 ,kyk2}\uf8ff B. Let\n 2 Rdx\u21e5k and 2 Rdy\u21e5k denote the top-k left and right singular vectors, respectively, of the\npopulation cross-covariance matrix of the whitened views T := C1/2\n. It is easy to check\nthat the optimum of Problem (1) is achieved at U\u21e4 = C1/2\ny . Therefore, a natural\n\nx , V\u21e4 = C1/2\n\nx CxyC1/2\n\ny\n\nx Cxy C 1\n\n2\n\n2\n\n2\n\n\fapproach, given a training dataset, is to estimate empirical auto-covariance and cross-covariance\n\nmatrices to compute bT, an empirical estimate of T; matrices U\u21e4 and V\u21e4 can then be estimated\nusing the top-k left and right singular vectors ofbT. This approach is referred to as sample average\napproximation (SAA) or empirical risk minimization (ERM).\nIn this paper, we consider the following equivalent re-parameterization of Problem (2) given by the\nvariable substitution M = UV>, also referred to as lifting. Find M 2 Rdx\u21e5dy that\nmaximize hM, C 1\nWe are interested in designing SA algorithms that, for any bounded distribution D with minimum\neigenvalue of the auto-covariance matrices bounded below by r, are guaranteed to \ufb01nd an \u270f-suboptimal\nsolution on the population objective (3), from which, we can extract a good solution for Problem (1).\n\ni s.t. i(M) 2{ 0, 1}, i = 1, . . . , min{dx, dy}, rank (M) \uf8ff k.\n\nx CxyC 1\n\n(3)\n\ny\n\n2\n\n2\n\n1.3 Related Work\nThere has been a \ufb02urry of recent work on scalable approaches to the empirical CCA problem,\ni.e. methods for numerical optimization of the empirical CCA objective on a \ufb01xed data set [1, 8, 14,\n15, 24]. These are typically batch approaches which use the entire data set at each iteration, either for\nperforming a power iteration [1, 8] or for optimizing the alternative empirical objective [14, 15, 24]:\nminimize 1\n\nF s.t. \u02dcU>Cx,n \u02dcU = I, \u02dcV>Cy,n \u02dcV = I,\n\n(4)\n\n2nk \u02dcU>X \u02dcV>Yk2\n\nF + xk \u02dcUk2\n\nF + yk \u02dcVk2\n\nwhere Cx,n and Cy,n are the empirical estimates of covariance matrices for the n samples stacked\nin the matrices X 2 Rdx\u21e5n and Y 2 Rdy\u21e5n, using alternating least squares [14], projected gradient\ndescent (AppaGrad, [15]) or alternating SVRG combined with shift-and-invert pre-conditioning [24].\nHowever, all the works above focus only on the empirical problem, and can all be seen as instances of\nSAA (ERM) approach to the stochastic optimization (learning) problem (1). In particular, the analyses\nin these works bounds suboptimality on the training objective, not the population objective (1).\nThe only relevant work we are aware of that studies algorithms for CCA as a population problem is a\nparallel work by [7]. However, there are several key differences. First, the objective considered in [7]\nis different from ours. The focus in [7] is on \ufb01nding a solution U, V that is very similar (has high\nalignment with) the optimal population solution U\u21e4, V\u21e4. In order for this to be possible, [7] must rely\non an \"eigengap\" between the singular values of the cross-correlation matrix Cxy. In contrast, since\nwe are only concerned with \ufb01nding a solution that is good in terms of the population objective (2),\nwe need not, and do not, depend on such an eigengap. If there is no eigengap in the cross-correlation\nmatrix, the population optimal solution is not well-de\ufb01ned, but that is \ufb01ne for us \u2013 we are happy to\nreturn any optimal (or nearly optimal) solution.\nFurthermore, given such an eigengap, the emphasis in [7] is on the guaranteed overall runtime of their\nmethod. Their core algorithm is very ef\ufb01cient in terms of runtime, but is not a streaming algorithm\nand cannot be viewed as an SA algorithm. They do also provide a streaming version, which is runtime\nand memory ef\ufb01cient, but is still not a \u201cnatural\u201d SA algorithm, in that it does not work by making\na small update to the solution at each iteration. In contrast, here we present a more \u201cnatural\u201d SA\nalgorithm and put more emphasis on its iteration complexity, i.e. the number of samples processed.\nWe do provide polynomial runtime guarantees, but rely on a heuristic capping in order to achieve\ngood runtime performance in practice.\nFinally, [7] only consider obtaining the top correlated direction (k = 1) and it is not clear how to\nextend their approach to Problem (1) of \ufb01nding the top k 1 correlated directions. Our methods\nhandle the general problem, with k 1, naturally and all our guarantees are valid for any number of\ndesired directions k.\n\n1.4 Contributions\nThe goal in this paper is to directly optimize the CCA \u201cpopulation objective\u201d based on i.i.d. draws\nfrom the population rather than capturing the sample, i.e. the training objective. This view justi\ufb01es\nand favors stochastic approximation approaches that are far from optimal on the sample but are\nessentially as good as the sample average approximation approach on the population. Such a view\n\n3\n\n\fhas been advocated in supervised machine learning [6, 18]; here, we carry over the same view to the\nrich world of unsupervised learning. The main contributions of the paper are as follows.\n\n\u2022 We give a convex relaxation of the CCA optimization problem. We present two stochastic\napproximation algorithms for solving the resulting problem. These algorithms work in a\nstreaming setting, i.e. they process one sample at a time, requiring only a single pass through\nthe data, and can easily scale to large datasets.\n\n\u2022 The proposed algorithms are instances of inexact stochastic mirror descent with the choice\nof potential function being Frobenius norm and von Neumann entropy, respectively. Prior\nwork on inexact proximal gradient descent suggests a lower bound on the size of the noise\nrequired to guarantee convergence for inexact updates [16]. While that condition is violated\nhere for the CCA problem, we give a tighter analysis of our algorithms with noisy gradients\nestablishing sub-linear convergence rates.\n\n\u2022 We give precise iteration complexity bounds for our algorithms, i.e. we give upper bounds\non iterations needed to guarantee a user-speci\ufb01ed \u270f-suboptimality (w.r.t. population) for\nCCA. These bounds do not depend on the eigengap in the cross-correlation matrix. To the\nbest of our knowledge this is a \ufb01rst such characterization of CCA in terms of generalization.\n\u2022 We show empirically that the proposed algorithms outperform existing state-of-the-art\nmethods for CCA on a real dataset. We make our implementation of the proposed algorithms\nand existing competing techniques available online1.\n\n2 Matrix Stochastic Gradient for CCA (MSG-CCA)\n\nProblem (3) is a non-convex optimization problem, however, it admits a simple convex relaxation.\nTaking the convex hull of the constraint set in Problem 3 gives the following convex relaxation:\n\n2\n\n2\n\ny\n\nx CxyC 1\n\nmaximize hM, C 1\n\ni s.t. kMk2 \uf8ff 1, kMk\u21e4 \uf8ff k.\n\n(5)\nWhile our updates are designed for Problem (5), our algorithm returns a rank-k solution, through a\nsimple rounding procedure ([27, Algorithm 4]; see more details below), which has the same objective\nin expectation. This allows us to guarantee \u270f-suboptimality of the output of the algorithm on the\noriginal non-convex Problem (3), and equivalently Problem (2).\nSimilar relaxations have been considered previously to design stochastic approximation (SA) al-\ngorithms for principal component analysis (PCA) [2] and partial least squares (PLS) [4]. These\nSA algorithms are instances of stochastic gradient descent \u2013 a popular choice for convex learning\nproblems. However, designing similar updates for the CCA problem is challenging since the gradient\nof the CCA objective (see Problem (5)) w.r.t. M is g := C1/2\n, and it is not at all clear\nhow one can design an unbiased estimator, gt, of the gradient g unless one knows the marginal\ndistributions of x and y. Therefore, we consider an instance of inexact proximal gradient method [16]\nwhich requires access to a \ufb01rst-order oracle with noisy estimates, @t, of gt. We show that an oracle\n\nx CxyC1/2\n\nwith bound on E[PT\n\nt=1 kgt @tk] of O(pT ) ensures convergence of the proximal gradient method.\n\nFurthermore, we propose a \ufb01rst order oracle with the above property which instantiates the inexact\ngradient as\n\n(6)\nwhere Wx,t, Wy,t are empirical estimates of whitening transformation based on training data seen\nuntil time t. This leads to the following stochastic inexact gradient update:\n\n@t := Wx,txty>t Wy,t \u21e1 gt,\n\ny\n\nMt+1 = PF (Mt + \u2318t@t),\n\n(7)\n\nwhere PF is the projection operator onto the constraint set of Problem (5).\nAlgorithm 1 provides the pseudocode for the proposed method which we term inexact matrix\nstochastic gradient method for CCA (MSG-CCA). At each iteration, we receive a new sample\n(xt, yt), update the empirical estimates of the whitening transformations which de\ufb01ne the inexact\ngradient @t. This is followed by a gradient update with step-size \u2318, and projection onto the set of\nconstraints of Problem (5) with respect to the Frobenius norm through the operator PF (\u00b7) [2]. After\nT iterations, the algorithm returns a rank-k matrix after a simple rounding procedure [27].\n1https://www.dropbox.com/sh/dkz4zgkevfyzif3/AABK9JlUvIUYtHvLPCBXLlpha?dl=0\n\n4\n\n\fAlgorithm 1 Matrix Stochastic Gradient for CCA (MSG-CCA)\nInput: Training data {(xt, yt)}T\nt=1, step size \u2318, auxiliary training data {(x0i, y0i)}\u2327\nOutput: \u02dcM\n\u2327P\u2327\n\u2327P\u2327\ni=1 x0ix0i>, Cy,0 1\n1: Initialize: M1 0, Cx,0 1\n2: for t = 1,\u00b7\u00b7\u00b7 , T do\nt+\u2327 xtx>t , Wx,t C 1\nt+\u2327 Cx,t1 + 1\n3:\nt+\u2327 yty>t , Wy,t C 1\nt+\u2327 Cy,t1 + 1\n4:\n\ni=1 y0iy0i>\n\n2\ny,t\n\n2\nx,t\n\ni=1\n\nCx,t t+\u23271\nCy,t t+\u23271\n@t Wx,txty>t Wy,t\n\n5:\n6: Mt+1 PF (Mt + \u2318@t)\n7: end for\n8: \u00afM = 1\n9: \u02dcM = rounding( \u00afM)\n\nT PT\n\nt=1 Mt\n\n% Projection given in [2]\n\n% Algorithm 2 in [27]\n\n\u2327 max{\n\n1\ncx\n\n1\nc y\n\nlog (2dy)}.\n\nWe denote the empirical estimates of auto-covariance matrices based on the \ufb01rst t samples by Cx,t\nand Cy,t. Our analysis of MSG-CCA follows a two-step procedure. First, we show that the empirical\nestimates of the whitening transform matrices, i.e. Wx,t := C1/2\n, guarantee that the\nexpected error in the \u201cinexact\u201d estimate, @t, converges to zero as O(1/pt). Next, we show that the\nresulting noisy stochastic gradient method converges to the optimum as O(1/pT ). In what follows,\nwe will denote the true whitening transforms by Wx := C1/2\nSince Algorithm 1 requires inverting empirical auto-covariance matrices, we need to ensure that the\nsmallest eigenvalues of Cx,t and Cy,t are bounded away from zero. Our \ufb01rst technical result shows\nthat in this happens with high probability for all iterates.\nLemma 2.1. With probability 1 with respect to training data drawn i.i.d.\nuniformly for all t that min(Cx,t) rx\n\nand Wy := C1/2\n\nfrom D, it holds\n\n, Wy,t := C1/2\n\n2 whenever:\n\nx,t\n\ny,t\n\n.\n\nx\n\ny\n\nlog0@ 2dx\nlog\u21e3 1\n1\u2318\n\n1A 1,\n\n2 and min(Cy,t) ry\n1\nc x\n\nlog (2dx) ,\n\n1\ncy\n\nlog0@ 2dy\nlog\u21e3 1\n1\u2318\n\n1A 1,\n\ny\n\nx\n\n.\n\n6B2+Bry\n\n6B2+Brx\n\n, cy = 3r2\n\nHere cx = 3r2\nWe denote by At the event that for all j = 1, .., t 1 the empirical cross-covariance matrices\nCx,j and Cy,j have their smallest eigenvalues bounded from below by rx and ry, respectively.\nLemma 2.1 above, guarantees that this event occurs with probability at least 1 , as long as there\nare \u2327 =\u2326 \u2713 B2\nLemma 2.2. Assume that the event At occurs, and that with probability one, for (x, y) \u21e0 D, we\nhave max{kxk2 ,kyk2}\uf8ff B. Then, for \uf8ff := 8B2p2 log(d)\nED [kgt @tk2 | At] \uf8ff\n\n1 )\u25c6\u25c6 samples in the auxiliary dataset.\n\n, the following holds for all t:\n\uf8ff\npt\n\nr2 log\u2713 2d\n\nlog( 1\n\nr2\n\n.\n\nThe result above bounds the size of the expected noise in the estimate of the inexact gradient. Not\nsurprisingly, the error decays as our estimates of the whitening transformation improve with more\ndata. Moreover, the rate at which the error decreases is suf\ufb01cient to bound the suboptimality of the\nMSG-CCA algorithm even with noisy biased stochastic gradients.\nTheorem 2.3. After T iterations of MSG-CCA (Algorithm 1) with step size \u2318 = 2pk\nGpT\nsample of size \u2327 =\u2326( B2\n)), and initializing M1 = 0, the following holds:\n2pkG + 2k\uf8ff + kB/r\n\n, auxiliary\n\nr2 log(\n\nlog(\n\n2d\n\nhM\u21e4, C 1\n\n2\n\nx CxyC 1\n\ny\n\n2\n\nx CxyC 1\n\ny\n\n2\n\ni] \uf8ff\n\npT\n\n)\n\npTpT1\ni E[h \u02dcM, C 1\n\n2\n\n,\n\n(8)\n\n5\n\n\fwhere the expectation is with respect to the i.i.d. samples and rounding, \uf8ff is as de\ufb01ned in Lemma 2.2,\nM\u21e4 is the optimum of (3), \u02dcM is the rank-k output of MSG-CCA, and G = 2Bprxry\nWhile Theorem 2.3 gives a bound on the objective of Problem (3), it implies a bound on the original\nCCA objective of Problem (1). In particular, given a rank-k factorization of \u02dcM := UV>, such that\nU>U = Ik and V>V = Ik, we construct\n\n.\n\n\u02dcU = C 1\n\nx,T U, \u02dcV := C 1\n\ny,T V.\n\n2\n\n2\n\nWe then have the following generalization bound.\nTheorem 2.4. After T iterations of MSG-CCA (Algorithm 1) with step size \u2318 = 2pk\nGpT\nsample of size \u2327 =\u2326( B2\nT1 ) )), and initializing M1 = 0, the following holds\n\nr2 log(\n\n2d\nlog( T\n\n(9)\n\n, auxiliary\n\nTr(U>\n\n\u21e4 CxyV\u21e4)E[Tr( \u02dcU>Cxy \u02dcV)] \uf8ff\n\nE[k \u02dcU>Cx \u02dcU Ik2] \uf8ff\n\nE[k \u02dcV>Cy \u02dcV Ik2] \uf8ff\n\n2pkG + 2k\uf8ff\n\npT\n\nT\n\nx r 2B2\ny r 2B2\n\nT\n\nB\nr2\n\nB\nr2\n\n2kB\n\n+\n\nkB\nrT\n\n+\n\nlog (dx) +\n\nlog (dy) +\n\nlog (d) +\n\nlog (d)!,\n\n2B\n3T\n\nT\n\nr2 r 2B2\nlog (dx)! +\nlog (dy)! +\n\n2B\n3T\n\n2B\n3T\n\nB + 1\n\nT\n\nB + 1\n\nT\n\n,\n\n,\n\n.\n\nwhere the expectation is with respect to the i.i.d. samples and rounding, the pair (U\u21e4, V\u21e4) is\nthe optimum of (1), ( \u02dcU , \u02dcV ) are the factors (de\ufb01ned in (9)) of the rank-k output of MSG-CCA,\nr := min{rx, ry}, d := max{dx, dy}, \uf8ff is as given in Lemma 2.2, and G = 2Bprxry\nAll proofs are deferred to the Appendix in the supplementary material. Few remarks are in order.\nConvexity: In our design and analysis of MSG-CCA, we have leveraged the following observations:\n(i) since the objective is linear, an optimum of (5) is always attained at an extreme point, corresponding\nto an optimum of (3); (ii) the exact convex relaxation (5) is tractable (this is not often the case for\nnon-convex problems); and (iii) although (5) might also have optima not on extreme points, we have\nan ef\ufb01cient randomized method, called rounding, to extract from any feasible point of (5) a solution\nof (3) that has the same value in expectation [27].\nEigengap free bound: Theorem 2.3 and 2.4 do not require an eigengap in the cross-correlation\nmatrix Cxy, and in particular the error bound, and thus the implied iteration complexity to achieve a\ndesired suboptimality does not depend on an eigengap.\nComparison with [7]: It is not straightforward to compare with the results of [7]. As discussed in\nSection 1.3, authors in [7] consider only the case k = 1 and their objective is different than ours. They\nseek (u, v) that have high alignment with the optimal (u\u21e4, v\u21e4) as measured through the alignment\n2\u00afu>Cxu\u21e4 + \u00afv>Cyv\u21e4. Furthermore, the analysis in [7] is dependent on the eigengap\n(\u00afu, \u00afv) := 1\n = 1 2 between the top two singular values 1, 2 of the population cross-correlation matrix T.\nNevertheless, one can relate their objective (u, v) to ours and ask what their guarantees ensure in\nterms of our objective, namely achieving \u270f-suboptimality for Problem (3). For the case k = 1, and\nin the presence of an eigengap , the method of [7] can be used to \ufb01nd an \u270f-suboptimal solution to\nProblem (3) with O( log2(d)\nCapped MSG-CCA: Although MSG-CCA comes with good theoretical guarantees, the compu-\ntational cost per iteration can be O(d3). Therefore, we consider a practical variant of MSG-CCA\nthat explicitly controls the rank of the iterates. To ensure computational ef\ufb01ciency, we recommend\nimposing a hard constraint on the rank of the iterates of MSG-CCA, following an approach similar to\nprevious works on PCA [2] and PLS [4]:\nx CxyC 1\n\n(10)\nFor estimates of the whitening transformations, at each iteration, we set the smallest dK eigenvalues\nof the covariance matrices to a constant (of the order of the estimated smallest eigenvalue of the\n\ni s.t. kMk2 \uf8ff 1, kMk\u21e4 \uf8ff k, rank (M) \uf8ff K.\n\nmaximize hM, C 1\n\n\u270f2 ) samples.\n\ny\n\n2\n\n2\n\n6\n\n\fcovariance matrix). This allows us to ef\ufb01ciently compute the whitening transformations since the\ncovariance matrices decompose into a sum of a low-rank matrix and a scaled identity matrix, bringing\ndown the computational cost per iteration to O(dK2). We observe empirically on a real dataset\n(see Section 4) that this procedure along with capping the rank of MSG iterates does not hurt the\nconvergence of MSG-CCA.\n\n3 Matrix Exponentiated Gradient for CCA (MEG-CCA)\n\nIn this section, we consider matrix multiplicative weight updates for CCA. Multiplicative weights\nmethod is a generic algorithmic technique in which one updates a distribution over a set of interest\nby iteratively multiplying probability mass of elements [12]. In our setting, the set is that of d k-\ndimensional (paired) subspaces and the multiplicative algorithm is an instance of matrix exponentiated\ngradient (MEG) update. A motivation for considering MEG is the fact that for related problems,\nincluding principal component analysis (PCA) and partial least squares (PLS), MEG has been shown\nto yield fast optimistic rates [4, 22, 26]. Unfortunately we are not able to recover such optimistic\nrates for CCA as the error in the inexact gradient decreases too slowly.\nOur development of MEG requires the symmetrization of Problem (3). Recall that g :=\nC1/2\nd = dx + dy. The matrix C is referred to as the self-adjoint dilation of the matrix g [20]. Given the\nSVD of g = U\u2303V> with no repeated singular values, the eigen-decomposition of C is given as\n\n. Consider the following symmetric matrix C :=\uf8ff 0\n0 \u2303\u25c6\u2713U U\nV V\u25c6>\n\n0 of size d \u21e5 d, where\n\n2\u2713U U\nV V\u25c6\u2713\u23030\n\nx CxyC1/2\n\nC =\n\n1\n\ng\n\ng\n\ny\n\n.\n\nIn other words, the top-k left and right singular vectors of C1/2\n, which comprise the CCA\nsolution we seek, are encoded in top and bottom rows, respectively, of the top-k eigenvectors of its\ndilation. This suggests the following scaled re-parameterization of Problem (3): \ufb01nd M 2 Rd\u21e5d that\n(11)\n\nmaximize hM, Ci s.t. i(M) 2{ 0, 1}, i = 1, . . . , d, rank (M) = k.\n\nx CxyC1/2\n\ny\n\nAs in Section 2, we take the convex hull of the constraint set to get a convex relaxation to Problem (11).\n\n(12)\nStochastic mirror descent on Problem (12) with the choice of potential function being the quantum\nrelative entropy gives the following updates [4, 27]:\n\nmaximize hM, Ci s.t. M \u232b 0,kMk2 \uf8ff 1, Tr (M) = k.\n\n, Mt = P\u21e3bMt\u2318 ,\n\n(13)\n\nexp (log (Mt1) + \u2318Ct)\n\nTr (exp (log (Mt1) + \u2318Ct))\n\nbMt =\n\nwhere Ct is the self-adjoint dilation of unbiased instantaneous gradient gt, and P denotes the Bregman\nprojection [10] onto the convex set of constraints in Problem (12). As discussed in Section 2 we\n\nt=1 kCt \u02dcCtk|AT ] of O(pT ).\nonly need an inexact gradient estimate \u02dcCt of Ct with a bound on E[PT\nSetting \u02dcCt to be the self-adjoint dilation of @t, de\ufb01ned in Section 2, guarantees such a bound.\nLemma 3.1. Assume that the event At occurs,gt @t has no repeated singular values and that with\nprobability one, for (x, y) \u21e0 D, we have max{kxk2 ,kyk2}\uf8ff B. Then, for \uf8ff de\ufb01ned in lemma 2.2,\nwe have that, Ext,ythMt1 M\u21e4, Ct \u02dcCt|Ati \uf8ff 2k\uf8ffpt\n, where M\u21e4 is the optimum of Problem (11).\nUsing the bound above, we can bound the suboptimality gap in the population objective between the\ntrue rank-k CCA solution and the rank-k solution returned by MEG-CCA.\nTheorem 3.2. After T iterations of MEG-CCA (see Algorithm 2 in Appendix) with step size \u2318 =\n)) and initializing M0 = 1\nd I,\n1\n\nGT \u25c6, auxiliary sample of size \u2327 =\u2326( B2\n\nG log\u27131 +q log(d)\n\nr2 log(\n\npTpT1\n\nlog(\n\n2d\n\n)\n\nthe following holds:\n\nhM\u21e4, Ci E[h \u02dcM, Ci] \uf8ff 2kr G2 log (d)\n\nT\n\n+ 2\n\nk\uf8ff\npT\n\n,\n\n7\n\n\fwhere the conditional expectation is taken with respect to the distribution and the internal randomiza-\ntion of the algorithm, M\u21e4 is the optimum of Problem (11), \u02dcM is the rank-k output of MEG-CCA after\nrounding, G = 2Bprxry\n\nand \uf8ff is de\ufb01ned in Lemma 2.2.\n\nAll of our remarks regarding latent convexity of the problem and practical variants from Section 2\napply to MEG-CCA as well. We note, however, that without additional assumptions like eigengap for\nT we are not able to recover projections to the canonical subspaces as done in Theorem 2.4.\n\n4 Experiments\n\nWe provide experimental results for our proposed methods, in particular we compare capped-MSG\nwhich is the practical variant of Algorithm 1 with capping as de\ufb01ned in equation (10), and MEG\n(Algorithm 2 in the Appendix), on a real dataset, Mediamill [19], consisting of paired observations\nof videos and corresponding commentary. We compare our algorithms against CCALin of [8], ALS\nCCA of [24]2, and SAA, which is denoted by \u201cbatch\u201d in Figure 1. All of the comparisons are given\nin terms of the CCA objective as a function of either CPU runtime or number of iterations. The\ntarget dimensionality in our experiments is k 2{ 1, 2, 4}. The choice of k is dictated largely by the\nfact that the spectrum of the Mediamill dataset decays exponentially. To ensure that the problem\nis well-conditioned, we add I for = 0.1 to the empirical estimates of the covariance matrices on\nMediamill dataset. For both MSG and MEG we set the step size at iteration t to be \u2318t = 0.1pt .\nMediamill is a multiview dataset consisting of n = 10, 000 corresponding videos and text annota-\ntions with labels representing semantic concepts [19]. The image view consists of 120-dimensional\nvisual features extracted from representative frames selected from videos, and the textual features are\n100-dimensional. We give the competing algorithms, both CCALin and ALS CCA, the advantage of\n\nthe knowledge of the eigengap at k. In particular, we estimate the spectrum of the matrixbT for the\n\nMediamill dataset and set the gap-dependent parameters in CCALin and ALS CCA accordingly.\nWe note, however, that estimating the eigengap to set the parameters is impractical in real scenarios.\nBoth CCALin and ALS CCA will therefore require additional tuning compared to MSG and MEG\nalgorithms proposed here. In the experiments, we observe that CCALin and ALS CCA outperform\nMEG and capped-MSG when recovering the top CCA component, in terms of progress per-iteration.\nHowever, capped-MSG is the best in terms of the overall runtime. The plots are shown in Figure 1.\n\n5 Discussion\n\n\u270f2 ).\n\nWe study CCA as a stochastic optimization problem and show that it is ef\ufb01ciently learnable by\nproviding analysis for two stochastic approximation algorithms. In particular, the proposed algorithms\nachieve \u270f-suboptimality in population objective in iterations O( 1\nNote that both of our Algorithms, MSG-CCA in Algorithm 1 and MEG-CCA in Algorithm 2\nin Appendix B are instances of inexact proximal-gradient method which was studied in [16]. In\nparticular, both algorithms receive a noisy gradient @t = gt + Et at iteration t and perform exact\nproximal steps (Bregman projections in equations (7) and (13)). The main result in [16] provides an\nO(E2/T ) convergence rate, where E =PT\nt=1 kEtk is the partial sum of the errors in the gradients.\nIt is shown that E = o(pT ) is a necessary condition to obtain convergence. However, for the CCA\nproblem that we are considering in this paper, our lemma A.6 shows that E = O(pT ). In fact, it is\neasy to see that E =\u21e5( pT ). Our analysis yields O( 1pT\n) convergence rates for both Algorithms 1\nand 2. This perhaps warrants further investigation into the more general problem of inexact proximal\ngradient method.\nIn empirical comparisons, we found the capped version of the proposed MSG algorithm to outperform\nother methods including MEG in terms of overall runtime needed to reach an \u270f-suboptimal solution.\nFuture work will focus on gaining a better theoretical understanding of capped MSG.\n\n2We run ALS only for k = 1 as the algorithm and the current implementation from the authors does not\n\nhandle k 1.\n\n8\n\n\f(a) k = 1\n\n(b) k = 2\n\n(c) k = 4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\ne\nv\ni\nt\nc\ne\nb\nO\n\nj\n\n0\n\n102\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n0\n\ne\nv\ni\nt\nc\ne\nb\nO\n\nj\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\ne\nv\ni\nt\nc\ne\nb\nO\n\nj\n\n0\n\n102\n\ne\nv\ni\nt\nc\ne\nb\nO\n\nj\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n103\nIteration\n\n100\n\nRuntime (in seconds)\n\n102\n\ne\nv\ni\nt\nc\ne\nb\nO\n\nj\n\ne\nv\ni\nt\nc\ne\nb\nO\n\nj\n\n0.8\n0.7\n0.6\n0.5\n0.4\n0.3\n0.2\n0.1\n0\n\n0.8\n0.7\n0.6\n0.5\n0.4\n0.3\n0.2\n0.1\n0\n\nCAPPED-MSG\nMEG\nCCALin\nbatch\nALSCCA\nMax Objective\n\n103\nIteration\n\n100\n\nRuntime (in seconds)\n\n102\n\n102\n\n103\nIteration\n\n100\n\n101\n\nRuntime (in seconds)\n\n102\n\nFigure 1: Comparisons of CCA-Lin, CCA-ALS, MSG, and MEG for CCA optimization on the MediaMill\ndataset, in terms of the objective value as a function of iteration (top) and as a function of CPU runtime (bottom).\n\nAcknowledgements\n\nThis research was supported in part by NSF BIGDATA grant IIS-1546482.\n\nReferences\n[1] Z. Allen-Zhu and Y. Li. Doubly Accelerated Methods for Faster CCA and Generalized Eigen-\ndecomposition. In Proceedings of the 34th International Conference on Machine Learning,\nICML, 2017. Full version available at http://arxiv.org/abs/1607.06017.\n\n[2] R. Arora, A. Cotter, and N. Srebro. Stochastic optimization of PCA with capped MSG. In\n\nAdvances in Neural Information Processing Systems, NIPS, 2013.\n\n[3] R. Arora and K. Livescu. Multi-view CCA-based acoustic features for phonetic recognition\nacross speakers and domains. In Acoustics, Speech and Signal Processing (ICASSP), 2013\nIEEE International Conference on, pages 7135\u20137139. IEEE, 2013.\n\n[4] R. Arora, P. Mianjy, and T. Marinov. Stochastic optimization for multiview representation\nlearning using partial least squares. In Proceedings of The 33rd International Conference on\nMachine Learning, ICML, pages 1786\u20131794, 2016.\n\n[5] A. Benton, R. Arora, and M. Dredze. Learning multiview embeddings of twitter users. In\nProceedings of the 54th Annual Meeting of the Association for Computational Linguistics,\nvolume 2, pages 14\u201319, 2016.\n\n[6] O. Bousquet and L. Bottou. The tradeoffs of large scale learning. In Advances in neural\n\ninformation processing systems, pages 161\u2013168, 2008.\n\n[7] C. Gao, D. Garber, N. Srebro, J. Wang, and W. Wang. Stochastic canonical correlation analysis.\n\narXiv preprint arXiv:1702.06533, 2017.\n\n[8] R. Ge, C. Jin, P. Netrapalli, A. Sidford, et al. Ef\ufb01cient algorithms for large-scale generalized\neigenvector computation and canonical correlation analysis. In International Conference on\nMachine Learning, pages 2741\u20132750, 2016.\n\n[9] S. Golden. Lower bounds for the helmholtz function. Physical Review, 137(4B):B1127, 1965.\n\n9\n\n\f[10] M. Herbster and M. K. Warmuth. Tracking the best linear predictor. Journal of Machine\n\nLearning Research, 1(Sep):281\u2013309, 2001.\n\n[11] H. Hotelling. Relations between two sets of variates. Biometrika, 28(3/4):321\u2013377, 1936.\n[12] S. Kale. Ef\ufb01cient algorithms using the multiplicative weights update method. Princeton\n\nUniversity, 2007.\n\n[13] E. Kidron, Y. Y. Schechner, and M. Elad. Pixels that sound. In Computer Vision and Pattern\nRecognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 88\u201395.\nIEEE, 2005.\n\n[14] Y. Lu and D. P. Foster. Large scale canonical correlation analysis with iterative least squares. In\n\nAdvances in Neural Information Processing Systems, pages 91\u201399, 2014.\n\n[15] Z. Ma, Y. Lu, and D. Foster. Finding linear structure in large datasets with scalable canonical\ncorrelation analysis. In Proceedings of The 32nd International Conference on Machine Learning,\npages 169\u2013178, 2015.\n\n[16] M. Schmidt, N. L. Roux, and F. R. Bach. Convergence rates of inexact proximal-gradient\nmethods for convex optimization. In Advances in neural information processing systems, pages\n1458\u20131466, 2011.\n\n[17] B. A. Schmitt. Perturbation bounds for matrix square roots and pythagorean sums. Linear\n\nalgebra and its applications, 174:215\u2013227, 1992.\n\n[18] S. Shalev-Shwartz and N. Srebro. Svm optimization: inverse dependence on training set size. In\nProceedings of the 25th international conference on Machine learning, pages 928\u2013935. ACM,\n2008.\n\n[19] C. G. Snoek, M. Worring, J. C. Van Gemert, J.-M. Geusebroek, and A. W. Smeulders. The chal-\nlenge problem for automated detection of 101 semantic concepts in multimedia. In Proceedings\nof the 14th ACM international conference on Multimedia, pages 421\u2013430. ACM, 2006.\n\n[20] J. A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computa-\n\ntional Mathematics, 12(4):389\u2013434, 2012.\n\n[21] J. A. Tropp et al. An introduction to matrix concentration inequalities. Foundations and\n\nTrends R in Machine Learning, 8(1-2):1\u2013230, 2015.\n\n[22] K. Tsuda, G. R\u00e4tsch, and M. K. Warmuth. Matrix exponentiated gradient updates for on-line\nlearning and bregman projection. In Journal of Machine Learning Research, pages 995\u20131018,\n2005.\n\n[23] A. Vinokourov, N. Cristianini, and J. Shawe-Taylor. Inferring a semantic representation of text\nvia cross-language correlation analysis. In Advances in neural information processing systems,\npages 1497\u20131504, 2003.\n\n[24] W. Wang, J. Wang, D. Garber, and N. Srebro. Ef\ufb01cient globally convergent stochastic optimiza-\ntion for canonical correlation analysis. In Advances in Neural Information Processing Systems,\npages 766\u2013774, 2016.\n\n[25] M. K. Warmuth and D. Kuzmin. Online variance minimization. In Learning theory, pages\n\n514\u2013528. Springer, 2006.\n\n[26] M. K. Warmuth and D. Kuzmin. Randomized PCA algorithms with regret bounds that are\n\nlogarithmic in the dimension. In NIPS\u201906, 2006.\n\n[27] M. K. Warmuth and D. Kuzmin. Randomized online PCA algorithms with regret bounds that\n\nare logarithmic in the dimension. Journal of Machine Learning Research, 9(10), 2008.\n\n10\n\n\f", "award": [], "sourceid": 2494, "authors": [{"given_name": "Raman", "family_name": "Arora", "institution": "Johns Hopkins University"}, {"given_name": "Teodor Vanislavov", "family_name": "Marinov", "institution": "Johns Hopkins University"}, {"given_name": "Poorya", "family_name": "Mianjy", "institution": "Johns Hopkins University"}, {"given_name": "Nati", "family_name": "Srebro", "institution": "TTI-Chicago"}]}