{"title": "Statistical Convergence of Kernel CCA", "book": "Advances in Neural Information Processing Systems", "page_first": 387, "page_last": 394, "abstract": null, "full_text": "Statistical Convergence of Kernel CCA\n\nKenji Fukumizu Institute of Statistical Mathematics Tokyo 106-8569 Japan fukumizu@ism.ac.jp\n\nFrancis R. Bach Centre de Morphologie Mathematique Ecole des Mines de Paris, France francis.bach@mines.org\n\nArthur Gretton Max Planck Institute for Biological Cybernetics 72076 Tubingen, Germany  arthur.gretton@tuebingen.mpg.de\n\nAbstract\nWhile kernel canonical correlation analysis (kernel CCA) has been applied in many problems, the asymptotic convergence of the functions estimated from a finite sample to the true functions has not yet been established. This paper gives a rigorous proof of the statistical convergence of kernel CCA and a related method (NOCCO), which provides a theoretical justification for these methods. The result also gives a sufficient condition on the decay of the regularization coefficient in the methods to ensure convergence.\n\n1\n\nIntro duction\n\nKernel canonical correlation analysis (kernel CCA) has been proposed as a nonlinear extension of CCA [1, 11, 3]. Given two random variables, kernel CCA aims at extracting the information which is shared by the two random variables, and has been successfully applied in various practical contexts. More precisely, given two random variables X and Y , the purpose of kernel CCA is to provide nonlinear mappings f (X ) and g (Y ) such that their correlation is maximized. As in many statistical methods, the desired functions are in practice estimated from a finite sample. Thus, the convergence of the estimated functions to the population ones with increasing sample size is very important to justify the method. Since the goal of kernel CCA is to estimate a pair of functions, the convergence should be evaluated in an appropriate functional norm: thus, we need tools from functional analysis to characterize the type of convergence. The purpose of this paper is to rigorously prove the statistical convergence of kernel CCA, and of a related method. The latter uses a NOrmalized Cross-Covariance Operator, and we call it NOCCO for short. Both kernel CCA and NOCCO require a regularization coefficient to enforce smoothness of the functions in the finite sample case (thus avoiding a trivial solution), but the decay of this regularisation with increased sample size has not yet been established. Our main theorems give a sufficient condition on the decay of the regularization coefficient for the finite sample\n\n\f\nestimates to converge to the desired functions in the population limit. Another important issue in establishing the convergence is an appropriate distance measure for functions. For NOCCO, we obtain convergence in the norm of reproducing kernel Hilbert spaces (RKHS) [2]. This norm is very strong: if the positive definite (p.d.) kernels are continuous and bounded, it is stronger than the uniform norm in the space of continuous functions, and thus the estimated functions converge uniformly to the desired ones. For kernel CCA, we show convergence in the L2 norm, which is a standard distance measure for functions. We also discuss the relation between our results and two relevant studies: COCO [9] and CCA on curves [10].\n\n2\n\nKernel CCA and related metho ds\n\nIn this section, we review kernel CCA as presented by [3], and then formulate it with covariance operators on RKHS. In this paper, a Hilbert space always refers to a separable Hilbert space, and an operator to a linear operator. T denotes the operator norm sup  =1 T  , and R(T ) denotes the range of an operator T . Throughout this paper, (HX , kX ) and (HY , kY ) are RKHS of functions on measurable spaces X and Y , respectively, with measurable p.d. kernels kX and kY . We consider a random vector (X, Y ) :   X  Y with distribution PX Y . The marginal distributions of X and Y are denoted PX and PY . We always assume EX [kX (X, X )] <  and EY [kY (Y , Y )] < . (1) Note that under this assumption it is easy to see HX and HY are continuously included in L2 (PX ) and L2 (PY ), respectively, where L2 () denotes the Hilbert space of square integrable functions with respect to the measure . 2.1 CCA in repro ducing kernel Hilb ert spaces\n\nClassical CCA provides the linear mappings aT X and bT Y that achieve maximum correlation. Kernel CCA extends this by looking for functions f and g such that f (X ) and g (Y ) have maximal correlation. More precisely, kernel CCA solves Cov[f (X ), g (Y )] . (2) max f HX ,g HY Var[f (X )]1/2 Var[g (Y )]1/2 In practice, we have to estimate the desired function from a finite sample. Given an i.i.d. sample (X1 , Y1 ), . . . , (Xn , Yn ) from PX Y , an empirical solution of Eq. (2) is Cov[f (X ), g (Y )] max (3) 1/ 2 , 1/ 2 V V f HX ,g HY ar[g (Y )] + n g 2 Y ar[f (X )] + n f 2 X H H where Cov and Var denote the empirical covariance and variance, such as f g . 1n 1n Cov[f (X ), g (Y )] = 1 n (Xi ) - (Yi ) - i=1 j =1 f (Xj ) j =1 g (Yj ) n n n The positive constant n is a regularization coefficient. As we shall see, the regularization terms n f HX and n g 2 Y make the problem well-formulated statistically, 2 H enforce smoothness, and enable operator inversion, as in Tikhonov regularization. 2.2 Representation with cross-covariance op erators\n\nKernel CCA and related methods can be formulated using covariance operators [4, 7, 8], which make theoretical discussions easier. It is known that there exists a unique cross-covariance operator Y X : HX  HY for (X, Y ) such that ( ( = Cov[f (X ), g (Y )]) g, Y X f HY = EX Y f (X ) - EX [f (X )])(g (Y ) - EY [g (Y )])\n\n\f\nholds for all f  HX and g  HY . The cross covariance operator represents the covariance of f (X ) and g (Y ) as a bilinear form of f and g . In particular, if Y is equal to X , the self-adjoint operator X X is called the covariance operator. Let (X1 , Y1 ), . . . , (Xn , Yn ) be i.i.d. random vectors on X  Y with distribution PX Y . (n) The empirical cross-covariance operator Y X is defined by the cross-covariance n 1 operator with the empirical distribution n i=1 Xi Yi . By definition, for any f  (n) HX and g  HY , the operator Y X gives the empirical covariance as follows; Let QX and QY be the orthogonal pro jections which respectively map HX onto R(X X ) and HY onto R(Y Y ). It is known [4] that Y X can be represented as Y X = Y Y VY X X X ,\n1/ 2 1/ 2\n\ng, Y X f HY = Cov[f (X ), g (Y )].\n\n(n)\n\n(4)\n\nwhere VY X : HX  HY is a unique bounded operator such that VY X  1 -1 / 2 -1 / 2 and VY X = QY VY X QX . We often write VY X as Y Y Y X X X in an abuse of -1 / 2 -1 / 2 notation, even when X X or Y Y are not appropriately defined as operators. With cross-covariance operators, the kernel CCA problem can be formulated as f , X X f HX = 1, (5) g, Y X f HY sub ject to sup g, Y Y g HY = 1. f HX ,g HY As with classical CCA, the solution of Eq. (5) is given by the eigenfunctions corresponding to the largest eigenvalue of the following generalized eigenproblem: f.  f= O O X Y XX (6) 1 g O Y Y g Y X O Similarly, the empirical estimator in Eq. (3) is obtained by solving f (n) , (X X + n I )f HX = 1, (n) g, Y X f HY sub ject to sup (n) g, ( + n I )g H = 1. f HX ,g HY\nYY\nY\n\n(7)\n\nLet us assume that the operator VY X is compact,1 and let  and  be the unit eigenfunctions of VY X corresponding to the largest singular value; that is, , VY X  HY =\nf HX ,g HY , f HX = g\n\nmax\n\nHY =1\n\ng, VY X f HY .\n\n(8)\n\nGiven   R(X X ) and   R(Y Y ), the kernel CCA solution in Eq. (6) is f = X X ,\n-1 / 2\n\ng = Y Y  .\n\n-1 / 2\n\n(9)\n\nAs in Eq. (9), the empirical estimators fn and gn in Eq. (7) are equal to\n1\n\nIn the empirical case, let n  HX and n  HY be the unit eigenfunctions corresponding to the largest singular value of the finite rank operator  -  - V (n) := (n) + n I 1/2 (n) (n) + n I 1/2 . (10) YX YY YX XX\nA bounded operator T : H1  H2 is called compact if any bounded sequence {un }  H1 has a subsequence {un } such that T un converges in H2 . One of the useful properties of a compact operator is that it admits a singular value decomposition (see [5, 6])\n\nfn = ((n) + n I )-1/2 n , XX\n\ngn = (Y Y + n I )-1/2 n .\n\n(n)\n\n(11)\n\n\f\nNote that all the above empirical operators and the estimators can be expressed in terms of Gram matrices. The solutions fn and gn are exactly the san e as those m 1 given in [3], and are obtained by linear combinations of kX (, Xi ) - n j =1 kX (, Xj ) n and kY (, Yi ) - 1 kY (, Yj ). The functions n and n are obtained similarly.\nn j =1 f HX ,g HY f HX = g HY =1\n\nThere exist additional, related methods to extract nonlinear dependence. The constrained covariance (COCO) [9] uses the unit eigenfunctions of Y X ; max g, Y X f HY =\nf HX ,g HY f HX = g HY =1\n\nmax\n\nCov[f (X ), g (Y )].\n\nThe statistical convergence of COCO has been proved in [8]. Instead of normalizing the covariance by the variances, COCO normalizes it by the RKHS norms of f and g . Kernel CCA is a more direct nonlinear extension of CCA than COCO. COCO tends to find functions with large variance for f (X ) and g (Y ), which may not be the most correlated features. On the other hand, kernel CCA may encounter situations where it finds functions with moderately large covariance but very small variance for f (X ) or g (Y ), since X X and Y Y can have arbitrarily small eigenvalues. A possible compromise is to use  and  for VY X , the NOrmalized Cross-Covariance Operator (NOCCO). While the statistical meaning of NOCCO is not as direct as kernel CCA, it can incorporate the normalization by X X and Y Y . We will establish the convergence of kernel CCA and NOCCO in Section 3.\n\n3\n\nMain theorems: convergence of kernel CCA and NOCCO\n\nWe show the convergence of NOCCO in the RKHS norm, and the kernel CCA in L2 sense. The results may easily be extended to the convergence of the eigenspace corresponding to the m-th largest eigenvalue. Theorem 1. Let (n ) 1 be a sequence of positive numbers such that n=\nn\n\nlim n = 0,\n\nn\n\nlim n1/3 n = .\n\n(12)\n\nAssume VY X is compact, and the eigenspaces given by Eq. (8) are one-dimensional. Let ,  , n , and n be the unit eigenfunctions of Eqs. (8) and (10). Then | n ,  HX |  1, | n ,  HY |  1 in probability, as n goes to infinity. Theorem 2. Let (n ) 1 be a sequence of positive numbers which satisfies Eq. (12). n= Assume that  and  are included in R(X X ) and R(Y Y ), respectively, and that VY X is compact. Then, for f , g , fn , and gn in Eqs.(9), (11), we have ( L fn - EX [fn (X )]) - (f - EX [f (X )])  0, (P ) ( L2 X gn - EY [gn (Y )]) - (g - EY [g (Y )]) 2 (PY )  0 in probability, as n goes to infinity. The convergence of NOCCO in the RKHS norm is a very strong result. If kX and kY are continuous and bounded, the RKHS norm is stronger than the uniform norm of the continuous functions. In such cases, Theorem 1 implies n and n converge uniformly to  and  , respectively. This uniform convergence is useful in practice, because in many applications the function value at each point is important.\n\n\f\nFor any complete orthonormal systems (CONS) {i } 1 of HX and {i } 1 of HY , i= i= -1 / 2 the compactness assumption on VY X requires that the correlation of X X i (X ) -1 / 2 and Y Y i (Y ) decay to zero as i  . This is not necessarily satisfied in general. A trivial example is the case of variables with Y = X , in which VY X = I is not compact. In this case, NOCCO is solved by an arbitrary function. Moreover, the kernel CCA does not have solutions, if X X has arbitrarily small eigenvalues. Leurgans et al. ([10]) discuss CCA on curves, which are represented by stochastic processes on an interval, and use the Sobolev space of functions with square integrable second derivative. Since the Sobolev space is a RKHS, their method is an example of kernel CCA. They also show the convergence of estimators under the condition n1/2 n  . Although the proof can be extended to a general RKHS, convergence is measured by the correlation, | fn , X X f HX | fn H )1/2 ( f, X X f H )1/2 ( fn , X X X X  1,\n\nwhich is weaker than the L2 convergence in Theorem 2. In fact, using f, X X f HX = 1, it is easy to derive the above convergence from Theorem 2. On the other hand, convergence of the correlation does not necessarily imply (fn - f ), X X (fn - f ) HX  0. From the equality\n/2 /2 1 1 (fn - f ), X X (fn - f ) HX = ( fn , X X fn HX - f, X X f HX )2 1/ 2 1/ 2 1/ 2\n\nwe require fn , X X fn HX  f, X X f HX = 1 in order to guarantee the left hand (n) side converges to zero. However, with the normalization fn , (X X + n I )fn HX = 1, convergence of fn , X X fn HX is not clear. We use the stronger assumption n1/3 n   to prove (fn - f ), X X (fn - f ) HX  0 in Theorem 2.\n\n+ 2{1 - fn , X X f HX /( X X fn HX X X f HX )} X X fn HX X X f HX ,\n\n1/ 2\n\n4\n\nOutline of the pro of of the main theorems\n\nWe show only the outline of the proof in this paper. See [6] for the detail. 4.1 Preliminary lemmas\n\nWe introduce some definitions for our proofs. Let H1 and H2 be Hilbert spaces.  An operator T : H1  H2 is called Hilbert-Schmidt if i=1 T i 2 2 <  for a H CONS {i } 1 of H1 . Obviously T  T H S . For Hilbert-Schmidt operators, i= the Hilbert-Schmidt norm and inner product are defined as   T1 , T2 H S = i=1 T1 i , T2 i H2 . T 2 S = i=1 T i 2 2 , H H These definitions are independent of the CONS. For more details, see [5] and [8]. For a Hilbert space F , a Borel measurable map F :   F from a measurable space F is called a random element in F . For a random element F in F with E F < , there exists a unique element E [F ]  F , called the expectation of F , such that holds. If random elements F and G in F satisfy E [ F 2] <  and E [ G 2 ] < , then F, G F is integrable. Moreover, if F and G are independent, we have E [ F, G F ] = E[F ], E [G] F . (13) E[F ], g H = E [ F, g F ] (g  F )\n\n\f\nIt is easy to see under the condition Eq. (1), the random element kX (, X )kY (, Y ) in the direct product HX  HY is integrable, i.e. E [ kX (, X )kY (, Y ) HX HY ] < . Combining Lemma 1 in [8] and Eq. (13), we obtain the following lemma. Lemma 3. The cross-covariance operator Y X is Hilbert-Schmidt, and E k k 2 Y X 2 S = . YX X (, X ) - EX [kX (, X )] Y (, Y ) - EY [kY (, Y )] H H H\nX Y\n\nProof. Write for simplicity F = kX (, X ) - EX [kX (, X )], G = kY (, Y ) - EY [kY (, Y )], Fi = kX (, Xi ) - EX [kX (, X )], and Gi = kY (, Yi ) - EY [kY (, Y )]. Then, F, F1 , . . . , Fn are i.i.d. random elements in HX , and a similar property also holds for G, G1 , . . . , Gn . Lemma 3 and the same argument as its proof implies (n) 2 1nF G 2 n n 1 1 , i-n i-n Y X HS = n i=1 j = 1 Fj j =1 Gj HX HY G H E nF n n (n) 1 1 1 . Y X , Y X H S = [F G], n i=1 i - n j =1 Fj i-n j =1 Gj H\nX Y\n\nThe law of large numbers implies limn g, Y X f HY = g, Y X f HY for each f and g in probability. The following lemma shows a much stronger uniform result. Lemma 4. H (n) = Op (n-1/2 ) (n  ). Y X - Y X S\n\n(n)\n\nFrom these equations, we have 2 1nF1n (n) 2 G1n - E [F G] HX HY i- n i- n Y X - Y X H S = n i=1 j = 1 Fj j =1 Gj 2 11 1 n n n 1 = n -n i=1 j =i (Fi Gj + Fj Gi ) - E [F G] H H . i=1 Fi Gi - n2\nX Y\n\nThe proof is completed by Chebyshev's inequality.\n\nUsing E [Fi ] = E [Gi ] = 0 and E [Fi Gj Fk G ] = 0 for i = j, {k , } = {i, j }, we have (n) 2 F -1 2 1 E Y X - Y X H S = n E G 2 X HY H H n E[F G] 2 X HY + O (1/n ).\n\nThe following two lemmas are essential parts of the proof of the main theorems. Lemma 5. Let n be a positive number such that n  0 (n  ). Then = V (n) -1 / 2 Y X (X X + n I )-1/2 Op (-3/2 n-1/2 ). n Y X - (Y Y + n I )\n\nA similar bound applies to the third term of Eq. (14), and the second term is (n) upper-bounded by 1 Y X - Y X . Thus, Lemma 4 completes the proof. n\n\nProof. The operator on the left hand side is equal to ( (n) (n) (n) -1 / 2  - (Y Y + n I )-1/2 Y X (X X + n I )-1/2 Y Y + n I ) (n) ( (n) + (Y Y + n I )-1/2 Y X - Y X X X + n I )-1/2 ( (n) . + (Y Y + n I )-1/2 Y X X X + n I )-1/2 - (X X + n I )-1/2 (14) B 3/ 2 B -3 / 2 -1 / 2 -1 / 2 -1 / 2 3/ 2 -3 / 2 From the equality A -B =A -A + (A - B )B , the first term in Eq. (14) is equal to 3 3+ ( (n) - 2 (n) (n) (n) 1 3 (n) -1  + n I )- 2 2 - (n) 2  ( 2. YY YY YY Y Y - Y Y Y Y + n I YX X X + n I )  (n) (n) (n) (n) From (Y Y + n I )-1/2  1/ n , (Y Y + n I )-1/2 Y X (X X + n I )-1/2  1 and Lemma 7, the norm of the above operator is upper-bounded by 3  + (n) (n) /2 /2 1  1 Y Y 3 , Y Y 3 Y Y - Y Y . n n max\n\n\f\nLemma 6. Assume VY X is compact. Then, for a sequence n  0, (  Y Y + n I )-1/2 Y X (X X + n I )-1/2 - VY X 0 (n  ).\n-1 / 2\n\nNote that R(VY X )  R(Y Y ), as remarked in Section 2.2. Let v = Y Y u be 1/ 2 an arbitrary element in R(VY X )  R(Y Y ). We have {(Y Y + n I )-1/2 Y Y - 1/ 2 1/ 2 1/ 2 1/ 2 I }v HY = (Y Y + n I )-1/2 Y Y {Y Y - (Y Y + n I )1/2 }Y Y u HY  Y Y - 1/ 2 1/ 2 (Y Y + n I )1/2 Y Y u HY . Since (Y Y + n I )1/2  Y Y in norm, we obtain for all v  R(VY X )  R(Y Y ). Because VY X is compact, Lemma 8 in the Appendix shows Eq. (15) converges to zero. The convergence of the second norm is similar. 4.2 Pro of of the main theorems {(Y Y + n I )-1/2 Y Y - I }v  0\n1/ 2\n\nProof. It suffices to prove {(Y Y + n I )-1/2 - Y Y }Y X (X X + n I )-1/2 and -1 / 2 -1 / 2 Y Y Y X {(X X + n I )-1/2 - X X } converge to zero. The former is equal to ( . V 1/ 2 Y Y + n I )-1/2 Y Y - I Y X (15)\n\n(n  )\n\n(16)\n\nProof of Thm. 1. This follows from Lemmas 5, 6, and Lemma 9 in Appendix. Proof Thm. 2. We show only the convergence of fn . W.l.o.g, we can assume n   1/ 2 1/ 2 1/ 2 in HX . From X X (fn - f ) 2 X = X X fn 2 X - 2 , X X fn HX +  2 X , it H H H 1/ 2 f suffices to show  n converges to  in probability. We have\nXX\n\nUsing the same argument as the bound on the first term in Eq. (14), the first term on the R.H.S of the above inequality is shown to converge to zero. The convergence of the second term is obvious. Using the assumption   R(X X ), the same argument as the proof of Eq. (16) applies to the third term, which completes the proof.\n\n1/ 2 X X fn -  HX 1/ 2 + X X (X X\n\n\n\n1/ 2 + n I )-1/2 (n - ) HX + X X (X X + n I )-1/2  -  HX .\n\n1/ 2 (n) X X {(X X + n I )-1/2 - (X X + n I )-1/2 }n HX\n\n5\n\nConcluding remarks\n\nWe have established the statistical convergence of kernel CCA and NOCCO, showing that the finite sample estimators of the nonlinear mappings converge to the desired population functions. This convergence is proved in the RKHS norm for NOCCO, and in the L2 norm for kernel CCA. These results give a theoretical justification for using the empirical estimates of NOCCO and kernel CCA in practice. We have also derived a sufficient condition, n1/3 n  , for the decay of the regularization coefficient n , which ensures the convergence described above. As [10] suggests, the order of the sufficient condition seems to depend on the function norm used to determine convergence. An interesting consideration is whether the order n1/3 n   can be improved for convergence in the L2 or RKHS norm.\n\nAnother question that remains to be addressed is when to use kernel CCA, COCO, or NOCCO in practice. The answer probably depends on the statistical properties of the data. It might consequently be helpful to determine the relation between the spectral properties of the data distribution and the solutions of these methods.\n\n\f\nAcknowledgements This work is partially supported by KAKENHI 15700241 and Inamori Foundation.\n\nReferences\n[1] S. Akaho. A kernel method for canonical correlation analysis. Proc. Intern. Meeting on Psychometric Society (IMPS2001), 2001. [2] N. Aronsza jn. Theory of reproducing kernels. Trans. American Mathematical Society, 69(3):337404, 1950. [3] F. R. Bach and M. I. Jordan. Kernel independent component analysis. J. Machine Learning Research, 3:148, 2002. [4] C. R. Baker. Joint measures and cross-covariance operators. Trans. American Mathematical Society, 186:273289, 1973. [5] N. Dunford and J. T. Schwartz. Linear Operators, Part II. Interscience, 1963. [6] K. Fukumizu, F. R. Bach, and A. Gretton. Consistency of kernel canonical correlation. Research Memorandum 942, Institute of Statistical Mathematics, 2005. [7] K. Fukumizu, F. R. Bach, and M. I. Jordan. Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. J. Machine Learning Research, 5:73 99, 2004. [8] A. Gretton, O. Bousquet, A. Smola, and B. Schlkopf. Measuring statistical deo pendence with Hilbert-Schmidt norms. Tech Report 140, Max-Planck-Institut fur  biologische Kybernetik, 2005. [9] A. Gretton, A. Smola, O. Bousquet, R. Herbrich, B. Schlkopf, and N. Logothetis. o Behaviour and convergence of the constrained covariance. Tech Report 128, MaxPlanck-Institut fur biologische Kybernetik, 2004.  [10] S. Leurgans, R. Moyeed, and B. Silverman. Canonical correlation analysis when the data are curves. J. Royal Statistical Society, Series B, 55(3):725740, 1993. [11] T. Melzer, M. Reiter, and H. Bischof. Nonlinear feature extraction using generalized canonical correlation analysis. Proc. Intern. Conf. Artificial Neural Networks (ICANN2001), 353360, 2001.\n\nA\n\nLemmas used in the pro ofs\n\nWe list the lemmas used in Section 4. See [6] for the proofs. Lemma 7. Suppose A and B are positive self-adjoint operators on a Hilbert space such that 0  A  I and 0  B  I hold for a positive constant . Then A3/2 - B 3/2  33/2 A - B . Lemma 8. Let H1 and H2 be Hilbert spaces, and H0 be a dense linear subspace of H2 . Suppose An and A are bounded operators on H2 , and B is a compact operator from H1 to H2 such that An u  Au for al l u  H0 , and supn An  M for some M > 0. Then An B converges to AB in norm. Lemma 9. Let A be a compact positive operator on a Hilbert space H, and An (n  N) be bounded positive operators on H such that An converges to A in norm. Assume the eigenspace of A corresponding to the largest eigenvalue is one-dimensional and spanned by a unit eigenvector , and the maximum of the spectrum of An is attained by a unit eigenvector n . Then we have | n ,  H|  1 as n  .\n\n\f\n", "award": [], "sourceid": 2891, "authors": [{"given_name": "Kenji", "family_name": "Fukumizu", "institution": null}, {"given_name": "Arthur", "family_name": "Gretton", "institution": null}, {"given_name": "Francis", "family_name": "Bach", "institution": null}]}