{"title": "Transelliptical Component Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 359, "page_last": 367, "abstract": null, "full_text": "Transelliptical Component Analysis\n\nFang Han\n\nDepartment of Biostatistics\nJohns Hopkins University\n\nBaltimore, MD 21210\nfhan@jhsph.edu\n\nHan Liu\n\nDepartment of Operations Research\n\nand Financial Engineering\n\nPrinceton University, NJ 08544\nhanliu@princeton.edu\n\nAbstract\n\nWe propose a high dimensional semiparametric scale-invariant principle compo-\nnent analysis, named TCA, by utilize the natural connection between the ellipti-\ncal distribution family and the principal component analysis. Elliptical distribu-\ntion family includes many well-known multivariate distributions like multivari-\nate Gaussian, t and logistic and it is extended to the meta-elliptical by Fang et.al\n(2002) using the copula techniques. In this paper we extend the meta-elliptical\ndistribution family to a even larger family, called transelliptical. We prove that\n\nTCA can obtain a near-optimal splog d/n estimation consistency rate in recover-\n\ning the leading eigenvector of the latent generalized correlation matrix under the\ntranselliptical distribution family, even if the distributions are very heavy-tailed,\nhave in\ufb01nite second moments, do not have densities and possess arbitrarily con-\ntinuous marginal distributions. A feature selection result with explicit rate is also\nprovided. TCA is further implemented in both numerical simulations and large-\nscale stock data to illustrate its empirical usefulness. Both theories and experi-\nments con\ufb01rm that TCA can achieve model \ufb02exibility, estimation accuracy and\nrobustness at almost no cost.\n\n1 Introduction\nGiven x1, . . . , xn \u2208 Rd as n i.i.d realizations of a random vector X \u2208 Rd with population co-\nvariance matrix \u03a3 and correlation matrix \u03a30, the Principal Component Analysis (PCA) aims at\nrecovering the top m leading eigenvectors u1, . . . , um of \u03a3. In practice, \u03a3 is unknown and the top\n\nestimators. However, because the PCA is well-known to be scale-variant, meaning that changing\nthe measurement scale of variables will make the estimators different, the PCA conducted on the\nsample correlation matrix is also regular in literatures [2]. It aims at recovering the top m lead-\n\nm leading eigenvectors bu1, . . . ,bum of the Pearson sample covariance matrix are obtained as the\ning eigenvectors \u03b81, . . . , \u03b8m of \u03a30 using the top m leading eigenvectorsb\u03b81, . . . ,b\u03b8m of the Pearson\nIn high dimensional settings, when d scales with n, it has been discussed in [14] that bu1 and b\u03b81\nangle between v1 and v2 by \u2220(v1, v2). [14] proved that \u2220(u1,bu1) and \u2220(\u03b81,b\u03b81) do not converge\n\nsample correlation matrix. Because \u03a30 is scale-invariant, we call the PCA aiming at recovering the\neigenvectors of \u03a30 the scale-invariant PCA.\n\nare generally not consistent estimators of u1 and \u03b81. For any two vectors v1, v2 \u2208 Rd, denote the\n\nto zero. Therefore, it is commonly assumed that \u03b81 = (\u03b811, . . . , \u03b81d)T is sparse, meaning that\ncard(supp(\u03b81)) := card({\u03b81j : \u03b81j 6= 0}) = s < n. This results in a variety of sparse PCA\nprocedures. Here we note that supp(uj) = supp(\u03b8j), for j = 1, . . . , d.\nThe elliptical distributions are of special interest in Principal Component Analysis. The study of\nelliptical distributions and their extensions have been launched in statistics recently by [4]. The\nelliptical distributions can be characterized by their stochastic representations [5]. A random vector\nZ = (Z1, . . . , Zd)T is said to follow an elliptical distribution or be elliptically distributed with\nparameters \u00b5, \u03a3 (cid:23) 0, and rank(\u03a3) = q, if it admits the stochastic representation: Z = \u00b5 + \u03beAU,\nwhere \u00b5 \u2208 Rd, \u03be \u2208 R and U \u2208 Rq are independent random variables, \u03be \u2265 0, U is uniformly\ndistributed on the unit sphere in Rq, and A \u2208 Rd\u00d7q is a \ufb01xed matrix such that AAT = \u03a3. We call\n\n1\n\n\f\u03be the generating variable. The density of Z does not necessarily exist. Elliptical distribution family\nincludes a variety of famous multivariate distributions: multivariate Gaussian, multivariate Cauchy,\nStudent\u2019s t, logistic, Kotz, symmetric Pearson type-II and type-VII distributions. We refer to [3, 5]\nand [4] for more details.\n[4] introduce the term meta-elliptical distribution in extending the continuous elliptical distributions\nwhose densities exist to a wider class of distributions with densities existing. The construction of\nthe meta-elliptical distributions is based on the copula technique and it was initially introduced by\n[25]. In particular, when the latent elliptical distribution is the multivariate Gaussian, we have the\nmeta-Gaussian or the nonparanormal distributions introduced by [16] and [19].\nThe elliptical distribution is of special interest in Principal Component Analysis (PCA). It has been\nshown in a variety of literatures [27, 11, 22, 12, 24] that the PCA conducted on elliptical distributions\nshares a number of good properties enjoyed by the PCA conducted on the Gaussian distribution. In\nparticular, [11] show that with regard to a range of hypothesis relevant to PCA, tests based on a mul-\ntivariate Gaussian assumption have the identical power for all elliptical distributions even without\nsecond moments. We will utilize this connection to construct a new model in this paper.\nIn this paper, a new high dimensional scale-invariant principle component analysis approach is pro-\nposed, named Transelliptical Component Analysis (TCA). Firstly, to achieve both the estimation\naccuracy and model \ufb02exibility, we build the model of TCA on the transelliptical distributions. A\nrandom vector X = (X1, . . . , Xd)T is said to follow a transelliptical distribution if there exists a set\nof univariate strictly monotone functions f = {fj}d\nj=1 such that f(X) := (f1(X1), . . . , fd(Xd))T\njk] (cid:23) 0. Here\nfollows a continuous elliptical distribution with parameters \u00b5 = 0 and \u03a30 = [\u03a30\ndiag(\u03a30) = 1. Transelliptical distributions do not necessarily possess densities and are strict exten-\nsions to the meta-elliptical distributions de\ufb01ned in [4]. TCA aims at recovering the top m leading\neigenvectors \u03b81, . . . , \u03b8m of \u03a30.\nSecondly, to estimate \u03a30 robustly and ef\ufb01ciently, instead of estimating the transformation functions\nj=1 preserve the ranks of the data, we utilize the\nnonparametric rank-based correlation coef\ufb01cient estimator, Kendall\u2019s tau, to estimate \u03a30. We prove\nthat even though the generating variable \u03be is changing and marginal distributions are arbitrarily\n\n{bfj}d\ncontinuous, Kendall\u2019s tau correlation matrix approximates \u03a30 in a parametric rate OP (plog d/n).\n\nj=1 as [19] did, realizing that {fj}d\n\nj=1 of {fj}d\n\nThis key observation makes Kendall\u2019s tau a better estimator than Pearson sample correlation matrix\nwith regard to a much larger distribution family than the Gaussian.\nThirdly, in terms of methodology and theory, we analyze the general case that X follows a\ntranselliptical distribution and \u03b81 is sparse. Here \u03b81 is the leading eigenvector of \u03a30. We ob-\n1 of \u03b81 utilizing the Kendall\u2019s tau correlation matrix. We prove that\nthe TCA can obtain a fast convergence rate in terms of parameter estimation and is of the rate\n\ntain the TCA estimator e\u03b8\u2217\nsin \u2220(\u03b81,e\u03b8\u221e) = OP (splog d/n), where e\u03b8\u221e is the estimator TCA obtains. A feature selection\n\nconsistency result with explicit rate is also provided.\n2 Background\nWe start with notations: Let M = [Mjk] \u2208 Rd\u00d7d and v = (v1, ..., vd)T \u2208 Rd. Let v\u2019s subvector\nwith entries indexed by I be denoted by vI, M\u2019s submatrix with rows indexed by I and columns\nindexed by J be denoted by MIJ. Let MI\u00b7 and M\u00b7J be the submatrix of M with rows in I and all\ncolumns, and the submatrix of M with columns in J and all rows. For 0 < q < \u221e, we de\ufb01ne the\n\u20180, \u2018q and \u2018\u221e vector norm as\n\ni=1\n\nPn\nWe de\ufb01ne the matrix \u2018max norm as the elementwise maximum value: kMkmax := max{|Mij|} and\nj=1 |Mij|. Let \u039bj(M) be the toppest j\u2212th eigenvalue\nthe \u2018\u221e norm as kMk\u221e := max1\u2264i\u2264m\nof M. In special, \u039bmin(M) := \u039bd(M) and \u039bmax(M) := \u039b1(M) are the smallest and largest\neigenvalues of M. The vectorized matrix of M, denoted by vec(M), is de\ufb01ned as: vec(M) :=\n(M T\u00b71 , . . . , M T\u00b7d)T . Let Sd\u22121 := {v \u2208 Rd : kvk2 = 1} be the d-dimensional unit sphere. The\nsign =d denotes that the two sides of the equality have the same distributions. For any two vectors\na, b \u2208 Rd and any two squared matrices A, B \u2208 Rd\u00d7d, denote the inner product of a and b, A and\n\nkvk0 := card(supp(v)), kvkq := (\n\n|vi|q)1/q and kvk\u221e := max\n1\u2264i\u2264d\n\n|vi|.\n\ndX\n\n2\n\n\fB by\n\nha, bi := aT b and hA, Bi := Tr(AT B).\n\n2.1 Elliptical and Transelliptical Distributions\nThis section is devoted to a brief discussion of elliptical and transelliptical distributions.\nIn the\nsequel, to be clear, a random vector X = (X1, . . . , Xd)T is said to be continuous if the marginal\ndistribution functions are all continuous.\n2.1.1 Elliptical Distributions\nIn this section we shall \ufb01rstly provide a de\ufb01nition of the elliptical distributions following [5].\nDe\ufb01nition 2.1. Given \u00b5 \u2208 Rd and \u03a3 \u2208 Rd\u00d7d, where rank(\u03a3) = q \u2264 d, a random vector Z =\n(Z1, . . . , Zd)T is said to have an elliptical distribution or is elliptically distributed with parameters \u00b5\nand \u03a3, if and only if Z has a stochastic representation: Z =d \u00b5 + \u03beAU, where \u00b5 \u2208 Rd, A \u2208 Rd\u00d7q,\nAAT = \u03a3, \u03be \u2265 0 is a random variable independent of U, U \u2208 Sq\u22121 is uniformly distributed in the\nunit sphere in Rq. In this setting we denote by Z \u223c ECd(\u00b5, \u03a3, \u03be).\nA random variable in R with continuous marginal distribution function does not necessarily possess\ndensity. A well-known set of examples is the cantor distribution, whose support set is the cantor set.\nWe refer to [7] for more discussions on this phenomenon. \u03a3 is symmetric and positive semi-de\ufb01nite,\nbut not necessarily to be positive de\ufb01nite.\nProposition 2.1. A random vector Z = (Z1, . . . , Zd)T has the stochastic representation Z \u223c\nECd(\u00b5, \u03a3, \u03be), if and only if Z has the characteristic function exp(it0\u00b5)\u03c6(t0\u03a3t), where \u03c6 is a\nproperly-de\ufb01ned characteristic function. We denote by X \u223c ECd(\u00b5, \u03a3, \u03c6).\nIf \u03be is absolutely\ncontinuous and \u03a3 is non-singular, then the density of Z exists and is of the form: pZ(z) =\n\n|\u03a3|\u22121/2g(cid:0)(z \u2212 \u00b5)T \u03a3\u22121(z \u2212 \u00b5)(cid:1) , where g : [0,\u221e) \u2192 [0,\u221e). We denote by Z \u223c ECd(\u00b5, \u03a3, g).\n\njk] with \u03a30\n\njk = \u03a3jk/p\u03a3jj\u03a3kk to be the generalized correlation matrix of Z. \u03a30\n\nA proof can be found in page 42 of [5]. When the density exists, \u03be, \u03c6 and g are uniquely determined\nby one of the other. The relationship among \u03be, \u03c6 and g are described in Theorem 2.2 and Theorem\n2.9 of [5]. The next proposition states that \u03a3, \u03c6, \u03be and A are not unique.\nProposition 2.2 (Theorem 2.15 of [5]). (i) If Z = \u00b5 + \u03beAU and Z = \u00b5\u2217 + \u03be\u2217A\u2217U\u2217, where\nA \u2208 Rd\u00d7q and A\u2217 \u2208 Rd\u00d7q, Z is continuous, then there exists a constant c > 0 such that\n\u03be\u2217 = c\u22121/2\u03be. (ii) If Z \u223c ECd(\u00b5, \u03a3, \u03c6) and Z \u223c ECd(\u00b5\u2217, \u03a3\u2217, \u03c6\u2217),\n\u00b5\u2217 = \u00b5, A\u2217A\u2217T = cAAT ,\nZ is continuous, then there exists a constant c > 0 such that \u00b5\u2217 = \u00b5, \u03a3\u2217 = c\u03a3, \u03c6\u2217(\u00b7) = \u03c6(c\u22121\u00b7).\nThe next proposition discusses the cases where (\u00b5, \u03a3, \u03be) is identi\ufb01able for Z.\nProposition 2.3. If Z \u223c ECd(\u00b5, \u03a3, \u03be) is continuous with rank(\u03a3) = q, then (1) P(\u03be = 0) = 0;\n(2)\u03a3ii > 0 for i \u2208 {1, . . . , d}; (3)(\u00b5, \u03a3, \u03be) is identi\ufb01able for Z under the constraint that\nmax(diag(\u03a3)) = 1.\nWe de\ufb01ne \u03a30 = [\u03a30\nis the correlation matrix of Z when Z\u2019s second moment exists and still re\ufb02ects the rank dependency\neven when Z has in\ufb01nite second moment [13].\n2.1.2 Transelliptical Distributions\nTo extend the elliptical distribution, we \ufb01rstly de\ufb01ne two sets of symmetric matrices: R+\nRd\u00d7d : \u03a3T = \u03a3, diag(\u03a3) = 1, \u03a3 (cid:31) 0};Rd = {\u03a3 \u2208 Rd\u00d7d : \u03a3T = \u03a3, diag(\u03a3) = 1, \u03a3 (cid:23) 0}.\nDe\ufb01nition 2.2. A random vector X = (X1, . . . , Xd)T with continuous marginal distribution func-\ntions F1, . . . , Fd and density existing is said to follow a meta-elliptical distribution if and only if\nthere exists a continuous elliptically distributed random vector Z \u223c ECd(0, \u03a30, g) with the marginal\ndistribution function Qg and \u03a30 \u2208 R+\nIn this paper, we generalize the meta-elliptical distribution family to a broader class, named the\ntranselliptical. The transelliptical distributions do not assume that densities exist for both X and Z\nand are therefore strict extensions to meta-elliptical distributions.\nDe\ufb01nition 2.3. A random vector X = (X1, . . . , Xd)T is said to follow a transelliptical distribu-\ntion if and only if there exists a set of strictly monotone functions f = {fj}d\nj=1 and a latent\ncontinuous elliptically distributed random vector Z \u223c ECd(0, \u03a30, \u03be) with \u03a30 \u2208 Rd, such that\n(f1(X1), . . . , fd(Xd))T =d Z. We call such X \u223c T Ed(\u03a30, \u03be; f1, . . . , fd) and \u03a30 the latent gener-\nalized correlation matrix.\n\ng (Fd(Xd)))T =d Z.\n\nd = {\u03a3 \u2208\n\nd , such that (Q\u22121\n\ng (F1(X1)), . . . , Q\u22121\n\n3\n\n\fg (F1), . . . , Q\u22121\n\ng (F1(X1)), . . . , Q\u22121\n\nProposition 2.4. If X follows a meta-elliptical distribution, in other words, X possesses den-\nsity and has continuous marginal distributions F1, . . . , Fd of X and a continuous random vec-\ntor Z \u223c ECd(0, \u03a30, g) such that (Q\u22121\ng (Fd(Xd)))T =d Z, then we have\nX \u223c T Ed(\u03a30, \u03be; Q\u22121\ng (Fd)).\nTo be more clear, the transelliptical distribution family is strictly larger than the meta-elliptical\ndistribution family in three senses: (i) the generating variable \u03be of the latent elliptical distribution is\nnot necessarily absolute continuous in transelliptical distributions; (ii) the parameter \u03a30 is strictly\nenlarged from R+\nd to Rd; (iii) the marginal distributions of X do not necessarily possess densities.\nThe term meta-Gaussian (or the nonparanormal) is introduced by [16, 19]. The term meta-elliptical\ncopula is introduced in [6]. This is actually an alternative de\ufb01nition of the meta-elliptical distribu-\ntion. The term elliptical copula is introduced in [18]. In summary,\n\ntranselliptical \u2283 meta-elliptical = meta-elliptical copula \u2283 elliptical* \u2283 elliptical copula,\ntranselliptical \u2283 meta-Gaussian = nonparanormal.\n\nHere elliptical* represents the elliptical distributions which are continuous and possess densities.\n\n2.2 Latent Correlation Matrix Estimation for Transelliptical Distributions\nWe \ufb01rstly study the correlation and covariance matrices of elliptical distributions. Given Z \u223c\nECd(\u00b5, \u03a3, \u03be), we \ufb01rst explore the relationship between the moments of Z and \u00b5 and \u03a3.\nProposition 2.5. Given Z \u223c ECd(\u00b5, \u03a3, \u03be) with rank(\u03a3) = q and \ufb01nite second moments and \u03a30 the\ngeneralized correlation matrix of Z, we have E(Z) = \u00b5, Var(Z) = E(\u03be2)\nq \u03a3, and Cor(Z) = \u03a30.\nWhen the random vector is elliptically distributed with second moment \ufb01nite, the sample mean and\ncorrelation matrices are element-wise consistent estimators of \u00b5 and \u03a30. However, the elliptical\ndistributions are generally very heavy-tailed (multivariate t or Cauchy distributions for example),\nmaking Pearson sample correlation matrix a bad estimator. When the distribution family is extended\nto the transelliptical, the Pearson sample correlation matrix is generally no longer a element-wise\nconsistent estimator of \u03a30. A similar \u201cplug-in\u201d idea as [6] works when \u03be is known. In the general\ncase when \u03be is unknown, the \u201cplug-in\u201d idea itself is unavailable.\n3 The TCA\nIn this section we propose the TCA approach. TCA is a two-stage method in estimating the leading\n\neigenvectors of \u03a30. Firstly, we estimate the Kendall\u2019s tau correlation matrix bR. Secondly, we plug\nbR into a sparse PCA algorithm.\n\n3.1 Rank-based Measures of Associations\nThe main idea of the TCA is to exploit the Kendall\u2019s tau statistic to estimate the generalized cor-\nrelation matrix \u03a30 ef\ufb01ciently and robustly. In detail, let X = (X1, . . . , Xd)T be a d\u2212dimensional\nrandom vector with marginal distributions F1, . . . , Fd and the joint distributions Fjk for the pair\n(Xj, Xk). The population Spearman\u2019s rho and Kendall\u2019s tau correlation coef\ufb01cients are given by\n\n\u03c4(Xj, Xk) = P((Xj \u2212 eXj)(Xk \u2212 eXk) > 0) \u2212 P((Xj \u2212 eXj)(Xk \u2212 eXk) < 0),\n\n\u03c1(Xj, Xk) = Corr(Fj(Xj), Fk(Xk)),\n\nwhere (eXj, eXk) is a independent copy of (Xj, Xk).\n\nIn particular, for Kendall\u2019s tau, we have\njk given X \u223c\nthe following theorem, which states an explicit relationship between \u03c4jk and \u03a30\nT Ed(\u03a30, \u03be; f1, . . . , fd), no matter what the generating variable \u03be is. This is a strict extension to\n[4]\u2019s result on the meta-elliptical distribution family.\nTheorem 3.1. Given X \u223c T Ed(\u03a30, \u03be; f1, . . . , fd) transelliptically distributed, we have\n\n(cid:17)\n(cid:16) \u03c0\n2 \u03c4(Xj, Xk)\n\n\u03a30\n\njk = sin\n\n.\n\n(3.1)\n\nRemark 3.1. Although the conclusion in Theorem 3.1 of [4] is correct, the proof provided is wrong\nor at least very ambiguous. Theorem 2.22 in [5] builds the result only for one sample statistic and\ncannot be generalized to the statistic of multiple samples, like the Kendall\u2019s tau or Spearman\u2019s rho.\nTherefore, we provide a new and clear version here. Detailed proofs can be found in the long version\nof this paper [8].\n\n4\n\n\fSpearman\u2019s rho depends not only on \u03a3 but also on the generating variable \u03be. When X follows mul-\njk/2). On the other hand, when X \u223c\ntivariate Gaussian, [17] proves that: \u03c1(Xj, Xk) = 6\nT Ed(\u03a30, \u03be; f1, . . . , fd) with \u03be =d 1, [10] proves that: \u03c1(Xj, Xk) = 3( arcsin \u03a30\n)3.\nlet x1, . . . , xn be n independent realizations of X, where xi =\nIn estimating \u03c4(Xj, Xk),\n(xi1, . . . , xid)T . We consider the following rank-based statistic:\n\n) \u2212 4( arcsin \u03a30\n\n\u03c0 arcsin(\u03a30\n\njk\n\njk\n\n\u03c0\n\n\u03c0\n\n2\n\nn(n \u2212 1)\n\nsign (xij \u2212 xi0j) (xik \u2212 xi0k) ,\n\nif j 6= k\n\n(3.2)\n\n1\u2264i 0 to make \u03a30 non-degenerate. \u03b81, . . . , \u03b8d \u2208 Sd\u22121 are the corresponding\nInspired by the model Md(\u03a30, \u03be, s; f), it is natural to consider the\neigenvectors of \u03bb1, . . . , \u03bbd.\n\nin estimating. By spectral decomposition, we write: \u03a30 = Pd\nfollowing optimization problem:e\u03b8\u2217\nwhere B0(s) := {v \u2208 Rd : kvk0 \u2264 s} and bR is the estimated Kendall\u2019s tau correlation matrix. The\ncorresponding global optimum is denoted bye\u03b8\u2217\nGenerally we can plug in the Kendall\u2019s tau correlation matrix bR to any sparse PCA algorithm listed\nversion of this paper [8]. The \ufb01nal estimator is denoted bye\u03b8\u221e with ke\u03b8\u221ek0 = k. It will be shown\n\nabove. In this paper, to approximate \u03b81, we consider using the Truncated Power method (TPower)\nproposed by [28] and [20]. The main idea of the TPower is to utilize the power method, but truncate\nthe vector to a \u20180 ball with radius k in each iteration. Detailed algorithms are provided in the long\n\nsubject to v \u2208 Sd\u22121 \u2229 B0(s),\n\n3.2.2 TCA Algorithm\n\nin Section 4 and Section 5 that the Kendall\u2019s tau correlation matrix is a better statistic in estimating\nthe correlation matrix than the Pearson sample correlation matrix in the sense that (i) it enjoys the\nGaussian parametric rate in a much larger distribution family, including many distributions with\nheavy tails; (ii) it is a more robust estimator, i.e. resistant to outliers.\nWe use the iterative de\ufb02ation method to learn the \ufb01rst k instead of the \ufb01rst one leading eigenvectors,\n\nfollowing the discussions of [21, 15, 28, 29]. In detail, a matrixb\u0393 \u2208 Rd\u00d7s de\ufb02ates a vector v \u2208 Rd\nand achieves a new matrixb\u03930:b\u03930 := (I \u2212 vvT )b\u0393(I \u2212 vvT ). In this way,b\u03930 is orthogonal to v.\n\n(3.4)\n\n1.\n\n5\n\n\f4 Theoretical Properties\n\nIn this section the theoretical properties of the TCA estimators are provided. Especially, we are\ninterested in the high dimensional case when d > n.\n\n4.1 Rank-based Correlation Matrix Estimation\n\nTheorem 4.1. Given x1, . . . , xn n independent realizations of X \u223c T Ed(\u03a30, \u03be; f1, . . . , fd) and\n\nThis section is devoted to the concentration result of the Kendall sample correlation matrix bR to the\nPearson correlation matrix \u03a30. The \u2018max convergence rate of bR is provided in the next theorem.\nletting bR be the Kendall tau correlation matrix, we have with probability at least 1 \u2212 d\u22125/2,\nProof sketch. Theorem 4.1 can be proved by realizing thatb\u03c4jk is an unbiased estimator of \u03c4(Xj, Xk)\n\nkbR \u2212 \u03a30kmax \u2264 3\u03c0plog d/n.\n\nand is a U-statistic with size 2. Hoeffding\u2019s inequality for U-statistic can then be applied to obtain\nthe result. Detailed proofs can be found in the long version of this paper [8].\n\n(4.1)\n\nThis section is devoted to the statement of our main result on the upper bound of the estimated error\n\n1 and TPower solvere\u03b8\u221e. We assume that the Model Md(\u03a30, \u03be, s; f)\n\nholds and the next theorem provides an upper bound on the angle between the estimated leading\n\n4.2 TCA Estimators\n\nof the TCA global optimume\u03b8\u2217\neigenvectore\u03b8\u2217\nTheorem 4.2. Lete\u03b8\u2217\n\n1 and true leading eigenvector \u03b81.\n\nFor any two vectors v1 \u2208 Sd\u22121 and v2 \u2208 Sd\u22121, letting\n\n1 be the global solution to Equation (3.4) and the Model Md(\u03a30, \u03be, s; f) holds.\n\nq\n\n1 \u2212 (vT\n\n1 v2)2,\n\nrlog d\n\n!\n\n| sin \u2220(v1, v2)| =\n\nthen we have\n\nP\n\n \n| sin \u2220(e\u03b8\u2217\n\n1, \u03b81)| \u2264 6\u03c0\n\n\u03bb1 \u2212 \u03bb2\n\n\u00b7 s\n\nn\n\n\u2265 1 \u2212 d\u22125/2.\n\n(4.2)\n\nProof sketch. The key idea of the proof is to utilize the \u2018max norm convergence result of bR to \u03a30.\nGenerally, when s and \u03bb1, \u03bb2 do not scale with (n, d), the rate is OP (plog d/n), which is the\n\nDetailed proofs can be found in the long version of this paper [8].\n\nparametric rate [20, 26, 23] obtains. When (n, d) goes to in\ufb01nity, the two leading eigenvalues \u03bb1\nand \u03bb2 will typically go to in\ufb01nity and will at least be away from zero. Hence, our rate shown in\nTheorem 4.2 will be usually better than the seemingly more common rate: 6\u03c0\u03bb1\n\u03bb1\u2212\u03bb2\n\nq log d\n\nn .\n\n\u00b7 s\n\n1 be the global solution to\n\nEquation (3.4) and the Model Md(\u03a30, \u03be, s; f) holds. Let\n\nCorollary 4.1 (Feature Selection Consistency of the TCA). Let e\u03b8\u2217\n\u0398 := supp(\u03b81) and b\u0398\u2217 := supp(e\u03b8\u2217\nrlog d\n\nIf we further have\n\n1).\n\n\u221a\n|\u03b81j| \u2265 6\n\n2\u03c0\n\u03bb1 \u2212 \u03bb2\n\n\u00b7 s\n\nthen we have, P(b\u0398\u2217 = \u0398) \u2265 1 \u2212 d\u22125/2.\n\nmin\nj\u2208\u0398\n\n,\n\nn\n\nProof sketch. The key of the proof is to construct a contradiction given Theorem 4.2 and the condi-\ntion on the minimum value of |\u03b81|. Detailed proofs can be found in the long version of this paper\n[8].\n\n6\n\n\f5 Experiments\n\nIn this section we investigate the empirical performance of the TCA method. We utilize the TPower\nalgorithm proposed by [28] and the following three methods are considered: (1) Pearson:\nthe\nclassic high dimensional scale-invariant PCA using the Pearson sample correlation matrix of the\ndata; (2) Kendall: the TCA using the Kendall correlation matrix; (3) LatPearson: the classic high\ndimensional scale-invariant PCA using the Pearson sample correlation matrix of the data drawn from\nthe latent elliptical distribution (perfect without data contamination).\n\n5.1 Numerical Simulations\n\ning eigenvectors are pre-speci\ufb01ed to be sparse. In detail, let \u03a3 = Pd\n\nIn the simulation study we randomly sample n data points from a certain transelliptical distribution\nT Ed(\u03a30, \u03be; f1, . . . , fd). Here we consider the set up of d = 100. To determine the transelliptical\ndistribution, \ufb01rstly, we derive \u03a30 in the following way: A covariance matrix \u03a3 is \ufb01rstly synthesized\nthrough the eigenvalue decomposition, where the \ufb01rst two eigenvalues are given and the correspond-\nj , where \u03c91 =\n6, \u03c92 = 3, \u03c93 = . . . = \u03c9d = 1, and the \ufb01rst two leading eigenvectors of \u03a3, u1 and u2, are sparse\nwith the \ufb01rst s = 10 entries of u1 and the second s = 10 entries of u2 are nonzero, i.e.\n\nj=1 \u03c9jujuT\n\nu1j =\n\n1 \u2264 j \u2264 10\notherwise\n\nand u2j =\n\n11 \u2264 j \u2264 20\notherwise\n\n.\n\n(5.1)\n\nThe remaining eigenvectors are chosen arbitrarily. The generalized correlation matrix \u03a30 is gener-\nated from \u03a3, with \u03bb1 = 4, \u03bb2 = 2.5, \u03bb3, . . . , \u03bbd \u2264 1 and the top two leading eigenvectors sparse:\n\n\u03b81j =\n\n1 \u2264 j \u2264 10\notherwise\n\nand \u03b82j =\n\n11 \u2264 j \u2264 20\notherwise\n\n.\n\n(5.2)\n\n(cid:26) 1\u221a\n\n10\n0\n\n(cid:26) \u2212 1\u221a\n\n10\n\n0\n\n(cid:26) 1\u221a\n\n10\n0\n\n(cid:26) \u2212 1\u221a\n\n10\n\n0\n\n1 + . . . + Y 2\n\n1 \u223c \u03c7d, \u03be\u2217\n\npY 2\nSecondly, using \u03a30, we consider the following three generating schemes:\n[Scheme 1] X \u223c T Ed(\u03a30, \u03be; f1, . . . , fd) with \u03be \u223c \u03c7d and f1(x) = . . . = fd(x) = x. Here\nd \u223c \u03c7d with Y1, . . . , Yd \u223ci.i.d N(0, 1). In other words, \u03c7d is the chi-distribution\nwith degree of freedom d. This is equivalent to say that X \u223c N(0, \u03a30) (Example 2.4 of [5]).\n[Scheme 2] X \u223c T Ed(\u03a30, \u03be; f1, . . . , fd) with \u03be =d \u221a\n2 and f1(x) = . . . = fd(x) = x.\n2 and m \u2208 N. This is equivalent to say that\nHere \u03be\u2217\nX \u223c M td(m, 0, \u03a30), i.e. X following a multivariate-t distribution with degree of freedom m, mean\n0 and covariance matrix \u03a30 (Example 2.5 of [5]). Here we consider m = 3.\n[Scheme 3] X \u223c T Ed(\u03a30, \u03be; f1, . . . , fd) with \u03be =d \u221a\ndependent of \u03be\u2217\nwhere\nh\u22121\n1 (x) := x, h\u22121\n\n1 is in-\n2 and m = 3. Moreover, {f1, . . . , fd} = {h1, h2, h3, h4, h5, h1, h2, h3, h4, h5, . . .},\n\n1 is independent of \u03be\u2217\n\n2 \u223c \u03c7m, \u03be\u2217\n\n1 \u223c \u03c7d, \u03be\u2217\n\n2 \u223c \u03c7m, \u03be\u2217\n\nm\u03be\u2217\n\n1 /\u03be\u2217\n\n2. Here \u03be\u2217\n\nm\u03be\u2217\n\n1 /\u03be\u2217\n\n2 (x) :=\n\n, h\u22121\n\n3 (x) :=\n\n,\n\n\u03a6(x) \u2212R \u03a6(t)\u03c6(t)dt\nqR(cid:0)\u03a6(y) \u2212R \u03a6(t)\u03c6(t)dt(cid:1)2\nexp(x) \u2212R exp(t)\u03c6(t)dt\nqR(cid:0)exp(y) \u2212R exp(t)\u03c6(t)dt(cid:1)2\n\n\u03c6(y)dy\n\n.\n\n\u03c6(y)dy\n\nsign(x)|x|1/2\n\nqR |t|\u03c6(t)dt\nx3qR t6\u03c6(t)dt\n\nh\u22121\n4 (x) :=\n\n, h\u22121\n\n5 (x) :=\n\nThis is equivalent to say that X is transelliptically distributed with the latent elliptical distribution\nZ \u223c M td(3, 0, \u03a30).\nTo evaluate the robustness of different methods, let r \u2208 [0, 1) represent the proportion of samples\nbeing contaminated. For each dimension, we randomly select bnrc entries and replace them with\neither 5 or -5 with equal probability. The \ufb01nal data matrix we obtained is X \u2208 Rn\u00d7d. Here we\npick r = 0, 0.02 or 0.05. Under the Scheme 1 to Scheme 3 with different levels of contamination\n(r = 0, 0.02 or 0.05), we repeatedly generate the data matrix X for 1,000 times and compute\nthe averaged False Positive Rates and False Negative Rates using a path of tuning parameters k\nfrom 5 to 90. The feature selection performances of different methods are then evaluated by plotting\n(FPR(k), 1\u2212FNR(k)). The corresponding ROC curves are presented in Figure 1 (A). More results\nare shown in the long version of this paper [8]. It can be observed that Kendall is generally better\nand more resistance to the outliers compared with Pearson.\n\n7\n\n\f(A)\n\n(B)\n\nFigure 1: (A) ROC curves under Scheme 1, Scheme 2 and Scheme 3 (top, middle, bottom) and data\ncontamination at different levels (r = 0, 0.02, 0.05 from left to right). x\u2212axis is FPR and y\u2212axis\nis TPR. Here n = 100 and d = 100. (B) Successful matches of the market trend proportions only\nusing the stocks in Ak and Bk. The x\u2212axis represents the tuning parameter k scaling from 1 to 200;\nthe y\u2212axis represents the % of successful matches. The curve denoted by \u2019Kendall\u2019 represents the\npoints of (k, \u03c1Ak) and the curves denoted by \u2019Pearson\u2019 represents the points of (k, \u03c1Bk).\n\n5.2 Equities Data\nIn this section we apply the TCA on the stock price data from Yahoo! Finance (finance.yahoo.\ncom). We collected the daily closing prices for J=452 stocks that were consistently in the S&P 500\nindex between January 1, 2003 through January 1, 2008. This gave us altogether T=1,257 data\npoints, each data point corresponds to the vector of closing prices on a trading day. Let St = [Stt,j]\ndenote by the closing price of stock j on day t.\nWe wish to evaluate the ability of using the only k stocks to represent the trend of the whole stock\nmarket. To this end, we run Kendall and Pearson on St and obtain the leading eigenvectors\n\ne\u03b8Kendall ande\u03b8P earson using the tuning parameter k \u2208 N. Let Ak := supp(e\u03b8Kendall) and Bk :=\nsupp(e\u03b8P earson). And then we let T W\nStt,j \u2212X\n\nstocks and Bk stocks in tth day compared with t \u2212 1th date, i.e:\n\ndenote by the trend of the whole stocks, Ak\n\nStt,j \u2212 X\n\n:= I(X\n\n:= I(X\n\nStt\u22121,j >), T Ak\n\nStt\u22121,j > 0)\n\n, T Ak\n\nt\n\nand T Bk\n\nt\n\nt\n\nT W\nt\n\nand\n\nj\n\nT Bk\nt\n\nj\n\n:= I(X\n\nj\u2208Bk\n\nt\n\nStt,j \u2212 X\n\nj\u2208Bk\n\nj\u2208Ak\n\nj\u2208Ak\n\nStt\u22121,j > 0),\n\nP\n\nt I(T W\n\nt = T Bk\n\nP\nhere I is the indicator function. In this way, we can calculate the proportion of successful matches\n) and \u03c1Bk :=\nof the market trend using the stocks in Ak and Bk as: \u03c1Ak := 1\nT\n). We visualize the result by plotting (k, \u03c1Ak) and (k, \u03c1Bk) on a 2D \ufb01gure. The\n1\nT\nresult is presented in Figure 1 (B).\nP\nIt can be observed from Figure 1 (B) that Kendall summarizes the trend of the whole stock market\nconstantly better than Pearson. Moreover, the averaged difference between the two methods are\nk(\u03c1Ak \u2212 \u03c1Bk) = 1.4025 with the standard deviation 0.6743. Therefore, the difference is\n1\n200\nsigni\ufb01cant.\n\nt = T Ak\n\nt I(T W\n\nt\n\nt\n\n6 Acknowledgement\nThis research was supported by NSF award IIS-1116730.\n\n8\n\n0.00.20.40.60.81.00.00.20.40.60.81.0PearsonKendallLatPearson0.00.20.40.60.81.00.00.20.40.60.81.0PearsonKendallLatPearson0.00.20.40.60.81.00.00.20.40.60.81.0PearsonKendallLatPearson0.00.20.40.60.81.00.00.20.40.60.81.0PearsonKendallLatPearson0.00.20.40.60.81.00.00.20.40.60.81.0PearsonKendallLatPearson0.00.20.40.60.81.00.00.20.40.60.81.0PearsonKendallLatPearson0.00.20.40.60.81.00.00.20.40.60.81.0PearsonKendallLatPearson0.00.20.40.60.81.00.00.20.40.60.81.0PearsonKendallLatPearson0.00.20.40.60.81.00.00.20.40.60.81.0PearsonKendallLatPearson0501001502007580859095kSuccessful Matches %PearsonKendall\fReferences\n[1] TW Anderson.\n\nRecherche, 67:02, 1990.\n\nStatistical inference in elliptically contoured and related distributions.\n\n[2] M.G. Borgognone, J. Bussi, and G. Hough. Principal component analysis in sensory analysis:\n\ncovariance or correlation matrix? Food quality and preference, 12(5-7):323\u2013326, 2001.\n\n[3] S. Cambanis, S. Huang, and G. Simons. On the theory of elliptically contoured distributions.\n\nJournal of Multivariate Analysis, 11(3):368\u2013385, 1981.\n\n[4] H.B. Fang, K.T. Fang, and S. Kotz. The meta-elliptical distributions with given marginals.\n\nJournal of Multivariate Analysis, 82(1):1\u201316, 2002.\n\n[5] KT Fang, S. Kotz, and KW Ng. Symmetric multivariate and related distributions. Chap-\n\nman&Hall, London, 1990.\n\n[6] C. Genest, AC Favre, J. B\u00b4eliveau, and C. Jacques. Metaelliptical copulas and their use in\nfrequency analysis of multivariate hydrological data. Water Resour. Res, 43(9):W09401, 2007.\n\n[7] P.R. Halmos. Measure theory, volume 18. Springer, 1974.\n[8] F. Han and H. Liu. Tca: Transelliptical principal component analysis for high dimensional\n\nnon-gaussian data. Technical Report, 2012.\n\n[9] W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the\n\nAmerican Statistical Association, pages 13\u201330, 1963.\n\n[10] H. Hult and F. Lindskog. Multivariate extremes, aggregation and dependence in elliptical\n\ndistributions. Advances in Applied probability, 34(3):587\u2013608, 2002.\n\n[11] DR Jensen. The structure of ellipsoidal distributions, ii. principal components. Biometrical\n\n[12] DR Jensen. Conditioning and concentration of principal components. Australian Journal of\n\n[13] H. Joe. Multivariate models and dependence concepts, volume 73. Chapman & Hall/CRC,\n\n[14] I.M. Johnstone and A.Y. Lu.\n\nSparse principal components analysis.\n\nArxiv preprint\n\nJournal, 28(3):363\u2013369, 1986.\n\nStatistics, 39(1):93\u2013104, 1997.\n\n1997.\n\narXiv:0901.4392, 2009.\n\n[15] M. Journ\u00b4ee, Y. Nesterov, P. Richt\u00b4arik, and R. Sepulchre. Generalized power method for sparse\nprincipal component analysis. The Journal of Machine Learning Research, 11:517\u2013553, 2010.\n[16] KS Kelly and R. Krzysztofowicz. A bivariate meta-gaussian density for use in hydrology.\n\nStochastic Hydrology and Hydraulics, 11(1):17\u201331, 1997.\n\n[17] W.H. Kruskal. Ordinal measures of association. Journal of the American Statistical Associa-\n\ntion, pages 814\u2013861, 1958.\n\n[18] D. Kurowicka, J. Misiewicz, and RM Cooke. Elliptical copulae. In Proc of the International\n\nConference on Monte Carlo Simulation-Monte Carlo, pages 209\u2013214, 2000.\n\n[19] H. Liu, J. Lafferty, and L. Wasserman. The nonparanormal: Semiparametric estimation of high\ndimensional undirected graphs. The Journal of Machine Learning Research, 10:2295\u20132328,\n2009.\n\n[20] Z. Ma. Sparse principal component analysis and iterative thresholding. Arxiv preprint\n\narXiv:1112.2432, 2011.\n\nsystems, 21:1017\u20131024, 2009.\n\n[21] L. Mackey. De\ufb02ation methods for sparse pca. Advances in neural information processing\n\n[22] G.P. McCabe. Principal variables. Technometrics, pages 137\u2013144, 1984.\n[23] D. Paul and I.M. Johnstone. Augmented sparse principal component analysis for high dimen-\n\nsional data. Arxiv preprint arXiv:1202.1242, 2012.\n\n[24] GQ Qian, G. Gabor, and RP Gupta. Principal components selection by the criterion of the\nminimum mean difference of complexity. Journal of multivariate analysis, 49(1):55\u201375, 1994.\n[25] A. Sklar. Fonctions de r\u00b4epartition `a n dimensions et leurs marges. Publ. Inst. Statist. Univ.\n\n[26] V.Q. Vu and J. Lei. Minimax rates of estimation for sparse pca in high dimensions. Arxiv\n\nParis, 8(1):11, 1959.\n\npreprint arXiv:1202.0786, 2012.\n\n[27] C.M. Waternaux. Principal components in the nonnormal case: The test of equality of q roots.\n\nJournal of Multivariate Analysis, 14(3):323\u2013335, 1984.\n\n[28] X.T. Yuan and T. Zhang. Truncated power method for sparse eigenvalue problems. Arxiv\n\npreprint arXiv:1112.2679, 2011.\n\n[29] Y. Zhang, A. dAspremont, and L.E. Ghaoui. Sparse pca: Convex relaxations, algorithms and\napplications. Handbook on Semide\ufb01nite, Conic and Polynomial Optimization, pages 915\u2013940,\n2012.\n\n9\n\n\f", "award": [], "sourceid": 4828, "authors": [{"given_name": "Fang", "family_name": "Han", "institution": null}, {"given_name": "Han", "family_name": "Liu", "institution": null}]}