{"title": "Generalizing Analytic Shrinkage for Arbitrary Covariance Structures", "book": "Advances in Neural Information Processing Systems", "page_first": 1869, "page_last": 1877, "abstract": "Analytic shrinkage is a statistical technique that offers a fast alternative to cross-validation for the regularization of covariance matrices and has appealing consistency properties. We show that the proof of consistency implies bounds on the growth rates of eigenvalues and their dispersion, which are often violated in data. We prove consistency under assumptions which do not restrict the covariance structure and therefore better match real world data. In addition, we propose an extension of analytic shrinkage --orthogonal complement shrinkage-- which adapts to the covariance structure. Finally we demonstrate the superior performance of our novel approach on data from the domains of finance, spoken letter and optical character recognition, and neuroscience.", "full_text": "Generalizing Analytic Shrinkage for Arbitrary\n\nCovariance Structures\n\nDaniel Bartz\n\nDepartment of Computer Science\n\nTU Berlin, Berlin, Germany\n\ndaniel.bartz@tu-berlin.de\n\nklaus-robert.mueller@tu-berlin.de\n\nKlaus-Robert M\u00a8uller\n\nTU Berlin, Berlin, Germany\nKorea University, Korea, Seoul\n\nAbstract\n\nAnalytic shrinkage is a statistical technique that offers a fast alternative to cross-\nvalidation for the regularization of covariance matrices and has appealing con-\nsistency properties. We show that the proof of consistency requires bounds on\nthe growth rates of eigenvalues and their dispersion, which are often violated in\ndata. We prove consistency under assumptions which do not restrict the covariance\nstructure and therefore better match real world data. In addition, we propose an\nextension of analytic shrinkage \u2013orthogonal complement shrinkage\u2013 which adapts\nto the covariance structure. Finally we demonstrate the superior performance of\nour novel approach on data from the domains of \ufb01nance, spoken letter and optical\ncharacter recognition, and neuroscience.\n\n1\n\nIntroduction\n\nThe estimation of covariance matrices is the basis of many machine learning algorithms and estima-\ntion procedures in statistics. The standard estimator is the sample covariance matrix: its entries are\nunbiased and consistent [1]. A well-known shortcoming of the sample covariance is the systematic\nerror in the spectrum. In particular for high dimensional data, where dimensionality p and number\nof observations n are often of the same order, large eigenvalues are over- und small eigenvalues\nunderestimated. A form of regularization which can alleviate this bias is shrinkage [2]: the convex\ncombination of the sample covariance matrix S and a multiple of the identity T = p\u22121trace(S)I,\n\nCsh = (1 \u2212 \u03bb)S + \u03bbT,\n\n(1)\nhas potentially lower mean squared error and lower bias in the spectrum [3]. The standard procedure\nfor chosing an optimal regularization for shrinkage is cross-validation [4], which is known to be\ntime consuming. For online settings CV can become unfeasible and a faster model selection method\nis required. Recently, analytic shrinkage [3] which provides a consistent analytic formula for the\nabove regularization parameter \u03bb has become increasingly popular. It minimizes the expected mean\nsquared error of the convex combination with a computational cost of O(p2), which is negligible\nwhen used for algorithms like Linear Discriminant Analysis (LDA) which are O(p3).\nThe consistency of analytic shrinkage relies on assumptions which are rarely tested in practice [5].\nThis paper will therefore aim to render the analytic shrinkage framework more practical and usable\nfor real world data. We contribute in three aspects: \ufb01rst, we derive simple tests for the applicability\nof the analytic shrinkage framework and observe that for many data sets of practical relevance the\nassumptions which underly consistency are not full\ufb01lled. Second, we design assumptions which\nbetter \ufb01t the statistical properties observed in real world data which typically has a low dimen-\nsional structure. Under these new assumptions, we prove consistency of analytic shrinkage. We\nshow a counter-intuitive result: for typical covariance structures, no shrinkage \u2013and therefore no\nregularization\u2013 takes place in the limit of high dimensionality and number of observations. In prac-\ntice, this leads to weak shrinkage and degrading performance. Therefore, third, we propose an ex-\ntension of the shrinkage framework: automatic orthogonal complement shrinkage (aoc-shrinkage)\n\n1\n\n\ftakes the covariance structure into account and outperforms standard shrinkage on real world data at\na moderate increase in computation time. Note that proofs of all theorems in this paper can be found\nin the supplemental material.\n\n2 Overview of analytic shrinkage\n\nTo derive analytic shrinkage, the expected mean squared error of the shrinkage covariance matrix\neq. (1) as an estimator of the true covariance matrix C is minimized:\n\n(cid:13)(cid:13)(cid:13)2(cid:21)\n(cid:20)(cid:13)(cid:13)(cid:13)C \u2212 (1 \u2212 \u03bb)S \u2212 \u03bbT\n(cid:17) \u2212 Var\n(cid:17)(cid:111)\n+ \u03bb2E(cid:104)(cid:0)Sij \u2212 Tij\n(cid:16)\n(cid:1)(cid:111)\n\nSij\n\n(cid:1)2(cid:105)\n\n(cid:16)\n\n+ Var\n\nSij\n\n(cid:17)(cid:41)\n\n(2)\n\n(3)\n\n\u03bb(cid:63) = arg min\n\n= arg min\n\n(cid:80)\n\ni,j\n\n=\n\n\u03bb\n\n\u03bb\n\n\u03bb\n\ni,j\n\ni,j\n\nE\n\n2\u03bb\n\nCov\n\nSij, Tij\n\nR(\u03bb) := arg min\n\n(cid:40)\n(cid:110)\n(cid:16)\n(cid:88)\n(cid:110)\nVar(cid:0)Sij\n(cid:1) \u2212 Cov(cid:0)Sij, Tij\nE(cid:104)(cid:0)Sij \u2212 Tij\n(cid:1)2(cid:105)\n(cid:80)\n(cid:16)\n(cid:88)\n(cid:1) =\n(cid:100)Var(cid:0)Sij\n(cid:40)(cid:88)\n(cid:88)\n(cid:1) =\n(cid:100)Cov(cid:0)Sii, Tii\n(cid:98)E(cid:2)(Sij \u2212 Tij)2(cid:3) = (Sij \u2212 Tij)2\n\n(n \u2212 1)np\n\n(n \u2212 1)n\n\n1\n\nk\n\ns\n\n1\n\n.\n\ns\n\n(cid:88)\n\nt\n\nxisxjs \u2212 1\nn\n\n(cid:17)2\n\nxitxjt\n\n(cid:88)\n\nx2\nisx2\n\nks \u2212 1\nn\n\nx2\nit\n\nt\n\n(cid:41)\n\n(cid:88)\n\nt(cid:48)\n\nx2\nit(cid:48)\n\nThe analytic shrinkage estimator \u02c6\u03bb is obtained by replacing expectations with sample estimates:\n\nTheoretical results on the estimator \u02c6\u03bb are based on analysis of a sequence of statistical models\nindexed by n. Xn denotes a pn \u00d7 n matrix of n iid observations of pn variables with mean zero\nand covariance matrix \u03a3n. Yn = \u0393T\nnXn denotes the same observations in their eigenbasis, having\ndiagonal covariance \u039bn = \u0393T\nit denote the entries of Xn and\nYn, respectively1. The main theoretical result on the estimator \u02c6\u03bb is its consistency in the large n, p\nlimit [3]. A decisive role is played by an assumption on the eighth moments2 in the eigenbasis:\nAssumption 2 (A2, Ledoit/Wolf 2004 [3]). There exists a constant K2 independent of n such that\n\nn\u03a3n\u0393n. Lower case letters xn\n\nit and yn\n\npn(cid:88)\n\np\u22121\n\nn\n\nE[(yn\n\ni1)8] \u2264 K2.\n\n3\n\nImplicit assumptions on the covariance structure\n\ni=1\n\nFrom the assumption on the eighth moments in the eigenbasis, we derive requirements on the eigen-\nvalues which facilitate an empirical check:\nTheorem 1 (largest eigenvalue growth rate). Let A2 hold. Then, there exists a limit on the growth\nrate of the largest eigenvalue\n\n(cid:16)\n\n(cid:17)\n\n\u03b3n\n1 = max\n\ni\n\nVar(yn\n\ni ) = O\n\np1/4\nn\n\n.\n\nTheorem 2 (dispersion growth rate). Let A2 hold. Then, there exists a limit on the growth rate of\nthe normalized eigenvalue dispersion\ndn = p\u22121\n\n\u03b3j)2 = O (1) .\n\n(\u03b3i \u2212 p\u22121\n\n(cid:88)\n\n(cid:88)\n\nn\n\nn\n\ni\n\nj\n\n1We shall often drop the sequence index n and the observation index t to improve readability of formulas.\n2eighth moments arise because Var(Sij), the variance of the sample covariance, is of fourth order and has\n\nto converge. Nevertheless, even for for non-Gaussian data convergence is fast.\n\n2\n\n\fFigure 1: Covariance matrices and dependency of the largest eigenvalue/dispersion on the dimen-\nsionality. Average over 100 repetitions.\n\nFigure 2: Dependency of the largest eigenvalue/dispersion on the dimensionality. Average over 100\nrandom subsets.\n\nThe theorems restrict the covariance structure of the sequence of models when the dimensionality\nincreases. To illustrate this, we design two sequences of models A and B indexed by their dimen-\nsionality p, in which dimensions xp\n\ni are correlated with a signal sp:\n\n(cid:26)(0.5 + bp\n\n(0.5 + bp\n\nxp\ni =\n\ni ) \u00b7 \u03b5p\ni + \u03b1cp\ni ) \u00b7 \u03b5p\ni ,\n\nelse.\n\ni sp, with probability PsA/B (i),\n\n(4)\n\ni and cp\n\ni are uniform random from [0, 1], sp and \u0001p\n\nwhere bp\ni are standard normal, \u03b1 = 1, PsB (i) = 0.2\nand PsA (i) = (i/10 + 1)\u22127/8 (power law decay). To avoid systematic errors, we hold the ratio of\nobservations to dimensions \ufb01xed: np/p = 2.\nTo the left in Figure 1, covariance matrices are shown: For model A, the matrix is dense in the\nupper left corner, the more dimensions we add the more sparse the matrix gets. For model B,\ncorrelations are spread out evenly. To the right, normalized sample dispersion and largest eigenvalue\nare shown. For model A, we see the behaviour from the theorems: the dispersion is bounded, the\nlargest eigenvalue grows with the fourth root. For model B, there is a linear dependency of both\ndispersion and largest eigenvalue: A2 is violated.\nFor real world data, we measure the dependency of the largest eigenvalue/dispersion on the dimen-\nsionality by averaging over random subsets. Figure 2 shows the results for four data sets3: (1) New\nYork Stock Exchange, (2) USPS hand-written digits, (3) ISOLET spoken letters and (4) a Brain\nComputer Interface EEG data set. The largest eigenvalues and the normalized dispersions (see Fig-\nure 2) closely resemble model B; a linear dependence on the dimensionality which violates A2 is\nvisible.\n\n3for details on the data sets, see section 5.\n\n3\n\n model A  model B10020030040050022.533.54normalized sample dispersion   model A  100200300400500010203040sample dispersionmax(EV)10020030040050001020    model Bcovariance matricesdispersion and largest EVdimensionality100200300400500050100max(EV)05001000050100150normalized sample dispersion#assetsUS stock market  sample dispersionmax(EV)05001000020040060001002000510#pixels USPS hand\u2212written digits010020002040020040060002040#featuresISOLET spoken letters020040060001002000200400050100#featuresBCI EEG data02004000100200max(EV)dimensionality\f4 Analytic shrinkage for arbitrary covariance structures\n\nWe replace A2 by a weaker assumption on the moments in the basis of the observations X which\ndoes not impose any constraints on the covariance structure4:\nAssumption 2(cid:48) (A2(cid:48)). There exists a constant K2 independent of p such that\n\np(cid:88)\n\np\u22121\n\nE[(xp\n\ni1)8] \u2264 K2.\n\ni=1\n\nStandard assumptions For the proof of consistency, the relationship between dimensionality and\nnumber of observations has to be de\ufb01ned and a weak restriction on the correlation of the products\nof uncorrelated variables is necessary. We use slightly modi\ufb01ed versions of the original assump-\ntions [3].\nAssumption 1(cid:48) (A1(cid:48), Kolmogorov asymptotics). There exists a constant K1, 0 \u2264 K1 \u2264 \u221e inde-\npendent of p such that\n\nlim\np\u2192\u221e p/np = K1.\n\nAssumption 3(cid:48) (A3(cid:48)).\n\n(cid:80)\n\ni,j,kl,l\u2208Qp\n\nlim\np\u2192\u221e\n\n(cid:0)Cov[yp\n\n|Qp|\n\nl1](cid:1)2\n\ni1yp\n\nj1, yp\n\nk1yp\n\n= 0\n\nwhere Qp is the set of all quadruples consisting of distinct integers between 1 and p.\n\nAdditional Assumptions A1(cid:48) to A3(cid:48) subsume a wide range of dispersion and eigenvalue con\ufb01g-\nurations. To investigate the role which this plays, we categorize sequences by adding an additional\nparameter k. It will prove essential for the limit behavior of optimal shrinkage and the consistency\nof analytic shrinkage:\nAssumption 4 (A4, growth rate of the normalized dispersion). Let \u03b3i denote the eigenvalues of C.\nThen, the limit behaviour of the normalized dispersion is parameterized by k:\n\np\u22121(cid:88)\n\n(\u03b3i \u2212 p\u22121(cid:88)\n\n\u03b3j)2 = \u0398(cid:0)max(1, p2k\u22121)(cid:1) ,\n\ni\n\nj\n\nwhere \u0398 is the Landau Theta.\nIn sequences of models with k \u2264 0.5 the normalized dispersion is bounded from above and below, as\nin model A in the last section. For k > 0.5 the normalized dispersion grows with the dimensionality,\nfor k = 1 it is linear in p, as in model B.\nWe make two technical assumptions to rule out degenerate cases. First, we assume that, on average,\nadditional dimensions make a positive contribution to the mean variance:\nAssumption 5 (A5). There exists a constant K3 such that\n\np(cid:88)\n\np\u22121\n\nE[(xp\n\ni1)2] \u2265 K3.\n\ni=1\n\nSecond, we assume that limits on the relation between second, fourth and eighth moments exist:\nAssumption 6 (A6, moment relation). \u2203\u03b14, \u03b18, \u03b24 and \u03b28:\nE[y4\nE[y4\n\ni ] \u2264 (1 + \u03b18)E2[y4\ni ]\ni ] \u2265 (1 + \u03b28)E2[y4\ni ]\n\ni ] \u2264 (1 + \u03b14)E2[y2\ni ]\ni ] \u2265 (1 + \u03b24)E2[y2\ni ]\n\nE[y8\nE[y8\n\n4For convenience, we index the sequence of statistical models by p instead of n.\n\n4\n\n\fFigure 3: Illustration of orthogonal complement shrinkage.\n\nTheoretical results on limit behaviour and consistency We are able to derive a novel theorem\nwhich shows that under these wider assumptions, shrinkage remains consistent:\nTheorem 3 (Consistency of Shrinkage). Let A1(cid:48), A2(cid:48), A3(cid:48), A4, A5, A6 hold and\n\nm = E\n\n(cid:20)(cid:16)\n\n(\u03bb\u2217 \u2212 \u02c6\u03bb)/\u03bb\u2217(cid:17)2(cid:21)\n\ndenote the expected squared relative error of the estimate \u02c6\u03bb. Then, independently of k,\n\nlim\np\u2192\u221e m = 0.\n\nAn unexpected caveat accompanying this result is the limit behaviour of the optimal shrinkage\nstrength \u03bb\u2217:\nTheorem 4 (Limit behaviour). Let A1(cid:48), A2(cid:48), A3(cid:48), A4, A5, A6 hold. Then, there exist 0 < bl <\nbu < 1\n\nk \u2264 0.5\nk > 0.5\n\n\u21d2\n\u21d2\n\n\u2200n : bl \u2264 \u03bb\u2217 \u2264 bu\np\u2192\u221e \u03bb\u2217 = 0\nlim\n\nThe theorem shows that there is a fundamental problem with analytic shrinkage:\nthan 0.5 (all data sets in the last section had k = 1) there is no shrinkage in the limit.\n\nif k is larger\n\n5 Automatic orthogonal complement shrinkage\n\nOrthogonal complement shrinkage To obtain a \ufb01nite shrinkage strength, we propose an exten-\nsion of shrinkage we call oc-shrinkage: it leaves the \ufb01rst eigendirection untouched and performs\nshrinkage on the orthogonal complement oc of that direction. Figure 3 illustrates this approach. It\nshows a three dimensional true covariance matrix with a high dispersion that makes it highly ellip-\nsoidal. The result is a high level of discrepancy between the spherical shrinkage target and the true\ncovariance. The best convex combination of target and sample covariance will put extremely low\nweight on the target. The situation is different in the orthogonal complement of the \ufb01rst eigendirec-\ntion of the sample covariance matrix: there, the discrepancy between sample covariance and target\nis strongly reduced.\nTo simplify the theoretical analysis, let us consider the case where there is only a single growing\neigenvalue while the remainder stays bounded:\n\n5\n\n\fAssumption 4(cid:48) (A4(cid:48) single large eigenvalue). Let us de\ufb01ne\n\nzi = yi,\nz1 = p\u2212k/2y1.\n\n2 \u2264 i \u2264 p,\n\nThere exist constants Fl and Fu such that Fl \u2264 E[z8\nA recent result from Random Matrix Theory [6] allows us to prove that the projection on the empir-\n\nical orthogonal complement (cid:98)oc does not affect the consistency of the estimator \u02c6\u03bb(cid:98)oc:\n\nTheorem 5 (consistency of oc-shrinkage). Let A1(cid:48), A2(cid:48), A3(cid:48), A4(cid:48), A5, A6 hold. In addition, assume\nthat 16th moments5 of the yi exist and are bounded. Then, independently of k,\n\ni ] \u2264 Fu\n\n(cid:18)\n\n\u02c6\u03bb(cid:98)oc \u2212 arg min\n\nQ(cid:98)oc(\u03bb)\n\n\u03bb\n\nlim\np\u2192\u221e\n\n(cid:19)2\n\n= 0,\n\nwhere Q denotes the mean squared error (MSE) of the convex combination (cmp. eq. (2)).\n\nAutomatic model selection Orthogonal complement shrinkage only yields an advantage if the \ufb01rst\neigenvalue is large enough. Starting from eq. (2), we can consistently estimate the error of standard\nshrinkage and orthogonal complement shrinkage and only use oc-shrinkage when the difference\n\n(cid:98)\u2206R,(cid:98)oc is positive. In the supplemental material, we derive a formula of a conservative estimate:\n\n(cid:98)\u2206R,cons.,(cid:98)oc = (cid:98)\u2206R,(cid:98)oc \u2212 m\u2206 \u02c6\u03c3(cid:98)\u2206R,(cid:99)oc\n\n\u2212 mE\n\n\u02c6\u03bb2(cid:98)oc \u02c6\u03c3 \u02c6E.\n\nUsage of m\u2206 = 0.45 corresponds to 75% probability of improvement under gaussianity and yields\ngood results in practice. The second term is relevant in small samples, setting mE = 0.1 is suf\ufb01cient.\nA dataset may have multiple large eigenvalues. It is straightforward to iterate the procedure and thus\nautomatically select the number of retained eigendirections \u02c6r. We call this automatic orthogonal\ncomplement shrinkage. An algorithm listing can be found in the supplemental.\nThe computational cost of aoc-shrinkage is larger than that of standard shrinkage as it additionally\nrequires an eigendecomposition O(p3) and some matrix multiplications O(\u02c6rp2). In the applications\nconsidered here, this additional cost is negligible: \u02c6r (cid:28) p and the eigendecomposition can replace\nmatrix inversions for LDA, QDA or portfolio optimization.\n\nFigure 4: Automatic selection of the number of eigendirections. Average over 100 runs.\n\n6 Empirical validation\n\nSimulations To test the method, we extend model B (eq. (4), section 3) to three signals, Psi = (0.1,\n0.25, 0.5). Figure 4 reports the percentage improvement in average loss over the sample covariance\nmatrix,\n\nPRIAL(cid:0)Csh/oc\u2212sh/aoc\u2212sh(cid:1) =\n\nE(cid:107)S \u2212 C(cid:107) \u2212 E(cid:107)Csh/oc\u2212sh/aoc\u2212sh \u2212 C(cid:107)\n\nE(cid:107)S \u2212 C(cid:107)\n\n,\n\n5The existence of 16th moments is needed because we bound the estimation error in each direction by the\n\nmaximum over all directions, an extremely conservative approximation.\n\n6\n\n1011020.40.50.60.70.80.91dimensionality pPRIAL  Shrinkageoc(1)\u2212Shrinkageoc(2)\u2212Shrinkageoc(3)\u2212Shrinkageoc(4)\u2212Shrinkageaoc\u2212Shrinkage\fTable 1: Portfolio risk. Mean absolute deviations\u00b7103 (mean squared deviations\u00b7106) of the resulting\nportfolios for the different covariance estimators and markets. \u2020 := aoc-shrinkage signi\ufb01cantly\nbetter than this model at the 5% level, tested by a randomization test.\n\nsample covariance\nstandard shrinkage\n\n\u02c6\u03bb\n\nshrinkage to a factor model\n\n\u02c6\u03bb\n\naoc-shrinkage\n\n\u02c6\u03bb\naverage \u02c6r\n\nUS\n8.56\u2020 (156.1\u2020)\n6.27\u2020 (86.4\u2020)\n5.56\u2020 (69.6\u2020)\n\n0.09\n\nEU\n5.93\u2020 (78.9\u2020)\n4.43\u2020 (46.2\u2020)\n4.00\u2020 (39.1\u2020)\n\n0.12\n\nHK\n6.57\u2020 (81.2\u2020)\n6.32\u2020 (76.2\u2020)\n6.17\u2020 (72.9\u2020)\n\n0.10\n\n0.41\n\n5.41 (67.0)\n\n0.44\n\n3.83 (36.3)\n\n0.42\n\n6.11 (71.8)\n\n0.75\n1.64\n\n0.79\n1.17\n\n0.75\n1.41\n\nTable 2: Accuracies for classi\ufb01cation tasks on ISOLET and USPS data. \u2217 := signi\ufb01cantly better\nthan all compared methods at the 5% level, tested by a randomization test.\n\nntrain\nLDA\nLDA (shrinkage)\nLDA (aoc)\nQDA\nQDA (shrinkage)\nQDA (aoc)\n\nISOLET\n500\n75.77%\n88.92%\n89.69%\n2.783%\n58.57%\n59.51%\n\n\u2217\n\n\u2217\n\n2000\n92.29%\n93.25%\n93.42%\n4.882%\n75.4%\n80.84%\n\n\u2217\n\n5000\n94.1%\n94.3%\n94.33%\n14.09%\n79.25%\n87.35%\n\nUSPS\n500\n72.31%\n83.77%\n83.95%\n10.11%\n82.2%\n83.31%\n\n\u2217\n\n5000\n\n2000\n87.45% 89.56%\n88.37% 89.77%\n88.37% 89.77%\n49.45% 72.43%\n88.85% 89.67%\n89.4%\n90.07%\n\n\u2217\n\nof standard shrinkage, oc-shrinkage for one to four eigendirections and aoc-shrinkage.\nStandard shrinkage behaves as predicted by Theorem 4: \u02c6\u03bb and therefore the PRIAL tend to zero in\nthe large n, p limit. The same holds for orders of oc-shrinkage \u2013oc(1) and oc(2)\u2013 lower than the\nnumber of signals, but performance degrades more slowly. For small dimensionalities eigenvalues\nare small and therefore there is no advantage for oc-shrinkage. On the contrary, the higher the order\nof oc-shrinkage, the larger the error by projecting out spurious large eigenvalues which should have\nbeen subject to regularization. The automatic order selection aoc-shrinkage leads to close to optimal\nPRIAL for all dimensionalities.\n\nReal world data I: portfolio optimization Covariance estimates are needed for the minimization\nof portfolio risk [7]. Table 1 shows portfolio risk for approximately eight years of daily return data\nfrom 1200 US, 600 European and 100 Hong Kong stocks, aggregated from Reuters tick data [8].\nEstimation of covariance matrices is based on short time windows (150 days) because of the data\u2019s\nnonstationarity. Despite the unfavorable ratio of observations to dimensionality, standard shrinkage\nhas very low values of \u02c6\u03bb: the stocks are highly correlated and the spherical target is highly inappro-\npriate. Shrinkage to a \ufb01nancial factor model incorporating the market factor [9] provides a better\ntarget; it leads to stronger shrinkage and better portfolios. Our proposed aoc-shrinkage yields even\nstronger shrinkage and signi\ufb01cantly outperforms all compared methods.\n\nTable 3: Accuracies for classi\ufb01cation tasks on BCI data. Arti\ufb01cially injected noise in one electrode.\n\u2217 := signi\ufb01cantly better than all compared methods at the 5% level, tested by a randomization test.\n\n\u03c3noise\nLDA\nLDA (shrinkage)\nLDA (aoc)\naverage \u02c6r\n\n0\n92.28%\n92.39%\n\u2217\n93.27%\n2.0836\n\n10\n92.28%\n92.94%\n\u2217\n93.27%\n3.0945\n\n30\n92.28%\n92.18%\n\u2217\n93.24%\n3.0891\n\n100\n92.28%\n88.04%\n\u2217\n92.88%\n3.0891\n\n300\n92.28%\n82.15%\n\u2217\n93.16%\n3.0891\n\n1000\n92.28%\n73.79%\n93.19%\n\n\u2217\n\n3.09\n\n7\n\n\fFigure 5: High variance components responsible for failure of shrinkage in BCI. \u03c3noise = 10.\nSubject 1.\n\nReal world data II: USPS and ISOLET We applied Linear and Quadratic Discriminant Analysis\n(LDA and QDA) to hand-written digit recognition (USPS, 1100 observations with 256 pixels for\neach of the 10 digits [10]) and spoken letter recognition (ISOLET, 617 features, 7797 recordings of\n26 spoken letters [11], obtained from the UCI ML Repository [12]) to assess the quality of standard\nand aoc-shrinkage covariance estimates.\nTable 2 shows that aoc-shrinkage outperforms standard shrinkage for QDA and LDA on both data\nsets for different training set sizes. Only for LDA and large sample sizes on the relatively low\ndimensional USPS data, there is no difference between standard and aoc-shrinkage: the automatic\nprocedure decides that shrinkage on the whole space is optimal.\n\nReal world data III: Brain-Computer-Interface The BCI data was recorded in a study in which\n11 subjects had to distinguish between noisy and noise-free phonemes [13, 14]. We applied LDA\non 427 standardized features calculated from event related potentials in 61 electrodes to classify two\nconditions: correctly identi\ufb01ed noise-free and correctly identi\ufb01ed noisy phonemes (ntrain = 1000).\nFor Table 3, we simulated additive noise in a random electrode (100 repetitions). With and without\nnoise, our proposed aoc-shrinkage outperforms standard shrinkage LDA. Without noise, \u02c6r \u2248 2 high\nvariance directions \u2013probably corresponding to ocular and facial muscle artefacts, depicted to the\nleft in Figure 5\u2013 are left untouched by aoc-shrinkage. With injected noise, the number of directions\nincreases to \u02c6r \u2248 3, as the procedure detects the additional high variance component \u2013to the right\nin Figure 5\u2013 and adapts the shrinkage procedure such that performance remains unaffected. For\nstandard shrinkage, noise affects the analytic regularization and performance degrades as a result.\n\n7 Discussion\n\nAnalytic shrinkage is a fast and accurate alternative to cross-validation which yields comparable\nperformance, e.g. in prediction tasks and portfolio optimization. This paper has contributed by clar-\nifying the (limited) applicability of the analytic shrinkage formula. In particular we could show that\nits assumptions are often violated in practice since real world data has complex structured depen-\ndencies. We therefore introduced a set of more general assumptions to shrinkage theory, chosen\nsuch that the appealing consistency properties of analytic shrinkage are preserved. We have shown\nthat for typcial structure in real world data, strong eigendirections adversely affect shrinkage by\ndriving the shrinkage strength to zero. Therefore, \ufb01nally, we have proposed an algorithm which\nautomatically restricts shrinkage to the orthogonal complement of the strongest eigendirections if\nappropriate. This leads to improved robustness and signi\ufb01cant performance enhancement in sim-\nulations and on real world data from the domains of \ufb01nance, spoken letter and optical character\nrecognition, and neuroscience.\n\nAcknowledgments\n\nThis work was supported in part by the World Class University Program through the National Re-\nsearch Foundation of Korea funded by the Ministry of Education, Science, and Technology, under\nGrant R31-10008. We thank Gilles Blanchard, Duncan Blythe, Thorsten Dickhaus, Irene Winkler\nand Anne Porbadnik for valuable comments and discussions.\n\n8\n\n  \u22120.0932\u22120.046600.04660.0932  \u22120.0631\u22120.031600.03160.0631  \u22120.2532\u22120.126600.12660.2532\fReferences\n[1] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer,\n\n2008.\n\n[2] Charles Stein. Inadmissibility of the usual estimator for the mean of a multivariate normal distribution.\n\nIn Proc. 3rd Berkeley Sympos. Math. Statist. Probability, volume 1, pages 197\u2013206, 1956.\n\n[3] Olivier Ledoit and Michael Wolf. A well-conditioned estimator for large-dimensional covariance matri-\n\nces. Journal of Multivariate Analysis, 88(2):365\u2013411, 2004.\n\n[4] Jerome. H. Friedman. Regularized discriminant analysis. Journal of the American Statistical Association,\n\n84(405):165\u2013175, 1989.\n\n[5] Juliane Sch\u00a8afer and Korbinian Strimmer. A shrinkage approach to large-scale covariance matrix esti-\nmation and implications for functional genomics. Statistical Applications in Genetics and Molecular\nBiology, 4(1):1175\u20131189, 2005.\n\n[6] Boaz Nadler. Finite sample approximation results for principal component analysis: A matrix perturbation\n\napproach. The Annals of Statistics, 36(6):2791\u20132817, 2008.\n\n[7] Harry Markowitz. Portfolio selection. Journal of Finance, VII(1):77\u201391, March 1952.\n[8] Daniel Bartz, Kerr Hatrick, Christian W. Hesse, Klaus-Robert M\u00a8uller, and Steven Lemm. Directional\nVariance Adjustment: Bias reduction in covariance matrices based on factor analysis with an application\nto portfolio optimization. PLoS ONE, 8(7):e67503, 07 2013.\n\n[9] Olivier Ledoit and Michael Wolf. Improved estimation of the covariance matrix of stock returns with an\n\napplication to portfolio selection. Journal of Empirical Finance, 10:603\u2013621, 2003.\n\n[10] Jonathan J. Hull. A database for handwritten text recognition research. IEEE Transactions on Pattern\n\nAnalysis and Machine Intelligence, 16(5):550\u2013554, May 1994.\n\n[11] Mark A Fanty and Ronald Cole. Spoken letter recognition. In Advances in Neural Information Processing\n\nSystems, volume 3, pages 220\u2013226, 1990.\n\n[12] Kevin Bache and Moshe Lichman. UCI machine learning repository. University of California, Irvine,\n\nSchool of Information and Computer Sciences, 2013.\n\n[13] Anne Kerstin Porbadnigk, Jan-Niklas Antons, Benjamin Blankertz, Matthias S Treder, Robert Schleicher,\nSebastian M\u00a8oller, and Gabriel Curio. Using ERPs for assessing the (sub)conscious perception of noise.\nIn 32nd Annual Intl Conf. of the IEEE Engineering in Medicine and Biology Society, pages 2690\u20132693,\n2010.\n\n[14] Anne Kerstin Porbadnigk, Matthias S Treder, Benjamin Blankertz, Jan-Niklas Antons, Robert Schleicher,\nSebastian M\u00a8oller, Gabriel Curio, and Klaus-Robert M\u00a8uller. Single-trial analysis of the neural correlates\nof speech quality perception. Journal of neural engineering, 10(5):056003, 2013.\n\n9\n\n\f", "award": [], "sourceid": 940, "authors": [{"given_name": "Daniel", "family_name": "Bartz", "institution": "TU Berlin"}, {"given_name": "Klaus-Robert", "family_name": "M\u00fcller", "institution": "TU Berlin"}]}