{"title": "Minimax Estimation of Bandable Precision Matrices", "book": "Advances in Neural Information Processing Systems", "page_first": 4888, "page_last": 4896, "abstract": "The inverse covariance matrix provides considerable insight for understanding statistical models in the multivariate setting. In particular, when the distribution over variables is assumed to be multivariate normal, the sparsity pattern in the inverse covariance matrix, commonly referred to as the precision matrix, corresponds to the adjacency matrix representation of the Gauss-Markov graph, which encodes conditional independence statements between variables. Minimax results under the spectral norm have previously been established for covariance matrices, both sparse and banded, and for sparse precision matrices. We establish minimax estimation bounds for estimating banded precision matrices under the spectral norm. Our results greatly improve upon the existing bounds; in particular, we find that the minimax rate for estimating banded precision matrices matches that of estimating banded covariance matrices. The key insight in our analysis is that we are able to obtain barely-noisy estimates of $k \\times k$ subblocks of the precision matrix by inverting slightly wider blocks of the empirical covariance matrix along the diagonal. Our theoretical results are complemented by experiments demonstrating the sharpness of our bounds.", "full_text": "Minimax Estimation of Bandable Precision Matrices\n\nDepartment of Statistics and Data Science\n\nDepartment of Statistics and Data Science\n\nAddison J. Hu\u2217\n\nYale University\n\nNew Haven, CT 06520\naddison.hu@yale.edu\n\nSahand N. Negahban\n\nYale University\n\nNew Haven, CT 06520\n\nsahand.negahban@yale.edu\n\nAbstract\n\nThe inverse covariance matrix provides considerable insight for understanding\nstatistical models in the multivariate setting. In particular, when the distribution over\nvariables is assumed to be multivariate normal, the sparsity pattern in the inverse\ncovariance matrix, commonly referred to as the precision matrix, corresponds to\nthe adjacency matrix representation of the Gauss-Markov graph, which encodes\nconditional independence statements between variables. Minimax results under the\nspectral norm have previously been established for covariance matrices, both sparse\nand banded, and for sparse precision matrices. We establish minimax estimation\nbounds for estimating banded precision matrices under the spectral norm. Our\nresults greatly improve upon the existing bounds; in particular, we \ufb01nd that the\nminimax rate for estimating banded precision matrices matches that of estimating\nbanded covariance matrices. The key insight in our analysis is that we are able to\nobtain barely-noisy estimates of k\u00d7k subblocks of the precision matrix by inverting\nslightly wider blocks of the empirical covariance matrix along the diagonal. Our\ntheoretical results are complemented by experiments demonstrating the sharpness\nof our bounds.\n\n1\n\nIntroduction\n\nImposing structure is crucial to performing statistical estimation in the high-dimensional regime\nwhere the number of observations can be much smaller than the number of parameters. In estimating\ngraphical models, a long line of work has focused on understanding how to impose sparsity on the\nunderlying graph structure.\nSparse edge recovery is generally not easy for an arbitrary distribution. However, for Gaussian\ngraphical models, it is well-known that the graphical structure is encoded in the inverse of the\ncovariance matrix \u03a3\u22121 = \u2126, commonly referred to as the precision matrix [12, 14, 3]. Therefore,\naccurate recovery of the precision matrix is paramount to understanding the structure of the graphical\nmodel. As a consequence, a great deal of work has focused on sparse recovery of precision matrices\nunder the multivariate normal assumption [8, 4, 5, 17, 16]. Beyond revealing the graph structure, the\nprecision matrix also turns out to be highly useful in a variety of applications, including portfolio\noptimization, speech recognition, and genomics [12, 23, 18].\nAlthough there has been a rich literature exploring the sparse precision matrix setting for Gaussian\ngraphical models, less work has emphasized understanding the estimation of precision matrices\nunder additional structural assumptions, with some exceptions for block structured sparsity [10] or\nbandability [1]. One would hope that extra structure should allow us to obtain more statistically\nef\ufb01cient solutions. In this work, we focus on the case of bandable precision matrices, which capture\n\u2217Addison graduated from Yale in May 2017. Up-to-date contact information may be found at http:\n\n//huisaddison.com/.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fa sense of locality between variables. Bandable matrices arise in a number of time-series contexts\nand have applications in climatology, spectroscopy, fMRI analysis, and astronomy [9, 20, 15]. For\nexample, in the time-series setting, we may assume that edges between variables Xi, Xj are more\nlikely when i is temporally close to j, as is the case in an auto-regressive process. The precision and\ncovariance matrices corresponding to distributions with this property are referred to as bandable, or\ntapering. We will discuss the details of this model in the sequel.\n\nPast work: Previous work has explored the estimation of both bandable covariance and precision\nmatrices [6, 15]. Closely related work includes the estimation of sparse precision and covariance\nmatrices [3, 17, 4]. Asymptotically-normal entrywise precision estimates as well as minimax rates\nfor operator norm recovery of sparse precision matrices have also been established [16]. A line of\nwork developed concurrently to our own establishes a matching minimax lower bound [13].\nWhen considering an estimation technique, a powerful criterion for evaluating whether the technique\nperforms optimally in terms of convergence rate is minimaxity. Past work has established minimax\nrates of convergence for sparse covariance matrices, bandable covariance matrices, and sparse\nprecision matrices [7, 6, 4, 17].\nThe technique for estimating bandable covariance matrices proposed in [6] is shown to achieve the\noptimal rate of convergence. However, no such theoretical guarantees have been shown for the\nbandable precision estimator proposed in recent work for estimating sparse and smooth precision\nmatrices that arise from cosmological data [15].\nOf note is the fact that the minimax rate of convergence for estimating sparse covariance matrices\nmatches the minimax rate of convergence of estimating sparse precision matrices. In this paper,\nwe introduce an adaptive estimator and show that it achieves the optimal rate of convergence when\nestimating bandable precision matrices from the banded parameter space (3). We \ufb01nd, satisfyingly,\nthat analogous to the sparse case, in which the minimax rate of convergence enjoys the same rate for\nboth precision and covariance matrices, the minimax rate of convergence for estimating bandable\nprecision matrices matches the minimax rate of convergence for estimating bandable covariance\nmatrices that has been established in the literature [6].\n\nOur contributions: Our goal is to estimate a banded precision matrix based on n i.i.d. observations.\nWe consider a parameter space of precision matrices \u2126 with a power law decay structure nearly\nidentical to the bandable covariance matrices considered for covariance matrix estimation [6]. We\npresent a simple-to-implement algorithm for estimating the precision matrix. Furthermore, we show\nthat the algorithm is minimax optimal with respect to the spectral norm. The upper and lower bounds\ngiven in Section 3 together imply the following optimal rate of convergence for estimating bandable\nprecision matrices under the spectral norm. Informally, our results show the following bound for\nrecovering a banded precision matrix with bandwidth k.\nTheorem 1.1 (Informal). The minimax risk for estimating the precision matrix \u2126 over the class P\u03b1\ngiven in (3) satis\ufb01es:\n\n(cid:13)(cid:13)(cid:13) \u02c6\u2126 \u2212 \u2126\n\n(cid:13)(cid:13)(cid:13)2 \u2248 k + log p\n\ninf\n\u02c6\u2126\n\nsup\nP\u03b1\n\nE\n\nn\n\n(1)\n\nwhere this bound is achieved by the tapering estimator \u02c6\u2126k as de\ufb01ned in Equation (7).\nAn important point to note, which is shown more precisely in the sequel, is that the rate of convergence\nas compared to sparse precision matrix recovery is improved by a factor of min(k log(p), k2).\nWe establish a minimax upper bound by detailing an algorithm for obtaining an estimator given\nobservations x1, . . . , xn and a pre-speci\ufb01ed bandwidth k, and studying the resultant estimator\u2019s risk\nproperties under the spectral norm. We show that an estimator using our algorithm with the optimal\nchoice of bandwidth attains the minimax rate of convergence with high probability.\nTo establish the optimality of our estimation routine, we derive a minimax lower bound to show\nthat the rate of convergence cannot be improved beyond that of our estimator. The lower bound is\nestablished by constructing subparameter spaces of (3) and applying testing arguments through Le\nCam\u2019s method and Assouad\u2019s lemma [22, 6].\nTo supplement our analysis, we conduct numerical experiments to explore the performance of our\nestimator in the \ufb01nite sample setting. The numerical experiments con\ufb01rm that even in the \ufb01nite\nsample case, our proposed estimator exhibits the minimax rate of convergence.\n\n2\n\n\fThe remainder of the paper is organized as follows. In Section 2, we detail the exact model setting\nand introduce a blockwise inversion technique for precision matrix estimation. In Section 3, theorems\nestablishing the minimaxity of our estimator under the spectral norm are presented. An upper bound\non the estimator\u2019s risk is given in high probability with the help of a result from set packing. The\nminimax lower bound is derived by way of a testing argument. Both bounds are accompanied by\ntheir proofs. Finally, in Section 4, our estimator is subjected to numerical experiments. Formal proofs\nof the theorems may be found in the longer version of the paper [11].\n\nNotation: We will now collect notation that will be used throughout the remaining sections. Vectors\nwill be denoted as lower-case x while matrices are upper-case A. The spectral or operator norm of a\nmatrix is de\ufb01ned to be (cid:107)A(cid:107) = supx(cid:54)=0,y(cid:54)=0(cid:104)Ax, y(cid:105) while the matrix (cid:96)1 norm of a symmetric matrix\nA \u2208 Rm\u00d7m is de\ufb01ned to be (cid:107)A(cid:107)1 = maxj\n\n(cid:80)m\ni=1 |Aij|.\n\n2 Background and problem set-up\n\nIn this section we present details of our model and the estimation procedure. If one considers\nobservations of the form x1, . . . , xn \u2208 Rp drawn from a distribution with precision matrix \u2126p\u00d7p and\nzero mean, the goal then is to estimate the unknown matrix \u2126p\u00d7p based on the observations {xi}n\ni=1.\nGiven a random sample of p-variate observations x1, . . . , xn drawn from a multivariate distribution\nwith population covariance \u03a3 = \u03a3p\u00d7p, our procedure is based on a tapering estimator derived from\nblockwise estimates for estimating the precision matrix \u2126p\u00d7p = \u03a3\u22121.\nThe maximum likelihood estimator of \u03a3 is\n\n\u02c6\u03a3 = (\u02c6\u03c3ij)1\u2264i,j\u2264p =\n\n1\nn\n\n(xl \u2212 \u00afx)(xl \u2212 \u00afx)(cid:62)\n\n(2)\n\nn(cid:88)\n\nl=1\n\nwhere \u00afx is the empirical mean of the vectors xi. We will construct estimators of the precision matrix\n\u2126 = \u03a3\u22121 by inverting blocks of \u02c6\u03a3 along the diagonal, and averaging over the resultant subblocks.\nThroughout this paper we adhere to the convention that \u03c9ij refers to the ijth element in a matrix \u2126.\nConsider the parameter space F\u03b1, with associated probability measure P\u03b1, given by:\nF\u03b1 = F\u03b1(M0, M ) =\n\n{|\u03c9ij| : |i \u2212 j| \u2265 k} \u2264 M k\u2212\u03b1 for all k, \u03bbi(\u2126) \u2208 [M\u22121\n\n(cid:88)\n\n(cid:40)\n\n(cid:41)\n\n0 , M0]\n\n\u2126 : max\n\nj\n\ni\n\n(3)\nwhere \u03bbi(\u2126) denotes the ith eigenvalue of \u2126, with \u03bbi \u2265 \u03bbj for all i \u2264 j. We also constrain\n\u03b1 > 0, M > 0, M0 > 0. Observe that this parameter space is nearly identical to that given in\nEquation (3) of [6]. We take on an additional assumption on the minimum eigenvalue of \u2126 \u2208 F\u03b1,\nwhich is used in the technical arguments where the risk of estimating \u2126 under the spectral norm is\nbounded in terms of the error of estimating \u03a3 = \u2126\u22121.\nObserve that the parameter space intuitively dictates that the magnitude of the entries of \u2126 decays in\npower law as we move away from the diagonal. As with the parameter space for bandable covariance\nmatrices given in [6], we may understand \u03b1 in (3) as a rate of decay for the precision entries \u03c9ij as\nthey move away from the diagonal; it can also be understood in terms of the smoothness parameter in\nnonparametric estimation [19]. As will be discussed in Section 3, the optimal choice of k depends on\nboth n and the decay rate \u03b1.\n\n2.1 Estimation procedure\n\nWe now detail the algorithm for obtaining minimax estimates for bandable \u2126, which is also given as\npseudo-code2 in Algorithm 1.\nThe algorithm is inspired by the tapering procedure introduced by Cai, Zhang, and Zhou [6] in the\ncase of covariance matrices, with modi\ufb01cations in order to estimate the precision matrix. Estimating\n\n2 In the pseudo-code, we adhere to the NumPy convention (1) that arrays are zero-indexed, (2) that slicing an\narray arr with the operation arr[a:b] includes the element indexed at a and excludes the element indexed at\nb, and (3) that if b is greater than the length of the array, only elements up to the terminal element are included,\nwith no errors.\n\n3\n\n\fthe precision matrix introduces new dif\ufb01culties as we do not have direct access to the estimates of\nelements of the precision matrix. For a given integer k, 1 \u2264 k \u2264 p, we construct a tapering estimator\nas follows. First, we calculate the maximum likelihood estimator for the covariance, as given in\nEquation (2). Then, for all integers 1 \u2212 m \u2264 l \u2264 p and m \u2265 1, we de\ufb01ne the matrices with square\nblocks of size at most 3m along the diagonal:\n\nFor each \u02c6\u03a3(3m)\nrefer to the individual entries of this intermediate matrix as follows:\n\nl\u2212m = (\u02c6\u03c3ij1{l \u2212 m \u2264 i < l + 2m, l \u2212 m \u2264 j < l + 2m})p\u00d7p\n\u02c6\u03a3(3m)\nl\u2212m , we replace the nonzero block with its inverse to obtain \u02d8\u2126(3m)\nij1{l \u2212 m \u2264 i < l + 2m, l \u2212 m \u2264 j < l + 2m})p\u00d7p\n\nl\u2212m = (\u02d8\u03c9l\n\n\u02d8\u2126(3m)\n\nl\u2212m . For a given l, we\n\nFor each l, we then keep only the central m \u00d7 m subblock of \u02d8\u2126(3m)\n\u02c6\u2126(m)\n\n:\n\nl\n\n\u02c6\u2126(m)\n\nl = (\u02d8\u03c9l\n\nij1{l \u2264 i < l + m, l \u2264 j < l + m})p\u00d7p\n\n(6)\nNote that this notation allows for l < 0 and l + m > p; in each case, this out-of-bounds indexing\nallows us to cleanly handle corner cases where the subblocks are smaller than m \u00d7 m.\nFor a given bandwidth k (assume k is divisible by 2), we calculate these blockwise estimates for both\nm = k and m = k\n\n2 . Finally, we construct our estimator by averaging over the block matrices:\n\nl\u2212m to obtain the blockwise estimate\n\n\uf8eb\uf8ed p(cid:88)\n\nl=1\u2212k\n\nl \u2212 p(cid:88)\n\n\u02c6\u2126(k)\n\nl=1\u2212k/2\n\n\uf8f6\uf8f8\n\n\u02c6\u2126(k/2)\n\nl\n\n\u02c6\u2126k =\n\n\u00b7\n\n2\nk\n\n(4)\n\n(5)\n\n(7)\n\nWe note that within k\nwe move from k\n\n2 entries of the diagonal, each entry is effectively the sum of k\n\n2 estimates, and as\n\n2 to k from the diagonal, each entry is progressively the sum of one fewer entry.\n\n2 of the diagonal, the entries are not tapered; and from k\n\nTherefore, within k\n2 to k of the diagonal, the\nentries are linearly tapered to zero. The analysis of this estimator makes careful use of this tapering\nschedule and the fact that our estimator is constructed through the average of block matrices of size\nat most k \u00d7 k.\n\n2.2\n\nImplementation details\n\nThe naive algorithm performs O(p + k) inversions of square matrices with size at most 3k. This\nmethod can be sped up considerably through an application of the Woodbury matrix identity and\nthe Schur complement relation [21, 2]. Doing so reduces the computational complexity of the\nalgorithm from O(pk3) to O(pk2). We discuss the details of modi\ufb01ed algorithm and its computational\ncomplexity below.\nSuppose we have \u02d8\u2126(3m)\nof \u02d8\u2126(3m)\nrow and one column from \u02c6\u03a3(3m)\nto \u02d8\u2126(3m)\nto \u02c6\u03a3(3m)\nthe nonzero blocks of \u02c6\u03a3(3m)\n\nl\u2212m+1. We observe that the nonzero block\nl\u2212m+1, which only differs by one\nl\u2212m , the matrix for which the inverse of the nonzero block corresponds\nl\u2212m , \u02d8\u2126(3m)\nl\u2212m\nl\u2212m+1 as two rank-1 updates. Let us view\n\nl\u2212m , which we have already computed. We may understand the movement from \u02c6\u03a3(3m)\nl\u2212m+1 (to which we already have direct access) and \u02d8\u2126(3m)\n\nl\u2212m+1 corresponds to the inverse of the nonzero block of \u02c6\u03a3(3m)\n\nl\u2212m and are interested in obtaining \u02d8\u2126(3m)\n\nl\u2212m as the block matrices:\n\nl\u2212m , \u02d8\u2126(3m)\n\nNonZero( \u02c6\u03a3(3m)\n\nl\u2212m ) =\n\nB(cid:62) \u2208 R(3m\u22121)\u00d71 C \u2208 R(3m\u22121)\u00d7(3m\u22121)\n\nA \u2208 R1\u00d71\n\n\u02dcA \u2208 R1\u00d71\n\nB \u2208 R1\u00d7(3m\u22121)\n\n\u02dcB \u2208 R1\u00d7(3m\u22121)\n\nNonZero( \u02d8\u2126(3m)\n\n\u02dcB(cid:62) \u2208 R(3m\u22121)\u00d71\nThe Schur complement relation tells us that given \u02c6\u03a33m\nfollows:\n\nl\u2212m ) =\n\n(cid:16) \u02dcC\u22121 + B(cid:62)A\u22121B\n\n(cid:17)\u22121\n\nC\u22121 =\n\n\u02dcC \u2208 R(3m\u22121)\u00d7(3m\u22121)\nl\u2212m , we may trivially compute C\u22121 as\n\nl\u2212m, \u02d8\u2126(3m)\n= \u02dcC \u2212 \u02dcCB(cid:62)B \u02dcC\nA + B \u02dcCB(cid:62)\n\n(8)\n\n(cid:21)\n(cid:21)\n\n(cid:20)\n(cid:20)\n\n4\n\n\fAlgorithm 1 Blockwise Inversion Technique\n\nfunction FITBLOCKWISE( \u02c6\u03a3, k)\n\n\u02c6\u2126 \u2190 0p\u00d7p\nfor l \u2208 [1 \u2212 k, p) do\n\n\u02c6\u2126 \u2190 \u02c6\u2126 + BLOCKINVERSE( \u02c6\u03a3, k, l)\n\nend for\nfor l \u2208 [1 \u2212 (cid:98)k/2(cid:99), p) do\n\n\u02c6\u2126 \u2190 \u02c6\u2126 \u2212 BLOCKINVERSE( \u02c6\u03a3,(cid:98)k/2(cid:99), l)\n\nend for\nreturn \u02c6\u2126\nend function\n\nfunction BLOCKINVERSE( \u02c6\u03a3, m, l)\n\ns \u2190 max{l \u2212 m, 0}\nf \u2190 min{p, l + 2m}\n\nM \u2190(cid:16) \u02c6\u03a3[s:f, s:f]\n(cid:17)\u22121\n\ns \u2190 m + min{l \u2212 m, 0}\nN \u2190 M[s:s+m, s:s+m]\nP \u2190 0p\u00d7p\ns \u2190 max{l, 0}\nf \u2190 min{l + m, p}\nP [s:f, s:f] = N\nreturn P\nend function\n\n(cid:46) Obtain 3m \u00d7 3m block inverse.\n\n(cid:46) Preserve central m \u00d7 m block of inverse.\n\n(cid:46) Restore block inverse to appropriate indices.\n\nby the Woodbury matrix identity, which gives an ef\ufb01cient algorithm for computing the inverse of\na matrix subject to a low-rank (in this case, rank-1) perturbation. This allows us to move from the\ninverse of a matrix in R3m\u00d73m to the inverse of a matrix in R(3m\u22121)\u00d7(3m\u22121) where a row and column\nhave been removed. A nearly identical argument allows us to move from the R(3m\u22121)\u00d7(3m\u22121) matrix\nto an R3m\u00d73m matrix where a row and column have been appended, which gives us the desired block\nof \u02d8\u2126(3m)\nWith this modi\ufb01cation to the algorithm, we need only compute the inverse of a square matrix of width\n2m at the beginning of the routine; thereafter, every subsequent block inverse may be computed\nthrough simple rank one matrix updates.\n\nl\u2212m+1.\n\n2.3 Complexity details\n\nWe now detail the factor of k improvement in computational complexity provided through the\napplication of the Woodbury matrix identity and the Schur complement relation introduced in Section\n2.2. Recall that the naive implementation of Algorithm 1 involves O(p + k) inversions of square\nmatrices of size at most 3k, each of which cost O(k3). Therefore, the overall complexity of the naive\nalgorithm is O(pk3), as k < p.\nNow, consider the Woodbury-Schur-improved algorithm. The initial single inversion of a 2k \u00d7 2k\nmatrix costs O(k3). Thereafter, we perform O(p + k) updates of the form given in Equation (8).\nThese updates simply require vector matrix operations. Therefore, the update complexity on each\niteration is O(k2). It follows that the overall complexity of the amended algorithm is O(pk2).\n\n3 Rate optimality under the spectral norm\n\nHere we present the results that establish the rate optimality of the above estimator under the spectral\nnorm. For symmetric matrices A, the spectral norm, which corresponds to the largest singular value\n\n5\n\n\fof A, coincides with the (cid:96)2-operator norm. We establish optimality by \ufb01rst deriving an upper bound\nin high probability using the blockwise inversion estimator de\ufb01ned in Section 2.1. We then give\na matching lower bound in expectation by carefully constructing two sets of multivariate normal\ndistributions and then applying Assouad\u2019s lemma and Le Cam\u2019s method.\n\n3.1 Upper bound under the spectral norm\n\nIn this section we derive a risk upper bound for the tapering estimator de\ufb01ned in (7) under the operator\nnorm. We assume the distribution of the xi\u2019s is subgaussian; that is, there exists \u03c1 > 0 such that:\n\nP(cid:8)|v(cid:62)(xi \u2212 E xi)| > t(cid:9) \u2264 e\u2212 t2\u03c1\n\n2\n\n(9)\nfor all t > 0 and (cid:107)v(cid:107)2 = 1. Let P\u03b1 = P\u03b1(M0, M, \u03c1) denote the set of distributions of xi that satisfy\n(3) and (9).\nTheorem 3.1. The tapering estimator \u02c6\u2126k, de\ufb01ned in (7), of the precision matrix \u2126p\u00d7p with p >\nn\n\n2\u03b1+1 satis\ufb01es:\n\n1\n\nwith k = o(n), log p = o(n), and a universal constant C > 0.\nIn particular, the estimator \u02c6\u2126 = \u02c6\u2126k with k = n\n\n2\u03b1+1 satis\ufb01es:\n\n1\n\nk + log p\n\n+ Ck\u22122\u03b1\n\n(cid:26)(cid:13)(cid:13)(cid:13) \u02c6\u2126k \u2212 \u2126\n(cid:26)(cid:13)(cid:13)(cid:13) \u02c6\u2126k \u2212 \u2126\n\n(cid:13)(cid:13)(cid:13)2 \u2265 C\n(cid:13)(cid:13)(cid:13)2 \u2265 Cn\u2212 2\u03b1\n\nn\n\nP\n\nsup\nP\u03b1\n\nP\n\nsup\nP\u03b1\n\n2\u03b1+1 + C\n\nlog p\n\nn\n\n(cid:27)\n\n= O(cid:0)p\u221215(cid:1)\n\n(cid:27)\n\n= O(cid:0)p\u221215(cid:1)\n\n(10)\n\n(11)\n\n1\n\n2\u03b1+1 yields the optimal rate by\n\nGiven the result in Equation (10), it is easy to show that setting k = n\nbalancing the size of the inside-taper and outside-taper terms, which gives Equation (11).\nThe proof of this theorem, which is given in the supplementary material, relies on the fact that when\nwe invert a 3k \u00d7 3k block, the difference between the central k \u00d7 k block and the corresponding\nk \u00d7 k block which would have been obtained by inverting the full matrix has a negligible contribution\nto the risk. As a result, we are able to take concentration bounds on the operator norm of subgaussian\nmatrices, customarily used for bounding the norm of the difference of covariance matrices, and apply\nthem instead to differences of precision matrices to obtain our result.\nThe key insight is that we can relate the spectral norm of a k \u00d7 k subblock produced by our estimator\nto the spectral norm of the corresponding k \u00d7 k subblock of the covariance matrix, which allows us\nto apply concentration bounds from classical random matrix theory. Moreover, it turns out that if we\napply the tapering schedule induced by the construction of our estimator to the population parameter\n\u2126 \u2208 F\u03b1, we may express the tapered population \u2126 as a sum of block matrices in exactly the same\nway that our estimator is expressed as a sum of block matrices.\nIn particular, the tapering schedule is presented next. Suppose a population precision matrix \u2126 \u2208 F\u03b1.\nThen, we denote the tapered version of \u2126 by \u2126A, and construct:\n\n\u2126A = (\u03c9ij \u00b7 vij)p\u00d7p\n\u2126B = (\u03c9ij \u00b7 (1 \u2212 vij))p\u00d7p\n\nwhere the tapering coef\ufb01cients are given by:\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f31\n\n|i\u2212j|\nk/2\n0\n\nvij =\n\nfor |i \u2212 j| < k\nfor k\nfor |i \u2212 j| \u2265 k\n\n2 \u2264 |i \u2212 j| < k\n\n2\n\nWe then handle the risk of estimating the inside-taper \u2126A and the risk of estimating the outside-taper\n\u2126B separately.\nBecause our estimator and the population parameter are both averages over k \u00d7 k block matrices\nalong the diagonal, we may then take a union bound over the high probability bounds on the spectral\nnorm deviation for the k \u00d7 k subblocks to obtain a high probability bound on the risk of our estimator.\nWe refer the reader to the longer version of the paper for further details [11].\n\n6\n\n\f3.2 Lower bound under the spectral norm\n\n2\u03b1+1 under the spectral norm by using the optimal choice of k = n\n\nIn Section 3.1, we established Theorem 3.1, which states that our estimator achieves the rate of\nconvergence n\u2212 2\u03b1\n2\u03b1+1 . Next we\ndemonstrate a matching lower bound, which implies that the upper bound established in Equation\n(11) is tight up to constant factors.\nSpeci\ufb01cally, for the estimation of precision matrices in the parameter space given by Equation (3),\nthe following minimax lower bound holds.\nTheorem 3.2. The minimax risk for estimating the precision matrix \u2126 over P\u03b1 under the operator\nnorm satis\ufb01es:\n\n1\n\n(cid:13)(cid:13)(cid:13) \u02c6\u2126 \u2212 \u2126\n\n(cid:13)(cid:13)(cid:13)2 \u2265 cn\u2212 2\u03b1\n\ninf\n\u02c6\u2126\n\nsup\nP\u03b1\n\nE\n\n2\u03b1+1 + c\n\nlog p\n\nn\n\n(12)\n\nAs in many information theoretic lower bounds, we \ufb01rst identify a subset of our parameter space that\ncaptures most of the complexity of the full space. We then establish an information theoretic limit\non estimating parameters from this subspace, which yields a valid minimax lower bound over the\noriginal set.\nSpeci\ufb01cally, for our particular parameter space F\u03b1, we identify two subparameter spaces, F11,F12.\nThe \ufb01rst, F11, is a collection of 2k matrices with varying levels of density. To this collection, we\n2\u03b1+1 . The second, F12, is a collection of\napply Assouad\u2019s lemma obtain a lower bound with rate n\u2212 2\u03b1\ndiagonal matrices, to which we apply Le Cam\u2019s method to derive a lower bound with rate log p\nn .\nThe rate given in Theorem 3.2 is therefore a lower bound on minimax rate for estimating the union\n(F11 \u222a F12) = F1 \u2282 F\u03b1. The full details of the subparameter space construction and derivation of\nlower bounds may be found in the full-length version of the paper [11].\n\n4 Experimental results\n\nWe implemented the blockwise inversion technique in NumPy and ran simulations on synthetic\ndatasets. Our experiments con\ufb01rm that even in the \ufb01nite sample case, the blockwise inversion\ntechnique achieves the theoretical rates. In the experiments, we draw observations from a multivariate\nnormal distribution with precision parameter \u2126 \u2208 F\u03b1, as de\ufb01ned in (3). Following [6], for given\nconstants \u03c1, \u03b1, p, we consider precision matrices \u2126 = (\u03c9ij)1\u2264i,j\u2264p of the form:\n\n\u03c9ij =\n\n\u03c1|i \u2212 j|\u2212\u03b1\u22121\n\nfor 1 \u2264 i = j \u2264 p\nfor 1 \u2264 i (cid:54)= j \u2264 p\n\n(13)\n\n(cid:26)1\n\nThough the precision matrices considered in our experiments are Toeplitz, our estimator does not\ntake advantage of this knowledge. We choose \u03c1 = 0.6 to ensure that the matrices generated are\nnon-negative de\ufb01nite.\nIn applying the tapering estimator as de\ufb01ned in (7), we choose the bandwidth to be k = (cid:98)n\nwhich gives the optimal rate of convergence, as established in Theorem 3.1.\nIn our experiments, we varied \u03b1, n, and p. For our \ufb01rst set of experiments, we allowed \u03b1 to take\non values in {0.2, 0.3, 0.4, 0.5}, n to take values in {250, 500, 750, 1000}, and p to take values in\n{100, 200, 300, 400}. Each setting was run for \ufb01ve trials, and the averages are plotted with error\nbars to show variability between experiments. We observe in Figure 1a that the spectral norm error\nincreases linearly as log p increases, con\ufb01rming the log p\nBuilding upon the experimental results from the \ufb01rst set of simulations, we provide an additional\nsets of trials for the \u03b1 = 0.2, p = 400 case, with n \u2208 {11000, 3162, 1670}. These sample sizes were\nchosen so that in Figure 1b, there is overlap between the error plots for \u03b1 = 0.2 and the other \u03b1\nregimes3. As with Figure 1a, Figure 1b con\ufb01rms the minimax rate of convergence given in Theorem\n3.1. Namely, we see that plotting the error with respect to n\u2212 2\u03b1\n2\u03b1+1 results in linear plots with almost\n3 For the \u03b1 = 0.2, p = 400 case, we omit the settings where n \u2208 {250, 500, 750} from Figure 1b to\n\nn term in the rate of convergence.\n\n2\u03b1+1(cid:99),\n\n1\n\nimprove the clarity of the plot.\n\n7\n\n\f(a) Spectral norm error as log p changes.\n\n(b) Mean spectral norm error as n\n\n2\u03b1+1 changes.\n\n\u2212 2\u03b1\n\nFigure 1: Experimental results. Note that the plotted error grows linearly as a function of log p and\nn\u2212 2\u03b1\n2\u03b1+1 , respectively, matching the theoretical results; however, the linear relationship is less clear in\nthe \u03b1 = 0.2 case, due to the subtle interplay of the error terms.\n\nidentical slopes. We note that in both plots, there is a small difference in the behavior for the case\n\u03b1 = 0.2. This observation can be attributed to the fact that for such a slow decay of the precision\nmatrix bandwidth, we have a more subtle interplay between the bias and variance terms presented in\nthe theorems above.\n\n5 Discussion\n\n2\u03b1+1 + log p\n\nIn this paper we have presented minimax upper and lower bounds for estimating banded precision ma-\ntrices after observing n samples drawn from a p-dimensional subgaussian distribution. Furthermore,\nwe have provided a computationally ef\ufb01cient algorithm that achieves the optimal rate of convergence\nfor estimating a banded precision matrix under the operator norm. Theorems 3.1 and 3.2 together\nestablish that the minimax rate of convergence for estimating precision matrices over the parameter\nspace F\u03b1 given in Equation (3) is n\u2212 2\u03b1\nn , where \u03b1 dictates the bandwidth of the precision\nmatrix.\nThe rate achieved in this setting parallels the results established for estimating a bandable covariance\nmatrix [6]. As in that result, we observe that different regimes dictate which term dominates in the\nrate of convergence. In the setting where log p is of a lower order than n\n2\u03b1+1 term\ndominates, and the rate of convergence is determined by the smoothness parameter \u03b1. However, when\nlog p is much larger than n\n2\u03b1+1 , p has a much greater in\ufb02uence on the minimax rate of convergence.\nOverall, we have shown the performance gains that may be obtained through added structural\nconstraints. An interesting line of future work will be to explore algorithms that uniformly exhibit\na smooth transition between fully banded models and sparse models on the precision matrix. Such\nmethods could adapt to the structure and allow for mixtures between banded and sparse precision\nmatrices. Another interesting direction would be in understanding how dependencies between the n\nobservations will in\ufb02uence the error rate of the estimator.\nFinally, the results presented here apply to the case of subgaussian random variables. Unfortunately,\nmoving away from the Gaussian setting in general breaks the connection between precision matrices\nand graph structure. Hence, a fruitful line of work will be to also develop methods that can be applied\nto estimating the banded graphical model structure with general exponential family observations.\n\n2\u03b1+1 , the n\u2212 2\u03b1\n\n1\n\n1\n\nAcknowledgements\n\nWe would like to thank Harry Zhou for stimulating discussions regarding matrix estimation problems.\nSN acknowledges funding from NSF Grant DMS 1723128.\n\n8\n\n4.64.85.05.25.45.65.86.0log(p)012345678Spectral Norm ErrorSetting: n=1000\u03b1=0.2\u03b1=0.3\u03b1=0.4\u03b1=0.50.020.040.060.080.100.120.14n\u22122\u03b12\u03b1+10.0000.0050.0100.0150.0200.025Mean Spectral NormSetting: p=400\u03b1=0.2\u03b1=0.3\u03b1=0.4\u03b1=0.5\fReferences\n[1] P. J. Bickel and Y. R. Gel. Banded regularization of autocovariance matrices in application to parameter\nestimation and forecasting of time series. Journal of the Royal Statistical Society: Series B (Statistical\nMethodology), 73(5):711\u2013728, 2011.\n\n[2] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, Cambridge, UK, 2004.\n\n[3] T. T. Cai, W. Liu, and X. Luo. A Constrained L1 Minimization Approach to Sparse Precision Matrix\n\nEstimation. arXiv:1102.2233 [stat], February 2011. arXiv: 1102.2233.\n\n[4] T. T. Cai, W. Liu, and H. H. Zhou. Estimating sparse precision matrix: Optimal rates of convergence and\n\nadaptive estimation. Ann. Statist., 44(2):455\u2013488, 04 2016.\n\n[5] T. T. Cai, Z. Ren, H. H. Zhou, et al. Estimating structured high-dimensional covariance and precision\n\nmatrices: Optimal rates and adaptive estimation. Electronic Journal of Statistics, 10(1):1\u201359, 2016.\n\n[6] T. T. Cai, C.-H. Zhang, and H. H. Zhou. Optimal rates of convergence for covariance matrix estimation.\n\nThe Annals of Statistics, 38(4):2118\u20132144, August 2010.\n\n[7] T. T. Cai and H. H. Zhou. Optimal rates of convergence for sparse covariance matrix estimation. Ann.\n\nStatist., 40(5):2389\u20132420, 10 2012.\n\n[8] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical Lasso.\n\nBiostatistics, 2007.\n\n[9] K. J. Friston, P. Jezzard, and R. Turner. Analysis of functional mri time-series. Human brain mapping,\n\n1(2):153\u2013171, 1994.\n\n[10] M. J. Hosseini and S.-I. Lee. Learning sparse gaussian graphical models with overlapping blocks. In\n\nAdvances in Neural Information Processing Systems, pages 3808\u20133816, 2016.\n\n[11] A. J. Hu and S. N. Negahban. Minimax Estimation of Bandable Precision Matrices. arXiv, 2017. arXiv:\n\n1710.07006v1.\n\n[12] S. L. Lauritzen. Graphical Models. Oxford Statistical Science Series. Clarendon Press, Oxford, 1996.\n\n[13] K. Lee and J. Lee. Estimating Large Precision Matrices via Modi\ufb01ed Cholesky Decomposition.\n\narXiv:1707.01143 [stat], July 2017. arXiv: 1707.01143.\n\n[14] N. Meinshausen and P. B\u00fchlmann. High-dimensional graphs and variable selection with the Lasso. Annals\n\nof Statistics, 34:1436\u20131462, 2006.\n\n[15] N. Padmanabhan, M. White, H. H. Zhou, and R. O\u2019Connell. Estimating sparse precision matrices. Monthly\n\nNotices of the Royal Astronomical Society, 460(2):1567\u20131576, 2016.\n\n[16] Z. Ren, T. Sun, C.-H. Zhang, and H. H. Zhou. Asymptotic normality and optimalities in estimation of large\n\nGaussian graphical models. The Annals of Statistics, 43(3):991\u20131026, June 2015.\n\n[17] A. J. Rothman, P. J. Bickel, E. Levina, and J. Zhu. Sparse permutation invariant covariance estimation.\n\nElectronic Journal of Statistics, 2:494\u2013515, 2008.\n\n[18] G. Saon and J. T. Chien. Bayesian sensing hidden markov models for speech recognition. In 2011 IEEE\nInternational Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5056\u20135059, May\n2011.\n\n[19] A. B. Tsybakov. Introduction to Nonparametric Estimation. Springer Publishing Company, Incorporated,\n\n1st edition, 2008.\n\n[20] H. Visser and J. Molenaar. Trend estimation and regression analysis in climatological time series: an\napplication of structural time series models and the kalman \ufb01lter. Journal of Climate, 8(5):969\u2013979, 1995.\n\n[21] M. A. Woodbury. Inverting modi\ufb01ed matrices. Statistical Research Group, Memo. Rep. no. 42. Princeton\n\nUniversity, Princeton, N. J., 1950.\n\n[22] B. Yu. Assouad, Fano and Le Cam. In Festschrift for Lucien Le Cam, pages 423\u2013435. Springer-Verlag,\n\nBerlin, 1997.\n\n[23] M. Yuan and Y. Lin. Model selection and estimation in the Gaussian graphical model. Biometrika,\n\n94(1):19\u201335, 2007.\n\n9\n\n\f", "award": [], "sourceid": 2520, "authors": [{"given_name": "Addison", "family_name": "Hu", "institution": "Yale University"}, {"given_name": "Sahand", "family_name": "Negahban", "institution": "Yale University"}]}