{"title": "Non-convex Statistical Optimization for Sparse Tensor Graphical Model", "book": "Advances in Neural Information Processing Systems", "page_first": 1081, "page_last": 1089, "abstract": "We consider the estimation of sparse graphical models that characterize the dependency structure of high-dimensional tensor-valued data. To facilitate the estimation of the precision matrix corresponding to each way of the tensor, we assume the data follow a tensor normal distribution whose covariance has a Kronecker product structure. The penalized maximum likelihood estimation of this model involves minimizing a non-convex objective function. In spite of the non-convexity of this estimation problem, we prove that an alternating minimization algorithm, which iteratively estimates each sparse precision matrix while fixing the others, attains an estimator with the optimal statistical rate of convergence as well as consistent graph recovery. Notably, such an estimator achieves estimation consistency with only one tensor sample, which is unobserved in previous work. Our theoretical results are backed by thorough numerical studies.", "full_text": "Non-convex Statistical Optimization for Sparse\n\nTensor Graphical Model\n\nWei Sun\nYahoo Labs\n\nSunnyvale, CA\n\nsunweisurrey@yahoo-inc.com\n\nHan Liu\n\nDepartment of Operations Research\n\nand Financial Engineering\n\nPrinceton University\n\nPrinceton, NJ\n\nhanliu@princeton.edu\n\nZhaoran Wang\n\nDepartment of Operations Research\n\nand Financial Engineering\n\nPrinceton University\n\nPrinceton, NJ\n\nzhaoran@princeton.edu\n\nGuang Cheng\n\nDepartment of Statistics\n\nPurdue University\nWest Lafayette, IN\n\nchengg@stat.purdue.edu\n\nAbstract\n\nWe consider the estimation of sparse graphical models that characterize the depen-\ndency structure of high-dimensional tensor-valued data. To facilitate the estimation\nof the precision matrix corresponding to each way of the tensor, we assume the\ndata follow a tensor normal distribution whose covariance has a Kronecker product\nstructure. The penalized maximum likelihood estimation of this model involves\nminimizing a non-convex objective function. In spite of the non-convexity of this\nestimation problem, we prove that an alternating minimization algorithm, which\niteratively estimates each sparse precision matrix while \ufb01xing the others, attains\nan estimator with the optimal statistical rate of convergence as well as consistent\ngraph recovery. Notably, such an estimator achieves estimation consistency with\nonly one tensor sample, which is unobserved in previous work. Our theoretical\nresults are backed by thorough numerical studies.\n\n1\n\nIntroduction\n\nHigh-dimensional tensor-valued data are prevalent in many \ufb01elds such as personalized recommen-\ndation systems and brain imaging research [1, 2]. Traditional recommendation systems are mainly\nbased on the user-item matrix, whose entry denotes each user\u2019s preference for a particular item. To\nincorporate additional information into the analysis, such as the temporal behavior of users, we need\nto consider a user-item-time tensor. For another example, functional magnetic resonance imaging\n(fMRI) data can be viewed as a three way (third-order) tensor since it contains the brain measurements\ntaken on different locations over time for various experimental conditions. Also, in the example of\nmicroarray study for aging [3], thousands of gene expression measurements are recorded on 16 tissue\ntypes on 40 mice with varying ages, which forms a four way gene-tissue-mouse-age tensor.\nIn this paper, we study the estimation of conditional independence structure within tensor data. For\nexample, in the microarray study for aging we are interested in the dependency structure across dif-\nferent genes, tissues, ages and even mice. Assuming data are drawn from a tensor normal distribution,\na straightforward way to estimate this structure is to vectorize the tensor and estimate the underlying\nGaussian graphical model associated with the vector. Such an approach ignores the tensor structure\n\n1\n\n\fand requires estimating a rather high dimensional precision matrix with insuf\ufb01cient sample size. For\ninstance, in the aforementioned fMRI application the sample size is one if we aim to estimate the\ndependency structure across different locations, time and experimental conditions. To address such a\nproblem, a popular approach is to assume the covariance matrix of the tensor normal distribution is\nseparable in the sense that it is the Kronecker product of small covariance matrices, each of which\ncorresponds to one way of the tensor. Under this assumption, our goal is to estimate the precision\nmatrix corresponding to each way of the tensor. See \u00a71.1 for a detailed survey of previous work.\nDespite the fact that the assumption of the Kronecker product structure of covariance makes the\nstatistical model much more parsimonious, it poses signi\ufb01cant challenges. In particular, the penalized\nnegative log-likelihood function is non-convex with respect to the unknown sparse precision matrices.\nConsequently, there exists a gap between computational and statistical theory. More speci\ufb01cally,\nas we will show in \u00a71.1, existing literature mostly focuses on establishing the existence of a local\noptimum that has desired statistical guarantees, rather than offering ef\ufb01cient algorithmic procedures\nthat provably achieve the desired local optima. In contrast, we analyze an alternating minimiza-\ntion algorithm which iteratively minimizes the non-convex objective function with respect to each\nindividual precision matrix while \ufb01xing the others. The established theoretical guarantees of the\nproposed algorithm are as follows. Suppose that we have n observations from a K-th order tensor\nnormal distribution. We denote by mk, sk, dk (k = 1, . . . , K) the dimension, sparsity, and max\nnumber of non-zero entries in each row of the precision matrix corresponding to the k-th way of the\nk=1 mk. The k-th precision matrix estimator from our alternating\n\ntensor. Besides, we de\ufb01ne m =QK\nminimization algorithm achieves apmk(mk + sk) log mk/(nm) statistical rate of convergence in\nFrobenius norm, which is minimax-optimal since this is the best rate one can obtain even when the\nrest K 1 true precision matrices are known [4]. Furthermore, under an extra irrepresentability\ncondition, we establish apmk log mk/(nm) rate of convergence in max norm, which is also optimal,\nand a dkpmk log mk/(nm) rate of convergence in spectral norm. These estimation consistency\nresults and a suf\ufb01ciently large signal strength condition further imply the model selection consistency\nof recovering all the edges. A notable implication of these results is that, when K 3, our alternating\nminimization algorithm can achieve estimation consistency in Frobenius norm even if we only have\naccess to one tensor sample, which is often the case in practice. This phenomenon is unobserved in\nprevious work. Finally, we conduct extensive experiments to evaluate the numerical performance of\nthe proposed alternating minimization method. Under the guidance of theory, we propose a way to\nsigni\ufb01cantly accelerate the algorithm without sacri\ufb01cing the statistical accuracy.\n\n1.1 Related work and our contribution\n\nA special case of our sparse tensor graphical model when K = 2 is the sparse matrix graphical\nmodel, which is studied by [5\u20138]. In particular, [5] and [6] only establish the existence of a local\noptima with desired statistical guarantees. Meanwhile, [7] considers an algorithm that is similar to\nours. However, the statistical rates of convergence obtained by [6, 7] are much slower than ours\nwhen K = 2. See Remark 3.6 in \u00a73.1 for a detailed comparison. For K = 2, our statistical rate of\nconvergence in Frobenius norm recovers the result of [5]. In other words, our theory con\ufb01rms that the\ndesired local optimum studied by [5] not only exists, but is also attainable by an ef\ufb01cient algorithm. In\naddition, for matrix graphical model, [8] establishes the statistical rates of convergence in spectral and\nFrobenius norms for the estimator attained by a similar algorithm. Their results achieve estimation\nconsistency in spectral norm with only one matrix observation. However, their rate is slower than\nours with K = 2. See Remark 3.11 in \u00a73.2 for a detailed discussion. Furthermore, we allow K to\nincrease and establish estimation consistency even in Frobenius norm for n = 1. Most importantly,\nall these results focus on matrix graphical model and can not handle the aforementioned motivating\napplications such as the gene-tissue-mouse-age tensor dataset.\nIn the context of sparse tensor graphical model with a general K, [9] shows the existence of a\nlocal optimum with desired rates, but does not prove whether there exists an ef\ufb01cient algorithm\nthat provably attains such a local optimum. In contrast, we prove that our alternating minimization\nalgorithm achieves an estimator with desired statistical rates. To achieve it, we apply a novel theoretical\nframework to separately consider the population and sample optimizers, and then establish the one-\nstep convergence for the population optimizer (Theorem 3.1) and the optimal rate of convergence\nfor the sample optimizer (Theorem 3.4). A new concentration result (Lemma B.1) is developed for\nthis purpose, which is also of independent interest. Moreover, we establish additional theoretical\n\n2\n\n\fguarantees including the optimal rate of convergence in max norm, the estimation consistency in\nspectral norm, and the graph recovery consistency of the proposed sparse precision matrix estimator.\nIn addition to the literature on graphical models, our work is also closely related to a recent line of\nresearch on alternating minimization for non-convex optimization problems [10\u201313]. These existing\nresults mostly focus on problems such as dictionary learning, phase retrieval and matrix decomposition.\nHence, our statistical model and analysis are completely different from theirs. Also, our paper is\nrelated to a recent line of work on tensor decomposition. See, e.g., [14\u201317] and the references therein.\nCompared with them, our work focuses on the graphical model structure within tensor-valued data.\nNotation: For a matrix A = (Ai,j) 2 Rd\u21e5d, we denote kAk1,kAk2,kAkF as its max, spectral,\nand Frobenius norm, respectively. We de\ufb01ne kAk1,off :=Pi6=j |Ai,j| as its off-diagonal `1 norm and\n|||A|||1 := maxiPj |Ai,j| as the maximum absolute row sum. Denote vec(A) as the vectorization\nof A which stacks the columns of A. Let tr(A) be the trace of A. For an index set S = {(i, j), i, j 2\n{1, . . . , d}}, we de\ufb01ne [A]S as the matrix whose entry indexed by (i, j) 2 S is equal to Ai,j, and\nzero otherwise. We denote 1d as the identity matrix with dimension d \u21e5 d. Throughout this paper, we\nuse C, C1, C2, . . . to denote generic absolute constants, whose values may vary from line to line.\n\n2 Sparse tensor graphical model\n\n2.1 Preliminary\n\nWe employ the tensor notations used by [18]. Throughout this paper, higher order tensors are denoted\nby boldface Euler script letters, e.g. T . We consider a K-th order tensor T2 Rm1\u21e5m2\u21e5\u00b7\u00b7\u00b7\u21e5mK .\nWhen K = 1 it reduces to a vector and when K = 2 it reduces to a matrix. The (i1, . . . , iK)-th\nelement of the tensor T is denoted to be Ti1,...,iK . Meanwhile, we de\ufb01ne the vectorization of T\nas vec(T ) := (T1,1,...,1, . . . ,Tm1,1,...,1, . . . ,T1,m2,...,mK ,Tm1,m2,...,mK )> 2 Rm with m =Qk mk.\ni1,...,iK1/2.\nIn addition, we de\ufb01ne the Frobenius norm of a tensor T as kT kF :=Pi1,...,iK T 2\nFor tensors, a \ufb01ber refers to the higher order analogue of the row and column of matrices. A \ufb01ber is\nobtained by \ufb01xing all but one of the indices of the tensor, e.g., the mode-k \ufb01ber of T(k) is given by\nTi1,...,,ik1,:,ik+1,...,iK . Matricization, also known as unfolding, is the process to transform a tensor\ninto a matrix. We denote T(k) as the mode-k matricization of a tensor T , which arranges the mode-k\n\ufb01bers to be the columns of the resulting matrix. Another useful operation in tensors is the k-mode\nproduct. The k-mode product of a tensor T2 Rm1\u21e5m2\u21e5\u00b7\u00b7\u00b7\u21e5mK with a matrix A 2 RJ\u21e5mk is denoted\nas T\u21e5 k A and is of the size m1 \u21e5\u00b7\u00b7\u00b7\u21e5 mk1 \u21e5 J \u21e5 mk+1 \u21e5\u00b7\u00b7\u00b7\u21e5 mK. Its entry is de\ufb01ned as (T\u21e5 k\nA)i1,...,ik1,j,ik+1,...,iK :=Pmk\nik=1 Ti1,...,iK Aj,ik . In addition, for a list of matrices {A1, . . . , AK}\nwith Ak 2 Rmk\u21e5mk, k = 1, . . . , K, we de\ufb01ne T\u21e5{ A1, . . . , AK} := T\u21e5 1 A1 \u21e52 \u00b7\u00b7\u00b7\u21e5 K AK.\n2.2 Model\nA tensor T2 Rm1\u21e5m2\u21e5\u00b7\u00b7\u00b7\u21e5mK follows the tensor normal distribution with zero mean and covariance\nmatrices \u23031, . . . , \u2303K, denoted as T\u21e0 TN(0; \u23031, . . . , \u2303K), if its probability density function is\n\np(T |\u23031, . . . , \u2303K) = (2\u21e1)m/2\u21e2 KYk=1\n\n|\u2303k|m/(2mk) exp kT \u21e5 \u23031/2k2\nF /2,\n\n(2.1)\n\n1\n\n, . . . , \u23031/2\n\nk=1 mk and \u23031/2 := {\u23031/2\n\nwhere m = QK\nK }. When K = 1, this tensor normal\ndistribution reduces to the vector normal distribution with zero mean and covariance \u23031. According\nto [9, 18], it can be shown that T\u21e0 TN(0; \u23031, . . . , \u2303K) if and only if vec(T ) \u21e0 N(vec(0); \u2303K \u2326\n\u00b7\u00b7\u00b7\u2326 \u23031), where vec(0) 2 Rm and \u2326 is the matrix Kronecker product.\nWe consider the parameter estimation for the tensor normal model. Assume that we observe in-\ndependently and identically distributed tensor samples T1, . . . ,Tn from TN(0; \u2303\u21e41, . . . , \u2303\u21e4K). We\naim to estimate the true covariance matrices (\u2303\u21e41, . . . , \u2303\u21e4K) and their corresponding true precision\nmatrices (\u2326\u21e41, . . . , \u2326\u21e4K) where \u2326\u21e4k = \u2303\u21e41\n(k = 1, . . . , K). To address the identi\ufb01ability issue in\nthe parameterization of the tensor normal distribution, we assume that k\u2326\u21e4kkF = 1 for k = 1, . . . , K.\nThis renormalization assumption does not change the graph structure of the original precision matrix.\n\nk\n\n3\n\n\fqn(\u23261, . . . , \u2326K) :=\n\nis tr[S(\u2326K \u2326\u00b7\u00b7\u00b7\u2326 \u23261)] PK\n\nA standard approach to estimate \u2326\u21e4k, k = 1, . . . , K, is to use the maximum likelihood method\nvia (2.1). Up to a constant, the negative log-likelihood function of the tensor normal distribution\nnPn\ni=1 vec(Ti)vec(Ti)>. To\nk=1(m/mk) log |\u2326k|, where S := 1\nencourage the sparsity of each precision matrix in the high-dimensional scenario, we consider a\npenalized log-likelihood estimator, which is obtained by minimizing\nKXk=1\n\ntr[S(\u2326K \u2326\u00b7\u00b7\u00b7\u2326 \u23261)] \n\nwhere Pk (\u00b7) is a penalty function indexed by the tuning parameter k. In this paper, we focus on\nthe lasso penalty [19], i.e., Pk (\u2326k) = kk\u2326kk1,off. This estimation procedure applies similarly to a\nbroad family of other penalty functions.\nWe name the penalized model from (2.2) as the sparse tensor graphical model. It reduces to the sparse\nvector graphical model [20, 21] when K = 1, and the sparse matrix graphical model [5\u20138] when\nK = 2. Our framework generalizes them to ful\ufb01ll the demand of capturing the graphical structure of\nhigher order tensor-valued data.\n\nlog |\u2326k| +\n\nKXk=1\n\nPk (\u2326k),\n\n1\nmk\n\n(2.2)\n\n1\nm\n\n2.3 Estimation\n\nThis section introduces the estimation procedure for the sparse tensor graphical model. A com-\nputationally ef\ufb01cient algorithm is provided to estimate the precision matrix for each way of the\ntensor.\nRecall that in (2.2), qn(\u23261, . . . , \u2326K) is jointly non-convex with respect to \u23261, . . . , \u2326K. Nevertheless,\nqn(\u23261, . . . , \u2326K) is a bi-convex problem since qn(\u23261, . . . , \u2326K) is convex in \u2326k when the rest K 1\nprecision matrices are \ufb01xed. The bi-convex property plays a critical role in our algorithm construction\nand its theoretical analysis in \u00a73.\nAccording to its bi-convex property, we propose to solve this non-convex problem by alternatively\nupdate one precision matrix with other matrices \ufb01xed. Note that, for any k = 1, . . . , K, minimizing\n(2.2) with respect to \u2326k while \ufb01xing the rest K 1 precision matrices is equivalent to minimizing\n(2.3)\n\nL(\u2326k) :=\n\ni Vk>i\n\ni=1 Vk\n\nnmPn\n\nK \u21e4(k)\nHere Sk := mk\nwith \u21e5 the tensor product operation and [\u00b7](k) the mode-k matricization operation de\ufb01ned in \u00a72.1. The\n1 >\nk1\u2326\u00b7\u00b7\u00b7\u2326 \u23261/2\nresult in (2.3) can be shown by noting that Vk\naccording to the properties of mode-k matricization shown by [18]. Hereafter, we drop the superscript\nk of Vk\ni if there is no confusion. Note that minimizing (2.3) corresponds to estimating vector-valued\nGaussian graphical model and can be solved ef\ufb01ciently via the glasso algorithm [21].\n\ni :=\u21e5Ti \u21e5\u23261/2\ni = [Ti](k)\u23261/2\n\nk1, 1mk , \u23261/2\nk+1\u2326 \u23261/2\n\nK \u2326\u00b7\u00b7\u00b7\u2326 \u23261/2\n\nk+1, . . . , \u23261/2\n\n, . . . , \u23261/2\n\n1\n\ntr(Sk\u2326k) \n\n1\nmk\n, where Vk\n\n1\nmk\n\nlog |\u2326k| + kk\u2326kk1,off.\n\n1 , . . . , \u2326(t)\n\n1 , . . . , \u2326(0)\n\nAlgorithm 1 Solve sparse tensor graphical model via Tensor lasso (Tlasso)\n1: Input: Tensor samples T1 . . . ,Tn, tuning parameters 1, . . . , K, max number of iterations T .\n2: Initialize \u2326(0)\nK randomly as symmetric and positive de\ufb01nite matrices and set t = 0.\n3: Repeat:\n4: t = t + 1.\n5: For k = 1, . . . , K:\n6:\n7:\n8: End For\n9: Until t = T .\n\nk1, \u2326(t1)\nk such that k\u2326(t)\n\nGiven \u2326(t)\nNormalize \u2326(t)\n\nK , solve (2.3) for \u2326(t)\n\nk+1 , . . . , \u2326(t1)\n\nk via glasso [21].\n\n10: Output: b\u2326k = \u2326(T )\nThe details of our Tensor lasso (Tlasso) algorithm are shown in Algorithm 1. It starts with a random\ninitialization and then alternatively updates each precision matrix until it converges. In \u00a73, we will\nillustrate that the statistical properties of the obtained estimator are insensitive to the choice of the\ninitialization (see the discussion following Theorem 3.5).\n\n(k = 1, . . . , K).\n\nk kF = 1.\n\nk\n\n4\n\n\f3 Theory of statistical optimization\n\nWe \ufb01rst prove the estimation errors in Frobenius norm, max norm, and spectral norm, and then provide\nthe model selection consistency of our Tlasso estimator. We defer all the proofs to the appendix.\n\n3.1 Estimation error in Frobenius norm\n\nBased on the penalized log-likelihood in (2.2), we de\ufb01ne the population log-likelihood function as\n\nq(\u23261, . . . , \u2326K) :=\n\n1\n\nmEtr\u21e5vec(T )vec(T )>(\u2326K \u2326\u00b7\u00b7\u00b7\u2326 \u23261)\u21e4 \n\n1\nmk\n\nKXk=1\n\nlog |\u2326k|.\n\n(3.1)\n\n\u2326k\n\nq(\u23261, . . . , \u2326K).\n\nBy minimizing q(\u23261, . . . , \u2326K) with respect to \u2326k, k = 1, . . . , K, we obtain the population mini-\nmization function with the parameter \u2326[K]k := {\u23261, . . . , \u2326k1, \u2326k+1, . . . , \u2326K}, i.e.,\n\nMk(\u2326[K]k) := argmin\n\n(3.2)\nTheorem 3.1. For any k = 1, . . . , K, if \u2326j (j 6= k) satis\ufb01es tr(\u2303\u21e4j \u2326j) 6= 0, then the population\nminimization function in (3.2) satis\ufb01es Mk(\u2326[K]k) = m\u21e5mkQj6=k tr(\u2303\u21e4j \u2326j)\u21e41\u2326\u21e4k.\nTheorem 3.1 shows a surprising phenomenon that the population minimization function recovers the\ntrue precision matrix up to a constant in only one iteration. If \u2326j = \u2326\u21e4j , j 6= k, then Mk(\u2326[K]k) =\n\u2326\u21e4k. Otherwise, after a normalization such that kMk(\u2326[K]k)kF = 1, the normalized population\nminimization function still fully recovers \u2326\u21e4k. This observation suggests that setting T = 1 in\nAlgorithm 1 is suf\ufb01cient. Such a suggestion will be further supported by our numeric results.\nIn practice, when (3.1) is unknown, we can approximate it via its sample version qn(\u23261, . . . , \u2326K)\nde\ufb01ned in (2.2), which gives rise to the statistical error in the estimation procedure. Analogously to\n(3.2), we de\ufb01ne the sample-based minimization function with parameter \u2326[K]k as\n\nqn(\u23261, . . . , \u2326K).\n\n(3.3)\n\ncMk(\u2326[K]k) := argmin\n\n\u2326k\n\nIn order to prove the estimation error, it remains to quantify the statistical error induced from \ufb01nite\nsamples. The following two regularity conditions are assumed for this purpose.\nCondition 3.2 (Bounded Eigenvalues). For any k = 1, . . . , K, there is a constant C1 > 0 such that,\n\n0 < C1 \uf8ff min(\u2303\u21e4k) \uf8ff max(\u2303\u21e4k) \uf8ff 1/C1 < 1,\n\nwhere min(\u2303\u21e4k) and max(\u2303\u21e4k) refer to the minimal and maximal eigenvalue of \u2303\u21e4k, respectively.\nCondition 3.2 requires the uniform boundedness of the eigenvalues of true covariance matrices \u2303\u21e4k. It\nhas been commonly assumed in the graphical model literature [22].\nCondition 3.3 (Tuning). For any k = 1, . . . , K and some constant C2 > 0, the tuning parameter k\n\nsatis\ufb01es 1/C2plog mk/(nmmk) \uf8ff k \uf8ff C2plog mk/(nmmk).\n\nCondition 3.3 speci\ufb01es the choice of the tuning parameters. In practice, a data-driven tuning procedure\n[23] can be performed to approximate the optimal choice of the tuning parameters.\nBefore characterizing the statistical error, we de\ufb01ne a sparsity parameter for \u2326\u21e4k, k = 1, . . . , K. Let\nSk := {(i, j) : [\u2326\u21e4k]i,j 6= 0}. Denote the sparsity parameter sk := |Sk| mk, which is the number\nof nonzero entries in the off-diagonal component of \u2326\u21e4k. For each k = 1, . . . , K, we de\ufb01ne B(\u2326\u21e4k) as\nthe set containing \u2326\u21e4k and its neighborhood for some suf\ufb01ciently large constant radius \u21b5> 0, i.e.,\n(3.4)\nTheorem 3.4. Assume Conditions 3.2 and 3.3 hold. For any k = 1, . . . , K, the statistical error of the\nsample-based minimization function de\ufb01ned in (3.3) satis\ufb01es that, for any \ufb01xed \u2326j 2 B(\u2326\u21e4j ) (j 6= k),\n\nB(\u2326\u21e4k) := {\u2326 2 Rmk\u21e5mk : \u2326 = \u2326>; \u2326 0;k\u2326 \u2326\u21e4kkF \uf8ff \u21b5}.\n\n! ,\ncMk(\u2326[K]k) Mk(\u2326[K]k)F = OP r mk(mk + sk) log mk\nwhere Mk(\u2326[K]k) andcMk(\u2326[K]k) are de\ufb01ned in (3.2) and (3.3), and m =QK\n\nnm\n\nk=1 mk.\n\n(3.5)\n\n5\n\n\fTheorem 3.4 establishes the statistical error associated withcMk(\u2326[K]k) for arbitrary \u2326j 2 B(\u2326\u21e4j )\nwith j 6= k. In comparison, previous work on the existence of a local solution with desired statistical\nproperty only establishes theorems similar to Theorem 3.4 for \u2326j = \u2326\u21e4j with j 6= k. The extension\nto an arbitrary \u2326j 2 B(\u2326\u21e4j ) involves non-trivial technical barriers. Particularly, we \ufb01rst establish the\nrate of convergence of the difference between a sample-based quadratic form with its expectation\n(Lemma B.1) via concentration of Lipschitz functions of Gaussian random variables [24]. This result\nis also of independent interest. We then carefully characterize the rate of convergence of Sk de\ufb01ned\nin (2.3) (Lemma B.2). Finally, we develop (3.5) using the results for vector-valued graphical models\ndeveloped by [25].\nAccording to Theorem 3.1 and Theorem 3.4, we obtain the rate of convergence of the Tlasso estimator\nin terms of Frobenius norm, which is our main result.\nTheorem 3.5. Assume that Conditions 3.2 and 3.3 hold. For any k = 1, . . . , K, if the initialization\nsatis\ufb01es \u2326(0)\n\nj 2 B(\u2326\u21e4j ) for any j 6= k, then the estimator b\u2326k from Algorithm 1 with T = 1 satis\ufb01es,\n\nb\u2326k \u2326\u21e4kF = OP r mk(mk + sk) log mk\n\nnm\n\n!,\n\n(3.6)\n\nj\n\nk=1 mk and B(\u2326\u21e4j ) is de\ufb01ned in (3.4).\n\nwhere m =QK\nTheorem 3.5 suggests that as long as the initialization is within a constant distance to the truth, our\nTlasso algorithm attains a consistent estimator after only one iteration. This initialization condition\n\u2326(0)\nj 2 B(\u2326\u21e4j ) trivially holds since for any \u2326(0)\nthat is positive de\ufb01nite and has unit Frobenius norm,\nwe have k\u2326(0)\nj \u2326\u21e4kkF \uf8ff 2 by noting that k\u2326\u21e4kkF = 1 (k = 1, . . . , K) for the identi\ufb01ability of the\ntensor normal distribution. In literature, [9] shows that there exists a local minimizer of (2.2) whose\nconvergence rate can achieve (3.6). However, it is unknown if their algorithm can \ufb01nd such minimizer\nsince there could be many other local minimizers.\nA notable implication of Theorem 3.5 is that, when K 3, the estimator from our Tlasso algorithm\ncan achieve estimation consistency even if we only have access to one observation, i.e., n = 1, which\nis often the case in practice. To see it, suppose that K = 3 and n = 1. When the dimensions m1, m2,\nand m3 are of the same order of magnitude and sk = O(mk) for k = 1, 2, 3, all the three error rates\ncorresponding to k = 1, 2, 3 in (3.6) converge to zero.\nThis result indicates that the estimation of the k-th precision matrix takes advantage of the information\nfrom the j-th way (j 6= k) of the tensor data. Consider a simple case that K = 2 and one precision\nmatrix \u2326\u21e41 = 1m1 is known. In this scenario the rows of the matrix data are independent and hence\nthe effective sample size for estimating \u2326\u21e42 is in fact nm1. The optimality result for the vector-valued\ngraphical model [4] implies that the optimal rate for estimating \u2326\u21e42 isp(m2 + s2) log m2/(nm1),\nwhich matches our result in (3.6). Therefore, the rate in (3.6) obtained by our Tlasso estimator is\nminimax-optimal since it is the best rate one can obtain even when \u2326\u21e4j (j 6= k) are known. As far as\nwe know, this phenomenon has not been discovered by any previous work in tensor graphical model.\nRemark 3.6. For K = 2, our tensor graphical model reduces to matrix graphical model with Kro-\n\nnecker product covariance structure [5\u20138]. In this case, the rate of convergence ofb\u23261 in (3.6) reduces\ntop(m1 + s1) log m1/(nm2), which is much faster thanpm2(m1 + s1)(log m1 + log m2)/n es-\ntablished by [6] andp(m1 + m2) log[max(m1, m2, n)]/(nm2) established by [7]. In literature, [5]\n\nshows that there exists a local minimizer of the objective function whose estimation errors match ours.\nHowever, it is unknown if their estimator can achieve such convergence rate. On the other hand, our\ntheorem con\ufb01rms that our algorithm is able to \ufb01nd such estimator with optimal rate of convergence.\n\n3.2 Estimation error in max norm and spectral norm\n\nWe next show the estimation error in max norm and spectral norm. Trivially, these estimation errors are\nbounded by that in Frobenius norm shown in Theorem 3.5. To develop improved rates of convergence\nin max and spectral norms, we need to impose stronger conditions on true parameters.\n\n6\n\n\fmax\ne2Sc\n\nk[\u21e4k]e,Sk[\u21e4k]Sk,Sk11 \uf8ff 1 \u21b5k.\n\n,\uf8ff \u21e4k are bounded\n\nk on the connected edges in Sk.\n\nCondition 3.7 controls the in\ufb02uence of the non-connected terms in Sc\nThis condition has been widely applied in lasso penalized models [26, 27].\nCondition 3.8 (Bounded Complexity). For each k = 1, . . . , K, the parameters \uf8ff\u2303\u21e4k\n\nand the parameter dk in (3.7) satis\ufb01es dk = opnm/(mk log mk).\nTheorem 3.9. Suppose Conditions 3.2, 3.3, 3.7 and 3.8 hold. Assume sk = O(mk) for k = 1, . . . , K\nand assume m0ks are in the same order, i.e., m1 \u21e3 m2 \u21e3\u00b7\u00b7\u00b7\u21e3 mK. For each k, if the initialization\nj 2 B(\u2326\u21e4j ) for any j 6= k, then the estimator b\u2326k from Algorithm 1 with T = 2 satis\ufb01es,\nsatis\ufb01es \u2326(0)\nIn addition, the edge set of b\u2326k is a subset of the true edge set of \u2326\u21e4k, that is, supp(b\u2326k) \u2713 supp(\u2326\u21e4k).\n\nTheorem 3.9 shows that our Tlasso estimator achieves the optimal rate of convergence in max norm\n[4]. Here we consider the estimator obtained after two iterations since we require a new concentration\ninequality (Lemma B.3) for the sample covariance matrix, which is built upon the estimator in\nTheorem 3.5. A direct consequence from Theorem 3.9 is the estimation error in spectral norm.\nCorollary 3.10. Suppose the conditions of Theorem 3.9 hold, for any k = 1, . . . , K, we have\n\n= OP r mk log mk\nnm ! .\n\nb\u2326k \u2326\u21e4k1\n\n(3.8)\n\nWe \ufb01rst introduce some important notations. Denote dk as the maximum number of non-zeros in any\nrow of the true precision matrices \u2326\u21e4k, that is,\n\n(3.7)\n\ndk := max\n\ni2{1,...,mk}{j 2{ 1, . . . , mk} : [\u2326\u21e4k]i,j 6= 0},\n\nk \u2326 \u2326\u21e41\n\nwith | \u00b7 | the cardinality of the inside set. For each covariance matrix \u2303\u21e4k, we de\ufb01ne \uf8ff\u2303\u21e4k\n.\n:= |||\u2303\u21e4k|||1\nDenote the Hessian matrix \u21e4k := \u2326\u21e41\nk, whose entry [\u21e4k](i,j),(s,t) corresponds\nto the second order partial derivative of the objective function with respect to [\u2326k]i,j and [\u2326k]s,t. We\nde\ufb01ne its sub-matrix indexed by the index set Sk as [\u21e4k]Sk,Sk = [\u2326\u21e41\n]Sk,Sk, which is the\n|Sk|\u21e5| Sk| matrix with rows and columns of \u21e4k indexed by Sk and Sk, respectively. Moreover, we\nde\ufb01ne \uf8ff\u21e4k\n. In order to establish the rate of convergence in max norm, we\nneed to impose an irrepresentability condition on the Hessian matrix.\nCondition 3.7 (Irrepresentability). For each k = 1, . . . , K, there exists some \u21b5k 2 (0, 1] such that\n\n:=([\u21e4k]Sk,Sk )11\n\nk \u2326 \u2326\u21e41\n\n2 Rm2\n\nk\u21e5m2\n\nk\n\nk\n\nb\u2326k \u2326\u21e4k2 = OP dkr mk log mk\nnm ! .\n\n(3.9)\n\nRemark 3.11. Now we compare our obtained rate of convergence in spectral norm for K = 2 with\nthat established in the sparse matrix graphical model literature. In particular, [8] establishes the rate\nk \uf8ff (sk _ 1), which\nholds for example in the bounded degree graphs, our obtained rate is faster. However, our faster rate\ncomes at the price of assuming the irrepresentability condition. Using recent advance in nonconvex\nregularization [28], we can eliminate the irrepresentability condition. We leave this to future work.\n\nof OPpmk(sk _ 1) log(m1 _ m2)/(nmk) for k = 1, 2. Therefore, when d2\n\n3.3 Model selection consistency\n\nTheorem 3.9 ensures that the estimated precision matrix correctly excludes all non-informative edges\n\nTherefore, in order to achieve the model selection consistency, a suf\ufb01cient condition is to assume that,\nfor each k = 1, . . . , K, the minimal signal \u2713k := min(i,j)2supp(\u2326\u21e4k)[\u2326\u21e4k]i,j is not too small.\n\nand includes all the true edges (i, j) with |[\u2326\u21e4k]i,j| > Cpmk log mk/(nm) for some constant C > 0.\nTheorem 3.12. Under the conditions of Theorem 3.9, if \u2713k Cpmk log mk/(nm) for some\nconstant C > 0, then for any k = 1, . . . , K, signb\u2326k = sign(\u2326\u21e4k), with high probability.\n\nTheorem 3.12 indicates that our Tlasso estimator is able to correctly recover the graphical structure of\neach way of the high-dimensional tensor data. To the best of our knowledge, these is the \ufb01rst model\nselection consistency result in high dimensional tensor graphical model.\n\n7\n\n\f4 Simulations\n\nk\n\nk b\u2326(t1)\n\nFK \uf8ff 0.001.\n\nFor simplicity, in our Tlasso algorithm we set the initialization of k-th precision matrix as 1mk for each\n\nWe compare the proposed Tlasso estimator with two alternatives. The \ufb01rst one is the direct graph-\nical lasso (Glasso) approach [21] which applies the glasso to the vectorized tensor data to es-\ntimate \u2326\u21e41 \u2326\u00b7\u00b7\u00b7\u2326 \u2326\u21e4K directly. The second alternative method is the iterative penalized max-\nimum likelihood method (P-MLE) proposed by [9], whose termination condition is set to be\nk=1b\u2326(t)\nPK\nk = 1, . . . , K and the total iteration T = 1. The tuning parameter k is set as 20plog mk/(nmmk).\n\nFor a fair comparison, the same tuning parameter is applied in the P-MLE method. In the direct\nGlasso approach, its tuning parameter is chosen by cross-validation via huge package [29].\nWe consider two simulations with a third order tensor, i.e., K = 3. In Simulation 1, we construct a\ntriangle graph, while in Simulation 2, we construct a four nearest neighbor graph for each precision\nmatrix. An illustration of the generated graphs are shown in Figure 1. In each simulation, we consider\nthree scenarios, i.e., s1: n = 10 and (m1, m2, m3) = (10, 10, 10); s2: n = 50 and (m1, m2, m3) =\n(10, 10, 10); s3: n = 10 and (m1, m2, m3) = (100, 5, 5). We repeat each example 100 times\nand compute the averaged computational time, the averaged estimation error of the Kronecker\n\nrate (TPR), and the true negative rate (TNR). More speci\ufb01cally, we denote a\u21e4i,j be the (i, j)-th\n1(a\u21e4i,j 6= 0) and\n\nproduct of precision matrices (m1m2m3)1b\u23261 \u2326\u00b7\u00b7\u00b7\u2326 b\u2326K \u2326\u21e41 \u2326\u00b7\u00b7\u00b7\u2326 \u2326\u21e4KF , the true positive\nentry of \u2326\u21e41 \u2326\u00b7\u00b7\u00b7\u2326 \u2326\u21e4K, and de\ufb01ne TPR := Pi,j\nTNR :=Pi,j\n\nAs shown in Figure 1, our Tlasso is dramatically faster than both alternative methods. In Scenario\ns3, Tlasso takes about \ufb01ve seconds for each replicate, the P-MLE takes about 500 seconds while\nthe direct Glasso method takes more than one hour and is omitted in the plot. Tlasso algorithm is\nnot only computationally ef\ufb01cient but also enjoys superior estimation accuracy. In all examples, the\ndirect Glasso method has signi\ufb01cantly larger errors than Tlasso due to ignoring the tensor graphical\nstructure. Tlasso outperforms P-MLE in Scenarios s1 and s2 and is comparable to it in Scenario s3.\n\n1(bai,j 6= 0, a\u21e4i,j 6= 0)/Pi,j\n\n1(bai,j = 0, a\u21e4i,j = 0)/Pi\n\n1(a\u21e4i,j = 0).\n\n400\n\n300\n\n200\n\n100\n\n)\ns\nd\nn\no\nc\ne\ns\n(\n \ne\nm\nT\n\ni\n\n0\n\nGlasso\nP-MLE\nTlasso\n\ns1\n\ns2\n\nScenarios\n\ns3\n\n400\n\n200\n\n)\ns\nd\nn\no\nc\ne\ns\n(\n \ne\nm\nT\n\ni\n\n0\n\n0.08\n\ns\nr\no\nr\nr\nE\n\n0.06\n\n0.04\n\n0.02\n\nGlasso\nP-MLE\nTlasso\n\nGlasso\nP-MLE\nTlasso\n\n0.08\n\n0.06\n\ns\nr\no\nr\nr\nE\n\n0.04\n\n0.02\n\nGlasso\nP-MLE\nTlasso\n\ns1\n\ns2\n\nScenarios\n\ns3\n\ns1\n\ns2\n\nScenarios\n\ns3\n\ns1\n\ns2\n\nScenarios\n\ns3\n\nFigure 1: Left two plots: illustrations of the generated graphs; Middle two plots: computational time;\nRight two plots: estimation errors. In each group of two plots, the left (right) is for Simulation 1 (2).\nTable 1 shows the variable selection performance. Our Tlasso identi\ufb01es almost all edges in these six\nexamples, while the Glasso and P-MLE method miss several true edges. On the other hand, Tlasso\ntends to include more non-connected edges than other methods.\nTable 1: A comparison of variable selection performance. Here TPR and TNR denote the true positive\nrate and true negative rate.\n\nScenarios\ns1\nSim 1 s2\ns3\ns1\nSim 2 s2\ns3\n\nGlasso\n\nTPR\n\n0.27 (0.002)\n0.34 (0.000)\n\nTNR\n\n0.96 (0.000)\n0.93 (0.000)\n\n0.08 (0.000)\n0.15 (0.000)\n\n0.96 (0.000)\n0.92 (0.000)\n\n/\n\n/\n\n/\n\n/\n\nP-MLE\n\nTPR\n1 (0)\n1 (0)\n1 (0)\n\n1 (0)\n\n0.93 (0.004)\n\n0.82 (0.001)\n\nTNR\n\n0.89 (0.002)\n0.89 (0.002)\n0.93 (0.001)\n0.88 (0.002)\n0.85 (0.002)\n0.93 (0.001)\n\nTlasso\n\nTPR\n1(0)\n1(0)\n1(0)\n1(0)\n1(0)\n\n0.99(0.001)\n\nTNR\n\n0.76 (0.004)\n0.76 (0.004)\n0.70 (0.004)\n0.65 (0.005)\n0.63 (0.005)\n0.38 (0.002)\n\nAcknowledgement\nWe would like to thank the anonymous reviewers for their helpful comments. Han Liu is grateful\nfor the support of NSF CAREER Award DMS1454377, NSF IIS1408910, NSF IIS1332109, NIH\nR01MH102339, NIH R01GM083084, and NIH R01HG06841. Guang Cheng\u2019s research is sponsored\nby NSF CAREER Award DMS1151692, NSF DMS1418042, Simons Fellowship in Mathematics,\nONR N00014-15-1-2331 and a grant from Indiana Clinical and Translational Sciences Institute.\n\n8\n\n\f2014.\n\nReferences\n[1] S. Rendle and L. Schmidt-Thieme. Pairwise interaction tensor factorization for personalized tag recom-\n\nmendation. In International Conference on Web Search and Data Mining, 2010.\n\n[2] G.I. Allen. Sparse higher-order principal components analysis. In International Conference on Arti\ufb01cial\n\nIntelligence and Statistics, 2012.\n\nPLOS Genetics, 3:2326\u20132337, 2007.\n\n[3] J. Zahn, S. Poosala, A. Owen, D. Ingram, et al. AGEMAP: A gene expression database for aging in mice.\n\n[4] T. Cai, W. Liu, and H.H. Zhou. Estimating sparse precision matrix: Optimal rates of convergence and\n\nadaptive estimation. Annals of Statistics, 2015.\n\n[5] C. Leng and C.Y. Tang. Sparse matrix graphical models. Journal of the American Statistical Association,\n\n107:1187\u20131200, 2012.\n\n[6] J. Yin and H. Li. Model selection and estimation in the matrix normal graphical model. Journal of\n\nMultivariate Analysis, 107:119\u2013140, 2012.\n\n[7] T. Tsiligkaridis, A. O. Hero, and S. Zhou. On convergence of Kronecker graphical Lasso algorithms. IEEE\n\nTransactions on Signal Processing, 61:1743\u20131755, 2013.\n\n[8] S. Zhou. Gemini: Graph estimation with matrix variate normal instances. Annals of Statistics, 42:532\u2013562,\n\n[9] S. He, J. Yin, H. Li, and X. Wang. Graphical model selection and estimation for high dimensional tensor\n\ndata. Journal of Multivariate Analysis, 128:165\u2013185, 2014.\n\n[10] P. Jain, P. Netrapalli, and S. Sanghavi. Low-rank matrix completion using alternating minimization. In\n\nSymposium on Theory of Computing, pages 665\u2013674, 2013.\n\n[11] P. Netrapalli, P. Jain, and S. Sanghavi. Phase retrieval using alternating minimization. In Advances in\n\nNeural Information Processing Systems, pages 2796\u20132804, 2013.\n\n[12] J. Sun, Q. Qu, and J. Wright. Complete dictionary recovery over the sphere. arXiv:1504.06785, 2015.\n[13] S. Arora, R. Ge, T. Ma, and A. Moitra. Simple, ef\ufb01cient, and neural algorithms for sparse coding.\n\narXiv:1503.00778, 2015.\n\n[14] A. Anandkumar, R. Ge, D. Hsu, S. Kakade, and M. Telgarsky. Tensor decompositions for learning latent\n\nvariable models. Journal of Machine Learning Research, 15:2773\u20132832, 2014.\n\n[15] W. Sun, J. Lu, H. Liu, and G. Cheng. Provable sparse tensor decomposition. arXiv:1502.01425, 2015.\n[16] S. Zhe, Z. Xu, X. Chu, Y. Qi, and Y. Park. Scalable nonparametric multiway data analysis. In International\n\nConference on Arti\ufb01cial Intelligence and Statistics, 2015.\n\n[17] S. Zhe, Z. Xu, Y. Qi, and P. Yu. Sparse bayesian multiview learning for simultaneous association discovery\nand diagnosis of alzheimer\u2019s disease. In Twenty-Ninth AAAI Conference on Arti\ufb01cial Intelligence, 2015.\n\n[18] T. Kolda and B. Bader. Tensor decompositions and applications. SIAM Review, 51:455\u2013500, 2009.\n[19] R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society,\n\nSeries B, 58:267\u2013288, 1996.\n\n2007.\n\nBiostatistics, 9:432\u2013441, 2008.\n\n[20] M. Yuan and Y. Lin. Model selection and estimation in the gaussian graphical model. Biometrika, 94:19\u201335,\n\n[21] J. Friedman, H. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical Lasso.\n\n[22] A. J. Rothman, P. J. Bickel, E. Levina, and J. Zhu. Sparse permutation invariant covariance estimation.\n\nElectronic Journal of Statistics, 2:494\u2013515, 2008.\n\n[23] W. Sun, J. Wang, and Y. Fang. Consistent selection of tuning parameters via variable selection stability.\n\nJournal of Machine Learning Research, 14:3419\u20133440, 2013.\n\n[24] M. Ledoux and M. Talagrand. Probability in Banach Spaces: Isoperimetry and Processes. Springer, 2011.\n[25] J. Fan, Y. Feng, and Y. Wu. Network exploration via the adaptive Lasso and scad penalties. Annals of\n\nStatistics, 3:521\u2013541, 2009.\n\n7:2541\u20132567, 2006.\n\n[26] P. Zhao and B. Yu. On model selection consistency of Lasso. Journal of Machine Learning Research,\n\n[27] P. Ravikumar, M.J. Wainwright, G. Raskutti, and B. Yu. High-dimensional covariance estimation by\nminimizing `1-penalized log-determinant divergence. Electronic Journal of Statistics, 5:935\u2013980, 2011.\n[28] Z. Wang, H. Liu, and T. Zhang. Optimal computational and statistical rates of convergence for sparse\n\nnonconvex learning problems. Annals of Statistics, 42:2164\u20132201, 2014.\n\n[29] T. Zhao, H. Liu, K. Roeder, J. Lafferty, and L. Wasserman. The huge package for high-dimensional\n\nundirected graph estimation in R. Journal of Machine Learning Research, 13:1059\u20131062, 2012.\n\n[30] A. Gupta and D. Nagar. Matrix variate distributions. Chapman and Hall/CRC Press, 2000.\n[31] P. Hoff. Separable covariance arrays via the Tucker product, with applications to multivariate relational\n\n[32] A.P. Dawid. Some matrix-variate distribution theory: Notational considerations and a bayesian application.\n\ndata. Bayesian Analysis, 6:179\u2013196, 2011.\n\nBiometrika, 68:265\u2013274, 1981.\n\n[33] S. Negahban and M.J. Wainwright. Estimation of (near) low-rank matrices with noise and high-dimensional\n\nscaling. Annals of Statistics, 39:1069\u20131097, 2011.\n\n9\n\n\f", "award": [], "sourceid": 677, "authors": [{"given_name": "Wei", "family_name": "Sun", "institution": "Yahoo Labs"}, {"given_name": "Zhaoran", "family_name": "Wang", "institution": "Princeton University"}, {"given_name": "Han", "family_name": "Liu", "institution": "Princeton University"}, {"given_name": "Guang", "family_name": "Cheng", "institution": "Purdue University"}]}