{"title": "Multiway clustering via tensor block models", "book": "Advances in Neural Information Processing Systems", "page_first": 715, "page_last": 725, "abstract": "We consider the problem of identifying multiway block structure from a large noisy tensor. Such problems arise frequently in applications such as genomics, recommendation system, topic modeling, and sensor network localization. We propose a tensor block model, develop a unified least-square estimation, and obtain the theoretical accuracy guarantees for multiway clustering. The statistical convergence of the estimator is established, and we show that the associated clustering procedure achieves partition consistency. A sparse regularization is further developed for identifying important blocks with elevated means. The proposal handles a broad range of data types, including binary, continuous, and hybrid observations. Through simulation and application to two real datasets, we demonstrate the outperformance of our approach over previous methods.", "full_text": "Multiway clustering via tensor block models\n\nMiaoyan Wang\n\nUniversity of Wisconsin \u2013 Madison\n\nmiaoyan.wang@wisc.edu\n\nYuchen Zeng\n\nUniversity of Wisconsin \u2013 Madison\n\nyzeng58@wisc.edu\n\nAbstract\n\nWe consider the problem of identifying multiway block structure from a large\nnoisy tensor. Such problems arise frequently in applications such as genomics,\nrecommendation system, topic modeling, and sensor network localization. We\npropose a tensor block model, develop a uni\ufb01ed least-square estimation, and\nobtain the theoretical accuracy guarantees for multiway clustering. The statistical\nconvergence of the estimator is established, and we show that the associated\nclustering procedure achieves partition consistency. A sparse regularization is\nfurther developed for identifying important blocks with elevated means. The\nproposal handles a broad range of data types, including binary, continuous, and\nhybrid observations. Through simulation and application to two real datasets, we\ndemonstrate the outperformance of our approach over previous methods.\n\n1\n\nIntroduction\n\nHigher-order tensors have recently attracted increased attention in data-intensive \ufb01elds such as\nneuroscience [1], social networks [2], computer vision [3], and genomics [4, 5]. In many applications,\nthe data tensors are often expected to have underlying block structure. One example is multi-tissue\nexpression data [4], in which genome-wide expression pro\ufb01les are collected from different tissues\nin a number of individuals. There may be groups of genes similarly expressed in subsets of tissues\nand individuals; mathematically, this implies an underlying three-way block structure in the data\ntensor. In a different context, block structure may emerge in a binary-valued tensor. Examples include\nmultilayer network data [2], with the nodes representing the individuals and the layers representing\nthe multiple types of relations. Here a planted block represents a community of individuals that are\nhighly connected within a class of relationships.\n\nFigure 1: Examples of tensor block model (TBM). (a) Our TBM method is used for multiway clustering and\nfor revealing the underlying checkerbox structure in a noisy tensor. (b) The sparse TBM method is used for\ndetecting sub-tensors of elevated means.\n\nThis paper presents a new method and the associated theory for tensors with block structure. We\ndevelop a uni\ufb01ed least-square estimation procedure for identifying multiway block structure. The\nproposal applies to a broad range of data types, including binary, continuous, and hybrid observations.\nWe establish a high-probability error bound for the resulting estimator, and show that the procedure\nenjoys consistency guarantees on the block structure recovery as the dimension of the data tensor\ngrows. Furthermore, we develop a sparse extension of the tensor block model for block selections.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\nInputOutputInputOutput\fFigure 1 shows two immediate examples of our method. When the data tensor possesses a checkerbox\npattern modulo some unknown reordering of entries, our method amounts to multiway clustering\nthat simultaneously clusters each mode of the tensor (Figure 1a). When the data tensor has no full\ncheckerbox structure but contains a small numbers of sub-tensors of elevated means, we develop a\nsparse version of our method to detect these sub-tensors of interest (Figure 1b).\nRelated work. Our work is closely related to, but also clearly distinctive from, the low-rank tensor\ndecomposition. A number of methods have been developed for low-rank tensor estimation, including\nCANDECOMP/PARAFAC (CP) decomposition [6] and Tucker decomposition [7]. The CP model\ndecomposes a tensor into a sum of rank-1 tensors, whereas Tucker model decomposes a tensor\ninto a core tensor multiplied by orthogonal matrices in each mode. In this paper we investigate an\nalternative block structure assumption, which has yet to be studied for higher-order tensors. Note that\na block structure automatically implies low-rankness. However, as we will show in Section 4, a direct\napplication of low rank estimation to the current setting will result in an inferior estimator. Therefore,\na full exploitation of the block structure is necessary; this is the focus of the current paper.\nOur work is also connected to biclustering [8] and its higher-order extensions [9, 10]. Existing\nmultiway clustering methods [9, 10, 5, 11] typically take a two-step procedure, by \ufb01rst estimating a\nlow-dimension representation of the data tensor and then applying clustering algorithms to the tensor\nfactors. In contrast, our tensor block model takes a single shot to perform estimation and clustering\nsimultaneously. This approach achieves a higher accuracy and an improved interpretability. Moreover,\nearlier solutions to multiway clustering [12, 9] focus on the algorithm effectiveness, leaving the\nstatistical optimality of the estimators unaddressed. Very recently, Chi et al [13] provides an attempt\nto study the statistical properties of the tensor block model. We will show that our estimator obtains a\nfaster convergence rate than theirs, and the power is further boosted with a sparse regularity.\n\n2 Preliminaries\n\nyi1,...,iK m(1)\ni1,j1\n\n. . . m(K)\n\ni1,...,iK\n\ni1,...,iK\n\nyi1,...,iK y(cid:48)\n\niK ,jK(cid:75),\n\nik,jk(cid:75) \u2208 Rsk\u00d7dk is de\ufb01ned as\n\nto denote an order-K (d1, . . . , dK)-dimensional tensor. The multilinear multiplication of a tensor\n\nWe begin by reviewing a few basic factors about tensors [14]. We use Y =(cid:74)yi1,...,iK(cid:75) \u2208 Rd1\u00d7\u00b7\u00b7\u00b7\u00d7dK\nY \u2208 Rd1\u00d7\u00b7\u00b7\u00b7\u00d7dK by matrices Mk =(cid:74)m(k)\nY \u00d71 M1 . . . \u00d7K MK =(cid:74) (cid:88)\nwhich results in an order-K tensor (s1, . . . , sK)-dimensional tensor. For any two tensors Y =\n(cid:104)Y,Y(cid:48)(cid:105) = (cid:80)\n(cid:74)yi1,...,iK(cid:75), Y(cid:48) = (cid:74)y(cid:48)\ni1,...,iK(cid:75) of identical order and dimensions, their inner product is de\ufb01ned as\n(cid:104)Y,Y(cid:105)1/2; it is the Euclidean norm of Y regarded as an(cid:81)\ni1,...,iK . The Frobenius norm of tensor Y is de\ufb01ned as (cid:107)Y(cid:107)F =\nk dk-dimensional vector. The maximum\nnorm of tensor Y is de\ufb01ned as (cid:107)Y(cid:107)max = maxi1,...,iK |yi1,...,iK|. An order-(K-1) slice of Y is a\nsub-tensor of Y obtained by holding the index in one mode \ufb01xed while letting other indices vary.\nA clustering of d objects is a partition of the index set [d] := {1, 2, . . . , d} into R disjoint non-empty\nsubsets. We refer to the number of clusters, R, as the clustering size. Equivalently, the clustering (or\npartition) can be represented using the \u201cmembership matrix\u201d. A membership matrix M \u2208 RR\u00d7d is\nan incidence matrix whose (i, j)-entry is 1 if and only if the element j belongs to the cluster i, and 0\notherwise. Throughout the paper, we will use the terms \u201cclustering\u201d, \u201cpartition\u201d, and \u201cmembership\nmatrix\u201d exchangeably. For a higher-order tensor, the concept of index partition applies to each of the\nmodes. A block is a sub-tensor induced by the index partitions along each of the K modes. We use\nthe term \u201ccluster\u201d to refer to the marginal partition on mode k, and reserve the term \u201cblock\u201d for the\nmultiway partition of the tensor.\n\n3 Tensor block model\n\nLet Y =(cid:74)yi1,...,iK(cid:75) \u2208 Rd1\u00d7\u00b7\u00b7\u00b7\u00d7dK denote an order-K, (d1, . . . , dK)-dimensional data tensor. The\n\nmain assumption of tensor block model (TBM) is that the observed data tensor Y is a noisy realization\nof an underlying tensor that exhibits a checkerbox structure (see Figure 1a). Speci\ufb01cally, suppose\nthat the k-th mode of the tensor consists of Rk clusters. If the tensor entry yi1,...,iK belongs to the\nblock determined by the rkth cluster in the mode k for rk \u2208 [Rk], then we assume that\nfor (i1, . . . , iK) \u2208 [d1] \u00d7 \u00b7\u00b7\u00b7 \u00d7 [dK],\n\nyi1,...,iK = cr1,...,rK + \u03b5i1,...,iK ,\n\n(1)\n\n2\n\n\fwhere cr1,...,rK is the mean of the tensor block indexed by (r1, . . . , rK), and \u03b5i1,...,iK \u2019s are indepen-\ndent, mean-zero noise terms to be speci\ufb01ed later. Our goal is to (i) \ufb01nd the clustering along each of the\nmodes, and (ii) estimate the block means {cr1,...,rK}, such that a corresponding blockwise-constant\ncheckerbox structure emerges in the data tensor.\nThe tensor block model (1) falls into a general class of non-overlapping, constant-mean clustering\nmodels [15], in that each tensor entry belongs to exactly one block with a common mean. The TBM\ncan be equivalently expressed as a special tensor Tucker model,\n\nY = C \u00d71 M1 \u00d72 \u00b7\u00b7\u00b7 \u00d7K MK + E,\n\nwhere C =(cid:74)cr1,...,rK(cid:75) \u2208 RR1\u00d7\u00b7\u00b7\u00b7\u00d7RK is a core tensor consisting of block means, Mk \u2208 {0, 1}dk\u00d7Rk\nis a membership matrix indicating the block allocations along mode k for k \u2208 [K], and E =(cid:74)\u03b5i1,...,iK(cid:75)\nis the noise tensor. We view the TBM (2) as a super-sparse Tucker model, in the sense that the each\ncolumn of Mk consists of one copy of 1\u2019s and massive 0\u2019s.\nWe make a general assumption on the noise tensor E. The noise terms \u03b5i1,...,iK \u2019s are assumed to be\nindependent, mean-zero \u03c3-subgaussian, where \u03c3 > 0 is the subgaussianity parameter. Precisely,\n\n(2)\n\nEe\u03bb\u03b5i1,...,iK \u2264 e\u03bb2\u03c32/2,\n\nfor all (i1, . . . , iK) \u2208 [d1] \u00d7 \u00b7\u00b7\u00b7 \u00d7 [dK] and all \u03bb \u2208 R.\n\n(3)\nTh assumption (3) incorporates common situations such as Gaussian noise, Bernoulli noise, and noise\nwith bounded support. In particular, we consider two important examples of the TBM:\nExample 1 (Gaussian tensor block model) Let Y be a continuous-valued tensor. The Gaussian\ntensor block model (GTBM) yi1,...,iK \u223ci.i.d. N (cr1,...,rK , \u03c32) is a special case of model (1), with the\nsubgaussianity parameter \u03c3 equal to the error variance. The GTBM serves as the foundation for\nmany tensor clustering algorithms [12, 4, 13].\nExample 2 (Stochastic tensor block model) Let Y be a binary-valued tensor. The stochastic tensor\nblock model (STBM) yi1,...,iK \u223ci.i.d. Bernoulli(cr1,...,rK ) is a special case of model (1), with the\nsubgaussianity parameter \u03c3 equal to 1\n4 . The STBM can be viewed as an extension, to higher-order\ntensors, of the popular stochastic block model [16, 17] for matrix-based network analysis. In the \ufb01led\nof community detection, multi-layer stochastic model has also been developed for multi-relational\nnetwork data analysis [18, 19].\n\nMore generally, our model also applied to hybrid error distributions, in which different types of\ndistribution are allowed for different portions of the tensor. This scenario may happen, for example,\nwhen the data tensor Y represents concatenated measurements from multiple data sources.\nBefore we discuss the estimation, we present the identi\ufb01ability of the TBM.\nAssumption 1 (Irreducible core) The core tensor C is called irreducible if it cannot be written as a\nblock tensor with the number of mode-k clusters smaller than Rk, for any k \u2208 [K].\nIn the matrix case (K = 2), the irreducibility is equivalent to saying that C has no two identical rows\nand no two identical columns. In the higher-order case, the assumption requires that none of order-\n(K-1) slices of C are identical. Note that irreducibility is a weaker assumption than full-rankness.\nProposition 1 (Identi\ufb01ability) Consider a Gaussian or Bernoulli TBM (1). Under Assumption 1,\nthe factor matrices Mk\u2019s are identi\ufb01able up to permutations of cluster labels.\n\nThe identi\ufb01ability property for the TBM outperforms that for the classical factor model [20, 21].\nIn the Tucker [22, 14] and many other factor analyses [20, 21], the factors are identi\ufb01able only up\nto orthogonal rotations. Those models recover only the (column) space spanned by Mk, but not\nthe individual factors. In contrast, our model does not suffer from rotational invariance, and as we\nshow in Section 4, every individual factor is consistently estimated in high dimensions. This brings a\nbene\ufb01t to the interpretation of factors in the tensor block model.\nWe propose a least-square approach for estimating the TBM. Let \u0398 = C \u00d71 M1 \u00d72 \u00b7\u00b7\u00b7 \u00d7K MK\ndenote the mean signal tensor with block structure. The mean tensor is assumed to belong to the\nfollowing parameter space\n\nPR1,...,RK =(cid:8)\u0398 \u2208 Rd1\u00d7\u00b7\u00b7\u00b7\u00d7dK : \u0398 = C \u00d71 M1 \u00d72 \u00b7\u00b7\u00b7 \u00d7K MK, with some membership matrices\n\nMk\u2019s and a core tensor C \u2208 RR1\u00d7\u00b7\u00b7\u00b7\u00d7RK(cid:9).\n\n3\n\n\fIn the following theoretical analysis, we assume the clustering size R = (R1, . . . , RK) is known\nand simply write P for short. The adaptation of unknown R will be addressed in Section 5.2. The\nleast-square estimator for the TBM (1) is\n\n(cid:8)\u22122(cid:104)Y, \u0398(cid:105) + (cid:107)\u0398(cid:107)2\n\n(cid:9) .\n\nF\n\n\u02c6\u0398 = arg min\n\n\u0398\u2208P\n\n(4)\n\nThe objective is equal (ignoring constants) to the sum of squares (cid:107)Y \u2212 \u0398(cid:107)2\nour estimator.\n\nF and hence the name of\n\n4 Statistical convergence\n\nIn this section, we establish the convergence rate of the least-squares estimator (4) for two measure-\nments. The \ufb01rst measurement is mean squared error (MSE):\n\nMSE(\u0398true, \u02c6\u0398) =\n\n(cid:107)\u0398true \u2212 \u02c6\u0398(cid:107)2\nF ,\n\n1(cid:81)\n\nk dk\n\nwhere \u0398true, \u02c6\u0398 \u2208 P are the true and estimated mean tensors, respectively. While the loss function\ncorresponds to the likelihood for the Gaussian tensor model, the same assertion does not hold for\nother types of distribution such as stochastic tensor block model. We will show that, with very high\nprobability, a simple least-square estimator achieves a fast convergence rate in a general class of\nblock tensor models.\n\nk\n\nk\n\n(5)\n\nk dk\n\nRk +\n\n(cid:33)\n\ndk log Rk\n\nTheorem 1 (Convergence rate of MSE) Let \u02c6\u0398 be the least-square estimator of \u0398true under\nmodel (1). There exists two constants C1, C2 > 0, such that,\n\n(cid:32)(cid:89)\nMSE(\u0398true, \u02c6\u0398) \u2264 C1\u03c32(cid:81)\nholds with probability at least 1 \u2212 exp(\u2212C2((cid:81)\nk Rk +(cid:80)\nThe convergence rate of MSE in (5) consists of two parts. The \ufb01rst part(cid:81)\nparameters in the core tensor C, while the second part(cid:80)\nconvergence rate proportional to(cid:80)\nApplying Tucker decomposition to the TBM yields(cid:80)\nk Rk +(cid:80)\nRmin = mink Rk tend to in\ufb01nity, we have(cid:81)\n\nestimating Mk\u2019s. It is the price that one has to pay for not knowing the locations of the blocks.\nWe compare our bound with existing literature. The Tucker tensor decomposition has a minimax\nk is the multilinear rank in the mode k.\nk dkRk, because the mode-k rank is bounded\nby the number of mode-k clusters. Now, as both the dimension dmin = mink dk and clustering size\nk dkRk. Therefore, by\n\n(cid:88)\nk dk log Rk)) uniformly over \u0398true \u2208 P.\nk Rk is the number of\nk dk log Rk re\ufb02ects the the complexity for\n\nk dk log Rk (cid:28) (cid:80)\n\nk [22], where R(cid:48)\n\nfully exploiting the block structure, we obtain a better convergence rate than previously possible.\nRecently, [13] proposed a convex relaxation for estimating the TBM. In the special case when the\ntensor dimensions are equal at every mode d1 = . . . = dK = d, their estimator has a convergence rate\nof order O(d\u22121) for all K \u2265 2. As we see from (5), our estimate obtains a much better convergence\nrate O(d\u2212(K\u22121)), which is especially favorable as the order increases.\nThe bound (5) generalizes the previous results on structured matrix estimation in network analysis [23,\n16]. Earlier work [16] suggests the following heuristics on the sample complexity for the matrix case:\n\nk dkR(cid:48)\n\n(number of parameters) + log (complexity of models)\n\n.\n\nnumber of samples\n\n(6)\nOur result supports this important principle for general K \u2265 2. Note that, in the TBM, the sample\nk Rk, and the combinatoric\n\nk dk, the number of parameters is(cid:81)\n\nsize is the total number of entries(cid:81)\ncomplexity for estimating block structure is of order(cid:81)\nneed to introduce some additional notation. Let Mk = (cid:74)m(k)\nk membership matrices, and D(k) = (cid:74)D(k)\n\ni,r(cid:48)(cid:75) be two mode-\nr,r(cid:48)(cid:75) be the mode-k confusion matrix with element\ni,r(cid:48) = 1}, where r, r(cid:48) \u2208 [Rk]. Note that the row/column sum\n\ni,r(cid:75), \u02c6Mk = (cid:74) \u02c6m(k)\n\nNext we study the consistency of partition. To de\ufb01ne the misclassi\ufb01cation rate (MCR), we\n\ni,r = \u02c6m(k)\n\n(cid:80)dk\n\n1{m(k)\n\nk Rdk\nk .\n\nD(k)\n\nr,r(cid:48) = 1\ndk\n\ni=1\n\n4\n\n\fof D(k) represents the nodes proportion in each cluster de\ufb01ned by Mk or \u02c6Mk. We restrict ourselves\nto non-degenerating clusterings; that is, the row/column sums of D(k) are lower bounded by a\nconstant \u03c4 > 0. With a little abuse of notation, we still use P = P(\u03c4 ) to denote the parameter space\nwith the non-degenerating assumption. The least-square estimator (4) should also be interpreted with\nthis constraint imposed.\nWe de\ufb01ne the mode-k misclassi\ufb01cation rate (MCR) as\n\n(cid:110)\n\n(cid:111)\n\nMCR(Mk, \u02c6Mk) =\n\nmax\n\nr\u2208[Rk],a(cid:54)=a(cid:48)\u2208[Rk]\n\nmin\n\nD(k)\n\na,r , D(k)\na(cid:48),r\n\n.\n\nIn other words, MCR is the element-wise maximum of the confusion matrix after removing the\nlargest entry from each column. Under the non-degenerating assumption, MCR = 0 if and only if the\nconfusion matrix D(k) is a permutation of a diagonal matrix; that is, the estimated partition matches\nwith the true partition, up to permutations of cluster labels.\n\nTheorem 2 (Convergence rate of MCR) Consider a tensor block model (2) with sub-Gaussian\nparameter \u03c3. De\ufb01ne the minimal gap between the blocks \u03b4min = 1(cid:107)C(cid:107)max\nmink \u03b4(k), where \u03b4(k) =\nk,...,rK )2. Let Mk,true be the true mode-k\nminrk(cid:54)=r(cid:48)\nmembership, \u02c6Mk be the estimator from (4). Then, for any \u03b5 \u2208 [0, 1],\n\u2212 C\u03b52\u03b42\n\nmaxr1,...,rk\u22121,rk+1,...,rK (cr1,...,rk,...,rK \u2212 cr1,...,r(cid:48)\n(cid:32)\nP(MCR( \u02c6Mk, Mk,true) \u2265 \u03b5) \u2264 21+(cid:80)\n\nmin\u03c4 3K\u22122(cid:81)K\n\nk dk exp\n\nk=1 dk\n\n(cid:33)\n\n,\n\nk\n\n\u03c32\n\nwhere C > 0 is a positive constant, and \u03c4 > 0 the lower bound of cluster proportions.\n\nThe above theorem shows that our estimator consistently recovers the block structure as the dimension\nof the data tensor grows. The block-mean gap \u03b4min serves the role of the eigen-separation as in the\nclassical tensor Tucker decomposition [22]. Table 1 summarizes the comparison of various tensor\nmethods in the special case when d1 = \u00b7\u00b7\u00b7 = dK = d and R1 = \u00b7\u00b7\u00b7 = RK = R.\n\nRecovery error ((cid:107) \u02c6\u0398 \u2212 \u0398true(cid:107)2\n\nF /\u03c32) Clustering error (MCR) Block detection (see Section 6)\n\nMethod\n\nTucker [22]\nCoCo [13]\n\nTBM (this paper)\n\ndR\ndK\u22121\nd log R\n\n-\n-\n\n\u03b4min\u03c4 (3K\u22122)/2 d\u2212(K\u22121)/2\n\n\u03c3\n\nNo\nNo\nYes\n\nTable 1: Comparison of various tensor decomposition methods. For ease of presentation, we summarize only\nthe leading terms in dimension.\n\n5 Numerical implementation\n\n5.1 Alternating optimization\n\nWe introduce an alternating optimization for solving (4). Estimating \u0398 consists of \ufb01nding both the\ncore tensor C and the membership matrices Mk\u2019s. The optimization (4) can be written as\n\n( \u02c6C,{ \u02c6Mk}) =\n\nwhere\n\narg min\n\nf (C,{Mk}),\nC\u2208RR1\u00d7\u00b7\u00b7\u00b7\u00d7RK , membership matrices Mk\u2019s\nf (C,{Mk}) = (cid:107)Y \u2212 C \u00d71 M1 \u00d72 . . . \u00d7K MK(cid:107)2\nF .\n\nThe decision variables consist of K + 1 blocks of variables, one for the core tensor C and K for\nthe membership matrices Mk\u2019s. We notice that, if any K out of the K + 1 blocks of variables are\nknown, then the last block of variables can be solved explicitly. This observation suggests that we\ncan iteratively update one block of variables at a time while keeping others \ufb01xed. Speci\ufb01cally, given\nthe collection of \u02c6Mk\u2019s, the core tensor estimate \u02c6C = arg minC f (C,{ \u02c6Mk}) consists of the sample\naverages of each tensor block. Given the block mean \u02c6C and K \u2212 1 membership matrices, the last\nmembership matrix can be solved using a simple nearest neighbor search over only Rk discrete points.\nThe full procedure is described in Algorithm 1.\nAlgorithm 1 can be viewed as a higher-order extension of the ordinary (one-way) k-means algorithm.\nThe core tensor C serves as the role of centroids. As each iteration reduces the value of the objective\nfunction, which is bounded below, convergence of the algorithm is guaranteed. The per-iteration\n\n5\n\n\fAlgorithm 1 Multiway clustering based on tensor block models\nInput: Data tensor Y \u2208 Rd1\u00d7\u00b7\u00b7\u00b7\u00d7dK , clustering size R = (R1, . . . , RK).\nOutput: Block mean tensor \u02c6C \u2208 RR1\u00d7\u00b7\u00b7\u00b7\u00d7RK , and the membership matrices \u02c6Mk\u2019s.\n1: Initialize the marginal clustering by performing independent k-means on each of the K modes.\n2: repeat\n3:\n\nUpdate the core tensor \u02c6C =(cid:74)\u02c6cr1,...,rK(cid:75). Speci\ufb01cally, for each (r1, . . . , rK) \u2208 [R1]\u00d7\u00b7\u00b7\u00b7 [RK],\n\n\u02c6cr1,...,rK =\n\n1\n\nnr1,...,rK\n\n\u02c6M\n\n\u22121\n1 (r1)\u00d7\u00b7\u00b7\u00b7\u00d7 \u02c6M\n\n\u22121\nK (rK )\n\nyi1,...,iK ,\n\n(7)\n\n(cid:88)\n\n(cid:81)\n\n4:\n5:\n\nwhere M\u22121\nk | \u02c6M\u22121\nfor k in {1, 2, ..., K} do\ncluster label \u02c6Mk(a) \u2208 [Rk]:\n\nk (rk) denotes the indices that belong to the rkth cluster in the mode k, and nr1,...,rK =\nk (rk)| denotes the number of entries in the block indexed by (r1, . . . , rK).\nUpdate the mode-k membership matrix \u02c6Mk. Speci\ufb01cally, for each a \u2208 [dk], assign the\n(cid:17)2\n\n(cid:16)\n\n(cid:88)\n\n\u02c6c \u02c6M1(i1),...,r,..., \u02c6MK (iK ) \u2212 yi1,...,a,...,iK\n\n,\n\n\u02c6Mk(a) = arg min\nr\u2208[Rk]\n\nI\u2212k\n\nwhere I\u2212k = (i1, . . . , ik\u22121, ik+1, . . . , iK) denotes the tensor coordinates except the k-th mode.\n\nend for\n\n6:\n7: until Convergence\n\ncomputational cost scales linearly with the sample size, d =(cid:81)\n\nk dk, and this complexity matches the\nclassical tensor methods [24, 25, 22]. We recognize that obtaining the global optimizer for such a\nnon-convex optimization is typically dif\ufb01cult [26, 1]. Following the common practice in non-convex\noptimization [1], we run the algorithm multiple times, using random initializations with independent\none-way k-means on each of the modes.\n\n5.2 Tuning parameter selection\n\nAlgorithm 1 takes the number of clusters R as an input. In practice such information is often unknown\nand R needs to be estimated from the data Y. We propose to select this tuning parameter using\nBayesian information criterion (BIC),\n\nBIC(R) = log\n\nwhere pe is the effective number of parameters in the model. In our case we take pe =(cid:81)\n(cid:80)\n\nk Rk +\nk dk log Rk, which is inspired from (6). We choose \u02c6R that minimizes BIC(R) via grid search. Our\nchoice of BIC aims to balance between the goodness-of-\ufb01t for the data and the degree of freedom in\nthe population model. We test its empirical performance in Section 7.\n\nk log dk\n\nk dk\n\n(8)\n\npe,\n\n+\n\n(cid:16)(cid:107)Y \u2212 \u02c6\u0398(cid:107)2\n\nF\n\n(cid:17)\n\n(cid:80)\n(cid:81)\n\n6 Extension to sparse estimation\n\nIn some large-scale applications, not every block in a data tensor is of equal importance. For example,\nin the genome-wide expression data analysis, only a few entries represent the signals while the\nmajority come from the background noise (see Figure 1b). While our estimator (4) is still able to\nhandle this scenario by assigning small values to some of the \u02c6cr1,...,rK \u2019s, the estimates may suffer\nfrom high variance. It is thus bene\ufb01cial to introduce regularized estimation for better bias-variance\ntrade-off and improved interpretability.\nHere we illustrate a sparse version of TBM by imposing regularity on block means for localizing\nimportant blocks in the data tensor. This problem can be formulated as a variable selection on the\nblock parameters. We propose the following regularized least-square estimation:\n\n(cid:8)(cid:107)Y \u2212 \u0398(cid:107)2\n\nF + \u03bb(cid:107)C(cid:107)\u03c1\n\n(cid:9) ,\n\n\u02c6\u0398sparse = arg min\n\n\u0398\u2208P\n\n6\n\n\fwhere C \u2208 RR1\u00d7\u00b7\u00b7\u00b7\u00d7RK is the block-mean tensor, (cid:107)C(cid:107)\u03c1 is the penalty function with \u03c1 being an index\nfor the tensor norm, and \u03bb is the penalty tuning parameter. Some widely used penalties include Lasso\npenalty (\u03c1 = 1), sparse subset penalty (\u03c1 = 0), ridge penalty (\u03c1 = Frobenius norm), elastic net\n(linear combination of \u03c1 = 1 and \u03c1 = Frobenius norm), among many others.\nFor parsimony purpose, we only discuss the Lasso and sparse subset penalties; other penalizations\ncan be derived similarly. Sparse estimation incurs slight changes to Algorithm 1. When updating the\ncore tensor C in (7), we \ufb01t a penalized least square problem with respect to C. The closed form for\nthe entry-wise sparse estimate \u02c6csparse\n\nr1,...,rK is (see Lemma 2 in the Supplements):\n\n\uf8f1\uf8f2\uf8f3\u02c6cols\n\n1\n\nr1,...,rK\nsign(\u02c6cols\n\n| \u2265(cid:113) \u03bb\n\n(cid:110)|\u02c6cols\n(cid:16)|\u02c6cols\n\n)\n\nr1,...,rK\n\n(cid:111)\n\n(cid:17)\n\n\u02c6csparse\nr1,...,rK\n\n=\n\nr1,...,rK\n\nr1,...,rK\n\nnr1,...,rK\n\n\u03bb\n\n| \u2212\n\n2nr1 ,...,rK\n\n+\n\nif \u03c1 = 0,\n\nif \u03c1 = 1,\n\nwhere a+ = max(a, 0) and \u02c6cols\nr1,...,rK denotes the ordinary least-square estimate in (7). The choice\nof penalty \u03c1 often depends on the study goals and interpretations in speci\ufb01c applications. Given a\npenalty function, we select the tuning parameter \u03bb via BIC (8), where we modify pe into psparse\n=\nk dk log Rk. Here (cid:107)\u00b7(cid:107)0 denotes the number of non-zero entries in the tensor. The\n\n(cid:107) \u02c6Csparse(cid:107)0 +(cid:80)\n\ne\n\nempirical performance of this proposal will be evaluated in Section 7.\n\n7 Experiments\n\nIn this section, we evaluate the empirical performance of our TBM method1. We consider both\nnon-sparse and sparse tensors, and compare the recovery accuracy with other tensor-based methods.\nUnless otherwise stated, we generate Gaussian tensors under the block model (1). The block means\nare generated from i.i.d. Uniform[-3,3]. The entries in the noise tensor E are generated from i.i.d.\nN (0, \u03c32). In each simulation study, we report the summary statistics across nsim = 50 replications.\n\n7.1 Finite-sample performance\n\nIn the \ufb01rst experiment, we assess the empirical relationship between the root mean squared error\n(RMSE) and the dimension. We set \u03c3 = 3 and consider tensors of order 3 and order 4 (see Figure 2).\nIn the case of order-3 tensors, we increase d1 from 20 to 70, and for each choice of d1, we set\nthe other two dimensions (d2, d3) such that d1 log R1 \u2248 d2 log R2 \u2248 d3 log R3. Recall that our\n\ntheoretical analysis suggests a convergence rate O((cid:112)log R1/d2d3) for our estimator. Figure 2a plots\nthe recovery error versus the rescaled sample size N1 =(cid:112)d2d3/ log R1. We \ufb01nd that the RMSE\norder-4 case from Figure 2b, where the rescaled sample size is N2 =(cid:112)d2d3d4/ log R1.\n\ndecreases roughly at the rate of 1/N1. This is consistent with our theoretical result. It is observed\nthat tensors with a higher number of blocks tend to yield higher recovery errors, as re\ufb02ected by the\nupward shift of the curves as R increases. Indeed, a higher R means a higher intrinsic dimension of\nthe problem, thus increasing the dif\ufb01culty of the estimation. Similar behavior can be observed in the\n\nFigure 2: Estimation error for block tensors with Gaussian noise. Each curve corresponds to a \ufb01xed clustering\n\nsize R. (a) Average RMSE against rescaled sample size N1 =(cid:112)d2d3/ log R1 for order-3 tensors. (b) Average\nRMSE against rescaled sample size N2 =(cid:112)d2d3d4/ log R1 for order-4 tensors.\n\nIn the second experiment, we evaluate the selection performance of our BIC criterion (8). Supplemen-\ntary Table S1 reports the selected numbers of clusters under various combinations of dimension d,\n\n1Our software is available at https://cran.r-project.org/web/packages/tensorsparse.\n\n7\n\n(a)N1(b)N2\fclustering size R, and noise \u03c3. We \ufb01nd that, for the case d = (40, 40, 40) and R = (4, 4, 4), the BIC\nselection is accurate in the low-to-moderate noise setting. In the high-noise setting with \u03c3 = 12, the\nselected number of clusters is slightly smaller than the true number, but the accuracy increases when\neither the dimension increases to d = (40, 40, 80) or the clustering size reduces to R = (2, 3, 4).\nWithin a tensor, the selection seems to be easier for shorter modes with smaller number of clusters.\nThis phenomenon is to be expected, since shorter mode has more effective samples for clustering.\n\n7.2 Comparison with alternative methods\n\nNext, we compare our TBM method with two popular low-rank tensor estimation methods: (i) CP\ndecomposition and (ii) Tucker decomposition. Following the literature [13, 5, 9], we perform the\nclustering by applying the k-means to the resulting factors along each of the modes. We refer to such\ntechniques as CP+k-means and Tucker+k-means.\nWe generate noisy block tensors with \ufb01ve clusters on each of the modes, and then assess both the\nestimation and clustering performance for each method. Note that TBM takes a single shot to perform\nestimation and clustering simultaneously, whereas CP and Tucker-based methods separate these two\ntasks in two steps. We use the RMSE to assess the estimation accuracy and use the clustering error\nrate (CER) to measure the clustering accuracy. The CER is calculated using the disagreements (i.e.,\none minus rand index) between the true and estimated block partitions in the three-way tensor. For\nfair comparison, we provide all methods the true number of clusters.\nFigure 3a shows that TBM achieves the lowest estimation error among the three methods. The gain\nin accuracy is more pronounced as the noise grows. Neither CP nor Tucker recovers the signal tensor,\nalthough Tucker appears to result in a modest clustering performance (Figure 3b). One possible\nexplanation is that the Tucker model imposes orthogonality to the factors, which make the subsequent\nk-means clustering easier than that for the CP factors. Figure 3b-c shows that the clustering error\nincreases with noise but decreases with dimension. This agrees with our expectation, as in tensor\ndata analysis, a larger dimension implies a larger sample size.\n\nFigure 3: Performance comparison in terms of RMSE and CER. (a) Estimation error against noise for tensors of\ndimension (40, 40, 40). (b) Clustering error against noise for tensors of dimension (40, 40, 40). (c) Clustering\nerror against noise for tensors of dimension (40, 50, 60).\n\nSparse case. We then evaluate the performance when the signal tensor is sparse. The simulated\nmodel is the same as before, except that we generate block means from a mixture of zero mass and\nUniform[-3,3], with probability p (sparsity rate) and 1 \u2212 p respectively. We generate noisy tensors of\ndimension d = (40, 40, 40) with varying levels of sparsity and noise. We utilize (cid:96)0-penalized TBM\nand primarily focus on the selection accuracy. The performance is quanti\ufb01ed via the the sparsity error\nrate, which is the proportion of entries that were incorrectly set to zero or incorrectly set to non-zero.\nWe also report the proportion of true zero\u2019s that were correctly identi\ufb01ed (correct zeros).\nTable 2 reports the BIC-selected \u03bb averaged across 50 simulations. We see a substantial bene\ufb01t\nobtained by penalization. The proposed \u03bb is able to guide the algorithm to correctly identify zero\u2019s,\nwhile maintaining good accuracy in identifying non-zero\u2019s. The resulting sparsity level is close to the\nground truth. Supplementary Figure S1 shows the estimation error and sparsity error against \u03c3 when\np = 0.8. Again, the sparse TBM outperforms the other methods.\n\n8\n\nTBM\f0.5\n0.5\n0.8\n\n4\n8\n8\n\n136.0(37.5)\n439.2(80.2)\n458.0(63.3)\n\n0.55(0.04)\n0.58(0.06)\n0.81(0.15)\n\n1.00(0.02)\n0.94(0.08)\n0.87(0.16)\n\n0.06(0.03)\n0.15(0.07)\n0.21(0.13)\n\nSparsity (p) Noise (\u03c3) BIC-selected \u03bb Estimated Sparsity Rate Correct Zero Rate\n\nSparsity Error Rate\n\nTable 2: Sparse TBM for estimating tensors of dimension d = (40, 40, 40). The reported statistics are averaged\nacross 50 simulations with standard deviation given in parentheses. Number in bold indicates the ground truth is\nwithin 2 standard deviations of the sample average.\n\n7.3 Real data analysis\n\nLastly, we apply our method on two real datasets. The \ufb01rst dataset is a real-valued tensor, consisting\nof approximate 1 million expression values from 13 brain tissues, 193 individuals, and 362 genes [4].\nWe subtracted the overall mean expression from the data, and applied the (cid:96)0-penalized TBM to\nidentify important blocks in the resulting tensor. The top blocks exhibit a clear tissues \u00d7 genes\nspeci\ufb01city (Supplementary Table S2). In particular, the top over-expressed block is driven by tissues\n{Substantia nigra, Spinal cord} and genes {GFAP, MBP}, suggesting their elevated expression across\nindividuals. In fact, GFAP encodes \ufb01lament proteins for mature astrocytes and MBP encodes myelin\nsheath for oligodendrocytes, both of which play important roles in the central nervous system [27].\nOur method also identi\ufb01es blocks with extremely negative means (i.e. under-expressed blocks). The\ntop under-expressed block is driven by tissues {Cerebellum, Cerebellar Hemisphere} and genes\n{CDH9, GPR6, RXFP1, CRH, DLX5/6, NKX2-1, SLC17A8}. The gene DLX6 encodes proteins in\nthe forebrain development [27], whereas cerebellum tissues are located in the hindbrain brain. The\nopposite spatial function is consistent with the observed under-expression pattern.\nThe second dataset we consider is the Nations data [2]. This is a 14 \u00d7 14 \u00d7 56 binary tensor\nconsisting of 56 political relationships of 14 countries between 1950 and 1965. We note that 78.9%\nof the entries are zero. Again, we applied the (cid:96)0-penalized TBM to identify important blocks in\nthe data. We found that the 14 countries are naturally partitioned into 5 clusters, two representing\nneutral countries {Brazil, Egypt, India, Israel, Netherlands} and {Burma, Indonesia, Jordan}, one\neastern bloc {China, Cuba, Poland, USSA}, and two western blocs, {USA} and {UK}. The relation\ntypes are partitioned into 7 clusters, among which the exports-related activities {reltreaties, book\ntranslations, relbooktranslations, exports3, relexporsts} and NGO-related activities {relintergovorgs,\nrelngo, intergovorgs3, ngoorgs3} are two major clusters that involve the connection between neutral\nand western blocs. Other top blocks are described in the Supplement.\nWe compared the goodness-of-\ufb01t of various clustering methods on the Brain expression and Nations\ndatasets. Because the code of CoCo method [13] is not yet available, we excluded it from our\nnumerical comparison (See Section 4 for the theoretical comparison with CoCo). The Table 3\nsummarizes the proportion of variance explained by each clustering method:\n\nDataset\n\nBrain expression\n\nTBM TBM-sparse CP+k-means Tucker+k-means CoTeC [12]\n0.856\n0.439\n\nNations\nTable 3: Comparison of goodness-of-\ufb01t in the Brain expression and Nations datasets.\n\n0.576\n0.324\n\n0.855\n0.433\n\n0.434\n0.253\n\n0.849\n0.419\n\nOur method (TBM) achieves the highest variance proportion, suggesting that the entries within the\nsame cluster are close (i.e., a good clustering). As expected, the sparse TBM results in a slightly\nlower proportion, because it has a lower model complexity at the cost of small bias. It is remarkable\nthat the sparse TBM still achieves a higher goodness-of-\ufb01t than others. The improved interpretability\nwith little loss of accuracy makes the sparse TBM appealing in applications.\n\n8 Conclusion\n\ntions, the least-square estimator achieves a convergence rate O((cid:80)\n\nWe have developed a statistical setting for studying the tensor block model. Under suitable assump-\nk dk log Rk) which is faster than\npreviously possible. Our TBM method applies to a broad range of data distributions and can handle\nboth sparse and dense data tensor. We demonstrate the bene\ufb01t of sparse regularity in power of\ndetection. In speci\ufb01c applications, prior knowledge may suggest other regularities for parameters. For\nexample, in the multi-layer network analysis, sometimes it may be reasonable to impose symmetry on\nthe parameters along certain modes. In some other applications, non-negativity of parameter values\nmay be enforced. We leave these directions for future study.\n\n9\n\n\fAcknowledgements\n\nThis research was supported by NSF grant DMS-1915978 and the University of Wisconsin-Madison,\nOf\ufb01ce of the Vice Chancellor for Research and Graduate Education with funding from the Wisconsin\nAlumni Research Foundation.\n\nReferences\n[1] Hua Zhou, Lexin Li, and Hongtu Zhu. Tensor regression with applications in neuroimaging\n\ndata analysis. Journal of the American Statistical Association, 108(502):540\u2013552, 2013.\n\n[2] Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. A three-way model for collective\nlearning on multi-relational data. In International Conference on Machine Learning, volume 11,\npages 809\u2013816, 2011.\n\n[3] Yichuan Tang, Ruslan Salakhutdinov, and Geoffrey Hinton. Tensor analyzers. In International\n\nConference on Machine Learning, pages 163\u2013171, 2013.\n\n[4] Miaoyan Wang, Jonathan Fischer, and Yun S Song. Three-way clustering of multi-tissue multi-\nindividual gene expression data using constrained tensor decomposition. Annals of Applied\nStatistics, in press, 2019.\n\n[5] Victoria Hore, Ana Vi\u00f1uela, Alfonso Buil, Julian Knight, Mark I McCarthy, Kerrin Small, and\nJonathan Marchini. Tensor decomposition for multiple-tissue gene expression experiments.\nNature genetics, 48(9):1094, 2016.\n\n[6] Frank L Hitchcock. The expression of a tensor or a polyadic as a sum of products. Journal of\n\nMathematics and Physics, 6(1-4):164\u2013189, 1927.\n\n[7] Ledyard R Tucker. Some mathematical notes on three-mode factor analysis. Psychometrika,\n\n31(3):279\u2013311, 1966.\n\n[8] Kean Ming Tan and Daniela M Witten. Sparse biclustering of transposable data. Journal of\n\nComputational and Graphical Statistics, 23(4):985\u20131008, 2014.\n\n[9] Tamara G Kolda and Jimeng Sun. Scalable tensor decompositions for multi-aspect data mining.\n\nIn 2008 Eighth IEEE international conference on data mining, pages 363\u2013372. IEEE, 2008.\n\n[10] Chang-Dong Wang, Jian-Huang Lai, and S Yu Philip. Multi-view clustering based on belief\npropagation. IEEE Transactions on Knowledge and Data Engineering, 28(4):1007\u20131021, 2015.\n\n[11] Miaoyan Wang and Lexin Li. Learning from binary multiway data: Probabilistic tensor\n\ndecomposition and its statistical optimality. arXiv preprint arXiv:1811.05076, 2018.\n\n[12] Stefanie Jegelka, Suvrit Sra, and Arindam Banerjee. Approximation algorithms for tensor\nIn International Conference on Algorithmic Learning Theory, pages 368\u2013383.\n\nclustering.\nSpringer, 2009.\n\n[13] Eric C Chi, Brian R Gaines, Will Wei Sun, Hua Zhou, and Jian Yang. Provable convex\n\nco-clustering of tensors. arXiv preprint arXiv:1803.06518, 2018.\n\n[14] Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM review,\n\n51(3):455\u2013500, 2009.\n\n[15] Sara C Madeira and Arlindo L Oliveira. Biclustering algorithms for biological data analysis:\na survey. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB),\n1(1):24\u201345, 2004.\n\n[16] Chao Gao and Zongming Ma. Minimax rates in network analysis: Graphon estimation, commu-\n\nnity detection and hypothesis testing. arXiv preprint arXiv:1811.06055, 2018.\n\n[17] Peter J Bickel and Aiyou Chen. A nonparametric view of network models and newman\u2013girvan\nand other modularities. Proceedings of the National Academy of Sciences, 106(50):21068\u2013\n21073, 2009.\n\n10\n\n\f[18] Kehui Chen Jing Lei and Brian Lynch. Consistent community detection in multi-layer network\n\ndata. Biometrika, to appear, 2019.\n\n[19] Subhadeep Paul, Yuguo Chen, et al. Consistent community detection in multi-relational\ndata through restricted multi-layer stochastic blockmodel. Electronic Journal of Statistics,\n10(2):3807\u20133870, 2016.\n\n[20] Robin A Darton. Rotation in factor analysis. Journal of the Royal Statistical Society: Series D\n\n(The Statistician), 29(3):167\u2013194, 1980.\n\n[21] Herv\u00e9 Abdi. Factor rotations in factor analyses. Encyclopedia for Research Methods for the\n\nSocial Sciences, Sage: Thousand Oaks, pages 792\u2013795, 2003.\n\n[22] Anru Zhang and Dong Xia. Tensor SVD: Statistical and computational limits. IEEE Transactions\n\non Information Theory, 2018.\n\n[23] Chao Gao, Yu Lu, Zongming Ma, and Harrison H Zhou. Optimal estimation and completion of\nmatrices with biclustering structures. The Journal of Machine Learning Research, 17(1):5602\u2013\n5630, 2016.\n\n[24] Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and Matus Telgarsky. Tensor\ndecompositions for learning latent variable models. The Journal of Machine Learning Research,\n15(1):2773\u20132832, 2014.\n\n[25] Miaoyan Wang and Yun Song. Tensor decompositions via two-mode higher-order SVD\n\n(HOSVD). In Arti\ufb01cial Intelligence and Statistics, pages 614\u2013622, 2017.\n\n[26] Daniel Aloise, Amit Deshpande, Pierre Hansen, and Preyas Popat. NP-hardness of Euclidean\n\nsum-of-squares clustering. Machine learning, 75(2):245\u2013248, 2009.\n\n[27] Nuala A O\u2019Leary, Mathew W Wright, J Rodney Brister, Stacy Ciufo, Diana Haddad, Rich\nMcVeigh, Bhanu Rajput, Barbara Robbertse, Brian Smith-White, Danso Ako-Adjei, et al. Ref-\nerence sequence (refseq) database at ncbi: current status, taxonomic expansion, and functional\nannotation. Nucleic acids research, 44(D1):D733\u2013D745, 2015.\n\n11\n\n\f", "award": [], "sourceid": 365, "authors": [{"given_name": "Miaoyan", "family_name": "Wang", "institution": "University of Wisconsin - Madison"}, {"given_name": "Yuchen", "family_name": "Zeng", "institution": "University of Wisconsin - Madison"}]}