{"title": "Tensor Biclustering", "book": "Advances in Neural Information Processing Systems", "page_first": 1311, "page_last": 1320, "abstract": "Consider a dataset where data is collected on multiple features of multiple individuals over multiple times. This type of data can be represented as a three dimensional individual/feature/time tensor and has become increasingly prominent in various areas of science. The tensor biclustering problem computes a subset of individuals and a subset of features whose signal trajectories over time lie in a low-dimensional subspace, modeling similarity among the signal trajectories while allowing different scalings across different individuals or different features. We study the information-theoretic limit of this problem under a generative model. Moreover, we propose an efficient spectral algorithm to solve the tensor biclustering problem and analyze its achievability bound in an asymptotic regime. Finally, we show the efficiency of our proposed method in several synthetic and real datasets.", "full_text": "Tensor Biclustering\n\nSoheil Feizi\n\nStanford University\n\nsfeizi@stanford.edu\n\nHamid Javadi\n\nStanford University\n\nhrhakim@stanford.edu\n\nDavid Tse\n\nStanford University\n\ndntse@stanford.edu\n\nAbstract\n\nConsider a dataset where data is collected on multiple features of multiple individu-\nals over multiple times. This type of data can be represented as a three dimensional\nindividual/feature/time tensor and has become increasingly prominent in various\nareas of science. The tensor biclustering problem computes a subset of individuals\nand a subset of features whose signal trajectories over time lie in a low-dimensional\nsubspace, modeling similarity among the signal trajectories while allowing dif-\nferent scalings across different individuals or different features. We study the\ninformation-theoretic limit of this problem under a generative model. Moreover,\nwe propose an ef\ufb01cient spectral algorithm to solve the tensor biclustering problem\nand analyze its achievability bound in an asymptotic regime. Finally, we show the\nef\ufb01ciency of our proposed method in several synthetic and real datasets.\n\nIntroduction\n\n1\nLet T \u2208 Rn1\u00d7n2 be a data matrix whose rows and columns represent individuals and features,\nrespectively. Given T, the matrix biclustering problem aims to \ufb01nd a subset of individuals (i.e.,\nJ1 \u2282 {1, 2, ..., n1}) which exhibit similar values across a subset of features (i.e., J2 \u2282 {1, 2, ..., n2})\n(Figure 1-a). The matrix biclustering problem has been studied extensively in machine learning and\nstatistics and is closely related to problems of sub-matrix localization, planted clique and community\ndetection [1, 2, 3].\nIn modern datasets, however, instead of collecting data on every individual-feature pair at a single\ntime, we may collect data at multiple times. One can visualize a trajectory over time for each\nindividual-feature pair. This type of datasets has become increasingly prominent in different areas of\nscience. For example, the roadmap epigenomics dataset [4] provides multiple histon modi\ufb01cation\nmarks for genome-tissue pairs, the genotype-tissue expression dataset [5] provides expression data\non multiple genes for individual-tissue pairs, while there have been recent efforts to collect various\nomics data in individuals at different times [6].\nSuppose we have n1 individuals, n2 features, and we collect data for every individual-feature pair\nat m different times. This data can be represented as a three dimensional tensor T \u2208 Rn1\u00d7n2\u00d7m\n(Figure 1-b). The tensor biclustering problem aims to compute a subset of individuals and a subset\nof features whose trajectories are highly similar. Similarity is modeled as the trajectories as lying\nin a low-dimensional (say one-dimensional) subspace (Figure 1-d). This de\ufb01nition allows different\nscalings across different individuals or different features, and is important in many applications\nsuch as in omics datasets [6] because individual-feature trajectories often have their own intrinsic\nscalings. In particular, at each time the individual-feature data matrix may not exhibit a matrix\nbicluster separately. This means that repeated applications of matrix biclustering cannot solve the\ntensor biclustering problem. Moreover, owing to the same reason, trajectories in a bicluster can\nhave large distances among themselves (Figure 1-d). Thus, a distance-based clustering of signal\ntrajectories is likely to fail as well.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: (a) The matrix biclustering problem. (b) The tensor biclustering problem. (c) The tensor\ntriclustering problem. (d) A visualization of a bicluster in a three dimensional tensor. Trajectories in\nthe bicluster (red points) form a low dimensional subspace.\n\nThis problem formulation has two main differences with tensor triclustering, which is a natural\ngeneralization of matrix biclustering to a three dimensional tensor (Figure 1-c). Firstly, unlike tensor\ntriclustering, tensor biclustering has an asymmetric structure along tensor dimensions inspired by\naforementioned applications. That is, since a tensor bicluster is de\ufb01ned as a subset of individuals\nand a subset of features with similar trajectories, the third dimension of the tensor (i.e., the time\ndimension) plays a different role compared to the other two dimensions. This is in contrast with\ntensor triclustering where there is not such a difference between roles of tensor dimensions in de\ufb01ning\nthe cluster. Secondly, in tensor biclustering, the notion of a cluster is de\ufb01ned regarding to trajectories\nlying in a low-dimensional subspace while in tensor triclustering, a cluster is de\ufb01ned as a sub-cube\nwith similar entries.\nFinding statistically signi\ufb01cant patterns in multi-dimensional data tensors has been studied in di-\nmensionality reduction [7, 8, 9, 10, 11, 12, 13, 14], topic modeling [15, 16, 17], among others. One\nrelated model is the spiked tensor model [7]. Unlike the tensor biclustering model that is asymmetric\nalong tensor dimensions, the spiked tensor model has a symmetric structure. Computational and\nstatistical limits for the spiked tensor model have been studied in [8, 9, 10, 14], among others. For\nmore details, see Supplementary Materials (SM) Section 1.3.\nIn this paper, we study information-theoretic and computational limits for the tensor biclustering\nproblem under a statistical model described in Section 2. From a computational perspective, we\npresent four polynomial time methods and analyze their asymptotic achievability bounds. In particular,\none of our proposed methods, namely tensor folding+spectral, outperforms other methods both\ntheoretically (under realistic model parameters) and numerically in several synthetic and real data\nexperiments. Moreover, we characterize a fundamental limit under which no algorithm can solve the\ntensor biclustering problem reliably in a minimax sense. We show that above this limit, a maximum\nlikelihood estimator (MLE) which has an exponential computational complexity can solve this\nproblem with vanishing error probability.\n\n1.1 Notation\nWe use T , X , and Z to represent input, signal, and noise tensors, respectively. For any set J, |J|\ndenotes its cardinality. [n] represents the set {1, 2, ..., n}. \u00afJ = [n] \u2212 J. (cid:107)x(cid:107)2 = (xtx)1/2 is the\nsecond norm of the vector x. x \u2297 y is the Kronecker product of two vectors x and y. The asymptotic\nnotation a(n) = O(b(n)) means that, there exists a universal constant c such that for suf\ufb01ciently\n\n2\n\n-24-1024122030-2-2-4-4-34-2-1240120230-2-2-4-4All TrajectoriesTrajectories in the Bicluster(d)(c)(b)(a)Tensor TriclusteringfeaturestimesindividualsTensor BiclusteringfeaturesindividualsOur ModelTensor BiclusteringMatrix Biclusteringfeaturestimesindividuals\flarge n, we have |a(n)| < cb(n). If there exists c > 0 such that a(n) = O(b(n) log(n)c), we use\nthe notation a(n) = \u02dcO(b(n)). The asymptotic notation a(n) = \u2126(b(n)) and a(n) = \u02dc\u2126(b(n)) is the\nsame as b(n) = O(a(n)) and b(n) = \u02dcO(a(n)), respectively. Moreover, we write a(n) = \u0398(b(n))\niff a(n) = \u2126(b(n)) and b(n) = \u2126(a(n)). Similarly, we write a(n) = \u02dc\u0398(b(n)) iff a(n) = \u02dc\u2126(b(n))\nand b(n) = \u02dc\u2126(a(n)).\n\n2 Problem Formulation\nLet T = X + Z where X is the signal tensor and Z is the noise tensor. Consider\n\nT = X + Z =\n\n\u03c3ru(J1)\n\nr \u2297 w(J2)\n\nr \u2297 vr + Z,\n\n(1)\n\nq(cid:88)\n\nr=1\n\nr\n\nr\n\n1\n\n1\n\nand w(J2)\n\nwhere u(J1)\nhave zero entries outside of J1 and J2 index sets, respectively. We assume\n\u03c31 \u2265 \u03c32 \u2265 ... \u2265 \u03c3q > 0. Under this model, trajectories X (J1, J2, :) form an at most q dimensional\nsubspace. We assume q (cid:28) min(m,|J1| \u00d7 |J2|).\nDe\ufb01nition 1 (Tensor Biclustering). The problem of tensor biclustering aims to compute bicluster\nindex sets (J1, J2) given T according to (1).\nIn this paper, we make the following simplifying assumptions: we assume q = 1, n = |n1| = |n2|,\nand k = |J1| = |J2|. To simplify notation, we drop superscripts (J1) and (J2) from u(J1)\nand\n, respectively. Without loss of generality, we normalize signal vectors such that (cid:107)u1(cid:107) =\nw(J2)\n(cid:107)w1(cid:107) = (cid:107)v1(cid:107) = 1. Moreover, we assume that for every (j1, j2) \u2208 J1 \u00d7 J2, \u2206 \u2264 u1(j1) \u2264 c\u2206\nand \u2206 \u2264 w1(j2) \u2264 c\u2206, where c is a constant. Under these assumptions, a signal trajectory can be\nwritten as X (j1, j2, :) = u1(j1)w1(j2)v1. The scaling of this trajectory depends on row and column\nspeci\ufb01c parameters u1(j1) and w1(j2). Note that our analysis can be extended naturally to a more\ngeneral setup of having multiple embedded biclusters with q > 1. We discuss this in Section 7.\nNext we describe the noise model. If (j1, j2) /\u2208 J1 \u00d7 J2, we assume that entries of the noise trajectory\nZ(j1, j2, :) are i.i.d. and each entry has a standard normal distribution. If (j1, j2) \u2208 J1 \u00d7 J2, we\nassume that entries of Z(j1, j2, :) are i.i.d. and each entry has a Gaussian distribution with zero mean\nand \u03c32\n- Noise Model I: In this model, we assume \u03c32\n\nz variance. We analyze the tensor biclustering problem under two noise models for \u03c32\nz:\n\nz = 1, i.e., the variance of the noise within and outside\nof the bicluster is assumed to be the same. This is the noise model often considered in analysis\nof sub-matrix localization [2, 3] and tensor PCA [7, 8, 9, 10, 11, 12, 14]. Although this model\nsimpli\ufb01es the analysis, it has the following drawback: under this noise model, for every value\nof \u03c31, the average trajectory lengths in the bicluster is larger than the average trajectory lengths\noutside of the bicluster. See SM Section 1.2 for more details.\nz = max(0, 1\u2212 \u03c32\n\nz is modeled to minimize\nIf\nthe difference between the average trajectory lengths within and outside of the bicluster.\n1 < mk2, noise is added to make the average trajectory lengths within and outside of the bicluster\n\u03c32\ncomparable. See SM Section 1.2 for more details.\n\n- Noise Model II: In this model, we assume \u03c32\n\nmk2 ), i.e., \u03c32\n\n1\n\n3 Computational Limits of the Tensor Biclustering Problem\n\n3.1 Tensor Folding+Spectral\n\nRecall the formulation of the tensor biclustering problem (1). Let\n\nT(j1,1) (cid:44) T (j1, :, :)\n\n(2)\nbe horizontal (the \ufb01rst mode) and lateral (the second mode) matrix slices of the tensor T , respectively.\nOne way to learn the embedded bicluster in the tensor is to compute row and column indices whose\ntrajectories are highly correlated with each other. To do that, we compute\n\nand T(j2,2) (cid:44) T (:, j2, :),\n\nTt\n\n(j2,2)T(j2,2)\n\nTt\n\n(j1,1)T(j1,1).\n\n(3)\n\nC1 (cid:44) n(cid:88)\n\nand C2 (cid:44) n(cid:88)\n\nj2=1\n\nj1=1\n\n3\n\n\fFigure 2: A visualization of the tensor folding+spectral algorithm 1 to compute the bicluster index\nset J2. The bicluster index set J1 can be computed similarly.\n\nAlgorithm 1 Tensor Folding+Spectral\n\nInput: T , k\nCompute \u02c6u1, the top eigenvector of C1\nCompute \u02c6w1, the top eigenvector of C2\nCompute \u02c6J1, indices of the k largest values of | \u02c6w1|\nCompute \u02c6J2, indices of the k largest values of |\u02c6u1|\nOutput: \u02c6J1 and \u02c6J2\n\nC1 represents a combined covariance matrix along the tensor columns (Figure 2). We refer to it as\nthe folded tensor over the columns. If there was no noise, this matrix would be equal to \u03c32\n1.\n1u1ut\nThus, its eigenvector corresponding to the largest eigenvalue would be equal to u1. On the other\nhand, we have u1(j1) = 0 if j1 /\u2208 J1 and |u1(j1)| > \u2206, otherwise. Therefore, selecting k indices of\nthe top eigenvector with largest magnitudes would recover the index set J1. However, with added\nnoise, the top eigenvector of the folded tensor would be a perturbed version of u1. Nevertheless one\ncan estimate J1 similarly (Algorithm 1). A similar argument holds for C2.\nTheorem 1. Let \u02c6u1 and \u02c6w1 be top eigenvectors of C1 and C2, respectively. Under both noise models\nI and II,\n\n\u221a\n- for m < \u02dcO(\n\u221a\n- for m = \u02dc\u2126(\n\nn), if \u03c32\n\nn), if \u03c32\n\n1 = \u02dc\u2126(n),\n\u221a\n1 = \u02dc\u2126(\n\nn max(n, m)),\n\n1 \u2208 \u00afJ1, j2 \u2208 J2 and j(cid:48)\n\n2 \u2208 \u00afJ2.\n\nas n \u2192 \u221e, with high probability, we have |\u02c6u1(j1)| > |\u02c6u1(j(cid:48)\nj1 \u2208 J1, j(cid:48)\nIn the proof of Theorem 1, following the result of [18] for a Wigner noise matrix, we have proved an\nl\u221e version of the Davis-Kahan Lemma for a Wishart noise matrix. This lemma can be of independent\ninterest for the readers.\n\n1)| and | \u02c6w1(j2)| > | \u02c6w1(j(cid:48)\n\n2)| for every\n\n3.2 Tensor Unfolding+Spectral\nLet Tunf olded \u2208 Rm\u00d7n2 be the unfolded tensor T such that Tunf olded(:, (j1 \u2212 1)n + j2) =\nT (j1, j2, :) for 1 \u2264 j1, j2 \u2264 n. Without noise, the right singular vector of this matrix is u1 \u2297 w1\nwhich corresponds to the singular value \u03c31. Therefore, selecting k2 indices of this singular vector\nwith largest magnitudes would recover the index set J1 \u00d7 J2. With added noise, however, the top\nsingular vector of the unfolded tensor will be perturbed. Nevertheless one can estimate J1 \u00d7 J2\nsimilarly (SM Section 2).\n\n4\n\nn2n2. . .. . .n2mn2mn1Matrix SlicesTT(n1, : , : )T(k+1, : , : )T(k, : , : )T(1, : , : ). . .. . .T(n1, : , : ) T(n1, : , : )tT(1, : , : ) T(1, : , : )tn2n2Combined CovarianceSpectralDecomposition Bicluster Index Set(J2)Input Tensor \fTheorem 2. Let \u02c6x be the top right singular vector of Tunf olded. Under both noise models I and II,\n1 = \u02dc\u2126(max(n2, m)), as n \u2192 \u221e, with high probability, we have |\u02c6x(j(cid:48))| < |\u02c6x(j)| for every j in\nif \u03c32\nthe bicluster and j(cid:48) outside of the bicluster.\n\n3.3 Thresholding Sum of Squared and Individual Trajectory Lengths\n\nIf the average trajectory lengths in the bicluster is larger than the one outside of the bicluster, methods\nbased on trajectory length statistics can be successful in solving the tensor biclustering problem. One\nsuch method is thresholding individual trajectory lengths. In this method, we select k2 indices (j1, j2)\nwith the largest trajectory length (cid:107)T (j1, j2, :)(cid:107) (SM Section 2).\nTheorem 3. As n \u2192 \u221e, with high probability, \u02c6J1 = J1 and \u02c6J2 = J2\n\n\u221a\n\n- if \u03c32\n\n- if \u03c32\n\nmk2), under noise model I.\n\n1 = \u02dc\u2126(\n1 = \u02dc\u2126(mk2), under noise model II.\n\nAnother method to solve the tensor biclustering problem is thresholding sum of squared trajectory\nlengths. In this method, we select k row indices with the largest sum of squared trajectory lengths\nalong the columns as an estimation of J1. We estimate J2 similarly (SM Section 2).\nTheorem 4. As n \u2192 \u221e, with high probability, \u02c6J1 = J1 and \u02c6J2 = J2\n\n\u221a\n\n- if \u03c32\n\n- if \u03c32\n\n1 = \u02dc\u2126(k\n1 = \u02dc\u2126(mk2 + k\n\n\u221a\n\nnm), under noise model I.\n\nnm), under noise model II.\n\n4 Statistical (Information-Theoretic) Limits of the Tensor Biclustering\n\nProblem\n\n4.1 Coherent Case\n\n\u221a\n\nIn this section, we study a statistical (information theoretic) boundary for the tensor biclustering\nk for j1 \u2208 J1. Similarly,\n\u221a\nproblem under the following statistical model: We assume u1(j1) = 1/\nk for j2 \u2208 J2. Moreover, we assume v1 is a \ufb01xed given vector with\nwe assume w1(j2) = 1/\n(cid:107)v1(cid:107) = 1. In the next section, we consider a non-coherent model where v1 is random and unknown.\nLet T be an observed tensor from the tensor biclustering model (J1, J2). Let Jall be the set of\n\nall possible (J1, J2). Thus, |Jall| =(cid:0)n\n\n(cid:1)2. A maximum likelihood estimator (MLE) for the tensor\nT (j1, j2, :) \u2212 k(1 \u2212 \u03c32\n\n(cid:107)T (j1, j2, :)(cid:107)2\n\n(cid:88)\n\nk\nbiclustering problem can be written as:\n\n(cid:88)\n\n(4)\n\nz )\n\nvt\n1\n\n2\u03c31\n\n(j1,j2)\u2208 \u02c6J1\u00d7 \u02c6J2\n\nmax\n\u02c6J\u2208Jall\n\n(j1,j2)\u2208 \u02c6J1\u00d7 \u02c6J2\n( \u02c6J1, \u02c6J2) \u2208 Jall.\n\nto compute the likelihood function for(cid:0)n\n\n(cid:1)2 possible bicluster indices. Thus, the computational\n\nNote that under the noise model I, the second term is zero. To solve this optimization, one needs\n\nk\ncomplexity of the MLE is exponential in n.\n1 = \u02dc\u2126(k), as n \u2192 \u221e, with high probability, (J1, J2) is\nTheorem 5. Under noise model I, if \u03c32\nthe optimal solution of optimization (4). A similar result holds under noise model II if mk =\n\u2126(log(n/k)).\n\nNext, we establish an upper bound on \u03c32\n1 under which no computational method can solve the tensor\nbiclustering problem with vanishing probability of error. This upper bound indeed matches with the\nMLE achievability bound of Theorem 5 indicating its tightness.\nTheorem 6. Let T be an observed tensor from the tensor biclustering model with bicluster indices\n(J1, J2). Let A be an algorithm that uses T and computes ( \u02c6J1, \u02c6J2). Under noise model I, for any\n\n5\n\n\f\ufb01xed 0 < \u03b1 < 1, if \u03c32\n\n(cid:105)\n1 < c\u03b1k log(n/k), as n \u2192 \u221e, we have\n\nP(cid:104) \u02c6J1 (cid:54)= J1 or \u02c6J2 (cid:54)= J2\n\nsup\n\ninf\n\nA\u2208AllAlg\n\n(J1,J2)\u2208Jall\n\n> 1 \u2212 \u03b1 \u2212\n\nlog(2)\n\n2k log(ne/k)\n\n.\n\n(5)\n\nA similar result holds under noise model II if mk = \u2126(log(n/k)).\n\n4.2 Non-coherent Case\n\nIn this section we consider a similar setup to the one of Section 4.1 with the difference that v1 is\nassumed to be uniformly distributed over a unit sphere. For simplicity, in this section we only consider\nnoise model I. The ML optimization in this setup can be written as follows:\n\nmax\n\u02c6J\u2208Jall\n\n(cid:107) (cid:88)\n\n(j1,j2)\u2208 \u02c6J1\u00d7 \u02c6J2\n( \u02c6J1, \u02c6J2) \u2208 Jall.\n1 = \u02dc\u2126(max(k,\n\nT (j1, j2, :)(cid:107)2\n\n(6)\n\n\u221a\n\nkm)), as n \u2192 \u221e, with high probability,\n\nTheorem 7. Under noise model I, if \u03c32\n(J1, J2) is the optimal solution of optimization (6).\n\nIf k > \u2126(m), the achievability bound of Theorem 7 simpli\ufb01es to the one of Theorem 5. In this case,\nusing the result of Theorem 6, this bound is tight. If k < O(m), the achievability bound of Theorem\n\u221a\n7 simpli\ufb01es to \u02dc\u2126(\nmk) which is larger than the one of Theorem 5 (this is the price we pay for not\nknowing v1). In the following, we show that this bound is also tight.\nTo show the converse of Theorem 7, we consider the detection task which is presumably easier than\nthe estimation task. Consider two probability distributions: (1) P\u03c31 under which the observed tensor\nis T = \u03c31u1 \u2297 w1 \u2297 v1 + Z where J1 and J2 have uniform distributions over k subsets of [n] and\nv1 is uniform over a unit sphere. (2) P0 under which the observed tensor is T = Z. Noise entries are\ni.i.d. normal. We need the following de\ufb01nition of contiguous distributions ([8]):\nDe\ufb01nition 2. For every n \u2208 N, let P0,n and P1,n be two probability measures on the same measure\nspace. We say that the sequence (P1,n) is contiguous with respect to (P0,n), if, for any sequence of\nevents An, we have\n\nTheorem 8. If \u03c32\n\n1 < \u02dcO(\n\n\u221a\n\nP0,n(An) = 0 \u21d2 lim\nn\u2192\u221e\n\nlim\nn\u2192\u221e\nmk), P\u03c31 is contiguous with respect to P0.\n\nP1,n(An) = 0.\n\n(7)\n\nThis theorem with Lemma 2 of [8] establishes the converse of Theorem 7. The proof is based on\nbounding the second moment of the Radon-Nikodym derivative of P\u03c31 with respect to P0 (SM\nSection 4.9).\n\n5 Summary of Asymptotic Results\n\n\u221a\nTable 1 summarizes asymptotic bounds for the case of \u2206 = 1/\nk and m = \u0398(n). For the MLE we\nconsider the coherent model of Section 4.1. Also in Table 1 we summarize computational complexity\nof different tensor biclustering methods. We discuss analytical and empirical running time of these\nmethods in SM Section 2.2.\n\n\u221a\nTable 1: Comparative analysis of tensor biclustering methods. Results have been simpli\ufb01ed for the\ncase of m = \u0398(n) and \u2206 = 1/\n\nk.\n\nMethods\n\n1, noise model I\n\u03c32\n\n1, noise model II Comp. Complexity\n\u03c32\n\nTensor Folding+Spectral\nTensor Unfolding+Spectral\n\nTh. Sum of Squared Trajectory Lengths\n\nTh. Individual Trajectory Lengths\n\nMaximum Likelihood\nStatistical Lower Bound\n\n\u02dc\u2126(n3/2)\n\u02dc\u2126(n2)\n\u02dc\u2126(nk2)\n\u02dc\u2126(nk2)\n\u02dc\u2126(k)\n\u02dcO(k)\n\n\u02dc\u2126(n3/2)\n\u02dc\u2126(n2)\n\u02dc\u2126(nk)\n\u02dc\u2126(k2\u221a\n\u02dc\u2126(k)\n\u02dcO(k)\n\nn)\n\n6\n\nO(n4)\nO(n3)\nO(n3)\nO(n3)\nexp(n)\n\n-\n\n\fFigure 3: Performance of different tensor biclustering methods in various values of \u03c31 (i.e., the signal\nstrength), under both noise models I and II. We consider n = 200, m = 50, k = 40. Experiments\nhave been repeated 10 times for each point.\n\n\u221a\n\nIn both noise models, the maximum likelihood estimator which has an exponential computational\ncomplexity leads to the best achievability bound compared to other methods. Below this bound, the\ninference is statistically impossible. Tensor folding+spectral method outperforms other methods\nwith polynomial computational complexity if k >\nn under noise model I, and k > n1/4 under\nnoise model II. For smaller values of k, thresholding individual trajectory lengths lead to a better\nachievability bound. This case is a part of the high-SNR regime where the average trajectory lengths\nwithin the bicluster is signi\ufb01cantly larger than the one outside of the bicluster. Unlike thresholding\nindividual trajectory lengths, other methods use the entire tensor to solve the tensor biclustering\nproblem. Thus, when k is very small, the accumulated noise can dominate the signal strength.\nMoreover, the performance of the tensor unfolding method is always worst than the one of the tensor\nfolding method. The reason is that, the tensor unfolding method merely infers a low dimensional\nsubspace of trajectories, ignoring the block structure that true low dimensional trajectories form.\n\n6 Numerical Results\n\n6.1 Synthetic Data\n\nIn this section we evaluate the performance of different tensor biclustering methods in synthetic\ndatasets. We use the statistical model described in Section 4.1 to generate the input tensor T . Let\n( \u02c6J1, \u02c6J2) be estimated bicluster indices (J1, J2) where | \u02c6J1| = | \u02c6J2| = k. To evaluate the inference\nquality we compute the fraction of correctly recovered bicluster indices (SM Section 3.1).\nIn our simulations we consider n = 200, m = 50, k = 40. Figure 3 shows the performance of\nfour tensor biclustering methods in different values of \u03c31 (i.e., the signal strength), under both noise\nmodels I and II. Tensor folding+spectral algorithm outperforms other methods in both noise models.\nThe gain is larger in the setup of noise model II compared to the one of noise model I.\n\n6.2 Real Data\n\nIn this section we apply tensor biclustering methods to the roadmap epigenomics dataset [4] which\nprovides histon mark signal strengths in different segments of human genome in various tissues and\ncell types. In this dataset, \ufb01nding a subset of genome segments and a subset of tissues (cell-types)\nwith highly correlated histon mark values can provide insight on tissue-speci\ufb01c functional roles\nof genome segments [4]. After pre-processing the data (SM Section 3.2), we obtain a data tensor\nT \u2208 Rn1\u00d7n2\u00d7m where n1 = 49 is the number of tissues (cell-types), n2 = 1457 is the number of\n\n7\n\n(b)(a)noise model IInoise model ISignal StrengthSignal StrengthBicluster Recovery RateBicluster Recovery RateTh. individual trajectory lengthsTh. sum of squared trajectory lengthsTensor unfolding+SpectralTensor folding+Spectral0501001502002503003504000.10.20.30.40.50.60.70.80.9105010015020025030035040000.20.40.60.81\fFigure 4: An application of tensor biclustering methods to the the roadmap epigenomics data.\n\ngenome segments, and m = 7 is the number of histon marks. Note that although in our analytical\nresults for simplicity we assume n1 = n2, our proposed methods can be used in a more general case\nsuch as the one considered in this section.\nWe form two combined covariance matrices C1 \u2208 Rn1\u00d7n1 and C2 \u2208 Rn2\u00d7n2 according to (3).\nFigure 4-(a,b) shows largest eigenvalues of C1 and C2, respectively. As illustrated in these \ufb01gures,\nspectral gaps (i.e., \u03bb1 \u2212 \u03bb2) of these matrices are large, indicating the existence of a low dimensional\nsignal tensor in the input tensor. We also form an unfolded tensor Tunf olded \u2208 Rm\u00d7n1n2. Similarly,\nthere is a large gap between the \ufb01rst and second largest singular values of Tunf olded (Figure 4-c).\nWe use the tensor folding+spectral algorithm 1 with |J1| = 10 and |J2| = 400 (we consider other\nvalues for the bicluster size in SM Section 3.2). The output of the algorithm ( \u02c6J1, \u02c6J2) is illustrated in\nFigure 4-d (note that for visualization purposes, we re-order rows and columns to have the bicluster\nappear in the corner). Figure 4-e illustrates the unfolded subspace {T (j1, j2, :) : (j1, j2) \u2208 \u02c6J1 \u00d7 \u02c6J2}.\nIn this inferred bicluster, Histon marks H3K4me3, H3K9ac, and H3K27ac have relatively high values.\nReference [4] shows that these histon marks indicate a promoter region with an increased activation\nin the genome.\nTo evaluate the quality of the inferred bicluster, we compute total absolute pairwise correlations\namong vectors in the inferred bicluster. As illustrated in Figure 4-f, the quality of inferred bicluster\nby tensor folding+spectral algorithm is larger than the one of other methods. Next, we compute the\nbicluster quality by choosing bicluster indices uniformly at random with the same cardinality. We\nrepeat this experiment 100 times. There is a signi\ufb01cant gap between the quality of these random\nbiclusters and the ones inferred by tensor biclustering methods indicating the signi\ufb01cance of our\ninferred biclusters. For more details on these experiment, see SM Section 3.2.\n\n7 Discussion\n\nIn this paper, we introduced and analyzed the tensor biclustering problem. The goal is to compute a\nsubset of tensor rows and columns whose corresponding trajectories form a low dimensional subspace.\nTo solve this problem, we proposed a method called tensor folding+spectral which demonstrated\nimproved analytical and empirical performance compared to other considered methods. Moreover, we\ncharacterized computational and statistical (information theoretic) limits for the tensor biclustering\nproblem in an asymptotic regime, under both coherent and non-coherent statistical models.\nOur results consider the case when the rank of the subspace is equal to one (i.e., q = 1). When q > 1,\nin both tensor folding+spectral and tensor unfolding+spectral methods, the embedded subspace in\nthe signal matrix will have a rank of q > 1, with singular values \u03c31 \u2265 \u03c32 \u2265 ... \u2265 \u03c3q > 0. In this\n\n8\n\n12345670123456i-th largest singular valuei-th largest eigenvaluei-th largest eigenvalue10612345670123456106123456705001000150020002500iii(c)(f)(e)(b)(d)(a)2004006008001000120014005101520253035404500.10.20.30.40.50.60.70.80.91tissuesgenome segments(chromosome 20)1000200030004000051015202530tissues x genome segmentsH3K27me3H3K27acH3K36me3H3K4me1H3K4me3H3K9acH3K9me30.450.50.550.60.650.70.750.80.85inferred bicluster qualityrandomTensorunfoldingTensorfoldingTh. individualTLTh. sum ofsquared TL \fsetup, we need the spectral radius of the noise matrix to be smaller than \u03c3q in order to guarantee\nthe recovery of the subspace. The procedure to characterize asymptotic achievability bounds would\nfollow from similar steps of the rank one case with some technical differences. For example, we\nwould need to extend Lemma 6 to the case where the signal matrix has rank q > 1. Moreover, in our\nproblem setup, we assumed that the size of the bicluster k and the rank of its subspace q are know\nparameters. In practice, these parameters can be learned approximately from the data. For example,\nin the tensor folding+spectral method, a good choice for the q parameter would be the index where\neigenvalues of the folded matrix decrease signi\ufb01cantly. Knowing q, one can determine the size of\nthe bicluster similarly as the number of indices in top eigenvectors with signi\ufb01cantly larger absolute\nvalues. Another practical approach to estimate model parameters would be trial and error plus cross\nvalidations.\nSome of the developed proof techniques may be of independent interest as well. For example, we\nproved an l\u221e version of the Davis-Kahan lemma for a Wishart noise matrix. Solving the tensor\nbiclustering problem for the case of having multiple overlapping biclusters, for the case of having\nincomplete tensor, and for the case of a priori unknown bicluster sizes are among future directions.\n\n8 Code\n\nWe provide code for tensor biclustering methods in the following link: https://github.com/\nSoheilFeizi/Tensor-Biclustering.\n\n9 Acknowledgment\n\nWe thank Prof. Ofer Zeitouni for the helpful discussion on detectably proof techniques of probability\nmeasures.\n\nReferences\n[1] Amos Tanay, Roded Sharan, and Ron Shamir. Biclustering algorithms: A survey. Handbook of\n\ncomputational molecular biology, 9(1-20):122\u2013124, 2005.\n\n[2] Yudong Chen and Jiaming Xu. Statistical-computational tradeoffs in planted problems and\nsubmatrix localization with a growing number of clusters and submatrices. arXiv preprint\narXiv:1402.1267, 2014.\n\n[3] T Tony Cai, Tengyuan Liang, and Alexander Rakhlin. Computational and statistical boundaries\n\nfor submatrix localization in a large noisy matrix. arXiv preprint arXiv:1502.01988, 2015.\n\n[4] Anshul Kundaje, Wouter Meuleman, Jason Ernst, Misha Bilenky, Angela Yen, Alireza Heravi-\nMoussavi, Pouya Kheradpour, Zhizhuo Zhang, Jianrong Wang, Michael J Ziller, et al. Integrative\nanalysis of 111 reference human epigenomes. Nature, 518(7539):317\u2013330, 2015.\n\n[5] GTEx Consortium et al. The genotype-tissue expression (gtex) pilot analysis: Multitissue gene\n\nregulation in humans. Science, 348(6235):648\u2013660, 2015.\n\n[6] Rui Chen, George I Mias, Jennifer Li-Pook-Than, Lihua Jiang, Hugo YK Lam, Rong Chen,\nElana Miriami, Konrad J Karczewski, Manoj Hariharan, Frederick E Dewey, et al. Personal\nomics pro\ufb01ling reveals dynamic molecular and medical phenotypes. Cell, 148(6):1293\u20131307,\n2012.\n\n[7] Emile Richard and Andrea Montanari. A statistical model for tensor pca. In Advances in Neural\n\nInformation Processing Systems, pages 2897\u20132905, 2014.\n\n[8] Andrea Montanari, Daniel Reichman, and Ofer Zeitouni. On the limitation of spectral methods:\nFrom the gaussian hidden clique problem to rank-one perturbations of gaussian tensors. In\nAdvances in Neural Information Processing Systems, pages 217\u2013225, 2015.\n\n[9] Samuel B Hopkins, Tselil Schramm, Jonathan Shi, and David Steurer. Fast spectral algorithms\nfrom sum-of-squares proofs: tensor decomposition and planted sparse vectors. arXiv preprint\narXiv:1512.02337, 2015.\n\n9\n\n\f[10] Samuel B Hopkins, Jonathan Shi, and David Steurer. Tensor principal component analysis via\n\nsum-of-square proofs. In COLT, pages 956\u20131006, 2015.\n\n[11] Amelia Perry, Alexander S Wein, and Afonso S Bandeira. Statistical limits of spiked tensor\n\nmodels. arXiv preprint arXiv:1612.07728, 2016.\n\n[12] Thibault Lesieur, L\u00e9o Miolane, Marc Lelarge, Florent Krzakala, and Lenka Zdeborov\u00e1. Sta-\ntistical and computational phase transitions in spiked tensor estimation. arXiv preprint\narXiv:1701.08010, 2017.\n\n[13] Animashree Anandkumar, Rong Ge, and Majid Janzamin. Guaranteed non-orthogonal tensor\n\ndecomposition via alternating rank-1 updates. arXiv preprint arXiv:1402.5180, 2014.\n\n[14] Anru Zhang and Dong Xia. Guaranteed tensor pca with optimality in statistics and computation.\n\narXiv preprint arXiv:1703.02724, 2017.\n\n[15] Animashree Anandkumar, Rong Ge, Daniel J Hsu, and Sham M Kakade. A tensor approach\nto learning mixed membership community models. Journal of Machine Learning Research,\n15(1):2239\u20132312, 2014.\n\n[16] Animashree Anandkumar, Rong Ge, Daniel J Hsu, Sham M Kakade, and Matus Telgarsky.\nTensor decompositions for learning latent variable models. Journal of Machine Learning\nResearch, 15(1):2773\u20132832, 2014.\n\n[17] Victoria Hore, Ana Vi\u00f1uela, Alfonso Buil, Julian Knight, Mark I McCarthy, Kerrin Small, and\nJonathan Marchini. Tensor decomposition for multiple-tissue gene expression experiments.\nNature Genetics, 48(9):1094\u20131100, 2016.\n\n[18] Yiqiao Zhong and Nicolas Boumal. Near-optimal bounds for phase synchronization. arXiv\n\npreprint arXiv:1703.06605, 2017.\n\n10\n\n\f", "award": [], "sourceid": 867, "authors": [{"given_name": "Soheil", "family_name": "Feizi", "institution": "Stanford University"}, {"given_name": "Hamid", "family_name": "Javadi", "institution": "Stanford University"}, {"given_name": "David", "family_name": "Tse", "institution": "Stanford University"}]}