{"title": "Minimax Localization of Structural Information in Large Noisy Matrices", "book": "Advances in Neural Information Processing Systems", "page_first": 909, "page_last": 917, "abstract": "We consider the problem of identifying a sparse set of relevant columns and rows in a large data matrix with highly corrupted entries. This problem of identifying groups from a collection of bipartite variables such as proteins and drugs, biological species and gene sequences, malware and signatures, etc is commonly referred to as biclustering or co-clustering. Despite its great practical relevance, and although several ad-hoc methods are available for biclustering, theoretical analysis of the problem is largely non-existent. The problem we consider is also closely related to structured multiple hypothesis testing, an area of statistics that has recently witnessed a flurry of activity. We make the following contributions: i) We prove lower bounds on the minimum signal strength needed for successful recovery of a bicluster as a function of the noise variance, size of the matrix and bicluster of interest. ii) We show that a combinatorial procedure based on the scan statistic achieves this optimal limit. iii) We characterize the SNR required by several computationally tractable procedures for biclustering including element-wise thresholding, column/row average thresholding and a convex relaxation approach to sparse singular vector decomposition.", "full_text": "Minimax Localization of Structural Information in\n\nLarge Noisy Matrices\n\nMladen Kolar\u2020\u21e4\n\nmladenk@cs.cmu.edu\n\nSivaraman Balakrishnan\u2020\u21e4\nsbalakri@cs.cmu.edu\n\nAlessandro Rinaldo\u2020\u2020\n\narinaldo@stat.cmu.edu\n\nAarti Singh\u2020\n\naarti@cs.cmu.edu\n\n\u2020 School of Computer Science and \u2020\u2020 Department of Statistics, Carnegie Mellon University\n\nAbstract\n\nWe consider the problem of identifying a sparse set of relevant columns and rows\nin a large data matrix with highly corrupted entries. This problem of identify-\ning groups from a collection of bipartite variables such as proteins and drugs,\nbiological species and gene sequences, malware and signatures, etc is commonly\nreferred to as biclustering or co-clustering. Despite its great practical relevance,\nand although several ad-hoc methods are available for biclustering, theoretical\nanalysis of the problem is largely non-existent. The problem we consider is also\nclosely related to structured multiple hypothesis testing, an area of statistics that\nhas recently witnessed a \ufb02urry of activity. We make the following contributions\n\n1. We prove lower bounds on the minimum signal strength needed for success-\nful recovery of a bicluster as a function of the noise variance, size of the\nmatrix and bicluster of interest.\n\n2. We show that a combinatorial procedure based on the scan statistic achieves\n\nthis optimal limit.\n\n3. We characterize the SNR required by several computationally tractable pro-\ncedures for biclustering including element-wise thresholding, column/row\naverage thresholding and a convex relaxation approach to sparse singular\nvector decomposition.\n\n1\n\nIntroduction\n\nBiclustering is the problem of identifying a (typically) sparse set of relevant columns and rows\nin a large, noisy data matrix. This problem along with the \ufb01rst algorithm to solve it were pro-\nposed by Hartigan [14] as a way to directly cluster data matrices to produce clusters with greater\ninterpretability. Biclustering routinely arises in several applications such as discovering groups of\nproteins and drugs that interact with each other [19], learning phylogenetic relationships between\ndifferent species based on alignments of snippets of their gene sequences [30], identifying malware\nthat have similar signatures [7] and identifying groups of users with similar tastes for commercial\nproducts [29]. In these applications, the data matrix is often indexed by (object, feature) pairs and\nthe goal is to identify clusters in this set of bipartite variables.\nIn standard clustering problems, the goal is only to identify meaningful groups of objects and the\nmethods typically use the entire feature vector to de\ufb01ne a notion of similarity between the objects.\n\n\u21e4These authors contributed equally to this work\n\n1\n\n\fBiclustering can be thought of as high-dimensional clustering where only a subset of the features\nare relevant for identifying similar objects, and the goal is to identify not only groups of objects\nthat are similar, but also which features are relevant to the clustering task. Consider, for instance\ngene expression data where the objects correspond to genes, and the features correspond to their ex-\npression levels under a variety of experimental conditions. Our present understanding of biological\nsystems leads us to expect that subsets of genes will be co-expressed only under a small number\nof experimental conditions. Although, pairs of genes are not expected to be similar under all ex-\nperimental conditions it is critical to be able to discover local expression patterns, which can for\ninstance correspond to joint participation in a particular biological pathway or process. Thus, while\nclustering aims to identify global structure in the data, biclustering take a more local approach by\njointly clustering both objects and features.\nPrevalent techniques for \ufb01nding biclusters are typically heuristic procedures with little or no theo-\nretical underpinning. In order to study, understand and compare biclustering algorithms we consider\na simple theoretical model of biclustering [18, 17, 26]. This model is akin to the spiked covariance\nmodel of [15] widely used in the study of PCA in high-dimensions.\nWe will focus on the following simple observation model for the matrix A 2 Rn1\u21e5n2:\n\n: ui 6= 0} and K2 = {i\n\n(1)\nwhere  = {ij}i2[n1],j2[n2] is a random matrix whose entries are i.i.d. N (0, 2) with 2 > 0\nknown, u = {ui : i 2 [n1]} and v = {vi : i 2 [n2]} are unknown deterministic unit vectors in\nRn1 and Rn2, respectively, and > 0 is a constant. To simplify the presentation, we assume that\nu /{ 0, 1}n1 and v /{ 0, 1}n2. Let K1 = {i\n: vi 6= 0} be the sets\nindexing the non-zero components of the vectors u and v, respectively. We assume that u and v are\nsparse, that is, k1 := |K1|\u2327 n1 and k2 := |K2|\u2327 n2. While the sets (K1, K2) are unknown,\nwe assume that their cardinalities are known. Notice that the magnitude of the signal for all the\npk1k2\ncoordinates in the bicluster K1 \u21e5 K2 is\n. The parameter  measures the strength of the signal,\nand is the key quantity we will be studying.\nWe focus on the case of a single bicluster that appears as an elevated sub-matrix of size k1 \u21e5 k2 with\nsignal strength  embedded in a large n1\u21e5n2 data matrix with entries corrupted by additive Gaussian\nnoise with variance 2. Under this model, the biclustering problem is formulated as the problem\nof estimating the sets K1 and K2, based on a single noisy observation A of the unknown signal\nmatrix uv0. Biclustering is most subtle when the matrix is large with several irrelevant variables,\nthe entries are highly noisy, and the bicluster is small as de\ufb01ned by a sparse set of rows/columns.\nWe provide a sharp characterization of tuples of (, n1, n2, k1, k2, 2) under which it is possible to\nrecover the bicluster and study several common methods and establish the regimes under which they\nsucceed.\nWe establish minimax lower and upper bounds for the following class of models. Let\n\nA = uv0 + \n\nYij\n\n\u21e5(0, k1, k2) := {(, K1, K2) :   0,|K1| = k1, K1 \u21e2 [n1],|K2| = k2, K2 \u21e2 [n2]}\n\n(2)\nbe a set of parameters. For a parameter \u2713 2 \u21e5, let P\u2713 denote the joint distribution of the entries of\nA = {aij}i2[n1],j2[n2], whose density with respect to the Lebesgue measure is\n\nN (aij; (k1k2)1/2 1I{i 2 K1, j 2 K2}, 2),\n\n(3)\n\nwhere the notation N (z; \u00b5, 2) denotes the distribution p(z) \u21e0N (\u00b5, 2) of a Gaussian random\nvariable with mean \u00b5 and variance 2, and 1I denotes the indicator function.\nWe derive a lower bound that identi\ufb01es tuples of (, n1, n2, k1, k2, 2) under which we can recover\nthe true biclustering from a noisy high dimensional matrix. We show that a combinatorial pro-\ncedure based on the scan statistic achieves the minimax optimal limits, however it is impractical\nas it requires enumerating all possible sub-matrices of a given size in a large matrix. We analyze\nthe scalings (i.e. the relation between  and (n1, n2, k1, k2, 2)) under which some computation-\nally tractable procedures for biclustering including element-wise thresholding, column/row average\nthresholding and sparse singular vector decomposition (SSVD) succeed with high probability.\nWe consider the detection of both small and large biclusters of weak activation, and show that at the\nminimax scaling the problem is surprisingly subtle (e.g., even detecting big clusters is quite hard).\n\n2\n\n\fIn Table 1, we describe our main \ufb01ndings and compare the scalings under which the various algo-\nrithms succeed.\n\nCombinatorial Thresholding\n\nRow/Column Averaging\n\nSparse SVD\n\nAlgorithm\nSNR scaling\nBicluster size\n\nMinimax\n\nAny\n\nTheorem 2\n\nWeak\nAny\n\nTheorem 3\n\n(n1/2+\u21b5\n\n1\n\nIntermediate\n\u21e5 n1/2+\u21b5\n2\nTheorem 4\n\n),\u21b5 2 (0, 1/2)\n\nWeak\nAny\n\nTheorem 5\n\nWhere the scalings are,\n\n1. Minimax:  \u21e0  max\u21e3pk1 log(n1  k1),pk2 log(n2  k2)\u2318\n2. Weak:  \u21e0  max\u21e3pk1k2 log(n1  k1),pk1k2 log(n2  k2)\u2318\n3. Intermediate (for large clusters):  \u21e0  max\u2713pk1k2 log(n1k1)\n\nn\u21b5\n2\n\n,\n\npk1k2 log(n2k2)\n\nn\u21b5\n1\n\n\u25c6\n\nElement-wise thresholding does not take advantage of any structure in the data matrix and hence\ndoes not achieve the minimax scaling for any bicluster size.\nIf the clusters are big enough\nrow/column averaging performs better than element-wise thresholding since it can take advantage\nof structure. We also study a convex relaxation for sparse SVD, based on the DSPCA algorithm pro-\nposed by [11] that encourages the singular vectors of the matrix to be supported over a sparse set of\nvariables. However, despite the increasing popularity of this method, we show that it is only guaran-\nteed to yield a sparse set of singular vectors when the SNR is quite high, equivalent to element-wise\nthresholding, and fails for stronger scalings of the SNR.\n\n1.1 Related work\n\nDue to its practical importance and dif\ufb01culty biclustering has attracted considerable attention (for\nsome recent surveys see [9, 27, 20, 22]). Broadly algorithms for biclustering can be categorized as\neither score-based searches, or spectral algorithms. Many of the proposed algorithms for identifying\nrelevant clusters are based on heuristic searches whose goal is to identify large average sub-matrices\nor sub-matrices that are well \ufb01t by a two-way ANOVA model. Sun et. al.\n[26] provide some\nstatistical backing for these exhaustive search procedures. In particular, they show how to construct\na test via exhaustive search to distinguish when there is a small sub-matrix of weak activation from\nthe \u201cnull\u201d case when there is no bicluster.\nThe premise behind the spectral algorithms is that if there was a sub-matrix embedded in a large\nmatrix, then this sub-matrix could be identi\ufb01ed from the left and right singular vectors of A. In the\ncase when exactly one of u and v is random, the model (1) can be related to the spiked covariance\nmodel of [15]. In the case when v is random, the matrix A has independent columns and dependent\nrows. Therefore, A0A is a spiked covariance matrix and it is possible to use the existing theoretical\nresults on the \ufb01rst eigenvalue to characterize the left singular vector of A. A lot of recent work has\ndealt with estimation of sparse eigenvectors of A0A, see for example [32, 16, 24, 31, 2]. For biclus-\ntering applications, the assumption that exactly one u or v is random, is not justi\ufb01able, therefore,\ntheoretical results for the spiked covariance model do not translate directly. Singular vectors of the\nmodel (1) have been analyzed by [21], improving on earlier results of [6]. These results however are\nasymptotic and do not consider the case when u and v are sparse.\nOur setup for the biclustering problem also falls in the framework of structured normal means multi-\nple hypothesis testing problems, where for each entry in the matrix the hypotheses are that the entry\nhas mean 0 versus an elevated mean. The presence of a bicluster (sub-matrix) however imposes\nstructure on which elements are elevated concurrently. Recently, several papers have investigated\nthe structured normal means setting for ordered domains. For example, [5] consider the detection of\nelevated intervals and other parametric structures along an ordered line or grid, [4] consider detec-\ntion of elevated connected paths in tree and lattice topologies, [3] considers nonparametric cluster\nstructures in a regular grid. In addition, [1] consider testing of different elevated structures in a gen-\neral but known graph topology. Our setup for the biclustering problem requires identi\ufb01cation of an\nelevated submatrix in an unordered matrix. At a high level, all these results suggest that it is possible\nto leverage the structure to improve the SNR threshold at which the hypothesis testing problem is\n\n3\n\n\ffeasible. However, computationally ef\ufb01cient procedures that achieve the minimax SNR thresholds\nare only known for a few of these problems. Our results for biclustering have a similar \ufb02avor, in\nthat the minimax threshold requires a combinatorial procedure whereas the computationally ef\ufb01cient\nprocedures we investigate are often sub-optimal.\nThe rest of this paper is organized as follows.\nIn Section 2, we provide a lower bound on the\nminimum signal strength needed for successfully identifying the bicluster. Section 3 presents a\ncombinatorial procedure which achieves the lower bound and hence is minimax optimal. We inves-\ntigate some computationally ef\ufb01cient procedures in Section 4. Simulation results are presented in\nSection 5 and we conclude in Section 6. All proofs are deferred to the Appendix.\n\n2 Lower bound\n\nIn this section, we derive a lower bound for the problem of identifying the correct bicluster, indexed\nby K1 and K2, in model (1). In particular, we derive conditions on (, n1, n2, k1, k2, 2) under\nwhich any method is going to make an error when estimating the correct cluster. Intuitively, if either\nthe signal-to-noise ratio / or the cluster size is small, the minimum signal strength needed will\nbe high since it is harder to distinguish the bicluster from the noise.\nTheorem 1. Let \u21b5 2 (0, 1\n8 ) and\nmin = min(n1, n2, k1, k2, )\n\nThen for all 0 \uf8ff min,\ninf\n\n\nsup\n\n\u27132\u21e5(0,k1,k2)\n\nP\u2713[(A) 6= (K1(\u2713), K2(\u2713))] \n\npM\n1 + pM \u27131  2\u21b5 \n\n= p\u21b5 max0@pk1 log(n1  k1),pk2 log(n2  k2),s k1k2 log(n1  k1)(n2  k1)\nlog M\u25c6 n1,n2!1\n\n! 12\u21b5,\n(5)\nwhere M = min(n1  k1, n2  k2), \u21e5(0, k1, k2) is given in (2) and the in\ufb01mum is over all\nmeasurable maps : Rn1\u21e5n2 7! 2[n1] \u21e5 2[n2].\nThe result can be interpreted in the following way: for any biclustering procedure , if 0 \uf8ff min,\nthen there exists some element in the model class \u21e5(0, k1, k2) such that the probability of incor-\nrectly identifying the sets K1 and K2 is bounded away from zero.\nThe proof is based on a standard technique described in Chapter 2.6 of [28]. We start by identifying\na subset of parameter tuples that are hard to distinguish. Once a suitable \ufb01nite set is identi\ufb01ed, tools\nfor establishing lower bounds on the error in multiple-hypothesis testing can be directly applied.\nThese tools only require computing the Kullback-Leibler (KL) divergence between two distribu-\ntions P\u27131 and P\u27132, which in the case of model (1) are two multivariate normal distributions. These\nconstructions and calculations are described in detail in the Appendix.\n\n1A .\n\n(4)\n\nk1 + k2  1\n\n2\u21b5\n\n3 Minimax optimal combinatorial procedure\n\nWe now investigate a combinatorial procedure achieving the lower bound of Theorem 1, in the sense\nthat, for any \u2713 2 \u21e5(min, k1, k2), the probability of recovering the true bicluster (K1, K2) tends to\none as n1 and n2 grow unbounded. This scan procedure consists in enumerating all possible pairs\nof subsets of the row and column indexes of size k1 and k2, respectively, and choosing the one\nwhose corresponding submatrix has the largest overall sum. In detail, for an observed matrix A and\ntwo candidate subsets \u02dcK1 \u21e2 [n1] and \u02dcK2 \u21e2 [n2], we de\ufb01ne the associated score S( \u02dcK1, \u02dcK2) :=\nPi2 \u02dcK1Pj2 \u02dcK2\naij. The estimated bicluster is the pair of subsets of sizes k1 and k2 achieving the\nhighest score:\n\n (A) := argmax\n\n( \u02dcK1, \u02dcK2) S( \u02dcK1, \u02dcK2)\n\nsubject to | \u02dcK1| = k1, | \u02dcK2| = k2.\n\n(6)\n\nThe following theorem determines the signal strength  needed for the decoder to \ufb01nd the true\nbicluster.\n\n4\n\n\fk1 + k2\n\nTheorem 2. Let A \u21e0 P\u2713 with \u2713 2 \u21e5(, k1, k2) and assume that k1 \uf8ff n1/2 and k2 \uf8ff n2/2. If\n\n  4 max0@pk1 log(n1  k1),pk2 log(n2  k2),s 2k1k2 log(n1  k1)(n2  k2)\n\n1A (7)\n\nthen P[ (A) 6= (K1, K2)] \uf8ff 2[(n1  k1)1 + (n2  k2)1] where is the decoder de\ufb01ned in (6).\nComparing to the lower bound in Theorem 1, we observe that the combinatorial procedure using the\ndecoder that looks for all possible clusters and chooses the one with largest score achieves the\nlower bound up to constants. Unfortunately, this procedure is not practical for data sets commonly\n\nk1n2\nencountered in practice, as it requires enumerating alln1\nk2 possible sub-matrices of size k1 \u21e5\n\nk2. The combinatorial procedure requires the signal to be positive, but not necessarily constant\nthroughout the bicluster. In fact it is easy to see that provided the average signal in the bicluster is\nlarger than that stipulated by the theorem this procedure succeeds with high probability irrespective\nof how the signal is distributed across the bicluster. Finally, we remark that the estimation of the\ncluster is done under the assumption that k1 and k2 are known. Establishing minimax lower bounds\nand a procedure that adapts to unknown k1 and k2 is an open problem.\n\n4 Computationally ef\ufb01cient biclustering procedures\n\nIn this section we investigate the performance of various procedures for biclustering, that, unlike the\noptimal scan statistic procedure studied in the previous section, are computationally tractable. For\neach of these procedures however, computational ease comes at the cost of suboptimal performance:\nrecovery of the true bicluster is only possible if the  is much larger than the minimax signal strength\nof Theorem 1.\n\n4.1 Element-wise thresholding\n\nThe simplest procedure that we analyze is based on element-wise thresholding. The bicluster is\nestimated as\n\n thr(A,\u2327 ) := {(i, j) 2 [n1] \u21e5 [n2] : |aij| \u2327}\n\n(8)\nwhere \u2327> 0 is a parameter. The following theorem characterizes the signal strength  required for\nthe element-wise thresholding to succeed in recovering the bicluster.\nTheorem 3. Let A \u21e0 P\u2713 with \u2713 2 \u21e5(, k1, k2) and \ufb01x > 0. Set the threshold \u2327 as\n\n(n1  k1)(n2  k2) + k1(n2  k2) + k2(n1  k1)\n\n\n\n.\n\nIf\n\n\u2327 = r2 log\n pk1k2 r2 log\n\n\n\nk1k2\n\n+r2 log\n\n(n1  k1)(n2  k2) + k1(n2  k2) + k2(n1  k1)\n\n\n\n!\n\nthen P[ thr(A,\u2327 ) 6= K1 \u21e5 K2] = o(/(k1k2)).\nComparing Theorem 3 with the lower bound in Theorem 1, we observe that\nstrength  needs to be O(max(pk1,pk2)) larger than the lower bound. This is not sur-\nprising, since the element-wise thresholding is not exploiting the structure of the problem,\nbut is assuming that the large elements of the matrix A are positioned randomly. From the\nif  \uf8ff\nproof it\n\u25c6 for a small enough con-\ncpk1k2\u2713q2 log k1k2\nstant c then thresholding will no longer recover the bicluster with probability at least 1 . It is also\nworth noting that thresholding neither requires the signal in the bicluster to be constant nor positive\nprovided it is larger in magnitude, at every entry, than the threshold speci\ufb01ed in the theorem.\n\n +q2 log (n1k1)(n2k2)+k1(n2k2)+k2(n1k1)\n\nthis upper bound is tight up to constants,\n\nis not hard to see that\n\nthe signal\n\ni.e.\n\n\n\n5\n\n\f4.2 Row/Column averaging\n\nNext, we analyze another a procedure based on column and row averaging. When the bicluster\nis large this procedure exploits the structure of the problem and outperforms the simple element-\nwise thresholding and the sparse SVD, which is discussed in the following section. The averaging\nprocedure works only well if the bicluster is \u201clarge\u201d, as speci\ufb01ed below, since otherwise the row or\ncolumn average is dominated by the noise.\nMore precisely, the averaging procedure computes the average of each row and column of A and\noutputs the k1 rows and k2 columns with the largest average. Let {rr,i}i2[n1] and {rc,j}j2[n2] denote\nthe positions of rows and columns when they are ordered according to row and column averages in\ndescending order. The bicluster is estimated then as\n\n avg(A) := {i 2 [n1] : rr,i \uf8ff k1}\u21e5{ j 2 [n2] : rc,j \uf8ff k2}.\n\n(9)\nThe following theorem characterizes the signal strength  required for the averaging procedure to\nsucceed in recovering the bicluster.\nTheorem 4. Let A \u21e0 P\u2713 with \u2713 2 \u21e5(, k1, k2). If k1 =\u2326( n1/2+\u21b5\n\u21b5 2 (0, 1/2) is a constant and,\n\n) and k2 =\u2326( n1/2+\u21b5\n\n), where\n\n2\n\n1\n\n  4 max pk1k2 log(n1  k1)\n\nn\u21b5\n2\n\n,pk1k2 log(n2  k2)\n\nn\u21b5\n1\n\n!\n\n1 + n1\n2 ].\n\nthen P[ (A) 6= (K1, K2)] \uf8ff [n1\nComparing to Theorem 3, we observe that the averaging requires lower signal strength than the\nelement-wise thresholding when the bicluster is large, that is, k1 =\u2326( pn1) and k2 =\u2326( pn2).\nUnless both k1 = O(n1) and k2 = O(n2), the procedure does not achieve the lower bound of\nTheorem 1, however, the procedure is simple and computationally ef\ufb01cient. It is also not hard to\nshow that this theorem is sharp in its characterization of the averaging procedure. Further, unlike\nthresholding, averaging requires the signal to be positive in the bicluster.\nIt is interesting to note that a large bicluster can also be identi\ufb01ed without assuming the normality\nof the noise matrix . This non-parametric extension is based on a simple sign-test, and the details\nare provided in Appendix.\n\n4.3 Sparse singular value decomposition (SSVD)\n\nAn alternate way to estimate K1 and K2 would be based on the singular value decomposition (SVD),\ni.e. \ufb01nding \u02dcu and \u02dcv that maximize h\u02dcu, A\u02dcvi, and then threshold the elements of \u02dcu and \u02dcv. Unfortu-\nnately, such a method would perform poorly when the signal  is weak and the dimensionality is\nhigh, since, due to the accumulation of noise, \u02dcu and \u02dcv are poor estimates of u and v and and do not\nexploit the fact that u and v are sparse.\nIn fact, it is now well understood [8] that SVD is strongly inconsistent when the signal strength is\nweak, i.e. \\(\u02dcu, u) ! \u21e1/2 (and similarly for v) almost surely. See [26] for a clear exposition and\ndiscussion of this inconsistency in the SVD setting.\nTo properly exploit the sparsity in the singular vectors, it seems natural to impose a cardinality\nconstraint to obtain a sparse singular vector decomposition (SSVD):\n\nmax\n\nu2Sn11,v2Sn21hu, Avi\n\nwhich can be further rewritten as\n\nsubject to ||u||0 \uf8ff k1, ||v||0 \uf8ff k2,\n\nmax\n\nZ2Rn2\u21e5n1\n\ntr AZ subject to Z = vu0, ||u||2 = 1, ||v||2 = 1, ||u||0 \uf8ff k1, ||v||0 \uf8ff k2.\n\n(10)\n\nThe above problem is non-convex and computationally intractable.\nInspired by the convex relaxation methods for sparse principal component analysis proposed by\n[11], we consider the following relaxation the SSVD:\n\nmax\n\nX2R(n1+n2)\u21e5(n1+n2)\n\ntr AX21  10|X21|1 subject to X \u232b 0, tr X11 = 1, tr X22 = 1,\n\n(11)\n\n6\n\n\fwhere X is the block matrix\n\n\uf8ff X11 X12\nX21 X22 \n\nas\n\nand\n\n(12)\n\nbicluster.\n\nbK2 = {j 2 [n2] : bvj 6= 0}.\n\nwith the block X21 corresponding to Z in (10). If the optimal solution bX is of rank 1, then, nec-\nessarily, bX =bu\nbv(bu0bv0). Based on the sparse singular vectorsbu andbv, we estimate the bicluster\nbK1 = {j 2 [n1] : buj 6= 0}\nThe user de\ufb01ned parameter  controls the sparsity of the solution bX21, and, therefore, provided\nthe solution is of rank one, it also controls the sparsity of the vectorsbu andbv and of the estimated\nThe following theorem provides suf\ufb01cient conditions for the solutionbX to be rank one and to recover\nthe bicluster.\nTheorem 5. Consider the model in (1). Assume k1 \u21e3 k2 and k1 \uf8ff n1/2 and k2 \uf8ff n2/2. If\n  2pk1k2 log(n1  k1)(n2  k2)\nthen the solutionbX of the optimization problem in (11) with  = \n1 O (k1\nIt is worth noting that SSVD correctly recovers signed vectorsbu andbv under this signal strength. In\nparticular, the procedure works even if the u and v in Equation 1 are signed.\nThe following theorem establishes necessary conditions for the SSVD to have a rank 1 solution that\ncorrectly identi\ufb01es the bicluster.\nTheorem 6. Consider the model in (1). Fix c 2 (0, 1/2). Assume that k1 \u21e3 k2 and k1 = o(n1/2c)\nand k2 = o(n1/2c\n\n1 ). Furthermore, we have that (bK1, bK2) = (K1, K2) with probability 1 O (k1\n\n(13)\nis of rank 1 with probability\n\n2pk1k2\n\n1 ).\n\nwith  = \n\n2pk1k2\n\n(14)\nthen the optimization problem (11) does not have a rank 1 solution that correctly\n\n \uf8ff 2pck1k2 log max(n1  k1, n2  k2),\n\nrecovers the sparsity pattern with probability at least 1 O (exp((pk1 + pk2)2) for suf\ufb01ciently\nlarge n1 and n2.\nFrom Theorem 6 observe that the suf\ufb01cient conditions of Theorem 5 are sharp. In particular, the two\ntheorems establish that the SSVD does not establish the lower bound given in Theorem 1. The signal\nstrength needs to be of the same order as for the element-wise thresholding, which is somewhat\nsurprising since from the formulation of the SSVD optimization problem it seems that the procedure\nuses the structure of the problem. From numerical simulations in Section 5 we observe that although\nSSVD requires the same scaling as thresholding, it consistently performs slightly better at a \ufb01xed\nsignal strength.\n\n). If\n\n2\n\n5 Simulation results\n\nWe test the performance of the three computationally ef\ufb01cient procedures on synthetic data: thresh-\nolding, averaging and sparse SVD. For sparse SVD we use an implementation posted online by [11].\nWe generate data from (1) with n = n1 = n2, k = k1 = k2, 2 = 1 and u = v / (10k, 00nk)0.\nFor each algorithm we plot the Hamming fraction (i.e. the Hamming distance between sbu and su\nrescaled to be between 0 and 1) against the rescaled sample size. In each case we average the results\nover 50 runs.\nFor thresholding and sparse SVD the rescaled scaling (x-axis) is\n\nand for averaging the\n\n\n\nn\u21b5\n\nkplog(nk)\n\n. We observe that there is a sharp threshold between success\n\nrescaled scaling (x-axis) is\nand failure of the algorithms, and the curves show good agreement with our theory.\nThe vertical line shows the point after which successful recovery happens for all values of n. We can\nmake a direct comparison between thresholding and sparse SVD (since the curves are identically\nrescaled) to see that at least empirically sparse SVD succeeds at a smaller scaling constant than\nthresholding even though their asymptotic rates are identical.\n\nkplog(nk)\n\n7\n\n\f1\n\nn\no\ni\nt\nc\na\nr\nf\n \ng\nn\nm\nm\na\nH\n\ni\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n \n\n0\n1\n\n k = log(n)\n\n k = n1/3\n\n \n\nn = 100\nn = 200\nn = 300\nn = 400\nn = 500\n\n2\n\n3\n\n4\n\nSignal strength\n\n5\n\n6\n\n7\n\n1\n\nn\no\ni\nt\nc\na\nr\nf\n \ng\nn\nm\nm\na\nH\n\ni\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n \n\n0\n1\n\n \n\nn = 100\nn = 200\nn = 300\nn = 400\nn = 500\n\n2\n\n3\n\n4\n\nSignal strength\n\n5\n\n6\n\n7\n\n1\n\nn\no\ni\nt\nc\na\nr\nf\n \ng\nn\nm\nm\na\nH\n\ni\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n \n\n0\n1\n\n k = 0.2n\n\n \n\nn = 100\nn = 200\nn = 300\nn = 400\nn = 500\n\n2\n\n3\n\n4\n\nSignal strength\n\n5\n\n6\n\n7\n\nFigure 1: Thresholding: Hamming fraction versus rescaled signal strength.\n\n1\n\nn\no\ni\nt\nc\na\nr\nf\n \ng\nn\nm\nm\na\nH\n\ni\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n \n\n0\n0\n\n1\n\n2\n\n k = n1/2 + \u03b1, \u03b1 = 0.1\n\n3\n\n4\n\nSignal strength\n\n \n\nn = 100\nn = 200\nn = 300\nn = 400\nn = 500\n\n1\n\nn\no\ni\nt\nc\na\nr\nf\n \ng\nn\nm\nm\na\nH\n\ni\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n5\n\n6\n\n7\n\n \n\n0\n0\n\n1\n\n2\n\n k = 0.2n\n\n3\n\n4\n\nSignal strength\n\n \n\nn = 100\nn = 200\nn = 300\nn = 400\nn = 500\n\n5\n\n6\n\n7\n\nFigure 2: Averaging: Hamming fraction versus rescaled signal strength.\n\n1\n\nn\no\n\ni\nt\nc\na\nr\nf\n \n\ni\n\ng\nn\nm\nm\na\nH\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n \n\n0\n1\n\n k = log(n)\n\n k = n1/3\n\n \n\nn = 100\nn = 200\nn = 300\nn = 400\nn = 500\n\n1.5\n\n2\n\nSignal strength\n\n2.5\n\n3\n\n1\n\nn\no\n\ni\nt\nc\na\nr\nf\n \n\ni\n\ng\nn\nm\nm\na\nH\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n \n\n0\n1\n\n \n\nn = 100\nn = 200\nn = 300\nn = 400\nn = 500\n\n1.5\n\n2\n\nSignal strength\n\n2.5\n\n3\n\nn\no\n\ni\nt\nc\na\nr\nf\n \n\ni\n\ng\nn\nm\nm\na\nH\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n \n\n0\n1\n\n k = 0.2n\n\n \n\nn = 100\nn = 200\nn = 300\nn = 400\nn = 500\n\n1.5\n\n2\n\nSignal strength\n\n2.5\n\n3\n\nFigure 3: Sparse SVD: Hamming fraction versus rescaled signal strength.\n\n6 Discussion\n\nIn this paper, we analyze biclustering using a simple statistical model (1), where a sparse rank one\nmatrix is perturbed with noise. Using this model, we have characterized the minimal signal strength\nbelow which no procedure can succeed in recovering the bicluster. This lower bound can be matched\nusing an exhaustive search technique. However, it is still an open problem to \ufb01nd a computationally\nef\ufb01cient procedure that is minimax optimal.\nAmini et. al. [2] analyze the convex relaxation procedure proposed in [11] for high-dimensional\nsparse PCA. Under the minimax scaling for this problem they show that provided a rank-1 solution\nexists it has the desired sparsity pattern (they were however not able to show that a rank-1 solution\nexists with high probability). Somewhat surprisingly, we show that in the SVD case a rank-1 solution\nwith the desired sparsity pattern does not exist with high probability. The two settings however are\nnot identical since the noise in the spiked covariance model is Wishart rather than Gaussian, and\nhas correlated entries. It would be interesting to analyze whether our negative result has similar\nimplications for the sparse PCA setting.\nThe focus of our paper has been on a model with one cluster, which although simple, provides\nseveral interesting theoretical insights. In practice, data often contains multiple clusters which need\nto be estimated. Many existing algorithms (see e.g. [17] and [18]) try to estimate multiple clusters\nand it would be useful to analyze these theoretically.\nFurthermore, the algorithms that we have analyzed assume knowledge of the size of the cluster,\nwhich is used to select the tuning parameters. It is a challenging problem of great practical relevance\nto \ufb01nd data driven methods to select these tuning parameters.\n7 Acknowledgments\nWe would like to thank Arash Amini and Martin Wainwright for fruitful discussions, and Larry\nWasserman for his ideas, indispensable advice and wise guidance. This research is supported in\npart by AFOSR under grant FA9550-10-1-0382 and NSF under grant IIS-1116458. SB would also\nlike to thank Jaime Carbonell and Srivatsan Narayanan for several valuable comments and thought-\nprovoking discussions.\n\n8\n\n\fReferences\n[1] Louigi Addario-Berry, Nicolas Broutin, Luc Devroye, and G\u00b4abor Lugosi. On combinatorial testing prob-\n\nlems. Ann. Statist., 38(5):3063\u20133092, 2010.\n\n[2] A.A. Amini and M.J. Wainwright. High-Dimensional Analysis Of Semide\ufb01nite Relaxations For Sparse\n\nPrincipal Components. The Annals of Statistics, 37(5B):2877\u20132921, 2009.\n\n[3] Ery Arias-Castro, Emmanuel J. Cand`es, and Arnaud Durand. Detection of an anomalous cluster in a\n\nnetwork. Ann. Stat., 39(1):278\u2013304, 2011.\n\n[4] Ery Arias-Castro, Emmanuel J. Cand`es, Hannes Helgason, and Ofer Zeitouni. Searching for a trail of\n\nevidence in a maze. Ann. Statist., 36(4):1726\u20131757, 2008.\n\n[5] Ery Arias-Castro, David L. Donoho, and Xiaoming Huo. Adaptive multiscale detection of \ufb01lamentary\n\nstructures in a background of uniform random points. Ann. Statist., 34(1):326\u2013349, 2006.\n\n[6] Jushan Bai. Inferential theory for factor models of large dimensions. Econometrica, 71(1):pp. 135\u2013171,\n\n2003.\n\n[7] Ulrich Bayer, Paolo Milani Comparetti, Clemens Hlauscheck, Christopher Kruegel, and Engin Kirda.\nScalable, Behavior-Based Malware Clustering. In 16th Symposium on Network and Distributed System\nSecurity (NDSS), 2009.\n\n[8] F. Benaych-Georges and R. Rao Nadakuditi. The singular values and vectors of low rank perturbations\n\nof large rectangular random matrices. ArXiv e-prints, March 2011.\n\n[9] S. Busygin, O. Prokopyev, and P.M. Pardalos. Biclustering in data mining. Computers & Operations\n\n[10] Emmanuel J. Cand`es, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis?\n\nResearch, 35(9):2964\u20132987, 2008.\n\nCoRR, abs/0912.3599, 2009.\n\n[11] Alexandre d\u2019Aspremont, Laurent El Ghaoui, Michael I. Jordan, and Gert R. G. Lanckriet. A direct\n\nformulation for sparse pca using semide\ufb01nite programming. SIAM Review, 49:434\u2013448, 2007.\n\n[12] K.R. Davidson and S.J. Szarek. Local operator theory, random matrices and Banach spaces. Handbook\n\nof the geometry of Banach spaces, 1:317\u2013366, 2001.\n\n[13] R. Fletcher. Semi-de\ufb01nite matrix constraints in optimization. SIAM Journal on Control and Optimization,\n\n[14] J. A. Hartigan. Direct clustering of a data matrix. Journal of the American Statistical Association,\n\n[15] I.M. Johnstone. On the distribution of the largest eigenvalue in principal components analysis. The Annals\n\n23:493, 1985.\n\n67(337):pp. 123\u2013129, 1972.\n\nof Statistics, 29(2):295\u2013327, 2001.\n\n[16] I.M. Johnstone and A.Y. Lu. On consistency and sparsity for principal components analysis in high\n\ndimensions. Journal of the American Statistical Association, 104(486):682\u2013693, 2009.\n\n[17] L. Lazzeroni and A. Owen. Plaid models for gene expression data. Statistica sinica, 12:61\u201386, 2002.\n[18] Mihee Lee, Haipeng Shen, Jianhua Z. Huang, and J. S. Marron. Biclustering via sparse singular value\n\ndecomposition. Biometrics, 66(4):1087\u20131095, 2010.\n\n[19] Jinze Liu and Wei Wang. Op-cluster: Clustering by tendency in high dimensional space. In Proceedings\nof the Third IEEE International Conference on Data Mining, ICDM \u201903, pages 187\u2013, Washington, DC,\nUSA, 2003. IEEE Computer Society.\n\n[20] S.C. Madeira and A.L. Oliveira. Biclustering algorithms for biological data analysis: a survey. IEEE\n\nTransactions on computational Biology and Bioinformatics, pages 24\u201345, 2004.\n\n[21] A. Onatski. Asymptotics of the principal components estimator of large factor models with weak factors.\n\nEconomics Department, Columbia University, 2009.\n\n[22] L. Parsons, E. Haque, and H. Liu. Subspace clustering for high dimensional data: a review. ACM SIGKDD\n\n[23] R.T. Rockafellar. The theory of subgradients and its applications to problems of optimization. Convex\n\nExplorations Newsletter, 6(1):90\u2013105, 2004.\n\nand nonconvex functions. Heldermann, 1981.\n\n[24] H. Shen and J.Z. Huang. Sparse principal component analysis via regularized low rank matrix approxi-\n\nmation. Journal of multivariate analysis, 99(6):1015\u20131034, 2008.\n\n[25] GW Stewart. Perturbation theory for the singular value decomposition. Computer Science Technical\n\nReport Series; Vol. CS-TR-2539, page 13, 1990.\n\n[26] X. Sun and A. B. Nobel. On the maximal size of Large-Average and ANOVA-\ufb01t Submatrices in a\n\nGaussian Random Matrix. ArXiv e-prints, September 2010.\n\n[27] A. Tanay, R. Sharan, and R. Shamir. Biclustering algorithms: A survey. Handbook of computational\n\nmolecular biology, 2004.\n\n[28] A.B. Tsybakov. Introduction to nonparametric estimation. Springer, 2009.\n[29] Lyle Ungar and Dean P. Foster. A formal statistical approach to collaborative \ufb01ltering. In CONALD, 98.\n[30] S. Wang, R. R. Gutell, and D. P. Miranker. Biclustering as a method for RNA local multiple sequence\n\nalignment. Bioinformatics, 23:3289\u20133296, Dec 2007.\n\n[31] D.M. Witten, R. Tibshirani, and T. Hastie. A penalized matrix decomposition, with applications to sparse\n\nprincipal components and canonical correlation analysis. Biostatistics, 10(3):515, 2009.\n\n[32] H. Zou, T. Hastie, and R. Tibshirani. Sparse principal component analysis. Journal of computational and\n\ngraphical statistics, 15(2):265\u2013286, 2006.\n\n9\n\n\f", "award": [], "sourceid": 581, "authors": [{"given_name": "Mladen", "family_name": "Kolar", "institution": null}, {"given_name": "Sivaraman", "family_name": "Balakrishnan", "institution": null}, {"given_name": "Alessandro", "family_name": "Rinaldo", "institution": null}, {"given_name": "Aarti", "family_name": "Singh", "institution": null}]}