Minimax Localization of Structural Information in Large Noisy Matrices

Part of Advances in Neural Information Processing Systems 24 (NIPS 2011)

Bibtex Metadata Paper SpotlightSlide Supplemental

Authors

Mladen Kolar, Sivaraman Balakrishnan, Alessandro Rinaldo, Aarti Singh

Abstract

We consider the problem of identifying a sparse set of relevant columns and rows in a large data matrix with highly corrupted entries. This problem of identifying groups from a collection of bipartite variables such as proteins and drugs, biological species and gene sequences, malware and signatures, etc is commonly referred to as biclustering or co-clustering. Despite its great practical relevance, and although several ad-hoc methods are available for biclustering, theoretical analysis of the problem is largely non-existent. The problem we consider is also closely related to structured multiple hypothesis testing, an area of statistics that has recently witnessed a flurry of activity. We make the following contributions: i) We prove lower bounds on the minimum signal strength needed for successful recovery of a bicluster as a function of the noise variance, size of the matrix and bicluster of interest. ii) We show that a combinatorial procedure based on the scan statistic achieves this optimal limit. iii) We characterize the SNR required by several computationally tractable procedures for biclustering including element-wise thresholding, column/row average thresholding and a convex relaxation approach to sparse singular vector decomposition.