{"title": "Approximation Algorithms for $\\ell_0$-Low Rank Approximation", "book": "Advances in Neural Information Processing Systems", "page_first": 6648, "page_last": 6659, "abstract": "We study the $\\ell_0$-Low Rank Approximation Problem, where the goal is,    given an $m \\times n$ matrix $A$, to output a rank-$k$ matrix $A'$ for which   $\\|A'-A\\|_0$ is minimized.    Here, for a matrix $B$, $\\|B\\|_0$ denotes the number of its non-zero entries.    This NP-hard variant of low rank approximation is natural for problems    with no underlying metric, and its goal is to minimize the number of disagreeing   data positions.      We provide approximation algorithms which significantly improve the running time    and approximation factor of previous work.    For $k > 1$, we show how to find, in poly$(mn)$ time for every $k$,    a rank $O(k \\log(n/k))$ matrix $A'$ for which $\\|A'-A\\|_0 \\leq O(k^2 \\log(n/k)) \\OPT$.    To the best of our knowledge, this is the first algorithm with provable guarantees    for the $\\ell_0$-Low Rank Approximation Problem for $k > 1$,    even for bicriteria algorithms.       For the well-studied case when $k = 1$, we give a $(2+\\epsilon)$-approximation    in {\\it sublinear time}, which is impossible for other variants of low rank    approximation such as for the  Frobenius norm.    We strengthen this for the well-studied case of binary matrices to obtain    a $(1+O(\\psi))$-approximation in sublinear time,    where $\\psi = \\OPT/\\nnz{A}$.   For small $\\psi$, our approximation factor is $1+o(1)$.", "full_text": "Approximation Algorithms for\n(cid:96)0-Low Rank Approximation\n\nKarl Bringmann1\n\nPavel Kolev1\u2217\n\nDavid P. Woodruff2\n\nkbringma@mpi-inf.mpg.de\n\npkolev@mpi-inf.mpg.de\n\ndwoodruf@cs.cmu.edu\n\n1 Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbr\u00fccken, Germany\n\n2 Department of Computer Science, Carnegie Mellon University\n\nAbstract\n\nWe study the (cid:96)0-Low Rank Approximation Problem, where the goal is, given an\nm \u00d7 n matrix A, to output a rank-k matrix A(cid:48) for which (cid:107)A(cid:48) \u2212 A(cid:107)0 is minimized.\nHere, for a matrix B, (cid:107)B(cid:107)0 denotes the number of its non-zero entries. This NP-\nhard variant of low rank approximation is natural for problems with no underlying\nmetric, and its goal is to minimize the number of disagreeing data positions.\nWe provide approximation algorithms which signi\ufb01cantly improve the running\ntime and approximation factor of previous work. For k > 1, we show how to\n\ufb01nd, in poly(mn) time for every k, a rank O(k log(n/k)) matrix A(cid:48) for which\n(cid:107)A(cid:48) \u2212 A(cid:107)0 \u2264 O(k2 log(n/k)) OPT. To the best of our knowledge, this is the \ufb01rst\nalgorithm with provable guarantees for the (cid:96)0-Low Rank Approximation Problem\nfor k > 1, even for bicriteria algorithms.\nFor the well-studied case when k = 1, we give a (2+\u0001)-approximation in sublinear\ntime, which is impossible for other variants of low rank approximation such as for\nthe Frobenius norm. We strengthen this for the well-studied case of binary matrices\nto obtain a (1 + O(\u03c8))-approximation in sublinear time, where \u03c8 = OPT /(cid:107)A(cid:107)0.\nFor small \u03c8, our approximation factor is 1 + o(1).\n\nIntroduction\n\n1\nLow rank approximation of an m \u00d7 n matrix A is an extremely well-studied problem, where the\ngoal is to replace the matrix A with a rank-k matrix A(cid:48) which well-approximates A, in the sense that\n(cid:107)A \u2212 A(cid:48)(cid:107) is small under some measure (cid:107) \u00b7 (cid:107). Since any rank-k matrix A(cid:48) can be written as U \u00b7 V ,\nwhere U is m \u00d7 k and V is k \u00d7 n, this allows for a signi\ufb01cant parameter reduction. Namely, instead\nof storing A, which has mn entries, one can store U and V , which have only (m + n)k entries in total.\nMoreover, when computing Ax, one can \ufb01rst compute V x and then U (V x), which takes (m + n)k\ninstead of mn time. We refer the reader to several surveys [19, 24, 40] for references to the many\nresults on low rank approximation.\nWe focus on approximation algorithms for the low-rank approximation problem, i.e. we seek to\noutput a rank-k matrix A(cid:48) for which (cid:107)A\u2212 A(cid:48)(cid:107) \u2264 \u03b1(cid:107)A\u2212 Ak(cid:107), where Ak = argminrank(B)=k(cid:107)A\u2212 B(cid:107)\nis the best rank-k approximation to A, and the approximation ratio \u03b1 is as small as possible. One of\ni,j)1/2, for\nwhich the optimal rank-k approximation can be obtained via the singular value decomposition (SVD).\nUsing randomization and approximation, one can compute an \u03b1 = 1 + \u0001-approximation, for any\n\u0001 > 0, in time much faster than the min(mn2, mn2) time required for computing the SVD, namely,\nin O((cid:107)A(cid:107)0 + n \u00b7 poly(k/\u0001)) time [9, 26, 29], where (cid:107)A(cid:107)0 denotes the number of non-zero entries\n\u2217This work has been funded by the Cluster of Excellence \u201cMultimodal Computing and Interaction\u201d within\n\nthe most widely studied error measures is the Frobenius norm (cid:107)A(cid:107)F = ((cid:80)m\n\nthe Excellence Initiative of the German Federal Government.\n\n(cid:80)n\n\ni=1\n\nj=1 A2\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fof A. For the Frobenius norm (cid:107)A(cid:107)0 time is also a lower bound, as any algorithm that does not read\nnearly all entries of A might not read a very large entry, and therefore cannot achieve a relative error\napproximation.\nThe rank-k matrix Ak obtained by computing the SVD is also optimal with respect to any rotationally\ninvariant norm, such as the operator and Schatten-p norms. Thus, such norms can also be solved\nexactly in polynomial time. Recently, however, there has been considerable interest [10, 3, 32]\nin obtaining low rank approximations for NP-hard error measures such as the entrywise (cid:96)p-norm\n\ni,j |Ai,j|p(cid:1)1/p, where p \u2265 1 is a real number. Note that for p < 1, this is not a norm,\n\nthough it is still a well-de\ufb01ned quantity. For p = \u221e, this corresponds to the max-norm or Chebyshev\nnorm. It is known that one can achieve a poly(k log(mn))-approximation in poly(mn) time for the\nlow-rank approximation problem with entrywise (cid:96)p-norm for every p \u2265 1 [36, 8].\n\n(cid:107)A(cid:107)p =(cid:0)(cid:80)\n\n1.1\n\n(cid:96)0-Low Rank Approximation\n\ni,j) = 1 if Ai,j (cid:54)= A(cid:48)\n\ni,j), where \u03b4(Ai,j (cid:54)= A(cid:48)\n\n(cid:80)\ni,j \u03b4(Ai,j (cid:54)= A(cid:48)\n\nA natural variant of low rank approximation which the results above do not cover is that of (cid:96)0-low\nrank approximation, where the measure (cid:107)A(cid:107)0 is the number of non-zero entries. In other words, we\ni,j (cid:54)= Ai,j is as small as possible.\nseek a rank-k matrix A(cid:48) for which the number of entries (i, j) with A(cid:48)\nLetting OPT = minrank(B)=k\ni,j and 0\notherwise, we would like to output a rank-k matrix A(cid:48) for which there are at most \u03b1 OPT entries\ni,j (cid:54)= Ai,j. Approximation algorithms for this problem are essential since solving the\n(i, j) with A(cid:48)\nproblem exactly is NP-hard [12, 14], even when k = 1 and A is a binary matrix.\nThe (cid:96)0-low rank approximation problem is quite natural for problems with no underlying metric, and\nits goal is to minimize the number of disagreeing data positions with a low rank matrix. Indeed, this\nerror measure directly answers the following question: if we are allowed to ignore some data - outliers\nor anomalies - what is the best low-rank model we can get? One well-studied case is when A is binary,\nbut A(cid:48) and its factors U and V need not necessarily be binary. This is called unconstrained Binary\nMatrix Factorization in [18], which has applications to association rule mining [20], biclustering\nstructure identi\ufb01cation [42, 43], pattern discovery for gene expression [34], digits reconstruction [25],\nmining high-dimensional discrete-attribute data [21, 22], market based clustering [23], and document\nclustering [43]. There is also a body of work on Boolean Matrix Factorization which restricts the\nfactors to also be binary, which is referred to as constrained Binary Matrix Factorization in [18]. This\nis motivated in applications such as classifying text documents and there is a large body of work on\nthis, see, e.g. [28, 31].\nThe (cid:96)0-low rank approximation problem coincides with a number of problems in different areas. It\nexactly coincides with the famous matrix rigidity problem over the reals, which asks for the minimal\nnumber OPT of entries of A that need to be changed in order to obtain a matrix of rank at most k.\nThe matrix rigidity problem is well-studied in complexity theory [15, 16, 39] and parameterized\ncomplexity [13]. These works are not directly relevant here as they do not provide approximation\nalgorithms. There are also other variants of (cid:96)0-low rank approximation, corresponding to cases such\nas when A is binary, A(cid:48) = U V is required to have binary factors U and V , and multiplication is\neither performed over a binary \ufb01eld [41, 17, 12, 30], or corresponds to an OR of ANDs. The latter is\nknown as the Boolean model [4, 12, 27, 33, 35, 38]. These different notions of inner products lead to\nvery different algorithms and results for the (cid:96)0-low rank approximation problem. However, all these\nmodels coincide in the special and important case in which A is binary and k = 1. This case was\nstudied in [20, 34, 18], as their algorithm for k = 1 forms the basis for their successful heuristic for\ngeneral k, e.g. the PROXIMUS technique [20].\nAnother related problem is robust PCA [6], in which there is an underlying matrix A that can\nbe written as a low rank matrix L plus a sparse matrix S [7]. Cand\u00e8s et al. [7] argue that both\ncomponents are of arbitrary magnitude, and we do not know the locations of the non-zeros in S nor\nhow many there are. Moreover, grossly corrupted observations are common in image processing,\nweb data analysis, and bioinformatics where some measurements are arbitrarily corrupted due to\nocclusions, malicious tampering, or sensor failures. Speci\ufb01c scenarios include video surveillance,\nface recognition, latent semantic indexing, and ranking of movies, books, etc. [7]. These problems\nhave the common theme of being an arbitrary magnitude sparse perturbation to a low rank matrix\nwith no natural underlying metric, and so the (cid:96)0-error measure (which is just the Hamming distance,\nor number of disagreements) is appropriate. In order to solve robust PCA in practice, Cand\u00e8s et al. [7]\n\n2\n\n\frelaxed the (cid:96)0-error measure to the (cid:96)1-norm. Understanding theoretical guarantees for solving the\noriginal (cid:96)0-problem is of fundamental importance, and we study this problem in this paper.\nFinally, interpreting 00 as 0, the (cid:96)0-low rank approximation problem coincides with the aforemen-\ntioned notion of entrywise (cid:96)p-approximation when p = 0. It is not hard to see that previous work [8]\nfor general p \u2265 1 fails to give any approximation factor for p = 0. Indeed, critical to their analysis is\nthe scale-invariance property of a norm, which does not hold for p = 0 since (cid:96)0 is not a norm.\n\n1.2 Our Results\n\ninto(cid:80)\n\nWe provide approximation algorithms for the (cid:96)0-low rank approximation problem which signi\ufb01cantly\nimprove the running time or approximation factor of previous work. In some cases our algorithms\neven run in sublinear time, i.e., faster than reading all non-zero entries of the matrix. This is provably\nimpossible for other measures such as the Frobenius norm and more generally, any (cid:96)p-norm for p > 0.\nFor k > 1, our approximation algorithms are, to the best of our knowledge, the \ufb01rst with provable\nguarantees for this problem.\nFirst, for k = 1, we signi\ufb01cantly improve the polynomial running time of previous (2 + \u0001)-\napproximations for this problem. The best previous algorithm due to Jiang et al. [18] was based on\nthe observation that there exists a column u of A spanning a 2-approximation. Therefore, solving the\nproblem minv (cid:107)A\u2212 uv(cid:107)0 for each column u of A yields a 2-approximation, where for a matrix B the\nmeasure (cid:107)B(cid:107)0 counts the number of non-zero entries. The problem minv (cid:107)A \u2212 uv(cid:107)0 decomposes\ni mini (cid:107)A:,i \u2212 viu(cid:107)0, where A:,i is the i-th column of A, and vi the i-th entry of vector v.\nThe optimal vi is the mode of the ratios Ai,j/uj, where j ranges over indices in {1, 2, . . . , m} with\nuj (cid:54)= 0. As a result, one can \ufb01nd a rank-1 matrix uvT providing a 2-approximation in O((cid:107)A(cid:107)0n)\ntime, which was the best known running time. Somewhat surprisingly, we show that one can\nachieve sublinear time for solving this problem. Namely, we obtain a (2 + \u0001)-approximation in\n(m + n) poly(\u0001\u22121\u03c8\u22121 log(mn)) time, for any \u0001 > 0, where \u03c8 = OPT /(cid:107)A(cid:107)0. This signi\ufb01cantly\nimproves upon the earlier O((cid:107)A(cid:107)0n) time for not too small \u0001 and \u03c8. Our result should be contrasted\nto Frobenius norm low rank approximation, for which \u2126((cid:107)A(cid:107)0) time is required even for k = 1, as\notherwise one might miss a very large entry in A. Since (cid:96)0-low rank approximation is insensitive to\nthe magnitude of entries of A, we bypass this general impossibility result.\nNext, still considering the case of k = 1, we show that if the matrix A is binary, a well-studied\ncase coinciding with the abovementioned GF (2) and Boolean models, we obtain an approximation\nalgorithm parameterized in terms of the ratio \u03c8 = OPT /(cid:107)A(cid:107)0, showing it is possible in time\n(m + n)\u03c8\u22121 poly(log(mn)) to obtain a (1 + O(\u03c8))-approximation. Note that our algorithm is again\nsublinear, unlike all algorithms in previous work. Moreover, when A is itself very well approximated\nby a low rank matrix, then \u03c8 may actually be sub-constant, and we obtain a signi\ufb01cantly better\n(1 + o(1))-approximation than the previous best known 2-approximations. Thus, we simultaneously\nimprove the running time and approximation factor. We also show that the running time of our\nalgorithm is optimal up to poly(log(mn)) factors by proving that any (1 + O(\u03c8))-approximation\nsucceeding with constant probability must read \u2126((m + n)\u03c8\u22121) entries of A in the worst case.\nFinally, for arbitrary k > 1, we \ufb01rst give an impractical algorithm that runs in time nO(k) and achieves\nan \u03b1 = poly(k)-approximation. To the best of our knowledge this is the \ufb01rst approximation algorithm\nfor the (cid:96)0-low rank approximation problem with any non-trivial approximation factor. To make our\nalgorithm practical, we reduce the running time to poly(mn), with an exponent independent of k, if\nwe allow for a bicriteria solution. In particular, we allow the algorithm to output a matrix A(cid:48) of some-\nwhat larger rank O(k log(n/k)), for which (cid:107)A \u2212 A(cid:48)(cid:107)0 \u2264 O(k2 log(n/k)) minrank(B)=k (cid:107)A \u2212 B(cid:107)0.\nAlthough we do not obtain rank exactly k, many of the motivations for \ufb01nding a low rank approxima-\ntion, such as reducing the number of parameters and fast matrix-vector product, still hold if the output\nrank is O(k log(n/k)). We are not aware of any alternative algorithms which achieve poly(mn) time\nand any provable approximation factor, even for bicriteria solutions.\n\n2 Preliminaries\n\nFor an matrix A \u2208 Am\u00d7n with entries Ai,j, we write Ai,: for its i-th row and A:,j for its j-th column.\n\n3\n\n\fInput Formats We always assume that we have random access to the entries of the given matrix A,\ni.e. we can read any entry Ai,j in constant time. For our sublinear time algorithms we need more\nef\ufb01cient access to the matrix, speci\ufb01cally the following two variants:\n(1) We say that we are given A with column adjacency arrays if we are given arrays B1, . . . , Bn and\nlengths (cid:96)1, . . . , (cid:96)n such that for any 1 \u2264 k \u2264 (cid:96)j the pair Bj[k] = (i, Ai,j) stores the row i containing\nthe k-th nonzero entry in column j as well as that entry Ai,j. This is a standard representation of\nmatrices used in many applications. Note that given only these adjacency arrays B1, . . . , Bn, in order\nto access any entry Ai,j we can perform a binary search over Bj, and hence random access to any\nmatrix entry is in time O(log n). Moreover, we assume to have random access to matrix entries in\nconstant time, and note that this is optimistic by at most a factor O(log n).\n(2) We say that we are given matrix A with row and column sums if we can access the numbers\ni Ai,j for j \u2208 [n] in constant time (and, as always, access any entry Ai,j\nin constant time). Notice that storing the row and column sums takes O(n + m) space, and thus\nwhile this might not be standard information it is very cheap to store.\nWe show that the \ufb01rst access type even allows to sample from the set of nonzero entries uniformly in\nconstant time.\nLemma 1. Given a matrix A \u2208 Rm\u00d7n with column adjacency arrays, after O(n) time preprocessing\nwe can sample a uniformly random nonzero entry (i, j) from A in time O(1).\n\n(cid:80)\nj Ai,j for i \u2208 [m] and(cid:80)\n\nThe proof of this lemma, as well as most other proofs in this extended abstract, can be found in the\nfull version of the paper.\n\n3 Algorithms for Real (cid:96)0-rank-k\nGiven a matrix A \u2208 Rm\u00d7n, the (cid:96)0-rank-k problem asks to \ufb01nd a matrix A(cid:48) with rank k such that the\ndifference between A and A(cid:48) measured in (cid:96)0 norm is minimized. We denote the optimum value by\n\nOPT(k) def\n\n= min\n\nrank(A(cid:48))=k\n\n(cid:107)A \u2212 A(cid:48)(cid:107)0 =\n\nmin\n\nU\u2208Rm\u00d7k, V \u2208Rk\u00d7n\n\n(cid:107)A \u2212 U V (cid:107)0 .\n\n(1)\n\nIn this section, we establish several new results on the (cid:96)0-rank-k problem. In Subsection 3.1, we prove\na structural lemma that shows the existence of k columns which provide a (k + 1)-approximation\nto OPT(k), and we also give an \u2126(k)-approximation lower bound for any algorithm that selects k\ncolumns from the input matrix A. In Subsection 3.2, we give an approximation algorithm that runs in\npoly(nk, m) time and achieves an O(k2)-approximation. To the best of our knowledge, this is the\n\ufb01rst algorithm with provable non-trivial approximation guarantees. In Subsection 3.3, we design a\npractical algorithm that runs in poly(n, m) time with an exponent independent of k, if we allow for a\nbicriteria solution.\n\n3.1 Structural Results\n\nWe give a new structural result for (cid:96)0 showing that any matrix A contains k columns which provide a\n(k + 1)-approximation for the (cid:96)0-rank-k problem (1).\nLemma 2. Let A \u2208 Rm\u00d7n be a matrix and k \u2208 [n]. There is a subset J (k) \u2282 [n] of size k and a\nmatrix Z \u2208 Rk\u00d7n such that (cid:107)A \u2212 A:,J (k) Z(cid:107)0 \u2264 (k + 1)OPT(k).\n\n= \u2205. We split the value OPT(k) into OPT(S(0), R(0))\n\nProof. Let Q(0) be the set of columns j with U V:,j = 0, and let R(0) def\nT (0) def\nOPT(S(0), Q(0))\nSuppose OPT(S(0), R(0)) \u2265 |S(0)||R(0)|/(k + 1). Then, for any subset J (k) it follows that\nminZ(cid:107)A \u2212 AS(0),J (k)Z(cid:107)0 \u2264 |S(0)||R(0)| + (cid:107)AS(0),Q(0)(cid:107)0 \u2264 (k + 1)OPT(k). Otherwise, there is a\n\n= [n],\n= (cid:107)AS(0),R(0) \u2212 U VS(0),R(0)(cid:107)0 and\n\ndef\n\n= (cid:107)AS(0),Q(0) \u2212 U VS(0),Q(0)(cid:107)0 = (cid:107)AS(0),Q(0)(cid:107)0.\n\n= [n] \\ Q(0). Let S(0) def\n\ncolumn i(1) such that(cid:13)(cid:13)AS(0),i(1) \u2212 (U V )S(0),i(1)\n\n(cid:13)(cid:13)0 \u2264 OPT(S(0), R(0))/|R(0)| \u2264 OPT(k)/|R(0)|.\n\ndef\n\n4\n\n\f|S((cid:96)+1)| \u2265 m \u00b7(cid:81)(cid:96)\n\nLet T (1) be the set of indices on which (U V )S(0),i(1) and AS(0),i(1) disagree, and similarly S(1) def\n=\nS(0)\\T (1) on which they agree. Then we have |T (1)| \u2264 OPT(k)/|R(0)|. Hence, in the submatrix\nT (1) \u00d7 R(0) the total error is at most |T (1)| \u00b7 |R(0)| \u2264 OPT(k). Let R(1), D(1) be a partitioning of\nR(0) such that AS(1),j is linearly dependent on AS(1),i(1) iff j \u2208 D(1). Then by selecting column\nA:,i(1) the incurred cost on matrix S(1) \u00d7 D(1) is zero. For the remaining submatrix S((cid:96)) \u00d7 R((cid:96)), we\nperform a recursive call of the algorithm.\nWe make at most k recursive calls, on instances S((cid:96)) \u00d7 R((cid:96)) for (cid:96) \u2208 {0, . . . , k \u2212 1}.\nIn the\n(cid:96)th iteration, either OPT(S((cid:96)), R((cid:96))) \u2265 |S((cid:96))||R((cid:96))|/(k + 1 \u2212 (cid:96)) and we are done, or there is a\ncolumn i((cid:96)+1) which partitions S((cid:96)) into S((cid:96)+1), T ((cid:96)+1) and R((cid:96)) into R((cid:96)+1), D((cid:96)+1) such that\nk+1 \u00b7 m and for every j \u2208 D((cid:96)) the column AS((cid:96)+1),j belongs\n\ni=0(1 \u2212 1\nk+1\u2212i ) = k\u2212(cid:96)\nto the span of {AS((cid:96)+1),i(t)}(cid:96)+1\nt=1.\nSuppose we performed k recursive calls. We show now that the incurred cost in submatrix S(k)\u00d7R(k)\nis at most OPT(S(k), R(k)) \u2264 OPT(k). By construction, the sub-columns {AS(k),i}i\u2208I (k) are lin-\nearly independent, where I (k) = {i(1), . . . , i(k)} is the set of the selected columns, and AS(k),I (k) =\n(U V )S(k),I (k). Since rank(AS(k),I (k)) = k, it follows that rank(US(k),:) = k, rank(V:,I (k) ) = k\nand the matrix V:,I (k) \u2208 Rk\u00d7k is invertible. Hence, for matrix Z = (V:,I (k) )\u22121V:,Rk we have\nOPT(S(k), R(k)) = (cid:107)ASk,Rk \u2212 ASk,I k Z(cid:107)0.\nThe statement follows by noting that the recursive calls accumulate a total cost of at most k \u00b7 OPT(k)\nin the submatrices T ((cid:96)+1) \u00d7 R((cid:96)) for (cid:96) \u2208 {0, 1, . . . , k \u2212 1}, as well as cost at most OPT(k) in\nsubmatrix S(k) \u00d7 R(k).\n\nWe also show that any algorithm that selects k columns of a matrix A incurs at least an \u2126(k)-\napproximation for the (cid:96)0-rank-k problem.\nLemma 3. Let k \u2264 n/2. Suppose A = (Gk\u00d7n; In\u00d7n) \u2208 R(n+k)\u00d7n is a matrix composed of a\nGaussian random matrix G \u2208 Rk\u00d7n with Gi,j \u223c N (0, 1) and identity matrix In\u00d7n. Then for any\nsubset J (k) \u2282 [n] of size k, we have minZ\u2208Rk\u00d7n(cid:107)A \u2212 A:,J (k) Z(cid:107)0 = \u2126(k) \u00b7 OPT(k).\n\n3.2 Basic Algorithm\n\nWe give an impractical algorithm that runs in poly(nk, m) time and achieves an O(k2)-approximation.\nTo the best of our knowledge this is the \ufb01rst approximation algorithm for the (cid:96)0-rank-k problem with\nnon-trivial approximation guarantees.\nTheorem 4. Given A \u2208 Rm\u00d7n and k \u2208 [n] we can compute in O(nk+1m2k\u03c9+1) time a set of k\nindices J (k) \u2282 [n] and a matrix Z \u2208 Rk\u00d7n such that (cid:107)A \u2212 A:,J (k) Z(cid:107)0 \u2264 O(k2) \u00b7 OPT(k).\nOur result relies on a subroutine by Berman and Karpinski [5] (attributed also to Kannan in that\npaper) which given a matrix U and a vector b approximates minx (cid:107)U x \u2212 b(cid:107)0 in polynomial time.\nSpeci\ufb01cally, we invoke in our algorithm the following variant of this result established by Alon,\nPanigrahy, and Yekhanin [2].\nTheorem 5. [2] There is an algorithm that given A \u2208 Rm\u00d7k and b \u2208 Rm outputs in O(m2k\u03c9+1)\ntime a vector z \u2208 Rk such that w.h.p. (cid:107)Az \u2212 b(cid:107)0 \u2264 k \u00b7 minx (cid:107)Ax \u2212 b(cid:107)0.\n3.3 Bicriteria Algorithm\n\nOur main contribution in this section is to design a practical algorithm that runs in poly(n, m) time\nwith an exponent independent of k, if we allow for a bicriteria solution.\nTheorem 6. Given A \u2208 Rm\u00d7n and k \u2208 [1, n], there is an algorithm that in expected poly(m, n)\ntime outputs a subset of indices J \u2282 [n] with |J| = O(k log(n/k)) and a matrix Z \u2208 R|J|\u00d7n such\nthat (cid:107)A \u2212 A:,J Z(cid:107)0 \u2264 O(k2 log(n/k)) \u00b7 OPT(k).\nThe structure of the proof follows a recent approximation algorithm [8, Algorithm 3] for the (cid:96)p-low\nrank approximation problem, for any p \u2265 1. We note that the analysis of [8, Theorem 7] is missing an\n\n5\n\n\fO(log1/p n) approximation factor, and na\u00efvely provides an O(k log1/p n)-approximation rather than\nthe stated O(k)-approximation. Further, it might be possible to obtain an ef\ufb01cient algorithm yielding\nan O(k2 log k)-approximation for Theorem 6 using unpublished techniques in [37]; we leave the\nstudy of obtaining the optimal approximation factor to future work.\nThere are two critical differences with the proof of [8, Theorem 7]. We cannot use the earlier [8,\nTheorem 3] which shows that any matrix A contains k columns which provide an O(k)-approximation\nfor the (cid:96)p-low rank approximation problem, since that proof requires p \u2265 1 and critically uses\nscale-invariance, which does not hold for p = 0. Our combinatorial argument in Lemma 2 seems\nfundamentally different than the maximum volume submatrix argument in [8] for p \u2265 1.\nSecond, unlike for (cid:96)p-regression for p \u2265 1, the (cid:96)0-regression problem minx (cid:107)U x\u2212 b(cid:107)0 given a matrix\nU and vector b is not ef\ufb01ciently solvable since it corresponds to a nearest codeword problem, which\nis NP-hard [1]. Thus, we resort to an approximation algorithm for (cid:96)0-regression, based on ideas for\nsolving the nearest codeword problem in [2, 5].\nNote that OPT(k) \u2264 (cid:107)A(cid:107)0. Since there are only mn + 1 possibilities of OPT(k), we can assume\nwe know OPT(k) and we can run the Algorithm 1 below for each such possibility, obtaining a\nrank-O(k log n) solution, and then outputting the solution found with the smallest cost. This can\nbe further optimized by forming instead O(log(mn)) guesses of OPT(k). One of these guesses is\nwithin a factor of 2 from the true value of OPT(k), and we note that the following argument only\nneeds to know OPT(k) up to a factor of 2.\nWe start by de\ufb01ning the notion of approximate coverage, which is different than the corresponding\nnotion in [8] for p \u2265 1, due to the fact that (cid:96)0-regression cannot be ef\ufb01ciently solved. Consequently,\napproximate coverage for p = 0 cannot be ef\ufb01ciently tested. Let Q \u2286 [n] and M = A:,Q be an\nm \u00d7 |Q| submatrix of A. We say that a column M:,i is (S, Q)-approximately covered by a submatrix\nM:,S of M, if |S| = 2k and minx (cid:107)M:,Sx \u2212 M:,i(cid:107)0 \u2264 100(k+1)OPT(k)\nLemma 7. (Similar to [8, Lemma 6], but using Lemma 2) Let Q \u2286 [n] and M = A:,Q be a submatrix\nof A. Suppose we select a subset R of 2k uniformly random columns of M. Then with probability at\nleast 1/3, at least a 1/10 fraction of the columns of M are (R, Q)-approximately covered.\n\n|Q|\n\n.\n\nM\n\n|Q|\n\ndef\n\ndef\n\n|Q|\n\n\u2264 (2k+1)OPT(k)\n\n= R \u222a {i} and \u03b7\n\nProof. To show this, as in [8], consider a uniformly random column index i not in the set R. Let\n= minrank(B)=k(cid:107)M:,T \u2212 B(cid:107)0. Since T is a uniformly random subset of 2k + 1\nT\ncolumns of M, ET \u03b7 \u2264 (2k+1)OPT(k)\n. Let E1 be the event \u03b7 \u2264 10(2k+1)OPT(k)\n.\nThen, by a Markov bound, Pr[E1] \u2265 9/10.\nFix a con\ufb01guration T = R \u222a {i} and let L(T ) \u2282 T be the subset guaranteed by Lemma 2 such\nthat |L(T )| = k and minX(cid:107)M:,L(T )X \u2212 M:,T(cid:107)0 \u2264 (k + 1) minrank(B)=k(cid:107)M:,T \u2212 B(cid:107)0. Notice that\n2k+1 minX(cid:107)M:,L(T )X \u2212 M:,T(cid:107)0, and thus by the law of total\nEi\nprobability we have ET\n2k+1 .\nLet E2 denote the event that minx (cid:107)M:,Lx\u2212 M:,i(cid:107)0 \u2264 10(k+1)\u03b7\n\n(cid:2)minx(cid:107)M:,L(T )x \u2212 M:,i(cid:107)0 | T(cid:3) = 1\n(cid:3) \u2264 (k+1)\u03b7\nFurther, as in [8], let E3 be the event that i /\u2208 L. Observe that there are(cid:0)k+1\nR(cid:48) \u2282 T such that |R(cid:48)| = 2k and L \u2282 R(cid:48). Since there are(cid:0)2k+1\n\n(cid:1) ways to choose a subset\n(cid:1) ways to choose R(cid:48), it follows that\n\n(cid:2)minx (cid:107)M:,L(T )x \u2212 M:,i(cid:107)0\n\n. By a Markov bound, Pr[E2] \u2265 9/10.\n\n2k+1 > 1/2. Hence, by the law of total probability, we have Pr[E3] > 1/2.\n\nPr[L \u2282 R | T ] = k+1\nAs in [8], Pr[E1 \u2227 E2 \u2227 E3] > 2/5, and conditioned on E1 \u2227 E2 \u2227 E3, minx (cid:107)M:,Rx \u2212 M:,i(cid:107)0 \u2264\nminx (cid:107)M:,Lx \u2212 M:,i(cid:107)0 \u2264 10(k+1)\u03b7\n, where the \ufb01rst inequality uses that L is a\nsubset of R given E3, and so the regression cost cannot decrease, while the second inequality uses the\noccurrence of E2 and the \ufb01nal inequality uses the occurrence of E1.\nAs in [8], if Zi is an indicator random variable indicating whether i is approximately covered\n5 . By a Markov bound,\n3. Thus, probability at least 1/3, at least a 1/10 fraction of the columns of\n\nby R, and Z = (cid:80)\n\ni\u2208Q Zi, then ER[Z] \u2265 2|Q|\n\n2k+1 \u2264 100(k+1)OPT(k)\n\nand ER[|Q| \u2212 Z] \u2264 3|Q|\n\nPr[|Q| \u2212 Z \u2265 9|Q|\nM are (R, Q)-approximately covered.\n\n10 ] \u2264 2\n\n|Q|\n\n2k+1\n\n2k\n\n|Q|\n\nk\n\n5\n\n6\n\n\fAlgorithm 1 Selecting O(k log(n/k)) columns of A.\nRequire: An integer k, and a matrix A.\nEnsure: O(k log(n/k)) columns of A\nAPPROXIMATELYSELECTCOLUMNS (k, A):\nif number of columns of A \u2264 2k then\n\nreturn all the columns of A\n\nLet R be a set of 2k uniformly random columns of A\n\nuntil at least (1/10)-fraction columns of A are nearly approximately covered\nLet AR be the columns of A not nearly approximately covered by R\nreturn R \u222a APPROXIMATELYSELECTCOLUMNS(k, AR)\n\nelse\n\nrepeat\n\nend if\n\nGiven Lemma 7, we are ready to prove Theorem 6. As noted above, a key difference with the\ncorresponding [8, Algorithm 3] for (cid:96)p and p \u2265 1, is that we cannot ef\ufb01ciently test if a column i is\napproximately covered by a set R. We will instead again make use of Theorem 5.\n\n|Q|\n\nProof of Theorem 6. The computation of matrix Z force us to relax the notion of (R, Q)-\napproximately covered to the notion of (R, Q)-nearly-approximately covered as follows: we say\nthat a column M:,i is (R, Q)-nearly-approximately covered if, the algorithm in Theorem 5 returns a\nvector z such that (cid:107)M:,Rz \u2212 M:,i(cid:107)0 \u2264 100(k+1)2OPT(k)\n. By the guarantee of Theorem 5, if M:,i is\n(R, Q)-approximately covered then it is also w.h.p. (R, Q)-nearly-approximately covered.\nSuppose Algorithm 1 makes t iterations and let A:,\u222at\ni=1Ri and Z be the resulting solution. We bound\nnow its cost. Let B0 = [n], and consider the i-th iteration of Algorithm 1. We denote by Ri a set of 2k\nuniformly random columns of Bi\u22121, by Gi a set of columns that is (Ri, Bi\u22121)-nearly-approximately\ncovered, and by Bi = Bi\u22121\\{Gi \u222a Ri} a set of the remaining columns. By construction, |Gi| \u2265\n10|Bi\u22121|. Since Algorithm 1 terminates when Bt+1 \u2264 2k,\n|Bi\u22121|/10 and |Bi| \u2264 9\nwe have 2k < |Bt| < (1 \u2212 1\n10 )tn, and thus the number of iterations t \u2264 10 log(n/2k). By\n|Gi|\n|Bi\u22121| \u2264 t \u2264 10 log n\n2k .\n(cid:107)A:,Riz(j) \u2212 A:,j(cid:107)0 \u2264\n\nconstruction, |Gi| = (1 \u2212 \u03b1i)|Bi\u22121| for some \u03b1i \u2264 9/10, and so(cid:80)t\n(cid:80)\n(cid:1) \u00b7 OPT(k).\n(cid:80)t\nSince minx(j)(cid:107)A:,Rix(j) \u2212 A:,j(cid:107)0 \u2264 100(k+1)2OPT(k)\n\n10|Bi\u22121| \u2212 2k < 9\n\n, we have(cid:80)t\nk \u00b7 minx(j)(cid:107)A:,Rix(j) \u2212 A:,j(cid:107)0 \u2264 O(cid:0)k2 \u00b7 log n\n\n|Bi\u22121|\n\ni=1\nj\u2208Gi\n\n(cid:80)\n\nBy Lemma 7, the expected number of iterations of selecting a set Ri such that |Gi| \u2265 1/10|Bi\u22121|\nis O(1). Since the number of recursive calls t is bounded by O(log(n/k)), it follows by a Markov\nbound that Algorithm 1 chooses O(k log(n/k)) columns in total. Since the approximation algorithm\nof Theorem 5 runs in polynomial time, our entire algorithm has expected polynomial time.\n\ni=1\n\nj\u2208Gi\n\ni=1\n\n2k\n\n4 Algorithm for Real (cid:96)0-rank-1\nGiven a matrix A \u2208 Rm\u00d7n, the (cid:96)0-rank-1 problem asks to \ufb01nd a matrix A(cid:48) with rank 1 such that the\ndifference between A and A(cid:48) measured in (cid:96)0 norm is minimized. We denote the optimum value by\n\nOPT(1) def\n\n= min\n\nrank(A(cid:48))=1\n\n(cid:107)A \u2212 A(cid:48)(cid:107)0 =\n\nmin\n\nu\u2208Rm, v\u2208Rn\n\n(cid:107)A \u2212 uvT(cid:107)0.\n\n(2)\n\nIn the trivial case when OPT(1) = 0, there is an optimal algorithm that runs in time O((cid:107)A(cid:107)0) and\n\ufb01nds the exact rank-1 decomposition uvT of a matrix A. In this work, we focus on the case when\nOPT(1) \u2265 1. We show that Algorithm 2 yields a (2 + \u0001)-approximation factor and runs in nearly\nlinear time in (cid:107)A(cid:107)0, for any constant \u0001 > 0. Furthermore, a variant of our algorithm even runs in\nsublinear time, if (cid:107)A(cid:107)0 is large and \u03c8\n= OPT(1)/(cid:107)A(cid:107)0 is not too small. In particular, we obtain\ntime o((cid:107)A(cid:107)0) when OPT(1) \u2265 (\u0001\u22121 log(mn))4 and (cid:107)A(cid:107)0 \u2265 n(\u0001\u22121 log(mn))4.\n\ndef\n\n7\n\n\fAlgorithm 2\nInput: A \u2208 Rm\u00d7n and \u0001 \u2208 (0, 0.1).\n1. Partition the columns of A into weight-classes S = {S(0), . . . , S(log n+1)} such that S(0) contains\nall columns j with (cid:107)A:,j(cid:107)0 = 0 and S(i) contains all columns j with 2i\u22121 \u2264 (cid:107)A:,j(cid:107)0 < 2i.\n2. For each weight-class S(i) do:\n\n2.1 Sample a set C (i) of \u0398(\u0001\u22122 log n) elements uniformly at random from S(i).\n\n2.2 Find a vector z(j) \u2208 Rn such that (cid:107)A\u2212 A:,j[z(j)]T(cid:107)0 \u2264(cid:0)1 + \u0001\n\neach column A:,j \u2208 C (i).\n3. Compute a (1 + \u0001\nReturn: the pair (A:,j, z(j)) corresponding to the minimal value Yj.\n\n15 )-approximation Yj of (cid:107)A \u2212 A:,j[z(j)]T(cid:107)0 for every j \u2208(cid:83)\n\n(cid:1) minv (cid:107)A\u2212 A:,jvT(cid:107)0, for\n\ni\u2208[|S|] C (i).\n\n15\n\n\u00012\n\n\u00012\n\n(cid:17)\n\nruns w.h.p.\n\n(cid:9)(cid:1) log2 n\n\n(cid:16)(cid:0) n log m\n\u00012 + min(cid:8)(cid:107)A(cid:107)0, n + \u03c8\u22121 log n\n\nTheorem 8. There is an algorithm that, given A \u2208 Rm\u00d7n with column adja-\ncency arrays and OPT(1) \u2265 1, and given \u0001 \u2208 (0, 0.1],\nin time\nand outputs a column A:,j and a vector z that\nO\nsatisfy w.h.p. (cid:107)A \u2212 A:,jzT(cid:107)0 \u2264 (2 + \u0001)OPT(1). The algorithm also computes a value Y satisfying\nw.h.p. (1 \u2212 \u0001)OPT(1) \u2264 Y \u2264 (2 + 2\u0001)OPT(1).\nThe only steps for which the implementation details are not immediate are Steps 2.2 and 3. We will\ndiscuss them in Sections 4.1 and 4.2, respectively. Note that the algorithm from Theorem 8 selects a\ncolumn A:,j and then \ufb01nds a good vector z such that the product A:,jzT approximates A. We show\nthat the approximation guarantee 2 + \u0001 is essentially tight for algorithms following this pattern.\nLemma 9. There exist a matrix A \u2208 Rn\u00d7n such that minz(cid:107)A \u2212 A:,jzT(cid:107)0 \u2265 2(1 \u2212 1/n)OPT(1),\nfor every column A:,j.\n\n4.1 Implementing Step 2.2\n\nThe Step 2.2 of Algorithm 2 uses the following sublinear procedure, given in Algorithm 3.\nLemma 10. Given A \u2208 Rm\u00d7n, u \u2208 Rm and \u0001 \u2208 (0, 1) we can compute in O(\u0001\u22122n log m) time a\nvector z \u2208 Rn such that w.h.p. (cid:107)A:,i \u2212 ziu(cid:107)0 \u2264 (1 + \u0001) minvi(cid:107)A:,i \u2212 viu(cid:107)0 for every i \u2208 [n].\n\nAlgorithm 3\nInput: A \u2208 Rm\u00d7n, u \u2208 Rm and \u0001 \u2208 (0, 1).\ndef\nLet Z\n= supp(u), and p\n1. Select each index i \u2208 N with probability p and let S be the resulting set.\n2. Compute a vector z \u2208 Rn such that zj = arg minr\u2208R(cid:107)AS,j \u2212 r \u00b7 uS(cid:107)0 for all j \u2208 [n].\nReturn: vector z.\n\n= \u0398(\u0001\u22122 log m), N\ndef\n\n= Z/|N|.\n\ndef\n\n4.2\n\nImplementing Step 3\n\nfor every j \u2208(cid:83)\n\nIn Step 3 of Algorithm 2 we want to compute a (1 + \u0001\n\n15 )-approximation Yj of (cid:107)A \u2212 A:,j[z(j)]T(cid:107)0\ni\u2208[|S|] C (i). We present two solutions, an exact algorithm (see Lemma 11) and a\nsublinear time sampling-based algorithm (see Lemma 13).\nLemma 11. Suppose A, B \u2208 Rm\u00d7n are represented by column adjacency arrays. Then, we can\ncompute in O((cid:107)A(cid:107)0 + n) time the measure (cid:107)A \u2212 B(cid:107)0.\nFor our second, sampling-based implementation of Step 3, we make use of an algorithm by Dagum\net al. [11] for estimating the expected value of a random variable. We note that the runtime of their\nalgorithm is a random variable, the magnitude of which is bounded w.h.p. within a certain range.\n= E[X] > 0. Let\ndef\nTheorem 12. [11] Let X be a random variable taking values in [0, 1] with \u00b5\n0 < \u0001, \u03b4 < 1 and \u03c1X = max{Var[X], \u0001\u00b5}. There is an algorithm with sample access to X that\ncomputes an estimator \u02dc\u00b5 in time t such that for a universal constant c we have:\ni) Pr[(1 \u2212 \u0001)\u00b5 \u2264 \u02dc\u00b5 \u2264 (1 + \u0001)\u00b5] \u2265 1 \u2212 \u03b4,\n\nii) Pr[t \u2265 c \u0001\u22122 log(1/\u03b4)\u03c1X /\u00b52] \u2264 \u03b4.\n\nand\n\n8\n\n\fWe state now our key technical insight, on which we build upon our sublinear algorithm.\nLemma 13. There is an algorithm that, given A, B \u2208 Rm\u00d7n with column adjacency arrays and\n(cid:107)A \u2212 B(cid:107)0 \u2265 1, and given \u0001 > 0, computes an estimator Z that satis\ufb01es w.h.p. (1 \u2212 \u0001)(cid:107)A \u2212 B(cid:107)0 \u2264\nZ \u2264 (1 + \u0001)(cid:107)A \u2212 B(cid:107)0. The algorithm runs w.h.p. in time O(n + \u0001\u22122 (cid:107)A(cid:107)0+(cid:107)B(cid:107)0\n(cid:107)A\u2212B(cid:107)0\n\nlog n}).\n\nWe present now our main result in this section.\nTheorem 14. There is an algorithm that, given A \u2208 Rm\u00d7n with column adjacency arrays and\nOPT(1) \u2265 1, and given j \u2208 [n], v \u2208 Rm and \u0001 \u2208 (0, 1), outputs an estimator Y that satis\ufb01es\nw.h.p. (1 \u2212 \u0001)(cid:107)A \u2212 A:,jvT(cid:107)0 \u2264 Y \u2264 (1 + \u0001)(cid:107)A \u2212 A:,jvT(cid:107)0. The algorithm runs w.h.p. in time\nO(min{(cid:107)A(cid:107)0, n + \u0001\u22122\u03c8\u22121 log n}), where \u03c8 = OPT(1)/(cid:107)A(cid:107)0.\n\nsampled column j \u2208(cid:83)\n\n0\u2264i\u2264log n+1 C (i).\n\nTo implement Step 3 of Algorithm 2, we simply apply Theorem 14 with A, \u0001 and v = z(j) to each\n\n5 Algorithms for Boolean (cid:96)0-rank-1\n\nOur goal is to compute an approximate solution of the Boolean (cid:96)0-rank-1 problem, de\ufb01ned by:\n\nOPT = OPTA\n\ndef\n=\n\nmin\n\nu\u2208{0,1}m, v\u2208{0,1}n\n\n(cid:107)A \u2212 uvT(cid:107)0, where A \u2208 {0, 1}m\u00d7n.\n\n(3)\n\nIn practice, approximating a matrix A by a rank-1 matrix uvT makes most sense if A is close to being\nrank-1. Hence, the above optimization problem is most relevant when OPT (cid:28) (cid:107)A(cid:107)0. In this section,\nwe focus on the case OPT/(cid:107)A(cid:107)0 \u2264 \u03c6 for suf\ufb01ciently small \u03c6 > 0. We prove the following.\nTheorem 15. Given A \u2208 {0, 1}m\u00d7n with row and column sums, and given \u03c6 \u2208 (0, 1/80] with\nOPT/(cid:107)A(cid:107)0 \u2264 \u03c6, we can compute vectors \u02dcu, \u02dcv with (cid:107)A \u2212 \u02dcu\u02dcvT(cid:107)0 \u2264 (1 + 5\u03c6)OPT + 37\u03c62(cid:107)A(cid:107)0 in\ntime O(min{(cid:107)A(cid:107)0 + m + n, \u03c6\u22121(m + n) log(mn)}).\nIn combination with Theorem 8 we obtain the following.\nTheorem 16. Given A \u2208 {0, 1}m\u00d7n with column adjacency arrays and with row and column sums,\nfor \u03c8 = OPT/(cid:107)A(cid:107)0 we can compute vectors \u02dcu, \u02dcv with (cid:107)A \u2212 \u02dcu\u02dcvT(cid:107)0 \u2264 (1 + 500\u03c8)OPT in time\nw.h.p. O(min{(cid:107)A(cid:107)0 + m + n, \u03c8\u22121(m + n)} \u00b7 log3(mn)).\nA variant of the algorithm from Theorem 15 can also be used to solve the Boolean (cid:96)0-rank-1 problem\nexactly. This yields the following theorem, which in particular shows that the problem is in polynomial\n\ntime when OPT \u2264 O(cid:0)(cid:112)(cid:107)A(cid:107)0 log(mn)(cid:1).\n\nTheorem 17. Given a matrix A \u2208 {0, 1}m\u00d7n, if OPTA/(cid:107)A(cid:107)0 \u2264 1/240 then we can exactly solve\nthe Boolean (cid:96)0-rank-1 problem in time 2O(OPT/\n\n(cid:107)A(cid:107)0) \u00b7 poly(mn).\n\n\u221a\n\n6 Lower Bounds for Boolean (cid:96)0-rank-1\n\nWe give now a lower bound of \u2126(n/\u03c6) on the number of samples of any 1 + O(\u03c6)-approximation\nalgorithm for the Boolean (cid:96)0-rank-1 problem, where OPT/(cid:107)A(cid:107)0 \u2264 \u03c6 as before.\nTheorem 18. Let C \u2265 1. Given an n \u00d7 n Boolean matrix A with column adjacency arrays and\n\nwith row and column sums, and given(cid:112)log(n)/n (cid:28) \u03c6 \u2264 1/100C such that OPTA/(cid:107)A(cid:107)0 \u2264 \u03c6,\n\ncomputing a (1 + C\u03c6)-approximation of OPTA requires to read \u2126(n/\u03c6) entries of A.\n\nThe technical core of our argument is the following lemma.\nLemma 19. Let \u03c6 \u2208 (0, 1/2). Let X1, . . . , Xk be Boolean random variables with expectations\np1, . . . , pk, where pi \u2208 {1/2 \u2212 \u03c6, 1/2 + \u03c6} for each i. Let A be an algorithm which can adaptively\nobtain any number of samples of each random variable, and which outputs bits bi for every i \u2208 [1 : k].\nSuppose that with probability at least 0.95 over the joint probability space of A and the random\nsamples, A outputs for at least a 0.95 fraction of all i that bi = 1 if pi = 1/2 + \u03c6 and bi = 0\notherwise. Then, with probability at least 0.05, A makes \u2126(k/\u03c62) samples in total.\n\n9\n\n\fReferences\n[1] Michael Alekhnovich. More on average case vs approximation complexity. Computational\n\nComplexity, 20(4):755\u2013786, 2011.\n\n[2] Noga Alon, Rina Panigrahy, and Sergey Yekhanin. Deterministic approximation algorithms\nfor the nearest codeword problem. In Approximation, Randomization, and Combinatorial Opti-\nmization. Algorithms and Techniques, 12th International Workshop, APPROX 2009, and 13th\nInternational Workshop, RANDOM 2009, Berkeley, CA, USA, August 21-23, 2009. Proceedings,\npages 339\u2013351, 2009.\n\n[3] Sanjeev Arora, Rong Ge, Ravi Kannan, and Ankur Moitra. Computing a nonnegative matrix\n\nfactorization - provably. SIAM J. Comput., 45(4):1582\u20131611, 2016.\n\n[4] Radim Belohl\u00e1vek and Vil\u00e9m Vychodil. Discovery of optimal factors in binary data via a novel\n\nmethod of matrix decomposition. J. Comput. Syst. Sci., 76(1):3\u201320, 2010.\n\n[5] Piotr Berman and Marek Karpinski. Approximating minimum unsatis\ufb01ability of linear equations.\nIn Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms,\nJanuary 6-8, 2002, San Francisco, CA, USA., pages 514\u2013516, 2002.\n\n[6] Emmanuel J Cand\u00e8s, Xiaodong Li, Yi Ma, and John Wright. Robust principal component\n\nanalysis? Journal of the ACM (JACM), 58(3):11, 2011.\n\n[7] Emmanuel J. Cand\u00e8s, Xiaodong Li, Yi Ma, and John Wright. Robust principal component\n\nanalysis? J. ACM, 58(3):11:1\u201311:37, June 2011.\n\n[8] Flavio Chierichetti, Sreenivas Gollapudi, Ravi Kumar, Silvio Lattanzi, Rina Panigrahy, and\nDavid P. Woodruff. Algorithms for $\\ell_p$ low-rank approximation. In Proceedings of the\n34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11\nAugust 2017, pages 806\u2013814, 2017.\n\n[9] Kenneth L. Clarkson and David P. Woodruff. Low rank approximation and regression in input\nsparsity time. In Symposium on Theory of Computing Conference, STOC\u201913, Palo Alto, CA,\nUSA, June 1-4, 2013, pages 81\u201390, 2013.\n\n[10] Kenneth L. Clarkson and David P. Woodruff. Input sparsity and hardness for robust subspace\napproximation. In IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS\n2015, Berkeley, CA, USA, 17-20 October, 2015, pages 310\u2013329, 2015.\n\n[11] Paul Dagum, Richard M. Karp, Michael Luby, and Sheldon M. Ross. An optimal algorithm for\n\nmonte carlo estimation. SIAM J. Comput., 29(5):1484\u20131496, 2000.\n\n[12] C. Dan, K. Arnsfelt Hansen, H. Jiang, L. Wang, and Y. Zhou. Low Rank Approximation of\n\nBinary Matrices: Column Subset Selection and Generalizations. ArXiv e-prints, 2015.\n\n[13] Fedor V. Fomin, Daniel Lokshtanov, S. M. Meesum, Saket Saurabh, and Meirav Zehavi. Matrix\nrigidity: Matrix theory from the viewpoint of parameterized complexity. In STACS. Springer,\nMarch 2017.\n\n[14] Nicolas Gillis and Stephen A. Vavasis. On the complexity of robust PCA and (cid:96)1-norm low-rank\n\nmatrix approximation. CoRR, abs/1509.09236, 2015.\n\n[15] D. Grigoriev. Using the notions of separability and independence for proving the lower bounds\non the circuit complexity (in russian). Notes of the Leningrad branch of the Steklov Mathematical\nInstitute, Nauka, 1976.\n\n[16] D. Grigoriev. Using the notions of separability and independence for proving the lower bounds\n\non the circuit complexity. Journal of Soviet Math., 14(5):1450\u20131456, 1980.\n\n[17] Harold W. Gutch, Peter Gruber, Arie Yeredor, and Fabian J. Theis. ICA over \ufb01nite \ufb01elds -\n\nseparability and algorithms. Signal Processing, 92(8):1796\u20131808, 2012.\n\n10\n\n\f[18] Peng Jiang, Jiming Peng, Michael Heath, and Rui Yang. A clustering approach to constrained\nbinary matrix factorization. In Data Mining and Knowledge Discovery for Big Data, pages\n281\u2013303. Springer, 2014.\n\n[19] Ravi Kannan and Santosh Vempala. Spectral algorithms. Foundations and Trends in Theoretical\n\nComputer Science, 4(3-4):157\u2013288, 2009.\n\n[20] Mehmet Koyut\u00fcrk and Ananth Grama. Proximus: a framework for analyzing very high dimen-\nsional discrete-attributed datasets. In Proceedings of the ninth ACM SIGKDD international\nconference on Knowledge discovery and data mining, pages 147\u2013156. ACM, 2003.\n\n[21] Mehmet Koyut\u00fcrk, Ananth Grama, and Naren Ramakrishnan. Compression, clustering, and\npattern discovery in very high-dimensional discrete-attribute data sets. IEEE Trans. Knowl.\nData Eng., 17(4):447\u2013461, 2005.\n\n[22] Mehmet Koyut\u00fcrk, Ananth Grama, and Naren Ramakrishnan. Nonorthogonal decomposition\nof binary matrices for bounded-error data compression and analysis. ACM Transactions on\nMathematical Software (TOMS), 32(1):33\u201369, 2006.\n\n[23] Tao Li. A general model for clustering binary data. In Proceedings of the eleventh ACM\nSIGKDD international conference on Knowledge discovery in data mining, pages 188\u2013197.\nACM, 2005.\n\n[24] Michael W. Mahoney. Randomized algorithms for matrices and data. Foundations and Trends\n\nin Machine Learning, 3(2):123\u2013224, 2011.\n\n[25] Edward Meeds, Zoubin Ghahramani, Radford M. Neal, and Sam T. Roweis. Modeling dyadic\ndata with binary latent factors. In Advances in Neural Information Processing Systems 19,\nProceedings of the Twentieth Annual Conference on Neural Information Processing Systems,\nVancouver, British Columbia, Canada, December 4-7, 2006, pages 977\u2013984, 2006.\n\n[26] Xiangrui Meng and Michael W. Mahoney. Low-distortion subspace embeddings in input-\nsparsity time and applications to robust linear regression. In Symposium on Theory of Computing\nConference, STOC\u201913, Palo Alto, CA, USA, June 1-4, 2013, pages 91\u2013100, 2013.\n\n[27] Pauli Miettinen, Taneli Mielik\u00e4inen, Aristides Gionis, Gautam Das, and Heikki Mannila. The\n\ndiscrete basis problem. IEEE Trans. Knowl. Data Eng., 20(10):1348\u20131362, 2008.\n\n[28] Pauli Miettinen and Jilles Vreeken. MDL4BMF: minimum description length for boolean\n\nmatrix factorization. TKDD, 8(4):18:1\u201318:31, 2014.\n\n[29] Jelani Nelson and Huy L. Nguyen. OSNAP: faster numerical linear algebra algorithms via\nsparser subspace embeddings. In 54th Annual IEEE Symposium on Foundations of Computer\nScience, FOCS 2013, 26-29 October, 2013, Berkeley, CA, USA, pages 117\u2013126, 2013.\n\n[30] A. Painsky, S. Rosset, and M. Feder. Generalized Independent Component Analysis Over Finite\n\nAlphabets. ArXiv e-prints, 2015.\n\n[31] S. Ravanbakhsh, B. Poczos, and R. Greiner. Boolean Matrix Factorization and Noisy Completion\n\nvia Message Passing. ArXiv e-prints, 2015.\n\n[32] Ilya P. Razenshteyn, Zhao Song, and David P. Woodruff. Weighted low rank approximations\nwith provable guarantees. In Proceedings of the 48th Annual ACM SIGACT Symposium on\nTheory of Computing, STOC 2016, Cambridge, MA, USA, June 18-21, 2016, pages 250\u2013263,\n2016.\n\n[33] Jouni K. Sepp\u00e4nen, Ella Bingham, and Heikki Mannila. A simple algorithm for topic identi\ufb01ca-\ntion in 0-1 data. In Knowledge Discovery in Databases: PKDD 2003, 7th European Conference\non Principles and Practice of Knowledge Discovery in Databases, Cavtat-Dubrovnik, Croatia,\nSeptember 22-26, 2003, Proceedings, pages 423\u2013434, 2003.\n\n[34] Bao-Hong Shen, Shuiwang Ji, and Jieping Ye. Mining discrete patterns via binary matrix\nfactorization. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge\nDiscovery and Data Mining, Paris, France, June 28 - July 1, 2009, pages 757\u2013766, 2009.\n\n11\n\n\f[35] Tom\u00e1s Singliar and Milos Hauskrecht. Noisy-or component analysis and its application to link\n\nanalysis. Journal of Machine Learning Research, 7:2189\u20132213, 2006.\n\n[36] Zhao Song, David P. Woodruff, and Peilin Zhong. Low rank approximation with entrywise\n\n(cid:96)1-norm error. CoRR, abs/1611.00898, 2016.\n\n[37] Zhao Song, David P. Woodruff, and Peilin Zhong. Entrywise low rank approximation of general\n\nfunctions, 2018. Manuscript.\n\n[38] Jaideep Vaidya, Vijayalakshmi Atluri, and Qi Guo. The role mining problem: \ufb01nding a minimal\ndescriptive set of roles. In 12th ACM Symposium on Access Control Models and Technologies,\nSACMAT 2007, Sophia Antipolis, France, June 20-22, 2007, Proceedings, pages 175\u2013184, 2007.\n\n[39] Leslie G. Valiant. Graph-theoretic arguments in low-level complexity.\n\nIn Mathematical\nFoundations of Computer Science 1977, 6th Symposium, Tatranska Lomnica, Czechoslovakia,\nSeptember 5-9, 1977, Proceedings, pages 162\u2013176, 1977.\n\n[40] David P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends in\n\nTheoretical Computer Science, 10(1-2):1\u2013157, 2014.\n\n[41] Arie Yeredor. Independent component analysis over galois \ufb01elds of prime order. IEEE Trans.\n\nInformation Theory, 57(8):5342\u20135359, 2011.\n\n[42] Zhong-Yuan Zhang, Tao Li, Chris Ding, Xian-Wen Ren, and Xiang-Sun Zhang. Binary matrix\nfactorization for analyzing gene expression data. Data Mining and Knowledge Discovery,\n20(1):28\u201352, 2010.\n\n[43] Zhongyuan Zhang, Tao Li, Chris Ding, and Xiangsun Zhang. Binary matrix factorization with\napplications. In Data Mining, 2007. ICDM 2007. Seventh IEEE International Conference on,\npages 391\u2013400. IEEE, 2007.\n\n12\n\n\f", "award": [], "sourceid": 3328, "authors": [{"given_name": "Karl", "family_name": "Bringmann", "institution": "Saarland University"}, {"given_name": "Pavel", "family_name": "Kolev", "institution": "Max-Planck-Institut f\u00fcr Informatik"}, {"given_name": "David", "family_name": "Woodruff", "institution": "Carnegie Mellon University"}]}