{"title": "Average Case Column Subset Selection for Entrywise $\\ell_1$-Norm Loss", "book": "Advances in Neural Information Processing Systems", "page_first": 10111, "page_last": 10121, "abstract": "We study the column subset selection problem with respect to the entrywise $\\ell_1$-norm loss. It is known that in the worst case, to obtain a good rank-$k$ approximation to a matrix, one needs an arbitrarily large $n^{\\Omega(1)}$ number of columns to obtain a $(1+\\epsilon)$-approximation to an $n \\times n$ matrix. Nevertheless, we show that under certain minimal and realistic distributional settings, it is possible to obtain a $(1+\\epsilon)$-approximation with a nearly linear running time and poly$(k/\\epsilon)+O(k\\log n)$ columns. Namely, we show that if the input matrix $A$ has the form $A = B + E$, where $B$ is an arbitrary rank-$k$ matrix, and $E$ is a matrix with i.i.d. entries drawn from any distribution $\\mu$ for which the $(1+\\gamma)$-th moment exists, for an arbitrarily small constant $\\gamma > 0$, then it is possible to obtain a $(1+\\epsilon)$-approximate column subset selection to the entrywise $\\ell_1$-norm in nearly linear time. Conversely we show that if the first moment does not exist, then it is not possible to obtain a $(1+\\epsilon)$-approximate subset selection algorithm even if one chooses any $n^{o(1)}$ columns. This is the first algorithm of any kind for achieving a $(1+\\epsilon)$-approximation for entrywise $\\ell_1$-norm loss low rank approximation.", "full_text": "Average Case Column Subset Selection for Entrywise\n\n(cid:96)1-Norm Loss\n\nZhao Song\u2217\n\nUniversity of Washington\n\nmagic.linuxkde@gmail.com\n\nDavid P. Woodruff\u2217\n\nCarnegie Mellon University\ndwoodruf@cs.cmu.edu\n\nPeilin Zhong\u2217\n\nColumbia University\n\npz2225@columbia.edu\n\nAbstract\n\nWe study the column subset selection problem with respect to the entrywise (cid:96)1-\nnorm loss. It is known that in the worst case, to obtain a good rank-k approxima-\ntion to a matrix, one needs an arbitrarily large n\u2126(1) number of columns to obtain\na (1 + \u0001)-approximation to the best entrywise (cid:96)1-norm low rank approximation of\nan n \u00d7 n matrix. Nevertheless, we show that under certain minimal and realistic\ndistributional settings, it is possible to obtain a (1+\u0001)-approximation with a nearly\nlinear running time and poly(k/\u0001) + O(k log n) columns. Namely, we show that\nif the input matrix A has the form A = B + E, where B is an arbitrary rank-k ma-\ntrix, and E is a matrix with i.i.d. entries drawn from any distribution \u00b5 for which\nthe (1 + \u03b3)-th moment exists, for an arbitrarily small constant \u03b3 > 0, then it is\npossible to obtain a (1 + \u0001)-approximate column subset selection to the entrywise\n(cid:96)1-norm in nearly linear time. Conversely we show that if the \ufb01rst moment does\nnot exist, then it is not possible to obtain a (1 + \u0001)-approximate subset selection\nalgorithm even if one chooses any no(1) columns. This is the \ufb01rst algorithm of any\nkind for achieving a (1 + \u0001)-approximation for entrywise (cid:96)1-norm loss low rank\napproximation.\n\n1\n\nIntroduction\n\nNumerical linear algebra algorithms are fundamental building blocks in many machine learning and\ndata mining tasks. A well-studied problem is low rank matrix approximation. The most common\nversion of the problem is also known as Principal Component Analysis (PCA), in which the goal is\nto \ufb01nd a low rank matrix to approximate a given matrix such that the Frobenius norm of the error\nis minimized. The optimal solution of this objective can be obtained via the singular value decom-\nposition (SVD). Hence, the problem can be solved in polynomial time. If approximate solutions are\nallowed, then the running time can be made almost linear in the number of non-zero entries of the\ngiven matrix [1, 2, 3, 4, 5, 6].\nAn important variant of the PCA problem is the entrywise (cid:96)1-norm low rank matrix approximation\nproblem. In this problem, instead of minimizing the Frobenius norm of the error, we seek to mini-\nmize the (cid:96)1-norm of the error. In particular, given an n \u00d7 n input matrix A, and a rank parameter\nk, we want to \ufb01nd a matrix B with rank at most k such that (cid:107)A \u2212 B(cid:107)1 is minimized, where for\ni,j |Ci,j|. There are several reasons for using the (cid:96)1-norm as\nthe error measure. For example, solutions with respect to the (cid:96)1-norm loss are usually more robust\nthan solutions with Frobenius norm loss [7, 8]. Further, the (cid:96)1-norm loss is often used as a relax-\nation of the (cid:96)0-loss, which has wide applications including sparse recovery, matrix completion, and\nrobust PCA; see e.g., [9, 8]. Although a number of algorithms have been proposed for the (cid:96)1-norm\nloss [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22], the problem is known to be NP-hard [23]. The\n\na matrix C, (cid:107)C(cid:107)1 is de\ufb01ned to be(cid:80)\n\n\u2217equal contribution.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f\ufb01rst (cid:96)1-low rank approximation with provable guarantees was proposed by [24]. To cope with NP-\nhardness, the authors gave a solution with a poly(k log n)-approximation ratio, i.e., their algorithm\noutputs a rank-k matrix B(cid:48) \u2208 Rn\u00d7n for which\n\n(cid:107)A \u2212 B(cid:48)(cid:107)1 \u2264 \u03b1 \u00b7 min\n\n(cid:107)A \u2212 B(cid:107)1\n\nrank \u2212k B\n\n(1)\n\nfor \u03b1 = poly(k log n). The approximation ratio \u03b1 was further improved to O(k log k) by allowing\nB(cid:48) to have a slightly larger k(cid:48) = O(k log n) rank [25]. Such B(cid:48) with larger rank is referred to as\na bicriteria solution. However, in high precision applications, such approximation factors are too\nlarge. A natural question is if one can compute a (1 + \u0001)-approximate solution ef\ufb01ciently for (cid:96)1-\nnorm low rank approximation. In fact, a (1 + \u0001)-approximation algorithm was given in [26], but\nthe running time of their algorithm is a prohibitive npoly(k/\u0001). Unfortunately, [26] shows in the\nworst case that a 2k\u2126(1) running time is necessary for any constant approximation given a standard\nconjecture in complexity theory.\nNotation. To describe our results, let us \ufb01rst introduce some notation. We will use [n] to denote the\nset {1, 2,\u00b7\u00b7\u00b7 , n}. We use Ai to denote the ith column of A. We use Aj to denote the jth row of A.\nLet Q \u2286 [n]. We use AQ to denote the matrix which is comprised of the columns of A with column\nindices in Q. Similarly, we use AQ to denote the matrix which is comprised of the rows of A with\nFrobenius norm of a matrix A, i.e., (cid:107)A(cid:107)F is the square root of the sum of squares of all the entries\nin A. For 1 \u2264 p < 2, we use (cid:107)A(cid:107)p to denote the entry-wise (cid:96)p-norm of a matrix A, i.e., (cid:107)A(cid:107)p is the\np-th root of the sum of p-th powers of the absolute values of the entries of A. (cid:107)A(cid:107)1 is an important\nspecial case of (cid:107)A(cid:107)p, which corresponds to the sum of absolute values of the entries in A. A random\nvariable X has the Cauchy distribution if its probability density function is f (z) =\n\n(cid:1) to denote the set of all the size-t subsets of [n]. Let (cid:107)A(cid:107)F denote the\n\nrow indices in Q. We use(cid:0)[n]\n\nt\n\n1\n\n\u03c0(1+z2).\n\n1.1 Our Results\n\nWe propose an ef\ufb01cient bicriteria (1 + \u0001)-approximate column subset selection algorithm for the\n(cid:96)1-norm. We bypass the running time lower bound mentioned above by making a mild assumption\non the input data, and also show that our assumption is necessary in a certain sense.\nOur main algorithmic result is described as follows.\nTheorem 1.1 (Informal version of Theorem 2.13). Suppose we are given a matrix A = A\u2217 + \u2206 \u2208\nRn\u00d7n, where rank(A\u2217) = k for k = no(1), and \u2206 is a random matrix for which the \u2206i,j are\ni.i.d. symmetric random variables with E[|\u2206i,j|p] = O(E[|\u2206i,j|]p) for some constant p > 1. Let\nwhich can output a subset S \u2286 [n] with |S| \u2264 poly(k/\u0001) + O(k log n) for which\n\n\u0001 \u2208 (0, 1/2) satisfy 1/\u0001 = no(1). There is an (cid:101)O(n2 + n poly(k/\u0001))2 time algorithm (Algorithm 1)\n\nX\u2208R|S|\u00d7n\nholds with probability at least 99/100.\n\nmin\n\n(cid:107)ASX \u2212 A(cid:107)1 \u2264 (1 + \u0001)(cid:107)\u2206(cid:107)1,\n\nNote the running time in Theorem 1.1 is nearly linear in the number of non-zero entries of A, since\nfor an n\u00d7 n matrix with i.i.d. noise drawn from any continuous distribution, the number of non-zero\nentries of A will be n2 with probability 1. We also show the moment assumption of Theorem 1.1 is\nnecessary in the following precise sense.\nTheorem 1.2 (Hardness, informal version of Theorem B.20). Let n > 0 be suf\ufb01ciently large. Let\nA = \u03b7 \u00b7 1 \u00b7 1(cid:62) + \u2206 \u2208 Rn\u00d7n be a random matrix where \u03b7 = nc0 for some suf\ufb01ciently large constant\nc0, 1 \u2208 Rn is the all-ones vector, and \u2200i, j \u2208 [n], \u2206i,j \u223c C(0, 1) are i.i.d. standard Cauchy random\nvariables. Let r = no(1). Then with probability at least 1 \u2212 O(1/ log log n), \u2200S \u2286 [n] with |S| = r,\n\n(cid:107)ASX \u2212 A(cid:107)1 \u2265 1.002(cid:107)\u2206(cid:107)1.\n\nmin\n\nX\u2208Rr\u00d7n\n\n1.2 Our Techniques\n\nFor an overview of our hardness result, we refer readers to the supplementary material, namely,\nAppendix B. In the following, we will outline the main techniques used in our algorithm.\n\n2We use the notation (cid:101)O(f ) := O(f \u00b7 logO(1) f ).\n\n2\n\n\f(1 + \u0001)-Approximate (cid:96)1-Low Rank Approximation. We make the following distributional as-\nsumption on the input matrix A \u2208 Rn\u00d7n: namely, A = A\u2217 + \u2206 where A\u2217 is an arbitrary rank-k\nfrom any symmetric distribution with E[|\u2206i,j|] = 1 and\nmatrix and the entries of \u2206 are i.i.d.\nE[|\u2206i,j|p] = O(1) for any real number p strictly greater than 1, e.g., p = 1.000001 would suf\ufb01ce.\nNote that such an assumption is mild compared to typical noise models which require the noise be\nGaussian or have bounded variance; in our case the random variables may even be heavy-tailed with\nin\ufb01nite variance. In this setting we show it is possible to obtain a subset of poly(k(\u0001\u22121 + log n))\ncolumns spanning a (1 + \u0001)-approximation. This provably overcomes the column subset selection\nlower bound of [24] which shows for entrywise (cid:96)1-low rank approximation that there are matrices\nfor which any subset of poly(k) columns spans at best a k\u2126(1)-approximation.\nConsider the following algorithm: sample poly(k/\u0001) columns of A, and try to cover as many of the\nremaining columns as possible. Here, by covering a column i, we mean that if AI is the subset of\ncolumns sampled, then miny (cid:107)AI y \u2212 Ai(cid:107)1 \u2264 (1 + O(\u0001))n. The reason for this notion of covering\nis that we are able to show in Lemma 2.1 that in this noise model, (cid:107)\u2206(cid:107)1 \u2265 (1 \u2212 \u0001)n2 w.h.p., and\nso if we could cover every column i, our overall cost would be (1 + O(\u0001))n2, which would give a\n(1 + O(\u0001))-approximation to the overall cost.\nWe will not be able to cover all columns, unfortunately, with our initial sample of poly(k/\u0001) columns\nsubsets S of columns of size at most n/r, for r \u2265 (1/\u03b3)1+1/(p\u22121) satisfy(cid:80)\nof A. Instead, though, we will show that we will be able to cover all but a set T of \u0001n/(k log k) of\nthe columns. Fortunately, we show in Lemma 2.4 another property of the noise matrix \u2206 is that all\nj\u2208S (cid:107)\u2206j(cid:107)1 = O(\u03b3n2).\nand then we know that(cid:80)\nThus, for the above set T that we do not cover, we can apply this lemma to it with \u03b3 = \u0001/(k log k),\n(cid:101)O(k)-approximate (cid:96)1 low rank approximation algorithm [25] on the set T , which will only incur\nj\u2208T (cid:107)\u2206j(cid:107)1 = O(\u0001n2/(k log k)), which then enables us to run a previous\ntotal cost O(\u0001n2), and since by Lemma 2.1 above the overall cost is at least (1 \u2212 \u0001)n2, we can still\nobtain a (1 + O(\u0001))-approximation overall.\nThe main missing piece of the algorithm to describe is why we are able to cover all but a small\nfraction of the columns. One thing to note is that our noise distribution may not have a \ufb01nite variance,\nand consequently, there can be very large entries \u2206i,j in some columns. In Lemma 2.3, we show\nthe number of columns in \u2206 for which there exists an entry larger than n1/2+1/(2p) in magnitude is\nO(n(2\u2212p)/2), which since p > 1 is a constant bounded away from 1, is sublinear. Let us call this set\nwith entries larger than n1/2+1/(2p) in magnitude the set H of \u201cheavy\" columns; we will not make\nany guarantees about H, rather, we will stuff it into the small set T of columns above on which we\nwill run our earlier O(k log k)-approximation.\nFor the remaining, non-heavy columns, which constitute almost all of our columns, we show in\nLemma 2.5 that (cid:107)\u2206i(cid:107)1 \u2264 (1 + \u0001)n w.h.p. The reason this is important is that recall to cover some\ncolumn i by a sample set I of columns, we need miny (cid:107)AI y \u2212 Ai(cid:107)1 \u2264 (1 + O(\u0001))n. It turns out, as\nwe now explain, that we will get miny (cid:107)AI y \u2212 Ai(cid:107)1 \u2264 (cid:107)\u2206i(cid:107)1 + ei, where ei is a quantity which we\ncan control and make O(\u0001n) by increasing our sample size I. Consequently, since (cid:107)\u2206i(cid:107)1 \u2264 (1+\u0001)n,\noverall we will have miny (cid:107)AI y \u2212 Ai(cid:107)1 \u2264 (1 + O(\u0001))n, which means that i will be covered. We\nnow explain what ei is, and why miny (cid:107)AI y \u2212 Ai(cid:107)1 \u2264 (cid:107)\u2206i(cid:107)1 + ei.\nTowards this end, we \ufb01rst explain a key insight in this model. Since the p-th moment exists for\nsome real number p > 1 (e.g., p = 1.000001 suf\ufb01ces), averaging helps reduce the noise of \ufb01tting a\ncolumn Ai by subsets of other columns. Namely, we show in Lemma 2.2 that for any t non-heavy\nj=1 \u03b1j\u2206ij(cid:107)1 =\nO(t1/pn), that is, since the individual coordinates of the \u2206ij are zero-mean random variables, their\nsum concentrates as we add up more columns. We do not need bounded variance for this property.\nHow can we use this averaging property for subset selection? The idea is, instead of sampling\na single subset I of O(k) columns and trying to cover each remaining column with this subset\nas shown in [25], we will sample multiple independent subsets I1, I2, . . . , It. Each set has size\npoly(k/\u0001) and we will sample at most poly(k/\u0001) subsets. By a similar argument of [25], for any\ngiven column index i \u2208 [n], for most of these subset Ij, we have that A\u2217\ni /(cid:107)\u2206i(cid:107)1 can be expressed\n(cid:96) /(cid:107)\u2206(cid:96)(cid:107)1, (cid:96) \u2208 Ij, via coef\ufb01cients of absolute value at most 1.\nas a linear combination of columns A\u2217\nNote that this is only true for most i and most j; we develop terminology for this in De\ufb01nitions 2.6,\n\ncolumn \u2206i1, . . . , \u2206it of \u2206, and any coef\ufb01cients \u03b11, \u03b12, . . . , \u03b1t \u2208 [\u22121, 1], (cid:107)(cid:80)t\n\n3\n\n\f2.7, 2.8, and 2.9, referring to what we call a good core. We quantify what we mean by most i and\nmost j having this property in Lemma 2.11 and Lemma 2.12.\nThe key though, that drives the analysis, is Lemma 2.10, which shows that miny (cid:107)Aiy \u2212 Ai(cid:107)1 \u2264\n(cid:107)\u2206i(cid:107)1 + ei, where ei = O(q1/p/t1\u22121/pn), where q is the size of each Ij, and t is the number of\ndifferent Ij. We need q to be at least k, just as before, so that we can be guaranteed that when we\ni /(cid:107)\u2206i(cid:107)1 can be expressed\nadjoin a column index i to Ij, there is some positive probability that A\u2217\n(cid:96) /(cid:107)\u2206(cid:96)(cid:107)1, (cid:96) \u2208 Ij, with coef\ufb01cients of absolute value at most\nas a linear combination of columns A\u2217\n1. What is different in our noise model though is the division by t1\u22121/p. Since p > 1, if we set t\nto be a large enough poly(k/\u0001), then ei = O(\u0001n), and then we will have covered Ai, as desired.\ni /(cid:107)\u2206i(cid:107)1\nThis captures the main property that averaging the linear combinations for expression A\u2217\ni /(cid:107)\u2206i(cid:107)1. Of course we\nusing different subsets Ij gives us better and better approximations to A\u2217\nneed to ensure several properties such as not sampling a heavy column (the averaging in Lemma 2.2\ndoes not apply when this happens), we need to ensure most of the Ij have small-coef\ufb01cient linear\ncombinations expressing A\u2217\n\ni /(cid:107)\u2206i(cid:107)1, etc. This is handled in our main theorem, Theorem 2.13.\n\n2\n\n(cid:96)1-Norm Column Subset Selection\n\nWe \ufb01rst present two subroutines.\nLinear regression with (cid:96)1 loss. The \ufb01rst subroutine needed is an approximate (cid:96)1 linear regression\nIn particular, given a matrix M \u2208 Rn\u00d7d, n vectors b1, b2,\u00b7\u00b7\u00b7 , bn \u2208 Rn, and an error\nsolver.\nparameter \u0001 \u2208 (0, 1), we want to compute x1, x2,\u00b7\u00b7\u00b7 , xn \u2208 Rd for which \u2200i \u2208 [n], we have\n\n(cid:107)M xi \u2212 bi(cid:107)1 \u2264 (1 + \u0001) \u00b7 min\nx\u2208Rd\n\n(cid:107)M x \u2212 bi(cid:107)1.\n\nFurthermore, we also need an estimate vi of the regression cost (cid:107)M xi \u2212 bi(cid:107)1 for each i \u2208 [n] such\nthat (cid:107)M xi \u2212 bi(cid:107)1 \u2264 vi \u2264 (1 + \u0001)(cid:107)M xi \u2212 bi(cid:107)1. Such an (cid:96)1-regression problem can be solved\nef\ufb01ciently (see [28] for a survey). The total running time to solve these n regression problems\n\nsimultaneously is at most (cid:101)O(n2) + n \u00b7 poly(d log n), and the success probability is at least 0.999.\n\n(cid:96)1 Column subset selection for general matrices. The second subroutine needed is an (cid:96)1-low rank\napproximation solver for general input matrices, though we allow a large approximation ratio. We\nuse the algorithm proposed by [25] for this purpose. In particular, given an n\u00d7 d (d \u2264 n) matrix M\nand a rank parameter k, the algorithm can output a small set S \u2282 [n] with size at most O(k log n),\nsuch that\n\nFurthermore, the running time is at most (cid:101)O(n2) + n \u00b7 poly(k log n), and the success probability is\n\nrank \u2212k B\n\nX\u2208R|S|\u00d7d\n\n(cid:107)MSX \u2212 M(cid:107)1 \u2264 O(k log k) \u00b7 min\n\n(cid:107)M \u2212 B(cid:107)1.\n\nmin\n\nat least 0.999. Now we can present our algorithm, Algorithm 1.\n\nSample a set I from(cid:0)[n]\n\nAlgorithm 1 (cid:96)1-Low Rank Approximation with Input Assumption\n1: procedure L1NOISYLOWRANKAPPROX(A \u2208 Rn\u00d7n, k, \u0001)\n2:\n3:\n\n(cid:46) Theorem 2.13\nSolve the approximate (cid:96)1-regression problem minx\u2208R|I| (cid:107)AI x \u2212 Ai(cid:107)1 for each i \u2208 [n], and\nCompute the set T = {i \u2208 [n] | vi is one of the top l largest values among v1, v2,\u00b7\u00b7\u00b7 , vn},\n\n(cid:1) uniformly at random, where s = poly(k/\u0001).\n\nlet vi be the estimated regression cost.\n\ns\n\nwhere l = n/ poly(k/\u0001).\n\nSolve (cid:96)1-column subset selection for AT . Let the solution be AQ.\n(cid:98)X be the solution. Return A(I\u222aQ) and (cid:98)X. (cid:46) A(I\u222aQ)(cid:98)X is a good low rank approximation to A\nSolve the approximate (cid:96)1-regression problem minX\u2208R(|I|+|Q|)\u00d7n (cid:107)A(I\u222aQ)X \u2212 A(cid:107)1, and let\n\n4:\n\n5:\n6:\n\n7: end procedure\n\nRunning time. Uniformly sampling a set I can be done in poly(k/\u0001) time. According to our (cid:96)1-\n\nregression subroutine, solving minx (cid:107)AI x \u2212 Ai(cid:107)1 for all i \u2208 [n] can be \ufb01nished in (cid:101)O(n2) + n \u00b7\n\npoly(k log(n)/\u0001) time. We only need sorting to compute the set T which takes O(n log n) time. By\n\n4\n\n\four second subroutine, the (cid:96)1-column subset selection for AT will take (cid:101)O(n2) + n \u00b7 poly(k log n).\nThe last step only needs an (cid:96)1-regression solver, which takes (cid:101)O(n2) + n \u00b7 poly(k log(n)/\u0001) time.\nThus, the overall running time is (cid:101)O(n2) + n \u00b7 poly(k log(n)/\u0001).\n\nThe remaining parts in this section will focus on analyzing the correctness of the algorithm.\n\n2.1 Properties of the Noise Matrix\nRecall that the input matrix A \u2208 Rn\u00d7n can be decomposed as A\u2217 + \u2206, where A\u2217 is the ground truth,\nand \u2206 is a random noise matrix. In particular, A\u2217 is an arbitrary rank-k matrix, and \u2206 is a random\nmatrix where each entry is an i.i.d. sample drawn from an unknown symmetric distribution. The\nonly assumption on \u2206 is that each entry \u2206i,j satis\ufb01es E[|\u2206i,j|p] = O(E[|\u2206i,j|p]) for some constant\np > 1, i.e., the p-th moment of the noise distribution is bounded. Without loss of generality, we will\nsuppose E[|\u2206i,j|] = 1, E[|\u2206i,j|p] = O(1), and p \u2208 (1, 2) throughout the paper. In this section, we\nwill present some key properties of the noise matrix.\nThe following lemma provides a lower bound on (cid:107)\u2206(cid:107)1. Once we have the such lower bound, we\ncan focus on \ufb01nding a solution for which the approximation cost is at most that lower bound.\nLemma 2.1 (Lower bound on the noise matrix). Let \u2206 \u2208 Rn\u00d7n be a random matrix where \u2206i,j are\ni.i.d. samples drawn from a symmetric distribution. Suppose E[|\u2206i,j|] = 1 and E[|\u2206i,j|p] = O(1)\nfor some constant p \u2208 (1, 2). Then, \u2200\u0001 \u2208 (0, 1) which satis\ufb01es 1/\u0001 = no(1), we have\n\nPr(cid:2)(cid:107)\u2206(cid:107)1 \u2265 (1 \u2212 \u0001)n2(cid:3) \u2265 1 \u2212 e\u2212\u0398(n).\n\nThe next lemma shows the main reason why we are able to get a small \ufb01tting cost when running\nregression. Consider a toy example. Suppose we have a target number a \u2208 R, and another t numbers\na + g1, a + g2,\u00b7\u00b7\u00b7 , a + gt \u2208 R, where gi are i.i.d. samples drawn from the standard Gaussian\ndistribution N (0, 1). If we use a + gi to \ufb01t a, then the expected cost is E[|a + gi \u2212 a|] = E[|gi|] =\n\n(cid:112)2/\u03c0. However, if we use the average of a + g1, a + g2,\u00b7\u00b7\u00b7 , a + gt to \ufb01t a, then the expected\ncost is E[|(cid:80)t\nvariance t, which means that the above expected cost is(cid:112)2/\u03c0/\n\ni=1 gi|/t]. Since the gi are independent,(cid:80)t\n\ni=1 gi is a random Gaussian variable with\nt. Thus the \ufb01tting cost is reduced\n\n\u221a\n\nt. By generalizing the above argument, we obtain the following lemma.\n\nby a factor\nLemma 2.2 (Averaging reduces the noise). Let \u22061, \u22062,\u00b7\u00b7\u00b7 , \u2206t \u2208 Rn be t random vectors. The\n\u2206i,j are i.i.d. symmetric random variables with E[|\u2206i,j|] = 1 and E[|\u2206i,j|p] = O(1) for some\nconstant p \u2208 (1, 2). Let \u03b11, \u03b12,\u00b7\u00b7\u00b7 , \u03b1t \u2208 [\u22121, 1] be t real numbers. Conditioned on \u2200i \u2208 [n], j \u2208\n[t],|\u2206i,j| \u2264 n1/2+1/(2p), with probability at least 1 \u2212 2\u2212n\u0398(1)\n\n,\n\n\u221a\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t(cid:88)\n\ni=1\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)1\n\n\u03b1i\u2206i\n\n\u2264 O(t1/pn).\n\nThe above lemma needs a condition that each entry in the noise column should not be too large.\nFortunately, we can show that most of the (noise) columns do not have any large entry.\nLemma 2.3 (Only a small number of columns have large entries). Let \u2206 \u2208 Rn\u00d7n be a random\nmatrix where the \u2206i,j are i.i.d. symmetric random variables with E[|\u2206i,j|] = 1 and E[|\u2206i,j|p] =\nO(1) for some constant p \u2208 (1, 2). Let\n\nH = {j \u2208 [n](cid:12)(cid:12) \u2203i \u2208 [n],|\u2206i,j| > n1/2+1/(2p)}.\n\nThen with probability at least 0.999 |H| \u2264 O(n1\u2212(p\u22121)/2).\nThe following lemma shows that any small subset of the columns of the noise matrix \u2206 cannot\ncontribute too much to the overall error. By combining with the previous lemma, the entrywise (cid:96)1\ncost of all columns containing large entries can be bounded.\nLemma 2.4. Let \u2206 \u2208 Rn\u00d7n be a random matrix where \u2206i,j are i.i.d. symmetric random variables\nwith E[|\u2206i,j|] = 1 and E[|\u2206i,j|p] = O(1) for some constant p \u2208 (1, 2). Let \u0001 \u2208 (0, 1) satisfy\n1/\u0001 = no(1). Let r \u2265 (1/\u0001)1+1/(p\u22121). Then, with probability at least .999, \u2200S \u2282 [n] with |S| \u2264 n/r,\n\n(cid:80)\nj\u2208S (cid:107)\u2206j(cid:107)1 = O(\u0001n2).\n\n5\n\n\fWe say a (noise) column is good if it does not have a large entry. We can show that, with high\nprobability, the entry-wise (cid:96)1 cost of a good (noise) column is small.\nLemma 2.5 (Cost of good noise columns). Let \u2206 \u2208 Rn be a random vector where \u2206i are i.i.d.\nsymmetric random variables with E[|\u2206i|] = 1 and E[|\u2206i|p] = O(1) for some constant p \u2208 (1, 2).\nLet \u0001 \u2208 (0, 1) satisfy 1/\u0001 = no(1). If \u2200i \u2208 [n],|\u2206i| \u2264 n1/2+1/(2p), then with probability at least\n1 \u2212 2\u2212n\u0398(1)\n\n, (cid:107)\u2206(cid:107)1 \u2264 (1 + \u0001)n.\n\n2.2 De\ufb01nition of Tuples and Cores\n\nIn this section, we provide some basic de\ufb01nitions, e.g., of a tuple, a good tuple, the core of a tuple,\nand a coef\ufb01cients tuple. These de\ufb01nitions will be heavily used later when we analyze the correctness\nof our algorithm.\nBefore we present the de\ufb01nitions, we introduce a notion RA\u2217 (S). Given a matrix A\u2217 \u2208 Rn1\u00d7n2, for\na set S \u2286 [n2], we de\ufb01ne\n\n(cid:26)(cid:12)(cid:12)(cid:12)det\n\n(cid:16)\n\n(A\u2217)Q\n\nP\n\n(cid:17)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12) |P| = |Q| = rank(A\u2217\n\n(cid:27)\n\nS), Q \u2286 [n1]\n\n,\n\nRA\u2217 (S) := arg max\nP :P\u2286S\n\nP )| \u2265 | det(M Q\n\n[k+1]\\{j})|, we have (cid:107)x(cid:107)\u221e \u2264 1.\n\nP )| is maximized. Since M has rank k, we know det(M Q\n\nwhere for a squared matrix C, det(C) denotes the determinant of C. The above maximum is over\nboth P and Q while RA\u2217 (S) only takes the value of the corresponding P . By Cramer\u2019s rule, if we\nuse the columns of A\u2217 with index in the set RA\u2217 (S) to \ufb01t any column of A\u2217 with index in the set S,\nthe absolute value of any \ufb01tting coef\ufb01cient will be at most 1. The use of Cramer\u2019s rule is as follows.\nConsider a rank k matrix M \u2208 Rn\u00d7(k+1). Let P \u2286 [k + 1], Q \u2286 [n],|P| = |Q| = k be such that\n| det(M Q\nP ) (cid:54)= 0 and thus the columns of\nMP are independent. Let i \u2208 [k + 1]\\ P . Then the linear equation MP x = Mi is feasible and there\nis a unique solution x. Furthermore, by Cramer\u2019s rule xj = det(M Q\nP ). Since\n| det(M Q\nSmall \ufb01tting coef\ufb01cients are good since they will not increase the noise by too much. For example,\nSx and (cid:107)x(cid:107)\u221e \u2264 1, i.e., the i-th column can be \ufb01t by the columns with indices in\nsuppose A\u2217\nthe set S and the \ufb01tting coef\ufb01cients x \u2208 R|S| are small. If we use the noisy columns of A\u2217\nS + \u2206S\ni + \u2206i)(cid:107)1 \u2264\nto \ufb01t the noisy column A\u2217\n(cid:107)\u2206i(cid:107)1 + (cid:107)\u2206Sx(cid:107)1. Since (cid:107)x(cid:107)\u221e \u2264 1, it is possible to give a good upper bound for (cid:107)\u2206Sx(cid:107)1.\nDe\ufb01nition 2.6 (Tuple). A (q, t, n)\u2212tuple is de\ufb01ned to be (S1, S2,\u00b7\u00b7\u00b7 , St, i), where \u2200j \u2208 [t], Sj \u2282\nj=1 Sj. Then |S| = qt, i.e., S1, S2,\u00b7\u00b7\u00b7 , St are disjoint. Furthermore,\ni \u2208 [n] and i (cid:54)\u2208 S. For simplicity, we use (S[t], i) to denote (S1, S2,\u00b7\u00b7\u00b7 , St, i).\nWe next provide the de\ufb01nition of a good tuple.\nDe\ufb01nition 2.7 (Good tuple). Given a rank-k matrix A\u2217 \u2208 Rn\u00d7n, an (A\u2217, q, t, \u03b1)-good tuple is a\n(q, t, n)-tuple (S[t], i) which satis\ufb01es\n\n[n] with |Sj| = q. Let S =(cid:83)t\n\ni + \u2206i, then the \ufb01tting cost is at most (cid:107)(A\u2217\n\nS + \u2206S)x \u2212 (A\u2217\n\n[k+1]\\{j})/det(M Q\n\ni = A\u2217\n\n|{j \u2208 [t] | i (cid:54)\u2208 RA\u2217 (Sj \u222a {i})}| \u2265 \u03b1 \u00b7 t.\n\nWe need the de\ufb01nition of the core of a tuple.\nDe\ufb01nition 2.8 (Core of a tuple). The core of (S[t], i) is de\ufb01ned to be the set\n\n{j \u2208 [t] | i (cid:54)\u2208 RA\u2217 (Sj \u222a {i})}.\n\nWe de\ufb01ne a coef\ufb01cients tuple as follows.\nDe\ufb01nition 2.9 (Coef\ufb01cients tuple). Given a rank-k matrix A\u2217 \u2208 Rn\u00d7n, let (S[t], i) be an\n(A\u2217, q, t, \u03b1)-good tuple. Let C be the core of (S[t], i). A coef\ufb01cients tuple corresponding to (S[t], i)\nis de\ufb01ned to be (x1, x2,\u00b7\u00b7\u00b7 , xt) where \u2200j \u2208 [t], xj \u2208 Rq. The vector xj \u2208 Rq satis\ufb01es: xj = 0\nif j \u2208 [t]\\C, while A\u2217\ni and (cid:107)xj(cid:107)\u221e \u2264 1, if j \u2208 C. To guarantee the coef\ufb01cients tuple is\nunique, we restrict each vector xj \u2208 Rq to be one that has the minimum lexicographic order.\n\nxj = A\u2217\n\nSj\n\n6\n\n\fmin\ny\u2208Rqt\n\n(cid:17)\n\nj=1 Sj\n\n(cid:13)(cid:13)(cid:13)1\n\n\u2264\n\nH =\n\nj \u2208 [n]\n\n(cid:27)\n\n.\n\n2.3 Properties of a Good Tuple and a Coef\ufb01cients Tuple\nConsider a good tuple (S1, S2,\u00b7\u00b7\u00b7 , St, i). By the de\ufb01nition of a good tuple, the size of the core C\nof the tuple is large. For each j \u2208 C, the coef\ufb01cients xj of using A\u2217\ni should have absolute\nvalue at most 1. Now consider the noisy setting. As discussed in the previous section, using ASj to \ufb01t\nAi has cost at most (cid:107)\u2206i(cid:107)1 +(cid:107)\u2206Sj xj(cid:107)1. Although (cid:107)\u2206Sj xj(cid:107)1 has a good upper bound, it is not small\nenough. To further reduce the (cid:96)1 \ufb01tting cost, we can now apply the averaging argument (Lemma 2.2)\nover all the \ufb01tting choices corresponding to C. Formally, we have the following lemma.\nLemma 2.10 (Good tuples imply low \ufb01tting cost). Suppose we are given a matrix A \u2208 Rn\u00d7n which\nsatis\ufb01es A = A\u2217 + \u2206, where A\u2217 \u2208 Rn\u00d7n has rank k. Here \u2206 \u2208 Rn\u00d7n is a random matrix where\n\u2206i,j are i.i.d. symmetric random variables with E[|\u2206i,j|] = 1 and E[|\u2206i,j|p] = O(1) for some\nconstant p \u2208 (1, 2). Let H \u2282 [n] be de\ufb01ned as follows:\n\nto \ufb01t A\u2217\n\nSj\n\n, for all (A\u2217, q, t, 1/2)-good tuples\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u2203i \u2208 [n],|\u2206i,j| > n1/2+1/(2p)\n(S1, S2,\u00b7\u00b7\u00b7 , St, i) which satisfy H \u2229(cid:16)(cid:83)t\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) 1\nt(cid:88)\n\nLet q, t \u2264 no(1). Then, with probability at least 1 \u2212 2\u2212n\u0398(1)\n= \u2205, we have\n\n(cid:13)(cid:13)(cid:13)A{(cid:83)t\n\nj=1 Sj}y \u2212 Ai\n\nASj xj \u2212 Ai\n\n\u2264 (cid:107)\u2206i(cid:107)1 + O(q1/p/t1\u22121/pn),\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)1\n(cid:1). Let \u03c0 : I \u2192 I be a random permutation of\n\nwhere C is the core of (S1, S2,\u00b7\u00b7\u00b7 , St, i), and (x1, x2,\u00b7\u00b7\u00b7 , xt) is the coef\ufb01cients tuple correspond-\ning to (S1, S2,\u00b7\u00b7\u00b7 , St, i).\nWe next show that if we choose columns randomly, it is easy to \ufb01nd a good tuple.\nLemma 2.11. Given a rank-k matrix A\u2217 \u2208 Rn\u00d7n, let q > 10k, t > 0. Let I = {i1, i2,\u00b7\u00b7\u00b7 , iqt+1}\nqt + 1 elements. \u2200j \u2208 [t], let\n\nbe a subset drawn uniformly at random from(cid:0) [n]\n\nSj =(cid:8)i\u03c0((j\u22121)q+1), i\u03c0((j\u22121)q+2),\u00b7\u00b7\u00b7 , i\u03c0((j\u22121)q+q)\n\nWe use i to denote i\u03c0(qt+1). With probability \u2265 1 \u2212 2k/q, (S1, S2,\u00b7\u00b7\u00b7 , St, i) is an\n(A\u2217, q, t, 1/2)\u2212good tuple.\nLemma 2.11 implies that if we randomly choose S1, S2,\u00b7\u00b7\u00b7 , St, then with high probability, there\nare many choices of i \u2208 [n], such that (S1, S2,\u00b7\u00b7\u00b7 , St, i) is a good tuple. Precisely, we can show\nthe following.\nLemma 2.12. Given a rank-k matrix A\u2217 \u2208 Rn\u00d7n, let q > 10k, t > 0. Let I = {i1, i2,\u00b7\u00b7\u00b7 , iqt} be a\n\nrandom subset uniformly drawn from(cid:0)[n]\n(cid:1). Let \u03c0 be a random permutation of qt elements. \u2200j \u2208 [t],\n(cid:12)(cid:12)(cid:8)i \u2208 [n] \\ I(cid:12)(cid:12) (S1, S2,\u00b7\u00b7\u00b7 , St, i) is an (A\u2217, q, t, 1/2)\u2212good tuple(cid:9)(cid:12)(cid:12) \u2265 (1 \u2212 4k/q)(n \u2212 qt).\n\nSj =(cid:8)i\u03c0((j\u22121)q+1), i\u03c0((j\u22121)q+2),\u00b7\u00b7\u00b7 , i\u03c0((j\u22121)q+q)\n\nThen with probability at least 2k/q,\n\nwe de\ufb01ne Sj as follows:\n\n|C|\n\nj=1\n\n(cid:9) .\n\n(cid:9) .\n\n(cid:26)\n\nqt+1\n\nqt\n\n2.4 Main Result\n\nNow we are able to put all ingredients together to prove our main theorem, Theorem 2.13.\nTheorem 2.13 (Formal version of Theorem 1.1). Suppose we are given a matrix A = A\u2217 + \u2206 \u2208\nRn\u00d7n, where rank(A\u2217) = k for k = no(1), and \u2206 is a random matrix for which the \u2206i,j are\ni.i.d. symmetric random variables with E[|\u2206i,j|] = 1 and E[|\u2206i,j|p] = O(1) for some constant\n(Algorithm 1) which can output a subset S \u2208 [n] with |S| \u2264 poly(k/\u0001) + O(k log n) for which\n\np \u2208 (1, 2). Let \u0001 \u2208 (0, 1/2) satisfy 1/\u0001 = no(1). There is an (cid:101)O(n2 + n poly(k/\u0001)) time algorithm\n\nX\u2208R|S|\u00d7n\nholds with probability at least 99/100.\n\nmin\n\n(cid:107)ASX \u2212 A(cid:107)1 \u2264 (1 + \u0001)(cid:107)\u2206(cid:107)1,\n\n7\n\n\f(cid:19)\n\n(cid:18)\n(cid:110)\n\nProof. We discussed the running time at the beginning of Section 2. Next, we turn to correctness.\nLet q = \u2126\n\n. Let r = \u0398(q/k). Let\n\n1+ 1\n\np\u22121\n\n1\n\nk(k log k)\n1+ 1\n\u0001\n\np\u22121\n\n, t = q\n\u0001\n\np\u22121\n1+ 1\np\u22121\n\n(cid:111)\n\n(cid:110)\n\n(cid:111)\n\n(cid:110)\n\nqt\n\nqt\n\nI1 =\n\n, I2 =\n\ni(2)\n1 , i(2)\n\ni(1)\n1 , i(1)\n\n,\u00b7\u00b7\u00b7 , Ir =\n\n2 ,\u00b7\u00b7\u00b7 , i(1)\n\n2 ,\u00b7\u00b7\u00b7 , i(2)\n\nbe r independent subsets drawn uniformly at random from (cid:0)[n]\n(cid:12)(cid:12)(cid:8)i \u2208 [n] \\ Is\n\ns\u2208[r] Is, which is\nthe same as that in Algorithm 1. Let \u03c01, \u03c02,\u00b7\u00b7\u00b7 , \u03c0r be r independent random permutations of qt\nelements. Due to Lemma 2.12 and a Chernoff bound, with probability at least .999, \u2203s \u2208 [r],\n\n(cid:12)(cid:12) (S1, S2,\u00b7\u00b7\u00b7 , St, i) is an (A\u2217, q, t, 1/2)\u2212good tuple(cid:9)(cid:12)(cid:12) \u2265 (1 \u2212 4k/q)(n \u2212 qt)\n(cid:110)\n\ni(r)\n1 , i(r)\n\n(cid:1). Let I = (cid:83)\n\n2 ,\u00b7\u00b7\u00b7 , i(r)\n\nwhere\n\n(cid:111)\n\n,\n\nqt\n\nqt\n\nSj =\n\ni(s)\n\u03c0s((j\u22121)q+1), i(s)\n\n\u03c0s((j\u22121)q+2),\u00b7\u00b7\u00b7 , i(s)\n\n\u03c0s((j\u22121)q+q)\n\n,\u2200j \u2208 [t].\n\n(cid:111)\n\nLet set H \u2282 [n] be de\ufb01ned as follows:\n\nH = {j \u2208 [n] | \u2203i \u2208 [n],|\u2206i,j| > n1/2+1/(2p)}.\n\nThen due to Lemma 2.3, with probability at least 0.999, |H| \u2264 O(n1\u2212(p\u22121)/2). Thus, for j \u2208 [r],\nthe probability that H \u2229 Ij (cid:54)= \u2205 is at most O(qt \u00b7 n1\u2212(p\u22121)/2/(n \u2212 qt)) = 1/n\u2126(1). By taking a\nunion bound over all j \u2208 [r], with probability at least 1 \u2212 1/n\u2126(1), \u2200j \u2208 [r], Ij \u2229 H = \u2205. Thus, we\ncan condition on Is \u2229 H = \u2205. Due to Lemma 2.10 and q1/p/t1\u22121/p = \u0001,\n\ni \u2208 [n] \\ Is\n\n(cid:107)AIs y \u2212 Ai(cid:107)1 \u2264 (cid:107)\u2206i(cid:107)1 + O(\u0001n)\n\nDue to Lemma 2.5 and a union bound over all i \u2208 [n] \\ H, with probability at least .999, \u2200i (cid:54)\u2208\nH,(cid:107)\u2206i(cid:107) \u2264 (1 + \u0001)n. Thus,\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:26)\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:26)\n\ny\u2208Rqt\n\n(cid:12)(cid:12)(cid:12)(cid:12) min\n(cid:12)(cid:12)(cid:12)(cid:12) min\n\ny\u2208Rqt\n\n(cid:27)(cid:12)(cid:12)(cid:12)(cid:12) \u2265 (1 \u2212 4k/q)(n \u2212 qt).\n(cid:27)(cid:12)(cid:12)(cid:12)(cid:12) \u2265 (1 \u2212 4k/q)(n \u2212 qt) \u2212 |H|.\n\n(cid:27)\n\ni \u2208 [n] \\ Is\n\nLet\n\n(cid:107)AIsy \u2212 Ai(cid:107)1 \u2264 (1 + O(\u0001))n\n(cid:26)\n\n(cid:12)(cid:12)(cid:12)(cid:12) min\n\ny\u2208Rqt\n\nT (cid:48) = [n] \\\n\ni \u2208 [n]\n\n(cid:107)AIsy \u2212 Ai(cid:107)1 \u2264 (1 + O(\u0001))n\n\n.\n\nThen |T (cid:48)| \u2264 O(kn/q + n1\u2212(p\u22121)/2) = O(kn/q) = O((\u0001/(k log k))1+1/(p\u22121)n). By our selec-\ntion of T in algorithm 1, T (cid:48) should be a subset of T . Due to Lemma 2.4, with probability at\nleast .999, (cid:107)\u2206T(cid:107)1 \u2264 O(\u0001n2/(k log k)). By our second subroutine mentioned at the beginning\nof Section 2 it can \ufb01nd a set Q \u2282 [n] with |Q| = O(k log n) such that minX\u2208R|Q|\u00d7|T|(cid:107)AQX \u2212\nAT(cid:107)1 \u2264 O(k log k)(cid:107)\u2206T(cid:107)1 \u2264 O(\u0001n2). Thus, we have minX\u2208R(|Q|+q\u00b7t\u00b7r)\u00d7n (cid:107)A(Q\u222aI)X \u2212\nA(cid:107)1 \u2264 minX1\u2208R(q\u00b7t\u00b7r)\u00d7n (cid:107)AI X1 \u2212 A[n]\\T(cid:107)1 + minX2\u2208R|Q|\u00d7n (cid:107)AQX2 \u2212 AT(cid:107)1 \u2264 (1 +\nO(\u0001))n2. Due to Lemma 2.1, with probability at least .999, (cid:107)\u2206(cid:107)1 \u2265 (1 \u2212 \u0001)n2, and thus\nminX\u2208R(|Q|+q\u00b7t\u00b7r)\u00d7n (cid:107)A(Q\u222aI)X \u2212 A(cid:107)1 \u2264 (1 + O(\u0001))(cid:107)\u2206(cid:107)1.\n\n3 Experiments\n\nThe take-home message from our theoretical analysis is that although the noise distribution may be\nheavy-tailed, if the p-th (p > 1) moment of the distribution exists, averaging the noise may reduce\nthe noise. In the spirit of averaging, we found that taking a median works a bit better in practice.\nInspired by our theoretical analysis, we propose a simple heuristic algorithm (Algorithm 2) which\ncan output a rank-k solution. We tested Algorithm 2 on both synthetic and real datasets.\n\nDatasets. For each rank-k experiment, we chose a high rank matrix (cid:98)A \u2208 Rn\u00d7d, applied top-k SVD\nto (cid:98)A and obtained a rank-k matrix A\u2217 as our ground truth matrix. For our synthetic data experiments,\nthe matrix (cid:98)A \u2208 R500\u00d7500 was generated at random, where each entry was drawn uniformly from\n\n8\n\n\fSample a set I = {i1, i2,\u00b7\u00b7\u00b7 , isk} from(cid:0)[n]\n\nAlgorithm 2 Median Heuristic\n1: procedure L1NOISYLOWRANKAPPROXHEU(A \u2208 Rn\u00d7d, k \u2265 1)\n2:\n3:\n4:\n5: end procedure\n\nCompute B \u2208 Rn\u00d7k s.t., for t \u2208 [n], q \u2208 [k], Bt,q = median(At,is(q\u22121)+1 ,\u00b7\u00b7\u00b7 , At,isq ).\nSolve minX\u2208Rk\u00d7d (cid:107)BX \u2212 A(cid:107)1 and let the solution be X\u2217. Output BX\u2217.\n\n(cid:1) uniformly at random.\n\nsk\n\nSYNTHETIC\n\nISOLET\n\nMFEAT\n\nFigure 1: Empirical results. The noise distributions of the experiments in the \ufb01rst row are from a 1.1-\nstable distribution. The noise distributions corresponding to the second row are the 1.1-th root of a Cauchy\ndistribution. The blue, red, orange and yellow bar denote SVD, the entrywise (cid:96)1-norm low rank algorithm in\n[24], the uniform k-column subset sampling algorithm in [25], and Algorithm 2 respectively.\n\n{0, 1,\u00b7\u00b7\u00b7 , 9}. For real datasets, we chose isolet3 (617 \u00d7 1559) or mfeat4 (651 \u00d7 2000) as (cid:98)A [29].\n\nWe tested two different noise distributions. One distribution is the standard L\u00e9vy 1.1-stable distribu-\ntion [30]. Another distribution is constructed from the standard Cauchy distribution, i.e., to draw a\nsample from the constructed distribution, we draw a sample from the Cauchy distribution, keep the\nsign unchanged, and take the 1\n1.1-th power of the absolute value. Notice that both distributions have\nbounded 1.1-th moment, but do not have a p-th moment for any p > 1.1. To construct the noise\n\nmatrix \u2206 \u2208 Rn\u00d7d, we drew a matrix (cid:98)\u2206 where each entry is an i.i.d. sample from one of the two\nnoise distributions, and then scaled the noise: \u2206 = (cid:98)\u2206 \u00b7 (cid:107)A\u2217(cid:107)1\n20\u00b7n\u00b7d . We set A = A\u2217 + \u2206 as the input.\n\nMethodologies. We compare Algorithm 2 with SVD, poly(k, log n)-approximate entrywise (cid:96)1 low\nrank approximation [24], and uniform k-column subset sampling [25]5. For Algorithm 2, we set\ns = min(50,(cid:98)n/k(cid:99)). For all of algorithms we repeated the experiment the same number of times\nand compared the best solution obtained by each algorithm. We report the approximation ratio\n(cid:107)B \u2212 A(cid:107)1/(cid:107)\u2206(cid:107)1 for each algorithm, where B \u2208 Rn\u00d7d is the output rank-k matrix. The results are\nshown in Figure 1. As shown in the \ufb01gure, Algorithm 2 outperformed all of the other algorithms.\n\nAcknowledgments. David P. Woodruff was supported in part by Of\ufb01ce of Naval Research (ONR)\ngrant N00014- 18-1-2562. Part of this work was done while he was visiting the Simons Institute\nfor the Theory of Computing. Peilin Zhong is supported in part by NSF grants (CCF-1703925,\nCCF-1421161, CCF-1714818, CCF-1617955 and CCF-1740833), Simons Foundation (#491119 to\nAlexandr Andoni), Google Research Award and a Google Ph.D. fellowship. Part of this work was\ndone while Zhao Song and Peilin Zhong were interns at IBM Research - Almaden and while Zhao\nSong was visiting the Simons Institute for the Theory of Computing.\n\n3https://archive.ics.uci.edu/ml/datasets/isolet\n4https://archive.ics.uci.edu/ml/datasets/Multiple+Features\n5We chose to compare with [24, 25] due to their theoretical guarantees. Though the uniform k-column\nsubset sampling described in the experiments of [25] is a heuristic algorithm, it is inspired by their theoretical\nalgorithm.\n\n9\n\n12345678910k1.01.52.05.0Approximation RatioSVDL1LowRankUniformMedian12345678910k1.01.52.05.0Approximation RatioSVDL1LowRankUniformMedian12345678910k1.01.52.05.0Approximation RatioSVDL1LowRankUniformMedian12345678910k1.01.52.05.0Approximation RatioSVDL1LowRankUniformMedian12345678910k1.01.52.05.0Approximation RatioSVDL1LowRankUniformMedian12345678910k1.01.52.05.0Approximation RatioSVDL1LowRankUniformMedian\fReferences\n[1] Tam\u00e1s Sarl\u00f3s. Improved approximation algorithms for large matrices via random projections.\nIn 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS) , 21-24 October\n2006, Berkeley, California, USA, Proceedings, pages 143\u2013152, 2006.\n\n[2] Kenneth L. Clarkson and David P. Woodruff. Low rank approximation and regression in input\nsparsity time. In Symposium on Theory of Computing Conference, STOC\u201913, Palo Alto, CA,\nUSA, June 1-4, 2013, pages 81\u201390. https://arxiv.org/pdf/1207.6365, 2013.\n\n[3] Xiangrui Meng and Michael W Mahoney. Low-distortion subspace embeddings in input-\nIn Proceedings of the forty-\ufb01fth\nsparsity time and applications to robust linear regression.\nannual ACM symposium on Theory of computing, pages 91\u2013100. ACM, https://arxiv.\norg/pdf/1210.3135, 2013.\n\n[4] Jelani Nelson and Huy L Nguy\u00ean. Osnap: Faster numerical linear algebra algorithms via\nsparser subspace embeddings. In 2013 IEEE 54th Annual Symposium on Foundations of Com-\nputer Science (FOCS), pages 117\u2013126. IEEE, https://arxiv.org/pdf/1211.1002, 2013.\n\n[5] Jean Bourgain, Sjoerd Dirksen, and Jelani Nelson. Toward a uni\ufb01ed theory of sparse dimen-\nsionality reduction in euclidean space. In Proceedings of the Forty-Seventh Annual ACM on\nSymposium on Theory of Computing, STOC 2015, Portland, OR, USA, June 14-17, 2015, pages\n499\u2013508, 2015.\n\n[6] Michael B. Cohen. Nearly tight oblivious subspace embeddings by trace inequalities.\n\nIn\nProceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms\n(SODA), Arlington, VA, USA, January 10-12, 2016, pages 278\u2013287, 2016.\n\n[7] Peter J. Huber. Robust estimation of a location parameter. The Annals of Mathematical Statis-\n\ntics, 35(1):73\u2013101, 1964.\n\n[8] Emmanuel J Cand\u00e8s, Xiaodong Li, Yi Ma, and John Wright. Robust principal component\n\nanalysis? Journal of the ACM (JACM), 58(3):11, 2011.\n\n[9] Huan Xu, Constantine Caramanis, and Sujay Sanghavi. Robust pca via outlier pursuit.\n\nAdvances in Neural Information Processing Systems, pages 2496\u20132504, 2010.\n\nIn\n\n[10] Qifa Ke and Takeo Kanade. Robust subspace computation using (cid:96)1 norm. Technical Report\n\nCMU-CS-03-172, Carnegie Mellon University, Pittsburgh, PA., 2003.\n\n[11] Qifa Ke and Takeo Kanade. Robust (cid:96)1 norm factorization in the presence of outliers and\nmissing data by alternative convex programming. In 2005 IEEE Computer Society Conference\non Computer Vision and Pattern Recognition (CVPR), volume 1, pages 739\u2013746. IEEE, 2005.\n\n[12] Eunwoo Kim, Minsik Lee, Chong-Ho Choi, Nojun Kwak, and Songhwai Oh. Ef\ufb01cient-norm-\nbased low-rank matrix approximations for large-scale problems using alternating recti\ufb01ed gra-\ndient method. IEEE transactions on neural networks and learning systems, 26(2):237\u2013251,\n2015.\n\n[13] Nojun Kwak. Principal component analysis based on (cid:96)1-norm maximization. IEEE transac-\n\ntions on pattern analysis and machine intelligence, 30(9):1672\u20131680, 2008.\n\n[14] Yinqiang Zheng, Guangcan Liu, Shigeki Sugimoto, Shuicheng Yan, and Masatoshi Okutomi.\nIn 2012 IEEE Conference\nPractical low-rank matrix approximation under robust (cid:96)1-norm.\non Computer Vision and Pattern Recognition, Providence, RI, USA, June 16-21, 2012, pages\n1410\u20131417, 2012.\n\n[15] J. Paul Brooks and Sapan Jot. Pcal1: An implementation in r of three methods for (cid:96)1-norm\n\nprincipal component analysis. Optimization Online preprint, 2012.\n\n[16] J. Paul Brooks and Jos\u00e9 H. Dul\u00e1. The (cid:96)1-norm best-\ufb01t hyperplane problem. Appl. Math. Lett.,\n\n26(1):51\u201355, 2013.\n\n[17] J. Paul Brooks, Jos\u00e9 H. Dul\u00e1, and Edward L Boone. A pure (cid:96)1-norm principal component\n\nanalysis. Computational statistics & data analysis, 61:83\u201398, 2013.\n\n10\n\n\f[18] Deyu Meng, Zongben Xu, Lei Zhang, and Ji Zhao. A cyclic weighted median method for (cid:96)1\n\nlow-rank matrix factorization with missing entries. In AAAI, volume 4, page 6, 2013.\n\n[19] Panos P. Markopoulos, George N. Karystinos, and Dimitrios A. Pados. Some options for (cid:96)1-\nsubspace signal processing. In ISWCS 2013, The Tenth International Symposium on Wireless\nCommunication Systems, Ilmenau, TU Ilmenau, Germany, August 27-30, 2013, pages 1\u20135,\n2013.\n\n[20] Panos P. Markopoulos, George N. Karystinos, and Dimitrios A. Pados. Optimal algorithms for\n\n(cid:96)1-subspace signal processing. IEEE Trans. Signal Processing, 62(19):5046\u20135058, 2014.\n\n[21] P. P. Markopoulos, S. Kundu, S. Chamadia, and D. A. Pados. Ef\ufb01cient (cid:96)1-Norm Principal-\n\nComponent Analysis via Bit Flipping. ArXiv e-prints, 2016.\n\n[22] Young Woong Park and Diego Klabjan.\n\nIteratively reweighted least squares algorithms for\n\n(cid:96)1-norm principal component analysis. arXiv preprint arXiv:1609.02997, 2016.\n\n[23] Nicolas Gillis and Stephen A Vavasis. On the complexity of robust pca and (cid:96)1-norm low-rank\n\nmatrix approximation. arXiv preprint arXiv:1509.09236, 2015.\n\n[24] Zhao Song, David P Woodruff, and Peilin Zhong. Low rank approximation with entrywise\nIn Proceedings of the 49th Annual Symposium on the Theory of Computing\n\n(cid:96)1-norm error.\n(STOC). ACM, https://arxiv.org/pdf/1611.00898, 2017.\n\n[25] Flavio Chierichetti, Sreenivas Gollapudi, Ravi Kumar, Silvio Lattanzi, Rina Panigrahy, and\nIn ICML. arXiv preprint\n\nDavid P Woodruff. Algorithms for (cid:96)p low rank approximation.\narXiv:1705.06730, 2017.\n\n[26] Frank Ban, Vijay Bhattiprolu, Karl Bringmann, Pavel Kolev, Euiwoong Lee, and David P.\n\nWoodruff. A PTAS for (cid:96)p-low rank approximation. In SODA, 2019.\n\n[27] Zhao Song, Ruosong Wang, Lin F Yang, Hongyang Zhang, and Peilin Zhong. Ef\ufb01cient sym-\n\nmetric norm regression via linear sketching. arXiv preprint arXiv:1910.01788, 2019.\n\n[28] David P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends\n\nin Theoretical Computer Science, 10(1-2):1\u2013157, 2014.\n\n[29] Arthur Asuncion and David Newman. Uci machine learning repository, 2007.\n\n[30] Benoit Mandelbrot. The pareto-levy law and the distribution of income. International Eco-\n\nnomic Review, 1(2):79\u2013106, 1960.\n\n[31] Andreas Maurer. A bound on the deviation probability for sums of non-negative random vari-\n\nables. J. Inequalities in Pure and Applied Mathematics, 4(1):15, 2003.\n\n[32] Rafal Latala. Estimation of moments of sums of independent real random variables. The\n\nAnnals of Probability, pages 1502\u20131513, 1997.\n\n[33] Anirban Dasgupta, Petros Drineas, Boulos Harb, Ravi Kumar, and Michael W Mahoney. Sam-\npling algorithms and coresets for (cid:96)p regression. SIAM Journal on Computing, 38(5):2060\u2013\n2078, 2009.\n\n11\n\n\f", "award": [], "sourceid": 5348, "authors": [{"given_name": "Zhao", "family_name": "Song", "institution": "University of Washington"}, {"given_name": "David", "family_name": "Woodruff", "institution": "Carnegie Mellon University"}, {"given_name": "Peilin", "family_name": "Zhong", "institution": "Columbia University"}]}