{"title": "Outlier-Robust High-Dimensional Sparse Estimation via Iterative Filtering", "book": "Advances in Neural Information Processing Systems", "page_first": 10689, "page_last": 10700, "abstract": "We study high-dimensional sparse estimation tasks in a robust setting where a constant fraction\nof the dataset is adversarially corrupted. Specifically, we focus on the fundamental problems of robust\nsparse mean estimation and robust sparse PCA.\nWe give the first practically viable robust estimators for these problems. \nIn more detail, our algorithms are sample and computationally efficient \nand achieve near-optimal robustness guarantees. \nIn contrast to prior provable algorithms which relied on the ellipsoid method, \nour algorithms use spectral techniques to iteratively remove outliers from the dataset. \nOur experimental evaluation on synthetic data shows that our algorithms are scalable and \nsignificantly outperform a range of previous approaches, nearly matching the best error rate without corruptions.", "full_text": "Outlier-Robust High-Dimensional Sparse Estimation\n\nvia Iterative Filtering\n\nIlias Diakonikolas\n\nUniversity of Wisconsin - Madison\nilias.diakonikolas@gmail.com\n\nSushrut Karmalkar\n\nUT Austin\n\ns.sushrut@gmail.com\n\nDaniel Kane\n\nUniversity of California, San Diego\n\nEric Price\nUT Austin\n\nAlistair Stewart\nWeb3 Foundation\n\ndakane@ucsd.edu\n\necprice@cs.utexas.edu\n\nstewart.al@gmail.com\n\nAbstract\n\nWe study high-dimensional sparse estimation tasks in a robust setting where a\nconstant fraction of the dataset is adversarially corrupted. Speci\ufb01cally, we focus\non the fundamental problems of robust sparse mean estimation and robust sparse\nPCA. We give the \ufb01rst practically viable robust estimators for these problems. In\nmore detail, our algorithms are sample and computationally ef\ufb01cient and achieve\nnear-optimal robustness guarantees. In contrast to prior provable algorithms which\nrelied on the ellipsoid method, our algorithms use spectral techniques to iteratively\nremove outliers from the dataset. Our experimental evaluation on synthetic data\nshows that our algorithms are scalable and signi\ufb01cantly outperform a range of\nprevious approaches, nearly matching the best error rate without corruptions.\n\n1\n\nIntroduction\n\n1.1 Background\n\nThe task of leveraging sparsity to extract meaningful information from high-dimensional datasets\nis a fundamental problem of signi\ufb01cant practical importance, motivated by a range of data analysis\napplications. Various formalizations of this general problem have been investigated in statistics and\nmachine learning for at least the past two decades, see, e.g., [HTW15] for a recent textbook on the\ntopic. This paper focuses on the unsupervised setting and in particular on estimating the parameters\nof a high-dimensional distribution under sparsity assumptions. Concretely, we study the problems\nof sparse mean estimation and sparse PCA under natural data generating models.\nThe classical setup in statistics is that the data was generated by a probabilistic model of a given\ntype. This is a simplifying assumption that is only approximately valid, as real datasets are typically\nexposed to some source of contamination. The \ufb01eld of robust statistics [Hub64, HR09, HRRS86]\naims to design estimators that are robust in the presence of model misspeci\ufb01cation. In recent years,\ndesigning computationally ef\ufb01cient robust estimators for high-dimensional settings has become a\npressing challenge in a number of applications. These include the analysis of biological datasets,\nwhere natural outliers are common [RPW+02, PLJD10, LAT+08] and can contaminate the down-\nstream statistical analysis, and data poisoning attacks [BNJT10], where even a small fraction of fake\ndata (outliers) can substantially degrade the learned model [BNL12, SKL17].\nThis discussion motivates the design of robust estimators that can tolerate a constant fraction of\nadversarially corrupted data. We will use the following model of corruptions (see, e.g., [DKK+16]):\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fDe\ufb01nition 1.1. Given 0 < \u03b5 < 1/2 and a family of distributions D on Rd, the adversary operates as\nfollows: The algorithm speci\ufb01es some number of samples N, and N samples X1, X2, . . . , XN are\ndrawn from some (unknown) D \u2208 D. The adversary is allowed to inspect the samples, removes \u03b5N\nof them, and replaces them with arbitrary points. This set of N points is then given to the algorithm.\nWe say that a set of samples is \u03b5-corrupted if it is generated by the above process.\n\nis to output a hypothesis vector(cid:98)\u00b5 that approximates \u00b5 in (cid:96)2-norm. In the context of robust sparse\n\nOur model of corruptions generalizes several other robustness models, including Huber\u2019s contami-\nnation model [Hub64] and the malicious PAC model [Val85, KL93].\nIn the context of robust sparse mean estimation, we are given an \u03b5-corrupted set of samples from an\nunknown mean Gaussian distribution N (\u00b5, I), where \u00b5 \u2208 Rd is assumed to be k-sparse, and the goal\nPCA (in the spiked covariance model), we are given an \u03b5-corrupted set of samples from N (0, \u03c1vvT ),\nwhere v \u2208 Rd is assumed to be k-sparse and the goal is to approximate v. In both settings, we would\nlike to design computationally ef\ufb01cient estimators with sample complexity poly(k, log d, 1/\u03b5), i.e.,\nclose to the information theoretic minimum, that achieve near-optimal error guarantees.\nUntil recently, even for the simplest high-dimensional parameter estimation settings, no polynomial\ntime robust learning algorithms with dimension-independent error guarantees were known. Two\nconcurrent works [DKK+16, LRV16] made the \ufb01rst progress on this front for the unsupervised set-\nting. Speci\ufb01cally, [DKK+16, LRV16] gave the \ufb01rst polynomial time algorithms for robustly learning\nthe mean and covariance of high-dimensional Gaussians and other models. These works focused on\nthe dense regime and as a result did not obtain algorithms with sublinear sample complexity in\nthe sparse setting. Building on [DKK+16], more recent work [BDLS17] obtained sample ef\ufb01cient\npolynomial time algorithms for the robust sparse setting, and in particular for the problems of robust\nsparse mean estimation and robust sparse PCA studied in this paper. These algorithms are based the\nunknown convex programming methodology of [DKK+16] and in particular inherently rely on the\nellipsoid algorithm. Moreover, the separation oracle required for the ellipsoid algorithm turns out\nto be another convex program \u2014 corresponding to an SDP to solve sparse PCA. As a consequence,\nthe running time of these algorithms, while polynomially bounded, is impractically high.\n\n1.2 Our Results and Techniques\n\nThe main contribution of this paper is the design of signi\ufb01cantly faster robust estimators for the\naforementioned high-dimensional sparse problems. More speci\ufb01cally, our algorithms are iterative\nand each iteration involves a simple spectral operation (computing the largest eigenvalue of an ap-\nproximate matrix). Our algorithms achieve the same error guarantee as [BDLS17] with similar sam-\nple complexity. At the technical level, we enhance the iterative \ufb01ltering methodology of [DKK+16]\nto the sparse setting, which we believe is of independent interest and could lead to faster algorithms\nfor other robust sparse estimation tasks as well.\nFor robust sparse mean estimation, we show:\nTheorem 1.2 (Robust Sparse Mean Estimation). Let D \u223c N (\u00b5, I) be a Gaussian distribution on Rd\nwith unknown k-sparse mean vector \u00b5, and \u03b5 > 0. Let S be an \u03b5-corrupted set of samples from D of\n\nsize N = (cid:101)\u2126(k2 log(d)/\u03b52). There exists an algorithm that, on input S, k, and \u03b5 runs in polynomial\ntime returns(cid:98)\u00b5 such that with probability at least 2/3 it holds (cid:107)(cid:98)\u00b5 \u2212 \u00b5(cid:107)2 = O(\u03b5(cid:112)log(1/\u03b5)).\n\nSome comments are in order. First, the sample complexity of our algorithm is asymptotically the\nsame as that of [BDLS17], and matches the lower bound of [DKS17] against Statistical Query\nalgorithms for this problem. The major advantage of our algorithm over [BDLS17] is that while\ntheir algorithm made use of the ellipsoid method, ours uses only spectral techniques and is scalable.\nFor robust sparse PCA in the spiked covariance model, we show:\nTheorem 1.3 (Robust Sparse PCA). Let D \u223c N (0, I + \u03c1vvT ) be a Gaussian distribution on Rd\nwith spiked covariance for an unknown k-sparse unit vector v, and 0 < \u03c1 < O(1). For \u03b5 > 0,\nlet S be an \u03b5-corrupted set of samples from D of size N = \u2126(k4 log4(d/\u03b5)/\u03b52). There exists an\nalgorithm that, on input S, k, and \u03b5, runs in polynomial time and returns \u02c6v \u2208 Rd such that with\nprobability at least 2/3 we have that (cid:107)\u02c6v\u02c6vT \u2212 vvT(cid:107)F = O (\u03b5 log(1/\u03b5)/\u03c1).\n\n2\n\n\fThe sample complexity upper bound in Theorem 1.3 is somewhat worse than the information theo-\nretic optimum of \u0398(k2 log d/\u03b52). While the ellipsoid-based algorithm of [BDLS17] achieves near-\noptimal sample complexity (within logarithmic factors), our algorithm is practically viable as it only\nuses spectral operations. We also note that the sample complexity in our above theorem is not known\nto be optimal for our algorithm. It seems quite plausible, via a tighter analysis, that our algorithm in\nfact has near-optimal sample complexity as well.\n\nFor both of our algorithms, in the most interesting regime of k (cid:28) \u221a\n\nd, the running time per iter-\nation is dominated by the O(N d2) computation of the empirical covariance matrix. The number\nof iterations is at most \u03b5N, although it typically is much smaller, so both algorithms take at most\nO(\u03b5N 2d2) time.\n\n1.3 Related Work\n\nThere is extensive literature on exploiting sparsity in statistical estimation (see, e.g., [HTW15]).\nIn this section, we summarize the related work that is directly related to the results of this paper.\nSparse mean estimation is arguably one of the most fundamental sparse estimation tasks and is\nclosely related to the Gaussian sequence model [Tsy08, Joh17]. The task of sparse PCA in the spiked\ncovariance model, initiated in [Joh01], has been extensively investigated (see Chapter 8 of [HTW15]\nand references therein). In this work, we design algorithms for the aforementioned problems that\nare robust to a constant fraction of outliers.\nLearning in the presence of outliers is an important goal in statistics studied since the 1960s [Hub64].\nSee, e.g., [HR09, HRRS86] for book-length introductions in robust statistics. Until recently, all\nknown computationally ef\ufb01cient high-dimensional estimators could tolerate a negligible fraction\nof outliers, even for the task of mean estimation. Recent work [DKK+16, LRV16] gave the \ufb01rst\nef\ufb01cient robust estimators for basic high-dimensional unsupervised tasks, including mean and co-\nvariance estimation. Since the dissemination of [DKK+16, LRV16], there has been a \ufb02urry of\nresearch activity on computationally ef\ufb01cient robust learning in high dimensions [BDLS17, CSV17,\nDKK+17, DKS17, DKK+18a, SCV18, DKS18b, DKS18a, HL18, KSS18, PSBR18, DKK+18b,\nKKM18, DKS19, LSLC18a, CDKS18, CDG18, CDGW19].\nIn the context of robust sparse estimation, [BDLS17] obtained sample-ef\ufb01cient and polynomial\ntime algorithms for robust sparse mean estimation and robust sparse PCA. The main difference\nbetween [BDLS17] and the results of this paper is that the [BDLS17] algorithms use the ellipsoid\nmethod (whose separation oracle is an SDP). Hence, these algorithms are prohibitively slow for\npractical applications. More recent work [LSLC18b] gave an iterative method for robust sparse\nmean estimation, which however requires multiple solutions to a convex relaxation for sparse PCA\nin each iteration. Finally, [LLC19] proposed an algorithm for robust sparse mean estimation via iter-\native trimmed hard thresholding. While this algorithm seems practically viable in terms of runtime,\nit can only tolerate 1/(\n\nk log(nd)) \u2013 i.e., sub-constant \u2013 fraction of corruptions.\n\n\u221a\n\n1.4 Paper Organization\n\nIn Section 2, we describe our algorithms and provide a detailed sketch of their analysis. In Section 3,\nwe report detailed experiments demonstrating the performance of our algorithms on synthetic data in\nvarious parameter regimes. Due to space limitations, the full proofs of correctness for our algorithms\ncan be found in the full version of this paper.\n\n2 Algorithms\n\nIn this section, we describe our algorithms in tandem with a detailed outline of the intuition behind\nthem and a sketch of their analysis. Due to space limitations, the proof of correctness is deferred to\nthe full version of our paper.\nAt a high-level, our algorithms use the iterative \ufb01ltering methodology of [DKK+16]. The main idea\nis to iteratively remove a small subset of the dataset, so that eventually we have removed all the\nimportant outliers and the standard estimator (i.e., the estimator we would have used in the noiseless\ncase) works. Before we explain our new ideas that enhance the \ufb01ltering methodology to the sparse\nsetting, we provide a brief technical description of the approach.\n\n3\n\n\fOverview of Iterative Filtering. The basic idea of iterative \ufb01ltering [DKK+16] is the following:\nIn a given iteration, carefully pick some test statistic (such as v \u00b7 x for a well-chosen v). If there\nwere no outliers, this statistic would follow a nice distribution (with good concentration properties).\nThis allows us to do some sort of statistical hypothesis testing of the \u201cnull hypothesis\u201d that each\nxi is an inlier, rejecting it (and believing that xi is an outlier) if v \u00b7 xi is far from the expected\ndistribution. Because there are a large number of such hypotheses, one uses a procedure reminiscent\nof the Benjamini-Hochberg procedure [BH95] to \ufb01nd a candidate set of outliers with low false\ndiscovery rate (FDR), i.e., a set with more outliers than inliers in expectation. This procedure looks\nfor a threshold T such that the fraction of points with test statistic above T is at least a constant\nfactor more than it \u201cshould\u201d be. If such a threshold is found, those points are mostly outliers and can\nbe safely removed. The key goal is to judiciously design a test statistic such that either the outliers\naren\u2019t particularly important\u2014so the naive empirical solution is adequate\u2014or at least one point will\nbe \ufb01ltered out.\nIn other words, the goal is to \ufb01nd a test statistic such that, if the distribution of the test statistic is\n\u201cclose\u201d to what it would be in the outlier-free world, then the outliers cannot perturb the answer too\nmuch. An additional complication is that the test statistics depend on the data (such as v \u00b7 x, where\nv is the principal component of the data) making the distribution on inliers also nontrivial. This\nconsideration drives the sample complexity of the algorithms.\nIn the algorithms we describe below, we use a speci\ufb01c parameterized notion of a good set. We\nde\ufb01ne these precisely in the supplementary material, brie\ufb02y, any large enough sample drawn from\nthe uncorrupted distribution will satisfy the structural properties required for the set to be good.\nWe now describe how to design such test statistics for our two sparse settings.\n\nNotation Before we describe our algorithms, we set up some notation. We de\ufb01ne hk : Rd \u2192 Rd\nto be the thresholding operator that keeps the k entries of v with the largest magnitude and sets the\nrest to 0. For a \ufb01nite set S, we will use a \u2208u S to mean that a is chosen uniformly at random from\nS. For M \u2208 Rd \u00d7 Rd and U \u2286 [d], let MU denote the matrix M restricted to the U \u00d7 U submatrix.\n\n2:\n\n3:\n\n4:\n5:\n6:\n7:\n\nAlgorithm 1 Robust Sparse Mean Estimation via Iterative Filtering\n1: procedure ROBUST-SPARSE-MEAN(S, k, \u03b5, \u03c4)\ninput: A multiset S such that there exists an (\u03b5, k, \u03c4 )-good set G with \u2206(G, S) \u2264 2\u03b5.\n\nthe largest magnitude k2 \u2212 k off-diagonal entries, with ties broken so that if (i, j) \u2208 U then\n(j, i) \u2208 U.\n\noutput: Multiset S(cid:48) with smaller fraction of corrupted samples or a vector (cid:98)\u00b5 with (cid:107)(cid:98)\u00b5 \u2212 \u00b5(cid:107)2 \u2264\n\u03b5(cid:112)log(1/\u03b5).\nCompute the sample mean (cid:101)\u00b5 = EX\u2208uS[X] and the sample covariance matrix (cid:101)\u03a3 , i.e.,\n(cid:101)\u03a3 = ((cid:101)\u03a3i,j)1\u2264i,j\u2264d with(cid:101)\u03a3i,j = EX\u2208uS[(Xi \u2212(cid:101)\u00b5i)(Xj \u2212(cid:101)\u00b5j)].\nLet U \u2286 [d] \u00d7 [d] be the set of the k largest magnitude entries of the diagonal of(cid:101)\u03a3 \u2212 I and\nif (cid:107)((cid:101)\u03a3 \u2212 I)(U )(cid:107)F \u2264 O(\u03b5 log(1/\u03b5)) then return(cid:98)\u00b5 := hk((cid:101)\u00b5).\nCompute the largest eigenvalue \u03bb\u2217 of ((cid:101)\u03a3 \u2212 I)U(cid:48) and a corresponding unit eigenvector v\u2217.\nif \u03bb\u2217 \u2265 \u2126(\u03b5(cid:112)log(1/\u03b5)) then: Let \u03b4(cid:96) := 3\nPrX\u2208uS [|v\u2217 \u00b7 (X \u2212(cid:101)\u00b5)| \u2265 T + \u03b4(cid:96)] \u2265 9 \u00b7 erfc(T /\nreturn the multiset S(cid:48) = {x \u2208 S : |v\u2217 \u00b7 (x \u2212(cid:101)\u00b5)| \u2264 T + \u03b4(cid:96)}.\n(cid:17)\n(U )(x \u2212(cid:101)\u00b5) \u2212 Tr(((cid:101)\u03a3 \u2212 I)(U ))\n\n(cid:16)\n(x \u2212(cid:101)\u00b5)T ((cid:101)\u03a3 \u2212 I)T\n\n\u03b5\u03bb\u2217. Find T > 0 such that\n3\u03b52\n\n\u221a\n\nSet U(cid:48) = {i \u2208 [d] : (i, j) \u2208 U}.\n\n\u221a\n\n2) +\n\n.\n\nT 2 ln(k ln(N d/\u03c4 ))\n\n/(cid:107)((cid:101)\u03a3 \u2212 I)(U )(cid:107)F .\n\n8:\n9:\n10:\n\nLet p(x) =\nFind T > 4 such that\n\nPrX\u2208uS[|p(X)| \u2265 T ] \u2265 9 exp(\u2212T /4) + 3\u03b52/(T ln2 T ) .\n\nreturn the multiset S(cid:48) = {x \u2208 S : |p(x)| \u2264 T}.\n\n11:\n\n4\n\n\fRobust Sparse Mean Estimation. Here we brie\ufb02y describe the motivation and analysis of Algo-\nrithm 1, describing a single iteration of our \ufb01lter for the robust sparse mean setting.\nIn order to estimate the k-sparse mean \u00b5, it suf\ufb01ces to ensure that our estimate \u00b5(cid:48) has |v \u00b7 (\u00b5(cid:48) \u2212 \u00b5)|\nsmall for any 2k-sparse unit vector v. The now-standard idea in robust statistics [DKK+16] is that\nif a small number of corrupted samples suf\ufb01ce to cause a large change in our estimate of v \u00b7 \u00b5, then\nthis must lead to a substantial increase in the sample variance of v \u00b7 x, which we can detect.\n\nThus, a very basic form of a robust algorithm might be to compute a sample covariance matrix (cid:101)\u03a3,\nand let v be the 2k-sparse unit vector that maximizes vT(cid:101)\u03a3v. If this number is close to 1, it certi\ufb01es\n\nthat our estimate \u00b5(cid:48) \u2014 obtained by truncating the sample mean to its k-largest entries \u2014 is a good\nestimate of the true mean \u00b5. If not, this will allow us to \ufb01lter our sample set by throwing away the\nvalues where v \u00b7 x is furthest from the true mean. This procedure guarantees that we have removed\nmore corrupted samples than uncorrupted ones. We then repeat the \ufb01lter until the empirical variance\nin every sparse direction is close to 1.\nUnfortunately, the optimization problem of \ufb01nding the optimal v is computationally challenging,\n\nrequiring a convex program. To circumvent the need for a convex program, we notice that vT(cid:101)\u03a3v \u2212\n1 = ((cid:101)\u03a3\u2212I)\u00b7(vvT ) is large only if(cid:101)\u03a3\u2212I has large entries on the (2k)2 non-zero entries of vvT . Thus,\nif the 4k2 largest entries of(cid:101)\u03a3 \u2212 I had small (cid:96)2-norm, this would certify that no such bad v existed\nmatrix consisting of the large entries of(cid:101)\u03a3 (for the moment assume that they are all off diagonal, but\nthis is not needed). We know that the sample mean of p(x) = (x\u2212 \u00b5(cid:48))T A(x\u2212 \u00b5(cid:48)) =(cid:101)\u03a3\u00b7 A = (cid:107)A(cid:107)2\nF .\nOn the other hand, if \u00b5(cid:48) approximates \u00b5 on the O(k2) entries in question, we would have that\n(cid:107)p(cid:107)2 = (cid:107)A(cid:107)F . This means that if (cid:107)A(cid:107)F is reasonably large, an \u03b5-fraction of corrupted points\nF = (cid:107)A(cid:107)F(cid:107)p(cid:107)2. This means that many of these errors must\nchanged the mean of p from 0 to (cid:107)A(cid:107)2\nhave had |p(x)| (cid:28) (cid:107)A(cid:107)F /\u03b5(cid:107)p(cid:107)2. This becomes very unlikely for good samples if (cid:107)A(cid:107)F is much\nlarger than \u03b5 (by standard results on the concentration of Gaussian polynomials). Thus, if \u00b5(cid:48) is\napproximately \u00b5 on these O(k2) coordinates, we can produce a \ufb01lter. To ensure this, we can use\nexisting \ufb01lter-based algorithms to approximate the mean on these O(k2) coordinates. This results\n\nand would allow us to return the truncated sample mean. In case these entries have large (cid:96)2-norm,\nwe show that we can produce a \ufb01lter that removes more bad samples than good ones. Let A be the\n\nin Algorithm 1. For the analysis, we note that if the entries of A are small, then vT ((cid:101)\u03a3\u2212 I)v must be\n\nsmall for any unit k-sparse v, which certi\ufb01es that the truncated sample mean is good. Otherwise, we\ncan \ufb01lter the samples using the \ufb01rst kind of \ufb01lter. This ensures that our mean estimate is suf\ufb01ciently\nclose to the true mean that we can then \ufb01lter using the second kind of \ufb01lter.\nIt is not hard to show that the above works if we are given suf\ufb01ciently many samples, but to obtain\na tight analysis of the sample complexity, we need a number of subtle technical ideas. The detailed\nanalysis of the sample complexity is deferred to the full version of our paper.\n\nRobust Sparse PCA Here we brie\ufb02y describe the motivation and analysis of Algorithm 2, de-\nscribing a single iteration of our \ufb01lter for the sparse PCA setting.\nNote that estimating the k-sparse vector v is equivalent to estimating E[XX T \u2212 I] = vvT . In fact,\nestimating E[XX T \u2212 I] to error \u03b5 in Frobenius norm allows one to estimate v within error \u03b5 in\n(cid:96)2-norm. Thus, we focus on he task of robustly approximating the mean of Y = XX T \u2212 I.\nOur algorithm is going to take advantage of one fact about X that even errors cannot hide: that\nVar[v \u00b7 X] is large. This is because removing uncorrupted samples cannot reduce the variance by\nmuch more than an \u03b5-fraction, and adding samples can only increase it. This means that an adversary\nattempting to fool our algorithm can only do so by creating other directions where the variance is\nlarge, or simply by adding other large entries to the sample covariance matrix in order to make it\nhard to \ufb01nd this particular k-sparse eigenvector. In either case, the adversary is creating large entries\nin the empirical mean of Y that should not be there. This suggests that the largest entries of the\nempirical mean of Y , whether errors or not, will be of great importance.\nThese large entries will tell us where to focus our attention. In particular, we can \ufb01nd the k2 largest\nentries of the empirical mean of Y and attempt to \ufb01lter based on them. When we do so, one of two\nthings will happen: Either we remove bad samples and make progress or we verify that these entries\nought to be large, and thus must come from the support of v. In particular, when we reach the second\n\n5\n\n\fAlgorithm 2 Robust Sparse PCA via Iterative Filtering\n\n1: procedure ROBUST-SPARSE-PCA(S, k,(cid:101)\u03a3, \u03b5, \u03b4, \u03c4)\ninput: A multiset S, an estimate of the true covariance(cid:101)\u03a3, a real number \u03b4 \u2208 R.\n\noutput: A multiset S(cid:48) with smaller fraction of corrupted samples or a matrix \u03a3(cid:48) with (cid:107)\u03a3(cid:48) \u2212 \u03a3(cid:107)F \u2264\n\nO(\n\n\u221a\n\u03b5\u03b4 + \u03b5 log(1/\u03b5))\nFor any x \u2208 Rd de\ufb01ne \u03b3(x) := vec(xxT \u2212 I) \u2208 Rd2.\nCompute \u02dc\u00b5 := ES[\u03b3(x)], \u02c6\u00b5 = hk2(\u00b5) and Q := Supp(\u02c6\u00b5).\nCompute\n\n2:\n3:\n4:\n\n5:\n\n6:\n7:\n8:\n\nMQ := ES[(\u03b3(x) \u2212 \u02dc\u00b5)(\u03b3(x) \u2212 \u02dc\u00b5)T ]Q\u00d7Q \u2208 Rk2 \u00d7 Rk2\n\nCovX\u223cN (0,(cid:101)\u03a3)(\u03b3(x)Q).\n\nLet \u03bb, v\u2217 be the maximum eigenvalue and corresponding eigenvector of MQ \u2212\nif \u03bb < C \u00b7 (\u03b4 + \u03b5 log2(1/\u03b5)), where C is a suf\ufb01ciently large constant then\nLet \u02c6\u00b5 = median ({\u03b3(x) \u00b7 v\u2217 | x \u2208 S}). Find a number T > log(1/\u03b5) satisfying\n\nCompute w, the largest eigenvector of mat(\u02dc\u00b5)Q. return wwT + I.\n\nPrS[|\u03b3(x)Q \u00b7 v\u2217 \u2212 \u02c6\u00b5| > CT + 3] >\n\n\u03b5\n\nT 2 log2(T )\n\n.\n\nreturn S(cid:48) = {x \u2208 S | |(\u03b3(x)Q \u00b7 v\u2217) \u2212 \u02c6\u00b5| < T}.\n\ncase, since the adversary cannot shrink the empirical variance of v \u00b7 X by much, almost all of the\nentries on the support of v must remain large, and this can be captured by our algorithm.\nThe above algorithm works under a set of deterministic conditions on the good set of samples that\nare satis\ufb01ed with high probability with poly(k) log(d)/\u03b52 samples. Our current analysis does not\nestablish the information-theoretically optimal sample size of O(k2 log(d)/\u03b52), though we believe\nthat this plausible via a tighter analysis.\nWe note that a naive implementation of this algorithm will achieve error poly(\u03b5) in our \ufb01nal estimate\nfor v, while our goal is to obtain \u02dcO(\u03b5) error. To achieve this, we need to overcome two dif\ufb01culties:\nFirst, when trying to \ufb01lter Y on subsets of its coordinates, we do not know the true variance of Y ,\nand thus cannot expect to obtain \u02dcO(\u03b5) error. This is \ufb01xed with a bootstrapping method similar to that\nin [Kan18] to estimate the covariance of a Gaussian. In particular, we do not know Var[Y ] a priori,\nbut after we run the algorithm, we obtain an approximation to v, which gives an approximation to\nVar[Y ]. This in turn lets us get a better approximation to v and a better approximation to Var[Y ];\nand so on.\n\n3 Experiments\n\nFor every experiment, we run 10 trials and plot the median value of the measurement. We shade the\ninterquartile range around each measurement as a measure of the con\ufb01dence of that measurement.\nEach experiment was run on a computer with a 2.7 GHz Intel Core i5 processor with an 8GB 1867\nMHz DDR3 RAM.\n\n3.1 Robust Sparse Mean Estimation\n\nThe performance of robust estimation algorithms depend heavily on the noise model. The \u201chard\u201d\nnoise distributions for one algorithm may be easy for a different algorithm, if that one can identify\nand \ufb01lter out the outliers. We therefore consider three different synthetic data distributions: two that\ndemonstrate the \u03b5\nk worst-case performance of other algorithms, and one that demonstrates the\n\n\u03b5(cid:112)log(1/\u03b5) performance of our full algorithm.\n\n\u221a\n\nThe algorithms we consider are RME sp, our algorithm; RME sp L, a version of our algorithm with\nonly the linear \ufb01lter and not the quadratic one; NP, the \u201cnaive pruning\u201d algorithm that drops samples\nwith obviously-outlier coordinates, then outputs the empirical mean; oracle, which is told exactly\n\n6\n\n\f(a) Unlike RANSAC, our algorithm RME sp can \ufb01l-\nter out the noise and match the oracle\u2019s perfor-\nmance. RME also matches the oracle, but needs\nmore samples.\nFigure 1: Constant-bias noise is easy for our algorithm, since it is caught by the linear \ufb01lter.\n\n(b) For \ufb01xed m, as k increases, RANSAC and NP\nboth diverge from RME and RME sp.\n\n(a) With suf\ufb01ciently many samples, the quadratic\n\ufb01lter can \ufb01lter out the noise, matching the oracle.\nThe linear \ufb01lter alone does not, even with a large\nnumber of samples.\n\n(b) For k (cid:28) \u221a\n\n\u221a\nd, the linear \ufb01lter alone does\nnot \ufb01lter out the noise, leading to an \u03b5\nk depen-\ndence for RME sp L. Our algorithm RME sp nearly\nmatches oracle.\n\nFigure 2: The linear-hiding noise model shows that the quadratic \ufb01lter is necessary.\n\n(a) This noise model gives \u2126(\u03b5(cid:112)log(1/\u03b5)) error\nFigure 3: The \ufb02ipping noise model demonstrates that the error can remain \u2126(\u03b5(cid:112)log(1/\u03b5)).\n\nto the oracle, and RME sp is at most twice this.\n\n(b) This gap persists regardless of m.\n\n(a) The constant noise model is easy to remove\nand does not take many samples.\n\n(b) The linear-hiding noise model is harder and\nrequires more samples to get the same guarantee.\n\nFigure 4: Sample complexity required to do well\u2014in this case, 70% of errors being less than 1.2\u2014\ndepends on the noise model.\n\n7\n\n0100200300m0.00.51.01.52.0L2 Lossd = 300, k = 10, eps = 0.1. Noise is constant +2 biasedoracleRMENPRME_spRANSAC020406080100k0.00.51.01.52.0L2 Lossd = 300, eps = 0.1, m = 5000. Noise is constant +2 biasedRME_spNPoracleRMERANSAC20004000600080001000012000m0.00.20.40.60.81.0L2 Lossd = 1000, k = 40, eps = 0.1. Noise is linear-hidingoracleRME_sp_LRME_spNP050100150200k0.000.250.500.751.00L2 Lossd = 300, eps = 0.1, m = 5000. Noise is linear-hidingRME_spNPRME_sp_Loracle0.00.10.20.30.40.5eps0.00.20.40.60.81.0L2 Lossd = 10, k = 1, m = 5000. Noise is tail-flippingoracleRME_sp02004006008001000m0.00.20.40.60.8L2 Lossd = 10, k = 1, eps = 0.1. Noise is tail-flippingoracleRME_sp51015k050100150samplesd = 200, eps = 0.1. Noise is a constant +2 biasedRME_sp51015k0204060samplesd = 200, eps = 0.1. Noise is linear-hidingRME_sp\f(a) Not only does RME not have small error for\nsmall sample complexity,\ninterestingly it also\ntakes longer to terminate.\n\n(b) RME sp takes time close to RME sp L until the\nquadratic \ufb01lter begins to apply (as can be seen in\nFigure 2), after which it takes much longer.\n\n(c) The runtimes for our sparse algorithms does\nnot change very much as we increase k for the\nlinear-hiding noise.\n\n(d) The runtimes for our sparse algorithms ap-\npears to increases with d linearly the case of\nlinear-hiding noise.\n\nFigure 5: Runtimes for robust mean estimation.\n\nwhich coordinates are inliers and outputs their empirical mean; RME, which applies the non-sparse\nrobust mean estimation algorithm of [DKK+17]; and RANSAC, which computes the mean of a ran-\ndomly chosen set of points, half the size of the entire set. One mean is preferred to another if it has\nmore points in a ball of radius\nd around it. For algorithms that have non-sparse outputs, we\nsparsify to the largest k coordinates before measuring the (cid:96)2 distance to the true mean.\nOur distributions are:\n\n(cid:112)\n\nd +\n\n\u221a\n\n\u221a\n\n\u2022 Constant-bias noise. Noise that biases every coordinate consistently (e.g., if the outliers\nadd 2 to every coordinate, or set every coordinate to \u00b5i + 1) is dif\ufb01cult for naive algorithms\n(such as coordinate-wise median, NP, RANSAC) to deal with, but ideal for the linear \ufb01lter. In\nFigure 1 we consider the noise that adds 2 to every coordinate.\n\u2022 Linear-hiding noise. To demonstrate that the quadratic \ufb01lter in our algorithm is necessary,\nwe use the following data distribution. The inliers are drawn from N (0, I). The outliers\nare evenly split between two types: N (1S, I) for some size-k set S, and N (0, 2I \u2212 IS).\nThe diagonal of the empirical covariance does not reveal S, so our linear \ufb01lter fails to prune\nanything, leading to \u03b5\nk error for RME sp L; the quadratic \ufb01lter successfully removes all\nthe outliers. This is shown in Figure 2.\n\u2022 Flipping noise. For both those types of noise, with suf\ufb01ciently many samples our \ufb01nal\nalgorithm will prune out essentially all the outliers; there also exist noise models where\n\nnoise model that picks a k-sparse direction v, and replaces the \u03b5 fraction of points furthest\nin the \u2212v direction with points in the +v direction. In fact, for this noise even the oracle\n\n\u2126(\u03b5(cid:112)log(1/\u03b5)) noise will remain at all times. In Figure 3 we demonstrate this for the\nmethod also has \u2126(\u03b5(cid:112)log(1/\u03b5)) error from the missing points, but our algorithm has twice\nformance of RME sp seems to be within a constant factor of the O(\u03b5(cid:112)log(1/\u03b5)) worst-case per-\n\nDiscussion. Matching our theoretical results, with suf\ufb01ciently many samples the worst-case per-\n\nthe error from the un\ufb01lterable added points.\n\nformance of oracle. This is not true for the naive algorithms NP, RANSAC, or the simpli\ufb01cation\nRME sp L of our algorithm, which all have an \u03b5\nk dependence. While our theoretical results show\n\n\u221a\n\n8\n\n0100200300m0.000.050.100.150.20secd = 300, k = 10, eps = 0.1. Noise is constant +2 biasedNP runtimeRME_sp runtimeRME runtime20004000600080001000012000m0123secd = 1000, k = 40, eps = 0.1. Noise is linear-hidingNP runtimeRME_sp_L runtimeRME_sp runtimeRME runtime050100150200k0.00.10.20.30.40.5secd = 300, eps = 0.1, m = 5000. Noise is linear-hidingNP runtimeRME_sp_L runtimeRME_sp runtime100200300d0.000.020.040.060.080.10seck = 20, eps = 0.1, m = 3000. Noise is linear-hidingNP runtimeRME_sp_L runtimeRME_sp runtime\f(a) The natural dense algorithm RDPCA requires\nmore samples than the sparse algorithm to get er-\nror < 0.1\n\n\u221a\n(b) For a \ufb01xed m, RSPCA performs better than\nd and then performs worse.\nRDPCA when k <\nuntil coming close to RDPCA. Note that the vari-\nance of RSPCA is smaller than that of RDPCA.\nFigure 6: Sample complexity of RSPCA is better than RDPCA for smaller sparsity.\n\nthat (cid:101)O(k2) samples suf\ufb01ce, the empirical results given in Figure 4 are consistent with (cid:101)O(k) being\n\nsuf\ufb01cient.\nOur algorithm runs much faster than the ellipsoid based approach. For instance for k = 10, d =\n300, m = 50 for the case of constant-biased noise our algorithm takes time 0.015 seconds to \ufb01nish.\nIn comparison the very \ufb01rst iteration for the SDP-based solution takes 10 seconds to solve with\nCVXOPT; the full ellipsoid-based algorithm, if implemented, would take many times that.\n\n3.2 Robust Sparse PCA\n\nIn Figure 6 we compare our robust sparse PCA algorithm RSPCA to a dense algorithm RDPCA for\nrobust PCA. RDPCA looks at the empirical covariance matrix and then in the direction of maximum\nvariance robustly estimates standard deviation. The algorithm then \ufb01lters points using a modi\ufb01ed\nversion of the linear \ufb01lter from [DKK+17] and hence requires a sample complexity of \u02dcO(d). For this\nalgorithm, we only consider a single simple noise model. We draw outlier samples from N (0, I +\nuuT ) where u has disjoint support from the true vector v.\n\n\u221a\n\nd; this\n\nThe sparse algorithm seems to perform better than the dense algorithm for k up to roughly\nis better than what we can prove, which is that it should be better up to at least d1/4.\n\n4 Conclusions\n\nIn this paper, we have presented iterative \ufb01ltering algorithms for two natural and fundamental robust\nsparse estimation tasks: sparse mean estimation and sparse PCA. In both cases, our algorithms\n\nachieve near-optimal (cid:101)O(\u03b5) error with sample complexity primarily dependent on the sparsity k,\n\nand only logarithmically on the ambient dimension d. Our theoretical guarantees are comparable\nto those of [BDLS17], with the signi\ufb01cant advantage that our algorithms only use simple spectral\ntechniques rather than the ellipsoid algorithm. This makes our algorithms practically viable and\neasy to implement. Our implementations perform essentially as expected: in sparse settings they\n\u221a\nrequire signi\ufb01cantly fewer samples than dense robust estimation, and have accuracy avoiding the\n\nk dependence of common benchmark techniques like RANSAC.\n\n5 Acknowledgements\n\nThe authors would like to thank the following sources of support.\nIlias Diakonikolas was supported by the NSF Award CCF-1652862 (CAREER) and a Sloan Re-\nsearch Fellowship. Sushrut Karmalkar was supported by NSF Award CNS-1414023. Daniel Kane\nwas supported by NSF Award CCF-1553288 (CAREER) and a Sloan Research Fellowship. Eric\nPrice was supported in part by NSF Award CCF-1751040 (CAREER). A part of this work was\nperformed when Alistair Stewart was a postdoctoral researcher at USC.\n\n9\n\n01000200030004000m0.00.20.40.60.81.0L2 Lossd = 200, k = 4, eps = 0.1Robust sparse PCARobust dense PCA0510152025k0.00.20.40.60.81.0L2 Lossd = 300, eps = 0.1, m = 2000Robust sparse PCARobust dense PCA\fReferences\n[BDLS17] S. Balakrishnan, S. S. Du, J. Li, and A. Singh. Computationally ef\ufb01cient robust sparse\nestimation in high dimensions. In Proc. 30th Annual Conference on Learning Theory\n(COLT), pages 169\u2013212, 2017.\n\n[BH95] Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical and\npowerful approach to multiple testing. Journal of the Royal statistical society: series\nB (Methodological), 57(1):289\u2013300, 1995.\n\n[BNJT10] M. Barreno, B. Nelson, A. D. Joseph, and J. D. Tygar. The security of machine learn-\n\ning. Machine Learning, 81(2):121\u2013148, 2010.\n\n[BNL12] B. Biggio, B. Nelson, and P. Laskov. Poisoning attacks against support vector ma-\nchines. In Proceedings of the 29th International Conference on Machine Learning,\nICML 2012, 2012.\n\n[CDG18] Y. Cheng, I. Diakonikolas, and R. Ge. High-dimensional robust mean estimation in\nnearly-linear time. CoRR, abs/1811.09380, 2018. Conference version in SODA 2019,\np. 2755-2771.\n\n[CDGW19] Y. Cheng, I. Diakonikolas, R. Ge, and D. P. Woodruff. Faster algorithms for high-\ndimensional robust covariance estimation. In Conference on Learning Theory, COLT\n2019, pages 727\u2013757, 2019.\n\n[CDKS18] Y. Cheng, I. Diakonikolas, D. M. Kane, and A. Stewart. Robust learning of \ufb01xed-\nstructure Bayesian networks. In Proc. 33rd Annual Conference on Neural Information\nProcessing Systems (NIPS), 2018.\n\n[CSV17] M. Charikar, J. Steinhardt, and G. Valiant. Learning from untrusted data. In Proc. 49th\n\nAnnual ACM Symposium on Theory of Computing (STOC), pages 47\u201360, 2017.\n\n[DKK+16] I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stewart. Robust\nestimators in high dimensions without the computational intractability. In Proc. 57th\nIEEE Symposium on Foundations of Computer Science (FOCS), pages 655\u2013664, 2016.\nJournal version in SIAM Journal on Computing, 48(2), p. 742\u2013864, 2019.\n\n[DKK+17] I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stewart. Being\nrobust (in high dimensions) can be practical. In Proc. 34th International Conference\non Machine Learning (ICML), pages 999\u20131008, 2017.\n\n[DKK+18a] I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stewart. Robustly\nlearning a Gaussian: Getting optimal error, ef\ufb01ciently. In Proc. 29th Annual Sympo-\nsium on Discrete Algorithms (SODA), pages 2683\u20132702, 2018.\n\n[DKK+18b] I. Diakonikolas, G. Kamath, D. M Kane, J. Li, J. Steinhardt, and A. Stewart. Sever: A\nrobust meta-algorithm for stochastic optimization. arXiv preprint arXiv:1803.02815,\n2018.\n\n[DKS17] I. Diakonikolas, D. M. Kane, and A. Stewart. Statistical query lower bounds for robust\nestimation of high-dimensional Gaussians and Gaussian mixtures. In Proc. 58th IEEE\nSymposium on Foundations of Computer Science (FOCS), pages 73\u201384, 2017.\n\n[DKS18a] I. Diakonikolas, D. M. Kane, and A. Stewart. Learning geometric concepts with nasty\nnoise. In Proc. 50th Annual ACM Symposium on Theory of Computing (STOC), pages\n1061\u20131073, 2018.\n\n[DKS18b] I. Diakonikolas, D. M. Kane, and A. Stewart. List-decodable robust mean estimation\nand learning mixtures of spherical Gaussians. In Proc. 50th Annual ACM Symposium\non Theory of Computing (STOC), pages 1047\u20131060, 2018.\n\n[DKS19] I. Diakonikolas, W. Kong, and A. Stewart. Ef\ufb01cient algorithms and lower bounds\nfor robust linear regression. In Proc. 30th Annual Symposium on Discrete Algorithms\n(SODA), 2019.\n\n10\n\n\f[HL18] S. B. Hopkins and J. Li. Mixture models, robustness, and sum of squares proofs. In\nProc. 50th Annual ACM Symposium on Theory of Computing (STOC), pages 1021\u2013\n1034, 2018.\n\n[HR09] P. J. Huber and E. M. Ronchetti. Robust statistics. Wiley New York, 2009.\n\n[HRRS86] F. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw, and W. A. Stahel. Robust statistics.\n\nThe approach based on in\ufb02uence functions. Wiley New York, 1986.\n\n[HTW15] T. Hastie, R. Tibshirani, and M. Wainwright. Statistical Learning with Sparsity: The\n\nLasso and Generalizations. Chapman & Hall/CRC, 2015.\n\n[Hub64] P. J. Huber. Robust estimation of a location parameter. Ann. Math. Statist., 35(1):73\u2013\n\n101, 03 1964.\n\n[Joh01] I. M. Johnstone. On the distribution of the largest eigenvalue in principal components\n\nanalysis. The Annals of Statistics, 29(2):295\u2013327, 2001.\n\n[Joh17] I. M. Johnstone. Gaussian estimation: Sequence and wavelet models. Available at\n\nhttp://statweb.stanford.edu/~imj/GE_08_09_17.pdf, 2017.\n\n[Kan18] D. M. Kane. Robust covariance estimation. Talk given at TTIC Workshop on Com-\nputational Ef\ufb01ciency and High-Dimensional Robust Statistics, 2018. Available at\nhttp://www.iliasdiakonikolas.org/tti-robust/Kane-Covariance.pdf.\n\n[KKM18] A. Klivans, P. Kothari, and R. Meka. Ef\ufb01cient algorithms for outlier-robust regres-\nsion. In Proc. 31st Annual Conference on Learning Theory (COLT), pages 1420\u20131430,\n2018.\n\n[KL93] M. J. Kearns and M. Li. Learning in the presence of malicious errors. SIAM Journal\n\non Computing, 22(4):807\u2013837, 1993.\n\n[KSS18] P. K. Kothari, J. Steinhardt, and D. Steurer. Robust moment estimation and improved\nclustering via sum of squares. In Proc. 50th Annual ACM Symposium on Theory of\nComputing (STOC), pages 1035\u20131046, 2018.\n\n[LAT+08] J.Z. Li, D.M. Absher, H. Tang, A.M. Southwick, A.M. Casto, S. Ramachandran,\nH.M. Cann, G.S. Barsh, M. Feldman, L.L. Cavalli-Sforza, and R.M. Myers. World-\nwide human relationships inferred from genome-wide patterns of variation. Science,\n319:1100\u20131104, 2008.\n\n[LLC19] L. Liu, T. Li, and C. Caramanis. High dimensional robust estimation of sparse models\n\nvia trimmed hard thresholding. CoRR, abs/1901.08237, 2019.\n\n[LRV16] K. A. Lai, A. B. Rao, and S. Vempala. Agnostic estimation of mean and covariance.\nIn Proc. 57th IEEE Symposium on Foundations of Computer Science (FOCS), 2016.\n\n[LSLC18a] L. Liu, Y. Shen, T. Li, and C. Caramanis. High dimensional robust sparse regression.\n\narXiv preprint arXiv:1805.11643, 2018.\n\n[LSLC18b] L. Liu, Y. Shen, T. Li, and C. Caramanis. High dimensional robust sparse regression.\n\nCoRR, abs/1805.11643, 2018.\n\n[PLJD10] P. Paschou, J. Lewis, A. Javed, and P. Drineas. Ancestry informative markers for \ufb01ne-\nscale individual assignment to worldwide populations. Journal of Medical Genetics,\n47:835\u2013847, 2010.\n\n[PSBR18] A. Prasad, A. S. Suggala, S. Balakrishnan, and P. Ravikumar. Robust estimation via\n\nrobust gradient estimation. arXiv preprint arXiv:1802.06485, 2018.\n\n[RPW+02] N. Rosenberg, J. Pritchard, J. Weber, H. Cann, K. Kidd, L.A. Zhivotovsky, and M.W.\n\nFeldman. Genetic structure of human populations. Science, 298:2381\u20132385, 2002.\n\n11\n\n\f[SCV18] J. Steinhardt, M. Charikar, and G. Valiant. Resilience: A criterion for learning in\nthe presence of arbitrary outliers. In Proc. 9th Innovations in Theoretical Computer\nScience Conference (ITCS), pages 45:1\u201345:21, 2018.\n\n[SKL17] J. Steinhardt, P. Wei Koh, and P. S. Liang. Certi\ufb01ed defenses for data poisoning attacks.\n\nIn Advances in Neural Information Processing Systems 30, pages 3520\u20133532, 2017.\n\n[Tsy08] A. B. Tsybakov. Introduction to Nonparametric Estimation. Springer Publishing Com-\n\npany, Incorporated, 2008.\n\n[Val85] L. Valiant. Learning disjunctions of conjunctions. In Proceedings of the Ninth Inter-\n\nnational Joint Conference on Arti\ufb01cial Intelligence, pages 560\u2013566, 1985.\n\n12\n\n\f", "award": [], "sourceid": 5706, "authors": [{"given_name": "Ilias", "family_name": "Diakonikolas", "institution": "UW Madison"}, {"given_name": "Daniel", "family_name": "Kane", "institution": "UCSD"}, {"given_name": "Sushrut", "family_name": "Karmalkar", "institution": "The University of Texas at Austin"}, {"given_name": "Eric", "family_name": "Price", "institution": "University of Texas at Austin"}, {"given_name": "Alistair", "family_name": "Stewart", "institution": "University of Southern California"}]}