{"title": "Robust Regression via Hard Thresholding", "book": "Advances in Neural Information Processing Systems", "page_first": 721, "page_last": 729, "abstract": "We study the problem of Robust Least Squares Regression (RLSR) where several response variables can be adversarially corrupted. More specifically, for a data matrix X \\in \\R^{p x n} and an underlying model w*, the response vector is generated as y = X'w* + b where b \\in n is the corruption vector supported over at most C.n coordinates. Existing exact recovery results for RLSR focus solely on L1-penalty based convex formulations and impose relatively strict model assumptions such as requiring the corruptions b to be selected independently of X.In this work, we study a simple hard-thresholding algorithm called TORRENT which, under mild conditions on X, can recover w* exactly even if b corrupts the response variables in an adversarial manner, i.e. both the support and entries of b are selected adversarially after observing X and w*. Our results hold under deterministic assumptions which are satisfied if X is sampled from any sub-Gaussian distribution. Finally unlike existing results that apply only to a fixed w*, generated independently of X, our results are universal and hold for any w* \\in \\R^p.Next, we propose gradient descent-based extensions of TORRENT that can scale efficiently to large scale problems, such as high dimensional sparse recovery. and prove similar recovery guarantees for these extensions. Empirically we find TORRENT, and more so its extensions, offering significantly faster recovery than the state-of-the-art L1 solvers. For instance, even on moderate-sized datasets (with p = 50K) with around 40% corrupted responses, a variant of our proposed method called TORRENT-HYB is more than 20x faster than the best L1 solver.", "full_text": "Robust Regression via Hard Thresholding\n\nKush Bhatia\u2020, Prateek Jain\u2020, and Purushottam Kar\u2021\u2217\n\n\u2020Microsoft Research, India\n\n{t-kushb,prajain}@microsoft.com, purushot@cse.iitk.ac.in\n\n\u2021Indian Institute of Technology Kanpur, India\n\nAbstract\n\nWe study the problem of Robust Least Squares Regression (RLSR) where several\nresponse variables can be adversarially corrupted. More speci\ufb01cally, for a data\nmatrix X \u2208 Rp\u00d7n and an underlying model w\u2217, the response vector is generated\nas y = X T w\u2217 + b where b \u2208 Rn is the corruption vector supported over at most\nC \u00b7 n coordinates. Existing exact recovery results for RLSR focus solely on L1-\npenalty based convex formulations and impose relatively strict model assumptions\nsuch as requiring the corruptions b to be selected independently of X.\nIn this work, we study a simple hard-thresholding algorithm called TORRENT\nwhich, under mild conditions on X, can recover w\u2217 exactly even if b corrupts the\nresponse variables in an adversarial manner, i.e. both the support and entries of b\nare selected adversarially after observing X and w\u2217. Our results hold under deter-\nministic assumptions which are satis\ufb01ed if X is sampled from any sub-Gaussian\ndistribution. Finally unlike existing results that apply only to a \ufb01xed w\u2217, generated\nindependently of X, our results are universal and hold for any w\u2217 \u2208 Rp.\nNext, we propose gradient descent-based extensions of TORRENT that can scale\nef\ufb01ciently to large scale problems, such as high dimensional sparse recovery. and\nprove similar recovery guarantees for these extensions. Empirically we \ufb01nd TOR-\nRENT, and more so its extensions, offering signi\ufb01cantly faster recovery than the\nstate-of-the-art L1 solvers. For instance, even on moderate-sized datasets (with\np = 50K) with around 40% corrupted responses, a variant of our proposed\nmethod called TORRENT-HYB is more than 20\u00d7 faster than the best L1 solver.\n\n\u201cIf among these errors are some which appear too large to be admissible,\nthen those equations which produced these errors will be rejected, as com-\ning from too faulty experiments, and the unknowns will be determined by\nmeans of the other equations, which will then give much smaller errors.\u201d\nA. M. Legendre, On the Method of Least Squares. 1805.\n\n1\n\nIntroduction\n\nRobust Least Squares Regression (RLSR) addresses the problem of learning a reliable set of regres-\nsion coef\ufb01cients in the presence of several arbitrary corruptions in the response vector. Owing to the\nwide-applicability of regression, RLSR features as a critical component of several important real-\nworld applications in a variety of domains such as signal processing [1], economics [2], computer\nvision [3, 4], and astronomy [2].\nGiven a data matrix X = [x1, . . . , xn] with n data points in Rp and the corresponding response\nvector y \u2208 Rn, the goal of RLSR is to learn a \u02c6w such that,\n\n(cid:88)\n\ni\u2208S\n\n( \u02c6w, \u02c6S) =\n\narg min\nw\u2208Rp\n\nS\u2282[n]:|S|\u2265(1\u2212\u03b2)\u00b7n\n\n(yi \u2212 xT\n\ni w)2,\n\n(1)\n\n\u2217This work was done while P.K. was a postdoctoral researcher at Microsoft Research India.\n\n1\n\n\fThat is, we wish to simultaneously determine the set of corruption free points \u02c6S and also estimate\nthe best model parameters over the set of clean points. However, the optimization problem given\nabove is non-convex (jointly in w and S) in general and might not directly admit ef\ufb01cient solutions.\nIndeed there exist reformulations of this problem that are known to be NP-hard to optimize [1].\nTo address this problem, most existing methods with provable guarantees assume that the obser-\nvations are obtained from some generative model. A commonly adopted model is the following\n\n\u221a\n\n\u221a\n\ny = X T w\u2217 + b,\n\np (or \u03b1\n\n(2)\nwhere w\u2217 \u2208 Rp is the true model vector that we wish to estimate and b \u2208 Rn is the corruption\nvector that can have arbitrary values. A common assumption is that the corruption vector is sparsely\nsupported i.e. (cid:107)b(cid:107)0 \u2264 \u03b1 \u00b7 n for some \u03b1 > 0.\nRecently, [4] and [5] obtained a surprising result which shows that one can recover w\u2217 exactly even\nwhen \u03b1 (cid:46) 1, i.e., when almost all the points are corrupted, by solving an L1-penalty based convex\noptimization problem: minw,b (cid:107)w(cid:107)1 + \u03bb(cid:107)b(cid:107)1, s.t., X(cid:62)w + b = y. However, these results require\nthe corruption vector b to be selected oblivious of X and w\u2217. Moreover, the results impose severe\nrestrictions on the data distribution, requiring that the data be either sampled from an isotropic\nGaussian ensemble [4], or row-sampled from an incoherent orthogonal matrix [5]. Finally, these\nresults hold only for a \ufb01xed w\u2217 and are not universal in general.\nIn contrast, [6] studied RLSR with less stringent assumptions, allowing arbitrary corruptions in\nresponse variables as well as in the data matrix X, and proposed a trimmed inner product based\nalgorithm for the problem. However, their recovery guarantees are signi\ufb01cantly weaker. Firstly,\nthey are able to recover w\u2217 only upto an additive error \u03b1\ns if w\u2217 is s-sparse). Hence, they\n\u221a\nrequire \u03b1 \u2264 1/\np just to claim a non-trivial bound. Note that this amounts to being able to tolerate\nonly a vanishing fraction of corruptions. More importantly, even with n \u2192 \u221e and extremely small\n\u03b1 they are unable to guarantee exact recovery of w\u2217. A similar result was obtained by [7], albeit\nusing a sub-sampling based algorithm with stronger assumptions on b.\nIn this paper, we focus on a simple and natural thresholding based algorithm for RLSR. At a high\nlevel, at each step t, our algorithm alternately estimates an active set St of \u201cclean\u201d points and then\nupdates the model to obtain wt+1 by minimizing the least squares error on the active set. This\nintuitive algorithm seems to embody a long standing heuristic \ufb01rst proposed by Legendre [8] over\ntwo centuries ago (see introductory quotation in this paper) that has been adopted in later literature\n[9, 10] as well. However, to the best of our knowledge, this technique has never been rigorously\nanalyzed before in non-asymptotic settings, despite its appealing simplicity.\nOur Contributions: The main contribution of this paper is an exact recovery guarantee for the\nthresholding algorithm mentioned above that we refer to as TORRENT-FC (see Algorithm 1). We\nprovide our guarantees in the model given in 2 where the corruptions b are selected adversarially but\nrestricted to have at most \u03b1 \u00b7 n non-zero entries where \u03b1 is a global constant dependent only on X1.\nUnder deterministic conditions on X, namely the subset strong convexity (SSC) and smoothness\n(SSS) properties (see De\ufb01nition 1), we guarantee that TORRENT-FC converges at a geometric rate\nand recovers w\u2217 exactly. We further show that these properties (SSC and SSS) are satis\ufb01ed w.h.p.\nif a) the data X is sampled from a sub-Gaussian distribution and, b) n \u2265 p log p.\nWe would like to stress three key advantages of our result over the results of [4, 5]: a) we allow b\nto be adversarial, i.e., both support and values of b to be selected adversarially based on X and w\u2217,\nb) we make assumptions on data that are natural, as well as signi\ufb01cantly less restrictive than what\nexisting methods make, and c) our analysis admits universal guarantees, i.e., holds for any w\u2217.\nWe would also like to stress that while hard-thresholding based methods have been studied rigor-\nously for the sparse-recovery problem [11, 12], hard-thresholding has not been studied formally\nfor the robust regression problem. [13] study soft-thresholding approaches to the robust regression\nproblem but without any formal guarantees. Moreover, the two problems are completely different\nand hence techniques from sparse-recovery analysis do not extend to robust regression.\n\n1Note that for an adaptive adversary, as is the case in our work, recovery cannot be guaranteed for \u03b1 \u2265 1/2\n\ni ((cid:101)w\u2212 w\u2217) for an adversarially chosen model (cid:101)w. This\nwould make it impossible for any algorithm to distinguish between w\u2217 and(cid:101)w thus making recovery impossible.\n\nsince the adversary can introduce corruptions as bi = x(cid:62)\n\n2\n\n\fDespite its simplicity, TORRENT-FC does not scale very well to datasets with large p as it solves\nleast squares problems at each iteration. We address this issue by designing a gradient descent\nbased algorithm (TORRENT-GD), and a hybrid algorithm (TORRENT-Hyb), both of which enjoy a\ngeometric rate of convergence and can recover w\u2217 under the model assumptions mentioned above.\nWe also propose extensions of TORRENT for the RLSR problem in the sparse regression setting\nwhere p (cid:29) n but (cid:107)w\u2217(cid:107)0 = s\u2217 (cid:28) p. Our algorithm TORRENT-HD is based on TORRENT-FC but\nuses the Iterative Hard Thresholding (IHT) algorithm, a popular algorithm for sparse regression. As\nbefore, we show that TORRENT-HD also converges geometrically to w\u2217 if a) the corruption index \u03b1\nis less than some constant C, b) X is sampled from a sub-Gaussian distribution and, c) n \u2265 s\u2217 log p.\nFinally, we experimentally evaluate existing L1-based algorithms and our hard thresholding-based\nalgorithms. The results demonstrate that our proposed algorithms (TORRENT-(FC/GD/HYB)) can\nbe signi\ufb01cantly faster than the best L1 solvers, exhibit better recovery properties, as well as be more\nrobust to dense white noise. For instance, on a problem with 50K dimensions and 40% corruption,\nTORRENT-HYB was found to be 20\u00d7 faster than L1 solvers, as well as achieve lower error rates.\n\n2 Problem Formulation\nGiven a set of data points X = [x1, x2, . . . , xn], where xi \u2208 Rp and the corresponding response\nvector y \u2208 Rn, the goal is to recover a parameter vector w\u2217 which solves the RLSR problem (1).\nWe assume that the response vector y is generated using the following model:\n\ny = y\u2217 + b + \u03b5, where y\u2217 = X(cid:62)w\u2217.\n\nHence, in the above model, (1) reduces to estimating w\u2217. We allow the model w\u2217 representing the\nregressor, to be chosen in an adaptive manner after the data features have been generated.\nThe above model allows two kinds of perturbations to yi \u2013 dense but bounded noise \u03b5i (e.g. white\nnoise \u03b5i \u223c N (0, \u03c32), \u03c3 \u2265 0), as well as potentially unbounded corruptions bi \u2013 to be introduced\nby an adversary. The only requirement we enforce is that the gross corruptions be sparse. \u03b5 shall\nrepresent the dense noise vector, for example \u03b5 \u223c N (0, \u03c32\u00b7In\u00d7n), and b, the corruption vector such\nthat (cid:107)b(cid:107)0 \u2264 \u03b1\u00b7n for some corruption index \u03b1 > 0. We shall use the notation S\u2217 = supp(b) \u2286 [n] to\ndenote the set of \u201cclean\u201d points, i.e. points that have not faced unbounded corruptions. We consider\nadaptive adversaries that are able to view the generated data points xi, as well as the clean responses\ny\u2217\ni and dense noise values \u03b5i before deciding which locations to corrupt and by what amount.\nWe denote the unit sphere in p dimensions using Sp\u22121. For any \u03b3 \u2208 (0, 1], we let S\u03b3 =\n{S \u2282 [n] : |S| = \u03b3 \u00b7 n} denote the set of all subsets of size \u03b3 \u00b7 n. For any set S, we let XS :=\n[xi]i\u2208S \u2208 Rp\u00d7|S| denote the matrix whose columns are composed of points in that set. Also, for\nany vector v \u2208 Rn we use the notation vS to denote the |S|-dimensional vector consisting of those\ncomponents that are in S. We use \u03bbmin(X) and \u03bbmax(X) to denote, respectively, the smallest and\nlargest eigenvalues of a square symmetric matrix X. We now introduce two properties, namely,\nSubset Strong Convexity and Subset Strong Smoothness, which are key to our analyses.\nDe\ufb01nition 1 (SSC and SSS Properties). A matrix X \u2208 Rp\u00d7n satis\ufb01es the Subset Strong Convexity\nProperty (resp. Subset Strong Smoothness Property) at level \u03b3 with strong convexity constant \u03bb\u03b3\n(resp. strong smoothness constant \u039b\u03b3) if the following holds:\n\n\u03bb\u03b3 \u2264 min\nS\u2208S\u03b3\n\n\u03bbmin(XSX(cid:62)\n\nS ) \u2264 max\nS\u2208S\u03b3\n\n\u03bbmax(XSX(cid:62)\n\nS ) \u2264 \u039b\u03b3.\n\nRemark 1. We note that the uniformity enforced in the de\ufb01nitions of the SSC and SSS properties is\nnot for the sake of convenience but rather a necessity. Indeed, a uniform bound is required in face of\nan adversary which can perform corruptions after data and response variables have been generated,\nand choose to corrupt precisely that set of points where the SSC and SSS parameters are the worst.\n\n3 TORRENT: Thresholding Operator-based Robust Regression Method\n\nWe now present TORRENT, a Thresholding Operator-based Robust RegrEssioN meThod for per-\nforming robust regression at scale. Key to our algorithms is the Hard Thresholding Operator which\nwe de\ufb01ne below.\n\n3\n\n\f(cid:13)(cid:13)2\n\n\u03b7, thresholding parameter \u03b2, tolerance \u0001\n\n2: while(cid:13)(cid:13)rt\n\nAlgorithm 1 TORRENT: Thresholding Operator-\nbased Robust RegrEssioN meThod\nInput: Training data {xi, yi} , i = 1 . . . n, step length\n1: w0 \u2190 0, S0 = [n], t \u2190 0, r0 \u2190 y\n3: wt+1 \u2190 UPDATE(wt, St, \u03b7, rt, St\u22121)\n4:\n5:\n6:\n7: end while\n8: return wt\n\nrt+1\nSt+1 \u2190 HT(rt+1, (1 \u2212 \u03b2)n)\nt \u2190 t + 1\n\ni \u2190(cid:0)yi \u2212(cid:10)wt+1, xi\n\n> \u0001 do\n\n(cid:11)(cid:1)\n\nSt\n\nAlgorithm 2 UPDATE TORRENT-FC\nInput: Current model w, current active set S\n1: return arg min\n\n(yi \u2212 (cid:104)w, xi(cid:105))2\n\n(cid:88)\n\nw\n\ni\u2208S\n\nAlgorithm 3 UPDATE TORRENT-GD\nInput: Current model w, current active set S, step\n1: g \u2190 XS(X(cid:62)\n2: return w \u2212 \u03b7 \u00b7 g\n\nS w \u2212 yS)\n\nsize \u03b7\n\nAlgorithm 4 UPDATE TORRENT-HYB\nInput: Current model w, current active set S, step\nsize \u03b7, current residuals r, previous active set S(cid:48)\n// Use the GD update if the active\nset S is changing a lot\n\n1:\n2: if |S\\S(cid:48)| > \u2206 then\n3: w(cid:48) \u2190 UPDATE-GD(w, S, \u03b7, r, S(cid:48))\n4: else\n5: // If stable, use the FC update\n6: w(cid:48) \u2190 UPDATE-FC(w, S)\n7: end if\n8: return w(cid:48)\n\nDe\ufb01nition 2 (Hard Thresholding Operator). For any vector v \u2208 Rn, let \u03c3v \u2208 Sn be the permutation\n\nthat orders elements of v in ascending order of their magnitudes i.e. (cid:12)(cid:12)v\u03c3v(1)\n(cid:12)(cid:12). Then for any k \u2264 n, we de\ufb01ne the hard thresholding operator as\n(cid:12)(cid:12)v\u03c3v(n)\n\n(cid:12)(cid:12) \u2264(cid:12)(cid:12)v\u03c3v(2)\n\n(cid:12)(cid:12) \u2264 . . . \u2264\n\nHT(v; k) =(cid:8)i \u2208 [n] : \u03c3\u22121\n\nv (i) \u2264 k(cid:9)\n\nUsing this operator, we present our algorithm TORRENT (Algorithm 1) for robust regression. TOR-\nRENT follows a most natural iterative strategy of, alternately, estimating an active set of points which\nhave the least residual error on the current regressor, and then updating the regressor to provide a\nbetter \ufb01t on this active set. We offer three variants of our algorithm, based on how aggressively the\nalgorithm tries to \ufb01t the regressor to the current active set.\nWe \ufb01rst propose a fully corrective algorithm TORRENT-FC (Algorithm 2) that performs a fully\ncorrective least squares regression step in an effort to minimize the regression error on the active set.\nThis algorithm makes signi\ufb01cant progress in each step, but at a cost of more expensive updates. To\naddress this, we then propose a milder, gradient descent-based variant TORRENT-GD (Algorithm 3)\nthat performs a much cheaper update of taking a single step in the direction of the gradient of the\nobjective function on the active set. This reduces the regression error on the active set but does not\nminimize it. This turns out to be bene\ufb01cial in situations where dense noise is present along with\nsparse corruptions since it prevents the algorithm from over\ufb01tting to the current active set.\nBoth the algorithms proposed above have their pros and cons \u2013 the FC algorithm provides signi\ufb01cant\nimprovements with each step, but is expensive to execute whereas the GD variant, although ef\ufb01cient\nin executing each step, offers slower progress. To get the best of both these algorithms, we propose\na third, hybrid variant TORRENT-HYB (Algorithm 4) that adaptively selects either the FC or the GD\nupdate depending on whether the active set is stable across iterations or not.\nIn the next section we show that this hard thresholding-based strategy offers a linear convergence\nrate for the algorithm in all its three variations. We shall also demonstrate the applicability of this\ntechnique to high dimensional sparse recovery settings in a subsequent section.\n\n4 Convergence Guarantees\n\nFor the sake of ease of exposition, we will \ufb01rst present our convergence analyses for cases where\ndense noise is not present i.e. y = X(cid:62)w\u2217 + b and will handle cases with dense noise and sparse\ncorruptions later. We \ufb01rst analyze the fully corrective TORRENT-FC algorithm. The convergence\nproof in this case relies on the optimality of the two steps carried out by the algorithm, the fully\ncorrective step that selects the best regressor on the active set, and the hard thresholding step that\ndiscovers a new active set by selecting points with the least residual error on the current regressor.\n\n4\n\n\fparameter set to \u03b2 \u2265 \u03b1. Let \u03a30 be an invertible matrix such that (cid:101)X = \u03a3\n\nTheorem 3. Let X = [x1, . . . , xn] \u2208 Rp\u00d7n be the given data matrix and y = X T w\u2217 + b be the\ncorrupted output with (cid:107)b(cid:107)0 \u2264 \u03b1 \u00b7 n. Let Algorithm 2 be executed on this data with the thresholding\n\u22121/2\n0 X satis\ufb01es the\nSSC and SSS properties at level \u03b3 with constants \u03bb\u03b3 and \u039b\u03b3 respectively (see De\ufb01nition 1). If the\ndata satis\ufb01es (1+\niterations, Algorithm 2 obtains an\n\u0001-accurate solution wt i.e. (cid:107)wt \u2212 w\u2217(cid:107)2 \u2264 \u0001.\nProof (Sketch). Let rt = y \u2212 X(cid:62)wt be the vector of residuals at time t and Ct = XStX(cid:62)\nlet S\u2217 = supp(b) be the set of uncorrupted points. The fully corrective step ensures that\n\n< 1, then after t = O(cid:16)\n\n(cid:16) 1\u221a\n\n(cid:17)(cid:17)\n\n(cid:107)b(cid:107)2\n\u0001\n\n. Also\n\n\u03bb1\u2212\u03b2\n\nlog\n\n2)\u039b\u03b2\n\n\u221a\n\nSt\n\nn\n\nwt+1 = C\u22121\n\nt XStySt = C\u22121\n\nt XStbSt,\n\nwhereas the hard thresholding step ensures that\n\n. Combining the two gives us\n\n(cid:13)(cid:13)bSt+1\n\n2 \u2264(cid:13)(cid:13)(cid:13)X(cid:62)\n(cid:13)(cid:13)2\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:101)X(cid:62)\n\n\u03b61=\n\nSt\n\nt XSt\n\n(cid:0)X(cid:62)\n(cid:13)(cid:13)(cid:13)rt+1\n(cid:13)(cid:13)(cid:13)2\n(cid:13)(cid:13)(cid:13)(cid:13)2\n(cid:17)\u22121 (cid:101)XStbSt\n\n+ 2 \u00b7 b(cid:62)\n\n2\n\nSt+1\n\nw\u2217 + bSt\n\n2\n\n2\n\nS\u2217\n\n(cid:13)(cid:13)(cid:13)2\n\n(cid:1) = w\u2217 + C\u22121\n(cid:13)(cid:13)2\n\u2264(cid:13)(cid:13)rt+1\nSt+1(cid:101)X(cid:62)\n(cid:13)(cid:13)bSt+1\n(cid:13)(cid:13)2 ,\nt XS(cid:48) = (cid:101)X(cid:62)\nS C\u22121\n\nC\u22121\nt XStbSt\n\n+ 2 \u00b7 b(cid:62)\n\nSt+1\n\nSt+1\n\nX(cid:62)\n\nS\u2217\\St+1\n\nS\u2217\\St+1\n\nC\u22121\nt XStbSt\n\n(cid:17)\u22121 (cid:101)XStbSt\n(cid:16)(cid:101)XSt(cid:101)X T\n2 + 2 \u00b7 \u039b\u03b2\n)\u22121(cid:101)XS(cid:48) and \u03b62\nwhere \u03b61 follows from setting (cid:101)X = \u03a3\n\u03bb1\u2212\u03b2\n\u22121/2\n0 X and X(cid:62)\nfollows from the SSC and SSS properties, (cid:107)bSt(cid:107)0 \u2264 (cid:107)b(cid:107)0 \u2264 \u03b2 \u00b7 n and |S\u2217\\St+1| \u2264 \u03b2 \u00b7 n. Solving\nthe quadratic equation and performing other manipulations gives us the claimed result.\n\n(cid:16)(cid:101)XSt(cid:101)X T\nS ((cid:101)XSt(cid:101)X(cid:62)\n\n\u03b62\u2264 \u039b2\n\u03b2\n\u03bb2\n1\u2212\u03b2\n\n\u00b7 (cid:107)bSt(cid:107)2\n\n\u00b7 (cid:107)bSt(cid:107)2\n\nSt+1\n\nSt\n\nSt\n\nSt\n\n2\n\n\u221a\n\n2)\u039b\u03b2\n\n\u03bb1\u2212\u03b2\n\nTheorem 3 relies on a deterministic (\ufb01xed design) assumption, speci\ufb01cally (1+\n< 1 in order\nto guarantee convergence. We can show that a large class of random designs, including Gaussian\nand sub-Gaussian designs actually satisfy this requirement. That is to say, data generated from these\ndistributions satisfy the SSC and SSS conditions such that (1+\n< 1 with high probability.\nTheorem 4 explicates this for the class of Gaussian designs.\nTheorem 4. Let X = [x1, . . . , xn] \u2208 Rp\u00d7n be the given data matrix with each xi \u223c N (0, \u03a3). Let\ny = X(cid:62)w\u2217 + b and (cid:107)b(cid:107)0 \u2264 \u03b1\u00b7 n. Also, let \u03b1 \u2264 \u03b2 < 1\n(cid:107)b(cid:107)2\nbility at least 1\u2212\u03b4, the data satis\ufb01es (1+\n\u0001\n\n(cid:16) 1\u221a\niterations of Algorithm 1 with the thresholding parameter set to \u03b2, we have(cid:13)(cid:13)wT \u2212 w\u2217(cid:13)(cid:13) \u2264 \u0001.\n\n65 and n \u2265 \u2126(cid:0)p + log 1\n\n(cid:1). Then, with proba-\n(cid:17)\n\n10 . More speci\ufb01cally, after T \u2265 10 log\n\n< 9\n\n\u03bb1\u2212\u03b2\n\n\u03bb1\u2212\u03b2\n\n2)\u039b\u03b2\n\n2)\u039b\u03b2\n\n\u221a\n\n\u221a\n\nn\n\n\u03b4\n\nRemark 2. Note that Theorem 4 provides rates that are independent of the condition number \u03bbmax(\u03a3)\n\u03bbmin(\u03a3)\nof the distribution. We also note that results similar to Theorem 4 can be proven for the larger class\nof sub-Gaussian distributions. We refer the reader to Section G for the same.\nRemark 3. We remind the reader that our analyses can readily accommodate dense noise in addition\nto sparse unbounded corruptions. We direct the reader to Appendix A which presents convergence\nproofs for our algorithms in these settings.\nRemark 4. We would like to point out that the design requirements made by our analyses are very\nmild when compared to existing literature. Indeed, the work of [4] assumes the Bouquet Model\nwhere distributions are restricted to be isotropic Gaussians whereas the work of [5] assumes a more\nstringent model of sub-orthonormal matrices, something that even Gaussian designs do not satisfy.\nOur analyses, on the other hand, hold for the general class of sub-Gaussian distributions.\n\nWe now analyze the TORRENT-GD algorithm which performs cheaper, gradient-style updates on\nthe active set. We will show that this method nevertheless enjoys a linear rate of convergence.\nTheorem 5. Let the data settings be as stated in Theorem 3 and let Algorithm 3 be executed on this\ndata with the thresholding parameter set to \u03b2 \u2265 \u03b1 and the step length set to \u03b7 = 1\n. If the data\n\u039b1\u2212\u03b2\n\n5\n\n\fsatis\ufb01es max(cid:8)\u03b7(cid:112)\u039b\u03b2, 1 \u2212 \u03b7\u03bb1\u2212\u03b2\n\n4 , then after t = O(cid:16)\n(cid:9) \u2264 1\n\nlog\n\n(cid:17)(cid:17)\n\n(cid:16)(cid:107)b(cid:107)2\u221a\n\nn\n\n1\n\u0001\n\niterations, Algorithm 1\n\nobtains an \u0001-accurate solution wt i.e. (cid:107)wt \u2212 w\u2217(cid:107)2 \u2264 \u0001.\nSimilar to TORRENT-FC, the assumptions made by the TORRENT-GD algorithm are also satis\ufb01ed\nby the class of sub-Gaussian distributions. The proof of Theorem 5, given in Appendix D, details\nthese arguments. Given the convergence analyses for TORRENT-FC and GD, we now move on to\nprovide a convergence analysis for the hybrid TORRENT-HYB algorithm which interleaves FC and\nGD steps. Since the exact interleaving adopted by the algorithm depends on the data, and not known\nin advance, this poses a problem. We address this problem by giving below a uniform convergence\nguarantee, one that applies to every interleaving of the FC and GD update steps.\nTheorem 6. Suppose Algorithm 4 is executed on data that allows Algorithms 2 and 3 a convergence\nrate of \u03b7FC and \u03b7GD respectively. Suppose we have 2\u00b7\u03b7FC\u00b7\u03b7GD < 1. Then for any interleavings of the\niterations, Algorithm 4\n\nFC and GD steps that the policy may enforce, after t = O(cid:16)\n\nensures an \u0001-optimal solution i.e. (cid:107)wt \u2212 w\u2217(cid:107) \u2264 \u0001.\nWe point out to the reader that the assumption made by Theorem 6 i.e. 2 \u00b7 \u03b7FC \u00b7 \u03b7GD < 1 is readily\nsatis\ufb01ed by random sub-Gaussian designs, albeit at the cost of reducing the noise tolerance limit. As\nwe shall see, TORRENT-HYB offers attractive convergence properties, merging the fast convergence\nrates of the FC step, as well as the speed and protection against over\ufb01tting provided by the GD step.\n\n(cid:16) 1\u221a\n\n(cid:17)(cid:17)\n\n(cid:107)b(cid:107)2\n\u0001\n\nlog\n\nn\n\n5 High-dimensional Robust Regression\n\nIn this section, we extend our approach to the robust high-dimensional sparse recovery setting. As\nbefore, we assume that the response vector y is obtained as: y = X(cid:62)w\u2217 + b, where (cid:107)b(cid:107)0 \u2264 \u03b1 \u00b7 n.\nHowever, this time, we also assume that w\u2217 is s\u2217-sparse i.e. (cid:107)w\u2217(cid:107)0 \u2264 s\u2217. As before, we shall\nneglect white/dense noise for the sake of simplicity. We reiterate that it is not possible to use existing\nresults from sparse recovery (such as [11, 12]) directly to solve this problem.\nOur objective would be to recover a sparse model \u02c6w so that (cid:107) \u02c6w \u2212 w\u2217(cid:107)2 \u2264 \u0001. The challenge here\nis to forgo a sample complexity of n (cid:38) p and instead, perform recovery with n \u223c s\u2217 log p samples\nalone. For this setting, we modify the FC update step of TORRENT-FC method to the following:\n\nwt+1 \u2190 inf\n\n(cid:107)w(cid:107)0\u2264s\n\n(yi \u2212 (cid:104)w, xi(cid:105))2 ,\n\n(3)\n\n(cid:88)\n\ni\u2208St\n\nfor some target sparsity level s (cid:28) p. We refer to this modi\ufb01ed algorithm as TORRENT-HD. Assum-\ning X satis\ufb01es the RSC/RSS properties (de\ufb01ned below), (3) can be solved ef\ufb01ciently using results\nfrom sparse recovery (for example the IHT algorithm [11, 14] analyzed in [12]).\nDe\ufb01nition 7 (RSC and RSS Properties). A matrix X \u2208 Rp\u00d7n will be said to satisfy the Restricted\nStrong Convexity Property (resp. Restricted Strong Smoothness Property) at level s = s1 + s2 with\nstrong convexity constant \u03b1s1+s2 (resp. strong smoothness constant Ls1+s2) if the following holds\nfor all (cid:107)w1(cid:107)0 \u2264 s1 and (cid:107)w2(cid:107)0 \u2264 s2:\n\u03b1s (cid:107)w1 \u2212 w2(cid:107)2\n\n2 \u2264(cid:13)(cid:13)X(cid:62)(w1 \u2212 w2)(cid:13)(cid:13)2\n\n2 \u2264 Ls (cid:107)w1 \u2212 w2(cid:107)2\n\n2\n\nFor our results, we shall require the subset versions of both these properties.\nDe\ufb01nition 8 (SRSC and SRSS Properties). A matrix X \u2208 Rp\u00d7n will be said to satisfy the Subset\nRestricted Strong Convexity (resp. Subset Restricted Strong Smoothness) Property at level (\u03b3, s)\nwith strong convexity constant \u03b1(\u03b3,s) (resp. strong smoothness constant L(\u03b3,s)) if for all subsets\nS \u2208 S\u03b3, the matrix XS satis\ufb01es the RSC (resp. RSS) property at level s with constant \u03b1s (resp. Ls).\nWe now state the convergence result for the TORRENT-HD algorithm.\nTheorem 9. Let X \u2208 Rp\u00d7n be the given data matrix and y = X T w\u2217 + b be the corrupted\n\u22121/2\noutput with (cid:107)w\u2217(cid:107)0 \u2264 s\u2217 and (cid:107)b(cid:107)0 \u2264 \u03b1 \u00b7 n. Let \u03a30 be an invertible matrix such that \u03a3\n0 X\nsatis\ufb01es the SRSC and SRSS properties at level (\u03b3, 2s+s\u2217) with constants \u03b1(\u03b3,2s+s\u2217) and L(\u03b3,2s+s\u2217)\nrespectively (see De\ufb01nition 8). Let Algorithm 2 be executed on this data with the TORRENT-HD\nupdate, thresholding parameter set to \u03b2 \u2265 \u03b1, and s \u2265 32\n\n(cid:16) L(1\u2212\u03b2,2s+s\u2217 )\n\n(cid:17)\n\n.\n\n\u03b1(1\u2212\u03b2,2s+s\u2217 )\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 1: (a), (b) and (c) Phase-transition diagrams depicting the recovery properties of the TORRENT-FC,\nTORRENT-HYB and L1 algorithms. The colors red and blue represent a high and low probability of success\nresp. A method is considered successful in an experiment if it recovers w\u2217 upto a 10\u22124 relative error. Both\nvariants of TORRENT can be seen to recover w\u2217 in presence of larger number of corruptions than the L1 solver.\n(d) Variation in recovery error with the magnitude of corruption. As the corruption is increased, TORRENT-FC\nand TORRENT-HYB show improved performance while the problem becomes more dif\ufb01cult for the L1 solver.\n\n< 1, then after t = O(cid:16)\n\nlog\n\n(cid:16) 1\u221a\n\n(cid:17)(cid:17)\n\n(cid:107)b(cid:107)2\n\u0001\n\nn\n\nIf X also satis\ufb01es 4L(\u03b2,s+s\u2217 )\n\u03b1(1\u2212\u03b2,s+s\u2217 )\nobtains an \u0001-accurate solution wt i.e. (cid:107)wt \u2212 w\u2217(cid:107)2 \u2264 \u0001.\n(cid:16)\nIn particular,\nafter t = O(cid:16)\ns\u2217 \u00b7 \u03bbmax(\u03a3)\n\u2126\n\n, then for all values of \u03b1 \u2264 \u03b2 < 1\n(cid:107)b(cid:107)2\n\u0001\n\n\u03bbmin(\u03a3) log p\n\n(cid:16) 1\u221a\n\nif X is\n\n(cid:17)(cid:17)\n\nlog\n\nn\n\n(cid:17)\n\nsampled from a Gaussian distribution N (0, \u03a3) and n \u2265\n65 , we can guarantee (cid:107)wt \u2212 w\u2217(cid:107)2 \u2264 \u0001\n\niterations of the algorithm (w.p. \u2265 1 \u2212 1/n10).\n\niterations, Algorithm 2\n\nRemark 5. The sample complexity required by Theorem 9 is identical to the one required by analyses\nfor high dimensional sparse recovery [12], save constants. Also note that TORRENT-HD can tolerate\nthe same corruption index as TORRENT-FC.\n\n6 Experiments\n\nSeveral numerical simulations were carried out on linear regression problems in low-dimensional,\nas well as sparse high-dimensional settings. The experiments show that TORRENT not only offers\nstatistically better recovery properties as compared to L1-style approaches, but that it can be more\nthan an order of magnitude faster as well.\nData: For the low dimensional setting, the regressor w\u2217 \u2208 Rp was chosen to be a random unit norm\nvector. Data was sampled as xi \u223c N (0, Ip) and response variables were generated as y\u2217\ni = (cid:104)w\u2217, xi(cid:105).\nThe set of corrupted points S\u2217 was selected as a uniformly random (\u03b1n)-sized subset of [n] and the\ncorruptions were set to bi \u223c U (\u22125(cid:107)y\u2217(cid:107)\u221e , 5(cid:107)y\u2217(cid:107)\u221e) for i \u2208 S\u2217. The corrupted responses were\ni + bi + \u03b5i where \u03b5i \u223c N (0, \u03c32). For the sparse high-dimensional setting,\nthen generated as yi = y\u2217\nsupp(w\u2217) was selected to be a random s\u2217-sized subset of [p]. Phase-transition diagrams (Figure 1)\nwere generated by repeating each experiment 100 times. For all other plots, each experiment was\nrun over 20 random instances of the data and the plots were drawn to depict the mean results.\nAlgorithms: We compared various variants of our algorithm TORRENT to the regularized L1 algo-\nrithm for robust regression [4, 5]. Note that the L1 problem can be written as minz (cid:107)z(cid:107)1 s.t.Az = y,\ntiplier (DALM) L1 solver implemented by [15] to solve the L1 problem. We ran a \ufb01ne tuned grid\nsearch over the \u03bb parameter for the L1 solver and quoted the best results obtained from the search. In\nthe low-dimensional setting, we compared the recovery properties of TORRENT-FC (Algorithm 2)\nand TORRENT-HYB (Algorithm 4) with the DALM-L1 solver, while for the high-dimensional case,\nwe compared TORRENT-HD against the DALM-L1 solver. Both the L1 solver, as well as our meth-\nods, were implemented in Matlab and were run on a single core 2.4GHz machine with 8 GB RAM.\nChoice of L1-solver: An extensive comparative study of various L1 minimization algorithms was\nperformed by [15] who showed that the DALM and Homotopy solvers outperform other counterparts\nboth in terms of recovery properties, and timings. We extended their study to our observation model\nand found the DALM solver to be signi\ufb01cantly better than the other L1 solvers; see Figure 3 in the\nappendix. We also observed, similar to [15], that the Approximate Message Passing (AMP) solver\n\u03bb I].\ndiverges on our problem as the input matrix to the L1 solver is a non-Gaussian matrix A = [X T 1\n\n(cid:3) and z\u2217 = [w\u2217(cid:62) \u03bbb(cid:62)](cid:62). We used the Dual Augmented Lagrange Mul-\n\nwhere A =(cid:2)X(cid:62) 1\n\n\u03bb Im\u00d7m\n\n7\n\nTotal PointsCorrupted Points TORRENT\u2212FC (p = 50 sigma = 0)110120130140150160170180190200100908070605040302010020406080100Total PointsCorrupted Points TORRENT\u2212HYB (p = 50 sigma = 0)110120130140150160170180190200100908070605040302010020406080100Total PointsCorrupted PointsL1\u2212DALM (p = 50 sigma = 0) 110120130140150160170180190200100908070605040302010020406080100010200.10.150.20.25Magnitude of Corruptionkw\u2212w\u2217k2 p = 500 n = 2000 alpha = 0.25 sigma = 0.2TORRENT\u2212FCTORRENT\u2212HYBL1\u2212DALM\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 2: In low-dimensional (a,b), as well as sparse high dimensional (c,d) settings, TORRENT offers better\nrecovery as the fraction of corrupted points \u03b1 is varied. In terms of runtime, TORRENT is an order of magnitude\nfaster than L1 solvers in both settings. In the low-dim. setting, TORRENT-HYB is the fastest of all the variants.\nEvaluation Metric: We measure the performance of various algorithms using the standard L2 error:\n\nr(cid:98)w = (cid:107)(cid:98)w \u2212 w\u2217(cid:107)2. For the phase-transition plots (Figure 1), we deemed an algorithm successful on\nan instance if it obtained a model (cid:98)w with error r(cid:98)w < 10\u22124 \u00b7 (cid:107)w\u2217(cid:107)2. We also measured the CPU\n\ntime required by each of the methods, so as to compare their scalability.\n\n6.1 Low Dimensional Results\n\nRecovery Property: The phase-transition plots presented in Figure 1 represent our recovery exper-\niments in graphical form. Both the fully-corrective and hybrid variants of TORRENT show better\nrecovery properties than the L1-minimization approach, indicated by the number of runs in which\nthe algorithm was able to correctly recover w\u2217 out of a 100 runs. Figure 2 shows the variation in\nrecovery error as a function of \u03b1 in the presence of white noise and exhibits the superiority of TOR-\nRENT-FC and TORRENT-HYB over L1-DALM. Here again, TORRENT-FC and TORRENT-HYB\nachieve signi\ufb01cantly lesser recovery error than L1-DALM for all \u03b1 <= 0.5. Figure 3 in the ap-\n\npendix show that the variations of (cid:107)(cid:98)w \u2212 w\u2217(cid:107)2 with varying p, \u03c3 and n follow a similar trend with\nTORRENT having signi\ufb01cantly lower recovery error in comparison to the L1 approach.\nFigure 1(d) brings out an interesting trend in the recovery property of TORRENT. As we increase\nthe magnitude of corruption from U (\u2212(cid:107)y\u2217(cid:107)\u221e ,(cid:107)y\u2217(cid:107)\u221e) to U (\u221220(cid:107)y\u2217(cid:107)\u221e , 20(cid:107)y\u2217(cid:107)\u221e), the recov-\nery error for TORRENT-HYB and TORRENT-FC decreases as expected since it becomes easier to\nidentify the grossly corrupted points. However the L1-solver was unable to exploit this observation\nand in fact exhibited an increase in recovery error.\nRun Time: In order to ascertain the recovery guarantees for TORRENT on ill-conditioned problems,\nwe performed an experiment where data was sampled as xi \u223c N (0, \u03a3) where diag(\u03a3) \u223c U (0, 5).\nFigure 2 plots the recovery error as a function of time. TORRENT-HYB was able to correctly recover\nw\u2217 about 50\u00d7 faster than L1-DALM which spent a considerable amount of time pre-processing the\ndata matrix X. Even after allowing the L1 algorithm to run for 500 iterations, it was unable to reach\nthe desired residual error of 10\u22124. Figure 2 also shows that our TORRENT-HYB algorithm is able to\nconverge to the optimal solution much faster than TORRENT-FC or TORRENT-GD. This is because\nTORRENT-FC solves a least square problem at each step and thus, even though it requires signi\ufb01-\ncantly fewer iterations to converge, each iteration in itself is very expensive. While each iteration of\n\nTORRENT-GD is cheap, it is still limited by the slow O(cid:0)(1 \u2212 1\n\n\u03ba )t(cid:1) convergence rate of the gradient\n\ndescent algorithm, where \u03ba is the condition number of the covariance matrix. TORRENT-HYB, on\nthe other hand, is able to combine the strengths of both the methods to achieve faster convergence.\n\n6.2 High Dimensional Results\n\nRecovery Property: Figure 2 shows the variation in recovery error in the high-dimensional setting\nas the number of corrupted points was varied. For these experiments, n was set to 5s\u2217 log(p) and\nthe fraction of corrupted points \u03b1 was varied from 0.1 to 0.7. While L1-DALM fails to recover w\u2217\nfor \u03b1 > 0.5, TORRENT-HD offers perfect recovery even for \u03b1 values upto 0.7.\nRun Time: Figure 2 shows the variation in recovery error as a function of run time in this setting.\nL1-DALM was found to be an order of magnitude slower than TORRENT-HD, making it infeasible\nfor sparse high-dimensional settings. One key reason for this is that the L1-DALM solver is signi\ufb01-\ncantly slower in identifying the set of clean points. For instance, whereas TORRENT-HD was able to\nidentify the clean set of points in only 5 iterations, it took L1 around 250 iterations to do the same.\n\n8\n\n00.20.40.6100Fraction of Corrupted Points kw\u2212w\u2217k2p = 500 n = 2000 sigma = 0.2TORRENT\u2212FCTORRENT\u2212HYBL1\u2212DALM0246810121e\u221251e\u221241e\u221231e\u221221e\u22121p = 300 n = 1800 alpha = 0.41 kappa = 5 Time (in Sec)kw\u2212w\u2217k2 TORRENT\u2212FCTORRENT\u2212HYBTORRENT\u2212GDL1\u2212DALM00.20.40.60.800.51p = 10000 n = 2303 s = 50 Fraction of Corrupted Pointskw\u2212w\u2217k2 TORRENT\u2212 HDL1\u2212DALM01002003004000123Time (in Sec)kw\u2212w\u2217k2p = 50000 n = 5410 alpha = 0.4 s = 100 TORRENT\u2212HDL1\u2212DALM\fReferences\n[1] Christoph Studer, Patrick Kuppinger, Graeme Pope, and Helmut B\u00a8olcskei. Recovery of\nIEEE Transaction on Information Theory, 58(5):3115\u20133130,\n\nSparsely Corrupted Signals.\n2012.\n\n[2] Peter J. Rousseeuw and Annick M. Leroy. Robust Regression and Outlier Detection. John\n\nWiley and Sons, 1987.\n\n[3] John Wright, Alan Y. Yang, Arvind Ganesh, S. Shankar Sastry, and Yi Ma. Robust Face\nRecognition via Sparse Representation. IEEE Transactions on Pattern Analysis and Machine\nIntelligence, 31(2):210\u2013227, 2009.\n\n[4] John Wright and Yi Ma. Dense Error Correction via (cid:96)1 Minimization. IEEE Transaction on\n\nInformation Theory, 56(7):3540\u20133560, 2010.\n\n[5] Nam H. Nguyen and Trac D. Tran. Exact recoverability from dense corrupted observations via\n\nL1 minimization. IEEE Transaction on Information Theory, 59(4):2036\u20132058, 2013.\n\n[6] Yudong Chen, Constantine Caramanis, and Shie Mannor. Robust Sparse Regression under\nAdversarial Corruption. In 30th International Conference on Machine Learning (ICML), 2013.\n[7] Brian McWilliams, Gabriel Krummenacher, Mario Lucic, and Joachim M. Buhmann. Fast and\nRobust Least Squares Estimation in Corrupted Linear Models. In 28th Annual Conference on\nNeural Information Processing Systems (NIPS), 2014.\n\n[8] Adrien-Marie Legendre (1805). On the Method of Least Squares.\n\nIn (Translated from the\nFrench) D.E. Smith, editor, A Source Book in Mathematics, pages 576\u2013579. New York: Dover\nPublications, 1959.\n\n[9] Peter J. Rousseeuw. Least Median of Squares Regression. Journal of the American Statistical\n\nAssociation, 79(388):871\u2013880, 1984.\n\n[10] Peter J. Rousseeuw and Katrien Driessen. Computing LTS Regression for Large Data Sets.\n\nJournal of Data Mining and Knowledge Discovery, 12(1):29\u201345, 2006.\n\n[11] Thomas Blumensath and Mike E. Davies. Iterative Hard Thresholding for Compressed Sens-\n\ning. Applied and Computational Harmonic Analysis, 27(3):265\u2013274, 2009.\n\n[12] Prateek Jain, Ambuj Tewari, and Purushottam Kar. On Iterative Hard Thresholding Meth-\nods for High-dimensional M-Estimation. In 28th Annual Conference on Neural Information\nProcessing Systems (NIPS), 2014.\n\n[13] Yiyuan She and Art B. Owen. Outlier Detection Using Nonconvex Penalized Regression.\n\narXiv:1006.2592 (stat.ME).\n\n[14] Rahul Garg and Rohit Khandekar. Gradient descent with sparsi\ufb01cation: an iterative algorithm\nIn 26th International Conference on\n\nfor sparse recovery with restricted isometry property.\nMachine Learning (ICML), 2009.\n\n[15] Allen Y. Yang, Arvind Ganesh, Zihan Zhou, Shankar Sastry, and Yi Ma. A Review of Fast\n\n(cid:96)1-Minimization Algorithms for Robust Face Recognition. CoRR abs/1007.3753, 2012.\n\n[16] Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model\n\nselection. The Annals of Statistics, 28(5):1302\u20131338, 2000.\n\n[17] Thomas Blumensath. Sampling and reconstructing signals from a union of linear subspaces.\n\nIEEE Transactions on Information Theory, 57(7):4660\u20134671, 2011.\n\n[18] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Y. Eldar\nand G. Kutyniok, editors, Compressed Sensing, Theory and Applications, chapter 5, pages\n210\u2013268. Cambridge University Press, 2012.\n\n9\n\n\f", "award": [], "sourceid": 494, "authors": [{"given_name": "Kush", "family_name": "Bhatia", "institution": "Microsoft Research"}, {"given_name": "Prateek", "family_name": "Jain", "institution": "Microsoft Research"}, {"given_name": "Purushottam", "family_name": "Kar", "institution": "Microsoft Research India"}]}