{"title": "Fast and Robust Least Squares Estimation in Corrupted Linear Models", "book": "Advances in Neural Information Processing Systems", "page_first": 415, "page_last": 423, "abstract": "Subsampling methods have been recently proposed to speed up least squares estimation in large scale settings. However, these algorithms are typically not robust to outliers or corruptions in the observed covariates. The concept of influence that was developed for regression diagnostics can be used to detect such corrupted observations as shown in this paper. This property of influence -- for which we also develop a randomized approximation -- motivates our proposed subsampling algorithm for large scale corrupted linear regression which limits the influence of data points since highly influential points contribute most to the residual error. Under a general model of corrupted observations, we show theoretically and empirically on a variety of simulated and real datasets that our algorithm improves over the current state-of-the-art approximation schemes for ordinary least squares.", "full_text": "Fast and Robust Least Squares Estimation in\n\nCorrupted Linear Models\n\nBrian McWilliams\u21e4 Gabriel Krummenacher\u21e4 Mario Lucic\n\nJoachim M. Buhmann\n\n{mcbrian,gabriel.krummenacher,lucic,jbuhmann}@inf.ethz.ch\n\nDepartment of Computer Science\n\nETH Z\u00a8urich, Switzerland\n\nAbstract\n\nSubsampling methods have been recently proposed to speed up least squares esti-\nmation in large scale settings. However, these algorithms are typically not robust\nto outliers or corruptions in the observed covariates.\nThe concept of in\ufb02uence that was developed for regression diagnostics can be\nused to detect such corrupted observations as shown in this paper. This property\nof in\ufb02uence \u2013 for which we also develop a randomized approximation \u2013 motivates\nour proposed subsampling algorithm for large scale corrupted linear regression\nwhich limits the in\ufb02uence of data points since highly in\ufb02uential points contribute\nmost to the residual error. Under a general model of corrupted observations, we\nshow theoretically and empirically on a variety of simulated and real datasets that\nour algorithm improves over the current state-of-the-art approximation schemes\nfor ordinary least squares.\n\n1\n\nIntroduction\n\nTo improve scalability of the widely used ordinary least squares algorithm, a number of randomized\napproximation algorithms have recently been proposed. These methods, based on subsampling the\n\ndataset, reduce the computational time from Onp2 to o(np2)1 [14]. Most of these algorithms\n\nare concerned with the classical \ufb01xed design setting or the case where the data is assumed to be\nsampled i.i.d.\ntypically from a sub-Gaussian distribution [7]. This is known to be an unrealistic\nmodelling assumption since real-world data are rarely well-behaved in the sense of the underlying\ndistributions.\nWe relax this limiting assumption by considering the setting where with some probability, the ob-\nserved covariates are corrupted with additive noise. This scenario corresponds to a generalised\nversion of the classical problem of \u201cerrors-in-variables\u201d in regression analysis which has recently\nbeen considered in the context of sparse estimation [12]. This corrupted observation model poses a\nmore realistic model of real data which may be subject to many different sources of measurement\nnoise or heterogeneity in the dataset.\nA key consideration for sampling is to ensure that the points used for estimation are typical of the\nfull dataset. Typicality requires the sampling distribution to be robust against outliers and corrupted\npoints.\nIn the i.i.d. sub-Gaussian setting, outliers are rare and can often easily be identi\ufb01ed by\nexamining the statistical leverage scores of the datapoints.\nCrucially, in the corrupted observation setting described in \u00a72, the concept of an outlying point\nconcerns the relationship between the observed predictors and the response. Now, leverage alone\ncannot detect the presence of corruptions. Consequently, without using additional knowledge about\n\n\u21e4Authors contributed equally.\n1Informally: f (n) = o(g(n)) means f (n) grows more slowly than g(n).\n\n1\n\n\fthe corrupted points, the OLS estimator (and its subsampled approximations) are biased. This also\nrules out stochastic gradient descent (SGD) \u2013 which is often used for large scale regression \u2013 since\nconvex cost functions and regularizers which are typically used for noisy data are not robust with\nrespect to measurement corruptions.\nThis setting motivates our use of in\ufb02uence \u2013 the effective impact of an individual datapoint exerts on\nthe overall estimate \u2013 in order to detect and therefore avoid sampling corrupted points. We propose\nan algorithm which is robust to corrupted observations and exhibits reduced bias compared with\nother subsampling estimators.\n\nOutline and Contributions.\nIn \u00a72 we introduce our corrupted observation model before reviewing\nthe basic concepts of statistical leverage and in\ufb02uence in \u00a73. In \u00a74 we brie\ufb02y review two subsampling\napproaches to approximating least squares based on structured random projections and leverage\nweighted importance sampling. Based on these ideas we present in\ufb02uence weighted subsampling\n(IWS-LS), a novel randomized least squares algorithm based on subsampling points with small\nin\ufb02uence in \u00a75.\nIn \u00a76 we analyse IWS-LS in the general setting where the observed predictors can be corrupted\nwith additive sub-Gaussian noise. Comparing the IWS-LS estimate with that of OLS and other\nrandomized least squares approaches we show a reduction in both bias and variance. It is important\nto note that the simultaneous reduction in bias and variance is relative to OLS and randomized\napproximations which are only unbiased in the non-corrupted setting. Our results rely on novel\n\ufb01nite sample characteristics of leverage and in\ufb02uence which we defer to \u00a7SI.3. Additionally, in\n\u00a7SI.4 we prove an estimation error bound for IWS-LS in the standard sub-Gaussian model.\nComputing in\ufb02uence exactly is not practical in large-scale applications and so we propose two ran-\ndomized approximation algorithms based on the randomized leverage approximation of [8]. Both\nof these algorithms run in o(np2) time which improve scalability in large problems. Finally, in \u00a77\nwe present extensive experimental evaluation which compares the performance of our algorithms\nagainst several randomized least squares methods on a variety of simulated and real datasets.\n2 Statistical model\nIn this work we consider a variant of the standard linear model\n\ny = X + \u270f,\n\n(1)\nwhere \u270f 2 Rn is a noise term independent of X 2 Rn\u21e5p. However, rather than directly observing\nX we instead observe Z where\n(2)\nU = diag(u1, . . . , un) and ui is a Bernoulli random variable with probability \u21e1 of being 1.\nW 2 Rn\u21e5p is a matrix of measurement corruptions. The rows of Z therefore are corrupted with\nprobability \u21e1 and not corrupted with probability (1 \u21e1).\nDe\ufb01nition 1 (Sub-gaussian matrix). A zero-mean matrix X is called sub-Gaussian with parameter\nn \u2303x. (b) For\n( 1\nn 2\nany unit vector v 2 Rp, v>xi is a sub-Gaussian random variable with parameter at most 1pp x.\nWe consider the speci\ufb01c instance of the linear corrupted observation model in Eqs. (1), (2) where\n\nn \u2303x) if (a) Each row x>i 2 Rp is sampled independently and has E[xix>i ] = 1\n\nZ = X + UW.\n\nx, 1\n\nw, 1\n\nn \u2303w) respec-\n\ntively,\n\n\u2022 X, W 2 Rn\u21e5p are sub-Gaussian with parameters ( 1\n\u270f In),\n\u2022 \u270f 2 Rn is sub-Gaussian with parameters ( 1\nn 2\n\nn 2\n\n\u270f , 1\n\nn 2\n\nx, 1\n\nn \u2303x) and ( 1\n\nn 2\n\nand all are independent of each other.\nThe key challenge is that even when \u21e1 and the magnitude of the corruptions, w are relatively small,\nthe standard linear regression estimate is biased and can perform poorly (see \u00a76). Sampling methods\nwhich are not sensitive to corruptions in the observations can perform even worse if they somehow\nsubsample a proportion rn > \u21e1n of corrupted points. Furthermore, the corruptions may not be large\nenough to be detected via leverage based techniques alone.\nThe model described in this section generalises the \u201cerrors-in-variables\u201d model from classical least\nsquares modelling. Recently, similar models have been studied in the high dimensional (p n)\n\n2\n\n\fsetting in [4\u20136, 12] in the context of robust sparse estimation. The \u201clow-dimensional\u201d (n > p)\nsetting is investigated in [4], but the \u201cbig data\u201d setting (n p) has not been considered so far.2\nIn the high-dimensional problem, knowledge of the corruption covariance, \u2303w [12], or the data\ncovariance \u2303x [5], is required to obtain a consistent estimate. This assumption may be unrealistic in\nmany settings. We aim to reduce the bias in our estimates without requiring knowledge of the true\ncovariance of the data or the corruptions, and instead sub-sample only non-corrupted points.\n\n3 Diagnostics for linear regression\n\nIn practice, the sub-Gaussian linear model assumption is often violated either by heterogeneous\nnoise or by a corruption model as in \u00a72. In such scenarios, \ufb01tting a least squares model to the full\ndataset is unwise since the outlying or corrupted points can have a large adverse effect on the model\n\ufb01t. Regression diagnostics have been developed in the statistics literature to detect such points (see\ne.g. [2] for a comprehensive overview). Recently, [14] proposed subsampling points for least squares\nbased on their leverage scores. Other recent works suggest related in\ufb02uence measures that identify\nsubspace [16] and multi-view [15] clusters in high dimensional data.\n\n3.1 Statistical leverage\n\nFor the standard linear model in Eq. (1), the well known least squares solution is\n\nb = arg min\n\n ky Xk2 =X>X1\n\nX>y.\n\n(3)\n\nThe projection matrix I L with L := X(X>X)1X> speci\ufb01es the subspace in which the residual\nlies. The diagonal elements of the \u201chat matrix\u201d L, li := Lii, i = 1, . . . , n are the statistical leverage\nscores of the ith sample. Leverage scores quantify to what extent a particular sample is an outlier\nwith respect to the distribution of X.\nAn equivalent de\ufb01nition from [14] which will be useful later concerns any matrix U 2 Rn\u21e5p which\nspans the column space of X (for example, the matrix whose columns are the left singular vectors of\nX). The statistical leverage scores of the rows of X are the squared row norms of U, i.e. li = kUik2.\nAlthough the use of leverage can be motivated from the least squares solution in Eq. (3), the lever-\nage scores do not take into account the relationship between the predictor variables and the response\nvariable y. Therefore, low-leverage points may have a weak predictive relationship with the re-\nsponse and vice-versa. In other words, it is possible for such points to be outliers with respect to the\nconditional distribution P (y|X) but not the marginal distribution on X.\n3.2\n\nIn\ufb02uence\n\nA concept that captures the predictive relationship between covariates and response is in\ufb02uence.\nIn\ufb02uential points are those that might not be outliers in the geometric sense, but instead adversely\naffect the estimated coef\ufb01cients.\nOne way to assess the in\ufb02uence of a point is to compute the change in the learned model when\nthe point is removed from the estimation step. [2]. We can compute a leave-one-out least squares\nestimator by straightforward application of the Sherman-Morrison-Woodbury formula (see Prop. 3\nin \u00a7SI.3):\n\nwe have\n\nwhere ei = yi xibOLS. De\ufb01ning the in\ufb02uence3, di as the change in expected mean squared error\n\nbi =X>X x>i xi1X>y x>i yi =b \ndi =\u21e3b bi\u2318> X>X\u21e3b bi\u2318 =\n\n(1 li)2 .\n2Unlike [5, 12] and others we do not consider sparsity in our solution since n p.\n3The expression we use is also called Cook\u2019s distance [2].\n\n\u23031x>i ei\n1 li\n\ne2\ni li\n\n3\n\n\fPoints with large values of di are those which, if added to the model, have the largest adverse effect\non the resulting estimate. Since in\ufb02uence only depends on the OLS residual error and the leverage\nscores, it can be seen that the in\ufb02uence of every point can be computed at the cost of a least squares\n\ufb01t. In the next section we will see how to approximate both quantities using random projections.\n\n4 Fast randomized least squares algorithms\n\nWe brie\ufb02y review two randomized approaches to least squares approximation:\nthe importance\nweighted subsampling approach of [9] and the dimensionality reduction approach [14]. The for-\nmer proposes an importance sampling probability distribution according to which, a small number\nof rows of X and y are drawn and used to compute the regression coef\ufb01cients. If the sampling prob-\nabilities are proportional to the statistical leverages, the resulting estimator is close to the optimal\nestimator [9]. We refer to this as LEV-LS.\nThe dimensionality reduction approach can be viewed as a random projection step followed by a\nuniform subsampling. The class of Johnson-Lindenstrauss projections \u2013 e.g. the SRHT \u2013 has been\nshown to approximately uniformize leverage scores in the projected space. Uniformly subsampling\nthe rows of the projected matrix proves to be equivalent to leverage weighted sampling on the origi-\nnal dataset [14]. We refer to this as SRHT-LS. It is analysed in the statistical setting by [7] who also\npropose ULURU, a two step \ufb01tting procedure which aims to correct for the subsampling bias and\nconsequently converges to the OLS estimate at a rate independent of the number of subsamples [7].\n\nSubsampled Randomized Hadamard Transform (SRHT) The SHRT consists of a precondi-\ntioning step after which nsubs rows of the new matrix are subsampled uniformly at random in the\n\nfollowing wayq n\n\nnsubs\n\nSHD \u00b7 X = \u21e7X with the de\ufb01nitions [3]:\n\n\u2022 S is a subsampling matrix.\n\u2022 D is a diagonal matrix whose entries are drawn independently from {1, 1}.\n\u2022 H 2 Rn\u21e5n is a normalized Walsh-Hadamard matrix4 which is de\ufb01ned recursively as\n\nHn =\uf8ff Hn/2 Hn/2\n\nHn/2 Hn/2 , H2 =\uf8ff +1 +1\n+1 1 .\n\nWe set H = 1pn Hn so it has orthonormal columns.\n\nAs a result, the rows of the transformed matrix \u21e7X have approximately uniform leverage scores.\n(see [17] for detailed analysis of the SRHT). Due to the recursive nature of H, the cost of applying\nthe SRHT is O (pn log nsubs) operations, where nsubs is the number of rows sampled from X [1].\n\nk\u02dcek \uf8ff (1 + \u21e2)kek\n\nsubsampling ratio, r =\u2326( p2\n\n\u21e22 ) results in a residual error, \u02dce which satis\ufb01es\n\nThe SRHT-LS algorithm solves bSRHT = arg min k\u21e7y \u21e7Xk2 which for an appropriate\nwhere e = y XbOLS is the vector of OLS residual errors [14].\n\nRandomized leverage computation Recently, a method based on random projections has been\nproposed to approximate the leverage scores based on \ufb01rst reducing the dimensionality of the data\nusing the SRHT followed by computing the leverage scores using this low-dimensional approxima-\ntion [8\u201310, 13].\nThe leverage approximation algorithm of [8] uses a SRHT, \u21e71 2 Rr1\u21e5n to \ufb01rst compute the ap-\nproximate SVD of X,\n\u21e71X = U\u21e7X\u2303\u21e7XV>\u21e7X. Followed by a second SHRT \u21e72 2 Rp\u21e5r2 to compute an approximate\northogonal basis for X\n\n(4)\n\nR1 = V\u21e7X\u23031\n\n\u21e7X 2 Rp\u21e5p, \u02dcU = XR1\u21e72 2 Rn\u21e5r2.\n\n(5)\n\n4For the Hadamard transform, n must be a power of two but other transforms exist (e.g. DCT, DFT) for\n\nwhich similar theoretical guarantees hold and there is no restriction on n.\n\n4\n\n\fThe approximate leverage scores are now the squared row norms of \u02dcU, \u02dcli = k \u02dcUik2.\nFrom [14] we derive the following result relating to randomized approximation of the leverage\n\n\u02dcli \uf8ff (1 + \u21e2l)li ,\n\n(6)\n\nwhere the approximation error, \u21e2l depends on the choice of projection dimensions r1 and r2.\nThe leverage weighted least squares (LEV-LS) algorithm samples rows of X and y with probability\nproportional to li (or \u02dcli in the approximate case) and performs least squares on this subsample. The\nresidual error resulting from the leverage weighted least squares is bounded by Eq. (4) implying\nthat LEV-LS and SRHT-LS are equivalent [14]. It is important to note that under the corrupted\nobservation model these approximations will be biased.\n\nIn\ufb02uence weighted subsampling\n\n5\nIn the corrupted observation model, OLS and therefore the random approximations to OLS de-\nscribed in \u00a74 obtain poor predictions. To remedy this, we propose in\ufb02uence weighted subsampling\n(IWS-LS) which is described in Algorithm 1. IWS-LS subsamples points according to the distri-\nbution, Pi = c/di where c is a normalizing constant so thatPn\ni=1 Pi = 1. OLS is then estimated on\nthe subsampled points. The sampling procedure ensures that points with high in\ufb02uence are selected\ninfrequently and so the resulting estimate is less biased than the full OLS solution. Several ap-\nproaches similar in spirit have previously been proposed based on identifying and down-weighting\nthe effect of highly in\ufb02uential observations [19].\nObviously, IWS-LS is impractical in the scenarios we consider since it requires the OLS residuals\nand full leverage scores. However, we use this as a baseline and to simplify the analysis. In the next\nsection, we propose an approximate in\ufb02uence weighted subsampling algorithm which combines the\napproximate leverage computation of [8] and the randomized least squares approach of [14].\n\nAlgorithm 1 In\ufb02uence weighted subsampling\n(IWS-LS).\nInput: Data: Z, y\n\nAlgorithm 2 Residual weighted subsampling\n(aRWS-LS)\nInput: Data: Z, y\n\n1: SolvebOLS = arg min ky Zk2\n\n2: for i = 1 . . . n do\n3:\n4:\n5:\n6: end for\n7: Sample rows (\u02dcZ, \u02dcy) of (Z, y) proportional to 1\ndi\n\nei = yi zibOLS\n\nli = z>i (Z>Z)1zi\ni li/(1 li)2\ndi = e2\n\n1\n\u02dce2\ni\n\n3: Sample rows (\u02dcZ, \u02dcy) of (Z, y) proportional to\n\n1: SolvebSRHT = arg min k\u21e7 \u00b7 (y Z)k2\n2: Estimate residuals: \u02dce = y ZbSRHT\n4: SolvebRW S = arg min k\u02dcy \u02dcZk2\nOutput: bRW S\n\n8: SolvebIWS = arg min k\u02dcy \u02dcZk2\nOutput: bIWS\nRandomized approximation algorithms. Using the ideas from \u00a74 and \u00a74 we obtain the following\nrandomized approximation to the in\ufb02uence scores\n\n\u02dcdi =\n\n\u02dcli\n\u02dce2\ni\n(1 \u02dcli)2\n\n,\n\n(7)\n\nwhere \u02dcei is the ith residual error computed using the SRHT-LS estimator. Since the approxima-\ntion errors of \u02dcei and \u02dcli are bounded (inequalities (4) and (6)), this suggests that our randomized\napproximation to in\ufb02uence is close to the true in\ufb02uence.\n\nBasic approximation. The \ufb01rst approximation algorithm is identical to Algorithm 1 except that\nleverage and residuals are replaced by their randomized approximations as in Eq. (7). We refer to\nthis algorithm as Approximate in\ufb02uence weighted subsampling (aIWS-LS). Full details are given\nin Algorithm 3 in \u00a7SI.2.\n\n5\n\n\fResidual Weighted Sampling. Leverage scores are typically uniform [7, 13] for sub-Gaussian\ndata. Even in the corrupted setting, the difference in leverage scores between corrupted and non-\ncorrupted points is small (see \u00a76). Therefore, the main contribution to the in\ufb02uence for each point\nwill originate from the residual error, e2\ni . Consequently, we propose sampling with probability\ninversely proportional to the approximate residual, 1\n. The resulting algorithm Residual Weighted\n\u02dce2\ni\nSubsampling (aRWS-LS) is detailed in Algorithm 2. Although aRWS-LS is not guaranteed to be\na good approximation to IWS-LS, empirical results suggests that it works well in practise and is\nfaster to compute than aIWS-LS.\n\nComputational complexity. Clearly, the computational complexity of IWS-LS is Onp2. The\ncomputation complexity of aIWS-LS is Onp log nsubs + npr2 + nsubsp2, where the \ufb01rst term\n(5). The cost of aRWS-LS is Onp log nsubs + np + nsubsp2 where\n\nis the cost of SRHT-LS, the second term is the cost of approximate leverage computation and the\nlast term solves OLS on the subsampled dataset. Here, r2 is the dimension of the random pro-\njection detailed in Eq.\nthe \ufb01rst term is the cost of SRHT-LS, the second term is the cost of computing the residuals\ne, and the last term solves OLS on the subsampled dataset. This computation can be reduced to\n\nOnp log nsubs + nsubsp2. Therefore the cost of both aIWS-LS and aRWS-LS is o(np2).\n\n6 Estimation error\n\nIn this section we will prove an upper bound on the estimation error of IWS-LS in the corrupted\nmodel. First, we show that the OLS error consists of two additional variance terms that depend on the\nsize and proportion of the corruptions and an additional bias term. We then show that IWS-LS can\nsigni\ufb01cantly reduce the relative variance and bias in this setting, so that it no longer depends on the\nmagnitude of the corruptions but only on their proportion. We compare these results to recent results\nfrom [4, 12] suggesting that consistent estimation requires knowledge about \u2303w. More recently, [5]\nshow that incomplete knowledge about this quantity results in a biased estimator where the bias is\nproportional to the uncertainty about \u2303w. We see that the form of our bound matches these results.\nInequalities are said to hold with high probability (w.h.p.) if the probability of failure is not more\nthan C1 exp(C2 log p) where C1, C2 are positive constants that do not depend on the scaling quan-\ntities n, p, w. The symbol . means that we ignore constants that do not depend on these scaling\nquantities. Proofs are provided in the supplement. Unless otherwise stated, k\u00b7k denotes the `2 norm\nfor vectors and the spectral norm for matrices.\nCorrupted observation model. As a baseline, we \ufb01rst investigate the behaviour of the OLS esti-\nmator in the corrupted model.\n\n+ \u21e12\n\nwppkk! \u00b7\n\n1\n\n\n(8)\n\nTheorem 1 (A bound on kbOLS k). If n & 2\nkbOLS k . \u270fx + \u21e1\u270fw + \u21e12\n\nw + wxkkr p log p\n\nn\n\nx2\nw\n\nmin(\u2303x) p log p then w.h.p.\n\nwhere 0 < \uf8ff min(\u2303x) + \u21e1min(\u2303w).\nRemark 1 (No corruptions case). Notice for a \ufb01xed w, taking lim\u21e1!0 or for a \ufb01xed \u21e1 taking\nlimw!0 (i.e. there are no corruptions) the above error reduces to the least squares result (see for\nexample [4]).\n\nRemark 2 (Variance and Bias). The \ufb01rst three terms in (8) scale withp1/n so as n ! 1, these\nterms tend towards 0. The last term does not depend onp1/n and so for some non-zero \u21e1 the least\n\nsquares estimate will incur some bias depending on the fraction and magnitude of corruptions.\n\nWe are now ready to state our theorem characterising the mean squared error of the in\ufb02uence\nweighted subsampling estimator.\nTheorem 2 (In\ufb02uence sampling in the corrupted model). For n & 2\n\nmin(\u2303\u21e5x) p log p we have\n\nx2\nw\n\nkbIWS k . \u2713\u270fx +\n\n\u21e1\u270f\n\n(w + 1)\n\n+ \u21e1kk\u25c6r p log p\n\nnsubs\n\n+ \u21e1ppkk! .\n\n1\n\n\nwhere 0 < \uf8ff min(\u2303\u21e5x) and \u2303\u21e5x is the covariance of the in\ufb02uence weighted subsampled data.\n\n6\n\n\f(a) In\ufb02uence (1.1)\n\n(b) Leverage (0.1)\n\n\u21e12\n\nFigure 1: Comparison of the distribution of the in\ufb02uence and leverage for corrupted and non-\ncorrupted points. The `1 distance between the histograms is shown in brackets.\nRemark 3. Theorem 2 states that the in\ufb02uence weighted subsampling estimator removes the propor-\n\ntional dependance of the error on w so the additional variance terms scale as O(\u21e1/w\u00b7pp/nsubs)\nand O(\u21e1pp/nsubs). The relative contribution of the bias term is \u21e1ppkk compared with\nwppkk for the OLS or non-in\ufb02uence-based subsampling methods.\nComparison with fully corrupted setting. We note that the bound in Theorem 1 is similar to the\nbound in [5] for an estimator where all data points are corrupted (i.e. \u21e1 = 1) and where incomplete\nknowledge of the covariance matrix of the corruptions, \u2303w is used. The additional bias in the\nestimator is proportional to the uncertainty in the estimate of \u2303w \u2013 in Theorem 1 this corresponds to\nw. Unbiased estimation is possible if \u2303w is known. See the Supplementary Information for further\n2\ndiscussion, where the relevant results from [5] are provided in Section SI.6.1 as Lemma 16.\n7 Experimental results\nWe compare IWS-LS against the methods SRHT-LS [14], ULURU [7]. These competing methods\nrepresent current state-of-the-art in fast randomized least squares. Since SRHT-LS is equivalent to\nLEV-LS [9] the comparison will highlight the difference between importance sampling according\nto the two difference types of regression diagnostic in the corrupted model. Similar to IWS-LS,\nULURU is also a two-step procedure where the \ufb01rst is equivalent to SRHT-LS. The second reduces\nbias by subtracting the result of regressing onto the residual. The experiments with the corrupted\ndata model will demonstrate the difference in robustness of IWS-LS and ULURU to corruptions in\nthe observations. Note that we do not compare with SGD. Although SGD has excellent properties\nfor large-scale linear regression, we are not aware of a convex loss function which is robust to the\ncorruption model we propose.\nWe assess the empirical performance of our method compared with standard and state-of-the-art\nrandomized approaches to linear regression in several difference scenarios. We evaluate these meth-\nods on the basis of the estimation error: the `2 norm of the difference between the true weights and\n\nthe learned weights, kb k. We present additional results for root mean squared prediction error\n(RMSE) on the test set in \u00a7SI.7.\nFor all the experiments on simulated data sets we use ntrain = 100, 000, ntest = 1000, p = 500.\nFor datasets of this size, computing exact leverage is impractical and so we report on results for\nIWS-LS in \u00a7SI.7. For aIWS-LS and aRWS-LS we used the same number of sub-samples to\napproximate the leverage scores and residuals as for solving the regression. For aIWS-LS we set\nr2 = p/2 (see Eq. (5)). The results are averaged over 100 runs.\nCorrupted data. We investigate the corrupted data noise model described in Eqs. (1)-(2). We\nshow three scenarios where \u21e1 = {0.05, 0.1, 0.3}. X and W were sampled from independent, zero-\nmean Gaussians with standard deviation x = 1 and w = 0.4 respectively. The true regression\ncoef\ufb01cients, were sampled from a standard Gaussian. We added i.i.d. zero-mean Gaussian noise\nwith standard deviation e = 0.1.\nFigure 1 shows the difference in distribution of in\ufb02uence and leverage between non-corrupted points\n(top) and corrupted points (bottom) for a dataset with 30% corrupted points. The distribution of\nleverage is very similar between the corrupted and non-corrupted points, as quanti\ufb01ed by the `1\ndifference. This suggests that leverage alone cannot be used to identify corrupted points.\n\n7\n\n\f(a) 5% Corruptions\n\n(b) 30% Corruptions\n\n(c) Airline delay\n\nFigure 2: Comparison of mean estimation error and standard deviation on two corrupted simulated\ndatasets and the airline delay dataset.\nOn the other hand, although there are some corrupted points with small in\ufb02uence, they typically\nhave a much larger in\ufb02uence than non-corrupted points. We give a theoretical explanation of this\nphenomenon in \u00a7SI.3 (remarks 4 and 5).\nFigure 2(a) and (b) shows the estimation error and the mean squared prediction error for different\nsubsample sizes. In this setting, computing IWS-LS is impractical (due to the exact leverage com-\nputation) so we omit the results but we notice that aIWS-LS and aRWS-LS quickly improve over\nthe full least squares solution and the other randomized approximations in all simulation settings. In\nall cases, in\ufb02uence based methods also achieve lower-variance estimates.\nFor 30% corruptions for a small number of samples ULURU outperforms the other subsampling\nmethods. However, as the number of samples increases, in\ufb02uence based methods start to outperform\nOLS. Here, ULURU converges quickly to the OLS solution but is not able to overcome the bias\nintroduced by the corrupted datapoints. Results for 10% corruptions are shown in Figs. 5 and 6 and\nwe provide results on smaller corrupted datasets (to show the performance of IWS-LS) as well as\nnon-corrupted data simulated according to [13] in \u00a7SI.7.\nAirline delay dataset The dataset consists of details of all commercial \ufb02ights in the USA over 20\nyears. Dataset along with visualisations available from http://stat-computing.org/dataexpo/2009/.\nSelecting the \ufb01rst ntrain = 13, 000 US Airways \ufb02ights from January 2000 (corresponding to ap-\nproximately 1.5 weeks) our goal is to predict the delay time of the next ntest = 5, 000 US Airways\n\ufb02ights. The features in this dataset consist of a binary vector representing origin-destination pairs\nand a real value representing distance (p = 170).\nThe dataset might be expected to violate the usual i.i.d. sub-Gaussian design assumption of standard\nlinear regression since the length of delays are often very different depending on the day. For\nexample, delays may be longer due to public holidays or on weekends. Of course, such regular\nevents could be accounted for in the modelling step, but some unpredictable outliers such as weather\ndelay may also occur. Results are presented in Figure 2(c), the RMSE is the error in predicted delay\ntime in minutes. Since the dataset is smaller, we can run IWS-LS to observe the accuracy of\naIWS-LS and aRWS-LS in comparison. For more than 3000 samples, these algorithm outperform\nOLS and quickly approach IWS-LS. The result suggests that the corrupted observation model is a\ngood model for this dataset. Furthermore, ULURU is unable to achieve the full accuracy of the OLS\nsolution.\n\n8 Conclusions\nWe have demonstrated theoretically and empirically under the generalised corrupted observation\nmodel that in\ufb02uence weighted subsampling is able to signi\ufb01cantly reduce both the bias and variance\ncompared with the OLS estimator and other randomized approximations which do not take in\ufb02uence\ninto account. Importantly our fast approximation, aRWS-LS performs similarly to IWS-LS. We\n\ufb01nd ULURU quickly converges to the OLS estimate, although it is not able to overcome the bias\ninduced by the corrupted datapoints despite its two-step procedure. The performance of IWS-LS\nrelative to OLS in the airline delay problem suggests that the corrupted observation model is a more\nrealistic modelling scenario than the standard sub-Gaussian design model for some tasks. Software\nis available at http://people.inf.ethz.ch/kgabriel/software.html.\nAcknowledgements. We thank David Balduzzi, Cheng Soon Ong and the anonymous reviewers\nfor invaluable discussions, suggestions and comments.\n\n8\n\n\fReferences\n[1] Nir Ailon and Edo Liberty. Fast dimension reduction using rademacher series on dual bch\n\ncodes. In 19th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1\u20139, 2008.\n\n[2] David A Belsley, Edwin Kuh, and Roy E Welsch. Regression Diagnostics. Identifying In\ufb02u-\n\nential Data and Sources of Collinearity. Wiley, 1981.\n\n[3] Christos Boutsidis and Alex Gittens. Improved matrix algorithms via the Subsampled Ran-\n\ndomized Hadamard Transform. 2012. arXiv:1204.0062v4 [cs.DS].\n\n[4] Yudong Chen and Constantine Caramanis. Orthogonal Matching Pursuit with Noisy and Miss-\n\ning Data: Low and High Dimensional Results. June 2012. arXiv:1206.0823.\n\n[5] Yudong Chen and Constantine Caramanis. Noisy and Missing Data Regression: Distribution-\n\nOblivious Support Recovery. In International Conference on Machine Learning, 2013.\n\n[6] Yudong Chen, Constantine Caramanis, and Shie Mannor. Robust Sparse Regression under\n\nAdversarial Corruption. In International Conference on Machine Learning, 2013.\n\n[7] P Dhillon, Y Lu, D P Foster, and L Ungar. New Subsampling Algorithms for Fast Least\n\nSquares Regression. In Advances in Neural Information Processing Systems, 2013.\n\n[8] Petros Drineas, Malik Magdon-Ismail, Michael W Mahoney, and David P Woodruff. Fast ap-\nproximation of matrix coherence and statistical leverage. September 2011. arXiv:1109.3843v2\n[cs.DS].\n\n[9] Petros Drineas, Michael W. Mahoney, and S. Muthukrishnan. Sampling algorithms for l2\nregression and applications. In Proceedings of the Seventeenth Annual ACM-SIAM Symposium\non Discrete Algorithm, SODA \u201906, pages 1127\u20131136, New York, NY, USA, 2006. ACM.\n\n[10] Petros Drineas, Michael W Mahoney, S Muthukrishnan, and Tam\u00b4as Sarl\u00b4os. Faster least squares\n\napproximation. Numerische Mathematik, 117(2):219\u2013249, 2011.\n\n[11] Daniel Hsu, Sham Kakade, and Tong Zhang. A tail inequality for quadratic forms of subgaus-\n\nsian random vectors. Electron. Commun. Probab., 17:no. 52, 1\u20136, 2012.\n\n[12] Po-Ling Loh and Martin J Wainwright. High-dimensional regression with noisy and missing\ndata: Provable guarantees with nonconvexity. The Annals of Statistics, 40(3):1637\u20131664, June\n2012.\n\n[13] Ping Ma, Michael W Mahoney, and Bin Yu. A Statistical Perspective on Algorithmic Lever-\n\naging. In proceedings of the International Conference on Machine Learning, 2014.\n\n[14] Michael W Mahoney.\n\narXiv:1104.5557v3 [cs.DS].\n\nRandomized algorithms for matrices and data.\n\nApril 2011.\n\n[15] Brian McWilliams and Giovanni Montana. Multi-view predictive partitioning in high dimen-\n\nsions. Statistical Analysis and Data Mining, 5(4):304\u2013321, 2012.\n\n[16] Brian McWilliams and Giovanni Montana. Subspace clustering of high-dimensional data: a\n\npredictive approach. Data Mining and Knowledge Discovery, 28:736\u2013772, 2014.\n\n[17] Joel A Tropp. Improved analysis of the subsampled randomized Hadamard transform. Novem-\n\nber 2010. arXiv:1011.1595v4 [math.NA].\n\n[18] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. November\n\n2010. arXiv:1011.3027.\n\n[19] Roy E Welsch. Regression sensitivity analysis and bounded-in\ufb02uence estimation. In Evalua-\n\ntion of econometric models, pages 153\u2013167. Academic Press, 1980.\n\n9\n\n\f", "award": [], "sourceid": 282, "authors": [{"given_name": "Brian", "family_name": "McWilliams", "institution": "ETH Zurich"}, {"given_name": "Gabriel", "family_name": "Krummenacher", "institution": "ETH Zurich"}, {"given_name": "Mario", "family_name": "Lucic", "institution": "ETH Zurich"}, {"given_name": "Joachim", "family_name": "Buhmann", "institution": "ETH Zurich"}]}