
Submitted by Assigned_Reviewer_1
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The idea of this paper is to use the hard thresholding to remove the potential outliers for robust regression. It seems a simple but interesting idea. One suggestion is that one might use a varying sequence of parameters for beta. For example, at the beginning stage, the estimation w^t is less accurate, and thus we might treat less samples as outliers; and as the algorithm goes with a more accurate w^t for large t, it might be OK to reject more outliers.
One difficulty with the proposed algorithm is how to choose the parameter beta and epsilon. More discussion needs to be added to the experiments.
Q2: Please summarize your review in 12 sentences
The paper considers a robust regression algorithm by applying the hard thresholding to remove potential outliers.
Submitted by Assigned_Reviewer_2
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Overall, this paper seemed to be wellwritten and clearly presented. However, the point about dealing with an additive noise perturbation in addition to sparse corruption is not quite clear. The main result here seems to be contained in Theorem 10 in Appendix A. But as Corollary 11 points out, the error bound provided by the theorem reduces to a constant plus epsilon in the case of Gaussian noise; i.e., the algorithm is not guaranteed to provide a consistent estimate of w^*. This seems to be a weaker conclusion than the results, say, in [5] or [6]. Does that mean the latter methods are preferable in the case of noisy observations? It would be good if the authors could clarify this point.
Q2: Please summarize your review in 12 sentences
The results in this paper seem quite novel and relevant. The technical details also appear to be sound.
Submitted by Assigned_Reviewer_3
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper proposes and analysis a new algorithm for the robust regression problem. Robust regression problem is an old problem, related to outliers detection.
Introduction 
The robust regression problem is defined in Eq. (2). I think that the authors should slightly rephrase their definition as Eq. (2) is one possible "modern" definition of the robust least square problem.
Authors also claims that robust regression and sparse regression are "completely different". The only difference if the sparse assumption on the coefficients vector w. The robust regression has a much more specific model on the noise, but can also be a sparse regression (as studied in Sec. 5). Again, I think that the authors should rephrase this part to avoid missunderstanding.
My major remark here, is a lake of state of the art on robust regression/outliers detection. Indeed, only very recent work are mentioned, but this a very old problem, and some classical methods such as L1 regression (in the sense of min  y  X w _1 ) should be mentioned here, and used as a reference.
Sec 2 
The problem formulation is clear, and the noise is specified thanks to two terms :  a classic gaussian noise  an unbounded "sparse" noise.
However, this formulation is not the "only one" possible. Maybe one reference on some Bayesian model for the noise on robust regression would be welcome here, in order to have a quite complete point of view on robust regression.
Moreover, I think that the authors could reformulate clearly the robust least square regression problem as an optimisation problem such as (there is some others possibilities)
min_{w,b} 0.5 y  Xw  b^2 + \lambda b_0
Sec. 3 
This section introduces the "Torrent" algorithms studied in Sec. 4. This algorithms made me think of a kind of "matching pursuit": the selection of the active set is made directly on the residual instead on its correlation with the matrix X. TorrentFC being the "OMP" and TorrentGD the "classical" MP.
The definition of Hard thresholding suits well with the problem, but is a kind of counterintuitive compared to the classical hard thresholding used for sparse regression.
Sec. 4 
This section presents the theoretical results of the proposed algorithm, and then is the main theoretical contribution of the paper. I have no particular remark on this part.
Sec. 5 
This section extends previous results to the "Robust sparse regression problem". Following my remark on Sec. 2,
I think that at this point, the author should reformulate the problem as
min_{w,b} 0.5 y  Xw  b^2 + \lambda b_0 + \mu w_0
Sec. 6 
The experimental part if ok, but the only method the authors use for comparison is L1DALM which solve only the sparse robust regression problem, not the robust regression problem without the sparse assumption. So the comparison is not fair, as the goal of the two approached is not the same. I also think that the author must compared their algorithm to a classic L1regression problem, which is the state of the art of robust regression (and also a L1L1 approach for the robust sparse regression problem).
Q2: Please summarize your review in 12 sentences
An OK paper on robust regression, with some theoretical results on the proposed algorithm, but with some lake on the presentation of robust regression as well as no comparison to some classical approached.
Submitted by Assigned_Reviewer_4
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The presented work represents a significant contribution to the RLSR problem. The method, based on an alternate thresholding algorithm, is both clever and simple. The idea is to alternate, at each step,
the estimation of the set of "clean points", then to update the unknown parameter vector. The proposed original method (TORRENTFC) guarantees exact recovery of the parameters vector, as long as the the proportion of outliers is below a fraction $\alpha$ (the corruption index) of the total data points, where $\alpha$, whose value is data dependent, is not larger than 1/2; and the subset strong convexity and subset strong smoothess properties (Definition 1) are satisfied; and n >= p log p, where p is the dimension of the data space; X is sampled from any subGaussian distribution. However, the above method does not scale well to datasets with large p. This issue is overcome with a pair of modified versions, one based on gradient descent (TORRENTGD) and the other a hybrid version of both algorithms, the so called TORRENTHYBD. Moreover, both ensure geometric convergence. Experimental comparison shows that the proposed algorithms outperform the best L1based algorithms: it is significantly faster and converges to a solution for which the estimation error for the parameters vector is much lower.
The TORRENTHYB combines the best of TORRENTFC and TorrentGF. Indeed, while TORRENTFC presents high improvement of the solution at each step but is computationally more expensive, while TORRENTGV efficiently executes each step, but progress is slower.
Finally, the presented approach is extended to deal with high dimensional data, leading to the TORRENTHD algorithm. This algorithm is able to yield exact recovery even for corruption indices as large as 0.7.
The article is well founded; the proofs of all theorems are present in the paper and in the supplementary material, and are very clear. It is also very well written paper, and as previously said, the proposed solution for this significant problem is very original.
Q2: Please summarize your review in 12 sentences
This is a very interesting paper, where the authors propose a new Robust Least Squares Regression, that can handle up to 50% of outliers under mild conditions, less restrictive than previous methods: the data matrix is sampled from any subGaussian distribution. The results achieved outperform those of stateofthe art L1 solvers, both in the estimation error and in convergence time.
Submitted by Assigned_Reviewer_5
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper considers the problem of Robust Least Squares Regression where some of the responses of a linear model are corrupted by heavy noise or even adversarial manipulation and others are uncorrupted or corrupted only by iid Gaussian noise. So, the problem is a combination of the detection of the corrupted components and linear regression.
The authors propose to use iterative hard thresholding in the sense that in each iteration a least squares problem is solved w.r.t. to those data points for which the difference of response and linear prediction was smallest in the preceding step. Instead of solving the least squares problem exactly, a variant is proposed were the solution of the quadratic problem is approximated via a single gradient descent step. In addition a hybrid method is presented which chooses one of these options depending on how much the active set changed in the last iteration.
The idea of using iterative hard thresholding to find the "active set" of indices of the uncorrupted responses does not seem to be new, see, e.g., "Outlier detection using nonconvex penalized regression" by She & Owen (2011). However, the convergence results of Section 4 are new and interesting. It seems a bit unclear whether Theorems 3 and 4 implicitly assume that there exists a unique solution w^* or if there might be more than one and the algorithm finds one of them. Also, the constraint on n in Theorem 4 is given as an order of magnitude. However, the constant seem to matter since the problem probably gets easier the larger n is compared to p.
In general the paper is well written. Although mathematically correct, the phrasing in l. 90 that "\alpha < 1/2" for exact recovery is a bit misleading given that \alpha < 1/65 in Theorem 4.
Concerning the highdimensional setting of Section 5, one should note that (3) cannot be solved globally optimally in general.
The L1 problem solved for comparison in Section 6 is min_w w_1 + \lambda X'*w  y_1. One could also use w^2_2 in the lowdimensional case. Also, the experiments could be improved by applying the algorithms to real world data.
Minor issues: * Algorithm 2 should be called UPDATE_FC and Algorithm 3 UPDATEGD
* In Section 2, the notation "bounded noise" is used for Gaussian noise and "potentially unbounded corruption" for the outliers. I am not sure if this is a good notation. There is no bound on the magnitude of Gaussian noise either.
Q2: Please summarize your review in 12 sentences
The paper proposes iterative hard thresholding methods for Robust Least Squares Regression and provides conditions for exact recovery of the underlying model. Numerical experiments indicate advantages compared to L1methods.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 5000 characters. Note
however, that reviewers and area chairs are busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We thank all the reviewers for their comments and
suggestions. We will keep these in mind while revising the
draft.
Reviewer 1 ==========  The modification to the
algorithm proposed by the reviewer is interesting and one that we have
actually tried. However we did not include it since we do not have a
theoretical analysis of the resulting method. Moreover the method seems
difficult to implement in practice since the rate of increasing beta seems
to have an effect on convergence rates.
 Several interesting
modifications are possible to the basic TORRENT framework. In this paper
we focused on a class of simple algorithms with few tunable parameters
that have solid theoretical analyses. It is noteworthy that TORRENTFC has
a single hyperparameter beta. Moreover, as Theorem 3 indicates, our bounds
hold whenever beta is larger than the true fraction of outliers as long as
there are enough clean responses.
Reviewer 2 ========== 
Comparison to classical approaches like y  X^T w_1: We do indeed cite
and compare against such formulations. In fact, the formulation in line
no. 67 can be easily rewritten as min_w w_1 + \lambda y  X^T
w_1 by replacing b = y  X^T w. Note that for large values of lambda,
this indeed minimizes y  X^T w_1.
 Experiments  comparison
to classical L1 and L1L1 approaches: As mentioned above, the L1DALM
approach that we compare against can be reformulated as the "classical L1"
approach mentioned by the reviewer. Hence, we do indeed compare against L1
and L1L1 approaches that the reviewer mentions.
 L1DALM only
solves sparse regression: As mentioned above, L1DALM solves w_1 +
\lambda y  X^T w_1. This is equivalent to the lowd L1 problem y 
X^Tw_1 for large lambda. To ensure fairness, we tuned lambda over a fine
grid (see line 368). Also, we compared against the L_2 regularized
formulation (w_2^2+\lambda yXw_1) but it also did not change the
results significantly.
 Comparison to L1 solvers other than
L1DALM: as mentioned on line 373, we indeed compared against other
stateoftheart L1 solvers (see Fig 3 (d)).
 Sparse regression
usually refers to problems where the regressor is sparse but the responses
are not highly corrupted whereas robust regression refers to problems
where response/covariates might be corrupted. Our comment was meant to
highlight this fact.
 The noise model used in Section 2 is the
most widely used model in literature that we could find. If the reviewer
could point out other models then we would be happy to refer to
them.
Reviewer 3 ========== We thank the reviewer for the
kind comments.
Reviewer 4 ==========  Indeed our current
analysis of the noise model with white noise and sparse corruptions does
not guarantee a consistent estimate. In fact, due to adversarial
corruptions, it seems unlikely that consistent estimates are even possible
since the adversary can add corruptions only to responses where the
Gaussian noise is "negative". Hence, the effective noise would not be
zeromean, which precludes consistent estimates.
 The result of
[5] ([5, Theorem 4]) seems to be similar to ours as they also estimate the
true w* up to the variance in the white noise. Moreover, the corruptions
considered in [5] are not adaptive and need to be added independently of
the data points and the white noise. The result of [6] does not even
consider dense noise and even then, offers very weak noise tolerance
(alpha < 1/sqrt(p)).
Reviewer 5 ========== We thank the
reviewer for the kind comments. We will reformat the plots to increase
their clarity.
Reviewer 6 ==========  We thank the reviewer
for pointing out the reference of (She and Owen, 2011) and will
appropriately reference the same.
 The results in our paper
(Theorems 3 and 4) assume the noise model given in equation 2 (the more
general model is given in Section 2) and guarantee convergence to the
underlying parameter w* that generates the (corrupted) responses.

We will explicate the constants in Theorem 4 in the supplementary
material.
 We will state the actual bound (currently 1/65) in the
introduction (line 90). We would like to stress that the main goal of the
paper was to show that a constant fraction of corruptions in responses can
be tolerated. We did not strive to optimize constants.
 As
mentioned in line 368, we tune lambda on a fine grid. Hence, the "lowd"
case i.e. one without w_1 is already subsumed. Moreover, we did try
using L_2 regularization, but it did not offer significant improvements
because the data is subGaussian and we provided enough clean responses
for each experiment.
 Post submission, we also performed
experiments on certain face recognition tasks that require robust
regression problems to be solved. Here again we observe significantly
better performance for our method than the state of the art for these
tasks. 
