NeurIPS 2020

Thunder: a Fast Coordinate Selection Solver for Sparse Learning


Review 1

Summary and Contributions: This paper introduces a new fast solver for solving large scale l-1 constrained minimization problems (aka Lasso) called Thunder. The authors propose a dual stage minimization procedure which consists of screening and recruiting of relevant support (feature selection) followed by shrinkage. They provide theoretical analysis of this algorithm and also demonstrate its efficiency as a solver on both simulated as well as real datasets, w.r.t. Celer and BLITZ. The feature set F is decomposed into an active set A and remaining set R, and Thunder solves sub-problems to actively refine A until a specifically designed stopping condition is satisfied.

Strengths: Pre-existing approaches for fast sparse solvers include feature screening for Lasso [10], which develop rules based on dual variable \lambda to refine the feature set of the sparse vectors pre-shrinkage thereby reducing computation time. Other approaches rely on correctly adjusting dual variable \lambda (homotopy based) and solving sub-problems (working set method) to achieve better computational efficiency. This paper effectively combines the merits of all of these approaches and packages it as a multi-stage sparse solver.

Weaknesses: The paper can benefit with a better exposition on safe screening methods [10] (lines 17-23). Line 26: "unsafe assumptions" vs "safe" screening rules - either explain briefly or defer explanation to a later point.

Correctness: The claims appear to be correct.

Clarity: Yes.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback:


Review 2

Summary and Contributions: The paper introduces a solver for lasso-type problems with a new screening strategy. The screening strategy is based on operational properties of the active set, with feature recruiting and removing components. The recruiting proposed is equipped with sampling approximation and a two-layer splitting selection. Numerical comparison on synthetic and application data sets demonstrates the advantage of the method with respect to timing. The major contribution of the paper is the screening strategy with its two-layer recruiting set. The design of this improved strategy is based on insightful comparison with a few currently popular screening methods. Rebuttal checked. Agree with other reviews and will keep the score.

Strengths: The design of the strategy is clear and well-motivated by the limitations of working set methods and dynamic screening. The empirical performance is promising.

Weaknesses: The details of the algorithm may need to be better presented and discussed. The effects of the a few proposed components are not clearly demonstrated. For example, what is the impact of the sampling recruiting strategy in 2.2 on both accuracy, convergence properties and speed? Does it break the safe screen property? Similarly, what is the effect of the size ratio of 2.3 on the algorithm, theoretically and empirically? Since these are the major innovative components, it would be to study it in details. For example, in the experiments, fix/remove the 2.3 and only do 2.2, and evaluate the difference. The current experiments are not informative. The broad impact part is not very informative -- it is essentially a summary of introduction, instead of board impact statement.

Correctness: Did not check the supplement proof. The main text looks correct.

Clarity: The writting is reasonable. Some improvement can be achieved by using more consistent notations with clear definitions. The algorithm description should also be improved.

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback:


Review 3

Summary and Contributions: This paper proposes a new solver, a.k.a, the Thunder solver, for large-scale l1 regularized optimization. In addition, this paper also provides theoretical justification and experiments for the Thunder algorithm in the proposed solver.

Strengths: This work proposes a novel solver for a general l1 regularized convex optimization problem with the L-Lipschitz gradient. Experimentally, the Thunder solver is faster than the recent Celer solver and BlitzL1 solver. Theoretically, this paper provides nontrivial time complexity analysis for the proposed algorithm.

Weaknesses: 1) This work should have mentioned more related works. I.e., when introducing the l1 regularization for sparse learning, compressed sensing, some classic works such as https://arxiv.org/pdf/math/0502327.pdf, https://authors.library.caltech.edu/10092/1/CANieeespm08.pdf, https://www.cns.nyu.edu/~tony/vns/readings/olshausen-field-1996.pdf, https://www.sciencedirect.com/science/article/pii/S0042698997001697, https://statweb.stanford.edu/~tibs/lasso/lasso.pdf, should be cited. 2) In the experiment section, it would appeal to a broader audience if the paper could explain the relationship between Thunder and CVX solver and provide some experimental comparison, since people in other fields may be unfamiliar with the Celer solver and BlitzL1 solver. 3) Although the experiments show some practical benefits, I can not find any principled way to choose the hyperparameters such as K_1, K_2. Are their choices problem specific? Are there similar problem specific parameters for Celer and Blitz optimized for fair comparison? The improvements of Thunder seem to come from more careful engineering of the work sets. Any more fundamental reasons that are attributed to the improvements?

Correctness: The theoretical results seem sound and correct and the experiments seem to be reasonable.

Clarity: The paper is well-organized and presented.

Relation to Prior Work: Yes, the novelty is clearly discussed. If the comparison is fair and not case by case, then the proposed algorithm could improve state of the art for L1 solvers. But at time of the review, this is difficult to assess.

Reproducibility: Yes

Additional Feedback: 1) Bad wording: "seminal", the second word of line 25 2) The labels in pictures are small (Figure 4, 5, 7, 8)


Review 4

Summary and Contributions: This paper proposed a novel lasso screening methods. By learning an active set which is much smaller the whole feature set, the efficiency is greatly improved. The most expensive part is the feature recruiting, the author uses a bi-level selection strategy to further improve the efficiency. Besides efficiency, the author also theoretically shows the safeness of the algorithms. In complexity analysis, the author pointed out the complexity can be largely reduced when the ground truth sparsity ratio or sparse penalty is large, which is the theoretical highlights. The authors have done several experiments to support their claims.

Strengths: The efficiency is largely improved compared to the previous study. The theoretical results and empirical results are consistent. Algorithms are not complex, thus easy to be applied.

Weaknesses: The complexity is largely reduced by feature shrinking. How does the ratio between R_1 and R_2 and the shrinking frequency K_2 affect the final solution? There is no sensitivity analysis on this part. Does the proposed approach require feature independence? Since the author mentioned that the active set will be the same size as the optimal set, is it possible that this will hold when features are correlated? This is a minor suggestion. Since the real data may violate the assumption, except for the running time, it is better to add the prediction results to show whether the right feature is selected.

Correctness: Yes, both empirical results and theoretical results are sound.

Clarity: yes, the logic is clear.

Relation to Prior Work: yes, the author uses a subset selection to avoid large-scale computation.

Reproducibility: Yes

Additional Feedback: Generally, the paper is impressive in terms of both theoretical and empirical aspects. My main concerns are as follows: 1). The assumptions: in the empirical study, the author highlights that compared to celer, the active set is more close to the optimal active set. Is it because that celer includes the correlated features? I wish to know whether the proposed method really found those active features but the author does not provide the results for this. Especially for real data, when the features are highly correlated, can the method improved the post-stage lasso prediction results? Also, the Lipstchiz and convexity ratio might lead to numerical issue in real dataset, which could effect the prediction results. If the author includes prediction in results, I think the paper will be more solid. 2). Splitting the R_1 and R_2. Since the author chooses 1/3 as the ratio in the experiment, I am wondering whether we can be more aggressive on this. The sensitivity analysis of the ratio and shrinking frequency is missing in the paper.