__ Summary and Contributions__: The paper focuses on sparse representation in deep neural networks for image restoration tasks. Authors propose adding sparse constrains via model structure instead of common regularization in the loss function. The method determines a subset of convolution kernels dynamically during inference and saves computation of the complementary set of convolution kernels. The method gives a better tradeoff between speed and accuracy, and it can be plug-in any convolution neural networks.

__ Strengths__: 1. Novelty and importance: the paper addresses one of the most important concepts of sparse representation which dominates the ill-posed image restoration tasks for decades and is rarely investigated after deep learning approaches prevalent.
2. Methodology: the paper borrows ideas from sparse coding based approaches but implements them in a deep learning favorable way. The sparsity is determined by an MLP conditioning on features instead of interactive optimization in sparse coding.
3. Experiments: ablation study covers the key design choices of the proposed method. The main results show significant improvement in terms of speed and accuracy over various models and tasks.
4. Potential: the proposed solution significantly outperforms CondConv on image restoration, so it can be extended to image recognition tasks. More importantly, this work gives a promising direction to explain and improve CondConv theoretically.

__ Weaknesses__: 1. The experiment mentioned Line 140 is not addressed in the ablation study.
2. Results in Figure 4 shows that increasing cardinality can improve performance constantly. Is there any regression point?

__ Correctness__: The paper is technically correct.

__ Clarity__: The paper is written concisely and clearly. Technically details are enriched in the main paper.

__ Relation to Prior Work__: Related works are well addressed. Comparisons between proposed and previous methods are covered both in method and experiment sections. Covered previous works are not limited to EDSR and RCAN in image restoration, but also SE-Net and CondConv in image recognition.

__ Reproducibility__: No

__ Additional Feedback__:

__ Summary and Contributions__: Authors introduce a form of sparsity for deep CNNs, by which a subset of channels are turned on/off by auxilliary soft gating networks (each of which is a multi-layer perceptron on the input signal). They apply the network to super-resolution, denoising, and compression artifact removal, showing performance at state-of-the-art for all three.

__ Strengths__: Results are at state-of-the art, with about half the computation (at least for the super-resolution application).

__ Weaknesses__: * Model is ad-hoc, somewhat complex, and difficult to understand (as described).
* Interpretation/Visualization of gating network (fig 5) was not so helpful. The authors state that "the last layer is more correlated to semantics, for example, tree branches in the first image and lion in the second image." But I do not see this at all. Moreover, I don't see why would one expect a semantic partition for a network that is performing super-reoslution?

__ Correctness__: Method is ad-hoc, so no derivations/proofs to check. I can't verify consistency of primary equations without understanding definitions of c or d (see below).

__ Clarity__: I had trouble understanding the details of the architecture. Figure 3 did not help. In particular, after multiple re-reeadings I still could not understand the definitions of the hidden nodes c or cardinal groups d.
Also, please have your article proof-read for style and grammar issues - there are many improper sentences.

__ Relation to Prior Work__: The method is an extension of the Squeeze-and-Excitation model (CVPR-2018), and offers some advantages over that.

__ Reproducibility__: No

__ Additional Feedback__: >>> Added after reading author's feedback, and other reviewer comments:
The use of gating networks in these problems is interesting, and gives performance gains. I still think the paper is not very clearly written, but acknowledge that the other reviewers did not have this reaction. The author feedback helped me to understand fig5, but I still would not call that "semantic" - there is nothing that indicates the meaning or idenity of the scene or objects, just foreground vs. background. I agree with R3 that the connection to group sparsity should be explained, and with R4 that connections to attention models should be explained. Overall, I did not find the feedback or other reviews significantly altered my view of the paper, which is marginally above threshold. In any case, I hope the authors will find our comments helpful in improving the work.

__ Summary and Contributions__: This paper enforces sparsity in neural networks in order to improve the network performances. The authors argue that iterative sparse coding algorithms are not appropriate because of their iterative nature. They observe that ReLU impose a sparsity. The paper proposes novel sparsity constraints to obtain a sparse representation in neural networks. Their algorithm enforces a group sparsity property. They do it with maximum selection operation similarly to a matching pursuit algorithm, To get a differentiable operator they use a softmax.
In their experimental section the authors did a serious ablation study testing different hypoetheses including sparsity constraints. They have tested it on a wide range of applications including noise removal, compression and super-resolution.

__ Strengths__: This paper did a serious experimental work to test ideas to impose sparsity in neural networks with different sparsity criteria. The expriencs have been carried with care.

__ Weaknesses__: The paper is rather weak on the theoretical side of sparsity and the existing work. The paper claims in the introduction that "sparsity of hidden representation in deep neural networks cannot be solved by iterative optimization as sparse coding". I do not understand this claims since algorithms such as LISTA do compute sparse coding from few layers in deep networks.
The fact that sparsity is needed to do denoising, compression or inverse problems is well understood independantly from neural networks and result from work carried by may researchers such as Donoho between 1995 and 2005. I do n \ot understand why they say that such sparsity can not be implemented given that a ReLU is the proximal operator of a positive l1 sparse coder, that many algorithms implement a sparse code with such architectures, and that such architectures with ReLU get very good performance for denoising and inverse problems as shown by
"Convolutional Neural Networks for Inverse Problems in Imaging: A Review" published in 2017, and much more work has been done so far. Doing group sparse coding is not either a problem by using mixed l2/l1 norms which are also implemented with ReLU non-linearities.
The algorithm proposed by the authors makes sense and in fact has similarities with matching pursuit procedure which extract maxima as they do to build sparse codes, but there is no guarantee of efficiency of such algorithms in this neural network context, beside the numerical experiments shown at the end.
The difficulty to evaluate such experiments is that there are often marginal improvement relatively to methods which are not always state of the art such as JPEG or SA-DCT for compression. The multiplication of tables and numbers do not help.

__ Correctness__: The empirical methodology seems correct but some of the claims of the paper concerning sparsity are not correct.

__ Clarity__: It is sufficiently well written to be understood.

__ Relation to Prior Work__: Yes, this is an algorithmic work and this particular algorithm has not been tested so far.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: This paper presents a new approach to enforce sparsity in neural network based denoising. The method is well explained, seems novel and promising and results are convincing.

__ Strengths__: The denoising results are convincing, the paper is well written, figures and tables are neat, the method is original.

__ Weaknesses__: It is missing comparisons in terms of computation time.
Minor comments:
Line 62: What is coupled dictionary?
Line 172: How does the model behaves is the noise is lower than 15 or higher than 50.
Line 168: Use "$\times 2$" instead of "x2"

__ Correctness__: Yes it seems correct to me.

__ Clarity__: Yes, it is well written.

__ Relation to Prior Work__: Yes this paper is very clear.

__ Reproducibility__: Yes

__ Additional Feedback__: Adding computation time in addition to PSNR and SSIM would be relevant.
Comparing with a patch based approach such as FEPLL for denosing and super resolution would be useful too:
Parameswaran, Shibin, et al. "Accelerating GMM-Based Patch Priors for Image Restoration: Three Ingredients for a $100 {\times} $ Speed-Up." IEEE Transactions on Image Processing 28.2 (2018): 687-698.