__ Summary and Contributions__: The paper presents a new type of structured convolution using sum pooling followed by a smaller convolution to maintain the receptive field (i.e. effective kernel size) while using fewer operations (addition and multiplication) and parameters.

__ Strengths__: The concept of the structured convolution proposed in the paper is overall novel and relevant to the NeurIPS community for its potential applications.

__ Weaknesses__: The major weaknesses of this paper are the soundness and significance.
In terms of soundness, the biggest issue is the way the empirical evaluation was designed and executed. Firstly, it's extremely odd that the authors decided to implement the structured convolution as a post-training conversion, rather than directly implementing the structured convolution as a trainable architectural feature (e.g. like depthwise convolution), which can eliminate the computational cost of the regularization term in Eq (3) (likely expensive but unacceptably undiscussed in the paper) and may further improve the accuracy results. Otherwise, the authors should demonstrate the feasibility of performing Eq (3) as quick fine-tuning for fully trained (regular) networks to justify their approach. Secondly, although the reduction in numbers of operations and parameters is overall impressive, the decrease in accuracies compared to baselines (Table 1 to 5) is also non-negligible, especially for more recent networks like EfficientNet and MobileNetV2. Even though the proposed approach compares favorably to other compression methods [48, 12, 35] in Fig 6, the significance of the margins isn't clear without standard errors shown in the figure. More importantly, [48, 12, 35] are all relatively dated, making the comparison questionable. Finally, the authors should include actual inference speed on generally available devices (i.e. CPU or GPU) using reasonably recent DL frameworks for readers' reference.
In terms of significance, given the less than satisfactory empirical evaluation as stated above, the paper is unfortunately not significant enough.

__ Correctness__: The paper is overall correct.

__ Clarity__: The paper is well written.

__ Relation to Prior Work__: The related work section is overall fine, although it's unclear if the authors are comparing against SOTA in Fig 6, arguably the most important part among all results.

__ Reproducibility__: Yes

__ Additional Feedback__: [Reply to Rebuttal]
I appreciate the authors' detailed feedback and have upgraded my score to 7. Below are some more suggestions.
1. The accuracy drop due to using the structured convolution as an architectural feature in training (i.e. the direct approach) is unexpectedly high. Please discuss more or consider fixing the issue (for the obvious value of accelerating training too).
2. Please do include the analysis of the regularizer and discuss its pros and cons (e.g. when batch size is small or convolution kernel is large).
3. Consider working with hardware experts to provide readers with estimated speedups for inference using TPUs or other custom hardware.

__ Summary and Contributions__: This paper proposes a network compression method by imposing compositionally of convolution kernels. The proposed method greatly reduces the number of multiplication and the number of parameters in the original network. Comparing to baselines, the proposed method is able to achieve a similar compression ratio while maintaining a slightly higher accuracy.

__ Strengths__: - The idea of structured convolution is quite interesting and reasonable to limit the degrees of freedom for convolution kernels. The idea is also quite simple and can thus generalize to most of CNN architectures.
- This paper is well-written. They explained the structured convolutions using both intuitive diagrams and formal maths.
- Many experiments are conducted across datasets, tasks and architectures (CIFAR, ImageNet, Cityscapes, ResNet, MobileNet, EfficientNet), showing that the proposed method can empirically generalize well.

__ Weaknesses__: The central idea of structured convolution resembles to using smaller kernel size. It would be better if the authors can explain the difference between simply swapping out 3x3 convs with 2x2 convs (c.f. Fig 1 (b)). I can see that conceptually they are not exactly equivalent, but maybe some experiments should be done to justify that this conceptual difference does make a difference in practice.

__ Correctness__: The claims and method make sense. And the experiments are done properly.
A minor issue, at Line 118, should it be 1 <= n <= N and 1<= c <= C?

__ Clarity__: This paper is in general well-written. However, it is a bit confusing in Sec 4 when introducing structured convolutions. Using the number of betas and the numbers of 1s to define a structure is not very intuitive. Although the intuition is mentioned in line 128. It might be better to introduce that first prior to the definition.

__ Relation to Prior Work__: This paper has clearly discussed the related work in an organized way. And it appears that this work is different enough from related work. However, I am not very familiar with related work of this paper.

__ Reproducibility__: Yes

__ Additional Feedback__: Post rebuttal: the rebuttal has addressed my concern and I would like to keep my score of 7.

__ Summary and Contributions__: This paper proposed structured convolutions which can be decomposed into a sum-pooling operation and a smaller convolution to achieve higher computation and parameter efficiency.

__ Strengths__: 1. This paper is on the whole clearly written and experimental evaluations are well executed.

__ Weaknesses__: 1. With the imposed structure, the types of features can be extracted might be limited. Would it be possible to approximate some hand-crafted filters, such as 3x3 lapalacian filter, using the proposed structured kernel?
2. There are two steps in the proposed training scheme. Where does the approximation error mainly come from, STEP 1 or Step 2? Could the authors report the model performance after the first step? Why not directly optimize the weights in the restricted subspace?
3. Typo: Figure 3, alpha_2*(x2+x3+x5+x6)

__ Correctness__: Yes

__ Clarity__: Yes.

__ Relation to Prior Work__: I'm not an expert in this field and do not know if the authors have clearly discussed the relationship between the proposed and related works.

__ Reproducibility__: Yes

__ Additional Feedback__: No.

__ Summary and Contributions__: This paper introduces a way to decompose convolutional kernels into a sum pooling followed by convolution. This has the advantage of reducing the parameters, additions, and multiplications needed in a convolutional layer, which reduces the computational cost of the network. This does result in slightly lower accuracy, but has the benefit of being more efficient.

__ Strengths__: The paper is very well written and easy to understand. The conceptual decomposition part is clever and explained well. The results are strong, showing the reduction of operations and comparable performance.

__ Weaknesses__: There are a few details that I would like to see discussed more:
- What is the training and inference runtime of the network? The paper does correctly mention this relies on the underlying hardware+software, but I'm curious if using current libraries and standard hardware (e.g., GPU, CPU) what the runtime of this approach is. Since efficiency is a main claim, it would be great to see this actually demonstrated (faster runtime, lower power usage, etc), rather than just a reduction of multiplications.
- It is a bit unclear how this is actually implemented. At first I thought it was redefining the convolutional operation in the network, but later it the paper it seems like it is using standard convolution then after training it does the decomposition. Unfortunately no code is provided, so it is difficult to understand how this is actually implemented.
- Related, how is the 2-step training process done? Is step 1 the standard, full gradient descent approach then step 2 is done once training is finished or are steps 1 and 2 alternated with each iteration of training?
Line 207, "The proposed scheme trains the architecture with the original C×N×N kernels in place but with a
structural regularization loss." makes it seem like there would be no efficiency gains during training, but only during inference.
It also makes it unclear how the efficient version is implemented. If it is using original kernels, adding some details about implementation is critical to actually obtain a reduction of operations.

__ Correctness__: Yes

__ Clarity__: Yes the paper is well written. There are lacking implementation details, which I think are necessary to see that this proposed approach actually reduces operations.

__ Relation to Prior Work__: Yes

__ Reproducibility__: No

__ Additional Feedback__: Overall, I think the paper is good, is well written and makes a good contribution. I think it really needs the implementation details, which will greatly strengthen the paper.
--- After rebuttal
The rebuttal mostly addressed my concerns. I would have like to see runtime on different hardware (e.g., GPU) to better demonstrate the runtime gains. The results on CPU show a small improvement. It would be great to add results that actually demonstrate this on the targeted hardware, as the authors mentioned "our method is designed for recent accelerators that allow efficient sum-pooling operations, thus the theoretical speedups (Tables 1-4) are realizable on such platforms."