Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Originality: Using a differentiable architecture search for pruning a network is new to me and it makes sense. Also using interpolation for fusing different channel sizes makes sense and it can possibly be used for other applications too. Quality: The contribution of the paper seems clear and the proposed methodology makes sense. Experimental results show that the proposed pruning can lead to a good trade-off between computational cost and accuracy. Clarity: The presentation of the paper should be improved. For instance channel wise interpolation is abbreviated sometimes with (CWI) and other times with (CHI), making the reading very confusing. Also, the actual way that CHI is performed is not very clear. It can be intuited by Fig. 2 but a proper formulation is missing. Significance: In my opinion the contribution of this paper is important because it show that is possible to cast network pruning in terms of differentiable architecture search and results place the method among the most promising pruning approaches. Additional comments: - From tab.1, the use of knowledge distillation (KD) seems to be important for good results. However, KD can be used to improve any pruning approach. Thus in this sense it is not clear if the good performance of the proposed method are due to the network architecture optimization or the knowledge distillation. If it's the second case, then the contributions of the paper will be reduced. - Fig.2 helped me to fully understand the contribution. However, there should be a clearer formulation of the approach too. - In l.50 the authors talk about optimizing the number of channels but do not talk about the number of layers. This is a bit confusing because it is not clear what is the final aim of the paper. This kind of problems can be found in several points in the paper. It seems like multiple people with different understanding wrote different parts of the paper. - Authors should compare with other approaches also in terms of training time. This method seems computationally quite intense during training. - From my understanding, in this work channels are grouped together based just on their order. This means that there can be other combination of channels that can outperform those predefined. One could argue that this constraint is maintained during training and it can induce to group channels in a meaning way. However, this is not true in case of starting the optimisation from a pre-trained network as it seem to be the case. Final decision: I read other reviews and rebuttal. Some of the answers to my questions are not fully satisfactory: - Q2.1: I saw the comparison with and without KD in figure 1, however, I wanted to see the influence of this factor in the final results. That is, the proposed method without KD would still be superior to others? The authors show results for a specif configuration in tab.2 and it seems that for the other methods using KD produces worse results. Is there any reason for that or it is a typo? - Q2.2: In this question I wanted to see more convincing results than a single experiment. However authors did not include any new experiment. Globally I still consider the paper in a positive way, however, I would like to see those two points clarified in a final version.
1) the paper motivates that the current network pruning approaches try the same architectural design whereas a more efficient approach could be found with new architecture. 2) paper organization was good and easy to follow 3) the approach is a bit non-intuitive: example why combine the different feature maps at all? why only depth and width search? It took some time to understand but the solution seems to be out-of-the-box like thinking. 4) the results are solid i.e. less FLOPs and better accuracy on imagenet compared to the previous approaches on network pruning
The writing of the article could be improved. Besides a few typographical errors across the paper (lines 72, 160, 160), there are some other parts where sentences could be rephrased, for example: 24: 'apply them' -> 'deploy them' 47: rephrase whole sentence 68: 'to develop the powerful networks'-> 'to deploy the deep networks' 87: Do not use 'while' as 'though' unless it is at the beginning of a sentence. 98: Rephrase whole sentence. The style of the article could also be improved. State more clearly the main contributions of the paper in the introductory sections and use Fig. 1a and Fig. 1b, instead of saying 'the first line of Fig. 1'. ---------------------------------------------------------------------------------------------- The proposed method is quite original, as if attempting to 'grow' a reduced network, instead of pruning a larger one. The NAS method is also very elegant mathematically; since it is setup as a differentiable problem, now the error can be propagated and gradient descend family of methods can be used to search a locally optimal solution (a structure in this case) in an efficient way. Having said that, yet this is another case of apply relaxation to a hard problem and call it a day. In fact, it is not easy to spot how much of the proposed TAS method is new or just a variant, a minor incremental improvement, or an use case, of the work already presented at .  Dong, Xuanyi, and Yi Yang. "Searching for a robust neural architecture in four gpu hours." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.
EDIT: I have read the other reviews and the rebuttal. Thanks to the authors for clarifying my questions. I am happy to raise my score, and urge the authors to make sure the clarity concerns are addressed in the draft. Clarity: I struggled with parts of the draft regarding clarity. Some specific points that I may have misunderstood: - I could not find a clear definition of CHI. Please define this clearly in the draft. If it has many possible interpretations, then specify the goal. Is CHI == CWI? If so, please explain this in a bit more detail as well. An example may be helpful. - How was the cost of the network estimated? In particular was is the difference between E_cost(A) and F(A) and how is this made into a differentiable loss? Please specify. - The role of distillation could have been made clearer. In particular, the authors could specify whether the new network inherits (1) weights, (2) architectures from the networks trained earlier in the TAS procedure. - The train / test / validation splits need to be clarified in the experimental section. This is especially important, since TAS trains on validation data. Quality: There are serious concerns regarding the quality of the experiments. - TAS uses validation data to optimize the architecture of the networks. Even though the validation data is not used directly to optimize the weights, it can still have a very significant influence on the learned weights via the bilevel optimization scheme. Therefore, it is correct to see the validation data as part of the training set for the purposes of evaluation of overfitting. I could not tell from the current draft whether the reported results are evaluated on the test or training sets. Moreoever, given that the validation set was not specified (from what I could tell!), it is hard to be sure whether the reported numbers reflect overfitting or not. - In the experiment on the effect of strategies to differentiate alpha, in was unclear why the choice was made to not constrain the computation cost. This seems like the least informative choice, and the fact that the chosen method is the only one that succeeds in this context doesn't make it obvious why it should succeed when you aim to constrain the computational cost. The authors could make this more convincing by including experiments with constrained costs. Significance: Given the concerns with the evaluation, it is hard to assess the significance of the work. Even so, the improvements were modest over competitor methods and raise concerns about the impact of the methods moving forward. Originality: The method is original, although not completely distinct from previous works, see Louizos et al. 2018. This would not be a major concern, if the experimental results were more interpretable and robust.