Paper ID: | 1094 |
---|---|

Title: | Learning Structured Sparsity in Deep Neural Networks |

This paper experiments with structured sparsity (group lasso) for reducing the size of convolutional neural networks in a way that is more regular, and hence more amenable to vector computation on GPUs, than the simple l1 regularization that has been recently explored by Han et al. I find this to be an important innovation. The experimental results seem promising. I find the depth regularization is particularly interesting.

This is a nice paper with a nice set of experimental results. Making DNNS run more efficiently is a core problem of deep learning. But sparsification is also useful for improved performance (improved regularization) and also for improved interpretability. The novelty is not so great as structured sparsity is a well established idea.

2-Confident (read it all; understood it all reasonably well)

This paper proposes using group sparsity on various CNN parameters to speed up their execution. The CNN is pre-trained using a classical baseline, then starting from the corresponding parameters, group sparsity constraints are applied during retraining. The inactive groups are then removed, and the final network is fine tuned without the sparsity constraints. This simple method gives good speed-ups without much loss in accuracy (and even some time improvements).

Using group sparsity to turn off redundant parts of a CNN and improve its speed seems like a good idea. Indeed, significant speed-ups are obtained in a large variety of experiments, with little loss in accuracy and even sometimes a small improvement. The authors use group sparsity on several axes, including the number of filters and channels used, the shape of the filters (I didn't really understand how the authors deactivate efficiently certain filters sites, this should be clarified). and the number of layers (using shortcuts to prevent the network from being disconnected). The idea explored in the paper is thus rather straightforward, but it is a good and probably useful one. However, unless I missed something, there are many details missing: How is the group sparsity optimisation performed within the CNN training? As importantly, how are the regularization weights chosen? Are they cross-validated? For example, Table 2 reports results with "different strengths of structured sparsity regularisation". These missing details are important, because in the end there must be a balance between speed and accuracy. If the network is to be used in practice it is not enough to duplicate baseline results on test data. The authors should clarify these points.

2-Confident (read it all; understood it all reasonably well)

The paper proposes the use of Graph Lasso to prune weights in a way which respects the 4d tensor structure. So, e.g. we would like to prune a whole channel or a whole filter, or even prune an entire layer of weights using the "shortcut" trick present in e.g. residual nets. The justification for doing this is two fold: it makes more sense semantically, and it allows us to avoid the computational penalty incurred with truly sparse weight matrices (especially in convolutional layers), as the sparsity can be quite random.

The idea of using graph lasso to prune weights is simply not novel enough an idea to be publishable at NIPS. I would be more open to it if the experimental results were amazing, but they are quite lackluster.

2-Confident (read it all; understood it all reasonably well)

This paper tackle the common problem resulted from CNN training, CNN structure is redundancy in terms of filters, channels, depth, and even filter shapes. In order these redundancy, this paper proposes a structured sparsity learning method to directly learn a compressed structure of CNN by group lasso regularization. The group strategy involves grouping based on channel-wise, filter-wise, shape-wise, and even depth-wise. Experimental results show a notable speed up comparing with the baseline CNN model.

The details of the training method is not very clear. The authors seem simply give the optimization objective, and do not discuss the parameter settings. Why the model is trained from weights initialized by the baseline，not from scratch? So it is more likely belong to a fine tune method. The authors discuss different regularization methods by grouping channels, filters, or shape of filters, I wonder if these methods can be combined for further speed up? Or the optimization is the main challenge. Another issue is that, recent CNN structure often adopt filters with size 3 times 3, e.g., VGG-16, ResNet, what is the meaning of shape regularization since it is already highly local.

2-Confident (read it all; understood it all reasonably well)

This paper applies structured sparsity learning to simplify the parameters of a deep neural network in order for fast computation without large loss of accuracy.

This paper propose a structured sparsity learning approach to simplify or speed-up a learned deep network in order for applications in platforms with limited computation resources. The idea of learning a compact and hardware-friendly structure is very interesting. However, some aspects are not clear enough. 1. How the approach is related to hardware? How to get CPU, GPU, or FPGA friendly structures, respectively? Or how to design the optimization goal to achieve this? 2. Could the approach to be used to transform the network to a version with only integer operations? 3. How to balance between computational complexity and error of the simplified network? 4. What is the computation complexity of the proposed approach itself?

2-Confident (read it all; understood it all reasonably well)

The paper presents a comprehensive method to sparsify deep neural networks across channel, filter, filter shape and layer. The proposed method looks very practical.

First, the proposed method needs to be compared with the state-of-the-art method, group-wise brain damage (GBD) [Lebedev, 2015]. Although GDB belongs to the filter shape according to the categories of this paper, the performance improvement (in terms of the amount of floating point operations) is larger than what this paper presents, especially, in the case of Conv1 in AlexNet. Second, the example networks, LeNet, AlexNet for CIFAR-10, etc. used in the experiments are mostly over-parameterized, i.e., relatively easy to compress. It is recommended that more optimized networks such as GoogLeNet, SqueezeNet and ResNet-152 need to be utilized to show the effectiveness of the proposed method.

3-Expert (read the paper in detail, know the area, quite certain of my opinion)