Reviews: Swapout: Learning an ensemble of deep architectures

NIPS 2016
Mon Dec 5th through Sun the 11th, 2016 at Centre Convencions Internacional Barcelona

Paper ID:	13
Title:	Swapout: Learning an ensemble of deep architectures

Reviewer 1

Summary

Swapout is a method that stochastically selects forward propagation in a neural network from a palette of choices: drop, identity, feedforward, residual. Achieves best results on CIFAR-10,100 that I'm aware of.

Qualitative Assessment

Overall, a fine paper. I would argue for acceptance. The basic idea is a simple generalization of dropout and residual networks. It seems to work noticably better than either, although not by a revolutionary amount. What I appreciate is the discussion in section 3.1, about sampling from the swapout distribution and then averaging, esp. how deterministic inference interacts poorly with batch normalization. That is a valuable lesson for the community to understand. It was published after the NIPS deadline, but I highly recommend that the authors read the arXiV preprint from Cornell Tech: http://arxiv.org/pdf/1605.06431v1.pdf , which shows that residual networks are exponentially-sized ensembles. It would be interesting to put swapout into the context of that paper.

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)

Reviewer 2

Summary

This paper examines a stochastic training method for deep architectures that is formulated in such a way that the method generalizes dropout and stochastic depth techniques. The paper studies a stochastic formulation for layer outputs which could be formulated as Y =Θ1⊙X+Θ2⊙F(X) where Θ1 and Θ2 are tensors of i.i.d. Bernoulli random variables. This allows layers to either: be dropped (Y=0), act a feedforward layer Y=F(X), be skipped Y=X, or behave like a residual network Y=X+F(X). The paper provides some well reasoned conjectures as to why "both dropout and swapout networks interact poorly with batch normalization if one uses deterministic inference", while also providing some nice experiments on the importance of the choice of the form of stochastic training schedules and the number of samples required to obtain estimates that make sampling useful. The approach is able to yield performance improvement over comparable models if the key and critical details of the stochastic training schedule and a sufficient number of samples are used.

Qualitative Assessment

This approach is a useful generalization of dropout and related methods for injecting stochasticity into training that is likely to be of fairly wide interest to the community. The theoretical motivations and justifications for the proposed approach are not monumental, but are sufficiently well formulated and justified relative to comparable work. The experiments here are well executed and critical aspects of using this approach are explored in sufficient depth. The exploration of the learning schedule and the number of samples required to obtain performance gains using this approach are essential aspects of this work and are underscored sufficiently well to provide good value to the community. I generally think the community will like and use these ideas. Table 1 provides good fair comparison with: ResNet v2 Ours Wx2, Dropout v2 Wx2 and Swapout v2 Wx4. However, an experiment using ResNet v2 Wx4 with dropout would be a good point of comparison in Table 4.

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)

Reviewer 3

Summary

This paper proposes a new form of the popular dropout technique for regularization of deep neural networks. In particular, this approach develops a dropout strategy for a combination of recent developments in deep learning (residual networks, stochastic depth networks). The algorithm stochastically chooses between dropping out residual connections, feedforward connections and both. The authors demonstrate that this strategy empirically improves on these aforementioned techniques on a convolutional networks applied to CIFAR-10 and CIFAR-100.

Qualitative Assessment

This is a generally well written paper that develops a neat novel idea for regularization of deep networks. I think the presented idea could be impactful in that it may prove to be very useful in practice to many researchers and practitioners. The presented empirical evaluation demonstrates that the approach is quite promising. The idea is, however, not tremendously novel in light of recent work (res-nets and stochastic depth networks) and it would have been nice to see comparison on some more interesting benchmarks. small typo in the abstract: exiting -> existing I appreciate the additional analysis of the test time statistics of the approach. Why can't you estimate the test time statistics empirically on a validation set? I really appreciate the tidbits on why dropout and swapout interact poorly with batch normalization. It's useful to know that you don't have to average over very many sampled dropouts (swapouts). I think this is a neat additional analysis and rather useful to the community. Why do the authors first do exactly 196 and then 224 epochs before decaying the learning rate? Normally such specific choices would arouse suspicion except in this case I expect it doesn't make much difference (e.g. between 196 and a round number like 200). Some additional detail on how the hyperparameters were chosen seems necessary here. I think it would have significantly improved the paper if experiments were run on more interesting problems. CIFAR-10 seems to be exhausted at this point. The current numbers are beyond the human level error of ~6%, which is rather worrying. Papers that beat this number are either overfitting to the test set or to the original labeler. Either way, it doesn't seem like a particularly useful benchmark anymore. One of the major reasons why res-nets were so well received was because they achieved state-of-the-art on the much larger and more challenging imagenet. A major result of the paper is that we might not need as deep networks as originally thought. However, it doesn't seem surprising (to me at least) that one wouldn't need an especially deep network for the 32x32 images in the CIFAR dataset (whereas this may be a different story for much larger images with more classes). As far as I can tell, the authors do not actually compare to the state-of-the-art on CIFAR-10. The All-convolutional-net paper achieved a lower error. Could the authors explain why these baselines are not included? The paper states that it compares to 'fair' baselines but this seems like a somewhat qualified result. e.g. As far as I can tell, the standard data-augmentation tricks are applied in this work as well. There is a lot of extra space available in this paper (see e.g. section 3.2). This leaves lots of space for an additional table of experimental results on e.g. imagenet.

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)

Reviewer 4

Summary

This paper proposes a generalization of some stochastic regularization techniques for effectively training deep networks with skip connections (i.e. dropout, stochastic depth, ResNets.) Like stochastic depth, swapout allows for connections that randomly skip layers, which has been shown to give improved performance--perhaps due to shorter paths to the loss layer and the resulting implicit ensemble over architectures with differing depth. However, like dropout, swapout is independently applied to each unit in a layer allowing for a richer space of sampled architectures. Since accurate expectation approximations are not easily attainable due to the skip connections, the authors propose stochastic inference (in which multiple forward passes are averaged during inference) instead of deterministic inference. To evaluate its effectiveness, the authors evaluate swapout on the CIFAR dataset, showing improvements over various baselines.

Qualitative Assessment

This paper was well written and the comparisons in Figure 1 clearly demonstrated the distinction between swapout and other previous work. However, because it is essentially a generalization of ResNets, stochastic depth, and droput, it is lacking in novelty. Furthermore, as the authors clearly stated, a major frustration with this area of research is the lack of a clear explanation for empirical successes. The authors provided some potential explanations (e.g. theoretical arguments about the stability of stochastic gradient descent, an increased space of sampled architectures included in the implicit ensemble, etc.), they do not provide much new insight. The experimental results do show consistent improvements on CIFAR-10/100 classification performance, but it would have been more convincing to show additional experiments on different datasets like ImageNet. Furthermore, it is not clear how each of their modifications contributed to the experimental improvements. Specifically, the authors suggest that the previously reported poor performance of dropout with residual networks and batch-normalization could be explained by poor variance estimates when using deterministic inference, but they do not show any experiments with stochastic inference.

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

Reviewer 5

Summary

This paper introduces Swapout, a regularization method that generalizes several stochastic training methods and residual architectures to match performance of residual networks on CIFAR-10 and CIFAR-100 with fewer layers using wider and shallower networks. Swapout samples between {0, X, F(X) and F(X) + X} at a per-unit level and at test time uses MC dropout to get predictions from the ensembled network.

Qualitative Assessment

The experimental results of this method are promising. However, comparisons to residual networks would be fairer if they include computation time. It is worth noting that the MC dropout sampling strategy at test time with 30 samples increases the cost 30X which should be considered in addition to the number of parameters which may limit practical applications compared to existing stochastic methods with lower computational time. Additionally, comparisons to methods that scale well with depth (stochastic depth and residual networks) seem incomplete without empirical results from deeper swapout architectures. By dropout of identity connections, the effective length of the identity paths are limited (and thus credit assignment paths are longer) so the optimization benefit of residual networks may be lost for very deep networks; therefore it is not clear if the performance of very deep swapout networks can match original residual networks. The authors claim that “appropriate choices of theta 1 and theta 2 yield all architectures covered by dropout, all architectures covered by stochastic depth, and block level skip connections.” However, it does not seem like it behaves like stochastic depth because with sampling at a unit level, it is exceedingly unlikely to drop out a whole layer of feedforward units and thus may not maintain the benefits of stochastic depth. Some claims about units per layer effects could have alternate explanations. Because linear interpolation of theta1 and theta2 in both directions decreased error, the effect of schedule may not have to do with the number of units per layer as stated but rather the average lengths of the credit assignment paths. The improvement of Linear(1,0.5) over Linear(0.5,1) could also be explained by the same rationale as the stochastic depth paper that the noise in the earlier layers affects all subsequent units.

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)

Reviewer 6

Summary

This paper integrated the idea of drop out and the stochastic depth into the newly designed network. The proposed scheme shows improved performance on public benchmark CIFAR dataset.

Qualitative Assessment

Advantages: 1. The proposed classifier gets comparable performance as ResNet on CIFAR dataset. Questions and Suggestions: 1. The major novelty of the proposed pipeline is to integrate current training approaches, such as dropout and stochastic depth schemes together in the newly proposed framework. The whole scheme can be viewed as an ensemble training method. From this point, the novelty is not very strong. 2. More experiments are needed. It would be convincing to see the performance comparison on another popular benchmark, like ImageNet. 3. The author claimed that the swapout is notably and reliably more efficient in the use of parameters. More experiments and analysis are needed.

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)