Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
=Statistical Strength= Throughoutt the paper, you refer to the concept of 'statistical strength' without describing what it actually means. I expect it means that if two things are correlated, you can estimate properties of them with better sample efficiency if you take this correlation into account, since you're effectively getting more data. Given that two features are correlated, optimization will be improved if you do some sort of preconditioning that accounts for this structure. In other words, given that features are correlated, you want to 'share statistical strength.' However, it is less clear to me why you want to regularize the model such that things become correlated/anti-correlated. Wouldn't you want features to be uncorrelated (covariance is the identity)? Rather than having correlated redundant features, why not just reduce the number of features, since this is a tunable hyperparameter? I can imagine an argument that it is better to have a model with lots of features where there is redundancy, and you carefully control for this redundancy instead of one where you reduce the number of features (num activations in intermediate layers). This argument would be consistent with recent work on overparamterization and implicit regularization. However, you would need to test this in your experiments. =Line 46= I don't understand this statement, and I expect other readers will not too. Why set priors that would seek to mimic the behavior of the prior-less learning method? Why would this lead to better generalization? =Effect of batch size= I was concerned by how big the effect of the batch size was in figure 2. The difference between 2B and 2A is considerable, for example. This suggests that there is considerable implicit regularization from using different batch sizes, and that the effect of this may be substantially bigger than the effect of Adareg. In particular, the difference between 2B and 2A for a given method seems to be bigger than the difference between methods within either 2A or 2B. I know there has been work on the connection between mini-batch SGD, natural gradient, and posterior inference in https://arxiv.org/abs/1806.09597 I am not up to date with this literature, and I expect there is follow-on work. It is important to comment on this. In general, assessing the impact of regularization techniques is difficult because there are so many ways you can regularize. You could also do early stopping, for example. The important regularization technnique that I wish you had discussed more is simply using a model with less capacity (fewer hidden activations). See my comment on 'statistical strength.' =Minor comments= The claims in the first sentences of the abstract are unnecessarily general. Why make broad statements about 'most previous works' or explain how all neural networks are designed? You repeatedly refer to things being 'diverse' in the intro, but don't explain or motivate what that means enough.
The paper proposed a to fit a matrix-variate normal prior with kronecker covariance matrix as an adaptive and data-dependent regularization technique. It also encourages neurons to be statistically correlated to each other. An efficient block coordinate descent (hyperparams and weights) algorithm with analytical solutions are also proposed to optimizer the framework. Empirical results are showed that the proposed method can outperform weight decay and DeCoV in terms of generalization performance. Furthermore, detailed analysis such as spectral norm and correlation are provided. The paper provides a solid justification of the mechanism how matrix-variate normal prior regularizes the weight, taking into account the local curvatures. I am convinced that the proposed method can be used as a strong regularization technique. The connection between prior and weight regularization is similar to l2 regularization and Gaussian weight prior. My main complaint is the scale of the experiments though the paper focuses on the small dataset. I expect a newly proposed regularization technique should be tested with deeper network because regularization technique can be more effective when adding to a more over-parameterized network. Also, at line 205, the paper said the regularization is placed only at the last softmax layer. I wonder whether the baseline approach such as weight decay is also placed only at the last softmax layer ( It seems the provided code only has the details of the proposed method ). If so, how does the proposed regularization compared to (weight decay or l2 regularization + BN) applied to all layers? I believe this (l2 + BN) is a more common setup. The paper didn't discuss the computational overhead over weight decay in details. I am convinced that it is efficient in the EB framework (and only applied to the last softmax layer) but it should still be slower than weight decay. It would be better to have a forward-backward time comparison on a larger network. A minor issue is that the paper didn't discuss  in the related work. AEB is not a novel approach and related work should be discussed in more details in the revision. Also, the proposed method has limited novelty given the fact that Kronecker structure covariance matrix as posterior has been proposed in . Another minor issue is that the idea that neurons are designed to be statistically related seems to conflict Dropout, which is designed to prevent co-adoption of neurons. Can authors elaborate more on this conflict in related work? : McInerney, J. (2017). An Empirical Bayes Approach to Optimizing Machine Learning Algorithms. NIPS 2017. : Zhang, G., Sun, S., Duvenaud, D.K., & Grosse, R.B. (2018). Noisy Natural Gradient as Variational Inference. ICML 2018.
This paper introduces a new learning algorithm that can be used to train neural networks with adaptive regularization. The authors show that the weights in each layer are correlated and then define a prior distribution over the weight to reflect this correlation (reducing the search space basically). The authors propose to use block coordinate descent algorithm to optimize the hyper parameters of this distribution and the weight of the network. The authors show that the proposed method outperforms the existing regularizers. Overall, this paper is well-written, very novel and introduces a very practical way to adaptively regularize a neural network. The following are some minor problems: 1- The authors do not study or talk about the running time of their algorithms compared to the baselines. 2- Figure 3 shows that the proposed algorithm oscillates and it is not well behaved. It would be nice if the authors could talk about this behavior.