Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Update 2019-08-07 The authors rebuttal shared with us more convincing ImageNet results. They did not address my other concerns as thoroughly. Nonetheless, I improved my score from 6 to 7. ************************** Some general comments: The paper is well written and easy to read. The introduction clearly states the objective and the context. The text and the 2 pseudo-algorithms clearly explain how the method works, and how it differs from previous methods. I did not feel the need to ask the authors to share their code. The CIFAR-10 experiments are solid and involve a rigorous tuning of the baseline hyperparameters. The ImageNet experiments seem preliminary but are rather promising. Other comments: A) Though the proposed method is more straightforward than previous methods, it is unclear whether it is easier to tune. Some of the previous methods only required to tune ADAM’s alpha. The latent weights were initialized and clipped using Glorot-et-al or He-et-al initialization methods. In comparison, the proposed method, BOP, requires tuning 2 hyperparameters, an adaptivity rate and a global threshold. It would be nice if the method did not require to tune a global threshold (but it is no big deal). Besides, it seems a bit awkward that there is a single global threshold for the whole network. Intuitively, I feel like wide layers may require a smaller threshold than thin layers. B) I agree with the authors that it would be nice if their method used the gradient second moment (like ADAM). C) It seems that the method is very dependent on Batch Normalization to maintain unit variance across the units. Maybe scale the binarized weights (after the sign function) so that the method can work without batch normalization? D) The method currently only works with 1-bit weights. It would be nice if it could also work with ternary, 2-bit or any number of bit weights. It would be even nicer if the infinite bit method was equivalent to SGD or ADAM.
As opposed to update the real-valued network as usual and quantize the learned one for feedforward, the inertia concept is useful as it can explain the behavior of the gradient descent-based updates. Eventually, it's natural to understand that a large magnitude in the corresponding real-valued network somehow represents the "confidence" about the sign of that parameter, which will be used as a binary variable in the BNN. I agree with this view. What I'm wondering though is the fact that it is already known that the gradient for a particular weight is not defined by its current value, but by the input values and the backpropagation error associated with it (the multiplication of the two). In that regard, it is obvious that some kind of momentum terms will help improve the speed of convergence, as already known by various momentum-based optimization techniques. The proposed algorithm, Bop, to me sounds like one of those variants, while the authors claim that it is the first optimizer specifically for BNN. More specifically, the update rule in eq (5) sounds familiar to me, as it is equivalent to the accumulated gradients as in the original definition of momentum. Slight difference would be that in the proposed algorithm the accumulated gradients replaces the magnitudes while in the original definition of the momentum method it replaces the gradient update. I wonder what's the main difference between the proposed method and a regular momentum-based approach then. Meanwhile, the experimental results show marginal improvement.
This paper addresses the optimization for BNN and provides a novel latent-free optimizer for BNN, which challenges the existing way of using later-weights. This is an interesting and original idea. Specifically, one common way to see BNN training is to view the binary weights as an approximation to real-valued weight vector, this paper argues that the latent weights used in the previous methods are in fact not weights. The paper argues this by introducing a concept of inertia. Motivated from this new insight, one novel optimizer called Bop is introduced. Compared with existing latent-based optimizers, Bop requires less memory. One of my previous concern is that relationship and understanding of latent weight methods from the perspective of Bop method. The author's response made a clear explanation by drawing similarities between the momentum and threshold in mitigating noisy flipping behavior. I agree with this viewpoint. The Experimental results on Cifar10 and ImageNet dataset demonstrates that Bop achieves comparative results with baseline optimizers. The original submit shows that Bop slightly worse results than baseline on ImageNet with 50 epochs. However, in the authors' response, they added results for BiReal-Net on ImageNet from scratch using Bop trained for 200 epochs, which achieves competitive results with reported methods. I think this result is promising. Overall, this paper provides a novel idea of training BNN with latent-free weights, which is interesting and insightful and might open a new way train BNN. The experimental results of the proposed Bop optimizer are comparative with existing methods (Experimental details need to be improved). Future directions in improving the proposed Bop are also given. ------------------------------------ Update 2019-08-07 1. The authors' rebuttal adds results for BiReal-Net on ImageNet from scratch using Bop trained for 200 epochs, which achieves competitive (slightly better) results with reported methods. I think this result is promising. 2. In the rebuttal, the authors explains the connections between the proposed Bop and traditional latent-weight methods to further help the understanding of Bop from the perspective of momentum. It is interesting but I think some further understanding is still needed. I keep my previous score, which is still 7.