NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
POST-REBUTTAL: authors have answered my questions satisfactorily, hence I am increasing my score to 7. I enjoyed reading this clearly written piece. It's a little bit of a misnomer to call this method "batch normalization", since it does not normalize the "spread", but only the "center". This is justified by the authors describing distributions over SPD matrices (sec 3.2). (Side question: what's the relationship between eq (10) and Wishart distribution? Are they the same, but one with the Riemannian metric? Wishart is also a maximum entropy distribution over SPD.) However, maximum entropy distribution depends not only on the space, but also which sufficient statistics one choses. I do not see an inherent reason why variance-like quantity should be ignored. I suggest changing the name of the method to "batch centering" to avoid confusion. Since traditional BN improves on learning rate, I wish to see learning curves with this method. How does this method compare (final performance, and convergence speed) to a naive BN the projection to closest point on the manifold? (perhaps this is a stupid question, but please explain) Also, the computational complexity of this method for forward and backward pass are not obvious. Can you include them?
Reviewer 2
Originality: this paper contains an original contribution, which is tightly based on [22]'s SPD networks, where the hidden representations are symmetric positive definite (SPD) matrices. The originality lies in that the authors adapted batch normalization which is usually performed on real vectors to the SPD representations. Notice that in their batch normalization, the variance is not normalized. The authors may consider to use a different term (e.g. batch centering) because of this. Quality: the proposed method is particularly useful for SPD networks, and could be useful in other networks with SPD hidden representations. The author should mention the complexity explicitly in computing the exponential map and its inverse. How much computational overhead is involved in applying your batch normalization? Clarity: the writing is satisfactory and the algorithms are clearly presented. But the math formulae can be greatly improved. In particular in the beginning there should be a small section to explain the Riemannian geometry of SPD matrices. Significance: this is a deep learning paper based on Riemannian geometry. It can be useful in the particular deep learning architecture SPD networks [22]. It may be also interesting from an application perspective (applying SPDNet to EEG/MRI datasets). From the mathematical standpoint the novelty is limited. Overall I feel that the potential significance is average because of these limitations. minor comments: The authors can cite related references is matrix Bregman divergence, Bures distance in deep learning, as well as log-det divergence. Notice that for Bregman divergence one can have the notion of variance. eq.(1) explain what is "log" (use \log instead of log) Is that matrix logarithm? L39 It is not clear what "each layer processes a point on the SPD manifold" means The paper seems to be finished in a rush, table 1 & table 3 are formatted in a bad way.
Reviewer 3
The paper is well written and very comprehensible. Necessary basics are explained and the experimental results give evidence that the proposed method increases classification performance. The experiments seem to be done very thoroughly. While the results on the NATO radar dataset is not very outstanding, the method seems to outperform the baseline SPDNet on the AFW and the HDM05 dataset. Overall, good work with some minor errors: - line 24: symmetric positive definite (no capitalization) - throughout the paper: mathoperators (like log, exp, argmin, ...) can be defined using \DeclareMathOperator{\argmin}{argmin} - algorithm 1: batch and normalized (no capitalization)