Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
* First, for a nonsymmetric matrix, the eigenvalues are not related to the SVD by a variational principle. This is a point of confusion throughout the paper. In line 57, you start with `given an input matrix M'. Listing the SVD / QR as a method to compute the the ED is not correct here. Later you clarify that M is considered to be a covariance matrix. You should clarify at the very beginning that the scope of your method is limited and restricted to symmetric matrices! * It is not clear to me why it is preferable to use power iterations for approximating the gradients. You are saying in line 28 that the power iteration method is numerically not stable if two eigenvalues being close. So what is the advantage of using power iterations compared to using the analytic solution for the gradients? * Well, you introduce some tricks like ridge regularization later in order to stabilize your computations. But what if you set your epsilon in line 117 to 0? I assume you would gain not much... * This said, your computational experiments are a bit biased, in a sense that you have introduced a hidden regularization parameter in your network, namely epsilon. I assume that setting epsilon to 10^-4 has quite some effect. What happens if you set epsilon to 10^-12. Further, you compare your method, that uses several practical tricks (Sec 2.5), to a plain implementation of the SVD. This comparison does not seem to be fair, since you can, for instance, also truncate the ordinary SVD. Thus, I think that a fair comparison requires the use of the same practical tricks. * I am not very impressed by your results. I do not see much benefit of using ZCA or PCA denoising here, in particular not for CIFAR10. Your results for CIFAR100 are somewhat far away from current state-of-the-art performance (correct me if I am wrong here!). Further it would be nice to compare results to [r1]. * Is PCA denoising really a new normalizing strategy for deep network? * Overall, the quality of the writing is good. Smaller comments: * I doubt that the eigendecomposition is widely used in deep networks. That is, because there are so many numerical issues. You may want to modify the abstract accordingly. * You should introduce L in Eq. (3). * Why is Sigma in Eq. (5) not bold? [r1] Iterative Normalization: Beyond Standardization towards Efficient Whitening. CVPR. 2019.
The paper proposes a numerically stable and back propagation compatible eigendecomposition for deep neural networks. It addresses the instability issue in analytic derivatives, as well as the convergence issue for power iterations. The authors give the theoretical justification behind their approach, while showing the numerical evidence behind the choice of parameters. The algorithm is robust when applied in ZCA and PCA denoising, which marginally improved the performance deep networks on CIFAR_10/100 and ResNet18/50. The paper is sound, though I am not familiar enough to comment on the originality. The algorithm should be useful in practice for handling the corner cases. Major comments: 1. The author mentioned a few times in the paper that the method is intended for large matrices, yet most matrices in the experiments have relatively small size. I am curious if the failure cases become more or less common, when the matrices become large enough. Minor comments: 1. The improvement in the experiments seem rather small. Most of the time, the convergence behavior looks very similar whether using the existing methods or the proposed one. Post author feedback: Thank the authors for addressing my questions. I think the problem this paper is trying to tackle is important. Unfortunately, I am still not convinced whether using a power series approximation to the derivative when the eigendecomposition is not differentiable is the best approach in this case, and whether it leads to meaningful improvement in applications. As I am remain mostly neutral on this paper, I decide to leave my score unchanged.
The basic idea is very simple: for a PCA layer in a DNN, both SVD and power iteration methods have drawbacks. To overcome this, we run SVD in the forward pass and power iteration in the backward pass. This leads to an "improper" backprop, but the results are close enough (people do this all the time, for example in batch renorm) or the errors introduced are good enough that it doesn't matter. I think that originality is somewhat limited in this paper (i.e., it combines two well-known elements in a well-known way). But I do think that the superiority of this approach to previous ones speaks to its value. The experimental evaluation of the main method is sound and convincing. The crucial experiment for me is Table 2 that shows that basically d = 64 works for this method, whereas previous methods could only really handle d=4. A very natural question here is whether we even have to do blocks of d. On the other hand, I'm not totally convinced about Sec 3.2 and the PCA denoising: realistically, it doesn't look any better than batch normalization. I think perhaps trying this on non-residual networks might give better results, or comparing to vanilla no BN networks. The paper is well written. Even in the face of quite a lot of mathematics, the paper is clear and a pleasure to read. I found only one typo (line 106 'backpropogation'). This is a high quality submission, and the authors have obviously put effort into the writing: thank you! The significance of this paper is hurt a little by the niche nature of PCA layers, but as above, I think that this paper could be the basis of a lot of new ideas, so overall I think significance is high. (It's not clear to me that citing Eigenfaces as an application of PCA layers to deep learning is appropriate?). POST AUTHOR FEEDBACK: Thank you for the the clarifications. My critiques were quite minor before, so nothing significant has changed in the score, leaving as is.