NeurIPS 2020

Woodbury Transformations for Deep Generative Flows

Review 1

Summary and Contributions: This paper utilises Woodbury's matrix identity and Sylvester's determinant identity to construct low-rank linear flows on high dimensional spaces, which they call Woodbury Transformations. The paper empirically compares running-time and NLL modelling on CIFAR10, Imagenet32/64 and CelebA in a glow-type generative model by comparing their Woodbury transforms with 1x1 convolutions and other alternatives in literature.

Strengths: The paper provides a simple but effective method to construct tractable linear flows on high-dimensional spaces. The experimental section describes trade-offs in running-time of different methods in literature and show improved NLL performance when their method replaces 1x1 convolutions or other methods. Even though other methods obtain similar NLL performance, the running-time experiment shows the advantage of utilising Woodbury transformation in these situations.

Weaknesses: The method is relatively straightforward and some directions could be explored further. For instance, by using fully connected transformations, typical convolutional weight-sharing is not utilised. It would be good to discuss downsides to fully connected Woodbury transforms, and possible alternative formulations that would utilise convolutional weight-sharing. Further, the models utilised in the experimental section are quite small. As a result, the NLL performance is not very good compared to newer flow-based models. In addition the gains in NLL are quite small, and it would be better if the authors included standard deviations over multiple runs. Minor: - If possible, I would advice the authors to include the "changing bottleneck" experiment in the main paper. This experiment relates to the required size of the bottleneck. - For a better overview it would be nice to have a table showing the complexity for different methods for their forward/inverse/logdet in one place.

Correctness: The paper builds upon two existing identities from literature and discusses time-complexity for their method. These seem to be correct. On the empirical evaluation the paper could be more precise: Please describe in detail how the best performing model was selected (i.e. run for fixed epochs, or select best based on validation set). Are the results shown in Table 1 single-run? Or an average over multiple runs?

Clarity: Overall, the paper is well-written. The comments on time-complexity throughout the paper are appreciated.

Relation to Prior Work: In general the relation to prior work is clear. Note that in [1] a somewhat similar approach was used to obtain rank-1 covariance matrices for variational auto encoders. Perhaps this would be a nice addition to the related work. Also, the above-mentioned disadvantage to convolutional-weight sharing could be discussed with respect to emerging and periodic convolutions. [1] Stochastic Backpropagation and Approximate Inference in Deep Generative Models. Danilo J. Rezende, Shakir Mohamed, Daan Wierstra.

Reproducibility: Yes

Additional Feedback: ==== After rebuttal ===== I am mostly happy with the rebuttal and I have updated my score to a 7. I strongly advice the authors to take into account the following recommendations and update their paper accordingly: - The authors argue in the rebuttal that Squeeze layers also destroy spatial weight sharing and therefore it doesn’t matter that Woodbury transforms do not have spatial weight sharing. Although I agree that squeeze layers do this (only to some degree though), I would really like to see the authors mentioning this as a possible limitation to the Woodbury transforms. - From the description in the rebuttal, it seems that the authors report the best performing model by looking at the test performance every T iterations. Although overfitting in generative modelling is by far not as bad as in a task like classification, reporting test performance in this manner is not good practice. Perhaps it would be good to highlight in the paper if this is indeed the case, and note that this is not the best approach to model selection.

Review 2

Summary and Contributions: The paper discusses a new class of invertible transformations for flow-based generative model. The idea his to utilize Woodbury transformations. The authors propose to utilize channel and spatial transformations to achieve flexible and efficient transformations. Further, they show how the Woodbury matrix identity allows to invert channel/spatial transformation, and Sylvester's determinant identity to obtain the Jacobian determinant. The empirical results show potential of the presented idea.

Strengths: + The proposed spatial and channel transformations, and their parameterization to allow relatively easy inversion is interesting. + The manner the matrices are expressed allows to utilize the Woodbury identity, and, thus, the transformation is invertible. + The empirical evaluation is sound. + The proposed transformations could be useful as a building block of invertible neural networks and flow-based models. + The memory efficient version of the proposed transformations allows to reduce the number of parameters while maintaining almost the same performance.

Weaknesses: - I wonder what is the total computational complexity compared to other methods (e.g., emerging convolutions). If I imagine the Woodbury flow working on a mobile device, the number of operations could cause a significant power demand. - Following on that, I am worried that the total computational complexity is much higher for other approaches. This could limit the usability of the proposed transformation.

Correctness: I do not find any flaws of the presented methods. Similarly, I find the experiments sound and well performed. The paper is very well written. The organization is correct, the flow is good. All concepts are clearly explained and easy to follow.

Clarity: The paper is very well written. The organization is correct, the flow is good. All concepts are clearly explained and easy to follow.

Relation to Prior Work: The paper explains precisely what is the prior work.

Reproducibility: Yes

Additional Feedback: * The only comment I can think of is about the total computational complexity. Maybe it would be fair to add it in the appendix. ===AFTER THE REBUTTAL=== I would like to thank the authors for their rebuttal. In my opinion the paper is good and solid, and it deserves to be accepted. I was leaning towards 8 even, however, after a vidid discussion with other reviewers, I decided to keep my score.

Review 3

Summary and Contributions: The authors propose to use the Woodbury matrix identity and Sylvester’s determinant identity to effectivly compute the inverse and Jacobian determinant in the deep generative flows models. One of the benefits of using these mechanisms is to accelerate model learning. Moreover, a Woodbury transformation can find deeper dependencies of the features because it are able to model correlations along both channel and spatial axes.

Strengths: This paper makes an empirical contribution.

Weaknesses: The presented empirical results do not confirm the significant advantage of the introduced modifications in relation to the already existing methods. Detailed comments to the results: 1) In Fig. 4 the results of Glow method do not coincide with the results that were presented in the original paper. I supposed that both models were trained too short. 2) Fig. 3 shows that almost in all cases 1x1 convolutions wins. Based on these results, one can say that the application of Woodbury transformations in flow models does not significantly reduce training time. Moreover, the authors state that "NLL of Woodbury Glow decreases faster". I would say that it obtains lower values but the decrease is similar, Fig 5. In my opinion, presented results are insufficient for the NeurIPS.

Correctness: I have discussed most of my concerns above.

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: Post rebuttal ============================================== Thanks for the author's response. Following the comments of the other reviewers. I agree that the paper is clearly written, the idea is sound. Hence I changed my score to 6. I didn't give a higher rating because the authors took the best performing model (according to them) and not compare with the model with full capacity as proposed in the original papers.

Review 4

Summary and Contributions: The paper introduces an efficient transformation. This model constructs a dense fully connected layer with a low rank plus identity matrix which enjoys the efficient inversion and determinant computation of the Woodbury matrix identity and Sylvester’s determinant identity. A memory efficient version of the transformation was also introduced which factorizes the filtering along the space into two independent so called Woodbury transformations along each axis.

Strengths: The paper introduces an efficient transformation to the library of invertible flows. Empirical evaluations show that the proposed model can slightly improve the bpd performance. Comparing the proposed transformation with the invertible convolutions with complete kernel size of n, eg those used in [17], the proposed Woodbury transformation has 2*n*d parameters so it can offer more flexibility in general but on the other hand it is not specifically designed for locally structured inputs such as images as the convolutions are.

Weaknesses: After reading the rebuttal, I have updated my score but I believe it is necessary that paper clarify that this model is a simple form of Sylvester flow and compare it with a sequence of planar flow as, roughly speaking, the Woodburry transformation can also be viewed as applying the planar flow sequentially (d times). i.e. y = (I + U V^T)x = (I + u_1 * v_1^T + u_2 * v_2^T + ... + u_d * v_d^T) where v_i is i^{th} column of matrix V. The proposed transformation is indeed a low-rank plus identity transformation which can be seen as a simple form of Sylvester Flow, z’ = z + Ah(Bz + b) where b=0 and h(x) = x. The description given for Sylvester Flow in section 4 is inaccurate. It can actually be performed on matrix (tensor) inputs and also can be performed along each axis in the same way as ME-Woodbury. Its inverse will be tractable and there has analytical form for inversion by making some simplifications, eg reducing it to Woodbury flow, otherwise the inversion is possible using iterative methods, so it *can* generate samples efficiently. So my main concern is the limited novelty of the proposed model in comparison to the available flows but I may be persuaded during the rebuttal phase.

Correctness: Throughout the paper, the invertible convolutions are claimed to be computationally inefficient and the computational cost of the circular (periodic) convolution is mentioned inaccurately. To be more accurate, for 2-D inputs of size N1*N2, the Fourier transform can be computed in O(N1*N2(logN1 + logN2)) time. For more detail please refer to [17] or Rafael C Gonzalez and Richard E Woods. Digital image processing, 1992. Interestingly, computational complexity of Woodbury flow for input of size 256*256 with d=16 is of the same order as that of 2-D convolution since (logN1 + logN2) =16. Woodbury flow can offer more flexibility in some datasets but it does not take into account local structure that a convolution can capture in 2-D images. The running time experiments in section 5 do not sound a fair comparison between different methods as the models dont have similar numbers of parameters. For example emerging convolution of 3*3 is not comparable with Woodbury of n*d parameters. Also, it is expected that ME-Woodbury be faster than the Woodbury as it replaced a heavier matrix products of size h*w*d with a two lighter products along each of sizes h*d+w*d. Furthermore, as mentioned above, the periodic (circular) convolution is also computationally efficient and their running time performance depends on employing a fast and parallelizable implementation of 2-D FFT. The results in Table 1, show that the proposed flows can slightly enhance the performance compared to baselines. But the shortcoming is that small models are trained for all the models rather than the model with full capacity as proposed in the original paper and hence the results are very far from those in the original paper. So to support your conclusion -- the Woodbury will make new SOTA results -- it is also important to reproduce the models with their full capacity and make a comparison with your model with similar complexity.

Clarity: The presentation is good but can be improved. Here are some details and possible typos: Vertical Space is needed after the caption of Figure 1 to separate it from the text. Lines 172-176 are repeated twice. Line 180: the correct Sylvester flow has an extra identity mapping (scape connection), ie z_{t+1}= z_{t} + … . Line 185: the statement “Sylvester flows are inverse functions” is not clear.

Relation to Prior Work: It provided a good review of available NFs but it also needs to re-examine the main advantage of the proposed model over Sylvester flow.

Reproducibility: Yes

Additional Feedback: