__ Summary and Contributions__: The paper proposes to use wavelets in normalizing flows. The authors expose the natural parallelism stemming from such wavelet decomposition to scale the training of this model to 1024x1024 images and enable faster training. The paper also uses MCMC algorithm (NUTS) to sample from an annealed distribution.

__ Strengths__: The motivation and explanation of the technique is clear. The results are indeed competitive in terms of log-likelihood with accelerated training (claim of x15 acceleration). The paper shows an additional motivation for annealed sampling in terms of distribution matching.

__ Weaknesses__: The global coherence of samples seems to be negatively impacted by that technique, which does not seem to be mentioned/addressed/studied in the paper. This is most visible in areas that should be sharp but are instead blurred by super-resolution. I personally suspect that this stems from the super-resolution stages being convolutional and not using global information.
-- Authors' rebuttal acknowledged --
While the technique and initial experiments are interesting, a deeper study is lacking on the the point I've mentioned. I maintain my score.

__ Correctness__: The claim and the methodology are correct.

__ Clarity__: The paper is clearly written.

__ Relation to Prior Work__: The prior work is discussed adequately and the paper explains how they differ from prior work. A relevant paper to cite would also be "Normalizing Flows with Multi-Scale Autoregressive Priors" by Shweta Mahaja, Apratim Bhattacharyya, Mario Fritz, Bernt Schiele, and Stefan Roth.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: The paper introduces Wavelet flows, a normalising flow architecture for high resolution signals. It uses the wavelet transform to decomposes an input signal into its wavelet coefficients (I_0, D_0, D_1, …, D_r) and model them sequentially using separate conditional models that produce the detail coefficients conditioned on the previous resolution image. It thus simultaneously produces a model for all resolutions, and can also do resolution upscaling.
They show that such an architecture is competitive with previous normalising flows like Glow for images on standard benchmarks while being faster to train, and also scales to high resolution.
== Post rebuttal update ==
The authors in their rebuttal show convincing evidence for quantitatively better performance than previous methods. At the same time, the blurriness of the samples is not yet addressed, which is important for the high-resolution modelling to be useful, and architectural changes as suggested by Reviewer 1 or removal of the patch-wise training would be useful to ablate here to more clearly understand what causes the blurriness. I also agree with Reviewer 3's suggestion that exploring the choice of wavelets would be natural experiments to include.
The quantitative results, especially the training speedup, seem strong enough though that I'd be ok to lean towards acceptance, and hence I revise my score from 5 to 6.

__ Strengths__: The choice of the wavelet decomposition for use in normalising flows is novel, and they obtain a good super-resolution model for free. They also train quite efficiently compared to previous flows. For low temperature sampling, past work only successfully could use additive coupling blocks, but they show a way to get lower temperature samples from affine flow blocks too using MCMC. They obtain competitive performance in terms of logprob on ImageNet and LSUN benchmarks at lower compute, and have diverse samples on CelebA-HQ compared to Glow. The experiment studying effect of changing temperature to correctly match the entropy of detail coefficients is good.

__ Weaknesses__: - While the aim is to use the wavelet transform to get high-resolution images, the samples (eg: Figure 2) qualitatively look quite blurry compared to the same 8-bit samples in past normalising flow papers like RealNVP/Glow/Flow++. It could be because they they used less compute, but the likelihood numbers are comparable on both Imagenet and LSUN and thus that suggests that it could be the wavelet decomposition isnt a good inductive bias for qualitative sample quality? Including some qualitative metrics like FID might help here.
- It would be nice if the show comparisons to a baseline where they train a similar super-resolution model, but instead of the wavelet basis use our usual pixel basis, ie instead of producing the detail coefficients D_j from I_j, directly produce I_{j+1} from I_j. In this case, the decomposition is overcomplete since here H(I_n) = (I_0, I_1, …, I_n), but it would more clearly show if the quantitative performance benefits come from using the wavelet basis or from the super-resolution nature of the model. (Note this is equivalent to optimising log p(I_n) since p(I_{i-1}|I_i) is a delta function). This will also support the claims on Page 2 that modelling the residuals is better than directly producing the higher resolution image. (The paper seems to assume that overcomplete representations cant be used for flows, however since we’re actually optimising the discrete log prob with the dequantisation trick its ok to do the above decomposition)

__ Correctness__: Yes, the methods are correct and and experiments are well described.

__ Clarity__: Yes, paper is easy to understand and notation is clear.

__ Relation to Prior Work__: Mostly. The super-resolution aspect should include comparisons to and discussion of previous flow approaches(eg SRFlow).

__ Reproducibility__: Yes

__ Additional Feedback__: Nits:
1. Patchwise training is a good idea, increases batch diversity and saves memory. However, some local features can depend on global information, is there a big enough receptive field? Example at high resolution, eye colors in a face.
2. Rearrange the tables to place them near where they’re referred to
3. Whats global variations in Line 108? Giving some intuition of what the wavelet coefficients mean could be useful
4. Not sure I understand what you mean by the multi-scale PixelRNN can’t be used for computing probability density of the high res images (Line 43). It can provide log p(x_r), and if you mean added the probability density of added content, can’t it just be log p(x_2r) - log p(x_r), where r is resolution?

__ Summary and Contributions__: This paper introduces a hierarchical structure for normalizing flows for density estimation and data generation based on wavelet transforms, allowing for a natural factorization of the data distribution based on different resolutions of the data.
For density estimation, each image is fed into a sequence of wavelet transforms. Each wavelet transform takes an image and outputs a lower resolution image (obtained by a low-pass filter) and a tensor of detail coefficients (obtained by a high-pass filter). Repeatedly applying wavelet transforms to the output images leads to a set of detail coefficient tensors for each scale and a final 1x1x3 “image” representing the average intensity per channel. The original representation can be recovered from this representation with a sequence of inverse wavelet transforms. As a specific instantiation of a wavelet the authors consider a Haar wavelet which has a 2x2 filter size.
Using the representations of the original image at different scales, the density of the original image is factorized into a product of distributions over the detail-coefficient tensors, each conditioned on the low-pass filtered image from the corresponding wavelet transform. The average intensity per channel is modeled with an unconditional density.
Having obtained all of the low resolution images and detail coefficients from a sequence of wavelet transforms, each factor in the distribution is separately modeled with a (conditional) normalizing flow. This allows for training each of the flow components in parallel, as opposed to the common practice of training the flow levels in sequence in hierarchical flow architectures. Sampling does require sequential passes through each normalizing flow, starting from the coarsest representation of the image all the way up to the highest resolution, akin to super resolution. The authors claim that the parallel training of the different flow components can lead up to 15x faster training as compared to other hierarchical flows such as Glow. Furthermore, the proposed hierarchy allows the authors to train normalizing flows on images with a high resolution of 1024x1024 pixels.
Finally, a significant part of the manuscript is devoted to sampling from an annealed density, as opposed to sampling from the density that the model was trained to optimize, as the authors argue that flows produce better samples when sampling from an annealed distribution. In order to sample with a temperature smaller than 1, the authors use MCMC to sample from the unnormalized annealed distribution.
--------------- post-rebuttal update -------------------
Wall-clock time: I appreciate that the author’s rebuttal provided a more detailed comparison in terms of training times, this point has been addressed in a satisfactory manner for me.
Choice of wavelets: the authors responded that learning a basis could be interesting for future work. As I stated in my review I would have liked to see some additional results that show the influence of the choice of wavelets (not necessarily learnable wavelets). One could imagine that this plays a role in the inductive bias of the model and the resulting sample quality (as reviewers 1 and 2 commented on the blurriness of samples). These seem to me like more natural experiments for a paper that introduces a wavelet-based hierarchy for normalising flows, rather than the extensive results on reduced temperature sampling. However, I am inclined to view this more as a different personal preference rather than a reason to reject a paper.
With this in mind, I will retain my score of marginally above the acceptance threshold.

__ Strengths__: 1. The proposed multi-resolution hierarchical structure naturally arises from the wavelet transforms and allows for parallel training of the flow components, which can in principle increase training speed. As normalizing flows are in general large models that take a long time to train, increasing the training speed while maintaining good performance is an important research direction to make flow models more practically useful.
2. The authors managed to scale their proposed method to high resolution data of 1024x1024 pixels, which is (to my knowledge) not done in previous normalizing flow papers.

__ Weaknesses__: 1. As the paper’s main proposal is to use wavelet transforms to obtain a natural hierarchy of normalizing flows, it is surprising to me that the authors have not shown an exploration on the effect of the different possible choices for the wavelets. The influence of different wavelet choices on the sample quality and quantitative performance seems like an important topic to investigate, as wavelet transforms for normalizing flows form such a central part of the novelty of this paper.
2. One of the main claims of the authors is that the proposed method can be trained up to 15 times faster than other flow methods like Glow. Although it is likely that the parallel training of the flow components can decrease the training time, the comparison between training times presented in the paper has some issues. The authors only report the total number of GPU hours required to train their method to convergence on their hardware, and compare this to *estimates* of the total training time (in GPU hours) of Glow by looking at log files and through discussions on github. This seems problematic for several reasons. If the number of epochs used for training in the methods is not the same, then even if the estimates are correct, the numbers are hard to compare. A more informative comparison would be training time per epoch, or the time required for a single forward pass (with and without a parameter update). Since Glow’s code is publicly available, and the authors have used part of Glow in their own code, running such a comparison *on the same hardware* should not pose a problem, especially for the lower resolution images. Furthermore, the authors state that for the LSUN bedroom dataset, wavelet flows are 7.5 times faster than Glow, but wavelet flows also perform worse than Glow in terms of bpd. If you would train Glow to the same performance level as Wavelet flows, what would the difference in training time then be? Finally, the authors claim that Wavelet flows have a 15x speedup in training time for CelebA at a 256x256 resolution, but for this resolution of CelebA no results are shown in terms of bits-per-dimension for wavelet flows, so it is not clear if there is a trade off between training time and quantitative performance on this dataset.
3. A significant portion of the paper is devoted to sampling from the unnormalized annealed density with MCMC, with the authors arguing that samples from flow models are better at lower temperatures. As stated in the experiment section, the authors pick a temperature T=0.97 for the annealing temperature, which is very close to T=1 for which no MCMC is required and one can sample exactly from the flow model. This makes me wonder if running the MCMC procedure, which can also misbehave in high-D, is worth it when one can sample exactly at T=1.

__ Correctness__: As mentioned above, although a training time speed up is likely to be present, the numbers provided in the paper to support this claim are hard to compare to those of other methods. Other claims and methods in the paper appear to be correct.

__ Clarity__: Yes, the paper is clearly written.

__ Relation to Prior Work__: In general the related work is sufficiently discussed. I do think a reference to SReC (https://arxiv.org/abs/2004.02872) and explanation of the difference would be in order.

__ Reproducibility__: Yes

__ Additional Feedback__: 1. The discussion on additive versus affine coupling layers and the annealing strength required for good samples is not the strongest part of the paper. The authors state that Figure 6 shows that the marginal density of the annealed additive model matches the ground truth marginal less well than that of the annealed affine model. However, even though the height of the peak seems better captured by the affine annealed model, the tails are better matched by the additive model.
2. In Fig 3c, is the panel with T=1 produced with mcmc sampling or with exact sampling from the normalizing flow model without MCMC? How do these two compare for T=1?
3. Where available, a comparison against better flow models such as Flow++ in table 1 would be appropriate, even if the authors mention that Flow++ uses more sophisticated dequantization which could be combined with the proposed method.

__ Summary and Contributions__: This paper proposed a method to train a multi scale Normalized Flow. Previously, Flow models are trained to learn a distribution of images on a single image size (e.g. 32x32). However, training on large images (e.g. 1024x1024) is very time consuming. This paper propose to firstly decompose a high resolution image into a pyramid of smaller images and learn one flow for each scale. A (predefined) Harr wavelet was used for decomposing an images into multiple scales. The experiments show that the proposed method achieve faster training (up to 15x) and on par bit per dimension. The authors also show that the proposed method can be applied to super resolution and sampling low resolution images.

__ Strengths__: Learning the density function of high resolution function is a difficult problem. The existing method is normally slow in training. The proposed method provide a way to learn a density function of high resolution image in a fast way.

__ Weaknesses__: While the result in table 1 on compare the proposed method with other baselines. The authors did not really compare them on HIGH RESOLUTION images. Results of both baselines and the proposed method are reported on images with 32x32, 64x64. On dataset with size 1024x1024, the authors only report the result of the proposed method since training on the baselines is umpractical. But, It would be nice to show some comparisons on dataset with bigger sizes such as 128x128 or 256x256. As we can see from table 1, with higher image size (I.e. 64x64), the proposed method shows worse result, it would be good to see whether this trend continue with increasing image size.

__ Correctness__: the method is correct.

__ Clarity__: the paper is well written and easy to understand

__ Relation to Prior Work__: related work was clearly discussed

__ Reproducibility__: No

__ Additional Feedback__: After reading the feedback from the authors, I think my concerns are properly addressed. Although ablation studies of variant wavelet are not in the paper. The quality of high resolution images generated by the method is also not so good. This paper provide a novel method to train a flow network on high resolution images efficiently. Further studies such as different model architecture or hyperparameters may improve the quality of the generated images. I suggest to accept the paper.