Three knowledgeable referees support acceptance, and I also recommend acceptance. The key contribution of this submission is a new reconstruction loss for VAEs (somewhat like JPEG loss) that matches human perception more closely than traditional VAE reconstruction losses (e.g. negative Gaussian log likelihood). For applications where the goal is to generate sharp images rather than to maximize the likelihood of held-out data, the proposed method is a good alternative to other known ways of generating sharp images with VAEs (i.e, autoregressive/flow-based decoders and adversarial loss function). Unlike these alternatives, the proposed method introduces few additional parameters to learn from the data. R1's and R2's concern about the lack of quantitative measures of performance is justified, but the author response also makes a compelling point about the difficulty of picking a fair quantitative metric. Given aims of this submission (to generate images with VAEs that look realistic to people), the strong qualitative results presented seem adequate to validate the approach. The authors should consider revising the manuscript to address R4's question about how MCMC can be used to draw samples.