Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The paper is very well written in clear English. The proposed framework is methodologically sound and sufficient details are given to make it fully reproducible and extendable. The experiments are convincing, with a relevant choice of comparative methods and an insightful ablation study. The bibliography and litterature review are adequate. Minor remarks and typos: -line 96: "albeit these may be-not be as expressive" -> may not be as expressive -Sec; 4.2.: it is not obvious how the proposed architecture yields a receptive field of 12500, since the total squeezing is 2^8. Some more details on this would be appreciated, as well as on the final dimension of z (which should be 4096, but this is implicit and hard to deduce). -In Sec. 4.5, the authors refer to "network g", but this does not appear in Figure 1. Although one may deduce that this corresponds to the "embedding" green block, this could be made more clear. -In Sec. 4.7, more details would be appreciated on how btaches are built . In particular, do they contain a mix of frames from all speakers, from a subset of speakers or from a single speaker? -The authors do not provide error bars for any of their results. Some standard deviations would be appreciated.
Originality: The authors propose original work, as they are the first to utilize normalizing flows for voice conversion. Quality: The paper provided extensive experiments as well as ablation studies, and provided samples of the voice conversions. Despite the lack of baselines, the authors did a good job deconstructing which components of their model were important and also experimenting with other, non-trivial models such as StarGAN and VQ-VAE. One thing I noticed, though, was when the source and target voices were of different genders, the converted voice seemed to have more artifacts and noise than usual. Clarity: The paper was well-written and easy to follow, which I appreciated. Significance: There is not a significant amount of work in non-parallel music or audio conversion in general, so this work presents a step forward in opening up new research directions. ----------------------------------------------------------------- UPDATE: I appreciate the authors addressing my confusion regarding the forward-backward conversion section, as well as the additional experiment about the identity neutrality of the z space. I will keep my original score.
Pros: 1. Using the invertibility(forward-backward) of flow-based model to do voice conversion is an overall clever idea. 2. End-to-end training of voice conversion system directly on raw waveform is elegant. Cons: 1. The novelty in machine learning/deep learning is limited. The session 4 is more like an architecture tuning summary of Glow/WaveGlow. Also the quality of posted audio samples and subjective evaluation(both naturalness and similarity) need to be improved. 2. For flow-based model, the mapping between latent `z` and data `x` has to be bijective. So there is no information bottleneck like auto-encoder based models(e.g. VQ-VAE). Thus, it's hard to convince others that `z` is in condition-neutral space like the author proposed in session 4.3, especially when the speaker embeddings are not powerful enough. Further experiments, like latent space analysis in VQ-VAE, will be necessary to prove `z` is condition/identity-neutral. 3. Missing important comparison with some state-of-the-art many-to-many non-parallel voice conversion work, like AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss(ICML 2019). It seems AUTOVC has better reported MOS and higher quality audio samples, although it performs on spectrogram not waveform.