Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The paper suggests to do something that strains belief -- create an estimate of a face from a speech signal. Surprisingly, this is able to succeed in some regards! Ultimately this stems from the fact that speech production relies on face characteristics for aspects of the sound, a fact that has been known in that literature for a while. The paper is thought-provoking and leads one to wonder what other reconstruction tasks might be possible across domains that seem unlinked. The quantitative evaluation is a little tricky to do correctly. It would be interesting to see if there is a way to more directly compare, for example, see whether people are able to pick out which of the GAN faces was created by the model (given the human face), or have people choose which face is most likely to have created the speech signal they hear. This might support whether people have the same judgements as the GAN model.
Detailed comments already given above.
This paper proposes a convolutional neural network based model to reconstruct a face from spoken speech. The training is done by using supervised GAN. The problem is novel, but the model itself is not so much as using encoder (or embedder) and decoder (or generator) is quite standard, and supervised GAN training has also been popularly used, so in that perspective, its novelty is incremental. But I think this paper needs more thorough experimental study to show the effectiveness of the proposed model: 1. From the experimental results, I suspect that the generated faces only match those attributes (gender, race, etc.) but not much about identities. I propose face identity classification might be better in this regard rather than the illustrated matching test. Even the matching test results with test data and the same gender, the accuracy was slightly higher than chance so it does not seem significant. 2. There is no baseline. In qualitative results, there was no baseline result. I understand that the problem might be new, but at least the paper might present ablation study or simpler baselines. The models in Table 3 were without any explanation and I assume they are only for matching test, not for generative models for speech to face reconstruction task. In sum, I think the problem looks interesting and novel, but the novelty of proposed model seems incremental, and the experimental results do not look to demonstrate the effectiveness of the proposed model enough. Writing looks fine but can be improved. Typos: line #173: Only F_d() and F_c() are used -> Only F_e() and F_g() are used line #197: a first -> the first