Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
This paper presents an interesting direction towards deeper understanding of speech signals in neural architectures. The methodology adapts an existing approach based on mean field theory to audio and speech recognition tasks Comparing two random weights might be an overly weak baseline. It would be nice to see a comparison to weights after a short time training to skip past the poorly scaled norms of initial weights that can sometimes occur It would be nice if the main figures in the paper could include error bars or auxiliary experiments that show whether the findings presented are robust across different training runs of the same neural architecture I am not sure that other practitioners would be able to implement the metrics used and reproduce the experimental setup with other neural architectures. Even though some information is given in the supplemental material more specific method description would help others continue to use these techniques
The paper is overall well written, and the experimental design is fundamentally well thought out and rasonable - I cannot say if it is entirely novel or if similar looking graphs could have been achieved with different or similar techniques. The results look intuitively correct and confirm ones expectations. I find some parts of the experimental setup confusing: the CNN *model* has been trained on WSJ and Spoken Wikipedia, and is not performing an ASR task, but a closed-set word recognition task (in addition to the speaker ID task). Why was Spoken Wikipedia used in addition to WSJ? Would it not have been possible to use Librispeech or another well-known corpus? The entire paper would be much "cleaner" if both types of systems ("CNN"=word recognition + "DS2"=ASR) would have been trained and evaluated on the same type of data. If that is not possible - please explain. What is the "CNN dataset"? It is interesting that training a system towards phones will also increase the capacity for words. Would it be possible to perform the same experiments with characters (which is what the DS2 system has been trained with)? This should exhibit a similar pattern, but one could then also compare phone and character systems. In English, the grapheme to phoneme relationship (phones to characters) is pretty complicated ("tangled"), and it should be possible to show that this analysis can "measure" the degree to which certain phones have clear relationships with characters, and other phones have no unique relationship with characters.
From the high-level of review, there are many works on understanding how neural network models do speech recognition internally, and this paper provides very solid analysis using a theoretical tool such as the mean-field theory. However, I'm not fully convinced how much the findings here can benefit the speech recognition research. In particular, the observations are from a particular model configuration (network structure, word-level label units) etc. It is unclear to me how much the observations would change in a different experimental setting, for example, with a system modeling the sub-phonemes as the traditional hybrid system, and evaluating on more challenging conversational speech corpus. The high-level conclusions such as normalization of speaker factors and untangling words in a speech recognition system are not surprising to me as a speech recognition researcher. It is not new, but this paper provides some more theoretical evidence to confirm what we have believed already. A question: I'm a bit confused what is N the ambient dimension in section 2.1? And I'm fully understand why \alpha in section 2.1 is important. If P is the vocabulary size, we will use a softmax layer with size P for classification. What is this to do with \alpha and separating hyperplane?