NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:6783
Title:Kernelized Bayesian Softmax for Text Generation

Reviewer 1

This paper builds on the motivation that context vectors from a language model, such as BERT, often cluster into separate groups for the same next word. These clusters may correspond to different senses of the word, and often have varying variances. The authors argue that a traditional softmax is not expressive enough to capture these clusters. A similar argument was made by Yang et al in their Mixture of Softmax (MoS) paper. The solution presented here is quite different though -- to allocate multiple senses to each word in the output embedding table, and to use a parameterized kernel to model the variance. The ideas are pretty neat, and as far as i know, original. The paper is rather heavy on prose and light on results, analysis or insights. There are several intuitions presented as to why different components presented make sense, but there are no results showing that these intuitions actually make sense. Specifically, I would have liked to see at least a discussion of which words end up with higher number of senses, or higher values of \theta after training. Results are presented on 3 tasks, which is good. But all of these are on a simple baseline; it would also be interesting to see if the improvements persist with some state of the art models (e.g. using transformer architectures). There is a brief discussion about runtime, but I would like to see a more concrete comparison of the speed compared to MoS. The writing could be tightened up considerably. There is a lot of repetitive text about the intuitions behind word embeddings (e.g. lines 20-32, 77-83, 89-90, 113-118). It is not clear if how Pr(y_{t-1} = s_i^j | x_{[0:t-1}) is computed in Eq. 7 (is this the same as Eq. 5?). The x and y axes in Figure 2 are not clear, neither why the kernel presented in Eq. 9 maximize at 0? The algorithm manages to confuse the reader rather than help -- why is there a maximization step within the training loop? Why is there only a single instance of S and T being trained on? An informed reader could guess the actual intended workings of the algorithm, but scientific writing should not rely on guesses. Overall, the paper presents some compelling ideas but is lacking in results to back them up.

Reviewer 2

Originality: the approach is original and interesting. The related work is cited adequately. Quality: theoretical part seems sound but the experimental part has some flaws and can and should be improved. Clarity: paper is well written and easy to understand Significance: there is no doubt of the importance of work, e.g if authors actually demonstrated that they indeed lean multi-sense embeddings, that would be very useful. But given current state of experimentation it is unclear if the technique actually works the way authors claim.

Reviewer 3

Tackling multiple senses of words is important, and this paper makes the first attempt to resolve the problem that each word corresponds to a single (sense) vector at the projection layer in text generation. I appreciate the motivation of this work, and the proposed model improves performance on various text generation tasks. In terms of method, there's nothing particularly surprising. It adopts a sense embedding matrix instead of word embedding matrix, and uses heuristics to dynamically allocate senses to words. In addition, it employs kernels with a learnable variance in place of inner product. While these techniques are not novel, they may be practical and provide a baseline for future work.