Review for NeurIPS paper: All Word Embeddings from One Embedding

NeurIPS 2020

All Word Embeddings from One Embedding

Review 1

Summary and Contributions: This paper presents a new method for performing neural network compression and decreasing the number of parameters required for creating word embeddings. The proposed method works by having a single base embedding for all words which then has a filter applied to it depending on the desired word for which to produce the embedding. The resulting vector is then passed through a simple neural network to produce the final word embedding. Much of the paper focuses on how to build these filters and evaluates the quality of the embedding along with its use in practical tasks.

Strengths: The evaluations provided provide strong evidence that this method for producing word embeddings using fewer parameters is effective. This work provides a novel (and effective) approach to parameter compression for neural networks. The NeurIPS community would likely be interested in and benefit from such a method, as it will feasibly allow for lower numbers of model parameters, thus reducing training time while maintaining model performance.

Weaknesses: In order to further demonstrate the benefit of this compression scheme in practice, it would have been nice to see results of training a larger network using the ALONE embeddings and comparing the performance with a standard transformer with a similar number of total parameters. Also, since one of the advantages of using a neural network compression scheme like this is the reduction in overall computation required for training an effective model, it would have been nice to see some comparisons in training time differences, computation time, total energy usage etc.

Correctness: Based on the descriptions in the paper, the methods used for evaluation of ALONE are correct and make use of standard evaluation datasets and setups.

Clarity: Overall, the paper is very clear and well written. There are some instances where ALONE is referred to as “the ALONE” or “our ALONE” which sounds a little odd.

Relation to Prior Work: A thorough comparison of this work with other approaches for reducing the number of parameters in neural networks is provided for both the experimental design as well as the theoretical contributions. The authors’ proposed method for compressing neural networks was compared with others such as pruning, knowledge distillation, and quantization. It was quite clear that ALONE took a novel approach to this issue.

Reproducibility: Yes

Additional Feedback: Why use this over pre-trained embeddings like GloVe, which are already expressive and fairly low-dimensional? UPDATE AFTER REBUTTAL: Thank you for addressing my concerns relative to GloVe. I stand by my score.

Review 2

Summary and Contributions: This paper proposes a method to compress the large word embedding matrix in NLP models to reduce the parameter size. The proposed approach computes word embeddings from shared filter vectors and another shared vector, making the parameter size independent from vocabulary size to achieve compression. Experiments on translation and summarization tasks demonstrate this method compresses embedding matrix significantly without losing performance. ------------ After Rebuttal ------------- Thank you for the response! It addressed most of my concerns and I would like to increase my score to 6.

Strengths: (1) The proposed method is very simple, it randomly assigns and combines filter vectors from a shared codebook and does not need to learn this discrete assignment operation (2) The results are good compared to non-compressed baselines with significantly less embedding parameters and on-par performance

Weaknesses: (1) I am not an expert in this direction, but I think the experiment lacks other compression baselines -- the main experiments only compare with normal seq2seq baseline and the toy “factorized embed” method while there are many embedding matrix compression methods out there as cited by the author. This paper compares to none of them, the only included DeFINE method does not show Embed params and it seems it is not in a comparable setting as well (as footnote says DeFINE uses more parameters than original transformer). In the introduction the authors exclude some related work like [1] saying they need additional parameters to learn, but I don’t think this is a reasonable excuse to not compare with them. Those methods are independent from the vocabulary size and have good performance/compression rate as well. Also, don’t the parameters of the feed-forward network here count as “additional parameters” ? I think those related work are comparable and the authors should compare at least one of other competitive compression methods to show the advantage of the proposed method over others. (2) It would be nice to show the overall compression rate instead of only the embedding matrix. Sometimes the overall compression rate might be small even though the embedding params seem to be compressed a lot. [1] Raphael Shu and Hideki Nakayama. Compressing word embeddings via deep compositional code learning. ICLR 2018

Correctness: Yes

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: This paper describes a new memory-efficient method to represent word embeddings as opposed to storing them as word embedding matrix which can be huge. The main idea is to construct a function (or as they say a filter) that takes in a single common vector and outputs a word embedding. Specifically, this filter is defined by M matrices with c columns each which are randomly sampled from a predefined distribution. To get the vector for a word, one samples one column from each matrix, adds them and passes the resultant vector through a feedforward network. The matrices are initialized and fixed whereas the feedforward network is trained with the downstream tasks. The authors present an initial proof of concept with trying to reconstruct GloVe vectors. In their main experiments, they show competitive performance on WMT en-de and abstractive summarization (headline generation) both based on transformer models. They compare against embedding tables and factored embedding tables. The presented method requires less memory than baselines as performs equally well (sometimes better) than the baseline methods presented. ------------------------------ I thank the authors for their rebuttal and stand by my score.

Strengths: 1. The presented model requires less memory than traditional word embedding methods theoretically for really large vocabularies, which is quite impressive. 2. Experiments validate the utility of this method on multiple tasks.

Weaknesses: Not a limitation per se, but it would have interesting to see experiments on larger vocabulary settings (as for example with word based language models) where BPE is not used, since it decreases vocabulary size by a lot anyway. Something minor: The total memory calculation should include a VxM term which indicates the columns chosen from the matrices for each word.

Correctness: Yes, I believe so. The method seems sound and the experiments are done on two standard seq2seq tasks and datasets. The results also seem very convincing and impressive.

Clarity: Yes, very clearly written.

Relation to Prior Work: Yes, prior work seems quite extensive.

Reproducibility: Yes

Additional Feedback: Some analysis on what the embeddings learned from this model look like compared to traditional embedding tables would be interesting to see (maybe using certain intrinsic evaluation measures) As I said experiments on language modeling tasks would be interesting to see. Additionally, I think this work could have applications in training word embeddings (with fasttext like methods) or cross-lingual word embedding methods.

Review 4

Summary and Contributions: This paper proposed a novel word embedding method, ALONE, that reduces the parameter space without harming the performances on both word-level tasks and sophisticated NLP tasks.

Strengths: This paper resolves the problem that the number of word embedding parameters scales linearly with the vocabulary size, which is not affordable given the large vocabulary size of current training corpus. The proposed method addresses this problem by decomposing the word embeddings into a few codebooks, which requires less parameter space and can scale logarithmically with the vocabulary size. While the proposed method effectively reduces the parameter space, the performances of downstream applications are not affected. Some of the experiments even show better performances for specific architectures,

Weaknesses: The writing of this paper needs to be revised. Please explain the meaning of the notation before using it (e.g., D_{inter}). The proposed method directly deals with the pretrained GloVe. Why not jointly train the parameters using the GloVe objective? Also it would be nice to include the studies on applying the proposed method on top of pretrained language models (e.g. BERT/GPT) and see the performance.

Correctness: I have several concerns regarding the current experiment setups: 1. For reconstruction task, noticed that the performance is comparable to glove when D_{inter}=2400, the number of parameter in this setting is can embed approximately 5k words in GloVe, does this mean that the parameters can potentially overfit to the top 5k words? 2. If that's the case, it's worth reporting reporting the fraction of words that are in the top 5k frequent in English wikipedia in the simplex-999, wordsim-353, and RG-65 datasets. 3. To avoid overfitting, I'm curious to see the performance if (4) is applied across the entire vocabulary (probably weighted by the word frequency makes more sense than average). 4. Since the transformer model itself already has a lot of parameters, I don't think the size of word embeddings is the bottleneck. Reducing the word embeddings seems to be marginal according to line 195-200.

Clarity: The writing of this paper needs to be revised. Apart from some grammatical error, please explain the meaning of the notation before using it (e.g., D_{inter}).

Relation to Prior Work: yes

Reproducibility: Yes

Additional Feedback: