Paper ID: | 6320 |
---|---|

Title: | On the Downstream Performance of Compressed Word Embeddings |

Compressing word embedding matrices is an important application, useful both for using NLP application on small devices and for generating more efficient (and less polluting) NLP models. The authors present an important contribution in understanding and evaluating different methods for compression. The paper is well written and explained (I liked the survey on sections 2.1 and 2.2). The only real concern I have with this paper is the large amount of content in the appendix (33 pages!). It seems that much of the content, including proofs that are important for the paper's argument, were put in the appendix to save space. I am not sure I have a clear idea on how to save space, but I would certainly encourage the authors at the very least to use the ninth page to bring some of the content back to the paper if accepted. Questions and comments: 1. The word embedding reconstruction error measure (section 2.2) assumes that X and X~ have the same dimension. But this does not hold for compression, in which k < d. How can this method be used to evaluate compressed models then? 2. How do the authors apply GloVe to do question answering on SQuAD (section 2.3)? 3. Section 3.1.1: a potential way to speed up the score computation is to only consider part of the embedding matrix (i.e., a subset of the vocab, using a smaller n). Did the authors study the effect (in theory or in practice) of this relaxation? ===================== Thank you for the clarifications in your response.

The paper is very well written; it is clear. The statements proved are useful and move forward a highly active area of research. I cannot comment in depth on originality as I'm not confident I would know of other overlapping work. The main drawback of the paper is that it is essentially a longer technical project that has been shoehorned into a 10-page paper, with many threads tied to the supplementary material. For example, both of the theorems are left unproven in the paper itself. The paper consistently touts having connected logistic regression to eigenspace overlap, but this connection is made entirely in the supplementary material. I'm not sure of what I would change here, but there is a sense that if I read "the paper" (the 10 page version), then I know "the stuff the paper is about," and this paper + supplementary material is somewhat pushing the boundaries of that distinction for me. At the same time, these are great ideas and fit well in this conference. I'm including detailed feedback in the improvements section below.

The authors propose a new metric for evaluating compressed word embeddings that they term as the eigenspace overlap. They prove that calculating the overlap between the left singular vectors of the uncompressed and compressed embeddings is a reliable metric and a good proxy for the performance of the compressed embeddings on downstream tasks. The paper makes several contributions: First, they display that earlier compression methods do not significantly outperform a simple baseline of uniform quantization. Second, they show that the metrics used to evaluate these embeddings cannot explain this behavior of uniform quantization being better than the other methods. Motivated by these findings they propose a new metric called eigenspace overlap that is able to not only explain the surprisingly high performance of uniform quantization as compared to the others, but also provides us with an easy to use metric that correlates better than most previous methods on downstream tasks. To me, especially striking is the result that uniform quantization does as well as these other methods, highlighting the fact that the previous compression techniques were not thoroughly compared to relevant baselines. This paper performs an average case analysis of the generalization error using compressed vs uncompressed embeddings based on linear regression and show that larger eigenspace overlap results in better expected generalization performance for the compressed embedding. The authors show that it is also a good metric for choosing between different compressed embeddings both in terms of being accurate as well as robust. Originality: New metric for calculating the quality of the embeddings. Paper distinguishes itself from previous metrics which are adequately described for the purposes of the paper. Quality: The technical content is sound, with thorough analysis and proofs of proposed methods. Clarity: Well written paper, very easy to follow. Significance: Useful for low memory applications

Originality: The metric is a new take on what creates good generalization bounds. Although in many ways, this paper is a natural generalization on the paper which introduces PIP loss, I think they still prove some novel bounds. Quality: The quality of the paper is decent. They have provided some theoretical justification for their work. However, the experimental section is rather weak -- the comparison to show downstream embeddings seems to be very narrow. I would have liked to see a broader comparison. Clairity: The paper is well written with few grammatical mistakes and typos. They have also explained their methodology and experiments clearly and in a manner that will help reproducibility. Significance: Although, the paper introduces a potentially important idea, it is largely similar to the ideas in the paper on pairwise inner product similarity. Nonetheless, the authors give rigorous justification for their ideas, which is always welcome.