NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
The paper proposes a new approach for unsupervised learning of text embeddings. The method is grounded on theoretical results and the proposed method seems efficient (faster training) according to the experimental results. The big problem with the paper is the experimental setup/results: 1) Experiments are performed with embeddings of size 100 only. Methods such as word2vec are known to perform better in analogy experiments when using embedding dimension of 300 or more. 2) The authors mention that they achieve state-of-the-art results for word similarity and analogy tasks. However, the authors do not compare their numbers with the ones reported in previous works, they only compare with their own runs of previous embedding models. For instance, the numbers reported in the original fastText paper (https://aclweb.org/anthology/Q17-1010) are much better than the ones reported in the experimental section. In the word analogy task, (the original) FastText achieves accuracy of 77.8 (SemGoogle) and 74.9 (SynGoogle), which are significatively better than the numbers reported for JoSE. Is this only related to the embedding size? The original FastText uses embedding size of 300. Does JoSE improve when the embedding size increases? Is it still significantly superior to the baselines when using embedding size of 300? Are embeddings of larger dimensionality less sensitive to the effect of training on Euclidean space and testing on spherical space? 3) Although the clustering task is interesting, I believe that a text ranking task such as answer selection (that is normally tackled using cosine similarity on embeddings) would give good insights about the effectiveness of the paragraph embeddings for a more real-world task. === Thank you for adding the new results. I've change my score to 6.
Reviewer 2
This paper proposes JoSE, a method to train word embeddings. Their unsupervised approach is rooted in the principle that words with similar contexts should be similar, where they have some novelty in their generative model using both word-word and word-paragraph embeddings and the novelty largely lies in their constraint that all embeddings are on the unit sphere - where they derive an optimization procedure for this constrained problem using Riemannian optimization. They also utilize word, paragraph s The empirical results form this paper are strong - outperforming the GloVe, Poincare Glove, and Word2vec baselines considerably in some cases. FastText is also outperformed as well, though less so, but FastText does have the advantage of using character n-gram information which is not used in JoSE. They also evaluate on analogies and embedding documents from the 20 newsgroups dataset and clustering them, evaluating on the purity of the clusters. While using spherical topology for embeddings is not anything new, I have not seen it applied in this manner. I find the results impressive enough for acceptance - and I checked the paper for experimental issues that would give them an advantage. They do have a hyperparameter they tune but they use the same value for all experiments. Other methods keep default hyperparameters. One concern I also had was efficiency, but their method is actually the fastest of the methods they compare to as well. A couple minor nits. SIF is used as a baseline, what does that mean exactly? SIF largely corresponds to techniques to leverage existing embeddings using SVD etc. Was that done here? Or by SIF do you mean the SIF embecdings were downloaded? If the latter, "Towards Universal Paraphrastic Sentence Embeddings" should be cited since those are largely the actual embeddings in SIF.
Reviewer 3
- This paper proposes a novel text embedding approach in the spherical space and develops an optimization algorithm to train the model. - The claims in this paper are well supported by theories and their derivations. - The results on text embedding applications look pretty good. It shows the model can learn text embeddings in the spherical space effectively and efficiently. Weaknesses: - The paper is well organized, however it might be hard for the readers to reproduce the results without the code, especially in Section 4. A step to step illustration of the optimization algorithm would help. - The authors did not show the possible weaknesses of the proposed model.