Reviews: Levenshtein Transformer

Paper ID:	5992
Title:	Levenshtein Transformer

[update] Thanks for the revision and clarification! I revised my review accordingly. ========================= This submission introduces Levenshtein Transformer, a non-autoregressive model for text generation and post editing. Instead of generating tokens left-to-right, it repeats a

$delete-and-insert$ procedure. More specifically, starting from an initial string, it keeps deletes tokens from or insert new tokens into the outputs, until convergence is met. The model is trained with imitation learning, where expert policy derived from gold data or a pertained auto-regressive teacher model is explored. Experiments on text summarization, machine translation, and post editing shows that the proposed model outperforms the transformer baselines in both accuracy and efficiency. Overall I think this is an interesting work. Yet I do have some confusion in both the technical part and the experimental part.

Reviewer 2

Originality: It is an interesting work by casting the sequence generation task as two iterative tasks of insertion/deletion. I think the formulation is new that is coupled with the training procedure based on imitation learning with two policies, i.e., deletion and insertion. Quality: The proposed model and its training procedure seem apt and well designed. Experiments are carried out carefully with consistent gains when compared with SOTA, i.e., Transformer, with faster inference speed. Clarity: This paper is clearly written, though I have a couple of minor questions regarding technical details. See the details in "Improvements" section. Significance: Given the inference efficiency and its reasonable quality improvements, I feel this work might have potential to impact future research. Other comment: line 89: we our policy for one iteration is -> {our, the}(?) policy for ...

Reviewer 3

=== Detailed Comments === > "two atomic operations — insertion and deletion" This is somewhat debatable. Under the LevT, an insertion operation first requires the number of slots to be predicted first, then the actual insertions are predicted. This is not completely atomic. i.e., using the authors terminology from Figure 1, "Insert Placeholders" then "Fill-in Tokens". > Section 1. "(up to ×5 speed-up" > Figure 4. > Section 4.1 "Analysis of Efficiency" This reviewer thinks the paper is quite misleading in the speed comparison, and iteration comparison. Figure 4 add/subtracts a U[0, 0.5) noise to the figure, which means it can subtract iterations -- this gives a misleading plot. Figure 4, and other iteration analysis is also misleading because the authors fail to take into account that 1 LevT iteration is 3 times more expensive than a standard transformer iteration (i.e., compared to other published methods). > Section 3.1 Roll-in Policy It took me several parses to fully understand the roll-in policy. This section can be rewritten to be more clear and easier to understand. > Section 3.2 Expert Policy and Section 4 "Oracle vs. Teacher Model" The terminology is confusing -- please use the standard terminology in the field -- this is simply Distillation vs no-Distillation. The Oracle and Teacher Model terminology is confusing. Additionally, the use of Levenshtein edit distance (and more specifically, decomposing it with dynamic programming, and using it as the oracle policy) is not new. Citations are missing [1, 2]. > Section 3.3 Comment: It seems like your model might benefit from a noisy-decoding approach. i.e., greedy decode with some noise, and select best one based off of log-prob of entire sequence. > Section 4 Machine Translation The authors presented several MT results, Ro-En, En-De, En-Ja -- this reviewer will focus on the WMT14 ende results. This is because WMT14 en-de is a well established benchmark for machine translation, and the other datasets are much less well established and lack strong prior work -- i.e., the other datasets are more interesting towards the NLP community and less so for the Machine Learning community. First, the reviewer thank the authors for published WMT14 ende, instead of taking the easy way out and only publishing on less competitive MT datasets. However, the empirical results are misleading. i.e., Table 1. The authors fail to compare to other prior published work while making bold claims on their own work. The Transformer baseline is very poor (26.X BLEU) -- it is behind the original Transformer paper [1] of 27.3 BLEU -- which in return is behind modern Transformer implementations which can reach >28 BLEU quite easily. > Section 4 Gigagword Similar to the MT results, citation and comparison to other published work is missing. For example, citing a few of the SOTA prior work from [4] would be nice. Overall, this reviewer argues for acceptance this paper. The ideas in the paper are sufficiently novel and is a good contribution to the community. The empirical results are lacking, but that should not be grounds for rejection. However, this reviewer also find the paper to be quite misleading in several places, especially in comparison with prior work and with a few citations missing. The writing and exposition of the paper can be improved. There are also many minor grammatical errors in the text, the text feels very rushed and definitely needs significant polishing. The exposition of the paper is definitely below the NeurIPS bar. However, assuming these issues are addressed in the rebuttal, this reviewer believes this paper should be accepted and will be willing to bump up the score from 6->7. [1] Optimal Completion Distillation for Sequence Learning [2] EditNTS: An Neural Programmer-Interpreter Model for Sentence Simplification through Explicit Editing [3] Attention Is All You Need [4] http://nlpprogress.com/english/summarization.html

Reviewer 1

Reviewer 2

Reviewer 3