Three reviewers recommended to accept the paper. The reviewers liked the idea introduced in the paper, which tackles an important research question (making self attention more efficient when applied to long sequences). Minor concerns included missing previous works (from discussion and experiments), and the fact that it is not clear if the method would lead to important gains in practice (as number of clusters might not scale sublinearly, and the work relies on sparse matrix multiplication needing efficient GPU implementation). After rebuttal and discussion, the reviewers indicated that these concern were addressed or did not justify a rejection. Thus, the paper is accepted. I encourage the authors to take into account the feedback of the reviewers, and to cite missing existing work, including "Efficient content-based sparse attention with routing transformers" (https://arxiv.org/abs/2003.05997).