This paper proposes a method for dealing with the quadratic complexity of attention with the goal of extending Transformers to longer sequences. The authors proposes a clever way to train a 'judge' for selecting relevant subsequences. The idea is novel and seems to work well in practice. Please make sure to address all suggestions by the reviewers in the final version of the paper.