This paper proposes an end-to-end self-supervised learning approach for speech representations. It can serve as the unsupervised pre-training for fast and robust deployment of automatic speech recognition systems, especially for those with low resource or limited amounts of labeled data. The authors reported compelling performance of the proposed technique on Librispeech and TIMIT. This is a strong paper and all reviewers are supportive for acceptance. Large-scale unsupervised pre-training has made great impacts in vision and NLP, the work reported here is analogous in the speech community in that effort. That being said, there are still minor concerns raised by reviewers in the review and discussion. For instance, the exposition can be further polished in the final version to make it more accessible to the readers.