This work extends the Transformer language model architecture of GPT-2 by scaling it to > 170 billion parameters, resulting in a new model called GPT-3. The paper demonstrates that when provided with zero or few labeled examples to condition on, this large model is capable of performing a multitude of language tasks without any further changes to model parameters. While on most tasks the zero-shot/few-shot performance is behind SOTA, the novelty lies in the demonstrated strong zero/few shot performance on diverse tasks. Clarity of exposition is another strength of the paper. One limitation is lack of reproducibility due to the massive compute necessary to train the model. Another limitation is that the paper’s scientific insights are limited, and the contribution is largely engineering. However, the strong experimental findings, and the thorough analysis presented in the paper make it worthy of acceptance at NeurIPS. Regarding ethical concerns, a big language model can attract the interest of malicious actors. However, the paper has done a thorough job of addressing these concerns.