GPT-3: Language Models are Few-Shot Learners

Grigory Sapunov
11 min readJun 2, 2020

OpenAI just published a paper “Language Models are Few-Shot Learners” presenting a recent upgrade of their well-known GPT-2 model — the GPT-3 family of models, with the largest of them (175B parameters), the “GPT-3” is 100x times larger than the largest (1.5B parameters) GPT-2!

TL;DR (Executive Summary)

  • No, you can’t download the model :)
  • And you probably can’t even train it from scratch unless you have a very powerful infrastructure.
  • The GPT-3 architecture is mostly the same as GPT-2 one (there are minor differences, see below).
  • The largest GPT-3 model size is 100x larger than the largest GPT-2 model (175B vs. 1.5B parameters).
  • The authors do not use fine-tuning or any other task-specific training (except the LM task).
  • Instead, they condition the model with the task description and/or some demonstrations of the task. It is called “in-context learning”.
  • Essentially, they treat the [trained] model as an intelligent entity asking it to perform some task (described as text on the model’s input, possibly with one or more examples given as text as well), the model continues the text, and we treat this continuation as an answer.
  • Evaluation example for CoQA dataset:

Introduction

GPT (Generative Pretrained Transformer) models are transformer architecture based autoregressive language models, meaning they are trained to perform the task of “language modeling”, predicting the next word of the sentence based on the history of the previous words (the context). GPT models are built using the transformer decoder only. For comparison, BERT is built using the transformer encoder, and typical neural machine translation (NMT) models are built using the complete encoder-decoder transformer.

If you are not familiar enough with the Transformer architecture, here is the best introduction I’ve seen, called The Illustrated Transformer. There is also The Illustrated GPT-2 (Visualizing Transformer Language Models)from the same author if you want to better understand the GPT model itself.

History

Architecture

The GPT-3 is actually a family of models ranging from smaller ones (125M weights) to the hugest one (175B) THE “GPT-3”.

Sizes, architectures, and learning hyper-parameters (batch size in tokens and learning rate) of the models

Essentially, the architecture is the same as GPT-2 (including the modified initialization, pre-normalization, and reversible tokenization described therein), with the exception that authors use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer.

The model became proportionally larger: more layers (up to 96), a higher number of units in each bottleneck layer (up to 12288), larger context window (2048 tokens comparing to 1024 in GPT-2 and 512 in GPT).

The training was performed using model parallelism on multiple V100 GPUs of the Microsoft cluster.

Datasets

The dataset has grown significantly. GPT-2 was trained on 40GB of text data. GPT-3 uses 570GB filtered data from CommonCrawl (45TB of compressed plaintext), plus some high-quality reference corpora (WebText2, Books1, Books2, Wikipedia) giving ~500B BPE tokens total.

Datasets used to train GPT-3

Rationale

In recent years, we’ve seen significant progress in using pre-trained transformer models (say, BERT) in many NLP tasks by having them directly fine-tuned instead of creating task-specific architectures.

This approach has several limitations, a major one is that while the architecture is task-agnostic, there is still a need for task-specific datasets and task-specific fine-tuning. For many tasks, it is difficult to collect a large supervised training dataset. Moreover, the generalization achieved under this paradigm can be poor because the model is overly specific to the training distribution and does not generalize well outside it. The last but not least, humans learn in a different way — they do not require large supervised datasets to learn most language tasks.

We can try a route called meta-learning. In the context of language models, it means the model develops a broad set of skills and pattern recognition abilities at training time and then uses those abilities at inference time to rapidly adapt to or recognize the desired task. GPT-2 used “in-context learning”, using the text input of a pretrained language model as a form of task specification.

Language model meta-learning. During unsupervised pre-training, a language model develops a broad set of skills and pattern recognition abilities. It then uses these abilities at inference time to rapidly adapt to or recognize the desired task. The term “in-context learning” describes the inner loop of this process, which occurs within the forward-pass upon each sequence.

GPT-2 results were interesting, but still far inferior to fine-tuning. The recent trend of creating larger and larger models (OpenAI’s 1.5B GPT-2NVIDIA’s 8.3B Megatron-LMGoogle’s 11B T5Microsoft’s 17B Turing-NLG) showed each increase has brought improvements in text synthesis and/or downstream NLP tasks. There was a hope, that in-context learning abilities might show similarly strong gains with scale. This finally led to the 10x larger 175B GPT-3 (comparing to Turing-NLG, not GPT-2).

Approach

We can employ the same GPT-2 approach to explore in-context learning:

  • Train the language model to predict the next token on a large corpus of texts; NO fine-tuning or any other task-specific training (except the LM task).
  • Then condition the model by a task description (just English text).
  • (optionally) Then condition the model with one or few (typically 10 to 100) demonstrations of the task.
  • Then (for some tasks) add a natural language prompt in addition to demonstrations.
  • The large enough model can “understand” the task and solve it.

The option with only a task description is called Zero-Shot, a description with a single demonstration — One-Shot, and a description with a few demonstrations — Few-Shot.

Zero-shot, one-shot and few-shot, contrasted with traditional fine-tuning.

Fine-tuning is out-of-scope for this work. Though, it is an interesting option for future research.

Results

The results are very decent and include several SOTA's.

  • Performance grows with the model size. Language modeling performance follows a power-law. The power-law behavior observed in previous works continues for an additional two orders of magnitude with only small deviations from the predicted curve.
Smooth scaling of performance with compute.
  • Model performance improves with the addition of a natural language task description, and with the number of examples in the model’s context. Few-shot learning also improves dramatically with model size. The general trends with both model size and number of examples in-context hold for most tasks studied.
Larger models make increasingly efficient use of in-context information. In-context learning performance on a simple task requiring the model to remove random symbols from a word, both with and without a natural language task description. These “learning” curves involve no gradient updates or fine-tuning, just increasing numbers of demonstrations given as conditioning.

All the 8 models (the 175 billion parameter GPT-3 and 7 smaller models) are evaluated on a wide range of datasets:

  • Language Modeling: ”Our largest model sets a new SOTA on PTB by a substantial margin of 15 points, achieving a perplexity of 20.50”
  • LAMBADA (tests the modeling of long-range dependencies in text): ”in a zero-shot setting GPT-3 achieves 76% on LAMBADA, a gain of 8% over the previous state of the art”
  • StoryCloze (involves selecting the correct ending sentence for five-sentence long stories): “GPT-3 achieves 83.2% in the zero-shot setting and 87.7% in the few-shot setting (with K = 70). This is still 4.1% lower than the fine-tuned SOTA using a BERT based model but improves over previous zero-shot results by roughly 10%.”
  • Closed Book Question Answering: “Overall, on one of the three datasets GPT-3’s one-shot matches the open-domain fine-tuning SOTA. On the other two datasets it approaches the performance of the closed-book SOTA despite not using fine-tuning.”
  • Translation (although GPT-3’s training data primarily consists of English (93% by word count), it also includes 7% foreign language content): “For the three input languages studied, GPT-3 significantly outperforms prior unsupervised NMT work when translating into English but underperforms when translating in the other direction.”
  • Winograd-Style Tasks: “On Winograd GPT-3 achieves 88.3%, 89.7%, and 88.6% in the zero-shot, one-shot, and few-shot settings, showing no clear in-context learning but in all cases achieving strong results just a few points below state-of-the-art and estimated human performance.”
  • Common Sense Reasoning: ”PhysicalQA (PIQA), asks common sense questions about how the physical world works and is intended as a probe of grounded understanding of the world. GPT-3 achieves 81.0% accuracy zero-shot, 80.5% accuracy one-shot, and 82.8% accuracy few-shot (the last measured on PIQA’s test server). This compares favorably to the 79.4% accuracy prior state-of-the-art of a fine-tuned RoBERTa.”
  • Reading Comprehension: “We observe a wide spread in GPT-3’s performance across these datasets suggestive of varying capability with different answer formats. In general we observe GPT-3 is on par with initial baselines and early results trained using contextual representations on each respective dataset.”
  • SuperGLUE: “We observe a wide range in GPT-3’s performance across tasks.”
  • NLI: “These results on both RTE and ANLI suggest that NLI is still a very difficult task for language models and they are only just beginning to show signs of progress.”
  • Synthetic and Qualitative Tasks: reasonable proficiency at moderately complex arithmetic tasks (addition/subtraction), both proficiency and failure on some “character manipulation” tasks (solving these tasks requires character-level manipulations, whereas GPT-3’s BPE encoding operates on significant fractions of a word).
  • SAT Analogies: “GPT-3 achieves 65.2% in the few-shot setting, 59.1% in the one-shot setting, and 53.7% in the zero-shot setting, whereas the average score among college applicants was 57% (random guessing yields 20%)”.
  • and finally News Article Generation.

News Article Generation

A bit more words on it.

The model was able to generate news articles that are practically indistinguishable from the real ones by humans (80 US-based participants). The mean human accuracy at detecting articles that were produced by the 175B parameter model was barely above chance at ∼52%.

The result holds both for short (∼200 word) news articles and longer (∼500 word) articles.

So, the trend of realistic content generation continues.

The model also has the ability to learn and utilize new words:

… correcting English grammar:

… and finally composing poems:

Benchmark contamination analysis

Some results might have been inflated due to contamination, so authors perform additional analysis of it.

They constructed cleaned versions of each of the benchmarks to check for potential contamination in the training set. Performance on most benchmarks changed negligibly, but some were flagged for further review. On inspection, the authors find some evidence for contamination of the PIQA and Winograd results, and they mark the corresponding results in the paper. They find no evidence that other benchmarks are affected.

Benchmark contamination analysis

Potential Misuse

OpenAI GPT-related works traditionally have a section on Broader Impacts and Potential Misuse Applications. It resumes their investigation on Malicious Uses of AI.

Photo by Arseny Togulev on Unsplash

Another section is dedicated to Fairness, Bias, and Representation. Authors focus on biases relating to gender, race, and religion, although they notice, that many other categories of bias are likely present and could be studied in follow-up work.

Authors have a GitHub repo, though it doesn’t contain any trained models.😸 And it’s hard to expect it, knowing OpenAI’s approach to such things (their concerns about potential misuse, staged release, and so on).

BTW, had they decide to publish 175B GPT-3, it should take at least 175,000,000,000 (trainable parameters)*4 (bytes per FP32) = 700GB. Torrents, probably, will be the right place to share such a model 🤗

Distilled models can be a part of a solution to the problem, and in addition to smaller size they can improve performance even further. Yet, this is still an interesting research task to perform distillation on such a huge model, and as authors mention it “has not been tried at the scale of hundred of billions parameters; new challenges and opportunities may be associated with applying it to models of this size.”

And, surely, you have to manage a proper infrastructure to deploy it. A single Tesla V100 has a maximum of 32GB memory, so you need 20+ such GPUs (a couple of DGX-2 or at least three recently announced DGX A100 with a newer A100 GPU) to just store all the parameters in their memory (you’ll need several times more to calculate activations, gradients and so on). And you need highly experienced engineers to efficiently implement model training (if you manage to train the model yourself) and inference (to use a trained model) of such a huge distributed model.

The model size is not the whole picture. The amount of computation is the main dimension to look at.

Regarding the total amount of computation, GPT-3 175B total train compute is estimated to be 3640 PF-days (petaflop/s-day), and given a recent DGX A100 peak performance to be 5 PFLOPS, it would take approximately 72 days for a cluster of 10 DGX A100 ($2M to buy) to train this model (probably much more, because running 100% the time at peak performance seems to be unrealistic).

For the previous-generation DGX-2 (containing 16x V100 and having peak performance of 2 PFLOPS) it would take approximately 36 days for a cluster of 50 DGX-2.

In the case of cloud training, there is an estimation of nearly $12M in compute based on public cloud GPU/TPU cost models.

In terms of energy, the first configuration (10x DGX A100) should consume 10*6.5kW*72days*24hours/day = ~112MWh. DGX-2 is the more relevant choice here (because GPT-3 was trained using V100), so for the second above-mentioned configuration (50x DGX-2) and training for 36 days it would consume 50*10kW*36days*24 hours/day = ~432MWh.

People in Twitter inspired me to estimate the carbon footprint of training such a large model, so for the second configuration (50x DGX-2) and CO2 emissions per kWh taken from here (I use the 0.99 pounds of CO2 emissions per kWh) the calculation gives a number of ~428K pounds or ~194 tonnes CO2.

Disclaimer: this is a ballpark estimate, the real configurations might have been very different from the model chosen here, and the real numbers of hours spent on training and energy consumption can be different as well. If you know more relevant numbers or find an error here, let me know.

Anyway, waiting for the 2T GPT-4 the next year 💣

--

--

Grigory Sapunov

ML/DL/AI expert. Software engineer with 20+ years programming experience. Loves Life Sciences. CTO and co-Founder of Intento. Google Developer Expert in ML.