[papers] RoBERTa: A Robustly Optimized BERT Pretraining Approach
Not a month goes by without a new language model announcing to surpass the good old BERT (oh my god, it’s still 9-months old) in one aspect or another. We saw XLNet, KERMIT, ERNIE, MT-DNN and so on.
Now he/she strikes back. Meet RoBERTa (for Robustly optimized BERT approach).
Authors found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it.
The magic is an improved recipe for training BERT models. The modifications are simple, they include:
(1) Training the model longer, with bigger batches, over more data.
Original BERT was trained on a combination of BookCorpus plus English Wikipedia, which totals 16GB of uncompressed text.
RoBERTa is additionally trained on:
- CC-News, collected from the English portion of the CommonCrawl News dataset. (76GB after filtering)
- OpenWebText, an open-source recreation of the WebText corpus containing web content extracted from URLs shared on Reddit with at least three upvotes. (38GB)
- Stories, a dataset containing a subset of CommonCrawl data filtered to match the story-like style of Winograd schemas. (31GB)
(2) Removing the next sentence prediction (NSP) objective.
It’s not the first time researchers get rid of this objective. The authors of XLNet unexpectedly found that the next-sentence prediction objective proposed in the original BERT did not necessarily lead to an improvement in their setting. Instead, it tended to harm the performance except for the RACE dataset. Hence, when they trained XLNet-Large, they excluded the next-sentence prediction objective. RoBERTa authors also found that removing the NSP loss matches or slightly improves downstream task performance, so the decision.
(3) Training on longer sequences.
Each input is packed with full sentences sampled contiguously from one or more documents, such that the total length is at most 512 tokens. Inputs may cross document boundaries. When reaching the end of one document, sampling begins from the next document, adding an extra separator token between documents.
Additionally, larger mini-batches are used.
(4) Dynamically changing the masking pattern applied to the training data.
The original BERT implementation performed masking once during data preprocessing, resulting in a single static mask. To avoid using the same mask for each training instance in every epoch, training data was duplicated 10 times so that each sequence is masked in 10 different ways over the 40 epochs of training. Thus, each training sequence was seen with the same mask four times during training.
RoBERTa uses a strategy with dynamic masking where the masking pattern is generated every time we feed a sequence to the model. This becomes crucial when pretraining for more steps or with larger datasets.
The result?
The new model establishes a new state-of-the-art on 4/9 of the GLUE tasks: MNLI, QNLI, RTE, and STS-B. It also matches state-of-the-art results on SQuAD and RACE.
Waiting for the next round of tuning other models. 🍿
The only problem with RoBERTa seems to be there is no such character in the Sesame Street show…
P.S. Let’s guess which other names will we see soon. UmBERTo, BERTie, RoBERT, …? Your ideas? :)
UPD. I was wrong. There is Roberta in Sesame Street! So, there are now no problems with the paper :)
UPD2. (ten minutes later) Oh no! Things are changing so fast! There is a problem. ERNIE 2.0!
UPD3. Looking closer at the ERNIE 2.0 results, there are now problems for RoBERTa again :)
ERNIE 2.0 scores (see the table below) are lower than RoBERTa’s ones (see the table above):
For the end of the day July 31st the GLUE benchmark leaderboard looks like this, RoBERTa is the first:
UPD4. There is another interesting BERT modification called SpanBERT (you can see it in the GLUE leaderboard at the 9th place). It was published just two days before RoBERTa and shares a few coauthors with it.
SpanBERT is designed to better represent and predict spans of text. It differs from BERT in both the masking scheme and the training objectives. SpanBERT adds the new span-boundary objective (SBO) to train the model to predict the entire masked span from the observed tokens at its boundary (and get rid of NSP objective as RoBERTa):
SpanBERT reaches substantially better performance on span selection tasks in particular. In GLUE benchmark the main gains from SpanBERT are in the SQuAD-based QNLI dataset and in RTE:
Yet SpanBERT’s results are weaker than RoBERTa’s ones.