Speeding up BERT

10 min readSep 30, 2019

The heavy BERT

BERT became an essential ingredient of many NLP deep learning pipelines. It is considered a milestone in NLP, as ResNet is in the computer vision field.

The only problem with BERT is its size.

BERT-base is model contains 110M parameters. The larger variant BERT-large contains 340M parameters. It’s hard to deploy a model of such size into many environments with limited resources, such as a mobile or embedded systems.

Training and inference times are tremendous.

Training of BERT-base was performed on 4 Cloud TPUs in Pod configuration (16 TPU chips total), and training of BERT-large was performed on 16 Cloud TPUs (64 TPU chips total). Each pretraining took 4 days to complete.

Training time is usually not an issue for an end-user since it’s usually a one-time investment (actually a many-times investment because you’ll probably have to retrain a model several times until you get a satisfactory result). Nevertheless, it’s worth improving the training speed if you can. The faster you iterate, the sooner you’ll solve your problem.

BERT inference times vary depending on the model and hardware available but in many cases, it significantly limits you on how much, how fast, and how cheap do you want to process your data. For some real-time applications, it could be prohibitive.

Optimizing neural networks

This set of problems is not new in neural networks. Other domains (say, computer vision) had the same issues before, and several approaches to compress and speed-up NN models have been developed.

These approaches roughly can be divided into several groups:

Architecture improvements (change the architecture to a faster one, say, replace RNN to a Transformer or a CNN; use layers that require fewer computations and so on) or more clever optimization (learning rate and policy, number of warmup steps, larger batch size, etc).
Model compression (usually done using quantization and/or pruning, reducing the total amount of computations keeping the architecture unchanged. OK, mostly unchanged).
Model distillation (train a smaller model that will replicate the behavior of the original model)

Let’s look at what can be done with BERT regarding these approaches.

1. Architecture and optimization improvements

Large-scale distributed training

The first (or even zeroth) thing to speed up BERT training is to distribute it on a larger cluster. While the original BERT was already trained using several machines, there are some optimized solutions for distributed training of BERT (e.g. from Alibaba or NVIDIA).

A recent record was set by NVIDIA that trained BERT-large in 53 minutes using (a very expensive) NVIDIA DGX SuperPOD with 92 DGX-2H nodes with a total of 1,472 V100 GPUs (which theoretically can deliver up to 190 PFLOPS).

Another example of a more clever optimization (and using super-powerful hardware) is a new layerwise adaptive large batch optimization technique called LAMB which allowed reducing BERT training time from 3 days to just 76 minutes on a (very expensive as well) TPUv3 Pod (1024 TPUv3 chips that can provide more than 100 PFLOPS performance for mixed-precision computing).

Architectures

Regarding more architecture- and less hardware-solutions, there is progressive stacking method for training BERT based upon an observation of self-attention layers behavior showing that its distribution concentrates locally around its position and the start-of-sentence token and that the attention distribution in the shallow model is similar to that of a deep model. Motivated by this, authors proposed the stacking algorithm to transfer knowledge from a shallow model to a deep model; and applied stacking progressively to accelerate BERT training. Authors achieved the training time about 25% shorter than the original BERT. This is mainly because for the same number of steps, training a small model needs less computation.

Other architectural improvements reducing the total amount of memory and/or computation are sparse factorizations of the attention matrix (aka Sparse Transformer by OpenAI) and block attention.

ALBERT

And finally, there is a possible architectural descendant of BERT called ALBERT (A Lite BERT) submitted to ICLR 2020 conference.

ALBERT incorporates two parameter reduction techniques.

The first one is a factorized embedding parameterization, separating the size of the hidden layers from the size of vocabulary embedding. This separation makes it easier to grow the hidden size without significantly increasing the parameter size of the vocabulary embeddings.

The second technique is a cross-layer parameter sharing. This technique prevents the parameter from growing with the depth of the network.

Both techniques significantly reduce the number of parameters for BERT without seriously hurting performance, thus improving parameter-efficiency.

An ALBERT configuration similar to BERT-large has 18x fewer parameters and can be trained about 1.7x faster.

It even outperforms heavily tuned RoBERTa!

2. Quantization and pruning

Quantization decreases the numerical precision of a model’s weights.

Typically models trained using FP32 (32-bit floating point), then they can be quantized into FP16 (16-bit floating point), INT8 (8-bit integer) or even more to INT4 or INT1, so reducing the model size 2x, 4x, 8x or 32x respectively. This is called post-training quantization.

Another (harder and a less mature) option is a quantization-aware training. FP16 training is becoming a commodity now. ICLR 2020 has an interesting submission on the state-of-the-art training results using 8-bit floating point representation, across Resnet, GNMT, Transformer.

Pruning removes some (not- or less-important) weights (or sometimes neurons) from the model, producing sparse weight matrices (or smaller layers). There is also research on removing entire matrices corresponding to attention heads of a transformer.

Quantization can be performed using Tensorflow Lite, a part of Tensorflow for on-device inference. TensorFlow Lite provides the tools to convert and run TensorFlow models on mobile, embedded and IoT devices. TensorFlow Lite supports post-training quantization and quantization-aware training.

Another option is to use TensorRT framework from NVIDIA. NVIDIA TensorRT is a platform for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications.

Recently NVIDIA announced TensorRT 6 with new optimizations that deliver inference for BERT-Large in only 5.8 ms on T4 GPUs and 4.2 ms on V100. For Titan RTX is should be faster, rough estimate using the peak performance (you can find the numbers here) of these cards gives 2x speedup, but in reality, it’ll probably be smaller.

5.84 ms for a 340M parameters BERT-large model and 2.07 ms for a 110M BERT-base with a batch size of one are cool numbers. With a larger batch size of 128, you can process up to 250 sentences/sec using BERT-large.

More numbers can be found here.

PyTorch recently announced quantization support since version 1.3. It is experimental right now, but you can already start using it thanks to the tutorial in which dynamic quantization is applied to an LSTM language model converting the model weights to INT8.

There is a well-known quantization of BERT called Q-BERT (from the “Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT” paper). The authors can achieve comparable performance to baseline with at most 2.3% performance degradation, even with ultra-low precision quantization down to 2 bits, corresponding up to 13× compression of the model parameters, and up to 4× compression of the embedding table as well as activations.

3. Distillation

Another interesting model compression method is distillation — a technique that transfers the knowledge of a large “teacher” network to a smaller “student” network. The “student” network is trained to mimic the behaviors of the “teacher” network.

A version of this strategy has already been pioneered by Rich Caruana and his collaborators. In their important paper, they demonstrate convincingly that the knowledge acquired by a large ensemble of models can be transferred to a single small model.

Geoffrey Hinton et al. showed this technique can be applied to neural networks in their paper called “Distilling the Knowledge in a Neural Network”.

DistilBERT

Since then this approach was applied to different neural networks, and you probably heard of a BERT distillation called DistilBERT by HuggingFace.

Finally, October 2nd a paper on DistilBERT called “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter” emerged and was submitted at NeurIPS 2019.

DistilBERT is a smaller language model, trained from the supervision of BERT in which authors removed the token-type embeddings and the pooler (used for the next sentence classification task) and kept the rest of the architecture identical while reducing the numbers of layers by a factor of two.

You can use DistilBERT off-the-shelf in the with the help of the transformers python package by HuggingFace (formerly known as pytorch-transformers and pytorch-pretrained-bert). Version 2.0.0 of the package supports TensorFlow 2.0/PyTorch interoperability.

DistilBERT authors also used a few training tricks from the recent RoBERTa paper which showed that the way BERT is trained is crucial for its final performance.

DistilBERT compares surprisingly well to BERT: authors were able to retain more than 95% of the performance while having 40% fewer parameters.

Comparison on the dev sets of the GLUE benchmark.

In terms of inference time, DistilBERT is more than 60% faster and smaller than BERT and 120% faster and smaller than ELMo+BiLSTM.

TinyBERT

A few days ago a new BERT distillation emerged — TinyBERT by Huawei.

To build a competitive TinyBERT, authors firstly propose a new Transformer distillation method to distill the knowledge embedded in teacher BERT. Specifically, they designed several loss functions to fit different representations from BERT layers:

the output of the embedding layer;
the hidden states and attention matrices derived from the Transformer
layer;
the logits output by the prediction layer.

The attention-based fitting is inspired by the recent findings that the attention weights learned by BERT can capture substantial linguistic knowledge, which encourages that linguistic knowledge can be well transferred from teacher BERT to student TinyBERT. However, it is ignored in existing KD methods of BERT, such as Distilled BiLSTM_SOFT, BERT-PKD, and DistilBERT.

Then, they proposed a novel two-stage learning framework including the general distillation and the task-specific distillation. At the general distillation stage, the original BERT without fine-tuning acts as the teacher model. The student TinyBERT learns to mimic the teacher’s behavior by executing the proposed Transformer distillation on the large scale corpus from the general domain. They obtained general TinyBERT that can be fine-tuned for various downstream tasks. At the task-specific distillation stage, they perform the data augmentation to provide more task-related materials for teacher-student learning and then re-execute the Transformer distillation on the augmented data.

Both the two stages are essential to improve the performance and generalization capability of TinyBERT.

TinyBERT is empirically effective and achieves comparable results with BERT-base in GLUE datasets while being 7.5x smaller and 9.4x faster on inference.

Waiting for it to be applied to BERT-large and XLNet-large models, and for the code to be published.

Other distillations

There are some other well-known distillations.

(2019/03) “Distilling Task-Specific Knowledge from BERT into Simple Neural Networks” paper distilled BERT into a single-layer BiLSTM achieving comparable results with ELMo, while using roughly 100 times fewer parameters and 15 times less inference time.

BiLSTM_SOF is a distilled TBiLSTM trained on soft logit targets

(2019/08) “Patient Knowledge Distillation for BERT Model Compression” paper proposed a Patient Knowledge Distillation approach that was among the first attempts to use hidden states of the teacher, not only the output from the last layer. Their student model patiently learned from multiple intermediate layers of the teacher model for incremental knowledge extraction. In their Patient-KD framework, the student is cultivated to imitate the representations only for the [CLS] token in the intermediate layers. The code is here.

(2019/09) A recent “Extreme Language Model Compression with Optimal Subwords and Shared Projections” paper submitted to ICLR 2020 focuses on a knowledge distillation technique for training a student model with a significantly smaller vocabulary as well as lower embedding and hidden state dimensions. They employ a dual-training mechanism that trains the teacher and student models simultaneously to obtain optimal word embeddings for the student vocabulary. The method is able to compress the BERT-base model by more than 60x, with only a minor drop in downstream task metrics, resulting in a language model with a footprint of under 7MB.

TinyBERT results look better, but anyway, a BERT-like model in 7Mb looks cool.

The described methods do not contradict each other, so we expect a significant synergy with these methods applied together to speeding up BERT (and other) models.

Release Notes

2019/10/05: added a link to DistilBERT paper.

2019/10/10: added a section on other well-known BERT distillations.

2019/10/10: added a section on Q-BERT.

2019/10/13: added a section on PyTorch 1.3 quantization support.