This is a part about ASICs from the “Hardware for Deep Learning” series. The content of the series is here.

As of beginning 2021, ASICs now is the only real alternative to GPUs for
1) deep learning training (definitely) or
2) inference (less so, because there are some tools to use FPGAs with a not-so-steep learning curve or ways to do efficient inference on CPUs).

Now, when every large company launches it’s own DL or AI chip, it’s impossible to be silent. So, ASICs.

Table of Contents

· Google TPU
TPU v1
TPU v2
TPU v3
TPU v4

JAX by Google Research is getting more and more popular.

Deepmind recently announced they are using JAX to accelerate their research and already developed a set of libraries on top of JAX.

There are more and more research papers build using JAX, say, the recent Vision Transformer (ViT).

JAX demonstrates impressive performance. And even the world’s fastest transformer is built on JAX now.

What is JAX?

And what are its strengths?

JAX is a pretty low-level library similar to NumPy but with several cool features:

  • Autograd — JAX can automatically differentiate native Python and NumPy code. It can differentiate through a large subset…

Photo by Arseny Togulev on Unsplash

OpenAI just published a paper “Language Models are Few-Shot Learners” presenting a recent upgrade of their well-known GPT-2 model — the GPT-3 family of models, with the largest of them (175B parameters), the “GPT-3” is 100x times larger than the largest (1.5B parameters) GPT-2!

TL;DR (Executive Summary)

  • No, you can’t download the model :)
  • And you probably can’t even train it from scratch unless you have a very powerful infrastructure.
  • The GPT-3 architecture is mostly the same as GPT-2 one (there are minor differences, see below).
  • The largest GPT-3 model size is 100x larger than the largest GPT-2 model (175B vs. 1.5B parameters).

There are many floating point formats you can hear about in the context of deep learning. Here is a summary of what are they about and where are they used.


A 80-bit IEEE 754 extended precision binary floating-point format typically known by the x86 implementation started from the Intel 8087 math co-processor (good old times when CPUs did not support floating point computations and FPU was a separate co-processor). In this implementation it contains:

  • 1 bit sign
  • 15 bits exponent
  • 64 bits fraction

Photo by Adrien Olichon on Unsplash

Part 3: ACT in Transformers

Part 1 is here.
Part 2 is here.

Finally, ACT came into transformers.

The Universal Transformer exploits the original idea of ACT applied to a transformer instead of an RNN.

The authors say they are adding recurrent inductive bias of RNNs with a dynamic per-position halting mechanism to the transformer. …

Photo by Adrien Olichon on Unsplash

Part 2: ACT in Residual Networks

Part 1 is here.

The next step in ACT development was done by Michael Figurnov (then an intern at Google and Ph.D. student in Higher School of Economics, now a researcher in DeepMind).

“Spatially Adaptive Computation Time for Residual Networks” by
Michael Figurnov, Maxwell D. Collins, Yukun Zhu, Li Zhang, Jonathan Huang, Dmitry Vetrov, Ruslan Salakhutdinov

Photo by Adrien Olichon on Unsplash

Part 1: ACT in RNNs

There is an interesting little-known topic of Adaptive Computation Time (ACT) in Neural Networks. It is applicable to different kinds of neural networks (RNN, ResNet, Transformer) and you can use this rather general idea somewhere else as well.

The general idea is that some complex data might require more computations to produce a final result, while some simple or unimportant data might require less. And you can dynamically decide how long to process the data by training a neural network to automatically adapt to the data.

The original work by Alex Graves is this one:
“Adaptive Computation Time for Recurrent Neural…

Computer-generated art sold by Christie’s

The Internet of Fakes

In the last few years, we see AI is reaching a productivity plateau in the field of content generation.

We heard news on artistic style transfer and face-swapping applications (aka deepfakes), natural voice generation (Google Duplex) and music synthesis, automatic review generation, smart reply and smart compose. Computer-generated art was even sold by Christie’s.

Where are we now? What’s next?

Let’s look at some examples to give you an intuition of what is possible right now and where are we headed.

Let’s start with images and videos.

Image processing

It’s been a long time since photoshop meant a program name. A lot…

Source of the image

The heavy BERT

BERT became an essential ingredient of many NLP deep learning pipelines. It is considered a milestone in NLP, as ResNet is in the computer vision field.

The only problem with BERT is its size.

BERT-base is model contains 110M parameters. The larger variant BERT-large contains 340M parameters. It’s hard to deploy a model of such size into many environments with limited resources, such as a mobile or embedded systems.

Not a month goes by without a new language model announcing to surpass the good old BERT (oh my god, it’s still 9-months old) in one aspect or another. We saw XLNet, KERMIT, ERNIE, MT-DNN and so on.

Now he/she strikes back. Meet RoBERTa (for Robustly optimized BERT approach).

Bert and Ernie reading the latest research paper

Authors found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it.

The magic is an improved recipe for training BERT models. The modifications are simple, they include:

(1) Training the model longer, with bigger batches, over more data.

Original BERT was trained…

Grigory Sapunov

ML/DL/AI expert. Software engineer with 20+ years programming experience. Loves Life Sciences. CTO and co-Founder of Intento. Google Developer Expert in ML.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store