This is a part about ASICs from the “Hardware for Deep Learning” series. The content of the series is here.
As of beginning 2021, ASICs now is the only real alternative to GPUs for
1) deep learning training (definitely) or
2) inference (less so, because there are some tools to use FPGAs with a not-so-steep learning curve or ways to do efficient inference on CPUs).
Now, when every large company launches it’s own DL or AI chip, it’s impossible to be silent. So, ASICs.
JAX by Google Research is getting more and more popular.
Deepmind recently announced they are using JAX to accelerate their research and already developed a set of libraries on top of JAX.
There are more and more research papers build using JAX, say, the recent Vision Transformer (ViT).
And what are its strengths?
JAX is a pretty low-level library similar to NumPy but with several cool features:
OpenAI just published a paper “Language Models are Few-Shot Learners” presenting a recent upgrade of their well-known GPT-2 model — the GPT-3 family of models, with the largest of them (175B parameters), the “GPT-3” is 100x times larger than the largest (1.5B parameters) GPT-2!
There are many floating point formats you can hear about in the context of deep learning. Here is a summary of what are they about and where are they used.
A 80-bit IEEE 754 extended precision binary floating-point format typically known by the x86 implementation started from the Intel 8087 math co-processor (good old times when CPUs did not support floating point computations and FPU was a separate co-processor). In this implementation it contains:
Finally, ACT came into transformers.
The Universal Transformer exploits the original idea of ACT applied to a transformer instead of an RNN.
The authors say they are adding recurrent inductive bias of RNNs with a dynamic per-position halting mechanism to the transformer. …
Part 1 is here.
The next step in ACT development was done by Michael Figurnov (then an intern at Google and Ph.D. student in Higher School of Economics, now a researcher in DeepMind).
“Spatially Adaptive Computation Time for Residual Networks” by
Michael Figurnov, Maxwell D. Collins, Yukun Zhu, Li Zhang, Jonathan Huang, Dmitry Vetrov, Ruslan Salakhutdinov
There is an interesting little-known topic of Adaptive Computation Time (ACT) in Neural Networks. It is applicable to different kinds of neural networks (RNN, ResNet, Transformer) and you can use this rather general idea somewhere else as well.
The general idea is that some complex data might require more computations to produce a final result, while some simple or unimportant data might require less. And you can dynamically decide how long to process the data by training a neural network to automatically adapt to the data.
In the last few years, we see AI is reaching a productivity plateau in the field of content generation.
We heard news on artistic style transfer and face-swapping applications (aka deepfakes), natural voice generation (Google Duplex) and music synthesis, automatic review generation, smart reply and smart compose. Computer-generated art was even sold by Christie’s.
Where are we now? What’s next?
Let’s look at some examples to give you an intuition of what is possible right now and where are we headed.
Let’s start with images and videos.
It’s been a long time since photoshop meant a program name. A lot…
BERT became an essential ingredient of many NLP deep learning pipelines. It is considered a milestone in NLP, as ResNet is in the computer vision field.
The only problem with BERT is its size.
BERT-base is model contains 110M parameters. The larger variant BERT-large contains 340M parameters. It’s hard to deploy a model of such size into many environments with limited resources, such as a mobile or embedded systems.
Now he/she strikes back. Meet RoBERTa (for Robustly optimized BERT approach).
Authors found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it.
The magic is an improved recipe for training BERT models. The modifications are simple, they include:
(1) Training the model longer, with bigger batches, over more data.
Original BERT was trained…
ML/DL/AI expert. Software engineer with 20+ years programming experience. Loves Life Sciences. CTO and co-Founder of Intento. Google Developer Expert in ML.