Adaptive Computation Time (ACT) in Neural Networks [3/3]
Part 3: ACT in Transformers
Part 1 is here.
Part 2 is here.
Finally, ACT came into transformers.
The Universal Transformer exploits the original idea of ACT applied to a transformer instead of an RNN.
The authors say they are adding recurrent inductive bias of RNNs with a dynamic per-position halting mechanism to the transformer. From my point of view, this is far from the recurrent inductive bias of RNNs (here is no recurrence along the sequence), yet very close to the original ACT that was already applied to a network with the recurrent inductive bias (each element is processed several times using the same function until a halting criterion is met).
“Universal Transformers” by Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, Łukasz Kaiser
Paper: https://arxiv.org/abs/1807.03819
Slides: here
Code: here
Some other implementations: https://github.com/topics/universal-transformer
Author’s blog: https://mostafadehghani.com/2019/05/05/universal-transformers/
Google’s blog: https://ai.googleblog.com/2018/08/moving-beyond-translation-with.html
#3. Universal Transformer (UT)
Architecture
There are two flavors of UT in the paper: 1) UT with a fixed number of repetitions, and 2) UT with dynamic halting.
The Universal Transformer repeatedly refines a series of vector representations for each position of the sequence in parallel, by combining information from different positions using self-attention and applying a recurrent transition function across all time steps. The number of time steps, T, is arbitrary but fixed (no ACT here).
Depending on the task, authors use one of two different transition functions: either a separable convolution or a fully-connected neural network that consists of a single rectified-linear activation function between two affine transformations, applied position-wise (similar to the original Transformer).
In fact, Universal Transformer is a recurrent function (not in time, but in depth) that evolves per-symbol hidden states in parallel, based at each step on the sequence of previous hidden states. In that sense, UT is similar to architectures such as the Neural GPU and the Neural Turing Machine.
When running for a fixed number of steps, the Universal Transformer is equivalent to a multi-layer Transformer with tied parameters across its layers.
Universal Transformer with dynamic halting modulates the number of computational steps needed to process each input symbol dynamically based on a scalar pondering value that is predicted by the model at each step. The pondering values are in a sense the model’s estimation of how much further computation is required for the input symbols at each processing step.
More precisely, Universal Transformer with dynamic halting adds a dynamic ACT halting mechanism to each position in the input sequence. Once the per-symbol recurrent block halts (indicating a sufficient number of revisions for that symbol), its state is simply copied to the next step until all blocks halt or we reach a maximum number of steps. The final output of the encoder is then the final layer of representations.
As the recurrent transition function can be applied any number of times, this implies that UTs can have variable depth (number of per-symbol processing steps).
Turing completeness
The interesting aspect of UT is its universality. Under certain assumptions, the resulting Universal Transformer can be shown to be Turing-complete (or “computationally universal”).
Remember, RNNs are considered Turing complete (here is a proof by Hava Siegelmann and Eduardo Sonntag). Yet there are controversies and statements that practically this is not true.
Regarding the original Transformer, assuming finite precision, it can be shown that the Transformer cannot be computationally universal (yet there are controversies as well).
While the Transformer executes a total number of operations that scales with the input size, the number of sequential operations is constant and independent of the input size, determined solely by the number of layers.
An intuitive example from the author’s blog are functions whose execution requires the sequential processing of each input element. In this case, for any given choice of depth T, one can construct an input sequence of length N > T that cannot be processed correctly by a Transformer:
Author states:
“Unlike the standard Transformer — which cannot be computationally universal as the number of sequential operations is constant — we can choose the number of steps as a function of the input length in the Universal Transformer. This holds independently of whether or not adaptive computation time is employed but does assume a non-constant, even if possibly deterministic, number of steps.”
Authors show they can reduce a Neural GPU (a Turing-complete model capable of learning arbitrary algorithms in principle, like a Neural
Turing Machine) to a Universal Transformer.
More on this topic can be found in the author’s blog.
Results
The results are promising.
On the bAbi question answering dataset, Subject-Verb agreement task, LAMBADA language modeling, algorithmic tasks (Copy, Reverse, and integer Addition, all on strings composed of decimal symbols ‘0’-‘9’), “learning to execute” tasks, and Machine Translation results are really good.
The difference with the ordinary transformer (and sometimes LSTM) is significant.
Additionally, weight sharing in depth leads to better performance of UTs (compared to the standard Transformer) on very small datasets and allows the UT to be a very data efficient model, making it attractive for domains and tasks with limited available data.
Interesting model, worth trying.
#3b. Adaptive attention span
Another interesting place for applying adaptivity in transformers is attention span.
The paper called “Adaptive Attention Span in Transformers” proposes a novel self-attention mechanism that can learn its optimal attention span.
Paper: https://arxiv.org/abs/1905.07799
Code: https://github.com/facebookresearch/adaptive-span
Post: https://ai.facebook.com/blog/making-transformer-networks-simpler-and-more-efficient/
The problem with the vanilla transformer is its fixed context size (or attention span). Additionally, it cannot be very large because of the computation cost of the attention mechanism (it requires O(n²) computations).
There are some solutions to these problems (like Transformer XL or Sparse Transformer, and now we have a Compressive Transformer as well).
Here is an alternative approach: let the layer (or even the attention head) decide the required context size on its own.
There are two options, learnable (called the adaptive attention span) and ACT-like (called the dynamic attention span).
Adaptive attention span let each attention head learn it’s own attention span independently from the other heads (this is done using a non-increasing masking function). It is learnable, but still fixed after the training is done.
Dynamic attention span changes the span dynamically depending on the current input.
In addition to these changes, the authors incorporated the relative position
embeddings of Shaw et al. (2018) and the caching mechanism of Dai et al. (2019) to speed up the train and test time.
Results are cool. The models are smaller, the performance is better.
Internally, adaptive spans are learned larger when needed.
And the dynamic spans adapt to the input sequence. The span increases at the beginning of words and in the middle of composed words, e.g., to predict the “l” in “overlook”
#3c. ALBERT
As you did already see, the UT without ACT is a very strong baseline on its own. Maybe repetitions themselves are the key, and the adaptivity is not required? As in the case of Repeat-RNN?
This could be true as well (at least given enough repetitions).
The other proof is ALBERT (“ALBERT: A Lite BERT for Self-supervised Learning of Language Representations”). The modified BERT-style model.
Paper: https://arxiv.org/abs/1909.11942
Code: https://github.com/google-research/ALBERT
Post: https://ai.googleblog.com/2019/12/albert-lite-bert-for-self-supervised.html
As I wrote earlier, ALBERT incorporates two parameter reduction techniques.
The first one is a factorized embedding parameterization, separating the size of the hidden layers from the size of vocabulary embedding. This separation makes it easier to grow the hidden size without significantly increasing the parameter size of the vocabulary embeddings.
The second technique (more relevant to our current topic) is a cross-layer parameter sharing. This technique prevents the parameter from growing with the depth of the network.
Both techniques significantly reduce the number of parameters for BERT without seriously hurting performance, thus improving parameter-efficiency.
As the authors wrote:
“We observed that the network often learned to perform similar operations at various layers, using different parameters of the network. This possible redundancy is eliminated in ALBERT by parameter-sharing across the layers, i.e., the same layer is applied on top of each other.”
As you see, this is similar to UT applying the same transformation a predefined number of times.
In the case of UT (without ACT) you just use the number of repetitions as a parameter, while in the case of ALBERT it is done by creating architecture with a predefined number of (the same) layers. Computationally this is the same.
In ALBERT the main goal was to reduce the number of parameters (and then to scale up the model again). So, the authors look at the cost of such a decision (reducing parameter count by sharing): how does it hurt the performance comparing with the BERT-style model (where no parameters are shared and all the layers are different). Not surprisingly from this point of view, it hurts (a bit).
But looking from the opposite side, how depth (number of layers) affects the performance of ALBERT, we see the performance increases significantly.
If we compare a 3-layer ALBERT model with a 1-layer ALBERT model, although they have the same number of parameters, the performance differs significantly. However, there are diminishing returns when continuing to increase the number of layers: the results of a 12-layer network are relatively close to the results of a 24-layer network, and the performance of a 48-layer network appears to decline.
It would be interesting to compare Universal Transformer with ALBERT given the same parameter budget.
ALBERT’s results are very impressive.
Conclusions
So, this part concludes the series on the ACT.
As you see, the idea of ACT is rather general and can be applied to many different things (and even more things are still not touched by ACT yet, so treat it as the call-to-action).
The possible path from simple (fixed) solutions to more complex ones (like ACT) looks like this:
- Start with something fixed (number of layers, attention span, activation function, you-name-it)
- Add learnability to the model (say, learn attention span; add learned parameter to an activation function, as in PReLU; it’s strange we did not see a case of a learned number of repetitions in UT or somewhere else — let me know if I missed it)
- Add adaptivity to the model (to allow the model dynamically adjusting the required parameters, whether it is an attention span or the number of repetitions; BTW did you see an activation function with this property? Maybe Backpropamine can be treated as a close thing).
I’d expect more ACT-related solutions in the future.