Paper, Tags: #nlp, #architectures
The Transformer is based solely on attention mechanisms, and it's more parallelizable and requires significantly less time to train.
Recurrent models, sequential, preclude parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples.
Transformer: no sequence, relies entirely on an attention mechanism to draw global dependencies between input and output. More parallelization
Learning dependencies between distant positions is a constant number of operations. Self-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.
Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolutions.
It uses stacked self-attention and point-wise, fully connected layers for both encoder and decoder
- Encoder: multi-head self-attention mechanism, and a simple, position-wise fully-connected feed-forward network
- Multi-head attention: performs the attention function in parallel, and the outputs are concatenated and once again projected
- Decoder: multi-head attention over the output of the encoder stack
An attention function is a mapping between a query and a set of key-value pairs to an output, where query, keys, values, and output are all vectors. The output is a weighted sum of the values
Total computational complexity per layer, and the amount of computation that can be parallelized.