Skip to content

Latest commit

 

History

History
29 lines (16 loc) · 1.72 KB

1706.03762.md

File metadata and controls

29 lines (16 loc) · 1.72 KB

Attention is all you need, Vaswani et al., 2018

Paper, Tags: #nlp, #architectures

The Transformer is based solely on attention mechanisms, and it's more parallelizable and requires significantly less time to train.

Recurrent models, sequential, preclude parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples.

Transformer: no sequence, relies entirely on an attention mechanism to draw global dependencies between input and output. More parallelization

Learning dependencies between distant positions is a constant number of operations. Self-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.

Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolutions.

Model

It uses stacked self-attention and point-wise, fully connected layers for both encoder and decoder

  • Encoder: multi-head self-attention mechanism, and a simple, position-wise fully-connected feed-forward network
    • Multi-head attention: performs the attention function in parallel, and the outputs are concatenated and once again projected
  • Decoder: multi-head attention over the output of the encoder stack

Attention

An attention function is a mapping between a query and a set of key-value pairs to an output, where query, keys, values, and output are all vectors. The output is a weighted sum of the values

Why self-attention

Total computational complexity per layer, and the amount of computation that can be parallelized.