Attention Is All You Need
Ashish Vaswani, Noam Shazeer et al.
Introduction
- Dominant models in sequence transduction use complex recurrent or convolutional neural networks.
- The proposed Transformer relies solely on attention mechanisms, eliminating recurrence and convolutions.
- Transformers are more parallelizable, requiring less training time with superior translation quality.
Background
- Current models are limited by sequential computation, which restricts parallelization.
- Attention mechanisms allow for modeling dependencies regardless of their distance in sequences.
- The Transformer model solely uses self-attention for sequence representation, not relying on RNNs or convolution.
Model Architecture
- The Transformer has an encoder-decoder structure using stacked self-attention and fully connected layers.
- Encoder and decoder consist of N = 6 identical layers with residual connections and layer normalization.
Encoder and Decoder Stacks
- The encoder has multi-head self-attention and a position-wise feedforward network.
- The decoder adds a third sub-layer for multi-head attention over encoder output, with masking to avoid attention to future positions.
Attention Mechanism
- Attention function maps a query and key-value pairs to an output.
- Scaled Dot-Product Attention uses softmax on normalized dot products for weights.
Scaled Dot-Product Attention
- Uses queries (Q), keys (K), and values (V) matrices to compute outputs.
- Formula: $$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$