Explain the Transformer architecture and self-attention mechanism
The Transformer architecture, introduced in 'Attention Is All You Need' (2017), replaced RNNs and LSTMs by relying entirely on self-attention. It consists of an encoder-decoder structure: the encoder processes the input sequence in parallel, while the decoder generates the output autoregressively.
Self-attention computes relationships between every pair of tokens in a sequence. For each token, it creates Query (Q), Key (K), and Value (V) vectors via learned projections. Attention scores are computed as softmax(QK^T / sqrt(d_k)), then multiplied by V to produce weighted outputs. This allows each token to attend to all others, capturing long-range dependencies without sequential processing.
Multi-head attention runs multiple attention operations in parallel with different learned projections, then concatenates and projects the results. This lets the model attend to different representation subspaces (e.g., syntax vs semantics). Positional encoding (sinusoidal or learned) injects sequence order since attention is permutation-invariant.
Transformers replaced RNNs because they enable parallelization during training, avoid vanishing gradients, and scale better with compute. Modern LLMs use decoder-only variants (GPT-style) for autoregressive generation.
Key Takeaways
- Self-attention computes Q, K, V for each token and aggregates information via weighted sums
- Multi-head attention captures different types of relationships in parallel
- Positional encoding is essential since attention is permutation-invariant
- Decoder-only architectures (GPT) dominate modern LLMs for generation