Overview
This course explores a paper that introduces a linear attention mechanism for transformers, reducing compute and memory requirements while revealing a connection between autoregressive transformers and RNNs. The learning outcomes include understanding the formulation of linear attention, its impact on transformer performance, and the relationship to RNNs. The course teaches about softmax attention, quadratic complexity, kernel functions, and conducting experiments. The teaching method involves a detailed explanation of the paper's concepts and findings. This course is intended for individuals interested in deep learning, transformer models, attention mechanisms, and the optimization of neural networks.
Syllabus
- Intro & Overview
- Softmax Attention & Transformers
- Quadratic Complexity of Softmax Attention
- Generalized Attention Mechanism
- Kernels
- Linear Attention
- Experiments
- Intuition on Linear Attention
- Connecting Autoregressive Transformers and RNNs
- Caveats with the RNN connection
- More Results & Conclusion
Taught by
Yannic Kilcher