Chris McCormick    Patterns & Messages    Archive

Exploring the inner workings of Transformers--and how we might improve them.

The Inner Workings of Multihead Latent Attention (MLA)

Multihead Latent Attention (MLA), introduced by DeepSeek in their V2 model, is an alternative to standard attention (and other variants such as MQA and GQA) which dramatically reduces memory bandwidth requirements for the attention calculations.

Patterns and Messages - Part 5 - The Residual Stream

Something I find really helpful about this merged-matrix perspective is that it puts everything in “model space”. The patterns and messages and their projection matrices all have the same length as the word embeddings.