Chris McCormick    Patterns & Messages    Archive

Exploring the inner workings of Transformers--and how we might improve them.

Patterns and Messages - Part 5 - The Residual Stream

Something I find really helpful about this merged-matrix perspective is that it puts everything in “model space”. The patterns and messages and their projection matrices all have the same length as the word embeddings.

Patterns and Messages - Part 2 - Token Communication

In the previous post, we looked at how our tendency to think of Attention in terms of large matrix multiplications obscures some key insights, and we rolled back the GPU optimizations in order to reveal them (recapped in the next section).