Patterns and Messages - Part 4 - Attention as a Dynamic Neural Network

19 Feb 2025

When you reduce Attention down to two matrices instead of four, the pattern and message vectors represent a more familiar architecture–they form a neural network, whose neurons are created dynamically at inference time from the tokens.

Patterns and Messages - Part 3 - Alternative Decompositions

19 Feb 2025

One potential benefit for this merged perspective is that it lets us begin our research into Transformer efficiency from a “more fundamental” definition of Attention.

Patterns and Messages - Part 2 - Token Communication

18 Feb 2025

In the previous post, we looked at how our tendency to think of Attention in terms of large matrix multiplications obscures some key insights, and we rolled back the GPU optimizations in order to reveal them (recapped in the next section).

Patterns and Messages - Part 1 - The Missing Subscript

18 Feb 2025

In this post, we’ll look at how a tiny bit of algebra suddenly opens up a wealth of insight.

Patterns and Messages: A New Framing of Transformer Attention

18 Feb 2025

I recently had a series of “aha!” moments around the Attention equations that’s lead to some exciting weeks of research and insight.