Chris McCormick    Patterns & Messages    Archive

Exploring the inner workings of Transformers--and how we might improve them.

Patterns and Messages - Part 2 - Token Communication

In the previous post, we looked at how our tendency to think of Attention in terms of large matrix multiplications obscures some key insights, and we rolled back the GPU optimizations in order to reveal them (recapped in the next section).