Patterns and Messages - Part 6 - Vocabulary-Based Analysis
What had me most excited about the merged matrix perspective (and perhaps overly so) was that the patterns and messages are in model space, the same dimension as the vocabulary.
What had me most excited about the merged matrix perspective (and perhaps overly so) was that the patterns and messages are in model space, the same dimension as the vocabulary.
Something I find really helpful about this merged-matrix perspective is that it puts everything in “model space”. The patterns and messages and their projection matrices all have the same length as the word embeddings.
0 CommentsWhen you reduce Attention down to two matrices instead of four, the pattern and message vectors represent a more familiar architecture–they form a neural network, whose neurons are created dynamically at inference time from the tokens.
0 CommentsOne potential benefit for this merged perspective is that it lets us begin our research into Transformer efficiency from a “more fundamental” definition of Attention.
0 CommentsIn the previous post, we looked at how our tendency to think of Attention in terms of large matrix multiplications obscures some key insights, and we rolled back the GPU optimizations in order to reveal them (recapped in the next section).
0 Comments