"I love you" "too": LLM Attention Explained.
This post explains, mechanically and with simple math, why a language model is likely to continue I love you with too instead of a random noun like chair or database.
0. The Tools
To understand this, you only need to know three simple concepts:
- Vector: A list of numbers representing a word’s “identity” (e.g.,
[0.2, -1.5, 3.0]). - Dot Product (a · b): A way to multiply two vectors to see how well they “align.” A higher number means a stronger match.
- Softmax: A formula that turns raw scores into percentages (e.g., 80%, 15%, 5%) that all add up to 100%.
1. Words Become Vectors
Each word (I, love, you) is converted into a vector. These numbers contain the word’s “vibe,” grammar, and meaning.
- x_I, x_love, x_you
2. The Three Roles (Learned from the World)
The model doesn’t just look at the raw word. It uses three “filters” (Wq, Wk, Wv)—which it perfected by reading billions of pages of text—to give each word three specialized roles:
| Role | Name | Purpose |
|---|---|---|
| Q | Query | What is this word looking for? (“you” is looking for a relationship context) |
| K | Key | What does this word offer? (“love” offers a deep emotional context) |
| V | Value | What info should it contribute? (The “reciprocal sentiment” data) |
3. Scoring the Past (The Attention Filter)
The model wants to know what comes after you. It takes the Query (Q) of the current word (“you”) and compares it to the Keys (K) of every word that came before it using a dot product:
score_I= q_you · k_I (Low match)score_love= q_you · k_love (High match!)score_you= q_you · k_you (Self-match)
The Big Idea: Because the model has learned how English works, it knows that the “Query” for the word “you” should be highly attracted to the “Key” for the word “love.”
4. The Weighted Blend
The model uses Softmax to turn those scores into weights. It then uses those weights to “mix” the Value (V) vectors together to create a Context Vector: Wv learns a “keep the useful parts” filter so the blend mostly carries what matters for next-word prediction.
score_I = q_you · k_I
score_love = q_you · k_love
score_you = q_you · k_you
w_I, w_love, w_you = softmax([score_I, score_love, score_you])
context = w_I * v_I + w_love * v_love + w_you * v_you
This creates a new, single vector that represents the “vibe” of the whole sequence: “Positive emotion directed at the listener.”
5. Scoring the Dictionary
Now, the model needs to pick a new word. It compares this “vibe” (Context Vector) against its entire dictionary of 50,000+ words. It is looking for a word whose “identity” (embedding) matches this specific context.
score("too")= Context · embedding_too –> Very Highscore("chair")= Context · embedding_chair –> Very Low
6. The Final Choice
Finally, the model turns these dictionary scores into probabilities:
- too: 92%
- back: 5%
- database: 0.0001%
The model samples from these percentages and prints “too”.
Summary
- Transform: Turn words into Q, K, and V based on “wisdom” learned during training.
- Match: Compare the current Query to past Keys to see what matters most.
- Blend: Use those match scores to create a weighted “Context Vector.”
- Predict: Find the word in the dictionary that best aligns with that Context Vector.