kaamvaam

"I love you" "too": LLM Attention Explained.

This post explains, mechanically and with simple math, why a language model is likely to continue I love you with too instead of a random noun like chair or database.

0. The Tools

To understand this, you only need to know three simple concepts:

  • Vector: A list of numbers representing a word’s “identity” (e.g., [0.2, -1.5, 3.0]).
  • Dot Product (a · b): A way to multiply two vectors to see how well they “align.” A higher number means a stronger match.
  • Softmax: A formula that turns raw scores into percentages (e.g., 80%, 15%, 5%) that all add up to 100%.

1. Words Become Vectors

Each word (I, love, you) is converted into a vector. These numbers contain the word’s “vibe,” grammar, and meaning.

  • x_I, x_love, x_you

2. The Three Roles (Learned from the World)

The model doesn’t just look at the raw word. It uses three “filters” (Wq, Wk, Wv)—which it perfected by reading billions of pages of text—to give each word three specialized roles:

Role Name Purpose
Q Query What is this word looking for? (“you” is looking for a relationship context)
K Key What does this word offer? (“love” offers a deep emotional context)
V Value What info should it contribute? (The “reciprocal sentiment” data)

3. Scoring the Past (The Attention Filter)

The model wants to know what comes after you. It takes the Query (Q) of the current word (“you”) and compares it to the Keys (K) of every word that came before it using a dot product:

  • score_I = q_you · k_I (Low match)
  • score_love = q_you · k_love (High match!)
  • score_you = q_you · k_you (Self-match)

The Big Idea: Because the model has learned how English works, it knows that the “Query” for the word “you” should be highly attracted to the “Key” for the word “love.”

4. The Weighted Blend

The model uses Softmax to turn those scores into weights. It then uses those weights to “mix” the Value (V) vectors together to create a Context Vector: Wv learns a “keep the useful parts” filter so the blend mostly carries what matters for next-word prediction.

score_I = q_you · k_I
score_love = q_you · k_love
score_you = q_you · k_you

w_I, w_love, w_you = softmax([score_I, score_love, score_you])

context = w_I * v_I + w_love * v_love + w_you * v_you

This creates a new, single vector that represents the “vibe” of the whole sequence: “Positive emotion directed at the listener.”

5. Scoring the Dictionary

Now, the model needs to pick a new word. It compares this “vibe” (Context Vector) against its entire dictionary of 50,000+ words. It is looking for a word whose “identity” (embedding) matches this specific context.

  • score("too") = Context · embedding_too –> Very High
  • score("chair") = Context · embedding_chair –> Very Low

6. The Final Choice

Finally, the model turns these dictionary scores into probabilities:

  • too: 92%
  • back: 5%
  • database: 0.0001%

The model samples from these percentages and prints “too”.

Summary

  1. Transform: Turn words into Q, K, and V based on “wisdom” learned during training.
  2. Match: Compare the current Query to past Keys to see what matters most.
  3. Blend: Use those match scores to create a weighted “Context Vector.”
  4. Predict: Find the word in the dictionary that best aligns with that Context Vector.