Byte Pair Encoding (BPE) — From Banana to Bandana
Let’s start with a quick fact: A single byte = 8 bits, so it can represent 2^8 = 256 values (IDs 0–255). When tokenizing text, each token gets an ID. Once we run out of those first 256, new tokens start from
<256th>
, <257th>
, and so on.
One of the smartest ways to build these tokens? Byte Pair Encoding (BPE) — used by GPT, RoBERTa, and many other models. It works by merging the most frequent symbol pairs into bigger tokens, over and over, so we keep the vocabulary small but expressive.
And instead of a dry definition… let’s use a story about a banana wearing a bandana.
1. The Idea Behind BPE
- Start with characters (including spaces).
- Find the most frequent pair.
- Merge them into one token.
- Repeat until your vocab is full or you reach a set limit.
Why it works:
- Captures common subwords like
ing
,tion
,ana
. - Handles rare words & typos better than splitting only on spaces.
- Keeps vocab size manageable.
2. Example: "banana bandana"
Let’s use numbers instead of letters:
1 = b, 2 = a, 3 = n, 4 = d, 5 = (space)
"banana bandana" → 1 2 3 2 3 2 5 1 2 3 4 2 3 2
Step 0 — Start
[1] [2] [3] [2] [3] [2] [5] [1] [2] [3] [4] [2] [3] [2]
Step 1 — Merge most frequent pair (2 3 → 23)
[1] [23] [2] [3] [2] [5] [1] [23] [4] [2] [3] [2]
Step 2 — Merge (1 23 → 123)
[123] [2] [3] [2] [5] [123] [4] [2] [3] [2]
Step 3 — Merge (23 2 → 232) and (123 4 → 1234)
[123] [232] [5] [1234] [232]
3. Why BPE Matters
- Efficiency → fewer tokens for the same meaning.
- Typo resilience →
bananna
still breaks into familiar chunks. - Language flexibility → works for languages without spaces.
- Reusability → builds a library of common word pieces.
4. Try It Yourself
Want to see BPE in action? Try merging tokens for "low low lower"
— the first new token after the original byte set gets <256th>
as its ID.
Conclusion
BPE isn’t just token splitting — it’s why LLMs can talk in so many contexts with such a small vocabulary. It finds patterns, merges them, and gives the model the building blocks of language without drowning it in rare tokens.
It’s simple, clever, and everywhere in modern NLP. Next time you chat with an AI, remember: somewhere inside, it’s quietly merging banana with bandana.
Full version of this article: Byte Pair Encoding (BPE) Explained with the Banana Bandana Example