Controls
Byte Pair Encoding
BPE is a subword tokenization algorithm used by most modern LLMs (GPT, LLaMA, etc.). It iteratively merges the most frequent adjacent pair of tokens, starting from individual characters, to build a compact vocabulary.
Unlike word-level tokenization (huge vocabulary, can’t handle new words) or character-level (too granular, long sequences), BPE finds a middle ground: common words become single tokens while rare words decompose into meaningful subwords.
How to Use
- Select a corpus from the dropdown or enter custom words
- Press Run to animate all merge steps automatically
- Press Step to advance one merge cycle at a time
- Watch pairs merge in the tokenization, frequency chart, and vocabulary
- Adjust speed with the slider
BPE Merge Loop
- Initialize: Split each word into characters, append end-of-word marker
_ - Count pairs: For each adjacent token pair across all words, count frequency
- Find best pair: Select the pair with highest frequency
- Merge: Replace every occurrence of the best pair with a single merged token
- Add to vocabulary: Record the new merged token and the merge rule
- Repeat steps 2–5 until max merges reached
Pseudocode
vocab = all unique characters + "_" for i in 1..max_merges: pairs = count_pairs(corpus) best = argmax(pairs) corpus = merge(corpus, best) vocab.add(best[0] + best[1]) rules.add(best → merged)
Pair Frequency
freq(a, b) = Σw ∈ corpus count(ab in w) × freq(w)
Each adjacent pair (a, b) is counted across all words, weighted by word frequency.
Compression Ratio
compression = initial_tokens / current_tokens
Measures how much the total token count has been reduced by merging.
Vocabulary Growth
|Vi| = |V0| + i
Each merge adds exactly one new token to the vocabulary.
BPE Metrics
| Phase | Idle |
| Merge Step | 0 / 10 |
| Vocab Size | - |
| Total Tokens | - |
| Compression | 1.00x |
| Best Pair | - |