Byte Pair Encoding

Controls

10
5

Byte Pair Encoding

BPE is a subword tokenization algorithm used by most modern LLMs (GPT, LLaMA, etc.). It iteratively merges the most frequent adjacent pair of tokens, starting from individual characters, to build a compact vocabulary.

Unlike word-level tokenization (huge vocabulary, can’t handle new words) or character-level (too granular, long sequences), BPE finds a middle ground: common words become single tokens while rare words decompose into meaningful subwords.

How to Use

  • Select a corpus from the dropdown or enter custom words
  • Press Run to animate all merge steps automatically
  • Press Step to advance one merge cycle at a time
  • Watch pairs merge in the tokenization, frequency chart, and vocabulary
  • Adjust speed with the slider

BPE Merge Loop

  1. Initialize: Split each word into characters, append end-of-word marker _
  2. Count pairs: For each adjacent token pair across all words, count frequency
  3. Find best pair: Select the pair with highest frequency
  4. Merge: Replace every occurrence of the best pair with a single merged token
  5. Add to vocabulary: Record the new merged token and the merge rule
  6. Repeat steps 2–5 until max merges reached

Pseudocode

vocab = all unique characters + "_"
for i in 1..max_merges:
  pairs = count_pairs(corpus)
  best = argmax(pairs)
  corpus = merge(corpus, best)
  vocab.add(best[0] + best[1])
  rules.add(best → merged)

Pair Frequency

freq(a, b) = Σw ∈ corpus count(ab in w) × freq(w)

Each adjacent pair (a, b) is counted across all words, weighted by word frequency.

Compression Ratio

compression = initial_tokens / current_tokens

Measures how much the total token count has been reduced by merging.

Vocabulary Growth

|Vi| = |V0| + i

Each merge adds exactly one new token to the vocabulary.

BPE Metrics

Phase Idle
Merge Step 0 / 10
Vocab Size -
Total Tokens -
Compression 1.00x
Best Pair -