Byte Pair Encoding (BPE) | Exploring Artificial Intelligence

Byte Pair Encoding

Character Merged Active Pair End-of-Word Frequency

Controls

Corpus:

Custom:

Max Merges:

Byte Pair Encoding

BPE is a subword tokenization algorithm used by most modern LLMs (GPT, LLaMA, etc.). It iteratively merges the most frequent adjacent pair of tokens, starting from individual characters, to build a compact vocabulary.

Unlike word-level tokenization (huge vocabulary, can’t handle new words) or character-level (too granular, long sequences), BPE finds a middle ground: common words become single tokens while rare words decompose into meaningful subwords.

How to Use

Select a corpus from the dropdown or enter custom words
Press Run to animate all merge steps automatically
Press Step to advance one merge cycle at a time
Watch pairs merge in the tokenization, frequency chart, and vocabulary
Adjust speed with the slider

BPE Merge Loop

Initialize: Split each word into characters, append end-of-word marker _
Count pairs: For each adjacent token pair across all words, count frequency
Find best pair: Select the pair with highest frequency
Merge: Replace every occurrence of the best pair with a single merged token
Add to vocabulary: Record the new merged token and the merge rule
Repeat steps 2–5 until max merges reached

Pseudocode

vocab = all unique characters + "_"
for i in 1..max_merges:
  pairs = count_pairs(corpus)
  best = argmax(pairs)
  corpus = merge(corpus, best)
  vocab.add(best[0] + best[1])
  rules.add(best → merged)

Pair Frequency

freq(a, b) = Σ_{w ∈ corpus} count(ab in w) × freq(w)

Each adjacent pair (a, b) is counted across all words, weighted by word frequency.

Compression Ratio

compression = initial_tokens / current_tokens

Measures how much the total token count has been reduced by merging.

Vocabulary Growth

|V_i| = |V₀| + i

Each merge adds exactly one new token to the vocabulary.

BPE Metrics

Phase	Idle
Merge Step	0 / 10
Vocab Size	-
Total Tokens	-
Compression	1.00x
Best Pair	-

E-AI: Byte Pair Encoding (BPE)

Controls

Byte Pair Encoding

How to Use

BPE Merge Loop

Pseudocode

Pair Frequency

Compression Ratio

Vocabulary Growth

BPE Metrics