GPT-4 can write Shakespearean sonnets, pass the bar exam, and debug complex code. Ask it "how many r's are in strawberry?" and it confidently answers two. The culprit isn't the neural network. It's the step that happens before the neural network ever sees your text: tokenization. The way a model chops text into pieces determines what it can count, what it can spell, why it struggles with arithmetic, and how much you pay per API call. Tokenization is the invisible foundation that shapes everything a language model can and cannot do.
Tokens Are the Atomic Units of Every LLM
Language models don't read characters or words. They read tokens: numerical IDs that represent chunks of text. Before any transformer attention head fires, before any embedding lookup happens, raw text must be converted into a sequence of integer IDs from a fixed vocabulary.
How raw text becomes token IDs through the tokenization pipeline
This creates a fundamental design tension. If each character is a token (vocabulary of ~256), sequences become extremely long. A single paragraph might consume 500+ tokens, and the transformer's attention mechanism scales quadratically with sequence length. If each word is a token, you need a vocabulary of hundreds of thousands of entries, and any word not in the vocabulary becomes an unknown <UNK> token, devastating for names, typos, code, or any language underrepresented in training data.
The solution powering every major language model in March 2026 is subword tokenization: break text into pieces larger than characters but smaller than words. Common words like "the" stay intact. Rare words like "tokenization" get split into meaningful pieces like "token" + "ization". The dominant algorithm for learning these splits is called Byte-Pair Encoding.
Byte-Pair Encoding: The Algorithm Behind Every Major LLM
BPE was originally a data compression algorithm (Gage, 1994), adapted for NLP by Sennrich, Haddow, and Birch (2016). The core idea is elegant: start with the smallest possible units and iteratively merge the most frequent adjacent pairs.
The training process works in three steps:
- Initialize the vocabulary with individual characters (or bytes) plus an end-of-word marker
- Count every adjacent pair of symbols across the training corpus
- Merge the most frequent pair into a single new symbol, then repeat
The merge rules learned during training are saved and applied deterministically to new text at inference time. GPT-2 learned 50,000 merges; GPT-4o's o200k_base tokenizer learned roughly 200,000.
Here is BPE training from scratch in Python. The corpus contains four words with varying frequencies, and we watch how BPE discovers subword units:
Word frequencies: {'low': 5, 'lowest': 2, 'newer': 6, 'wider': 3}
Initial tokens:
n e w e r </w> x6
l o w </w> x5
w i d e r </w> x3
l o w e s t </w> x2
Merge 1: 'e' + 'r' -> 'er' (freq=9)
Merge 2: 'er' + '</w>' -> 'er</w>' (freq=9)
Merge 3: 'l' + 'o' -> 'lo' (freq=7)
Merge 4: 'lo' + 'w' -> 'low' (freq=7)
Merge 5: 'n' + 'e' -> 'ne' (freq=6)
Merge 6: 'ne' + 'w' -> 'new' (freq=6)
Merge 7: 'new' + 'er</w>' -> 'newer</w>' (freq=6)
Merge 8: 'low' + '</w>' -> 'low</w>' (freq=5)
Merge 9: 'w' + 'i' -> 'wi' (freq=3)
Merge 10: 'wi' + 'd' -> 'wid' (freq=3)
Final tokens after 10 merges:
newer</w> x6
low</w> x5
wid er</w> x3
low e s t </w> x2
--- Encoding a new word: 'lowest' ---
Start: ['l', 'o', 'w', 'e', 's', 't', '</w>']
Final: ['low', 'e', 's', 't', '</w>']
Notice what happened: "newer" became a single token because it appeared 6 times. "low" merged into one piece, so "lowest" gets split as ["low", "e", "s", "t"]. The model sees "low" as a familiar subword, then processes the suffix character by character. This is exactly how real tokenizers handle rare words composed of common morphemes.
Key Insight: BPE's merge list is a learned compression scheme tuned to the tokenizer's training corpus. It is not the same as the model's training data, and this mismatch is the root cause of glitch tokens, which we cover later.
Byte-Level BPE: The Innovation That Eliminated Unknown Tokens
Original BPE operated on characters, which left gaps for unseen Unicode symbols. Radford et al. (2019) introduced byte-level BPE in GPT-2: instead of starting with characters, start with raw bytes (a base vocabulary of exactly 256). Since any text in any language encodes to a byte sequence via UTF-8, byte-level BPE guarantees zero unknown tokens for any input. English, Chinese, Arabic, emoji, code, binary data: everything can be represented.
Modern tokenizers also apply regex-based pre-tokenization to prevent merges across category boundaries. GPT-2 introduced a regex pattern that keeps contractions ("don't" becomes "don" + "'t"), separates numbers from letters, and prevents spaces from merging with words. GPT-4o's o200k_base uses increasingly sophisticated patterns that also handle CJK characters and non-Latin scripts.
The Three Tokenizer Families Powering Production LLMs
Every major language model uses one of three tokenizer implementations. Understanding them matters when you're picking models for production, estimating costs, or debugging unexpected behavior.
Comparison of the three main tokenizer families: tiktoken, SentencePiece, and HuggingFace Tokenizers
tiktoken (OpenAI, Rust core): The fastest option at 3-6x the speed of alternatives. Inference only, no training support. Powers all OpenAI models and has been adopted by Llama 3/4 and Mistral's Tekken tokenizer.
SentencePiece (Kudo and Richardson, 2018): Treats input as a raw character stream with no pre-tokenization step, encoding spaces as the metasymbol _. Supports both BPE and Unigram algorithms. Particularly strong for languages without clear word boundaries (Chinese, Japanese, Thai). Powers Google's Gemini and Gemma families.
HuggingFace Tokenizers (GitHub): Rust-backed library supporting BPE, WordPiece, and Unigram with full training support. The most flexible option, used by thousands of open-source models.
BPE vs. WordPiece vs. Unigram
These three algorithms take fundamentally different approaches to building vocabularies:
| Property | BPE | WordPiece | Unigram |
|---|---|---|---|
| Direction | Bottom-up (merge) | Bottom-up (merge) | Top-down (prune) |
| Selection criterion | Most frequent pair | Pair that maximizes likelihood | Remove token that least impacts loss |
| Morphology recovery | Weak | Moderate | Strong |
| Used by | GPT, Llama, Mistral, DeepSeek | BERT, DistilBERT | Gemini, T5, mBART |
Pro Tip: The Unigram algorithm (Kudo, 2018) starts with a huge candidate vocabulary and iteratively prunes tokens whose removal least increases the corpus loss. Because it evaluates each token's marginal contribution, Unigram recovers morphological suffixes like "-ly", "-ing", and "-tion" far more reliably than BPE. If you're working with morphologically rich languages (Turkish, Finnish, Hungarian), Unigram-based tokenizers tend to perform better.
Vocabulary Sizes Have Exploded Since 2023
Vocabulary sizes have grown roughly 8x in three years. Larger vocabularies produce shorter token sequences (less compute in self-attention) and better multilingual coverage, at the cost of larger embedding matrices.
| Model | Tokenizer | Vocab Size | Type | Release |
|---|---|---|---|---|
| GPT-5 / GPT-4o / o3 | tiktoken (o200k_base) | ~200,000 | Byte-level BPE | 2024-2025 |
| Llama 4 (Scout/Maverick) | tiktoken-based | 202,048 | Byte-level BPE | April 2025 |
| Gemini 3 / Gemma 3 | SentencePiece | 262,144 | Unigram/BPE | 2025 |
| Claude Opus 4.6 / Sonnet 4 | Proprietary BPE | ~65,536 | Byte-level BPE | 2025 |
| DeepSeek-V3 / R2 | Custom BPE | ~128,000 | Byte-level BPE | 2024-2025 |
| Qwen3 | Custom BBPE | ~151,936 | Byte-level BPE | 2025 |
| Mistral (Tekken) | tiktoken-based | 131,072 | BPE | 2024 |
| Llama 3 | tiktoken-based | 128,256 | Byte-level BPE | 2024 |
| Llama 2 | SentencePiece | 32,000 | BPE + byte fallback | 2023 |
From 32K (Llama 2) to 262K (Gemini 3) in three years. Tao et al. (2024) at NeurIPS 2024 showed this isn't arbitrary: there's a log-linear relationship between vocabulary size and training loss. Llama 2's 32K vocabulary was optimal for a 7B model, but for the 70B variant, the compute-optimal vocabulary would have been at least 216K, 7x larger than what was actually used.
The Vocabulary Size Tradeoff
Choosing vocabulary size is one of the most consequential decisions in building a language model. Every time the model predicts the next token, it computes a probability distribution over the entire vocabulary:
Where:
- is the sequence length (number of tokens)
- is the model's hidden dimension
- is the vocabulary size
In Plain English: A 262K vocabulary (Gemini 3) means 262,144 softmax computations per token position, 8x more than Llama 2's 32K vocabulary. But larger vocabularies produce shorter sequences, so fewer tokens go through attention. In practice, the total compute often decreases because attention scales quadratically with sequence length while softmax scales only linearly with vocabulary size.
Five Ways Tokenization Breaks Your Model
Tokenization is not a solved problem. It introduces systematic failures that affect model accuracy, fairness, and cost.
Arithmetic and Number Tokenization
Ask GPT-4 to compute 1,234 + 5,678 and it might get it wrong. Not because the transformer can't do addition, but because the tokenizer splits numbers inconsistently. "480" might be a single token while "481" splits into "4" + "81". The model never sees individual digits aligned for column addition.
Singh and Strouse (2024) demonstrated at ICLR 2025 that right-to-left tokenization improves arithmetic accuracy by over 22 percentage points. Simply adding commas to numbers ("1,234") forces digit grouping that aligns addends correctly.
The Multilingual Token Tax
The same sentence costs dramatically different amounts depending on language. Lundin et al. (2025) found tokenization premiums of 2-5x for low-resource African languages compared to English, with the cost amplified further by quadratic attention scaling. Arabic text requires 68% to 340% more tokens than equivalent English text, depending on the tokenizer.
This is not just an efficiency problem. Higher fertility (tokens per word) means longer sequences, more compute, higher latency, and higher API costs for the same meaning. Research shows fertility explains 20-50% of the variance in model accuracy across languages. This directly impacts RAG pipelines where retrieved passages consume context window budget measured in tokens.
Glitch Tokens: The SolidGoldMagikarp Problem
In January 2023, researchers Jessica Rumbelow and Matthew Watkins discovered that asking ChatGPT to repeat "SolidGoldMagikarp" produced "distribute" instead. The root cause: mismatch between tokenizer training data and model training data. "SolidGoldMagikarp" was a Reddit username frequent enough in the tokenizer corpus to earn its own BPE token, but so rare in model training data that its embedding was essentially random noise.
The problem persists. Systematic research found roughly 4.3% of vocabulary entries across tested models are glitch tokens. The GlitchMiner framework (AAAI 2026) uses gradient-based entropy maximization to find them in GPT-4, Llama 2, Mistral, and DeepSeek-V3.
Code Formatting Waste
Whitespace, indentation, and newlines consume approximately 24.5% of tokens across programming languages while contributing minimal semantic value. This overhead compounds when using structured outputs, where JSON formatting adds curly braces, quotes, and indentation on top of the actual content. Pan et al. (2025) showed Java loses 14.7% and C# loses 13.2% to pure formatting overhead. GPT-4's cl100k_base tokenizer groups 4 spaces into a single token (token ID 257) and has dedicated tokens for whitespace sequences up to 128 spaces.
Token Boundary Misalignment
When token boundaries in a prompt don't match what the model expects, performance degrades dramatically. In Chinese text, misaligned boundaries cause the probability of the correct next token to drop by up to four orders of magnitude. Microsoft's Guidance library implements "token healing" to fix this by backing up partial tokens and re-sampling aligned continuations.
Common Pitfall: Prompting with <a href="http: won't produce // next, because :// is a single token (ID 1129 in cl100k_base), but your prompt forced : to be tokenized separately. The model doesn't know how to continue from a boundary that never occurs in training data.
Beyond Subwords: The Rise of Byte-Level Models
The most exciting development in tokenization is the push to eliminate it entirely. If models could process raw bytes, every problem above (arithmetic splits, multilingual inequality, glitch tokens, boundary effects) would disappear.
Evolution from character-level to byte-level tokenization-free architectures
ByT5 (Xue et al., 2022) proved the concept: a transformer processing byte sequences can match token-level models. But byte sequences are 4-5x longer, making attention costs prohibitive.
MegaByte (Yu et al., 2023) from Meta introduced a two-level architecture: a large "global" transformer processes fixed-size patches of bytes, while a smaller "local" transformer handles individual bytes within each patch. This achieves sub-quadratic scaling for million-byte sequences.
SpaceByte (Slagle, 2024, NeurIPS 2024) took a smarter approach: instead of fixed-size patches, apply the larger transformer blocks only after space characters (natural word boundaries). SpaceByte matched subword transformer performance on English text and code, achieving 1.009 bits-per-byte on PG-19 versus MegaByte's 1.083.
The real breakthrough came in December 2024 with Meta's Byte Latent Transformer (BLT) (Pagnoni et al., 2024). BLT uses entropy-based dynamic patching: a small byte-level language model computes next-byte entropy, and patch boundaries appear where the next byte is hardest to predict. Simple, predictable regions (common words) get large patches requiring little compute; complex regions (rare words, code, numbers) get small patches with more attention.
The results: BLT matches Llama 3 at 8B parameters while using up to 50% fewer inference FLOPs. And because BLT operates on raw bytes, it handles typos, spelling variations, and novel words gracefully. There is no fixed vocabulary to be surprised by.
The 2025-2026 Frontier in Tokenization Research
Even within subword approaches, recent work has pushed the boundaries considerably.
SuperBPE (COLM 2025): A two-pass BPE that first learns standard subword tokens, then learns cross-word "superword" tokens spanning whitespace. SuperBPE produces 33% fewer tokens and improves average performance by 4.0% across 30 benchmarks, with an 8.2% gain on MMLU. Wins on 25 of 30 individual tasks, and trains in a few hours on 100 CPUs.
BoundlessBPE (COLM 2025): Relaxes the pre-tokenization boundary constraint entirely, allowing merges across word boundaries. Achieves up to 15% improvement in bytes per token and a 3-5% increase in Renyi efficiency over standard BPE.
LiteToken (February 2026): Identifies and removes "intermediate merge residues," tokens that are frequent during BPE training but rarely appear in final tokenized output. About 10% of tokens in major tokenizers are residues. LiteToken is plug-and-play: it works with any existing tokenizer, reduces fragmentation, and improves handling of noisy or misspelled inputs.
Dynamic tokenization is gaining traction too. ADAT (NeurIPS 2024) iteratively refines the vocabulary based on model feedback during training. Retrofitting LLMs with Dynamic Tokenization (ACL 2025) enables flexible tokenization post-training, reducing inference FLOPs by choosing token granularity adaptively.
On the theoretical side, Rajaraman et al. (2024) at NeurIPS 2024 proved that transformers cannot learn k-th order Markov sources without tokenization but can with it. This is the first theoretical justification for why tokenization helps beyond mere compression.
When to Use Each Tokenization Strategy
Choosing the right tokenizer depends on your use case, target languages, and compute budget:
| Scenario | Recommended approach | Why |
|---|---|---|
| English-only production API | Large BPE vocab (100K+) | Shortest sequences, lowest cost |
| Multilingual application | SentencePiece Unigram or o200k_base | Better cross-lingual compression |
| Code-heavy workloads | Byte-level BPE with code-aware pre-tokenization | Handles whitespace efficiently |
| Research on new languages | Train custom BPE/Unigram on domain data | Avoids the multilingual token tax |
| Extreme noise tolerance needed | Byte-level model (BLT) | No fixed vocabulary to break |
| Small model (<1B params) | Smaller vocab (32K-64K) | Embedding matrix fits in memory |
| Large model (70B+) | Larger vocab (128K-256K) | Compute-optimal per scaling laws |
Pro Tip: Prompt caching (available from OpenAI, Anthropic, and Google) gives up to 90% discounts on repeated input tokens. Combined with a large-vocabulary tokenizer that produces fewer tokens, you can achieve 60-80% cost reductions on production workloads with context engineering.
Practical Cost Implications
All major LLM providers charge per token. Since different tokenizers produce different token counts for the same text, the model choice affects cost independently of quality:
- 1,000 English words equals roughly 1,300 tokens with o200k_base but 1,500+ tokens with a smaller vocabulary tokenizer
- Non-English text shows even larger differences: the same Arabic paragraph might cost 3x more with one provider than another
- Output tokens cost 2-5x more than input tokens across providers. A more efficient tokenizer for your language saves money on both sides
Decision framework for choosing a tokenizer based on use case and constraints
Conclusion
Tokenization is the most underappreciated component in the entire language model stack. Every problem you've encountered with LLMs, arithmetic failures, "how many r's in strawberry" mistakes, inflated API costs for non-English text, mysterious glitch token behavior, traces back to how text gets split into integers before the model processes it.
The field is at a turning point. BPE has served well since 2016, and innovations like SuperBPE and LiteToken are pushing the subword approach further. But byte-level models like Meta's BLT have proven that tokenization-free architectures can match tokenized models at scale while eliminating entire categories of failure modes. The question is no longer whether models can work without tokenizers, but when the transition happens at production scale.
For practitioners, the immediate takeaway is that tokenization is a first-class design decision. The tokenizer you pick affects model accuracy, multilingual fairness, inference cost, and which tasks your model can reliably handle. Understanding tokenization isn't optional.
To build on this foundation, explore How Large Language Models Actually Work for the transformer architecture that processes tokens, Text Embeddings for how tokens become vectors, and Context Engineering for working within token limits effectively. For the frontier of model intelligence built on top of tokenization, see Reasoning Models.
Frequently Asked Interview Questions
Q: Why do LLMs use subword tokenization instead of character-level or word-level tokenization?
Character-level tokenization creates very long sequences that are expensive for attention (quadratic scaling), and the model struggles to learn meaningful patterns from individual characters. Word-level tokenization can't handle out-of-vocabulary words like typos, names, or code. Subword tokenization hits the sweet spot: common words stay intact for efficiency, while rare words decompose into reusable subword pieces that the model has seen in other contexts.
Q: Explain how BPE training works in three sentences.
BPE initializes the vocabulary with individual characters (or bytes), then repeatedly counts all adjacent symbol pairs across the corpus and merges the most frequent pair into a new symbol. This continues for a fixed number of merge operations. The resulting merge rules are saved and applied in the same order to tokenize new text at inference time.
Q: A user complains your multilingual chatbot is slower and more expensive for Arabic queries than English ones. What's happening?
The tokenizer likely has much higher fertility (tokens per word) for Arabic than English because it was trained primarily on English text. The same semantic content requires 2-5x more tokens in Arabic, which increases both latency (more attention computations) and cost (per-token pricing). Solutions include using a tokenizer with better multilingual coverage (like o200k_base or SentencePiece trained on balanced multilingual data) or training a language-specific tokenizer.
Q: What is a "glitch token" and why do they exist?
A glitch token is a vocabulary entry whose embedding is essentially random noise because the token appeared frequently in the tokenizer's training data (earning a vocabulary slot) but rarely in the model's actual training data (so the model never learned its meaning). When prompted with a glitch token, models produce nonsensical or evasive outputs. About 4.3% of vocabulary entries in tested models are glitch tokens.
Q: How does Meta's Byte Latent Transformer (BLT) eliminate the need for a fixed vocabulary?
BLT processes raw bytes instead of tokens, using a small auxiliary model to compute next-byte entropy. It places patch boundaries where entropy is high (hard-to-predict regions), creating variable-sized patches that group predictable bytes together. This means common words get processed cheaply in large patches while rare or complex sequences get fine-grained attention. BLT matches Llama 3 at 8B parameters while using up to 50% fewer inference FLOPs.
Q: Your team is building a code generation model. What tokenization considerations are most important?
First, ensure the tokenizer efficiently handles whitespace, since formatting consumes roughly 25% of tokens in code. Look for tokenizers with dedicated multi-space tokens (like GPT-4's cl100k_base). Second, consider how the tokenizer splits variable names and syntax: camelCase and snake_case identifiers should ideally split at natural boundaries. Third, evaluate byte-level BPE to handle diverse programming languages and special characters without unknown tokens.
Q: Why does vocabulary size matter for model scaling, and what's the current best practice?
Vocabulary size directly affects the embedding matrix size (vocab x hidden dim parameters) and the softmax computation cost per token. However, larger vocabularies produce shorter sequences, reducing the quadratic attention cost. Research from Tao et al. (2024) showed a log-linear relationship between optimal vocabulary size and model size. Current best practice for large models (70B+) is 128K-256K tokens, while smaller models (<7B) may benefit from 32K-64K to keep embedding overhead manageable.