Skip to content

BERT: How Google Changed NLP Forever

DS
LDS Team
Let's Data Science
17 minAudio
Listen Along
0:00/ 0:00
AI voice

In October 2018, a team at Google led by Jacob Devlin published a paper that reshaped how machines understand language. BERT (Bidirectional Encoder Representations from Transformers) didn't just beat existing benchmarks; it obliterated them, improving state-of-the-art results on eleven NLP tasks simultaneously. One year later, Google integrated BERT into its search engine, affecting 10% of all English queries. The core idea was deceptively simple: instead of reading text left-to-right like previous models, let the model read in both directions at once. That architectural choice launched the pre-train-then-fine-tune era that dominates machine learning to this day.

We'll use a consistent running example throughout this article: fine-tuning BERT for sentiment classification on movie reviews. Every concept, from tokenization to the final classification head, ties back to this task.

The Pre-Training and Fine-Tuning Revolution

Pre-training and fine-tuning is a two-stage transfer learning approach where a model first learns general language representations from massive unlabeled text, then adapts to specific tasks with minimal labeled data. Before BERT, practitioners trained models from scratch for every new task. Need a spam classifier? Train from scratch. Need a question answering system? Start over. This wasted enormous compute and demanded large labeled datasets for every application.

BERT flipped the script. Google pre-trained one model on 3.3 billion words (English Wikipedia plus the BooksCorpus), learning grammar, facts, and reasoning patterns. You could then fine-tune this pre-trained model on your specific task with as few as a thousand labeled examples. A movie review classifier, a named entity recognizer, and a question answering system could all start from the same pre-trained weights.

Key Insight: Pre-training captures the "what" of language (syntax, semantics, world knowledge), while fine-tuning teaches the "how" of your specific task. This is the same principle behind transfer learning, applied to NLP for the first time at scale.

This approach reduced labeling costs dramatically. Tasks that previously required 100,000 labeled examples could now achieve comparable accuracy with 5,000 examples and a fine-tuned BERT model.

Masked Language Modeling: Teaching BERT to Read

Masked Language Modeling (MLM) is BERT's primary pre-training objective, where the model learns to predict randomly hidden words from their surrounding context in both directions. Traditional language models like GPT-1 predicted the next word given all previous words, reading left-to-right. This works well for generation but limits understanding, because each word can only attend to what came before it.

BERT's solution was brilliant. During pre-training, 15% of input tokens are randomly selected for prediction. Of those selected tokens:

  • 80% are replaced with a special [MASK] token
  • 10% are replaced with a random word
  • 10% are left unchanged

For our sentiment example, consider the sentence: "The movie was absolutely [MASK] and I loved every minute." BERT must predict "brilliant" (or a similar word) using context from both sides: "absolutely" on the left and "and I loved" on the right. A left-to-right model would never see "I loved every minute" when predicting the masked word.

The MLM training loss follows cross-entropy over the masked positions:

LMLM=iMlogP(xixM)\mathcal{L}_{\text{MLM}} = -\sum_{i \in \mathcal{M}} \log P(x_i \mid \mathbf{x}_{\setminus \mathcal{M}})

Where:

  • LMLM\mathcal{L}_{\text{MLM}} is the masked language modeling loss
  • M\mathcal{M} is the set of masked token positions
  • xix_i is the original token at position ii
  • xM\mathbf{x}_{\setminus \mathcal{M}} is the input sequence with masked positions replaced
  • P(xixM)P(x_i \mid \mathbf{x}_{\setminus \mathcal{M}}) is the predicted probability of the correct token given the corrupted input

In Plain English: For each masked word in the movie review, BERT looks at every other word in the sentence (both before and after the mask) and outputs a probability distribution over its entire vocabulary. The loss pushes BERT to assign the highest probability to the original word that was hidden.

The 80/10/10 split is a clever trick. If BERT only ever saw [MASK] tokens during pre-training, it would struggle during fine-tuning when no masks are present. The 10% random replacement and 10% unchanged tokens force the model to maintain good representations for all positions, not just masked ones.

Next Sentence Prediction and Why It Fell Out of Favor

Next Sentence Prediction (NSP) is BERT's secondary pre-training objective, where the model learns whether two sentences naturally follow each other in text. During training, BERT receives sentence pairs: half the time sentence B follows sentence A in the corpus (labeled IsNext), half the time it's a random sentence (labeled NotNext).

For our sentiment task, a positive pair might be: "The cinematography was stunning." followed by "Every frame looked like a painting." A negative pair would swap in something unrelated like "The restaurant closed at midnight."

NSP aimed to help tasks like question answering and natural language inference. The idea had merit, but later research revealed a problem.

Common Pitfall: RoBERTa (Liu et al., 2019) showed that removing NSP entirely and training with longer contiguous text sequences actually improved performance. The issue was that NSP was too easy: the model could often distinguish random sentence pairs by topic alone, without learning deeper logical relationships. This finding pushed all subsequent BERT variants to drop or replace NSP.

BERT Architecture: The Encoder-Only Transformer

BERT uses only the encoder portion of the original transformer architecture, applying bidirectional self-attention across all layers. While the full transformer (Vaswani et al., 2017) has both an encoder and a decoder, BERT discards the decoder entirely. This is a deliberate design choice: BERT's goal is understanding, not generation.

BERT encoder-only architecture showing bidirectional self-attention across all layersClick to expandBERT encoder-only architecture showing bidirectional self-attention across all layers

The original BERT paper introduced two model sizes:

ConfigurationLayersHidden SizeAttention HeadsParameters
BERT-Base1276812110M
BERT-Large24102416340M

Each layer applies multi-head self-attention followed by a feed-forward network. The critical difference from GPT-style models is the attention mask. GPT uses a causal mask that prevents each token from attending to future tokens. BERT uses no mask at all: every token attends to every other token. This bidirectional attention gives BERT its power for understanding tasks.

For a movie review like "The plot was predictable but the acting saved it," BERT's attention lets "predictable" attend to "saved" and vice versa. A left-to-right model processing "predictable" would have no idea the review eventually turns positive.

Key Insight: Bidirectional attention gives BERT superior understanding of existing text, but it cannot generate text autoregressively. GPT's causal attention enables fluent text generation but sacrifices full bidirectional context. Neither is universally better; they serve different purposes.

WordPiece Tokenization

WordPiece is BERT's subword tokenization algorithm that splits text into a vocabulary of approximately 30,522 tokens, balancing between full words and individual characters. Unlike word-level tokenization (which fails on unseen words) or character-level tokenization (which produces very long sequences), WordPiece finds a middle ground.

The algorithm works greedily: for each word, it finds the longest matching prefix in the vocabulary, marks it as a token, and repeats for the remaining characters. Continuation pieces are prefixed with ##.

For our sentiment example, consider tokenizing "The movie was unbelievably good":

code
Input:  "The movie was unbelievably good"
Tokens: ["The", "movie", "was", "un", "##bel", "##iev", "##ably", "good"]
IDs:    [1996, 3185, 2001, 4895, 12588, 9870, 10354, 2204]

Common words stay intact ("movie", "good"), while rarer words get split into recognizable pieces. This means BERT never encounters a truly unknown word, because it can always decompose it into known subwords. For a deeper look at how different tokenization strategies affect model performance, see our guide on tokenization.

BERT also adds special tokens: [CLS] at the start (used for classification), [SEP] between sentences, and [PAD] for padding shorter sequences.

The [CLS] Token and Sentence Representations

The [CLS] (classification) token is prepended to every input and serves as an aggregate representation of the entire sequence after passing through all transformer layers. By the final hidden layer, its embedding has attended to every other token across all 12 (or 24) layers, making it a compressed summary of the full input.

For our sentiment classifier, this is the token that matters most. The classification pipeline works like this:

  1. Prepend [CLS] to the movie review
  2. Pass the full sequence through all BERT layers
  3. Extract the [CLS] token's final hidden state (a 768-dimensional vector for BERT-Base)
  4. Feed this vector through a single linear layer + softmax for classification
python
from transformers import BertTokenizer, BertForSequenceClassification
import torch

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=2
)

review = "The cinematography was breathtaking and the story kept me on the edge of my seat"
inputs = tokenizer(review, return_tensors="pt", padding=True, truncation=True)
outputs = model(**inputs)

probs = torch.softmax(outputs.logits, dim=-1)
print(f"Negative: {probs[0][0]:.4f}, Positive: {probs[0][1]:.4f}")

Pro Tip: While [CLS] works well for classification after fine-tuning, it's a poor choice for generating sentence embeddings without fine-tuning. For embedding tasks, models like Sentence-BERT apply mean pooling over all token embeddings instead. See our article on text embeddings for production-grade embedding strategies.

Fine-Tuning BERT for Downstream Tasks

Fine-tuning adapts BERT's pre-trained weights to a specific task by adding a thin task-specific head and training end-to-end on labeled data. The same pre-trained model handles radically different tasks with minimal architectural changes.

BERT fine-tuning pipeline showing one pre-trained model adapted to classification, NER, and question answering tasksClick to expandBERT fine-tuning pipeline showing one pre-trained model adapted to classification, NER, and question answering tasks

Sentiment Classification (Our Running Example)

Add a linear layer on top of [CLS]. Feed in labeled movie reviews. Train for 2-4 epochs with a learning rate around 2e-5. That's it. The pre-trained layers already understand language; you're just teaching the classification head what "positive" and "negative" mean in your domain.

python
from transformers import BertForSequenceClassification, Trainer, TrainingArguments

model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    learning_rate=2e-5,
    weight_decay=0.01,
    evaluation_strategy="epoch",
)

# trainer = Trainer(model=model, args=training_args, train_dataset=..., eval_dataset=...)
# trainer.train()

Named Entity Recognition (NER)

Instead of using [CLS], NER uses every token's final hidden state. Each token gets classified into entity categories (Person, Organization, Location, etc.). WordPiece tokenization complicates this slightly: if "Washington" splits into ["Wash", "##ington"], only the first subword token gets a label, and the rest are ignored during loss computation.

Question Answering

For extractive QA (like SQuAD), BERT learns two things: where the answer starts and where it ends. Two linear layers predict start and end positions over all tokens. The input format is [CLS] question [SEP] passage [SEP], and the model identifies the answer span within the passage.

Production Fine-Tuning Considerations

  • Learning rate: 2e-5 to 5e-5 works for most tasks. Higher rates destabilize pre-trained weights.
  • Epochs: 2-4 is typical. BERT fine-tunes fast because pre-trained weights are already strong.
  • Batch size: 16-32. Larger batches need linear learning rate scaling.
  • Sequence length: BERT's original maximum is 512 tokens. Truncate or use chunking for longer documents.
  • Memory: BERT-Base needs roughly 4GB GPU memory for inference, 8-12GB for fine-tuning with batch size 16.

BERT vs GPT: Encoder Meets Decoder

BERT and GPT represent the fundamental split in modern NLP: encoder-only models for understanding versus decoder-only models for generation. Both originate from the same transformer architecture, but they use it differently.

Side-by-side comparison of BERT bidirectional encoder and GPT autoregressive decoder architecturesClick to expandSide-by-side comparison of BERT bidirectional encoder and GPT autoregressive decoder architectures

AspectBERT (Encoder)GPT (Decoder)
Attention directionBidirectional (sees all tokens)Unidirectional (sees only past tokens)
Pre-training taskMasked Language ModelingNext Token Prediction
Primary strengthClassification, retrieval, NERText generation, reasoning, chat
Fine-tuning approachTask-specific head on topPrompt engineering or instruction tuning
Inference latencyFast (single forward pass)Slow (autoregressive token-by-token)
Typical parameter count (2026)100M to 400M7B to 400B+
Cost per 1M tokens (API)0.01 to 0.10 USD0.50 to 15.00 USD

Key Insight: For our sentiment classifier, BERT processes the entire review in a single forward pass (about 5ms on a T4 GPU). A GPT-style model would need to generate a classification answer token-by-token, taking 50-200ms for the same task. When you're classifying millions of reviews daily, that 10-40x speed difference translates directly into infrastructure cost.

The BERT Family: From DistilBERT to ModernBERT

BERT spawned an entire family of encoder models, each addressing specific limitations of the original. Here's how they compare as of March 2026:

ModelYearKey InnovationParametersContext Length
BERT2018Bidirectional MLM + NSP110M / 340M512
DistilBERT2019Knowledge distillation (40% smaller, 60% faster)66M512
RoBERTa2019Removed NSP, more data, dynamic masking125M / 355M512
ALBERT2019Parameter sharing, factorized embeddings12M / 18M512
DeBERTa2020Disentangled attention + enhanced mask decoder134M / 390M512
DeBERTaV32021ELECTRA-style training + DeBERTa architecture86M / 304M512
ModernBERT2024RoPE, GeGLU, alternating local/global attention149M / 395M8,192

RoBERTa (Liu et al., 2019) proved that BERT was significantly undertrained. By removing NSP, using 160GB of text (10x BERT's training data), and employing dynamic masking (changing which tokens are masked each epoch), RoBERTa pushed the GLUE average from BERT-Large's 82.2 to 88.5.

DeBERTa (He et al., 2020) introduced disentangled attention, separating content and position into two distinct vectors instead of summing them early like BERT. This yielded GLUE scores of 90.1, surpassing human performance on several benchmarks. The tradeoff: DeBERTa-Large uses roughly twice the GPU memory of RoBERTa-Large.

ModernBERT (December 2024) is the most significant update to the BERT architecture in years. Built by Answer.AI, LightOn, and collaborators, it brings modern techniques to encoder models:

  • Rotary Positional Embeddings (RoPE) replace absolute positional encodings, enabling the 8,192-token context window
  • GeGLU activation replaces GeLU in feed-forward layers for improved expressiveness
  • Alternating local/global attention where every third layer uses full global attention (RoPE theta 160,000) and remaining layers use 128-token sliding window attention (RoPE theta 10,000)
  • Flash Attention and unpadding for 2x faster inference than older encoders
  • Trained on 2 trillion tokens of English and code data (600x more than original BERT)

Pro Tip: If you're starting a new project in 2026 that needs an encoder model, start with ModernBERT-base (149M params). It matches or beats DeBERTaV3 on most benchmarks, runs faster, handles 8K token contexts natively, and understands code. The only reason to pick DeBERTaV3 is if you need a model specifically fine-tuned for a niche domain where DeBERTa checkpoints already exist.

When to Use BERT-Style Models (and When Not To)

Encoder models still dominate specific production workloads in 2026. Here's the decision framework.

Use a BERT-style encoder when:

  1. You need low-latency classification (spam, toxicity, sentiment) at high throughput
  2. You're building retrieval or ranking systems for search or RAG pipelines
  3. You need structured extraction (NER, relation extraction, POS tagging)
  4. Your task has clear labels and you can fine-tune on domain data
  5. Cost matters: encoder inference is 10-100x cheaper than LLM inference

Use a GPT-style decoder when:

  1. You need open-ended text generation (writing, summarization, translation)
  2. The task requires multi-step reasoning or chain-of-thought
  3. You want zero-shot or few-shot capability without fine-tuning
  4. Your task definition changes frequently (prompt engineering is faster than re-training)

Common Pitfall: Don't use a 70B-parameter LLM for binary classification just because it can do it. A fine-tuned BERT-Base (110M parameters) will classify faster, cheaper, and often more accurately on tasks with sufficient training data. Reserve LLMs for tasks that genuinely require generation or reasoning. For the principles behind using context effectively with LLMs, see context engineering.

BERT's Lasting Impact on Modern AI

BERT's influence extends far beyond its own architecture. GPT-2 and GPT-3 adopted the same pre-training philosophy with a decoder architecture, and Vision Transformers (ViT) applied masked pre-training to images. Today's largest models trace their training methodology back to Devlin et al.'s 2018 work.

Google's integration of BERT into Search in October 2019 was the first time a transformer model directly touched billions of users daily. When someone searched "can you get medicine for someone pharmacy," BERT understood the query was about picking up a prescription for another person, something keyword-matching missed entirely.

In production systems as of March 2026, encoder models remain the backbone of:

  • Search engines (query understanding, document ranking)
  • Content moderation at scale (billions of posts classified daily)
  • Embedding generation for vector databases and similarity search
  • Named entity recognition in financial, legal, and medical documents
  • Code search and understanding (ModernBERT's training on code data makes it strong here)

The Hugging Face Transformers library (now at v5) hosts over 80,000 BERT-family checkpoints, making it the most fine-tuned architecture in history.

Conclusion

BERT proved that bidirectional understanding of text, achieved through masked language modeling, could produce representations powerful enough to advance the entire field of NLP in one paper. The pre-train-then-fine-tune approach it introduced remains the default strategy for building ML systems in 2026, whether you're working with encoder models or the massive decoder-only LLMs that followed.

For production NLP tasks that demand speed, accuracy, and cost efficiency, encoder models haven't been replaced. They've been refined. ModernBERT's 8,192-token context window, Flash Attention, and 2-trillion-token training corpus bring the architecture into the modern era while preserving what made BERT special: fast, focused, bidirectional understanding. If you're building search, classification, or extraction systems, start here.

To go deeper, explore how large language models work for the full picture of the transformer ecosystem, dive into text embeddings to see how BERT-style models power semantic search, or learn how encoders fit into RAG pipelines for production retrieval systems.

Interview Questions

Q: Why does BERT use bidirectional attention instead of left-to-right, and what tradeoff does this create?

Bidirectional attention lets each token attend to both past and future context, producing richer representations. The tradeoff is that BERT cannot generate text autoregressively because it has already "seen" the full input, so it excels at classification and retrieval but cannot produce coherent text like GPT-style models.

Q: Explain Masked Language Modeling. Why does BERT mask only 15% of tokens and use the 80/10/10 split?

MLM randomly selects 15% of tokens for prediction. Masking too many removes too much context; masking too few makes training slow. The 80/10/10 split (mask/random/unchanged) addresses a train-test mismatch: during fine-tuning, no [MASK] tokens appear. By sometimes showing the original or random tokens during pre-training, BERT learns representations that work well without masks.

Q: What is the purpose of the [CLS] token, and when should you not rely on it?

The [CLS] token aggregates information from the entire sequence through self-attention across all layers. After fine-tuning, its final hidden state feeds into classification heads. However, for sentence embeddings without task-specific fine-tuning, [CLS] produces poor representations. Mean pooling over all token embeddings (as in Sentence-BERT) works significantly better for similarity tasks.

Q: Why did RoBERTa remove Next Sentence Prediction, and what replaced it?

RoBERTa showed that NSP didn't help and sometimes hurt performance. The task was too easy because the model could distinguish sentence pairs by topic alone, without learning real logical relationships. RoBERTa replaced it with nothing; it simply trained on longer contiguous text sequences with dynamic masking, which gave the model better long-range understanding.

Q: When would you choose a fine-tuned BERT model over a large language model for a production NLP task?

Choose BERT when you need high-throughput, low-latency classification or extraction with labeled training data. A fine-tuned BERT-Base runs 10-40x faster than prompting a decoder model and costs a fraction of the compute. Choose an LLM when you need generation, zero-shot capability, or multi-step reasoning.

Q: How does ModernBERT improve on the original BERT architecture?

ModernBERT (2024) brings three key improvements: rotary positional embeddings (RoPE) extend context length from 512 to 8,192 tokens, alternating local/global attention layers improve efficiency on long sequences, and training on 2 trillion tokens (vs. BERT's 3.3 billion) dramatically improves knowledge. It's roughly 2x faster than older encoders at equivalent sequence lengths thanks to Flash Attention and unpadding.

Q: Explain the difference between WordPiece and BPE tokenization in the context of BERT and GPT.

Both create subword vocabularies but differ in construction. BPE iteratively merges the most frequent adjacent character pairs, while WordPiece selects merges that maximize corpus likelihood. In practice both produce similar tokenizations, but WordPiece tends toward slightly more linguistically meaningful subwords. BERT uses about 30,000 tokens; GPT models typically use 50,000 to 100,000.

Q: A colleague suggests using BERT for summarizing long documents. What's wrong with this approach?

BERT is an encoder-only model that cannot generate text. Summarization requires producing new text, which needs a decoder (GPT-style) or encoder-decoder model (T5, BART). BERT's 512-token limit also means it can't process most documents without chunking. For extractive summarization, you'd score and select existing sentences rather than generate new ones, which produces lower-quality results.

Practice interview problems based on real data

1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems
Free Career Roadmaps8 PATHS

Step-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

Explore all career paths