Every time you type a prompt into ChatGPT and watch words appear one at a time, you're witnessing a decoder-only transformer predicting the next token. That mechanism, refined across seven years and five major releases, turned a 117-million-parameter research experiment into the engine behind products used by hundreds of millions of people. Understanding the GPT architecture means understanding why modern AI feels so capable and where its limits still live.
We'll trace the full GPT lineage from GPT-1 through GPT-5 using a single running example: how the model generates a response to "The capital of France is". Every architectural concept connects back to this one sentence.
Decoder-Only Transformer Architecture
The GPT architecture is a decoder-only transformer, meaning it uses only the right half of the original transformer design proposed by Vaswani et al. (2017). Where the full transformer has an encoder that reads input and a decoder that produces output, GPT collapses both roles into a single stack of decoder layers.
Each layer contains two sub-blocks: a masked multi-head self-attention mechanism and a position-wise feed-forward network (FFN). Layer normalization and residual connections wrap each sub-block. GPT-1 used post-normalization (LayerNorm after the sub-block), while GPT-2 and every model since switched to pre-normalization (LayerNorm before the sub-block), which stabilizes training at scale.
For our running example, the input "The capital of France is" gets tokenized into a sequence of token IDs. Each token ID maps to an embedding vector, and learned positional encodings tell the model where each token sits in the sequence. This combined representation feeds into the first transformer layer.
Click to expandGPT decoder-only transformer architecture showing token input, embeddings, stacked transformer layers with causal attention and FFN, and output probabilities
Key Insight: The "decoder-only" label is slightly misleading. GPT doesn't decode from a separate encoder's output. It reads and generates within the same stream of tokens. A more accurate name would be "autoregressive transformer," but the original naming stuck.
Autoregressive Next-Token Prediction
Autoregressive language modeling is the training objective that defines every GPT model. The model learns to predict the next token in a sequence given all preceding tokens. During training, it sees massive amounts of text and adjusts its weights to maximize the probability of each observed token given its context.
The probability of a full sequence factors as:
Where:
- is the probability of token given all previous tokens
- is the total number of tokens in the sequence
- The product runs over every position, each conditioned on everything before it
In Plain English: To generate the word after "The capital of France is", the model computes a probability distribution over its entire vocabulary. "Paris" gets a high probability because the training data contained many passages establishing that fact. The model picks a token (say, "Paris"), appends it, and repeats the process for the next position.
This left-to-right generation is why text appears word by word in ChatGPT. The model doesn't "think" about the whole response at once. It commits to one token, then conditions on that choice for the next one. This differs fundamentally from BERT, which processes the entire input bidirectionally.
For a deeper look at how temperature, top-k, and top-p sampling strategies shape token selection, see our dedicated guide.
Causal Self-Attention: Why GPT Only Looks Backward
Causal (masked) self-attention is the mechanism that enforces the autoregressive property inside the transformer. In standard self-attention, every token can attend to every other token. Causal attention adds a triangular mask that blocks each token from seeing any future tokens.
The attention computation for a single head is:
Where:
- , , are the query, key, and value matrices (linear projections of the input)
- is the dimension of each key vector (scaling factor to prevent dot products from growing too large)
- is the causal mask: 0 for allowed positions, for future positions
- The softmax converts raw scores into a probability distribution over positions
In Plain English: When the model processes the token "France" in our example, it can attend to "The", "capital", "of", and "France" itself. It cannot peek ahead at "is" or anything after it. The mask fills those future positions with negative infinity, which softmax converts to zero attention weight.
This is the core architectural difference between GPT and BERT. BERT uses bidirectional attention (every token sees every other token), which makes it excellent for understanding existing text but unable to generate new text autoregressively. GPT's causal mask sacrifices bidirectional context in exchange for the ability to generate coherent text token by token.
Multi-head attention runs this computation in parallel across multiple "heads" (GPT-3 uses 96 heads per layer), each learning to focus on different relationship types: syntactic structure in one head, semantic similarity in another, positional patterns in a third.
GPT Evolution: From 117M to Frontier Scale
The GPT family spans seven years of rapid scaling. Each generation introduced architectural innovations beyond simply adding more parameters.
| Model | Year | Parameters | Layers | Context | Key Innovation |
|---|---|---|---|---|---|
| GPT-1 | 2018 | 117M | 12 | 512 | Unsupervised pre-training + supervised fine-tuning |
| GPT-2 | 2019 | 1.5B | 48 | 1,024 | Pre-normalization, zero-shot task transfer |
| GPT-3 | 2020 | 175B | 96 | 2,048 | In-context learning, few-shot prompting |
| GPT-4 | 2023 | ~1.8T (rumored MoE) | 120 | 128K | Multimodal input, mixture of experts |
| GPT-5 | 2025 | Undisclosed | Undisclosed | 1M+ | Unified reasoning, real-time routing |
GPT-1 (2018) proved a concept: pre-train a language model on raw text with next-token prediction, then fine-tune it on specific tasks. With just 117M parameters and a 512-token context window across 12 transformer layers, it showed that unsupervised pre-training could transfer to downstream NLP tasks. The training corpus was BookCorpus, roughly 7,000 unpublished books.
GPT-2 (2019) scaled to 1.5B parameters and introduced pre-normalization (LayerNorm before the attention and FFN blocks). The bigger discovery was zero-shot transfer: GPT-2 could perform tasks it was never trained on simply by framing them as text completion. OpenAI initially withheld the full model over misuse concerns.
GPT-3 (2020) was the inflection point. At 175B parameters trained on 300B tokens, it demonstrated in-context learning: the ability to perform tasks based on a few examples provided in the prompt, without any weight updates. GPT-3's 96 layers, 96 attention heads per layer, and 12,288-dimensional embeddings made it 100x larger than GPT-2. This is where the scaling laws research by Kaplan et al. became critical, showing predictable relationships between compute, data, parameters, and loss.
GPT-4 (2023) moved to a mixture of experts (MoE) architecture, though OpenAI never officially confirmed the details. Credible leaks suggest approximately 1.8 trillion total parameters across 120 layers, with 16 expert networks of roughly 111B parameters each and 2 experts routed per forward pass. GPT-4 also added multimodal input (vision), extended the context to 128K tokens, and showed substantial improvements in reasoning benchmarks. The MoE approach meant GPT-4 could be far larger in total capacity while keeping inference costs manageable.
GPT-5 (August 2025) represents the current frontier. OpenAI focused on capability over parameter counts: 94.6% on AIME 2025 (math), 74.9% on SWE-bench Verified (coding), and 45% fewer factual errors than GPT-4o when web search is enabled. The key innovation is a real-time router that dynamically switches between fast mode for simple queries and "thinking" mode for complex reasoning, merging GPT-4o and the o-series models into a single system.
Click to expandGPT evolution timeline from GPT-1 in 2018 to GPT-5 in 2025 showing parameter counts and key innovations at each stage
Pre-Training, Fine-Tuning, and Alignment
The GPT training pipeline has three distinct phases, each building on the previous one.
Phase 1: Pre-training uses next-token prediction on trillions of tokens from the internet, books, and code. The model learns grammar, facts, and reasoning purely from predicting what comes next. GPT-3 trained on 300B tokens; modern frontier models train on 10T+. The loss is standard cross-entropy between the predicted distribution and the actual next token.
Phase 2: Supervised Fine-Tuning (SFT) trains the pre-trained model on curated instruction-response pairs. Human contractors write high-quality responses, and the model learns to follow instructions rather than just complete text. This transforms a completion engine into a useful assistant.
Phase 3: Alignment via RLHF/RLAIF further refines behavior. In RLHF, humans rank multiple model outputs, training a reward model that captures preferences. The language model is then fine-tuned with PPO to maximize reward scores while staying close to the SFT baseline. RLAIF, pioneered by Anthropic's Constitutional AI, replaces human rankers with AI critics guided by explicit principles, cutting cost from over one dollar per preference label to under a cent.
Pro Tip: The alignment phase is why ChatGPT refuses certain requests and follows instructions. The raw pre-trained model has no concept of helpfulness or safety. RLHF/RLAIF is what shapes the "personality" you interact with.
For a detailed exploration of how context engineering builds on these alignment techniques in production systems, see our guide.
Scaling Laws: Predicting Performance Before Training
Scaling laws are empirical relationships that predict a model's loss as a function of compute budget, parameter count, and training data size. They let researchers decide how to allocate resources before committing millions of dollars to a training run.
Kaplan et al. (2020) established that loss follows a power law in each variable:
Where:
- is the cross-entropy loss as a function of model size
- is the number of model parameters
- is a constant (the "irreducible entropy" floor)
- is the scaling exponent (empirically around 0.076 for parameters)
In Plain English: Double the parameters and you get a predictable, diminishing reduction in loss. This isn't a guess; it holds across multiple orders of magnitude, from millions to trillions of parameters.
The Chinchilla scaling laws (Hoffmann et al., 2022) refined this by showing that Kaplan's approach over-allocated parameters relative to data. Chinchilla's finding: a compute-optimal model should train on roughly 20 tokens per parameter. This means a 70B model needs about 1.4T training tokens. GPT-3, with 175B parameters trained on only 300B tokens, was significantly undertrained by this standard.
Modern practice has moved even further from Chinchilla-optimal. Meta's Llama 3 8B trained on 15T tokens, nearly 200x the Chinchilla recommendation, because inference cost matters more than training efficiency for widely deployed models. Training a smaller model on more data yields better quality per inference dollar.
KV Cache and Efficient Inference
The KV cache is the single most important optimization for autoregressive inference. Without it, generating each new token would require recomputing attention over every previous token from scratch, making generation cost quadratic in sequence length.
For our running example, after the model generates "Paris" from "The capital of France is", the KV cache holds the key-value pairs for all six tokens. When generating the next token (perhaps a period), the model only computes Q, K, V for the new token, retrieves the cached K and V for positions 1 through 6, and runs attention. This drops per-token generation from to .
The tradeoff is memory. For a model like GPT-3 with 96 layers and 96 attention heads, the KV cache for a single 2,048-token sequence requires roughly 3 GB in FP16. At 128K context (GPT-4), this balloons to around 200 GB per sequence. Modern techniques for managing this include:
- Multi-Query Attention (MQA) and Grouped-Query Attention (GQA): share key-value heads across multiple query heads, cutting cache size by 8x or more
- Multi-head Latent Attention (MLA): compresses key-value pairs into a low-rank latent space, reducing cache by 90%+ (used in DeepSeek V3/V4)
- Quantization: storing cache entries in INT8 or FP8 instead of FP16
- Selective eviction: dropping cache entries for less-important tokens based on attention entropy
Common Pitfall: Many developers overlook KV cache memory when planning inference infrastructure. A model that fits in GPU memory for a single short prompt can run out of memory entirely when serving long-context requests or batching multiple users.
Encoder-Only vs. Decoder-Only vs. Encoder-Decoder
Three transformer variants dominated NLP between 2018 and 2023. Understanding why decoder-only won explains why GPT's architecture became the default for frontier models.
Click to expandComparison of encoder-only, decoder-only, and encoder-decoder transformer architectures showing their attention patterns, training objectives, and best use cases
| Architecture | Example | Attention | Training Objective | Best For |
|---|---|---|---|---|
| Encoder-only | BERT (2018) | Bidirectional | Masked language modeling | Classification, NER, search |
| Decoder-only | GPT (2018+) | Causal (left-to-right) | Next-token prediction | Generation, chat, reasoning |
| Encoder-decoder | T5 (2019) | Bidirectional encoder + causal decoder | Span corruption | Translation, summarization |
Why decoder-only won. Three factors converged. First, simplicity: a single stack of identical layers is easier to scale, parallelize, and optimize than a two-stack encoder-decoder. Second, scaling behavior: decoder-only models showed better performance scaling with compute, partly because every token in the training data contributes to the loss (BERT masks only 15% of tokens per pass). Third, emergent abilities: as decoder-only models scaled past 100B parameters, they developed unexpected capabilities like in-context learning, chain-of-thought reasoning, and code generation that weren't explicitly trained.
BERT remains excellent for tasks that need bidirectional understanding (semantic search, classification), but it can't generate text. T5 showed that encoder-decoder models can be competitive, but the added complexity hasn't paid off at frontier scale. Every major frontier model in March 2026, including GPT-5, Claude Opus 4.6, Gemini 3.1 Pro, and Llama 4, uses a decoder-only architecture.
The Modern Frontier: GPT-5 and Its Competitors
As of March 2026, the frontier model competition looks markedly different from even a year ago.
GPT-5 (August 2025) unified reasoning and fast generation behind a single API. Its real-time routing system decides whether a query needs "thinking" compute or can be answered quickly. On SWE-bench Verified, it scores 74.9%, with 45% fewer factual errors than GPT-4o when web search is enabled.
Claude Opus 4.6 (February 2026) leads on complex coding tasks with 80.8% on SWE-bench and offers 1M-token context in beta. At 5/25 USD per million tokens (input/output), it's the premium coding and agent model.
Gemini 3.1 Pro (February 2026) scores 94.3% on GPQA Diamond (PhD-level science) at 2/12 USD per million tokens (input/output). It remains the only major model with native video input processing.
Llama 4 (April 2025) brought open-weight models to MoE. Llama 4 Scout uses 17B active parameters from 109B total across 16 experts, with a 10M-token context window.
DeepSeek V3 (December 2024) matched GPT-4-class performance with 671B total parameters (37B active) and Multi-head Latent Attention, training on just 2.788M H800 GPU hours.
For a thorough comparison of open-source versus closed LLMs, including cost, performance, and deployment tradeoffs, see our dedicated article.
When to Use GPT-Style Models (and When Not To)
GPT-style decoder-only models excel at generative tasks: chatbots, content creation, code generation, reasoning, and agentic workflows. If your task involves producing text, a decoder-only architecture is almost certainly the right choice.
They're not the best fit for classification or embedding tasks needing bidirectional context. For semantic search or sentence similarity, encoder-based models still outperform decoder-only models per compute dollar.
For tasks requiring structured outputs like JSON schemas, GPT-style models work well with constrained decoding, but you need to handle the output format carefully.
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-5",
messages=[
{"role": "system", "content": "You are a geography expert."},
{"role": "user", "content": "The capital of France is"}
],
max_tokens=50,
temperature=0.0
)
# The model generates autoregressively, token by token:
# Step 1: P("Paris" | "The capital of France is") → high probability → select "Paris"
# Step 2: P("." | "The capital of France is Paris") → select "."
# Step 3: P("<stop>" | ...) → generation ends
print(response.choices[0].message.content)
# Output: Paris.
Click to expandAutoregressive token-by-token generation flow showing how GPT predicts each successive token conditioned on all previous tokens
Conclusion
The GPT architecture's power comes from a deceptively simple idea: stack identical decoder layers, train them to predict the next token, and scale. From GPT-1's 117M parameters in 2018 to GPT-5's unified reasoning system in 2025, the core mechanism hasn't changed. What changed is the scale, the alignment techniques, and the engineering required to serve billions of requests.
Understanding how transformers work at the attention level provides the foundation for grasping GPT. From there, the path to understanding modern AI systems leads through tokenization, sampling strategies, and reasoning models that build on the GPT architecture with chain-of-thought capabilities.
The next time you watch ChatGPT type out a response, you'll know exactly what's happening: a causal attention mask ensuring each token only sees the past, a softmax distribution picking the next token, and a KV cache making the whole thing fast enough to feel like a conversation.
Interview Questions
Q: Why does GPT use a decoder-only architecture instead of the full encoder-decoder transformer?
Decoder-only architectures are simpler to scale because they use a single, uniform stack of layers. Every token in the training data contributes to the loss function (unlike BERT, which masks only 15% of tokens), making training more data-efficient. At frontier scale, decoder-only models also exhibit emergent capabilities like in-context learning that weren't observed in encoder-decoder models at the same compute budget.
Q: Explain the causal attention mask and why it's necessary for autoregressive generation.
The causal mask is an upper-triangular matrix of negative infinity values applied to the attention scores before softmax. It prevents each token from attending to future positions. Without it, the model during training could "cheat" by looking ahead at the answer, and during generation, there would be no future tokens to attend to anyway. The mask enforces the autoregressive factorization: each token's probability depends only on preceding tokens.
Q: What is the KV cache and why does it matter for inference performance?
The KV cache stores the key and value projections from previous tokens so they don't need to be recomputed at each generation step. Without it, generating token would require reprocessing all previous tokens through every attention layer, making generation cost quadratic. With the KV cache, each new token only computes its own Q, K, V and reuses cached K and V from prior tokens, reducing per-step cost to linear. The tradeoff is memory: for long contexts, the cache can consume hundreds of gigabytes.
Q: How do scaling laws influence decisions about model size and training data?
Scaling laws (Kaplan et al., 2020; Chinchilla, 2022) show that loss decreases as a power law with respect to parameters, data, and compute. Chinchilla showed that compute-optimal training uses roughly 20 tokens per parameter, meaning many early models like GPT-3 were undertrained. In practice, modern models like Llama 3 train far beyond Chinchilla-optimal because inference cost per token matters more than training efficiency for widely deployed models.
Q: What changed between GPT-3 and GPT-4 architecturally?
GPT-4 is widely believed to use a mixture of experts (MoE) architecture rather than GPT-3's dense transformer. In MoE, each token activates only a subset of experts, so total parameter count can be much larger (rumored 1.8T) while keeping per-token compute manageable. GPT-4 also added multimodal input (vision) and extended context to 128K tokens. OpenAI has not officially confirmed the MoE details, but multiple credible leaks corroborate the architecture.
Q: How does RLHF transform a pre-trained GPT model into a useful assistant?
RLHF happens after supervised fine-tuning. Human annotators rank multiple model outputs for the same prompt, and these rankings train a reward model. The language model is then fine-tuned with PPO to maximize reward model scores while staying close to the SFT baseline (using a KL divergence penalty). This process teaches the model to be helpful, follow instructions, and avoid harmful outputs. RLAIF (using AI feedback instead of humans) has become increasingly common due to 100x lower costs per preference label.
Q: Why did decoder-only models win over encoder-only and encoder-decoder architectures for modern LLMs?
Three factors: simplicity (one uniform layer stack scales and parallelizes better), training efficiency (every token contributes to the loss, unlike BERT's 15% masking rate), and emergent abilities (in-context learning, chain-of-thought reasoning, and code generation appeared at scale in decoder-only models but not in other architectures at equivalent compute). By 2026, every frontier model uses decoder-only, including GPT-5, Claude Opus 4.6, Gemini 3.1 Pro, and Llama 4.
Q: A user reports that GPT generates great short responses but degrades on long outputs. What's the likely cause?
This is typically an attention dilution problem. As the sequence grows, attention weights spread across more tokens, weakening recall of earlier context. Positional encoding limitations can also cause degradation beyond the training context length. Solutions include models with RoPE (Rotary Position Embeddings) that generalize to longer contexts, retrieval-augmented generation, and placing critical information near the end of the input.