Skip to content

The Transformer Architecture Explained

DS
LDS Team
Let's Data Science
16 minAudio · 3 listens
Listen Along
0:00/ 0:00
AI voice

In June 2017, eight researchers at Google published a paper that would become the most cited machine learning paper of the 21st century. "Attention Is All You Need" (Vaswani et al., 2017) introduced the transformer, an architecture that replaced recurrence and convolution with a single mechanism: attention. With over 168,000 citations on Semantic Scholar by early 2026, the transformer now powers every major language model, from GPT-5 to Claude to Gemini, and has expanded far beyond text into vision, audio, protein structure prediction, and robotics.

We'll trace one sentence through the entire transformer architecture, from raw tokens to translated output, so you can see exactly how each component fits together. Our running example: translating the English sentence "The cat sat on the mat" into French.

Why Recurrent Networks Hit a Wall

Recurrent neural networks process tokens one at a time, left to right. Each hidden state depends on the previous one, creating a strict sequential bottleneck. For RNNs and LSTMs, this means two practical problems.

First, training can't be parallelized across sequence positions. A 512-token input requires 512 sequential steps regardless of hardware. Second, information from early tokens must survive through every intermediate hidden state to reach later tokens. Even LSTMs, designed to mitigate this, struggle with dependencies spanning hundreds of positions.

The transformer solved both problems simultaneously. Every token attends to every other token in a single parallel operation. A 512-token sequence needs one step, not 512. And the distance between any two tokens is always one attention hop, regardless of their positions in the sequence.

PropertyRNN/LSTMTransformer
ProcessingSequential (O(n)O(n) steps)Parallel (O(1)O(1) steps)
Max dependency pathO(n)O(n)O(1)O(1)
Training speed on GPUSlow (low use)Fast (high use)
Memory for long sequencesFixed hidden stateO(n2)O(n^2) attention matrix
Year dominant2014 to 20172017 to present

RNN sequential processing vs Transformer parallel processingClick to expandRNN sequential processing vs Transformer parallel processing

Key Insight: The transformer trades memory for parallelism. The O(n2)O(n^2) attention cost seems expensive, but GPUs are designed for exactly this kind of dense matrix operation. The sequential nature of RNNs wastes GPU compute.

Self-Attention: The Core Mechanism

Self-attention lets each token in a sequence decide how much to attend to every other token, including itself. Think of it like a library search. Every word in our sentence "The cat sat on the mat" simultaneously asks a question (Query), advertises what it contains (Key), and holds actual information to share (Value).

Query, Key, Value Matrices

Each input token starts as an embedding vector. The model learns three weight matrices, WQW_Q, WKW_K, and WVW_V, which project every embedding into three distinct vectors:

  • Query (QQ): "What am I looking for?" When the word "sat" generates its query, it's asking for relevant context.
  • Key (KK): "What do I contain?" The word "cat" generates a key that advertises: "I'm the subject, an animate noun."
  • Value (VV): "Here's my information." If attention decides "cat" is relevant to "sat," the value of "cat" gets passed along.

The dot product between a query and a key measures relevance. High dot product means strong relevance; the value from that position gets weighted heavily in the output.

Scaled Dot-Product Attention

The full attention computation in a single equation:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Where:

  • QQ is the matrix of query vectors (one row per token)
  • KK is the matrix of key vectors (one row per token)
  • VV is the matrix of value vectors (one row per token)
  • dkd_k is the dimension of each key vector
  • QKTQK^T computes the dot product between every query-key pair
  • dk\sqrt{d_k} is the scaling factor that prevents the dot products from growing too large
  • softmax\text{softmax} normalizes each row to a probability distribution

In Plain English: For the word "sat" in our sentence, the model computes a relevance score against every other word. "Cat" scores high (it's the subject doing the sitting), "mat" scores moderately (it's the location), and "the" scores low. These scores become probabilities via softmax, and the output for "sat" becomes a weighted mix of all words' values, dominated by the most relevant ones.

The dk\sqrt{d_k} scaling is critical. Without it, dot products grow proportionally to the dimension size, pushing softmax into regions where gradients vanish. The original paper used dk=64d_k = 64, so the scaling factor was 8.

Self-attention mechanism showing Query, Key, Value matrices flowing to attention outputClick to expandSelf-attention mechanism showing Query, Key, Value matrices flowing to attention output

Multi-Head Attention Captures Different Relationships

A single attention head learns one type of relationship. But language has many simultaneous relationships: syntactic (subject-verb), semantic (word similarity), positional (adjacent words), and coreference (pronoun-antecedent). Multi-head attention runs multiple attention heads in parallel, each with its own learned WQW_Q, WKW_K, WVW_V projections.

MultiHead(Q,K,V)=Concat(head1,,headh)WO\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W_O

Where:

  • headi=Attention(QWiQ,KWiK,VWiV)\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)
  • hh is the number of attention heads (8 in the original paper)
  • WOW_O is the output projection matrix
  • Each head operates on dk=dmodel/hd_k = d_{\text{model}} / h dimensions (512 / 8 = 64)

In Plain English: Imagine eight different analysts examining our sentence "The cat sat on the mat." One analyst focuses on which words are grammatically connected (cat-sat), another on spatial relationships (sat-on-mat), another on article-noun pairs (the-cat, the-mat). Each analyst works independently on a 64-dimensional slice, then their findings are concatenated and projected back to the full 512 dimensions.

The original paper found that heads naturally specialize. Some consistently track syntactic dependencies, others handle positional patterns. This emergent specialization is one reason transformers generalize so well across tasks.

Multi-head attention with parallel heads concatenated and projectedClick to expandMulti-head attention with parallel heads concatenated and projected

Pro Tip: In modern architectures, Grouped Query Attention (GQA) shares key-value heads across groups of query heads. Llama 3, Gemini, and Mistral all use GQA because it reduces KV-cache memory by 4 to 8 times during inference with negligible quality loss. This matters enormously when serving models to millions of users.

Positional Encoding: Teaching Order to a Parallel System

Self-attention is permutation-invariant. It treats "The cat sat on the mat" and "mat the on sat cat The" identically. The model needs positional information injected explicitly.

The original transformer used sinusoidal positional encodings:

PE(pos,2i)=sin(pos100002i/dmodel)PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right) PE(pos,2i+1)=cos(pos100002i/dmodel)PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)

Where:

  • pospos is the position of the token in the sequence (0, 1, 2, ...)
  • ii is the dimension index
  • dmodeld_{\text{model}} is the embedding dimension (512 in the original paper)
  • Even dimensions use sine, odd dimensions use cosine

In Plain English: Each position gets a unique fingerprint made of sine and cosine waves at different frequencies. Position 0 gets one pattern, position 1 gets a slightly rotated version. The model can learn to extract relative distances because the difference between any two positions' encodings depends only on their distance, not their absolute positions.

Modern Positional Encoding: RoPE

Nearly all production transformers in 2026, including Llama 3, Claude, Gemini, Mistral, and DeepSeek, have abandoned sinusoidal encodings in favor of Rotary Position Embedding (RoPE). RoPE rotates query and key vectors in two-dimensional subspaces by an angle proportional to their position. After rotation, the dot product between any query-key pair naturally encodes their relative distance. RoPE adds zero extra parameters, handles arbitrary sequence lengths, and enables techniques like YaRN and NTK-aware scaling for extending context to millions of tokens.

The Full Transformer Architecture

The original transformer follows an encoder-decoder structure. Let's trace our sentence "The cat sat on the mat" through each component.

Full transformer architecture with encoder and decoder stacksClick to expandFull transformer architecture with encoder and decoder stacks

Encoder Stack

The encoder consists of N=6N = 6 identical layers. Each layer has two sub-layers:

  1. Multi-head self-attention. Every token in the English input attends to every other English token. The word "sat" can directly access "cat" and "mat" to understand the full scene.

  2. Position-wise feed-forward network. Two linear transformations with a ReLU (or GELU in modern variants) activation between them, applied identically to each position:

FFN(x)=max(0,xW1+b1)W2+b2\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2

Where:

  • xx is the input vector at a single token position (512 dimensions in the original paper)
  • W1W_1 and b1b_1 are the weights and bias of the first linear layer, projecting to dff=2048d_{ff} = 2048 dimensions
  • max(0,)\max(0, \cdot) is the ReLU activation (GELU in modern variants)
  • W2W_2 and b2b_2 are the weights and bias of the second linear layer, projecting back to dmodel=512d_{\text{model}} = 512

In Plain English: Each word's representation gets pushed through a bottleneck. For our sentence "The cat sat on the mat," the 512-dimensional representation of "sat" expands to 2048 dimensions, giving the model room to perform complex per-token reasoning, then compresses back to 512. This expansion-contraction pattern lets each position refine its meaning independently.

Both sub-layers use residual connections and layer normalization:

LayerNorm(x+Sublayer(x))\text{LayerNorm}(x + \text{Sublayer}(x))

Where:

  • xx is the input to the sub-layer (the residual stream)
  • Sublayer(x)\text{Sublayer}(x) is either the multi-head attention or feed-forward network output
  • x+Sublayer(x)x + \text{Sublayer}(x) is the residual connection, adding the sub-layer's output back to its input
  • LayerNorm\text{LayerNorm} normalizes activations across the feature dimension to stabilize training

In Plain English: When the encoder processes "sat," the attention layer's output gets added back to the original representation of "sat." This skip connection means even if the attention layer learns nothing useful at first, the original signal survives. Layer normalization then rescales everything to prevent values from exploding or vanishing as they pass through dozens of stacked layers.

Together, residual connections and layer normalization allow stacking dozens (or hundreds) of layers without degradation. This is the backbone of how we train deep learning models with hundreds of billions of parameters.

Common Pitfall: The original paper applies layer norm after the residual addition (Post-LN). Most modern transformers use Pre-LN (normalizing before the sub-layer), which is more stable during training. GPT-style models all use Pre-LN. This small change matters significantly at scale.

Decoder Stack

The decoder also has 6 identical layers, but with three sub-layers per layer:

  1. Masked multi-head self-attention. The decoder generates tokens left to right. During training, the mask prevents position ii from attending to positions >i> i, ensuring the model can't "cheat" by looking at future tokens in the target translation.

  2. Encoder-decoder cross-attention. The decoder's queries attend to the encoder's keys and values. When generating the French word "chat" (cat), the decoder focuses on the encoder's representation of "cat." This is where the translation actually happens.

  3. Position-wise feed-forward network. Same structure as in the encoder.

Output Layer

The decoder's output passes through a linear layer and softmax to produce a probability distribution over the French vocabulary. At each step, the model picks the most probable next token (or samples from the distribution, as covered in LLM Sampling).

Attention in PyTorch: A Compact Implementation

This implementation shows the core scaled dot-product and multi-head attention. No EXEC marker since PyTorch is not available in Pyodide.

python
import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model=512, n_heads=8):
        super().__init__()
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k = d_model // n_heads

        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        weights = F.softmax(scores, dim=-1)
        return torch.matmul(weights, V), weights

    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)

        # Project and reshape: (batch, seq, d_model) -> (batch, heads, seq, d_k)
        Q = self.W_q(query).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
        K = self.W_k(key).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
        V = self.W_v(value).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)

        # Scaled dot-product attention per head
        attn_output, attn_weights = self.scaled_dot_product_attention(Q, K, V, mask)

        # Concatenate heads and project
        attn_output = attn_output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        return self.W_o(attn_output)

# Usage example
mha = MultiHeadAttention(d_model=512, n_heads=8)
x = torch.randn(2, 10, 512)  # batch=2, seq_len=10, d_model=512
output = mha(x, x, x)        # self-attention: Q=K=V=x
print(f"Input shape:  {x.shape}")
print(f"Output shape: {output.shape}")
code
Input shape:  torch.Size([2, 10, 512])
Output shape: torch.Size([2, 10, 512])

Notice how the output shape matches the input exactly. Self-attention transforms representations without changing dimensions, which is what makes the residual connections possible.

Scaling Laws: Bigger Models, Better Performance

One of the most remarkable properties of transformers is how predictably their performance improves with scale. The Kaplan scaling laws (2020) first demonstrated that cross-entropy loss follows a power law with respect to model size, dataset size, and compute budget.

The Chinchilla study (Hoffmann et al., 2022) refined this, showing that compute-optimal training requires scaling data and parameters equally. Their finding: roughly 20 tokens per parameter. A 70-billion parameter model should train on about 1.4 trillion tokens.

ModelParametersTraining TokensTokens/Param Ratio
GPT-3 (2020)175B300B1.7
Chinchilla (2022)70B1.4T20
Llama 3 (2024)70B15T214
DeepSeek-V3 (2025)671B (37B active)14.8T22 (total)

The trend since Chinchilla has been to overtrain, using far more tokens than the compute-optimal ratio suggests. Llama 3 trained on 15 trillion tokens for a 70B model. Why? Inference costs dominate deployment budgets. A smaller, overtrained model is cheaper to serve even if training costs more.

Key Insight: Scaling laws tell you what's possible, not what's practical. Compute-optimal training minimizes total FLOPs, but real-world economics favor smaller models trained on more data because you pay for inference millions of times but train only once.

Modern Transformer Innovations (March 2026)

The 2017 architecture was just the starting point. Here's what's changed.

Flash Attention

Standard attention materializes the full n×nn \times n attention matrix in GPU memory. FlashAttention (Dao, 2022) restructures the computation using tiling and kernel fusion to avoid this materialization, reducing memory from O(n2)O(n^2) to O(n)O(n) while running 2 to 4 times faster. FlashAttention-3 targets NVIDIA Hopper GPUs with asynchronous execution, and FlashAttention-4 (March 2026) is purpose-built for Blackwell's Tensor Memory architecture, reaching 1,600 TFLOPS on B200 GPUs with up to 3.6 times speedup over FlashAttention-2.

Mixture of Experts

As of March 2026, virtually all frontier models use Mixture of Experts (MoE) layers. Instead of one massive feed-forward network, MoE uses hundreds of smaller "expert" networks with a router that selects the top-kk experts per token. DeepSeek-V3 has 671 billion total parameters but activates only 37 billion per token across 256 experts with top-8 routing. This delivers the quality of a huge model at the inference cost of a small one.

Decoder-Only Dominance

The original encoder-decoder architecture was designed for sequence-to-sequence tasks like translation. But the GPT architecture showed that a decoder-only transformer, trained autoregressively, works remarkably well for generation, classification, translation, and essentially every language task. Every major LLM in 2026, GPT-5, Claude, Gemini, Llama 4, uses decoder-only architecture.

Beyond Text

Transformers have spread to every modality. Vision Transformers (ViT) now match or exceed CNNs on image classification. AlphaFold 2 used transformer attention to solve protein structure prediction, one of biology's oldest open problems. Whisper and Conformer architectures power modern speech recognition. Diffusion models for image and video generation (Stable Diffusion, Sora) use transformer backbones. The attention mechanism itself turned out to be modality-agnostic: any data that can be expressed as a sequence of tokens is fair game.

When to Use Transformers (and When Not To)

Use transformers when:

  • Your data has long-range dependencies (text, genomics, time series with seasonal patterns)
  • You have enough data and compute for pretraining or a good pretrained model exists
  • Parallelism matters for training speed
  • You need to capture complex relationships across multiple dimensions

Consider alternatives when:

  • Your sequences are very short (under 50 tokens). Simpler models like CNNs or even MLPs can be faster and equally accurate
  • You're severely compute-constrained. Transformers are memory-hungry, especially at long sequence lengths
  • You need strict real-time latency on edge devices. State-space models (Mamba) offer linear-time alternatives with competitive quality
  • Your task is fundamentally tabular. Gradient-boosted trees still outperform transformers on most structured data

Conclusion

The transformer's central insight was deceptively simple: let every element in a sequence attend to every other element, in parallel, and learn which connections matter. That idea, expressed through query-key-value projections and scaled dot-product attention, turned out to be the right inductive bias for language, and then for vision, protein folding, and audio processing too.

Understanding how large language models actually work starts with understanding transformers. The attention mechanism produces the text embeddings that power semantic search and RAG systems. And the architectural innovations since 2017, from RoPE to Flash Attention to Mixture of Experts, have been refinements of the original design rather than replacements.

If you're studying deep learning seriously, the transformer is the architecture you'll encounter most. Read the original paper. Implement the attention mechanism from scratch. Then explore the BERT and GPT families to see how one architecture spawned two fundamentally different approaches to the same underlying mechanism.

Interview Questions

Q: Explain the purpose of the scaling factor dk\sqrt{d_k} in the attention formula.

Without scaling, the dot products between queries and keys grow proportionally to the dimension dkd_k. Large dot products push the softmax function into saturation regions where gradients become extremely small, stalling training. Dividing by dk\sqrt{d_k} normalizes the variance of dot products to approximately 1, keeping softmax in a well-behaved range regardless of the dimension size.

Q: Why does the transformer use multi-head attention instead of a single attention head with the same total dimension?

Multiple heads allow the model to attend to different types of relationships simultaneously. One head might learn syntactic dependencies (subject-verb agreement), another might capture semantic similarity, and another positional patterns. A single large head would blend all these signals into one attention distribution, losing the ability to maintain distinct relationship types. Empirically, 8 heads at 64 dimensions each consistently outperform 1 head at 512 dimensions.

Q: What is the difference between self-attention and cross-attention?

In self-attention, queries, keys, and values all come from the same sequence. Each token attends to all other tokens in its own sequence. In cross-attention (used in the decoder), queries come from the decoder's current representation, but keys and values come from the encoder's output. This is how the decoder "reads" the source sequence during translation or any sequence-to-sequence task.

Q: How do modern models handle positional information differently from the original transformer?

The original transformer added fixed sinusoidal positional encodings to input embeddings. Modern models overwhelmingly use RoPE (Rotary Position Embedding), which applies position-dependent rotations to query and key vectors before the dot product. RoPE encodes relative position directly into the attention scores, requires no additional parameters, and can be extended to longer sequences than seen during training through techniques like YaRN.

Q: Why have decoder-only architectures become dominant over encoder-decoder designs?

Decoder-only models are simpler to train (single autoregressive objective), scale more efficiently (one stack instead of two), and surprisingly handle encoder-decoder tasks like translation well when formulated as generation. The practical engineering benefits of maintaining one architecture for pretraining, finetuning, and inference outweigh any marginal quality gains from task-specific encoder-decoder designs.

Q: What is Flash Attention, and why does it matter for production systems?

Standard attention requires materializing the full n×nn \times n attention matrix, using O(n2)O(n^2) memory. Flash Attention restructures the computation using tiling and kernel fusion to compute exact attention without storing the full matrix, reducing memory to O(n)O(n) while achieving 2 to 4 times speedup. This directly enables longer context lengths. Without Flash Attention, processing 128K tokens on current hardware would be impractical.

Q: How does Mixture of Experts relate to the transformer's feed-forward layers?

MoE replaces the single feed-forward network in each transformer layer with multiple "expert" feed-forward networks and a routing mechanism. For each token, a learned router selects the top-kk experts (typically 2 to 8). This allows the total parameter count to grow enormously while keeping per-token compute constant. DeepSeek-V3 has 671B parameters but activates only 37B per token, achieving the quality of a massive dense model at a fraction of the compute cost.

Q: A team member says, "Transformers can't handle long sequences because attention is quadratic." How would you respond?

That was a valid concern in 2017, but it's largely solved in 2026. Flash Attention reduces the memory bottleneck from O(n2)O(n^2) to O(n)O(n) without approximation. Grouped Query Attention reduces KV-cache memory during inference. Techniques like ring attention distribute long sequences across multiple GPUs. Models like Gemini 1.5 and Claude routinely process 200K or more tokens. The quadratic compute cost remains, but hardware advances and algorithmic optimizations have pushed practical limits to millions of tokens.

Practice interview problems based on real data

1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems
Free Career Roadmaps8 PATHS

Step-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

Explore all career paths