Long Context Models: Working with 1M+ Token Windows

DS
LDS Team
Let's Data Science
6 minAudio
Listen Along
0:00 / 0:00
AI voice

The original Transformer processed 512 tokens, roughly a single page of text. Llama 4 Scout now claims a context window of 10 million tokens: the equivalent of the entire Harry Potter series repeated fifteen times. Long context models have gone from processing a paragraph to ingesting entire codebases in a single pass, and the engineering required to get there touches every layer of the stack, from attention algorithms to GPU memory management.

But headline numbers lie. A model that accepts 1 million tokens does not necessarily reason over them equally well. The RULER benchmark (Hsieh et al., COLM 2024) revealed that most models claiming 128K+ context degrade sharply on complex retrieval tasks well before reaching their advertised limits. Effective context length sits at roughly 50-65% of the marketed capacity for most models. If you're building production systems that depend on long context, understanding where it works and where it breaks is not optional.

Throughout this article, we'll use a concrete running scenario: a legal team processing a 400-page contract (roughly 300K tokens) to find contradictory clauses. This task needs global reasoning across the entire document, which is exactly where long context shines and where its limitations become most visible.

The Context Window in March 2026

Context window sizes across major models span two orders of magnitude. Here is where things stand as of March 2026:

ModelMax InputArchitecture Notes
Llama 4 Scout10M17B active, 16 experts (MoE); iRoPE; trained at 256K
Gemini 3 Pro1M64K output; matches Gemini 1.5 Pro NIAH scores
GPT-4.11M100% NIAH accuracy at 900K+; released Apr 2025
Llama 4 Maverick1M17B active, 128 experts (MoE)
Grok 42MSliding-window memory; released Sep 2025
GPT-5.2400K128K output; released Dec 2025
Claude Opus 4.6200K (1M beta)128K output; 1M requires Usage Tier 4+
DeepSeek V3128K671B total, 37B active (MoE); uses MLA
DeepSeek V4 (expected)1MTrillion params; Engram memory; launching Mar 2026

Key Insight: There is a critical gap between advertised and effective context length. Llama 4 Scout's 10M window was trained at only 256K tokens and relies on inference-time extrapolation via iRoPE to generalize. Independent benchmarks at the full 10M scale remain limited. The RULER benchmark showed that only about half of models claiming 32K+ context maintained satisfactory performance at that length on multi-hop reasoning tasks.

Context window evolution from GPT-3 to Llama 4 ScoutContext window evolution from GPT-3 to Llama 4 Scout

For our contract analysis scenario, this table drives a real decision. A 300K-token contract fits comfortably inside GPT-4.1 or Gemini 3 Pro, but would require the beta tier for Claude Opus 4.6 and would not fit in DeepSeek V3 at all. Choosing the right model starts with knowing the actual working capacity, not the marketing number.

The Quadratic Attention Bottleneck

The reason long context took years to achieve is rooted in the Transformer's self-attention mechanism. For every token, the model computes its relationship to every other token. To understand how LLMs actually work at a deeper level, you need to see the math behind this bottleneck:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Where:

  • QQ is the query matrix (what each token is looking for)
  • KK is the key matrix (what each token advertises about itself)
  • VV is the value matrix (the actual content each token carries)
  • dkd_k is the head dimension, used for scaling
  • QKTQK^T produces an N×NN \times N attention matrix, where NN is the sequence length

In Plain English: Every token in your 300K-token contract must compute a relevance score against every other token. That's 90 billion pairwise comparisons per layer, per attention head. Double the contract length and the work quadruples, not doubles.

The following demo shows how this scaling plays out in concrete memory terms:

code
Attention Matrix Memory (FP16, single head, single layer):
  Seq Length    Matrix Size  Relative Cost
--------------------------------------------
       4,096          32 MB             1x
      32,768           2 GB            64x
     131,072          32 GB         1,024x
   1,048,576         2.0 TB        65,536x

Softmax Attention Score Distribution (Query at position 64):
  Max attention weight:  0.0466
  Min attention weight:  0.0004
  Top-5 positions:       [83, 45, 3, 23, 37]
  Top-5 weight sum:      0.1994
  Top-10% weight sum:    0.3845

Notice something important in those attention scores: the top 10% of positions capture only 38% of the total weight. Attention in real models is sparse. Most tokens contribute very little to any given query. This sparsity is what makes techniques like Flash Attention, KV cache eviction, and sliding window attention possible.

Common Pitfall: People often assume 1M-token context means 1M tokens of equal importance. In practice, attention weight concentrates on a small fraction of positions. A 300K-token contract might have 95% of attention weight distributed across 30K tokens of key clauses, definitions, and cross-references.

Flash Attention: Making Long Context Feasible

Flash Attention (Tri Dao et al., NeurIPS 2022) is the single most important enabler of long context. The core insight: standard attention is bottlenecked not by raw compute but by memory I/O. Specifically, the reads and writes between GPU high-bandwidth memory (HBM) and fast on-chip SRAM dominate wall-clock time.

Flash Attention never materializes the full N×NN \times N attention matrix. Instead, it uses tiling: the Q, K, and V matrices are divided into blocks that fit in SRAM. Each block computes a partial attention result, and an online softmax algorithm stitches partial results together using running statistics. The output is mathematically identical to standard attention (no approximation), but memory usage drops from O(N2)O(N^2) to O(N)O(N).

VersionYearKey InnovationSpeedup
Flash Attention v12022IO-aware tiling, fused CUDA kernel2-4x over standard attention
Flash Attention v22023Warp-level partitioning, better sequence parallelism2x over v1, 50-73% peak FLOPS on A100
Flash Attention v32024Hopper GPU (H100): async WGMMA, FP8 support1.5-2x over v2, 840 TFLOPS (85% utilization)

In Plain English: Before Flash Attention, loading our 300K-token contract would require materializing a 180 GB attention matrix per layer. Flash Attention computes the exact same result while storing only a few megabytes of block-level intermediates in fast on-chip SRAM. This is what makes million-token context windows physically possible on existing hardware.

A Flash Attention v4 targeting Blackwell GPUs (B200) is already in development, written in CuTeDSL and promising further gains on next-generation hardware.

Positional Encodings That Scale

The original Transformer used fixed sinusoidal positional embeddings that broke at lengths beyond the training data. Understanding tokenization is essential here, because position encodings operate on the token sequence, and a "300K-token contract" means something very specific about how the text was split.

Modern long-context models use position encoding schemes designed for extrapolation:

RoPE (Rotary Position Embeddings), introduced by Su et al. (2021), encodes position by rotating query and key vectors in pairs of dimensions. The angle of rotation is proportional to the token's position, with different frequencies for different dimension pairs. Because the dot product of two rotated vectors depends only on their relative distance, RoPE captures both absolute and relative position elegantly. It is now the dominant position encoding, used by LLaMA, Mistral, Qwen, DeepSeek, and most open-source models.

ALiBi (Attention with Linear Biases) (Press et al., ICLR 2022) takes a different approach entirely. Instead of modifying embeddings, it directly subtracts a distance-proportional penalty from attention scores. Models trained with ALiBi on 1024-token sequences can extrapolate to 2048+ at inference with minimal degradation.

YaRN (Yet another RoPE extensioN) (Peng et al., ICLR 2024) extends RoPE-based models to longer contexts with minimal fine-tuning. YaRN applies different interpolation strategies across RoPE dimensions using a ramp function, preserving high-frequency detail while extending low-frequency range. Qwen3 uses YaRN to go from 32K to 128K context.

For the most extreme extensions, LongRoPE (Microsoft Research, ICML 2024) uses evolutionary search to find optimal per-dimension rescaling factors, achieving context windows beyond 2 million tokens with a progressive fine-tuning strategy.

Llama 4 Scout introduced iRoPE (interleaved RoPE), which alternates RoPE layers with no-position-encoding layers. This hybrid design enables inference-time extrapolation from 256K training context to 10M tokens, though real-world performance at the extreme end remains less proven.

Pro Tip: If you're fine-tuning an open-source model for longer context, YaRN is the practical choice. It needs 10x fewer training tokens and 2.5x fewer steps than naive RoPE interpolation. Start with a 4x extension (32K to 128K) before attempting anything larger.

Taming the KV Cache

When a model generates text token by token, it caches the Key and Value projections of all previous tokens so it doesn't recompute them at each step. For long contexts, this KV cache dominates GPU memory:

bytes per token=2×L×Hkv×d×b\text{bytes per token} = 2 \times L \times H_{kv} \times d \times b

Where:

  • LL is the number of Transformer layers
  • HkvH_{kv} is the number of key-value heads
  • dd is the dimension per head
  • bb is bytes per element (2 for FP16)
  • The factor of 2 accounts for storing both K and V

In Plain English: For our 300K-token contract in Llama 3 70B, the KV cache alone would eat roughly 93 GB of GPU memory, more than an entire H100's 80 GB. This is why KV cache optimization isn't a nice-to-have; it's a requirement for long context to work at all.

KV cache memory growth showing quadratic scaling problemKV cache memory growth showing quadratic scaling problem

Three families of solutions have emerged:

Architectural Compression

Grouped Query Attention (GQA) shares KV heads across groups of query heads, reducing cache size up to 8x. DeepSeek takes this further with Multi-head Latent Attention (MLA), compressing all KV heads into a small latent vector for a 93.3% cache reduction while maintaining full multi-head expressiveness. K-EXAONE (LG AI Research, Jan 2026) uses a 3:1 hybrid of global and sliding-window attention with a 128-token window, cutting memory use by 70% compared to full global attention across all layers.

PagedAttention

PagedAttention (vLLM) treats GPU memory like virtual memory in an operating system. The KV cache gets divided into non-contiguous pages, eliminating fragmentation and enabling copy-on-write sharing across requests with common prefixes. This alone provides 2-4x throughput improvement in serving scenarios.

Quantization and Eviction

Production systems compress the KV cache to 4-bit or even 2-bit precision with minimal quality loss. For extreme lengths, H2O (Heavy-Hitter Oracle) keeps only the most-attended "heavy hitter" tokens plus a sliding window of recent tokens, achieving up to 29x throughput improvement with 20% heavy-hitter retention.

Pro Tip: When deploying open-source long-context models, 4-bit KV cache quantization can cut memory usage by 75% with negligible retrieval accuracy loss. This is often the difference between needing a multi-GPU setup and fitting on a single card.

The "Lost in the Middle" Problem

A well-documented limitation of long context was identified by Liu et al. (TACL 2024). Models recall information placed at the beginning or end of the context far better than information buried in the middle, creating a U-shaped performance curve across document position.

code
Simulated Retrieval Accuracy by Document Position
(U-shaped 'Lost in the Middle' effect)

  Position       Region   Accuracy
-----------------------------------
         1    Beginning      93.0%
         2    Beginning      91.0%
         3    Beginning      90.7%
         6       Middle      78.0%
        10       Middle      68.3%
        11       Middle      66.2%
        15       Middle      75.0%
        18          End      90.0%
        19          End      89.5%
        20          End      89.2%

Average accuracy (first 20%):  91.0%
Average accuracy (middle 60%): 73.2%
Average accuracy (last 20%):   88.2%

The middle-position accuracy drops nearly 18 percentage points compared to the beginning. For our contract scenario, a contradictory clause buried on page 200 of 400 sits right in the danger zone.

This effect has been significantly mitigated in 2025-2026 models through attention calibration and training improvements. Never Lost in the Middle introduced position-agnostic training, and Found in the Middle proposed plug-and-play positional encoding fixes. Still, achieving truly position-uniform retrieval across very long contexts remains an open problem.

Practical workarounds for production:

  1. Document labeling with XML tags. Wrap distinct sections in indexed tags so the model can reference them by ID rather than by position.
  2. Strategic ordering. Place the most critical information at the beginning and end, where recall is strongest.
  3. Chain-of-thought anchoring. Ask the model to first list relevant section IDs, then answer. This forces a full-context scan before responding.

Benchmarking Long Context: Beyond Needle in a Haystack

The Needle in a Haystack (NIAH) test inserts a specific fact at varying depths within a long document and asks the model to retrieve it. While useful as a sanity check, NIAH is too easy for modern models. GPT-4.1 achieves 100% accuracy throughout its full 1M-token context. The Gemini 1.5 Pro technical report demonstrated 99.7% recall at 1M tokens, a benchmark Gemini 3 Pro matches or exceeds.

The RULER benchmark (Hsieh et al., COLM 2024) provides a much harder test. RULER extends NIAH with four task categories: multi-needle retrieval, multi-hop tracing, aggregation, and question answering. Despite near-perfect vanilla NIAH scores, almost all models show substantial performance drops on RULER as context grows. RULER is now the standard for evaluating whether a model's context window is genuinely useful or merely nominal.

BenchmarkTask TypeDifficultyWhat It Tests
NIAHSingle fact retrievalEasyCan the model find one fact?
RULERMulti-task (4 categories)HardDoes the model reason across context?
LongBench v2503 questions, contexts to 2M wordsHardReal-world long-context tasks
HashHop (Magic.dev)Random hash pair retrievalVery hardTrue memorization, no semantic shortcuts

Other notable benchmarks include LongBench (21 tasks across 6 categories, bilingual) and LongBench v2 (503 challenging questions with contexts up to 2M words). Magic.dev's HashHop, used to evaluate their 100M-token LTM-2-Mini, eliminates semantic cues entirely by using random incompressible hash pairs, ensuring the model genuinely stores and recalls rather than guessing from context.

Key Insight: A model scoring 99% on NIAH might score 60% on RULER at the same context length. Always benchmark on tasks that match your actual use case complexity. For our contract analysis scenario, multi-hop reasoning (clause A references clause B which modifies clause C) is the real test, and that's exactly where weaker models fail.

Long Context Versus RAG

A common misconception is that massive context windows make RAG obsolete. The reality is more nuanced than that.

DimensionLong ContextRAG
Reasoning scopeGlobal (sees all connections)Local (only retrieved chunks)
Cost per queryHigh (process all tokens)Low (only retrieved chunks)
LatencyHigher for initial processingLow (millisecond retrieval)
Data freshnessStatic per requestDynamic (index updates cheaply)
Retrieval recallHigh but degrades with lengthDepends on text embeddings quality

Use long context when the task requires synthesizing information across the entire document. Finding contradictions across a 400-page contract, understanding global code dependencies, summarizing themes across 50 emails. RAG fails here because vector search may miss subtle cross-document connections that only appear when you see the full picture.

Use RAG when you have a large, frequently updated knowledge base and need specific factual answers. Processing 1 million tokens costs $2-10 per query with frontier models; RAG processes only the relevant chunks at a fraction of the cost. For our contract analysis scenario, if the legal team only needs to look up a specific clause definition, RAG is faster and cheaper. If they need to find every instance where clause 14.2 conflicts with clauses elsewhere in the document, long context wins.

Research confirms this tradeoff. Li et al. (2025) found that long context generally outperforms RAG on Wikipedia-based QA, but RAG has advantages for dialogue-based queries and cost-sensitive applications.

RAG versus long context decision frameworkRAG versus long context decision framework

Pro Tip: The best production systems use both. Load the full document into long context for global reasoning tasks, then switch to RAG for repeated factual lookups against the same corpus. This hybrid approach cuts costs by 5-10x on query-heavy workloads while preserving the ability to reason globally when needed.

Prompt Caching: The Economics Lever

Long-context costs drop dramatically with prompt caching, which stores the computed KV cache from a prompt prefix so subsequent queries reuse it instead of reprocessing from scratch:

ProviderCache Write CostCache Read DiscountMin Cache Size
Anthropic1.25x base (5-min TTL)90% off input cost1024-4096 tokens
Google GeminiStorage fee ($1-4.50/MTok/hr)75-90% off (model dependent)32K tokens (explicit)
OpenAIFree (automatic)50% off input cost1024 tokens

Consider loading our 300K-token contract for interactive querying with Claude Opus 4.6 ($5/MTok standard input rate):

  • Without caching: Each query processes the full 300K prefix = $1.50 per query
  • With caching: First query pays 1.25x ($1.88 cache write). Subsequent queries pay 0.1x = $0.15 per query, a 10x reduction

Over 50 queries against the same contract, caching saves roughly $67 compared to reprocessing every time. Anthropic's cache has a 5-minute TTL, so batch your queries. Google's implicit caching (enabled by default since May 2025 on Gemini 2.5+ models) provides automatic 75% savings with no code changes.

Pro Tip: Structure prompts with stable content first (system prompt, tool definitions, document context) and variable content last (user query). Caching matches on exact prefixes, so placing the changing part at the end maximizes cache hits.

Frontier Research: What Comes Next

Several research directions are pushing the boundaries of long context beyond brute-force attention scaling:

Titans + MIRAS (Google, Dec 2025) introduces a deep neural network as a long-term memory module. Unlike traditional RNNs that compress state into small vectors, Titans uses a multi-layer perceptron that updates its weights while reading input, combining RNN-speed inference with transformer-accuracy reasoning. The MIRAS framework generalizes this into new attention-free architectures (Moneta, Yaad, Memora) that match or surpass linear RNNs on long-context tasks.

InftyThink (Zhejiang University, Mar 2026) transforms monolithic reasoning into an iterative process with intermediate summarization. The model generates a partial reasoning chain, summarizes its progress, and builds upon those summaries in subsequent iterations. This creates a sawtooth memory pattern that enables unbounded reasoning depth while keeping computational costs bounded. Experiments show 3-11% improvements on MATH500 and GPQA benchmarks.

Chain of Agents (Google, NeurIPS 2024) takes a multi-agent approach: multiple worker agents each process a segment of the input, then a manager agent synthesizes their contributions. CoA achieves up to 10% improvement over both RAG and full-context baselines on summarization, QA, and code completion tasks.

Ring Attention (Liu et al., 2023) distributes sequences across multiple GPUs arranged in a ring. Each device holds one block and computes local attention while K and V blocks circulate, with communication fully overlapped by computation. Context length scales linearly with device count, zero approximation error.

When to Use Long Context (and When Not To)

Long context is powerful, but it's not always the right tool:

Choose long context when:

  • The task requires cross-document reasoning (contradictions, themes, dependencies)
  • You need the model to see the full picture before answering
  • Document structure matters (code repos, legal contracts, medical records)
  • Query volume is low enough that per-query cost is acceptable

Choose RAG (or hybrid) when:

  • Your knowledge base exceeds 1M tokens or updates frequently
  • You need sub-second latency on factual lookups
  • Budget constraints make $2-10 per query prohibitive
  • The task is point-lookup, not synthesis

Avoid long context entirely when:

  • Your task doesn't benefit from more context (most classification tasks plateau at 4K tokens)
  • You're just padding context with irrelevant documents hoping the model "gets smarter"
  • You haven't validated retrieval quality at your target length with RULER or similar benchmarks

Conclusion

Long context models have moved from a research curiosity to production infrastructure, but the engineering reality is more complex than the headline token counts suggest. A model that accepts 1M tokens is not the same as a model that reasons well over 1M tokens. The gap between advertised and effective context, exposed by RULER and similar benchmarks, means you must validate retrieval quality at your actual working lengths before committing to a long-context architecture.

The stack that enables long context, Flash Attention for IO-efficient computation, RoPE and its extensions for scalable positioning, GQA and MLA for cache compression, prompt caching for cost reduction, represents some of the most elegant systems engineering in modern AI. Understanding these components turns long context from a black-box feature into a tool you can reason about, optimize, and deploy with confidence.

For the fundamentals of how these models process language internally, see How Large Language Models Actually Work. To understand the token vocabulary that defines what "1 million tokens" actually contains, read Tokenization: Why It Matters More Than You Think. And for when long context isn't the right tool and retrieval is a better fit, see RAG: Making LLMs Smarter with Your Data.

Interview Questions

Q: What is the difference between advertised context window and effective context length?

The advertised context window is the maximum number of tokens a model can accept as input. Effective context length is the range within which the model maintains strong retrieval and reasoning performance. RULER benchmark results show effective length is typically 50-65% of advertised capacity, because models degrade on multi-hop reasoning and aggregation tasks well before hitting their stated limit.

Q: Explain how Flash Attention reduces memory from O(N^2) to O(N) without approximation.

Flash Attention tiles the Q, K, and V matrices into blocks that fit in GPU SRAM and computes partial attention results per block. An online softmax algorithm maintains running statistics (max and sum of exponentials) to stitch block results together, producing output mathematically identical to standard attention. The key saving is that the full N-by-N attention matrix never gets materialized in HBM.

Q: Your team wants to process a 500K-token codebase for bug detection. Would you use long context or RAG?

Long context is the better fit here. Bug detection requires understanding cross-file dependencies, import chains, and how functions interact across modules. RAG would retrieve individual code chunks in isolation, missing the global dependency graph. I'd use a model like GPT-4.1 or Gemini 3 Pro with their 1M-token window, and validate with RULER-style benchmarks on code-specific tasks to confirm the model actually reasons across the full context at that scale.

Q: What causes the "Lost in the Middle" problem, and how do you mitigate it in production?

Models trained primarily with causal attention develop stronger attention patterns for tokens at the beginning (primacy bias) and end (recency bias) of the context. Information in the middle receives weaker attention weights. Production mitigations include wrapping documents in indexed XML tags so the model references by ID, placing critical information at the start and end, and using chain-of-thought prompting that forces a full context scan before answering.

Q: How does Grouped Query Attention (GQA) reduce KV cache size?

Standard multi-head attention has separate K and V projections for every attention head. GQA shares a single set of K and V heads across a group of query heads. With 8 query heads sharing 1 KV head (8:1 ratio), the KV cache shrinks by 8x. The quality tradeoff is minimal because most of the model's expressive power comes from the query projections, not the key-value pairs.

Q: When would prompt caching fail to provide cost savings?

Prompt caching matches on exact token prefixes. If every query changes the system prompt, document ordering, or includes different context, the cache never hits. It also fails when query volume is too low to amortize the cache write cost, or when the time between queries exceeds the TTL (5 minutes for Anthropic). The worst case is single-shot queries against unique documents, where you pay the 1.25x cache write premium with no subsequent reads.

Q: Compare RoPE and ALiBi for position encoding. When would you prefer each?

RoPE encodes position by rotating query-key vectors, making the dot product depend on relative distance. ALiBi adds a linear distance penalty directly to attention scores. RoPE is more expressive and dominates in practice (LLaMA, Mistral, DeepSeek all use it), but ALiBi extrapolates better to unseen lengths without fine-tuning. If you're deploying a pre-trained model at exactly its trained length, RoPE is standard. If you need to push beyond training length without any fine-tuning, ALiBi offers more graceful degradation.

Q: A model claims 10M token context but was trained at 256K. Should you trust it for 1M-token tasks?

Be skeptical. Training at 256K and extrapolating to 10M via techniques like iRoPE means the model has never seen attention patterns at 1M scale during training. It may handle simple retrieval (NIAH-style) at 1M, but complex multi-hop reasoning or aggregation tasks often degrade significantly beyond the training length. Always run your specific task at the target length and measure before committing to production.