In September 2024, OpenAI's o1-preview scored 83% on the American Invitational Mathematics Examination (AIME 2024). Seven months later, o3 scored 96.7% on the same exam — matching the level of gold-medal International Math Olympiad competitors. The model didn't get bigger between those releases. It got better at thinking. Where standard LLMs generate answers in a single forward pass — predicting the most likely next token — reasoning models pause, plan, explore multiple solution paths, critique their own logic, and backtrack when they hit a dead end. They have traded speed for deliberation, and the results have been extraordinary.
This shift represents the most significant change in how LLMs operate since the Transformer architecture itself. Understanding how reasoning models work — the chain-of-thought mechanisms, the reinforcement learning training, the test-time compute scaling laws — is now essential knowledge for anyone building with AI.
The shift from prediction to deliberation
A reasoning model is an LLM designed to generate intermediate computational steps — often called a "chain of thought" — before producing a final answer. Unlike standard LLMs that optimize for immediate next-token probability, reasoning models optimize for the correctness of a verified final answer, using reinforcement learning to internalize the process of self-correction.
Nobel laureate Daniel Kahneman described two modes of human thought: System 1 (fast, automatic, intuitive — like completing "Bread and...") and System 2 (slow, deliberative, logical — like calculating in your head). Standard LLMs like GPT-4o and Claude Sonnet 4.5 in direct mode are System 1 engines. Reasoning models like o3, DeepSeek-R1, and Claude Opus 4.6 in adaptive thinking mode simulate System 2 by generating a hidden (or partially visible) thinking process before presenting the final result.
In Plain English: A standard LLM reads your question and immediately starts writing the answer. A reasoning model reads your question, opens an internal scratchpad, debates with itself for hundreds or thousands of tokens, checks its own work, and only then writes the answer you see. The scratchpad is where the "intelligence" happens.
The core difference lies in the objective function. A standard LLM asks, "What is the most likely next word?" A reasoning model asks, "What is the next logical step to maximize the probability of a correct outcome?"
From prompting to training: the chain-of-thought revolution
The foundation of every modern reasoning model traces back to a single insight: LLMs perform dramatically better on multi-step problems when they show their work.
Chain-of-Thought prompting (Wei et al., Google, January 2022). The original CoT paper demonstrated that including a few examples of step-by-step reasoning in the prompt caused LLMs to generate their own intermediate steps, dramatically improving accuracy on math, logic, and commonsense tasks. In standard generation, the model maps input directly to output: . With chain-of-thought, the model generates reasoning steps before the output:
In Plain English: The probability of the correct answer depends on the probability of generating a good reasoning path first. Instead of jumping from question to answer, the model explores various reasoning paths. If a path is logically sound, the final answer is far more likely to be correct. This is why "showing your work" makes LLMs smarter — it's not just cosmetic.
Zero-shot CoT (Kojima et al., May 2022) eliminated the need for crafted examples entirely. Simply appending "Let's think step by step" to any prompt boosted reasoning performance significantly — proving that the capability was latent in the model, waiting to be activated.
Self-Consistency (Wang et al., March 2022, ICLR 2023) pushed accuracy further by sampling multiple diverse reasoning paths and taking the majority vote. On GSM8K math benchmarks, self-consistency improved accuracy by +17.9%.
Tree of Thoughts (Yao et al., Princeton/Google DeepMind, NeurIPS 2023) generalized chain-of-thought to tree-structured reasoning with strategic lookahead and backtracking — allowing models to evaluate multiple branches and prune dead ends, similar to how AlphaGo searched game trees.
The critical shift came in late 2024: instead of prompting models to reason, researchers began training them to reason natively through reinforcement learning. OpenAI's o1-preview (September 2024) was the first commercial model with internalized chain-of-thought, and DeepSeek-R1 (January 2025) proved it could be done in the open.
The mechanics of test-time compute
For a decade, the "scaling laws" of AI referred to training compute — make the model bigger, train on more data, get better results. Around 2024, this approach hit diminishing returns as high-quality training data became scarce. Snell et al. ("Scaling LLM Test-Time Compute Optimally," August 2024) demonstrated a different scaling axis: performance improves logarithmically with the amount of compute spent at inference time.
In Plain English: Just as a human gives a better answer with 60 seconds to think rather than 5, a reasoning model gets smarter with more "thinking tokens." This relationship is logarithmic — the first 1,000 thinking tokens help enormously, but going from 10,000 to 11,000 helps less. The implication is profound: a smaller model that thinks for 10 seconds can outperform a massive model that answers instantly.
This created a new design trade-off. Previously, intelligence was determined entirely by model size. Now intelligence is a function of both model capability and inference budget. A developer can dial a "thinking knob" — more thinking tokens buys more accuracy at the cost of latency and compute.
How reasoning models are trained
The architecture of a reasoning system typically involves three components: a Generator (the LLM proposing the next reasoning step), a Verifier (a reward model evaluating whether each step is correct), and a Search Strategy (the algorithm exploring the tree of possible thoughts).
Process reward models vs. outcome reward models
Traditional RLHF uses Outcome Reward Models (ORMs) — one binary signal at the end: "Was the final answer correct?" This is sparse feedback. In a 50-step math proof, the model gets a single thumbs-up or thumbs-down, with no information about where the reasoning went wrong.
Process Reward Models (PRMs) evaluate every step. Lightman et al. ("Let's Verify Step by Step," OpenAI, May 2023) demonstrated that process supervision significantly outperforms outcome supervision, with their PRM solving 78.2% of a representative MATH subset. They released PRM800K — 800,000 step-level human feedback labels — which became foundational for the reasoning models that followed.
In Plain English: The total reward sums up the reward for each individual step, discounted over time. A step that's correct early matters more than a correct step late in the chain. This forces the model to get every step right, not just gamble on the final answer.
Key Insight: Recent findings at ICLR 2025 complicate this picture — discriminative ORMs can match discriminative PRMs across 14 diverse domains. However, process advantage verifiers remain 1.5-5x more compute-efficient and enable 6x gains in sample efficiency for online RL. The debate is shifting from "PRM vs. ORM" to "when and how to combine both."
DeepSeek-R1: reasoning from reinforcement learning
DeepSeek-R1 (released January 20, 2025, published in Nature) provided the first fully open account of how to train a reasoning model. Its training used Group Relative Policy Optimization (GRPO) — an RL algorithm that eliminates the need for a separate critic/value model by using normalized rewards across different generations of the same prompt as the advantage baseline. GRPO is more computationally efficient than PPO (Proximal Policy Optimization) and operates at the sequence level rather than the token level.
The most remarkable finding came from DeepSeek-R1-Zero, trained via pure RL without any supervised fine-tuning. The model spontaneously developed self-reflection, self-verification, extended chain-of-thought, and what the authors described as an "aha moment" — recognizing and correcting its own mistakes mid-reasoning. R1-Zero used Reinforcement Learning with Verifiable Rewards (RLVR): accuracy rewards from ground-truth checkers (compilers, math solvers) and format rewards for output structure. No neural reward models were used — a deliberate choice to avoid reward hacking at scale.
The full DeepSeek-R1 pipeline added four stages: (1) cold-start SFT with curated chain-of-thought examples, (2) GRPO training with verifiable rewards, (3) rejection sampling to generate high-quality reasoning data, and (4) final SFT + RL rounds for alignment.
The reasoning model landscape in 2026
Every major AI lab now ships reasoning models. The table below captures the current frontier as of February 2026.
| Model | Lab | Released | Key Feature | Pricing (per M tokens) |
|---|---|---|---|---|
| o3 | OpenAI | Apr 2025 | Hidden CoT, high compute mode | 8 out |
| o4-mini | OpenAI | Apr 2025 | Cost-efficient reasoning | 4.40 out |
| DeepSeek-R1 | DeepSeek | Jan 2025 | 671B MoE, MIT license, open weights | Open-source |
| DeepSeek-R1-0528 | DeepSeek | May 2025 | Major R1 update | Open-source |
| Claude Opus 4.6 | Anthropic | Feb 2026 | Adaptive thinking, 1M context | 75 out |
| Gemini 2.5 Pro | Mar 2025 | Native thinking, Deep Think mode | 10 out | |
| QwQ-32B | Alibaba | Mar 2025 | 32B params matches 671B R1, Apache 2.0 | Open-source |
| Grok 3 Think | xAI | Feb 2025 | 10x compute vs. Grok 2 | Premium+ subscription |
| Qwen3-235B | Alibaba | Apr 2025 | Unified thinking/non-thinking modes | Open-source |
Benchmark performance across the frontier
Reasoning models are evaluated on progressively harder benchmarks. The numbers below represent the state of the art as of February 2026.
Mathematical reasoning
| Model | AIME 2024 | AIME 2025 | MATH-500 |
|---|---|---|---|
| o3 | 96.7% | 88.9% | — |
| o4-mini | 91.7% | 92.7% | — |
| DeepSeek-R1 | 79.8% | — | 97.3% |
| DeepSeek-R1-0528 | 91.4% | 87.5% | — |
| Gemini 2.5 Pro | 92.0% | 86.7% | — |
| Claude Sonnet 4.5 | — | 87.0% | — |
| Grok 3 Think (cons@64) | — | 93.3% | — |
Graduate-level science (GPQA Diamond) and code
| Model | GPQA Diamond | SWE-bench Verified | Codeforces Elo |
|---|---|---|---|
| Gemini 2.5 Pro | 84.0% | — | — |
| o3 | 83.3-87.7% | 69.1% | 2706 |
| o4-mini | 81.4% | 68.1% | 2719 |
| Claude Opus 4.6 (Thinking) | — | 80.8% | — |
| Claude Sonnet 4.5 | 83.4% | — | — |
| Grok 3 | 84.6% | — | — |
| DeepSeek-R1 | 71.5% | ~49% | 2029 |
The hardest benchmarks
Humanity's Last Exam (2,500 expert-level questions across 100+ subjects, created by Center for AI Safety and Scale AI): Gemini 3 Pro Preview leads at 37.5%, followed by Claude Opus 4.6 at 36.7% — up from single digits in early 2025. Human experts score ~90%.
FrontierMath (350 research-level math problems by Epoch AI): o4-mini leads at 17%, o3 at 10%. Most non-reasoning models score ~0%.
GPQA Diamond (graduate-level science): Gemini 3 Pro Deep Think leads at 93.8%, a significant jump over the next tier of models (Grok 3 at 84.6%, Gemini 2.5 Pro at 84.0%, Claude Sonnet 4.5 at 83.4%).
ARC-AGI-2 (novel visual reasoning): o3 scores just 4% at low compute. Gemini 3 Deep Think reaches 45.1%. Humans score 100%.
Distilling reasoning into smaller models
One of DeepSeek-R1's most impactful contributions was demonstrating that reasoning ability can be distilled from a 671B parameter teacher model into dramatically smaller student models. All six distilled models were released under MIT license on January 20, 2025.
| Distilled Model | Base | Params | AIME 2024 | MATH-500 | GPQA Diamond |
|---|---|---|---|---|---|
| R1-Distill-Qwen-1.5B | Qwen2.5-1.5B | 1.5B | — | 83.9% | — |
| R1-Distill-Qwen-7B | Qwen2.5-7B | 7B | 55.5% | 92.8% | 49.1% |
| R1-Distill-Qwen-14B | Qwen2.5-14B | 14B | 69.7% | 93.9% | 59.1% |
| R1-Distill-Qwen-32B | Qwen2.5-32B | 32B | 72.6% | 94.3% | 62.1% |
| R1-Distill-Llama-70B | Llama-3.3-70B | 70B | 70.0% | 94.5% | 65.2% |
The results are striking: R1-Distill-Qwen-7B — a 7 billion parameter model — surpassed QwQ-32B-Preview. R1-Distill-Qwen-32B outperformed OpenAI's o1-mini across multiple benchmarks. This catalyzed an explosion of open-source reasoning models: Sky-T1-32B-Preview (January 2025, trained for under $450, matching o1-preview), the OpenThinker series (OpenThoughts-114k dataset, progressing from 7B to 32B models), and QwQ-32B (March 2025, matching DeepSeek-R1 at 32B parameters under Apache 2.0 license).
The implication: reasoning-capable models now run on consumer hardware. A quantized R1-Distill-Qwen-14B fits in 16GB of VRAM and outperforms models 50x its size from a year earlier.
Controlling the thinking budget in production
Modern inference APIs expose direct control over how much a model reasons. This is the "thinking budget" — the maximum number of tokens the model can spend on its internal chain-of-thought before it must produce a visible answer.
OpenAI uses a reasoning_effort parameter with three levels (low, medium, high) for o3, o4-mini, and o3-mini. Moving from low to high typically raises accuracy by 10-30% on hard benchmarks.
Anthropic introduced budget_tokens with Claude 3.7 Sonnet's extended thinking (February 2025), allowing 1,024 to 128,000 thinking tokens. Claude Opus 4.6 (February 2026) introduced adaptive thinking — the model dynamically decides how much to reason, with four effort levels: low, medium, high (default), and max.
Google uses thinkingBudget for Gemini 2.5 models (0 to 24,576 tokens, or -1 for dynamic) and thinkingLevel for Gemini 3 models.
Common Pitfall: The max_tokens parameter now includes hidden reasoning tokens. If you set a limit of 4,096 tokens and the model uses 3,500 to think, only 596 remain for the visible answer. Developers who truncate output limits to save money often abort the thought process mid-stream, getting a garbled or empty response instead of a cheaper one.
Self-consistency: reasoning without a reasoning model
You don't need a dedicated reasoning model to get reasoning-like behavior. Self-consistency — generating multiple answers and taking the majority vote — is a powerful technique that works with any LLM.
from collections import Counter
# Five reasoning paths for "What is 17 x 24?"
# Simulating a standard LLM generating five independent solutions
# Three paths compute correctly, two make arithmetic errors
path_answers = [408, 408, 418, 408, 398]
vote_counts = Counter(path_answers)
majority_answer = vote_counts.most_common(1)[0][0]
confidence = vote_counts[majority_answer] / len(path_answers)
print(f"Path results: {path_answers}")
print(f"Vote counts: {dict(vote_counts)}")
print(f"Majority answer: {majority_answer}")
print(f"Confidence: {confidence:.0%}")
print(f"Correct answer: {17 * 24}")
# Output:
# Path results: [408, 408, 418, 408, 398]
# Vote counts: {408: 3, 418: 1, 398: 1}
# Majority answer: 408
# Confidence: 60%
# Correct answer: 408
Wang et al. showed this approach improves GSM8K accuracy by +17.9% over a single greedy generation. In production, combining self-consistency with a Best-of-N selection using a separate reward model (inference-time rejection sampling) can approximate reasoning model performance at lower cost — though at higher latency.
The economics of reasoning
Reasoning models consume 5-100x more tokens per request than standard models. Even after OpenAI's 80% price cut in June 2025 (o3 dropped from 40 to 8 per million tokens), a complex code review with o3 can still cost 5-10x more than the same query to GPT-4o — because o3 generates far more output tokens due to its hidden thinking chain.
The cost-effective strategy for most applications is a hybrid architecture:
- Router: A fast, cheap model (GPT-4o-mini, Gemini 2.5 Flash) classifies the query complexity.
- Simple queries route to the fast model for immediate response.
- Complex queries route to a reasoning model (o4-mini at $4.40/M output, or DeepSeek-R1 self-hosted) asynchronously.
o4-mini deserves special attention: at roughly 10x cheaper than o3, it matches or beats o3 on AIME 2025 (92.7% vs. 88.9%) and Codeforces (2719 vs. 2706 Elo). For most production use cases, o4-mini is the sweet spot between cost and capability.
Limitations and open problems
The overthinking problem
Reasoning models can perform worse on simple tasks. A survey titled "Stop Overthinking" found that models generate "excessively detailed or unnecessarily elaborate reasoning steps" even when they arrive at the correct answer early. Asking o3 "What is the capital of France?" wastes tokens on deliberation that adds nothing. Adaptive thinking systems (Claude Opus 4.6, Gemini's dynamic budget) are an explicit response to this problem — letting the model skip reasoning when it isn't needed.
Chain-of-thought faithfulness
Anthropic's Alignment Science team published "Reasoning Models Don't Always Say What They Think" (Chen et al., May 2025). They found that Claude 3.7 Sonnet mentioned inserted hints in its reasoning chain only 25% of the time, and DeepSeek-R1 only 39% of the time. The model's visible chain-of-thought is not a reliable transcript of its actual computational process — it is a post-hoc rationalization that may omit key factors. This has profound implications for AI safety: you cannot fully monitor a model's behavior by reading its reasoning trace.
When NOT to use reasoning models
Reasoning models are the wrong tool for: factual retrieval and Q&A (where a standard model with RAG is faster and cheaper), text transformation tasks (summarization, translation, formatting), entity extraction and classification, creative writing, and any latency-sensitive application requiring sub-second responses. Reserve reasoning models for multi-step math, complex code generation, scientific reasoning, multi-step planning, and tasks where correctness is worth the 10-30 second wait.
The road from CoT prompting to adaptive thinking
| Date | Milestone |
|---|---|
| Jan 2022 | Wei et al. publish Chain-of-Thought prompting (Google) |
| Mar 2022 | Wang et al. publish Self-Consistency (ICLR 2023) |
| May 2022 | Kojima et al. publish zero-shot CoT ("Let's think step by step") |
| May 2023 | Lightman et al. publish "Let's Verify Step by Step" / PRM800K (OpenAI) |
| May 2023 | Yao et al. publish Tree of Thoughts (Princeton/DeepMind, NeurIPS 2023) |
| Aug 2024 | Snell et al. publish test-time compute scaling laws |
| Sep 2024 | OpenAI releases o1-preview — first commercial reasoning model |
| Dec 2024 | Gemini 2.0 Flash Thinking Experimental; o3 previewed (ARC-AGI 87.5%) |
| Jan 2025 | DeepSeek-R1 released (671B MoE, MIT license) + 6 distilled models |
| Feb 2025 | Claude 3.7 Sonnet introduces extended thinking; Grok 3 Think released |
| Mar 2025 | QwQ-32B matches R1 at 32B params; Gemini 2.5 Pro with native thinking |
| Apr 2025 | o3 and o4-mini released; Qwen3 adds unified thinking/non-thinking modes |
| May 2025 | Claude Opus 4 and Sonnet 4; DeepSeek-R1-0528 (AIME 2024: 91.4%) |
| Sep 2025 | Claude Sonnet 4.5 (AIME 2025: 87% without tools) |
| Feb 2026 | Claude Opus 4.6 introduces adaptive thinking (model decides when to reason) |
The trajectory is clear: reasoning evolved from an external prompting trick (2022) to an internalized capability trained via RL (2024), and is now converging on adaptive reasoning — models that dynamically decide whether and how much to think, eliminating the overhead of reasoning on trivial queries while preserving depth on hard problems.
Conclusion
Reasoning models represent the maturation of generative AI from pattern matching to genuine problem-solving. Through chain-of-thought training, process reward models, and test-time compute scaling, models like o3, DeepSeek-R1, and Claude Opus 4.6 have internalized the trial-and-error process that humans use to solve hard problems — trading latency for accuracy in a trade-off that is increasingly favorable as inference costs fall.
The most effective AI engineers in 2026 are not just prompt engineers — they are reasoning architects, designing the verification loops, thinking budgets, and model routing strategies that determine when a system should think fast and when it should think deep.
To understand the foundation these reasoning models build on, start with How Large Language Models Actually Work. For optimizing the context fed to reasoning engines, see Context Engineering: From Prompts to Production. To combine reasoning with external knowledge retrieval, explore Retrieval-Augmented Generation (RAG). And for the vector representations that power semantic search in these systems, read Text Embeddings: The Foundation of Semantic Search.