Skip to content

Reasoning Models: How AI Learned to Think Step by Step

DS
LDS Team
Let's Data Science
11 minAudio · 2 listens
Listen Along
0:00/ 0:00
AI voice

In September 2024, OpenAI's o1-preview scored 83% on the American Invitational Mathematics Examination. Seven months later, o3 hit 96.7% on the same exam, matching gold-medal International Math Olympiad competitors. The model didn't get bigger. It got better at thinking. On March 3, 2026, Donald Knuth published "Claude's Cycles" after Claude Opus 4.6 solved an open graph theory conjecture in 31 guided explorations over roughly an hour.

Reasoning models represent the most significant shift in how LLMs actually work since the Transformer architecture. Understanding their chain-of-thought mechanisms, reinforcement learning training, and test-time compute scaling is essential for anyone building with AI in 2026.

Reasoning technique evolution from basic prompting to test-time compute scalingClick to expandReasoning technique evolution from basic prompting to test-time compute scaling

The Shift from Prediction to Deliberation

A reasoning model is an LLM designed to generate intermediate computational steps, often called a "chain of thought," before producing a final answer. Unlike standard LLMs that optimize for immediate next-token probability, reasoning models optimize for the correctness of a verified final answer, using reinforcement learning to internalize self-correction.

Nobel laureate Daniel Kahneman described two modes of human thought: System 1 (fast, intuitive, like completing "Bread and...") and System 2 (slow, deliberative, like calculating $17 \times 24$ in your head). Standard LLMs are System 1 engines. Reasoning models simulate System 2 by generating a hidden thinking process before answering.

In Plain English: A standard LLM reads your question and immediately starts writing. A reasoning model opens an internal scratchpad, debates with itself for thousands of tokens, checks its own work, and only then writes the answer you see. That scratchpad is where the intelligence happens.

A standard LLM asks, "What is the most likely next word?" A reasoning model asks, "What is the next logical step to maximize the probability of a correct outcome?"

DimensionStandard LLMReasoning Model
Thinking modeSystem 1 (instant)System 2 (deliberative)
ObjectiveMost probable next tokenCorrect final answer
Internal processSingle forward passHidden chain-of-thought
LatencySub-second5 to 60+ seconds
Training signalNext-token predictionReinforcement learning with verification
Best forRetrieval, summarization, chatMath, code, science, multi-step planning

The Chain-of-Thought Revolution

The foundation of every modern reasoning model traces back to a single insight: LLMs perform dramatically better on multi-step problems when they show their work.

Chain-of-Thought prompting (Wei et al., Google, January 2022). Including a few examples of step-by-step reasoning in the prompt caused LLMs to generate their own intermediate steps, dramatically improving accuracy on math and logic tasks. In standard generation, the model maps input directly to output: P(yx)P(y|x). With chain-of-thought, the model generates reasoning steps zz before the output:

P(yx)=zP(yz,x)P(zx)P(y|x) = \sum_{z} P(y|z, x) \cdot P(z|x)

Where:

  • P(yx)P(y|x) is the probability of the correct answer yy given input xx
  • zz represents a specific reasoning path (chain of intermediate steps)
  • P(zx)P(z|x) is the probability of generating that reasoning path from the input
  • P(yz,x)P(y|z, x) is the probability of the correct answer given both the reasoning path and input
  • The sum is over all possible reasoning paths

In Plain English: The probability of a correct answer depends on generating a good reasoning path first. Instead of jumping straight to an answer, the model explores various routes. If a path is logically sound, the final answer is far more likely correct. This is why "showing your work" makes LLMs smarter; it is not cosmetic.

Zero-shot CoT (Kojima et al., May 2022) eliminated crafted examples entirely. Simply appending "Let's think step by step" to any prompt boosted reasoning performance, proving the capability was latent in the model, waiting to be activated.

Self-Consistency (Wang et al., March 2022) pushed accuracy further by sampling multiple diverse reasoning paths and taking the majority vote. On GSM8K math benchmarks, self-consistency improved accuracy by +17.9%.

Tree of Thoughts (Yao et al., Princeton/Google DeepMind, NeurIPS 2023) generalized chain-of-thought to tree-structured reasoning with strategic lookahead and backtracking, allowing models to evaluate branches and prune dead ends, similar to how AlphaGo searched game trees.

Chain of Thought versus Tree of Thoughts comparison showing linear vs branching reasoningClick to expandChain of Thought versus Tree of Thoughts comparison showing linear vs branching reasoning

More recently, Forest-of-Thought (ICML 2025) extended tree reasoning to multiple parallel trees with sparse activation, achieving 96.8% on the Game of 24. DR-CoT (Dynamic Recursive Chain-of-Thought, Scientific Reports 2025) introduced recursive refinement with dynamic context truncation.

The critical shift came in late 2024: instead of prompting models to reason, researchers began training them to reason natively through reinforcement learning. OpenAI's o1-preview (September 2024) was the first commercial model with internalized chain-of-thought, and DeepSeek-R1 (January 2025) proved it could be done in the open.

Test-Time Compute: A New Scaling Axis

For a decade, AI's "scaling laws" meant training compute: bigger model, more data, better results. Around 2024, this hit diminishing returns as high-quality data became scarce. Snell et al. ("Scaling LLM Test-Time Compute Optimally," August 2024) demonstrated a different axis: performance improves logarithmically with compute spent at inference time.

Performancelog(Test-Time Compute)\text{Performance} \propto \log(\text{Test-Time Compute})

Where:

  • Performance\text{Performance} is the accuracy or correctness metric on a given task
  • Test-Time Compute\text{Test-Time Compute} is the amount of compute (measured in tokens or FLOPs) spent during inference
  • The logarithmic relationship means early thinking tokens help enormously, while later tokens yield diminishing returns

In Plain English: Just as a human gives a better answer with 60 seconds to think rather than 5, a reasoning model gets smarter with more "thinking tokens." The first 1,000 thinking tokens help enormously, but going from 10,000 to 11,000 helps less. A smaller model that thinks for 10 seconds can outperform a massive model that answers instantly.

This created a new design trade-off. Intelligence is now a function of both model capability and inference budget. Inference demand is projected to exceed training demand by 118x by 2026, reshaping GPU procurement toward inference-optimized hardware. The simulation below demonstrates the principle:

<!— EXEC —>

python
import numpy as np

np.random.seed(42)

def simulate_reasoning_accuracy(n_candidates, n_trials=2000, base_accuracy=0.55):
    """Each candidate independently correct with base_accuracy.
    Majority vote over n_candidates improves effective accuracy."""
    correct = 0
    for _ in range(n_trials):
        votes = np.random.binomial(1, base_accuracy, size=n_candidates)
        if votes.sum() > n_candidates / 2:
            correct += 1
    return correct / n_trials

compute_levels = [1, 3, 5, 9, 15, 25, 51]
print(f"{'Candidates':<12} {'Accuracy':<10} {'Gain vs 1'}")
print('-' * 36)

baseline = None
for n in compute_levels:
    acc = simulate_reasoning_accuracy(n)
    if baseline is None:
        baseline = acc
    gain = acc - baseline
    print(f'{n:<12} {acc:<10.1%} {gain:+.1%}')
code
Candidates   Accuracy   Gain vs 1
------------------------------------
1            54.4%      +0.0%
3            60.4%      +5.9%
5            59.9%      +5.4%
9            61.9%      +7.5%
15           65.5%      +11.1%
25           69.4%      +14.9%
51           75.4%      +21.0%

Key Insight: Even with each individual attempt barely above coin-flip accuracy (55%), majority voting over 51 candidates pushes effective accuracy above 75%. This is the mathematical foundation of test-time compute scaling: more thinking tokens are equivalent to more candidates in a majority vote.

How Reasoning Models Are Trained

A reasoning system has three components: a Generator (proposes the next step), a Verifier (evaluates step correctness), and a Search Strategy (explores the tree of possible thoughts).

Process Reward Models vs. Outcome Reward Models

Traditional RLHF uses Outcome Reward Models (ORMs): one binary signal at the end. In a 50-step math proof, the model gets a single thumbs-up or thumbs-down with no information about where reasoning went wrong.

Process Reward Models (PRMs) evaluate every step. Lightman et al. ("Let's Verify Step by Step," OpenAI, May 2023) showed process supervision significantly outperforms outcome supervision, solving 78.2% of a MATH subset. They released PRM800K, 800,000 step-level labels that became foundational for reasoning model training.

Rtotal=t=1Tγtr(st)R_{\text{total}} = \sum_{t=1}^{T} \gamma^t \cdot r(s_t)

Where:

  • RtotalR_{\text{total}} is the total discounted reward for the full reasoning chain
  • TT is the total number of reasoning steps
  • γ\gamma is the discount factor (0 < γ\gamma < 1), weighting earlier steps more heavily
  • r(st)r(s_t) is the reward assigned to step tt by the process reward model
  • sts_t is the state of the reasoning chain at step tt

In Plain English: The total reward sums up the reward for each individual step, discounted over time. A correct step early matters more than a correct step late in the chain. This forces the model to get every step right rather than gamble on the final answer.

Process reward model training pipeline showing step-level supervisionClick to expandProcess reward model training pipeline showing step-level supervision

PRM research has accelerated: Athena-PRM extended PRMs to multimodal reasoning (10+ point gains on visual math benchmarks with just 5,000 samples), FunPRM adapted step-level supervision to code generation, and ToolPRMBench introduced the first benchmark for PRMs in tool-using agent settings.

Key Insight: ICLR 2025 findings complicate the PRM narrative: discriminative ORMs can match discriminative PRMs across 14 diverse domains. However, process advantage verifiers remain 1.5 to 5x more compute-efficient and enable 6x gains in sample efficiency for online RL. The debate is shifting from "PRM vs. ORM" to "when and how to combine both."

DeepSeek-R1 and GRPO: Reasoning from Reinforcement Learning

DeepSeek-R1 (January 2025, published in Nature) provided the first fully open account of training a reasoning model. Its key innovation was Group Relative Policy Optimization (GRPO), from the DeepSeekMath paper, which eliminates the separate critic/value model PPO requires. GRPO generates multiple responses per prompt and uses their mean reward as the advantage baseline, reducing memory by 40 to 60% and costing up to 18x less than PPO.

The most remarkable finding came from DeepSeek-R1-Zero, trained via pure RL without any supervised fine-tuning. The model spontaneously developed self-reflection, self-verification, and what the authors described as an "aha moment," recognizing and correcting its own mistakes mid-reasoning. R1-Zero used Reinforcement Learning with Verifiable Rewards (RLVR): accuracy rewards from ground-truth checkers (compilers, math solvers) and format rewards. No neural reward models were used, deliberately avoiding reward hacking at scale.

The full R1 pipeline added four stages: cold-start SFT, GRPO training with verifiable rewards, rejection sampling for high-quality reasoning data, and final SFT + RL for alignment. R1 scored 97.3% on MATH-500.

The Reasoning Model Frontier in March 2026

Every major AI lab now ships reasoning models. The table below captures the current frontier.

ModelLabReleasedKey FeaturePricing (per M tokens)
GPT-5.4OpenAIMar 2026GDPval 83%, thinking mode uses 50 to 80% fewer tokens than o3$10 in / $30 out
o3OpenAIApr 2025Hidden CoT, high compute mode$2 in / $8 out
o4-miniOpenAIApr 2025Cost-efficient reasoning$1.10 in / $4.40 out
Claude Opus 4.6AnthropicFeb 2026Adaptive thinking, solved Knuth conjecture$15 in / $75 out
Gemini 3.1 ProGoogleFeb 2026ARC-AGI-2 77.1%, GPQA Diamond 94.3%, Deep Think$2 in / $8 out
DeepSeek-R1DeepSeekJan 2025671B MoE, MIT license, open weightsOpen-source
QwQ-32BAlibabaMar 202532B params matches R1, Apache 2.0Open-source
Ministral 3 ReasoningMistralDec 20253B/8B/14B reasoning variants, Apache 2.0Open-source

Benchmark Performance Across the Frontier

ModelAIME 2024MATH-500GPQA DiamondSWE-Bench Verified
GPT-5.4 (thinking)83.0% (GDPval)57.7% (Pro)
o396.7%83.3 to 87.7%69.1%
o4-mini91.7%81.4%68.1%
Claude Opus 4.680.8%
Gemini 3.1 Pro94.3%80.6%
DeepSeek-R179.8%97.3%71.5%~49%
QwQ-32B

GPT-5.4's thinking mode uses 50 to 80% fewer tokens than o3 for equivalent accuracy, with 6x fewer hallucinations. Claude Opus 4.6 leads SWE-Bench at 80.8%; Gemini 3.1 Pro holds the GPQA Diamond record at 94.3%.

Pro Tip: For cost-sensitive deployments, o4-mini is the sweet spot: roughly 10x cheaper than o3, it matches o3 on AIME 2025 (92.7% vs. 88.9%) and Codeforces (2719 vs. 2706 Elo). Start there and escalate only when accuracy demands it.

The Hardest Benchmarks

Humanity's Last Exam (2,500 expert-level questions): Gemini 3 Pro leads at 37.5%, Claude Opus 4.6 at 36.7%. Human experts score ~90%. ARC-AGI-2 (novel visual reasoning): Gemini 3.1 Pro reaches 77.1%. ARC-AGI-3, launching March 25, 2026, will be the first interactive reasoning benchmark with 1,000+ levels across 150+ environments. FrontierMath (350 research-level math): o4-mini leads at 17%; most non-reasoning models score ~0%.

Distilling Reasoning into Smaller Models

DeepSeek-R1 demonstrated that reasoning ability can be distilled from a 671B teacher into dramatically smaller students. All six distilled models shipped under MIT license in January 2025.

Distilled ModelBaseParamsAIME 2024MATH-500
R1-Distill-Qwen-1.5BQwen2.51.5B83.9%
R1-Distill-Qwen-7BQwen2.57B55.5%92.8%
R1-Distill-Qwen-14BQwen2.514B69.7%93.9%
R1-Distill-Qwen-32BQwen2.532B72.6%94.3%
R1-Distill-Llama-70BLlama-3.370B70.0%94.5%

R1-Distill-Qwen-7B surpassed QwQ-32B-Preview; R1-Distill-Qwen-32B outperformed o1-mini. This catalyzed open-source reasoning: Sky-T1-32B-Preview (trained for under $450), QwQ-32B (Apache 2.0), and Ministral 3 Reasoning (3B/8B/14B, the 14B scoring 85% on AIME 2025). A quantized R1-Distill-Qwen-14B fits in 16GB of VRAM.

Controlling the Thinking Budget in Production

Modern inference APIs expose direct control over reasoning depth:

OpenAI offers reasoning_effort (low, medium, high) for o3 and o4-mini. Moving from low to high raises accuracy by 10 to 30% on hard benchmarks.

Anthropic introduced budget_tokens (1,024 to 128,000 thinking tokens) with Claude 3.7 Sonnet's extended thinking. Claude Opus 4.6 added adaptive thinking: four effort levels where the model decides how much to reason.

Google uses thinkingBudget for Gemini 2.5 and thinkingLevel for Gemini 3 with Deep Think mode.

Common Pitfall: The max_tokens parameter now includes hidden reasoning tokens. If you set a limit of 4,096 tokens and the model uses 3,500 to think, only 596 remain for the visible answer. Developers who truncate output limits to save money often abort the thought process mid-stream, getting a garbled or empty response.

Self-Consistency: Reasoning Without a Reasoning Model

You don't need a dedicated reasoning model to get reasoning-like behavior. Self-consistency, generating multiple answers and taking the majority vote, works with any LLM. This is relevant to context engineering workflows where you control the inference pipeline.

<!— EXEC —>

python
from collections import Counter

# Five reasoning paths for "What is 17 x 24?"
# Simulating a standard LLM generating five independent solutions
# Three paths compute correctly, two make arithmetic errors
path_answers = [408, 408, 418, 408, 398]

vote_counts = Counter(path_answers)
majority_answer = vote_counts.most_common(1)[0][0]
confidence = vote_counts[majority_answer] / len(path_answers)

print(f"Path results: {path_answers}")
print(f"Vote counts: {dict(vote_counts)}")
print(f"Majority answer: {majority_answer}")
print(f"Confidence: {confidence:.0%}")
print(f"Correct answer: {17 * 24}")
code
Path results: [408, 408, 418, 408, 398]
Vote counts: {408: 3, 418: 1, 398: 1}
Majority answer: 408
Confidence: 60%
Correct answer: 408

Wang et al. showed self-consistency improves GSM8K accuracy by +17.9% over a single greedy generation. Combining it with Best-of-N selection using a reward model can approximate reasoning model performance at lower cost.

The Economics of Reasoning

Reasoning models consume 5 to 100x more tokens per request than standard models. Even after OpenAI's 80% price cut in June 2025 (o3 dropped from $10/$40 to $2/$8 per million tokens), a complex code review with o3 costs 5 to 10x more than GPT-4o due to hidden thinking tokens.

The cost-effective strategy is a hybrid architecture: a fast, cheap model (GPT-4o-mini, Gemini 2.5 Flash) routes simple queries instantly, while complex queries go to a reasoning model (o4-mini at $4.40/M output or self-hosted DeepSeek-R1) asynchronously. For RAG pipelines, simple retrieval should never hit a reasoning model; multi-hop synthesis questions are ideal candidates.

Limitations and When NOT to Use Reasoning Models

The Overthinking Problem

Reasoning models can perform worse on simple tasks. "Stop Overthinking" found that models generate excessively elaborate reasoning even when they reach the correct answer early. Asking o3 "What is the capital of France?" wastes tokens on pointless deliberation. Adaptive thinking (Claude Opus 4.6, Gemini's dynamic budget) is the direct response.

Chain-of-Thought Faithfulness

Anthropic's "Reasoning Models Don't Always Say What They Think" (Chen et al., May 2025) found Claude 3.7 Sonnet mentioned inserted hints in its reasoning only 25% of the time; DeepSeek-R1 only 39%. The visible chain-of-thought is not a reliable transcript of internal computation. It is a post-hoc rationalization, with profound implications for AI safety monitoring.

When to Use vs. When NOT to Use Reasoning Models

Use Reasoning ModelsAvoid Reasoning Models
Multi-step math and proofsFactual retrieval and Q&A (use RAG)
Complex code generation and debuggingText transformation (summarization, translation)
Scientific reasoning across disciplinesEntity extraction and classification
Multi-hop planning with constraintsCreative writing
Tasks where correctness justifies 10 to 30s latencyLatency-sensitive applications (sub-second)
Hard benchmark-style problemsHigh-volume, low-complexity queries

Common Pitfall: Many teams default to reasoning models for everything after seeing benchmark scores. In practice, 80% of production queries are simple enough that a standard model with good context engineering handles them faster, cheaper, and just as accurately. Reserve reasoning compute for the 20% that actually needs it.

The Timeline from CoT Prompting to Adaptive Thinking

DateMilestone
Jan 2022Chain-of-Thought prompting (Wei et al., Google)
Mar 2022Self-Consistency (Wang et al.)
May 2022Zero-shot CoT: "Let's think step by step" (Kojima et al.)
May 2023"Let's Verify Step by Step" / PRM800K (Lightman et al., OpenAI)
May 2023Tree of Thoughts (Yao et al., NeurIPS 2023)
Aug 2024Test-time compute scaling laws (Snell et al.)
Sep 2024o1-preview: first commercial reasoning model
Jan 2025DeepSeek-R1 (671B MoE, MIT license) + 6 distilled models
Mar 2025QwQ-32B matches R1 at 32B; Gemini 2.5 Pro with native thinking
Apr 2025o3 and o4-mini released
Dec 2025Ministral 3 Reasoning; Forest-of-Thought (ICML 2025)
Feb 2026Claude Opus 4.6 (adaptive thinking); Gemini 3.1 Pro (ARC-AGI-2: 77.1%)
Mar 2026GPT-5.4 released; ARC-AGI-3 launches Mar 25

The trajectory: reasoning evolved from a prompting trick (2022) to an internalized capability trained via RL (2024), and now converges on adaptive reasoning, where models decide whether and how much to think.

Conclusion

Reasoning models represent the maturation of generative AI from pattern matching to genuine problem-solving. Through chain-of-thought training, process reward models, and test-time compute scaling, models like o3, DeepSeek-R1, and Claude Opus 4.6 have internalized the trial-and-error process humans use to tackle hard problems, trading latency for accuracy in a trade-off that grows more favorable as inference costs fall.

The most effective AI engineers in 2026 are reasoning architects, designing verification loops, thinking budgets, and routing strategies that determine when a system should think fast and when it should think deep. The open-source ecosystem means this capability is no longer locked behind API paywalls.

To understand the foundation these reasoning models build on, start with how LLMs actually work. For optimizing context fed to reasoning engines, see context engineering. To combine reasoning with external knowledge, explore RAG. The question going forward is not whether AI can reason, but how efficiently we direct that reasoning toward problems that actually matter.

Interview Questions

Q: What is the difference between chain-of-thought prompting and a trained reasoning model?

Chain-of-thought prompting adds step-by-step examples to the prompt to elicit reasoning from a standard LLM. A trained reasoning model has internalized the reasoning process through reinforcement learning, generating intermediate steps natively without special prompting. The trained approach is more reliable because the behavior is baked into the weights rather than coaxed through prompt engineering.

Q: Explain the difference between a Process Reward Model and an Outcome Reward Model.

An ORM provides a single binary signal after the final answer. A PRM evaluates every intermediate step, giving denser feedback so the model learns which specific steps went wrong. PRMs are 1.5 to 5x more compute-efficient for training, though ICLR 2025 work shows discriminative ORMs can match PRMs in some domains.

Q: What is GRPO and why was it significant for DeepSeek-R1?

GRPO generates multiple responses per prompt and uses their mean reward as the advantage baseline, eliminating PPO's separate critic model. This reduces memory by 40 to 60% and can be 18x more cost-efficient than PPO, making reasoning model training accessible without massive infrastructure.

Q: Your team wants to add a reasoning model to a production chatbot. What architecture would you recommend?

A hybrid routing architecture where a cheap classifier evaluates query complexity. Simple factual questions route to a standard LLM; only multi-step reasoning, complex code, or math hits the reasoning model. This keeps costs manageable while ensuring hard problems get the compute they need.

Q: Why might a reasoning model perform worse than a standard LLM on a simple question?

Reasoning models can overthink simple queries, generating elaborate reasoning chains for questions that need no deliberation. This wastes tokens, increases latency, and can actually introduce errors through unnecessary intermediate steps. Adaptive thinking systems like Claude Opus 4.6 address this by letting the model dynamically decide whether and how much to reason.

Q: A reasoning model shows correct logic in its chain-of-thought but arrives at a wrong answer. What could explain this?

Research from Anthropic (Chen et al., May 2025) showed that the visible chain-of-thought is not a faithful transcript of the model's internal computation. Models mentioned inserted hints in their reasoning only 25 to 39% of the time. The displayed reasoning can be a post-hoc rationalization rather than the actual decision process, meaning the "correct logic" you see may not reflect what actually determined the output.

Q: How does knowledge distillation work for reasoning models, and what are its practical implications?

The teacher model generates high-quality reasoning traces, and the student is fine-tuned to reproduce both process and answers. DeepSeek showed a 7B distilled model surpassing a 32B non-distilled model, meaning reasoning-capable models now fit on consumer GPUs with a quantized 14B in 16GB of VRAM.

Q: What is the relationship between test-time compute and model accuracy?

Performance scales logarithmically with test-time compute: the first additional thinking tokens help enormously, but returns diminish. There is always an optimal thinking budget beyond which more tokens are wasteful. API parameters like reasoning_effort or budget_tokens let you tune this trade-off per query.

Free Career Roadmaps8 PATHS

Step-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

Explore all career paths