Reasoning Models: How AI Learned to Think Step by Step

DS
LDS Team
Let's Data Science
11 min readAudio
Listen Along
0:00 / 0:00
AI voice

In September 2024, OpenAI's o1-preview scored 83% on the American Invitational Mathematics Examination (AIME 2024). Seven months later, o3 scored 96.7% on the same exam — matching the level of gold-medal International Math Olympiad competitors. The model didn't get bigger between those releases. It got better at thinking. Where standard LLMs generate answers in a single forward pass — predicting the most likely next token — reasoning models pause, plan, explore multiple solution paths, critique their own logic, and backtrack when they hit a dead end. They have traded speed for deliberation, and the results have been extraordinary.

This shift represents the most significant change in how LLMs operate since the Transformer architecture itself. Understanding how reasoning models work — the chain-of-thought mechanisms, the reinforcement learning training, the test-time compute scaling laws — is now essential knowledge for anyone building with AI.

The shift from prediction to deliberation

A reasoning model is an LLM designed to generate intermediate computational steps — often called a "chain of thought" — before producing a final answer. Unlike standard LLMs that optimize for immediate next-token probability, reasoning models optimize for the correctness of a verified final answer, using reinforcement learning to internalize the process of self-correction.

Nobel laureate Daniel Kahneman described two modes of human thought: System 1 (fast, automatic, intuitive — like completing "Bread and...") and System 2 (slow, deliberative, logical — like calculating 17×2417 \times 24 in your head). Standard LLMs like GPT-4o and Claude Sonnet 4.5 in direct mode are System 1 engines. Reasoning models like o3, DeepSeek-R1, and Claude Opus 4.6 in adaptive thinking mode simulate System 2 by generating a hidden (or partially visible) thinking process before presenting the final result.

In Plain English: A standard LLM reads your question and immediately starts writing the answer. A reasoning model reads your question, opens an internal scratchpad, debates with itself for hundreds or thousands of tokens, checks its own work, and only then writes the answer you see. The scratchpad is where the "intelligence" happens.

The core difference lies in the objective function. A standard LLM asks, "What is the most likely next word?" A reasoning model asks, "What is the next logical step to maximize the probability of a correct outcome?"

From prompting to training: the chain-of-thought revolution

The foundation of every modern reasoning model traces back to a single insight: LLMs perform dramatically better on multi-step problems when they show their work.

Chain-of-Thought prompting (Wei et al., Google, January 2022). The original CoT paper demonstrated that including a few examples of step-by-step reasoning in the prompt caused LLMs to generate their own intermediate steps, dramatically improving accuracy on math, logic, and commonsense tasks. In standard generation, the model maps input directly to output: P(yx)P(y|x). With chain-of-thought, the model generates reasoning steps zz before the output:

P(yx)=zP(yz,x)P(zx)P(y|x) = \sum_{z} P(y|z, x) \cdot P(z|x)

In Plain English: The probability of the correct answer depends on the probability of generating a good reasoning path first. Instead of jumping from question to answer, the model explores various reasoning paths. If a path is logically sound, the final answer is far more likely to be correct. This is why "showing your work" makes LLMs smarter — it's not just cosmetic.

Zero-shot CoT (Kojima et al., May 2022) eliminated the need for crafted examples entirely. Simply appending "Let's think step by step" to any prompt boosted reasoning performance significantly — proving that the capability was latent in the model, waiting to be activated.

Self-Consistency (Wang et al., March 2022, ICLR 2023) pushed accuracy further by sampling multiple diverse reasoning paths and taking the majority vote. On GSM8K math benchmarks, self-consistency improved accuracy by +17.9%.

Tree of Thoughts (Yao et al., Princeton/Google DeepMind, NeurIPS 2023) generalized chain-of-thought to tree-structured reasoning with strategic lookahead and backtracking — allowing models to evaluate multiple branches and prune dead ends, similar to how AlphaGo searched game trees.

The critical shift came in late 2024: instead of prompting models to reason, researchers began training them to reason natively through reinforcement learning. OpenAI's o1-preview (September 2024) was the first commercial model with internalized chain-of-thought, and DeepSeek-R1 (January 2025) proved it could be done in the open.

The mechanics of test-time compute

For a decade, the "scaling laws" of AI referred to training compute — make the model bigger, train on more data, get better results. Around 2024, this approach hit diminishing returns as high-quality training data became scarce. Snell et al. ("Scaling LLM Test-Time Compute Optimally," August 2024) demonstrated a different scaling axis: performance improves logarithmically with the amount of compute spent at inference time.

Performancelog(Test-Time Compute)\text{Performance} \propto \log(\text{Test-Time Compute})

In Plain English: Just as a human gives a better answer with 60 seconds to think rather than 5, a reasoning model gets smarter with more "thinking tokens." This relationship is logarithmic — the first 1,000 thinking tokens help enormously, but going from 10,000 to 11,000 helps less. The implication is profound: a smaller model that thinks for 10 seconds can outperform a massive model that answers instantly.

This created a new design trade-off. Previously, intelligence was determined entirely by model size. Now intelligence is a function of both model capability and inference budget. A developer can dial a "thinking knob" — more thinking tokens buys more accuracy at the cost of latency and compute.

How reasoning models are trained

The architecture of a reasoning system typically involves three components: a Generator (the LLM proposing the next reasoning step), a Verifier (a reward model evaluating whether each step is correct), and a Search Strategy (the algorithm exploring the tree of possible thoughts).

Process reward models vs. outcome reward models

Traditional RLHF uses Outcome Reward Models (ORMs) — one binary signal at the end: "Was the final answer correct?" This is sparse feedback. In a 50-step math proof, the model gets a single thumbs-up or thumbs-down, with no information about where the reasoning went wrong.

Process Reward Models (PRMs) evaluate every step. Lightman et al. ("Let's Verify Step by Step," OpenAI, May 2023) demonstrated that process supervision significantly outperforms outcome supervision, with their PRM solving 78.2% of a representative MATH subset. They released PRM800K — 800,000 step-level human feedback labels — which became foundational for the reasoning models that followed.

Rtotal=t=1Tγtr(st)R_{\text{total}} = \sum_{t=1}^{T} \gamma^t \cdot r(s_t)

In Plain English: The total reward sums up the reward for each individual step, discounted over time. A step that's correct early matters more than a correct step late in the chain. This forces the model to get every step right, not just gamble on the final answer.

Key Insight: Recent findings at ICLR 2025 complicate this picture — discriminative ORMs can match discriminative PRMs across 14 diverse domains. However, process advantage verifiers remain 1.5-5x more compute-efficient and enable 6x gains in sample efficiency for online RL. The debate is shifting from "PRM vs. ORM" to "when and how to combine both."

DeepSeek-R1: reasoning from reinforcement learning

DeepSeek-R1 (released January 20, 2025, published in Nature) provided the first fully open account of how to train a reasoning model. Its training used Group Relative Policy Optimization (GRPO) — an RL algorithm that eliminates the need for a separate critic/value model by using normalized rewards across different generations of the same prompt as the advantage baseline. GRPO is more computationally efficient than PPO (Proximal Policy Optimization) and operates at the sequence level rather than the token level.

The most remarkable finding came from DeepSeek-R1-Zero, trained via pure RL without any supervised fine-tuning. The model spontaneously developed self-reflection, self-verification, extended chain-of-thought, and what the authors described as an "aha moment" — recognizing and correcting its own mistakes mid-reasoning. R1-Zero used Reinforcement Learning with Verifiable Rewards (RLVR): accuracy rewards from ground-truth checkers (compilers, math solvers) and format rewards for output structure. No neural reward models were used — a deliberate choice to avoid reward hacking at scale.

The full DeepSeek-R1 pipeline added four stages: (1) cold-start SFT with curated chain-of-thought examples, (2) GRPO training with verifiable rewards, (3) rejection sampling to generate high-quality reasoning data, and (4) final SFT + RL rounds for alignment.

The reasoning model landscape in 2026

Every major AI lab now ships reasoning models. The table below captures the current frontier as of February 2026.

ModelLabReleasedKey FeaturePricing (per M tokens)
o3OpenAIApr 2025Hidden CoT, high compute mode2in/2 in / 8 out
o4-miniOpenAIApr 2025Cost-efficient reasoning1.10in/1.10 in / 4.40 out
DeepSeek-R1DeepSeekJan 2025671B MoE, MIT license, open weightsOpen-source
DeepSeek-R1-0528DeepSeekMay 2025Major R1 updateOpen-source
Claude Opus 4.6AnthropicFeb 2026Adaptive thinking, 1M context15in/15 in / 75 out
Gemini 2.5 ProGoogleMar 2025Native thinking, Deep Think mode1.25in/1.25 in / 10 out
QwQ-32BAlibabaMar 202532B params matches 671B R1, Apache 2.0Open-source
Grok 3 ThinkxAIFeb 202510x compute vs. Grok 2Premium+ subscription
Qwen3-235BAlibabaApr 2025Unified thinking/non-thinking modesOpen-source

Benchmark performance across the frontier

Reasoning models are evaluated on progressively harder benchmarks. The numbers below represent the state of the art as of February 2026.

Mathematical reasoning

ModelAIME 2024AIME 2025MATH-500
o396.7%88.9%
o4-mini91.7%92.7%
DeepSeek-R179.8%97.3%
DeepSeek-R1-052891.4%87.5%
Gemini 2.5 Pro92.0%86.7%
Claude Sonnet 4.587.0%
Grok 3 Think (cons@64)93.3%

Graduate-level science (GPQA Diamond) and code

ModelGPQA DiamondSWE-bench VerifiedCodeforces Elo
Gemini 2.5 Pro84.0%
o383.3-87.7%69.1%2706
o4-mini81.4%68.1%2719
Claude Opus 4.6 (Thinking)80.8%
Claude Sonnet 4.583.4%
Grok 384.6%
DeepSeek-R171.5%~49%2029

The hardest benchmarks

Humanity's Last Exam (2,500 expert-level questions across 100+ subjects, created by Center for AI Safety and Scale AI): Gemini 3 Pro Preview leads at 37.5%, followed by Claude Opus 4.6 at 36.7% — up from single digits in early 2025. Human experts score ~90%.

FrontierMath (350 research-level math problems by Epoch AI): o4-mini leads at 17%, o3 at 10%. Most non-reasoning models score ~0%.

GPQA Diamond (graduate-level science): Gemini 3 Pro Deep Think leads at 93.8%, a significant jump over the next tier of models (Grok 3 at 84.6%, Gemini 2.5 Pro at 84.0%, Claude Sonnet 4.5 at 83.4%).

ARC-AGI-2 (novel visual reasoning): o3 scores just 4% at low compute. Gemini 3 Deep Think reaches 45.1%. Humans score 100%.

Distilling reasoning into smaller models

One of DeepSeek-R1's most impactful contributions was demonstrating that reasoning ability can be distilled from a 671B parameter teacher model into dramatically smaller student models. All six distilled models were released under MIT license on January 20, 2025.

Distilled ModelBaseParamsAIME 2024MATH-500GPQA Diamond
R1-Distill-Qwen-1.5BQwen2.5-1.5B1.5B83.9%
R1-Distill-Qwen-7BQwen2.5-7B7B55.5%92.8%49.1%
R1-Distill-Qwen-14BQwen2.5-14B14B69.7%93.9%59.1%
R1-Distill-Qwen-32BQwen2.5-32B32B72.6%94.3%62.1%
R1-Distill-Llama-70BLlama-3.3-70B70B70.0%94.5%65.2%

The results are striking: R1-Distill-Qwen-7B — a 7 billion parameter model — surpassed QwQ-32B-Preview. R1-Distill-Qwen-32B outperformed OpenAI's o1-mini across multiple benchmarks. This catalyzed an explosion of open-source reasoning models: Sky-T1-32B-Preview (January 2025, trained for under $450, matching o1-preview), the OpenThinker series (OpenThoughts-114k dataset, progressing from 7B to 32B models), and QwQ-32B (March 2025, matching DeepSeek-R1 at 32B parameters under Apache 2.0 license).

The implication: reasoning-capable models now run on consumer hardware. A quantized R1-Distill-Qwen-14B fits in 16GB of VRAM and outperforms models 50x its size from a year earlier.

Controlling the thinking budget in production

Modern inference APIs expose direct control over how much a model reasons. This is the "thinking budget" — the maximum number of tokens the model can spend on its internal chain-of-thought before it must produce a visible answer.

OpenAI uses a reasoning_effort parameter with three levels (low, medium, high) for o3, o4-mini, and o3-mini. Moving from low to high typically raises accuracy by 10-30% on hard benchmarks.

Anthropic introduced budget_tokens with Claude 3.7 Sonnet's extended thinking (February 2025), allowing 1,024 to 128,000 thinking tokens. Claude Opus 4.6 (February 2026) introduced adaptive thinking — the model dynamically decides how much to reason, with four effort levels: low, medium, high (default), and max.

Google uses thinkingBudget for Gemini 2.5 models (0 to 24,576 tokens, or -1 for dynamic) and thinkingLevel for Gemini 3 models.

Common Pitfall: The max_tokens parameter now includes hidden reasoning tokens. If you set a limit of 4,096 tokens and the model uses 3,500 to think, only 596 remain for the visible answer. Developers who truncate output limits to save money often abort the thought process mid-stream, getting a garbled or empty response instead of a cheaper one.

Self-consistency: reasoning without a reasoning model

You don't need a dedicated reasoning model to get reasoning-like behavior. Self-consistency — generating multiple answers and taking the majority vote — is a powerful technique that works with any LLM.

python
from collections import Counter

# Five reasoning paths for "What is 17 x 24?"
# Simulating a standard LLM generating five independent solutions
# Three paths compute correctly, two make arithmetic errors
path_answers = [408, 408, 418, 408, 398]

vote_counts = Counter(path_answers)
majority_answer = vote_counts.most_common(1)[0][0]
confidence = vote_counts[majority_answer] / len(path_answers)

print(f"Path results: {path_answers}")
print(f"Vote counts: {dict(vote_counts)}")
print(f"Majority answer: {majority_answer}")
print(f"Confidence: {confidence:.0%}")
print(f"Correct answer: {17 * 24}")

# Output:
# Path results: [408, 408, 418, 408, 398]
# Vote counts: {408: 3, 418: 1, 398: 1}
# Majority answer: 408
# Confidence: 60%
# Correct answer: 408

Wang et al. showed this approach improves GSM8K accuracy by +17.9% over a single greedy generation. In production, combining self-consistency with a Best-of-N selection using a separate reward model (inference-time rejection sampling) can approximate reasoning model performance at lower cost — though at higher latency.

The economics of reasoning

Reasoning models consume 5-100x more tokens per request than standard models. Even after OpenAI's 80% price cut in June 2025 (o3 dropped from 10/10/40 to 2/2/8 per million tokens), a complex code review with o3 can still cost 5-10x more than the same query to GPT-4o — because o3 generates far more output tokens due to its hidden thinking chain.

The cost-effective strategy for most applications is a hybrid architecture:

  1. Router: A fast, cheap model (GPT-4o-mini, Gemini 2.5 Flash) classifies the query complexity.
  2. Simple queries route to the fast model for immediate response.
  3. Complex queries route to a reasoning model (o4-mini at $4.40/M output, or DeepSeek-R1 self-hosted) asynchronously.

o4-mini deserves special attention: at roughly 10x cheaper than o3, it matches or beats o3 on AIME 2025 (92.7% vs. 88.9%) and Codeforces (2719 vs. 2706 Elo). For most production use cases, o4-mini is the sweet spot between cost and capability.

Limitations and open problems

The overthinking problem

Reasoning models can perform worse on simple tasks. A survey titled "Stop Overthinking" found that models generate "excessively detailed or unnecessarily elaborate reasoning steps" even when they arrive at the correct answer early. Asking o3 "What is the capital of France?" wastes tokens on deliberation that adds nothing. Adaptive thinking systems (Claude Opus 4.6, Gemini's dynamic budget) are an explicit response to this problem — letting the model skip reasoning when it isn't needed.

Chain-of-thought faithfulness

Anthropic's Alignment Science team published "Reasoning Models Don't Always Say What They Think" (Chen et al., May 2025). They found that Claude 3.7 Sonnet mentioned inserted hints in its reasoning chain only 25% of the time, and DeepSeek-R1 only 39% of the time. The model's visible chain-of-thought is not a reliable transcript of its actual computational process — it is a post-hoc rationalization that may omit key factors. This has profound implications for AI safety: you cannot fully monitor a model's behavior by reading its reasoning trace.

When NOT to use reasoning models

Reasoning models are the wrong tool for: factual retrieval and Q&A (where a standard model with RAG is faster and cheaper), text transformation tasks (summarization, translation, formatting), entity extraction and classification, creative writing, and any latency-sensitive application requiring sub-second responses. Reserve reasoning models for multi-step math, complex code generation, scientific reasoning, multi-step planning, and tasks where correctness is worth the 10-30 second wait.

The road from CoT prompting to adaptive thinking

DateMilestone
Jan 2022Wei et al. publish Chain-of-Thought prompting (Google)
Mar 2022Wang et al. publish Self-Consistency (ICLR 2023)
May 2022Kojima et al. publish zero-shot CoT ("Let's think step by step")
May 2023Lightman et al. publish "Let's Verify Step by Step" / PRM800K (OpenAI)
May 2023Yao et al. publish Tree of Thoughts (Princeton/DeepMind, NeurIPS 2023)
Aug 2024Snell et al. publish test-time compute scaling laws
Sep 2024OpenAI releases o1-preview — first commercial reasoning model
Dec 2024Gemini 2.0 Flash Thinking Experimental; o3 previewed (ARC-AGI 87.5%)
Jan 2025DeepSeek-R1 released (671B MoE, MIT license) + 6 distilled models
Feb 2025Claude 3.7 Sonnet introduces extended thinking; Grok 3 Think released
Mar 2025QwQ-32B matches R1 at 32B params; Gemini 2.5 Pro with native thinking
Apr 2025o3 and o4-mini released; Qwen3 adds unified thinking/non-thinking modes
May 2025Claude Opus 4 and Sonnet 4; DeepSeek-R1-0528 (AIME 2024: 91.4%)
Sep 2025Claude Sonnet 4.5 (AIME 2025: 87% without tools)
Feb 2026Claude Opus 4.6 introduces adaptive thinking (model decides when to reason)

The trajectory is clear: reasoning evolved from an external prompting trick (2022) to an internalized capability trained via RL (2024), and is now converging on adaptive reasoning — models that dynamically decide whether and how much to think, eliminating the overhead of reasoning on trivial queries while preserving depth on hard problems.

Conclusion

Reasoning models represent the maturation of generative AI from pattern matching to genuine problem-solving. Through chain-of-thought training, process reward models, and test-time compute scaling, models like o3, DeepSeek-R1, and Claude Opus 4.6 have internalized the trial-and-error process that humans use to solve hard problems — trading latency for accuracy in a trade-off that is increasingly favorable as inference costs fall.

The most effective AI engineers in 2026 are not just prompt engineers — they are reasoning architects, designing the verification loops, thinking budgets, and model routing strategies that determine when a system should think fast and when it should think deep.

To understand the foundation these reasoning models build on, start with How Large Language Models Actually Work. For optimizing the context fed to reasoning engines, see Context Engineering: From Prompts to Production. To combine reasoning with external knowledge retrieval, explore Retrieval-Augmented Generation (RAG). And for the vector representations that power semantic search in these systems, read Text Embeddings: The Foundation of Semantic Search.