Skip to content

Advanced Prompt Engineering: Chain-of-Thought, ReAct, and Structured Outputs

DS
LDS Team
Let's Data Science
25 minAudio · 1 listens
Listen Along
0:00/ 0:00
AI voice

Ask a basic question and any LLM gets it right. Ask it to extract structured data from a messy clinical note, reason through a multi-step diagnostic problem, or autonomously gather information and synthesize a report — and the gap between "demo mode" and "production system" becomes obvious fast.

Basic prompting (give the task, get the answer) covers roughly 60% of what's possible. The remaining 40% comes from techniques that change how the model reasons, what format it returns, and how reliably it does both. This article covers the techniques that actually move production metrics: chain-of-thought, self-consistency, ReAct, structured outputs, system prompt engineering, prompt caching, and extended thinking — including how the field has shifted toward "context engineering" for agentic systems.

The running example throughout: a medical record analysis assistant that extracts structured patient data from unstructured clinical notes. It's a realistic, complex task that shows exactly why each technique matters.

A Framework for Choosing Techniques

Advanced prompting techniques solve different problems. Before reaching for any of them, identify which problem you actually have.

ProblemTechniqueTypical Gain
Model skips steps, makes reasoning errorsChain-of-Thought10–40% accuracy lift on multi-step tasks
Single output is unreliable, needs confidenceSelf-Consistency12–18% accuracy lift on reasoning tasks
Model needs real-time data or tool useReActEnables tasks impossible from context alone
Output format is inconsistentStructured Outputs>99% schema adherence vs ~82% with JSON mode
Token costs too high for large contextsPrompt CachingUp to 90% cost reduction
Model ignores key instructionsSystem Prompt EngineeringSubstantial compliance improvement
Complex reasoning in agentic loopsExtended Thinking / Think ToolMore reliable multi-step decisions

No single technique is universally better. Self-consistency costs 3–5x tokens. ReAct requires tool infrastructure. Structured outputs require schema definitions. Extended thinking incurs reasoning token costs. Use what the problem actually demands.

Prompt engineering techniques organized by the problem they solveClick to expandPrompt engineering techniques organized by the problem they solve

Chain-of-Thought Prompting

Chain-of-thought (CoT) prompting is the practice of instructing a model to reason through a problem step by step before producing a final answer. Introduced in Wei et al. (2022), CoT consistently improves accuracy on tasks requiring arithmetic, logic, and multi-step planning — the kinds of tasks where "just answer it" produces plausible-sounding errors.

There are two forms, and the choice between them matters.

Zero-Shot CoT

Zero-shot CoT appends a simple instruction to trigger step-by-step reasoning. The classic version is "Let's think step by step." Google DeepMind researchers found this single phrase improved GSM8K math accuracy from 17.7% to 78.7% on PaLM in 2022. By 2026, similar gains persist on reasoning-heavy tasks even with frontier models — though models like Claude Opus 4.6 and GPT-5 have reasoning baked in, making explicit CoT instructions more of a fine-tuning signal than a fundamental unlock.

Applied to the medical record example:

code
System: You are a medical record analyst.

User: Extract the primary diagnosis from this clinical note:
"Pt presents with c/o chest pain x3 days, worse on exertion,
relieved by rest. EKG shows ST depression in V4-V5. Troponin 0.8.
Plan: admit for rule out ACS."

Let's think step by step.

The model now walks through: what symptoms are present, what lab values indicate, what the plan suggests — and arrives at "Acute Coronary Syndrome (rule-out)" rather than guessing from the word "chest pain" alone.

Key Insight: Zero-shot CoT works because it allocates more tokens to intermediate reasoning. The model isn't "smarter" — it's given space to work. The reasoning tokens themselves become the workspace.

Few-Shot CoT

Few-shot CoT provides 2–6 demonstrations where each example includes the question, the step-by-step reasoning trace, and the final answer. The model learns the expected reasoning pattern from those examples and applies it to the new input.

code
User: Analyze clinical note: "65yo M with DM2, hypertension. Presents with
increased thirst, polyuria x2 weeks. FBG 312, HbA1c 11.2%. Metformin not
achieving control."

Reasoning:
- Patient has DM2 (established diagnosis)
- New symptoms: polydipsia, polyuria = hyperglycemia symptoms
- FBG 312 mg/dL = well above normal (70–100 mg/dL)
- HbA1c 11.2% = severe uncontrolled diabetes
- Current therapy (Metformin) failing to achieve targets

Conclusion: Uncontrolled Type 2 Diabetes Mellitus. Medication escalation indicated.

---
Now analyze: "45yo F, non-smoker. Productive cough x3 weeks, low-grade fever,
night sweats. CXR shows right upper lobe infiltrate. PPD positive."

The model follows the established pattern: enumerate findings, assess significance, reach a clinical conclusion.

Example selection matters more than example count. Production systems using dynamic example selection — retrieving the most semantically similar examples from a pool for each new input — outperform static 5-shot sets on precision while using fewer tokens. Start with 3–5 examples. Add more only when you observe specific failure modes.

When CoT Helps and When It Hurts

CoT is not always the right call. For simple factual retrieval ("What is the patient's date of birth?"), chain-of-thought adds tokens without improving accuracy — and can actually introduce errors as the model reasons its way to a wrong conclusion.

Use CoTSkip CoT
Multi-step arithmetic or logicDirect lookup from text
Causal reasoning ("why did X lead to Y")Classification with obvious signals
Planning tasks with dependenciesSingle-field extraction
Ambiguous cases requiring differential diagnosisQuestions with one clear answer

Chain-of-thought vs. direct answer: before and after comparison showing reasoning stepsClick to expandChain-of-thought vs. direct answer: before and after comparison showing reasoning steps

Self-Consistency: Majority Vote Over Reasoning Paths

Self-consistency (Wang et al., 2022) samples multiple independent reasoning paths for the same question, then takes a majority vote on the final answer. The key insight: even when individual reasoning chains contain errors, the correct answer tends to appear more often than any single wrong answer.

The accuracy gains are well-documented. On GSM8K, self-consistency improves over single-sample CoT by 17.9 percentage points. On SVAMP (math word problems), the gain is 11.0 points. On AQuA, 12.2 points.

Here's what that looks like in practice for our medical record assistant. Given an ambiguous note, you call the model 5 times with temperature=0.7 and aggregate:

code
Final answer: 42
Confidence: 60% (3/5 votes)

For the medical record assistant, "responses" would be diagnosis strings. "Acute Coronary Syndrome" appearing in 4 of 5 traces gives far more confidence than a single output ever could.

Pro Tip: Set temperature between 0.5 and 0.8 for self-consistency sampling. Lower temperatures produce near-identical traces (no diversity to vote over). Higher temperatures introduce noise that drowns out the correct answer.

The cost question. Self-consistency multiplies your token bill by the number of samples — typically 3 to 5x. This is worth it when: (a) the task has high-stakes consequences (medical, legal, financial), (b) you need a calibrated confidence score alongside the answer, or (c) a single wrong output has downstream consequences that are expensive to fix. For bulk processing of low-stakes classifications, it's overkill.

Research published in 2025 introduced Confidence-Improved Self-Consistency (CISC), which weights votes by per-sample confidence scores. CISC matches the accuracy of standard 18-sample majority voting using only 10 samples — a 46% reduction in compute for equivalent accuracy.

ReAct: Reason, Then Act

ReAct (Yao et al., 2022 — paper) combines chain-of-thought reasoning with tool-use actions in an interleaved loop. Rather than reasoning from static context alone, a ReAct agent cycles through Thought → Action → Observation until the task is complete.

The loop works like this:

  1. Thought: The model reasons about what it knows and what it needs
  2. Action: The model calls a tool (search, database query, calculator, API)
  3. Observation: The tool result is fed back into context
  4. Repeat until the model determines the task is complete

A ReAct trace for the medical record assistant might look like this:

code
Thought: The note mentions "Troponin 0.8" but I need to know the normal range
to assess clinical significance. I should look this up.

Action: search("troponin normal range ACS threshold")

Observation: Normal troponin I < 0.04 ng/mL. Value >0.04 is elevated.
ACS is typically indicated at >0.1 ng/mL.

Thought: Troponin of 0.8 ng/mL is 20x the ACS threshold. Combined with
ST depression and chest pain on exertion, this strongly indicates NSTEMI.

Action: search("NSTEMI vs unstable angina troponin differentiation")

Observation: Troponin elevation distinguishes NSTEMI from unstable angina.
Elevated troponin = NSTEMI; normal troponin with ischemia symptoms = UA.

Thought: All evidence points to NSTEMI. I have enough to answer.

Final Answer: Non-ST-elevation myocardial infarction (NSTEMI). Troponin 0.8 ng/mL
(20x ACS threshold), ST depression V4-V5, exertional chest pain all consistent.

This is why ReAct is the foundation of modern AI agents. The technique directly powers frameworks like LangChain's agent executor, the ReAct pattern in building AI agents with planning and tool use, and tool-calling implementations in the Claude Agent SDK.

ReAct on HotpotQA (multi-hop question answering) outperformed chain-of-thought alone by a significant margin, primarily because it could retrieve information that wasn't in the prompt context. The Google Research team found ReAct produced "more interpretable" reasoning traces than any baseline.

ReAct loop showing Thought, Action, Observation cycle with real medical record exampleClick to expandReAct loop showing Thought, Action, Observation cycle with real medical record example

Key Insight: ReAct's power isn't just tool access — it's that the model can decide what to look up based on intermediate reasoning. A static RAG retrieval fetches documents once, upfront. ReAct can search again based on what the first search revealed.

For the function calling and tool use implementation behind ReAct agents, the actual mechanism is the model's tool-calling capability. The model generates a structured tool call, your application executes it, and the result gets appended to the conversation. The loop is multiple turns of that cycle.

Extended Thinking and the Think Tool

Extended thinking is Anthropic's architecture-level answer to chain-of-thought. Rather than relying on prompt instructions to trigger reasoning, extended thinking gives Claude a dedicated scratchpad of "thinking tokens" that happen before the public response is generated. These tokens aren't part of the billable output you show users — they're internal deliberation.

The distinction from traditional CoT prompting matters in practice:

Traditional CoTExtended Thinking
TriggerPrompt instruction ("think step by step")API parameter (thinking: {type: "enabled"})
VisibilityIn the response textSeparate thinking block
CostOutput tokensReasoning tokens (separate pricing)
FaithfulnessModel may not verbalize actual reasoningStill not guaranteed faithful (Anthropic research confirms)
Best forSimple multi-step tasksComplex reasoning, long agentic chains

Adaptive thinking (available for Claude Opus 4.6 and Sonnet 4.6 as of 2026) removes the need to manually set a thinking token budget. Instead of specifying budget_tokens: 5000, you set the thinking type to "auto" and Claude decides how much reasoning each request needs. A simple extraction might use no thinking tokens; a complex multi-step diagnosis uses many.

For the medical record assistant handling complex edge cases — conflicting lab values, rare presentations, multiple comorbidities — adaptive thinking produces measurably better diagnostic accuracy than static CoT instructions.

The "think" tool is a related but distinct concept. While extended thinking handles pre-response deliberation, the think tool lets Claude pause mid-tool-call chain to explicitly reason about what it just received before deciding the next action. Anthropic introduced it for situations where Claude performs long sequences of tool calls and needs to assess intermediate results carefully — like our medical record assistant looking up multiple reference values and synthesizing them.

python
# Claude think tool usage — explicit reasoning step in agentic chain
tools = [
    {
        "name": "think",
        "description": "Use this tool to reason carefully about tool results before proceeding",
        "input_schema": {
            "type": "object",
            "properties": {
                "thought": {"type": "string", "description": "Your reasoning"}
            },
            "required": ["thought"]
        }
    },
    {
        "name": "search_drug_database",
        "description": "Search for drug interactions, dosing, contraindications",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {"type": "string"}
            },
            "required": ["query"]
        }
    }
]

The think tool is especially valuable when a single wrong tool call has high cost (an API call to an external database, a file modification, a message sent to a user). Having Claude reason explicitly before acting catches errors that implicit CoT misses.

Tree of Thoughts: When to Use It

Tree of Thoughts (ToT, Yao et al., 2023) extends chain-of-thought into a search tree. Rather than a single linear reasoning chain, the model generates multiple candidate next steps at each node, evaluates which paths look most promising, and explores those branches — pruning dead ends.

For extremely difficult reasoning tasks (programming competition problems, complex planning, mathematical proofs), ToT significantly outperforms both CoT and self-consistency. The cost: each tree node is a full LLM call. A 3-branch, 4-depth tree requires up to 81 calls for one problem.

In practice, ToT is used selectively in 2026 — primarily in offline research settings or high-value tasks where cost is secondary to correctness. For most production applications, self-consistency at 5 samples provides 80% of the benefit at 10% of the compute.

More recent research has explored Forest-of-Thought (FoT), which integrates multiple reasoning trees with sparse activation — only expanding branches that look genuinely promising. This improves efficiency significantly, but the engineering complexity keeps ToT variants out of most production stacks.

Common Pitfall: Teams new to agentic AI sometimes reach for ToT when self-consistency would be sufficient. Measure accuracy vs. cost on your actual task before committing to a tree-search approach.

Structured Outputs: Reliable JSON at Scale

Structured outputs are the single most impactful prompt engineering technique for production systems. The problem they solve: you need JSON back from the model, you ask for JSON, and roughly 10–20% of the time you get something that breaks your parser — a markdown code fence, a trailing comma, a comment inside the JSON.

JSON Mode vs. Structured Outputs

The distinction matters for production reliability.

JSON mode (OpenAI's response_format: {"type": "json_object"}) guarantees syntactically valid JSON. Nothing more. The model can return a completely different schema than you requested, extra fields, wrong types. By 2026, JSON mode is considered legacy — it's a training-wheels solution.

Structured outputs (OpenAI's strict: true with a JSON schema, Anthropic's tool-use with defined input schemas, Gemini's response_schema parameter) enforce a specific schema at the token generation level using constrained sampling. The model cannot generate tokens that would violate the schema. Adherence is >99% in production.

Here's the practical difference for extracting data from a medical record:

python
from pydantic import BaseModel
from typing import Optional, List
from openai import OpenAI

class MedicalExtract(BaseModel):
    patient_age: Optional[int]
    primary_diagnosis: str
    diagnoses: List[str]
    medications: List[str]
    critical_values: List[str]
    follow_up_required: bool

client = OpenAI()

response = client.beta.chat.completions.parse(
    model="gpt-4o-2024-11-20",
    messages=[
        {
            "role": "system",
            "content": "Extract structured medical data from the clinical note."
        },
        {
            "role": "user",
            "content": """65yo M with DM2, hypertension. Presents with increased
            thirst, polyuria x2 weeks. FBG 312, HbA1c 11.2%. Metformin not
            achieving control. Plan: add insulin, follow up in 4 weeks."""
        }
    ],
    response_format=MedicalExtract,
)

extract = response.choices[0].message.parsed
print(f"Diagnosis: {extract.primary_diagnosis}")
print(f"Follow-up: {extract.follow_up_required}")

The response_format=MedicalExtract argument passes the Pydantic schema to the API. OpenAI's constrained decoding engine ensures every token produced conforms to that schema. No parsing errors. No missing fields. No wrong types.

Common Pitfall: Pydantic validation happens on the Python object — the model still needs good prompting to put the right values in the correct fields. Structured outputs guarantee format, not accuracy. A badly prompted model will still hallucinate diagnoses, just hallucinate them into a perfectly formatted dict.

For Anthropic's Claude, the equivalent pattern uses tool definitions. Define a extract_medical_data tool with your JSON schema, instruct the model to call it, and the model returns a structured tool-use block. The Model Context Protocol uses this same mechanism for standardized tool definitions across providers.

Schema-first development — define your Pydantic model first, then write the prompt to fill it — has become the industry standard for data extraction pipelines in 2026. It cuts integration code by up to 60% compared to regex-based post-processing.

System Prompt Engineering

A system prompt is a contract between you and the model. It defines who the model is, what it can and can't do, and what success looks like. Most system prompts are written haphazardly; the well-engineered ones read like precise specifications.

What Works in Role Definition

The "you are a helpful assistant" pattern is nearly useless. Models perform significantly better with specific, bounded role definitions that include constraints.

What doesn't work:

code
You are a helpful medical assistant.

What does:

code
You are a clinical data extraction system. Your sole function is extracting
structured fields from clinical notes. You output only the requested fields.
You do not add clinical interpretation beyond what is directly stated.
You do not hallucinate values. If a field is not present in the note,
return null. If a value is ambiguous, flag it in the "notes" field.

The second version specifies: what the system does, what it doesn't do, how to handle edge cases, and what "correct" behavior looks like. That's a contract.

Instruction Ordering and Placement Effects

Research consistently shows models pay most attention to the beginning and end of context (the "primacy and recency" effect). Critical instructions belong at the top of the system prompt. Equally important: the task instruction itself goes at the end of the user message, after any data you're asking the model to process.

Structure that works:

code
[SYSTEM]
Role + goal (who you are, what success looks like)
Hard constraints (what you must never do)
Output format specification

[USER]
Context/data to process

Task instruction (what to do with the above)

Mixing context and instructions degrades compliance. The 4-block pattern — INSTRUCTIONS / CONTEXT / TASK / OUTPUT FORMAT — consistently outperforms unstructured prompts across providers.

Anthropic's official prompt engineering guidance adds one more layer: use XML tags to separate logical sections (<instructions>, <examples>, <data>). Claude in particular responds well to structured XML delimiters because its training data included heavily structured documents.

Negative Instructions

"Do not" instructions are underused and highly effective. Models follow explicit prohibitions more reliably than they infer them from positive framing.

For the medical record assistant:

  • "Do not infer diagnoses not stated in the note"
  • "Do not return values with units stripped (return '312 mg/dL', not '312')"
  • "Do not include information from your training data that contradicts the note"

Token Budget Considerations

The practical sweet spot for most system prompts is 150 to 300 words. LLM reasoning performance degrades around 3,000 tokens of system prompt as attention gets diluted. Longer is not better. Every line in a system prompt should earn its place.

Prompt Caching: 90% Cost Reduction on Large Contexts

Prompt caching lets you mark static parts of a prompt — the system instructions, few-shot examples, document corpus — so the model doesn't reprocess them on every call. The inference provider caches the KV-state for the prefix and reuses it.

The economics differ by provider:

ProviderRead costWrite costCache TTLControl
Anthropic Claude10% of input price125% of input price (one-time)5 min (extendable to 1h)Explicit via cache_control
OpenAI50% of input priceNo write costAutomaticNone (automatic)
Google Gemini25% of input priceStandard input priceConfigurableExplicit context cache

For the medical record assistant processing 10,000 notes against a 500-page clinical guidelines document, caching the guidelines reduces input token cost by ~90% (Anthropic) or ~50% (OpenAI) on every call after the first. The guidelines are read once into cache; each note call pays only for the note tokens.

The rule of thumb: structure your prompts with static content first (system instructions, few-shot examples, tool definitions, documents) and variable content last. Caching is prefix-based — the provider matches from the start of the prompt, so anything that changes between calls must come after everything static.

python
# Anthropic prompt caching — static system prompt + dynamic note
import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-opus-4-6",
    system=[
        {
            "type": "text",
            "text": "You are a clinical data extraction system...",
            "cache_control": {"type": "ephemeral"}   # Cache this prefix
        },
        {
            "type": "text",
            "text": clinical_guidelines_text,         # 50k tokens of guidelines
            "cache_control": {"type": "ephemeral"}   # Cache this too
        }
    ],
    messages=[
        {
            "role": "user",
            "content": patient_note   # Variable — not cached, paid at full rate
        }
    ],
    max_tokens=1024
)

Since early 2026, Anthropic also automatically caches frequently-accessed static system instructions at the workspace level — so even without explicit cache_control markers, repeat callers with identical system prompts see cost benefits. Explicit markers give you more control over what's cached and how long it persists.

Prompt caching architecture showing static prefix vs dynamic user content across multiple requestsClick to expandPrompt caching architecture showing static prefix vs dynamic user content across multiple requests

Context Engineering: Beyond Individual Prompts

By late 2025, Anthropic formalized what practitioners had been discovering independently: for multi-step agentic systems, the limiting factor is no longer individual prompt quality but the entire context state across turns. Anthropic calls this "context engineering."

Context engineering is the set of strategies for curating and maintaining the optimal set of tokens during LLM inference — including system instructions, tools, MCP connections, retrieved documents, and message history. Anthropic's internal research found context-engineered agents achieved 54% better performance on multi-step tasks compared to prompt-engineered equivalents.

The practical difference:

Prompt EngineeringContext Engineering
ScopeSingle prompt, one turnEntire context across turns
FocusWhat instructions to giveWhat information to include when
IterationPer-prompt A/B testingPer-turn context curation
ToolingPrompt templatesDynamic context pipelines

For the agentic AI stack — where a medical record assistant might run for dozens of turns, accumulating tool results, retrieved documents, and intermediate conclusions — context engineering determines whether the model retains the right information or drowns in irrelevant history.

The key patterns Anthropic recommends: just-in-time context loading (don't dump all guidelines upfront — load the relevant section when needed), explicit summarization of long tool chains before new tasks start, and structured state management that keeps the most decision-relevant information in the recency window.

This connects directly to the broader agentic AI stack — prompt engineering is one layer of the stack, context engineering is the orchestration above it. For deeper coverage of how evaluation works across the full agentic pipeline, the LLM evaluation guide covers the frameworks that measure whether these techniques actually work.

Prompt Versioning: Treat Prompts Like Code

Prompts are production code. They have the same failure modes as software: regression after changes, inconsistent behavior across environments, no rollback mechanism without version control. In 2026, serious teams version prompts in git alongside their application code.

Practical workflow:

  • Store prompts in separate files (.txt or .md) tracked in git
  • Branch for experiments (one git branch per prompt variant)
  • Log the prompt version with every LLM call (add prompt_version: "v2.3" to request metadata)
  • A/B test prompt changes against a baseline on a sample of production traffic before full rollout
  • Write evaluations that run on every prompt commit — automated regression testing for AI behavior

When to Use Each Technique

ScenarioRecommended Approach
Simple extraction, one field, clear textDirect prompt, no special technique
Multi-step reasoning, complex logicZero-shot CoT or few-shot CoT
High-stakes output, need confidence scoreSelf-Consistency (5 samples)
Needs real-time data or external toolsReAct with tool definitions
JSON output for downstream processingStructured Outputs with Pydantic
Large static context, many requestsPrompt Caching
Inconsistent model behaviorSystem Prompt Engineering + versioning
Long agentic tool chains, costly mistakesExtended Thinking or Think Tool
Multi-turn agent, accumulating contextContext Engineering patterns
Extreme reasoning tasks, cost secondaryTree of Thoughts (offline/research)

What not to combine blindly: Self-consistency + long few-shot examples + large document context + extended thinking = massive token bills. Each technique multiplies cost. Combine them only when the accuracy gain justifies the expense for that specific task.

Conclusion

The jump from basic prompting to production-grade prompt engineering is about matching technique to problem. Chain-of-thought gives the model space to reason. Self-consistency validates answers by consensus. ReAct extends reasoning into tool-use loops that power real AI agents. Structured outputs eliminate parsing failures. Extended thinking and the think tool handle complex agentic chains where implicit CoT isn't enough. Context engineering scales all of it across multi-turn systems.

For the medical record assistant, the production system likely uses all of them together: a carefully engineered system prompt with cached guidelines, few-shot CoT examples for reasoning traces, structured outputs with a Pydantic schema for guaranteed extraction format, ReAct for cases requiring drug database lookups or reference range verification, and adaptive thinking for genuinely complex multi-comorbidity cases.

To go deeper on the agent patterns that ReAct enables, read Building AI Agents: ReAct, Planning, and Tool Use. For the underlying mechanism that makes tool use work, Function Calling and Tool Use for AI Agents covers the API mechanics. And for understanding how context flows through modern AI systems at an architectural level, The Agentic AI Stack covers how prompt engineering, context management, and tool use fit together in production.

The best prompt engineers treat prompts as hypotheses: write, test, measure, iterate. These techniques aren't magic — they're levers. Learn when to pull each one.

Interview Questions

What is chain-of-thought prompting and when does it improve accuracy?

Chain-of-thought prompting instructs the model to reason step by step before giving a final answer. The phrase "Let's think step by step" (zero-shot CoT) or explicit reasoning demonstrations (few-shot CoT) both work. CoT improves accuracy on multi-step arithmetic, logical reasoning, and planning tasks — typically 10–40% gains. For simple lookup tasks, CoT often hurts performance by introducing opportunities for reasoning errors.

How does self-consistency prompting work and what's the tradeoff?

Self-consistency samples multiple independent reasoning traces at elevated temperature, then takes a majority vote on the final answer. It improves accuracy on GSM8K by 17.9 points over single-sample CoT (Wang et al., 2022). The tradeoff is cost: 5 samples means 5x the tokens. It's appropriate for high-stakes tasks where accuracy matters more than throughput.

Explain the ReAct (Reason + Act) pattern and where it's used in production.

ReAct interleaves chain-of-thought reasoning with tool-use actions in a loop: the model reasons about what it needs, calls a tool, observes the result, and continues reasoning. This is the core pattern behind most production AI agents — LangChain's agent executor, Claude's agentic systems, and any system where the model needs real-time information. The key advantage over static RAG is that ReAct can search based on intermediate reasoning, not just the initial query.

What is the difference between JSON mode and structured outputs?

JSON mode guarantees syntactically valid JSON but does not enforce a schema — the model can return wrong field names, wrong types, or extra fields. Structured outputs use constrained decoding to enforce a specific JSON schema at the token level, achieving >99% schema adherence. For production data extraction pipelines, structured outputs have become the standard as JSON mode reliability is insufficient.

How should a system prompt be structured for best compliance?

Place the most critical instructions at the top and bottom (primacy and recency effects). Use a 4-block structure: role/goal first, then constraints, then context/data, then the task instruction at the very end. Keep system prompts under 300 words — attention dilutes beyond 3,000 tokens. Use explicit negative instructions ("do not infer values not present") alongside positive ones. Anthropic additionally recommends XML tags to separate logical sections.

How does prompt caching reduce costs and when is it most valuable?

Prompt caching stores the KV-state of a static prompt prefix so it doesn't need to be reprocessed on each call. Anthropic's Claude charges 10% of normal input price for cached tokens (90% savings), while OpenAI charges 50% of input price (50% savings). It's most valuable when: (a) you have a large, static document or guidelines corpus, (b) you're making many calls with only the user input varying, or (c) you have long few-shot example sets repeated across requests.

What is the difference between extended thinking and chain-of-thought prompting in Claude?

Traditional CoT is triggered via prompt instructions and appears inline in the response text. Extended thinking is an API-level parameter that gives Claude a separate scratchpad of reasoning tokens before generating the public response — the thinking is invisible to the end user. Adaptive thinking (available for Claude Opus 4.6 and Sonnet 4.6) goes further: it lets Claude dynamically decide how many thinking tokens each request warrants rather than requiring a fixed budget. Use extended thinking for complex tasks in agentic systems; use CoT for simpler multi-step tasks where the reasoning itself is part of the desired output.

You're building a clinical note extraction system. A team member suggests asking the model "return the results as JSON." What's wrong with this and what would you do instead?

"Return results as JSON" relies on the model's inclination to comply — it produces JSON most of the time but fails roughly 10–20% of the time in ways that break downstream parsers. Instead, define a Pydantic model that exactly specifies the fields, types, and optionality, then use the provider's structured output API (OpenAI's response_format with a JSON schema, Anthropic's tool definition). This enforces schema adherence at the token-generation level, eliminating parsing failures entirely. Pair this with a carefully engineered system prompt that specifies how to handle missing values and ambiguous data.

Practice with real Logistics & Shipping data

90 SQL & Python problems · 15 industry datasets

250 free problems · No credit card

See all Logistics & Shipping problems
Free Career Roadmaps8 PATHS

Step-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

Explore all career paths