Structured Outputs: Making LLMs Return Reliable JSON

DS
LDS Team
Let's Data Science
14 min readAudio
Listen Along
0:00 / 0:00
AI voice

You spent hours crafting the perfect prompt. The reasoning is sound, the context is rich, and the model is the latest state-of-the-art release. You ask for a simple JSON summary. The response streams in, looking perfect — until the very last line where a trailing comma breaks your parser. Or worse, the model wraps the JSON in markdown backticks, adds a polite "Here is your data:" preamble, or hallucinates an extra field that crashes your pipeline.

This was the reality of structured LLM output until 2024. Today, every major provider — OpenAI, Anthropic, and Google — offers native structured output guarantees. The underlying technique, constrained decoding, mathematically forces the model to produce valid, schema-conformant output by modifying the token probability distribution at every generation step. No more hoping the model follows instructions. No more retry loops. Just guaranteed structure.

From hope to guarantee: the evolution of structured output

Structured output from LLMs has evolved through four distinct phases:

Phase 1: Prompt engineering (2020-2023). Developers added instructions like "Output valid JSON only" and provided few-shot examples. This relied entirely on the model's probabilistic tendency to follow instructions — with failure rates of 5-20% depending on schema complexity.

Phase 2: JSON mode (2023-2024). OpenAI introduced response_format: { type: "json_object" } in November 2023. This guaranteed valid JSON syntax but provided no schema enforcement — the model could return any valid JSON structure, not necessarily the one you wanted.

Phase 3: Structured Outputs with schema enforcement (2024-2025). OpenAI released Structured Outputs in August 2024 with response_format: { type: "json_schema" }, guaranteeing that output conforms to a specific JSON Schema. Google Gemini added response_schema support at Google I/O in May 2024. Anthropic followed in November 2025 with constrained decoding for Claude, now generally available across Opus 4.6, Sonnet 4.5, and Haiku 4.5.

Phase 4: High-performance constrained decoding engines (2025-2026). Open-source engines like XGrammar and llguidance achieved near-zero overhead constrained decoding, enabling structured output at production scale. In May 2025, OpenAI credited llguidance for foundational work underpinning their structured output implementation.

How constrained decoding works

At its core, an LLM is an autoregressive model that predicts the next token based on all previous tokens:

P(xtx1:t1)P(x_t \mid x_{1:t-1})

At each step, the model produces a logit (raw score) for every token in its vocabulary — typically 32,000 to 128,000 tokens. Normally, these logits pass through softmax to become probabilities, and the next token is sampled from that distribution.

Constrained decoding inserts a logit processor between the model's output and the sampling step. This processor tracks the current position within the target grammar (JSON Schema, regex, etc.) and masks invalid tokens by setting their logits to -\infty:

Pconstrained(xi)={Pmodel(xi)jVvalidPmodel(xj)if xiVvalid0otherwiseP_{\text{constrained}}(x_i) = \begin{cases} \frac{P_{\text{model}}(x_i)}{\sum_{j \in V_{\text{valid}}} P_{\text{model}}(x_j)} & \text{if } x_i \in V_{\text{valid}} \\ 0 & \text{otherwise} \end{cases}

In Plain English: The model generates scores for every word in its dictionary. A grammar checker then turns off every word that would break the JSON rules — setting its score to negative infinity so it gets zero probability. The remaining valid words keep their relative rankings, and the model picks from only those. The model is physically incapable of generating invalid output, no matter how much it "wants" to.

The state machine behind the mask

The logit processor needs to know which tokens are valid at each position. This requires compiling the target schema into a state machine that tracks the current position in the output structure.

The foundational algorithm was formalized by Willard & Louf (2023) in the paper that introduced the Outlines library. Their key insight: autoregressive text generation can be reformulated as transitions between states of a finite-state machine (FSM). The algorithm works in two phases:

Offline (once per schema):

  1. Convert the JSON Schema into a regular expression
  2. Build an FSM from the regex
  3. For every FSM state, precompute which vocabulary tokens are valid transitions — producing an index: state → {valid_token_ids}

Online (every token):

  1. Look up the current FSM state in the precomputed index — an O(1) hash map lookup
  2. Mask all tokens not in the valid set
  3. Sample from remaining tokens
  4. Advance the FSM to the new state

This precomputation makes constrained decoding remarkably fast at inference time. The computational cost is paid once during schema compilation, then amortized across all generation calls using that schema.

FSM versus CFG: two approaches to grammar enforcement

The FSM approach from Willard & Louf works well for flat schemas, but hits a fundamental limitation: regular expressions cannot express recursion. JSON is inherently recursive — objects can contain objects, arrays can contain arrays — so a pure FSM approach must either flatten recursion to a fixed depth or reject recursive schemas entirely.

Context-free grammars (CFGs) solve this by using a pushdown automaton (PDA) — essentially a finite-state machine augmented with a stack. The stack tracks nesting depth, enabling the grammar to match opening braces with closing braces at arbitrary depth.

The challenge is performance. An FSM has a finite, known number of states that can be fully precomputed. A PDA's state depends on the stack contents, creating potentially infinite states and making full precomputation impossible.

XGrammar (Dong et al., MLSys 2025) solved this with an elegant insight: split vocabulary tokens into context-independent and context-dependent sets. Context-independent tokens (~99% of the vocabulary) have validity that depends only on the current grammar position, not the stack contents — these can be fully precomputed into bitmask tables, just like FSM approaches. Context-dependent tokens (~1%) require runtime stack inspection. The result: CFG-level expressiveness with FSM-level performance for the vast majority of tokens, achieving up to 100x speedup over traditional grammar-constrained methods.

Pro Tip: If your schema includes recursive structures (nested comments, tree-like data, recursive $ref definitions), you need a CFG-based engine like XGrammar, llama.cpp grammar mode, or llguidance. FSM-based tools like Outlines will either reject the schema or flatten recursion to a fixed depth.

The BPE tokenization challenge

A subtlety that trips up many implementations: LLMs do not generate characters — they generate tokens, which are variable-length byte sequences produced by BPE (Byte Pair Encoding) tokenization. A single token might be "json" (4 characters), " the" (4 characters including the space), or "\n\n" (2 characters).

This creates a fundamental mismatch. Grammar constraints operate at the character level ("the next character must be a digit"), but the model generates multi-character tokens. Validating a token like "123" requires checking that each of its three characters is valid at its corresponding position in the grammar.

Worse, BPE's canonical tokenization depends on context. The same substring can tokenize differently depending on what precedes it. When constrained decoding forces the model down an unusual token path, it may produce a non-canonical tokenization that the model rarely encountered during training, subtly degrading output quality.

Token healing, pioneered by Microsoft's Guidance library, addresses the boundary problem. When a prompt ends mid-token, the tokenization of the boundary differs from what the model would see if the full string were tokenized jointly. Token healing backs up generation by one token, then constrains the first generated token to begin with the removed token's text — allowing the canonically tokenized version to emerge naturally.

The provider landscape in 2026

Every major LLM provider now offers native structured output support, but the implementations differ significantly:

FeatureOpenAIAnthropicGoogle Gemini
API parameterresponse_format: { type: "json_schema" }output_config: { format: { type: "json_schema" } }response_mime_type + response_schema
Release dateAug 2024Nov 2025 (beta); now GAMay 2024 (Google I/O); enhanced schema support later
EnforcementConstrained decodingConstrained decodingConstrained decoding
Recursive schemas ($ref)SupportedSupportedLimited (improving since Nov 2025)
Strict modeYes (requires additionalProperties: false)Enabled by default at GAVia response_schema
Supported modelsGPT-4o, GPT-4o-mini, o3-miniOpus 4.6, Sonnet 4.5, Opus 4.5, Haiku 4.5Gemini 2.5 Pro/Flash, Gemini 3

Key Insight: OpenAI's strict mode requires that every object sets additionalProperties: false and lists all properties in the required array. Optional fields must be represented as a union with null. This is more restrictive than standard JSON Schema but enables 100% schema compliance via constrained decoding. Note that structured outputs via response_format: json_schema are supported on GPT-4o and o3-mini; newer models like GPT-4.1 support structured outputs through function calling with strict: true and the Responses API.

JSON mode versus Structured Outputs

A critical distinction that causes confusion: JSON mode guarantees syntactically valid JSON but not schema adherence. Structured Outputs guarantees the output conforms to a specific JSON Schema. Always prefer Structured Outputs when available — JSON mode still requires you to validate and handle schema mismatches manually.

Defining schemas with Pydantic

In the Python ecosystem, Pydantic has become the standard for defining data structures that compile into JSON Schemas:

python
from pydantic import BaseModel, Field
from typing import List, Optional

class SentimentAnalysis(BaseModel):
    score: float = Field(
        ..., ge=0.0, le=1.0,
        description="Normalized sentiment score: 0 = negative, 1 = positive."
    )
    key_entities: List[str] = Field(
        description="Proper nouns or products mentioned in the text."
    )
    reasoning: str = Field(
        description="Brief explanation of why this score was assigned."
    )
    flag_for_review: bool = Field(
        default=False,
        description="True if text contains hate speech or PII."
    )

The description fields serve double duty: they become part of the JSON Schema that the LLM sees as instructions, and they document the data structure for developers. This is the most granular prompt engineering you can do — specifying intent at the individual field level.

Recursive and polymorphic schemas

Real-world applications often require complex structures. A comment thread with arbitrary nesting:

python
from __future__ import annotations

class Comment(BaseModel):
    author: str
    content: str
    replies: List[Comment] = Field(default_factory=list)

Comment.model_rebuild()  # Resolve forward reference

A search result that could be different types:

python
from typing import Union, Literal

class Product(BaseModel):
    type: Literal["product"]
    name: str
    price: float

class Video(BaseModel):
    type: Literal["video"]
    title: str
    duration_seconds: int

class SearchResult(BaseModel):
    item: Union[Product, Video]

With constrained decoding, when the model generates "type": "video", the decoder immediately masks out price (which belongs to Product) and unmasks duration_seconds. The grammar enforces semantic consistency — you will never get a video with a price field.

The open-source structured generation ecosystem

Outlines (dottxt)

Outlines, born from the Willard & Louf (2023) paper, pioneered the FSM approach to constrained decoding. It compiles JSON Schemas and regex patterns into finite-state machines with precomputed vocabulary indexes. Outlines is model-agnostic and works with any model that exposes logits (HuggingFace Transformers, vLLM, etc.).

Limitation: Complex schemas with features like minItems, maxItems, or large enums can cause compilation times from 40 seconds to over 10 minutes, as the regex representation explodes in size. The JSONSchemaBench benchmark found Outlines had the lowest compliance rate among tested engines, primarily due to these timeouts.

XGrammar

XGrammar (Dong et al., MLSys 2025) from the CMU/MLC team (led by Tianqi Chen) takes the CFG/PDA approach with the context-independent/dependent token split described above. As of 2025, vLLM uses XGrammar by default for structured generation, and SGLang integrates it as well. XGrammar achieves token mask generation in under 40 microseconds per token and near-zero overhead when integrated with serving frameworks.

llguidance (Microsoft/Guidance)

llguidance is a Rust-based constrained decoding engine from Microsoft, using a derivative-based regex engine (derivre) for the lexer and an optimized Earley parser for CFG rules. It achieves approximately 50 microseconds CPU time per token for a 128K-token vocabulary with negligible startup costs. In May 2025, OpenAI publicly credited llguidance for foundational work underpinning their Structured Outputs implementation.

Instructor

Instructor provides a high-level interface that patches LLM client SDKs (OpenAI, Anthropic, Google, etc.) to accept Pydantic models directly as the response_model parameter:

python
import instructor
from openai import OpenAI

client = instructor.from_openai(OpenAI())

result = client.chat.completions.create(
    model="gpt-4o",
    response_model=SentimentAnalysis,
    messages=[{"role": "user", "content": "I loved the service but the price was steep."}]
)

print(result.score)      # 0.65
print(result.reasoning)  # "Positive sentiment about service quality..."

Instructor handles schema translation, validation, and automatic retries. It is the easiest entry point for practitioners who want structured outputs without managing the low-level constrained decoding stack.

The semantic quality problem

Constrained decoding guarantees syntactic validity — the output is valid JSON matching your schema. It does not guarantee semantic correctness — the values in that JSON may be wrong, hallucinated, or low-quality.

Research has shown that constraining the output format can actually degrade the model's reasoning ability. Tam et al. (EMNLP 2024 Industry Track) found that JSON-mode constrained decoding hinders reasoning tasks because the model may be forced to output an answer field before completing its chain-of-thought reasoning. The fix is simple but important: always place reasoning fields before answer fields in your schema, allowing the model to think before committing to a conclusion.

More fundamentally, Park et al. (NeurIPS 2024) demonstrated that constrained decoding distorts the model's probability distribution. When high-probability tokens are masked because they violate the grammar, the remaining tokens are renormalized, amplifying relative differences and producing outputs that are syntactically correct but semantically less natural. Their ASAp algorithm progressively corrects this distortion but adds computational overhead.

Pro Tip: When using structured outputs for tasks requiring reasoning (math, logic, analysis), include a reasoning or chain_of_thought string field before the answer fields in your schema. This lets the model "think aloud" within the structured format, mitigating the reasoning degradation that constrained decoding can cause.

Performance and compliance benchmarks

The JSONSchemaBench benchmark (Geng et al., 2025) evaluated six constrained decoding frameworks against 10,000 real-world JSON schemas of varying complexity:

EngineApproachGlaiveAI (Simple)GitHub Easy (Moderate)Compilation Speed
Guidance/llguidanceCFG (Earley parser)98%Highest across 6/8 datasets~50us/token
OpenAIConstrained decoding100%97%N/A (server-side)
XGrammarCFG (PDA)93%87%<40us/token
Llama.cppCFG (GBNF grammar)97%88%Variable
GeminiConstrained decoding100%88%N/A (server-side)
OutlinesFSM (regex)96%83%40s-10min for complex

The SGLang serving framework combined XGrammar with compressed finite-state machines that analyze transition paths and decode multiple tokens in a single step, achieving up to 2x latency reduction and up to 2.5x throughput improvement over uncompressed approaches in JSON decoding benchmarks.

The practical upshot: Despite per-token overhead from grammar checking, structured outputs often reduce total latency because the model generates no conversational filler, stops immediately when the JSON is complete, and eliminates retry logic that unstructured approaches require.

Common pitfalls

Ordering fields for reasoning. Place reasoning or explanation fields before answer or result fields. The model generates tokens sequentially — if the answer comes first, the model cannot reason its way to it.

Schema complexity budget. Deeply nested schemas with large enums or many oneOf branches increase grammar compilation time and per-token overhead. Keep schemas as flat as practical.

Confusing JSON mode with Structured Outputs. JSON mode only guarantees valid JSON, not schema compliance. Use Structured Outputs (json_schema type) whenever available.

Ignoring the description field. Pydantic Field(description=...) values are passed to the LLM as inline instructions. Omitting them forces the model to guess field semantics from names alone.

Conclusion

Structured outputs have transformed LLMs from unpredictable text generators into reliable software components. The shift from prompt-based hope to grammar-enforced guarantees — powered by constrained decoding algorithms from Willard & Louf (2023) through XGrammar (2024) and llguidance — enables a new class of agentic workflows where LLMs can reliably chain steps, pass data between functions, and interact with external systems knowing the data interchange format will never break.

The key architectural decision is choosing the right level of abstraction: API-level structured outputs (OpenAI, Anthropic, Google) for simplicity, high-level wrappers like Instructor for rapid development, or low-level engines like XGrammar and llguidance for maximum control in self-hosted deployments.

To understand how these models generate tokens internally, see How Large Language Models Actually Work. For how tokenization affects constrained decoding, read Tokenization Deep Dive: Why It Matters More Than You Think. And for how structured outputs fit into retrieval pipelines, see RAG: Making LLMs Smarter with Your Data.