How Structured Outputs and Constrained Decoding Work

You spent hours crafting the perfect prompt. The reasoning is sound, the context is rich, and the model is the latest release. You ask for a simple JSON summary. The response streams in, looking perfect, until the very last line where a trailing comma breaks your parser. Or worse, the model wraps the JSON in markdown backticks, adds a polite "Here is your data:" preamble, or hallucinates an extra field that crashes your pipeline.

This was the reality of structured LLM output until 2024. Today, every major provider offers native structured output guarantees. The underlying technique, constrained decoding, mathematically forces the model to produce valid, schema-conformant output by modifying the token probability distribution at every generation step. No more hoping. No more retry loops. Just guaranteed structure.

Throughout this article, we'll use a single running example: extracting structured product review data (sentiment, entities, reasoning) from unstructured customer text.

From Prompt Engineering to Guaranteed Structure

Evolution of structured LLM output from prompt engineering to constrained decoding engines Evolution of structured LLM output from prompt engineering to constrained decoding engines

Structured output from LLMs has evolved through four distinct phases.

Phase 1: Prompt engineering (2020 to 2023). Developers added "Output valid JSON only" instructions with few-shot examples. Failure rates ran 5 to 20% depending on schema complexity.

Phase 2: JSON mode (2023 to 2024). OpenAI introduced response_format: { type: "json_object" } in November 2023. Valid JSON syntax was guaranteed, but no schema enforcement. The model could return any JSON structure.

Phase 3: Schema enforcement (2024 to 2025). OpenAI released Structured Outputs in August 2024 with response_format: { type: "json_schema" }. Google Gemini added response_schema at Google I/O in May 2024. Anthropic followed in November 2025 with constrained decoding for Claude.

Phase 4: High-performance engines (2025 to 2026). Open-source engines like XGrammar and llguidance achieved near-zero overhead constrained decoding. In May 2025, OpenAI credited llguidance for foundational work underpinning their implementation.

How Constrained Decoding Works

Constrained decoding is the technique behind guaranteed structured output. At its core, an LLM is an autoregressive model that predicts the next token based on all previous tokens:

$P(x_t \mid x_{1:t-1})$

Where:

$P(x_t \mid x_{1:t-1})$ is the probability of generating token $x_t$ given all previous tokens
$x_{1:t-1}$ is the sequence of tokens generated so far

The model produces a logit (raw score) for every token in its vocabulary (typically 32,000 to 128,000 tokens). Constrained decoding inserts a logit processor between the model's output and the sampling step. This processor tracks position within the target grammar and masks invalid tokens by setting their logits to $-\infty$ :

$P_{\text{constrained}}(x_i) = \begin{cases} \frac{P_{\text{model}}(x_i)}{\sum_{j \in V_{\text{valid}}} P_{\text{model}}(x_j)} & \text{if } x_i \in V_{\text{valid}} \\ 0 & \text{otherwise} \end{cases}$

Where:

$P_{\text{constrained}}(x_i)$ is the final probability of token $x_i$ after grammar enforcement
$P_{\text{model}}(x_i)$ is the model's original probability for token $x_i$
$V_{\text{valid}}$ is the set of tokens allowed by the grammar at this position

In Plain English: Think of our product review extraction. The model generates scores for every word in its dictionary. A grammar checker turns off every word that would break the JSON rules, setting its score to negative infinity. If the schema says the next field must be "score" (a float), tokens like "hello" or } get zeroed out. The model picks from only the remaining valid tokens. It is physically incapable of generating invalid output.

The State Machine Behind the Mask

The foundational algorithm was formalized by Willard and Louf (2023) in the paper that introduced the Outlines library. Their insight: autoregressive text generation can be reformulated as transitions between states of a finite-state machine (FSM).

Offline (once per schema): Convert the JSON Schema into a regular expression, build an FSM, and for every state precompute which vocabulary tokens are valid transitions (state -> {valid_token_ids}).

Online (every token): Look up the current FSM state in the precomputed index (O(1) hash map lookup), mask invalid tokens, sample, and advance the FSM.

This precomputation makes constrained decoding remarkably fast at inference time. The cost is paid once during schema compilation, then amortized across all generation calls.

FSM Versus CFG: Two Approaches to Grammar Enforcement

Comparison of FSM and CFG approaches to grammar-constrained token generation Comparison of FSM and CFG approaches to grammar-constrained token generation

The FSM approach works well for flat schemas but hits a fundamental limitation: regular expressions cannot express recursion. JSON is inherently recursive (objects contain objects, arrays contain arrays), so a pure FSM must either flatten recursion to a fixed depth or reject recursive schemas.

Context-free grammars (CFGs) solve this using a pushdown automaton (PDA), an FSM augmented with a stack that tracks nesting depth. The challenge is performance: a PDA's state depends on stack contents, making full precomputation impossible.

XGrammar (Dong et al., MLSys 2025) solved this elegantly: split vocabulary tokens into context-independent and context-dependent sets. Context-independent tokens (~99% of the vocabulary) can be fully precomputed into bitmask tables. Context-dependent tokens (~1%) require runtime stack inspection. The result: CFG expressiveness with FSM performance, achieving up to 100x speedup over traditional grammar-constrained methods.

Pro Tip: If your schema includes recursive structures (nested comments, tree-like data, recursive $ref definitions), you need a CFG-based engine like XGrammar or llguidance. FSM-based tools like Outlines will either reject the schema or flatten recursion to a fixed depth.

The BPE Tokenization Challenge

LLMs do not generate characters. They generate tokens, variable-length byte sequences produced by BPE tokenization. A single token might be "json" (4 characters) or " the" (4 characters including the space). Grammar constraints operate at the character level, but the model generates multi-character tokens, creating a fundamental mismatch.

When constrained decoding forces the model down an unusual token path, it may produce a non-canonical tokenization the model rarely saw during training, subtly degrading output quality. Token healing, pioneered by Microsoft's Guidance library, addresses this by backing up one token at the prompt boundary and constraining the first generated token to begin with the removed token's text.

Key Insight: The token-grammar mismatch is why building constrained decoding from scratch is so difficult. Unless you have a specific reason, use an established engine like XGrammar or llguidance.

Provider Comparison in March 2026

Every major LLM provider offers native structured output support, but implementations differ significantly.

Feature	OpenAI	Anthropic	Google Gemini
API parameter	`text.format` (Responses API) / `response_format` (Chat Completions)	`output_config.format` / `strict: true` on tools	`response_mime_type` + `response_schema`
GA release	Aug 2024	Nov 2025 (beta); now GA	May 2024; enhanced Nov 2025
Recursive schemas	Supported (`$ref`)	Supported	Supported (`$ref`, `anyOf` Nov 2025)
Strict mode	Requires `additionalProperties: false`	Enabled by default	Via `response_schema`
Models	GPT-5, GPT-4.1, GPT-4o, o3-mini	Opus 4.6, Sonnet 4.6/4.5, Opus 4.5, Haiku 4.5	Gemini 2.5 Pro/Flash, Gemini 3

Key Insight: OpenAI's strict mode requires every object to set additionalProperties: false and list all properties in the required array. Optional fields must be a union with null. GPT-4.1 and newer models use the Responses API where the parameter moved from response_format to text.format.

JSON mode vs. Structured Outputs: JSON mode guarantees syntactically valid JSON but not schema adherence. For our review extraction, JSON mode might return {"text": "great product"} when you expected {"score": 0.9, "key_entities": [...]}. Structured Outputs prevents this entirely.

Defining Schemas with Pydantic

In the Python ecosystem, Pydantic is the standard for defining data structures that compile into JSON Schemas. Here is our product review schema:

python

from pydantic import BaseModel, Field
from typing import List

class ReviewAnalysis(BaseModel):
    reasoning: str = Field(
        description="Step-by-step analysis of the review text. "
                    "Must reference specific phrases from the review."
    )
    score: float = Field(..., ge=0.0, le=1.0,
        description="Normalized sentiment score assigned AFTER reasoning: "
                    "0 = negative, 1 = positive."
    )
    key_entities: List[str] = Field(
        description="Products, features, or brand names mentioned."
    )
    flag_for_review: bool = Field(
        default=False,
        description="True if the review contains hate speech, PII, or spam."
    )

Notice reasoning comes before score. The model generates tokens left to right, so placing reasoning first lets it think before committing to a number. Tam et al. (EMNLP 2024) confirmed this experimentally: answer-first schemas degrade reasoning performance.

Pro Tip: Field descriptions are your secret weapon. A vague "score" field forces the model to guess the range, polarity, and weighting factors. Specific descriptions eliminate ambiguity and dramatically improve output quality.

Recursive and Polymorphic Schemas

A comment thread with arbitrary nesting:

python

from __future__ import annotations

class Comment(BaseModel):
    author: str
    content: str
    replies: List[Comment] = Field(default_factory=list)

Comment.model_rebuild()  # Resolve forward reference

A polymorphic search result with discriminated unions:

python

from typing import Union, Literal

class Product(BaseModel):
    type: Literal["product"]
    name: str
    price: float

class Video(BaseModel):
    type: Literal["video"]
    title: str
    duration_seconds: int

class SearchResult(BaseModel):
    item: Union[Product, Video]

With constrained decoding, when the model generates "type": "video", the decoder immediately masks price (which belongs to Product) and unmasks duration_seconds. You will never get a Video object with a price field.

The Structured Output Tooling Ecosystem

How the Instructor library abstracts structured output across providers How the Instructor library abstracts structured output across providers

Open-Source Engines

Outlines pioneered the FSM approach but struggles with complex schemas (compilation times of 40 seconds to 10+ minutes). The JSONSchemaBench benchmark found Outlines had the lowest compliance rate among tested engines, primarily due to these timeouts.

XGrammar is the default structured generation backend for vLLM, SGLang, and TensorRT-LLM as of March 2026, achieving under 40 microseconds per token with near-zero overhead in JSON generation. A September 2025 benchmark found XGrammar slightly outperformed llguidance in repeated-schema scenarios thanks to effective caching.

llguidance from Microsoft uses a Rust-based Earley parser at ~50 microseconds per token with negligible startup costs. OpenAI publicly credited llguidance for foundational work underpinning their Structured Outputs in May 2025. llguidance had faster time-to-first-token in dynamic schema scenarios.

Pydantic AI (v1.66.0) from the Pydantic team implements three structured output approaches: tool output (default, schema as a tool parameter), native output (model produces JSON schema-compliant text), and prompted output (schema injected into the prompt). It also supports streaming structured output with real-time validation.

Instructor: The High-Level Abstraction

Instructor (v1.14.5, 3M+ monthly downloads) patches LLM client SDKs to accept Pydantic models directly:

python

import instructor
from openai import OpenAI

client = instructor.from_openai(OpenAI())

result = client.chat.completions.create(
    model="gpt-4o",
    response_model=ReviewAnalysis,
    messages=[{
        "role": "user",
        "content": "Review: I loved the Widget Pro's battery life, "
                   "but the \$299 price tag stung."
    }]
)

print(result.reasoning)     # "Positive on battery, negative on price..."
print(result.score)         # 0.65
print(result.key_entities)  # ["Widget Pro"]

Instructor handles schema translation, validation, and automatic retries when validation fails. It works with OpenAI, Anthropic, Google, Mistral, Cohere, Ollama, DeepSeek, and 15+ other providers. The same Pydantic model works across all of them, making it trivial to switch providers or run A/B tests across models.

The equivalent with Anthropic's Claude requires only swapping the client constructor:

python

import instructor
from anthropic import Anthropic

client = instructor.from_anthropic(Anthropic())

result = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    response_model=ReviewAnalysis,
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": "Review: Worst customer service I've ever experienced. "
                   "The product itself is fine but I'll never buy from BrandX again."
    }]
)

Common Pitfall: Instructor's retry mechanism re-prompts the model with Pydantic validation errors. This works for minor issues but can loop on fundamentally broken schemas. Always set max_retries (default is 1; use 2 to 3 for complex schemas).

When to Use Structured Outputs (And When Not To)

Use structured outputs when:

LLM output feeds directly into code (APIs, databases, downstream functions)
Extracting specific data from unstructured text (our review analysis example)
Implementing RAG query parsing or result formatting
Chaining multi-step LLM workflows where each step expects a specific input format
Building context engineering pipelines where each stage needs a predictable schema
Any production system where a malformed response means a crashed pipeline

Do NOT use structured outputs when:

Generating free-form content (blog posts, creative writing, conversational chat)
The response structure is genuinely unknown ahead of time
You need the model's full reasoning visible to the user (unless you add a reasoning field)
Working with extremely simple yes/no responses where the overhead isn't justified

The Semantic Quality Problem

Constrained decoding guarantees syntactic validity but not semantic correctness. The values in your JSON may be wrong, hallucinated, or low-quality even when the structure is perfect.

Park et al. (NeurIPS 2024) demonstrated that masking high-probability tokens distorts the model's probability distribution. When tokens the model "wants" to generate are blocked because they violate the grammar, the remaining tokens get renormalized, amplifying relative differences. This can produce syntactically correct but semantically degraded outputs. Their ASAp algorithm progressively corrects this distortion but adds computational overhead.

For our review example, the model always returns valid JSON with a float between 0 and 1 for score. But whether that score accurately reflects the review's sentiment depends on the model's understanding, not the grammar. The grammar cannot enforce "this score should be 0.3 because the review is mostly negative."

Pro Tip: Treat structured outputs as a transport guarantee, not a correctness guarantee. Add Pydantic validators for business logic and monitor semantic quality through sampling and human review.

Production Considerations and Common Pitfalls

Decision matrix for choosing a structured output approach based on deployment needs Decision matrix for choosing a structured output approach based on deployment needs

Schema complexity budget. Deeply nested schemas with large enums increase compilation time and per-token overhead. Keep schemas as flat as practical. If you need 50 enum values, consider whether a free-text field with post-validation would work better. As a rule of thumb, schemas under 20 properties with nesting depth under 3 compile quickly across all engines.

Ignoring field descriptions. Pydantic Field(description=...) values are passed to the LLM as inline instructions. Omitting them forces the model to guess semantics from names alone. In our review example, describing score as "0 = negative, 1 = positive" prevents the model from using a 1-to-10 scale.

Schema caching. Anthropic caches compiled schemas for 24 hours. Frequent schema changes during A/B testing accumulate compilation overhead. Batch variations and reuse where possible.

Latency profile. Despite per-token grammar checking overhead (under 50 microseconds), structured outputs often reduce total latency. The model generates no conversational filler, stops immediately when the JSON is complete, and eliminates the retry logic that unstructured approaches require. For high-throughput pipelines, schema compilation is the bottleneck; XGrammar and llguidance both cache compiled grammars to amortize this cost.

Error handling. Even with guaranteed syntax, build defensive code paths. API rate limits, network timeouts, and context-length overflows still happen. Wrap structured output calls in try/except blocks with fallback logic, and log raw responses for debugging.

Conclusion

Structured outputs have transformed LLMs from unpredictable text generators into reliable software components. The shift from prompt-based hope to grammar-enforced guarantees, powered by constrained decoding from Willard and Louf (2023) through XGrammar and llguidance, enables agentic workflows where data interchange formats never break.

The key architectural decision is choosing the right abstraction level: API-level structured outputs for simplicity, Instructor for rapid multi-provider development, or low-level engines for maximum control in self-hosted deployments.

To understand how LLMs generate tokens internally, see How Large Language Models Actually Work. For the tokenization challenges that make constrained decoding hard, read Tokenization Deep Dive. And for building the text embedding and RAG pipelines where structured outputs shine brightest, those articles cover the full retrieval stack.

Frequently Asked Interview Questions

Q: What is the difference between JSON mode and Structured Outputs?

JSON mode guarantees syntactically valid JSON but places no constraints on the schema. Structured Outputs enforces a specific JSON Schema via constrained decoding, guaranteeing every field, type, and constraint is satisfied. In production, always prefer Structured Outputs because it eliminates post-hoc validation and retry logic.

Q: How does constrained decoding enforce schema compliance without retraining the model?

A logit processor sits between the model's raw output and the sampling step. At each position, it computes valid tokens according to the grammar state and masks all others by setting their logits to negative infinity. The model's weights are never modified; only the sampling distribution is constrained.

Q: Why should reasoning fields come before answer fields in a structured output schema?

LLMs generate tokens left to right. If the answer field precedes the reasoning field, the model commits to a conclusion before articulating its chain of thought. Tam et al. (EMNLP 2024) showed this degrades reasoning performance. Placing reasoning first lets the model think before committing.

Q: What is the performance overhead of constrained decoding?

Modern engines add under 50 microseconds per token for grammar checking, negligible next to the model's 10 to 50 millisecond inference time per token. Schema compilation is a one-time cost amortized across requests. Structured outputs often reduce total latency by eliminating conversational filler and retry logic.

Q: When would you choose Instructor over native provider APIs?

Instructor adds value for multi-provider support (same Pydantic model across OpenAI, Anthropic, Google), automatic retry with validation feedback, and rapid prototyping. For single-provider deployments with simple schemas, native APIs avoid the extra dependency.

Q: How do you handle optional fields in OpenAI's strict mode?

Every property must appear in the required array with additionalProperties: false. Optional fields are represented as a union with null (e.g., Optional[str] compiles to {"anyOf": [{"type": "string"}, {"type": "null"}]}). The field is always present but its value can be null.

Q: A pipeline returns valid JSON but semantically incorrect values. How do you diagnose this?

The issue is semantic, not syntactic. Check field ordering (reasoning before answers), review field descriptions for ambiguity, verify the schema isn't too restrictive (small enums forcing bad choices), and add Pydantic validators for business rules. Monitor output quality through sampling and human review.

Q: What are the tradeoffs between FSM-based and CFG-based engines?

FSM engines (Outlines) fully precompute token validity but cannot handle recursion and may have long compilation times. CFG engines (XGrammar, llguidance) support recursive schemas via pushdown automata. XGrammar's context-independent/dependent token split gives near-FSM performance with full CFG expressiveness, making it the better choice for most production workloads.

Structured Outputs: Making LLMs Return Reliable JSON

From Prompt Engineering to Guaranteed Structure

How Constrained Decoding Works

The State Machine Behind the Mask

FSM Versus CFG: Two Approaches to Grammar Enforcement

The BPE Tokenization Challenge

Provider Comparison in March 2026

Defining Schemas with Pydantic

Recursive and Polymorphic Schemas

The Structured Output Tooling Ecosystem

Open-Source Engines

Instructor: The High-Level Abstraction

When to Use Structured Outputs (And When Not To)

The Semantic Quality Problem

Production Considerations and Common Pitfalls

Conclusion

Frequently Asked Interview Questions

Related Articles

Open Source vs Closed LLMs: Choosing the Right Model in 2026

Long Context Models: Working with 1M+ Token Windows

Related Articles

Open Source vs Closed LLMs: Choosing the Right Model in 2026

Long Context Models: Working with 1M+ Token Windows