Structured Output Comparison: Which JSON Mode Actually Works
Agentic AI
OpenAI Structured Outputs, Anthropic tool use, Pydantic validation, Outlines. We benchmarked all four on real schemas from production agents. The reliability gaps surprised us, especially at deeper nesting.
By Arjun Raghavan, Security & Systems Lead, BIPI · April 22, 2024 · 8 min read
Every team building agents reaches the same fork: how do you force the model to emit JSON your code can actually parse. We have shipped systems using each of the four mainstream approaches over the past two years. They are not interchangeable. The reliability profile depends heavily on schema complexity and which model you are running.
The four approaches
OpenAI Structured Outputs (the response_format with json_schema mode) constrains generation at the token level using grammar-based decoding. Anthropic tool use forces the model into a tool call where the input schema defines the output. Pydantic validation parses free-form output and retries on validation failure. Outlines is an open-source library that does grammar-constrained decoding for any HuggingFace model.
Benchmark setup
We took five schemas from real production systems: a flat 8-field invoice schema, a 3-level nested clinical note, a discriminated union with 6 variants, a recursive comment-thread schema, and a 200-field product catalogue schema. We ran each through every approach across 500 prompts each. Failure modes: missing required field, type mismatch, invalid JSON, hallucinated extra field.
Flat schemas: everything works
On flat schemas with under 20 fields, all four approaches hit above 96 percent reliability. OpenAI Structured Outputs was best at 99.8 percent. The remaining 0.2 percent were almost all timeouts, not schema violations. If your data is flat, pick the approach that fits your model provider and move on.
Deeper nesting separates the approaches
At three levels of nesting we saw the gap open up. Pydantic-with-retry fell to 78 percent on the first attempt and 94 percent after one retry. OpenAI Structured Outputs held above 95 percent on three-level schemas but dropped to 87 percent on recursive structures. Anthropic tool use hit 91 percent on three-level and 84 percent on recursive. Outlines on Llama 3.3 70B hit 96 percent but the latency was 2.4x that of GPT-4o.
Discriminated unions are the hard case
Discriminated unions, where one field determines the shape of the rest, are where most approaches stumble. The model has to commit to the discriminator value first and then produce a consistent payload. Anthropic tool use was strongest here at 94 percent, which surprised us because the documentation does not emphasise this. OpenAI Structured Outputs hit 89 percent, mostly failing by producing fields from the wrong variant. Pydantic-with-retry hit 91 percent after two retries, which is competitive but slow.
Schema complexity limits
OpenAI Structured Outputs has hard limits: 100 properties per object, 5 levels of nesting, no patternProperties, no additionalProperties: true. We hit the 100-property limit on a real client schema and had to split it into sub-schemas with a coordinator prompt. Anthropic has softer limits documented as best practices but the practical ceiling is similar.
- OpenAI Structured Outputs: 100 properties per object, 5 levels nesting, strict JSON Schema subset
- Anthropic tool use: no hard documented limit but degrades above 4 levels of nesting
- Outlines: limited by what the grammar engine supports, sometimes hangs on ambiguous grammars
- Pydantic-with-retry: no limits but cost scales with retry count
Latency comparison
OpenAI Structured Outputs adds 0 to 100 ms over an equivalent text completion because the grammar work happens during generation. Anthropic tool use is similar. Pydantic-with-retry has zero overhead on first try but a full additional model call on every retry. Outlines depends on hardware. On an A100 with vLLM and a 70B model we measured 280 to 600 ms overhead per call.
Validation-and-retry feels simpler but the tail latency is brutal. p99 doubles with one retry, triples with two.
What we run in production
Our default stack today: OpenAI Structured Outputs when on OpenAI models, Anthropic tool use when on Claude, Pydantic validation as a second layer regardless of provider. The second layer catches the rare cases where the grammar-constrained output is technically valid JSON but semantically broken: dates in the wrong century, negative quantities, foreign-key references to deleted rows.
When to reach for Outlines
Outlines wins when you are running on-prem models for compliance or cost reasons and you cannot use a managed provider's structured output. It also wins when you need exotic constraints like regex matching on substrings inside fields. The tradeoff is operational: you are running a vLLM cluster and managing GPU capacity. Most teams underestimate that overhead.
Pick the structured output approach that matches your provider, layer Pydantic over the top, and budget for one retry in your latency SLO. That covers 99 percent of agent use cases. The remaining 1 percent are the kind of problem that justifies a custom solution and a senior engineer.
Read more field notes, explore our services, or get in touch at info@bipi.in. Privacy Policy · Terms.