ZH version is available. Content is displayed in original English for accuracy.
The model may return the schema you want, but with hallucinated values like `invoice_date` being off by 2 months or the transcript array ordered wrongly. The JSON is valid, but the values are not.
Structured output today is a big part of using LLMs, especially when building deterministic workflows.
Current structured output benchmarks (e.g., JSONSchemaBench) only validate the pass rate for JSON schema and types, and not the actual values within the produced JSON.
So we designed the Structured Output Benchmark (SOB) that fixes this by measuring both the JSON schema pass rate, types, and the value accuracy across all three modalities, text, image, and audio.
For our test set, every record is paired with a JSON Schema and a ground-truth answer that was verified against the source context manually by a human and an LLM cross-check, so a missing or hallucinated value will be considered to be wrong.
Open source is doing pretty well with GLM 4.7 coming in number 2 right after GPT 5.4.
We noticed the rankings shift across modalities: GLM-4.7 leads text, Gemma-4-31B leads images, Gemini-2.5-Flash leads audio.
For example, GPT-5.4 ranks 3rd on text but 9th on images.
Model size is not a predictor, either: Qwen3.5-35B and GLM-4.7 beat GPT-5 and Claude-Sonnet-4.6 on Value Accuracy. Phi-4 (14B) beats GPT-5 and GPT-5-mini on text.
Structured hallucinations are the hardest bug. Such values are type-correct, schema-valid, and plausible, so they slip through most guardrails. For example, in one audio record, the ground truth is "target_market_age": "15 to 35 years", and a model returns "25 to 35". This is invisible without field-level checks.
Our goal is to be the best general model for deterministic tasks, and a key aspect of determinism is a controllable and consistent output structure. The first step to making structured output better is to measure it and hold ourselves against the best.

Discussion (13 Comments)Read Original on HackerNews
Why no Opus 4.7? Why Gemini 3.1 Pro is missing?
If there is some other criterion (e.g. models within certain time or budget), great - just make it explicit.
When I see "Top 5 at a glance" and it missed key frontier models, I am (at best) confused.
For the benchmark, was kept consistent across all models and typically opus and 3.1 pro would be overkill and expensive even with reasoning off.
Good point tho, will add this point in the blog too :)
Also the benchmark is open source, so anyone can run a model on it and create a PR too, the leaderboard is dynamic and will automatically add that in.
If you want to avoid using Opus 4.7 them why GPT-5.4 (unless with a disclaimer that it is low reasoning setting, or check that on medium its price is comparable with Haiku/Flash).
Also, usually it is good to look at the newest model. Gemini 2.5 Flash is quite dated. Gemini 3.1 Flash Lite is the new one (https://openrouter.ai/google/gemini-3.1-flash-lite-preview).
Gemini 3.1 and GLM 5 came out around the same time as Sonnet 4.6 (~Feb 2026) so it's strange that they are missing, but Gemini 2.5 Flash, Gemini 3 Flash, and GLM 4.7 are there.
While most models were great at producing JSON schema, they were pretty bad at producing accurate values.
In the graph you'll is almost a 20%-30% drop between the JSON schema pass vs the value accuracy.
Check out the paper section "6.3 Structured Decoding Ablation"
Paper: https://arxiv.org/pdf/2604.25359
We ran the comparison and saw no difference, so to keep the bench consistent since some models don't support structured decoding we used greedy decoding on all models.
> Our goal is to be the best general model for deterministic tasks
I'm sorry but this simply doesn't make sense. If you want a deterministic output don't use an LLM.
Many developer workflows use LLMs to produce structured artifacts due to it's flexibility of consuming unstructured inputs.
> "don't use an LLM"
Partially agree, that's what we're building towards at interfaze.ai a hybrid between transformers (LLMs) and traditional CNN/DNN architecture to solve this problem of "deterministic" output. This give devs the flexibility of custom schema definitions and unstructured input while still getting high quality structured output like you would get from a CNN models like EasyOCR.
The industry is moving toward using LLMs for more and more deterministic tasks so this benchmarks allows us to now measure it.