Claude Opus 4.6 accuracy on BridgeBench hallucination test drops from 83% to 68%

bbratao about 16 hours ago 12 commentsRead Article on twitter.com

DE version is available. Content is displayed in original English for accuracy.

⚡ Community Insights

Discussion Sentiment

50% Positive

Analyzed from 79 words in the discussion.

Discussion (12 Comments)Read Original on HackerNews

Reubend•about 14 hours ago

Because the website doesn't seem to show any sample size of runs, I assume they ran it once across the suite.

The models are nondeterministic, and therefore it's pretty normal for different runs to give different results.

I don't see this as evidence that Opus 4.6 has gotten worse.

ehtbanton•about 7 hours ago

Benchmarks like this one are designed to thoroughly test the model across several iterations. 15% is a MASSIVE discrepancy.

Come on Anthropic, admit what you're doing already and let us access your best models unhindered, even if it costs us more. At the moment we just all feel short-changed.