Back to News
Advertisement
Advertisement

⚡ Community Insights

Discussion Sentiment

67% Positive

Analyzed from 264 words in the discussion.

Trending Topics

#models#different#deterministic#runs#results#model#nondeterministic#therefore#pretty#normal

Discussion (12 Comments)Read Original on HackerNews

Reubendabout 14 hours ago
Because the website doesn't seem to show any sample size of runs, I assume they ran it once across the suite.

The models are nondeterministic, and therefore it's pretty normal for different runs to give different results.

I don't see this as evidence that Opus 4.6 has gotten worse.

slurpybabout 7 hours ago
I would love to know what you’re doing in the harness to not feel the total degradation in experience now in comparison to December & January.
bsderabout 10 hours ago
> The models are nondeterministic, and therefore it's pretty normal for different runs to give different results.

And how is that an excuse?

I don't care about how good a model could be. I care about how good a model was on my run.

Consequently, my opinion on a model is going to be based around its worst performance, not its best.

As such, this qualifies as strong evidence that Opus 4.6 has gotten worse.

senkoabout 2 hours ago
>> The models are nondeterministic, and therefore it's pretty normal for different runs to give different results.

> And how is that an excuse? […] this qualifies as strong evidence…

This qualifies as nothing due to how random processes work, that’s what the gp is saying. The numbers are not reliable if it’s just one run.

If this is counter-intuitive, a refresher on basic statistics and probability theory may be in order.

dlahodaabout 10 hours ago
are models really non deterministic?
Ruryabout 9 hours ago
People are describing the results when they say models are non-deterministic. Give it the same exact input twice, and you'll get two different outputs. Deterministic would mean the same input always gives the same output.
loneboatabout 10 hours ago
Yes. Look up LLM "temperature" - it's an internal parameter that tweaks how deterministic they behave.
csomarabout 10 hours ago
The models are deterministic, the inference is not.
ehtbantonabout 7 hours ago
Benchmarks like this one are designed to thoroughly test the model across several iterations. 15% is a MASSIVE discrepancy.

Come on Anthropic, admit what you're doing already and let us access your best models unhindered, even if it costs us more. At the moment we just all feel short-changed.