These seem to be different tests? One has 6 tasks the other has 30.
yorwbaβ’about 2 hours ago
Yeah, of those 6 tasks, only "halluc-doc-http-handler" isn't within 1% of the previous result. 86.6% is 13/15 rounded down, so if they sampled 15 attempts for that task, the probability of getting 100% when the true success rate was 13/15 would be (13/15)^15 > 0.11, which is not all that unlikely.
ekiddβ’about 3 hours ago
Thank you. Those are not the same test at all. I agree that something weird is up with Opus. But this post doesn't actually prove anything, and the title is misleading editorializing.
jiwidiβ’about 5 hours ago
See original opus 4.6 sitting at 16% hallucination and the retest on 12th of april at 33%
They definitely must be doing some quantization or optimization to meet demand, otherwise why would model performance degrade this much? It's been crazy for me personally
dns_snekβ’about 2 hours ago
I'm the last person to defend any of these companies, but it's not a retest. The set of tasks is clearly different and results for the original tasks are nearly identical. It differs on one task where it previously had fabrications = 0 and now it's 1, which dropped the score from 100% to 86%
Combining multiple tests on the same leaderboard like this is nonsense, there should be a separate leaderbaord for the new tasks where every model is tested again.
Putting it on the original leaderboard as "Opus 4.6 (April 12)" is so obviously inappropriate that it smells like deception. You could say that the leaderboard is hallucinated.
metalmanβ’about 3 hours ago
do people get the simple fact of data sets becoming more and more polluted, one from AI, and secondly from an increasingly deranged human population that is hyper focused on getting some extra financial advantage at ANY other cost, and a huge part of that is about "reputational management", again with zero limits on missinformation.
a sane society would pull the plug, now.
Discussion (0 Comments)Read Original on HackerNews
They definitely must be doing some quantization or optimization to meet demand, otherwise why would model performance degrade this much? It's been crazy for me personally
Combining multiple tests on the same leaderboard like this is nonsense, there should be a separate leaderbaord for the new tasks where every model is tested again.
Putting it on the original leaderboard as "Opus 4.6 (April 12)" is so obviously inappropriate that it smells like deception. You could say that the leaderboard is hallucinated.