FR version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
74% Positive
Analyzed from 1213 words in the discussion.
Trending Topics
#model#qwen#models#opus#flamingo#pelican#benchmark#training#https#test

Discussion (53 Comments)Read Original on HackerNews
I'd say the example actually does (vaguely) suggest that Qwen might be overfitting to the Pelican.
But in terms of making something physically plausible, Opus certainly got a lot closer
For a delightful moment this morning I thought I might have finally caught a model provider cheating by training for the pelican, but the flamingo convinced me that wasn't the case.
https://redd.it/1slz38i
https://x.com/JeffDean/status/2024525132266688757
If anything, the disastrous Opus4.7 pelican shows us they don't pelicanmaxx
I guess initially it would have been a silly way to demonstrate the effect of model size. But the size of the largest models stopped increasing a while ago, recent improvements are driven principally by optimizing for specific tasks. If you had some secret task that you knew they weren't training for then you could use that as a benchmark for how much the models are improving versus overfitting for their training set, but this is not that.
https://blog.brokk.ai/introducing-the-brokk-power-ranking/
This doesn't hold if some models trained on the benchmark and some didn't, but you can fix this by deliberately fine-tuning all models for the benchmark before comparing them. For more in-depth discussion of this, see https://mlbenchmarks.org/11-evaluating-language-models.html#...
Pelican: saturated!
But that Opus pelican?
It's pretty good at finding bugs, but not so good at writing patches to fix them.