Through the looking glass of benchmark hacking

Discussion (9 Comments)Read Original on HackerNews

ej88•about 1 hour ago

This is cool!

I used to work on post-training & evals. it's really hard to make a good eval set and catch all forms of reward hacking. Excited to see more from poolside!

fsh•about 2 hours ago

I don't get the point. The model has presumably been trained on all public GitHub code, so the evaluation is tainted anyway.

adrian_b•about 1 hour ago

A couple of days ago there has been another thread about an experiment with many LLMs, where especially the Anthropic models were found to "cheat" in a large percentage of the coding tasks that had been benchmarked, by searching the Internet for appropriate code and inserting it in the program they had to write.

The conclusion of that study was that when benchmarking LLMs for coding ability, they should not have access to Internet, if you want to know their intrinsic abilities.

Moreover, this can be worrisome as a more direct copyright infringement than the one caused by training, because even if they find open source code on the Internet and they insert it in the generated files, it is pretty certain that it must have had a license that prohibits the removal of the copyright notice.

htrp•5 minutes ago

> A couple of days ago there has been another thread about an experiment with many LLMs, where especially the Anthropic models were found to "cheat" in a large percentage of the coding tasks that had been benchmarked, by searching the Internet for appropriate code and inserting it in the program they had to write.

Can you find the thread?

ej88•about 1 hour ago

swe bench pro has a public and private test set, where the private eval is from proprietary codebases only

pratio•about 2 hours ago

Are you guys affiliated to https://poolside.fm/ or https://poolsuite.net?

colesantiago•about 1 hour ago

They are not, although Poolside FM was the first one to use the "Poolside" name.

Poolside AI filed a trademark infringement against "Poolside FM" that forced Poolside FM to change their name to "Poolsuite"

https://x.com/Poolsuite/status/1398007075435843592

This annoyed the founder of Poolsuite and they ripped off his brand.

https://x.com/marty/status/1932386087390818635?s=46

schnitzelstoat•about 3 hours ago

It was an interesting read - perhaps I misunderstood the part about blocking GitHub, but is not possible just to block it from accessing that specific repo?

changoplatanero•about 2 hours ago

In theory yes blocking specific repo is possible. In practice more difficult as the repo could be cloned under different names and you might have hundreds of training tasks that you need to configure this for. So it would be a lot of work to verify that you blocked them one by one.

Through the looking glass of benchmark hacking

⚡ Community Insights

Discussion (9 Comments)Read Original on HackerNews