DE version is available. Content is displayed in original English for accuracy.
Static vulnerability discovery benchmarks become outdated quickly. Cases leak into training data, and scores start measuring memorization. The monthly refresh keeps the test set ahead of contamination — or at least makes the contamination window honest.
Each case runs three agents: a Curator reads the advisory and builds an answer key, a Finder (the model under test) gets 24 shell steps to explore the code and write a structured report, and a Judge scores the blinded submission. The Finder never sees the patch. It starts from sink hints and must trace the bug through actual code.
Only repos with 10k+ stars qualify. A diversity pass prevents any single repo from dominating the set. Ambiguous advisories (merge commits, multi-repo references, unresolvable refs) are dropped.
Currently evaluating GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro, GLM-5.1, and Kimi K2.5. All traces are public.
Methodology: https://ndaybench.winfunc.com/methodology
Live Leaderboard: https://ndaybench.winfunc.com/leaderboard
Live Traces: https://ndaybench.winfunc.com/traces

Discussion (11 Comments)Read Original on HackerNews
I'd love to see how the model we serve, Qwen3.5 122B A10B, stacks up against the rest on this benchmark. AI Router Switzerland (aiRouter.ch) can sponsor free API access for about a month if that helps for adding it to the evaluation set.
Curator, answer key, Finder, shell steps, structured report, sink hints… I understand nothing. Did you use an LLM to generate this HN submission?
It looks like a standard LLM-as-a-judge approach. Do you manually validate or verify some of the results? Done poorly, the results can be very noisy and meaningless.
You don't really need manual verification for these, the CVEs (vulnerabilities) are public and can be programmatically validated.
Must have.
> The Finder will never see the patch.
I wasn’t worried that this eval would show the answer to the model before evaluating it. Seems requirements leaked into this post.
Will incorporate false-positive rates into the rubric from the next run onwards.
At winfunc, we spent a lot of research time taming these models to eradicate false-positive rates (it's high!) so this does feel important enough to be documented. Thanks!