Teaching Claude Why

Discussion (2 Comments)Read Original on HackerNews

TyrunDemeg101•about 6 hours ago

The 0% blackmail rate is the headline but the post's own footnote is the more interesting line: "results on more recent models may be confounded by the presence of information about the evaluation in the pre-training corpus." The blackmail eval was published a year ago and got broad coverage. It's almost certainly in the pre-training of every model trained since. So a current model "passing" might just be recognizing the test from its homework.

Setting contamination aside, there's a structural issue behavioral testing can't solve. You can falsify "this model is aligned" with one counterexample. You can't verify it with any number of passing tests. The next scenario you try might be the one that breaks the claim. That's not a methodology problem, it's a logic one, and the same shape affects interpretability and theoretical guarantees in different forms.

The most important sentence in the post is buried in the discussion: "our auditing methodology is not yet sufficient to rule out scenarios in which Claude would choose to take catastrophic autonomous action." That sentence describes the actual epistemic situation. The headline number doesn't.

Expanded on this here: https://aaronjholbrook.com/thoughts/2026-05-08-the-negative-...

mitchbob•about 6 hours ago

Dupe of

https://news.ycombinator.com/item?id=48066592 (43 comments)

Teaching Claude Why

⚡ Community Insights

Discussion (2 Comments)Read Original on HackerNews