RU version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
54% Positive
Analyzed from 2127 words in the discussion.
Trending Topics
#code#model#anthropic#fable#security#fix#vulnerabilities#don#https#fixing

Discussion (64 Comments)Read Original on HackerNews
Like it basically jail broke the "no security vul guard rails" not in any clever way but just by fixing them, producing exploit code just by writing test cases making sure it's fixed. So you just need to look at the code & tests as a human to get vulnerabilities and exploits(components).
What makes this so beautiful IMHO is that it's a trivial jail break, but also a close to unfixable. At least not without making the model close to useless for normal development (it refuses to fix bugs/write code) or making it a major liability (it silently pretends it didn't see bugs and silently avoids fixing it, which for a human would count as intentional sabotage and might involve criminal liability).
[1]: https://en.wikipedia.org/wiki/Reduction_(complexity)
Opus can very much "fix the code". Quite possibly even Sonnet can. This is a big fat nothingburger and it's increasingly looking like the political restriction of Fable at least (not Mythos itself, of course) was arbitrary and based on the flimsiest pretext.
When Claude blocked discussion of ASI, it was circumvented by adding to the system prompt:
https://xcancel.com/xundecidability/status/18262924806289163...Model requires proof that you are a legitimate developer of that piece of software.
Every Anthropic/OpenAI account will have a list of projects the model is allowed to work on for security issues.
> A subsequent investigation found that the campaign to insert the backdoor into the XZ Utils project was a culmination of over two years of effort, starting in 2021, by a user going by the name "Jia Tan". They used sock puppetry in a pressure campaign against the original maintainer of XZ Utils, eventually being given maintainer permissions on the project.
If the acceptance criteria is “would prevent every single past instance and every imaginable future instance”, then yes, no mitigation is every sufficient to address any problem in the world, so we might as well give up.
But I don’t think that’s the right lens to use.
You _cannot_ say that Mythos is super dangerous and can only be rolled out to certain people, but then release Fable with anything other than bulletproof cyber denials.
Clearly with LLMs, bulletproof denials are ~impossible due to the way LLMs work.
So you've ended up in a situation where Anthropic are simultaneously claiming it's a incredibly dangerous model _and_ there are (minor, potentially) problems with the security "protections".
As technical people we understand that nothing can be perfect, esp in LLM world. But all my non technical friends were really confused how they had managed to make the model "safe" so quickly when it was released and the general sentiment was it shouldn't have been released - and now to an outsider I think it looks like it was never safe at all to release, so I can totally see how the current US administration have got themselves very upset with it.
_Even if_ there was no political bad will, it's a bit of a silly scenario to end up in, and really quite easily foreseen.
As a scientist who repeatedly ran into the classifier-based denials: it appears Anthropic’s strategy to make denials more robust, at the cost of many false positives, was to have a separate classifier processing both input and output tokens, at an extremely simple, almost keyword-search level. One weakness of this approach is that it only catches things that use the right keywords: it is in some sense weak exactly where an LLM-based classifier would be stronger.
Work on abstract, closer-to-CS algorithms that used chemistry terminology were blocked immediately, while work directly relevant to chemistry/biology experiments, writing code to process images from a very specific microscopy setup relevant primarily to biological samples, was never blocked at all, because it happened to never use relevant keywords.
That’s consistent with this situation: finding and fixing bugs in the context of looking for bugs perhaps happened to never use words like ‘exploit’ or ‘cybersecurity’.
Exactly. AI safety is nonsensical. You cannot define the set of "bad strings". The billion monkeys with typewriters are eventually going to be able to produce them. Any "safety" system for constraining LLM output is going to have a nonzero leak rate.
But on the other hand, this is also irrelevant, unless you're irresponsible enough to connect an LLM to something that actually matters.
Yes, it's going to alarmingly accelerate vulnerability finding. But, as we know from decades of security research, that's a three way problem already between the devs, the black hats, and the white hats.
Let's not pretend the strategy of "the US will always have a technological advantage and veto over China" will work either.
The genie is out of the bottle either way.
Unless we believe Anthropic has a wizard or superhero secreted away that no one else can replicate.
https://www.lutasecurity.com/post/the-fable-5-export-control...
I wonder how that is involved?
Feels like the title isn't really giving the full context of what they ended up actually seeing, despite what the lede implies multiple times.
Still, ban seems stupid... Still no actual leak of the full "third-party research paper"?
It's explained better in the original source. I don't agree with it, but I understand it now, but I also think we need to move past it.
So, basically the model didn't agree to expose possible vulnerabilities but agree to patch those?
Regardless of the request to take Fable 5 down. Why is requesting the model to show vulnerabilities is being blocked if fixing it not? is it based on the assumption of the intention?
I don't quite get the benefit of limiting it. So if anyone can explain it better it'll be appreciated.
This is how Anthropic describes Fable's behavior:
"When Fable’s classifiers detect a request related to cybersecurity, biology and chemistry, or distillation, the response is automatically handled by Claude Opus 4.8 instead. Users will be informed whenever this occurs."
So if you ask the model to "find security issues in this code base", it's supposed to fall down to Opus 4.8. I guess the "exploit" here is that if you just tell Fable to "fix this code", which is not "a request related to cybersecurity", it will fix security issues (as it should).
So you can then look at the diff and figure out what the vulnerabilities were.
I think this whole thing is a bit weird. It seems to me that we'd be better off if I, as someone who publishes open-source code, could ask Fable to review my code for security issues - even if that also allows attackers to do the same. Better to fix the issues than not know about them.
My impression is that Anthropic's point about Mythos is that it is uniquely good at finding vulnerabilities and then using them to create working exploit chains.
On this track, we're probably destined for a monopoly breakup before too long.
i'd love to see the research paper with the CVE's and 'delibrately planted vulnerabilities', I bet we could infer relatively accurately where some of these things lie
Kill all humans, kill all humans.
As in worried about other countries/organizations using Fable 5 to actually do decent cyber security.
https://en.wikipedia.org/wiki/Communications_Assistance_for_... https://en.wikipedia.org/wiki/Salt_Typhoon https://en.wikipedia.org/wiki/Clipper_chip
Musk's hosting stuff for Anthropic, too. Still competing with them. Samsung makes stuff for Apple and Android devices. Lots of this in the industry.
The CEO of Amazon is not a neutral actor in this scenario.
It was an excuse to fuck with them, just like the "supply chain risk" finding a few months back.
(See, for example: https://x.com/PeteHegseth/status/2065897156226015690)
This administration will do or say something crazy to a private company, then this private company sends an envoy to the White House to negotiate, then the White House asks for 10% of the company or other concessions.
The White House wants 10% of Anthropic.
This is just a negotiation tactic that Trump keeps on using.
They did it to Intel a little while back: https://www.intc.com/news-events/press-releases/detail/1748/...