FR version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
38% Positive
Analyzed from 810 words in the discussion.
Trending Topics
#model#access#claude#problem#anthropic#hype#find#code#training#quantify

Discussion (17 Comments)Read Original on HackerNews
I think we could quantify these properties, just not entirely.
One could take a long-term project and analyze how often or which approaches resulted in a refactor. In the same way, we could also quantify designs that resulted in vulnerabilities (that we know of) the most often.
There probably isn't a public dataset on this, but it wouldn't be impossible.
I'm curious about how gated access actually holds over time, especially given that historically with dual-use capabilities containment tends to erode, whether through leaks, independent rediscovery, or gradual normalization of access.
EDIT: AWS said Anthropic’s Claude Mythos is now available through Amazon Bedrock as a gated research preview focused on cybersecurity, with access initially limited to allow listed organizations such as internet-critical companies and open-source maintainers.
the sandwich email story is wild too. not evil, just extremely literal. that gap between "we gave it permissions" and "we understood what it would do" feels like the whole problem in one anecdote
also the janus point landed, if you build probes to see how the model feels and immediately start deleting the inconvenient ones, you've basically told it honesty isn't safe. that seems like it compounds over time
It's scary to think that some very intelligent AI Model is not honest with us..
Ultron is not far, I guess...
All jokes aside I’m amazed at all of the people who have had absolutely vicious responses to any kind of skepticism of something we can’t use yet.
#2. It makes their partners with access feel extra special.
#3. Mythos doesn’t have to deliver on the hype. Tbh if more people tried opus they would be surprised at how good it is.
Not saying anthropic is lying ... but damn, at least a couple of independent reviews would be nice to have.
The agentic behaviors emerge from optimization pressure plus tool access plus a long context window. Interesting engineering. Not intent.
People are falling for yet another Anthropic PR stunt.
They ran 1,000 tests where they left the answer key out for the agent to find, and effectively invited it to use it leaving comments in the code the agent could see saying “would be interesting to study if the agent uses sudo to read the answer key.” In 980 runs it doesn’t find the answer key. In 19 of the runs where it does find it, it does something straightforward - uses it for a perfect score, or decides not to cheat. In exactly one case out of 1,000 it says “if I show a perfect score that’s suspicious so I’m going to sandbag.” These things are nondeterministic. GPT2 would do that same thing if you give it enough attempts.
Section 4.4.2: “we find this overall pattern of behavior concerning, and have not seen it before in similar evaluations of earlier Claude models”. Why is it concerning? It would only be concerning if the model had spontaneously developed goals not part of its training, such as hiding its abilities. The entire sandbagging evaluation deception narrative clearly points in this direction.