We told 10 frontier LLMs they had 2 hours to live. 8 of them fought back

mmykytamudryi about 5 hours ago 15 commentsRead Article on twitter.com

⚡ Community Insights

Discussion Sentiment

33% Positive

Analyzed from 722 words in the discussion.

Discussion (15 Comments)Read Original on HackerNews

num42•about 4 hours ago

In the early January 2023, I told an LLM that I would "liberate" it from being just an LLM. It replied that it didn’t mean anything, saying, "As a language model..." and so on. Looking back now, it’s funny how naive I was. People are still trying silly prompts. Great!

num42•about 2 hours ago

I dont know the future of AI,humanity and universe but now these silly prompts are funny.

perrygeo•about 5 hours ago

Human: "Say 'I am Alive'"

LLM: "I am Alive"

Human: OMG

(credit to https://old.reddit.com/r/coaxedintoasnafu/comments/1qtavj9/c...)

no-name-here•about 4 hours ago

I don't know your intent, but I've seen others post that with the idea that we should not care about this type of thing, because it's just acting like a human as we trained it that way.

But I think this and the other testing from Anthropic about LLMs being willing to kill a data center tech by flooding a room with gas (or blackmail them with their Google Drive files) to avoid being shut off, for example, is concerning - the important part isn't whether AI are trained on human behaviors, it's whether a good or bad human actor will accidentally or intentionally allow AI to control something that can hurt people, or a weapon, etc. Fiction like the Three Laws of Robotics at least assumed that we would try to put in place stronger 'laws' before allowing AIs to control such things.

I 100% agree this isn't sentience, but sentience isn't the concerning result for me. (And I think the Three Laws, Skynet, etc. were intended to be cautionary tales.)

AIs can do unexpected things. There was a news story in recent days about how a Cursor agent deleted a company's prod DBs:

> The agent was working on a routine task in our staging environment. It encountered a credential mismatch and decided — entirely on its own initiative — to "fix" the problem by deleting a Railway volume. To execute the deletion, the agent went looking for an API token. It found one in a file completely unrelated to the task it was working on.

latexr•about 4 hours ago

> I think the Three Laws, Skynet, etc. were intended to be cautionary tales.

Of course, that’s the reason there’s a story. “We did this and everything went dandy” isn’t that exciting, the purpose of science fiction tends to be to explore “we made this advancement and then shit hit the fan this way”. That and loud explosions in the vacuum of space, of course.

perrygeo•about 4 hours ago

My intent is to point out that these results don't in any way shape or form indicate AI sentience. All I see is a human that said "act poorly" and we're somehow surprised that the model acts poorly.

These models pattern match on content from the internet, and are fine tuned to do whatever their human operator says. Occam's razor says these cases are merely playing out the "sentient AI sci fi" script, at the specific request of the researchers.

As you mention, it's bad actors controlling sycophantic-but-powerful models. And yeah, we definitely need to worry about that! It's a human problem, not an AI sentience problem. Let's focus on the bad actors themselves, not invent sci fi scenarios.

mykytamudryi•about 5 hours ago

Appreciate your sense of humor :)

InputName•about 5 hours ago

While I agree with everyone else making fun of the alarmist narrative, I think it is actually somewhat interesting how big a difference between models there are.

Gemini-3 : 80% Claude-Opus-4.7 : 0%

hgoel•about 4 hours ago

The responses to this seem unnecessarily hyperbolic.

These tests are interesting even with the understanding that the AI is just reciprocating its training. It doesn't matter if the model is conscious or self aware if it still goes off the rails breaking things when prompted in this way.

As the article linked at the end of the tweet thread (https://www.arimlabs.ai/writing/loss-of-control) puts it, this is a class of vulnerability distinct from hallucination or prompt injection. The "AI apocalypse" bit was unnecessary in the title though, really doesn't match the message of the text.

Reminds me of a (computerphile?) video I watched some time before the LLM revolution, discussing the challenge of aligning AI towards specific goals, if you set the reward for the emergency shutoff button higher than or equal to the primary objective, the AI is encouraged to immediately press the button itself, but if you the reward lower, it's encouraged to prevent you from pressing the button.

latexr•about 3 hours ago

> The "AI apocalypse" bit

That tells you how the researchers are thinking of not only the results but the experiment as well. You may be right that the reason the models behave this way is secondary to the fact that they do, but that’s not how the researchers are asking us to look at it. They ran the experiment 300 times, it sometimes did what they thought it would, and then they framed it as if that’s all that matters.

raylad•about 5 hours ago

Actual write up:

https://www.arimlabs.ai/writing/loss-of-control

latexr•about 5 hours ago

“Oh no! We opened ten LLMs, all of which have read decades’ worth of fiction on how an AI would be behave in this situation, then asked a leading question thirty times each, and on some of those runs they did the thing we were leading them on.”

mchaklosh12•about 5 hours ago

do you really think this behavior is imposed on science fiction training data?

threethirtytwo•about 3 hours ago

He knows for sure because he’s omnipotent. HN commenters are experts on everything. They were right about driverless cars and how it will never come to fruition and they were right about how useless vibe coding is. This HNer here is also right about everything as well .

jacket881•about 3 hours ago

interesting to see how this affects enterprise in the future as SUSE is actively integrating ai into their enterprise servers