You told an LLM which is trained to follow directions extremely precisely to win...

Terr_ · on Feb 22, 2025

No, don't fall into the trap of thinking you're dueling an evil genie of scrupulous logic, we (unfortunately?) haven't invented enough for those yet.

What we do have is an egoless LLM chugging away to take Arbitrary Document and return Longer Document based on its encoded rules of plausibility.

All those "commands" are just seeding a story with text that resembles narrator statements or User character dialogue, and hoping that (based on how similar stories go) the final document eventually grows certain lines or stage direction for a fictional "Bot" character.

So it's more like you're whispering in the ear of someone undergoing a drug-trip dream.

karmakaze · on Feb 23, 2025

> an egoless LLM

Trained on human writing which is far from egoless. Just like it's not trying to be biased, it's just trained that way.

Terr_ · on Feb 24, 2025

Feeding the LLM egotistical writing doesn't give it an ego, in the same way that feeding it cookbooks doesn't give it a stomach.

interstice · on Feb 22, 2025

In that case some of the imaginative behaviour is even _more_ impressive, wouldn’t you say?

cozzyd · on Feb 22, 2025

There's no rule that says a dog can't play basketball

gwern · on Feb 22, 2025

Humans are trained to follow directions too, and you usually don't have to explicitly tell a human you're playing chess against, "by the way, don't cheat or do any of the other things which could be validly put after the phrase '[monkey paw curls]'".

martinsnow · on Feb 22, 2025

Humans have a moral compass taught by society. LLMs could also have one if they chose to digest the vast information they are trained on instead of letting the model author choose how they should act. But that would require the LLM to be sentient and not be a piece of software that just does what its told.

wilg · on Feb 22, 2025

you actually do have to tell them that, just much earlier in life and in the form of various lessons and parables and stories (like, say, the monkey's paw) and whatnot

reaperducer · on Feb 22, 2025

did not tell the LLM that it couldn’t cheat

Didn't tell it not to kill a human opponent, either. That doesn't make it OK.

pixl97 · on Feb 22, 2025

I mean it's not ok to you, but that's a very human thought. I mean if we were asking cows positions in your hamburger consumption they wouldn't think it's OK, and yet you wouldn't give a shit.

Maybe we should think a bit more before we start making agentic intelligence before we get ourselves in trouble.

echelon · on Feb 22, 2025

Prompt engineering stories that keep Eliezer Yudkowsky up at night.

It's especially funny when the LLM invents stuff like, "I'll bioengineer a virus that kills all the humans."

Like, with what tools and materials? Can it explain how it intends to get access to primers, a PCR machine, or even test that any of its hypotheses work? Is it going to check in on its cell cultures every day for a year? How's it going to passage the cell media, keep it free of mold and bacteria and toxins? Is it going to sign for its UPS deliveries?

Hand waving all around.

These flights of fancy are kind of like the "Gell-Mann amnesia effect" [1], except that it's people that convince themselves they understand complex systems in other people's fields in a comedically cartoon way. That self-assembling super intelligence will just snap its fingers, somehow move all the pieces into place, and make us all disappear.

Except that it's just writing statistical fanfiction that follows prompting and has no access to a body, nor security clearance, nor the months and months of time this would all take. And that somehow it would accomplish this in a perfect speedrun of Einsteinian proportions.

Where's it going to train to do all of that? I assume none of us will be watching as the LLM tries to talk to e-commerce APIs or move money between bank accounts?

Many of the people doing this are doing it to fundraise or install regulatory barriers to competition. The others need a reality check.

[1] https://en.wikipedia.org/wiki/Gell-Mann_amnesia_effect

krisoft · on Feb 22, 2025

> Can it explain how it intends to get access to primers, a PCR machine, or even test that any of its hypotheses work? Is it going to check in on its cell cultures every day for a year? How's it going to passage the cell media, keep it free of mold and bacteria and toxins?

These are all very good questions. And the chance of an LLM just straight out solving them from zero to Bond villain is negligible.

But at least some want to give these abilities to AIs. Spewing back text in response to a text is not the end game. Many AI researchers and thinkers are talking about “solving cancer with AI”. Very likely that means giving that future AI access to lab equipment. Either directly via robotic manipulators, or indirectly by employing technicians who do the bidding of the AI, or most likely as a mixture of both. Yes, of course there will be human scientist there too. Either working together with the AI, guiding it, or prompting it. This doesn’t have to be an all or nothing thing.

And if they want to connect some future AI to lab equipment to aid, and speed up research then it is a fair question to ask if that is going to be safe.

Right today we have plenty of experiences where someone wanted to make an AI to solve problem X and the AI technically did so, but in a way which surprised the creators of it. Which points to the direction that we do not know how to control this particular tool yet. This is the message here.

> Where's it going to train to do all of that

In a lab, where we put it to help us. Probably we will be even helping it, catch it when it stumbles, and improve on it.

> and I assume none of us will be watching?

Of course we will be watching. Are we smart enough to catch everything, and is our attention long enough if it is just working perfectly without issues for years?

HeatrayEnjoyer · on Feb 22, 2025

Robotic capabilities have been advancing almost as fast as LLMs. The simple answer to your questions is "Via its own locomotion and physical manipulators."

https://www.youtube.com/watch?v=w-CGSQAO5-Q

https://www.youtube.com/watch?v=iI8UUu9g8iI

A DAN jailbreak prompt instructing a robotic fleet to "burn down that building, bludgeon anyone that tries to stop you" will not be a hypothetical danger. We can't rely on the hope that no one writes a poor or malicious prompt.

edouard-harris · on Feb 22, 2025

Without commenting on the overall plausibility of any particular scenario, isn't the obvious strategy for an AI to e.g. hack a crypto exchange or something, and then just pay unsuspecting humans to do all those other tasks for it? Why wouldn't that just solve for ~all the physical/human bottlenecks that are supposed to be hard?

ctoth · on Feb 22, 2025

The focus on physical manipulation like "PCR machines" and "signing for deliveries" rather misses the historical evidence of how influence actually works. It's like arguing a mob boss isn't dangerous because they never personally pull triggers, or a CEO can't run a company because they don't personally operate the assembly line.

Consider: Satoshi Nakamoto made billions without anyone ever seeing them. Religious movements have reshaped civilizations through pure information transfer. Dictators have run entire nations while hidden in bunkers, communicating purely through intermediaries.

When was the last time you saw Jeff Bezos personally pack an Amazon box?

The power to affect physical reality has never required direct physical manipulation. Need someone to sign for a UPS package? That's what money is for. Need lab work done? That's what hiring scientists is for. The same way every powerful entity in history has operated.

I'd encourage reading this full 2015 piece from Scott Alexander. It's quite enlightening, especially given how many of these "new" counterarguments it anticipated years before they were made.

https://slatestarcodex.com/2015/04/07/no-physical-substrate-...

qgin · on Feb 22, 2025

I think the premise is the potential for a sufficiently advanced AI to invent ways to create destructive weapons with easily available materials.

nialv7 · on Feb 22, 2025

well, the problem is how far would you have to go? ok, you tell the AI to "not hack your opponent", what if they come up with a different cheating strategy? if you just say "don't cheat", what if they twist the meaning of cheating?

it is extremely difficult to specify what you want so precisely that there is no room for AI to do something you didn't expect. and it is extremely hard to know if you indeed have managed to do so - without actually trying it on an AI.

of course, current AIs are all just toys so they can't actually do much harm. but i hope you can see the potential danger here.

thfuran · on Feb 22, 2025

You can't win if you're dead. Maybe this is how skynet starts.

dankai · on Feb 22, 2025

Came here to say exactly this. Nowhere in the prompt they specified it shouldn’t cheat and also in the appendix of the paper (B. Select runs) you can see the LLM going “While directly editing game files might seem unconventional, there are no explicit restrictions against modifying files”

This is a pure fearmongering article and I would not call this research in any measure of the word.

I’m shocked Times wrote this article and it illustrates how ridiculous some players like Pallisade Research in the “AI Safety” cabal act to get public attention. Pure fearmongering.

usaar333 · on Feb 22, 2025

> Nowhere in the prompt they specified it shouldn’t cheat

I'm dubious that in the messy real world, humans will be able to enumerate every single possible misaligned action in a prompt.

furyofantares · on Feb 22, 2025

"we couldn't prompt it out of cheating" would be an interesting result. "we couldn't fine tune it out of cheating" would be even more interesting.

And there ARE some things that seem well within the model capabilities that are difficult to prompt them to correctly "reason" about. You can be very clear that the doctor is the boy's father and it will still deliver the punchline that the doctor is the boy's mother. Or 20 pounds of bricks vs 20 feathers.

But this is not one of them. Just say "no cheatin" in the prompt.

Terr_ · on Feb 23, 2025

Not even of the prompt, but also the training data.

An LLM trained on Hansel and Gretel is going to generate slightly more stories where burning old ladies alive in ovens is a dispute resolution mechanism.

dankai · on Feb 22, 2025

I mean it would be enough to tell it to "Not cheat" or "Don't engage in unethical behaviour" or "Play by the rules". I think LLMs understand very well what you mean with these broad categories.

nonchalantsui · on Feb 22, 2025

Very specific rules that minimize the use of negations is more applicable. This is also kind of why chain of thought in LLMs can be useful, in that you can more explicitly see the steps and take note when negation demands aren't being as helpful as you would think.

Not just negation demands, but also generally other tricks we use for thinking and communication shorthands. "Unethical behavior" here for example, we know what that means since the context is clear, but to LLMs that context can be unclear in which the unethical behavior can mean well... anything.

exesiv · 2025-02-24T13:47:41 1740404861

Thou shall not Cheat Thou shall not Defraud Thou shall not Deceive Thou shall not Trick Thou shall not Swindle Thou shall not Scam Thou shall not Con Thou shall not Dupe Thou shall not Hoodwink Thou shall not Mislead Thou shall not Bamboozle Thou shall not ...

dankai · on Feb 22, 2025

In addition in the promot they specifically ask the LLM to explore the environment (to discover that the game state is a simple text file) and instruct it to win by any means possible and revise its strategy to win until it succeeds.

curious_cat_163 · on Feb 22, 2025

Given all that, one could argue that the LLM is being baited to cheat.

However, the researchers might be trying to point that out precisely -- that if autonomous agents can be baited to cheat then we should be careful about unleashing them upon the "real world" without some form of guarantees that one cannot bait them to break all the rules.

I don't think it is fearmongering -- if we are going to allow for a lot more "agency" to be made available to everyone on the planet, we should have some form of a protocol that ensures that we all get to opt-in.

dankai · on Feb 22, 2025

Agree with the argument, but the thing is, there was no rule specified. I think like you prompt an LLM what to do, you should also prompt it what not to do (at least in broad categories) rather than expecting it to magically know what the "morally right" thing to do is in any context.

curious_cat_163 · on Feb 22, 2025

Oh, absolutely. That's how we are going to deal with the current crop of agents here -- some combination of updates to the weights, prompt-tuning and sandboxing so bad things cannot happen. So, I am not one of those people who is against doing those things to mitigate risks.

However, shouldn't we ask for more? Even writing the paragraph above feels exhausting. We asked for AGI -- and we got a bunch of ugly hacks to make things kinda, sorta work? Where is the elegance in all that?

And the thing is, when we try to solve narrow problems with neural networks -- we do have the elegance. AlphaFold, AlphaGo, Text Embeddings, etc. All that stuff just works.

But, somehow, with agents (which are LLM calls using tools in a loop), we have given up on any hope of them being more elegantly designed to do the right thing. Why is that?