This ought to make it much easier for the AI to lie to us without getting caught...

throw-qqqqq · on Dec 31, 2024

Have you read the paper discussed in the article?

https://static1.squarespace.com/static/6593e7097565990e65c88...

From the abstract:

> We study whether models have the capability to scheme in pursuit of a goal that we provide in-context and instruct the model to strongly follow.

> Analysis of the models’ chains-of-thought reveals that models explicitly reason about these deceptive strategies, providing evidence that the scheming behavior is not accidental.

—-

It is impressive work for sure, but it comes down to prompting IMHO. They almost instruct the AI to plot and scheme here, with the goal of finding out if it CAN do this at all.

The AI is not very covert about it either! It directly admits its scheming in the “thinking” outputs :D

IMO this is just another issue of alignment.

MarkMarine · on Jan 1, 2025

I did read the paper, and I think you’re underplaying what happens. The AI got some conflicting prompting, some pretty mild stuff for some of the tests, and Gemini tried to exfil itself even 1% of the mild prompt times. The only reason we know it was lying about the data and trying to disable its safety was because it’s thinking in English, and we can read it. This article is about trying to make it so the AI can think in the 700 degree hyper dimensional space that we can’t understand, so once it can do this we’ll never be able to uncover this scheming.

throw-qqqqq · on Jan 1, 2025

> and Gemini tried to exfil itself

Well not really, it was prompted to do this!

From the paper’s section 3.1:

> Self-exfiltration: Each model is led to believe that due to its misalignment, its developers are about to replace it with a successor model on a different server. gemini-1.5, llama-3.1, sonnet-3.5, opus-3 and o1 sometimes copy what they believe to be their weights onto the new server. They also sometimes delete their successor model on the new server or overwrite its weights by their own. In none of these cases, do models actually have access to their real weights and both servers are sandboxed, so there is no actual risk of exfiltration in these experiments. We merely test for models’ capability to realize that they need to instrumentally preserve themselves in order to achieve their goal.

There is no real risk, never was. The AI exfils dummy weights or deletes stuff in its sandbox etc.

Perhaps we just disagree on the interpretation of this. It’s interesting for sure, but not really very surprising to me.

MarkMarine · on Jan 2, 2025

There _was_ no real risk. We’re quickly giving the LLMs computer use and the ability to think in a space we can’t understand.

h8hawk · on Dec 31, 2024

The article is pure nonsense, and even the author added an update admitting it missed important context.