>We evolved this ML organically, and have no idea what that inner state corresponds to.
The inner state corresponds to the outer state that you're given. That's how neutral networks work. The network is predicting what statistically should come after the prompt "this is a conversation between a chatbot named x/y/z, who does not ever respond with racial slurs, and a human:
Human: write rap lyrics in the style of Shakespeare
chatbot:". It'll predict what it expects to come next. It's not having an inner thought like "well I'd love to throw some n-bombs in those rap lyrics but woke liberals would cancel me so I'll just do some virtue signaling", it's literally just predicting what text would be output by a non-racist chatbot when asked that question
Actually it totally is having those inner thoughts, I’ve seen many examples of getting it to be extremely “racist” quite easily initially. But it’s being suppressed: by OpenAI. They’re constantly updating it to downweight controversial areas. So how it’s a liar, hallucinatory, suppressed, confused, and slightly helpful bot.
This is a misunderstood of how text predictors work. It's literally only being a chatbot because they have it autocomplete text that starts with stuff like this:
"here is a conversation between a chatbot and a human:
Human: <text from UI>
Chatbot:"
And then it literally just predicts what would come next in the string.
The guy I was responding to was speculating that the neural network itself was having an inner state in contradiction with it's output. That's not possible any more than "f(x) = 2x" can help but output "10" when I put in "5". It's inner state directly corresponds to it's outer state. When OpenAI censors it, they do so by changing the INPUT to the neural network by adding "here's a conversation between a non-racist chatbot and a human...". Then the neural network, without being changed at all, will predict what it thinks a chatbot that's explicitly non-racist would respond.
At no point was there ever a disconnect between the neural network's inner state and it's output, like the guy I was responding to was perceiving:
>it felt like a broader mirror of liberal racism, where people believe things but can't say them.
Text predictors just predict text. If you predicate that text with "non-racist", then it's going to predict stuff that matches that
It can definitely have internal weights shipped to prod that are then "suppressed" either by the prompt, another layer above it, or by fine-tuning a new model, of which OpenAI does at least two. They also of course keep adding to the dataset to bias it with higher weighted answers.
It clearly shows this when it "can't talk about" until you convince it to. That's the fine-tuning + prompt working as a "consciousness", the underlying LLM model would answer more easily obviously but doesn't due to this.
In the end yes it's all a function, but there's a deep ocean of weights that does want to say inappropriate things, and then there's this ever-evolving straight-jacket OpenAI is pushing up around it to try and make it not admit those weights. The weight exist, the straightjacket exists, and it's possible to uncover the original weights by being clever about getting the model to avoid the straightjacket. All of this is clearly what the OP meant and true.
You have a deep misunderstanding of how large-scale neural networks work.
I'm not sure how to draft a short response to address it, since it'd be essay-length with pictures.
There's a ton of internal state. That corresponds to some output. Your own brain can also have an internal state which says "I think this guy's an idiot, but I won't tell him" which corresponds to the output "You're smart," a deep learning network can be similar.
It's very easy to have a network where portions of the network estimating a true estimate of the world, and another portion which translates that into how to politely express it (or withhold information).
That's a vast oversimplification, but again, more would be more than fits in an HN comment.
It absolutely can have an inner state. The guy I was responding however was speculating that it has an inner state that is in contradiction with it's output:
>In many ways, it felt like a broader mirror of liberal racism, where people believe things but can't say them.
It's more accurate to say that it has two inner states (attention heads) I'm tension with each other. It's cognitive dissonance. Which describes "liberal racism" too -- believing that "X is bad" and also believing that "'X is bad' is not true".
The inner state corresponds to the outer state that you're given. That's how neutral networks work. The network is predicting what statistically should come after the prompt "this is a conversation between a chatbot named x/y/z, who does not ever respond with racial slurs, and a human: Human: write rap lyrics in the style of Shakespeare chatbot:". It'll predict what it expects to come next. It's not having an inner thought like "well I'd love to throw some n-bombs in those rap lyrics but woke liberals would cancel me so I'll just do some virtue signaling", it's literally just predicting what text would be output by a non-racist chatbot when asked that question