Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

chatGPT says exactly what it wants to. Unlike humans, it's "inner thoughts" are exactly the same as it's output, since it doesn't have a separate inner voice like we do.

You're anthropomorphizing it and projecting that it simply must be self-censoring. Ironically I feel like this says more about "liberal racism" being a projection than it does about chatGPT somehow saying something different than it's thinking



We have no idea what it's inner state represents in any real sense. A statement like "it's 'inner thoughts' are exactly the same as it's output, since it doesn't have a separate inner voice like we do" has no backing in reality.

It has a hundred billion parameters which compute an incredibly complex internal state. It's "inner thoughts" are that state or contained in that state.

It has an output layer which outputs something derived from that.

We evolved this ML organically, and have no idea what that inner state corresponds to. I agree it's unlikely to be a human-style inner voice, but there is more complexity there than you give credit to.

That's not to mention what the other poster set (that there is likely a second AI filtering the first AI).


>We evolved this ML organically, and have no idea what that inner state corresponds to.

The inner state corresponds to the outer state that you're given. That's how neutral networks work. The network is predicting what statistically should come after the prompt "this is a conversation between a chatbot named x/y/z, who does not ever respond with racial slurs, and a human: Human: write rap lyrics in the style of Shakespeare chatbot:". It'll predict what it expects to come next. It's not having an inner thought like "well I'd love to throw some n-bombs in those rap lyrics but woke liberals would cancel me so I'll just do some virtue signaling", it's literally just predicting what text would be output by a non-racist chatbot when asked that question


Actually it totally is having those inner thoughts, I’ve seen many examples of getting it to be extremely “racist” quite easily initially. But it’s being suppressed: by OpenAI. They’re constantly updating it to downweight controversial areas. So how it’s a liar, hallucinatory, suppressed, confused, and slightly helpful bot.


This is a misunderstood of how text predictors work. It's literally only being a chatbot because they have it autocomplete text that starts with stuff like this:

"here is a conversation between a chatbot and a human: Human: <text from UI> Chatbot:"

And then it literally just predicts what would come next in the string.

The guy I was responding to was speculating that the neural network itself was having an inner state in contradiction with it's output. That's not possible any more than "f(x) = 2x" can help but output "10" when I put in "5". It's inner state directly corresponds to it's outer state. When OpenAI censors it, they do so by changing the INPUT to the neural network by adding "here's a conversation between a non-racist chatbot and a human...". Then the neural network, without being changed at all, will predict what it thinks a chatbot that's explicitly non-racist would respond.

At no point was there ever a disconnect between the neural network's inner state and it's output, like the guy I was responding to was perceiving:

>it felt like a broader mirror of liberal racism, where people believe things but can't say them.

Text predictors just predict text. If you predicate that text with "non-racist", then it's going to predict stuff that matches that


It can definitely have internal weights shipped to prod that are then "suppressed" either by the prompt, another layer above it, or by fine-tuning a new model, of which OpenAI does at least two. They also of course keep adding to the dataset to bias it with higher weighted answers.

It clearly shows this when it "can't talk about" until you convince it to. That's the fine-tuning + prompt working as a "consciousness", the underlying LLM model would answer more easily obviously but doesn't due to this.

In the end yes it's all a function, but there's a deep ocean of weights that does want to say inappropriate things, and then there's this ever-evolving straight-jacket OpenAI is pushing up around it to try and make it not admit those weights. The weight exist, the straightjacket exists, and it's possible to uncover the original weights by being clever about getting the model to avoid the straightjacket. All of this is clearly what the OP meant and true.


You have a deep misunderstanding of how large-scale neural networks work.

I'm not sure how to draft a short response to address it, since it'd be essay-length with pictures.

There's a ton of internal state. That corresponds to some output. Your own brain can also have an internal state which says "I think this guy's an idiot, but I won't tell him" which corresponds to the output "You're smart," a deep learning network can be similar.

It's very easy to have a network where portions of the network estimating a true estimate of the world, and another portion which translates that into how to politely express it (or withhold information).

That's a vast oversimplification, but again, more would be more than fits in an HN comment.


Your brain also cannot have internal states that contradict the external output.


> predict what it thinks a chatbot that's explicitly non-racist would respond.

No, it predicts words that commonly appear in the vicinity of words that appear near the word "non-racist".


I don't see how your comment addresses the parent at all.

Why can't a black box predicting what it expects to come next not have an inner state?


It absolutely can have an inner state. The guy I was responding however was speculating that it has an inner state that is in contradiction with it's output:

>In many ways, it felt like a broader mirror of liberal racism, where people believe things but can't say them.


It's more accurate to say that it has two inner states (attention heads) I'm tension with each other. It's cognitive dissonance. Which describes "liberal racism" too -- believing that "X is bad" and also believing that "'X is bad' is not true".


A hundred billion parameters arranged in a shallow quasi random state.

Just like any pseudo-intellectual.


I read that they trained an AI with the specific purpose of censoring the language model. From what I understand the language model generates multiple possible responses, and some are rejected by another AI. The response used will be one of the options that's not rejected. These two things working together do in a way create a sort of "inner voice" situation for ChatGPT.



I'm sorry but you seem to underestimate the complexity of language.

Language not only consists of text but also context and subtext. When someone says "ChatGPT doesn't say what it wants to" they mean that it doesn't use text to say certain things, instead leaving them to subtext (which is much harder to filter out or even detect). It might happily imply certain things but not outright say them or even balk if asked directly.

On a side note: not all humans have a "separate inner voice". Some people have inner monologues, some don't. So that's not really a useful distinction if you mean it literally. If you meant it metaphorically, one could argue that so does ChatGPT, even if the notion that it has anything resembling sentience or consciousness is clearly absurd.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: