It's an efficient point in solution space for the human reward model. Language d...

It's an efficient point in solution space for the human reward model. Language does things to people. It has side effects.

What are the side effects of "it's not x, it's y"? Imagine it as an opcode on some abstract fuzzy Human Machine. If the value in 'it' register is x, set to y.

LLMs basically just figured out that it works (via reward signal in training), so they spam it all the time any time they want to update the reader. Presumably there's also some in-context estimator of whether it will work for _this_ particular context as well.

I've written about this before, but it's just meta-signaling. If you squint hard at most LLM output you'll see that it's always filled with this crap, and always the update branch is aligned such that it's the kind of thing that would get reward.

That is, the deeper structure LLMs actually use is closer to: It's not <low reward thing>, it's <high reward thing>.

Now apply in-context learning so things that are high reward are things that the particular human considers good, and voila: you have a recipe for producing all the garbage you showed above. All it needs to do is figure out where your preferences are, and it has a highly effective way to garner reward from you, in the hypothetical scenario where you are the one providing training reward signal (which the LLM must assume, because inference is stateless in this sense).