There's still some typicality defined by the prompt though. If you ask for a proof in the style of Shakespeare, you're going to get some "average" Shakespeare. It's kind of embedded in the task definition; you're shifting the reference distribution.
If a LLM returned something really unusual for Shakespeare when you didn't ask for it, you'd say it's not performing well.
Maybe that's tautological but I think it's what's usually meant by "average".
I'm sure LLMs with something different is on the near horizon but I don't think we're there quite yet.
The point was that no, you wont (necessarily) get some "average" shakespeare. A sampler may introduce bias and look for the "above average" shakespeare in the distribution.
If a LLM returned something really unusual for Shakespeare when you didn't ask for it, you'd say it's not performing well.
Maybe that's tautological but I think it's what's usually meant by "average".
I'm sure LLMs with something different is on the near horizon but I don't think we're there quite yet.