> Any probability distribution over strings can theoretically be factored into a product of such a “probability that next token is x given that the text so far is y”.
And such a probability distribution would not generally understand concepts, efficient or otherwise. The P(next_token) is based upon the syntactical structure built via the model and some basic semantic distance that LLMs provide. They don't have enough conceptual power to reliably generate new facts and know that they are facts consistent with the model. That would be an internal representation system.
The academic exercise here is similar to monads: "yes, any computed function f(x) can be expressed as a sufficiently pre-computed large lookup table." With LLMs we're dealing with approximate lookups due to lossy compression, but that's still what these prior probabilities are: lookup tables. Lookup tables are not smart, do not understand concepts, and they have little to no capacity to generate new results not sufficiently represented in the training set.
My main concern here is the theoretical point, and so I’m not addressing the “this is what current (e.g. transformer based) models do” parts.
> The P(next_token) is based upon the syntactical structure built via the model and some basic semantic distance that LLMs provide.
Regardless of whether this is true for existing transformer-based models, this is not true for all computable conditional probability distributions over text.
Any computable task can be framed as sampling from some conditional probability distribution. (If the task is deterministic, that just means that the conditional probability distribution to sample from is one which has probability 1 for some string, when conditioned on the thing it is to be conditioned on.)
Whether transformer based models are lookup tables or not, not all computable probability distributions over text are. (As, of course, not all computable tasks can be expressed as a simple finite lookup table.)
I don’t know exactly what you mean by “generally understand concepts”, though I suppose
> They don't have enough conceptual power to reliably generate new facts and know that they are facts consistent with the model. That would be an internal representation system.
is describing that somewhat.
And, in that case, if there is any computational process which counts as having “enough conceptual power to generate new facts and know that they are facts consistent with the model”, then, a computable conditional probability distribution over strings conditioned on their prefixes, and therefore also a computable probability distribution over next tokens given all-tokens-so-far , is also (theoretically) capable of that.
And so, it would follow that “it only predicts the next token” doesn’t (in principle/theory) preclude it having such an understanding of concepts, unless no computational process ever can.
> “it only predicts the next token” doesn’t (in principle/theory) preclude it having such an understanding of concepts, unless no computational process ever can.
In my opinion, this is highly reductive and academic. Whether these models are transformers or not, lookup likelihood is not indicative of understanding of concepts in any reasonable way.
If the response to a algebraic equation was based upon probability of tokens in a corpus... and not an actual deterministic application of the rules of algebra, would that response know concepts? Would it be intelligent?
With math, specifically given the unbounded size of the tokens compared to language, it's clear that token prediction is not a useful methodology.
Let's say we're just trying to multiply two integers. Even if a model had Rain Man powers of memorization, and it memorized phone book after phone book of multiplication tables, the probabilistic likelihood model would fail for the very obvious reason that we cannot enumerate (and train on) all the possible outcomes of math and calculate their frequencies. We can however understand and use the concepts of math, which is distinct from their symbolic representation.
> lookup likelihood is not indicative of understanding of concepts in any reasonable way.
Where did I ever say that the thing was doing lookup? I only said it was producing a probability distribution.
Is your claim that all programs are just doing lookup?
> If the response to a algebraic equation was based upon probability of tokens in a corpus...
Ah, I see the confusion. When I say “probability distribution” I do not mean “for each option, an empirical fraction out of all the options, that this particular option appeared in the corpus”. Rather, by “probability distribution”, I mean (in the discrete case) “an assignment of a number which is at least zero and at most one, to each of the options, and such that the sum of the assigned values add up to 1”. I am allowing that this assignment of values is computed (from what is being conditioned on) in any way whatsoever .
If the correct answer is a number, it may compute the entire correct number through some standard means, and then look at however many correct tokens from the number are already present, and assign a probability of 1 to the correct next one, and 0 to all other tokens. If conditioning on a partial answer that has parts wrong, it may use an arbitrary distribution.
And such a probability distribution would not generally understand concepts, efficient or otherwise. The P(next_token) is based upon the syntactical structure built via the model and some basic semantic distance that LLMs provide. They don't have enough conceptual power to reliably generate new facts and know that they are facts consistent with the model. That would be an internal representation system.
The academic exercise here is similar to monads: "yes, any computed function f(x) can be expressed as a sufficiently pre-computed large lookup table." With LLMs we're dealing with approximate lookups due to lossy compression, but that's still what these prior probabilities are: lookup tables. Lookup tables are not smart, do not understand concepts, and they have little to no capacity to generate new results not sufficiently represented in the training set.