In this case, the curse of dimensionality is working in our favor, as the volume of an n-dimensional simplex also grows exponentially, so it is perfectly suited to hold the exponential state space of the mixed-state presentation.
Updating from one mixed-state to another just involves updating for each pure state in the mixture and then mixing the resulting states together, so everything is neatly linear.
One way to represent the training data would be to have one state per token position that emits the corresponding token and advances to the next position, or to a randomly sampled other document if there's no next position. That matches the way LLMs are actually trained, but of course typical transformers have a much smaller residual stream dimension than the number of training tokens, so it needs to conflate some states, and synchronizing with the training data is also not what we want models to do, otherwise we would be using infini-gram https://arxiv.org/abs/2401.17377 to regurgitate exact matches.
Updating from one mixed-state to another just involves updating for each pure state in the mixture and then mixing the resulting states together, so everything is neatly linear.
One way to represent the training data would be to have one state per token position that emits the corresponding token and advances to the next position, or to a randomly sampled other document if there's no next position. That matches the way LLMs are actually trained, but of course typical transformers have a much smaller residual stream dimension than the number of training tokens, so it needs to conflate some states, and synchronizing with the training data is also not what we want models to do, otherwise we would be using infini-gram https://arxiv.org/abs/2401.17377 to regurgitate exact matches.