My summary: Anything which drives predictions of "the next token" must create a model of "how the world works" AND a model for "what state is the world in right now". The authors train a transformer and demonstrate that it creates a structure that represents both of these.
Crucially it will tend to find the simplest such representation that still solves the problem. This is why ultimately the model is only sufficient to solve problems that it was trained to solve.
(I could be wrong here. Please correct me if that's the case.)
I think the comment you're replying to means exactly what you're saying, which is that it will find solutions which are "easy" to find for the optimizer, and therefore solutions which are simple to achieve through the convergence of some optimizer.