I think people care too much about trying to innovate a new model architecture. ...

marcel-c13 · 2025-10-24T20:42:28 1761338548

But isn't the max training efficiency naturally tied to the architecture? Meaning other architecture have another training efficiency landscape? I've said it somewhere else: It is not about "caring too much about new model architecture" but to have a balance between exploitation and exploration.

alyxya · 2025-10-24T23:40:07 1761349207

I didn't really convey my thoughts very well. I think of the actual valuable "more efficient ways of training" to be paradigm shifts between things like pretraining for learning raw knowledge, fine-tuning for making a model behave in certain ways, and reinforcement learning for learning from an environment. Those are all agnostic to the model architecture, and while there could be better model architectures that make pretraining 2x faster, it won't make pretraining replace the need for reinforcement learning. There isn't as much value in trying to explore this space compared to finding ways to train a model to be capable of something it wasn't before.