Agreed, and I think that many of the problems that people think LLMs will become capable of, in fact require AGI.
It may well turn out that LLMs are NOT the path to AGI. You can make them bigger and better, and address some of their shortcomings with various tweaks, but it seems that AGI requires online/continual learning which may prove impossible to retrofit onto a pre-trained transformer. Gradient descent may be the wrong tool for incremental learning.
At least in theory we can achieve incremental learning by training from scratch every time we got some new training data. There are drawbacks with this approach such as inconsistent performance for different training runs and significantly higher training cost but it's achievable. Now the problem is if there exist methods more efficient than gradient descent? I think it's very clear now that there are no other algorithm in sight that could achieve the level of intelligence without gradient descent at its core, and the problem is just how gradient descent is used.
The obvious alternative to gradient descent here would be Bayes Formula (probabalistic Bayesian belief updates), since this addresses the exact problem that our brains evolved to optimize - how to utilize prediction failure (sensory feedback vs prediction) to make better predictions - better prediction of where the food is, what the predator will do, how to attract a mate, etc. Predict next word too (learn language), of course.
I don't think pre-train for every update works - it's an incredibly slow and expensive way to learn, and the training data just isn't there. Where is the training data that could train it how to do every aspect of any job - the stuff that humans learn by experimentation and experience? The training data that is available via text (and video) is mostly artifacts - what someone created, not the thought process that went into creating it, and the failed experiments and pitfalls to avoid along the way, etc, etc.
It would be nice to have a generic college-graduate pre-trained AGI as a starting point, but then you need to take that and train it how to be a developer (starting at entry level, etc), or for whatever job you'd like it to do. It takes a human years to practice to get good at jobs like these, with many try-fail-rethink experiments every day. Imagine if each of those daily updates took 6 months and $100M to incorporate?! We really need genuine online learning where each generic graduate-level AGI instance can get on-the-job training and human feedback and update it's own "weights" continually.
> The obvious alternative to gradient descent here would be Bayes Formula
If you know a little about the math behind gradient descent you can see that an embedding layer followed by a softmax layer gives you exactly the best Bayes estimate. If you want a bit of structure, like every word depends on previous n words, you get a convolutional RNN which is also well studied. These ideas are natural and elegant but maybe a better idea is to comprehend the research already done to avoid diving into dead ends too much.
No, I don't "want a bit of structure" ... I want a predictive architecture that supports online learning. So far the only one I'm aware of is the cortex.
Not sure what approaches you are considering as dead ends, but RNNs still have their place (e.g. Mamba), depending on what you are trying to achieve.
It may well turn out that LLMs are NOT the path to AGI. You can make them bigger and better, and address some of their shortcomings with various tweaks, but it seems that AGI requires online/continual learning which may prove impossible to retrofit onto a pre-trained transformer. Gradient descent may be the wrong tool for incremental learning.