Agreed, and I think that many of the problems that people think LLMs will become...

renonce · on March 13, 2024

At least in theory we can achieve incremental learning by training from scratch every time we got some new training data. There are drawbacks with this approach such as inconsistent performance for different training runs and significantly higher training cost but it's achievable. Now the problem is if there exist methods more efficient than gradient descent? I think it's very clear now that there are no other algorithm in sight that could achieve the level of intelligence without gradient descent at its core, and the problem is just how gradient descent is used.

HarHarVeryFunny · on March 13, 2024

The obvious alternative to gradient descent here would be Bayes Formula (probabalistic Bayesian belief updates), since this addresses the exact problem that our brains evolved to optimize - how to utilize prediction failure (sensory feedback vs prediction) to make better predictions - better prediction of where the food is, what the predator will do, how to attract a mate, etc. Predict next word too (learn language), of course.

I don't think pre-train for every update works - it's an incredibly slow and expensive way to learn, and the training data just isn't there. Where is the training data that could train it how to do every aspect of any job - the stuff that humans learn by experimentation and experience? The training data that is available via text (and video) is mostly artifacts - what someone created, not the thought process that went into creating it, and the failed experiments and pitfalls to avoid along the way, etc, etc.

It would be nice to have a generic college-graduate pre-trained AGI as a starting point, but then you need to take that and train it how to be a developer (starting at entry level, etc), or for whatever job you'd like it to do. It takes a human years to practice to get good at jobs like these, with many try-fail-rethink experiments every day. Imagine if each of those daily updates took 6 months and $100M to incorporate?! We really need genuine online learning where each generic graduate-level AGI instance can get on-the-job training and human feedback and update it's own "weights" continually.

renonce · on March 13, 2024

> The obvious alternative to gradient descent here would be Bayes Formula

If you know a little about the math behind gradient descent you can see that an embedding layer followed by a softmax layer gives you exactly the best Bayes estimate. If you want a bit of structure, like every word depends on previous n words, you get a convolutional RNN which is also well studied. These ideas are natural and elegant but maybe a better idea is to comprehend the research already done to avoid diving into dead ends too much.

HarHarVeryFunny · on March 13, 2024

No, I don't "want a bit of structure" ... I want a predictive architecture that supports online learning. So far the only one I'm aware of is the cortex.

Not sure what approaches you are considering as dead ends, but RNNs still have their place (e.g. Mamba), depending on what you are trying to achieve.