The limit for this sort of exercise is "holding everything in memory". Because training of neural networks require that one updates the weights frequently. An NVIDI A100 has a bandwidth of 2 Tb/sec. Your home ADSL something in the order of 10 Mbit. And then there's latency.
Mind you, theoretically that is a limitation of our current network architectures. If we could conceive a learning approach that was localised, to the point of being "embarrassingly parallel", perhaps. It would probably be less efficient, but if it is sufficiently parallel to compensate for Amdahl's law, who knows?
Less theoretically, one could imagine that we use the same approach that we use in systems engineering in general: functional decomposition. Instead of having one Huge Model To Rule Them All, train separate models that each perform a specific, modular function, and then integrate them.
In a sense this is what is currently happening already. Stable Diffusion have one model to generate img2depth, to generate an estimation which parts of a picture are far away from the lense. They have another model to upscale low res images to high res images, etc etc. This is also how the brain works.
But it is difficult to see how this sort of approach could be applied to very small scale, low contextual tasks, like folding@home.
You would likely be limited by the communication latency between nodes, unless you come up with some unique model architecture or training method. Most of these large scale models are trained on GPUs using very high speed interconnects.
The term for this is federated learning. Usually it’s used to preserve privacy since a user’s data can stay on their device. I think it ends up not being efficient for the model sizes used here.
Couldn't we train a very good model by distributing the dataset along with the computing power using something similar to folding@home?