“ With machine learning, you spent most of the time copying memory between the CPU and GPU”
- this is a sign that you are most likely doing it wrong. Yes, some operations are inherently bandwidth bound, but most important ones such as larger matrix multiplies (transformers) and convolutions are compute bound.
TIP: if you need to emulate a bigger batch with less RAM available - use gradient accumulation trick. Super easy to implement in Pytorch and it is already implemented as a single flag (accumulate_grad_batches) in Pytorch Lightning.
Your gradient accumulation trick involves multiple cpu to gpu transfers, which is precisely what the parent is trying to avoid with fitting a larger batch in gpu memory.
- this is a sign that you are most likely doing it wrong. Yes, some operations are inherently bandwidth bound, but most important ones such as larger matrix multiplies (transformers) and convolutions are compute bound.