Most parallel processing scales linearly with core count. But the 3090 is more i...

option · on Sept 29, 2020

“ With machine learning, you spent most of the time copying memory between the CPU and GPU”

- this is a sign that you are most likely doing it wrong. Yes, some operations are inherently bandwidth bound, but most important ones such as larger matrix multiplies (transformers) and convolutions are compute bound.

Jhsto · on Sept 29, 2020

Sure, but compared to the 3080 I'd say that the main deal is the bigger RAM for copying reason than the increased core count.

option · on Sept 29, 2020

TIP: if you need to emulate a bigger batch with less RAM available - use gradient accumulation trick. Super easy to implement in Pytorch and it is already implemented as a single flag (accumulate_grad_batches) in Pytorch Lightning.

p1esk · on Sept 29, 2020

Your gradient accumulation trick involves multiple cpu to gpu transfers, which is precisely what the parent is trying to avoid with fitting a larger batch in gpu memory.