Most parallel processing scales linearly with core count. But the 3090 is more interesting for machine learning because of its RAM, which it has 24GB against 3080's 10GB. With machine learning, you spent most of the time copying memory between the CPU and GPU, so being able to fit more data to it reduces computation latency.
“ With machine learning, you spent most of the time copying memory between the CPU and GPU”
- this is a sign that you are most likely doing it wrong. Yes, some operations are inherently bandwidth bound, but most important ones such as larger matrix multiplies (transformers) and convolutions are compute bound.
TIP: if you need to emulate a bigger batch with less RAM available - use gradient accumulation trick. Super easy to implement in Pytorch and it is already implemented as a single flag (accumulate_grad_batches) in Pytorch Lightning.
Your gradient accumulation trick involves multiple cpu to gpu transfers, which is precisely what the parent is trying to avoid with fitting a larger batch in gpu memory.