That’s wild that with a KV cache and compilation on the Mac CPU you are faster t...

ModelForge · 2025-08-20T20:44:55 1755722695

Could be an artifact of the small size not fully taking advantage of the GPU. For example, for the slightly larger Qwen3 0.6B model the A100 is faster (you can see it when scrolling to the bottom here: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch05/11...)

ladberg · 2025-08-20T19:47:53 1755719273

Given that the compiled version is slower than then eager version on A100, there's definitely something suboptimal happening there

ModelForge · 2025-08-20T20:48:43 1755722923

No the compiled version is actually faster.

From that table, the A100 tok/sec (larger is faster) numbers are:

- Eager: 28

- Compiled: 128

And

- KV cache eager: 26

- KV cache compiled: 99

The reason that the KV cache is slower is likely because it's not GPU-optimized code. On CPU the KV cache is faster. To make it faster on GPU, you would pre-allocate the tensors on the device for example instead of `torch.cat`ting them on the fly

ladberg · 2025-08-20T22:54:42 1755730482

Ah yep read the labels backwards and meant that - ty for catching and for the explanation

punnerud · 2025-08-20T16:54:44 1755708884

Because on Mac the CPU and GPU share memory, but A100 need to transfer to RAM/CPU on the parts that’s not supported by GPU?

(My first guess)

Weryj · 2025-08-20T16:31:38 1755707498

This would be because the GPU can’t fill its waveform and hide memory latency, no? I’m curious for a reason why