Interesting post & problem. I wonder if the reason that the idea of running the ...

Interesting post & problem. I wonder if the reason that the idea of running the tasks on the same queue fails is for the same reason you have a problem in the first place - variable clock rate means it’s impossible to schedule precisely and you end up aliasing your spin stop time ideal time based on how the OS decided to clock the GPU. But that suggests that maybe your spin job isn’t complex enough to run the GPU at the highest clock because if it is running at max then you should be able to reliably time the stop of the spin even without adding a software PLL (which may not be a bad idea). I didn’t see a detailed explanation of how the spin is implemented and I suspect a more thorough spin loop that consistently drives more of the GPU might be more effective at keeping the clock rate at max perf.