I actually looked at this in detail about a year ago for some automated driving ...

ryao · on Oct 30, 2024

I just sent the link to a driver developer at Nvidia. If he shares the link with others at Nvidia, they should become aware of the idea tomorrow. That said, I have no idea if he will do that, but at least I tried.

sleepybrett · on Oct 30, 2024

Are they interested in you optimizing your workloads or just selling you more gpus to help you get to market faster...

saagarjha · on Oct 31, 2024

It is in Nvidia's interest that their cards have better developer experience and cost less to run than their competitors.

wcunning · on Oct 31, 2024

The problem is that CUDA already does that, and they're not incentivized to really improve from that baseline, given the capability and ease of use of the ROCM or Intel solutions.

yanniszark · on Oct 30, 2024

I'm not sure, it seems to me like this should be doable in Nvidia as well. This is a paper that uses instruction sampling (called CUPTI) in Nvidia to provide optimization advice:

https://ieeexplore.ieee.org/document/9370339

It seems like the instruction sampler is there, and it also provides the stall reason.

wcunning · on Oct 31, 2024

The issue there is that that info is what Nvidia chooses to port out from the on-chip execution. Most of what we can do for observation is in the kernel driver space and not really on-chip or even low level transit to the chip. One of the other commenters pointed out that you can get huge benefits from avoiding busy waiting on the returned data from the chip, which makes total sense, but also increases latency, which didn't work for my near-realtime use case when I was investigating. Other than those types of low hanging fruit where you can accept a little latency for better power state management, it's hard to find low level optimizations specifically for Nvidia through the closed source parts of the CUDA stack or through the driver transit to chip when those are intentionally hidden.

A while ago, I read a paper on dissecting the Nvidia architecture using very specifically tuned microbenchmarking to understand things like cache structure on chip and the like [0]. Unfortunately, no one has done this for seriously in use, recent architectures, so it's hard to use this info today. Similarly, there isn't an eBPF VM running on the chip to summarize all of this and the Nvidia tools aren't intended to make this kind of info easy to get, probably specifically because of this paper...

[0] https://arxiv.org/pdf/1804.06826