Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I actually looked at this in detail about a year ago for some automated driving compute work at my previous job, and I found that the detailed info you'd want from Nvidia was just 100% unavailable. There's pretty good proxies in some of the data you can get out of Nvidia tools, and there's some extra info you can glean from some of the function call stack in the open source Nvidia driver shim layer (because the actual main components are still binary blob, even with the "open source" driver), but over all you still can't get much useful info out.

Now that Brendan works for Intel, he can get a lot of this info from the much more open source Intel GPU driver, but that's only so useful since everyone is either Nvidia or AMD still. The more hopeful sign is that a lot of the major customers of Nvidia are going to start demanding this sort of access and there's a real chance that AMD's more accessible driver starts documenting what to actually look at, which will create the market competition to fill this space. In the meantime, take a look at the flamegraph capabilities in PyTorch and similar frameworks, up an abstraction level and eek what performance you can.



I just sent the link to a driver developer at Nvidia. If he shares the link with others at Nvidia, they should become aware of the idea tomorrow. That said, I have no idea if he will do that, but at least I tried.


Are they interested in you optimizing your workloads or just selling you more gpus to help you get to market faster...


It is in Nvidia's interest that their cards have better developer experience and cost less to run than their competitors.


The problem is that CUDA already does that, and they're not incentivized to really improve from that baseline, given the capability and ease of use of the ROCM or Intel solutions.


I'm not sure, it seems to me like this should be doable in Nvidia as well. This is a paper that uses instruction sampling (called CUPTI) in Nvidia to provide optimization advice:

https://ieeexplore.ieee.org/document/9370339

It seems like the instruction sampler is there, and it also provides the stall reason.


The issue there is that that info is what Nvidia chooses to port out from the on-chip execution. Most of what we can do for observation is in the kernel driver space and not really on-chip or even low level transit to the chip. One of the other commenters pointed out that you can get huge benefits from avoiding busy waiting on the returned data from the chip, which makes total sense, but also increases latency, which didn't work for my near-realtime use case when I was investigating. Other than those types of low hanging fruit where you can accept a little latency for better power state management, it's hard to find low level optimizations specifically for Nvidia through the closed source parts of the CUDA stack or through the driver transit to chip when those are intentionally hidden.

A while ago, I read a paper on dissecting the Nvidia architecture using very specifically tuned microbenchmarking to understand things like cache structure on chip and the like [0]. Unfortunately, no one has done this for seriously in use, recent architectures, so it's hard to use this info today. Similarly, there isn't an eBPF VM running on the chip to summarize all of this and the Nvidia tools aren't intended to make this kind of info easy to get, probably specifically because of this paper...

[0] https://arxiv.org/pdf/1804.06826




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: