> for no reason other than to minimally rice benchmarks. For AI/ML applications,...

FeepingCreature · on Oct 30, 2024

Sure, but CUDA is an AI/ML API, and anyways you're not doing blocking calls when writing a graphics engine regardless. (Well, you better not.) And anyways, these calls will already busyspin for a few millis before yielding to the OS - it's just that you have to explicitly opt in to the latter part. So these are the sorts of calls that you'd use for high-throughput work, but they behave like calls designed for very-low-latency work. There is no point in shaving a few milliseconds off a low-seconds call other than to make NVidia look a few percent better in benchmarks. The tradeoffs are all wrong, and because nobody knows about it, megawatts of energy are being wasted.

saagarjha · on Oct 31, 2024

This is important if you are launching many kernels and orchestrating their execution from the CPU.

FeepingCreature · on Oct 31, 2024

In that case (which tbh is kind of bad design imo), you should have to explicitly opt in to the power-hungry mode.

saagarjha · on Nov 1, 2024

This is a thing that people want, hence the decision. Unfortunately those people pay Nvidia a lot more money than you do.

FeepingCreature · on Nov 1, 2024

The thing is, it's really not hard to recognize this access pattern. Just bucket API call times and switch modes on the fly.

There is simply no excuse for an app that does 10 API calls a second to burn 100% CPU.