We looked into this at Modal! We put out vGPUs but didn't see demand and our int...

We looked into this at Modal! We put out vGPUs but didn't see demand and our internal benchmarks for MPS and Green Contexts didn't indicate a big win.

The tricky thing here is that many GPU workloads saturate at least one of the resources on the GPU -- arithmetic throughput, memory bandwidth, thread slots, registers -- and so there's typically resource contention that leads to lowered throughput/increased latency for all parties.

And in a cloud (esp serverless/auto-scaling) computing context, the variety of GPU SKUs means you can often more easily right-size your workload onto whole replicas (on our platform, from one T4 up to 8 H100s per replica).