We looked into this at Modal! We put out vGPUs but didn't see demand and our internal benchmarks for MPS and Green Contexts didn't indicate a big win.
The tricky thing here is that many GPU workloads saturate at least one of the resources on the GPU -- arithmetic throughput, memory bandwidth, thread slots, registers -- and so there's typically resource contention that leads to lowered throughput/increased latency for all parties.
And in a cloud (esp serverless/auto-scaling) computing context, the variety of GPU SKUs means you can often more easily right-size your workload onto whole replicas (on our platform, from one T4 up to 8 H100s per replica).
The tricky thing here is that many GPU workloads saturate at least one of the resources on the GPU -- arithmetic throughput, memory bandwidth, thread slots, registers -- and so there's typically resource contention that leads to lowered throughput/increased latency for all parties.
And in a cloud (esp serverless/auto-scaling) computing context, the variety of GPU SKUs means you can often more easily right-size your workload onto whole replicas (on our platform, from one T4 up to 8 H100s per replica).