A single threaded benchmark better represents real performance, I'd argue. 10 Gbps is only 1.2 GB/s after all and few applications use parallel streams.
You can't do it with zero kernel code for ISA devices, but if there was a pci busmouse, uio + uio_pci_generic would work for reading the mouse, and you'd use uinput to send the events to the input stack. If you're willing to make a little uio stub driver for the interrupt, you can do it for ISA. UIO is from 2006 or something.
tldr; it's there but nobody is interested in reinventing ancient pre-PCI drivers, so there's no generic ISA plumbing.
"TPU 8t and TPU 8i deliver up to two times better performance-per-watt over the previous generation" sounds impressive especially as the previous generation is so recent (2025).
Interesting that there's separate inference and training focused hardware. Do companies using NV hardware also use different hardware for each task or is their compute more fungible?
That training is compute-bound and inference is memory-bound is well-known, but I don't think Nvidia deployments typically specialize for one vs the other.
One reason is that most clouds/neoclouds don't own workloads, and want fungibility. Given that you're spending a lot on H200s and what not it's good to also spend on the networking to make sure you can sell them to all kinds of customers. The Grok LPU in Vera Rubin is an inference-specific accelerator, and Cerebras is also inference-optimized so specialization is starting to happen.
I can't answer for NVIDIA but AWS has its own training and inference chips, and word on the street is the inference chips are too weak, so some companies are running inference on the training chips.
They stopped producing Inferentia altogether and are only investing in Trainium now. They also announced a partnership with Cerebras not long ago. That should give you a clue.
> Interesting that there's separate inference and training focused hardware. Do companies using NV hardware also use different hardware for each task or is their compute more fungible?
Dedicated hardware will usually be faster, which is why as certain things mature, they go from being complicated and expensive to being cheap and plentiful in $1 chips. This tells me Google has a much better grasp on their stack than people building on NVidia, because Google owns everything from the keyboard to the silicon. They've iterated so much they understand how to separate out different functions that compete with each other for resources.
The "training" chips will probably be quite usable for slower, higher-throughput inference at scale. I expect that to be quite popular eventually for non-time-sensitive uses.
Vera Rubin will have Groq chips focused on fast inference so it points toward a trend. Also, with energy needs so high, why not reach for every feasible optimization?
Nvidia said in March that they're working on specialized inference hardware, but they don't have any right now. You can do inference from Nvidia's current hardware offerings, but it's not as efficient.
I did some (really shallow) research and there is lightpanda that seams a bit better solution to search the web from some agent than some wrapper around the Chrome Developer Tools.
> Processors are always locked at the highest performance state (including "turbo" frequencies). All cores are unparked. Thermal output may be significant.
There's no mode spelled the same ("High Performance") - and I don't think Linuxes universally do this:
> Processors are always locked at the highest performance state (including "turbo" frequencies).
Unless performance state means something idiosyncratic in MS terminology.
Normally you'd want to let idle apply power saving measures including downclocking to donate some unused power envelope to busy cores, increasing overall performance.
A server profile optimized for high throughput that disables power savings mechanisms. It also enables sysctl settings to improve the throughput performance of the disk and network IO.
accelerator-performance:
A profile that contains the same tuning as the throughput-performance profile. Additionally, it locks the CPU to low C states so that the latency is less than 100us. This improves the performance of certain accelerators, such as GPUs.
latency-performance:
A server profile optimized for low latency and disables power savings mechanisms and enables sysctl settings that improve latency. CPU governor is set to performance and the CPU is locked to the low C states (by PM QoS). "
Here the latency-performance profile sounds most like the Windows Server mode (but differnet from throughput-performance).
Battery capacity of smartphones seems to double every ~8 years. The design space is adding more battery capacity, reducing battery life, or using less power.
reply