Another option is to use the "processor trace" functionality available in Intel ...

hinkley · 2025-10-15T08:07:17 1760515637

Those definitely make them less wrong, but still leave you hanging because most functions have side effects and those are exceedingly difficult to trace.

The function that triggers GC is typically not the function that made the mess.

The function that stalls on L2 cache misses often did not cause the miss.

Just using the profiler can easily leave 2-3x performance on the table, and in some cases 4x. And in a world where autoscaling exists and computers run in batteries that’s a substantial delta.

And the fact is that with few exceptions nobody after 2008 really knows me as the optimization guy, because I don’t glorify it. I’m the super clean code guy. If you want fast gibberish, one of those guys can come after me for another 2x if you or I don’t shoo them away. Now you’re creeping into order of magnitude territory. And all after the profiler stopped feeding you easy answers.

scottgg · 2025-10-15T04:28:12 1760502492

Do you have a source for “with very little observer effect”? I don’t know better, it just seems like a big assumption the CPU can emit all this extra stuff without behaving differently.

PennRobotics · 2025-10-15T12:52:45 1760532765

Trace data are sent through a large/fast port (PCIe or 60-pin connector) and captured using fast dedicated hardware at something like 10 GB per second. The trace data are usually compressed and often only need to indicate whether a branch is taken or not taken (TNT packets from x86, Arm has ETM but similar enough trace path) with a little bit of timing, exception/interrupt, and address overhead. The bottleneck is streaming and storing trace data from a hardware debugger (since its internal buffer is usually under half a second at max throughput) although you can further filter by application on Intel processors via CR3 matching. (Regarding the last five years of Apple: I'm not sure you'll find any info on Apple's debuggers and modifications to the Arm architecture. Ever.)

If you encounter a slowdown using RTIT or IPT (the old and new names for hardware trace) it's usually a single-digit percentage. (The sources here are Intel's vague documentation claims plus anecdotes; Magic Trace, Hagen Paul Pfeifer, Andi Kleen, Prelude Research.)

Decoding happens later and is significantly slower, and this is where the article's focus, JIT compilation, might be problematic using hardware trace (as instruction data might change/disappear, plus mapping machine code output to each Java instruction can be tricky).

scottgg · 2025-10-23T17:55:42 1761242142

Thanks! I didn’t realise it’s common for CPUs to rock dedicated hardware for this.

achierius · 2025-10-15T04:43:37 1760503417

It's not an assumption, this is based on claims made by CPU manufactures. It's possible to get it down to within 1-2% overhead.

Intuitively this works because the hardware can just spend some extra area to stream the info off on the side of the datapath -- it doesn't need to be in the critical path.