Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Another option is to use the "processor trace" functionality available in Intel and Apple CPUs. This can give you a history of every single instruction executed and timing information every few instructions, with very little observer effect. Probably way more accurate than the approach in the paper, though you need the right kind of CPU and you have to deal with a huge amount of data being collected.


Those definitely make them less wrong, but still leave you hanging because most functions have side effects and those are exceedingly difficult to trace.

The function that triggers GC is typically not the function that made the mess.

The function that stalls on L2 cache misses often did not cause the miss.

Just using the profiler can easily leave 2-3x performance on the table, and in some cases 4x. And in a world where autoscaling exists and computers run in batteries that’s a substantial delta.

And the fact is that with few exceptions nobody after 2008 really knows me as the optimization guy, because I don’t glorify it. I’m the super clean code guy. If you want fast gibberish, one of those guys can come after me for another 2x if you or I don’t shoo them away. Now you’re creeping into order of magnitude territory. And all after the profiler stopped feeding you easy answers.


Do you have a source for “with very little observer effect”? I don’t know better, it just seems like a big assumption the CPU can emit all this extra stuff without behaving differently.


Trace data are sent through a large/fast port (PCIe or 60-pin connector) and captured using fast dedicated hardware at something like 10 GB per second. The trace data are usually compressed and often only need to indicate whether a branch is taken or not taken (TNT packets from x86, Arm has ETM but similar enough trace path) with a little bit of timing, exception/interrupt, and address overhead. The bottleneck is streaming and storing trace data from a hardware debugger (since its internal buffer is usually under half a second at max throughput) although you can further filter by application on Intel processors via CR3 matching. (Regarding the last five years of Apple: I'm not sure you'll find any info on Apple's debuggers and modifications to the Arm architecture. Ever.)

If you encounter a slowdown using RTIT or IPT (the old and new names for hardware trace) it's usually a single-digit percentage. (The sources here are Intel's vague documentation claims plus anecdotes; Magic Trace, Hagen Paul Pfeifer, Andi Kleen, Prelude Research.)

Decoding happens later and is significantly slower, and this is where the article's focus, JIT compilation, might be problematic using hardware trace (as instruction data might change/disappear, plus mapping machine code output to each Java instruction can be tricky).


Thanks! I didn’t realise it’s common for CPUs to rock dedicated hardware for this.


It's not an assumption, this is based on claims made by CPU manufactures. It's possible to get it down to within 1-2% overhead.

Intuitively this works because the hardware can just spend some extra area to stream the info off on the side of the datapath -- it doesn't need to be in the critical path.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: