In the follow-up blog post (https://randomascii.wordpress.com/2017/07/27/what-is... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		NobodyNada on Dec 22, 2022 \| parent \| context \| favorite \| on: 24-core CPU and I can’t move my mouse (2017) In the follow-up blog post (https://randomascii.wordpress.com/2017/07/27/what-is-windows...) he uses a sampler to try to find the bottleneck by looking at CPU instructions, but concludes: > But the main thing I always realize when using this technique is that modern CPUs are weird and confusing. Because CPUs are massively out-of-order and super-scalar it is not at all clear what it means for a sampling interrupt to hit “on” a particular instruction. If an instruction is particularly expensive then samples are more likely to hit “near there” but I’m sure where “near there” actually is: > If there are three instructions executing simultaneously when the interrupt fires then which one “gets the blame”? > If a load instruction misses in the cache and forces the CPU to wait then will the samples show up on the load instruction, or on the first use of the data? Both seem to happen. > If a branch is mispredicted then will the samples show up on the branch instruction or on the branch target? > What’s going on with the expensive cmp instruction on line 24 of the spreadsheet? > If anyone has a good model for what happens to the CPU pipelines when a sampling interrupt happens I would appreciate that. Ideally that would explain the relationship between clusters of samples and expensive instructions. I have a distinct memory of reading a blog post probably 5-ish years ago where someone did just that. The author started with some microbenchmarks of very tight loops, and used some sort of profiler/perf counter tool that measured hit counts for each instruction of the loop. Then, they went into a deep dive into analyzing the instruction throughputs and latencies and dependency chains to demonstrate how bottlenecks at the CPU level manifested as clusters of samples, and how to use this information to optimize the loop. Does anybody else remember this post, and possibly where I can find it? I’ve been in a couple situations where it would have been tremendously helpful, but I just haven’t been able to dig it up.

turol on Dec 23, 2022 | [–]

It was Travis Downs: https://travisdowns.github.io/blog/2019/08/20/interrupts.htm...

NobodyNada on Dec 23, 2022 | | [–]

That’s it, thank you so much!

formerly_proven on Dec 22, 2022 | [–]

Sounds like the work of Agner Fog.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact