I just want to say: wow! You are showing me something I bet many others are real...

nkurz · on April 26, 2020

> What is the speed with restriction to SSE3? That would finalize the tests.

I think this is the right incantation to restrict to SSE3:

  clang-6.0 -O3 bodies.c -msse3 -Wall -lm -o bodies_clang_O3_sse3_c
  perf stat bodies_clang_O3_sse3_c 5000000
  1,470,870,787 cycles       #    3.691 GHz
  3,391,153,749 instructions #    2.31  insns per cycle

  gcc-8 -O3 bodies.c -msse3 -Wall -lm -o bodies_gcc_O3_sse3_c
  perf stat bodies_gcc_O3_sse3_c 5000000
  2,256,550,525 cycles       #    3.691 GHz
  3,306,361,186 instructions #    1.47  insns per cycle

It would be interesting to analyze the difference between GCC and Clang here. Just glancing without comprehension, it looks like one big difference is that Clang might be calling out to a different square root routine rather than using the assembly builtin. Hmm, although it makes me wonder if maybe that library routine is using some more advanced instruction set?

> Do you mind if I directly quote you in a follow up post?

Sure, but realize that I'm speculating here. Essentially, I'm an expert (or at least was a couple years ago) on integer vector operations on modern Intel server processors. But I'm not familiar with their consumer models, and I'm much less fluent in floating point. The result is that I know where to look for answers, but don't actually know them off hand. So quote the primary sources instead when you can.

Please do write a followup, and send me an email when you post it. My address is in my HN profile (click on username). Also, I might be able to provide you remote access to testing machines if it would help you test.

nkurz · on April 26, 2020

OK, I might have a lead on the GCC versus Clang gap. At the least, this obscure command line option closes the gap:

  gcc-8 -fno-math-errno -O3 bodies.c -msse3 -Wall -lm -o bodies_gcc_O3_sse3_c
  perf stat bodies_gcc_O3_sse3_c 5000000
  1,398,805,282    cycles       #    3.691 GHz
     3,181,041,562 instructions #    2.27  insns per cycle

I haven't thought about it much, but more info here: https://stackoverflow.com/questions/37117809/why-cant-gcc-op.... While the question is about C++, it hints that the spec might require sqrt() to set errno if given a negative number. Clang special cases this by adding a never taken branch, but gcc does not (unless told it doesn't need to care).