I just want to say: wow! You are showing me something I bet many others are really not aware of: degraded SSE performance in these types of processors! Your DIVSD vs DIVPD comment makes a lot of sense too. Man I feel this HN thread has been just a gold mine of this kind of information.
What is the speed with restriction to SSE3? That would finalize the tests.
Do you mind if I directly quote you in a follow up post? This is really good stuff.
It would be interesting to analyze the difference between GCC and Clang here. Just glancing without comprehension, it looks like one big difference is that Clang might be calling out to a different square root routine rather than using the assembly builtin. Hmm, although it makes me wonder if maybe that library routine is using some more advanced instruction set?
> Do you mind if I directly quote you in a follow up post?
Sure, but realize that I'm speculating here. Essentially, I'm an expert (or at least was a couple years ago) on integer vector operations on modern Intel server processors. But I'm not familiar with their consumer models, and I'm much less fluent in floating point. The result is that I know where to look for answers, but don't actually know them off hand. So quote the primary sources instead when you can.
Please do write a followup, and send me an email when you post it. My address is in my HN profile (click on username). Also, I might be able to provide you remote access to testing machines if it would help you test.
I haven't thought about it much, but more info here: https://stackoverflow.com/questions/37117809/why-cant-gcc-op.... While the question is about C++, it hints that the spec might require sqrt() to set errno if given a negative number. Clang special cases this by adding a never taken branch, but gcc does not (unless told it doesn't need to care).
What is the speed with restriction to SSE3? That would finalize the tests.
Do you mind if I directly quote you in a follow up post? This is really good stuff.