To back up the other commenter - it's not the same. https://godbolt.org/z/r6e443...

cogman10 · 2026-02-11T13:59:11 1770818351

What's strange is I'm finding that gcc really struggles to correctly optimize this.

This was my function

    for (auto v : array) {
        if (v != 0)
            return false;
    }
    return true;

clang emits basically the same thing yours does. But gcc ends up just really struggling to vectorize for large numbers of array.

Here's gcc for 42 elements:

https://godbolt.org/z/sjz7xd8Gs

and here's clang for 42 elements:

https://godbolt.org/z/frvbhrnEK

Very bizarre. Clang pretty readily sees that it can use SIMD instructions and really optimizes this while GCC really struggles to want to use it. I've even seen strange output where GCC will emit SIMD instructions for the first loop and then falls back on regular x86 compares for the rest.

Edit: Actually, it looks like for large enough array sizes, it flips. At 256 elements, gcc ends up emitting simd instructions while clang does pure x86. So strange.

maccard · 2026-02-12T18:43:50 1770921830

Writing a micro benchmark is an academic exercise. You end up benchmarking in isolation which only tells you is your function faster in that exact scenario. Something which is faster in isolation in a microbenchmark can be slower when put in a real workload because vextoising is likely to have way more of an impact than anything else. Similarly, if you parallelise it, you introduce a whole new category of ways to compare.

cogman10 · 2026-02-12T19:09:59 1770923399

This isn't a microbenchmark. In fact, I haven't even bothered to benchmark it (perhaps the non-simd version actually is faster?)

This is purely me looking at the emitted assembly and being surprised at when the compilers decide to deploy it and not deploy it. It may be the case that the SIMD instructions are in fact slower even though they should theoretically end up faster.

Both compilers are simply using heuristics to determine when it's fruitful to deploy SIMD instructions.

secondcoming · 2026-02-11T15:54:52 1770825292

I;ve had to coerce gcc to emitting SIMD code by using int instead of bool. Also, the early return may be putting it off.

abbeyj · 2026-02-11T19:11:00 1770837060

Doing both of those things does seem to help: https://godbolt.org/z/1vv7cK4bE

GCC trunk seems to like using `bool` so we may eventually be able to retire the hack of using `int`.

btdmaster · 2026-02-11T16:11:56 1770826316

I see yeah that makes sense. I wanted to highlight that "magic" will, on average, give the optimizer a harder time. Explicit offset loops like that are generally avoided in many C++ styles in favor of iterators.

delta_p_delta_x · 2026-02-11T17:17:20 1770830240

Even at a higher level of abstraction, the compiler seems to pull through: https://godbolt.org/z/1nvE34YTe

btdmaster · 2026-02-11T19:23:20 1770837800

It emits a cmp/jmp still when arithmetic would be fine though which is the difference highlighted in the article and examples in this thread. It's nice that it simplifies down to assembly, but the assembly is somewhat questionable (especially that xor eax eax branch target on the other side).