I’m probably an optimization expert, and I would solve that problem completely d...

xoranth · on July 6, 2023

Interesting. I think you can vectorize the prologue using movemask + popcnt instead of keeping a counter in the ymm registers (warning: untested code, still need to benchmark it):

    const __m256i zero = _mm256_setzero_si256();
    const __m256i s = _mm256_set1_epi8( 's' );
    const __m256i p = _mm256_set1_epi8( 'p' );

    const size_t a = (size_t)input;
    const size_t rem = a % 32;
    const char* aligned = input - rem;

    const __m256i v = _mm256_load_si256(( const __m256i*) input);
    const __m256i z = _mm256_cmpeq_epi8( v, zero );

    size_t m_plus = _mm256_movemask_epi8(_mm_cmpeq_epi8(v, s));
    size_t m_minus = _mm256_movemask_epi8(_mm_cmpeq_epi8(v, p));
    size_t m_zero = _mm256_movemask_epi8(_mm_cmpeq_epi8(v, z));
    size_t offset_zero = _mm_tzcnt_64(m_zero >> rem);

    m_plus = _bzhi_u64(m_plus >> rem, offset_zero);
    m_minus = _bzhi_u64(m_minus >> rem, offset_zero);

    // Skip loop we already found the end of the string...
    while (m_zero == 0) {
        // ...
    }
    
    // ...
    
    return m_plus + res - m_minus;

Const-me · on July 7, 2023

Good idea, and it can be used for both prologue and epilogue pieces. Updated that gist.

However, this is only relevant for very small inputs. For longer inputs the vectorized portion of the function gonna dominate the performance.

gavinray · on July 6, 2023

Do you know if this is possible using "std::experimental::simd" out of curiosity?

https://en.cppreference.com/w/cpp/experimental/simd

Const-me · on July 6, 2023

I don’t have any experience with that library.

Still, based on the documentation you have linked, I’m not sure it could possibly generate some code similar to my version. I could be wrong but I don’t see APIs which aggregate or accumulate the `simd_mask` vectors they output for results of vector comparisons.

utopcell · on July 7, 2023

you might want to rewrite this in a form that is compatible with @414owen's repo.

bluedevilzn · on July 7, 2023

What’s a good source to learn and practice AVX?

Const-me · on July 7, 2023

For a starting point, I wrote that article couple years ago: http://const.me/articles/simd/simd.pdf

I don’t recommend assembly. Intrinsics are typically good enough performance wise, and writing correct assembly is hard. For instance, Chromium has non-trivial dependencies with code written in assembly, and they caused tons of fun debugging issues like that https://bugs.chromium.org/p/chromium/issues/detail?id=121838... Using intrinsics would have solved that because modern compilers follow ABI conventions of the target platforms very carefully.

About that highway, I don’t have any experience but based on the documentation I don’t like it too much. They say that’s a thin wrapper over intrinsics, but I believe it still breaks things. Specifically, Intel and ARM don’t document highway, but they have decent documentation on their intrinsics. Similarly, stackoverflow has thousands of questions and answers with tags like SSE and AVX, most of them are related to intrinsics, but nothing related to highway.

utopcell · on July 7, 2023

You could study Google's Highway library [1].

[1] https://github.com/google/highway

kierank · on July 7, 2023

FFmpeg has a lot of assembly language that can be added to:

https://blogs.gnome.org/rbultje/2017/07/14/writing-x86-simd-...