Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I’m probably an optimization expert, and I would solve that problem completely differently.

On my computer, the initial C version runs at 389 MB / second. I haven’t tested the assembly versions, but if they deliver the same 6.2x speedup, would result in 2.4 GB/second here.

Here’s C++ version which for long buffers exceeds 24 GB/second on my computer: https://gist.github.com/Const-me/3ade77faad47f0fbb0538965ae7... That’s 61x speedup compared to the original version, without any assembly, based on AVX2 intrinsics.



Interesting. I think you can vectorize the prologue using movemask + popcnt instead of keeping a counter in the ymm registers (warning: untested code, still need to benchmark it):

    const __m256i zero = _mm256_setzero_si256();
    const __m256i s = _mm256_set1_epi8( 's' );
    const __m256i p = _mm256_set1_epi8( 'p' );

    const size_t a = (size_t)input;
    const size_t rem = a % 32;
    const char* aligned = input - rem;

    const __m256i v = _mm256_load_si256(( const __m256i*) input);
    const __m256i z = _mm256_cmpeq_epi8( v, zero );

    size_t m_plus = _mm256_movemask_epi8(_mm_cmpeq_epi8(v, s));
    size_t m_minus = _mm256_movemask_epi8(_mm_cmpeq_epi8(v, p));
    size_t m_zero = _mm256_movemask_epi8(_mm_cmpeq_epi8(v, z));
    size_t offset_zero = _mm_tzcnt_64(m_zero >> rem);

    m_plus = _bzhi_u64(m_plus >> rem, offset_zero);
    m_minus = _bzhi_u64(m_minus >> rem, offset_zero);

    // Skip loop we already found the end of the string...
    while (m_zero == 0) {
        // ...
    }
    
    // ...
    
    return m_plus + res - m_minus;


Good idea, and it can be used for both prologue and epilogue pieces. Updated that gist.

However, this is only relevant for very small inputs. For longer inputs the vectorized portion of the function gonna dominate the performance.


Do you know if this is possible using "std::experimental::simd" out of curiosity?

https://en.cppreference.com/w/cpp/experimental/simd


I don’t have any experience with that library.

Still, based on the documentation you have linked, I’m not sure it could possibly generate some code similar to my version. I could be wrong but I don’t see APIs which aggregate or accumulate the `simd_mask` vectors they output for results of vector comparisons.


you might want to rewrite this in a form that is compatible with @414owen's repo.


What’s a good source to learn and practice AVX?


For a starting point, I wrote that article couple years ago: http://const.me/articles/simd/simd.pdf

I don’t recommend assembly. Intrinsics are typically good enough performance wise, and writing correct assembly is hard. For instance, Chromium has non-trivial dependencies with code written in assembly, and they caused tons of fun debugging issues like that https://bugs.chromium.org/p/chromium/issues/detail?id=121838... Using intrinsics would have solved that because modern compilers follow ABI conventions of the target platforms very carefully.

About that highway, I don’t have any experience but based on the documentation I don’t like it too much. They say that’s a thin wrapper over intrinsics, but I believe it still breaks things. Specifically, Intel and ARM don’t document highway, but they have decent documentation on their intrinsics. Similarly, stackoverflow has thousands of questions and answers with tags like SSE and AVX, most of them are related to intrinsics, but nothing related to highway.


You could study Google's Highway library [1].

[1] https://github.com/google/highway


FFmpeg has a lot of assembly language that can be added to:

https://blogs.gnome.org/rbultje/2017/07/14/writing-x86-simd-...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: