I’m probably an optimization expert, and I would solve that problem completely differently.
On my computer, the initial C version runs at 389 MB / second. I haven’t tested the assembly versions, but if they deliver the same 6.2x speedup, would result in 2.4 GB/second here.
Interesting. I think you can vectorize the prologue using movemask + popcnt instead of keeping a counter in the ymm registers (warning: untested code, still need to benchmark it):
const __m256i zero = _mm256_setzero_si256();
const __m256i s = _mm256_set1_epi8( 's' );
const __m256i p = _mm256_set1_epi8( 'p' );
const size_t a = (size_t)input;
const size_t rem = a % 32;
const char* aligned = input - rem;
const __m256i v = _mm256_load_si256(( const __m256i*) input);
const __m256i z = _mm256_cmpeq_epi8( v, zero );
size_t m_plus = _mm256_movemask_epi8(_mm_cmpeq_epi8(v, s));
size_t m_minus = _mm256_movemask_epi8(_mm_cmpeq_epi8(v, p));
size_t m_zero = _mm256_movemask_epi8(_mm_cmpeq_epi8(v, z));
size_t offset_zero = _mm_tzcnt_64(m_zero >> rem);
m_plus = _bzhi_u64(m_plus >> rem, offset_zero);
m_minus = _bzhi_u64(m_minus >> rem, offset_zero);
// Skip loop we already found the end of the string...
while (m_zero == 0) {
// ...
}
// ...
return m_plus + res - m_minus;
Still, based on the documentation you have linked, I’m not sure it could possibly generate some code similar to my version. I could be wrong but I don’t see APIs which aggregate or accumulate the `simd_mask` vectors they output for results of vector comparisons.
I don’t recommend assembly. Intrinsics are typically good enough performance wise, and writing correct assembly is hard. For instance, Chromium has non-trivial dependencies with code written in assembly, and they caused tons of fun debugging issues like that https://bugs.chromium.org/p/chromium/issues/detail?id=121838... Using intrinsics would have solved that because modern compilers follow ABI conventions of the target platforms very carefully.
About that highway, I don’t have any experience but based on the documentation I don’t like it too much. They say that’s a thin wrapper over intrinsics, but I believe it still breaks things. Specifically, Intel and ARM don’t document highway, but they have decent documentation on their intrinsics. Similarly, stackoverflow has thousands of questions and answers with tags like SSE and AVX, most of them are related to intrinsics, but nothing related to highway.
On my computer, the initial C version runs at 389 MB / second. I haven’t tested the assembly versions, but if they deliver the same 6.2x speedup, would result in 2.4 GB/second here.
Here’s C++ version which for long buffers exceeds 24 GB/second on my computer: https://gist.github.com/Const-me/3ade77faad47f0fbb0538965ae7... That’s 61x speedup compared to the original version, without any assembly, based on AVX2 intrinsics.