Sorry but what exactly do you base that on? They are very much so designed to exploit the out-of-order, speculative, super-scalar nature of modern CPUs. vqsort is pure, well done SIMD and yet how do you explain that graph https://github.com/Voultapher/sort-research-rs/blob/main/wri... if your claims are true? Also there are reasons why SIMD is not a good fit for every situation.
That's before the performance bugfix we talked about, right? (We re-initialized the RNG on every sort.)
Vqsort performance on M1 is indeed about half that of AVX-512. The NEON instruction set is missing some important operations for Quicksort, including Compress and popcount of vector masks.
Yes, like everything before the addendum the results are for the the same tested vqsort version without the bug-fix, as I motivate in the section "Author's conclusion and opinion" at the end. I suspect thought that even the new version will not outperform crumsort or ipnsort on this machine.
The obvious thing is to use SIMD.