You claim yourself that "I have not delved very much into depth" yet you make a lot of assumptions here. Here are the key facts, the `_by` functions that contain the complex exception handling stuff are only used for tests and the benchmark ones are as native as possible e.g. `std::sort(data, data + len)`. I pick either gcc or clang based on what gives better performance, and use these settings with the cc crate https://github.com/Voultapher/sort-research-rs/blob/b131ed61... plus sort specific options. It all gets statically linked into one big binary. Also note the C++ code is header only, so one big translation unit per sort. The only disadvantage based on me looking at the generated asm is one additional function call versus possible inlining for the initial part of the sort. This adds a total of ~20 or less cycles to the total time, which in the benchmark is measured in microseconds, where 1 microsecond is ~4900 cycles.
And yet you ignore majority of my questions above. It was an honest feedback but now I will bite and say that the report looks "good on the paper" but the underlying code is low-quality, very non-idiomatic, without explanations, very difficult to understand on what and how it is being tested, testing/benchmarking methodology etc. Many important pieces are missing and all that poses a question on what genuine intention of this experiment was - was it to actually find a bottleneck in implementations, if any, or was it something else. Now, that you're avoiding to address the feedback, it begins to become obvious.