Despite all of these issues you mention, std::simd is perfectly usable in the state it is in today in nightly Rust.
I've written thousands and thousands of lines of Rust SIMD code over the last ~4 years and it's, in my opinion, a pretty nice way of doing SIMD code that is portable.
I don't know about the specific issues in stabilization, but the API has been relatively stable, although there were some breaking changes a few years ago.
Maybe you can't extract 100% of your CPUs capabilities using it, but I don't find that a problem because there's a zero-cost fallback to CPU-specific intrinsics when necessary.
I recently wrote some computer graphics code and I could get really nice performance (~20x my scalar code, 5x from just a naive translation). And the same codebase can be compiled to AVX2, SSE2 and ARM NEON. It uses f32x8's (256b vector width), which are not available on SSE or NEON, but the compiler can split those vectors. The f32x8 version was faster than f32x4 even on 128b hardware. I would've needed to painstakingly port this codebase to each CPU, so it was at least a 3x reduction in lines of code (and more in programmer time).
> Despite all of these issues you mention, std::simd is perfectly usable in the state it is in today in nightly Rust.
Absolutely! It seems to work for many people. The point of my comment was not that there's anything necessarily wrong with it. It's a huge API and it's hard and slow to stabilize huge things when you only get one shot.
> A f32x16 version would also be faster on 256b hardware, but spill in SSE. For Zen5 you probably want to use f32x32.
Yeah, exceeding native vector width is kinda just adding another round of loop unrolling. Sometimes it helps, sometime it doesn't. This is probably mostly about register pressure.
And architecture specific benchmarking is required if you want to get most performance out of it.
> I'd prefer if std::simd would encurage relative to native SIMD width scaling (and support scalable SIMD ISAs).
It is possible to write width-generic SIMD code (ie. have vector width as generic parameter) in Rust std::simd (or C++ templates and vector extensions) and make it relative to native vector width (albeit you need to explicitly define that).
In my problem domain (computer graphics etc) the vector width is often mandated by the task at hand (e.g. 2d vs 3d). It's often not about doing something on an array of size N. This does not lead to optimal HW utilization, but it's convenient and still a lot faster than scalar code.
Scalable SIMD ISAs are kind of a new thing, so not sure how well current std::simd or C vector extensions (or LLVM IR SIMD ops) map to the HW. Maybe they would be better served by another kind of API? I don't really know, haven't had the privilege of writing any scalable vector code yet.
What I'm trying to say is IMO std::simd works well enough and should probably be stabilized (almost) as is, barring any show stopper issues. It's already useful and has been for many years.
I've written thousands and thousands of lines of Rust SIMD code over the last ~4 years and it's, in my opinion, a pretty nice way of doing SIMD code that is portable.
I don't know about the specific issues in stabilization, but the API has been relatively stable, although there were some breaking changes a few years ago.
Maybe you can't extract 100% of your CPUs capabilities using it, but I don't find that a problem because there's a zero-cost fallback to CPU-specific intrinsics when necessary.
I recently wrote some computer graphics code and I could get really nice performance (~20x my scalar code, 5x from just a naive translation). And the same codebase can be compiled to AVX2, SSE2 and ARM NEON. It uses f32x8's (256b vector width), which are not available on SSE or NEON, but the compiler can split those vectors. The f32x8 version was faster than f32x4 even on 128b hardware. I would've needed to painstakingly port this codebase to each CPU, so it was at least a 3x reduction in lines of code (and more in programmer time).