Why isn’t std::simd in stabile yet? Why do so many great features seem stuck in ...

ChadNauseam · 2025-11-05T19:53:46 1762372426

There really aren't that many people working on the compiler. It's mostly volunteers.

The structure is unlike a traditional company. In a traditional company, the managers decide the priorities and direct the employees what to work on while facilitating that work. While there are people with a more managerial type position working on rust compiler, their job is not to tell the volunteers what to work on (they cannot), but instead to help the volunteers accomplish whatever it is they want to do.

I don't know about std::simd specifically, but for many features, it's simply a case of "none of the very small number of people working on the rust compiler have prioritized it".

I do wish there was a bounty system, where people could say "I really want std::simd so I'll pay $5,000 to the rust foundation if it gets stabilized". If enough people did that I'm sure they could find a way to make it happen. But I think realistically, very few people would be willing to put up even a cent for the features they want. I hear a lot of people wishing for better const generics, but only 27 people have set up a donation to boxy (lead of the const generics group https://github.com/sponsors/BoxyUwU ).

LtWorf · 2025-11-05T23:08:59 1762384139

> There really aren't that many people working on the compiler. It's mostly volunteers.

Seems smart to put the language as a requirement for compiling the linux kernel and a bunch of other core projects then!

ChadNauseam · 2025-11-05T23:17:40 1762384660

I think it seems just right. Languages these days are either controlled by volunteers or megacorps. Because linux is about freedom and is not aligned with megacorps, I think they'd prefer a volunteer-driven language like Rust or C++ rather than the corporate ones.

immibis · 2025-11-05T23:51:25 1762386685

I'm not sure you can claim that Linux is about freedom. Linux is run by a bunch of corps and megacorps who are otherwise competing, not by volunteers.

LtWorf · 2025-11-06T14:06:37 1762437997

Yeah I think if you want ideals and freedom you have to look elsewhere. It's not controlled by a single company, but it is controlled by a one digit number of them.

FuckButtons · 2025-11-05T23:30:21 1762385421

I’m not sure you can argue that Rust and C++ have anything like a similar story around being volunteer oriented, given the number of places that have C++ compiler groups that contribute papers / implementations.

pjmlp · 2025-11-06T08:47:02 1762418822

C++ has been an industrial language from the early days, since it got adopted among C compiler vendors back in the 1980's.

lmm · 2025-11-06T00:41:12 1762389672

I mean, Linux development works exactly the same way.

pclmulqdq · 2025-11-06T03:47:46 1762400866

Linux has a BDFL and a quasi-corporate structure that keeps all the incentives aligned. Rust has neither of those.

LtWorf · 2025-11-06T06:54:04 1762412044

I think it's quite rare for linux developers to not do it on behalf of some company.

Weren't a bunch of modules deprecated recently as a consequence of intel layoffs?

lmm · 2025-11-06T07:28:23 1762414103

> I think it's quite rare for linux developers to not do it on behalf of some company.

Corporate-sponsored contributions are probably the majority, but I don't think true volunteers are super-rare. But in both cases they're a "volunteer" from the perspective of the Linux leadership - they're contributing the changes that they want to make, they're not employees of Linux who can be directed to work on the things that that leadership thinks is important.

(And conversely it's the same with Rust - a lot of those volunteer contributors work for employers who want Rust to have some specific functionality, so they employ someone to work on that)

LtWorf · 2025-11-06T23:18:32 1762471112

In the python forum you will get a ban if you dare to ask that contributors disclose if they are contributing on behalf of some company and which company is it.

JoshTriplett · 2025-11-05T20:46:50 1762375610

> Why isn’t std::simd in stable yet?

Leaving aside any specific blockers:

- It's a massive hard problem, to build a portable abstraction layer over the SIMD capabilities of various CPUs.

- It's a massive balance between performance and usability, and people care deeply about both.

- It's subject to Rust's stability guarantee for the standard library: once we ship it, we can't fix any API issues.

- There are already portable SIMD libraries in the ecosystem, which aren't subject to that stability guarantee as they can ship new semver-major versions. (One of these days, I hope we have ways to do that for the standard library.)

- Many people already use non-portable SIMD for the 1-3 targets they care about, instead.

colonial · 2025-11-05T23:02:27 1762383747

> Many people already use non-portable SIMD for the 1-3 targets they care about, instead.

This is something a lot of people (myself included) have gotten tripped up by. Non-portable SIMD intrinsics have been stable under std::arch for a long time. Obviously they aren't nearly as nice to hold, but if you're in a place where you need explicit SIMD speed-ups, that probably isn't a killer.

JoshTriplett · 2025-11-05T23:14:34 1762384474

Exactly. Many parts of SIMD are entirely stable, for x86, ARM, WebAssembly...

The thing that isn't stable in the standard library is the portable abstraction layer atop those. But several of those exist in the community.

exDM69 · 2025-11-06T12:10:11 1762431011

Despite all of these issues you mention, std::simd is perfectly usable in the state it is in today in nightly Rust.

I've written thousands and thousands of lines of Rust SIMD code over the last ~4 years and it's, in my opinion, a pretty nice way of doing SIMD code that is portable.

I don't know about the specific issues in stabilization, but the API has been relatively stable, although there were some breaking changes a few years ago.

Maybe you can't extract 100% of your CPUs capabilities using it, but I don't find that a problem because there's a zero-cost fallback to CPU-specific intrinsics when necessary.

I recently wrote some computer graphics code and I could get really nice performance (~20x my scalar code, 5x from just a naive translation). And the same codebase can be compiled to AVX2, SSE2 and ARM NEON. It uses f32x8's (256b vector width), which are not available on SSE or NEON, but the compiler can split those vectors. The f32x8 version was faster than f32x4 even on 128b hardware. I would've needed to painstakingly port this codebase to each CPU, so it was at least a 3x reduction in lines of code (and more in programmer time).

JoshTriplett · 2025-11-07T01:48:24 1762480104

> Despite all of these issues you mention, std::simd is perfectly usable in the state it is in today in nightly Rust.

Absolutely! It seems to work for many people. The point of my comment was not that there's anything necessarily wrong with it. It's a huge API and it's hard and slow to stabilize huge things when you only get one shot.

camel-cdr · 2025-11-06T14:50:51 1762440651

A f32x16 version would also be faster on 256b hardware, but spill in SSE. For Zen5 you probably want to use f32x32.

I'd prefer if std::simd would encurage relative to native SIMD width scaling (and support scalable SIMD ISAs).

exDM69 · 2025-11-06T15:38:01 1762443481

> A f32x16 version would also be faster on 256b hardware, but spill in SSE. For Zen5 you probably want to use f32x32.

Yeah, exceeding native vector width is kinda just adding another round of loop unrolling. Sometimes it helps, sometime it doesn't. This is probably mostly about register pressure.

And architecture specific benchmarking is required if you want to get most performance out of it.

> I'd prefer if std::simd would encurage relative to native SIMD width scaling (and support scalable SIMD ISAs).

It is possible to write width-generic SIMD code (ie. have vector width as generic parameter) in Rust std::simd (or C++ templates and vector extensions) and make it relative to native vector width (albeit you need to explicitly define that).

In my problem domain (computer graphics etc) the vector width is often mandated by the task at hand (e.g. 2d vs 3d). It's often not about doing something on an array of size N. This does not lead to optimal HW utilization, but it's convenient and still a lot faster than scalar code.

Scalable SIMD ISAs are kind of a new thing, so not sure how well current std::simd or C vector extensions (or LLVM IR SIMD ops) map to the HW. Maybe they would be better served by another kind of API? I don't really know, haven't had the privilege of writing any scalable vector code yet.

What I'm trying to say is IMO std::simd works well enough and should probably be stabilized (almost) as is, barring any show stopper issues. It's already useful and has been for many years.

vlovich123 · 2025-11-05T21:00:17 1762376417

> we can't fix any API issues.

Can’t APIs be fixed between editions?

JoshTriplett · 2025-11-05T21:02:19 1762376539

Partially (with upcoming support for renaming things across editions), but it's a pain if the types change (because then they're no longer common vocabulary), and all the old APIs still have to exist.

singron · 2025-11-05T19:51:11 1762372271

There is a GitHub issue that details what's blocking stabilization for a each feature. I've read a few recently and noticed some patterns:

1. A high bar for quality in std

2. Dependencies on other unstable features

3. Known bugs

4. Conflicts with other unstable features

It seems anything that affects trait solving is very complicated and is more likely to have bugs or combine non-trivially with other trait-solving features.

I think there is also some sampling bias. Tons of features get stabilized, but you are much more likely to notice a nightly feature that is unstable for a long time and complex enough to be excited about.

throwup238 · 2025-11-05T22:36:43 1762382203

> It seems anything that affects trait solving is very complicated and is more likely to have bugs or combine non-trivially with other trait-solving features.

Yep and this is why many features die or linger on forever. Getting the trait solving working correctly across types and soundly across lifetimes is complicated enough to have killed several features previously (like specialization/min_specialization). It was the reason async trait took so long and why GAT were so important.

vlovich123 · 2025-11-05T21:01:04 1762376464

> Dependencies on other unstable features

AFAIK that’s not a blocker for Rust - the std library is allowed to use unstable at all times.

estebank · 2025-11-05T21:08:21 1762376901

I think they meant on unstable features which might yet change their semantics. A stable API relying on unstable implementation is common in Rust (? operator, for example), but that is entirely dependent on having a good idea of what the eventual stable version is going to look like, in such a way that the already stable feature won't break in any way.

Avi-D-coder · 2025-11-05T20:14:55 1762373695

Usually when I go and read the github and zulip threads the reason for paused work comes down to the fact that no one has come up with a design that maintains every existing promise the compiler has made. The most common ones I see are the feature conflicts with safety, semver/encapsulation, interacts weirdly with object safety, causes post post-monomorphization errors, breaks perfect type class coherence (see haskells unsound specialization).

Too many promises have been made.

Rust needs more unsafe opt outs. Ironically simd has this so it does not bother me.

duped · 2025-11-05T22:37:44 1762382264

std::arch::* intrinsics for SIMD are stable and you can use them today. The situation is only slightly worse than C/C++ because the rust compilers cares a lot about undefined behavior, so there's some safe-but-technically-unsafe/annoying cfg stuff to make sure the intrinsics are actually emitted as you intend.

There is nothing blocking high quality SIMD libraries on stable in Rust today. The bar for inclusion in std is just much higher than the rest of the ecosystem.

capyba · 2025-11-05T23:53:21 1762386801

Given the “blazingly fast” branding, I too would have thought this would be in stable Rust by now.

However, like other commenters I assume it’s because it’s hard, not all that many users of Rust really need it, and the compiler team is small and only consists of volunteers.

jandrewrogers · 2025-11-06T02:40:43 1762396843

Getting maximum performance out of SIMD requires rolling your own code with intrinsics. It is something a compiler can't do for you at a pretty fundamental level.

Most interesting performance optimizations from vector ISAs can't be done by the compiler.

capyba · 2025-11-06T10:19:21 1762424361

Interesting, how so? I’ve had really good success with the autovectorization in gcc and the intel c compiler. Often it’s faster than my own instrinsics, though not always. One notable example though is that it seems to struggle with reduction - when I’m updating large arrays ie `A[i] += a` the compiler struggles to use simd for this and I need to do it myself.

burntsushi · 2025-11-06T14:58:26 1762441106

There's no optimal portable `movemask` operation. Because aarch64 NEON doesn't have it.

exDM69 · 2025-11-06T12:12:59 1762431179

> Getting maximum performance out of SIMD requires rolling your own code with intrinsics

Not disagreeing with this statement in general, but with std::simd I can get 80% of the performance with 20% of the effort compared to intrinsics.

For the last 20%, there's a zero cost fallback to intrinsics when you need it.

jandrewrogers · 2025-11-06T16:13:51 1762445631

To clarify, there are many things SIMD is used for that look nothing like the loop parallelism or doing numerics commonly discussed. For example, heterogeneous concurrency is likely going to be beyond compilers for the foreseeable future and it is a great SIMD optimization.

A common example is executing the equivalent of a runtime SQL WHERE clause on arbitrary data structures of mixed types. Clever idioms allow surprisingly complex unrelated constraint operators to be evaluated in parallel with SIMD. It would be cool if a compiler could take a large pile of fussy, branchy scalar code that evaluates ad hoc constraints on data structures and converts it to an equivalent SIMD constraint engine but that doesn't seem likely anytime soon. So we roll them by hand.

burntsushi · 2025-11-06T14:57:27 1762441047

> Given the “blazingly fast” branding, I too would have thought this would be in stable Rust by now.

It's the exact opposite. It's the portable SIMD abstraction that isn't stable yet. But the vendor specific SIMD intrinsics have been stable for quite some time already (x86-64 for many years for example). And indeed, those are necessary for some cases.

ripgrep wouldn't be as fast as it was if it weren't possible to use SIMD on stable Rust.

queuebert · 2025-11-06T03:42:34 1762400554

I do scientific computing, and even I rarely have a situation where CPU SIMD is a clear win. Usually it's either not worth the added complexity, or the problem is so embarrassingly parallel that you should use a GPU.

capyba · 2025-11-06T10:13:37 1762424017

Interesting, in what domain? My work is in scientific computing as well (finite elements) and I usually find myself in the opposite situation: SIMD is very helpful but the added complexity of using a GPU is not worthwhile.

steveklabnik · 2025-11-06T00:22:18 1762388538

Don’t forget that autovectorization does a lot too. This is only for when you want to ensure you get exactly what you want, for many applications, they just kinda get it for free sometimes.

IshKebab · 2025-11-05T20:22:03 1762374123

I would love generators too but I think the more features they add the more interactions with existing features they have to deal with, so it's not surprising that its slowing down.

estebank · 2025-11-05T20:55:33 1762376133

Generators in particular has been blocked on the AsyncIterator trait. There are also open questions around consuming those (`for await i in stream`, or just keep to `while let Some(i) in stream.next().await`? What about parallel iteration? What about pinning obligations? Do that as part of desugaring or making it explicit?). It is a shame because it is almost orthogonal, but any given decision might not be compatible with different approaches for generators. The good news is that some people are working on it again.

the__alchemist · 2025-11-05T19:42:10 1762371730

Would love this. I've heard it's not planned to be in the near future. Maybe "perfect is the enemy of good enough"?

CooCooCaCha · 2025-11-05T19:52:27 1762372347

Rust doesn’t have a BDFL so there’s nobody with the power to push things through when they’re good enough.

And since Rust basically sells itself on high standards (zero-cost abstractions, etc.) the devs go back and forth until it feels like the solution is handed down from the heavens.

ChadNauseam · 2025-11-05T19:58:29 1762372709

And somehow it has ended up feeling more pleasant and consistent than most languages with a BDFL, even though it was designed by committee. I don't really understand how that happened, but I appreciate the cautious and conservative approach they've taken

stevefan1999 · 2025-11-06T02:25:04 1762395904

As someone who used std::simd in an attempt for submitting to an academic conference CFP*, I have look deep into how std::simd and I would conclude that there are a couple of reasons it isn't stable yet (this is rather long and maybe need 10 minutes to read):

1. It is highly depending on LLVM intrinsics which itself can change quite a lot. Sometimes the intrinsic would even fail to instantiate and crashed the entire compilation. I for example met chronic ICE crashes for the same code in different nightly Rust version. Then I realize it is because the SIMD operation was too complicated and I need to simplify it, and sometimes need to stop recursing and expanding too much to prevent stack spilling and exhausting register allocation.

This happens from time to time especially when using std::simd with embedded target where registers are scarcity.

2. Some hardware design decisions making SIMD itself not ergonomic and hard to generalize, this is also reflected on the design of std::simd as well.

Recall that SIMD techniques stems from vector processors in supercomputers from the likes of Cray and IBM, that is from the 70s and back then computation and hardware design was primitive and simple, so they have fixed vector size.

The ancient design is very stable, and is still kept till this day, even with the likes of AVX2, AVX512, VFP and NEON, so this influenced the design of things like lane count (https://doc.rust-lang.org/std/simd/struct.LaneCount.html).

But the plot twist: as time goes on, it turns out that modern SIMD is now capable of doing variable sizes; RISC-V's SIMD extension is one such implementation for example.

So now we come to a dilemma on to keep the existing fixed lane count design, or allow it to extend further. If we allow it to extend further to cater for things like variable-SIMD vector length, then we need to wait for generic_const_exprs to be stable, and right now it is not only not stable but incomplete too (https://github.com/rust-lang/portable-simd/issues/416).

This is a hard design philosophical change and is not easy to deal with. Time will tell.

3. As an extension to #2, the way that thinking in SIMD is hard in the very first place, and to use it in production you even have to think about different situations. This come in the form of dynamic dispatch, and it is a pain to dealt with, although we have great helpers such as multiversion...it is still very hard to design an interface that scales. Take Google's highway (https://github.com/google/highway/blob/master/g3doc/quick_re...) for example, it is the library to write portable SIMD code with dynamic dispatch in C++, but in an esotheric and not so ergonomic way. How we could do better with std::simd is still a myth. How do you abstract the idea of scatter-gather operation? What the heck is swizzle? Why do we call it shuffle and not permutation. Lots of stuff to learn, that means lots of pain to go through.

4. Plus, when you think in SIMD, there could be multiple instructions and multiple ways to do the same thing, one maybe more efficient than the other.

For example, as I have to touch some finite field stuff in GF(2^8), there are few ways to do finite field multiplication:

a. Precomputed table lookup

b. Russian Peasant Multiplication (basically carryless Karatsuba multiplication, but oftenly reduce to the form of table lookups as well, can also seen as ripple counter with modulo arithmetic except carry has to be delivered in a different way)

c. Do an inner product and then do Barrett reduction (https://www.esat.kuleuven.be/cosic/publications/article-1115...)

d. Or just treat it as multiplcation over a polynominal power series but this essentially mean we treat it as a finite field convolution, which I suspect is highly related to fourier transform. (https://arxiv.org/pdf/1102.4772)

e. Use the somewhat new GF2P8AFFINEQB (https://www.felixcloutier.com/x86/gf2p8affineqb) from GFNI which, contrary to most people who think it is available for AVX512 only, but is actually available for SSE/AVX/AVX2 as well (this is called GFNI-SSE in gcc), so it works on my 13600KF too (except obviously I cannot use ZMM registers or I just get illegal instruction for any instructions that touches ZMM or uses the EVEX encoding). I have an internal implementation of finite field multiplication using just that, but I need to use the polynomial of 0x11D rather than 0x11B so GF2P8MULB (https://www.felixcloutier.com/x86/gf2p8mulb) is out of question (which is supposed to be the fastest in the world theoretically if we can use arbitary polynomial), but this is rather hard to understand and explain in the first place. (by the way I used SIMDE for that: https://github.com/simd-everywhere/simde)

All of these can be done in SIMD, but each one of these methods have its pros and cons. Table lookup maybe fast and seemingly O(1) but you actually need to keep the table in cache, meaning we trade time with space, and SIMD would amplify the cache thrashing from multiple access. This could slow down the CPU pipeline although modern CPU are clever enough on cache management. If you want to do Russian Peasant Multiplication then you need a bunch of loops to go through the division and XOR chunk by chunk.

If you want Barrett reduction then you need to have efficient carryless multiplication such as PCLMULQDQ (https://www.felixcloutier.com/x86/pclmulqdq), to do the inner product and reduce the polynomial. Or a more primitive way find ways to do finite field Horner's method in SIMD...

How to think in SIMD is already hard as said in #3. How to balance in SIMD like this is even harder.

Unless you want to have a certain edge, or want to shatter the benchmark, I would say SIMD is not a good investment. You need to use SIMD at the right scenario at the right time. SIMD is useful, but also kind of niche, and modern CPU is optimized well enough that the performance of general solutions without using SIMD, is good enough too, since all of which will eventually dump right down to the uops anyway, with deep pipeline, branch predictor, superscalar and speculative execution doing their magics altogether, and most of the time if you want to use SIMD, using the easiest SIMD methods is generally enough.

*: I myself used std::simd intensively in my own project, well it got refused that the paper was actually severely lacking in literature studies, but that I shouldn't have used LLM too much to generate the paper.

However, the code was here (https://github.com/stevefan1999-personal/sigmah). Now I have a new approach to this problem that is derived from my current work with finite field, error correction, divide and conquer and polynominal multiplication, and I plan to resubmit the paper once I have time to clear it, with a more careful approach next time too, although the problem of string matching with don't care can be seen as convolution and I doubt my approach would ended up something like that...making the paper still unworthy for acceptance.

janwas · 2025-11-06T09:00:09 1762419609

> performance of general solutions without using SIMD, is good enough too, since all of which will eventually dump right down to the uops anyway, with deep pipeline, branch predictor, superscalar and speculative execution doing their magics altogether

A quick comment on this one point (personal opinion): from a hyperscalar perspective, scalar code is most certainly not enough. The energy cost from scheduling a MUL instruction is something like 10x of the actual operation it performs. It is important to amortize that cost over many elements (i.e. SIMD).

eden-u4 · 2025-11-06T07:54:16 1762415656

wow, thanks for this long explanation.