> if general-purpose HW with horribly-slow misaligned loads/stores came out from...

dzaima · 2026-03-11T10:50:15 1773226215

> How is that different for RISC-V?

RISC-V hardware with slow misaligned mem ops does exist to non-insignificant extent, and it seems not enough people have laughed at them, and instead compilers did just surrender and default to not using them.

> As you observed there's a feedback loop between what compilers output and what gets optimised in hardware.

Well, that loop needs to start somewhere, and it has already started, and started wrong. I suppose we'll see what happens with real RVA23 hardware; at the very least, even if it takes a decade for most hardware to support misaligned well, software could retroactively change its defaults while still remaining technically-RVA23-compatible, so I suppose that's good.

brucehoult · 2026-03-11T23:32:06 1773271926

> RISC-V hardware with slow misaligned mem ops does exist to non-insignificant extent

Only U74 and P550, old RV64GC CPUs.

SiFive's RVA23 cores have fast misaligned accesses, as do all THead and SpacemiT cores.

I can't imagine that all the Tenstorrent and Ventana and so forth people doing massively OoO 8-wide cores won't also have fast misaligned accesses.

As a previous poster said: if you're targeting RVA23 then just assume misaligned is fast and if someone one day makes one that isn't then sucks to be them.

dzaima · 2026-03-11T23:54:05 1773273245

P550 is, like, what, only a year old? I suppose there has been some laughing at it at least.

Also Kendryte K230 / C908, but only on vector mem ops, which adds a whole another mess onto this.

I'd hope all the massive OoO will have fast misaligned mem ops, anything else would immediately cause infinite pain for decades.

But of course there'll be plenty of RVA23 hardware that's much smaller eventually too, once it becomes a general expectation instead of "cool thing for the very-top-end to have".

I do agree that it'd be reasonable to just assume fast misaligned ops, but for whatever reason gcc and clang just don't, and that's what we have for defaults.

brucehoult · 2026-03-12T03:13:55 1773285235

> P550 is, like, what, only a year old?

No, it was released to customers in June 2021, almost five years ago.

https://www.sifive.com/press/sifive-performance-p550-core-se...

It has take a while for this core to appear in an SoC suitable for SBCs, as Intel was originally announced as doing that and got as far as showing a working SoC/Board at the Intel Innovation 2022 event in September 2022.

Someone who attended that event was able to download the source code for my primes benchmark and compile and run it, at the show, and was kind enough to send me the results. They were fine.

For reasons known only to Intel, they subsequently cancelled mass production of the chip.

ESWIN stepped up and made the EIC7700X, as used in the Milk-V Megrez and SiFive HiFive Premier P550, which did indeed ship just over a year ago.

But technically we could have had boards with the Intel chip three years ago.

Heck we should have had the far better/faster Milk-V Oasis with the P670 core (and 16 of them!) two years ago. Again, that was business/politics that prevented it, not technology.

dzaima · 2026-03-12T13:33:12 1773322392

> No, it was released to customers in June 2021, almost five years ago.

Ah, okay. (still, like, at least a couple decades newer than the last x86-64 chip with slow unaligned mem ops, if such ever existed at all? Haven't heard of / can't find anything saying any aarch64 ever had problems with them either, so still much worse for the RISC-V side).

Well, I suppose we can hope that business/politics messes will all never happen again and won't affect anything RVA23.

adgjlsfhk1 · 2026-03-12T03:13:42 1773285222

> I do agree that it'd be reasonable to just assume fast misaligned ops, but for whatever reason gcc and clang just don't, and that's what we have for defaults.

This very much has a "for now" on it. Once there is actually widespread hardware with the feature, I would be very surprised if the compilers don't update their heuristics (at least for RVA23 chips)

dzaima · 2026-03-12T13:33:52 1773322432

Indeed we shall hope heuristics update; but of course if no compilers emit it hardware has no reason to actually bother making fast misaligned ops, so it's primed for going wrong.

adgjlsfhk1 · 2026-03-12T20:31:23 1773347483

hardware devs traditionally have been pretty good at helping the compiler teams with things like this (because its a lot cheaper to improve the compiler than your chip).

newpavlov · 2026-03-11T10:47:13 1773226033

>So just use misaligned loads if Zicclsm is supported.

LLVM and GCC developers clearly disagree with you. In other words, re-iterating the previously raised point: Zicclsm is effectively useless and we have to wait decades for hypothetical Oilsm.

Most programmers will not know that the misaligned issue even exists, even less about options like -mno-strict-align. They just will compile their project with default settings and blame RISC-V for being slow.

RISC-V could've easily avoided all this mess by properly mandating misaligned pointer handling as part of the I extension.

dzaima · 2026-03-11T11:36:12 1773228972

Well, we don't necessarily have to wait for Oilsm; software that wants to could just choose to be opinionated and run massively-worse on suboptimal hardware. And, of course, once Oilsm hardware becomes the standard, it'd be fine to recompile RVA23-targeting software to it too.

> RISC-V could've easily avoided all this mess by properly mandating misaligned pointer handling as part of the I extension.

Rather hard to mandate performance by an open ISA. Especially considering that there could actually be scenarios where it may be necessary to chicken-bit it off; and of course the fact that there's already some questionability on ops crossing pages, where even ARM/x86 are very slow.

newpavlov · 2026-03-11T14:07:10 1773238030

I am not saying that RISC-V should mandate performance. If anything, we wouldn't had the problem with Zicclsm if they did not bother with the stupid performance note.

I would be fine with any of the following 3 approaches:

1) Mandate that store/loads do not support misaligned pointers and introduce separate misaligned instructions (good for correctness, so its my personal preference).

2) Mandate that store/loads always support misaligned pointers.

3) Mandate that store/loads do not support misaligned pointers unless Zicclsm/Oilsm/whatever is available.

If hardware wants to implement a slow handling of misaligned pointers for some reason, it's squarely responsibility of the hardware's vendor. And everyone would know whom to blame for poor performance on some workloads.

We are effectively going to end up with 3, but many years later and with a lot of additional unnecessary mess associated with it. Arguably, this issue should've been long sorted out in the age of ratification of the I extension.

dzaima · 2026-03-11T14:44:00 1773240240

2 is basically infeasible with RISC-V being intended for a wide range of use-cases. 1 might be ok but introduces a bunch of opcode space waste.

Indeed extremely sad that Zicclsm wasn't a thing in the spec, from the very start (never mind that even now it only lives in the profiles spec); going through the git history, seems that the text around misaligned handling optionality goes all the way back to the very start of the riscv/riscv-isa-manual repo, before `Z*` extensions existed at all.

More broadly, it's rather sad that there aren't similar extensions for other forms of optional behavior (thing that was recently brought up is RVV vsetvli with e.g. `e64,mf2`, useful for massive-VLEN>DLEN hardware).

newpavlov · 2026-03-11T15:28:00 1773242880

>1 might be ok but introduces a bunch of opcode space waste.

I wouldn't call it "waste". Moreover, it's fine for misaligned instructions to use a wider encoding or be less rich than their aligned counterparts. For example, they may not have the immediate offset or have a shorter one. One fun potential possibility is to encode the misaligned variant into aligned instructions using the immediate offset with all bits set to one, as a side effect it also would make the offset fully symmetric.

dzaima · 2026-03-11T17:34:49 1773250489

Of course that'd result in entirely-avoidable slowdown for the potentially-misaligned ops. Perhaps fine for a program that doesn't use them frequently, but quite bad for ones that need misaligned ops everywhere.

In terms of correctness, there's also the possibility of partially-misaligned ops (e.g. an 8B load with 4B alignment, loading two adjacent int32_t fields) so you're not handling everything with correct faults anyways.