What do you mean? Do you want to annotate the condition as unpredictable, so tha...

toredo1729_2 · on Feb 9, 2025

Yes, that would be great. It's not always benefical, but in some (rare, but for me important) cases it's better. Currently, the only way to ensure a conditional move is used, is to use inline assembly. This is not portable and also less maintainable than a "proper" solution.

tjalfi · on Feb 9, 2025

clang has the __builtin_unpredictable() intrinsic[0] for this purpose.

[0] https://clang.llvm.org/docs/LanguageExtensions.html#builtin-...

IshKebab · on Feb 9, 2025

And it's not always possible! E.g. most RISC-V CPUs don't support it yet.

dzaima · on Feb 9, 2025

Eh, it takes ~3-4 instrs to do a branchless "x ? y : z" on baseline rv64i (depending on the format you have the condition in) via "y^((y^z)&x)", and with Zicond that only goes down to 3 instrs (they really don't want to standardize GPR instrs with 3 operands so what Zicond adds is "x ? y : 0" and "x ? 0 : y" ¯\_(ツ)_/¯; might bring the latency down by an instr or two though).

IshKebab · on Feb 9, 2025

It's more about removing branches than instruction counts or latency.

dzaima · on Feb 9, 2025

The "y^((y^z)&x)" method is already branchless, and close in performance to the Zicond variant, is my point; i.e. Zicond doesn't actually add much.

IshKebab · on Feb 9, 2025

Are you sure? As soon as you add actual computations in you're heading through the whole execution pipeline & forwarding network, tying up ALUs, etc. Zicond can probably be handled without all that.

Also that isn't actually equivalent since `x` needs to be all 1s or all 0s surely? Neither GCC nor Clang use that method, but they do use Zicond.

dzaima · on Feb 9, 2025

Zicond's czero.eqz & czero.nez (& the `or` to merge those together for the 3-instr impl of the general `x?y:z`) still have to go through the execution pipeline, forwarding network, an ALU, etc just as much as an xor or and need to. It's just that there's a shorter dependency chain and maybe one less instr.

Indeed you may need to negate `x` if you have only the LSB set in it; hence "3-4 instrs ... depending on the format you have the condition in" in my original message.

I assume gcc & clang just haven't bothered considering the branchless baseline impl, rather than it being particularly bad.

Note that there's another way some RISC-V hardware supports doing branchless conditional stores - a jump over a move instr (or in some cases, even some arithmetic instructions), which they internally convert to a branchless update.