You might want to give feedback to the risc-v developers (although it might be t...

dzaima · on Feb 9, 2025

There's of course no "way" to check, as it's a microarchitectural property. Your best bet is comparing performance of the same code on predictable vs unpredictable branches.

I don't think there's any need for x86 cores to try to handle this; it's just a waste of silicon for something doable in one instruction anyway (I'd imagine that additionally instruction fusion is a pretty hot path, especially with jumps involved; and you'll get into situations of conflicting fusions as currently cmp+jcc is fused, so there's the question of whether cmp+jcc+mov becomes (cmp+jcc)+mov or cmp+(jcc+mov), or if you have a massive three-instruction four-input(?) fusion).

Oh, another thing I don't like about fusing condjump+mv - it makes it stupidly more non-trivial to intentionally use branches on known-predictable conditions for avoiding the dependency on both branches.

ryao · on Feb 9, 2025

> There's of course no "way" to check, as it's a microarchitectural property. Your best bet is comparing performance of the same code on predictable vs unpredictable branches.

I was afraid the answer to my question would be that, but since my read of your previous comment “there's way to check whether it's implemented” seemed to suggest you knew a way I did not, I had my fingers crossed. At least, it had been either that you knew a trick I did not, or that a typo had deleted the word “no”.

> I don't think there's any need for x86 cores to try to handle this; it's just a waste of silicon for something doable in one instruction anyway (I'd imagine that additionally instruction fusion is a pretty hot path, especially with jumps involved; and you'll get into situations of conflicting fusions as currently cmp+jcc is fused, so there's the question of whether cmp+jcc+mov becomes (cmp+jcc)+mov or cmp+(jcc+mov), or if you have a massive three-instruction four-input(?) fusion).

Interestingly, the RISC-V guys seem to think that adding an explicit instruction is a waste of silicon while adding logic to detect the idiom to the instruction decoder is the way to go. x86 cores spend enormous amounts of silicon on situational tricks to make code run faster. I doubt spending silicon on one more trick would be terrible, especially since the a number of other tricks to extract more performance from things likely apply to even more obscure situations. As for what happens in the x86 core, the instruction decoder would presumably emit what it emits for the explicit version when it sees the implicit version. I have no idea what that is inside a x86 core. I suspect that there are some corner cases involving the mov instruction causing a fault to handle (as you would want the cpu to report that the mov triggered the fault, not the jmp), but it seems doable given that they already had to handle instruction faults in other cases of fusion.

Also, if either of us were sufficiently motivated, we might be able to get GCC to generate better code through a plugin that will detect the implicit cmov idiom and replace it with an explicit cmov:

https://gcc.gnu.org/onlinedocs/gccint/Plugins.html

A similar plugin likely could be written for LLVM:

https://llvm.org/docs/WritingAnLLVMNewPMPass.html#registerin...

Note that I have not confirmed whether their plugins are able to hook the compiler backend where they would need to hook to do this.

Of course, such plugins won’t do anything for all of the existing binaries that have the implicit idiom or any new binaries built without the plugins, but they could at least raise awareness of the issue. It is not a full solution since compilers don’t emit the implicit cmov idiom in all cases where a cmov would be beneficial, but it would at least address the cases where they do.

dzaima · on Feb 9, 2025

> since my read of your previous comment seemed to suggest you knew a way I did not, I had my fingers crossed.

Whoops, typo! edited.

> Interestingly, the RISC-V guys seem to think that adding an explicit instruction is a waste of silicon while adding this to the instruction decoder is the way to go

From what I've read, the thing they're against (or at least is a major blocker) is having a standard GPR instruction that takes 3 operands, as all current GPR instrs take a max of two. I cannot imagine there being any way that fusing instructions is less silicon than a new instruction whatsoever; if anything else, it'd be not wanting to waste opcode space, or being fine with the branchy version (which I'm not).

Zen 4, at least as per Agner's microarchitecture optimization guide, only fuses nops and cmp/test/basic_arith+jcc; not that many tricks, only quite necessary ones (nops being present in code alignment, and branches, well, being basically mandatory every couple instructions).

No need for a plugin; it is possible to achieve branchess moves on both as-is: https://www.godbolt.org/z/eojqMseqs. A plugin wouldn't be any more stable than that mess. (also, huh, __builtin_expect_with_probability actually helped there!)

I'd imagine a major problem for the basic impls is that the compiler may early on lose the info that the load can be ran in both cases, at which point doing it unconditionally would be an incorrect transformation.

ryao · on Feb 9, 2025

I had suggested the virtual instructions to the RISC-V developers to eliminate branchy cmov assembly, as I am not happy with it either. It is surprising to realize that x86 cores are not making more use of macro-ops fusion, contrary to my expectation, but I guess it makes sense now that I think about it. Their designers have plenty of other knobs for tuning performance and the better their branch predictor becomes, the less this actually matters outside of the cases where developers go out of their way to use cmov.

A plugin would handle cases where the implicit idiom is emitted without needing the developer to explicitly try to force this. As far as I know, most people don’t ever touch conditional moves on the CPU and the few that do (myself included), only bother with it for extremely hot code paths, which leaves some dangling fruit on the table, particularly when the compiler is emitting the implicit version by coincidence. The safety of the transformation as a last pass in the compiler backend is not an issue since the output would be no more buggy than it previously was (as both branches are already calculated). Trying to handle all cases (the non-low dangling fruit) is where you have to worry about incorrect transformations.

dzaima · on Feb 10, 2025

Ah, your gcc example does have the branchful branch that could be done branchlessly; I was thinking about my original example with a load, which can't be transformed back.

On fusion, https://dougallj.github.io/applecpu/firestorm.html mentions ones that Apple's M1 does - arith+branch, and very specialized stuff.