> since my read of your previous comment seemed to suggest you knew a way I did not, I had my fingers crossed.
Whoops, typo! edited.
> Interestingly, the RISC-V guys seem to think that adding an explicit instruction is a waste of silicon while adding this to the instruction decoder is the way to go
From what I've read, the thing they're against (or at least is a major blocker) is having a standard GPR instruction that takes 3 operands, as all current GPR instrs take a max of two. I cannot imagine there being any way that fusing instructions is less silicon than a new instruction whatsoever; if anything else, it'd be not wanting to waste opcode space, or being fine with the branchy version (which I'm not).
Zen 4, at least as per Agner's microarchitecture optimization guide, only fuses nops and cmp/test/basic_arith+jcc; not that many tricks, only quite necessary ones (nops being present in code alignment, and branches, well, being basically mandatory every couple instructions).
No need for a plugin; it is possible to achieve branchess moves on both as-is: https://www.godbolt.org/z/eojqMseqs. A plugin wouldn't be any more stable than that mess. (also, huh, __builtin_expect_with_probability actually helped there!)
I'd imagine a major problem for the basic impls is that the compiler may early on lose the info that the load can be ran in both cases, at which point doing it unconditionally would be an incorrect transformation.
I had suggested the virtual instructions to the RISC-V developers to eliminate branchy cmov assembly, as I am not happy with it either. It is surprising to realize that x86 cores are not making more use of macro-ops fusion, contrary to my expectation, but I guess it makes sense now that I think about it. Their designers have plenty of other knobs for tuning performance and the better their branch predictor becomes, the less this actually matters outside of the cases where developers go out of their way to use cmov.
A plugin would handle cases where the implicit idiom is emitted without needing the developer to explicitly try to force this. As far as I know, most people don’t ever touch conditional moves on the CPU and the few that do (myself included), only bother with it for extremely hot code paths, which leaves some dangling fruit on the table, particularly when the compiler is emitting the implicit version by coincidence. The safety of the transformation as a last pass in the compiler backend is not an issue since the output would be no more buggy than it previously was (as both branches are already calculated). Trying to handle all cases (the non-low dangling fruit) is where you have to worry about incorrect transformations.
Ah, your gcc example does have the branchful branch that could be done branchlessly; I was thinking about my original example with a load, which can't be transformed back.
Whoops, typo! edited.
> Interestingly, the RISC-V guys seem to think that adding an explicit instruction is a waste of silicon while adding this to the instruction decoder is the way to go
From what I've read, the thing they're against (or at least is a major blocker) is having a standard GPR instruction that takes 3 operands, as all current GPR instrs take a max of two. I cannot imagine there being any way that fusing instructions is less silicon than a new instruction whatsoever; if anything else, it'd be not wanting to waste opcode space, or being fine with the branchy version (which I'm not).
Zen 4, at least as per Agner's microarchitecture optimization guide, only fuses nops and cmp/test/basic_arith+jcc; not that many tricks, only quite necessary ones (nops being present in code alignment, and branches, well, being basically mandatory every couple instructions).
No need for a plugin; it is possible to achieve branchess moves on both as-is: https://www.godbolt.org/z/eojqMseqs. A plugin wouldn't be any more stable than that mess. (also, huh, __builtin_expect_with_probability actually helped there!)
I'd imagine a major problem for the basic impls is that the compiler may early on lose the info that the load can be ran in both cases, at which point doing it unconditionally would be an incorrect transformation.