I think the wide variety of hardware and complexities of running on a real syste...

I think the wide variety of hardware and complexities of running on a real system sometimes make this problem more difficult.

Even within a given processor microarchitecture (say, just Zen 2, or just Haswell), different CPUs will be running at different frequencies, have different cooling solutions, and be running different microcode releases, all of which will affect which program is the fastest. And this is without considering cache pressure or memory latency, which is also dependent on any other programs the user happens to be running.

Running a superoptimizer for your linear algebra program that runs on your Cray supercluster can give clear gains. Doing the same for every combination of user hardware seems less feasible - you may find output that is a clear win for the machines you tested on, but it's often possible that it will lose out on other machines.