OpenBLAS

gnufx · on April 26, 2020

GEMM is a fairly special case, but the case is unproven as far as I can see. The OpenBLAS kernels are certainly bigger than they need be, but it's not clear to me they satisfy the request (especially addressing "holistic" treatment of registers), and "OpenBLAS" isn't a performance measurement.

Since I'm interested in this, I'd like to see the case to reproduce with an analysis of the performance difference. BLAS developers have made statements about GCC optimization failures (at least vectorization) that aren't correct in my experience with at all recent GCC. BLIS' generic C kernel already runs at ~60% (I forget the exact number) of the speed of the Haswell assembler code for large DGEMM without any tuning attempt, perhaps with GCC extensions. (I didn't check whether the blocking is right for Haswell or pursue an analysis with MAQAO or something.)