MKL's benchmark performance requires AVX + FMA.
3.5 GHz * (4 add + 4 multiply) * 2 fma/cycle = 56 peak GFLOPS.
To exceed 50 GFLOPS without them would imply the CPU ran at 12.5 GHz.
OpenBLAS, on the other hand, actually performed poorly because it was limited to SSE thanks to a bug preventing the CPU from being recognized as Zen2.
I checked the instructions in with perf and it is using an SSE code path. Also, as reported elsewhere, MKL_DEBUG_CPU_TYPE=5 does not enable AVX2 support as it used to do.
The plot thickens. As I reported elsewhere in the thread, the slow code paths were selected on my machine, unless I override the mkl_serv_intel_cpu_true function to always return true. However, this was with PyTorch.
I have now also compiled the ACE DGEMM benchmark and linked against MKL iomp:
So, it is clearly using a GEMM kernel. Now I wonder what is different between PyTorch and this simple benchmark, causing PyTorch to result in a slow SSE code path.
Found the discrepancy. I use single precision in PyTorch. When I benchmark sgemm, the SSE code path is selected.
Conclusion: MKL detects Zen now, but currently only implements a Zen code path for dgemm and not for sgemm. To get good performance for sgemm, you have to fake being an Intel CPU.
FWIW, on my [Skylake/Cascadelake]-X Intel systems, Intel's compilers performed well, almost always outperforming GCC and Clang. But on Zen, their performance was terrible. So I was happy to see that MKL, unlike the compilers, did not appear to gimp AMD.
It's disappointing that MKL doesn't use optimized code paths on the 3700X.
I messaged the person who actually ran the benchmarks and owns the laptop, asking them to chime in with more information. I'm just the person who wrote that benchmark suite.
No, I don't. The 32-core AWS systems must be Epyc, so I'll try benchmarking there.
When OpenBLAS identifies the arch, it is competitive with MKL in single threaded performance, at least for matrices with a couple hundred rows and columns or more.
But MKL truly shines with multiple threads, so scaling on a 32 core system would be interesting to look at.