Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

PyTorch uses MKL https://www.google.com/amp/s/amp.reddit.com/r/MachineLearnin... and the workaround for AMD has been disabled by Intel


I think MKL actually fixed Zen performance. That is, the workaround no longer makes any difference because it is no longer needed.

Small matrix multiply benchmarks on a Zen2 (Ryzen 7 4700U), featuring MKL 2020.1.216+0, OpenBLAS, and Eigen: https://gist.github.com/stillyslalom/bd916e3d26b4531364676ac...

MKL's benchmark performance requires AVX + FMA. 3.5 GHz * (4 add + 4 multiply) * 2 fma/cycle = 56 peak GFLOPS. To exceed 50 GFLOPS without them would imply the CPU ran at 12.5 GHz.

OpenBLAS, on the other hand, actually performed poorly because it was limited to SSE thanks to a bug preventing the CPU from being recognized as Zen2.


I think MKL actually fixed Zen performance. That is, the workaround no longer makes any difference because it is no longer needed.

Odd. I am trying on my 3700X and it is definitely not using AVX, FMA or AVX2 code paths. Intel MKL 2020 update 2:

     ldd  ~/git/sticker2/target/release/sticker2  | grep mkl_intel
     libmkl_intel_lp64.so => /nix/store/jpjwkkv1dqk4nn8swjzr5qqzp0dpzk2f-mkl-2020.2.254/lib/libmkl_intel_lp64.so (0x00007fe786862000)
I checked the instructions in with perf and it is using an SSE code path. Also, as reported elsewhere, MKL_DEBUG_CPU_TYPE=5 does not enable AVX2 support as it used to do.


Comparing OpenBLAS and MKL with `peakflops` in Julia, there's definitely an advantage for MKL:

    julia> using LinearAlgebra

    julia> BLAS.vendor()
    :openblas64

    julia> BLAS.set_num_threads(1)

    julia> peakflops()
    3.9023447970402664e10


    julia> using LinearAlgebra
    
    julia> BLAS.vendor()
    :mkl
    
    julia> BLAS.set_num_threads(1)
    
    julia> peakflops()
    4.8113846984735275e10
That's close to the ~50 Gflops I saw in @celrod's benchmarks.


The plot thickens. As I reported elsewhere in the thread, the slow code paths were selected on my machine, unless I override the mkl_serv_intel_cpu_true function to always return true. However, this was with PyTorch.

I have now also compiled the ACE DGEMM benchmark and linked against MKL iomp:

    $ ./mt-dgemm 1000 | grep GFLOP
    GFLOP/s rate:         69.124168 GF/s
Most-used function is

   mt-dgemm  libmkl_def.so       [.] mkl_blas_def_dgemm_kernel_zen
So, it is clearly using a GEMM kernel. Now I wonder what is different between PyTorch and this simple benchmark, causing PyTorch to result in a slow SSE code path.


Found the discrepancy. I use single precision in PyTorch. When I benchmark sgemm, the SSE code path is selected.

Conclusion: MKL detects Zen now, but currently only implements a Zen code path for dgemm and not for sgemm. To get good performance for sgemm, you have to fake being an Intel CPU.

Edit, longer description: https://github.com/pytorch/builder/issues/504


Hmm.

FWIW, on my [Skylake/Cascadelake]-X Intel systems, Intel's compilers performed well, almost always outperforming GCC and Clang. But on Zen, their performance was terrible. So I was happy to see that MKL, unlike the compilers, did not appear to gimp AMD.

It's disappointing that MKL doesn't use optimized code paths on the 3700X.

I messaged the person who actually ran the benchmarks and owns the laptop, asking them to chime in with more information. I'm just the person who wrote that benchmark suite.


It seems I have found the issue. We were both right. MKL now uses a Zen-optimized kernel for dgemm, but not (yet?) for sgemm. More details:

https://github.com/pytorch/builder/issues/504


If OpenBLAS' CPU detection fails, you can force it with an environment variable, but why omit AMD's implementation?


Do you know if this applies to epyc as well?


No, I don't. The 32-core AWS systems must be Epyc, so I'll try benchmarking there.

When OpenBLAS identifies the arch, it is competitive with MKL in single threaded performance, at least for matrices with a couple hundred rows and columns or more. But MKL truly shines with multiple threads, so scaling on a 32 core system would be interesting to look at.


You can see BLIS on Intel's home turf at https://github.com/flame/blis/blob/master/docs/Performance.m... (52-core SKX) and compare with OpenBLAS on 32-core Zen1. (Multithreaded BLAS isn't typically used in HPC, where the parallelism is elsewhere.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: