Hacker Newsnew | past | comments | ask | show | jobs | submit | skidrow's commentslogin

SIMD intrinsics and manually unrolled loops are surely needed. That's the reason why all BLAS libraries vectorize and unroll loops manually. Even modern compilers can't properly auto-vectorize and unroll with 100% success rate.


are OpenBLAS and MKL not well optimized lol? They literally compared against OpenBLAS/MKL and posted the results in the article. As someone already mentioned, this implementation is faster than MKL even on a Intel Xeon with 96 cores. Maybe you missed the point, but the purpose of the arcticle was to show HOW to implement matmul without FORTRAN/ASSEMBLY code with NumPy-like performance. NOT how to write a BLIS-competitive library. So the article and the code seem to be LGTM.


Look, it's indeed a resonable comparison. They use matrix sizes up to M=N=K=5000, so the ffi overhead is neglectable. What's the point of compairing NumPy with BLAS if NumPy does use BLAS under the hood?


Their implementation outperforms not only the recent version of OpenBLAS but also MKL on their machine (these are DEFAULT BLAS libraries shipped with numpy). What's the point of compairing against BLIS if numpy doesn't use it by default? The authors explicitly say: "We compare against NumPy". They use matrix sizes up to M=N=K=5000. So the ffi overhead is, in fact, neglectable.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: