I don't think this code can make use of the tensor cores, or the wgmma instructions that you typically need to get peak performance out of them.
Programming these is a nightmare as you need to have several in flight concurrently for peak performance.
Perhaps you don't need the extra flops as you end up bandwidth bound?
Regardless the good thing about the code in the blog though is it'll probably work pretty well for other accelerators, if you port it to HIP or similar. If you use wgmma I'm not sure it'll even be portable across Nvidia generations.
Good point yes. That explains why he's getting performance similar to the leading frameworks. Those tensor operations are helpful for training or for throughput-optimised batched inference but not really for a batch size of one.
I actually didn't know that. I'm in the space as a hobbyist and I had a vague understanding that tensor cores are essential for reaching peak performance, but can only work for certain operations like dense matrix-matrix multiplication. It was on my list to investigate whether they could be used to further improve single-batch decoding - makes sense that they don't help when it's all matrix-vector.
Programming these is a nightmare as you need to have several in flight concurrently for peak performance.
Perhaps you don't need the extra flops as you end up bandwidth bound?
Regardless the good thing about the code in the blog though is it'll probably work pretty well for other accelerators, if you port it to HIP or similar. If you use wgmma I'm not sure it'll even be portable across Nvidia generations.