I don't think this code can make use of the tensor cores, or the wgmma instructi...

chillee · on Dec 15, 2024

For latency-bound inference (i.e. one request) you don't need tensor-cores since all your operations are just matrix vector multiplications.

fancyfredbot · on Dec 15, 2024

Good point yes. That explains why he's getting performance similar to the leading frameworks. Those tensor operations are helpful for training or for throughput-optimised batched inference but not really for a batch size of one.

reasonableklout · on Dec 16, 2024

I actually didn't know that. I'm in the space as a hobbyist and I had a vague understanding that tensor cores are essential for reaching peak performance, but can only work for certain operations like dense matrix-matrix multiplication. It was on my list to investigate whether they could be used to further improve single-batch decoding - makes sense that they don't help when it's all matrix-vector.