That is on a CPU. A GPU works differently, such that threads on a GPU implicitly...

saagarjha · on Dec 22, 2024

I think you have a mistaken understanding of how GPUs work? There is some "vectorization" across threads in the form of coalescing but what I am talking about is literally a vectorized load/store, the same you would see on a CPU. Like, you can do a ld/ld.64/ld.128 to specify the width of the memory operation. If your loops load individual elements and it is possible to load them together then the compiler can fuse them together.

ryao · on Dec 23, 2024

That makes more sense. When you said automatic vectorization, I was thinking about SIMD calculations. Nvidia does support doing 128-bit loads and stores:

https://docs.nvidia.com/cuda/parallel-thread-execution/index...