That is on a CPU. A GPU works differently, such that threads on a GPU implicitly vectorize loads and stores as part of their warp/block. My question had concerned GPUs, where you cannot vectorize instructions by loop unrolling since the instructions are already vector instructions.
I think you have a mistaken understanding of how GPUs work? There is some "vectorization" across threads in the form of coalescing but what I am talking about is literally a vectorized load/store, the same you would see on a CPU. Like, you can do a ld/ld.64/ld.128 to specify the width of the memory operation. If your loops load individual elements and it is possible to load them together then the compiler can fuse them together.
That makes more sense. When you said automatic vectorization, I was thinking about SIMD calculations. Nvidia does support doing 128-bit loads and stores: