> at larger vector sizes, you spend more time in the embarrassingly-parallelizab...

> at larger vector sizes, you spend more time in the embarrassingly-parallelizable kernel

Vectors are only fixed-width, so even though it is an "embarrassingly parallel" problem, you still only expect it to asymptote towards a fixed speedup. Moreover, the bigger the matrix, the more likely you are to see cache size and memory bandwidth effects in your performance numbers, meaning the SIMD could be less of a win in the limit.