Interestingly enough, Vector<T> itself is a bit limited and has smaller API than...

Interestingly enough, Vector<T> itself is a bit limited and has smaller API than Vector128/256/512<T>. It is being improved to match those and then expand further, but it happens incrementally, in 8, now in 9 and more changes after - after all, compiler treats it pretty much the same as vector of respective width under the hood.

The main problem with writing SIMD code for the first time is it's a learning curve to understand that the performance improvement doesn't come from just performing n element operations per single instruction but also reducing the book-keeping that usually comes per element per loop iteration like conditional branches to terminate the loop, branchy element selects over branchless shuffles in vectors, loading more data at a time, etc.

Which is why many first time attempts lead to wrong impression that vectorization is hard, while the truth is they just stumble into known "don't do that, also do less" traps like needlessly spilling vectors or writing to intermediate buffers instead of directly, modifying individual vector elements or even iterating them, avoiding actual vector operations.

There's a new-ish guide on vectorization if you're interested: https://github.com/dotnet/runtime/blob/main/docs/coding-guid...