This. Modern SIMD extensions have gathers and scatters to specifically work with these kinds of memory layout. For example, ARM64 NEON has interleaving loads and stores in the form of LD2/3/4 and respective ST* counterparts.
Sure, but how well do they perform compared to vector loads? Do they get converted to vector load + shuffle uops, and therefore require a specific layout anyway?
Last time I tried using gathers on AVX2, performance was comparable to doing scalar loads.
Gathers on AVX2 used to be problematic, but assume it shouldn't be the case today especially if the lane-crossing is minimal? (if you do know, please share!)
Gather is still terrible, the only core that handles it well is the Intel's P core. AMD issues 40+ micro ops in AVX2(80 in AVX512), and the Intel E core is much worse.
When using SIMD you must either use SoA or AoSoA for optimal performance. You can sometimes use AoS if you have a special hand coded swizzle loader for the format.
Do you know of any resources on such swizzle loaders? I've toyed around with hand-coding x86 SIMD myself, and getting everything horizontally in the right place is always a pain.
https://documentation-service.arm.com/static/6530e5163f12c06... (PDF)