Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This. Modern SIMD extensions have gathers and scatters to specifically work with these kinds of memory layout. For example, ARM64 NEON has interleaving loads and stores in the form of LD2/3/4 and respective ST* counterparts.

https://documentation-service.arm.com/static/6530e5163f12c06... (PDF)



Sure, but how well do they perform compared to vector loads? Do they get converted to vector load + shuffle uops, and therefore require a specific layout anyway?

Last time I tried using gathers on AVX2, performance was comparable to doing scalar loads.


They are pretty good: https://dougallj.github.io/applecpu/measurements/firestorm/L...

Gathers on AVX2 used to be problematic, but assume it shouldn't be the case today especially if the lane-crossing is minimal? (if you do know, please share!)


Gather is still terrible, the only core that handles it well is the Intel's P core. AMD issues 40+ micro ops in AVX2(80 in AVX512), and the Intel E core is much worse.

When using SIMD you must either use SoA or AoSoA for optimal performance. You can sometimes use AoS if you have a special hand coded swizzle loader for the format.


Do you know of any resources on such swizzle loaders? I've toyed around with hand-coding x86 SIMD myself, and getting everything horizontally in the right place is always a pain.


You can often find them by search stack overflow(try AVX2 + (deinterleave/AoS to SoA/transpose etc), especially any answers by Peter Cordes.

You can write them yourself also but I'd add a verifier that checks the output with scalar code as it can be tricky to get correct.

Intel had an article up for the 3x8 transpose, but it seems to no longer exist so i'll just post the psuedo code

   //xyz -> xxx 
 void swizzle3_AoS_to_SoA(v8float &x, v8float &y, v8float &z) {
  v8float m14 = interleave_low_high<1, 2>(x, z); //swap low/high 128 bits
  v8float m03 = blend<0, 0, 0, 0, 1, 1, 1, 1>(x, y); //_mm256_blend_ps 1 cycle
  v8float m25 = blend<0, 0, 0, 0, 1, 1, 1, 1>(y, z);

  //shuffles are all 1 cycle
  __m256 xy = _mm256_shuffle_ps(m14, m25, _MM_SHUFFLE(2, 1, 3, 2)); // upper x's and y's 
  __m256 yz = _mm256_shuffle_ps(m03, m14, _MM_SHUFFLE(1, 0, 2, 1)); // lower y's and z's
  v8float xo = _mm256_shuffle_ps(m03, xy, _MM_SHUFFLE(2, 0, 3, 0));
  v8float yo = _mm256_shuffle_ps(yz, xy, _MM_SHUFFLE(3, 1, 2, 0));
  v8float zo = _mm256_shuffle_ps(yz, m25, _MM_SHUFFLE(3, 0, 3, 1));
  x = xo;
  y = yo;
  z = zo;

 }




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: