Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Gather is still terrible, the only core that handles it well is the Intel's P core. AMD issues 40+ micro ops in AVX2(80 in AVX512), and the Intel E core is much worse.

When using SIMD you must either use SoA or AoSoA for optimal performance. You can sometimes use AoS if you have a special hand coded swizzle loader for the format.



Do you know of any resources on such swizzle loaders? I've toyed around with hand-coding x86 SIMD myself, and getting everything horizontally in the right place is always a pain.


You can often find them by search stack overflow(try AVX2 + (deinterleave/AoS to SoA/transpose etc), especially any answers by Peter Cordes.

You can write them yourself also but I'd add a verifier that checks the output with scalar code as it can be tricky to get correct.

Intel had an article up for the 3x8 transpose, but it seems to no longer exist so i'll just post the psuedo code

   //xyz -> xxx 
 void swizzle3_AoS_to_SoA(v8float &x, v8float &y, v8float &z) {
  v8float m14 = interleave_low_high<1, 2>(x, z); //swap low/high 128 bits
  v8float m03 = blend<0, 0, 0, 0, 1, 1, 1, 1>(x, y); //_mm256_blend_ps 1 cycle
  v8float m25 = blend<0, 0, 0, 0, 1, 1, 1, 1>(y, z);

  //shuffles are all 1 cycle
  __m256 xy = _mm256_shuffle_ps(m14, m25, _MM_SHUFFLE(2, 1, 3, 2)); // upper x's and y's 
  __m256 yz = _mm256_shuffle_ps(m03, m14, _MM_SHUFFLE(1, 0, 2, 1)); // lower y's and z's
  v8float xo = _mm256_shuffle_ps(m03, xy, _MM_SHUFFLE(2, 0, 3, 0));
  v8float yo = _mm256_shuffle_ps(yz, xy, _MM_SHUFFLE(3, 1, 2, 0));
  v8float zo = _mm256_shuffle_ps(yz, m25, _MM_SHUFFLE(3, 0, 3, 1));
  x = xo;
  y = yo;
  z = zo;

 }




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: