Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

As not mentioned in the article, if you want the general form of this algorithm, it is a Hillis-Steele prefix sum: <https://en.wikipedia.org/wiki/Prefix_sum#Algorithm_1:_Shorte...>


I don't think this really describes neon_prefixsum_fast as a whole? The algorithm does use a Hillis-Steele sum on sums of 4 values, but each of these is computed with a sequential sum, interleaving those with a transposed order. In terms of what's added to what, it's actually quite a bit like my "Sequential broadcasting" picture from [0]. The reference I'd use for a general form is "Parallel Scan as a Multidimensional Array Problem"[1], breaking 16 elements into a 4x4 array; the paper describes how the scan splits into a row-wise scan, plus values obtained from an exclusive scan on carries from the rows.

[0] https://mlochbaum.github.io/BQN/implementation/primitive/fol...

[1] https://ashinkarov.github.io/pubs/2022-scan.html




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: