Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The Case of the Missing SIMD Code (optidash.ai)
76 points by ethanwillis on June 8, 2023 | hide | past | favorite | 31 comments


It's cool and everything but I don't go along with the whole why-don't-people-just tone of the whole thing. I think it's fairly understandable to want to limit the number of micro-optimized implementations of however many hot routines your library has & maintain them for an ever expanding array of SIMD extensions and variants, each of which you either have to ship with your resulting ever-expanding binaries (if you're doing runtime capability detection) or will likely never see the light of day for 99.9% of users who get shipped a baseline binary for compatibility (if you're doing build-time capability detection).

I'm sure libpng is in a better place with his improvement but I don't think not already having it is purely because they're dummies or "not ambitious enough".

In a world where profile-guided fuzzers are getting better every day, maintaining a high-profile library in a non-memory-safe language can be nerve-wracking enough with just a single implementation of your array-processing loops.


Not that I'm necessarily disagreeing, but simde has made life a lot easier here. If you want to write SIMD stuff and don't want to use a higher-level library or depend on the compiler to optimize the code down to SIMD (both of which honestly seem like pretty good options these days), simde is worth taking a look at. You write code that looks like native code, but it'll translate it to the other archs for you. Well, "other archs" just means partially-supported NEON here, but I'd still consider that quite a win.

No relation with the project, just a user.


I was curious about these libraries a few weeks ago and did some searching. Is there one that's got a clearly dominating set of users or contributors?

I don't know what a good way to compare these might be, other than perhaps activity/contributor count.

[1] https://github.com/simd-everywhere/simde

[2] https://github.com/ermig1979/Simd

[3] https://github.com/google/highway

[4] https://gitlab.com/libeigen/eigen

[5] https://github.com/shibatch/sleef


I'm the main author of Highway, so I have some opinions :D Number of operations/platforms supported are important criteria.

A hopefully unbiased commentary:

Simde allows you to take existing nonportable intrinsics and get them to run on another platform. This is useful when you have a bunch of existing code and tight deadlines. The downside is less than optimal performance - a portable abstraction can be more efficient than forcing one platform to exactly match the semantics of another. Although a ton of effort has gone into Simde, sometimes it also resorts to autovectorization which may or may not work.

Eigen and SLEEF are mostly math-focused projects that also have a portability layer. SLEEF is designed for C and thus has type suffixes which are rather verbose, see https://github.com/shibatch/sleef/blob/master/src/libm/sleef... But it offers a complete (more so than Highway's) libm. Eigen is similarly useful for higher-level linear algebra that you probably do not want to reimplement yourself. However, I am skeptical of its expression template approach, in my experience one can run into trouble when the code gets complex (maybe some compiler limits are reached). Both of them may be less useful in non-math domains, AFAIK they lack many of the specialized instructions Highway has (e.g. interleaved load/store, compress/expand, software AES/CLMUL, popcount, lzcnt, saturated add/sub, 128-bit compare/minmax, fixed-point mul). I think Eigen would struggle with dynamic dispatch because of its header-only style, and SLEEF (understandably) lacks 8/16-bit types.

Highway is used in multiple projects and because it is just a collection of wrapper functions around intrinsics, it is more predictable/reliable. For this reason it is also currently AFAIK the only solution that supports scalable vectors in SVE and RISC-V V. The main downside is that it requires C++11, and so we are unable to help projects that are C-only and cannot compile parts of their code as C++.


This entire space has a lot of options, and it's still an open question about what is the right approach. The fact is that C++ is a scalar language, with scalar guarantees and scalar assumptions built into their types. Hacking around this requires sacrifices, and different libraries are choosing different approaches. EVO (not featured in your list) is trying to approach it from an algorithms point of view. Highway is portable wrappers around common functionality.

I think this is inherently a language-compiler problem. When libraries are using intrinsic functions to override compiler decisions, they interact poorly. One example: here's comparing a scalar versus an avx-512 intrinsic version of accumulate.

https://godbolt.org/z/cMafh4Pnv

IDK if you can read assembly, but the intrinsic version is waaaay worse, and about 4x slower. This is because using intrinsics makes it very hard for the compiler to do other optimizations it otherwise would (e.g. unrolling and reordering).


I agree compiler transforms sometimes do not interact well, but some still do help intrinsics.

Interestingly, even `#pragma clang unroll(4)` is ignored here. Often in such cases we have to manually unroll, e.g. introduce multiple accumulators.


Can I pick your brain? I'm working on yet another C++ based SIMD library, and my head is sort of boiling over.


Happy to discuss, preferably via issue on github.com/google/highway, would you like to open one?


I'm not familiar with all of those but simde is afaict lower level than the others. It doesn't attempt to build any higher level abstractions, just translate between the existing standards and instruction sets.


I've also run into this thinking, and have been looking to solve it in codebases I'm working on.

I've run across: https://github.com/aff3ct/MIPP but have not worked with it extensively yet. It looks to be a solution to the rewriting X parallel pipeline into Y SIMD extensions.

Perhaps something like this, or languages introducing something similar into their standard libraries/modules would be a solution.

None of this of course solves the run-time detection of capability/growing binary size to support such.


I totally agree we don't want multiple optimized implementations, but that is no longer necessary. Highway (mentioned below) lets you write your code once but be compiled+specialized for the major SIMD targets. If you're writing say a thousand lines of SIMD, the binary size impact is a few dozen KiB.

This is also a purely library solution, so no build system magic is required.


I don't agree with this argument at all when the domain is a relatively fixed format with a specification.


We've worked with Larry in the past, and still do to this day, at kraken.io. He helped us to create, what I believe to be the fastest way on planet Earth to detect the various image formats. We'll be open-sourcing it later this year (it's not nearly as presentable as it needs to be, to the outside world). Indeed, it does have some pretty spectacular SIMD going on and I can't wait to get it out there in the proper way.


> Newer instructions (e.g. AVX2) would simplify this slightly, but the compiler is actually smart about replacing some of my intrinsics with better replacements.

I've seen this story many times before. What the author should be doing is reorganizing his code to be branch-free, clear about aliasing, and aggressive with vectorizing-friendly pragmas. Notice that when he wrote vectorized code in the 25-year-old SSE instructions, the compiler was able to optimize into better, more contemporary SIMD instructions? It's not because he used SSE intrinsics. It's because SSE intrinsics forced him to write branch-free code. If he took the same steps with scalar code, it would now be clearer and easier for the compiler to truly optimize his code, and it would be stable throughout time as new architectures come out.


For a moment I was confused why the text sounded so familiar, but then realized the author is Larry Bank and I read the post on his blog https://bitbanksoftware.blogspot.com/2021/05/the-case-of-mis... before.


While it is technically interesting to do such experiments, it is quite presumptuous imo to down value a portable and reasonably performing code. I never need to worry about libpng performance and it is likely the case for 99.9% of the cases. And when you really want to optimize it for a reason my experience is a well structured portable code is actually easier to begin with.


> There is no scalar instruction for calculating the absolute value, so it uses a negate and conditional move instead.

Wait, what? x86 has specialized instructions for all kinds of things but doesn't have absolute value?


IIRC you can compose an xor and subtract to do abs, which might eliminate the need for a dedicated opcode. Although negate/cmov might work better. Either way you'd only need an abs if it showed up often enough that you wanted to save code size or could schedule the pipeline better, which seems unlikely.


I wonder how many people need anything in the BMI suite of assembly language, or BMI2.

https://en.m.wikipedia.org/wiki/X86_Bit_manipulation_instruc...

Popcnt, LZCNT and TZCNT are pretty nifty and all, but probably more obscure than absolute value.

I do think pdep and pext are brilliant and that more people should play with those two BMI2 instructions.


A dedicated absolute value instruction would save exactly one instruction every time you do absolute value, and both of the instructions for naive absolute value are already pretty fast. Dedicated L/TZCNT instructions save a dozen or so instructions every time you do that operation, and some 'clever' ways to do it might use fewer instructions but some are fairly slow instructions.

Amdahl's law. You don't gain a lot of benefit by improving something which is already fast.


Popcount is extremely common in algorithms. XOR then popcount gives hamming distance, for example. I’ve used lzcnt many times as well, in parsing binary formats and flag fields.


tzcnt and popcnt are for parsing strings! If you'd like to e.g. lex a json document then you'll need these.


Popcnt is probably the most broadly applicable BMI instruction with a myriad of applications in parallel programming. I'm not surprised to hear that it has applications for string parsing, but I admit that I can't see how right now. But I can believe it even without seeing an example.

TZCNT being used for string parsing... Sorry, can't see it at all or how it could be possible lol. Could you give an example?

(31 - LZCNT(x)) is binary 32-bit logarithm and likely has a number of mathematical applications.


This is the simdjson bit_indexer[0] which takes in a mask representing the positions of structural characters ([]{}:,) and outputs a list of the positions of those characters. Basically it uses popcnt to know how many there are and uses tzcnt repeatedly to find the position of a single one. It's a bit questionable whether building this list is necessary, but that concern is just about whether the tzcnt and popcnt belong here or in some other code that processes the structural characters (e.g. to iterate over structural characters you could iterate over the raw masks until you find a nonzero one and then start tzcnt'ing there).

If we want to keep the list of positions, there are other approaches for converting the vector mask to the list of positions. We could and the vector mask with {0,1,2,3,4...}, then widen by 4x to get int32 positions and add our current int32 position, then use or emulate vcompress and unconditionally write everything to the list, counting on updating the position by popcnt(mask) to keep the array dense. So we'd still need popcnt, but some implementations of this sort of thing end up computing this some other way[1] and don't literally use the popcnt instruction. This approach might be more reasonable if we only widen by 2x and produce a list of int16 positions per 64k input chunk then go back and widen that whole list to int32 later.

[0]: https://github.com/simdjson/simdjson/blob/master/src/generic...

[1]: https://github.com/lemire/despacer/blob/master/src/despacer....


> Basically it uses popcnt to know how many there are and uses tzcnt repeatedly to find the position of a single one.

BLSI and BLSR are better instructions for iterating over a bitset and finding the 1s (BLSI) and then removing the 1s (BLSR).

Furthermore, it only took 3 assembly instructions to implement before the BMI instruction set (see: https://www.chessprogramming.org/General_Setwise_Operations#... )

You're correct on popcnt, but this kind of iteration I normally see with BLSI / BLSR, not with TZCNT. I feel like TZCNT is slower and less elegant for this purpose.

---------

To go from a single binary bit selected (ex: 0b00001000) into its index (ie: 3 in this case) I guess TZCNT could be used though. But I would have done popcnt(x-1) instinctively.

TZCNT would be faster though, so there's that.

------

But all of these operations are only ~2 to 3 assembly instructions. Roughly on the same scale as absolute value. With exception of popcnt of course.

----

EDIT: Okay, I think I see how TZCNT comes into play now. Thanks! Just had to think it through.


Awesome work! It makes me wonder how many critical libraries there are out there, dependencies for thousands of projects, that could use optimizations like this. Probably more than we can imagine.


The vast majority of them I would guess.


Faster PNG en/decoders exist though, it's probably just libpng being the most commonly used, but not extensively optimised.


Meanwhile SIMD is still out of reach of JavaScript (unless you count using JS to load WASM).


WebGPU


Ah yes, the Web Shading Lang dialect of JavaScript.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: