PDEP/PEXT are the big ones, they are extremely important for real-time sensor an...

BeeOnRope · on Dec 8, 2018

Thanks, that is really interesting. It is hard to believe that pdep/ext alone could result in a 10x throughput improvement - but I acknowledge it is possible since that is one very slow to emulate instruction in the general case, and if you needed exactly that...

It actually isn't clear to me exactly what Intel was targeting with that pair of instructions, but they sure is useful in all sorts of scenarios.

> The only other non-standard instructions with similar value are the AES intrinsics

If I can ask, what are the interesting uses outside of encryption? The main use I am aware of is as a handy fast and high-quality hash function implemented in hardware (and you don't need all the rounds when you are just after quality, and not adversarial collision resistance).

jandrewrogers · on Dec 9, 2018

For PDEP/PEXT it is the general case of ad hoc and unpredictable bit extract/deposit sequences. A decade ago, I spent a lot of time designing clever libraries that could dynamically effect this but even if you could amortize the overhead of setting up the machinery, it still was ~20 cycles. These instructions eliminated the need to code gen at all, and each instruction runs a lot faster than ~20 cycles. When those instructions showed up with Haswell, it wiped out a lot of code I had written, and in a good way. You can compose them to effect algorithms that would be very complicated (and slow) to implement otherwise.

I've read some things from Intel that suggest PDEP/PEXT were designed for cryptographic applications. However, they are a straightforward implementation of generalized shift networks (there is literature on this), so their potential applications are much broader.

For AES, those instructions have interesting properties for integer manipulation beyond encryption, and even beyond providing the basis for the fastest generic non-cryptographic hash functions currently available for both large and small keys. For example, you can compute a perfect hash (e.g. collision-free hashing from 32-bits to 32-bits) in a few clock cycles for scalar primitives using AES intrinsics. If you understand the construction, which superficially seems like it should not be possible, the result is virtually ideal statistically. Brilliant for hash tables, which still spend a lot of their time hashing, so I am surprised no one seems to be doing it (I figured it out myself, studying the statistical peculiarities of the AES instructions).