PDEP/PEXT are the big ones, they are extremely important for real-time sensor and event processing (plus a few other things like join parallelization). Those instructions let you trivially compute ad hoc intersections between arbitrary and mixed dimensionality constraints in high dimensionality spaces that would lead to some very ugly and much slower code in pure C++. Also useful for massively parallel graph analytics. Ironically, the instructions were not designed for this purpose. We are talking about a 10x improvement in throughput, it isn't trivial.
I lived in the HPC world prior to the existence of these instructions. I wouldn't want to go back. I used to design insanely complex and inscrutable bit-twiddling libraries to achieve the result of what is a handful of instructions now. It is one of the very few intrinsics I can't live without for most of the high-performance codes I write. The only other non-standard instructions with similar value are the AES intrinsics (which are useful for more than encryption).
Vector instruction support is important but more spotty in its value, at least in my case. I have applications where I expect the details of vector performance will matter but I have insufficient data thus far. Early AVX implementations were marginal but I could see use cases for AVX-512, though I have no anecdotal data to support that conjecture.
Thanks, that is really interesting. It is hard to believe that pdep/ext alone could result in a 10x throughput improvement - but I acknowledge it is possible since that is one very slow to emulate instruction in the general case, and if you needed exactly that...
It actually isn't clear to me exactly what Intel was targeting with that pair of instructions, but they sure is useful in all sorts of scenarios.
> The only other non-standard instructions with similar value are the AES intrinsics
If I can ask, what are the interesting uses outside of encryption? The main use I am aware of is as a handy fast and high-quality hash function implemented in hardware (and you don't need all the rounds when you are just after quality, and not adversarial collision resistance).
For PDEP/PEXT it is the general case of ad hoc and unpredictable bit extract/deposit sequences. A decade ago, I spent a lot of time designing clever libraries that could dynamically effect this but even if you could amortize the overhead of setting up the machinery, it still was ~20 cycles. These instructions eliminated the need to code gen at all, and each instruction runs a lot faster than ~20 cycles. When those instructions showed up with Haswell, it wiped out a lot of code I had written, and in a good way. You can compose them to effect algorithms that would be very complicated (and slow) to implement otherwise.
I've read some things from Intel that suggest PDEP/PEXT were designed for cryptographic applications. However, they are a straightforward implementation of generalized shift networks (there is literature on this), so their potential applications are much broader.
For AES, those instructions have interesting properties for integer manipulation beyond encryption, and even beyond providing the basis for the fastest generic non-cryptographic hash functions currently available for both large and small keys. For example, you can compute a perfect hash (e.g. collision-free hashing from 32-bits to 32-bits) in a few clock cycles for scalar primitives using AES intrinsics. If you understand the construction, which superficially seems like it should not be possible, the result is virtually ideal statistically. Brilliant for hash tables, which still spend a lot of their time hashing, so I am surprised no one seems to be doing it (I figured it out myself, studying the statistical peculiarities of the AES instructions).
I lived in the HPC world prior to the existence of these instructions. I wouldn't want to go back. I used to design insanely complex and inscrutable bit-twiddling libraries to achieve the result of what is a handful of instructions now. It is one of the very few intrinsics I can't live without for most of the high-performance codes I write. The only other non-standard instructions with similar value are the AES intrinsics (which are useful for more than encryption).
Vector instruction support is important but more spotty in its value, at least in my case. I have applications where I expect the details of vector performance will matter but I have insufficient data thus far. Early AVX implementations were marginal but I could see use cases for AVX-512, though I have no anecdotal data to support that conjecture.