As impressive as this analysis is by the compiler, I shudder to think how much time the compiler spends doing these kinds of optimizations. In my opinion, a better architecture would be for the compiler to provide a separate analysis tool that suggests source level changes for these kinds of optimizations. It could alert you that the loop could be replaced with a popcount (and optionally make the textual replacement in the source code for you). Then you pay the cost of the optimization once and have the benefit of clarity about what your code _actually_ runs at runtime instead of the compiler transparently pulling the rug out from underneath you when run with optimizations.
Side note: many years ago I wrote the backend for a private global surveillance system that has almost surely tracked the physical location of anyone reading this. We could efficiently track how often a device had been seen at a location in the prior 64 (days|weeks|months) in just 192 bytes and use popcount to compute the value. I am not proud that I built this.
I did a lot of research on this [1]. I got confirmation from Robert Garner (architect of the SPARC processor) that the NSA did indeed ask for the population count instruction. His story of meeting with the NSA is pretty amusing [2].
It goes back to the CDC 6600 at least, and is most often seen as part of Hamming distance computation (pop(xor(x,y))). But it turns out to be really useful for other things (trailing zero count), and worth having in hardware since the software sequence is a ~dozen instructions for 64 bits.
I <3 [AT]BM and BMI[12], and functional-equivalency matching with cost minimization optimization compiler passes.
I wish GCC and LLVM had compiler passes to semi/automagically "vectorize" hot sections using SIMD, i.e., magic transformation of UTF-8 conversion, regex matching, and string functions.
There's a fun approach where do the computation as a tree in parallel. You do a little masking and shifting to add all the even-numbered bits to all the odd-numbered bits, and come up with a set of (assuming we're working on a 64-bit value) 32 partial sums of 2 bits each. Then you add pairs of those to get 16 partial sums of 4 bits each, and so forth until you get to the top. This requires six sums, plus shifts and masks for each one.
I don't know if compilers are able to detect this and compile it down to a single instruction, though.
reply