I found the branch prediction claim to be a bit weak without perfmon data to support it. The references point out that the Intel 'popcnt %r1, %r2' instruction has a false input dependency on the output variable and neither gcc nor clang seem to know about that and adjust register scheduling. This is usually caused because the chip uses the same scheduling bucket used for instructions like 'add'. So putting 'popcnt %rax, %rax' inside a tight loop will prevent that loop from being unrolled in the processor. As a result the code is highly sensitive to random changes because the compiler occasionally gets lucky and breaks the chain.