For those unaware, \* do not use this on modern systems. \* Normal for-loops are...

notacoward · on Oct 28, 2019

That depends on how large the misprediction penalty is and how many times the branch executes. Branch prediction isn't magic. It never makes branches beneficial. The best it can do, with perfect accuracy, is bring the amortized misprediction penalty down to zero. The key is that Duff's Device also relies on branching, and it's a less predictable branch that's more likely to be mispredicted. On the other hand, one BP miss is still cheaper than several hits plus one miss at the end. On the other other hand, there are issues to consider like code size and cache turnover (favors a loop), handling of special cases where larger loads/stores can be used (favors DD), processor extensions, etc. If you're really trying to write an optimal memcpy-like function, you'll probably end up using a hybrid of several approaches - including Duff's Device.

gnode · on Oct 28, 2019

Another thing to note, is that loop unrolling can be harmful to performance, as it requires more instruction cache. Beyond branch prediction, modern compilers can also convert loops into vector / SIMD instructions, and other magic.

The Duff's device is still useful for creating co-routines though; a handy way of yielding, then returning to the yield point.