Seems a bit funny that CRC32 is so slow.

nigeltao · on March 16, 2019

The comparison is to SMHasher's CRC32 implementation, which is simple but slow, forked from zlib some time ago.

SIMD CRC32 implementations can be more than 14x faster than recent zlib's.

I dropped some details as comments on the original blog post.

corysama · on March 15, 2019

I expect that's because it was designed very well for the days of serial, in-order, static execution. But, CPUs today are all about pipelining and parallelism. Herb Sutter's "Free Lunch" has been over for longer than most people here have been writing software. http://www.gotw.ca/publications/concurrency-ddj.htm

mark-r · on March 15, 2019

Sutter's solution was to write more concurrent code, but that's not what helps in this case. Rather the CPU vendors have been throwing transistors at doing more math operations in parallel. All you need to do is make sure you're taking advantage of the vector instructions, whether that's done by the compiler or with intrinsic calls.

Traditionally the fastest CRC code has used lookup tables, I wonder if that's creating cache pressure these days? Or maybe that approach was abandoned while I wasn't paying attention.

dragontamer · on March 15, 2019

> Rather the CPU vendors have been throwing transistors at doing more math operations in parallel. All you need to do is make sure you're taking advantage of the vector instructions, whether that's done by the compiler or with intrinsic calls.

Instruction-level parallelism is incredibly important. I'd say that any optimizing programmer needs to fully understand ILP, and how it interacts with pipelines and dependency cutting (and register renaming).

Modern CPUs are extremely well parallelized with ILP. Any good, modern hash function will take advantage of this feature of modern CPUs.

Case in point, it seems like xxhash is SCALAR 32-bit / 64-bit code. No vectorization as far as I can tell, its purely using ILP to get its speed.

Intel Assembly has a 64-bit multiplier (but vectorized only has 32-bit multipliers). I've theorized to myself that this 64-bit multiplier could lead to better mixing than the vectorized instructions, and it seems like xxhash goes for that.

The 32-bit version of xxhash can likely be vectorized and optimized even further.

corysama · on March 15, 2019

https://moinakg.wordpress.com/2013/01/19/vectorizing-xxhash-...

rurban · on March 15, 2019

Nope. The fastest crc code now uses the intrinsics and some parallel execution tricks, see crclib from Cloudflare.

corysama · on March 17, 2019

Where do I find this library? Googling “crclib from Cloudflare” without quotes returns your comment as the top hit :P

rurban · on March 17, 2019

I have them here:

https://github.com/cloudflare/zlib.git

https://github.com/rurban/crcutil.git

woadwarrior01 · on March 15, 2019

The OP probably finds it funny because Intel processors since Nehalem have had an instruction for computing CRC32.

pcwalton · on March 15, 2019

Though, sadly, the CRC32 instruction computes the wrong polynomial for Zip and most other formats. :(

Twirrim · on March 15, 2019

Where is the disparity between the two coming from?

nigeltao · on March 16, 2019

There are two widely used polynomials, IEEE and Castagnoli. These parameterize the CRC-32 algorithm, and are sometimes known as crc32 and crc32c.

The IEEE polynomial is used by Bzip2, Ethernet (IEEE 802.3), Gzip, MPEG-2, PNG, SATA, Zip and other formats.

The Castagnoli polynomial is used by Btrfs, Ext4, iSCSI, SCTP and other formats.

For better or worse, the hardware instruction computes CRC-32-Castagnoli, which means it's not relevant for e.g. Zip.

Twirrim · on March 16, 2019

Ahh interesting. Thank you.

Is there any particular advantage of one polynomial over the other?

nigeltao · on March 16, 2019

https://en.wikipedia.org/wiki/Cyclic_redundancy_check#Standa... suggests that Castagnoli has better "error detection capacity". What that means exactly (in terms of information theory) and whether that difference matters in practice is getting beyond my expertise level. In any case, existing formats like Zip are what they are, and as I mentioned in the blog post comments, it is still possible to SIMD-accelerate the IEEE flavor.

kevin_thibedeau · on March 15, 2019

It still should be vectorizable without the dedicated instruction. That isn't evident here so the comparison is questionable.