another contender is zstd: https://github.com/facebook/zstd. It typically offers better compression ratios than LZ4 at a slight (depending on your data) cost in speed. Additionally it offers a training mode to tune the algorithm to increase compression ratio on specific types of data, particularly useful for compression of small pieces of data.
That table shows zstd comparing poorly at its faster settings, but at slower settings, it offers a significantly better compression ratio, albeit 3x slower decompression.
Zstd is very different - it includes an entropy coder.
LZ4 only finds repeated matches, but then doesn't encode them very efficiently.
To put it simplistically, if you have a file which is a (good) random mix of an equal number A and B characters, LZ4 won't be able to compress it significantly, while Zstd will compress it 8:1 converging to an encoding where a '1' bit is A, and a '0' bit is B.
> To put it simplistically, if you have a file which is a (good) random mix of an equal number A and B characters, LZ4 won't be able to compress it significantly
I checked it. LZ4 is still reducing the size to half, no idea why half. So for 10 MB file it compresses to 5 MB.
Edit: checked with highest compression and it compresses 1MB file to 185KB. So what the parent wrote is false.
Yes, if I take the 8 combinations aaa, aab, aba etc and assign each of them a 9 bit codeword I replace each 24 bit sequence with a 9 bit sequence. So arithmetic coders have no problem with cases like this.
Actually, I seem to recall that he was working on it before getting hired by Facebook (unless there was a massive delay in the hiring to become known). I was following his excellent blog posts on the matter at the time.
Yes it was a fully working things before facebook. There has been a lot of improvement in both the core and cli. But the core innovations of zstd was well established before facebook. I was probably following his blogs (even though I wasn't a compression expert) for months before I saw the post about his joining facebook.
I've spent an afternoon testing zstd's custom dictionaries. It really only provides benefits on small data blocks. According to my tests, the largest blocks at which custom dictionaries could still provide a benefit is 8K, above that the compression ratio advantage compared to the default is definitely gone.
There should be a way to pool standard dictionaries somewhere, such as a "standard english text corpus data" dictionary, that you can then download on demand for encoding, say, BLOB text fields in a database with little to no overhead.
The way this would probably work without this facility though, say, in a database, is that the dictionary is maintained internally and constructed on the fly from the field data and not exposed to users. Although, I don't know if you'd have to keep every version of the dictionary in order to successfully decompress old data? If so then perhaps this is a niche feature
W.r.t. standard dictionaries, it's something we're interested in, but the fundamental reality of dictionaries is that their effectiveness is strongly tied to their specificity. Put another way, a universal dictionary is a self-contradiction.
And yes, totally, I know at least RocksDB supports exactly that behavior [0].