Regarding lossless text compression, does anyone know how a simple way to compre...

maccard · on Dec 30, 2024

I ended up with a similar problem. We replaced the data with a simple binary serialization format, gzip’ed that, and then base64 encoded the gzipped data. It’s far from perfect but it was 250x saving in our case making it go from “stupidly large” to “we don’t care” with an hours work.

coder543 · on Dec 30, 2024

For compressing short (<100 bytes), repetitive strings, you could potentially train a zstd dictionary on your dataset, and then use that same dictionary for all rows. Of course, you’d want to disable several zstd defaults, like outputting the zstd header, since every single byte counts for short string compression.

ianburrell · on Dec 30, 2024

Postgres supports toast (long record) compression. It seems to support enabling on columns. It looks like it supports LZ4 and Zstd now. Zstd has better compression at expense of more time.

Too · on Dec 31, 2024

Two similar approaches using the fact that large portions of the text in a record is static.

Using CLP: https://www.uber.com/blog/reducing-logging-cost-by-two-order...

https://messagetemplates.org/

Rewrite the log to extract and group by common identifiers:

https://bit.kevinslin.com/p/lossless-log-aggregation

Tostino · on Dec 30, 2024

TOAST compression is likely your best option for that data. You may need to lower the data size threshold for toast for that column.

brody_hamer · on Dec 30, 2024

I haven’t played around with it too much myself, but I remember reading that gzip (or at least python’s compatible zlib library) supports a “seed dictionary” of expected fragments”.

I gather that you’d supply the same “seed” during both compression and decompression, and this would reduce the amount of information embedded into the compressed result.

duskwuff · on Dec 31, 2024

Many other compression libraries, like zstd, support functionality along those lines. For that matter, brotli's big party trick is having a built-in dictionary, tuned for web content.

It's easy to implement in LZ-style compressors - it amounts to injecting the dictionary as context, as if it had been previously output by the decompressor. (There's a striking parallel to how LLM prompting works.)