Regarding lossless text compression, does anyone know how a simple way to compress repetitive JSON(B) data in a regular Postgres table? Ideally I would use columnar compression [1], but I'm limited to the extensions supported by Google Cloud SQL [2].
Since my JSON(B) data is fairly repetitive, my bet would be to store some sort of JSON schema in a parent table. I'm storing the response body from a API call to a third-party API, so normalizing it by hand is probably out of the question.
I wonder if Avro can be helpful for storing the JSON schema. Even if I had to create custom PL/SQL functions for my top 10 JSON schemas it would be ok, since the data is growing very quickly and I imagine it could be compressed at least 10x compared to regular JSON or JSONB columns.
I ended up with a similar problem. We replaced the data with a simple binary serialization format, gzip’ed that, and then base64 encoded the gzipped data. It’s far from perfect but it was 250x saving in our case making it go from “stupidly large” to “we don’t care” with an hours work.
For compressing short (<100 bytes), repetitive strings, you could potentially train a zstd dictionary on your dataset, and then use that same dictionary for all rows. Of course, you’d want to disable several zstd defaults, like outputting the zstd header, since every single byte counts for short string compression.
Postgres supports toast (long record) compression. It seems to support enabling on columns. It looks like it supports LZ4 and Zstd now. Zstd has better compression at expense of more time.
I haven’t played around with it too much myself, but I remember reading that gzip (or at least python’s compatible zlib library) supports a “seed dictionary” of expected fragments”.
I gather that you’d supply the same “seed” during both compression and decompression, and this would reduce the amount of information embedded into the compressed result.
Many other compression libraries, like zstd, support functionality along those lines. For that matter, brotli's big party trick is having a built-in dictionary, tuned for web content.
It's easy to implement in LZ-style compressors - it amounts to injecting the dictionary as context, as if it had been previously output by the decompressor. (There's a striking parallel to how LLM prompting works.)
Since my JSON(B) data is fairly repetitive, my bet would be to store some sort of JSON schema in a parent table. I'm storing the response body from a API call to a third-party API, so normalizing it by hand is probably out of the question.
I wonder if Avro can be helpful for storing the JSON schema. Even if I had to create custom PL/SQL functions for my top 10 JSON schemas it would be ok, since the data is growing very quickly and I imagine it could be compressed at least 10x compared to regular JSON or JSONB columns.
[1] https://github.com/citusdata/citus?tab=readme-ov-file#creati... [2] https://cloud.google.com/sql/docs/postgres/extensions