> We introduced BtrBlocks, an open columnar compression format
for data lakes. By analyzing a collection of real-world datasets, we
selected a pool of fast encoding schemes for this use case. Additionally, we introduced Pseudodecimal Encoding, a novel compression scheme for floating-point numbers. Using our sample-based
compression scheme selection algorithm and our generic framework for cascading compression, we showed that, compared to
existing data lake formats, BtrBlocks achieves a high compression factor, competitive compression speed and superior decompression performance. BtrBlocks is open source and available at
https://github.com/maxi-k/btrblocks.
It is interesting and I'd love to look over some details benchmarks on the differences. Storing floats as integers overcome several of their challenges. The example of dollar units would be a good candidate for a short delta compression.
I doubt I'd ever used columnar compression again as I felt it too difficult to fight DBAs on keeping the original sorting and schema preserved in an optimal way. I do find it really interesting though.
> We introduced BtrBlocks, an open columnar compression format for data lakes. By analyzing a collection of real-world datasets, we selected a pool of fast encoding schemes for this use case. Additionally, we introduced Pseudodecimal Encoding, a novel compression scheme for floating-point numbers. Using our sample-based compression scheme selection algorithm and our generic framework for cascading compression, we showed that, compared to existing data lake formats, BtrBlocks achieves a high compression factor, competitive compression speed and superior decompression performance. BtrBlocks is open source and available at https://github.com/maxi-k/btrblocks.