Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> CONCLUSION

> We introduced BtrBlocks, an open columnar compression format for data lakes. By analyzing a collection of real-world datasets, we selected a pool of fast encoding schemes for this use case. Additionally, we introduced Pseudodecimal Encoding, a novel compression scheme for floating-point numbers. Using our sample-based compression scheme selection algorithm and our generic framework for cascading compression, we showed that, compared to existing data lake formats, BtrBlocks achieves a high compression factor, competitive compression speed and superior decompression performance. BtrBlocks is open source and available at https://github.com/maxi-k/btrblocks.



It is interesting and I'd love to look over some details benchmarks on the differences. Storing floats as integers overcome several of their challenges. The example of dollar units would be a good candidate for a short delta compression.

I doubt I'd ever used columnar compression again as I felt it too difficult to fight DBAs on keeping the original sorting and schema preserved in an optimal way. I do find it really interesting though.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: