I am having trouble understand how any organization could ever need a collection of logs larger than the size of the entire Internet Archive. 100PB is staggering, and the idea of filling that with logs, while entirely possible, just seems completely useless given the cost of managing that kind of data.
This is on a technical level quite impressive though, don't get me wrong, I just don't understand the use case.
These are order and trade logs probably. You want to have them and you need them for auditing. Binance wants to be more professional in that way probably. HFT is making billions of orders per day per trader.
OK, so let's do some napkin math... I'm guessing something like this is the information you might want to log:
user ID: 128bits
timestamp: 96bits
ip address: 32bits
coin type: idk 32bits? how many fake internet money types can there be?
price: 32bits
quantity: 32bits
So total we have 352bits. Now let's double it for teh lulz so 704bits wtf not. You know what fuck it let's just round up to 1024bits. Each trade is 128bytes why not, that's a nice number.
That means 200Pb--2e17 bytes mind you--is enough to store 1.5625e16 trades. If all the traders are doing 1e9 trades/day, and we assume this dataset is 13mo of data, that means there are 38772 HFT traders all simultaneously making 11574 trades per second.. That seems like a lot..
In other words, that means Binance is processing 448.75 million orders per second.. Are they though?
EDIT: No, indeed some googling indicates they claim they can process something like 1.4 million TPS. But I'd hazard a guess the actual figure on average is less..
EDIT: err sorry, shoulda been 100Pb. Divide all those numbers by two. Still two orders of magnitude worth of absurd.
The only thing I can think of is that they are collecting every single line of log data from every single production server with absolutely zero expiration so that they can backtrack any future attack with precision, maybe even finding the original breach.
That's the only actual use case I can think of for something like this, which makes sense for a cryptocurrency exchange that is certainly expecting to get hacked at some point.
Same, also I'd love to know more about the technical details of their logging format, the on-disk storage format, and why they were only able to reduce the storage size to 20% of the uncompressed size. For example, clp[1] can achieve much, much better compression on logs data.
Exactly! Which is again one of the reasons it's confusing that people apply full text search technology to logs. Machine logs are quite a lot less entropic than human prose, and therefore can be compressed a whole lot better. A corrollary is that because of the redundancy in the data "grepping" the compressed form can be very fast, so long as the compression scheme allows it.
If the query infrastructure operating on these compressed data is itself able to store intermediate results, then we've killed two birds with one stone because we've also gotten rid of the restrictive query language. That's how cascading mapreduce jobs (or Spark) does it, allowing users to perform complex analyses that are entirely off the table if they're restricted to the lucene query language. Imagine a world where your SQL database was one giant table and only allowed you to query it with SELECT. That's pretty limiting, right?
So as a technology demonstration of Quickwit this seems really cool--it can clearly scale!--but it's kind of also an indictment of Binance (and all the other companies doing ELKish things out there).
This is on a technical level quite impressive though, don't get me wrong, I just don't understand the use case.