Fun read. I haven't used Java in a long time, so even the "idiomatic Java code that would pass muster with any seasoned Java developer" was a bit shocking.
I know it's a cliche that someone brings up Rust in any programming thread, but I can't help myself. ;-) Several of the techniques here are easy enough in Rust that I just do them routinely.
* Rayon makes it easy to process line-wise in reasonable-sized chunks without allocating a string per line.
* The standard integer parsing stuff takes a slice, not an owned string.
* The standard hash map doesn't have the same expectation "that we pass in an instance of the key class that is equal to an existing instance"; you can call HashMap<String>::get with a &str. (One limitation is that std::collections::HashMap::entry does expect a K, so there's some redundancy if you need to insert afterward, but you could drop to hashbrown::HashMap::raw_entry_mut to avoid that.)
I actually wrote a Rust version as well, and yes, it was far easier to write, far less code (although not incorporating all the tricks), completely safe, and pretty fast -- but still 2x slower than my end result in Java.
HashMap was quite a bottleneck in Rust as well, for many reasons, not just the key allocation problem. But it was very easy to implement the same kind of custom hashtable, which almost by default ends up having a better memory layout than in Java.
> HashMap was quite a bottleneck in Rust as well, for many reasons, not just the key allocation problem.
I'd be curious to hear more. Other than the default hash function being quite expensive, and occasionally wanting to use the more expressive hashbrown API, I've been happy with Rust's std::collections::HashMap.
The reason the custom hashtable wins out isn't something generally applicable. For the very specific dataset used in the challenge, the hash function could be radically simplified, to just a single multiply and rotate left.
To be fair, I didn't try out the hashbrown API. Maybe that, together with FxHash, would have been enough for the official dataset with 97% of keys < 16 bytes. But, I was optimizing for the "10k" dataset as well, with a lot of longer keys, and hashing just the first 8 bytes was the winning idea.
I didn't bother to try, so not sure. There would probably be some challenges and I don't see how I'd accomplish it without some branch instruction in the custom hasher.
They do because the standard library implementation was changed to use Hashbrown. However the standard library API has a slight limitation in that you can't get say "I have a reference to a string. If there's an entry for it, give me that. Otherwise clone the string and insert a new entry. Also only do one key lookup."
You end up either having to do two lookups or always cloning the string even if you ended up not needing it.
Cool, I see mfiguiere linked to my recent blog post! Let me share a few words about it...
I took part in the One Billion Row challenge (1BRC). It was a lot of fun, but also a great learning experience. People came up with some pretty incredible optimization tricks. When you put them all together, it's a huge number, and they are all mingled up in individual solutions. They also happen on many levels -- from quite high, to incredibly low and detailed.
In retrospect, I could see there was a good number of tricks that are relatively easy to grasp, and reusable in other projects. I felt the urge to do a writeup that captures this knowledge in one place, isolating and explaining each of the tricks.
Thank you for taking the time to write it up. Really nice to see some modern Java, as well as the ideas with comments - many of which seem to generalize quite well - beyond Java and the jvm.
What an excellent post, one of the best on 1BRC I've come across so far. Big shout-out to Marko for participating in the challenge, making a strong push for clarifying corner cases of the rules, and sharing his experiences in this amazing write-up!
Thanks for the praise Gunnar, but we all owe it to you for organizing it, and especially sticking through thick and thin when it took off, and needed lots of attention to evaluate everyone and maintain a level playing field!
The last part of the article raises an interesting question for me. What is the fastest, mostly fault tolerant implementation that could be created? So something that runs on say 10 different versions of the input, each of which has had up to 20 characters from the standard ASCII set inserted, replaced or removed randomly in the input file. So we'd be mostly looking for "data corruption" fault tolerance vs hardened against malicious input. How much can you still "out optimize" the compiler and JVM if you can't throw away all the safety?
I wonder if using a trie instead of a hash would have provided a performance win.
if you're parsing the file row by row, iterating over the trie as you process each character (as they argue to calculate the int value) (so what you have to do to hash it anyways), should be similar. What you'd end up in is micro-architectual issues on cache performance.
There's a reason everybody uses hashing - if your data is trusted, it does very few operations.
Tries have to allocate, walk, and interpret multiple nodes. Perhaps not as bad as trees (though that depends on how many possible trie node representations there are, vs what the data density is at each level), but still worse than the non-colliding hash.
That said, with a finite dataset a perfect hash would probably beat a general hash though.
It was known, but the requirement was that the program keep working for an arbitrary keyset that conforms to the specified rules (up to 10,000 unique keys, each up to 100 bytes in length).
Top solutions tend to use hashes that ignore many input bytes. Even if they did not, they would use hashes that consume 4 or 8 bytes of input at a time (you could do more if the language exposed aesenc)
Interestingly enough, that was my first idea. But when you consider the tiny keyset size, it would be hard to beat two machine instructions to calculate the hash + a single array lookup.
There is one entry which uses a trie, but it's not at the top, IIRC (which may or may not be related to using a trie, there's many other relevant design decisions).
I must admit I only glanced at the article and in the past when topic appeared. How does this work, the speed of 1.7s I mean? Taking a look at input, let's say average entry is 16 bytes, there's billion of it; That's ~16GB. Average read speed of SSDs is what, ~500MB/s? Scanning alone at max throughput will take half a minute. This must be relying on DDR4+ read speeds which would probably come in under a second in certain cases. Is that's what's going on, RAM disk?
I used to benchmark a lot on an enterprise-grade SSD 10 years ago, and that was already at 2 GB/s. Today, even my laptop's SSD supports multiple GB/s.
But you're right about the contest -- each program was measured five times in a row, and there was enough RAM to fit the entire file into the page cache.
The best time using all 32 cores (64 hyperthreads) on the evaluation machine was 323 milliseconds.
Running without RAM cache would be a great followup to this challenge. I think a time around 2-3 seconds should be achievable. But, it would be highly sensitive to the hardware setup and how well the disk-reading code is placed on cores relative to the connection to the disk. Not sure what it would take to allow hundreds of contestants to benefit from the best arrangement.
this was the most approachable commentary on a 1BRC submission i've read thus far.
up to 6.6 seconds variant, i see the point of pushing the envelope of the programming environment and hacks that might help optimize future builds of the language. but beyond that the optimizations seems to be more about overcoming the inherent limitations of the environment, which are due to conciously made tradeoffs.
i feel that once you hit that limit, we have reached the upper limit of the competition.
I know it's a cliche that someone brings up Rust in any programming thread, but I can't help myself. ;-) Several of the techniques here are easy enough in Rust that I just do them routinely.
* Rayon makes it easy to process line-wise in reasonable-sized chunks without allocating a string per line.
* The standard integer parsing stuff takes a slice, not an owned string.
* The standard hash map doesn't have the same expectation "that we pass in an instance of the key class that is equal to an existing instance"; you can call HashMap<String>::get with a &str. (One limitation is that std::collections::HashMap::entry does expect a K, so there's some redundancy if you need to insert afterward, but you could drop to hashbrown::HashMap::raw_entry_mut to avoid that.)