The Billion Row Challenge (1BRC) – Step-by-Step from 71s to 1.7s

scottlamb · on Feb 22, 2024

Fun read. I haven't used Java in a long time, so even the "idiomatic Java code that would pass muster with any seasoned Java developer" was a bit shocking.

I know it's a cliche that someone brings up Rust in any programming thread, but I can't help myself. ;-) Several of the techniques here are easy enough in Rust that I just do them routinely.

* Rayon makes it easy to process line-wise in reasonable-sized chunks without allocating a string per line.

* The standard integer parsing stuff takes a slice, not an owned string.

* The standard hash map doesn't have the same expectation "that we pass in an instance of the key class that is equal to an existing instance"; you can call HashMap<String>::get with a &str. (One limitation is that std::collections::HashMap::entry does expect a K, so there's some redundancy if you need to insert afterward, but you could drop to hashbrown::HashMap::raw_entry_mut to avoid that.)

haxen · on Feb 22, 2024

I actually wrote a Rust version as well, and yes, it was far easier to write, far less code (although not incorporating all the tricks), completely safe, and pretty fast -- but still 2x slower than my end result in Java.

HashMap was quite a bottleneck in Rust as well, for many reasons, not just the key allocation problem. But it was very easy to implement the same kind of custom hashtable, which almost by default ends up having a better memory layout than in Java.

https://github.com/mtopolnik/rust-1brc/blob/main/src/main.rs

I bet I could improve on it just a bit and match the Java time. Not sure about topping it, though.

scottlamb · on Feb 22, 2024

> HashMap was quite a bottleneck in Rust as well, for many reasons, not just the key allocation problem.

I'd be curious to hear more. Other than the default hash function being quite expensive, and occasionally wanting to use the more expressive hashbrown API, I've been happy with Rust's std::collections::HashMap.

haxen · on Feb 22, 2024

The reason the custom hashtable wins out isn't something generally applicable. For the very specific dataset used in the challenge, the hash function could be radically simplified, to just a single multiply and rotate left.

To be fair, I didn't try out the hashbrown API. Maybe that, together with FxHash, would have been enough for the official dataset with 97% of keys < 16 bytes. But, I was optimizing for the "10k" dataset as well, with a lot of longer keys, and hashing just the first 8 bytes was the winning idea.

scottlamb · on Feb 22, 2024

> For the very specific dataset used in the challenge, the hash function could be radically simplified, to just a single multiply and rotate left.

That's just a custom hash function, though; couldn't you do that with the standard std::collections::HashMap?

haxen · on Feb 22, 2024

I didn't bother to try, so not sure. There would probably be some challenges and I don't see how I'd accomplish it without some branch instruction in the custom hasher.

jasonjmcghee · on Feb 22, 2024

I thought everyone used hashbrown now

https://github.com/rust-lang/hashbrown

IshKebab · on Feb 22, 2024

They do because the standard library implementation was changed to use Hashbrown. However the standard library API has a slight limitation in that you can't get say "I have a reference to a string. If there's an entry for it, give me that. Otherwise clone the string and insert a new entry. Also only do one key lookup."

You end up either having to do two lookups or always cloning the string even if you ended up not needing it.

GrumpySloth · on Feb 23, 2024

You can.

https://doc.rust-lang.org/std/collections/hash_map/enum.Entr...

IshKebab · on Feb 23, 2024

You can't. `entry()` takes they key by value, not by reference:

https://doc.rust-lang.org/std/collections/hash_map/struct.Ha...

You might be getting confused because in that example the `HashMap` keys are references.

GrumpySloth · on Feb 23, 2024

My bad. That’s not been an issue in my code, as I use arenas to manage memory and never put an owning container as key in a HashMap.

haxen · on Feb 22, 2024

Cool, I see mfiguiere linked to my recent blog post! Let me share a few words about it...

I took part in the One Billion Row challenge (1BRC). It was a lot of fun, but also a great learning experience. People came up with some pretty incredible optimization tricks. When you put them all together, it's a huge number, and they are all mingled up in individual solutions. They also happen on many levels -- from quite high, to incredibly low and detailed.

In retrospect, I could see there was a good number of tricks that are relatively easy to grasp, and reusable in other projects. I felt the urge to do a writeup that captures this knowledge in one place, isolating and explaining each of the tricks.

e12e · on Feb 22, 2024

Thank you for taking the time to write it up. Really nice to see some modern Java, as well as the ideas with comments - many of which seem to generalize quite well - beyond Java and the jvm.

fearthetelomere · on Feb 22, 2024

If you do find the time, I'm sure that writeup would be very valuable!

tristor · on Feb 23, 2024

The linked article is that write-up, this is the author replying to someone else posting their blog article.

gunnarmorling · on Feb 22, 2024

What an excellent post, one of the best on 1BRC I've come across so far. Big shout-out to Marko for participating in the challenge, making a strong push for clarifying corner cases of the rules, and sharing his experiences in this amazing write-up!

haxen · on Feb 22, 2024

Thanks for the praise Gunnar, but we all owe it to you for organizing it, and especially sticking through thick and thin when it took off, and needed lots of attention to evaluate everyone and maintain a level playing field!

tpmoney · on Feb 23, 2024

The last part of the article raises an interesting question for me. What is the fastest, mostly fault tolerant implementation that could be created? So something that runs on say 10 different versions of the input, each of which has had up to 20 characters from the standard ASCII set inserted, replaced or removed randomly in the input file. So we'd be mostly looking for "data corruption" fault tolerance vs hardened against malicious input. How much can you still "out optimize" the compiler and JVM if you can't throw away all the safety?

compsciphd · on Feb 22, 2024

I wonder if using a trie instead of a hash would have provided a performance win.

if you're parsing the file row by row, iterating over the trie as you process each character (as they argue to calculate the int value) (so what you have to do to hash it anyways), should be similar. What you'd end up in is micro-architectual issues on cache performance.

o11c · on Feb 22, 2024

There's a reason everybody uses hashing - if your data is trusted, it does very few operations.

Tries have to allocate, walk, and interpret multiple nodes. Perhaps not as bad as trees (though that depends on how many possible trie node representations there are, vs what the data density is at each level), but still worse than the non-colliding hash.

That said, with a finite dataset a perfect hash would probably beat a general hash though.

koliber · on Feb 22, 2024

Was the list of city names known? Could you statically pre-build the Trie structure?

haxen · on Feb 22, 2024

It was known, but the requirement was that the program keep working for an arbitrary keyset that conforms to the specified rules (up to 10,000 unique keys, each up to 100 bytes in length).

paulddraper · on Feb 22, 2024

Tries are rarely the most efficient option.

Kind of like linked lists...conceptually pleasant but rarely if ever the best option for real-world perf.

anonymoushn · on Feb 22, 2024

Top solutions tend to use hashes that ignore many input bytes. Even if they did not, they would use hashes that consume 4 or 8 bytes of input at a time (you could do more if the language exposed aesenc)

haxen · on Feb 22, 2024

Interestingly enough, that was my first idea. But when you consider the tiny keyset size, it would be hard to beat two machine instructions to calculate the hash + a single array lookup.

gunnarmorling · on Feb 22, 2024

There is one entry which uses a trie, but it's not at the top, IIRC (which may or may not be related to using a trie, there's many other relevant design decisions).

Keyframe · on Feb 22, 2024

I must admit I only glanced at the article and in the past when topic appeared. How does this work, the speed of 1.7s I mean? Taking a look at input, let's say average entry is 16 bytes, there's billion of it; That's ~16GB. Average read speed of SSDs is what, ~500MB/s? Scanning alone at max throughput will take half a minute. This must be relying on DDR4+ read speeds which would probably come in under a second in certain cases. Is that's what's going on, RAM disk?

haxen · on Feb 22, 2024

I used to benchmark a lot on an enterprise-grade SSD 10 years ago, and that was already at 2 GB/s. Today, even my laptop's SSD supports multiple GB/s.

But you're right about the contest -- each program was measured five times in a row, and there was enough RAM to fit the entire file into the page cache.

The best time using all 32 cores (64 hyperthreads) on the evaluation machine was 323 milliseconds.

Keyframe · on Feb 22, 2024

I went to the original repo, it's indeed a RAM disk. That makes it clear then. I was already almost excited about some IO wizardry going on.

Results are determined by running the program on a Hetzner AX161 dedicated server (32 core AMD EPYC™ 7502P (Zen2), 128 GB RAM).

Programs are run from a RAM disk (i.o. the IO overhead for loading the file from disk is not relevant), using 8 cores of the machine.

haxen · on Feb 22, 2024

Running without RAM cache would be a great followup to this challenge. I think a time around 2-3 seconds should be achievable. But, it would be highly sensitive to the hardware setup and how well the disk-reading code is placed on cores relative to the connection to the disk. Not sure what it would take to allow hundreds of contestants to benefit from the best arrangement.

pulse7 · on Feb 22, 2024

PCIe 3.0 x4 SSD runs at ~3500MB/s, PCIe 4.0 x4 SSD runs at ~7000MB/s and PCIe 5.0 x4 SSD runs at ~14000MB/s (yes - megabytes, not megabits)!

sroussey · on Feb 22, 2024

1 KIOXIA SSD on PCI 5 can do that 14000MB/s and 2M IOPS on 4 lanes. Some servers have 100+ lanes. You can have some serious speed if you want!

Keyframe · on Feb 22, 2024

blazing! I had ye olde SSD SATA bricks in mind though.

CyberDildonics · on Feb 23, 2024

14000MB/s? That's around 14,000,000 KB/s !

rldjbpin · on Feb 23, 2024

this was the most approachable commentary on a 1BRC submission i've read thus far.

up to 6.6 seconds variant, i see the point of pushing the envelope of the programming environment and hacks that might help optimize future builds of the language. but beyond that the optimizations seems to be more about overcoming the inherent limitations of the environment, which are due to conciously made tradeoffs.

i feel that once you hit that limit, we have reached the upper limit of the competition.

yread · on Feb 23, 2024

Here is the .net version

https://github.com/noahfalk/1brc

a bit faster than fatest java still

pjmlp · on Feb 23, 2024

For the ultimate performance, so to speak, Valhala still needs to arrive into the JVM.

So .NET has quite a few tricks that on Java side still require FFI, or clever use of Panama, maybe.

anonymoushn · on Feb 24, 2024

These tools won't really close the gap unless Panama's vector APIs get a big makeover.

pjmlp · on Feb 24, 2024

Panama vector API is still in preview for a reason.

And it depends on Valhala being done.

jiehong · on Feb 23, 2024

Very nice! Always use profilers!

In reality, though, data always have errors, and you get many validity checks in there, and data is never only ascii, but full of Unicode.

I so wish python would come with the equivalent of jvisualvm in its “batteries included” (not even talking about better profilers even).

teo_zero · on Feb 23, 2024

The bottom line is: of you want performance in Java, code if it was C!