This brings back memories debugging an azul zing bug where an effectively final optimization ended up doing the wrong thing with zstd-jni. It was painful enough that I couldn’t convince the team to enable the optimization again for years after it was fixed.
I agree. It’s the enshittification of the internet. Luckily we still have infrastructure providers with more sensible offerings. We don’t have to use aws, gcp, etc.
I've used bazel for about 10 years, all in small orgs. Right now, single digit team.
I don't think there exists a better solution for a project mixing Java and C/C++ than Bazel. The new module system for bazel has matured a lot. As an example, it's trivial to add boringssl and access it from Java.
This is fair. mise (by design) does nothing to help with that sort of thing. It's also definitely designed for languages outside C/C++, and to a lesser degree Java.
mise tasks are basically just fancy bash scripts though, so I could totally see a setup that uses mise tasks/tools for node/js/ruby and dispatches to other tools for building Java and C/C++.
That’s the default in the monorepos I’ve worked on.
When a third party dep is broken or needs a workaround, just include a patch in the build (or fork). Then those patches can be upstreamed asynchronously without slowing down development.
I really appreciate the work done to evolve Java. They seem to be able to both fix earlier mistakes (and there were many!) and new ones from happening. While I see other languages struggling with the weight of complexity added over time, Java seems to get simpler and easier even with significant updates to the language spec.
I think the «best practices» found in Java enterprise has meant that a lot of people think all Java has to look like that.
High performance Java is awesome and can be competitive with C++ and Rust for non trivial systems, thanks to much better development cycle, debugging and high quality libraries.
Most the benefits is the JVM, so Kotlin has those too, but I don’t feel Kotlin is enough of an improvement to warrant the downsides (at least for me). But both Kotlin, Scala and Clojure are great for the Java ecosystem.
Mechanisms for getting the linux kernel out of the way is pretty decent these days, and CPUs with a lot of cores are common. That means you can isolate a bunch of cores and pin threads the way you want, and then use some kernel-bypass to access hardware directly. Communicate between cores using ring buffers.
This gives you best of both worlds - carefully designed system for the hardware with near optimal performance, and still with the ability to take advantage of the full linux kernel for management, monitoring, debugging, etc.
Accessing hardware directky via /dev/mem is literally the original kernel bypass strategy, before we got the UIO and VFIO APIs to do it in a blessed way.
Isolating a core and then pinning a single thread is the way to go to get both low latency and high throughput, sacrificing efficiency.
This works fine on Linux, and common approach for trading systems where it’s fine to oversubscribe a bunch of cores for this type of stuff. The cores are mostly busy spinning and doing nothing, so it’s very inefficient in terms of actual work, but great for latency and throughput when you need it.
It is definitely not good advice for all things. For workloads that are either end of the CPU/IO spectrum (e.g. almost all waiting on IO or almost all doing CPU work) it can be a huge win as you can get very good L1 cache utilization, are not context-switching and don't need to handle thread synchronization in your code because not state is shared between threads.
For workloads that are a mix of IO and non-trivial CPU work, it can still work but is much, much harder to get right.
In a previous project we used fixedint32/64 instead of varint values in the schema for messages where performance was critical.
That left only varint used for tags. For encoding we already know the size at compile time. We also don’t have to decode them since we can match on the raw bytes (again, known at compile time).
A final optimization is to make sure the message has a fixed size and then just mutate the bytes of a template directly before shipping it off. Hard to beat.
In a newer project we just use SBE, but it lacks some of the ergonomics of protobuf.
The number of unique int32 values is not that great, and a full bitset would only be 512MB. Hard to bit a dense bitset in performance.
As a general purpose data structure, I would look at roaring bitmaps for this purpose, which has good trade-offs with great performance for most use-cases.
If only simple lookups are needed over a static set, then it's worth looking at minimal perfect hash functions (https://sux.di.unimi.it).
Hmm, bitmaps is an interesting idea!
If the data is dense enough, then yeah I guess a quick linear scan would work.
There's also the extreme of simply storing the answer for each possible query as a u32 and just index the array, but there the overhead is much larger.
Rank-select is also interesting, but I doubt it comes anywhere close in performance.
I happen to also be working on a minimal perfect hashing project that has way higher throughput than other methods (mostly because batching), see https://curiouscoding.nl/posts/ptrhash-paper/ and the first post linked from there.
Nice work! I love the care and discussion around low level optimizations, as such things are often ignored.
There are a lot of interesting variants of rank/select and other succinct data structures which has interesting tradeoffs. Maybe most useful when compression is critical. At least my own data structures can’t compete with the query performance you are showing, but they are great when the memory footprint matters.