For example, Claude can fluently generate Bevy code as of the training cutoff date, and there's no way there's enough training data on the web to explain this. There's an agent somewhere in a compile test loop generating Bevy examples.
A custom LLM language could have fine grained fuzzing, mocking, concurrent calling, memoization and other features that allow LLMs to generate and debug synthetic code more effectively.
If that works, there's a pathway to a novel language having higher quality training data than even Python.
I recently had Codex convert an script of mine from bash to a custom, Make inspired language for HPC work (think nextflow, but an actual language). The bash script submitted a bunch of jobs based on some inputs. I wanted this converted to use my pipeline language instead.
I wrote this custom language. It's on Github, but the example code that would have been available would be very limited.
I gave it two inputs -- the original bash script and an example of my pipeline language (unrelated jobs).
The code it gave me was syntactically correct, and was really close to the final version. I didn't have to edit very much to get the code exactly where I wanted it.
This is to say -- if a novel language is somewhat similar to an existing syntax, the LLM will be surprisingly good at writing it.
Hill climbing a password would only be possible if intermediate KV cache entries were stored. To hillclimb "hunter2", you're going to try "a", "b", "c", etc, until you notice that "h" comes back faster. Then you try "ha", "hb" and so on.
But that's only going to work if the cache looks like: "h", "hu", "hun", ..., "hunter2"
If just "hunter2" is in the cache, you won't get any signal until you stumble on exactly that password. And that's before getting into the block size granularity of the caches discussed elsewhere in this thread.
That's not to say timing attacks aren't possible. I haven't looked at Claude Code's prompt generation, but there's no intrinsic reason why you couldn't do things like figure out what open source code and research papers your competitors are loading into context.
Sharing caches between orgs would be an incredible misstep.
Right, you can’t actually guess a letter (byte) at a time but you can guess a token at a time (I believe the vocabulary is 200000 possible tokens in gpt 5)
So you could send each of the 200000 possible tokens, see which is cached, and then send 200000 more tokens to find the next cached token
Certainly less efficient but well within the realm of a feasible attack
It's a good call out re: tokens vs letters, but I think you might have misunderstood my point - you can't do it a token at a time unless the intermediate KV cache is stored after each token is generated.
This won't be the case in any non toy implementation, as it would be unneccessary and slow.
Ah, fair enough. Anthropic caches at a block level (basically a single message) so for non-trivial messages this is really less of a concern, although I definitely understand why they still scope cache to a single tenant
The author also suggests an incorrect optimization to Tokio here which suggests a lack of understanding of what the specific guarantees given are:
https://github.com/tokio-rs/tokio/pull/7622
The tests do not appear to simulate the queue in Loom, which would be a very, very good idea.
This stuff is hard. I almost certainly made a mistake in what I've written above (edit: I did!). In practice, the queue is probably fine to use, but I wouldn't be shocked if there's a heisenbug lurking in this codebase that manifests something like: it all works fine now, but in the next LLVM version an optimization pass is added which breaks it on ARM in release mode, and after that the queue yields duplicate values in a busy loop every few million reads which is only triggered on Graviton processors.
Or something. Like I said, this stuff is hard. I wrote a very detailed simulator for the Rust/C++ memory model, have implemented dozens of lockless algorithms, and I still make a mistake every time I go to write code. You need to simulate it with something like Loom to have any hope of a robust implementation.
For anyone interested in learning about Rust's memory model, I can't recommend enough Rust Atomics and Locks:
> The tests do not appear to simulate the queue in Loom, which would be a very, very good idea.
Loom is apparently this: https://github.com/tokio-rs/loom I've used tokio a bit in the past, but wasn't aware of that tool at all, looks really useful and probably I'm not alone in never hearing about it before. Any tips&tricks or gotchas with it one should know beforehand?
One of my favourite YouTubers, Matt Orchard, did a video on cults that included Heaven's Gate. He interviews a surviving member. If this article interested you, it's worth a watch (the beginning is a bit silly):
In development, you import Loom's mutex. In production, you import a regular mutex. This of course has zero overhead, but the simulation testing itself is usually quite slow. Only one thread can execute at a time, and many iterations are required.
For example, Claude can fluently generate Bevy code as of the training cutoff date, and there's no way there's enough training data on the web to explain this. There's an agent somewhere in a compile test loop generating Bevy examples.
A custom LLM language could have fine grained fuzzing, mocking, concurrent calling, memoization and other features that allow LLMs to generate and debug synthetic code more effectively.
If that works, there's a pathway to a novel language having higher quality training data than even Python.