Single-use JIT Performance on x86 Processors

andikleen2 · on July 2, 2020

You have to be a bit careful with the CLFLUSH method. I tried to use it in a widely used program years ago because Intel recommended it, but we found that it just hangs the CPU on some older VIA/Centaur CPUs. Presumably that's fixed these days, but the old CPUs are likely still around.

Nyan · on July 2, 2020

Thanks for the info! Unfortunately I don't have access to any VIA/Centaur CPUs, so couldn't test on those (though test results welcome if anyone is willing/able to!).

But yeah, you have to check the CPU you're running on when doing these tricks unfortunately, as results vary greatly across micro-architectures.

Interestingly, there's a more optimal CLFLUSHOPT instruction on more recent processors, which often seems to be quite effective for this task.

ComputerGuru · on July 2, 2020

Funny. My line of thought was “it’s probably still not fixed, but those old cpus won’t be running new code anytime soon.”

MaxBarraclough · on July 2, 2020

Looks fun, but impractical. Are there real uses for this kind of thing, on modern architectures?

Related question: has anyone tried to create a high-level language for doing this kind of madness?

vardump · on July 2, 2020

There are practical applications for this.

Perhaps most importantly for regexp compilation, as it is a fairly common use case [1].

Some other examples I can think of: JIT compiling an eval-statement. Constructing a filter for example for a database scan. CPU based pixel shader. Etc.

A lot of code is JIT compiled, but executed only once. So this was a pretty interesting article with practical performance implications in some scenarios.

[1]: See for example https://www.pcre.org/original/doc/html/pcrejit.html

Nyan · on July 2, 2020

Thanks for the info. I'm not particularly familiar with common JIT applications, but I suspect that this use-case is actually more niche than may think.

The problem is that the example presented requires a memory page with write + execute permissions (at the same time). I suspect many JITs don't do this for security reasons (and to deal with OSes which don't allow it), as it may make it easier for an attacker to gain arbitrary code execution. It's likely that many JITs toggle between write and execute permissions, rather than have both enabled at the same time. Whilst this reduces attack surface, changing permissions on allocated memory requires syscalls, which are quite expensive in terms of performance.

The scenario presented in the article avoids the impact of syscalls, to maximize performance, leaving only the impact caused by the processor itself. If a JIT isn't overly concerned with this type of security, using write+execute memory could be a way to avoid syscall overhead. On the other hand, if a JIT does toggle permissions, the syscall overhead is likely much more significant than overheads caused by the processor (although the techniques shown might still help depending on how the JIT engine works).

cat199 · on July 2, 2020

> many JITs don't do this for security reasons

would be nice, but historically it's more been:

- all JITs do this

- OpenBSD creates W^X

- open source JITs and other OS's start to incorporate W^X

- things are now either W^X compatible or haven't been ported to a W^X OS yet.

Nyan · on July 2, 2020

Ah good point, thanks for the info!

vardump · on July 2, 2020

Yeah, the security implications are obvious, R+W+X should not be used with untrusted inputs.

Not that I'd recommend this, but alternatively you could also map exact same memory twice, one with R+X and the other with R+W. The attacker would need to figure out the writable address. Unfortunately there are probably a lot of ways to accidentally leak this information to the attacker...

There are still plenty of use cases where inputs can be trusted.

saagarjha · on July 2, 2020

For performance I think this scheme is fairly common for W^X JITs.

MaxBarraclough · on July 2, 2020

Regex is a good example of where JIT can be a good idea for performance, but is it really 'single use'? A compiled regex object would presumably obey the lifetime rules of the host language. It might live as long as the host process, and be reused many times.

I wasn't very clear but I was really thinking of 'self-modifying' code, in contrast to conventional JIT as with regex (once generated, the binary is presumably never modified).

chrisseaton · on July 2, 2020

Self-modifying code is sometimes used to allow you to update an inline cache for a method dispatch as the program is running. Rather than having to recompile and produce new machine code to update an inline cache.

MaxBarraclough · on July 2, 2020

Neat idea. Presumably a dynamic JIT like HotSpot would use this technique.

In C you can have a global function pointer, and mutate it, but presumably that still has an additional indirection compared to the self-modifying approach.

chrisseaton · on July 2, 2020

> Presumably a dynamic JIT like HotSpot would use this technique.

Not in practice in HotSpot no, but it could do and similar VMs do. execute|write is not fashionable these days! Some systems enforce execute^write, so you can't do it.

MaxBarraclough · on July 3, 2020

> execute|write is not fashionable these days!

Ah, of course! Hadn't thought of that.

vardump · on July 2, 2020

> Regex is a good example of where JIT can be a good idea for performance, but is it really 'single use'?

Ideally not, but take a look at real life code bases... Just like with SQL prepared statements.

This kind of waste happens often due to reduced visibility. Abstraction layers are good at hiding details like construction cost. Repeated construction happens accidentally pretty easily.

pizlonator · on July 2, 2020

Usually if the code is really single use then the optimal solution is to interpret before JITing.

The more severe case of this in modern JITs is inline cache repatching. That’s super frequent.

Jweb_Guru · on July 2, 2020

> Usually if the code is really single use then the optimal solution is to interpret before JITing.

This is the conventional wisdom, yes, but recently (for very long-running database queries at least) it seems to be less true.

saagarjha · on July 2, 2020

If you know something is long running, then yes it makes sense to compile it early. If you don’t, then interpreting until you have it compiled and then OSRing usually does a good job.

Nyan · on July 2, 2020

> Are there real uses for this kind of thing, on modern architectures?

For me, I came up with an algorithm for doing error correction coding, however, good performance can only be achieved by JIT'ing code. Trying to implement the algorithm without JIT results in many if/switch statements and memory lookups, which makes it much slower. Unfortunately, the JIT'd code can only be used once because a new function needs to be written every time the routine is called, which leads to a scenario like that in the article.

Otherwise, I do think this is somewhat niche, but there may be some interesting applications if the security of write+execute memory is not a concern.

vardump · on July 2, 2020

Yeah, I was actually thinking about this particular case for code specialization. In code where the inner loop is very branchy, you can have considerable gains for being able to remove unnecessary branches (and code).

This kind of technique was (is?) a fairly common in demoscene. Often just modifying constants in existing code but also specializing (AFAIK usually block concatenation) isn't unheard of.

(By the way, at least on x86, it might pay off to watch out for things like inner loop(s) branch target 16-byte alignment to avoid penalties.)

Nyan · on July 2, 2020

Didn't know the practicality of this in the demoscene - thanks for the info!

saagarjha · on July 2, 2020

Demos are generally more focused on size savings than performance ;)

vardump · on July 2, 2020

Only for size compos. Like 4 kB demos. Or 256 byte, the "new 4k".

For retro systems (like C64, speccy, Amiga OCS) speed is generally the king. Of course there are still sizecoding compos for them as well. My oooold Amiga demo effects were full of code generation and SMC.

chrisseaton · on July 2, 2020

> Are there real uses for this kind of thing, on modern architectures?

Sometimes it's simpler to have all code be compiled rather than interpreted - one representation for everything. For example until fairly recently V8 always compiled everything - no interpreter.