I'm surprised this is seen as a liability of mmap rather than a cooperative sche...

jandrewrogers · on Aug 25, 2024

I would describe it more as a limitation of mmap than a liability.

Modern async models have their origin in addressing serious shortcomings with the traditional POSIX APIs, particularly with respect to mmap and kernel schedulers. You can’t mix the models easily, many parts of POSIX don’t play nicely with async architectures. Traditional async I/O engines use direct I/O on locked memory, and if you use async you need to be cognizant of why this is. Half the point of async is to have explicit control and knowledge of when page faults are scheduled. Async is a power tool, not something that should be used lightly for casual tasks nor accidentally pulled in as a dependency.

The issue here appears to be that it is far too easy for someone to inadvertently mix models, not async per se. Async has substantial advantages versus native kernel threads, so getting rid of async is not a realistic solution. No one is going to give up a several-fold performance increase versus native kernel threads because some developers can’t figure out how to not mix models or the ecosystem doesn’t protect developers against inadvertently mixing models.

Async is used heavily in some C/C++ domains but it doesn’t seem to cause many issues there, perhaps because dependencies are much more explicit and intentional. Async has also been idiomatic for certain domains in C/C++ for decades so there is an element of maturity around working with it.

cbsmith · on Aug 25, 2024

> I would describe it more as a limitation of mmap than a liability.

Except it's a limitation that shows up even if you never make an mmap call. It's just a reality of living with virtual memory (and arguably, with preemptive based kernel scheduling in general as the kernel can decide to context switch from a thread).

> Traditional async I/O engines use direct I/O on locked memory, and if you use async you need to be cognizant of why this is. Half the point of async is to have explicit control and knowledge of when page faults are scheduled. Async is a power tool, not something that should be used lightly for casual tasks nor accidentally pulled in as a dependency.

Cooperative multitasking, to avoid the stalling described in this article, also needs locked memory/explicit control/knowledge of when page faults are scheduled.

> Async is used heavily in some C/C++ domains but it doesn’t seem to cause many issues there, perhaps because dependencies are much more explicit and intentional. Async has also been idiomatic for certain domains in C/C++ for decades so there is an element of maturity around working with it.

It's more that the people in those domains have an understanding of cooperative multitasking's trade-offs, and it is an explicit design choice to employ it.

Moto7451 · on Aug 25, 2024

I think your point here can be more generalized. Why should someone expect reading memory to benefit from async code?

The fact that the memory in this case has an access layer with exploitable latency is where the chatter about this stems from, but it misses the fundamental issue at hand.

If this was a valid concept we’d have async memcpy interfaces.

gpderetta · on Aug 25, 2024

It is not exactly async memory, but at the turn of the millennium a few unices experimented with scheduler activations: the kernel would upcall back into the application whenever a thread would block for any reason, allowing rescheduling of the user space thread.

In the end, the complexity wasn't worth it at the time, bit it is possible that something like that could be brought back in the fitire

mjb · on Aug 25, 2024

> I'm surprised this is seen as a liability of mmap rather than a cooperative scheduler that isn't using native kernel threads

Indeed. In practice, though, it's easier to write high performance servers and storage systems with async runtimes (like tokio) than with native threads, at least with the current state of the ecosystem. That's not for some fundamental reason - it's possible to get great threaded performance - just the current reality.

So, whoever's fault this is, it's useful to have good evidence of this downside of async runtimes (and worth thinking about ways that OSs could let runtimes know when they were about to block on IO).

cbsmith · on Aug 25, 2024

> That's not for some fundamental reason - it's possible to get great threaded performance - just the current reality.

I would argue it is for a fundamental reason. Cooperative multitasking in user-space requires far less overhead than anything a kernel might do. It's just an explicit part of the trade-off: you get more efficient context switches and control when context can change, and in exchange you leave something on the table whenever the kernel is involved.

> So, whoever's fault this is, it's useful to have good evidence of this downside of async runtimes (and worth thinking about ways that OSs could let runtimes know when they were about to block on IO).

But it isn't specific to async runtimes (in fact, a kernel-based preemptively scheduled async runtime wouldn't have this problem). It's a problem specific to cooperative multitasking.

scienceplease · on Aug 25, 2024

> cooperative scheduler that isn't using native kernel threads

Can anyone point me towards cooperative thread schedulers that use native kernel threads? Would this effectively mean implementing a cooperative model on top of pthreads?

jsnell · on Aug 25, 2024

The term to search for prior art is user-mode scheduler / scheduling. Basically you add additional kernel features that allow making some scheduling decisions in the application, it's not something you'd just build on a vanilla pthreads implementation.

Examples:

Windows 7 UMS: https://learn.microsoft.com/en-us/windows/win32/procthread/u...

google3 fibers / switchto: https://www.youtube.com/watch?v=KXuZi9aeGTw

raggi · on Aug 25, 2024

changes in the page table block the whole process, doesn't matter what combination of concurrency models you're using. we could do with a sub-process mapping API from the OS, but it's not something any major OS offers today, and requires designing for at a very fundamental level due to interaction with the hardware, and associated hardware constraints.

toast0 · on Aug 25, 2024

> changes in the page table block the whole process, doesn't matter what combination of concurrency models you're using.

I don't think that's necessarily true --- adding a mapping doesn't need to stop other threads that share a page table unless they're also modifying the page table. I don't think the TLB would cache an unmapped entry, but even if it did, the page fault handler will check, see that it's fine and resume execution.

For unmapping, it's different, in that you have to do IPI TLB shootdowns, but that interupts, not blocks.

comex · on Aug 25, 2024

And even if other threads are contending for the page table lock, the kernel doesn’t hold that lock for the entire duration of the I/O. Only for the tiny fraction of that duration where the kernel is spending CPU time doing bookkeeping. For the rest of the time, during which the system is just waiting for the disk, the thread that triggered the page-in is still blocked, but other threads can do whatever they want, including page-table modifications.

From what I’ve read on LWN, contention on the page table lock (mmap_sem / mmap_lock) has been a real and perennial issue for Linux, especially on servers with huge numbers of CPUs; but it’s a far smaller effect than what this post is talking about.

cbsmith · on Aug 25, 2024

> From what I’ve read on LWN, contention on the page table lock (mmap_sem / mmap_lock) has been a real and perennial issue for Linux, especially on servers with huge numbers of CPUs; but it’s a far smaller effect than what this post is talking about.

...and either way, that's kernel lock contention, not blocking IO.

raggi · on Aug 25, 2024

fair pushback, though s/unless they're also modifying the page table/unless they're also accessing the page table/, as it too needs to be synchronized. so yes, sometimes it has no effect, but given how often programs end up loading, crosstalk is super common

cbsmith · on Aug 25, 2024

Crosstalk is absolutely common for a number of operations, but that crosstalk is NOT the same as blocking until the page is loaded into RAM. That operation is blocking on IO.

cbsmith · on Aug 25, 2024

Page faults block the thread, not the process, because the thread is trying to access memory that isn't available to it. Other threads run just fine so long as they too don't trigger page faults. The article specifically mentions this, and I've built entire architectures based around this reality. They work great.

...and of course there are also multi-process concurrency models, where even if a process were blocked, the other processes would not. So no, it does absolutely matter what combination of concurrency models you are using.