IMO this is a strong argument for proper threads over async: you can try and gue...

dan-robertson · on Aug 24, 2024

I don’t find this argument super strong, fwiw. It could just mean ‘be wary of doing blocking operations with async, and note map makes reading memory blocking (paging in) and writing memory blocking (CoW pages)’

I think there are reasons to be wary but to me, debugging comes first (this goes two ways though: if you have a single ‘actual’ thread then many races can’t happen) because debuggers/traces/… work better on non-async code. Performance comes second but it’s complicated. The big cost with threads is heavy context switches and per-thread memory. The big cost with async is losing cpu locality (because many syscalls on Linux won’t lead to your thread yielding, and the core your thread is on will likely have more of the relevant information and lots of cache to take advantage of when the syscall returns[1]) and spending more on coordination. Without io_uring, you end up sending out your syscall work (nonblocking fd ops excepted) to some thread pool to eventually pick up (likely via some futex) load into cache, send to the os on some random core, and then send back to you in a way that you will notice such that the next step can be (internally) scheduled. It can be hard to keep a handle on the latency added by all that indirection. The third reason I have to be wary of async is that it can be harder to track resource usage when you have a big bag of async stuff going on at once. With threads there is some sense in which you can limit per-thread cost and then limit the number of threads. I find this third reason quite weak.

All that said, it seems pretty clear that async provides a lot of value, especially for ‘single-threaded’ (I use this phrase in a loose sense) contexts like JavaScript or Python where you can reduce some multithreading pain. And I remain excited for io_uring based async to pick up steam.

[1] there’s this thing people say about the context switching in and out of kernel space for a syscall being very expensive. See for example the first graph here: https://www.usenix.org/legacy/events/osdi10/tech/full_papers... . But I think it isn’t really very true these days (maybe spectre & co mitigations changed that?) at least on Linux.

dathinab · on Aug 25, 2024

It isn't.

Async is for tasks dominated by waiting, e.g. http serving, not computations. This means it's extremely rare to run into mmap blocking related issues if you don't do something strange.

Furthermore async doesn't exclude multi threading:

- having multi threaded worker threads in addition to CPU threads is pretty normal

- having multiple async threads potentially with cross core work stealing is also the nrom

I.e. if you just follow basic advice the huge majority of task interacting in any potential performance problematic way will not be run in async task even if you write an async web server.

> but you'll never fully match reality and you end up wasting resources when an executor blocks when you weren't expecting

and you wast tons of resources always even without doing something unusual with non async IO _iff_ it's about waiting dominated tasks as you have way more management overhead

furthermore in more realistic cases it's quite common that some unplanned blocking is mainly casing latency issues (which in worst case could case timeouts) but due async engines still using multi threading it not leading relevant utilization issues. That is if it's just some unplanned blocking. If you do obviously wrong things like processing large files in async tasks things can be different.

An argument against async is that depending what you use it can add complexity and that a lot of use-cases don't benefit form it's benefits enough to make it a reasonable choice. Through that is also a bit language dependent. E.g. JS is already anyway coperative in your program and using async makes things simpler here (as the alternative are callbacks). Or in pythons with GIL the perf. gain of async are much higher compared to the gains in idk. C++.

neonsunset · on Aug 25, 2024

This kind of issue exists only in async executor implementations that cannot detect blocked workers and inject new ones to compensate for the starvation. I'm not aware if Rust has anything like this today (both Tokio and async-std are not like that) or in development for tomorrow, but there are implementations that demonstrate resilience to this in other language(s).

dbaupp · on Aug 25, 2024

Do you have info about current (production) implementations that increase the number of workers?

In https://tokio.rs/blog/2020-04-preemption#a-note-on-blocking (2020), there's reference to .NET doing this, and an explicit suggestion that Go, Erlang and Java do not, as well as discussion of why Tokio did not.

neonsunset · on Aug 25, 2024

Yes, it is .NET as Tokio blog post references.

Unfortunately, it does not appear to look into .NET's implementation with sufficient detail and as a result gets its details somewhat wrong.

Starting with .NET 6, there are two mechanisms that determine active ThreadPool's active thread count: hill-climbing algorithm and blocking detection.

Hill-climbing is the mechanism that both Tokio blog post and the articles it references mention. I hope the blog's contents do not indicate the depth of research performed by Tokio developers because the coverage has a few obvious issues: it references an article written in 2006 covering .NET Framework that talks about the heavier and more problematic use-cases. As you can expect, the implementation received numerous changes since then and 14 years later likely shared little with the original code. In general, as you can expect, the performance of then-available .NET Core 3.1 was incomparably better to put it mildly, which includes tiered-compilation in the JIT that reduced the impact of such startup-like cases that used to be more problematic. Thus, I don't think the observations made in Tokio post are conclusive regarding current implementation.

In fact, my interpretation of how various C# codebases evolved throughout the years is that hill-climbing worked a little too well enabling ungodly heaps of exceedingly bad code that completely disregarded expected async/await usage and abuse threadpool to oblivion, with most egregious cases handled by enterprise applications overriding minimum thread count to a hundred or two and/or increasing thread injection rate. Luckily, those days are long gone. The community is now in over-adjustment phase where people would rather unnecessarily contort the code with async than block it here an there and let threadpool work its magic.

There are also other mistakes in the article regarding task granularity, execution time and behavior there but it's out of scope of this comment.

Anyway, the second mechanism is active blocking detection. This is something that was introduced in .NET 6 with the rewrite of threadpool impl. to C#. The way it works is it exposes a new API on the threadpool that lets all kinds of internal routines to notify it that a worker is or about to get blocked. This allows it to immediately inject a new thread to avoid starvation without a wind-up period. This works very well for the most problematic scenarios of abuse (or just unavoidable sync and async interaction around the edges) and allows to further ensure the "jitter" discussed in the articles does not happen. Later on, threadpool will reclaim idle threads after a delay where it sees they do not perform useful work, with hill-climbing or otherwise.

I've been meaning to put up a small demonstration of hill-climbing in light of un-cooperative blocking for a while so your question was a good opportunity:

https://github.com/neon-sunset/InteropResilienceDemo there are additional notes in the readme to explain the output and its interpretation.

You can also observe almost-instant mitigation of cooperative (aka through managed means) blocking by running the code from here instead: https://devblogs.microsoft.com/dotnet/performance-improvemen... (second snippet in the section).

dbaupp · on Aug 25, 2024

Thanks for the up-to-date info.

> .NET 6

(I’m under the impression that this was released in 2021, whereas the linked Tokio post is from 2020. Hopefully that frames the Tokio post’s more accurately.)

neonsunset · on Aug 25, 2024

UPD: Ouch, messed up the Rust lib import path on Unix systems in the demo. Now fixed.