Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

One of the comments to the article is really interesting. The recent Meltdown stuff has really blown up the cost of privilege transitions, because now people expect non-architectural data not to leak across privilege boundaries. System calls are about twice as slow as they used to be. Meanwhile, I/O is faster than ever, with PCIE and NVME. Io_uring offers the opportunity to avoid privilege transitions through asynchronous calls based on writing to a memory buffer shared between user space and the kernel. That has the potential for fundamentally changing the basis system call interface. As the article hints at, the trick is designing the API so you can construct as large a block of work to be done asynchronously as possible. At the limit, you could really push down the core of your I/O loop into the kernel, hence the suggestions that BPF programs could be submitted through the ring to chain operations all within the kernel.

(Incidentally, this is a good illustration of the flexibility of UNIX’s model of describing everything with a file descriptor. The same interface meant for asynchronous file I/O was easily extended to network I/O.)



> hence the suggestions that BPF programs could be submitted through the ring to chain operations all within the kernel

That sounds unbelievably annoying.

No, the limit has been found and it is not in the kernel. You push the I/O loop up for userspace I/O, more specific, not more general. They call it SPDK [1] (or DPDK for networking) and as far I can tell, the principle is essentially having a dummy driver in the kernel that maps the entire PCIe peripheral memory space into your chosen process, and everything flows from there.

At the I/O limit, asynchronous isn't feasible because interrupts introduce latency and waste cycles not doing work. All userspace I/O frameworks work only through polling.

1: https://spdk.io/


> At the I/O limit, asynchronous isn't feasible because interrupts introduce latency and waste cycles not doing work.

io_uring supports polled i/o: https://lore.kernel.org/linux-block/20190116175003.17880-8-a...


The problem with mechanisms like DPDK is that they bypass all the infrastructure in the kernel and make it hard to play well with others using the same hardware or services. DPDK, for example, bypasses the TCP/IP stack. SPDK bypasses the VFS. You can write your own TCP/IP stack or filesystem on top of those things, but then you can't play well with other processes using those services. While some GPUs can directly multiplex command streams from different processes, most hardware cannot.


That's the point of DPDK: to get the kernel out of the way of packet processing.

Userland packet processing (in network context) is much more flexible and less brittle than forcing certain functionality to exist solely in the Kernel layer. However things do exist that allow you to (mostly) transparently re-jigger a standard app's TCP/IP calls. One such example is using LD_PRELOAD to "hijack" the sys-calls for certain things and snake it to your (super high performance) userspace app!

There's a lot of exciting stuff happening in the open source networking world (DPDK, VPP/FDio, Network Service Mesh, etc). I really recommend digging into it!


SPDK is a pain to use. io_uring supports polling, and that gets it within a few percent of SPDK performance.


Interestingly, the advent of these asynchronous, context-switch-less interfaces might see microkernels make a comeback, as they were originally decried for performance reasons (they traditionally need many expensive context switches to operate).


I wonder too if in this batched model if there's some opportunity to further leverage the 'race to idle' solution you see in mobile devices. If I have 16 cores and half of them are frequently spun down, the thermal budget for the others is improved.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: