One of the authors here: yeah, it's very interesting. The flame graphs here don't do a great job at highlighting an aspect of the challenge which is that crypto fans out across many CPUs. I think the hunch that 20-30gbps is attainable (on a fast system) is accurate - it'll take more work to get there.
What's interesting is that the cost for x/crypto on these system classes is prohibitive for serial decoding at 10gbps. I was ballparking with 1280 MTU, you have about 1000ns to process a packet, it takes about 2000ns to encrypt. The fan-out is critical at these levels, and will always introduce it's own additional costs, with synchronization, memory management and so on.
Is the per-packet processing in Wireguard stateless? As in, no sequential packet numbering in the crypto, etc?
If yes, then you should be able to get the kernel to spread your incoming traffic across cores with minimal contention and coordination, with multiqueue tuntap:
Absolutely multiqueue is on the list. What that would in theory allow us to solve well is rss/xss. As mentioned above, we would still have to fan out crypto, which means we need to solve for numa aware queue processing, which isn’t immediately solvable with the current go runtime APIs. Lots of interesting things to work on!
What's interesting is that the cost for x/crypto on these system classes is prohibitive for serial decoding at 10gbps. I was ballparking with 1280 MTU, you have about 1000ns to process a packet, it takes about 2000ns to encrypt. The fan-out is critical at these levels, and will always introduce it's own additional costs, with synchronization, memory management and so on.