I once explored this, hitting around 125K RPS per core on Node.js. Then I realized it was pointless, the moment you add any real work (database calls, file I/O, etc.), throughput drops below 10K RPS.
We just do the networking bits a bit differently now. DPDK was a product of its time.
Unless you can get an ASIC to do it, then the ASIC is massively preferrable; just the power savings generally¹ end the discussion. (= remove most routers from the list; also some security appliances and load balancers.)
¹ exceptions confirm the rule, i.e. small/boutique setups
Zero copy is the important part for applications that need to saturate the NIC. For example Netflix integrated encryption into the FreeBSD kernel so they could use sendfile for zero-copy transfers from SSD (in the case of very popular titles) to a TLS stream. Otherwise they would have had two extra copies of every block of video just to encrypt it.
Note however that their actual streaming stack is very different from the application stack. The constraint isn't strictly technical: ISP colocation space is expensive, so they need to have the most juiced machines they can possibly fit in the rack to control costs.
There's an obvious appeal to accomplishing zero-copy by pushing network functionality into user space instead of application functionality into kernel space, so the DPDK evolution is natural.
I keep watching and trying io_uring and still can't make it work as fast with simple code as consistently for those use cases. AF_XDP gets me partly there but then you're writing ebpf... might as well go full-dpdk.
Maybe it's a skill issue on my part, though. Or just a well-fitting niche.
I also want to get into socket io using io_uring in zig. I'll try to apply everything I found in liburing wiki [0] and see how much I can get (max hardware I have is 10gbit/s).
Seems like there is: - multi-shot requests - register_napi on uring instance - zero copy receive/send. (Probably won't be able to get into it)
Did you already try these or are there other configurations I can add to improve it?
[0]: https://github.com/axboe/liburing/wiki/io_uring-and-networki...
AF_XDP is also another way to do high-performance networking in the kernel, and it's not bad.
DPDK still has a ~30% advantage over an optimized kernel-space application with a huge maintenance burden. A lot of people reach for it, though, without optimizing kernel interfaces first.
One other big plus of DPDK for me is the low-level access to hardware offload. GPUDirect (when you can get it to work), StorageDirect or most of the available DMA engines in some (not so) high-end hardware. The flow API on mellanox hardware is the basis of many of my multi-accelerator applications (I wish they supported P4 for packet format instead, or just open-source whatever low-level ISA the controller is running, but I don't buy enough gear to have a voice). Perusing the DPDK documentation can give ideas.
So, yes, very low-level with some batteries included. Good and stable for niche uses. But far smaller hiring pool (is the io_uring-100Gb pool bigger ? I don't know).
As vendors are eager to remind us, custom silicon to accelerate everything between L1 to L7 exists. That said, it is still the case in 2025 that the "fast path" data-plane will end up passing either nothing or everything in a flow to the "slow path" control-plane, where the most significant silicon is less 'ASIC' and more 'aarch64'.
This is all to say that the GP's comments are broadly correct.