Io_uring, kTLS and Rust for zero syscall HTTPS server

(blog.habets.se)

495 points guntars | 1 comments | 22 Aug 25 03:51 UTC | HN request time: 0s | source

Show context

bmcahren ◴[22 Aug 25 05:35 UTC] No.44981313[source]▶

This was a good read and great work. Can't wait to see the performance tests.

Your write up connected some early knowledge from when I was 11 where I was trying to set up a database/backend and was finding lots of cgi-bin online. I realize now those were spinning up new processes with each request https://en.wikipedia.org/wiki/Common_Gateway_Interface

I remember when sendfile became available for my large gaming forum with dozens of TB of demo downloads. That alone was huge for concurrency.

I thought I had swore off this type of engineering but between this, the Netflix case of extra 40ms and the GTA 5 70% load time reduction maybe there is a lot more impactful work to be done.

https://netflixtechblog.com/life-of-a-netflix-partner-engine...

https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times...

replies(2): >>44981421 #>>44989337 #

commandersaki ◴[22 Aug 25 20:20 UTC] No.44989337[source]▶

>>44981313 #

I'm sceptical of the efficiency gains with sendfile; seems marginal at best, even in the late 90s when it was at the height of popularity.

replies(2): >>44989857 #>>44989863 #

lossolo ◴[22 Aug 25 21:11 UTC] No.44989857[source]▶

>>44989337 #

> seems marginal at best

Depends on the workload.

Normally you would go read() -> write() so:

1. Disk -> page cache (DMA)

2. Kernel -> user copy (read)

3. User -> kernel copy (write)

4. Kernel -> NIC (DMA)

sendfile():

1. Disk -> page cache (DMA)

No user space copies, kernel wires those pages straight to the socket

2. Kernel -> NIC (DMA)

So basically, it eliminates 1-2 memory copies along with the associated cache pollution and memory bandwidth overhead. If you are running high QPS web services where syscall and copy overheads dominate, for example CDNs/static file serving the gains can be really big. Based on my observations this can mean double digit reductions in CPU usage and up to ~2x higher throughput.

replies(1): >>44991515 #

1. commandersaki ◴[23 Aug 25 00:02 UTC] No.44991515{3}[source]▶

>>44989857 #

I understand the optimisation, I'm just saying I'm sceptical the optimisation is even that useful, like it seems it'd only kick in with pathological cases where kernel round trip time is really dominating; my gut reckons most applications just don't benefit. Caddy in the last few years got sendfile support and with it on and off and it usually you wouldn't see a discernible difference [1].

Which makes me sceptical for the argument for kTLS which is stated in the article; what benefit does offloading your crypto to the kernel provider (possibly making it more brittle). I've seen the author of haproxy say that performance he's seen has been only marginal, but did point out it was useful in that you can strace your process and see plaintext instead of ciphertext which is nice.

[1]: https://blog.tjll.net/reverse-proxy-hot-dog-eating-contest-c...

↑