Io_uring, kTLS and Rust for zero syscall HTTPS server

(blog.habets.se)

495 points guntars | 1 comments | 22 Aug 25 03:51 UTC | HN request time: 0.214s | source

Show context

bmcahren ◴[22 Aug 25 05:35 UTC] No.44981313[source]▶

This was a good read and great work. Can't wait to see the performance tests.

Your write up connected some early knowledge from when I was 11 where I was trying to set up a database/backend and was finding lots of cgi-bin online. I realize now those were spinning up new processes with each request https://en.wikipedia.org/wiki/Common_Gateway_Interface

I remember when sendfile became available for my large gaming forum with dozens of TB of demo downloads. That alone was huge for concurrency.

I thought I had swore off this type of engineering but between this, the Netflix case of extra 40ms and the GTA 5 70% load time reduction maybe there is a lot more impactful work to be done.

https://netflixtechblog.com/life-of-a-netflix-partner-engine...

https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times...

replies(2): >>44981421 #>>44989337 #

kev009 ◴[22 Aug 25 05:56 UTC] No.44981421[source]▶

>>44981313 #

It wasn't just CGI, every HTTP session was commonly a forked copy of the entire server in the CERN and Apache lineage! Apache gradually had better answers, but their API with common addons made it a bit difficult to transition so webservers like nginx took off which are built closer to the architecture in the article with event driven I/O from the beginning.

replies(3): >>44982088 #>>44983055 #>>44985024 #

tliltocatl ◴[22 Aug 25 14:20 UTC] No.44985024[source]▶

>>44981421 #

That's because Unix API used to assume fork() is extremely cheap. Threads were ugly performance hack second-class citizens - still are in some ways. This was indeed true on PDP-11 (just copy a <64KB disk file!), but as address spaces grew, it became prohibitively expensive to copy page tables, so programmers turned to multithreading. At then multicore CPUs became the norm, and multithreading on multicore CPUs meant any kind of copy-on-write required TLB shootdown, making fork() even more expensive. VMS (and its clone known as Windows NT) did it right from the start - processes are just resource containers, units execution are threads and all IO is async. But being technically superior doesn't outweighs the disadvantage of being proprietary.

replies(1): >>44993913 #

1. kev009 ◴[23 Aug 25 07:02 UTC] No.44993913[source]▶

>>44985024 #

It's also a pretty bold scheduler benchmark to be handling tens of thousands of processes or 1:1 thread wakeups, especially the further back in time you go considering fairness issues. And then that's running at the wrong latency granularity for fast I/O completion events across that many nodes so it's going to run like a screen door on a submarine without a lot of rethinking things.

Evented I/O works out pretty well in practice for the I and D cache, especially if you can affine and allocate things as the article states, and do similar natural alignments inside the kernel (i.e. RSS/consistent hashing).

↑