Io_uring, kTLS and Rust for zero syscall HTTPS server

(blog.habets.se)

Show context

Seattle3503 ◴[22 Aug 25 05:44 UTC] No.44981374[source]▶

> For example when submitting a write operation, the memory location of those bytes must not be deallocated or overwritten.

> The io-uring crate doesn’t help much with this. The API doesn’t allow the borrow checker to protect you at compile time, and I don’t see it doing any runtime checks either.

I've seen comments like this before[1], and I get the impression that building a a safe async Rust library around io_uring is actually quite difficult. Which is sort of a bummer.

IIRC Alice from the tokio team also suggested there hasn't been much interest in pushing through these difficulties more recently, as the current performance is "good enough".

[1] https://boats.gitlab.io/blog/post/io-uring/

replies(7): >>44981390 #>>44981469 #>>44981966 #>>44982846 #>>44983850 #>>44983930 #>>44989979 #

newpavlov ◴[22 Aug 25 10:32 UTC] No.44982846[source]▶

>>44981374 #

This actually one of my many gripes about Rust async and why I consider it a bad addition to the language in the long term. The fundamental problem is that rust async was developed when epoll was dominant (and almost no one in the Rust circles cared about IOCP) and it has heavily influenced the async design (sometimes indirectly through other languages).

Think about it for a second. Why do we not have this problem with "synchronous" syscalls? When you call `read` you also "pass mutable borrow" of the buffer to the kernel, but it maps well into the Rust ownership/borrow model since the syscall blocks execution of the thread and there are no ways to prevent it in user code. With poll-based async model you side-step this issues since you use the same "sync" syscalls, but which are guaranteed to return without blocking.

For a completion-based IO to work properly with the ownership/borrow model we have to guarantee that the task code will not continue execution until it receives a completion event. You simply can not do it with state machines polled in user code. But the threading model fits here perfectly! If we are to replace threads with "green" threads, user Rust code will look indistinguishable from "synchronous" code. And no, the green threads model can work properly on embedded systems as demonstrated by many RTOSes.

There are several ways of how we could've done it without making the async runtime mandatory for all targets (the main reason why green threads were removed from Rust 1.0). My personal favorite is introduction of separate "async" targets.

Unfortunately, the Rust language developers made a bet on the unproved polling stackless model because of the promised efficiency and we are in the process of finding out whether the bet plays of or not.

replies(3): >>44983562 #>>44984589 #>>44984882 #

duped ◴[22 Aug 25 13:41 UTC] No.44984589[source]▶

>>44982846 #

> You simply can not do it with state machines polled in user code

That's not really true. The only guarantees in Rust futures are that they are polled() once and must have their Waker's wake() called before they are polled again. A completion based future submits the request on first poll and calls wake() on completion. That's kind of the interesting design of futures in Rust - they support polling and completion.

The real conundrum is that the futures are not really portable across executors. For io_using for example, the executor's event loop is tightly coupled with submission and completion. And due to instability of a few features (async trait, return impl trait in trait, etc) there is not really a standard way to write executor independent async code (you can, some big crates do, but it's not necessarily trivial).

Combine that with the fact that container runtimes disable io_uring by default and most people are deploying async web servers in Docker containers, it's easy to see why development has stalled.

It's also unfair to mischaracterize design goals and ideas from 2016 with how the ecosystem evolved over the last decade, particularly after futures were stabilized before other language items and major executors became popular. If you look at the RFCs and blog posts back then (eg: https://aturon.github.io/tech/2016/09/07/futures-design/) you can see why readiness was chosen over completion, and how completion can be represented with readiness. He even calls out how naïve completion (callbacks) leads to more allocation on future composition and points to where green threads were abandoned.

replies(3): >>44984765 #>>44988043 #>>44988514 #

newpavlov ◴[22 Aug 25 13:57 UTC] No.44984765[source]▶

>>44984589 #

No, the fundamental problem (in the context of io-uring) is that futures are managed by user code and can be dropped at any time. This often referred as "cancellation safety". Imagine a future has initialized completion-based IO with buffer which is part of the future state. User code can simply drop the future (e.g. if it was part of `select!`) and now we have a huge problem on our hands: the kernel will write into a dropped buffer! In the synchronous context it's equivalent to de-allocating thread stack under foot of the thread which is blocked on a synchronous syscall. You obviously can do it (using safe code) in thread-based code, but it's fine to do in async.

This is why you have to use various hacks when using io-uring based executors with Rust async (like using polling mode or ring-owned buffers and additional data copies). It could be "resolved" on the language level with an additional pile of hacks which would implement async Drop, but, in my opinion, it would only further hurt consistency of the language.

>He even calls out how naïve completion (callbacks) leads to more allocation on future composition and points to where green threads were abandoned.

I already addressed it in the other comment.

replies(2): >>44984811 #>>44985652 #

duped ◴[22 Aug 25 14:02 UTC] No.44984811[source]▶

>>44984765 #

That problem exists regardless of whether you want to use stackful coroutines or not. The stack could be freed by user code at anytime. It could also panic and drop buffers upon unwinding.

I wouldn't call async drop a pile of hacks, it's actually something that would be useful in this context.

And that said there's an easy fix: don't use the pointers supplied by the future!

replies(1): >>44984977 #

newpavlov ◴[22 Aug 25 14:15 UTC] No.44984977[source]▶

>>44984811 #

>That problem exists regardless of whether you want to use stackful coroutines or not. The stack could be freed by user code at anytime. It could also panic and drop buffers upon unwinding.

Nope. The problem does not exist in the stackfull model by the virtue of user being unable (in safe code) to drop stack of a stackfull task similarly to how you can not drop stack of a thread. If you want to cancel a stackfull task, you have to send a cancellation signal to it and wait for its completion (i.e. cancellation is fully cooperative). And you can not fundamentally panic while waiting for a completion event, the task code is "frozen" until the signal is received.

>it's actually something that would be useful in this context.

Yes, it's useful to patch a bunch of holes introduced by the Rust async model and only for that. And this is why I call it a bunch of hacks, especially considering the fundamental issues which prevent implementation of async Drop. A properly designed system would've properly worked with the classic Drop.

>And that said there's an easy fix: don't use the pointers supplied by the future!

It's always amusing when Rust async advocates say that. Let met translate: don't use `let mut buf = [0u8; 16]; socket.read_all(&mut buf).await?;`. If you can't see why such arguments are bonkers, we don't have anything left to talk about.

replies(2): >>44985638 #>>44985662 #

oconnor663 ◴[22 Aug 25 15:14 UTC] No.44985638[source]▶

>>44984977 #

> don't use `let mut buf = [0u8; 16]; socket.read_all(&mut buf).await?;`. If you can't see why such arguments are bonkers, we don't have anything left to talk about.

It doesn't seem bonkers to me. I know you already know these details, but spelling it out: If I'm using select/poll/epoll in C to do non-blocking reads of a socket, then yes I can use any old stack buffer to receive the bytes, because those are readiness APIs that only write through my pointer "now or never". But if I'm using IOCP/io_uring, I have to be careful not to use a stack buffer that doesn't outlive the whole IO loop, because those are completion APIs that write through my pointer "later". This isn't just a question of the borrow checker being smart enough to analyze our code; it's a genuine difference in what correct code needs to do in these two different settings. So if async Rust forces us to use heap allocated (or long-lived in some other way) buffers to do IOCP/io_uring reads, is that a failure of the async model, or is that just the nature of systems programming?

replies(1): >>44985901 #

1. newpavlov ◴[22 Aug 25 15:38 UTC] No.44985901[source]▶

>>44985638 #

>is that a failure of the async model

This, 100%. Being really generous, it can be called a leaky model which is poorly compatible with completion-based APIs.

replies(1): >>44987957 #

2. vlovich123 ◴[22 Aug 25 18:26 UTC] No.44987957[source]▶

>>44985901 (TP) #

The leaky model is that you could ever receive into a stack buffer and you're arguing to persist this model. The reason it's leaky is that copying memory around is supremely expensive. But that's how the BSD socket API from the 90s works and btw something you can make work with async provided you're into memory copies. io_uring is a modern API that's for performance and that's why Rust libraries try to avoid memory copying within the internals. Supporting copying into the stack buffer with io_uring is very difficult to accomplish even in synchronous code. It's not a failure of async but a different programming paradigm altogether.

As someone else mentioned, what you really want is to ask io_uring to allocate the pages itself so that for reads it gives you pages that were allocated by the kernel to be filled directly by HW and then mapped into your userspace process without any copying by the kernel or any other SW layer involved.

replies(2): >>44989382 #>>44996320 #

3. Asmod4n ◴[22 Aug 25 20:24 UTC] No.44989382[source]▶

>>44987957 #

there is just one catch.

Using the feature to let io_uring handle buffers for you limits you to the mem lock limit of a process, which is 8MB on a typical debian install (more on others) And that's a hard limit unless you got root access to said machine.

replies(1): >>44989420 #

4. vlovich123 ◴[22 Aug 25 20:27 UTC] No.44989420{3}[source]▶

>>44989382 #

Sure, that's the most efficient way. But you can still have the user allocate a read buffer, pass it to the read API & receive it on the way out. In fact, unlike what OP claimed, this is actually more efficient since you could safely avoid unnecessarily initializing this buffer safely (by truncating to the length read before returning) whereas safely using uninitialized buffers is kind of tricky.

5. zbentley ◴[23 Aug 25 14:34 UTC] No.44996320[source]▶

>>44987957 #

> what you really want is to ask io_uring to allocate the pages itself so that for reads it gives you pages that were allocated by the kernel

Okay, but what about writes? If I have a memory region that I want io_uring to write, it's a major pain in the ass to manage the lifetime of objects in that region in a safe way. My choices are basically: manually manage the lifetime and only allow it to be dropped when I see a completion show up (this is what most everything does now, and it's a) hard to get right and b) limited in many ways, e.g. it's heap-only), or permanently leak that memory as unusable.

replies(1): >>44999559 #

6. vlovich123 ◴[23 Aug 25 22:16 UTC] No.44999559{3}[source]▶

>>44996320 #

You ask the I/O system for a writable buffer. When you fill it up, you hand it off. Once the I/o finishes, it goes back into the available pool of memory to write with. This is how high performance I/O works.

replies(1): >>45004570 #

7. zbentley ◴[24 Aug 25 14:35 UTC] No.45004570{4}[source]▶

>>44999559 #

Okay, but . . . how would that work? A syscall gives back a pointer (I thought the point was to avoid syscalls/context switches)? An io_malloc userspace function (great, now how do I manage lifetimes of the buffers it hands out)? Something else?

replies(1): >>45070568 #

8. vlovich123 ◴[29 Aug 25 23:32 UTC] No.45070568{5}[source]▶

>>45004570 #

The memory is allocated by the runtime that has the io_uring backend. You ask it for memory which it manages in its own memory allocator. Lifetime is managed no differently than Vec. For example, when you drop the DmaBuffer [1] it goes back into the pool. Or you hand it off as an I/O submission after filling it up.

The memory frequently needs to be mlocked memory anyway, so a general purpose allocator doesn't work.

[1] https://docs.rs/glommio/latest/glommio/fn.allocate_dma_buffe...

↑