Your write up connected some early knowledge from when I was 11 where I was trying to set up a database/backend and was finding lots of cgi-bin online. I realize now those were spinning up new processes with each request https://en.wikipedia.org/wiki/Common_Gateway_Interface

I remember when sendfile became available for my large gaming forum with dozens of TB of demo downloads. That alone was huge for concurrency.

I thought I had swore off this type of engineering but between this, the Netflix case of extra 40ms and the GTA 5 70% load time reduction maybe there is a lot more impactful work to be done.

https://netflixtechblog.com/life-of-a-netflix-partner-engine...

https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times...

replies(2): >>44981421 #>>44989337 #

5. 6r17 ◴[22 Aug 25 05:36 UTC] No.44981322[source]▶

>>44980865 (OP) #

I really want to see the benchmarks on this ; tried it like 4 days ago and then built a standard epoll implementation ; I could not compete against nginx using uring but that's not the easiest task for an arrogant night so I really hope you get some deserved sweet numbers ; mine were a sad deception but I did not do most of your implementation - rather simply tried to "batch" calls. Wish you the best of luck and much fun

6. ValtteriL ◴[22 Aug 25 05:40 UTC] No.44981349[source]▶

>>44980865 (OP) #

Excellent read. I'd like to see DPDK style full kernel bypass next

replies(1): >>44981754 #

7. Seattle3503 ◴[22 Aug 25 05:44 UTC] No.44981374[source]▶

>>44980865 (OP) #

> For example when submitting a write operation, the memory location of those bytes must not be deallocated or overwritten.

> The io-uring crate doesn’t help much with this. The API doesn’t allow the borrow checker to protect you at compile time, and I don’t see it doing any runtime checks either.

I've seen comments like this before[1], and I get the impression that building a a safe async Rust library around io_uring is actually quite difficult. Which is sort of a bummer.

IIRC Alice from the tokio team also suggested there hasn't been much interest in pushing through these difficulties more recently, as the current performance is "good enough".

[1] https://boats.gitlab.io/blog/post/io-uring/

replies(7): >>44981390 #>>44981469 #>>44981966 #>>44982846 #>>44983850 #>>44983930 #>>44989979 #

8. JoshTriplett ◴[22 Aug 25 05:48 UTC] No.44981390[source]▶

>>44981374 #

I think the right way to build a safe interface around io_uring would be to use ring-owned buffers, ask the ring for a buffer when you want one, and give the buffer back to the ring when initiating a write.

replies(2): >>44981470 #>>44982162 #

9. kev009 ◴[22 Aug 25 05:56 UTC] No.44981421[source]▶

>>44981313 #

It wasn't just CGI, every HTTP session was commonly a forked copy of the entire server in the CERN and Apache lineage! Apache gradually had better answers, but their API with common addons made it a bit difficult to transition so webservers like nginx took off which are built closer to the architecture in the article with event driven I/O from the beginning.

replies(3): >>44982088 #>>44983055 #>>44985024 #

10. kev009 ◴[22 Aug 25 06:02 UTC] No.44981451[source]▶

>>44981408 #

FWIW Rust advice is maybe 15% of the bottom of the article, most of the decisions apply equally to C and the article is a fairly sensible survey of APIs.

11. jcranmer ◴[22 Aug 25 06:06 UTC] No.44981469[source]▶

>>44981374 #

There is, I think, an ownership model that Rust's borrow checker very poorly supports, and for lack of a better name, I've called it hot potato ownership. The basic idea is that you have a buffer which you can give out as ownership in the expectation that the person you gave it to will (eventually) give it back to you. It's a sort of non-lexical borrowing problem, and I very quickly discovered when trying to implement it myself in purely safe Rust that the "giving the buffer back" is just really gnarly to write.

replies(3): >>44981493 #>>44981689 #>>44982450 #

12. pingiun ◴[22 Aug 25 06:06 UTC] No.44981470{3}[source]▶

>>44981390 #

This is something that Amos Wenger (fasterthanlime) has worked on: https://github.com/bearcove/loona/blob/main/crates/buffet/RE...

13. stouset ◴[22 Aug 25 06:11 UTC] No.44981493{3}[source]▶

>>44981469 #

Maybe I’m misunderstanding, but why is that not possible with a

    Fn(_: T) -> T

replies(3): >>44981684 #>>44981785 #>>44982627 #

14. iknowstuff ◴[22 Aug 25 06:53 UTC] No.44981684{4}[source]▶

>>44981493 #

It totally is

https://docs.rs/tokio-uring/latest/tokio_uring/fs/struct.Fil...

15. tayo42 ◴[22 Aug 25 06:54 UTC] No.44981689{3}[source]▶

>>44981469 #

Refcel didn't work? Or rc?

replies(1): >>44982100 #

16. spaintech ◴[22 Aug 25 07:09 UTC] No.44981754[source]▶

>>44981349 #

Not sure if you are aware of this, but LUNA does this already.

https://www.usenix.org/system/files/atc23-zhu-lingjun.pdf

17. dwattttt ◴[22 Aug 25 07:15 UTC] No.44981785{4}[source]▶

>>44981493 #

As sibling notes, it is. It's very rarely seen though.

One place you might see something like it is if an API takes ownership, but returns it on error; you see the error side carry the resource you gave it, so you could try again.

18. Imustaskforhelp ◴[22 Aug 25 07:20 UTC] No.44981811[source]▶

>>44980865 (OP) #

Such a good read.

I am patient to wait for the benchmarks so take your time ,but I honestly love how the author doesn't care about benchmarks right now and wanted to clean the code first. Its kinda impressive that there are people who have such line of thinking in this world where benchmarks gets maxxed and whole project's sole existence is to satisfy benchmarks.

Really a breath of fresh air and honestly I admire the author so much for this. It was such a good read, loved it a lot thank you. Didn't know ktls existed or Io_uring could be used in such a way.

19. mgaunard ◴[22 Aug 25 07:44 UTC] No.44981924[source]▶

>>44980865 (OP) #

"zero syscall"

> In order to avoid busy looping, both the kernel and the web server will only busy-loop checking the queue for a little bit (configurable, but think milliseconds), and if there’s nothing new, the web server will do a syscall to “go to sleep” until something gets added to the queue.

replies(3): >>44981958 #>>44982011 #>>44982686 #

20. KolmogorovComp ◴[22 Aug 25 07:52 UTC] No.44981958[source]▶

>>44981924 #

It’s good to read an article until the end

> This means that a busy web server can serve all of its queries without even once (after setup is done) needing to do a syscall. As long as queues keep getting added to, strace will show nothing.

21. ozgrakkurt ◴[22 Aug 25 07:53 UTC] No.44981966[source]▶

>>44981374 #

You don’t have to represent everything with borrows. You can just use data structures like Slab to make it cancel safe.

As an example this library I wrote before is cancel safe and doesn’t use lifetimes etc. for it.

https://github.com/steelcake/io2

replies(1): >>44983102 #

22. nly ◴[22 Aug 25 08:02 UTC] No.44982011[source]▶

>>44981924 #

Like all polling I/O models (that don't spin) it also means you have to wait milliseconds in the worst case to start servicing a request. That's a long time.

For comparison a read/write over a TCP socket on loopback between two process is a few microseconds using BSD sockets API.

replies(1): >>44982195 #

23. avar ◴[22 Aug 25 08:17 UTC] No.44982088{3}[source]▶

>>44981421 #

    every HTTP session was commonly a forked
    copy of the entire server in the CERN
    and Apache lineage!

And there's nothing wrong with that for application workers. On *nix systems fork() is very fast, you can fork "the entire server" and the kernel will only COW your memory. As nginx etc. showed you can get better raw file serving performance with other models, but it's still a legitimate technique for application logic where business logic will drown out any process overhead.

replies(2): >>44982384 #>>44982389 #

24. rfoo ◴[22 Aug 25 08:18 UTC] No.44982100{4}[source]▶

>>44981689 #

Slapping Rc<T> over something that could be clearly uniquely owned is a sign of very poorly designed lifetime rules / system.

And yes, for now async Rust is full of unnecessary Arc<T> and is very poorly made.

replies(1): >>44982580 #

25. Tuna-Fish ◴[22 Aug 25 08:31 UTC] No.44982162{3}[source]▶

>>44981390 #

This works perfectly well, and allows using the type system to handle safety. But it also really limits how you handle memory, and makes it impossible to do things like filling out parts of existing objects, so a lot of people are reluctant to take the plunge.

replies(1): >>44984196 #

26. klabb3 ◴[22 Aug 25 08:37 UTC] No.44982195{3}[source]▶

>>44982011 #

> Like all polling I/O models (that don't spin) it also means you have to wait milliseconds in the worst case to start servicing a request. That's a long time.

No? What they're saying is the busy loop will spin until an event occurs, for at most x ms. And if it does park the thread (the only syscall required), it can be immediately woken up on the first event too. Only if multiple events occurred since the last call would you receive them together. This normally happens only under high load, when event processing takes enough time to have a buildup of new events in the background. Increased latency is the intended outcome on high loads.

To be fair, it was a while ago I read the io-uring paper. But I distinctly recall the mix of poll and park behavior, plus configurable wait conditions. Please correct me if I'm wrong (someone here certainly knows).

27. tsimionescu ◴[22 Aug 25 09:07 UTC] No.44982384{4}[source]▶

>>44982088 #

Forking for anything other than calling exec is still a horrible idea (with special exceptions like shells). Forking is a very unsafe operation (you can easily share locks and files with the child process unless both your code and every library you use is very careful - for example, it's easy to get into malloc deadlocks with forked processes), and its performance depends a lot on how you actually use it.

replies(1): >>44996499 #

28. josephg ◴[22 Aug 25 09:08 UTC] No.44982389{4}[source]▶

>>44982088 #

So long as you have something like nginx in front of your server. Otherwise your whole site can be taken down by a slowloris attack over a 33.6k modem.

29. pornel ◴[22 Aug 25 09:17 UTC] No.44982450{3}[source]▶

>>44981469 #

This can be done with exclusively owned objects. That's how io_uring abstractions work in Rust – you give your (heap allocated) buffer to a buffer pool, and get it back when the operation is done.

&mut references are exclusive and non-copyable, so the hot potato approach can even be used within their scope.

But the problem in Rust is that threads can unwind/exit at any time, invalidating buffers living on the stack, and io_uring may use the buffer for longer than the thread lives.

The borrow checker only checks what code is doing, but doesn't have power to alter runtime behavior (it's not a GC after all), so it only can prevent io_uring abstractions from getting any on-stack buffers, but has no power to prevent threads from unwinding to make on-stack buffer safe instead.

replies(2): >>44983643 #>>44987564 #

30. LAC-Tech ◴[22 Aug 25 09:31 UTC] No.44982527[source]▶

>>44980865 (OP) #

I think rusts glacial compile times prevent it from being a useful platform for web apps. Yes it's a nice language, and very performant, but it's horrible devex to have to wait seconds for your server to recompile after a change.

replies(3): >>44982707 #>>44984093 #>>44986057 #

31. zozbot234 ◴[22 Aug 25 09:41 UTC] No.44982580{5}[source]▶

>>44982100 #

If the thread can be dropped while the buffer is "owned" by the kernel io-uring facilities (to be given back when the operation completes) that's not "unique" ownership. The existing Rc/Arc<T> may be overkill for that case, but something very much like it will still be needed.

32. IshKebab ◴[22 Aug 25 09:49 UTC] No.44982627{4}[source]▶

>>44981493 #

How is that different to

  Fn(_: &mut T)

?

replies(1): >>44983405 #

33. thomashabets2 ◴[22 Aug 25 10:03 UTC] No.44982686[source]▶

>>44981924 #

Under load it's zero syscall (barring any rare allocations inside rustls for the handshake. I can't guarantee that it never does).

Without load the overhead of calling (effectively) sleep() is, while technically true, not relevant.

But sure, you can tweak the busyloop timers and burn 100% CPU on kernel and user side indefinitely if you want to avoid that sleep-when-idle syscall. It's just… not a good idea.

replies(1): >>44988195 #

34. maeln ◴[22 Aug 25 10:07 UTC] No.44982707[source]▶

>>44982527 #

> but it's horrible devex to have to wait seconds for your server to recompile after a change.

What a time to be alived that seconds to recompile is consider horrible devex.

replies(2): >>44986592 #>>44991470 #

35. phrotoma ◴[22 Aug 25 10:07 UTC] No.44982712[source]▶

>>44980865 (OP) #

Anybody know what the state of kTLS is? I asked one of the Cilium devs about it a while ago'cause I'd seen Thomas Graf excitedly talking about it and he told me that kernel support in many distros was lacking so they aren't ready to enable it by default.

replies(2): >>44984358 #>>44990810 #

36. bullen ◴[22 Aug 25 10:19 UTC] No.44982779[source]▶

>>44980865 (OP) #

So far everything after epoll that I have compared with falls short.

So to reimplement my foundation (with all the bugs) will not be worth it.

I will however compare Javas NIO (epoll) with the new Virtual Threads IO (without pinning).

http://github.com/tinspin/rupy

replies(1): >>44985050 #

37. fuy ◴[22 Aug 25 10:26 UTC] No.44982814[source]▶

>>44981223 #

perf and look at stack traces (or off-cpu events for waits/locks). also, ebpf

38. newpavlov ◴[22 Aug 25 10:32 UTC] No.44982846[source]▶

>>44981374 #

This actually one of my many gripes about Rust async and why I consider it a bad addition to the language in the long term. The fundamental problem is that rust async was developed when epoll was dominant (and almost no one in the Rust circles cared about IOCP) and it has heavily influenced the async design (sometimes indirectly through other languages).

Think about it for a second. Why do we not have this problem with "synchronous" syscalls? When you call `read` you also "pass mutable borrow" of the buffer to the kernel, but it maps well into the Rust ownership/borrow model since the syscall blocks execution of the thread and there are no ways to prevent it in user code. With poll-based async model you side-step this issues since you use the same "sync" syscalls, but which are guaranteed to return without blocking.

For a completion-based IO to work properly with the ownership/borrow model we have to guarantee that the task code will not continue execution until it receives a completion event. You simply can not do it with state machines polled in user code. But the threading model fits here perfectly! If we are to replace threads with "green" threads, user Rust code will look indistinguishable from "synchronous" code. And no, the green threads model can work properly on embedded systems as demonstrated by many RTOSes.

There are several ways of how we could've done it without making the async runtime mandatory for all targets (the main reason why green threads were removed from Rust 1.0). My personal favorite is introduction of separate "async" targets.

Unfortunately, the Rust language developers made a bet on the unproved polling stackless model because of the promised efficiency and we are in the process of finding out whether the bet plays of or not.

replies(3): >>44983562 #>>44984589 #>>44984882 #

39. jabl ◴[22 Aug 25 11:07 UTC] No.44983055{3}[source]▶

>>44981421 #

To nitpick at least as of Apache HTTPD 1.3 ages ago it wasn't forking for every request, but had a pool of already forked worker processes with each handling one connection at a time but could handle an unlimited number of connections sequentially, and it could spawn or kill worker processes depending on load.

The same model is possible in Apache httpd 2.x with the "prefork" mpm.

replies(1): >>44989797 #

40. ozgrakkurt ◴[22 Aug 25 11:14 UTC] No.44983102{3}[source]▶

>>44981966 #

Just realised my code isn’t cancel safe either. It is invalid if the user just drops a read future and the buffer itself while the operation is in the kernel.

It is just a PITA to get it fully right.

Probably need the buffer to come from the async library so user allocates the buffers using the async library like a sibling comment says.

It is just much easier to not use Rust and say futures should run fully always and can’t be just dropped and make some actual progress. So I’m just doing it in zig now

41. Soft ◴[22 Aug 25 11:54 UTC] No.44983405{5}[source]▶

>>44982627 #

In the former the caller does not retain access to T until Fn returns.

replies(2): >>44984480 #>>44984586 #

42. kibwen ◴[22 Aug 25 12:13 UTC] No.44983562{3}[source]▶

>>44982846 #

> The fundamental problem is that rust async was developed when epoll was dominant (and almost no one in the Rust circles cared about IOCP)

No, this is a mistaken retelling of history. The Rust developers were not ignorant of IOCP, nor were they zealous about any specific async model. They went looking for a model that fit with Rust's ethos, and completion didn't fit. Aaron Turon has an illuminating post from 2016 explaining their reasoning: https://aturon.github.io/tech/2016/09/07/futures-design/

See the section "Defining futures":

There’s a very standard way to describe futures, which we found in every existing futures implementation we inspected: as a function that subscribes a callback for notification that the future is complete.

Note: In the async I/O world, this kind of interface is sometimes referred to as completion-based, because events are signaled on completion of operations; Windows’s IOCP is based on this model.

[...] Unfortunately, this approach nevertheless forces allocation at almost every point of future composition, and often imposes dynamic dispatch, despite our best efforts to avoid such overhead.

[...] TL;DR, we were unable to make the “standard” future abstraction provide zero-cost composition of futures, and we know of no “standard” implementation that does so.

[...] After much soul-searching, we arrived at a new “demand-driven” definition of futures.

I'm not sure where this meme came from where people seem to think that the Rust devs rejected a completion-based scheme because of some emotional affinity for epoll. They spent a long time thinking about the problem, and came up with a solution that worked best for Rust's goals. The existence of a usable io_uring in 2016 wouldn't have changed the fundamental calculus.

replies(2): >>44983784 #>>44992207 #

43. api ◴[22 Aug 25 12:18 UTC] No.44983599[source]▶

>>44980865 (OP) #

This is impressive but it’s also an amazing amount of complexity and difficult programming to work around the fact that syscalls are so slow.

It seems like there’s these fundamental things in OSes that we just can’t improve, or I suppose can’t without breaking too much backward compatibility, so we are forced to do this.

replies(1): >>44988360 #

44. alfiedotwtf ◴[22 Aug 25 12:22 UTC] No.44983643{4}[source]▶

>>44982450 #

In my universe, `let` wouldn’t exist… instead there would only be 3 ways to declare variables:

  1. global my_global_var: GlobalType = …
  2. heap my_heap_var: HeapType = …
  3. stack my_stack_var: StackType = …

Global types would need to implement a global trait to ensure mutual exclusion (waves hands).

So by having the location of allocation in the type itself, we no longer have to do boxing mental gymnastics

replies(2): >>44984696 #>>44986248 #

45. newpavlov ◴[22 Aug 25 12:34 UTC] No.44983784{4}[source]▶

>>44983562 #

>which we found in every existing futures implementation we inspected

This is exactly what I meant when I wrote about the indirect influence from other languages. People may dress it up as much as they want, but it's clear that polling was the most important model at the time (outside of the Windows world) and a lot of design consideration was put into being compatible with it. The Rust async model literally uses the polling terminology in its most fundamental interfaces!

>this approach nevertheless forces allocation at almost every point of future composition

This is only true in the narrow world of modeling async execution with futures. Do you see heap allocations in Go on each equivalent of "future composition" (i.e. every function call)? No, you do not. With the stackfull models you allocate a full stack for your task and you model function calls as plain function calls without any future composition shenaniganry.

Yes, the stackless model is more efficient memory-wise and allows for some additional useful tricks (like sharing future stacks in `join!`). But the stackfull model is perfectly efficient for 95+% of use cases, fits better with the borrow/ownership model, does not result in the `.await` noise, does not lead to the horrible ecosystem split (including split between different executors), and does not need the language-breaking hacks like `Pin` (see the `noalias` exception made for it). And I believe it's possible to close the memory efficiency gap between the models with certain compiler improvements (tracking maximum stack usage bound for functions and introducing a separate async ABI with two separate stacks).

>The existence of a usable io_uring in 2016 wouldn't have changed the fundamental calculus.

IIRC the first usable versions of io-uring very released approximately during the time when the Rust async was undergoing stabilization. I am really confident that if the async system was designed today we would've had a totally different model. Importance of completion-based models has only grown since then not only because of the sane async file IO, but also because of Spectre and Meltdown.

replies(1): >>44984545 #

46. aliceryhl ◴[22 Aug 25 12:40 UTC] No.44983850[source]▶

>>44981374 #

> IIRC Alice from the tokio team also suggested there hasn't been much interest in pushing through these difficulties more recently, as the current performance is "good enough".

Well, I think there is interest, but mostly for file IO.

For file IO, the situation is pretty simple. We already have to implement that using spawn_blocking, and spawn_blocking has the exact same buffer challenges as io_uring does, so translating file IO to io_uring is not that tricky.

On the other hand, I don't think tokio::net's existing APIs will support io_uring. Or at least they won't support the buffer-based io_uring APIs; there is no reason they can't register for readiness through io_uring.

replies(1): >>44984017 #

47. johncolanduoni ◴[22 Aug 25 12:46 UTC] No.44983930[source]▶

>>44981374 #

It’s annoying but possible to do this correctly and not have the API be too bad. The “happy path” of a clean success or error is fine if you accept that buffers can’t just be simple &[u8] slices. Cancellation can be handled safely with something like the following API contract:

Have your function signature be async fn read(buffer: &mut Vec<u8>) -> Result<…>’ (you can use something more convenient like ‘&mut BytesMut’ too). If you run the future to completion (success or failure), the argument holds the same buffer passed in, with data filled in appropriately on success. If you cancel/drop the future, the buffer may point at an empty allocation instead (this is usually not an annoying constraint for most IO flows, and footgun potential is low).

The way this works is that your library “takes” the underlying allocation before starting the operation out of the variable, replacing it with the default unallocated ‘Vec<u8>’. Once the buffer is no longer used by the IO system, it puts it back before returning. If you cancel, it manages the buffer in the background to release it when safe and the unallocated buffer is left in the passed variable.

replies(1): >>44984454 #

48. johncolanduoni ◴[22 Aug 25 12:53 UTC] No.44984017{3}[source]▶

>>44983850 #

This covers probably 90% of the usefulness of io_uring for non-niche applications. Its original purpose was doing buffered async file IO without a bunch of caveats that make it effectively useless. The biggest speed up I’ve found with it is ‘stat’ing large sets of files in the VFS cache. It can literally be 50x faster at that, since you can do 1000 files with a single systemcall and the data you need from the kernel is all in memory.

High throughput network usecases that don’t need/want AF_XDP or DPDK can get most of the speedup with ‘sendmmsg/recvmmsg’ and segmentation offload.

replies(2): >>44985107 #>>44985151 #

49. j-krieger ◴[22 Aug 25 12:59 UTC] No.44984093[source]▶

>>44982527 #

Compile times aren’t glacial and will be much faster with the new trait solver and cranelift.

50. j-krieger ◴[22 Aug 25 13:01 UTC] No.44984119[source]▶

>>44980865 (OP) #

I do wonder if this would make for an excellent exfil implant since it doesn‘t register syscalls.

replies(1): >>44996576 #

51. johncolanduoni ◴[22 Aug 25 13:07 UTC] No.44984196{4}[source]▶

>>44982162 #

That’s annoying for people writing bespoke low-level networking code, but for a high-level HTTP library it’s a rounding error in the overall complexity on display. I think the bigger barrier for Tokio is that the interplay between having an epoll instance and a io_uring instance on the same pool is problematic and can erase performance gains. If done greenfield you could implement the “normal” APIs with ‘IORING_OP_POLL_ADD’, but not all of the exposed ‘mio’ surface area can work this way - only the oneshot API.

52. drewg123 ◴[22 Aug 25 13:22 UTC] No.44984358[source]▶

>>44982712 #

That's a shame. How hard is it to enable? Do you need a custom kernel, or can you enable it at runtime?

On FreeBSD, its been in the kernel / openssl since 13, and has been one runtime toggle (sysctl kern.ipc.tls.enable=1) away from being enabled. And its enabled by default in the upcoming FreeBSD-15.

We (at Netflix) have run all of our tls encrypted streaming over kTLS for most of a decade.

53. andyferris ◴[22 Aug 25 13:30 UTC] No.44984454{3}[source]▶

>>44983930 #

It sounds like this would be better modelled by passing ownership of the buffer and expecting it to be returned on the success (ok) case. What you described doesn't seem compatible with what I would call a mutable borrow (mutate the contents of a Vec<u8>).

Or maybe I've misunderstood?

replies(1): >>44991486 #

54. andyferris ◴[22 Aug 25 13:32 UTC] No.44984480{6}[source]▶

>>44983405 #

I think I'm lost. If I give a mutable reference to a function... I can't access it (even read it) until it returns, no?

What is different?

replies(1): >>44996390 #

55. klaussilveira ◴[22 Aug 25 13:34 UTC] No.44984503[source]▶

>>44980865 (OP) #

For anyone wanting to learn more about how to create a small server with io_uring: https://unixism.net/2020/04/io-uring-by-example-article-seri...

56. kibwen ◴[22 Aug 25 13:37 UTC] No.44984545{5}[source]▶

>>44983784 #

> But the stackfull model is

The existence of advantages doesn't change anything here. The problems is that the disadvantages made this approach a non-starter, despite a lot of effort to make it work. Tradeoffs exist in language design, and the approaches were judged accordingly. What works for Go doesn't necessarily work for Rust, because they target different domains.

> I am really confident that if the async system was designed today we would've had a totally different model

No, without solving the original problems, the outcome would be the same. The Rust devs at the time were well aware of io_uring.

replies(1): >>44985952 #

57. IshKebab ◴[22 Aug 25 13:41 UTC] No.44984586{6}[source]▶

>>44983405 #

That's true of mutable references too though isn't it? In fact lots of people have suggested they should really have been called "exclusive references", since you can actually mutate some objects through non-exclusive references (called "interior mutability" normally).

58. duped ◴[22 Aug 25 13:41 UTC] No.44984589{3}[source]▶

>>44982846 #

> You simply can not do it with state machines polled in user code

That's not really true. The only guarantees in Rust futures are that they are polled() once and must have their Waker's wake() called before they are polled again. A completion based future submits the request on first poll and calls wake() on completion. That's kind of the interesting design of futures in Rust - they support polling and completion.

The real conundrum is that the futures are not really portable across executors. For io_using for example, the executor's event loop is tightly coupled with submission and completion. And due to instability of a few features (async trait, return impl trait in trait, etc) there is not really a standard way to write executor independent async code (you can, some big crates do, but it's not necessarily trivial).

Combine that with the fact that container runtimes disable io_uring by default and most people are deploying async web servers in Docker containers, it's easy to see why development has stalled.

It's also unfair to mischaracterize design goals and ideas from 2016 with how the ecosystem evolved over the last decade, particularly after futures were stabilized before other language items and major executors became popular. If you look at the RFCs and blog posts back then (eg: https://aturon.github.io/tech/2016/09/07/futures-design/) you can see why readiness was chosen over completion, and how completion can be represented with readiness. He even calls out how naïve completion (callbacks) leads to more allocation on future composition and points to where green threads were abandoned.

replies(3): >>44984765 #>>44988043 #>>44988514 #

59. npalli ◴[22 Aug 25 13:51 UTC] No.44984695[source]▶

>>44980865 (OP) #

So, current status on async

Rust - you need to understand: Futures, Pin, Waker, async runtimes, Send/Sync bounds, async trait objects, etc.

C++20, coroutines.

Go, goroutines.

Java21+, virtual threads

replies(4): >>44984787 #>>44985922 #>>44986154 #>>44988572 #

60. IX-103 ◴[22 Aug 25 13:51 UTC] No.44984696{5}[source]▶

>>44983643 #

Doesn't Rust do this? `let` is always on the stack. If you want to allocate on the heap then you need a Box. So `let foo = Box::new(MyFoo::default ())` creates a Box on the stack that points to a MyFoo on the heap. So MyFoo is a stack type and Box<MyFoo> is a heap type. Or do you think there is value in defining MyFooStack and MyFooHeap separately to support both use cases?

replies(2): >>44985539 #>>44986339 #

61. newpavlov ◴[22 Aug 25 13:57 UTC] No.44984765{4}[source]▶

>>44984589 #

No, the fundamental problem (in the context of io-uring) is that futures are managed by user code and can be dropped at any time. This often referred as "cancellation safety". Imagine a future has initialized completion-based IO with buffer which is part of the future state. User code can simply drop the future (e.g. if it was part of `select!`) and now we have a huge problem on our hands: the kernel will write into a dropped buffer! In the synchronous context it's equivalent to de-allocating thread stack under foot of the thread which is blocked on a synchronous syscall. You obviously can do it (using safe code) in thread-based code, but it's fine to do in async.

This is why you have to use various hacks when using io-uring based executors with Rust async (like using polling mode or ring-owned buffers and additional data copies). It could be "resolved" on the language level with an additional pile of hacks which would implement async Drop, but, in my opinion, it would only further hurt consistency of the language.

>He even calls out how naïve completion (callbacks) leads to more allocation on future composition and points to where green threads were abandoned.

I already addressed it in the other comment.

replies(2): >>44984811 #>>44985652 #

62. thomashabets2 ◴[22 Aug 25 13:59 UTC] No.44984787[source]▶

>>44984695 #

Rust: Well yes. Rust does force you to understand the things, or it won't compile. It does have drawbacks.

Go: goroutines are not async. And you can't understand goroutines without understanding channels. And channels are weirdly implemented in Go, where the semantics of edge cases, while well defined, are like rolling a D20 die if you try to reason from first principles.

Go doesn't force you to understand things. I agree with that. It has pros and cons.

I see what you mean but "cheap threads" is not the same thing as async. More like "current status of massive concurrency". Except that's not right either. tarweb, the subject of the blog post in question, is single threaded and uses io_uring as an event loop. (the idea being to spin up one thread per CPU core, to use full capacity)

So it's current status of… what exactly?

Cheap threads have a benefit over an async loop. The main one being that they're easier to reason about. It also has drawbacks. E.g. each thread may be light weight, but it does need a stack.

replies(1): >>44985199 #

63. duped ◴[22 Aug 25 14:02 UTC] No.44984811{5}[source]▶

>>44984765 #

That problem exists regardless of whether you want to use stackful coroutines or not. The stack could be freed by user code at anytime. It could also panic and drop buffers upon unwinding.

I wouldn't call async drop a pile of hacks, it's actually something that would be useful in this context.

And that said there's an easy fix: don't use the pointers supplied by the future!

replies(1): >>44984977 #

64. withoutboats3 ◴[22 Aug 25 14:07 UTC] No.44984882{3}[source]▶

>>44982846 #

genuinely so sad to me that you are still grinding this axe. if your fantasy design works so much better - go build it then!

replies(1): >>44985113 #

65. hnaccountme ◴[22 Aug 25 14:08 UTC] No.44984890[source]▶

>>44980865 (OP) #

Hey, Is there a working HTTP server with all these features?

I am working on something like this for work. But with plain old C

66. newpavlov ◴[22 Aug 25 14:15 UTC] No.44984977{6}[source]▶

>>44984811 #

>That problem exists regardless of whether you want to use stackful coroutines or not. The stack could be freed by user code at anytime. It could also panic and drop buffers upon unwinding.

Nope. The problem does not exist in the stackfull model by the virtue of user being unable (in safe code) to drop stack of a stackfull task similarly to how you can not drop stack of a thread. If you want to cancel a stackfull task, you have to send a cancellation signal to it and wait for its completion (i.e. cancellation is fully cooperative). And you can not fundamentally panic while waiting for a completion event, the task code is "frozen" until the signal is received.

>it's actually something that would be useful in this context.

Yes, it's useful to patch a bunch of holes introduced by the Rust async model and only for that. And this is why I call it a bunch of hacks, especially considering the fundamental issues which prevent implementation of async Drop. A properly designed system would've properly worked with the classic Drop.

>And that said there's an easy fix: don't use the pointers supplied by the future!

It's always amusing when Rust async advocates say that. Let met translate: don't use `let mut buf = [0u8; 16]; socket.read_all(&mut buf).await?;`. If you can't see why such arguments are bonkers, we don't have anything left to talk about.

replies(2): >>44985638 #>>44985662 #

67. tliltocatl ◴[22 Aug 25 14:20 UTC] No.44985024{3}[source]▶

>>44981421 #

That's because Unix API used to assume fork() is extremely cheap. Threads were ugly performance hack second-class citizens - still are in some ways. This was indeed true on PDP-11 (just copy a <64KB disk file!), but as address spaces grew, it became prohibitively expensive to copy page tables, so programmers turned to multithreading. At then multicore CPUs became the norm, and multithreading on multicore CPUs meant any kind of copy-on-write required TLB shootdown, making fork() even more expensive. VMS (and its clone known as Windows NT) did it right from the start - processes are just resource containers, units execution are threads and all IO is async. But being technically superior doesn't outweighs the disadvantage of being proprietary.

replies(1): >>44993913 #

68. ozgrakkurt ◴[22 Aug 25 14:22 UTC] No.44985050[source]▶

>>44982779 #

This wiki page might be useful for anyone that is looking into this

https://github.com/axboe/liburing/wiki/io_uring-and-networki...

Also there is napi support in uring which uses polled io on sockets instead of interrupt based io from what I understand. You can see examples using it in liburing github

69. ◴[22 Aug 25 14:27 UTC] No.44985107{4}[source]▶

>>44984017 #

70. newpavlov ◴[22 Aug 25 14:28 UTC] No.44985113{4}[source]▶

>>44984882 #

Deal with it. Async is my greatest disappointment in the otherwise mostly stellar language. And I will continue to argue strongly against it.

After Rust has raised the level of quality and expectations to such great level, async feels like 3 steps back with all those arguments "you are holding it wrong", footguns, and piles of hacks. And this sentiment is shared by many others. It's really disappointing to see how many resources are getting sunk into the flawed async model by both the language and the ecosystem developers.

>go build it then

I did build it and it's in the process of being adopted into a proprietary database (theoretically a prime use-case for async Rust). Sadly, because I don't have ways to change the language and the compiler, it has obvious limitations (and generally it can be called unsound, especially around thread locals). It works for our project only because we have a tightly controlled code base. In future I plan to create a custom "green-thread" fork of `std` to ease limitations a bit. Because of the limitations (and the proprietary nature of the project) it is unlikely to be published as an open source project.

Amusingly, during online discussions I've seen other unrelated people who done similar stuff.

replies(2): >>44986035 #>>44990422 #

71. alde ◴[22 Aug 25 14:30 UTC] No.44985140[source]▶

>>44980865 (OP) #

Unfortunately io_uring is disabled by default on most cloud workload orchestrators, like CloudRun, GKE, EKS and even local Docker. Hope this will change soon, but until then it will remain very niche.

replies(2): >>44986488 #>>44988690 #

72. dpeckett ◴[22 Aug 25 14:31 UTC] No.44985151{4}[source]▶

>>44984017 #

For TCP streams syscall overhead isn't a big issue really, you can easily transfer large chunks of data in each write(). If you have TCP segmentation offload available you'll have no serious issues pushing 100gbit/s. Also if you are sending static content don't forget sendfile().

UDP is a whole another kettle of fish, get's very complicated to go above 10gbit/s or so. This is a big part of why QUIC really struggles to scale well for fat pipes [1]. sendmmsg/recvmmsg + UDP GRO/GSO will probably get you to ~30gbit/s but beyond that is a real headache. The issue is that UDP is not stream focused so you're making a ton of little writes and the kernel networking stack as of today does a pretty bad job with these workloads.

FWIW even the fastest QUIC implementations cap out at <10gbit/s today [2].

Had a good fight writing a ~20gbit userspace UDP VPN recently. Ended up having to bypass the kernels networking stack using AF_XDP [3].

I'm available for hire btw, if you've got an interesting networking project feel free to reach out.

1. https://arxiv.org/abs/2310.09423

2. https://microsoft.github.io/msquic/

3. https://github.com/apoxy-dev/icx/blob/main/tunnel/tunnel.go

replies(2): >>44991392 #>>44991928 #

73. ori_b ◴[22 Aug 25 14:36 UTC] No.44985199{3}[source]▶

>>44984787 #

> Go: goroutines are not async

Sure they are. The abstraction they provide is a synchronous API, but it's accomplished using an async runtime.

replies(3): >>44985707 #>>44987321 #>>44990910 #

74. butterisgood ◴[22 Aug 25 15:00 UTC] No.44985475[source]▶

>>44980865 (OP) #

Where do people get the idea that one thread per core is correct on a system that deals with time slices?

In my experience “oversubscribing” threads to cores (more threads than cores) provides a wall-clock time benefit.

I think one thread per core would work better without preemptive scheduling.

But then we aren’t talking about Unix.

replies(4): >>44985631 #>>44986628 #>>44988220 #>>44988584 #

75. tele_ski ◴[22 Aug 25 15:07 UTC] No.44985539{6}[source]▶

>>44984696 #

The suggestion is c# class vs struct basically, with explicit globals which are just class with synchronization

replies(1): >>44986589 #

76. gorset ◴[22 Aug 25 15:14 UTC] No.44985631[source]▶

>>44985475 #

Isolating a core and then pinning a single thread is the way to go to get both low latency and high throughput, sacrificing efficiency.

This works fine on Linux, and common approach for trading systems where it’s fine to oversubscribe a bunch of cores for this type of stuff. The cores are mostly busy spinning and doing nothing, so it’s very inefficient in terms of actual work, but great for latency and throughput when you need it.

replies(1): >>44986485 #

77. oconnor663 ◴[22 Aug 25 15:14 UTC] No.44985638{7}[source]▶

>>44984977 #

> don't use `let mut buf = [0u8; 16]; socket.read_all(&mut buf).await?;`. If you can't see why such arguments are bonkers, we don't have anything left to talk about.

It doesn't seem bonkers to me. I know you already know these details, but spelling it out: If I'm using select/poll/epoll in C to do non-blocking reads of a socket, then yes I can use any old stack buffer to receive the bytes, because those are readiness APIs that only write through my pointer "now or never". But if I'm using IOCP/io_uring, I have to be careful not to use a stack buffer that doesn't outlive the whole IO loop, because those are completion APIs that write through my pointer "later". This isn't just a question of the borrow checker being smart enough to analyze our code; it's a genuine difference in what correct code needs to do in these two different settings. So if async Rust forces us to use heap allocated (or long-lived in some other way) buffers to do IOCP/io_uring reads, is that a failure of the async model, or is that just the nature of systems programming?

replies(1): >>44985901 #

78. vlovich123 ◴[22 Aug 25 15:15 UTC] No.44985652{5}[source]▶

>>44984765 #

I really don’t understand this argument. If you force the user to transfer ownership of the buffer into the I/O subsystem, the system can make sure to transfer ownership of the buffer into the async runtime, not leaving it held within the cancellable future and the future returns that buffer which is given back when the completion is received from the kernel. What am I missing?

replies(2): >>44985725 #>>44987286 #

79. duped ◴[22 Aug 25 15:17 UTC] No.44985662{7}[source]▶

>>44984977 #

> The problem does not exist in the stackfull model by the virtue of user being unable (in safe code) to drop stack of a stackfull task similarly to how you can not drop stack of a thread.

If you're not doing things better than threads then why don't you just use threads?

> And you can not fundamentally panic while waiting for a completion event, the task code is "frozen" until the signal is received.

So you only allow join/select at the task level? Sounds awful!

> Let met translate: don't use `let mut buf = [0u8; 16]; socket.read_all(&mut buf).await?;

Yes, exactly. It's more like `let buf = socket.read(16);`

replies(1): >>44985885 #

80. thomashabets2 ◴[22 Aug 25 15:20 UTC] No.44985707{4}[source]▶

>>44985199 #

I'm trying to understand the context in which the parent commenter uses the term, since it can mean multiple things. They said "async" and then enumerated some wildly different things.

Like, do you need async runtimes to do epoll async in Rust? No. Ok, so that excludes many definitions. Do you need coroutines in C++ to do aio for reading and writing? No.

So like I said, what do they mean by "async"? The blog post refers to a web server that does "async" in Rust without any async runtime, and without the `async` keyword.

In other words, that parent commenter is what's called "not even wrong".

81. newpavlov ◴[22 Aug 25 15:22 UTC] No.44985725{6}[source]▶

>>44985652 #

The goal of the async system is to allow users to write synchronous looking code which is executed asynchronously with all associated benefits. "Forcing" users to do stuff like this shows the clear failure to achieve this goal. Additionally, passing ownership like this (instead of passing mutable borrow) arguably goes against the zero-cost principle.

replies(1): >>44985850 #

82. vlovich123 ◴[22 Aug 25 15:33 UTC] No.44985850{7}[source]▶

>>44985725 #

I don’t follow the zero copy argument. You pass in an owned buffer and get an owned buffer back out. There’s no copying happening here. It’s your claim that async is supposed to look like synchronous code but I don’t buy it. I don’t see why that’s a goal. Synchronous is an anachronistic software paradigm for a computer hardware architecture that never really existed (electronics are concurrent and asynchronous by nature) and cause a lot of performance problems trying to make it work that way.

Indeed, one thing I’ve always wondered is if you can submit a read request for a page aligned buffer and have the kernel arrange for data to be written directly into that without any additional copies. That’s probably not possible since there’s routing happening in the kernel and it accumulates everything into sk_buffs.

But maybe it could arrange for the framing part of the packet and the data to be decoupled so that it can just give you a mapping into the data region (maybe instead of you even providing a buffer, it gives you back an address mapped into your space). Not sure if that TLB update might be more expensive than a single copy.

replies(2): >>44986038 #>>44986257 #

83. newpavlov ◴[22 Aug 25 15:37 UTC] No.44985885{8}[source]▶

>>44985662 #

>If you're not doing things better than threads then why don't you just use threads?

Because green threads are more efficient than the classical threads. You have less context switching, more control over concurrency (e.g. you can have application-level pseudo critical section and tools like `join!`/`select!`), and with io-uring you have a much smaller number of syscalls.

In other words, memory footprint would be similar to the classical threads, but runtime performance can be much higher.

>So you only allow join/select at the task level? Sounds awful!

What is the difference with join/select at the future level?

Yes, with the most straightforward implementation you have to allocate full stack for each sub-task (somewhat equivalent to boxing sub-futures). But it's theoretically possible to use the parent task stack for sub-task stacks with the aforementioned compiler improvements.

Another difference is that instead of just dropping the future state on the floor you have to explicitly send a cancellation signal (e.g. based on `IORING_OP_ASYNC_CANCEL`) and wait for the sub-task to finish. Performance-wise it should have minimal difference when compared against the hypothetical async Drop.

>Yes, exactly.

Ok, I have nothing more to add then.

84. newpavlov ◴[22 Aug 25 15:38 UTC] No.44985901{8}[source]▶

>>44985638 #

>is that a failure of the async model

This, 100%. Being really generous, it can be called a leaky model which is poorly compatible with completion-based APIs.

replies(1): >>44987957 #

85. oconnor663 ◴[22 Aug 25 15:40 UTC] No.44985922[source]▶

>>44984695 #

Note that C++ coroutines use heap allocation to avoid the problems that Pin is solving, which is a pretty big carve-out from the "zero overhead principle" that C++ usually aims for. The long development time of async traits has also been related to Rust not heap allocating futures. Whether that performance+portability-vs-complexity tradeoff is worth it for any given project is, of course, a different question.

replies(2): >>44987670 #>>44989960 #

86. no_wizard ◴[22 Aug 25 15:42 UTC] No.44985952{6}[source]▶

>>44984545 #

What were the original problems exactly? From what I recall they effectively boiled down to size concerns due to seeing themselves as a c/c++ successor and they didn’t want to lose any adoption in the embedded systems target audience.

replies(2): >>44986171 #>>44986648 #

87. mbStavola ◴[22 Aug 25 15:49 UTC] No.44986035{5}[source]▶

>>44985113 #

> it has obvious limitations (and generally it can be called unsound, especially around thread locals)

Is this really better than what we have now? I don't think async is perfect, but I can see what tradeoffs they are currently making and how they plan to address most if not all of them. "General" unsoundness seems like a rather large downside.

> In future I plan to create a custom "green-thread" fork of `std` to ease limitations a bit

Can you go more in-depth into these limitations and which would be alleviated by having first class support for your approach in the compiler/std?

replies(1): >>44986338 #

88. newpavlov ◴[22 Aug 25 15:49 UTC] No.44986038{8}[source]▶

>>44985850 #

You have an inevitable overhead of managing the owned buffer when compared against simply passing mutable borrow to an already existing buffer. Imagine if `io::Read` APIs were constructed as `fn read(&mut self, buf: Vec<u8>) -> io::Resul<Vec<u8>>`.

Parity with synchronous programming is an explicit goal of Rust async declared many times (e.g. see here https://github.com/rust-lang/rust-project-goals/issues/105). I agree with your rant about the illusion of synchronicity, but it does not matter. The synchronous abstraction is immensely useful in practice and less leaky it is, the better.

replies(2): >>44986309 #>>44987910 #

89. dj_gitmo ◴[22 Aug 25 15:50 UTC] No.44986057[source]▶

>>44982527 #

Did you read the post? It has nothing to do with web apps.

90. K0nserv ◴[22 Aug 25 15:59 UTC] No.44986154[source]▶

>>44984695 #

The facts that Send/Sync bounds model are still relevant in all the other languages, the absence of Send/Sync just means it's easier to write subtly incorrect code.

replies(1): >>44987201 #

91. const_cast ◴[22 Aug 25 16:00 UTC] No.44986171{7}[source]▶

>>44985952 #

I mean from an outsiders perspective on Rust this is how I saw it.

Rust is in a strange place because they're a systems language directly competing with C++. Async, in general, doesn't vibe with that but green threads definitely don't.

If you're gonna do green threads you might as well throw in a GC too and get a whole runtime. And now you're writing Go.

replies(2): >>44986288 #>>44986357 #

92. kbolino ◴[22 Aug 25 16:06 UTC] No.44986248{5}[source]▶

>>44983643 #

But what does "heap my_heap_var" actually mean, without a garbage collector? Who owns "my_heap_var" and when does it get deallocated? What does explicitly writing out the heap-ness of a variable ultimately provide, that Rust's existing type system with its many heap-allocated types (Box, Rc, Arc, Vec, HashMap, etc.) doesn't already provide?

replies(1): >>44996133 #

93. namibj ◴[22 Aug 25 16:06 UTC] No.44986257{8}[source]▶

>>44985850 #

Such reads are in principle supported if you have sufficient hardware offloading of your stream. AFAIK io_uring got an update a while back specifically to make this practical for non-stream reads, where you basically provide a slab allocator region to the ring and get to tell reads to pick a free slot/slab in that region _only when they actually get the data_ instead of you blocking DMA capable memory for as long as the remote takes to send you the data.

94. zozbot234 ◴[22 Aug 25 16:08 UTC] No.44986288{8}[source]▶

>>44986171 #

On the contrary, stackless async can "vibe" quite well with deep embedded workloads that also require a low-level language like C/C++. There's very few meaningful alternatives to Rust in that space.

95. namibj ◴[22 Aug 25 16:10 UTC] No.44986309{9}[source]▶

>>44986038 #

The problem is that the ring requires suitably locked memory which won't be subject to swapping, thus inherently forcing a different memory type if you want the extra-low-overhead/extra-scalable operation.

It makes sense to ask the ring wrapper for memory that you can emplace your payload into before submitting the IO if you want to use zero-copy.

replies(1): >>44987251 #

96. newpavlov ◴[22 Aug 25 16:12 UTC] No.44986338{6}[source]▶

>>44986035 #

>Is this really better than what we have now?

Depends on the metric you use. Memory-wise it's a bit less efficient (our tasks usually are quite big, so relative overhead is small in our case), runtime-wise it should be on par or slightly ahead. From the source code perspective, in my opinion, it's much better. We don't have the async/await noise everywhere and after development of the `std` fork we will get async in most dependencies as well for "free" (we still would need to inspect the code to see that they do not use blocking `libc` calls for example). I always found it amusing that people use "sync" `log`-based logging in their async projects, we will not have this problem. The approach also allows migration of tasks across cores even if you keep `Rc` across yield points. And of course we do not need to duplicate traits with their async counterparts and Drop implementations with async operations work properly out of the box.

>Can you go more in-depth into these limitations and which would be alleviated by having first class support for your approach in the compiler/std?

The most obvious example is thread locals. Right now we have to ensure that code does not wait on completion while having a thread local reference (we allow migration of tasks across workers/cores by default). We ban use of thread locals in our code and assume that dependencies are unable to yield into our executor. With forked `std` we can replace the `thread_local!` macro with a task-local implementation which would resolve this issue.

Another source of potential unsoundness is reuse of parent task stack for sub-task stacks in our implementation of `select!`/`join!` (we have separate variants which allocate full stacks for sub-tasks which are used for "fat" sub-tasks). Right now we have to provide stack size for sub-tasks manually and check that the value is correct using external tools (we use raw syscalls for interacting with io-uring and forbid external shared library calls inside sub-tasks). This could be resolved with the aforementioned special async ABI and tracking of maximum stack usage bound.

Finally, our implementation may not work out-of-box on Windows (I read that it has protections against messing with stack pointer on which we rely), but it's not a problem for us since we target only modern Linux.

replies(1): >>44987144 #

97. kbolino ◴[22 Aug 25 16:13 UTC] No.44986339{6}[source]▶

>>44984696 #

You may already know this, but let-bindings are not necessarily on the stack. The reference does say they are (it's important to remember that the reference is not normative), and it is often simpler to think of them that way, but in reality they don't have to be on the stack.

The compiler can perform all sorts of optimizations, and on most modern CPU architectures, it is better to shove as many values into registers as possible. If you don't take the address of a variable, you don't run out of registers, and you don't call other, non-inlined functions, then let-bindings (and function arguments/return values) need not ever spill onto the stack.

In some cases, values don't even get into registers. Small numeric constants (literals, consts, immutable lets) can simply be inlined as immediate values in the assembly/machine code. In the other direction, large constant arrays and strings don't spill onto the stack but rather the constant pool.

replies(1): >>44986522 #

98. no_wizard ◴[22 Aug 25 16:15 UTC] No.44986357{8}[source]▶

>>44986171 #

I don't think doing green threads equates to 'well might as well have a GC now!'. I think they made the wrong tradeoff too, because hardware will inevitably catch up to the language requirements, especially if its desirable to use. Not to mention over time things can be made more efficient from the Rust side as well, with compiler improvements, better programming techniques etc.

I think they made the wrong bet, personally. Having worked in enough languages that have function coloring problems I would avoid it as a language design as a line in the sand item, regardless of tradeoffs

replies(2): >>44990669 #>>44992857 #

99. butterisgood ◴[22 Aug 25 16:25 UTC] No.44986485{3}[source]▶

>>44985631 #

I just wish people who give this advice for 1 thread per core would "expand their reasoning" or "show the work".

It's not blanket good advice for all things.

replies(2): >>44987171 #>>44989341 #

100. nicce ◴[22 Aug 25 16:26 UTC] No.44986488[source]▶

>>44985140 #

Back to self-hosting!

101. zozbot234 ◴[22 Aug 25 16:29 UTC] No.44986522{7}[source]▶

>>44986339 #

In particular, let bindings within async code (and coroutines, if that feature is stabilized at some point) might easily live on the heap.

102. kibwen ◴[22 Aug 25 16:36 UTC] No.44986589{7}[source]▶

>>44985539 #

Note that items declared as `static` in Rust are already globals that require synchronization (in Rust terms, static items must implement `Sync`), although they're located in static memory rather than on the stack or heap.

103. craftkiller ◴[22 Aug 25 16:36 UTC] No.44986592{3}[source]▶

>>44982707 #

At my first job out of college it took 30 minutes to recompile and launch the server. Now the kids complain about 10 seconds. It's just impossible for me to take their complaints seriously. 10 seconds isn't even enough time for a mental context-switch, its just slightly more time than "instant". Back in the day, something like this wasn't an exaggeration: https://xkcd.com/303/

replies(1): >>44993844 #

104. jandrewrogers ◴[22 Aug 25 16:38 UTC] No.44986628[source]▶

>>44985475 #

A mistake people make with thread-per-core (TPC) architecture is thinking you can pick and choose the parts you find convenient, when in reality it is much closer to "all or nothing". It may be worse to half-ass a TPC implementation than to not use TPC at all. However, TPC is more efficient in just about all contexts if you do it correctly.

Most developers are unfamiliar with the design idioms for TPC e.g. how to properly balance and shed load between cores.

105. kibwen ◴[22 Aug 25 16:40 UTC] No.44986648{7}[source]▶

>>44985952 #

Have you read the article by Aaron Turon linked above? It's very informative, and if you have any questions about specific parts of it, feel free to reference them. In particular it boils down to the fact that Rust bends over backwards to avoid putting anything that requires allocation or dynamic dispatch in the core language (e.g. Rust's closures are fascinating in that they're stack-allocated, like C++'s, while also playing nicely with the borrow checker, which is quite a feat). This property extends to the current design of async, which makes async suitable for embedded devices, which is extremely cool (check out the Embassy project for the state of the art in this space).

106. surajrmal ◴[22 Aug 25 17:19 UTC] No.44987144{7}[source]▶

>>44986338 #

If you use a custom libc and dynamic linker, you can very easily customize thread locals to work the way you want without forking the standard library.

107. lossolo ◴[22 Aug 25 17:22 UTC] No.44987171{4}[source]▶

>>44986485 #

Check out Scylla and its underlying framework Seastar. They expand their reasoning and show the work.

108. koakuma-chan ◴[22 Aug 25 17:25 UTC] No.44987201{3}[source]▶

>>44986154 #

Yeah the new typescript compiler that's written in Go crashed for me the other day because of some kind of concurrent modification. Java also has runtime checks for concurrent modification in its collections.

109. surajrmal ◴[22 Aug 25 17:30 UTC] No.44987251{10}[source]▶

>>44986309 #

newpavlov seems to operate in a theoretical bubble. Until something like he describes is published publicly where we talk more concretely, it's not worth engaging. Requiring special buffers is very typical for any hardware offload with zero copy. It isn't a leaky abstraction. Async rust is well designed and I've not seen anything rival it. If there are problems it's in the libraries built on top like tokio.

110. Inufu ◴[22 Aug 25 17:33 UTC] No.44987286{6}[source]▶

>>44985652 #

Requiring ownership transfer gives up on one of the main selling points of Rust, being able to verify reference lifetime and safety at compile time. If we have to give up on references then a lot of Rusts complexity no longer buys us anything.

replies(1): >>44987838 #

111. smw ◴[22 Aug 25 17:36 UTC] No.44987321{4}[source]▶

>>44985199 #

Yeah, in fact I'd argue that any abstraction that doesn't let you treat the work as sync is fundamentally broken.

https://journal.stuffwithstuff.com/2015/02/01/what-color-is-...

112. jcranmer ◴[22 Aug 25 17:56 UTC] No.44987564{4}[source]▶

>>44982450 #

Yes and no.

In my case, I have code that essentially looks like this:

   struct Parser {
     state: ParserState
   }
   struct Subparser {
     state: ParserState
   }
   impl Parser {
     pub fn parse_something(&mut self) -> Subparser {
       Subparse { state: self.state } // NOTE: doesn't work
     }
   }
   impl Drop for Subparser {
     fn drop(&mut self) {
       parser.state = self.state; // NOTE: really doesn't work
     }
   }

Okay, I can make the first line work by changing Parser.state to be an Option<ParserState> instead and using Option::take (or std::mem::replace on a custom enum; going from an &mut T to a T is possible in a number of ways). But how do I give Subparser the ability to give its ParserState back to the original parser? If I could make Subparser take a lifetime and just have a pointer to Parser.state, I wouldn't even bother with half of this setup because I would just reach into the Parser directly, but that's not an option in this case. (The safe Rust option I eventually reached for is a oneshot channel, which is actually a lot of overhead for this case).

It's the give-back portion of the borrow-to-give-back pattern that ends up being gnarly. I'm actually somewhat disappointed that the Rust ecosystem has in general given up on trying to build up safe pointer abstractions in the ecosystem, like doing use tracking for a pointed-to object. FWIW, a rough C++ implementation of what I would like to do is this:

  template <typename T> class HotPotato {
    T *data;
    HotPotato<T> *borrowed_from = nullptr, *given_to = nullptr;

    public:
    T *get_data() {
      // If we've given the data out, we can't use it at the moment.
      return given_to ? nullptr : data;
    }
    std::unique_ptr<HotPotato<T>> borrow() {
      assert(given_to == nullptr);
      auto *new_holder = new HotPotato();
      new_holder->data = data;
      new_holder->borrowed_from = this;
      given_to = new_holder;
    }

    ~HotPotato() {
      if (given_to) {
        given_to->borrowed_from = borrowed_from;
      }
      if (borrowed_from) {
        borrowed_from->given_to = given_to;
      } else {
        delete data;
      }
    }
  };

replies(1): >>44991000 #

113. nly ◴[22 Aug 25 18:04 UTC] No.44987670{3}[source]▶

>>44985922 #

Quite a lot of work was done in Clang at least to elide allocations for coroutines where the compiler can see enough information.

114. vlovich123 ◴[22 Aug 25 18:18 UTC] No.44987838{7}[source]▶

>>44987286 #

I'm not sure what you're trying to say, but the compile-time safety requirement isn't given up. It would look something like:

    self.buffer = io_read(self.buffer)?

This isn't much different than

    io_read(&mut self.buffer)?

since rust doesn't permit simultaneous access when a mutable reference is taken.

replies(1): >>44993473 #

115. vlovich123 ◴[22 Aug 25 18:23 UTC] No.44987910{9}[source]▶

>>44986038 #

The problem is that "parity" can be interpreted in different ways and you're choosing to interpret it in a way that doesn't seem to be communicated in the issue you referenced. In fact, the common definition of parity is something like "feature parity" meaning that you can do accomplish all the things you did before, even if it's not the same (e.g. MacOS has copy-paste feature parity even though it's a different shortcut or might work slightly differently from other operating systems). It rarely means "drop-in" parity where you don't have to do anything.

To me it's pretty clear that parity in the issue referenced refers to equivalence parity - that is you can accomplish the tasks in some way, not that it's a drop-in replacement. I haven't seen anywhere suggested that async lets you write synchronous code without any changes, nor that integrating completion-style APIs with asynchronous will yield code that looks like synchronous. For one, completion-style APIs are for performance and performance APIs are rarely structured for simplicity but to avoid implicit costs hidden in common leaky (but simpler) abstractions. For another, completion-style APIs in synchronous programming ALSO looks different from epoll/select-like APIs, so I really don't understand the argument you're trying to make.

EDIT:

> You have an inevitable overhead of managing the owned buffer when compared against simply passing mutable borrow to an already existing buffer. Imagine if `io::Read` APIs were constructed as `fn read(&mut self, buf: Vec<u8>) -> io::Resul<Vec<u8>>`.

I'm imaging and I don't see a huge problem in terms of the overhead this implies. And you'd probably not necessarily take in a Vec directly but some I/O-specific type since such an API would be for performance.

116. vlovich123 ◴[22 Aug 25 18:26 UTC] No.44987957{9}[source]▶

>>44985901 #

The leaky model is that you could ever receive into a stack buffer and you're arguing to persist this model. The reason it's leaky is that copying memory around is supremely expensive. But that's how the BSD socket API from the 90s works and btw something you can make work with async provided you're into memory copies. io_uring is a modern API that's for performance and that's why Rust libraries try to avoid memory copying within the internals. Supporting copying into the stack buffer with io_uring is very difficult to accomplish even in synchronous code. It's not a failure of async but a different programming paradigm altogether.

As someone else mentioned, what you really want is to ask io_uring to allocate the pages itself so that for reads it gives you pages that were allocated by the kernel to be filled directly by HW and then mapped into your userspace process without any copying by the kernel or any other SW layer involved.

replies(2): >>44989382 #>>44996320 #

117. 0x457 ◴[22 Aug 25 18:32 UTC] No.44988043{4}[source]▶

>>44984589 #

> And due to instability of a few features (async trait, return impl trait in trait, etc) there is not really a standard way to write executor independent async code (you can, some big crates do, but it's not necessarily trivial).

Uhm all of that is just sugar on top of stable feature. None of these features or lack off prevent portability.

Full portability isn't possible specifically due to how Waker works (i.e. is implementation specific). That allows async to work with different style of asyncs. Reason why io_uring is hard in rust is because of io_uring way of dealing with memory.

118. mgaunard ◴[22 Aug 25 18:42 UTC] No.44988195{3}[source]▶

>>44982686 #

In my experience, trying to use io_uring for spinning/no-system-call uses is not straightforward.

First, there are some tricks required to actually make it work at all, then there is a problem that you'll need a core not only for userland, but also inside the kernel, both of them per-application.

Sharing a kernel spinning thread across multiple applications is also possible but requires further efforts (you need to share some parent ring across processes, which need to be related).

Overall I feel that it doesn't really deliver on the no-system-call idea, certainly not out of the box. You might have a more straightforward experience with XDP, which coincidentally gives you a lot more access and control as well if you need it.

119. danudey ◴[22 Aug 25 18:44 UTC] No.44988220[source]▶

>>44985475 #

One thread per core if you're CPU-bound and not IO-bound.

In this very specific case, it seems as though the vast majority of the webserver's work is asynchronous and event-based, so the actual webserver is never waiting on I/O input or output - once it's ready you dump it somewhere the kernel can get to it and move on to the next request if there is one.

I think this gets this specific project close to the platonic ideal of a one-thread-per-core workload if indeed you're never waiting on I/O or any syscalls, but I feel as though it should come with extreme caveats of "this is almost never how the real world works so don't go artificially limiting your application to `nproc` threads without actually testing real-world use cases first".

replies(1): >>44990240 #

120. j_seigh ◴[22 Aug 25 18:53 UTC] No.44988360[source]▶

>>44983599 #

I don't think it has to be. Conceptually it's just a couple of queues.

There's a software equivalent of the Peter Principle where software or an API becomes increasingly complex to the point where no one understands it. They then attempt to fix that by adding more functionality (complexity).

121. hobofan ◴[22 Aug 25 19:05 UTC] No.44988514{4}[source]▶

>>44984589 #

> The only guarantees in Rust futures are that they are polled() once and must have their Waker's wake() called before they are polled again.

I just had to double-check as this sounded strange to me, and no that's not true.

The most efficient design is to do it that way, yes, but there are no guarantees of that sort. If one wants to build a less efficient executor, it's perfectly permissible to just poll futures on a tight loop without involving the Waker at all.

replies(1): >>44988813 #

122. hobofan ◴[22 Aug 25 19:09 UTC] No.44988572[source]▶

>>44984695 #

If you are fine with writing "good enough" high-level Rust code (that will potentially still beat out most other languages in terms of performance) and are fine with using the mid-level primitives that other people have built, you don't really have to understand most of those things.

123. wahern ◴[22 Aug 25 19:10 UTC] No.44988584[source]▶

>>44985475 #

In the case of io_uring, one user thread per core is not a bad rule of thumb given that the kernel side is using a pool of worker threads.

124. superb_dev ◴[22 Aug 25 19:17 UTC] No.44988690[source]▶

>>44985140 #

Why do they disable io_uring?

replies(2): >>44990499 #>>44999134 #

125. duped ◴[22 Aug 25 19:29 UTC] No.44988813{5}[source]▶

>>44988514 #

Let me rephrase, there's no guarantee that a poll() is called again (because of cancel safety) and in practice you have to call wake() because executors won't reschedule the task unless one of their children wake()s

126. WJW ◴[22 Aug 25 20:11 UTC] No.44989233[source]▶

>>44980865 (OP) #

Pretty cool! Adding kTLS is definitely an improvement. I made an actually zero-syscall per request server a few years ago (and blogged about it at https://wjwh.eu/posts/2021-10-01-no-syscall-server-iouring.h...) but as TFA notes it comes at a heavy cost of constantly busy-looping.

io_uring is very cool tech though and has been progressing at an impressive pace the last few years.

127. commandersaki ◴[22 Aug 25 20:20 UTC] No.44989337[source]▶

>>44981313 #

I'm sceptical of the efficiency gains with sendfile; seems marginal at best, even in the late 90s when it was at the height of popularity.

replies(2): >>44989857 #>>44989863 #

128. thinkharderdev ◴[22 Aug 25 20:20 UTC] No.44989341{4}[source]▶

>>44986485 #

It is definitely not good advice for all things. For workloads that are either end of the CPU/IO spectrum (e.g. almost all waiting on IO or almost all doing CPU work) it can be a huge win as you can get very good L1 cache utilization, are not context-switching and don't need to handle thread synchronization in your code because not state is shared between threads.

For workloads that are a mix of IO and non-trivial CPU work, it can still work but is much, much harder to get right.

129. Asmod4n ◴[22 Aug 25 20:24 UTC] No.44989382{10}[source]▶

>>44987957 #

there is just one catch.

Using the feature to let io_uring handle buffers for you limits you to the mem lock limit of a process, which is 8MB on a typical debian install (more on others) And that's a hard limit unless you got root access to said machine.

replies(1): >>44989420 #

130. vlovich123 ◴[22 Aug 25 20:27 UTC] No.44989420{11}[source]▶

>>44989382 #

Sure, that's the most efficient way. But you can still have the user allocate a read buffer, pass it to the read API & receive it on the way out. In fact, unlike what OP claimed, this is actually more efficient since you could safely avoid unnecessarily initializing this buffer safely (by truncating to the length read before returning) whereas safely using uninitialized buffers is kind of tricky.

131. kev009 ◴[22 Aug 25 21:04 UTC] No.44989797{4}[source]▶

>>44983055 #

I don't see anything in my comment that implied _when_ the forking happened so it's not really a nit :)

132. lossolo ◴[22 Aug 25 21:11 UTC] No.44989857{3}[source]▶

>>44989337 #

> seems marginal at best

Depends on the workload.

Normally you would go read() -> write() so:

1. Disk -> page cache (DMA)

2. Kernel -> user copy (read)

3. User -> kernel copy (write)

4. Kernel -> NIC (DMA)

sendfile():

1. Disk -> page cache (DMA)

No user space copies, kernel wires those pages straight to the socket

2. Kernel -> NIC (DMA)

So basically, it eliminates 1-2 memory copies along with the associated cache pollution and memory bandwidth overhead. If you are running high QPS web services where syscall and copy overheads dominate, for example CDNs/static file serving the gains can be really big. Based on my observations this can mean double digit reductions in CPU usage and up to ~2x higher throughput.

replies(1): >>44991515 #

133. kev009 ◴[22 Aug 25 21:11 UTC] No.44989863{3}[source]▶

>>44989337 #

Then you don't understand the memory and protection model of a modern system very well.

sendfile effectively turns your user space file server into a control plane, and moves the data plane to where the data is eliminating copies between address spaces. This can be made congruent with I/O completions (i.e. Ethernet+IP and block) and made asynchronous so the entire thing is pumping data between completion events. Watch the Netflix video the author links in the post.

There is an inverted approach where you move all this into a single user address space, i.e. DPDK, but it's the same overall concept just a different who.

134. mpyne ◴[22 Aug 25 21:22 UTC] No.44989960{3}[source]▶

>>44985922 #

C++ coroutines must allocate at runtime as the allocation size isn't resolvable early enough at compile time to statically fix the allocation, but it's not required to be allocated from the heap (not that custom allocators are fun, but it is possible).

In any event it's essentially a stack frame so it's not a failure of zero-overhead, the stack frame will need to be somewhere.

135. touisteur ◴[22 Aug 25 21:24 UTC] No.44989979[source]▶

>>44981374 #

I wish I could have been paid to work on SPARK specification around io_uring so that one could have built on it. Or to work on SPARK-to-eBPF (there's already a llvm backend for gnat) and have some form of guarantees at the seams... alas.

136. butterisgood ◴[22 Aug 25 21:46 UTC] No.44990240{3}[source]▶

>>44988220 #

But, your CPU availability is time sliced... So, why is not "more than one thread per core" equivalent to "more CPU" (my point is, sometimes it is...)

replies(1): >>44990341 #

137. butterisgood ◴[22 Aug 25 21:55 UTC] No.44990341{4}[source]▶

>>44990240 #

https://github.com/rminnich/9front/tree/ron_nix

Has Ron Minnich's port of "Nix" (not NixOS as you may know it), to 9front.

The entire point of this is to disallow the kernel pre-empting and switching out CPU cores that should be dedicated to an "application". (Application Cores).

One could imagine this arrangement plus io_uring would be awfully nice.

138. arianvanp ◴[22 Aug 25 22:14 UTC] No.44990499{3}[source]▶

>>44988690 #

Sandboxing like gvisor is based on syscalls and iouring makes your code syscallless

139. surajrmal ◴[22 Aug 25 22:28 UTC] No.44990669{9}[source]▶

>>44986357 #

There are other languages with green threads and folks are free to use those. Zig is trying to do interesting things with stackful coroutines.

I don't think I nor most systems programmers would have chosen rust if it required green threads instead of stackless coroutines for async. If you work on embedded or low level environments like kernels and whatnot, you need something that falls back to callbacks for async. I'm sure folks who work on servers would have been fine with green threads but they were not the target audience for rust. Being upset because you fall outside the target demographic of a particular language doesn't mean they made the wrong choice. It just means you should look for something else.

140. tempaccount420 ◴[22 Aug 25 22:42 UTC] No.44990810[source]▶

>>44982712 #

kTLS just sounds like a bad idea all around.

141. gpderetta ◴[22 Aug 25 22:54 UTC] No.44990910{4}[source]▶

>>44985199 #

By that definition, pthread is also async. If everything is async, then the word loses all meanings.

Async is really about the surface syntax and ergonomics, not the implementation.

replies(1): >>44996554 #

142. pornel ◴[22 Aug 25 23:04 UTC] No.44991000{5}[source]▶

>>44987564 #

You can implement this in Rust.

It's an equivalent of Rc<Cell<(Option<Box<T>>, Option<Box<T>>)>>, but with the Rc replaced by a custom shared type that avoids keeping refcount by having max 2 owners.

You're going to need UnsafeCell to implement the exact solution, which needs a few lines of code that is as safe as the C++ version.

143. johncolanduoni ◴[22 Aug 25 23:50 UTC] No.44991392{5}[source]▶

>>44985151 #

Yeah all agreed - the only addendum I’d add is for cases where you can’t use large buffers because you don’t have the data (e.g. realtime data streams or very short request/reply cycles). These end up having the same problems, but are not soluble by TCP or UDP segmentation offloads. This is where reduced syscall overhead (or even better kernel bypass) really shines for networking.

144. hu3 ◴[22 Aug 25 23:57 UTC] No.44991470{3}[source]▶

>>44982707 #

It was already horrible devex 40 years ago when turbo pascal could compile millions of lines almost instantly with a processor that was slower than my current watch processor.

145. johncolanduoni ◴[22 Aug 25 23:59 UTC] No.44991486{4}[source]▶

>>44984454 #

It is compatible under Rust’s model (I’ve used it to implement safe io_uring interfaces specifically). ‘&mut Vec<u8>’ doesn’t just let you mutate contents or extend the allocation - you can call ‘mem::replace(…)’ and swap the allocation entirely. It’s morally equivalent to passing back and forth, and almost identical in the generated machine code (structure return values look a lot like mutable structure arguments at the register calling convention level). However it’s much less annoying to work with in practice - passing buffers back and forth and then reassigning them to the same variable name results in a lot of semantically irrelevant code to please the ownership model.

146. commandersaki ◴[23 Aug 25 00:02 UTC] No.44991515{4}[source]▶

>>44989857 #

I understand the optimisation, I'm just saying I'm sceptical the optimisation is even that useful, like it seems it'd only kick in with pathological cases where kernel round trip time is really dominating; my gut reckons most applications just don't benefit. Caddy in the last few years got sendfile support and with it on and off and it usually you wouldn't see a discernible difference [1].

Which makes me sceptical for the argument for kTLS which is stated in the article; what benefit does offloading your crypto to the kernel provider (possibly making it more brittle). I've seen the author of haproxy say that performance he's seen has been only marginal, but did point out it was useful in that you can strace your process and see plaintext instead of ciphertext which is nice.

[1]: https://blog.tjll.net/reverse-proxy-hot-dog-eating-contest-c...

147. mastax ◴[23 Aug 25 01:03 UTC] No.44991928{5}[source]▶

>>44985151 #

I have a hard time believing that google is serving YouTube over QUIC/HTTP3 at 10Gbit/s, or even 30Gbit/s.

replies(1): >>44992512 #

148. fpoling ◴[23 Aug 25 01:44 UTC] No.44992207{4}[source]▶

>>44983562 #

That post explicitly stated one of the goals was to avoid requiring heap allocations. But fundamentally io_uring is incompatible with the stack and in practice coding against it requires dynamic allocations. If that would be known 10 years ago, surely it would have influenced the design goals.

149. johncolanduoni ◴[23 Aug 25 02:29 UTC] No.44992512{6}[source]▶

>>44991928 #

These are per-connection bottlenecks, largely due to implementation choices in the Linux network stack. Even with vanilla Linux networking, vertical scale can get the aggregate bandwidth as high as you want if you don’t need 10G per connection (which YouTube doesn’t), as long as you have enough CPU cores and NIC queues.

Another thing to consider: Google’s load balancers are all bespoke SDN and they almost certainly speak HTTP1/2 between the load balancers and the application servers. So Linux network stack constraints are probably not relevant for the YouTube frontend serving HTTP3 at all.

150. fpoling ◴[23 Aug 25 03:26 UTC] No.44992857{9}[source]▶

>>44986357 #

Hardware does not catches up with language requirements. If anything, it is languages/compilers that catch up with hardware, like SSE instructions and loop parallel ism.

For me the mistake that Rust made was that it tried too hard to behave like C/C++ with its single execution stack.

Ada uses two stacks allowing a callee to return a stack-allocated arrays to the caller. Not only it allows to avoid dynamic allocations in many cases where C++ allocates memory, but it also reduces the need for pointers making the code safer even without the borrow checker.

If instead of async Rust spent efforts on implementing something like that or even allow for explicit stack control from safe code so green threads or co-routines could be implemented as a library it could be more compatible with io_uring world.

replies(1): >>44993588 #

151. Inufu ◴[23 Aug 25 05:34 UTC] No.44993473{8}[source]▶

>>44987838 #

It means you can for example no longer do things like get multiple disjoint references into the same buffer for parallel reads/writes of independent chunks.

Or well you can, using unsafe, Arc and Mutex - but at that point the safety guarantees aren’t much better than what I get in well designed C++.

Don’t get me wrong, I still much prefer Rust, but I wish async and references worked together better.

Source: I recently wrote a high-throughput RPC library in Rust (saturating > 100 Gbit NICs)

152. zozbot234 ◴[23 Aug 25 05:57 UTC] No.44993588{10}[source]▶

>>44992857 #

> Ada uses two stacks allowing a callee to return a stack-allocated arrays to the caller.

You could do this manually by threading a pointer to a separately-allocated stack (could be on the heap or perhaps just a static allocation) as an extra function parameter. It's just a very simple case of arena allocation, with similar advantages and disadvantages. (For example, the caller must ensure that enough space is available on the dynamic-data stack for anything that the callee might want to push onto it.) In general it's just not really worth it, because it turns out that dynamically-sized data that one would not want to simply place on the heap is rare anyway.

153. LAC-Tech ◴[23 Aug 25 06:47 UTC] No.44993844{4}[source]▶

>>44986592 #

I can remember instant reloads of application servers on a job 10+ years ago grandpa. This isn't new.

replies(1): >>45001508 #

154. kev009 ◴[23 Aug 25 07:02 UTC] No.44993913{4}[source]▶

>>44985024 #

It's also a pretty bold scheduler benchmark to be handling tens of thousands of processes or 1:1 thread wakeups, especially the further back in time you go considering fairness issues. And then that's running at the wrong latency granularity for fast I/O completion events across that many nodes so it's going to run like a screen door on a submarine without a lot of rethinking things.

Evented I/O works out pretty well in practice for the I and D cache, especially if you can affine and allocate things as the article states, and do similar natural alignments inside the kernel (i.e. RSS/consistent hashing).

155. alfiedotwtf ◴[23 Aug 25 14:12 UTC] No.44996133{6}[source]▶

>>44986248 #

> What does explicitly writing out the heap-ness of a variable ultimately provide, that Rust's existing type system with its many heap-allocated types (Box, Rc, Arc, Vec, HashMap, etc.) doesn't already provide?

To be honest, I was thinking more in terms of cognitive overload i.e. is all that Box boilerplate even needed if we were to treat all `heap my_heap = …” as box underneath? In other words, couldn’t we elide all that away:

    let foo = Box::new(MyFoo::default ());

Becomes:

    heap foo = MyFoo::default();

Must nicer!

156. zbentley ◴[23 Aug 25 14:34 UTC] No.44996320{10}[source]▶

>>44987957 #

> what you really want is to ask io_uring to allocate the pages itself so that for reads it gives you pages that were allocated by the kernel

Okay, but what about writes? If I have a memory region that I want io_uring to write, it's a major pain in the ass to manage the lifetime of objects in that region in a safe way. My choices are basically: manually manage the lifetime and only allow it to be dropped when I see a completion show up (this is what most everything does now, and it's a) hard to get right and b) limited in many ways, e.g. it's heap-only), or permanently leak that memory as unusable.

replies(1): >>44999559 #

157. zbentley ◴[23 Aug 25 14:47 UTC] No.44996390{7}[source]▶

>>44984480 #

Let's say a function "foo" calls "fn bar(_: &mut T) -> ()".

When passing a mutable reference, the lifetime of the object is largely decided by "foo" (with some caveats).

Now, let's say that "foo" instead calls "fn bar(_: T) -> T".

When passing the object itself, the lifetime is largely decided/decide-able by "bar".

158. zbentley ◴[23 Aug 25 15:02 UTC] No.44996499{5}[source]▶

>>44982384 #

I think it's not quite that bad (and I know that this has been litigated to death all over the programmer internet).

If you are forking from a language/ecosystem that is extremely thread-friendly, (e.g. Go, Java, Erlang) fork is more risky. This is because such runtimes mean a high likelihood of there being threads doing fork-unsafe things at the moment of fork().

If you are forking from a language/ecosystem that is thread-unfriendly, fork is less risky. That isn't to say "it's always safe/low risk to run fork() in e.g. Python, Ruby, Perl", but in those contexts it's easier to prove/test invariants like "there are no threads running/so-and-so lock is not held at the point in my program when I fork", at which point the risks of fork(2) are much reduced.

To be clear, "reduced" is not the same as "gone"! You still have to reason about explicitly taken locks in the forking thread, file descriptors, signal handlers, and unexpected memory growth due to CoW/GC interactions. But that's a lot more tractable than the Java situation of "it's tricky to predict how many Java threads are active when I want to fork, and even trickier to know if there are any JNI/FFI-library-created raw pthreads running, the GC might be threaded, and checking for each of those things is still racy with my call to fork(2)".

You still have to make sure that that fork-safety invariants are true. But the effort to do that is extremely different depending on language platform.

Rust/C/C++ don't cleanly fit into either of those two (already mushy/subjective) categorizations, though. Whether forking is feasible in a given Rust/C/C++ codebase depends on what the code does and requires a tricky set of judgement calls and at-a-distance knowledge going forward to make sure that the codebase doesn't become fork-unsafe in harmful ways.

159. zbentley ◴[23 Aug 25 15:11 UTC] No.44996554{5}[source]▶

>>44990910 #

Eh, not really. Async (in this semantic context) is generally about cooperative concurrency and also often about concurrent or multiplexed I/O. Pthreads aren't async by those definitions, though you can run async code within a given pthread as usual.

Goroutines are an unusual case, in that they don't have cooperative concurrency--they're pre-emptive--but the Go runtime does perform I/O using concurrent multiplexers under the hood.

So goroutines are kind of both: computation execution and code semantics look like pthreads, but I/O operations look like NodeJS on the backend.

Now, I'm not sure what "async runtime" means in the GP. If they're referring to I/O multiplexers, then they should say that. If they're referring to something else, then I'm not familiar with other uses of that term that would accurately apply to Golang.

replies(1): >>44996891 #

160. zbentley ◴[23 Aug 25 15:14 UTC] No.44996576[source]▶

>>44984119 #

It would, hence why major cloud providers currently disable io_uring in many of their compute environments.

replies(1): >>45011923 #

161. ori_b ◴[23 Aug 25 16:02 UTC] No.44996891{6}[source]▶

>>44996554 #

Well, that's exactly what the kernel is doing when it swaps threads. When you block on I/O, you're voluntarily pausing your thread and doing concurrent I/O with another thread.

Async and threads are a lot closer than most people think. An OS is mainly a queue for swapping between async operations, and a collection of abstracted services that the async operations can request, like network or disk i/o.

replies(1): >>45002390 #

162. alpb ◴[23 Aug 25 21:05 UTC] No.44999134{3}[source]▶

>>44988690 #

Security reasons. https://news.ycombinator.com/item?id=44632240 There are also other edge cases around cgroups accounting that renders some isolation/throttling mechanisms not fully effective.

163. vlovich123 ◴[23 Aug 25 22:16 UTC] No.44999559{11}[source]▶

>>44996320 #

You ask the I/O system for a writable buffer. When you fill it up, you hand it off. Once the I/o finishes, it goes back into the available pool of memory to write with. This is how high performance I/O works.

replies(1): >>45004570 #

164. craftkiller ◴[24 Aug 25 04:52 UTC] No.45001508{5}[source]▶

>>44993844 #

Yeah, we had hot reloading of code too but hot-reloading for instant "reloads" was needed back then. Nowadays, you can do a full relaunch of the server in 10 seconds so hot reloads no longer matter.

165. ◴[24 Aug 25 08:19 UTC] No.45002390{7}[source]▶

>>44996891 #

166. zbentley ◴[24 Aug 25 14:35 UTC] No.45004570{12}[source]▶

>>44999559 #

Okay, but . . . how would that work? A syscall gives back a pointer (I thought the point was to avoid syscalls/context switches)? An io_malloc userspace function (great, now how do I manage lifetimes of the buffers it hands out)? Something else?

replies(1): >>45070568 #

167. j-krieger ◴[25 Aug 25 09:26 UTC] No.45011923{3}[source]▶

>>44996576 #

Interesting!

168. vlovich123 ◴[29 Aug 25 23:32 UTC] No.45070568{13}[source]▶

>>45004570 #

The memory is allocated by the runtime that has the io_uring backend. You ask it for memory which it manages in its own memory allocator. Lifetime is managed no differently than Vec. For example, when you drop the DmaBuffer [1] it goes back into the pool. Or you hand it off as an I/O submission after filling it up.

The memory frequently needs to be mlocked memory anyway, so a general purpose allocator doesn't work.

[1] https://docs.rs/glommio/latest/glommio/fn.allocate_dma_buffe...

↑