←back to thread

Futurelock: A subtle risk in async Rust

(rfd.shared.oxide.computer)
427 points bcantrill | 2 comments | | HN request time: 0.486s | source

This RFD describes our distillation of a really gnarly issue that we hit in the Oxide control plane.[0] Not unlike our discovery of the async cancellation issue[1][2][3], this is larger than the issue itself -- and worse, the program that hits futurelock is correct from the programmer's point of view. Fortunately, the surface area here is smaller than that of async cancellation and the conditions required to hit it can be relatively easily mitigated. Still, this is a pretty deep issue -- and something that took some very seasoned Rust hands quite a while to find.

[0] https://github.com/oxidecomputer/omicron/issues/9259

[1] https://rfd.shared.oxide.computer/rfd/397

[2] https://rfd.shared.oxide.computer/rfd/400

[3] https://www.youtube.com/watch?v=zrv5Cy1R7r4

Show context
mdasen ◴[] No.45778522[source]
I rewrote this in Go and it also deadlocks. It doesn't seem to be something that's Rust specific.

I'm going to write down the order of events.

1. Background task takes the lock and holds it for 5 seconds.

2. Async Thing 1 tries to take the lock, but must wait for background task to release it. It is next in line to get the lock.

3. We fire off a goroutine that's just sleeping for a second.

4. Select wants to find a channel that is finished. The sleepChan finishes first (since it's sleeping for 1 second) while Async Thing 1 is still waiting 4 more seconds for the lock. So select will execute the sleepChan case.

5. That case fires off Async Thing 2. Async Thing 2 is waiting for the lock, but it is second in line to get the lock after Async Thing 1.

6. Async Thing 1 gets the lock and is ready to write to its channel - but the main is paused trying to read from c2, not c1. Main is "awaiting" on c2 via "<-c2". Async Thing 1 can't give up its lock until it writes to c1. It can't write to c1 until c1 is "awaited" via "<-c1". But the program has already gone into the other case and until the sleepChan case finishes, it won't try to await c1. But it will never finish its case because its case depends on c1 finishing first.

You can use buffered channels in Go so that Async Thing 1 can write to c1 without main reading from it, but as the article notes you could use join_all in Rust.

But the issue is that you're saying with "select" in either Go or Rust "get me the first one that finishes" and then in the branch that finishes first, you are awaiting a lock that will get resolved when you read the other branch. It just doesn't feel like something that is Rust specific.

    func main() {
        lock := sync.Mutex{}
        c1 := make(chan string)
        c2 := make(chan string)
        sleepChan := make(chan bool)    
        
        go start_background_task(&lock)
        time.Sleep(1 * time.Millisecond) //make sure it schedules start_background_task first

        go do_async_thing(c1, "op1", &lock)
 
        go func() {
                time.Sleep(1 * time.Second)
                sleepChan <- true
        }()

        for range 2 {
                select {
                case msg1 := <-c1:
                        fmt.Println("In the c1 case")
                        fmt.Printf("received %s\n", msg1)
                case _ = <-sleepChan:
                        fmt.Println("In the sleepChan case")
                        go do_async_thing(c2, "op2", &lock)
                        fmt.Printf("received %s\n", <-c2) // "awaiting" on c2 here, but c1's lock won't be given up until we read it
                }
        }
        fmt.Println("all done")
    }

    func start_background_task(lock *sync.Mutex) {
        fmt.Println("starting background task")
        lock.Lock()
        fmt.Println("acquired background task lock")
        defer lock.Unlock()
        time.Sleep(5 * time.Second)
        fmt.Println("dropping background task lock")
    }

    func do_async_thing(c chan string, label string, lock *sync.Mutex) {
        fmt.Printf("%s: started\n", label)
        lock.Lock()
        fmt.Printf("%s: acuired lock\n", label)
        defer lock.Unlock()
        fmt.Printf("%s: done\n", label)
        c <- label
    }
replies(4): >>45778817 #>>45778828 #>>45779192 #>>45779380 #
1. clarkmcc ◴[] No.45779192[source]
I think the thing that rubs me the wrong way is that Rust was supposed to be "fearless" concurrency. Go doesn't claim that title so I'm not offended when it doesn't live up to it.
replies(1): >>45784777 #
2. kibwen ◴[] No.45784777[source]
Despite "fearless concurrency", Rust has been careful to never claim to prevent deadlocks/race conditions in general, in either async code or non-async code. It's certainly easier to get deadlocks in async Rust than in non-async Rust, but this isn't some sort of novel failure mode.