←back to thread

110 points ingve | 8 comments | | HN request time: 0.254s | source | bottom
1. websiteapi ◴[] No.46007761[source]
there's a lot of hype around durable execution these days. why do that instead of regular use of queues? is it the dev ergonomics that's cool here?

you can (and people already) model steps in any arbitrarily large workflow and have those results be processed in a modular fashion and have whatever process that begins this workflow check the state of the necessary preconditions prior to taking any action and thus go to the currently needed step, or retry ones that failed, and so forth.

replies(5): >>46007901 #>>46008038 #>>46008154 #>>46008279 #>>46009559 #
2. tptacek ◴[] No.46007901[source]
We build what is effectively a durable execution "engine" for our orchestrator (ours is backed by boltdb and not SQLite, which I objected to, correctly). The steps in our workflows build running virtual machines and include things like allocating addresses, loading BPF programs, preparing root filesystems, and registering services.

Short answer: we need to be able to redeploy and bounce the orchestrator without worrying about what stage each running VM on our platform is in.

JP, the dev that built this out for us, talks a bit about the design rationale (search for "Cadence") here:

https://fly.io/blog/the-exit-interview-jp/

The library itself is open:

https://github.com/superfly/fsm

3. ryeats ◴[] No.46008038[source]
As you say it can be done but it's an anti-pattern to use a message queue as a database which is essentially what you are doing for these kinds of long running tasks. The reason is that their are a lot of state your likely going to want to status as a task runs and persist and checkpoint yes you can carefully string together a series of database calls chained with message transactions so you don't lose something when an issue happens but then you also need bespoke logic to restart or retry each step and it can turn into a bit of a mess.
4. snicker7 ◴[] No.46008154[source]
Message queues (e.g. SQS) are inappropriate for tracking long-running tasks/workflows. This is due to the operational requirements such as:

- Checking the status of a task (queued, pending, failed, cancelled, completed) - Cancelling a queued task (or pending task if the execution environment supports it) - Re-prioritizing queued tasks - Searching for tasks based off an attribute (e.g. tag)

You really do need a database for this.

replies(2): >>46009178 #>>46009553 #
5. kodablah ◴[] No.46008279[source]
> is it the dev ergonomics that's cool here?

Yup. Being able to write imperative code that automatically resumes where it left off is very valuable. It's best to represent durable turing completeness using modern approaches of authoring such logic - programming languages. Being able to loop, try/catch, apply advanced conditional logic, etc in a crash-proof algorithm that can run for weeks/months/years and is introspectable has a lot of value over just using queues.

Durable execution is all just queues and task processing and event sourcing under the hood though.

6. yyx ◴[] No.46009178[source]
Sounds like a Celery with SQLAlchemy backend.
7. DenisM ◴[] No.46009553[source]
I’m reminded of classical LRU cache implementation - double linked list and a hash map that points to the list elements.

It is a queue if we squint really hard, but it allows random access and reordering. Do we have durable structures of this kind?

I can’t imagine how to shoehorn this into Kafka or SQS.

8. hmaxdml ◴[] No.46009559[source]
The hype is because DE is such an dev exp improvement over building your own queue. Good DE frameworks come with workflows, pub/sub, notifications, distributed queues with tons of flow control options, etc.