Building a Durable Execution Engine with SQLite

1. websiteapi ◴[21 Nov 25 19:09 UTC] No.46007761[source]▶

there's a lot of hype around durable execution these days. why do that instead of regular use of queues? is it the dev ergonomics that's cool here?

you can (and people already) model steps in any arbitrarily large workflow and have those results be processed in a modular fashion and have whatever process that begins this workflow check the state of the necessary preconditions prior to taking any action and thus go to the currently needed step, or retry ones that failed, and so forth.

replies(5): >>46007901 #>>46008038 #>>46008154 #>>46008279 #>>46009559 #

2. tptacek ◴[21 Nov 25 19:22 UTC] No.46007901[source]▶

>>46007761 (TP) #

We build what is effectively a durable execution "engine" for our orchestrator (ours is backed by boltdb and not SQLite, which I objected to, correctly). The steps in our workflows build running virtual machines and include things like allocating addresses, loading BPF programs, preparing root filesystems, and registering services.

Short answer: we need to be able to redeploy and bounce the orchestrator without worrying about what stage each running VM on our platform is in.

JP, the dev that built this out for us, talks a bit about the design rationale (search for "Cadence") here:

https://fly.io/blog/the-exit-interview-jp/

The library itself is open:

https://github.com/superfly/fsm

3. ryeats ◴[21 Nov 25 19:32 UTC] No.46008038[source]▶

>>46007761 (TP) #

As you say it can be done but it's an anti-pattern to use a message queue as a database which is essentially what you are doing for these kinds of long running tasks. The reason is that their are a lot of state your likely going to want to status as a task runs and persist and checkpoint yes you can carefully string together a series of database calls chained with message transactions so you don't lose something when an issue happens but then you also need bespoke logic to restart or retry each step and it can turn into a bit of a mess.

4. snicker7 ◴[21 Nov 25 19:46 UTC] No.46008154[source]▶

>>46007761 (TP) #

Message queues (e.g. SQS) are inappropriate for tracking long-running tasks/workflows. This is due to the operational requirements such as:

- Checking the status of a task (queued, pending, failed, cancelled, completed) - Cancelling a queued task (or pending task if the execution environment supports it) - Re-prioritizing queued tasks - Searching for tasks based off an attribute (e.g. tag)

You really do need a database for this.

replies(2): >>46009178 #>>46009553 #

5. kodablah ◴[21 Nov 25 19:58 UTC] No.46008279[source]▶

>>46007761 (TP) #

> is it the dev ergonomics that's cool here?

Yup. Being able to write imperative code that automatically resumes where it left off is very valuable. It's best to represent durable turing completeness using modern approaches of authoring such logic - programming languages. Being able to loop, try/catch, apply advanced conditional logic, etc in a crash-proof algorithm that can run for weeks/months/years and is introspectable has a lot of value over just using queues.

Durable execution is all just queues and task processing and event sourcing under the hood though.

6. yyx ◴[21 Nov 25 21:27 UTC] No.46009178[source]▶

>>46008154 #

Sounds like a Celery with SQLAlchemy backend.

7. DenisM ◴[21 Nov 25 22:07 UTC] No.46009553[source]▶

>>46008154 #

I’m reminded of classical LRU cache implementation - double linked list and a hash map that points to the list elements.

It is a queue if we squint really hard, but it allows random access and reordering. Do we have durable structures of this kind?

I can’t imagine how to shoehorn this into Kafka or SQS.

8. hmaxdml ◴[21 Nov 25 22:07 UTC] No.46009559[source]▶

>>46007761 (TP) #

The hype is because DE is such an dev exp improvement over building your own queue. Good DE frameworks come with workflows, pub/sub, notifications, distributed queues with tons of flow control options, etc.