Building a Durable Execution Engine with SQLite

1. roughly ◴[21 Nov 25 21:04 UTC] No.46008976[source]▶

One thing that needs to be emphasized with “durable execution” engines is they don’t actually get you out of having to handle errors, rollbacks, etc. Even the canonical examples everyone uses - so you’re using a DE engine to restart a sales transaction, but the part of that transaction that failed was “charging the customer” - did it fail before or after the charge went through? You failed while updating the inventory system - did the product get marked out or not? All of these problems are tractable, but once you’ve solved them - once you’ve built sufficient atomicity into your system to handle the actual failure cases - the benefits of taking on the complexity of a DE system are substantially lower than the marketing pitch.

replies(3): >>46009362 #>>46009374 #>>46009633 #

2. hedgehog ◴[21 Nov 25 21:46 UTC] No.46009362[source]▶

>>46008976 (TP) #

In my one encounter with one of these systems it induced new code and tooling complexity, orders of magnitude performance overhead for most operations, and made dev and debug workflows much slower. All for... an occasional convenience far outweighed by the overall drag of using it. There are probably other environments where something like this makes sense but I can't figure out what they are.

replies(2): >>46009456 #>>46009666 #

3. throwaway894345 ◴[21 Nov 25 21:48 UTC] No.46009374[source]▶

>>46008976 (TP) #

> they don’t actually get you out of having to handle errors

I wrote a durable system that recovers from all sorts of errors (mostly network faults) without writing much error handling code. It just retries automatically, and importantly the happy path and the error path are exactly the same, so I don’t have to worry that my error path has much less execution than my happy path.

> but the part of that transaction that failed was “charging the customer” - did it fail before or after the charge went through?

In all cases, whether the happy path or the error path, the first thing you do is compare the desired state (“there exists a transaction exists charging the customer $5”) with the actual state (“has the customer been charged $5?”) and that determines whether you (re)issue the transaction or just update your internal state.

> once you’ve built sufficient atomicity into your system to handle the actual failure cases - the benefits of taking on the complexity of a DE system are substantially lower than the marketing pitch

I probably agree with this. The main value is probably not in the framework but rather in the larger architecture that it encourages—separating things out into idempotent functions that can be safely retried. I could maybe be persuaded otherwise, but most of my “durable execution” patterns seem to be more of a “controller pattern” (in the sense of a Kubernetes controller, running a reconciling control loop) and it just happens that any distributed, durable controller platform includes a durable execution subsystem.

4. throwaway894345 ◴[21 Nov 25 21:58 UTC] No.46009456[source]▶

>>46009362 #

> All for... an occasional convenience far outweighed by the overall drag of using it

If you have any long-running operation that could be interrupted mid-run by any network fluke (or the termination of the VM running your program, or your program being OOMed, or some issue with some third party service that your app talks to, etc), and you don’t want to restart the whole thing from scratch, you could benefit from these systems. The alternative is having engineers manually try to repair the state and restart execution in just the right place and that scales very badly.

I have an application that needs to stand up a bunch of cloud infrastructure (a “workspace” in which users can do research) on the press of a button, and I want to make sure that the right infrastructure exists even if some deployment attempt is interrupted or if the upstream definition of a workspace changes. Every month there are dozens of network flukes or 5XX errors from remote endpoints that would otherwise leave these workspaces in a broken state and in need of manual repair. Instead, the system heals itself whenever the fault clears and I basically never have to look at the system (I periodically check the error logs, however, to confirm that the system is actually recovering from faults—I worry that the system has caught fire and there’s actually some bug in the alerting system that is keeping things quiet).

replies(1): >>46011164 #

5. jedberg ◴[21 Nov 25 22:15 UTC] No.46009633[source]▶

>>46008976 (TP) #

The key to a durable workflow is making each step idempotent. Then you don't have to worry about those things. You just run the failed step again. If it already worked the first time, it's a no-op.

For example, stripe lets you include an idempotency key with your request. If you try to make a charge again with the same key, it ignores you. A DE framework like DBOS will automatically generate the idempotency key for you.

But you're correct, if you can't make the operation idempotent, then you have to handle that yourself.

replies(1): >>46009942 #

6. jedberg ◴[21 Nov 25 22:17 UTC] No.46009666[source]▶

>>46009362 #

I'm not sure which one you used, but ideally it's so lightweight that the benefits outweigh the slight cost of developing with them. Besides the recovery benefit, there is observability and debugging benefits too.

replies(1): >>46011100 #

7. repeekad ◴[21 Nov 25 22:46 UTC] No.46009942[source]▶

>>46009633 #

Temporal plus idempotency keys solves probably the majority of infrastructure normally needed for production systems

replies(1): >>46010375 #

8. cyberpunk ◴[21 Nov 25 23:34 UTC] No.46010375{3}[source]▶

>>46009942 #

Except to run temporal at scale on prem you’ll need 50x the infra you had before.

replies(1): >>46010599 #

9. jedberg ◴[22 Nov 25 00:01 UTC] No.46010599{4}[source]▶

>>46010375 #

Indeed, one of the main selling points of DBOS. All the functionality of Temporal without any of the infrastructure.

replies(1): >>46010644 #

10. cyberpunk ◴[22 Nov 25 00:08 UTC] No.46010644{5}[source]▶

>>46010599 #

Ah I don't know if I would agree with that. Temporal does a lot of stuff; we just don't happen to need most of it and it's really heavyweight on the database side (running low 500 or so workflows/second of their own 'hello world' style echo benchmark translates to 100k database ops/second..

DBOS is tied to Postgres, right? That wouldn't scale anywhere near where we need either.

Sadly there aren't many shortcuts in this space and pretending there are seems a bit hip at the moment. In the end, mostly everyone who can afford to solve such problems are gonna end up writing their own systems for this.

11. hedgehog ◴[22 Nov 25 01:12 UTC] No.46011100{3}[source]▶

>>46009666 #

I don't want to start a debate about a specific vendor but the cost was very high. Leaky serialization of call arguments and results, then hairpinning messages across the internet and back to get to workers. 200ms overhead for a no-op call. There was some observability benefit but it didn't allow for debugger access and had its own special way of packaging code so net add of complexity there too. That's not getting into the induced complexity caused by adding a bunch of RPC boundaries to fit their execution model. All that and using the thing effectively still requires understanding their runtime model. I understand the motivation, but not the technical approach.

12. hedgehog ◴[22 Nov 25 01:20 UTC] No.46011164{3}[source]▶

>>46009456 #

The system I used didn't have any notion of repair, just retry-forever. What did you use for that? I've written service tree management tools that do that sort of thing on a single host but not any kind of distributed system.