Show HN: Hatchet – Open-source distributed task queue

(github.com)

578 points abelanger | 2 comments | 08 Mar 24 17:07 UTC | HN request time: 0.425s | source

Hello HN, we're Gabe and Alexander from Hatchet (https://hatchet.run), we're working on an open-source, distributed task queue. It's an alternative to tools like Celery for Python and BullMQ for Node.js, primarily focused on reliability and observability. It uses Postgres for the underlying queue.

Why build another managed queue? We wanted to build something with the benefits of full transactional enqueueing - particularly for dependent, DAG-style execution - and felt strongly that Postgres solves for 99.9% of queueing use-cases better than most alternatives (Celery uses Redis or RabbitMQ as a broker, BullMQ uses Redis). Since the introduction of SKIP LOCKED and the milestones of recent PG releases (like active-active replication), it's becoming more feasible to horizontally scale Postgres across multiple regions and vertically scale to 10k TPS or more. Many queues (like BullMQ) are built on Redis and data loss can occur when suffering OOM if you're not careful, and using PG helps avoid an entire class of problems.

We also wanted something that was significantly easier to use and debug for application developers. A lot of times the burden of building task observability falls on the infra/platform team (for example, asking the infra team to build a Grafana view for their tasks based on exported prom metrics). We're building this type of observability directly into Hatchet.

What do we mean by "distributed"? You can run workers (the instances which run tasks) across multiple VMs, clusters and regions - they are remotely invoked via a long-lived gRPC connection with the Hatchet queue. We've attempted to optimize our latency to get our task start times down to 25-50ms and much more optimization is on the roadmap.

We also support a number of extra features that you'd expect, like retries, timeouts, cron schedules, dependent tasks. A few things we're currently working on - we use RabbitMQ (confusing, yes) for pub/sub between engine components and would prefer to just use Postgres, but didn't want to spend additional time on the exchange logic until we built a stable underlying queue. We are also considering the use of NATS for engine-engine and engine-worker connections.

We'd greatly appreciate any feedback you have and hope you get the chance to try out Hatchet.

Show context

topicseed ◴[08 Mar 24 17:26 UTC] No.39643364[source]▶

>>39643136 (OP) #

What specific strategies does Hatchet employ to guarantee fault tolerance and enable durable execution? How does it handle partial failures in multi-step workflows?

replies(1): >>39643881 #

abelanger ◴[08 Mar 24 18:05 UTC] No.39643881[source]▶

>>39643364 #

Each task in Hatchet is backed by a workflow [1]. Workflows are predefined steps which are persisted in PostgreSQL. If a worker dies or crashes midway through (stops heartbeating to the engine), we reassign tasks (assuming they have retries left). We also track timeouts in the database, which means if we miss a timeout, we simply retry after some amount of time. Like I mentioned in the post, we avoid some classes of faults just by relying on PostgreSQL and persisting each workflow run, so you don't need to time out with distributed locks in Redis, for example, or worry about data loss if Redis OOMs. Our `ticker` service is basically its own worker which is assigned a lease for each step run.

We also store the input/output of each workflow step in the database. So resuming a multi-step workflow is pretty simple - we just replay the step with the same input.

To zoom out a bit - unlike many alternatives [2], the execution path of a multi-step workflow in Hatchet is declared ahead of time. There are tradeoffs to this approach; it makes it much easier to run a single-step workflow or if you know the workflow execution path ahead of time. You also avoid classes of problems related to workflow versioning, we can gracefully drain older workflow version with a different execution path. It's also more natural to debug and see a DAG execution instead of debugging procedural logic.

The clear tradeoff is that you can't try...catch the execution of a single task or concatenate a bunch of futures that you wait for later. Roadmap-wise, we're considering adding procedural execution on top of our workflows concept. Which means providing a nice API for calling `await workflow.run` and capturing errors. These would be a higher-level concept in Hatchet and are not built yet.

There are some interesting concepts around using semaphores and durable leases that are relevant here, which we're exploring [3].

[1] https://docs.hatchet.run/home/basics/workflows [2] https://temporal.io [3] https://www.citusdata.com/blog/2016/08/12/state-machines-to-...

replies(3): >>39644055 #>>39644522 #>>39645566 #

spenczar5 ◴[08 Mar 24 18:49 UTC] No.39644522[source]▶

>>39643881 #

What happens if a worker goes silent for longer than the heartbeat duration, then a new worker is spawned, then the original worker “comes back to life”? For example, because there was a network partition, or because the first worker’s host machine was sleeping, or even just that the first worker process was CPU starved?

replies(1): >>39647221 #

1. abelanger ◴[08 Mar 24 22:22 UTC] No.39647221[source]▶

>>39644522 #

The heartbeat duration (5s) is not the same as the inactive duration (60s). If a worker has been down for 60 seconds, we reassign to provide some buffer and handle unstable networks. Once someone asks we'll expose these options and make them configurable.

We currently send cancellation signals for individual tasks to workers, but our cancellation signals aren't replayed if they fail on the network. This is an important edge case for us to figure out.

There's not much we can do if the worker ignores that signal. We should probably add some alerting if we see multiple responses on the same task, because that means the worker is ignoring the cancellation signal. This would also be a problem if workloads start blocking the whole thread.

replies(1): >>39647844 #

2. spenczar5 ◴[08 Mar 24 23:36 UTC] No.39647844[source]▶

>>39647221 (TP) #

Right, I meant inactive duration, of course.

Cancellation signals are tricky. You of course cannot be sure that the remote end receives it. This turns into the two generals problem.

Yes, you need monitoring for this case. I work on scientific workloads which can completely consume CPU resources. This failure scenario is quite real.

Not all tasks are idempotent, but it sounds like a prudent user should try to design things that way, since your system has “at least once” execution of tasks, as opposed to “at most once.” Despite any marketing claims, “exactly once” is not generally possible.

Good docs on this point are important, as is configurability for cases when “at most once” is preferable.

↑