←back to thread

578 points abelanger | 4 comments | | HN request time: 0.326s | source

Hello HN, we're Gabe and Alexander from Hatchet (https://hatchet.run), we're working on an open-source, distributed task queue. It's an alternative to tools like Celery for Python and BullMQ for Node.js, primarily focused on reliability and observability. It uses Postgres for the underlying queue.

Why build another managed queue? We wanted to build something with the benefits of full transactional enqueueing - particularly for dependent, DAG-style execution - and felt strongly that Postgres solves for 99.9% of queueing use-cases better than most alternatives (Celery uses Redis or RabbitMQ as a broker, BullMQ uses Redis). Since the introduction of SKIP LOCKED and the milestones of recent PG releases (like active-active replication), it's becoming more feasible to horizontally scale Postgres across multiple regions and vertically scale to 10k TPS or more. Many queues (like BullMQ) are built on Redis and data loss can occur when suffering OOM if you're not careful, and using PG helps avoid an entire class of problems.

We also wanted something that was significantly easier to use and debug for application developers. A lot of times the burden of building task observability falls on the infra/platform team (for example, asking the infra team to build a Grafana view for their tasks based on exported prom metrics). We're building this type of observability directly into Hatchet.

What do we mean by "distributed"? You can run workers (the instances which run tasks) across multiple VMs, clusters and regions - they are remotely invoked via a long-lived gRPC connection with the Hatchet queue. We've attempted to optimize our latency to get our task start times down to 25-50ms and much more optimization is on the roadmap.

We also support a number of extra features that you'd expect, like retries, timeouts, cron schedules, dependent tasks. A few things we're currently working on - we use RabbitMQ (confusing, yes) for pub/sub between engine components and would prefer to just use Postgres, but didn't want to spend additional time on the exchange logic until we built a stable underlying queue. We are also considering the use of NATS for engine-engine and engine-worker connections.

We'd greatly appreciate any feedback you have and hope you get the chance to try out Hatchet.

Show context
Kinrany ◴[] No.39645312[source]
With NATS in the stack, what's the advantage over using NATS directly?
replies(1): >>39649524 #
1. abelanger ◴[] No.39649524[source]
I'm assuming specifically you mean Nex functions? Otherwise NATS gives you connectivity and a message queue - it doesn't (or didn't) have the concept of task executions or workflows.

With regards to Nex -- it isn't fully stable and only supports Javascript/Webassembly. It's also extremely new, so I'd be curious to see how things stabilize in the coming year.

replies(3): >>39650640 #>>39651097 #>>39655626 #
2. rapnie ◴[] No.39650640[source]
I recently found Nex in the context of Wasmcloud [0] and ability for it to support long-running tasks/workflows. Impression that indeed Nex needs a good time to mature still. There was also a talk [1] about using Temporal here. For Hatchet it may be interesting to check it out (note: I am not affiliated with Wasmcloud, nor currently using it).

[0] https://wasmcloud.com

[1] https://www.temporal.io/replay/videos/zero-downtime-deploys-...

3. bruth ◴[] No.39651097[source]
(Disclaimer: I am a NATS maintainer and work for Synadia)

The parent comment may have been referring to the fact that NATS has support for durable (and replicated) work queue streams, so those could be used directly for queuing tasks and having a set of workers dequeuing concurrently. And this is regardless if you would want to use Nex or not. Nex is indeed fairly new, but the team on is iterating on it quickly and we are dog-fooding it internally to keep stabilizing it.

The other benefits of NATS is the built-in multi-tenancy which would allow for distinct applications/teams/contexts to have an isolated set of streams and messaging. It acts as a secure namespace.

NATS supports clustering within a region or across regions. For example, Synadia hosts a supercluster in many different regions across the globe and across the three major cloud providers. As it applies to distributed work queues, you can place work queue streams in a cluster within a region/provider closest to the users/apps enqueuing the work, and then deploy workers in the same region for optimizing latency of dequeuing and processing.

Could be worth a deeper look on how much you could leverage for this use case.

4. Kinrany ◴[] No.39655626[source]
I wasn't thinking of Nex, I didn't realize Hatchet includes compute and doesn't just store tasks.

Still, it seems like NATS + any lambda implementation + a dumb service that wakes lambdas when they need to process something, would be simple to set up and in combination do the same thing.