Show HN: Hatchet – Open-source distributed task queue

(github.com)

Hello HN, we're Gabe and Alexander from Hatchet (https://hatchet.run), we're working on an open-source, distributed task queue. It's an alternative to tools like Celery for Python and BullMQ for Node.js, primarily focused on reliability and observability. It uses Postgres for the underlying queue.

Why build another managed queue? We wanted to build something with the benefits of full transactional enqueueing - particularly for dependent, DAG-style execution - and felt strongly that Postgres solves for 99.9% of queueing use-cases better than most alternatives (Celery uses Redis or RabbitMQ as a broker, BullMQ uses Redis). Since the introduction of SKIP LOCKED and the milestones of recent PG releases (like active-active replication), it's becoming more feasible to horizontally scale Postgres across multiple regions and vertically scale to 10k TPS or more. Many queues (like BullMQ) are built on Redis and data loss can occur when suffering OOM if you're not careful, and using PG helps avoid an entire class of problems.

We also wanted something that was significantly easier to use and debug for application developers. A lot of times the burden of building task observability falls on the infra/platform team (for example, asking the infra team to build a Grafana view for their tasks based on exported prom metrics). We're building this type of observability directly into Hatchet.

What do we mean by "distributed"? You can run workers (the instances which run tasks) across multiple VMs, clusters and regions - they are remotely invoked via a long-lived gRPC connection with the Hatchet queue. We've attempted to optimize our latency to get our task start times down to 25-50ms and much more optimization is on the roadmap.

We also support a number of extra features that you'd expect, like retries, timeouts, cron schedules, dependent tasks. A few things we're currently working on - we use RabbitMQ (confusing, yes) for pub/sub between engine components and would prefer to just use Postgres, but didn't want to spend additional time on the exchange logic until we built a stable underlying queue. We are also considering the use of NATS for engine-engine and engine-worker connections.

We'd greatly appreciate any feedback you have and hope you get the chance to try out Hatchet.

Show context

bluehadoop ◴[08 Mar 24 18:08 UTC] No.39643927[source]▶

>>39643136 (OP) #

How does this compare against Temporal/Cadence/Conductor? Does hatchet also support durable execution?

https://temporal.io/ https://cadenceworkflow.io/ https://conductor-oss.org/

replies(1): >>39644550 #

1. abelanger ◴[08 Mar 24 18:50 UTC] No.39644550[source]▶

>>39643927 #

It's very similar - I used Temporal at a previous company to run a couple million workflows per month. The gRPC networking with workers is the most similar component, I especially liked that I only had to worry about an http2 connection with mTLS instead of a different broker protocol.

Temporal is a powerful system but we were getting to the point where it took a full-time engineer to build an observability layer around Temporal. Integrating workflows in an intuitive way with OpenTelemetry and logging was surprisingly non-arbitrary. We wanted to build more of a Vercel-like experience for managing workflows.

We have a section on the docs page for durable execution [1], also see the comment on HN [2]. Like I mention in that comment, we still have a long way to go before users can write a full workflow in code in the same style as a Temporal workflow, users either define the execution path ahead of time or invoke a child workflow from an existing workflow. This is also something that requires customization for each SDK - like Temporal's custom asyncio event loop in their Python SDK [3]. We don't want to roll this out until we can be sure about compatibility with the way most people write their functions.

[1] https://docs.hatchet.run/home/features/durable-execution

[2] https://news.ycombinator.com/item?id=39643881

[3] https://github.com/temporalio/sdk-python

replies(2): >>39646064 #>>39651696 #

2. bicijay ◴[08 Mar 24 20:33 UTC] No.39646064[source]▶

>>39644550 (TP) #

Well, you just got an user. Love the concept of temporal, but i can't justify the overhead you need with infra to make it work for the upper guys... And the cloud offering is a bit expensive for small companies.

replies(1): >>39647745 #

3. mfateev ◴[08 Mar 24 23:23 UTC] No.39647745[source]▶

>>39646064 #

Do you know about the Temporal startup program? It gives enough credits to offset support fees for 2 years. https://temporal.io/startup

replies(2): >>39649420 #>>39658844 #

4. Aeolun ◴[09 Mar 24 04:39 UTC] No.39649420{3}[source]▶

>>39647745 #

If you are expecting to still be small after 2 years that just delays the expense until you are locked in?

5. dangoodmanUT ◴[09 Mar 24 14:17 UTC] No.39651696[source]▶

>>39644550 (TP) #

> we were getting to the point where it took a full-time engineer to build an observability layer around Temporal

We did it in like 5 minutes by adding in otel traces? And maybe another 15 to add their grafana dashboard?

What obstacles did you experience here?

replies(1): >>39653865 #

6. abelanger ◴[09 Mar 24 18:52 UTC] No.39653865[source]▶

>>39651696 #

Well, for one - most otel services (like Honeycomb) are designed around aggregate views, and engineers found it difficult to track down the failure of specific workflows. We were already using Sentry, had started adding prom + grafana into our stack, and were already using mezmo for logging. So to debug a workflow, we'd see an alert come in through Sentry, grab the workflow ID and activity ID, perform a search in the Temporal console, track down the failed activity (of which there could be between 1-100 activities), and associate that with our logs in mezmo (involving a new query syntax). This is a lot of raw data that takes time to parse and figure out what's going wrong. And then we wanted to build out a view of worker health, which involves a new set of dashboards and alerts that are different from our error alerting in Sentry.

Yes, this sounded broken to us too - we were aware of the promise of consolidation with an opentelemetry and a Grafana stack, but we couldn't make this transition happen cleanly, and when you're already relying on certain tools for your API it makes the transition more difficult. There's also upskilling involved in getting engineers on the team to adjust to otel when they're used to more intuitive tools like sentry and mezmo.

A good set of default metrics, better search, and views for worker performance and pools - that would have gone a long way. The extent of Temporal UI features are basic recent workflows, an expanded workflow view with stack traces for thrown errors, a schedules page, and a settings page.

7. bicijay ◴[10 Mar 24 13:22 UTC] No.39658844{3}[source]▶

>>39647745 #

I know its gonna sound entitled. But even though we are a small company we still process a lot of events from third parties. Temporal cloud pricing is based on number of actions, 2400 bucks would only cover some months in our case.

↑