←back to thread

578 points abelanger | 8 comments | | HN request time: 0s | source | bottom

Hello HN, we're Gabe and Alexander from Hatchet (https://hatchet.run), we're working on an open-source, distributed task queue. It's an alternative to tools like Celery for Python and BullMQ for Node.js, primarily focused on reliability and observability. It uses Postgres for the underlying queue.

Why build another managed queue? We wanted to build something with the benefits of full transactional enqueueing - particularly for dependent, DAG-style execution - and felt strongly that Postgres solves for 99.9% of queueing use-cases better than most alternatives (Celery uses Redis or RabbitMQ as a broker, BullMQ uses Redis). Since the introduction of SKIP LOCKED and the milestones of recent PG releases (like active-active replication), it's becoming more feasible to horizontally scale Postgres across multiple regions and vertically scale to 10k TPS or more. Many queues (like BullMQ) are built on Redis and data loss can occur when suffering OOM if you're not careful, and using PG helps avoid an entire class of problems.

We also wanted something that was significantly easier to use and debug for application developers. A lot of times the burden of building task observability falls on the infra/platform team (for example, asking the infra team to build a Grafana view for their tasks based on exported prom metrics). We're building this type of observability directly into Hatchet.

What do we mean by "distributed"? You can run workers (the instances which run tasks) across multiple VMs, clusters and regions - they are remotely invoked via a long-lived gRPC connection with the Hatchet queue. We've attempted to optimize our latency to get our task start times down to 25-50ms and much more optimization is on the roadmap.

We also support a number of extra features that you'd expect, like retries, timeouts, cron schedules, dependent tasks. A few things we're currently working on - we use RabbitMQ (confusing, yes) for pub/sub between engine components and would prefer to just use Postgres, but didn't want to spend additional time on the exchange logic until we built a stable underlying queue. We are also considering the use of NATS for engine-engine and engine-worker connections.

We'd greatly appreciate any feedback you have and hope you get the chance to try out Hatchet.

1. bluehadoop ◴[] No.39643927[source]
How does this compare against Temporal/Cadence/Conductor? Does hatchet also support durable execution?

https://temporal.io/ https://cadenceworkflow.io/ https://conductor-oss.org/

replies(1): >>39644550 #
2. abelanger ◴[] No.39644550[source]
It's very similar - I used Temporal at a previous company to run a couple million workflows per month. The gRPC networking with workers is the most similar component, I especially liked that I only had to worry about an http2 connection with mTLS instead of a different broker protocol.

Temporal is a powerful system but we were getting to the point where it took a full-time engineer to build an observability layer around Temporal. Integrating workflows in an intuitive way with OpenTelemetry and logging was surprisingly non-arbitrary. We wanted to build more of a Vercel-like experience for managing workflows.

We have a section on the docs page for durable execution [1], also see the comment on HN [2]. Like I mention in that comment, we still have a long way to go before users can write a full workflow in code in the same style as a Temporal workflow, users either define the execution path ahead of time or invoke a child workflow from an existing workflow. This is also something that requires customization for each SDK - like Temporal's custom asyncio event loop in their Python SDK [3]. We don't want to roll this out until we can be sure about compatibility with the way most people write their functions.

[1] https://docs.hatchet.run/home/features/durable-execution

[2] https://news.ycombinator.com/item?id=39643881

[3] https://github.com/temporalio/sdk-python

replies(2): >>39646064 #>>39651696 #
3. bicijay ◴[] No.39646064[source]
Well, you just got an user. Love the concept of temporal, but i can't justify the overhead you need with infra to make it work for the upper guys... And the cloud offering is a bit expensive for small companies.
replies(1): >>39647745 #
4. mfateev ◴[] No.39647745{3}[source]
Do you know about the Temporal startup program? It gives enough credits to offset support fees for 2 years. https://temporal.io/startup
replies(2): >>39649420 #>>39658844 #
5. Aeolun ◴[] No.39649420{4}[source]
If you are expecting to still be small after 2 years that just delays the expense until you are locked in?
6. dangoodmanUT ◴[] No.39651696[source]
> we were getting to the point where it took a full-time engineer to build an observability layer around Temporal

We did it in like 5 minutes by adding in otel traces? And maybe another 15 to add their grafana dashboard?

What obstacles did you experience here?

replies(1): >>39653865 #
7. abelanger ◴[] No.39653865{3}[source]
Well, for one - most otel services (like Honeycomb) are designed around aggregate views, and engineers found it difficult to track down the failure of specific workflows. We were already using Sentry, had started adding prom + grafana into our stack, and were already using mezmo for logging. So to debug a workflow, we'd see an alert come in through Sentry, grab the workflow ID and activity ID, perform a search in the Temporal console, track down the failed activity (of which there could be between 1-100 activities), and associate that with our logs in mezmo (involving a new query syntax). This is a lot of raw data that takes time to parse and figure out what's going wrong. And then we wanted to build out a view of worker health, which involves a new set of dashboards and alerts that are different from our error alerting in Sentry.

Yes, this sounded broken to us too - we were aware of the promise of consolidation with an opentelemetry and a Grafana stack, but we couldn't make this transition happen cleanly, and when you're already relying on certain tools for your API it makes the transition more difficult. There's also upskilling involved in getting engineers on the team to adjust to otel when they're used to more intuitive tools like sentry and mezmo.

A good set of default metrics, better search, and views for worker performance and pools - that would have gone a long way. The extent of Temporal UI features are basic recent workflows, an expanded workflow view with stack traces for thrown errors, a schedules page, and a settings page.

8. bicijay ◴[] No.39658844{4}[source]
I know its gonna sound entitled. But even though we are a small company we still process a lot of events from third parties. Temporal cloud pricing is based on number of actions, 2400 bucks would only cover some months in our case.