Most active commenters
  • hosh(5)
  • jaggederest(4)
  • abelanger(3)

←back to thread

578 points abelanger | 15 comments | | HN request time: 0.214s | source | bottom

Hello HN, we're Gabe and Alexander from Hatchet (https://hatchet.run), we're working on an open-source, distributed task queue. It's an alternative to tools like Celery for Python and BullMQ for Node.js, primarily focused on reliability and observability. It uses Postgres for the underlying queue.

Why build another managed queue? We wanted to build something with the benefits of full transactional enqueueing - particularly for dependent, DAG-style execution - and felt strongly that Postgres solves for 99.9% of queueing use-cases better than most alternatives (Celery uses Redis or RabbitMQ as a broker, BullMQ uses Redis). Since the introduction of SKIP LOCKED and the milestones of recent PG releases (like active-active replication), it's becoming more feasible to horizontally scale Postgres across multiple regions and vertically scale to 10k TPS or more. Many queues (like BullMQ) are built on Redis and data loss can occur when suffering OOM if you're not careful, and using PG helps avoid an entire class of problems.

We also wanted something that was significantly easier to use and debug for application developers. A lot of times the burden of building task observability falls on the infra/platform team (for example, asking the infra team to build a Grafana view for their tasks based on exported prom metrics). We're building this type of observability directly into Hatchet.

What do we mean by "distributed"? You can run workers (the instances which run tasks) across multiple VMs, clusters and regions - they are remotely invoked via a long-lived gRPC connection with the Hatchet queue. We've attempted to optimize our latency to get our task start times down to 25-50ms and much more optimization is on the roadmap.

We also support a number of extra features that you'd expect, like retries, timeouts, cron schedules, dependent tasks. A few things we're currently working on - we use RabbitMQ (confusing, yes) for pub/sub between engine components and would prefer to just use Postgres, but didn't want to spend additional time on the exchange logic until we built a stable underlying queue. We are also considering the use of NATS for engine-engine and engine-worker connections.

We'd greatly appreciate any feedback you have and hope you get the chance to try out Hatchet.

Show context
kcorbitt ◴[] No.39643991[source]
I love your vision and am excited to see the execution! I've been looking for exactly this product (postgres-backed task queue with workers in multiple languages and decent built-in observability) for like... 3 years. Every 6 months I'll check in and see if someone has built it yet, evaluate the alternatives, and come away disappointed.

One important feature request that probably would block our adoption: one reason why I prefer a postgres-backed queue over eg. Redis is just to simplify our infra by having fewer servers and technologies in the stack. Adding in RabbitMQ is definitely an extra dependency I'd really like to avoid.

(Currently we've settled on graphile-worker which is fine for what it does, but leaves a lot of boxes unchecked.)

replies(9): >>39644137 #>>39645512 #>>39646111 #>>39647059 #>>39647179 #>>39650750 #>>39651174 #>>39652574 #>>39652765 #
1. abelanger ◴[] No.39644137[source]
Thank you, appreciate the kind words! What boxes are you looking to check?

Yes, I'm not a fan of the RabbitMQ dependency either - see here for the reasoning: https://news.ycombinator.com/item?id=39643940.

It would take some work to replace this with listen/notify in Postgres, less work to replace this with an in-memory component, but we can't provide the same guarantees in that case.

replies(2): >>39647886 #>>39647932 #
2. jaggederest ◴[] No.39647886[source]
I come to this only as an interested observer, but my experience with listen/notify is that it outperforms rabbitmq/kafka in small to medium operations and has always pleasantly surprised me. You might find out it's a little easier than you think to slim your dependency stack down.
replies(1): >>39651534 #
3. kcorbitt ◴[] No.39647932[source]
Boxes-wise, I'd like a management interface at least as good as the one Sidekiq had in Rails for years. Would also need some hard numbers around performance and probably a bit more battle-testing before using this in our current product.
4. hosh ◴[] No.39651534[source]
How do you handle things when no listeners are available to be notified?
replies(1): >>39653677 #
5. abelanger ◴[] No.39653677{3}[source]
Presumably there'd be a messages table that you listen/notify on, and you'd replay messages that weren't consumed when a listener rejoins. But yeah, this is the overhead I was referencing.
replies(2): >>39654661 #>>39655861 #
6. jaggederest ◴[] No.39654661{4}[source]
Yep, but practically speaking, you need those records anyway even if you're using another queue to actually distribute the jobs. At least every system I've ever built of a reasonable size has a job audit table anyway. Plus it's an "Enterprise Feature™" so you can differentiate on it if you like that kind of feature-based pricing
replies(1): >>39655867 #
7. hosh ◴[] No.39655861{4}[source]
With the way LISTEN/NOTIFY works, Postgres doesn't keep a record of messages that are not sent. So you cannot replay this. Unless you know something about postgresql that I don't know about.
replies(1): >>39655893 #
8. hosh ◴[] No.39655867{5}[source]
Postgres's LISTEN/NOTIFY doesn't keep those kinds of records. The whole point of using SKIP LOCK is so that you can make updates to rows to keep those kinds of messages with concurrent consumers.
replies(1): >>39657159 #
9. yencabulator ◴[] No.39655893{5}[source]
You insert work-to-be-performed into a table, and use NOTIFY only to wake up consumers that there is more work to be had. Consumers that weren't there at the time of NOTIFY can look at the rows in the table at startup.
replies(1): >>39656434 #
10. hosh ◴[] No.39656434{6}[source]
I see. So the notify is just to say there is work to be performed, but there is no payload that includes the job. The consumer still has to make a query. If there isn’t enough work, the queries should come back empty. This saves from having to poll, but not a true push system.
replies(1): >>39657180 #
11. jaggederest ◴[] No.39657159{6}[source]
Yes. I'm saying you'll manually need to insert some kind of job audit log into a different table. Cheers
12. jaggederest ◴[] No.39657180{7}[source]
as far as I can tell NOTIFY is fanout, in the sense that it will send a message to all the LISTENing connections, so it wouldn't make sense in that context anyway. It's not one-to-one, it's about making sure that jobs get picked up in a timely fashion. If you're doing something fancier with event sourcing or equivalent, you can send events via NOTIFY, and have clients decide what to do with those events then.

Quoth the manual: "The NOTIFY command sends a notification event together with an optional “payload” string to each client application that has previously executed LISTEN channel for the specified channel name in the current database. Notifications are visible to all users."

replies(1): >>39657571 #
13. hosh ◴[] No.39657571{8}[source]
Notify can be triggered with stored procedures to send payloads related to changes to a table. It can be set up to send the id of a row that was inserted or updated, for example. (But WAL replication is usually better for this)
replies(1): >>39660110 #
14. yencabulator ◴[] No.39660110{9}[source]
Broadcasting the id to a lot of workers is not useful, only one of them should work on the task. Waking up the workers to do a SELECT FOR UPDATE .. SKIP LOCKED is the trick. At best the NOTIFY payload could include the kind of worker that should wake up.
replies(1): >>39666056 #
15. ◴[] No.39666056{10}[source]