Most active commenters
  • jerrygenser(3)

←back to thread

578 points abelanger | 12 comments | | HN request time: 2.026s | source | bottom

Hello HN, we're Gabe and Alexander from Hatchet (https://hatchet.run), we're working on an open-source, distributed task queue. It's an alternative to tools like Celery for Python and BullMQ for Node.js, primarily focused on reliability and observability. It uses Postgres for the underlying queue.

Why build another managed queue? We wanted to build something with the benefits of full transactional enqueueing - particularly for dependent, DAG-style execution - and felt strongly that Postgres solves for 99.9% of queueing use-cases better than most alternatives (Celery uses Redis or RabbitMQ as a broker, BullMQ uses Redis). Since the introduction of SKIP LOCKED and the milestones of recent PG releases (like active-active replication), it's becoming more feasible to horizontally scale Postgres across multiple regions and vertically scale to 10k TPS or more. Many queues (like BullMQ) are built on Redis and data loss can occur when suffering OOM if you're not careful, and using PG helps avoid an entire class of problems.

We also wanted something that was significantly easier to use and debug for application developers. A lot of times the burden of building task observability falls on the infra/platform team (for example, asking the infra team to build a Grafana view for their tasks based on exported prom metrics). We're building this type of observability directly into Hatchet.

What do we mean by "distributed"? You can run workers (the instances which run tasks) across multiple VMs, clusters and regions - they are remotely invoked via a long-lived gRPC connection with the Hatchet queue. We've attempted to optimize our latency to get our task start times down to 25-50ms and much more optimization is on the roadmap.

We also support a number of extra features that you'd expect, like retries, timeouts, cron schedules, dependent tasks. A few things we're currently working on - we use RabbitMQ (confusing, yes) for pub/sub between engine components and would prefer to just use Postgres, but didn't want to spend additional time on the exchange logic until we built a stable underlying queue. We are also considering the use of NATS for engine-engine and engine-worker connections.

We'd greatly appreciate any feedback you have and hope you get the chance to try out Hatchet.

1. jerrygenser ◴[] No.39644543[source]
Something I really like about some pub/sub systems is Push subscriptions. For example in GCP pub/sub you can have a "subscriber" that is not pulling events off the queue but instead is an http endpoint where events are pushed to.

The nice thing about this is that you can use a runtime like cloud run or lambda and allow that runtime to scale based on http requests and also scale to zero.

Setting up autoscaling for workers can be a little bit more finicky, e.g. in kubernetes you might set up KEDA autoscaling based on some queue depth metrics but these might need to be exported from rabbit.

I suppose you could have a setup where your daemon worker is making http requests and in that sense "push" to the place where jobs are actually running but this adds another level of complexity.

Is there any plan to support a push model where you can push jobs into http and some daemons that are holding the http connections opened?

replies(8): >>39645123 #>>39647156 #>>39647854 #>>39650089 #>>39650936 #>>39651913 #>>39655909 #>>39681765 #
2. abelanger ◴[] No.39645123[source]
I like that idea, basically the first HTTP request ensures the worker gets spun up on a lambda, and the task gets picked up on the next poll when the worker is running. We already have the underlying push model for our streaming feature: https://docs.hatchet.run/home/features/streaming. Can configure this to post to an HTTP endpoint pretty easily.

The daemon feels fragile to me, why not just shut down the worker client-side after some period of inactivity?

replies(1): >>39645490 #
3. jerrygenser ◴[] No.39645490[source]
I think it depends on the http runtime. One of the things with cloud run is that if the server is not handling requests, it doesn't get CPU time. So even if the first request is "wake up", it wouldn't get any CPU to poll outside of the request-response cycle.

You can configure cloud run to always allocate CPU but it's a lot more expensive. I don't think it would be a good autoscaling story since autoscaling is based on http requests being processed. (maybe can be done via CPU but that's may not be what you want, it may not even be cpu bound)

4. alexbouchard ◴[] No.39647156[source]
The push queue model has major benefits has you mentioned. We've built Hookdeck (hookdeck.com) on that premise. I hope we see more projects adopt it.
5. tonyhb ◴[] No.39647854[source]
You might want to look at https://www.inngest.com for that. Disclaimer: I'm a cofounder. We released event-driven step functions about 20 months ago.
replies(1): >>39650910 #
6. jsmeaton ◴[] No.39650089[source]
https://cloud.google.com/tasks is such a good model and I really want an open source version of it (or to finally bite the bullet and write my own).

Having http targets means you get things like rate limiting, middleware, and observability that your regular application uses, and you aren’t tied to whatever backend the task system supports.

Set up a separate scaling group and away you go.

7. jerrygenser ◴[] No.39650910[source]
Looks cool but looks like it's only typescript. If there is a json payload, couldn't any web server handle it?
replies(1): >>39672687 #
8. lysecret ◴[] No.39650936[source]
Yep we are using cloud tasks and pub sub a lot. Another big benefit is that the GCP infra is literally “pushing” your messages even if your infra goes down.
9. sixdimensional ◴[] No.39651913[source]
There are some tools like Apache Nifi which call this pattern an HTTP listener. it’s also basically a kind of a sink, and also sort of resembles webhook architecture.
10. yencabulator ◴[] No.39655909[source]
> For example in GCP pub/sub you can have a "subscriber" that is not pulling events off the queue but instead is an http endpoint where events are pushed to.

That just means that there's a lightweight worker that does the HTTP POST to your "subscriber". With retries etc, just like it's done here.

11. tonyhb ◴[] No.39672687{3}[source]
We support TS, Python, Golang, and Java/Kotlin with official SDKs, and our SDK spec is open, so yes — any server can handle it :)
12. jamescmartinez ◴[] No.39681765[source]
Mergent (YC S21 - https://mergent.co) might be precisely what you're looking for in terms of a push-over-HTTP model for background jobs and crons.

You simply define a task using our API and we take care of pushing it to any HTTP endpoint, holding the connection open and using the HTTP status code to determine success/failure, whether or not we should retry, etc.

Happy to answer any questions here or over email james@mergent.co