←back to thread

578 points abelanger | 2 comments | | HN request time: 0s | source

Hello HN, we're Gabe and Alexander from Hatchet (https://hatchet.run), we're working on an open-source, distributed task queue. It's an alternative to tools like Celery for Python and BullMQ for Node.js, primarily focused on reliability and observability. It uses Postgres for the underlying queue.

Why build another managed queue? We wanted to build something with the benefits of full transactional enqueueing - particularly for dependent, DAG-style execution - and felt strongly that Postgres solves for 99.9% of queueing use-cases better than most alternatives (Celery uses Redis or RabbitMQ as a broker, BullMQ uses Redis). Since the introduction of SKIP LOCKED and the milestones of recent PG releases (like active-active replication), it's becoming more feasible to horizontally scale Postgres across multiple regions and vertically scale to 10k TPS or more. Many queues (like BullMQ) are built on Redis and data loss can occur when suffering OOM if you're not careful, and using PG helps avoid an entire class of problems.

We also wanted something that was significantly easier to use and debug for application developers. A lot of times the burden of building task observability falls on the infra/platform team (for example, asking the infra team to build a Grafana view for their tasks based on exported prom metrics). We're building this type of observability directly into Hatchet.

What do we mean by "distributed"? You can run workers (the instances which run tasks) across multiple VMs, clusters and regions - they are remotely invoked via a long-lived gRPC connection with the Hatchet queue. We've attempted to optimize our latency to get our task start times down to 25-50ms and much more optimization is on the roadmap.

We also support a number of extra features that you'd expect, like retries, timeouts, cron schedules, dependent tasks. A few things we're currently working on - we use RabbitMQ (confusing, yes) for pub/sub between engine components and would prefer to just use Postgres, but didn't want to spend additional time on the exchange logic until we built a stable underlying queue. We are also considering the use of NATS for engine-engine and engine-worker connections.

We'd greatly appreciate any feedback you have and hope you get the chance to try out Hatchet.

Show context
stephen ◴[] No.39653807[source]
Wow, looks great! We currently happily use graphile-worker, and have two questions:

> full transactional enqueueing

Do you mean transactional within the same transaction as the application's own state?

My guess is no (from looking at the docs, where enqueuing in the SDK looks a lot like a wire call and not issuing a SQL command over our application's existing connection pool), and that you mean transactionality between steps within the Hatchet jobs...

I get that, but fwiw transactionality of "perform business logic against entities + job enqueue" (both for queuing the job itself, as well as work performed by workers) is the primary reason we're using a PG-based job queue, as then we avoid transactional outboxes for each queue/work step.

So, dunno, loosing that would be a big deal/kinda defeat the purpose (for us) of a PG-based queue.

2nd question, not to be a downer, but I'm just genuinely curious as a wanna-be dev infra/tooling engineer, but a) why take funding to build this (it seems bootstrappable? maybe that's naive), and b) why would YC keeping putting money into these "look really neat but ...surely?... will never be the 100x returns/billion dollar companies" dev infra startups? Or maybe I'm over-estimating the size of the return/exit necessary to make it worth their while.

replies(1): >>39654002 #
1. abelanger ◴[] No.39654002[source]
Yeah these are great questions.

> Do you mean transactional within the same transaction as the application's own state? My guess is no (from looking at the docs, where enqueuing in the SDK looks a lot like a wire call and not issuing a SQL command over our application's existing connection pool), and that you mean transactionality between steps within the Hatchet jobs...

Yeah, it's the latter, though we'd actually like to support the former in the longer term. There's no technical reason we can't write the workflow/task and read from the same table that you're enqueueing with in the same transaction as your application. That's the really exciting thing about the RiverQueue implementation, though it also illustrates how difficult it is to support every PG driver in an elegant way.

Transactional enqueueing is important for a whole bunch of other reasons though - like assigning workers, maintaining dependencies between tasks, implementing timeouts.

> why take funding to build this (it seems bootstrappable? maybe that's naive)

The thesis is that we can help some users offload their tasks infra with a hosted version, and hosted infra is hard to bootstrap.

> why would YC keeping putting money into these "look really neat but ...surely?... will never be the 100x returns/billion dollar companies" dev infra startups?

I think Cloudflare is an interesting example here. You could probably make similar arguments against a spam protection proxy, which was the initial service. But a lot of the core infrastructure needed for that service branches into a lot of other products, like a CDN or caching layer, or a more compelling, full-featured product like a WAF. I can't speak for YC or the other dev infra startups, but I imagine that's part of the thesis.

replies(1): >>39654543 #
2. stephen ◴[] No.39654543[source]
Nice! Thanks for the reply!

> hosted infra is hard to bootstrap.

Ah yeah, that definitely makes sense...

> a lot of the core infrastructure needed for that service branches > into a lot of other products

Ah, I think I see what you mean--the goal isn't to be "just a job queue" in 2-5 years, it's to grow into a wider platform/ecosystem/etc.

Ngl I go back/forth between rooting for dev-founded VC companies like yourself, or Benjie, the guy behind graphile-worker, who is tip-toeing into being commercially-supported.

Like I want both to win (paydays all around! :-D), but the VC money just gives such a huge edge, of establishing a very high bar of polish / UX / docs / devrel / marketing, basically using loss-leading VC money for a bet that may/may not work out, that it's very hard for the independents to compete. I have honestly been hoping post-ZIRP would swing some advantage back to the Benjie's of the world, but looks like no/not yet.

...I say all of above ^ while myself working for a VC-backed prop-tech company...so, kinda the pot calling the kettle black. :-D

Good luck! The fairness/priority queues of Hatchet definitely seem novel, at least from what I'm used to, so will keep it bookmarked/in the tool chest.