Why build another managed queue? We wanted to build something with the benefits of full transactional enqueueing - particularly for dependent, DAG-style execution - and felt strongly that Postgres solves for 99.9% of queueing use-cases better than most alternatives (Celery uses Redis or RabbitMQ as a broker, BullMQ uses Redis). Since the introduction of SKIP LOCKED and the milestones of recent PG releases (like active-active replication), it's becoming more feasible to horizontally scale Postgres across multiple regions and vertically scale to 10k TPS or more. Many queues (like BullMQ) are built on Redis and data loss can occur when suffering OOM if you're not careful, and using PG helps avoid an entire class of problems.
We also wanted something that was significantly easier to use and debug for application developers. A lot of times the burden of building task observability falls on the infra/platform team (for example, asking the infra team to build a Grafana view for their tasks based on exported prom metrics). We're building this type of observability directly into Hatchet.
What do we mean by "distributed"? You can run workers (the instances which run tasks) across multiple VMs, clusters and regions - they are remotely invoked via a long-lived gRPC connection with the Hatchet queue. We've attempted to optimize our latency to get our task start times down to 25-50ms and much more optimization is on the roadmap.
We also support a number of extra features that you'd expect, like retries, timeouts, cron schedules, dependent tasks. A few things we're currently working on - we use RabbitMQ (confusing, yes) for pub/sub between engine components and would prefer to just use Postgres, but didn't want to spend additional time on the exchange logic until we built a stable underlying queue. We are also considering the use of NATS for engine-engine and engine-worker connections.
We'd greatly appreciate any feedback you have and hope you get the chance to try out Hatchet.
However I am still missing a section on why this is different than any of the other existing and more mature solutions. What led you to develop this over existing options and what different tradeoffs did you make? Extra points if you can concisely tell me what you do badly that your 'competitors' do well because I don't believe there is a one best solution in this space, it is all tradeoffs
[1] https://github.com/hatchet-dev/hatchet/blob/main/README.md#h...
> Welcome to Hatchet! This guide walks you through getting set up on Hatchet Cloud. If you'd like to self-host Hatchet, please see the self-hosted quickstart instead.
but the link to "self-hosted quickstart" links back to the same page
Hatchet looks cool nonetheless. Queues are a pain for many other use-cases too.
We also store the input/output of each workflow step in the database. So resuming a multi-step workflow is pretty simple - we just replay the step with the same input.
To zoom out a bit - unlike many alternatives [2], the execution path of a multi-step workflow in Hatchet is declared ahead of time. There are tradeoffs to this approach; it makes it much easier to run a single-step workflow or if you know the workflow execution path ahead of time. You also avoid classes of problems related to workflow versioning, we can gracefully drain older workflow version with a different execution path. It's also more natural to debug and see a DAG execution instead of debugging procedural logic.
The clear tradeoff is that you can't try...catch the execution of a single task or concatenate a bunch of futures that you wait for later. Roadmap-wise, we're considering adding procedural execution on top of our workflows concept. Which means providing a nice API for calling `await workflow.run` and capturing errors. These would be a higher-level concept in Hatchet and are not built yet.
There are some interesting concepts around using semaphores and durable leases that are relevant here, which we're exploring [3].
[1] https://docs.hatchet.run/home/basics/workflows [2] https://temporal.io [3] https://www.citusdata.com/blog/2016/08/12/state-machines-to-...
https://temporal.io/ https://cadenceworkflow.io/ https://conductor-oss.org/
One important feature request that probably would block our adoption: one reason why I prefer a postgres-backed queue over eg. Redis is just to simplify our infra by having fewer servers and technologies in the stack. Adding in RabbitMQ is definitely an extra dependency I'd really like to avoid.
(Currently we've settled on graphile-worker which is fine for what it does, but leaves a lot of boxes unchecked.)
The component which needs the highest uptime is our ingestion service [1]. This ingests events from the Hatchet SDKs and is responsible for writing the workflow execution path, and then sends messages downstream to our other engine components. This is a horizontally scalable service and you should run at least 2 replicas across different AZs. Also see how to configure different services for engine components [2].
The other piece of this is PostgreSQL, use your favorite managed provider which has point-in-time restores and backups. This is the core of our self-healing, I'm not sure where it makes sense to route writes if the primary goes down.
Let me know what you need for self-hosted docs, happy to write them up for you.
[1] https://github.com/hatchet-dev/hatchet/tree/main/internal/se... [2] https://docs.hatchet.run/self-hosting/configuration-options#...
Yes, I'm not a fan of the RabbitMQ dependency either - see here for the reasoning: https://news.ycombinator.com/item?id=39643940.
It would take some work to replace this with listen/notify in Postgres, less work to replace this with an in-memory component, but we can't provide the same guarantees in that case.
I love the simplicity & approachability of Deno queues for example, but I’d need to roll my own way to subscribe to task status from the client.
Wondering if perhaps the Postgres underpinnings here would make that possible.
EDIT: seems so! https://docs.hatchet.run/home/features/streaming
It's dead simple: an existence of the URI means the topic/channel/whathaveu exists, to access it one needs to know the URI, data streamed but no access to old data, multiple consumers no problem.
The core difference is that pg-boss is a library while Hatchet is a separate service which runs independently of your workers. This service also provides a UI and API for interacting with Hatchet - I don't think pg-boss has those things, so you'd probably have to build out observability yourself.
This doesn't make a huge difference when you're at 1 worker, but having each worker poll your database can lead to DB issues if you're not careful - I've seen some pretty low-throughput setups for very long-running jobs using a database with 60 CPUs because of polling workers. Hatchet distributes in two layers - the "engine" and the "worker" layer. Each engine polls the database and fans out to the workers over a long-lived gRPC connection. This reduces pressure on the DB and lets us manage which workers to assign tasks to based on things like max concurrent runs on each worker or worker health.
The nice thing about this is that you can use a runtime like cloud run or lambda and allow that runtime to scale based on http requests and also scale to zero.
Setting up autoscaling for workers can be a little bit more finicky, e.g. in kubernetes you might set up KEDA autoscaling based on some queue depth metrics but these might need to be exported from rabbit.
I suppose you could have a setup where your daemon worker is making http requests and in that sense "push" to the place where jobs are actually running but this adds another level of complexity.
Is there any plan to support a push model where you can push jobs into http and some daemons that are holding the http connections opened?
Temporal is a powerful system but we were getting to the point where it took a full-time engineer to build an observability layer around Temporal. Integrating workflows in an intuitive way with OpenTelemetry and logging was surprisingly non-arbitrary. We wanted to build more of a Vercel-like experience for managing workflows.
We have a section on the docs page for durable execution [1], also see the comment on HN [2]. Like I mention in that comment, we still have a long way to go before users can write a full workflow in code in the same style as a Temporal workflow, users either define the execution path ahead of time or invoke a child workflow from an existing workflow. This is also something that requires customization for each SDK - like Temporal's custom asyncio event loop in their Python SDK [3]. We don't want to roll this out until we can be sure about compatibility with the way most people write their functions.
[1] https://docs.hatchet.run/home/features/durable-execution
It would help to see a mapping of Celery to Hatchet as examples. The current examples require you to understand (and buy into) Hatchet's model, but that's hard to do without understanding how it compares to existing solutions.
I’m all for just using Postgres in service of the grug brain philosophy.
Will definitely be looking into this, congrats on the launch!
A nice thing in Celery Flower is viewing the `args, kwargs`, whereas Hatchet operates on JSON request/response bodies, so some early users have mentioned that it's hard to get visibility into the exact typing/serialization that's happening. Something for us to work on.
Long live Postgres queues.
The daemon feels fragile to me, why not just shut down the worker client-side after some period of inactivity?
There have to be at least 10 different ways between different cloud providers to run a distributed task queue. Amazon, Azure, GCP
Self-hosting RabbitMQ, etc.
I'm curious how they are able to convince investors that there is a sizable portion of market they think doesn't already have this solved (or already has it solved and is willing to migrate)
This tool comes with more bells and whistles and presumably will be more constrained in what you can do with it, where ZeroMQ gives you the flexibility to build your own protocol. In principle they have many of the same use cases, like how you can buy ready made whipped cream or whip up your own with some heavy cream and sugar -- one approach is more constrained but works for most situations where you need some whipped cream, and the other is a lot more work and somewhat higher risk (you can over whip your cream and end up with butter), but you can do a lot more with it.
I want the task graph to run without thinking about retries, timeouts, serialized resources, etc.
Interested to look at your particular approach.
You can configure cloud run to always allocate CPU but it's a lot more expensive. I don't think it would be a good autoscaling story since autoscaling is based on http requests being processed. (maybe can be done via CPU but that's may not be what you want, it may not even be cpu bound)
Would be interested to know what features you feel it’s lacking.
Related, but separately: can you trigger a variable number of task executions from one step? If the answer to the previous question is yes then it would of course be trivial; if not, I'm wondering if you could i.e. have a task act as a generator and yield values, or just return a list, and have each individual item get passed off to its own execution of the next task(s) in the DAG.
For example some of the examples involve a load_docs step, but all loaded docs seem to be passed to the next step execution in the DAG together, unless I'm just misunderstanding something. How could we tweak such an example to have a separate task execution per document loaded? The benefits of durable execution and being able to resume an intensive workflow without repeating work is lessened if you can't naturally/easily control the size of the unit of work for task executions.
Could not find any specifics on generative AI in your docs. Thanks
I found that shocking at the time, if plausible, and wondered why nobody pulled on that thread. I suppose like me they had bigger fish to fry.
It was pretty painless for me to set up and write tests against. The operator works well and is really simple if you want to save money.
I mean, isn’t Hatchett another dependency? Graphile Worker? I like all these things, but why draw the line at one thing over another over essentially aesthetics?
You better start believing in dependencies if you’re a programmer.
> I'm wondering if you could i.e. have a task act as a generator and yield values, or just return a list, and have each individual item get passed off to its own execution of the next task(s) in the DAG.
Yeah, we were having a conversation yesterday about this - there's probably a simple decorator we could add so that if a step returns an array, and a child step is dependent on that parent step, it fans out if a `fanout` key is set. If we can avoid unstructured trace diagrams in favor of a nice DAG-style workflow execution we'd prefer to support that.
The other thing we've started on is propagating a single "flow id" to each child workflow so we can provide the same visualization/tracing that we provide in each workflow execution. This is similar to AWS X-rays.
As I mentioned we're working on the durable workflow model, and we'll find a way to make child workflows durable in the same way activities (and child workflows) are durable on Temporal.
[1] https://docs.hatchet.run/sdks/typescript-sdk/api/admin-clien...
I started out by just entering a record into a database queue and just polling every few seconds. Functional, but our IO costs for polling weren’t ideal, and we wanted to distribute this without using stuff like schedlock. I switched to Redis but it got complicated dealing with multiple dispatchers, OOM issues, and having to run a secondary job to move individual tasks in and out of the immediate queue, etc. I had started looking at switching to backing it with PG and SKIP LOCKED, etc. but I’ve changed positions.
I can see a similar use case on my horizon wondered if Hatchet would be suitable for it.
> How do you distribute inference across workers?
In Hatchet, "run inference" would be a task. By default, tasks get randomly assigned to workers in a FIFO fashion. But we give you a few options for controlling how tasks get ordered and sent. For example, let's say you'd like to limit users to 1 inference task at a time per session. You could do this by setting a concurrency key "<session-id>" and `maxRuns=1` [1]. This means that for each session key, you only run 1 inference task. The purpose of this would be fairness.
> Can one use just any protocol
We handle the communication between the worker and the queue through a gRPC connection. We assume that you're passing JSON-serializable objects through the queue.
[1] https://docs.hatchet.run/home/features/concurrency/round-rob...
As a professional I’m allergic to statements like “you better start believing in X”. How can you even have objective discourse at work like that?
I did end up moving it Redis and basically ZADD an execution timestamp and job ID, then ZRANGEBYSCORE at my desired interval and remove those jobs as I successfully distribute them out to workers. I then set a fence time. At that time a job runs to move stuff that should have ran but didn’t (rare, thankfully) into a remediation queue, and load the next block of items that should run between now + fence. At the service level, any items with a scheduled date within the fence gets ZADDed after being inserted into the normal database. Anything outside the fence will be picked up at the appropriate time.
This worked. I was able to ramp up the polling time to get near-real time dispatch while also noticeably reducing costs. Problems were some occasional Redis issues (OOM and having to either a keep bumping up the Redis instance size or reduce the fence duration), allowing multiple pollers for redundancy and scale (I used schelock for that :/), and occasionally a bug where the poller craps out in the middle of the Redis work resulting in at least once SLA which required downstream protections to make sure I don’t send the same message multiple time to the patient.
Again, it all works but I’m interested in seeing if there are solutions that I don’t have to hand roll.
We're both second time CTOs and we've been on both sides of this, as consumers of and creators of OSS. I was previously a co-founder and CTO of Porter [2], which had an open-core model. There are two risks that most companies think about in the open core model:
1. Big companies using your platform without contributing back in some way or buying a license. I think this is less of a risk, because these organizations are incentivized to buy a support license to help with maintenance, upgrades, and since we sit on a critical path, with uptime.
2. Hyperscalers folding your product in to their offering [3]. This is a bigger risk but is also a bit of a "champagne problem".
Note that smaller companies/individual developers are who we'd like to enable, not crowd out. If people would like to use our cloud offering because it reduces the headache for them, they should do so. If they just want to run our service and manage their own PostgreSQL, they should have the option to do that too.
Based on all of this, here's where we land on things:
1. Everything we've built so far has been 100% MIT licensed. We'd like to keep it that way and make money off of Hatchet Cloud. We'll likely roll out a separate enterprise support agreement for self hosting.
2. Our cloud version isn't going to run a different core engine or API server than our open source version. We'll write interfaces for all plugins to our servers and engines, so even if we have something super specific to how we've chosen to do things on the cloud version, we'll expose the options to write your own plugins on the engine and server.
3. We'd like to make self-hosting as easy to use as our cloud version. We don't want our self-hosted offering to be a second-class citizen.
Would love to hear everyone's thoughts on this.
Okay, but we're talking about this on a post about using another piece of software.
What is the rational for, well this additional dependency, Hatchet, that's okay, and its inevitable failures are okay, but this other dependency, RabbitMQ, which does something different, but will have fewer failures for some objective reasons, that's not okay?
Hatchet is very much about aesthetics. What else does Hatchet have going on? It doesn't have a lot of history, it's going to have a lot of bugs. It works as a DSL written in Python annotations, which is very much an aesthetic choice, very much something I see a bunch of AI startups doing, which I personally think is kind of dumb. Like OpenAI tools are "just" JSON schemas, they don't reinvent everything, and yet Trigger, Hatchet, Runloop, etc., they're all doing DSLs. It hews to a specific promotional playbook that is also very aesthetic. Is this not the "objective discourse at work" you are looking for?
I am not saying it is bad, I am saying that 99% of people adopting it will be doing so for essentially aesthetic reasons - and being less knowledgable about alternatives might describe 50-80% of the audience, but to me, being less knowledgeable as a "professional" is an aesthetic choice. There's nothing wrong with this.
You can get into the weeds about what you meant by whatever you said. I am aware. But I am really saying, I'm dubious of anyone promoting "Use my new thing X which is good because it doesn't introduce a new dependency." It's an oxymoron plainly on its face. It's not in their marketing copy but the author is talking about it here, and maybe the author isn't completely sincere, maybe the author doesn't care and will happily write everything on top of RabbitMQ if someone were willing to pay for it, because that decision doesn't really matter. The author is just being reactive to people's aesthetics, that programmers on social media "like" Postgres more than RabbitMQ, for reasons, and that means you can "only" use one, but that none of those reasons are particularly well informed by experience or whatever, yet nonetheless strongly held.
When you want to explain something that doesn't make objective sense when read literally, okay, it might have an aesthetic explanation that makes more sense.
I have an existing pipeline that runs tasks across two K8 clusters and share a DB. Is it possible to define steps in a workflow where the step run logic is setup to run elsewhere? Essentially not having an inline run function defined, and another worker process listening for that step name.
> Do you publish pricing for your cloud offering?
Not yet, we're rolling out the cloud offering slowly to make sure we don't experience any widespread outages. As soon as we're open for self-serve on the cloud side, we'll publish our pricing model.
> For the self hosted option, are there plans to create a Kubernetes operator?
Not at the moment, our initial plan was to help folks with a KEDA autoscaling setup based on Hatchet queue metrics, which is something I've done with Sidekiq queue depth. We'll probably wait to build a k8s operator after our existing Helm chart is relatively stable.
> With an MIT license do you fear Amazon could create a Amazon Hatchet Service sometime in the future?
Yes. The question is whether that risk is worth the tradeoff of not being MIT-licensed. There are also paths to getting integrated into AWS marketplace we'll explore longer-term. I added some thoughts here: https://news.ycombinator.com/item?id=39646788.
Looking ahead (and back) in the database and placing an exclusive lock on the schedule is the way to do this. You basically guarantee scheduling at +/- the polling interval if your service goes down while maintaining the lock. This allows you to horizontally scale the `tickers` which are polling for the schedules.
We currently send cancellation signals for individual tasks to workers, but our cancellation signals aren't replayed if they fail on the network. This is an important edge case for us to figure out.
There's not much we can do if the worker ignores that signal. We should probably add some alerting if we see multiple responses on the same task, because that means the worker is ignoring the cancellation signal. This would also be a problem if workloads start blocking the whole thread.
That seems pretty long - am I misunderstanding something? By my understanding this means the time from enqueue to job processing, maybe someone can enlighten me.
There's still a lot of work to do for optimization though, particularly to improve the polling interval if there aren't workers available to run the task. Some people might expect to set a max concurrency limit of 1 on each worker and have each subsequent workflow take 50ms to start, which isn't be the case at the moment.
[1] https://github.com/hatchet-dev/hatchet/tree/main/examples/lo...
'But I am really saying, I'm dubious of anyone promoting "Use my new thing X which is good because it doesn't introduce a new dependency."'
"Advances in software technology and increasing economic pressure have begun to break down many of the barriers to improved software productivity. The ${PRODUCT} is designed to remove the remaining barriers […]"
It reads like the above quote from the pitch of r1000 in 1985. https://datamuseum.dk/bits/30003882
If you're saying that the scheduling in Hatchet should be a separate library, we rely on go-cron [1] to run cron schedules.
It's not an eternity in a task queue which supports DAG-style workflows with concurrency limits and fairness strategies. The reason for this is you need to check all of the subscribed workers and assign a task in a transactional way.
The limit on the Postgres level is probably on the order of 5-10ms on a managed PG provider. Have a look at: https://news.ycombinator.com/item?id=39593384.
Also, these are not my benchmarks, but have a look at [1] for Temporal timings.
[1] https://www.windmill.dev/blog/launch-week-1/fastest-workflow...
>When you want to explain something that doesn't make objective sense when read literally, okay, it might have an aesthetic explanation that makes more sense.
What an attitude and way to kill a discussion. Again, hard for me to imagine that you're able to have objective discussions at work. As you wish I won't engage in discourse with you so you can feel smart.
Cancellation signals are tricky. You of course cannot be sure that the remote end receives it. This turns into the two generals problem.
Yes, you need monitoring for this case. I work on scientific workloads which can completely consume CPU resources. This failure scenario is quite real.
Not all tasks are idempotent, but it sounds like a prudent user should try to design things that way, since your system has “at least once” execution of tasks, as opposed to “at most once.” Despite any marketing claims, “exactly once” is not generally possible.
Good docs on this point are important, as is configurability for cases when “at most once” is preferable.
This seems like a lot of boiler plate to write functions with to me (context I created http://github.com/DAGWorks-Inc/hamilton).
Feature-wise, the biggest missing pieces from Graphile Worker for me are (1) a robust management web ui and (2) really strong documentation.
Looking forward to trying it out!
A UI is a common request, something I’ve been considering investing effort into. I don’t think we’ll ever have one in the core package, but probably as a separate package/plugin (even a third party one); we’ve been thinking more about the events and APIs such a system would need and making these available, and adding a plugin system to enable tighter integration.
Could you expand on what’s missing in the documentation? That’s been a focus recently (as you may have noticed with the new expanded docusaurus site linked previously rather than just a README), but documentation can always be improved.
1. Functions which allow you to declaratively sleep until a specific time, automatically rescheduling jobs (https://www.inngest.com/docs/reference/functions/step-sleep-...).
2. Declarative cancellation, which allows you to cancel jobs if the user reschedules their appointment automatically (https://www.inngest.com/docs/guides/cancel-running-functions).
3. General reliability and API access.
Inngest does that for you, but again — disclaimer, I made it and am biased.
https://renegadeotter.com/2023/11/30/job-queues-with-postrgr...
I got a few feature request for Pueue that were out of the scope as they didn't fit Pueue's vision, but seem to fit hatchet quite well (e.g. complex scheduling functionality and multi-agent support) :)
One thing I'm missing from your website however, is an actual view from how the interface looks like, what does the actual user interface look like.
Having the possibility to schedule stuff in a smart way is nice and all, but how do you *overlook* it? It's important to get a good overview of how your tasks perform.
Once I'm convinced that this is actually a useful piece of software, I would like to reference you in the Readme of Pueue as a alternative for users that need more powerful scheduling features (or multi-client support) :) Would that be ok for you?
Like I mentioned here [1], we'll expand our comparison section over time. If Pueue's an alternative people are asking about, we'll definitely put it in there.
> Having the possibility to schedule stuff in a smart way is nice and all, but how do you overlook it? It's important to get a good overview of how your tasks perform.
I'm not sure what you mean by this. Perhaps you're referring to this - https://news.ycombinator.com/item?id=39647154 - in which case I'd say: most software is far from perfect. Our scheduling works but has limitations and is being refactored before we advertise it and build it into our other SDKs.
Is there any task queue you are completely happy with?
I use Redis, but it’s only half of the solution.
I'm personally very excited about River and I think it fills an important gap in the Go ecosystem! Also now that sqlc w/ pgx seems to be getting more popular, it's very easy to integrate.
With regards to Nex -- it isn't fully stable and only supports Javascript/Webassembly. It's also extremely new, so I'd be curious to see how things stabilize in the coming year.
My laptop can execute about 400 billion CPU instructions per second on battery.
That's about 10 billion instructions in 25ms.
Ihat's the CPU alone, i.e. not including the GPU which would increase the total considerably. Also not counting SIMD lanes as separate: The count is bona fide assembly language instructions.
It comes from cores running at ~4GHz, 8 issued instructions per clock, times 12 cores, plus 4 additional "efficiency" cores adding a bit more. People have confirmed by measurement the 8 instructions per clock is achievable (or close) in well-optimised code. Average code is more like 2-3 per cycle.
Only for short periods as the CPU is likely to get hot and thermally throttle even with its fan. But when it throttles it'll still exceed 1 billion in 25ms.
For perspective on how far silicon has come, the GPU on my laptop is reported to do about 14 trillion floating-point 32-bit calculations per second.
Tools like hatchet are one less dependency for projects already using Postgres: Postgres has become a de-facto database to build against.
Compare that to an application built on top of Postgres and using Celery + Redis/RabbitMQ.
Also, it seems like you are confusing aesthetic with ergonomics. Since forever, software developers have tried to improve on all of "aesthetics" (code/system structure appearance), "ergonomics" (how easy/fast is it to build with) and "performance" (how well it works), and the cycle has been continuous (we introduce extra abstractions, then do away with some when it gets overly complex, and on and on).
All I want is a simple way to specify a tree of jobs to run to do things like checkout a git branch, build it, run the tests, then install the artifacts.
Or push a new static website to some site. Or periodically do something.
My grug brain simply doesn't want to deal with modern way of doing $SHIT. I don't need to manage a million different tasks per hour, so scaling vertically is acceptable to me, and the benefits of scaling horizontally simply don't appear in my use cases.
Having http targets means you get things like rate limiting, middleware, and observability that your regular application uses, and you aren’t tied to whatever backend the task system supports.
Set up a separate scaling group and away you go.
Fast,easy,well,cheap is not a quality measure but it sure is a way to build more useless abstractions. You tell me which abstractions has made your software twice as effective.
[0] https://github.com/wakatime/wakaq
I don’t think this is out of place
Plus some adjacent discussion on GitHub: https://github.com/prometheus/client_python/issues/902
Hope that helps!
You say Celery can use Redis or RabbitMQ as a backend, but I've also used it with Postgres as a broker successfully, although on a smaller scale (just a single DB node). It's undocumented, so definitely won't recommend anybody using this in production now, but seems to still work fine. [1]
How does Hatchet compare to this setup? Also, have you considered making a plugin backend for Celery, so that old systems can be ported more easily?
[1] https://www.temporal.io/replay/videos/zero-downtime-deploys-...
I really try to suggest people skip Node and learn a proper backend language with a solid framework with a proven architecture.
Here's the most heavily upvoted in the past 12 months
Hatchet https://news.ycombinator.com/item?id=39643136
Inngest https://news.ycombinator.com/item?id=36403014
Windmill https://news.ycombinator.com/item?id=35920082
HN comments on Temporal.io https://github.com/temporalio https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...
Internally we rant about the complexity of the above projects vs using transactional job queues libs like:
river https://news.ycombinator.com/item?id=38349716
neoq: [https://github.com/acaloiaro/neoq](https://github.com/acaloi...
gue: [https://github.com/vgarvardt/gue](https://github.com/vgarvar...
Deep inside can't wait to see some like ThePrimeTimeagen to review it ;) https://www.youtube.com/@ThePrimeTimeagen
The parent comment may have been referring to the fact that NATS has support for durable (and replicated) work queue streams, so those could be used directly for queuing tasks and having a set of workers dequeuing concurrently. And this is regardless if you would want to use Nex or not. Nex is indeed fairly new, but the team on is iterating on it quickly and we are dog-fooding it internally to keep stabilizing it.
The other benefits of NATS is the built-in multi-tenancy which would allow for distinct applications/teams/contexts to have an isolated set of streams and messaging. It acts as a secure namespace.
NATS supports clustering within a region or across regions. For example, Synadia hosts a supercluster in many different regions across the globe and across the three major cloud providers. As it applies to distributed work queues, you can place work queue streams in a cluster within a region/provider closest to the users/apps enqueuing the work, and then deploy workers in the same region for optimizing latency of dequeuing and processing.
Could be worth a deeper look on how much you could leverage for this use case.
The license is more permissive than ours MIT vs AGPLv3, and you're using Go vs Rust for us, but other than that the architecture looks extremely similar, also based mostly on Postgres with the same insights than us: it's sufficient. I'm curious where do you see the main differentiator long-term.
What makes abstractions more versatile has more to do with its composability and expressiveness of those compositions.
An abstraction that attempts to (apparently) reduce complexity without also being composable, is overall less versatile. Usually, something that does one thing well, is designed to also be as simple as possible. Otherwise you are increasing the overall complexity (and reducing reliability or making it fragile instead of anti-fragile) for very little gain.
Yeah, faith will be your last resort when the resulting tower of babel fails in hitherto unknown to man modes.
We did it in like 5 minutes by adding in otel traces? And maybe another 15 to add their grafana dashboard?
What obstacles did you experience here?
We’ve had a lot of pain with celery and Redis over the years and Hatchet seems to be a pretty compelling alternative. I’d want to see the codebase stabilize a bit before seriously considering it though. And frankly I don’t see a viable path to real commercialization for them so I’d only consider it if everything you needed really was MIT licensed.
Windmill is super interesting but I view it as the next evolution of something like Zapier. Having a large corpus of templates and integrations is the power of that type of product. I understand that under the hood it is a similar paradigm, but the market positioning is rightfully night and day. And I also do see a path to real commercialization of the Windmill product because of the above.
Not so much talking about the original post, I think it’s awesome what they are building, and clearly they have learned by observing other things.
I understand our positioning is not clear on our landing (and we are working on it), but my read of hatched is that what they put forward is mostly a durable execution engine for arbitrary code in python/typescript on a fleet of managed workers, which is exactly what Windmill is. We are profitable and probably wouldn't if we were MIT licensed with no enterprise features.
From reading their documentation, the implementation is extremely similar, you define workflows as code ahead of time, and then the engine make sure to have them progress reliably on your fleet of workers (one of our customer has 600 workers deployed on edge environments). There are a few minor differences, we implement the workers as generic rust binary that pull the workflows, so you never have to redeploy them to test and deploy new workflows, whereas they have developed SDK for each languages to allow you to define your own deployable workers (which is more similar to Inngest/Temporal). Also we use polling and REST instead of gRPC for communications between workers and servers.
Like I mention in that comment, we'd like to keep our repository 100% MIT licensed. I realize this is unpopular among open source startups - and I'm sure there are good reasons for that. We've considered these reasons and still landed on the MIT license.
> I'm curious how you're building a money making business around an open source product.
We'd like to make money off of our cloud version. See the comment on pricing here - https://news.ycombinator.com/item?id=39653084 - which also links to other comments about pricing, sorry about that.
We'll be posting updates and announcements in the Discord - and the Github in our releases - I'd expect that we document this pattern pretty soon.
> full transactional enqueueing
Do you mean transactional within the same transaction as the application's own state?
My guess is no (from looking at the docs, where enqueuing in the SDK looks a lot like a wire call and not issuing a SQL command over our application's existing connection pool), and that you mean transactionality between steps within the Hatchet jobs...
I get that, but fwiw transactionality of "perform business logic against entities + job enqueue" (both for queuing the job itself, as well as work performed by workers) is the primary reason we're using a PG-based job queue, as then we avoid transactional outboxes for each queue/work step.
So, dunno, loosing that would be a big deal/kinda defeat the purpose (for us) of a PG-based queue.
2nd question, not to be a downer, but I'm just genuinely curious as a wanna-be dev infra/tooling engineer, but a) why take funding to build this (it seems bootstrappable? maybe that's naive), and b) why would YC keeping putting money into these "look really neat but ...surely?... will never be the 100x returns/billion dollar companies" dev infra startups? Or maybe I'm over-estimating the size of the return/exit necessary to make it worth their while.
Yes, this sounded broken to us too - we were aware of the promise of consolidation with an opentelemetry and a Grafana stack, but we couldn't make this transition happen cleanly, and when you're already relying on certain tools for your API it makes the transition more difficult. There's also upskilling involved in getting engineers on the team to adjust to otel when they're used to more intuitive tools like sentry and mezmo.
A good set of default metrics, better search, and views for worker performance and pools - that would have gone a long way. The extent of Temporal UI features are basic recent workflows, an expanded workflow view with stack traces for thrown errors, a schedules page, and a settings page.
Put in a DAU/MAU/volume/revenue clause that pertains specifically only to hyperscalers and resellers. Don't listen to the naysayers telling you not to do it. This isn't their company or their future. They don't care if you lose your business or that you put in all of that work just for a tech giant to absorb it for free and turn it against you.
Just do it. Do it now and you won't get (astroturfed?) flack for that decision later by people who don't even have skin in the game. It's not a big deal. I would buy open core products with these protections -- it's not me you're protecting yourselves against, and I'm nowhere in the blast radius. You're trying not to die in the miasma of monolithic cloud vendors.
> Do you mean transactional within the same transaction as the application's own state? My guess is no (from looking at the docs, where enqueuing in the SDK looks a lot like a wire call and not issuing a SQL command over our application's existing connection pool), and that you mean transactionality between steps within the Hatchet jobs...
Yeah, it's the latter, though we'd actually like to support the former in the longer term. There's no technical reason we can't write the workflow/task and read from the same table that you're enqueueing with in the same transaction as your application. That's the really exciting thing about the RiverQueue implementation, though it also illustrates how difficult it is to support every PG driver in an elegant way.
Transactional enqueueing is important for a whole bunch of other reasons though - like assigning workers, maintaining dependencies between tasks, implementing timeouts.
> why take funding to build this (it seems bootstrappable? maybe that's naive)
The thesis is that we can help some users offload their tasks infra with a hosted version, and hosted infra is hard to bootstrap.
> why would YC keeping putting money into these "look really neat but ...surely?... will never be the 100x returns/billion dollar companies" dev infra startups?
I think Cloudflare is an interesting example here. You could probably make similar arguments against a spam protection proxy, which was the initial service. But a lot of the core infrastructure needed for that service branches into a lot of other products, like a CDN or caching layer, or a more compelling, full-featured product like a WAF. I can't speak for YC or the other dev infra startups, but I imagine that's part of the thesis.
We still need to do some work on this feature though, we'll make sure to document it when it's well-supported.
> hosted infra is hard to bootstrap.
Ah yeah, that definitely makes sense...
> a lot of the core infrastructure needed for that service branches > into a lot of other products
Ah, I think I see what you mean--the goal isn't to be "just a job queue" in 2-5 years, it's to grow into a wider platform/ecosystem/etc.
Ngl I go back/forth between rooting for dev-founded VC companies like yourself, or Benjie, the guy behind graphile-worker, who is tip-toeing into being commercially-supported.
Like I want both to win (paydays all around! :-D), but the VC money just gives such a huge edge, of establishing a very high bar of polish / UX / docs / devrel / marketing, basically using loss-leading VC money for a bet that may/may not work out, that it's very hard for the independents to compete. I have honestly been hoping post-ZIRP would swing some advantage back to the Benjie's of the world, but looks like no/not yet.
...I say all of above ^ while myself working for a VC-backed prop-tech company...so, kinda the pot calling the kettle black. :-D
Good luck! The fairness/priority queues of Hatchet definitely seem novel, at least from what I'm used to, so will keep it bookmarked/in the tool chest.
Still, it seems like NATS + any lambda implementation + a dumb service that wakes lambdas when they need to process something, would be simple to set up and in combination do the same thing.
That just means that there's a lightweight worker that does the HTTP POST to your "subscriber". With retries etc, just like it's done here.
Quoth the manual: "The NOTIFY command sends a notification event together with an optional “payload” string to each client application that has previously executed LISTEN channel for the specified channel name in the current database. Notifications are visible to all users."
You simply define a task using our API and we take care of pushing it to any HTTP endpoint, holding the connection open and using the HTTP status code to determine success/failure, whether or not we should retry, etc.
Happy to answer any questions here or over email james@mergent.co