Show HN: Hatchet – Open-source distributed task queue

One repeat issue I’ve had with my past position is need to schedule an unlimited number of jobs, often months to year from now. Example use case: a patient schedules an appointment for a follow up in 6 months, so I schedule a series of appointment reminders in the days leading up to it. I might have millions of these jobs.

I started out by just entering a record into a database queue and just polling every few seconds. Functional, but our IO costs for polling weren’t ideal, and we wanted to distribute this without using stuff like schedlock. I switched to Redis but it got complicated dealing with multiple dispatchers, OOM issues, and having to run a secondary job to move individual tasks in and out of the immediate queue, etc. I had started looking at switching to backing it with PG and SKIP LOCKED, etc. but I’ve changed positions.

I can see a similar use case on my horizon wondered if Hatchet would be suitable for it.

Well, it was a dumbed down example. In that particular case, appointments can be added, removed, or moved at any moment, so I can’t just run one job every 24 hours to tee up the next day’s work and leave it at that. Simply polling the database for messages that are due to go out gives me my just-in-time queue, but then I need to build out the work to distribute it, and we didn’t like the IO costs.

I did end up moving it Redis and basically ZADD an execution timestamp and job ID, then ZRANGEBYSCORE at my desired interval and remove those jobs as I successfully distribute them out to workers. I then set a fence time. At that time a job runs to move stuff that should have ran but didn’t (rare, thankfully) into a remediation queue, and load the next block of items that should run between now + fence. At the service level, any items with a scheduled date within the fence gets ZADDed after being inserted into the normal database. Anything outside the fence will be picked up at the appropriate time.

This worked. I was able to ramp up the polling time to get near-real time dispatch while also noticeably reducing costs. Problems were some occasional Redis issues (OOM and having to either a keep bumping up the Redis instance size or reduce the fence duration), allowing multiple pollers for redundancy and scale (I used schelock for that :/), and occasionally a bug where the poller craps out in the middle of the Redis work resulting in at least once SLA which required downstream protections to make sure I don’t send the same message multiple time to the patient.

Again, it all works but I’m interested in seeing if there are solutions that I don’t have to hand roll.