←back to thread

43 points enether | 2 comments | | HN request time: 0.578s | source

The space is confusing to say the least.

Message queues are usually a core part of any distributed architecture, and the options are endless: Kafka, RabbitMQ, NATS, Redis Streams, SQS, ZeroMQ... and then there's the “just use Postgres” camp for simpler use cases.

I’m trying to make sense of the tradeoffs between:

- async fire-and-forget pub/sub vs. sync RPC-like point to point communication

- simple FIFO vs. priority queues and delay queues

- intelligent brokers (e.g. RabbitMQ, NATS with filters) vs. minimal brokers (e.g. Kafka’s client-driven model)

There's also a fair amount of ideology/emotional attachment - some folks root for underdogs written in their favorite programming language, others reflexively dismiss anything that's not "enterprise-grade". And of course, vendors are always in the mix trying to steer the conversation toward their own solution.

If you’ve built a production system in the last few years:

1. What queue did you choose?

2. What didn't work out?

3. Where did you regret adding complexity?

4. And if you stuck with a DB-based queue — did it scale?

I’d love to hear war stories, regrets, and opinions.

Show context
Jemaclus ◴[] No.44006992[source]
For large applications in a service-oriented architecture, I leverage Kafka 100% of the time. With Confluent Cloud and Amazon MSK, infra is relatively trivial to maintain. There's really no reason to use anything else for this.

For smaller projects of "job queues," I tend to use Amazon SQS or RabbitMQ.

But just for clarity, Kafka is not really a message queue -- it's a persistent structured log that can be used as a message queue. More specifically, you can replay messages by resetting the offset. In a queue, the idea is once you pop an item off the queue, it's no longer in the queue and therefore is gone once it's consumed, but with Kafka, you're leaving the message where it is and moving an offset instead. This means, for example, that you can have many many clients read from the same topic without issue.

SQS and other MQs don't have that persistence -- once you consume the message and ack, the message disappears and you can't "replay it" via the queue system. You have to re-submit the message to process it. This means you can really only have one client per topic, because once the message is consumed, it's no longer available to anyone else.

There are pros and cons to either mechanism, and there's significant overlap in the usage of the two systems, but they are designed to serve different purposes.

The analogy I tend to use is that Kafka is like reading a book. You read a page, you turn the page. But if you get confused, you can flip back and reread a previous page. An MQ like RabbitMQ or Sidekiq is more like the line at the grocery store: once the customer pays, they walk out and they're gone. You can't go back and re-process their cart.

Again, pros and cons to both approaches.

"What didn't work out?" -- I've learned in my career that, in general, I really like replayability, so Kafka is typically my first choice, unless I know that re-creating the messages are trivial, in which case I am more inclined to lean toward RabbitMQ or SQS. I've been bitten several times by MQs where I can't easily recreate the queue, and I lose critical messages.

"Where did you regret adding complexity?" -- Again, smaller systems that are just "job queues" (versus service-to-service async communication) don't need a whole lot of complexity. So I've learned that if it's a small system, go with an MQ first (any of them are fine), and go with Kafka only if you start scaling beyond a single simple system.

"And if you stuck with a DB-based queue -- did it scale?" -- I've done this in the past. It scales until it doesn't. Given my experience with MQs and Kafka, I feel it's a trivial amount of work to set up an MQ/Kafka, and I don't get anything extra by using a DB-based queue. I personally would avoid these, unless you have a compelling reason to use it (eg, your DB isn't huge, and you can save money).

replies(2): >>44019302 #>>44019334 #
mlhpdx ◴[] No.44019334[source]
We build applications very differently. SQS queues with 1000s of clients have been a go to for me for over a decade. And the opposite as well — 1000s of queues (one per client device, they’re free). Zero maintenance, zero cost when unused. Absurd scalability.
replies(1): >>44021704 #
1. matt_s ◴[] No.44021704[source]
Hey I'm curious how the consumers of those queues typically consume their data, is it some job that is polling, another piece of tech that helps scale up for bursts of queue traffic, etc. We're using the google equivalent and I'm finding that there are a lot of compute resources being used on both the publisher and subscriber sides. The use cases here I'm talking about are mostly just systems trying to stay in sync with some data where the source system is the source of record and consumers are using it for read-only purposes of some kind.
replies(1): >>44022684 #
2. mlhpdx ◴[] No.44022684[source]
On the producer side I’d expect to see change data capture being directed to a queue fairly efficiently, but perhaps you have some intermediary that’s running between the system of record and the queue? The latter works, but yeah it eats compute.

On the consumer side the duty cycle drives design. If it’s a steady flow then a polling listener is easy to right size. If the flow is episodic (long periods of idle with unpredictable spikes of high load) one option is to put a alarm on the queue that triggers when it goes from empty to non-empty, and handle that alarm by starting the processing machinery. That avoids the cost of constantly polling during dead time.