←back to thread

73 points ajhool | 4 comments | | HN request time: 0s | source

It's common to see here that Postgres hosted in RDS can handle 99% of workloads up to millions of users. I'm building an IoT app with a plan to ingest the IoT traffic into dynamo partitioned on user id (I'm quite familiar with the tradeoffs) and everything else be in Postgres. A few services but not microservice (basically: core service, identity service, IoT data service, notification service). Ingesting and monitoring about 1,000,000 IoT devices daily (1 packer per device per day) and about 1,000,000 users with only 5,000 active users per day (basically we monitor user IoT devices 24/7 but only some 5,000 users will have anomalous results and log in).

In the database posts & discussions here I sometimes find that the opinions are strong but the numbers are missing. Obviously applications have wide variation in traffic and query complexity so apples to apples comparisons are hard. Still, I would greatly benefit from hearing some real world experiences with numbers.

Rough approximation database questions for current or prior applications:

1. How many customers do you have?

2. What's expected daily traffic? Peak traffic?

3. What database engine or engines do you use?

4. How many rows or how much storage does your db have?

5. What else about your application is relevant for database load?

6. Microservice, Service, or monolith. Happy with it?

1. wolfgang42 ◴[] No.43368475[source]
Let's do some math.

First, the IOT devices reporting daily. In the absence of further context, I’m going to assume that it doesn’t matter when the devices report, so they can be configured to spread out their load. I’m also going to assume 1kb of data per device, but with an HTTPS API there’s roughly 7kb of overhead that we need to account for when calculating bandwidth. (Source: http://netsekure.org/2010/03/tls-overhead/ . TLS session resumption gets that down to ~300 bytes, but adds implementation complications.)

    $ units -1v '1M req/day' 'req/sec'
        1M req/day = 11.574074 req/sec
    $ units -1v '1M req/day * 8kbyte/req' 'kbyte/sec' # Incoming bandwidth
        1M req/day * 8kbyte/req = 92.592593 kbyte/sec
    $ units -1v '1M req/day * 1kbyte/req' 'years/TB' # Storage
        reciprocal conversion
        1 / (1M req/day * 1kbyte/req) = 2.7379093 years/TB
It looks like our load here is a whopping 12 RPS, and we could handle the traffic on an ISDN line from the 1990s. Data storage is a little trickier; if we can’t compress or delete old data we may have to stop by Best Buy for a new hard drive every half decade or so.

Users can’t be configured to load balance themselves, so we’ll be pessemistic and assume that every single one of them logs in to check their device over their morning coffee. We’ll also assume that every time they do that they want all of the data from the last 3 months, though in practice this could probably be summarized before we send it to them.

    $ units -1v '5000 req/1 hour' 'req/sec'
        5000 req/1 hour = 1.3888889 req/sec
    $ units -1v '5000 req/1 hour * 90 kbyte/req' 'MB/sec' # Outgoing bandwidth
        5000 req/1 hour * 90 kbyte/req = 0.125 MB/sec
For this, we have just under 2 RPS, but our responses are quite a lot bigger so the bandwidth is higher—we probably want to move into the early 2000s and upgrade to a DSL connection. Oh, and we also want to make sure our disk can handle the read load—but conveniently, since we’re just pulling raw data and not asking for any complicated joins, these numbers are actually the same. 2 RPS gives us 500ms per request; since a spinning rust drive pessimistically takes 10ms/seek we only have a ~50-seek budget per request so we probably want to make sure to cluster or partition the table by device to improve data locality. (Doing that increases the write load significantly, though, so maybe we want to upgrade to an SSD or think about a database that’s smart enough to do some kind of periodic bulk rebalancing.)

Oh, I almost forgot, we'll also want to make sure we have disk space to keep track of all of those idle logins who aren’t doing anything:

    $ units -1v '1M records * 1kbyte/rec' 'GB'
        1M records * 1kbyte/rec = 1 GB
Modern computers—for very relative definitions of modern—are fast, if they don’t get bogged down. Based on my numbers, you could probably run your system from a Macbook in your desk drawer; but it would be trivial to add a requirement which would multiply any of these numbers by several orders of magnitude. The problem with trying to compare architectures is that you have to really understand what your needs are, and not only are they going to be difficult to compare to someone else’s setup (small differences can add up fast when multiplied across requests) but it’s also hard to tell how much effort the other system put into optimizing (or, for that matter, how much effort you want to put into optimizing).
replies(1): >>43368539 #
2. chatmasta ◴[] No.43368539[source]
This is a nice analysis of incoming and outgoing bandwidth, but totally neglects to analyze the compute power required by each query. You can’t answer that without knowing the possible queries that users can send. Do they have a dashboard with a known set of queries? Do they have a GraphQL API that can produce pathological queries? Can they send SQL directly? Are they running aggregations? Do they use filters? Are they querying recent data or all historical data?

And that’s just the read side. You also need to ask about ingestion and transformation. Is the database storing raw events and nothing else? (Is the user happy with that?) Is it append only? Is it being rolled up and summarized into daily partitions? How many transactions per second? How many of those are INSERT vs. UPDATE or DELETE? Which rows are being updated? Only recent or any of them? All of them?

etc…

There is no generic answer to this question, and “requests per second” is a reductive and insufficient interpretation of the problem that won’t identify any of the hidden complexity.

replies(1): >>43368899 #
3. wolfgang42 ◴[] No.43368899[source]
The OP doesn’t give any hints as to what users are doing, so as I said I just assumed “last 90 days of data” was the only query they’d make, but it’s a fair point that that caused me to leave out CPU power since it’s basically negligible for just shoveling data off of a disk.

Thanks for expanding on my comments about how important actually having detailed requirements is for doing performance calculations!

replies(1): >>43368907 #
4. chatmasta ◴[] No.43368907{3}[source]
Yes indeed, OP only asked “how much traffic is required,” which could be zero if his database is locked while trying to respond to some absurd query from one of his users :)