←back to thread

73 points ajhool | 1 comments | | HN request time: 0.204s | source

It's common to see here that Postgres hosted in RDS can handle 99% of workloads up to millions of users. I'm building an IoT app with a plan to ingest the IoT traffic into dynamo partitioned on user id (I'm quite familiar with the tradeoffs) and everything else be in Postgres. A few services but not microservice (basically: core service, identity service, IoT data service, notification service). Ingesting and monitoring about 1,000,000 IoT devices daily (1 packer per device per day) and about 1,000,000 users with only 5,000 active users per day (basically we monitor user IoT devices 24/7 but only some 5,000 users will have anomalous results and log in).

In the database posts & discussions here I sometimes find that the opinions are strong but the numbers are missing. Obviously applications have wide variation in traffic and query complexity so apples to apples comparisons are hard. Still, I would greatly benefit from hearing some real world experiences with numbers.

Rough approximation database questions for current or prior applications:

1. How many customers do you have?

2. What's expected daily traffic? Peak traffic?

3. What database engine or engines do you use?

4. How many rows or how much storage does your db have?

5. What else about your application is relevant for database load?

6. Microservice, Service, or monolith. Happy with it?

1. iot_throwaway ◴[] No.43368319[source]
I’m not going to pile on with “these are bad questions”.

We serve a few hundred concurrently online users of our web dashboard which visualises data from our ~100k online IoT device fleet.

With respect to our time series systems:

We store around 1M fields (kv pairs) per second, around 10k rows per second i.e. almost a trillion rows per day and almost a hundred trillion new data points per day.

We compress this down to around 1TB of data per day using Clickhouse and store and make available the last 30 days of data for “online analysis”.

We also store all the data we’ve ever ingested in BigQuery for M/L purposes. I shudder to think of how much this costs.

Our queries are all pretty simple and fast, just scanning and aggregating with filtering, all under 100ms.

This costs around 30k USD per month on AWS, not including networking which is too hard for me to separate from our non time series workloads.

Our infrastructure is all python and microservices, using Kafka, and under 1000 cores for the whole pipeline excluding databases.

Our databases are I believe low tens of the largest graviton instances that were available the last time we upgraded.

This is maintained by a backend team of around 15 people, and a total engineering team including hardware, data and software of around 45 people.