←back to thread

15 points jbryu | 1 comments | | HN request time: 0.408s | source

I’m hosting a turn-based multiplayer browser game on a single Hetzner CCX23 x86 cloud server (4 vCPU, 16GB RAM, 80GB disk). The backend is built with Node.js and Socket.IO and is run via Docker Swarm. I use also use Traefik for load balancing.

Matchmaking uses a round-robin sharding approach: each room is always handled by the same backend instance, letting me keep game state in memory and scale horizontally without Redis.

Here’s the issue: At ~500 concurrent players across ~60 rooms (max 8 players/room), I see low CPU usage but high event loop lag. One feature in my game is typing during a player's turn - each throttled keystroke is broadcast to the other players in real-time. If I remove this logic, I can handle 1000+ players without issue.

Scaling out backend instances on my single-server doesn't help. I expected less load per backend instance to help, but I still hit the same limit around 500 players. This suggests to me that the bottleneck isn’t CPU or app logic, but something deeper in the stack. But I’m not sure what.

Some server metrics at 500 players:

- CPU: 25% per core (according to htop)

- PPS: ~3000 in / ~3000 out

- Bandwidth: ~100KBps in / ~800KBps out

Could 500 concurrent players just be a realistic upper bound for my single-server setup, or is something misconfigured? I know scaling out with new servers should fix the issue, but I wanted to check in with the internet first to see if I'm missing anything. I’m new to multiplayer architecture so any insight would be greatly appreciated.

1. snowman_lars ◴[] No.44435960[source]
You say you use Docker via Docker Swarm, so maybe this is about the Docker network setup?

I haven't tried Swarm, but to some degree assume it can give the same effects as Docker Compose with several services. I also less sure of the effects if you never have communication between containers, but I think perhaps there may still be the same or similar issue.

What I experienced when doing not exactly a load test, but just processing a large dataset through multiple docker containers started from a docker compose config, was that the default docker network loopback (docker0) was saturated. After creating a docker network that the various nodes were configured to use, things got a lot better.

So this is the question for you, do all the containers in the swarm talk via docker0? If yes, read up on docker networks in relation to swarm in particular.