←back to thread

15 points jbryu | 1 comments | | HN request time: 0s | source

I’m hosting a turn-based multiplayer browser game on a single Hetzner CCX23 x86 cloud server (4 vCPU, 16GB RAM, 80GB disk). The backend is built with Node.js and Socket.IO and is run via Docker Swarm. I use also use Traefik for load balancing.

Matchmaking uses a round-robin sharding approach: each room is always handled by the same backend instance, letting me keep game state in memory and scale horizontally without Redis.

Here’s the issue: At ~500 concurrent players across ~60 rooms (max 8 players/room), I see low CPU usage but high event loop lag. One feature in my game is typing during a player's turn - each throttled keystroke is broadcast to the other players in real-time. If I remove this logic, I can handle 1000+ players without issue.

Scaling out backend instances on my single-server doesn't help. I expected less load per backend instance to help, but I still hit the same limit around 500 players. This suggests to me that the bottleneck isn’t CPU or app logic, but something deeper in the stack. But I’m not sure what.

Some server metrics at 500 players:

- CPU: 25% per core (according to htop)

- PPS: ~3000 in / ~3000 out

- Bandwidth: ~100KBps in / ~800KBps out

Could 500 concurrent players just be a realistic upper bound for my single-server setup, or is something misconfigured? I know scaling out with new servers should fix the issue, but I wanted to check in with the internet first to see if I'm missing anything. I’m new to multiplayer architecture so any insight would be greatly appreciated.

Show context
pvg ◴[] No.44389469[source]
It sounds like you want to coalesce the outbound updates otherwise everyone typing is accidentally quadratic.
replies(1): >>44389558 #
jbryu ◴[] No.44389558[source]
I thought this might've been the issue too, but because the game is turn-based there should only ever be 1 person typing at once (in a given room).
replies(2): >>44390462 #>>44392350 #
brudgers ◴[] No.44390462[source]
there should only ever be 1 person typing at once (in a given room)

Have you verified that is the case?

replies(1): >>44391104 #
jbryu ◴[] No.44391104{3}[source]
Yep just triple checked. If distributing the load on a single server by adding more backend containers doesn't decrease ping then maybe this is just the natural upper bound for my particular game... The only shared bottleneck between all backend containers I can think of right now is at the OS or network interface layer, but things still lag even when I tried increasing OS networking limits:

  net.core.wmem_max = 16777216
  net.core.rmem_max = 16777216
  net.ipv4.tcp_wmem = 4096 65536 16777216
  net.ipv4.tcp_rmem = 4096 87380 16777216

Perhaps the reality for low latency multiplayer games is to embrace horizontal scaling and not vertically scaling? Not sure.
replies(2): >>44391631 #>>44392112 #
codingdave ◴[] No.44391631{4}[source]
Networking bottlenecks are not always on your box - they could be on the router your box is talking to. Or, depending on load, the ethernet packets themselves could be crowding the physical subnet. Do you have a way to mock 500 users playing the game that would truly keep all the traffic internal to your OS? Because if that works, but the lag persists for real players, the problem is external to your OS.
replies(1): >>44391883 #
1. jbryu ◴[] No.44391883{5}[source]
Good point. I actually don't know what performance looks like with 500 real users. The way I'm mocking right now is by running a script on my local machine that generates 500+ bots that listens to events to auto join + play games. I tried to implement the bots to behave as closely to humans as possible. I'm not sure if this is what you mean by keeping traffic internal to my box's OS, but right now this approach creates lag. I didn't consider whether spinning up hundreds of websocket connections from a single source (my local machine) would have any implications when load testing hm