Turned out that Node.js didn't gracefully close TCP connections. It just silently dropped the connection and sent a RST packet if the other side tried to reuse it. Fun times.
Turned out that Node.js didn't gracefully close TCP connections. It just silently dropped the connection and sent a RST packet if the other side tried to reuse it. Fun times.
I won't name the product because it's not its fault, but we had an HA cluster of 3 instances of it set up. Users reported that the first login of the day would fail, but only for the first person to come into the office. You hit the login button, it takes 30 seconds to give you an invalid login, and then you try logging in again and it works fine for the rest of the day.
Turns out IT had a "passive" firewall (traffic inspection and blocking, but no NAT) in place between the nodes. The nodes established long-running TCP connections between them for synchronization. The firewall internally kept a table of known established connections and eventually drops them out if they're idle. The product had turned on TCP keepalive, but the Linux default keepalive interval is longer than the firewall's timeout. When the firewall dropped the connection from the table it didn't spit out RST packets to anyone, it just silently stopped letting traffic flow.
When the first user of the day tried to log in, all three HA nodes believed their TCP connections were still alive and happy (since they had no reason not to think that) and had to wait for the connection to timeout before tearing those down and re-establishing them. That was a fun one to figure out...