←back to thread

Why AO3 Was Down

(www.reddit.com)
151 points danso | 3 comments | | HN request time: 0.873s | source
1. Groxx ◴[] No.44470528[source]
Ha, a site I worked on hit this limit for the "follow relationships" table - had to build a new compound key table to migrate to, with triggers to dual read/write, to unbreak everything. In a few hours of "wtf" -> "oh crap" -> "well I guess we gotta do it right this time" and quick coding.

And then I pulled apart PT-OSC to make it more... less incredibly stupid about resource use, so it wouldn't cause too much load while it backfilled. And let it run for about 6 weeks.

Good luck! It's a fun problem to have - excess success, and a light puzzle to solve :)

replies(1): >>44472420 #
2. bilka ◴[] No.44472420[source]
Do you happen to have those PT-OSC changes around? We've already migrated bookmarks with the downtime (with PT-OSC), but there are more tables that would be nice to get migrated away from int without going into maintenance or shedding a lot of load.
replies(1): >>44472829 #
3. Groxx ◴[] No.44472829[source]
No, it's long, long gone.

When I did it, the script was a bit of a mess of trigger setup, and then a backfill that only monitored replica lag, as if the status of the much less heavily used failover instance was somehow the most important part of a database. Hopefully that's no longer true, and none of this is necessary any more.

So I essentially split it in half, so I could keep only the trigger setup, and carefully read the queries the backfill would perform so I could duplicate it. And then wrote a very simple loop of "select N records, copy to new table, check how long that took. scale up by min(5%, 100), scale down by 30%, if outside target bounds".

Intentionally very polite to the main DB, because once the triggers are in place it really doesn't matter how long it takes. It dropped down to single digits at peak load on some days, so I think that was the correct choice.