←back to thread

264 points davidgomes | 2 comments | | HN request time: 0.001s | source
Show context
dwwoelfel ◴[] No.41879888[source]
Here's how we did it at OneGraph (RIP), where we not only upgraded versions without downtime, but we also moved hosting providers from GCP to Aurora without downtime.

1. Set up logical replication to a new database server. We used https://github.com/2ndQuadrant/pglogical, but maybe you don't need that any more with newer versions of postgres?

2. Flip a feature flag that pauses all database queries and wait for the queue of queries to complete.

3. Wait for the query queue to drain and for replication to catch up.

4. Flip a feature flag that switches the connection from the old db to the new db.

5. Flip the flag to resume queries.

It helped that we were written in OCaml. We had to write our own connection pooling, which meant that we had full control over the query queue. Not sure how you would do it with e.g. Java's Hikari, where the query queue and the connection settings are complected.

We also had no long-running queries, with a default timeout of 30 seconds.

It helped to over-provision servers during the migration, because any requests that came in while the migration was ongoing would have to wait for the migration to complete.

replies(1): >>41880045 #
1. droobles ◴[] No.41880045[source]
That is awesome, I dream of being able to do zero downtime SQL migrations.
replies(1): >>41880215 #
2. brentjanderson ◴[] No.41880215[source]
One of the linked pieces in the Neon blog post is from Knock, where we pulled off a practically zero downtime migration: https://knock.app/blog/zero-downtime-postgres-upgrades

In that post we walk through all the steps we took to go from Postgres 11.9 to 15.3.