Upgrading Uber's MySQL Fleet

(www.uber.com)

236 points benocodes | 5 comments | 14 Oct 24 12:08 UTC | HN request time: 0.739s | source

Show context

whalesalad ◴[14 Oct 24 12:39 UTC] No.41836959[source]▶

So satisfying to do a huge upgrade like this and then see the actual proof in the pudding with all the reduced latencies and query times.

replies(1): >>41837062 #

hu3 ◴[14 Oct 24 12:52 UTC] No.41837062[source]▶

>>41836959 #

Yeah some numbers caught my attention like ~94% reduction in overall database lock time.

And to think they never have to worry about VACUUM. Ahh the peace.

replies(4): >>41837227 #>>41837317 #>>41837626 #>>41838255 #

InsideOutSanta ◴[14 Oct 24 13:22 UTC] No.41837317[source]▶

>>41837062 #

As somebody who has always used MySQL, but always been told that I should be using Postgres, I'd love to understand what the issues with VACUUM are, and what I should be aware of when potentially switching databases?

replies(5): >>41837435 #>>41838160 #>>41838537 #>>41839012 #>>41845775 #

1. mjr00 ◴[14 Oct 24 14:59 UTC] No.41838160[source]▶

>>41837317 #

Worth reading up on Postgres' MVCC model for concurrency.[0]

Short version is that VACUUM is needed to clean up dead tuples and reclaim disk space. For most cases with smaller amounts of data, auto-vacuum works totally fine. But I've had issues with tables with 100m+ rows that are frequently updated where auto-vacuum falls behind and stops working completely. These necessitated a full data dump + restore (because we didn't want to double our storage capacity to do a full vacuum). We fixed this by sharding the table and tweaking auto-vacuum to run more frequently, but this isn't stuff you have to worry about in MySQL.

Honestly if you're a small shop without database/postgres experts and MySQL performance is adequate for you, I wouldn't switch. Newer versions of MySQL have fixed the egregious issues, like silent data truncation on INSERT by default, and it's easier to maintain, in my experience.

[0] https://www.postgresql.org/docs/current/mvcc-intro.html

replies(2): >>41838279 #>>41838746 #

2. InsideOutSanta ◴[14 Oct 24 15:11 UTC] No.41838279[source]▶

>>41838160 (TP) #

Thanks for that, that's valuable information.

3. williamdclt ◴[14 Oct 24 15:58 UTC] No.41838746[source]▶

>>41838160 (TP) #

As much as I have gripes with the autovac, I’m surprised at the idea of getting to such a broken state. 100M rows is not small but not huge, how frequent is “frequent updates”? How long ago was that (there’s been a lot of changes in autovac since v9)?

“Stops working completely” should not be a thing, it could be vacuuming slower than the update frequency (although that’d be surprising) but I don’t know of any reason it’d just stop?

That being said I’ve also had issues with autovac (on aurora to be fair, couldn’t say if it was aurora-specific) like it running constantly without vacuuming anything, like there was an old transaction idling (there wasn’t)

replies(2): >>41839294 #>>41839304 #

4. mjr00 ◴[14 Oct 24 16:49 UTC] No.41839294[source]▶

>>41838746 #

It's been a while, but IIRC it was on pg12. "Stopped working completely" I'm basing on the vacuum statistics saying the last auto-vacuum started weeks ago for these tables and never actually finished. Frequent updates means regularly rewriting 10 million rows (at various places) throughout the table. I also should mention that there were 100+ materialized views built off this table which I'm sure had an impact.

In any case, this got resolved but caused a huge operational headache, and isn't something that would have been a problem with MySQL. I feel like that's the main reason VACUUM gets hated on; all of the problems with it are solvable, but you only find those problems by running into them, and when you run into them on your production database it ends up somewhere between "pain in the ass" and "total nightmare" to resolve.

5. sgarland ◴[14 Oct 24 16:50 UTC] No.41839304[source]▶

>>41838746 #

On decently-sized tables (100,000,000 is, as you say, not small but not huge), if you haven’t tuned cost limiting and/or various parameters for controlling autovacuum workers, it’s entirely possible for it to effectively do nothing, especially if you’re in the cloud with backing disks that have limited IOPS / throughput.

It continues to baffle me why AWS picks some truly terrible defaults for parameter groups. I understand most of them come from the RDBMS defaults, but AWS has the luxury of knowing precisely how many CPUs and RAM any given instance has. On any decently-sized instance, it should allocate far more memory for maintenance_work_mem, for example.

↑