Upgrading Uber's MySQL Fleet

1. remon ◴[14 Oct 24 14:48 UTC] No.41838059[source]▶

Impressive numbers at a glance but that boils down to ~140qps which is between one and two orders of magnitude below what you'd expect a normal MySQL node typically would serve. Obviously average execution time is mostly a function of the complexity of the query but based on Uber's business I can't really see what sort of non-normative queries they'd run at volume (e.g. for their customer facing apps). Uber's infra runs on Amazon AWS afaik and even taking some level of volume discount into account they're burning many millions of USD on some combination of overcapacity or suboptimal querying/caching strategies.

replies(5): >>41838139 #>>41838199 #>>41838202 #>>41839045 #>>41839409 #

2. Jgrubb ◴[14 Oct 24 14:57 UTC] No.41838139[source]▶

>>41838059 (TP) #

See, the problem is that the people who care about cost performance and the people who care about UX performance are rarely the same people, and often neither side is empowered with the data or experience they need to bridge the gap.

replies(1): >>41839066 #

3. nunez ◴[14 Oct 24 15:03 UTC] No.41838199[source]▶

>>41838059 (TP) #

Didn't realize their entire MySQL data layer runs in AWS. Given that they went with basically a blue-green update strategy, this was, essentially a "witness our cloud spend" kind of post.

replies(1): >>41838349 #

4. ◴[14 Oct 24 15:03 UTC] No.41838202[source]▶

>>41838059 (TP) #

5. pocket_cheese ◴[14 Oct 24 15:20 UTC] No.41838349[source]▶

>>41838199 #

They're not. Almost all of their infra was on prem when I worked there 3 years ago.

replies(1): >>41838402 #

6. remon ◴[14 Oct 24 15:26 UTC] No.41838402{3}[source]▶

>>41838349 #

It's neither. I remember them moving to the cloud but apparently they moved to Google/Oracle (the latter making this article particularly interesting btw). As per the relevant press release : "It’s understood that Uber will close down its own on-premises data centers and move the entirety of its information technology workloads to Oracle and Google Cloud."

7. aseipp ◴[14 Oct 24 16:26 UTC] No.41839045[source]▶

>>41838059 (TP) #

Dividing the fleet QPS by the number of nodes is completely meaningless because it assumes that queries are distributed evenly across every part of the system and that every part of the system is uniform (e.g. it is unclear what the read/write patterns are, proportion of these nodes are read replicas or hot standbys, if their sizing and configuration are the same). That isn't realistic at all. I would guess it is extremely likely that hot subsets of these clusters, depending on the use case, see anywhere from 1 to 4 orders of magnitude higher QPS than your guess, probably on a near constant basis.

Don't get me wrong, a lot of people have talked about Uber doing overengineering in weird ways, maybe they're even completely right. But being like "Well, obviously x/y = z, and z is rather small, therefore it's not impressive, isn't this obvious?" is the computer programming equivalent of the "econ 101 student says supply and demand explain everything" phenomenon. It's not an accurate characterization of the system at all and falls prey to the very thing you're alluding to ("this is obvious.")

replies(1): >>41839798 #

8. bushbaba ◴[14 Oct 24 16:28 UTC] No.41839066[source]▶

>>41838139 #

Hardware is cheap relative to salaries. It might take 1 engineer 1 quarter to optimize. Compare that to a few thousand per server.

replies(3): >>41839204 #>>41840085 #>>41840375 #

9. sgarland ◴[14 Oct 24 16:40 UTC] No.41839204{3}[source]▶

>>41839066 #

It might take an engineer with no prior RDBMS knowledge a quarter to be able to optimize a DB for their use case, but then it’s effectively free. You found the optimal parameters to use for writer nodes? Great, roll that out to the fleet.

10. Twirrim ◴[14 Oct 24 16:59 UTC] No.41839409[source]▶

>>41838059 (TP) #

They're not on AWS. They use on-prem and are migrating to Google and Oracle clouds.

https://www.forbes.com/sites/danielnewman/2023/02/21/uber-go...

11. 0cf8612b2e1e ◴[14 Oct 24 17:36 UTC] No.41839798[source]▶

>>41839045 #

Simple enough just to think about localities and time of day. New York during Tuesday rush hour could be more load than all of North Dakota sees in a month. Even busy cities probably drop down to nothing on a weekday at 3am.

12. Jgrubb ◴[14 Oct 24 18:01 UTC] No.41840085{3}[source]▶

>>41839066 #

Ok but we're in a thread about Ubers cloud bills, which are probably well into the 9 figures annually. It definitely gets talked about in board meetings.

Global public cloud spend is hundreds of billions of dollars a year. I wouldn't be surprised if it's AWS's marketing team that came up with the talking point about how much more expensive developer time is.

Edit: put this another way- wherever you work, you might know what parts of the architecture need some performance work but do you know what parts of the architecture cost the most money?

13. JackSlateur ◴[14 Oct 24 18:26 UTC] No.41840375{3}[source]▶

>>41839066 #

A couple of years ago, I optimize some shit and reduced the annual billing of 150k€/y, for a 3 days of work

I might say, "hardware" is expensive compared to (my) salary :)

replies(1): >>41840591 #

14. notyourwork ◴[14 Oct 24 18:49 UTC] No.41840591{4}[source]▶

>>41840375 #

There isn’t always low hanging fruit. And when there is, it likely requires engineering knowledge to know it exists.

replies(1): >>41842979 #

15. sgarland ◴[14 Oct 24 22:46 UTC] No.41842979{5}[source]▶

>>41840591 #

There almost always is, actually. If you’re in the cloud and aren’t a tiny startup, that means you’ve had team[s] building your infrastructure, probably led by devs at some point.

It doesn’t take engineering knowledge to browse through CloudWatch metrics and see that your average CPU utilization is in the single digits.