There’s about a factor of 3 improvement that can be made to most code after the profiler has given up. That probably means there are better profilers than could be written, but in 20 years of having them I’ve only seen 2 that tried. Sadly I think flame graphs made profiling more accessible to the unmotivated but didn’t actually improve overall results.
The sympathy is also needed. Problems aren't found when people don't care, or consider the current performance acceptable.
> There’s about a factor of 3 improvement that can be made to most code after the profiler has given up. That probably means there are better profilers than could be written, but in 20 years of having them I’ve only seen 2 that tried.
It's hard for profilers to identify slowdowns that are due to the architecture. Making the function do less work to get its result feels different from determining that the function's result is unnecessary.
If you see a database query that takes 1 hour to run, and only touches a few gb of data, you should be thinking "Well nvme bandwidth is multiple gigabytes per second, why can't it run in 1 second or less?"
The idea that anyone would accept a request to a website taking longer than 30ms, (the time it takes for a game to render it's entire world including both the CPU and GPU parts at 60fps) is insane, and nobody should really accept it, but we commonly do.
Uber could run the complete global rider/driver flow from a single server.
It doesn't, in part because all of those individual trips earn $1 or more each, so it's perfectly acceptable to the business to be more more inefficient and use hundreds of servers for this task.
Similarly, a small website taking 150ms to render the page only matters if the lost productivity costs less than the engineering time to fix it, and even then, only makes sense if that engineering time isn't more productively used to add features or reliability.
The common element between attempts is new visualizations. And like drawing a projection of an object in a mechanical engineering drawing, there is no one projection that contains the entire description of the problem. You need to present several and let brain synthesize the data missing in each individual projection into an accurate model.
I always liked Shaw’s “The reasonable man adapts himself to the world: the unreasonable one persists in trying to adapt the world to himself. Therefore all progress depends on the unreasonable man.”
https://en.wikipedia.org/wiki/Speed_of_light
Just as an example, round trip delay from where I rent to the local backbone is about 14mS alone, and the average for a webserver is 53mS. Just as a simple echo reply. (I picked it because I'd hoped that was in Redmond or some nearby datacenter, but it looks more likely to be in a cheaper labor area.)
However it's only the bloated ECMAScript (javascript) trash web of today that makes a website take longer than ~1 second to load on a modern PC. Plain old HTML, images on a reasonable diet, and some script elements only for interactive things can scream.
mtr -bzw microsoft.com
6. AS7922 be-36131-cs03.seattle.wa.ibone.comcast.net (2001:558:3:942::1) 0.0% 10 12.9 13.9 11.5 18.7 2.6
7. AS7922 be-2311-pe11.seattle.wa.ibone.comcast.net (2001:558:3:3a::2) 0.0% 10 11.8 13.3 10.6 17.2 2.4
8. AS7922 2001:559:0:80::101e 0.0% 10 15.2 20.7 10.7 60.0 17.3
9. AS8075 ae25-0.icr02.mwh01.ntwk.msn.net (2a01:111:2000:2:8000::b9a) 0.0% 10 41.1 23.7 14.8 41.9 10.4
10. AS8075 be140.ibr03.mwh01.ntwk.msn.net (2603:1060:0:12::f18e) 0.0% 10 53.1 53.1 50.2 57.4 2.1
11. AS8075 2603:1060:0:10::f536 0.0% 10 82.1 55.7 50.5 82.1 9.7
12. AS8075 2603:1060:0:10::f3b1 0.0% 10 54.4 96.6 50.4 147.4 32.5
13. AS8075 2603:1060:0:10::f51a 0.0% 10 49.7 55.3 49.7 78.4 8.3
14. AS8075 2a01:111:201:f200::d9d 0.0% 10 52.7 53.2 50.2 58.1 2.7
15. AS8075 2a01:111:2000:6::4a51 0.0% 10 49.4 51.6 49.4 54.1 1.7
20. AS8075 2603:1030:b:3::152 0.0% 10 50.7 53.4 49.2 60.7 4.2Billing, serving assets like map tiles, etc. not included.
Some key things to understand:
* The scale of Uber is not that high. A big city surely has < 10,000 drivers simultaneously, probably less than 1,000.
* The driver and rider phones participate in the state keeping. They send updates every 4 seconds, but they only have to be online to start a trip. Both mobiles cache a trip log that gets uploaded when network is available.
* Since driver/rider send updates every 4 seconds, and since you don't need to be online to continue or end a trip, you don't even need an active spare for the server. A hot spare can rebuild the world state in 4 seconds. State for a rider and driver is just a few bytes each for id, position and status.
* Since you'll have the rider and driver trip logs from their phones, you don't necessarily have to log the ride server side either. Its also OK to lose a little data on the server. You can use UDP.
Don't forget that in the olden times, all the taxis in a city like New York were dispatched by humans. All the police in the city were dispatched by humans. You can replace a building of dispatchers with a good server and mobile hardware working together.
Microservices try to fix that. But then you need bin packing so microservices beget kubernetes.
What the internet will tell me is that uber has 4500 distinct services, which is more services than there are counties in the US.
My first job out of college, I got handed the slowest machine they had. The app was already half done and was dogshit slow even with small data sets. I was embarrassed to think my name would be associated with it. The UI painted so slowly I could watch the individual lines paint on my screen.
My friend and I in college had made homework into a game of seeing who could make their homework assignment run faster or using less memory. Such as calculating the Fibonacci of 100, or 1000. So I just started applying those skills and learning new ones.
For weeks I evaluated improvements to the code by saying “one Mississippi, two Mississippi”. Then how many syllables I got through. Then the stopwatch function on my watch. No profilers, no benchmarking tools, just code review.
And that’s how my first specialization became optimization.
The sandwich view hides invocation count, which is one of the biggest things you need to look at for that remaining 3x.
Also you need to think about budgets. Which is something game designers do and the rest of us ignore. Do I want 10% of overall processing time to be spent accessing reloadable config? Reporting stats? If the answer is no then we need to look at that, even if data retrieval is currently 40% of overall response time and we are trying to get from 2 seconds to 200 ms.
That means config and stats have a budget of 20ms each and you will never hit 200ms if someone doesn’t look at them. So you can pretend like they don’t exist until you get all the other tent poles chopped and then surprise pikachu face when you’ve already painted them into a corner with your other changes.
When we have a lot of shit that all needs to get done, you want to get to transparency, look at the pile and figure out how to do it all effectively. Combine errands and spread the stressful bits out over time. None of the tools and none of the literature supports this exercise, and in fact most of the literature is actively hostile to this exercise. Which is why you should read a certain level of reproval or even contempt in my writing about optimization. It’s very much intended.
Most advice on writing fast code has not materially changed for a time period where the number of calculations we do has increased by 5 orders of magnitude. In every other domain, we re-evaluate our solutions at each order of magnitude. We have marched past ignorant and into insane at this point. We are broken and we have been broken for twenty years.
The reality is most of those requests (now) get mixed in with a firehose of traffic, and could be served much faster than 16ms if that is all that was going on. But it’s never all that is going on.