The world could run on older hardware if software optimization was a priority

(twitter.com)

848 points turrini | 2 comments | 13 May 25 10:31 UTC | HN request time: 0.415s | source

Show context

cletus ◴[13 May 25 16:32 UTC] No.43974735[source]▶

So I've worked for Google (and Facebook) and it really drives the point home of just how cheap hardware is and how not worth it optimizing code is most of the time.

More than a decade ago Google had to start managing their resource usage in data centers. Every project has a budget. CPU cores, hard disk space, flash storage, hard disk spindles, memory, etc. And these are generally convertible to each other so you can see the relative cost.

Fun fact: even though at the time flash storage was ~20x the cost of hard disk storage, it was often cheaper net because of the spindle bottleneck.

Anyway, all of these things can be turned into software engineer hours, often called "mili-SWEs" meaning a thousandth of the effort of 1 SWE for 1 year. So projects could save on hardware and hire more people or hire fewer people but get more hardware within their current budgets.

I don't remember the exact number of CPU cores amounted to a single SWE but IIRC it was in the thousands. So if you spend 1 SWE year working on optimization acrosss your project and you're not saving 5000 CPU cores, it's a net loss.

Some projects were incredibly large and used much more than that so optimization made sense. But so often it didn't, particularly when whatever code you wrote would probably get replaced at some point anyway.

The other side of this is that there is (IMHO) a general usability problem with the Web in that it simply shouldn't take the resources it does. If you know people who had to or still do data entry for their jobs, you'll know that the mouse is pretty inefficient. The old terminals from 30-40+ years ago that were text-based had some incredibly efficent interfaces at a tiny fraction of the resource usage.

I had expected that at some point the Web would be "solved" in the sense that there'd be a generally expected technology stack and we'd move on to other problems but it simply hasn't happened. There's still a "framework of the week" and we're still doing dumb things like reimplementing scroll bars in user code that don't work right with the mouse wheel.

I don't know how to solve that problem or even if it will ever be "solved".

replies(5): >>43975206 #>>43976742 #>>43978070 #>>43978689 #>>43983076 #

1. mike_hearn ◴[13 May 25 19:34 UTC] No.43976742[source]▶

>>43974735 #

I worked there too and you're talking about performance in terms of optimal usage of CPU on a per-project basis.

Google DID put a ton of effort into two other aspects of performance: latency, and overall machine utilization. Both of these were top-down directives that absorbed a lot of time and attention from thousands of engineers. The salary costs were huge. But, if you're machine constrained you really don't want a lot of cores idling for no reason even if they're individually cheap (because the opportunity cost of waiting on new DC builds is high). And if your usage is very sensitive to latency then it makes sense to shave milliseconds off because of business metrics, not hardware $ savings.

replies(1): >>43977380 #

2. cletus ◴[13 May 25 20:32 UTC] No.43977380[source]▶

>>43976742 (TP) #

The key part here is "machine utilization" and absolutely there was a ton of effort put into this. I think before my time servers were allocated to projects but even early on in my time at Google Borg had already adopted shared machine usage and therew was a whole system of resource quota implemented via cgroups.

Likewise there have been many optimization projects and they used to call these out at TGIF. No idea if they still do. One I remember was reducing the health checks via UDP for Stubby and given that every single Google product extensively uses Stubby then even a small (5%? I forget) reduction in UDP traffic amounted to 50,000+ cores, which is (and was) absolutely worth doing.

I wouldn't even put latency in the same category as "performance optimization" because often you decrease latency by increasing resource usage. For example, you may send duplicate RPCs and wait for the fastest to reply. That could be double or tripling effort.

↑