←back to thread

151 points ibobev | 2 comments | | HN request time: 0.57s | source
Show context
bob1029 ◴[] No.45653379[source]
I look at cross core communication as a 100x latency penalty. Everything follows from there. The dependencies in the workload ultimately determine how it should be spread across the cores (or not!). The real elephant in the room is that oftentimes it's much faster to just do the whole job on a single core even if you have 255 others available. Some workloads do not care what kind of clever scheduler you have in hand. If everything constantly depends on the prior action you will never get any uplift.

You see this most obviously (visually) in places like game engines. In Unity, the difference between non-burst and burst-compiled code is very extreme. The difference between single and multi core for the job system is often irrelevant by comparison. If the amount of cpu time being spent on each job isn't high enough, the benefit of multicore evaporates. Sending a job to be ran on the fleet has a lot of overhead. It has to be worth that one time 100x latency cost both ways.

The GPU is the ultimate example of this. There are some workloads that benefit dramatically from the incredible parallelism. Others are entirely infeasible by comparison. This is at the heart of my problem with the current machine learning research paradigm. Some ML techniques are terrible at running on the GPU, but it seems as if we've convinced ourselves that GPU is a prerequisite for any kind of ML work. It all boils down to the latency of the compute. Getting data in and out of a GPU takes an eternity compared to L1. There are other fundamental problems with GPUs (warp divergence) that preclude clever workarounds.

replies(7): >>45660423 #>>45661402 #>>45661430 #>>45662310 #>>45662427 #>>45662527 #>>45667568 #
dist-epoch ◴[] No.45661430[source]
The thing with GPUs is that for many problems really dumb and simple algorithms (think bubble sort equivalent) are many times faster than very fancy CPU algorithms (think quick sort equivalent). Your typical non-neural-network GPU algorithm is rarely using more than 50% of it's power, yet still outperforms carefully written CPU algorithms.
replies(2): >>45665642 #>>45668021 #
1. sumtechguy ◴[] No.45668021[source]
That is application of the formula

Pre-work time + pack up time + send time + unpack time + work time + pack up time + send time + unpack time + post-work time.

All remote work has these properties. Even something 'simple' like a remote REST call. If 'remote work time' plus all that other stuff is less than your local calls then it is time wise worth sending it remote. If not local CPU would win.

That in many cases right now the GPU is 'winning' that race.

replies(1): >>45668465 #
2. EraYaN ◴[] No.45668465[source]
There are some neat tricks to remove almost all the pack and unpack time. Apache Arrow can help a ton there (uses the same data format on both CPU and GPU or other accelerator). And on some unified memory systems even the send time can be very low.