The death of thread per core

(buttondown.com)

Show context

bob1029 ◴[21 Oct 25 07:32 UTC] No.45653379[source]▶

I look at cross core communication as a 100x latency penalty. Everything follows from there. The dependencies in the workload ultimately determine how it should be spread across the cores (or not!). The real elephant in the room is that oftentimes it's much faster to just do the whole job on a single core even if you have 255 others available. Some workloads do not care what kind of clever scheduler you have in hand. If everything constantly depends on the prior action you will never get any uplift.

You see this most obviously (visually) in places like game engines. In Unity, the difference between non-burst and burst-compiled code is very extreme. The difference between single and multi core for the job system is often irrelevant by comparison. If the amount of cpu time being spent on each job isn't high enough, the benefit of multicore evaporates. Sending a job to be ran on the fleet has a lot of overhead. It has to be worth that one time 100x latency cost both ways.

The GPU is the ultimate example of this. There are some workloads that benefit dramatically from the incredible parallelism. Others are entirely infeasible by comparison. This is at the heart of my problem with the current machine learning research paradigm. Some ML techniques are terrible at running on the GPU, but it seems as if we've convinced ourselves that GPU is a prerequisite for any kind of ML work. It all boils down to the latency of the compute. Getting data in and out of a GPU takes an eternity compared to L1. There are other fundamental problems with GPUs (warp divergence) that preclude clever workarounds.

replies(7): >>45660423 #>>45661402 #>>45661430 #>>45662310 #>>45662427 #>>45662527 #>>45667568 #

1. pron ◴[21 Oct 25 22:17 UTC] No.45662427[source]▶

>>45653379 #

That's fine, but a work-stealing scheduler doesn't redistribute work willy-nilly. Locally-submitted tasks are likely to remain local, and are generally stolen when stealing does pay off. If everything is more-or-less evenly distributed, you'll get little or no stealing.

That's not to say it's perfect. The problem is in anticipating how much workload is about to arrive and deciding how many worker threads to spawn. If you overestimate and have too many worker threads running, you will get wasteful stealing; if you're overly conservative and slow to respond to growing workload (to avoid over-stealing), you'll wait for threads to spawn and hurt your latencies just as the workload begins to spike.

replies(1): >>45662804 #

2. vlovich123 ◴[21 Oct 25 22:58 UTC] No.45662804[source]▶

>>45662427 (TP) #

There’s secondary costs though - because you might run on any thread you have to sprinkle atomics and/or mutexes all over the place (in Rust parlance the tasks spawned must be Send) which have all sorts of implicit performance costs that stack up even if you never transfer the task.

In other words, you could probably easily do 10m op/s per core on a thread per core design but struggle to get 1m op/s on a work stealing design. And the work stealing will be total throughput for the machine whereas the 10m op/s design will generally continue scaling with the number of CPUs.

replies(2): >>45663406 #>>45664068 #

3. zipy124 ◴[22 Oct 25 00:08 UTC] No.45663406[source]▶

>>45662804 #

What about using something like https://github.com/judofyr/spice?

replies(1): >>45663658 #

4. vlovich123 ◴[22 Oct 25 00:42 UTC] No.45663658{3}[source]▶

>>45663406 #

These aren’t task queues as are being discussed here. It’s more like rayon - I have a par_iter and I want that to go as fast as possible on a large number of elements. Slightly different use case than thread per core vs work stealing runtime.

5. pron ◴[22 Oct 25 01:48 UTC] No.45664068[source]▶

>>45662804 #

An occasional successful CAS (on an owned cache line) has very little cost, but if you have to sprinkle atomics/mutexes all over the place, then there's something that's clearly not scalable in your design regardless of the concurrency implementation (you're expecting contention in a lot of places).

replies(1): >>45664426 #

6. vlovich123 ◴[22 Oct 25 02:53 UTC] No.45664426{3}[source]▶

>>45664068 #

An atomic add on a 6ghz high end desktop CPU (13900) is I believe on the order of 4-10ns. If it’s in your hot path your hot path can’t go faster than 50-100 million operations/s - that’s the cost of 1 such instruction in your hotpath (down from the 24 billion non-atomic additions your 6ghz could do otherwise). A CAS brings this down to ~20-50 Mops/s. So it’s quite a meaningful slowdown if you actually want to use the full throughput of your CPU. And if that cache line is cached on another CPU you pay an additional hidden latency that could be anywhere from 40-200ns further reducing your hotpath to a maximum of 5-25MHz (and ignoring secondary effects of slowing down those cores without them even doing anything). God forbid there’s any contention - you’re looking at a variance of 20x between the optimal and worst case of how much of a throughput reduction you see by having a single CAS in your hot loop. And this is just talking about the task scheduler - at least in Rust you’ll need to have thread-safe data structures being accessed within the task itself - that’s what I was referring to as “sprinkled”. If you really want to target something running at 10Mops/s on a single core, I don’t think you can possibly get there with a task stealing approach.

replies(1): >>45680136 #

7. skavi ◴[23 Oct 25 09:59 UTC] No.45680136{4}[source]▶

>>45664426 #

Is that best case latency? e.g., with only one thread adding to that location?

↑