The death of thread per core

(buttondown.com)

151 points ibobev | 5 comments | 20 Oct 25 21:19 UTC | HN request time: 0.515s | source

Show context

vacuity ◴[21 Oct 25 04:11 UTC] No.45652410[source]▶

There are no hard rules; use principles flexibly.

That being said, there are some things that are generally true for the long term: use a pinned thread per core, maximize locality (of data and code, wherever relevant), use asynchronous programming if performance is necessary. To incorporate the OP, give control where it's due to each entity (here, the scheduler). Cross-core data movement was never the enemy, but unprincipled cross-core data movement can be. If even distribution of work is important, work-stealing is excellent, as long as it's done carefully. Details like how concurrency is implemented (shared-state, here) or who controls the data are specific to the circumstances.

replies(1): >>45661510 #

1. AaronAPU ◴[21 Oct 25 20:49 UTC] No.45661510[source]▶

>>45652410 #

I did mass scale performance benchmarking on highly optimized workloads using lockfree queues and fibers, and locking to a core almost never was faster. There were a few topologies where it was, but they were outliers.

This was on a wide variety of intel, AMD, NUMA, ARM processors with different architectures, OSes and memory configurations.

Part of the reason is hyper threading (or threadripper type archs) but even locking to groups wasn’t usually faster.

This was even moreso the case when you had competing workloads stealing cores from the OS scheduler.

replies(4): >>45661808 #>>45662327 #>>45663519 #>>45669765 #

2. zamadatix ◴[21 Oct 25 21:14 UTC] No.45661808[source]▶

>>45661510 (TP) #

I think workload might be as (if not more) the factor than the uniqueness of the topology itself for how much pinning matters. If your workload is purely computationally limited then it doesn't matter. Same if it's actually I/O limited. If it's memory bandwidth limited then it depends on things like how much fits in per core cache vs shared cache vs going to RAM, and how is RAM actually fed to the cores.

A really interesting niche is all of the performance considerations around the design/use of VPP (Vector Packet Processing) in the networking context. It's just one example of a single niche, but it can give a good idea of how both "changing the way the computation works" and "changing the locality and pinning" can come together at the same time. I forget the username but the person behind VPP is actually on HN often, and a pretty cool guy to chat with.

Or, as vacuity put it, "there are no hard rules; use principles flexibly".

3. jandrewrogers ◴[21 Oct 25 22:05 UTC] No.45662327[source]▶

>>45661510 (TP) #

Most high-performance workloads are limited by memory-bandwidth these days. Even in HPC that became the primary bottleneck for a large percentage of workloads in the 2000s. High-performance data infrastructure is largely the same. You can drive 200 GB/s of I/O on a server in real systems today.

The memory-bandwidth bound cases is where thread-per-core tends to shine. It was the problem in HPC that thread-per-core was invented to solve and it empirically had significant performance benefits. Today we use it in high-scale databases and other I/O intensive infrastructure if performance and scalability are paramount.

That said, it is an architecture that does not degrade gracefully. I've seen more thread-per-core implementations in the wild that were broken by design than ones that were implemented correctly. It requires a commitment to rigor and thoroughness in the architecture that most software devs are not used to.

4. vacuity ◴[22 Oct 25 00:20 UTC] No.45663519[source]▶

>>45661510 (TP) #

Thanks for sharing. Aside from what the other replies to you have shared, I admittedly have less experience, and I'm mainly interested in the OS perspective. Balancing global and local optimizations is hard, so the OS deserves some leeway, but as I see it, mainstream OSes tend to be awkward no matter what. It's long past time for OS schedulers to consider high-level metadata to get a rough idea of the idiosyncrasies of the workload. In the extreme case, designing the OS from the ground up to minimize cross-core contention[0] gives the most control, maximizing potential performance. As jandrewrogers says in a sibling reply, this requires a commitment to rigor, treacherous and nonportable as it is. In any case, with improved infrastructure ("with sufficiently smart compilers"...), thread-per-core gains power.

[0] https://news.ycombinator.com/item?id=45651183

5. menaerus ◴[22 Oct 25 14:35 UTC] No.45669765[source]▶

>>45661510 (TP) #

What type of workloads?

↑