←back to thread

128 points ksec | 2 comments | | HN request time: 0.409s | source
Show context
dragontamer ◴[] No.42751521[source]
Triple decoder is one unique effect. The fact that Intel managed to get them lined up for small loops to do 9x effective instruction issue is basically miraculous IMO. Very well done.

Another unique effect is L2 shared between 4 cores. This means that thread communications across those 4 cores has much lower latencies.

I've had lots of debates with people online about this design vs Hyperthreading. It seems like the overall discovery from Intel is that highly threaded tasks use less resources (cache, ROPs, etc. etc).

Big cores (P cores or AMD Zen5) obviously can split into 2 hyperthread, but what if that division is still too big? E cores are 4 threads of support in roughly the same space as 1 Pcore.

This is because L2 cache is shared/consolidated, and other resources (ROP buffers, register files, etc. etc.) are just all so much smaller on the Ecore.

It's an interesting design. I'd still think that growing the cores to 4way SMT (like Xeon Phi) or 8way SMT (POWER10) would be a more conventional way to split up resources though. But obviously I don't work at Intel or can make these kinds of decisions.

replies(8): >>42751667 #>>42751930 #>>42752001 #>>42752140 #>>42752196 #>>42752200 #>>42753025 #>>42753142 #
Salgat ◴[] No.42753142[source]
What we desperately need before we get too deep into this is stronger support in languages for heterogeneous cores in an architecture agnostic way. Some way to annotate that certain threads should run on certain types of cores (and close together in memory hierarchy) without getting too deep into implementation details.
replies(2): >>42754839 #>>42757713 #
mlyle ◴[] No.42754839[source]
I don't think so. I don't trust software authors to make the right choice, and the most tilted examples of where a core will usually need a bigger core can afford to wait for the scheduler to figure it out.

And if you want to be close together in the memory hierarchy, does that mean close to the RAM that you can easily get to? And you want allocations from there? If you really want that, you can use numa(3).

> without getting too deep into implementation details.

Every microarchitecture is a special case about what you win by being close to things, and how it plays with contention and distances to other things. You either don't care and trust the infrastructure, or you want to micromanage it all, IMO.

replies(1): >>42757189 #
Salgat ◴[] No.42757189[source]
I'm talking about close together in the cache. If a threadpool manager is hinted that 4 threads are going to share a lot of memory, they can be allocated on the same l2 cache. And no matter what, you're trusting software developers either way, whether it be at the app level, the language/runtime level, or the operating system level.
replies(2): >>42757753 #>>42758907 #
1. dragontamer ◴[] No.42757753[source]
NUMA aware threading is somewhat rare but it does exist.

Its just reaching into the high arts of high-performance that fewer-and-fewer programmers know about. I myself am not an HPC expert, I just like to study this stuff on the side as a hobby.

So NUMA-awareness is when your code knows that &variable1 is located in one physical location, while &variable2 is somewhere else.

This is possible because NUMA-aware allocators (numa_alloc in Linux, VirtualAlloc in Windows) can take parameters that guarantee an allocation within a particular NUMA zone.

Now that you know certain variables are tied together in physical locations, you can also tie threads together with affinity to those same NUMA locations. And with a bit of effort, you can ensure that threads that are in one workpool share the same NUMA zones.

---------

Now code-awareness of shared caches is less common. But following the same models of "abstracted work pools of thread-affinity + NUMA awareness of data", programmers have been able to ensure Zen1 cores to be working together with the same L3 cache.

L2 cache with E-cores is new, but not a new concept in general. (IE: the same mechanisms and abstractions we used for thread-affinity checks on Zen cores sharing L3 cache, or multi-socket CPUs being NUMA Aware... all would still work for L2 cache).

I don't know if the libraries support that. But I bet Intel's library (TBB) and their programmers are working on keeping their abstractions clean and efficient.

replies(1): >>42758981 #
2. mlyle ◴[] No.42758981[source]
> I don't know if the libraries support that. But I bet Intel's library (TBB) and their programmers are working on keeping their abstractions clean and efficient.

Intel can declare in ACPI a set of nodes, the distances between nodes, and then Linux/libnuma/etc pick it up.

So, e.g. in AMD's SLIT tables, the local node is 10; within the same partition are 11; within the same socket are 12; distant sockets are >=20.

There's fancier, more detailed tables (e.g. HMAT) and some code out there that uses them, but it's kind of beyond the scope of libnuma.