Skymont: Intel's E-Cores reach for the Sky (2024)

(chipsandcheese.com)

Show context

dragontamer ◴[18 Jan 25 21:21 UTC] No.42751521[source]▶

Triple decoder is one unique effect. The fact that Intel managed to get them lined up for small loops to do 9x effective instruction issue is basically miraculous IMO. Very well done.

Another unique effect is L2 shared between 4 cores. This means that thread communications across those 4 cores has much lower latencies.

I've had lots of debates with people online about this design vs Hyperthreading. It seems like the overall discovery from Intel is that highly threaded tasks use less resources (cache, ROPs, etc. etc).

Big cores (P cores or AMD Zen5) obviously can split into 2 hyperthread, but what if that division is still too big? E cores are 4 threads of support in roughly the same space as 1 Pcore.

This is because L2 cache is shared/consolidated, and other resources (ROP buffers, register files, etc. etc.) are just all so much smaller on the Ecore.

It's an interesting design. I'd still think that growing the cores to 4way SMT (like Xeon Phi) or 8way SMT (POWER10) would be a more conventional way to split up resources though. But obviously I don't work at Intel or can make these kinds of decisions.

replies(8): >>42751667 #>>42751930 #>>42752001 #>>42752140 #>>42752196 #>>42752200 #>>42753025 #>>42753142 #

Salgat ◴[19 Jan 25 02:40 UTC] No.42753142[source]▶

>>42751521 #

What we desperately need before we get too deep into this is stronger support in languages for heterogeneous cores in an architecture agnostic way. Some way to annotate that certain threads should run on certain types of cores (and close together in memory hierarchy) without getting too deep into implementation details.

replies(2): >>42754839 #>>42757713 #

mlyle ◴[19 Jan 25 07:49 UTC] No.42754839[source]▶

>>42753142 #

I don't think so. I don't trust software authors to make the right choice, and the most tilted examples of where a core will usually need a bigger core can afford to wait for the scheduler to figure it out.

And if you want to be close together in the memory hierarchy, does that mean close to the RAM that you can easily get to? And you want allocations from there? If you really want that, you can use numa(3).

> without getting too deep into implementation details.

Every microarchitecture is a special case about what you win by being close to things, and how it plays with contention and distances to other things. You either don't care and trust the infrastructure, or you want to micromanage it all, IMO.

replies(1): >>42757189 #

Salgat ◴[19 Jan 25 14:17 UTC] No.42757189[source]▶

>>42754839 #

I'm talking about close together in the cache. If a threadpool manager is hinted that 4 threads are going to share a lot of memory, they can be allocated on the same l2 cache. And no matter what, you're trusting software developers either way, whether it be at the app level, the language/runtime level, or the operating system level.

replies(2): >>42757753 #>>42758907 #

mlyle ◴[19 Jan 25 17:07 UTC] No.42758907{3}[source]▶

>>42757189 #

> you're trusting software developers either way, whether it be at the app level, the language/runtime level, or the operating system level.

I trust systems to do better based on observed behavior rather than a software engineer's guess of how it will be scheduled. Who knows if, in a given use case, the program is a "small" part of the system or a "large" part that should get preferential placement and scheduling.

> If a threadpool manager is hinted that 4 threads are going to share a lot of memory, they can be allocated on the same l2 cache.

And so this is kind of a weird thing: we know we're going to be performance critical and we need things to be forced to be adjacent... but we don't know the exact details of the hardware we're running on. (Else, just numa_bind and be done...)

replies(1): >>42761401 #

1. Salgat ◴[19 Jan 25 20:23 UTC] No.42761401{4}[source]▶

>>42758907 #

The beauty is that you don't care what hardware you run on, all you're annotating are very useful but generic properties such as which threads are sharing a lot of memory, or perhaps that a thread should have highest performance priority so that internally it stays on p cores instead of the more scalable e cores. Very simple optional hints.

replies(1): >>42761643 #

2. mlyle ◴[19 Jan 25 20:48 UTC] No.42761643[source]▶

>>42761401 (TP) #

> should have highest performance priority so that internally it stays on p cores

Everything will decide that it wants P cores; it's not punished for battery or energy impact, and wants to win over other applications for users to have a better experience with it.

And even if not made in bad faith, it doesn't know what else is running on the system.

Also these decisions tend to be unduly influenced by microbenchmarks and then don't apply to the real system.

> which threads are sharing a lot of memory

But if they're not super active, should the scheduler really change what it's doing? And doesn't the size of that L2 matter? It doesn't matter if e.g. the stuff is going to get churned out before there's a benefit from that sharing.

In the end, if you don't know pretty specific details of the environment you'll run on: what the hardware is like, what loading is like, what data set size is like, and what else will be running on the machine -- it is probably better to leave this decision to the scheduler.

If you do know all those things, and it's worth tuning this stuff in depth-- odds are you're HPC and you know what the machine is like.

replies(1): >>42764002 #

3. Salgat ◴[20 Jan 25 01:28 UTC] No.42764002[source]▶

>>42761643 #

To clarify, what gets scheduled is up to the OS or runtime, all you're doing is setting relative priority. If everything is all the same priority, then it's just as likely to all run on e cores.

replies(1): >>42764545 #

4. mlyle ◴[20 Jan 25 02:58 UTC] No.42764545{3}[source]▶

>>42764002 #

And then, what's the point?

A system that encourage everyone to jack everything up is pointless.

A system to tell the OS that the developer anticipates that data is shared and super hot will be mostly lied to (on accident or purpose).

There's the edge cases: database servers, HPC, etc, where you believe that the system has a sole occupant that can predict loading.

But libnuma, and the underlying ACPI SRAT/SLIT/HMAT tables are a pretty good fit for these use cases.

replies(1): >>42765390 #

5. Salgat ◴[20 Jan 25 05:52 UTC] No.42765390{4}[source]▶

>>42764545 #

If you lie about the nature of your application, you'll only hurt performance in this configuration. You're not telling the OS what cores to run on, you're simply giving hints as to how the program behaves. It's no different than telling the threadpool manager how many threads to create or if a thread is long lived. It's a platform agnostic hint to help performance. And remember, this is all optional, just like the threadpool example that already exists in most major languages. Are you going to argue that programs shouldn't have access to core count information on the cpu too? They'll just shoot their foots as you said.

replies(1): >>42766086 #

6. mlyle ◴[20 Jan 25 07:59 UTC] No.42766086{5}[source]▶

>>42765390 #

Again, there's already explicit ways for programs to show fine control; this stuff is already declared in ACPI and libnuma and higher level shims exist over it. But generally you want to know both how the entire machine is being used and pretty detailed information about working set sizes before attempting this.

Most things that have tried to set affinities have ended up screwing it up.

There's no need to put an easier user interface on the footgun or to make the footgun cross-platform. These interfaces provide opportunities for small wins (generally <5%) and big losses. If you're in a supercomputing center or a hyperscaler running your own app, this is worth it; if you're writing a DBMS that will run on tens of thousands of dedicated machines, it may be worth it. But usually you don't understand the way you'll be employed well enough to know if this is a win.

replies(1): >>42772218 #

7. Salgat ◴[20 Jan 25 19:34 UTC] No.42772218{6}[source]▶

>>42766086 #

In the context of the future of heterogeneous computing, where your standard pc will have thousands of cores of various capabilities and locality, I very much disagree.

replies(1): >>42772796 #

8. mlyle ◴[20 Jan 25 20:40 UTC] No.42772796{7}[source]▶

>>42772218 #

> where your standard pc will have thousands of cores

Thousands of non-GPU cores, intended to run normal tasks? I doubt it.

Thousands of special purpose cores running different programs like managing power, managing networks, managing RGB lighting around? Maybe, but that doesn't really benefit from this.

Thousands of cores including GPU cores? What you're talking about in labelling locality isn't sufficient to address this problem, and isn't really even a significant step towards its solution.

replies(1): >>42773976 #

9. Salgat ◴[20 Jan 25 22:41 UTC] No.42773976{8}[source]▶

>>42772796 #

CPUs are trending towards heterogenous many core implementations. 16 core was considered server exclusive a few decades ago, now we're at heterogenous 24 core on an Intel 14900k cpu. The biggest limit right now is on the software side, hence my original comment. I wouldn't be surprised if someday the cpu and gpu become combined to overcome the memory wall, with many different types of specialized cores depending on the use case.

replies(1): >>42777206 #

10. mlyle ◴[21 Jan 25 06:30 UTC] No.42777206{9}[source]▶

>>42773976 #

The software side is limited, somewhat intrinsically (there tend to be a lot of things we want to do in order--- Amdahl's law wins).

And even when you aren't intrinsically limited by that, optimal placement doesn't reduce contention that much (assuming you're not ping-ponging a single cache line every operation or something dumb like that).

But the hardware side, too: we're not getting transistors that quickly anymore, and we don't want anything too much smaller than an Intel E-core. Even if we stack 3D, all that net wafer area is not cheap and isn't cheapening quickly.

↑