Skymont: Intel's E-Cores reach for the Sky (2024)

(chipsandcheese.com)

128 points ksec | 1 comments | 18 Jan 25 19:27 UTC | HN request time: 0s | source

Show context

dragontamer ◴[18 Jan 25 21:21 UTC] No.42751521[source]▶

Triple decoder is one unique effect. The fact that Intel managed to get them lined up for small loops to do 9x effective instruction issue is basically miraculous IMO. Very well done.

Another unique effect is L2 shared between 4 cores. This means that thread communications across those 4 cores has much lower latencies.

I've had lots of debates with people online about this design vs Hyperthreading. It seems like the overall discovery from Intel is that highly threaded tasks use less resources (cache, ROPs, etc. etc).

Big cores (P cores or AMD Zen5) obviously can split into 2 hyperthread, but what if that division is still too big? E cores are 4 threads of support in roughly the same space as 1 Pcore.

This is because L2 cache is shared/consolidated, and other resources (ROP buffers, register files, etc. etc.) are just all so much smaller on the Ecore.

It's an interesting design. I'd still think that growing the cores to 4way SMT (like Xeon Phi) or 8way SMT (POWER10) would be a more conventional way to split up resources though. But obviously I don't work at Intel or can make these kinds of decisions.

replies(8): >>42751667 #>>42751930 #>>42752001 #>>42752140 #>>42752196 #>>42752200 #>>42753025 #>>42753142 #

01HNNWZ0MV43FF ◴[18 Jan 25 21:46 UTC] No.42751667[source]▶

>>42751521 #

> It seems like the overall discovery from Intel is that highly threaded tasks use less resources (cache, ROPs, etc. etc).

Does that mean if I can take a single-threaded program and split it into multiple threads, it might use less power? I have been telling myself that the only reason to use threads is to get more CPU power or to call blocking APIs. If they're actually more power-efficient, that would change how I weigh threads vs. async

replies(1): >>42751756 #

dragontamer ◴[18 Jan 25 21:58 UTC] No.42751756[source]▶

>>42751667 #

Not... quite. I think you've got the cause-and-effect backwards.

Programmers who happen to write multiple-threaded programs don't need powerful cores, they want more cores. A Blender programmer calculating cloth physics would rather have 4x weaker cores than 1x P-core.

Programmers who happen to write powerful singled-threaded programs need powerful cores. For example, AMD's "X3D" line of CPUs famously have 96MB of L3 cache, and video games that are on these very-powerful cores have much better performance.

Its not "Programmers should change their code to fit the machine". From Intel's perspective, CPU designers should change their core designs to match the different kinds of programmers. Single-threaded (or low-thread) programmers... largely represented by the Video Game programmers... want P-cores. But not very much of them.

Multithreaded programmers... represented by Graphics and a few others... want E-cores. Splitting a P-core into "only" 2 threads is not sufficient, they want 4x or even 8x more cores. Because there's multiple communities of programmers out there, dedicating design teams to creating entirely different cores is a worthwhile endeavor.

--------

> Does that mean if I can take a single-threaded program and split it into multiple threads, it might use less power? I have been telling myself that the only reason to use threads is to get more CPU power or to call blocking APIs. If they're actually more power-efficient, that would change how I weigh threads vs. async

Power-efficiency is going to be incredibly difficult moving forward.

It should be noted that E-cores are not very power-efficient though. They're area efficient, IE Cheaper for Intel to make. Intel can sell 4x as many E-cores for roughly the same price/area as 1x P-core.

E-cores are cost-efficient cores. I think they happen to use slightly less power, but I'm not convinced that power-efficiency is their particular design goal.

If your code benefits from cache (ie: big cores), its probable that the lowest power-cost would be to run on large caches (like P-cores or Zen5 or Zen5 X3D). Communicating with RAM is always more power than just communicating with caches after all.

If your code does NOT benefit from cache (ie: Blender regularly has 100GB+ scenes for complex movies), then all of those spare resources on P-cores are useless, as nothing fits anyway and the core will be spending almost all of its time waiting on RAM to do anything. So the E-core will be more power efficient in this case.

replies(3): >>42752304 #>>42752572 #>>42753201 #

rasz ◴[19 Jan 25 02:56 UTC] No.42753201[source]▶

>>42751756 #

>A Blender programmer calculating cloth physics would rather have 4x weaker cores than 1x P-core.

Nah, Blender programmer will prefer one core with AVX-512 instead of 4 without it.

replies(2): >>42753399 #>>42755005 #

dragontamer ◴[19 Jan 25 03:52 UTC] No.42753399{3}[source]▶

>>42753201 #

I mean, eventually yeah.

I like Zen5 as much as the next guy, but it should be noted that even today's most recent version of Blender is AVX (256-bit) only. That means E-cores remain the optimal core to work with for a lot of Blender stuff.

Hopefully AMD Zen5 AVX512 becomes more popular. Maybe it'd become more popular as Intel rolls out AVX10 (somewhat compatible instruction set)

replies(1): >>42753861 #

adgjlsfhk1 ◴[19 Jan 25 05:09 UTC] No.42753861{4}[source]▶

>>42753399 #

Would blender benefit from the bits of AVX-512 other than the width? I would think the approximate sqrt instructions might be useful.

replies(1): >>42753881 #

1. dragontamer ◴[19 Jan 25 05:12 UTC] No.42753881{5}[source]▶

>>42753861 #

AVX512 is one of the best instruction sets I've seen. No joke.

There's all kinds of things AVX512 would help out in Blender. But those ways are incompatible with older AVX2 or SSE code. The question is if Blender will be willing to support SSE, AVX, and AVX512 code paths. Each new codepath is more maintenance and more effort.

AVX512 has more registers: not just 32x 512-bit registers (AVX normally has 16x 256-bit registers), but also the kmask registers (64-bits that take the place of old boolean logic that used to be done on the 256-bit registers). This alone should give far more optimizations for the compiler to automatically find.

There's also VPCOMPRESSB and VPEXPANDB, Conflict-detection, and other instructions that make new SIMD data-structures far more efficient to implement. But this requires deep understanding that very few programmers have yet.

↑