Skymont: Intel's E-Cores reach for the Sky (2024)

(chipsandcheese.com)

128 points ksec | 3 comments | 18 Jan 25 19:27 UTC | HN request time: 0s | source

Show context

dragontamer ◴[18 Jan 25 21:21 UTC] No.42751521[source]▶

Triple decoder is one unique effect. The fact that Intel managed to get them lined up for small loops to do 9x effective instruction issue is basically miraculous IMO. Very well done.

Another unique effect is L2 shared between 4 cores. This means that thread communications across those 4 cores has much lower latencies.

I've had lots of debates with people online about this design vs Hyperthreading. It seems like the overall discovery from Intel is that highly threaded tasks use less resources (cache, ROPs, etc. etc).

Big cores (P cores or AMD Zen5) obviously can split into 2 hyperthread, but what if that division is still too big? E cores are 4 threads of support in roughly the same space as 1 Pcore.

This is because L2 cache is shared/consolidated, and other resources (ROP buffers, register files, etc. etc.) are just all so much smaller on the Ecore.

It's an interesting design. I'd still think that growing the cores to 4way SMT (like Xeon Phi) or 8way SMT (POWER10) would be a more conventional way to split up resources though. But obviously I don't work at Intel or can make these kinds of decisions.

replies(8): >>42751667 #>>42751930 #>>42752001 #>>42752140 #>>42752196 #>>42752200 #>>42753025 #>>42753142 #

01HNNWZ0MV43FF ◴[18 Jan 25 21:46 UTC] No.42751667[source]▶

>>42751521 #

> It seems like the overall discovery from Intel is that highly threaded tasks use less resources (cache, ROPs, etc. etc).

Does that mean if I can take a single-threaded program and split it into multiple threads, it might use less power? I have been telling myself that the only reason to use threads is to get more CPU power or to call blocking APIs. If they're actually more power-efficient, that would change how I weigh threads vs. async

replies(1): >>42751756 #

dragontamer ◴[18 Jan 25 21:58 UTC] No.42751756[source]▶

>>42751667 #

Not... quite. I think you've got the cause-and-effect backwards.

Programmers who happen to write multiple-threaded programs don't need powerful cores, they want more cores. A Blender programmer calculating cloth physics would rather have 4x weaker cores than 1x P-core.

Programmers who happen to write powerful singled-threaded programs need powerful cores. For example, AMD's "X3D" line of CPUs famously have 96MB of L3 cache, and video games that are on these very-powerful cores have much better performance.

Its not "Programmers should change their code to fit the machine". From Intel's perspective, CPU designers should change their core designs to match the different kinds of programmers. Single-threaded (or low-thread) programmers... largely represented by the Video Game programmers... want P-cores. But not very much of them.

Multithreaded programmers... represented by Graphics and a few others... want E-cores. Splitting a P-core into "only" 2 threads is not sufficient, they want 4x or even 8x more cores. Because there's multiple communities of programmers out there, dedicating design teams to creating entirely different cores is a worthwhile endeavor.

--------

> Does that mean if I can take a single-threaded program and split it into multiple threads, it might use less power? I have been telling myself that the only reason to use threads is to get more CPU power or to call blocking APIs. If they're actually more power-efficient, that would change how I weigh threads vs. async

Power-efficiency is going to be incredibly difficult moving forward.

It should be noted that E-cores are not very power-efficient though. They're area efficient, IE Cheaper for Intel to make. Intel can sell 4x as many E-cores for roughly the same price/area as 1x P-core.

E-cores are cost-efficient cores. I think they happen to use slightly less power, but I'm not convinced that power-efficiency is their particular design goal.

If your code benefits from cache (ie: big cores), its probable that the lowest power-cost would be to run on large caches (like P-cores or Zen5 or Zen5 X3D). Communicating with RAM is always more power than just communicating with caches after all.

If your code does NOT benefit from cache (ie: Blender regularly has 100GB+ scenes for complex movies), then all of those spare resources on P-cores are useless, as nothing fits anyway and the core will be spending almost all of its time waiting on RAM to do anything. So the E-core will be more power efficient in this case.

replies(3): >>42752304 #>>42752572 #>>42753201 #

seanmcdirmid ◴[19 Jan 25 00:24 UTC] No.42752572[source]▶

>>42751756 #

> A Blender programmer calculating cloth physics would rather have 4x weaker cores than 1x P-core.

Don’t they really want GPU threads for that? You wouldn’t get by with just weaker cores.

replies(1): >>42753392 #

1. dragontamer ◴[19 Jan 25 03:51 UTC] No.42753392[source]▶

>>42752572 #

Cloth physics in Blender are stored in RAM (as scenes and models can grow very large, too large for a GPU).

Figuring out which verticies for a physics simulation need to be sent to the GPU would be time, effort, and PCIe traffic _NOT_ running the cloth physics.

Furthermore, once all the data is figured out and blocked out in cache, its... cached. Cloth physics only interacts with a small number of close, nearby objects. Yeah, you _could_ calculate this small portion and send it to the GPU, but CPU is really good at just automatically traversing trees and storing the most recently used stuff in L1, L2, and L3 caches automatically (without any need of special code).

All in all, I expect something like Cloth physics (which is a calculation Blender currently does on CPU-only), is best done CPU only. Not because GPUs are bad at this algorithm... but instead because PCIe transfers are too slow and cloth physics is just too easily cached / benefited by various CPU features.

It'd be a lot of effort to translate all that code to GPU and you likely won't get large gains (like Raytracing/Cycles/Rendering gets for GPU Compute).

replies(1): >>42753599 #

2. seanmcdirmid ◴[19 Jan 25 04:26 UTC] No.42753599[source]▶

>>42753392 (TP) #

NVIDIA's physX has its own cloth physics abstractions: https://docs.nvidia.com/gameworks/content/gameworkslibrary/p..., so I'm sure it is a thing we do on GPUs already, if only for games. These are old demos anyways:

https://www.youtube.com/watch?v=80vKqJSAmIc

I wonder what the difference is between the cloth physics you are talking about and the one NVIDIA has been doing for I think more than a decade now? Is it scale? It sounds like, at least, there are alternatives that do it on the GPU and there are questions if Blender will do it on the GPU:

https://blenderartists.org/t/any-plans-to-make-cloth-simulat...

replies(1): >>42753848 #

3. dragontamer ◴[19 Jan 25 05:08 UTC] No.42753848[source]▶

>>42753599 #

Cloth / Hair physics in those games were graphics-only physics.

They could collide with any mesh that was inside of the GPU's memory. But those calculations cannot work on any information stored on CPU RAM. Well... not efficiently anyway.

---------

When the Cloth simulator in Blender runs, it generates all kinds of information the CPU needs for other steps. In effect, Blender's cloth physics serves as an input to animation frames, which is all CPU-side information.

Again: i know cloth physics executes on GPUs very well in isolation. But I'd be surprised if BLENDER's specific cloth physics would ever be efficient on a GPU. Because as it turns out, calculations kind of don't matter in the big-picture. There's a lot of other things you need to do after those calculations (animations, key frames, and other such interactions). And if all that information is stored randomly in 100GB of CPU RAM, it'd be very hard to untangle that data and get it to a GPU (and back).

In a Video Game PHYSX setting, you just display the cloth physics to the screen. In Blender, a 3d animation program, you have to do a lot more with all that information and touch many other data-structures.

PCIe is very slow compared to RAM.

↑