←back to thread

172 points marban | 2 comments | | HN request time: 0.547s | source
Show context
InTheArena ◴[] No.40051885[source]
While everyone has focused on Apple's power-efficiency on the M series chips, one thing that has been very interesting is how powerful the unified memory model (by having the memory on-package with CPU) with large bandwidth to the memory actually is. Hence a lot of people in the local LLMA community are really going after high-memory Macs.

It's great to see NPUs here with the new Ryzen cores - but I wonder how effective they will be with off-die memory versus the Apple approach.

That said, it's nothing but great to see these capabilities in something other then a expensive NVIDIA card. Local NPUs may really help with edge deploying more conferencing capabilities.

Edited - sorry, ,meant on-package.

replies(8): >>40051950 #>>40052032 #>>40052167 #>>40052857 #>>40053126 #>>40054064 #>>40054570 #>>40054743 #
v1sea ◴[] No.40053126[source]
edit: I was wrong.
replies(3): >>40053352 #>>40053365 #>>40053618 #
smallmancontrov ◴[] No.40053618[source]
> low latency results each frame

What does that do to your utilization?

I've been out of this space for a while, but in game dev any backwards information flow (GPU->CPU) completely murdered performance. "Whatever you do, don't stall the pipeline." Instant 50%-90% performance hit. Even if you had to awkwardly duplicate calculations on the CPU, it was almost always worth it, and not by a small amount. The caveat to "almost" was that if you were willing to wait 2-4 frames to get data back, you could do that without stalling the pipeline.

I didn't think this was a memory architecture thing, I thought it was a data dependency thing. If you have to finish all calculations before readback and if you have to readback before starting new calculations, the time for all cores to empty out and fill back up is guaranteed to be dead time, regardless of whether the job was rendering polygons or finite element calculations or neural nets.

Does shared memory actually change this somehow? Or does it just make it more convenient to shoot yourself in the foot?

EDIT: or is the difference that HPC operates in a regime where long "frame time" dwarfs the pipeline empty/refill "dead time"?

replies(1): >>40053853 #
1. v1sea ◴[] No.40053853[source]
It was probably from my workloads being relatively small that I could get away with 90Hz read on the cpu side. I'll need to dig deeper into it. The metrics I was seeing were showing 200-300 microseconds of GPU time for physics calculations and within the same frame the cpu reading from that buffer. Maybe I'm wrong, need to test more.
replies(1): >>40054488 #
2. smallmancontrov ◴[] No.40054488[source]
> edit: I was wrong.

If the only reason you were "wrong" was because you intuitively understood that it wasn't worth a large amount of valuable human time to save a small amount of cheap machine time, you were right in the way that matters (time allocation) and should keep it up :)