←back to thread

548 points nsagent | 1 comments | | HN request time: 0.215s | source
Show context
albertzeyer ◴[] No.44566290[source]
This is exciting. So this is using unified memory of CUDA? I wonder how well that works. Is the behavior of the unified memory in CUDA actually the same as for Apple silicon? For Apple silicon, as I understand, the memory is anyway shared between GPU and CPU. But for CUDA, this is not the case. So when you have some tensor on CPU, how will it end up on GPU then? This needs a copy somehow. Or is this all hidden by CUDA?
replies(3): >>44566325 #>>44566412 #>>44571076 #
zcbenz ◴[] No.44566412[source]
In the absence of hardware unified memory, CUDA will automatically copy data between CPU/GPU when there are page faults.
replies(4): >>44566633 #>>44566987 #>>44567184 #>>44567252 #
saagarjha ◴[] No.44567184[source]
This seems like it would be slow…
replies(1): >>44567307 #
1. freeone3000 ◴[] No.44567307[source]
Matches my experience. It’s memory stalls all over the place, aggravated (on 12.3 at least) there wasn’t even a prefetcher.