(github.com)

548 points nsagent | 2 comments | 14 Jul 25 21:40 UTC | HN request time: 0s | source

Show context

albertzeyer ◴[14 Jul 25 22:59 UTC] No.44566290[source]▶

This is exciting. So this is using unified memory of CUDA? I wonder how well that works. Is the behavior of the unified memory in CUDA actually the same as for Apple silicon? For Apple silicon, as I understand, the memory is anyway shared between GPU and CPU. But for CUDA, this is not the case. So when you have some tensor on CPU, how will it end up on GPU then? This needs a copy somehow. Or is this all hidden by CUDA?

replies(3): >>44566325 #>>44566412 #>>44571076 #

zcbenz ◴[14 Jul 25 23:20 UTC] No.44566412[source]▶

>>44566290 #

In the absence of hardware unified memory, CUDA will automatically copy data between CPU/GPU when there are page faults.

replies(4): >>44566633 #>>44566987 #>>44567184 #>>44567252 #

1. fenced_load ◴[15 Jul 25 00:08 UTC] No.44566633[source]▶

>>44566412 #

There is also NVLink c2c support between Nvidia's CPUs and GPUs that doesn't require any copy, CPUs and GPUs directly access each other's memory over a coherent bus. IIRC, they have 4 CPU + 4 GPU servers already available.

replies(1): >>44566751 #

2. benreesman ◴[15 Jul 25 00:35 UTC] No.44566751[source]▶

>>44566633 (TP) #

Yeah NCCL is a whole world and it's not even the only thing involved, but IIRC that's the difference between 8xH100 PCI and 8xH100 SXM2.

↑

Apple's MLX adding CUDA support