(github.com)

548 points nsagent | 2 comments | 14 Jul 25 21:40 UTC | HN request time: 0s | source

Show context

albertzeyer ◴[14 Jul 25 22:59 UTC] No.44566290[source]▶

This is exciting. So this is using unified memory of CUDA? I wonder how well that works. Is the behavior of the unified memory in CUDA actually the same as for Apple silicon? For Apple silicon, as I understand, the memory is anyway shared between GPU and CPU. But for CUDA, this is not the case. So when you have some tensor on CPU, how will it end up on GPU then? This needs a copy somehow. Or is this all hidden by CUDA?

replies(3): >>44566325 #>>44566412 #>>44571076 #

MBCook ◴[14 Jul 25 23:03 UTC] No.44566325[source]▶

>>44566290 #

This is my guess, but does higher end hardware they sell, like the server rack stuff for AI, perhaps have the unified memory?

I know standard GPUs don’t.

The patch suggested one of the reasons for it was to make it easy to develop on a Mac and run on a super computer. So the hardware with the unified memory might be in that class.

replies(4): >>44566356 #>>44566370 #>>44566408 #>>44567318 #

1. ajuhasz ◴[14 Jul 25 23:12 UTC] No.44566370[source]▶

>>44566325 #

The Jetsons[1] have unified memory[2].

[1] https://www.nvidia.com/en-us/autonomous-machines/embedded-sy... [2] https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s...

replies(1): >>44566586 #

2. tonyarkles ◴[14 Jul 25 23:56 UTC] No.44566586[source]▶

>>44566370 (TP) #

They sure do and it's pretty amazing. One iteration of a vision system I worked on got frames from a camera over a Mellanox NIC that supports RDMA (Rivermax), preprocessed the images using CUDA, did inference on them with TensorRT, and the first time a single byte of the inference pipeline hit the CPU itself was when we were consuming the output.

↑

Apple's MLX adding CUDA support