←back to thread

548 points nsagent | 1 comments | | HN request time: 0.358s | source
Show context
zdw ◴[] No.44566065[source]
How does this work when one of the key features of MLX is using a unified memory architecture? (see bullets on repo readme: https://github.com/ml-explore/mlx )

I would think that bringing that to all UMA APUs (of any vendor) would be interesting, but discreet GPU's definitely would need a different approach?

edit: reading the PR comments, it appears that CUDA supports a UMA API directly, and will transparently copy as needed.

replies(1): >>44567302 #
1. freeone3000 ◴[] No.44567302[source]
Eh yes but from my experience its lack of prefetch lends to significant memory stalls waiting for the copy. It might be suitable if your entire dataset fits in VRAM after doing a “manual prefetch” but it killed performance for my application (ML training) so hard that we actually got time to move to streaming loads.