How does this work when one of the key features of MLX is using a unified memory architecture? (see bullets on repo readme: https://github.com/ml-explore/mlx )
I would think that bringing that to all UMA APUs (of any vendor) would be interesting, but discreet GPU's definitely would need a different approach?
edit: reading the PR comments, it appears that CUDA supports a UMA API directly, and will transparently copy as needed.
replies(1):