(github.com)

548 points nsagent | 2 comments | 14 Jul 25 21:40 UTC | HN request time: 0.625s | source

1. zdw ◴[14 Jul 25 22:32 UTC] No.44566065[source]▶

How does this work when one of the key features of MLX is using a unified memory architecture? (see bullets on repo readme: https://github.com/ml-explore/mlx )

I would think that bringing that to all UMA APUs (of any vendor) would be interesting, but discreet GPU's definitely would need a different approach?

edit: reading the PR comments, it appears that CUDA supports a UMA API directly, and will transparently copy as needed.

replies(1): >>44567302 #

2. freeone3000 ◴[15 Jul 25 02:27 UTC] No.44567302[source]▶

>>44566065 (TP) #

Eh yes but from my experience its lack of prefetch lends to significant memory stalls waiting for the copy. It might be suitable if your entire dataset fits in VRAM after doing a “manual prefetch” but it killed performance for my application (ML training) so hard that we actually got time to move to streaming loads.

↑

Apple's MLX adding CUDA support