My guess would be that the model is faulted into memory lazily page by page (4K or 16K chunks) as the model is used, so only the actual parts that are needed are loaded.
The kernel also removes old pages from the page cache to make room for new ones, and especially so if the computer is using a lot of its RAM. As with all performance things, this approach trades off inference speed for memory usage, but likely faster overall because you don't have to read the entire thing from disk at the start. Each input will take a different path through the model, and will require loading more of it.
The cool part is that this memory architecture should work just fine with hardware acceleration, too, as long as the computer has unified memory (anything with an integrated GPU). This approach likely won't be possible with dedicated GPUs/VRAM.
This approach _does_ still work to run a dense model with limited memory, but the time/memory savings would just be less. The GPU doesn't multiply every matrix in the file literally simultaneously, so the page cache doesn't need to contain the entire model at once.
I think that's just an accounting thing. Many UNIX variants will not "charge" read only memory mapped pages to a process, because they could be shared among many processes and evicted at will.