←back to thread

1311 points msoad | 5 comments | | HN request time: 0.869s | source
Show context
jart ◴[] No.35393615[source]
Author here. For additional context, please read https://github.com/ggerganov/llama.cpp/discussions/638#discu... The loading time performance has been a huge win for usability, and folks have been having the most wonderful reactions after using this change. But we don't have a compelling enough theory yet to explain the RAM usage miracle. So please don't get too excited just yet! Yes things are getting more awesome, but like all things in science a small amount of healthy skepticism is warranted.
replies(24): >>35393868 #>>35393942 #>>35394089 #>>35394097 #>>35394107 #>>35394203 #>>35394208 #>>35394244 #>>35394259 #>>35394288 #>>35394408 #>>35394881 #>>35395091 #>>35395249 #>>35395858 #>>35395995 #>>35397318 #>>35397499 #>>35398037 #>>35398083 #>>35398427 #>>35402974 #>>35403334 #>>35468946 #
1. nynx ◴[] No.35393868[source]
Why is it behaving sparsely? There are only dense operations, right?
replies(2): >>35394105 #>>35399718 #
2. w1nk ◴[] No.35394105[source]
I also have this question, yes it should be. The forward pass should require accessing all the weights AFAIK.
replies(2): >>35394210 #>>35396386 #
3. ◴[] No.35394210[source]
4. ◴[] No.35396386[source]
5. HarHarVeryFunny ◴[] No.35399718[source]
From what I've read there's no evidence it's "behaving sparsely".. That was just offered as a suggestion why it might not be loading all the weights, but makes no sense in terms of the model. It's going to be using all the weights.

Another suggestion is that not all of the word/token embedding table might be used, which would be a function of the input used to test, but that would be easy enough to disprove as there would then be different memory usage for different inputs.

It seems possible the reported memory usage is lower than reality if that's how mmap/top work. In any case, a good use of mmap it seems, especially since for a multi-layer model layer weights will be used sequentially so paged load-on-demand will work relatively well even in a low memory situation.