←back to thread

1311 points msoad | 3 comments | | HN request time: 0.001s | source
Show context
jart ◴[] No.35393615[source]
Author here. For additional context, please read https://github.com/ggerganov/llama.cpp/discussions/638#discu... The loading time performance has been a huge win for usability, and folks have been having the most wonderful reactions after using this change. But we don't have a compelling enough theory yet to explain the RAM usage miracle. So please don't get too excited just yet! Yes things are getting more awesome, but like all things in science a small amount of healthy skepticism is warranted.
replies(24): >>35393868 #>>35393942 #>>35394089 #>>35394097 #>>35394107 #>>35394203 #>>35394208 #>>35394244 #>>35394259 #>>35394288 #>>35394408 #>>35394881 #>>35395091 #>>35395249 #>>35395858 #>>35395995 #>>35397318 #>>35397499 #>>35398037 #>>35398083 #>>35398427 #>>35402974 #>>35403334 #>>35468946 #
thomastjeffery ◴[] No.35394408[source]
How diverse is the training corpus?
replies(1): >>35394827 #
dchest ◴[] No.35394827[source]
https://arxiv.org/abs/2302.13971
replies(1): >>35394950 #
1. thomastjeffery ◴[] No.35394950[source]
Is there any measure, not of size or token amount, but of diversity in the content of the text?

Did that metric meaningfully change when the amount of required memory dropped?

If the amount of diversity is lowered, I would expect that to lower the amount of patterns to be modeled from the text. If that is the case, then the resulting model size itself would be lowered, during and after training.

replies(1): >>35395333 #
2. actually_a_dog ◴[] No.35395333[source]
By "diversity," do you mean something like "entropy?" Like maybe

    H_s(x) := -\sum_{x \in X_s} p(x) log(p(x))
where X_s := all s-grams from the training set? That seems like it would eventually become hard to impossible to actually compute. Even if you could what would it tell you?

Or, wait... are you referring to running such an analysis on the output of the model? Yeah, that might prove interesting....

replies(1): >>35415417 #
3. thomastjeffery ◴[] No.35415417[source]
I'm really just speculating here.

Because the text we write is not evenly distributed random noise, what we encode into it (by writing) is entropy.

Because LLMs model text with inference, they model all of the entropy that is present.

That would mean that the resulting size would be a measure of entropy (sum of patterns) divided by repetition (recurring patterns). In this count, I would consider each unique token alone an instance of the identity pattern.

So to answer both questions: yes.