Dummy's Guide to Modern LLM Sampling

1. mdp2021 ◴[04 May 25 18:03 UTC] No.43888295[source]▶

When the attempt is though to have the LLM output an "idea", not just a "next token", the selection over the logits vector should break that original idea... If the idea is complete, there should be no need to use sampling over the logits.

The sampling, in this framework, should not happen near the output level ("what will the next spoke word be").

replies(1): >>43888329 #

2. minimaxir ◴[04 May 25 18:09 UTC] No.43888329[source]▶

>>43888295 (TP) #

LLMs are trained to maximize the probability of correct guesses for the next token, not "ideas". You cannot define an idea as a training loss objective.

replies(2): >>43888414 #>>43888493 #

3. mdp2021 ◴[04 May 25 18:22 UTC] No.43888414[source]▶

>>43888329 #

That is an architectural problem. If you want the post to be rephrased: it is paradoxical to have changes made near the output level, "changing words before it says them", given that the expected is to work with ideas. (And even then, selection would not be at the output level - it would be during the definition of the structure.)

So, articles like this submission - while interesting from many points of view - make the elephant in the room more evident.

> You cannot define an idea as a training loss objective

What tells you so? If you see a technical limit, note e.g. that sentences and paragraphs can have their own position in an embedding space.

4. orbital-decay ◴[04 May 25 18:34 UTC] No.43888493[source]▶

>>43888329 #

Interpretability studies offer several orthogonal ways to look at this, it's like Newtonian vs Lagrangian mechanics. Autoregressive token prediction, pattern matching, idea conceptualization, pathfinding in the extremely multidimensional space...