←back to thread

228 points nkko | 2 comments | | HN request time: 0.002s | source
Show context
orbital-decay ◴[] No.43888112[source]
One thing not said here is that samplers have no access to model's internal state. It's basic math applied to the output distribution, which technically carries some semantics but you can't decode it without being as smart as the model itself.

Certain samplers described here like repetition penalty or DRY are just like this - the model could repeat itself in a myriad of ways, the only way to prevent all of them is better training, not n-gram search or other classic NLP methods. This is basically trying to plug every hole with a finger. How many fingers do you have?

Hacking the autoregressive process has some some low-hanging fruits like Min-P that can make some improvement and certain nifty tricks possible, but if you're doing it to turn a bad model into a good one, you're doing it wrong.

replies(2): >>43888135 #>>43889081 #
Der_Einzige ◴[] No.43888135[source]
No, it's done to turn an uncreative model into a creative model. This idea that sampling isn't that important or is some violation of the bitter lesson is exactly why I had to call out the whole academic field as having a giant blindspot for this kind of research in our oral presentation at ICLR!

Top n sigma has been around since mid 2024, min_p around since 2023 and we are still waiting for these innovations to be integrated outside of open source stuff (i.e. outside of HF/vllm). It's being done slowly on purpose by API providers because they don't want to deal with the risk of models being "too creative" (also high temp likely breaks their watermarking)

One other thing - making models aware of their own sampling settings is super easy if you just feed it back to the model every token or generation (say, using structured generation). Models can control their own sampling settings and thus "have access to its internal states" with just a tiny bit of extra programming (the model can write that code for you now lol)

replies(3): >>43888242 #>>43890812 #>>43892167 #
1. achierius ◴[] No.43890812[source]
How is it not a violation of the bitter lesson? You're trying to correct the model after the fact using human logic, where the bitter lesson would want you to just train a better model.

Not that I think that goes against your point -- I think it's rather a problem with the bitter lesson.

replies(1): >>43895587 #
2. Der_Einzige ◴[] No.43895587[source]
The primary argument for why it's not a violation is that the heuristic is (almost) free. LLM designed samplers are and will probably be better - but in order to start the recursive self-improvement engine a few free heuristics will be needed.

The bitter lesson critique is that the human designed heuristics were not free, and harmed the notion of "letting the computer figure it out" by slowing down training. High temp sampling is very important for half-way decent synthetic data generation and thus enabling "letting the computer figure it out" for natural language. Better sampling is the only way to make high temperature generations coherent.