←back to thread

228 points nkko | 1 comments | | HN request time: 0.208s | source
Show context
orbital-decay ◴[] No.43888112[source]
One thing not said here is that samplers have no access to model's internal state. It's basic math applied to the output distribution, which technically carries some semantics but you can't decode it without being as smart as the model itself.

Certain samplers described here like repetition penalty or DRY are just like this - the model could repeat itself in a myriad of ways, the only way to prevent all of them is better training, not n-gram search or other classic NLP methods. This is basically trying to plug every hole with a finger. How many fingers do you have?

Hacking the autoregressive process has some some low-hanging fruits like Min-P that can make some improvement and certain nifty tricks possible, but if you're doing it to turn a bad model into a good one, you're doing it wrong.

replies(2): >>43888135 #>>43889081 #
Der_Einzige ◴[] No.43888135[source]
No, it's done to turn an uncreative model into a creative model. This idea that sampling isn't that important or is some violation of the bitter lesson is exactly why I had to call out the whole academic field as having a giant blindspot for this kind of research in our oral presentation at ICLR!

Top n sigma has been around since mid 2024, min_p around since 2023 and we are still waiting for these innovations to be integrated outside of open source stuff (i.e. outside of HF/vllm). It's being done slowly on purpose by API providers because they don't want to deal with the risk of models being "too creative" (also high temp likely breaks their watermarking)

One other thing - making models aware of their own sampling settings is super easy if you just feed it back to the model every token or generation (say, using structured generation). Models can control their own sampling settings and thus "have access to its internal states" with just a tiny bit of extra programming (the model can write that code for you now lol)

replies(3): >>43888242 #>>43890812 #>>43892167 #
1. orbital-decay ◴[] No.43888242[source]
I guess variance is a better word for this. Creativity is a pretty loose term, for example most people will describe R1 as creative in RP/stories for its tendency to derail everything in an unhinged way, but it still lacks variance like every other modern model (kill the reasoning chain and look at logprobs to get what I mean). The bitter lesson is not some threshold and can't be violated, it describes a curve of diminishing returns. As long as you're on the easy part, it's fine.

But the bigger problem is that the concepts are expressed before they're decoded into the output distribution. You can steer them to a degree by hacking the autoregressive transport, but if the model itself learned that this concept corresponds to that particular concept, not a set of concepts (and RL tends to do exactly that), fixing it with sampling is usually hard to impossible, you'll just lose accuracy/make it dumber as you basically force out-of-distribution outputs.