←back to thread

228 points nkko | 6 comments | | HN request time: 0s | source | bottom
Show context
orbital-decay ◴[] No.43888112[source]
One thing not said here is that samplers have no access to model's internal state. It's basic math applied to the output distribution, which technically carries some semantics but you can't decode it without being as smart as the model itself.

Certain samplers described here like repetition penalty or DRY are just like this - the model could repeat itself in a myriad of ways, the only way to prevent all of them is better training, not n-gram search or other classic NLP methods. This is basically trying to plug every hole with a finger. How many fingers do you have?

Hacking the autoregressive process has some some low-hanging fruits like Min-P that can make some improvement and certain nifty tricks possible, but if you're doing it to turn a bad model into a good one, you're doing it wrong.

replies(2): >>43888135 #>>43889081 #
1. Der_Einzige ◴[] No.43888135[source]
No, it's done to turn an uncreative model into a creative model. This idea that sampling isn't that important or is some violation of the bitter lesson is exactly why I had to call out the whole academic field as having a giant blindspot for this kind of research in our oral presentation at ICLR!

Top n sigma has been around since mid 2024, min_p around since 2023 and we are still waiting for these innovations to be integrated outside of open source stuff (i.e. outside of HF/vllm). It's being done slowly on purpose by API providers because they don't want to deal with the risk of models being "too creative" (also high temp likely breaks their watermarking)

One other thing - making models aware of their own sampling settings is super easy if you just feed it back to the model every token or generation (say, using structured generation). Models can control their own sampling settings and thus "have access to its internal states" with just a tiny bit of extra programming (the model can write that code for you now lol)

replies(3): >>43888242 #>>43890812 #>>43892167 #
2. orbital-decay ◴[] No.43888242[source]
I guess variance is a better word for this. Creativity is a pretty loose term, for example most people will describe R1 as creative in RP/stories for its tendency to derail everything in an unhinged way, but it still lacks variance like every other modern model (kill the reasoning chain and look at logprobs to get what I mean). The bitter lesson is not some threshold and can't be violated, it describes a curve of diminishing returns. As long as you're on the easy part, it's fine.

But the bigger problem is that the concepts are expressed before they're decoded into the output distribution. You can steer them to a degree by hacking the autoregressive transport, but if the model itself learned that this concept corresponds to that particular concept, not a set of concepts (and RL tends to do exactly that), fixing it with sampling is usually hard to impossible, you'll just lose accuracy/make it dumber as you basically force out-of-distribution outputs.

3. achierius ◴[] No.43890812[source]
How is it not a violation of the bitter lesson? You're trying to correct the model after the fact using human logic, where the bitter lesson would want you to just train a better model.

Not that I think that goes against your point -- I think it's rather a problem with the bitter lesson.

replies(1): >>43895587 #
4. NitpickLawyer ◴[] No.43892167[source]
> No, it's done to turn an uncreative model into a creative model. This idea that sampling isn't that important or is some violation of the bitter lesson is exactly why I had to call out the whole academic field as having a giant blindspot for this kind of research in our oral presentation at ICLR!

I see this sentiment a lot, there's even people that swear by samplers like XTC (which sounds counter intuitive af) but it's always on "creative" tasks. On math tasks, with a clear correct/incorrect answer, none of the "creative" samplers come on top, not even min_p (except for crazy temperatures, and even there the overall accuracy is still lower than normal temps w/ normal sampling)...

The main problem is that "creativity" is such a subjective measure that it's hard to score properly.

replies(1): >>43895563 #
5. Der_Einzige ◴[] No.43895563[source]
I think "crazy" temperatures start around 100, not 2-3 as folks commonly claim in the literature.

You're right in general on this post, but I think you underestimate how many coomers/erp folks there are and how much they use LLMs. XTC was made for them to give some notion of slop removal. It's probably not quite as good at that task as the antislop sampler (from Sam Peach, EQ bench creator) - but I find XTC to be quite good at adding "spice" to outputs.

re: difficulty to measure "creativity" is especially true - especially around the difficulty of scoring it! We have some nitpickers of our own whispering into our ears about this. You don't happen to be at Stanford do you? IFYKYK...

6. Der_Einzige ◴[] No.43895587[source]
The primary argument for why it's not a violation is that the heuristic is (almost) free. LLM designed samplers are and will probably be better - but in order to start the recursive self-improvement engine a few free heuristics will be needed.

The bitter lesson critique is that the human designed heuristics were not free, and harmed the notion of "letting the computer figure it out" by slowing down training. High temp sampling is very important for half-way decent synthetic data generation and thus enabling "letting the computer figure it out" for natural language. Better sampling is the only way to make high temperature generations coherent.