←back to thread

228 points nkko | 2 comments | | HN request time: 0.429s | source
Show context
orbital-decay ◴[] No.43888112[source]
One thing not said here is that samplers have no access to model's internal state. It's basic math applied to the output distribution, which technically carries some semantics but you can't decode it without being as smart as the model itself.

Certain samplers described here like repetition penalty or DRY are just like this - the model could repeat itself in a myriad of ways, the only way to prevent all of them is better training, not n-gram search or other classic NLP methods. This is basically trying to plug every hole with a finger. How many fingers do you have?

Hacking the autoregressive process has some some low-hanging fruits like Min-P that can make some improvement and certain nifty tricks possible, but if you're doing it to turn a bad model into a good one, you're doing it wrong.

replies(2): >>43888135 #>>43889081 #
Der_Einzige ◴[] No.43888135[source]
No, it's done to turn an uncreative model into a creative model. This idea that sampling isn't that important or is some violation of the bitter lesson is exactly why I had to call out the whole academic field as having a giant blindspot for this kind of research in our oral presentation at ICLR!

Top n sigma has been around since mid 2024, min_p around since 2023 and we are still waiting for these innovations to be integrated outside of open source stuff (i.e. outside of HF/vllm). It's being done slowly on purpose by API providers because they don't want to deal with the risk of models being "too creative" (also high temp likely breaks their watermarking)

One other thing - making models aware of their own sampling settings is super easy if you just feed it back to the model every token or generation (say, using structured generation). Models can control their own sampling settings and thus "have access to its internal states" with just a tiny bit of extra programming (the model can write that code for you now lol)

replies(3): >>43888242 #>>43890812 #>>43892167 #
1. NitpickLawyer ◴[] No.43892167[source]
> No, it's done to turn an uncreative model into a creative model. This idea that sampling isn't that important or is some violation of the bitter lesson is exactly why I had to call out the whole academic field as having a giant blindspot for this kind of research in our oral presentation at ICLR!

I see this sentiment a lot, there's even people that swear by samplers like XTC (which sounds counter intuitive af) but it's always on "creative" tasks. On math tasks, with a clear correct/incorrect answer, none of the "creative" samplers come on top, not even min_p (except for crazy temperatures, and even there the overall accuracy is still lower than normal temps w/ normal sampling)...

The main problem is that "creativity" is such a subjective measure that it's hard to score properly.

replies(1): >>43895563 #
2. Der_Einzige ◴[] No.43895563[source]
I think "crazy" temperatures start around 100, not 2-3 as folks commonly claim in the literature.

You're right in general on this post, but I think you underestimate how many coomers/erp folks there are and how much they use LLMs. XTC was made for them to give some notion of slop removal. It's probably not quite as good at that task as the antislop sampler (from Sam Peach, EQ bench creator) - but I find XTC to be quite good at adding "spice" to outputs.

re: difficulty to measure "creativity" is especially true - especially around the difficulty of scoring it! We have some nitpickers of our own whispering into our ears about this. You don't happen to be at Stanford do you? IFYKYK...