←back to thread

169 points constantinum | 10 comments | | HN request time: 0.618s | source | bottom
1. StrauXX ◴[] No.40714872[source]
Did I understand the documentation for many of these libraries correctly in that they reprompt until they receive valid JSON? If so I don't understand why one would do that when token masking is a deterministicly verifyable way to get structured output of any kind (as done by Guidance and LMQL for instance). This is not meant to be snarky, I really am curious. Is there an upside to reprompting - aside from easier implementation.
replies(4): >>40714984 #>>40714988 #>>40715185 #>>40715620 #
2. hellovai ◴[] No.40714984[source]
the main one is that most people don't own the model. so if you use openai / anthropic / etc then you can't use token masking. in that case, reprompting is pretty much the only option
replies(2): >>40716262 #>>40725394 #
3. frabcus ◴[] No.40714988[source]
My experience with models even about a year ago is that the model has firmly decided what it thinks by the time of the last layer, so the probabilities on that layer aren't very useful.

You either get the same (in this case wrong) thing differently worded, or worse you get effectively noise if the second probability is very much lower than the largest probability.

My guess is that applies here too. Better to let all the layers rethink the tokens, than force hallucination of eg a random letter when you don't expect an angle bracket

(Edit: above is assuming using logprobs and/or logit_bias with the OpenAI API, not some other masking technique)

replies(1): >>40715094 #
4. HeatrayEnjoyer ◴[] No.40715094[source]
Why not apply it at an earlier layer then?
replies(1): >>40716700 #
5. torginus ◴[] No.40715185[source]
Isn't reprompting a decent technique? Considering most modern languages are LL(k), that is you need at most k tokens to parse the output (tbf these are programming language tokens not LLM tokens), with k=1 being the most common choice, would it not be reasonable to expect to only have to regenerate only a handful of tokens at most?
replies(1): >>40715256 #
6. joatmon-snoo ◴[] No.40715256[source]
Author here- yes, reprompting can work well enough if the latency hit is acceptable to you.

If you’re driving user-facing interactions with LLMs, though, and you’re already dealing with >1min latency on the first call (as many of our current users are!), waiting for another LLM call to come back is a really frustrating thing to block your UX on.

7. Havoc ◴[] No.40715620[source]
For local models you can use grammars to constrain it directly.
8. michaelt ◴[] No.40716262[source]
In the specific cases of openai and anthropic, both have 'tool use' interfaces which will generate valid JSON following a schema of your choice.

You're right, though, that reprompting works with pretty much everything out there, including hosted models that don't have tool use as part of their API. And its simple too, you don't even need to know what "token masking" is.

Reprompting can also apply arbitrarily criteria that are more complex than just a json schema. You ask it to choose an excerpt of a document and the string it returns isn't an excerpt? Just reprompt.

9. anonymoushn ◴[] No.40716700{3}[source]
At earlier layers the activations don't correspond as cleanly to tokens, and I expect that vendor APIs for proprietary LLMs wouldn't let you do anything like this.
10. StrauXX ◴[] No.40725394[source]
It does. With OpenAI at least you definetly can use token masking. There are some limitations but even those are circumventable. I have used token masking on the OpenAI API with LMQL without any issues.