Every Way to Get Structured Output from LLMs

(www.boundaryml.com)

169 points constantinum | 1 comments | 18 Jun 24 04:01 UTC | HN request time: 0.209s | source

Show context

politelemon ◴[18 Jun 24 06:43 UTC] No.40714738[source]▶

Structured output should not be assumed is limited to JSON. Claude performs very well with XML, as it has been trained with it, so there's no real need to put in extra work. Not XML as in conformant, schema-compliant XML, just XML as delimiters.

https://docs.anthropic.com/en/docs/build-with-claude/prompt-...

Give it examples and instructions in tags, ask it to output in tags, and force it to return early by completing for it. (Assistant:<output>).

When you think about it, it makes a lot of sense. Even if the output is chatty, parsing it is easy because you're not looking for } which may or may not match an opening {, instead you're looking for </output> which is much easier to parse for.

replies(1): >>40714811 #

hellovai ◴[18 Jun 24 06:54 UTC] No.40714811[source]▶

>>40714738 #

XML is also a great option, but there are a few trade offs:

> XML is a many more tokens (much slower + $$$ for complex schemas)

> regardless of if you're looking for } or </output> its really a matter of "does your parser work". when you have three tokens that need to be correct "</" "output", ">", the odds of a mistake are higher, instead of when you just need "}".

That said, the parser is much easier to write, we're actually considering supporting XML in BAML. have you found any reductions of accuracy?

Also, not sure if you saw this, but apparently Claude doesn't actually prefer XML, it just happens to work well with it. Was recently new info for myself as well. https://x.com/alexalbert__/status/1778550859598807178 (devrel @ Anthropic)

replies(1): >>40719935 #

gavindean90 ◴[18 Jun 24 17:03 UTC] No.40719935[source]▶

>>40714811 #

I think once you have < and / the rest becomes much easier to predict. In a way it “spreads” the prediction over several tokens.

The < indicates that the preceding information is in fact over. The “/“ represents that we are closing something and not starting a subtopic. And the “output” defines what we are closing. The final “>” ensures that our “output” string is ended. In JSON all of that semantic meaning get put into the one token }.

replies(1): >>40740667 #

joatmon-snoo ◴[20 Jun 24 16:44 UTC] No.40740667[source]▶

>>40719935 #

Hmm, that's an interesting way of thinking about it. The way I see it, I trust XML less, because the sparser representation gives it more room to make a mistake: if you think of every token as an opportunity to be correct or wrong, the higher token count needed to represent content in XML gives the model a higher chance to get the output wrong (kinda like the birthday paradox).

(Plus, more output tokens is more expensive!)

e.g.

using the cl_100k tokenizer (what GPT4 uses), this JSON is 60 tokens:

    {
      "method": "GET",
      "endpoint": "/api/model/details",
      "headers": {
        "Authorization": "Bearer YOUR_ACCESS_TOKEN",
        "Content-Type": "application/json"
      },
      "queryParams": {
        "model_id": "12345"
      }
    }

whereas this XML is 76 tokens:

    <?xml version="1.0" encoding="UTF-8" ?>
    <method>GET</method>
    <endpoint>/api/model/details</endpoint>
    <headers>
        <Authorization>Bearer YOUR_ACCESS_TOKEN</Authorization>
        <Content-Type>application/json</Content-Type>
    </headers>
    <queryParams>
        <model_id>12345</model_id>
    </queryParams>

You can check out the tokenization here by toggling "show tokens": https://www.promptfiddle.com/json-vs-xml-token-count-BtXe3

replies(1): >>40742568 #

1. tarasglek ◴[20 Jun 24 19:49 UTC] No.40742568[source]▶

>>40740667 #

you will love yaml since its a similar improvement in token use over json

↑