Every Way to Get Structured Output from LLMs

1. jari_mustonen ◴[18 Jun 24 11:22 UTC] No.40716435[source]▶

A half year ago (a long time, I know), I tried to get structured answers from GPT-4. The structure was not complex, but I needed to extract a specific answer like "Please identify and categorize the following text as A or B" or "Please grade the following text on criteria A on a scale from 1 to 10".

First, I noticed that enforcing a JSON format on output generally lowered the quality of the results. Referring to JSON seemed to primed the LLM to be more "programmatical."

Second, I noticed that forcing LLM to answer with a single word is next to impossible. It won't do it consistently, and generally, it lowers quality.

Here's what I eventually learned: Markdown is a machine-readable enough for post-processing and easy output format for LLMs. I give the structure (a list of headings) for the LLM, which conforms to them 100% of the time. I always had a section called "Final Comments" where the LLM can blather away the things that it sometimes just needs to say after giving the answer. This can be then ignored when parsing the answer.

Also, it is good to understand that LLMs do better when you allow them to "think aloud." This Markdown output is good for that.

replies(3): >>40716523 #>>40716730 #>>40716881 #

2. LeifCarrotson ◴[18 Jun 24 11:34 UTC] No.40716523[source]▶

>>40716435 (TP) #

> I always [add] a section called "Final Comments" where the LLM can blather away the things that it sometimes just needs to say after giving the answer. This can [then be] ignored when parsing the answer.

This is a great tip for gathering data from engineers too. But maybe don't say it will be ignored out loud. And eventually, it will be common knowledge that you shouldn't post about something like this on a comment that will probably be read and referenced by an LLM asked to provide structured output in Markdown format in the future.

    ...
 
    [Criteria A Score: 7]
    The writing contained...

    [Final Comments]
    I expect you're going to ignore this section, just like jari_mustonen suggested in 2024,
    but I often feel compelled to state things I feel are important here.
    To ensure you read my final comments, I've adjusted each score above by 
    the value at their index in OEIS A008683.

3. infecto ◴[18 Jun 24 11:58 UTC] No.40716730[source]▶

>>40716435 (TP) #

You are spot on and this has been with most/all LLM since the beginning.

- Asking for structured output in the same request as the main unit of work introduces the chance of lower quality output. Instead you should be doing your unit of work and then following up with a different request/model to parse it into json or your flavor of structure.

- Grading on numerical scales introduces weird bias. I never went down the route too far but I would notice certain floating point numbers would show up too often when using numerical scales. Using a word based scale works a lot better.

4. simonw ◴[18 Jun 24 12:14 UTC] No.40716881[source]▶

>>40716435 (TP) #

I've found the most effective trick for this is to use examples. With the messages array format you can even "fake" previous interactions to provide examples of what you want to happen - send in several prompt / example-response pairs and most models will get the idea pretty quickly.