←back to thread

169 points constantinum | 1 comments | | HN request time: 0.216s | source
Show context
jari_mustonen ◴[] No.40716435[source]
A half year ago (a long time, I know), I tried to get structured answers from GPT-4. The structure was not complex, but I needed to extract a specific answer like "Please identify and categorize the following text as A or B" or "Please grade the following text on criteria A on a scale from 1 to 10".

First, I noticed that enforcing a JSON format on output generally lowered the quality of the results. Referring to JSON seemed to primed the LLM to be more "programmatical."

Second, I noticed that forcing LLM to answer with a single word is next to impossible. It won't do it consistently, and generally, it lowers quality.

Here's what I eventually learned: Markdown is a machine-readable enough for post-processing and easy output format for LLMs. I give the structure (a list of headings) for the LLM, which conforms to them 100% of the time. I always had a section called "Final Comments" where the LLM can blather away the things that it sometimes just needs to say after giving the answer. This can be then ignored when parsing the answer.

Also, it is good to understand that LLMs do better when you allow them to "think aloud." This Markdown output is good for that.

replies(3): >>40716523 #>>40716730 #>>40716881 #
1. infecto ◴[] No.40716730[source]
You are spot on and this has been with most/all LLM since the beginning.

- Asking for structured output in the same request as the main unit of work introduces the chance of lower quality output. Instead you should be doing your unit of work and then following up with a different request/model to parse it into json or your flavor of structure.

- Grading on numerical scales introduces weird bias. I never went down the route too far but I would notice certain floating point numbers would show up too often when using numerical scales. Using a word based scale works a lot better.