Test-driven development with an LLM for fun and profit

(blog.yfzhou.fyi)

219 points crazylogger | 2 comments | 16 Jan 25 15:30 UTC | HN request time: 0.469s | source

Show context

smusamashah ◴[16 Jan 25 18:17 UTC] No.42728826[source]▶

On a similar note, has anyone found themselves absolutely not trusting non-code LLM output?

The code is at least testable and verifiable. For everything else I am left wondering if it's the truth or a hallucination. It incurs more mental burden that I was trying to avoid using LLM in the first place.

replies(7): >>42728915 #>>42729219 #>>42729640 #>>42729926 #>>42730263 #>>42730292 #>>42731632 #

sdesol ◴[16 Jan 25 19:19 UTC] No.42729640[source]▶

>>42728826 #

> On a similar note, has anyone found themselves absolutely not trusting non-code LLM output?

I'm working on a LLM chat app that is built around mistrust. The basic idea is that it is unlikely a supermajority of quality LLMs can get it wrong.

This isn't foolproof though, but it does provide some level of confidence in the answer.

Here is a quick example in which I analyze results from multiple LLMs that answered, "When did Homer Simpson go to Mars?"

https://beta.gitsense.com/?chat=4d28f283-24f4-4657-89e0-5abf...

If you look at the yes and no table, all except GPT-4o and GPT-4o mini said no. After asking GPT-4o who was correct, it provided "evidence" on an episode so I asked for more information on that episode. Based on what it said, it looks like the mission to Mars was a hoax and when I challenged GPT-4o on this, it agreed and said Homer never went to Mars, like others have said.

I then asked Sonnet 3.5 about the episode and it said GPT-4o misinterpreted the plot.

https://beta.gitsense.com/?chat=4d28f283-24f4-4657-89e0-5abf...

At this point, I am confident (but not 100% sure) Homer never went to Mars and if I really needed to know, I'll need to search the web.

replies(3): >>42730175 #>>42731507 #>>42733497 #

1. horsawlarway ◴[16 Jan 25 22:05 UTC] No.42731507[source]▶

>>42729640 #

Isn't this essentially making the point of the post above you?

For comparison - if I just do a web search for "Did homer simpson go to mars" I get immediately linked to the wikipedia page for that exact episode (https://en.wikipedia.org/wiki/The_Marge-ian_Chronicles), and the plot summary is less to read than your LLM output - It clearly summarizes that Marge & Lisa (note - NOT homer) almost went to mars, but did not go. Further - the summary correctly includes the outro which does show Marge and Lisa on mars in the year 2051.

Basically - for factual content, the LLM output was a garbage game of telephone.

replies(1): >>42732233 #

2. sdesol ◴[16 Jan 25 23:24 UTC] No.42732233[source]▶

>>42731507 (TP) #

> Isn't this essentially making the point of the post above you?

Yes. This is why I wrote the chat app, because I mistrust LLMs, but I do find them extremely useful when you approach them with the right mindset. If answering "Did Homer Simpson go to Mars?" correctly is critical, then you can choose to require a 100% consensus, otherwise you will need a fallback plan.

When I asked all the LLMs about the Wikipedia article, they all correctly answered "No" and talked about Marge and Lisa in the future without Homer.

↑