Test-driven development with an LLM for fun and profit

> On a similar note, has anyone found themselves absolutely not trusting non-code LLM output?

I'm working on a LLM chat app that is built around mistrust. The basic idea is that it is unlikely a supermajority of quality LLMs can get it wrong.

This isn't foolproof though, but it does provide some level of confidence in the answer.

Here is a quick example in which I analyze results from multiple LLMs that answered, "When did Homer Simpson go to Mars?"

https://beta.gitsense.com/?chat=4d28f283-24f4-4657-89e0-5abf...

If you look at the yes and no table, all except GPT-4o and GPT-4o mini said no. After asking GPT-4o who was correct, it provided "evidence" on an episode so I asked for more information on that episode. Based on what it said, it looks like the mission to Mars was a hoax and when I challenged GPT-4o on this, it agreed and said Homer never went to Mars, like others have said.

I then asked Sonnet 3.5 about the episode and it said GPT-4o misinterpreted the plot.

https://beta.gitsense.com/?chat=4d28f283-24f4-4657-89e0-5abf...

At this point, I am confident (but not 100% sure) Homer never went to Mars and if I really needed to know, I'll need to search the web.