Test-driven development with an LLM for fun and profit

(blog.yfzhou.fyi)

219 points crazylogger | 4 comments | 16 Jan 25 15:30 UTC | HN request time: 0.006s | source

Show context

smusamashah ◴[16 Jan 25 18:17 UTC] No.42728826[source]▶

On a similar note, has anyone found themselves absolutely not trusting non-code LLM output?

The code is at least testable and verifiable. For everything else I am left wondering if it's the truth or a hallucination. It incurs more mental burden that I was trying to avoid using LLM in the first place.

replies(7): >>42728915 #>>42729219 #>>42729640 #>>42729926 #>>42730263 #>>42730292 #>>42731632 #

energy123 ◴[16 Jan 25 22:18 UTC] No.42731632[source]▶

>>42728826 #

We need a hallucination benchmark.

My experience is, o1 is very good at avoiding hallucinations and I trust it more, but o1-mini and 4o are awful.

replies(1): >>42732332 #

1. sdesol ◴[16 Jan 25 23:35 UTC] No.42732332[source]▶

>>42731632 #

Well given the price $15.00 / 1M input tokens and $60.00 / 1M output* tokens, I would hope so. Given the price, I think it is fair to say it is doing a lot of checks in the background.

replies(1): >>42732460 #

2. energy123 ◴[16 Jan 25 23:51 UTC] No.42732460[source]▶

>>42732332 (TP) #

It is expensive. But if I'm correct about o1, it means user mistrust of LLMs is going to be a short-lived thing as costs come down and more people use o1 (or better) models as their daily driver.

replies(1): >>42732737 #

3. sdesol ◴[17 Jan 25 00:33 UTC] No.42732737[source]▶

>>42732460 #

> mistrust of LLMs is going to be a short-lived thing as costs come down and more people use o1

I think the biggest question is, is o1 scalable. I think o1 does well because it is going back and forth hundreds if not thousands of times. Somebody mentioned in a thread that I was participating in that they let o1 crunch things for 10 minutes. It sounded like it saved them a lot work, so it was well worth it.

Whether or not o1 is practical for the general public is something we will have to wait and see.

replies(1): >>42732826 #

4. energy123 ◴[17 Jan 25 00:48 UTC] No.42732826{3}[source]▶

>>42732737 #

I'm going to wager "yes" because o3-mini (High) gets equal benchmark scores to o1 despite using 1/3rd as much compute, and because the consistent trend has been towards rapid order-of-magnitude decreases in price for a fixed level of intelligence (trend has many components dovetailing, both hardware and software related). Can't forecast the future, but this would be my bet on a time horizon of < 3 years.

↑