←back to thread

219 points crazylogger | 4 comments | | HN request time: 0.001s | source
Show context
smusamashah ◴[] No.42728826[source]
On a similar note, has anyone found themselves absolutely not trusting non-code LLM output?

The code is at least testable and verifiable. For everything else I am left wondering if it's the truth or a hallucination. It incurs more mental burden that I was trying to avoid using LLM in the first place.

replies(7): >>42728915 #>>42729219 #>>42729640 #>>42729926 #>>42730263 #>>42730292 #>>42731632 #
energy123 ◴[] No.42731632[source]
We need a hallucination benchmark.

My experience is, o1 is very good at avoiding hallucinations and I trust it more, but o1-mini and 4o are awful.

replies(1): >>42732332 #
1. sdesol ◴[] No.42732332[source]
Well given the price $15.00 / 1M input tokens and $60.00 / 1M output* tokens, I would hope so. Given the price, I think it is fair to say it is doing a lot of checks in the background.
replies(1): >>42732460 #
2. energy123 ◴[] No.42732460[source]
It is expensive. But if I'm correct about o1, it means user mistrust of LLMs is going to be a short-lived thing as costs come down and more people use o1 (or better) models as their daily driver.
replies(1): >>42732737 #
3. sdesol ◴[] No.42732737[source]
> mistrust of LLMs is going to be a short-lived thing as costs come down and more people use o1

I think the biggest question is, is o1 scalable. I think o1 does well because it is going back and forth hundreds if not thousands of times. Somebody mentioned in a thread that I was participating in that they let o1 crunch things for 10 minutes. It sounded like it saved them a lot work, so it was well worth it.

Whether or not o1 is practical for the general public is something we will have to wait and see.

replies(1): >>42732826 #
4. energy123 ◴[] No.42732826{3}[source]
I'm going to wager "yes" because o3-mini (High) gets equal benchmark scores to o1 despite using 1/3rd as much compute, and because the consistent trend has been towards rapid order-of-magnitude decreases in price for a fixed level of intelligence (trend has many components dovetailing, both hardware and software related). Can't forecast the future, but this would be my bet on a time horizon of < 3 years.