OpenAI o3 and o4-mini

(openai.com)

555 points maheshrijal | 1 comments | 16 Apr 25 17:01 UTC | HN request time: 0.979s | source

Show context

M4v3R ◴[16 Apr 25 20:50 UTC] No.43710247[source]▶

Ok, I’m a bit underwhelmed. I’ve asked it a fairly technical question, about a very niche topic (Final Fantasy VII reverse engineering): https://chatgpt.com/share/68001766-92c8-8004-908f-fb185b7549...

With right knowledge and web searches one can answer this question in a matter of minutes at most. The model fumbled around modding forums and other sites and did manage to find some good information but then started to hallucinate some details and used them in the further research. The end result it gave me was incorrect, and the steps it described to get the value were totally fabricated.

What’s even worse in the thinking trace it looks like it is aware it does not have an answer and that the 399 is just an estimate. But in the answer itself it confidently states it found the correct value.

Essentially, it lied to me that it doesn’t really know and provided me with an estimate without telling me.

Now, I’m perfectly aware that this is a very niche topic, but at this point I expect the AI to either find me a good answer or tell me it couldn’t do it. Not to lie me in the face.

Edit: Turns out it’s not just me: https://x.com/transluceai/status/1912552046269771985?s=46

replies(13): >>43710318 #>>43711672 #>>43711775 #>>43711851 #>>43712139 #>>43712425 #>>43713176 #>>43713582 #>>43713694 #>>43714110 #>>43714235 #>>43722041 #>>43727880 #

werdnapk ◴[17 Apr 25 00:02 UTC] No.43711672[source]▶

>>43710247 #

I've used AI with "niche" programming questions and it's always a total let down. I truly don't understand this "vibe coding" movement unless everyone is building todo apps.

replies(7): >>43711916 #>>43712520 #>>43712538 #>>43712869 #>>43712926 #>>43713572 #>>43718231 #

hatefulmoron ◴[17 Apr 25 00:47 UTC] No.43711916[source]▶

>>43711672 #

It's incredible when I ask Claude 3.7 a question about Typescript/Python and it can generate hundreds of lines of code that are pretty on point (it's usually not exactly correct on first prompt, but it's coherent).

I've recently been asking questions about Dafny and Lean -- it's frustrating that it will completely make up syntax and features that don't exist, but still speak to me with the same confidence as when it's talking about Typescript. It's possible that shoving lots of documentation or a book about the language into the context would help (I haven't tried), but I'm not sure if it would make up for the model's lack of "intuition" about the subject.

replies(1): >>43712454 #

mhitza ◴[17 Apr 25 02:16 UTC] No.43712454[source]▶

>>43711916 #

Don't need to ho that esoteric. Seen them make stuff up pretty often for more common functional programming languages like Haskell and OCaml.

replies(2): >>43712635 #>>43713580 #

greenavocado ◴[17 Apr 25 02:51 UTC] No.43712635[source]▶

>>43712454 #

Recommend using RAG for this. Make the Haskell or OCaml documentation your knowledge base and index it for RAG. Then it makes a heck of a lot more sense!

replies(1): >>43713194 #

rashkov ◴[17 Apr 25 04:42 UTC] No.43713194[source]▶

>>43712635 #

How does one do that? As far as I can tell neither Claude or chatgpt web clients support this. Is there a third party tool that people are using?

replies(2): >>43715328 #>>43730209 #

1. hellsten ◴[18 Apr 25 17:39 UTC] No.43730209[source]▶

>>43713194 #

You could try using the built-in "projects" feature of Claude and ChatGPT: https://support.anthropic.com/en/articles/9517075-what-are-p...

You can get pretty good results by copying the output from Firefox's Reader View into your project, for example: about:reader?url=https://learnxinyminutes.com/ocaml/

↑