Most active commenters

diggan(6)
kubb(3)

Andrej Karpathy: Software in the era of AI [video]

(www.youtube.com)

Show context

mentalgear ◴[19 Jun 25 09:33 UTC] No.44316934[source]▶

Meanwhile, I asked this morning Claude 4 to write a simple EXIF normalizer. After two rounds of prompting it to double-check its code, I still had to point out that it makes no sense to load the entire image for re-orientating if the EXIF orientation is fine in the first place.

Vibe vs reality, and anyone actually working in the space daily can attest how brittle these systems are.

Maybe this changes in SWE with more automated tests in verifiable simulators, but the real world is far to complex to simulate in its vastness.

replies(7): >>44317104 #>>44317116 #>>44317136 #>>44317214 #>>44317305 #>>44317622 #>>44317741 #

ramon156 ◴[19 Jun 25 10:13 UTC] No.44317136[source]▶

>>44316934 #

The real question is how long it'll take until they're not brittle

replies(3): >>44317160 #>>44317197 #>>44317483 #

1. kubb ◴[19 Jun 25 10:17 UTC] No.44317160[source]▶

>>44317136 #

Or will they ever be reliable. Your question is already making an assumption.

replies(3): >>44317316 #>>44317424 #>>44317731 #

2. diggan ◴[19 Jun 25 10:43 UTC] No.44317316[source]▶

>>44317160 (TP) #

They're reliable already if you change the way you approach them. These probabilistic token generators probably never will be "reliable" if you expect them to 100% always output exactly what you had in mind, without iterating in user-space (the prompts).

replies(1): >>44317546 #

3. vFunct ◴[19 Jun 25 11:01 UTC] No.44317424[source]▶

>>44317160 (TP) #

Its perfectly reliable for the things you know it to be, such as operations within its context window size.

Don't ask LLMs to "Write me Microsoft Excel".

Instead, ask it to "Write a directory tree view for the Open File dialog box in Excel".

Break your projects down into the smallest chunks you can for the LLMs. The more specific you are, the more reliable it's going to be.

The rest of this year is going to be companies figuring out how to break down large tasks into smaller tasks for LLM consumption.

4. kubb ◴[19 Jun 25 11:18 UTC] No.44317546[source]▶

>>44317316 #

I also think they might never become reliable.

replies(2): >>44317591 #>>44317599 #

5. diggan ◴[19 Jun 25 11:24 UTC] No.44317591{3}[source]▶

>>44317546 #

But what does that mean? If you tell the LLM "Say just 'hi' without any extra words or explanations", do you not get "hi" back from it?

replies(2): >>44317612 #>>44318187 #

6. flir ◴[19 Jun 25 11:25 UTC] No.44317599{3}[source]▶

>>44317546 #

There is a bar below which they are reliable.

"Write a Python script that adds three numbers together".

Is that bar going up? I think it probably is, although not as fast/far as some believe. I also think that "unreliable" can still be "useful".

7. TeMPOraL ◴[19 Jun 25 11:28 UTC] No.44317612{4}[source]▶

>>44317591 #

That's literally the wrong way to use LLMs though.

LLMs think in tokens, the less they emit the dumber they are, so asking them to be concise, or to give the answer before explanation, is extremely counterproductive.

replies(1): >>44317636 #

8. diggan ◴[19 Jun 25 11:32 UTC] No.44317636{5}[source]▶

>>44317612 #

I was trying to make a point regarding "reliability", not a point about how to prompt or how to use them for work.

replies(1): >>44317746 #

9. dist-epoch ◴[19 Jun 25 11:46 UTC] No.44317731[source]▶

>>44317160 (TP) #

I remember when people were saying here on HN that AIs will never be able to generate picture of hands with just 5 fingers because they just "don't have common sense"

10. TeMPOraL ◴[19 Jun 25 11:48 UTC] No.44317746{6}[source]▶

>>44317636 #

This is relevant. Your example may be simple enough, but for anything more complex, letting the model have its space to think/compute is critical to reliability - if you starve it for compute, you'll get more errors/hallucinations.

replies(1): >>44317850 #

11. diggan ◴[19 Jun 25 12:04 UTC] No.44317850{7}[source]▶

>>44317746 #

Yeah I mean I agree with you, but I'm still not sure how it's relevant. I'd also urge people to have unit tests they treat as production code, and proper system prompts, and X and Y, but it's really beyond the original point of "LLMs aren't reliable" which is the context in this sub-tree.

12. kubb ◴[19 Jun 25 12:49 UTC] No.44318187{4}[source]▶

>>44317591 #

Sometimes I get "Hi!", sometimes "Hey!".

replies(1): >>44318270 #

13. diggan ◴[19 Jun 25 12:57 UTC] No.44318270{5}[source]▶

>>44318187 #

Which model? Just tried a bunch of ChatGPT, OpenAI's API, Claude, Anthropic's API and DeepSeek's API with both chat and reasonee, every single one replied with a single "hi".

replies(1): >>44318659 #

14. throwdbaaway ◴[19 Jun 25 13:47 UTC] No.44318659{6}[source]▶

>>44318270 #

o3-mini-2025-01-31 with high reasoning effort replied with "Hi" after 448 reasoning tokens.

gpt-4.5-preview-2025-02-27 replied with "Hi!"

replies(1): >>44319506 #

15. diggan ◴[19 Jun 25 15:20 UTC] No.44319506{7}[source]▶

>>44318659 #

> o3-mini-2025-01-31 with high reasoning effort replied with "Hi" after 448 reasoning tokens.

I got "hi", as expected. What is the full system prompt + user message you're using?

https://i.imgur.com/Y923KXB.png

> gpt-4.5-preview-2025-02-27

Same "hi": https://i.imgur.com/VxiIrIy.png

replies(1): >>44324512 #

16. throwdbaaway ◴[20 Jun 25 03:38 UTC] No.44324512{8}[source]▶

>>44319506 #

Ah right, my bad. Somehow I thought the prompt was only:

    Say just 'hi'

while the "without any extra words or explanations" part was for the readers of your comment. Perhaps kubb also made a similar mistake.

I used empty system prompt.

↑