(karpathy.bearblog.dev)

615 points __rito__ | 2 comments | 10 Dec 25 17:23 UTC | HN request time: 0.459s | source

Related from yesterday: Show HN: Gemini Pro 3 imagines the HN front page 10 years from now - https://news.ycombinator.com/item?id=46205632

Show context

nomel ◴[11 Dec 25 15:20 UTC] No.46232459[source]▶

>>46220540 (OP) #

> I realized that this task is actually a really good fit for LLMs

I've found the opposite, since these models still fail pretty wildly at nuance. I think it's a conceptual "needle in the haystack sort of problem.

A good test is to find some thread where there's a disagreement and have it try to analyze the discussion. It will usually strongly misrepresent what was being said, by each side, and strongly align with one user, missing the actual divide that's causing the disagreement (a needle).

replies(1): >>46232847 #

1. gowld ◴[11 Dec 25 15:49 UTC] No.46232847[source]▶

>>46232459 #

As always, which model versions did you use in your test?

replies(1): >>46236493 #

2. nomel ◴[11 Dec 25 20:11 UTC] No.46236493[source]▶

>>46232847 (TP) #

Claude Opus 4.5, Gemini 3 Pro, ChatGPT 5.1. Haven't tried ChatGPT 5.2.

It requires that the discussion has nuance, to see the failure. Gemini is, by far the, worst at this (which fits my suspicion that they heavily weighted reddit posts).

I don't think this is all that strange though. The human, on one side of the argument, is also missing the nuance, which is the source of the conflict. Is there a belief that AI has surpassed the average human, with conversational nuance!?

↑

Auto-grading decade-old Hacker News discussions with hindsight