I failed to recreate the 1996 Space Jam website with Claude

(j0nah.com)

549 points thecr0w | 1 comments | 07 Dec 25 17:18 UTC | HN request time: 0.204s | source

Show context

999900000999 ◴[07 Dec 25 18:05 UTC] No.46183645[source]▶

Space Jam website design as an LLM benchmark.

This article is a bit negative. Claude gets close , it just can't get the order right which is something OP can manually fix.

I prefer GitHub Copilot because it's cheaper and integrates with GitHub directly. I'll have times where it'll get it right, and times when I have to try 3 or 4 times.

replies(4): >>46183660 #>>46183768 #>>46184119 #>>46184297 #

GeoAtreides ◴[07 Dec 25 19:02 UTC] No.46184119[source]▶

>>46183645 #

>which is something OP can manually fix

what if the LLM gets something wrong that the operator (a junior dev perhaps) doesn't even know it's wrong? that's the main issue: if it fails here, it will fail with other things, in not such obvious ways.

replies(2): >>46185803 #>>46187184 #

1. godelski ◴[07 Dec 25 22:18 UTC] No.46185803[source]▶

>>46184119 #

I think that's the main problem with them. It is hard to figure out when they're wrong.

As the post shows, you can't trust them when they think they solved something but you also can't trust them when they think they haven't[0]. The things are optimized for human preference, which ultimately results in this being optimized to hide mistakes. After all, we can't penalize mistakes in training when we don't know the mistakes are mistakes. The de facto bias is that we prefer mistakes that we don't know are mistakes than mistakes that we do[1].

Personally I think a well designed tool makes errors obvious. As a tool user that's what I want and makes tool use effective. But LLMs flip this on the head, making errors difficult to detect. Which is incredibly problematic.

[0] I frequently see this in a thing it thinks is a problem but actually isn't, which makes steering more difficult.

[1] Yes, conceptually unknown unknowns are worse. But you can't measure unknown unknowns, they are indistinguishable from knowns. So you always optimize deception (along with other things) when you don't have clear objective truths (most situations).

↑