(dynomight.substack.com)

696 points crescit_eundo | 1 comments | 14 Nov 24 17:05 UTC | HN request time: 0.2s | source

Show context

codeflo ◴[15 Nov 24 10:52 UTC] No.42145710[source]▶

At this point, we have to assume anything that becomes a published benchmark is specifically targeted during training. That's not something specific to LLMs or OpenAI. Compiler companies have done the same thing for decades, specifically detecting common benchmark programs and inserting hand-crafted optimizations. Similarly, the shader compilers in GPU drivers have special cases for common games and benchmarks.

replies(3): >>42146244 #>>42146391 #>>42151266 #

darkerside ◴[15 Nov 24 12:21 UTC] No.42146244[source]▶

>>42145710 #

VW got in a lot of trouble for this

replies(10): >>42146543 #>>42146550 #>>42146553 #>>42146556 #>>42146560 #>>42147093 #>>42147124 #>>42147353 #>>42147357 #>>42148300 #

conradev ◴[15 Nov 24 14:44 UTC] No.42147357[source]▶

>>42146244 #

GPT-3.5 did not “cheat” on chess benchmarks, though, it was actually just better at chess?

replies(1): >>42147748 #

GolfPopper ◴[15 Nov 24 15:27 UTC] No.42147748[source]▶

>>42147357 #

I think the OP's point is that chat GPT-3.5 may have a chess-engine baked-in to its (closed and unavailable) code for PR purposes. So it "realizes" that "hey, I'm playing a game of chess" and then, rather than doing whatever it normally does, it just acts as a front-end for a quite good chess-engine.

replies(1): >>42147861 #

1. conradev ◴[15 Nov 24 15:39 UTC] No.42147861[source]▶

>>42147748 #

I see – my initial interpretation of OP’s “special case” was “Theory 2: GPT-3.5-instruct was trained on more chess games.”

But I guess it’s also a possibility that they had a real chess engine hiding in there.

↑

Something weird is happening with LLMs and chess