Something weird is happening with LLMs and chess

(dynomight.substack.com)

696 points crescit_eundo | 1 comments | 14 Nov 24 17:05 UTC | HN request time: 0s | source

Show context

codeflo ◴[15 Nov 24 10:52 UTC] No.42145710[source]▶

At this point, we have to assume anything that becomes a published benchmark is specifically targeted during training. That's not something specific to LLMs or OpenAI. Compiler companies have done the same thing for decades, specifically detecting common benchmark programs and inserting hand-crafted optimizations. Similarly, the shader compilers in GPU drivers have special cases for common games and benchmarks.

replies(3): >>42146244 #>>42146391 #>>42151266 #

darkerside ◴[15 Nov 24 12:21 UTC] No.42146244[source]▶

>>42145710 #

VW got in a lot of trouble for this

replies(10): >>42146543 #>>42146550 #>>42146553 #>>42146556 #>>42146560 #>>42147093 #>>42147124 #>>42147353 #>>42147357 #>>42148300 #

sigmoid10 ◴[15 Nov 24 13:08 UTC] No.42146560[source]▶

>>42146244 #

Apples and oranges. VW actually cheated on regulatory testing to bypass legal requirements. So to be comparable, the government would first need to pass laws where e.g. only compilers that pass a certain benchmark are allowed to be used for purchasable products and then the developers would need to manipulate behaviour during those benchmarks.

replies(3): >>42146749 #>>42147885 #>>42150309 #

0xFF0123 ◴[15 Nov 24 13:32 UTC] No.42146749[source]▶

>>42146560 #

The only difference is the legality. From an integrity point of view it's basically the same

replies(7): >>42146884 #>>42146984 #>>42147072 #>>42147078 #>>42147443 #>>42147742 #>>42147978 #

1. Swenrekcah ◴[15 Nov 24 14:10 UTC] No.42147078[source]▶

>>42146749 #

That is not true. Even ChatGPT understands how they are different, I won’t paste the whole response but here are the differences it highlights:

Key differences:

1. Intent and harm: • VW’s actions directly violated laws and had environmental and health consequences. Optimizing LLMs for chess benchmarks, while arguably misleading, doesn’t have immediate real-world harms. 2. Scope: Chess-specific optimization is generally a transparent choice within AI research. It’s not a hidden “defeat device” but rather an explicit design goal. 3. Broader impact: LLMs fine-tuned for benchmarks often still retain general-purpose capabilities. They aren’t necessarily “broken” outside chess, whereas VW cars fundamentally failed to meet emissions standards.

↑