Something weird is happening with LLMs and chess

(dynomight.substack.com)

696 points crescit_eundo | 1 comments | 14 Nov 24 17:05 UTC | HN request time: 0.242s | source

Show context

codeflo ◴[15 Nov 24 10:52 UTC] No.42145710[source]▶

At this point, we have to assume anything that becomes a published benchmark is specifically targeted during training. That's not something specific to LLMs or OpenAI. Compiler companies have done the same thing for decades, specifically detecting common benchmark programs and inserting hand-crafted optimizations. Similarly, the shader compilers in GPU drivers have special cases for common games and benchmarks.

replies(3): >>42146244 #>>42146391 #>>42151266 #

darkerside ◴[15 Nov 24 12:21 UTC] No.42146244[source]▶

>>42145710 #

VW got in a lot of trouble for this

replies(10): >>42146543 #>>42146550 #>>42146553 #>>42146556 #>>42146560 #>>42147093 #>>42147124 #>>42147353 #>>42147357 #>>42148300 #

ArnoVW ◴[15 Nov 24 13:08 UTC] No.42146556[source]▶

>>42146244 #

True. But they did not optimize for a specific case. They detected the test and then enabled a special regime, that was not used normally.

It’s as if OpenAI detects the IP address from a benchmark organization, and then used a completely different model.

replies(1): >>42148055 #

1. K0balt ◴[15 Nov 24 15:58 UTC] No.42148055[source]▶

>>42146556 #

This is the apples to apples version. Perhaps might be more accurate to say that when detecting a benchmark attempt the model tries the prompt 3 times with different seeds then picks the best answer, otherwise it just zero-shots the prompt in everyday use.

I say this because the be test still uses the same hardware (model) but changed the way it behaved by running emissions friendly parameters ( a different execution framework) that wouldn’t have been used in everyday driving, where fuel efficiency and performance optimized parameters were used instead.

What I’d like to know is if it actually was unethical or not. The overall carbon footprint of the lower fuel consumption setting, with fuel manufacturing and distribution factored in, might easily have been more impactful than the emissions model, which typically does not factor in fuel consumed.

↑