Something weird is happening with LLMs and chess

(dynomight.substack.com)

696 points crescit_eundo | 1 comments | 14 Nov 24 17:05 UTC | HN request time: 0.214s | source

Show context

codeflo ◴[15 Nov 24 10:52 UTC] No.42145710[source]▶

At this point, we have to assume anything that becomes a published benchmark is specifically targeted during training. That's not something specific to LLMs or OpenAI. Compiler companies have done the same thing for decades, specifically detecting common benchmark programs and inserting hand-crafted optimizations. Similarly, the shader compilers in GPU drivers have special cases for common games and benchmarks.

replies(3): >>42146244 #>>42146391 #>>42151266 #

darkerside ◴[15 Nov 24 12:21 UTC] No.42146244[source]▶

>>42145710 #

VW got in a lot of trouble for this

replies(10): >>42146543 #>>42146550 #>>42146553 #>>42146556 #>>42146560 #>>42147093 #>>42147124 #>>42147353 #>>42147357 #>>42148300 #

TrueDuality ◴[15 Nov 24 13:06 UTC] No.42146543[source]▶

>>42146244 #

Not quite. VW got in trouble for running _different_ software in test vs prod. These optimizations are all going to "prod" but are only useful for specific targets (a specific game in this case).

replies(1): >>42146761 #

krisoft ◴[15 Nov 24 13:33 UTC] No.42146761[source]▶

>>42146543 #

> VW got in trouble for running _different_ software in test vs prod.

Not quite. They programmed their "prod" software to recognise the circumstances of a laboratory test and behave differently. Namely during laboratory emissions testing they would activate emission control features they would not activate otherwise.

The software was the same they flash on production cars. They were production cars. You could take a random car from a random dealership and it would have done the same trickery in the lab.

replies(1): >>42147479 #

TrueDuality ◴[15 Nov 24 14:58 UTC] No.42147479[source]▶

>>42146761 #

I disagree with your distinction on the environments but understand your argument. Production for VM to me is "on the road when a customer is using your product as intended". Using the same artifact for those different environments isn't the same as "running that in production".

replies(1): >>42149146 #

1. krisoft ◴[15 Nov 24 17:47 UTC] No.42149146[source]▶

>>42147479 #

“Test” environment is the domain of prototype cars driving at the proving ground. It is an internal affair, only for employees and contractors. The software is compiled on some engineer’s laptop and uploaded on the ECU by an engineer manually. No two cars are ever the same, everything is in flux. The number of cars are small.

“Production” is a factory line producing cars. The software is uploaded on the ECUs by some factory machine automatically. Each car are exactly the same, with the exact same software version on thousands and thousands of cars. The cars are sold to customers.

Some small number of these prodiction cars are sent for regulatory compliance checks to third parties. But those cars won’t become suddenly non-production cars just because someone sticks up a probe in their exhausts. The same way gmail’s production servers don’t suddenly turn into test environments just because a user opens the network tab in their browser’s dev tool to see what kind of requests fly on the wire.

↑