←back to thread

688 points crescit_eundo | 2 comments | | HN request time: 0s | source
Show context
codeflo ◴[] No.42145710[source]
At this point, we have to assume anything that becomes a published benchmark is specifically targeted during training. That's not something specific to LLMs or OpenAI. Compiler companies have done the same thing for decades, specifically detecting common benchmark programs and inserting hand-crafted optimizations. Similarly, the shader compilers in GPU drivers have special cases for common games and benchmarks.
replies(3): >>42146244 #>>42146391 #>>42151266 #
darkerside ◴[] No.42146244[source]
VW got in a lot of trouble for this
replies(10): >>42146543 #>>42146550 #>>42146553 #>>42146556 #>>42146560 #>>42147093 #>>42147124 #>>42147353 #>>42147357 #>>42148300 #
1. ArnoVW ◴[] No.42146556[source]
True. But they did not optimize for a specific case. They detected the test and then enabled a special regime, that was not used normally.

It’s as if OpenAI detects the IP address from a benchmark organization, and then used a completely different model.

replies(1): >>42148055 #
2. K0balt ◴[] No.42148055[source]
This is the apples to apples version. Perhaps might be more accurate to say that when detecting a benchmark attempt the model tries the prompt 3 times with different seeds then picks the best answer, otherwise it just zero-shots the prompt in everyday use.

I say this because the be test still uses the same hardware (model) but changed the way it behaved by running emissions friendly parameters ( a different execution framework) that wouldn’t have been used in everyday driving, where fuel efficiency and performance optimized parameters were used instead.

What I’d like to know is if it actually was unethical or not. The overall carbon footprint of the lower fuel consumption setting, with fuel manufacturing and distribution factored in, might easily have been more impactful than the emissions model, which typically does not factor in fuel consumed.