←back to thread

688 points crescit_eundo | 1 comments | | HN request time: 0s | source
Show context
codeflo ◴[] No.42145710[source]
At this point, we have to assume anything that becomes a published benchmark is specifically targeted during training. That's not something specific to LLMs or OpenAI. Compiler companies have done the same thing for decades, specifically detecting common benchmark programs and inserting hand-crafted optimizations. Similarly, the shader compilers in GPU drivers have special cases for common games and benchmarks.
replies(3): >>42146244 #>>42146391 #>>42151266 #
darkerside ◴[] No.42146244[source]
VW got in a lot of trouble for this
replies(10): >>42146543 #>>42146550 #>>42146553 #>>42146556 #>>42146560 #>>42147093 #>>42147124 #>>42147353 #>>42147357 #>>42148300 #
sigmoid10 ◴[] No.42146560[source]
Apples and oranges. VW actually cheated on regulatory testing to bypass legal requirements. So to be comparable, the government would first need to pass laws where e.g. only compilers that pass a certain benchmark are allowed to be used for purchasable products and then the developers would need to manipulate behaviour during those benchmarks.
replies(3): >>42146749 #>>42147885 #>>42150309 #
0xFF0123 ◴[] No.42146749[source]
The only difference is the legality. From an integrity point of view it's basically the same
replies(7): >>42146884 #>>42146984 #>>42147072 #>>42147078 #>>42147443 #>>42147742 #>>42147978 #
Thorrez ◴[] No.42146884[source]
I think breaking a law is more unethical than not breaking a law.

Also, legality isn't the only difference in the VW case. With VW, they had a "good emissions" mode. They enabled the good emissions mode during the test, but disabled it during regular driving. It would have worked during regular driving, but they disabled it during regular driving. With compilers, there's no "good performance" mode that would work during regular usage that they're disabling during regular usage.

replies(4): >>42146959 #>>42147070 #>>42147439 #>>42147666 #
hansworst ◴[] No.42147439{3}[source]
Overfitting on test data absolutely does mean that the model would perform better in benchmarks than it would in real life use cases.
replies(1): >>42158947 #
1. Thorrez ◴[] No.42158947{4}[source]
I think you're talking about something different from what sigmoid10 was talking about. sigmoid10 said "manipulate behaviour during those benchmarks". I interpreted that to mean the compiler detects if a benchmark is going on and alters its behavior only then. So this wouldn't impact real life use cases.